All of lore.kernel.org
 help / color / mirror / Atom feed
* dcache shrink list corruption?
@ 2014-04-29 16:01 Miklos Szeredi
  2014-04-29 17:43 ` Linus Torvalds
  2014-04-29 17:56 ` Al Viro
  0 siblings, 2 replies; 61+ messages in thread
From: Miklos Szeredi @ 2014-04-29 16:01 UTC (permalink / raw)
  To: Al Viro, Linus Torvalds, linux-kernel, linux-fsdevel

This was reported by IBM for 3.12, but if my analysis is right, it affects
current kernel as well as older ones.

So the question is: does anything protect the shrink list from concurrent
modification by one or more dput() instances?

E.g. two dentries are on the shrink list, for both dget(), d_drop() and dput()
are called.  dput() -> dentry_kill() -> dentry_lru_del() -> d_shrink_del() ->
list_del_init().  Unlike the LRU list this is only protected with d_lock on the
individual dentries, which is not enough to prevent list corruption:

list->next = a, list->prev = b
a->next = b, a->prev = list
b->next = list, b->prev = a

CPU1: list_del_init(b)
        __list_del(a, list)
             a->next = list ...
CPU2: list_del_init(a)
        __list_del(list, list)
             list->next = list
             list->prev = list
CPU1: (continuing list_del_init(b))
             list->prev = a

Attached patch is just a starting point (untested).  Not sure how to minimize
contention without adding too much complexity.

Thanks,
Miklos


diff --git a/fs/dcache.c b/fs/dcache.c
index 40707d88a945..5e0719292e3e 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -357,10 +357,14 @@ static void d_lru_del(struct dentry *dentry)
 	WARN_ON_ONCE(!list_lru_del(&dentry->d_sb->s_dentry_lru, &dentry->d_lru));
 }
 
+static __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_shrink_lock);
+
 static void d_shrink_del(struct dentry *dentry)
 {
 	D_FLAG_VERIFY(dentry, DCACHE_SHRINK_LIST | DCACHE_LRU_LIST);
+	spin_lock(&dcache_shrink_lock);
 	list_del_init(&dentry->d_lru);
+	spin_unlock(&dcache_shrink_lock);
 	dentry->d_flags &= ~(DCACHE_SHRINK_LIST | DCACHE_LRU_LIST);
 	this_cpu_dec(nr_dentry_unused);
 }
@@ -368,7 +372,9 @@ static void d_shrink_del(struct dentry *dentry)
 static void d_shrink_add(struct dentry *dentry, struct list_head *list)
 {
 	D_FLAG_VERIFY(dentry, 0);
+	spin_lock(&dcache_shrink_lock);
 	list_add(&dentry->d_lru, list);
+	spin_unlock(&dcache_shrink_lock);
 	dentry->d_flags |= DCACHE_SHRINK_LIST | DCACHE_LRU_LIST;
 	this_cpu_inc(nr_dentry_unused);
 }
@@ -391,7 +397,9 @@ static void d_lru_shrink_move(struct dentry *dentry, struct list_head *list)
 {
 	D_FLAG_VERIFY(dentry, DCACHE_LRU_LIST);
 	dentry->d_flags |= DCACHE_SHRINK_LIST;
+	spin_lock(&dcache_shrink_lock);
 	list_move_tail(&dentry->d_lru, list);
+	spin_unlock(&dcache_shrink_lock);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-04-29 16:01 dcache shrink list corruption? Miklos Szeredi
@ 2014-04-29 17:43 ` Linus Torvalds
  2014-04-29 18:03   ` Miklos Szeredi
  2014-04-29 17:56 ` Al Viro
  1 sibling, 1 reply; 61+ messages in thread
From: Linus Torvalds @ 2014-04-29 17:43 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Al Viro, Linux Kernel Mailing List, linux-fsdevel

On Tue, Apr 29, 2014 at 9:01 AM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> This was reported by IBM for 3.12, but if my analysis is right, it affects
> current kernel as well as older ones.
>
> So the question is: does anything protect the shrink list from concurrent
> modification by one or more dput() instances?

Ugh. I don't see anything. The shrinking list is a private list, so
adding on its own would be entirely safe, and I think that's where the
"we don't need no steenking locking" comes from.

But yes, the dentries are then visible on the hash chains, and there
can be concurrent removals from the list.

That new global lock smells, though - and if we want to use a global
lock, we should simply use the existing per-superblock LRU lock, not
make up some new global one. The moving case already holds it, can't
we just take it in the add/del case? Was there some reason you didn't
do that?

Let me think about it, maybe there's some trick we can do by virtue of
the list head being private and us holding the dentry lock. So at
least on addition, we have *two* of the tree involved nodes locked.
Does that perhaps allow for any lockless games to be played?

               Linus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-04-29 16:01 dcache shrink list corruption? Miklos Szeredi
  2014-04-29 17:43 ` Linus Torvalds
@ 2014-04-29 17:56 ` Al Viro
  1 sibling, 0 replies; 61+ messages in thread
From: Al Viro @ 2014-04-29 17:56 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Linus Torvalds, linux-kernel, linux-fsdevel

On Tue, Apr 29, 2014 at 06:01:39PM +0200, Miklos Szeredi wrote:

> Attached patch is just a starting point (untested).  Not sure how to minimize
> contention without adding too much complexity.

Contention isn't the worst problem here - I'd expect the cacheline ping-pong
to hurt more...  I agree with the analysis, but I'd really like to avoid that
spinlock ;-/

Let me see if we can avoid that...  Oh, well - at least that's a good excuse
to take a break from fucking deadlock analysis around the damn acct(2), most
of VFS and network filesystems ;-/

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-04-29 17:43 ` Linus Torvalds
@ 2014-04-29 18:03   ` Miklos Szeredi
  2014-04-29 18:16     ` Al Viro
  2014-04-29 18:17     ` Linus Torvalds
  0 siblings, 2 replies; 61+ messages in thread
From: Miklos Szeredi @ 2014-04-29 18:03 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Al Viro, Linux Kernel Mailing List, linux-fsdevel

On Tue, Apr 29, 2014 at 7:43 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Apr 29, 2014 at 9:01 AM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>> This was reported by IBM for 3.12, but if my analysis is right, it affects
>> current kernel as well as older ones.
>>
>> So the question is: does anything protect the shrink list from concurrent
>> modification by one or more dput() instances?
>
> Ugh. I don't see anything. The shrinking list is a private list, so
> adding on its own would be entirely safe, and I think that's where the
> "we don't need no steenking locking" comes from.
>
> But yes, the dentries are then visible on the hash chains, and there
> can be concurrent removals from the list.
>
> That new global lock smells, though - and if we want to use a global
> lock, we should simply use the existing per-superblock LRU lock, not
> make up some new global one. The moving case already holds it, can't
> we just take it in the add/del case? Was there some reason you didn't
> do that?

Because we no longer have that.  It now uses the list_lru thing, with
a "per-node" lock, whatever that one is.

Introducing a new per-sb lock should be OK.

Another idea, which could have subtler effects, is simply not to kill
a dentry that is on the shrink list (indicated by
DCACHE_SHRINK_LIST), since it's bound to get killed anyway.  But
that's a change in behaviour...

>
> Let me think about it, maybe there's some trick we can do by virtue of
> the list head being private and us holding the dentry lock. So at
> least on addition, we have *two* of the tree involved nodes locked.
> Does that perhaps allow for any lockless games to be played?

I don't understand how that could be made to work.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-04-29 18:03   ` Miklos Szeredi
@ 2014-04-29 18:16     ` Al Viro
  2014-04-29 19:10       ` Al Viro
  2014-04-29 18:17     ` Linus Torvalds
  1 sibling, 1 reply; 61+ messages in thread
From: Al Viro @ 2014-04-29 18:16 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel

On Tue, Apr 29, 2014 at 08:03:24PM +0200, Miklos Szeredi wrote:

> Introducing a new per-sb lock should be OK.
> 
> Another idea, which could have subtler effects, is simply not to kill
> a dentry that is on the shrink list (indicated by
> DCACHE_SHRINK_LIST), since it's bound to get killed anyway.  But
> that's a change in behaviour...

Umm...  You mean, if final dput() finds dentry already on shrink list,
just leave it there and return?  Might get really painful - the code
that knows it's holding the last reference to already unhashed dentry
might get a nasty surprise when dput() returns before it's killed off.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-04-29 18:03   ` Miklos Szeredi
  2014-04-29 18:16     ` Al Viro
@ 2014-04-29 18:17     ` Linus Torvalds
  1 sibling, 0 replies; 61+ messages in thread
From: Linus Torvalds @ 2014-04-29 18:17 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Al Viro, Linux Kernel Mailing List, linux-fsdevel

On Tue, Apr 29, 2014 at 11:03 AM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> Because we no longer have that.  It now uses the list_lru thing, with
> a "per-node" lock, whatever that one is.

Oh, yes. Right you are. I just started looking at that and went "ugh".

The lru lists are all distributed now with multiple locks (well, one
per list node).

> Another idea, which could have subtler effects, is simply not to kill
> a dentry that is on the shrink list (indicated by
> DCACHE_SHRINK_LIST), since it's bound to get killed anyway.  But
> that's a change in behaviour...

Ooh, I like the way you think. I don't see why this wouldn't be the
right approach.

            Linus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-04-29 18:16     ` Al Viro
@ 2014-04-29 19:10       ` Al Viro
  2014-04-29 21:18         ` Dave Chinner
  0 siblings, 1 reply; 61+ messages in thread
From: Al Viro @ 2014-04-29 19:10 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel

On Tue, Apr 29, 2014 at 07:16:10PM +0100, Al Viro wrote:
> On Tue, Apr 29, 2014 at 08:03:24PM +0200, Miklos Szeredi wrote:
> 
> > Introducing a new per-sb lock should be OK.
> > 
> > Another idea, which could have subtler effects, is simply not to kill
> > a dentry that is on the shrink list (indicated by
> > DCACHE_SHRINK_LIST), since it's bound to get killed anyway.  But
> > that's a change in behaviour...
> 
> Umm...  You mean, if final dput() finds dentry already on shrink list,
> just leave it there and return?  Might get really painful - the code
> that knows it's holding the last reference to already unhashed dentry
> might get a nasty surprise when dput() returns before it's killed off.

I wonder if we could have dput() side of thinks check if we are on the
shrink list, mark it "I'll be killing it anyway" and go ahead without
removal from the shrink list and instead of freeing mark it "I'm done
with it".  With shrink_dentry_list(), on the other hand, checking for those
marks, treating the former as "just move it to private list and keep
going".  After the list of victims is dealt with, keep picking dentries
from the second list, wait for them to get the second mark and do actual
freeing.  That ought to avoid any extra locks and preserve all ordering,
except for that of kmem_cache_free(), AFAICS...

Comments?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-04-29 19:10       ` Al Viro
@ 2014-04-29 21:18         ` Dave Chinner
  2014-04-29 21:48           ` Al Viro
  0 siblings, 1 reply; 61+ messages in thread
From: Dave Chinner @ 2014-04-29 21:18 UTC (permalink / raw)
  To: Al Viro
  Cc: Miklos Szeredi, Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel

On Tue, Apr 29, 2014 at 08:10:15PM +0100, Al Viro wrote:
> On Tue, Apr 29, 2014 at 07:16:10PM +0100, Al Viro wrote:
> > On Tue, Apr 29, 2014 at 08:03:24PM +0200, Miklos Szeredi wrote:
> > 
> > > Introducing a new per-sb lock should be OK.
> > > 
> > > Another idea, which could have subtler effects, is simply not to kill
> > > a dentry that is on the shrink list (indicated by
> > > DCACHE_SHRINK_LIST), since it's bound to get killed anyway.  But
> > > that's a change in behaviour...
> > 
> > Umm...  You mean, if final dput() finds dentry already on shrink list,
> > just leave it there and return?  Might get really painful - the code
> > that knows it's holding the last reference to already unhashed dentry
> > might get a nasty surprise when dput() returns before it's killed off.
> 
> I wonder if we could have dput() side of thinks check if we are on the
> shrink list, mark it "I'll be killing it anyway" and go ahead without
> removal from the shrink list and instead of freeing mark it "I'm done
> with it".  With shrink_dentry_list(), on the other hand, checking for those
> marks, treating the former as "just move it to private list and keep
> going".  After the list of victims is dealt with, keep picking dentries
> from the second list, wait for them to get the second mark and do actual
> freeing.  That ought to avoid any extra locks and preserve all ordering,
> except for that of kmem_cache_free(), AFAICS...
> 
> Comments?

Seems like it would work, but it seems fragile to me - I'm
wondering how we can ensure that the private shrink list
manipulations can be kept private.

We have a similar situation with the inode cache (private shrink
list) but the I_FREEING flag is set the entire time the inode is on
the shrink list. Any new hash lookup or attempt to grab the inode
that occurs while I_FREEING is set fails, so perhaps dentries also
need a well defined "being torn down and freed" state where new
references cannot be taken even though the dentry can still be
found...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-04-29 21:18         ` Dave Chinner
@ 2014-04-29 21:48           ` Al Viro
  2014-04-29 23:04             ` Linus Torvalds
  0 siblings, 1 reply; 61+ messages in thread
From: Al Viro @ 2014-04-29 21:48 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Miklos Szeredi, Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel

On Wed, Apr 30, 2014 at 07:18:51AM +1000, Dave Chinner wrote:

> Seems like it would work, but it seems fragile to me - I'm
> wondering how we can ensure that the private shrink list
> manipulations can be kept private.
> 
> We have a similar situation with the inode cache (private shrink
> list) but the I_FREEING flag is set the entire time the inode is on
> the shrink list. Any new hash lookup or attempt to grab the inode
> that occurs while I_FREEING is set fails, so perhaps dentries also
> need a well defined "being torn down and freed" state where new
> references cannot be taken even though the dentry can still be
> found...

Ummm...  You mean, have d_lookup() et.al. fail on something that is on
a shrink list?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-04-29 21:48           ` Al Viro
@ 2014-04-29 23:04             ` Linus Torvalds
  2014-04-29 23:20               ` Al Viro
  0 siblings, 1 reply; 61+ messages in thread
From: Linus Torvalds @ 2014-04-29 23:04 UTC (permalink / raw)
  To: Al Viro
  Cc: Dave Chinner, Miklos Szeredi, Linux Kernel Mailing List, linux-fsdevel

On Tue, Apr 29, 2014 at 2:48 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Ummm...  You mean, have d_lookup() et.al. fail on something that is on
> a shrink list?

So I tried to see if that would work just consider it dead by the time
it hits the shrink list, and if somebody does a lookup on the dentry,
fail on it and just allocate a new dentry and do a real lookup.

But at a minimum, we have "d_op->d_prune()" that would now be possibly
be called for the old dentry *after* a new dentry has been allocated.
Not to mention the inode not having been dropped. So it looks like a
disaster where the filesystem now ends up having concurrent "live"
dentries for the same entry.  Even if one of them is on its way out,
it's still visible to the filesystem. That makes me very
uncomfortable.

I'm starting to think that Miklos' original patch (perhaps with the
spinlock split to at least be one per superblock) is the most
straightforward approach after all. It's annoying having to add locks
here, but the whole pruning path should not be a hotpath, so maybe
it's not actually a big deal.

               Linus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-04-29 23:04             ` Linus Torvalds
@ 2014-04-29 23:20               ` Al Viro
  2014-04-30  2:31                 ` Al Viro
  0 siblings, 1 reply; 61+ messages in thread
From: Al Viro @ 2014-04-29 23:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, Miklos Szeredi, Linux Kernel Mailing List, linux-fsdevel

On Tue, Apr 29, 2014 at 04:04:11PM -0700, Linus Torvalds wrote:
> But at a minimum, we have "d_op->d_prune()" that would now be possibly
> be called for the old dentry *after* a new dentry has been allocated.
> Not to mention the inode not having been dropped. So it looks like a
> disaster where the filesystem now ends up having concurrent "live"
> dentries for the same entry.  Even if one of them is on its way out,
> it's still visible to the filesystem. That makes me very
> uncomfortable.
> 
> I'm starting to think that Miklos' original patch (perhaps with the
> spinlock split to at least be one per superblock) is the most
> straightforward approach after all. It's annoying having to add locks
> here, but the whole pruning path should not be a hotpath, so maybe
> it's not actually a big deal.

FWIW, my solution is more or less working; I'll give more local beating
and post it... 

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-04-29 23:20               ` Al Viro
@ 2014-04-30  2:31                 ` Al Viro
  2014-04-30  2:56                   ` Linus Torvalds
  0 siblings, 1 reply; 61+ messages in thread
From: Al Viro @ 2014-04-30  2:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, Miklos Szeredi, Linux Kernel Mailing List, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 8710 bytes --]

On Wed, Apr 30, 2014 at 12:20:13AM +0100, Al Viro wrote:
> On Tue, Apr 29, 2014 at 04:04:11PM -0700, Linus Torvalds wrote:
> > But at a minimum, we have "d_op->d_prune()" that would now be possibly
> > be called for the old dentry *after* a new dentry has been allocated.
> > Not to mention the inode not having been dropped. So it looks like a
> > disaster where the filesystem now ends up having concurrent "live"
> > dentries for the same entry.  Even if one of them is on its way out,
> > it's still visible to the filesystem. That makes me very
> > uncomfortable.
> > 
> > I'm starting to think that Miklos' original patch (perhaps with the
> > spinlock split to at least be one per superblock) is the most
> > straightforward approach after all. It's annoying having to add locks
> > here, but the whole pruning path should not be a hotpath, so maybe
> > it's not actually a big deal.
> 
> FWIW, my solution is more or less working; I'll give more local beating
> and post it... 

OK, aggregate diff follows, more readable splitup (3 commits) attached.
It seems to survive beating here; testing, review and comments are
welcome.

diff --git a/fs/dcache.c b/fs/dcache.c
index 494a9def..a83b933 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -246,16 +246,8 @@ static void __d_free(struct rcu_head *head)
 	kmem_cache_free(dentry_cache, dentry); 
 }
 
-/*
- * no locks, please.
- */
-static void d_free(struct dentry *dentry)
+static void dentry_free(struct dentry *dentry)
 {
-	BUG_ON((int)dentry->d_lockref.count > 0);
-	this_cpu_dec(nr_dentry);
-	if (dentry->d_op && dentry->d_op->d_release)
-		dentry->d_op->d_release(dentry);
-
 	/* if dentry was never visible to RCU, immediate free is OK */
 	if (!(dentry->d_flags & DCACHE_RCUACCESS))
 		__d_free(&dentry->d_u.d_rcu);
@@ -420,40 +412,6 @@ static void dentry_lru_del(struct dentry *dentry)
 }
 
 /**
- * d_kill - kill dentry and return parent
- * @dentry: dentry to kill
- * @parent: parent dentry
- *
- * The dentry must already be unhashed and removed from the LRU.
- *
- * If this is the root of the dentry tree, return NULL.
- *
- * dentry->d_lock and parent->d_lock must be held by caller, and are dropped by
- * d_kill.
- */
-static struct dentry *d_kill(struct dentry *dentry, struct dentry *parent)
-	__releases(dentry->d_lock)
-	__releases(parent->d_lock)
-	__releases(dentry->d_inode->i_lock)
-{
-	list_del(&dentry->d_u.d_child);
-	/*
-	 * Inform d_walk() that we are no longer attached to the
-	 * dentry tree
-	 */
-	dentry->d_flags |= DCACHE_DENTRY_KILLED;
-	if (parent)
-		spin_unlock(&parent->d_lock);
-	dentry_iput(dentry);
-	/*
-	 * dentry_iput drops the locks, at which point nobody (except
-	 * transient RCU lookups) can reach this dentry.
-	 */
-	d_free(dentry);
-	return parent;
-}
-
-/**
  * d_drop - drop a dentry
  * @dentry: dentry to drop
  *
@@ -499,6 +457,8 @@ void d_drop(struct dentry *dentry)
 }
 EXPORT_SYMBOL(d_drop);
 
+static DECLARE_WAIT_QUEUE_HEAD(free_wq);
+
 /*
  * Finish off a dentry we've decided to kill.
  * dentry->d_lock must be held, returns with it unlocked.
@@ -506,16 +466,17 @@ EXPORT_SYMBOL(d_drop);
  * Returns dentry requiring refcount drop, or NULL if we're done.
  */
 static struct dentry *
-dentry_kill(struct dentry *dentry, int unlock_on_failure)
+dentry_kill(struct dentry *dentry, struct list_head *morgue)
 	__releases(dentry->d_lock)
 {
 	struct inode *inode;
 	struct dentry *parent;
+	bool can_free = true;
 
 	inode = dentry->d_inode;
 	if (inode && !spin_trylock(&inode->i_lock)) {
 relock:
-		if (unlock_on_failure) {
+		if (!morgue) {
 			spin_unlock(&dentry->d_lock);
 			cpu_relax();
 		}
@@ -531,6 +492,15 @@ relock:
 		goto relock;
 	}
 
+	if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) {
+		/* morgue must be non-NULL */
+		list_move(&dentry->d_lru, morgue);
+		if (parent)
+			spin_unlock(&parent->d_lock);
+		/* inode must be NULL */
+		spin_unlock(&dentry->d_lock);
+		return parent;
+	}
 	/*
 	 * The dentry is now unrecoverably dead to the world.
 	 */
@@ -543,10 +513,43 @@ relock:
 	if ((dentry->d_flags & DCACHE_OP_PRUNE) && !d_unhashed(dentry))
 		dentry->d_op->d_prune(dentry);
 
-	dentry_lru_del(dentry);
+	if (dentry->d_flags & DCACHE_LRU_LIST) {
+		if (!(dentry->d_flags & DCACHE_SHRINK_LIST))
+			d_lru_del(dentry);
+		else if (morgue)
+			d_shrink_del(dentry);
+		else
+			can_free = false;
+	}
 	/* if it was on the hash then remove it */
 	__d_drop(dentry);
-	return d_kill(dentry, parent);
+	list_del(&dentry->d_u.d_child);
+	/*
+	 * Inform d_walk() that we are no longer attached to the
+	 * dentry tree
+	 */
+	dentry->d_flags |= DCACHE_DENTRY_KILLED;
+	if (parent)
+		spin_unlock(&parent->d_lock);
+	dentry_iput(dentry);
+	/*
+	 * dentry_iput drops the locks, at which point nobody (except
+	 * transient RCU lookups) can reach this dentry.
+	 */
+	BUG_ON((int)dentry->d_lockref.count > 0);
+	this_cpu_dec(nr_dentry);
+	if (dentry->d_op && dentry->d_op->d_release)
+		dentry->d_op->d_release(dentry);
+
+	if (likely(can_free)) {
+		dentry_free(dentry);
+	} else {
+		spin_lock(&dentry->d_lock);
+		dentry->d_flags |= DCACHE_MAY_FREE;
+		spin_unlock(&dentry->d_lock);
+		wake_up(&free_wq);
+	}
+	return parent;
 }
 
 /* 
@@ -602,7 +605,7 @@ repeat:
 	return;
 
 kill_it:
-	dentry = dentry_kill(dentry, 1);
+	dentry = dentry_kill(dentry, NULL);
 	if (dentry)
 		goto repeat;
 }
@@ -815,47 +818,10 @@ restart:
 }
 EXPORT_SYMBOL(d_prune_aliases);
 
-/*
- * Try to throw away a dentry - free the inode, dput the parent.
- * Requires dentry->d_lock is held, and dentry->d_count == 0.
- * Releases dentry->d_lock.
- *
- * This may fail if locks cannot be acquired no problem, just try again.
- */
-static struct dentry * try_prune_one_dentry(struct dentry *dentry)
-	__releases(dentry->d_lock)
-{
-	struct dentry *parent;
-
-	parent = dentry_kill(dentry, 0);
-	/*
-	 * If dentry_kill returns NULL, we have nothing more to do.
-	 * if it returns the same dentry, trylocks failed. In either
-	 * case, just loop again.
-	 *
-	 * Otherwise, we need to prune ancestors too. This is necessary
-	 * to prevent quadratic behavior of shrink_dcache_parent(), but
-	 * is also expected to be beneficial in reducing dentry cache
-	 * fragmentation.
-	 */
-	if (!parent)
-		return NULL;
-	if (parent == dentry)
-		return dentry;
-
-	/* Prune ancestors. */
-	dentry = parent;
-	while (dentry) {
-		if (lockref_put_or_lock(&dentry->d_lockref))
-			return NULL;
-		dentry = dentry_kill(dentry, 1);
-	}
-	return NULL;
-}
-
 static void shrink_dentry_list(struct list_head *list)
 {
-	struct dentry *dentry;
+	struct dentry *dentry, *parent;
+	LIST_HEAD(morgue);
 
 	rcu_read_lock();
 	for (;;) {
@@ -891,24 +857,43 @@ static void shrink_dentry_list(struct list_head *list)
 		}
 		rcu_read_unlock();
 
+		parent = dentry_kill(dentry, &morgue);
 		/*
-		 * If 'try_to_prune()' returns a dentry, it will
-		 * be the same one we passed in, and d_lock will
-		 * have been held the whole time, so it will not
-		 * have been added to any other lists. We failed
-		 * to get the inode lock.
-		 *
-		 * We just add it back to the shrink list.
+		 * If dentry_kill returns NULL, we have nothing more to do.
 		 */
-		dentry = try_prune_one_dentry(dentry);
-
-		rcu_read_lock();
-		if (dentry) {
+		if (!parent) {
+			rcu_read_lock();
+			continue;
+		}
+		if (unlikely(parent == dentry)) {
+			/*
+			 * trylocks have failed and d_lock has been held the
+			 * whole time, so it could not have been added to any
+			 * other lists. Just add it back to the shrink list.
+			 */
+			rcu_read_lock();
 			d_shrink_add(dentry, list);
 			spin_unlock(&dentry->d_lock);
+			continue;
 		}
+		/*
+		 * We need to prune ancestors too. This is necessary to prevent
+		 * quadratic behavior of shrink_dcache_parent(), but is also
+		 * expected to be beneficial in reducing dentry cache
+		 * fragmentation.
+		 */
+		dentry = parent;
+		while (dentry && !lockref_put_or_lock(&dentry->d_lockref))
+			dentry = dentry_kill(dentry, NULL);
+		rcu_read_lock();
 	}
 	rcu_read_unlock();
+	while (unlikely(!list_empty(&morgue))) {
+		dentry = list_first_entry(&morgue, struct dentry, d_lru);
+		list_del_init(&dentry->d_lru);
+		wait_event(free_wq, dentry->d_flags & DCACHE_MAY_FREE);
+		dentry_free(dentry);
+	}
 }
 
 static enum lru_status
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 3b9bfdb..3c7ec32 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -221,6 +221,8 @@ struct dentry_operations {
 #define DCACHE_SYMLINK_TYPE		0x00300000 /* Symlink */
 #define DCACHE_FILE_TYPE		0x00400000 /* Other file type */
 
+#define DCACHE_MAY_FREE			0x00800000
+
 extern seqlock_t rename_lock;
 
 static inline int dname_external(const struct dentry *dentry)

[-- Attachment #2: mb2 --]
[-- Type: text/plain, Size: 11212 bytes --]

>From 895aeb48465bbf78803fd11dee2556d010ada23a Mon Sep 17 00:00:00 2001
From: Al Viro <viro@zeniv.linux.org.uk>
Date: Tue, 29 Apr 2014 15:45:28 -0400
Subject: [PATCH 1/3] fold d_kill() and d_free()
Status: O
Content-Length: 2893
Lines: 109

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/dcache.c |   76 +++++++++++++++++++----------------------------------------
 1 file changed, 24 insertions(+), 52 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 494a9def..9b15c5c 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -246,23 +246,6 @@ static void __d_free(struct rcu_head *head)
 	kmem_cache_free(dentry_cache, dentry); 
 }
 
-/*
- * no locks, please.
- */
-static void d_free(struct dentry *dentry)
-{
-	BUG_ON((int)dentry->d_lockref.count > 0);
-	this_cpu_dec(nr_dentry);
-	if (dentry->d_op && dentry->d_op->d_release)
-		dentry->d_op->d_release(dentry);
-
-	/* if dentry was never visible to RCU, immediate free is OK */
-	if (!(dentry->d_flags & DCACHE_RCUACCESS))
-		__d_free(&dentry->d_u.d_rcu);
-	else
-		call_rcu(&dentry->d_u.d_rcu, __d_free);
-}
-
 /**
  * dentry_rcuwalk_barrier - invalidate in-progress rcu-walk lookups
  * @dentry: the target dentry
@@ -420,40 +403,6 @@ static void dentry_lru_del(struct dentry *dentry)
 }
 
 /**
- * d_kill - kill dentry and return parent
- * @dentry: dentry to kill
- * @parent: parent dentry
- *
- * The dentry must already be unhashed and removed from the LRU.
- *
- * If this is the root of the dentry tree, return NULL.
- *
- * dentry->d_lock and parent->d_lock must be held by caller, and are dropped by
- * d_kill.
- */
-static struct dentry *d_kill(struct dentry *dentry, struct dentry *parent)
-	__releases(dentry->d_lock)
-	__releases(parent->d_lock)
-	__releases(dentry->d_inode->i_lock)
-{
-	list_del(&dentry->d_u.d_child);
-	/*
-	 * Inform d_walk() that we are no longer attached to the
-	 * dentry tree
-	 */
-	dentry->d_flags |= DCACHE_DENTRY_KILLED;
-	if (parent)
-		spin_unlock(&parent->d_lock);
-	dentry_iput(dentry);
-	/*
-	 * dentry_iput drops the locks, at which point nobody (except
-	 * transient RCU lookups) can reach this dentry.
-	 */
-	d_free(dentry);
-	return parent;
-}
-
-/**
  * d_drop - drop a dentry
  * @dentry: dentry to drop
  *
@@ -546,7 +495,30 @@ relock:
 	dentry_lru_del(dentry);
 	/* if it was on the hash then remove it */
 	__d_drop(dentry);
-	return d_kill(dentry, parent);
+	list_del(&dentry->d_u.d_child);
+	/*
+	 * Inform d_walk() that we are no longer attached to the
+	 * dentry tree
+	 */
+	dentry->d_flags |= DCACHE_DENTRY_KILLED;
+	if (parent)
+		spin_unlock(&parent->d_lock);
+	dentry_iput(dentry);
+	/*
+	 * dentry_iput drops the locks, at which point nobody (except
+	 * transient RCU lookups) can reach this dentry.
+	 */
+	BUG_ON((int)dentry->d_lockref.count > 0);
+	this_cpu_dec(nr_dentry);
+	if (dentry->d_op && dentry->d_op->d_release)
+		dentry->d_op->d_release(dentry);
+
+	/* if dentry was never visible to RCU, immediate free is OK */
+	if (!(dentry->d_flags & DCACHE_RCUACCESS))
+		__d_free(&dentry->d_u.d_rcu);
+	else
+		call_rcu(&dentry->d_u.d_rcu, __d_free);
+	return parent;
 }
 
 /* 
-- 
1.7.10.4


>From aff934c47717be0216c9e2c10a2e8ca0f829bb54 Mon Sep 17 00:00:00 2001
From: Al Viro <viro@zeniv.linux.org.uk>
Date: Tue, 29 Apr 2014 16:13:18 -0400
Subject: [PATCH 2/3] fold try_prune_one_dentry()
Status: O
Content-Length: 2944
Lines: 108

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/dcache.c |   75 ++++++++++++++++++++---------------------------------------
 1 file changed, 25 insertions(+), 50 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 9b15c5c..a5540d4 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -787,47 +787,9 @@ restart:
 }
 EXPORT_SYMBOL(d_prune_aliases);
 
-/*
- * Try to throw away a dentry - free the inode, dput the parent.
- * Requires dentry->d_lock is held, and dentry->d_count == 0.
- * Releases dentry->d_lock.
- *
- * This may fail if locks cannot be acquired no problem, just try again.
- */
-static struct dentry * try_prune_one_dentry(struct dentry *dentry)
-	__releases(dentry->d_lock)
-{
-	struct dentry *parent;
-
-	parent = dentry_kill(dentry, 0);
-	/*
-	 * If dentry_kill returns NULL, we have nothing more to do.
-	 * if it returns the same dentry, trylocks failed. In either
-	 * case, just loop again.
-	 *
-	 * Otherwise, we need to prune ancestors too. This is necessary
-	 * to prevent quadratic behavior of shrink_dcache_parent(), but
-	 * is also expected to be beneficial in reducing dentry cache
-	 * fragmentation.
-	 */
-	if (!parent)
-		return NULL;
-	if (parent == dentry)
-		return dentry;
-
-	/* Prune ancestors. */
-	dentry = parent;
-	while (dentry) {
-		if (lockref_put_or_lock(&dentry->d_lockref))
-			return NULL;
-		dentry = dentry_kill(dentry, 1);
-	}
-	return NULL;
-}
-
 static void shrink_dentry_list(struct list_head *list)
 {
-	struct dentry *dentry;
+	struct dentry *dentry, *parent;
 
 	rcu_read_lock();
 	for (;;) {
@@ -863,22 +825,35 @@ static void shrink_dentry_list(struct list_head *list)
 		}
 		rcu_read_unlock();
 
+		parent = dentry_kill(dentry, 0);
 		/*
-		 * If 'try_to_prune()' returns a dentry, it will
-		 * be the same one we passed in, and d_lock will
-		 * have been held the whole time, so it will not
-		 * have been added to any other lists. We failed
-		 * to get the inode lock.
-		 *
-		 * We just add it back to the shrink list.
+		 * If dentry_kill returns NULL, we have nothing more to do.
 		 */
-		dentry = try_prune_one_dentry(dentry);
-
-		rcu_read_lock();
-		if (dentry) {
+		if (!parent) {
+			rcu_read_lock();
+			continue;
+		}
+		if (unlikely(parent == dentry)) {
+			/*
+			 * trylocks have failed and d_lock has been held the
+			 * whole time, so it could not have been added to any
+			 * other lists. Just add it back to the shrink list.
+			 */
+			rcu_read_lock();
 			d_shrink_add(dentry, list);
 			spin_unlock(&dentry->d_lock);
+			continue;
 		}
+		/*
+		 * We need to prune ancestors too. This is necessary to prevent
+		 * quadratic behavior of shrink_dcache_parent(), but is also
+		 * expected to be beneficial in reducing dentry cache
+		 * fragmentation.
+		 */
+		dentry = parent;
+		while (dentry && !lockref_put_or_lock(&dentry->d_lockref))
+			dentry = dentry_kill(dentry, 1);
+		rcu_read_lock();
 	}
 	rcu_read_unlock();
 }
-- 
1.7.10.4


>From 68fa8c9758d609094cc6a6004a8765ce0ea8052d Mon Sep 17 00:00:00 2001
From: Al Viro <viro@zeniv.linux.org.uk>
Date: Tue, 29 Apr 2014 17:00:27 -0400
Subject: [PATCH 3/3] don't evict from shrink list on dentry_kill(dentry, 1)
Status: O
Content-Length: 4620
Lines: 168

let shrinker deal with that and with actual freeing of such dentries

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/dcache.c            |   60 +++++++++++++++++++++++++++++++++++++++---------
 include/linux/dcache.h |    2 ++
 2 files changed, 51 insertions(+), 11 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index a5540d4..a83b933 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -246,6 +246,15 @@ static void __d_free(struct rcu_head *head)
 	kmem_cache_free(dentry_cache, dentry); 
 }
 
+static void dentry_free(struct dentry *dentry)
+{
+	/* if dentry was never visible to RCU, immediate free is OK */
+	if (!(dentry->d_flags & DCACHE_RCUACCESS))
+		__d_free(&dentry->d_u.d_rcu);
+	else
+		call_rcu(&dentry->d_u.d_rcu, __d_free);
+}
+
 /**
  * dentry_rcuwalk_barrier - invalidate in-progress rcu-walk lookups
  * @dentry: the target dentry
@@ -448,6 +457,8 @@ void d_drop(struct dentry *dentry)
 }
 EXPORT_SYMBOL(d_drop);
 
+static DECLARE_WAIT_QUEUE_HEAD(free_wq);
+
 /*
  * Finish off a dentry we've decided to kill.
  * dentry->d_lock must be held, returns with it unlocked.
@@ -455,16 +466,17 @@ EXPORT_SYMBOL(d_drop);
  * Returns dentry requiring refcount drop, or NULL if we're done.
  */
 static struct dentry *
-dentry_kill(struct dentry *dentry, int unlock_on_failure)
+dentry_kill(struct dentry *dentry, struct list_head *morgue)
 	__releases(dentry->d_lock)
 {
 	struct inode *inode;
 	struct dentry *parent;
+	bool can_free = true;
 
 	inode = dentry->d_inode;
 	if (inode && !spin_trylock(&inode->i_lock)) {
 relock:
-		if (unlock_on_failure) {
+		if (!morgue) {
 			spin_unlock(&dentry->d_lock);
 			cpu_relax();
 		}
@@ -480,6 +492,15 @@ relock:
 		goto relock;
 	}
 
+	if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) {
+		/* morgue must be non-NULL */
+		list_move(&dentry->d_lru, morgue);
+		if (parent)
+			spin_unlock(&parent->d_lock);
+		/* inode must be NULL */
+		spin_unlock(&dentry->d_lock);
+		return parent;
+	}
 	/*
 	 * The dentry is now unrecoverably dead to the world.
 	 */
@@ -492,7 +513,14 @@ relock:
 	if ((dentry->d_flags & DCACHE_OP_PRUNE) && !d_unhashed(dentry))
 		dentry->d_op->d_prune(dentry);
 
-	dentry_lru_del(dentry);
+	if (dentry->d_flags & DCACHE_LRU_LIST) {
+		if (!(dentry->d_flags & DCACHE_SHRINK_LIST))
+			d_lru_del(dentry);
+		else if (morgue)
+			d_shrink_del(dentry);
+		else
+			can_free = false;
+	}
 	/* if it was on the hash then remove it */
 	__d_drop(dentry);
 	list_del(&dentry->d_u.d_child);
@@ -513,11 +541,14 @@ relock:
 	if (dentry->d_op && dentry->d_op->d_release)
 		dentry->d_op->d_release(dentry);
 
-	/* if dentry was never visible to RCU, immediate free is OK */
-	if (!(dentry->d_flags & DCACHE_RCUACCESS))
-		__d_free(&dentry->d_u.d_rcu);
-	else
-		call_rcu(&dentry->d_u.d_rcu, __d_free);
+	if (likely(can_free)) {
+		dentry_free(dentry);
+	} else {
+		spin_lock(&dentry->d_lock);
+		dentry->d_flags |= DCACHE_MAY_FREE;
+		spin_unlock(&dentry->d_lock);
+		wake_up(&free_wq);
+	}
 	return parent;
 }
 
@@ -574,7 +605,7 @@ repeat:
 	return;
 
 kill_it:
-	dentry = dentry_kill(dentry, 1);
+	dentry = dentry_kill(dentry, NULL);
 	if (dentry)
 		goto repeat;
 }
@@ -790,6 +821,7 @@ EXPORT_SYMBOL(d_prune_aliases);
 static void shrink_dentry_list(struct list_head *list)
 {
 	struct dentry *dentry, *parent;
+	LIST_HEAD(morgue);
 
 	rcu_read_lock();
 	for (;;) {
@@ -825,7 +857,7 @@ static void shrink_dentry_list(struct list_head *list)
 		}
 		rcu_read_unlock();
 
-		parent = dentry_kill(dentry, 0);
+		parent = dentry_kill(dentry, &morgue);
 		/*
 		 * If dentry_kill returns NULL, we have nothing more to do.
 		 */
@@ -852,10 +884,16 @@ static void shrink_dentry_list(struct list_head *list)
 		 */
 		dentry = parent;
 		while (dentry && !lockref_put_or_lock(&dentry->d_lockref))
-			dentry = dentry_kill(dentry, 1);
+			dentry = dentry_kill(dentry, NULL);
 		rcu_read_lock();
 	}
 	rcu_read_unlock();
+	while (unlikely(!list_empty(&morgue))) {
+		dentry = list_first_entry(&morgue, struct dentry, d_lru);
+		list_del_init(&dentry->d_lru);
+		wait_event(free_wq, dentry->d_flags & DCACHE_MAY_FREE);
+		dentry_free(dentry);
+	}
 }
 
 static enum lru_status
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 3b9bfdb..3c7ec32 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -221,6 +221,8 @@ struct dentry_operations {
 #define DCACHE_SYMLINK_TYPE		0x00300000 /* Symlink */
 #define DCACHE_FILE_TYPE		0x00400000 /* Other file type */
 
+#define DCACHE_MAY_FREE			0x00800000
+
 extern seqlock_t rename_lock;
 
 static inline int dname_external(const struct dentry *dentry)
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-04-30  2:31                 ` Al Viro
@ 2014-04-30  2:56                   ` Linus Torvalds
  2014-04-30  4:04                     ` Al Viro
  2014-04-30  9:15                     ` Miklos Szeredi
  0 siblings, 2 replies; 61+ messages in thread
From: Linus Torvalds @ 2014-04-30  2:56 UTC (permalink / raw)
  To: Al Viro
  Cc: Dave Chinner, Miklos Szeredi, Linux Kernel Mailing List, linux-fsdevel

On Tue, Apr 29, 2014 at 7:31 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> OK, aggregate diff follows, more readable splitup (3 commits) attached.
> It seems to survive beating here; testing, review and comments are
> welcome.

Miklos, did you have some particular load that triggered this, or was
it just some reports? It would be really good to get this patch some
stress-testing.

I like how the patch removes more lines than it adds, but apart from
that it's hard to read the patch (even the split-out ones) and say
anything more about it. I think this needs a *lot* of testing.

                 Linus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-04-30  2:56                   ` Linus Torvalds
@ 2014-04-30  4:04                     ` Al Viro
  2014-04-30 15:49                       ` Miklos Szeredi
  2014-04-30  9:15                     ` Miklos Szeredi
  1 sibling, 1 reply; 61+ messages in thread
From: Al Viro @ 2014-04-30  4:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, Miklos Szeredi, Linux Kernel Mailing List, linux-fsdevel

On Tue, Apr 29, 2014 at 07:56:13PM -0700, Linus Torvalds wrote:
> On Tue, Apr 29, 2014 at 7:31 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > OK, aggregate diff follows, more readable splitup (3 commits) attached.
> > It seems to survive beating here; testing, review and comments are
> > welcome.
> 
> Miklos, did you have some particular load that triggered this, or was
> it just some reports? It would be really good to get this patch some
> stress-testing.
> 
> I like how the patch removes more lines than it adds, but apart from
> that it's hard to read the patch (even the split-out ones) and say
> anything more about it. I think this needs a *lot* of testing.

FWIW, the first two are really straightforward expanding the function
into its only callsite.  The last needs more splitup.  Not sure if the
following is good enough, but it ought to be at least somewhat cleaner.
Combined change is identical to the original, so it doesn't invalidate
the testing so far...

>From 895aeb48465bbf78803fd11dee2556d010ada23a Mon Sep 17 00:00:00 2001
From: Al Viro <viro@zeniv.linux.org.uk>
Date: Tue, 29 Apr 2014 15:45:28 -0400
Subject: [PATCH 1/6] fold d_kill() and d_free()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/dcache.c |   76 +++++++++++++++++++----------------------------------------
 1 file changed, 24 insertions(+), 52 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 494a9def..9b15c5c 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -246,23 +246,6 @@ static void __d_free(struct rcu_head *head)
 	kmem_cache_free(dentry_cache, dentry); 
 }
 
-/*
- * no locks, please.
- */
-static void d_free(struct dentry *dentry)
-{
-	BUG_ON((int)dentry->d_lockref.count > 0);
-	this_cpu_dec(nr_dentry);
-	if (dentry->d_op && dentry->d_op->d_release)
-		dentry->d_op->d_release(dentry);
-
-	/* if dentry was never visible to RCU, immediate free is OK */
-	if (!(dentry->d_flags & DCACHE_RCUACCESS))
-		__d_free(&dentry->d_u.d_rcu);
-	else
-		call_rcu(&dentry->d_u.d_rcu, __d_free);
-}
-
 /**
  * dentry_rcuwalk_barrier - invalidate in-progress rcu-walk lookups
  * @dentry: the target dentry
@@ -420,40 +403,6 @@ static void dentry_lru_del(struct dentry *dentry)
 }
 
 /**
- * d_kill - kill dentry and return parent
- * @dentry: dentry to kill
- * @parent: parent dentry
- *
- * The dentry must already be unhashed and removed from the LRU.
- *
- * If this is the root of the dentry tree, return NULL.
- *
- * dentry->d_lock and parent->d_lock must be held by caller, and are dropped by
- * d_kill.
- */
-static struct dentry *d_kill(struct dentry *dentry, struct dentry *parent)
-	__releases(dentry->d_lock)
-	__releases(parent->d_lock)
-	__releases(dentry->d_inode->i_lock)
-{
-	list_del(&dentry->d_u.d_child);
-	/*
-	 * Inform d_walk() that we are no longer attached to the
-	 * dentry tree
-	 */
-	dentry->d_flags |= DCACHE_DENTRY_KILLED;
-	if (parent)
-		spin_unlock(&parent->d_lock);
-	dentry_iput(dentry);
-	/*
-	 * dentry_iput drops the locks, at which point nobody (except
-	 * transient RCU lookups) can reach this dentry.
-	 */
-	d_free(dentry);
-	return parent;
-}
-
-/**
  * d_drop - drop a dentry
  * @dentry: dentry to drop
  *
@@ -546,7 +495,30 @@ relock:
 	dentry_lru_del(dentry);
 	/* if it was on the hash then remove it */
 	__d_drop(dentry);
-	return d_kill(dentry, parent);
+	list_del(&dentry->d_u.d_child);
+	/*
+	 * Inform d_walk() that we are no longer attached to the
+	 * dentry tree
+	 */
+	dentry->d_flags |= DCACHE_DENTRY_KILLED;
+	if (parent)
+		spin_unlock(&parent->d_lock);
+	dentry_iput(dentry);
+	/*
+	 * dentry_iput drops the locks, at which point nobody (except
+	 * transient RCU lookups) can reach this dentry.
+	 */
+	BUG_ON((int)dentry->d_lockref.count > 0);
+	this_cpu_dec(nr_dentry);
+	if (dentry->d_op && dentry->d_op->d_release)
+		dentry->d_op->d_release(dentry);
+
+	/* if dentry was never visible to RCU, immediate free is OK */
+	if (!(dentry->d_flags & DCACHE_RCUACCESS))
+		__d_free(&dentry->d_u.d_rcu);
+	else
+		call_rcu(&dentry->d_u.d_rcu, __d_free);
+	return parent;
 }
 
 /* 
-- 
1.7.10.4


>From aff934c47717be0216c9e2c10a2e8ca0f829bb54 Mon Sep 17 00:00:00 2001
From: Al Viro <viro@zeniv.linux.org.uk>
Date: Tue, 29 Apr 2014 16:13:18 -0400
Subject: [PATCH 2/6] fold try_prune_one_dentry()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/dcache.c |   75 ++++++++++++++++++++---------------------------------------
 1 file changed, 25 insertions(+), 50 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 9b15c5c..a5540d4 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -787,47 +787,9 @@ restart:
 }
 EXPORT_SYMBOL(d_prune_aliases);
 
-/*
- * Try to throw away a dentry - free the inode, dput the parent.
- * Requires dentry->d_lock is held, and dentry->d_count == 0.
- * Releases dentry->d_lock.
- *
- * This may fail if locks cannot be acquired no problem, just try again.
- */
-static struct dentry * try_prune_one_dentry(struct dentry *dentry)
-	__releases(dentry->d_lock)
-{
-	struct dentry *parent;
-
-	parent = dentry_kill(dentry, 0);
-	/*
-	 * If dentry_kill returns NULL, we have nothing more to do.
-	 * if it returns the same dentry, trylocks failed. In either
-	 * case, just loop again.
-	 *
-	 * Otherwise, we need to prune ancestors too. This is necessary
-	 * to prevent quadratic behavior of shrink_dcache_parent(), but
-	 * is also expected to be beneficial in reducing dentry cache
-	 * fragmentation.
-	 */
-	if (!parent)
-		return NULL;
-	if (parent == dentry)
-		return dentry;
-
-	/* Prune ancestors. */
-	dentry = parent;
-	while (dentry) {
-		if (lockref_put_or_lock(&dentry->d_lockref))
-			return NULL;
-		dentry = dentry_kill(dentry, 1);
-	}
-	return NULL;
-}
-
 static void shrink_dentry_list(struct list_head *list)
 {
-	struct dentry *dentry;
+	struct dentry *dentry, *parent;
 
 	rcu_read_lock();
 	for (;;) {
@@ -863,22 +825,35 @@ static void shrink_dentry_list(struct list_head *list)
 		}
 		rcu_read_unlock();
 
+		parent = dentry_kill(dentry, 0);
 		/*
-		 * If 'try_to_prune()' returns a dentry, it will
-		 * be the same one we passed in, and d_lock will
-		 * have been held the whole time, so it will not
-		 * have been added to any other lists. We failed
-		 * to get the inode lock.
-		 *
-		 * We just add it back to the shrink list.
+		 * If dentry_kill returns NULL, we have nothing more to do.
 		 */
-		dentry = try_prune_one_dentry(dentry);
-
-		rcu_read_lock();
-		if (dentry) {
+		if (!parent) {
+			rcu_read_lock();
+			continue;
+		}
+		if (unlikely(parent == dentry)) {
+			/*
+			 * trylocks have failed and d_lock has been held the
+			 * whole time, so it could not have been added to any
+			 * other lists. Just add it back to the shrink list.
+			 */
+			rcu_read_lock();
 			d_shrink_add(dentry, list);
 			spin_unlock(&dentry->d_lock);
+			continue;
 		}
+		/*
+		 * We need to prune ancestors too. This is necessary to prevent
+		 * quadratic behavior of shrink_dcache_parent(), but is also
+		 * expected to be beneficial in reducing dentry cache
+		 * fragmentation.
+		 */
+		dentry = parent;
+		while (dentry && !lockref_put_or_lock(&dentry->d_lockref))
+			dentry = dentry_kill(dentry, 1);
+		rcu_read_lock();
 	}
 	rcu_read_unlock();
 }
-- 
1.7.10.4


>From d250553b1ca7d4c9b0544b4b6a3cdf5e2a3dd147 Mon Sep 17 00:00:00 2001
From: Al Viro <viro@zeniv.linux.org.uk>
Date: Tue, 29 Apr 2014 23:40:14 -0400
Subject: [PATCH 3/6] new helper: dentry_free()

The part of old d_free() that dealt with actual freeing of dentry.
Taken out of dentry_kill() into a separate function.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/dcache.c |   15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index a5540d4..dab7db1 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -246,6 +246,15 @@ static void __d_free(struct rcu_head *head)
 	kmem_cache_free(dentry_cache, dentry); 
 }
 
+static void dentry_free(struct dentry *dentry)
+{
+	/* if dentry was never visible to RCU, immediate free is OK */
+	if (!(dentry->d_flags & DCACHE_RCUACCESS))
+		__d_free(&dentry->d_u.d_rcu);
+	else
+		call_rcu(&dentry->d_u.d_rcu, __d_free);
+}
+
 /**
  * dentry_rcuwalk_barrier - invalidate in-progress rcu-walk lookups
  * @dentry: the target dentry
@@ -513,11 +522,7 @@ relock:
 	if (dentry->d_op && dentry->d_op->d_release)
 		dentry->d_op->d_release(dentry);
 
-	/* if dentry was never visible to RCU, immediate free is OK */
-	if (!(dentry->d_flags & DCACHE_RCUACCESS))
-		__d_free(&dentry->d_u.d_rcu);
-	else
-		call_rcu(&dentry->d_u.d_rcu, __d_free);
+	dentry_free(dentry);
 	return parent;
 }
 
-- 
1.7.10.4


>From 2f2c6a0a4153054f2b3b44011ffbe8e4bae1a1e1 Mon Sep 17 00:00:00 2001
From: Al Viro <viro@zeniv.linux.org.uk>
Date: Tue, 29 Apr 2014 23:42:52 -0400
Subject: [PATCH 4/6] expand the call of dentry_lru_del() in dentry_kill()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/dcache.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index dab7db1..e482775 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -501,7 +501,12 @@ relock:
 	if ((dentry->d_flags & DCACHE_OP_PRUNE) && !d_unhashed(dentry))
 		dentry->d_op->d_prune(dentry);
 
-	dentry_lru_del(dentry);
+	if (dentry->d_flags & DCACHE_LRU_LIST) {
+		if (!(dentry->d_flags & DCACHE_SHRINK_LIST))
+			d_lru_del(dentry);
+		else
+			d_shrink_del(dentry);
+	}
 	/* if it was on the hash then remove it */
 	__d_drop(dentry);
 	list_del(&dentry->d_u.d_child);
-- 
1.7.10.4


>From 1d1c5bc3cff207847a2643d85442d25adcf9041a Mon Sep 17 00:00:00 2001
From: Al Viro <viro@zeniv.linux.org.uk>
Date: Tue, 29 Apr 2014 23:52:05 -0400
Subject: [PATCH 5/6] Don't try to remove from shrink list we don't own

So far it's just been local equivalent transformations.  Now the meat of that
thing: dentry_kill() has two kinds of call sites - ones that had just
dropped the last reference (dput(), killing ancestors in shrink_dentry_list())
and one where the victim had sat on shrink list with zero refcount.  We already
have a flag telling which kind it is (unlock_on_failure).  What we want to do
is to avoid d_shrink_del() in the first kind of dentry_kill().  We do, however,
want everything except the actual freeing still done as we would in mainline.
So we need to deal with two new situations - the first kind of dentry_kill()
finding a dentry on shrink list (result of laziness in dget(); we had it on
shrink list with refcount 1) and the second kind finding a dentry that is
currently being killed by the first kind.  What we do is this:

* add another list in shrink_dentry_list() ('morgue'), pass it to the second
kind of dentry_kill(), which will move dentry there, drop locks and do nothing
else if it finds that the dentry is in process of being killed; we detect that
by seeing DCACHE_DENTRY_KILLED already set.
* have the first kind skip removal from shrink list and actual freeing
if dentry was on the shrink list; in that case instead of freeing the victim
just mark it DCACHE_MAY_FREE and wake free_wq.
* once shrink_dentry_list() has dealt with the entire list it has been given,
morgue will contain the ones that had triggered "don't free them yet" path in
dentry_kill(dentry, 1, ...).  They are going to acquire DCACHE_MAY_FREE once
dentry_kill() is through with them, so just wait for that (on free_wq) and
free them.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/dcache.c            |   40 ++++++++++++++++++++++++++++++++++------
 include/linux/dcache.h |    2 ++
 2 files changed, 36 insertions(+), 6 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index e482775..f0773f6 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -457,6 +457,8 @@ void d_drop(struct dentry *dentry)
 }
 EXPORT_SYMBOL(d_drop);
 
+static DECLARE_WAIT_QUEUE_HEAD(free_wq);
+
 /*
  * Finish off a dentry we've decided to kill.
  * dentry->d_lock must be held, returns with it unlocked.
@@ -464,11 +466,12 @@ EXPORT_SYMBOL(d_drop);
  * Returns dentry requiring refcount drop, or NULL if we're done.
  */
 static struct dentry *
-dentry_kill(struct dentry *dentry, int unlock_on_failure)
+dentry_kill(struct dentry *dentry, int unlock_on_failure, struct list_head *morgue)
 	__releases(dentry->d_lock)
 {
 	struct inode *inode;
 	struct dentry *parent;
+	bool can_free = true;
 
 	inode = dentry->d_inode;
 	if (inode && !spin_trylock(&inode->i_lock)) {
@@ -489,6 +492,15 @@ relock:
 		goto relock;
 	}
 
+	if (unlikely(morgue && dentry->d_flags & DCACHE_DENTRY_KILLED)) {
+		list_move(&dentry->d_lru, morgue);
+		if (parent)
+			spin_unlock(&parent->d_lock);
+		if (inode)
+			spin_unlock(&inode->i_lock);
+		spin_unlock(&dentry->d_lock);
+		return parent;
+	}
 	/*
 	 * The dentry is now unrecoverably dead to the world.
 	 */
@@ -504,8 +516,10 @@ relock:
 	if (dentry->d_flags & DCACHE_LRU_LIST) {
 		if (!(dentry->d_flags & DCACHE_SHRINK_LIST))
 			d_lru_del(dentry);
-		else
+		else if (!unlock_on_failure)
 			d_shrink_del(dentry);
+		else
+			can_free = false;
 	}
 	/* if it was on the hash then remove it */
 	__d_drop(dentry);
@@ -527,7 +541,14 @@ relock:
 	if (dentry->d_op && dentry->d_op->d_release)
 		dentry->d_op->d_release(dentry);
 
-	dentry_free(dentry);
+	if (likely(can_free)) {
+		dentry_free(dentry);
+	} else {
+		spin_lock(&dentry->d_lock);
+		dentry->d_flags |= DCACHE_MAY_FREE;
+		spin_unlock(&dentry->d_lock);
+		wake_up(&free_wq);
+	}
 	return parent;
 }
 
@@ -584,7 +605,7 @@ repeat:
 	return;
 
 kill_it:
-	dentry = dentry_kill(dentry, 1);
+	dentry = dentry_kill(dentry, 1, NULL);
 	if (dentry)
 		goto repeat;
 }
@@ -800,6 +821,7 @@ EXPORT_SYMBOL(d_prune_aliases);
 static void shrink_dentry_list(struct list_head *list)
 {
 	struct dentry *dentry, *parent;
+	LIST_HEAD(morgue);
 
 	rcu_read_lock();
 	for (;;) {
@@ -835,7 +857,7 @@ static void shrink_dentry_list(struct list_head *list)
 		}
 		rcu_read_unlock();
 
-		parent = dentry_kill(dentry, 0);
+		parent = dentry_kill(dentry, 0, &morgue);
 		/*
 		 * If dentry_kill returns NULL, we have nothing more to do.
 		 */
@@ -862,10 +884,16 @@ static void shrink_dentry_list(struct list_head *list)
 		 */
 		dentry = parent;
 		while (dentry && !lockref_put_or_lock(&dentry->d_lockref))
-			dentry = dentry_kill(dentry, 1);
+			dentry = dentry_kill(dentry, 1, NULL);
 		rcu_read_lock();
 	}
 	rcu_read_unlock();
+	while (unlikely(!list_empty(&morgue))) {
+		dentry = list_first_entry(&morgue, struct dentry, d_lru);
+		list_del_init(&dentry->d_lru);
+		wait_event(free_wq, dentry->d_flags & DCACHE_MAY_FREE);
+		dentry_free(dentry);
+	}
 }
 
 static enum lru_status
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 3b9bfdb..3c7ec32 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -221,6 +221,8 @@ struct dentry_operations {
 #define DCACHE_SYMLINK_TYPE		0x00300000 /* Symlink */
 #define DCACHE_FILE_TYPE		0x00400000 /* Other file type */
 
+#define DCACHE_MAY_FREE			0x00800000
+
 extern seqlock_t rename_lock;
 
 static inline int dname_external(const struct dentry *dentry)
-- 
1.7.10.4


>From 3b9afa6f49ec53bf45067584d5baf567a0aa5105 Mon Sep 17 00:00:00 2001
From: Al Viro <viro@zeniv.linux.org.uk>
Date: Tue, 29 Apr 2014 23:57:14 -0400
Subject: [PATCH 6/6] clean up afterwards

a) dentry_kill() arguments are redundant - unlock_on_failure == !!morgue
b) the second kind of dentry_kill() can see dentry on process of being
killed only after it's been unlocked by dentry_iput(), so inode is
guaranteed to be NULL in that case
c) the first kind of dentry_kill can't see DCACHE_DENTRY_KILLED at all
(it's only set when refcount reaches zero and is unable to grow back;
the first kind is done right after having dropped the recount to zero).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/dcache.c |   18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index f0773f6..a83b933 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -466,7 +466,7 @@ static DECLARE_WAIT_QUEUE_HEAD(free_wq);
  * Returns dentry requiring refcount drop, or NULL if we're done.
  */
 static struct dentry *
-dentry_kill(struct dentry *dentry, int unlock_on_failure, struct list_head *morgue)
+dentry_kill(struct dentry *dentry, struct list_head *morgue)
 	__releases(dentry->d_lock)
 {
 	struct inode *inode;
@@ -476,7 +476,7 @@ dentry_kill(struct dentry *dentry, int unlock_on_failure, struct list_head *morg
 	inode = dentry->d_inode;
 	if (inode && !spin_trylock(&inode->i_lock)) {
 relock:
-		if (unlock_on_failure) {
+		if (!morgue) {
 			spin_unlock(&dentry->d_lock);
 			cpu_relax();
 		}
@@ -492,12 +492,12 @@ relock:
 		goto relock;
 	}
 
-	if (unlikely(morgue && dentry->d_flags & DCACHE_DENTRY_KILLED)) {
+	if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) {
+		/* morgue must be non-NULL */
 		list_move(&dentry->d_lru, morgue);
 		if (parent)
 			spin_unlock(&parent->d_lock);
-		if (inode)
-			spin_unlock(&inode->i_lock);
+		/* inode must be NULL */
 		spin_unlock(&dentry->d_lock);
 		return parent;
 	}
@@ -516,7 +516,7 @@ relock:
 	if (dentry->d_flags & DCACHE_LRU_LIST) {
 		if (!(dentry->d_flags & DCACHE_SHRINK_LIST))
 			d_lru_del(dentry);
-		else if (!unlock_on_failure)
+		else if (morgue)
 			d_shrink_del(dentry);
 		else
 			can_free = false;
@@ -605,7 +605,7 @@ repeat:
 	return;
 
 kill_it:
-	dentry = dentry_kill(dentry, 1, NULL);
+	dentry = dentry_kill(dentry, NULL);
 	if (dentry)
 		goto repeat;
 }
@@ -857,7 +857,7 @@ static void shrink_dentry_list(struct list_head *list)
 		}
 		rcu_read_unlock();
 
-		parent = dentry_kill(dentry, 0, &morgue);
+		parent = dentry_kill(dentry, &morgue);
 		/*
 		 * If dentry_kill returns NULL, we have nothing more to do.
 		 */
@@ -884,7 +884,7 @@ static void shrink_dentry_list(struct list_head *list)
 		 */
 		dentry = parent;
 		while (dentry && !lockref_put_or_lock(&dentry->d_lockref))
-			dentry = dentry_kill(dentry, 1, NULL);
+			dentry = dentry_kill(dentry, NULL);
 		rcu_read_lock();
 	}
 	rcu_read_unlock();
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-04-30  2:56                   ` Linus Torvalds
  2014-04-30  4:04                     ` Al Viro
@ 2014-04-30  9:15                     ` Miklos Szeredi
  2014-05-02  5:51                       ` Al Viro
  1 sibling, 1 reply; 61+ messages in thread
From: Miklos Szeredi @ 2014-04-30  9:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Tue, Apr 29, 2014 at 07:56:13PM -0700, Linus Torvalds wrote:
> On Tue, Apr 29, 2014 at 7:31 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > OK, aggregate diff follows, more readable splitup (3 commits) attached.
> > It seems to survive beating here; testing, review and comments are
> > welcome.
> 
> Miklos, did you have some particular load that triggered this, or was
> it just some reports? It would be really good to get this patch some
> stress-testing.
> 
> I like how the patch removes more lines than it adds, but apart from
> that it's hard to read the patch (even the split-out ones) and say
> anything more about it. I think this needs a *lot* of testing.

IBM is triggering this with the host01 test from the LTP suite.  I haven't yet
tried to reproduce it.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-04-30  4:04                     ` Al Viro
@ 2014-04-30 15:49                       ` Miklos Szeredi
  2014-04-30 15:56                         ` Miklos Szeredi
  2014-04-30 16:03                         ` Al Viro
  0 siblings, 2 replies; 61+ messages in thread
From: Miklos Szeredi @ 2014-04-30 15:49 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Wed, Apr 30, 2014 at 05:04:36AM +0100, Al Viro wrote:
> On Tue, Apr 29, 2014 at 07:56:13PM -0700, Linus Torvalds wrote:
> > On Tue, Apr 29, 2014 at 7:31 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > OK, aggregate diff follows, more readable splitup (3 commits) attached.
> > > It seems to survive beating here; testing, review and comments are
> > > welcome.
> > 
> > Miklos, did you have some particular load that triggered this, or was
> > it just some reports? It would be really good to get this patch some
> > stress-testing.
> > 
> > I like how the patch removes more lines than it adds, but apart from
> > that it's hard to read the patch (even the split-out ones) and say
> > anything more about it. I think this needs a *lot* of testing.
> 
> FWIW, the first two are really straightforward expanding the function
> into its only callsite.  The last needs more splitup.  Not sure if the
> following is good enough, but it ought to be at least somewhat cleaner.
> Combined change is identical to the original, so it doesn't invalidate
> the testing so far...

Hmm, patches look okay, but I'm wondering if we really need the morgue list and
the waiting.  Why not just skip dentries that are presently being handled by
dentry_kill()?

Thanks,
Miklos
---


From: Al Viro <viro@zeniv.linux.org.uk>
Date: Tue, 29 Apr 2014 23:52:05 -0400
Subject: Don't try to remove from shrink list we don't own

So far it's just been local equivalent transformations.  Now the meat of
that thing: dentry_kill() has two kinds of call sites - ones that had just
dropped the last reference (dput(), killing ancestors in
shrink_dentry_list()) and one where the victim had sat on shrink list with
zero refcount.  We already have a flag telling which kind it is
(unlock_on_failure).  What we want to do is to avoid d_shrink_del() in the
first kind of dentry_kill().  We do, however, want everything except the
actual freeing still done as we would in mainline.  So we need to deal with
two new situations - the first kind of dentry_kill() finding a dentry on
shrink list (result of laziness in dget(); we had it on shrink list with
refcount 1) and the second kind finding a dentry that is currently being
killed by the first kind.  What we do is this:

 - in the first kind of dentry_kill() we don't remove the dentry from the
   shrink list, this makes the shrink list really private to the shrinker
   functions

 - we proceed with the normal killing but only actually free the dentry if
   it's definitely not on the shrink list at this point.  If it's still on
   the shrink list set DCACHE_MAY_FREE and defer the freeing to
   shrink_dentry_list()

 - shrink_dentry_list() will detect if the dentry is killed with
   DCACHE_DENTRY_KILLED.  If DCACHE_MAY_FREE isn't yet set, just take the
   dentry off the shrink list and let the still running dentry_kill()
   finish it off.  If DCACHE_MAY_FREE is set, then free the dentry.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---
 fs/dcache.c            |   44 ++++++++++++++++++++++++++++++++++++++++++--
 include/linux/dcache.h |    2 ++
 2 files changed, 44 insertions(+), 2 deletions(-)

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -469,6 +469,7 @@ dentry_kill(struct dentry *dentry, int u
 {
 	struct inode *inode;
 	struct dentry *parent;
+	bool can_free = true;
 
 	inode = dentry->d_inode;
 	if (inode && !spin_trylock(&inode->i_lock)) {
@@ -504,8 +505,10 @@ dentry_kill(struct dentry *dentry, int u
 	if (dentry->d_flags & DCACHE_LRU_LIST) {
 		if (!(dentry->d_flags & DCACHE_SHRINK_LIST))
 			d_lru_del(dentry);
-		else
+		else if (!unlock_on_failure)
 			d_shrink_del(dentry);
+		else
+			can_free = false;
 	}
 	/* if it was on the hash then remove it */
 	__d_drop(dentry);
@@ -527,7 +530,25 @@ dentry_kill(struct dentry *dentry, int u
 	if (dentry->d_op && dentry->d_op->d_release)
 		dentry->d_op->d_release(dentry);
 
-	dentry_free(dentry);
+	if (unlikely(!can_free)) {
+		spin_lock(&dentry->d_lock);
+		/*
+		 * Could have gotten off the shrink list while not holding the
+		 * dcache lock.
+		 *
+		 * If still on the shrink list shrink_dentry_list will take care
+		 * of freeing it for us.  Signal this by setting the
+		 * DCACHE_MAY_FREE flag.
+		 */
+		if (!(dentry->d_flags & DCACHE_SHRINK_LIST))
+			can_free = true;
+		else
+			dentry->d_flags |= DCACHE_MAY_FREE;
+		spin_unlock(&dentry->d_lock);
+	}
+	if (likely(can_free))
+		dentry_free(dentry);
+
 	return parent;
 }
 
@@ -835,6 +856,25 @@ static void shrink_dentry_list(struct li
 		}
 		rcu_read_unlock();
 
+		if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) {
+			bool free_it;
+			/*
+			 * This dentry has already been killed, but was stuck on
+			 * the shrink list.  It may be undergoing cleanup, in
+			 * which case let dput() finish it off.
+			 *
+			 * If it's already clean, dput() deferred freeing to us.
+			 */
+			free_it = (dentry->d_flags & DCACHE_MAY_FREE) != 0;
+			spin_unlock(&dentry->d_lock);
+
+			if (free_it)
+				dentry_free(dentry);
+
+			rcu_read_lock();
+			continue;
+		}
+
 		parent = dentry_kill(dentry, 0);
 		/*
 		 * If dentry_kill returns NULL, we have nothing more to do.
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -221,6 +221,8 @@ struct dentry_operations {
 #define DCACHE_SYMLINK_TYPE		0x00300000 /* Symlink */
 #define DCACHE_FILE_TYPE		0x00400000 /* Other file type */
 
+#define DCACHE_MAY_FREE			0x00800000
+
 extern seqlock_t rename_lock;
 
 static inline int dname_external(const struct dentry *dentry)

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-04-30 15:49                       ` Miklos Szeredi
@ 2014-04-30 15:56                         ` Miklos Szeredi
  2014-04-30 16:03                         ` Al Viro
  1 sibling, 0 replies; 61+ messages in thread
From: Miklos Szeredi @ 2014-04-30 15:56 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

And a followup patch for that, removing RCU locking from shrink_dentry_list().

I haven't tested any of this at this point, but I'll ask IBM to do some stress
testing.

Thanks,
Miklos
---

From: Miklos Szeredi <mszeredi@suse.cz>
Subject: dcache: don't need rcu in shrink_dentry_list()

Since now the shrink list is private and nobody can free the dentry while
it is on the shrink list, we can remove RCU protection from this.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---
 fs/dcache.c |   29 ++++-------------------------
 1 file changed, 4 insertions(+), 25 deletions(-)

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -822,23 +822,9 @@ static void shrink_dentry_list(struct li
 {
 	struct dentry *dentry, *parent;
 
-	rcu_read_lock();
-	for (;;) {
-		dentry = list_entry_rcu(list->prev, struct dentry, d_lru);
-		if (&dentry->d_lru == list)
-			break; /* empty */
-
-		/*
-		 * Get the dentry lock, and re-verify that the dentry is
-		 * this on the shrinking list. If it is, we know that
-		 * DCACHE_SHRINK_LIST and DCACHE_LRU_LIST are set.
-		 */
+	while (!list_empty(list)) {
+		dentry = list_entry(list->prev, struct dentry, d_lru);
 		spin_lock(&dentry->d_lock);
-		if (dentry != list_entry(list->prev, struct dentry, d_lru)) {
-			spin_unlock(&dentry->d_lock);
-			continue;
-		}
-
 		/*
 		 * The dispose list is isolated and dentries are not accounted
 		 * to the LRU here, so we can simply remove it from the list
@@ -854,7 +840,6 @@ static void shrink_dentry_list(struct li
 			spin_unlock(&dentry->d_lock);
 			continue;
 		}
-		rcu_read_unlock();
 
 		if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) {
 			bool free_it;
@@ -870,8 +855,6 @@ static void shrink_dentry_list(struct li
 
 			if (free_it)
 				dentry_free(dentry);
-
-			rcu_read_lock();
 			continue;
 		}
 
@@ -879,17 +862,15 @@ static void shrink_dentry_list(struct li
 		/*
 		 * If dentry_kill returns NULL, we have nothing more to do.
 		 */
-		if (!parent) {
-			rcu_read_lock();
+		if (!parent)
 			continue;
-		}
+
 		if (unlikely(parent == dentry)) {
 			/*
 			 * trylocks have failed and d_lock has been held the
 			 * whole time, so it could not have been added to any
 			 * other lists. Just add it back to the shrink list.
 			 */
-			rcu_read_lock();
 			d_shrink_add(dentry, list);
 			spin_unlock(&dentry->d_lock);
 			continue;
@@ -903,9 +884,7 @@ static void shrink_dentry_list(struct li
 		dentry = parent;
 		while (dentry && !lockref_put_or_lock(&dentry->d_lockref))
 			dentry = dentry_kill(dentry, 1);
-		rcu_read_lock();
 	}
-	rcu_read_unlock();
 }
 
 static enum lru_status

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-04-30 15:49                       ` Miklos Szeredi
  2014-04-30 15:56                         ` Miklos Szeredi
@ 2014-04-30 16:03                         ` Al Viro
  2014-04-30 17:33                           ` Miklos Szeredi
  1 sibling, 1 reply; 61+ messages in thread
From: Al Viro @ 2014-04-30 16:03 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Linus Torvalds, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Wed, Apr 30, 2014 at 05:49:58PM +0200, Miklos Szeredi wrote:
> > FWIW, the first two are really straightforward expanding the function
> > into its only callsite.  The last needs more splitup.  Not sure if the
> > following is good enough, but it ought to be at least somewhat cleaner.
> > Combined change is identical to the original, so it doesn't invalidate
> > the testing so far...
> 
> Hmm, patches look okay, but I'm wondering if we really need the morgue list and
> the waiting.  Why not just skip dentries that are presently being handled by
> dentry_kill()?

Who will be freeing them?  If we do that from dentry_kill(), we are back to
needing them removed from shrink list by something called by dput().  And
if we do that from shrink_dentry_list(), we'd damn better wait for
dentry_iput() et.al. to finish.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-04-30 16:03                         ` Al Viro
@ 2014-04-30 17:33                           ` Miklos Szeredi
  2014-04-30 18:36                             ` Al Viro
  0 siblings, 1 reply; 61+ messages in thread
From: Miklos Szeredi @ 2014-04-30 17:33 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Wed, Apr 30, 2014 at 6:03 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Wed, Apr 30, 2014 at 05:49:58PM +0200, Miklos Szeredi wrote:
>> > FWIW, the first two are really straightforward expanding the function
>> > into its only callsite.  The last needs more splitup.  Not sure if the
>> > following is good enough, but it ought to be at least somewhat cleaner.
>> > Combined change is identical to the original, so it doesn't invalidate
>> > the testing so far...
>>
>> Hmm, patches look okay, but I'm wondering if we really need the morgue list and
>> the waiting.  Why not just skip dentries that are presently being handled by
>> dentry_kill()?
>
> Who will be freeing them?  If we do that from dentry_kill(), we are back to
> needing them removed from shrink list by something called by dput().  And
> if we do that from shrink_dentry_list(), we'd damn better wait for
> dentry_iput() et.al. to finish.

We can do it from dput if the shrinker gets there first and from the
shrinker if dput manages to finish before.  See the updated patch in
the previous mail.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-04-30 17:33                           ` Miklos Szeredi
@ 2014-04-30 18:36                             ` Al Viro
  2014-04-30 18:42                               ` Miklos Szeredi
  0 siblings, 1 reply; 61+ messages in thread
From: Al Viro @ 2014-04-30 18:36 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Linus Torvalds, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Wed, Apr 30, 2014 at 07:33:38PM +0200, Miklos Szeredi wrote:
> On Wed, Apr 30, 2014 at 6:03 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > On Wed, Apr 30, 2014 at 05:49:58PM +0200, Miklos Szeredi wrote:
> >> > FWIW, the first two are really straightforward expanding the function
> >> > into its only callsite.  The last needs more splitup.  Not sure if the
> >> > following is good enough, but it ought to be at least somewhat cleaner.
> >> > Combined change is identical to the original, so it doesn't invalidate
> >> > the testing so far...
> >>
> >> Hmm, patches look okay, but I'm wondering if we really need the morgue list and
> >> the waiting.  Why not just skip dentries that are presently being handled by
> >> dentry_kill()?
> >
> > Who will be freeing them?  If we do that from dentry_kill(), we are back to
> > needing them removed from shrink list by something called by dput().  And
> > if we do that from shrink_dentry_list(), we'd damn better wait for
> > dentry_iput() et.al. to finish.
> 
> We can do it from dput if the shrinker gets there first and from the
> shrinker if dput manages to finish before.  See the updated patch in
> the previous mail.

Er?  The only patch I see is removal of RCU from shrink_dentry_list(), which
is fine, but doesn't do anything of that sort.  What was the Message-ID?

Let me see if I understand what you are proposing:
	dentry_kill(dentry, 0) seeing DCACHE_DENTRY_KILLED
		check DCACHE_MAY_FREE, free it's been set
	dentry_kill(dentry, 1) seeing that we are on shrinker list
		leave on the list, do the work on killing, retake ->d_lock,
		if we are still on shrinker list
			set DCACHE_MAY_FREE,
		else
			free it

That would probably work...

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-04-30 18:36                             ` Al Viro
@ 2014-04-30 18:42                               ` Miklos Szeredi
  2014-04-30 19:02                                 ` Al Viro
  0 siblings, 1 reply; 61+ messages in thread
From: Miklos Szeredi @ 2014-04-30 18:42 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Wed, Apr 30, 2014 at 8:36 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Wed, Apr 30, 2014 at 07:33:38PM +0200, Miklos Szeredi wrote:
>> On Wed, Apr 30, 2014 at 6:03 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> > On Wed, Apr 30, 2014 at 05:49:58PM +0200, Miklos Szeredi wrote:
>> >> > FWIW, the first two are really straightforward expanding the function
>> >> > into its only callsite.  The last needs more splitup.  Not sure if the
>> >> > following is good enough, but it ought to be at least somewhat cleaner.
>> >> > Combined change is identical to the original, so it doesn't invalidate
>> >> > the testing so far...
>> >>
>> >> Hmm, patches look okay, but I'm wondering if we really need the morgue list and
>> >> the waiting.  Why not just skip dentries that are presently being handled by
>> >> dentry_kill()?
>> >
>> > Who will be freeing them?  If we do that from dentry_kill(), we are back to
>> > needing them removed from shrink list by something called by dput().  And
>> > if we do that from shrink_dentry_list(), we'd damn better wait for
>> > dentry_iput() et.al. to finish.
>>
>> We can do it from dput if the shrinker gets there first and from the
>> shrinker if dput manages to finish before.  See the updated patch in
>> the previous mail.
>
> Er?  The only patch I see is removal of RCU from shrink_dentry_list(), which
> is fine, but doesn't do anything of that sort.  What was the Message-ID?

Message-ID: <20140430154958.GC3113@tucsk.piliscsaba.szeredi.hu>

>
> Let me see if I understand what you are proposing:
>         dentry_kill(dentry, 0) seeing DCACHE_DENTRY_KILLED
>                 check DCACHE_MAY_FREE, free it's been set
>         dentry_kill(dentry, 1) seeing that we are on shrinker list
>                 leave on the list, do the work on killing, retake ->d_lock,
>                 if we are still on shrinker list
>                         set DCACHE_MAY_FREE,
>                 else
>                         free it
>
> That would probably work...

Exactly.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-04-30 18:42                               ` Miklos Szeredi
@ 2014-04-30 19:02                                 ` Al Viro
  2014-04-30 19:59                                   ` Al Viro
  0 siblings, 1 reply; 61+ messages in thread
From: Al Viro @ 2014-04-30 19:02 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Linus Torvalds, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Wed, Apr 30, 2014 at 08:42:01PM +0200, Miklos Szeredi wrote:

> Message-ID: <20140430154958.GC3113@tucsk.piliscsaba.szeredi.hu>

I see...  Several points:
	* I still think that it's better to do handling of "we see
DCACHE_DENTRY_KILLED already set" in dentry_kill() itself.
	* in dentry_kill(dentry, 0) we *know* that it's not on a shrink
list - the caller has just removed it from there and we'd kept ->d_lock
since that point.   What's more, with that scheme we don't need to bother
with can_free at all - just grab ->d_lock after ->d_release() call and check
DCACHE_SHRINK_LIST.  No sense trying to avoid that - in case where we
could just go ahead and free the sucker, there's neither contention nor
anybody else interested in that cacheline, so spin_lock(&dentry->d_lock)
is pretty much free.

IOW, I'd prefer to do the following (completely untested) on top of 1/6--4/6:

diff --git a/fs/dcache.c b/fs/dcache.c
index e482775..82d8eb4 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -489,6 +489,17 @@ relock:
 		goto relock;
 	}
 
+	if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) {
+		if (parent)
+			spin_unlock(&parent->d_lock);
+		if (dentry->d_flags & DCACHE_MAY_FREE) {
+			spin_unlock(&dentry->d_lock);
+			dentry_free(dentry);
+		} else {
+			spin_unlock(&dentry->d_lock);
+		}
+		return parent;
+	}
 	/*
 	 * The dentry is now unrecoverably dead to the world.
 	 */
@@ -504,8 +515,6 @@ relock:
 	if (dentry->d_flags & DCACHE_LRU_LIST) {
 		if (!(dentry->d_flags & DCACHE_SHRINK_LIST))
 			d_lru_del(dentry);
-		else
-			d_shrink_del(dentry);
 	}
 	/* if it was on the hash then remove it */
 	__d_drop(dentry);
@@ -527,7 +536,14 @@ relock:
 	if (dentry->d_op && dentry->d_op->d_release)
 		dentry->d_op->d_release(dentry);
 
-	dentry_free(dentry);
+	spin_lock(&dentry->d_lock);
+	if (dentry->d_flags & DCACHE_SHRINK_LIST) {
+		dentry->d_flags |= DCACHE_MAY_FREE;
+		spin_unlock(&dentry->d_lock);
+	} else {
+		spin_unlock(&dentry->d_lock);
+		dentry_free(dentry);
+	}
 	return parent;
 }
 
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 3b9bfdb..3c7ec32 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -221,6 +221,8 @@ struct dentry_operations {
 #define DCACHE_SYMLINK_TYPE		0x00300000 /* Symlink */
 #define DCACHE_FILE_TYPE		0x00400000 /* Other file type */
 
+#define DCACHE_MAY_FREE			0x00800000
+
 extern seqlock_t rename_lock;
 
 static inline int dname_external(const struct dentry *dentry)

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-04-30 19:02                                 ` Al Viro
@ 2014-04-30 19:59                                   ` Al Viro
  2014-04-30 20:23                                     ` Linus Torvalds
  0 siblings, 1 reply; 61+ messages in thread
From: Al Viro @ 2014-04-30 19:59 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Linus Torvalds, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Wed, Apr 30, 2014 at 08:02:27PM +0100, Al Viro wrote:
> On Wed, Apr 30, 2014 at 08:42:01PM +0200, Miklos Szeredi wrote:
> 
> > Message-ID: <20140430154958.GC3113@tucsk.piliscsaba.szeredi.hu>
> 
> I see...  Several points:
> 	* I still think that it's better to do handling of "we see
> DCACHE_DENTRY_KILLED already set" in dentry_kill() itself.
> 	* in dentry_kill(dentry, 0) we *know* that it's not on a shrink
> list - the caller has just removed it from there and we'd kept ->d_lock
> since that point.   What's more, with that scheme we don't need to bother
> with can_free at all - just grab ->d_lock after ->d_release() call and check
> DCACHE_SHRINK_LIST.  No sense trying to avoid that - in case where we
> could just go ahead and free the sucker, there's neither contention nor
> anybody else interested in that cacheline, so spin_lock(&dentry->d_lock)
> is pretty much free.
> 
> IOW, I'd prefer to do the following (completely untested) on top of 1/6--4/6:

Sigh...  One more problem:
                /*
                 * We found an inuse dentry which was not removed from
                 * the LRU because of laziness during lookup. Do not free it.
                 */
                if (dentry->d_lockref.count) {
                        spin_unlock(&dentry->d_lock);
                        continue;
                }
should become
		if (dentry->d_lockref.count > 0) {
			....
instead - lockref_mark_dead() slaps -128 into it, so we'll just leak all
dentries where dput() has come first and decided to leave the suckers
for us.

Another thing: I don't like what's going on with freeing vs. ->d_lock there.
Had that been a mutex, we'd definitely get a repeat of "vfs: fix subtle
use-after-free of pipe_inode_info".  The question is, can spin_unlock(p)
dereference p after another CPU gets through spin_lock(p)?  Linus?

It can be dealt with by setting DCACHE_RCUACCESS in "let another guy free
it" cases and playing with rcu_read_lock a bit, but I wonder whether we
need to bother - quick look through alpha/sparc/ppc shows that on those
we do not and the same is true for non-paravirt case on x86.  I hadn't
checked what paravirt one does, though, and I certainly hadn't done
exhaustive search for architectures doing something weird...

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-04-30 19:59                                   ` Al Viro
@ 2014-04-30 20:23                                     ` Linus Torvalds
  2014-04-30 20:38                                       ` Al Viro
  0 siblings, 1 reply; 61+ messages in thread
From: Linus Torvalds @ 2014-04-30 20:23 UTC (permalink / raw)
  To: Al Viro
  Cc: Miklos Szeredi, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Wed, Apr 30, 2014 at 12:59 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Another thing: I don't like what's going on with freeing vs. ->d_lock there.
> Had that been a mutex, we'd definitely get a repeat of "vfs: fix subtle
> use-after-free of pipe_inode_info".  The question is, can spin_unlock(p)
> dereference p after another CPU gets through spin_lock(p)?  Linus?

spin_unlock() *should* be safe wrt that issue.

But I have to say, I think paravirtualized spinlocks may break that.
They do all kinds of "kick waiters" after releasing the lock.

Doesn't the RCU protection solve that, though? Nobody should be
releasing the dentry under us, afaik..

            Linus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-04-30 20:23                                     ` Linus Torvalds
@ 2014-04-30 20:38                                       ` Al Viro
  2014-04-30 20:57                                         ` Linus Torvalds
  0 siblings, 1 reply; 61+ messages in thread
From: Al Viro @ 2014-04-30 20:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Miklos Szeredi, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Wed, Apr 30, 2014 at 01:23:26PM -0700, Linus Torvalds wrote:
> On Wed, Apr 30, 2014 at 12:59 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Another thing: I don't like what's going on with freeing vs. ->d_lock there.
> > Had that been a mutex, we'd definitely get a repeat of "vfs: fix subtle
> > use-after-free of pipe_inode_info".  The question is, can spin_unlock(p)
> > dereference p after another CPU gets through spin_lock(p)?  Linus?
> 
> spin_unlock() *should* be safe wrt that issue.
> 
> But I have to say, I think paravirtualized spinlocks may break that.
> They do all kinds of "kick waiters" after releasing the lock.
> 
> Doesn't the RCU protection solve that, though? Nobody should be
> releasing the dentry under us, afaik..

We do not (and cannot) call dentry_kill() with rcu_read_lock held - it can
trigger any amount of IO, for one thing.  We can take it around the
couple of places where do that spin_unlock(&dentry->d_lock) (along with
setting DCACHE_RCUACCESS) - that's what I'd been refering to.  Then this
sucker (tests still running, so far everything seems to survive) becomes
the following (again, on top of 1/6..4/6).  BTW, is there any convenient
way to tell git commit --amend to update the commit date?  Something
like --date=now would be nice, but it isn't accepted...

commit 797ff22681dc969b478ed837787d24dfd2dd2132
Author: Al Viro <viro@zeniv.linux.org.uk>
Date:   Tue Apr 29 23:52:05 2014 -0400

    dentry_kill(): don't try to remove from shrink list
    
    If the victim in on the shrink list, don't remove it from there.
    If shrink_dentry_list() manages to remove it from the list before
    we are done - fine, we'll just free it as usual.  If not - mark
    it with new flag (DCACHE_MAY_FREE) and leave it there.
    
    Eventually, shrink_dentry_list() will get to it, remove the sucker
    from shrink list and call dentry_kill(dentry, 0).  Which is where
    we'll deal with freeing.
    
    Since now dentry_kill(dentry, 0) may happen after or during
    dentry_kill(dentry, 1), we need to recognize that (by seeing
    DCACHE_DENTRY_KILLED already set), unlock everything
    and either free the sucker (in case DCACHE_MAY_FREE has been
    set) or leave it for ongoing dentry_kill(dentry, 1) to deal with.
    
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

diff --git a/fs/dcache.c b/fs/dcache.c
index e482775..fa40d26 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -489,6 +489,20 @@ relock:
 		goto relock;
 	}
 
+	if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) {
+		if (parent)
+			spin_unlock(&parent->d_lock);
+		if (dentry->d_flags & DCACHE_MAY_FREE) {
+			spin_unlock(&dentry->d_lock);
+			dentry_free(dentry);
+		} else {
+			dentry->d_flags |= DCACHE_RCUACCESS;
+			rcu_read_lock();
+			spin_unlock(&dentry->d_lock);
+			rcu_read_unlock();
+		}
+		return parent;
+	}
 	/*
 	 * The dentry is now unrecoverably dead to the world.
 	 */
@@ -504,8 +518,6 @@ relock:
 	if (dentry->d_flags & DCACHE_LRU_LIST) {
 		if (!(dentry->d_flags & DCACHE_SHRINK_LIST))
 			d_lru_del(dentry);
-		else
-			d_shrink_del(dentry);
 	}
 	/* if it was on the hash then remove it */
 	__d_drop(dentry);
@@ -527,7 +539,16 @@ relock:
 	if (dentry->d_op && dentry->d_op->d_release)
 		dentry->d_op->d_release(dentry);
 
-	dentry_free(dentry);
+	spin_lock(&dentry->d_lock);
+	if (dentry->d_flags & DCACHE_SHRINK_LIST) {
+		dentry->d_flags |= DCACHE_MAY_FREE | DCACHE_RCUACCESS;
+		rcu_read_lock();
+		spin_unlock(&dentry->d_lock);
+		rcu_read_unlock();
+	} else {
+		spin_unlock(&dentry->d_lock);
+		dentry_free(dentry);
+	}
 	return parent;
 }
 
@@ -829,7 +850,7 @@ static void shrink_dentry_list(struct list_head *list)
 		 * We found an inuse dentry which was not removed from
 		 * the LRU because of laziness during lookup. Do not free it.
 		 */
-		if (dentry->d_lockref.count) {
+		if (dentry->d_lockref.count > 0) {
 			spin_unlock(&dentry->d_lock);
 			continue;
 		}
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 3b9bfdb..3c7ec32 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -221,6 +221,8 @@ struct dentry_operations {
 #define DCACHE_SYMLINK_TYPE		0x00300000 /* Symlink */
 #define DCACHE_FILE_TYPE		0x00400000 /* Other file type */
 
+#define DCACHE_MAY_FREE			0x00800000
+
 extern seqlock_t rename_lock;
 
 static inline int dname_external(const struct dentry *dentry)

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-04-30 20:38                                       ` Al Viro
@ 2014-04-30 20:57                                         ` Linus Torvalds
  2014-04-30 21:12                                           ` Al Viro
  0 siblings, 1 reply; 61+ messages in thread
From: Linus Torvalds @ 2014-04-30 20:57 UTC (permalink / raw)
  To: Al Viro
  Cc: Miklos Szeredi, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Wed, Apr 30, 2014 at 1:38 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> We do not (and cannot) call dentry_kill() with rcu_read_lock held - it can
> trigger any amount of IO, for one thing.  We can take it around the
> couple of places where do that spin_unlock(&dentry->d_lock) (along with
> setting DCACHE_RCUACCESS) - that's what I'd been refering to.

Just the last spin_unlock() would be the case that matters, if the
spin_unlock() is done on something that could be freed immediately and
the lock protects and is inside the entity that gets freed.

>   BTW, is there any convenient
> way to tell git commit --amend to update the commit date?  Something
> like --date=now would be nice, but it isn't accepted...

--date="$(date)" works.

The "--date" thing doesn't take the nice "git approxidate" strings,
maybe it should. So you have to give it a "real" date.

              Linus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-04-30 20:57                                         ` Linus Torvalds
@ 2014-04-30 21:12                                           ` Al Viro
  2014-04-30 22:12                                             ` Al Viro
  0 siblings, 1 reply; 61+ messages in thread
From: Al Viro @ 2014-04-30 21:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Miklos Szeredi, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Wed, Apr 30, 2014 at 01:57:05PM -0700, Linus Torvalds wrote:
> On Wed, Apr 30, 2014 at 1:38 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > We do not (and cannot) call dentry_kill() with rcu_read_lock held - it can
> > trigger any amount of IO, for one thing.  We can take it around the
> > couple of places where do that spin_unlock(&dentry->d_lock) (along with
> > setting DCACHE_RCUACCESS) - that's what I'd been refering to.
> 
> Just the last spin_unlock() would be the case that matters, if the
> spin_unlock() is done on something that could be freed immediately and
> the lock protects and is inside the entity that gets freed.

*nod*

There are two such spin_unlock (handover from shrink_dentry_list() to
dput() and the opposite one), but they are all that needs protection -
->d_flags update is outside the rcu-critical area.  I really wonder
if we *can* get there without DCACHE_RCUACCESS having been set, though;
dentry would have to be
	* picked into shrink list (i.e. have had zero refcount at some point)
	* never had been through __d_rehash()
shrink_dentry_list() definitely counts on that being impossible, and it
probably is, but I'm feeling seriously paranoid about the whole area.
I'll finish grepping through the tree and probably drop setting
DCACHE_RCUACCESS from the patch - either that, or set it in d_shrink_add()
it it turns out that it is possible and shrink_dentry_list() is fucked...

Tests seem to be running fine so far...

> >   BTW, is there any convenient
> > way to tell git commit --amend to update the commit date?  Something
> > like --date=now would be nice, but it isn't accepted...
> 
> --date="$(date)" works.

Thanks...

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-04-30 21:12                                           ` Al Viro
@ 2014-04-30 22:12                                             ` Al Viro
  2014-04-30 23:04                                               ` Linus Torvalds
  0 siblings, 1 reply; 61+ messages in thread
From: Al Viro @ 2014-04-30 22:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Miklos Szeredi, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Wed, Apr 30, 2014 at 10:12:06PM +0100, Al Viro wrote:
> On Wed, Apr 30, 2014 at 01:57:05PM -0700, Linus Torvalds wrote:
> > On Wed, Apr 30, 2014 at 1:38 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > We do not (and cannot) call dentry_kill() with rcu_read_lock held - it can
> > > trigger any amount of IO, for one thing.  We can take it around the
> > > couple of places where do that spin_unlock(&dentry->d_lock) (along with
> > > setting DCACHE_RCUACCESS) - that's what I'd been refering to.
> > 
> > Just the last spin_unlock() would be the case that matters, if the
> > spin_unlock() is done on something that could be freed immediately and
> > the lock protects and is inside the entity that gets freed.
> 
> *nod*
> 
> There are two such spin_unlock (handover from shrink_dentry_list() to
> dput() and the opposite one), but they are all that needs protection -
> ->d_flags update is outside the rcu-critical area.  I really wonder
> if we *can* get there without DCACHE_RCUACCESS having been set, though;
> dentry would have to be
> 	* picked into shrink list (i.e. have had zero refcount at some point)
> 	* never had been through __d_rehash()
> shrink_dentry_list() definitely counts on that being impossible, and it
> probably is, but I'm feeling seriously paranoid about the whole area.
> I'll finish grepping through the tree and probably drop setting
> DCACHE_RCUACCESS from the patch - either that, or set it in d_shrink_add()
> it it turns out that it is possible and shrink_dentry_list() is fucked...

OK, it really can't happen.  The proof is more convoluted than I'd like it,
but it's solid enough, so setting that flag in dentry_kill() handover cases
wasn't needed.  I've just pushed the whole thing to vfs.git#for-linus;
review and testing would be very welcome.  I can repost it one more time,
but the only difference compared to the last variant in this thread is not
bothering with DCACHE_RCUACCESS.

It has survived LTP tests, going through xfstests now...

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-04-30 22:12                                             ` Al Viro
@ 2014-04-30 23:04                                               ` Linus Torvalds
  2014-04-30 23:14                                                 ` Linus Torvalds
  2014-04-30 23:38                                                 ` Al Viro
  0 siblings, 2 replies; 61+ messages in thread
From: Linus Torvalds @ 2014-04-30 23:04 UTC (permalink / raw)
  To: Al Viro
  Cc: Miklos Szeredi, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Wed, Apr 30, 2014 at 3:12 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> I've just pushed the whole thing to vfs.git#for-linus;
> review and testing would be very welcome.

I have no half-way relevant test-case for this, so I'm hoping people
who have good VFS stress-tests (preferably under memory pressure so
that we get that whole shrinking path) will test.

But it looks fine.

That said, I do hate that RCU read-lock around the final spin-unlock.

Let me go talk to the paravirt people. Maybe they don't need this, and
I don't know exactly *how* they use that lock pointer after the unlock
in the "kick waiters" part. Maybe it's all good.

                     Linus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-04-30 23:04                                               ` Linus Torvalds
@ 2014-04-30 23:14                                                 ` Linus Torvalds
  2014-04-30 23:43                                                   ` Al Viro
  2014-04-30 23:38                                                 ` Al Viro
  1 sibling, 1 reply; 61+ messages in thread
From: Linus Torvalds @ 2014-04-30 23:14 UTC (permalink / raw)
  To: Al Viro
  Cc: Miklos Szeredi, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Wed, Apr 30, 2014 at 4:04 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Let me go talk to the paravirt people. Maybe they don't need this, and
> I don't know exactly *how* they use that lock pointer after the unlock
> in the "kick waiters" part. Maybe it's all good.

.. looking at current users (xen and kvm) it does in fact look all
good. Yes, we "kick" possible waiters after the unlock, but the lock
itself is not touched by that, it just uses the pointer to the lock as
a way to figure out who to kick.

In fact, I kind of knew that, but had forgotten. We very much depend
on spin_unlock being safe wrt immediate deleton already: the
"completion" code very much depends on it. It does a "spin_unlock()"
to release the completion, and it can get reused immediately (it might
be on the stack, it might be in some data structure that gets freed).

So I'd suggest removing that whole RCU thing, because it should be
safe to unlock something that can go away immediately. We'd have huge
problems elsewhere if it wasn't safe.

               Linus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-04-30 23:04                                               ` Linus Torvalds
  2014-04-30 23:14                                                 ` Linus Torvalds
@ 2014-04-30 23:38                                                 ` Al Viro
  1 sibling, 0 replies; 61+ messages in thread
From: Al Viro @ 2014-04-30 23:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Miklos Szeredi, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Wed, Apr 30, 2014 at 04:04:58PM -0700, Linus Torvalds wrote:
> On Wed, Apr 30, 2014 at 3:12 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > I've just pushed the whole thing to vfs.git#for-linus;
> > review and testing would be very welcome.
> 
> I have no half-way relevant test-case for this, so I'm hoping people
> who have good VFS stress-tests (preferably under memory pressure so
> that we get that whole shrinking path) will test.
> 
> But it looks fine.
> 
> That said, I do hate that RCU read-lock around the final spin-unlock.

So do I, obviously...  After looking through the rest of arch_spin_unlock(),
it looks like the only suspicious one except the paravirt on x86 is
blackfin.  And that might be misreading - I'm not familiar enough with the
architecture to tell...

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-04-30 23:14                                                 ` Linus Torvalds
@ 2014-04-30 23:43                                                   ` Al Viro
  2014-05-01  0:18                                                     ` Linus Torvalds
  0 siblings, 1 reply; 61+ messages in thread
From: Al Viro @ 2014-04-30 23:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Miklos Szeredi, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Wed, Apr 30, 2014 at 04:14:14PM -0700, Linus Torvalds wrote:
> On Wed, Apr 30, 2014 at 4:04 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Let me go talk to the paravirt people. Maybe they don't need this, and
> > I don't know exactly *how* they use that lock pointer after the unlock
> > in the "kick waiters" part. Maybe it's all good.
> 
> .. looking at current users (xen and kvm) it does in fact look all
> good. Yes, we "kick" possible waiters after the unlock, but the lock
> itself is not touched by that, it just uses the pointer to the lock as
> a way to figure out who to kick.
> 
> In fact, I kind of knew that, but had forgotten. We very much depend
> on spin_unlock being safe wrt immediate deleton already: the
> "completion" code very much depends on it. It does a "spin_unlock()"
> to release the completion, and it can get reused immediately (it might
> be on the stack, it might be in some data structure that gets freed).
> 
> So I'd suggest removing that whole RCU thing, because it should be
> safe to unlock something that can go away immediately. We'd have huge
> problems elsewhere if it wasn't safe.

OK, done and force-pushed.  Should propagate in a few...

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-04-30 23:43                                                   ` Al Viro
@ 2014-05-01  0:18                                                     ` Linus Torvalds
  2014-05-01  2:51                                                       ` Al Viro
  0 siblings, 1 reply; 61+ messages in thread
From: Linus Torvalds @ 2014-05-01  0:18 UTC (permalink / raw)
  To: Al Viro
  Cc: Miklos Szeredi, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Wed, Apr 30, 2014 at 4:43 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> OK, done and force-pushed.  Should propagate in a few...

That made it more obvious how the DCACHE_MAY_FREE case ends up
working. And in particular, mind rewriting this:

    if (dentry->d_flags & DCACHE_MAY_FREE) {
        spin_unlock(&dentry->d_lock);
        dentry_free(dentry);
    } else {
        spin_unlock(&dentry->d_lock);
    }
    return parent;

as just

    bool free = dentry->d_flags & DCACHE_MAY_FREE;
    spin_unlock(&dentry->d_lock);
    if (free)
        dentry_free(dentry);
    return parent;

instead? In fact, I get the feeling that the other case later on
really fits the same model:

    spin_lock(&dentry->d_lock);
    if (dentry->d_flags & DCACHE_SHRINK_LIST) {
        dentry->d_flags |= DCACHE_MAY_FREE;
        spin_unlock(&dentry->d_lock);
    } else {
        spin_unlock(&dentry->d_lock);
        dentry_free(dentry);
  }

ends up really being better as

    spin_lock(&dentry->d_lock);
    free = 1;
    if (dentry->d_flags & DCACHE_SHRINK_LIST) {
        dentry->d_flags |= DCACHE_MAY_FREE;
        free = 0;
    }
    spin_unlock(&dentry->d_lock);
    if (free)
        dentry_free(dentry);
    return parent;

and then suddenly it looks like we have a common exit sequence from
that dentry_kill() function, no?

(The earlier "unlock_on_failure" exit case is altogether a different case).

I dunno. Maybe not a big deal, but one reason I prefer doing that
"free" flag is because I really tend to prefer the simple case of
lock-unlock pairing cleanly at the same level. NOT the pattern where
you have one lock at one indentation level, paired with multiple
unlocks for all the different cases.

                  Linus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-05-01  0:18                                                     ` Linus Torvalds
@ 2014-05-01  2:51                                                       ` Al Viro
  2014-05-01  2:59                                                         ` Linus Torvalds
  0 siblings, 1 reply; 61+ messages in thread
From: Al Viro @ 2014-05-01  2:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Miklos Szeredi, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Wed, Apr 30, 2014 at 05:18:23PM -0700, Linus Torvalds wrote:

> and then suddenly it looks like we have a common exit sequence from
> that dentry_kill() function, no?
> 
> (The earlier "unlock_on_failure" exit case is altogether a different case).
> 
> I dunno. Maybe not a big deal, but one reason I prefer doing that
> "free" flag is because I really tend to prefer the simple case of
> lock-unlock pairing cleanly at the same level. NOT the pattern where
> you have one lock at one indentation level, paired with multiple
> unlocks for all the different cases.

Yeah, but... I have such variant, but the best I could get still generated
the code that wasn't particulary nice.  Part might be gcc code generation
sucking for bool, part - extra register pressure...  It has slightly lower
i-cache footprint, though, so it might be worth doing.  Hell knows; that's a
fairly hot codepath, so let's do it that way - I've just pushed an alternative
branch with bool can_free variant; branches in question: vfs.git#for-linus and
vfs.git#dentry_kill-2. Help with profiling is needed; the loads to watch are
the ones where dentry_kill() + dentry_free() are sufficiently high in profiles
for the differences to matter.  Any takers?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-05-01  2:51                                                       ` Al Viro
@ 2014-05-01  2:59                                                         ` Linus Torvalds
  2014-05-01  3:12                                                           ` Al Viro
  0 siblings, 1 reply; 61+ messages in thread
From: Linus Torvalds @ 2014-05-01  2:59 UTC (permalink / raw)
  To: Al Viro
  Cc: Miklos Szeredi, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Wed, Apr 30, 2014 at 7:51 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
>       Help with profiling is needed; the loads to watch are
> the ones where dentry_kill() + dentry_free() are sufficiently high in profiles
> for the differences to matter.  Any takers?

I really hope there are no such loads, my "lock/unlock pairing"
suggestion was mostly so that the pairing is clearer, not necessarily
for performance.

That said, I'd assume that it migth be worth testing at least the
"lots of concurrent lookups of 'simple_dentry_operations' dentries".
So most of /proc, most uses of simple_lookup(). That at least triggers
the dentry_kill() path in dput(), so it should be fairly easy to get
profiles.

But real loads with real filesystems? That sounds harder.

              Linus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-05-01  2:59                                                         ` Linus Torvalds
@ 2014-05-01  3:12                                                           ` Al Viro
  2014-05-01  9:42                                                             ` Miklos Szeredi
  0 siblings, 1 reply; 61+ messages in thread
From: Al Viro @ 2014-05-01  3:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Miklos Szeredi, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Wed, Apr 30, 2014 at 07:59:43PM -0700, Linus Torvalds wrote:
> On Wed, Apr 30, 2014 at 7:51 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> >       Help with profiling is needed; the loads to watch are
> > the ones where dentry_kill() + dentry_free() are sufficiently high in profiles
> > for the differences to matter.  Any takers?
> 
> I really hope there are no such loads, my "lock/unlock pairing"
> suggestion was mostly so that the pairing is clearer, not necessarily
> for performance.
> 
> That said, I'd assume that it migth be worth testing at least the
> "lots of concurrent lookups of 'simple_dentry_operations' dentries".
> So most of /proc, most uses of simple_lookup(). That at least triggers
> the dentry_kill() path in dput(), so it should be fairly easy to get
> profiles.
> 
> But real loads with real filesystems? That sounds harder.

Well, the simplest way to do that is a bunch of open/unlink/close.  Another
one is cp -rl / rm -rf of a large tree (rmdir(2) does shrink_dcache_parent()). 
For that matter, unmapping an anon shared mapping will trigger the same path.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-05-01  3:12                                                           ` Al Viro
@ 2014-05-01  9:42                                                             ` Miklos Szeredi
  2014-05-01 14:34                                                               ` Al Viro
  0 siblings, 1 reply; 61+ messages in thread
From: Miklos Szeredi @ 2014-05-01  9:42 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

Two points about latest version (dentry_kill-2):

- Doing anything with dentry->d_parent in case of DCACHE_DENTRY_KILLED looks
  seriously wrong.  Parent has been dealt with, at that point, by the other
  caller, no?

- "bool foo = flag & FLAG" looks suspicious.  Is this guaranteed not to
  overflow?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-05-01  9:42                                                             ` Miklos Szeredi
@ 2014-05-01 14:34                                                               ` Al Viro
  2014-05-01 21:02                                                                 ` Al Viro
                                                                                   ` (2 more replies)
  0 siblings, 3 replies; 61+ messages in thread
From: Al Viro @ 2014-05-01 14:34 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Linus Torvalds, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Thu, May 01, 2014 at 11:42:52AM +0200, Miklos Szeredi wrote:
> Two points about latest version (dentry_kill-2):
> 
> - Doing anything with dentry->d_parent in case of DCACHE_DENTRY_KILLED looks
>   seriously wrong.  Parent has been dealt with, at that point, by the other
>   caller, no?

In both branches, actually - we should bugger off earlier *and* return NULL
in that case.  Nice catch.

Hmm... I see why it failed to blow up on that.  It *did* trigger, all right -
udev is stepping into that right on boot.  The thing is, check should be
	if ((int)dentry->d_lockref.count > 0)
since the damn thing is unsigned int.  IOW, they did go through handover
and shrink_dentry_list() proceeded to lose them.  And with that braino
fixed, it steps into the extra dput crap just fine.

OK, fixed and pushed (both branches).

> - "bool foo = flag & FLAG" looks suspicious.  Is this guaranteed not to
>   overflow?

What do you mean, overflow?  It's not a 1-bit unsigned int; conversion to
_Bool is different (which is the only reason why it's more than mere
syntax sugar).  See C99 6.3.2.1 ("When any scalar value is converted
to _Bool, the result is 0 if the value compares equal to 0; otherwise,
the result is 1").

That, BTW, is also the reason why _Bool bitfields exist - try
struct {
	_Bool a:1;
	unsigned b:1;
} x;
x.a = 2;
x.b = 2;
if (x.a)
	printf("A");
if (x.b)
	printf("B");
and see what it does.  The first test triggers, same as if (2) would.
The second does not, since conversion to unsigned integer type other
than _Bool gives the value in range of that type comparable to original
modulo (maximal representable + 1).  And 2 modulo 2 is 0...

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-05-01 14:34                                                               ` Al Viro
@ 2014-05-01 21:02                                                                 ` Al Viro
  2014-05-01 21:05                                                                   ` Al Viro
  2014-05-02  8:43                                                                 ` Szeredi Miklos
  2014-05-02 21:04                                                                 ` Linus Torvalds
  2 siblings, 1 reply; 61+ messages in thread
From: Al Viro @ 2014-05-01 21:02 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Linus Torvalds, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

*grumble*

Looks like gmail filtering is buggered again.  The previous mail got
bounced by gmail, both for Miklos and for Linus ;-/  Hell knows if
this one gets through; the original is at http://lkml.org/lkml/2014/5/1/235,
bounce being bloody vague, as usual.

This is starting to get really annoying ;-/  Does anybody have any idea
what might be going on?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-05-01 21:02                                                                 ` Al Viro
@ 2014-05-01 21:05                                                                   ` Al Viro
  2014-05-01 22:52                                                                     ` Linus Torvalds
  0 siblings, 1 reply; 61+ messages in thread
From: Al Viro @ 2014-05-01 21:05 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Linus Torvalds, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Thu, May 01, 2014 at 10:02:33PM +0100, Al Viro wrote:
> *grumble*
> 
> Looks like gmail filtering is buggered again.  The previous mail got
> bounced by gmail, both for Miklos and for Linus ;-/  Hell knows if
> this one gets through; the original is at http://lkml.org/lkml/2014/5/1/235,
> bounce being bloody vague, as usual.

... and so has this one, the bounce being

[quote]
A message that you sent could not be delivered to one or more of its
recipients. This is a permanent error. The following address(es) failed:

  torvalds@linux-foundation.org
    SMTP error from remote mail server after end of data:
    host ASPMX.L.GOOGLE.COM [173.194.67.27]: 550-5.7.1 [195.92.253.2      12] Our system has detected that this message is
    550-5.7.1 likely unsolicited mail. To reduce the amount of spam sent to Gmail,
    550-5.7.1 this message has been blocked. Please visit
    550-5.7.1 http://support.google.com/mail/bin/answer.py?hl=en&answer=188131 for
    550 5.7.1 more information. cv1si45381wib.71 - gsmtp
  miklos@szeredi.hu
    SMTP error from remote mail server after end of data:
    host ASPMX.L.GOOGLE.COM [173.194.67.27]: 550-5.7.1 [195.92.253.2      12] Our system has detected that this message is
    550-5.7.1 likely unsolicited mail. To reduce the amount of spam sent to Gmail,
    550-5.7.1 this message has been blocked. Please visit
    550-5.7.1 http://support.google.com/mail/bin/answer.py?hl=en&answer=188131 for
    550 5.7.1 more information. bb7si53528wib.20 - gsmtp
[endquote]

What the hell is going on?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-05-01 21:05                                                                   ` Al Viro
@ 2014-05-01 22:52                                                                     ` Linus Torvalds
  0 siblings, 0 replies; 61+ messages in thread
From: Linus Torvalds @ 2014-05-01 22:52 UTC (permalink / raw)
  To: Al Viro
  Cc: Miklos Szeredi, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Thu, May 1, 2014 at 2:05 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> ... and so has this one, the bounce being

I actually got all three messages despite the bounces. But I seem to
have gotten at least the first one through lkml.

Your email is lacking SPF or any other authentication, so it may be
gmail being pissy about that, though:

  Received-SPF: none (google.com: viro@ftp.linux.org.uk does not
designate permitted sender hosts) client-ip=195.92.253.2;

especially if your IP is in a block where other people send spam,
gmail has been known to just say "that ISP is clearly bad".

              Linus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-04-30  9:15                     ` Miklos Szeredi
@ 2014-05-02  5:51                       ` Al Viro
  2014-05-02  9:00                         ` Szeredi Miklos
  0 siblings, 1 reply; 61+ messages in thread
From: Al Viro @ 2014-05-02  5:51 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Linus Torvalds, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Wed, Apr 30, 2014 at 11:15:15AM +0200, Miklos Szeredi wrote:
> On Tue, Apr 29, 2014 at 07:56:13PM -0700, Linus Torvalds wrote:
> > On Tue, Apr 29, 2014 at 7:31 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > OK, aggregate diff follows, more readable splitup (3 commits) attached.
> > > It seems to survive beating here; testing, review and comments are
> > > welcome.
> > 
> > Miklos, did you have some particular load that triggered this, or was
> > it just some reports? It would be really good to get this patch some
> > stress-testing.
> > 
> > I like how the patch removes more lines than it adds, but apart from
> > that it's hard to read the patch (even the split-out ones) and say
> > anything more about it. I think this needs a *lot* of testing.
> 
> IBM is triggering this with the host01 test from the LTP suite.  I haven't yet
> tried to reproduce it.

Could you repost their report?  So far I hadn't managed to get actual list
corruption on mainline kernel - it definitely is possible, but on all
testcases so far the race window is too narrow.  So if they have a reproducer
that doesn't take an insane amount of time, I'd really like to see it...

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-05-01 14:34                                                               ` Al Viro
  2014-05-01 21:02                                                                 ` Al Viro
@ 2014-05-02  8:43                                                                 ` Szeredi Miklos
  2014-05-02 21:04                                                                 ` Linus Torvalds
  2 siblings, 0 replies; 61+ messages in thread
From: Szeredi Miklos @ 2014-05-02  8:43 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Thu, May 1, 2014 at 4:34 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Thu, May 01, 2014 at 11:42:52AM +0200, Miklos Szeredi wrote:

>> - "bool foo = flag & FLAG" looks suspicious.  Is this guaranteed not to
>>   overflow?
>
> What do you mean, overflow?  It's not a 1-bit unsigned int; conversion to
> _Bool is different (which is the only reason why it's more than mere
> syntax sugar).  See C99 6.3.2.1 ("When any scalar value is converted
> to _Bool, the result is 0 if the value compares equal to 0; otherwise,
> the result is 1").

Ah, didn't know that.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-05-02  5:51                       ` Al Viro
@ 2014-05-02  9:00                         ` Szeredi Miklos
  2014-05-02 21:02                           ` Miklos Szeredi
  2014-05-02 21:08                           ` Miklos Szeredi
  0 siblings, 2 replies; 61+ messages in thread
From: Szeredi Miklos @ 2014-05-02  9:00 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Fri, May 2, 2014 at 7:51 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Wed, Apr 30, 2014 at 11:15:15AM +0200, Miklos Szeredi wrote:

>> IBM is triggering this with the host01 test from the LTP suite.  I haven't yet
>> tried to reproduce it.
>
> Could you repost their report?  So far I hadn't managed to get actual list
> corruption on mainline kernel - it definitely is possible, but on all
> testcases so far the race window is too narrow.  So if they have a reproducer
> that doesn't take an insane amount of time, I'd really like to see it...

The only reproducer they have is LTP/host01, and can trigger the bug
reliably within hours.

The bug is private, but I'll ask if I can repost it.  The first thing
is a a warning from the D_FLAG_VERIFY() in d_shrink_del() from
shrink_dentry_list().  They added printks that show that the dentry
has DCACHE_DENTRY_KILLED.

We could ask for a dump, but this is the only rational explanation I
could find for this (and a shrink list with two dentries, with racing
dput on both nicely explains the case where the shrink list's prev
pointer still points to the already killed dentry).

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-05-02  9:00                         ` Szeredi Miklos
@ 2014-05-02 21:02                           ` Miklos Szeredi
  2014-05-02 21:08                           ` Miklos Szeredi
  1 sibling, 0 replies; 61+ messages in thread
From: Miklos Szeredi @ 2014-05-02 21:02 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Fri, May 02, 2014 at 11:00:04AM +0200, Szeredi Miklos wrote:

> The bug is private, but I'll ask if I can repost it.  The first thing
> is a a warning from the D_FLAG_VERIFY() in d_shrink_del() from
> shrink_dentry_list().  They added printks that show that the dentry
> has DCACHE_DENTRY_KILLED.
> 
> We could ask for a dump, but this is the only rational explanation I
> could find for this (and a shrink list with two dentries, with racing
> dput on both nicely explains the case where the shrink list's prev
> pointer still points to the already killed dentry).

Bug details:

==============================================================================
WARNING: at
/home/abuild/rpmbuild/BUILD/kernel-default-3.12.15/linux-3.12/fs/dcache.c:392
 Modules linked in: iptable_filter ip_tables x_tables rpcsec_gss_krb5
auth_rpcgss nfsv4 dns_resolver nfs lockd sunrpc fscache af_pac        ket xfs
libcrc32c dm_mod autofs4 btrfs xor sr_mod cdrom usb_storage raid6_pq sd_mod
crc_t10dif ibmvscsi(X) scsi_transport_srp scsi_        tgt ibmveth(X) ohci_pci
ohci_hcd ehci_hcd usbcore usb_common sg scsi_mod
 Supported: Yes
 CPU: 7 PID: 25685 Comm: host01 Tainted: G              X 3.12.15-3-default #1
 task: c0000003f7fb68f0 ti: c0000003f31f4000 task.ti: c0000003f31f4000
 NIP: c00000000024552c LR: c0000000002462b8 CTR: c0000000002aad40
 REGS: c0000003f31f7530 TRAP: 0700   Tainted: G              X 
(3.12.15-3-default)
 MSR: 8000000100029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24242448  XER: 20000000
 SOFTE: 1
 CFAR: c0000000002454b0

 GPR00: c0000000002462b8 c0000003f31f77b0 c000000000f10fb0 c0000003dfa7f028
 GPR04: 0000000000000001 c000000000e4fe00 0000000000000000 0600000000000000
 GPR08: c000000000e30fb0 0000000000000001 0000000000000000 0000000000000000
 GPR12: 0000000024242442 c00000000fe41880 0000000000000000 0000000000000000
 GPR16: 0000000000000000 0000010013556de0 0000000000006458 0000000000000002
 GPR20: 0000000000000000 c00000000084e940 0000000000000004 c0000003f8afa400
 GPR24: 0000000000006457 c0000003fde09e68 0000000000000000 0000000000000000
 GPR28: c0000003dfa7f028 c0000003dfa7f028 c0000003f31f7948 c0000003dfa7f0a8
 NIP [c00000000024552c] .d_shrink_del+0x9c/0xc0
 LR [c0000000002462b8] .shrink_dentry_list+0xc8/0x160
 PACATMSCRATCH [8000000100009033]
 Call Trace:
 [c0000003f31f77b0] [c0000003f8f1ef80] 0xc0000003f8f1ef80 (unreliable)
 [c0000003f31f7820] [c0000000002462b8] .shrink_dentry_list+0xc8/0x160
 [c0000003f31f78d0] [c000000000246754] .shrink_dcache_parent+0x44/0xa0
 [c0000003f31f7980] [c0000000002b0a0c] .proc_flush_task+0xbc/0x1f0
 [c0000003f31f7a70] [c00000000008e554] .release_task+0x94/0x530
 [c0000003f31f7b50] [c00000000008efb8] .wait_task_zombie+0x5c8/0x750
 [c0000003f31f7c10] [c00000000008fa80] .do_wait+0x120/0x2c0
 [c0000003f31f7cd0] [c000000000091040] .SyS_wait4+0x90/0x130
 [c0000003f31f7dc0] [c0000000000910fc] .SyS_waitpid+0x1c/0x30
 [c0000003f31f7e30] [c000000000009dfc] syscall_exit+0x0/0x7c
 Instruction dump:
 7d09502a 3908ffff 7d09512a 4bdcaf19 60000000 38210070 e8010010 7c0803a6
 4e800020 3d02fff2 8928dfd6 69290001 <0b090000> 2fa90000 41feff80 39200001
 ---[ end trace 56e8481827564dc3 ]---
==============================================================================
I forgot to mention that the TCP test used in the recreates is from the LTP
suite from http://ltp.sourceforge.net

It has been observed that each time we run into it so far, the host01 test is
the one running.
==============================================================================
After installing the kernel with some conditional printk statements in it, the
condition hit and I caught a glimpse of it. I then dropped the system into xmon
to dump out some further info:

When we first detect that the dentry->d_flags is missing these flags
DCACHE_SHRINK_LIST | DCACHE_LRU_LIST we start to dump some information about
the dentry from within shrink_dentry_list() and got:

dentry c0000003d6240ce8 name 5 lockcnt -128 flags 1048780
post-shrink-del: dentry c0000003d6240ce8 name 5 lockcnt -128
dentry c0000003d6240ce8 lockcnt -128
dentry c0000003d6240ce8 name 5 lockcnt -128 flags 1048780
post-shrink-del: dentry c0000003d6240ce8 name 5 lockcnt -128
dentry c0000003d6240ce8 lockcnt -128
dentry c0000003d6240ce8 name 5 lockcnt -128 flags 1048780

and this repeats over and over...

The code (with printks) looks like this:

 899         for (;;) {
 900                 dentry = list_entry_rcu(list->prev, struct dentry, d_lru);
 901                 if (&dentry->d_lru == list)
 902                         break; /* empty */
 903 
 904                 if (print_more)
 905                         printk(KERN_WARNING "dentry %p lockcnt %d\n",
 906                                 dentry, dentry->d_lockref.count);
 907 
 908                 /*
 909                  * Get the dentry lock, and re-verify that the dentry is
 910                  * this on the shrinking list. If it is, we know that
 911                  * DCACHE_SHRINK_LIST and DCACHE_LRU_LIST are set.
 912                  */
 913                 spin_lock(&dentry->d_lock);
 914                 if (dentry != list_entry(list->prev, struct dentry,
d_lru)) {
 915                         spin_unlock(&dentry->d_lock);
 916                         continue;
 917                 }
 918 
 919                 /*
 920                  * Check that both bits are set before calling
d_shrink_del()
 921                  * If we find out they are not print out some information
on
 922                  * the dentry and enable the tracing on the rest of the
code
 923                  * path.
 924                  */
 925                 if (D_FLAG_VERIFY(dentry, DCACHE_SHRINK_LIST |
DCACHE_LRU_LIST)) {
 926                         printk(KERN_WARNING "dentry %p name %s lockcnt %d
flags %d\n",
 927                                 dentry, dentry->d_name.name,
 928                                 dentry->d_lockref.count
 929                                 dentry->d_flags);
 930                         print_more = true;
 931                 }
 932 
 933                 /*
 934                  * The dispose list is isolated and dentries are not
accounted
 935                  * to the LRU here, so we can simply remove it from the
list
 936                  * here regardless of whether it is referenced or not.
 937                  */
 938                 d_shrink_del(dentry);
 939 
 940                 /*
 941                  * We found an inuse dentry which was not removed from
 942                  * the LRU because of laziness during lookup. Do not free
it.
 943                  */
 944                 if (dentry->d_lockref.count) {
 945                         spin_unlock(&dentry->d_lock);
 946                         if (print_more)
 947                                 printk(KERN_WARNING "post-shrink-del:
dentry %p name %s lockcnt %d\n",
 948                                         dentry, dentry->d_name.name,
 949                                         dentry->d_lockref.count);
 950                         continue;
 951                 }

and based on the debug output we seem to be processing the same "dentry" each
time. The dentry->d_lockref.count being negative seems bad.

After breaking into xmon and dumping said dentry, it looks like this:

0:mon> d c0000003d6240ce8 200
c0000003d6240ce8 cc00100004000000 0000000000000000  |................|
c0000003d6240cf8 0000000000000000 2851fcfd030000c0  |........(Q......|
c0000003d6240d08 9124000001000000 200d24d6030000c0  |.$...... .$.....|
c0000003d6240d18 0000000000000000 3500740073797374  |........5.t.syst|
c0000003d6240d28 656d73005861004e 00000000ff000000  |ems.Xa.N........|
c0000003d6240d38 0000000000010000 0000000080ffffff  |................|
c0000003d6240d48 8093f900000000c0 80b6a8f8030000c0  |................|
c0000003d6240d58 0000000010010000 0000000000000000  |................|
c0000003d6240d68 680d24d6030000c0 680d24d6030000c0  |h.$.....h.$.....|
c0000003d6240d78 a08d07d6030000c0 489dfe00000000c0  |........H.......|
c0000003d6240d88 880d24d6030000c0 880d24d6030000c0  |..$.......$.....|
c0000003d6240d98 0000000000000000 0000000000000000  |................|
c0000003d6240da8 cc00080002000000 0000000000000000  |................|
c0000003d6240db8 b0b22200000000d0 e8601afd030000c0  |.."......`......|
c0000003d6240dc8 49218a0104000000 e00d24d6030000c0  |I!........$.....|
c0000003d6240dd8 381359d6030000c0 64617461004e4b00  |8.Y.....data.NK.|
c0000003d6240de8 6b0073006f2e3000 002e300054010000  |k.s.o.0...0.T...|
c0000003d6240df8 0000000000000000 0000000000000000  |................|
c0000003d6240e08 00878f05000000d0 80c213f6030000c0  |................|
c0000003d6240e18 0000000000000000 0000000000000000  |................|
c0000003d6240e28 284d24d6030000c0 28ce24d6030000c0  |(M$.....(.$.....|
c0000003d6240e38 38ce24d6030000c0 384d24d6030000c0  |8.$.....8M$.....|
c0000003d6240e48 480e24d6030000c0 480e24d6030000c0  |H.$.....H.$.....|
c0000003d6240e58 0000000000000000 501459d6030000c0  |........P.Y.....|
c0000003d6240e68 cc00080002000000 0000000000000000  |................|
c0000003d6240e78 c0e77100000000d0 28e41ffd030000c0  |..q.....(.......|
c0000003d6240e88 19f7f3ba06000000 a00e24d6030000c0  |..........$.....|
c0000003d6240e98 f09a33d6030000c0 6465766963650000  |..3.....device..|
c0000003d6240ea8 6b007300e84b0e72 86dc4788485bcc3e  |k.s..K.r..G.H[.>|
c0000003d6240eb8 d306777b71e75300 0000000000000000  |..w{q.S.........|
c0000003d6240ec8 009b7000000000c0 80614df6030000c0  |..p......aM.....|
c0000003d6240ed8 cb569d7ccbb353b1 302e71f2030000c0  |.V.|..S.0.q.....|

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-05-01 14:34                                                               ` Al Viro
  2014-05-01 21:02                                                                 ` Al Viro
  2014-05-02  8:43                                                                 ` Szeredi Miklos
@ 2014-05-02 21:04                                                                 ` Linus Torvalds
  2 siblings, 0 replies; 61+ messages in thread
From: Linus Torvalds @ 2014-05-02 21:04 UTC (permalink / raw)
  To: Al Viro
  Cc: Miklos Szeredi, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Thu, May 1, 2014 at 7:34 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> OK, fixed and pushed (both branches).

Al, can you send a real pull request (the "both branches" part in
particular makes me worry about which one you think is right), because
I suspect by now we just need to get this wider testing.

              Linus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-05-02  9:00                         ` Szeredi Miklos
  2014-05-02 21:02                           ` Miklos Szeredi
@ 2014-05-02 21:08                           ` Miklos Szeredi
  2014-05-02 21:18                             ` Linus Torvalds
  2014-05-02 22:32                             ` Al Viro
  1 sibling, 2 replies; 61+ messages in thread
From: Miklos Szeredi @ 2014-05-02 21:08 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

There's more of the "delete from shrink list not owned by us" in select parent.
Proposed patch appended.

And I'm not sure what umount_collect() is supposed to do.  Can other shrinkers
still be active at that point?  That would present other problems, no?

Also somewhat related is the question: how check_submounts_and_drop() could be
guaranteed correctness (timely removal of all unsed dentries) in the presence of
other shrinkers?

Thanks,
Miklos
----


From: Miklos Szeredi <mszeredi@suse.cz>
Subject: dcache: select_collect(): don't remove from shrink list

Shrink lists are not protected by any lock, so don't remove from an unknown
one.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---
 fs/dcache.c |   23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1241,18 +1241,19 @@ static enum d_walk_ret select_collect(vo
 	 * loop in shrink_dcache_parent() might not make any progress
 	 * and loop forever.
 	 */
-	if (dentry->d_lockref.count) {
-		dentry_lru_del(dentry);
-	} else if (!(dentry->d_flags & DCACHE_SHRINK_LIST)) {
-		/*
-		 * We can't use d_lru_shrink_move() because we
-		 * need to get the global LRU lock and do the
-		 * LRU accounting.
-		 */
+	if (dentry->d_flags & (DCACHE_SHRINK_LIST | DCACHE_LRU_LIST) ==
+							DCACHE_LRU_LIST) {
 		d_lru_del(dentry);
-		d_shrink_add(dentry, &data->dispose);
-		data->found++;
-		ret = D_WALK_NORETRY;
+		if (!dentry->d_lockref.count) {
+			/*
+			 * We can't use d_lru_shrink_move() because we
+			 * need to get the global LRU lock and do the
+			 * LRU accounting.
+			 */
+			d_shrink_add(dentry, &data->dispose);
+			data->found++;
+			ret = D_WALK_NORETRY;
+		}
 	}
 	/*
 	 * We can return to the caller if we have found some (this

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-05-02 21:08                           ` Miklos Szeredi
@ 2014-05-02 21:18                             ` Linus Torvalds
  2014-05-02 22:40                               ` Al Viro
  2014-05-02 22:32                             ` Al Viro
  1 sibling, 1 reply; 61+ messages in thread
From: Linus Torvalds @ 2014-05-02 21:18 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Al Viro, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Fri, May 2, 2014 at 2:08 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> There's more of the "delete from shrink list not owned by us" in select parent.
> Proposed patch appended.

Ahh. Clearly this needs more work before I pull.

             Linus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-05-02 21:08                           ` Miklos Szeredi
  2014-05-02 21:18                             ` Linus Torvalds
@ 2014-05-02 22:32                             ` Al Viro
  1 sibling, 0 replies; 61+ messages in thread
From: Al Viro @ 2014-05-02 22:32 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Linus Torvalds, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Fri, May 02, 2014 at 11:08:13PM +0200, Miklos Szeredi wrote:
> There's more of the "delete from shrink list not owned by us" in select parent.
> Proposed patch appended.

While it certainly looks like dentry_lru_del() should die, I really wonder if
"let's pretend that dentry isn't there if it's on some other shrink list"
is the right approach.  You've already noticed one problem (check-and-drop
giving false busy indications), but shrink_dcache_parent() has similar
issues.  How about incrementing data->found instead?  That'll end up
rescanning the tree in case if it's not ours; so what?  If it's another
process doing the same shrinking in a subtree, we want to let it do its
job anyway.  Sure, somebody doing mount -o remount in a loop might be
able to stall the living hell out of us, for as long as new non-busy
dentries are being added in our subtree, but the second part in itself
is sufficient; we will keep picking those new non-busy dentries as long
as they keep coming.  And if they do not, eventually they will be all
taken out by us or by those other shrinkers and we'll be done.
 
> And I'm not sure what umount_collect() is supposed to do.  Can other shrinkers
> still be active at that point?  That would present other problems, no?

No other shrinkers - prune_dcache_sb() and shrink_dcache_sb() are also
serialized by ->s_umount, shrink_dcache_parent() and check_submounts_and_drop()
are called only when an active reference is held, which obviously prevents
fs shutdown.

> Also somewhat related is the question: how check_submounts_and_drop() could be
> guaranteed correctness (timely removal of all unsed dentries) in the presence of
> other shrinkers?

Another interesting question is what the hell are shrink_dcache_parent()
callers expecting.  E.g. what's going on in fuse_reverse_inval_entry()?
And what is nilfs_tree_is_busy() about?

FWIW, I'm not at all sure that vfs_rmdir() and vfs_rename() have any reason
to call it these days, and dentry_unhash() one simply ought to die - it's used
only by hpfs unlink() in case it wants to truncate in order to work around
-ENOSPC.  And _that_ won't have any subdirectories to deal with anyway, so
shrink_dcache_parent() there is a no-op, even if we keep the damn helper alive.
The rest of callers also look dubious...

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-05-02 21:18                             ` Linus Torvalds
@ 2014-05-02 22:40                               ` Al Viro
  2014-05-02 23:06                                 ` Al Viro
  2014-05-03  4:26                                 ` Al Viro
  0 siblings, 2 replies; 61+ messages in thread
From: Al Viro @ 2014-05-02 22:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Miklos Szeredi, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Fri, May 02, 2014 at 02:18:43PM -0700, Linus Torvalds wrote:
> On Fri, May 2, 2014 at 2:08 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> > There's more of the "delete from shrink list not owned by us" in select parent.
> > Proposed patch appended.
> 
> Ahh. Clearly this needs more work before I pull.

*nod*

Besides, I want to put Miklos' "don't bother with RCU in shrink_dentry_list()"
in there as soon as select_collect() has been dealt with.  I don't think
that the currently posted patch for select_collect() is right, though -
see my reply to parent posting.  Basically, I think we should treat "it's
on the shrink list already" as "increment data->found and keep going".  IOW,
	if (on shrink list) {
		data->found++;
	} else {
		if (on lru list)
			d_lru_del
		if (refcount is zero) {
			d_shrink_add
			data->found++;
		}
	}
	if (data->found)
		ret = need_resched() ? D_WALK_QUIT : D_WALK_NORETRY;

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-05-02 22:40                               ` Al Viro
@ 2014-05-02 23:06                                 ` Al Viro
  2014-05-03  4:26                                 ` Al Viro
  1 sibling, 0 replies; 61+ messages in thread
From: Al Viro @ 2014-05-02 23:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Miklos Szeredi, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Fri, May 02, 2014 at 11:40:22PM +0100, Al Viro wrote:
> On Fri, May 02, 2014 at 02:18:43PM -0700, Linus Torvalds wrote:
> > On Fri, May 2, 2014 at 2:08 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > There's more of the "delete from shrink list not owned by us" in select parent.
> > > Proposed patch appended.
> > 
> > Ahh. Clearly this needs more work before I pull.
> 
> *nod*
> 
> Besides, I want to put Miklos' "don't bother with RCU in shrink_dentry_list()"
> in there as soon as select_collect() has been dealt with.  I don't think
> that the currently posted patch for select_collect() is right, though -
> see my reply to parent posting.  Basically, I think we should treat "it's
> on the shrink list already" as "increment data->found and keep going".  IOW,
> 	if (on shrink list) {
> 		data->found++;
> 	} else {
> 		if (on lru list)
> 			d_lru_del
> 		if (refcount is zero) {
> 			d_shrink_add
> 			data->found++;
> 		}
> 	}
> 	if (data->found)
> 		ret = need_resched() ? D_WALK_QUIT : D_WALK_NORETRY;

While we are at it - BUG() in umount_collect() is probably a bad idea.
At that point we are holding ->s_umount, so it guarantees that a lot
of stuff from that point on will get stuck.  Starting with sync(2).
And I really doubt that damage from WARN() instead will be more...

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-05-02 22:40                               ` Al Viro
  2014-05-02 23:06                                 ` Al Viro
@ 2014-05-03  4:26                                 ` Al Viro
  2014-05-03 18:07                                   ` Linus Torvalds
  2014-05-03 18:21                                   ` Al Viro
  1 sibling, 2 replies; 61+ messages in thread
From: Al Viro @ 2014-05-03  4:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Miklos Szeredi, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Fri, May 02, 2014 at 11:40:22PM +0100, Al Viro wrote:
> On Fri, May 02, 2014 at 02:18:43PM -0700, Linus Torvalds wrote:
> > On Fri, May 2, 2014 at 2:08 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > There's more of the "delete from shrink list not owned by us" in select parent.
> > > Proposed patch appended.
> > 
> > Ahh. Clearly this needs more work before I pull.
> 
> *nod*
> 
> Besides, I want to put Miklos' "don't bother with RCU in shrink_dentry_list()"
> in there as soon as select_collect() has been dealt with.  I don't think
> that the currently posted patch for select_collect() is right, though -
> see my reply to parent posting.  Basically, I think we should treat "it's
> on the shrink list already" as "increment data->found and keep going".  IOW,
> 	if (on shrink list) {
> 		data->found++;
> 	} else {
> 		if (on lru list)
> 			d_lru_del
> 		if (refcount is zero) {
> 			d_shrink_add
> 			data->found++;
> 		}
> 	}
> 	if (data->found)
> 		ret = need_resched() ? D_WALK_QUIT : D_WALK_NORETRY;

See vfs.git#dentry_kill-3; warning - this is completely untested and I would
really like comments on spinning case there (i.e. the one where select_collect()
finds some stuff already on some other shrink list and nothing with zero
refcount that wouldn't be there).  In that case (and it's basically "somebody
else is evicting stuff in our subtree and they'd already picked everything
we want evicted") I just let the loop in check_submounts_and_drop() repeat
(we do have cond_resched() there).  Any better suggestions would be welcome...

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-05-03  4:26                                 ` Al Viro
@ 2014-05-03 18:07                                   ` Linus Torvalds
  2014-05-03 18:25                                     ` Al Viro
  2014-05-03 18:21                                   ` Al Viro
  1 sibling, 1 reply; 61+ messages in thread
From: Linus Torvalds @ 2014-05-03 18:07 UTC (permalink / raw)
  To: Al Viro
  Cc: Miklos Szeredi, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Fri, May 2, 2014 at 9:26 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> See vfs.git#dentry_kill-3; warning - this is completely untested and I would
> really like comments on spinning case there (i.e. the one where select_collect()
> finds some stuff already on some other shrink list and nothing with zero
> refcount that wouldn't be there).

I'm not seeing why you say that there can be no other shrinkers active
during umount (that whole "buggered" thing).

Sure, umount itself should be serialized by the sb lock, so there
should be only one umount dentry collector. But why wouldn't there be
shrinkers active due to memory pressure?

generic_unmount_super() is called by ->kill_sb(), which is done
*before* the superblock shrinker is unregistered  So any memory
pressure during that will cause dentries to be shrunk other ways.

What am I missing?

                Linus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-05-03  4:26                                 ` Al Viro
  2014-05-03 18:07                                   ` Linus Torvalds
@ 2014-05-03 18:21                                   ` Al Viro
  2014-05-04  6:29                                     ` Al Viro
  1 sibling, 1 reply; 61+ messages in thread
From: Al Viro @ 2014-05-03 18:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Miklos Szeredi, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Sat, May 03, 2014 at 05:26:04AM +0100, Al Viro wrote:

> See vfs.git#dentry_kill-3; warning - this is completely untested and I would
> really like comments on spinning case there (i.e. the one where select_collect()
> finds some stuff already on some other shrink list and nothing with zero
> refcount that wouldn't be there).  In that case (and it's basically "somebody
> else is evicting stuff in our subtree and they'd already picked everything
> we want evicted") I just let the loop in check_submounts_and_drop() repeat
> (we do have cond_resched() there).  Any better suggestions would be welcome...

Hmm...  As the matter of fact, the whole shrink_dcache_for_umount() could
be made a lot saner.  What we need is to reuse shrink_dcache_parent()
and follow it with d_walk() that would just go through whatever remains and
complain about the leaves of that.  For anon roots we'll obviously need to
wrap that into dget and d_drop/dput.

I'm testing that right now; everything seems to be working so far and if
it survives, I'll push that sucker out.  Total since the beginning of
the whole series:
 fs/dcache.c            |  310 ++++++++++++++++++++++++++++++++---------------------------------------------------------
 include/linux/dcache.h |    2 +
 2 files changed, 112 insertions(+), 200 deletions(-)

Anyway, I'm off to finish with acct.c races, then it's time for Eric's
lazy umount on unlink series; this work is getting unpleasantly close
to the areas he's touching, so I'd better get it on top of this stuff
and into -next...

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-05-03 18:07                                   ` Linus Torvalds
@ 2014-05-03 18:25                                     ` Al Viro
  0 siblings, 0 replies; 61+ messages in thread
From: Al Viro @ 2014-05-03 18:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Miklos Szeredi, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Sat, May 03, 2014 at 11:07:57AM -0700, Linus Torvalds wrote:
> Sure, umount itself should be serialized by the sb lock, so there
> should be only one umount dentry collector. But why wouldn't there be
> shrinkers active due to memory pressure?
> 
> generic_unmount_super() is called by ->kill_sb(), which is done
> *before* the superblock shrinker is unregistered  So any memory
> pressure during that will cause dentries to be shrunk other ways.
> 
> What am I missing?

This:
        if (!grab_super_passive(sb))
                return SHRINK_STOP;
before calling prune_dcache_sb().  grab_super_passive() returns with
->s_umount held shared on success (with down_read_trylock()) and ->kill_sb()
is called only with ->s_umount held exclusive.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-05-03 18:21                                   ` Al Viro
@ 2014-05-04  6:29                                     ` Al Viro
  2014-05-06 10:17                                       ` Miklos Szeredi
  0 siblings, 1 reply; 61+ messages in thread
From: Al Viro @ 2014-05-04  6:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Miklos Szeredi, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Sat, May 03, 2014 at 07:21:11PM +0100, Al Viro wrote:
> On Sat, May 03, 2014 at 05:26:04AM +0100, Al Viro wrote:
> 
> > See vfs.git#dentry_kill-3; warning - this is completely untested and I would
> > really like comments on spinning case there (i.e. the one where select_collect()
> > finds some stuff already on some other shrink list and nothing with zero
> > refcount that wouldn't be there).  In that case (and it's basically "somebody
> > else is evicting stuff in our subtree and they'd already picked everything
> > we want evicted") I just let the loop in check_submounts_and_drop() repeat
> > (we do have cond_resched() there).  Any better suggestions would be welcome...
> 
> Hmm...  As the matter of fact, the whole shrink_dcache_for_umount() could
> be made a lot saner.  What we need is to reuse shrink_dcache_parent()
> and follow it with d_walk() that would just go through whatever remains and
> complain about the leaves of that.  For anon roots we'll obviously need to
> wrap that into dget and d_drop/dput.
> 
> I'm testing that right now; everything seems to be working so far and if
> it survives, I'll push that sucker out.  Total since the beginning of
> the whole series:
>  fs/dcache.c            |  310 ++++++++++++++++++++++++++++++++---------------------------------------------------------
>  include/linux/dcache.h |    2 +
>  2 files changed, 112 insertions(+), 200 deletions(-)

No regressions compared to mainline; force-pushed into vfs#dentry_kill-3.
Review and testing would be very welcome...  Summary for that branch:
(it's *not* a pull request yet; the thing really needs review)
Shortlog:
Al Viro (8):
      fix races between __d_instantiate() and checks of dentry flags
      fold d_kill() and d_free()
      fold try_prune_one_dentry()
      new helper: dentry_free()
      expand the call of dentry_lru_del() in dentry_kill()
      dentry_kill(): don't try to remove from shrink list
      don't remove from shrink list in select_collect()
      more graceful recovery in umount_collect()

Miklos Szeredi (1):
      dcache: don't need rcu in shrink_dentry_list()

Diffstat:
 fs/dcache.c            |  318 +++++++++++++++++++++++++++++------------------------------------------------------------
 fs/namei.c             |    6 +-
 include/linux/dcache.h |    2 +
 3 files changed, 107 insertions(+), 219 deletions(-)


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-05-04  6:29                                     ` Al Viro
@ 2014-05-06 10:17                                       ` Miklos Szeredi
  2014-05-06 14:53                                         ` Linus Torvalds
  0 siblings, 1 reply; 61+ messages in thread
From: Miklos Szeredi @ 2014-05-06 10:17 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Sun, May 4, 2014 at 8:29 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Sat, May 03, 2014 at 07:21:11PM +0100, Al Viro wrote:
>> On Sat, May 03, 2014 at 05:26:04AM +0100, Al Viro wrote:
>>
>> > See vfs.git#dentry_kill-3; warning - this is completely untested and I would
>> > really like comments on spinning case there (i.e. the one where select_collect()
>> > finds some stuff already on some other shrink list and nothing with zero
>> > refcount that wouldn't be there).  In that case (and it's basically "somebody
>> > else is evicting stuff in our subtree and they'd already picked everything
>> > we want evicted") I just let the loop in check_submounts_and_drop() repeat
>> > (we do have cond_resched() there).  Any better suggestions would be welcome...
>>
>> Hmm...  As the matter of fact, the whole shrink_dcache_for_umount() could
>> be made a lot saner.  What we need is to reuse shrink_dcache_parent()
>> and follow it with d_walk() that would just go through whatever remains and
>> complain about the leaves of that.  For anon roots we'll obviously need to
>> wrap that into dget and d_drop/dput.
>>
>> I'm testing that right now; everything seems to be working so far and if
>> it survives, I'll push that sucker out.  Total since the beginning of
>> the whole series:
>>  fs/dcache.c            |  310 ++++++++++++++++++++++++++++++++---------------------------------------------------------
>>  include/linux/dcache.h |    2 +
>>  2 files changed, 112 insertions(+), 200 deletions(-)
>
> No regressions compared to mainline; force-pushed into vfs#dentry_kill-3.
> Review and testing would be very welcome...  Summary for that branch:
> (it's *not* a pull request yet; the thing really needs review)

Patches look okay to me.

Reviewed-by: Miklos Szeredi <mszeredi@suse.cz>

>       dentry_kill(): don't try to remove from shrink list

Backport of this to 3.12 was tested by IBM and apparently fixes the
issue for them (I didn't backport the cleanup patches only the actual
fix)

>       don't remove from shrink list in select_collect()

I've also asked them to test this, although I think this is even
harder to trigger.  But at least the non-racy codepaths need to be
tested.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-05-06 10:17                                       ` Miklos Szeredi
@ 2014-05-06 14:53                                         ` Linus Torvalds
  2014-05-06 16:52                                           ` Al Viro
  0 siblings, 1 reply; 61+ messages in thread
From: Linus Torvalds @ 2014-05-06 14:53 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Al Viro, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Tue, May 6, 2014 at 3:17 AM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> Patches look okay to me.
>
> Reviewed-by: Miklos Szeredi <mszeredi@suse.cz>
>
>>       dentry_kill(): don't try to remove from shrink list
>
> Backport of this to 3.12 was tested by IBM and apparently fixes the
> issue for them (I didn't backport the cleanup patches only the actual
> fix)

Ok, good.

>>       don't remove from shrink list in select_collect()
>
> I've also asked them to test this, although I think this is even
> harder to trigger.  But at least the non-racy codepaths need to be
> tested.

I'll be incommunicado all next week, so I think I should merge this
all now rather than later.

Al, mind doing a real pull request?

                Linus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-05-06 14:53                                         ` Linus Torvalds
@ 2014-05-06 16:52                                           ` Al Viro
  2014-05-06 17:01                                             ` Linus Torvalds
  0 siblings, 1 reply; 61+ messages in thread
From: Al Viro @ 2014-05-06 16:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Miklos Szeredi, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Tue, May 06, 2014 at 07:53:19AM -0700, Linus Torvalds wrote:
> On Tue, May 6, 2014 at 3:17 AM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > Patches look okay to me.
> >
> > Reviewed-by: Miklos Szeredi <mszeredi@suse.cz>
> >
> >>       dentry_kill(): don't try to remove from shrink list
> >
> > Backport of this to 3.12 was tested by IBM and apparently fixes the
> > issue for them (I didn't backport the cleanup patches only the actual
> > fix)
> 
> Ok, good.
> 
> >>       don't remove from shrink list in select_collect()
> >
> > I've also asked them to test this, although I think this is even
> > harder to trigger.  But at least the non-racy codepaths need to be
> > tested.
> 
> I'll be incommunicado all next week, so I think I should merge this
> all now rather than later.
> 
> Al, mind doing a real pull request?

OK...  There's one more thing I would like to put there, if you are going to
be away for the week.  It has sat in -next for a while, and it could stay
there, except that there's a _lot_ of followups touching stuff all over the
tree and I'd obviously prefer those to go into subsystem trees.  Which
means inter-tree dependencies ;-/  Would you be OK if I included that one
into pull request?  It just turns kvfree() into inline and takes it to
mm.h, next to is_vmalloc_addr(); we have *lots* of open-coded instances of
that all over the place.  As the matter of fact, more than a half of
is_vmalloc_addr() call sites are of that sort...

commit 3c91dc1ce40f973bbceded3d8ce96bda7c4d480c
Author: Al Viro <viro@zeniv.linux.org.uk>
Date:   Wed Apr 23 10:13:03 2014 -0400

    nick kvfree() from apparmor/lib
    
    too many open-coded instances
    
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bf9811e..a784964 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -370,6 +370,17 @@ static inline int is_vmalloc_or_module_addr(const void *x)
 }
 #endif
 
+static inline void kvfree(const void *x)
+{
+	/* include order mess... */
+	extern void kfree(const void *);
+	extern void vfree(const void *);
+	if (is_vmalloc_addr(x))
+		vfree(x);
+	else
+		kfree(x);
+}
+
 static inline void compound_lock(struct page *page)
 {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/security/apparmor/include/apparmor.h b/security/apparmor/include/apparmor.h
index 8fb1488..97130f8 100644
--- a/security/apparmor/include/apparmor.h
+++ b/security/apparmor/include/apparmor.h
@@ -66,7 +66,6 @@ extern int apparmor_initialized __initdata;
 char *aa_split_fqname(char *args, char **ns_name);
 void aa_info_message(const char *str);
 void *__aa_kvmalloc(size_t size, gfp_t flags);
-void kvfree(void *buffer);
 
 static inline void *kvmalloc(size_t size)
 {
diff --git a/security/apparmor/lib.c b/security/apparmor/lib.c
index 6968992..c1827e0 100644
--- a/security/apparmor/lib.c
+++ b/security/apparmor/lib.c
@@ -104,17 +104,3 @@ void *__aa_kvmalloc(size_t size, gfp_t flags)
 	}
 	return buffer;
 }
-
-/**
- * kvfree - free an allocation do by kvmalloc
- * @buffer: buffer to free (MAYBE_NULL)
- *
- * Free a buffer allocated by kvmalloc
- */
-void kvfree(void *buffer)
-{
-	if (is_vmalloc_addr(buffer))
-		vfree(buffer);
-	else
-		kfree(buffer);
-}

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-05-06 16:52                                           ` Al Viro
@ 2014-05-06 17:01                                             ` Linus Torvalds
  2014-05-06 19:15                                               ` Al Viro
  0 siblings, 1 reply; 61+ messages in thread
From: Linus Torvalds @ 2014-05-06 17:01 UTC (permalink / raw)
  To: Al Viro
  Cc: Miklos Szeredi, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Tue, May 6, 2014 at 9:52 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> OK...  There's one more thing I would like to put there, if you are going to
> be away for the week.  It has sat in -next for a while, and it could stay
> there, except that there's a _lot_ of followups touching stuff all over the
> tree and I'd obviously prefer those to go into subsystem trees.  Which
> means inter-tree dependencies ;-/  Would you be OK if I included that one
> into pull request?  It just turns kvfree() into inline and takes it to
> mm.h, next to is_vmalloc_addr();

Is there any particular reason for inlining this? I'd actually rather
not, _especially_ if it means that a lot of "is_vmalloc_addr()" usage
goes away and we may end up with this as a real interface.

But no, I don't hate the patch.

              Linus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: dcache shrink list corruption?
  2014-05-06 17:01                                             ` Linus Torvalds
@ 2014-05-06 19:15                                               ` Al Viro
  0 siblings, 0 replies; 61+ messages in thread
From: Al Viro @ 2014-05-06 19:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Miklos Szeredi, Dave Chinner, Linux Kernel Mailing List, linux-fsdevel

On Tue, May 06, 2014 at 10:01:21AM -0700, Linus Torvalds wrote:
> On Tue, May 6, 2014 at 9:52 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > OK...  There's one more thing I would like to put there, if you are going to
> > be away for the week.  It has sat in -next for a while, and it could stay
> > there, except that there's a _lot_ of followups touching stuff all over the
> > tree and I'd obviously prefer those to go into subsystem trees.  Which
> > means inter-tree dependencies ;-/  Would you be OK if I included that one
> > into pull request?  It just turns kvfree() into inline and takes it to
> > mm.h, next to is_vmalloc_addr();
> 
> Is there any particular reason for inlining this? I'd actually rather
> not, _especially_ if it means that a lot of "is_vmalloc_addr()" usage
> goes away and we may end up with this as a real interface.

OK, I can live with that.  dcache fixes + kvfree() (uninlined, exported by
mm/util.c) + posix_acl bugfix from hch.  Please, pull from
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-linus

Shortlog:
Al Viro (9):
      fix races between __d_instantiate() and checks of dentry flags
      fold d_kill() and d_free()
      fold try_prune_one_dentry()
      new helper: dentry_free()
      expand the call of dentry_lru_del() in dentry_kill()
      dentry_kill(): don't try to remove from shrink list
      don't remove from shrink list in select_collect()
      more graceful recovery in umount_collect()
      nick kvfree() from apparmor

Christoph Hellwig (1):
      posix_acl: handle NULL ACL in posix_acl_equiv_mode

Miklos Szeredi (1):
      dcache: don't need rcu in shrink_dentry_list()

Diffstat:
 fs/dcache.c                          |  318 ++++++++++++++++++++++++---------------------------------------------------
 fs/namei.c                           |    6 +-
 fs/posix_acl.c                       |    6 ++
 include/linux/dcache.h               |    2 +
 include/linux/mm.h                   |    2 +
 mm/util.c                            |   10 +++
 security/apparmor/include/apparmor.h |    1 -
 security/apparmor/lib.c              |   14 ----
 8 files changed, 125 insertions(+), 234 deletions(-)


^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2014-05-06 19:15 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-29 16:01 dcache shrink list corruption? Miklos Szeredi
2014-04-29 17:43 ` Linus Torvalds
2014-04-29 18:03   ` Miklos Szeredi
2014-04-29 18:16     ` Al Viro
2014-04-29 19:10       ` Al Viro
2014-04-29 21:18         ` Dave Chinner
2014-04-29 21:48           ` Al Viro
2014-04-29 23:04             ` Linus Torvalds
2014-04-29 23:20               ` Al Viro
2014-04-30  2:31                 ` Al Viro
2014-04-30  2:56                   ` Linus Torvalds
2014-04-30  4:04                     ` Al Viro
2014-04-30 15:49                       ` Miklos Szeredi
2014-04-30 15:56                         ` Miklos Szeredi
2014-04-30 16:03                         ` Al Viro
2014-04-30 17:33                           ` Miklos Szeredi
2014-04-30 18:36                             ` Al Viro
2014-04-30 18:42                               ` Miklos Szeredi
2014-04-30 19:02                                 ` Al Viro
2014-04-30 19:59                                   ` Al Viro
2014-04-30 20:23                                     ` Linus Torvalds
2014-04-30 20:38                                       ` Al Viro
2014-04-30 20:57                                         ` Linus Torvalds
2014-04-30 21:12                                           ` Al Viro
2014-04-30 22:12                                             ` Al Viro
2014-04-30 23:04                                               ` Linus Torvalds
2014-04-30 23:14                                                 ` Linus Torvalds
2014-04-30 23:43                                                   ` Al Viro
2014-05-01  0:18                                                     ` Linus Torvalds
2014-05-01  2:51                                                       ` Al Viro
2014-05-01  2:59                                                         ` Linus Torvalds
2014-05-01  3:12                                                           ` Al Viro
2014-05-01  9:42                                                             ` Miklos Szeredi
2014-05-01 14:34                                                               ` Al Viro
2014-05-01 21:02                                                                 ` Al Viro
2014-05-01 21:05                                                                   ` Al Viro
2014-05-01 22:52                                                                     ` Linus Torvalds
2014-05-02  8:43                                                                 ` Szeredi Miklos
2014-05-02 21:04                                                                 ` Linus Torvalds
2014-04-30 23:38                                                 ` Al Viro
2014-04-30  9:15                     ` Miklos Szeredi
2014-05-02  5:51                       ` Al Viro
2014-05-02  9:00                         ` Szeredi Miklos
2014-05-02 21:02                           ` Miklos Szeredi
2014-05-02 21:08                           ` Miklos Szeredi
2014-05-02 21:18                             ` Linus Torvalds
2014-05-02 22:40                               ` Al Viro
2014-05-02 23:06                                 ` Al Viro
2014-05-03  4:26                                 ` Al Viro
2014-05-03 18:07                                   ` Linus Torvalds
2014-05-03 18:25                                     ` Al Viro
2014-05-03 18:21                                   ` Al Viro
2014-05-04  6:29                                     ` Al Viro
2014-05-06 10:17                                       ` Miklos Szeredi
2014-05-06 14:53                                         ` Linus Torvalds
2014-05-06 16:52                                           ` Al Viro
2014-05-06 17:01                                             ` Linus Torvalds
2014-05-06 19:15                                               ` Al Viro
2014-05-02 22:32                             ` Al Viro
2014-04-29 18:17     ` Linus Torvalds
2014-04-29 17:56 ` Al Viro

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.