linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* performance delta after VFS i_mutex=>i_rwsem conversion
@ 2016-06-06 20:00 Dave Hansen
  2016-06-06 20:46 ` Linus Torvalds
  0 siblings, 1 reply; 21+ messages in thread
From: Dave Hansen @ 2016-06-06 20:00 UTC (permalink / raw)
  To: Chen, Tim C, Ingo Molnar, Davidlohr Bueso, Peter Zijlstra (Intel),
	Jason Low, Linus Torvalds, Michel Lespinasse, Paul E. McKenney,
	Waiman Long, Al Viro, LKML

tl;dr: Mutexes spin more than than rwsems which makes mutexes perform
better when contending ->i_rwsem.  But, mutexes do this at the cost of
not sleeping much, even with tons of contention.

Should we do something to keep rwsem and mutex performance close to each
other?  If so, should mutexes be sleeping more or rwsems sleeping less?

---

I was doing some performance comparisons between 4.5 and 4.7 and noticed
a weird "blip" in unlink[1] performance where it got worse between 4.5
and 4.7:

	https://www.sr71.net/~dave/intel/rwsem-vs-mutex.png

There are two things to notice here:
1. Although the "1" (4.5) and "2" (4.7) lines nearly converge at high
   cpu  counts, 4.5 outperforms 4.7 by almost 2x at low contention
2. 4.7 has lots of idle time.  4.5 eats lots of cpu spinning

That was a 160-thread Westmere (~4 years old) system, but I moved to a
smaller system for some more focused testing (4 cores, Skylake), and
bisected it there down to the i_mutex switch to rwsem.  I tested on two
commits from here on out:

	[d9171b9] 1 commit before 9902af7
	[9902af7] parallel lookups: actual switch to rwsem"

unlink takes the rwsem for write, so it should see the same level of
contention as the mutex did.  But, at 4 threads, the unlink performance was:

	d9171b9(mutex): 689179
	9902af7(rwsem): 498325 (-27.7%)

I tracked this down to the differences between:

	rwsem_spin_on_owner() - false roughly 1% of the time
	mutex_spin_on_owner() - false roughly 0.05% of the time

The optimistic rwsem and mutex code look quite similar, but there is one
big difference: a hunk of code in rwsem_spin_on_owner() stops the
spinning for rwsems, but isn't present for mutexes in any form:

>         if (READ_ONCE(sem->owner))
>                 return true; /* new owner, continue spinning */
> 
>         /*
>          * When the owner is not set, the lock could be free or
>          * held by readers. Check the counter to verify the
>          * state.
>          */
>         count = READ_ONCE(sem->count); 
>         return (count == 0 || count == RWSEM_WAITING_BIAS);

If I hack this out, I end up with:

	d9171b9(mutex-original): 689179
	9902af7(rwsem-hacked  ): 671706 (-2.5%)

I think it's safe to say that this accounts for the majority of the
difference in behavior.  It also shows that the _actual_
i_mutex=>i_rwsem conversion is the issue here, and not some other part
of the parallel lookup patches.

So, as it stands today in 4.7-rc1, mutexes end up yielding higher
performance under contention.  But, they don't let them system go very
idle, even under heavy contention, which seems rather wrong.  Should we
be making rwsems spin more, or mutexes spin less?

--

1. The test in question here is creating/unlinking many files in the
same directory: >
https://github.com/antonblanchard/will-it-scale/blob/master/tests/unlink1.c

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: performance delta after VFS i_mutex=>i_rwsem conversion
  2016-06-06 20:00 performance delta after VFS i_mutex=>i_rwsem conversion Dave Hansen
@ 2016-06-06 20:46 ` Linus Torvalds
  2016-06-06 21:13   ` Waiman Long
  2016-06-06 21:15   ` Al Viro
  0 siblings, 2 replies; 21+ messages in thread
From: Linus Torvalds @ 2016-06-06 20:46 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Chen, Tim C, Ingo Molnar, Davidlohr Bueso, Peter Zijlstra (Intel),
	Jason Low, Michel Lespinasse, Paul E. McKenney, Waiman Long,
	Al Viro, LKML

On Mon, Jun 6, 2016 at 1:00 PM, Dave Hansen <dave.hansen@intel.com> wrote:
>
> I tracked this down to the differences between:
>
>         rwsem_spin_on_owner() - false roughly 1% of the time
>         mutex_spin_on_owner() - false roughly 0.05% of the time
>
> The optimistic rwsem and mutex code look quite similar, but there is one
> big difference: a hunk of code in rwsem_spin_on_owner() stops the
> spinning for rwsems, but isn't present for mutexes in any form:
>
>>         if (READ_ONCE(sem->owner))
>>                 return true; /* new owner, continue spinning */
>>
>>         /*
>>          * When the owner is not set, the lock could be free or
>>          * held by readers. Check the counter to verify the
>>          * state.
>>          */
>>         count = READ_ONCE(sem->count);
>>         return (count == 0 || count == RWSEM_WAITING_BIAS);
>
> If I hack this out, I end up with:
>
>         d9171b9(mutex-original): 689179
>         9902af7(rwsem-hacked  ): 671706 (-2.5%)
>
> I think it's safe to say that this accounts for the majority of the
> difference in behavior.

So my gut feel is that we do want to have the same heuristics for
rwsems and mutexes (well, modulo possible actual semantic differences
due to the whole shared-vs-exclusive issues).

And I also suspect that the mutexes have gotten a lot more performance
tuning done on them, so it's likely the correct thing to try to make
the rwsem match the mutex code rather than the other way around.

I think we had Jason and Davidlohr do mutex work last year, let's see
if they agree on that "yes, the mutex case is the likely more tuned
case" feeling.

The fact that your performance improves when you do that obviously
then also validates the assumption that the mutex spinning is the
better optimized one.

> So, as it stands today in 4.7-rc1, mutexes end up yielding higher
> performance under contention.  But, they don't let them system go very
> idle, even under heavy contention, which seems rather wrong.  Should we
> be making rwsems spin more, or mutexes spin less?

I think performance is what matters. The fact that it performs better
with spinning is a big mark for spinning more.

Being idle under load is _not_ something we should see as a good
thing. Yes, yes, it would be lower power, but lock contention is *not*
a low-power load. Being slow under lock contention just tends to make
for more lock contention, and trying to increase idle time is almost
certainly the wrong thing to do.

Spinning behavior tends to have a secondary advantage too: it is a
hell of a lot nicer to do performance analysis on. So if you get lock
contention on real loads (as opposed to some extreme
unlink-microbenchmark), I think a lot of people will be happier seeing
the spinning behavior just because it helps pinpoint the problem in
ways idling does not.

So I think everything points to: "make rwsems do the same thing
mutexes do". But I'll let it locking maintainers pipe up. Peter? Ingo?

              Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: performance delta after VFS i_mutex=>i_rwsem conversion
  2016-06-06 20:46 ` Linus Torvalds
@ 2016-06-06 21:13   ` Waiman Long
  2016-06-06 21:20     ` Linus Torvalds
  2016-06-08  8:58     ` Ingo Molnar
  2016-06-06 21:15   ` Al Viro
  1 sibling, 2 replies; 21+ messages in thread
From: Waiman Long @ 2016-06-06 21:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Hansen, Chen, Tim C, Ingo Molnar, Davidlohr Bueso,
	Peter Zijlstra (Intel),
	Jason Low, Michel Lespinasse, Paul E. McKenney, Waiman Long,
	Al Viro, LKML

On 06/06/2016 04:46 PM, Linus Torvalds wrote:
> On Mon, Jun 6, 2016 at 1:00 PM, Dave Hansen<dave.hansen@intel.com>  wrote:
>> I tracked this down to the differences between:
>>
>>          rwsem_spin_on_owner() - false roughly 1% of the time
>>          mutex_spin_on_owner() - false roughly 0.05% of the time
>>
>> The optimistic rwsem and mutex code look quite similar, but there is one
>> big difference: a hunk of code in rwsem_spin_on_owner() stops the
>> spinning for rwsems, but isn't present for mutexes in any form:
>>
>>>          if (READ_ONCE(sem->owner))
>>>                  return true; /* new owner, continue spinning */
>>>
>>>          /*
>>>           * When the owner is not set, the lock could be free or
>>>           * held by readers. Check the counter to verify the
>>>           * state.
>>>           */
>>>          count = READ_ONCE(sem->count);
>>>          return (count == 0 || count == RWSEM_WAITING_BIAS);
>> If I hack this out, I end up with:
>>
>>          d9171b9(mutex-original): 689179
>>          9902af7(rwsem-hacked  ): 671706 (-2.5%)
>>
>> I think it's safe to say that this accounts for the majority of the
>> difference in behavior.
> So my gut feel is that we do want to have the same heuristics for
> rwsems and mutexes (well, modulo possible actual semantic differences
> due to the whole shared-vs-exclusive issues).
>
> And I also suspect that the mutexes have gotten a lot more performance
> tuning done on them, so it's likely the correct thing to try to make
> the rwsem match the mutex code rather than the other way around.
>
> I think we had Jason and Davidlohr do mutex work last year, let's see
> if they agree on that "yes, the mutex case is the likely more tuned
> case" feeling.

It is probably true that the mutex code has been better tuned as it was 
more widely used. Now that may be changing as a lot of mutexes have been 
changed to rwsems.

> The fact that your performance improves when you do that obviously
> then also validates the assumption that the mutex spinning is the
> better optimized one.
>
>> So, as it stands today in 4.7-rc1, mutexes end up yielding higher
>> performance under contention.  But, they don't let them system go very
>> idle, even under heavy contention, which seems rather wrong.  Should we
>> be making rwsems spin more, or mutexes spin less?
> I think performance is what matters. The fact that it performs better
> with spinning is a big mark for spinning more.
>
> Being idle under load is _not_ something we should see as a good
> thing. Yes, yes, it would be lower power, but lock contention is *not*
> a low-power load. Being slow under lock contention just tends to make
> for more lock contention, and trying to increase idle time is almost
> certainly the wrong thing to do.
>
> Spinning behavior tends to have a secondary advantage too: it is a
> hell of a lot nicer to do performance analysis on. So if you get lock
> contention on real loads (as opposed to some extreme
> unlink-microbenchmark), I think a lot of people will be happier seeing
> the spinning behavior just because it helps pinpoint the problem in
> ways idling does not.
>
> So I think everything points to: "make rwsems do the same thing
> mutexes do". But I'll let it locking maintainers pipe up. Peter? Ingo?
>
>                Linus

The tricky part about optimistic spinning in rwsem is that we don't know 
for sure if any of the lock holding readers is running or not. So we 
don't do spinning when readers have the lock. Currently, we use the 
state of the owner variable as the heuristic to determine if the lock 
owner is a writer (owner) or reader (!owner). However, it is also 
possible that a writer gets the lock, but hasn't set the owner field yet 
while while another task samples the owner value at that interval 
causing it to abort optimistic spinning.

I do have a patchset that allow us to more accurately determine the 
state of the lock owner.

locking/rwsem: Add reader-owned state to the owner field
http://www.spinics.net/lists/kernel/msg2258572.html

That should eliminate the performance gap between mutex and rwsem wrt 
spinning when only writers are present. I am hoping that that patchset 
can be queued for 4.8.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: performance delta after VFS i_mutex=>i_rwsem conversion
  2016-06-06 20:46 ` Linus Torvalds
  2016-06-06 21:13   ` Waiman Long
@ 2016-06-06 21:15   ` Al Viro
  2016-06-06 21:46     ` Linus Torvalds
  1 sibling, 1 reply; 21+ messages in thread
From: Al Viro @ 2016-06-06 21:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Hansen, Chen, Tim C, Ingo Molnar, Davidlohr Bueso,
	Peter Zijlstra (Intel),
	Jason Low, Michel Lespinasse, Paul E. McKenney, Waiman Long,
	LKML

On Mon, Jun 06, 2016 at 01:46:23PM -0700, Linus Torvalds wrote:

> So my gut feel is that we do want to have the same heuristics for
> rwsems and mutexes (well, modulo possible actual semantic differences
> due to the whole shared-vs-exclusive issues).
> 
> And I also suspect that the mutexes have gotten a lot more performance
> tuning done on them, so it's likely the correct thing to try to make
> the rwsem match the mutex code rather than the other way around.
> 
> I think we had Jason and Davidlohr do mutex work last year, let's see
> if they agree on that "yes, the mutex case is the likely more tuned
> case" feeling.
> 
> The fact that your performance improves when you do that obviously
> then also validates the assumption that the mutex spinning is the
> better optimized one.

FWIW, there's another fun issue on ramfs - dcache_readdir() is doing an
obscene amount of grabbing/releasing ->d_lock and once you take the external
serialization out, parallel getdents load hits contention on *that*.
In spades.  And unlike mutex (or rswem exclusive), contention on ->d_lock
chews a lot of cycles.  The root cause is the use of cursors - we not only
move them more than we ought to (we do that on each entry reported, rather
than once before return from dcache_readdir()), we can't traverse the real
list entries (which remain nice and stable; another low-hanging fruit is
pointless grabbing ->d_lock on those) without ->d_lock on parent.

I think I have a kinda-sorta solution, but it has a problem.  What I want
to do is
	* list_move() only once per dcache_readdir()
	* ->d_lock taken for that and only for that.
	* list_move() itself surrounded with write_seqcount_{begin,end} on
some seqcount
	* traversal to the next real entry done under rcu_read_lock in a
seqretry loop.

The only problem is where to put that seqcount (unsigned int, really).
->i_dir_seq is an obvious candidate, but that'll need careful profiling
on getdents/lookup mixes...

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: performance delta after VFS i_mutex=>i_rwsem conversion
  2016-06-06 21:13   ` Waiman Long
@ 2016-06-06 21:20     ` Linus Torvalds
  2016-06-07  3:22       ` Valdis.Kletnieks
  2016-06-08  8:58     ` Ingo Molnar
  1 sibling, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2016-06-06 21:20 UTC (permalink / raw)
  To: Waiman Long
  Cc: Dave Hansen, Chen, Tim C, Ingo Molnar, Davidlohr Bueso,
	Peter Zijlstra (Intel),
	Jason Low, Michel Lespinasse, Paul E. McKenney, Waiman Long,
	Al Viro, LKML

On Mon, Jun 6, 2016 at 2:13 PM, Waiman Long <waiman.long@hpe.com> wrote:
>
> The tricky part about optimistic spinning in rwsem is that we don't know for
> sure if any of the lock holding readers is running or not.

I'm notm sure how common the reader-vs-writer contention is, at least
for the new inode use. I'm sure you can trigger it with crazy
benchmarks, but I wouldn't worry about it unless people start
complaining.

The writer-writer case is easy to trigger with write-heavy loads (ok,
rename/unlink in this case). Are there real loads where there are lots
of concurrent lookup and writes? I really don't know (note that
"lookup" needs to be uncached and actually hit the lowlevel filesystem
for the locking to even trigger in the first place).

I guess some "concurrent readdir with unlink" load would show that
behavior, but is it _realistic_? No idea. Let's not worry about it too
much until somebody shows a reason to worry.

               Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: performance delta after VFS i_mutex=>i_rwsem conversion
  2016-06-06 21:15   ` Al Viro
@ 2016-06-06 21:46     ` Linus Torvalds
  2016-06-06 22:07       ` Al Viro
  0 siblings, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2016-06-06 21:46 UTC (permalink / raw)
  To: Al Viro
  Cc: Dave Hansen, Chen, Tim C, Ingo Molnar, Davidlohr Bueso,
	Peter Zijlstra (Intel),
	Jason Low, Michel Lespinasse, Paul E. McKenney, Waiman Long,
	LKML

On Mon, Jun 6, 2016 at 2:15 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> FWIW, there's another fun issue on ramfs - dcache_readdir() is doing an
> obscene amount of grabbing/releasing ->d_lock

Yeah, I saw that report, and I'm not entirely surprised. There was no
real reason to worry about it before, since readdir() couldn't really
cause contention with the previous mutual exclusion at the upper
level.

> I think I have a kinda-sorta solution, but it has a problem.  What I want
> to do is
>         * list_move() only once per dcache_readdir()
>         * ->d_lock taken for that and only for that.
>         * list_move() itself surrounded with write_seqcount_{begin,end} on
> some seqcount
>         * traversal to the next real entry done under rcu_read_lock in a
> seqretry loop.
>
> The only problem is where to put that seqcount (unsigned int, really).
> ->i_dir_seq is an obvious candidate, but that'll need careful profiling
> on getdents/lookup mixes...

Let's look at smaller changes first.

In particular, why the f*ck do we take "next->d_lock" at all, much less twice?

Do we really need that? Both of them seem bogus:

 - the first one only protects the "simple_positive()" test.

    That seems stupid. We hold the parent d_lock _and_ we hold the
parent inode lock for reading, how the hell is simple_positive() going
to change? Yeah, yeah, *maybe* we could catch a lookup that just
happens to add the inode field in process, but it's not a race we even
care about. If we see the inode being non-NULL, it will now *stay*
non-NULL, and we already depend on that (that "d_inode(next)" is then
done without the lock held.

 - the second one only protects "list_move()" of the cursor. But since
it's the child list, the "next->d_lock" thing ends up being
irrelevant. It's the parent dentry lock we need to hold, nothing else.
Not the "next" one.

so I don't see the point of half the d_lock games we play.

And the thing about spinlock contention: having *nested* spinlocks be
contented turns contention into an exponential thing. I really suspect
that if we can just remove the nested spinlock, the dentry->d_lock
contention will go down by a huge amount, because then you completely
remove the "wait on lock while holding another lock" thing, which is
what tends to really suck.

So before you do something fancy, look at the simple parts. I think
getting rid of the nested d_lock might be really simple.

Maybe we might want to do some really careful

     inode = READ_ONCE(dentry->d_inode);

because we're reading the thing without locking, and maybe there's
something else subtle going on. But that really looks fairly
low-hanging, and it might make the problem go away in practice.

              Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: performance delta after VFS i_mutex=>i_rwsem conversion
  2016-06-06 21:46     ` Linus Torvalds
@ 2016-06-06 22:07       ` Al Viro
  2016-06-06 23:50         ` Linus Torvalds
  0 siblings, 1 reply; 21+ messages in thread
From: Al Viro @ 2016-06-06 22:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Hansen, Chen, Tim C, Ingo Molnar, Davidlohr Bueso,
	Peter Zijlstra (Intel),
	Jason Low, Michel Lespinasse, Paul E. McKenney, Waiman Long,
	LKML

On Mon, Jun 06, 2016 at 02:46:44PM -0700, Linus Torvalds wrote:

> Let's look at smaller changes first.
> 
> In particular, why the f*ck do we take "next->d_lock" at all, much less twice?

See upthread re "low-hanging fruit".

> Do we really need that? Both of them seem bogus:
> 
>  - the first one only protects the "simple_positive()" test.
> 
>     That seems stupid. We hold the parent d_lock _and_ we hold the
> parent inode lock for reading, how the hell is simple_positive() going
> to change? Yeah, yeah, *maybe* we could catch a lookup that just
> happens to add the inode field in process, but it's not a race we even
> care about.

Can't happen - it's ramfs and lookups there never end up adding a positive
entry.  So I don't believe that READ_ONCE() or anything of that sort would
be needed.  All transitions from negative to positive happen under exclusive
lock on parent, which gives us all barriers we need.  Transitions from
hashed positive to negative or unhashed also happen only under the same
exclusive lock on parent, which takes care of going in other direction.

> If we see the inode being non-NULL, it will now *stay*
> non-NULL, and we already depend on that (that "d_inode(next)" is then
> done without the lock held.

Like I said - it's stable.

>  - the second one only protects "list_move()" of the cursor. But since
> it's the child list, the "next->d_lock" thing ends up being
> irrelevant. It's the parent dentry lock we need to hold, nothing else.
> Not the "next" one.
> 
> so I don't see the point of half the d_lock games we play.

Yes.

> And the thing about spinlock contention: having *nested* spinlocks be
> contented turns contention into an exponential thing. I really suspect
> that if we can just remove the nested spinlock, the dentry->d_lock
> contention will go down by a huge amount, because then you completely
> remove the "wait on lock while holding another lock" thing, which is
> what tends to really suck.

True in general, but here we really do a lot under that ->d_lock - all
list traversals are under it.  So I suspect that contention on nested
lock is not an issue in that particular load.  It's certainly a separate
commit, so we'll see how much does it give on its own, but I doubt that
it'll be anywhere near enough.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: performance delta after VFS i_mutex=>i_rwsem conversion
  2016-06-06 22:07       ` Al Viro
@ 2016-06-06 23:50         ` Linus Torvalds
  2016-06-06 23:59           ` Linus Torvalds
  2016-06-07  0:40           ` Al Viro
  0 siblings, 2 replies; 21+ messages in thread
From: Linus Torvalds @ 2016-06-06 23:50 UTC (permalink / raw)
  To: Al Viro
  Cc: Dave Hansen, Chen, Tim C, Ingo Molnar, Davidlohr Bueso,
	Peter Zijlstra (Intel),
	Jason Low, Michel Lespinasse, Paul E. McKenney, Waiman Long,
	LKML



On Mon, 6 Jun 2016, Al Viro wrote:
> 
> True in general, but here we really do a lot under that ->d_lock - all
> list traversals are under it.  So I suspect that contention on nested
> lock is not an issue in that particular load.  It's certainly a separate
> commit, so we'll see how much does it give on its own, but I doubt that
> it'll be anywhere near enough.

Hmm. Maybe. 

But at least we can try to minimize everything that happens under the 
dentry->d_lock spinlock. 

So how about this patch? It's entirely untested, but it rewrites that 
readdir() function to try to do the minimum possible under the d_lock 
spinlock.

I say "rewrite", because it really is totally different. It's not just 
that the nested "next" locking is gone, it also treats the cursor very 
differently and tries to avoid doing any unnecessary cursor list 
operations.

So instead of "list_move()" at the beginning of traversal, and another for 
each successful dir_emit(), it just updates a "move target". The target is 
stable because it will only ever be a positive directory entry that we 
emitted, which is stable due to having the directory inode held for 
reading.

So now there is only one final "list_move()" at the end, and it's only 
done if we actually emitted a directory entry.

That makes the whole actual loop when we hold the d_lock very tight. So we 
hold the d_lock for truly the minimal possible case, namely the actual 
list traversal.

And sure, we keep dropping and re-taking it, so we'll get a fair amount of 
cacheline ping-pong if we have several CPU's doing this on the same 
directory at the same time. But minimizing the work we do inside the 
spinlock should hopefully mean that we get very little actual spinning.

And I do want to repeat that the patch is entirely untested. It compiles. 
I looked at the assembly it generated. It looks fine to me, but I might 
have had a brainfart and done something completely broken.

			Linus

---

 fs/libfs.c | 55 ++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 32 insertions(+), 23 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index 3db2721144c2..00a0b3a6b23e 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -132,45 +132,54 @@ static inline unsigned char dt_type(struct inode *inode)
 	return (inode->i_mode >> 12) & 15;
 }
 
+static inline int dcache_emit_entry(struct dir_context *ctx, struct dentry *dir, struct dentry *child)
+{
+	int ret;
+	struct inode *inode;
+
+	spin_unlock(&dir->d_lock);
+	inode = d_inode(child);
+	ret = dir_emit(ctx, child->d_name.name, child->d_name.len, inode->i_ino,  dt_type(inode));
+	spin_lock(&dir->d_lock);
+	return ret;
+}
+
 /*
  * Directory is locked and all positive dentries in it are safe, since
  * for ramfs-type trees they can't go away without unlink() or rmdir(),
  * both impossible due to the lock on directory.
  */
-
 int dcache_readdir(struct file *file, struct dir_context *ctx)
 {
-	struct dentry *dentry = file->f_path.dentry;
-	struct dentry *cursor = file->private_data;
-	struct list_head *p, *q = &cursor->d_child;
+	struct dentry *dentry, *cursor;
+	struct list_head *entry, *target;
 
 	if (!dir_emit_dots(file, ctx))
 		return 0;
+
+	dentry = file->f_path.dentry;
 	spin_lock(&dentry->d_lock);
-	if (ctx->pos == 2)
-		list_move(q, &dentry->d_subdirs);
-
-	for (p = q->next; p != &dentry->d_subdirs; p = p->next) {
-		struct dentry *next = list_entry(p, struct dentry, d_child);
-		spin_lock_nested(&next->d_lock, DENTRY_D_LOCK_NESTED);
-		if (!simple_positive(next)) {
-			spin_unlock(&next->d_lock);
+
+	cursor = file->private_data;
+	entry = (ctx->pos == 2) ? &dentry->d_subdirs : &cursor->d_child;
+
+	target = NULL;
+	while ((entry = entry->next) != &dentry->d_subdirs) {
+		struct dentry *child = list_entry(entry, struct dentry, d_child);
+
+		if (!simple_positive(child))
 			continue;
-		}
 
-		spin_unlock(&next->d_lock);
-		spin_unlock(&dentry->d_lock);
-		if (!dir_emit(ctx, next->d_name.name, next->d_name.len,
-			      d_inode(next)->i_ino, dt_type(d_inode(next))))
+		/* This will drop and re-take the dentry lock .. */
+		if (!dcache_emit_entry(ctx, dentry, child))
 			return 0;
-		spin_lock(&dentry->d_lock);
-		spin_lock_nested(&next->d_lock, DENTRY_D_LOCK_NESTED);
-		/* next is still alive */
-		list_move(q, p);
-		spin_unlock(&next->d_lock);
-		p = q;
+
+		/* .. but 'entry' is stable because of the shared rwsem */
+		target = entry;
 		ctx->pos++;
 	}
+	if (target)
+		list_move(&cursor->d_child, target);
 	spin_unlock(&dentry->d_lock);
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: performance delta after VFS i_mutex=>i_rwsem conversion
  2016-06-06 23:50         ` Linus Torvalds
@ 2016-06-06 23:59           ` Linus Torvalds
  2016-06-07  0:29             ` Linus Torvalds
  2016-06-07  0:40           ` Al Viro
  1 sibling, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2016-06-06 23:59 UTC (permalink / raw)
  To: Al Viro
  Cc: Dave Hansen, Chen, Tim C, Ingo Molnar, Davidlohr Bueso,
	Peter Zijlstra (Intel),
	Jason Low, Michel Lespinasse, Paul E. McKenney, Waiman Long,
	LKML

On Mon, Jun 6, 2016 at 4:50 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>
> And I do want to repeat that the patch is entirely untested. It compiles.
> I looked at the assembly it generated. It looks fine to me, but I might
> have had a brainfart and done something completely broken.

Dammit.

I did that whole dcache_emit_entry() thing because we need to do the
final target cursor move regardless, and dcache_emit_entry() will
always return with the spinlock held again.

But then I didn't actually fix the "return" to a "break".

So the "return 0" here:

> +               /* This will drop and re-take the dentry lock .. */
> +               if (!dcache_emit_entry(ctx, dentry, child))
>                         return 0;

is fatal and woudl return with the spinlock held. It *should* have
been a "break" to exit the readdir loop.

So the patch I sent out was indeed a terminal brainfart, but with that
fix to change that return to a "break" it *might* work.

            Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: performance delta after VFS i_mutex=>i_rwsem conversion
  2016-06-06 23:59           ` Linus Torvalds
@ 2016-06-07  0:29             ` Linus Torvalds
  0 siblings, 0 replies; 21+ messages in thread
From: Linus Torvalds @ 2016-06-07  0:29 UTC (permalink / raw)
  To: Al Viro
  Cc: Dave Hansen, Chen, Tim C, Ingo Molnar, Davidlohr Bueso,
	Peter Zijlstra (Intel),
	Jason Low, Michel Lespinasse, Paul E. McKenney, Waiman Long,
	LKML

On Mon, Jun 6, 2016 at 4:59 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So the patch I sent out was indeed a terminal brainfart, but with that
> fix to change that return to a "break" it *might* work.

It seems to at least survive some trivial testing with that fixed..

              Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: performance delta after VFS i_mutex=>i_rwsem conversion
  2016-06-06 23:50         ` Linus Torvalds
  2016-06-06 23:59           ` Linus Torvalds
@ 2016-06-07  0:40           ` Al Viro
  2016-06-07  0:44             ` Al Viro
                               ` (2 more replies)
  1 sibling, 3 replies; 21+ messages in thread
From: Al Viro @ 2016-06-07  0:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Hansen, Chen, Tim C, Ingo Molnar, Davidlohr Bueso,
	Peter Zijlstra (Intel),
	Jason Low, Michel Lespinasse, Paul E. McKenney, Waiman Long,
	LKML

On Mon, Jun 06, 2016 at 04:50:59PM -0700, Linus Torvalds wrote:
> 
> 
> On Mon, 6 Jun 2016, Al Viro wrote:
> > 
> > True in general, but here we really do a lot under that ->d_lock - all
> > list traversals are under it.  So I suspect that contention on nested
> > lock is not an issue in that particular load.  It's certainly a separate
> > commit, so we'll see how much does it give on its own, but I doubt that
> > it'll be anywhere near enough.
> 
> Hmm. Maybe. 
> 
> But at least we can try to minimize everything that happens under the 
> dentry->d_lock spinlock. 
> 
> So how about this patch? It's entirely untested, but it rewrites that 
> readdir() function to try to do the minimum possible under the d_lock 
> spinlock.
> 
> I say "rewrite", because it really is totally different. It's not just 
> that the nested "next" locking is gone, it also treats the cursor very 
> differently and tries to avoid doing any unnecessary cursor list 
> operations.

Similar to what I've got here, except that mine has a couple of helper
functions usable in dcache_dir_lseek() as well:

next_positive(parent, child, n) - returns nth positive child after that one
or NULL if there's less than n such.  NULL as the second argument => search
from the beginning.

move_cursor(cursor, child) - moves cursor immediately past child *or* to
the very end if child is NULL.

The third commit in series will be the lockless replacement for
for next_positive().  move_cursor() is easy - it became simply
	struct dentry *parent = cursor->d_parent;
	unsigned n, *seq = &parent->d_inode->i_dir_seq;
	spin_lock(&parent->d_lock);
        for (;;) {
                n = *seq;
                if (!(n & 1) && cmpxchg(seq, n, n + 1) == n)  
                        break;
                cpu_relax();
        }
        __list_del(cursor->d_child.prev, cursor->d_child.next);
        if (child)
                list_add(&cursor->d_child, &child->d_child);
        else
                list_add_tail(&cursor->d_child, &parent->d_subdirs);
	smp_store_release(seq, n + 2);
	spin_unlock(&parent->d_lock);

with

static struct dentry *next_positive(struct dentry *parent,
                                    struct dentry *child, int count)
{
        struct list_head *p = child ? &child->d_child : &parent->d_subdirs;
        unsigned *seq = &parent->d_inode->i_dir_seq, n;
        do {
                int i = count;
                n = smp_load_acquire(seq) & ~1;
                rcu_read_lock();
                do {
                        p = p->next;
                        if (p == &parent->d_subdirs) {
                                child = NULL;
                                break;
                        }
                        child = list_entry(p, struct dentry, d_child);
                } while (!simple_positive(child) || --i);
                rcu_read_unlock();
        } while (unlikely(smp_load_acquire(seq) != n));
        return child;
}
as initial attempt at lockless next_positive(); barriers are probably wrong,
though...

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: performance delta after VFS i_mutex=>i_rwsem conversion
  2016-06-07  0:40           ` Al Viro
@ 2016-06-07  0:44             ` Al Viro
  2016-06-07  0:58             ` Al Viro
  2016-06-07  0:58             ` Linus Torvalds
  2 siblings, 0 replies; 21+ messages in thread
From: Al Viro @ 2016-06-07  0:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Hansen, Chen, Tim C, Ingo Molnar, Davidlohr Bueso,
	Peter Zijlstra (Intel),
	Jason Low, Michel Lespinasse, Paul E. McKenney, Waiman Long,
	LKML

On Tue, Jun 07, 2016 at 01:40:58AM +0100, Al Viro wrote:
> On Mon, Jun 06, 2016 at 04:50:59PM -0700, Linus Torvalds wrote:
> > 
> > 
> > On Mon, 6 Jun 2016, Al Viro wrote:
> > > 
> > > True in general, but here we really do a lot under that ->d_lock - all
> > > list traversals are under it.  So I suspect that contention on nested
> > > lock is not an issue in that particular load.  It's certainly a separate
> > > commit, so we'll see how much does it give on its own, but I doubt that
> > > it'll be anywhere near enough.
> > 
> > Hmm. Maybe. 
> > 
> > But at least we can try to minimize everything that happens under the 
> > dentry->d_lock spinlock. 
> > 
> > So how about this patch? It's entirely untested, but it rewrites that 
> > readdir() function to try to do the minimum possible under the d_lock 
> > spinlock.
> > 
> > I say "rewrite", because it really is totally different. It's not just 
> > that the nested "next" locking is gone, it also treats the cursor very 
> > differently and tries to avoid doing any unnecessary cursor list 
> > operations.
> 
> Similar to what I've got here, except that mine has a couple of helper
> functions usable in dcache_dir_lseek() as well:
> 
> next_positive(parent, child, n) - returns nth positive child after that one
> or NULL if there's less than n such.  NULL as the second argument => search
> from the beginning.
> 
> move_cursor(cursor, child) - moves cursor immediately past child *or* to
> the very end if child is NULL.
> 
> The third commit in series will be the lockless replacement for
> for next_positive().  move_cursor() is easy - it became simply
> 	struct dentry *parent = cursor->d_parent;
> 	unsigned n, *seq = &parent->d_inode->i_dir_seq;
> 	spin_lock(&parent->d_lock);
>         for (;;) {
>                 n = *seq;
>                 if (!(n & 1) && cmpxchg(seq, n, n + 1) == n)  
>                         break;
>                 cpu_relax();
>         }
>         __list_del(cursor->d_child.prev, cursor->d_child.next);
>         if (child)
>                 list_add(&cursor->d_child, &child->d_child);
>         else
>                 list_add_tail(&cursor->d_child, &parent->d_subdirs);
> 	smp_store_release(seq, n + 2);
> 	spin_unlock(&parent->d_lock);
> 
> with
> 
> static struct dentry *next_positive(struct dentry *parent,
>                                     struct dentry *child, int count)
> {
>         struct list_head *p = child ? &child->d_child : &parent->d_subdirs;
>         unsigned *seq = &parent->d_inode->i_dir_seq, n;
>         do {
>                 int i = count;
>                 n = smp_load_acquire(seq) & ~1;
>                 rcu_read_lock();
>                 do {
>                         p = p->next;
>                         if (p == &parent->d_subdirs) {
>                                 child = NULL;
>                                 break;
>                         }
>                         child = list_entry(p, struct dentry, d_child);
>                 } while (!simple_positive(child) || --i);
>                 rcu_read_unlock();
>         } while (unlikely(smp_load_acquire(seq) != n));
>         return child;
> }
> as initial attempt at lockless next_positive(); barriers are probably wrong,
> though...

FWIW,
loff_t dcache_dir_lseek(struct file *file, loff_t offset, int whence)
{
        struct dentry *dentry = file->f_path.dentry;
        switch (whence) {
                case 1:
                        offset += file->f_pos;
                case 0:
                        if (offset >= 0)
                                break;
                default:
                        return -EINVAL;
        }
        if (offset != file->f_pos) {
                file->f_pos = offset;
                if (file->f_pos >= 2) {
                        struct dentry *cursor = file->private_data;
                        loff_t n = file->f_pos - 2;

                        inode_lock_shared(dentry->d_inode);
                        move_cursor(cursor, next_positive(dentry, NULL, n));
                        inode_unlock_shared(dentry->d_inode);
                }
        }
        return offset;
}

and
int dcache_readdir(struct file *file, struct dir_context *ctx)
{
        struct dentry *dentry = file->f_path.dentry;
        struct dentry *cursor = file->private_data;
        struct dentry *child, *next;
        bool moved = false;

        if (!dir_emit_dots(file, ctx))
                return 0;

        child = ctx->pos != 2 ? cursor : NULL;
        while ((next = next_positive(dentry, child, 1)) != NULL) {
                if (!dir_emit(ctx, next->d_name.name, next->d_name.len,
                              d_inode(next)->i_ino, dt_type(d_inode(next))))
                        break;
                moved = true;
                child = next;
                ctx->pos++;
        }
        if (moved)
                move_cursor(cursor, child);
        return 0;
}

is what the methods themselves became.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: performance delta after VFS i_mutex=>i_rwsem conversion
  2016-06-07  0:40           ` Al Viro
  2016-06-07  0:44             ` Al Viro
@ 2016-06-07  0:58             ` Al Viro
  2016-06-07  0:58             ` Linus Torvalds
  2 siblings, 0 replies; 21+ messages in thread
From: Al Viro @ 2016-06-07  0:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Hansen, Chen, Tim C, Ingo Molnar, Davidlohr Bueso,
	Peter Zijlstra (Intel),
	Jason Low, Michel Lespinasse, Paul E. McKenney, Waiman Long,
	LKML

On Tue, Jun 07, 2016 at 01:40:58AM +0100, Al Viro wrote:

> as initial attempt at lockless next_positive(); barriers are probably wrong,
> though...

More than just barriers...  _Absolutely_ untested variant pushed into
vfs.git#untested.dcache_readdir (the last 3 commits in there).  It
compiles, but that's it.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: performance delta after VFS i_mutex=>i_rwsem conversion
  2016-06-07  0:40           ` Al Viro
  2016-06-07  0:44             ` Al Viro
  2016-06-07  0:58             ` Al Viro
@ 2016-06-07  0:58             ` Linus Torvalds
  2016-06-07  1:19               ` Al Viro
  2 siblings, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2016-06-07  0:58 UTC (permalink / raw)
  To: Al Viro
  Cc: Dave Hansen, Chen, Tim C, Ingo Molnar, Davidlohr Bueso,
	Peter Zijlstra (Intel),
	Jason Low, Michel Lespinasse, Paul E. McKenney, Waiman Long,
	LKML

On Mon, Jun 6, 2016 at 5:40 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> static struct dentry *next_positive(struct dentry *parent,
>                                     struct dentry *child, int count)
> {
>         struct list_head *p = child ? &child->d_child : &parent->d_subdirs;

>From your description, you seem to be very confused about what "child
== NULL" means. Here it means that it's a cursor to the beginning, but
in your commentary on move_cursor(), you say "moves cursor immediately
past child *or* to the very end if child is NULL".

That's very confusing. Is NULL beginning or end?

I really think you'd be better off having a special ERR_PTR value for
end, possibly as a flag value in the cursor dentry.

The whole "what does NULL mean" confusion exists inside that
"next_positive" too:

>         unsigned *seq = &parent->d_inode->i_dir_seq, n;
>         do {
>                 int i = count;
>                 n = smp_load_acquire(seq) & ~1;
>                 rcu_read_lock();
>                 do {
>                         p = p->next;
>                         if (p == &parent->d_subdirs) {
>                                 child = NULL;
>                                 break;
>                         }

look, here you return NULL for "end" again. Even though it meant
beginning at the start of the function. Nasty.

Also, may I suggest that there is a very trivial special case for
"next_positive()" that needs no barriers or sequence checking or
anything else: at the very beginning, just load the "->next" pointer,
and if it's a positive entry, you're done. That's going to be the
common case when there _isn't_ crazy multi-threaded readdirs going on,
so it's worth handling separately.

In fact, if you have a special value for the case of "cursor is at
end" situation, then for the small directory case that can be handled
with a single getdents call, you'll *never* set the cursor in the
child list at all, which means that the above special case for
next_positive() is actually the common case even for the threaded
situation.

                Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: performance delta after VFS i_mutex=>i_rwsem conversion
  2016-06-07  0:58             ` Linus Torvalds
@ 2016-06-07  1:19               ` Al Viro
  0 siblings, 0 replies; 21+ messages in thread
From: Al Viro @ 2016-06-07  1:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Hansen, Chen, Tim C, Ingo Molnar, Davidlohr Bueso,
	Peter Zijlstra (Intel),
	Jason Low, Michel Lespinasse, Paul E. McKenney, Waiman Long,
	LKML

On Mon, Jun 06, 2016 at 05:58:53PM -0700, Linus Torvalds wrote:
> >From your description, you seem to be very confused about what "child
> == NULL" means. Here it means that it's a cursor to the beginning, but
> in your commentary on move_cursor(), you say "moves cursor immediately
> past child *or* to the very end if child is NULL".
> 
> That's very confusing. Is NULL beginning or end?

The former for argument, the latter for return value...

> >         unsigned *seq = &parent->d_inode->i_dir_seq, n;
> >         do {
> >                 int i = count;
> >                 n = smp_load_acquire(seq) & ~1;
> >                 rcu_read_lock();
> >                 do {
> >                         p = p->next;
> >                         if (p == &parent->d_subdirs) {
> >                                 child = NULL;
> >                                 break;
> >                         }
> 
> look, here you return NULL for "end" again. Even though it meant
> beginning at the start of the function. Nasty.

Actually, reassigning 'child' here was broken, NULL or no NULL - we want
the subsequent retries (if any) to start at the same state.
 
> Also, may I suggest that there is a very trivial special case for
> "next_positive()" that needs no barriers or sequence checking or
> anything else: at the very beginning, just load the "->next" pointer,
> and if it's a positive entry, you're done. That's going to be the
> common case when there _isn't_ crazy multi-threaded readdirs going on,
> so it's worth handling separately.

Point.

> In fact, if you have a special value for the case of "cursor is at
> end" situation, then for the small directory case that can be handled
> with a single getdents call, you'll *never* set the cursor in the
> child list at all, which means that the above special case for
> next_positive() is actually the common case even for the threaded
> situation.

Not really.  Cursor is allocated on the child list in the first place; it's
just that its position is ignored for file->f_pos <= 2.  We could change
that, but I'd rather avoid the headache right now.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: performance delta after VFS i_mutex=>i_rwsem conversion
  2016-06-06 21:20     ` Linus Torvalds
@ 2016-06-07  3:22       ` Valdis.Kletnieks
  2016-06-07 15:22         ` Waiman Long
  0 siblings, 1 reply; 21+ messages in thread
From: Valdis.Kletnieks @ 2016-06-07  3:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Waiman Long, Dave Hansen, Chen, Tim C, Ingo Molnar,
	Davidlohr Bueso, Peter Zijlstra (Intel),
	Jason Low, Michel Lespinasse, Paul E. McKenney, Waiman Long,
	Al Viro, LKML

[-- Attachment #1: Type: text/plain, Size: 592 bytes --]

On Mon, 06 Jun 2016 14:20:32 -0700, Linus Torvalds said:

> I guess some "concurrent readdir with unlink" load would show that
> behavior, but is it _realistic_? No idea. Let's not worry about it too
> much until somebody shows a reason to worry.

I've seen Makefiles where 'make clean' does a 'find . -name "*.o" | xargs rm'.

But if somebody is doing that often enough against cache-cold directory trees
to matter, they have bigger problems (like learning how to properly develop
code using 'make').  So unless performance is *so* bad it triggers the lockup
detector, it's probably OK....


[-- Attachment #2: Type: application/pgp-signature, Size: 848 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: performance delta after VFS i_mutex=>i_rwsem conversion
  2016-06-07  3:22       ` Valdis.Kletnieks
@ 2016-06-07 15:22         ` Waiman Long
  0 siblings, 0 replies; 21+ messages in thread
From: Waiman Long @ 2016-06-07 15:22 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Linus Torvalds, Dave Hansen, Chen, Tim C, Ingo Molnar,
	Davidlohr Bueso, Peter Zijlstra (Intel),
	Jason Low, Michel Lespinasse, Paul E. McKenney, Waiman Long,
	Al Viro, LKML

On 06/06/2016 11:22 PM, Valdis.Kletnieks@vt.edu wrote:
> On Mon, 06 Jun 2016 14:20:32 -0700, Linus Torvalds said:
>
>> I guess some "concurrent readdir with unlink" load would show that
>> behavior, but is it _realistic_? No idea. Let's not worry about it too
>> much until somebody shows a reason to worry.
> I've seen Makefiles where 'make clean' does a 'find . -name "*.o" | xargs rm'.
>
> But if somebody is doing that often enough against cache-cold directory trees
> to matter, they have bigger problems (like learning how to properly develop
> code using 'make').  So unless performance is *so* bad it triggers the lockup
> detector, it's probably OK....
>

The performance won't be very bad. It is just that with the right mix of 
readers and writers, the performance can be a bit worse  with i_rwsem 
than with i_mutex. But in other cases, the performance will be 
comparable or better with i_rwsem.

Regards,
Longman

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: performance delta after VFS i_mutex=>i_rwsem conversion
  2016-06-06 21:13   ` Waiman Long
  2016-06-06 21:20     ` Linus Torvalds
@ 2016-06-08  8:58     ` Ingo Molnar
  2016-06-09 10:25       ` Ingo Molnar
  1 sibling, 1 reply; 21+ messages in thread
From: Ingo Molnar @ 2016-06-08  8:58 UTC (permalink / raw)
  To: Waiman Long
  Cc: Linus Torvalds, Dave Hansen, Chen, Tim C, Ingo Molnar,
	Davidlohr Bueso, Peter Zijlstra (Intel),
	Jason Low, Michel Lespinasse, Paul E. McKenney, Waiman Long,
	Al Viro, LKML


* Waiman Long <waiman.long@hpe.com> wrote:

> I do have a patchset that allow us to more accurately determine the state of
> the lock owner.
> 
> locking/rwsem: Add reader-owned state to the owner field
> http://www.spinics.net/lists/kernel/msg2258572.html
> 
> That should eliminate the performance gap between mutex and rwsem wrt
> spinning when only writers are present. I am hoping that that patchset can
> be queued for 4.8.

Yeah, so I actually had this series merged for testing last week, but a 
complication with a prereq patch made me unmerge it. But I have no fundamental 
objections, at all.

I also agree with Linus's general observation that we want to make 
down_write()/up_write() match mutex performance characteristics.

I think kernel developers should fundamentally be able to switch between 
mutex_lock()/unlock() and down_write()/up_write() and back, without noticing
any high level behavioral changes.

Any 'reader/writer mixing' artifacts are secondary concerns and we'll sort them 
out as they happen.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: performance delta after VFS i_mutex=>i_rwsem conversion
  2016-06-08  8:58     ` Ingo Molnar
@ 2016-06-09 10:25       ` Ingo Molnar
  2016-06-09 18:14         ` Dave Hansen
  0 siblings, 1 reply; 21+ messages in thread
From: Ingo Molnar @ 2016-06-09 10:25 UTC (permalink / raw)
  To: Waiman Long
  Cc: Linus Torvalds, Dave Hansen, Chen, Tim C, Ingo Molnar,
	Davidlohr Bueso, Peter Zijlstra (Intel),
	Jason Low, Michel Lespinasse, Paul E. McKenney, Waiman Long,
	Al Viro, LKML


* Ingo Molnar <mingo@kernel.org> wrote:

> 
> * Waiman Long <waiman.long@hpe.com> wrote:
> 
> > I do have a patchset that allow us to more accurately determine the state of
> > the lock owner.
> > 
> > locking/rwsem: Add reader-owned state to the owner field
> > http://www.spinics.net/lists/kernel/msg2258572.html
> > 
> > That should eliminate the performance gap between mutex and rwsem wrt
> > spinning when only writers are present. I am hoping that that patchset can
> > be queued for 4.8.
> 
> Yeah, so I actually had this series merged for testing last week, but a 
> complication with a prereq patch made me unmerge it. But I have no fundamental 
> objections, at all.
> 
> I also agree with Linus's general observation that we want to make 
> down_write()/up_write() match mutex performance characteristics.
> 
> I think kernel developers should fundamentally be able to switch between 
> mutex_lock()/unlock() and down_write()/up_write() and back, without noticing
> any high level behavioral changes.

Ok, these enhancements are now in the locking tree and are queued up for v4.8:

   git pull git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git locking/core

Dave, you might want to check your numbers with these changes: is rwsem 
performance still significantly worse than mutex performance?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: performance delta after VFS i_mutex=>i_rwsem conversion
  2016-06-09 10:25       ` Ingo Molnar
@ 2016-06-09 18:14         ` Dave Hansen
  2016-06-09 20:10           ` Chen, Tim C
  0 siblings, 1 reply; 21+ messages in thread
From: Dave Hansen @ 2016-06-09 18:14 UTC (permalink / raw)
  To: Ingo Molnar, Waiman Long
  Cc: Linus Torvalds, Chen, Tim C, Ingo Molnar, Davidlohr Bueso,
	Peter Zijlstra (Intel),
	Jason Low, Michel Lespinasse, Paul E. McKenney, Waiman Long,
	Al Viro, LKML

On 06/09/2016 03:25 AM, Ingo Molnar wrote:
>>> That should eliminate the performance gap between mutex and rwsem wrt
>>> spinning when only writers are present. I am hoping that that patchset can
>>> be queued for 4.8.
>>
>> Yeah, so I actually had this series merged for testing last week, but a 
>> complication with a prereq patch made me unmerge it. But I have no fundamental 
>> objections, at all.
...
> Ok, these enhancements are now in the locking tree and are queued up for v4.8:
> 
>    git pull git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git locking/core
> 
> Dave, you might want to check your numbers with these changes: is rwsem 
> performance still significantly worse than mutex performance?

It's substantially closer than it was, but there's probably a little
work still to do.  The rwsem still looks to be sleeping a lot more than
the mutex.  Here's where we started:

	https://www.sr71.net/~dave/intel/rwsem-vs-mutex.png

The rwsem peaked lower and earlier than the mutex code.  Now, if we
compare the old (4.7-rc1) rwsem code to the newly-patched rwsem code
(from tip/locking):

> https://www.sr71.net/~dave/intel/bb.html?1=4.7.0-rc1&2=4.7.0-rc1-00127-gd4c3be7

We can see the peak is a bit higher and more importantly, it's more of a
plateau than a sharp peak.  We can also compare the new rwsem code to
the 4.5 code that had the mutex in place:

> https://www.sr71.net/~dave/intel/bb.html?1=4.5.0-rc6&2=4.7.0-rc1-00127-gd4c3be7

rwsems are still a _bit_ below the mutex code at the peak, and they also
seem to be substantially lower during the tail from 20 cpus on up.  The
rwsems are sleeping less than they were before the tip/locking updates,
but they are still idling the CPUs 90% of the time while the mutexes end
up idle 15-20% of the time when all the cpus are contending on the lock.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: performance delta after VFS i_mutex=>i_rwsem conversion
  2016-06-09 18:14         ` Dave Hansen
@ 2016-06-09 20:10           ` Chen, Tim C
  0 siblings, 0 replies; 21+ messages in thread
From: Chen, Tim C @ 2016-06-09 20:10 UTC (permalink / raw)
  To: Hansen, Dave, Ingo Molnar, Waiman Long
  Cc: Linus Torvalds, Ingo Molnar, Davidlohr Bueso,
	Peter Zijlstra (Intel),
	Jason Low, Michel Lespinasse, Paul E. McKenney, Waiman Long,
	Al Viro, LKML

>> Ok, these enhancements are now in the locking tree and are queued up for
>v4.8:
>>
>>    git pull git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
>> locking/core
>>
>> Dave, you might want to check your numbers with these changes: is
>> rwsem performance still significantly worse than mutex performance?
>
>It's substantially closer than it was, but there's probably a little work still to do.
>The rwsem still looks to be sleeping a lot more than the mutex.  Here's where
>we started:
>
>	https://www.sr71.net/~dave/intel/rwsem-vs-mutex.png
>
>The rwsem peaked lower and earlier than the mutex code.  Now, if we compare
>the old (4.7-rc1) rwsem code to the newly-patched rwsem code (from
>tip/locking):
>
>> https://www.sr71.net/~dave/intel/bb.html?1=4.7.0-rc1&2=4.7.0-rc1-00127
>> -gd4c3be7
>
>We can see the peak is a bit higher and more importantly, it's more of a plateau
>than a sharp peak.  We can also compare the new rwsem code to the 4.5 code
>that had the mutex in place:
>
>> https://www.sr71.net/~dave/intel/bb.html?1=4.5.0-rc6&2=4.7.0-rc1-00127
>> -gd4c3be7
>
>rwsems are still a _bit_ below the mutex code at the peak, and they also seem
>to be substantially lower during the tail from 20 cpus on up.  The rwsems are
>sleeping less than they were before the tip/locking updates, but they are still
>idling the CPUs 90% of the time while the mutexes end up idle 15-20% of the
>time when all the cpus are contending on the lock.

In Al Viro's conversion, he introduced inode_lock_shared which uses
read lock in lookup_slow.   The rwsem does bail out of optimistic spin when
readers acquire the lock, thus causing us to see a lot less optimistic spinning
attempts for the unlink test case.  

Whereas earlier for mutex, we will keep spinning.

A simple test may be to see if we get similar performance when changing
inode_lock_shared to the  writer version.

That said, hopefully we should have a lot more read locking than write locking
(i.e. more path look up than changes to the path) so the switch to rwsem is
still a win.  I guess the lesson here is when there is an equal mix of writers
and readers, rwsem could be a bit worse in performance than mutex as
we don't spin as hard.

Tim

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2016-06-09 20:10 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-06 20:00 performance delta after VFS i_mutex=>i_rwsem conversion Dave Hansen
2016-06-06 20:46 ` Linus Torvalds
2016-06-06 21:13   ` Waiman Long
2016-06-06 21:20     ` Linus Torvalds
2016-06-07  3:22       ` Valdis.Kletnieks
2016-06-07 15:22         ` Waiman Long
2016-06-08  8:58     ` Ingo Molnar
2016-06-09 10:25       ` Ingo Molnar
2016-06-09 18:14         ` Dave Hansen
2016-06-09 20:10           ` Chen, Tim C
2016-06-06 21:15   ` Al Viro
2016-06-06 21:46     ` Linus Torvalds
2016-06-06 22:07       ` Al Viro
2016-06-06 23:50         ` Linus Torvalds
2016-06-06 23:59           ` Linus Torvalds
2016-06-07  0:29             ` Linus Torvalds
2016-06-07  0:40           ` Al Viro
2016-06-07  0:44             ` Al Viro
2016-06-07  0:58             ` Al Viro
2016-06-07  0:58             ` Linus Torvalds
2016-06-07  1:19               ` Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).