performance delta after VFS i_mutex=>i_rwsem conversion

From: Dave Hansen <dave.hansen@intel.com>
To: "Chen, Tim C" <tim.c.chen@intel.com>,
	Ingo Molnar <mingo@redhat.com>, Davidlohr Bueso <dbueso@suse.de>,
	"Peter Zijlstra (Intel)" <peterz@infradead.org>,
	Jason Low <jason.low2@hp.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Michel Lespinasse <walken@google.com>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Waiman Long <waiman.long@hp.com>,
	Al Viro <viro@zeniv.linux.org.uk>,
	LKML <linux-kernel@vger.kernel.org>
Subject: performance delta after VFS i_mutex=>i_rwsem conversion
Date: Mon, 6 Jun 2016 13:00:49 -0700	[thread overview]
Message-ID: <5755D671.9070908@intel.com> (raw)

tl;dr: Mutexes spin more than than rwsems which makes mutexes perform
better when contending ->i_rwsem.  But, mutexes do this at the cost of
not sleeping much, even with tons of contention.

Should we do something to keep rwsem and mutex performance close to each
other?  If so, should mutexes be sleeping more or rwsems sleeping less?

---

I was doing some performance comparisons between 4.5 and 4.7 and noticed
a weird "blip" in unlink[1] performance where it got worse between 4.5
and 4.7:

	https://www.sr71.net/~dave/intel/rwsem-vs-mutex.png

There are two things to notice here:
1. Although the "1" (4.5) and "2" (4.7) lines nearly converge at high
   cpu  counts, 4.5 outperforms 4.7 by almost 2x at low contention
2. 4.7 has lots of idle time.  4.5 eats lots of cpu spinning

That was a 160-thread Westmere (~4 years old) system, but I moved to a
smaller system for some more focused testing (4 cores, Skylake), and
bisected it there down to the i_mutex switch to rwsem.  I tested on two
commits from here on out:

	[d9171b9] 1 commit before 9902af7
	[9902af7] parallel lookups: actual switch to rwsem"

unlink takes the rwsem for write, so it should see the same level of
contention as the mutex did.  But, at 4 threads, the unlink performance was:

	d9171b9(mutex): 689179
	9902af7(rwsem): 498325 (-27.7%)

I tracked this down to the differences between:

	rwsem_spin_on_owner() - false roughly 1% of the time
	mutex_spin_on_owner() - false roughly 0.05% of the time

The optimistic rwsem and mutex code look quite similar, but there is one
big difference: a hunk of code in rwsem_spin_on_owner() stops the
spinning for rwsems, but isn't present for mutexes in any form:

>         if (READ_ONCE(sem->owner))
>                 return true; /* new owner, continue spinning */
> 
>         /*
>          * When the owner is not set, the lock could be free or
>          * held by readers. Check the counter to verify the
>          * state.
>          */
>         count = READ_ONCE(sem->count); 
>         return (count == 0 || count == RWSEM_WAITING_BIAS);

If I hack this out, I end up with:

	d9171b9(mutex-original): 689179
	9902af7(rwsem-hacked  ): 671706 (-2.5%)

I think it's safe to say that this accounts for the majority of the
difference in behavior.  It also shows that the _actual_
i_mutex=>i_rwsem conversion is the issue here, and not some other part
of the parallel lookup patches.

So, as it stands today in 4.7-rc1, mutexes end up yielding higher
performance under contention.  But, they don't let them system go very
idle, even under heavy contention, which seems rather wrong.  Should we
be making rwsems spin more, or mutexes spin less?

--

1. The test in question here is creating/unlinking many files in the
same directory: >
https://github.com/antonblanchard/will-it-scale/blob/master/tests/unlink1.c