From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752505AbcFFUAw (ORCPT <rfc822;w@1wt.eu>);
	Mon, 6 Jun 2016 16:00:52 -0400
Received: from mga01.intel.com ([192.55.52.88]:48853 "EHLO mga01.intel.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1751700AbcFFUAv (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Mon, 6 Jun 2016 16:00:51 -0400
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.26,428,1459839600"; 
   d="scan'208";a="970070684"
From: Dave Hansen <dave.hansen@intel.com>
To: "Chen, Tim C" <tim.c.chen@intel.com>, Ingo Molnar <mingo@redhat.com>,
        Davidlohr Bueso <dbueso@suse.de>,
        "Peter Zijlstra (Intel)" <peterz@infradead.org>,
        Jason Low <jason.low2@hp.com>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Michel Lespinasse <walken@google.com>,
        "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
        Waiman Long <waiman.long@hp.com>, Al Viro <viro@zeniv.linux.org.uk>,
        LKML <linux-kernel@vger.kernel.org>
Subject: performance delta after VFS i_mutex=>i_rwsem conversion
Message-ID: <5755D671.9070908@intel.com>
Date: Mon, 6 Jun 2016 13:00:49 -0700
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101
 Thunderbird/38.6.0
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

tl;dr: Mutexes spin more than than rwsems which makes mutexes perform
better when contending ->i_rwsem.  But, mutexes do this at the cost of
not sleeping much, even with tons of contention.

Should we do something to keep rwsem and mutex performance close to each
other?  If so, should mutexes be sleeping more or rwsems sleeping less?

---

I was doing some performance comparisons between 4.5 and 4.7 and noticed
a weird "blip" in unlink[1] performance where it got worse between 4.5
and 4.7:

	https://www.sr71.net/~dave/intel/rwsem-vs-mutex.png

There are two things to notice here:
1. Although the "1" (4.5) and "2" (4.7) lines nearly converge at high
   cpu  counts, 4.5 outperforms 4.7 by almost 2x at low contention
2. 4.7 has lots of idle time.  4.5 eats lots of cpu spinning

That was a 160-thread Westmere (~4 years old) system, but I moved to a
smaller system for some more focused testing (4 cores, Skylake), and
bisected it there down to the i_mutex switch to rwsem.  I tested on two
commits from here on out:

	[d9171b9] 1 commit before 9902af7
	[9902af7] parallel lookups: actual switch to rwsem"

unlink takes the rwsem for write, so it should see the same level of
contention as the mutex did.  But, at 4 threads, the unlink performance was:

	d9171b9(mutex): 689179
	9902af7(rwsem): 498325 (-27.7%)

I tracked this down to the differences between:

	rwsem_spin_on_owner() - false roughly 1% of the time
	mutex_spin_on_owner() - false roughly 0.05% of the time

The optimistic rwsem and mutex code look quite similar, but there is one
big difference: a hunk of code in rwsem_spin_on_owner() stops the
spinning for rwsems, but isn't present for mutexes in any form:

>         if (READ_ONCE(sem->owner))
>                 return true; /* new owner, continue spinning */
> 
>         /*
>          * When the owner is not set, the lock could be free or
>          * held by readers. Check the counter to verify the
>          * state.
>          */
>         count = READ_ONCE(sem->count); 
>         return (count == 0 || count == RWSEM_WAITING_BIAS);

If I hack this out, I end up with:

	d9171b9(mutex-original): 689179
	9902af7(rwsem-hacked  ): 671706 (-2.5%)

I think it's safe to say that this accounts for the majority of the
difference in behavior.  It also shows that the _actual_
i_mutex=>i_rwsem conversion is the issue here, and not some other part
of the parallel lookup patches.

So, as it stands today in 4.7-rc1, mutexes end up yielding higher
performance under contention.  But, they don't let them system go very
idle, even under heavy contention, which seems rather wrong.  Should we
be making rwsems spin more, or mutexes spin less?

--

1. The test in question here is creating/unlinking many files in the
same directory: >
https://github.com/antonblanchard/will-it-scale/blob/master/tests/unlink1.c