From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753089AbcFFVPq (ORCPT <rfc822;w@1wt.eu>);
	Mon, 6 Jun 2016 17:15:46 -0400
Received: from zeniv.linux.org.uk ([195.92.253.2]:40946 "EHLO
	ZenIV.linux.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1753021AbcFFVPo (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Mon, 6 Jun 2016 17:15:44 -0400
Date: Mon, 6 Jun 2016 22:15:23 +0100
From: Al Viro <viro@ZenIV.linux.org.uk>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>, "Chen, Tim C" <tim.c.chen@intel.com>,
        Ingo Molnar <mingo@redhat.com>, Davidlohr Bueso <dbueso@suse.de>,
        "Peter Zijlstra (Intel)" <peterz@infradead.org>,
        Jason Low <jason.low2@hp.com>, Michel Lespinasse <walken@google.com>,
        "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
        Waiman Long <waiman.long@hp.com>, LKML <linux-kernel@vger.kernel.org>
Subject: Re: performance delta after VFS i_mutex=>i_rwsem conversion
Message-ID: <20160606211522.GF14480@ZenIV.linux.org.uk>
References: <5755D671.9070908@intel.com>
 <CA+55aFxH_7wjo_BgUPK5iomWedE2=DaUZVX-yruHOWEk7OTiHQ@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CA+55aFxH_7wjo_BgUPK5iomWedE2=DaUZVX-yruHOWEk7OTiHQ@mail.gmail.com>
User-Agent: Mutt/1.6.0 (2016-04-01)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Mon, Jun 06, 2016 at 01:46:23PM -0700, Linus Torvalds wrote:

> So my gut feel is that we do want to have the same heuristics for
> rwsems and mutexes (well, modulo possible actual semantic differences
> due to the whole shared-vs-exclusive issues).
> 
> And I also suspect that the mutexes have gotten a lot more performance
> tuning done on them, so it's likely the correct thing to try to make
> the rwsem match the mutex code rather than the other way around.
> 
> I think we had Jason and Davidlohr do mutex work last year, let's see
> if they agree on that "yes, the mutex case is the likely more tuned
> case" feeling.
> 
> The fact that your performance improves when you do that obviously
> then also validates the assumption that the mutex spinning is the
> better optimized one.

FWIW, there's another fun issue on ramfs - dcache_readdir() is doing an
obscene amount of grabbing/releasing ->d_lock and once you take the external
serialization out, parallel getdents load hits contention on *that*.
In spades.  And unlike mutex (or rswem exclusive), contention on ->d_lock
chews a lot of cycles.  The root cause is the use of cursors - we not only
move them more than we ought to (we do that on each entry reported, rather
than once before return from dcache_readdir()), we can't traverse the real
list entries (which remain nice and stable; another low-hanging fruit is
pointless grabbing ->d_lock on those) without ->d_lock on parent.

I think I have a kinda-sorta solution, but it has a problem.  What I want
to do is
	* list_move() only once per dcache_readdir()
	* ->d_lock taken for that and only for that.
	* list_move() itself surrounded with write_seqcount_{begin,end} on
some seqcount
	* traversal to the next real entry done under rcu_read_lock in a
seqretry loop.

The only problem is where to put that seqcount (unsigned int, really).
->i_dir_seq is an obvious candidate, but that'll need careful profiling
on getdents/lookup mixes...