From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753089AbcFFVPq (ORCPT ); Mon, 6 Jun 2016 17:15:46 -0400 Received: from zeniv.linux.org.uk ([195.92.253.2]:40946 "EHLO ZenIV.linux.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753021AbcFFVPo (ORCPT ); Mon, 6 Jun 2016 17:15:44 -0400 Date: Mon, 6 Jun 2016 22:15:23 +0100 From: Al Viro To: Linus Torvalds Cc: Dave Hansen , "Chen, Tim C" , Ingo Molnar , Davidlohr Bueso , "Peter Zijlstra (Intel)" , Jason Low , Michel Lespinasse , "Paul E. McKenney" , Waiman Long , LKML Subject: Re: performance delta after VFS i_mutex=>i_rwsem conversion Message-ID: <20160606211522.GF14480@ZenIV.linux.org.uk> References: <5755D671.9070908@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.6.0 (2016-04-01) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jun 06, 2016 at 01:46:23PM -0700, Linus Torvalds wrote: > So my gut feel is that we do want to have the same heuristics for > rwsems and mutexes (well, modulo possible actual semantic differences > due to the whole shared-vs-exclusive issues). > > And I also suspect that the mutexes have gotten a lot more performance > tuning done on them, so it's likely the correct thing to try to make > the rwsem match the mutex code rather than the other way around. > > I think we had Jason and Davidlohr do mutex work last year, let's see > if they agree on that "yes, the mutex case is the likely more tuned > case" feeling. > > The fact that your performance improves when you do that obviously > then also validates the assumption that the mutex spinning is the > better optimized one. FWIW, there's another fun issue on ramfs - dcache_readdir() is doing an obscene amount of grabbing/releasing ->d_lock and once you take the external serialization out, parallel getdents load hits contention on *that*. In spades. And unlike mutex (or rswem exclusive), contention on ->d_lock chews a lot of cycles. The root cause is the use of cursors - we not only move them more than we ought to (we do that on each entry reported, rather than once before return from dcache_readdir()), we can't traverse the real list entries (which remain nice and stable; another low-hanging fruit is pointless grabbing ->d_lock on those) without ->d_lock on parent. I think I have a kinda-sorta solution, but it has a problem. What I want to do is * list_move() only once per dcache_readdir() * ->d_lock taken for that and only for that. * list_move() itself surrounded with write_seqcount_{begin,end} on some seqcount * traversal to the next real entry done under rcu_read_lock in a seqretry loop. The only problem is where to put that seqcount (unsigned int, really). ->i_dir_seq is an obvious candidate, but that'll need careful profiling on getdents/lookup mixes...