From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752505AbcFFUAw (ORCPT ); Mon, 6 Jun 2016 16:00:52 -0400 Received: from mga01.intel.com ([192.55.52.88]:48853 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751700AbcFFUAv (ORCPT ); Mon, 6 Jun 2016 16:00:51 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.26,428,1459839600"; d="scan'208";a="970070684" From: Dave Hansen To: "Chen, Tim C" , Ingo Molnar , Davidlohr Bueso , "Peter Zijlstra (Intel)" , Jason Low , Linus Torvalds , Michel Lespinasse , "Paul E. McKenney" , Waiman Long , Al Viro , LKML Subject: performance delta after VFS i_mutex=>i_rwsem conversion Message-ID: <5755D671.9070908@intel.com> Date: Mon, 6 Jun 2016 13:00:49 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org tl;dr: Mutexes spin more than than rwsems which makes mutexes perform better when contending ->i_rwsem. But, mutexes do this at the cost of not sleeping much, even with tons of contention. Should we do something to keep rwsem and mutex performance close to each other? If so, should mutexes be sleeping more or rwsems sleeping less? --- I was doing some performance comparisons between 4.5 and 4.7 and noticed a weird "blip" in unlink[1] performance where it got worse between 4.5 and 4.7: https://www.sr71.net/~dave/intel/rwsem-vs-mutex.png There are two things to notice here: 1. Although the "1" (4.5) and "2" (4.7) lines nearly converge at high cpu counts, 4.5 outperforms 4.7 by almost 2x at low contention 2. 4.7 has lots of idle time. 4.5 eats lots of cpu spinning That was a 160-thread Westmere (~4 years old) system, but I moved to a smaller system for some more focused testing (4 cores, Skylake), and bisected it there down to the i_mutex switch to rwsem. I tested on two commits from here on out: [d9171b9] 1 commit before 9902af7 [9902af7] parallel lookups: actual switch to rwsem" unlink takes the rwsem for write, so it should see the same level of contention as the mutex did. But, at 4 threads, the unlink performance was: d9171b9(mutex): 689179 9902af7(rwsem): 498325 (-27.7%) I tracked this down to the differences between: rwsem_spin_on_owner() - false roughly 1% of the time mutex_spin_on_owner() - false roughly 0.05% of the time The optimistic rwsem and mutex code look quite similar, but there is one big difference: a hunk of code in rwsem_spin_on_owner() stops the spinning for rwsems, but isn't present for mutexes in any form: > if (READ_ONCE(sem->owner)) > return true; /* new owner, continue spinning */ > > /* > * When the owner is not set, the lock could be free or > * held by readers. Check the counter to verify the > * state. > */ > count = READ_ONCE(sem->count); > return (count == 0 || count == RWSEM_WAITING_BIAS); If I hack this out, I end up with: d9171b9(mutex-original): 689179 9902af7(rwsem-hacked ): 671706 (-2.5%) I think it's safe to say that this accounts for the majority of the difference in behavior. It also shows that the _actual_ i_mutex=>i_rwsem conversion is the issue here, and not some other part of the parallel lookup patches. So, as it stands today in 4.7-rc1, mutexes end up yielding higher performance under contention. But, they don't let them system go very idle, even under heavy contention, which seems rather wrong. Should we be making rwsems spin more, or mutexes spin less? -- 1. The test in question here is creating/unlinking many files in the same directory: > https://github.com/antonblanchard/will-it-scale/blob/master/tests/unlink1.c