From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757061AbcBCNcP (ORCPT ); Wed, 3 Feb 2016 08:32:15 -0500 Received: from foss.arm.com ([217.140.101.70]:34297 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754617AbcBCNcL (ORCPT ); Wed, 3 Feb 2016 08:32:11 -0500 Date: Wed, 3 Feb 2016 13:32:10 +0000 From: Will Deacon To: Ingo Molnar Cc: Linus Torvalds , Boqun Feng , Paul McKenney , Peter Zijlstra , "Maciej W. Rozycki" , David Daney , =?iso-8859-1?Q?M=E5ns_Rullg=E5rd?= , Ralf Baechle , Linux Kernel Mailing List Subject: Re: [RFC][PATCH] mips: Fix arch_spin_unlock() Message-ID: <20160203133210.GC20217@arm.com> References: <20160202051904.GC1239@fixme-laptop.cn.ibm.com> <20160202064433.GG6719@linux.vnet.ibm.com> <20160202093440.GD1239@fixme-laptop.cn.ibm.com> <20160202175127.GO10166@arm.com> <20160202193037.GQ10166@arm.com> <20160203083338.GA1772@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160203083338.GA1772@gmail.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Feb 03, 2016 at 09:33:39AM +0100, Ingo Molnar wrote: > * Will Deacon wrote: > > On Tue, Feb 02, 2016 at 10:06:36AM -0800, Linus Torvalds wrote: > > > On Tue, Feb 2, 2016 at 9:51 AM, Will Deacon wrote: > > > > > > > > Given that the vast majority of weakly ordered architectures respect > > > > address dependencies, I would expect all of them to be hurt if they > > > > were forced to use barrier instructions instead, even those where the > > > > microarchitecture is fairly strongly ordered in practice. > > > > > > I do wonder if it would be all that noticeable, though. I don't think > > > we've really had benchmarks. > > > > > > For example, most of the RCU list traversal shows up on x86 - where > > > loads are already acquires. But they show up not because of that, but > > > because a RCU list traversal is pretty much always going to take the > > > cache miss. > > > > > > So it would actually be interesting to just try it - what happens to > > > kernel-centric benchmarks (which are already fairly rare) on arm if we > > > change the rcu_dereference() to be a smp_load_acquire()? > > > > > > Because maybe nothing happens at all. I don't think we've ever tried it. > > > > FWIW, and this is by no means conclusive, I hacked that up quickly and ran > > hackbench a few times on the nearest idle arm64 system. The results were > > consistently ~4% slower using acquire for rcu_dereference. > > Could you please double check that? The thing is that hackbench is a _notoriously_ > unstable workload and very dependent on various small details such as kernel image > layout and random per-bootup cache/memory layouts details. Yup. FWIW, I did try across a few reboots to arrive at the 4% figure, but it's not conclusive (as I said :). > In fact I'd suggest to test this via a quick runtime hack like this in rcupdate.h: > > extern int panic_timeout; > > ... > > if (panic_timeout) > smp_load_acquire(p); > else > typeof(*p) *________p1 = (typeof(*p) *__force)lockless_dereference(p); > > (or so) So the problem with this is that a LOAD LOAD sequence isn't an ordering hazard on ARM, so you're potentially at the mercy of the branch predictor as to whether you get an acquire. That's not to say it won't be discarded as soon as the conditional is resolved, but it could screw up the benchmarking. I'd be better off doing some runtime patching, but that's not something I can knock up in a couple of minutes (so I'll add it to my list). > and you could get specific numbers of noise estimations via: > > triton:~/tip> perf stat --null --repeat 10 perf bench sched messaging -l 10000 > > [...] > > Performance counter stats for 'perf bench sched messaging -l 10000' (10 runs): > > 4.616404309 seconds time elapsed ( +- 1.67% ) > > note that even with a repeat count of 10 runs and a loop count 100 times larger > than the hackbench default, the intrinsic noise of this workload was still 1.6% - > and that does not include boot-to-boot systematic noise. That's helpful, thanks. Will