Re: [GIT PULL] locking/urgent for v5.12

From: Peter Zijlstra <peterz@infradead.org>
To: Waiman Long <llong@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Borislav Petkov <bp@suse.de>, Ali Saidi <alisaidi@amazon.com>,
	Steve Capper <steve.capper@arm.com>,
	Will Deacon <will@kernel.org>, x86-ml <x86@kernel.org>,
	lkml <linux-kernel@vger.kernel.org>
Subject: Re: [GIT PULL] locking/urgent for v5.12
Date: Mon, 26 Apr 2021 10:04:45 +0200	[thread overview]
Message-ID: <YIZ0HZGqLvU+VlYh@hirez.programming.kicks-ass.net> (raw)
In-Reply-To: <9017e7b9-cd6e-caa1-b8ba-8d85f4a5d87e@redhat.com>

On Sun, Apr 25, 2021 at 01:06:52PM -0400, Waiman Long wrote:
> On 4/25/21 12:39 PM, Linus Torvalds wrote:

> > > I'm assuming it's because of the switch to try_cmpxchg by PeterZ?
> 
> Yes, try_cmpxchg() requires a variable to hold the new value as well as a
> place to return the actual value before the cmpxchg(). It is just the way
> try_cmpxchg() works.

Right; by virtue of it returning a boolean, the value return needs to be
through a pointer argument.

> > > New confusion:
> > >   - Why is the truly non-critical cmpxchg using "try_cmpxhg()", when
> > >     the _first_ cmpxchg - above the loop - is not?

> At least for x86, try_cmpxchg() seems to produce a slight better assembly
> code than the regular cmpxchg(). I guess that may be one of the reason Peter
> changed it to use try_cmpxchg(). Another reason that I can think of is to
> make the code fit in one line instead of splitting it up into two lines like
> the original version from Ali.

Right, x86 generates slightly better asm (and potentially so for any
architecture that has CAS state in condition codes) while it's a wash
for other architectures (specifically we checked at the time arm64
didn't generate worse code).

> > > 
> > > Pre-existing confusion:
> > >   - Why is the code using "atomic_add()" to set a bit?
> > > 
> > > Yeah, yeah, neither of these are *bugs*, but Christ is that code hard
> > > to read. The "use add to set a bit" is valid because of the spinlock
> > > serialization (ie only one add can ever happen), and the
> > > cmpxchg-vs-try_cmpxchg confusion isn't buggy, it's just really really
> > > confusing that that same function is using two different - but
> > > equivalent - cmpxchg things on the same variable literally a couple of
> > > lines apart.
> As you have said, the spinlock serialization makes sure that only 1 writer
> is allowed to do that. I agree that using atomic_or() looks better in this
> case. Both of them are equivalent in this particular case.

Agreed, I think the reason is that because of the read-side doing the
BIAS add/sub, some of that snuck into the write side. AFAIK no arch
lacks the atomic_or() intrinsic. The one that's often an issue is
atomic_fetch_or() (x86 for one doesn't have it :/).

> > > I've pulled this, but can we please
> > > 
> > >   - make *both* of the cmpxchg's use "try_cmpxchg()" (and thus that
> > >     "cnts" variable)?
> Yes, we can certainly change the other cmpxchg() to try_cmpxchg().
> > > 
> > >   - add a comment about _why_ it's doing "atomic_add()" instead of the
> > >     much more logical "atomic_or()", and about how the spinlock serializes
> > >     it
> > > 
> > > I'm assuming the "atomic_add()" is simply because many more
> > > architectures have that as an actual intrinsic atomic. I understand.
> > > But it's really really not obvious from the code.
> > > 
> I will post a patch to make the suggested change to qrwlock.c.

Thanks.