Re: [EXT] Re: [RFC] Disable lockref on arm64

From: Linus Torvalds <torvalds@linux-foundation.org>
To: Jayachandran Chandrasekharan Nair <jnair@marvell.com>
Cc: "catalin.marinas@arm.com" <catalin.marinas@arm.com>,
	Jan Glauber <jglauber@marvell.com>,
	"will.deacon@arm.com" <will.deacon@arm.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-arm-kernel@lists.infradead.org"
	<linux-arm-kernel@lists.infradead.org>
Subject: Re: [EXT] Re: [RFC] Disable lockref on arm64
Date: Mon, 6 May 2019 10:13:43 -0700	[thread overview]
Message-ID: <CAHk-=whGAef6+mZi-_+rfTDxXrLDw-jrOiy3MNEpLAkC5scGRg@mail.gmail.com> (raw)
In-Reply-To: <20190506061100.GA8465@dc5-eodlnx05.marvell.com>

On Sun, May 5, 2019 at 11:13 PM Jayachandran Chandrasekharan Nair
<jnair@marvell.com> wrote:
>
> > It's not normal, and it's not inevitable.
>
> If you look at the code, the CAS failure is followed by a yield
> before retrying the CAS. Yield on arm64 is expected to be a hint
> to release resources so that other threads/cores can make progress.
> Under heavy contention, I expect the current code to behave the way
> I noted in my last mail, with the issue with fairness as well.

Yes, this is a good point.

It's entirely possibly that _particularly_ for CAS loops - where we
get an updated value that caused the CAS to fail - we should not yield
in between every CAS. Exactly so that we don't force the CPU to flush
the cacheline that it just got and make the CAS loop do lots of
unnecessary work and just keep looping.

That said, I think right now all ARM implementations actually treat
the yield instruction as a pure nop, so I assume this isn't actually
the root of the ThunderX2 problems.

But we could do something like "unroll the cmpxchg loop twice, only
yield every other iteration" exactly for the case wher cpu_relax()
might encourage the CPU to actually release the cacheline.

If you find that something like that would help ThunderX2, we can most
certainly try that kind of thing. It makes sense, and unrolling the
cmpxchg loop once might even help with branch prediction (ie the
common case might be that the *first* cmpxchg succeeds, and that's
what the branch predictors get primed for, but then if the first one
actually fails, we fall through to the contention case, and now maybe
the branch predictor ends up being primed for *that* case for the
second one).

Yes, the above is wild handwaving - but it's the kind of thing we
could easily do if the hardware people have a good argument for them.

I'm not asking for _perfect_ hardware. I'm just asking for hardware to
not be actively antagonistic to something fundamental like a cmpxchg
loop.

> Your larger point seems to be that the hardware has smarter to
> scale standard locking implementations when adding cores, and
> be graceful even in extremely high contention cases. Yes, this
> is something we should be looking at for ThunderX3.

Yes. Also, my point really is that no core should ever be in the
situation that it fetches a cache-line, only to then release it
immediately again without making any progress. You do want a *certain*
amount of stickiness to the cachelines.

Of course, you very much do not want to overdo it - I'm talking about
keeping the cacheline around for tens of cycles, not for hundreds of
cycles. Enough that if you have small code snippets that are in and
out quickly, you don't see bouncing at that level.

Because you *don't* want the stickiness to be such that once one CPU
has the cacheline, it will stay with that CPU (or that socket) as long
as the CPU hammers it all the time.

So for example, in the case of a cmpxchg loop, maybe you'll get a
couple of branch mispredicts the first few times round the loop (first
because the original value read was a plain read and the cacheline was
shared rather than exclusive, but then after that because the branch
predictor saw previous cases where there was no contention, and the
loop exited immediately).

So maybe the first couple of iterations the core might be "slow
enough" to not get the cmpxchg loop working well. But not a lot more
than that (again, unless something special happens, like an interrupt
storm that keeps interrupting the loop).

End result: I do think the cache coherency needs to be a *bit* smarter
than just a "oh, another core wants this line, I will now immediately
try to satisfy it". And yes, those kinds of things get more important
the more cores you have. You don't need a whole lot of smarts if you
have just a couple of cores, because you'll be making progress
regardless.

And note that I really think this kind of support absolutely falls on
hardware. Software simply _cannot_ make some good tuning decision
using backoff or whatever. Yes, yes, I know people have done things
like that historically (Sparc and SunOS were famous for it), but part
of doing them historically was that

 (a) they didn't know better

 (b) the software was often tuned to particular hardware and loads (or
at least a very small set of hardware)

 (c) there wasn't necessarily better hardware options available

but none of the above is true today.

Things like backoff loops in software are not only hw specific, they
tend to have seriously bad fairness behavior (and they depend on
everybody honoring the backoff, which can be a security issue), so
then you have to add even more complexity to deal with the fairness
issues, and your cmpxchg loop becomes something truly disgusting.

Which is why I say: "do it right in hardware".

                   Linus

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel