Re: Memory transaction instructions

From: Linus Torvalds <torvalds@linux-foundation.org>
To: David Howells <dhowells@redhat.com>
Cc: Nicholas Piggin <npiggin@gmail.com>,
	Mateusz Guzik <mjguzik@gmail.com>,
	linux-arch <linux-arch@vger.kernel.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Michael Ellerman <mpe@ellerman.id.au>,
	tony.luck@intel.com, viro@zeniv.linux.org.uk,
	linux-fsdevel@vger.kernel.org,
	Jan Glauber <jan.glauber@gmail.com>,
	linuxppc-dev <linuxppc-dev@lists.ozlabs.org>,
	Linux ARM <linux-arm-kernel@lists.infradead.org>
Subject: Re: Memory transaction instructions
Date: Mon, 16 Jan 2023 08:59:16 -0800	[thread overview]
Message-ID: <CAHk-=whjFwzEq0u04=n=t7-kNJdX0HkAOjAMjmLXDDycJ+j9yQ@mail.gmail.com> (raw)
In-Reply-To: <1966767.1673878095@warthog.procyon.org.uk>

On Mon, Jan 16, 2023 at 6:08 AM David Howells <dhowells@redhat.com> wrote:
>
> I'm not sure how relevant it is to the topic, but I seem to remember you
> having a go at implementing spinlocks with x86_64 memory transaction
> instructions a while back.  Do you have any thoughts on whether these
> instructions are ever likely to become something we can use?

Nope.

Not only are they buggy, the actual cost of them was prohibitive too.

The only implementation I ever did was what then became 'lockref' (so
this was about a decade ago), and using transactional memory was
actually quite noticeably *slower* than just using a spinlock in the
common cases (particularly the uncontended case - which is actually
the common one by far, despite some benchmarks).

And while the argument could be that transactional memory has improved
in the last decade (narrator: "It hasn't"), the problem is actually
quite fundamental.

The whole "best case scenario" for transactional memory concept is
literally optimized and designed for the rare circumstance - the case
where you have contention, but where the locking part is basically an
idempotent operation (so "lock+unlock" ends up undoing each other and
can use shared cachelines rather than bouncing the cacheline).

And contention pretty much happens in one situation, and one situation
only: in a combination of (a) bad locking and (b) benchmarks.

And for the kernel, where we don't have bad locking, and where we
actually use fine-grained locks that are _near_ the data that we are
locking (the lockref of the dcache is obviously one example of that,
but the skbuff queue you mention is almost certainly exactly the same
situation): the lock is right by the data that the lock protects, and
the "shared lock cacheline" model simply does not work. You'll bounce
the data, and most likely you'll also touch the same lock cacheline
too.

I was quite excited about transactional memory, but have (obviously)
come to the conclusion that it was one of those "looks good in theory,
but is just garbage in reality".

So this isn't an "the x86 implementation was bad" issue. This is a
"transactional memory is a bad idea" situation. It complicates the
most important part of the core CPU (the memory pipeline) enormously,
and it doesn't actually really work.

Now, with that "transactional memory is garbage" in mind, let me just
mention that there are obviously other circumstances:

 (a) horrendously bad locking

If you have a single centralized lock, and you really have serious
contention on it in real life all the time (and not just under
microbenchmarks), and the data that gets touched is spread out all
over, and you simply cannot fix the locking, then all those
theoretical advantages of transactional memory can actually come into
play.

But again: even then, there's an absolutely huge cost to the memory
pipeline.  So that advantage really doesn't come free. Nobody has ever
gotten transactional memory to actually work well in real life. Not
Intel, not IBM with powerpc HTM.

That should tell you something. The hw complexity cost is very very real.

But Intel actually had some loads where TSX helped. I think it was SAP
HANA, and I really think that what you should take away from that is
that SAP HANA locking is likely horrible garbage inside. But maybe I'm
biased, and maybe SAP HANA is lovely, and maybe it just happened to
work really well for TSX. I've never actually looked at that load, but
if I were a betting man...

 (b) very *limited* transactions are probably a great idea

LL/SC is arguably a form of "transactional memory", just with a single
word. Now, I think atomics are generally usually better than LL/SC
when it really is just a single word (because most of the time,
"single word" also means "simple semantics", and while CMPXCHG loops
aren't lovely when the semantics are a bit more complex, it's actually
nice and portable and works at a higher level than assembler sequences
and as such is actually a *lot* better than LL/SC in practice).

But a "N-word LL/SC with forward progress guarantees" would be
objectively stronger than atomics, and would allow for doing low-level
data structures in ways that atomics don't. Doubly linked lists
together with a lock check would argue for "N" being ~5 memory
operations. So I do believe in potentially *limited* memory
transactions, when they can be made limited enough that you have

 (1) forward progress guarantees with no "separate fallback on
contention" garbage

 (2) can keep the hw implementation simple enough with just a store
buffer and an (architecturally) very limited number of cacheline
accesses that you can actually order in HW so that you don't have ABBA
situations.

But the generic kind of transactional memory that Intel tried with TSX
(and IBM with HTM) is pure garbage, and almost certainly always will
be.

And all of the above is obviously just my opinionated input on it. But
I just happen to be right.

                Linus