[RFC PATCH] powerpc: Optimise barriers for fully ordered atomics

From: Nicholas Piggin <npiggin@gmail.com>
To: linuxppc-dev@lists.ozlabs.org
Cc: Nicholas Piggin <npiggin@gmail.com>
Subject: [RFC PATCH] powerpc: Optimise barriers for fully ordered atomics
Date: Sat, 13 Apr 2024 03:25:26 +1000	[thread overview]
Message-ID: <20240412172529.783268-1-npiggin@gmail.com> (raw)

"Fully ordered" atomics (RMW that return a value) are said to have a
full barrier before and after the atomic operation. This is implemented
as:

  hwsync
  larx
  ...
  stcx.
  bne-
  hwsync

This is slow on POWER processors because hwsync and stcx. require a
round-trip to the nest (~= L2 cache). The hwsyncs can be avoided with
the sequence:

  lwsync
  larx
  ...
  stcx.
  bne-
  isync

lwsync prevents all reorderings except store/load reordering, so the
larx could be execued ahead of a prior store becoming visible. However
the stcx. is a store, so it is ordered by the lwsync against all prior
access and if the value in memory had been modified since the larx, it
will fail. So the point at which the larx executes is not a concern
because the stcx. always verifies the memory was unchanged.

The isync prevents subsequent instructions being executed before the
stcx. executes, and stcx. is necessarily visible to the system after
it executes, so there is no opportunity for it (or prior stores, thanks
to lwsync) to become visible after a subsequent load or store.

This sequence requires only one L2 round-trip and so is around 2x faster
measured on a POWER10 with back-to-back atomic ops on cached memory.

[ Remains to be seen if this is always faster when there is other activity
going on, and if it's faster on non-POEWR CPUs or perhaps older ones
like 970 that might not optimise isync so much. ]

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/synch.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/synch.h b/arch/powerpc/include/asm/synch.h
index b0b4c64870d7..0b1718eb9a40 100644
--- a/arch/powerpc/include/asm/synch.h
+++ b/arch/powerpc/include/asm/synch.h
@@ -60,8 +60,8 @@ static inline void ppc_after_tlbiel_barrier(void)
 	MAKE_LWSYNC_SECTION_ENTRY(97, __lwsync_fixup);
 #define PPC_ACQUIRE_BARRIER	 "\n" stringify_in_c(__PPC_ACQUIRE_BARRIER)
 #define PPC_RELEASE_BARRIER	 stringify_in_c(LWSYNC) "\n"
-#define PPC_ATOMIC_ENTRY_BARRIER "\n" stringify_in_c(sync) "\n"
-#define PPC_ATOMIC_EXIT_BARRIER	 "\n" stringify_in_c(sync) "\n"
+#define PPC_ATOMIC_ENTRY_BARRIER "\n" stringify_in_c(LWSYNC) "\n"
+#define PPC_ATOMIC_EXIT_BARRIER	 "\n" stringify_in_c(isync) "\n"
 #else
 #define PPC_ACQUIRE_BARRIER
 #define PPC_RELEASE_BARRIER
-- 
2.43.0