Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro

From: Uros Bizjak <ubizjak@gmail.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Russell King <linux@armlinux.org.uk>,
	Michael Ellerman <mpe@ellerman.id.au>,
	Thomas Bogendoerfer <tsbogend@alpha.franken.de>,
	Heiko Carstens <hca@linux.ibm.com>,
	"the arch/x86 maintainers" <x86@kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Paul McKenney <paulmck@linux.vnet.ibm.com>
Subject: Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro
Date: Thu, 26 May 2022 10:54:58 +0200	[thread overview]
Message-ID: <CAFULd4bc54+_FmJ=f++zzz99mR8r5c11-Y49pz86Yb8G3dyJpA@mail.gmail.com> (raw)
In-Reply-To: <CAHk-=wh1XeaxWXG5QziGA4ds918UnW1hO924kusgVB-wGj+9Og@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2593 bytes --]

On Wed, May 25, 2022 at 6:48 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Wed, May 25, 2022 at 7:40 AM Uros Bizjak <ubizjak@gmail.com> wrote:
> >
> > Use try_cmpxchg64 instead of cmpxchg64 in CMPXCHG_LOOP macro.
> > x86 CMPXCHG instruction returns success in ZF flag, so this
> > change saves a compare after cmpxchg (and related move instruction
> > in front of cmpxchg). The main loop of lockref_get improves from:
>
> Ack on this one regardless of the 32-bit x86 question.
>
> HOWEVER.
>
> I'd like other architectures to pipe up too, because I think right now
> x86 is the only one that implements that "arch_try_cmpxchg()" family
> of operations natively, and I think the generic fallback for when it
> is missing might be kind of nasty.
>
> Maybe it ends up generating ok code, but it's also possible that it
> just didn't matter when it was only used in one place in the
> scheduler.
>
> The lockref_get() case can be quite hot under some loads, it would be
> sad if this made other architectures worse.
>
> Anyway, maybe that try_cmpxchg() fallback is fine, and works out well
> on architectures that use load-locked / store-conditional as-is.

Attached to this message, please find attached the testcase that
analyses various CMPXCHG_LOOPs. Here you will find the old, the
fallback and the new cmpxchg loop, together with corresponding
lockref_get_* functions.

The testcase models the x86 cmpxchg8b and can be compiled for 64bit as
well as 32bit targets. As can be seen from the experiment, the
try_cmpxchg fallback creates EXACTLY THE SAME code for 64bit target as
the unpatched code. For the 32bit target one extra dead reg-reg 32bit
move remains in the generated fallback code assembly (this is the
compiler (gcc-10.3) artefact with double-word 64bit moves on x86_32
target).

From the above experiment, we can conclude that the patched lockref.c
creates the same code with the try_cmpxchg fallback as the original
code. I think the same will be observed also for other targets.

When the new code involving try_cmpxchg is compiled, impressive size
gains for x86_32 can be seen. The main loop size reduces from 44 bytes
to 30 bytes.

In the git repository, several transitions from cmpxchg to try_cmpxchg
can be found. The above experiment confirms, that the generated
fallback assembly is at least as good as the original unpatched
version, but can be more optimal when the target provides try_cmpxchg
instruction. Also, it looks to me that several other hot spots
throughout the code can be improved by changing them from using
cmpxchg to try_cmpxchg.

Uros.

[-- Attachment #2: lockref-test.c --]
[-- Type: text/x-csrc, Size: 2606 bytes --]

#include <stdint.h>

#define __aligned_u64 u64 __attribute__((aligned(8)))

#define LOCK_PREFIX "lock "

# define likely(x)	__builtin_expect(!!(x), 1)
# define unlikely(x)	__builtin_expect(!!(x), 0)

typedef uint64_t u64;
typedef uint32_t u32;

static inline void cpu_relax(void)
{
	asm volatile("rep; nop" ::: "memory");
}

static inline u64 cmpxchg64(volatile u64 *ptr, u64 old, u64 new)
{
	u64 prev;
	asm volatile(LOCK_PREFIX "cmpxchg8b %1"
		     : "=A" (prev),
		       "+m" (*ptr)
		     : "b" ((u32)new),
		       "c" ((u32)(new >> 32)),
		       "0" (old)
		     : "memory");
	return prev;
}

#define CMPXCHG_LOOP_OLD(CODE, SUCCESS) do {				\
	int retry = 100;						\
	__aligned_u64 old = *lockref;					\
	while (t) {							\
		__aligned_u64 new = old, prev = old;			\
		CODE							\
		old = cmpxchg64(lockref, old, new);			\
		if (likely(old == prev)) {				\
			SUCCESS;					\
		}							\
		if (!--retry)						\
			break;						\
		cpu_relax();						\
	}								\
} while (0)

void lockref_get_old(u64 *lockref, _Bool t)
{
	CMPXCHG_LOOP_OLD(
		new++;
	,
		return;
	);
}

#define try_cmpxchg64_fallback(_ptr, _oldp, _new)	\
({ \
	typeof(*(_ptr)) *___op = (_oldp), ___o = *___op, ___r; \
	___r = cmpxchg64((_ptr), ___o, (_new)); \
	if (unlikely(___r != ___o)) \
		*___op = ___r; \
	likely(___r == ___o); \
})

#define CMPXCHG_LOOP_FALLBACK(CODE, SUCCESS) do {				\
	int retry = 100;							\
	__aligned_u64 old = *lockref;						\
	while (t) {								\
		__aligned_u64 new = old;					\
		CODE								\
		if (likely(try_cmpxchg64_fallback(lockref, &old, new))) {	\
			SUCCESS;						\
		}								\
		if (!--retry)							\
			break;							\
		cpu_relax();							\
	}									\
} while (0)

void lockref_get_fallback(u64 *lockref, _Bool t)
{
	CMPXCHG_LOOP_FALLBACK(
		new++;
	,
		return;
	);
}

static inline _Bool try_cmpxchg64(volatile u64 *ptr, u64 *pold, u64 new)
{
	_Bool success;
	u64 old = *pold;
	asm volatile(LOCK_PREFIX "cmpxchg8b %[ptr]"
		     : "=@ccz" (success),
		       [ptr] "+m" (*ptr),
		       "+A" (old)
		     : "b" ((u32)new),
		       "c" ((u32)(new >> 32))
		     : "memory");

	if (unlikely(!success))
		*pold = old;
	return success;
}

#define CMPXCHG_LOOP_NEW(CODE, SUCCESS) do {				\
	int retry = 100;						\
	__aligned_u64 old = *lockref;					\
	while (t) {							\
		__aligned_u64 new = old;				\
		CODE							\
		if (likely(try_cmpxchg64(lockref, &old, new))) {	\
			SUCCESS;					\
		}							\
		if (!--retry)						\
			break;						\
		cpu_relax();						\
	}								\
} while (0)

void lockref_get_new(u64 *lockref, _Bool t)
{
	CMPXCHG_LOOP_NEW(
		new++;
	,
		return;
	);
}