linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] arm64: Support the TSO memory model
@ 2024-04-11  0:51 Hector Martin
  2024-04-11  0:51 ` [PATCH 1/4] prctl: Introduce PR_{SET,GET}_MEM_MODEL Hector Martin
                   ` (6 more replies)
  0 siblings, 7 replies; 30+ messages in thread
From: Hector Martin @ 2024-04-11  0:51 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Marc Zyngier, Mark Rutland
  Cc: Zayd Qumsieh, Justin Lu, Ryan Houdek, Mark Brown, Ard Biesheuvel,
	Mateusz Guzik, Anshuman Khandual, Oliver Upton, Miguel Luis,
	Joey Gouly, Christoph Paasch, Kees Cook, Sami Tolvanen,
	Baoquan He, Joel Granados, Dawei Li, Andrew Morton,
	Florent Revest, David Hildenbrand, Stefan Roesch, Andy Chiu,
	Josh Triplett, Oleg Nesterov, Helge Deller, Zev Weiss,
	Ondrej Mosnacek, Miguel Ojeda, linux-arm-kernel, linux-kernel,
	Asahi Linux, Hector Martin

x86 CPUs implement a stricter memory modern than ARM64 (TSO). For this
reason, x86 emulation on baseline ARM64 systems requires very expensive
memory model emulation. Having hardware that supports this natively is
therefore very attractive. Such hardware, in fact, exists. This series
adds support for userspace to identify when TSO is available and
toggle it on, if supported.

Some ARM64 CPUs intrinsically implement the TSO memory model, while
others expose is as an IMPDEF control. Apple Silicon SoCs are in the
latter category. Using TSO for x86 emulation on chips that support it
has been shown to provide a massive performance boost [1].

Patch 1 introduces the PR_{SET,GET}_MEM_MODEL userspace control, which
is initially not implemented for any architectures.

Patch 2 implements it for CPUs which are known, to the best of my
knowledge, to always implement the TSO memory model unconditionally.
This uses the cpufeature mechanism to only enable this if *all* cores in
the system meet the requirements.

Patch 3 adds the scaffolding necesasry to save/restore the ACTLR_EL1
register across context switches. This register contains IMPDEF flags
related to CPU execution, and on Apple CPUs this is where the runtime
TSO toggle bit is implemented. Other CPUs could conceivably benefit from
this scaffolding if they also use ACTLR_EL1 for things that could
ostensibly be runtime controlled and context-switched. For this to work,
ACTLR_EL1 must have a uniform layout across all cores in the system.

Finally, patch 4 implements PR_{SET,GET}_MEM_MODEL for Apple CPUs by
hooking it up to flip the appropriate ACTLR_EL1 bit when the Apple TSO
feature is detected (on all CPUs, which also implies the uniform
ACTLR_EL1 layout).

This series has been brewing in the downstream Asahi Linux tree for a
while now, and ships to thousands of users. A subset have been using it
with FEX-Emu, which already supports this feature. This rebase on
v6.9-rc1 is only build-tested (all intermediate commits with and without
the config enabled, on ARM64) but I'll update the downstream branch soon
with this version and get it pushed out to users/testers.

The Apple support works on bare metal and *should* work exactly the same
way on macOS VMs (as alluded to by Zayd in his independent submission [3]),
though I haven't personally verified this. KVM support for this is left
for a future patchset.

(Apologies for the large Cc: list; I want to make sure nobody who got
Cced on Zayd's alternate take is left out of this one.) 

[1] https://fex-emu.com/FEX-2306/
[2] https://github.com/AsahiLinux/linux/tree/bits/220-tso
[3] https://lore.kernel.org/lkml/20240410211652.16640-1-zayd_qumsieh@apple.com/

To: Catalin Marinas <catalin.marinas@arm.com>
To: Will Deacon <will@kernel.org>
To: Marc Zyngier <maz@kernel.org>
To: Mark Rutland <mark.rutland@arm.com>
Cc: Zayd Qumsieh <zayd_qumsieh@apple.com>
Cc: Justin Lu <ih_justin@apple.com>
Cc: Ryan Houdek <Houdek.Ryan@fex-emu.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Miguel Luis <miguel.luis@oracle.com>
Cc: Joey Gouly <joey.gouly@arm.com>
Cc: Christoph Paasch <cpaasch@apple.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Joel Granados <j.granados@samsung.com>
Cc: Dawei Li <dawei.li@shingroup.cn>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Florent Revest <revest@chromium.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Stefan Roesch <shr@devkernel.io>
Cc: Andy Chiu <andy.chiu@sifive.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Zev Weiss <zev@bewilderbeest.net>
Cc: Ondrej Mosnacek <omosnace@redhat.com>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: Asahi Linux <asahi@lists.linux.dev>

Signed-off-by: Hector Martin <marcan@marcan.st>
---
Hector Martin (4):
      prctl: Introduce PR_{SET,GET}_MEM_MODEL
      arm64: Implement PR_{GET,SET}_MEM_MODEL for always-TSO CPUs
      arm64: Introduce scaffolding to add ACTLR_EL1 to thread state
      arm64: Implement Apple IMPDEF TSO memory model control

 arch/arm64/Kconfig                        | 14 ++++++
 arch/arm64/include/asm/apple_cpufeature.h | 15 +++++++
 arch/arm64/include/asm/cpufeature.h       | 10 +++++
 arch/arm64/include/asm/processor.h        |  3 ++
 arch/arm64/kernel/Makefile                |  3 +-
 arch/arm64/kernel/cpufeature.c            | 11 ++---
 arch/arm64/kernel/cpufeature_impdef.c     | 61 ++++++++++++++++++++++++++
 arch/arm64/kernel/process.c               | 71 +++++++++++++++++++++++++++++++
 arch/arm64/kernel/setup.c                 |  8 ++++
 arch/arm64/tools/cpucaps                  |  2 +
 include/linux/memory_ordering_model.h     | 11 +++++
 include/uapi/linux/prctl.h                |  5 +++
 kernel/sys.c                              | 21 +++++++++
 13 files changed, 229 insertions(+), 6 deletions(-)
---
base-commit: 4cece764965020c22cff7665b18a012006359095
change-id: 20240411-tso-e86fdceb94b8

Best regards,
-- 
Hector Martin <marcan@marcan.st>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/4] prctl: Introduce PR_{SET,GET}_MEM_MODEL
  2024-04-11  0:51 [PATCH 0/4] arm64: Support the TSO memory model Hector Martin
@ 2024-04-11  0:51 ` Hector Martin
  2024-04-11  0:51 ` [PATCH 2/4] arm64: Implement PR_{GET,SET}_MEM_MODEL for always-TSO CPUs Hector Martin
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 30+ messages in thread
From: Hector Martin @ 2024-04-11  0:51 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Marc Zyngier, Mark Rutland
  Cc: Zayd Qumsieh, Justin Lu, Ryan Houdek, Mark Brown, Ard Biesheuvel,
	Mateusz Guzik, Anshuman Khandual, Oliver Upton, Miguel Luis,
	Joey Gouly, Christoph Paasch, Kees Cook, Sami Tolvanen,
	Baoquan He, Joel Granados, Dawei Li, Andrew Morton,
	Florent Revest, David Hildenbrand, Stefan Roesch, Andy Chiu,
	Josh Triplett, Oleg Nesterov, Helge Deller, Zev Weiss,
	Ondrej Mosnacek, Miguel Ojeda, linux-arm-kernel, linux-kernel,
	Asahi Linux, Hector Martin

On some architectures, it is possible to query and/or change the CPU
memory model. This allows userspace to switch to a stricter memory model
for performance reasons, such as when emulating code for another
architecture where that model is the default.

Introduce two prctls to allow userspace to query and set the memory
model for a thread. Two models are initially defined:

- PR_SET_MEM_MODEL_DEFAULT requests the default memory model for the
  architecture.
- PR_SET_MEM_MODEL_TSO requests the x86 TSO memory model.

PR_SET_MEM_MODEL is allowed to set a stricter memory model than
requested if available, in which case it will return successfully. If
the requested memory model cannot be fulfilled, it will return an error.
The memory model that was actually set can be queried by a subsequent
call to PR_GET_MEM_MODEL.

Examples:
- On a CPU with not support for a memory model at least as strong as
  TSO, PR_SET_MEM_MODEL(PR_SET_MEM_MODEL_TSO) fails.
- On a CPU with runtime-configurable TSO support, PR_SET_MEM_MODEL can
  toggle the memory model between DEFAULT and TSO at will.
- On a CPU where the only memory model is at least as strict as TSO,
  PR_GET_MEM_MODEL will return PR_SET_MEM_MODEL_DEFAULT, and
  PR_SET_MEM_MODEL(PR_SET_MEM_MODEL_TSO) will return success but leave
  the memory model at PR_SET_MEM_MODEL_DEFAULT. This implies that the
  default is in fact at least as strict as TSO.

Signed-off-by: Hector Martin <marcan@marcan.st>
---
 include/linux/memory_ordering_model.h | 11 +++++++++++
 include/uapi/linux/prctl.h            |  5 +++++
 kernel/sys.c                          | 21 +++++++++++++++++++++
 3 files changed, 37 insertions(+)

diff --git a/include/linux/memory_ordering_model.h b/include/linux/memory_ordering_model.h
new file mode 100644
index 000000000000..267a12ca6630
--- /dev/null
+++ b/include/linux/memory_ordering_model.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_MEMORY_ORDERING_MODEL_H
+#define __ASM_MEMORY_ORDERING_MODEL_H
+
+/* Arch hooks to implement the PR_{GET_SET}_MEM_MODEL prctls */
+
+struct task_struct;
+int arch_prctl_mem_model_get(struct task_struct *t);
+int arch_prctl_mem_model_set(struct task_struct *t, unsigned long val);
+
+#endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 370ed14b1ae0..961216093f11 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -306,4 +306,9 @@ struct prctl_mm_map {
 # define PR_RISCV_V_VSTATE_CTRL_NEXT_MASK	0xc
 # define PR_RISCV_V_VSTATE_CTRL_MASK		0x1f
 
+#define PR_GET_MEM_MODEL	0x6d4d444c
+#define PR_SET_MEM_MODEL	0x4d4d444c
+# define PR_SET_MEM_MODEL_DEFAULT	0
+# define PR_SET_MEM_MODEL_TSO		1
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index f8e543f1e38a..6af659a9f826 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -45,6 +45,7 @@
 #include <linux/version.h>
 #include <linux/ctype.h>
 #include <linux/syscall_user_dispatch.h>
+#include <linux/memory_ordering_model.h>
 
 #include <linux/compat.h>
 #include <linux/syscalls.h>
@@ -2442,6 +2443,16 @@ static int prctl_get_auxv(void __user *addr, unsigned long len)
 	return sizeof(mm->saved_auxv);
 }
 
+int __weak arch_prctl_mem_model_get(struct task_struct *t)
+{
+	return -EINVAL;
+}
+
+int __weak arch_prctl_mem_model_set(struct task_struct *t, unsigned long val)
+{
+	return -EINVAL;
+}
+
 SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		unsigned long, arg4, unsigned long, arg5)
 {
@@ -2757,6 +2768,16 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_RISCV_V_GET_CONTROL:
 		error = RISCV_V_GET_CONTROL();
 		break;
+	case PR_GET_MEM_MODEL:
+		if (arg2 || arg3 || arg4 || arg5)
+			return -EINVAL;
+		error = arch_prctl_mem_model_get(me);
+		break;
+	case PR_SET_MEM_MODEL:
+		if (arg3 || arg4 || arg5)
+			return -EINVAL;
+		error = arch_prctl_mem_model_set(me, arg2);
+		break;
 	default:
 		error = -EINVAL;
 		break;

-- 
2.44.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/4] arm64: Implement PR_{GET,SET}_MEM_MODEL for always-TSO CPUs
  2024-04-11  0:51 [PATCH 0/4] arm64: Support the TSO memory model Hector Martin
  2024-04-11  0:51 ` [PATCH 1/4] prctl: Introduce PR_{SET,GET}_MEM_MODEL Hector Martin
@ 2024-04-11  0:51 ` Hector Martin
  2024-04-11  0:51 ` [PATCH 3/4] arm64: Introduce scaffolding to add ACTLR_EL1 to thread state Hector Martin
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 30+ messages in thread
From: Hector Martin @ 2024-04-11  0:51 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Marc Zyngier, Mark Rutland
  Cc: Zayd Qumsieh, Justin Lu, Ryan Houdek, Mark Brown, Ard Biesheuvel,
	Mateusz Guzik, Anshuman Khandual, Oliver Upton, Miguel Luis,
	Joey Gouly, Christoph Paasch, Kees Cook, Sami Tolvanen,
	Baoquan He, Joel Granados, Dawei Li, Andrew Morton,
	Florent Revest, David Hildenbrand, Stefan Roesch, Andy Chiu,
	Josh Triplett, Oleg Nesterov, Helge Deller, Zev Weiss,
	Ondrej Mosnacek, Miguel Ojeda, linux-arm-kernel, linux-kernel,
	Asahi Linux, Hector Martin

Some ARM64 implementations are known to always use the TSO memory model.
Add trivial support for the PR_{GET,SET}_MEM_MODEL prctl, which allows
userspace to learn this fact.

Known TSO implementations:
- Nvidia Denver
- Nvidia Carmel
- Fujitsu A64FX

Signed-off-by: Hector Martin <marcan@marcan.st>
---
 arch/arm64/Kconfig                    |  9 +++++++++
 arch/arm64/include/asm/cpufeature.h   |  4 ++++
 arch/arm64/kernel/Makefile            |  3 ++-
 arch/arm64/kernel/cpufeature.c        | 11 +++++-----
 arch/arm64/kernel/cpufeature_impdef.c | 38 +++++++++++++++++++++++++++++++++++
 arch/arm64/kernel/process.c           | 24 ++++++++++++++++++++++
 arch/arm64/tools/cpucaps              |  1 +
 7 files changed, 84 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 7b11c98b3e84..f8e66fe44ff4 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2162,6 +2162,15 @@ config ARM64_DEBUG_PRIORITY_MASKING
 	  If unsure, say N
 endif # ARM64_PSEUDO_NMI
 
+config ARM64_MEMORY_MODEL_CONTROL
+	bool "Runtime memory model control"
+	help
+	  Some ARM64 CPUs support runtime switching of the CPU memory
+	  model, which can be useful to emulate other CPU architectures
+	  which have different memory models. Say Y to enable support
+	  for the PR_SET_MEM_MODEL/PR_GET_MEM_MODEL prctl() calls on
+	  CPUs with this feature.
+
 config RELOCATABLE
 	bool "Build a relocatable kernel image" if EXPERT
 	select ARCH_HAS_RELR
diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
index 8b904a757bd3..fb215b0e7529 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -1032,6 +1032,10 @@ static inline bool cpu_has_lpa2(void)
 #endif
 }
 
+void __init init_cpucap_indirect_list_impdef(void);
+void __init init_cpucap_indirect_list_from_array(const struct arm64_cpu_capabilities *caps);
+bool cpufeature_matches(u64 reg, const struct arm64_cpu_capabilities *entry);
+
 #endif /* __ASSEMBLY__ */
 
 #endif
diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
index 763824963ed1..5eaaee7b8358 100644
--- a/arch/arm64/kernel/Makefile
+++ b/arch/arm64/kernel/Makefile
@@ -33,7 +33,8 @@ obj-y			:= debug-monitors.o entry.o irq.o fpsimd.o		\
 			   return_address.o cpuinfo.o cpu_errata.o		\
 			   cpufeature.o alternative.o cacheinfo.o		\
 			   smp.o smp_spin_table.o topology.o smccc-call.o	\
-			   syscall.o proton-pack.o idle.o patching.o pi/
+			   syscall.o proton-pack.o idle.o patching.o pi/	\
+			   cpufeature_impdef.o
 
 obj-$(CONFIG_COMPAT)			+= sys32.o signal32.o			\
 					   sys_compat.o
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 56583677c1f2..e39ab93ad683 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -1028,7 +1028,7 @@ static void init_cpu_ftr_reg(u32 sys_reg, u64 new)
 extern const struct arm64_cpu_capabilities arm64_errata[];
 static const struct arm64_cpu_capabilities arm64_features[];
 
-static void __init
+void __init
 init_cpucap_indirect_list_from_array(const struct arm64_cpu_capabilities *caps)
 {
 	for (; caps->matches; caps++) {
@@ -1540,8 +1540,8 @@ has_always(const struct arm64_cpu_capabilities *entry, int scope)
 	return true;
 }
 
-static bool
-feature_matches(u64 reg, const struct arm64_cpu_capabilities *entry)
+bool
+cpufeature_matches(u64 reg, const struct arm64_cpu_capabilities *entry)
 {
 	int val, min, max;
 	u64 tmp;
@@ -1594,14 +1594,14 @@ has_user_cpuid_feature(const struct arm64_cpu_capabilities *entry, int scope)
 	if (!mask)
 		return false;
 
-	return feature_matches(val, entry);
+	return cpufeature_matches(val, entry);
 }
 
 static bool
 has_cpuid_feature(const struct arm64_cpu_capabilities *entry, int scope)
 {
 	u64 val = read_scoped_sysreg(entry, scope);
-	return feature_matches(val, entry);
+	return cpufeature_matches(val, entry);
 }
 
 const struct cpumask *system_32bit_el0_cpumask(void)
@@ -3486,6 +3486,7 @@ void __init setup_boot_cpu_features(void)
 	 * handle the boot CPU.
 	 */
 	init_cpucap_indirect_list();
+	init_cpucap_indirect_list_impdef();
 
 	/*
 	 * Detect broken pseudo-NMI. Must be called _before_ the call to
diff --git a/arch/arm64/kernel/cpufeature_impdef.c b/arch/arm64/kernel/cpufeature_impdef.c
new file mode 100644
index 000000000000..bb04a8e3d79d
--- /dev/null
+++ b/arch/arm64/kernel/cpufeature_impdef.c
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Contains implementation-defined CPU feature definitions.
+ */
+
+#include <asm/cpufeature.h>
+
+#ifdef CONFIG_ARM64_MEMORY_MODEL_CONTROL
+static bool has_tso_fixed(const struct arm64_cpu_capabilities *entry, int scope)
+{
+	/* List of CPUs that always use the TSO memory model */
+	static const struct midr_range fixed_tso_list[] = {
+		MIDR_ALL_VERSIONS(MIDR_NVIDIA_DENVER),
+		MIDR_ALL_VERSIONS(MIDR_NVIDIA_CARMEL),
+		MIDR_ALL_VERSIONS(MIDR_FUJITSU_A64FX),
+		{ /* sentinel */ }
+	};
+
+	return is_midr_in_range_list(read_cpuid_id(), fixed_tso_list);
+}
+#endif
+
+static const struct arm64_cpu_capabilities arm64_impdef_features[] = {
+#ifdef CONFIG_ARM64_MEMORY_MODEL_CONTROL
+	{
+		.desc = "TSO memory model (Fixed)",
+		.capability = ARM64_HAS_TSO_FIXED,
+		.type = ARM64_CPUCAP_SYSTEM_FEATURE,
+		.matches = has_tso_fixed,
+	},
+#endif
+	{},
+};
+
+void __init init_cpucap_indirect_list_impdef(void)
+{
+	init_cpucap_indirect_list_from_array(arm64_impdef_features);
+}
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index 4ae31b7af6c3..7920056bad3e 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -41,6 +41,7 @@
 #include <linux/thread_info.h>
 #include <linux/prctl.h>
 #include <linux/stacktrace.h>
+#include <linux/memory_ordering_model.h>
 
 #include <asm/alternative.h>
 #include <asm/compat.h>
@@ -513,6 +514,25 @@ void update_sctlr_el1(u64 sctlr)
 	isb();
 }
 
+#ifdef CONFIG_ARM64_MEMORY_MODEL_CONTROL
+int arch_prctl_mem_model_get(struct task_struct *t)
+{
+	return PR_SET_MEM_MODEL_DEFAULT;
+}
+
+int arch_prctl_mem_model_set(struct task_struct *t, unsigned long val)
+{
+	if (alternative_has_cap_unlikely(ARM64_HAS_TSO_FIXED) &&
+	    val == PR_SET_MEM_MODEL_TSO)
+		return 0;
+
+	if (val == PR_SET_MEM_MODEL_DEFAULT)
+		return 0;
+
+	return -EINVAL;
+}
+#endif
+
 /*
  * Thread switching.
  */
@@ -651,6 +671,10 @@ void arch_setup_new_exec(void)
 		arch_prctl_spec_ctrl_set(current, PR_SPEC_STORE_BYPASS,
 					 PR_SPEC_ENABLE);
 	}
+
+#ifdef CONFIG_ARM64_MEMORY_MODEL_CONTROL
+	arch_prctl_mem_model_set(current, PR_SET_MEM_MODEL_DEFAULT);
+#endif
 }
 
 #ifdef CONFIG_ARM64_TAGGED_ADDR_ABI
diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps
index 62b2838a231a..daa6b9495402 100644
--- a/arch/arm64/tools/cpucaps
+++ b/arch/arm64/tools/cpucaps
@@ -52,6 +52,7 @@ HAS_STAGE2_FWB
 HAS_TCR2
 HAS_TIDCP1
 HAS_TLB_RANGE
+HAS_TSO_FIXED
 HAS_VA52
 HAS_VIRT_HOST_EXTN
 HAS_WFXT

-- 
2.44.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 3/4] arm64: Introduce scaffolding to add ACTLR_EL1 to thread state
  2024-04-11  0:51 [PATCH 0/4] arm64: Support the TSO memory model Hector Martin
  2024-04-11  0:51 ` [PATCH 1/4] prctl: Introduce PR_{SET,GET}_MEM_MODEL Hector Martin
  2024-04-11  0:51 ` [PATCH 2/4] arm64: Implement PR_{GET,SET}_MEM_MODEL for always-TSO CPUs Hector Martin
@ 2024-04-11  0:51 ` Hector Martin
  2024-04-11  0:51 ` [PATCH 4/4] arm64: Implement Apple IMPDEF TSO memory model control Hector Martin
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 30+ messages in thread
From: Hector Martin @ 2024-04-11  0:51 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Marc Zyngier, Mark Rutland
  Cc: Zayd Qumsieh, Justin Lu, Ryan Houdek, Mark Brown, Ard Biesheuvel,
	Mateusz Guzik, Anshuman Khandual, Oliver Upton, Miguel Luis,
	Joey Gouly, Christoph Paasch, Kees Cook, Sami Tolvanen,
	Baoquan He, Joel Granados, Dawei Li, Andrew Morton,
	Florent Revest, David Hildenbrand, Stefan Roesch, Andy Chiu,
	Josh Triplett, Oleg Nesterov, Helge Deller, Zev Weiss,
	Ondrej Mosnacek, Miguel Ojeda, linux-arm-kernel, linux-kernel,
	Asahi Linux, Hector Martin

Some CPUs expose IMPDEF features in ACTLR_EL1 that can be meaningfully
controlled per-thread (like TSO control on Apple cores). Add the basic
scaffolding to save/restore this register as part of context switching.

This mechanism is disabled by default both by config symbol and via a
runtime check, which ensures it is never triggered unless the system is
known to need it for some feature (which also implies that the layout of
ACTLR_EL1 is uniform between all CPU core types).

Signed-off-by: Hector Martin <marcan@marcan.st>
---
 arch/arm64/Kconfig                  |  3 +++
 arch/arm64/include/asm/cpufeature.h |  5 +++++
 arch/arm64/include/asm/processor.h  |  3 +++
 arch/arm64/kernel/process.c         | 25 +++++++++++++++++++++++++
 arch/arm64/kernel/setup.c           |  8 ++++++++
 5 files changed, 44 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index f8e66fe44ff4..9b3593b34cce 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -408,6 +408,9 @@ config KASAN_SHADOW_OFFSET
 config UNWIND_TABLES
 	bool
 
+config ARM64_ACTLR_STATE
+	bool
+
 source "arch/arm64/Kconfig.platforms"
 
 menu "Kernel Features"
diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
index fb215b0e7529..46ab37f8f4d8 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -909,6 +909,11 @@ static inline unsigned int get_vmid_bits(u64 mmfr1)
 	return 8;
 }
 
+static __always_inline bool system_has_actlr_state(void)
+{
+	return false;
+}
+
 s64 arm64_ftr_safe_value(const struct arm64_ftr_bits *ftrp, s64 new, s64 cur);
 struct arm64_ftr_reg *get_arm64_ftr_reg(u32 sys_id);
 
diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h
index f77371232d8c..d43c5791a35e 100644
--- a/arch/arm64/include/asm/processor.h
+++ b/arch/arm64/include/asm/processor.h
@@ -184,6 +184,9 @@ struct thread_struct {
 	u64			sctlr_user;
 	u64			svcr;
 	u64			tpidr2_el0;
+#ifdef CONFIG_ARM64_ACTLR_STATE
+	u64			actlr;
+#endif
 };
 
 static inline unsigned int thread_get_vl(struct thread_struct *thread,
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index 7920056bad3e..117f80e16aac 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -372,6 +372,11 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
 		if (system_supports_tpidr2())
 			p->thread.tpidr2_el0 = read_sysreg_s(SYS_TPIDR2_EL0);
 
+#ifdef CONFIG_ARM64_ACTLR_STATE
+		if (system_has_actlr_state())
+			p->thread.actlr = read_sysreg(actlr_el1);
+#endif
+
 		if (stack_start) {
 			if (is_compat_thread(task_thread_info(p)))
 				childregs->compat_sp = stack_start;
@@ -533,6 +538,25 @@ int arch_prctl_mem_model_set(struct task_struct *t, unsigned long val)
 }
 #endif
 
+#ifdef CONFIG_ARM64_ACTLR_STATE
+/*
+ * IMPDEF control register ACTLR_EL1 handling. Some CPUs use this to
+ * expose features that can be controlled by userspace.
+ */
+static void actlr_thread_switch(struct task_struct *next)
+{
+	if (!system_has_actlr_state())
+		return;
+
+	current->thread.actlr = read_sysreg(actlr_el1);
+	write_sysreg(next->thread.actlr, actlr_el1);
+}
+#else
+static inline void actlr_thread_switch(struct task_struct *next)
+{
+}
+#endif
+
 /*
  * Thread switching.
  */
@@ -550,6 +574,7 @@ struct task_struct *__switch_to(struct task_struct *prev,
 	ssbs_thread_switch(next);
 	erratum_1418040_thread_switch(next);
 	ptrauth_thread_switch_user(next);
+	actlr_thread_switch(next);
 
 	/*
 	 * Complete any pending TLB or cache maintenance on this CPU in case
diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index 65a052bf741f..35342f633a85 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -359,6 +359,14 @@ void __init __no_sanitize_address setup_arch(char **cmdline_p)
 	 */
 	init_task.thread_info.ttbr0 = phys_to_ttbr(__pa_symbol(reserved_pg_dir));
 #endif
+#ifdef CONFIG_ARM64_ACTLR_STATE
+	/* Store the boot CPU ACTLR_EL1 value as the default. This will only
+	 * be actually restored during context switching iff the platform is
+	 * known to use ACTLR_EL1 for exposable features and its layout is
+	 * known to be the same on all CPUs.
+	 */
+	init_task.thread.actlr = read_sysreg(actlr_el1);
+#endif
 
 	if (boot_args[1] || boot_args[2] || boot_args[3]) {
 		pr_err("WARNING: x1-x3 nonzero in violation of boot protocol:\n"

-- 
2.44.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 4/4] arm64: Implement Apple IMPDEF TSO memory model control
  2024-04-11  0:51 [PATCH 0/4] arm64: Support the TSO memory model Hector Martin
                   ` (2 preceding siblings ...)
  2024-04-11  0:51 ` [PATCH 3/4] arm64: Introduce scaffolding to add ACTLR_EL1 to thread state Hector Martin
@ 2024-04-11  0:51 ` Hector Martin
  2024-04-11  1:37 ` [PATCH 0/4] arm64: Support the TSO memory model Neal Gompa
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 30+ messages in thread
From: Hector Martin @ 2024-04-11  0:51 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Marc Zyngier, Mark Rutland
  Cc: Zayd Qumsieh, Justin Lu, Ryan Houdek, Mark Brown, Ard Biesheuvel,
	Mateusz Guzik, Anshuman Khandual, Oliver Upton, Miguel Luis,
	Joey Gouly, Christoph Paasch, Kees Cook, Sami Tolvanen,
	Baoquan He, Joel Granados, Dawei Li, Andrew Morton,
	Florent Revest, David Hildenbrand, Stefan Roesch, Andy Chiu,
	Josh Triplett, Oleg Nesterov, Helge Deller, Zev Weiss,
	Ondrej Mosnacek, Miguel Ojeda, linux-arm-kernel, linux-kernel,
	Asahi Linux, Hector Martin

Apple CPUs may implement the TSO memory model as an optional
configurable mode. This allows x86 emulators to simplify their
load/store handling, greatly increasing performance.

Expose this via the prctl PR_SET_MEM_MODEL_TSO mechanism. We use the
Apple IMPDEF AIDR_EL1 register to check for the availability of TSO
mode, and enable this codepath on all CPUs with an Apple implementer.

This relies on the ACTLR_EL1 thread state scaffolding introduced
earlier.

Signed-off-by: Hector Martin <marcan@marcan.st>
---
 arch/arm64/Kconfig                        |  2 ++
 arch/arm64/include/asm/apple_cpufeature.h | 15 +++++++++++++++
 arch/arm64/include/asm/cpufeature.h       |  3 ++-
 arch/arm64/kernel/cpufeature_impdef.c     | 23 +++++++++++++++++++++++
 arch/arm64/kernel/process.c               | 22 ++++++++++++++++++++++
 arch/arm64/tools/cpucaps                  |  1 +
 6 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 9b3593b34cce..2f3eedd955c9 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2167,6 +2167,8 @@ endif # ARM64_PSEUDO_NMI
 
 config ARM64_MEMORY_MODEL_CONTROL
 	bool "Runtime memory model control"
+	default ARCH_APPLE
+	select ARM64_ACTLR_STATE
 	help
 	  Some ARM64 CPUs support runtime switching of the CPU memory
 	  model, which can be useful to emulate other CPU architectures
diff --git a/arch/arm64/include/asm/apple_cpufeature.h b/arch/arm64/include/asm/apple_cpufeature.h
new file mode 100644
index 000000000000..4370d91ffa3e
--- /dev/null
+++ b/arch/arm64/include/asm/apple_cpufeature.h
@@ -0,0 +1,15 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#ifndef __ASM_APPLE_CPUFEATURES_H
+#define __ASM_APPLE_CPUFEATURES_H
+
+#include <linux/bits.h>
+#include <asm/sysreg.h>
+
+#define AIDR_APPLE_TSO_SHIFT	9
+#define AIDR_APPLE_TSO		BIT(9)
+
+#define ACTLR_APPLE_TSO_SHIFT	1
+#define ACTLR_APPLE_TSO		BIT(1)
+
+#endif
diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
index 46ab37f8f4d8..a191000d88c2 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -911,7 +911,8 @@ static inline unsigned int get_vmid_bits(u64 mmfr1)
 
 static __always_inline bool system_has_actlr_state(void)
 {
-	return false;
+	return IS_ENABLED(CONFIG_ARM64_ACTLR_STATE) &&
+		alternative_has_cap_unlikely(ARM64_HAS_TSO_APPLE);
 }
 
 s64 arm64_ftr_safe_value(const struct arm64_ftr_bits *ftrp, s64 new, s64 cur);
diff --git a/arch/arm64/kernel/cpufeature_impdef.c b/arch/arm64/kernel/cpufeature_impdef.c
index bb04a8e3d79d..9325d1eb12f4 100644
--- a/arch/arm64/kernel/cpufeature_impdef.c
+++ b/arch/arm64/kernel/cpufeature_impdef.c
@@ -4,8 +4,21 @@
  */
 
 #include <asm/cpufeature.h>
+#include <asm/apple_cpufeature.h>
 
 #ifdef CONFIG_ARM64_MEMORY_MODEL_CONTROL
+static bool has_apple_feature(const struct arm64_cpu_capabilities *entry, int scope)
+{
+	u64 val;
+	WARN_ON(scope != SCOPE_SYSTEM);
+
+	if (read_cpuid_implementor() != ARM_CPU_IMP_APPLE)
+		return false;
+
+	val = read_sysreg(aidr_el1);
+	return cpufeature_matches(val, entry);
+}
+
 static bool has_tso_fixed(const struct arm64_cpu_capabilities *entry, int scope)
 {
 	/* List of CPUs that always use the TSO memory model */
@@ -22,6 +35,16 @@ static bool has_tso_fixed(const struct arm64_cpu_capabilities *entry, int scope)
 
 static const struct arm64_cpu_capabilities arm64_impdef_features[] = {
 #ifdef CONFIG_ARM64_MEMORY_MODEL_CONTROL
+	{
+		.desc = "TSO memory model (Apple)",
+		.capability = ARM64_HAS_TSO_APPLE,
+		.type = ARM64_CPUCAP_SYSTEM_FEATURE,
+		.matches = has_apple_feature,
+		.field_pos = AIDR_APPLE_TSO_SHIFT,
+		.field_width = 1,
+		.sign = FTR_UNSIGNED,
+		.min_field_value = 1,
+	},
 	{
 		.desc = "TSO memory model (Fixed)",
 		.capability = ARM64_HAS_TSO_FIXED,
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index 117f80e16aac..34a19ecfb630 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -44,6 +44,7 @@
 #include <linux/memory_ordering_model.h>
 
 #include <asm/alternative.h>
+#include <asm/apple_cpufeature.h>
 #include <asm/compat.h>
 #include <asm/cpufeature.h>
 #include <asm/cacheflush.h>
@@ -522,6 +523,10 @@ void update_sctlr_el1(u64 sctlr)
 #ifdef CONFIG_ARM64_MEMORY_MODEL_CONTROL
 int arch_prctl_mem_model_get(struct task_struct *t)
 {
+	if (alternative_has_cap_unlikely(ARM64_HAS_TSO_APPLE) &&
+		t->thread.actlr & ACTLR_APPLE_TSO)
+		return PR_SET_MEM_MODEL_TSO;
+
 	return PR_SET_MEM_MODEL_DEFAULT;
 }
 
@@ -531,6 +536,23 @@ int arch_prctl_mem_model_set(struct task_struct *t, unsigned long val)
 	    val == PR_SET_MEM_MODEL_TSO)
 		return 0;
 
+	if (alternative_has_cap_unlikely(ARM64_HAS_TSO_APPLE)) {
+		WARN_ON(!system_has_actlr_state());
+
+		switch (val) {
+		case PR_SET_MEM_MODEL_TSO:
+			t->thread.actlr |= ACTLR_APPLE_TSO;
+			break;
+		case PR_SET_MEM_MODEL_DEFAULT:
+			t->thread.actlr &= ~ACTLR_APPLE_TSO;
+			break;
+		default:
+			return -EINVAL;
+		}
+		write_sysreg(t->thread.actlr, actlr_el1);
+		return 0;
+	}
+
 	if (val == PR_SET_MEM_MODEL_DEFAULT)
 		return 0;
 
diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps
index daa6b9495402..62f9ca9ce44b 100644
--- a/arch/arm64/tools/cpucaps
+++ b/arch/arm64/tools/cpucaps
@@ -52,6 +52,7 @@ HAS_STAGE2_FWB
 HAS_TCR2
 HAS_TIDCP1
 HAS_TLB_RANGE
+HAS_TSO_APPLE
 HAS_TSO_FIXED
 HAS_VA52
 HAS_VIRT_HOST_EXTN

-- 
2.44.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] arm64: Support the TSO memory model
  2024-04-11  0:51 [PATCH 0/4] arm64: Support the TSO memory model Hector Martin
                   ` (3 preceding siblings ...)
  2024-04-11  0:51 ` [PATCH 4/4] arm64: Implement Apple IMPDEF TSO memory model control Hector Martin
@ 2024-04-11  1:37 ` Neal Gompa
  2024-04-11 13:28 ` Will Deacon
  2024-04-16  2:11 ` Zayd Qumsieh
  6 siblings, 0 replies; 30+ messages in thread
From: Neal Gompa @ 2024-04-11  1:37 UTC (permalink / raw)
  To: Hector Martin
  Cc: Catalin Marinas, Will Deacon, Marc Zyngier, Mark Rutland,
	Zayd Qumsieh, Justin Lu, Ryan Houdek, Mark Brown, Ard Biesheuvel,
	Mateusz Guzik, Anshuman Khandual, Oliver Upton, Miguel Luis,
	Joey Gouly, Christoph Paasch, Kees Cook, Sami Tolvanen,
	Baoquan He, Joel Granados, Dawei Li, Andrew Morton,
	Florent Revest, David Hildenbrand, Stefan Roesch, Andy Chiu,
	Josh Triplett, Oleg Nesterov, Helge Deller, Zev Weiss,
	Ondrej Mosnacek, Miguel Ojeda, linux-arm-kernel, linux-kernel,
	Asahi Linux

On Wed, Apr 10, 2024 at 8:51 PM Hector Martin <marcan@marcan.st> wrote:
>
> x86 CPUs implement a stricter memory modern than ARM64 (TSO). For this
> reason, x86 emulation on baseline ARM64 systems requires very expensive
> memory model emulation. Having hardware that supports this natively is
> therefore very attractive. Such hardware, in fact, exists. This series
> adds support for userspace to identify when TSO is available and
> toggle it on, if supported.
>
> Some ARM64 CPUs intrinsically implement the TSO memory model, while
> others expose is as an IMPDEF control. Apple Silicon SoCs are in the
> latter category. Using TSO for x86 emulation on chips that support it
> has been shown to provide a massive performance boost [1].
>
> Patch 1 introduces the PR_{SET,GET}_MEM_MODEL userspace control, which
> is initially not implemented for any architectures.
>
> Patch 2 implements it for CPUs which are known, to the best of my
> knowledge, to always implement the TSO memory model unconditionally.
> This uses the cpufeature mechanism to only enable this if *all* cores in
> the system meet the requirements.
>
> Patch 3 adds the scaffolding necesasry to save/restore the ACTLR_EL1
> register across context switches. This register contains IMPDEF flags
> related to CPU execution, and on Apple CPUs this is where the runtime
> TSO toggle bit is implemented. Other CPUs could conceivably benefit from
> this scaffolding if they also use ACTLR_EL1 for things that could
> ostensibly be runtime controlled and context-switched. For this to work,
> ACTLR_EL1 must have a uniform layout across all cores in the system.
>
> Finally, patch 4 implements PR_{SET,GET}_MEM_MODEL for Apple CPUs by
> hooking it up to flip the appropriate ACTLR_EL1 bit when the Apple TSO
> feature is detected (on all CPUs, which also implies the uniform
> ACTLR_EL1 layout).
>
> This series has been brewing in the downstream Asahi Linux tree for a
> while now, and ships to thousands of users. A subset have been using it
> with FEX-Emu, which already supports this feature. This rebase on
> v6.9-rc1 is only build-tested (all intermediate commits with and without
> the config enabled, on ARM64) but I'll update the downstream branch soon
> with this version and get it pushed out to users/testers.
>
> The Apple support works on bare metal and *should* work exactly the same
> way on macOS VMs (as alluded to by Zayd in his independent submission [3]),
> though I haven't personally verified this. KVM support for this is left
> for a future patchset.
>
> (Apologies for the large Cc: list; I want to make sure nobody who got
> Cced on Zayd's alternate take is left out of this one.)
>
> [1] https://fex-emu.com/FEX-2306/
> [2] https://github.com/AsahiLinux/linux/tree/bits/220-tso
> [3] https://lore.kernel.org/lkml/20240410211652.16640-1-zayd_qumsieh@apple.com/
>
> To: Catalin Marinas <catalin.marinas@arm.com>
> To: Will Deacon <will@kernel.org>
> To: Marc Zyngier <maz@kernel.org>
> To: Mark Rutland <mark.rutland@arm.com>
> Cc: Zayd Qumsieh <zayd_qumsieh@apple.com>
> Cc: Justin Lu <ih_justin@apple.com>
> Cc: Ryan Houdek <Houdek.Ryan@fex-emu.org>
> Cc: Mark Brown <broonie@kernel.org>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: Mateusz Guzik <mjguzik@gmail.com>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Cc: Oliver Upton <oliver.upton@linux.dev>
> Cc: Miguel Luis <miguel.luis@oracle.com>
> Cc: Joey Gouly <joey.gouly@arm.com>
> Cc: Christoph Paasch <cpaasch@apple.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Sami Tolvanen <samitolvanen@google.com>
> Cc: Baoquan He <bhe@redhat.com>
> Cc: Joel Granados <j.granados@samsung.com>
> Cc: Dawei Li <dawei.li@shingroup.cn>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Florent Revest <revest@chromium.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Stefan Roesch <shr@devkernel.io>
> Cc: Andy Chiu <andy.chiu@sifive.com>
> Cc: Josh Triplett <josh@joshtriplett.org>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Helge Deller <deller@gmx.de>
> Cc: Zev Weiss <zev@bewilderbeest.net>
> Cc: Ondrej Mosnacek <omosnace@redhat.com>
> Cc: Miguel Ojeda <ojeda@kernel.org>
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Asahi Linux <asahi@lists.linux.dev>
>
> Signed-off-by: Hector Martin <marcan@marcan.st>
> ---
> Hector Martin (4):
>       prctl: Introduce PR_{SET,GET}_MEM_MODEL
>       arm64: Implement PR_{GET,SET}_MEM_MODEL for always-TSO CPUs
>       arm64: Introduce scaffolding to add ACTLR_EL1 to thread state
>       arm64: Implement Apple IMPDEF TSO memory model control
>
>  arch/arm64/Kconfig                        | 14 ++++++
>  arch/arm64/include/asm/apple_cpufeature.h | 15 +++++++
>  arch/arm64/include/asm/cpufeature.h       | 10 +++++
>  arch/arm64/include/asm/processor.h        |  3 ++
>  arch/arm64/kernel/Makefile                |  3 +-
>  arch/arm64/kernel/cpufeature.c            | 11 ++---
>  arch/arm64/kernel/cpufeature_impdef.c     | 61 ++++++++++++++++++++++++++
>  arch/arm64/kernel/process.c               | 71 +++++++++++++++++++++++++++++++
>  arch/arm64/kernel/setup.c                 |  8 ++++
>  arch/arm64/tools/cpucaps                  |  2 +
>  include/linux/memory_ordering_model.h     | 11 +++++
>  include/uapi/linux/prctl.h                |  5 +++
>  kernel/sys.c                              | 21 +++++++++
>  13 files changed, 229 insertions(+), 6 deletions(-)
> ---
> base-commit: 4cece764965020c22cff7665b18a012006359095
> change-id: 20240411-tso-e86fdceb94b8
>

The series looks good to me.

Reviewed-by: Neal Gompa <neal@gompa.dev>



-- 
真実はいつも一つ!/ Always, there's only one truth!

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] arm64: Support the TSO memory model
  2024-04-11  0:51 [PATCH 0/4] arm64: Support the TSO memory model Hector Martin
                   ` (4 preceding siblings ...)
  2024-04-11  1:37 ` [PATCH 0/4] arm64: Support the TSO memory model Neal Gompa
@ 2024-04-11 13:28 ` Will Deacon
  2024-04-11 14:19   ` Hector Martin
                     ` (2 more replies)
  2024-04-16  2:11 ` Zayd Qumsieh
  6 siblings, 3 replies; 30+ messages in thread
From: Will Deacon @ 2024-04-11 13:28 UTC (permalink / raw)
  To: Hector Martin
  Cc: Catalin Marinas, Marc Zyngier, Mark Rutland, Zayd Qumsieh,
	Justin Lu, Ryan Houdek, Mark Brown, Ard Biesheuvel,
	Mateusz Guzik, Anshuman Khandual, Oliver Upton, Miguel Luis,
	Joey Gouly, Christoph Paasch, Kees Cook, Sami Tolvanen,
	Baoquan He, Joel Granados, Dawei Li, Andrew Morton,
	Florent Revest, David Hildenbrand, Stefan Roesch, Andy Chiu,
	Josh Triplett, Oleg Nesterov, Helge Deller, Zev Weiss,
	Ondrej Mosnacek, Miguel Ojeda, linux-arm-kernel, linux-kernel,
	Asahi Linux

Hi Hector,

On Thu, Apr 11, 2024 at 09:51:19AM +0900, Hector Martin wrote:
> x86 CPUs implement a stricter memory modern than ARM64 (TSO). For this
> reason, x86 emulation on baseline ARM64 systems requires very expensive
> memory model emulation. Having hardware that supports this natively is
> therefore very attractive. Such hardware, in fact, exists. This series
> adds support for userspace to identify when TSO is available and
> toggle it on, if supported.

I'm probably going to make myself hugely unpopular here, but I have a
strong objection to this patch series as it stands. I firmly believe
that providing a prctl() to query and toggle the memory model to/from
TSO is going to lead to subtle fragmentation of arm64 Linux userspace.

It's not difficult to envisage this TSO switch being abused for native
arm64 applications:

  * A program no longer crashes when TSO is enabled, so the developer
    just toggles TSO to meet a deadline.

  * Some legacy x86 sources are being ported to arm64 but concurrency
    is hard so the developer just enables TSO to (mostly) avoid thinking
    about it.

  * Some binaries in a distribution exhibit instability which goes away
    in TSO mode, so a taskset-like program is used to run them with TSO
    enabled.

In all these cases, we end up with native arm64 applications that will
either fail to load or will crash in subtle ways on CPUs without the TSO
feature. Assuming that the application cannot be fixed, a better
approach would be to recompile using stronger instructions (e.g.
LDAR/STLR) so that at least the resulting binary is portable. Now, it's
true that some existing CPUs are TSO by design (this is a perfectly
valid implementation of the arm64 memory model), but I think there's a
big difference between quietly providing more ordering guarantees than
software may be relying on and providing a mechanism to discover,
request and ultimately rely upon the stronger behaviour.

An alternative option is to go down the SPARC RMO route and just enable
TSO statically (although presumably in the firmware) for Apple silicon.
I'm assuming that has a performance impact for native code?

Will

P.S. I briefly pondered the idea of the kernel toggling the bit in the
ELF loader when e.g. it sees an x86 machine type but I suspect that
doesn't really help with existing emulators and you'd still need a way
to tell the emulator whether or not it was enabled.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] arm64: Support the TSO memory model
  2024-04-11 13:28 ` Will Deacon
@ 2024-04-11 14:19   ` Hector Martin
  2024-04-11 18:43     ` Hector Martin
  2024-04-19 16:58     ` Will Deacon
  2024-05-02  0:16   ` Zayd Qumsieh
  2024-05-07 10:24   ` Alex Bennée
  2 siblings, 2 replies; 30+ messages in thread
From: Hector Martin @ 2024-04-11 14:19 UTC (permalink / raw)
  To: Will Deacon
  Cc: Catalin Marinas, Marc Zyngier, Mark Rutland, Zayd Qumsieh,
	Justin Lu, Ryan Houdek, Mark Brown, Ard Biesheuvel,
	Mateusz Guzik, Anshuman Khandual, Oliver Upton, Miguel Luis,
	Joey Gouly, Christoph Paasch, Kees Cook, Sami Tolvanen,
	Baoquan He, Joel Granados, Dawei Li, Andrew Morton,
	Florent Revest, David Hildenbrand, Stefan Roesch, Andy Chiu,
	Josh Triplett, Oleg Nesterov, Helge Deller, Zev Weiss,
	Ondrej Mosnacek, Miguel Ojeda, linux-arm-kernel, linux-kernel,
	Asahi Linux

On 2024/04/11 22:28, Will Deacon wrote:
> Hi Hector,
> 
> On Thu, Apr 11, 2024 at 09:51:19AM +0900, Hector Martin wrote:
>> x86 CPUs implement a stricter memory modern than ARM64 (TSO). For this
>> reason, x86 emulation on baseline ARM64 systems requires very expensive
>> memory model emulation. Having hardware that supports this natively is
>> therefore very attractive. Such hardware, in fact, exists. This series
>> adds support for userspace to identify when TSO is available and
>> toggle it on, if supported.
> 
> I'm probably going to make myself hugely unpopular here, but I have a
> strong objection to this patch series as it stands. I firmly believe
> that providing a prctl() to query and toggle the memory model to/from
> TSO is going to lead to subtle fragmentation of arm64 Linux userspace.

I honestly doubt this should be a significant concern right now, given
that only a subset of implementations actually support this. Yes,
developers can do stupid stuff, but we already have gone through this
kind of story with other situations (e.g. 16K and 64K page support on
ARM64 breaking 4K assumptions) and things have been fixed over time.

In particular, I highly suspect Asahi Linux and Apple Silicon have done
a lot more good for the ARM64 ecosystem by getting developers to fix
their page size mess than they will do bad by somehow encouraging TSO
abuse. We've even found new memory model issues thanks to the
architecture's deep out-of-order character (remember that mess with
Linux atomics? :-)). So far, in the year+ we've had this patchset
downstream, not a single developer has proposed abusing it for something
that isn't an x86 emulator.

There's a pragmatic argument here: since we need this, and it absolutely
will continue to ship downstream if rejected, it doesn't make much
difference for fragmentation risk does it? The vast majority of
Linux-on-Mac users are likely to continue running downstream kernels for
the foreseeable future anyway to get newer features and hardware support
faster than they can be upstreamed. So not allowing this upstream
doesn't really change the landscape vis-a-vis being able to abuse this
or not, it just makes our life harder by forcing us to carry more
patches forever.

> It's not difficult to envisage this TSO switch being abused for native
> arm64 applications:
> 
>   * A program no longer crashes when TSO is enabled, so the developer
>     just toggles TSO to meet a deadline.
> 
>   * Some legacy x86 sources are being ported to arm64 but concurrency
>     is hard so the developer just enables TSO to (mostly) avoid thinking
>     about it.

Both of these rely on the developer *knowing* what TSO is and why it
fixes this. I posit that a developer who knows what that is also likely
to know why this is a stupid hack and they shouldn't be doing this and
that it won't work on all machines.

> 
>   * Some binaries in a distribution exhibit instability which goes away
>     in TSO mode, so a taskset-like program is used to run them with TSO
>     enabled.

Since the flag is cleared on execve, this third one isn't generally
possible as far as I know.

> In all these cases, we end up with native arm64 applications that will
> either fail to load or will crash in subtle ways on CPUs without the TSO
> feature. Assuming that the application cannot be fixed, a better
> approach would be to recompile using stronger instructions (e.g.
> LDAR/STLR) so that at least the resulting binary is portable. Now, it's
> true that some existing CPUs are TSO by design (this is a perfectly
> valid implementation of the arm64 memory model), but I think there's a
> big difference between quietly providing more ordering guarantees than
> software may be relying on and providing a mechanism to discover,
> request and ultimately rely upon the stronger behaviour.

The problem is "just" using stronger instructions is much more
expensive, as emulators have demonstrated. If TSO didn't serve a
practical purpose I wouldn't be submitting this, but it does. This is
basically non-negotiable for x86 emulation; if this is rejected
upstream, it will forever live as a downstream patch used by the entire
gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very
explicitly targeting, given our efforts with microVMs for 4K page size
support and the upcoming Vulkan drivers).

That said, I have a pragmatic proposal here. The "fixed TSO" part of the
implementation should be harmless, since those CPUs would correctly run
poorly-written applications anyway so the API is moot. That leaves Apple
Silicon. Our native kernels are and likely always will be 16K page size,
due to a bunch of pain around 16K-only IOMMUs (4K kernels do boot
natively but with very broken functionality including no GPU
acceleration) plus performance differences that favor 16K. How about we
gate the TSO functionality to only be supported on 4K kernel builds?
This would make them only work in 4K VMs on Asahi Linux. We are very
explicitly discouraging people from trying to use the microVMs to work
around page size problems (which they can already do, another
fragmentation problem, anyway); any application which requires the 4K VM
to run that isn't an emulator is already clearly broken and advertising
that fact openly. So, adding TSO to this should be only a marginal risk
of further fragmentation, and it wouldn't allow apps to "sneakily" "just
work" on Apple machines by abusing TSO.

> 
> An alternative option is to go down the SPARC RMO route and just enable
> TSO statically (although presumably in the firmware) for Apple silicon.
> I'm assuming that has a performance impact for native code?

Correct. We already have this as a bootloader option, but it is not
desirable. Plus, userspace code still needs a way to *discover* that TSO
is enabled for correctness, so it can automatically decide whether to
use stronger or weaker instructions.

> 
> Will
> 
> P.S. I briefly pondered the idea of the kernel toggling the bit in the
> ELF loader when e.g. it sees an x86 machine type but I suspect that
> doesn't really help with existing emulators and you'd still need a way
> to tell the emulator whether or not it was enabled.
> 

- Hector

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] arm64: Support the TSO memory model
  2024-04-11 14:19   ` Hector Martin
@ 2024-04-11 18:43     ` Hector Martin
  2024-04-16  2:22       ` Zayd Qumsieh
  2024-04-19 16:58     ` Will Deacon
  1 sibling, 1 reply; 30+ messages in thread
From: Hector Martin @ 2024-04-11 18:43 UTC (permalink / raw)
  To: Will Deacon
  Cc: Catalin Marinas, Marc Zyngier, Mark Rutland, Zayd Qumsieh,
	Justin Lu, Ryan Houdek, Mark Brown, Ard Biesheuvel,
	Mateusz Guzik, Anshuman Khandual, Oliver Upton, Miguel Luis,
	Joey Gouly, Christoph Paasch, Kees Cook, Sami Tolvanen,
	Baoquan He, Joel Granados, Dawei Li, Andrew Morton,
	Florent Revest, David Hildenbrand, Stefan Roesch, Andy Chiu,
	Josh Triplett, Oleg Nesterov, Helge Deller, Zev Weiss,
	Ondrej Mosnacek, Miguel Ojeda, linux-arm-kernel, linux-kernel,
	Asahi Linux



On 2024/04/11 23:19, Hector Martin wrote:
>>
>> An alternative option is to go down the SPARC RMO route and just enable
>> TSO statically (although presumably in the firmware) for Apple silicon.
>> I'm assuming that has a performance impact for native code?
> 
> Correct. We already have this as a bootloader option, but it is not
> desirable. Plus, userspace code still needs a way to *discover* that TSO
> is enabled for correctness, so it can automatically decide whether to
> use stronger or weaker instructions.

To add some numbers to this (I was just made aware of this paper):

https://www.sra.uni-hannover.de/Publications/2023/tosting-arcs23/wrenger_23_arcs.pdf

Using TSO globally has, on average, a 9% performance hit, so that is
clearly off the table as a general solution.

Meanwhile, more detailed microbenchmarks often show TSO as having better
performance than outright using acquire/release instructions without
TSO. Therefore, just giving up on TSO and using acq/rel semantics for
emulators is also not an acceptable solution.

Additionally, the general load/store instructions on ARM have more
flexible addressing modes than the synchronizing ones, and since general
x86 emulation requires *all* loads and stores to be like this in a
non-TSO model (without much more complex/expensive program analysis to
determine where this can be elided), the perf impact is definitely worse
for emulation (e.g. stack accesses are affected) than for a
microbenchmark where only the "target" test instructions are being modified.

- Hector

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] arm64: Support the TSO memory model
  2024-04-11  0:51 [PATCH 0/4] arm64: Support the TSO memory model Hector Martin
                   ` (5 preceding siblings ...)
  2024-04-11 13:28 ` Will Deacon
@ 2024-04-16  2:11 ` Zayd Qumsieh
  6 siblings, 0 replies; 30+ messages in thread
From: Zayd Qumsieh @ 2024-04-16  2:11 UTC (permalink / raw)
  To: marcan
  Cc: catalin.marinas, will, maz, mark.rutland, zayd_qumsieh,
	ih_justin, Houdek.Ryan, broonie, ardb, mjguzik,
	anshuman.khandual, oliver.upton, miguel.luis, joey.gouly,
	cpaasch, keescook, samitolvanen, bhe, j.granados, dawei.li, akpm,
	revest, david, shr, andy.chiu, josh, oleg, deller, zev, omosnace,
	ojeda, linux-arm-kernel, linux-kernel, asahi

The patch looks great! :) I have one minor suggestion, though:

>static __always_inline bool system_has_actlr_state(void)
>{
>	return IS_ENABLED(CONFIG_ARM64_ACTLR_STATE) &&
>		alternative_has_cap_unlikely(ARM64_HAS_TSO_APPLE);
>}

ACTLR_EL1.TSO is not exposed for writing on Virtual Machines on all
versions of MacOS. However, AIDR_EL1 may still advertise TSO, whether
or not ACTLR_EL1.TSO is writable. Could you modify the patch such that
we check the writability of ACTLR_EL1.TSO in system_has_actlr_state
(or once on startup, and cache it, since reading from AIDR_EL1 causes
a trap to Hypervisor.fwk)?

Thanks,
Zayd

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] arm64: Support the TSO memory model
  2024-04-11 18:43     ` Hector Martin
@ 2024-04-16  2:22       ` Zayd Qumsieh
  2024-04-19 16:58         ` Will Deacon
  0 siblings, 1 reply; 30+ messages in thread
From: Zayd Qumsieh @ 2024-04-16  2:22 UTC (permalink / raw)
  To: Hector Martin, Will Deacon
  Cc: Catalin Marinas, Marc Zyngier, Mark Rutland, Zayd Qumsieh,
	Justin Lu, Ryan Houdek, Mark Brown, Ard Biesheuvel,
	Mateusz Guzik, Anshuman Khandual, Oliver Upton, Miguel Luis,
	Joey Gouly, Christoph Paasch, Kees Cook, Sami Tolvanen,
	Baoquan He, Joel Granados, Dawei Li, Andrew Morton,
	Florent Revest, David Hildenbrand, Stefan Roesch, Andy Chiu,
	Josh Triplett, Oleg Nesterov, Helge Deller, Zev Weiss,
	Ondrej Mosnacek, Miguel Ojeda, linux-arm-kernel, linux-kernel,
	Asahi Linux

>I'm probably going to make myself hugely unpopular here, but I have a
>strong objection to this patch series as it stands. I firmly believe
>that providing a prctl() to query and toggle the memory model to/from
>TSO is going to lead to subtle fragmentation of arm64 Linux userspace.

It's definitely not our intent to fragment the ecosystem.
The goal of this memory ordering is to simplify emulation layers that benefit from this.
If you have suggestions to reduce the risk of it being misused outside of emulators, we'd be happy to look into it.

Thanks,
Zayd

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] arm64: Support the TSO memory model
  2024-04-11 14:19   ` Hector Martin
  2024-04-11 18:43     ` Hector Martin
@ 2024-04-19 16:58     ` Will Deacon
  2024-04-20 11:37       ` Marc Zyngier
  2024-04-20 12:13       ` Eric Curtin
  1 sibling, 2 replies; 30+ messages in thread
From: Will Deacon @ 2024-04-19 16:58 UTC (permalink / raw)
  To: Hector Martin
  Cc: Catalin Marinas, Marc Zyngier, Mark Rutland, Zayd Qumsieh,
	Justin Lu, Ryan Houdek, Mark Brown, Ard Biesheuvel,
	Mateusz Guzik, Anshuman Khandual, Oliver Upton, Miguel Luis,
	Joey Gouly, Christoph Paasch, Kees Cook, Sami Tolvanen,
	Baoquan He, Joel Granados, Dawei Li, Andrew Morton,
	Florent Revest, David Hildenbrand, Stefan Roesch, Andy Chiu,
	Josh Triplett, Oleg Nesterov, Helge Deller, Zev Weiss,
	Ondrej Mosnacek, Miguel Ojeda, linux-arm-kernel, linux-kernel,
	Asahi Linux

On Thu, Apr 11, 2024 at 11:19:13PM +0900, Hector Martin wrote:
> On 2024/04/11 22:28, Will Deacon wrote:
> >   * Some binaries in a distribution exhibit instability which goes away
> >     in TSO mode, so a taskset-like program is used to run them with TSO
> >     enabled.
> 
> Since the flag is cleared on execve, this third one isn't generally
> possible as far as I know.

Ah ok, I'd missed that. Thanks.

> > In all these cases, we end up with native arm64 applications that will
> > either fail to load or will crash in subtle ways on CPUs without the TSO
> > feature. Assuming that the application cannot be fixed, a better
> > approach would be to recompile using stronger instructions (e.g.
> > LDAR/STLR) so that at least the resulting binary is portable. Now, it's
> > true that some existing CPUs are TSO by design (this is a perfectly
> > valid implementation of the arm64 memory model), but I think there's a
> > big difference between quietly providing more ordering guarantees than
> > software may be relying on and providing a mechanism to discover,
> > request and ultimately rely upon the stronger behaviour.
> 
> The problem is "just" using stronger instructions is much more
> expensive, as emulators have demonstrated. If TSO didn't serve a
> practical purpose I wouldn't be submitting this, but it does. This is
> basically non-negotiable for x86 emulation; if this is rejected
> upstream, it will forever live as a downstream patch used by the entire
> gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very
> explicitly targeting, given our efforts with microVMs for 4K page size
> support and the upcoming Vulkan drivers).

These microVMs sound quite interesting. What exactly are they? Are you
running them under KVM?

Ignoring the mechanism for the time being, would it solve your problem
if you were able to run specific microVMs in TSO mode, or do you *really*
need the VM to have finer-grained control than that? If the whole VM is
running in TSO mode, then my concerns largely disappear, as that's
indistinguishable from running on a hardware implementation that happens
to be TSO.

> That said, I have a pragmatic proposal here. The "fixed TSO" part of the
> implementation should be harmless, since those CPUs would correctly run
> poorly-written applications anyway so the API is moot. That leaves Apple
> Silicon. Our native kernels are and likely always will be 16K page size,
> due to a bunch of pain around 16K-only IOMMUs (4K kernels do boot
> natively but with very broken functionality including no GPU
> acceleration) plus performance differences that favor 16K. How about we
> gate the TSO functionality to only be supported on 4K kernel builds?
> This would make them only work in 4K VMs on Asahi Linux. We are very
> explicitly discouraging people from trying to use the microVMs to work
> around page size problems (which they can already do, another
> fragmentation problem, anyway); any application which requires the 4K VM
> to run that isn't an emulator is already clearly broken and advertising
> that fact openly. So, adding TSO to this should be only a marginal risk
> of further fragmentation, and it wouldn't allow apps to "sneakily" "just
> work" on Apple machines by abusing TSO.

I appreciate that you're trying to be constructive here, but I don't think
we should tie this to the page size. It's an artifical limitation and I
don't think it really addresses the underlying concerns that I have.

Will

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] arm64: Support the TSO memory model
  2024-04-16  2:22       ` Zayd Qumsieh
@ 2024-04-19 16:58         ` Will Deacon
  2024-04-19 18:05           ` Catalin Marinas
  0 siblings, 1 reply; 30+ messages in thread
From: Will Deacon @ 2024-04-19 16:58 UTC (permalink / raw)
  To: Zayd Qumsieh
  Cc: Hector Martin, Catalin Marinas, Marc Zyngier, Mark Rutland,
	Justin Lu, Ryan Houdek, Mark Brown, Ard Biesheuvel,
	Mateusz Guzik, Anshuman Khandual, Oliver Upton, Miguel Luis,
	Joey Gouly, Christoph Paasch, Kees Cook, Sami Tolvanen,
	Baoquan He, Joel Granados, Dawei Li, Andrew Morton,
	Florent Revest, David Hildenbrand, Stefan Roesch, Andy Chiu,
	Josh Triplett, Oleg Nesterov, Helge Deller, Zev Weiss,
	Ondrej Mosnacek, Miguel Ojeda, linux-arm-kernel, linux-kernel,
	Asahi Linux

On Mon, Apr 15, 2024 at 07:22:41PM -0700, Zayd Qumsieh wrote:
> >I'm probably going to make myself hugely unpopular here, but I have a
> >strong objection to this patch series as it stands. I firmly believe
> >that providing a prctl() to query and toggle the memory model to/from
> >TSO is going to lead to subtle fragmentation of arm64 Linux userspace.
> 
> It's definitely not our intent to fragment the ecosystem.
> The goal of this memory ordering is to simplify emulation layers that benefit from this.
> If you have suggestions to reduce the risk of it being misused outside of emulators, we'd be happy to look into it.

Once you have exposed this toggle via prctl(), it doesn't really matter
what your intentions where. It will get used outside of emulation laters
and we'll be stuck supporting it.

Will

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] arm64: Support the TSO memory model
  2024-04-19 16:58         ` Will Deacon
@ 2024-04-19 18:05           ` Catalin Marinas
  0 siblings, 0 replies; 30+ messages in thread
From: Catalin Marinas @ 2024-04-19 18:05 UTC (permalink / raw)
  To: Will Deacon
  Cc: Zayd Qumsieh, Hector Martin, Marc Zyngier, Mark Rutland,
	Justin Lu, Ryan Houdek, Mark Brown, Ard Biesheuvel,
	Mateusz Guzik, Anshuman Khandual, Oliver Upton, Miguel Luis,
	Joey Gouly, Christoph Paasch, Kees Cook, Sami Tolvanen,
	Baoquan He, Joel Granados, Dawei Li, Andrew Morton,
	Florent Revest, David Hildenbrand, Stefan Roesch, Andy Chiu,
	Josh Triplett, Oleg Nesterov, Helge Deller, Zev Weiss,
	Ondrej Mosnacek, Miguel Ojeda, linux-arm-kernel, linux-kernel,
	Asahi Linux

On Fri, Apr 19, 2024 at 05:58:26PM +0100, Will Deacon wrote:
> On Mon, Apr 15, 2024 at 07:22:41PM -0700, Zayd Qumsieh wrote:
> > >I'm probably going to make myself hugely unpopular here, but I have a
> > >strong objection to this patch series as it stands. I firmly believe
> > >that providing a prctl() to query and toggle the memory model to/from
> > >TSO is going to lead to subtle fragmentation of arm64 Linux userspace.
> > 
> > It's definitely not our intent to fragment the ecosystem. The goal
> > of this memory ordering is to simplify emulation layers that benefit
> > from this. If you have suggestions to reduce the risk of it being
> > misused outside of emulators, we'd be happy to look into it.
> 
> Once you have exposed this toggle via prctl(), it doesn't really matter
> what your intentions where. It will get used outside of emulation laters
> and we'll be stuck supporting it.

Just FTR, I fully agree with Will. I'm strongly against this kind of ABI
for a non-architected, implementation defined feature. I can't even tell
exactly what TSO means on the Apple hardware. Is it close to the x86
TSO? Is there a formal memory model for it? Are future Apple (or other
Arm vendor) implementations going to follow exactly the same model to be
able to call it some form of "Apple standard" that deserves an ABI?

So, sorry, I'm going to NAK these approaches proposing imp def features
as generic opt-in mechanisms (the microVMs thing sounds doable though,
to my limited understanding; I guess that would mean running the
emulator in a VM).

-- 
Catalin

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] arm64: Support the TSO memory model
  2024-04-19 16:58     ` Will Deacon
@ 2024-04-20 11:37       ` Marc Zyngier
  2024-05-02  0:10         ` Zayd Qumsieh
  2024-04-20 12:13       ` Eric Curtin
  1 sibling, 1 reply; 30+ messages in thread
From: Marc Zyngier @ 2024-04-20 11:37 UTC (permalink / raw)
  To: Will Deacon
  Cc: Hector Martin, Catalin Marinas, Mark Rutland, Zayd Qumsieh,
	Justin Lu, Ryan Houdek, Mark Brown, Ard Biesheuvel,
	Mateusz Guzik, Anshuman Khandual, Oliver Upton, Miguel Luis,
	Joey Gouly, Christoph Paasch, Kees Cook, Sami Tolvanen,
	Baoquan He, Joel Granados, Dawei Li, Andrew Morton,
	Florent Revest, David Hildenbrand, Stefan Roesch, Andy Chiu,
	Josh Triplett, Oleg Nesterov, Helge Deller, Zev Weiss,
	Ondrej Mosnacek, Miguel Ojeda, linux-arm-kernel, linux-kernel,
	Asahi Linux

On Fri, 19 Apr 2024 17:58:09 +0100,
Will Deacon <will@kernel.org> wrote:
> 
> On Thu, Apr 11, 2024 at 11:19:13PM +0900, Hector Martin wrote:
> > On 2024/04/11 22:28, Will Deacon wrote:
> > >   * Some binaries in a distribution exhibit instability which goes away
> > >     in TSO mode, so a taskset-like program is used to run them with TSO
> > >     enabled.
> > 
> > Since the flag is cleared on execve, this third one isn't generally
> > possible as far as I know.
> 
> Ah ok, I'd missed that. Thanks.
> 
> > > In all these cases, we end up with native arm64 applications that will
> > > either fail to load or will crash in subtle ways on CPUs without the TSO
> > > feature. Assuming that the application cannot be fixed, a better
> > > approach would be to recompile using stronger instructions (e.g.
> > > LDAR/STLR) so that at least the resulting binary is portable. Now, it's
> > > true that some existing CPUs are TSO by design (this is a perfectly
> > > valid implementation of the arm64 memory model), but I think there's a
> > > big difference between quietly providing more ordering guarantees than
> > > software may be relying on and providing a mechanism to discover,
> > > request and ultimately rely upon the stronger behaviour.
> > 
> > The problem is "just" using stronger instructions is much more
> > expensive, as emulators have demonstrated. If TSO didn't serve a
> > practical purpose I wouldn't be submitting this, but it does. This is
> > basically non-negotiable for x86 emulation; if this is rejected
> > upstream, it will forever live as a downstream patch used by the entire
> > gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very
> > explicitly targeting, given our efforts with microVMs for 4K page size
> > support and the upcoming Vulkan drivers).
> 
> These microVMs sound quite interesting. What exactly are they? Are you
> running them under KVM?
> 
> Ignoring the mechanism for the time being, would it solve your problem
> if you were able to run specific microVMs in TSO mode, or do you *really*
> need the VM to have finer-grained control than that? If the whole VM is
> running in TSO mode, then my concerns largely disappear, as that's
> indistinguishable from running on a hardware implementation that happens
> to be TSO.

Since KVM has been mentioned a few times, I'll give my take on this.

Since day 1, it was a conscious decision for KVM/arm64 to emulate the
architecture, and only that -- this is complicated enough. Meaning
that no implementation-defined features should be explicitly exposed
to the guest. So I have no plan to expose any such feature for
userspace to configure TSO or anything else of the sort.

However, that doesn't preclude VMs from running in TSO mode if the HW
is configured as such at boot time. From what I have understood, this
is a per translation regime setting (EL1 and EL2 have separate knobs).

So it should be possible to set ACTLR_EL1.TSO=1 from firmware (using
the non-architected ACTLR_EL12 accessor), and let things work without
touching anything else (KVM doesn't context switch this register and
traps accesses to it). This would keep KVM out of the loop, the host
side would be unaffected, and only VMs would pay the overhead of TSO.

I appreciate that this is not the ideal situation, and very much an
all-or-nothing approach. But that's what we can reasonably manage from
an upstream perspective given the variability of the arm64 ecosystem.

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] arm64: Support the TSO memory model
  2024-04-19 16:58     ` Will Deacon
  2024-04-20 11:37       ` Marc Zyngier
@ 2024-04-20 12:13       ` Eric Curtin
  2024-04-20 12:15         ` Eric Curtin
  2024-05-06 11:21         ` Sergio Lopez Pascual
  1 sibling, 2 replies; 30+ messages in thread
From: Eric Curtin @ 2024-04-20 12:13 UTC (permalink / raw)
  To: Will Deacon
  Cc: Hector Martin, Catalin Marinas, Marc Zyngier, Mark Rutland,
	Zayd Qumsieh, Justin Lu, Ryan Houdek, Mark Brown, Ard Biesheuvel,
	Mateusz Guzik, Anshuman Khandual, Oliver Upton, Miguel Luis,
	Joey Gouly, Christoph Paasch, Kees Cook, Sami Tolvanen,
	Baoquan He, Joel Granados, Dawei Li, Andrew Morton,
	Florent Revest, David Hildenbrand, Stefan Roesch, Andy Chiu,
	Josh Triplett, Oleg Nesterov, Helge Deller, Zev Weiss,
	Ondrej Mosnacek, Miguel Ojeda, linux-arm-kernel, linux-kernel,
	Asahi Linux, Sergio Lopez Pascual

On Fri, 19 Apr 2024 at 18:08, Will Deacon <will@kernel.org> wrote:
>
> On Thu, Apr 11, 2024 at 11:19:13PM +0900, Hector Martin wrote:
> > On 2024/04/11 22:28, Will Deacon wrote:
> > >   * Some binaries in a distribution exhibit instability which goes away
> > >     in TSO mode, so a taskset-like program is used to run them with TSO
> > >     enabled.
> >
> > Since the flag is cleared on execve, this third one isn't generally
> > possible as far as I know.
>
> Ah ok, I'd missed that. Thanks.
>
> > > In all these cases, we end up with native arm64 applications that will
> > > either fail to load or will crash in subtle ways on CPUs without the TSO
> > > feature. Assuming that the application cannot be fixed, a better
> > > approach would be to recompile using stronger instructions (e.g.
> > > LDAR/STLR) so that at least the resulting binary is portable. Now, it's
> > > true that some existing CPUs are TSO by design (this is a perfectly
> > > valid implementation of the arm64 memory model), but I think there's a
> > > big difference between quietly providing more ordering guarantees than
> > > software may be relying on and providing a mechanism to discover,
> > > request and ultimately rely upon the stronger behaviour.
> >
> > The problem is "just" using stronger instructions is much more
> > expensive, as emulators have demonstrated. If TSO didn't serve a
> > practical purpose I wouldn't be submitting this, but it does. This is
> > basically non-negotiable for x86 emulation; if this is rejected
> > upstream, it will forever live as a downstream patch used by the entire
> > gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very
> > explicitly targeting, given our efforts with microVMs for 4K page size
> > support and the upcoming Vulkan drivers).
>
> These microVMs sound quite interesting. What exactly are they? Are you
> running them under KVM?

It's the magic of libkrun. This is one of the git repos in the family
of libkrun, it has a wide array of use cases, which I personally won't
do much justice explaining all then, this is just one
repo/tool/usecases:

https://github.com/containers/krunvm

https://sinrega.org/running-microvms-on-m1/

CC'ing @Sergio Lopez Pascual the lead of krun in general.

Is mise le meas/Regards,

Eric Curtin

>
> Ignoring the mechanism for the time being, would it solve your problem
> if you were able to run specific microVMs in TSO mode, or do you *really*
> need the VM to have finer-grained control than that? If the whole VM is
> running in TSO mode, then my concerns largely disappear, as that's
> indistinguishable from running on a hardware implementation that happens
> to be TSO.
>
> > That said, I have a pragmatic proposal here. The "fixed TSO" part of the
> > implementation should be harmless, since those CPUs would correctly run
> > poorly-written applications anyway so the API is moot. That leaves Apple
> > Silicon. Our native kernels are and likely always will be 16K page size,
> > due to a bunch of pain around 16K-only IOMMUs (4K kernels do boot
> > natively but with very broken functionality including no GPU
> > acceleration) plus performance differences that favor 16K. How about we
> > gate the TSO functionality to only be supported on 4K kernel builds?
> > This would make them only work in 4K VMs on Asahi Linux. We are very
> > explicitly discouraging people from trying to use the microVMs to work
> > around page size problems (which they can already do, another
> > fragmentation problem, anyway); any application which requires the 4K VM
> > to run that isn't an emulator is already clearly broken and advertising
> > that fact openly. So, adding TSO to this should be only a marginal risk
> > of further fragmentation, and it wouldn't allow apps to "sneakily" "just
> > work" on Apple machines by abusing TSO.
>
> I appreciate that you're trying to be constructive here, but I don't think
> we should tie this to the page size. It's an artifical limitation and I
> don't think it really addresses the underlying concerns that I have.
>
> Will
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] arm64: Support the TSO memory model
  2024-04-20 12:13       ` Eric Curtin
@ 2024-04-20 12:15         ` Eric Curtin
  2024-05-06 11:21         ` Sergio Lopez Pascual
  1 sibling, 0 replies; 30+ messages in thread
From: Eric Curtin @ 2024-04-20 12:15 UTC (permalink / raw)
  To: Will Deacon
  Cc: Hector Martin, Catalin Marinas, Marc Zyngier, Mark Rutland,
	Zayd Qumsieh, Justin Lu, Ryan Houdek, Mark Brown, Ard Biesheuvel,
	Mateusz Guzik, Anshuman Khandual, Oliver Upton, Miguel Luis,
	Joey Gouly, Christoph Paasch, Kees Cook, Sami Tolvanen,
	Baoquan He, Joel Granados, Dawei Li, Andrew Morton,
	Florent Revest, David Hildenbrand, Stefan Roesch, Andy Chiu,
	Josh Triplett, Oleg Nesterov, Helge Deller, Zev Weiss,
	Ondrej Mosnacek, Miguel Ojeda, linux-arm-kernel, linux-kernel,
	Asahi Linux, Sergio Lopez Pascual

On Sat, 20 Apr 2024 at 13:13, Eric Curtin <ecurtin@redhat.com> wrote:
>
> On Fri, 19 Apr 2024 at 18:08, Will Deacon <will@kernel.org> wrote:
> >
> > On Thu, Apr 11, 2024 at 11:19:13PM +0900, Hector Martin wrote:
> > > On 2024/04/11 22:28, Will Deacon wrote:
> > > >   * Some binaries in a distribution exhibit instability which goes away
> > > >     in TSO mode, so a taskset-like program is used to run them with TSO
> > > >     enabled.
> > >
> > > Since the flag is cleared on execve, this third one isn't generally
> > > possible as far as I know.
> >
> > Ah ok, I'd missed that. Thanks.
> >
> > > > In all these cases, we end up with native arm64 applications that will
> > > > either fail to load or will crash in subtle ways on CPUs without the TSO
> > > > feature. Assuming that the application cannot be fixed, a better
> > > > approach would be to recompile using stronger instructions (e.g.
> > > > LDAR/STLR) so that at least the resulting binary is portable. Now, it's
> > > > true that some existing CPUs are TSO by design (this is a perfectly
> > > > valid implementation of the arm64 memory model), but I think there's a
> > > > big difference between quietly providing more ordering guarantees than
> > > > software may be relying on and providing a mechanism to discover,
> > > > request and ultimately rely upon the stronger behaviour.
> > >
> > > The problem is "just" using stronger instructions is much more
> > > expensive, as emulators have demonstrated. If TSO didn't serve a
> > > practical purpose I wouldn't be submitting this, but it does. This is
> > > basically non-negotiable for x86 emulation; if this is rejected
> > > upstream, it will forever live as a downstream patch used by the entire
> > > gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very
> > > explicitly targeting, given our efforts with microVMs for 4K page size
> > > support and the upcoming Vulkan drivers).
> >
> > These microVMs sound quite interesting. What exactly are they? Are you
> > running them under KVM?
>
> It's the magic of libkrun. This is one of the git repos in the family
> of libkrun, it has a wide array of use cases, which I personally won't
> do much justice explaining all then, this is just one
> repo/tool/usecases:
>
> https://github.com/containers/krunvm
>
> https://sinrega.org/running-microvms-on-m1/

Sorry for the double post, meant to share this one for the Asahi
emulator usecase. Sergio's blogs are great in general:

https://sinrega.org/2023-10-06-using-microvms-for-gaming-on-fedora-asahi/

Is mise le meas/Regards,

Eric Curtin

>
> CC'ing @Sergio Lopez Pascual the lead of krun in general.
>
> Is mise le meas/Regards,
>
> Eric Curtin
>
> >
> > Ignoring the mechanism for the time being, would it solve your problem
> > if you were able to run specific microVMs in TSO mode, or do you *really*
> > need the VM to have finer-grained control than that? If the whole VM is
> > running in TSO mode, then my concerns largely disappear, as that's
> > indistinguishable from running on a hardware implementation that happens
> > to be TSO.
> >
> > > That said, I have a pragmatic proposal here. The "fixed TSO" part of the
> > > implementation should be harmless, since those CPUs would correctly run
> > > poorly-written applications anyway so the API is moot. That leaves Apple
> > > Silicon. Our native kernels are and likely always will be 16K page size,
> > > due to a bunch of pain around 16K-only IOMMUs (4K kernels do boot
> > > natively but with very broken functionality including no GPU
> > > acceleration) plus performance differences that favor 16K. How about we
> > > gate the TSO functionality to only be supported on 4K kernel builds?
> > > This would make them only work in 4K VMs on Asahi Linux. We are very
> > > explicitly discouraging people from trying to use the microVMs to work
> > > around page size problems (which they can already do, another
> > > fragmentation problem, anyway); any application which requires the 4K VM
> > > to run that isn't an emulator is already clearly broken and advertising
> > > that fact openly. So, adding TSO to this should be only a marginal risk
> > > of further fragmentation, and it wouldn't allow apps to "sneakily" "just
> > > work" on Apple machines by abusing TSO.
> >
> > I appreciate that you're trying to be constructive here, but I don't think
> > we should tie this to the page size. It's an artifical limitation and I
> > don't think it really addresses the underlying concerns that I have.
> >
> > Will
> >


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] arm64: Support the TSO memory model
  2024-04-20 11:37       ` Marc Zyngier
@ 2024-05-02  0:10         ` Zayd Qumsieh
  2024-05-02 13:25           ` Marc Zyngier
  0 siblings, 1 reply; 30+ messages in thread
From: Zayd Qumsieh @ 2024-05-02  0:10 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Catalin Marinas, Mark Rutland, Zayd Qumsieh, Justin Lu,
	Ryan Houdek, Mark Brown, Ard Biesheuvel, Mateusz Guzik,
	Anshuman Khandual, Oliver Upton, Miguel Luis, Joey Gouly,
	Christoph Paasch, Kees Cook, Sami Tolvanen, Baoquan He,
	Joel Granados, Dawei Li, Andrew Morton, Florent Revest,
	David Hildenbrand, Stefan Roesch, Andy Chiu, Josh Triplett,
	Oleg Nesterov, Helge Deller, Zev Weiss, Ondrej Mosnacek,
	Miguel Ojeda, linux-arm-kernel, linux-kernel, Asahi Linux

> On Fri, 19 Apr 2024 17:58:09 +0100,
> Will Deacon <will@kernel.org> wrote:
> > 
> > On Thu, Apr 11, 2024 at 11:19:13PM +0900, Hector Martin wrote:
> > > On 2024/04/11 22:28, Will Deacon wrote:
> > > >   * Some binaries in a distribution exhibit instability which goes away
> > > >     in TSO mode, so a taskset-like program is used to run them with TSO
> > > >     enabled.
> > > 
> > > Since the flag is cleared on execve, this third one isn't generally
> > > possible as far as I know.
> > 
> > Ah ok, I'd missed that. Thanks.
> > 
> > > > In all these cases, we end up with native arm64 applications that will
> > > > either fail to load or will crash in subtle ways on CPUs without the TSO
> > > > feature. Assuming that the application cannot be fixed, a better
> > > > approach would be to recompile using stronger instructions (e.g.
> > > > LDAR/STLR) so that at least the resulting binary is portable. Now, it's
> > > > true that some existing CPUs are TSO by design (this is a perfectly
> > > > valid implementation of the arm64 memory model), but I think there's a
> > > > big difference between quietly providing more ordering guarantees than
> > > > software may be relying on and providing a mechanism to discover,
> > > > request and ultimately rely upon the stronger behaviour.
> > > 
> > > The problem is "just" using stronger instructions is much more
> > > expensive, as emulators have demonstrated. If TSO didn't serve a
> > > practical purpose I wouldn't be submitting this, but it does. This is
> > > basically non-negotiable for x86 emulation; if this is rejected
> > > upstream, it will forever live as a downstream patch used by the entire
> > > gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very
> > > explicitly targeting, given our efforts with microVMs for 4K page size
> > > support and the upcoming Vulkan drivers).
> > 
> > These microVMs sound quite interesting. What exactly are they? Are you
> > running them under KVM?
> > 
> > Ignoring the mechanism for the time being, would it solve your problem
> > if you were able to run specific microVMs in TSO mode, or do you *really*
> > need the VM to have finer-grained control than that? If the whole VM is
> > running in TSO mode, then my concerns largely disappear, as that's
> > indistinguishable from running on a hardware implementation that happens
> > to be TSO.
>
> Since KVM has been mentioned a few times, I'll give my take on this.
>
> Since day 1, it was a conscious decision for KVM/arm64 to emulate the
> architecture, and only that -- this is complicated enough. Meaning
> that no implementation-defined features should be explicitly exposed
> to the guest. So I have no plan to expose any such feature for
> userspace to configure TSO or anything else of the sort.

Agreed. We do not intend for TSO mode to be used extensively for EL1, the
intention is for TSO mode to be reserved for userspace applications that
request it.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] arm64: Support the TSO memory model
  2024-04-11 13:28 ` Will Deacon
  2024-04-11 14:19   ` Hector Martin
@ 2024-05-02  0:16   ` Zayd Qumsieh
  2024-05-07 10:24   ` Alex Bennée
  2 siblings, 0 replies; 30+ messages in thread
From: Zayd Qumsieh @ 2024-05-02  0:16 UTC (permalink / raw)
  To: Will Deacon
  Cc: Catalin Marinas, Marc Zyngier, Mark Rutland, Zayd Qumsieh,
	Justin Lu, Ryan Houdek, Mark Brown, Ard Biesheuvel,
	Mateusz Guzik, Anshuman Khandual, Oliver Upton, Miguel Luis,
	Joey Gouly, Christoph Paasch, Kees Cook, Sami Tolvanen,
	Baoquan He, Joel Granados, Dawei Li, Andrew Morton,
	Florent Revest, David Hildenbrand, Stefan Roesch, Andy Chiu,
	Josh Triplett, Oleg Nesterov, Helge Deller, Zev Weiss,
	Ondrej Mosnacek, Miguel Ojeda, linux-arm-kernel, linux-kernel,
	Asahi Linux

On Thu, 11 Apr 2024 14:28:54 +0100,
Will Deacon <will@kernel.org> wrote:
> P.S. I briefly pondered the idea of the kernel toggling the bit in the
> ELF loader when e.g. it sees an x86 machine type but I suspect that
> doesn't really help with existing emulators and you'd still need a way
> to tell the emulator whether or not it was enabled.

This seems promising to me. What do people think of adding an opt-in argument,
option, or similar to binfmt that allows users to mark certain file formats as
"must run under TSO"? And then, the kernel would set the TSO bit when invoking
the interpreter for those file formats. If an emulator decides to create a
non-CPU-emulation thread, then it can use a prctl to disable TSO and switch to
the default ARM memory model. Note that this prctl wouldn't be allowed to
enable TSO - it would only disable it. This way, it is much harder for a
faulty application to be made that relies on TSO, since enabling of TSO is
only done via a binfmt handler that the user must explicitly opt into.

It is true that existing emulators wouldn't be able to benefit from this, but
that's the case no matter the activation mechanism. We can, however, expose a
prctl to get the memory model, so emulators can detect if TSO was enabled for
their threads.

To summarize, I propose two prctls (similar to the ones in the current revision
of the patch series). One to switch from the TSO memory model to the default
ARM one (this is a one-way street). And another to query the current memory
model.

Thanks,
Zayd

P.S. I forgot to CC you in my most recent email to Marc Zyngier just now. 
Sorry, I'm quite new to using mailing lists.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] arm64: Support the TSO memory model
  2024-05-02  0:10         ` Zayd Qumsieh
@ 2024-05-02 13:25           ` Marc Zyngier
  2024-05-06  8:20             ` Jonas Oberhauser
  0 siblings, 1 reply; 30+ messages in thread
From: Marc Zyngier @ 2024-05-02 13:25 UTC (permalink / raw)
  To: Zayd Qumsieh
  Cc: Catalin Marinas, Will Deacon, Mark Rutland, Justin Lu,
	Ryan Houdek, Mark Brown, Ard Biesheuvel, Mateusz Guzik,
	Anshuman Khandual, Oliver Upton, Miguel Luis, Joey Gouly,
	Christoph Paasch, Kees Cook, Sami Tolvanen, Baoquan He,
	Joel Granados, Dawei Li, Andrew Morton, Florent Revest,
	David Hildenbrand, Stefan Roesch, Andy Chiu, Josh Triplett,
	Oleg Nesterov, Helge Deller, Zev Weiss, Ondrej Mosnacek,
	Miguel Ojeda, linux-arm-kernel, linux-kernel, Asahi Linux

[adding Will back to the thread]

On Thu, 02 May 2024 01:10:35 +0100,
Zayd Qumsieh <zayd_qumsieh@apple.com> wrote:
> 
> > On Fri, 19 Apr 2024 17:58:09 +0100,
> > Will Deacon <will@kernel.org> wrote:
> > > 
> > > On Thu, Apr 11, 2024 at 11:19:13PM +0900, Hector Martin wrote:
> > > > On 2024/04/11 22:28, Will Deacon wrote:
> > > > >   * Some binaries in a distribution exhibit instability which goes away
> > > > >     in TSO mode, so a taskset-like program is used to run them with TSO
> > > > >     enabled.
> > > > 
> > > > Since the flag is cleared on execve, this third one isn't generally
> > > > possible as far as I know.
> > > 
> > > Ah ok, I'd missed that. Thanks.
> > > 
> > > > > In all these cases, we end up with native arm64 applications that will
> > > > > either fail to load or will crash in subtle ways on CPUs without the TSO
> > > > > feature. Assuming that the application cannot be fixed, a better
> > > > > approach would be to recompile using stronger instructions (e.g.
> > > > > LDAR/STLR) so that at least the resulting binary is portable. Now, it's
> > > > > true that some existing CPUs are TSO by design (this is a perfectly
> > > > > valid implementation of the arm64 memory model), but I think there's a
> > > > > big difference between quietly providing more ordering guarantees than
> > > > > software may be relying on and providing a mechanism to discover,
> > > > > request and ultimately rely upon the stronger behaviour.
> > > > 
> > > > The problem is "just" using stronger instructions is much more
> > > > expensive, as emulators have demonstrated. If TSO didn't serve a
> > > > practical purpose I wouldn't be submitting this, but it does. This is
> > > > basically non-negotiable for x86 emulation; if this is rejected
> > > > upstream, it will forever live as a downstream patch used by the entire
> > > > gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very
> > > > explicitly targeting, given our efforts with microVMs for 4K page size
> > > > support and the upcoming Vulkan drivers).
> > > 
> > > These microVMs sound quite interesting. What exactly are they? Are you
> > > running them under KVM?
> > > 
> > > Ignoring the mechanism for the time being, would it solve your problem
> > > if you were able to run specific microVMs in TSO mode, or do you *really*
> > > need the VM to have finer-grained control than that? If the whole VM is
> > > running in TSO mode, then my concerns largely disappear, as that's
> > > indistinguishable from running on a hardware implementation that happens
> > > to be TSO.
> >
> > Since KVM has been mentioned a few times, I'll give my take on this.
> >
> > Since day 1, it was a conscious decision for KVM/arm64 to emulate the
> > architecture, and only that -- this is complicated enough. Meaning
> > that no implementation-defined features should be explicitly exposed
> > to the guest. So I have no plan to expose any such feature for
> > userspace to configure TSO or anything else of the sort.
> 
> Agreed. We do not intend for TSO mode to be used extensively for EL1, the
> intention is for TSO mode to be reserved for userspace applications that
> request it.

But that's the same thing for a hypervisor.

For usersoace in a VM to make use of any feature, it must be exposed
to the VM as a whole by the host VMM (QEMU, kvmtool, whatever). Which
means having a new userspace ABI, specific to KVM, exposing a feature
for which there is no spec whatsoever. Even worse, you cannot discover
whether the instruction you must use to context switch the ACTLR_EL1
register is implemented. Isn't that great?

And I'm not even talking about the joys of migrating such a VM,
because we have no clue what this bit means on other implementations.
For all we know it causes another CPU to catch fire (or go PDP-endian,
which is basically the same).

Which is why my proposal is for this bit to be set statically for
*all* VMs, and leave the kernel (and KVM) out of the picture
altogether. At least that is something we can reason about (although
someone would need to start thinking of how this particular TSO
implementation composes with the relaxed memory ordering used outside
of the VM and show that they actually lead to correct results for
something such as virtio, for example).

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] arm64: Support the TSO memory model
  2024-05-02 13:25           ` Marc Zyngier
@ 2024-05-06  8:20             ` Jonas Oberhauser
  0 siblings, 0 replies; 30+ messages in thread
From: Jonas Oberhauser @ 2024-05-06  8:20 UTC (permalink / raw)
  To: Marc Zyngier, Zayd Qumsieh
  Cc: Catalin Marinas, Will Deacon, Mark Rutland, Justin Lu,
	Ryan Houdek, Mark Brown, Ard Biesheuvel, Mateusz Guzik,
	Anshuman Khandual, Oliver Upton, Miguel Luis, Joey Gouly,
	Christoph Paasch, Kees Cook, Sami Tolvanen, Baoquan He,
	Joel Granados, Dawei Li, Andrew Morton, Florent Revest,
	David Hildenbrand, Stefan Roesch, Andy Chiu, Josh Triplett,
	Oleg Nesterov, Helge Deller, Zev Weiss, Ondrej Mosnacek,
	Miguel Ojeda, linux-arm-kernel, linux-kernel, Asahi Linux



Am 5/2/2024 um 3:25 PM schrieb Marc Zyngier:
> although
> someone would need to start thinking of how this particular TSO
> implementation composes with the relaxed memory ordering used outside
> of the VM and show that they actually lead to correct results for
> something such as virtio, for example

I used to think about this problem space. Composing some kinds of memory
models (e.g., Arm and TSO) is easy, others is hard.

I don't know much about virtio, so this may show my naivety, but what
complications could arise from virtio?

Does the "visible behavior" of virtio change depending on the memory
model of the machine it is running on?

At least internally inside virtio it should not cause any problems, since
you are effectively adding some barriers inside some of the virtio threads.
(those that are running in the VM).

But if the VM relies on virtio behaving in a "TSO manner" but its behavior
is more relaxed on e.g. Arm, then that could cause issues.

have fun, jonas


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] arm64: Support the TSO memory model
  2024-04-20 12:13       ` Eric Curtin
  2024-04-20 12:15         ` Eric Curtin
@ 2024-05-06 11:21         ` Sergio Lopez Pascual
  2024-05-06 16:12           ` Marc Zyngier
  1 sibling, 1 reply; 30+ messages in thread
From: Sergio Lopez Pascual @ 2024-05-06 11:21 UTC (permalink / raw)
  To: Eric Curtin, Will Deacon
  Cc: Hector Martin, Catalin Marinas, Marc Zyngier, Mark Rutland,
	Zayd Qumsieh, Justin Lu, Ryan Houdek, Mark Brown, Ard Biesheuvel,
	Mateusz Guzik, Anshuman Khandual, Oliver Upton, Miguel Luis,
	Joey Gouly, Christoph Paasch, Kees Cook, Sami Tolvanen,
	Baoquan He, Joel Granados, Dawei Li, Andrew Morton,
	Florent Revest, David Hildenbrand, Stefan Roesch, Andy Chiu,
	Josh Triplett, Oleg Nesterov, Helge Deller, Zev Weiss,
	Ondrej Mosnacek, Miguel Ojeda, linux-arm-kernel, linux-kernel,
	Asahi Linux

Eric Curtin <ecurtin@redhat.com> writes:

> On Fri, 19 Apr 2024 at 18:08, Will Deacon <will@kernel.org> wrote:
>>
>> On Thu, Apr 11, 2024 at 11:19:13PM +0900, Hector Martin wrote:
>> > On 2024/04/11 22:28, Will Deacon wrote:
>> > >   * Some binaries in a distribution exhibit instability which goes away
>> > >     in TSO mode, so a taskset-like program is used to run them with TSO
>> > >     enabled.
>> >
>> > Since the flag is cleared on execve, this third one isn't generally
>> > possible as far as I know.
>>
>> Ah ok, I'd missed that. Thanks.
>>
>> > > In all these cases, we end up with native arm64 applications that will
>> > > either fail to load or will crash in subtle ways on CPUs without the TSO
>> > > feature. Assuming that the application cannot be fixed, a better
>> > > approach would be to recompile using stronger instructions (e.g.
>> > > LDAR/STLR) so that at least the resulting binary is portable. Now, it's
>> > > true that some existing CPUs are TSO by design (this is a perfectly
>> > > valid implementation of the arm64 memory model), but I think there's a
>> > > big difference between quietly providing more ordering guarantees than
>> > > software may be relying on and providing a mechanism to discover,
>> > > request and ultimately rely upon the stronger behaviour.
>> >
>> > The problem is "just" using stronger instructions is much more
>> > expensive, as emulators have demonstrated. If TSO didn't serve a
>> > practical purpose I wouldn't be submitting this, but it does. This is
>> > basically non-negotiable for x86 emulation; if this is rejected
>> > upstream, it will forever live as a downstream patch used by the entire
>> > gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very
>> > explicitly targeting, given our efforts with microVMs for 4K page size
>> > support and the upcoming Vulkan drivers).

In addition to the use case Hector exposed here, there's another,
potentially larger one, which is running x86_64 containers on aarch64
systems, using a combination of both Virtualization and emulation.

In this scenario, both not being able to use TSO for emulation
and having to enable it all the time for the whole VM have a very large
impact on performance (~25% on some workloads).

I understand the concern about the risk of userspace fragmentation, but
I was wondering if we could minimize it to an acceptable level by
narrowing down the context. For instance, since both use cases we're
bringing to the table imply the use of Virtualization, we should be able
to restrict PR_SET_MEM_MODEL to only be accepted when running on EL1
(and not in nVHE nor pKVM), returning EINVAL otherwise. This would
heavily discourage users from relying on this feature for native
applications that can run on arbitrary contexts, hence drastically
reducing the fragmentation risk.

We would still need a way to ensure the trap gets to the VMM and for
the VMM to operate on the impdef ACTLR_EL12, but that should be dealt on
a different series.

Thanks,
Sergio.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] arm64: Support the TSO memory model
  2024-05-06 11:21         ` Sergio Lopez Pascual
@ 2024-05-06 16:12           ` Marc Zyngier
  2024-05-06 16:20             ` Eric Curtin
  2024-05-06 22:04             ` Sergio Lopez Pascual
  0 siblings, 2 replies; 30+ messages in thread
From: Marc Zyngier @ 2024-05-06 16:12 UTC (permalink / raw)
  To: Sergio Lopez Pascual
  Cc: Eric Curtin, Will Deacon, Hector Martin, Catalin Marinas,
	Mark Rutland, Zayd Qumsieh, Justin Lu, Ryan Houdek, Mark Brown,
	Ard Biesheuvel, Mateusz Guzik, Anshuman Khandual, Oliver Upton,
	Miguel Luis, Joey Gouly, Christoph Paasch, Kees Cook,
	Sami Tolvanen, Baoquan He, Joel Granados, Dawei Li,
	Andrew Morton, Florent Revest, David Hildenbrand, Stefan Roesch,
	Andy Chiu, Josh Triplett, Oleg Nesterov, Helge Deller, Zev Weiss,
	Ondrej Mosnacek, Miguel Ojeda, linux-arm-kernel, linux-kernel,
	Asahi Linux

On Mon, 06 May 2024 12:21:40 +0100,
Sergio Lopez Pascual <slp@redhat.com> wrote:
> 
> Eric Curtin <ecurtin@redhat.com> writes:
> 
> > On Fri, 19 Apr 2024 at 18:08, Will Deacon <will@kernel.org> wrote:
> >>
> >> On Thu, Apr 11, 2024 at 11:19:13PM +0900, Hector Martin wrote:
> >> > On 2024/04/11 22:28, Will Deacon wrote:
> >> > >   * Some binaries in a distribution exhibit instability which goes away
> >> > >     in TSO mode, so a taskset-like program is used to run them with TSO
> >> > >     enabled.
> >> >
> >> > Since the flag is cleared on execve, this third one isn't generally
> >> > possible as far as I know.
> >>
> >> Ah ok, I'd missed that. Thanks.
> >>
> >> > > In all these cases, we end up with native arm64 applications that will
> >> > > either fail to load or will crash in subtle ways on CPUs without the TSO
> >> > > feature. Assuming that the application cannot be fixed, a better
> >> > > approach would be to recompile using stronger instructions (e.g.
> >> > > LDAR/STLR) so that at least the resulting binary is portable. Now, it's
> >> > > true that some existing CPUs are TSO by design (this is a perfectly
> >> > > valid implementation of the arm64 memory model), but I think there's a
> >> > > big difference between quietly providing more ordering guarantees than
> >> > > software may be relying on and providing a mechanism to discover,
> >> > > request and ultimately rely upon the stronger behaviour.
> >> >
> >> > The problem is "just" using stronger instructions is much more
> >> > expensive, as emulators have demonstrated. If TSO didn't serve a
> >> > practical purpose I wouldn't be submitting this, but it does. This is
> >> > basically non-negotiable for x86 emulation; if this is rejected
> >> > upstream, it will forever live as a downstream patch used by the entire
> >> > gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very
> >> > explicitly targeting, given our efforts with microVMs for 4K page size
> >> > support and the upcoming Vulkan drivers).
> 
> In addition to the use case Hector exposed here, there's another,
> potentially larger one, which is running x86_64 containers on aarch64
> systems, using a combination of both Virtualization and emulation.
> 
> In this scenario, both not being able to use TSO for emulation
> and having to enable it all the time for the whole VM have a very large
> impact on performance (~25% on some workloads).

Well, there is always a price to pay somewhere, and this is the usual
trade-off between performance and maintainability.

> I understand the concern about the risk of userspace fragmentation, but
> I was wondering if we could minimize it to an acceptable level by
> narrowing down the context. For instance, since both use cases we're
> bringing to the table imply the use of Virtualization, we should be able
> to restrict PR_SET_MEM_MODEL to only be accepted when running on EL1
> (and not in nVHE nor pKVM), returning EINVAL otherwise. This would
> heavily discourage users from relying on this feature for native
> applications that can run on arbitrary contexts, hence drastically
> reducing the fragmentation risk.

As I explained in another sub-thread[1], I am not prepared to allow
non architectural state to be exposed to a guest.  I'm also not
prepared to make significant ABI differences between VHE, nVHE, hVHE,
with or without pKVM, because the job of the kernel is to abstract
those differences.

> We would still need a way to ensure the trap gets to the VMM and for
> the VMM to operate on the impdef ACTLR_EL12, but that should be dealt on
> a different series.

The VMM can't use ACTLR_EL12, by the very definition of this register
(the clue is in the name).  You'd have to proxy the write in the
kernel and context-switch it, which means adding non-architectural
state to KVM, breaking VM migration and adding more kludges to the
existing Apple-specific host crap.

Also, let's realise that we are talking about making significant
changes to the arm64 ABI for a platform that is still not fully
supported in the upstream kernel. I have the feeling that changing the
memory model dynamically may not be of the utmost priority until then.

Thanks,

	M.

[1] https://lore.kernel.org/all/867cgcqrb9.wl-maz@kernel.org

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] arm64: Support the TSO memory model
  2024-05-06 16:12           ` Marc Zyngier
@ 2024-05-06 16:20             ` Eric Curtin
  2024-05-06 22:04             ` Sergio Lopez Pascual
  1 sibling, 0 replies; 30+ messages in thread
From: Eric Curtin @ 2024-05-06 16:20 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Sergio Lopez Pascual, Will Deacon, Hector Martin,
	Catalin Marinas, Mark Rutland, Zayd Qumsieh, Justin Lu,
	Ryan Houdek, Mark Brown, Ard Biesheuvel, Mateusz Guzik,
	Anshuman Khandual, Oliver Upton, Miguel Luis, Joey Gouly,
	Christoph Paasch, Kees Cook, Sami Tolvanen, Baoquan He,
	Joel Granados, Dawei Li, Andrew Morton, Florent Revest,
	David Hildenbrand, Stefan Roesch, Andy Chiu, Josh Triplett,
	Oleg Nesterov, Helge Deller, Zev Weiss, Ondrej Mosnacek,
	Miguel Ojeda, linux-arm-kernel, linux-kernel, Asahi Linux

On Mon, 6 May 2024 at 17:13, Marc Zyngier <maz@kernel.org> wrote:
>
> On Mon, 06 May 2024 12:21:40 +0100,
> Sergio Lopez Pascual <slp@redhat.com> wrote:
> >
> > Eric Curtin <ecurtin@redhat.com> writes:
> >
> > > On Fri, 19 Apr 2024 at 18:08, Will Deacon <will@kernel.org> wrote:
> > >>
> > >> On Thu, Apr 11, 2024 at 11:19:13PM +0900, Hector Martin wrote:
> > >> > On 2024/04/11 22:28, Will Deacon wrote:
> > >> > >   * Some binaries in a distribution exhibit instability which goes away
> > >> > >     in TSO mode, so a taskset-like program is used to run them with TSO
> > >> > >     enabled.
> > >> >
> > >> > Since the flag is cleared on execve, this third one isn't generally
> > >> > possible as far as I know.
> > >>
> > >> Ah ok, I'd missed that. Thanks.
> > >>
> > >> > > In all these cases, we end up with native arm64 applications that will
> > >> > > either fail to load or will crash in subtle ways on CPUs without the TSO
> > >> > > feature. Assuming that the application cannot be fixed, a better
> > >> > > approach would be to recompile using stronger instructions (e.g.
> > >> > > LDAR/STLR) so that at least the resulting binary is portable. Now, it's
> > >> > > true that some existing CPUs are TSO by design (this is a perfectly
> > >> > > valid implementation of the arm64 memory model), but I think there's a
> > >> > > big difference between quietly providing more ordering guarantees than
> > >> > > software may be relying on and providing a mechanism to discover,
> > >> > > request and ultimately rely upon the stronger behaviour.
> > >> >
> > >> > The problem is "just" using stronger instructions is much more
> > >> > expensive, as emulators have demonstrated. If TSO didn't serve a
> > >> > practical purpose I wouldn't be submitting this, but it does. This is
> > >> > basically non-negotiable for x86 emulation; if this is rejected
> > >> > upstream, it will forever live as a downstream patch used by the entire
> > >> > gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very
> > >> > explicitly targeting, given our efforts with microVMs for 4K page size
> > >> > support and the upcoming Vulkan drivers).
> >
> > In addition to the use case Hector exposed here, there's another,
> > potentially larger one, which is running x86_64 containers on aarch64
> > systems, using a combination of both Virtualization and emulation.
> >
> > In this scenario, both not being able to use TSO for emulation
> > and having to enable it all the time for the whole VM have a very large
> > impact on performance (~25% on some workloads).
>
> Well, there is always a price to pay somewhere, and this is the usual
> trade-off between performance and maintainability.
>
> > I understand the concern about the risk of userspace fragmentation, but
> > I was wondering if we could minimize it to an acceptable level by
> > narrowing down the context. For instance, since both use cases we're
> > bringing to the table imply the use of Virtualization, we should be able
> > to restrict PR_SET_MEM_MODEL to only be accepted when running on EL1
> > (and not in nVHE nor pKVM), returning EINVAL otherwise. This would
> > heavily discourage users from relying on this feature for native
> > applications that can run on arbitrary contexts, hence drastically
> > reducing the fragmentation risk.
>
> As I explained in another sub-thread[1], I am not prepared to allow
> non architectural state to be exposed to a guest.  I'm also not
> prepared to make significant ABI differences between VHE, nVHE, hVHE,
> with or without pKVM, because the job of the kernel is to abstract
> those differences.
>
> > We would still need a way to ensure the trap gets to the VMM and for
> > the VMM to operate on the impdef ACTLR_EL12, but that should be dealt on
> > a different series.
>
> The VMM can't use ACTLR_EL12, by the very definition of this register
> (the clue is in the name).  You'd have to proxy the write in the
> kernel and context-switch it, which means adding non-architectural
> state to KVM, breaking VM migration and adding more kludges to the
> existing Apple-specific host crap.
>
> Also, let's realise that we are talking about making significant
> changes to the arm64 ABI for a platform that is still not fully
> supported in the upstream kernel. I have the feeling that changing the

Note there's two use-cases for this today, bare-metal Linux on Apple
Silicon devices and Linux VMs on macOS. The latter is fully supported
in the upstream kernel.

Apple Silicon devices have a significantly sized Linux userbase as
there is a shortage of decent local ARM development machines for Linux
as well as just being decent local laptop/desktop SoC's in general for
AI. The general performance of the SoC makes it very useful.

Is mise le meas/Regards,

Eric Curtin

> memory model dynamically may not be of the utmost priority until then.
>
> Thanks,
>
>         M.
>
> [1] https://lore.kernel.org/all/867cgcqrb9.wl-maz@kernel.org
>
> --
> Without deviation from the norm, progress is not possible.
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] arm64: Support the TSO memory model
  2024-05-06 16:12           ` Marc Zyngier
  2024-05-06 16:20             ` Eric Curtin
@ 2024-05-06 22:04             ` Sergio Lopez Pascual
  1 sibling, 0 replies; 30+ messages in thread
From: Sergio Lopez Pascual @ 2024-05-06 22:04 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Eric Curtin, Will Deacon, Hector Martin, Catalin Marinas,
	Mark Rutland, Zayd Qumsieh, Justin Lu, Ryan Houdek, Mark Brown,
	Ard Biesheuvel, Mateusz Guzik, Anshuman Khandual, Oliver Upton,
	Miguel Luis, Joey Gouly, Christoph Paasch, Kees Cook,
	Sami Tolvanen, Baoquan He, Joel Granados, Dawei Li,
	Andrew Morton, Florent Revest, David Hildenbrand, Stefan Roesch,
	Andy Chiu, Josh Triplett, Oleg Nesterov, Helge Deller, Zev Weiss,
	Ondrej Mosnacek, Miguel Ojeda, linux-arm-kernel, linux-kernel,
	Asahi Linux

Marc Zyngier <maz@kernel.org> writes:

> On Mon, 06 May 2024 12:21:40 +0100,
> Sergio Lopez Pascual <slp@redhat.com> wrote:
>>
>> Eric Curtin <ecurtin@redhat.com> writes:
>>
>> > On Fri, 19 Apr 2024 at 18:08, Will Deacon <will@kernel.org> wrote:
>> >>
>> >> On Thu, Apr 11, 2024 at 11:19:13PM +0900, Hector Martin wrote:
>> >> > On 2024/04/11 22:28, Will Deacon wrote:
>> >> > >   * Some binaries in a distribution exhibit instability which goes away
>> >> > >     in TSO mode, so a taskset-like program is used to run them with TSO
>> >> > >     enabled.
>> >> >
>> >> > Since the flag is cleared on execve, this third one isn't generally
>> >> > possible as far as I know.
>> >>
>> >> Ah ok, I'd missed that. Thanks.
>> >>
>> >> > > In all these cases, we end up with native arm64 applications that will
>> >> > > either fail to load or will crash in subtle ways on CPUs without the TSO
>> >> > > feature. Assuming that the application cannot be fixed, a better
>> >> > > approach would be to recompile using stronger instructions (e.g.
>> >> > > LDAR/STLR) so that at least the resulting binary is portable. Now, it's
>> >> > > true that some existing CPUs are TSO by design (this is a perfectly
>> >> > > valid implementation of the arm64 memory model), but I think there's a
>> >> > > big difference between quietly providing more ordering guarantees than
>> >> > > software may be relying on and providing a mechanism to discover,
>> >> > > request and ultimately rely upon the stronger behaviour.
>> >> >
>> >> > The problem is "just" using stronger instructions is much more
>> >> > expensive, as emulators have demonstrated. If TSO didn't serve a
>> >> > practical purpose I wouldn't be submitting this, but it does. This is
>> >> > basically non-negotiable for x86 emulation; if this is rejected
>> >> > upstream, it will forever live as a downstream patch used by the entire
>> >> > gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very
>> >> > explicitly targeting, given our efforts with microVMs for 4K page size
>> >> > support and the upcoming Vulkan drivers).
>>
>> In addition to the use case Hector exposed here, there's another,
>> potentially larger one, which is running x86_64 containers on aarch64
>> systems, using a combination of both Virtualization and emulation.
>>
>> In this scenario, both not being able to use TSO for emulation
>> and having to enable it all the time for the whole VM have a very large
>> impact on performance (~25% on some workloads).
>
> Well, there is always a price to pay somewhere, and this is the usual
> trade-off between performance and maintainability.

Yes, and given that the impact on performance is so big, I honestly
think it's worth exploring a bit if there's an option that could keep
the maintenance cost at an acceptable level.

>> I understand the concern about the risk of userspace fragmentation, but
>> I was wondering if we could minimize it to an acceptable level by
>> narrowing down the context. For instance, since both use cases we're
>> bringing to the table imply the use of Virtualization, we should be able
>> to restrict PR_SET_MEM_MODEL to only be accepted when running on EL1
>> (and not in nVHE nor pKVM), returning EINVAL otherwise. This would
>> heavily discourage users from relying on this feature for native
>> applications that can run on arbitrary contexts, hence drastically
>> reducing the fragmentation risk.
>
> As I explained in another sub-thread[1], I am not prepared to allow
> non architectural state to be exposed to a guest.  I'm also not
> prepared to make significant ABI differences between VHE, nVHE, hVHE,
> with or without pKVM, because the job of the kernel is to abstract
> those differences.

I understand, makes sense.

>> We would still need a way to ensure the trap gets to the VMM and for
>> the VMM to operate on the impdef ACTLR_EL12, but that should be dealt on
>> a different series.
>
> The VMM can't use ACTLR_EL12, by the very definition of this register
> (the clue is in the name).  You'd have to proxy the write in the
> kernel and context-switch it, which means adding non-architectural
> state to KVM, breaking VM migration and adding more kludges to the
> existing Apple-specific host crap.

I know, I just didn't want to go into details here, because this series
is not touching any of that. But since we're already there, I'd like to
ask you, do you think it'd be possible and reasonable dealing with
IMPDEF registers outside of KVM, from a platform-specific module,
treating it like a paravirt feature?

In fact, if that would be acceptable, what if we treated this whole
feature as a platform-specific knob leaving both the ARM64 ABI and KVM
(mostly) aside?

I'm thinking of something in the lines of this:

- Host side:

  * Having vcpu load/put calling into some platform-specific module that
    would be in charge of keeping track of the desired state for a
    particular context and adjusting ACTLR_EL12 as needed, relieving KVM
    from this task and avoiding polluting its structs with
    non-architectural state.

  * Either having a kernel handler for the TACR trap that would call to
    the platform-specific module, or allowing the VMM to request the
    kernel to exit to it when that trap is triggered. The latter would
    also require the module to expose a device node with an ioctl
    interface (independent from KVM's) for the VMM to request the
    desired TSO stategy for a particular thread.

  * An alternative to the previous point could be enabling the VMM to be
    able to request KVM to start a VM with HCR_EL2.TACR = 0. This one
    would be way cheaper in CPU time, and would simplify the
    platform-specific module job to just save/restore ACTLR_EL12 for
    that context, but I guess it could potentially introduce some
    undesired variance between VM configurations. I'm honestly open to
    both options, please let me know if you find one to be better for
    KVM.

- Guest side:

  * Wiring __switch_to() to also call the platform-specific module. Akin
    to what happens with KVM, this one would be in charge of keeping
    track of the threads that want TSO enabled, adjusting ACTLR_EL1
    accordingly.

  * Having the platform-specific module expose a device node with an
    ioctl interface for userspace applications to request TSO to be
    enabled for the current thread.

I think an approach like this would address the ARM64 userspace
fragmentation concerns, relieve KVM from carrying a platform-specific
burden and reduce the maintenance costs to a reasonable level. WDYT?

> Also, let's realise that we are talking about making significant
> changes to the arm64 ABI for a platform that is still not fully
> supported in the upstream kernel. I have the feeling that changing the
> memory model dynamically may not be of the utmost priority until then.

Please note this feature will also be used by Linux running in a VM on
macOS under Hypervisor.framework, so Asahi isn't the only platform. This
raises significantly the number of users potentially benefited by
emulators being able to operate the TSO knob.

Thanks,
Sergio.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] arm64: Support the TSO memory model
  2024-04-11 13:28 ` Will Deacon
  2024-04-11 14:19   ` Hector Martin
  2024-05-02  0:16   ` Zayd Qumsieh
@ 2024-05-07 10:24   ` Alex Bennée
  2024-05-07 14:52     ` Ard Biesheuvel
  2 siblings, 1 reply; 30+ messages in thread
From: Alex Bennée @ 2024-05-07 10:24 UTC (permalink / raw)
  To: Will Deacon
  Cc: Hector Martin, Catalin Marinas, Marc Zyngier, Mark Rutland,
	Zayd Qumsieh, Justin Lu, Ryan Houdek, Mark Brown, Ard Biesheuvel,
	Mateusz Guzik, Anshuman Khandual, Oliver Upton, Miguel Luis,
	Joey Gouly, Christoph Paasch, Kees Cook, Sami Tolvanen,
	Baoquan He, Joel Granados, Dawei Li, Andrew Morton,
	Florent Revest, David Hildenbrand, Stefan Roesch, Andy Chiu,
	Josh Triplett, Oleg Nesterov, Helge Deller, Zev Weiss,
	Ondrej Mosnacek, Miguel Ojeda, linux-arm-kernel, linux-kernel,
	Asahi Linux

Will Deacon <will@kernel.org> writes:

> Hi Hector,
>
> On Thu, Apr 11, 2024 at 09:51:19AM +0900, Hector Martin wrote:
>> x86 CPUs implement a stricter memory modern than ARM64 (TSO). For this
>> reason, x86 emulation on baseline ARM64 systems requires very expensive
>> memory model emulation. Having hardware that supports this natively is
>> therefore very attractive. Such hardware, in fact, exists. This series
>> adds support for userspace to identify when TSO is available and
>> toggle it on, if supported.
>
> I'm probably going to make myself hugely unpopular here, but I have a
> strong objection to this patch series as it stands. I firmly believe
> that providing a prctl() to query and toggle the memory model to/from
> TSO is going to lead to subtle fragmentation of arm64 Linux userspace.
>
> It's not difficult to envisage this TSO switch being abused for native
> arm64 applications:
>
>   * A program no longer crashes when TSO is enabled, so the developer
>     just toggles TSO to meet a deadline.
>
>   * Some legacy x86 sources are being ported to arm64 but concurrency
>     is hard so the developer just enables TSO to (mostly) avoid thinking
>     about it.
>
>   * Some binaries in a distribution exhibit instability which goes away
>     in TSO mode, so a taskset-like program is used to run them with TSO
>     enabled.

These all just seem like cases of engineers hiding from their very real
problems. I don't know if its really the kernels place to avoid giving
them the foot gun. Would it assuage your concerns at all if we set a
taint flag so bug reports/core dumps indicated we were in a
non-architectural memory mode?

> In all these cases, we end up with native arm64 applications that will
> either fail to load or will crash in subtle ways on CPUs without the TSO
> feature. Assuming that the application cannot be fixed, a better
> approach would be to recompile using stronger instructions (e.g.
> LDAR/STLR) so that at least the resulting binary is portable. Now, it's
> true that some existing CPUs are TSO by design (this is a perfectly
> valid implementation of the arm64 memory model), but I think there's a
> big difference between quietly providing more ordering guarantees than
> software may be relying on and providing a mechanism to discover,
> request and ultimately rely upon the stronger behaviour.

I think the main use case here is for emulation. When we run x86-on-arm
in QEMU we do currently insert lots of extra barrier instructions on
every load and store. If we can probe and set a TSO mode I can assure
you we'll do the right thing ;-)

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] arm64: Support the TSO memory model
  2024-05-07 10:24   ` Alex Bennée
@ 2024-05-07 14:52     ` Ard Biesheuvel
  2024-05-09 11:13       ` Catalin Marinas
  0 siblings, 1 reply; 30+ messages in thread
From: Ard Biesheuvel @ 2024-05-07 14:52 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Will Deacon, Hector Martin, Catalin Marinas, Marc Zyngier,
	Mark Rutland, Zayd Qumsieh, Justin Lu, Ryan Houdek, Mark Brown,
	Mateusz Guzik, Anshuman Khandual, Oliver Upton, Miguel Luis,
	Joey Gouly, Christoph Paasch, Kees Cook, Sami Tolvanen,
	Baoquan He, Joel Granados, Dawei Li, Andrew Morton,
	Florent Revest, David Hildenbrand, Stefan Roesch, Andy Chiu,
	Josh Triplett, Oleg Nesterov, Helge Deller, Zev Weiss,
	Ondrej Mosnacek, Miguel Ojeda, linux-arm-kernel, linux-kernel,
	Asahi Linux

On Tue, 7 May 2024 at 12:24, Alex Bennée <alex.bennee@linaro.org> wrote:
>
> Will Deacon <will@kernel.org> writes:
>
> > Hi Hector,
> >
> > On Thu, Apr 11, 2024 at 09:51:19AM +0900, Hector Martin wrote:
> >> x86 CPUs implement a stricter memory modern than ARM64 (TSO). For this
> >> reason, x86 emulation on baseline ARM64 systems requires very expensive
> >> memory model emulation. Having hardware that supports this natively is
> >> therefore very attractive. Such hardware, in fact, exists. This series
> >> adds support for userspace to identify when TSO is available and
> >> toggle it on, if supported.
> >
> > I'm probably going to make myself hugely unpopular here, but I have a
> > strong objection to this patch series as it stands. I firmly believe
> > that providing a prctl() to query and toggle the memory model to/from
> > TSO is going to lead to subtle fragmentation of arm64 Linux userspace.
> >
> > It's not difficult to envisage this TSO switch being abused for native
> > arm64 applications:
> >
> >   * A program no longer crashes when TSO is enabled, so the developer
> >     just toggles TSO to meet a deadline.
> >
> >   * Some legacy x86 sources are being ported to arm64 but concurrency
> >     is hard so the developer just enables TSO to (mostly) avoid thinking
> >     about it.
> >
> >   * Some binaries in a distribution exhibit instability which goes away
> >     in TSO mode, so a taskset-like program is used to run them with TSO
> >     enabled.
>
> These all just seem like cases of engineers hiding from their very real
> problems. I don't know if its really the kernels place to avoid giving
> them the foot gun. Would it assuage your concerns at all if we set a
> taint flag so bug reports/core dumps indicated we were in a
> non-architectural memory mode?
>
> > In all these cases, we end up with native arm64 applications that will
> > either fail to load or will crash in subtle ways on CPUs without the TSO
> > feature. Assuming that the application cannot be fixed, a better
> > approach would be to recompile using stronger instructions (e.g.
> > LDAR/STLR) so that at least the resulting binary is portable. Now, it's
> > true that some existing CPUs are TSO by design (this is a perfectly
> > valid implementation of the arm64 memory model), but I think there's a
> > big difference between quietly providing more ordering guarantees than
> > software may be relying on and providing a mechanism to discover,
> > request and ultimately rely upon the stronger behaviour.
>
> I think the main use case here is for emulation. When we run x86-on-arm
> in QEMU we do currently insert lots of extra barrier instructions on
> every load and store. If we can probe and set a TSO mode I can assure
> you we'll do the right thing ;-)
>

Without a public specification of what TSO mode actually entails,
deciding which of those barriers can be dropped is not going to be as
straight-forward as you make it out to be.

Apple's TSO mode is vertically integrated with Rosetta, which means
that TSO mode provides whatever Rosetta needs to run x86 code
correctly, and that it could mean different things on different
generations of the micro-architecture. And whether Apple's TSO is the
same as Fujitsu's is anyone's guess afaik.

Running a game and seeing it perform better is great, but it is not
the kind of rigor we usually attempt to apply when adding support for
architectural features. Hopefully, there will be some architectural
support for this in the future, but without any spec that defines the
memory model it implements, I am not convinced we should merge this.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] arm64: Support the TSO memory model
  2024-05-07 14:52     ` Ard Biesheuvel
@ 2024-05-09 11:13       ` Catalin Marinas
  2024-05-09 12:31         ` Neal Gompa
  0 siblings, 1 reply; 30+ messages in thread
From: Catalin Marinas @ 2024-05-09 11:13 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Alex Bennée, Will Deacon, Hector Martin, Marc Zyngier,
	Mark Rutland, Zayd Qumsieh, Justin Lu, Ryan Houdek, Mark Brown,
	Mateusz Guzik, Anshuman Khandual, Oliver Upton, Miguel Luis,
	Joey Gouly, Christoph Paasch, Kees Cook, Sami Tolvanen,
	Baoquan He, Joel Granados, Dawei Li, Andrew Morton,
	Florent Revest, David Hildenbrand, Stefan Roesch, Andy Chiu,
	Josh Triplett, Oleg Nesterov, Helge Deller, Zev Weiss,
	Ondrej Mosnacek, Miguel Ojeda, linux-arm-kernel, linux-kernel,
	Asahi Linux

On Tue, May 07, 2024 at 04:52:30PM +0200, Ard Biesheuvel wrote:
> On Tue, 7 May 2024 at 12:24, Alex Bennée <alex.bennee@linaro.org> wrote:
> > I think the main use case here is for emulation. When we run x86-on-arm
> > in QEMU we do currently insert lots of extra barrier instructions on
> > every load and store. If we can probe and set a TSO mode I can assure
> > you we'll do the right thing ;-)
> 
> Without a public specification of what TSO mode actually entails,
> deciding which of those barriers can be dropped is not going to be as
> straight-forward as you make it out to be.
> 
> Apple's TSO mode is vertically integrated with Rosetta, which means
> that TSO mode provides whatever Rosetta needs to run x86 code
> correctly, and that it could mean different things on different
> generations of the micro-architecture. And whether Apple's TSO is the
> same as Fujitsu's is anyone's guess afaik.

Indeed. Apart from using impdef registers, that's what I think is the
second biggest problem with this feature (and the corresponding
patches). We don't know the precise memory model, we can't tell whether
this TSO bit is stored in the TLB. If it is, is it per ASID/VMID? The
other problem Marc raised is what memory model is between two CPUs where
only one has the TSO bit set? Does it only break the TSO model or is
there a chance that it also breaks the default relaxed model? What other
TSO flavours are out there, how do they compare with the Apple one?

> Running a game and seeing it perform better is great, but it is not
> the kind of rigor we usually attempt to apply when adding support for
> architectural features. Hopefully, there will be some architectural
> support for this in the future, but without any spec that defines the
> memory model it implements, I am not convinced we should merge this.

There is FEAT_LRCPC (available on Apple Silicon from M2 onwards). Rather
than having a big knob to turn TSO on or off, this feature introduces
instructions that permit a code generator to get the TSO semantics in a
more efficient way (e.g. using LDAPR+STLR instead of the stricter
LDAR+STLR; not sure how well these are implemented on the Apple
Silicon). There are further improvements in FEAT_LRCPC{2,3} (with the
latter adding support for SIMD but not available in hardware yet). So
the direction from Arm is pretty clear, acknowledging that there is a
need for such TSO emulation but not in the way of undocumented impdef
registers. Whether more is needed here, I guess people working on
emulators could reach out to Arm or CPU vendors with suggestions (the
path to the architects is not straightforward, usually legal has a say,
but it's doable, there are formal channels already).

I see the impdef hardware TSO options as temporary until CPU
implementations catch up to architected FEAT_LRCPC*. Given the problems
already stated in this thread, I think such hacks should be carried
downstream and (hopefully) will eventually vanish. Maybe those TSO knobs
currently make an emulation faster than FEAT_LRCPC* but that's feedback
to go to the microarchitects on the implementation (or architects on
what other instructions should be covered).

-- 
Catalin

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] arm64: Support the TSO memory model
  2024-05-09 11:13       ` Catalin Marinas
@ 2024-05-09 12:31         ` Neal Gompa
  2024-05-09 12:56           ` Catalin Marinas
  0 siblings, 1 reply; 30+ messages in thread
From: Neal Gompa @ 2024-05-09 12:31 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Ard Biesheuvel, Alex Bennée, Will Deacon, Hector Martin,
	Marc Zyngier, Mark Rutland, Zayd Qumsieh, Justin Lu, Ryan Houdek,
	Mark Brown, Mateusz Guzik, Anshuman Khandual, Oliver Upton,
	Miguel Luis, Joey Gouly, Christoph Paasch, Kees Cook,
	Sami Tolvanen, Baoquan He, Joel Granados, Dawei Li,
	Andrew Morton, Florent Revest, David Hildenbrand, Stefan Roesch,
	Andy Chiu, Josh Triplett, Oleg Nesterov, Helge Deller, Zev Weiss,
	Ondrej Mosnacek, Miguel Ojeda, linux-arm-kernel, linux-kernel,
	Asahi Linux

On Thu, May 9, 2024 at 5:13 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> On Tue, May 07, 2024 at 04:52:30PM +0200, Ard Biesheuvel wrote:
> > On Tue, 7 May 2024 at 12:24, Alex Bennée <alex.bennee@linaro.org> wrote:
> > > I think the main use case here is for emulation. When we run x86-on-arm
> > > in QEMU we do currently insert lots of extra barrier instructions on
> > > every load and store. If we can probe and set a TSO mode I can assure
> > > you we'll do the right thing ;-)
> >
> > Without a public specification of what TSO mode actually entails,
> > deciding which of those barriers can be dropped is not going to be as
> > straight-forward as you make it out to be.
> >
> > Apple's TSO mode is vertically integrated with Rosetta, which means
> > that TSO mode provides whatever Rosetta needs to run x86 code
> > correctly, and that it could mean different things on different
> > generations of the micro-architecture. And whether Apple's TSO is the
> > same as Fujitsu's is anyone's guess afaik.
>
> Indeed. Apart from using impdef registers, that's what I think is the
> second biggest problem with this feature (and the corresponding
> patches). We don't know the precise memory model, we can't tell whether
> this TSO bit is stored in the TLB. If it is, is it per ASID/VMID? The
> other problem Marc raised is what memory model is between two CPUs where
> only one has the TSO bit set? Does it only break the TSO model or is
> there a chance that it also breaks the default relaxed model? What other
> TSO flavours are out there, how do they compare with the Apple one?
>
> > Running a game and seeing it perform better is great, but it is not
> > the kind of rigor we usually attempt to apply when adding support for
> > architectural features. Hopefully, there will be some architectural
> > support for this in the future, but without any spec that defines the
> > memory model it implements, I am not convinced we should merge this.
>
> There is FEAT_LRCPC (available on Apple Silicon from M2 onwards). Rather
> than having a big knob to turn TSO on or off, this feature introduces
> instructions that permit a code generator to get the TSO semantics in a
> more efficient way (e.g. using LDAPR+STLR instead of the stricter
> LDAR+STLR; not sure how well these are implemented on the Apple
> Silicon). There are further improvements in FEAT_LRCPC{2,3} (with the
> latter adding support for SIMD but not available in hardware yet). So
> the direction from Arm is pretty clear, acknowledging that there is a
> need for such TSO emulation but not in the way of undocumented impdef
> registers. Whether more is needed here, I guess people working on
> emulators could reach out to Arm or CPU vendors with suggestions (the
> path to the architects is not straightforward, usually legal has a say,
> but it's doable, there are formal channels already).
>
> I see the impdef hardware TSO options as temporary until CPU
> implementations catch up to architected FEAT_LRCPC*. Given the problems
> already stated in this thread, I think such hacks should be carried
> downstream and (hopefully) will eventually vanish. Maybe those TSO knobs
> currently make an emulation faster than FEAT_LRCPC* but that's feedback
> to go to the microarchitects on the implementation (or architects on
> what other instructions should be covered).
>

They cannot ever "vanish" because we are supporting every Mx platform
back to the first one. The M1 series will never have FEAT_LRCPC.

I do not think it is unreasonable to support this method when we know
what the CPU platform is and FEAT_LRCPC does not exist.



--
真実はいつも一つ!/ Always, there's only one truth!

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/4] arm64: Support the TSO memory model
  2024-05-09 12:31         ` Neal Gompa
@ 2024-05-09 12:56           ` Catalin Marinas
  0 siblings, 0 replies; 30+ messages in thread
From: Catalin Marinas @ 2024-05-09 12:56 UTC (permalink / raw)
  To: Neal Gompa
  Cc: Ard Biesheuvel, Alex Bennée, Will Deacon, Hector Martin,
	Marc Zyngier, Mark Rutland, Zayd Qumsieh, Justin Lu, Ryan Houdek,
	Mark Brown, Mateusz Guzik, Anshuman Khandual, Oliver Upton,
	Miguel Luis, Joey Gouly, Christoph Paasch, Kees Cook,
	Sami Tolvanen, Baoquan He, Joel Granados, Dawei Li,
	Andrew Morton, Florent Revest, David Hildenbrand, Stefan Roesch,
	Andy Chiu, Josh Triplett, Oleg Nesterov, Helge Deller, Zev Weiss,
	Ondrej Mosnacek, Miguel Ojeda, linux-arm-kernel, linux-kernel,
	Asahi Linux

On Thu, May 09, 2024 at 06:31:04AM -0600, Neal Gompa wrote:
> On Thu, May 9, 2024 at 5:13 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > I see the impdef hardware TSO options as temporary until CPU
> > implementations catch up to architected FEAT_LRCPC*. Given the problems
> > already stated in this thread, I think such hacks should be carried
> > downstream and (hopefully) will eventually vanish. Maybe those TSO knobs
> > currently make an emulation faster than FEAT_LRCPC* but that's feedback
> > to go to the microarchitects on the implementation (or architects on
> > what other instructions should be covered).
> 
> They cannot ever "vanish" because we are supporting every Mx platform
> back to the first one. The M1 series will never have FEAT_LRCPC.

Well, you missed "eventually". It depends on the timeline you have in
mind but, say, 15 years from now there may not be many M1s around to be
worth maintaining these patches out-of-tree (and they don't make sense
in-tree either because of the lack of standardisation).

> I do not think it is unreasonable to support this method when we know
> what the CPU platform is and FEAT_LRCPC does not exist.

If you want a portable emulator, you better start supporting FEAT_LRCPC*
(I think FEX does this), ideally detected at run-time with a fallback to
RCsc. Whether, additionally, you want to support the non-portable Apple
TSO with out-of-tree patches, it's up to you.

-- 
Catalin

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2024-05-09 12:56 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-11  0:51 [PATCH 0/4] arm64: Support the TSO memory model Hector Martin
2024-04-11  0:51 ` [PATCH 1/4] prctl: Introduce PR_{SET,GET}_MEM_MODEL Hector Martin
2024-04-11  0:51 ` [PATCH 2/4] arm64: Implement PR_{GET,SET}_MEM_MODEL for always-TSO CPUs Hector Martin
2024-04-11  0:51 ` [PATCH 3/4] arm64: Introduce scaffolding to add ACTLR_EL1 to thread state Hector Martin
2024-04-11  0:51 ` [PATCH 4/4] arm64: Implement Apple IMPDEF TSO memory model control Hector Martin
2024-04-11  1:37 ` [PATCH 0/4] arm64: Support the TSO memory model Neal Gompa
2024-04-11 13:28 ` Will Deacon
2024-04-11 14:19   ` Hector Martin
2024-04-11 18:43     ` Hector Martin
2024-04-16  2:22       ` Zayd Qumsieh
2024-04-19 16:58         ` Will Deacon
2024-04-19 18:05           ` Catalin Marinas
2024-04-19 16:58     ` Will Deacon
2024-04-20 11:37       ` Marc Zyngier
2024-05-02  0:10         ` Zayd Qumsieh
2024-05-02 13:25           ` Marc Zyngier
2024-05-06  8:20             ` Jonas Oberhauser
2024-04-20 12:13       ` Eric Curtin
2024-04-20 12:15         ` Eric Curtin
2024-05-06 11:21         ` Sergio Lopez Pascual
2024-05-06 16:12           ` Marc Zyngier
2024-05-06 16:20             ` Eric Curtin
2024-05-06 22:04             ` Sergio Lopez Pascual
2024-05-02  0:16   ` Zayd Qumsieh
2024-05-07 10:24   ` Alex Bennée
2024-05-07 14:52     ` Ard Biesheuvel
2024-05-09 11:13       ` Catalin Marinas
2024-05-09 12:31         ` Neal Gompa
2024-05-09 12:56           ` Catalin Marinas
2024-04-16  2:11 ` Zayd Qumsieh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).