linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 1/2] rseq: extend struct rseq with numa node id
@ 2022-01-31 20:55 Mathieu Desnoyers
  2022-01-31 20:55 ` [RFC PATCH 2/2] selftests/rseq: Implement rseq numa node id field selftest Mathieu Desnoyers
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Mathieu Desnoyers @ 2022-01-31 20:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, Paul E . McKenney, Boqun Feng,
	H . Peter Anvin, Paul Turner, linux-api, Christian Brauner,
	Florian Weimer, David.Laight, carlos, Peter Oskolkov,
	Mathieu Desnoyers

Adding the NUMA node id to struct rseq is a straightforward thing to do,
and a good way to figure out if anything in the user-space ecosystem
prevents extending struct rseq.

This NUMA node id field allows memory allocators such as tcmalloc to
take advantage of fast access to the current NUMA node id to perform
NUMA-aware memory allocation.

It is also useful for implementing NUMA-aware user-space mutexes.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
 include/linux/sched.h       |  4 ++
 include/trace/events/rseq.h |  4 +-
 include/uapi/linux/rseq.h   | 54 ++++++++++++++++++++--
 kernel/ptrace.c             |  2 +-
 kernel/rseq.c               | 91 +++++++++++++++++++++++++++----------
 5 files changed, 124 insertions(+), 31 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 508b91d57470..838c9e0b4cae 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1291,6 +1291,7 @@ struct task_struct {
 
 #ifdef CONFIG_RSEQ
 	struct rseq __user *rseq;
+	u32 rseq_len;
 	u32 rseq_sig;
 	/*
 	 * RmW on rseq_event_mask must be performed atomically
@@ -2260,10 +2261,12 @@ static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags)
 {
 	if (clone_flags & CLONE_VM) {
 		t->rseq = NULL;
+		t->rseq_len = 0;
 		t->rseq_sig = 0;
 		t->rseq_event_mask = 0;
 	} else {
 		t->rseq = current->rseq;
+		t->rseq_len = current->rseq_len;
 		t->rseq_sig = current->rseq_sig;
 		t->rseq_event_mask = current->rseq_event_mask;
 	}
@@ -2272,6 +2275,7 @@ static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags)
 static inline void rseq_execve(struct task_struct *t)
 {
 	t->rseq = NULL;
+	t->rseq_len = 0;
 	t->rseq_sig = 0;
 	t->rseq_event_mask = 0;
 }
diff --git a/include/trace/events/rseq.h b/include/trace/events/rseq.h
index a04a64bc1a00..6bd442697354 100644
--- a/include/trace/events/rseq.h
+++ b/include/trace/events/rseq.h
@@ -16,13 +16,15 @@ TRACE_EVENT(rseq_update,
 
 	TP_STRUCT__entry(
 		__field(s32, cpu_id)
+		__field(s32, node_id)
 	),
 
 	TP_fast_assign(
 		__entry->cpu_id = raw_smp_processor_id();
+		__entry->node_id = cpu_to_node(raw_smp_processor_id());
 	),
 
-	TP_printk("cpu_id=%d", __entry->cpu_id)
+	TP_printk("cpu_id=%d node_id=%d", __entry->cpu_id, __entry->node_id)
 );
 
 TRACE_EVENT(rseq_ip_fixup,
diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
index 77ee207623a9..006373a15ebe 100644
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -13,9 +13,9 @@
 #include <linux/types.h>
 #include <asm/byteorder.h>
 
-enum rseq_cpu_id_state {
-	RSEQ_CPU_ID_UNINITIALIZED		= -1,
-	RSEQ_CPU_ID_REGISTRATION_FAILED		= -2,
+enum rseq_id_state {
+	RSEQ_ID_UNINITIALIZED			= -1,
+	RSEQ_ID_REGISTRATION_FAILED		= -2,
 };
 
 enum rseq_flags {
@@ -78,7 +78,7 @@ struct rseq {
 	 * Read by user-space with single-copy atomicity semantics. This
 	 * field should only be read by the thread which registered this
 	 * data structure. Aligned on 32-bit. Values
-	 * RSEQ_CPU_ID_UNINITIALIZED and RSEQ_CPU_ID_REGISTRATION_FAILED
+	 * RSEQ_ID_UNINITIALIZED and RSEQ_ID_REGISTRATION_FAILED
 	 * have a special semantic: the former means "rseq uninitialized",
 	 * and latter means "rseq initialization failed". This value is
 	 * meant to be read within rseq critical sections and compared
@@ -130,6 +130,52 @@ struct rseq {
 	 *     this thread.
 	 */
 	__u32 flags;
+
+	__u32 padding1[3];
+
+	/*
+	 * This is the end of the original rseq ABI.
+	 * This is a valid end of rseq ABI for the purpose of rseq registration
+	 * rseq_len.
+	 * The original rseq ABI use "sizeof(struct rseq)" on registration,
+	 * thus requiring the padding above.
+	 */
+
+	/*
+	 * Restartable sequences node_id_start field. Updated by the
+	 * kernel. Read by user-space with single-copy atomicity
+	 * semantics. This field should only be read by the thread which
+	 * registered this data structure. Aligned on 32-bit. Always
+	 * contains a value in the range of possible NUMA node IDs, although the
+	 * value may not be the actual current NUMA node ID (e.g. if rseq is not
+	 * initialized). This NUMA node ID number value should always be compared
+	 * against the value of the node_id field before performing a rseq
+	 * commit or returning a value read from a data structure indexed using
+	 * the node_id_start value.
+	 */
+	__u32 node_id_start;
+
+	/*
+	 * Restartable sequences node_id field. Updated by the kernel.
+	 * Read by user-space with single-copy atomicity semantics. This
+	 * field should only be read by the thread which registered this
+	 * data structure. Aligned on 32-bit. Values
+	 * RSEQ_ID_UNINITIALIZED and RSEQ_ID_REGISTRATION_FAILED
+	 * have a special semantic: the former means "rseq uninitialized",
+	 * and latter means "rseq initialization failed". This value is
+	 * meant to be read within rseq critical sections and compared
+	 * with the node_id_start value previously read, before performing
+	 * the commit instruction, or read and compared with the
+	 * node_id_start value before returning a value loaded from a data
+	 * structure indexed using the node_id_start value.
+	 */
+	__u32 node_id;
+
+	/*
+	 * This is a valid end of rseq ABI for the purpose of rseq registration
+	 * rseq_len. Use the offset immediately after the node_id field as
+	 * rseq_len.
+	 */
 } __attribute__((aligned(4 * sizeof(__u64))));
 
 #endif /* _UAPI_LINUX_RSEQ_H */
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index eea265082e97..f5edde5b7805 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -800,7 +800,7 @@ static long ptrace_get_rseq_configuration(struct task_struct *task,
 {
 	struct ptrace_rseq_configuration conf = {
 		.rseq_abi_pointer = (u64)(uintptr_t)task->rseq,
-		.rseq_abi_size = sizeof(*task->rseq),
+		.rseq_abi_size = task->rseq_len,
 		.signature = task->rseq_sig,
 		.flags = 0,
 	};
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 97ac20b4f738..2381a8a1b90d 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -81,15 +81,27 @@
  *   F1. <failure>
  */
 
-static int rseq_update_cpu_id(struct task_struct *t)
+static int rseq_update_cpu_node_id(struct task_struct *t)
 {
-	u32 cpu_id = raw_smp_processor_id();
 	struct rseq __user *rseq = t->rseq;
+	u32 cpu_id = raw_smp_processor_id();
+	u32 node_id;
 
-	if (!user_write_access_begin(rseq, sizeof(*rseq)))
+	if (!user_write_access_begin(rseq, t->rseq_len))
 		goto efault;
-	unsafe_put_user(cpu_id, &rseq->cpu_id_start, efault_end);
-	unsafe_put_user(cpu_id, &rseq->cpu_id, efault_end);
+	switch (t->rseq_len) {
+	case offsetofend(struct rseq, node_id):
+		node_id = cpu_to_node(cpu_id);
+		unsafe_put_user(node_id, &rseq->node_id_start, efault_end);
+		unsafe_put_user(node_id, &rseq->node_id, efault_end);
+		fallthrough;
+	case offsetofend(struct rseq, padding1):
+		unsafe_put_user(cpu_id, &rseq->cpu_id_start, efault_end);
+		unsafe_put_user(cpu_id, &rseq->cpu_id, efault_end);
+		break;
+	default:
+		goto efault_end;
+	}
 	user_write_access_end();
 	trace_rseq_update(t);
 	return 0;
@@ -100,22 +112,42 @@ static int rseq_update_cpu_id(struct task_struct *t)
 	return -EFAULT;
 }
 
-static int rseq_reset_rseq_cpu_id(struct task_struct *t)
+static int rseq_reset_rseq_cpu_node_id(struct task_struct *t)
 {
-	u32 cpu_id_start = 0, cpu_id = RSEQ_CPU_ID_UNINITIALIZED;
+	u32 id_start = 0, id = RSEQ_ID_UNINITIALIZED;
 
-	/*
-	 * Reset cpu_id_start to its initial state (0).
-	 */
-	if (put_user(cpu_id_start, &t->rseq->cpu_id_start))
-		return -EFAULT;
-	/*
-	 * Reset cpu_id to RSEQ_CPU_ID_UNINITIALIZED, so any user coming
-	 * in after unregistration can figure out that rseq needs to be
-	 * registered again.
-	 */
-	if (put_user(cpu_id, &t->rseq->cpu_id))
-		return -EFAULT;
+	switch (t->rseq_len) {
+	case offsetofend(struct rseq, node_id):
+		/*
+		 * Reset node_id_start to its initial state (0).
+		 */
+		if (put_user(id_start, &t->rseq->node_id_start))
+			return -EFAULT;
+		/*
+		 * Reset node_id to RSEQ_ID_UNINITIALIZED, so any user coming in after
+		 * unregistration can figure out that rseq needs to be registered
+		 * again.
+		 */
+		if (put_user(id, &t->rseq->node_id))
+			return -EFAULT;
+		fallthrough;
+	case offsetofend(struct rseq, padding1):
+		/*
+		 * Reset cpu_id_start to its initial state (0).
+		 */
+		if (put_user(id_start, &t->rseq->cpu_id_start))
+			return -EFAULT;
+		/*
+		 * Reset cpu_id to RSEQ_ID_UNINITIALIZED, so any user coming in after
+		 * unregistration can figure out that rseq needs to be registered
+		 * again.
+		 */
+		if (put_user(id, &t->rseq->cpu_id))
+			return -EFAULT;
+		break;
+	default:
+		return -EINVAL;
+	}
 	return 0;
 }
 
@@ -293,7 +325,7 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
 		if (unlikely(ret < 0))
 			goto error;
 	}
-	if (unlikely(rseq_update_cpu_id(t)))
+	if (unlikely(rseq_update_cpu_node_id(t)))
 		goto error;
 	return;
 
@@ -336,15 +368,16 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
 		/* Unregister rseq for current thread. */
 		if (current->rseq != rseq || !current->rseq)
 			return -EINVAL;
-		if (rseq_len != sizeof(*rseq))
+		if (rseq_len != current->rseq_len)
 			return -EINVAL;
 		if (current->rseq_sig != sig)
 			return -EPERM;
-		ret = rseq_reset_rseq_cpu_id(current);
+		ret = rseq_reset_rseq_cpu_node_id(current);
 		if (ret)
 			return ret;
 		current->rseq = NULL;
 		current->rseq_sig = 0;
+		current->rseq_len = 0;
 		return 0;
 	}
 
@@ -357,7 +390,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
 		 * the provided address differs from the prior
 		 * one.
 		 */
-		if (current->rseq != rseq || rseq_len != sizeof(*rseq))
+		if (current->rseq != rseq || rseq_len != current->rseq_len)
 			return -EINVAL;
 		if (current->rseq_sig != sig)
 			return -EPERM;
@@ -369,12 +402,20 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
 	 * If there was no rseq previously registered,
 	 * ensure the provided rseq is properly aligned and valid.
 	 */
-	if (!IS_ALIGNED((unsigned long)rseq, __alignof__(*rseq)) ||
-	    rseq_len != sizeof(*rseq))
+	if (!IS_ALIGNED((unsigned long)rseq, __alignof__(*rseq)))
 		return -EINVAL;
+	switch (rseq_len) {
+	case offsetofend(struct rseq, node_id):
+		fallthrough;
+	case offsetofend(struct rseq, padding1):
+		break;
+	default:
+		return -EINVAL;
+	}
 	if (!access_ok(rseq, rseq_len))
 		return -EFAULT;
 	current->rseq = rseq;
+	current->rseq_len = rseq_len;
 	current->rseq_sig = sig;
 	/*
 	 * If rseq was previously inactive, and has just been
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 2/2] selftests/rseq: Implement rseq numa node id field selftest
  2022-01-31 20:55 [RFC PATCH 1/2] rseq: extend struct rseq with numa node id Mathieu Desnoyers
@ 2022-01-31 20:55 ` Mathieu Desnoyers
  2022-01-31 21:19 ` [RFC PATCH 1/2] rseq: extend struct rseq with numa node id Florian Weimer
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Mathieu Desnoyers @ 2022-01-31 20:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, Paul E . McKenney, Boqun Feng,
	H . Peter Anvin, Paul Turner, linux-api, Christian Brauner,
	Florian Weimer, David.Laight, carlos, Peter Oskolkov,
	Mathieu Desnoyers

Test the NUMA node id extension rseq field. Compare it against the value
returned by the getcpu(2) system call while pinned on a specific core.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
 tools/testing/selftests/rseq/basic_test.c |  6 +++
 tools/testing/selftests/rseq/rseq-abi.h   | 51 +++++++++++++++++++++--
 tools/testing/selftests/rseq/rseq.c       | 37 ++++++++++++++--
 tools/testing/selftests/rseq/rseq.h       | 40 ++++++++++++++++++
 4 files changed, 127 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/rseq/basic_test.c b/tools/testing/selftests/rseq/basic_test.c
index d8efbfb89193..9601db964b72 100644
--- a/tools/testing/selftests/rseq/basic_test.c
+++ b/tools/testing/selftests/rseq/basic_test.c
@@ -22,6 +22,8 @@ void test_cpu_pointer(void)
 	CPU_ZERO(&test_affinity);
 	for (i = 0; i < CPU_SETSIZE; i++) {
 		if (CPU_ISSET(i, &affinity)) {
+			int node;
+
 			CPU_SET(i, &test_affinity);
 			sched_setaffinity(0, sizeof(test_affinity),
 					&test_affinity);
@@ -29,6 +31,10 @@ void test_cpu_pointer(void)
 			assert(rseq_current_cpu() == i);
 			assert(rseq_current_cpu_raw() == i);
 			assert(rseq_cpu_start() == i);
+			node = rseq_fallback_current_node();
+			assert(rseq_current_node() == node);
+			assert(rseq_current_node_raw() == node);
+			assert(rseq_node_start() == node);
 			CPU_CLR(i, &test_affinity);
 		}
 	}
diff --git a/tools/testing/selftests/rseq/rseq-abi.h b/tools/testing/selftests/rseq/rseq-abi.h
index a8c44d9af71f..7aba1cc0990b 100644
--- a/tools/testing/selftests/rseq/rseq-abi.h
+++ b/tools/testing/selftests/rseq/rseq-abi.h
@@ -13,9 +13,9 @@
 #include <linux/types.h>
 #include <asm/byteorder.h>
 
-enum rseq_abi_cpu_id_state {
-	RSEQ_ABI_CPU_ID_UNINITIALIZED			= -1,
-	RSEQ_ABI_CPU_ID_REGISTRATION_FAILED		= -2,
+enum rseq_abi_id_state {
+	RSEQ_ABI_ID_UNINITIALIZED			= -1,
+	RSEQ_ABI_ID_REGISTRATION_FAILED			= -2,
 };
 
 enum rseq_abi_flags {
@@ -146,6 +146,51 @@ struct rseq_abi {
 	 *     this thread.
 	 */
 	__u32 flags;
+	__u32 padding1[3];
+
+	/*
+	 * This is the end of the original rseq ABI.
+	 * This is a valid end of rseq ABI for the purpose of rseq registration
+	 * rseq_len.
+	 * The original rseq ABI use "sizeof(struct rseq)" on registration,
+	 * thus requiring the padding above.
+	 */
+
+	/*
+	 * Restartable sequences node_id_start field. Updated by the
+	 * kernel. Read by user-space with single-copy atomicity
+	 * semantics. This field should only be read by the thread which
+	 * registered this data structure. Aligned on 32-bit. Always
+	 * contains a value in the range of possible NUMA node IDs, although the
+	 * value may not be the actual current NUMA node ID (e.g. if rseq is not
+	 * initialized). This NUMA node ID number value should always be compared
+	 * against the value of the node_id field before performing a rseq
+	 * commit or returning a value read from a data structure indexed using
+	 * the node_id_start value.
+	 */
+	__u32 node_id_start;
+
+	/*
+	 * Restartable sequences node_id field. Updated by the kernel.
+	 * Read by user-space with single-copy atomicity semantics. This
+	 * field should only be read by the thread which registered this
+	 * data structure. Aligned on 32-bit. Values
+	 * RSEQ_ID_UNINITIALIZED and RSEQ_ID_REGISTRATION_FAILED
+	 * have a special semantic: the former means "rseq uninitialized",
+	 * and latter means "rseq initialization failed". This value is
+	 * meant to be read within rseq critical sections and compared
+	 * with the node_id_start value previously read, before performing
+	 * the commit instruction, or read and compared with the
+	 * node_id_start value before returning a value loaded from a data
+	 * structure indexed using the node_id_start value.
+	 */
+	__u32 node_id;
+
+	/*
+	 * This is a valid end of rseq ABI for the purpose of rseq registration
+	 * rseq_len. Use the offset immediately after the node_id field as
+	 * rseq_len.
+	 */
 } __attribute__((aligned(4 * sizeof(__u64))));
 
 #endif /* _RSEQ_ABI_H */
diff --git a/tools/testing/selftests/rseq/rseq.c b/tools/testing/selftests/rseq/rseq.c
index 07ba0d463a96..99b5c3b71ef0 100644
--- a/tools/testing/selftests/rseq/rseq.c
+++ b/tools/testing/selftests/rseq/rseq.c
@@ -27,10 +27,20 @@
 #include <signal.h>
 #include <limits.h>
 #include <dlfcn.h>
+#include <stddef.h>
 
 #include "../kselftest.h"
 #include "rseq.h"
 
+#ifndef sizeof_field
+#define sizeof_field(TYPE, MEMBER) sizeof((((TYPE *)0)->MEMBER))
+#endif
+
+#ifndef offsetofend
+#define offsetofend(TYPE, MEMBER) \
+	(offsetof(TYPE, MEMBER)	+ sizeof_field(TYPE, MEMBER))
+#endif
+
 static const int *libc_rseq_offset_p;
 static const unsigned int *libc_rseq_size_p;
 static const unsigned int *libc_rseq_flags_p;
@@ -49,7 +59,8 @@ static int rseq_ownership;
 
 static
 __thread struct rseq_abi __rseq_abi __attribute__((tls_model("initial-exec"))) = {
-	.cpu_id = RSEQ_ABI_CPU_ID_UNINITIALIZED,
+	.cpu_id = RSEQ_ABI_ID_UNINITIALIZED,
+	.node_id = RSEQ_ABI_ID_UNINITIALIZED,
 };
 
 static int sys_rseq(struct rseq_abi *rseq_abi, uint32_t rseq_len,
@@ -58,6 +69,11 @@ static int sys_rseq(struct rseq_abi *rseq_abi, uint32_t rseq_len,
 	return syscall(__NR_rseq, rseq_abi, rseq_len, flags, sig);
 }
 
+static int sys_getcpu(unsigned *cpu, unsigned *node)
+{
+	return syscall(__NR_getcpu, cpu, node, NULL);
+}
+
 int rseq_available(void)
 {
 	int rc;
@@ -83,7 +99,7 @@ int rseq_register_current_thread(void)
 		/* Treat libc's ownership as a successful registration. */
 		return 0;
 	}
-	rc = sys_rseq(&__rseq_abi, sizeof(struct rseq_abi), 0, RSEQ_SIG);
+	rc = sys_rseq(&__rseq_abi, offsetofend(struct rseq_abi, node_id), 0, RSEQ_SIG);
 	if (rc)
 		return -1;
 	assert(rseq_current_cpu_raw() >= 0);
@@ -98,7 +114,7 @@ int rseq_unregister_current_thread(void)
 		/* Treat libc's ownership as a successful unregistration. */
 		return 0;
 	}
-	rc = sys_rseq(&__rseq_abi, sizeof(struct rseq_abi), RSEQ_ABI_FLAG_UNREGISTER, RSEQ_SIG);
+	rc = sys_rseq(&__rseq_abi, offsetofend(struct rseq_abi, node_id), RSEQ_ABI_FLAG_UNREGISTER, RSEQ_SIG);
 	if (rc)
 		return -1;
 	return 0;
@@ -121,7 +137,7 @@ void rseq_init(void)
 		return;
 	rseq_ownership = 1;
 	rseq_offset = (void *)&__rseq_abi - rseq_thread_pointer();
-	rseq_size = sizeof(struct rseq_abi);
+	rseq_size = offsetofend(struct rseq_abi, node_id);
 	rseq_flags = 0;
 }
 
@@ -146,3 +162,16 @@ int32_t rseq_fallback_current_cpu(void)
 	}
 	return cpu;
 }
+
+int32_t rseq_fallback_current_node(void)
+{
+	uint32_t cpu_id, node_id;
+	int ret;
+
+	ret = sys_getcpu(&cpu_id, &node_id);
+	if (ret) {
+		perror("sys_getcpu()");
+		return ret;
+	}
+	return (int32_t) node_id;
+}
diff --git a/tools/testing/selftests/rseq/rseq.h b/tools/testing/selftests/rseq/rseq.h
index 6bd0ac466b4a..6fccc87f9025 100644
--- a/tools/testing/selftests/rseq/rseq.h
+++ b/tools/testing/selftests/rseq/rseq.h
@@ -115,6 +115,11 @@ int rseq_unregister_current_thread(void);
  */
 int32_t rseq_fallback_current_cpu(void);
 
+/*
+ * Restartable sequence fallback for reading the current node number.
+ */
+int32_t rseq_fallback_current_node(void);
+
 /*
  * Values returned can be either the current CPU number, -1 (rseq is
  * uninitialized), or -2 (rseq initialization has failed).
@@ -124,6 +129,15 @@ static inline int32_t rseq_current_cpu_raw(void)
 	return RSEQ_ACCESS_ONCE(rseq_get_abi()->cpu_id);
 }
 
+/*
+ * Values returned can be either the current NUMA node number, -1 (rseq is
+ * uninitialized), or -2 (rseq initialization has failed).
+ */
+static inline int32_t rseq_current_node_raw(void)
+{
+	return RSEQ_ACCESS_ONCE(rseq_get_abi()->node_id);
+}
+
 /*
  * Returns a possible CPU number, which is typically the current CPU.
  * The returned CPU number can be used to prepare for an rseq critical
@@ -140,6 +154,22 @@ static inline uint32_t rseq_cpu_start(void)
 	return RSEQ_ACCESS_ONCE(rseq_get_abi()->cpu_id_start);
 }
 
+/*
+ * Returns a possible NUMA node number, which is typically the current NUMA
+ * node.  The returned NUMA node number can be used to prepare for an rseq
+ * critical section, which will confirm whether the NUMA node number is indeed
+ * the current one, and whether rseq is initialized.
+ *
+ * The NUMA node number returned by rseq_node_start should always be validated
+ * by passing it to a rseq asm sequence, or by comparing it to the return value
+ * of rseq_current_node_raw() if the rseq asm sequence does not need to be
+ * invoked.
+ */
+static inline uint32_t rseq_node_start(void)
+{
+	return RSEQ_ACCESS_ONCE(rseq_get_abi()->node_id_start);
+}
+
 static inline uint32_t rseq_current_cpu(void)
 {
 	int32_t cpu;
@@ -150,6 +180,16 @@ static inline uint32_t rseq_current_cpu(void)
 	return cpu;
 }
 
+static inline uint32_t rseq_current_node(void)
+{
+	int32_t node;
+
+	node = rseq_current_node_raw();
+	if (rseq_unlikely(node < 0))
+		node = rseq_fallback_current_node();
+	return node;
+}
+
 static inline void rseq_clear_rseq_cs(void)
 {
 	RSEQ_WRITE_ONCE(rseq_get_abi()->rseq_cs.arch.ptr, 0);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 1/2] rseq: extend struct rseq with numa node id
  2022-01-31 20:55 [RFC PATCH 1/2] rseq: extend struct rseq with numa node id Mathieu Desnoyers
  2022-01-31 20:55 ` [RFC PATCH 2/2] selftests/rseq: Implement rseq numa node id field selftest Mathieu Desnoyers
@ 2022-01-31 21:19 ` Florian Weimer
  2022-01-31 21:26   ` Mathieu Desnoyers
  2022-02-01 11:08   ` Peter Zijlstra
  2022-01-31 21:22 ` Mathieu Desnoyers
  2022-02-01 11:08 ` Peter Zijlstra
  3 siblings, 2 replies; 9+ messages in thread
From: Florian Weimer @ 2022-01-31 21:19 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel, Thomas Gleixner, Paul E . McKenney,
	Boqun Feng, H . Peter Anvin, Paul Turner, linux-api,
	Christian Brauner, David.Laight, carlos, Peter Oskolkov

* Mathieu Desnoyers:

> Adding the NUMA node id to struct rseq is a straightforward thing to do,
> and a good way to figure out if anything in the user-space ecosystem
> prevents extending struct rseq.
>
> This NUMA node id field allows memory allocators such as tcmalloc to
> take advantage of fast access to the current NUMA node id to perform
> NUMA-aware memory allocation.
>
> It is also useful for implementing NUMA-aware user-space mutexes.

It can be used to implement getcpu purely in userspace, too.  I had
plan to hack this together with a node ID cache in TLS, which should
offer pretty much the same functionality (except for weird CPU
topology changes which alter the node ID of a previously used CPU).

However, I do not understand the need for two fields here.  Why isn't
one enough?

One field would also avoid the need to mess with rseq_cpu_id_state,
maintaining API compatibility.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 1/2] rseq: extend struct rseq with numa node id
  2022-01-31 20:55 [RFC PATCH 1/2] rseq: extend struct rseq with numa node id Mathieu Desnoyers
  2022-01-31 20:55 ` [RFC PATCH 2/2] selftests/rseq: Implement rseq numa node id field selftest Mathieu Desnoyers
  2022-01-31 21:19 ` [RFC PATCH 1/2] rseq: extend struct rseq with numa node id Florian Weimer
@ 2022-01-31 21:22 ` Mathieu Desnoyers
  2022-02-01 11:08 ` Peter Zijlstra
  3 siblings, 0 replies; 9+ messages in thread
From: Mathieu Desnoyers @ 2022-01-31 21:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, paulmck, Boqun Feng,
	H. Peter Anvin, Paul Turner, linux-api, Christian Brauner,
	Florian Weimer, David Laight, carlos, Peter Oskolkov

----- On Jan 31, 2022, at 3:55 PM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:

> Adding the NUMA node id to struct rseq is a straightforward thing to do,
> and a good way to figure out if anything in the user-space ecosystem
> prevents extending struct rseq.
> 
> This NUMA node id field allows memory allocators such as tcmalloc to
> take advantage of fast access to the current NUMA node id to perform
> NUMA-aware memory allocation.
> 
> It is also useful for implementing NUMA-aware user-space mutexes.
> 

[...]

> +	__u32 padding1[3];
> +
> +	/*
> +	 * This is the end of the original rseq ABI.
> +	 * This is a valid end of rseq ABI for the purpose of rseq registration
> +	 * rseq_len.
> +	 * The original rseq ABI use "sizeof(struct rseq)" on registration,
> +	 * thus requiring the padding above.
> +	 */
> +
> +	/*
> +	 * Restartable sequences node_id_start field. Updated by the
> +	 * kernel. Read by user-space with single-copy atomicity
> +	 * semantics. This field should only be read by the thread which
> +	 * registered this data structure. Aligned on 32-bit. Always
> +	 * contains a value in the range of possible NUMA node IDs, although the
> +	 * value may not be the actual current NUMA node ID (e.g. if rseq is not
> +	 * initialized). This NUMA node ID number value should always be compared
> +	 * against the value of the node_id field before performing a rseq
> +	 * commit or returning a value read from a data structure indexed using
> +	 * the node_id_start value.
> +	 */
> +	__u32 node_id_start;

Considering that the same "node id" is shared across various cores, I don't expect
it to be of much use in a rseq critical section comparison. That differs from the
"cpu id" (really the core ID), or the eventual concept of "vcpu id" as developed
internally at Google, which are identifiers which are guaranteed to be unique
within a process, and unchanged, for the duration of the rseq critical section.

Also, having these node_id* fields after the original end of the struct rseq
means user-space would have to check whether the glibc's __rseq_size is large
enough to contain those node_id* fields before loading them, which means there
needs to be at least one comparison before using the fields, therefore defeating
the purpose of the "*_id_start" trick.

So for those two reasons, I think just the "node_id" field would be sufficient
(no node_id_start field).

This brings another question though: should we then place the "node_id" field
in the original struct rseq padding or after ?

If we place it in the original padding, then glibc-2.35 would have enough
space to contain this field, but we would need to add a new sys_rseq flag to
query whether the node_id field is supported by the kernel, for use by
applications and glibc.

However, if we choose to place the new node_id field after the original
padding, applications can simply check with the __rseq_size exposed by
glibc to detect whether this field is there and populated. I have a
preference for this last approach as this looks less like a "one-off"
hack, and a more future-proof way to extend struct rseq.

Thoughts ?

Thanks,

Mathieu

> +
> +	/*
> +	 * Restartable sequences node_id field. Updated by the kernel.
> +	 * Read by user-space with single-copy atomicity semantics. This
> +	 * field should only be read by the thread which registered this
> +	 * data structure. Aligned on 32-bit. Values
> +	 * RSEQ_ID_UNINITIALIZED and RSEQ_ID_REGISTRATION_FAILED
> +	 * have a special semantic: the former means "rseq uninitialized",
> +	 * and latter means "rseq initialization failed". This value is
> +	 * meant to be read within rseq critical sections and compared
> +	 * with the node_id_start value previously read, before performing
> +	 * the commit instruction, or read and compared with the
> +	 * node_id_start value before returning a value loaded from a data
> +	 * structure indexed using the node_id_start value.
> +	 */
> +	__u32 node_id;
> +
> +	/*
> +	 * This is a valid end of rseq ABI for the purpose of rseq registration
> +	 * rseq_len. Use the offset immediately after the node_id field as
> +	 * rseq_len.
> +	 */
> } __attribute__((aligned(4 * sizeof(__u64))));
> 
-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 1/2] rseq: extend struct rseq with numa node id
  2022-01-31 21:19 ` [RFC PATCH 1/2] rseq: extend struct rseq with numa node id Florian Weimer
@ 2022-01-31 21:26   ` Mathieu Desnoyers
  2022-02-01 11:08   ` Peter Zijlstra
  1 sibling, 0 replies; 9+ messages in thread
From: Mathieu Desnoyers @ 2022-01-31 21:26 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Peter Zijlstra, linux-kernel, Thomas Gleixner, paulmck,
	Boqun Feng, H. Peter Anvin, Paul Turner, linux-api,
	Christian Brauner, David Laight, carlos, Peter Oskolkov

----- On Jan 31, 2022, at 4:19 PM, Florian Weimer fw@deneb.enyo.de wrote:

> * Mathieu Desnoyers:
> 
>> Adding the NUMA node id to struct rseq is a straightforward thing to do,
>> and a good way to figure out if anything in the user-space ecosystem
>> prevents extending struct rseq.
>>
>> This NUMA node id field allows memory allocators such as tcmalloc to
>> take advantage of fast access to the current NUMA node id to perform
>> NUMA-aware memory allocation.
>>
>> It is also useful for implementing NUMA-aware user-space mutexes.
> 
> It can be used to implement getcpu purely in userspace, too.  I had
> plan to hack this together with a node ID cache in TLS, which should
> offer pretty much the same functionality (except for weird CPU
> topology changes which alter the node ID of a previously used CPU).

I suspect that any approach based on a user-space cache will break with
respect to CRIU. That is one big advantage of using the rseq thread area
for this.

> 
> However, I do not understand the need for two fields here.  Why isn't
> one enough?

As stated in my self-reply, I don't think those two fields are needed
after all.

> 
> One field would also avoid the need to mess with rseq_cpu_id_state,
> maintaining API compatibility.

True. However considering that we plan to remove the buggy "rseq_cs.ptr"
fields from the API, that rseq.h UAPI compatibility does not seem to be
very much relevant.

But still, it's better if we can avoid breaking API, agreed. And the
"node_id_start" does not appear to be needed after all.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 1/2] rseq: extend struct rseq with numa node id
  2022-01-31 20:55 [RFC PATCH 1/2] rseq: extend struct rseq with numa node id Mathieu Desnoyers
                   ` (2 preceding siblings ...)
  2022-01-31 21:22 ` Mathieu Desnoyers
@ 2022-02-01 11:08 ` Peter Zijlstra
  2022-02-01 12:07   ` Florian Weimer
  3 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2022-02-01 11:08 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Thomas Gleixner, Paul E . McKenney, Boqun Feng,
	H . Peter Anvin, Paul Turner, linux-api, Christian Brauner,
	Florian Weimer, David.Laight, carlos, Peter Oskolkov

On Mon, Jan 31, 2022 at 03:55:30PM -0500, Mathieu Desnoyers wrote:

> It is also useful for implementing NUMA-aware user-space mutexes.

That really wants a futex extention..

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 1/2] rseq: extend struct rseq with numa node id
  2022-01-31 21:19 ` [RFC PATCH 1/2] rseq: extend struct rseq with numa node id Florian Weimer
  2022-01-31 21:26   ` Mathieu Desnoyers
@ 2022-02-01 11:08   ` Peter Zijlstra
  2022-02-01 12:11     ` Florian Weimer
  1 sibling, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2022-02-01 11:08 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Mathieu Desnoyers, linux-kernel, Thomas Gleixner,
	Paul E . McKenney, Boqun Feng, H . Peter Anvin, Paul Turner,
	linux-api, Christian Brauner, David.Laight, carlos,
	Peter Oskolkov

On Mon, Jan 31, 2022 at 10:19:09PM +0100, Florian Weimer wrote:
> It can be used to implement getcpu purely in userspace, too.  I had
> plan to hack this together with a node ID cache in TLS, which should
> offer pretty much the same functionality (except for weird CPU
> topology changes which alter the node ID of a previously used CPU).

PowerPC does that quite a lot..

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 1/2] rseq: extend struct rseq with numa node id
  2022-02-01 11:08 ` Peter Zijlstra
@ 2022-02-01 12:07   ` Florian Weimer
  0 siblings, 0 replies; 9+ messages in thread
From: Florian Weimer @ 2022-02-01 12:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, linux-kernel, Thomas Gleixner,
	Paul E . McKenney, Boqun Feng, H . Peter Anvin, Paul Turner,
	linux-api, Christian Brauner, David.Laight, carlos,
	Peter Oskolkov

* Peter Zijlstra:

> On Mon, Jan 31, 2022 at 03:55:30PM -0500, Mathieu Desnoyers wrote:
>
>> It is also useful for implementing NUMA-aware user-space mutexes.
>
> That really wants a futex extention..

Do you mean with hashes per NUMA node?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 1/2] rseq: extend struct rseq with numa node id
  2022-02-01 11:08   ` Peter Zijlstra
@ 2022-02-01 12:11     ` Florian Weimer
  0 siblings, 0 replies; 9+ messages in thread
From: Florian Weimer @ 2022-02-01 12:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathieu Desnoyers, linux-kernel, Thomas Gleixner,
	Paul E . McKenney, Boqun Feng, H . Peter Anvin, Paul Turner,
	linux-api, Christian Brauner, David.Laight, carlos,
	Peter Oskolkov

* Peter Zijlstra:

> On Mon, Jan 31, 2022 at 10:19:09PM +0100, Florian Weimer wrote:
>> It can be used to implement getcpu purely in userspace, too.  I had
>> plan to hack this together with a node ID cache in TLS, which should
>> offer pretty much the same functionality (except for weird CPU
>> topology changes which alter the node ID of a previously used CPU).
>
> PowerPC does that quite a lot..

How so?  Why make it especially hard for userspace?

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-02-01 12:12 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-31 20:55 [RFC PATCH 1/2] rseq: extend struct rseq with numa node id Mathieu Desnoyers
2022-01-31 20:55 ` [RFC PATCH 2/2] selftests/rseq: Implement rseq numa node id field selftest Mathieu Desnoyers
2022-01-31 21:19 ` [RFC PATCH 1/2] rseq: extend struct rseq with numa node id Florian Weimer
2022-01-31 21:26   ` Mathieu Desnoyers
2022-02-01 11:08   ` Peter Zijlstra
2022-02-01 12:11     ` Florian Weimer
2022-01-31 21:22 ` Mathieu Desnoyers
2022-02-01 11:08 ` Peter Zijlstra
2022-02-01 12:07   ` Florian Weimer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).