linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 00/11] RSEQ node id and virtual cpu id extensions
@ 2022-02-18 21:06 Mathieu Desnoyers
  2022-02-18 21:06 ` [RFC PATCH v2 01/11] rseq: Introduce feature size and alignment ELF auxiliary vector entries Mathieu Desnoyers
                   ` (10 more replies)
  0 siblings, 11 replies; 19+ messages in thread
From: Mathieu Desnoyers @ 2022-02-18 21:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, Paul E . McKenney, Boqun Feng,
	H . Peter Anvin, Paul Turner, linux-api, Christian Brauner,
	Florian Weimer, David.Laight, carlos, Peter Oskolkov,
	Mathieu Desnoyers

Hi,

I'm sending this series for feedback. There appears to be a lot of
interest in the virtual cpu id feature for use in user-space memory
allocators (e.g. glibc malloc), so I am sending this out as RFC.

The most interesting patch in here is "sched: Introduce per memory space
current virtual cpu id". So if you want to jump to the meat, go
immediately to that patch.

This series is based on the tip tree core/sched branch [1] at commit
ed3b362d54f0 ("sched/isolation: Split housekeeping cpumask per isolation
features").

Feedback is welcome!

Thanks,

Mathieu

[1] https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/

Mathieu Desnoyers (11):
  rseq: Introduce feature size and alignment ELF auxiliary vector
    entries
  rseq: Introduce extensible rseq ABI
  rseq: extend struct rseq with numa node id
  selftests/rseq: Use ELF auxiliary vector for extensible rseq
  selftests/rseq: Implement rseq numa node id field selftest
  lib: invert _find_next_bit source arguments
  lib: implement find_{first,next}_{zero,one}_and_zero_bit
  cpumask: implement cpumask_{first,next}_{zero,one}_and_zero
  sched: Introduce per memory space current virtual cpu id
  rseq: extend struct rseq with per memory space vcpu id
  selftests/rseq: Implement rseq vm_vcpu_id field support

 fs/binfmt_elf.c                           |   5 +
 fs/exec.c                                 |   4 +
 include/linux/cpumask.h                   |  94 ++++++
 include/linux/find.h                      | 123 +++++++-
 include/linux/mm.h                        |  25 ++
 include/linux/mm_types.h                  | 111 +++++++
 include/linux/sched.h                     |   9 +
 include/trace/events/rseq.h               |   4 +-
 include/uapi/linux/auxvec.h               |   2 +
 include/uapi/linux/rseq.h                 |  22 ++
 init/Kconfig                              |   4 +
 kernel/fork.c                             |  15 +-
 kernel/ptrace.c                           |   2 +-
 kernel/rseq.c                             |  60 +++-
 kernel/sched/core.c                       |  82 +++++
 kernel/sched/deadline.c                   |   3 +
 kernel/sched/debug.c                      |  13 +
 kernel/sched/fair.c                       |   1 +
 kernel/sched/rt.c                         |   2 +
 kernel/sched/sched.h                      | 364 ++++++++++++++++++++++
 kernel/sched/stats.c                      |  16 +-
 lib/find_bit.c                            |  17 +-
 tools/include/linux/find.h                |   9 +-
 tools/lib/find_bit.c                      |  17 +-
 tools/testing/selftests/rseq/basic_test.c |   5 +
 tools/testing/selftests/rseq/rseq-abi.h   |  22 ++
 tools/testing/selftests/rseq/rseq.c       |  86 ++++-
 tools/testing/selftests/rseq/rseq.h       |  46 ++-
 28 files changed, 1106 insertions(+), 57 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH v2 01/11] rseq: Introduce feature size and alignment ELF auxiliary vector entries
  2022-02-18 21:06 [RFC PATCH v2 00/11] RSEQ node id and virtual cpu id extensions Mathieu Desnoyers
@ 2022-02-18 21:06 ` Mathieu Desnoyers
  2022-02-18 21:06 ` [RFC PATCH v2 02/11] rseq: Introduce extensible rseq ABI Mathieu Desnoyers
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 19+ messages in thread
From: Mathieu Desnoyers @ 2022-02-18 21:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, Paul E . McKenney, Boqun Feng,
	H . Peter Anvin, Paul Turner, linux-api, Christian Brauner,
	Florian Weimer, David.Laight, carlos, Peter Oskolkov,
	Mathieu Desnoyers

Export the rseq feature size supported by the kernel as well as the
required allocation alignment for the rseq per-thread area to user-space
through ELF auxiliary vector entries.

This is part of the extensible rseq ABI.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
 fs/binfmt_elf.c             | 5 +++++
 include/uapi/linux/auxvec.h | 2 ++
 include/uapi/linux/rseq.h   | 5 +++++
 3 files changed, 12 insertions(+)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 605017eb9349..77776582e76d 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -46,6 +46,7 @@
 #include <linux/cred.h>
 #include <linux/dax.h>
 #include <linux/uaccess.h>
+#include <linux/rseq.h>
 #include <asm/param.h>
 #include <asm/page.h>
 
@@ -286,6 +287,10 @@ create_elf_tables(struct linux_binprm *bprm, const struct elfhdr *exec,
 	if (bprm->have_execfd) {
 		NEW_AUX_ENT(AT_EXECFD, bprm->execfd);
 	}
+#ifdef CONFIG_RSEQ
+	NEW_AUX_ENT(AT_RSEQ_FEATURE_SIZE, offsetof(struct rseq, end));
+	NEW_AUX_ENT(AT_RSEQ_ALIGN, __alignof__(struct rseq));
+#endif
 #undef NEW_AUX_ENT
 	/* AT_NULL is zero; clear the rest too */
 	memset(elf_info, 0, (char *)mm->saved_auxv +
diff --git a/include/uapi/linux/auxvec.h b/include/uapi/linux/auxvec.h
index c7e502bf5a6f..6991c4b8ab18 100644
--- a/include/uapi/linux/auxvec.h
+++ b/include/uapi/linux/auxvec.h
@@ -30,6 +30,8 @@
 				 * differ from AT_PLATFORM. */
 #define AT_RANDOM 25	/* address of 16 random bytes */
 #define AT_HWCAP2 26	/* extension of AT_HWCAP */
+#define AT_RSEQ_FEATURE_SIZE	27	/* rseq supported feature size */
+#define AT_RSEQ_ALIGN		28	/* rseq allocation alignment */
 
 #define AT_EXECFN  31	/* filename of program */
 
diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
index 77ee207623a9..05d3c4cdeb40 100644
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -130,6 +130,11 @@ struct rseq {
 	 *     this thread.
 	 */
 	__u32 flags;
+
+	/*
+	 * Flexible array member at end of structure, after last feature field.
+	 */
+	char end[];
 } __attribute__((aligned(4 * sizeof(__u64))));
 
 #endif /* _UAPI_LINUX_RSEQ_H */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH v2 02/11] rseq: Introduce extensible rseq ABI
  2022-02-18 21:06 [RFC PATCH v2 00/11] RSEQ node id and virtual cpu id extensions Mathieu Desnoyers
  2022-02-18 21:06 ` [RFC PATCH v2 01/11] rseq: Introduce feature size and alignment ELF auxiliary vector entries Mathieu Desnoyers
@ 2022-02-18 21:06 ` Mathieu Desnoyers
  2022-02-18 21:06 ` [RFC PATCH v2 03/11] rseq: extend struct rseq with numa node id Mathieu Desnoyers
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 19+ messages in thread
From: Mathieu Desnoyers @ 2022-02-18 21:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, Paul E . McKenney, Boqun Feng,
	H . Peter Anvin, Paul Turner, linux-api, Christian Brauner,
	Florian Weimer, David.Laight, carlos, Peter Oskolkov,
	Mathieu Desnoyers

Introduce the extensible rseq ABI, where the feature size supported by
the kernel and the required alignment are communicated to user-space
through ELF auxiliary vectors.

This allows user-space to call rseq registration with a rseq_len of
either 32 bytes for the original struct rseq size (which includes
padding), or larger.

If rseq_len is larger than 32 bytes, then it must be large enough to
contain the feature size communicated to user-space through ELF
auxiliary vectors.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
 include/linux/sched.h |  4 ++++
 kernel/ptrace.c       |  2 +-
 kernel/rseq.c         | 33 +++++++++++++++++++++++++++------
 3 files changed, 32 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 508b91d57470..838c9e0b4cae 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1291,6 +1291,7 @@ struct task_struct {
 
 #ifdef CONFIG_RSEQ
 	struct rseq __user *rseq;
+	u32 rseq_len;
 	u32 rseq_sig;
 	/*
 	 * RmW on rseq_event_mask must be performed atomically
@@ -2260,10 +2261,12 @@ static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags)
 {
 	if (clone_flags & CLONE_VM) {
 		t->rseq = NULL;
+		t->rseq_len = 0;
 		t->rseq_sig = 0;
 		t->rseq_event_mask = 0;
 	} else {
 		t->rseq = current->rseq;
+		t->rseq_len = current->rseq_len;
 		t->rseq_sig = current->rseq_sig;
 		t->rseq_event_mask = current->rseq_event_mask;
 	}
@@ -2272,6 +2275,7 @@ static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags)
 static inline void rseq_execve(struct task_struct *t)
 {
 	t->rseq = NULL;
+	t->rseq_len = 0;
 	t->rseq_sig = 0;
 	t->rseq_event_mask = 0;
 }
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index eea265082e97..f5edde5b7805 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -800,7 +800,7 @@ static long ptrace_get_rseq_configuration(struct task_struct *task,
 {
 	struct ptrace_rseq_configuration conf = {
 		.rseq_abi_pointer = (u64)(uintptr_t)task->rseq,
-		.rseq_abi_size = sizeof(*task->rseq),
+		.rseq_abi_size = task->rseq_len,
 		.signature = task->rseq_sig,
 		.flags = 0,
 	};
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 97ac20b4f738..46dc5c2ce2b7 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -18,6 +18,9 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/rseq.h>
 
+/* The original rseq structure size (including padding) is 32 bytes. */
+#define ORIG_RSEQ_SIZE		32
+
 #define RSEQ_CS_PREEMPT_MIGRATE_FLAGS (RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE | \
 				       RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT)
 
@@ -86,10 +89,15 @@ static int rseq_update_cpu_id(struct task_struct *t)
 	u32 cpu_id = raw_smp_processor_id();
 	struct rseq __user *rseq = t->rseq;
 
-	if (!user_write_access_begin(rseq, sizeof(*rseq)))
+	if (!user_write_access_begin(rseq, t->rseq_len))
 		goto efault;
 	unsafe_put_user(cpu_id, &rseq->cpu_id_start, efault_end);
 	unsafe_put_user(cpu_id, &rseq->cpu_id, efault_end);
+	/*
+	 * Additional feature fields added after ORIG_RSEQ_SIZE
+	 * need to be conditionally updated only if
+	 * t->rseq_len != ORIG_RSEQ_SIZE.
+	 */
 	user_write_access_end();
 	trace_rseq_update(t);
 	return 0;
@@ -116,6 +124,11 @@ static int rseq_reset_rseq_cpu_id(struct task_struct *t)
 	 */
 	if (put_user(cpu_id, &t->rseq->cpu_id))
 		return -EFAULT;
+	/*
+	 * Additional feature fields added after ORIG_RSEQ_SIZE
+	 * need to be conditionally reset only if
+	 * t->rseq_len != ORIG_RSEQ_SIZE.
+	 */
 	return 0;
 }
 
@@ -336,7 +349,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
 		/* Unregister rseq for current thread. */
 		if (current->rseq != rseq || !current->rseq)
 			return -EINVAL;
-		if (rseq_len != sizeof(*rseq))
+		if (rseq_len != current->rseq_len)
 			return -EINVAL;
 		if (current->rseq_sig != sig)
 			return -EPERM;
@@ -345,6 +358,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
 			return ret;
 		current->rseq = NULL;
 		current->rseq_sig = 0;
+		current->rseq_len = 0;
 		return 0;
 	}
 
@@ -357,7 +371,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
 		 * the provided address differs from the prior
 		 * one.
 		 */
-		if (current->rseq != rseq || rseq_len != sizeof(*rseq))
+		if (current->rseq != rseq || rseq_len != current->rseq_len)
 			return -EINVAL;
 		if (current->rseq_sig != sig)
 			return -EPERM;
@@ -366,15 +380,22 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
 	}
 
 	/*
-	 * If there was no rseq previously registered,
-	 * ensure the provided rseq is properly aligned and valid.
+	 * If there was no rseq previously registered, ensure the provided rseq
+	 * is properly aligned, as communcated to user-space through the ELF
+	 * auxiliary vector AT_RSEQ_ALIGN.
+	 *
+	 * In order to be valid, rseq_len is either the original rseq size, or
+	 * large enough to contain all supported fields, as communicated to
+	 * user-space through the ELF auxiliary vector AT_RSEQ_FEATURE_SIZE.
 	 */
 	if (!IS_ALIGNED((unsigned long)rseq, __alignof__(*rseq)) ||
-	    rseq_len != sizeof(*rseq))
+	    rseq_len < ORIG_RSEQ_SIZE ||
+	    (rseq_len != ORIG_RSEQ_SIZE && rseq_len < offsetof(struct rseq, end)))
 		return -EINVAL;
 	if (!access_ok(rseq, rseq_len))
 		return -EFAULT;
 	current->rseq = rseq;
+	current->rseq_len = rseq_len;
 	current->rseq_sig = sig;
 	/*
 	 * If rseq was previously inactive, and has just been
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH v2 03/11] rseq: extend struct rseq with numa node id
  2022-02-18 21:06 [RFC PATCH v2 00/11] RSEQ node id and virtual cpu id extensions Mathieu Desnoyers
  2022-02-18 21:06 ` [RFC PATCH v2 01/11] rseq: Introduce feature size and alignment ELF auxiliary vector entries Mathieu Desnoyers
  2022-02-18 21:06 ` [RFC PATCH v2 02/11] rseq: Introduce extensible rseq ABI Mathieu Desnoyers
@ 2022-02-18 21:06 ` Mathieu Desnoyers
  2022-02-18 21:06 ` [RFC PATCH v2 04/11] selftests/rseq: Use ELF auxiliary vector for extensible rseq Mathieu Desnoyers
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 19+ messages in thread
From: Mathieu Desnoyers @ 2022-02-18 21:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, Paul E . McKenney, Boqun Feng,
	H . Peter Anvin, Paul Turner, linux-api, Christian Brauner,
	Florian Weimer, David.Laight, carlos, Peter Oskolkov,
	Mathieu Desnoyers

Adding the NUMA node id to struct rseq is a straightforward thing to do,
and a good way to figure out if anything in the user-space ecosystem
prevents extending struct rseq.

This NUMA node id field allows memory allocators such as tcmalloc to
take advantage of fast access to the current NUMA node id to perform
NUMA-aware memory allocation.

It can also be useful for implementing fast-paths for NUMA-aware
user-space mutexes.

It also allows implementing getcpu(2) purely in user-space.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
 include/trace/events/rseq.h |  4 +++-
 include/uapi/linux/rseq.h   |  8 ++++++++
 kernel/rseq.c               | 19 +++++++++++++------
 3 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/include/trace/events/rseq.h b/include/trace/events/rseq.h
index a04a64bc1a00..6bd442697354 100644
--- a/include/trace/events/rseq.h
+++ b/include/trace/events/rseq.h
@@ -16,13 +16,15 @@ TRACE_EVENT(rseq_update,
 
 	TP_STRUCT__entry(
 		__field(s32, cpu_id)
+		__field(s32, node_id)
 	),
 
 	TP_fast_assign(
 		__entry->cpu_id = raw_smp_processor_id();
+		__entry->node_id = cpu_to_node(raw_smp_processor_id());
 	),
 
-	TP_printk("cpu_id=%d", __entry->cpu_id)
+	TP_printk("cpu_id=%d node_id=%d", __entry->cpu_id, __entry->node_id)
 );
 
 TRACE_EVENT(rseq_ip_fixup,
diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
index 05d3c4cdeb40..1cb90a435c5c 100644
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -131,6 +131,14 @@ struct rseq {
 	 */
 	__u32 flags;
 
+	/*
+	 * Restartable sequences node_id field. Updated by the kernel. Read by
+	 * user-space with single-copy atomicity semantics. This field should
+	 * only be read by the thread which registered this data structure.
+	 * Aligned on 32-bit. Contains the current NUMA node ID.
+	 */
+	__u32 node_id;
+
 	/*
 	 * Flexible array member at end of structure, after last feature field.
 	 */
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 46dc5c2ce2b7..cb7d8a5afc82 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -84,15 +84,17 @@
  *   F1. <failure>
  */
 
-static int rseq_update_cpu_id(struct task_struct *t)
+static int rseq_update_cpu_node_id(struct task_struct *t)
 {
-	u32 cpu_id = raw_smp_processor_id();
 	struct rseq __user *rseq = t->rseq;
+	u32 cpu_id = raw_smp_processor_id();
+	u32 node_id = cpu_to_node(cpu_id);
 
 	if (!user_write_access_begin(rseq, t->rseq_len))
 		goto efault;
 	unsafe_put_user(cpu_id, &rseq->cpu_id_start, efault_end);
 	unsafe_put_user(cpu_id, &rseq->cpu_id, efault_end);
+	unsafe_put_user(node_id, &rseq->node_id, efault_end);
 	/*
 	 * Additional feature fields added after ORIG_RSEQ_SIZE
 	 * need to be conditionally updated only if
@@ -108,9 +110,9 @@ static int rseq_update_cpu_id(struct task_struct *t)
 	return -EFAULT;
 }
 
-static int rseq_reset_rseq_cpu_id(struct task_struct *t)
+static int rseq_reset_rseq_cpu_node_id(struct task_struct *t)
 {
-	u32 cpu_id_start = 0, cpu_id = RSEQ_CPU_ID_UNINITIALIZED;
+	u32 cpu_id_start = 0, cpu_id = RSEQ_CPU_ID_UNINITIALIZED, node_id = 0;
 
 	/*
 	 * Reset cpu_id_start to its initial state (0).
@@ -124,6 +126,11 @@ static int rseq_reset_rseq_cpu_id(struct task_struct *t)
 	 */
 	if (put_user(cpu_id, &t->rseq->cpu_id))
 		return -EFAULT;
+	/*
+	 * Reset node_id to its initial state (0).
+	 */
+	if (put_user(node_id, &t->rseq->node_id))
+		return -EFAULT;
 	/*
 	 * Additional feature fields added after ORIG_RSEQ_SIZE
 	 * need to be conditionally reset only if
@@ -306,7 +313,7 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
 		if (unlikely(ret < 0))
 			goto error;
 	}
-	if (unlikely(rseq_update_cpu_id(t)))
+	if (unlikely(rseq_update_cpu_node_id(t)))
 		goto error;
 	return;
 
@@ -353,7 +360,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
 			return -EINVAL;
 		if (current->rseq_sig != sig)
 			return -EPERM;
-		ret = rseq_reset_rseq_cpu_id(current);
+		ret = rseq_reset_rseq_cpu_node_id(current);
 		if (ret)
 			return ret;
 		current->rseq = NULL;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH v2 04/11] selftests/rseq: Use ELF auxiliary vector for extensible rseq
  2022-02-18 21:06 [RFC PATCH v2 00/11] RSEQ node id and virtual cpu id extensions Mathieu Desnoyers
                   ` (2 preceding siblings ...)
  2022-02-18 21:06 ` [RFC PATCH v2 03/11] rseq: extend struct rseq with numa node id Mathieu Desnoyers
@ 2022-02-18 21:06 ` Mathieu Desnoyers
  2022-02-18 21:06 ` [RFC PATCH v2 05/11] selftests/rseq: Implement rseq numa node id field selftest Mathieu Desnoyers
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 19+ messages in thread
From: Mathieu Desnoyers @ 2022-02-18 21:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, Paul E . McKenney, Boqun Feng,
	H . Peter Anvin, Paul Turner, linux-api, Christian Brauner,
	Florian Weimer, David.Laight, carlos, Peter Oskolkov,
	Mathieu Desnoyers

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
 tools/testing/selftests/rseq/rseq-abi.h |  5 ++
 tools/testing/selftests/rseq/rseq.c     | 68 ++++++++++++++++++++++---
 tools/testing/selftests/rseq/rseq.h     | 15 ++++--
 3 files changed, 76 insertions(+), 12 deletions(-)

diff --git a/tools/testing/selftests/rseq/rseq-abi.h b/tools/testing/selftests/rseq/rseq-abi.h
index a8c44d9af71f..00ac846d85b0 100644
--- a/tools/testing/selftests/rseq/rseq-abi.h
+++ b/tools/testing/selftests/rseq/rseq-abi.h
@@ -146,6 +146,11 @@ struct rseq_abi {
 	 *     this thread.
 	 */
 	__u32 flags;
+
+	/*
+	 * Flexible array member at end of structure, after last feature field.
+	 */
+	char end[];
 } __attribute__((aligned(4 * sizeof(__u64))));
 
 #endif /* _RSEQ_ABI_H */
diff --git a/tools/testing/selftests/rseq/rseq.c b/tools/testing/selftests/rseq/rseq.c
index 986b9458efb2..506f2b17aea6 100644
--- a/tools/testing/selftests/rseq/rseq.c
+++ b/tools/testing/selftests/rseq/rseq.c
@@ -28,6 +28,8 @@
 #include <limits.h>
 #include <dlfcn.h>
 #include <stddef.h>
+#include <sys/auxv.h>
+#include <linux/auxvec.h>
 
 #include "../kselftest.h"
 #include "rseq.h"
@@ -39,17 +41,35 @@ static const unsigned int *libc_rseq_flags_p;
 /* Offset from the thread pointer to the rseq area.  */
 ptrdiff_t rseq_offset;
 
-/* Size of the registered rseq area.  0 if the registration was
-   unsuccessful.  */
+/*
+ * Size of the registered rseq area.  0 if the registration was
+ * unsuccessful.
+ */
 unsigned int rseq_size = -1U;
 
-/* Flags used during rseq registration.  */
+/* Flags used during rseq registration. */
 unsigned int rseq_flags;
 
+/*
+ * rseq feature size supported by the kernel.  0 if the registration was
+ * unsuccessful.
+ */
+unsigned int rseq_feature_size = -1U;
+
 static int rseq_ownership;
+static int rseq_reg_success;	/* At least one rseq registration has succeeded. */
+
+/* Allocate a large area for the TLS. */
+#define RSEQ_THREAD_AREA_ALLOC_SIZE	1024
+
+/* Original struct rseq feature size is 20 bytes. */
+#define ORIG_RSEQ_FEATURE_SIZE		20
+
+/* Original struct rseq allocation size is 32 bytes. */
+#define ORIG_RSEQ_ALLOC_SIZE		32
 
 static
-__thread struct rseq_abi __rseq_abi __attribute__((tls_model("initial-exec"))) = {
+__thread struct rseq_abi __rseq_abi __attribute__((tls_model("initial-exec"), aligned(RSEQ_THREAD_AREA_ALLOC_SIZE))) = {
 	.cpu_id = RSEQ_ABI_CPU_ID_UNINITIALIZED,
 };
 
@@ -84,10 +104,18 @@ int rseq_register_current_thread(void)
 		/* Treat libc's ownership as a successful registration. */
 		return 0;
 	}
-	rc = sys_rseq(&__rseq_abi, sizeof(struct rseq_abi), 0, RSEQ_SIG);
-	if (rc)
+	rc = sys_rseq(&__rseq_abi, rseq_size, 0, RSEQ_SIG);
+	if (rc) {
+		if (RSEQ_READ_ONCE(rseq_reg_success)) {
+			/* Incoherent success/failure within process. */
+			abort();
+		}
+		rseq_size = 0;
+		rseq_feature_size = 0;
 		return -1;
+	}
 	assert(rseq_current_cpu_raw() >= 0);
+	RSEQ_WRITE_ONCE(rseq_reg_success, 1);
 	return 0;
 }
 
@@ -99,12 +127,28 @@ int rseq_unregister_current_thread(void)
 		/* Treat libc's ownership as a successful unregistration. */
 		return 0;
 	}
-	rc = sys_rseq(&__rseq_abi, sizeof(struct rseq_abi), RSEQ_ABI_FLAG_UNREGISTER, RSEQ_SIG);
+	rc = sys_rseq(&__rseq_abi, rseq_size, RSEQ_ABI_FLAG_UNREGISTER, RSEQ_SIG);
 	if (rc)
 		return -1;
 	return 0;
 }
 
+static
+unsigned int get_rseq_feature_size(void)
+{
+	unsigned long auxv_rseq_feature_size, auxv_rseq_align;
+
+	auxv_rseq_align = getauxval(AT_RSEQ_ALIGN);
+	assert(!auxv_rseq_align || auxv_rseq_align <= RSEQ_THREAD_AREA_ALLOC_SIZE);
+
+	auxv_rseq_feature_size = getauxval(AT_RSEQ_FEATURE_SIZE);
+	assert(!auxv_rseq_feature_size || auxv_rseq_feature_size <= RSEQ_THREAD_AREA_ALLOC_SIZE);
+	if (auxv_rseq_feature_size)
+		return auxv_rseq_feature_size;
+	else
+		return ORIG_RSEQ_FEATURE_SIZE;
+}
+
 static __attribute__((constructor))
 void rseq_init(void)
 {
@@ -116,14 +160,21 @@ void rseq_init(void)
 		rseq_offset = *libc_rseq_offset_p;
 		rseq_size = *libc_rseq_size_p;
 		rseq_flags = *libc_rseq_flags_p;
+		rseq_feature_size = get_rseq_feature_size();
+		if (rseq_feature_size > rseq_size)
+			rseq_feature_size = rseq_size;
 		return;
 	}
 	if (!rseq_available())
 		return;
 	rseq_ownership = 1;
 	rseq_offset = (void *)&__rseq_abi - rseq_thread_pointer();
-	rseq_size = sizeof(struct rseq_abi);
 	rseq_flags = 0;
+	rseq_feature_size = get_rseq_feature_size();
+	if (rseq_feature_size == ORIG_RSEQ_FEATURE_SIZE)
+		rseq_size = ORIG_RSEQ_ALLOC_SIZE;
+	else
+		rseq_size = RSEQ_THREAD_AREA_ALLOC_SIZE;
 }
 
 static __attribute__((destructor))
@@ -133,6 +184,7 @@ void rseq_exit(void)
 		return;
 	rseq_offset = 0;
 	rseq_size = -1U;
+	rseq_feature_size = -1U;
 	rseq_ownership = 0;
 }
 
diff --git a/tools/testing/selftests/rseq/rseq.h b/tools/testing/selftests/rseq/rseq.h
index 9d850b290c2e..e73db2e82a11 100644
--- a/tools/testing/selftests/rseq/rseq.h
+++ b/tools/testing/selftests/rseq/rseq.h
@@ -47,13 +47,20 @@
 
 #include "rseq-thread-pointer.h"
 
-/* Offset from the thread pointer to the rseq area.  */
+/* Offset from the thread pointer to the rseq area. */
 extern ptrdiff_t rseq_offset;
-/* Size of the registered rseq area.  0 if the registration was
-   unsuccessful.  */
+/*
+ * Size of the registered rseq area.  0 if the registration was
+ * unsuccessful.
+ */
 extern unsigned int rseq_size;
-/* Flags used during rseq registration.  */
+/* Flags used during rseq registration. */
 extern unsigned int rseq_flags;
+/*
+ * rseq feature size supported by the kernel.  0 if the registration was
+ * unsuccessful.
+ */
+extern unsigned int rseq_feature_size;
 
 static inline struct rseq_abi *rseq_get_abi(void)
 {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH v2 05/11] selftests/rseq: Implement rseq numa node id field selftest
  2022-02-18 21:06 [RFC PATCH v2 00/11] RSEQ node id and virtual cpu id extensions Mathieu Desnoyers
                   ` (3 preceding siblings ...)
  2022-02-18 21:06 ` [RFC PATCH v2 04/11] selftests/rseq: Use ELF auxiliary vector for extensible rseq Mathieu Desnoyers
@ 2022-02-18 21:06 ` Mathieu Desnoyers
  2022-02-18 21:06 ` [RFC PATCH v2 06/11] lib: invert _find_next_bit source arguments Mathieu Desnoyers
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 19+ messages in thread
From: Mathieu Desnoyers @ 2022-02-18 21:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, Paul E . McKenney, Boqun Feng,
	H . Peter Anvin, Paul Turner, linux-api, Christian Brauner,
	Florian Weimer, David.Laight, carlos, Peter Oskolkov,
	Mathieu Desnoyers

Test the NUMA node id extension rseq field. Compare it against the value
returned by the getcpu(2) system call while pinned on a specific core.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
 tools/testing/selftests/rseq/basic_test.c |  5 ++++
 tools/testing/selftests/rseq/rseq-abi.h   |  8 ++++++
 tools/testing/selftests/rseq/rseq.c       | 18 +++++++++++++
 tools/testing/selftests/rseq/rseq.h       | 31 +++++++++++++++++++++++
 4 files changed, 62 insertions(+)

diff --git a/tools/testing/selftests/rseq/basic_test.c b/tools/testing/selftests/rseq/basic_test.c
index d8efbfb89193..a49b88cb20a3 100644
--- a/tools/testing/selftests/rseq/basic_test.c
+++ b/tools/testing/selftests/rseq/basic_test.c
@@ -22,6 +22,8 @@ void test_cpu_pointer(void)
 	CPU_ZERO(&test_affinity);
 	for (i = 0; i < CPU_SETSIZE; i++) {
 		if (CPU_ISSET(i, &affinity)) {
+			int node;
+
 			CPU_SET(i, &test_affinity);
 			sched_setaffinity(0, sizeof(test_affinity),
 					&test_affinity);
@@ -29,6 +31,9 @@ void test_cpu_pointer(void)
 			assert(rseq_current_cpu() == i);
 			assert(rseq_current_cpu_raw() == i);
 			assert(rseq_cpu_start() == i);
+			node = rseq_fallback_current_node();
+			assert(rseq_current_node() == node);
+			assert(rseq_current_node_raw() == node);
 			CPU_CLR(i, &test_affinity);
 		}
 	}
diff --git a/tools/testing/selftests/rseq/rseq-abi.h b/tools/testing/selftests/rseq/rseq-abi.h
index 00ac846d85b0..a1faa9162d52 100644
--- a/tools/testing/selftests/rseq/rseq-abi.h
+++ b/tools/testing/selftests/rseq/rseq-abi.h
@@ -147,6 +147,14 @@ struct rseq_abi {
 	 */
 	__u32 flags;
 
+	/*
+	 * Restartable sequences node_id field. Updated by the kernel. Read by
+	 * user-space with single-copy atomicity semantics. This field should
+	 * only be read by the thread which registered this data structure.
+	 * Aligned on 32-bit. Contains the current NUMA node ID.
+	 */
+	__u32 node_id;
+
 	/*
 	 * Flexible array member at end of structure, after last feature field.
 	 */
diff --git a/tools/testing/selftests/rseq/rseq.c b/tools/testing/selftests/rseq/rseq.c
index 506f2b17aea6..470fc0f73e22 100644
--- a/tools/testing/selftests/rseq/rseq.c
+++ b/tools/testing/selftests/rseq/rseq.c
@@ -79,6 +79,11 @@ static int sys_rseq(struct rseq_abi *rseq_abi, uint32_t rseq_len,
 	return syscall(__NR_rseq, rseq_abi, rseq_len, flags, sig);
 }
 
+static int sys_getcpu(unsigned int *cpu, unsigned int *node)
+{
+	return syscall(__NR_getcpu, cpu, node, NULL);
+}
+
 int rseq_available(void)
 {
 	int rc;
@@ -199,3 +204,16 @@ int32_t rseq_fallback_current_cpu(void)
 	}
 	return cpu;
 }
+
+int32_t rseq_fallback_current_node(void)
+{
+	uint32_t cpu_id, node_id;
+	int ret;
+
+	ret = sys_getcpu(&cpu_id, &node_id);
+	if (ret) {
+		perror("sys_getcpu()");
+		return ret;
+	}
+	return (int32_t) node_id;
+}
diff --git a/tools/testing/selftests/rseq/rseq.h b/tools/testing/selftests/rseq/rseq.h
index e73db2e82a11..4f1954cd12ff 100644
--- a/tools/testing/selftests/rseq/rseq.h
+++ b/tools/testing/selftests/rseq/rseq.h
@@ -20,6 +20,15 @@
 #include "rseq-abi.h"
 #include "compiler.h"
 
+#ifndef rseq_sizeof_field
+#define rseq_sizeof_field(TYPE, MEMBER) sizeof((((TYPE *)0)->MEMBER))
+#endif
+
+#ifndef rseq_offsetofend
+#define rseq_offsetofend(TYPE, MEMBER) \
+	(offsetof(TYPE, MEMBER)	+ rseq_sizeof_field(TYPE, MEMBER))
+#endif
+
 /*
  * Empty code injection macros, override when testing.
  * It is important to consider that the ASM injection macros need to be
@@ -123,6 +132,11 @@ int rseq_unregister_current_thread(void);
  */
 int32_t rseq_fallback_current_cpu(void);
 
+/*
+ * Restartable sequence fallback for reading the current node number.
+ */
+int32_t rseq_fallback_current_node(void);
+
 /*
  * Values returned can be either the current CPU number, -1 (rseq is
  * uninitialized), or -2 (rseq initialization has failed).
@@ -132,6 +146,15 @@ static inline int32_t rseq_current_cpu_raw(void)
 	return RSEQ_ACCESS_ONCE(rseq_get_abi()->cpu_id);
 }
 
+/*
+ * Current NUMA node number.
+ */
+static inline uint32_t rseq_current_node_raw(void)
+{
+	assert((int) rseq_feature_size >= rseq_offsetofend(struct rseq_abi, node_id));
+	return RSEQ_ACCESS_ONCE(rseq_get_abi()->node_id);
+}
+
 /*
  * Returns a possible CPU number, which is typically the current CPU.
  * The returned CPU number can be used to prepare for an rseq critical
@@ -158,6 +181,14 @@ static inline uint32_t rseq_current_cpu(void)
 	return cpu;
 }
 
+static inline uint32_t rseq_current_node(void)
+{
+	if (rseq_likely((int) rseq_feature_size >= rseq_offsetofend(struct rseq_abi, node_id)))
+		return rseq_current_node_raw();
+	else
+		return rseq_fallback_current_node();
+}
+
 static inline void rseq_clear_rseq_cs(void)
 {
 	RSEQ_WRITE_ONCE(rseq_get_abi()->rseq_cs.arch.ptr, 0);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH v2 06/11] lib: invert _find_next_bit source arguments
  2022-02-18 21:06 [RFC PATCH v2 00/11] RSEQ node id and virtual cpu id extensions Mathieu Desnoyers
                   ` (4 preceding siblings ...)
  2022-02-18 21:06 ` [RFC PATCH v2 05/11] selftests/rseq: Implement rseq numa node id field selftest Mathieu Desnoyers
@ 2022-02-18 21:06 ` Mathieu Desnoyers
  2022-02-18 21:06 ` [RFC PATCH v2 07/11] lib: implement find_{first,next}_{zero,one}_and_zero_bit Mathieu Desnoyers
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 19+ messages in thread
From: Mathieu Desnoyers @ 2022-02-18 21:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, Paul E . McKenney, Boqun Feng,
	H . Peter Anvin, Paul Turner, linux-api, Christian Brauner,
	Florian Weimer, David.Laight, carlos, Peter Oskolkov,
	Mathieu Desnoyers

Apply bit-invert operations before the AND operation in _find_next_bit.
Allows AND operations on combined bitmasks in which we search either for
one or zero, e.g.: find first bit which is both zero in one bitmask AND
one in the second bitmask.

The existing use for find first zero bit does not use the second
argument, so whether the inversion is performed before or after the AND
operator does not matter.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
 include/linux/find.h       | 13 +++++++------
 lib/find_bit.c             | 17 ++++++++---------
 tools/include/linux/find.h |  9 +++++----
 tools/lib/find_bit.c       | 17 ++++++++---------
 4 files changed, 28 insertions(+), 28 deletions(-)

diff --git a/include/linux/find.h b/include/linux/find.h
index 5bb6db213bcb..41941cb9cad7 100644
--- a/include/linux/find.h
+++ b/include/linux/find.h
@@ -10,7 +10,8 @@
 
 extern unsigned long _find_next_bit(const unsigned long *addr1,
 		const unsigned long *addr2, unsigned long nbits,
-		unsigned long start, unsigned long invert, unsigned long le);
+		unsigned long start, unsigned long invert_src1,
+		unsigned long src2, unsigned long le);
 extern unsigned long _find_first_bit(const unsigned long *addr, unsigned long size);
 extern unsigned long _find_first_and_bit(const unsigned long *addr1,
 					 const unsigned long *addr2, unsigned long size);
@@ -41,7 +42,7 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
 		return val ? __ffs(val) : size;
 	}
 
-	return _find_next_bit(addr, NULL, size, offset, 0UL, 0);
+	return _find_next_bit(addr, NULL, size, offset, 0UL, 0UL, 0);
 }
 #endif
 
@@ -71,7 +72,7 @@ unsigned long find_next_and_bit(const unsigned long *addr1,
 		return val ? __ffs(val) : size;
 	}
 
-	return _find_next_bit(addr1, addr2, size, offset, 0UL, 0);
+	return _find_next_bit(addr1, addr2, size, offset, 0UL, 0UL, 0);
 }
 #endif
 
@@ -99,7 +100,7 @@ unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
 		return val == ~0UL ? size : ffz(val);
 	}
 
-	return _find_next_bit(addr, NULL, size, offset, ~0UL, 0);
+	return _find_next_bit(addr, NULL, size, offset, ~0UL, 0UL, 0);
 }
 #endif
 
@@ -247,7 +248,7 @@ unsigned long find_next_zero_bit_le(const void *addr, unsigned
 		return val == ~0UL ? size : ffz(val);
 	}
 
-	return _find_next_bit(addr, NULL, size, offset, ~0UL, 1);
+	return _find_next_bit(addr, NULL, size, offset, ~0UL, 0UL, 1);
 }
 #endif
 
@@ -266,7 +267,7 @@ unsigned long find_next_bit_le(const void *addr, unsigned
 		return val ? __ffs(val) : size;
 	}
 
-	return _find_next_bit(addr, NULL, size, offset, 0UL, 1);
+	return _find_next_bit(addr, NULL, size, offset, 0UL, 0UL, 1);
 }
 #endif
 
diff --git a/lib/find_bit.c b/lib/find_bit.c
index 1b8e4b2a9cba..73e78565e691 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -25,23 +25,23 @@
 /*
  * This is a common helper function for find_next_bit, find_next_zero_bit, and
  * find_next_and_bit. The differences are:
- *  - The "invert" argument, which is XORed with each fetched word before
- *    searching it for one bits.
+ *  - The "invert_src1" and "invert_src2" arguments, which are XORed to
+ *    each source word before applying the 'and' operator.
  *  - The optional "addr2", which is anded with "addr1" if present.
  */
 unsigned long _find_next_bit(const unsigned long *addr1,
 		const unsigned long *addr2, unsigned long nbits,
-		unsigned long start, unsigned long invert, unsigned long le)
+		unsigned long start, unsigned long invert_src1,
+		unsigned long invert_src2, unsigned long le)
 {
 	unsigned long tmp, mask;
 
 	if (unlikely(start >= nbits))
 		return nbits;
 
-	tmp = addr1[start / BITS_PER_LONG];
+	tmp = addr1[start / BITS_PER_LONG] ^ invert_src1;
 	if (addr2)
-		tmp &= addr2[start / BITS_PER_LONG];
-	tmp ^= invert;
+		tmp &= addr2[start / BITS_PER_LONG] ^ invert_src2;
 
 	/* Handle 1st word. */
 	mask = BITMAP_FIRST_WORD_MASK(start);
@@ -57,10 +57,9 @@ unsigned long _find_next_bit(const unsigned long *addr1,
 		if (start >= nbits)
 			return nbits;
 
-		tmp = addr1[start / BITS_PER_LONG];
+		tmp = addr1[start / BITS_PER_LONG] ^ invert_src1;
 		if (addr2)
-			tmp &= addr2[start / BITS_PER_LONG];
-		tmp ^= invert;
+			tmp &= addr2[start / BITS_PER_LONG] ^ invert_src2;
 	}
 
 	if (le)
diff --git a/tools/include/linux/find.h b/tools/include/linux/find.h
index 47e2bd6c5174..5ab0c95086ad 100644
--- a/tools/include/linux/find.h
+++ b/tools/include/linux/find.h
@@ -10,7 +10,8 @@
 
 extern unsigned long _find_next_bit(const unsigned long *addr1,
 		const unsigned long *addr2, unsigned long nbits,
-		unsigned long start, unsigned long invert, unsigned long le);
+		unsigned long start, unsigned long invert_src1,
+		unsigned long src2, unsigned long le);
 extern unsigned long _find_first_bit(const unsigned long *addr, unsigned long size);
 extern unsigned long _find_first_and_bit(const unsigned long *addr1,
 					 const unsigned long *addr2, unsigned long size);
@@ -41,7 +42,7 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
 		return val ? __ffs(val) : size;
 	}
 
-	return _find_next_bit(addr, NULL, size, offset, 0UL, 0);
+	return _find_next_bit(addr, NULL, size, offset, 0UL, 0UL, 0);
 }
 #endif
 
@@ -71,7 +72,7 @@ unsigned long find_next_and_bit(const unsigned long *addr1,
 		return val ? __ffs(val) : size;
 	}
 
-	return _find_next_bit(addr1, addr2, size, offset, 0UL, 0);
+	return _find_next_bit(addr1, addr2, size, offset, 0UL, 0UL, 0);
 }
 #endif
 
@@ -99,7 +100,7 @@ unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
 		return val == ~0UL ? size : ffz(val);
 	}
 
-	return _find_next_bit(addr, NULL, size, offset, ~0UL, 0);
+	return _find_next_bit(addr, NULL, size, offset, ~0UL, 0UL, 0);
 }
 #endif
 
diff --git a/tools/lib/find_bit.c b/tools/lib/find_bit.c
index ba4b8d94e004..4176232de7f9 100644
--- a/tools/lib/find_bit.c
+++ b/tools/lib/find_bit.c
@@ -24,13 +24,14 @@
 /*
  * This is a common helper function for find_next_bit, find_next_zero_bit, and
  * find_next_and_bit. The differences are:
- *  - The "invert" argument, which is XORed with each fetched word before
- *    searching it for one bits.
+ *  - The "invert_src1" and "invert_src2" arguments, which are XORed to
+ *    each source word before applying the 'and' operator.
  *  - The optional "addr2", which is anded with "addr1" if present.
  */
 unsigned long _find_next_bit(const unsigned long *addr1,
 		const unsigned long *addr2, unsigned long nbits,
-		unsigned long start, unsigned long invert, unsigned long le)
+		unsigned long start, unsigned long invert_src1,
+		unsigned long invert_src2, unsigned long le)
 {
 	unsigned long tmp, mask;
 	(void) le;
@@ -38,10 +39,9 @@ unsigned long _find_next_bit(const unsigned long *addr1,
 	if (unlikely(start >= nbits))
 		return nbits;
 
-	tmp = addr1[start / BITS_PER_LONG];
+	tmp = addr1[start / BITS_PER_LONG] ^ invert_src1;
 	if (addr2)
-		tmp &= addr2[start / BITS_PER_LONG];
-	tmp ^= invert;
+		tmp &= addr2[start / BITS_PER_LONG] ^ invert_src2;
 
 	/* Handle 1st word. */
 	mask = BITMAP_FIRST_WORD_MASK(start);
@@ -64,10 +64,9 @@ unsigned long _find_next_bit(const unsigned long *addr1,
 		if (start >= nbits)
 			return nbits;
 
-		tmp = addr1[start / BITS_PER_LONG];
+		tmp = addr1[start / BITS_PER_LONG] ^ invert_src1;
 		if (addr2)
-			tmp &= addr2[start / BITS_PER_LONG];
-		tmp ^= invert;
+			tmp &= addr2[start / BITS_PER_LONG] ^ invert_src2;
 	}
 
 #if (0)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH v2 07/11] lib: implement find_{first,next}_{zero,one}_and_zero_bit
  2022-02-18 21:06 [RFC PATCH v2 00/11] RSEQ node id and virtual cpu id extensions Mathieu Desnoyers
                   ` (5 preceding siblings ...)
  2022-02-18 21:06 ` [RFC PATCH v2 06/11] lib: invert _find_next_bit source arguments Mathieu Desnoyers
@ 2022-02-18 21:06 ` Mathieu Desnoyers
  2022-02-18 21:06 ` [RFC PATCH v2 08/11] cpumask: implement cpumask_{first,next}_{zero,one}_and_zero Mathieu Desnoyers
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 19+ messages in thread
From: Mathieu Desnoyers @ 2022-02-18 21:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, Paul E . McKenney, Boqun Feng,
	H . Peter Anvin, Paul Turner, linux-api, Christian Brauner,
	Florian Weimer, David.Laight, carlos, Peter Oskolkov,
	Mathieu Desnoyers

Allow finding the first or next bit within two input bitmasks which is
either:

- both zero and zero,
- respectively one and zero.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
 include/linux/find.h | 110 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 110 insertions(+)

diff --git a/include/linux/find.h b/include/linux/find.h
index 41941cb9cad7..9b6edd00b45e 100644
--- a/include/linux/find.h
+++ b/include/linux/find.h
@@ -76,6 +76,66 @@ unsigned long find_next_and_bit(const unsigned long *addr1,
 }
 #endif
 
+#ifndef find_next_one_and_zero_bit
+/**
+ * find_next_one_and_zero_bit - find the next bit which is one in addr1 and zero in addr2
+ * @addr1: The first address to base the search on
+ * @addr2: The second address to base the search on
+ * @offset: The bitnumber to start searching at
+ * @size: The bitmap size in bits
+ *
+ * Returns the bit number for the next bit set in addr1 and cleared in addr2
+ * If no corresponding bits meet this criterion, returns @size.
+ */
+static inline
+unsigned long find_next_one_and_zero_bit(const unsigned long *addr1,
+		const unsigned long *addr2, unsigned long size,
+		unsigned long offset)
+{
+	if (small_const_nbits(size)) {
+		unsigned long val;
+
+		if (unlikely(offset >= size))
+			return size;
+
+		val = *addr1 & ~*addr2 & GENMASK(size - 1, offset);
+		return val ? __ffs(val) : size;
+	}
+
+	return _find_next_bit(addr1, addr2, size, offset, 0UL, ~0UL, 0);
+}
+#endif
+
+#ifndef find_next_zero_and_zero_bit
+/**
+ * find_next_zero_and_zero_bit - find the next bit which is zero in addr1 and addr2
+ * @addr1: The first address to base the search on
+ * @addr2: The second address to base the search on
+ * @offset: The bitnumber to start searching at
+ * @size: The bitmap size in bits
+ *
+ * Returns the bit number for the next bit cleared in addr1 and addr2
+ * If no corresponding bits meet this criterion, returns @size.
+ */
+static inline
+unsigned long find_next_zero_and_zero_bit(const unsigned long *addr1,
+		const unsigned long *addr2, unsigned long size,
+		unsigned long offset)
+{
+	if (small_const_nbits(size)) {
+		unsigned long val;
+
+		if (unlikely(offset >= size))
+			return size;
+
+		val = ~*addr1 & ~*addr2 & GENMASK(size - 1, offset);
+		return val ? __ffs(val) : size;
+	}
+
+	return _find_next_bit(addr1, addr2, size, offset, ~0UL, ~0UL, 0);
+}
+#endif
+
 #ifndef find_next_zero_bit
 /**
  * find_next_zero_bit - find the next cleared bit in a memory region
@@ -173,6 +233,56 @@ unsigned long find_first_zero_bit(const unsigned long *addr, unsigned long size)
 }
 #endif
 
+#ifndef find_first_one_and_zero_bit
+/**
+ * find_first_one_and_zero_bit - find the first bit which is one in addr1 and zero in addr2
+ * @addr1: The first address to base the search on
+ * @addr2: The second address to base the search on
+ * @size: The bitmap size in bits
+ *
+ * Returns the bit number for the first bit set in addr1 and cleared in addr2
+ * If no corresponding bits meet this criterion, returns @size.
+ */
+static inline
+unsigned long find_first_one_and_zero_bit(const unsigned long *addr1,
+				 const unsigned long *addr2,
+				 unsigned long size)
+{
+	if (small_const_nbits(size)) {
+		unsigned long val = *addr1 & ~*addr2 & GENMASK(size - 1, 0);
+
+		return val ? __ffs(val) : size;
+	}
+
+	return _find_next_bit(addr1, addr2, size, 0, 0UL, ~0UL, 0);
+}
+#endif
+
+#ifndef find_first_zero_and_zero_bit
+/**
+ * find_first_zero_and_zero_bit - find the first bit which is zero in addr1 and addr2
+ * @addr1: The first address to base the search on
+ * @addr2: The second address to base the search on
+ * @size: The bitmap size in bits
+ *
+ * Returns the bit number for the first bit cleared in addr1 and addr2
+ * If no corresponding bits meet this criterion, returns @size.
+ */
+static inline
+unsigned long find_first_zero_and_zero_bit(const unsigned long *addr1,
+				 const unsigned long *addr2,
+				 unsigned long size)
+{
+	if (small_const_nbits(size)) {
+		unsigned long val = ~*addr1 & ~*addr2 & GENMASK(size - 1, 0);
+
+		return val ? __ffs(val) : size;
+	}
+
+	return _find_next_bit(addr1, addr2, size, 0, ~0UL, ~0UL, 0);
+}
+#endif
+
 #ifndef find_last_bit
 /**
  * find_last_bit - find the last set bit in a memory region
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH v2 08/11] cpumask: implement cpumask_{first,next}_{zero,one}_and_zero
  2022-02-18 21:06 [RFC PATCH v2 00/11] RSEQ node id and virtual cpu id extensions Mathieu Desnoyers
                   ` (6 preceding siblings ...)
  2022-02-18 21:06 ` [RFC PATCH v2 07/11] lib: implement find_{first,next}_{zero,one}_and_zero_bit Mathieu Desnoyers
@ 2022-02-18 21:06 ` Mathieu Desnoyers
  2022-02-18 21:06 ` [RFC PATCH v2 09/11] sched: Introduce per memory space current virtual cpu id Mathieu Desnoyers
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 19+ messages in thread
From: Mathieu Desnoyers @ 2022-02-18 21:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, Paul E . McKenney, Boqun Feng,
	H . Peter Anvin, Paul Turner, linux-api, Christian Brauner,
	Florian Weimer, David.Laight, carlos, Peter Oskolkov,
	Mathieu Desnoyers

Allow finding the first or next bit within two input cpumasks which is
either:

- both zero and zero,
- respectively one and zero.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
 include/linux/cpumask.h | 94 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 94 insertions(+)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 64dae70d31f5..040476134557 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -134,6 +134,18 @@ static inline unsigned int cpumask_first_and(const struct cpumask *srcp1,
 	return 0;
 }
 
+static inline unsigned int cpumask_first_one_and_zero(const struct cpumask *srcp1,
+					     const struct cpumask *srcp2)
+{
+	return 0;
+}
+
+static inline unsigned int cpumask_first_zero_and_zero(const struct cpumask *srcp1,
+					     const struct cpumask *srcp2)
+{
+	return 0;
+}
+
 static inline unsigned int cpumask_last(const struct cpumask *srcp)
 {
 	return 0;
@@ -157,6 +169,20 @@ static inline unsigned int cpumask_next_and(int n,
 	return n+1;
 }
 
+static inline unsigned int cpumask_next_one_and_zero(int n,
+					     const struct cpumask *srcp1,
+					     const struct cpumask *srcp2)
+{
+	return n+1;
+}
+
+static inline unsigned int cpumask_next_zero_and_zero(int n,
+					     const struct cpumask *srcp1,
+					     const struct cpumask *srcp2)
+{
+	return n+1;
+}
+
 static inline unsigned int cpumask_next_wrap(int n, const struct cpumask *mask,
 					     int start, bool wrap)
 {
@@ -230,6 +256,36 @@ unsigned int cpumask_first_and(const struct cpumask *srcp1, const struct cpumask
 	return find_first_and_bit(cpumask_bits(srcp1), cpumask_bits(srcp2), nr_cpumask_bits);
 }
 
+/**
+ * cpumask_first_one_and_zero - return the first cpu from *srcp1 & ~*srcp2
+ * @src1p: the first input
+ * @src2p: the second input
+ *
+ * Returns >= nr_cpu_ids if no cpus match in both.
+ */
+static inline
+unsigned int cpumask_first_one_and_zero(const struct cpumask *srcp1,
+					const struct cpumask *srcp2)
+{
+	return find_first_one_and_zero_bit(cpumask_bits(srcp1), cpumask_bits(srcp2),
+					   nr_cpumask_bits);
+}
+
+/**
+ * cpumask_first_zero_and_zero - return the first cpu from ~*srcp1 & ~*srcp2
+ * @src1p: the first input
+ * @src2p: the second input
+ *
+ * Returns >= nr_cpu_ids if no cpus match in both.
+ */
+static inline
+unsigned int cpumask_first_zero_and_zero(const struct cpumask *srcp1,
+					 const struct cpumask *srcp2)
+{
+	return find_first_zero_and_zero_bit(cpumask_bits(srcp1), cpumask_bits(srcp2),
+					    nr_cpumask_bits);
+}
+
 /**
  * cpumask_last - get the last CPU in a cpumask
  * @srcp:	- the cpumask pointer
@@ -258,6 +314,44 @@ static inline unsigned int cpumask_next_zero(int n, const struct cpumask *srcp)
 	return find_next_zero_bit(cpumask_bits(srcp), nr_cpumask_bits, n+1);
 }
 
+/**
+ * cpumask_next_one_and_zero - return the next cpu from *srcp1 & ~*srcp2
+ * @n: the cpu prior to the place to search (ie. return will be > @n)
+ * @src1p: the first input
+ * @src2p: the second input
+ *
+ * Returns >= nr_cpu_ids if no cpus match in both.
+ */
+static inline
+unsigned int cpumask_next_one_and_zero(int n, const struct cpumask *srcp1,
+				       const struct cpumask *srcp2)
+{
+	/* -1 is a legal arg here. */
+	if (n != -1)
+		cpumask_check(n);
+	return find_next_one_and_zero_bit(cpumask_bits(srcp1), cpumask_bits(srcp2),
+					  nr_cpumask_bits, n+1);
+}
+
+/**
+ * cpumask_next_zero_and_zero - return the next cpu from ~*srcp1 & ~*srcp2
+ * @n: the cpu prior to the place to search (ie. return will be > @n)
+ * @src1p: the first input
+ * @src2p: the second input
+ *
+ * Returns >= nr_cpu_ids if no cpus match in both.
+ */
+static inline
+unsigned int cpumask_next_zero_and_zero(int n, const struct cpumask *srcp1,
+					const struct cpumask *srcp2)
+{
+	/* -1 is a legal arg here. */
+	if (n != -1)
+		cpumask_check(n);
+	return find_next_zero_and_zero_bit(cpumask_bits(srcp1), cpumask_bits(srcp2),
+					   nr_cpumask_bits, n+1);
+}
+
 int __pure cpumask_next_and(int n, const struct cpumask *, const struct cpumask *);
 int __pure cpumask_any_but(const struct cpumask *mask, unsigned int cpu);
 unsigned int cpumask_local_spread(unsigned int i, int node);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH v2 09/11] sched: Introduce per memory space current virtual cpu id
  2022-02-18 21:06 [RFC PATCH v2 00/11] RSEQ node id and virtual cpu id extensions Mathieu Desnoyers
                   ` (7 preceding siblings ...)
  2022-02-18 21:06 ` [RFC PATCH v2 08/11] cpumask: implement cpumask_{first,next}_{zero,one}_and_zero Mathieu Desnoyers
@ 2022-02-18 21:06 ` Mathieu Desnoyers
  2022-02-21 17:38   ` [RFC PATCH v3 " Mathieu Desnoyers
  2022-02-25 17:35   ` [RFC PATCH v2 " Jonathan Corbet
  2022-02-18 21:06 ` [RFC PATCH v2 10/11] rseq: extend struct rseq with per memory space vcpu id Mathieu Desnoyers
  2022-02-18 21:06 ` [RFC PATCH v2 11/11] selftests/rseq: Implement rseq vm_vcpu_id field support Mathieu Desnoyers
  10 siblings, 2 replies; 19+ messages in thread
From: Mathieu Desnoyers @ 2022-02-18 21:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, Paul E . McKenney, Boqun Feng,
	H . Peter Anvin, Paul Turner, linux-api, Christian Brauner,
	Florian Weimer, David.Laight, carlos, Peter Oskolkov,
	Mathieu Desnoyers

This feature allows the scheduler to expose a current virtual cpu id
to user-space. This virtual cpu id is within the possible cpus range,
and is temporarily (and uniquely) assigned while threads are actively
running within a memory space. If a memory space has fewer threads than
cores, or is limited to run on few cores concurrently through sched
affinity or cgroup cpusets, the virtual cpu ids will be values close
to 0, thus allowing efficient use of user-space memory for per-cpu
data structures.

The vcpu_ids are NUMA-aware. On NUMA systems, when a vcpu_id is observed
by user-space to be associated with a NUMA node, it is guaranteed to
never change NUMA node unless a kernel-level NUMA configuration change
happens.

This feature is meant to be exposed by a new rseq thread area field.

The primary purpose of this feature is to do the heavy-lifting needed
by memory allocators to allow them to use per-cpu data structures
efficiently in the following situations:

- Single-threaded applications,
- Multi-threaded applications on large systems (many cores) with limited
  cpu affinity mask,
- Multi-threaded applications on large systems (many cores) with
  restricted cgroup cpuset per container,
- Processes using memory from many NUMA nodes.

One of the key concerns from scheduler maintainers is the overhead
associated with additional atomic operations in the scheduler fast-path.
In order to save one atomic set bit and one atomic clear bit on the
scheduler context switch fast path, the following optimizations are
implemented:

1) On context switch between threads belonging to the same memory space,
   transfer the mm_vcpu_id from prev to next without any atomic ops.
   This takes care of use-cases involving frequent context switch
   between threads belonging to the same memory space.

2) Threads belonging to a memory space with single user (mm_users==1)
   can be assigned mm_vcpu_id=0 without any atomic operation on the
   scheduler fast-path. In non-NUMA, when a memory space goes from
   single to multi-threaded, lazily allocate the vcpu_id 0 in the mm
   vcpu mask. This takes care of all single-threaded use-cases
   involving context switching between threads belonging to different
   memory spaces.

   With NUMA, the single-threaded memory space scenario is still
   special-cased to eliminate all atomic operations on the fast path,
   but rather than returning vcpu_id=0, return the current numa_node_id
   to allow single-threaded memory spaces to keep good numa locality.
   On systems where the number of cpus ids is lower than the number of
   numa node ids, pick the first cpu in the node cpumask rather than the
   node ID.

3) Introduce a per-runqueue cache containing { mm, vcpu_id } entries.
   Keep track of the recently allocated vcpu_id for each mm rather than
   freeing them immediately. This eliminates most atomic ops when
   context switching back and forth between threads belonging to
   different memory spaces in multi-threaded scenarios (many processes,
   each with many threads).

The credit goes to Paul Turner (Google) for the vcpu_id idea. This
feature is implemented based on the discussions with Paul Turner and
Peter Oskolkov (Google), but I took the liberty to implement scheduler
fast-path optimizations and my own NUMA-awareness scheme. The rumor has
it that Google have been running a rseq vcpu_id extension internally at
Google in production for a year. The tcmalloc source code indeed has
comments hinting at a vcpu_id prototype extension to the rseq system
call [1].

schedstats:

* perf bench sched messaging (single instance, multi-process):

On sched-switch:
  single-threaded vcpu-id:                99.985 %
  transfer between threads:                0     %
  runqueue cache hit:                      0.015 %
  runqueue cache eviction (bit-clear):     0     %
  runqueue cache discard (bit-clear):      0     %
  vcpu-id allocation (bit-set):            0     %

On release mm:
  vcpu-id remove (bit-clear):              0     %

On migration:
  vcpu-id remove (bit-clear):              0     %

* perf bench sched messaging -t (single instance, multi-thread):

On sched-switch:
  single-threaded vcpu-id:                 0.128 %
  transfer between threads:               98.260 %
  runqueue cache hit:                      1.075 %
  runqueue cache eviction (bit-clear):     0.001 %
  runqueue cache discard (bit-clear):      0     %
  vcpu-id allocation (bit-set):            0.269 %

On release mm:
  vcpu-id remove (bit-clear):              0.161 %

On migration:
  vcpu-id remove (bit-clear):              0.107 %

* perf bench sched messaging -t (two instances, multi-thread):

On sched-switch:
  single-threaded vcpu-id:                 0.081 %
  transfer between threads:               89.512 %
  runqueue cache hit:                      9.659 %
  runqueue cache eviction (bit-clear):     0.003 %
  runqueue cache discard (bit-clear):      0     %
  vcpu-id allocation (bit-set):            0.374 %

On release mm:
  vcpu-id remove (bit-clear):              0.243 %

On migration:
  vcpu-id remove (bit-clear):              0.129 %

* perf bench sched pipe (one instance, multi-process):

On sched-switch:
  single-threaded vcpu-id:                99.993 %
  transfer between threads:                0.001 %
  runqueue cache hit:                      0.002 %
  runqueue cache eviction (bit-clear):     0     %
  runqueue cache discard (bit-clear):      0     %
  vcpu-id allocation (bit-set):            0.002 %

On release mm:
  vcpu-id remove (bit-clear):              0     %

On migration:
  vcpu-id remove (bit-clear):              0.002 %

[1] https://github.com/google/tcmalloc/blob/master/tcmalloc/internal/linux_syscall_support.h#L26

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
 fs/exec.c                |   4 +
 include/linux/mm.h       |  25 +++
 include/linux/mm_types.h | 111 ++++++++++++
 include/linux/sched.h    |   5 +
 init/Kconfig             |   4 +
 kernel/fork.c            |  15 +-
 kernel/sched/core.c      |  82 +++++++++
 kernel/sched/deadline.c  |   3 +
 kernel/sched/debug.c     |  13 ++
 kernel/sched/fair.c      |   1 +
 kernel/sched/rt.c        |   2 +
 kernel/sched/sched.h     | 364 +++++++++++++++++++++++++++++++++++++++
 kernel/sched/stats.c     |  16 +-
 13 files changed, 642 insertions(+), 3 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 79f2c9483302..7b7520b63e95 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1006,6 +1006,10 @@ static int exec_mmap(struct mm_struct *mm)
 	active_mm = tsk->active_mm;
 	tsk->active_mm = mm;
 	tsk->mm = mm;
+	mm_init_vcpu_users(mm);
+	mm_init_vcpumask(mm);
+	mm_init_node_vcpumask(mm);
+	sched_vcpu_activate_mm(tsk, mm);
 	/*
 	 * This prevents preemption while active_mm is being loaded and
 	 * it and mm are being updated, which could cause problems for
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e1a84b1e6787..6ca8a4a85fcd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3374,5 +3374,30 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
 }
 #endif
 
+#ifdef CONFIG_SCHED_MM_VCPU
+void sched_vcpu_release_mm(struct task_struct *t, struct mm_struct *mm);
+void sched_vcpu_activate_mm(struct task_struct *t, struct mm_struct *mm);
+void sched_vcpu_get_mm(struct task_struct *t, struct mm_struct *mm);
+void sched_vcpu_dup_mm(struct task_struct *t, struct mm_struct *mm);
+static inline int task_mm_vcpu_id(struct task_struct *t)
+{
+	return t->mm_vcpu;
+}
+#else
+static inline void sched_vcpu_release_mm(struct task_struct *t, struct mm_struct *mm) { }
+static inline void sched_vcpu_activate_mm(struct task_struct *t, struct mm_struct *mm) { }
+static inline void sched_vcpu_get_mm(struct task_struct *t, struct mm_struct *mm) { }
+static inline void sched_vcpu_dup_mm(struct task_struct *t, struct mm_struct *mm) { }
+static inline int task_mm_vcpu_id(struct task_struct *t)
+{
+	/*
+	 * Use the processor id as a fall-back when the mm vcpu feature is
+	 * disabled. This provides functional per-cpu data structure accesses
+	 * in user-space, althrough it won't provide the memory usage benefits.
+	 */
+	return raw_smp_processor_id();
+}
+#endif
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 9db36dc5d4cf..40fcc526396f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -17,6 +17,7 @@
 #include <linux/page-flags-layout.h>
 #include <linux/workqueue.h>
 #include <linux/seqlock.h>
+#include <linux/nodemask.h>
 
 #include <asm/mmu.h>
 
@@ -502,6 +503,20 @@ struct mm_struct {
 		 */
 		atomic_t mm_count;
 
+#ifdef CONFIG_SCHED_MM_VCPU
+		/**
+		 * @mm_vcpu_users: The number of references to &struct mm_struct
+		 * from user-space threads.
+		 *
+		 * Initialized to 1 for the first thread with a reference with
+		 * the mm. Incremented for each thread getting a reference to the
+		 * mm, and decremented on mm release from user-space threads.
+		 * Used to enable single-threaded mm_vcpu accounting (when == 1).
+		 */
+
+		atomic_t mm_vcpu_users;
+#endif
+
 #ifdef CONFIG_MMU
 		atomic_long_t pgtables_bytes;	/* PTE page table pages */
 #endif
@@ -659,6 +674,102 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
 	return (struct cpumask *)&mm->cpu_bitmap;
 }
 
+#ifdef CONFIG_SCHED_MM_VCPU
+/* Future-safe accessor for struct mm_struct's vcpu_mask. */
+static inline cpumask_t *mm_vcpumask(struct mm_struct *mm)
+{
+	unsigned long vcpu_bitmap = (unsigned long)mm;
+
+	vcpu_bitmap += offsetof(struct mm_struct, cpu_bitmap);
+	/* Skip cpu_bitmap */
+	vcpu_bitmap += cpumask_size();
+	return (struct cpumask *)vcpu_bitmap;
+}
+
+static inline void mm_init_vcpu_users(struct mm_struct *mm)
+{
+	atomic_set(&mm->mm_vcpu_users, 1);
+}
+
+static inline void mm_init_vcpumask(struct mm_struct *mm)
+{
+	cpumask_clear(mm_vcpumask(mm));
+}
+
+static inline unsigned int mm_vcpumask_size(void)
+{
+	return cpumask_size();
+}
+
+#else
+static inline cpumask_t *mm_vcpumask(struct mm_struct *mm)
+{
+	return NULL;
+}
+
+static inline void mm_init_vcpu_users(struct mm_struct *mm) { }
+static inline void mm_init_vcpumask(struct mm_struct *mm) { }
+
+static inline unsigned int mm_vcpumask_size(void)
+{
+	return 0;
+}
+#endif
+
+#if defined(CONFIG_SCHED_MM_VCPU) && defined(CONFIG_NUMA)
+/*
+ * Layout of node vcpumasks:
+ * - node_alloc vcpumask:        cpumask tracking which vcpu_id were
+ *                               allocated (across nodes) in this
+ *                               memory space.
+ * - node vcpumask[nr_node_ids]: per-node cpumask tracking which vcpu_id
+ *                               were allocated in this memory space.
+ */
+static inline cpumask_t *mm_node_alloc_vcpumask(struct mm_struct *mm)
+{
+	unsigned long vcpu_bitmap = (unsigned long)mm_vcpumask(mm);
+
+	/* Skip mm_vcpumask */
+	vcpu_bitmap += cpumask_size();
+	return (struct cpumask *)vcpu_bitmap;
+}
+
+static inline cpumask_t *mm_node_vcpumask(struct mm_struct *mm, unsigned int node)
+{
+	unsigned long vcpu_bitmap = (unsigned long)mm_node_alloc_vcpumask(mm);
+
+	/* Skip node alloc vcpumask */
+	vcpu_bitmap += cpumask_size();
+	vcpu_bitmap += node * cpumask_size();
+	return (struct cpumask *)vcpu_bitmap;
+}
+
+static inline void mm_init_node_vcpumask(struct mm_struct *mm)
+{
+	unsigned int node;
+
+	if (num_possible_nodes() == 1)
+		return;
+	cpumask_clear(mm_node_alloc_vcpumask(mm));
+	for (node = 0; node < nr_node_ids; node++)
+		cpumask_clear(mm_node_vcpumask(mm, node));
+}
+
+static inline unsigned int mm_node_vcpumask_size(void)
+{
+	if (num_possible_nodes() == 1)
+		return 0;
+	return (nr_node_ids + 1) * cpumask_size();
+}
+#else
+static inline void mm_init_node_vcpumask(struct mm_struct *mm) { }
+
+static inline unsigned int mm_node_vcpumask_size(void)
+{
+	return 0;
+}
+#endif
+
 struct mmu_gather;
 extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
 extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 838c9e0b4cae..c400d44f8716 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1300,6 +1300,11 @@ struct task_struct {
 	unsigned long rseq_event_mask;
 #endif
 
+#ifdef CONFIG_SCHED_MM_VCPU
+	int				mm_vcpu;	/* Current vcpu in mm */
+	int				vcpu_mm_active;
+#endif
+
 	struct tlbflush_unmap_batch	tlb_ubc;
 
 	union {
diff --git a/init/Kconfig b/init/Kconfig
index e9119bf54b1f..6bd40f303a0d 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1023,6 +1023,10 @@ config RT_GROUP_SCHED
 
 endif #CGROUP_SCHED
 
+config SCHED_MM_VCPU
+	def_bool y
+	depends on SMP && RSEQ
+
 config UCLAMP_TASK_GROUP
 	bool "Utilization clamping per group of tasks"
 	depends on CGROUP_SCHED
diff --git a/kernel/fork.c b/kernel/fork.c
index d75a528f7b21..78fcf3277540 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -970,6 +970,11 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 #ifdef CONFIG_MEMCG
 	tsk->active_memcg = NULL;
 #endif
+
+#ifdef CONFIG_SCHED_MM_VCPU
+	tsk->mm_vcpu = 0;
+	tsk->vcpu_mm_active = 0;
+#endif
 	return tsk;
 
 free_stack:
@@ -1079,6 +1084,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 		goto fail_nocontext;
 
 	mm->user_ns = get_user_ns(user_ns);
+	mm_init_vcpu_users(mm);
+	mm_init_vcpumask(mm);
+	mm_init_node_vcpumask(mm);
 	return mm;
 
 fail_nocontext:
@@ -1380,6 +1388,8 @@ static int wait_for_vfork_done(struct task_struct *child,
  */
 static void mm_release(struct task_struct *tsk, struct mm_struct *mm)
 {
+	sched_vcpu_release_mm(tsk, mm);
+
 	uprobe_free_utask(tsk);
 
 	/* Get rid of any cached register state */
@@ -1499,10 +1509,12 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
 	if (clone_flags & CLONE_VM) {
 		mmget(oldmm);
 		mm = oldmm;
+		sched_vcpu_get_mm(tsk, mm);
 	} else {
 		mm = dup_mm(tsk, current->mm);
 		if (!mm)
 			return -ENOMEM;
+		sched_vcpu_dup_mm(tsk, mm);
 	}
 
 	tsk->mm = mm;
@@ -2901,7 +2913,8 @@ void __init proc_caches_init(void)
 	 * dynamically sized based on the maximum CPU number this system
 	 * can have, taking hotplug into account (nr_cpu_ids).
 	 */
-	mm_size = sizeof(struct mm_struct) + cpumask_size();
+	mm_size = sizeof(struct mm_struct) + cpumask_size() + mm_vcpumask_size() +
+		  mm_node_vcpumask_size();
 
 	mm_cachep = kmem_cache_create_usercopy("mm_struct",
 			mm_size, ARCH_MIN_MMSTRUCT_ALIGN,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1e08b02e0cd5..70bf2899c9b3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2267,6 +2267,7 @@ static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
 	lockdep_assert_rq_held(rq);
 
 	deactivate_task(rq, p, DEQUEUE_NOCLOCK);
+	rq_vcpu_cache_remove_mm_locked(rq, p->mm, false);
 	set_task_cpu(p, new_cpu);
 	rq_unlock(rq, rf);
 
@@ -2454,6 +2455,7 @@ int push_cpu_stop(void *arg)
 	// XXX validate p is still the highest prio task
 	if (task_rq(p) == rq) {
 		deactivate_task(rq, p, 0);
+		rq_vcpu_cache_remove_mm_locked(rq, p->mm, false);
 		set_task_cpu(p, lowest_rq->cpu);
 		activate_task(lowest_rq, p, 0);
 		resched_curr(lowest_rq);
@@ -3093,6 +3095,7 @@ static void __migrate_swap_task(struct task_struct *p, int cpu)
 		rq_pin_lock(dst_rq, &drf);
 
 		deactivate_task(src_rq, p, 0);
+		rq_vcpu_cache_remove_mm_locked(src_rq, p->mm, false);
 		set_task_cpu(p, cpu);
 		activate_task(dst_rq, p, 0);
 		check_preempt_curr(dst_rq, p, 0);
@@ -3716,6 +3719,8 @@ static void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags
 	p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
 
 	WRITE_ONCE(rq->ttwu_pending, 1);
+	if (WARN_ON_ONCE(task_cpu(p) != cpu_of(rq)))
+		rq_vcpu_cache_remove_mm(task_rq(p), p->mm, false);
 	__smp_call_single_queue(cpu, &p->wake_entry.llist);
 }
 
@@ -4125,6 +4130,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 
 		wake_flags |= WF_MIGRATED;
 		psi_ttwu_dequeue(p);
+		rq_vcpu_cache_remove_mm(task_rq(p), p->mm, false);
 		set_task_cpu(p, cpu);
 	}
 #else
@@ -4796,6 +4802,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
 	sched_info_switch(rq, prev, next);
 	perf_event_task_sched_out(prev, next);
 	rseq_preempt(prev);
+	switch_mm_vcpu(rq, prev, next);
 	fire_sched_out_preempt_notifiers(prev, next);
 	kmap_local_sched_out();
 	prepare_task(next);
@@ -5922,6 +5929,7 @@ static bool try_steal_cookie(int this, int that)
 			goto next;
 
 		deactivate_task(src, p, 0);
+		rq_vcpu_cache_remove_mm_locked(src, p->mm, false);
 		set_task_cpu(p, this);
 		activate_task(dst, p, 0);
 
@@ -10927,3 +10935,77 @@ void call_trace_sched_update_nr_running(struct rq *rq, int count)
 {
         trace_sched_update_nr_running_tp(rq, count);
 }
+
+#ifdef CONFIG_SCHED_MM_VCPU
+void sched_vcpu_release_mm(struct task_struct *t, struct mm_struct *mm)
+{
+	struct rq_flags rf;
+	struct rq *rq;
+
+	if (!mm)
+		return;
+	WARN_ON_ONCE(t != current);
+	preempt_disable();
+	rq = this_rq();
+	rq_lock_irqsave(rq, &rf);
+	t->vcpu_mm_active = 0;
+	atomic_dec(&mm->mm_vcpu_users);
+	rq_vcpu_cache_remove_mm_locked(rq, mm, true);
+	rq_unlock_irqrestore(rq, &rf);
+	t->mm_vcpu = -1;
+	preempt_enable();
+}
+
+void sched_vcpu_activate_mm(struct task_struct *t, struct mm_struct *mm)
+{
+	WARN_ON_ONCE(t != current);
+	preempt_disable();
+	t->vcpu_mm_active = 1;
+	/* No need to reserve in cpumask because single-threaded. */
+	t->mm_vcpu = mm_vcpu_first_node_vcpu(numa_node_id());
+	preempt_enable();
+}
+
+void sched_vcpu_get_mm(struct task_struct *t, struct mm_struct *mm)
+{
+	int vcpu, mm_vcpu_users;
+	struct rq_flags rf;
+	struct rq *rq;
+
+	preempt_disable();
+	rq = this_rq();
+	t->vcpu_mm_active = 1;
+	mm_vcpu_users = atomic_read(&mm->mm_vcpu_users);
+	atomic_inc(&mm->mm_vcpu_users);
+	t->mm_vcpu = -1;
+	vcpu = current->mm_vcpu;
+	rq_lock_irqsave(rq, &rf);
+	/* On transition from 1 to 2 mm users, reserve vcpu ids. */
+	if (mm_vcpu_users == 1) {
+		mm_vcpu_reserve_nodes(mm);
+		rq_vcpu_cache_remove_mm_locked(rq, mm, true);
+		current->mm_vcpu = __mm_vcpu_get(rq, mm);
+		rq_vcpu_cache_add(rq, mm, current->mm_vcpu);
+		/*
+		 * __mm_vcpu_get could get a different vcpu after going
+		 * multi-threaded, then back single-threaded, then
+		 * multi-threaded on a NUMA configuration using the first CPU
+		 * matching the NUMA node as single-threaded vcpu, with
+		 * leftover vcpu_id matching the NUMA node set from when this
+		 * task was multithreaded.
+		 */
+		if (current->mm_vcpu != vcpu)
+			rseq_set_notify_resume(current);
+	}
+	rq_unlock_irqrestore(rq, &rf);
+	preempt_enable();
+}
+
+void sched_vcpu_dup_mm(struct task_struct *t, struct mm_struct *mm)
+{
+	preempt_disable();
+	t->vcpu_mm_active = 1;
+	t->mm_vcpu = -1;
+	preempt_enable();
+}
+#endif
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d2c072b0ef01..f4f394db2db8 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -655,6 +655,7 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p
 	__dl_add(dl_b, p->dl.dl_bw, cpumask_weight(later_rq->rd->span));
 	raw_spin_unlock(&dl_b->lock);
 
+	rq_vcpu_cache_remove_mm_locked(rq, p->mm, false);
 	set_task_cpu(p, later_rq->cpu);
 	double_unlock_balance(later_rq, rq);
 
@@ -2290,6 +2291,7 @@ static int push_dl_task(struct rq *rq)
 	}
 
 	deactivate_task(rq, next_task, 0);
+	rq_vcpu_cache_remove_mm_locked(rq, next_task->mm, false);
 	set_task_cpu(next_task, later_rq->cpu);
 
 	/*
@@ -2386,6 +2388,7 @@ static void pull_dl_task(struct rq *this_rq)
 				push_task = get_push_task(src_rq);
 			} else {
 				deactivate_task(src_rq, p, 0);
+				rq_vcpu_cache_remove_mm_locked(src_rq, p->mm, false);
 				set_task_cpu(p, this_cpu);
 				activate_task(this_rq, p, 0);
 				dmin = p->dl.deadline;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 102d6f70e84d..3b44f8dd064d 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -763,6 +763,19 @@ do {									\
 		P(sched_goidle);
 		P(ttwu_count);
 		P(ttwu_local);
+#undef P
+#define P(n) SEQ_printf(m, "  .%-30s: %Ld\n", #n, schedstat_val(rq->n));
+		P(nr_vcpu_thread_transfer);
+		P(nr_vcpu_cache_hit);
+		P(nr_vcpu_cache_evict);
+		P(nr_vcpu_cache_discard_wrong_node);
+		P(nr_vcpu_allocate);
+		P(nr_vcpu_allocate_node_reuse);
+		P(nr_vcpu_allocate_node_new);
+		P(nr_vcpu_allocate_node_rebalance);
+		P(nr_vcpu_allocate_node_steal);
+		P(nr_vcpu_remove_release_mm);
+		P(nr_vcpu_remove_migrate);
 	}
 #undef P
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dcbd3110c687..9c8b88e57315 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7817,6 +7817,7 @@ static void detach_task(struct task_struct *p, struct lb_env *env)
 	lockdep_assert_rq_held(env->src_rq);
 
 	deactivate_task(env->src_rq, p, DEQUEUE_NOCLOCK);
+	rq_vcpu_cache_remove_mm_locked(env->src_rq, p->mm, false);
 	set_task_cpu(p, env->dst_cpu);
 }
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 7b4f4fbbb404..fd37e23612f9 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2106,6 +2106,7 @@ static int push_rt_task(struct rq *rq, bool pull)
 	}
 
 	deactivate_task(rq, next_task, 0);
+	rq_vcpu_cache_remove_mm_locked(rq, next_task->mm, false);
 	set_task_cpu(next_task, lowest_rq->cpu);
 	activate_task(lowest_rq, next_task, 0);
 	resched_curr(lowest_rq);
@@ -2379,6 +2380,7 @@ static void pull_rt_task(struct rq *this_rq)
 				push_task = get_push_task(src_rq);
 			} else {
 				deactivate_task(src_rq, p, 0);
+				rq_vcpu_cache_remove_mm_locked(src_rq, p->mm, false);
 				set_task_cpu(p, this_cpu);
 				activate_task(this_rq, p, 0);
 				resched = true;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3da5718cd641..5034a7372452 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -916,6 +916,19 @@ struct uclamp_rq {
 DECLARE_STATIC_KEY_FALSE(sched_uclamp_used);
 #endif /* CONFIG_UCLAMP_TASK */
 
+#ifdef CONFIG_SCHED_MM_VCPU
+# define RQ_VCPU_CACHE_SIZE	8
+struct rq_vcpu_entry {
+	struct mm_struct *mm;	/* NULL if unset */
+	int vcpu_id;
+};
+
+struct rq_vcpu_cache {
+	struct rq_vcpu_entry entry[RQ_VCPU_CACHE_SIZE];
+	unsigned int head;
+};
+#endif
+
 /*
  * This is the main, per-CPU runqueue data structure.
  *
@@ -1086,6 +1099,19 @@ struct rq {
 	/* try_to_wake_up() stats */
 	unsigned int		ttwu_count;
 	unsigned int		ttwu_local;
+
+	unsigned long long	nr_vcpu_single_thread;
+	unsigned long long	nr_vcpu_thread_transfer;
+	unsigned long long	nr_vcpu_cache_hit;
+	unsigned long long	nr_vcpu_cache_evict;
+	unsigned long long	nr_vcpu_cache_discard_wrong_node;
+	unsigned long long	nr_vcpu_allocate;
+	unsigned long long	nr_vcpu_allocate_node_reuse;
+	unsigned long long	nr_vcpu_allocate_node_new;
+	unsigned long long	nr_vcpu_allocate_node_rebalance;
+	unsigned long long	nr_vcpu_allocate_node_steal;
+	unsigned long long	nr_vcpu_remove_release_mm;
+	unsigned long long	nr_vcpu_remove_migrate;
 #endif
 
 #ifdef CONFIG_CPU_IDLE
@@ -1116,6 +1142,10 @@ struct rq {
 	unsigned int		core_forceidle_occupation;
 	u64			core_forceidle_start;
 #endif
+
+#ifdef CONFIG_SCHED_MM_VCPU
+	struct rq_vcpu_cache	vcpu_cache;
+#endif
 };
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -3137,3 +3167,337 @@ extern int sched_dynamic_mode(const char *str);
 extern void sched_dynamic_update(int mode);
 #endif
 
+#ifdef CONFIG_SCHED_MM_VCPU
+static inline int __mm_vcpu_get_single_node(struct mm_struct *mm)
+{
+	struct cpumask *cpumask;
+	int vcpu;
+
+	cpumask = mm_vcpumask(mm);
+	/* Atomically reserve lowest available vcpu number. */
+	do {
+		vcpu = cpumask_first_zero(cpumask);
+		if (vcpu >= nr_cpu_ids)
+			return -1;
+	} while (cpumask_test_and_set_cpu(vcpu, cpumask));
+	return vcpu;
+}
+
+#ifdef CONFIG_NUMA
+static inline bool mm_node_vcpumask_test_cpu(struct mm_struct *mm, int vcpu_id)
+{
+	if (num_possible_nodes() == 1)
+		return true;
+	return cpumask_test_cpu(vcpu_id, mm_node_vcpumask(mm, numa_node_id()));
+}
+
+static inline int __mm_vcpu_get(struct rq *rq, struct mm_struct *mm)
+{
+	struct cpumask *cpumask = mm_vcpumask(mm),
+		       *node_cpumask = mm_node_vcpumask(mm, numa_node_id()),
+		       *node_alloc_cpumask = mm_node_alloc_vcpumask(mm);
+	unsigned int node;
+	int vcpu;
+
+	if (num_possible_nodes() == 1)
+		return __mm_vcpu_get_single_node(mm);
+
+	/*
+	 * Try to atomically reserve lowest available vcpu number within those
+	 * already reserved for this NUMA node.
+	 */
+	do {
+		vcpu = cpumask_first_one_and_zero(node_cpumask, cpumask);
+		if (vcpu >= nr_cpu_ids)
+			goto alloc_numa;
+	} while (cpumask_test_and_set_cpu(vcpu, cpumask));
+	schedstat_inc(rq->nr_vcpu_allocate_node_reuse);
+	goto end;
+
+alloc_numa:
+	/*
+	 * Try to atomically reserve lowest available vcpu number within those
+	 * not already allocated for numa nodes.
+	 */
+	do {
+		vcpu = cpumask_first_zero_and_zero(node_alloc_cpumask, cpumask);
+		if (vcpu >= nr_cpu_ids)
+			goto numa_update;
+	} while (cpumask_test_and_set_cpu(vcpu, cpumask));
+	cpumask_set_cpu(vcpu, node_cpumask);
+	cpumask_set_cpu(vcpu, node_alloc_cpumask);
+	schedstat_inc(rq->nr_vcpu_allocate_node_new);
+	goto end;
+
+numa_update:
+	/*
+	 * NUMA node id configuration changed for at least one CPU in the system.
+	 * We need to steal a currently unused vcpu_id from an overprovisioned
+	 * node for our current node. Userspace must handle the fact that the
+	 * node id associated with this vcpu_id may change due to node ID
+	 * reconfiguration.
+	 *
+	 * Count how many possible cpus are attached to each (other) node id,
+	 * and compare this with the per-mm node vcpumask cpu count. Find one
+	 * which has too many cpus in its mask to steal from.
+	 */
+	for (node = 0; node < nr_node_ids; node++) {
+		struct cpumask *iter_cpumask;
+
+		if (node == numa_node_id())
+			continue;
+		iter_cpumask = mm_node_vcpumask(mm, node);
+		if (nr_cpus_node(node) < cpumask_weight(iter_cpumask)) {
+			/* Try to steal from this node. */
+			do {
+				vcpu = cpumask_first_one_and_zero(iter_cpumask, cpumask);
+				if (vcpu >= nr_cpu_ids)
+					goto steal_fail;
+			} while (cpumask_test_and_set_cpu(vcpu, cpumask));
+			cpumask_clear_cpu(vcpu, iter_cpumask);
+			cpumask_set_cpu(vcpu, node_cpumask);
+			schedstat_inc(rq->nr_vcpu_allocate_node_rebalance);
+			goto end;
+		}
+	}
+
+steal_fail:
+	/*
+	 * Our attempt at gracefully stealing a vcpu_id from another
+	 * overprovisioned NUMA node failed. Fallback to grabbing the first
+	 * available vcpu_id.
+	 */
+	do {
+		vcpu = cpumask_first_zero(cpumask);
+		if (vcpu >= nr_cpu_ids)
+			return -1;
+	} while (cpumask_test_and_set_cpu(vcpu, cpumask));
+	/* Steal vcpu from its numa node mask. */
+	for (node = 0; node < nr_node_ids; node++) {
+		struct cpumask *iter_cpumask;
+
+		if (node == numa_node_id())
+			continue;
+		iter_cpumask = mm_node_vcpumask(mm, node);
+		if (cpumask_test_cpu(vcpu, iter_cpumask)) {
+			cpumask_clear_cpu(vcpu, iter_cpumask);
+			break;
+		}
+	}
+	cpumask_set_cpu(vcpu, node_cpumask);
+	schedstat_inc(rq->nr_vcpu_allocate_node_steal);
+end:
+	return vcpu;
+}
+
+static inline int mm_vcpu_first_node_vcpu(int node)
+{
+	int vcpu;
+
+	if (likely(nr_cpu_ids >= nr_node_ids))
+		return node;
+	vcpu = cpumask_first(cpumask_of_node(node));
+	if (vcpu >= nr_cpu_ids)
+		return -1;
+	return vcpu;
+}
+
+/*
+ * Single-threaded processes observe a mapping of vcpu_id->node_id where
+ * the vcpu_id returned corresponds to mm_vcpu_first_node_vcpu(). When going
+ * from single to multi-threaded, reserve this same mapping so it stays
+ * invariant.
+ */
+static inline void mm_vcpu_reserve_nodes(struct mm_struct *mm)
+{
+	struct cpumask *node_alloc_cpumask = mm_node_alloc_vcpumask(mm);
+	int node, other_node;
+
+	for (node = 0; node < nr_node_ids; node++) {
+		struct cpumask *iter_cpumask = mm_node_vcpumask(mm, node);
+		int vcpu = mm_vcpu_first_node_vcpu(node);
+
+		/* Skip nodes that have no CPU associated with them. */
+		if (vcpu < 0)
+			continue;
+		cpumask_set_cpu(vcpu, iter_cpumask);
+		cpumask_set_cpu(vcpu, node_alloc_cpumask);
+		for (other_node = 0; other_node < nr_node_ids; other_node++) {
+			if (other_node == node)
+				continue;
+			cpumask_clear_cpu(vcpu, mm_node_vcpumask(mm, other_node));
+		}
+	}
+}
+#else
+static inline bool mm_node_vcpumask_test_cpu(struct mm_struct *mm, int vcpu_id)
+{
+	return true;
+}
+static inline int __mm_vcpu_get(struct rq *rq, struct mm_struct *mm)
+{
+	return __mm_vcpu_get_single_node(mm);
+}
+static inline int mm_vcpu_first_node_vcpu(int node)
+{
+	return 0;
+}
+static inline void mm_vcpu_reserve_nodes(struct mm_struct *mm) { }
+#endif
+
+static inline void __mm_vcpu_put(struct mm_struct *mm, int vcpu)
+{
+	if (vcpu < 0)
+		return;
+	cpumask_clear_cpu(vcpu, mm_vcpumask(mm));
+}
+
+static inline struct rq_vcpu_entry *rq_vcpu_cache_lookup(struct rq *rq, struct mm_struct *mm)
+{
+	struct rq_vcpu_cache *vcpu_cache = &rq->vcpu_cache;
+	int i;
+
+	for (i = 0; i < RQ_VCPU_CACHE_SIZE; i++) {
+		struct rq_vcpu_entry *entry = &vcpu_cache->entry[i];
+
+		if (entry->mm == mm)
+			return entry;
+	}
+	return NULL;
+}
+
+/* Removal from cache simply leaves an unused hole. */
+static inline int rq_vcpu_cache_lookup_remove(struct rq *rq, struct mm_struct *mm)
+{
+	struct rq_vcpu_entry *entry = rq_vcpu_cache_lookup(rq, mm);
+
+	if (!entry)
+		return -1;
+	entry->mm = NULL;	/* Remove from cache */
+	return entry->vcpu_id;
+}
+
+static inline void rq_vcpu_cache_remove_mm_locked(struct rq *rq, struct mm_struct *mm,
+						  bool release_mm)
+{
+	int vcpu;
+
+	if (!mm)
+		return;
+	/*
+	 * Do not remove the cache entry for a runqueue that runs a task which
+	 * currently uses the target mm.
+	 */
+	if (!release_mm && rq->curr->mm == mm)
+		return;
+	vcpu = rq_vcpu_cache_lookup_remove(rq, mm);
+	if (vcpu < 0)
+		return;
+	if (release_mm)
+		schedstat_inc(rq->nr_vcpu_remove_release_mm);
+	else
+		schedstat_inc(rq->nr_vcpu_remove_migrate);
+	__mm_vcpu_put(mm, vcpu);
+}
+
+static inline void rq_vcpu_cache_remove_mm(struct rq *rq, struct mm_struct *mm,
+					   bool release_mm)
+{
+	struct rq_flags rf;
+
+	rq_lock_irqsave(rq, &rf);
+	rq_vcpu_cache_remove_mm_locked(rq, mm, release_mm);
+	rq_unlock_irqrestore(rq, &rf);
+}
+
+/*
+ * Add at head, move head forward. Cheap LRU cache.
+ * Only need to clear the vcpu mask bit from its own mm_vcpumask(mm) when we
+ * overwrite an old entry from the cache. Note that this is not needed if the
+ * overwritten entry is an unused hole. This access to the old_mm from an
+ * unrelated thread requires that cache entry for a given mm gets pruned from
+ * the cache when a task is dequeued from the runqueue.
+ */
+static inline void rq_vcpu_cache_add(struct rq *rq, struct mm_struct *mm,
+				     int vcpu_id)
+{
+	struct rq_vcpu_cache *vcpu_cache = &rq->vcpu_cache;
+	struct mm_struct *old_mm;
+	struct rq_vcpu_entry *entry;
+	unsigned int pos;
+
+	pos = vcpu_cache->head;
+	entry = &vcpu_cache->entry[pos];
+	old_mm = entry->mm;
+	if (old_mm) {
+		schedstat_inc(rq->nr_vcpu_cache_evict);
+		__mm_vcpu_put(old_mm, entry->vcpu_id);
+	}
+	entry->mm = mm;
+	entry->vcpu_id = vcpu_id;
+	vcpu_cache->head = (pos + 1) % RQ_VCPU_CACHE_SIZE;
+}
+
+static inline int mm_vcpu_get(struct rq *rq, struct task_struct *t)
+{
+	struct rq_vcpu_entry *entry;
+	struct mm_struct *mm = t->mm;
+	int vcpu;
+
+	/* Skip allocation if mm is single-threaded. */
+	if (atomic_read(&mm->mm_vcpu_users) == 1) {
+		schedstat_inc(rq->nr_vcpu_single_thread);
+		vcpu = mm_vcpu_first_node_vcpu(numa_node_id());
+		goto end;
+	}
+	entry = rq_vcpu_cache_lookup(rq, mm);
+	if (likely(entry)) {
+		vcpu = entry->vcpu_id;
+		if (likely(mm_node_vcpumask_test_cpu(mm, vcpu))) {
+			schedstat_inc(rq->nr_vcpu_cache_hit);
+			goto end;
+		} else {
+			schedstat_inc(rq->nr_vcpu_cache_discard_wrong_node);
+			entry->mm = NULL;	/* Remove from cache */
+			__mm_vcpu_put(mm, vcpu);
+		}
+	}
+	schedstat_inc(rq->nr_vcpu_allocate);
+	vcpu = __mm_vcpu_get(rq, mm);
+	rq_vcpu_cache_add(rq, mm, vcpu);
+end:
+	return vcpu;
+}
+
+static inline void switch_mm_vcpu(struct rq *rq, struct task_struct *prev,
+				  struct task_struct *next)
+{
+	if (!(next->flags & PF_EXITING) && !(next->flags & PF_KTHREAD) &&
+	    next->mm && next->vcpu_mm_active) {
+		if (!(prev->flags & PF_EXITING) && !(prev->flags & PF_KTHREAD) &&
+				prev->mm == next->mm && prev->vcpu_mm_active &&
+				mm_node_vcpumask_test_cpu(next->mm, prev->mm_vcpu)) {
+			/*
+			 * Switching between threads with the same mm. Simply pass the
+			 * vcpu token along to the next thread.
+			 */
+			schedstat_inc(rq->nr_vcpu_thread_transfer);
+			next->mm_vcpu = prev->mm_vcpu;
+		} else {
+			next->mm_vcpu = mm_vcpu_get(rq, next);
+		}
+	}
+	if (!(prev->flags & PF_EXITING) && !(prev->flags & PF_KTHREAD) &&
+	    prev->mm && prev->vcpu_mm_active)
+		prev->mm_vcpu = -1;
+}
+
+#else
+static inline void switch_mm_vcpu(struct rq *rq, struct task_struct *prev,
+				  struct task_struct *next) { }
+static inline void rq_vcpu_cache_remove_mm_locked(struct rq *rq,
+						  struct mm_struct *mm,
+						  bool release_mm) { }
+static inline void rq_vcpu_cache_remove_mm(struct rq *rq, struct mm_struct *mm,
+						  bool release_mm) { }
+#endif
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 07dde2928c79..027d0caf2d14 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -134,12 +134,24 @@ static int show_schedstat(struct seq_file *seq, void *v)
 
 		/* runqueue-specific stats */
 		seq_printf(seq,
-		    "cpu%d %u 0 %u %u %u %u %llu %llu %lu",
+		    "cpu%d %u 0 %u %u %u %u %llu %llu %lu %llu %llu %llu %llu %llu %llu %llu %llu %llu %llu %llu %llu",
 		    cpu, rq->yld_count,
 		    rq->sched_count, rq->sched_goidle,
 		    rq->ttwu_count, rq->ttwu_local,
 		    rq->rq_cpu_time,
-		    rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount);
+		    rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount,
+		    rq->nr_vcpu_single_thread,
+		    rq->nr_vcpu_thread_transfer,
+		    rq->nr_vcpu_cache_hit,
+		    rq->nr_vcpu_cache_evict,
+		    rq->nr_vcpu_cache_discard_wrong_node,
+		    rq->nr_vcpu_allocate,
+		    rq->nr_vcpu_allocate_node_reuse,
+		    rq->nr_vcpu_allocate_node_new,
+		    rq->nr_vcpu_allocate_node_rebalance,
+		    rq->nr_vcpu_allocate_node_steal,
+		    rq->nr_vcpu_remove_release_mm,
+		    rq->nr_vcpu_remove_migrate);
 
 		seq_printf(seq, "\n");
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH v2 10/11] rseq: extend struct rseq with per memory space vcpu id
  2022-02-18 21:06 [RFC PATCH v2 00/11] RSEQ node id and virtual cpu id extensions Mathieu Desnoyers
                   ` (8 preceding siblings ...)
  2022-02-18 21:06 ` [RFC PATCH v2 09/11] sched: Introduce per memory space current virtual cpu id Mathieu Desnoyers
@ 2022-02-18 21:06 ` Mathieu Desnoyers
  2022-02-18 21:06 ` [RFC PATCH v2 11/11] selftests/rseq: Implement rseq vm_vcpu_id field support Mathieu Desnoyers
  10 siblings, 0 replies; 19+ messages in thread
From: Mathieu Desnoyers @ 2022-02-18 21:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, Paul E . McKenney, Boqun Feng,
	H . Peter Anvin, Paul Turner, linux-api, Christian Brauner,
	Florian Weimer, David.Laight, carlos, Peter Oskolkov,
	Mathieu Desnoyers

If a memory space has fewer threads than cores, or is limited to run on
few cores concurrently through sched affinity or cgroup cpusets, the
virtual cpu ids will be values close to 0, thus allowing efficient use
of user-space memory for per-cpu data structures.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
 include/uapi/linux/rseq.h |  9 +++++++++
 kernel/rseq.c             | 10 +++++++++-
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
index 1cb90a435c5c..77a136586ac6 100644
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -139,6 +139,15 @@ struct rseq {
 	 */
 	__u32 node_id;
 
+	/*
+	 * Restartable sequences vm_vcpu_id field. Updated by the kernel. Read by
+	 * user-space with single-copy atomicity semantics. This field should
+	 * only be read by the thread which registered this data structure.
+	 * Aligned on 32-bit. Contains the current thread's virtual CPU ID
+	 * (allocated uniquely within a memory space).
+	 */
+	__u32 vm_vcpu_id;
+
 	/*
 	 * Flexible array member at end of structure, after last feature field.
 	 */
diff --git a/kernel/rseq.c b/kernel/rseq.c
index cb7d8a5afc82..1b00339c341b 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -89,12 +89,14 @@ static int rseq_update_cpu_node_id(struct task_struct *t)
 	struct rseq __user *rseq = t->rseq;
 	u32 cpu_id = raw_smp_processor_id();
 	u32 node_id = cpu_to_node(cpu_id);
+	u32 vm_vcpu_id = task_mm_vcpu_id(t);
 
 	if (!user_write_access_begin(rseq, t->rseq_len))
 		goto efault;
 	unsafe_put_user(cpu_id, &rseq->cpu_id_start, efault_end);
 	unsafe_put_user(cpu_id, &rseq->cpu_id, efault_end);
 	unsafe_put_user(node_id, &rseq->node_id, efault_end);
+	unsafe_put_user(vm_vcpu_id, &rseq->vm_vcpu_id, efault_end);
 	/*
 	 * Additional feature fields added after ORIG_RSEQ_SIZE
 	 * need to be conditionally updated only if
@@ -112,7 +114,8 @@ static int rseq_update_cpu_node_id(struct task_struct *t)
 
 static int rseq_reset_rseq_cpu_node_id(struct task_struct *t)
 {
-	u32 cpu_id_start = 0, cpu_id = RSEQ_CPU_ID_UNINITIALIZED, node_id = 0;
+	u32 cpu_id_start = 0, cpu_id = RSEQ_CPU_ID_UNINITIALIZED, node_id = 0,
+	    vm_vcpu_id = 0;
 
 	/*
 	 * Reset cpu_id_start to its initial state (0).
@@ -131,6 +134,11 @@ static int rseq_reset_rseq_cpu_node_id(struct task_struct *t)
 	 */
 	if (put_user(node_id, &t->rseq->node_id))
 		return -EFAULT;
+	/*
+	 * Reset vm_vcpu_id to its initial state (0).
+	 */
+	if (put_user(vm_vcpu_id, &t->rseq->vm_vcpu_id))
+		return -EFAULT;
 	/*
 	 * Additional feature fields added after ORIG_RSEQ_SIZE
 	 * need to be conditionally reset only if
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH v2 11/11] selftests/rseq: Implement rseq vm_vcpu_id field support
  2022-02-18 21:06 [RFC PATCH v2 00/11] RSEQ node id and virtual cpu id extensions Mathieu Desnoyers
                   ` (9 preceding siblings ...)
  2022-02-18 21:06 ` [RFC PATCH v2 10/11] rseq: extend struct rseq with per memory space vcpu id Mathieu Desnoyers
@ 2022-02-18 21:06 ` Mathieu Desnoyers
  10 siblings, 0 replies; 19+ messages in thread
From: Mathieu Desnoyers @ 2022-02-18 21:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, Paul E . McKenney, Boqun Feng,
	H . Peter Anvin, Paul Turner, linux-api, Christian Brauner,
	Florian Weimer, David.Laight, carlos, Peter Oskolkov,
	Mathieu Desnoyers

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
 tools/testing/selftests/rseq/rseq-abi.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/tools/testing/selftests/rseq/rseq-abi.h b/tools/testing/selftests/rseq/rseq-abi.h
index a1faa9162d52..1ee4740ebe94 100644
--- a/tools/testing/selftests/rseq/rseq-abi.h
+++ b/tools/testing/selftests/rseq/rseq-abi.h
@@ -155,6 +155,15 @@ struct rseq_abi {
 	 */
 	__u32 node_id;
 
+	/*
+	 * Restartable sequences vm_vcpu_id field. Updated by the kernel. Read by
+	 * user-space with single-copy atomicity semantics. This field should
+	 * only be read by the thread which registered this data structure.
+	 * Aligned on 32-bit. Contains the current thread's virtual CPU ID
+	 * (allocated uniquely within a memory space).
+	 */
+	__u32 vm_vcpu_id;
+
 	/*
 	 * Flexible array member at end of structure, after last feature field.
 	 */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH v3 09/11] sched: Introduce per memory space current virtual cpu id
  2022-02-18 21:06 ` [RFC PATCH v2 09/11] sched: Introduce per memory space current virtual cpu id Mathieu Desnoyers
@ 2022-02-21 17:38   ` Mathieu Desnoyers
  2022-02-25 17:35   ` [RFC PATCH v2 " Jonathan Corbet
  1 sibling, 0 replies; 19+ messages in thread
From: Mathieu Desnoyers @ 2022-02-21 17:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, Paul E . McKenney, Boqun Feng,
	H . Peter Anvin, Paul Turner, linux-api, Christian Brauner,
	Florian Weimer, David.Laight, carlos, Peter Oskolkov,
	Mathieu Desnoyers

This feature allows the scheduler to expose a current virtual cpu id
to user-space. This virtual cpu id is within the possible cpus range,
and is temporarily (and uniquely) assigned while threads are actively
running within a memory space. If a memory space has fewer threads than
cores, or is limited to run on few cores concurrently through sched
affinity or cgroup cpusets, the virtual cpu ids will be values close
to 0, thus allowing efficient use of user-space memory for per-cpu
data structures.

The vcpu_ids are NUMA-aware. On NUMA systems, when a vcpu_id is observed
by user-space to be associated with a NUMA node, it is guaranteed to
never change NUMA node unless a kernel-level NUMA configuration change
happens.

This feature is meant to be exposed by a new rseq thread area field.

The primary purpose of this feature is to do the heavy-lifting needed
by memory allocators to allow them to use per-cpu data structures
efficiently in the following situations:

- Single-threaded applications,
- Multi-threaded applications on large systems (many cores) with limited
  cpu affinity mask,
- Multi-threaded applications on large systems (many cores) with
  restricted cgroup cpuset per container,
- Processes using memory from many NUMA nodes.

One of the key concerns from scheduler maintainers is the overhead
associated with additional atomic operations in the scheduler fast-path.
In order to save one atomic set bit and one atomic clear bit on the
scheduler context switch fast path, the following optimizations are
implemented:

1) On context switch between threads belonging to the same memory space,
   transfer the mm_vcpu_id from prev to next without any atomic ops.
   This takes care of use-cases involving frequent context switch
   between threads belonging to the same memory space.

2) Threads belonging to a memory space with single user (mm_users==1)
   can be assigned mm_vcpu_id=0 without any atomic operation on the
   scheduler fast-path. In non-NUMA, when a memory space goes from
   single to multi-threaded, lazily allocate the vcpu_id 0 in the mm
   vcpu mask. This takes care of all single-threaded use-cases
   involving context switching between threads belonging to different
   memory spaces.

   With NUMA, the single-threaded memory space scenario is still
   special-cased to eliminate all atomic operations on the fast path,
   but rather than returning vcpu_id=0, return the current numa_node_id
   to allow single-threaded memory spaces to keep good numa locality.
   On systems where the number of cpus ids is lower than the number of
   numa node ids, pick the first cpu in the node cpumask rather than the
   node ID.

3) Introduce a per-runqueue cache containing { mm, vcpu_id } entries.
   Keep track of the recently allocated vcpu_id for each mm rather than
   freeing them immediately. This eliminates most atomic ops when
   context switching back and forth between threads belonging to
   different memory spaces in multi-threaded scenarios (many processes,
   each with many threads).

The credit goes to Paul Turner (Google) for the vcpu_id idea. This
feature is implemented based on the discussions with Paul Turner and
Peter Oskolkov (Google), but I took the liberty to implement scheduler
fast-path optimizations and my own NUMA-awareness scheme. The rumor has
it that Google have been running a rseq vcpu_id extension internally in
production for a year. The tcmalloc source code indeed has comments
hinting at a vcpu_id prototype extension to the rseq system call [1].

schedstats:

* perf bench sched messaging (single instance, multi-process):

On sched-switch:
  single-threaded vcpu-id:                99.985 %
  transfer between threads:                0     %
  runqueue cache hit:                      0.015 %
  runqueue cache eviction (bit-clear):     0     %
  runqueue cache discard (bit-clear):      0     %
  vcpu-id allocation (bit-set):            0     %

On release mm:
  vcpu-id remove (bit-clear):              0     %

On migration:
  vcpu-id remove (bit-clear):              0     %

* perf bench sched messaging -t (single instance, multi-thread):

On sched-switch:
  single-threaded vcpu-id:                 0.128 %
  transfer between threads:               98.260 %
  runqueue cache hit:                      1.075 %
  runqueue cache eviction (bit-clear):     0.001 %
  runqueue cache discard (bit-clear):      0     %
  vcpu-id allocation (bit-set):            0.269 %

On release mm:
  vcpu-id remove (bit-clear):              0.161 %

On migration:
  vcpu-id remove (bit-clear):              0.107 %

* perf bench sched messaging -t (two instances, multi-thread):

On sched-switch:
  single-threaded vcpu-id:                 0.081 %
  transfer between threads:               89.512 %
  runqueue cache hit:                      9.659 %
  runqueue cache eviction (bit-clear):     0.003 %
  runqueue cache discard (bit-clear):      0     %
  vcpu-id allocation (bit-set):            0.374 %

On release mm:
  vcpu-id remove (bit-clear):              0.243 %

On migration:
  vcpu-id remove (bit-clear):              0.129 %

* perf bench sched pipe (one instance, multi-process):

On sched-switch:
  single-threaded vcpu-id:                99.993 %
  transfer between threads:                0.001 %
  runqueue cache hit:                      0.002 %
  runqueue cache eviction (bit-clear):     0     %
  runqueue cache discard (bit-clear):      0     %
  vcpu-id allocation (bit-set):            0.002 %

On release mm:
  vcpu-id remove (bit-clear):              0     %

On migration:
  vcpu-id remove (bit-clear):              0.002 %

[1] https://github.com/google/tcmalloc/blob/master/tcmalloc/internal/linux_syscall_support.h#L26

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
Changes since v2:
- Fix a race between migration and release_mm, leading to a mm pointer
  leak in the runqueue vcpu cache: clearing the cache on migration
  should validate that the target mm in not the current mm on the target
  runqueue _and_ that the current task on that runqueue it not exiting
  or half-way through exec.
- Introduce struct vcpu_domain to encapsulate the the cpumasks and user
  count. This will make it easier to use this from pid namespaces as
  well in the future.
- Add next_vm_vcpu field to sched_switch tracepoint.
---
 fs/exec.c                    |   3 +
 include/linux/mm.h           |  40 ++++
 include/linux/mm_types.h     |   1 +
 include/linux/sched.h        |   5 +
 include/linux/vcpu.h         |  80 ++++++++
 include/linux/vcpu_types.h   |  28 +++
 include/trace/events/sched.h |   7 +-
 init/Kconfig                 |   4 +
 kernel/fork.c                |  13 +-
 kernel/sched/core.c          |  83 ++++++++
 kernel/sched/deadline.c      |   3 +
 kernel/sched/debug.c         |  13 ++
 kernel/sched/fair.c          |   1 +
 kernel/sched/rt.c            |   2 +
 kernel/sched/sched.h         | 375 +++++++++++++++++++++++++++++++++++
 kernel/sched/stats.c         |  16 +-
 16 files changed, 669 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/vcpu.h
 create mode 100644 include/linux/vcpu_types.h

diff --git a/fs/exec.c b/fs/exec.c
index 79f2c9483302..4c562fa13332 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -66,6 +66,7 @@
 #include <linux/io_uring.h>
 #include <linux/syscall_user_dispatch.h>
 #include <linux/coredump.h>
+#include <linux/vcpu.h>
 
 #include <linux/uaccess.h>
 #include <asm/mmu_context.h>
@@ -1006,6 +1007,8 @@ static int exec_mmap(struct mm_struct *mm)
 	active_mm = tsk->active_mm;
 	tsk->active_mm = mm;
 	tsk->mm = mm;
+	vcpu_domain_init(mm_vcpu_domain(mm));
+	vcpu_domain_activate(tsk, mm_vcpu_domain(mm));
 	/*
 	 * This prevents preemption while active_mm is being loaded and
 	 * it and mm are being updated, which could cause problems for
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e1a84b1e6787..9d44d64cbc7c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3374,5 +3374,45 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
 }
 #endif
 
+#ifdef CONFIG_VCPU_DOMAIN
+static inline struct vcpu_domain *mm_vcpu_domain(struct mm_struct *mm)
+{
+	unsigned long vcpu_bitmap = (unsigned long)mm;
+
+	if (!mm)
+		return NULL;
+	vcpu_bitmap += offsetof(struct mm_struct, cpu_bitmap);
+	/* Skip cpu_bitmap */
+	vcpu_bitmap += cpumask_size();
+	return (struct vcpu_domain *)vcpu_bitmap;
+}
+void vcpu_domain_release(struct task_struct *t, struct vcpu_domain *domain);
+void vcpu_domain_activate(struct task_struct *t, struct vcpu_domain *domain);
+void vcpu_domain_get(struct task_struct *t, struct vcpu_domain *domain);
+void vcpu_domain_dup(struct task_struct *t, struct vcpu_domain *domain);
+static inline int task_mm_vcpu_id(struct task_struct *t)
+{
+	return t->mm_vcpu;
+}
+#else
+static inline struct vcpu_domain *mm_vcpu_domain(struct mm_struct *mm)
+{
+	return NULL;
+}
+void vcpu_domain_release(struct task_struct *t, struct vcpu_domain *domain) { }
+void vcpu_domain_activate(struct task_struct *t, struct vcpu_domain *domain) { }
+void vcpu_domain_get(struct task_struct *t, struct vcpu_domain *domain) { }
+void vcpu_domain_dup(struct task_struct *t, struct vcpu_domain *domain) { }
+static inline int task_mm_vcpu_id(struct task_struct *t)
+{
+	/*
+	 * Use the processor id as a fall-back when the mm vcpu feature is
+	 * disabled. This provides functional per-cpu data structure accesses
+	 * in user-space, althrough it won't provide the memory usage benefits.
+	 */
+	return raw_smp_processor_id();
+}
+#endif
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 9db36dc5d4cf..ef23400aee51 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -17,6 +17,7 @@
 #include <linux/page-flags-layout.h>
 #include <linux/workqueue.h>
 #include <linux/seqlock.h>
+#include <linux/nodemask.h>
 
 #include <asm/mmu.h>
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 838c9e0b4cae..96df9e3fc7af 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1300,6 +1300,11 @@ struct task_struct {
 	unsigned long rseq_event_mask;
 #endif
 
+#ifdef CONFIG_VCPU_DOMAIN
+	int				mm_vcpu;	/* Current vcpu in mm */
+	int				vcpu_domain_active;
+#endif
+
 	struct tlbflush_unmap_batch	tlb_ubc;
 
 	union {
diff --git a/include/linux/vcpu.h b/include/linux/vcpu.h
new file mode 100644
index 000000000000..252fc7573025
--- /dev/null
+++ b/include/linux/vcpu.h
@@ -0,0 +1,80 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_VCPU_H
+#define _LINUX_VCPU_H
+
+#include <linux/cpumask.h>
+#include <linux/nodemask.h>
+#include <linux/vcpu_types.h>
+
+#ifdef CONFIG_VCPU_DOMAIN
+static inline unsigned int vcpu_domain_vcpumask_size(void)
+{
+	return cpumask_size();
+}
+static inline cpumask_t *vcpu_domain_vcpumask(struct vcpu_domain *vcpu_domain)
+{
+	return vcpu_domain->vcpumasks;
+}
+
+# ifdef CONFIG_NUMA
+static inline unsigned int vcpu_domain_node_vcpumask_size(void)
+{
+	if (num_possible_nodes() == 1)
+		return 0;
+	return (nr_node_ids + 1) * cpumask_size();
+}
+static inline cpumask_t *vcpu_domain_node_alloc_vcpumask(struct vcpu_domain *vcpu_domain)
+{
+	unsigned long vcpu_bitmap = (unsigned long)vcpu_domain_vcpumask(vcpu_domain);
+
+	/* Skip vcpumask */
+	vcpu_bitmap += cpumask_size();
+	return (struct cpumask *)vcpu_bitmap;
+}
+static inline cpumask_t *vcpu_domain_node_vcpumask(struct vcpu_domain *vcpu_domain, unsigned int node)
+{
+	unsigned long vcpu_bitmap = (unsigned long)vcpu_domain_node_alloc_vcpumask(vcpu_domain);
+
+	/* Skip node alloc vcpumask */
+	vcpu_bitmap += cpumask_size();
+	vcpu_bitmap += node * cpumask_size();
+	return (struct cpumask *)vcpu_bitmap;
+}
+static inline void vcpu_domain_node_init(struct vcpu_domain *vcpu_domain)
+{
+	unsigned int node;
+
+	if (num_possible_nodes() == 1)
+		return;
+	cpumask_clear(vcpu_domain_node_alloc_vcpumask(vcpu_domain));
+	for (node = 0; node < nr_node_ids; node++)
+		cpumask_clear(vcpu_domain_node_vcpumask(vcpu_domain, node));
+}
+# else /* CONFIG_NUMA */
+static inline unsigned int vcpu_domain_node_vcpumask_size(void)
+{
+	return 0;
+}
+static inline void vcpu_domain_node_init(struct vcpu_domain *vcpu_domain) { }
+# endif /* CONFIG_NUMA */
+
+static inline unsigned int vcpu_domain_size(void)
+{
+	return offsetof(struct vcpu_domain, vcpumasks) + vcpu_domain_vcpumask_size() +
+	       vcpu_domain_node_vcpumask_size();
+}
+static inline void vcpu_domain_init(struct vcpu_domain *vcpu_domain)
+{
+	atomic_set(&vcpu_domain->users, 1);
+	cpumask_clear(vcpu_domain_vcpumask(vcpu_domain));
+	vcpu_domain_node_init(vcpu_domain);
+}
+#else /* CONFIG_VCPU_DOMAIN */
+static inline unsigned int vcpu_domain_size(void)
+{
+	return 0;
+}
+static inline void vcpu_domain_init(struct vcpu_domain *vcpu_domain) { }
+#endif /* CONFIG_VCPU_DOMAIN */
+
+#endif /* _LINUX_VCPU_H */
diff --git a/include/linux/vcpu_types.h b/include/linux/vcpu_types.h
new file mode 100644
index 000000000000..36e520c4e287
--- /dev/null
+++ b/include/linux/vcpu_types.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_VCPU_TYPES_H
+#define _LINUX_VCPU_TYPES_H
+
+#include <linux/atomic.h>
+#include <linux/cpumask.h>
+
+struct vcpu_domain {
+	/**
+	 * @users: The number of references to &struct vcpu_domain from
+	 * user-space threads.
+	 *
+	 * Initialized to 1 for the first thread with a reference with
+	 * the domain. Incremented for each thread getting a reference to the
+	 * domain, and decremented on domain release from user-space threads.
+	 * Used to enable single-threaded domain vcpu accounting (when == 1).
+	 */
+	atomic_t users;
+	/*
+	 * Layout of vcpumasks:
+	 * - vcpumask (cpumask_size()),
+	 * - node_alloc_vcpumask (cpumask_size(), NUMA=y only),
+	 * - array of nr_node_ids node_vcpumask (each cpumask_size(), NUMA=y only).
+	 */
+	cpumask_t vcpumasks[];
+};
+
+#endif /* _LINUX_VCPU_TYPES_H */
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 94640482cfe7..050cb749ca1a 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -233,6 +233,7 @@ TRACE_EVENT(sched_switch,
 		__array(	char,	next_comm,	TASK_COMM_LEN	)
 		__field(	pid_t,	next_pid			)
 		__field(	int,	next_prio			)
+		__field(	pid_t,	next_vm_vcpu			)
 	),
 
 	TP_fast_assign(
@@ -243,10 +244,11 @@ TRACE_EVENT(sched_switch,
 		memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
 		__entry->next_pid	= next->pid;
 		__entry->next_prio	= next->prio;
+		__entry->next_vm_vcpu	= task_mm_vcpu_id(next);
 		/* XXX SCHED_DEADLINE */
 	),
 
-	TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d",
+	TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d next_vm_vcpu=%d",
 		__entry->prev_comm, __entry->prev_pid, __entry->prev_prio,
 
 		(__entry->prev_state & (TASK_REPORT_MAX - 1)) ?
@@ -262,7 +264,8 @@ TRACE_EVENT(sched_switch,
 		  "R",
 
 		__entry->prev_state & TASK_REPORT_MAX ? "+" : "",
-		__entry->next_comm, __entry->next_pid, __entry->next_prio)
+		__entry->next_comm, __entry->next_pid, __entry->next_prio,
+		__entry->next_vm_vcpu)
 );
 
 /*
diff --git a/init/Kconfig b/init/Kconfig
index e9119bf54b1f..8d504675d3a3 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1023,6 +1023,10 @@ config RT_GROUP_SCHED
 
 endif #CGROUP_SCHED
 
+config VCPU_DOMAIN
+	def_bool y
+	depends on SMP && RSEQ
+
 config UCLAMP_TASK_GROUP
 	bool "Utilization clamping per group of tasks"
 	depends on CGROUP_SCHED
diff --git a/kernel/fork.c b/kernel/fork.c
index d75a528f7b21..97bc7dcadafe 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -97,6 +97,7 @@
 #include <linux/scs.h>
 #include <linux/io_uring.h>
 #include <linux/bpf.h>
+#include <linux/vcpu.h>
 
 #include <asm/pgalloc.h>
 #include <linux/uaccess.h>
@@ -970,6 +971,11 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 #ifdef CONFIG_MEMCG
 	tsk->active_memcg = NULL;
 #endif
+
+#ifdef CONFIG_VCPU_DOMAIN
+	tsk->mm_vcpu = 0;
+	tsk->vcpu_domain_active = 0;
+#endif
 	return tsk;
 
 free_stack:
@@ -1079,6 +1085,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 		goto fail_nocontext;
 
 	mm->user_ns = get_user_ns(user_ns);
+	vcpu_domain_init(mm_vcpu_domain(mm));
 	return mm;
 
 fail_nocontext:
@@ -1380,6 +1387,8 @@ static int wait_for_vfork_done(struct task_struct *child,
  */
 static void mm_release(struct task_struct *tsk, struct mm_struct *mm)
 {
+	vcpu_domain_release(tsk, mm_vcpu_domain(mm));
+
 	uprobe_free_utask(tsk);
 
 	/* Get rid of any cached register state */
@@ -1499,10 +1508,12 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
 	if (clone_flags & CLONE_VM) {
 		mmget(oldmm);
 		mm = oldmm;
+		vcpu_domain_get(tsk, mm_vcpu_domain(mm));
 	} else {
 		mm = dup_mm(tsk, current->mm);
 		if (!mm)
 			return -ENOMEM;
+		vcpu_domain_dup(tsk, mm_vcpu_domain(mm));
 	}
 
 	tsk->mm = mm;
@@ -2901,7 +2912,7 @@ void __init proc_caches_init(void)
 	 * dynamically sized based on the maximum CPU number this system
 	 * can have, taking hotplug into account (nr_cpu_ids).
 	 */
-	mm_size = sizeof(struct mm_struct) + cpumask_size();
+	mm_size = sizeof(struct mm_struct) + cpumask_size() + vcpu_domain_size();
 
 	mm_cachep = kmem_cache_create_usercopy("mm_struct",
 			mm_size, ARCH_MIN_MMSTRUCT_ALIGN,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1e08b02e0cd5..3894822d548a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -16,6 +16,7 @@
 #include <linux/blkdev.h>
 #include <linux/kcov.h>
 #include <linux/scs.h>
+#include <linux/vcpu.h>
 
 #include <asm/switch_to.h>
 #include <asm/tlb.h>
@@ -2267,6 +2268,7 @@ static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
 	lockdep_assert_rq_held(rq);
 
 	deactivate_task(rq, p, DEQUEUE_NOCLOCK);
+	rq_vcpu_domain_migrate_locked(rq, p);
 	set_task_cpu(p, new_cpu);
 	rq_unlock(rq, rf);
 
@@ -2454,6 +2456,7 @@ int push_cpu_stop(void *arg)
 	// XXX validate p is still the highest prio task
 	if (task_rq(p) == rq) {
 		deactivate_task(rq, p, 0);
+		rq_vcpu_domain_migrate_locked(rq, p);
 		set_task_cpu(p, lowest_rq->cpu);
 		activate_task(lowest_rq, p, 0);
 		resched_curr(lowest_rq);
@@ -3093,6 +3096,7 @@ static void __migrate_swap_task(struct task_struct *p, int cpu)
 		rq_pin_lock(dst_rq, &drf);
 
 		deactivate_task(src_rq, p, 0);
+		rq_vcpu_domain_migrate_locked(src_rq, p);
 		set_task_cpu(p, cpu);
 		activate_task(dst_rq, p, 0);
 		check_preempt_curr(dst_rq, p, 0);
@@ -3716,6 +3720,8 @@ static void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags
 	p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
 
 	WRITE_ONCE(rq->ttwu_pending, 1);
+	if (WARN_ON_ONCE(task_cpu(p) != cpu_of(rq)))
+		rq_vcpu_domain_migrate_locked(task_rq(p), p);
 	__smp_call_single_queue(cpu, &p->wake_entry.llist);
 }
 
@@ -4125,6 +4131,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 
 		wake_flags |= WF_MIGRATED;
 		psi_ttwu_dequeue(p);
+		rq_vcpu_domain_migrate(task_rq(p), p);
 		set_task_cpu(p, cpu);
 	}
 #else
@@ -4796,6 +4803,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
 	sched_info_switch(rq, prev, next);
 	perf_event_task_sched_out(prev, next);
 	rseq_preempt(prev);
+	switch_mm_vcpu(rq, prev, next);
 	fire_sched_out_preempt_notifiers(prev, next);
 	kmap_local_sched_out();
 	prepare_task(next);
@@ -5922,6 +5930,7 @@ static bool try_steal_cookie(int this, int that)
 			goto next;
 
 		deactivate_task(src, p, 0);
+		rq_vcpu_domain_migrate_locked(src, p);
 		set_task_cpu(p, this);
 		activate_task(dst, p, 0);
 
@@ -10927,3 +10936,77 @@ void call_trace_sched_update_nr_running(struct rq *rq, int count)
 {
         trace_sched_update_nr_running_tp(rq, count);
 }
+
+#ifdef CONFIG_VCPU_DOMAIN
+void vcpu_domain_release(struct task_struct *t, struct vcpu_domain *vcpu_domain)
+{
+	struct rq_flags rf;
+	struct rq *rq;
+
+	if (!vcpu_domain)
+		return;
+	WARN_ON_ONCE(t != current);
+	preempt_disable();
+	rq = this_rq();
+	rq_lock_irqsave(rq, &rf);
+	t->vcpu_domain_active = 0;
+	atomic_dec(&vcpu_domain->users);
+	rq_vcpu_cache_remove_vcpu_domain_locked(rq, vcpu_domain, true);
+	rq_unlock_irqrestore(rq, &rf);
+	t->mm_vcpu = -1;
+	preempt_enable();
+}
+
+void vcpu_domain_activate(struct task_struct *t, struct vcpu_domain *vcpu_domain)
+{
+	WARN_ON_ONCE(t != current);
+	preempt_disable();
+	t->vcpu_domain_active = 1;
+	/* No need to reserve in cpumask because single-threaded. */
+	t->mm_vcpu = vcpu_domain_vcpu_first_node_vcpu(numa_node_id());
+	preempt_enable();
+}
+
+void vcpu_domain_get(struct task_struct *t, struct vcpu_domain *vcpu_domain)
+{
+	int vcpu, vcpu_users;
+	struct rq_flags rf;
+	struct rq *rq;
+
+	preempt_disable();
+	rq = this_rq();
+	t->vcpu_domain_active = 1;
+	vcpu_users = atomic_read(&vcpu_domain->users);
+	atomic_inc(&vcpu_domain->users);
+	t->mm_vcpu = -1;
+	vcpu = current->mm_vcpu;
+	rq_lock_irqsave(rq, &rf);
+	/* On transition from 1 to 2 vcpu domain users, reserve vcpu ids. */
+	if (vcpu_users == 1) {
+		vcpu_domain_vcpu_reserve_nodes(vcpu_domain);
+		rq_vcpu_cache_remove_vcpu_domain_locked(rq, vcpu_domain, true);
+		current->mm_vcpu = __vcpu_domain_vcpu_get(rq, vcpu_domain);
+		rq_vcpu_cache_add(rq, vcpu_domain, current->mm_vcpu);
+		/*
+		 * __vcpu_domain_vcpu_get could get a different vcpu after
+		 * going multi-threaded, then back single-threaded, then
+		 * multi-threaded on a NUMA configuration using the first CPU
+		 * matching the NUMA node as single-threaded vcpu, with
+		 * leftover vcpu_id matching the NUMA node set from when this
+		 * task was multithreaded.
+		 */
+		if (current->mm_vcpu != vcpu)
+			rseq_set_notify_resume(current);
+	}
+	rq_unlock_irqrestore(rq, &rf);
+	preempt_enable();
+}
+
+void vcpu_domain_dup(struct task_struct *t, struct vcpu_domain *vcpu_domain)
+{
+	preempt_disable();
+	t->vcpu_domain_active = 1;
+	t->mm_vcpu = -1;
+	preempt_enable();
+}
+#endif
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d2c072b0ef01..056933d83469 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -655,6 +655,7 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p
 	__dl_add(dl_b, p->dl.dl_bw, cpumask_weight(later_rq->rd->span));
 	raw_spin_unlock(&dl_b->lock);
 
+	rq_vcpu_domain_migrate_locked(rq, p);
 	set_task_cpu(p, later_rq->cpu);
 	double_unlock_balance(later_rq, rq);
 
@@ -2290,6 +2291,7 @@ static int push_dl_task(struct rq *rq)
 	}
 
 	deactivate_task(rq, next_task, 0);
+	rq_vcpu_domain_migrate_locked(rq, next_task);
 	set_task_cpu(next_task, later_rq->cpu);
 
 	/*
@@ -2386,6 +2388,7 @@ static void pull_dl_task(struct rq *this_rq)
 				push_task = get_push_task(src_rq);
 			} else {
 				deactivate_task(src_rq, p, 0);
+				rq_vcpu_domain_migrate_locked(src_rq, p);
 				set_task_cpu(p, this_cpu);
 				activate_task(this_rq, p, 0);
 				dmin = p->dl.deadline;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 102d6f70e84d..0b30946de2a4 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -763,6 +763,19 @@ do {									\
 		P(sched_goidle);
 		P(ttwu_count);
 		P(ttwu_local);
+#undef P
+#define P(n) SEQ_printf(m, "  .%-30s: %Ld\n", #n, schedstat_val(rq->n));
+		P(nr_vcpu_thread_transfer);
+		P(nr_vcpu_cache_hit);
+		P(nr_vcpu_cache_evict);
+		P(nr_vcpu_cache_discard_wrong_node);
+		P(nr_vcpu_allocate);
+		P(nr_vcpu_allocate_node_reuse);
+		P(nr_vcpu_allocate_node_new);
+		P(nr_vcpu_allocate_node_rebalance);
+		P(nr_vcpu_allocate_node_steal);
+		P(nr_vcpu_remove_release);
+		P(nr_vcpu_remove_migrate);
 	}
 #undef P
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dcbd3110c687..c43d97c79237 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7817,6 +7817,7 @@ static void detach_task(struct task_struct *p, struct lb_env *env)
 	lockdep_assert_rq_held(env->src_rq);
 
 	deactivate_task(env->src_rq, p, DEQUEUE_NOCLOCK);
+	rq_vcpu_domain_migrate_locked(env->src_rq, p);
 	set_task_cpu(p, env->dst_cpu);
 }
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 7b4f4fbbb404..4922a87d5527 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2106,6 +2106,7 @@ static int push_rt_task(struct rq *rq, bool pull)
 	}
 
 	deactivate_task(rq, next_task, 0);
+	rq_vcpu_domain_migrate_locked(rq, next_task);
 	set_task_cpu(next_task, lowest_rq->cpu);
 	activate_task(lowest_rq, next_task, 0);
 	resched_curr(lowest_rq);
@@ -2379,6 +2380,7 @@ static void pull_rt_task(struct rq *this_rq)
 				push_task = get_push_task(src_rq);
 			} else {
 				deactivate_task(src_rq, p, 0);
+				rq_vcpu_domain_migrate_locked(src_rq, p);
 				set_task_cpu(p, this_cpu);
 				activate_task(this_rq, p, 0);
 				resched = true;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3da5718cd641..abdd48d1ffe6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -66,6 +66,7 @@
 #include <linux/syscalls.h>
 #include <linux/task_work.h>
 #include <linux/tsacct_kern.h>
+#include <linux/vcpu.h>
 
 #include <asm/tlb.h>
 
@@ -916,6 +917,19 @@ struct uclamp_rq {
 DECLARE_STATIC_KEY_FALSE(sched_uclamp_used);
 #endif /* CONFIG_UCLAMP_TASK */
 
+#ifdef CONFIG_VCPU_DOMAIN
+# define RQ_VCPU_CACHE_SIZE	8
+struct rq_vcpu_entry {
+	struct vcpu_domain *vcpu_domain;	/* NULL if unset */
+	int vcpu_id;
+};
+
+struct rq_vcpu_cache {
+	struct rq_vcpu_entry entry[RQ_VCPU_CACHE_SIZE];
+	unsigned int head;
+};
+#endif
+
 /*
  * This is the main, per-CPU runqueue data structure.
  *
@@ -1086,6 +1100,19 @@ struct rq {
 	/* try_to_wake_up() stats */
 	unsigned int		ttwu_count;
 	unsigned int		ttwu_local;
+
+	unsigned long long	nr_vcpu_single_thread;
+	unsigned long long	nr_vcpu_thread_transfer;
+	unsigned long long	nr_vcpu_cache_hit;
+	unsigned long long	nr_vcpu_cache_evict;
+	unsigned long long	nr_vcpu_cache_discard_wrong_node;
+	unsigned long long	nr_vcpu_allocate;
+	unsigned long long	nr_vcpu_allocate_node_reuse;
+	unsigned long long	nr_vcpu_allocate_node_new;
+	unsigned long long	nr_vcpu_allocate_node_rebalance;
+	unsigned long long	nr_vcpu_allocate_node_steal;
+	unsigned long long	nr_vcpu_remove_release;
+	unsigned long long	nr_vcpu_remove_migrate;
 #endif
 
 #ifdef CONFIG_CPU_IDLE
@@ -1116,6 +1143,10 @@ struct rq {
 	unsigned int		core_forceidle_occupation;
 	u64			core_forceidle_start;
 #endif
+
+#ifdef CONFIG_VCPU_DOMAIN
+	struct rq_vcpu_cache	vcpu_cache;
+#endif
 };
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -3137,3 +3168,347 @@ extern int sched_dynamic_mode(const char *str);
 extern void sched_dynamic_update(int mode);
 #endif
 
+#ifdef CONFIG_VCPU_DOMAIN
+static inline int __vcpu_domain_vcpu_get_single_node(struct vcpu_domain *vcpu_domain)
+{
+	struct cpumask *cpumask;
+	int vcpu;
+
+	cpumask = vcpu_domain_vcpumask(vcpu_domain);
+	/* Atomically reserve lowest available vcpu number. */
+	do {
+		vcpu = cpumask_first_zero(cpumask);
+		if (vcpu >= nr_cpu_ids)
+			return -1;
+	} while (cpumask_test_and_set_cpu(vcpu, cpumask));
+	return vcpu;
+}
+
+#ifdef CONFIG_NUMA
+static inline bool vcpu_domain_node_vcpumask_test_cpu(struct vcpu_domain *vcpu_domain, int vcpu_id)
+{
+	if (num_possible_nodes() == 1)
+		return true;
+	return cpumask_test_cpu(vcpu_id, vcpu_domain_node_vcpumask(vcpu_domain, numa_node_id()));
+}
+
+static inline int __vcpu_domain_vcpu_get(struct rq *rq, struct vcpu_domain *vcpu_domain)
+{
+	struct cpumask *cpumask = vcpu_domain_vcpumask(vcpu_domain),
+		       *node_cpumask = vcpu_domain_node_vcpumask(vcpu_domain, numa_node_id()),
+		       *node_alloc_cpumask = vcpu_domain_node_alloc_vcpumask(vcpu_domain);
+	unsigned int node;
+	int vcpu;
+
+	if (num_possible_nodes() == 1)
+		return __vcpu_domain_vcpu_get_single_node(vcpu_domain);
+
+	/*
+	 * Try to atomically reserve lowest available vcpu number within those
+	 * already reserved for this NUMA node.
+	 */
+	do {
+		vcpu = cpumask_first_one_and_zero(node_cpumask, cpumask);
+		if (vcpu >= nr_cpu_ids)
+			goto alloc_numa;
+	} while (cpumask_test_and_set_cpu(vcpu, cpumask));
+	schedstat_inc(rq->nr_vcpu_allocate_node_reuse);
+	goto end;
+
+alloc_numa:
+	/*
+	 * Try to atomically reserve lowest available vcpu number within those
+	 * not already allocated for numa nodes.
+	 */
+	do {
+		vcpu = cpumask_first_zero_and_zero(node_alloc_cpumask, cpumask);
+		if (vcpu >= nr_cpu_ids)
+			goto numa_update;
+	} while (cpumask_test_and_set_cpu(vcpu, cpumask));
+	cpumask_set_cpu(vcpu, node_cpumask);
+	cpumask_set_cpu(vcpu, node_alloc_cpumask);
+	schedstat_inc(rq->nr_vcpu_allocate_node_new);
+	goto end;
+
+numa_update:
+	/*
+	 * NUMA node id configuration changed for at least one CPU in the system.
+	 * We need to steal a currently unused vcpu_id from an overprovisioned
+	 * node for our current node. Userspace must handle the fact that the
+	 * node id associated with this vcpu_id may change due to node ID
+	 * reconfiguration.
+	 *
+	 * Count how many possible cpus are attached to each (other) node id,
+	 * and compare this with the per-mm node vcpumask cpu count. Find one
+	 * which has too many cpus in its mask to steal from.
+	 */
+	for (node = 0; node < nr_node_ids; node++) {
+		struct cpumask *iter_cpumask;
+
+		if (node == numa_node_id())
+			continue;
+		iter_cpumask = vcpu_domain_node_vcpumask(vcpu_domain, node);
+		if (nr_cpus_node(node) < cpumask_weight(iter_cpumask)) {
+			/* Try to steal from this node. */
+			do {
+				vcpu = cpumask_first_one_and_zero(iter_cpumask, cpumask);
+				if (vcpu >= nr_cpu_ids)
+					goto steal_fail;
+			} while (cpumask_test_and_set_cpu(vcpu, cpumask));
+			cpumask_clear_cpu(vcpu, iter_cpumask);
+			cpumask_set_cpu(vcpu, node_cpumask);
+			schedstat_inc(rq->nr_vcpu_allocate_node_rebalance);
+			goto end;
+		}
+	}
+
+steal_fail:
+	/*
+	 * Our attempt at gracefully stealing a vcpu_id from another
+	 * overprovisioned NUMA node failed. Fallback to grabbing the first
+	 * available vcpu_id.
+	 */
+	do {
+		vcpu = cpumask_first_zero(cpumask);
+		if (vcpu >= nr_cpu_ids)
+			return -1;
+	} while (cpumask_test_and_set_cpu(vcpu, cpumask));
+	/* Steal vcpu from its numa node mask. */
+	for (node = 0; node < nr_node_ids; node++) {
+		struct cpumask *iter_cpumask;
+
+		if (node == numa_node_id())
+			continue;
+		iter_cpumask = vcpu_domain_node_vcpumask(vcpu_domain, node);
+		if (cpumask_test_cpu(vcpu, iter_cpumask)) {
+			cpumask_clear_cpu(vcpu, iter_cpumask);
+			break;
+		}
+	}
+	cpumask_set_cpu(vcpu, node_cpumask);
+	schedstat_inc(rq->nr_vcpu_allocate_node_steal);
+end:
+	return vcpu;
+}
+
+static inline int vcpu_domain_vcpu_first_node_vcpu(int node)
+{
+	int vcpu;
+
+	if (likely(nr_cpu_ids >= nr_node_ids))
+		return node;
+	vcpu = cpumask_first(cpumask_of_node(node));
+	if (vcpu >= nr_cpu_ids)
+		return -1;
+	return vcpu;
+}
+
+/*
+ * Single-threaded processes observe a mapping of vcpu_id->node_id where
+ * the vcpu_id returned corresponds to vcpu_domain_vcpu_first_node_vcpu(). When going
+ * from single to multi-threaded, reserve this same mapping so it stays
+ * invariant.
+ */
+static inline void vcpu_domain_vcpu_reserve_nodes(struct vcpu_domain *vcpu_domain)
+{
+	struct cpumask *node_alloc_cpumask = vcpu_domain_node_alloc_vcpumask(vcpu_domain);
+	int node, other_node;
+
+	for (node = 0; node < nr_node_ids; node++) {
+		struct cpumask *iter_cpumask = vcpu_domain_node_vcpumask(vcpu_domain, node);
+		int vcpu = vcpu_domain_vcpu_first_node_vcpu(node);
+
+		/* Skip nodes that have no CPU associated with them. */
+		if (vcpu < 0)
+			continue;
+		cpumask_set_cpu(vcpu, iter_cpumask);
+		cpumask_set_cpu(vcpu, node_alloc_cpumask);
+		for (other_node = 0; other_node < nr_node_ids; other_node++) {
+			if (other_node == node)
+				continue;
+			cpumask_clear_cpu(vcpu, vcpu_domain_node_vcpumask(vcpu_domain, other_node));
+		}
+	}
+}
+#else
+static inline bool vcpu_domain_node_vcpumask_test_cpu(struct vcpu_domain *vcpu_domain,
+						      int vcpu_id)
+{
+	return true;
+}
+static inline int __vcpu_domain_vcpu_get(struct rq *rq, struct vcpu_domain *vcpu_domain)
+{
+	return __vcpu_domain_vcpu_get_single_node(vcpu_domain);
+}
+static inline int vcpu_domain_vcpu_first_node_vcpu(int node)
+{
+	return 0;
+}
+static inline void vcpu_domain_vcpu_reserve_nodes(struct vcpu_domain *vcpu_domain) { }
+#endif
+
+static inline void __vcpu_domain_vcpu_put(struct vcpu_domain *vcpu_domain, int vcpu)
+{
+	if (vcpu < 0)
+		return;
+	cpumask_clear_cpu(vcpu, vcpu_domain_vcpumask(vcpu_domain));
+}
+
+static inline struct rq_vcpu_entry *rq_vcpu_cache_lookup(struct rq *rq, struct vcpu_domain *vcpu_domain)
+{
+	struct rq_vcpu_cache *vcpu_cache = &rq->vcpu_cache;
+	int i;
+
+	for (i = 0; i < RQ_VCPU_CACHE_SIZE; i++) {
+		struct rq_vcpu_entry *entry = &vcpu_cache->entry[i];
+
+		if (entry->vcpu_domain == vcpu_domain)
+			return entry;
+	}
+	return NULL;
+}
+
+/* Removal from cache simply leaves an unused hole. */
+static inline int rq_vcpu_cache_lookup_remove(struct rq *rq, struct vcpu_domain *vcpu_domain)
+{
+	struct rq_vcpu_entry *entry = rq_vcpu_cache_lookup(rq, vcpu_domain);
+
+	if (!entry)
+		return -1;
+	entry->vcpu_domain = NULL;	/* Remove from cache */
+	return entry->vcpu_id;
+}
+
+static inline void rq_vcpu_cache_remove_vcpu_domain_locked(struct rq *rq, struct vcpu_domain *vcpu_domain,
+							   bool release)
+{
+	int vcpu;
+
+	if (!vcpu_domain)
+		return;
+	/*
+	 * Do not remove the cache entry for a runqueue that runs a task which
+	 * currently uses the target mm.
+	 */
+	if (!release && rq->curr->vcpu_domain_active && mm_vcpu_domain(rq->curr->mm) == vcpu_domain)
+		return;
+	vcpu = rq_vcpu_cache_lookup_remove(rq, vcpu_domain);
+	if (vcpu < 0)
+		return;
+	if (release)
+		schedstat_inc(rq->nr_vcpu_remove_release);
+	else
+		schedstat_inc(rq->nr_vcpu_remove_migrate);
+	__vcpu_domain_vcpu_put(vcpu_domain, vcpu);
+}
+
+static inline void rq_vcpu_cache_remove_vcpu_domain(struct rq *rq, struct vcpu_domain *vcpu_domain,
+						    bool release)
+{
+	struct rq_flags rf;
+
+	rq_lock_irqsave(rq, &rf);
+	rq_vcpu_cache_remove_vcpu_domain_locked(rq, vcpu_domain, release);
+	rq_unlock_irqrestore(rq, &rf);
+}
+
+static inline void rq_vcpu_domain_migrate_locked(struct rq *rq, struct task_struct *t)
+{
+	rq_vcpu_cache_remove_vcpu_domain_locked(rq, mm_vcpu_domain(t->mm), false);
+}
+
+static inline void rq_vcpu_domain_migrate(struct rq *rq, struct task_struct *t)
+{
+	rq_vcpu_cache_remove_vcpu_domain(rq, mm_vcpu_domain(t->mm), false);
+}
+
+/*
+ * Add at head, move head forward. Cheap LRU cache.
+ * Only need to clear the vcpu mask bit from its own domain when we overwrite
+ * an old entry from the cache. Note that this is not needed if the overwritten
+ * entry is an unused hole. This access to the old_vcpu_domain from an
+ * unrelated thread requires that cache entry for a given vcpu_domain gets
+ * pruned from the cache when a task is dequeued from the runqueue.
+ */
+static inline void rq_vcpu_cache_add(struct rq *rq, struct vcpu_domain *vcpu_domain,
+				     int vcpu_id)
+{
+	struct rq_vcpu_cache *vcpu_cache = &rq->vcpu_cache;
+	struct vcpu_domain *old_vcpu_domain;
+	struct rq_vcpu_entry *entry;
+	unsigned int pos;
+
+	pos = vcpu_cache->head;
+	entry = &vcpu_cache->entry[pos];
+	old_vcpu_domain = entry->vcpu_domain;
+	if (old_vcpu_domain) {
+		schedstat_inc(rq->nr_vcpu_cache_evict);
+		__vcpu_domain_vcpu_put(old_vcpu_domain, entry->vcpu_id);
+	}
+	entry->vcpu_domain = vcpu_domain;
+	entry->vcpu_id = vcpu_id;
+	vcpu_cache->head = (pos + 1) % RQ_VCPU_CACHE_SIZE;
+}
+
+static inline int vcpu_domain_vcpu_get(struct rq *rq, struct vcpu_domain *vcpu_domain)
+{
+	struct rq_vcpu_entry *entry;
+	int vcpu;
+
+	/* Skip allocation if mm is single-threaded. */
+	if (atomic_read(&vcpu_domain->users) == 1) {
+		schedstat_inc(rq->nr_vcpu_single_thread);
+		vcpu = vcpu_domain_vcpu_first_node_vcpu(numa_node_id());
+		goto end;
+	}
+	entry = rq_vcpu_cache_lookup(rq, vcpu_domain);
+	if (likely(entry)) {
+		vcpu = entry->vcpu_id;
+		if (likely(vcpu_domain_node_vcpumask_test_cpu(vcpu_domain, vcpu))) {
+			schedstat_inc(rq->nr_vcpu_cache_hit);
+			goto end;
+		} else {
+			schedstat_inc(rq->nr_vcpu_cache_discard_wrong_node);
+			entry->vcpu_domain = NULL;	/* Remove from cache */
+			__vcpu_domain_vcpu_put(vcpu_domain, vcpu);
+		}
+	}
+	schedstat_inc(rq->nr_vcpu_allocate);
+	vcpu = __vcpu_domain_vcpu_get(rq, vcpu_domain);
+	rq_vcpu_cache_add(rq, vcpu_domain, vcpu);
+end:
+	return vcpu;
+}
+
+static inline void switch_mm_vcpu(struct rq *rq, struct task_struct *prev,
+				  struct task_struct *next)
+{
+	if (!(next->flags & PF_KTHREAD) && next->vcpu_domain_active && next->mm) {
+		if (!(prev->flags & PF_KTHREAD) && prev->vcpu_domain_active &&
+		    prev->mm == next->mm &&
+		    vcpu_domain_node_vcpumask_test_cpu(mm_vcpu_domain(next->mm), prev->mm_vcpu)) {
+			/*
+			 * Switching between threads with the same mm. Simply pass the
+			 * vcpu token along to the next thread.
+			 */
+			schedstat_inc(rq->nr_vcpu_thread_transfer);
+			next->mm_vcpu = prev->mm_vcpu;
+		} else {
+			next->mm_vcpu = vcpu_domain_vcpu_get(rq, mm_vcpu_domain(next->mm));
+		}
+	}
+	if (!(prev->flags & PF_KTHREAD) && prev->vcpu_domain_active && prev->mm)
+		prev->mm_vcpu = -1;
+}
+
+#else
+static inline void switch_mm_vcpu(struct rq *rq, struct task_struct *prev,
+				  struct task_struct *next) { }
+static inline void rq_vcpu_cache_remove_vcpu_domain_locked(struct rq *rq,
+							   struct vcpu_domain *domain,
+							   bool release) { }
+static inline void rq_vcpu_cache_remove_vcpu_domain(struct rq *rq, struct vcpu_domain *domain,
+						    bool release) { }
+static inline void rq_vcpu_domain_migrate_locked(struct rq *rq, struct task_struct *t) { }
+static inline void rq_vcpu_domain_migrate(struct rq *rq, struct task_struct *t) { }
+#endif
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 07dde2928c79..a0bf4eaae296 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -134,12 +134,24 @@ static int show_schedstat(struct seq_file *seq, void *v)
 
 		/* runqueue-specific stats */
 		seq_printf(seq,
-		    "cpu%d %u 0 %u %u %u %u %llu %llu %lu",
+		    "cpu%d %u 0 %u %u %u %u %llu %llu %lu %llu %llu %llu %llu %llu %llu %llu %llu %llu %llu %llu %llu",
 		    cpu, rq->yld_count,
 		    rq->sched_count, rq->sched_goidle,
 		    rq->ttwu_count, rq->ttwu_local,
 		    rq->rq_cpu_time,
-		    rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount);
+		    rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount,
+		    rq->nr_vcpu_single_thread,
+		    rq->nr_vcpu_thread_transfer,
+		    rq->nr_vcpu_cache_hit,
+		    rq->nr_vcpu_cache_evict,
+		    rq->nr_vcpu_cache_discard_wrong_node,
+		    rq->nr_vcpu_allocate,
+		    rq->nr_vcpu_allocate_node_reuse,
+		    rq->nr_vcpu_allocate_node_new,
+		    rq->nr_vcpu_allocate_node_rebalance,
+		    rq->nr_vcpu_allocate_node_steal,
+		    rq->nr_vcpu_remove_release,
+		    rq->nr_vcpu_remove_migrate);
 
 		seq_printf(seq, "\n");
 
-- 
2.17.1




^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v2 09/11] sched: Introduce per memory space current virtual cpu id
  2022-02-18 21:06 ` [RFC PATCH v2 09/11] sched: Introduce per memory space current virtual cpu id Mathieu Desnoyers
  2022-02-21 17:38   ` [RFC PATCH v3 " Mathieu Desnoyers
@ 2022-02-25 17:35   ` Jonathan Corbet
  2022-02-25 17:56     ` Mathieu Desnoyers
  1 sibling, 1 reply; 19+ messages in thread
From: Jonathan Corbet @ 2022-02-25 17:35 UTC (permalink / raw)
  To: Mathieu Desnoyers, Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, Paul E . McKenney, Boqun Feng,
	H . Peter Anvin, Paul Turner, linux-api, Christian Brauner,
	Florian Weimer, David.Laight, carlos, Peter Oskolkov,
	Mathieu Desnoyers

Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:

> This feature allows the scheduler to expose a current virtual cpu id
> to user-space. This virtual cpu id is within the possible cpus range,
> and is temporarily (and uniquely) assigned while threads are actively
> running within a memory space. If a memory space has fewer threads than
> cores, or is limited to run on few cores concurrently through sched
> affinity or cgroup cpusets, the virtual cpu ids will be values close
> to 0, thus allowing efficient use of user-space memory for per-cpu
> data structures.

So I have one possibly (probably) dumb question: if I'm writing a
program to make use of virtual CPU IDs, how do I know what the maximum
ID will be?  It seems like one of the advantages of this mechanism would
be not having to be prepared for anything in the physical ID space, but
is there any guarantee that the virtual-ID space will be smaller?
Something like "no larger than the number of threads", say?

Thanks,

jon

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v2 09/11] sched: Introduce per memory space current virtual cpu id
  2022-02-25 17:35   ` [RFC PATCH v2 " Jonathan Corbet
@ 2022-02-25 17:56     ` Mathieu Desnoyers
  2022-02-25 18:15       ` Jonathan Corbet
  2022-02-25 21:21       ` Mathieu Desnoyers
  0 siblings, 2 replies; 19+ messages in thread
From: Mathieu Desnoyers @ 2022-02-25 17:56 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Peter Zijlstra, linux-kernel, Thomas Gleixner, paulmck,
	Boqun Feng, H. Peter Anvin, Paul Turner, linux-api,
	Christian Brauner, Florian Weimer, David Laight, carlos,
	Peter Oskolkov

----- On Feb 25, 2022, at 12:35 PM, Jonathan Corbet corbet@lwn.net wrote:

> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:
> 
>> This feature allows the scheduler to expose a current virtual cpu id
>> to user-space. This virtual cpu id is within the possible cpus range,
>> and is temporarily (and uniquely) assigned while threads are actively
>> running within a memory space. If a memory space has fewer threads than
>> cores, or is limited to run on few cores concurrently through sched
>> affinity or cgroup cpusets, the virtual cpu ids will be values close
>> to 0, thus allowing efficient use of user-space memory for per-cpu
>> data structures.
> 
> So I have one possibly (probably) dumb question: if I'm writing a
> program to make use of virtual CPU IDs, how do I know what the maximum
> ID will be?  It seems like one of the advantages of this mechanism would
> be not having to be prepared for anything in the physical ID space, but
> is there any guarantee that the virtual-ID space will be smaller?
> Something like "no larger than the number of threads", say?

Hi Jonathan,

This is a very relevant question. Let me quote what I answered to Florian
on the last round of review for this series:

Some effective upper bounds for the number of vcpu ids observable in a process:

- sysconf(3) _SC_NPROCESSORS_CONF,
- the number of threads which exist concurrently in the process,
- the number of cpus in the cpu affinity mask applied by sched_setaffinity,
  except in corner-case situations such as cpu hotplug removing all cpus from
  the affinity set,
- cgroup cpuset "partition" limits,

Note that AFAIR non-partition cgroup cpusets allow a cgroup to "borrow"
additional cores from the rest of the system if they are idle, therefore
allowing the number of concurrent threads to go beyond the specified limit.

AFAIR the sched affinity mask is tweaked independently of the cgroup cpuset.
Those are two mechanisms both affecting the scheduler task placement.

I would expect the user-space code to use some sensible upper bound as a
hint about how many per-vcpu data structure elements to expect (and how many
to pre-allocate), but have a "lazy initialization" fall-back in case the
vcpu id goes up to the number of configured processors - 1. And I suspect
that even the number of configured processors may change with CRIU.

If the above explanation makes sense (please let me know if I am wrong
or missed something), I suspect I should add it to the commit message.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v2 09/11] sched: Introduce per memory space current virtual cpu id
  2022-02-25 17:56     ` Mathieu Desnoyers
@ 2022-02-25 18:15       ` Jonathan Corbet
  2022-02-25 18:39         ` Mathieu Desnoyers
  2022-02-25 21:21       ` Mathieu Desnoyers
  1 sibling, 1 reply; 19+ messages in thread
From: Jonathan Corbet @ 2022-02-25 18:15 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel, Thomas Gleixner, paulmck,
	Boqun Feng, H. Peter Anvin, Paul Turner, linux-api,
	Christian Brauner, Florian Weimer, David Laight, carlos,
	Peter Oskolkov

Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:

> Some effective upper bounds for the number of vcpu ids observable in a process:
>
> - sysconf(3) _SC_NPROCESSORS_CONF,
> - the number of threads which exist concurrently in the process,
> - the number of cpus in the cpu affinity mask applied by sched_setaffinity,
>   except in corner-case situations such as cpu hotplug removing all cpus from
>   the affinity set,
> - cgroup cpuset "partition" limits,
>
> Note that AFAIR non-partition cgroup cpusets allow a cgroup to "borrow"
> additional cores from the rest of the system if they are idle, therefore
> allowing the number of concurrent threads to go beyond the specified limit.
>
> AFAIR the sched affinity mask is tweaked independently of the cgroup cpuset.
> Those are two mechanisms both affecting the scheduler task placement.
>
> I would expect the user-space code to use some sensible upper bound as a
> hint about how many per-vcpu data structure elements to expect (and how many
> to pre-allocate), but have a "lazy initialization" fall-back in case the
> vcpu id goes up to the number of configured processors - 1. And I suspect
> that even the number of configured processors may change with CRIU.
>
> If the above explanation makes sense (please let me know if I am wrong
> or missed something), I suspect I should add it to the commit message.

That helps, thanks.  I do think that something like this belongs in the
changelog - or, even better, in the upcoming restartable-sequences
section in the userspace-api documentation :)

Thanks,

jon

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v2 09/11] sched: Introduce per memory space current virtual cpu id
  2022-02-25 18:15       ` Jonathan Corbet
@ 2022-02-25 18:39         ` Mathieu Desnoyers
  2022-02-25 19:24           ` Jonathan Corbet
  0 siblings, 1 reply; 19+ messages in thread
From: Mathieu Desnoyers @ 2022-02-25 18:39 UTC (permalink / raw)
  To: Jonathan Corbet, linux-man
  Cc: Peter Zijlstra, linux-kernel, Thomas Gleixner, paulmck,
	Boqun Feng, H. Peter Anvin, Paul Turner, linux-api,
	Christian Brauner, Florian Weimer, David Laight, carlos,
	Peter Oskolkov

----- On Feb 25, 2022, at 1:15 PM, Jonathan Corbet corbet@lwn.net wrote:

> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:
> 
>> Some effective upper bounds for the number of vcpu ids observable in a process:
>>
>> - sysconf(3) _SC_NPROCESSORS_CONF,
>> - the number of threads which exist concurrently in the process,
>> - the number of cpus in the cpu affinity mask applied by sched_setaffinity,
>>   except in corner-case situations such as cpu hotplug removing all cpus from
>>   the affinity set,
>> - cgroup cpuset "partition" limits,
>>
>> Note that AFAIR non-partition cgroup cpusets allow a cgroup to "borrow"
>> additional cores from the rest of the system if they are idle, therefore
>> allowing the number of concurrent threads to go beyond the specified limit.
>>
>> AFAIR the sched affinity mask is tweaked independently of the cgroup cpuset.
>> Those are two mechanisms both affecting the scheduler task placement.
>>
>> I would expect the user-space code to use some sensible upper bound as a
>> hint about how many per-vcpu data structure elements to expect (and how many
>> to pre-allocate), but have a "lazy initialization" fall-back in case the
>> vcpu id goes up to the number of configured processors - 1. And I suspect
>> that even the number of configured processors may change with CRIU.
>>
>> If the above explanation makes sense (please let me know if I am wrong
>> or missed something), I suspect I should add it to the commit message.
> 
> That helps, thanks.  I do think that something like this belongs in the
> changelog - or, even better, in the upcoming restartable-sequences
> section in the userspace-api documentation :)

Just to confirm, when you say "userspace-api documentation" do you refer to
man pages ?

I did a few attempts at upstreaming a rseq.2 man page in 2020, but I have been
stuck waiting for feedback from Michael Kerrisk since then.

So for the moment I'm maintaining a rseq.2 man page here:

https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/doc/man/rseq.2

I'd gladly accept some help to improve the documentation of rseq.

Thanks,

Mathieu

> 
> Thanks,
> 
> jon

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v2 09/11] sched: Introduce per memory space current virtual cpu id
  2022-02-25 18:39         ` Mathieu Desnoyers
@ 2022-02-25 19:24           ` Jonathan Corbet
  0 siblings, 0 replies; 19+ messages in thread
From: Jonathan Corbet @ 2022-02-25 19:24 UTC (permalink / raw)
  To: Mathieu Desnoyers, linux-man
  Cc: Peter Zijlstra, linux-kernel, Thomas Gleixner, paulmck,
	Boqun Feng, H. Peter Anvin, Paul Turner, linux-api,
	Christian Brauner, Florian Weimer, David Laight, carlos,
	Peter Oskolkov

Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:

> ----- On Feb 25, 2022, at 1:15 PM, Jonathan Corbet corbet@lwn.net wrote:
>> That helps, thanks.  I do think that something like this belongs in the
>> changelog - or, even better, in the upcoming restartable-sequences
>> section in the userspace-api documentation :)
>
> Just to confirm, when you say "userspace-api documentation" do you refer to
> man pages ?

No, I meant Documentation/userspace-api/.  But yes, even having the man
page in place would be a good step in the right direction.

Thanks,

jon

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v2 09/11] sched: Introduce per memory space current virtual cpu id
  2022-02-25 17:56     ` Mathieu Desnoyers
  2022-02-25 18:15       ` Jonathan Corbet
@ 2022-02-25 21:21       ` Mathieu Desnoyers
  1 sibling, 0 replies; 19+ messages in thread
From: Mathieu Desnoyers @ 2022-02-25 21:21 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Peter Zijlstra, linux-kernel, Thomas Gleixner, paulmck,
	Boqun Feng, H. Peter Anvin, Paul Turner, linux-api,
	Christian Brauner, Florian Weimer, David Laight, carlos,
	Peter Oskolkov

----- On Feb 25, 2022, at 12:56 PM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:

> ----- On Feb 25, 2022, at 12:35 PM, Jonathan Corbet corbet@lwn.net wrote:
> 
>> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:
>> 
>>> This feature allows the scheduler to expose a current virtual cpu id
>>> to user-space. This virtual cpu id is within the possible cpus range,
>>> and is temporarily (and uniquely) assigned while threads are actively
>>> running within a memory space. If a memory space has fewer threads than
>>> cores, or is limited to run on few cores concurrently through sched
>>> affinity or cgroup cpusets, the virtual cpu ids will be values close
>>> to 0, thus allowing efficient use of user-space memory for per-cpu
>>> data structures.
>> 
>> So I have one possibly (probably) dumb question: if I'm writing a
>> program to make use of virtual CPU IDs, how do I know what the maximum
>> ID will be?  It seems like one of the advantages of this mechanism would
>> be not having to be prepared for anything in the physical ID space, but
>> is there any guarantee that the virtual-ID space will be smaller?
>> Something like "no larger than the number of threads", say?
> 
> Hi Jonathan,
> 
> This is a very relevant question. Let me quote what I answered to Florian
> on the last round of review for this series:
> 
> Some effective upper bounds for the number of vcpu ids observable in a process:
> 
> - sysconf(3) _SC_NPROCESSORS_CONF,
> - the number of threads which exist concurrently in the process,

One small detail I forgot to mention: on a NUMA system, a single-threaded
process will observe (typically) vcpu_id=numa_node_id. So it can jump around
between vcpu_id values depending on which numa node it runs on at the moment.

So the vcpu_id is not strictly bound by the number of concurrently running
threads.

Thanks,

Mathieu

> - the number of cpus in the cpu affinity mask applied by sched_setaffinity,
>  except in corner-case situations such as cpu hotplug removing all cpus from
>  the affinity set,
> - cgroup cpuset "partition" limits,
> 
> Note that AFAIR non-partition cgroup cpusets allow a cgroup to "borrow"
> additional cores from the rest of the system if they are idle, therefore
> allowing the number of concurrent threads to go beyond the specified limit.
> 
> AFAIR the sched affinity mask is tweaked independently of the cgroup cpuset.
> Those are two mechanisms both affecting the scheduler task placement.
> 
> I would expect the user-space code to use some sensible upper bound as a
> hint about how many per-vcpu data structure elements to expect (and how many
> to pre-allocate), but have a "lazy initialization" fall-back in case the
> vcpu id goes up to the number of configured processors - 1. And I suspect
> that even the number of configured processors may change with CRIU.
> 
> If the above explanation makes sense (please let me know if I am wrong
> or missed something), I suspect I should add it to the commit message.
> 
> Thanks,
> 
> Mathieu
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2022-02-25 21:21 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-18 21:06 [RFC PATCH v2 00/11] RSEQ node id and virtual cpu id extensions Mathieu Desnoyers
2022-02-18 21:06 ` [RFC PATCH v2 01/11] rseq: Introduce feature size and alignment ELF auxiliary vector entries Mathieu Desnoyers
2022-02-18 21:06 ` [RFC PATCH v2 02/11] rseq: Introduce extensible rseq ABI Mathieu Desnoyers
2022-02-18 21:06 ` [RFC PATCH v2 03/11] rseq: extend struct rseq with numa node id Mathieu Desnoyers
2022-02-18 21:06 ` [RFC PATCH v2 04/11] selftests/rseq: Use ELF auxiliary vector for extensible rseq Mathieu Desnoyers
2022-02-18 21:06 ` [RFC PATCH v2 05/11] selftests/rseq: Implement rseq numa node id field selftest Mathieu Desnoyers
2022-02-18 21:06 ` [RFC PATCH v2 06/11] lib: invert _find_next_bit source arguments Mathieu Desnoyers
2022-02-18 21:06 ` [RFC PATCH v2 07/11] lib: implement find_{first,next}_{zero,one}_and_zero_bit Mathieu Desnoyers
2022-02-18 21:06 ` [RFC PATCH v2 08/11] cpumask: implement cpumask_{first,next}_{zero,one}_and_zero Mathieu Desnoyers
2022-02-18 21:06 ` [RFC PATCH v2 09/11] sched: Introduce per memory space current virtual cpu id Mathieu Desnoyers
2022-02-21 17:38   ` [RFC PATCH v3 " Mathieu Desnoyers
2022-02-25 17:35   ` [RFC PATCH v2 " Jonathan Corbet
2022-02-25 17:56     ` Mathieu Desnoyers
2022-02-25 18:15       ` Jonathan Corbet
2022-02-25 18:39         ` Mathieu Desnoyers
2022-02-25 19:24           ` Jonathan Corbet
2022-02-25 21:21       ` Mathieu Desnoyers
2022-02-18 21:06 ` [RFC PATCH v2 10/11] rseq: extend struct rseq with per memory space vcpu id Mathieu Desnoyers
2022-02-18 21:06 ` [RFC PATCH v2 11/11] selftests/rseq: Implement rseq vm_vcpu_id field support Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).