linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 1/3] Introduce per thread group current virtual cpu id
@ 2022-02-01 19:25 Mathieu Desnoyers
  2022-02-01 19:25 ` [RFC PATCH 2/3] rseq: extend struct rseq with per thread group vcpu id Mathieu Desnoyers
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Mathieu Desnoyers @ 2022-02-01 19:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, Paul E . McKenney, Boqun Feng,
	H . Peter Anvin, Paul Turner, linux-api, Christian Brauner,
	Florian Weimer, David.Laight, carlos, Peter Oskolkov,
	Mathieu Desnoyers

This feature allows the scheduler to expose a current virtual cpu id
to user-space. This virtual cpu id is within the possible cpus range,
and is temporarily (and uniquely) assigned while threads are actively
running within a thread group. If a thread group has fewer threads than
cores, or is limited to run on few cores concurrently through sched
affinity or cgroup cpusets, the virtual cpu ids will be values close
to 0, thus allowing efficient use of user-space memory for per-cpu
data structures.

This feature is meant to be exposed by a new rseq thread area field.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
 fs/exec.c                    |  4 +++
 include/linux/sched.h        |  4 +++
 include/linux/sched/signal.h | 49 ++++++++++++++++++++++++++++++++++++
 init/Kconfig                 | 14 +++++++++++
 kernel/sched/core.c          |  2 ++
 5 files changed, 73 insertions(+)

diff --git a/fs/exec.c b/fs/exec.c
index 79f2c9483302..bc9a8c5f17f4 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1153,6 +1153,10 @@ static int de_thread(struct task_struct *tsk)
 	sig->group_exec_task = NULL;
 	sig->notify_count = 0;
 
+	/* Release possibly high vcpu id, get vcpu id 0. */
+	tg_vcpu_put(tsk);
+	tg_vcpu_get(tsk);
+
 no_thread_group:
 	/* we have changed execution domain */
 	tsk->exit_signal = SIGCHLD;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 838c9e0b4cae..0f199daed26a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1300,6 +1300,10 @@ struct task_struct {
 	unsigned long rseq_event_mask;
 #endif
 
+#ifdef CONFIG_SCHED_THREAD_GROUP_VCPU
+	int				tg_vcpu;	/* Current vcpu in thread group */
+#endif
+
 	struct tlbflush_unmap_batch	tlb_ubc;
 
 	union {
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index b6ecb9fc4cd2..c87e7ad5a1ea 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -244,6 +244,12 @@ struct signal_struct {
 						 * and may have inconsistent
 						 * permissions.
 						 */
+#ifdef CONFIG_SCHED_THREAD_GROUP_VCPU
+	/*
+	 * Mask of allocated vcpu ids within the thread group.
+	 */
+	cpumask_t			vcpu_mask;
+#endif
 } __randomize_layout;
 
 /*
@@ -742,4 +748,47 @@ static inline unsigned long rlimit_max(unsigned int limit)
 	return task_rlimit_max(current, limit);
 }
 
+#ifdef CONFIG_SCHED_THREAD_GROUP_VCPU
+static inline void tg_vcpu_get(struct task_struct *t)
+{
+	struct cpumask *cpumask = &t->signal->vcpu_mask;
+	unsigned int vcpu;
+
+	if (t->flags & PF_KTHREAD)
+		return;
+	/* Atomically reserve lowest available vcpu number. */
+	do {
+		vcpu = cpumask_first_zero(cpumask);
+		WARN_ON_ONCE(vcpu >= nr_cpu_ids);
+	} while (cpumask_test_and_set_cpu(vcpu, cpumask));
+	t->tg_vcpu = vcpu;
+}
+
+static inline void tg_vcpu_put(struct task_struct *t)
+{
+	if (t->flags & PF_KTHREAD)
+		return;
+	cpumask_clear_cpu(t->tg_vcpu, &t->signal->vcpu_mask);
+	t->tg_vcpu = 0;
+}
+
+static inline int task_tg_vcpu_id(struct task_struct *t)
+{
+	return t->tg_vcpu;
+}
+#else
+static inline void tg_vcpu_get(struct task_struct *t) { }
+static inline void tg_vcpu_put(struct task_struct *t) { }
+static inline int task_tg_vcpu_id(struct task_struct *t)
+{
+	/*
+	 * Use the processor id as a fall-back when the thread group vcpu
+	 * feature is disabled. This provides functional per-cpu data structure
+	 * accesses in user-space, althrough it won't provide the memory usage
+	 * benefits.
+	 */
+	return raw_smp_processor_id();
+}
+#endif
+
 #endif /* _LINUX_SCHED_SIGNAL_H */
diff --git a/init/Kconfig b/init/Kconfig
index e9119bf54b1f..5f72b4212a33 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1023,6 +1023,20 @@ config RT_GROUP_SCHED
 
 endif #CGROUP_SCHED
 
+config SCHED_THREAD_GROUP_VCPU
+	bool "Provide per-thread-group virtual cpu id"
+	depends on SMP
+	default n
+	help
+	  This feature allows the scheduler to expose a current virtual cpu id
+	  to user-space. This virtual cpu id is within the possible cpus range,
+	  and is temporarily (and uniquely) assigned while threads are actively
+	  running within a thread group. If a thread group has fewer threads than
+	  cores, or is limited to run on few cores concurrently through sched
+	  affinity or cgroup cpusets, the virtual cpu ids will be values close
+	  to 0, thus allowing efficient use of user-space memory for per-cpu
+	  data structures.
+
 config UCLAMP_TASK_GROUP
 	bool "Utilization clamping per group of tasks"
 	depends on CGROUP_SCHED
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2e4ae00e52d1..2690e80977b1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4795,6 +4795,8 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
 	sched_info_switch(rq, prev, next);
 	perf_event_task_sched_out(prev, next);
 	rseq_preempt(prev);
+	tg_vcpu_put(prev);
+	tg_vcpu_get(next);
 	fire_sched_out_preempt_notifiers(prev, next);
 	kmap_local_sched_out();
 	prepare_task(next);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH 2/3] rseq: extend struct rseq with per thread group vcpu id
  2022-02-01 19:25 [RFC PATCH 1/3] Introduce per thread group current virtual cpu id Mathieu Desnoyers
@ 2022-02-01 19:25 ` Mathieu Desnoyers
  2022-02-01 20:03   ` Florian Weimer
  2022-02-01 19:25 ` [RFC PATCH 3/3] selftests/rseq: Implement rseq tg_vcpu_id field support Mathieu Desnoyers
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Mathieu Desnoyers @ 2022-02-01 19:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, Paul E . McKenney, Boqun Feng,
	H . Peter Anvin, Paul Turner, linux-api, Christian Brauner,
	Florian Weimer, David.Laight, carlos, Peter Oskolkov,
	Mathieu Desnoyers

If a thread group has fewer threads than cores, or is limited to run on
few cores concurrently through sched affinity or cgroup cpusets, the
virtual cpu ids will be values close to 0, thus allowing efficient use
of user-space memory for per-cpu data structures.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
 include/uapi/linux/rseq.h | 15 +++++++++++++++
 kernel/rseq.c             | 16 +++++++++++++++-
 2 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
index 386c25b5bbdb..d687ac79e62c 100644
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -154,6 +154,21 @@ struct rseq {
 	 * rseq_len. Use the offset immediately after the node_id field as
 	 * rseq_len.
 	 */
+
+	/*
+	 * Restartable sequences tg_vcpu_id field. Updated by the kernel. Read by
+	 * user-space with single-copy atomicity semantics. This field should
+	 * only be read by the thread which registered this data structure.
+	 * Aligned on 32-bit. Contains the current thread's virtual CPU ID
+	 * (allocated uniquely within thread group).
+	 */
+	__u32 tg_vcpu_id;
+
+	/*
+	 * This is a valid end of rseq ABI for the purpose of rseq registration
+	 * rseq_len. Use the offset immediately after the tg_vcpu_id field as
+	 * rseq_len.
+	 */
 } __attribute__((aligned(4 * sizeof(__u64))));
 
 #endif /* _UAPI_LINUX_RSEQ_H */
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 13f6d0419f31..37b43735a400 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -86,10 +86,14 @@ static int rseq_update_cpu_node_id(struct task_struct *t)
 	struct rseq __user *rseq = t->rseq;
 	u32 cpu_id = raw_smp_processor_id();
 	u32 node_id = cpu_to_node(cpu_id);
+	u32 tg_vcpu_id = task_tg_vcpu_id(t);
 
 	if (!user_write_access_begin(rseq, t->rseq_len))
 		goto efault;
 	switch (t->rseq_len) {
+	case offsetofend(struct rseq, tg_vcpu_id):
+		unsafe_put_user(tg_vcpu_id, &rseq->tg_vcpu_id, efault_end);
+		fallthrough;
 	case offsetofend(struct rseq, node_id):
 		unsafe_put_user(node_id, &rseq->node_id, efault_end);
 		fallthrough;
@@ -112,9 +116,17 @@ static int rseq_update_cpu_node_id(struct task_struct *t)
 
 static int rseq_reset_rseq_cpu_node_id(struct task_struct *t)
 {
-	u32 cpu_id_start = 0, cpu_id = RSEQ_CPU_ID_UNINITIALIZED, node_id = 0;
+	u32 cpu_id_start = 0, cpu_id = RSEQ_CPU_ID_UNINITIALIZED, node_id = 0,
+	    tg_vcpu_id = 0;
 
 	switch (t->rseq_len) {
+	case offsetofend(struct rseq, tg_vcpu_id):
+		/*
+		 * Reset tg_vcpu_id to its initial state (0).
+		 */
+		if (put_user(tg_vcpu_id, &t->rseq->tg_vcpu_id))
+			return -EFAULT;
+		fallthrough;
 	case offsetofend(struct rseq, node_id):
 		/*
 		 * Reset node_id to its initial state (0).
@@ -396,6 +408,8 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
 	if (!IS_ALIGNED((unsigned long)rseq, __alignof__(*rseq)))
 		return -EINVAL;
 	switch (rseq_len) {
+	case offsetofend(struct rseq, tg_vcpu_id):
+		fallthrough;
 	case offsetofend(struct rseq, node_id):
 		fallthrough;
 	case offsetofend(struct rseq, padding1):
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH 3/3] selftests/rseq: Implement rseq tg_vcpu_id field support
  2022-02-01 19:25 [RFC PATCH 1/3] Introduce per thread group current virtual cpu id Mathieu Desnoyers
  2022-02-01 19:25 ` [RFC PATCH 2/3] rseq: extend struct rseq with per thread group vcpu id Mathieu Desnoyers
@ 2022-02-01 19:25 ` Mathieu Desnoyers
  2022-02-01 19:49 ` [RFC PATCH 1/3] Introduce per thread group current virtual cpu id Peter Oskolkov
  2022-02-02 11:23 ` Peter Zijlstra
  3 siblings, 0 replies; 14+ messages in thread
From: Mathieu Desnoyers @ 2022-02-01 19:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, Paul E . McKenney, Boqun Feng,
	H . Peter Anvin, Paul Turner, linux-api, Christian Brauner,
	Florian Weimer, David.Laight, carlos, Peter Oskolkov,
	Mathieu Desnoyers

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
 tools/testing/selftests/rseq/rseq-abi.h | 15 +++++++++++++++
 tools/testing/selftests/rseq/rseq.c     |  6 +++---
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/rseq/rseq-abi.h b/tools/testing/selftests/rseq/rseq-abi.h
index 850827e8d089..929183c2b3c0 100644
--- a/tools/testing/selftests/rseq/rseq-abi.h
+++ b/tools/testing/selftests/rseq/rseq-abi.h
@@ -169,6 +169,21 @@ struct rseq_abi {
 	 * rseq_len. Use the offset immediately after the node_id field as
 	 * rseq_len.
 	 */
+
+	/*
+	 * Restartable sequences tg_vcpu_id field. Updated by the kernel. Read by
+	 * user-space with single-copy atomicity semantics. This field should
+	 * only be read by the thread which registered this data structure.
+	 * Aligned on 32-bit. Contains the current thread's virtual CPU ID
+	 * (allocated uniquely within thread group).
+	 */
+	__u32 tg_vcpu_id;
+
+	/*
+	 * This is a valid end of rseq ABI for the purpose of rseq registration
+	 * rseq_len. Use the offset immediately after the tg_vcpu_id field as
+	 * rseq_len.
+	 */
 } __attribute__((aligned(4 * sizeof(__u64))));
 
 #endif /* _RSEQ_ABI_H */
diff --git a/tools/testing/selftests/rseq/rseq.c b/tools/testing/selftests/rseq/rseq.c
index 4b0e68051db8..c8d30e770d59 100644
--- a/tools/testing/selftests/rseq/rseq.c
+++ b/tools/testing/selftests/rseq/rseq.c
@@ -88,7 +88,7 @@ int rseq_register_current_thread(void)
 		/* Treat libc's ownership as a successful registration. */
 		return 0;
 	}
-	rc = sys_rseq(&__rseq_abi, rseq_offsetofend(struct rseq_abi, node_id), 0, RSEQ_SIG);
+	rc = sys_rseq(&__rseq_abi, rseq_offsetofend(struct rseq_abi, tg_vcpu_id), 0, RSEQ_SIG);
 	if (rc)
 		return -1;
 	assert(rseq_current_cpu_raw() >= 0);
@@ -103,7 +103,7 @@ int rseq_unregister_current_thread(void)
 		/* Treat libc's ownership as a successful unregistration. */
 		return 0;
 	}
-	rc = sys_rseq(&__rseq_abi, rseq_offsetofend(struct rseq_abi, node_id), RSEQ_ABI_FLAG_UNREGISTER, RSEQ_SIG);
+	rc = sys_rseq(&__rseq_abi, rseq_offsetofend(struct rseq_abi, tg_vcpu_id), RSEQ_ABI_FLAG_UNREGISTER, RSEQ_SIG);
 	if (rc)
 		return -1;
 	return 0;
@@ -126,7 +126,7 @@ void rseq_init(void)
 		return;
 	rseq_ownership = 1;
 	rseq_offset = (void *)&__rseq_abi - rseq_thread_pointer();
-	rseq_size = rseq_offsetofend(struct rseq_abi, node_id);
+	rseq_size = rseq_offsetofend(struct rseq_abi, tg_vcpu_id);
 	rseq_flags = 0;
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 1/3] Introduce per thread group current virtual cpu id
  2022-02-01 19:25 [RFC PATCH 1/3] Introduce per thread group current virtual cpu id Mathieu Desnoyers
  2022-02-01 19:25 ` [RFC PATCH 2/3] rseq: extend struct rseq with per thread group vcpu id Mathieu Desnoyers
  2022-02-01 19:25 ` [RFC PATCH 3/3] selftests/rseq: Implement rseq tg_vcpu_id field support Mathieu Desnoyers
@ 2022-02-01 19:49 ` Peter Oskolkov
  2022-02-01 21:00   ` Mathieu Desnoyers
  2022-02-02 11:23 ` Peter Zijlstra
  3 siblings, 1 reply; 14+ messages in thread
From: Peter Oskolkov @ 2022-02-01 19:49 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Linux Kernel Mailing List, Thomas Gleixner,
	Paul E . McKenney, Boqun Feng, H . Peter Anvin, Paul Turner,
	linux-api, Christian Brauner, Florian Weimer, David.Laight,
	carlos, Chris Kennelly

On Tue, Feb 1, 2022 at 11:26 AM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> This feature allows the scheduler to expose a current virtual cpu id
> to user-space. This virtual cpu id is within the possible cpus range,
> and is temporarily (and uniquely) assigned while threads are actively
> running within a thread group. If a thread group has fewer threads than
> cores, or is limited to run on few cores concurrently through sched
> affinity or cgroup cpusets, the virtual cpu ids will be values close
> to 0, thus allowing efficient use of user-space memory for per-cpu
> data structures.

Why per thread group and not per mm? The main use case is for
per-(v)cpu memory allocation logic, so it seems having this feature
per mm is more appropriate?

>
> This feature is meant to be exposed by a new rseq thread area field.
>
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> ---
>  fs/exec.c                    |  4 +++
>  include/linux/sched.h        |  4 +++
>  include/linux/sched/signal.h | 49 ++++++++++++++++++++++++++++++++++++
>  init/Kconfig                 | 14 +++++++++++
>  kernel/sched/core.c          |  2 ++
>  5 files changed, 73 insertions(+)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 79f2c9483302..bc9a8c5f17f4 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1153,6 +1153,10 @@ static int de_thread(struct task_struct *tsk)
>         sig->group_exec_task = NULL;
>         sig->notify_count = 0;
>
> +       /* Release possibly high vcpu id, get vcpu id 0. */
> +       tg_vcpu_put(tsk);
> +       tg_vcpu_get(tsk);
> +
>  no_thread_group:
>         /* we have changed execution domain */
>         tsk->exit_signal = SIGCHLD;
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 838c9e0b4cae..0f199daed26a 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1300,6 +1300,10 @@ struct task_struct {
>         unsigned long rseq_event_mask;
>  #endif
>
> +#ifdef CONFIG_SCHED_THREAD_GROUP_VCPU
> +       int                             tg_vcpu;        /* Current vcpu in thread group */
> +#endif
> +
>         struct tlbflush_unmap_batch     tlb_ubc;
>
>         union {
> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index b6ecb9fc4cd2..c87e7ad5a1ea 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -244,6 +244,12 @@ struct signal_struct {
>                                                  * and may have inconsistent
>                                                  * permissions.
>                                                  */
> +#ifdef CONFIG_SCHED_THREAD_GROUP_VCPU
> +       /*
> +        * Mask of allocated vcpu ids within the thread group.
> +        */
> +       cpumask_t                       vcpu_mask;

We use a pointer for the mask (in struct mm). Adds complexity around
alloc/free, though. Just FYI.

> +#endif
>  } __randomize_layout;
>
>  /*
> @@ -742,4 +748,47 @@ static inline unsigned long rlimit_max(unsigned int limit)
>         return task_rlimit_max(current, limit);
>  }
>
> +#ifdef CONFIG_SCHED_THREAD_GROUP_VCPU
> +static inline void tg_vcpu_get(struct task_struct *t)
> +{
> +       struct cpumask *cpumask = &t->signal->vcpu_mask;
> +       unsigned int vcpu;
> +
> +       if (t->flags & PF_KTHREAD)
> +               return;
> +       /* Atomically reserve lowest available vcpu number. */
> +       do {
> +               vcpu = cpumask_first_zero(cpumask);
> +               WARN_ON_ONCE(vcpu >= nr_cpu_ids);
> +       } while (cpumask_test_and_set_cpu(vcpu, cpumask));
> +       t->tg_vcpu = vcpu;
> +}
> +
> +static inline void tg_vcpu_put(struct task_struct *t)
> +{
> +       if (t->flags & PF_KTHREAD)
> +               return;
> +       cpumask_clear_cpu(t->tg_vcpu, &t->signal->vcpu_mask);
> +       t->tg_vcpu = 0;
> +}
> +
> +static inline int task_tg_vcpu_id(struct task_struct *t)
> +{
> +       return t->tg_vcpu;
> +}
> +#else
> +static inline void tg_vcpu_get(struct task_struct *t) { }
> +static inline void tg_vcpu_put(struct task_struct *t) { }
> +static inline int task_tg_vcpu_id(struct task_struct *t)
> +{
> +       /*
> +        * Use the processor id as a fall-back when the thread group vcpu
> +        * feature is disabled. This provides functional per-cpu data structure
> +        * accesses in user-space, althrough it won't provide the memory usage
> +        * benefits.
> +        */
> +       return raw_smp_processor_id();
> +}
> +#endif
> +
>  #endif /* _LINUX_SCHED_SIGNAL_H */
> diff --git a/init/Kconfig b/init/Kconfig
> index e9119bf54b1f..5f72b4212a33 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1023,6 +1023,20 @@ config RT_GROUP_SCHED
>
>  endif #CGROUP_SCHED
>
> +config SCHED_THREAD_GROUP_VCPU
> +       bool "Provide per-thread-group virtual cpu id"
> +       depends on SMP
> +       default n
> +       help
> +         This feature allows the scheduler to expose a current virtual cpu id
> +         to user-space. This virtual cpu id is within the possible cpus range,
> +         and is temporarily (and uniquely) assigned while threads are actively
> +         running within a thread group. If a thread group has fewer threads than
> +         cores, or is limited to run on few cores concurrently through sched
> +         affinity or cgroup cpusets, the virtual cpu ids will be values close
> +         to 0, thus allowing efficient use of user-space memory for per-cpu
> +         data structures.
> +
>  config UCLAMP_TASK_GROUP
>         bool "Utilization clamping per group of tasks"
>         depends on CGROUP_SCHED
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2e4ae00e52d1..2690e80977b1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4795,6 +4795,8 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
>         sched_info_switch(rq, prev, next);
>         perf_event_task_sched_out(prev, next);
>         rseq_preempt(prev);
> +       tg_vcpu_put(prev);
> +       tg_vcpu_get(next);

Doing this for all tasks on all context switches will most likely be
too expensive. We do it only for tasks that explicitly asked for this
feature during their rseq registration, and still the tight loop in
our equivalent of tg_vcpu_get() is occasionally noticeable (lots of
short wakeups can lead to the loop thrashing around).

Again, our approach is more complicated as a result.

>         fire_sched_out_preempt_notifiers(prev, next);
>         kmap_local_sched_out();
>         prepare_task(next);
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 2/3] rseq: extend struct rseq with per thread group vcpu id
  2022-02-01 19:25 ` [RFC PATCH 2/3] rseq: extend struct rseq with per thread group vcpu id Mathieu Desnoyers
@ 2022-02-01 20:03   ` Florian Weimer
  2022-02-01 20:22     ` Mathieu Desnoyers
  0 siblings, 1 reply; 14+ messages in thread
From: Florian Weimer @ 2022-02-01 20:03 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel, Thomas Gleixner, Paul E . McKenney,
	Boqun Feng, H . Peter Anvin, Paul Turner, linux-api,
	Christian Brauner, David.Laight, carlos, Peter Oskolkov

* Mathieu Desnoyers:

> If a thread group has fewer threads than cores, or is limited to run on
> few cores concurrently through sched affinity or cgroup cpusets, the
> virtual cpu ids will be values close to 0, thus allowing efficient use
> of user-space memory for per-cpu data structures.

From a userspace programmer perspective, what's a good way to obtain a
reasonable upper bound for the possible tg_vcpu_id values?

I believe not all users of cgroup cpusets change the affinity mask.

> diff --git a/kernel/rseq.c b/kernel/rseq.c
> index 13f6d0419f31..37b43735a400 100644
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -86,10 +86,14 @@ static int rseq_update_cpu_node_id(struct task_struct *t)
>  	struct rseq __user *rseq = t->rseq;
>  	u32 cpu_id = raw_smp_processor_id();
>  	u32 node_id = cpu_to_node(cpu_id);
> +	u32 tg_vcpu_id = task_tg_vcpu_id(t);
>  
>  	if (!user_write_access_begin(rseq, t->rseq_len))
>  		goto efault;
>  	switch (t->rseq_len) {
> +	case offsetofend(struct rseq, tg_vcpu_id):
> +		unsafe_put_user(tg_vcpu_id, &rseq->tg_vcpu_id, efault_end);
> +		fallthrough;
>  	case offsetofend(struct rseq, node_id):
>  		unsafe_put_user(node_id, &rseq->node_id, efault_end);
>  		fallthrough;

Is the switch really useful?  I suspect it's faster to just write as
much as possible all the time.  The switch should be well-predictable
if running uniform userspace, but still …

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 2/3] rseq: extend struct rseq with per thread group vcpu id
  2022-02-01 20:03   ` Florian Weimer
@ 2022-02-01 20:22     ` Mathieu Desnoyers
  2022-02-01 20:32       ` Florian Weimer
  0 siblings, 1 reply; 14+ messages in thread
From: Mathieu Desnoyers @ 2022-02-01 20:22 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Peter Zijlstra, linux-kernel, Thomas Gleixner, paulmck,
	Boqun Feng, H. Peter Anvin, Paul Turner, linux-api,
	Christian Brauner, David Laight, carlos, Peter Oskolkov

----- On Feb 1, 2022, at 3:03 PM, Florian Weimer fw@deneb.enyo.de wrote:

> * Mathieu Desnoyers:
> 
>> If a thread group has fewer threads than cores, or is limited to run on
>> few cores concurrently through sched affinity or cgroup cpusets, the
>> virtual cpu ids will be values close to 0, thus allowing efficient use
>> of user-space memory for per-cpu data structures.
> 
> From a userspace programmer perspective, what's a good way to obtain a
> reasonable upper bound for the possible tg_vcpu_id values?

Some effective upper bounds:

- sysconf(3) _SC_NPROCESSORS_CONF,
- the number of threads which exist concurrently in the process,
- the number of cpus in the cpu affinity mask applied by sched_setaffinity,
  except in corner-case situations such as cpu hotplug removing all cpus from
  the affinity set,
- cgroup cpuset "partition" limits,

Note that AFAIR non-partition cgroup cpusets allow a cgroup to "borrow"
additional cores from the rest of the system if they are idle, therefore
allowing the number of concurrent threads to go beyond the specified limit.

> 
> I believe not all users of cgroup cpusets change the affinity mask.

AFAIR the sched affinity mask is tweaked independently of the cgroup cpuset.
Those are two mechanisms both affecting the scheduler task placement.

I would expect the user-space code to use some sensible upper bound as a
hint about how many per-vcpu data structure elements to expect (and how many
to pre-allocate), but have a "lazy initialization" fall-back in case the
vcpu id goes up to the number of configured processors - 1. And I suspect
that even the number of configured processors may change with CRIU.

> 
>> diff --git a/kernel/rseq.c b/kernel/rseq.c
>> index 13f6d0419f31..37b43735a400 100644
>> --- a/kernel/rseq.c
>> +++ b/kernel/rseq.c
>> @@ -86,10 +86,14 @@ static int rseq_update_cpu_node_id(struct task_struct *t)
>>  	struct rseq __user *rseq = t->rseq;
>>  	u32 cpu_id = raw_smp_processor_id();
>>  	u32 node_id = cpu_to_node(cpu_id);
>> +	u32 tg_vcpu_id = task_tg_vcpu_id(t);
>>  
>>  	if (!user_write_access_begin(rseq, t->rseq_len))
>>  		goto efault;
>>  	switch (t->rseq_len) {
>> +	case offsetofend(struct rseq, tg_vcpu_id):
>> +		unsafe_put_user(tg_vcpu_id, &rseq->tg_vcpu_id, efault_end);
>> +		fallthrough;
>>  	case offsetofend(struct rseq, node_id):
>>  		unsafe_put_user(node_id, &rseq->node_id, efault_end);
>>  		fallthrough;
> 
> Is the switch really useful?  I suspect it's faster to just write as
> much as possible all the time.  The switch should be well-predictable
> if running uniform userspace, but still …

The switch ensures the kernel don't try to write to a memory area beyond
the rseq size which has been registered by user-space. So it seems to be
useful to ensure we don't corrupt user-space memory. Or am I missing your
point ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 2/3] rseq: extend struct rseq with per thread group vcpu id
  2022-02-01 20:22     ` Mathieu Desnoyers
@ 2022-02-01 20:32       ` Florian Weimer
  2022-02-01 21:20         ` Mathieu Desnoyers
  0 siblings, 1 reply; 14+ messages in thread
From: Florian Weimer @ 2022-02-01 20:32 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel, Thomas Gleixner, paulmck,
	Boqun Feng, H. Peter Anvin, Paul Turner, linux-api,
	Christian Brauner, David Laight, carlos, Peter Oskolkov

* Mathieu Desnoyers:

> ----- On Feb 1, 2022, at 3:03 PM, Florian Weimer fw@deneb.enyo.de wrote:
>
>> * Mathieu Desnoyers:
>> 
>>> If a thread group has fewer threads than cores, or is limited to run on
>>> few cores concurrently through sched affinity or cgroup cpusets, the
>>> virtual cpu ids will be values close to 0, thus allowing efficient use
>>> of user-space memory for per-cpu data structures.
>> 
>> From a userspace programmer perspective, what's a good way to obtain a
>> reasonable upper bound for the possible tg_vcpu_id values?
>
> Some effective upper bounds:
>
> - sysconf(3) _SC_NPROCESSORS_CONF,
> - the number of threads which exist concurrently in the process,
> - the number of cpus in the cpu affinity mask applied by sched_setaffinity,
>   except in corner-case situations such as cpu hotplug removing all cpus from
>   the affinity set,
> - cgroup cpuset "partition" limits,

Affinity masks and _SC_NPROCESSORS_CONF can be off by more than an
order of magnitude compared to the cgroup cpuset, I think, and those
aren't even that atypical configurations.

The number of concurrent threads sounds more tractable, but I'm
worried about things creating threads behind libc's back (perhaps
io_uring?).  So it couldn't be a hard upper bound.

I'm worried about querying anything cgroup-related because these APIs
have a reputation for being slow, convoluted, and unstable
(effectively not subject to the “don't break userspace” rule).
Hopefully I'm wrong about that.

>> I believe not all users of cgroup cpusets change the affinity mask.
>
> AFAIR the sched affinity mask is tweaked independently of the cgroup cpuset.
> Those are two mechanisms both affecting the scheduler task placement.

There are container hosts out there that synthesize an affinity mask
that matches the CPU allocation, assuming that anyone who calls
sched_getaffinity only does so for counting the number of set bits.

> I would expect the user-space code to use some sensible upper bound as a
> hint about how many per-vcpu data structure elements to expect (and how many
> to pre-allocate), but have a "lazy initialization" fall-back in case the
> vcpu id goes up to the number of configured processors - 1. And I suspect
> that even the number of configured processors may change with CRIU.

Sounds reasonable.

>> Is the switch really useful?  I suspect it's faster to just write as
>> much as possible all the time.  The switch should be well-predictable
>> if running uniform userspace, but still …
>
> The switch ensures the kernel don't try to write to a memory area beyond
> the rseq size which has been registered by user-space. So it seems to be
> useful to ensure we don't corrupt user-space memory. Or am I missing your
> point ?

Due to the alignment, I think you'd only ever see 32 and 64 bytes for
now?

I'd appreciate if you could put the maximm supported size and possibly
the alignment in the auxiliary vector, so that we don't have to rseq
system calls in a loop on process startup.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 1/3] Introduce per thread group current virtual cpu id
  2022-02-01 19:49 ` [RFC PATCH 1/3] Introduce per thread group current virtual cpu id Peter Oskolkov
@ 2022-02-01 21:00   ` Mathieu Desnoyers
  0 siblings, 0 replies; 14+ messages in thread
From: Mathieu Desnoyers @ 2022-02-01 21:00 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Peter Zijlstra, linux-kernel, Thomas Gleixner, paulmck,
	Boqun Feng, H. Peter Anvin, Paul Turner, linux-api,
	Christian Brauner, Florian Weimer, David Laight, carlos,
	Chris Kennelly

----- On Feb 1, 2022, at 2:49 PM, Peter Oskolkov posk@posk.io wrote:

> On Tue, Feb 1, 2022 at 11:26 AM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> This feature allows the scheduler to expose a current virtual cpu id
>> to user-space. This virtual cpu id is within the possible cpus range,
>> and is temporarily (and uniquely) assigned while threads are actively
>> running within a thread group. If a thread group has fewer threads than
>> cores, or is limited to run on few cores concurrently through sched
>> affinity or cgroup cpusets, the virtual cpu ids will be values close
>> to 0, thus allowing efficient use of user-space memory for per-cpu
>> data structures.
> 
> Why per thread group and not per mm? The main use case is for
> per-(v)cpu memory allocation logic, so it seems having this feature
> per mm is more appropriate?

Good point, yes, per-mm would be more appropriate.

So I guess that from a userspace perspective, the rseq field could become
"__u32 vm_vcpu; /* Current vcpu within memory space. */"

[...]

>> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
>> index b6ecb9fc4cd2..c87e7ad5a1ea 100644
>> --- a/include/linux/sched/signal.h
>> +++ b/include/linux/sched/signal.h
>> @@ -244,6 +244,12 @@ struct signal_struct {
>>                                                  * and may have inconsistent
>>                                                  * permissions.
>>                                                  */
>> +#ifdef CONFIG_SCHED_THREAD_GROUP_VCPU
>> +       /*
>> +        * Mask of allocated vcpu ids within the thread group.
>> +        */
>> +       cpumask_t                       vcpu_mask;
> 
> We use a pointer for the mask (in struct mm). Adds complexity around
> alloc/free, though. Just FYI.

It does make sense if this is opt-in.

[...]

>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 2e4ae00e52d1..2690e80977b1 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -4795,6 +4795,8 @@ prepare_task_switch(struct rq *rq, struct task_struct
>> *prev,
>>         sched_info_switch(rq, prev, next);
>>         perf_event_task_sched_out(prev, next);
>>         rseq_preempt(prev);
>> +       tg_vcpu_put(prev);
>> +       tg_vcpu_get(next);
> 
> Doing this for all tasks on all context switches will most likely be
> too expensive. We do it only for tasks that explicitly asked for this
> feature during their rseq registration, and still the tight loop in
> our equivalent of tg_vcpu_get() is occasionally noticeable (lots of
> short wakeups can lead to the loop thrashing around).
> 
> Again, our approach is more complicated as a result.

I suspect that the overhead of tg_vcpu_get is quite small for processes
which work on only few cores, but becomes noticeable when processes have
many threads and are massively parallel (not affined to only a few cores).

When the feature is disabled, we can always fall-back on the value returned
by raw_smp_processor_id() and use that as a "vm-vcpu-id" value.

Whether the vm-vcpu-id or the processor id is used needs to be a consensus
across all threads from all processes using a mm at a given time.

There appears to be a tradeoff here, and I wonder how this should be presented
to users. A few possible options:

- vm-vcpu feature is opt-in (default off) or opt-out (default on),
- whether vm-vcpu is enabled for a process could be selected at runtime by the
  process, either at process initialization (single thread, single mm user)
  and/or while the process is multi-threaded (requires more synchronization),
- if we find a way to move automatically between vm-vcpu-id and processor id as
  information source for all threads tied to a mm when we reach a number of parallel
  threads threshold, then I suspect we could have best of both worlds. But it's not
  clear to me how to achieve this.

Thoughts ?

Thanks,

Mathieu


> 
>>         fire_sched_out_preempt_notifiers(prev, next);
>>         kmap_local_sched_out();
>>         prepare_task(next);
>> --
>> 2.17.1

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 2/3] rseq: extend struct rseq with per thread group vcpu id
  2022-02-01 20:32       ` Florian Weimer
@ 2022-02-01 21:20         ` Mathieu Desnoyers
  2022-02-01 21:30           ` Florian Weimer
  0 siblings, 1 reply; 14+ messages in thread
From: Mathieu Desnoyers @ 2022-02-01 21:20 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Peter Zijlstra, linux-kernel, Thomas Gleixner, paulmck,
	Boqun Feng, H. Peter Anvin, Paul Turner, linux-api,
	Christian Brauner, David Laight, carlos, Peter Oskolkov

----- On Feb 1, 2022, at 3:32 PM, Florian Weimer fw@deneb.enyo.de wrote:
[...]
> 
>>> Is the switch really useful?  I suspect it's faster to just write as
>>> much as possible all the time.  The switch should be well-predictable
>>> if running uniform userspace, but still …
>>
>> The switch ensures the kernel don't try to write to a memory area beyond
>> the rseq size which has been registered by user-space. So it seems to be
>> useful to ensure we don't corrupt user-space memory. Or am I missing your
>> point ?
> 
> Due to the alignment, I think you'd only ever see 32 and 64 bytes for
> now?

Yes, but I would expect the rseq registration arguments to have a rseq_len
of offsetofend(struct rseq, tg_vcpu_id) when userspace wants the tg_vcpu_id
feature to be supported (but not the following features).

Then, as we append additional features as follow-up fields, those
eventually become requested by glibc by increasing the requested size.

Then it's kind of weird to receive a registration size which is not
aligned on 32-byte, but then use internal knowledge of the structure
alignment in the kernel code to write beyond the requested size. And all
this in a case where we are returning to user-space after a preemption,
so I don't expect this extra switch/case to cause significant overhead.

> 
> I'd appreciate if you could put the maximm supported size and possibly
> the alignment in the auxiliary vector, so that we don't have to rseq
> system calls in a loop on process startup.

Yes, it's a good idea. I'm not too familiar with the auxiliary vector.
Are we talking about the kernel's

fs/binfmt_elf.c:fill_auxv_note()

?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 2/3] rseq: extend struct rseq with per thread group vcpu id
  2022-02-01 21:20         ` Mathieu Desnoyers
@ 2022-02-01 21:30           ` Florian Weimer
  2022-02-02  1:32             ` Mathieu Desnoyers
  0 siblings, 1 reply; 14+ messages in thread
From: Florian Weimer @ 2022-02-01 21:30 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel, Thomas Gleixner, paulmck,
	Boqun Feng, H. Peter Anvin, Paul Turner, linux-api,
	Christian Brauner, David Laight, carlos, Peter Oskolkov

* Mathieu Desnoyers:

> ----- On Feb 1, 2022, at 3:32 PM, Florian Weimer fw@deneb.enyo.de wrote:
> [...]
>> 
>>>> Is the switch really useful?  I suspect it's faster to just write as
>>>> much as possible all the time.  The switch should be well-predictable
>>>> if running uniform userspace, but still …
>>>
>>> The switch ensures the kernel don't try to write to a memory area beyond
>>> the rseq size which has been registered by user-space. So it seems to be
>>> useful to ensure we don't corrupt user-space memory. Or am I missing your
>>> point ?
>> 
>> Due to the alignment, I think you'd only ever see 32 and 64 bytes for
>> now?
>
> Yes, but I would expect the rseq registration arguments to have a rseq_len
> of offsetofend(struct rseq, tg_vcpu_id) when userspace wants the tg_vcpu_id
> feature to be supported (but not the following features).

But if rseq is managed by libc, it really has to use the full size
unconditionally.  I would even expect that eventually, the kernel only
supports the initial 32, maybe 64 for a few early extension, and the
size indicated by the auxiliary vector.

Not all of that area would be ABI, some of it would be used by the
vDSO only and opaque to userspace application (with applications/libcs
passing __rseq_offset as an argument to these functions).

>> I'd appreciate if you could put the maximm supported size and possibly
>> the alignment in the auxiliary vector, so that we don't have to rseq
>> system calls in a loop on process startup.
>
> Yes, it's a good idea. I'm not too familiar with the auxiliary vector.
> Are we talking about the kernel's
>
> fs/binfmt_elf.c:fill_auxv_note()
>
> ?

Indeed.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 2/3] rseq: extend struct rseq with per thread group vcpu id
  2022-02-01 21:30           ` Florian Weimer
@ 2022-02-02  1:32             ` Mathieu Desnoyers
  2022-02-03 15:53               ` Mathieu Desnoyers
  0 siblings, 1 reply; 14+ messages in thread
From: Mathieu Desnoyers @ 2022-02-02  1:32 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Peter Zijlstra, linux-kernel, Thomas Gleixner, paulmck,
	Boqun Feng, H. Peter Anvin, Paul Turner, linux-api,
	Christian Brauner, David Laight, carlos, Peter Oskolkov

----- On Feb 1, 2022, at 4:30 PM, Florian Weimer fw@deneb.enyo.de wrote:

> * Mathieu Desnoyers:
> 
>> ----- On Feb 1, 2022, at 3:32 PM, Florian Weimer fw@deneb.enyo.de wrote:
>> [...]
>>> 
>>>>> Is the switch really useful?  I suspect it's faster to just write as
>>>>> much as possible all the time.  The switch should be well-predictable
>>>>> if running uniform userspace, but still …
>>>>
>>>> The switch ensures the kernel don't try to write to a memory area beyond
>>>> the rseq size which has been registered by user-space. So it seems to be
>>>> useful to ensure we don't corrupt user-space memory. Or am I missing your
>>>> point ?
>>> 
>>> Due to the alignment, I think you'd only ever see 32 and 64 bytes for
>>> now?
>>
>> Yes, but I would expect the rseq registration arguments to have a rseq_len
>> of offsetofend(struct rseq, tg_vcpu_id) when userspace wants the tg_vcpu_id
>> feature to be supported (but not the following features).
> 
> But if rseq is managed by libc, it really has to use the full size
> unconditionally.  I would even expect that eventually, the kernel only
> supports the initial 32, maybe 64 for a few early extension, and the
> size indicated by the auxiliary vector.
> 
> Not all of that area would be ABI, some of it would be used by the
> vDSO only and opaque to userspace application (with applications/libcs
> passing __rseq_offset as an argument to these functions).
> 

I think one aspect leading to our misunderstanding here is the distinction
between the size of the rseq area _allocation_, and the offset after the last
field supported by the given kernel.

With this in mind, let's state a bit more clearly our expected aux. vector
extensibility scheme.

With CONFIG_RSEQ=y, the kernel would pass the following information through
the ELF auxv:

- rseq allocation size (auxv_rseq_alloc_size),
- rseq allocation alignment (auxv_rseq_alloc_align),
- offset after the end of the last rseq field supported by this kernel (auxv_rseq_offset_end),

We always have auxv_rseq_alloc_size >= auxv_rseq_offset_end.

I would expect libc to use this information to allocate a memory area
at least auxv_rseq_alloc_size in size, with an alignment respecting
auxv_rseq_alloc_align. It would use a value >= auvx_rseq_alloc_size
as rseq_len argument for the rseq registration.

But I would expect libc to use the auxv_rseq_offset_end value to populate __rseq_size,
so rseq users can rely on this to check whether the fields they are trying to access
is indeed populated by the kernel.

Of course, the kernel would still allow the original 32-byte rseq_len argument
for the rseq registration, so the original ABI still works. It would however
reject any rseq registration with size smaller than auxv_rseq_alloc_size (other
than the 32-byte special-case).

Is that in line with what you have in mind ? Do we really need to expose those 3
auxv variables independently or can we somehow remove auxv_rseq_alloc_size and
use auxv_rseq_offset_end as a min value for allocation instead ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 1/3] Introduce per thread group current virtual cpu id
  2022-02-01 19:25 [RFC PATCH 1/3] Introduce per thread group current virtual cpu id Mathieu Desnoyers
                   ` (2 preceding siblings ...)
  2022-02-01 19:49 ` [RFC PATCH 1/3] Introduce per thread group current virtual cpu id Peter Oskolkov
@ 2022-02-02 11:23 ` Peter Zijlstra
  2022-02-02 13:48   ` Mathieu Desnoyers
  3 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2022-02-02 11:23 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Thomas Gleixner, Paul E . McKenney, Boqun Feng,
	H . Peter Anvin, Paul Turner, linux-api, Christian Brauner,
	Florian Weimer, David.Laight, carlos, Peter Oskolkov

On Tue, Feb 01, 2022 at 02:25:38PM -0500, Mathieu Desnoyers wrote:

> +static inline void tg_vcpu_get(struct task_struct *t)
> +{
> +	struct cpumask *cpumask = &t->signal->vcpu_mask;
> +	unsigned int vcpu;
> +
> +	if (t->flags & PF_KTHREAD)
> +		return;
> +	/* Atomically reserve lowest available vcpu number. */
> +	do {
> +		vcpu = cpumask_first_zero(cpumask);
> +		WARN_ON_ONCE(vcpu >= nr_cpu_ids);
> +	} while (cpumask_test_and_set_cpu(vcpu, cpumask));
> +	t->tg_vcpu = vcpu;
> +}
> +
> +static inline void tg_vcpu_put(struct task_struct *t)
> +{
> +	if (t->flags & PF_KTHREAD)
> +		return;
> +	cpumask_clear_cpu(t->tg_vcpu, &t->signal->vcpu_mask);
> +	t->tg_vcpu = 0;
> +}

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2e4ae00e52d1..2690e80977b1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4795,6 +4795,8 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
>  	sched_info_switch(rq, prev, next);
>  	perf_event_task_sched_out(prev, next);
>  	rseq_preempt(prev);
> +	tg_vcpu_put(prev);
> +	tg_vcpu_get(next);


URGGHHH!!! that's *2* atomics extra on the context switch path. Worse,
that's on a line that's trivially contended with a few threads.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 1/3] Introduce per thread group current virtual cpu id
  2022-02-02 11:23 ` Peter Zijlstra
@ 2022-02-02 13:48   ` Mathieu Desnoyers
  0 siblings, 0 replies; 14+ messages in thread
From: Mathieu Desnoyers @ 2022-02-02 13:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, paulmck, Boqun Feng,
	H. Peter Anvin, Paul Turner, linux-api, Christian Brauner,
	Florian Weimer, David Laight, carlos, Peter Oskolkov

----- On Feb 2, 2022, at 6:23 AM, Peter Zijlstra peterz@infradead.org wrote:

> On Tue, Feb 01, 2022 at 02:25:38PM -0500, Mathieu Desnoyers wrote:
> 
>> +static inline void tg_vcpu_get(struct task_struct *t)
>> +{
>> +	struct cpumask *cpumask = &t->signal->vcpu_mask;
>> +	unsigned int vcpu;
>> +
>> +	if (t->flags & PF_KTHREAD)
>> +		return;
>> +	/* Atomically reserve lowest available vcpu number. */
>> +	do {
>> +		vcpu = cpumask_first_zero(cpumask);
>> +		WARN_ON_ONCE(vcpu >= nr_cpu_ids);
>> +	} while (cpumask_test_and_set_cpu(vcpu, cpumask));
>> +	t->tg_vcpu = vcpu;
>> +}
>> +
>> +static inline void tg_vcpu_put(struct task_struct *t)
>> +{
>> +	if (t->flags & PF_KTHREAD)
>> +		return;
>> +	cpumask_clear_cpu(t->tg_vcpu, &t->signal->vcpu_mask);
>> +	t->tg_vcpu = 0;
>> +}
> 
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 2e4ae00e52d1..2690e80977b1 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -4795,6 +4795,8 @@ prepare_task_switch(struct rq *rq, struct task_struct
>> *prev,
>>  	sched_info_switch(rq, prev, next);
>>  	perf_event_task_sched_out(prev, next);
>>  	rseq_preempt(prev);
>> +	tg_vcpu_put(prev);
>> +	tg_vcpu_get(next);
> 
> 
> URGGHHH!!! that's *2* atomics extra on the context switch path. Worse,
> that's on a line that's trivially contended with a few threads.

There is one obvious optimization that just begs to be done here: when
switching between threads belonging to the same process, we can simply
take the vcpu_id tag of the prev thread and use it for next,
without requiring any atomic operation.

This only leaves the overhead of added atomics when scheduling between
threads which belong to different processes. Does it matter as much ?
If it's the case, then we should really scratch our heads a little more
to come up with improvements.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 2/3] rseq: extend struct rseq with per thread group vcpu id
  2022-02-02  1:32             ` Mathieu Desnoyers
@ 2022-02-03 15:53               ` Mathieu Desnoyers
  0 siblings, 0 replies; 14+ messages in thread
From: Mathieu Desnoyers @ 2022-02-03 15:53 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Peter Zijlstra, linux-kernel, Thomas Gleixner, paulmck,
	Boqun Feng, H. Peter Anvin, Paul Turner, linux-api,
	Christian Brauner, David Laight, carlos, Peter Oskolkov

----- On Feb 1, 2022, at 8:32 PM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:

> ----- On Feb 1, 2022, at 4:30 PM, Florian Weimer fw@deneb.enyo.de wrote:
> 
>> * Mathieu Desnoyers:
>> 
>>> ----- On Feb 1, 2022, at 3:32 PM, Florian Weimer fw@deneb.enyo.de wrote:
>>> [...]
>>>> 
>>>>>> Is the switch really useful?  I suspect it's faster to just write as
>>>>>> much as possible all the time.  The switch should be well-predictable
>>>>>> if running uniform userspace, but still …
>>>>>
>>>>> The switch ensures the kernel don't try to write to a memory area beyond
>>>>> the rseq size which has been registered by user-space. So it seems to be
>>>>> useful to ensure we don't corrupt user-space memory. Or am I missing your
>>>>> point ?
>>>> 
>>>> Due to the alignment, I think you'd only ever see 32 and 64 bytes for
>>>> now?
>>>
>>> Yes, but I would expect the rseq registration arguments to have a rseq_len
>>> of offsetofend(struct rseq, tg_vcpu_id) when userspace wants the tg_vcpu_id
>>> feature to be supported (but not the following features).
>> 
>> But if rseq is managed by libc, it really has to use the full size
>> unconditionally.  I would even expect that eventually, the kernel only
>> supports the initial 32, maybe 64 for a few early extension, and the
>> size indicated by the auxiliary vector.
>> 
>> Not all of that area would be ABI, some of it would be used by the
>> vDSO only and opaque to userspace application (with applications/libcs
>> passing __rseq_offset as an argument to these functions).
>> 
> 
> I think one aspect leading to our misunderstanding here is the distinction
> between the size of the rseq area _allocation_, and the offset after the last
> field supported by the given kernel.
> 
> With this in mind, let's state a bit more clearly our expected aux. vector
> extensibility scheme.
> 
> With CONFIG_RSEQ=y, the kernel would pass the following information through
> the ELF auxv:
> 
> - rseq allocation size (auxv_rseq_alloc_size),
> - rseq allocation alignment (auxv_rseq_alloc_align),
> - offset after the end of the last rseq field supported by this kernel
> (auxv_rseq_offset_end),
> 
> We always have auxv_rseq_alloc_size >= auxv_rseq_offset_end.
> 
> I would expect libc to use this information to allocate a memory area
> at least auxv_rseq_alloc_size in size, with an alignment respecting
> auxv_rseq_alloc_align. It would use a value >= auvx_rseq_alloc_size
> as rseq_len argument for the rseq registration.
> 
> But I would expect libc to use the auxv_rseq_offset_end value to populate
> __rseq_size,
> so rseq users can rely on this to check whether the fields they are trying to
> access
> is indeed populated by the kernel.
> 
> Of course, the kernel would still allow the original 32-byte rseq_len argument
> for the rseq registration, so the original ABI still works. It would however
> reject any rseq registration with size smaller than auxv_rseq_alloc_size (other
> than the 32-byte special-case).
> 
> Is that in line with what you have in mind ? Do we really need to expose those 3
> auxv variables independently or can we somehow remove auxv_rseq_alloc_size and
> use auxv_rseq_offset_end as a min value for allocation instead ?

After giving all this some more thoughts, I think we can extend struct rseq
cleanly without adding any "padding1" fields at the end of the existing structure
(which would be effectively wasting very useful hot cache line space).

Here is what I have in mind:

We consider separately the "size" and the "feature_size" of the rseq structure.

- The "size" is really the size for memory allocation (includes padding),
- The "feature_size" is the offset after the last supported feature field.

So for the original struct rseq, size=32 bytes and feature_size=20 bytes
(offsetofend(struct rseq, flags)).

The kernel can expose this "supported rseq feature size" value through the ELF auxiliary
vector (with a new AT_RSEQ_FEATURE_SIZE). It would also expose a new AT_RSEQ_ALIGN for the
minimal allocation alignment required by the kernel. Those can be queried by user-space
through getauxval(3).

glibc can add a new const unsigned int __rseq_feature_size symbol in a future release which
will support extended rseq structures. This is the symbol I expect rseq users to check
at least once in the program's lifetime before they try to access rseq fields beyond
the originally populated 20 bytes.

The rseq_len argument passed to sys_rseq would really be the allocated size for the rseq
area (as it is today, considering that the kernel expects sizeof(struct rseq)). Typically,
I would expect glibc to take the rseq feature_size value and round it up to a value which
is a multiple of the AT_RSEQ_ALIGN. That allocation size would be used to populate
__rseq_size.

Am I missing something ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-02-03 15:53 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-01 19:25 [RFC PATCH 1/3] Introduce per thread group current virtual cpu id Mathieu Desnoyers
2022-02-01 19:25 ` [RFC PATCH 2/3] rseq: extend struct rseq with per thread group vcpu id Mathieu Desnoyers
2022-02-01 20:03   ` Florian Weimer
2022-02-01 20:22     ` Mathieu Desnoyers
2022-02-01 20:32       ` Florian Weimer
2022-02-01 21:20         ` Mathieu Desnoyers
2022-02-01 21:30           ` Florian Weimer
2022-02-02  1:32             ` Mathieu Desnoyers
2022-02-03 15:53               ` Mathieu Desnoyers
2022-02-01 19:25 ` [RFC PATCH 3/3] selftests/rseq: Implement rseq tg_vcpu_id field support Mathieu Desnoyers
2022-02-01 19:49 ` [RFC PATCH 1/3] Introduce per thread group current virtual cpu id Peter Oskolkov
2022-02-01 21:00   ` Mathieu Desnoyers
2022-02-02 11:23 ` Peter Zijlstra
2022-02-02 13:48   ` Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).