linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/4] Extend rseq with sched_state field
@ 2023-05-17 15:26 Mathieu Desnoyers
  2023-05-17 15:26 ` [RFC PATCH 1/4] rseq: Add sched_state field to struct rseq Mathieu Desnoyers
                   ` (5 more replies)
  0 siblings, 6 replies; 17+ messages in thread
From: Mathieu Desnoyers @ 2023-05-17 15:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, Paul E . McKenney, Boqun Feng,
	H . Peter Anvin, Paul Turner, linux-api, Christian Brauner,
	Florian Weimer, David.Laight, carlos, Peter Oskolkov,
	Alexander Mikhalitsyn, Chris Kennelly, Ingo Molnar, Darren Hart,
	Davidlohr Bueso, André Almeida, libc-alpha, Steven Rostedt,
	Jonathan Corbet, Mathieu Desnoyers

This prototype extends struct rseq with a new sched_state field, which
contains a "on-cpu" flag kept up-to-date by the scheduler.

It is meant to be used by userspace adaptative mutexes to decide between
busy-wait and futex wait system call (releasing the CPU) behaviors based
on the current state of the mutex owner.

The goal is to improve use-cases where the duration of the critical
sections for a given lock follows a multi-modal distribution, preventing
statistical guesses from doing a good job at choosing between busy-wait
and futex wait behavior.

This is in response to the LWN coverage of 2023 Open Source Summit North
America (https://lwn.net/Articles/931789/) unscheduled slot "Adaptive
spinning in user space" presented by André Almeida.

Feedback is welcome!

Mathieu

Mathieu Desnoyers (4):
  rseq: Add sched_state field to struct rseq
  selftests/rseq: Add sched_state rseq field and getter
  selftests/rseq: Implement sched state test program
  selftests/rseq: Implement rseq_mutex test program

 include/linux/sched.h                         |  12 ++
 include/uapi/linux/rseq.h                     |  17 +++
 kernel/rseq.c                                 |  14 ++
 tools/testing/selftests/rseq/.gitignore       |   2 +
 tools/testing/selftests/rseq/Makefile         |   3 +-
 tools/testing/selftests/rseq/rseq-abi.h       |  17 +++
 tools/testing/selftests/rseq/rseq.h           |   5 +
 tools/testing/selftests/rseq/rseq_mutex.c     | 120 ++++++++++++++++++
 .../testing/selftests/rseq/sched_state_test.c |  71 +++++++++++
 9 files changed, 260 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/rseq/rseq_mutex.c
 create mode 100644 tools/testing/selftests/rseq/sched_state_test.c

-- 
2.25.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH 1/4] rseq: Add sched_state field to struct rseq
  2023-05-17 15:26 [RFC PATCH 0/4] Extend rseq with sched_state field Mathieu Desnoyers
@ 2023-05-17 15:26 ` Mathieu Desnoyers
  2023-05-17 16:03   ` Davidlohr Bueso
                     ` (2 more replies)
  2023-05-17 15:26 ` [RFC PATCH 2/4] selftests/rseq: Add sched_state rseq field and getter Mathieu Desnoyers
                   ` (4 subsequent siblings)
  5 siblings, 3 replies; 17+ messages in thread
From: Mathieu Desnoyers @ 2023-05-17 15:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, Paul E . McKenney, Boqun Feng,
	H . Peter Anvin, Paul Turner, linux-api, Christian Brauner,
	Florian Weimer, David.Laight, carlos, Peter Oskolkov,
	Alexander Mikhalitsyn, Chris Kennelly, Ingo Molnar, Darren Hart,
	Davidlohr Bueso, André Almeida, libc-alpha, Steven Rostedt,
	Jonathan Corbet, Mathieu Desnoyers, Florian Weimer

Expose the "on-cpu" state for each thread through struct rseq to allow
adaptative mutexes to decide more accurately between busy-waiting and
calling sys_futex() to release the CPU, based on the on-cpu state of the
mutex owner.

It is only provided as an optimization hint, because there is no
guarantee that the page containing this field is in the page cache, and
therefore the scheduler may very well fail to clear the on-cpu state on
preemption. This is expected to be rare though, and is resolved as soon
as the task returns to user-space.

The goal is to improve use-cases where the duration of the critical
sections for a given lock follows a multi-modal distribution, preventing
statistical guesses from doing a good job at choosing between busy-wait
and futex wait behavior.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Carlos O'Donell <carlos@redhat.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: libc-alpha@sourceware.org
---
 include/linux/sched.h     | 12 ++++++++++++
 include/uapi/linux/rseq.h | 17 +++++++++++++++++
 kernel/rseq.c             | 14 ++++++++++++++
 3 files changed, 43 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index eed5d65b8d1f..c7e9248134c1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2351,11 +2351,20 @@ static inline void rseq_signal_deliver(struct ksignal *ksig,
 	rseq_handle_notify_resume(ksig, regs);
 }
 
+void __rseq_set_sched_state(struct task_struct *t, unsigned int state);
+
+static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state)
+{
+	if (t->rseq)
+		__rseq_set_sched_state(t, state);
+}
+
 /* rseq_preempt() requires preemption to be disabled. */
 static inline void rseq_preempt(struct task_struct *t)
 {
 	__set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask);
 	rseq_set_notify_resume(t);
+	rseq_set_sched_state(t, 0);
 }
 
 /* rseq_migrate() requires preemption to be disabled. */
@@ -2405,6 +2414,9 @@ static inline void rseq_signal_deliver(struct ksignal *ksig,
 				       struct pt_regs *regs)
 {
 }
+static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state)
+{
+}
 static inline void rseq_preempt(struct task_struct *t)
 {
 }
diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
index c233aae5eac9..c6d8537e23ca 100644
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -37,6 +37,13 @@ enum rseq_cs_flags {
 		(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
 };
 
+enum rseq_sched_state {
+	/*
+	 * Task is currently running on a CPU if bit is set.
+	 */
+	RSEQ_SCHED_STATE_ON_CPU		= (1U << 0),
+};
+
 /*
  * struct rseq_cs is aligned on 4 * 8 bytes to ensure it is always
  * contained within a single cache-line. It is usually declared as
@@ -148,6 +155,16 @@ struct rseq {
 	 */
 	__u32 mm_cid;
 
+	/*
+	 * Restartable sequences sched_state field. Updated by the kernel. Read
+	 * by user-space with single-copy atomicity semantics. This fields can
+	 * be read by any userspace thread. Aligned on 32-bit. Contains a
+	 * bitmask of enum rseq_sched_state. This field is provided as a hint
+	 * by the scheduler, and requires that the page holding struct rseq is
+	 * faulted-in for the state update to be performed by the scheduler.
+	 */
+	__u32 sched_state;
+
 	/*
 	 * Flexible array member at end of structure, after last feature field.
 	 */
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 9de6e35fe679..b2eb3bbaa9ef 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -91,6 +91,7 @@ static int rseq_update_cpu_node_id(struct task_struct *t)
 	u32 cpu_id = raw_smp_processor_id();
 	u32 node_id = cpu_to_node(cpu_id);
 	u32 mm_cid = task_mm_cid(t);
+	u32 sched_state = RSEQ_SCHED_STATE_ON_CPU;
 
 	WARN_ON_ONCE((int) mm_cid < 0);
 	if (!user_write_access_begin(rseq, t->rseq_len))
@@ -99,6 +100,7 @@ static int rseq_update_cpu_node_id(struct task_struct *t)
 	unsafe_put_user(cpu_id, &rseq->cpu_id, efault_end);
 	unsafe_put_user(node_id, &rseq->node_id, efault_end);
 	unsafe_put_user(mm_cid, &rseq->mm_cid, efault_end);
+	unsafe_put_user(sched_state, &rseq->sched_state, efault_end);
 	/*
 	 * Additional feature fields added after ORIG_RSEQ_SIZE
 	 * need to be conditionally updated only if
@@ -339,6 +341,18 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
 	force_sigsegv(sig);
 }
 
+/*
+ * Attempt to update rseq scheduler state.
+ */
+void __rseq_set_sched_state(struct task_struct *t, unsigned int state)
+{
+	if (unlikely(t->flags & PF_EXITING))
+		return;
+	pagefault_disable();
+	(void) put_user(state, &t->rseq->sched_state);
+	pagefault_enable();
+}
+
 #ifdef CONFIG_DEBUG_RSEQ
 
 /*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 2/4] selftests/rseq: Add sched_state rseq field and getter
  2023-05-17 15:26 [RFC PATCH 0/4] Extend rseq with sched_state field Mathieu Desnoyers
  2023-05-17 15:26 ` [RFC PATCH 1/4] rseq: Add sched_state field to struct rseq Mathieu Desnoyers
@ 2023-05-17 15:26 ` Mathieu Desnoyers
  2023-05-17 15:26 ` [RFC PATCH 3/4] selftests/rseq: Implement sched state test program Mathieu Desnoyers
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 17+ messages in thread
From: Mathieu Desnoyers @ 2023-05-17 15:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, Paul E . McKenney, Boqun Feng,
	H . Peter Anvin, Paul Turner, linux-api, Christian Brauner,
	Florian Weimer, David.Laight, carlos, Peter Oskolkov,
	Alexander Mikhalitsyn, Chris Kennelly, Ingo Molnar, Darren Hart,
	Davidlohr Bueso, André Almeida, libc-alpha, Steven Rostedt,
	Jonathan Corbet, Mathieu Desnoyers

Extend struct rseq in the rseq selftests to include the sched_state
field. Implement a getter function for this field.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
 tools/testing/selftests/rseq/rseq-abi.h | 17 +++++++++++++++++
 tools/testing/selftests/rseq/rseq.h     |  5 +++++
 2 files changed, 22 insertions(+)

diff --git a/tools/testing/selftests/rseq/rseq-abi.h b/tools/testing/selftests/rseq/rseq-abi.h
index fb4ec8a75dd4..15ec333e9eec 100644
--- a/tools/testing/selftests/rseq/rseq-abi.h
+++ b/tools/testing/selftests/rseq/rseq-abi.h
@@ -37,6 +37,13 @@ enum rseq_abi_cs_flags {
 		(1U << RSEQ_ABI_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
 };
 
+enum rseq_sched_state {
+	/*
+	 * Task is currently running on a CPU if bit is set.
+	 */
+	RSEQ_SCHED_STATE_ON_CPU		= (1U << 0),
+};
+
 /*
  * struct rseq_abi_cs is aligned on 4 * 8 bytes to ensure it is always
  * contained within a single cache-line. It is usually declared as
@@ -164,6 +171,16 @@ struct rseq_abi {
 	 */
 	__u32 mm_cid;
 
+	/*
+	 * Restartable sequences sched_state field. Updated by the kernel. Read
+	 * by user-space with single-copy atomicity semantics. This fields can
+	 * be read by any userspace thread. Aligned on 32-bit. Contains a
+	 * bitmask of enum rseq_sched_state. This field is provided as a hint
+	 * by the scheduler, and requires that the page holding struct rseq is
+	 * faulted-in for the state update to be performed by the scheduler.
+	 */
+	__u32 sched_state;
+
 	/*
 	 * Flexible array member at end of structure, after last feature field.
 	 */
diff --git a/tools/testing/selftests/rseq/rseq.h b/tools/testing/selftests/rseq/rseq.h
index d7364ea4d201..348e9385cb2b 100644
--- a/tools/testing/selftests/rseq/rseq.h
+++ b/tools/testing/selftests/rseq/rseq.h
@@ -236,6 +236,11 @@ static inline void rseq_prepare_unload(void)
 	rseq_clear_rseq_cs();
 }
 
+static inline uint32_t rseq_current_sched_state(void)
+{
+	return RSEQ_ACCESS_ONCE(rseq_get_abi()->sched_state);
+}
+
 static inline __attribute__((always_inline))
 int rseq_cmpeqv_storev(enum rseq_mo rseq_mo, enum rseq_percpu_mode percpu_mode,
 		       intptr_t *v, intptr_t expect,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 3/4] selftests/rseq: Implement sched state test program
  2023-05-17 15:26 [RFC PATCH 0/4] Extend rseq with sched_state field Mathieu Desnoyers
  2023-05-17 15:26 ` [RFC PATCH 1/4] rseq: Add sched_state field to struct rseq Mathieu Desnoyers
  2023-05-17 15:26 ` [RFC PATCH 2/4] selftests/rseq: Add sched_state rseq field and getter Mathieu Desnoyers
@ 2023-05-17 15:26 ` Mathieu Desnoyers
  2023-05-17 15:26 ` [RFC PATCH 4/4] selftests/rseq: Implement rseq_mutex " Mathieu Desnoyers
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 17+ messages in thread
From: Mathieu Desnoyers @ 2023-05-17 15:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, Paul E . McKenney, Boqun Feng,
	H . Peter Anvin, Paul Turner, linux-api, Christian Brauner,
	Florian Weimer, David.Laight, carlos, Peter Oskolkov,
	Alexander Mikhalitsyn, Chris Kennelly, Ingo Molnar, Darren Hart,
	Davidlohr Bueso, André Almeida, libc-alpha, Steven Rostedt,
	Jonathan Corbet, Mathieu Desnoyers

This is a small test program with can be altered to show whether the
target thread is on-cpu or not, dependending on whether it loops on
poll() or does a busy-loop.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
 tools/testing/selftests/rseq/.gitignore       |  1 +
 tools/testing/selftests/rseq/Makefile         |  2 +-
 .../testing/selftests/rseq/sched_state_test.c | 71 +++++++++++++++++++
 3 files changed, 73 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/rseq/sched_state_test.c

diff --git a/tools/testing/selftests/rseq/.gitignore b/tools/testing/selftests/rseq/.gitignore
index 16496de5f6ce..a8db9f7a7cec 100644
--- a/tools/testing/selftests/rseq/.gitignore
+++ b/tools/testing/selftests/rseq/.gitignore
@@ -9,3 +9,4 @@ param_test_compare_twice
 param_test_mm_cid
 param_test_mm_cid_benchmark
 param_test_mm_cid_compare_twice
+sched_state_test
diff --git a/tools/testing/selftests/rseq/Makefile b/tools/testing/selftests/rseq/Makefile
index b357ba24af06..7c8f4f2be74c 100644
--- a/tools/testing/selftests/rseq/Makefile
+++ b/tools/testing/selftests/rseq/Makefile
@@ -14,7 +14,7 @@ OVERRIDE_TARGETS = 1
 
 TEST_GEN_PROGS = basic_test basic_percpu_ops_test basic_percpu_ops_mm_cid_test param_test \
 		param_test_benchmark param_test_compare_twice param_test_mm_cid \
-		param_test_mm_cid_benchmark param_test_mm_cid_compare_twice
+		param_test_mm_cid_benchmark param_test_mm_cid_compare_twice sched_state_test
 
 TEST_GEN_PROGS_EXTENDED = librseq.so
 
diff --git a/tools/testing/selftests/rseq/sched_state_test.c b/tools/testing/selftests/rseq/sched_state_test.c
new file mode 100644
index 000000000000..47aeee6d980f
--- /dev/null
+++ b/tools/testing/selftests/rseq/sched_state_test.c
@@ -0,0 +1,71 @@
+// SPDX-License-Identifier: LGPL-2.1
+
+#define _GNU_SOURCE
+#include <assert.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/time.h>
+#include <poll.h>
+
+#include "rseq.h"
+
+static struct rseq_abi *target_thread;
+
+//TODO:
+//Use rseq c.s. and rseq fence to protect access to remote thread's rseq_abi.
+
+static
+void show_sched_state(struct rseq_abi *rseq_thread)
+{
+	uint32_t state;
+
+	state = rseq_thread->sched_state;
+	printf("Target thread: ON_CPU=%d\n", !!(state & RSEQ_SCHED_STATE_ON_CPU));
+}
+
+
+static
+void *test_thread(void *arg)
+{
+	int i;
+
+	for (i = 0; i < 1000; i++) {
+		show_sched_state(target_thread);
+		(void) poll(NULL, 0, 100);
+	}
+	return NULL;
+}
+
+int main(int argc, char **argv)
+{
+	pthread_t test_thread_id;
+	int i;
+
+	if (rseq_register_current_thread()) {
+		fprintf(stderr, "Error: rseq_register_current_thread(...) failed(%d): %s\n",
+			errno, strerror(errno));
+		goto init_thread_error;
+	}
+	target_thread = rseq_get_abi();
+
+	pthread_create(&test_thread_id, NULL, test_thread, NULL);
+
+	for (i = 0; i < 1000000000; i++)
+		rseq_barrier();
+	//for (i = 0; i < 10000; i++)
+	//	(void) poll(NULL, 0, 75);
+
+	pthread_join(test_thread_id, NULL);
+
+	if (rseq_unregister_current_thread()) {
+		fprintf(stderr, "Error: rseq_unregister_current_thread(...) failed(%d): %s\n",
+			errno, strerror(errno));
+		goto init_thread_error;
+	}
+	return 0;
+
+init_thread_error:
+	return -1;
+}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 4/4] selftests/rseq: Implement rseq_mutex test program
  2023-05-17 15:26 [RFC PATCH 0/4] Extend rseq with sched_state field Mathieu Desnoyers
                   ` (2 preceding siblings ...)
  2023-05-17 15:26 ` [RFC PATCH 3/4] selftests/rseq: Implement sched state test program Mathieu Desnoyers
@ 2023-05-17 15:26 ` Mathieu Desnoyers
  2023-05-17 16:07 ` [RFC PATCH 0/4] Extend rseq with sched_state field Davidlohr Bueso
  2023-05-17 18:36 ` Steven Rostedt
  5 siblings, 0 replies; 17+ messages in thread
From: Mathieu Desnoyers @ 2023-05-17 15:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, Paul E . McKenney, Boqun Feng,
	H . Peter Anvin, Paul Turner, linux-api, Christian Brauner,
	Florian Weimer, David.Laight, carlos, Peter Oskolkov,
	Alexander Mikhalitsyn, Chris Kennelly, Ingo Molnar, Darren Hart,
	Davidlohr Bueso, André Almeida, libc-alpha, Steven Rostedt,
	Jonathan Corbet, Mathieu Desnoyers

Example use of the rseq sched state.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
 tools/testing/selftests/rseq/.gitignore   |   1 +
 tools/testing/selftests/rseq/Makefile     |   3 +-
 tools/testing/selftests/rseq/rseq_mutex.c | 120 ++++++++++++++++++++++
 3 files changed, 123 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/rseq/rseq_mutex.c

diff --git a/tools/testing/selftests/rseq/.gitignore b/tools/testing/selftests/rseq/.gitignore
index a8db9f7a7cec..38d5b2fe5905 100644
--- a/tools/testing/selftests/rseq/.gitignore
+++ b/tools/testing/selftests/rseq/.gitignore
@@ -10,3 +10,4 @@ param_test_mm_cid
 param_test_mm_cid_benchmark
 param_test_mm_cid_compare_twice
 sched_state_test
+rseq_mutex
diff --git a/tools/testing/selftests/rseq/Makefile b/tools/testing/selftests/rseq/Makefile
index 7c8f4f2be74c..a9d7ceb5b79b 100644
--- a/tools/testing/selftests/rseq/Makefile
+++ b/tools/testing/selftests/rseq/Makefile
@@ -14,7 +14,8 @@ OVERRIDE_TARGETS = 1
 
 TEST_GEN_PROGS = basic_test basic_percpu_ops_test basic_percpu_ops_mm_cid_test param_test \
 		param_test_benchmark param_test_compare_twice param_test_mm_cid \
-		param_test_mm_cid_benchmark param_test_mm_cid_compare_twice sched_state_test
+		param_test_mm_cid_benchmark param_test_mm_cid_compare_twice sched_state_test \
+		rseq_mutex
 
 TEST_GEN_PROGS_EXTENDED = librseq.so
 
diff --git a/tools/testing/selftests/rseq/rseq_mutex.c b/tools/testing/selftests/rseq/rseq_mutex.c
new file mode 100644
index 000000000000..7e580fc6eebf
--- /dev/null
+++ b/tools/testing/selftests/rseq/rseq_mutex.c
@@ -0,0 +1,120 @@
+// SPDX-License-Identifier: LGPL-2.1
+
+#define _GNU_SOURCE
+#include <assert.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/time.h>
+#include <poll.h>
+
+#include "rseq.h"
+
+#define RSEQ_MUTEX_MAX_BUSY_LOOP	100
+
+struct rseq_mutex {
+	/*
+	 * NULL: unlocked. When non-NULL, owner points to per-thread rseq_abi
+	 * of owner thread.
+	 */
+	struct rseq_abi *owner;
+};
+
+static struct rseq_mutex lock = { .owner = NULL };
+
+static int testvar;
+
+static void rseq_lock_slowpath(struct rseq_mutex *lock)
+{
+	int i = 0;
+
+	for (;;) {
+		struct rseq_abi *expected = NULL, *self = rseq_get_abi();
+
+		if (__atomic_compare_exchange_n(&lock->owner, &expected, self, false,
+						    __ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
+			break;
+		//TODO: use rseq critical section to protect dereference of owner thread's
+		//rseq_abi, combined with rseq fence at thread reclaim.
+		if ((RSEQ_ACCESS_ONCE(expected->sched_state) & RSEQ_SCHED_STATE_ON_CPU) &&
+		    i < RSEQ_MUTEX_MAX_BUSY_LOOP) {
+			rseq_barrier();			/* busy-wait, e.g. cpu_relax(). */
+			i++;
+		} else {
+			//TODO: Enqueue waiter in a wait-queue, and integrate
+			//with sys_futex rather than waiting for 10ms.
+			(void) poll(NULL, 0, 10);	/* wait 10ms */
+		}
+	}
+}
+
+static void rseq_lock(struct rseq_mutex *lock)
+{
+	struct rseq_abi *expected = NULL, *self = rseq_get_abi();
+
+	if (rseq_likely(__atomic_compare_exchange_n(&lock->owner, &expected, self, false,
+						    __ATOMIC_ACQUIRE, __ATOMIC_RELAXED)))
+		return;
+	rseq_lock_slowpath(lock);
+}
+
+static void rseq_unlock(struct rseq_mutex *lock)
+{
+	__atomic_store_n(&lock->owner, NULL, __ATOMIC_RELEASE);
+	//TODO: integrate with sys_futex and wakeup oldest waiter.
+}
+
+static
+void *test_thread(void *arg)
+{
+	int i;
+
+	if (rseq_register_current_thread()) {
+		fprintf(stderr, "Error: rseq_register_current_thread(...) failed(%d): %s\n",
+			errno, strerror(errno));
+		abort();
+	}
+
+	for (i = 0; i < 1000; i++) {
+		int var;
+
+		rseq_lock(&lock);
+		var = RSEQ_READ_ONCE(testvar);
+		if (var) {
+			fprintf(stderr, "Unexpected value %d\n", var);
+			abort();
+		}
+		RSEQ_WRITE_ONCE(testvar, 1);
+		if (!(i % 10))
+			(void) poll(NULL, 0, 10);
+		else
+			rseq_barrier();
+		RSEQ_WRITE_ONCE(testvar, 0);
+		rseq_unlock(&lock);
+	}
+
+	if (rseq_unregister_current_thread()) {
+		fprintf(stderr, "Error: rseq_unregister_current_thread(...) failed(%d): %s\n",
+			errno, strerror(errno));
+		abort();
+	}
+	return NULL;
+}
+
+int main(int argc, char **argv)
+{
+	int nr_threads = 5;
+	pthread_t test_thread_id[nr_threads];
+	int i;
+
+	for (i = 0; i < nr_threads; i++) {
+		pthread_create(&test_thread_id[i], NULL, test_thread, NULL);
+	}
+
+	for (i = 0; i < nr_threads; i++) {
+		pthread_join(test_thread_id[i], NULL);
+	}
+
+	return 0;
+}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/4] rseq: Add sched_state field to struct rseq
  2023-05-17 15:26 ` [RFC PATCH 1/4] rseq: Add sched_state field to struct rseq Mathieu Desnoyers
@ 2023-05-17 16:03   ` Davidlohr Bueso
  2023-05-18 21:49   ` Boqun Feng
  2023-05-19 20:51   ` Noah Goldstein
  2 siblings, 0 replies; 17+ messages in thread
From: Davidlohr Bueso @ 2023-05-17 16:03 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel, Thomas Gleixner, Paul E . McKenney,
	Boqun Feng, H . Peter Anvin, Paul Turner, linux-api,
	Christian Brauner, Florian Weimer, David.Laight, carlos,
	Peter Oskolkov, Alexander Mikhalitsyn, Chris Kennelly,
	Ingo Molnar, Darren Hart, Andr� Almeida,
	libc-alpha, Steven Rostedt, Jonathan Corbet, Florian Weimer

On Wed, 17 May 2023, Mathieu Desnoyers wrote:

>Expose the "on-cpu" state for each thread through struct rseq to allow
>adaptative mutexes to decide more accurately between busy-waiting and
>calling sys_futex() to release the CPU, based on the on-cpu state of the
>mutex owner.

Oh yeah moving the spin stuff out of the kernel is much nicer.

>It is only provided as an optimization hint, because there is no
>guarantee that the page containing this field is in the page cache, and
>therefore the scheduler may very well fail to clear the on-cpu state on
>preemption. This is expected to be rare though, and is resolved as soon
>as the task returns to user-space.
>
>The goal is to improve use-cases where the duration of the critical
>sections for a given lock follows a multi-modal distribution, preventing
>statistical guesses from doing a good job at choosing between busy-wait
>and futex wait behavior.
>
>Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
>Cc: Jonathan Corbet <corbet@lwn.net>
>Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
>Cc: Carlos O'Donell <carlos@redhat.com>
>Cc: Florian Weimer <fweimer@redhat.com>
>Cc: libc-alpha@sourceware.org
>---
> include/linux/sched.h     | 12 ++++++++++++
> include/uapi/linux/rseq.h | 17 +++++++++++++++++
> kernel/rseq.c             | 14 ++++++++++++++
> 3 files changed, 43 insertions(+)

Ie: previous efforts

  kernel/futex.c             |  675 ++++++++++++++++++++++++++++++++++++++------
  kernel/futex.c             |  572 ++++++++++++++++++++++++++++++++++++-------------

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/4] Extend rseq with sched_state field
  2023-05-17 15:26 [RFC PATCH 0/4] Extend rseq with sched_state field Mathieu Desnoyers
                   ` (3 preceding siblings ...)
  2023-05-17 15:26 ` [RFC PATCH 4/4] selftests/rseq: Implement rseq_mutex " Mathieu Desnoyers
@ 2023-05-17 16:07 ` Davidlohr Bueso
  2023-05-17 18:36 ` Steven Rostedt
  5 siblings, 0 replies; 17+ messages in thread
From: Davidlohr Bueso @ 2023-05-17 16:07 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel, Thomas Gleixner, Paul E . McKenney,
	Boqun Feng, H . Peter Anvin, Paul Turner, linux-api,
	Christian Brauner, Florian Weimer, David.Laight, carlos,
	Peter Oskolkov, Alexander Mikhalitsyn, Chris Kennelly,
	Ingo Molnar, Darren Hart, Andr� Almeida,
	libc-alpha, Steven Rostedt, Jonathan Corbet, longman

+Cc Waiman.

On Wed, 17 May 2023, Mathieu Desnoyers wrote:

>This prototype extends struct rseq with a new sched_state field, which
>contains a "on-cpu" flag kept up-to-date by the scheduler.
>
>It is meant to be used by userspace adaptative mutexes to decide between
>busy-wait and futex wait system call (releasing the CPU) behaviors based
>on the current state of the mutex owner.
>
>The goal is to improve use-cases where the duration of the critical
>sections for a given lock follows a multi-modal distribution, preventing
>statistical guesses from doing a good job at choosing between busy-wait
>and futex wait behavior.
>
>This is in response to the LWN coverage of 2023 Open Source Summit North
>America (https://lwn.net/Articles/931789/) unscheduled slot "Adaptive
>spinning in user space" presented by André Almeida.
>
>Feedback is welcome!
>
>Mathieu
>
>Mathieu Desnoyers (4):
>  rseq: Add sched_state field to struct rseq
>  selftests/rseq: Add sched_state rseq field and getter
>  selftests/rseq: Implement sched state test program
>  selftests/rseq: Implement rseq_mutex test program
>
> include/linux/sched.h                         |  12 ++
> include/uapi/linux/rseq.h                     |  17 +++
> kernel/rseq.c                                 |  14 ++
> tools/testing/selftests/rseq/.gitignore       |   2 +
> tools/testing/selftests/rseq/Makefile         |   3 +-
> tools/testing/selftests/rseq/rseq-abi.h       |  17 +++
> tools/testing/selftests/rseq/rseq.h           |   5 +
> tools/testing/selftests/rseq/rseq_mutex.c     | 120 ++++++++++++++++++
> .../testing/selftests/rseq/sched_state_test.c |  71 +++++++++++
> 9 files changed, 260 insertions(+), 1 deletion(-)
> create mode 100644 tools/testing/selftests/rseq/rseq_mutex.c
> create mode 100644 tools/testing/selftests/rseq/sched_state_test.c
>
>--
>2.25.1
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/4] Extend rseq with sched_state field
  2023-05-17 15:26 [RFC PATCH 0/4] Extend rseq with sched_state field Mathieu Desnoyers
                   ` (4 preceding siblings ...)
  2023-05-17 16:07 ` [RFC PATCH 0/4] Extend rseq with sched_state field Davidlohr Bueso
@ 2023-05-17 18:36 ` Steven Rostedt
  5 siblings, 0 replies; 17+ messages in thread
From: Steven Rostedt @ 2023-05-17 18:36 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel, Thomas Gleixner, Paul E . McKenney,
	Boqun Feng, H . Peter Anvin, Paul Turner, linux-api,
	Christian Brauner, Florian Weimer, David.Laight, carlos,
	Peter Oskolkov, Alexander Mikhalitsyn, Chris Kennelly,
	Ingo Molnar, Darren Hart, Davidlohr Bueso, André Almeida,
	libc-alpha, Jonathan Corbet

On Wed, 17 May 2023 11:26:50 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> This prototype extends struct rseq with a new sched_state field, which
> contains a "on-cpu" flag kept up-to-date by the scheduler.
> 
> It is meant to be used by userspace adaptative mutexes to decide between
> busy-wait and futex wait system call (releasing the CPU) behaviors based
> on the current state of the mutex owner.

Woot!

I'm so glad I talked Jon into hanging around for André's spurious talk.
He's the one that brought up rseq (as he mentioned in his article), and I
guess you read that :-)

Unfortunately, I'm in the process of a lot of trips and
meetings/presentations over the next few weeks, and I will not be able to
look into this. But I hope that André could test it out.

Cheers!

-- Steve

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/4] rseq: Add sched_state field to struct rseq
  2023-05-17 15:26 ` [RFC PATCH 1/4] rseq: Add sched_state field to struct rseq Mathieu Desnoyers
  2023-05-17 16:03   ` Davidlohr Bueso
@ 2023-05-18 21:49   ` Boqun Feng
  2023-05-19 14:15     ` Mathieu Desnoyers
  2023-05-19 20:51   ` Noah Goldstein
  2 siblings, 1 reply; 17+ messages in thread
From: Boqun Feng @ 2023-05-18 21:49 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel, Thomas Gleixner, Paul E . McKenney,
	H . Peter Anvin, Paul Turner, linux-api, Christian Brauner,
	Florian Weimer, David.Laight, carlos, Peter Oskolkov,
	Alexander Mikhalitsyn, Chris Kennelly, Ingo Molnar, Darren Hart,
	Davidlohr Bueso, André Almeida, libc-alpha, Steven Rostedt,
	Jonathan Corbet, Florian Weimer

On Wed, May 17, 2023 at 11:26:51AM -0400, Mathieu Desnoyers wrote:
> Expose the "on-cpu" state for each thread through struct rseq to allow
> adaptative mutexes to decide more accurately between busy-waiting and
> calling sys_futex() to release the CPU, based on the on-cpu state of the
> mutex owner.
> 
> It is only provided as an optimization hint, because there is no
> guarantee that the page containing this field is in the page cache, and
> therefore the scheduler may very well fail to clear the on-cpu state on
> preemption. This is expected to be rare though, and is resolved as soon
> as the task returns to user-space.
> 
> The goal is to improve use-cases where the duration of the critical
> sections for a given lock follows a multi-modal distribution, preventing
> statistical guesses from doing a good job at choosing between busy-wait
> and futex wait behavior.
> 
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
> Cc: Carlos O'Donell <carlos@redhat.com>
> Cc: Florian Weimer <fweimer@redhat.com>
> Cc: libc-alpha@sourceware.org
> ---
>  include/linux/sched.h     | 12 ++++++++++++
>  include/uapi/linux/rseq.h | 17 +++++++++++++++++
>  kernel/rseq.c             | 14 ++++++++++++++
>  3 files changed, 43 insertions(+)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index eed5d65b8d1f..c7e9248134c1 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2351,11 +2351,20 @@ static inline void rseq_signal_deliver(struct ksignal *ksig,
>  	rseq_handle_notify_resume(ksig, regs);
>  }
>  
> +void __rseq_set_sched_state(struct task_struct *t, unsigned int state);
> +
> +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state)
> +{
> +	if (t->rseq)
> +		__rseq_set_sched_state(t, state);
> +}
> +
>  /* rseq_preempt() requires preemption to be disabled. */
>  static inline void rseq_preempt(struct task_struct *t)
>  {
>  	__set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask);
>  	rseq_set_notify_resume(t);
> +	rseq_set_sched_state(t, 0);
>  }
>  
>  /* rseq_migrate() requires preemption to be disabled. */
> @@ -2405,6 +2414,9 @@ static inline void rseq_signal_deliver(struct ksignal *ksig,
>  				       struct pt_regs *regs)
>  {
>  }
> +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state)
> +{
> +}
>  static inline void rseq_preempt(struct task_struct *t)
>  {
>  }
> diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
> index c233aae5eac9..c6d8537e23ca 100644
> --- a/include/uapi/linux/rseq.h
> +++ b/include/uapi/linux/rseq.h
> @@ -37,6 +37,13 @@ enum rseq_cs_flags {
>  		(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
>  };
>  
> +enum rseq_sched_state {
> +	/*
> +	 * Task is currently running on a CPU if bit is set.
> +	 */
> +	RSEQ_SCHED_STATE_ON_CPU		= (1U << 0),
> +};
> +
>  /*
>   * struct rseq_cs is aligned on 4 * 8 bytes to ensure it is always
>   * contained within a single cache-line. It is usually declared as
> @@ -148,6 +155,16 @@ struct rseq {
>  	 */
>  	__u32 mm_cid;
>  
> +	/*
> +	 * Restartable sequences sched_state field. Updated by the kernel. Read
> +	 * by user-space with single-copy atomicity semantics. This fields can
> +	 * be read by any userspace thread. Aligned on 32-bit. Contains a

Maybe this is a premature optimization, but since most of the time the
bit would be read by another thread, does it make sense putting the
"sched_state" into a different cache line to avoid false sharing?

Also We could have a "sched_state_local" and "sched_state_remote" for
different usages (local reads vs remote reads).

Regards,
Boqun

> +	 * bitmask of enum rseq_sched_state. This field is provided as a hint
> +	 * by the scheduler, and requires that the page holding struct rseq is
> +	 * faulted-in for the state update to be performed by the scheduler.
> +	 */
> +	__u32 sched_state;
> +
>  	/*
>  	 * Flexible array member at end of structure, after last feature field.
>  	 */
> diff --git a/kernel/rseq.c b/kernel/rseq.c
> index 9de6e35fe679..b2eb3bbaa9ef 100644
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -91,6 +91,7 @@ static int rseq_update_cpu_node_id(struct task_struct *t)
>  	u32 cpu_id = raw_smp_processor_id();
>  	u32 node_id = cpu_to_node(cpu_id);
>  	u32 mm_cid = task_mm_cid(t);
> +	u32 sched_state = RSEQ_SCHED_STATE_ON_CPU;
>  
>  	WARN_ON_ONCE((int) mm_cid < 0);
>  	if (!user_write_access_begin(rseq, t->rseq_len))
> @@ -99,6 +100,7 @@ static int rseq_update_cpu_node_id(struct task_struct *t)
>  	unsafe_put_user(cpu_id, &rseq->cpu_id, efault_end);
>  	unsafe_put_user(node_id, &rseq->node_id, efault_end);
>  	unsafe_put_user(mm_cid, &rseq->mm_cid, efault_end);
> +	unsafe_put_user(sched_state, &rseq->sched_state, efault_end);
>  	/*
>  	 * Additional feature fields added after ORIG_RSEQ_SIZE
>  	 * need to be conditionally updated only if
> @@ -339,6 +341,18 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
>  	force_sigsegv(sig);
>  }
>  
> +/*
> + * Attempt to update rseq scheduler state.
> + */
> +void __rseq_set_sched_state(struct task_struct *t, unsigned int state)
> +{
> +	if (unlikely(t->flags & PF_EXITING))
> +		return;
> +	pagefault_disable();
> +	(void) put_user(state, &t->rseq->sched_state);
> +	pagefault_enable();
> +}
> +
>  #ifdef CONFIG_DEBUG_RSEQ
>  
>  /*
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/4] rseq: Add sched_state field to struct rseq
  2023-05-18 21:49   ` Boqun Feng
@ 2023-05-19 14:15     ` Mathieu Desnoyers
  2023-05-19 17:18       ` Boqun Feng
  0 siblings, 1 reply; 17+ messages in thread
From: Mathieu Desnoyers @ 2023-05-19 14:15 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Peter Zijlstra, linux-kernel, Thomas Gleixner, Paul E . McKenney,
	H . Peter Anvin, Paul Turner, linux-api, Christian Brauner,
	Florian Weimer, David.Laight, carlos, Peter Oskolkov,
	Alexander Mikhalitsyn, Chris Kennelly, Ingo Molnar, Darren Hart,
	Davidlohr Bueso, André Almeida, libc-alpha, Steven Rostedt,
	Jonathan Corbet, Florian Weimer

On 2023-05-18 17:49, Boqun Feng wrote:
> On Wed, May 17, 2023 at 11:26:51AM -0400, Mathieu Desnoyers wrote:
[...]
>> diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
>> index c233aae5eac9..c6d8537e23ca 100644
>> --- a/include/uapi/linux/rseq.h
>> +++ b/include/uapi/linux/rseq.h
>> @@ -37,6 +37,13 @@ enum rseq_cs_flags {
>>   		(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
>>   };
>>   
>> +enum rseq_sched_state {
>> +	/*
>> +	 * Task is currently running on a CPU if bit is set.
>> +	 */
>> +	RSEQ_SCHED_STATE_ON_CPU		= (1U << 0),
>> +};
[...]
>>   
>> +	/*
>> +	 * Restartable sequences sched_state field. Updated by the kernel. Read
>> +	 * by user-space with single-copy atomicity semantics. This fields can
>> +	 * be read by any userspace thread. Aligned on 32-bit. Contains a
> 
> Maybe this is a premature optimization, but since most of the time the
> bit would be read by another thread, does it make sense putting the
> "sched_state" into a different cache line to avoid false sharing?

I'm puzzled by your optimization proposal, so I'll say it outright: I'm 
probably missing something.

I agree that false-sharing would be an issue if various threads would 
contend for updating any field within this cache line.

But the only thread responsible for updating this cache line's fields is 
the current thread, either from userspace (stores to rseq_abi->rseq_cs) 
or from the kernel (usually on return to userspace, except for this new 
ON_CPU bit clear on context switch).

The other threads busy-waiting on the content of this sched_state field 
will only load from it, never store. And they will only busy-wait on it 
as long as the current task runs. When that task gets preempted, other 
threads will notice the flag change and use sys_futex instead.

So the very worse I can think of in terms of pattern causing cache 
coherency traffic due to false-sharing is if the lock owner happens to 
have lots of rseq critical sections as well, causing it to repeatedly 
store to the rseq_abi->rseq_cs field, which is in the same cache line.

But even then I'm wondering if this really matters, because each of 
those stores to rseq_cs would only slow down loads from other threads 
which will need to retry busy-wait anyway if the on-cpu flag is still set.

So, what am I missing ? Is this heavy use of rseq critical sections 
while the lock is held the scenario you are concerned about ?

Note that the heavy cache-line bouncing in my test-case happens on the 
lock structure (cmpxchg expecting NULL, setting the current thread 
rseq_get_abi() pointer on success). There are probably better ways to 
implement that part, it is currently just a simple prototype showcasing 
the approach.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/4] rseq: Add sched_state field to struct rseq
  2023-05-19 14:15     ` Mathieu Desnoyers
@ 2023-05-19 17:18       ` Boqun Feng
  2023-05-23 14:10         ` Mathieu Desnoyers
  0 siblings, 1 reply; 17+ messages in thread
From: Boqun Feng @ 2023-05-19 17:18 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel, Thomas Gleixner, Paul E . McKenney,
	H . Peter Anvin, Paul Turner, linux-api, Christian Brauner,
	Florian Weimer, David.Laight, carlos, Peter Oskolkov,
	Alexander Mikhalitsyn, Chris Kennelly, Ingo Molnar, Darren Hart,
	Davidlohr Bueso, André Almeida, libc-alpha, Steven Rostedt,
	Jonathan Corbet, Florian Weimer

On Fri, May 19, 2023 at 10:15:11AM -0400, Mathieu Desnoyers wrote:
> On 2023-05-18 17:49, Boqun Feng wrote:
> > On Wed, May 17, 2023 at 11:26:51AM -0400, Mathieu Desnoyers wrote:
> [...]
> > > diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
> > > index c233aae5eac9..c6d8537e23ca 100644
> > > --- a/include/uapi/linux/rseq.h
> > > +++ b/include/uapi/linux/rseq.h
> > > @@ -37,6 +37,13 @@ enum rseq_cs_flags {
> > >   		(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
> > >   };
> > > +enum rseq_sched_state {
> > > +	/*
> > > +	 * Task is currently running on a CPU if bit is set.
> > > +	 */
> > > +	RSEQ_SCHED_STATE_ON_CPU		= (1U << 0),
> > > +};
> [...]
> > > +	/*
> > > +	 * Restartable sequences sched_state field. Updated by the kernel. Read
> > > +	 * by user-space with single-copy atomicity semantics. This fields can
> > > +	 * be read by any userspace thread. Aligned on 32-bit. Contains a
> > 
> > Maybe this is a premature optimization, but since most of the time the
> > bit would be read by another thread, does it make sense putting the
> > "sched_state" into a different cache line to avoid false sharing?
> 
> I'm puzzled by your optimization proposal, so I'll say it outright: I'm
> probably missing something.
> 

Maybe it's me who is missing something ;-)

> I agree that false-sharing would be an issue if various threads would
> contend for updating any field within this cache line.
> 
> But the only thread responsible for updating this cache line's fields is the
> current thread, either from userspace (stores to rseq_abi->rseq_cs) or from
> the kernel (usually on return to userspace, except for this new ON_CPU bit
> clear on context switch).
> 
> The other threads busy-waiting on the content of this sched_state field will
> only load from it, never store. And they will only busy-wait on it as long

But their loads can change the cache line state from "Exclusive" to
"Shared" (using MESI terms), right? And that could delay the stores of
the current thread.

> as the current task runs. When that task gets preempted, other threads will
> notice the flag change and use sys_futex instead.
> 
> So the very worse I can think of in terms of pattern causing cache coherency
> traffic due to false-sharing is if the lock owner happens to have lots of
> rseq critical sections as well, causing it to repeatedly store to the
> rseq_abi->rseq_cs field, which is in the same cache line.
> 
> But even then I'm wondering if this really matters, because each of those
> stores to rseq_cs would only slow down loads from other threads which will
> need to retry busy-wait anyway if the on-cpu flag is still set.
> 
> So, what am I missing ? Is this heavy use of rseq critical sections while
> the lock is held the scenario you are concerned about ?
> 

The case in my mind is the opposite direction: the loads from other
threads delay the stores to rseq_cs on the current thread, which I
assume are usually a fast path. For example:

	CPU 1				CPU 2

	lock(foo); // holding a lock
	rseq_start():
	  <CPU 1 own the cache line exclusively>
	  				lock(foo):
					  <fail to get foo>
					  <check whether the lock owner is on CPU>
					  <cache line becames shared>
	  ->rseq_cs = .. // Need to invalidate the cache line on other CPU

But as you mentioned, there is only one updater here (the current
thread), so maybe it doesn't matter... but since it's a userspace ABI,
so I cannot help thinking "what if there is another bit that has a
different usage pattern introduced in the future", so..

> Note that the heavy cache-line bouncing in my test-case happens on the lock
> structure (cmpxchg expecting NULL, setting the current thread rseq_get_abi()
> pointer on success). There are probably better ways to implement that part,
> it is currently just a simple prototype showcasing the approach.
> 

Yeah.. that's a little strange, I guess you can just read the lock
owner's rseq_abi, for example:

	rseq_lock_slowpath() {
		struct rseq_abi *other_rseq = lock->owner;

		if (RSEQ_ACCESS_ONCE(other_rseq->sched_state)) {
			...
		}
	}

?

Regards,
Boqun

> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/4] rseq: Add sched_state field to struct rseq
  2023-05-17 15:26 ` [RFC PATCH 1/4] rseq: Add sched_state field to struct rseq Mathieu Desnoyers
  2023-05-17 16:03   ` Davidlohr Bueso
  2023-05-18 21:49   ` Boqun Feng
@ 2023-05-19 20:51   ` Noah Goldstein
  2023-05-23 12:49     ` Mathieu Desnoyers
  2 siblings, 1 reply; 17+ messages in thread
From: Noah Goldstein @ 2023-05-19 20:51 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel, Thomas Gleixner, Paul E . McKenney,
	Boqun Feng, H . Peter Anvin, Paul Turner, linux-api,
	Christian Brauner, Florian Weimer, David.Laight, carlos,
	Peter Oskolkov, Alexander Mikhalitsyn, Chris Kennelly,
	Ingo Molnar, Darren Hart, Davidlohr Bueso, André Almeida,
	libc-alpha, Steven Rostedt, Jonathan Corbet, Florian Weimer

On Wed, May 17, 2023 at 10:28 AM Mathieu Desnoyers via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> Expose the "on-cpu" state for each thread through struct rseq to allow
> adaptative mutexes to decide more accurately between busy-waiting and
> calling sys_futex() to release the CPU, based on the on-cpu state of the
> mutex owner.
>
> It is only provided as an optimization hint, because there is no
> guarantee that the page containing this field is in the page cache, and
> therefore the scheduler may very well fail to clear the on-cpu state on
> preemption. This is expected to be rare though, and is resolved as soon
> as the task returns to user-space.
>
> The goal is to improve use-cases where the duration of the critical
> sections for a given lock follows a multi-modal distribution, preventing
> statistical guesses from doing a good job at choosing between busy-wait
> and futex wait behavior.
>
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
> Cc: Carlos O'Donell <carlos@redhat.com>
> Cc: Florian Weimer <fweimer@redhat.com>
> Cc: libc-alpha@sourceware.org
> ---
>  include/linux/sched.h     | 12 ++++++++++++
>  include/uapi/linux/rseq.h | 17 +++++++++++++++++
>  kernel/rseq.c             | 14 ++++++++++++++
>  3 files changed, 43 insertions(+)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index eed5d65b8d1f..c7e9248134c1 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2351,11 +2351,20 @@ static inline void rseq_signal_deliver(struct ksignal *ksig,
>         rseq_handle_notify_resume(ksig, regs);
>  }
>
> +void __rseq_set_sched_state(struct task_struct *t, unsigned int state);
> +
> +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state)
> +{
> +       if (t->rseq)
> +               __rseq_set_sched_state(t, state);
> +}
> +
>  /* rseq_preempt() requires preemption to be disabled. */
>  static inline void rseq_preempt(struct task_struct *t)
>  {
>         __set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask);
>         rseq_set_notify_resume(t);
> +       rseq_set_sched_state(t, 0);

Should rseq_migrate also be made to update the cpu_id of the new core?
I imagine the usage of this will be something along the lines of:

if(!on_cpu(mutex->owner_rseq_struct) &&
   cpu(mutex->owner_rseq_struct) == this_threads_cpu)
   // goto futex

So I would think updating on migrate would be useful as well.


>  }
>
>  /* rseq_migrate() requires preemption to be disabled. */
> @@ -2405,6 +2414,9 @@ static inline void rseq_signal_deliver(struct ksignal *ksig,
>                                        struct pt_regs *regs)
>  {
>  }
> +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state)
> +{
> +}
>  static inline void rseq_preempt(struct task_struct *t)
>  {
>  }
> diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
> index c233aae5eac9..c6d8537e23ca 100644
> --- a/include/uapi/linux/rseq.h
> +++ b/include/uapi/linux/rseq.h
> @@ -37,6 +37,13 @@ enum rseq_cs_flags {
>                 (1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
>  };
>
> +enum rseq_sched_state {
> +       /*
> +        * Task is currently running on a CPU if bit is set.
> +        */
> +       RSEQ_SCHED_STATE_ON_CPU         = (1U << 0),
> +};
> +
>  /*
>   * struct rseq_cs is aligned on 4 * 8 bytes to ensure it is always
>   * contained within a single cache-line. It is usually declared as
> @@ -148,6 +155,16 @@ struct rseq {
>          */
>         __u32 mm_cid;
>
> +       /*
> +        * Restartable sequences sched_state field. Updated by the kernel. Read
> +        * by user-space with single-copy atomicity semantics. This fields can
> +        * be read by any userspace thread. Aligned on 32-bit. Contains a
> +        * bitmask of enum rseq_sched_state. This field is provided as a hint
> +        * by the scheduler, and requires that the page holding struct rseq is
> +        * faulted-in for the state update to be performed by the scheduler.
> +        */
> +       __u32 sched_state;
> +
>         /*
>          * Flexible array member at end of structure, after last feature field.
>          */
> diff --git a/kernel/rseq.c b/kernel/rseq.c
> index 9de6e35fe679..b2eb3bbaa9ef 100644
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -91,6 +91,7 @@ static int rseq_update_cpu_node_id(struct task_struct *t)
>         u32 cpu_id = raw_smp_processor_id();
>         u32 node_id = cpu_to_node(cpu_id);
>         u32 mm_cid = task_mm_cid(t);
> +       u32 sched_state = RSEQ_SCHED_STATE_ON_CPU;
>
>         WARN_ON_ONCE((int) mm_cid < 0);
>         if (!user_write_access_begin(rseq, t->rseq_len))
> @@ -99,6 +100,7 @@ static int rseq_update_cpu_node_id(struct task_struct *t)
>         unsafe_put_user(cpu_id, &rseq->cpu_id, efault_end);
>         unsafe_put_user(node_id, &rseq->node_id, efault_end);
>         unsafe_put_user(mm_cid, &rseq->mm_cid, efault_end);
> +       unsafe_put_user(sched_state, &rseq->sched_state, efault_end);
>         /*
>          * Additional feature fields added after ORIG_RSEQ_SIZE
>          * need to be conditionally updated only if
> @@ -339,6 +341,18 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
>         force_sigsegv(sig);
>  }
>
> +/*
> + * Attempt to update rseq scheduler state.
> + */
> +void __rseq_set_sched_state(struct task_struct *t, unsigned int state)
> +{
> +       if (unlikely(t->flags & PF_EXITING))
> +               return;
> +       pagefault_disable();
> +       (void) put_user(state, &t->rseq->sched_state);
> +       pagefault_enable();
> +}
> +
>  #ifdef CONFIG_DEBUG_RSEQ
>
>  /*
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/4] rseq: Add sched_state field to struct rseq
  2023-05-19 20:51   ` Noah Goldstein
@ 2023-05-23 12:49     ` Mathieu Desnoyers
  2023-05-23 16:32       ` Noah Goldstein
  0 siblings, 1 reply; 17+ messages in thread
From: Mathieu Desnoyers @ 2023-05-23 12:49 UTC (permalink / raw)
  To: Noah Goldstein
  Cc: Peter Zijlstra, linux-kernel, Thomas Gleixner, Paul E . McKenney,
	Boqun Feng, H . Peter Anvin, Paul Turner, linux-api,
	Christian Brauner, Florian Weimer, David.Laight, carlos,
	Peter Oskolkov, Alexander Mikhalitsyn, Chris Kennelly,
	Ingo Molnar, Darren Hart, Davidlohr Bueso, André Almeida,
	libc-alpha, Steven Rostedt, Jonathan Corbet, Florian Weimer

On 2023-05-19 16:51, Noah Goldstein wrote:
> On Wed, May 17, 2023 at 10:28 AM Mathieu Desnoyers via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
>>
>> Expose the "on-cpu" state for each thread through struct rseq to allow
>> adaptative mutexes to decide more accurately between busy-waiting and
>> calling sys_futex() to release the CPU, based on the on-cpu state of the
>> mutex owner.
>>
>> It is only provided as an optimization hint, because there is no
>> guarantee that the page containing this field is in the page cache, and
>> therefore the scheduler may very well fail to clear the on-cpu state on
>> preemption. This is expected to be rare though, and is resolved as soon
>> as the task returns to user-space.
>>
>> The goal is to improve use-cases where the duration of the critical
>> sections for a given lock follows a multi-modal distribution, preventing
>> statistical guesses from doing a good job at choosing between busy-wait
>> and futex wait behavior.
>>
>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Cc: Jonathan Corbet <corbet@lwn.net>
>> Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
>> Cc: Carlos O'Donell <carlos@redhat.com>
>> Cc: Florian Weimer <fweimer@redhat.com>
>> Cc: libc-alpha@sourceware.org
>> ---
>>   include/linux/sched.h     | 12 ++++++++++++
>>   include/uapi/linux/rseq.h | 17 +++++++++++++++++
>>   kernel/rseq.c             | 14 ++++++++++++++
>>   3 files changed, 43 insertions(+)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index eed5d65b8d1f..c7e9248134c1 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -2351,11 +2351,20 @@ static inline void rseq_signal_deliver(struct ksignal *ksig,
>>          rseq_handle_notify_resume(ksig, regs);
>>   }
>>
>> +void __rseq_set_sched_state(struct task_struct *t, unsigned int state);
>> +
>> +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state)
>> +{
>> +       if (t->rseq)
>> +               __rseq_set_sched_state(t, state);
>> +}
>> +
>>   /* rseq_preempt() requires preemption to be disabled. */
>>   static inline void rseq_preempt(struct task_struct *t)
>>   {
>>          __set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask);
>>          rseq_set_notify_resume(t);
>> +       rseq_set_sched_state(t, 0);
> 
> Should rseq_migrate also be made to update the cpu_id of the new core?
> I imagine the usage of this will be something along the lines of:
> 
> if(!on_cpu(mutex->owner_rseq_struct) &&
>     cpu(mutex->owner_rseq_struct) == this_threads_cpu)
>     // goto futex
> 
> So I would think updating on migrate would be useful as well.

I don't think we want to act differently based on the cpu on which the 
owner is queued.

If the mutex owner is not on-cpu, and queued on the same cpu as the 
current thread, we indeed want to call sys_futex WAIT.

If the mutex owner is not on-cpu, but queued on a different cpu than the 
current thread, we *still* want to call sys_futex WAIT, because 
busy-waiting for a thread which is queued but not currently running is 
wasteful.

Or am I missing something ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/4] rseq: Add sched_state field to struct rseq
  2023-05-19 17:18       ` Boqun Feng
@ 2023-05-23 14:10         ` Mathieu Desnoyers
  0 siblings, 0 replies; 17+ messages in thread
From: Mathieu Desnoyers @ 2023-05-23 14:10 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Peter Zijlstra, linux-kernel, Thomas Gleixner, Paul E . McKenney,
	H . Peter Anvin, Paul Turner, linux-api, Christian Brauner,
	Florian Weimer, David.Laight, carlos, Peter Oskolkov,
	Alexander Mikhalitsyn, Chris Kennelly, Ingo Molnar, Darren Hart,
	Davidlohr Bueso, André Almeida, libc-alpha, Steven Rostedt,
	Jonathan Corbet, Florian Weimer

On 2023-05-19 13:18, Boqun Feng wrote:
[...]
> 
> The case in my mind is the opposite direction: the loads from other
> threads delay the stores to rseq_cs on the current thread, which I
> assume are usually a fast path. For example:

Yes, OK, you are correct. And I just validated on my end that busy-waiting
repeatedly loading from a cache line does slow down the concurrent stores
to other variables on that cache line significantly (at least on my
Intel(R) Core(TM) i7-8650U). Small reproducer provided at the end of
this email. Results:

compudj@thinkos:~/test$ time ./test-cacheline -d
thread id : 140242706274048, pid 16940
thread id : 140242697881344, pid 16940

real	0m4.145s
user	0m8.289s
sys	0m0.000s

compudj@thinkos:~/test$ time ./test-cacheline -s
thread id : 139741482387200, pid 16950
thread id : 139741473994496, pid 16950

real	0m4.573s
user	0m9.147s
sys	0m0.000s


> 
> 	CPU 1				CPU 2
> 
> 	lock(foo); // holding a lock
> 	rseq_start():
> 	  <CPU 1 own the cache line exclusively>
> 	  				lock(foo):
> 					  <fail to get foo>
> 					  <check whether the lock owner is on CPU>
> 					  <cache line becames shared>
> 	  ->rseq_cs = .. // Need to invalidate the cache line on other CPU
> 
> But as you mentioned, there is only one updater here (the current
> thread), so maybe it doesn't matter... but since it's a userspace ABI,
> so I cannot help thinking "what if there is another bit that has a
> different usage pattern introduced in the future", so..

Yes, however we have to be careful about how we introduce this considering
that the rseq feature extensions are "append only" to the structure feature
size exported by the kernel to userspace through getauxval(3).

So if we decide that we create a big hole right in the middle of the rseq_abi
for cacheline alignment, that's a possibility, but we'd really be wasting an
entire cacheline for a single bit.

Another possibility would be to add a level of indirection: we could have a field
in struct rseq which is either a pointer or offset from the thread_pointer() to
the on-cpu bit, which would sit in a different cache line. It would be up to
glibc to allocate space for it, possibly at the end of the rseq_abi field.

> 
>> Note that the heavy cache-line bouncing in my test-case happens on the lock
>> structure (cmpxchg expecting NULL, setting the current thread rseq_get_abi()
>> pointer on success). There are probably better ways to implement that part,
>> it is currently just a simple prototype showcasing the approach.
>>
> 
> Yeah.. that's a little strange, I guess you can just read the lock
> owner's rseq_abi, for example:
> 
> 	rseq_lock_slowpath() {
> 		struct rseq_abi *other_rseq = lock->owner;
> 
> 		if (RSEQ_ACCESS_ONCE(other_rseq->sched_state)) {
> 			...
> 		}
> 	}

Yes, I don't think the load of the owner pointer needs to be part of the
cmpxchg per se. It could be done from a load on the slow-path.

This way we would not require that the owner id and the lock state be the
same content, and this would allow much more freedom for the fast-path
semantic.

Thanks,

Mathieu

> 
> ?
> 
> Regards,
> Boqun
> 
>> Thanks,
>>
>> Mathieu
>>
>> -- 
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> https://www.efficios.com
>>

Reproducer:

/*
  * cacheline testing (exclusive vs shared store speed)
  *
  * build with gcc -O2 -pthread -o test-cacheline test-cacheline.c
  *
  * Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
  * License: MIT
  */

#include <stdio.h>
#include <pthread.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
#include <rseq/rseq.h>

#define NR_THREADS 2

struct test {
	int a;
	int b;
} __attribute__((aligned(256)));

enum testcase {
	TEST_SAME_CACHELINE,
	TEST_OTHER_CACHELINE,
};

static enum testcase testcase;
static int test_stop, test_go;
static struct test test, test2;

static
void *testthread(void *arg)
{
	long nr = (long)arg;

         printf("thread id : %lu, pid %lu\n", pthread_self(), getpid());

	__atomic_add_fetch(&test_go, 1, __ATOMIC_RELAXED);
	while (RSEQ_READ_ONCE(test_go) < NR_THREADS)
		rseq_barrier();
	if (nr == 0) {
		switch (testcase) {
		case TEST_SAME_CACHELINE:
			while (!RSEQ_READ_ONCE(test_stop))
				(void) RSEQ_READ_ONCE(test.a);
			break;
		case TEST_OTHER_CACHELINE:
			while (!RSEQ_READ_ONCE(test_stop))
				(void) RSEQ_READ_ONCE(test2.a);
			break;
		}
	} else if (nr == 1) {
		unsigned long long i;

		for (i = 0; i < 16000000000UL; i++)
			RSEQ_WRITE_ONCE(test.b, i);
		RSEQ_WRITE_ONCE(test_stop, 1);
	}
         return ((void*)0);
}

static
void show_usage(char **argv)
{
	fprintf(stderr, "Usage: %s <OPTIONS>\n", argv[0]);
	fprintf(stderr, "OPTIONS:\n");
	fprintf(stderr, "	[-s] Same cacheline\n");
	fprintf(stderr, "	[-d] Different cacheline\n");
}

static
int parse_args(int argc, char **argv)
{
	if (argc != 2 || argv[1][0] != '-') {
		show_usage(argv);
		return -1;
	}
	switch (argv[1][1]) {
	case 's':
		testcase = TEST_SAME_CACHELINE;
		break;
	case 'd':
		testcase = TEST_OTHER_CACHELINE;
		break;
	default:
		show_usage(argv);
		return -1;
	}
	return 0;
}

int main(int argc, char **argv)
{
         pthread_t testid[NR_THREADS];
         void *tret;
         int i, err;

	if (parse_args(argc, argv))
		exit(1);

         for (i = 0; i < NR_THREADS; i++) {
                 err = pthread_create(&testid[i], NULL, testthread,
                         (void *)(long)i);
                 if (err != 0)
                         exit(1);
         }

         for (i = 0; i < NR_THREADS; i++) {
                 err = pthread_join(testid[i], &tret);
                 if (err != 0)
                         exit(1);
         }

         return 0;
}



-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/4] rseq: Add sched_state field to struct rseq
  2023-05-23 12:49     ` Mathieu Desnoyers
@ 2023-05-23 16:32       ` Noah Goldstein
  2023-05-23 17:30         ` Mathieu Desnoyers
  0 siblings, 1 reply; 17+ messages in thread
From: Noah Goldstein @ 2023-05-23 16:32 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel, Thomas Gleixner, Paul E . McKenney,
	Boqun Feng, H . Peter Anvin, Paul Turner, linux-api,
	Christian Brauner, Florian Weimer, David.Laight, carlos,
	Peter Oskolkov, Alexander Mikhalitsyn, Chris Kennelly,
	Ingo Molnar, Darren Hart, Davidlohr Bueso, André Almeida,
	libc-alpha, Steven Rostedt, Jonathan Corbet, Florian Weimer

On Tue, May 23, 2023 at 7:49 AM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> On 2023-05-19 16:51, Noah Goldstein wrote:
> > On Wed, May 17, 2023 at 10:28 AM Mathieu Desnoyers via Libc-alpha
> > <libc-alpha@sourceware.org> wrote:
> >>
> >> Expose the "on-cpu" state for each thread through struct rseq to allow
> >> adaptative mutexes to decide more accurately between busy-waiting and
> >> calling sys_futex() to release the CPU, based on the on-cpu state of the
> >> mutex owner.
> >>
> >> It is only provided as an optimization hint, because there is no
> >> guarantee that the page containing this field is in the page cache, and
> >> therefore the scheduler may very well fail to clear the on-cpu state on
> >> preemption. This is expected to be rare though, and is resolved as soon
> >> as the task returns to user-space.
> >>
> >> The goal is to improve use-cases where the duration of the critical
> >> sections for a given lock follows a multi-modal distribution, preventing
> >> statistical guesses from doing a good job at choosing between busy-wait
> >> and futex wait behavior.
> >>
> >> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> >> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> >> Cc: Jonathan Corbet <corbet@lwn.net>
> >> Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
> >> Cc: Carlos O'Donell <carlos@redhat.com>
> >> Cc: Florian Weimer <fweimer@redhat.com>
> >> Cc: libc-alpha@sourceware.org
> >> ---
> >>   include/linux/sched.h     | 12 ++++++++++++
> >>   include/uapi/linux/rseq.h | 17 +++++++++++++++++
> >>   kernel/rseq.c             | 14 ++++++++++++++
> >>   3 files changed, 43 insertions(+)
> >>
> >> diff --git a/include/linux/sched.h b/include/linux/sched.h
> >> index eed5d65b8d1f..c7e9248134c1 100644
> >> --- a/include/linux/sched.h
> >> +++ b/include/linux/sched.h
> >> @@ -2351,11 +2351,20 @@ static inline void rseq_signal_deliver(struct ksignal *ksig,
> >>          rseq_handle_notify_resume(ksig, regs);
> >>   }
> >>
> >> +void __rseq_set_sched_state(struct task_struct *t, unsigned int state);
> >> +
> >> +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state)
> >> +{
> >> +       if (t->rseq)
> >> +               __rseq_set_sched_state(t, state);
> >> +}
> >> +
> >>   /* rseq_preempt() requires preemption to be disabled. */
> >>   static inline void rseq_preempt(struct task_struct *t)
> >>   {
> >>          __set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask);
> >>          rseq_set_notify_resume(t);
> >> +       rseq_set_sched_state(t, 0);
> >
> > Should rseq_migrate also be made to update the cpu_id of the new core?
> > I imagine the usage of this will be something along the lines of:
> >
> > if(!on_cpu(mutex->owner_rseq_struct) &&
> >     cpu(mutex->owner_rseq_struct) == this_threads_cpu)
> >     // goto futex
> >
> > So I would think updating on migrate would be useful as well.
>
> I don't think we want to act differently based on the cpu on which the
> owner is queued.
>
> If the mutex owner is not on-cpu, and queued on the same cpu as the
> current thread, we indeed want to call sys_futex WAIT.
>
> If the mutex owner is not on-cpu, but queued on a different cpu than the
> current thread, we *still* want to call sys_futex WAIT, because
> busy-waiting for a thread which is queued but not currently running is
> wasteful.
>
I think this is less clear. In some cases sure but not always. Going
to the futex
has more latency that userland waits, and if the system is not busy (other than
the one process) most likely less latency that yield. Also going to the futex
requires a syscall on unlock.

For example if the critical section is expected to be very small, it
would be easy
to imagine the lock be better implemented with:
while(is_locked)
  if (owner->on_cpu || owner->cpu != my_cpu)
    exponential backoff
  else
    yield

Its not that "just go to futex" doesn't ever make sense, but I don't
think its fair
to say that *always* the case.

Looking at the kernel code, it doesn't seem to be a particularly high cost to
keep the CPU field updated during migration so seems like a why not
kind of question.
> Or am I missing something ?
>
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/4] rseq: Add sched_state field to struct rseq
  2023-05-23 16:32       ` Noah Goldstein
@ 2023-05-23 17:30         ` Mathieu Desnoyers
  2023-05-23 20:10           ` Noah Goldstein
  0 siblings, 1 reply; 17+ messages in thread
From: Mathieu Desnoyers @ 2023-05-23 17:30 UTC (permalink / raw)
  To: Noah Goldstein
  Cc: Peter Zijlstra, linux-kernel, Thomas Gleixner, Paul E . McKenney,
	Boqun Feng, H . Peter Anvin, Paul Turner, linux-api,
	Christian Brauner, Florian Weimer, David.Laight, carlos,
	Peter Oskolkov, Alexander Mikhalitsyn, Chris Kennelly,
	Ingo Molnar, Darren Hart, Davidlohr Bueso, André Almeida,
	libc-alpha, Steven Rostedt, Jonathan Corbet, Florian Weimer

On 2023-05-23 12:32, Noah Goldstein wrote:
> On Tue, May 23, 2023 at 7:49 AM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> On 2023-05-19 16:51, Noah Goldstein wrote:
>>> On Wed, May 17, 2023 at 10:28 AM Mathieu Desnoyers via Libc-alpha
>>> <libc-alpha@sourceware.org> wrote:
>>>>
>>>> Expose the "on-cpu" state for each thread through struct rseq to allow
>>>> adaptative mutexes to decide more accurately between busy-waiting and
>>>> calling sys_futex() to release the CPU, based on the on-cpu state of the
>>>> mutex owner.
>>>>
>>>> It is only provided as an optimization hint, because there is no
>>>> guarantee that the page containing this field is in the page cache, and
>>>> therefore the scheduler may very well fail to clear the on-cpu state on
>>>> preemption. This is expected to be rare though, and is resolved as soon
>>>> as the task returns to user-space.
>>>>
>>>> The goal is to improve use-cases where the duration of the critical
>>>> sections for a given lock follows a multi-modal distribution, preventing
>>>> statistical guesses from doing a good job at choosing between busy-wait
>>>> and futex wait behavior.
>>>>
>>>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>>>> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
>>>> Cc: Jonathan Corbet <corbet@lwn.net>
>>>> Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
>>>> Cc: Carlos O'Donell <carlos@redhat.com>
>>>> Cc: Florian Weimer <fweimer@redhat.com>
>>>> Cc: libc-alpha@sourceware.org
>>>> ---
>>>>    include/linux/sched.h     | 12 ++++++++++++
>>>>    include/uapi/linux/rseq.h | 17 +++++++++++++++++
>>>>    kernel/rseq.c             | 14 ++++++++++++++
>>>>    3 files changed, 43 insertions(+)
>>>>
>>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>>> index eed5d65b8d1f..c7e9248134c1 100644
>>>> --- a/include/linux/sched.h
>>>> +++ b/include/linux/sched.h
>>>> @@ -2351,11 +2351,20 @@ static inline void rseq_signal_deliver(struct ksignal *ksig,
>>>>           rseq_handle_notify_resume(ksig, regs);
>>>>    }
>>>>
>>>> +void __rseq_set_sched_state(struct task_struct *t, unsigned int state);
>>>> +
>>>> +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state)
>>>> +{
>>>> +       if (t->rseq)
>>>> +               __rseq_set_sched_state(t, state);
>>>> +}
>>>> +
>>>>    /* rseq_preempt() requires preemption to be disabled. */
>>>>    static inline void rseq_preempt(struct task_struct *t)
>>>>    {
>>>>           __set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask);
>>>>           rseq_set_notify_resume(t);
>>>> +       rseq_set_sched_state(t, 0);
>>>
>>> Should rseq_migrate also be made to update the cpu_id of the new core?
>>> I imagine the usage of this will be something along the lines of:
>>>
>>> if(!on_cpu(mutex->owner_rseq_struct) &&
>>>      cpu(mutex->owner_rseq_struct) == this_threads_cpu)
>>>      // goto futex
>>>
>>> So I would think updating on migrate would be useful as well.
>>
>> I don't think we want to act differently based on the cpu on which the
>> owner is queued.
>>
>> If the mutex owner is not on-cpu, and queued on the same cpu as the
>> current thread, we indeed want to call sys_futex WAIT.
>>
>> If the mutex owner is not on-cpu, but queued on a different cpu than the
>> current thread, we *still* want to call sys_futex WAIT, because
>> busy-waiting for a thread which is queued but not currently running is
>> wasteful.
>>
> I think this is less clear. In some cases sure but not always. Going
> to the futex
> has more latency that userland waits, and if the system is not busy (other than
> the one process) most likely less latency that yield. Also going to the futex
> requires a syscall on unlock.
> 
> For example if the critical section is expected to be very small, it
> would be easy
> to imagine the lock be better implemented with:
> while(is_locked)
>    if (owner->on_cpu || owner->cpu != my_cpu)
>      exponential backoff
>    else
>      yield
> 
> Its not that "just go to futex" doesn't ever make sense, but I don't
> think its fair
> to say that *always* the case.
> 
> Looking at the kernel code, it doesn't seem to be a particularly high cost to
> keep the CPU field updated during migration so seems like a why not
> kind of question.

We already have the owner rseq_abi cpu_id field populated on every 
return-to-userspace. I wonder if it's really relevant that migration 
populates an updated value in this field immediately ? It's another case 
where this would be provided as a hint updated only if the struct rseq 
is in the page cache, because AFAIU the scheduler migration path cannot 
take a page fault.

Also, if a thread bounces around many runqueues before being scheduled 
again, we would be adding those useless stores to the rseq_abi structure 
at each migration between runqueues.

Given this would add some complexity to the scheduler migration code, I 
would want to see metrics/benchmarks showing that it indeed improves 
real-world use-cases before adding this to the rseq ABI.

It's not only a question of added lines of code as of today, but also a 
question of added userspace ABI guarantees which can prevent future 
scheduler optimizations. I'm *very* careful about keeping those to a 
strict minimum, which I hope Peter Zijlstra appreciates.

Thanks,

Mathieu


>> Or am I missing something ?
>>
>> Thanks,
>>
>> Mathieu
>>
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> https://www.efficios.com
>>

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/4] rseq: Add sched_state field to struct rseq
  2023-05-23 17:30         ` Mathieu Desnoyers
@ 2023-05-23 20:10           ` Noah Goldstein
  0 siblings, 0 replies; 17+ messages in thread
From: Noah Goldstein @ 2023-05-23 20:10 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel, Thomas Gleixner, Paul E . McKenney,
	Boqun Feng, H . Peter Anvin, Paul Turner, linux-api,
	Christian Brauner, Florian Weimer, David.Laight, carlos,
	Peter Oskolkov, Alexander Mikhalitsyn, Chris Kennelly,
	Ingo Molnar, Darren Hart, Davidlohr Bueso, André Almeida,
	libc-alpha, Steven Rostedt, Jonathan Corbet, Florian Weimer

On Tue, May 23, 2023 at 12:30 PM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> On 2023-05-23 12:32, Noah Goldstein wrote:
> > On Tue, May 23, 2023 at 7:49 AM Mathieu Desnoyers
> > <mathieu.desnoyers@efficios.com> wrote:
> >>
> >> On 2023-05-19 16:51, Noah Goldstein wrote:
> >>> On Wed, May 17, 2023 at 10:28 AM Mathieu Desnoyers via Libc-alpha
> >>> <libc-alpha@sourceware.org> wrote:
> >>>>
> >>>> Expose the "on-cpu" state for each thread through struct rseq to allow
> >>>> adaptative mutexes to decide more accurately between busy-waiting and
> >>>> calling sys_futex() to release the CPU, based on the on-cpu state of the
> >>>> mutex owner.
> >>>>
> >>>> It is only provided as an optimization hint, because there is no
> >>>> guarantee that the page containing this field is in the page cache, and
> >>>> therefore the scheduler may very well fail to clear the on-cpu state on
> >>>> preemption. This is expected to be rare though, and is resolved as soon
> >>>> as the task returns to user-space.
> >>>>
> >>>> The goal is to improve use-cases where the duration of the critical
> >>>> sections for a given lock follows a multi-modal distribution, preventing
> >>>> statistical guesses from doing a good job at choosing between busy-wait
> >>>> and futex wait behavior.
> >>>>
> >>>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> >>>> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> >>>> Cc: Jonathan Corbet <corbet@lwn.net>
> >>>> Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
> >>>> Cc: Carlos O'Donell <carlos@redhat.com>
> >>>> Cc: Florian Weimer <fweimer@redhat.com>
> >>>> Cc: libc-alpha@sourceware.org
> >>>> ---
> >>>>    include/linux/sched.h     | 12 ++++++++++++
> >>>>    include/uapi/linux/rseq.h | 17 +++++++++++++++++
> >>>>    kernel/rseq.c             | 14 ++++++++++++++
> >>>>    3 files changed, 43 insertions(+)
> >>>>
> >>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
> >>>> index eed5d65b8d1f..c7e9248134c1 100644
> >>>> --- a/include/linux/sched.h
> >>>> +++ b/include/linux/sched.h
> >>>> @@ -2351,11 +2351,20 @@ static inline void rseq_signal_deliver(struct ksignal *ksig,
> >>>>           rseq_handle_notify_resume(ksig, regs);
> >>>>    }
> >>>>
> >>>> +void __rseq_set_sched_state(struct task_struct *t, unsigned int state);
> >>>> +
> >>>> +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state)
> >>>> +{
> >>>> +       if (t->rseq)
> >>>> +               __rseq_set_sched_state(t, state);
> >>>> +}
> >>>> +
> >>>>    /* rseq_preempt() requires preemption to be disabled. */
> >>>>    static inline void rseq_preempt(struct task_struct *t)
> >>>>    {
> >>>>           __set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask);
> >>>>           rseq_set_notify_resume(t);
> >>>> +       rseq_set_sched_state(t, 0);
> >>>
> >>> Should rseq_migrate also be made to update the cpu_id of the new core?
> >>> I imagine the usage of this will be something along the lines of:
> >>>
> >>> if(!on_cpu(mutex->owner_rseq_struct) &&
> >>>      cpu(mutex->owner_rseq_struct) == this_threads_cpu)
> >>>      // goto futex
> >>>
> >>> So I would think updating on migrate would be useful as well.
> >>
> >> I don't think we want to act differently based on the cpu on which the
> >> owner is queued.
> >>
> >> If the mutex owner is not on-cpu, and queued on the same cpu as the
> >> current thread, we indeed want to call sys_futex WAIT.
> >>
> >> If the mutex owner is not on-cpu, but queued on a different cpu than the
> >> current thread, we *still* want to call sys_futex WAIT, because
> >> busy-waiting for a thread which is queued but not currently running is
> >> wasteful.
> >>
> > I think this is less clear. In some cases sure but not always. Going
> > to the futex
> > has more latency that userland waits, and if the system is not busy (other than
> > the one process) most likely less latency that yield. Also going to the futex
> > requires a syscall on unlock.
> >
> > For example if the critical section is expected to be very small, it
> > would be easy
> > to imagine the lock be better implemented with:
> > while(is_locked)
> >    if (owner->on_cpu || owner->cpu != my_cpu)
> >      exponential backoff
> >    else
> >      yield
> >
> > Its not that "just go to futex" doesn't ever make sense, but I don't
> > think its fair
> > to say that *always* the case.
> >
> > Looking at the kernel code, it doesn't seem to be a particularly high cost to
> > keep the CPU field updated during migration so seems like a why not
> > kind of question.
>
> We already have the owner rseq_abi cpu_id field populated on every
> return-to-userspace. I wonder if it's really relevant that migration
> populates an updated value in this field immediately ? It's another case
> where this would be provided as a hint updated only if the struct rseq
> is in the page cache, because AFAIU the scheduler migration path cannot
> take a page fault.
>

Ah, thats a good point. And probably as probability the page is in the cache
goes down a fair bit as the task is idle / bounced around for longer.

> Also, if a thread bounces around many runqueues before being scheduled
> again, we would be adding those useless stores to the rseq_abi structure
> at each migration between runqueues.
>
> Given this would add some complexity to the scheduler migration code, I
> would want to see metrics/benchmarks showing that it indeed improves
> real-world use-cases before adding this to the rseq ABI.
>
> It's not only a question of added lines of code as of today, but also a
> question of added userspace ABI guarantees which can prevent future
> scheduler optimizations. I'm *very* careful about keeping those to a
> strict minimum, which I hope Peter Zijlstra appreciates.

Well, this entire thing is moreso a hint than a guarantee. Even on_cpu
is only updated if the page happens to be in the pagecache so I don't
see how you could ever be *having* to do anything.

But fair enough, thought I'd throw the idea out there, but enough valid
concerns seem to make it not such a good idea.
>
> Thanks,
>
> Mathieu
>
>
> >> Or am I missing something ?
> >>
> >> Thanks,
> >>
> >> Mathieu
> >>
> >> --
> >> Mathieu Desnoyers
> >> EfficiOS Inc.
> >> https://www.efficios.com
> >>
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-05-23 20:11 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-17 15:26 [RFC PATCH 0/4] Extend rseq with sched_state field Mathieu Desnoyers
2023-05-17 15:26 ` [RFC PATCH 1/4] rseq: Add sched_state field to struct rseq Mathieu Desnoyers
2023-05-17 16:03   ` Davidlohr Bueso
2023-05-18 21:49   ` Boqun Feng
2023-05-19 14:15     ` Mathieu Desnoyers
2023-05-19 17:18       ` Boqun Feng
2023-05-23 14:10         ` Mathieu Desnoyers
2023-05-19 20:51   ` Noah Goldstein
2023-05-23 12:49     ` Mathieu Desnoyers
2023-05-23 16:32       ` Noah Goldstein
2023-05-23 17:30         ` Mathieu Desnoyers
2023-05-23 20:10           ` Noah Goldstein
2023-05-17 15:26 ` [RFC PATCH 2/4] selftests/rseq: Add sched_state rseq field and getter Mathieu Desnoyers
2023-05-17 15:26 ` [RFC PATCH 3/4] selftests/rseq: Implement sched state test program Mathieu Desnoyers
2023-05-17 15:26 ` [RFC PATCH 4/4] selftests/rseq: Implement rseq_mutex " Mathieu Desnoyers
2023-05-17 16:07 ` [RFC PATCH 0/4] Extend rseq with sched_state field Davidlohr Bueso
2023-05-17 18:36 ` Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).