linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/3 v0.2] RFC: sched/UMCG
@ 2021-07-08 19:46 Peter Oskolkov
  2021-07-08 19:46 ` [RFC PATCH 1/3 v0.2] sched: add WF_CURRENT_CPU and externise ttwu Peter Oskolkov
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Peter Oskolkov @ 2021-07-08 19:46 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, linux-kernel, linux-api
  Cc: Paul Turner, Ben Segall, Peter Oskolkov, Peter Oskolkov,
	Joel Fernandes, Andrei Vagin, Jim Newsome, Jann Horn

This is another attempt at implementing UMCG, based on
discussion in https://lore.kernel.org/patchwork/cover/1433967/

Most of the "why" is covered here (some details are obsolete):
https://lore.kernel.org/patchwork/cover/1433967/#1632328

At a high level, UMCG servers/workers provide the foundation
for an M:N threading library, as described in the link above.

In addition, servers without workers can be used as "basic"
UMCG tasks if wait/wake/context-switch are the only desired
operations.

Joel Fernandes has also once mentioned that he had a use case
for a wake+bring-the-wakee-to-the-current-CPU operation,
so this is now also supported via UMCG_WF_CURRENT_CPU flag
(patch 3).

Patch 1: add WF_CURRENT_CPU and tweak ttwu - same as last time
Patch 2: add X86_64 helpers to work atomically with userspace values
Patch 3: implement UMCG kernel-side

In this version of the patchset I used only userspace/TLS
data, as suggested by Peter Zijlstra. With the exception
of one issue (see patch 3 commit message) everything seems
to be working great.

This TLS-only approach makes the userspace code a bit more
involved, so I'm not posting libumcg/selftests with this
patchset to focus on the kernel side only.


TODO:
  - put atomic helpers from patch 2 into their proper place (unless
    keeping them in kernel/sched/umcg.h is OK)
  - fix the wake server issue in preempt disable block (see patch 3)
  - implement timeout handling
  - imlement worker preemption
  - more testing
  - manpages, docs, and similar
  - attach libumbc and selftest patches


Peter Oskolkov (3):
  sched: add WF_CURRENT_CPU and externise ttwu
  sched/umcg: RFC: add userspace atomic helpers
  sched/umcg: RFC: implement UMCG syscalls

 arch/x86/entry/syscalls/syscall_64.tbl |   2 +
 include/linux/sched.h                  |   6 +
 include/linux/syscalls.h               |   4 +
 include/uapi/asm-generic/unistd.h      |   8 +-
 include/uapi/linux/umcg.h              | 246 +++++++++++++
 init/Kconfig                           |  10 +
 kernel/exit.c                          |   7 +
 kernel/sched/Makefile                  |   1 +
 kernel/sched/core.c                    |  20 +-
 kernel/sched/fair.c                    |   4 +
 kernel/sched/sched.h                   |  15 +-
 kernel/sched/umcg.c                    | 481 +++++++++++++++++++++++++
 kernel/sched/umcg.h                    | 271 ++++++++++++++
 kernel/sys_ni.c                        |   4 +
 14 files changed, 1068 insertions(+), 11 deletions(-)
 create mode 100644 include/uapi/linux/umcg.h
 create mode 100644 kernel/sched/umcg.c
 create mode 100644 kernel/sched/umcg.h

--
2.25.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH 1/3 v0.2] sched: add WF_CURRENT_CPU and externise ttwu
  2021-07-08 19:46 [RFC PATCH 0/3 v0.2] RFC: sched/UMCG Peter Oskolkov
@ 2021-07-08 19:46 ` Peter Oskolkov
  2021-07-08 19:46 ` [RFC PATCH 2/3 v0.2] sched/umcg: RFC: add userspace atomic helpers Peter Oskolkov
  2021-07-08 19:46 ` [RFC PATCH 3/3 v0.2] sched/umcg: RFC: implement UMCG syscalls Peter Oskolkov
  2 siblings, 0 replies; 17+ messages in thread
From: Peter Oskolkov @ 2021-07-08 19:46 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, linux-kernel, linux-api
  Cc: Paul Turner, Ben Segall, Peter Oskolkov, Peter Oskolkov,
	Joel Fernandes, Andrei Vagin, Jim Newsome, Jann Horn

Add WF_CURRENT_CPU wake flag that advices the scheduler to
move the wakee to the current CPU. This is useful for fast on-CPU
context switching use cases such as UMCG.

In addition, make ttwu external rather than static so that
the flag could be passed to it from outside of sched/core.c.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 kernel/sched/core.c  |  3 +--
 kernel/sched/fair.c  |  4 ++++
 kernel/sched/sched.h | 15 +++++++++------
 3 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0c22cd026440..293f5801bf81 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3680,8 +3680,7 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
  * Return: %true if @p->state changes (an actual wakeup was done),
  *	   %false otherwise.
  */
-static int
-try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
+int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 {
 	unsigned long flags;
 	int cpu, success = 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 11d22943753f..16a9c93e6e82 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6836,6 +6836,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 	if (wake_flags & WF_TTWU) {
 		record_wakee(p);

+		if ((wake_flags & WF_CURRENT_CPU) &&
+		    cpumask_test_cpu(cpu, p->cpus_ptr))
+			return cpu;
+
 		if (sched_energy_enabled()) {
 			new_cpu = find_energy_efficient_cpu(p, prev_cpu);
 			if (new_cpu >= 0)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9a1c6aeb9165..80de6836f8ae 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2031,13 +2031,14 @@ static inline int task_on_rq_migrating(struct task_struct *p)
 }

 /* Wake flags. The first three directly map to some SD flag value */
-#define WF_EXEC     0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */
-#define WF_FORK     0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */
-#define WF_TTWU     0x08 /* Wakeup;            maps to SD_BALANCE_WAKE */
+#define WF_EXEC         0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */
+#define WF_FORK         0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */
+#define WF_TTWU         0x08 /* Wakeup;            maps to SD_BALANCE_WAKE */

-#define WF_SYNC     0x10 /* Waker goes to sleep after wakeup */
-#define WF_MIGRATED 0x20 /* Internal use, task got migrated */
-#define WF_ON_CPU   0x40 /* Wakee is on_cpu */
+#define WF_SYNC         0x10 /* Waker goes to sleep after wakeup */
+#define WF_MIGRATED     0x20 /* Internal use, task got migrated */
+#define WF_ON_CPU       0x40 /* Wakee is on_cpu */
+#define WF_CURRENT_CPU  0x80 /* Prefer to move the wakee to the current CPU. */

 #ifdef CONFIG_SMP
 static_assert(WF_EXEC == SD_BALANCE_EXEC);
@@ -3037,6 +3038,8 @@ static inline bool is_per_cpu_kthread(struct task_struct *p)
 extern void swake_up_all_locked(struct swait_queue_head *q);
 extern void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);

+extern int try_to_wake_up(struct task_struct *tsk, unsigned int state, int wake_flags);
+
 #ifdef CONFIG_PREEMPT_DYNAMIC
 extern int preempt_dynamic_mode;
 extern int sched_dynamic_mode(const char *str);
--
2.25.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH 2/3 v0.2] sched/umcg: RFC: add userspace atomic helpers
  2021-07-08 19:46 [RFC PATCH 0/3 v0.2] RFC: sched/UMCG Peter Oskolkov
  2021-07-08 19:46 ` [RFC PATCH 1/3 v0.2] sched: add WF_CURRENT_CPU and externise ttwu Peter Oskolkov
@ 2021-07-08 19:46 ` Peter Oskolkov
  2021-07-08 21:12   ` Jann Horn
  2021-07-09  8:01   ` Peter Zijlstra
  2021-07-08 19:46 ` [RFC PATCH 3/3 v0.2] sched/umcg: RFC: implement UMCG syscalls Peter Oskolkov
  2 siblings, 2 replies; 17+ messages in thread
From: Peter Oskolkov @ 2021-07-08 19:46 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, linux-kernel, linux-api
  Cc: Paul Turner, Ben Segall, Peter Oskolkov, Peter Oskolkov,
	Joel Fernandes, Andrei Vagin, Jim Newsome, Jann Horn

Add helper functions to work atomically with userspace 32/64 bit values -
there are some .*futex.* named helpers, but they are not exactly
what is needed for UMCG; I haven't found what else I could use, so I
rolled these.

At the moment only X86_64 is supported.

Note: the helpers should probably go into arch/ somewhere; I have
them in kernel/sched/umcg.h temporarily for convenience. Please
let me know where I should put them and how to name them.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 kernel/sched/umcg.h | 264 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 264 insertions(+)
 create mode 100644 kernel/sched/umcg.h

diff --git a/kernel/sched/umcg.h b/kernel/sched/umcg.h
new file mode 100644
index 000000000000..aa8fb24964ed
--- /dev/null
+++ b/kernel/sched/umcg.h
@@ -0,0 +1,264 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+#ifndef _KERNEL_SCHED_UMCG_H
+#define _KERNEL_SCHED_UMCG_H
+
+#ifdef CONFIG_UMCG
+#ifdef CONFIG_X86_64
+
+#include <linux/sched.h>
+#include <linux/uaccess.h>
+#include <linux/umcg.h>
+
+#include <asm/asm.h>
+#include <linux/atomic.h>
+
+/* TODO: move atomic operations below into arch/ headers */
+static inline int umcg_atomic_cmpxchg_32(u32 *uval, u32 __user *uaddr,
+						u32 oldval, u32 newval)
+{
+	int ret = 0;
+
+	if (!user_access_begin(uaddr, sizeof(u32)))
+		return -EFAULT;
+	asm volatile("\n"
+		"1:\t" LOCK_PREFIX "cmpxchgl %4, %2\n"
+		"2:\n"
+		"\t.section .fixup, \"ax\"\n"
+		"3:\tmov     %3, %0\n"
+		"\tjmp     2b\n"
+		"\t.previous\n"
+		_ASM_EXTABLE_UA(1b, 3b)
+		: "+r" (ret), "=a" (oldval), "+m" (*uaddr)
+		: "i" (-EFAULT), "r" (newval), "1" (oldval)
+		: "memory"
+	);
+	user_access_end();
+	*uval = oldval;
+	return ret;
+}
+
+static inline int umcg_atomic_cmpxchg_64(u64 *uval, u64 __user *uaddr,
+						u64 oldval, u64 newval)
+{
+	int ret = 0;
+
+	if (!user_access_begin(uaddr, sizeof(u64)))
+		return -EFAULT;
+	asm volatile("\n"
+		"1:\t" LOCK_PREFIX "cmpxchgq %4, %2\n"
+		"2:\n"
+		"\t.section .fixup, \"ax\"\n"
+		"3:\tmov     %3, %0\n"
+		"\tjmp     2b\n"
+		"\t.previous\n"
+		_ASM_EXTABLE_UA(1b, 3b)
+		: "+r" (ret), "=a" (oldval), "+m" (*uaddr)
+		: "i" (-EFAULT), "r" (newval), "1" (oldval)
+		: "memory"
+	);
+	user_access_end();
+	*uval = oldval;
+	return ret;
+}
+
+static inline int fix_pagefault(unsigned long uaddr, bool write_fault)
+{
+	struct mm_struct *mm = current->mm;
+	int ret;
+
+	mmap_read_lock(mm);
+	ret = fixup_user_fault(mm, uaddr, write_fault ? FAULT_FLAG_WRITE : 0,
+			NULL);
+	mmap_read_unlock(mm);
+
+	return ret < 0 ? ret : 0;
+}
+
+static inline int umcg_get_user_32(u32 __user *uaddr, u32 *val)
+{
+	while (true) {
+		int ret;
+		u32 out;
+
+		pagefault_disable();
+		ret = __get_user(out, uaddr);
+		pagefault_enable();
+
+		if (!ret) {
+			*val = out;
+			return 0;
+		}
+
+		if (WARN_ONCE(ret != -EFAULT, "Unexpected error"))
+			return -EFAULT;
+
+		ret = fix_pagefault((unsigned long)uaddr, false);
+		if (ret)
+			return -EFAULT;
+	}
+}
+
+/**
+ * umcg_cmpxchg_32_user - compare_exchange 32-bit values
+ *
+ * Return:
+ * 0 - OK
+ * -EFAULT: memory access error
+ * -EAGAIN: @expected did not match; consult @prev
+ */
+static inline int umcg_cmpxchg_32_user(u32 __user *uaddr, u32 *prev, u32 val)
+{
+	while (true) {
+		int ret;
+		u32 expected = *prev;
+
+		pagefault_disable();
+		ret = umcg_atomic_cmpxchg_32(prev, uaddr, expected, val);
+		pagefault_enable();
+
+		if (!ret)
+			return *prev == expected ? 0 : -EAGAIN;
+
+		if (WARN_ONCE(ret != -EFAULT, "Unexpected error"))
+			return -EFAULT;
+
+		ret = fix_pagefault((unsigned long)uaddr, true);
+		if (ret)
+			return -EFAULT;
+	}
+}
+
+/**
+ * umcg_cmpxchg_64_user - compare_exchange 64-bit values
+ *
+ * Return:
+ * 0 - OK
+ * -EFAULT: memory access error
+ * -EAGAIN: @expected did not match; consult @prev
+ */
+static inline int umcg_cmpxchg_64_user(u64 __user *uaddr, u64 *prev, u64 val)
+{
+	while (true) {
+		int ret;
+		u64 expected = *prev;
+
+		pagefault_disable();
+		ret = umcg_atomic_cmpxchg_64(prev, uaddr, expected, val);
+		pagefault_enable();
+
+		if (!ret)
+			return *prev == expected ? 0 : -EAGAIN;
+
+		if (WARN_ONCE(ret != -EFAULT, "Unexpected error"))
+			return -EFAULT;
+
+		ret = fix_pagefault((unsigned long)uaddr, true);
+		if (ret)
+			return -EFAULT;
+	}
+}
+
+/**
+ * atomic_stack_push_user - push a node onto the stack
+ * @head - a pointer to the head of the stack;
+ * @node - a pointer to the node to push.
+ *
+ * Push a node onto a single-linked list (stack). Atomicity/correctness
+ * is guaranteed by locking the head via settings its first bit (assuming
+ * the pointer is aligned).
+ *
+ * Return: 0 on success, -EFAULT on error.
+ */
+static inline int atomic_stack_push_user(u64 __user *head, u64 __user *node)
+{
+	while (true) {
+		int ret;
+		u64 first;
+
+		smp_mb();  /* Make the read below clean. */
+		if (get_user(first, head))
+			return -EFAULT;
+
+		if (first & 1UL) {
+			cpu_relax();
+			continue;  /* first is being deleted. */
+		}
+
+		if (put_user(first, node))
+			return -EFAULT;
+		smp_mb();  /* Make the write above visible. */
+
+		ret = umcg_cmpxchg_64_user(head, &first, (u64)node);
+		if (!ret)
+			return 0;
+
+		if (ret == -EAGAIN) {
+			cpu_relax();
+			continue;
+		}
+
+		if (WARN_ONCE(ret != -EFAULT, "unexpected umcg_cmpxchg result"))
+			return -EFAULT;
+
+		return -EFAULT;
+	}
+}
+
+/**
+ * atomic_stack_pop_user - pop a node from the stack
+ * @head - a pointer to the head of the stack;
+ * @value - a pointer to where store the popped value.
+ *
+ * Pop a node from a single-linked list (stack). Atomicity/correctness
+ * is guaranteed by locking the head via settings its first bit (assuming
+ * the pointer is aligned).
+ *
+ * Note: on success, @value should be cast to (u64 __user *) if not zero.
+ *
+ * Return: 0 on success, -EFAULT on error.
+ */
+static inline int atomic_stack_pop_user(u64 __user *head, u64 *value)
+{
+	while (true) {
+		int ret;
+		u64 next, first;
+
+		smp_mb();  /* Make the read below clean. */
+		if (get_user(first, head))
+			return -EFAULT;
+
+		if (!first) {
+			*value = 0UL;
+			return 0;
+		}
+
+		if (first & 1UL) {
+			cpu_relax();
+			continue;  /* first is being deleted. */
+		}
+
+		ret = umcg_cmpxchg_64_user(head, &first, first | 1UL);
+		if (ret == -EAGAIN) {
+			cpu_relax();
+			continue;
+		}
+
+		if (ret)
+			return -EFAULT;
+
+		if (get_user(next, (u64 __user *)first))
+			return -EFAULT;
+
+		first |= 1UL;
+
+		ret = umcg_cmpxchg_64_user(head, &first, next);
+		if (ret)
+			return -EFAULT;
+
+		*value = first & ~1UL;
+		return 0;
+	}
+}
+#endif  /* CONFIG_X86_64 */
+#endif  /* CONFIG_UMCG */
+#endif  /* _KERNEL_SCHED_UMCG_H */
--
2.25.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH 3/3 v0.2] sched/umcg: RFC: implement UMCG syscalls
  2021-07-08 19:46 [RFC PATCH 0/3 v0.2] RFC: sched/UMCG Peter Oskolkov
  2021-07-08 19:46 ` [RFC PATCH 1/3 v0.2] sched: add WF_CURRENT_CPU and externise ttwu Peter Oskolkov
  2021-07-08 19:46 ` [RFC PATCH 2/3 v0.2] sched/umcg: RFC: add userspace atomic helpers Peter Oskolkov
@ 2021-07-08 19:46 ` Peter Oskolkov
  2021-07-11 16:35   ` Peter Oskolkov
  2021-07-11 18:29   ` Thierry Delisle
  2 siblings, 2 replies; 17+ messages in thread
From: Peter Oskolkov @ 2021-07-08 19:46 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, linux-kernel, linux-api
  Cc: Paul Turner, Ben Segall, Peter Oskolkov, Peter Oskolkov,
	Joel Fernandes, Andrei Vagin, Jim Newsome, Jann Horn

Define struct umcg_task and sys_umcg_ctl/sys_umcg_wait syscalls.

This is another attempt at implementing UMCG, based on
discussion in https://lore.kernel.org/patchwork/cover/1433967/

Most of the "why" is covered here (some details are obsolete):
https://lore.kernel.org/patchwork/cover/1433967/#1632328

I'll update the commit message with more "why" when the general
approach is ACKed at a high level.

The "how": in this patch I used the approach suggested by
peterz@ in the discussion linked above; specifically,
only a single

struct umcg_task __user *umcg_task

pointer is added to struct task_struct.

I tried to make inline doc-comments to be more clear and detailed
this time - hopefuly they answer the "how" better now.

Pretty much everything works, with one issue: when a worker
blocks, we need to wake its server in umcg_wq_worker_sleeping
called from sched_submit_work within preempt_disable block.
As the server_tid is set by the userspace, it can point to
a running/spinning task, and so wake_server will hang waiting
for ttwu() to succeed. I do not think this is appropriate,
but I am not sure at the moment how to resolve this. Any sugestions?

Other than the issue above, I need to implement timeout handling
and do more testing before this is ready for a fully detailed
technical review - at the moment this is just an RFC.

And also worker preemption - haven't looked at it at all yet.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 arch/x86/entry/syscalls/syscall_64.tbl |   2 +
 include/linux/sched.h                  |   6 +
 include/linux/syscalls.h               |   4 +
 include/uapi/asm-generic/unistd.h      |   8 +-
 include/uapi/linux/umcg.h              | 246 +++++++++++++
 init/Kconfig                           |  10 +
 kernel/exit.c                          |   7 +
 kernel/sched/Makefile                  |   1 +
 kernel/sched/core.c                    |  17 +-
 kernel/sched/umcg.c                    | 481 +++++++++++++++++++++++++
 kernel/sched/umcg.h                    |   7 +
 kernel/sys_ni.c                        |   4 +
 12 files changed, 790 insertions(+), 3 deletions(-)
 create mode 100644 include/uapi/linux/umcg.h
 create mode 100644 kernel/sched/umcg.c

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index ce18119ea0d0..0c6c7fd72b0b 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -368,6 +368,8 @@
 444	common	landlock_create_ruleset	sys_landlock_create_ruleset
 445	common	landlock_add_rule	sys_landlock_add_rule
 446	common	landlock_restrict_self	sys_landlock_restrict_self
+447	common	umcg_ctl		sys_umcg_ctl
+448	common	umcg_wait		sys_umcg_wait

 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 50db9496c99d..185ad1cdde77 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -66,6 +66,7 @@ struct sighand_struct;
 struct signal_struct;
 struct task_delay_info;
 struct task_group;
+struct umcg_task;

 /*
  * Task state bitmask. NOTE! These bits are also
@@ -1223,6 +1224,10 @@ struct task_struct {
 	unsigned long rseq_event_mask;
 #endif

+#ifdef CONFIG_UMCG
+	struct umcg_task __user *umcg_task;
+#endif
+
 	struct tlbflush_unmap_batch	tlb_ubc;

 	union {
@@ -1599,6 +1604,7 @@ extern struct pid *cad_pid;
 #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
 #define PF_SWAPWRITE		0x00800000	/* Allowed to write to swap */
+#define PF_UMCG_WORKER		0x01000000	/* UMCG worker */
 #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
 #define PF_MCE_EARLY		0x08000000      /* Early kill for mce process policy */
 #define PF_MEMALLOC_PIN		0x10000000	/* Allocation context constrained to zones which allow long term pinning. */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 050511e8f1f8..f3e1ef8d842f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -71,6 +71,7 @@ struct open_how;
 struct mount_attr;
 struct landlock_ruleset_attr;
 enum landlock_rule_type;
+struct umcg_task;

 #include <linux/types.h>
 #include <linux/aio_abi.h>
@@ -1050,6 +1051,9 @@ asmlinkage long sys_landlock_create_ruleset(const struct landlock_ruleset_attr _
 asmlinkage long sys_landlock_add_rule(int ruleset_fd, enum landlock_rule_type rule_type,
 		const void __user *rule_attr, __u32 flags);
 asmlinkage long sys_landlock_restrict_self(int ruleset_fd, __u32 flags);
+asmlinkage long sys_umcg_ctl(u32 flags, struct umcg_task __user *self);
+asmlinkage long sys_umcg_wait(u32 flags, u64 abs_timeout);
+

 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 6de5a7fc066b..1a4c9ac0e296 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -873,8 +873,14 @@ __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule)
 #define __NR_landlock_restrict_self 446
 __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)

+#define __NR_umcg_ctl 447
+__SYSCALL(__NR_umcg_ctl, sys_umcg_ctl)
+#define __NR_umcg_wait 448
+__SYSCALL(__NR_umcg_wait, sys_umcg_wait)
+
+
 #undef __NR_syscalls
-#define __NR_syscalls 447
+#define __NR_syscalls 449

 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/umcg.h b/include/uapi/linux/umcg.h
new file mode 100644
index 000000000000..2fa69dd9e841
--- /dev/null
+++ b/include/uapi/linux/umcg.h
@@ -0,0 +1,246 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_UMCG_H
+#define _UAPI_LINUX_UMCG_H
+
+#include <linux/limits.h>
+#include <linux/types.h>
+
+/*
+ * UMCG: User Managed Concurrency Groups.
+ *
+ * Syscalls (see kernel/sched/umcg.c):
+ *      sys_umcg_ctl()  - register/unregister UMCG tasks;
+ *      sys_umcg_wait() - wait/wake/context-switch.
+ *
+ * struct umcg_task (below): controls the state of UMCG tasks.
+ */
+
+/*
+ * UMCG task states, the first 8 bits. The states represent the user space
+ * point of view.
+ *
+ * All UMCG tasks (servers, workers, basic tasks) can be either
+ * RUNNING, i.e. doing useful work, or IDLE, i.e. have no work assigned
+ * to them and thus blocked in sys_umcg_wait().
+ *
+ * In addition, when a UMCG worker blocks in the kernel (e.g. on I/O),
+ * it is marked as BLOCKED; when a BLOCKED worker completes its blocking
+ * operation, it is marked as IDLE and added to idle_workers list (see
+ * struct umcg_task below).
+ *
+ * UMCG servers and basic tasks continue to be considered as RUNNING
+ * even if they are blocked in the kernel in any way other than in
+ * sys_umcg_wait().
+ *
+ * State transitions:
+ *
+ * RUNNING => IDLE:   the current RUNNING task becomes IDLE by calling
+ *                    sys_umcg_wait();
+ * IDLE => RUNNING:   - another worker, server, or a basic UMCG task called
+ *                      sys_umcg_wait() with self->next_tid pointing to the
+ *                      task transitioning from IDLE to RUNNING (mostly
+ *                      applies to workers and basic tasks);
+ *                    - the userspace marked and IDLE task as RUNNING and
+ *                      sent a signal to it (thus interrupting sys_umcg_wait);
+ *                    - servers: the kernel wakes an IDLE server from
+ *                      idle_servers list when a BLOCKED worker becomes IDLE
+ *                      (see below);
+ *                    - servers: the kernel wakes and IDLE server that
+ *                      is "attached" to a RUNNING worker when the worker
+ *                      becomes BLOCKED;
+ * RUNNING => BLOCKED: when a RUNNING UMCG worker blocks in the kernel,
+ *                     the kernel marks it as BLOCKED (and wakes its server);
+ * BLOCKED => IDLE:    when a BLOCKED UMCG worker finishes its blocking
+ *                     operation, the kernel marks it as IDLE, adds it to
+ *                     the list of idle workers (see struct umcg_task) and
+ *                     wakes an idle server from the list of idle servers, if
+ *                     available.
+ *
+ * Note 1: only the transitions listed above are valid; these state
+ *         transitions are not valid and do not happen:
+ *         - IDLE => BLOCKED
+ *         - BLOCKED => RUNNING
+ *
+ * Note 2: only UMCG workers (UMCG tasks registered with UMCG_CTL_WORKER
+ *         flag set) are subject to block/wake detection logic;
+ *
+ * Note 3: if a worker has UMC_TF_LOCKED state flag set, it behaves as
+ *         a server or a basic task, i.e. the block/wake detection is
+ *         disabled (this is a UMCG equivalent of task_lock() or
+ *         preempt_disable()). UMCG_TF_LOCKED flag is cleared by the kernel
+ *         when the worker goes to sleep in umcg_wait().
+ *
+ *
+ * Note 4: when a state transition is initiated by the userspace via a call
+ *         to sys_umcg_wait(), it is the userspace's responsibility to
+ *         change the value of the umcg_task.state field; when a state
+ *         transition is initiated by the kernel during worker block/wake
+ *         handling, it is the kernel who marks the worker as BLOCKED
+ *         or IDLE, and the server as RUNNING.
+ */
+#define UMCG_TASK_NONE			0
+#define UMCG_TASK_RUNNING		1
+#define UMCG_TASK_IDLE			2
+#define UMCG_TASK_BLOCKED		3
+
+#define UMCG_TF_STATE_MASK		0xff
+
+/* UMCG task state flags, bits 8-15 */
+
+/*
+ * UMCG_TF_LOCKED: locked by the userspace; workers with UMCG_TF_LOCKED set
+ * do not become BLOCKED and do not wake their attached server.
+ */
+#define UMCG_TF_LOCKED			(1 << 8)
+/* UMCG_TF_PREEMPTED: not currently used, but will be added later. */
+
+/**
+ * struct umcg_task - controls the state of UMCG tasks.
+ *
+ * UMCG tasks can be:
+ *
+ * - UMCG workers: must have a UMCG server assigned when running; the
+ *                 server is woken when the worker blocks;
+ *                 has PF_UMCG_WORKER task flag set in task_struct.
+ *
+ *                 Both @idle_servers_ptr and @idle_workes_ptr are !NULL
+ *                 when running or calling sys_umcg_wait().
+ *
+ *                 A worker's state can be:
+ *                 - RUNNING: is schedulable by the kernel, has a server
+ *                            assigned in @server_tid;
+ *                 - IDLE: not schedulable by the kernel; can be
+ *                         context-switched into via sys_umcg_wait;
+ *                 - BLOCKED: blocked in the kernel (e.g. on I/O).
+ *
+ * - UMCG servers: @idle_servers_ptr @@ !@idle_workers_ptr && !@server_tid.
+ *
+ *                 A server's state can be:
+ *                 - RUNNING: behaves like a "normal" task: is schedulable
+ *                            by the kernel, can block on I/O, etc.
+ *                 - IDLE: not schedulable by the kernel;
+ *                         - if @next_tid != 0, the @next_tid must point
+ *                           to a RUNNING worker;
+ *                         - if @next_tid == 0, the server must be added
+ *                           to @idle_servers_ptr (by the userspace).
+ *
+ * - UMCG basic tasks: !@idle_servers_ptr && !@idle_workers_ptr && !server_tid.
+ *
+ *                 A basic UMCG task can either be IDLE or RUNNING, and it
+ *                 does not participate in server/worker logic. From the
+ *                 kenrel point of view, UMCG basic tasks and servers are
+ *                 almost indistinguishable.
+ *
+ *                 TODO: does it make sense to mention UMCG basic tasks here
+ *                       at all, given that the kernel does not currently
+ *                       treat them differently?
+ *                       - pro: it is a supported use case to have
+ *                              UMCG tasks that can call sys_umcg_wait()
+ *                              to sleep, wake, or context-switch;
+ *                       - con: UMCG servers can do exactly the same; the
+ *                              only difference, from the kernel point of
+ *                              view, is that servers can be pointed to by
+ *                              server_tid field below, or added to
+ *                              idle_servers_ptr list (by the userspace).
+ *
+ * See sys_umcg_ctl documentation for a detailed explanation of how
+ * UMCG task types are determined.
+ *
+ * See sys_umcg_wait documentation for a detailed explanation of server/worker
+ * interactions, and for basic task behavior.
+ *
+ * Once a UMCG task is registered as one of the three types
+ * (see sys_umcg_ctl), it may not change its type.
+ *
+ * The struct is aligned at 64 bytes to ensure that it fits into
+ * a single cache line.
+ */
+struct umcg_task {
+	/**
+	 * @state: the current state of the UMCG task described by this struct.
+	 *
+	 * Readable/writable by both the kernel and the userspace.
+	 *
+	 * UMCG task state:
+	 *   bits  0 -  7: task state;
+	 *   bits  8 - 15: state flags;
+	 *   bits 16 - 23: reserved; must be zeroes;
+	 *   bits 24 - 31: for userspace use.
+	 */
+	uint32_t	state;			/* r/w */
+
+	/**
+	 * @api_version: the version of UMCG API the userspace would like
+	 *               to use. Must be set before calling umcg_ctl
+	 *               and must not be changed afterwards.
+	 */
+	uint32_t	api_version;		/* r   */
+
+	/**
+	 * @server_tid: the TID of the server UMCG task that should be
+	 *              woken when this WORKER becomes BLOCKED. Can be zero.
+	 *
+	 *              If this is a UMCG server, @server_tid should
+	 *              contain the TID of @self - it will be used to find
+	 *              the task_struct to wake when pulled from
+	 *              @idle_servers.
+	 *
+	 * Read-only for the kernel, read/write for the userspace.
+	 */
+	uint32_t	server_tid;		/* r   */
+
+	/**
+	 * @next_tid: the TID of the UMCG task that should be context-switched
+	 *            into in sys_umcg_wait(). Can be zero.
+	 *
+	 * Read-only for the kernel, read/write for the userspace.
+	 */
+	uint32_t	next_tid;		/* r   */
+
+	/**
+	 * @idle_servers_ptr: a single-linked list pointing to the list
+	 *                    of idle servers. Can be NULL.
+	 *
+	 * Readable/writable by both the kernel and the userspace: the
+	 * userspace adds items to the list, the kernel removes them.
+	 *
+	 * TODO: describe how the list works.
+	 */
+	uint64_t	idle_servers_ptr;	/* r/w */
+
+	/**
+	 * @idle_workers_ptr: a single-linked list pointing to the list
+	 *                    of idle workers. Can be NULL.
+	 *
+	 * Readable/writable by both the kernel and the userspace: the
+	 * kernel adds items to the list, the userspace removes them.
+	 */
+	uint64_t	idle_workers_ptr;	/* r/w */
+} __attribute__((packed, aligned(8 * sizeof(__u64))));
+
+/**
+ * enum umcg_ctl_flag - flags to pass to sys_umcg_ctl
+ * @UMCG_CTL_REGISTER:   register the current task as a UMCG task
+ * @UMCG_CTL_UNREGISTER: unregister the current task as a UMCG task
+ * @UMCG_CTL_WORKER:     register the current task as a UMCG worker
+ *
+ * See sys_umcg_ctl documentation for more details.
+ */
+enum umcg_ctl_flag {
+	UMCG_CTL_REGISTER	= 0x00001,
+	UMCG_CTL_UNREGISTER	= 0x00002,
+	UMCG_CTL_WORKER		= 0x10000,
+};
+
+/**
+ * enum umcg_wait_flag - flags to pass to sys_umcg_wait
+ * @UMCG_WAIT_WAKE_ONLY: wake @self->next_tid, don't put @self to sleep;
+ * @UMCG_WF_CURRENT_CPU: wake @self->next_tid on the current CPU
+ *                       (use WF_CURRENT_CPU); @UMCG_WAIT_WAKE_ONLY must be set.
+ */
+enum umcg_wait_flag {
+	UMCG_WAIT_WAKE_ONLY = 1,
+	UMCG_WF_CURRENT_CPU = 2,
+};
+
+#endif /* _UAPI_LINUX_UMCG_H */
diff --git a/init/Kconfig b/init/Kconfig
index a61c92066c2e..c15a50a61ba6 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1662,6 +1662,16 @@ config MEMBARRIER

 	  If unsure, say Y.

+config UMCG
+	bool "Enable User Managed Concurrency Groups API"
+	depends on X86_64
+	default n
+	help
+	  Enable User Managed Concurrency Groups API, which form the basis
+	  for an in-process M:N userspace scheduling framework.
+	  At the moment this is an experimental/RFC feature that is not
+	  guaranteed to be backward-compatible.
+
 config KALLSYMS
 	bool "Load all symbols for debugging/ksymoops" if EXPERT
 	default y
diff --git a/kernel/exit.c b/kernel/exit.c
index fd1c04193e18..dc8398558d87 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -744,6 +744,13 @@ void __noreturn do_exit(long code)
 	if (unlikely(!tsk->pid))
 		panic("Attempted to kill the idle task!");

+#ifdef CONFIG_UMCG
+	if (unlikely(tsk->flags & PF_UMCG_WORKER))
+		tsk->flags &= ~PF_UMCG_WORKER;
+
+	tsk->umcg_task = NULL;
+#endif
+
 	/*
 	 * If do_exit is called because this processes oopsed, it's possible
 	 * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 978fcfca5871..e4e481eee1b7 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -37,3 +37,4 @@ obj-$(CONFIG_MEMBARRIER) += membarrier.o
 obj-$(CONFIG_CPU_ISOLATION) += isolation.o
 obj-$(CONFIG_PSI) += psi.o
 obj-$(CONFIG_SCHED_CORE) += core_sched.o
+obj-$(CONFIG_UMCG) += umcg.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 293f5801bf81..f7ddeed72e30 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -26,6 +26,7 @@

 #include "pelt.h"
 #include "smp.h"
+#include "umcg.h"

 /*
  * Export tracepoints that act as a bare tracehook (ie: have no trace event
@@ -3961,6 +3962,10 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->wake_entry.u_flags = CSD_TYPE_TTWU;
 	p->migration_pending = NULL;
 #endif
+#ifdef CONFIG_UMCG
+	p->umcg_task = NULL;
+	p->flags &= ~PF_UMCG_WORKER;
+#endif
 }

 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
@@ -5975,10 +5980,14 @@ static inline void sched_submit_work(struct task_struct *tsk)
 	 * in the possible wakeup of a kworker and because wq_worker_sleeping()
 	 * requires it.
 	 */
-	if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
+	if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) {
 		preempt_disable();
 		if (task_flags & PF_WQ_WORKER)
 			wq_worker_sleeping(tsk);
+#ifdef CONFIG_UMCG
+		else if (task_flags & PF_UMCG_WORKER)
+			umcg_wq_worker_sleeping(tsk);
+#endif
 		else
 			io_wq_worker_sleeping(tsk);
 		preempt_enable_no_resched();
@@ -5997,9 +6006,13 @@ static inline void sched_submit_work(struct task_struct *tsk)

 static void sched_update_worker(struct task_struct *tsk)
 {
-	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
+	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) {
 		if (tsk->flags & PF_WQ_WORKER)
 			wq_worker_running(tsk);
+#ifdef CONFIG_UMCG
+		else if (tsk->flags & PF_UMCG_WORKER)
+			umcg_wq_worker_running(tsk);
+#endif
 		else
 			io_wq_worker_running(tsk);
 	}
diff --git a/kernel/sched/umcg.c b/kernel/sched/umcg.c
new file mode 100644
index 000000000000..f83afc8d4827
--- /dev/null
+++ b/kernel/sched/umcg.c
@@ -0,0 +1,481 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/*
+ * User Managed Concurrency Groups (UMCG).
+ */
+
+#include <linux/syscalls.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/umcg.h>
+#include <linux/freezer.h>
+
+#include "sched.h"
+#include "umcg.h"
+
+static int umcg_validate_version(u32 api_version)
+{
+	if (api_version == 1)
+		return 0;
+	return 1;
+}
+
+/**
+ * sys_umcg_ctl: (un)register a task as a UMCG task.
+ * @flags:       ORed values from enum umcg_ctl_flag; see below;
+ * @self:        a pointer to struct umcg_task that describes this
+ *               task and governs the behavior of sys_umcg_wait if
+ *               registering; must be NULL if unregistering.
+ *
+ * @flags & UMCG_CTL_REGISTER: register a UMCG task:
+ *         UMCG workers:
+ *              - self->state must be UMCG_TASK_IDLE
+ *              - @flags & UMCG_CTL_WORKER
+ *         UMCG servers and basic tasks:
+ *              - self->state must be UMCG_TASK_RUNNING
+ *              - !(@flags & UMCG_CTL_WORKER)
+ *
+ *         All tasks:
+ *              - self->api_version must match one of the supported API
+ *                versions
+ *              - self->server_tid must be zero
+ *              - self->next_tid must be zero
+ *
+ *         If the conditions above are met, sys_umcg_ctl() immediately returns
+ *         if the registered task is a RUNNING server or basic task; an IDLE
+ *         worker will be added to idle_workers_ptr, and the worker put to
+ *         sleep; an idle server from idle_servers_ptr will be woken, if any.
+ *
+ * @flags == UMCG_CTL_UNREGISTER: unregister a UMCG task. If the current task is
+ *           a UMCG worker, the userspace is responsible for waking its server.
+ *
+ * Return:
+ * 0                - success
+ * > 0              - the highest supported API version if @self->api_version
+ *                    is not supported (when registering)
+ * -EFAULT          - failed to read @self
+ * -EINVAL          - some other error occurred
+ */
+SYSCALL_DEFINE2(umcg_ctl, u32, flags, struct umcg_task __user *, self)
+{
+	struct umcg_task ut;
+	int ret;
+
+	if (flags == UMCG_CTL_UNREGISTER) {
+		if (self || !current->umcg_task)
+			return -EINVAL;
+
+		if (current->flags & PF_UMCG_WORKER)
+			current->flags &= ~PF_UMCG_WORKER;
+
+		current->umcg_task = NULL;
+		return 0;
+	}
+
+	/* Register the current task as a UMCG task. */
+	if (!(flags & UMCG_CTL_REGISTER))
+		return -EINVAL;
+
+	flags &= ~UMCG_CTL_REGISTER;
+	if (flags && flags != UMCG_CTL_WORKER)
+		return -EINVAL;
+
+	if (current->umcg_task)
+		return -EINVAL;
+
+	if (copy_from_user(&ut, self, sizeof(ut)))
+		return -EFAULT;
+
+	ret = umcg_validate_version(ut.api_version);
+	if (ret)
+		return ret;
+
+	if (flags == UMCG_CTL_WORKER) {
+		if (ut.state != UMCG_TASK_IDLE)
+			return -EINVAL;
+
+		current->umcg_task = self;
+		current->flags |= PF_UMCG_WORKER;
+
+		umcg_wq_worker_running(current);  /* Will sleep. */
+		return 0;
+	}
+
+	/* This is a server/basic task. */
+	if (ut.state != UMCG_TASK_RUNNING)
+		return -EINVAL;
+
+	current->umcg_task = self;
+	return 0;
+}
+
+/* Sleep until interrupted or self.state becomes RUNNING. */
+static int umcg_idle_loop(struct umcg_task __user *self, u64 abs_timeout)
+{
+
+	if (abs_timeout)
+		return -EOPNOTSUPP;
+
+	/* Unlock the worker, if locked. */
+	if (current->flags & PF_UMCG_WORKER) {
+		u32 umcg_state;
+
+		if (get_user(umcg_state, &self->state))
+			return -EFAULT;
+
+		if ((umcg_state & UMCG_TF_LOCKED) && umcg_cmpxchg_32_user(
+					&self->state, &umcg_state,
+					umcg_state & ~UMCG_TF_LOCKED))
+			return -EFAULT;
+	}
+
+	while (true) {
+		u32 umcg_state;
+
+		set_current_state(TASK_INTERRUPTIBLE);
+
+		freezable_schedule();
+
+		if (signal_pending(current))
+			return -EINTR;
+
+		if (get_user(umcg_state, &self->state))
+			return -EFAULT;
+
+		if ((umcg_state & UMCG_TF_STATE_MASK) == UMCG_TASK_RUNNING)
+			return 0;
+	}
+}
+
+/* Try to wake up. May be called with preempt_disable set. */
+static int umcg_do_wake(u32 next_tid, int wake_flags)
+{
+	struct task_struct *next;
+	int ttwu_ok;
+
+	rcu_read_lock();
+	next = find_task_by_vpid(next_tid);
+	if (!next || !(READ_ONCE(next->umcg_task))) {
+		rcu_read_unlock();
+		return -ESRCH;
+	}
+
+	ttwu_ok = try_to_wake_up(next, TASK_NORMAL, wake_flags);
+	rcu_read_unlock();
+
+	if (!ttwu_ok) {
+		cpu_relax();
+		return -EAGAIN;
+	}
+
+	return 0;
+}
+
+/* Loop until ttwu succeeds. May be called with preempt_disable set. */
+static int umcg_do_wake_loop(u32 next_tid, int wake_flags)
+{
+	/*
+	 * TODO: the loop below can cause a soft lockup when called from
+	 *       sched_submit_work/umcg_wq_worker_sleeping. Needs fixing.
+	 */
+	while (true) {
+		int ret = umcg_do_wake(next_tid, wake_flags);
+
+		if (!ret)
+			return 0;
+
+		if (ret == -EAGAIN)
+			continue;  /* umcg_do_wake calls cpu_relax */
+
+		return ret;
+	}
+}
+
+/*
+ * At the moment, umcg_do_context_switch simply wakes up @next with
+ * WF_CURRENT_CPU and puts the current task to sleep.
+ *
+ * In the future an optimization will be added to adjust runtime accounting
+ * so that from the kernel scheduling perspective the two tasks are
+ * essentially treated as one.
+ */
+static int umcg_do_context_switch(struct umcg_task __user *self, u32 next_tid,
+					u64 abs_timeout)
+{
+	struct task_struct *next;
+	int ttwu_ok;
+
+	if (abs_timeout)
+		return -EOPNOTSUPP;
+
+	rcu_read_lock();
+	next = find_task_by_vpid(next_tid);
+	if (!next) {
+		rcu_read_unlock();
+		return -ESRCH;
+	}
+
+	/* TODO: instead of wake + sleep, do a context switch. */
+	ttwu_ok = try_to_wake_up(next, TASK_NORMAL, WF_CURRENT_CPU);
+	rcu_read_unlock();
+
+	if (!ttwu_ok) {
+		cpu_relax();
+		return -EAGAIN;
+	}
+
+	return umcg_idle_loop(self, abs_timeout);
+}
+
+/**
+ * sys_umcg_wait: sleep the current task and/or wake another task.
+ * @flags:        zero or a value from enum umcg_wait_flag.
+ * @abs_timeout:  when to wake the task; zero for no timeout. NOT SUPPORTED yet.
+ *
+ * @self->state must be UMCG_TASK_IDLE (where @self is struct umcg_task of
+ * the current task) if !(@flags & UMCG_WAIT_WAKE_ONLY).
+ *
+ * If @self->next_tid is not zero, it must point to a UMCG task blocked in
+ * sys_umcg_wait(). The userspace must have changed its state from IDLE to
+ * RUNNING before calling sys_umcg_wait() in the current task. This "next"
+ * task will be woken (context-switched-to on the fast path) when the current
+ * task is put to sleep.
+ *
+ * If this is a worker (PF_UMCG_WORKER is set), and @self->next_tid is zero,
+ * the server assigned to this worker (@self->server_tid) will be
+ * woken/switched-to; same rules apply (the server must be waiting in
+ * sys_umcg_wait(); its state must be RUNNING now).
+ *
+ * If @self->next_tid points to a UMCG worker, it must have its server_tid
+ * set, with the server blocked in sys_umcg_wait().
+ *
+ * Return:
+ * 0             - OK;
+ * -EAGAIN       - the task to wake is not sleeping (yet?);
+ * -ETIMEDOUT    - the timeout expired;
+ * -EFAULT       - failed accessing struct umcg_task __user of the current
+ *                 task or of the task to wake;
+ * -ESRCH        - the task to wake not found or not a UMCG task;
+ * -EINVAL       - another error happened (e.g. bad @flags, or the current
+ *                 task is not a UMCG task, etc.)
+ */
+SYSCALL_DEFINE2(umcg_wait, u32, flags, u64, abs_timeout)
+{
+	struct umcg_task __user *self = current->umcg_task;
+	u32 next_tid;
+
+	if (!self)
+		return -EINVAL;
+
+	if (get_user(next_tid, &self->next_tid))
+		return -EFAULT;
+
+	if (flags & UMCG_WAIT_WAKE_ONLY) {
+		if (!next_tid || abs_timeout)
+			return -EINVAL;
+
+		flags &= ~UMCG_WAIT_WAKE_ONLY;
+		if (flags & ~UMCG_WF_CURRENT_CPU)
+			return -EINVAL;
+
+		return umcg_do_wake(next_tid, flags & UMCG_WF_CURRENT_CPU ?
+				WF_CURRENT_CPU : 0);
+	}
+
+	if (flags)
+		return -EINVAL;
+
+	if (!next_tid && current->flags & PF_UMCG_WORKER) {
+		if (get_user(next_tid, &self->server_tid))
+			return -EFAULT;
+	}
+
+	if (next_tid)
+		return umcg_do_context_switch(self, next_tid, abs_timeout);
+
+	return umcg_idle_loop(self, abs_timeout);
+}
+
+#define umcg_die() \
+{                                                                \
+	pr_warn("UMCG: umcg_die: %s:%d\n", __FILE__, __LINE__);  \
+	force_sig(SIGSEGV);                                      \
+}
+
+/* Wake the server; may be called within preempt_disable section. */
+static bool wake_server(struct umcg_task __user *ut_server, u32 server_tid)
+{
+	u32 state = UMCG_TASK_IDLE;
+
+	if (WARN_ON(!(ut_server || server_tid)))
+		return false;
+
+	if (!ut_server) {
+		struct task_struct *tsk;
+
+		rcu_read_lock();
+		tsk = find_task_by_vpid(server_tid);
+		if (tsk)
+			ut_server = READ_ONCE(tsk->umcg_task);
+		rcu_read_unlock();
+
+		if (!ut_server)
+			return false;
+	}
+
+	if (!server_tid && get_user(server_tid, &ut_server->server_tid))
+		return false;
+
+	if (umcg_cmpxchg_32_user(&ut_server->state, &state,
+			UMCG_TASK_RUNNING)) {
+		umcg_die();
+		return false;
+	}
+
+	/* TODO: make a smarter context switch when available. */
+	return umcg_do_wake_loop(server_tid, WF_CURRENT_CPU) == 0;
+}
+
+/*
+ * Change the worker's state RUNNING => BLOCKED or BLOCKED => IDLE, depending
+ * on context (@sleeping).
+ *
+ * May be called with preempt_disable.
+ *
+ * Returns true to continue; false to abort.
+ */
+static bool wq_worker_change_state(struct umcg_task __user *ut_worker,
+		bool sleeping)
+{
+	u32 prev_state, next_state;
+	int ret;
+
+	if (umcg_get_user_32(&ut_worker->state, &prev_state)) {
+		umcg_die();
+		return false;
+	}
+
+	if (prev_state & UMCG_TF_LOCKED)
+		return false;
+
+	if (sleeping) {
+		if ((prev_state & UMCG_TF_STATE_MASK) !=
+				UMCG_TASK_RUNNING)
+			return false;
+	} else {
+		if ((prev_state & UMCG_TF_STATE_MASK) ==
+				UMCG_TASK_RUNNING) {
+			/*
+			 * Workers with servers attached should
+			 * pass through; workers without servers
+			 * should wait.
+			 */
+			u32 server_tid;
+
+			if (umcg_get_user_32(&ut_worker->server_tid,
+						&server_tid)) {
+				umcg_die();
+				return false;
+			}
+
+			if (server_tid)
+				return false;
+		}
+	}
+
+	next_state = prev_state & ~UMCG_TF_STATE_MASK;
+	next_state |= sleeping ? UMCG_TASK_BLOCKED : UMCG_TASK_IDLE;
+
+	ret = umcg_cmpxchg_32_user(&ut_worker->state, &prev_state,
+			next_state);
+
+	if (!ret)
+		return true;
+
+	umcg_die();
+	return false;
+}
+
+/* Called from sched_submit_work() with preempt_disable. */
+void umcg_wq_worker_sleeping(struct task_struct *tsk)
+{
+	struct umcg_task __user *ut_worker = tsk->umcg_task;
+	u32 server_tid;
+
+	if (WARN_ONCE((tsk != current) || !ut_worker, "Invalid umcg worker"))
+		return;
+
+	/* Step one: mark the worker as BLOCKED */
+	if (!wq_worker_change_state(ut_worker, true))
+		return;
+
+	/* Step two: wake the server, if any. */
+	if (umcg_get_user_32(&ut_worker->server_tid, &server_tid)) {
+		umcg_die();
+		return;
+	}
+
+	if (!server_tid)
+		return;
+
+	/* TODO: make a smarter context switch when available. */
+	if (!wake_server(NULL, server_tid))
+		umcg_die();
+}
+
+/* Called from sched_update_worker(). May sleep. */
+void umcg_wq_worker_running(struct task_struct *tsk)
+{
+	struct umcg_task __user *ut_worker = tsk->umcg_task;
+	u64 head, popped_server;
+
+	if (WARN_ONCE((tsk != current) || !ut_worker, "Invalid umcg worker"))
+		return;
+
+	/*
+	 * Remove the workqueue flag to avoid triggering
+	 * umcg_wq_worker_sleeping.
+	 */
+	current->flags &= ~PF_UMCG_WORKER;
+
+	/* Step one: mark the worker as IDLE */
+	if (!wq_worker_change_state(ut_worker, false))
+		goto out;
+
+	/* Step 2: add the worker to idle_workers list. */
+	if (get_user(head, &ut_worker->idle_workers_ptr))
+		goto die;
+
+	if (!head)
+		goto die;
+
+	if (atomic_stack_push_user((u64 __user *)head,
+			(u64 __user *)&ut_worker->idle_workers_ptr))
+		goto die;
+
+	/* Step 3: wake an idle server, if any. */
+	if (get_user(head, &ut_worker->idle_servers_ptr))
+		goto die;
+
+	if (head && atomic_stack_pop_user((u64 __user *)head, &popped_server))
+		goto die;
+
+	if (popped_server) {
+		struct umcg_task __user *ut_server = container_of(
+				(u64 *)popped_server,
+				struct umcg_task, idle_servers_ptr);
+
+		if (!wake_server(ut_server, 0))
+			goto die;
+	}
+
+	/* Step 4: sleep until woken by a server */
+	umcg_idle_loop(ut_worker, 0);
+
+out:
+	current->flags |= PF_UMCG_WORKER;
+	return;
+
+die:
+	umcg_die();
+}
diff --git a/kernel/sched/umcg.h b/kernel/sched/umcg.h
index aa8fb24964ed..3078e14f1f03 100644
--- a/kernel/sched/umcg.h
+++ b/kernel/sched/umcg.h
@@ -12,6 +12,13 @@
 #include <asm/asm.h>
 #include <linux/atomic.h>

+/*
+ * umcg_wq_worker_[sleeping|running] are called in core.c by
+ * sched_submit_work() and sched_update_worker().
+ */
+void umcg_wq_worker_sleeping(struct task_struct *tsk);
+void umcg_wq_worker_running(struct task_struct *tsk);
+
 /* TODO: move atomic operations below into arch/ headers */
 static inline int umcg_atomic_cmpxchg_32(u32 *uval, u32 __user *uaddr,
 						u32 oldval, u32 newval)
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 0ea8128468c3..cd1be6356e42 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -272,6 +272,10 @@ COND_SYSCALL(landlock_create_ruleset);
 COND_SYSCALL(landlock_add_rule);
 COND_SYSCALL(landlock_restrict_self);

+/* kernel/sched/umcg.c */
+COND_SYSCALL(umcg_ctl);
+COND_SYSCALL(umcg_wait);
+
 /* arch/example/kernel/sys_example.c */

 /* mm/fadvise.c */
--
2.25.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/3 v0.2] sched/umcg: RFC: add userspace atomic helpers
  2021-07-08 19:46 ` [RFC PATCH 2/3 v0.2] sched/umcg: RFC: add userspace atomic helpers Peter Oskolkov
@ 2021-07-08 21:12   ` Jann Horn
  2021-07-09  4:01     ` Peter Oskolkov
  2021-07-09  8:01   ` Peter Zijlstra
  1 sibling, 1 reply; 17+ messages in thread
From: Jann Horn @ 2021-07-08 21:12 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, linux-kernel,
	linux-api, Paul Turner, Ben Segall, Peter Oskolkov,
	Joel Fernandes, Andrei Vagin, Jim Newsome

On Thu, Jul 8, 2021 at 9:46 PM Peter Oskolkov <posk@posk.io> wrote:
> Add helper functions to work atomically with userspace 32/64 bit values -
> there are some .*futex.* named helpers, but they are not exactly
> what is needed for UMCG; I haven't found what else I could use, so I
> rolled these.
>
> At the moment only X86_64 is supported.
>
> Note: the helpers should probably go into arch/ somewhere; I have
> them in kernel/sched/umcg.h temporarily for convenience. Please
> let me know where I should put them and how to name them.

Instead of open-coding spinlocks in userspace memory like this (which
some of the reviewers will probably dislike because it will have
issues around priority inversion and such), I wonder whether you could
use an actual futex as your underlying locking primitive?

The most straightforward way to do that would probably be to make the
head structure in userspace look roughly like this?

struct umcg_head {
  u64 head_ptr;
  u32 lock;
};

and then from kernel code, you could build a fastpath that directly
calls cmpxchg_futex_value_locked() and build a fallback based on
do_futex(), or something like that.

There is precedent for using futex from inside the kernel to
communicate with userspace: See mm_release(), which calls do_futex()
with FUTEX_WAKE for the clear_child_tid feature.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/3 v0.2] sched/umcg: RFC: add userspace atomic helpers
  2021-07-08 21:12   ` Jann Horn
@ 2021-07-09  4:01     ` Peter Oskolkov
  0 siblings, 0 replies; 17+ messages in thread
From: Peter Oskolkov @ 2021-07-09  4:01 UTC (permalink / raw)
  To: Jann Horn
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Linux Kernel Mailing List, linux-api, Paul Turner, Ben Segall,
	Peter Oskolkov, Joel Fernandes, Andrei Vagin, Jim Newsome

On Thu, Jul 8, 2021 at 2:12 PM Jann Horn <jannh@google.com> wrote:
>
> On Thu, Jul 8, 2021 at 9:46 PM Peter Oskolkov <posk@posk.io> wrote:
> > Add helper functions to work atomically with userspace 32/64 bit values -
> > there are some .*futex.* named helpers, but they are not exactly
> > what is needed for UMCG; I haven't found what else I could use, so I
> > rolled these.
> >
> > At the moment only X86_64 is supported.
> >
> > Note: the helpers should probably go into arch/ somewhere; I have
> > them in kernel/sched/umcg.h temporarily for convenience. Please
> > let me know where I should put them and how to name them.
>
> Instead of open-coding spinlocks in userspace memory like this (which
> some of the reviewers will probably dislike because it will have
> issues around priority inversion and such), I wonder whether you could
> use an actual futex as your underlying locking primitive?
>
> The most straightforward way to do that would probably be to make the
> head structure in userspace look roughly like this?
>
> struct umcg_head {
>   u64 head_ptr;
>   u32 lock;
> };
>
> and then from kernel code, you could build a fastpath that directly
> calls cmpxchg_futex_value_locked() and build a fallback based on
> do_futex(), or something like that.
>
> There is precedent for using futex from inside the kernel to
> communicate with userspace: See mm_release(), which calls do_futex()
> with FUTEX_WAKE for the clear_child_tid feature.

Hi Jann,

Thanks for the note!

The approach you suggest will require locking every operation, I
believe, while in the scheme I have pushes/inserts are lock-free if
there are no concurrent pops/deletes. And the kernel does mostly
pushes (waking workers, and there can be a lot of workers), while pops
are rare (idle servers, and there is no reason for the number of
servers to exceed the number of CPUs substantially, and if there is
contention here, it will be very short-lived), while the userspace
will pop the entire stack of idle workers in one op (so a short lock
as well). So I think my approach scales better. And priority inversion
should not matter here, because this is for userspace scheduling, and
so the userspace scheduler should worry about it, not the kernel.

Am I missing something?

Thanks,
Peter

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/3 v0.2] sched/umcg: RFC: add userspace atomic helpers
  2021-07-08 19:46 ` [RFC PATCH 2/3 v0.2] sched/umcg: RFC: add userspace atomic helpers Peter Oskolkov
  2021-07-08 21:12   ` Jann Horn
@ 2021-07-09  8:01   ` Peter Zijlstra
  2021-07-09 16:57     ` Peter Oskolkov
  1 sibling, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2021-07-09  8:01 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, linux-api,
	Paul Turner, Ben Segall, Peter Oskolkov, Joel Fernandes,
	Andrei Vagin, Jim Newsome, Jann Horn

On Thu, Jul 08, 2021 at 12:46:37PM -0700, Peter Oskolkov wrote:

> +static inline int umcg_atomic_cmpxchg_64(u64 *uval, u64 __user *uaddr,
> +						u64 oldval, u64 newval)
> +{
> +	int ret = 0;
> +
> +	if (!user_access_begin(uaddr, sizeof(u64)))
> +		return -EFAULT;
> +	asm volatile("\n"
> +		"1:\t" LOCK_PREFIX "cmpxchgq %4, %2\n"
> +		"2:\n"
> +		"\t.section .fixup, \"ax\"\n"
> +		"3:\tmov     %3, %0\n"
> +		"\tjmp     2b\n"
> +		"\t.previous\n"
> +		_ASM_EXTABLE_UA(1b, 3b)
> +		: "+r" (ret), "=a" (oldval), "+m" (*uaddr)
> +		: "i" (-EFAULT), "r" (newval), "1" (oldval)
> +		: "memory"
> +	);
> +	user_access_end();
> +	*uval = oldval;
> +	return ret;
> +}

> +static inline int fix_pagefault(unsigned long uaddr, bool write_fault)
> +{
> +	struct mm_struct *mm = current->mm;
> +	int ret;
> +
> +	mmap_read_lock(mm);
> +	ret = fixup_user_fault(mm, uaddr, write_fault ? FAULT_FLAG_WRITE : 0,
> +			NULL);
> +	mmap_read_unlock(mm);
> +
> +	return ret < 0 ? ret : 0;
> +}

> +static inline int umcg_cmpxchg_64_user(u64 __user *uaddr, u64 *prev, u64 val)
> +{
> +	while (true) {
> +		int ret;
> +		u64 expected = *prev;
> +
> +		pagefault_disable();
> +		ret = umcg_atomic_cmpxchg_64(prev, uaddr, expected, val);
> +		pagefault_enable();
> +
> +		if (!ret)
> +			return *prev == expected ? 0 : -EAGAIN;
> +
> +		if (WARN_ONCE(ret != -EFAULT, "Unexpected error"))
> +			return -EFAULT;
> +
> +		ret = fix_pagefault((unsigned long)uaddr, true);
> +		if (ret)
> +			return -EFAULT;
> +	}
> +}
> +
> +/**
> + * atomic_stack_push_user - push a node onto the stack
> + * @head - a pointer to the head of the stack;
> + * @node - a pointer to the node to push.
> + *
> + * Push a node onto a single-linked list (stack). Atomicity/correctness
> + * is guaranteed by locking the head via settings its first bit (assuming
> + * the pointer is aligned).
> + *
> + * Return: 0 on success, -EFAULT on error.
> + */
> +static inline int atomic_stack_push_user(u64 __user *head, u64 __user *node)
> +{
> +	while (true) {
> +		int ret;
> +		u64 first;
> +
> +		smp_mb();  /* Make the read below clean. */
> +		if (get_user(first, head))
> +			return -EFAULT;
> +
> +		if (first & 1UL) {
> +			cpu_relax();
> +			continue;  /* first is being deleted. */
> +		}
> +
> +		if (put_user(first, node))
> +			return -EFAULT;
> +		smp_mb();  /* Make the write above visible. */
> +
> +		ret = umcg_cmpxchg_64_user(head, &first, (u64)node);
> +		if (!ret)
> +			return 0;
> +
> +		if (ret == -EAGAIN) {
> +			cpu_relax();
> +			continue;
> +		}
> +
> +		if (WARN_ONCE(ret != -EFAULT, "unexpected umcg_cmpxchg result"))
> +			return -EFAULT;
> +
> +		return -EFAULT;
> +	}
> +}


This is horrible... Jann is absolutely right, you do not, *ever* do
userspace spinlocks. What's wrong with the trivial lockless single
linked list approach?

On top of that, you really want to avoid all that endless stac/clac
nonsense and only have that once, at the outer edges of things.

Something like the *completely* untested below (except it needs lots of
extra gunk to support compilers without asm-goto-output, and more widths
and ...


#define __try_cmpxchg_user_size(ptr, oldp, new, size, label)		\
({									\
	_Bool __success;						\
	__chk_user_ptr(ptr);						\
	__typeof__(ptr) _old = (__typeof__(ptr))(oldp);			\
	__typeof__(*(ptr)) __old = *_old;				\
	__typeof__(*(ptr)) __new = (new);				\
	switch (size) {							\
	case 8:								\
		volatile u64 *__ptr = (volatile u64 *)(ptr);		\
		asm_volatile_goto("1: " LOCK_PREFIX "cmpxchgq %[new], %[ptr]" \
				  CC_SET(z)				\
				  _ASM_EXTABLE_UA(1b, %l[label])	\
				  : CC_OUT(x) (__success),		\
				    [ptr] "+m" (*__ptr),		\
				    [old] "+a" (__old),			\
				  : [new] "r" (__new)			\
				  : "memory"				\
				  : label);				\
		break;							\
	}								\
	if (unlikely(!success))						\
		*_old = __old;						\
	__success;							\
})

#define unsafe_try_cmpxchg_user(ptr, oldp, new, label)			\
	__try_cmpxchg_user_size((ptr), (oldp), (new), sizeof(*(ptr)), label);

int user_llist_add(u64 __user *new, u64 __user *head)
{
	u64 first;
	int ret;

	if (unlikely(!access_ok(new, sizeof(*new)) ||
		     !access_ok(head, sizeof(*head))))
		return -EFAULT;

again:
	__uaccess_begin_nospec();

	unsafe_get_user(first, head, Efault_head);
	do {
		unsafe_put_user(first, new, Efault_new);
	} while (!unsafe_try_cmpxchg_user(head, &first, new, Efault_head));

	user_access_end();

	return 0;

Efault_new:
	user_access_end();

	ret = fixup_fault(new);
	if (ret < 0)
		return ret;

	goto again;

Efault_head:
	user_access_end();

	ret = fixup_fault(head);
	if (ret < 0)
		return ret;

	goto again;
}

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/3 v0.2] sched/umcg: RFC: add userspace atomic helpers
  2021-07-09  8:01   ` Peter Zijlstra
@ 2021-07-09 16:57     ` Peter Oskolkov
  2021-07-09 17:33       ` Peter Oskolkov
  2021-07-13 16:10       ` Peter Zijlstra
  0 siblings, 2 replies; 17+ messages in thread
From: Peter Oskolkov @ 2021-07-09 16:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Peter Oskolkov, Ingo Molnar, Thomas Gleixner, linux-kernel,
	linux-api, Paul Turner, Ben Segall, Joel Fernandes, Andrei Vagin,
	Jim Newsome, Jann Horn

On Fri, Jul 9, 2021 at 1:02 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Jul 08, 2021 at 12:46:37PM -0700, Peter Oskolkov wrote:
>
> > +static inline int umcg_atomic_cmpxchg_64(u64 *uval, u64 __user *uaddr,
> > +                                             u64 oldval, u64 newval)
> > +{
> > +     int ret = 0;
> > +
> > +     if (!user_access_begin(uaddr, sizeof(u64)))
> > +             return -EFAULT;
> > +     asm volatile("\n"
> > +             "1:\t" LOCK_PREFIX "cmpxchgq %4, %2\n"
> > +             "2:\n"
> > +             "\t.section .fixup, \"ax\"\n"
> > +             "3:\tmov     %3, %0\n"
> > +             "\tjmp     2b\n"
> > +             "\t.previous\n"
> > +             _ASM_EXTABLE_UA(1b, 3b)
> > +             : "+r" (ret), "=a" (oldval), "+m" (*uaddr)
> > +             : "i" (-EFAULT), "r" (newval), "1" (oldval)
> > +             : "memory"
> > +     );
> > +     user_access_end();
> > +     *uval = oldval;
> > +     return ret;
> > +}
>
> > +static inline int fix_pagefault(unsigned long uaddr, bool write_fault)
> > +{
> > +     struct mm_struct *mm = current->mm;
> > +     int ret;
> > +
> > +     mmap_read_lock(mm);
> > +     ret = fixup_user_fault(mm, uaddr, write_fault ? FAULT_FLAG_WRITE : 0,
> > +                     NULL);
> > +     mmap_read_unlock(mm);
> > +
> > +     return ret < 0 ? ret : 0;
> > +}
>
> > +static inline int umcg_cmpxchg_64_user(u64 __user *uaddr, u64 *prev, u64 val)
> > +{
> > +     while (true) {
> > +             int ret;
> > +             u64 expected = *prev;
> > +
> > +             pagefault_disable();
> > +             ret = umcg_atomic_cmpxchg_64(prev, uaddr, expected, val);
> > +             pagefault_enable();
> > +
> > +             if (!ret)
> > +                     return *prev == expected ? 0 : -EAGAIN;
> > +
> > +             if (WARN_ONCE(ret != -EFAULT, "Unexpected error"))
> > +                     return -EFAULT;
> > +
> > +             ret = fix_pagefault((unsigned long)uaddr, true);
> > +             if (ret)
> > +                     return -EFAULT;
> > +     }
> > +}
> > +
> > +/**
> > + * atomic_stack_push_user - push a node onto the stack
> > + * @head - a pointer to the head of the stack;
> > + * @node - a pointer to the node to push.
> > + *
> > + * Push a node onto a single-linked list (stack). Atomicity/correctness
> > + * is guaranteed by locking the head via settings its first bit (assuming
> > + * the pointer is aligned).
> > + *
> > + * Return: 0 on success, -EFAULT on error.
> > + */
> > +static inline int atomic_stack_push_user(u64 __user *head, u64 __user *node)
> > +{
> > +     while (true) {
> > +             int ret;
> > +             u64 first;
> > +
> > +             smp_mb();  /* Make the read below clean. */
> > +             if (get_user(first, head))
> > +                     return -EFAULT;
> > +
> > +             if (first & 1UL) {
> > +                     cpu_relax();
> > +                     continue;  /* first is being deleted. */
> > +             }
> > +
> > +             if (put_user(first, node))
> > +                     return -EFAULT;
> > +             smp_mb();  /* Make the write above visible. */
> > +
> > +             ret = umcg_cmpxchg_64_user(head, &first, (u64)node);
> > +             if (!ret)
> > +                     return 0;
> > +
> > +             if (ret == -EAGAIN) {
> > +                     cpu_relax();
> > +                     continue;
> > +             }
> > +
> > +             if (WARN_ONCE(ret != -EFAULT, "unexpected umcg_cmpxchg result"))
> > +                     return -EFAULT;
> > +
> > +             return -EFAULT;
> > +     }
> > +}
>
>
> This is horrible... Jann is absolutely right, you do not, *ever* do
> userspace spinlocks. What's wrong with the trivial lockless single
> linked list approach?.

I'm not sure how to get around the ABA problem with a lockless single
linked list: https://en.wikipedia.org/wiki/ABA_problem

Our semantics can probably be relied on to prevent ABA from happening
with idle workers (popped/removed by the userspace), but idle servers
can trivially have it:

(initial state): head => Server A => Server B => NULL

task1 :
- read head (get A), read A (get B), {/* delayed */}

tasks 2-3-4:
- pop A, pop B, push A

task 1:
- cmp_xchg(head, A /* old */, B /* new */) - succeeds, now B is in the
list erroneously

>
> On top of that, you really want to avoid all that endless stac/clac
> nonsense and only have that once, at the outer edges of things.
>
> Something like the *completely* untested below (except it needs lots of
> extra gunk to support compilers without asm-goto-output, and more widths
> and ...

I'll try the approach you suggest below once it is clear how to deal
with the ABA issue (and the wake server issue raised in patch 3).

>
>
> #define __try_cmpxchg_user_size(ptr, oldp, new, size, label)            \
> ({                                                                      \
>         _Bool __success;                                                \
>         __chk_user_ptr(ptr);                                            \
>         __typeof__(ptr) _old = (__typeof__(ptr))(oldp);                 \
>         __typeof__(*(ptr)) __old = *_old;                               \
>         __typeof__(*(ptr)) __new = (new);                               \
>         switch (size) {                                                 \
>         case 8:                                                         \
>                 volatile u64 *__ptr = (volatile u64 *)(ptr);            \
>                 asm_volatile_goto("1: " LOCK_PREFIX "cmpxchgq %[new], %[ptr]" \
>                                   CC_SET(z)                             \
>                                   _ASM_EXTABLE_UA(1b, %l[label])        \
>                                   : CC_OUT(x) (__success),              \
>                                     [ptr] "+m" (*__ptr),                \
>                                     [old] "+a" (__old),                 \
>                                   : [new] "r" (__new)                   \
>                                   : "memory"                            \
>                                   : label);                             \
>                 break;                                                  \
>         }                                                               \
>         if (unlikely(!success))                                         \
>                 *_old = __old;                                          \
>         __success;                                                      \
> })
>
> #define unsafe_try_cmpxchg_user(ptr, oldp, new, label)                  \
>         __try_cmpxchg_user_size((ptr), (oldp), (new), sizeof(*(ptr)), label);
>
> int user_llist_add(u64 __user *new, u64 __user *head)
> {
>         u64 first;
>         int ret;
>
>         if (unlikely(!access_ok(new, sizeof(*new)) ||
>                      !access_ok(head, sizeof(*head))))
>                 return -EFAULT;
>
> again:
>         __uaccess_begin_nospec();
>
>         unsafe_get_user(first, head, Efault_head);
>         do {
>                 unsafe_put_user(first, new, Efault_new);
>         } while (!unsafe_try_cmpxchg_user(head, &first, new, Efault_head));
>
>         user_access_end();
>
>         return 0;
>
> Efault_new:
>         user_access_end();
>
>         ret = fixup_fault(new);
>         if (ret < 0)
>                 return ret;
>
>         goto again;
>
> Efault_head:
>         user_access_end();
>
>         ret = fixup_fault(head);
>         if (ret < 0)
>                 return ret;
>
>         goto again;
> }

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/3 v0.2] sched/umcg: RFC: add userspace atomic helpers
  2021-07-09 16:57     ` Peter Oskolkov
@ 2021-07-09 17:33       ` Peter Oskolkov
  2021-07-13 16:10       ` Peter Zijlstra
  1 sibling, 0 replies; 17+ messages in thread
From: Peter Oskolkov @ 2021-07-09 17:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Peter Oskolkov, Ingo Molnar, Thomas Gleixner, linux-kernel,
	linux-api, Paul Turner, Ben Segall, Joel Fernandes, Andrei Vagin,
	Jim Newsome, Jann Horn

On Fri, Jul 9, 2021 at 9:57 AM Peter Oskolkov <posk@google.com> wrote:
>
> On Fri, Jul 9, 2021 at 1:02 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
[...]

> > What's wrong with the trivial lockless single
> > linked list approach?.
>
> I'm not sure how to get around the ABA problem with a lockless single
> linked list: https://en.wikipedia.org/wiki/ABA_problem
>
> Our semantics can probably be relied on to prevent ABA from happening
> with idle workers (popped/removed by the userspace), but idle servers
> can trivially have it:
>
> (initial state): head => Server A => Server B => NULL
>
> task1 :
> - read head (get A), read A (get B), {/* delayed */}
>
> tasks 2-3-4:
> - pop A, pop B, push A
>
> task 1:
> - cmp_xchg(head, A /* old */, B /* new */) - succeeds, now B is in the
> list erroneously

I think the kernel can just mark "popped" servers as such (by setting
the lowest bit of its "next" pointer to one) without actually removing
them from the list, and letting the userspace do the cleanup/GC. With
hard-limiting the max length of idle servers at 2*NUM_CPUs or similar,
this may work out OK. I'll work on this approach. Please have a look
at patch 3. :)

>
> >
> > On top of that, you really want to avoid all that endless stac/clac
> > nonsense and only have that once, at the outer edges of things.
> >
> > Something like the *completely* untested below (except it needs lots of
> > extra gunk to support compilers without asm-goto-output, and more widths
> > and ...
>
> I'll try the approach you suggest below once it is clear how to deal
> with the ABA issue (and the wake server issue raised in patch 3).
>

[...]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 3/3 v0.2] sched/umcg: RFC: implement UMCG syscalls
  2021-07-08 19:46 ` [RFC PATCH 3/3 v0.2] sched/umcg: RFC: implement UMCG syscalls Peter Oskolkov
@ 2021-07-11 16:35   ` Peter Oskolkov
  2021-07-11 18:29   ` Thierry Delisle
  1 sibling, 0 replies; 17+ messages in thread
From: Peter Oskolkov @ 2021-07-11 16:35 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Linux Kernel Mailing List, linux-api
  Cc: Paul Turner, Ben Segall, Peter Oskolkov, Joel Fernandes,
	Andrei Vagin, Jim Newsome, Jann Horn

On Thu, Jul 8, 2021 at 12:46 PM Peter Oskolkov <posk@posk.io> wrote:
[...]

> Pretty much everything works, with one issue: when a worker
> blocks, we need to wake its server in umcg_wq_worker_sleeping
> called from sched_submit_work within preempt_disable block.
> As the server_tid is set by the userspace, it can point to
> a running/spinning task, and so wake_server will hang waiting
> for ttwu() to succeed. I do not think this is appropriate,
> but I am not sure at the moment how to resolve this. Any sugestions?

[...]

I think I can solve this by carefully ordering state changes (both
umcg state and task state) and maybe sending a signal to the wakee if
not enough. I'll try this approach in v0.3.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 3/3 v0.2] sched/umcg: RFC: implement UMCG syscalls
  2021-07-08 19:46 ` [RFC PATCH 3/3 v0.2] sched/umcg: RFC: implement UMCG syscalls Peter Oskolkov
  2021-07-11 16:35   ` Peter Oskolkov
@ 2021-07-11 18:29   ` Thierry Delisle
  2021-07-12 15:40     ` Peter Oskolkov
  1 sibling, 1 reply; 17+ messages in thread
From: Thierry Delisle @ 2021-07-11 18:29 UTC (permalink / raw)
  To: posk
  Cc: avagin, bsegall, jannh, jnewsome, joel, linux-api, linux-kernel,
	mingo, peterz, pjt, posk, tglx, Peter Buhr, Martin Karsten

 > Let's move the discussion to the new thread.

I'm happy to start a new thread. I'm re-responding to my last post 
because many
of my questions are still unanswered.

 > + * State transitions:
 > + *
 > + * RUNNING => IDLE:   the current RUNNING task becomes IDLE by calling
 > + *                    sys_umcg_wait();
 >
 > [...]
 >
 > +/**
 > + * enum umcg_wait_flag - flags to pass to sys_umcg_wait
 > + * @UMCG_WAIT_WAKE_ONLY: wake @self->next_tid, don't put @self to sleep;
 > + * @UMCG_WF_CURRENT_CPU: wake @self->next_tid on the current CPU
 > + *                       (use WF_CURRENT_CPU); @UMCG_WAIT_WAKE_ONLY 
must be set.
 > + */
 > +enum umcg_wait_flag {
 > +    UMCG_WAIT_WAKE_ONLY = 1,
 > +    UMCG_WF_CURRENT_CPU = 2,
 > +};

What is the purpose of using sys_umcg_wait without next_tid or with
UMCG_WAIT_WAKE_ONLY? It looks like Java's park/unpark semantics to me, 
that is
worker threads can use this for synchronization and mutual exclusion. In 
this
case, how do these compare to using FUTEX_WAIT/FUTEX_WAKE?


 > +struct umcg_task {
 > [...]
 > +    /**
 > +     * @server_tid: the TID of the server UMCG task that should be
 > +     *              woken when this WORKER becomes BLOCKED. Can be zero.
 > +     *
 > +     *              If this is a UMCG server, @server_tid should
 > +     *              contain the TID of @self - it will be used to find
 > +     *              the task_struct to wake when pulled from
 > +     *              @idle_servers.
 > +     *
 > +     * Read-only for the kernel, read/write for the userspace.
 > +     */
 > +    uint32_t    server_tid;        /* r   */
 > [...]
 > +    /**
 > +     * @idle_servers_ptr: a single-linked list pointing to the list
 > +     *                    of idle servers. Can be NULL.
 > +     *
 > +     * Readable/writable by both the kernel and the userspace: the
 > +     * userspace adds items to the list, the kernel removes them.
 > +     *
 > +     * TODO: describe how the list works.
 > +     */
 > +    uint64_t    idle_servers_ptr;    /* r/w */
 > [...]
 > +} __attribute__((packed, aligned(8 * sizeof(__u64))));

 From the comments and by elimination, I'm guessing that idle_servers_ptr is
somehow used by servers to block until some worker threads become idle. 
However,
I do not understand how the userspace is expected to use it. I also do not
understand if these link fields form a stack or a queue and where is the 
head.


 > +/**
 > + * sys_umcg_ctl: (un)register a task as a UMCG task.
 > + * @flags:       ORed values from enum umcg_ctl_flag; see below;
 > + * @self:        a pointer to struct umcg_task that describes this
 > + *               task and governs the behavior of sys_umcg_wait if
 > + *               registering; must be NULL if unregistering.
 > + *
 > + * @flags & UMCG_CTL_REGISTER: register a UMCG task:
 > + *         UMCG workers:
 > + *              - self->state must be UMCG_TASK_IDLE
 > + *              - @flags & UMCG_CTL_WORKER
 > + *
 > + *         If the conditions above are met, sys_umcg_ctl() 
immediately returns
 > + *         if the registered task is a RUNNING server or basic task; 
an IDLE
 > + *         worker will be added to idle_workers_ptr, and the worker 
put to
 > + *         sleep; an idle server from idle_servers_ptr will be 
woken, if any.

This approach to creating UMCG workers concerns me a little. My 
understanding
is that in general, the number of servers controls the amount of parallelism
in the program. But in the case of creating new UMCG workers, the new 
threads
only respect the M:N threading model after sys_umcg_ctl has blocked. 
What does
this mean for applications that create thousands of short lived tasks? Are
users expcted to create pools of reusable UMCG workers?


I would suggest adding at least one uint64_t field to the struct 
umcg_task that
is left as-is by the kernel. This allows implementers of user-space
schedulers to add scheduler specific data structures to the threads without
needing some kind of table on the side.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 3/3 v0.2] sched/umcg: RFC: implement UMCG syscalls
  2021-07-11 18:29   ` Thierry Delisle
@ 2021-07-12 15:40     ` Peter Oskolkov
  2021-07-12 21:44       ` Thierry Delisle
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Oskolkov @ 2021-07-12 15:40 UTC (permalink / raw)
  To: Thierry Delisle
  Cc: posk, avagin, bsegall, jannh, jnewsome, joel, linux-api,
	linux-kernel, mingo, peterz, pjt, tglx, Peter Buhr,
	Martin Karsten

On Sun, Jul 11, 2021 at 11:29 AM Thierry Delisle <tdelisle@uwaterloo.ca> wrote:
>
>  > Let's move the discussion to the new thread.
>
> I'm happy to start a new thread. I'm re-responding to my last post
> because many
> of my questions are still unanswered.
>
>  > + * State transitions:
>  > + *
>  > + * RUNNING => IDLE:   the current RUNNING task becomes IDLE by calling
>  > + *                    sys_umcg_wait();
>  >
>  > [...]
>  >
>  > +/**
>  > + * enum umcg_wait_flag - flags to pass to sys_umcg_wait
>  > + * @UMCG_WAIT_WAKE_ONLY: wake @self->next_tid, don't put @self to sleep;
>  > + * @UMCG_WF_CURRENT_CPU: wake @self->next_tid on the current CPU
>  > + *                       (use WF_CURRENT_CPU); @UMCG_WAIT_WAKE_ONLY
> must be set.
>  > + */
>  > +enum umcg_wait_flag {
>  > +    UMCG_WAIT_WAKE_ONLY = 1,
>  > +    UMCG_WF_CURRENT_CPU = 2,
>  > +};
>
> What is the purpose of using sys_umcg_wait without next_tid or with
> UMCG_WAIT_WAKE_ONLY? It looks like Java's park/unpark semantics to me,
> that is
> worker threads can use this for synchronization and mutual exclusion. In
> this
> case, how do these compare to using FUTEX_WAIT/FUTEX_WAKE?

sys_umcg_wait without next_tid puts the task in UMCG_IDLE state; wake
wakes it. These are standard sched operations. If they are emulated
via futexes, fast context switching will require something like
FUTEX_SWAP that was NACKed last year.

>
>
>  > +struct umcg_task {
>  > [...]
>  > +    /**
>  > +     * @server_tid: the TID of the server UMCG task that should be
>  > +     *              woken when this WORKER becomes BLOCKED. Can be zero.
>  > +     *
>  > +     *              If this is a UMCG server, @server_tid should
>  > +     *              contain the TID of @self - it will be used to find
>  > +     *              the task_struct to wake when pulled from
>  > +     *              @idle_servers.
>  > +     *
>  > +     * Read-only for the kernel, read/write for the userspace.
>  > +     */
>  > +    uint32_t    server_tid;        /* r   */
>  > [...]
>  > +    /**
>  > +     * @idle_servers_ptr: a single-linked list pointing to the list
>  > +     *                    of idle servers. Can be NULL.
>  > +     *
>  > +     * Readable/writable by both the kernel and the userspace: the
>  > +     * userspace adds items to the list, the kernel removes them.
>  > +     *
>  > +     * TODO: describe how the list works.
>  > +     */
>  > +    uint64_t    idle_servers_ptr;    /* r/w */
>  > [...]
>  > +} __attribute__((packed, aligned(8 * sizeof(__u64))));
>
>  From the comments and by elimination, I'm guessing that idle_servers_ptr is
> somehow used by servers to block until some worker threads become idle.
> However,
> I do not understand how the userspace is expected to use it. I also do not
> understand if these link fields form a stack or a queue and where is the
> head.

When a server has nothing to do (no work to run), it is put into IDLE
state and added to the list. The kernel wakes an IDLE server if a
blocked worker unblocks.

>
>
>  > +/**
>  > + * sys_umcg_ctl: (un)register a task as a UMCG task.
>  > + * @flags:       ORed values from enum umcg_ctl_flag; see below;
>  > + * @self:        a pointer to struct umcg_task that describes this
>  > + *               task and governs the behavior of sys_umcg_wait if
>  > + *               registering; must be NULL if unregistering.
>  > + *
>  > + * @flags & UMCG_CTL_REGISTER: register a UMCG task:
>  > + *         UMCG workers:
>  > + *              - self->state must be UMCG_TASK_IDLE
>  > + *              - @flags & UMCG_CTL_WORKER
>  > + *
>  > + *         If the conditions above are met, sys_umcg_ctl()
> immediately returns
>  > + *         if the registered task is a RUNNING server or basic task;
> an IDLE
>  > + *         worker will be added to idle_workers_ptr, and the worker
> put to
>  > + *         sleep; an idle server from idle_servers_ptr will be
> woken, if any.
>
> This approach to creating UMCG workers concerns me a little. My
> understanding
> is that in general, the number of servers controls the amount of parallelism
> in the program. But in the case of creating new UMCG workers, the new
> threads
> only respect the M:N threading model after sys_umcg_ctl has blocked.
> What does
> this mean for applications that create thousands of short lived tasks? Are
> users expcted to create pools of reusable UMCG workers?

Yes: task/thread creation is not as lightweight as just posting work
items onto a preexisting pool of workers.

>
>
> I would suggest adding at least one uint64_t field to the struct
> umcg_task that
> is left as-is by the kernel. This allows implementers of user-space
> schedulers to add scheduler specific data structures to the threads without
> needing some kind of table on the side.

This is usually achieved by embedding the kernel struct into a larger
userspace/TLS struct. For example:

struct umcg_task_user {
  struct umcg_task umcg_task;
  extra_user_data d1;
  extra_user_ptr p1;
  /* etc. */
} __aligned(...);

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 3/3 v0.2] sched/umcg: RFC: implement UMCG syscalls
  2021-07-12 15:40     ` Peter Oskolkov
@ 2021-07-12 21:44       ` Thierry Delisle
  2021-07-12 23:31         ` Peter Oskolkov
  0 siblings, 1 reply; 17+ messages in thread
From: Thierry Delisle @ 2021-07-12 21:44 UTC (permalink / raw)
  To: posk
  Cc: avagin, bsegall, jannh, jnewsome, joel, linux-api, linux-kernel,
	mingo, mkarsten, pabuhr, peterz, pjt, posk, tdelisle, tglx

 > sys_umcg_wait without next_tid puts the task in UMCG_IDLE state; wake
 > wakes it. These are standard sched operations. If they are emulated
 > via futexes, fast context switching will require something like
 > FUTEX_SWAP that was NACKed last year.

I understand these wait and wake semantics and the need for the fast
context-switch(swap). As I see it, you need 3 operations:

- SWAP: context-switch directly to a different thread, no scheduler involved
- WAIT: block current thread, go back to server thread
- WAKE: unblock target thread, add it to scheduler, e.g. through
         idle_workers_ptr

There is no existing syscalls to handle SWAP, so I agree sys_umcg_wait is
needed for this to work.

However, there already exists sys_futex to handle WAIT and WAKE. When a 
worker
calls either sys_futex WAIT or sys_umcg_wait next_tid == NULL, in both case
the worker will block, SWAP to the server and wait for FUTEX_WAKE,
UMCG_WAIT_WAKE_ONLY respectively. It's not obvious to me that there 
would be
performance difference and the semantics seem to be the same to me.

So what I am asking is: is UMCG_WAIT_WAKE_ONLY needed?

Is the idea to support workers directly context-switching among each other,
without involving server threads and without going through idle_servers_ptr?

If so, can you explain some of the intended state transitions in this case.


 > > However, I do not understand how the userspace is expected to use 
it. I also
 > > do not understand if these link fields form a stack or a queue and 
where is
 > > the head.
 >
 > When a server has nothing to do (no work to run), it is put into IDLE
 > state and added to the list. The kernel wakes an IDLE server if a
 > blocked worker unblocks.

 From the code in umcg_wq_worker_running (Step 3), I am guessing users are
expected to provide a global head somewhere in memory and
umcg_task.idle_servers_ptr points to the head of the list for all workers.
Servers are then added in user space using atomic_stack_push_user. Is this
correct? I did not find any documentation on the list head.

I like the idea that each worker thread points to a given list, it 
allows the
possibility for separate containers with their own independent servers, 
workers
and scheduling. However, it seems that the list itself could be implemented
using existing kernel APIs, for example a futex or an event fd. Like so:

struct umcg_task {
      [...]

      /**
       * @idle_futex_ptr: pointer to a futex user for idle server threads.
       *
       * When waking a worker, the kernel decrements the pointed to 
futex value
       * if it is non-zero and wakes a server if the decrement occurred.
       *
       * Server threads that have no work to do should increment the futex
       * value and FUTEX_WAIT
       */
      uint64_t    idle_futex_ptr;    /* r/w */

      [...]
} __attribute__((packed, aligned(8 * sizeof(__u64))));

I believe the futex approach, like the list, has the advantage that when 
there
are no idle servers, checking the list requires no locking. I don't know if
that can be achieved with eventfd.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 3/3 v0.2] sched/umcg: RFC: implement UMCG syscalls
  2021-07-12 21:44       ` Thierry Delisle
@ 2021-07-12 23:31         ` Peter Oskolkov
  2021-07-13 14:02           ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Oskolkov @ 2021-07-12 23:31 UTC (permalink / raw)
  To: Thierry Delisle
  Cc: avagin, bsegall, jannh, jnewsome, joel, linux-api, linux-kernel,
	mingo, mkarsten, pabuhr, peterz, pjt, posk, tglx

On Mon, Jul 12, 2021 at 2:44 PM Thierry Delisle <tdelisle@uwaterloo.ca> wrote:
>
>  > sys_umcg_wait without next_tid puts the task in UMCG_IDLE state; wake
>  > wakes it. These are standard sched operations. If they are emulated
>  > via futexes, fast context switching will require something like
>  > FUTEX_SWAP that was NACKed last year.
>
> I understand these wait and wake semantics and the need for the fast
> context-switch(swap). As I see it, you need 3 operations:
>
> - SWAP: context-switch directly to a different thread, no scheduler involved
> - WAIT: block current thread, go back to server thread
> - WAKE: unblock target thread, add it to scheduler, e.g. through
>          idle_workers_ptr
>
> There is no existing syscalls to handle SWAP, so I agree sys_umcg_wait is
> needed for this to work.
>
> However, there already exists sys_futex to handle WAIT and WAKE. When a
> worker
> calls either sys_futex WAIT or sys_umcg_wait next_tid == NULL, in both case
> the worker will block, SWAP to the server and wait for FUTEX_WAKE,
> UMCG_WAIT_WAKE_ONLY respectively. It's not obvious to me that there
> would be
> performance difference and the semantics seem to be the same to me.
>
> So what I am asking is: is UMCG_WAIT_WAKE_ONLY needed?

Because the approach you described has been tried last year and was NACKed:
https://lore.kernel.org/lkml/20200722234538.166697-1-posk@posk.io/

In short, futex maintainers do not want to touch the existing futex
code at all other than for bugfixes. No new futex functionality,
period. See e.g. futex2 efforts:
https://lore.kernel.org/lkml/20210603195924.361327-1-andrealmeid@collabora.com/

> Is the idea to support workers directly context-switching among each other,
> without involving server threads and without going through idle_servers_ptr?

Yes.

> If so, can you explain some of the intended state transitions in this case.

Cooperative scheduling: workers can yield to each other (i.e. swap).
This allows building very fast concurrency primitives (mutexes,
producer-consumer queues, etc.). For example: a producer-consumer
queue: a producer: prepare an item, contex-switch to a consumer on CPU
synchronously: this can be done much faster than waking the consumer
asynchronously; helps a lot to reduce latency (throughput? not so
much)

>
>
>  > > However, I do not understand how the userspace is expected to use
> it. I also
>  > > do not understand if these link fields form a stack or a queue and
> where is
>  > > the head.
>  >
>  > When a server has nothing to do (no work to run), it is put into IDLE
>  > state and added to the list. The kernel wakes an IDLE server if a
>  > blocked worker unblocks.
>
>  From the code in umcg_wq_worker_running (Step 3), I am guessing users are
> expected to provide a global head somewhere in memory and
> umcg_task.idle_servers_ptr points to the head of the list for all workers.
> Servers are then added in user space using atomic_stack_push_user. Is this
> correct? I did not find any documentation on the list head.

This is going to change somewhat. I'll post a new patchset soon-ish
(later this week?)

>
> I like the idea that each worker thread points to a given list, it
> allows the
> possibility for separate containers with their own independent servers,
> workers
> and scheduling. However, it seems that the list itself could be implemented
> using existing kernel APIs, for example a futex or an event fd. Like so:
>
> struct umcg_task {
>       [...]
>
>       /**
>        * @idle_futex_ptr: pointer to a futex user for idle server threads.
>        *
>        * When waking a worker, the kernel decrements the pointed to
> futex value
>        * if it is non-zero and wakes a server if the decrement occurred.
>        *
>        * Server threads that have no work to do should increment the futex
>        * value and FUTEX_WAIT
>        */
>       uint64_t    idle_futex_ptr;    /* r/w */
>
>       [...]
> } __attribute__((packed, aligned(8 * sizeof(__u64))));
>
> I believe the futex approach, like the list, has the advantage that when
> there
> are no idle servers, checking the list requires no locking. I don't know if
> that can be achieved with eventfd.

As mentioned above, a futex-based solution was rejected by
maintainers. Believe me, I tried. Not only tried, we have a working
userspace scheduling stack on top of FUTEX_SWAP deployed internally at
Google, with some actual usage (mostly testing/debugging workloads). I
suggest we stop discussing futexes in this context - I do not see the
maintainers changing their position. And the approach used in this
patchset seems to work (and I actually like how the code is shaping
up).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 3/3 v0.2] sched/umcg: RFC: implement UMCG syscalls
  2021-07-12 23:31         ` Peter Oskolkov
@ 2021-07-13 14:02           ` Peter Zijlstra
  0 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2021-07-13 14:02 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Thierry Delisle, avagin, bsegall, jannh, jnewsome, joel,
	linux-api, linux-kernel, mingo, mkarsten, pabuhr, pjt, posk,
	tglx

On Mon, Jul 12, 2021 at 04:31:01PM -0700, Peter Oskolkov wrote:
> On Mon, Jul 12, 2021 at 2:44 PM Thierry Delisle <tdelisle@uwaterloo.ca> wrote:

> > So what I am asking is: is UMCG_WAIT_WAKE_ONLY needed?
> 
> Because the approach you described has been tried last year and was NACKed:
> https://lore.kernel.org/lkml/20200722234538.166697-1-posk@posk.io/
> 
> In short, futex maintainers do not want to touch the existing futex
> code at all other than for bugfixes. No new futex functionality,
> period. See e.g. futex2 efforts:
> https://lore.kernel.org/lkml/20210603195924.361327-1-andrealmeid@collabora.com/

These are two orthogonal issues. We do not want to make the futex
multiplex monster worse, but that's not the reason for rejecting
FUTEX_SWAP.

The problem with FUTEX_SWAP is that it doesn't even begin to solve the
posed problem, namely N:M threading that natively allows blocking
syscalls (IOW without wrapping all syscalls).

This means we need kernel->user notification of tasks that block and
wakeup, such that the userspace scheduler can adequately react. This is
not something that sanely fits in futex.

It also requires an additional kernel side block point such that tasks
that blocked in-kernel, will not resume userspace when the userspace
scheduler decided to run another task in its stead.

These things are what resulted in UMCG.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/3 v0.2] sched/umcg: RFC: add userspace atomic helpers
  2021-07-09 16:57     ` Peter Oskolkov
  2021-07-09 17:33       ` Peter Oskolkov
@ 2021-07-13 16:10       ` Peter Zijlstra
  2021-07-13 17:14         ` Peter Oskolkov
  1 sibling, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2021-07-13 16:10 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Peter Oskolkov, Ingo Molnar, Thomas Gleixner, linux-kernel,
	linux-api, Paul Turner, Ben Segall, Joel Fernandes, Andrei Vagin,
	Jim Newsome, Jann Horn

On Fri, Jul 09, 2021 at 09:57:32AM -0700, Peter Oskolkov wrote:
> On Fri, Jul 9, 2021 at 1:02 AM Peter Zijlstra <peterz@infradead.org> wrote:

> > This is horrible... Jann is absolutely right, you do not, *ever* do
> > userspace spinlocks. What's wrong with the trivial lockless single
> > linked list approach?.
> 
> I'm not sure how to get around the ABA problem with a lockless single
> linked list: https://en.wikipedia.org/wiki/ABA_problem

I'm familiar with the problem. I'm just not sure how we got there.

Last time we had umcg_blocked_ptr / umcg_runnable_ptr which were kernel
append, user clear single linked lists used for RUNNING->BLOCKED and
BLOCKED->RUNNABLE notifications.

But those seem gone, instead we now have idle_servers_ptr /
idle_workers_ptr. I've not yet fully digested things, but they seem to
implement some 'SMP' policy. Can we please forget about the whole SMP
thing for now and focus on getting the basics sorted?

So if we implement the bits outlined here:

  https://lore.kernel.org/linux-api/20210609125435.GA68187@worktop.programming.kicks-ass.net/
  https://lore.kernel.org/linux-api/YMJTyVVdylyHtkeW@hirez.programming.kicks-ass.net/

Then what is mising for full N:1 (aka UP) ?

One thing I've considered is that we might want to add u64 timestamps
for the various state transitions, such that userspace can compute
elapsed time which is useful for more dynamic scheduling policies.

Another is preemption; I'm thinking we can drive that with signals, but
that does give complications -- signals are super gross API wise.
Another method would be much preferred I think. We could add another
syscall which allows userspace (from eg. SIGALRM/SIGPROF/SIGVTALRM) to
force a worker to do a RUNNING->RUNNABLE transition and schedule back to
the server.


Then lets consider N:M (aka SMP). The basics of SMP is sharing work
between servers. For a large part userspace can already do that by
keeping a shared ready queue. Servers that go idle pick up a work,
re-assign it to themselves and go.

The pain-point seems to be *efficient* BLOCKED->RUNNABLE notifications
across servers. Inefficient options include the userspace servers
iterating all known other servers and trying to steal their
umcg_runnable_ptr and processing it. This is 'difficult' in that there
is no natural wakeup and hence letting a server do IDLE will increase
latency and destroy work concervance.

The alternative seems to be to let the kernel do the server iteration,
looking for a RUNNING one and using that umcg_runnable_ptr field and
kicking it. For that we can set up an immutable linked list between
struct umcg_task's, a circular single linked list that the kernel
iterates until it's back where it started. Then there is no dymaic
state.

Hmm?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/3 v0.2] sched/umcg: RFC: add userspace atomic helpers
  2021-07-13 16:10       ` Peter Zijlstra
@ 2021-07-13 17:14         ` Peter Oskolkov
  0 siblings, 0 replies; 17+ messages in thread
From: Peter Oskolkov @ 2021-07-13 17:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Peter Oskolkov, Ingo Molnar, Thomas Gleixner, linux-kernel,
	linux-api, Paul Turner, Ben Segall, Joel Fernandes, Andrei Vagin,
	Jim Newsome, Jann Horn

On Tue, Jul 13, 2021 at 9:10 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Jul 09, 2021 at 09:57:32AM -0700, Peter Oskolkov wrote:
> > On Fri, Jul 9, 2021 at 1:02 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> > > This is horrible... Jann is absolutely right, you do not, *ever* do
> > > userspace spinlocks. What's wrong with the trivial lockless single
> > > linked list approach?.
> >
> > I'm not sure how to get around the ABA problem with a lockless single
> > linked list: https://en.wikipedia.org/wiki/ABA_problem
>
> I'm familiar with the problem. I'm just not sure how we got there.
>
> Last time we had umcg_blocked_ptr / umcg_runnable_ptr which were kernel
> append, user clear single linked lists used for RUNNING->BLOCKED and
> BLOCKED->RUNNABLE notifications.
>
> But those seem gone, instead we now have idle_servers_ptr /
> idle_workers_ptr. I've not yet fully digested things, but they seem to
> implement some 'SMP' policy. Can we please forget about the whole SMP
> thing for now and focus on getting the basics sorted?

Hmm... yes, I was always working on the M:N model, i.e. multiple
servers, i.e. SMP. Apologies if this was not clear.

I think I'm close to having "SMP basics sorted" - I believe it will
take me longer now to go back to "UP" and then extend this to SMP. And
the result can probably be not as clean as having SMP there from the
beginning. Just give me another couple of days, please!

>
> So if we implement the bits outlined here:
>
>   https://lore.kernel.org/linux-api/20210609125435.GA68187@worktop.programming.kicks-ass.net/
>   https://lore.kernel.org/linux-api/YMJTyVVdylyHtkeW@hirez.programming.kicks-ass.net/
>
> Then what is mising for full N:1 (aka UP) ?
>
> One thing I've considered is that we might want to add u64 timestamps
> for the various state transitions, such that userspace can compute
> elapsed time which is useful for more dynamic scheduling policies.

Tagging state transitions with unique-per-task tags (counters) will
simplify a lot of things in the userspace, as these can be used to
sort out block/wakeup/swap races easily. If these tags have timestamp
semantics (e.g. nanos), even better - as you suggest, scheduling can
be tweaked, tracing/instrumentation will naturally use these, etc.

Synchronizing state+timestamp updates will be a challenge, unless we
share a 64-bit field for state + timestamp: 6 bytes for the timestamp
and two bytes for the state (I think one byte for the state is too
tight, although it will be enough initially). 6 bytes of nanos will
cycle over in a couple of days (if my math is right), so should not be
an issue for uniqueness; and I guess the userspace can always easily
restore the higher two bytes of the timestamp if needed.

So... should I convert u32 state into u64 state+timestamp field?

>
> Another is preemption; I'm thinking we can drive that with signals, but
> that does give complications -- signals are super gross API wise.
> Another method would be much preferred I think. We could add another
> syscall which allows userspace (from eg. SIGALRM/SIGPROF/SIGVTALRM) to
> force a worker to do a RUNNING->RUNNABLE transition and schedule back to
> the server.

I haven't looked into preemption yet, so I'll try any approach you
suggest (after the "basics" are sorted out).

> Then lets consider N:M (aka SMP). The basics of SMP is sharing work
> between servers. For a large part userspace can already do that by
> keeping a shared ready queue. Servers that go idle pick up a work,
> re-assign it to themselves and go.
>
> The pain-point seems to be *efficient* BLOCKED->RUNNABLE notifications
> across servers. Inefficient options include the userspace servers
> iterating all known other servers and trying to steal their
> umcg_runnable_ptr and processing it. This is 'difficult' in that there
> is no natural wakeup and hence letting a server do IDLE will increase
> latency and destroy work concervance.
>
> The alternative seems to be to let the kernel do the server iteration,
> looking for a RUNNING one and using that umcg_runnable_ptr field and
> kicking it. For that we can set up an immutable linked list between
> struct umcg_task's, a circular single linked list that the kernel
> iterates until it's back where it started. Then there is no dymaic
> state.

In full utilization scenarios, which are typical in our use cases
(custom scheduling is not that useful if there are idle CPUs/servers),
the kernel will quite often find no idle servers, so having a dynamic
list/stack of idle servers is more efficient. I'll post what I have in
mind (the kernel marks servers as deleted, the userspace does
cleanup/gc) soon.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-07-13 17:15 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-08 19:46 [RFC PATCH 0/3 v0.2] RFC: sched/UMCG Peter Oskolkov
2021-07-08 19:46 ` [RFC PATCH 1/3 v0.2] sched: add WF_CURRENT_CPU and externise ttwu Peter Oskolkov
2021-07-08 19:46 ` [RFC PATCH 2/3 v0.2] sched/umcg: RFC: add userspace atomic helpers Peter Oskolkov
2021-07-08 21:12   ` Jann Horn
2021-07-09  4:01     ` Peter Oskolkov
2021-07-09  8:01   ` Peter Zijlstra
2021-07-09 16:57     ` Peter Oskolkov
2021-07-09 17:33       ` Peter Oskolkov
2021-07-13 16:10       ` Peter Zijlstra
2021-07-13 17:14         ` Peter Oskolkov
2021-07-08 19:46 ` [RFC PATCH 3/3 v0.2] sched/umcg: RFC: implement UMCG syscalls Peter Oskolkov
2021-07-11 16:35   ` Peter Oskolkov
2021-07-11 18:29   ` Thierry Delisle
2021-07-12 15:40     ` Peter Oskolkov
2021-07-12 21:44       ` Thierry Delisle
2021-07-12 23:31         ` Peter Oskolkov
2021-07-13 14:02           ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).