[PATCH v0.9.1 0/6] sched,mm,x86/uaccess: implement User Managed Concurrency Groups

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v0.9.1 0/6] sched,mm,x86/uaccess: implement User Managed Concurrency Groups
@ 2021-11-22 21:13 Peter Oskolkov
  2021-11-22 21:13 ` [PATCH v0.9.1 1/6] sched/umcg: add WF_CURRENT_CPU and externise ttwu Peter Oskolkov
                   ` (6 more replies)
  0 siblings, 7 replies; 44+ messages in thread
From: Peter Oskolkov @ 2021-11-22 21:13 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Andrew Morton,
	Dave Hansen, Andy Lutomirski, linux-mm, linux-kernel, linux-api
  Cc: Paul Turner, Ben Segall, Peter Oskolkov, Peter Oskolkov,
	Andrei Vagin, Jann Horn, Thierry Delisle

User Managed Concurrency Groups (UMCG) is an M:N threading
subsystem/toolkit that lets user space application developers implement
in-process user space schedulers.

This v0.9.1 patchset is the same as v0.9, where u32/u64 in
uapi/linux/umcg.h are replaced with __u32/__u64, as test robot/lkp
does not recognize u32/u64 for some reason.

v0.9 is v0.8 rebased on top of the current tip/sched/core,
with a fix in umcg_update_state of an issue reported by Tao Zhou.

Key changes from patchset v0.7:
https://lore.kernel.org/all/20211012232522.714898-1-posk@google.com/:

- added libumcg tools/lib/umcg;
- worker "wakeup" is reworked so that it is now purely a userspace op,
  instead of waking the thread in order for it to block on return
  to the userspace immediately;
- a couple of minor fixes and refactorings.

These big things remain to be addressed (in no particular order):
- support tracing/debugging
- make context switches faster (see umcg_do_context_switch in umcg.c)
- support other architectures
- cleanup and post selftests in tools/testing/selftests/umcg/
- allow cross-mm wakeups (securely)


Peter Oskolkov (6):
  sched/umcg: add WF_CURRENT_CPU and externise ttwu
  mm, x86/uaccess: add userspace atomic helpers
  sched/umcg: implement UMCG syscalls
  sched/umcg, lib/umcg: implement libumcg
  sched/umcg: add Documentation/userspace-api/umcg.txt
  sched/umcg, lib/umcg: add tools/lib/umcg/libumcg.txt

 Documentation/userspace-api/umcg.txt   |  598 ++++++++++++
 arch/x86/entry/syscalls/syscall_64.tbl |    2 +
 arch/x86/include/asm/uaccess_64.h      |   93 ++
 fs/exec.c                              |    1 +
 include/linux/sched.h                  |   71 ++
 include/linux/syscalls.h               |    3 +
 include/linux/uaccess.h                |   46 +
 include/uapi/asm-generic/unistd.h      |    7 +-
 include/uapi/linux/umcg.h              |  137 +++
 init/Kconfig                           |   10 +
 kernel/entry/common.c                  |    4 +-
 kernel/exit.c                          |    5 +
 kernel/sched/Makefile                  |    1 +
 kernel/sched/core.c                    |   12 +-
 kernel/sched/fair.c                    |    4 +
 kernel/sched/sched.h                   |   15 +-
 kernel/sched/umcg.c                    |  949 +++++++++++++++++++
 kernel/sys_ni.c                        |    4 +
 mm/maccess.c                           |  264 ++++++
 tools/lib/umcg/.gitignore              |    4 +
 tools/lib/umcg/Makefile                |   11 +
 tools/lib/umcg/libumcg.c               | 1202 ++++++++++++++++++++++++
 tools/lib/umcg/libumcg.h               |  299 ++++++
 tools/lib/umcg/libumcg.txt             |  438 +++++++++
 24 files changed, 4168 insertions(+), 12 deletions(-)
 create mode 100644 Documentation/userspace-api/umcg.txt
 create mode 100644 include/uapi/linux/umcg.h
 create mode 100644 kernel/sched/umcg.c
 create mode 100644 tools/lib/umcg/.gitignore
 create mode 100644 tools/lib/umcg/Makefile
 create mode 100644 tools/lib/umcg/libumcg.c
 create mode 100644 tools/lib/umcg/libumcg.h
 create mode 100644 tools/lib/umcg/libumcg.txt


base-commit: cb0e52b7748737b2cf6481fdd9b920ce7e1ebbdf
--
2.25.1


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v0.9.1 1/6] sched/umcg: add WF_CURRENT_CPU and externise ttwu
  2021-11-22 21:13 [PATCH v0.9.1 0/6] sched,mm,x86/uaccess: implement User Managed Concurrency Groups Peter Oskolkov
@ 2021-11-22 21:13 ` Peter Oskolkov
  2021-11-22 21:13 ` [PATCH v0.9.1 2/6] mm, x86/uaccess: add userspace atomic helpers Peter Oskolkov
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 44+ messages in thread
From: Peter Oskolkov @ 2021-11-22 21:13 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Andrew Morton,
	Dave Hansen, Andy Lutomirski, linux-mm, linux-kernel, linux-api
  Cc: Paul Turner, Ben Segall, Peter Oskolkov, Peter Oskolkov,
	Andrei Vagin, Jann Horn, Thierry Delisle

Add WF_CURRENT_CPU wake flag that advices the scheduler to
move the wakee to the current CPU. This is useful for fast on-CPU
context switching use cases such as UMCG.

In addition, make ttwu external rather than static so that
the flag could be passed to it from outside of sched/core.c.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 kernel/sched/core.c  |  3 +--
 kernel/sched/fair.c  |  4 ++++
 kernel/sched/sched.h | 15 +++++++++------
 3 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index beaa8be6241e..5344aa0afe5a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3977,8 +3977,7 @@ bool ttwu_state_match(struct task_struct *p, unsigned int state, int *success)
  * Return: %true if @p->state changes (an actual wakeup was done),
  *	   %false otherwise.
  */
-static int
-try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
+int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 {
 	unsigned long flags;
 	int cpu, success = 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 884f29d07963..399422e6479b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6890,6 +6890,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 	if (wake_flags & WF_TTWU) {
 		record_wakee(p);

+		if ((wake_flags & WF_CURRENT_CPU) &&
+		    cpumask_test_cpu(cpu, p->cpus_ptr))
+			return cpu;
+
 		if (sched_energy_enabled()) {
 			new_cpu = find_energy_efficient_cpu(p, prev_cpu);
 			if (new_cpu >= 0)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index eb971151e7e4..5e1ecf89c12b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2052,13 +2052,14 @@ static inline int task_on_rq_migrating(struct task_struct *p)
 }

 /* Wake flags. The first three directly map to some SD flag value */
-#define WF_EXEC     0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */
-#define WF_FORK     0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */
-#define WF_TTWU     0x08 /* Wakeup;            maps to SD_BALANCE_WAKE */
+#define WF_EXEC         0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */
+#define WF_FORK         0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */
+#define WF_TTWU         0x08 /* Wakeup;            maps to SD_BALANCE_WAKE */

-#define WF_SYNC     0x10 /* Waker goes to sleep after wakeup */
-#define WF_MIGRATED 0x20 /* Internal use, task got migrated */
-#define WF_ON_CPU   0x40 /* Wakee is on_cpu */
+#define WF_SYNC         0x10 /* Waker goes to sleep after wakeup */
+#define WF_MIGRATED     0x20 /* Internal use, task got migrated */
+#define WF_ON_CPU       0x40 /* Wakee is on_cpu */
+#define WF_CURRENT_CPU  0x80 /* Prefer to move the wakee to the current CPU. */

 #ifdef CONFIG_SMP
 static_assert(WF_EXEC == SD_BALANCE_EXEC);
@@ -3076,6 +3077,8 @@ static inline bool is_per_cpu_kthread(struct task_struct *p)
 extern void swake_up_all_locked(struct swait_queue_head *q);
 extern void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);

+extern int try_to_wake_up(struct task_struct *tsk, unsigned int state, int wake_flags);
+
 #ifdef CONFIG_PREEMPT_DYNAMIC
 extern int preempt_dynamic_mode;
 extern int sched_dynamic_mode(const char *str);
--
2.25.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v0.9.1 2/6] mm, x86/uaccess: add userspace atomic helpers
  2021-11-22 21:13 [PATCH v0.9.1 0/6] sched,mm,x86/uaccess: implement User Managed Concurrency Groups Peter Oskolkov
  2021-11-22 21:13 ` [PATCH v0.9.1 1/6] sched/umcg: add WF_CURRENT_CPU and externise ttwu Peter Oskolkov
@ 2021-11-22 21:13 ` Peter Oskolkov
  2021-11-24 14:31   ` Peter Zijlstra
  2021-11-22 21:13 ` [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls Peter Oskolkov
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 44+ messages in thread
From: Peter Oskolkov @ 2021-11-22 21:13 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Andrew Morton,
	Dave Hansen, Andy Lutomirski, linux-mm, linux-kernel, linux-api
  Cc: Paul Turner, Ben Segall, Peter Oskolkov, Peter Oskolkov,
	Andrei Vagin, Jann Horn, Thierry Delisle

In addition to futexes needing to do atomic operations in the userspace,
a second use case is now in the works (UMCG, see
https://lore.kernel.org/all/20210917180323.278250-1-posk@google.com/),
so a generic facility to perform these operations has been called for
(see https://lore.kernel.org/all/87ilyk9xc0.ffs@tglx/).

Add a set of generic helpers to perform 32/64-bit xchg and cmpxchg
operations in the userspace. Also implement the required
architecture-specific support on x86_64.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 arch/x86/include/asm/uaccess_64.h |  93 +++++++++++
 include/linux/uaccess.h           |  46 ++++++
 mm/maccess.c                      | 264 ++++++++++++++++++++++++++++++
 3 files changed, 403 insertions(+)

diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 45697e04d771..41e2f96d3ec4 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -79,4 +79,97 @@ __copy_from_user_flushcache(void *dst, const void __user *src, unsigned size)
 	kasan_check_write(dst, size);
 	return __copy_user_flushcache(dst, src, size);
 }
+
+#define ARCH_HAS_ATOMIC_UACCESS_HELPERS 1
+
+static inline int __try_cmpxchg_user_32(u32 *uval, u32 __user *uaddr,
+						u32 oldval, u32 newval)
+{
+	int ret = 0;
+
+	asm volatile("\n"
+		"1:\t" LOCK_PREFIX "cmpxchgl %4, %2\n"
+		"2:\n"
+		"\t.section .fixup, \"ax\"\n"
+		"3:\tmov     %3, %0\n"
+		"\tjmp     2b\n"
+		"\t.previous\n"
+		_ASM_EXTABLE_UA(1b, 3b)
+		: "+r" (ret), "=a" (oldval), "+m" (*uaddr)
+		: "i" (-EFAULT), "r" (newval), "1" (oldval)
+		: "memory"
+	);
+	*uval = oldval;
+	return ret;
+}
+
+static inline int __try_cmpxchg_user_64(u64 *uval, u64 __user *uaddr,
+						u64 oldval, u64 newval)
+{
+	int ret = 0;
+
+	asm volatile("\n"
+		"1:\t" LOCK_PREFIX "cmpxchgq %4, %2\n"
+		"2:\n"
+		"\t.section .fixup, \"ax\"\n"
+		"3:\tmov     %3, %0\n"
+		"\tjmp     2b\n"
+		"\t.previous\n"
+		_ASM_EXTABLE_UA(1b, 3b)
+		: "+r" (ret), "=a" (oldval), "+m" (*uaddr)
+		: "i" (-EFAULT), "r" (newval), "1" (oldval)
+		: "memory"
+	);
+	*uval = oldval;
+	return ret;
+}
+
+static inline int __try_xchg_user_32(u32 *oval, u32 __user *uaddr, u32 newval)
+{
+	u32 oldval = 0;
+	int ret = 0;
+
+	asm volatile("\n"
+		"1:\txchgl %0, %2\n"
+		"2:\n"
+		"\t.section .fixup, \"ax\"\n"
+		"3:\tmov     %3, %1\n"
+		"\tjmp     2b\n"
+		"\t.previous\n"
+		_ASM_EXTABLE_UA(1b, 3b)
+		: "=r" (oldval), "=r" (ret), "+m" (*uaddr)
+		: "i" (-EFAULT), "0" (newval), "1" (0)
+	);
+
+	if (ret)
+		return ret;
+
+	*oval = oldval;
+	return 0;
+}
+
+static inline int __try_xchg_user_64(u64 *oval, u64 __user *uaddr, u64 newval)
+{
+	u64 oldval = 0;
+	int ret = 0;
+
+	asm volatile("\n"
+		"1:\txchgq %0, %2\n"
+		"2:\n"
+		"\t.section .fixup, \"ax\"\n"
+		"3:\tmov     %3, %1\n"
+		"\tjmp     2b\n"
+		"\t.previous\n"
+		_ASM_EXTABLE_UA(1b, 3b)
+		: "=r" (oldval), "=r" (ret), "+m" (*uaddr)
+		: "i" (-EFAULT), "0" (newval), "1" (0)
+	);
+
+	if (ret)
+		return ret;
+
+	*oval = oldval;
+	return 0;
+}
+
 #endif /* _ASM_X86_UACCESS_64_H */
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index ac0394087f7d..dcb3ac093075 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -408,4 +408,50 @@ void __noreturn usercopy_abort(const char *name, const char *detail,
 			       unsigned long len);
 #endif

+#ifdef ARCH_HAS_ATOMIC_UACCESS_HELPERS
+/**
+ * cmpxchg_user_[32|64][_nofault|]() - compare_exchange 32/64-bit values
+ * @uaddr:     Destination address, in user space;
+ * @curr_val:  Source address, in kernel space;
+ * @new_val:   The value to write to the destination address.
+ *
+ * This is the standard cmpxchg: atomically: compare *@uaddr to *@curr_val;
+ * if the values match, write @new_val to @uaddr, return 0; if the values
+ * do not match, write *@uaddr to @curr_val, return -EAGAIN.
+ *
+ * The _nofault versions don't fault and can be used in
+ * atomic/preempt-disabled contexts.
+ *
+ * Return:
+ * 0      : OK/success;
+ * -EINVAL: @uaddr is not properly aligned ('may fault' versions only);
+ * -EFAULT: memory access error (including mis-aligned @uaddr in _nofault);
+ * -EAGAIN: @old did not match.
+ */
+int cmpxchg_user_32_nofault(u32 __user *uaddr, u32 *curr_val, u32 new_val);
+int cmpxchg_user_64_nofault(u64 __user *uaddr, u64 *curr_val, u64 new_val);
+int cmpxchg_user_32(u32 __user *uaddr, u32 *curr_val, u32 new_val);
+int cmpxchg_user_64(u64 __user *uaddr, u64 *curr_val, u64 new_val);
+
+/**
+ * xchg_user_[32|64][_nofault|]() - exchange 32/64-bit values
+ * @uaddr:   Destination address, in user space;
+ * @val:     Source address, in kernel space.
+ *
+ * This is the standard atomic xchg: exchange values pointed to by @uaddr and @val.
+ *
+ * The _nofault versions don't fault and can be used in
+ * atomic/preempt-disabled contexts.
+ *
+ * Return:
+ * 0      : OK/success;
+ * -EINVAL: @uaddr is not properly aligned ('may fault' versions only);
+ * -EFAULT: memory access error (including mis-aligned @uaddr in _nofault).
+ */
+int xchg_user_32_nofault(u32 __user *uaddr, u32 *val);
+int xchg_user_64_nofault(u64 __user *uaddr, u64 *val);
+int xchg_user_32(u32 __user *uaddr, u32 *val);
+int xchg_user_64(u64 __user *uaddr, u64 *val);
+#endif		/* ARCH_HAS_ATOMIC_UACCESS_HELPERS */
+
 #endif		/* __LINUX_UACCESS_H__ */
diff --git a/mm/maccess.c b/mm/maccess.c
index d3f1a1f0b1c1..620556b11550 100644
--- a/mm/maccess.c
+++ b/mm/maccess.c
@@ -335,3 +335,267 @@ long strnlen_user_nofault(const void __user *unsafe_addr, long count)

 	return ret;
 }
+
+#ifdef ARCH_HAS_ATOMIC_UACCESS_HELPERS
+
+static int fix_pagefault(unsigned long uaddr, bool write_fault, int bytes)
+{
+	struct mm_struct *mm = current->mm;
+	int ret;
+
+	mmap_read_lock(mm);
+	ret = fixup_user_fault(mm, uaddr, write_fault ? FAULT_FLAG_WRITE : 0,
+			NULL);
+	mmap_read_unlock(mm);
+
+	return ret < 0 ? ret : 0;
+}
+
+int cmpxchg_user_32_nofault(u32 __user *uaddr, u32 *curr_val, u32 new_val)
+{
+	int ret = -EFAULT;
+	u32 __old = *curr_val;
+
+	if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
+		return -EFAULT;
+
+	pagefault_disable();
+
+	if (!user_access_begin(uaddr, sizeof(*uaddr))) {
+		pagefault_enable();
+		return -EFAULT;
+	}
+	ret = __try_cmpxchg_user_32(curr_val, uaddr, __old, new_val);
+	user_access_end();
+
+	if (!ret)
+		ret =  *curr_val == __old ? 0 : -EAGAIN;
+
+	pagefault_enable();
+	return ret;
+}
+
+int cmpxchg_user_64_nofault(u64 __user *uaddr, u64 *curr_val, u64 new_val)
+{
+	int ret = -EFAULT;
+	u64 __old = *curr_val;
+
+	if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
+		return -EFAULT;
+
+	pagefault_disable();
+
+	if (!user_access_begin(uaddr, sizeof(*uaddr))) {
+		pagefault_enable();
+		return -EFAULT;
+	}
+	ret = __try_cmpxchg_user_64(curr_val, uaddr, __old, new_val);
+	user_access_end();
+
+	if (!ret)
+		ret =  *curr_val == __old ? 0 : -EAGAIN;
+
+	pagefault_enable();
+
+	return ret;
+}
+
+int cmpxchg_user_32(u32 __user *uaddr, u32 *curr_val, u32 new_val)
+{
+	int ret = -EFAULT;
+	u32 __old = *curr_val;
+
+	/* Validate proper alignment. */
+	if (unlikely(((unsigned long)uaddr % sizeof(*uaddr)) ||
+			((unsigned long)curr_val % sizeof(*curr_val))))
+		return -EINVAL;
+
+	if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
+		return -EFAULT;
+
+	pagefault_disable();
+
+	while (true) {
+		ret = -EFAULT;
+		if (!user_access_begin(uaddr, sizeof(*uaddr)))
+			break;
+
+		ret = __try_cmpxchg_user_32(curr_val, uaddr, __old, new_val);
+		user_access_end();
+
+		if (!ret) {
+			ret =  *curr_val == __old ? 0 : -EAGAIN;
+			break;
+		}
+
+		if (fix_pagefault((unsigned long)uaddr, true, sizeof(*uaddr)) < 0)
+			break;
+	}
+
+	pagefault_enable();
+	return ret;
+}
+
+int cmpxchg_user_64(u64 __user *uaddr, u64 *curr_val, u64 new_val)
+{
+	int ret = -EFAULT;
+	u64 __old = *curr_val;
+
+	/* Validate proper alignment. */
+	if (unlikely(((unsigned long)uaddr % sizeof(*uaddr)) ||
+			((unsigned long)curr_val % sizeof(*curr_val))))
+		return -EINVAL;
+
+	if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
+		return -EFAULT;
+
+	pagefault_disable();
+
+	while (true) {
+		ret = -EFAULT;
+		if (!user_access_begin(uaddr, sizeof(*uaddr)))
+			break;
+
+		ret = __try_cmpxchg_user_64(curr_val, uaddr, __old, new_val);
+		user_access_end();
+
+		if (!ret) {
+			ret =  *curr_val == __old ? 0 : -EAGAIN;
+			break;
+		}
+
+		if (fix_pagefault((unsigned long)uaddr, true, sizeof(*uaddr)) < 0)
+			break;
+	}
+
+	pagefault_enable();
+	return ret;
+}
+
+/**
+ * xchg_user_[32|64][_nofault|]() - exchange 32/64-bit values
+ * @uaddr:   Destination address, in user space;
+ * @val:     Source address, in kernel space.
+ *
+ * This is the standard atomic xchg: exchange values pointed to by @uaddr and @val.
+ *
+ * The _nofault versions don't fault and can be used in
+ * atomic/preempt-disabled contexts.
+ *
+ * Return:
+ * 0      : OK/success;
+ * -EINVAL: @uaddr is not properly aligned ('may fault' versions only);
+ * -EFAULT: memory access error (including mis-aligned @uaddr in _nofault).
+ */
+int xchg_user_32_nofault(u32 __user *uaddr, u32 *val)
+{
+	int ret;
+
+	if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
+		return -EFAULT;
+
+	pagefault_disable();
+
+	if (!user_access_begin(uaddr, sizeof(*uaddr))) {
+		pagefault_enable();
+		return -EFAULT;
+	}
+
+	ret = __try_xchg_user_32(val, uaddr, *val);
+	user_access_end();
+
+	pagefault_enable();
+
+	return ret;
+}
+
+int xchg_user_64_nofault(u64 __user *uaddr, u64 *val)
+{
+	int ret;
+
+	if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
+		return -EFAULT;
+
+	pagefault_disable();
+
+	if (!user_access_begin(uaddr, sizeof(*uaddr))) {
+		pagefault_enable();
+		return -EFAULT;
+	}
+
+	ret = __try_xchg_user_64(val, uaddr, *val);
+	user_access_end();
+
+	pagefault_enable();
+
+	return ret;
+}
+
+int xchg_user_32(u32 __user *uaddr, u32 *val)
+{
+	int ret = -EFAULT;
+
+	/* Validate proper alignment. */
+	if (unlikely(((unsigned long)uaddr % sizeof(*uaddr)) ||
+			((unsigned long)val % sizeof(*val))))
+		return -EINVAL;
+
+	if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
+		return -EFAULT;
+
+	pagefault_disable();
+
+	while (true) {
+		ret = -EFAULT;
+		if (!user_access_begin(uaddr, sizeof(*uaddr)))
+			break;
+
+		ret = __try_xchg_user_32(val, uaddr, *val);
+		user_access_end();
+
+		if (!ret)
+			break;
+
+		if (fix_pagefault((unsigned long)uaddr, true, sizeof(*uaddr)) < 0)
+			break;
+	}
+
+	pagefault_enable();
+
+	return ret;
+}
+
+int xchg_user_64(u64 __user *uaddr, u64 *val)
+{
+	int ret = -EFAULT;
+
+	/* Validate proper alignment. */
+	if (unlikely(((unsigned long)uaddr % sizeof(*uaddr)) ||
+			((unsigned long)val % sizeof(*val))))
+		return -EINVAL;
+
+	if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
+		return -EFAULT;
+
+	pagefault_disable();
+
+	while (true) {
+		ret = -EFAULT;
+		if (!user_access_begin(uaddr, sizeof(*uaddr)))
+			break;
+
+		ret = __try_xchg_user_64(val, uaddr, *val);
+		user_access_end();
+
+		if (!ret)
+			break;
+
+		if (fix_pagefault((unsigned long)uaddr, true, sizeof(*uaddr)) < 0)
+			break;
+	}
+
+	pagefault_enable();
+
+	return ret;
+}
+#endif		/* ARCH_HAS_ATOMIC_UACCESS_HELPERS */
--
2.25.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls
  2021-11-22 21:13 [PATCH v0.9.1 0/6] sched,mm,x86/uaccess: implement User Managed Concurrency Groups Peter Oskolkov
  2021-11-22 21:13 ` [PATCH v0.9.1 1/6] sched/umcg: add WF_CURRENT_CPU and externise ttwu Peter Oskolkov
  2021-11-22 21:13 ` [PATCH v0.9.1 2/6] mm, x86/uaccess: add userspace atomic helpers Peter Oskolkov
@ 2021-11-22 21:13 ` Peter Oskolkov
  2021-11-24 18:36   ` kernel test robot
                     ` (5 more replies)
  2021-11-22 21:13 ` [PATCH v0.9.1 4/6] sched/umcg, lib/umcg: implement libumcg Peter Oskolkov
                   ` (3 subsequent siblings)
  6 siblings, 6 replies; 44+ messages in thread
From: Peter Oskolkov @ 2021-11-22 21:13 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Andrew Morton,
	Dave Hansen, Andy Lutomirski, linux-mm, linux-kernel, linux-api
  Cc: Paul Turner, Ben Segall, Peter Oskolkov, Peter Oskolkov,
	Andrei Vagin, Jann Horn, Thierry Delisle

Define struct umcg_task and two syscalls: sys_umcg_ctl sys_umcg_wait.

User Managed Concurrency Groups is an M:N threading toolkit that allows
constructing user space schedulers designed to efficiently manage
heterogeneous in-process workloads while maintaining high CPU
utilization (95%+).

In addition, M:N threading and cooperative user space scheduling
enables synchronous coding style and better cache locality when
compared to asynchronous callback/continuation style of programming.

UMCG kernel API is build around the following ideas:

* UMCG server: a task/thread representing "kernel threads", or (v)CPUs;
* UMCG worker: a task/thread representing "application threads", to be
  scheduled over servers;
* UMCG task state: (NONE), RUNNING, BLOCKED, IDLE: states a UMCG task (a
  server or a worker) can be in;
* UMCG task state flag: LOCKED, PREEMPTED: additional state flags that
  can be ORed with the task state to communicate additional information to
  the kernel;
* struct umcg_task: a per-task userspace set of data fields, usually
  residing in the TLS, that fully reflects the current task's UMCG state
  and controls the way the kernel manages the task;
* sys_umcg_ctl(): a syscall used to register the current task/thread as a
  server or a worker, or to unregister a UMCG task;
* sys_umcg_wait(): a syscall used to put the current task to sleep and/or
  wake another task, pontentially context-switching between the two tasks
  on-CPU synchronously.

In short, servers can be thought of as CPUs over which application
threads (workers) are scheduled; at any one time a worker is either:
- RUNNING: has a server and is schedulable by the kernel;
- BLOCKED: blocked in the kernel (e.g. on I/O, or a futex);
- IDLE: is not blocked, but cannot be scheduled by the kernel to
  run because it has no server assigned to it (e.g. because all
  available servers are busy "running" other workers).

Usually the number of servers in a process is equal to the number of
CPUs available to the kernel if the process is supposed to consume
the whole machine, or less than the number of CPUs available if the
process is sharing the machine with other workloads. The number of
workers in a process can grow very large: tens of thousands is normal;
hundreds of thousands and more (millions) is something that would
be desirable to achieve in the future, as lightweight userspace
threads in Java and Go easily scale to millions, and UMCG workers
are (intended to be) conceptually similar to those.

Detailed use cases and API behavior are provided in
Documentation/userspace-api/umcg.txt (see sibling patches).

Some high-level implementation notes:

UMCG tasks (workers and servers) are "tagged" with struct umcg_task
residing in userspace (usually in TLS) to facilitate kernel/userspace
communication. This makes the kernel-side code much simpler (see e.g.
the implementation of sys_umcg_wait), but also requires some careful
uaccess handling and page pinning (see below).

The main UMCG server/worker interaction looks like:

a. worker W1 is RUNNING, with a server S attached to it sleeping
   in IDLE state;
b. worker W1 blocks in the kernel, e.g. on I/O;
c. the kernel marks W1 as BLOCKED, the attached server S
   as RUNNING, and wakes S (the "block detection" event);
d. the server now picks another IDLE worker W2 to run: marks
   W2 as RUNNING, itself as IDLE, ands calls sys_umcg_wait();
e. when the blocking operation of W1 completes, the worker
   is marked by the kernel as IDLE and added to idle workers list
   (see struct umcg_task) for the userspace to pick up and
   later run (the "wake detection" event).

While there are additional operations such as worker-to-worker
context switch, preemption, workers "yielding", etc., the "workflow"
above is the main worker/server interaction that drives the
implementation.

Specifically:

- most operations are conceptually context switches:
    - scheduling a worker: a running server goes to sleep and "runs"
      a worker in its place;
    - block detection: worker is descheduled, and its server is woken;
    - wake detection: woken worker, running in the kernel, is descheduled,
      and if there is an idle server, it is woken to process the wake
      detection event;
- to faciliate low scheduling latencies and cache locality, most
  server/worker interactions described above are performed synchronously
  "on CPU" via WF_CURRENT_CPU flag passed to ttwu; while at the moment
  the context switches are simulated by putting the switch-out task to
  sleep and waking the switch-into task on the same cpu, it is very much
  the long-term goal of this project to make the context switch much
  lighter, by tweaking runtime accounting and, maybe, even bypassing
  __schedule();
- worker blocking is detected in a hook to sched_submit_work; as mentioned
  above, the server is to be woken on the same CPU, synchronously;
  this code may not pagefault, so to access worker's and server's
  userspace memory (struct umcg_task), memory pages containing the worker's
  and the server's structs umcg_task are pinned when the worker is
  exiting to the userspace, and unpinned when the worker is descheduled;
- worker wakeup is detected in a hook to sched_update_worker, and processed
  in the exit to usermode loop (via TIF_NOTIFY_RESUME); workers CAN
  pagefault on the wakeup path;
- worker preemption is implemented by the userspace tagging the worker
  with UMCG_TF_PREEMPTED state flag and sending a NOOP signal to it;
  on the exit to usermode the worker is intercepted and its server is woken
  (see Documentation/userspace-api/umcg.txt for more details);
- each state change is tagged with a unique timestamp (of MONOTONIC
  variety), so that
    - scheduling instrumentation is naturally available;
    - racing state changes are easily detected and ABA issues are
      avoided;
  see umcg_update_state() in umcg.c for implementation details, and
  Documentation/userspace-api/umcg.txt for a higher-level
  description.

The previous version of the patchset can be found at
https://lore.kernel.org/all/20211012232522.714898-1-posk@google.com/
containing some additional context and links to earlier discussions.

More details are available in Documentation/userspace-api/umcg.txt
in sibling patches, and in doc-comments in the code.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 arch/x86/entry/syscalls/syscall_64.tbl |   2 +
 fs/exec.c                              |   1 +
 include/linux/sched.h                  |  71 ++
 include/linux/syscalls.h               |   3 +
 include/uapi/asm-generic/unistd.h      |   7 +-
 include/uapi/linux/umcg.h              | 137 ++++
 init/Kconfig                           |  10 +
 kernel/entry/common.c                  |   4 +-
 kernel/exit.c                          |   5 +
 kernel/sched/Makefile                  |   1 +
 kernel/sched/core.c                    |   9 +-
 kernel/sched/umcg.c                    | 949 +++++++++++++++++++++++++
 kernel/sys_ni.c                        |   4 +
 13 files changed, 1199 insertions(+), 4 deletions(-)
 create mode 100644 include/uapi/linux/umcg.h
 create mode 100644 kernel/sched/umcg.c

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index fe8f8dd157b4..f09f96bb7f35 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -371,6 +371,8 @@
 447	common	memfd_secret		sys_memfd_secret
 448	common	process_mrelease	sys_process_mrelease
 449	common	futex_waitv		sys_futex_waitv
+450	common	umcg_ctl		sys_umcg_ctl
+451	common	umcg_wait		sys_umcg_wait

 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/fs/exec.c b/fs/exec.c
index 537d92c41105..1749f0f74fed 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1838,6 +1838,7 @@ static int bprm_execve(struct linux_binprm *bprm,
 	current->fs->in_exec = 0;
 	current->in_execve = 0;
 	rseq_execve(current);
+	umcg_execve(current);
 	acct_update_integrals(current);
 	task_numa_free(current, false);
 	return retval;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d2e261adb8ea..dc9a8b8c5761 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -67,6 +67,7 @@ struct sighand_struct;
 struct signal_struct;
 struct task_delay_info;
 struct task_group;
+struct umcg_task;

 /*
  * Task state bitmask. NOTE! These bits are also
@@ -1294,6 +1295,12 @@ struct task_struct {
 	unsigned long rseq_event_mask;
 #endif

+#ifdef CONFIG_UMCG
+	struct umcg_task __user	*umcg_task;
+	struct page		*pinned_umcg_worker_page;  /* self */
+	struct page		*pinned_umcg_server_page;
+#endif
+
 	struct tlbflush_unmap_batch	tlb_ubc;

 	union {
@@ -1687,6 +1694,13 @@ extern struct pid *cad_pid;
 #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
 #define PF_SWAPWRITE		0x00800000	/* Allowed to write to swap */
+
+#ifdef CONFIG_UMCG
+#define PF_UMCG_WORKER		0x01000000	/* UMCG worker */
+#else
+#define PF_UMCG_WORKER		0x00000000
+#endif
+
 #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
 #define PF_MCE_EARLY		0x08000000      /* Early kill for mce process policy */
 #define PF_MEMALLOC_PIN		0x10000000	/* Allocation context constrained to zones which allow long term pinning. */
@@ -2287,6 +2301,63 @@ static inline void rseq_execve(struct task_struct *t)

 #endif

+#ifdef CONFIG_UMCG
+
+void umcg_handle_resuming_worker(void);
+void umcg_handle_exiting_worker(void);
+void umcg_clear_child(struct task_struct *tsk);
+
+/* Called by bprm_execve() in fs/exec.c. */
+static inline void umcg_execve(struct task_struct *tsk)
+{
+	if (tsk->umcg_task)
+		umcg_clear_child(tsk);
+}
+
+/* Called by exit_to_user_mode_loop() in kernel/entry/common.c.*/
+static inline void umcg_handle_notify_resume(void)
+{
+	if (current->flags & PF_UMCG_WORKER)
+		umcg_handle_resuming_worker();
+}
+
+/* Called by do_exit() in kernel/exit.c. */
+static inline void umcg_handle_exit(void)
+{
+	if (current->flags & PF_UMCG_WORKER)
+		umcg_handle_exiting_worker();
+}
+
+/*
+ * umcg_wq_worker_[sleeping|running] are called in core.c by
+ * sched_submit_work() and sched_update_worker().
+ */
+void umcg_wq_worker_sleeping(struct task_struct *tsk);
+void umcg_wq_worker_running(struct task_struct *tsk);
+
+#else  /* CONFIG_UMCG */
+
+static inline void umcg_clear_child(struct task_struct *tsk)
+{
+}
+static inline void umcg_execve(struct task_struct *tsk)
+{
+}
+static inline void umcg_handle_notify_resume(void)
+{
+}
+static inline void umcg_handle_exit(void)
+{
+}
+static inline void umcg_wq_worker_sleeping(struct task_struct *tsk)
+{
+}
+static inline void umcg_wq_worker_running(struct task_struct *tsk)
+{
+}
+
+#endif
+
 #ifdef CONFIG_DEBUG_RSEQ

 void rseq_syscall(struct pt_regs *regs);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 528a478dbda8..424a4686be74 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -72,6 +72,7 @@ struct open_how;
 struct mount_attr;
 struct landlock_ruleset_attr;
 enum landlock_rule_type;
+struct umcg_task;

 #include <linux/types.h>
 #include <linux/aio_abi.h>
@@ -1057,6 +1058,8 @@ asmlinkage long sys_landlock_add_rule(int ruleset_fd, enum landlock_rule_type ru
 		const void __user *rule_attr, __u32 flags);
 asmlinkage long sys_landlock_restrict_self(int ruleset_fd, __u32 flags);
 asmlinkage long sys_memfd_secret(unsigned int flags);
+asmlinkage long sys_umcg_ctl(u32 flags, struct umcg_task __user *self);
+asmlinkage long sys_umcg_wait(u32 flags, u64 abs_timeout);

 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 4557a8b6086f..6d29b3896d4c 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -883,8 +883,13 @@ __SYSCALL(__NR_process_mrelease, sys_process_mrelease)
 #define __NR_futex_waitv 449
 __SYSCALL(__NR_futex_waitv, sys_futex_waitv)

+#define __NR_umcg_ctl 450
+__SYSCALL(__NR_umcg_ctl, sys_umcg_ctl)
+#define __NR_umcg_wait 451
+__SYSCALL(__NR_umcg_wait, sys_umcg_wait)
 #undef __NR_syscalls
-#define __NR_syscalls 450
+
+#define __NR_syscalls 452

 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/umcg.h b/include/uapi/linux/umcg.h
new file mode 100644
index 000000000000..cd9f60002821
--- /dev/null
+++ b/include/uapi/linux/umcg.h
@@ -0,0 +1,137 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_UMCG_H
+#define _UAPI_LINUX_UMCG_H
+
+#include <linux/limits.h>
+#include <linux/types.h>
+
+/*
+ * UMCG: User Managed Concurrency Groups.
+ *
+ * Syscalls (see kernel/sched/umcg.c):
+ *      sys_umcg_ctl()  - register/unregister UMCG tasks;
+ *      sys_umcg_wait() - wait/wake/context-switch.
+ *
+ * struct umcg_task (below): controls the state of UMCG tasks.
+ *
+ * See Documentation/userspace-api/umcg.txt for detals.
+ */
+
+/*
+ * UMCG task states, the first 6 bits of struct umcg_task.state_ts.
+ * The states represent the user space point of view.
+ */
+#define UMCG_TASK_NONE			0ULL
+#define UMCG_TASK_RUNNING		1ULL
+#define UMCG_TASK_IDLE			2ULL
+#define UMCG_TASK_BLOCKED		3ULL
+
+/* UMCG task state flags, bits 7-8 */
+
+/*
+ * UMCG_TF_LOCKED: locked by the userspace in preparation to calling umcg_wait.
+ */
+#define UMCG_TF_LOCKED			(1ULL << 6)
+
+/*
+ * UMCG_TF_PREEMPTED: the userspace indicates the worker should be preempted.
+ */
+#define UMCG_TF_PREEMPTED		(1ULL << 7)
+
+/* The first six bits: RUNNING, IDLE, or BLOCKED. */
+#define UMCG_TASK_STATE_MASK		0x3fULL
+
+/* The full state mask: the first 18 bits. */
+#define UMCG_TASK_STATE_MASK_FULL	0x3ffffULL
+
+/*
+ * The number of bits reserved for UMCG state timestamp in
+ * struct umcg_task.state_ts.
+ */
+#define UMCG_STATE_TIMESTAMP_BITS	46
+
+/* The number of bits truncated from UMCG state timestamp. */
+#define UMCG_STATE_TIMESTAMP_GRANULARITY	4
+
+/**
+ * struct umcg_task - controls the state of UMCG tasks.
+ *
+ * The struct is aligned at 64 bytes to ensure that it fits into
+ * a single cache line.
+ */
+struct umcg_task {
+	/**
+	 * @state_ts: the current state of the UMCG task described by
+	 *            this struct, with a unique timestamp indicating
+	 *            when the last state change happened.
+	 *
+	 * Readable/writable by both the kernel and the userspace.
+	 *
+	 * UMCG task state:
+	 *   bits  0 -  5: task state;
+	 *   bits  6 -  7: state flags;
+	 *   bits  8 - 12: reserved; must be zeroes;
+	 *   bits 13 - 17: for userspace use;
+	 *   bits 18 - 63: timestamp (see below).
+	 *
+	 * Timestamp: a 46-bit CLOCK_MONOTONIC timestamp, at 16ns resolution.
+	 * See Documentation/userspace-api/umcg.txt for detals.
+	 */
+	__u64	state_ts;		/* r/w */
+
+	/**
+	 * @next_tid: the TID of the UMCG task that should be context-switched
+	 *            into in sys_umcg_wait(). Can be zero.
+	 *
+	 * Running UMCG workers must have next_tid set to point to IDLE
+	 * UMCG servers.
+	 *
+	 * Read-only for the kernel, read/write for the userspace.
+	 */
+	__u32	next_tid;		/* r   */
+
+	__u32	flags;			/* Reserved; must be zero. */
+
+	/**
+	 * @idle_workers_ptr: a single-linked list of idle workers. Can be NULL.
+	 *
+	 * Readable/writable by both the kernel and the userspace: the
+	 * kernel adds items to the list, the userspace removes them.
+	 */
+	__u64	idle_workers_ptr;	/* r/w */
+
+	/**
+	 * @idle_server_tid_ptr: a pointer pointing to a single idle server.
+	 *                       Readonly.
+	 */
+	__u64	idle_server_tid_ptr;	/* r   */
+} __attribute__((packed, aligned(8 * sizeof(__u64))));
+
+/**
+ * enum umcg_ctl_flag - flags to pass to sys_umcg_ctl
+ * @UMCG_CTL_REGISTER:   register the current task as a UMCG task
+ * @UMCG_CTL_UNREGISTER: unregister the current task as a UMCG task
+ * @UMCG_CTL_WORKER:     register the current task as a UMCG worker
+ */
+enum umcg_ctl_flag {
+	UMCG_CTL_REGISTER	= 0x00001,
+	UMCG_CTL_UNREGISTER	= 0x00002,
+	UMCG_CTL_WORKER		= 0x10000,
+};
+
+/**
+ * enum umcg_wait_flag - flags to pass to sys_umcg_wait
+ * @UMCG_WAIT_WAKE_ONLY:      wake @self->next_tid, don't put @self to sleep;
+ * @UMCG_WAIT_WF_CURRENT_CPU: wake @self->next_tid on the current CPU
+ *                            (use WF_CURRENT_CPU); @UMCG_WAIT_WAKE_ONLY
+ *                            must be set.
+ */
+enum umcg_wait_flag {
+	UMCG_WAIT_WAKE_ONLY			= 1,
+	UMCG_WAIT_WF_CURRENT_CPU		= 2,
+};
+
+/* See Documentation/userspace-api/umcg.txt.*/
+#define UMCG_IDLE_NODE_PENDING (1ULL)
+
+#endif /* _UAPI_LINUX_UMCG_H */
diff --git a/init/Kconfig b/init/Kconfig
index 036b750e8d8a..365802b25100 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1693,6 +1693,16 @@ config MEMBARRIER

 	  If unsure, say Y.

+config UMCG
+	bool "Enable User Managed Concurrency Groups API"
+	depends on X86_64
+	default n
+	help
+	  Enable User Managed Concurrency Groups API, which form the basis
+	  for an in-process M:N userspace scheduling framework.
+	  At the moment this is an experimental/RFC feature that is not
+	  guaranteed to be backward-compatible.
+
 config KALLSYMS
 	bool "Load all symbols for debugging/ksymoops" if EXPERT
 	default y
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index d5a61d565ad5..62453772a0c7 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -171,8 +171,10 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 		if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
 			handle_signal_work(regs, ti_work);

-		if (ti_work & _TIF_NOTIFY_RESUME)
+		if (ti_work & _TIF_NOTIFY_RESUME) {
+			umcg_handle_notify_resume();
 			tracehook_notify_resume(regs);
+		}

 		/* Architecture specific TIF work */
 		arch_exit_to_user_mode_work(regs, ti_work);
diff --git a/kernel/exit.c b/kernel/exit.c
index f702a6a63686..4bdd51c75aee 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -749,6 +749,10 @@ void __noreturn do_exit(long code)
 	if (unlikely(!tsk->pid))
 		panic("Attempted to kill the idle task!");

+	/* Turn off UMCG sched hooks. */
+	if (unlikely(tsk->flags & PF_UMCG_WORKER))
+		tsk->flags &= ~PF_UMCG_WORKER;
+
 	/*
 	 * If do_exit is called because this processes oopsed, it's possible
 	 * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
@@ -786,6 +790,7 @@ void __noreturn do_exit(long code)

 	io_uring_files_cancel();
 	exit_signals(tsk);  /* sets PF_EXITING */
+	umcg_handle_exit();

 	/* sync mm's RSS info before statistics gathering */
 	if (tsk->mm)
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index c7421f2d05e1..c03eea9bc738 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -41,3 +41,4 @@ obj-$(CONFIG_MEMBARRIER) += membarrier.o
 obj-$(CONFIG_CPU_ISOLATION) += isolation.o
 obj-$(CONFIG_PSI) += psi.o
 obj-$(CONFIG_SCHED_CORE) += core_sched.o
+obj-$(CONFIG_UMCG) += umcg.o
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5344aa0afe5a..26362cfcee84 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4269,6 +4269,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->wake_entry.u_flags = CSD_TYPE_TTWU;
 	p->migration_pending = NULL;
 #endif
+	umcg_clear_child(p);
 }

 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
@@ -6327,9 +6328,11 @@ static inline void sched_submit_work(struct task_struct *tsk)
 	 * If a worker goes to sleep, notify and ask workqueue whether it
 	 * wants to wake up a task to maintain concurrency.
 	 */
-	if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
+	if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) {
 		if (task_flags & PF_WQ_WORKER)
 			wq_worker_sleeping(tsk);
+		else if (task_flags & PF_UMCG_WORKER)
+			umcg_wq_worker_sleeping(tsk);
 		else
 			io_wq_worker_sleeping(tsk);
 	}
@@ -6347,9 +6350,11 @@ static inline void sched_submit_work(struct task_struct *tsk)

 static void sched_update_worker(struct task_struct *tsk)
 {
-	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
+	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) {
 		if (tsk->flags & PF_WQ_WORKER)
 			wq_worker_running(tsk);
+		else if (tsk->flags & PF_UMCG_WORKER)
+			umcg_wq_worker_running(tsk);
 		else
 			io_wq_worker_running(tsk);
 	}
diff --git a/kernel/sched/umcg.c b/kernel/sched/umcg.c
new file mode 100644
index 000000000000..8f43a9f786c1
--- /dev/null
+++ b/kernel/sched/umcg.c
@@ -0,0 +1,949 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/*
+ * User Managed Concurrency Groups (UMCG).
+ *
+ * See Documentation/userspace-api/umcg.txt for detals.
+ */
+
+#include <linux/syscalls.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/umcg.h>
+
+#include "sched.h"
+
+/**
+ * get_user_nofault - get user value without sleeping.
+ *
+ * get_user() might sleep and therefore cannot be used in preempt-disabled
+ * regions.
+ */
+#define get_user_nofault(out, uaddr)			\
+({							\
+	int ret = -EFAULT;				\
+							\
+	if (access_ok((uaddr), sizeof(*(uaddr)))) {	\
+		pagefault_disable();			\
+							\
+		if (!__get_user((out), (uaddr)))	\
+			ret = 0;			\
+							\
+		pagefault_enable();			\
+	}						\
+	ret;						\
+})
+
+/**
+ * umcg_pin_pages: pin pages containing struct umcg_task of this worker
+ *                 and its server.
+ *
+ * The pages are pinned when the worker exits to the userspace and unpinned
+ * when the worker is in sched_submit_work(), i.e. when the worker is
+ * about to be removed from its runqueue. Thus at most NR_CPUS UMCG pages
+ * are pinned at any one time across the whole system.
+ *
+ * The pinning is needed so that going-to-sleep workers can access
+ * their and their servers' userspace umcg_task structs without page faults,
+ * as the code path can be executed in the context of a pagefault, with
+ * mm lock held.
+ */
+static int umcg_pin_pages(u32 server_tid)
+{
+	struct umcg_task __user *worker_ut = current->umcg_task;
+	struct umcg_task __user *server_ut = NULL;
+	struct task_struct *tsk;
+
+	rcu_read_lock();
+	tsk = find_task_by_vpid(server_tid);
+	/* Server/worker interaction is allowed only within the same mm. */
+	if (tsk && current->mm == tsk->mm)
+		server_ut = READ_ONCE(tsk->umcg_task);
+	rcu_read_unlock();
+
+	if (!server_ut)
+		return -EINVAL;
+
+	tsk = current;
+
+	/* worker_ut is stable, don't need to repin */
+	if (!tsk->pinned_umcg_worker_page)
+		if (1 != pin_user_pages_fast((unsigned long)worker_ut, 1, 0,
+					&tsk->pinned_umcg_worker_page))
+			return -EFAULT;
+
+	/* server_ut may change, need to repin */
+	if (tsk->pinned_umcg_server_page) {
+		unpin_user_page(tsk->pinned_umcg_server_page);
+		tsk->pinned_umcg_server_page = NULL;
+	}
+
+	if (1 != pin_user_pages_fast((unsigned long)server_ut, 1, 0,
+				&tsk->pinned_umcg_server_page))
+		return -EFAULT;
+
+	return 0;
+}
+
+static void umcg_unpin_pages(void)
+{
+	struct task_struct *tsk = current;
+
+	if (tsk->pinned_umcg_worker_page)
+		unpin_user_page(tsk->pinned_umcg_worker_page);
+	if (tsk->pinned_umcg_server_page)
+		unpin_user_page(tsk->pinned_umcg_server_page);
+
+	tsk->pinned_umcg_worker_page = NULL;
+	tsk->pinned_umcg_server_page = NULL;
+}
+
+static void umcg_clear_task(struct task_struct *tsk)
+{
+	/*
+	 * This is either called for the current task, or for a newly forked
+	 * task that is not yet running, so we don't need strict atomicity
+	 * below.
+	 */
+	if (tsk->umcg_task) {
+		WRITE_ONCE(tsk->umcg_task, NULL);
+
+		/* These can be simple writes - see the commment above. */
+		tsk->pinned_umcg_worker_page = NULL;
+		tsk->pinned_umcg_server_page = NULL;
+		tsk->flags &= ~PF_UMCG_WORKER;
+	}
+}
+
+/* Called for a forked or execve-ed child. */
+void umcg_clear_child(struct task_struct *tsk)
+{
+	umcg_clear_task(tsk);
+}
+
+/* Called both by normally (unregister) and abnormally exiting workers. */
+void umcg_handle_exiting_worker(void)
+{
+	umcg_unpin_pages();
+	umcg_clear_task(current);
+}
+
+/**
+ * umcg_update_state: atomically update umcg_task.state_ts, set new timestamp.
+ * @state_ts   - points to the state_ts member of struct umcg_task to update;
+ * @expected   - the expected value of state_ts, including the timestamp;
+ * @desired    - the desired value of state_ts, state part only;
+ * @may_fault  - whether to use normal or _nofault cmpxchg.
+ *
+ * The function is basically cmpxchg(state_ts, expected, desired), with extra
+ * code to set the timestamp in @desired.
+ */
+static int umcg_update_state(u64 __user *state_ts, u64 *expected, u64 desired,
+				bool may_fault)
+{
+	u64 curr_ts = (*expected) >> (64 - UMCG_STATE_TIMESTAMP_BITS);
+	u64 next_ts = ktime_get_ns() >> UMCG_STATE_TIMESTAMP_GRANULARITY;
+
+	/* Cut higher order bits. */
+	next_ts &= (1ULL << UMCG_STATE_TIMESTAMP_BITS) - 1;
+
+	if (next_ts == curr_ts)
+		++next_ts;
+
+	/* Remove an old timestamp, if any. */
+	desired &= UMCG_TASK_STATE_MASK_FULL;
+
+	/* Set the new timestamp. */
+	desired |= (next_ts << (64 - UMCG_STATE_TIMESTAMP_BITS));
+
+	if (may_fault)
+		return cmpxchg_user_64(state_ts, expected, desired);
+
+	return cmpxchg_user_64_nofault(state_ts, expected, desired);
+}
+
+/**
+ * sys_umcg_ctl: (un)register the current task as a UMCG task.
+ * @flags:       ORed values from enum umcg_ctl_flag; see below;
+ * @self:        a pointer to struct umcg_task that describes this
+ *               task and governs the behavior of sys_umcg_wait if
+ *               registering; must be NULL if unregistering.
+ *
+ * @flags & UMCG_CTL_REGISTER: register a UMCG task:
+ *         UMCG workers:
+ *              - @flags & UMCG_CTL_WORKER
+ *              - self->state must be UMCG_TASK_BLOCKED
+ *         UMCG servers:
+ *              - !(@flags & UMCG_CTL_WORKER)
+ *              - self->state must be UMCG_TASK_RUNNING
+ *
+ *         All tasks:
+ *              - self->next_tid must be zero
+ *
+ *         If the conditions above are met, sys_umcg_ctl() immediately returns
+ *         if the registered task is a server; a worker will be added to
+ *         idle_workers_ptr, and the worker put to sleep; an idle server
+ *         from idle_server_tid_ptr will be woken, if present.
+ *
+ * @flags == UMCG_CTL_UNREGISTER: unregister a UMCG task. If the current task
+ *           is a UMCG worker, the userspace is responsible for waking its
+ *           server (before or after calling sys_umcg_ctl).
+ *
+ * Return:
+ * 0                - success
+ * -EFAULT          - failed to read @self
+ * -EINVAL          - some other error occurred
+ */
+SYSCALL_DEFINE2(umcg_ctl, u32, flags, struct umcg_task __user *, self)
+{
+	struct umcg_task ut;
+
+	if (flags == UMCG_CTL_UNREGISTER) {
+		if (self || !current->umcg_task)
+			return -EINVAL;
+
+		if (current->flags & PF_UMCG_WORKER)
+			umcg_handle_exiting_worker();
+		else
+			umcg_clear_task(current);
+
+		return 0;
+	}
+
+	if (!(flags & UMCG_CTL_REGISTER))
+		return -EINVAL;
+
+	flags &= ~UMCG_CTL_REGISTER;
+	if (flags && flags != UMCG_CTL_WORKER)
+		return -EINVAL;
+
+	if (current->umcg_task || !self)
+		return -EINVAL;
+
+	if (copy_from_user(&ut, self, sizeof(ut)))
+		return -EFAULT;
+
+	if (ut.next_tid)
+		return -EINVAL;
+
+	if (flags == UMCG_CTL_WORKER) {
+		if ((ut.state_ts & UMCG_TASK_STATE_MASK_FULL) != UMCG_TASK_BLOCKED)
+			return -EINVAL;
+
+		WRITE_ONCE(current->umcg_task, self);
+		current->flags |= PF_UMCG_WORKER;
+
+		/* Trigger umcg_handle_resuming_worker() */
+		set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+	} else {
+		if ((ut.state_ts & UMCG_TASK_STATE_MASK_FULL) != UMCG_TASK_RUNNING)
+			return -EINVAL;
+
+		WRITE_ONCE(current->umcg_task, self);
+	}
+
+	return 0;
+}
+
+/**
+ * handle_timedout_worker - make sure the worker is added to idle_workers
+ *                          upon a "clean" timeout.
+ */
+static int handle_timedout_worker(struct umcg_task __user *self)
+{
+	u64 curr_state, next_state;
+	int ret;
+
+	if (get_user(curr_state, &self->state_ts))
+		return -EFAULT;
+
+	if ((curr_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_IDLE) {
+		/* TODO: should we care here about TF_LOCKED or TF_PREEMPTED? */
+
+		next_state = curr_state & ~UMCG_TASK_STATE_MASK;
+		next_state |= UMCG_TASK_BLOCKED;
+
+		ret = umcg_update_state(&self->state_ts, &curr_state, next_state, true);
+		if (ret)
+			return ret;
+
+		return -ETIMEDOUT;
+	}
+
+	return 0;  /* Not really timed out. */
+}
+
+/*
+ * umcg_should_idle - return true if tasks with @state should block in
+ *                    imcg_idle_loop().
+ */
+static bool umcg_should_idle(u64 state)
+{
+	switch (state & UMCG_TASK_STATE_MASK) {
+	case UMCG_TASK_RUNNING:
+		return state & UMCG_TF_LOCKED;
+	case UMCG_TASK_IDLE:
+		return !(state & UMCG_TF_LOCKED);
+	case UMCG_TASK_BLOCKED:
+		return false;
+	default:
+		WARN_ONCE(true, "unknown UMCG task state");
+		return false;
+	}
+}
+
+/**
+ * umcg_idle_loop - sleep until !umcg_should_idle() or a timeout expires
+ * @abs_timeout - absolute timeout in nanoseconds; zero => no timeout
+ *
+ * The function marks the current task as INTERRUPTIBLE and calls
+ * freezable_schedule().
+ *
+ * Note: because UMCG workers should not be running WITHOUT attached servers,
+ *       and because servers should not be running WITH attached workers,
+ *       the function returns only on fatal signal pending and ignores/flushes
+ *       all other signals.
+ */
+static int umcg_idle_loop(u64 abs_timeout)
+{
+	int ret;
+	struct page *pinned_page = NULL;
+	struct hrtimer_sleeper timeout;
+	struct umcg_task __user *self = current->umcg_task;
+	const bool worker = current->flags & PF_UMCG_WORKER;
+
+	/* Clear PF_UMCG_WORKER to elide workqueue handlers. */
+	if (worker)
+		current->flags &= ~PF_UMCG_WORKER;
+
+	if (abs_timeout) {
+		hrtimer_init_sleeper_on_stack(&timeout, CLOCK_REALTIME,
+				HRTIMER_MODE_ABS);
+
+		hrtimer_set_expires_range_ns(&timeout.timer, (s64)abs_timeout,
+				current->timer_slack_ns);
+	}
+
+	while (true) {
+		u64 umcg_state;
+
+		/*
+		 * We need to read from userspace _after_ the task is marked
+		 * TASK_INTERRUPTIBLE, to properly handle concurrent wakeups;
+		 * but faulting is not allowed; so we try a fast no-fault read,
+		 * and if it fails, pin the page temporarily.
+		 */
+retry_once:
+		set_current_state(TASK_INTERRUPTIBLE);
+
+		/* Order set_current_state above with get_user below. */
+		smp_mb();
+		ret = -EFAULT;
+		if (get_user_nofault(umcg_state, &self->state_ts)) {
+			set_current_state(TASK_RUNNING);
+
+			if (pinned_page)
+				goto out;
+			else if (1 != pin_user_pages_fast((unsigned long)self,
+						1, 0, &pinned_page))
+					goto out;
+
+			goto retry_once;
+		}
+
+		if (pinned_page) {
+			unpin_user_page(pinned_page);
+			pinned_page = NULL;
+		}
+
+		ret = 0;
+		if (!umcg_should_idle(umcg_state)) {
+			set_current_state(TASK_RUNNING);
+			goto out;
+		}
+
+		if (abs_timeout)
+			hrtimer_sleeper_start_expires(&timeout, HRTIMER_MODE_ABS);
+
+		if (!abs_timeout || timeout.task)
+			freezable_schedule();
+
+		__set_current_state(TASK_RUNNING);
+
+		/*
+		 * Check for timeout before checking the state, as workers
+		 * are not going to return from freezable_schedule() unless
+		 * they are RUNNING.
+		 */
+		ret = -ETIMEDOUT;
+		if (abs_timeout && !timeout.task)
+			goto out;
+
+		/* Order set_current_state above with get_user below. */
+		smp_mb();
+		ret = -EFAULT;
+		if (get_user(umcg_state, &self->state_ts))
+			goto out;
+
+		ret = 0;
+		if (!umcg_should_idle(umcg_state))
+			goto out;
+
+		ret = -EINTR;
+		if (fatal_signal_pending(current))
+			goto out;
+
+		if (signal_pending(current))
+			flush_signals(current);
+	}
+
+out:
+	if (pinned_page) {
+		unpin_user_page(pinned_page);
+		pinned_page = NULL;
+	}
+	if (abs_timeout) {
+		hrtimer_cancel(&timeout.timer);
+		destroy_hrtimer_on_stack(&timeout.timer);
+	}
+	if (worker) {
+		current->flags |= PF_UMCG_WORKER;
+
+		if (ret == -ETIMEDOUT)
+			ret = handle_timedout_worker(self);
+
+		/* Workers must go through workqueue handlers upon wakeup. */
+		set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+	}
+	return ret;
+}
+
+/**
+ * umcg_wakeup_allowed - check whether @current can wake @tsk.
+ *
+ * Currently a placeholder that allows wakeups within a single process
+ * only (same mm). In the future the requirement will be relaxed (securely).
+ */
+static bool umcg_wakeup_allowed(struct task_struct *tsk)
+{
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	if (tsk->mm && tsk->mm == current->mm && READ_ONCE(tsk->umcg_task))
+		return true;
+
+	return false;
+}
+
+/*
+ * Try to wake up. May be called with preempt_disable set. May be called
+ * cross-process.
+ *
+ * Note: umcg_ttwu succeeds even if ttwu fails: see wait/wake state
+ *       ordering logic.
+ */
+static int umcg_ttwu(u32 next_tid, int wake_flags)
+{
+	struct task_struct *next;
+
+	rcu_read_lock();
+	next = find_task_by_vpid(next_tid);
+	if (!next || !umcg_wakeup_allowed(next)) {
+		rcu_read_unlock();
+		return -ESRCH;
+	}
+
+	/* The result of ttwu below is ignored. */
+	try_to_wake_up(next, TASK_NORMAL, wake_flags);
+	rcu_read_unlock();
+
+	return 0;
+}
+
+/*
+ * At the moment, umcg_do_context_switch simply wakes up @next with
+ * WF_CURRENT_CPU and puts the current task to sleep.
+ *
+ * In the future an optimization will be added to adjust runtime accounting
+ * so that from the kernel scheduling perspective the two tasks are
+ * essentially treated as one. In addition, the context switch may be performed
+ * right here on the fast path, instead of going through the wake/wait pair.
+ */
+static int umcg_do_context_switch(u32 next_tid, u64 abs_timeout)
+{
+	int ret;
+
+	ret = umcg_ttwu(next_tid, WF_CURRENT_CPU);
+	if (ret)
+		return ret;
+
+	return umcg_idle_loop(abs_timeout);
+}
+
+/**
+ * sys_umcg_wait: put the current task to sleep and/or wake another task.
+ * @flags:        zero or a value from enum umcg_wait_flag.
+ * @abs_timeout:  when to wake the task, in nanoseconds; zero for no timeout.
+ *
+ * @self->state_ts must be UMCG_TASK_IDLE (where @self is current->umcg_task)
+ * if !(@flags & UMCG_WAIT_WAKE_ONLY) (also see umcg_idle_loop and
+ * umcg_should_idle above).
+ *
+ * If @self->next_tid is not zero, it must point to an IDLE UMCG task.
+ * The userspace must have changed its state from IDLE to RUNNING
+ * before calling sys_umcg_wait() in the current task. This "next"
+ * task will be woken (context-switched-to on the fast path) when the
+ * current task is put to sleep.
+ *
+ * See Documentation/userspace-api/umcg.txt for detals.
+ *
+ * Return:
+ * 0             - OK;
+ * -ETIMEDOUT    - the timeout expired;
+ * -EFAULT       - failed accessing struct umcg_task __user of the current
+ *                 task;
+ * -ESRCH        - the task to wake not found or not a UMCG task;
+ * -EINVAL       - another error happened (e.g. bad @flags, or the current
+ *                 task is not a UMCG task, etc.)
+ */
+SYSCALL_DEFINE2(umcg_wait, u32, flags, u64, abs_timeout)
+{
+	struct umcg_task __user *self = current->umcg_task;
+	u32 next_tid;
+
+	if (!self)
+		return -EINVAL;
+
+	if (get_user(next_tid, &self->next_tid))
+		return -EFAULT;
+
+	if (flags & UMCG_WAIT_WAKE_ONLY) {
+		if (!next_tid || abs_timeout)
+			return -EINVAL;
+
+		flags &= ~UMCG_WAIT_WAKE_ONLY;
+		if (flags & ~UMCG_WAIT_WF_CURRENT_CPU)
+			return -EINVAL;
+
+		return umcg_ttwu(next_tid, flags & UMCG_WAIT_WF_CURRENT_CPU ?
+					WF_CURRENT_CPU : 0);
+	}
+
+	/* Unlock the worker, if locked. */
+	if (current->flags & PF_UMCG_WORKER) {
+		u64 umcg_state;
+
+		if (get_user(umcg_state, &self->state_ts))
+			return -EFAULT;
+
+		if ((umcg_state & UMCG_TF_LOCKED) && umcg_update_state(
+					&self->state_ts, &umcg_state,
+					umcg_state & ~UMCG_TF_LOCKED, true))
+			return -EFAULT;
+	}
+
+	if (next_tid)
+		return umcg_do_context_switch(next_tid, abs_timeout);
+
+	return umcg_idle_loop(abs_timeout);
+}
+
+/*
+ * NOTE: all code below is called from workqueue submit/update, or
+ *       syscall exit to usermode loop, so all errors result in the
+ *       termination of the current task (via SIGKILL).
+ */
+
+/*
+ * Wake idle server: find the task, change its state IDLE=>RUNNING, ttwu.
+ */
+static int umcg_wake_idle_server_nofault(u32 server_tid)
+{
+	struct umcg_task __user *ut_server = NULL;
+	struct task_struct *tsk;
+	int ret = -EINVAL;
+	u64 state;
+
+	rcu_read_lock();
+
+	tsk = find_task_by_vpid(server_tid);
+	/* Server/worker interaction is allowed only within the same mm. */
+	if (tsk && current->mm == tsk->mm)
+		ut_server = READ_ONCE(tsk->umcg_task);
+
+	if (!ut_server)
+		goto out_rcu;
+
+	ret = -EFAULT;
+	if (get_user_nofault(state, &ut_server->state_ts))
+		goto out_rcu;
+
+	ret = -EAGAIN;
+	if ((state & UMCG_TASK_STATE_MASK) != UMCG_TASK_IDLE)
+		goto out_rcu;
+
+	ret = umcg_update_state(&ut_server->state_ts, &state,
+			(state & ~UMCG_TASK_STATE_MASK) | UMCG_TASK_RUNNING,
+			false);
+
+	if (ret)
+		goto out_rcu;
+
+	try_to_wake_up(tsk, TASK_NORMAL, WF_CURRENT_CPU);
+
+out_rcu:
+	rcu_read_unlock();
+	return ret;
+}
+
+/*
+ * Wake idle server: find the task, change its state IDLE=>RUNNING, ttwu.
+ */
+static int umcg_wake_idle_server_may_fault(u32 server_tid)
+{
+	struct umcg_task __user *ut_server = NULL;
+	struct task_struct *tsk;
+	int ret = -EINVAL;
+	u64 state;
+
+	rcu_read_lock();
+	tsk = find_task_by_vpid(server_tid);
+	if (tsk && current->mm == tsk->mm)
+		ut_server = READ_ONCE(tsk->umcg_task);
+	rcu_read_unlock();
+
+	if (!ut_server)
+		return -EINVAL;
+
+	if (get_user(state, &ut_server->state_ts))
+		return -EFAULT;
+
+	if ((state & UMCG_TASK_STATE_MASK) != UMCG_TASK_IDLE)
+		return -EAGAIN;
+
+	ret = umcg_update_state(&ut_server->state_ts, &state,
+			(state & ~UMCG_TASK_STATE_MASK) | UMCG_TASK_RUNNING,
+			true);
+	if (ret)
+		return ret;
+
+	/*
+	 * umcg_ttwu will call find_task_by_vpid again; but we cannot
+	 * elide this, as we cannot do get_user() from an rcu-locked
+	 * code block.
+	 */
+	return umcg_ttwu(server_tid, WF_CURRENT_CPU);
+}
+
+/*
+ * Wake idle server: find the task, change its state IDLE=>RUNNING, ttwu.
+ */
+static int umcg_wake_idle_server(u32 server_tid, bool may_fault)
+{
+	int ret = umcg_wake_idle_server_nofault(server_tid);
+
+	if (!ret)
+		return 0;
+
+	if (!may_fault || ret != -EFAULT)
+		return ret;
+
+	return umcg_wake_idle_server_may_fault(server_tid);
+}
+
+/*
+ * Called in sched_submit_work() context for UMCG workers. In the common case,
+ * the worker's state changes RUNNING => BLOCKED, and its server's state
+ * changes IDLE => RUNNING, and the server is ttwu-ed.
+ *
+ * Under some conditions (e.g. the worker is "locked", see
+ * /Documentation/userspace-api/umcg.txt for more details), the
+ * function does nothing.
+ *
+ * The function is called with preempt disabled to make sure the retry_once
+ * logic below works correctly.
+ */
+static void process_sleeping_worker(struct task_struct *tsk, u32 *server_tid)
+{
+	struct umcg_task __user *ut_worker = tsk->umcg_task;
+	u64 curr_state, next_state;
+	bool retried = false;
+	u32 tid;
+	int ret;
+
+	*server_tid = 0;
+
+	if (WARN_ONCE((tsk != current) || !ut_worker, "Invalid UMCG worker."))
+		return;
+
+	/* If the worker has no server, do nothing. */
+	if (unlikely(!tsk->pinned_umcg_server_page))
+		return;
+
+	if (get_user_nofault(curr_state, &ut_worker->state_ts))
+		goto die;
+
+	/*
+	 * The userspace is allowed to concurrently change a RUNNING worker's
+	 * state only once in a "short" period of time, so we retry state
+	 * change at most once. As this retry block is within a
+	 * preempt_disable region, "short" is truly short here.
+	 *
+	 * See Documentation/userspace-api/umcg.txt for details.
+	 */
+retry_once:
+	if (curr_state & UMCG_TF_LOCKED)
+		return;
+
+	if (WARN_ONCE((curr_state & UMCG_TASK_STATE_MASK) != UMCG_TASK_RUNNING,
+			"Unexpected UMCG worker state."))
+		goto die;
+
+	next_state = curr_state & ~UMCG_TASK_STATE_MASK;
+	next_state |= UMCG_TASK_BLOCKED;
+
+	ret = umcg_update_state(&ut_worker->state_ts, &curr_state, next_state, false);
+	if (ret == -EAGAIN) {
+		if (retried)
+			goto die;
+
+		retried = true;
+		goto retry_once;
+	}
+	if (ret)
+		goto die;
+
+	smp_mb();  /* Order state read/write above and getting next_tid below. */
+	if (get_user_nofault(tid, &ut_worker->next_tid))
+		goto die;
+
+	*server_tid = tid;
+	return;
+
+die:
+	pr_warn("%s: killing task %d\n", __func__, current->pid);
+	force_sig(SIGKILL);
+}
+
+/* Called from sched_submit_work(). Must not fault/sleep. */
+void umcg_wq_worker_sleeping(struct task_struct *tsk)
+{
+	u32 server_tid;
+
+	/*
+	 * Disable preemption so that retry_once in process_sleeping_worker
+	 * works properly.
+	 */
+	preempt_disable();
+	process_sleeping_worker(tsk, &server_tid);
+	preempt_enable();
+
+	if (server_tid) {
+		int ret = umcg_wake_idle_server_nofault(server_tid);
+
+		if (ret && ret != -EAGAIN)
+			goto die;
+	}
+
+	goto out;
+
+die:
+	pr_warn("%s: killing task %d\n", __func__, current->pid);
+	force_sig(SIGKILL);
+out:
+	umcg_unpin_pages();
+}
+
+/**
+ * enqueue_idle_worker - push an idle worker onto idle_workers_ptr list/stack.
+ *
+ * Returns true on success, false on a fatal failure.
+ *
+ * See Documentation/userspace-api/umcg.txt for details.
+ */
+static bool enqueue_idle_worker(struct umcg_task __user *ut_worker)
+{
+	u64 __user *node = &ut_worker->idle_workers_ptr;
+	u64 __user *head_ptr;
+	u64 first = (u64)node;
+	u64 head;
+
+	if (get_user(head, node) || !head)
+		return false;
+
+	head_ptr = (u64 __user *)head;
+
+	/* Mark the worker as pending. */
+	if (put_user(UMCG_IDLE_NODE_PENDING, node))
+		return false;
+
+	/* Make the head point to the worker. */
+	if (xchg_user_64(head_ptr, &first))
+		return false;
+
+	/* Make the worker point to the previous head. */
+	if (put_user(first, node))
+		return false;
+
+	return true;
+}
+
+/**
+ * get_idle_server - retrieve an idle server, if present.
+ *
+ * Returns true on success, false on a fatal failure.
+ */
+static bool get_idle_server(struct umcg_task __user *ut_worker, u32 *server_tid)
+{
+	u64 server_tid_ptr;
+	u32 tid;
+
+	/* Empty result is OK. */
+	*server_tid = 0;
+
+	if (get_user(server_tid_ptr, &ut_worker->idle_server_tid_ptr))
+		return false;
+
+	if (!server_tid_ptr)
+		return false;
+
+	tid = 0;
+	if (xchg_user_32((u32 __user *)server_tid_ptr, &tid))
+		return false;
+
+	*server_tid = tid;
+	return true;
+}
+
+/*
+ * Returns true to wait for the userspace to schedule this worker, false
+ * to return to the userspace.
+ *
+ * In the common case, a BLOCKED worker is marked IDLE and enqueued
+ * to idle_workers_ptr list. The idle server is woken (if present).
+ *
+ * If a RUNNING worker is preempted, this function will trigger, in which
+ * case the worker is moved to IDLE state and its server is woken.
+ *
+ * Sets @server_tid to point to the server to be woken if the worker
+ * is going to sleep; sets @server_tid to point to the server assigned
+ * to this RUNNING worker if the worker is to return to the userspace.
+ */
+static bool process_waking_worker(struct task_struct *tsk, u32 *server_tid)
+{
+	struct umcg_task __user *ut_worker = tsk->umcg_task;
+	u64 curr_state, next_state;
+
+	*server_tid = 0;
+
+	if (WARN_ONCE((tsk != current) || !ut_worker, "Invalid umcg worker"))
+		return false;
+
+	if (fatal_signal_pending(tsk))
+		return false;
+
+	if (get_user(curr_state, &ut_worker->state_ts))
+		goto die;
+
+	if ((curr_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_RUNNING) {
+		u32 tid;
+
+		/* Wakeup: wait but don't enqueue. */
+		if (curr_state & UMCG_TF_LOCKED)
+			return true;
+
+		smp_mb();  /* Order getting state and getting server_tid */
+		if (get_user(tid, &ut_worker->next_tid))
+			goto die;
+
+		if (!tid)
+			/* RUNNING workers must have servers. */
+			goto die;
+
+		*server_tid = tid;
+
+		/* pass-through: RUNNING with a server. */
+		if (!(curr_state & UMCG_TF_PREEMPTED))
+			return false;
+
+		/*
+		 * Fallthrough to mark the worker IDLE: the worker is
+		 * PREEMPTED.
+		 */
+	} else if (unlikely((curr_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_IDLE &&
+			(curr_state & UMCG_TF_LOCKED)))
+		/* The worker prepares to sleep or to unregister. */
+		return false;
+
+	if (unlikely((curr_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_IDLE))
+		goto die;
+
+	next_state = curr_state & ~UMCG_TASK_STATE_MASK;
+	next_state |= UMCG_TASK_IDLE;
+
+	if (umcg_update_state(&ut_worker->state_ts, &curr_state,
+			next_state, true))
+		goto die;
+
+	if (!enqueue_idle_worker(ut_worker))
+		goto die;
+
+	smp_mb();  /* Order enqueuing the worker with getting the server. */
+	if (!(*server_tid) && !get_idle_server(ut_worker, server_tid))
+		goto die;
+
+	return true;
+
+die:
+	pr_warn("umcg_process_waking_worker: killing task %d\n", current->pid);
+	force_sig(SIGKILL);
+	return false;
+}
+
+/*
+ * Called from sched_update_worker(): defer all work until later, as
+ * sched_update_worker() may be called with in-kernel locks held.
+ */
+void umcg_wq_worker_running(struct task_struct *tsk)
+{
+	set_tsk_thread_flag(tsk, TIF_NOTIFY_RESUME);
+}
+
+/* Called via TIF_NOTIFY_RESUME flag from exit_to_user_mode_loop. */
+void umcg_handle_resuming_worker(void)
+{
+	u32 server_tid;
+
+	/* Avoid recursion by removing PF_UMCG_WORKER */
+	current->flags &= ~PF_UMCG_WORKER;
+
+	do {
+		bool should_wait;
+
+		should_wait = process_waking_worker(current, &server_tid);
+		if (!should_wait)
+			break;
+
+		if (server_tid) {
+			int ret = umcg_wake_idle_server(server_tid, true);
+
+			if (ret && ret != -EAGAIN)
+				goto die;
+		}
+
+		umcg_idle_loop(0);
+	} while (true);
+
+	if (!server_tid)
+		/* No server => no reason to pin pages. */
+		umcg_unpin_pages();
+	else if (umcg_pin_pages(server_tid))
+		goto die;
+
+	goto out;
+
+die:
+	pr_warn("%s: killing task %d\n", __func__, current->pid);
+	force_sig(SIGKILL);
+out:
+	current->flags |= PF_UMCG_WORKER;
+}
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index d1944258cfc0..82d233aa2648 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -273,6 +273,10 @@ COND_SYSCALL(landlock_create_ruleset);
 COND_SYSCALL(landlock_add_rule);
 COND_SYSCALL(landlock_restrict_self);

+/* kernel/sched/umcg.c */
+COND_SYSCALL(umcg_ctl);
+COND_SYSCALL(umcg_wait);
+
 /* arch/example/kernel/sys_example.c */

 /* mm/fadvise.c */
--
2.25.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v0.9.1 4/6] sched/umcg, lib/umcg: implement libumcg
  2021-11-22 21:13 [PATCH v0.9.1 0/6] sched,mm,x86/uaccess: implement User Managed Concurrency Groups Peter Oskolkov
                   ` (2 preceding siblings ...)
  2021-11-22 21:13 ` [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls Peter Oskolkov
@ 2021-11-22 21:13 ` Peter Oskolkov
  2021-11-22 21:13 ` [PATCH v0.9.1 5/6] sched/umcg: add Documentation/userspace-api/umcg.txt Peter Oskolkov
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 44+ messages in thread
From: Peter Oskolkov @ 2021-11-22 21:13 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Andrew Morton,
	Dave Hansen, Andy Lutomirski, linux-mm, linux-kernel, linux-api
  Cc: Paul Turner, Ben Segall, Peter Oskolkov, Peter Oskolkov,
	Andrei Vagin, Jann Horn, Thierry Delisle

Implement libumcg in tools/lib/umcg. Define higher-level UMCG
API that hides kernel-level UMCG API intricacies.

As a higher-level API, libumcg makes subtle changes to server/worker
interactions, compared to the kernel UMCG API, and introduces
the following new concepts:

- UMCG Group: a collection of servers and workers in a process
  that can interact with each other; UMCG groups are useful to
  partition servers and workers within a process in order to, for
  example, affine work to specific NUMA nodes;
- UMCG basic tasks: these are UMCG servers, from the kernel point
  of view; they do not interact with UMCG workers and thus
  do not need in UMCG groups; used for cooperative wait/wake/swap
  operations.

The main difference of server/worker interaction in libumcg
vs the kernel-side UMCG API is that a wakeup can be queued:
if umcg_wake() is called on a RUNNING UMCG task, the fact is
recorded (in the userspace), and when the task calls umcg_wait()
or umcg_swap(), the wakeup is consumed and the task is not
marked IDLE.

Libumcg exports the following API:
        umcg_enabled()
        umcg_get_utid()
        umcg_set_task_tag()
        umcg_get_task_tag()
        umcg_create_group()
        umcg_destroy_group()
        umcg_register_basic_task()
        umcg_register_worker()
        umcg_register_server()
        umcg_unregister_task()
        umcg_wait()
        umcg_wake()
        umcg_swap()
        umcg_get_idle_worker()
        umcg_run_worker()
        umcg_preempt_worker()
        umcg_get_time_ns()

See tools/lib/umcg/libumcg.txt for details.

Notes:
- this is still somewhat work-in-progress: while the kernel side
  code has been more or less stable over the last couple of months,
  the userspace side of things is less so;
- while libumcg is intended to be the main/primary/only direct user
  of the kernel UMCG API, at the moment the implementation is more
  geared more towards testing and correctness than live production
  usage, with a lot of asserts and similar development helpers;
- I have a number of umcg selftests that I plan to clean up and
  post shortly.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 tools/lib/umcg/.gitignore |    4 +
 tools/lib/umcg/Makefile   |   11 +
 tools/lib/umcg/libumcg.c  | 1202 +++++++++++++++++++++++++++++++++++++
 tools/lib/umcg/libumcg.h  |  299 +++++++++
 4 files changed, 1516 insertions(+)
 create mode 100644 tools/lib/umcg/.gitignore
 create mode 100644 tools/lib/umcg/Makefile
 create mode 100644 tools/lib/umcg/libumcg.c
 create mode 100644 tools/lib/umcg/libumcg.h

diff --git a/tools/lib/umcg/.gitignore b/tools/lib/umcg/.gitignore
new file mode 100644
index 000000000000..ea55ae666041
--- /dev/null
+++ b/tools/lib/umcg/.gitignore
@@ -0,0 +1,4 @@
+PDX-License-Identifier: GPL-2.0-only
+libumcg.a
+libumcg.o
+
diff --git a/tools/lib/umcg/Makefile b/tools/lib/umcg/Makefile
new file mode 100644
index 000000000000..cef44b681cb1
--- /dev/null
+++ b/tools/lib/umcg/Makefile
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0
+
+CFLAGS += -g -I../../include/ -I../../../usr/include/ -Wall -Werror
+
+libumcg.a: libumcg.o
+	ar rc libumcg.a libumcg.o
+
+libumcg.o: libumcg.c
+
+clean :
+	rm libumcg.a libumcg.o
diff --git a/tools/lib/umcg/libumcg.c b/tools/lib/umcg/libumcg.c
new file mode 100644
index 000000000000..8f8dfd515712
--- /dev/null
+++ b/tools/lib/umcg/libumcg.c
@@ -0,0 +1,1202 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include "libumcg.h"
+
+#include <assert.h>
+#include <errno.h>
+#include <pthread.h>
+#include <signal.h>
+#include <stdatomic.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <threads.h>
+#include <time.h>
+
+#include <linux/kernel.h>
+
+static int sys_umcg_ctl(uint32_t flags, struct umcg_task *umcg_task)
+{
+	return syscall(__NR_umcg_ctl, flags, umcg_task);
+}
+
+static int sys_umcg_wait(uint32_t flags, uint64_t abs_timeout)
+{
+	return syscall(__NR_umcg_wait, flags, abs_timeout);
+}
+
+bool umcg_enabled(void)
+{
+	int ret = sys_umcg_ctl(UMCG_CTL_REGISTER, NULL);
+
+	if (ret && errno == EINVAL)
+		return true;
+
+	return false;
+}
+
+uint64_t umcg_get_time_ns(void)
+{
+	struct timespec ts;
+
+	if (clock_gettime(CLOCK_REALTIME, &ts)) {
+		fprintf(stderr, "clock_gettime failed\n");
+		abort();
+	}
+
+	return ts.tv_sec * NSEC_PER_SEC + ts.tv_nsec;
+}
+
+struct umcg_task_tls;
+
+/**
+ * struct umcg_group - describes UMCG group.
+ *
+ * See tools/lib/umcg/libumcg.txt for detals.
+ */
+struct umcg_group {
+	/**
+	 * @idle_workers_head: points to the kernel-side list of idle
+	 *                     workers, i.e. the address of this field
+	 *                     is passed to the kernel in
+	 *                     struct umcg_task.idle_workers_ptr.
+	 */
+	uint64_t		idle_workers_head;
+
+	/**
+	 * @nr_tasks: the number of tasks (servers and workers) registered
+	 *            in this group.
+	 */
+	uint64_t		nr_tasks;
+
+	/**
+	 * @idle_worker_lock: protects @idle_workers below.
+	 */
+	pthread_spinlock_t	idle_worker_lock;
+
+	/**
+	 * @idle_server_lock: protects @idle_servers below.
+	 */
+	pthread_spinlock_t	idle_server_lock;
+
+	/**
+	 * @idle_workers: points to the userspace-side list of idle workers.
+	 *
+	 * When a server polls for an idle worker via umcg_poll_worker(),
+	 * the server first consults @idle_workers; if the list is empty,
+	 * the value of the variable is swapped with @idle_workers_head.
+	 */
+	uint64_t		*idle_workers;
+
+	/**
+	 * @idle_servers: points to the userspace-side list of idle servers.
+	 *
+	 * When a server polls for an idle worker via umcg_poll_worker(),
+	 * and none is available, the server is added to the list and blocks
+	 * via sys_umcg_wait().
+	 */
+	struct umcg_task_tls	*idle_servers;
+
+	/**
+	 * @idle_server_tid: the TID of one of the idle servers.
+	 *
+	 * The address of this field is passed to the kernel in
+	 * struct umct_task.idle_server_tid_ptr.
+	 */
+	uint32_t		idle_server_tid;
+} __attribute((aligned(8)));
+
+/**
+ * struct umcg_task_tls - per thread struct used to identify/manage UMCG tasks
+ *
+ * Each UMCG task requires an instance of struct umcg_task passed to
+ * sys_umcg_ctl. This struct contains it, as well as several additional
+ * fields useful for the userspace UMCG API.
+ *
+ * The alignment is driven by the alignment of struct umcg_task.
+ */
+struct umcg_task_tls {
+	struct umcg_task	umcg_task;
+	struct umcg_group	*group;  /* read only */
+	umcg_tid		peer;    /* server or worker or UMCG_NONE */
+	umcg_tid		self;    /* read only */
+	intptr_t		tag;
+	pid_t			tid;     /* read only */
+	bool			worker;  /* read only */
+
+	struct umcg_task_tls	*next;   /* used in group->idle_servers */
+} __attribute((aligned(8 * sizeof(uint64_t))));
+
+static thread_local struct umcg_task_tls *umcg_task_tls;
+
+umcg_tid umcg_get_utid(void)
+{
+	return (umcg_tid)&umcg_task_tls;
+}
+
+static struct umcg_task_tls *utid_to_utls(umcg_tid utid)
+{
+	assert(utid != UMCG_NONE);
+	return *(struct umcg_task_tls **)utid;
+}
+
+uint64_t umcg_get_task_state(umcg_tid task)
+{
+	struct umcg_task_tls *utls = utid_to_utls(task);
+	uint64_t state;
+
+	if (!utls)
+		return UMCG_TASK_NONE;
+
+	state = atomic_load_explicit(&utls->umcg_task.state_ts, memory_order_acquire);
+	return state & UMCG_TASK_STATE_MASK_FULL;
+}
+
+/* Update the state variable, set new timestamp. */
+static bool umcg_update_state(uint64_t *state, uint64_t *prev, uint64_t next)
+{
+	uint64_t prev_ts = (*prev) >> (64 - UMCG_STATE_TIMESTAMP_BITS);
+	struct timespec now;
+	uint64_t next_ts;
+	int res;
+
+	/*
+	 * clock_gettime(CLOCK_MONOTONIC, ...) takes less than 20ns on a
+	 * typical Intel processor on average, even when run concurrently,
+	 * so the overhead is low enough for most applications.
+	 *
+	 * If this is still too high, `next_ts = prev_ts + 1` should work
+	 * as well. The only real requirement is that the "timestamps" are
+	 * uniqueue per thread within a reasonable time frame.
+	 */
+	res = clock_gettime(CLOCK_MONOTONIC, &now);
+	assert(!res);
+	next_ts = (now.tv_sec * NSEC_PER_SEC + now.tv_nsec) >>
+		UMCG_STATE_TIMESTAMP_GRANULARITY;
+
+	/* Cut higher order bits. */
+	next_ts &= ((1ULL << UMCG_STATE_TIMESTAMP_BITS) - 1);
+
+	if (next_ts == prev_ts)
+		++next_ts;
+
+#ifndef NDEBUG
+	if (prev_ts > next_ts) {
+		fprintf(stderr, "%s: time goes back: prev_ts: %lu "
+				"next_ts: %lu diff: %lu\n", __func__,
+				prev_ts, next_ts, prev_ts - next_ts);
+	}
+#endif
+
+	/* Remove old timestamp, if any. */
+	next &= ((1ULL << (64 - UMCG_STATE_TIMESTAMP_BITS)) - 1);
+
+	/* Set the new timestamp. */
+	next |= (next_ts << (64 - UMCG_STATE_TIMESTAMP_BITS));
+
+	/*
+	 * TODO: review whether memory order below can be weakened to
+	 * memory_order_acq_rel for success and memory_order_acquire for
+	 * failure.
+	 */
+	return atomic_compare_exchange_strong_explicit(state, prev, next,
+			memory_order_seq_cst, memory_order_seq_cst);
+}
+
+static bool umcg_worker_in_idle_queue(umcg_tid worker)
+{
+	struct umcg_task_tls *worker_utls = utid_to_utls(worker);
+	struct umcg_task *worker_ut = &worker_utls->umcg_task;
+
+	assert(worker_utls->worker);
+
+	return (uint64_t)&worker_utls->group->idle_workers_head !=
+		atomic_load_explicit(&worker_ut->idle_workers_ptr,
+					memory_order_acquire);
+}
+
+void umcg_set_task_tag(umcg_tid utid, intptr_t tag)
+{
+	utid_to_utls(utid)->tag = tag;
+}
+
+intptr_t umcg_get_task_tag(umcg_tid utid)
+{
+	return utid_to_utls(utid)->tag;
+}
+
+static bool try_task_lock(struct umcg_task_tls *task, uint64_t expected_state,
+				uint64_t new_state)
+{
+	uint64_t next;
+	uint64_t prev = atomic_load_explicit(&task->umcg_task.state_ts,
+				memory_order_acquire);
+
+	if (prev & UMCG_TF_LOCKED)
+		return false;
+
+	if ((prev & UMCG_TASK_STATE_MASK) != expected_state)
+		return false;
+
+	next = (prev & ~UMCG_TASK_STATE_MASK) | new_state | UMCG_TF_LOCKED;
+	return umcg_update_state((uint64_t *)&task->umcg_task.state_ts, &prev, next);
+}
+
+static void task_lock(struct umcg_task_tls *task, uint64_t expected_state,
+			uint64_t new_state)
+{
+	int loop_counter = 0;
+
+	while (!try_task_lock(task, expected_state, new_state))
+		assert(++loop_counter < 1000 * 1000 * 100);
+}
+
+static void task_unlock(struct umcg_task_tls *task, uint64_t expected_state,
+		uint64_t new_state)
+{
+	bool ok;
+	uint64_t next;
+	uint64_t prev = atomic_load_explicit((uint64_t *)&task->umcg_task.state_ts,
+					memory_order_acquire);
+
+	next = ((prev & ~UMCG_TASK_STATE_MASK_FULL) | new_state) & ~UMCG_TF_LOCKED;
+	assert(next != prev);
+	assert((prev & UMCG_TASK_STATE_MASK_FULL & ~UMCG_TF_LOCKED) == expected_state);
+
+	ok = umcg_update_state((uint64_t *)&task->umcg_task.state_ts, &prev, next);
+	assert(ok);
+}
+
+umcg_tid umcg_register_basic_task(intptr_t tag)
+{
+	int ret;
+
+	if (umcg_task_tls != NULL) {
+		errno = EINVAL;
+		return UMCG_NONE;
+	}
+
+	umcg_task_tls = malloc(sizeof(struct umcg_task_tls));
+	if (!umcg_task_tls) {
+		errno = ENOMEM;
+		return UMCG_NONE;
+	}
+	memset(umcg_task_tls, 0, sizeof(struct umcg_task_tls));
+
+	umcg_task_tls->umcg_task.state_ts = UMCG_TASK_RUNNING;
+	umcg_task_tls->self = (umcg_tid)&umcg_task_tls;
+	umcg_task_tls->tag = tag;
+	umcg_task_tls->tid = gettid();
+
+	ret = sys_umcg_ctl(UMCG_CTL_REGISTER, &umcg_task_tls->umcg_task);
+	if (ret) {
+		free(umcg_task_tls);
+		umcg_task_tls = NULL;
+		errno = ret;
+		return UMCG_NONE;
+	}
+
+	return umcg_task_tls->self;
+}
+
+static umcg_tid umcg_register_task_in_group(umcg_t group_id, intptr_t tag,
+						bool server)
+{
+	int ret;
+	uint32_t self_tid;
+	struct umcg_group *group;
+	struct umcg_task_tls *curr;
+
+	if (group_id == UMCG_NONE) {
+		errno = EINVAL;
+		return UMCG_NONE;
+	}
+
+	if (umcg_task_tls != NULL) {
+		errno = EINVAL;
+		return UMCG_NONE;
+	}
+
+	group = (struct umcg_group *)group_id;
+
+	curr = malloc(sizeof(struct umcg_task_tls));
+	if (!curr) {
+		errno = ENOMEM;
+		return UMCG_NONE;
+	}
+	memset(curr, 0, sizeof(struct umcg_task_tls));
+
+	self_tid = gettid();
+	curr->umcg_task.state_ts = server ? UMCG_TASK_RUNNING : UMCG_TASK_BLOCKED;
+	curr->umcg_task.idle_server_tid_ptr = server ? 0UL :
+		(uint64_t)&group->idle_server_tid;
+	curr->umcg_task.idle_workers_ptr =
+		(uint64_t)&group->idle_workers_head;
+	curr->group = group;
+	curr->tag = tag;
+	curr->tid = self_tid;
+	curr->self = (umcg_tid)&umcg_task_tls;
+	curr->worker = !server;
+
+	/*
+	 * Need to set umcg_task_tls before registering, as a server
+	 * may pick up this worker immediately, and use @self.
+	 */
+	atomic_store_explicit(&umcg_task_tls, curr, memory_order_release);
+
+	ret = sys_umcg_ctl(server ? UMCG_CTL_REGISTER :
+					UMCG_CTL_REGISTER | UMCG_CTL_WORKER,
+				&curr->umcg_task);
+	if (ret) {
+		free(curr);
+		errno = ret;
+		atomic_store_explicit(&umcg_task_tls, NULL, memory_order_release);
+		return UMCG_NONE;
+	}
+
+	atomic_fetch_add_explicit(&group->nr_tasks, 1, memory_order_relaxed);
+
+	return umcg_task_tls->self;
+}
+
+umcg_tid umcg_register_worker(umcg_t group_id, intptr_t tag)
+{
+	return umcg_register_task_in_group(group_id, tag, false);
+}
+
+umcg_tid umcg_register_server(umcg_t group_id, intptr_t tag)
+{
+	return umcg_register_task_in_group(group_id, tag, true);
+}
+
+int umcg_unregister_task(void)
+{
+	int ret;
+
+	if (!umcg_task_tls) {
+		errno = EINVAL;
+		return -1;
+	}
+
+	/* If this is a worker, wake the server. */
+	if (umcg_task_tls->worker) {
+		struct umcg_task_tls *curr = umcg_task_tls;
+		struct umcg_task_tls *utls_server;
+
+		task_lock(curr, UMCG_TASK_RUNNING, UMCG_TASK_IDLE);
+		utls_server = utid_to_utls(curr->peer);
+		assert(utls_server->tid == atomic_load_explicit(
+					&curr->umcg_task.next_tid,
+					memory_order_acquire));
+		curr->peer = UMCG_NONE;
+		atomic_store_explicit(&curr->umcg_task.next_tid, 0,
+				memory_order_release);
+
+		utls_server->peer = UMCG_NONE;
+		atomic_store_explicit(&utls_server->umcg_task.next_tid, 0,
+					memory_order_release);
+
+		/* Keep the worker locked to avoid needing the server. */
+		if (utls_server) {
+			curr->worker = false;  /* umcg_wake tries to lock */
+			ret = umcg_wake(utls_server->self, false);
+			assert(!ret || errno == ESRCH);
+		}
+	}
+
+	ret = sys_umcg_ctl(UMCG_CTL_UNREGISTER, NULL);
+	if (ret) {
+		errno = ret;
+		return -1;
+	}
+
+	if (umcg_task_tls->group)
+		atomic_fetch_sub_explicit(&umcg_task_tls->group->nr_tasks, 1,
+						memory_order_relaxed);
+
+	free(umcg_task_tls);
+	atomic_store_explicit(&umcg_task_tls, NULL, memory_order_release);
+	return 0;
+}
+
+/* Helper return codes. */
+enum umcg_prepare_op_result {
+	UMCG_OP_DONE,
+	UMCG_OP_SYS,
+	UMCG_OP_AGAIN,
+	UMCG_OP_ERROR
+};
+
+static enum umcg_prepare_op_result umcg_prepare_wait_may_lock(void)
+{
+	struct umcg_task *ut;
+	uint64_t prev_state, next_state;
+
+	if (!umcg_task_tls) {
+		errno = EINVAL;
+		return UMCG_OP_ERROR;
+	}
+
+	ut = &umcg_task_tls->umcg_task;
+
+	prev_state = atomic_load_explicit(&ut->state_ts, memory_order_acquire);
+	next_state = umcg_task_tls->worker ?
+		UMCG_TASK_IDLE | UMCG_TF_LOCKED | UMCG_UTF_WORKER_IN_WAIT :
+		UMCG_TASK_IDLE;
+	if (((prev_state & UMCG_TASK_STATE_MASK_FULL) == UMCG_TASK_RUNNING) &&
+		umcg_update_state((uint64_t *)&ut->state_ts, &prev_state, next_state))
+		return UMCG_OP_SYS;
+
+	if ((prev_state & UMCG_TASK_STATE_MASK_FULL) !=
+			(UMCG_TASK_RUNNING | UMCG_UTF_WAKEUP_QUEUED)) {
+#ifndef NDEBUG
+		fprintf(stderr, "libumcg: unexpected state before wait: %lu\n",
+				prev_state);
+		assert(false);
+#endif
+		errno = EINVAL;
+		return UMCG_OP_ERROR;
+	}
+
+	if (umcg_update_state((uint64_t *)&ut->state_ts, &prev_state, UMCG_TASK_RUNNING))
+		return UMCG_OP_DONE;
+
+#ifndef NDEBUG
+	/* Raced with another wait/wake? This is not supported. */
+	fprintf(stderr, "libumcg: failed to remove the wakeup flag: %lu\n",
+			prev_state);
+	assert(false);
+#endif
+	errno = EINVAL;
+	return UMCG_OP_ERROR;
+}
+
+/* Always return -1 because the user needs to see ETIMEDOUT in errno */
+static int handle_timedout(void)
+{
+	struct umcg_task *ut = &umcg_task_tls->umcg_task;
+	uint64_t umcg_state;
+
+retry:
+	/* Restore RUNNING state if the task is still IDLE. */
+	umcg_state = atomic_load_explicit(&ut->state_ts,
+			memory_order_acquire);
+	if ((umcg_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_RUNNING)
+		return -1;
+
+	assert((umcg_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_IDLE);
+
+	if (umcg_update_state((uint64_t *)&ut->state_ts, &umcg_state, UMCG_TASK_RUNNING))
+		return -1;
+
+	/* A wakeup could have been queued. */
+	goto retry;
+}
+
+static int umcg_do_wait(uint64_t timeout)
+{
+	struct umcg_task *ut = &umcg_task_tls->umcg_task;
+	uint32_t flags = 0;
+
+	/* If this is a worker, need to change the state of the server. */
+	if (umcg_task_tls->worker &&
+			atomic_load_explicit(&ut->next_tid, memory_order_acquire)) {
+		bool ok;
+		struct umcg_task *server_ut =
+			&utid_to_utls(umcg_task_tls->peer)->umcg_task;
+		uint64_t server_state = atomic_load_explicit(&server_ut->state_ts,
+				memory_order_acquire);
+
+		assert((server_state & UMCG_TASK_STATE_MASK_FULL) == UMCG_TASK_IDLE);
+		ok = umcg_update_state((uint64_t *)&server_ut->state_ts,
+				&server_state, UMCG_TASK_RUNNING);
+		assert(ok);
+	} else if (!umcg_task_tls->worker)
+		atomic_store_explicit(&ut->next_tid, 0, memory_order_release);
+
+	do {
+		uint64_t umcg_state;
+		int ret;
+
+		ret = sys_umcg_wait(flags, timeout);
+		if (!ret)
+			return 0;
+
+		if (ret && errno == EINTR) {
+			umcg_state = atomic_load_explicit(&ut->state_ts,
+					memory_order_acquire) & UMCG_TASK_STATE_MASK;
+			if (umcg_state == UMCG_TASK_RUNNING)
+				return 0;
+			continue;
+		}
+
+		if (errno == ETIMEDOUT)
+			return handle_timedout();
+
+		return -1;
+	} while (true);
+}
+
+int umcg_wait(uint64_t timeout)
+{
+	switch (umcg_prepare_wait_may_lock()) {
+	case UMCG_OP_DONE:
+		return 0;
+	case UMCG_OP_SYS:
+		break;
+	case UMCG_OP_ERROR:
+		return -1;
+	default:
+		assert(false);
+		return -1;
+	}
+
+	return umcg_do_wait(timeout);
+}
+
+static void enqueue_idle_worker(struct umcg_task_tls *utls)
+{
+	struct umcg_task *ut = &utls->umcg_task;
+	uint64_t *node = (uint64_t *)&ut->idle_workers_ptr;
+	uint64_t head = *node;
+	uint64_t *head_ptr = (uint64_t *)head;
+	uint64_t first = (uint64_t)node;
+
+	assert(utls->worker);
+	assert(&utls->group->idle_workers_head == head_ptr);
+
+	/* Mark the worker as pending. */
+	atomic_store_explicit(node, UMCG_IDLE_NODE_PENDING, memory_order_release);
+
+	/* Make the head point to the worker. */
+	first = atomic_exchange_explicit(head_ptr, first, memory_order_acq_rel);
+
+	/* Make the worker point to the previous head. */
+	atomic_store_explicit(node, first, memory_order_release);
+}
+
+static enum umcg_prepare_op_result umcg_prepare_wake_may_lock(
+		struct umcg_task_tls *next_utls, bool for_swap)
+{
+	struct umcg_task *next_ut = &next_utls->umcg_task;
+	uint64_t curr_state, next_state;
+	enum umcg_prepare_op_result result = UMCG_OP_DONE;
+	bool enqueue_worker = false;
+
+	curr_state = atomic_load_explicit(&next_ut->state_ts, memory_order_acquire);
+
+	if (curr_state & (UMCG_TF_LOCKED | UMCG_UTF_WAKEUP_QUEUED))
+		return UMCG_OP_AGAIN;
+
+	/* Start with RUNNING tasks. */
+	if ((curr_state & UMCG_TASK_STATE_MASK_FULL) == UMCG_TASK_RUNNING)
+		next_state = UMCG_TASK_RUNNING | UMCG_UTF_WAKEUP_QUEUED;
+	else if (curr_state & UMCG_UTF_WORKER_IN_WAIT) {
+		/* Next, check workers in wait. */
+		assert(next_utls->worker);
+		assert((curr_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_IDLE);
+
+		if (for_swap) {
+			next_state = UMCG_TASK_RUNNING;
+			result = UMCG_OP_SYS;
+		} else {
+			next_state = UMCG_TASK_IDLE;
+			enqueue_worker = true;
+		}
+	} else if ((curr_state & UMCG_TASK_STATE_MASK_FULL) == UMCG_TASK_IDLE) {
+		/* Next, check IDLE tasks. */
+		if (next_utls->worker) {
+			if (for_swap) {
+				next_state = UMCG_TASK_RUNNING | UMCG_TF_LOCKED;
+				result = UMCG_OP_SYS;
+			} else {
+				return UMCG_OP_AGAIN;
+			}
+		} else {
+			atomic_store_explicit(&next_utls->umcg_task.next_tid,
+						0, memory_order_release);
+			next_state = UMCG_TASK_RUNNING;
+			result = UMCG_OP_SYS;
+		}
+	} else {
+		/* Finally, deal with BLOCKED workers. */
+		assert((curr_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_BLOCKED);
+		assert(next_utls->worker);
+
+		return UMCG_OP_AGAIN;
+	}
+
+	if (umcg_update_state((uint64_t *)&next_ut->state_ts, &curr_state, next_state)) {
+		if (enqueue_worker)
+			enqueue_idle_worker(next_utls);
+		return result;
+	}
+
+	return UMCG_OP_AGAIN;
+}
+
+static int umcg_do_wake_or_swap(uint32_t next_tid, bool should_wait,
+				uint64_t timeout, bool wf_current_cpu,
+				struct umcg_task_tls *next_utls)
+{
+	struct umcg_task *ut;
+	uint32_t flags = 0;
+	uint32_t server_tid = 0;
+	int ret;
+
+	/* wf_current_cpu is possible in wake-only scenarios. */
+	assert(!should_wait || !wf_current_cpu);
+	assert(umcg_task_tls != NULL);
+
+	ut = &umcg_task_tls->umcg_task;
+
+	/*
+	 * This is a worker waking another task: lock it so that next_tid
+	 * is not interpreted as a server if this worker pagefaults.
+	 */
+	if (umcg_task_tls->worker && !should_wait) {
+		server_tid = atomic_load_explicit(&ut->next_tid,
+				memory_order_acquire);
+		assert(server_tid);
+		assert(utid_to_utls(umcg_task_tls->peer)->tid == server_tid);
+		task_lock(umcg_task_tls, UMCG_TASK_RUNNING, UMCG_TASK_IDLE);
+	}
+
+	atomic_store_explicit(&ut->next_tid, next_tid, memory_order_release);
+
+	if (!should_wait)
+		flags |= UMCG_WAIT_WAKE_ONLY;
+	if (wf_current_cpu)
+		flags |= UMCG_WAIT_WF_CURRENT_CPU;
+
+	if (next_utls && next_utls->worker)
+		task_unlock(next_utls, UMCG_TASK_RUNNING, UMCG_TASK_RUNNING);
+	ret = sys_umcg_wait(flags, should_wait ? timeout : 0);
+
+	/* If we locked this worker, unlock it. */
+	if (server_tid) {
+		atomic_store_explicit(&ut->next_tid, server_tid,
+				memory_order_release);
+		task_unlock(umcg_task_tls, UMCG_TASK_IDLE, UMCG_TASK_RUNNING);
+	}
+
+	if (ret && errno == ETIMEDOUT)
+		return handle_timedout();
+
+	return ret;
+}
+
+int umcg_wake(umcg_tid next, bool wf_current_cpu)
+{
+	struct umcg_task_tls *utls = utid_to_utls(next);
+	uint64_t loop_counter = 0;
+
+	if (!utls) {
+		errno = EINVAL;
+		return -1;
+	}
+
+again:
+	assert(++loop_counter < (1ULL << 31));
+	switch (umcg_prepare_wake_may_lock(utls, false /* for_swap */)) {
+	case UMCG_OP_DONE:
+		return 0;
+	case UMCG_OP_SYS:
+		break;
+	case UMCG_OP_ERROR:
+		return -1;
+	case UMCG_OP_AGAIN:
+		goto again;
+	default:
+		assert(false);
+		return -1;
+	}
+
+	return umcg_do_wake_or_swap(utls->tid, false, 0, wf_current_cpu, utls);
+}
+
+static void transfer_server_locked(struct umcg_task_tls *next)
+{
+	struct umcg_task_tls *curr = umcg_task_tls;
+	struct umcg_task_tls *server = utid_to_utls(curr->peer);
+
+	atomic_thread_fence(memory_order_acquire);
+	assert(curr->worker);
+	assert(next->worker);
+	assert(curr->peer != UMCG_NONE);
+	assert(next->peer == UMCG_NONE);
+
+	next->peer = curr->peer;
+	curr->peer = UMCG_NONE;
+	next->umcg_task.next_tid = curr->umcg_task.next_tid;
+	curr->umcg_task.next_tid = 0;
+
+	server->peer = next->self;
+	server->umcg_task.next_tid = next->tid;
+	atomic_thread_fence(memory_order_release);
+}
+
+int umcg_swap(umcg_tid next, uint64_t timeout)
+{
+	struct umcg_task_tls *utls = utid_to_utls(next);
+	bool should_wake, should_wait;
+	uint64_t loop_counter = 0;
+
+	assert(umcg_task_tls);
+
+again:
+	assert(++loop_counter < (1ULL << 31));
+	switch (umcg_prepare_wake_may_lock(utls, true /* for_swap */)) {
+	case UMCG_OP_DONE:
+		should_wake = false;
+		break;
+	case UMCG_OP_SYS:
+		should_wake = true;
+		break;
+	case UMCG_OP_ERROR:
+		return -1;
+	case UMCG_OP_AGAIN:
+		goto again;
+	default:
+		assert(false);
+	}
+
+	switch (umcg_prepare_wait_may_lock()) {
+	case UMCG_OP_DONE:
+		should_wait = false;
+		break;
+	case UMCG_OP_SYS:
+		should_wait = true;
+		break;
+	case UMCG_OP_ERROR:
+		return -1;
+	default:
+		assert(false);
+	}
+
+	if (should_wait && should_wake && umcg_task_tls->worker)
+		transfer_server_locked(utls);
+
+	if (should_wake)
+		return umcg_do_wake_or_swap(utls->tid, should_wait, timeout,
+				false, utls);
+
+	if (should_wait)
+		return umcg_do_wait(timeout);
+
+	return 0;
+}
+
+/* A noop SIGUSR1 handler, used in worker preemption. */
+static void sigusr_handler(int signum)
+{
+}
+
+umcg_t umcg_create_group(uint32_t flags)
+{
+	struct umcg_group *group;
+	int res;
+
+	if (flags && flags != UMCG_GROUP_ENABLE_PREEMPTION) {
+		errno = EINVAL;
+		return UMCG_NONE;
+	}
+
+	group = malloc(sizeof(struct umcg_group));
+	if (!group) {
+		errno = ENOMEM;
+		return UMCG_NONE;
+	}
+
+	memset(group, 0, sizeof(*group));
+
+	res = pthread_spin_init(&group->idle_worker_lock, PTHREAD_PROCESS_PRIVATE);
+	if (res) {
+		errno = res;
+		goto error;
+	}
+
+	res = pthread_spin_init(&group->idle_server_lock, PTHREAD_PROCESS_PRIVATE);
+	if (res) {
+		errno = res;
+		res = pthread_spin_destroy(&group->idle_worker_lock);
+		assert(!res);
+		goto error;
+	}
+
+	if (flags & UMCG_GROUP_ENABLE_PREEMPTION) {
+		if (SIG_ERR == signal(SIGUSR1, sigusr_handler)) {
+			res = pthread_spin_destroy(&group->idle_worker_lock);
+			assert(!res);
+			res = pthread_spin_destroy(&group->idle_server_lock);
+			assert(!res);
+			goto error;
+		}
+	}
+
+	return (intptr_t)group;
+
+error:
+	free(group);
+	return UMCG_NONE;
+}
+
+int umcg_destroy_group(umcg_t umcg)
+{
+	int res;
+	struct umcg_group *group = (struct umcg_group *)umcg;
+
+	if (atomic_load_explicit(&group->nr_tasks, memory_order_acquire)) {
+		errno = EBUSY;
+		return -1;
+	}
+
+	res = pthread_spin_destroy(&group->idle_worker_lock);
+	assert(!res);
+	res = pthread_spin_destroy(&group->idle_server_lock);
+	assert(!res);
+
+	free(group);
+	return 0;
+}
+
+static void detach_worker(void)
+{
+	struct umcg_task_tls *server_utls = umcg_task_tls;
+	struct umcg_task_tls *worker_utls;
+
+	assert(server_utls->group != NULL);
+
+	atomic_thread_fence(memory_order_acquire);
+	if (!server_utls->peer)
+		return;
+
+	worker_utls = utid_to_utls(server_utls->peer);
+	assert(server_utls->peer == worker_utls->self);
+	assert(worker_utls->peer == server_utls->self);
+
+	umcg_task_tls->umcg_task.next_tid = 0;
+	worker_utls->umcg_task.next_tid = 0;
+	worker_utls->peer = UMCG_NONE;
+	server_utls->peer = UMCG_NONE;
+
+	atomic_thread_fence(memory_order_release);
+}
+
+umcg_tid umcg_run_worker(umcg_tid worker)
+{
+	struct umcg_task_tls *worker_utls = utid_to_utls(worker);
+	struct umcg_task_tls *server_utls = umcg_task_tls;
+	struct umcg_task *server_ut = &umcg_task_tls->umcg_task;
+	struct umcg_task *worker_ut;
+	uint64_t curr_state, next_state;
+	int ret;
+	bool ok;
+
+	assert(server_utls->group != NULL);
+	assert(server_utls->group == worker_utls->group);
+	assert(worker_utls->worker);
+
+	atomic_thread_fence(memory_order_acquire);
+	assert(server_utls->peer == UMCG_NONE);
+	assert(worker_utls->peer == UMCG_NONE);
+
+	worker_ut = &worker_utls->umcg_task;
+
+	assert(!umcg_worker_in_idle_queue(worker));
+
+	/*
+	 * Mark the server IDLE before marking the worker RUNNING: preemption
+	 * can happen immediately after the worker is marked RUNNING.
+	 */
+	curr_state = atomic_load_explicit((uint64_t *)&server_ut->state_ts,
+			memory_order_acquire);
+	assert((curr_state & UMCG_TASK_STATE_MASK_FULL) == UMCG_TASK_RUNNING);
+	ok = umcg_update_state((uint64_t *)&server_ut->state_ts, &curr_state,
+			UMCG_TASK_IDLE);
+	assert(ok);
+
+	/* Lock the worker in preparation to run it. */
+	curr_state = atomic_load_explicit((uint64_t *)&worker_ut->state_ts,
+			memory_order_acquire);
+	assert((curr_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_IDLE);
+	assert(!(curr_state & UMCG_TF_LOCKED));
+	next_state = curr_state & UMCG_UTF_WAKEUP_QUEUED ?
+		UMCG_TASK_RUNNING | UMCG_UTF_WAKEUP_QUEUED :
+		UMCG_TASK_RUNNING;
+	ok = umcg_update_state((uint64_t *)&worker_ut->state_ts, &curr_state,
+			next_state | UMCG_TF_LOCKED);
+
+	assert(ok);
+
+	/* Attach the server to the worker. */
+	atomic_thread_fence(memory_order_acquire);
+	server_ut->next_tid = worker_utls->tid;
+	worker_ut->next_tid = server_utls->tid;
+	worker_utls->peer = server_utls->self;
+	server_utls->peer = worker;
+
+	atomic_thread_fence(memory_order_release);
+	task_unlock(worker_utls, next_state, next_state);
+
+	ret = sys_umcg_wait(0, 0);
+
+	atomic_thread_fence(memory_order_acquire);
+	if (!server_utls->peer) {
+		assert(server_ut->next_tid == 0);
+		/*
+		 * The worker woke early due to umcg_state change
+		 * and unregistered/exited.
+		 */
+		assert(!ret || errno == ESRCH);
+		errno = 0;
+		return UMCG_NONE;
+	}
+
+	assert(!ret);
+
+	/* Detach the server from the worker. */
+	worker_utls = utid_to_utls(server_utls->peer);
+	detach_worker();
+
+	return worker_utls->self;
+}
+
+int umcg_preempt_worker(umcg_tid worker)
+{
+	struct umcg_task_tls *worker_utls = utid_to_utls(worker);
+	struct umcg_task *worker_ut = &worker_utls->umcg_task;
+	uint32_t worker_tid = worker_utls->tid;
+	uint64_t curr_state;
+	int ret;
+
+	curr_state = atomic_load_explicit(&worker_ut->state_ts,
+			memory_order_acquire);
+	if ((curr_state & UMCG_TASK_STATE_MASK_FULL) != UMCG_TASK_RUNNING) {
+		errno = EAGAIN;
+		return -1;
+	}
+
+	if (!umcg_update_state((uint64_t *)&worker_ut->state_ts, &curr_state,
+			UMCG_TASK_RUNNING | UMCG_TF_PREEMPTED)) {
+		errno = EAGAIN;
+		return -1;
+	}
+
+	/*
+	 * It is possible that this thread is descheduled here, the worker
+	 * pagefaults, wakes up, and then exits; in this case tgkill() below
+	 * will fail with errno == ESRCH.
+	 */
+	ret = tgkill(getpid(), worker_tid, SIGUSR1);
+	assert(!ret || errno == ESRCH);
+	return 0;
+}
+
+static void wake_idle_server(void)
+{
+	struct umcg_group *group = umcg_task_tls->group;
+	int res;
+
+	res = pthread_spin_lock(&group->idle_server_lock);
+	assert(!res);
+
+	if (group->idle_servers) {
+		struct umcg_task_tls *server = group->idle_servers;
+
+		group->idle_servers = server->next;
+		server->next = NULL;
+
+		assert((atomic_load_explicit(&server->umcg_task.state_ts,
+						memory_order_acquire) &
+					UMCG_TASK_STATE_MASK_FULL) == UMCG_TASK_IDLE);
+
+		res = umcg_wake(server->self, false);
+		assert(!res);
+	}
+
+	res = pthread_spin_unlock(&group->idle_server_lock);
+	assert(!res);
+}
+
+static umcg_tid get_idle_worker(void)
+{
+	struct umcg_group *group = umcg_task_tls->group;
+	umcg_tid result = UMCG_NONE;
+	uint64_t *head;
+	int res;
+
+	res = pthread_spin_lock(&group->idle_worker_lock);
+	assert(!res);
+
+	head = group->idle_workers;
+
+once_again:
+	/* First, check the userspace idle worker list. */
+	if (head) {
+		uint64_t next;
+		struct umcg_task *worker;
+		struct umcg_task_tls *worker_utls;
+
+		worker = container_of((__u64 *)head, struct umcg_task, idle_workers_ptr);
+		worker_utls = container_of(worker, struct umcg_task_tls, umcg_task);
+
+		/* Spin while the worker is pending. */
+		do {
+			next = atomic_load_explicit(head, memory_order_acquire);
+		} while (next == UMCG_IDLE_NODE_PENDING);
+
+		/* Wait for the worker's server to detach in umcg_run_worker(). */
+		while (atomic_load_explicit(&worker_utls->peer,
+						memory_order_relaxed))
+			;
+
+		/* Pull the worker out of the idle worker list. */
+		group->idle_workers = (uint64_t *)next;
+		atomic_store_explicit(&worker->idle_workers_ptr,
+				(uint64_t)&group->idle_workers_head,
+				memory_order_release);
+
+		if (next)
+			wake_idle_server();
+
+		result = worker_utls->self;
+		goto out;
+	}
+
+	/*
+	 * Get the kernel's idle worker list.
+	 *
+	 * TODO: review whether memory order below can be weakened to
+	 * memory_order_acq_rel.
+	 */
+	head = (uint64_t *)atomic_exchange_explicit(&group->idle_workers_head,
+			0ULL, memory_order_seq_cst);
+
+	if (!head)
+		goto out;
+
+	group->idle_workers = head;
+	goto once_again;
+
+out:
+	res = pthread_spin_unlock(&group->idle_worker_lock);
+	assert(!res);
+
+	return result;
+}
+
+static void enqueue_idle_server(void)
+{
+	struct umcg_task_tls *server = umcg_task_tls;
+	struct umcg_group *group = server->group;
+	int res;
+
+	res = pthread_spin_lock(&group->idle_server_lock);
+	assert(!res);
+
+	assert(server->next == NULL);
+	assert((atomic_load_explicit(&server->umcg_task.state_ts,
+					memory_order_acquire) &
+				UMCG_TASK_STATE_MASK_FULL) == UMCG_TASK_IDLE);
+
+	server->next = group->idle_servers;
+	group->idle_servers = server;
+
+	res = pthread_spin_unlock(&group->idle_server_lock);
+	assert(!res);
+}
+
+static umcg_tid idle_server_wait(void)
+{
+	struct umcg_task_tls *server_utls = umcg_task_tls;
+	struct umcg_task *server_ut = &umcg_task_tls->umcg_task;
+	uint32_t server_tid = server_utls->tid;
+	struct umcg_group *group = umcg_task_tls->group;
+	umcg_tid worker;
+	uint32_t prev = 0ULL;
+	uint64_t state;
+	bool ok;
+
+	state = atomic_load_explicit((uint64_t *)&server_ut->state_ts, memory_order_acquire);
+	assert((state & UMCG_TASK_STATE_MASK_FULL) == UMCG_TASK_RUNNING);
+	ok = umcg_update_state((uint64_t *)&server_ut->state_ts, &state, UMCG_TASK_IDLE);
+	assert(ok);
+
+	/*
+	 * Try to become THE idle server that the kernel will wake.
+	 *
+	 * TODO: review whether memory order below can be weakened to
+	 * memory_order_acq_rel for success and memory_order_acquire
+	 * for failure.
+	 */
+	ok = atomic_compare_exchange_strong_explicit(&group->idle_server_tid,
+			&prev, server_tid,
+			memory_order_seq_cst, memory_order_seq_cst);
+
+	if (!ok) {
+		assert(prev != server_tid);
+		enqueue_idle_server();
+		umcg_do_wait(0);
+		assert(server_utls->next == NULL);
+
+		return UMCG_NONE;
+	}
+
+	/* We need to ensure no idle workers enqueued before going to sleep. */
+	worker = get_idle_worker();
+
+	if (worker) {
+		state = atomic_load_explicit(&server_ut->state_ts,
+				memory_order_acquire);
+		if ((state & UMCG_TASK_STATE_MASK_FULL) != UMCG_TASK_RUNNING) {
+			ok = umcg_update_state((uint64_t *)&server_ut->state_ts,
+					&state, UMCG_TASK_RUNNING);
+			assert(ok || ((state & UMCG_TASK_STATE_MASK_FULL) ==
+						UMCG_TASK_RUNNING));
+		}
+	} else
+		umcg_do_wait(0);
+
+	/*
+	 * If the server calls umcg_get_idle_worker() in a loop, the worker
+	 * that pulled the server at step N (and thus zeroed idle_server_tid)
+	 * may wake the server at step N+1 without cleaning idle_server_tid,
+	 * so the server needs to clean idle_server_tid in case this happens.
+	 *
+	 * TODO: review whether memory order below can be weakened to
+	 * memory_order_acq_rel for success and memory_order_acquire
+	 * for failure.
+	 */
+	prev = server_tid;
+	ok = atomic_compare_exchange_strong_explicit(
+				&group->idle_server_tid, &prev, 0UL,
+				memory_order_seq_cst, memory_order_seq_cst);
+	assert(ok || (prev != server_tid));
+	return worker;
+}
+
+umcg_tid umcg_get_idle_worker(bool wait)
+{
+	umcg_tid result = UMCG_NONE;
+
+	assert(umcg_task_tls->peer == UMCG_NONE);
+	assert((atomic_load_explicit(&umcg_task_tls->umcg_task.state_ts,
+				memory_order_acquire) & UMCG_TASK_STATE_MASK_FULL) ==
+			UMCG_TASK_RUNNING);
+
+	do {
+		result = get_idle_worker();
+
+		if (result || !wait)
+			break;
+
+		result = idle_server_wait();
+	} while (!result);
+
+	assert((atomic_load_explicit(&umcg_task_tls->umcg_task.state_ts,
+				memory_order_acquire) & UMCG_TASK_STATE_MASK_FULL) ==
+			UMCG_TASK_RUNNING);
+	return result;
+}
diff --git a/tools/lib/umcg/libumcg.h b/tools/lib/umcg/libumcg.h
new file mode 100644
index 000000000000..8d97d4032667
--- /dev/null
+++ b/tools/lib/umcg/libumcg.h
@@ -0,0 +1,299 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __LIBUMCG_H
+#define __LIBUMCG_H
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <limits.h>
+#include <unistd.h>
+#include <linux/types.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <syscall.h>
+#include <time.h>
+
+#include <linux/umcg.h>
+
+/*
+ * UMCG: User Managed Concurrency Groups.
+ *
+ * LIBUMCG provides userspace UMCG API that hides some of the intricacies
+ * of sys_umcg_ctl() and sys_umcg_wait() syscalls.
+ *
+ * Note that this API is still quite low level and is designed as
+ * a toolkit for building higher-level userspace schedulers.
+ *
+ * See tools/lib/umcg/libumcg.txt for detals.
+ */
+
+typedef intptr_t umcg_t;   /* UMCG group ID. */
+typedef intptr_t umcg_tid; /* UMCG thread ID. */
+
+#define UMCG_NONE	(0)
+
+/**
+ * umcg_enabled - indicates whether UMCG syscalls are available.
+ */
+bool umcg_enabled(void);
+
+/**
+ * umcg_get_utid - return the UMCG ID of the current thread.
+ *
+ * The function always succeeds, and the returned ID is guaranteed to be
+ * stable over the life of the thread.
+ *
+ * The ID is NOT guaranteed to be unique over the life of the process.
+ */
+umcg_tid umcg_get_utid(void);
+
+/**
+ * umcg_set_task_tag - add an arbitrary tag to a registered UMCG task.
+ *
+ * Note: not-thread-safe: the user is responsible for proper memory fencing.
+ */
+void umcg_set_task_tag(umcg_tid utid, intptr_t tag);
+
+/**
+ * umcg_get_task_tag - get the task tag. Returns zero if none set.
+ *
+ * Note: not-thread-safe: the user is responsible for proper memory fencing.
+ */
+intptr_t umcg_get_task_tag(umcg_tid utid);
+
+/**
+ * enum umcg_create_group_flag - flags to pass to umcg_create_group
+ * @UMCG_GROUP_ENABLE_PREEMPTION: enable worker preemption.
+ *
+ * See tools/lib/libumcg.txt for detals.
+ */
+enum umcg_create_group_flag {
+	UMCG_GROUP_ENABLE_PREEMPTION	= 1
+};
+
+/**
+ * umcg_create_group - create a UMCG group
+ * @flags:             a combination of values from enum umcg_create_group_flag
+ *
+ * See tools/lib/libumcg.txt for detals.
+ *
+ * Return:
+ * UMCG_NONE     - an error occurred. Check errno.
+ * != UMCG_NONE  - the ID of the group, to be used in e.g. umcg_register.
+ */
+umcg_t umcg_create_group(uint32_t flags);
+
+/**
+ * umcg_destroy_group - destroy a UMCG group
+ * @umcg:               ID of the group to destroy
+ *
+ * The group must be empty (no server or worker threads).
+ *
+ * Return:
+ * 0            - Ok
+ * -1           - an error occurred. Check errno.
+ *                errno == EAGAIN: the group has server or worker threads
+ */
+int umcg_destroy_group(umcg_t umcg);
+
+/**
+ * umcg_register_basic_task - register the current thread as a UMCG basic task
+ * @tag:          An arbitrary tag to be associated with the task.
+ *
+ * See tools/lib/libumcg.txt for detals.
+ *
+ * Return:
+ * UMCG_NONE     - an error occurred. Check errno.
+ * != UMCG_NONE  - the ID of the thread to be used with UMCG API (guaranteed
+ *                 to match the value returned by umcg_get_utid).
+ */
+umcg_tid umcg_register_basic_task(intptr_t tag);
+
+/**
+ * umcg_register_worker - register the current thread as a UMCG worker
+ * @group_id:      The ID of the UMCG group the thread should join;
+ * @tag:           an arbitrary tag to be associated with the task.
+ *
+ * Return:
+ * UMCG_NONE     - an error occurred. Check errno.
+ * != UMCG_NONE  - the ID of the thread to be used with UMCG API (guaranteed
+ *                 to match the value returned by umcg_get_utid).
+ */
+umcg_tid umcg_register_worker(umcg_t group_id, intptr_t tag);
+
+/**
+ * umcg_register_server - register the current thread as a UMCG server
+ * @group_id:      The ID of the UMCG group the thread should join;
+ * @tag:           an arbitrary tag to be associated with the task.
+ *
+ * Return:
+ * UMCG_NONE     - an error occurred. Check errno.
+ * != UMCG_NONE  - the ID of the thread to be used with UMCG API (guaranteed
+ *                 to match the value returned by umcg_get_utid).
+ */
+umcg_tid umcg_register_server(umcg_t group_id, intptr_t tag);
+
+/**
+ * umcg_unregister_task - unregister the current thread.
+ *
+ * Return:
+ * 0              - OK
+ * -1             - the current thread is not a UMCG thread
+ */
+int umcg_unregister_task(void);
+
+/**
+ * umcg_wait - block the current thread
+ * @timeout:   absolute timeout in nanoseconds
+ *
+ * Blocks the current thread, which must have been registered via umcg_register,
+ * until it is waken via umcg_wake or swapped into via umcg_swap. If the current
+ * thread has a wakeup queued (see umcg_wake), returns zero immediately,
+ * consuming the wakeup.
+ *
+ * Return:
+ * 0         - OK, the thread was waken;
+ * -1        - did not wake normally;
+ *               errno:
+ *                 EINTR: interrupted
+ *                 EINVAL: some other error occurred
+ */
+int umcg_wait(uint64_t timeout);
+
+/**
+ * umcg_wake - wake @next; non-blocking.
+ * @next:            ID of the thread to wake;
+ * @wf_current_cpu:  an advisory hint indicating that the current thread
+ *                   is going to block in the immediate future and that
+ *                   the wakee should be woken on the current CPU;
+ *
+ * If @next is blocked via umcg_wait or umcg_swap, wake it if @next is
+ * a server or a basic task; if @next is a worker, it will be queued
+ * in the idle worker list. If @next is running, queue the wakeup,
+ * so that a future block of @next will consume the wakeup and will not block.
+ *
+ * umcg_wake can queue at most one wakeup; if waking or queueing a wakeup
+ * is not possible, umcg_wake will SPIN.
+ *
+ * See tools/lib/umcg/libumcg.txt for detals.
+ *
+ * Return:
+ * 0         - OK, @next has woken, or a wakeup has been queued;
+ * -1        - an error occurred.
+ */
+int umcg_wake(umcg_tid next, bool wf_current_cpu);
+
+/**
+ * umcg_swap - wake @next, put the current thread to sleep
+ * @next:      ID of the thread to wake
+ * @timeout:   absolute timeout in ns
+ *
+ * umcg_swap is semantically equivalent to
+ *
+ *     int ret = umcg_wake(next, true);
+ *     if (ret)
+ *             return ret;
+ *     return umcg_wait(timeout);
+ *
+ * but may do a synchronous context switch into @next on the current CPU.
+ *
+ * Note: if @next is a worker, it must be IDLE, but not in the idle worker list.
+ * See tools/lib/umcg/libumcg.txt for detals.
+ */
+int umcg_swap(umcg_tid next, u64 timeout);
+
+/**
+ * umcg_get_idle_worker - get an idle worker, if available
+ * @wait: if true, block until an idle worker becomes available
+ *
+ * The current thread must be a UMCG server. If there is a list/queue of
+ * waiting IDLE workers in the server's group, umcg_get_idle_worker
+ * picks one; if there are no IDLE workers, the current thread sleeps in
+ * the idle server queue if @wait is true.
+ *
+ * Note: servers waiting for idle workers must NOT be woken via umcg_wake(),
+ *       as this will leave them in inconsistent state.
+ *
+ * See tools/lib/umcg/libumcg.txt for detals.
+ *
+ * Return:
+ * UMCG_NONE         - an error occurred; check errno;
+ * != UMCG_NONE      - a RUNNABLE worker.
+ */
+umcg_tid umcg_get_idle_worker(bool wait);
+
+/**
+ * umcg_run_worker - run @worker as a UMCG server
+ * @worker:          the ID of a RUNNABLE worker to run
+ *
+ * The current thread must be a UMCG "server".
+ *
+ * See tools/lib/umcg/libumcg.txt for detals.
+ *
+ * Return:
+ * UMCG_NONE    - if errno == 0, the last worker the server was running
+ *                unregistered itself; if errno != 0, an error occurred
+ * != UMCG_NONE - the ID of the last worker the server was running before
+ *                the worker was blocked or preempted.
+ */
+umcg_tid umcg_run_worker(umcg_tid worker);
+
+/**
+ * umcg_preempt_worker - preempt a RUNNING worker.
+ * @worker:          the ID of a RUNNING worker to preempt.
+ *
+ * See tools/lib/umcg/libumcg.txt for detals.
+ *
+ * Return:
+ * 0        - Ok;
+ * -1       - an error occurred; check errno and `man tgkill()`. In addition
+ *            to tgkill() errors, EAGAIN is also returned if the worker
+ *            is not in RUNNING state (in this case tgkill() was not called).
+ */
+int umcg_preempt_worker(umcg_tid worker);
+
+/**
+ * umcg_get_task_state - return the UMCG state of @task, including state
+ * flags, without the timestamp.
+ *
+ * Note that in most situations the state value can be changed at any time
+ * by a concurrent thread, so this function is exposed for debugging/testing
+ * purposes only.
+ */
+uint64_t umcg_get_task_state(umcg_tid task);
+
+#ifndef NSEC_PER_SEC
+#define NSEC_PER_SEC	1000000000L
+#endif
+
+/**
+ * umcg_get_time_ns - returns the absolute current time in nanoseconds.
+ *
+ * The function uses CLOCK_MONOTONIC; the returned value can be used
+ * to set absolute timeouts for umcg_wait() and umcg_swap().
+ */
+uint64_t umcg_get_time_ns(void);
+
+/**
+ * UMCG userspace-only task state flag: wakeup queued.
+ *
+ * see umcg_wake() above.
+ */
+#define UMCG_UTF_WAKEUP_QUEUED	(1ULL << 17)
+
+/**
+ * UMCG userspace-only task state flag: worker in sys_umcg_wait().
+ *
+ * IDLE workers can be in two substates:
+ * - waiting in sys_umcg_wait(): in this case UTF_WORKER_IN_WAIT flag is set;
+ * - waiting in the idle worker list: in this case the flag is not set.
+ *
+ * If the worker is IDLE in sys_umcg_wait, umcg_wake() clears the flag
+ * and adds the worker to the idle worker list.
+ *
+ * If the worker is IDLE in the idle worker list, umcg_wake() sets
+ * the wakeup queued flag.
+ */
+#define UMCG_UTF_WORKER_IN_WAIT	(1ULL << 16)
+
+#endif  /* __LIBUMCG_H */
--
2.25.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v0.9.1 5/6] sched/umcg: add Documentation/userspace-api/umcg.txt
  2021-11-22 21:13 [PATCH v0.9.1 0/6] sched,mm,x86/uaccess: implement User Managed Concurrency Groups Peter Oskolkov
                   ` (3 preceding siblings ...)
  2021-11-22 21:13 ` [PATCH v0.9.1 4/6] sched/umcg, lib/umcg: implement libumcg Peter Oskolkov
@ 2021-11-22 21:13 ` Peter Oskolkov
  2021-11-22 21:13 ` [PATCH v0.9.1 6/6] sched/umcg, lib/umcg: add tools/lib/umcg/libumcg.txt Peter Oskolkov
  2021-11-24 14:06 ` [PATCH v0.9.1 0/6] sched,mm,x86/uaccess: implement User Managed Concurrency Groups Peter Zijlstra
  6 siblings, 0 replies; 44+ messages in thread
From: Peter Oskolkov @ 2021-11-22 21:13 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Andrew Morton,
	Dave Hansen, Andy Lutomirski, linux-mm, linux-kernel, linux-api
  Cc: Paul Turner, Ben Segall, Peter Oskolkov, Peter Oskolkov,
	Andrei Vagin, Jann Horn, Thierry Delisle

Document User Managed Concurrency Groups syscalls, data structures,
state transitions, etc. in UMGG kernel API.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 Documentation/userspace-api/umcg.txt | 598 +++++++++++++++++++++++++++
 1 file changed, 598 insertions(+)
 create mode 100644 Documentation/userspace-api/umcg.txt

diff --git a/Documentation/userspace-api/umcg.txt b/Documentation/userspace-api/umcg.txt
new file mode 100644
index 000000000000..539b7c6a8962
--- /dev/null
+++ b/Documentation/userspace-api/umcg.txt
@@ -0,0 +1,598 @@
+UMCG API (KERNEL)
+
+User Managed Concurrency Groups (UMCG) is an M:N threading
+subsystem/toolkit that lets user space application developers implement
+in-process user space schedulers.
+
+See tools/lib/umcg/umcg.txt for LIBUMCG API, as opposed to UMCG API (kernel)
+described here. The first three subsections are the same in both documents.
+
+
+CONTENTS
+
+    WHY? HETEROGENEOUS IN-PROCESS WORKLOADS
+    REQUIREMENTS
+    WHY TWO APIS: UMCG (KERNEL) AND LIBUMCG (USERSPACE)?
+    UMCG API (KERNEL)
+    SERVERS
+    WORKERS
+    UMCG TASK STATES
+    STRUCT UMCG_TASK
+    SYS_UMCG_CTL()
+    SYS_UMCG_WAIT()
+    STATE TRANSITIONS
+    SERVER-ONLY USE CASES
+
+
+WHY? HETEROGENEOUS IN-PROCESS WORKLOADS
+
+Linux kernel's CFS scheduler is designed for the "common" use case, with
+efficiency/throughput in mind. Work isolation and workloads of different
+"urgency" are addressed by tools such as cgroups, CPU affinity, priorities,
+etc., which are difficult or impossible to efficiently use in-process.
+
+For example, a single DBMS process may receive tens of thousands requests
+per second; some of these requests may have strong response latency
+requirements as they serve live user requests (e.g. login authentication);
+some of these requests may not care much about latency but must be served
+within a certain time period (e.g. an hourly aggregate usage report); some
+of these requests are to be served only on a best-effort basis and can be
+NACKed under high load (e.g. an exploratory research/hypothesis testing
+workload).
+
+Beyond different work item latency/throughput requirements as outlined
+above, the DBMS may need to provide certain guarantees to different users;
+for example, user A may "reserve" 1 CPU for their high-priority/low-latency
+requests, 2 CPUs for mid-level throughput workloads, and be allowed to send
+as many best-effort requests as possible, which may or may not be served,
+depending on the DBMS load. Besides, the best-effort work, started when the
+load was low, may need to be delayed if suddenly a large amount of
+higher-priority work arrives. With hundreds or thousands of users like
+this, it is very difficult to guarantee the application's responsiveness
+using standard Linux tools while maintaining high CPU utilization.
+
+Gaming is another use case: some in-process work must be completed before a
+certain deadline dictated by frame rendering schedule, while other work
+items can be delayed; some work may need to be cancelled/discarded because
+the deadline has passed; etc.
+
+User Managed Concurrency Groups is an M:N threading toolkit that allows
+constructing user space schedulers designed to efficiently manage
+heterogeneous in-process workloads described above while maintaining high
+CPU utilization (95%+).
+
+
+REQUIREMENTS
+
+One relatively established way to design high-efficiency, low-latency
+systems is to split all work into small on-cpu work items, with
+asynchronous I/O and continuations, all executed on a thread pool with the
+number of threads not exceeding the number of available CPUs. Although this
+approach works, it is quite difficult to develop and maintain such a
+system, as, for example, small continuations are difficult to piece
+together when debugging. Besides, such asynchronous callback-based systems
+tend to be somewhat cache-inefficient, as continuations can get scheduled
+on any CPU regardless of cache locality.
+
+M:N threading and cooperative user space scheduling enables controlled CPU
+usage (minimal OS preemption), synchronous coding style, and better cache
+locality.
+
+Specifically:
+
+* a variable/fluctuating number M of "application" threads should be
+  "scheduled over" a relatively fixed number N of "kernel" threads, where
+  N is less than or equal to the number of CPUs available;
+* only those application threads that are attached to kernel threads are
+  scheduled "on CPU";
+* application threads should be able to cooperatively yield to each other;
+* when an application thread blocks in kernel (e.g. in I/O), this becomes
+  a scheduling event ("block") that the userspace scheduler should be able
+  to efficiently detect, and reassign a waiting application thread to the
+  freeded "kernel" thread;
+* when a blocked application thread wakes (e.g. its I/O operation
+  completes), this event ("wake") should also be detectable by the
+  userspace scheduler, which should be able to either quickly dispatch the
+  newly woken thread to an idle "kernel" thread or, if all "kernel"
+  threads are busy, put it in the waiting queue;
+* in addition to the above, it would be extremely useful for a separate
+  in-process "watchdog" facility to be able to monitor the state of each
+  of the M+N threads, and to intervene in case of runaway workloads
+  (interrupt/preempt).
+
+
+WHY THE TWO APIS: UMCG (KERNEL) AND LIBUMCG (USERSPACE)?
+
+UMCG syscalls, sys_umcg_ctl() and sys_umcg_wait(), are designed to make
+the kernel-side UMCG implementation as lightweight as possible. LIBUMCG,
+on the other hand, is designed to expose the key abstractions to users
+in a much more usable, higher-level way.
+
+See tools/lib/umcg/libumcg.txt for more details on LIBUMCG API.
+
+
+UMCG API (KERNEL)
+
+Based on the requrements above, UMCG API (kernel) is build around the
+following ideas:
+
+* UMCG server: a task/thread representing "kernel threads", or CPUs from
+  the requirements above;
+* UMCG worker: a task/thread representing "application threads", to be
+  scheduled over servers;
+* UMCG task state: (NONE), RUNNING, BLOCKED, IDLE: states a UMCG task (a
+  server or a worker) can be in;
+* UMCG task state flag: LOCKED, PREEMPTED: additional state flags that
+  can be ORed with the task state to communicate additional information to
+  the kernel;
+* struct umcg_task: a per-task userspace set of data fields, usually
+  residing in the TLS, that fully reflects the current task's UMCG state
+  and controls the way the kernel manages the task;
+* sys_umcg_ctl(): a syscall used to register the current task/thread as a
+  server or a worker, or to unregister a UMCG task;
+* sys_umcg_wait(): a syscall used to put the current task to sleep and/or
+  wake another task, pontentially context-switching between the two tasks
+  on-CPU synchronously.
+
+
+SERVERS
+
+When a task/thread is registered as a server, it is in RUNNING state and
+behaves like any other normal task/thread. In addition, servers can
+interact with other UMCG tasks via sys_umcg_wait():
+
+* servers can voluntarily suspend their execution (wait), becoming IDLE;
+* servers can wake other IDLE servers;
+* servers can context-switch between each other.
+
+Note that if a server blocks in the kernel not via sys_umcg_wait(), it
+still retains its RUNNING state.
+
+
+WORKERS
+
+A worker cannot be RUNNING without having a server associated with it, so
+when a task is first registered as a worker, it enters the IDLE state.
+
+* a worker becomes RUNNING when a server calls sys_umcg_wait to
+  context-switch into it; the server goes IDLE, and the worker becomes
+  RUNNING in its place;
+* when a RUNNING worker blocks in the kernel, it becomes BLOCKED, its
+  associated server becomes RUNNING and the server's sys_umcg_wait() call
+  from the bullet above returns; this transition is sometimes called
+  "block detection";
+* when the syscall on which a BLOCKED worker completes, the worker
+  becomes IDLE and is added to the list of idle workers; if there is an
+  idle server waiting, the kernel wakes it; this transition is sometimes
+  called "wake detection";
+* RUNNING workers can voluntarily suspend their execution (wait),
+  becoming IDLE; their associated servers are woken;
+* a RUNNING worker can context-switch with an IDLE worker; the server of
+  the switched-out worker is transferred to the switched-in worker;
+* any UMCG task can "wake" an IDLE worker via sys_umcg_wait(); unless
+  this is a server running the worker as described in the first bullet in
+  this list, the worker remain IDLE but is added to the idle workers list;
+  this "wake" operation exists for completeness, to make sure
+  wait/wake/context-switch operations are available for all UMCG tasks;
+* the userspace can preempt a RUNNING worker by marking it
+  RUNNING|PREEMPTED and sending a signal to it; the userspace should have
+  installed a NOP signal handler for the signal; the kernel will then
+  transition the worker into IDLE|PREEMPTED state and wake its associated
+  server.
+
+
+UMCG TASK STATES
+
+Important: all state transitions described below involve at least two
+steps: the change of the state field in struct umcg_task, for example
+RUNNING to IDLE, and the corresponding change in struct task_struct state,
+for example a transition between the task running on CPU and being
+descheduled and removed from the kernel runqueue. The key principle of UMCG
+API design is that the party initiating the state transition modifies the
+state variable.
+
+For example, a task going IDLE first changes its state from RUNNING to IDLE
+in the userpace and then calls sys_umcg_wait(), which completes the
+transition.
+
+Note on documentation: in include/uapi/linux/umcg.h, task states have the
+form UMCG_TASK_RUNNING, UMCG_TASK_BLOCKED, etc. In this document these are
+usually referred to simply RUNNING and BLOCKED, unless it creates
+ambiguity. Task state flags, e.g. UMCG_TF_PREEMPTED, are treated similarly.
+
+UMCG task states reflect the view from the userspace, rather than from the
+kernel. There are three fundamental task states:
+
+* RUNNING: indicates that the task is schedulable by the kernel; applies
+  to both servers and workers;
+* IDLE: indicates that the task is not schedulable by the kernel (see
+  umcg_idle_loop() in kernel/sched/umcg.c); applies to both servers and
+  workers;
+* BLOCKED: indicates that the worker is blocked in the kernel; does not
+  apply to servers.
+
+In addition to the three states above, two state flags help with state
+transitions:
+
+* LOCKED: the userspace is preparing the worker for a state transition
+  and "locks" the worker until the worker is ready for the kernel to act
+  on the state transition; used similarly to preempt_disable or
+  irq_disable in the kernel; applies only to workers in RUNNING or IDLE
+  state; RUNNING|LOCKED means "this worker is about to become RUNNING,
+  while IDLE|LOCKED means "this worker is about to become IDLE or
+  unregister;
+* PREEMPTED: the userspace indicates it wants the worker to be preempted;
+  there are no situations when both LOCKED and PREEMPTED flags are set at
+  the same time.
+
+
+STRUCT UMCG_TASK
+
+From include/uapi/linux/umcg.h:
+
+struct umcg_task {
+      uint64_t        state_ts;               /* r/w */
+      uint32_t        next_tid;               /* r   */
+      uint32_t        flags;                  /* reserved */
+      uint64_t        idle_workers_ptr;       /* r/w */
+      uint64_t        idle_server_tid_ptr;    /* r*  */
+};
+
+Each UMCG task is identified by struct umcg_task, which is provided to the
+kernel when the task is registered via sys_umcg_ctl().
+
+* uint64_t state_ts: the current state of the task this struct
+  identifies, as described in the previous section, combined with a
+  unique timestamp indicating when the last state change happened.
+
+  Readable/writable by both the kernel and the userspace.
+
+    bits  0 -  5: task state (RUNNING, IDLE, BLOCKED);
+    bits  6 -  7: state flags (LOCKED, PREEMPTED);
+    bits  8 - 12: reserved; must be zeroes;
+    bits 13 - 17: for userspace use;
+    bits 18 - 63: timestamp.
+
+   Timestamp: a 46-bit CLOCK_MONOTONIC timestamp, at 16ns resolution.
+
+   It is highly benefitical to tag each state change with a unique
+   timestamp:
+
+   - timestamps will naturally provide instrumentation to measure
+     scheduling delays, both in the kernel and in the userspace;
+   - uniqueness of timestamps (module overflow) guarantees that state
+     change races, especially ABA races, are easily detected and avoided.
+
+   Each timestamp represents the moment in time the state change happened,
+   in nanoseconds, with the lower 4 bits and the upper 16 bits stripped.
+
+   In this document 'umcg_task.state' is often used to talk about
+   'umcg_task.state_ts' field, as timestamps do not carry semantic
+   meaning at the moment.
+
+   This is how umcg_task.state_ts is updated in the kernel:
+
+    /* kernel side */
+    /**
+     * umcg_update_state: atomically update umcg_task.state_ts, set new timestamp.
+     * @state_ts   - points to the state_ts member of struct umcg_task to update;
+     * @expected   - the expected value of state_ts, including the timestamp;
+     * @desired    - the desired value of state_ts, state part only;
+     * @may_fault  - whether to use normal or _nofault cmpxchg.
+     *
+     * The function is basically cmpxchg(state_ts, expected, desired), with extra
+     * code to set the timestamp in @desired.
+     */
+    static int umcg_update_state(u64 __user *state_ts, u64 *expected, u64 desired,
+                                    bool may_fault)
+    {
+            u64 curr_ts = (*expected) >> (64 - UMCG_STATE_TIMESTAMP_BITS);
+            u64 next_ts = ktime_get_ns() >> UMCG_STATE_TIMESTAMP_GRANULARITY;
+
+            /* Cut higher order bits. */
+            next_ts &= ((1ULL << UMCG_STATE_TIMESTAMP_BITS) - 1);
+
+            if (next_ts == curr_ts)
+                    ++next_ts;
+
+            /* Remove an old timestamp, if any. */
+            desired &= ((1ULL << (64 - UMCG_STATE_TIMESTAMP_BITS)) - 1);
+
+            /* Set the new timestamp. */
+            desired |= (next_ts << (64 - UMCG_STATE_TIMESTAMP_BITS));
+
+            if (may_fault)
+                    return cmpxchg_user_64(state_ts, expected, desired);
+
+            return cmpxchg_user_64_nofault(state_ts, expected, desired);
+    }
+
+* uint32_t next_tid: contains the TID of the task to context-switch-into
+  in sys_umcg_wait(); can be zero; writable by the userspace, readable by
+  the kernel; if this is a RUNNING worker, this field contains the TID of
+  the server that should be woken when this worker blocks; see
+  sys_umcg_wait() for more details;
+
+* uint32_t flags: reserved; must be zero.
+
+* uint64_t idle_workers_ptr: this field forms a single-linked list of
+  idle workers: all RUNNING workers have this field set to point to the
+  head of the list (a pointer variable in the userspace).
+
+  When a worker's blocking operation in the kernel completes, the kernel
+  changes the worker's state from BLOCKED to IDLE and adds the worker to
+  the top of the list of idle workers using this logic:
+
+    /* kernel side */
+    /**
+     * enqueue_idle_worker - push an idle worker onto idle_workers_ptr
+     * list/stack.
+     *
+     * Returns true on success, false on a fatal failure.
+     */
+    static bool enqueue_idle_worker(struct umcg_task __user *ut_worker)
+    {
+        u64 __user *node = &ut_worker->idle_workers_ptr;
+        u64 __user *head_ptr;
+        u64 first = (u64)node;
+        u64 head;
+
+        if (get_user_nosleep(head, node) || !head)
+                return false;
+
+        head_ptr = (u64 __user *)head;
+
+        if (put_user_nosleep(UMCG_IDLE_NODE_PENDING, node))
+                return false;
+
+        if (xchg_user_64(head_ptr, &first))
+                return false;
+
+        if (put_user_nosleep(first, node))
+                return false;
+
+        return true;
+    }
+
+  In the userspace the list is cleared atomically using this logic:
+
+    /* userspace side */
+    uint64_t *idle_workers = (uint64_t *)*head;
+
+    atomic_exchange(&idle_workers, NULL);
+
+  The userspace re-points workers' idle_workers_ptr to the list head
+  variable before the worker is allowed to become RUNNING again.
+
+  When processing the idle workers list, the userspace should wait for
+  workers marked as UMCG_IDLE_NODE_PENDING to have the flag cleared (see
+  enqueue_idle_worker() above).
+
+* uint64_t idle_server_tid_ptr: points to a variable in the userspace
+  that points to an idle server, i.e. a server in IDLE state waiting in
+  sys_umcg_wait(); read-only; workers must have this field set; not used
+  in servers.
+
+  When a worker's blocking operation in the kernel completes, the kernel
+  changes the worker's state from BLOCKED to IDLE, adds the worker to the
+  list of idle workers, and wakes the idle server if present; the kernel
+  atomically exchanges (*idle_server_tid_ptr) with 0, thus waking the idle
+  server, if present, only once. See State transitions below for more
+  details.
+
+
+SYS_UMCG_CTL()
+
+int sys_umcg_ctl(uint32_t flags, struct umcg_task *self) is used to
+register or unregister the current task as a worker or server. Flags can be
+one of the following:
+
+    UMCG_CTL_REGISTER: register a server;
+    UMCG_CTL_REGISTER | UMCG_CTL_WORKER: register a worker;
+    UMCG_CTL_UNREGISTER: unregister the current server or worker.
+
+When registering a task, self must point to struct umcg_task describing
+this server or worker; the pointer must remain valid until the task is
+unregistered.
+
+When registering a server, self->state must be RUNNING; all other fields in
+self must be zeroes.
+
+When registering a worker, self->state must be BLOCKED;
+self->idle_server_tid_ptr and self->idle_workers_ptr must be valid pointers
+as described in struct umcg_task; self->next_tid must be zero.
+
+When unregistering a task, self must be NULL.
+
+
+SYS_UMCG_WAIT()
+
+int sys_umcg_wait(uint32_t flags, uint64_t abs_timeout) operates on
+registered UMCG servers and workers: struct umcg_task *self provided to
+sys_umcg_ctl() when registering the current task is consulted in addition
+to flags and abs_timeout parameters.
+
+The function can be used to perform one of the three operations:
+
+* wait: if self->next_tid is zero, sys_umcg_wait() puts the current
+  task to sleep;
+* wake: if self->next_tid is not zero, and flags & UMCG_WAIT_WAKE_ONLY,
+  the task identified by next_tid is woken;
+* context switch: if self->next_tid is not zero, and !(flags &
+  UMCG_WAIT_WAKE_ONLY), the current task is put to sleep and the next task
+  is woken, synchronously switching between the tasks on the current CPU
+  on the fast path.
+
+Flags can be zero or a combination of the following values:
+
+* UMCG_WAIT_WAKE_ONLY: wake the next task, don't put the current task to
+  sleep;
+* UMCG_WAIT_WF_CURRENT_CPU: wake the next task on the curent CPU; this
+  flag has an effect only if UMCG_WAIT_WAKE_ONLY is set: context switching
+  is always attempted to happen on the curent CPU.
+
+The section below provides more details on how servers and workers interact
+via sys_umcg_wait(), during worker block/wake events, and during worker
+preemption.
+
+
+STATE TRANSITIONS
+
+As mentioned above, the key principle of UMCG state transitions is that the
+party initiating the state transition modifies the state of affected tasks.
+
+Below, "TASK:STATE" indicates a task T, where T can be either W for worker
+or S for server, in state S, where S can be one of the three states,
+potentially ORed with a state flag. Each individual state transition is an
+atomic operation (cmpxchg) unless indicated otherwise. Also note that the
+order of state transitions is important and is part of the contract between
+the userspace and the kernel. The kernel is free to kill the task (SIGKILL)
+if the contract is broken.
+
+Some worker state transitions below include adding LOCKED flag to worker
+state. This is done to indicate to the kernel that the worker is
+transitioning state and should not participate in the block/wake detection
+routines, which can happen due to interrupts/pagefaults/signals.
+
+IDLE|LOCKED means that a running worker is preparing to sleep, so
+interrupts should not lead to server wakeup; RUNNING|LOCKED means that an
+idle worker is going to be "scheduled to run", but may not yet have its
+server set up properly.
+
+The key invariant: a RUNNING worker (not LOCKED) must have a server
+assigned to it.
+
+Key state transitions:
+
+* server to worker context switch ("schedule a worker to run"):
+  S:RUNNING+W:IDLE => S:IDLE+W:RUNNING:
+        in the userspace, in the context of the server S running:
+            S:RUNNING => S:IDLE (mark self as idle)
+            W:IDLE => W:RUNNING|LOCKED (mark the worker as running)
+            W.next_tid := S.tid; S.next_tid := W.tid (link the server with
+                the worker)
+            W:RUNNING|LOCKED => W:RUNNING (unlock the worker)
+            S: sys_umcg_wait() (make the syscall)
+        the kernel context switches from the server to the worker; the
+        server sleeps until it becomes RUNNING during one of the
+        transitions below;
+
+* worker to server context switch (worker "yields"): S:IDLE+W:RUNNING =>
+S:RUNNING+W:IDLE:
+        in the userspace, in the context of the worker W running (note that
+        a running worker has its next_tid set to point to its server):
+            W:RUNNING => W:IDLE|LOCKED (mark self as idle)
+            S:IDLE => S:RUNNING (mark the server as running)
+            W: sys_umcg_wait() (make the syscall)
+        the kernel removes the LOCKED flag from the worker's state and
+        context switches from the worker to the server; the worker sleeps
+        until it becomes RUNNING;
+
+* worker to worker context switch: W1:RUNNING+W2:IDLE =>
+  W1:IDLE+W2:RUNNING:
+        in the userspace, in the context of W1 running:
+            W2:IDLE => W2:RUNNING|LOCKED (mark W2 as running)
+            W1:RUNNING => W1:IDLE|LOCKED (mark self as idle)
+            W2.next_tid := W1.next_tid; S.next_tid := W2.tid (transfer the
+                server W1 => W2)
+            W1:next_tid := W2.tid (indicate that W1 should context-switch
+                into W2)
+            W2:RUNNING|LOCKED => W2:RUNNING (unlock W2)
+            W1: sys_umcg_wait() (make the syscall)
+        same as above, the kernel removes the LOCKED flag from the W1's
+        state and context switches to next_tid;
+
+* worker wakeup: W:IDLE => W:IDLE, W queued into the idle worker list:
+        in the userspace, a server S can wake a worker W sleeping in
+        sys_umcg_wait() without "running" it. This is a purely
+        userspace operation that adds the worker to the idle worker list.
+
+* block detection: worker blocks in the kernel: S:IDLE+W:RUNNING =>
+  S:RUNNING+W:BLOCKED:
+        when a worker blocks in the kernel in RUNNING state (not LOCKED),
+        before descheduling the task from the CPU the kernel performs
+        these operations:
+            W:RUNNING => W:BLOCKED
+            S := W.next_tid
+            S:IDLE => S:RUNNING
+            try_to_wake_up(S)
+        if any of the first three operations above fail, the worker is
+        killed via SIGKILL. Note that ttwu(S) is not required to succeed,
+        as the server may still be transitioning to sleep in
+        sys_umcg_wait(); before actually putting the server to sleep its
+        UMCG state is checked and, if it is RUNNING, sys_umcg_wait()
+        returns to the userspace;
+        if the worker has its LOCKED flag set, block detection does not
+        trigger, as the worker is assumed to be in the userspace
+        scheduling code.
+
+* wake detection: worker wakes in the kernel: W:BLOCKED => W:IDLE:
+        all workers' returns to the userspace are intercepted:
+            start: (a label)
+            if W:RUNNING & W.next_tid != 0: let the worker exit to the
+                userspace, as this is a RUNNING worker with a server;
+            W:* => W:IDLE (previously blocked or woken without servers
+                workers are not allowed to return to the userspace);
+            the worker is appended to W.idle_workers_ptr idle workers list;
+            S := *W.idle_server_tid_ptr; if (S != 0) S:IDLE => S.RUNNING;
+                ttwu(S)
+            idle_loop(W): this is the same idle loop that sys_umcg_wait()
+                uses: it breaks only when the worker becomes RUNNING; when
+                the idle loop exits, it is assumed that the userspace has
+                properly removed the worker from the idle workers list
+                before marking it RUNNING;
+            goto start; (repeat from the beginning).
+
+        the logic above is a bit more complicated in the presence of
+        LOCKED or PREEMPTED flags, but the main invariants
+        stay the same:
+            only RUNNING workers with servers assigned are allowed to run
+                in the userspace (unless LOCKED);
+            newly IDLE workers are added to the idle workers list; any
+                user-initiated state change assumes the userspace
+                properly removed the worker from the list;
+            as with wake detection, any "breach of contract" by the
+                userspace will result in the task termination via SIGKILL.
+
+* worker preemption: S:IDLE+W:RUNNING => S:RUNNING+W:IDLE|PREEMPTED:
+        when the userspace wants to preempt a RUNNING worker, it changes it
+        state, atomically, RUNNING => RUNNING|PREEMPTED and sends a
+        signal to the worker via tgkill(); the signal handler, previously
+        set up by the userspace, can be a NOP (note that only RUNNING
+        workers can be preempted);
+
+        if the worker, at the moment the signal arrived, continued to be
+        running on-CPU in the userspace, the "wake detection" code will be
+        triggered that, in addition to what was described above, will
+        check if the worker is in RUNNING|PREEMPTED state:
+            W:RUNNING|PREEMPTED => W:IDLE|PREEMPTED
+            S := W.next_tid
+            S:IDLE => S:RUNNING
+            try_to_wakeup(S)
+
+        if the signal arrives after the worker blocks in the kernel,
+        the "block detection" happened as described above, with the
+        following change:
+            W:RUNNING|PREEMPTED => W:BLOCKED|PREEMPTED
+            S := W.next_tid
+            S:IDLE => S:RUNNING
+            try_to_wake_up(S)
+
+        in any case, the worker's server is woken, with its attached
+        worker (S.next_tid) either in BLOCKED|PREEMPTED or IDLE|PREEMPTED
+        state.
+
+
+SERVER-ONLY USE CASES
+
+Some workloads/applications may benefit from fast and synchronous on-CPU
+user-initiated context switches without the need for full userspace
+scheduling (block/wake detection). These applications can use "standalone"
+UMCG servers to wait/wake/context-switch. At the moment only in-process
+operations are allowed. In the future this restriction will be lifted,
+and wait/wake/context-switch operations between servers in related processes
+be permitted (when it is safe to do so, e.g. if the processes belong
+to the same user and/or cgroup).
+
+These "worker-less" operations involve trivial RUNNING <==> IDLE state
+changes, not discussed here for brevity.
--
2.25.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v0.9.1 6/6] sched/umcg, lib/umcg: add tools/lib/umcg/libumcg.txt
  2021-11-22 21:13 [PATCH v0.9.1 0/6] sched,mm,x86/uaccess: implement User Managed Concurrency Groups Peter Oskolkov
                   ` (4 preceding siblings ...)
  2021-11-22 21:13 ` [PATCH v0.9.1 5/6] sched/umcg: add Documentation/userspace-api/umcg.txt Peter Oskolkov
@ 2021-11-22 21:13 ` Peter Oskolkov
  2021-11-24 14:06 ` [PATCH v0.9.1 0/6] sched,mm,x86/uaccess: implement User Managed Concurrency Groups Peter Zijlstra
  6 siblings, 0 replies; 44+ messages in thread
From: Peter Oskolkov @ 2021-11-22 21:13 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, Andrew Morton,
	Dave Hansen, Andy Lutomirski, linux-mm, linux-kernel, linux-api
  Cc: Paul Turner, Ben Segall, Peter Oskolkov, Peter Oskolkov,
	Andrei Vagin, Jann Horn, Thierry Delisle

Document libumcg.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 tools/lib/umcg/libumcg.txt | 438 +++++++++++++++++++++++++++++++++++++
 1 file changed, 438 insertions(+)
 create mode 100644 tools/lib/umcg/libumcg.txt

diff --git a/tools/lib/umcg/libumcg.txt b/tools/lib/umcg/libumcg.txt
new file mode 100644
index 000000000000..06f509bf5341
--- /dev/null
+++ b/tools/lib/umcg/libumcg.txt
@@ -0,0 +1,438 @@
+LIBUMCG API (USERSPACE)
+
+User Managed Concurrency Groups (UMCG) is an M:N threading
+subsystem/toolkit that lets user space application developers implement
+in-process user space schedulers.
+
+See Documentation/userspace-api/umcg.txt for UMCG API (kernel), as opposed
+to LIBUMCG API described here. The first three subsections are the
+same in both documents.
+
+
+CONTENTS
+
+    WHY? HETEROGENEOUS IN-PROCESS WORKLOADS
+    REQUIREMENTS
+    WHY THE TWO APIS: UMCG (KERNEL) AND LIBUMCG (USERSPACE)?
+    LIBUMCG API (USERSPACE)
+        SERVERS
+        WORKERS
+        BASIC UMCG TASKS
+    LIBUMCG API
+        umcg_t
+        umcg_tid
+        UMCG_NONE
+        umcg_enabled()
+        umcg_get_utid()
+        umcg_set_task_tag()
+        umcg_get_task_tag()
+        umcg_create_group()
+        umcg_destroy_group()
+        umcg_register_basic_task()
+        umcg_register_worker()
+        umcg_register_server()
+        umcg_unregister_task()
+        umcg_wait()
+        umcg_wake()
+        umcg_swap()
+        umcg_get_idle_worker()
+        umcg_run_worker()
+        umcg_preempt_worker()
+        umcg_get_time_ns()
+
+
+WHY? HETEROGENEOUS IN-PROCESS WORKLOADS
+
+Linux kernel's CFS scheduler is designed for the "common" use case, with
+efficiency/throughput in mind. Work isolation and workloads of different
+"urgency" are addressed by tools such as cgroups, CPU affinity, priorities,
+etc., which are difficult or impossible to efficiently use in-process.
+
+For example, a single DBMS process may receive tens of thousands requests
+per second; some of these requests may have strong response latency
+requirements as they serve live user requests (e.g. login authentication);
+some of these requests may not care much about latency but must be served
+within a certain time period (e.g. an hourly aggregate usage report); some
+of these requests are to be served only on a best-effort basis and can be
+NACKed under high load (e.g. an exploratory research/hypothesis testing
+workload).
+
+Beyond different work item latency/throughput requirements as outlined
+above, the DBMS may need to provide certain guarantees to different users;
+for example, user A may "reserve" 1 CPU for their high-priority/low-latency
+requests, 2 CPUs for mid-level throughput workloads, and be allowed to send
+as many best-effort requests as possible, which may or may not be served,
+depending on the DBMS load. Besides, the best-effort work, started when the
+load was low, may need to be delayed if suddenly a large amount of
+higher-priority work arrives. With hundreds or thousands of users like
+this, it is very difficult to guarantee the application's responsiveness
+using standard Linux tools while maintaining high CPU utilization.
+
+Gaming is another use case: some in-process work must be completed before a
+certain deadline dictated by frame rendering schedule, while other work
+items can be delayed; some work may need to be cancelled/discarded because
+the deadline has passed; etc.
+
+User Managed Concurrency Groups is an M:N threading toolkit that allows
+constructing user space schedulers designed to efficiently manage
+heterogeneous in-process workloads described above while maintaining high
+CPU utilization (95%+).
+
+
+REQUIREMENTS
+
+One relatively established way to design high-efficiency, low-latency
+systems is to split all work into small on-cpu work items, with
+asynchronous I/O and continuations, all executed on a thread pool with the
+number of threads not exceeding the number of available CPUs. Although this
+approach works, it is quite difficult to develop and maintain such a
+system, as, for example, small continuations are difficult to piece
+together when debugging. Besides, such asynchronous callback-based systems
+tend to be somewhat cache-inefficient, as continuations can get scheduled
+on any CPU regardless of cache locality.
+
+M:N threading and cooperative user space scheduling enables controlled CPU
+usage (minimal OS preemption), synchronous coding style, and better cache
+locality.
+
+Specifically:
+
+* a variable/fluctuating number M of "application" threads should be
+  "scheduled over" a relatively fixed number N of "kernel" threads, where
+  N is less than or equal to the number of CPUs available;
+* only those application threads that are attached to kernel threads are
+  scheduled "on CPU";
+* application threads should be able to cooperatively yield to each other;
+* when an application thread blocks in kernel (e.g. in I/O), this becomes
+  a scheduling event ("block") that the userspace scheduler should be able
+  to efficiently detect, and reassign a waiting application thread to the
+  freeded "kernel" thread;
+* when a blocked application thread wakes (e.g. its I/O operation
+  completes), this event ("wake") should also be detectable by the
+  userspace scheduler, which should be able to either quickly dispatch the
+  newly woken thread to an idle "kernel" thread or, if all "kernel"
+  threads are busy, put it in the waiting queue;
+* in addition to the above, it would be extremely useful for a separate
+  in-process "watchdog" facility to be able to monitor the state of each
+  of the M+N threads, and to intervene in case of runaway workloads
+  (interrupt/preempt).
+
+
+WHY THE TWO APIS: UMCG (KERNEL) AND LIBUMCG (USERSPACE)?
+
+UMCG syscalls, sys_umcg_ctl() and sys_umcg_wait(), are designed to make
+the kernel-side UMCG implementation as lightweight as possible. LIBUMCG,
+on the other hand, is designed to expose the key abstractions to users
+in a much more usable, higher-level way.
+
+See Documentation/userspace-api/umcg.txt for more details on
+UMCG API (kernel).
+
+Please note that LIBUMCG API is itself a rather low-level API intended
+to be used to construct higher-level userspace schedulers.
+
+Note: to avoid confusion, in this document "UMCG servers/workers" refer
+UMCG tasks when considered in the context of the kernel UMCG API (syscalls),
+while "LIBUMCG servers/workers" refer to the same tasks when considered
+in the context of the userspace LIBUMC API outlined below. When the
+distinction is not important, "UMCG servers/workers" is used generically.
+
+
+LIBUMCG API (USERSPACE)
+
+Based on the requrements above, LIBUMCG API (userspace) is build around the
+following ideas:
+
+* UMCG server: a thread representing "kernel threads", or CPUs from
+  the requirements above;
+* UMCG worker: a thread representing "application threads", to be
+  scheduled over servers;
+* UMCG group: a collection of servers and workers that can interact with
+  each other; a single process may contain several UMCG groups (e.g. a
+  group per NUMA node);
+* a set of functions (API) that allows workers to be "scheduled" over
+  servers and to interact with one another cooperatively.
+
+
+LIBUMCG SERVERS
+
+When a thread is registered as a server, it behaves like any other normal
+thread.
+
+Servers can interact with other servers in the same UMCG group:
+
+* servers can voluntarily suspend their execution by calling umcg_wait();
+* servers can wake other servers by calling umcg_wake();
+* servers can context-switch between each other by calling umcg_swap().
+
+Servers can also interact with workers in their UMCG group:
+
+* servers can schedule ("run") workers in their place by calling
+  umcg_run_worker(); when the worker blocks, the function returns;
+* servers can query for workers that finished their blocking operations
+  by calling umcg_get_idle_worker();
+* servers can force running workers into idle state and have the
+  servers running those workers to wakeup by calling umcg_preempt_worker().
+
+
+LIBUMCG WORKERS
+
+A worker cannot be running without having a server associated with it, so
+when a task is first registered as a worker, it is blocked until a server
+"runs" it (new workers are added to the idle worker list).
+
+Workers can interact with other workers in their UMCG group:
+
+* workers can voluntarily suspend their execution by calling umcg_wait();
+* workers can wake other workers by calling umcg_wake();
+* workers can context-switch between each other by calling umcg_swap().
+
+
+LIBUMCG BASIC UMCG TASKS
+
+If the application is only interested in server-to-server interactions,
+it does not need to create a UMCG group and may register a server as a
+"basic UMCG task". Same umcg_[wait|wake|swap] functions are available.
+
+
+LIBUMCG API: umcg_t
+
+umcg_t is an opaque pointer indicating a UMCG group.
+
+
+LIBUMCG API: umcg_tid
+
+umcg_tid is an opaque pointer indicating a UMCG task (a basic task,
+a server, or a worker).
+
+
+LIBUMCG API: UMCG_NONE
+
+UMCG_NONE holds a NULL value for variables of type umcg_t or umcg_tid.
+
+
+LIBUMCG API: umcg_enabled()
+
+bool umcg_enabled(void) - returns true if the running kernel exposes
+                          UMCG kernel API (sys_umcg_ctl and sys_umcg_wait).
+
+
+LIBUMCG API: umcg_get_utid
+
+umcg_tid umcg_get_utid(void) - returns the umcg_tid value identifying
+                               the current thread as a UMCG task. The value
+                               is guaranteed to be stable over the life
+                               of the thread, but may be reused between
+                               different threads (it is a pointer to a TLS
+                               variable).
+
+
+LIBUMCG API: umg_set_task_tag()
+
+void umcg_set_task_tag(umcg_tid utid, intptr_t tag) - a helper function
+                       used to associate an arbitrary user-provided value
+                       with a umcg task/thread.
+
+
+LIBUMCG API: umcg_get_task_tag()
+
+intptr_t umcg_get_task_tag(umcg_tid utid) - returns a previously set
+                           task tag, or zero.
+
+
+LIBUMCG API: umcg_create_group()
+
+umcg_t umcg_create_group(uint32_t flags) - create a UMCG group.
+
+    UMCG servers and workers at LIBUMCG level must belong to the same UMCG
+    group to interact; note that this is different from UMCG kernel API,
+    where servers and workers can all interact within the same process.
+
+    UMCG groups are used to partition UMCG tasks (servers and workers) within
+    a process, e.g. to allow NUMA-aware scheduling.
+
+
+LIBUMCG API: umcg_destroy_group()
+
+int umcg_destroy_group(umcg_t umcg) - destroy a UMCG group. The group must
+                       be empty, i.e. all its servers and workers must have
+                       unregistered.
+
+
+LIBUMCG API: umcg_register_basic_task()
+
+umcg_tid umcg_register_basic_task(intptr_t tag) - register the current
+                                  thread as a basic LIBUMCG task.
+
+    Basic LIBUMCG tasks do not belong to UMCG groups, and thus cannot
+    interact with LIBUMCG workers.
+
+    At the kernel level basic LIBUMCG tasks are servers.
+
+
+LIBUMCG API: umcg_register_worker()
+
+umcg_tid umcg_register_worker(umcg_t group_id, intptr_t tag) - register
+                              the current thread as a LIBUMCG worker in a group.
+
+    LIBUMCG workers, once registered, can forget about being UMCG workers,
+    as the only difference vs "normal" threads is that now workers are
+    scheduled not by the kernel, but by servers in their UMCG group, which
+    happens "transparently" to UMCG workers.
+
+    LIBUMCG workers may call umcg_[wait|wake|swap] to cooperatively share
+    workload with other LIBUMCG workers in the same group.
+
+    Note: at the moment UMCG workers, once registered, cannnot receive
+    non-fatal signals.
+
+
+LIBUMCG API: umcg_register_server()
+
+umcg_tid umcg_register_server(umcg_t group_id, intptr_t tag) - register
+                              the current thread as a LIBUMCG server in a group.
+
+    LIBUMCG servers schedule LIBUMCG workers in the same group via
+        umcg_get_idle_worker(), umcg_run_worker(), and umcg_preempt_worker().
+        See descriptions of these functions below for more details.
+
+
+LIBUMCG API: umcg_unregister_task()
+
+int umcg_unregister_task(void) - unregister the current thread as a UMCG task.
+
+   A thread can be only one type of a UMCG task at a time. Once unregistered,
+   a thread can register again as a different UMCG task type, in the same
+   or a different group.
+
+
+LIBUMCG API: umcg_wait()
+
+int umcg_wait(uint64_t timeout) - block the current UMCG task until
+              the timeout expires or it is woken via umcg_wake() or umcg_swap().
+
+    All UMCG task types can call umcg_wait().
+
+
+LIBUMCG API: umcg_wake()
+
+int umcg_wake(umcg_tid next, bool wf_current_cpu) - wake a UMCG task.
+
+    Wake a umcg task if it is blocked as a result of calling umcg_wait() or
+    umcg_swap(). If @next is NOT blocked in umcg_wait() or umcg_swap(),
+    it will be marked as "wakeup queued"; when @next calls umcg_wait() or
+    umcg_swap() later, the "wakeup queued" flag will be removed and
+    the function will not block.
+
+    Only one wakeup can be queued per task, so calling umcg_wake for
+    a task with a wakeup queued will spin (in the userspace) until the
+    wakeup flag is cleared.
+
+    If @next is a worker blocked in umcg_wait(), the worker is added
+    to the idle workers list so that it will the be picked up by
+    umcg_get_idle_worker().
+
+    Note that "wakeup queued" is a purely LIBUMCG (userspace) concept:
+    the kernel (UMCG kernel API) is unaware of it.
+
+
+LIBUMCG API: umcg_swap()
+
+int umcg_swap(umcg_tid next, u64 timeout) - block the current task; wake @next.
+
+    umcg_swap() can be used for server-to-server or worker-to-worker
+    context switches, but NOT server-to-worker or worker-to-server.
+    If a server wants to context switch with a worker, the server
+    should call umcg_run_worker(). If a worker wants to context switch
+    to its server, it should call umcg_wait().
+
+    In server-to-server context switches, the switching-out server
+    is RUNNING; the switching-in server can either be IDLE or RUNNING.
+    If the switching-in server is running, it will have a wakeup queued.
+    If the switching-out server has a wakeup queued, umcg_swap() will
+    consume the wakeup.
+
+    In worker-to-worker context switches, the "normal" behavior is that
+    the RUNNING switching-out worker becomes IDLE, its server is
+    transferred to the IDLE switching-in worker, and the switching-in
+    worker becomes RUNNING.
+
+    The switching-in worker MUST NOT be in the idle worker list: it
+    can either be in umcg_wait(), or pulled out of the idle worker list
+    previously.
+
+    Same wakeup-queued rules apply to swapping workers as they apply
+    to swapping servers.
+
+    Note that while with servers umcg_swap() is technically equivalent
+    to { umcg_wake(); umcg_wait(); }, with possible on-cpu optimizations,
+    with workers there is a difference, as umcg_swap() will RUN an
+    idle worker, while umcg_wake() will add it to the idle worker list.
+
+    This difference, however, is transparent for workers: workers
+    engaging is cooperative scheduling via wait/wake/swap will observe
+    exactly the same behavior as if they were servers or basic UMCG tasks.
+
+
+LIBUMCG API: umcg_get_idle_worker()
+
+umcg_tid umcg_get_idle_worker(bool wait) - get an idle worker from
+                                           the idle worker list.
+
+    Servers can query for unblocked workers by calling umcg_get_idle_worker().
+
+    There are two idle worker lists per UMCG group; the kernel-side list,
+    as described in Documentation/userspace-api/umcg.txt, and the
+    userspace-side list. umcg_get_idle_worker() first checks the userspace
+    list and, if it is not empty, returns the first available idle worker,
+    removing it from the list.
+
+    If the userspace list is empty, the function swaps it with the kernel-side
+    list of empty workers, and then checks the userspace list again.
+
+    If @wait is true, the function blocks if there are no idle workers
+    available. The server will be added to the group's idle server list.
+    The server may also be pointed at by struct umcg_task.idle_server_tid_ptr.
+
+    It is safe to call umcg_get_idle_worker() concurrently.
+
+
+LIBUMCG API: umcg_run_worker()
+
+umcg_tid umcg_run_worker(umcg_tid worker) - run the worker.
+
+    Servers "run" workers by calling umcg_run_worker(). @worker must be
+    IDLE, and must NOT be in the idle worker list. I.e. a server may
+    run a worker that is blocked in umcg_wait() either if umcg_wake()
+    has NOT been called on the @worker, or if umcg_wake() HAD been
+    called, and the worker then was returned via umcg_get_idle_worker().
+
+    umcg_run_worker() will block; if the worker the server is running
+    swaps with another worker, the server will get reassigned to that
+    new worker.
+
+    When the worker the server is running blocks, umcg_run_worker returns
+    the worker's umcg_tid, or UMCG_NONE if the worker unregisters.
+
+
+LIBUMCG API: umcg_preempt_worker()
+
+int umcg_preempt_worker(umcg_tid worker) - preempt a RUNNING worker.
+
+    The function interrupgs a RUNNING worker and wakes its server. If
+    the worker is not RUNNING (i.e. BLOCKED or IDLE), the function
+    returns an error with errno set to EAGAIN.
+
+    The group the worker belongs to must be created with
+    UMCG_GROUP_ENABLE_PREEMPTION flag set.
+
+    The function may be called from any thread in the process that
+    @worker belongs to.
+
+
+LIBUMCG API: umcg_get_time_ns()
+
+uint64_t umcg_get_time_ns(void) - return the current absolute time.
+
+    This function can be used to calculate the absolute timeouts passed
+    to umcg_wait() and umcg_swap().
--
2.25.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v0.9.1 0/6] sched,mm,x86/uaccess: implement User Managed Concurrency Groups
  2021-11-22 21:13 [PATCH v0.9.1 0/6] sched,mm,x86/uaccess: implement User Managed Concurrency Groups Peter Oskolkov
                   ` (5 preceding siblings ...)
  2021-11-22 21:13 ` [PATCH v0.9.1 6/6] sched/umcg, lib/umcg: add tools/lib/umcg/libumcg.txt Peter Oskolkov
@ 2021-11-24 14:06 ` Peter Zijlstra
  2021-11-24 16:28   ` Peter Oskolkov
  6 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2021-11-24 14:06 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Ingo Molnar, Thomas Gleixner, Andrew Morton, Dave Hansen,
	Andy Lutomirski, linux-mm, linux-kernel, linux-api, Paul Turner,
	Ben Segall, Peter Oskolkov, Andrei Vagin, Jann Horn,
	Thierry Delisle

On Mon, Nov 22, 2021 at 01:13:21PM -0800, Peter Oskolkov wrote:
> User Managed Concurrency Groups (UMCG) is an M:N threading
> subsystem/toolkit that lets user space application developers implement
> in-process user space schedulers.
> 
> This v0.9.1 patchset is the same as v0.9, where u32/u64 in
> uapi/linux/umcg.h are replaced with __u32/__u64, as test robot/lkp
> does not recognize u32/u64 for some reason.
> 
> v0.9 is v0.8 rebased on top of the current tip/sched/core,
> with a fix in umcg_update_state of an issue reported by Tao Zhou.
> 
> Key changes from patchset v0.7:
> https://lore.kernel.org/all/20211012232522.714898-1-posk@google.com/:
> 
> - added libumcg tools/lib/umcg;
> - worker "wakeup" is reworked so that it is now purely a userspace op,
>   instead of waking the thread in order for it to block on return
>   to the userspace immediately;
> - a couple of minor fixes and refactorings.
> 
> These big things remain to be addressed (in no particular order):
> - support tracing/debugging
> - make context switches faster (see umcg_do_context_switch in umcg.c)
> - support other architectures
> - cleanup and post selftests in tools/testing/selftests/umcg/
> - allow cross-mm wakeups (securely)

*groan*... so these patches do *NOT* support the very thing this all
started with, namely block + wakeup notifications. I'm really not sure
how that happened, as that was the sole purpose of the exercise.

Aside of that, the whole uaccess stuff is horrific :-( I'll reply to
that email separately, but the alternative is also included in the
random hackery below.

I'm still trying to make sense of it all, but I'm really not seeing how
any of this satisfies the initial goals, also it is once again 100% new
code :/

---
 arch/x86/Kconfig                  |    1 
 arch/x86/include/asm/uaccess.h    |  106 +++++++++++++++
 arch/x86/include/asm/uaccess_64.h |   93 -------------
 include/linux/entry-common.h      |    2 
 include/linux/sched.h             |   29 ++--
 include/linux/thread_info.h       |    2 
 include/linux/uaccess.h           |   46 ------
 init/Kconfig                      |    7 -
 kernel/entry/common.c             |   11 +
 kernel/sched/umcg.c               |  231 ++++++++++++++++++++-------------
 mm/maccess.c                      |  264 --------------------------------------
 11 files changed, 278 insertions(+), 514 deletions(-)

Index: linux-2.6/arch/x86/include/asm/uaccess_64.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/uaccess_64.h
+++ linux-2.6/arch/x86/include/asm/uaccess_64.h
@@ -79,97 +79,4 @@ __copy_from_user_flushcache(void *dst, c
 	kasan_check_write(dst, size);
 	return __copy_user_flushcache(dst, src, size);
 }
-
-#define ARCH_HAS_ATOMIC_UACCESS_HELPERS 1
-
-static inline int __try_cmpxchg_user_32(u32 *uval, u32 __user *uaddr,
-						u32 oldval, u32 newval)
-{
-	int ret = 0;
-
-	asm volatile("\n"
-		"1:\t" LOCK_PREFIX "cmpxchgl %4, %2\n"
-		"2:\n"
-		"\t.section .fixup, \"ax\"\n"
-		"3:\tmov     %3, %0\n"
-		"\tjmp     2b\n"
-		"\t.previous\n"
-		_ASM_EXTABLE_UA(1b, 3b)
-		: "+r" (ret), "=a" (oldval), "+m" (*uaddr)
-		: "i" (-EFAULT), "r" (newval), "1" (oldval)
-		: "memory"
-	);
-	*uval = oldval;
-	return ret;
-}
-
-static inline int __try_cmpxchg_user_64(u64 *uval, u64 __user *uaddr,
-						u64 oldval, u64 newval)
-{
-	int ret = 0;
-
-	asm volatile("\n"
-		"1:\t" LOCK_PREFIX "cmpxchgq %4, %2\n"
-		"2:\n"
-		"\t.section .fixup, \"ax\"\n"
-		"3:\tmov     %3, %0\n"
-		"\tjmp     2b\n"
-		"\t.previous\n"
-		_ASM_EXTABLE_UA(1b, 3b)
-		: "+r" (ret), "=a" (oldval), "+m" (*uaddr)
-		: "i" (-EFAULT), "r" (newval), "1" (oldval)
-		: "memory"
-	);
-	*uval = oldval;
-	return ret;
-}
-
-static inline int __try_xchg_user_32(u32 *oval, u32 __user *uaddr, u32 newval)
-{
-	u32 oldval = 0;
-	int ret = 0;
-
-	asm volatile("\n"
-		"1:\txchgl %0, %2\n"
-		"2:\n"
-		"\t.section .fixup, \"ax\"\n"
-		"3:\tmov     %3, %1\n"
-		"\tjmp     2b\n"
-		"\t.previous\n"
-		_ASM_EXTABLE_UA(1b, 3b)
-		: "=r" (oldval), "=r" (ret), "+m" (*uaddr)
-		: "i" (-EFAULT), "0" (newval), "1" (0)
-	);
-
-	if (ret)
-		return ret;
-
-	*oval = oldval;
-	return 0;
-}
-
-static inline int __try_xchg_user_64(u64 *oval, u64 __user *uaddr, u64 newval)
-{
-	u64 oldval = 0;
-	int ret = 0;
-
-	asm volatile("\n"
-		"1:\txchgq %0, %2\n"
-		"2:\n"
-		"\t.section .fixup, \"ax\"\n"
-		"3:\tmov     %3, %1\n"
-		"\tjmp     2b\n"
-		"\t.previous\n"
-		_ASM_EXTABLE_UA(1b, 3b)
-		: "=r" (oldval), "=r" (ret), "+m" (*uaddr)
-		: "i" (-EFAULT), "0" (newval), "1" (0)
-	);
-
-	if (ret)
-		return ret;
-
-	*oval = oldval;
-	return 0;
-}
-
 #endif /* _ASM_X86_UACCESS_64_H */
Index: linux-2.6/include/linux/uaccess.h
===================================================================
--- linux-2.6.orig/include/linux/uaccess.h
+++ linux-2.6/include/linux/uaccess.h
@@ -408,50 +408,4 @@ void __noreturn usercopy_abort(const cha
 			       unsigned long len);
 #endif
 
-#ifdef ARCH_HAS_ATOMIC_UACCESS_HELPERS
-/**
- * cmpxchg_user_[32|64][_nofault|]() - compare_exchange 32/64-bit values
- * @uaddr:     Destination address, in user space;
- * @curr_val:  Source address, in kernel space;
- * @new_val:   The value to write to the destination address.
- *
- * This is the standard cmpxchg: atomically: compare *@uaddr to *@curr_val;
- * if the values match, write @new_val to @uaddr, return 0; if the values
- * do not match, write *@uaddr to @curr_val, return -EAGAIN.
- *
- * The _nofault versions don't fault and can be used in
- * atomic/preempt-disabled contexts.
- *
- * Return:
- * 0      : OK/success;
- * -EINVAL: @uaddr is not properly aligned ('may fault' versions only);
- * -EFAULT: memory access error (including mis-aligned @uaddr in _nofault);
- * -EAGAIN: @old did not match.
- */
-int cmpxchg_user_32_nofault(u32 __user *uaddr, u32 *curr_val, u32 new_val);
-int cmpxchg_user_64_nofault(u64 __user *uaddr, u64 *curr_val, u64 new_val);
-int cmpxchg_user_32(u32 __user *uaddr, u32 *curr_val, u32 new_val);
-int cmpxchg_user_64(u64 __user *uaddr, u64 *curr_val, u64 new_val);
-
-/**
- * xchg_user_[32|64][_nofault|]() - exchange 32/64-bit values
- * @uaddr:   Destination address, in user space;
- * @val:     Source address, in kernel space.
- *
- * This is the standard atomic xchg: exchange values pointed to by @uaddr and @val.
- *
- * The _nofault versions don't fault and can be used in
- * atomic/preempt-disabled contexts.
- *
- * Return:
- * 0      : OK/success;
- * -EINVAL: @uaddr is not properly aligned ('may fault' versions only);
- * -EFAULT: memory access error (including mis-aligned @uaddr in _nofault).
- */
-int xchg_user_32_nofault(u32 __user *uaddr, u32 *val);
-int xchg_user_64_nofault(u64 __user *uaddr, u64 *val);
-int xchg_user_32(u32 __user *uaddr, u32 *val);
-int xchg_user_64(u64 __user *uaddr, u64 *val);
-#endif		/* ARCH_HAS_ATOMIC_UACCESS_HELPERS */
-
 #endif		/* __LINUX_UACCESS_H__ */
Index: linux-2.6/kernel/sched/umcg.c
===================================================================
--- linux-2.6.orig/kernel/sched/umcg.c
+++ linux-2.6/kernel/sched/umcg.c
@@ -39,6 +39,10 @@
  *                 and its server.
  *
  * The pages are pinned when the worker exits to the userspace and unpinned
+ *
+ * XXX exit is wrong; must pin on syscall-entry. Otherwise the pin is of
+ * XXX unbounded duration.
+ *
  * when the worker is in sched_submit_work(), i.e. when the worker is
  * about to be removed from its runqueue. Thus at most NR_CPUS UMCG pages
  * are pinned at any one time across the whole system.
@@ -67,10 +71,12 @@ static int umcg_pin_pages(u32 server_tid
 	tsk = current;
 
 	/* worker_ut is stable, don't need to repin */
-	if (!tsk->pinned_umcg_worker_page)
-		if (1 != pin_user_pages_fast((unsigned long)worker_ut, 1, 0,
-					&tsk->pinned_umcg_worker_page))
+	// XXX explain, this should never be so
+	if (!tsk->pinned_umcg_worker_page) {
+		if (pin_user_pages_fast((unsigned long)worker_ut, 1, 0,
+					&tsk->pinned_umcg_worker_page) != 1)
 			return -EFAULT;
+	}
 
 	/* server_ut may change, need to repin */
 	if (tsk->pinned_umcg_server_page) {
@@ -78,8 +84,8 @@ static int umcg_pin_pages(u32 server_tid
 		tsk->pinned_umcg_server_page = NULL;
 	}
 
-	if (1 != pin_user_pages_fast((unsigned long)server_ut, 1, 0,
-				&tsk->pinned_umcg_server_page))
+	if (pin_user_pages_fast((unsigned long)server_ut, 1, 0,
+				&tsk->pinned_umcg_server_page) != 1)
 		return -EFAULT;
 
 	return 0;
@@ -89,13 +95,14 @@ static void umcg_unpin_pages(void)
 {
 	struct task_struct *tsk = current;
 
-	if (tsk->pinned_umcg_worker_page)
+	if (tsk->pinned_umcg_worker_page) {
 		unpin_user_page(tsk->pinned_umcg_worker_page);
-	if (tsk->pinned_umcg_server_page)
+		tsk->pinned_umcg_worker_page = NULL;
+	}
+	if (tsk->pinned_umcg_server_page) {
 		unpin_user_page(tsk->pinned_umcg_server_page);
-
-	tsk->pinned_umcg_worker_page = NULL;
-	tsk->pinned_umcg_server_page = NULL;
+		tsk->pinned_umcg_server_page = NULL;
+	}
 }
 
 static void umcg_clear_task(struct task_struct *tsk)
@@ -137,12 +144,18 @@ void umcg_handle_exiting_worker(void)
  *
  * The function is basically cmpxchg(state_ts, expected, desired), with extra
  * code to set the timestamp in @desired.
+ *
+ * XXX I don't understand the need for this complexity; umcg_task is only 4
+ * XXX u64's long, that means there's 4 more in the cacheline, why can't the
+ * XXX timestamp get it's own word?
+ *
+ * XXX Also, is a single timestamp sufficient?
  */
-static int umcg_update_state(u64 __user *state_ts, u64 *expected, u64 desired,
-				bool may_fault)
+static int umcg_update_state(u64 __user *state_ts, u64 *expected, u64 desired)
 {
 	u64 curr_ts = (*expected) >> (64 - UMCG_STATE_TIMESTAMP_BITS);
 	u64 next_ts = ktime_get_ns() >> UMCG_STATE_TIMESTAMP_GRANULARITY;
+	bool success;
 
 	/* Cut higher order bits. */
 	next_ts &= (1ULL << UMCG_STATE_TIMESTAMP_BITS) - 1;
@@ -156,10 +169,17 @@ static int umcg_update_state(u64 __user
 	/* Set the new timestamp. */
 	desired |= (next_ts << (64 - UMCG_STATE_TIMESTAMP_BITS));
 
-	if (may_fault)
-		return cmpxchg_user_64(state_ts, expected, desired);
+	if (!user_access_begin(state_ts, sizeof(*state_ts)))
+		return -EFAULT;
+
+	success = __try_cmpxchg_user((u64 *)state_ts, expected, desired, Efault);
+	user_access_end();
+
+	return success ? 0 : -EAGAIN;
 
-	return cmpxchg_user_64_nofault(state_ts, expected, desired);
+Efault:
+	user_access_end();
+	return -EFAULT;
 }
 
 /**
@@ -233,8 +253,7 @@ SYSCALL_DEFINE2(umcg_ctl, u32, flags, st
 		WRITE_ONCE(current->umcg_task, self);
 		current->flags |= PF_UMCG_WORKER;
 
-		/* Trigger umcg_handle_resuming_worker() */
-		set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+		set_syscall_work(SYSCALL_UMCG);
 	} else {
 		if ((ut.state_ts & UMCG_TASK_STATE_MASK_FULL) != UMCG_TASK_RUNNING)
 			return -EINVAL;
@@ -263,7 +282,7 @@ static int handle_timedout_worker(struct
 		next_state = curr_state & ~UMCG_TASK_STATE_MASK;
 		next_state |= UMCG_TASK_BLOCKED;
 
-		ret = umcg_update_state(&self->state_ts, &curr_state, next_state, true);
+		ret = umcg_update_state(&self->state_ts, &curr_state, next_state);
 		if (ret)
 			return ret;
 
@@ -324,7 +343,7 @@ static int umcg_idle_loop(u64 abs_timeou
 				current->timer_slack_ns);
 	}
 
-	while (true) {
+	for (;;) {
 		u64 umcg_state;
 
 		/*
@@ -333,22 +352,18 @@ static int umcg_idle_loop(u64 abs_timeou
 		 * but faulting is not allowed; so we try a fast no-fault read,
 		 * and if it fails, pin the page temporarily.
 		 */
-retry_once:
 		set_current_state(TASK_INTERRUPTIBLE);
 
-		/* Order set_current_state above with get_user below. */
-		smp_mb();
 		ret = -EFAULT;
 		if (get_user_nofault(umcg_state, &self->state_ts)) {
-			set_current_state(TASK_RUNNING);
-
 			if (pinned_page)
-				goto out;
-			else if (1 != pin_user_pages_fast((unsigned long)self,
-						1, 0, &pinned_page))
-					goto out;
+				break;
 
-			goto retry_once;
+			if (pin_user_pages_fast((unsigned long)self,
+						1, 0, &pinned_page) != 1)
+				break;
+
+			continue;
 		}
 
 		if (pinned_page) {
@@ -357,10 +372,8 @@ retry_once:
 		}
 
 		ret = 0;
-		if (!umcg_should_idle(umcg_state)) {
-			set_current_state(TASK_RUNNING);
-			goto out;
-		}
+		if (!umcg_should_idle(umcg_state))
+			break;
 
 		if (abs_timeout)
 			hrtimer_sleeper_start_expires(&timeout, HRTIMER_MODE_ABS);
@@ -368,7 +381,6 @@ retry_once:
 		if (!abs_timeout || timeout.task)
 			freezable_schedule();
 
-		__set_current_state(TASK_RUNNING);
 
 		/*
 		 * Check for timeout before checking the state, as workers
@@ -377,27 +389,26 @@ retry_once:
 		 */
 		ret = -ETIMEDOUT;
 		if (abs_timeout && !timeout.task)
-			goto out;
+			break;
 
-		/* Order set_current_state above with get_user below. */
-		smp_mb();
 		ret = -EFAULT;
 		if (get_user(umcg_state, &self->state_ts))
-			goto out;
+			break;
 
 		ret = 0;
 		if (!umcg_should_idle(umcg_state))
-			goto out;
+			break;
 
 		ret = -EINTR;
 		if (fatal_signal_pending(current))
-			goto out;
+			break;
 
+		// XXX this *cannot* be right, a process can loose signals this way.
 		if (signal_pending(current))
 			flush_signals(current);
 	}
+	__set_current_state(TASK_RUNNING);
 
-out:
 	if (pinned_page) {
 		unpin_user_page(pinned_page);
 		pinned_page = NULL;
@@ -428,10 +439,7 @@ static bool umcg_wakeup_allowed(struct t
 {
 	WARN_ON_ONCE(!rcu_read_lock_held());
 
-	if (tsk->mm && tsk->mm == current->mm && READ_ONCE(tsk->umcg_task))
-		return true;
-
-	return false;
+	return tsk->mm && tsk->mm == current->mm && READ_ONCE(tsk->umcg_task);
 }
 
 /*
@@ -459,26 +467,6 @@ static int umcg_ttwu(u32 next_tid, int w
 	return 0;
 }
 
-/*
- * At the moment, umcg_do_context_switch simply wakes up @next with
- * WF_CURRENT_CPU and puts the current task to sleep.
- *
- * In the future an optimization will be added to adjust runtime accounting
- * so that from the kernel scheduling perspective the two tasks are
- * essentially treated as one. In addition, the context switch may be performed
- * right here on the fast path, instead of going through the wake/wait pair.
- */
-static int umcg_do_context_switch(u32 next_tid, u64 abs_timeout)
-{
-	int ret;
-
-	ret = umcg_ttwu(next_tid, WF_CURRENT_CPU);
-	if (ret)
-		return ret;
-
-	return umcg_idle_loop(abs_timeout);
-}
-
 /**
  * sys_umcg_wait: put the current task to sleep and/or wake another task.
  * @flags:        zero or a value from enum umcg_wait_flag.
@@ -509,6 +497,7 @@ SYSCALL_DEFINE2(umcg_wait, u32, flags, u
 {
 	struct umcg_task __user *self = current->umcg_task;
 	u32 next_tid;
+	int ret;
 
 	if (!self)
 		return -EINVAL;
@@ -535,14 +524,17 @@ SYSCALL_DEFINE2(umcg_wait, u32, flags, u
 		if (get_user(umcg_state, &self->state_ts))
 			return -EFAULT;
 
-		if ((umcg_state & UMCG_TF_LOCKED) && umcg_update_state(
-					&self->state_ts, &umcg_state,
-					umcg_state & ~UMCG_TF_LOCKED, true))
+		if ((umcg_state & UMCG_TF_LOCKED) &&
+		    umcg_update_state(&self->state_ts, &umcg_state,
+				      umcg_state & ~UMCG_TF_LOCKED))
 			return -EFAULT;
 	}
 
-	if (next_tid)
-		return umcg_do_context_switch(next_tid, abs_timeout);
+	if (next_tid) {
+		ret = umcg_ttwu(next_tid, WF_CURRENT_CPU);
+		if (ret)
+			return ret;
+	}
 
 	return umcg_idle_loop(abs_timeout);
 }
@@ -581,9 +573,10 @@ static int umcg_wake_idle_server_nofault
 	if ((state & UMCG_TASK_STATE_MASK) != UMCG_TASK_IDLE)
 		goto out_rcu;
 
+	pagefault_disable();
 	ret = umcg_update_state(&ut_server->state_ts, &state,
-			(state & ~UMCG_TASK_STATE_MASK) | UMCG_TASK_RUNNING,
-			false);
+				(state & ~UMCG_TASK_STATE_MASK) | UMCG_TASK_RUNNING);
+	pagefault_enable();
 
 	if (ret)
 		goto out_rcu;
@@ -621,8 +614,7 @@ static int umcg_wake_idle_server_may_fau
 		return -EAGAIN;
 
 	ret = umcg_update_state(&ut_server->state_ts, &state,
-			(state & ~UMCG_TASK_STATE_MASK) | UMCG_TASK_RUNNING,
-			true);
+				(state & ~UMCG_TASK_STATE_MASK) | UMCG_TASK_RUNNING);
 	if (ret)
 		return ret;
 
@@ -690,6 +682,11 @@ static void process_sleeping_worker(stru
 	 *
 	 * See Documentation/userspace-api/umcg.txt for details.
 	 */
+
+	// XXX this seems like a super gross hack, please explain more.
+	// XXX ideally we kill this LOCKED but entirely, that just smells
+	// XXX worse than fish gone bad.
+
 retry_once:
 	if (curr_state & UMCG_TF_LOCKED)
 		return;
@@ -701,7 +698,9 @@ retry_once:
 	next_state = curr_state & ~UMCG_TASK_STATE_MASK;
 	next_state |= UMCG_TASK_BLOCKED;
 
-	ret = umcg_update_state(&ut_worker->state_ts, &curr_state, next_state, false);
+	pagefault_disable();
+	ret = umcg_update_state(&ut_worker->state_ts, &curr_state, next_state);
+	pagefault_enable();
 	if (ret == -EAGAIN) {
 		if (retried)
 			goto die;
@@ -712,6 +711,8 @@ retry_once:
 	if (ret)
 		goto die;
 
+	// XXX write a real ordering comment, see ttwu() for examples
+	// XXX idem for all other barriers in this file.
 	smp_mb();  /* Order state read/write above and getting next_tid below. */
 	if (get_user_nofault(tid, &ut_worker->next_tid))
 		goto die;
@@ -739,7 +740,6 @@ void umcg_wq_worker_sleeping(struct task
 
 	if (server_tid) {
 		int ret = umcg_wake_idle_server_nofault(server_tid);
-
 		if (ret && ret != -EAGAIN)
 			goto die;
 	}
@@ -777,14 +777,25 @@ static bool enqueue_idle_worker(struct u
 		return false;
 
 	/* Make the head point to the worker. */
-	if (xchg_user_64(head_ptr, &first))
+	if (!user_access_begin(head_ptr, sizeof(*head_ptr)))
 		return false;
 
+	first = __xchg_user(head_ptr, (u64)node, Efault);
+	user_access_end();
+
+	// XXX vCPU goes on a holiday here and userspace is left
+	// XXX with a broken list, cmpxchg based list-add is safer
+	// XXX that way
+
 	/* Make the worker point to the previous head. */
 	if (put_user(first, node))
 		return false;
 
 	return true;
+
+Efault:
+	user_access_end();
+	return false;
 }
 
 /**
@@ -807,11 +818,18 @@ static bool get_idle_server(struct umcg_
 		return false;
 
 	tid = 0;
-	if (xchg_user_32((u32 __user *)server_tid_ptr, &tid))
+	if (!user_access_begin((u32 __user *)server_tid_ptr, sizeof(u32)))
 		return false;
 
+	tid = __xchg_user((u32 *)server_tid_ptr, 0, Efault);
+	user_access_end();
+
 	*server_tid = tid;
 	return true;
+
+Efault:
+	user_access_end();
+	return false;
 }
 
 /*
@@ -870,9 +888,10 @@ static bool process_waking_worker(struct
 		 * PREEMPTED.
 		 */
 	} else if (unlikely((curr_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_IDLE &&
-			(curr_state & UMCG_TF_LOCKED)))
+			    (curr_state & UMCG_TF_LOCKED))) {
 		/* The worker prepares to sleep or to unregister. */
 		return false;
+	}
 
 	if (unlikely((curr_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_IDLE))
 		goto die;
@@ -880,8 +899,7 @@ static bool process_waking_worker(struct
 	next_state = curr_state & ~UMCG_TASK_STATE_MASK;
 	next_state |= UMCG_TASK_IDLE;
 
-	if (umcg_update_state(&ut_worker->state_ts, &curr_state,
-			next_state, true))
+	if (umcg_update_state(&ut_worker->state_ts, &curr_state, next_state))
 		goto die;
 
 	if (!enqueue_idle_worker(ut_worker))
@@ -905,18 +923,53 @@ die:
  */
 void umcg_wq_worker_running(struct task_struct *tsk)
 {
-	set_tsk_thread_flag(tsk, TIF_NOTIFY_RESUME);
+	// XXX this cannot be right, userspace needs to know we're blocked
+	// XXX also, this was exactly what we had those pins for!
+
+	add self to blocked list();
+       	change state();
+	possibly wake next_tid();
+
+	umcg_unpin_pages();
+
+	// and then we go sleep.... the umcg_sys_exit() handler will then
+	// notify userspace we've woken up again and, if available, kick some
+	// idle thread to pick us up.
 }
 
-/* Called via TIF_NOTIFY_RESUME flag from exit_to_user_mode_loop. */
-void umcg_handle_resuming_worker(void)
+void umcg_sys_enter(struct pt_regs *regs)
 {
 	u32 server_tid;
 
 	/* Avoid recursion by removing PF_UMCG_WORKER */
 	current->flags &= ~PF_UMCG_WORKER;
 
-	do {
+	// XXX wth did umcg_task::server_tid go?
+
+	if (!server_tid)
+		umcg_unpin_pages();
+	else if (umcg_pin_pages(server_tid))
+		goto die;
+
+	goto out;
+
+die:
+	pr_warn("%s: killing task %d\n", __func__, current->pid);
+	force_sig(SIGKILL);
+out:
+	current->flags |= PF_UMCG_WORKER;
+}
+
+void umcg_sys_exit(struct pt_regs *regs)
+{
+	u32 server_tid;
+
+	umcg_unpin_pages();
+
+	/* Avoid recursion by removing PF_UMCG_WORKER */
+	current->flags &= ~PF_UMCG_WORKER;
+
+	for (;;) {
 		bool should_wait;
 
 		should_wait = process_waking_worker(current, &server_tid);
@@ -931,13 +984,7 @@ void umcg_handle_resuming_worker(void)
 		}
 
 		umcg_idle_loop(0);
-	} while (true);
-
-	if (!server_tid)
-		/* No server => no reason to pin pages. */
-		umcg_unpin_pages();
-	else if (umcg_pin_pages(server_tid))
-		goto die;
+	}
 
 	goto out;
 
Index: linux-2.6/mm/maccess.c
===================================================================
--- linux-2.6.orig/mm/maccess.c
+++ linux-2.6/mm/maccess.c
@@ -335,267 +335,3 @@ long strnlen_user_nofault(const void __u
 
 	return ret;
 }
-
-#ifdef ARCH_HAS_ATOMIC_UACCESS_HELPERS
-
-static int fix_pagefault(unsigned long uaddr, bool write_fault, int bytes)
-{
-	struct mm_struct *mm = current->mm;
-	int ret;
-
-	mmap_read_lock(mm);
-	ret = fixup_user_fault(mm, uaddr, write_fault ? FAULT_FLAG_WRITE : 0,
-			NULL);
-	mmap_read_unlock(mm);
-
-	return ret < 0 ? ret : 0;
-}
-
-int cmpxchg_user_32_nofault(u32 __user *uaddr, u32 *curr_val, u32 new_val)
-{
-	int ret = -EFAULT;
-	u32 __old = *curr_val;
-
-	if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
-		return -EFAULT;
-
-	pagefault_disable();
-
-	if (!user_access_begin(uaddr, sizeof(*uaddr))) {
-		pagefault_enable();
-		return -EFAULT;
-	}
-	ret = __try_cmpxchg_user_32(curr_val, uaddr, __old, new_val);
-	user_access_end();
-
-	if (!ret)
-		ret =  *curr_val == __old ? 0 : -EAGAIN;
-
-	pagefault_enable();
-	return ret;
-}
-
-int cmpxchg_user_64_nofault(u64 __user *uaddr, u64 *curr_val, u64 new_val)
-{
-	int ret = -EFAULT;
-	u64 __old = *curr_val;
-
-	if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
-		return -EFAULT;
-
-	pagefault_disable();
-
-	if (!user_access_begin(uaddr, sizeof(*uaddr))) {
-		pagefault_enable();
-		return -EFAULT;
-	}
-	ret = __try_cmpxchg_user_64(curr_val, uaddr, __old, new_val);
-	user_access_end();
-
-	if (!ret)
-		ret =  *curr_val == __old ? 0 : -EAGAIN;
-
-	pagefault_enable();
-
-	return ret;
-}
-
-int cmpxchg_user_32(u32 __user *uaddr, u32 *curr_val, u32 new_val)
-{
-	int ret = -EFAULT;
-	u32 __old = *curr_val;
-
-	/* Validate proper alignment. */
-	if (unlikely(((unsigned long)uaddr % sizeof(*uaddr)) ||
-			((unsigned long)curr_val % sizeof(*curr_val))))
-		return -EINVAL;
-
-	if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
-		return -EFAULT;
-
-	pagefault_disable();
-
-	while (true) {
-		ret = -EFAULT;
-		if (!user_access_begin(uaddr, sizeof(*uaddr)))
-			break;
-
-		ret = __try_cmpxchg_user_32(curr_val, uaddr, __old, new_val);
-		user_access_end();
-
-		if (!ret) {
-			ret =  *curr_val == __old ? 0 : -EAGAIN;
-			break;
-		}
-
-		if (fix_pagefault((unsigned long)uaddr, true, sizeof(*uaddr)) < 0)
-			break;
-	}
-
-	pagefault_enable();
-	return ret;
-}
-
-int cmpxchg_user_64(u64 __user *uaddr, u64 *curr_val, u64 new_val)
-{
-	int ret = -EFAULT;
-	u64 __old = *curr_val;
-
-	/* Validate proper alignment. */
-	if (unlikely(((unsigned long)uaddr % sizeof(*uaddr)) ||
-			((unsigned long)curr_val % sizeof(*curr_val))))
-		return -EINVAL;
-
-	if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
-		return -EFAULT;
-
-	pagefault_disable();
-
-	while (true) {
-		ret = -EFAULT;
-		if (!user_access_begin(uaddr, sizeof(*uaddr)))
-			break;
-
-		ret = __try_cmpxchg_user_64(curr_val, uaddr, __old, new_val);
-		user_access_end();
-
-		if (!ret) {
-			ret =  *curr_val == __old ? 0 : -EAGAIN;
-			break;
-		}
-
-		if (fix_pagefault((unsigned long)uaddr, true, sizeof(*uaddr)) < 0)
-			break;
-	}
-
-	pagefault_enable();
-	return ret;
-}
-
-/**
- * xchg_user_[32|64][_nofault|]() - exchange 32/64-bit values
- * @uaddr:   Destination address, in user space;
- * @val:     Source address, in kernel space.
- *
- * This is the standard atomic xchg: exchange values pointed to by @uaddr and @val.
- *
- * The _nofault versions don't fault and can be used in
- * atomic/preempt-disabled contexts.
- *
- * Return:
- * 0      : OK/success;
- * -EINVAL: @uaddr is not properly aligned ('may fault' versions only);
- * -EFAULT: memory access error (including mis-aligned @uaddr in _nofault).
- */
-int xchg_user_32_nofault(u32 __user *uaddr, u32 *val)
-{
-	int ret;
-
-	if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
-		return -EFAULT;
-
-	pagefault_disable();
-
-	if (!user_access_begin(uaddr, sizeof(*uaddr))) {
-		pagefault_enable();
-		return -EFAULT;
-	}
-
-	ret = __try_xchg_user_32(val, uaddr, *val);
-	user_access_end();
-
-	pagefault_enable();
-
-	return ret;
-}
-
-int xchg_user_64_nofault(u64 __user *uaddr, u64 *val)
-{
-	int ret;
-
-	if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
-		return -EFAULT;
-
-	pagefault_disable();
-
-	if (!user_access_begin(uaddr, sizeof(*uaddr))) {
-		pagefault_enable();
-		return -EFAULT;
-	}
-
-	ret = __try_xchg_user_64(val, uaddr, *val);
-	user_access_end();
-
-	pagefault_enable();
-
-	return ret;
-}
-
-int xchg_user_32(u32 __user *uaddr, u32 *val)
-{
-	int ret = -EFAULT;
-
-	/* Validate proper alignment. */
-	if (unlikely(((unsigned long)uaddr % sizeof(*uaddr)) ||
-			((unsigned long)val % sizeof(*val))))
-		return -EINVAL;
-
-	if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
-		return -EFAULT;
-
-	pagefault_disable();
-
-	while (true) {
-		ret = -EFAULT;
-		if (!user_access_begin(uaddr, sizeof(*uaddr)))
-			break;
-
-		ret = __try_xchg_user_32(val, uaddr, *val);
-		user_access_end();
-
-		if (!ret)
-			break;
-
-		if (fix_pagefault((unsigned long)uaddr, true, sizeof(*uaddr)) < 0)
-			break;
-	}
-
-	pagefault_enable();
-
-	return ret;
-}
-
-int xchg_user_64(u64 __user *uaddr, u64 *val)
-{
-	int ret = -EFAULT;
-
-	/* Validate proper alignment. */
-	if (unlikely(((unsigned long)uaddr % sizeof(*uaddr)) ||
-			((unsigned long)val % sizeof(*val))))
-		return -EINVAL;
-
-	if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
-		return -EFAULT;
-
-	pagefault_disable();
-
-	while (true) {
-		ret = -EFAULT;
-		if (!user_access_begin(uaddr, sizeof(*uaddr)))
-			break;
-
-		ret = __try_xchg_user_64(val, uaddr, *val);
-		user_access_end();
-
-		if (!ret)
-			break;
-
-		if (fix_pagefault((unsigned long)uaddr, true, sizeof(*uaddr)) < 0)
-			break;
-	}
-
-	pagefault_enable();
-
-	return ret;
-}
-#endif		/* ARCH_HAS_ATOMIC_UACCESS_HELPERS */
Index: linux-2.6/include/linux/entry-common.h
===================================================================
--- linux-2.6.orig/include/linux/entry-common.h
+++ linux-2.6/include/linux/entry-common.h
@@ -42,11 +42,13 @@
 				 SYSCALL_WORK_SYSCALL_EMU |		\
 				 SYSCALL_WORK_SYSCALL_AUDIT |		\
 				 SYSCALL_WORK_SYSCALL_USER_DISPATCH |	\
+				 SYSCALL_WORK_SYSCALL_UMCG |		\
 				 ARCH_SYSCALL_WORK_ENTER)
 #define SYSCALL_WORK_EXIT	(SYSCALL_WORK_SYSCALL_TRACEPOINT |	\
 				 SYSCALL_WORK_SYSCALL_TRACE |		\
 				 SYSCALL_WORK_SYSCALL_AUDIT |		\
 				 SYSCALL_WORK_SYSCALL_USER_DISPATCH |	\
+				 SYSCALL_WORK_SYSCALL_UMCG |		\
 				 SYSCALL_WORK_SYSCALL_EXIT_TRAP	|	\
 				 ARCH_SYSCALL_WORK_EXIT)
 
Index: linux-2.6/include/linux/thread_info.h
===================================================================
--- linux-2.6.orig/include/linux/thread_info.h
+++ linux-2.6/include/linux/thread_info.h
@@ -46,6 +46,7 @@ enum syscall_work_bit {
 	SYSCALL_WORK_BIT_SYSCALL_AUDIT,
 	SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH,
 	SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP,
+	SYSCALL_WORK_BIT_SYSCALL_UMCG,
 };
 
 #define SYSCALL_WORK_SECCOMP		BIT(SYSCALL_WORK_BIT_SECCOMP)
@@ -55,6 +56,7 @@ enum syscall_work_bit {
 #define SYSCALL_WORK_SYSCALL_AUDIT	BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT)
 #define SYSCALL_WORK_SYSCALL_USER_DISPATCH BIT(SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH)
 #define SYSCALL_WORK_SYSCALL_EXIT_TRAP	BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP)
+#define SYSCALL_WORK_SYSCALL_UMCG	BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_UMCG)
 #endif
 
 #include <asm/thread_info.h>
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -2303,9 +2303,10 @@ static inline void rseq_execve(struct ta
 
 #ifdef CONFIG_UMCG
 
-void umcg_handle_resuming_worker(void);
-void umcg_handle_exiting_worker(void);
-void umcg_clear_child(struct task_struct *tsk);
+extern void umcg_sys_enter(struct pt_regs *regs);
+extern void umcg_sys_exit(struct pt_regs *regs);
+extern void umcg_handle_exiting_worker(void);
+extern void umcg_clear_child(struct task_struct *tsk);
 
 /* Called by bprm_execve() in fs/exec.c. */
 static inline void umcg_execve(struct task_struct *tsk)
@@ -2314,13 +2315,6 @@ static inline void umcg_execve(struct ta
 		umcg_clear_child(tsk);
 }
 
-/* Called by exit_to_user_mode_loop() in kernel/entry/common.c.*/
-static inline void umcg_handle_notify_resume(void)
-{
-	if (current->flags & PF_UMCG_WORKER)
-		umcg_handle_resuming_worker();
-}
-
 /* Called by do_exit() in kernel/exit.c. */
 static inline void umcg_handle_exit(void)
 {
@@ -2332,18 +2326,23 @@ static inline void umcg_handle_exit(void
  * umcg_wq_worker_[sleeping|running] are called in core.c by
  * sched_submit_work() and sched_update_worker().
  */
-void umcg_wq_worker_sleeping(struct task_struct *tsk);
-void umcg_wq_worker_running(struct task_struct *tsk);
+extern void umcg_wq_worker_sleeping(struct task_struct *tsk);
+extern void umcg_wq_worker_running(struct task_struct *tsk);
 
 #else  /* CONFIG_UMCG */
 
-static inline void umcg_clear_child(struct task_struct *tsk)
+static inline void umcg_sys_enter(struct pt_regs *regs)
 {
 }
-static inline void umcg_execve(struct task_struct *tsk)
+
+static inline void umcg_sys_exit(struct pt_regs *regs)
+{
+}
+
+static inline void umcg_clear_child(struct task_struct *tsk)
 {
 }
-static inline void umcg_handle_notify_resume(void)
+static inline void umcg_execve(struct task_struct *tsk)
 {
 }
 static inline void umcg_handle_exit(void)
Index: linux-2.6/kernel/entry/common.c
===================================================================
--- linux-2.6.orig/kernel/entry/common.c
+++ linux-2.6/kernel/entry/common.c
@@ -6,6 +6,7 @@
 #include <linux/livepatch.h>
 #include <linux/audit.h>
 #include <linux/tick.h>
+#include <linux/sched.h>
 
 #include "common.h"
 
@@ -76,6 +77,9 @@ static long syscall_trace_enter(struct p
 	if (unlikely(work & SYSCALL_WORK_SYSCALL_TRACEPOINT))
 		trace_sys_enter(regs, syscall);
 
+	if (work & SYSCALL_WORK_SYSCALL_UMCG)
+		umcg_sys_enter(regs);
+
 	syscall_enter_audit(regs, syscall);
 
 	return ret ? : syscall;
@@ -171,10 +175,8 @@ static unsigned long exit_to_user_mode_l
 		if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
 			handle_signal_work(regs, ti_work);
 
-		if (ti_work & _TIF_NOTIFY_RESUME) {
-			umcg_handle_notify_resume();
+		if (ti_work & _TIF_NOTIFY_RESUME)
 			tracehook_notify_resume(regs);
-		}
 
 		/* Architecture specific TIF work */
 		arch_exit_to_user_mode_work(regs, ti_work);
@@ -255,6 +257,9 @@ static void syscall_exit_work(struct pt_
 	step = report_single_step(work);
 	if (step || work & SYSCALL_WORK_SYSCALL_TRACE)
 		arch_syscall_exit_tracehook(regs, step);
+
+	if (work & SYSCALL_WORK_SYSCALL_UMCG)
+		umcg_sys_exit(regs);
 }
 
 /*
Index: linux-2.6/arch/x86/Kconfig
===================================================================
--- linux-2.6.orig/arch/x86/Kconfig
+++ linux-2.6/arch/x86/Kconfig
@@ -248,6 +248,7 @@ config X86
 	select HAVE_RSEQ
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_UNSTABLE_SCHED_CLOCK
+	select HAVE_UMCG			if X86_64
 	select HAVE_USER_RETURN_NOTIFIER
 	select HAVE_GENERIC_VDSO
 	select HOTPLUG_SMT			if SMP
Index: linux-2.6/arch/x86/include/asm/uaccess.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/uaccess.h
+++ linux-2.6/arch/x86/include/asm/uaccess.h
@@ -341,6 +341,37 @@ do {									\
 		     : [umem] "m" (__m(addr))				\
 		     : : label)
 
+#define __try_cmpxchg_user_asm(itype, _ptr, _pold, _new, label)	({	\
+	bool success;							\
+	__typeof__(_ptr) _old = (__typeof__(_ptr))(_pold);		\
+	__typeof__(*(_ptr)) __old = *_old;				\
+	__typeof__(*(_ptr)) __new = (_new);				\
+	asm_volatile_goto("\n"						\
+		     "1: " LOCK_PREFIX "cmpxchg"itype" %[new], %[ptr]\n"\
+		     _ASM_EXTABLE_UA(1b, %l[label])			\
+		     : CC_OUT(z) (success),				\
+		       [ptr] "+m" (*_ptr),				\
+		       [old] "+a" (__old)				\
+		     : [new] "r" (__new)				\
+		     : "memory", "cc"					\
+		     : label);						\
+	if (unlikely(!success))						\
+		*_old = __old;						\
+	likely(success);					})
+
+
+#define __xchg_user_asm(itype, _ptr, _val, label)	({		\
+	__typeof__(*(_ptr)) __ret = (_val);				\
+	asm_volatile_goto("\n"						\
+			"1: " LOCK_PREFIX "xchg"itype" %[var], %[ptr]\n"\
+			_ASM_EXTABLE_UA(1b, %l[label])			\
+			: [var] "+r" (__ret).				\
+			  [ptr] "+m" (*(_ptr))				\
+			:						\
+			: "memory", "cc"				\
+			: label);					\
+	__ret;						})
+
 #else // !CONFIG_CC_HAS_ASM_GOTO_OUTPUT
 
 #ifdef CONFIG_X86_32
@@ -411,8 +442,83 @@ do {									\
 		     : [umem] "m" (__m(addr)),				\
 		       [efault] "i" (-EFAULT), "0" (err))
 
+#define __try_cmpxchg_user_asm(itype, _ptr, _pold, _new, label)	({	\
+	int __err = 0;							\
+	bool success;							\
+	__typeof__(_ptr) _old = (__typeof__(_ptr))(_pold);		\
+	__typeof__(*(_ptr)) __old = *_old;				\
+	__typeof__(*(_ptr)) __new = (_new);				\
+	asm volatile("\n"						\
+		     "1: " LOCK_PREFIX "cmpxchg"itype" %[new], %[ptr]\n"\
+		     CC_SET(z)						\
+		     "2:\n"						\
+		     ".pushsection .fixup,\"ax\"\n"			\
+		     "3:	mov %[efault], %[errout]\n"		\
+		     "		jmp 2b\n"				\
+		     ".popsection\n"					\
+		     _ASM_EXTABLE_UA(1b, 3b)				\
+		     : CC_OUT(z) (success),				\
+		       [errout] "+r" (__err),				\
+		       [ptr] "+m" (*_ptr),				\
+		       [old] "+a" (__old)				\
+		     : [new] "r" (__new),				\
+		       [efault] "i" (-EFAULT)				\
+		     : "memory", "cc");					\
+	if (unlikely(__err))						\
+		goto label;						\
+	if (unlikely(!success))						\
+		*_old = __old;						\
+	likely(success);					})
+
+#define __xchg_user_asm(itype, _ptr, _val, label)	({		\
+	int __err = 0;							\
+	__typeof__(*(_ptr)) __ret = (_val);				\
+	asm volatile("\n"						\
+		     "1: " LOCK_PREFIX "xchg"itype" %[var], %[ptr]\n"	\
+		     "2:\n"						\
+		     ".pushsection .fixup,\"ax\"\n"			\
+		     "3:	mov %[efault], %[errout]\n"		\
+		     "		jmp 2b\n"				\
+		     ".popsection\n"					\
+		     _ASM_EXTABLE_UA(1b, 3b)				\
+		     : [ptr] "+m" (*(_ptr)),				\
+		       [var] "+r" (__ret),				\
+		       [errout] "+r" (__err)				\
+		     : [efault] "i" (-EFAULT)				\
+		     : "memory", "cc");					\
+	if (unlikely(__err))						\
+		goto label;						\
+	__ret;						})
+
 #endif // CONFIG_CC_HAS_ASM_GOTO_OUTPUT
 
+extern void __try_cmpxchg_user_wrong_size(void);
+extern void __xchg_user_wrong_size(void);
+
+#define __try_cmpxchg_user(_ptr, _oldp, _nval, _label) ({		\
+	__typeof__(*(_ptr)) __ret;					\
+	switch (sizeof(__ret)) {					\
+	case 4:	__ret = __try_cmpxchg_user_asm("l", (_ptr), (_oldp),	\
+					       (_nval), _label);	\
+		break;							\
+	case 8:	__ret = __try_cmpxchg_user_asm("q", (_ptr), (_oldp),	\
+					       (_nval), _label);	\
+		break;							\
+	default: __try_cmpxchg_user_wrong_size();			\
+	}								\
+	__ret;						})
+
+#define __xchg_user(_ptr, _nval, _label)		({		\
+	__typeof__(*(_ptr)) __ret;					\
+	switch (sizeof(__ret)) {					\
+	case 4: __ret = __xchg_user_asm("l", (_ptr), (_nval), _label);	\
+		break;							\
+	case 8: __ret = __xchg_user_asm("q", (_ptr), (_nval), _label);	\
+		break;							\
+	default: __xchg_user_wrong_size();				\
+	}								\
+	__ret;						})
+
 /* FIXME: this hack is definitely wrong -AK */
 struct __large_struct { unsigned long buf[100]; };
 #define __m(x) (*(struct __large_struct __user *)(x))
Index: linux-2.6/init/Kconfig
===================================================================
--- linux-2.6.orig/init/Kconfig
+++ linux-2.6/init/Kconfig
@@ -1693,9 +1693,14 @@ config MEMBARRIER
 
 	  If unsure, say Y.
 
+config HAVE_UMCG
+	defbool n
+
 config UMCG
 	bool "Enable User Managed Concurrency Groups API"
-	depends on X86_64
+	depends on 64BIT
+	depends on GENERIC_ENTRY
+	depends on HAVE_UMCG
 	default n
 	help
 	  Enable User Managed Concurrency Groups API, which form the basis

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v0.9.1 2/6] mm, x86/uaccess: add userspace atomic helpers
  2021-11-22 21:13 ` [PATCH v0.9.1 2/6] mm, x86/uaccess: add userspace atomic helpers Peter Oskolkov
@ 2021-11-24 14:31   ` Peter Zijlstra
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2021-11-24 14:31 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Ingo Molnar, Thomas Gleixner, Andrew Morton, Dave Hansen,
	Andy Lutomirski, linux-mm, linux-kernel, linux-api, Paul Turner,
	Ben Segall, Peter Oskolkov, Andrei Vagin, Jann Horn,
	Thierry Delisle

On Mon, Nov 22, 2021 at 01:13:23PM -0800, Peter Oskolkov wrote:

> +static inline int __try_cmpxchg_user_32(u32 *uval, u32 __user *uaddr,
> +						u32 oldval, u32 newval)

That's a 'try_cmpxchg' function but *NOT* the try_cmpxchg semantics,
please don't do that, fixed in the below.

> +int cmpxchg_user_64(u64 __user *uaddr, u64 *curr_val, u64 new_val)
> +{
> +	int ret = -EFAULT;
> +	u64 __old = *curr_val;
> +
> +	/* Validate proper alignment. */
> +	if (unlikely(((unsigned long)uaddr % sizeof(*uaddr)) ||
> +			((unsigned long)curr_val % sizeof(*curr_val))))
> +		return -EINVAL;
> +
> +	if (unlikely(!access_ok(uaddr, sizeof(*uaddr))))
> +		return -EFAULT;
> +
> +	pagefault_disable();
> +
> +	while (true) {
> +		ret = -EFAULT;
> +		if (!user_access_begin(uaddr, sizeof(*uaddr)))
> +			break;
> +
> +		ret = __try_cmpxchg_user_64(curr_val, uaddr, __old, new_val);
> +		user_access_end();
> +
> +		if (!ret) {
> +			ret =  *curr_val == __old ? 0 : -EAGAIN;
> +			break;
> +		}
> +
> +		if (fix_pagefault((unsigned long)uaddr, true, sizeof(*uaddr)) < 0)
> +			break;
> +	}
> +
> +	pagefault_enable();
> +	return ret;
> +}

Please, just read what you wrote. This scored *really* high on the
WTF'o'meter.

That is aside of:

 - that user_access_begin() includes access_ok().
 - the fact that having SMAP *inside* a cmpxchg loop is ridiculous.
 - that you write cmpxchg inside a loop, but it isn't actually a cmpxchg-loop.

No the real problem is:

 - you *DISABLE* pagefaults
 - you force the exception handler
 - you manually fix up the fault

while you could've just done the op and let the fault handler do it's
thing, that whole function is pointless.



So as a penance for not having looked at this before I wrote you the
replacement. The asm-goto-output variant isn't actually compile tested,
but the old complicated thing is. Also, I'm >.< close to merging the
series that kills .fixup for x86, but the fixup (pun intended) should be
trivial.

Usage can be gleaned from the bigger patch I send you in reply to 0/ but
TL;DR:

	if (!user_access_begin(uptr, sizeof(u64)))
		return -EFAULT;

	unsafe_get_user(old, uptr, Efault);
	do {
		new = func(old);
	} while (!__try_cmpxchg_user(uptr, &old, new, Efault));

	user_access_end();

	return 0;

Efault:
	user_access_end();
	return -EFAULT;


Then if called within pagefault_disable(), it'll get -EFAULT more, if
called without it, it'll just take the fault and try to fix it up if at
all possible.


---
diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 33a68407def3..909c48083c4f 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -341,6 +341,37 @@ do {									\
 		     : [umem] "m" (__m(addr))				\
 		     : : label)
 
+#define __try_cmpxchg_user_asm(itype, _ptr, _pold, _new, label)	({	\
+	bool success;							\
+	__typeof__(_ptr) _old = (__typeof__(_ptr))(_pold);		\
+	__typeof__(*(_ptr)) __old = *_old;				\
+	__typeof__(*(_ptr)) __new = (_new);				\
+	asm_volatile_goto("\n"						\
+		     "1: " LOCK_PREFIX "cmpxchg"itype" %[new], %[ptr]\n"\
+		     _ASM_EXTABLE_UA(1b, %l[label])			\
+		     : CC_OUT(z) (success),				\
+		       [ptr] "+m" (*_ptr),				\
+		       [old] "+a" (__old)				\
+		     : [new] "r" (__new)				\
+		     : "memory", "cc"					\
+		     : label);						\
+	if (unlikely(!success))						\
+		*_old = __old;						\
+	likely(success);					})
+
+
+#define __xchg_user_asm(itype, _ptr, _val, label)	({		\
+	__typeof__(*(_ptr)) __ret = (_val);				\
+	asm_volatile_goto("\n"						\
+			"1: " LOCK_PREFIX "xchg"itype" %[var], %[ptr]\n"\
+			_ASM_EXTABLE_UA(1b, %l[label])			\
+			: [var] "+r" (__ret).				\
+			  [ptr] "+m" (*(_ptr))				\
+			:						\
+			: "memory", "cc"				\
+			: label);					\
+	__ret;						})
+
 #else // !CONFIG_CC_HAS_ASM_GOTO_OUTPUT
 
 #ifdef CONFIG_X86_32
@@ -411,8 +442,83 @@ do {									\
 		     : [umem] "m" (__m(addr)),				\
 		       [efault] "i" (-EFAULT), "0" (err))
 
+#define __try_cmpxchg_user_asm(itype, _ptr, _pold, _new, label)	({	\
+	int __err = 0;							\
+	bool success;							\
+	__typeof__(_ptr) _old = (__typeof__(_ptr))(_pold);		\
+	__typeof__(*(_ptr)) __old = *_old;				\
+	__typeof__(*(_ptr)) __new = (_new);				\
+	asm volatile("\n"						\
+		     "1: " LOCK_PREFIX "cmpxchg"itype" %[new], %[ptr]\n"\
+		     CC_SET(z)						\
+		     "2:\n"						\
+		     ".pushsection .fixup,\"ax\"\n"			\
+		     "3:	mov %[efault], %[errout]\n"		\
+		     "		jmp 2b\n"				\
+		     ".popsection\n"					\
+		     _ASM_EXTABLE_UA(1b, 3b)				\
+		     : CC_OUT(z) (success),				\
+		       [errout] "+r" (__err),				\
+		       [ptr] "+m" (*_ptr),				\
+		       [old] "+a" (__old)				\
+		     : [new] "r" (__new),				\
+		       [efault] "i" (-EFAULT)				\
+		     : "memory", "cc");					\
+	if (unlikely(__err))						\
+		goto label;						\
+	if (unlikely(!success))						\
+		*_old = __old;						\
+	likely(success);					})
+
+#define __xchg_user_asm(itype, _ptr, _val, label)	({		\
+	int __err = 0;							\
+	__typeof__(*(_ptr)) __ret = (_val);				\
+	asm volatile("\n"						\
+		     "1: " LOCK_PREFIX "xchg"itype" %[var], %[ptr]\n"	\
+		     "2:\n"						\
+		     ".pushsection .fixup,\"ax\"\n"			\
+		     "3:	mov %[efault], %[errout]\n"		\
+		     "		jmp 2b\n"				\
+		     ".popsection\n"					\
+		     _ASM_EXTABLE_UA(1b, 3b)				\
+		     : [ptr] "+m" (*(_ptr)),				\
+		       [var] "+r" (__ret),				\
+		       [errout] "+r" (__err)				\
+		     : [efault] "i" (-EFAULT)				\
+		     : "memory", "cc");					\
+	if (unlikely(__err))						\
+		goto label;						\
+	__ret;						})
+
 #endif // CONFIG_CC_HAS_ASM_GOTO_OUTPUT
 
+extern void __try_cmpxchg_user_wrong_size(void);
+extern void __xchg_user_wrong_size(void);
+
+#define __try_cmpxchg_user(_ptr, _oldp, _nval, _label) ({		\
+	__typeof__(*(_ptr)) __ret;					\
+	switch (sizeof(__ret)) {					\
+	case 4:	__ret = __try_cmpxchg_user_asm("l", (_ptr), (_oldp),	\
+					       (_nval), _label);	\
+		break;							\
+	case 8:	__ret = __try_cmpxchg_user_asm("q", (_ptr), (_oldp),	\
+					       (_nval), _label);	\
+		break;							\
+	default: __try_cmpxchg_user_wrong_size();			\
+	}								\
+	__ret;						})
+
+#define __xchg_user(_ptr, _nval, _label)		({		\
+	__typeof__(*(_ptr)) __ret;					\
+	switch (sizeof(__ret)) {					\
+	case 4: __ret = __xchg_user_asm("l", (_ptr), (_nval), _label);	\
+		break;							\
+	case 8: __ret = __xchg_user_asm("q", (_ptr), (_nval), _label);	\
+		break;							\
+	default: __xchg_user_wrong_size();				\
+	}								\
+	__ret;						})
+
 /* FIXME: this hack is definitely wrong -AK */
 struct __large_struct { unsigned long buf[100]; };
 #define __m(x) (*(struct __large_struct __user *)(x))

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v0.9.1 0/6] sched,mm,x86/uaccess: implement User Managed Concurrency Groups
  2021-11-24 14:06 ` [PATCH v0.9.1 0/6] sched,mm,x86/uaccess: implement User Managed Concurrency Groups Peter Zijlstra
@ 2021-11-24 16:28   ` Peter Oskolkov
  2021-11-24 17:20     ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Oskolkov @ 2021-11-24 16:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Andrew Morton, Dave Hansen,
	Andy Lutomirski, Linux Memory Management List,
	Linux Kernel Mailing List, linux-api, Paul Turner, Ben Segall,
	Peter Oskolkov, Andrei Vagin, Jann Horn, Thierry Delisle

On Wed, Nov 24, 2021 at 6:06 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Nov 22, 2021 at 01:13:21PM -0800, Peter Oskolkov wrote:
> > User Managed Concurrency Groups (UMCG) is an M:N threading
> > subsystem/toolkit that lets user space application developers implement
> > in-process user space schedulers.
> >
> > This v0.9.1 patchset is the same as v0.9, where u32/u64 in
> > uapi/linux/umcg.h are replaced with __u32/__u64, as test robot/lkp
> > does not recognize u32/u64 for some reason.
> >
> > v0.9 is v0.8 rebased on top of the current tip/sched/core,
> > with a fix in umcg_update_state of an issue reported by Tao Zhou.
> >
> > Key changes from patchset v0.7:
> > https://lore.kernel.org/all/20211012232522.714898-1-posk@google.com/:
> >
> > - added libumcg tools/lib/umcg;
> > - worker "wakeup" is reworked so that it is now purely a userspace op,
> >   instead of waking the thread in order for it to block on return
> >   to the userspace immediately;
> > - a couple of minor fixes and refactorings.
> >
> > These big things remain to be addressed (in no particular order):
> > - support tracing/debugging
> > - make context switches faster (see umcg_do_context_switch in umcg.c)
> > - support other architectures
> > - cleanup and post selftests in tools/testing/selftests/umcg/
> > - allow cross-mm wakeups (securely)
>
> *groan*... so these patches do *NOT* support the very thing this all
> started with, namely block + wakeup notifications. I'm really not sure
> how that happened, as that was the sole purpose of the exercise.

I'm not sure why you say this - in-process block/wakeup is very much
supported - please see the third patch. Cross-process (cross-mm)
wakeups are not supported at the moment, as the security story has to
be fleshed out.

>
> Aside of that, the whole uaccess stuff is horrific :-( I'll reply to
> that email separately, but the alternative is also included in the
> random hackery below.

Thanks - I'll try to make uaccess more to your liking, unless you say
the whole thing is a no-go.

>
> I'm still trying to make sense of it all, but I'm really not seeing how
> any of this satisfies the initial goals, also it is once again 100% new
> code :/

I believe the initial goals of in-process block/wakeup detection,
on-cpu context switching, etc. are all achieved here. Re: new code:
the code in the third patch evolved into what it is today based on
feedback/discussions in this list.

[...]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v0.9.1 0/6] sched,mm,x86/uaccess: implement User Managed Concurrency Groups
  2021-11-24 16:28   ` Peter Oskolkov
@ 2021-11-24 17:20     ` Peter Zijlstra
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2021-11-24 17:20 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Ingo Molnar, Thomas Gleixner, Andrew Morton, Dave Hansen,
	Andy Lutomirski, Linux Memory Management List,
	Linux Kernel Mailing List, linux-api, Paul Turner, Ben Segall,
	Peter Oskolkov, Andrei Vagin, Jann Horn, Thierry Delisle

On Wed, Nov 24, 2021 at 08:28:43AM -0800, Peter Oskolkov wrote:
> On Wed, Nov 24, 2021 at 6:06 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Mon, Nov 22, 2021 at 01:13:21PM -0800, Peter Oskolkov wrote:
> > > User Managed Concurrency Groups (UMCG) is an M:N threading
> > > subsystem/toolkit that lets user space application developers implement
> > > in-process user space schedulers.
> > >
> > > This v0.9.1 patchset is the same as v0.9, where u32/u64 in
> > > uapi/linux/umcg.h are replaced with __u32/__u64, as test robot/lkp
> > > does not recognize u32/u64 for some reason.
> > >
> > > v0.9 is v0.8 rebased on top of the current tip/sched/core,
> > > with a fix in umcg_update_state of an issue reported by Tao Zhou.
> > >
> > > Key changes from patchset v0.7:
> > > https://lore.kernel.org/all/20211012232522.714898-1-posk@google.com/:
> > >
> > > - added libumcg tools/lib/umcg;
> > > - worker "wakeup" is reworked so that it is now purely a userspace op,
> > >   instead of waking the thread in order for it to block on return
> > >   to the userspace immediately;
> > > - a couple of minor fixes and refactorings.
> > >
> > > These big things remain to be addressed (in no particular order):
> > > - support tracing/debugging
> > > - make context switches faster (see umcg_do_context_switch in umcg.c)
> > > - support other architectures
> > > - cleanup and post selftests in tools/testing/selftests/umcg/
> > > - allow cross-mm wakeups (securely)
> >
> > *groan*... so these patches do *NOT* support the very thing this all
> > started with, namely block + wakeup notifications. I'm really not sure
> > how that happened, as that was the sole purpose of the exercise.
> 
> I'm not sure why you say this - in-process block/wakeup is very much
> supported - please see the third patch. Cross-process (cross-mm)
> wakeups are not supported at the moment, as the security story has to
> be fleshed out.

I seem to have gotten submit and update work confused. I'll go stare
more. For some reason I find it very hard to read this stuff.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls
  2021-11-22 21:13 ` [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls Peter Oskolkov
@ 2021-11-24 18:36   ` kernel test robot
  2021-11-24 20:08   ` Peter Zijlstra
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 44+ messages in thread
From: kernel test robot @ 2021-11-24 18:36 UTC (permalink / raw)
  To: Peter Oskolkov, Ingo Molnar, Thomas Gleixner, Andrew Morton,
	Dave Hansen, Andy Lutomirski, linux-kernel, linux-api
  Cc: llvm, kbuild-all, Linux Memory Management List, Paul Turner

Hi Peter,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on cb0e52b7748737b2cf6481fdd9b920ce7e1ebbdf]

url:    https://github.com/0day-ci/linux/commits/Peter-Oskolkov/sched-mm-x86-uaccess-implement-User-Managed-Concurrency-Groups/20211123-051525
base:   cb0e52b7748737b2cf6481fdd9b920ce7e1ebbdf
config: arm64-randconfig-r031-20211124 (https://download.01.org/0day-ci/archive/20211125/202111250209.9dBNZjdP-lkp@intel.com/config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project 67a1c45def8a75061203461ab0060c75c864df1c)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install arm64 cross compiling tool for clang build
        # apt-get install binutils-aarch64-linux-gnu
        # https://github.com/0day-ci/linux/commit/942655474fa2cd59ea3d11a1cc03775dd79a508e
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Peter-Oskolkov/sched-mm-x86-uaccess-implement-User-Managed-Concurrency-Groups/20211123-051525
        git checkout 942655474fa2cd59ea3d11a1cc03775dd79a508e
        # save the config file to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 ARCH=arm64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                                  ^
   <scratch space>:34:1: note: expanded from here
   __arm64_sys_recvmsg
   ^
   kernel/sys_ni.c:257:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                      ^
   kernel/sys_ni.c:263:1: warning: no previous prototype for function '__arm64_sys_mremap' [-Wmissing-prototypes]
   COND_SYSCALL(mremap);
   ^
   arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                                  ^
   <scratch space>:39:1: note: expanded from here
   __arm64_sys_mremap
   ^
   kernel/sys_ni.c:263:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                      ^
   kernel/sys_ni.c:266:1: warning: no previous prototype for function '__arm64_sys_add_key' [-Wmissing-prototypes]
   COND_SYSCALL(add_key);
   ^
   arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                                  ^
   <scratch space>:40:1: note: expanded from here
   __arm64_sys_add_key
   ^
   kernel/sys_ni.c:266:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                      ^
   kernel/sys_ni.c:267:1: warning: no previous prototype for function '__arm64_sys_request_key' [-Wmissing-prototypes]
   COND_SYSCALL(request_key);
   ^
   arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                                  ^
   <scratch space>:41:1: note: expanded from here
   __arm64_sys_request_key
   ^
   kernel/sys_ni.c:267:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                      ^
   kernel/sys_ni.c:268:1: warning: no previous prototype for function '__arm64_sys_keyctl' [-Wmissing-prototypes]
   COND_SYSCALL(keyctl);
   ^
   arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                                  ^
   <scratch space>:42:1: note: expanded from here
   __arm64_sys_keyctl
   ^
   kernel/sys_ni.c:268:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                      ^
   kernel/sys_ni.c:272:1: warning: no previous prototype for function '__arm64_sys_landlock_create_ruleset' [-Wmissing-prototypes]
   COND_SYSCALL(landlock_create_ruleset);
   ^
   arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                                  ^
   <scratch space>:47:1: note: expanded from here
   __arm64_sys_landlock_create_ruleset
   ^
   kernel/sys_ni.c:272:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                      ^
   kernel/sys_ni.c:273:1: warning: no previous prototype for function '__arm64_sys_landlock_add_rule' [-Wmissing-prototypes]
   COND_SYSCALL(landlock_add_rule);
   ^
   arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                                  ^
   <scratch space>:48:1: note: expanded from here
   __arm64_sys_landlock_add_rule
   ^
   kernel/sys_ni.c:273:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                      ^
   kernel/sys_ni.c:274:1: warning: no previous prototype for function '__arm64_sys_landlock_restrict_self' [-Wmissing-prototypes]
   COND_SYSCALL(landlock_restrict_self);
   ^
   arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                                  ^
   <scratch space>:49:1: note: expanded from here
   __arm64_sys_landlock_restrict_self
   ^
   kernel/sys_ni.c:274:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                      ^
>> kernel/sys_ni.c:277:1: warning: no previous prototype for function '__arm64_sys_umcg_ctl' [-Wmissing-prototypes]
   COND_SYSCALL(umcg_ctl);
   ^
   arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                                  ^
   <scratch space>:50:1: note: expanded from here
   __arm64_sys_umcg_ctl
   ^
   kernel/sys_ni.c:277:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                      ^
>> kernel/sys_ni.c:278:1: warning: no previous prototype for function '__arm64_sys_umcg_wait' [-Wmissing-prototypes]
   COND_SYSCALL(umcg_wait);
   ^
   arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                                  ^
   <scratch space>:51:1: note: expanded from here
   __arm64_sys_umcg_wait
   ^
   kernel/sys_ni.c:278:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                      ^
   kernel/sys_ni.c:283:1: warning: no previous prototype for function '__arm64_sys_fadvise64_64' [-Wmissing-prototypes]
   COND_SYSCALL(fadvise64_64);
   ^
   arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                                  ^
   <scratch space>:52:1: note: expanded from here
   __arm64_sys_fadvise64_64
   ^
   kernel/sys_ni.c:283:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                      ^
   kernel/sys_ni.c:286:1: warning: no previous prototype for function '__arm64_sys_swapon' [-Wmissing-prototypes]
   COND_SYSCALL(swapon);
   ^
   arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                                  ^
   <scratch space>:53:1: note: expanded from here
   __arm64_sys_swapon
   ^
   kernel/sys_ni.c:286:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                      ^
   kernel/sys_ni.c:287:1: warning: no previous prototype for function '__arm64_sys_swapoff' [-Wmissing-prototypes]
   COND_SYSCALL(swapoff);
   ^
   arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                                  ^
   <scratch space>:54:1: note: expanded from here
   __arm64_sys_swapoff
   ^
   kernel/sys_ni.c:287:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                      ^
   kernel/sys_ni.c:288:1: warning: no previous prototype for function '__arm64_sys_mprotect' [-Wmissing-prototypes]
   COND_SYSCALL(mprotect);
   ^
   arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                                  ^
   <scratch space>:55:1: note: expanded from here
   __arm64_sys_mprotect
   ^
   kernel/sys_ni.c:288:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                      ^
   kernel/sys_ni.c:289:1: warning: no previous prototype for function '__arm64_sys_msync' [-Wmissing-prototypes]
   COND_SYSCALL(msync);
   ^
   arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                                  ^
   <scratch space>:56:1: note: expanded from here
   __arm64_sys_msync
   ^
   kernel/sys_ni.c:289:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                      ^
   kernel/sys_ni.c:290:1: warning: no previous prototype for function '__arm64_sys_mlock' [-Wmissing-prototypes]
   COND_SYSCALL(mlock);
   ^
   arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                                  ^
   <scratch space>:57:1: note: expanded from here
   __arm64_sys_mlock
   ^
   kernel/sys_ni.c:290:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   arch/arm64/include/asm/syscall_wrapper.h:76:13: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                      ^
   kernel/sys_ni.c:291:1: warning: no previous prototype for function '__arm64_sys_munlock' [-Wmissing-prototypes]
   COND_SYSCALL(munlock);
   ^
   arch/arm64/include/asm/syscall_wrapper.h:76:25: note: expanded from macro 'COND_SYSCALL'
           asmlinkage long __weak __arm64_sys_##name(const struct pt_regs *regs)   \
                                  ^
   <scratch space>:58:1: note: expanded from here
   __arm64_sys_munlock
   ^
   kernel/sys_ni.c:291:1: note: declare 'static' if the function is not intended to be used outside of this translation unit


vim +/__arm64_sys_umcg_ctl +277 kernel/sys_ni.c

   275	
   276	/* kernel/sched/umcg.c */
 > 277	COND_SYSCALL(umcg_ctl);
 > 278	COND_SYSCALL(umcg_wait);
   279	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls
  2021-11-22 21:13 ` [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls Peter Oskolkov
  2021-11-24 18:36   ` kernel test robot
@ 2021-11-24 20:08   ` Peter Zijlstra
  2021-11-24 21:32     ` Peter Zijlstra
  2021-11-25 17:28     ` Peter Oskolkov
  2021-11-24 21:19   ` Peter Zijlstra
                     ` (3 subsequent siblings)
  5 siblings, 2 replies; 44+ messages in thread
From: Peter Zijlstra @ 2021-11-24 20:08 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Ingo Molnar, Thomas Gleixner, Andrew Morton, Dave Hansen,
	Andy Lutomirski, linux-mm, linux-kernel, linux-api, Paul Turner,
	Ben Segall, Peter Oskolkov, Andrei Vagin, Jann Horn,
	Thierry Delisle

On Mon, Nov 22, 2021 at 01:13:24PM -0800, Peter Oskolkov wrote:
> +/**
> + * struct umcg_task - controls the state of UMCG tasks.
> + *
> + * The struct is aligned at 64 bytes to ensure that it fits into
> + * a single cache line.
> + */
> +struct umcg_task {
> +	/**
> +	 * @state_ts: the current state of the UMCG task described by
> +	 *            this struct, with a unique timestamp indicating
> +	 *            when the last state change happened.
> +	 *
> +	 * Readable/writable by both the kernel and the userspace.
> +	 *
> +	 * UMCG task state:
> +	 *   bits  0 -  5: task state;
> +	 *   bits  6 -  7: state flags;
> +	 *   bits  8 - 12: reserved; must be zeroes;
> +	 *   bits 13 - 17: for userspace use;
> +	 *   bits 18 - 63: timestamp (see below).
> +	 *
> +	 * Timestamp: a 46-bit CLOCK_MONOTONIC timestamp, at 16ns resolution.
> +	 * See Documentation/userspace-api/umcg.txt for detals.
> +	 */
> +	__u64	state_ts;		/* r/w */
> +
> +	/**
> +	 * @next_tid: the TID of the UMCG task that should be context-switched
> +	 *            into in sys_umcg_wait(). Can be zero.
> +	 *
> +	 * Running UMCG workers must have next_tid set to point to IDLE
> +	 * UMCG servers.
> +	 *
> +	 * Read-only for the kernel, read/write for the userspace.
> +	 */
> +	__u32	next_tid;		/* r   */
> +
> +	__u32	flags;			/* Reserved; must be zero. */
> +
> +	/**
> +	 * @idle_workers_ptr: a single-linked list of idle workers. Can be NULL.
> +	 *
> +	 * Readable/writable by both the kernel and the userspace: the
> +	 * kernel adds items to the list, the userspace removes them.
> +	 */
> +	__u64	idle_workers_ptr;	/* r/w */
> +
> +	/**
> +	 * @idle_server_tid_ptr: a pointer pointing to a single idle server.
> +	 *                       Readonly.
> +	 */
> +	__u64	idle_server_tid_ptr;	/* r   */
> +} __attribute__((packed, aligned(8 * sizeof(__u64))));

The thing is; I really don't see how this is supposed to be used. Where
did the blocked and runnable list go ?

I also don't see why the kernel cares about idle workers at all; that
seems something userspace can sort itself just fine.

The whole next_tid thing seems confused too, how can it be the next task
when it must be the server? Also, what if there isn't an idle server?

This just all isn't making any sense to me.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls
  2021-11-22 21:13 ` [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls Peter Oskolkov
  2021-11-24 18:36   ` kernel test robot
  2021-11-24 20:08   ` Peter Zijlstra
@ 2021-11-24 21:19   ` Peter Zijlstra
  2021-11-26 21:11     ` Thomas Gleixner
  2021-11-24 21:41   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2021-11-24 21:19 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Ingo Molnar, Thomas Gleixner, Andrew Morton, Dave Hansen,
	Andy Lutomirski, linux-mm, linux-kernel, linux-api, Paul Turner,
	Ben Segall, Peter Oskolkov, Andrei Vagin, Jann Horn,
	Thierry Delisle

On Mon, Nov 22, 2021 at 01:13:24PM -0800, Peter Oskolkov wrote:

> +	 * Timestamp: a 46-bit CLOCK_MONOTONIC timestamp, at 16ns resolution.

> +static int umcg_update_state(u64 __user *state_ts, u64 *expected, u64 desired,
> +				bool may_fault)
> +{
> +	u64 curr_ts = (*expected) >> (64 - UMCG_STATE_TIMESTAMP_BITS);
> +	u64 next_ts = ktime_get_ns() >> UMCG_STATE_TIMESTAMP_GRANULARITY;

I'm still very hesitant to use ktime (fear the HPET); but I suppose it
makes sense to use a time base that's accessible to userspace. Was
MONOTONIC_RAW considered?


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls
  2021-11-24 20:08   ` Peter Zijlstra
@ 2021-11-24 21:32     ` Peter Zijlstra
  2021-11-25 17:28     ` Peter Oskolkov
  1 sibling, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2021-11-24 21:32 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Ingo Molnar, Thomas Gleixner, Andrew Morton, Dave Hansen,
	Andy Lutomirski, linux-mm, linux-kernel, linux-api, Paul Turner,
	Ben Segall, Peter Oskolkov, Andrei Vagin, Jann Horn,
	Thierry Delisle

On Wed, Nov 24, 2021 at 09:08:23PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 22, 2021 at 01:13:24PM -0800, Peter Oskolkov wrote:
> > +/**
> > + * struct umcg_task - controls the state of UMCG tasks.
> > + *
> > + * The struct is aligned at 64 bytes to ensure that it fits into
> > + * a single cache line.
> > + */
> > +struct umcg_task {
> > +	/**
> > +	 * @state_ts: the current state of the UMCG task described by
> > +	 *            this struct, with a unique timestamp indicating
> > +	 *            when the last state change happened.
> > +	 *
> > +	 * Readable/writable by both the kernel and the userspace.
> > +	 *
> > +	 * UMCG task state:
> > +	 *   bits  0 -  5: task state;
> > +	 *   bits  6 -  7: state flags;
> > +	 *   bits  8 - 12: reserved; must be zeroes;
> > +	 *   bits 13 - 17: for userspace use;
> > +	 *   bits 18 - 63: timestamp (see below).
> > +	 *
> > +	 * Timestamp: a 46-bit CLOCK_MONOTONIC timestamp, at 16ns resolution.
> > +	 * See Documentation/userspace-api/umcg.txt for detals.
> > +	 */
> > +	__u64	state_ts;		/* r/w */
> > +
> > +	/**
> > +	 * @next_tid: the TID of the UMCG task that should be context-switched
> > +	 *            into in sys_umcg_wait(). Can be zero.
> > +	 *
> > +	 * Running UMCG workers must have next_tid set to point to IDLE
> > +	 * UMCG servers.
> > +	 *
> > +	 * Read-only for the kernel, read/write for the userspace.
> > +	 */
> > +	__u32	next_tid;		/* r   */
> > +
> > +	__u32	flags;			/* Reserved; must be zero. */
> > +
> > +	/**
> > +	 * @idle_workers_ptr: a single-linked list of idle workers. Can be NULL.
> > +	 *
> > +	 * Readable/writable by both the kernel and the userspace: the
> > +	 * kernel adds items to the list, the userspace removes them.
> > +	 */
> > +	__u64	idle_workers_ptr;	/* r/w */
> > +
> > +	/**
> > +	 * @idle_server_tid_ptr: a pointer pointing to a single idle server.
> > +	 *                       Readonly.
> > +	 */
> > +	__u64	idle_server_tid_ptr;	/* r   */
> > +} __attribute__((packed, aligned(8 * sizeof(__u64))));
> 
> The thing is; I really don't see how this is supposed to be used. Where
> did the blocked and runnable list go ?
> 
> I also don't see why the kernel cares about idle workers at all; that
> seems something userspace can sort itself just fine.
> 
> The whole next_tid thing seems confused too, how can it be the next task
> when it must be the server? Also, what if there isn't an idle server?
> 
> This just all isn't making any sense to me.

Oooh, someone made things super confusing by doing s/runnable/idle/ on
the whole thing :-( That only took me most of the day to figure out.
Naming is important, don't mess about with stuff like this.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls
  2021-11-22 21:13 ` [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls Peter Oskolkov
                     ` (2 preceding siblings ...)
  2021-11-24 21:19   ` Peter Zijlstra
@ 2021-11-24 21:41   ` Peter Zijlstra
  2021-11-24 21:58   ` Peter Zijlstra
  2021-11-24 22:18   ` Peter Zijlstra
  5 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2021-11-24 21:41 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Ingo Molnar, Thomas Gleixner, Andrew Morton, Dave Hansen,
	Andy Lutomirski, linux-mm, linux-kernel, linux-api, Paul Turner,
	Ben Segall, Peter Oskolkov, Andrei Vagin, Jann Horn,
	Thierry Delisle

On Mon, Nov 22, 2021 at 01:13:24PM -0800, Peter Oskolkov wrote:
> +	while (true) {

(you have 2 inf. loops in umcg and you chose a different expression for each)

> +		u64 umcg_state;
> +
> +		/*
> +		 * We need to read from userspace _after_ the task is marked
> +		 * TASK_INTERRUPTIBLE, to properly handle concurrent wakeups;
> +		 * but faulting is not allowed; so we try a fast no-fault read,
> +		 * and if it fails, pin the page temporarily.
> +		 */

That comment is misleading! Faulting *is* allowed, but it can scribble
__state. If faulting would not be allowed, you wouldn't be able to call
pin_user_pages_fast().

> +retry_once:
> +		set_current_state(TASK_INTERRUPTIBLE);
> +
> +		/* Order set_current_state above with get_user below. */
> +		smp_mb();

And just in case you hadn't yet seen, that smp_mb() is implied by
set_current_state().

> +		ret = -EFAULT;
> +		if (get_user_nofault(umcg_state, &self->state_ts)) {
> +			set_current_state(TASK_RUNNING);
> +
> +			if (pinned_page)
> +				goto out;
> +			else if (1 != pin_user_pages_fast((unsigned long)self,
> +						1, 0, &pinned_page))

That else is pointless, and that '1 != foo' coding style is evil.

> +					goto out;
> +
> +			goto retry_once;
> +		}

And, as you could've seen from the big patch, all that goto isn't
actually needed here, break / continue seem to be sufficient.

> +
> +		if (pinned_page) {
> +			unpin_user_page(pinned_page);
> +			pinned_page = NULL;
> +		}

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls
  2021-11-22 21:13 ` [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls Peter Oskolkov
                     ` (3 preceding siblings ...)
  2021-11-24 21:41   ` Peter Zijlstra
@ 2021-11-24 21:58   ` Peter Zijlstra
  2021-11-24 22:18   ` Peter Zijlstra
  5 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2021-11-24 21:58 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Ingo Molnar, Thomas Gleixner, Andrew Morton, Dave Hansen,
	Andy Lutomirski, linux-mm, linux-kernel, linux-api, Paul Turner,
	Ben Segall, Peter Oskolkov, Andrei Vagin, Jann Horn,
	Thierry Delisle

On Mon, Nov 22, 2021 at 01:13:24PM -0800, Peter Oskolkov wrote:
> +	if (abs_timeout) {
> +		hrtimer_init_sleeper_on_stack(&timeout, CLOCK_REALTIME,
> +				HRTIMER_MODE_ABS);

Using CLOCK_REALTIME timers while the rest of the thing runs off of
CLOCK_MONOTONIC doesn't seem to make sense to me. Why would you want to
have timeouts subject to DST shifts and crap like that?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls
  2021-11-22 21:13 ` [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls Peter Oskolkov
                     ` (4 preceding siblings ...)
  2021-11-24 21:58   ` Peter Zijlstra
@ 2021-11-24 22:18   ` Peter Zijlstra
  5 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2021-11-24 22:18 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Ingo Molnar, Thomas Gleixner, Andrew Morton, Dave Hansen,
	Andy Lutomirski, linux-mm, linux-kernel, linux-api, Paul Turner,
	Ben Segall, Peter Oskolkov, Andrei Vagin, Jann Horn,
	Thierry Delisle

On Mon, Nov 22, 2021 at 01:13:24PM -0800, Peter Oskolkov wrote:
> +die:
> +	pr_warn("%s: killing task %d\n", __func__, current->pid);
> +	force_sig(SIGKILL);

That pr_warn() might need to be pr_warn_ratelimited() in order to no be
a system log DoS.

Because, AFAICT, you can craft userspace to trigger this arbitrarily
often, just spawn a worker and make it misbehave.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls
  2021-11-24 20:08   ` Peter Zijlstra
  2021-11-24 21:32     ` Peter Zijlstra
@ 2021-11-25 17:28     ` Peter Oskolkov
  2021-11-26 17:09       ` Peter Zijlstra
  1 sibling, 1 reply; 44+ messages in thread
From: Peter Oskolkov @ 2021-11-25 17:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Andrew Morton, Dave Hansen,
	Andy Lutomirski, Linux Memory Management List,
	Linux Kernel Mailing List, linux-api, Paul Turner, Ben Segall,
	Peter Oskolkov, Andrei Vagin, Jann Horn, Thierry Delisle

Thanks, Peter, for the review!

Some of your comments, like ratelimiting pr_warn and removing gotos,
are obvious in how to address them, so I'll just do that and won't
mention them here. Some comments are less clear re: what should be
done about them, so I have them below with my own comments/questions.

At a higher level, I get that the uaccess patch is bad and needs
serious changes. But based on your comments on this main patch so far,
it looks like the overall approach did not raise many objections - is
it so? Have you finished reviewing the patch?

Please also look at my questions/comments below.

Thanks,
Peter

[...]
> > +struct umcg_task {
[...]

>
> The thing is; I really don't see how this is supposed to be used. Where
> did the blocked and runnable list go ?
>
> I also don't see why the kernel cares about idle workers at all; that
> seems something userspace can sort itself just fine.
>
> The whole next_tid thing seems confused too, how can it be the next task
> when it must be the server? Also, what if there isn't an idle server?
>
> This just all isn't making any sense to me.

Based on your later comments I assume it is clearer now. The doc patch
5 has a lot of extra explanations and examples. Please let me know if
something is still unclear here.

> I'm still very hesitant to use ktime (fear the HPET); but I suppose it
> makes sense to use a time base that's accessible to userspace. Was
> MONOTONIC_RAW considered?

I believe it was considered. I'll re-consider it, and add a comment if
the new consideration arrives at the same conclusion.

> Using CLOCK_REALTIME timers while the rest of the thing runs off of
> CLOCK_MONOTONIC doesn't seem to make sense to me. Why would you want to
> have timeouts subject to DST shifts and crap like that?

Yes, these should be the same if at all possible. I'll definitely
reconsider what clock to use in both timeouts and state timestamps.

> Oooh, someone made things super confusing by doing s/runnable/idle/ on
> the whole thing :-( That only took me most of the day to figure out.
> Naming is important, don't mess about with stuff like this.

I clearly remember I had four states: blocked, pending, runnable,
running (I still believe that four states better reflect what is going
on here). The current blocked/idle/running is the result of an early
discussion. Something along the lines of:

<start of a recollection>
pending workers (=unblocked workers that the userspace still thinks
are blocked) are better named as idle; also the kernel does not really
care about what userspace thinks, so idle workers and runnable workers
are the same from the kernel point of view, so let's have one state
for these workers, not two.
<end of the recollection>

Please let me know if you want me to change anything here. I'll gladly
name workers on the idle worker list as idle (or whatever you prefer),
and workers that the userspace took out of the list as "runnable".
Just as a FYI, workers blocked in umcg_wait() will also be called
"runnable" then, as they are sitting in umcg_idle_loop() and can be
woken or swapped into.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls
  2021-11-25 17:28     ` Peter Oskolkov
@ 2021-11-26 17:09       ` Peter Zijlstra
  2021-11-26 21:08         ` Thomas Gleixner
                           ` (2 more replies)
  0 siblings, 3 replies; 44+ messages in thread
From: Peter Zijlstra @ 2021-11-26 17:09 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Ingo Molnar, Thomas Gleixner, Andrew Morton, Dave Hansen,
	Andy Lutomirski, Linux Memory Management List,
	Linux Kernel Mailing List, linux-api, Paul Turner, Ben Segall,
	Peter Oskolkov, Andrei Vagin, Jann Horn, Thierry Delisle

On Thu, Nov 25, 2021 at 09:28:49AM -0800, Peter Oskolkov wrote:

> it looks like the overall approach did not raise many objections - is
> it so? Have you finished reviewing the patch?

I've been trying to make sense of it, and while doing so deleted a bunch
of things and rewrote the rest.

Things that went *poof*:

 - wait_wake_only
 - server_tid_ptr (now: server_tid)
 - state_ts (now: state,blocked_ts,runnable_ts)

I've also changed next_tid to only be used as a context switch target,
never to find the server to enqueue the runnable tasks on.

All xchg() users seem to have disappeared.

Signals should now be handled, after which it'll go back to waiting on
RUNNING.

The code could fairly easily be changed to work on 32bit, big-endian is
the tricky bit, for now 64bit only.

Anyway, I only *think* the below code will work (it compiles with gcc-10
and gcc-11) but I've not yet come around to writing/updating the
userspace part, so it might explode on first contact -- I'll try that
next week if you don't beat me to it.

That said, the below code seems somewhat sensible to me (I would say,
having written it :), but I'm fairly sure I killed some capabilities the
other thing had (notably the first two items above).

If you want either of them restored, can you please give a use-case for
them? Because I cannot seem to think of any sane cases for either
wait_wake_only or server_tid_ptr.

Anyway, in large order it's very like what you did, but it's different
in pretty much all details.

Of note, it now has 5 hooks: sys_enter, pre-schedule, post-schedule
(still nop), sys_exit and notify_resume.

---
Subject: sched: User Mode Concurency Groups
From: Peter Zijlstra <peterz@infradead.org>
Date: Fri Nov 26 17:24:27 CET 2021

XXX split and changelog

Originally-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -248,6 +248,7 @@ config X86
 	select HAVE_RSEQ
 	select HAVE_SYSCALL_TRACEPOINTS
 	select HAVE_UNSTABLE_SCHED_CLOCK
+	select HAVE_UMCG			if X86_64
 	select HAVE_USER_RETURN_NOTIFIER
 	select HAVE_GENERIC_VDSO
 	select HOTPLUG_SMT			if SMP
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -371,6 +371,8 @@
 447	common	memfd_secret		sys_memfd_secret
 448	common	process_mrelease	sys_process_mrelease
 449	common	futex_waitv		sys_futex_waitv
+450	common	umcg_ctl		sys_umcg_ctl
+451	common	umcg_wait		sys_umcg_wait
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -83,6 +83,7 @@ struct thread_info {
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
 #define TIF_SINGLESTEP		4	/* reenable singlestep on user return*/
 #define TIF_SSBD		5	/* Speculative store bypass disable */
+#define TIF_UMCG		6	/* UMCG return to user hook */
 #define TIF_SPEC_IB		9	/* Indirect branch speculation mitigation */
 #define TIF_SPEC_L1D_FLUSH	10	/* Flush L1D on mm switches (processes) */
 #define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
@@ -107,6 +108,7 @@ struct thread_info {
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
 #define _TIF_SINGLESTEP		(1 << TIF_SINGLESTEP)
 #define _TIF_SSBD		(1 << TIF_SSBD)
+#define _TIF_UMCG		(1 << TIF_UMCG)
 #define _TIF_SPEC_IB		(1 << TIF_SPEC_IB)
 #define _TIF_SPEC_L1D_FLUSH	(1 << TIF_SPEC_L1D_FLUSH)
 #define _TIF_USER_RETURN_NOTIFY	(1 << TIF_USER_RETURN_NOTIFY)
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -341,6 +341,24 @@ do {									\
 		     : [umem] "m" (__m(addr))				\
 		     : : label)
 
+#define __try_cmpxchg_user_asm(itype, _ptr, _pold, _new, label)	({	\
+	bool success;							\
+	__typeof__(_ptr) _old = (__typeof__(_ptr))(_pold);		\
+	__typeof__(*(_ptr)) __old = *_old;				\
+	__typeof__(*(_ptr)) __new = (_new);				\
+	asm_volatile_goto("\n"						\
+		     "1: " LOCK_PREFIX "cmpxchg"itype" %[new], %[ptr]\n"\
+		     _ASM_EXTABLE_UA(1b, %l[label])			\
+		     : CC_OUT(z) (success),				\
+		       [ptr] "+m" (*_ptr),				\
+		       [old] "+a" (__old)				\
+		     : [new] "r" (__new)				\
+		     : "memory", "cc"					\
+		     : label);						\
+	if (unlikely(!success))						\
+		*_old = __old;						\
+	likely(success);					})
+
 #else // !CONFIG_CC_HAS_ASM_GOTO_OUTPUT
 
 #ifdef CONFIG_X86_32
@@ -411,6 +429,34 @@ do {									\
 		     : [umem] "m" (__m(addr)),				\
 		       [efault] "i" (-EFAULT), "0" (err))
 
+#define __try_cmpxchg_user_asm(itype, _ptr, _pold, _new, label)	({	\
+	int __err = 0;							\
+	bool success;							\
+	__typeof__(_ptr) _old = (__typeof__(_ptr))(_pold);		\
+	__typeof__(*(_ptr)) __old = *_old;				\
+	__typeof__(*(_ptr)) __new = (_new);				\
+	asm volatile("\n"						\
+		     "1: " LOCK_PREFIX "cmpxchg"itype" %[new], %[ptr]\n"\
+		     CC_SET(z)						\
+		     "2:\n"						\
+		     ".pushsection .fixup,\"ax\"\n"			\
+		     "3:	mov %[efault], %[errout]\n"		\
+		     "		jmp 2b\n"				\
+		     ".popsection\n"					\
+		     _ASM_EXTABLE_UA(1b, 3b)				\
+		     : CC_OUT(z) (success),				\
+		       [errout] "+r" (__err),				\
+		       [ptr] "+m" (*_ptr),				\
+		       [old] "+a" (__old)				\
+		     : [new] "r" (__new),				\
+		       [efault] "i" (-EFAULT)				\
+		     : "memory", "cc");					\
+	if (unlikely(__err))						\
+		goto label;						\
+	if (unlikely(!success))						\
+		*_old = __old;						\
+	likely(success);					})
+
 #endif // CONFIG_CC_HAS_ASM_GOTO_OUTPUT
 
 /* FIXME: this hack is definitely wrong -AK */
@@ -505,6 +551,21 @@ do {										\
 } while (0)
 #endif // CONFIG_CC_HAS_ASM_GOTO_OUTPUT
 
+extern void __try_cmpxchg_user_wrong_size(void);
+
+#define unsafe_try_cmpxchg_user(_ptr, _oldp, _nval, _label) ({		\
+	__typeof__(*(_ptr)) __ret;					\
+	switch (sizeof(__ret)) {					\
+	case 4:	__ret = __try_cmpxchg_user_asm("l", (_ptr), (_oldp),	\
+					       (_nval), _label);	\
+		break;							\
+	case 8:	__ret = __try_cmpxchg_user_asm("q", (_ptr), (_oldp),	\
+					       (_nval), _label);	\
+		break;							\
+	default: __try_cmpxchg_user_wrong_size();			\
+	}								\
+	__ret;						})
+
 /*
  * We want the unsafe accessors to always be inlined and use
  * the error labels - thus the macro games.
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1838,6 +1838,7 @@ static int bprm_execve(struct linux_binp
 	current->fs->in_exec = 0;
 	current->in_execve = 0;
 	rseq_execve(current);
+	umcg_execve(current);
 	acct_update_integrals(current);
 	task_numa_free(current, false);
 	return retval;
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -22,6 +22,10 @@
 # define _TIF_UPROBE			(0)
 #endif
 
+#ifndef _TIF_UMCG
+# define _TIF_UMCG			(0)
+#endif
+
 /*
  * SYSCALL_WORK flags handled in syscall_enter_from_user_mode()
  */
@@ -42,11 +46,13 @@
 				 SYSCALL_WORK_SYSCALL_EMU |		\
 				 SYSCALL_WORK_SYSCALL_AUDIT |		\
 				 SYSCALL_WORK_SYSCALL_USER_DISPATCH |	\
+				 SYSCALL_WORK_SYSCALL_UMCG |		\
 				 ARCH_SYSCALL_WORK_ENTER)
 #define SYSCALL_WORK_EXIT	(SYSCALL_WORK_SYSCALL_TRACEPOINT |	\
 				 SYSCALL_WORK_SYSCALL_TRACE |		\
 				 SYSCALL_WORK_SYSCALL_AUDIT |		\
 				 SYSCALL_WORK_SYSCALL_USER_DISPATCH |	\
+				 SYSCALL_WORK_SYSCALL_UMCG |		\
 				 SYSCALL_WORK_SYSCALL_EXIT_TRAP	|	\
 				 ARCH_SYSCALL_WORK_EXIT)
 
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -67,6 +67,7 @@ struct sighand_struct;
 struct signal_struct;
 struct task_delay_info;
 struct task_group;
+struct umcg_task;
 
 /*
  * Task state bitmask. NOTE! These bits are also
@@ -1294,6 +1295,15 @@ struct task_struct {
 	unsigned long rseq_event_mask;
 #endif
 
+#ifdef CONFIG_UMCG
+	clockid_t		umcg_clock;
+	struct umcg_task __user	*umcg_task;
+	struct page		*umcg_worker_page;
+	struct task_struct	*umcg_server;
+	struct umcg_task __user *umcg_server_task;
+	struct page		*umcg_server_page;
+#endif
+
 	struct tlbflush_unmap_batch	tlb_ubc;
 
 	union {
@@ -1687,6 +1697,13 @@ extern struct pid *cad_pid;
 #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
 #define PF_SWAPWRITE		0x00800000	/* Allowed to write to swap */
+
+#ifdef CONFIG_UMCG
+#define PF_UMCG_WORKER		0x01000000	/* UMCG worker */
+#else
+#define PF_UMCG_WORKER		0x00000000
+#endif
+
 #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
 #define PF_MCE_EARLY		0x08000000      /* Early kill for mce process policy */
 #define PF_MEMALLOC_PIN		0x10000000	/* Allocation context constrained to zones which allow long term pinning. */
@@ -2285,6 +2302,67 @@ static inline void rseq_execve(struct ta
 {
 }
 
+#endif
+
+#ifdef CONFIG_UMCG
+
+extern void umcg_sys_enter(struct pt_regs *regs, long syscall);
+extern void umcg_sys_exit(struct pt_regs *regs);
+extern void umcg_notify_resume(struct pt_regs *regs);
+extern void umcg_worker_exit(void);
+extern void umcg_clear_child(struct task_struct *tsk);
+
+/* Called by bprm_execve() in fs/exec.c. */
+static inline void umcg_execve(struct task_struct *tsk)
+{
+	if (tsk->umcg_task)
+		umcg_clear_child(tsk);
+}
+
+/* Called by do_exit() in kernel/exit.c. */
+static inline void umcg_handle_exit(void)
+{
+	if (current->flags & PF_UMCG_WORKER)
+		umcg_worker_exit();
+}
+
+/*
+ * umcg_wq_worker_[sleeping|running] are called in core.c by
+ * sched_submit_work() and sched_update_worker().
+ */
+extern void umcg_wq_worker_sleeping(struct task_struct *tsk);
+extern void umcg_wq_worker_running(struct task_struct *tsk);
+
+#else  /* CONFIG_UMCG */
+
+static inline void umcg_sys_enter(struct pt_regs *regs, long syscall)
+{
+}
+
+static inline void umcg_sys_exit(struct pt_regs *regs)
+{
+}
+
+static inline void umcg_notify_resume(struct pt_regs *regs)
+{
+}
+
+static inline void umcg_clear_child(struct task_struct *tsk)
+{
+}
+static inline void umcg_execve(struct task_struct *tsk)
+{
+}
+static inline void umcg_handle_exit(void)
+{
+}
+static inline void umcg_wq_worker_sleeping(struct task_struct *tsk)
+{
+}
+static inline void umcg_wq_worker_running(struct task_struct *tsk)
+{
+}
+
 #endif
 
 #ifdef CONFIG_DEBUG_RSEQ
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -72,6 +72,7 @@ struct open_how;
 struct mount_attr;
 struct landlock_ruleset_attr;
 enum landlock_rule_type;
+struct umcg_task;
 
 #include <linux/types.h>
 #include <linux/aio_abi.h>
@@ -1057,6 +1058,8 @@ asmlinkage long sys_landlock_add_rule(in
 		const void __user *rule_attr, __u32 flags);
 asmlinkage long sys_landlock_restrict_self(int ruleset_fd, __u32 flags);
 asmlinkage long sys_memfd_secret(unsigned int flags);
+asmlinkage long sys_umcg_ctl(u32 flags, struct umcg_task __user *self, clockid_t which_clock);
+asmlinkage long sys_umcg_wait(u32 flags, u64 abs_timeout);
 
 /*
  * Architecture-specific system calls
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -46,6 +46,7 @@ enum syscall_work_bit {
 	SYSCALL_WORK_BIT_SYSCALL_AUDIT,
 	SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH,
 	SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP,
+	SYSCALL_WORK_BIT_SYSCALL_UMCG,
 };
 
 #define SYSCALL_WORK_SECCOMP		BIT(SYSCALL_WORK_BIT_SECCOMP)
@@ -55,6 +56,7 @@ enum syscall_work_bit {
 #define SYSCALL_WORK_SYSCALL_AUDIT	BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT)
 #define SYSCALL_WORK_SYSCALL_USER_DISPATCH BIT(SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH)
 #define SYSCALL_WORK_SYSCALL_EXIT_TRAP	BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP)
+#define SYSCALL_WORK_SYSCALL_UMCG	BIT(SYSCALL_WORK_BIT_SYSCALL_UMCG)
 #endif
 
 #include <asm/thread_info.h>
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -883,8 +883,13 @@ __SYSCALL(__NR_process_mrelease, sys_pro
 #define __NR_futex_waitv 449
 __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
 
+#define __NR_umcg_ctl 450
+__SYSCALL(__NR_umcg_ctl, sys_umcg_ctl)
+#define __NR_umcg_wait 451
+__SYSCALL(__NR_umcg_wait, sys_umcg_wait)
 #undef __NR_syscalls
-#define __NR_syscalls 450
+
+#define __NR_syscalls 452
 
 /*
  * 32 bit systems traditionally used different
--- /dev/null
+++ b/include/uapi/linux/umcg.h
@@ -0,0 +1,117 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_UMCG_H
+#define _UAPI_LINUX_UMCG_H
+
+#include <linux/types.h>
+
+/*
+ * UMCG: User Managed Concurrency Groups.
+ *
+ * Syscalls (see kernel/sched/umcg.c):
+ *      sys_umcg_ctl()  - register/unregister UMCG tasks;
+ *      sys_umcg_wait() - wait/wake/context-switch.
+ *
+ * struct umcg_task (below): controls the state of UMCG tasks.
+ */
+
+/*
+ * UMCG task states, the first 6 bits of struct umcg_task.state_ts.
+ * The states represent the user space point of view.
+ *
+ *   ,--------(TF_PREEMPT + notify_resume)-------. ,------------.
+ *   |                                           v |            |
+ * RUNNING -(schedule)-> BLOCKED -(sys_exit)-> RUNNABLE  (signal + notify_resume)
+ *   ^                                           | ^            |
+ *   `--------------(sys_umcg_wait)--------------' `------------'
+ *
+ */
+#define UMCG_TASK_NONE			0x0000U
+#define UMCG_TASK_RUNNING		0x0001U
+#define UMCG_TASK_RUNNABLE		0x0002U
+#define UMCG_TASK_BLOCKED		0x0003U
+
+#define UMCG_TASK_MASK			0x00ffU
+
+/*
+ * UMCG_TF_PREEMPT: userspace indicates the worker should be preempted.
+ *
+ * Must only be set on UMCG_TASK_RUNNING; once set, any subsequent
+ * return-to-user (eg signal) will perform the equivalent of sys_umcg_wait() on
+ * it. That is, it will wake next_tid/server_tid, transfer to RUNNABLE and
+ * enqueue on the server's runnable list.
+ *
+ */
+#define UMCG_TF_PREEMPT			0x0100U
+
+#define UMCG_TF_MASK			0xff00U
+
+#define UMCG_TASK_ALIGN			64
+
+/**
+ * struct umcg_task - controls the state of UMCG tasks.
+ *
+ * The struct is aligned at 64 bytes to ensure that it fits into
+ * a single cache line.
+ */
+struct umcg_task {
+	/**
+	 * @state_ts: the current state of the UMCG task described by
+	 *            this struct, with a unique timestamp indicating
+	 *            when the last state change happened.
+	 *
+	 * Readable/writable by both the kernel and the userspace.
+	 *
+	 * UMCG task state:
+	 *   bits  0 -  7: task state;
+	 *   bits  8 - 15: state flags;
+	 *   bits 16 - 31: for userspace use;
+	 */
+	__u32	state;				/* r/w */
+
+	/**
+	 * @next_tid: the TID of the UMCG task that should be context-switched
+	 *            into in sys_umcg_wait(). Can be zero, in which case
+	 *            it'll switch to server_tid.
+	 *
+	 * @server_tid: the TID of the UMCG server that hosts this task,
+	 *		when RUNNABLE this task will get added to it's
+	 *		runnable_workers_ptr list.
+	 *
+	 * Read-only for the kernel, read/write for the userspace.
+	 */
+	__u32	next_tid;			/* r   */
+	__u32	server_tid;			/* r   */
+
+	__u32	__hole[1];
+
+	/*
+	 * Timestamps for when last we became BLOCKED, RUNNABLE, in CLOCK_MONOTONIC.
+	 */
+	__u64	blocked_ts;			/*   w */
+	__u64   runnable_ts;			/*   w */
+
+	/**
+	 * @runnable_workers_ptr: a single-linked list of runnable workers.
+	 *
+	 * Readable/writable by both the kernel and the userspace: the
+	 * kernel adds items to the list, userspace removes them.
+	 */
+	__u64	runnable_workers_ptr;		/* r/w */
+
+	__u64	__zero[3];
+
+} __attribute__((packed, aligned(UMCG_TASK_ALIGN)));
+
+/**
+ * enum umcg_ctl_flag - flags to pass to sys_umcg_ctl
+ * @UMCG_CTL_REGISTER:   register the current task as a UMCG task
+ * @UMCG_CTL_UNREGISTER: unregister the current task as a UMCG task
+ * @UMCG_CTL_WORKER:     register the current task as a UMCG worker
+ */
+enum umcg_ctl_flag {
+	UMCG_CTL_REGISTER	= 0x00001,
+	UMCG_CTL_UNREGISTER	= 0x00002,
+	UMCG_CTL_WORKER		= 0x10000,
+};
+
+#endif /* _UAPI_LINUX_UMCG_H */
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1693,6 +1693,21 @@ config MEMBARRIER
 
 	  If unsure, say Y.
 
+config HAVE_UMCG
+	bool
+
+config UMCG
+	bool "Enable User Managed Concurrency Groups API"
+	depends on 64BIT
+	depends on GENERIC_ENTRY
+	depends on HAVE_UMCG
+	default n
+	help
+	  Enable User Managed Concurrency Groups API, which form the basis
+	  for an in-process M:N userspace scheduling framework.
+	  At the moment this is an experimental/RFC feature that is not
+	  guaranteed to be backward-compatible.
+
 config KALLSYMS
 	bool "Load all symbols for debugging/ksymoops" if EXPERT
 	default y
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -6,6 +6,7 @@
 #include <linux/livepatch.h>
 #include <linux/audit.h>
 #include <linux/tick.h>
+#include <linux/sched.h>
 
 #include "common.h"
 
@@ -76,6 +77,9 @@ static long syscall_trace_enter(struct p
 	if (unlikely(work & SYSCALL_WORK_SYSCALL_TRACEPOINT))
 		trace_sys_enter(regs, syscall);
 
+	if (work & SYSCALL_WORK_SYSCALL_UMCG)
+		umcg_sys_enter(regs, syscall);
+
 	syscall_enter_audit(regs, syscall);
 
 	return ret ? : syscall;
@@ -155,8 +159,7 @@ static unsigned long exit_to_user_mode_l
 	 * Before returning to user space ensure that all pending work
 	 * items have been completed.
 	 */
-	while (ti_work & EXIT_TO_USER_MODE_WORK) {
-
+	do {
 		local_irq_enable_exit_to_user(ti_work);
 
 		if (ti_work & _TIF_NEED_RESCHED)
@@ -168,6 +171,10 @@ static unsigned long exit_to_user_mode_l
 		if (ti_work & _TIF_PATCH_PENDING)
 			klp_update_patch_state(current);
 
+		/* must be before handle_signal_work(); terminates on sigpending */
+		if (ti_work & _TIF_UMCG)
+			umcg_notify_resume(regs);
+
 		if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
 			handle_signal_work(regs, ti_work);
 
@@ -188,7 +195,7 @@ static unsigned long exit_to_user_mode_l
 		tick_nohz_user_enter_prepare();
 
 		ti_work = READ_ONCE(current_thread_info()->flags);
-	}
+	} while (ti_work & EXIT_TO_USER_MODE_WORK);
 
 	/* Return the latest work state for arch_exit_to_user_mode() */
 	return ti_work;
@@ -203,7 +210,7 @@ static void exit_to_user_mode_prepare(st
 	/* Flush pending rcuog wakeup before the last need_resched() check */
 	tick_nohz_user_enter_prepare();
 
-	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
+	if (unlikely(ti_work & (EXIT_TO_USER_MODE_WORK | _TIF_UMCG)))
 		ti_work = exit_to_user_mode_loop(regs, ti_work);
 
 	arch_exit_to_user_mode_prepare(regs, ti_work);
@@ -253,6 +260,9 @@ static void syscall_exit_work(struct pt_
 	step = report_single_step(work);
 	if (step || work & SYSCALL_WORK_SYSCALL_TRACE)
 		arch_syscall_exit_tracehook(regs, step);
+
+	if (work & SYSCALL_WORK_SYSCALL_UMCG)
+		umcg_sys_exit(regs);
 }
 
 /*
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -749,6 +749,10 @@ void __noreturn do_exit(long code)
 	if (unlikely(!tsk->pid))
 		panic("Attempted to kill the idle task!");
 
+	/* Turn off UMCG sched hooks. */
+	if (unlikely(tsk->flags & PF_UMCG_WORKER))
+		tsk->flags &= ~PF_UMCG_WORKER;
+
 	/*
 	 * If do_exit is called because this processes oopsed, it's possible
 	 * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
@@ -786,6 +790,7 @@ void __noreturn do_exit(long code)
 
 	io_uring_files_cancel();
 	exit_signals(tsk);  /* sets PF_EXITING */
+	umcg_handle_exit();
 
 	/* sync mm's RSS info before statistics gathering */
 	if (tsk->mm)
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -41,3 +41,4 @@ obj-$(CONFIG_MEMBARRIER) += membarrier.o
 obj-$(CONFIG_CPU_ISOLATION) += isolation.o
 obj-$(CONFIG_PSI) += psi.o
 obj-$(CONFIG_SCHED_CORE) += core_sched.o
+obj-$(CONFIG_UMCG) += umcg.o
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3977,8 +3977,7 @@ bool ttwu_state_match(struct task_struct
  * Return: %true if @p->state changes (an actual wakeup was done),
  *	   %false otherwise.
  */
-static int
-try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
+int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 {
 	unsigned long flags;
 	int cpu, success = 0;
@@ -4270,6 +4269,7 @@ static void __sched_fork(unsigned long c
 	p->wake_entry.u_flags = CSD_TYPE_TTWU;
 	p->migration_pending = NULL;
 #endif
+	umcg_clear_child(p);
 }
 
 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
@@ -6328,9 +6328,11 @@ static inline void sched_submit_work(str
 	 * If a worker goes to sleep, notify and ask workqueue whether it
 	 * wants to wake up a task to maintain concurrency.
 	 */
-	if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
+	if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) {
 		if (task_flags & PF_WQ_WORKER)
 			wq_worker_sleeping(tsk);
+		else if (task_flags & PF_UMCG_WORKER)
+			umcg_wq_worker_sleeping(tsk);
 		else
 			io_wq_worker_sleeping(tsk);
 	}
@@ -6348,9 +6350,11 @@ static inline void sched_submit_work(str
 
 static void sched_update_worker(struct task_struct *tsk)
 {
-	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
+	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) {
 		if (tsk->flags & PF_WQ_WORKER)
 			wq_worker_running(tsk);
+		else if (tsk->flags & PF_UMCG_WORKER)
+			umcg_wq_worker_running(tsk);
 		else
 			io_wq_worker_running(tsk);
 	}
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6890,6 +6890,10 @@ select_task_rq_fair(struct task_struct *
 	if (wake_flags & WF_TTWU) {
 		record_wakee(p);
 
+		if ((wake_flags & WF_CURRENT_CPU) &&
+		    cpumask_test_cpu(cpu, p->cpus_ptr))
+			return cpu;
+
 		if (sched_energy_enabled()) {
 			new_cpu = find_energy_efficient_cpu(p, prev_cpu);
 			if (new_cpu >= 0)
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2052,13 +2052,14 @@ static inline int task_on_rq_migrating(s
 }
 
 /* Wake flags. The first three directly map to some SD flag value */
-#define WF_EXEC     0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */
-#define WF_FORK     0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */
-#define WF_TTWU     0x08 /* Wakeup;            maps to SD_BALANCE_WAKE */
-
-#define WF_SYNC     0x10 /* Waker goes to sleep after wakeup */
-#define WF_MIGRATED 0x20 /* Internal use, task got migrated */
-#define WF_ON_CPU   0x40 /* Wakee is on_cpu */
+#define WF_EXEC         0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */
+#define WF_FORK         0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */
+#define WF_TTWU         0x08 /* Wakeup;            maps to SD_BALANCE_WAKE */
+
+#define WF_SYNC         0x10 /* Waker goes to sleep after wakeup */
+#define WF_MIGRATED     0x20 /* Internal use, task got migrated */
+#define WF_ON_CPU       0x40 /* Wakee is on_cpu */
+#define WF_CURRENT_CPU  0x80 /* Prefer to move the wakee to the current CPU. */
 
 #ifdef CONFIG_SMP
 static_assert(WF_EXEC == SD_BALANCE_EXEC);
@@ -3076,6 +3077,8 @@ static inline bool is_per_cpu_kthread(st
 extern void swake_up_all_locked(struct swait_queue_head *q);
 extern void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);
 
+extern int try_to_wake_up(struct task_struct *tsk, unsigned int state, int wake_flags);
+
 #ifdef CONFIG_PREEMPT_DYNAMIC
 extern int preempt_dynamic_mode;
 extern int sched_dynamic_mode(const char *str);
--- /dev/null
+++ b/kernel/sched/umcg.c
@@ -0,0 +1,744 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/*
+ * User Managed Concurrency Groups (UMCG).
+ *
+ */
+
+#include <linux/syscalls.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/umcg.h>
+
+#include <asm/syscall.h>
+
+#include "sched.h"
+
+static struct task_struct *umcg_get_task(u32 tid)
+{
+	struct task_struct *tsk = NULL;
+
+	if (tid) {
+		rcu_read_lock();
+		tsk = find_task_by_vpid(tid);
+		if (tsk && current->mm == tsk->mm && tsk->umcg_task)
+			get_task_struct(tsk);
+		else
+			tsk = NULL;
+		rcu_read_unlock();
+	}
+
+	return tsk;
+}
+
+/**
+ * umcg_pin_pages: pin pages containing struct umcg_task of this worker
+ *                 and its server.
+ */
+static int umcg_pin_pages(void)
+{
+	struct task_struct *server = NULL, *tsk = current;
+	struct umcg_task __user *self = tsk->umcg_task;
+	int server_tid;
+
+	if (tsk->umcg_worker_page ||
+	    tsk->umcg_server_page ||
+	    tsk->umcg_server_task ||
+	    tsk->umcg_server)
+		return -EBUSY;
+
+	if (get_user(server_tid, &self->server_tid))
+		return -EFAULT;
+
+	server = umcg_get_task(server_tid);
+	if (!server)
+		return -EINVAL;
+
+	if (pin_user_pages_fast((unsigned long)self, 1, 0,
+				&tsk->umcg_worker_page) != 1)
+		goto clear_self;
+
+	/* must cache due to possible concurrent change vs access_ok() */
+	tsk->umcg_server_task = server->umcg_task;
+	if (pin_user_pages_fast((unsigned long)tsk->umcg_server_task, 1, 0,
+				&tsk->umcg_server_page) != 1)
+		goto clear_server;
+
+	tsk->umcg_server = server;
+
+	return 0;
+
+clear_server:
+	tsk->umcg_server_task = NULL;
+	tsk->umcg_server_page = NULL;
+
+	unpin_user_page(tsk->umcg_worker_page);
+clear_self:
+	tsk->umcg_worker_page = NULL;
+	put_task_struct(server);
+
+	return -EFAULT;
+}
+
+static void umcg_unpin_pages(void)
+{
+	struct task_struct *tsk = current;
+
+	if (tsk->umcg_server) {
+		unpin_user_page(tsk->umcg_worker_page);
+		tsk->umcg_worker_page = NULL;
+
+		unpin_user_page(tsk->umcg_server_page);
+		tsk->umcg_server_page = NULL;
+		tsk->umcg_server_task = NULL;
+
+		put_task_struct(tsk->umcg_server);
+		tsk->umcg_server = NULL;
+	}
+}
+
+static void umcg_clear_task(struct task_struct *tsk)
+{
+	/*
+	 * This is either called for the current task, or for a newly forked
+	 * task that is not yet running, so we don't need strict atomicity
+	 * below.
+	 */
+	if (tsk->umcg_task) {
+		WRITE_ONCE(tsk->umcg_task, NULL);
+		tsk->umcg_server = NULL;
+
+		/* These can be simple writes - see the commment above. */
+		tsk->umcg_worker_page = NULL;
+		tsk->umcg_server_page = NULL;
+		tsk->umcg_server_task = NULL;
+
+		tsk->flags &= ~PF_UMCG_WORKER;
+		clear_task_syscall_work(tsk, SYSCALL_UMCG);
+		clear_tsk_thread_flag(tsk, TIF_UMCG);
+	}
+}
+
+/* Called for a forked or execve-ed child. */
+void umcg_clear_child(struct task_struct *tsk)
+{
+	umcg_clear_task(tsk);
+}
+
+/* Called both by normally (unregister) and abnormally exiting workers. */
+void umcg_worker_exit(void)
+{
+	umcg_unpin_pages();
+	umcg_clear_task(current);
+}
+
+/*
+ * Do a state transition, @from -> @to, and possible read @next after that.
+ *
+ * Will clear UMCG_TF_PREEMPT.
+ *
+ * When @to == {BLOCKED,RUNNABLE}, update timestamps.
+ *
+ * Returns:
+ *   0: success
+ *   -EAGAIN: when self->state != @from
+ *   -EFAULT
+ */
+static int umcg_update_state(struct task_struct *tsk, u32 from, u32 to, u32 *next)
+{
+	struct umcg_task *self = tsk->umcg_task;
+	u32 old, new;
+	u64 now;
+
+	if (to >= UMCG_TASK_RUNNABLE) {
+		switch (tsk->umcg_clock) {
+		case CLOCK_REALTIME:      now = ktime_get_real_ns();     break;
+		case CLOCK_MONOTONIC:     now = ktime_get_ns();          break;
+		case CLOCK_BOOTTIME:      now = ktime_get_boottime_ns(); break;
+		case CLOCK_TAI:           now = ktime_get_clocktai_ns(); break;
+		}
+	}
+
+	if (!user_access_begin(self, sizeof(*self)))
+		return -EFAULT;
+
+	unsafe_get_user(old, &self->state, Efault);
+	do {
+		if ((old & UMCG_TASK_MASK) != from)
+			goto fail;
+
+		new = old & ~(UMCG_TASK_MASK | UMCG_TF_PREEMPT);
+		new |= to & UMCG_TASK_MASK;
+
+	} while (!unsafe_try_cmpxchg_user(&self->state, &old, new, Efault));
+
+	if (to == UMCG_TASK_BLOCKED)
+		unsafe_put_user(now, &self->blocked_ts, Efault);
+	if (to == UMCG_TASK_RUNNABLE)
+		unsafe_put_user(now, &self->runnable_ts, Efault);
+
+	if (next)
+		unsafe_get_user(*next, &self->next_tid, Efault);
+
+	user_access_end();
+	return 0;
+
+fail:
+	user_access_end();
+	return -EAGAIN;
+
+Efault:
+	user_access_end();
+	return -EFAULT;
+}
+
+/* Called from syscall enter path */
+void umcg_sys_enter(struct pt_regs *regs, long syscall)
+{
+	/* avoid recursion vs our own syscalls */
+	if (syscall == __NR_umcg_wait ||
+	    syscall == __NR_umcg_ctl)
+		return;
+
+	/* avoid recursion vs schedule() */
+	current->flags &= ~PF_UMCG_WORKER;
+
+	if (umcg_pin_pages())
+		goto die;
+
+	current->flags |= PF_UMCG_WORKER;
+	return;
+
+die:
+	current->flags |= PF_UMCG_WORKER;
+	pr_warn("%s: killing task %d\n", __func__, current->pid);
+	force_sig(SIGKILL);
+}
+
+static int umcg_wake_task(struct task_struct *tsk)
+{
+	int ret = umcg_update_state(tsk, UMCG_TASK_RUNNABLE, UMCG_TASK_RUNNING, NULL);
+	if (ret)
+		return ret;
+
+	try_to_wake_up(tsk, TASK_NORMAL, WF_CURRENT_CPU);
+	return 0;
+}
+
+/*
+ * Wake @next_tid or server.
+ *
+ * Must be called in umcg_pin_pages() context, relies on tsk->umcg_server.
+ *
+ * Returns:
+ *   0: success
+ *   -EFAULT
+ */
+static int umcg_wake_next(struct task_struct *tsk, u32 next_tid)
+{
+	struct task_struct *next = NULL;
+	int ret;
+
+	next = umcg_get_task(next_tid);
+	/*
+	 * umcg_wake_task(next) might fault; if we cannot fault, we'll eat it
+	 * and 'spuriously' not wake @next_tid but instead try and wake the
+	 * server.
+	 *
+	 * XXX: we can fix this by adding umcg_next_page to umcg_pin_pages().
+	 *
+	 * umcg_wake_task() can also fail due to next not having the right
+	 * state, then too will we try and wake the server.
+	 *
+	 * If we cannot wake the server due to state issues, too bad.
+	 */
+	if (!next || umcg_wake_task(next)) {
+		ret = umcg_wake_task(tsk->umcg_server);
+		if (ret == -EFAULT)
+			goto out;
+	}
+	ret = 0;
+out:
+	if (next)
+		put_task_struct(next);
+
+	return ret;
+}
+
+/* pre-schedule() */
+void umcg_wq_worker_sleeping(struct task_struct *tsk)
+{
+	int next_tid;
+
+	/* Must not fault, mmap_sem might be held. */
+	pagefault_disable();
+
+	if (WARN_ON_ONCE(!tsk->umcg_server))
+		goto die;
+
+	if (umcg_update_state(tsk, UMCG_TASK_RUNNING, UMCG_TASK_BLOCKED, &next_tid))
+		goto die;
+
+	if (umcg_wake_next(tsk, next_tid))
+		goto die;
+
+	pagefault_enable();
+
+	/*
+	 * We're going to sleep, make sure to unpin the pages, this ensures
+	 * the pins are temporary.
+	 */
+	umcg_unpin_pages();
+
+	return;
+
+die:
+	pagefault_enable();
+	pr_warn("%s: killing task %d\n", __func__, current->pid);
+	force_sig(SIGKILL);
+}
+
+/* post-schedule() */
+void umcg_wq_worker_running(struct task_struct *tsk)
+{
+	/* nothing here, see umcg_sys_exit() */
+}
+
+/*
+ * Enqueue @tsk on it's server's runnable list
+ *
+ * Must be called in umcg_pin_pages() context, relies on tsk->umcg_server.
+ *
+ * cmpxchg based single linked list add such that list integrity is never
+ * violated.  Userspace *MUST* remove it from the list before changing ->state.
+ * As such, we must change state to RUNNABLE before enqueue.
+ *
+ * Returns:
+ *   0: success
+ *   -EFAULT
+ */
+static int umcg_enqueue_runnable(struct task_struct *tsk)
+{
+	struct umcg_task __user *server = tsk->umcg_server_task;
+	struct umcg_task __user *self = tsk->umcg_task;
+	u64 self_ptr = (unsigned long)self;
+	u64 first_ptr;
+
+	/*
+	 * umcg_pin_pages() did access_ok() on both pointers, use self here
+	 * only because __user_access_begin() isn't available in generic code.
+	 */
+	if (!user_access_begin(self, sizeof(*self)))
+		return -EFAULT;
+
+	unsafe_get_user(first_ptr, &server->runnable_workers_ptr, Efault);
+	do {
+		unsafe_put_user(first_ptr, &self->runnable_workers_ptr, Efault);
+	} while (!unsafe_try_cmpxchg_user(&server->runnable_workers_ptr, &first_ptr, self_ptr, Efault));
+
+	user_access_end();
+	return 0;
+
+Efault:
+	user_access_end();
+	return -EFAULT;
+}
+
+/*
+ * umcg_wait: Wait for ->state to become RUNNING
+ *
+ * Returns:
+ *   0: success
+ *   -EINTR: pending signal
+ *   -EINVAL: ->state is not {RUNNABLE,RUNNING}
+ *   -ETIMEDOUT
+ *   -EFAULT
+ */
+int umcg_wait(u64 timo)
+{
+	struct task_struct *tsk = current;
+	struct umcg_task __user *self = tsk->umcg_task;
+	struct hrtimer_sleeper timeout;
+	struct page *page = NULL;
+	u32 state;
+	int ret;
+
+	if (timo) {
+		hrtimer_init_sleeper_on_stack(&timeout, tsk->umcg_clock,
+					      HRTIMER_MODE_ABS);
+		hrtimer_set_expires_range_ns(&timeout.timer, (s64)timo,
+					     tsk->timer_slack_ns);
+	}
+
+	for (;;) {
+		set_current_state(TASK_INTERRUPTIBLE);
+
+		ret = -EINTR;
+		if (signal_pending(current))
+			break;
+
+		/*
+		 * Faults can block and scribble our wait state.
+		 */
+		pagefault_disable();
+		if (get_user(state, &self->state)) {
+			pagefault_enable();
+
+			ret = -EFAULT;
+			if (page) {
+				unpin_user_page(page);
+				page = NULL;
+				break;
+			}
+
+			if (pin_user_pages_fast((unsigned long)self, 1, 0, &page) != 1) {
+				page = NULL;
+				break;
+			}
+
+			continue;
+		}
+
+		if (page) {
+			unpin_user_page(page);
+			page = NULL;
+		}
+		pagefault_enable();
+
+		state &= UMCG_TASK_MASK;
+		if (state != UMCG_TASK_RUNNABLE) {
+			ret = 0;
+			if (state == UMCG_TASK_RUNNING)
+				break;
+
+			ret = -EINVAL;
+			break;
+		}
+
+		if (timo)
+			hrtimer_sleeper_start_expires(&timeout, HRTIMER_MODE_ABS);
+
+		freezable_schedule();
+
+		ret = -ETIMEDOUT;
+		if (timo && !timeout.task)
+			break;
+	}
+	__set_current_state(TASK_RUNNING);
+
+	if (timo) {
+		hrtimer_cancel(&timeout.timer);
+		destroy_hrtimer_on_stack(&timeout.timer);
+	}
+
+	return ret;
+}
+
+void umcg_sys_exit(struct pt_regs *regs)
+{
+	struct task_struct *tsk = current;
+	long syscall = syscall_get_nr(tsk, regs);
+
+	if (syscall == __NR_umcg_wait)
+		return;
+
+	/*
+	 * sys_umcg_ctl() will get here without having called umcg_sys_enter()
+	 * as such it will look like a syscall that blocked.
+	 */
+
+	if (tsk->umcg_server) {
+		/*
+		 * Didn't block, we done.
+		 */
+		umcg_unpin_pages();
+		return;
+	}
+
+	/* avoid recursion vs schedule() */
+	current->flags &= ~PF_UMCG_WORKER;
+
+	if (umcg_pin_pages())
+		goto die;
+
+	if (umcg_update_state(tsk, UMCG_TASK_BLOCKED, UMCG_TASK_RUNNABLE, NULL))
+		goto die_unpin;
+
+	if (umcg_enqueue_runnable(tsk))
+		goto die_unpin;
+
+	/* server might not be runnable, too bad */
+	if (umcg_wake_task(tsk->umcg_server) == -EFAULT)
+		goto die_unpin;
+
+	umcg_unpin_pages();
+
+	switch (umcg_wait(0)) {
+	case -EFAULT:
+	case -EINVAL:
+	case -ETIMEDOUT: /* how!?! */
+		goto die;
+
+	case -EINTR:
+		/* notify_resume will continue the wait after the signal */
+		break;
+	default:
+		break;
+	}
+
+	current->flags |= PF_UMCG_WORKER;
+
+	return;
+
+die_unpin:
+	umcg_unpin_pages();
+die:
+	current->flags |= PF_UMCG_WORKER;
+	pr_warn("%s: killing task %d\n", __func__, current->pid);
+	force_sig(SIGKILL);
+}
+
+void umcg_notify_resume(struct pt_regs *regs)
+{
+	struct task_struct *tsk = current;
+	struct umcg_task __user *self = tsk->umcg_task;
+	u32 state, next_tid;
+
+	/* avoid recursion vs schedule() */
+	current->flags &= ~PF_UMCG_WORKER;
+
+	if (get_user(state, &self->state))
+		goto die;
+
+	state &= UMCG_TASK_MASK | UMCG_TF_MASK;
+	if (state == UMCG_TASK_RUNNING)
+		goto done;
+
+	if (state & UMCG_TF_PREEMPT) {
+		umcg_pin_pages();
+
+		if (umcg_update_state(tsk, UMCG_TASK_RUNNING,
+				      UMCG_TASK_RUNNABLE, &next_tid))
+			goto die_unpin;
+
+		if (umcg_enqueue_runnable(tsk))
+			goto die_unpin;
+
+		if (umcg_wake_next(tsk, next_tid))
+			goto die_unpin;
+
+		umcg_unpin_pages();
+	}
+
+	switch (umcg_wait(0)) {
+	case -EFAULT:
+	case -EINVAL:
+	case -ETIMEDOUT: /* how!?! */
+		goto die;
+
+	case -EINTR:
+		/* we'll will continue the wait after the signal */
+		break;
+	default:
+		break;
+	}
+
+done:
+	current->flags |= PF_UMCG_WORKER;
+	return;
+
+die_unpin:
+	umcg_unpin_pages();
+die:
+	current->flags |= PF_UMCG_WORKER;
+	pr_warn("%s: killing task %d\n", __func__, current->pid);
+	force_sig(SIGKILL);
+}
+
+/**
+ * sys_umcg_wait: put the current task to sleep and/or wake another task.
+ * @flags:        zero or a value from enum umcg_wait_flag.
+ * @abs_timeout:  when to wake the task, in nanoseconds; zero for no timeout.
+ *
+ *
+ *
+ * Returns:
+ * 0             - OK;
+ * -ETIMEDOUT    - the timeout expired;
+ * -EFAULT       - failed accessing struct umcg_task __user of the current
+ *                 task, the server or next.
+ * -ESRCH        - the task to wake not found or not a UMCG task;
+ * -EINVAL       - another error happened (e.g. the current task is not a
+ *                 UMCG task, etc.)
+ */
+SYSCALL_DEFINE2(umcg_wait, u32, flags, u64, timo)
+{
+	struct task_struct *next, *tsk = current;
+	struct umcg_task __user *self = tsk->umcg_task;
+	bool worker = tsk->flags & PF_UMCG_WORKER;
+	u32 next_tid;
+	int ret;
+
+	if (!self || flags)
+		return -EINVAL;
+
+	if (worker)
+		tsk->flags &= ~PF_UMCG_WORKER;
+
+	/* see umcg_sys_{enter,exit}() */
+	umcg_pin_pages();
+
+	ret = umcg_update_state(tsk, UMCG_TASK_RUNNING, UMCG_TASK_RUNNABLE, &next_tid);
+	if (ret)
+		goto unpin;
+
+	next = umcg_get_task(next_tid);
+	if (!next) {
+		ret = -ESRCH;
+		goto unblock;
+	}
+
+	if (worker) {
+		ret = umcg_enqueue_runnable(tsk);
+		if (ret)
+			goto put_task;
+	}
+
+	ret = umcg_wake_task(next);
+	if (ret)
+		goto put_task;
+
+	put_task_struct(next);
+	umcg_unpin_pages();
+
+	ret = umcg_wait(timo);
+	switch (ret) {
+	case -EINTR:	/* umcg_notify_resume() will continue the wait */
+	case 0:		/* all done */
+		ret = 0;
+		break;
+
+	default:
+		/*
+		 * If this fails you get to keep the pieces; you'll get stuck
+		 * in umcg_notify_resume().
+		 */
+		umcg_update_state(tsk, UMCG_TASK_RUNNABLE, UMCG_TASK_RUNNING, NULL);
+		break;
+	}
+out:
+	if (worker)
+		tsk->flags |= PF_UMCG_WORKER;
+	return ret;
+
+put_task:
+	put_task_struct(next);
+unblock:
+	umcg_update_state(tsk, UMCG_TASK_RUNNABLE, UMCG_TASK_RUNNING, NULL);
+unpin:
+	umcg_unpin_pages();
+	goto out;
+}
+
+/**
+ * sys_umcg_ctl: (un)register the current task as a UMCG task.
+ * @flags:       ORed values from enum umcg_ctl_flag; see below;
+ * @self:        a pointer to struct umcg_task that describes this
+ *               task and governs the behavior of sys_umcg_wait if
+ *               registering; must be NULL if unregistering.
+ *
+ * @flags & UMCG_CTL_REGISTER: register a UMCG task:
+ *         UMCG workers:
+ *              - @flags & UMCG_CTL_WORKER
+ *              - self->state must be UMCG_TASK_BLOCKED
+ *         UMCG servers:
+ *              - !(@flags & UMCG_CTL_WORKER)
+ *              - self->state must be UMCG_TASK_RUNNING
+ *
+ *         All tasks:
+ *              - self->next_tid must be zero
+ *
+ *         If the conditions above are met, sys_umcg_ctl() immediately returns
+ *         if the registered task is a server; a worker will be added to
+ *         runnable_workers_ptr, and the worker put to sleep; an runnable server
+ *         from runnable_server_tid_ptr will be woken, if present.
+ *
+ * @flags == UMCG_CTL_UNREGISTER: unregister a UMCG task. If the current task
+ *           is a UMCG worker, the userspace is responsible for waking its
+ *           server (before or after calling sys_umcg_ctl).
+ *
+ * Return:
+ * 0                - success
+ * -EFAULT          - failed to read @self
+ * -EINVAL          - some other error occurred
+ */
+SYSCALL_DEFINE3(umcg_ctl, u32, flags, struct umcg_task __user *, self, clockid_t, which_clock)
+{
+	struct umcg_task ut;
+
+	if ((unsigned long)self % UMCG_TASK_ALIGN)
+		return -EINVAL;
+
+	if (flags == UMCG_CTL_UNREGISTER) {
+		if (self || !current->umcg_task)
+			return -EINVAL;
+
+		if (current->flags & PF_UMCG_WORKER)
+			umcg_worker_exit();
+		else
+			umcg_clear_task(current);
+
+		return 0;
+	}
+
+	if (!(flags & UMCG_CTL_REGISTER))
+		return -EINVAL;
+
+	switch (which_clock) {
+	case CLOCK_REALTIME:
+	case CLOCK_MONOTONIC:
+	case CLOCK_BOOTTIME:
+	case CLOCK_TAI:
+		current->umcg_clock = which_clock;
+		break;
+
+	default:
+		return -EINVAL;
+	}
+
+	flags &= ~UMCG_CTL_REGISTER;
+	if (flags && flags != UMCG_CTL_WORKER)
+		return -EINVAL;
+
+	if (current->umcg_task || !self)
+		return -EINVAL;
+
+	if (copy_from_user(&ut, self, sizeof(ut)))
+		return -EFAULT;
+
+	if (ut.next_tid || ut.__hole[0] || ut.__zero[0] || ut.__zero[1] || ut.__zero[2])
+		return -EINVAL;
+
+	if (flags == UMCG_CTL_WORKER) {
+		if ((ut.state & (UMCG_TASK_MASK | UMCG_TF_MASK)) != UMCG_TASK_BLOCKED)
+			return -EINVAL;
+
+		WRITE_ONCE(current->umcg_task, self);
+		current->flags |= PF_UMCG_WORKER;	/* hook schedule() */
+		set_syscall_work(SYSCALL_UMCG);		/* hook syscall */
+		set_thread_flag(TIF_UMCG);		/* hook return-to-user */
+
+		/* umcg_sys_exit() will transition to RUNNABLE and wait */
+
+	} else {
+		if ((ut.state & (UMCG_TASK_MASK | UMCG_TF_MASK)) != UMCG_TASK_RUNNING)
+			return -EINVAL;
+
+		WRITE_ONCE(current->umcg_task, self);
+		set_thread_flag(TIF_UMCG);		/* hook return-to-user */
+
+		/* umcg_notify_resume() would block if not RUNNING */
+	}
+
+	return 0;
+}
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -273,6 +273,10 @@ COND_SYSCALL(landlock_create_ruleset);
 COND_SYSCALL(landlock_add_rule);
 COND_SYSCALL(landlock_restrict_self);
 
+/* kernel/sched/umcg.c */
+COND_SYSCALL(umcg_ctl);
+COND_SYSCALL(umcg_wait);
+
 /* arch/example/kernel/sys_example.c */
 
 /* mm/fadvise.c */

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls
  2021-11-26 17:09       ` Peter Zijlstra
@ 2021-11-26 21:08         ` Thomas Gleixner
  2021-11-26 21:59           ` Peter Zijlstra
  2021-11-26 22:16         ` Peter Zijlstra
  2021-11-29  0:29         ` Peter Oskolkov
  2 siblings, 1 reply; 44+ messages in thread
From: Thomas Gleixner @ 2021-11-26 21:08 UTC (permalink / raw)
  To: Peter Zijlstra, Peter Oskolkov
  Cc: Ingo Molnar, Andrew Morton, Dave Hansen, Andy Lutomirski,
	Linux Memory Management List, Linux Kernel Mailing List,
	linux-api, Paul Turner, Ben Segall, Peter Oskolkov, Andrei Vagin,
	Jann Horn, Thierry Delisle

On Fri, Nov 26 2021 at 18:09, Peter Zijlstra wrote:
> +
> +	if (timo) {
> +		hrtimer_init_sleeper_on_stack(&timeout, tsk->umcg_clock,
> +					      HRTIMER_MODE_ABS);
> +		hrtimer_set_expires_range_ns(&timeout.timer, (s64)timo,
> +					     tsk->timer_slack_ns);
> +	}
> +
> +	for (;;) {
> +		set_current_state(TASK_INTERRUPTIBLE);
> +
> +		ret = -EINTR;
> +		if (signal_pending(current))
> +			break;
> +
> +		/*
> +		 * Faults can block and scribble our wait state.
> +		 */
> +		pagefault_disable();
> +		if (get_user(state, &self->state)) {
> +			pagefault_enable();
> +
> +			ret = -EFAULT;
> +			if (page) {
> +				unpin_user_page(page);
> +				page = NULL;
> +				break;
> +			}
> +
> +			if (pin_user_pages_fast((unsigned long)self, 1, 0, &page) != 1) {
> +				page = NULL;
> +				break;
> +			}
> +
> +			continue;
> +		}
> +
> +		if (page) {
> +			unpin_user_page(page);
> +			page = NULL;
> +		}
> +		pagefault_enable();
> +
> +		state &= UMCG_TASK_MASK;
> +		if (state != UMCG_TASK_RUNNABLE) {
> +			ret = 0;
> +			if (state == UMCG_TASK_RUNNING)
> +				break;
> +
> +			ret = -EINVAL;
> +			break;
> +		}
> +
> +		if (timo)
> +			hrtimer_sleeper_start_expires(&timeout, HRTIMER_MODE_ABS);
> +
> +		freezable_schedule();

You can replace the whole hrtimer foo with

                if (!schedule_hrtimeout_range_clock(timo ? &timo : NULL,
                                                    tsk->timer_slack_ns,
                                                    HRTIMER_MODE_ABS,
                                                    tsk->umcg_clock)) {
                	ret = -ETIMEOUT;
                        break;
                }

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls
  2021-11-24 21:19   ` Peter Zijlstra
@ 2021-11-26 21:11     ` Thomas Gleixner
  2021-11-26 21:52       ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Thomas Gleixner @ 2021-11-26 21:11 UTC (permalink / raw)
  To: Peter Zijlstra, Peter Oskolkov
  Cc: Ingo Molnar, Andrew Morton, Dave Hansen, Andy Lutomirski,
	linux-mm, linux-kernel, linux-api, Paul Turner, Ben Segall,
	Peter Oskolkov, Andrei Vagin, Jann Horn, Thierry Delisle

On Wed, Nov 24 2021 at 22:19, Peter Zijlstra wrote:
> On Mon, Nov 22, 2021 at 01:13:24PM -0800, Peter Oskolkov wrote:
>
>> +	 * Timestamp: a 46-bit CLOCK_MONOTONIC timestamp, at 16ns resolution.
>
>> +static int umcg_update_state(u64 __user *state_ts, u64 *expected, u64 desired,
>> +				bool may_fault)
>> +{
>> +	u64 curr_ts = (*expected) >> (64 - UMCG_STATE_TIMESTAMP_BITS);
>> +	u64 next_ts = ktime_get_ns() >> UMCG_STATE_TIMESTAMP_GRANULARITY;
>
> I'm still very hesitant to use ktime (fear the HPET); but I suppose it
> makes sense to use a time base that's accessible to userspace. Was
> MONOTONIC_RAW considered?

MONOTONIC_RAW is not really useful as you can't sleep on it and it won't
solve the HPET crap either.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls
  2021-11-26 21:11     ` Thomas Gleixner
@ 2021-11-26 21:52       ` Peter Zijlstra
  2021-11-29 22:07         ` Thomas Gleixner
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2021-11-26 21:52 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Oskolkov, Ingo Molnar, Andrew Morton, Dave Hansen,
	Andy Lutomirski, linux-mm, linux-kernel, linux-api, Paul Turner,
	Ben Segall, Peter Oskolkov, Andrei Vagin, Jann Horn,
	Thierry Delisle

On Fri, Nov 26, 2021 at 10:11:17PM +0100, Thomas Gleixner wrote:
> On Wed, Nov 24 2021 at 22:19, Peter Zijlstra wrote:
> > On Mon, Nov 22, 2021 at 01:13:24PM -0800, Peter Oskolkov wrote:
> >
> >> +	 * Timestamp: a 46-bit CLOCK_MONOTONIC timestamp, at 16ns resolution.
> >
> >> +static int umcg_update_state(u64 __user *state_ts, u64 *expected, u64 desired,
> >> +				bool may_fault)
> >> +{
> >> +	u64 curr_ts = (*expected) >> (64 - UMCG_STATE_TIMESTAMP_BITS);
> >> +	u64 next_ts = ktime_get_ns() >> UMCG_STATE_TIMESTAMP_GRANULARITY;
> >
> > I'm still very hesitant to use ktime (fear the HPET); but I suppose it
> > makes sense to use a time base that's accessible to userspace. Was
> > MONOTONIC_RAW considered?
> 
> MONOTONIC_RAW is not really useful as you can't sleep on it and it won't
> solve the HPET crap either.

But it's ns are of equal size to sched_clock(), if both share TSC IIRC.
Whereas MONOTONIC, being subject to ntp rate stuff, has differently
sized ns.

The only time that's relevant though is when you're going to mix these
timestamps with CLOCK_THREAD_CPUTIME_ID, which might just be
interesting.

But yeah, not being able to sleep on it ruins the party.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls
  2021-11-26 21:08         ` Thomas Gleixner
@ 2021-11-26 21:59           ` Peter Zijlstra
  2021-11-26 22:07             ` Peter Zijlstra
  2021-11-27  0:45             ` Thomas Gleixner
  0 siblings, 2 replies; 44+ messages in thread
From: Peter Zijlstra @ 2021-11-26 21:59 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Oskolkov, Ingo Molnar, Andrew Morton, Dave Hansen,
	Andy Lutomirski, Linux Memory Management List,
	Linux Kernel Mailing List, linux-api, Paul Turner, Ben Segall,
	Peter Oskolkov, Andrei Vagin, Jann Horn, Thierry Delisle

On Fri, Nov 26, 2021 at 10:08:14PM +0100, Thomas Gleixner wrote:
> On Fri, Nov 26 2021 at 18:09, Peter Zijlstra wrote:
> > +
> > +	if (timo) {
> > +		hrtimer_init_sleeper_on_stack(&timeout, tsk->umcg_clock,
> > +					      HRTIMER_MODE_ABS);
> > +		hrtimer_set_expires_range_ns(&timeout.timer, (s64)timo,
> > +					     tsk->timer_slack_ns);
> > +	}
> > +
> > +	for (;;) {
> > +		set_current_state(TASK_INTERRUPTIBLE);
> > +
> > +		ret = -EINTR;
> > +		if (signal_pending(current))
> > +			break;
> > +
> > +		/*
> > +		 * Faults can block and scribble our wait state.
> > +		 */
> > +		pagefault_disable();
> > +		if (get_user(state, &self->state)) {
> > +			pagefault_enable();
> > +
> > +			ret = -EFAULT;
> > +			if (page) {
> > +				unpin_user_page(page);
> > +				page = NULL;
> > +				break;
> > +			}
> > +
> > +			if (pin_user_pages_fast((unsigned long)self, 1, 0, &page) != 1) {
> > +				page = NULL;
> > +				break;
> > +			}
> > +
> > +			continue;
> > +		}
> > +
> > +		if (page) {
> > +			unpin_user_page(page);
> > +			page = NULL;
> > +		}
> > +		pagefault_enable();
> > +
> > +		state &= UMCG_TASK_MASK;
> > +		if (state != UMCG_TASK_RUNNABLE) {
> > +			ret = 0;
> > +			if (state == UMCG_TASK_RUNNING)
> > +				break;
> > +
> > +			ret = -EINVAL;
> > +			break;
> > +		}
> > +
> > +		if (timo)
> > +			hrtimer_sleeper_start_expires(&timeout, HRTIMER_MODE_ABS);
> > +
> > +		freezable_schedule();
> 
> You can replace the whole hrtimer foo with
> 
>                 if (!schedule_hrtimeout_range_clock(timo ? &timo : NULL,
>                                                     tsk->timer_slack_ns,
>                                                     HRTIMER_MODE_ABS,
>                                                     tsk->umcg_clock)) {
>                 	ret = -ETIMEOUT;
>                         break;
>                 }

That seems to loose the freezable crud.. then again, since we're
interruptible, that shouldn't matter. Lemme go do that.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls
  2021-11-26 21:59           ` Peter Zijlstra
@ 2021-11-26 22:07             ` Peter Zijlstra
  2021-11-27  0:45             ` Thomas Gleixner
  1 sibling, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2021-11-26 22:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Oskolkov, Ingo Molnar, Andrew Morton, Dave Hansen,
	Andy Lutomirski, Linux Memory Management List,
	Linux Kernel Mailing List, linux-api, Paul Turner, Ben Segall,
	Peter Oskolkov, Andrei Vagin, Jann Horn, Thierry Delisle

On Fri, Nov 26, 2021 at 10:59:44PM +0100, Peter Zijlstra wrote:

> That seems to loose the freezable crud.. then again, since we're
> interruptible, that shouldn't matter. Lemme go do that.


---

--- a/kernel/sched/umcg.c
+++ b/kernel/sched/umcg.c
@@ -52,7 +52,7 @@ static int umcg_pin_pages(void)
 
 	server = umcg_get_task(server_tid);
 	if (!server)
-		return -EINVAL;
+		return -ESRCH;
 
 	if (pin_user_pages_fast((unsigned long)self, 1, 0,
 				&tsk->umcg_worker_page) != 1)
@@ -358,18 +358,10 @@ int umcg_wait(u64 timo)
 {
 	struct task_struct *tsk = current;
 	struct umcg_task __user *self = tsk->umcg_task;
-	struct hrtimer_sleeper timeout;
 	struct page *page = NULL;
 	u32 state;
 	int ret;
 
-	if (timo) {
-		hrtimer_init_sleeper_on_stack(&timeout, tsk->umcg_clock,
-					      HRTIMER_MODE_ABS);
-		hrtimer_set_expires_range_ns(&timeout.timer, (s64)timo,
-					     tsk->timer_slack_ns);
-	}
-
 	for (;;) {
 		set_current_state(TASK_INTERRUPTIBLE);
 
@@ -415,22 +407,16 @@ int umcg_wait(u64 timo)
 			break;
 		}
 
-		if (timo)
-			hrtimer_sleeper_start_expires(&timeout, HRTIMER_MODE_ABS);
-
-		freezable_schedule();
-
-		ret = -ETIMEDOUT;
-		if (timo && !timeout.task)
+		if (!schedule_hrtimeout_range_clock(timo ? &timo : NULL,
+						    tsk->timer_slack_ns,
+						    HRTIMER_MODE_ABS,
+						    tsk->umcg_clock)) {
+			ret = -ETIMEDOUT;
 			break;
+		}
 	}
 	__set_current_state(TASK_RUNNING);
 
-	if (timo) {
-		hrtimer_cancel(&timeout.timer);
-		destroy_hrtimer_on_stack(&timeout.timer);
-	}
-
 	return ret;
 }
 
@@ -515,7 +501,8 @@ void umcg_notify_resume(struct pt_regs *
 		goto done;
 
 	if (state & UMCG_TF_PREEMPT) {
-		umcg_pin_pages();
+		if (umcg_pin_pages())
+			goto die;
 
 		if (umcg_update_state(tsk, UMCG_TASK_RUNNING,
 				      UMCG_TASK_RUNNABLE, &next_tid))
@@ -586,7 +573,9 @@ SYSCALL_DEFINE2(umcg_wait, u32, flags, u
 		tsk->flags &= ~PF_UMCG_WORKER;
 
 	/* see umcg_sys_{enter,exit}() */
-	umcg_pin_pages();
+	ret = umcg_pin_pages();
+	if (ret)
+		return ret;
 
 	ret = umcg_update_state(tsk, UMCG_TASK_RUNNING, UMCG_TASK_RUNNABLE, &next_tid);
 	if (ret)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls
  2021-11-26 17:09       ` Peter Zijlstra
  2021-11-26 21:08         ` Thomas Gleixner
@ 2021-11-26 22:16         ` Peter Zijlstra
  2021-11-27  1:16           ` Thomas Gleixner
  2021-11-29  0:29         ` Peter Oskolkov
  2 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2021-11-26 22:16 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Ingo Molnar, Thomas Gleixner, Andrew Morton, Dave Hansen,
	Andy Lutomirski, Linux Memory Management List,
	Linux Kernel Mailing List, linux-api, Paul Turner, Ben Segall,
	Peter Oskolkov, Andrei Vagin, Jann Horn, Thierry Delisle

On Fri, Nov 26, 2021 at 06:09:10PM +0100, Peter Zijlstra wrote:

> @@ -155,8 +159,7 @@ static unsigned long exit_to_user_mode_l
>  	 * Before returning to user space ensure that all pending work
>  	 * items have been completed.
>  	 */
> -	while (ti_work & EXIT_TO_USER_MODE_WORK) {
> -
> +	do {
>  		local_irq_enable_exit_to_user(ti_work);
>  
>  		if (ti_work & _TIF_NEED_RESCHED)
> @@ -168,6 +171,10 @@ static unsigned long exit_to_user_mode_l
>  		if (ti_work & _TIF_PATCH_PENDING)
>  			klp_update_patch_state(current);
>  
> +		/* must be before handle_signal_work(); terminates on sigpending */
> +		if (ti_work & _TIF_UMCG)
> +			umcg_notify_resume(regs);
> +
>  		if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
>  			handle_signal_work(regs, ti_work);
>  
> @@ -188,7 +195,7 @@ static unsigned long exit_to_user_mode_l
>  		tick_nohz_user_enter_prepare();
>  
>  		ti_work = READ_ONCE(current_thread_info()->flags);
> -	}
> +	} while (ti_work & EXIT_TO_USER_MODE_WORK);
>  
>  	/* Return the latest work state for arch_exit_to_user_mode() */
>  	return ti_work;
> @@ -203,7 +210,7 @@ static void exit_to_user_mode_prepare(st
>  	/* Flush pending rcuog wakeup before the last need_resched() check */
>  	tick_nohz_user_enter_prepare();
>  
> -	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
> +	if (unlikely(ti_work & (EXIT_TO_USER_MODE_WORK | _TIF_UMCG)))
>  		ti_work = exit_to_user_mode_loop(regs, ti_work);
>  
>  	arch_exit_to_user_mode_prepare(regs, ti_work);

Thomas, since you're looking at this. I'm not quite sure I got this
right. The intent is that when _TIF_UMCG is set (and it is never cleared
until the task unregisters) it is called at least once.

The thinking is that if umcg_wait() gets interrupted, we'll drop out,
handle the signal and then resume the wait, which can obviously happen
any number of times.

It's just that I'm never quite sure where signal crud happens; I'm
assuming handle_signal_work() simply mucks about with regs (sets sp and
ip etc.. to the signal stack) and drops out of kernel mode, and on
re-entry we do this whole merry cycle once again. But I never actually
dug that deep.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls
  2021-11-26 21:59           ` Peter Zijlstra
  2021-11-26 22:07             ` Peter Zijlstra
@ 2021-11-27  0:45             ` Thomas Gleixner
  2021-11-29 15:05               ` Peter Zijlstra
  1 sibling, 1 reply; 44+ messages in thread
From: Thomas Gleixner @ 2021-11-27  0:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Peter Oskolkov, Ingo Molnar, Andrew Morton, Dave Hansen,
	Andy Lutomirski, Linux Memory Management List,
	Linux Kernel Mailing List, linux-api, Paul Turner, Ben Segall,
	Peter Oskolkov, Andrei Vagin, Jann Horn, Thierry Delisle

On Fri, Nov 26 2021 at 22:59, Peter Zijlstra wrote:
> On Fri, Nov 26, 2021 at 10:08:14PM +0100, Thomas Gleixner wrote:
>> > +		if (timo)
>> > +			hrtimer_sleeper_start_expires(&timeout, HRTIMER_MODE_ABS);
>> > +
>> > +		freezable_schedule();
>> 
>> You can replace the whole hrtimer foo with
>> 
>>                 if (!schedule_hrtimeout_range_clock(timo ? &timo : NULL,
>>                                                     tsk->timer_slack_ns,
>>                                                     HRTIMER_MODE_ABS,
>>                                                     tsk->umcg_clock)) {
>>                 	ret = -ETIMEOUT;
>>                         break;
>>                 }
>
> That seems to loose the freezable crud.. then again, since we're
> interruptible, that shouldn't matter. Lemme go do that.

We could add a freezable wrapper for that if necessary.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls
  2021-11-26 22:16         ` Peter Zijlstra
@ 2021-11-27  1:16           ` Thomas Gleixner
  2021-11-29 15:07             ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Thomas Gleixner @ 2021-11-27  1:16 UTC (permalink / raw)
  To: Peter Zijlstra, Peter Oskolkov
  Cc: Ingo Molnar, Andrew Morton, Dave Hansen, Andy Lutomirski,
	Linux Memory Management List, Linux Kernel Mailing List,
	linux-api, Paul Turner, Ben Segall, Peter Oskolkov, Andrei Vagin,
	Jann Horn, Thierry Delisle

On Fri, Nov 26 2021 at 23:16, Peter Zijlstra wrote:
> On Fri, Nov 26, 2021 at 06:09:10PM +0100, Peter Zijlstra wrote:
>>  
>> -	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
>> +	if (unlikely(ti_work & (EXIT_TO_USER_MODE_WORK | _TIF_UMCG)))
>>  		ti_work = exit_to_user_mode_loop(regs, ti_work);
>>  
>>  	arch_exit_to_user_mode_prepare(regs, ti_work);
>
> Thomas, since you're looking at this. I'm not quite sure I got this
> right. The intent is that when _TIF_UMCG is set (and it is never cleared
> until the task unregisters) it is called at least once.

Right.

> The thinking is that if umcg_wait() gets interrupted, we'll drop out,
> handle the signal and then resume the wait, which can obviously happen
> any number of times.

Right.

> It's just that I'm never quite sure where signal crud happens; I'm
> assuming handle_signal_work() simply mucks about with regs (sets sp and
> ip etc.. to the signal stack) and drops out of kernel mode, and on
> re-entry we do this whole merry cycle once again. But I never actually
> dug that deep.

Yes. It sets up the signal frame and once the loop is left because there
are no more TIF flags to handle it drops back to user space into the
signal handler. That returns to the kernel via sys_[rt_]sigreturn()
which undoes the regs damage either by restoring the previous state or
fiddling it to restart the syscall instead of dropping back to user
space.

So yes, this should work, but I hate the sticky nature of TIF_UMCG. I
have no real good idea how to avoid that yet, but let me think about it
some more.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls
  2021-11-26 17:09       ` Peter Zijlstra
  2021-11-26 21:08         ` Thomas Gleixner
  2021-11-26 22:16         ` Peter Zijlstra
@ 2021-11-29  0:29         ` Peter Oskolkov
  2021-11-29 16:41           ` Peter Zijlstra
  2 siblings, 1 reply; 44+ messages in thread
From: Peter Oskolkov @ 2021-11-29  0:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Andrew Morton, Dave Hansen,
	Andy Lutomirski, Linux Memory Management List,
	Linux Kernel Mailing List, linux-api, Paul Turner, Ben Segall,
	Peter Oskolkov, Andrei Vagin, Jann Horn, Thierry Delisle

On Fri, Nov 26, 2021 at 9:09 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Nov 25, 2021 at 09:28:49AM -0800, Peter Oskolkov wrote:
>
> > it looks like the overall approach did not raise many objections - is
> > it so? Have you finished reviewing the patch?
>
> I've been trying to make sense of it, and while doing so deleted a bunch
> of things and rewrote the rest.

Thanks a lot, Peter! If we can get this in, and work the kinks out
later, that would be great!

>
> Things that went *poof*:
>
>  - wait_wake_only
>  - server_tid_ptr (now: server_tid)
>  - state_ts (now: state,blocked_ts,runnable_ts)
>
> I've also changed next_tid to only be used as a context switch target,
> never to find the server to enqueue the runnable tasks on.
>
> All xchg() users seem to have disappeared.
>
> Signals should now be handled, after which it'll go back to waiting on
> RUNNING.
>
> The code could fairly easily be changed to work on 32bit, big-endian is
> the tricky bit, for now 64bit only.
>
> Anyway, I only *think* the below code will work (it compiles with gcc-10
> and gcc-11) but I've not yet come around to writing/updating the
> userspace part, so it might explode on first contact -- I'll try that
> next week if you don't beat me to it.

I'll take me some time to fully test this (got some other stuff to
look at at the moment); some notes are below. I'd prefer you to merge
whatever you believe is working, and to later adjust things that need
adjusting, rather than keep the endless stream of patchsets that go
nowhere.

>
> That said, the below code seems somewhat sensible to me (I would say,
> having written it :), but I'm fairly sure I killed some capabilities the
> other thing had (notably the first two items above).
>
> If you want either of them restored, can you please give a use-case for
> them? Because I cannot seem to think of any sane cases for either
> wait_wake_only or server_tid_ptr.

wait_wake_only is not needed if you have both next_tid and server_tid,
as your patch has. In my version of the patch, next_tid is the same as
server_tid, so the flag is needed to indicate to the kernel that
next_tid is the wakee, not the server.

re: (idle_)server_tid_ptr: it seems that you assume that blocked
workers keep their servers, while in my patch they "lose them" once
they block, and so there should be a global idle server pointer to
wake the server in my scheme (if there is an idle one). The main
difference is that in my approach a server has only a single, running,
worker assigned to it, while in your approach it can have a number of
blocked/idle workers to take care of as well.

The main difference between our approaches, as I see it: in my
approach if a worker is running, its server is sleeping, period. If we
have N servers, and N running workers, there are no servers to wake
when a previously blocked worker finishes its blocking op. In your
approach, it seems that N servers have each a bunch of workers
pointing at them, and a single worker running. If a previously blocked
worker wakes up, it wakes the server it was assigned to previously,
and so now we have more than N physical tasks/threads running: N
workers and the woken server. This is not ideal: if the process is
affined to only N CPUs, that means a worker will be preempted to let
the woken server run, which is somewhat against the goal of letting
the workers run more or less uninterrupted. This is not deal breaking,
but maybe something to keep in mind.

Another big concern I have is that you removed UMCG_TF_LOCKED. I
definitely needed it to guard workers during "sched work" in the
userspace in my approach. I'm not sure if the flag is absolutely
needed with your approach, but most likely it is - the kernel-side
scheduler does lock tasks and runqueues and disables interrupts and
migrations and other things so that the scheduling logic is not
hijacked by concurrent stuff. Why do you assume that the userspace
scheduling code does not need similar protections?

In summary, again, I'm fine with your patch/approach getting in,
provided things like UMCG_TF_LOCKED are considered later.




>
> Anyway, in large order it's very like what you did, but it's different
> in pretty much all details.
>
> Of note, it now has 5 hooks: sys_enter, pre-schedule, post-schedule
> (still nop), sys_exit and notify_resume.
>
> ---
> Subject: sched: User Mode Concurency Groups
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Fri Nov 26 17:24:27 CET 2021
>
> XXX split and changelog
>
> Originally-by: Peter Oskolkov <posk@google.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -248,6 +248,7 @@ config X86
>         select HAVE_RSEQ
>         select HAVE_SYSCALL_TRACEPOINTS
>         select HAVE_UNSTABLE_SCHED_CLOCK
> +       select HAVE_UMCG                        if X86_64
>         select HAVE_USER_RETURN_NOTIFIER
>         select HAVE_GENERIC_VDSO
>         select HOTPLUG_SMT                      if SMP
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -371,6 +371,8 @@
>  447    common  memfd_secret            sys_memfd_secret
>  448    common  process_mrelease        sys_process_mrelease
>  449    common  futex_waitv             sys_futex_waitv
> +450    common  umcg_ctl                sys_umcg_ctl
> +451    common  umcg_wait               sys_umcg_wait
>
>  #
>  # Due to a historical design error, certain syscalls are numbered differently
> --- a/arch/x86/include/asm/thread_info.h
> +++ b/arch/x86/include/asm/thread_info.h
> @@ -83,6 +83,7 @@ struct thread_info {
>  #define TIF_NEED_RESCHED       3       /* rescheduling necessary */
>  #define TIF_SINGLESTEP         4       /* reenable singlestep on user return*/
>  #define TIF_SSBD               5       /* Speculative store bypass disable */
> +#define TIF_UMCG               6       /* UMCG return to user hook */
>  #define TIF_SPEC_IB            9       /* Indirect branch speculation mitigation */
>  #define TIF_SPEC_L1D_FLUSH     10      /* Flush L1D on mm switches (processes) */
>  #define TIF_USER_RETURN_NOTIFY 11      /* notify kernel of userspace return */
> @@ -107,6 +108,7 @@ struct thread_info {
>  #define _TIF_NEED_RESCHED      (1 << TIF_NEED_RESCHED)
>  #define _TIF_SINGLESTEP                (1 << TIF_SINGLESTEP)
>  #define _TIF_SSBD              (1 << TIF_SSBD)
> +#define _TIF_UMCG              (1 << TIF_UMCG)
>  #define _TIF_SPEC_IB           (1 << TIF_SPEC_IB)
>  #define _TIF_SPEC_L1D_FLUSH    (1 << TIF_SPEC_L1D_FLUSH)
>  #define _TIF_USER_RETURN_NOTIFY        (1 << TIF_USER_RETURN_NOTIFY)
> --- a/arch/x86/include/asm/uaccess.h
> +++ b/arch/x86/include/asm/uaccess.h
> @@ -341,6 +341,24 @@ do {                                                                       \
>                      : [umem] "m" (__m(addr))                           \
>                      : : label)
>
> +#define __try_cmpxchg_user_asm(itype, _ptr, _pold, _new, label)        ({      \
> +       bool success;                                                   \
> +       __typeof__(_ptr) _old = (__typeof__(_ptr))(_pold);              \
> +       __typeof__(*(_ptr)) __old = *_old;                              \
> +       __typeof__(*(_ptr)) __new = (_new);                             \
> +       asm_volatile_goto("\n"                                          \
> +                    "1: " LOCK_PREFIX "cmpxchg"itype" %[new], %[ptr]\n"\
> +                    _ASM_EXTABLE_UA(1b, %l[label])                     \
> +                    : CC_OUT(z) (success),                             \
> +                      [ptr] "+m" (*_ptr),                              \
> +                      [old] "+a" (__old)                               \
> +                    : [new] "r" (__new)                                \
> +                    : "memory", "cc"                                   \
> +                    : label);                                          \
> +       if (unlikely(!success))                                         \
> +               *_old = __old;                                          \
> +       likely(success);                                        })
> +
>  #else // !CONFIG_CC_HAS_ASM_GOTO_OUTPUT
>
>  #ifdef CONFIG_X86_32
> @@ -411,6 +429,34 @@ do {                                                                       \
>                      : [umem] "m" (__m(addr)),                          \
>                        [efault] "i" (-EFAULT), "0" (err))
>
> +#define __try_cmpxchg_user_asm(itype, _ptr, _pold, _new, label)        ({      \
> +       int __err = 0;                                                  \
> +       bool success;                                                   \
> +       __typeof__(_ptr) _old = (__typeof__(_ptr))(_pold);              \
> +       __typeof__(*(_ptr)) __old = *_old;                              \
> +       __typeof__(*(_ptr)) __new = (_new);                             \
> +       asm volatile("\n"                                               \
> +                    "1: " LOCK_PREFIX "cmpxchg"itype" %[new], %[ptr]\n"\
> +                    CC_SET(z)                                          \
> +                    "2:\n"                                             \
> +                    ".pushsection .fixup,\"ax\"\n"                     \
> +                    "3:        mov %[efault], %[errout]\n"             \
> +                    "          jmp 2b\n"                               \
> +                    ".popsection\n"                                    \
> +                    _ASM_EXTABLE_UA(1b, 3b)                            \
> +                    : CC_OUT(z) (success),                             \
> +                      [errout] "+r" (__err),                           \
> +                      [ptr] "+m" (*_ptr),                              \
> +                      [old] "+a" (__old)                               \
> +                    : [new] "r" (__new),                               \
> +                      [efault] "i" (-EFAULT)                           \
> +                    : "memory", "cc");                                 \
> +       if (unlikely(__err))                                            \
> +               goto label;                                             \
> +       if (unlikely(!success))                                         \
> +               *_old = __old;                                          \
> +       likely(success);                                        })
> +
>  #endif // CONFIG_CC_HAS_ASM_GOTO_OUTPUT
>
>  /* FIXME: this hack is definitely wrong -AK */
> @@ -505,6 +551,21 @@ do {                                                                               \
>  } while (0)
>  #endif // CONFIG_CC_HAS_ASM_GOTO_OUTPUT
>
> +extern void __try_cmpxchg_user_wrong_size(void);
> +
> +#define unsafe_try_cmpxchg_user(_ptr, _oldp, _nval, _label) ({         \
> +       __typeof__(*(_ptr)) __ret;                                      \
> +       switch (sizeof(__ret)) {                                        \
> +       case 4: __ret = __try_cmpxchg_user_asm("l", (_ptr), (_oldp),    \
> +                                              (_nval), _label);        \
> +               break;                                                  \
> +       case 8: __ret = __try_cmpxchg_user_asm("q", (_ptr), (_oldp),    \
> +                                              (_nval), _label);        \
> +               break;                                                  \
> +       default: __try_cmpxchg_user_wrong_size();                       \
> +       }                                                               \
> +       __ret;                                          })
> +
>  /*
>   * We want the unsafe accessors to always be inlined and use
>   * the error labels - thus the macro games.
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1838,6 +1838,7 @@ static int bprm_execve(struct linux_binp
>         current->fs->in_exec = 0;
>         current->in_execve = 0;
>         rseq_execve(current);
> +       umcg_execve(current);
>         acct_update_integrals(current);
>         task_numa_free(current, false);
>         return retval;
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -22,6 +22,10 @@
>  # define _TIF_UPROBE                   (0)
>  #endif
>
> +#ifndef _TIF_UMCG
> +# define _TIF_UMCG                     (0)
> +#endif
> +
>  /*
>   * SYSCALL_WORK flags handled in syscall_enter_from_user_mode()
>   */
> @@ -42,11 +46,13 @@
>                                  SYSCALL_WORK_SYSCALL_EMU |             \
>                                  SYSCALL_WORK_SYSCALL_AUDIT |           \
>                                  SYSCALL_WORK_SYSCALL_USER_DISPATCH |   \
> +                                SYSCALL_WORK_SYSCALL_UMCG |            \
>                                  ARCH_SYSCALL_WORK_ENTER)
>  #define SYSCALL_WORK_EXIT      (SYSCALL_WORK_SYSCALL_TRACEPOINT |      \
>                                  SYSCALL_WORK_SYSCALL_TRACE |           \
>                                  SYSCALL_WORK_SYSCALL_AUDIT |           \
>                                  SYSCALL_WORK_SYSCALL_USER_DISPATCH |   \
> +                                SYSCALL_WORK_SYSCALL_UMCG |            \
>                                  SYSCALL_WORK_SYSCALL_EXIT_TRAP |       \
>                                  ARCH_SYSCALL_WORK_EXIT)
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -67,6 +67,7 @@ struct sighand_struct;
>  struct signal_struct;
>  struct task_delay_info;
>  struct task_group;
> +struct umcg_task;
>
>  /*
>   * Task state bitmask. NOTE! These bits are also
> @@ -1294,6 +1295,15 @@ struct task_struct {
>         unsigned long rseq_event_mask;
>  #endif
>
> +#ifdef CONFIG_UMCG
> +       clockid_t               umcg_clock;
> +       struct umcg_task __user *umcg_task;
> +       struct page             *umcg_worker_page;
> +       struct task_struct      *umcg_server;
> +       struct umcg_task __user *umcg_server_task;
> +       struct page             *umcg_server_page;
> +#endif
> +
>         struct tlbflush_unmap_batch     tlb_ubc;
>
>         union {
> @@ -1687,6 +1697,13 @@ extern struct pid *cad_pid;
>  #define PF_KTHREAD             0x00200000      /* I am a kernel thread */
>  #define PF_RANDOMIZE           0x00400000      /* Randomize virtual address space */
>  #define PF_SWAPWRITE           0x00800000      /* Allowed to write to swap */
> +
> +#ifdef CONFIG_UMCG
> +#define PF_UMCG_WORKER         0x01000000      /* UMCG worker */
> +#else
> +#define PF_UMCG_WORKER         0x00000000
> +#endif
> +
>  #define PF_NO_SETAFFINITY      0x04000000      /* Userland is not allowed to meddle with cpus_mask */
>  #define PF_MCE_EARLY           0x08000000      /* Early kill for mce process policy */
>  #define PF_MEMALLOC_PIN                0x10000000      /* Allocation context constrained to zones which allow long term pinning. */
> @@ -2285,6 +2302,67 @@ static inline void rseq_execve(struct ta
>  {
>  }
>
> +#endif
> +
> +#ifdef CONFIG_UMCG
> +
> +extern void umcg_sys_enter(struct pt_regs *regs, long syscall);
> +extern void umcg_sys_exit(struct pt_regs *regs);
> +extern void umcg_notify_resume(struct pt_regs *regs);
> +extern void umcg_worker_exit(void);
> +extern void umcg_clear_child(struct task_struct *tsk);
> +
> +/* Called by bprm_execve() in fs/exec.c. */
> +static inline void umcg_execve(struct task_struct *tsk)
> +{
> +       if (tsk->umcg_task)
> +               umcg_clear_child(tsk);
> +}
> +
> +/* Called by do_exit() in kernel/exit.c. */
> +static inline void umcg_handle_exit(void)
> +{
> +       if (current->flags & PF_UMCG_WORKER)
> +               umcg_worker_exit();
> +}
> +
> +/*
> + * umcg_wq_worker_[sleeping|running] are called in core.c by
> + * sched_submit_work() and sched_update_worker().
> + */
> +extern void umcg_wq_worker_sleeping(struct task_struct *tsk);
> +extern void umcg_wq_worker_running(struct task_struct *tsk);
> +
> +#else  /* CONFIG_UMCG */
> +
> +static inline void umcg_sys_enter(struct pt_regs *regs, long syscall)
> +{
> +}
> +
> +static inline void umcg_sys_exit(struct pt_regs *regs)
> +{
> +}
> +
> +static inline void umcg_notify_resume(struct pt_regs *regs)
> +{
> +}
> +
> +static inline void umcg_clear_child(struct task_struct *tsk)
> +{
> +}
> +static inline void umcg_execve(struct task_struct *tsk)
> +{
> +}
> +static inline void umcg_handle_exit(void)
> +{
> +}
> +static inline void umcg_wq_worker_sleeping(struct task_struct *tsk)
> +{
> +}
> +static inline void umcg_wq_worker_running(struct task_struct *tsk)
> +{
> +}
> +
>  #endif
>
>  #ifdef CONFIG_DEBUG_RSEQ
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -72,6 +72,7 @@ struct open_how;
>  struct mount_attr;
>  struct landlock_ruleset_attr;
>  enum landlock_rule_type;
> +struct umcg_task;
>
>  #include <linux/types.h>
>  #include <linux/aio_abi.h>
> @@ -1057,6 +1058,8 @@ asmlinkage long sys_landlock_add_rule(in
>                 const void __user *rule_attr, __u32 flags);
>  asmlinkage long sys_landlock_restrict_self(int ruleset_fd, __u32 flags);
>  asmlinkage long sys_memfd_secret(unsigned int flags);
> +asmlinkage long sys_umcg_ctl(u32 flags, struct umcg_task __user *self, clockid_t which_clock);
> +asmlinkage long sys_umcg_wait(u32 flags, u64 abs_timeout);
>
>  /*
>   * Architecture-specific system calls
> --- a/include/linux/thread_info.h
> +++ b/include/linux/thread_info.h
> @@ -46,6 +46,7 @@ enum syscall_work_bit {
>         SYSCALL_WORK_BIT_SYSCALL_AUDIT,
>         SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH,
>         SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP,
> +       SYSCALL_WORK_BIT_SYSCALL_UMCG,
>  };
>
>  #define SYSCALL_WORK_SECCOMP           BIT(SYSCALL_WORK_BIT_SECCOMP)
> @@ -55,6 +56,7 @@ enum syscall_work_bit {
>  #define SYSCALL_WORK_SYSCALL_AUDIT     BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT)
>  #define SYSCALL_WORK_SYSCALL_USER_DISPATCH BIT(SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH)
>  #define SYSCALL_WORK_SYSCALL_EXIT_TRAP BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP)
> +#define SYSCALL_WORK_SYSCALL_UMCG      BIT(SYSCALL_WORK_BIT_SYSCALL_UMCG)
>  #endif
>
>  #include <asm/thread_info.h>
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -883,8 +883,13 @@ __SYSCALL(__NR_process_mrelease, sys_pro
>  #define __NR_futex_waitv 449
>  __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
>
> +#define __NR_umcg_ctl 450
> +__SYSCALL(__NR_umcg_ctl, sys_umcg_ctl)
> +#define __NR_umcg_wait 451
> +__SYSCALL(__NR_umcg_wait, sys_umcg_wait)
>  #undef __NR_syscalls
> -#define __NR_syscalls 450
> +
> +#define __NR_syscalls 452
>
>  /*
>   * 32 bit systems traditionally used different
> --- /dev/null
> +++ b/include/uapi/linux/umcg.h
> @@ -0,0 +1,117 @@
> +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
> +#ifndef _UAPI_LINUX_UMCG_H
> +#define _UAPI_LINUX_UMCG_H
> +
> +#include <linux/types.h>
> +
> +/*
> + * UMCG: User Managed Concurrency Groups.
> + *
> + * Syscalls (see kernel/sched/umcg.c):
> + *      sys_umcg_ctl()  - register/unregister UMCG tasks;
> + *      sys_umcg_wait() - wait/wake/context-switch.
> + *
> + * struct umcg_task (below): controls the state of UMCG tasks.
> + */
> +
> +/*
> + * UMCG task states, the first 6 bits of struct umcg_task.state_ts.
> + * The states represent the user space point of view.
> + *
> + *   ,--------(TF_PREEMPT + notify_resume)-------. ,------------.
> + *   |                                           v |            |
> + * RUNNING -(schedule)-> BLOCKED -(sys_exit)-> RUNNABLE  (signal + notify_resume)
> + *   ^                                           | ^            |
> + *   `--------------(sys_umcg_wait)--------------' `------------'
> + *
> + */
> +#define UMCG_TASK_NONE                 0x0000U
> +#define UMCG_TASK_RUNNING              0x0001U
> +#define UMCG_TASK_RUNNABLE             0x0002U
> +#define UMCG_TASK_BLOCKED              0x0003U
> +
> +#define UMCG_TASK_MASK                 0x00ffU
> +
> +/*
> + * UMCG_TF_PREEMPT: userspace indicates the worker should be preempted.
> + *
> + * Must only be set on UMCG_TASK_RUNNING; once set, any subsequent
> + * return-to-user (eg signal) will perform the equivalent of sys_umcg_wait() on
> + * it. That is, it will wake next_tid/server_tid, transfer to RUNNABLE and
> + * enqueue on the server's runnable list.
> + *
> + */
> +#define UMCG_TF_PREEMPT                        0x0100U
> +
> +#define UMCG_TF_MASK                   0xff00U
> +
> +#define UMCG_TASK_ALIGN                        64
> +
> +/**
> + * struct umcg_task - controls the state of UMCG tasks.
> + *
> + * The struct is aligned at 64 bytes to ensure that it fits into
> + * a single cache line.
> + */
> +struct umcg_task {
> +       /**
> +        * @state_ts: the current state of the UMCG task described by
> +        *            this struct, with a unique timestamp indicating
> +        *            when the last state change happened.
> +        *
> +        * Readable/writable by both the kernel and the userspace.
> +        *
> +        * UMCG task state:
> +        *   bits  0 -  7: task state;
> +        *   bits  8 - 15: state flags;
> +        *   bits 16 - 31: for userspace use;
> +        */
> +       __u32   state;                          /* r/w */
> +
> +       /**
> +        * @next_tid: the TID of the UMCG task that should be context-switched
> +        *            into in sys_umcg_wait(). Can be zero, in which case
> +        *            it'll switch to server_tid.
> +        *
> +        * @server_tid: the TID of the UMCG server that hosts this task,
> +        *              when RUNNABLE this task will get added to it's
> +        *              runnable_workers_ptr list.
> +        *
> +        * Read-only for the kernel, read/write for the userspace.
> +        */
> +       __u32   next_tid;                       /* r   */
> +       __u32   server_tid;                     /* r   */
> +
> +       __u32   __hole[1];
> +
> +       /*
> +        * Timestamps for when last we became BLOCKED, RUNNABLE, in CLOCK_MONOTONIC.
> +        */
> +       __u64   blocked_ts;                     /*   w */
> +       __u64   runnable_ts;                    /*   w */
> +
> +       /**
> +        * @runnable_workers_ptr: a single-linked list of runnable workers.
> +        *
> +        * Readable/writable by both the kernel and the userspace: the
> +        * kernel adds items to the list, userspace removes them.
> +        */
> +       __u64   runnable_workers_ptr;           /* r/w */
> +
> +       __u64   __zero[3];
> +
> +} __attribute__((packed, aligned(UMCG_TASK_ALIGN)));
> +
> +/**
> + * enum umcg_ctl_flag - flags to pass to sys_umcg_ctl
> + * @UMCG_CTL_REGISTER:   register the current task as a UMCG task
> + * @UMCG_CTL_UNREGISTER: unregister the current task as a UMCG task
> + * @UMCG_CTL_WORKER:     register the current task as a UMCG worker
> + */
> +enum umcg_ctl_flag {
> +       UMCG_CTL_REGISTER       = 0x00001,
> +       UMCG_CTL_UNREGISTER     = 0x00002,
> +       UMCG_CTL_WORKER         = 0x10000,
> +};
> +
> +#endif /* _UAPI_LINUX_UMCG_H */
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1693,6 +1693,21 @@ config MEMBARRIER
>
>           If unsure, say Y.
>
> +config HAVE_UMCG
> +       bool
> +
> +config UMCG
> +       bool "Enable User Managed Concurrency Groups API"
> +       depends on 64BIT
> +       depends on GENERIC_ENTRY
> +       depends on HAVE_UMCG
> +       default n
> +       help
> +         Enable User Managed Concurrency Groups API, which form the basis
> +         for an in-process M:N userspace scheduling framework.
> +         At the moment this is an experimental/RFC feature that is not
> +         guaranteed to be backward-compatible.
> +
>  config KALLSYMS
>         bool "Load all symbols for debugging/ksymoops" if EXPERT
>         default y
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -6,6 +6,7 @@
>  #include <linux/livepatch.h>
>  #include <linux/audit.h>
>  #include <linux/tick.h>
> +#include <linux/sched.h>
>
>  #include "common.h"
>
> @@ -76,6 +77,9 @@ static long syscall_trace_enter(struct p
>         if (unlikely(work & SYSCALL_WORK_SYSCALL_TRACEPOINT))
>                 trace_sys_enter(regs, syscall);
>
> +       if (work & SYSCALL_WORK_SYSCALL_UMCG)
> +               umcg_sys_enter(regs, syscall);
> +
>         syscall_enter_audit(regs, syscall);
>
>         return ret ? : syscall;
> @@ -155,8 +159,7 @@ static unsigned long exit_to_user_mode_l
>          * Before returning to user space ensure that all pending work
>          * items have been completed.
>          */
> -       while (ti_work & EXIT_TO_USER_MODE_WORK) {
> -
> +       do {
>                 local_irq_enable_exit_to_user(ti_work);
>
>                 if (ti_work & _TIF_NEED_RESCHED)
> @@ -168,6 +171,10 @@ static unsigned long exit_to_user_mode_l
>                 if (ti_work & _TIF_PATCH_PENDING)
>                         klp_update_patch_state(current);
>
> +               /* must be before handle_signal_work(); terminates on sigpending */
> +               if (ti_work & _TIF_UMCG)
> +                       umcg_notify_resume(regs);
> +
>                 if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
>                         handle_signal_work(regs, ti_work);
>
> @@ -188,7 +195,7 @@ static unsigned long exit_to_user_mode_l
>                 tick_nohz_user_enter_prepare();
>
>                 ti_work = READ_ONCE(current_thread_info()->flags);
> -       }
> +       } while (ti_work & EXIT_TO_USER_MODE_WORK);
>
>         /* Return the latest work state for arch_exit_to_user_mode() */
>         return ti_work;
> @@ -203,7 +210,7 @@ static void exit_to_user_mode_prepare(st
>         /* Flush pending rcuog wakeup before the last need_resched() check */
>         tick_nohz_user_enter_prepare();
>
> -       if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
> +       if (unlikely(ti_work & (EXIT_TO_USER_MODE_WORK | _TIF_UMCG)))
>                 ti_work = exit_to_user_mode_loop(regs, ti_work);
>
>         arch_exit_to_user_mode_prepare(regs, ti_work);
> @@ -253,6 +260,9 @@ static void syscall_exit_work(struct pt_
>         step = report_single_step(work);
>         if (step || work & SYSCALL_WORK_SYSCALL_TRACE)
>                 arch_syscall_exit_tracehook(regs, step);
> +
> +       if (work & SYSCALL_WORK_SYSCALL_UMCG)
> +               umcg_sys_exit(regs);
>  }
>
>  /*
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -749,6 +749,10 @@ void __noreturn do_exit(long code)
>         if (unlikely(!tsk->pid))
>                 panic("Attempted to kill the idle task!");
>
> +       /* Turn off UMCG sched hooks. */
> +       if (unlikely(tsk->flags & PF_UMCG_WORKER))
> +               tsk->flags &= ~PF_UMCG_WORKER;
> +
>         /*
>          * If do_exit is called because this processes oopsed, it's possible
>          * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
> @@ -786,6 +790,7 @@ void __noreturn do_exit(long code)
>
>         io_uring_files_cancel();
>         exit_signals(tsk);  /* sets PF_EXITING */
> +       umcg_handle_exit();
>
>         /* sync mm's RSS info before statistics gathering */
>         if (tsk->mm)
> --- a/kernel/sched/Makefile
> +++ b/kernel/sched/Makefile
> @@ -41,3 +41,4 @@ obj-$(CONFIG_MEMBARRIER) += membarrier.o
>  obj-$(CONFIG_CPU_ISOLATION) += isolation.o
>  obj-$(CONFIG_PSI) += psi.o
>  obj-$(CONFIG_SCHED_CORE) += core_sched.o
> +obj-$(CONFIG_UMCG) += umcg.o
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3977,8 +3977,7 @@ bool ttwu_state_match(struct task_struct
>   * Return: %true if @p->state changes (an actual wakeup was done),
>   *        %false otherwise.
>   */
> -static int
> -try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> +int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  {
>         unsigned long flags;
>         int cpu, success = 0;
> @@ -4270,6 +4269,7 @@ static void __sched_fork(unsigned long c
>         p->wake_entry.u_flags = CSD_TYPE_TTWU;
>         p->migration_pending = NULL;
>  #endif
> +       umcg_clear_child(p);
>  }
>
>  DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
> @@ -6328,9 +6328,11 @@ static inline void sched_submit_work(str
>          * If a worker goes to sleep, notify and ask workqueue whether it
>          * wants to wake up a task to maintain concurrency.
>          */
> -       if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
> +       if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) {
>                 if (task_flags & PF_WQ_WORKER)
>                         wq_worker_sleeping(tsk);
> +               else if (task_flags & PF_UMCG_WORKER)
> +                       umcg_wq_worker_sleeping(tsk);
>                 else
>                         io_wq_worker_sleeping(tsk);
>         }
> @@ -6348,9 +6350,11 @@ static inline void sched_submit_work(str
>
>  static void sched_update_worker(struct task_struct *tsk)
>  {
> -       if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
> +       if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) {
>                 if (tsk->flags & PF_WQ_WORKER)
>                         wq_worker_running(tsk);
> +               else if (tsk->flags & PF_UMCG_WORKER)
> +                       umcg_wq_worker_running(tsk);
>                 else
>                         io_wq_worker_running(tsk);
>         }
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6890,6 +6890,10 @@ select_task_rq_fair(struct task_struct *
>         if (wake_flags & WF_TTWU) {
>                 record_wakee(p);
>
> +               if ((wake_flags & WF_CURRENT_CPU) &&
> +                   cpumask_test_cpu(cpu, p->cpus_ptr))
> +                       return cpu;
> +
>                 if (sched_energy_enabled()) {
>                         new_cpu = find_energy_efficient_cpu(p, prev_cpu);
>                         if (new_cpu >= 0)
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2052,13 +2052,14 @@ static inline int task_on_rq_migrating(s
>  }
>
>  /* Wake flags. The first three directly map to some SD flag value */
> -#define WF_EXEC     0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */
> -#define WF_FORK     0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */
> -#define WF_TTWU     0x08 /* Wakeup;            maps to SD_BALANCE_WAKE */
> -
> -#define WF_SYNC     0x10 /* Waker goes to sleep after wakeup */
> -#define WF_MIGRATED 0x20 /* Internal use, task got migrated */
> -#define WF_ON_CPU   0x40 /* Wakee is on_cpu */
> +#define WF_EXEC         0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */
> +#define WF_FORK         0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */
> +#define WF_TTWU         0x08 /* Wakeup;            maps to SD_BALANCE_WAKE */
> +
> +#define WF_SYNC         0x10 /* Waker goes to sleep after wakeup */
> +#define WF_MIGRATED     0x20 /* Internal use, task got migrated */
> +#define WF_ON_CPU       0x40 /* Wakee is on_cpu */
> +#define WF_CURRENT_CPU  0x80 /* Prefer to move the wakee to the current CPU. */
>
>  #ifdef CONFIG_SMP
>  static_assert(WF_EXEC == SD_BALANCE_EXEC);
> @@ -3076,6 +3077,8 @@ static inline bool is_per_cpu_kthread(st
>  extern void swake_up_all_locked(struct swait_queue_head *q);
>  extern void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);
>
> +extern int try_to_wake_up(struct task_struct *tsk, unsigned int state, int wake_flags);
> +
>  #ifdef CONFIG_PREEMPT_DYNAMIC
>  extern int preempt_dynamic_mode;
>  extern int sched_dynamic_mode(const char *str);
> --- /dev/null
> +++ b/kernel/sched/umcg.c
> @@ -0,0 +1,744 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +/*
> + * User Managed Concurrency Groups (UMCG).
> + *
> + */
> +
> +#include <linux/syscalls.h>
> +#include <linux/types.h>
> +#include <linux/uaccess.h>
> +#include <linux/umcg.h>
> +
> +#include <asm/syscall.h>
> +
> +#include "sched.h"
> +
> +static struct task_struct *umcg_get_task(u32 tid)
> +{
> +       struct task_struct *tsk = NULL;
> +
> +       if (tid) {
> +               rcu_read_lock();
> +               tsk = find_task_by_vpid(tid);
> +               if (tsk && current->mm == tsk->mm && tsk->umcg_task)
> +                       get_task_struct(tsk);
> +               else
> +                       tsk = NULL;
> +               rcu_read_unlock();
> +       }
> +
> +       return tsk;
> +}
> +
> +/**
> + * umcg_pin_pages: pin pages containing struct umcg_task of this worker
> + *                 and its server.
> + */
> +static int umcg_pin_pages(void)
> +{
> +       struct task_struct *server = NULL, *tsk = current;
> +       struct umcg_task __user *self = tsk->umcg_task;
> +       int server_tid;
> +
> +       if (tsk->umcg_worker_page ||
> +           tsk->umcg_server_page ||
> +           tsk->umcg_server_task ||
> +           tsk->umcg_server)
> +               return -EBUSY;
> +
> +       if (get_user(server_tid, &self->server_tid))
> +               return -EFAULT;
> +
> +       server = umcg_get_task(server_tid);
> +       if (!server)
> +               return -EINVAL;
> +
> +       if (pin_user_pages_fast((unsigned long)self, 1, 0,
> +                               &tsk->umcg_worker_page) != 1)
> +               goto clear_self;
> +
> +       /* must cache due to possible concurrent change vs access_ok() */
> +       tsk->umcg_server_task = server->umcg_task;
> +       if (pin_user_pages_fast((unsigned long)tsk->umcg_server_task, 1, 0,
> +                               &tsk->umcg_server_page) != 1)
> +               goto clear_server;
> +
> +       tsk->umcg_server = server;
> +
> +       return 0;
> +
> +clear_server:
> +       tsk->umcg_server_task = NULL;
> +       tsk->umcg_server_page = NULL;
> +
> +       unpin_user_page(tsk->umcg_worker_page);
> +clear_self:
> +       tsk->umcg_worker_page = NULL;
> +       put_task_struct(server);
> +
> +       return -EFAULT;
> +}
> +
> +static void umcg_unpin_pages(void)
> +{
> +       struct task_struct *tsk = current;
> +
> +       if (tsk->umcg_server) {
> +               unpin_user_page(tsk->umcg_worker_page);
> +               tsk->umcg_worker_page = NULL;
> +
> +               unpin_user_page(tsk->umcg_server_page);
> +               tsk->umcg_server_page = NULL;
> +               tsk->umcg_server_task = NULL;
> +
> +               put_task_struct(tsk->umcg_server);
> +               tsk->umcg_server = NULL;
> +       }
> +}
> +
> +static void umcg_clear_task(struct task_struct *tsk)
> +{
> +       /*
> +        * This is either called for the current task, or for a newly forked
> +        * task that is not yet running, so we don't need strict atomicity
> +        * below.
> +        */
> +       if (tsk->umcg_task) {
> +               WRITE_ONCE(tsk->umcg_task, NULL);
> +               tsk->umcg_server = NULL;
> +
> +               /* These can be simple writes - see the commment above. */
> +               tsk->umcg_worker_page = NULL;
> +               tsk->umcg_server_page = NULL;
> +               tsk->umcg_server_task = NULL;
> +
> +               tsk->flags &= ~PF_UMCG_WORKER;
> +               clear_task_syscall_work(tsk, SYSCALL_UMCG);
> +               clear_tsk_thread_flag(tsk, TIF_UMCG);
> +       }
> +}
> +
> +/* Called for a forked or execve-ed child. */
> +void umcg_clear_child(struct task_struct *tsk)
> +{
> +       umcg_clear_task(tsk);
> +}
> +
> +/* Called both by normally (unregister) and abnormally exiting workers. */
> +void umcg_worker_exit(void)
> +{
> +       umcg_unpin_pages();
> +       umcg_clear_task(current);
> +}
> +
> +/*
> + * Do a state transition, @from -> @to, and possible read @next after that.
> + *
> + * Will clear UMCG_TF_PREEMPT.
> + *
> + * When @to == {BLOCKED,RUNNABLE}, update timestamps.
> + *
> + * Returns:
> + *   0: success
> + *   -EAGAIN: when self->state != @from
> + *   -EFAULT
> + */
> +static int umcg_update_state(struct task_struct *tsk, u32 from, u32 to, u32 *next)
> +{
> +       struct umcg_task *self = tsk->umcg_task;
> +       u32 old, new;
> +       u64 now;
> +
> +       if (to >= UMCG_TASK_RUNNABLE) {
> +               switch (tsk->umcg_clock) {
> +               case CLOCK_REALTIME:      now = ktime_get_real_ns();     break;
> +               case CLOCK_MONOTONIC:     now = ktime_get_ns();          break;
> +               case CLOCK_BOOTTIME:      now = ktime_get_boottime_ns(); break;
> +               case CLOCK_TAI:           now = ktime_get_clocktai_ns(); break;
> +               }
> +       }
> +
> +       if (!user_access_begin(self, sizeof(*self)))
> +               return -EFAULT;
> +
> +       unsafe_get_user(old, &self->state, Efault);
> +       do {
> +               if ((old & UMCG_TASK_MASK) != from)
> +                       goto fail;
> +
> +               new = old & ~(UMCG_TASK_MASK | UMCG_TF_PREEMPT);
> +               new |= to & UMCG_TASK_MASK;
> +
> +       } while (!unsafe_try_cmpxchg_user(&self->state, &old, new, Efault));
> +
> +       if (to == UMCG_TASK_BLOCKED)
> +               unsafe_put_user(now, &self->blocked_ts, Efault);
> +       if (to == UMCG_TASK_RUNNABLE)
> +               unsafe_put_user(now, &self->runnable_ts, Efault);
> +
> +       if (next)
> +               unsafe_get_user(*next, &self->next_tid, Efault);
> +
> +       user_access_end();
> +       return 0;
> +
> +fail:
> +       user_access_end();
> +       return -EAGAIN;
> +
> +Efault:
> +       user_access_end();
> +       return -EFAULT;
> +}
> +
> +/* Called from syscall enter path */
> +void umcg_sys_enter(struct pt_regs *regs, long syscall)
> +{
> +       /* avoid recursion vs our own syscalls */
> +       if (syscall == __NR_umcg_wait ||
> +           syscall == __NR_umcg_ctl)
> +               return;
> +
> +       /* avoid recursion vs schedule() */
> +       current->flags &= ~PF_UMCG_WORKER;
> +
> +       if (umcg_pin_pages())
> +               goto die;
> +
> +       current->flags |= PF_UMCG_WORKER;
> +       return;
> +
> +die:
> +       current->flags |= PF_UMCG_WORKER;
> +       pr_warn("%s: killing task %d\n", __func__, current->pid);
> +       force_sig(SIGKILL);
> +}
> +
> +static int umcg_wake_task(struct task_struct *tsk)
> +{
> +       int ret = umcg_update_state(tsk, UMCG_TASK_RUNNABLE, UMCG_TASK_RUNNING, NULL);
> +       if (ret)
> +               return ret;
> +
> +       try_to_wake_up(tsk, TASK_NORMAL, WF_CURRENT_CPU);
> +       return 0;
> +}
> +
> +/*
> + * Wake @next_tid or server.
> + *
> + * Must be called in umcg_pin_pages() context, relies on tsk->umcg_server.
> + *
> + * Returns:
> + *   0: success
> + *   -EFAULT
> + */
> +static int umcg_wake_next(struct task_struct *tsk, u32 next_tid)
> +{
> +       struct task_struct *next = NULL;
> +       int ret;
> +
> +       next = umcg_get_task(next_tid);
> +       /*
> +        * umcg_wake_task(next) might fault; if we cannot fault, we'll eat it
> +        * and 'spuriously' not wake @next_tid but instead try and wake the
> +        * server.
> +        *
> +        * XXX: we can fix this by adding umcg_next_page to umcg_pin_pages().
> +        *
> +        * umcg_wake_task() can also fail due to next not having the right
> +        * state, then too will we try and wake the server.
> +        *
> +        * If we cannot wake the server due to state issues, too bad.
> +        */
> +       if (!next || umcg_wake_task(next)) {
> +               ret = umcg_wake_task(tsk->umcg_server);
> +               if (ret == -EFAULT)
> +                       goto out;
> +       }
> +       ret = 0;
> +out:
> +       if (next)
> +               put_task_struct(next);
> +
> +       return ret;
> +}
> +
> +/* pre-schedule() */
> +void umcg_wq_worker_sleeping(struct task_struct *tsk)
> +{
> +       int next_tid;
> +
> +       /* Must not fault, mmap_sem might be held. */
> +       pagefault_disable();
> +
> +       if (WARN_ON_ONCE(!tsk->umcg_server))
> +               goto die;
> +
> +       if (umcg_update_state(tsk, UMCG_TASK_RUNNING, UMCG_TASK_BLOCKED, &next_tid))
> +               goto die;
> +
> +       if (umcg_wake_next(tsk, next_tid))
> +               goto die;
> +
> +       pagefault_enable();
> +
> +       /*
> +        * We're going to sleep, make sure to unpin the pages, this ensures
> +        * the pins are temporary.
> +        */
> +       umcg_unpin_pages();
> +
> +       return;
> +
> +die:
> +       pagefault_enable();
> +       pr_warn("%s: killing task %d\n", __func__, current->pid);
> +       force_sig(SIGKILL);
> +}
> +
> +/* post-schedule() */
> +void umcg_wq_worker_running(struct task_struct *tsk)
> +{
> +       /* nothing here, see umcg_sys_exit() */
> +}
> +
> +/*
> + * Enqueue @tsk on it's server's runnable list
> + *
> + * Must be called in umcg_pin_pages() context, relies on tsk->umcg_server.
> + *
> + * cmpxchg based single linked list add such that list integrity is never
> + * violated.  Userspace *MUST* remove it from the list before changing ->state.
> + * As such, we must change state to RUNNABLE before enqueue.
> + *
> + * Returns:
> + *   0: success
> + *   -EFAULT
> + */
> +static int umcg_enqueue_runnable(struct task_struct *tsk)
> +{
> +       struct umcg_task __user *server = tsk->umcg_server_task;
> +       struct umcg_task __user *self = tsk->umcg_task;
> +       u64 self_ptr = (unsigned long)self;
> +       u64 first_ptr;
> +
> +       /*
> +        * umcg_pin_pages() did access_ok() on both pointers, use self here
> +        * only because __user_access_begin() isn't available in generic code.
> +        */
> +       if (!user_access_begin(self, sizeof(*self)))
> +               return -EFAULT;
> +
> +       unsafe_get_user(first_ptr, &server->runnable_workers_ptr, Efault);
> +       do {
> +               unsafe_put_user(first_ptr, &self->runnable_workers_ptr, Efault);
> +       } while (!unsafe_try_cmpxchg_user(&server->runnable_workers_ptr, &first_ptr, self_ptr, Efault));
> +
> +       user_access_end();
> +       return 0;
> +
> +Efault:
> +       user_access_end();
> +       return -EFAULT;
> +}
> +
> +/*
> + * umcg_wait: Wait for ->state to become RUNNING
> + *
> + * Returns:
> + *   0: success
> + *   -EINTR: pending signal
> + *   -EINVAL: ->state is not {RUNNABLE,RUNNING}
> + *   -ETIMEDOUT
> + *   -EFAULT
> + */
> +int umcg_wait(u64 timo)
> +{
> +       struct task_struct *tsk = current;
> +       struct umcg_task __user *self = tsk->umcg_task;
> +       struct hrtimer_sleeper timeout;
> +       struct page *page = NULL;
> +       u32 state;
> +       int ret;
> +
> +       if (timo) {
> +               hrtimer_init_sleeper_on_stack(&timeout, tsk->umcg_clock,
> +                                             HRTIMER_MODE_ABS);
> +               hrtimer_set_expires_range_ns(&timeout.timer, (s64)timo,
> +                                            tsk->timer_slack_ns);
> +       }
> +
> +       for (;;) {
> +               set_current_state(TASK_INTERRUPTIBLE);
> +
> +               ret = -EINTR;
> +               if (signal_pending(current))
> +                       break;
> +
> +               /*
> +                * Faults can block and scribble our wait state.
> +                */
> +               pagefault_disable();
> +               if (get_user(state, &self->state)) {
> +                       pagefault_enable();
> +
> +                       ret = -EFAULT;
> +                       if (page) {
> +                               unpin_user_page(page);
> +                               page = NULL;
> +                               break;
> +                       }
> +
> +                       if (pin_user_pages_fast((unsigned long)self, 1, 0, &page) != 1) {
> +                               page = NULL;
> +                               break;
> +                       }
> +
> +                       continue;
> +               }
> +
> +               if (page) {
> +                       unpin_user_page(page);
> +                       page = NULL;
> +               }
> +               pagefault_enable();
> +
> +               state &= UMCG_TASK_MASK;
> +               if (state != UMCG_TASK_RUNNABLE) {
> +                       ret = 0;
> +                       if (state == UMCG_TASK_RUNNING)
> +                               break;
> +
> +                       ret = -EINVAL;
> +                       break;
> +               }
> +
> +               if (timo)
> +                       hrtimer_sleeper_start_expires(&timeout, HRTIMER_MODE_ABS);
> +
> +               freezable_schedule();
> +
> +               ret = -ETIMEDOUT;
> +               if (timo && !timeout.task)
> +                       break;
> +       }
> +       __set_current_state(TASK_RUNNING);
> +
> +       if (timo) {
> +               hrtimer_cancel(&timeout.timer);
> +               destroy_hrtimer_on_stack(&timeout.timer);
> +       }
> +
> +       return ret;
> +}
> +
> +void umcg_sys_exit(struct pt_regs *regs)
> +{
> +       struct task_struct *tsk = current;
> +       long syscall = syscall_get_nr(tsk, regs);
> +
> +       if (syscall == __NR_umcg_wait)
> +               return;
> +
> +       /*
> +        * sys_umcg_ctl() will get here without having called umcg_sys_enter()
> +        * as such it will look like a syscall that blocked.
> +        */
> +
> +       if (tsk->umcg_server) {
> +               /*
> +                * Didn't block, we done.
> +                */
> +               umcg_unpin_pages();
> +               return;
> +       }
> +
> +       /* avoid recursion vs schedule() */
> +       current->flags &= ~PF_UMCG_WORKER;
> +
> +       if (umcg_pin_pages())
> +               goto die;
> +
> +       if (umcg_update_state(tsk, UMCG_TASK_BLOCKED, UMCG_TASK_RUNNABLE, NULL))
> +               goto die_unpin;
> +
> +       if (umcg_enqueue_runnable(tsk))
> +               goto die_unpin;
> +
> +       /* server might not be runnable, too bad */
> +       if (umcg_wake_task(tsk->umcg_server) == -EFAULT)
> +               goto die_unpin;
> +
> +       umcg_unpin_pages();
> +
> +       switch (umcg_wait(0)) {
> +       case -EFAULT:
> +       case -EINVAL:
> +       case -ETIMEDOUT: /* how!?! */
> +               goto die;
> +
> +       case -EINTR:
> +               /* notify_resume will continue the wait after the signal */
> +               break;
> +       default:
> +               break;
> +       }
> +
> +       current->flags |= PF_UMCG_WORKER;
> +
> +       return;
> +
> +die_unpin:
> +       umcg_unpin_pages();
> +die:
> +       current->flags |= PF_UMCG_WORKER;
> +       pr_warn("%s: killing task %d\n", __func__, current->pid);
> +       force_sig(SIGKILL);
> +}
> +
> +void umcg_notify_resume(struct pt_regs *regs)
> +{
> +       struct task_struct *tsk = current;
> +       struct umcg_task __user *self = tsk->umcg_task;
> +       u32 state, next_tid;
> +
> +       /* avoid recursion vs schedule() */
> +       current->flags &= ~PF_UMCG_WORKER;
> +
> +       if (get_user(state, &self->state))
> +               goto die;
> +
> +       state &= UMCG_TASK_MASK | UMCG_TF_MASK;
> +       if (state == UMCG_TASK_RUNNING)
> +               goto done;
> +
> +       if (state & UMCG_TF_PREEMPT) {
> +               umcg_pin_pages();
> +
> +               if (umcg_update_state(tsk, UMCG_TASK_RUNNING,
> +                                     UMCG_TASK_RUNNABLE, &next_tid))
> +                       goto die_unpin;
> +
> +               if (umcg_enqueue_runnable(tsk))
> +                       goto die_unpin;
> +
> +               if (umcg_wake_next(tsk, next_tid))
> +                       goto die_unpin;
> +
> +               umcg_unpin_pages();
> +       }
> +
> +       switch (umcg_wait(0)) {
> +       case -EFAULT:
> +       case -EINVAL:
> +       case -ETIMEDOUT: /* how!?! */
> +               goto die;
> +
> +       case -EINTR:
> +               /* we'll will continue the wait after the signal */
> +               break;
> +       default:
> +               break;
> +       }
> +
> +done:
> +       current->flags |= PF_UMCG_WORKER;
> +       return;
> +
> +die_unpin:
> +       umcg_unpin_pages();
> +die:
> +       current->flags |= PF_UMCG_WORKER;
> +       pr_warn("%s: killing task %d\n", __func__, current->pid);
> +       force_sig(SIGKILL);
> +}
> +
> +/**
> + * sys_umcg_wait: put the current task to sleep and/or wake another task.
> + * @flags:        zero or a value from enum umcg_wait_flag.
> + * @abs_timeout:  when to wake the task, in nanoseconds; zero for no timeout.
> + *
> + *
> + *
> + * Returns:
> + * 0             - OK;
> + * -ETIMEDOUT    - the timeout expired;
> + * -EFAULT       - failed accessing struct umcg_task __user of the current
> + *                 task, the server or next.
> + * -ESRCH        - the task to wake not found or not a UMCG task;
> + * -EINVAL       - another error happened (e.g. the current task is not a
> + *                 UMCG task, etc.)
> + */
> +SYSCALL_DEFINE2(umcg_wait, u32, flags, u64, timo)
> +{
> +       struct task_struct *next, *tsk = current;
> +       struct umcg_task __user *self = tsk->umcg_task;
> +       bool worker = tsk->flags & PF_UMCG_WORKER;
> +       u32 next_tid;
> +       int ret;
> +
> +       if (!self || flags)
> +               return -EINVAL;
> +
> +       if (worker)
> +               tsk->flags &= ~PF_UMCG_WORKER;
> +
> +       /* see umcg_sys_{enter,exit}() */
> +       umcg_pin_pages();
> +
> +       ret = umcg_update_state(tsk, UMCG_TASK_RUNNING, UMCG_TASK_RUNNABLE, &next_tid);
> +       if (ret)
> +               goto unpin;
> +
> +       next = umcg_get_task(next_tid);
> +       if (!next) {
> +               ret = -ESRCH;
> +               goto unblock;
> +       }
> +
> +       if (worker) {
> +               ret = umcg_enqueue_runnable(tsk);
> +               if (ret)
> +                       goto put_task;
> +       }
> +
> +       ret = umcg_wake_task(next);
> +       if (ret)
> +               goto put_task;
> +
> +       put_task_struct(next);
> +       umcg_unpin_pages();
> +
> +       ret = umcg_wait(timo);
> +       switch (ret) {
> +       case -EINTR:    /* umcg_notify_resume() will continue the wait */
> +       case 0:         /* all done */
> +               ret = 0;
> +               break;
> +
> +       default:
> +               /*
> +                * If this fails you get to keep the pieces; you'll get stuck
> +                * in umcg_notify_resume().
> +                */
> +               umcg_update_state(tsk, UMCG_TASK_RUNNABLE, UMCG_TASK_RUNNING, NULL);
> +               break;
> +       }
> +out:
> +       if (worker)
> +               tsk->flags |= PF_UMCG_WORKER;
> +       return ret;
> +
> +put_task:
> +       put_task_struct(next);
> +unblock:
> +       umcg_update_state(tsk, UMCG_TASK_RUNNABLE, UMCG_TASK_RUNNING, NULL);
> +unpin:
> +       umcg_unpin_pages();
> +       goto out;
> +}
> +
> +/**
> + * sys_umcg_ctl: (un)register the current task as a UMCG task.
> + * @flags:       ORed values from enum umcg_ctl_flag; see below;
> + * @self:        a pointer to struct umcg_task that describes this
> + *               task and governs the behavior of sys_umcg_wait if
> + *               registering; must be NULL if unregistering.
> + *
> + * @flags & UMCG_CTL_REGISTER: register a UMCG task:
> + *         UMCG workers:
> + *              - @flags & UMCG_CTL_WORKER
> + *              - self->state must be UMCG_TASK_BLOCKED
> + *         UMCG servers:
> + *              - !(@flags & UMCG_CTL_WORKER)
> + *              - self->state must be UMCG_TASK_RUNNING
> + *
> + *         All tasks:
> + *              - self->next_tid must be zero
> + *
> + *         If the conditions above are met, sys_umcg_ctl() immediately returns
> + *         if the registered task is a server; a worker will be added to
> + *         runnable_workers_ptr, and the worker put to sleep; an runnable server
> + *         from runnable_server_tid_ptr will be woken, if present.
> + *
> + * @flags == UMCG_CTL_UNREGISTER: unregister a UMCG task. If the current task
> + *           is a UMCG worker, the userspace is responsible for waking its
> + *           server (before or after calling sys_umcg_ctl).
> + *
> + * Return:
> + * 0                - success
> + * -EFAULT          - failed to read @self
> + * -EINVAL          - some other error occurred
> + */
> +SYSCALL_DEFINE3(umcg_ctl, u32, flags, struct umcg_task __user *, self, clockid_t, which_clock)
> +{
> +       struct umcg_task ut;
> +
> +       if ((unsigned long)self % UMCG_TASK_ALIGN)
> +               return -EINVAL;
> +
> +       if (flags == UMCG_CTL_UNREGISTER) {
> +               if (self || !current->umcg_task)
> +                       return -EINVAL;
> +
> +               if (current->flags & PF_UMCG_WORKER)
> +                       umcg_worker_exit();
> +               else
> +                       umcg_clear_task(current);
> +
> +               return 0;
> +       }
> +
> +       if (!(flags & UMCG_CTL_REGISTER))
> +               return -EINVAL;
> +
> +       switch (which_clock) {
> +       case CLOCK_REALTIME:
> +       case CLOCK_MONOTONIC:
> +       case CLOCK_BOOTTIME:
> +       case CLOCK_TAI:
> +               current->umcg_clock = which_clock;
> +               break;
> +
> +       default:
> +               return -EINVAL;
> +       }
> +
> +       flags &= ~UMCG_CTL_REGISTER;
> +       if (flags && flags != UMCG_CTL_WORKER)
> +               return -EINVAL;
> +
> +       if (current->umcg_task || !self)
> +               return -EINVAL;
> +
> +       if (copy_from_user(&ut, self, sizeof(ut)))
> +               return -EFAULT;
> +
> +       if (ut.next_tid || ut.__hole[0] || ut.__zero[0] || ut.__zero[1] || ut.__zero[2])
> +               return -EINVAL;
> +
> +       if (flags == UMCG_CTL_WORKER) {
> +               if ((ut.state & (UMCG_TASK_MASK | UMCG_TF_MASK)) != UMCG_TASK_BLOCKED)
> +                       return -EINVAL;
> +
> +               WRITE_ONCE(current->umcg_task, self);
> +               current->flags |= PF_UMCG_WORKER;       /* hook schedule() */
> +               set_syscall_work(SYSCALL_UMCG);         /* hook syscall */
> +               set_thread_flag(TIF_UMCG);              /* hook return-to-user */
> +
> +               /* umcg_sys_exit() will transition to RUNNABLE and wait */
> +
> +       } else {
> +               if ((ut.state & (UMCG_TASK_MASK | UMCG_TF_MASK)) != UMCG_TASK_RUNNING)
> +                       return -EINVAL;
> +
> +               WRITE_ONCE(current->umcg_task, self);
> +               set_thread_flag(TIF_UMCG);              /* hook return-to-user */
> +
> +               /* umcg_notify_resume() would block if not RUNNING */
> +       }
> +
> +       return 0;
> +}
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -273,6 +273,10 @@ COND_SYSCALL(landlock_create_ruleset);
>  COND_SYSCALL(landlock_add_rule);
>  COND_SYSCALL(landlock_restrict_self);
>
> +/* kernel/sched/umcg.c */
> +COND_SYSCALL(umcg_ctl);
> +COND_SYSCALL(umcg_wait);
> +
>  /* arch/example/kernel/sys_example.c */
>
>  /* mm/fadvise.c */

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls
  2021-11-27  0:45             ` Thomas Gleixner
@ 2021-11-29 15:05               ` Peter Zijlstra
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2021-11-29 15:05 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Oskolkov, Ingo Molnar, Andrew Morton, Dave Hansen,
	Andy Lutomirski, Linux Memory Management List,
	Linux Kernel Mailing List, linux-api, Paul Turner, Ben Segall,
	Peter Oskolkov, Andrei Vagin, Jann Horn, Thierry Delisle

On Sat, Nov 27, 2021 at 01:45:20AM +0100, Thomas Gleixner wrote:
> On Fri, Nov 26 2021 at 22:59, Peter Zijlstra wrote:
> > On Fri, Nov 26, 2021 at 10:08:14PM +0100, Thomas Gleixner wrote:
> >> > +		if (timo)
> >> > +			hrtimer_sleeper_start_expires(&timeout, HRTIMER_MODE_ABS);
> >> > +
> >> > +		freezable_schedule();
> >> 
> >> You can replace the whole hrtimer foo with
> >> 
> >>                 if (!schedule_hrtimeout_range_clock(timo ? &timo : NULL,
> >>                                                     tsk->timer_slack_ns,
> >>                                                     HRTIMER_MODE_ABS,
> >>                                                     tsk->umcg_clock)) {
> >>                 	ret = -ETIMEOUT;
> >>                         break;
> >>                 }
> >
> > That seems to loose the freezable crud.. then again, since we're
> > interruptible, that shouldn't matter. Lemme go do that.
> 
> We could add a freezable wrapper for that if necessary.

I should just finish rewriting that freezer crap and then we can delete
it all :-) But I don't think that's needed in this case, as long as
we're interruptible we'll pass through the signal path which has a
try_to_freezer() in it.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls
  2021-11-27  1:16           ` Thomas Gleixner
@ 2021-11-29 15:07             ` Peter Zijlstra
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2021-11-29 15:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Oskolkov, Ingo Molnar, Andrew Morton, Dave Hansen,
	Andy Lutomirski, Linux Memory Management List,
	Linux Kernel Mailing List, linux-api, Paul Turner, Ben Segall,
	Peter Oskolkov, Andrei Vagin, Jann Horn, Thierry Delisle

On Sat, Nov 27, 2021 at 02:16:43AM +0100, Thomas Gleixner wrote:

> So yes, this should work, but I hate the sticky nature of TIF_UMCG. I
> have no real good idea how to avoid that yet, but let me think about it
> some more.

Yeah, that, I couldn't come up with anything saner either.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls
  2021-11-29  0:29         ` Peter Oskolkov
@ 2021-11-29 16:41           ` Peter Zijlstra
  2021-11-29 17:34             ` Peter Oskolkov
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2021-11-29 16:41 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Ingo Molnar, Thomas Gleixner, Andrew Morton, Dave Hansen,
	Andy Lutomirski, Linux Memory Management List,
	Linux Kernel Mailing List, linux-api, Paul Turner, Ben Segall,
	Peter Oskolkov, Andrei Vagin, Jann Horn, Thierry Delisle

On Sun, Nov 28, 2021 at 04:29:11PM -0800, Peter Oskolkov wrote:

> wait_wake_only is not needed if you have both next_tid and server_tid,
> as your patch has. In my version of the patch, next_tid is the same as
> server_tid, so the flag is needed to indicate to the kernel that
> next_tid is the wakee, not the server.

Ah, okay.

> re: (idle_)server_tid_ptr: it seems that you assume that blocked
> workers keep their servers, while in my patch they "lose them" once
> they block, and so there should be a global idle server pointer to
> wake the server in my scheme (if there is an idle one). The main
> difference is that in my approach a server has only a single, running,
> worker assigned to it, while in your approach it can have a number of
> blocked/idle workers to take care of as well.

Correct; I've been thinking in analogues of the way we schedule CPUs.
Each CPU has a ready/run queue along with the current task.
fundamentally the RUNNABLE tasks need to go somewhere when all servers
are busy. So at that point the previous server is as good a place as
any.

Now, I sympathise with a blocked task not having a relation; I often
argue this same, since we have wakeup balancing etc. And I've not really
thought about how to best do wakeup-balancing, also see below.

> The main difference between our approaches, as I see it: in my
> approach if a worker is running, its server is sleeping, period. If we
> have N servers, and N running workers, there are no servers to wake
> when a previously blocked worker finishes its blocking op. In your
> approach, it seems that N servers have each a bunch of workers
> pointing at them, and a single worker running. If a previously blocked
> worker wakes up, it wakes the server it was assigned to previously,

Right; it does that. It can check the ::state of it's current task,
possibly set TF_PREEMPT or just go back to sleep.

> and so now we have more than N physical tasks/threads running: N
> workers and the woken server. This is not ideal: if the process is
> affined to only N CPUs, that means a worker will be preempted to let
> the woken server run, which is somewhat against the goal of letting
> the workers run more or less uninterrupted. This is not deal breaking,
> but maybe something to keep in mind.

I suppose it's easy enough to make this behaviour configurable though;
simply enqueue and not wake.... Hmm.. how would this worker know if the
server was 'busy' or not? The whole 'current' thing is a user-space
construct. I suppose that's what your pointer was for? Puts an actual
idle server in there, if there is one. Let me ponder that a bit.

However, do note this whole scheme fundamentally has some of that, the
moment the syscall unblocks until sys_exit is 'unmanaged' runtime for
all tasks, they can consume however much time the syscall needs there.

Also, timeout on sys_umcg_wait() gets you the exact same situation (or
worse, multiple running workers).

> Another big concern I have is that you removed UMCG_TF_LOCKED. I

OOh yes, I forgot to mention that. I couldn't figure out what it was
supposed to do.

> definitely needed it to guard workers during "sched work" in the
> userspace in my approach. I'm not sure if the flag is absolutely
> needed with your approach, but most likely it is - the kernel-side
> scheduler does lock tasks and runqueues and disables interrupts and
> migrations and other things so that the scheduling logic is not
> hijacked by concurrent stuff. Why do you assume that the userspace
> scheduling code does not need similar protections?

I've not yet come across a case where this is needed. Migration for
instance is possible when RUNNABLE, simply write ::server_tid before
::state. Userspace just needs to make sure who actually owns the task,
but it can do that outside of this state.

But like I said; I've not yet done the userspace part (and I lost most
of today trying to install a new machine), so perhaps I'll run into it
soon enough.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls
  2021-11-29 16:41           ` Peter Zijlstra
@ 2021-11-29 17:34             ` Peter Oskolkov
  2021-11-29 21:08               ` Peter Zijlstra
  2021-12-06 11:47               ` Peter Zijlstra
  0 siblings, 2 replies; 44+ messages in thread
From: Peter Oskolkov @ 2021-11-29 17:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Andrew Morton, Dave Hansen,
	Andy Lutomirski, Linux Memory Management List,
	Linux Kernel Mailing List, linux-api, Paul Turner, Ben Segall,
	Peter Oskolkov, Andrei Vagin, Jann Horn, Thierry Delisle

On Mon, Nov 29, 2021 at 8:41 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Sun, Nov 28, 2021 at 04:29:11PM -0800, Peter Oskolkov wrote:
>
> > wait_wake_only is not needed if you have both next_tid and server_tid,
> > as your patch has. In my version of the patch, next_tid is the same as
> > server_tid, so the flag is needed to indicate to the kernel that
> > next_tid is the wakee, not the server.
>
> Ah, okay.
>
> > re: (idle_)server_tid_ptr: it seems that you assume that blocked
> > workers keep their servers, while in my patch they "lose them" once
> > they block, and so there should be a global idle server pointer to
> > wake the server in my scheme (if there is an idle one). The main
> > difference is that in my approach a server has only a single, running,
> > worker assigned to it, while in your approach it can have a number of
> > blocked/idle workers to take care of as well.
>
> Correct; I've been thinking in analogues of the way we schedule CPUs.
> Each CPU has a ready/run queue along with the current task.
> fundamentally the RUNNABLE tasks need to go somewhere when all servers
> are busy. So at that point the previous server is as good a place as
> any.
>
> Now, I sympathise with a blocked task not having a relation; I often
> argue this same, since we have wakeup balancing etc. And I've not really
> thought about how to best do wakeup-balancing, also see below.
>
> > The main difference between our approaches, as I see it: in my
> > approach if a worker is running, its server is sleeping, period. If we
> > have N servers, and N running workers, there are no servers to wake
> > when a previously blocked worker finishes its blocking op. In your
> > approach, it seems that N servers have each a bunch of workers
> > pointing at them, and a single worker running. If a previously blocked
> > worker wakes up, it wakes the server it was assigned to previously,
>
> Right; it does that. It can check the ::state of it's current task,
> possibly set TF_PREEMPT or just go back to sleep.
>
> > and so now we have more than N physical tasks/threads running: N
> > workers and the woken server. This is not ideal: if the process is
> > affined to only N CPUs, that means a worker will be preempted to let
> > the woken server run, which is somewhat against the goal of letting
> > the workers run more or less uninterrupted. This is not deal breaking,
> > but maybe something to keep in mind.
>
> I suppose it's easy enough to make this behaviour configurable though;
> simply enqueue and not wake.... Hmm.. how would this worker know if the
> server was 'busy' or not? The whole 'current' thing is a user-space
> construct. I suppose that's what your pointer was for? Puts an actual
> idle server in there, if there is one. Let me ponder that a bit.

Yes, the idle_server_ptr was there to point to an idle server; this
naturally did wakeup balancing.

>
> However, do note this whole scheme fundamentally has some of that, the
> moment the syscall unblocks until sys_exit is 'unmanaged' runtime for
> all tasks, they can consume however much time the syscall needs there.
>
> Also, timeout on sys_umcg_wait() gets you the exact same situation (or
> worse, multiple running workers).

It should not. Timed out workers should be added to the runnable list
and not become running unless a server chooses so. So sys_umcg_wait()
with a timeout should behave similarly to a normal sleep, in that the
server is woken upon the worker blocking, and upon the worker wakeup
the worker is added to the woken workers list and waits for a server
to run it. The only difference is that in a sleep the worker becomes
BLOCKED, while in sys_umcg_wait() the worker is RUNNABLE the whole
time.

Why then have sys_umcg_wait() with a timeout at all, instead of
calling nanosleep()? Because the worker in sys_umcg_wait() can be
context-switched into by another worker, or made running by a server;
if the worker is in nanosleep(), it just sleeps.

>
> > Another big concern I have is that you removed UMCG_TF_LOCKED. I
>
> OOh yes, I forgot to mention that. I couldn't figure out what it was
> supposed to do.
>
> > definitely needed it to guard workers during "sched work" in the
> > userspace in my approach. I'm not sure if the flag is absolutely
> > needed with your approach, but most likely it is - the kernel-side
> > scheduler does lock tasks and runqueues and disables interrupts and
> > migrations and other things so that the scheduling logic is not
> > hijacked by concurrent stuff. Why do you assume that the userspace
> > scheduling code does not need similar protections?
>
> I've not yet come across a case where this is needed. Migration for
> instance is possible when RUNNABLE, simply write ::server_tid before
> ::state. Userspace just needs to make sure who actually owns the task,
> but it can do that outside of this state.
>
> But like I said; I've not yet done the userspace part (and I lost most
> of today trying to install a new machine), so perhaps I'll run into it
> soon enough.

The most obvious scenario where I needed locking is when worker A
wants to context switch into worker B, while another worker C wants to
context switch into worker A, and worker A pagefaults. This involves:

worker A context: worker A context switches into worker B:

- worker B::server_tid = worker A::server_tid
- worker A::server_tid = none
- worker A::state = runnable
- worker B::state = running
- worker A::next_tid = worker B
- worker A calls sys_umcg_wait()

worker B context: before the above completes, worker C wants to
context switch into worker A, with similar steps.

"interrupt context": in the middle of the mess above, worker A pagefaults

Too many moving parts. UMCG_TF_LOCKED helped me make this mess
manageable. Maybe without pagefaults clever ordering of the operations
listed above could make things work, but pagefaults mess things badly,
so some kind of "preempt_disable()" for the userspace scheduling code
was needed, and UMCG_TF_LOCKED was the solution I had.



>
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls
  2021-11-29 17:34             ` Peter Oskolkov
@ 2021-11-29 21:08               ` Peter Zijlstra
  2021-11-29 21:29                 ` Peter Zijlstra
  2021-11-29 23:38                 ` Peter Oskolkov
  2021-12-06 11:47               ` Peter Zijlstra
  1 sibling, 2 replies; 44+ messages in thread
From: Peter Zijlstra @ 2021-11-29 21:08 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Ingo Molnar, Thomas Gleixner, Andrew Morton, Dave Hansen,
	Andy Lutomirski, Linux Memory Management List,
	Linux Kernel Mailing List, linux-api, Paul Turner, Ben Segall,
	Peter Oskolkov, Andrei Vagin, Jann Horn, Thierry Delisle

On Mon, Nov 29, 2021 at 09:34:49AM -0800, Peter Oskolkov wrote:
> On Mon, Nov 29, 2021 at 8:41 AM Peter Zijlstra <peterz@infradead.org> wrote:

> > However, do note this whole scheme fundamentally has some of that, the
> > moment the syscall unblocks until sys_exit is 'unmanaged' runtime for
> > all tasks, they can consume however much time the syscall needs there.
> >
> > Also, timeout on sys_umcg_wait() gets you the exact same situation (or
> > worse, multiple running workers).
> 
> It should not. Timed out workers should be added to the runnable list
> and not become running unless a server chooses so. So sys_umcg_wait()
> with a timeout should behave similarly to a normal sleep, in that the
> server is woken upon the worker blocking, and upon the worker wakeup
> the worker is added to the woken workers list and waits for a server
> to run it. The only difference is that in a sleep the worker becomes
> BLOCKED, while in sys_umcg_wait() the worker is RUNNABLE the whole
> time.

OK, that's somewhat subtle and I hadn't gotten that either.

Currently it return -ETIMEDOUT in RUNNING state for both server and
worker callers.

Let me go fix that then.

> > > Another big concern I have is that you removed UMCG_TF_LOCKED. I
> >
> > OOh yes, I forgot to mention that. I couldn't figure out what it was
> > supposed to do.
> >
> > > definitely needed it to guard workers during "sched work" in the
> > > userspace in my approach. I'm not sure if the flag is absolutely
> > > needed with your approach, but most likely it is - the kernel-side
> > > scheduler does lock tasks and runqueues and disables interrupts and
> > > migrations and other things so that the scheduling logic is not
> > > hijacked by concurrent stuff. Why do you assume that the userspace
> > > scheduling code does not need similar protections?
> >
> > I've not yet come across a case where this is needed. Migration for
> > instance is possible when RUNNABLE, simply write ::server_tid before
> > ::state. Userspace just needs to make sure who actually owns the task,
> > but it can do that outside of this state.
> >
> > But like I said; I've not yet done the userspace part (and I lost most
> > of today trying to install a new machine), so perhaps I'll run into it
> > soon enough.
> 
> The most obvious scenario where I needed locking is when worker A
> wants to context switch into worker B, while another worker C wants to
> context switch into worker A, and worker A pagefaults. This involves:
> 
> worker A context: worker A context switches into worker B:
> 
> - worker B::server_tid = worker A::server_tid
> - worker A::server_tid = none
> - worker A::state = runnable
> - worker B::state = running
> - worker A::next_tid = worker B
> - worker A calls sys_umcg_wait()
> 
> worker B context: before the above completes, worker C wants to
> context switch into worker A, with similar steps.
> 
> "interrupt context": in the middle of the mess above, worker A pagefaults
> 
> Too many moving parts. UMCG_TF_LOCKED helped me make this mess
> manageable. Maybe without pagefaults clever ordering of the operations
> listed above could make things work, but pagefaults mess things badly,
> so some kind of "preempt_disable()" for the userspace scheduling code
> was needed, and UMCG_TF_LOCKED was the solution I had.

I'm not sure I'm following. For this to be true A and C must be running
on a different server right?

So we have something like:

	S0 running A			S1 running B

Therefore:

	S0::state == RUNNABLE		S1::state == RUNNABLE
	A::server_tid == S0.tid		B::server_tid == S1.tid
	A::state == RUNNING		B::state == RUNNING

Now, you want A to switch to C, therefore C had better be with S0, eg we
have:

	C::server_tid == S0.tid
	C::state == RUNNABLE

So then A does:

	A::next_tid = C.tid;
	sys_umcg_wait();

Which will:

	pin(A);
	pin(S0);

	cmpxchg(A::state, RUNNING, RUNNABLE);

	next_tid = A::next_tid; // C

	enqueue(S0::runnable, A);

At which point B steals S0's runnable queue, and tries to make A go.

					runnable = xchg(S0::runnable_list_ptr, NULL); // == A
					A::server_tid = S1.tid;
					B::next_tid = A.tid;
					sys_umcg_wait();

	wake(C)
	  cmpxchg(C::state, RUNNABLE, RUNNING); <-- *fault*


Something like that, right?

What currently happens is that S0 goes back to S0 and S1 ends up in A.
That is, if, for any reason we fail to wake next_tid, we'll wake
server_tid.

So then S0 wakes up and gets to re-evaluate life. If it has another
worker it can go run that, otherwise it can try and steal a worker
somewhere or just idle out.

Now arguably, the only reason A->C can fault is because C is garbage, at
which point your program is malformed and it doesn't matter what
happens one way or the other.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls
  2021-11-29 21:08               ` Peter Zijlstra
@ 2021-11-29 21:29                 ` Peter Zijlstra
  2021-11-29 23:38                 ` Peter Oskolkov
  1 sibling, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2021-11-29 21:29 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Ingo Molnar, Thomas Gleixner, Andrew Morton, Dave Hansen,
	Andy Lutomirski, Linux Memory Management List,
	Linux Kernel Mailing List, linux-api, Paul Turner, Ben Segall,
	Peter Oskolkov, Andrei Vagin, Jann Horn, Thierry Delisle

On Mon, Nov 29, 2021 at 10:08:41PM +0100, Peter Zijlstra wrote:
> I'm not sure I'm following. For this to be true A and C must be running
> on a different server right?
> 
> So we have something like:
> 
> 	S0 running A			S1 running B
> 
> Therefore:
> 
> 	S0::state == RUNNABLE		S1::state == RUNNABLE
> 	A::server_tid == S0.tid		B::server_tid == S1.tid
> 	A::state == RUNNING		B::state == RUNNING
> 
> Now, you want A to switch to C, therefore C had better be with S0, eg we
> have:
> 
> 	C::server_tid == S0.tid
> 	C::state == RUNNABLE
> 
> So then A does:
> 
> 	A::next_tid = C.tid;
> 	sys_umcg_wait();
> 
> Which will:
> 
> 	pin(A);
> 	pin(S0);
> 
> 	cmpxchg(A::state, RUNNING, RUNNABLE);
> 
> 	next_tid = A::next_tid; // C
> 
> 	enqueue(S0::runnable, A);
> 
> At which point B steals S0's runnable queue, and tries to make A go.
> 
> 					runnable = xchg(S0::runnable_list_ptr, NULL); // == A
> 					A::server_tid = S1.tid;
> 					B::next_tid = A.tid;
> 					sys_umcg_wait();
> 
> 	wake(C)
> 	  cmpxchg(C::state, RUNNABLE, RUNNING); <-- *fault*
> 
> 
> Something like that, right?

And note that there's an XXX in the code about exactly this case; it has
a question whether we want to add pin(next) to umcg_pin_pages().

That would not in fact help here, because sys_umcg_wait() is faultable
and the only reason it'll return -EFAULT is because, as stated below, C
is garbage. But it does make a difference for when we do something like:

	self->next_tid = someone;
	sys_something_we_expect_to_block();
	// handle not blocking

Because in that case userspace must have taken 'someone' from the
runnable queue and made it 'next', but then we'll not wake next but the
server, which then needs to figure out something went sideways.

So I'm tempted to add that optional 3rd pin, simply to reduce the
failure cases.

> What currently happens is that S0 goes back to S0 and S1 ends up in A.
> That is, if, for any reason we fail to wake next_tid, we'll wake
> server_tid.
> 
> So then S0 wakes up and gets to re-evaluate life. If it has another
> worker it can go run that, otherwise it can try and steal a worker
> somewhere or just idle out.
> 
> Now arguably, the only reason A->C can fault is because C is garbage, at
> which point your program is malformed and it doesn't matter what
> happens one way or the other.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls
  2021-11-26 21:52       ` Peter Zijlstra
@ 2021-11-29 22:07         ` Thomas Gleixner
  2021-11-29 22:22           ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Thomas Gleixner @ 2021-11-29 22:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Peter Oskolkov, Ingo Molnar, Andrew Morton, Dave Hansen,
	Andy Lutomirski, linux-mm, linux-kernel, linux-api, Paul Turner,
	Ben Segall, Peter Oskolkov, Andrei Vagin, Jann Horn,
	Thierry Delisle

On Fri, Nov 26 2021 at 22:52, Peter Zijlstra wrote:
> On Fri, Nov 26, 2021 at 10:11:17PM +0100, Thomas Gleixner wrote:
>> On Wed, Nov 24 2021 at 22:19, Peter Zijlstra wrote:
>> > On Mon, Nov 22, 2021 at 01:13:24PM -0800, Peter Oskolkov wrote:
>> >
>> >> +	 * Timestamp: a 46-bit CLOCK_MONOTONIC timestamp, at 16ns resolution.
>> >
>> >> +static int umcg_update_state(u64 __user *state_ts, u64 *expected, u64 desired,
>> >> +				bool may_fault)
>> >> +{
>> >> +	u64 curr_ts = (*expected) >> (64 - UMCG_STATE_TIMESTAMP_BITS);
>> >> +	u64 next_ts = ktime_get_ns() >> UMCG_STATE_TIMESTAMP_GRANULARITY;
>> >
>> > I'm still very hesitant to use ktime (fear the HPET); but I suppose it
>> > makes sense to use a time base that's accessible to userspace. Was
>> > MONOTONIC_RAW considered?
>> 
>> MONOTONIC_RAW is not really useful as you can't sleep on it and it won't
>> solve the HPET crap either.
>
> But it's ns are of equal size to sched_clock(), if both share TSC IIRC.
> Whereas MONOTONIC, being subject to ntp rate stuff, has differently
> sized ns.

The size is the same, i.e. 1 bit per nanosecond :)

> The only time that's relevant though is when you're going to mix these
> timestamps with CLOCK_THREAD_CPUTIME_ID, which might just be
> interesting.

Uuurg. If you want to go towards CLOCK_THREAD_CPUTIME_ID, that's going
to be really nasty. Actually you can sleep on that clock, but that's a
completely different universe. If anything like that is desired then we
need to rewrite that posix CPU timer muck completely with all the bells
and whistels and race conditions attached to it. *Shudder*

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls
  2021-11-29 22:07         ` Thomas Gleixner
@ 2021-11-29 22:22           ` Peter Zijlstra
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2021-11-29 22:22 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Oskolkov, Ingo Molnar, Andrew Morton, Dave Hansen,
	Andy Lutomirski, linux-mm, linux-kernel, linux-api, Paul Turner,
	Ben Segall, Peter Oskolkov, Andrei Vagin, Jann Horn,
	Thierry Delisle

On Mon, Nov 29, 2021 at 11:07:07PM +0100, Thomas Gleixner wrote:
> On Fri, Nov 26 2021 at 22:52, Peter Zijlstra wrote:

> The size is the same, i.e. 1 bit per nanosecond :)

:-)

> > The only time that's relevant though is when you're going to mix these
> > timestamps with CLOCK_THREAD_CPUTIME_ID, which might just be
> > interesting.
> 
> Uuurg. If you want to go towards CLOCK_THREAD_CPUTIME_ID, that's going
> to be really nasty. Actually you can sleep on that clock, but that's a
> completely different universe. If anything like that is desired then we
> need to rewrite that posix CPU timer muck completely with all the bells
> and whistels and race conditions attached to it. *Shudder*

Oh, I wasn't thinking anything as terrible as that. Sleeping on that
clock is fundamentally daft since it doesn't run when thats is
sleeping, consider trying to sleep on your own runtime :-)

I was only considering combining THREAD_CPUTIME timestamps with the
UMCG timestamps to compute how much unmanaged time there was, or other
such things.

Anyway, lets forget I bought this up and assume that for practical
purposes all [ns] are of equal length.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls
  2021-11-29 21:08               ` Peter Zijlstra
  2021-11-29 21:29                 ` Peter Zijlstra
@ 2021-11-29 23:38                 ` Peter Oskolkov
  2021-12-06 11:32                   ` Peter Zijlstra
  1 sibling, 1 reply; 44+ messages in thread
From: Peter Oskolkov @ 2021-11-29 23:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Andrew Morton, Dave Hansen,
	Andy Lutomirski, Linux Memory Management List,
	Linux Kernel Mailing List, linux-api, Paul Turner, Ben Segall,
	Peter Oskolkov, Andrei Vagin, Jann Horn, Thierry Delisle

On Mon, Nov 29, 2021 at 1:08 PM Peter Zijlstra <peterz@infradead.org> wrote:
[...]
> > > > Another big concern I have is that you removed UMCG_TF_LOCKED. I
> > >
> > > OOh yes, I forgot to mention that. I couldn't figure out what it was
> > > supposed to do.
[...]
>
> So then A does:
>
>         A::next_tid = C.tid;
>         sys_umcg_wait();
>
> Which will:
>
>         pin(A);
>         pin(S0);
>
>         cmpxchg(A::state, RUNNING, RUNNABLE);

Hmm.... That's another difference between your patch and mine: my
approach was "the side that initiates the change updates the state".
So in my code the userspace changes the current task's state RUNNING
=> RUNNABLE and the next task's state, or the server's state, RUNNABLE
=> RUNNING before calling sys_umcg_wait(). The kernel changed worker
states to BLOCKED/RUNNABLE during block/wake detection, and marked
servers RUNNING when waking them during block/wake detection; but all
applicable state changes for sys_umcg_wait() happen in the userspace.

The reasoning behind this approach was:
- do in kernel only that which cannot be done in the userspace, to
make the kernel code smaller/simpler
- similar to how futexes work: futex_wait does not change the futex
value to the desired value, but just checks whether the futex value
matches the desired value
- similar to how futexes work, concurrent state changes can happen in
the userspace without calling into the kernel at all
    for example:
        - (a): worker A goes to sleep into sys_umcg_wait()
        - (b): worker B wants to context switch into worker A "a moment" later
        - due to preemption/interrupts/pagefaults/whatnot, (b) happens
in reality before (a)
    in my patchset, the situation above happily resolves in the
userspace so that worker A keeps running without ever calling
sys_umcg_wait().

Again, I don't think this is deal breaking, and your approach will
work, just a bit less efficiently in some cases :)

I'm still not sure we can live without UMCG_TF_LOCKED. What if worker
A transfers its server to worker B that A intends to context switch
into, and then worker A pagefaults or gets interrupted before calling
sys_umcg_wait()? The server will be woken up and will see that it is
assigned to worker B; now what? If worker A is "locked" before the
whole thing starts, the pagefault/interrupt will not trigger
block/wake detection, worker A will keep RUNNING for all intended
purposes, and eventually will call sys_umcg_wait() as it had
intended...

[...]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls
  2021-11-29 23:38                 ` Peter Oskolkov
@ 2021-12-06 11:32                   ` Peter Zijlstra
  2021-12-06 12:04                     ` Peter Zijlstra
  2021-12-13 13:55                     ` Peter Zijlstra
  0 siblings, 2 replies; 44+ messages in thread
From: Peter Zijlstra @ 2021-12-06 11:32 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Ingo Molnar, Thomas Gleixner, Andrew Morton, Dave Hansen,
	Andy Lutomirski, Linux Memory Management List,
	Linux Kernel Mailing List, linux-api, Paul Turner, Ben Segall,
	Peter Oskolkov, Andrei Vagin, Jann Horn, Thierry Delisle

Sorry, I haven't been feeling too well and as such procastinated on this
because thinking is required :/ Trying to pick up the bits.

On Mon, Nov 29, 2021 at 03:38:38PM -0800, Peter Oskolkov wrote:
> On Mon, Nov 29, 2021 at 1:08 PM Peter Zijlstra <peterz@infradead.org> wrote:
> [...]
> > > > > Another big concern I have is that you removed UMCG_TF_LOCKED. I
> > > >
> > > > OOh yes, I forgot to mention that. I couldn't figure out what it was
> > > > supposed to do.
> [...]
> >
> > So then A does:
> >
> >         A::next_tid = C.tid;
> >         sys_umcg_wait();
> >
> > Which will:
> >
> >         pin(A);
> >         pin(S0);
> >
> >         cmpxchg(A::state, RUNNING, RUNNABLE);
> 
> Hmm.... That's another difference between your patch and mine: my
> approach was "the side that initiates the change updates the state".
> So in my code the userspace changes the current task's state RUNNING
> => RUNNABLE and the next task's state,

I couldn't make that work for wakeups; when a thread blocks in a
random syscall there is no userspace to wake the next thread. And since
it seems required in this case, it's easier and more consistent to
always do it.

> or the server's state, RUNNABLE
> => RUNNING before calling sys_umcg_wait().

Yes, this is indeed required; I've found the same when trying to build
the userspace server loop. And yes, I'm starting to see where you're
coming from.

> I'm still not sure we can live without UMCG_TF_LOCKED. What if worker
> A transfers its server to worker B that A intends to context switch

	S0 running A

Therefore:

	S0::state == RUNNABLE
	A::server_tid = S0.tid
	A::state == RUNNING

you want A to switch to B, therefore:

	B::state == RUNNABLE

if B is not yet on S0 then:

	B::server_tid = S0.tid;

finally:

0:
	A::next_tid = B.tid;
1:
	A::state = RUNNABLE:
2:
	sys_umcg_wait();
3:

> into, and then worker A pagefaults or gets interrupted before calling
> sys_umcg_wait()?

So the problem is tripping umcg_notify_resume() on the labels 1 and 2,
right? tripping it on 0 and 3 is trivially correct.

If we trip it on 1 and !(A::state & TG_PREEMPT), then nothing, since
::state == RUNNING we'll just continue onwards and all is well. That is,
nothing has happened yet.

However, if we trip it on 2: we're screwed. Because at that point
::state is scribbled.

> The server will be woken up and will see that it is
> assigned to worker B; now what? If worker A is "locked" before the
> whole thing starts, the pagefault/interrupt will not trigger
> block/wake detection, worker A will keep RUNNING for all intended
> purposes, and eventually will call sys_umcg_wait() as it had
> intended...

No, the failure case is different; umcg_notify_resume() will simply
block A until someone sets A::state == RUNNING and kicks it, which will
be no-one.

Now, the above situation is actually simple to fix, but it gets more
interesting when we're using sys_umcg_wait() to build wait primitives.
Because in that case we get stuff like:

	for (;;) {
		self->state = RUNNABLE;
		smp_mb();
		if (cond)
			break;
		sys_umcg_wait();
	}
	self->state = RUNNING;

And we really need to not block and also not do sys_umcg_wait() early.

So yes, I agree that we need a special case here that ensures
umcg_notify_resume() doesn't block. Let me ponder naming and comments.
Either a TF_COND_WAIT or a whole new state. I can't decide yet.

Now, obviously if you do a random syscall anywhere around here, you get
to keep the pieces :-)

I've also added ::next_tid to the whole umcg_pin_pages() thing, and made
it so that ::next_tid gets cleared when it's been used. That way things
like:

	self->next_tid = pick_from_runqueue();
	sys_that_is_expected_to_sleep();
	if (self->next_tid) {
		return_to_runqueue(self->next_tid);
		self->next_tid = 0;
	}

Are much simpler to manage. Either it did sleep and ::next_tid is
consumed, or it didn't sleep and it needs to be returned to the
runqueue.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls
  2021-11-29 17:34             ` Peter Oskolkov
  2021-11-29 21:08               ` Peter Zijlstra
@ 2021-12-06 11:47               ` Peter Zijlstra
  2022-01-19 17:26                 ` Peter Oskolkov
  1 sibling, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2021-12-06 11:47 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Ingo Molnar, Thomas Gleixner, Andrew Morton, Dave Hansen,
	Andy Lutomirski, Linux Memory Management List,
	Linux Kernel Mailing List, linux-api, Paul Turner, Ben Segall,
	Peter Oskolkov, Andrei Vagin, Jann Horn, Thierry Delisle

On Mon, Nov 29, 2021 at 09:34:49AM -0800, Peter Oskolkov wrote:
> On Mon, Nov 29, 2021 at 8:41 AM Peter Zijlstra <peterz@infradead.org> wrote:

> > Also, timeout on sys_umcg_wait() gets you the exact same situation (or
> > worse, multiple running workers).
> 
> It should not. Timed out workers should be added to the runnable list
> and not become running unless a server chooses so. So sys_umcg_wait()
> with a timeout should behave similarly to a normal sleep, in that the
> server is woken upon the worker blocking, and upon the worker wakeup
> the worker is added to the woken workers list and waits for a server
> to run it. The only difference is that in a sleep the worker becomes
> BLOCKED, while in sys_umcg_wait() the worker is RUNNABLE the whole
> time.
> 
> Why then have sys_umcg_wait() with a timeout at all, instead of
> calling nanosleep()? Because the worker in sys_umcg_wait() can be
> context-switched into by another worker, or made running by a server;
> if the worker is in nanosleep(), it just sleeps.

I've been trying to figure out the semantics of that timeout thing, and
I can't seem to make sense of it.

Consider two workers:

	S0 running A				S1 running B

therefore:

	S0::state == RUNNABLE			S1::state == RUNNABLE
	A::server_tid == S0.tid			B::server_tid = S1.tid
	A::state == RUNNING			B::state == RUNNING

Doing:

	self->state = RUNNABLE;			self->state = RUNNABLE;
	sys_umcg_wait(0);			sys_umcg_wait(10);
	  umcg_enqueue_runnable()		  umcg_enqueue_runnable()
	  umcg_wake()				  umcg_wake()
	  umcg_wait()				  umcg_wait()
						    hrtimer_start()

In both cases we get the exact same outcome:

	A::state == RUNNABLE			B::state == RUNNABLE
	S0::state == RUNNING			S1::state == RUNNING
	S0::runnable_ptr == &A			S1::runnable_ptr = &B


Which is, AFAICT, the exact state you wanted to achieve, except B now
has an active timer, but what do you want it to do when that goes?

I'm tempted to say workers cannot have timeout, and servers can use it
to wake themselves.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls
  2021-12-06 11:32                   ` Peter Zijlstra
@ 2021-12-06 12:04                     ` Peter Zijlstra
  2021-12-13 13:55                     ` Peter Zijlstra
  1 sibling, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2021-12-06 12:04 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Ingo Molnar, Thomas Gleixner, Andrew Morton, Dave Hansen,
	Andy Lutomirski, Linux Memory Management List,
	Linux Kernel Mailing List, linux-api, Paul Turner, Ben Segall,
	Peter Oskolkov, Andrei Vagin, Jann Horn, Thierry Delisle

On Mon, Dec 06, 2021 at 12:32:22PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 29, 2021 at 03:38:38PM -0800, Peter Oskolkov wrote:
> > On Mon, Nov 29, 2021 at 1:08 PM Peter Zijlstra <peterz@infradead.org> wrote:

> Now, the above situation is actually simple to fix, but it gets more
> interesting when we're using sys_umcg_wait() to build wait primitives.
> Because in that case we get stuff like:
> 
> 	for (;;) {
> 		self->state = RUNNABLE;
> 		smp_mb();
> 		if (cond)
> 			break;
> 		sys_umcg_wait();
> 	}
> 	self->state = RUNNING;
> 
> And we really need to not block and also not do sys_umcg_wait() early.
> 
> So yes, I agree that we need a special case here that ensures
> umcg_notify_resume() doesn't block. Let me ponder naming and comments.
> Either a TF_COND_WAIT or a whole new state. I can't decide yet.

Hurmph... OTOH since self above hasn't actually done anything yet, it
isn't reported as runnable yet, and so for all intents and purposes the
userspace state thinks it's running (which is true) and nobody should be
trying a concurrent wakeup and there anre't any races.

Bah, now I'm confused again :-) Let me go think more.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls
  2021-12-06 11:32                   ` Peter Zijlstra
  2021-12-06 12:04                     ` Peter Zijlstra
@ 2021-12-13 13:55                     ` Peter Zijlstra
  1 sibling, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2021-12-13 13:55 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Ingo Molnar, Thomas Gleixner, Andrew Morton, Dave Hansen,
	Andy Lutomirski, Linux Memory Management List,
	Linux Kernel Mailing List, linux-api, Paul Turner, Ben Segall,
	Peter Oskolkov, Andrei Vagin, Jann Horn, Thierry Delisle

On Mon, Dec 06, 2021 at 12:32:22PM +0100, Peter Zijlstra wrote:
> 
> Sorry, I haven't been feeling too well and as such procastinated on this
> because thinking is required :/ Trying to pick up the bits.

*sigh* and yet another week gone... someone was unhappy about refcount_t.


> No, the failure case is different; umcg_notify_resume() will simply
> block A until someone sets A::state == RUNNING and kicks it, which will
> be no-one.
> 
> Now, the above situation is actually simple to fix, but it gets more
> interesting when we're using sys_umcg_wait() to build wait primitives.
> Because in that case we get stuff like:
> 
> 	for (;;) {
> 		self->state = RUNNABLE;
> 		smp_mb();
> 		if (cond)
> 			break;
> 		sys_umcg_wait();
> 	}
> 	self->state = RUNNING;
> 
> And we really need to not block and also not do sys_umcg_wait() early.
> 
> So yes, I agree that we need a special case here that ensures
> umcg_notify_resume() doesn't block. Let me ponder naming and comments.
> Either a TF_COND_WAIT or a whole new state. I can't decide yet.
> 
> Now, obviously if you do a random syscall anywhere around here, you get
> to keep the pieces :-)

Something like so I suppose..

--- a/include/uapi/linux/umcg.h
+++ b/include/uapi/linux/umcg.h
@@ -42,6 +42,32 @@
  *
  */
 #define UMCG_TF_PREEMPT			0x0100U
+/*
+ * UMCG_TF_COND_WAIT: indicate the task *will* call sys_umcg_wait()
+ *
+ * Enables server loops like (vs umcg_sys_exit()):
+ *
+ *   for(;;) {
+ *	self->status = UMCG_TASK_RUNNABLE | UMCG_TF_COND_WAIT;
+ *	// smp_mb() implied by xchg()
+ *
+ *	runnable_ptr = xchg(self->runnable_workers_ptr, NULL);
+ *	while (runnable_ptr) {
+ *		next = runnable_ptr->runnable_workers_ptr;
+ *
+ *		umcg_server_add_runnable(self, runnable_ptr);
+ *
+ *		runnable_ptr = next;
+ *	}
+ *
+ *	self->next = umcg_server_pick_next(self);
+ *	sys_umcg_wait(0, 0);
+ *   }
+ *
+ * without a signal or interrupt in between setting umcg_task::state and
+ * sys_umcg_wait() resulting in an infinite wait in umcg_notify_resume().
+ */
+#define UMCG_TF_COND_WAIT		0x0200U
 
 #define UMCG_TF_MASK			0xff00U
 
--- a/kernel/sched/umcg.c
+++ b/kernel/sched/umcg.c
@@ -180,7 +180,7 @@ void umcg_worker_exit(void)
 /*
  * Do a state transition, @from -> @to, and possible read @next after that.
  *
- * Will clear UMCG_TF_PREEMPT.
+ * Will clear UMCG_TF_PREEMPT, UMCG_TF_COND_WAIT.
  *
  * When @to == {BLOCKED,RUNNABLE}, update timestamps.
  *
@@ -216,7 +216,8 @@ static int umcg_update_state(struct task
 		if ((old & UMCG_TASK_MASK) != from)
 			goto fail;
 
-		new = old & ~(UMCG_TASK_MASK | UMCG_TF_PREEMPT);
+		new = old & ~(UMCG_TASK_MASK |
+			      UMCG_TF_PREEMPT | UMCG_TF_COND_WAIT);
 		new |= to & UMCG_TASK_MASK;
 
 	} while (!unsafe_try_cmpxchg_user(&self->state, &old, new, Efault));
@@ -567,11 +568,13 @@ void umcg_notify_resume(struct pt_regs *
 	if (state == UMCG_TASK_RUNNING)
 		goto done;
 
-	// XXX can get here when:
-	//
-	// self->state = RUNNABLE
-	// <signal>
-	// sys_umcg_wait();
+	/*
+	 * See comment at UMCG_TF_COND_WAIT; TL;DR: user *will* call
+	 * sys_umcg_wait() and signals/interrupts shouldn't block
+	 * return-to-user.
+	 */
+	if (state == UMCG_TASK_RUNNABLE | UMCG_TF_COND_WAIT)
+		goto done;
 
 	if (state & UMCG_TF_PREEMPT) {
 		if (umcg_pin_pages())
@@ -658,6 +661,13 @@ SYSCALL_DEFINE2(umcg_wait, u32, flags, u
 	if (ret)
 		goto unblock;
 
+	/*
+	 * Clear UMCG_TF_COND_WAIT *and* check state == RUNNABLE.
+	 */
+	ret = umcg_update_state(self, tsk, UMCG_TASK_RUNNABLE, UMCG_TASK_RUNNABLE);
+	if (ret)
+		goto unpin;
+
 	if (worker) {
 		ret = umcg_enqueue_runnable(tsk);
 		if (ret)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls
  2021-12-06 11:47               ` Peter Zijlstra
@ 2022-01-19 17:26                 ` Peter Oskolkov
  2022-01-20 11:07                   ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Oskolkov @ 2022-01-19 17:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Peter Oskolkov, Ingo Molnar, Thomas Gleixner, Andrew Morton,
	Dave Hansen, Andy Lutomirski, Linux Memory Management List,
	Linux Kernel Mailing List, linux-api, Paul Turner, Ben Segall,
	Andrei Vagin, Jann Horn, Thierry Delisle

On Mon, Dec 6, 2021 at 3:47 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Nov 29, 2021 at 09:34:49AM -0800, Peter Oskolkov wrote:
> > On Mon, Nov 29, 2021 at 8:41 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> > > Also, timeout on sys_umcg_wait() gets you the exact same situation (or
> > > worse, multiple running workers).
> >
> > It should not. Timed out workers should be added to the runnable list
> > and not become running unless a server chooses so. So sys_umcg_wait()
> > with a timeout should behave similarly to a normal sleep, in that the
> > server is woken upon the worker blocking, and upon the worker wakeup
> > the worker is added to the woken workers list and waits for a server
> > to run it. The only difference is that in a sleep the worker becomes
> > BLOCKED, while in sys_umcg_wait() the worker is RUNNABLE the whole
> > time.
> >
> > Why then have sys_umcg_wait() with a timeout at all, instead of
> > calling nanosleep()? Because the worker in sys_umcg_wait() can be
> > context-switched into by another worker, or made running by a server;
> > if the worker is in nanosleep(), it just sleeps.
>
> I've been trying to figure out the semantics of that timeout thing, and
> I can't seem to make sense of it.
>
> Consider two workers:
>
>         S0 running A                            S1 running B
>
> therefore:
>
>         S0::state == RUNNABLE                   S1::state == RUNNABLE
>         A::server_tid == S0.tid                 B::server_tid = S1.tid
>         A::state == RUNNING                     B::state == RUNNING
>
> Doing:
>
>         self->state = RUNNABLE;                 self->state = RUNNABLE;
>         sys_umcg_wait(0);                       sys_umcg_wait(10);
>           umcg_enqueue_runnable()                 umcg_enqueue_runnable()

sys_umcg_wait() should not enqueue the worker as runnable; workers are
enqueued to indicate wakeup events.

>           umcg_wake()                             umcg_wake()
>           umcg_wait()                             umcg_wait()
>                                                     hrtimer_start()
>
> In both cases we get the exact same outcome:
>
>         A::state == RUNNABLE                    B::state == RUNNABLE
>         S0::state == RUNNING                    S1::state == RUNNING
>         S0::runnable_ptr == &A                  S1::runnable_ptr = &B

So without sys_umcg_wait enqueueing into the queue, the state now is

         A::state == RUNNABLE                    B::state == RUNNABLE
         S0::state == RUNNING                    S1::state == RUNNING
         S0::runnable_ptr == NULL                  S1::runnable_ptr = NULL

>
>
> Which is, AFAICT, the exact state you wanted to achieve, except B now
> has an active timer, but what do you want it to do when that goes?

When the timer goes off, _then_ B is enqueued into the queue, so the
state becomes

         A::state == RUNNABLE                    B::state == RUNNABLE
         S0::state == RUNNING                    S1::state == RUNNING
         S0::runnable_ptr == NULL                  S1::runnable_ptr = &B

So worker timeouts in sys_umcg_wait are treated as wakeup events, with
the difference that when the worker is eventually scheduled by a
server, sys_umcg_wait returns with ETIMEDOUT.

>
> I'm tempted to say workers cannot have timeout, and servers can use it
> to wake themselves.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls
  2022-01-19 17:26                 ` Peter Oskolkov
@ 2022-01-20 11:07                   ` Peter Zijlstra
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2022-01-20 11:07 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Peter Oskolkov, Ingo Molnar, Thomas Gleixner, Andrew Morton,
	Dave Hansen, Andy Lutomirski, Linux Memory Management List,
	Linux Kernel Mailing List, linux-api, Paul Turner, Ben Segall,
	Andrei Vagin, Jann Horn, Thierry Delisle

On Wed, Jan 19, 2022 at 09:26:41AM -0800, Peter Oskolkov wrote:
> On Mon, Dec 6, 2021 at 3:47 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Mon, Nov 29, 2021 at 09:34:49AM -0800, Peter Oskolkov wrote:
> > > On Mon, Nov 29, 2021 at 8:41 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > > > Also, timeout on sys_umcg_wait() gets you the exact same situation (or
> > > > worse, multiple running workers).
> > >
> > > It should not. Timed out workers should be added to the runnable list
> > > and not become running unless a server chooses so. So sys_umcg_wait()
> > > with a timeout should behave similarly to a normal sleep, in that the
> > > server is woken upon the worker blocking, and upon the worker wakeup
> > > the worker is added to the woken workers list and waits for a server
> > > to run it. The only difference is that in a sleep the worker becomes
> > > BLOCKED, while in sys_umcg_wait() the worker is RUNNABLE the whole
> > > time.
> > >
> > > Why then have sys_umcg_wait() with a timeout at all, instead of
> > > calling nanosleep()? Because the worker in sys_umcg_wait() can be
> > > context-switched into by another worker, or made running by a server;
> > > if the worker is in nanosleep(), it just sleeps.
> >
> > I've been trying to figure out the semantics of that timeout thing, and
> > I can't seem to make sense of it.
> >
> > Consider two workers:
> >
> >         S0 running A                            S1 running B
> >
> > therefore:
> >
> >         S0::state == RUNNABLE                   S1::state == RUNNABLE
> >         A::server_tid == S0.tid                 B::server_tid = S1.tid
> >         A::state == RUNNING                     B::state == RUNNING
> >
> > Doing:
> >
> >         self->state = RUNNABLE;                 self->state = RUNNABLE;
> >         sys_umcg_wait(0);                       sys_umcg_wait(10);
> >           umcg_enqueue_runnable()                 umcg_enqueue_runnable()
> 
> sys_umcg_wait() should not enqueue the worker as runnable; workers are
> enqueued to indicate wakeup events.

Oooh... I see.

> So worker timeouts in sys_umcg_wait are treated as wakeup events, with
> the difference that when the worker is eventually scheduled by a
> server, sys_umcg_wait returns with ETIMEDOUT.

Right.. OK, let me go fold and polish what I have now before I go change
things again though.

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2022-01-20 11:08 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-22 21:13 [PATCH v0.9.1 0/6] sched,mm,x86/uaccess: implement User Managed Concurrency Groups Peter Oskolkov
2021-11-22 21:13 ` [PATCH v0.9.1 1/6] sched/umcg: add WF_CURRENT_CPU and externise ttwu Peter Oskolkov
2021-11-22 21:13 ` [PATCH v0.9.1 2/6] mm, x86/uaccess: add userspace atomic helpers Peter Oskolkov
2021-11-24 14:31   ` Peter Zijlstra
2021-11-22 21:13 ` [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls Peter Oskolkov
2021-11-24 18:36   ` kernel test robot
2021-11-24 20:08   ` Peter Zijlstra
2021-11-24 21:32     ` Peter Zijlstra
2021-11-25 17:28     ` Peter Oskolkov
2021-11-26 17:09       ` Peter Zijlstra
2021-11-26 21:08         ` Thomas Gleixner
2021-11-26 21:59           ` Peter Zijlstra
2021-11-26 22:07             ` Peter Zijlstra
2021-11-27  0:45             ` Thomas Gleixner
2021-11-29 15:05               ` Peter Zijlstra
2021-11-26 22:16         ` Peter Zijlstra
2021-11-27  1:16           ` Thomas Gleixner
2021-11-29 15:07             ` Peter Zijlstra
2021-11-29  0:29         ` Peter Oskolkov
2021-11-29 16:41           ` Peter Zijlstra
2021-11-29 17:34             ` Peter Oskolkov
2021-11-29 21:08               ` Peter Zijlstra
2021-11-29 21:29                 ` Peter Zijlstra
2021-11-29 23:38                 ` Peter Oskolkov
2021-12-06 11:32                   ` Peter Zijlstra
2021-12-06 12:04                     ` Peter Zijlstra
2021-12-13 13:55                     ` Peter Zijlstra
2021-12-06 11:47               ` Peter Zijlstra
2022-01-19 17:26                 ` Peter Oskolkov
2022-01-20 11:07                   ` Peter Zijlstra
2021-11-24 21:19   ` Peter Zijlstra
2021-11-26 21:11     ` Thomas Gleixner
2021-11-26 21:52       ` Peter Zijlstra
2021-11-29 22:07         ` Thomas Gleixner
2021-11-29 22:22           ` Peter Zijlstra
2021-11-24 21:41   ` Peter Zijlstra
2021-11-24 21:58   ` Peter Zijlstra
2021-11-24 22:18   ` Peter Zijlstra
2021-11-22 21:13 ` [PATCH v0.9.1 4/6] sched/umcg, lib/umcg: implement libumcg Peter Oskolkov
2021-11-22 21:13 ` [PATCH v0.9.1 5/6] sched/umcg: add Documentation/userspace-api/umcg.txt Peter Oskolkov
2021-11-22 21:13 ` [PATCH v0.9.1 6/6] sched/umcg, lib/umcg: add tools/lib/umcg/libumcg.txt Peter Oskolkov
2021-11-24 14:06 ` [PATCH v0.9.1 0/6] sched,mm,x86/uaccess: implement User Managed Concurrency Groups Peter Zijlstra
2021-11-24 16:28   ` Peter Oskolkov
2021-11-24 17:20     ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).