linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v0.1 0/9] UMCG early preview/RFC patchset
@ 2021-05-20 18:36 Peter Oskolkov
  2021-05-20 18:36 ` [RFC PATCH v0.1 1/9] sched/umcg: add UMCG syscall stubs and CONFIG_UMCG Peter Oskolkov
                   ` (11 more replies)
  0 siblings, 12 replies; 35+ messages in thread
From: Peter Oskolkov @ 2021-05-20 18:36 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, linux-kernel, linux-api
  Cc: Paul Turner, Ben Segall, Peter Oskolkov, Peter Oskolkov,
	Joel Fernandes, Andrew Morton, Andrei Vagin, Jim Newsome

As indicated earlier in the FUTEX_SWAP patchset:

https://lore.kernel.org/lkml/20200722234538.166697-1-posk@posk.io/

"Google Fibers" is a userspace scheduling framework
used widely and successfully at Google to improve in-process workload
isolation and response latencies. We are working on open-sourcing
this framework, and UMCG (User-Managed Concurrency Groups) kernel
patches are intended as the foundation of this.

This patchset is "early preview/RFC" - an earlier version of the
"core UMCG API" was discussed offlist, and I was asked to post
"what I have" to LKML before I consider the work ready for a full
review.

Notes:
- the first six patches cover "core UMCG API" and are more "ready"
  than the last three, in the sense that I expect to see few, if any,
  material changes to them, unless the whole approach is NACKed;
- the last three patches cover "server/worker UMCG API" and need
  more work and testing:
  - while I'm not aware of any specific issues with them, I have not
    implemented and/or tested, yet, many important use cases, such as:
    - tracing
    - signals/interrupts
    - explicit preemption of workers other than cooperative wait/swap
  - comments/documentation is missing in many important places, or
    maybe even wrong/outdated.

As such, please pay more attention to the high-level intended behavior
and design than to things like patch organization, contents of commit
messages, comments or indentation, especially in the last three patches.

Unless the feedback here points to a different approach, my next step
is to add timeout handling to sys_umcg_wait/sys_umcg_swap, as this
will open up a lot of Google-internal tests that cover most of
use/corner cases other than explicit preemption of workers (Google
Fibers use cooperative scheduling features only). Then I'll
work on issues uncovered by those tests. Then I'll address preemption
and tracing.


This work is loosely based on Google-internal SwitchTo and SwitchTo
Groups kernel patches developed by Paul Turner and Ben Segall.

Peter Oskolkov (9):
  sched/umcg: add UMCG syscall stubs and CONFIG_UMCG
  sched/umcg: add uapi/linux/umcg.h and sched/umcg.c
  sched: add WF_CURRENT_CPU and externise ttwu
  sched/umcg: implement core UMCG API
  lib/umcg: implement UMCG core API for userspace
  selftests/umcg: add UMCG core API selftest
  sched/umcg: add UMCG server/worker API (early RFC)
  lib/umcg: add UMCG server/worker API (early RFC)
  selftests/umcg: add UMCG server/worker API selftest

 arch/x86/entry/syscalls/syscall_64.tbl        |   11 +
 include/linux/mm_types.h                      |    5 +
 include/linux/sched.h                         |    7 +-
 include/linux/syscalls.h                      |   14 +
 include/uapi/asm-generic/unistd.h             |   25 +-
 include/uapi/linux/umcg.h                     |   70 ++
 init/Kconfig                                  |   10 +
 kernel/fork.c                                 |   11 +
 kernel/sched/Makefile                         |    1 +
 kernel/sched/core.c                           |   17 +-
 kernel/sched/fair.c                           |    4 +
 kernel/sched/sched.h                          |   15 +-
 kernel/sched/umcg.c                           | 1114 +++++++++++++++++
 kernel/sched/umcg.h                           |   96 ++
 kernel/sys_ni.c                               |   13 +
 mm/init-mm.c                                  |    4 +
 tools/lib/umcg/.gitignore                     |    4 +
 tools/lib/umcg/Makefile                       |   11 +
 tools/lib/umcg/libumcg.c                      |  572 +++++++++
 tools/lib/umcg/libumcg.h                      |  262 ++++
 tools/testing/selftests/umcg/.gitignore       |    3 +
 tools/testing/selftests/umcg/Makefile         |   15 +
 tools/testing/selftests/umcg/umcg_core_test.c |  347 +++++
 tools/testing/selftests/umcg/umcg_test.c      |  475 +++++++
 24 files changed, 3096 insertions(+), 10 deletions(-)
 create mode 100644 include/uapi/linux/umcg.h
 create mode 100644 kernel/sched/umcg.c
 create mode 100644 kernel/sched/umcg.h
 create mode 100644 tools/lib/umcg/.gitignore
 create mode 100644 tools/lib/umcg/Makefile
 create mode 100644 tools/lib/umcg/libumcg.c
 create mode 100644 tools/lib/umcg/libumcg.h
 create mode 100644 tools/testing/selftests/umcg/.gitignore
 create mode 100644 tools/testing/selftests/umcg/Makefile
 create mode 100644 tools/testing/selftests/umcg/umcg_core_test.c
 create mode 100644 tools/testing/selftests/umcg/umcg_test.c

-- 
2.31.1.818.g46aad6cb9e-goog


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v0.1 1/9] sched/umcg: add UMCG syscall stubs and CONFIG_UMCG
  2021-05-20 18:36 [RFC PATCH v0.1 0/9] UMCG early preview/RFC patchset Peter Oskolkov
@ 2021-05-20 18:36 ` Peter Oskolkov
  2021-05-20 18:36 ` [RFC PATCH v0.1 2/9] sched/umcg: add uapi/linux/umcg.h and sched/umcg.c Peter Oskolkov
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 35+ messages in thread
From: Peter Oskolkov @ 2021-05-20 18:36 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, linux-kernel, linux-api
  Cc: Paul Turner, Ben Segall, Peter Oskolkov, Peter Oskolkov,
	Joel Fernandes, Andrew Morton, Andrei Vagin, Jim Newsome

User Managed Concurrency Groups is a fast context switching and
in-process userspace scheduling framework.

Two main use cases are security sandboxes and userspace scheduling.

Security sandboxes: fast X-process context switching will open up a
bunch of light-weight security tools, e.g. gVisor, or Tor Project's
Shadow simulator, to more use cases.

In-process userspace scheduling is used extensively at Google to provide
latency control and isolation guarantees for diverse workloads while
maintaining high CPU utilization.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 arch/x86/entry/syscalls/syscall_64.tbl | 11 +++++++++++
 include/uapi/asm-generic/unistd.h      | 25 ++++++++++++++++++++++++-
 init/Kconfig                           | 10 ++++++++++
 kernel/sys_ni.c                        | 13 +++++++++++++
 4 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index ecd551b08d05..2e984a77eb23 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -368,6 +368,17 @@
 444	common	landlock_create_ruleset	sys_landlock_create_ruleset
 445	common	landlock_add_rule	sys_landlock_add_rule
 446	common	landlock_restrict_self	sys_landlock_restrict_self
+447	common	umcg_api_version	sys_umcg_api_version
+448	common	umcg_register_task	sys_umcg_register_task
+449	common	umcg_unregister_task	sys_umcg_unregister_task
+450	common	umcg_wait		sys_umcg_wait
+451	common	umcg_wake		sys_umcg_wake
+452	common	umcg_swap		sys_umcg_swap
+453	common	umcg_create_group	sys_umcg_create_group
+454	common	umcg_destroy_group	sys_umcg_destroy_group
+455	common	umcg_poll_worker	sys_umcg_poll_worker
+456	common	umcg_run_worker		sys_umcg_run_worker
+457	common	umcg_preempt_worker	sys_umcg_preempt_worker
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 6de5a7fc066b..cb8504e7ae07 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -873,8 +873,31 @@ __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule)
 #define __NR_landlock_restrict_self 446
 __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
 
+#define __NR_umcg_api_version 447
+__SYSCALL(__NR_umcg_api_version, sys_umcg_api_version)
+#define __NR_umcg_register_task 448
+__SYSCALL(__NR_umcg_register_task, sys_umcg_register_task)
+#define __NR_umcg_unregister_task 449
+__SYSCALL(__NR_umcg_unregister_task, sys_umcg_unregister_task)
+#define __NR_umcg_wait 450
+__SYSCALL(__NR_umcg_wait, sys_umcg_wait)
+#define __NR_umcg_wake 451
+__SYSCALL(__NR_umcg_wake, sys_umcg_wake)
+#define __NR_umcg_swap 452
+__SYSCALL(__NR_umcg_swap, sys_umcg_swap)
+#define __NR_umcg_create_group 453
+__SYSCALL(__NR_umcg_create_group, sys_umcg_create_group)
+#define __NR_umcg_destroy_group 454
+__SYSCALL(__NR_umcg_destroy_group, sys_umcg_destroy_group)
+#define __NR_umcg_poll_worker 455
+__SYSCALL(__NR_umcg_poll_worker, sys_umcg_poll_worker)
+#define __NR_umcg_run_worker 456
+__SYSCALL(__NR_umcg_run_worker, sys_umcg_run_worker)
+#define __NR_umcg_preempt_worker 457
+__SYSCALL(__NR_umcg_preempt_worker, sys_umcg_preempt_worker)
+
 #undef __NR_syscalls
-#define __NR_syscalls 447
+#define __NR_syscalls 458
 
 /*
  * 32 bit systems traditionally used different
diff --git a/init/Kconfig b/init/Kconfig
index 1ea12c64e4c9..bfac88dd5d73 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1661,6 +1661,16 @@ config MEMBARRIER
 
 	  If unsure, say Y.
 
+config UMCG
+	bool "Enable User Managed Concurrency Groups API"
+	default n
+	help
+	  Enable UMCG core wait/wake/swap operations as well as UMCG
+	  group/server/worker API. The core API is useful for fast IPC
+	  and context switching, while the group/server/worker API, together
+	  with the core API, form the basis for an in-process M:N userspace
+	  scheduling framework implemented in lib/umcg.
+
 config KALLSYMS
 	bool "Load all symbols for debugging/ksymoops" if EXPERT
 	default y
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 0ea8128468c3..fea55aa0222a 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -272,6 +272,19 @@ COND_SYSCALL(landlock_create_ruleset);
 COND_SYSCALL(landlock_add_rule);
 COND_SYSCALL(landlock_restrict_self);
 
+/* kernel/sched/umcg.c */
+COND_SYSCALL(umcg_api_version);
+COND_SYSCALL(umcg_register_task);
+COND_SYSCALL(umcg_unregister_task);
+COND_SYSCALL(umcg_wait);
+COND_SYSCALL(umcg_wake);
+COND_SYSCALL(umcg_swap);
+COND_SYSCALL(umcg_create_group);
+COND_SYSCALL(umcg_destroy_group);
+COND_SYSCALL(umcg_poll_worker);
+COND_SYSCALL(umcg_run_worker);
+COND_SYSCALL(umcg_preempt_worker);
+
 /* arch/example/kernel/sys_example.c */
 
 /* mm/fadvise.c */
-- 
2.31.1.818.g46aad6cb9e-goog


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v0.1 2/9] sched/umcg: add uapi/linux/umcg.h and sched/umcg.c
  2021-05-20 18:36 [RFC PATCH v0.1 0/9] UMCG early preview/RFC patchset Peter Oskolkov
  2021-05-20 18:36 ` [RFC PATCH v0.1 1/9] sched/umcg: add UMCG syscall stubs and CONFIG_UMCG Peter Oskolkov
@ 2021-05-20 18:36 ` Peter Oskolkov
  2021-05-20 18:36 ` [RFC PATCH v0.1 3/9] sched: add WF_CURRENT_CPU and externise ttwu Peter Oskolkov
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 35+ messages in thread
From: Peter Oskolkov @ 2021-05-20 18:36 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, linux-kernel, linux-api
  Cc: Paul Turner, Ben Segall, Peter Oskolkov, Peter Oskolkov,
	Joel Fernandes, Andrew Morton, Andrei Vagin, Jim Newsome

Introduce the uapi UMCG header file and document core UMCG API syscalls.

It is sometimes useful to separate the discussion of API from
the implementation details, and it seems to be the case here.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 include/linux/syscalls.h  |   9 +++
 include/uapi/linux/umcg.h |  70 +++++++++++++++++++
 kernel/sched/Makefile     |   1 +
 kernel/sched/umcg.c       | 143 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 223 insertions(+)
 create mode 100644 include/uapi/linux/umcg.h
 create mode 100644 kernel/sched/umcg.c

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 050511e8f1f8..15de3e34ccee 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -71,6 +71,7 @@ struct open_how;
 struct mount_attr;
 struct landlock_ruleset_attr;
 enum landlock_rule_type;
+struct umcg_task;
 
 #include <linux/types.h>
 #include <linux/aio_abi.h>
@@ -1050,6 +1051,14 @@ asmlinkage long sys_landlock_create_ruleset(const struct landlock_ruleset_attr _
 asmlinkage long sys_landlock_add_rule(int ruleset_fd, enum landlock_rule_type rule_type,
 		const void __user *rule_attr, __u32 flags);
 asmlinkage long sys_landlock_restrict_self(int ruleset_fd, __u32 flags);
+asmlinkage long umcg_api_version(u32 api_version, u32 flags);
+asmlinkage long umcg_register_task(u32 api_version, u32 flags, u32 group_id,
+					struct umcg_task __user *umcg_task);
+asmlinkage long umcg_unregister_task(u32 flags);
+asmlinkage long umcg_wait(u32 flags, const struct __kernel_timespec __user *timeout);
+asmlinkage long umcg_wake(u32 flags, u32 next_tid);
+asmlinkage long umcg_swap(u32 wake_flags, u32 next_tid, u32 wait_flags,
+				const struct __kernel_timespec __user *timeout);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/linux/umcg.h b/include/uapi/linux/umcg.h
new file mode 100644
index 000000000000..6c59563b41b2
--- /dev/null
+++ b/include/uapi/linux/umcg.h
@@ -0,0 +1,70 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_UMCG_H
+#define _UAPI_LINUX_UMCG_H
+
+#include <linux/limits.h>
+#include <linux/types.h>
+
+/*
+ * UMCG task states, the first 8 bits.
+ */
+#define UMCG_TASK_NONE			0
+/* UMCG server states. */
+#define UMCG_TASK_POLLING		1
+#define UMCG_TASK_SERVING		2
+#define UMCG_TASK_PROCESSING		3
+/* UMCG worker states. */
+#define UMCG_TASK_RUNNABLE		4
+#define UMCG_TASK_RUNNING		5
+#define UMCG_TASK_BLOCKED		6
+#define UMCG_TASK_UNBLOCKED		7
+
+/* UMCG task state flags, bits 8-15 */
+#define UMCG_TF_WAKEUP_QUEUED		(1 << 8)
+
+/*
+ * Unused at the moment flags reserved for features to be introduced
+ * in the near future.
+ */
+#define UMCG_TF_PREEMPT_DISABLED	(1 << 9)
+#define UMCG_TF_PREEMPTED		(1 << 10)
+
+#define UMCG_NOID	UINT_MAX
+
+/**
+ * struct umcg_task - controls the state of UMCG-enabled tasks.
+ *
+ * While at the moment only one field is present (@state), in future
+ * versions additional fields will be added, e.g. for the userspace to
+ * provide performance-improving hints and for the kernel to export sched
+ * stats.
+ *
+ * The struct is aligned at 32 bytes to ensure that even with future additions
+ * it fits into a single cache line.
+ */
+struct umcg_task {
+	/**
+	 * @state: the current state of the UMCG task described by this struct.
+	 *
+	 * UMCG task state:
+	 *   bits  0 -  7: task state;
+	 *   bits  8 - 15: state flags;
+	 *   bits 16 - 23: reserved; must be zeroes;
+	 *   bits 24 - 31: for userspace use.
+	 */
+	uint32_t	state;
+} __attribute((packed, aligned(4 * sizeof(uint64_t))));
+
+/**
+ * enum umcg_register_flag - flags for sys_umcg_register
+ * @UMCG_REGISTER_CORE_TASK:  Register a UMCG core task (not part of a group);
+ * @UMCG_REGISTER_WORKER:     Register a UMCG worker task;
+ * @UMCG_REGISTER_SERVER:     Register a UMCG server task.
+ */
+enum umcg_register_flag {
+	UMCG_REGISTER_CORE_TASK		= 0,
+	UMCG_REGISTER_WORKER		= 1,
+	UMCG_REGISTER_SERVER		= 2
+};
+
+#endif /* _UAPI_LINUX_UMCG_H */
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 978fcfca5871..e4e481eee1b7 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -37,3 +37,4 @@ obj-$(CONFIG_MEMBARRIER) += membarrier.o
 obj-$(CONFIG_CPU_ISOLATION) += isolation.o
 obj-$(CONFIG_PSI) += psi.o
 obj-$(CONFIG_SCHED_CORE) += core_sched.o
+obj-$(CONFIG_UMCG) += umcg.o
diff --git a/kernel/sched/umcg.c b/kernel/sched/umcg.c
new file mode 100644
index 000000000000..b8195cfdb76a
--- /dev/null
+++ b/kernel/sched/umcg.c
@@ -0,0 +1,143 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/*
+ * User Managed Concurrency Groups (UMCG).
+ */
+
+#include <linux/syscalls.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/umcg.h>
+
+/**
+ * sys_umcg_api_version - query UMCG API versions that are supported.
+ * @api_version:          Requested API version.
+ * @flags:                Reserved.
+ *
+ * Return:
+ * 0                    - the @api_version is supported;
+ * > 0                  - the maximum supported version of UMCG API if
+ *                        the requested version is not supported.
+ * -EINVAL              - @flags is not zero.
+ *
+ * NOTE: the kernel may drop support for older/deprecated API versions,
+ * so a return of X does not indicate that any version less than X is
+ * supported.
+ */
+SYSCALL_DEFINE2(umcg_api_version, u32, api_version, u32, flags)
+{
+	return -ENOSYS;
+}
+
+/**
+ * sys_umcg_register_task - register the current task as a UMCG task.
+ * @api_version:       The expected/desired API version of the syscall.
+ * @flags:             One of enum umcg_register_flag.
+ * @group_id:          UMCG Group ID. UMCG_NOID for Core tasks.
+ * @umcg_task:         The control struct for the current task.
+ *                     umcg_task->state must be UMCG_TASK_NONE.
+ *
+ * Register the current task as a UMCG task. If this is a core UMCG task,
+ * the syscall marks it as RUNNING and returns immediately.
+ *
+ * If this is a UMCG worker, the syscall marks it UNBLOCKED, and proceeds
+ * with the normal UNBLOCKED worker logic.
+ *
+ * If this is a UMCG server, the syscall immediately returns.
+ *
+ * Return:
+ * 0            - Ok;
+ * -EOPNOTSUPP  - the API version is not supported;
+ * -EINVAL      - one of the parameters is wrong;
+ * -EFAULT      - failed to access @umcg_task.
+ */
+SYSCALL_DEFINE4(umcg_register_task, u32, api_version, u32, flags, u32, group_id,
+		struct umcg_task __user *, umcg_task)
+{
+	return -ENOSYS;
+}
+
+/**
+ * sys_umcg_unregister_task - unregister the current task as a UMCG task.
+ * @flags: reserved.
+ *
+ * Return:
+ * 0       - Ok;
+ * -EINVAL - the current task is not a UMCG task.
+ */
+SYSCALL_DEFINE1(umcg_unregister_task, u32, flags)
+{
+	return -ENOSYS;
+}
+
+/**
+ * sys_umcg_wait - block the current task (if all condtions are met).
+ * @flags:         Reserved.
+ * @timeout:       The absolute timeout of the wait. Not supported yet.
+ *                 Must be NULL.
+ *
+ * Sleep until woken, interrupted, or @timeout expires.
+ *
+ * Return:
+ * 0           - Ok;
+ * -EFAULT     - failed to read struct umcg_task assigned to this task
+ *               via sys_umcg_register();
+ * -EAGAIN     - try again;
+ * -EINTR      - signal pending;
+ * -EOPNOTSUPP - @timeout != NULL (not supported yet).
+ * -EINVAL     - a parameter or a member of struct umcg_task has a wrong value.
+ */
+SYSCALL_DEFINE2(umcg_wait, u32, flags,
+		const struct __kernel_timespec __user *, timeout)
+{
+	return -ENOSYS;
+}
+
+/**
+ * sys_umcg_wake - wake @next_tid task blocked in sys_umcg_wait.
+ * @flags:         Reserved.
+ * @next_tid:      The ID of the task to wake.
+ *
+ * Wake @next identified by @next_tid. @next must be either a UMCG core
+ * task or a UMCG worker task.
+ *
+ * Return:
+ * 0           - Ok;
+ * -EFAULT     - failed to read struct umcg_task assigned to next;
+ * -ESRCH      - @next_tid did not identify a task;
+ * -EAGAIN     - try again;
+ * -EINVAL     - a parameter or a member of next->umcg_task has a wrong value.
+ */
+SYSCALL_DEFINE2(umcg_wake, u32, flags, u32, next_tid)
+{
+	return -ENOSYS;
+}
+
+/**
+ * sys_umcg_swap - wake next, put current to sleep.
+ * @wake_flags:    Reserved.
+ * @next_tid:      The ID of the task to wake.
+ * @wait_flags:    Reserved.
+ * @timeout:       The absolute timeout of the wait. Not supported yet.
+ *
+ * sys_umcg_swap() is semantically equivalent to this code fragment:
+ *
+ *     ret = sys_umcg_wake(wake_flags, next_tid);
+ *     if (ret)
+ *             return ret;
+ *     return sys_umcg_wait(wait_flags, timeout);
+ *
+ * The function attempts to wake @next on the current CPU.
+ *
+ * The current and the next tasks must both be either UMCG core tasks,
+ * or two UMCG workers belonging to the same UMCG group. In the latter
+ * case the UMCG server task that is "running" the current task will
+ * be transferred to the next task.
+ *
+ * Return: see the code snippet above.
+ */
+SYSCALL_DEFINE4(umcg_swap, u32, wake_flags, u32, next_tid, u32, wait_flags,
+		const struct __kernel_timespec __user *, timeout)
+{
+	return -ENOSYS;
+}
-- 
2.31.1.818.g46aad6cb9e-goog


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v0.1 3/9] sched: add WF_CURRENT_CPU and externise ttwu
  2021-05-20 18:36 [RFC PATCH v0.1 0/9] UMCG early preview/RFC patchset Peter Oskolkov
  2021-05-20 18:36 ` [RFC PATCH v0.1 1/9] sched/umcg: add UMCG syscall stubs and CONFIG_UMCG Peter Oskolkov
  2021-05-20 18:36 ` [RFC PATCH v0.1 2/9] sched/umcg: add uapi/linux/umcg.h and sched/umcg.c Peter Oskolkov
@ 2021-05-20 18:36 ` Peter Oskolkov
  2021-05-20 18:36 ` [RFC PATCH v0.1 4/9] sched/umcg: implement core UMCG API Peter Oskolkov
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 35+ messages in thread
From: Peter Oskolkov @ 2021-05-20 18:36 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, linux-kernel, linux-api
  Cc: Paul Turner, Ben Segall, Peter Oskolkov, Peter Oskolkov,
	Joel Fernandes, Andrew Morton, Andrei Vagin, Jim Newsome

Add WF_CURRENT_CPU wake flag that advices the scheduler to
move the wakee to the current CPU. This is useful for fast on-CPU
context switching use cases such as UMCG.

In addition, make ttwu external rather than static so that
the flag could be passed to it from outside of sched/core.c.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 kernel/sched/core.c  |  3 +--
 kernel/sched/fair.c  |  4 ++++
 kernel/sched/sched.h | 15 +++++++++------
 3 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3d2527239c3e..88506bc2617f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3683,8 +3683,7 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
  * Return: %true if @p->state changes (an actual wakeup was done),
  *	   %false otherwise.
  */
-static int
-try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
+int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 {
 	unsigned long flags;
 	int cpu, success = 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 161b92aa1c79..e55256bbb60b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6764,6 +6764,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 	if (wake_flags & WF_TTWU) {
 		record_wakee(p);
 
+		if ((wake_flags & WF_CURRENT_CPU) &&
+		    cpumask_test_cpu(cpu, p->cpus_ptr))
+			return cpu;
+
 		if (sched_energy_enabled()) {
 			new_cpu = find_energy_efficient_cpu(p, prev_cpu);
 			if (new_cpu >= 0)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8f0194cee0ba..205d05571d9e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2027,13 +2027,14 @@ static inline int task_on_rq_migrating(struct task_struct *p)
 }
 
 /* Wake flags. The first three directly map to some SD flag value */
-#define WF_EXEC     0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */
-#define WF_FORK     0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */
-#define WF_TTWU     0x08 /* Wakeup;            maps to SD_BALANCE_WAKE */
+#define WF_EXEC         0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */
+#define WF_FORK         0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */
+#define WF_TTWU         0x08 /* Wakeup;            maps to SD_BALANCE_WAKE */
 
-#define WF_SYNC     0x10 /* Waker goes to sleep after wakeup */
-#define WF_MIGRATED 0x20 /* Internal use, task got migrated */
-#define WF_ON_CPU   0x40 /* Wakee is on_cpu */
+#define WF_SYNC         0x10 /* Waker goes to sleep after wakeup */
+#define WF_MIGRATED     0x20 /* Internal use, task got migrated */
+#define WF_ON_CPU       0x40 /* Wakee is on_cpu */
+#define WF_CURRENT_CPU  0x80 /* Prefer to move the wakee to the current CPU. */
 
 #ifdef CONFIG_SMP
 static_assert(WF_EXEC == SD_BALANCE_EXEC);
@@ -3018,6 +3019,8 @@ static inline bool is_per_cpu_kthread(struct task_struct *p)
 extern void swake_up_all_locked(struct swait_queue_head *q);
 extern void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);
 
+extern int try_to_wake_up(struct task_struct *tsk, unsigned int state, int wake_flags);
+
 #ifdef CONFIG_PREEMPT_DYNAMIC
 extern int preempt_dynamic_mode;
 extern int sched_dynamic_mode(const char *str);
-- 
2.31.1.818.g46aad6cb9e-goog


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v0.1 4/9] sched/umcg: implement core UMCG API
  2021-05-20 18:36 [RFC PATCH v0.1 0/9] UMCG early preview/RFC patchset Peter Oskolkov
                   ` (2 preceding siblings ...)
  2021-05-20 18:36 ` [RFC PATCH v0.1 3/9] sched: add WF_CURRENT_CPU and externise ttwu Peter Oskolkov
@ 2021-05-20 18:36 ` Peter Oskolkov
  2021-05-21 19:06   ` Andrei Vagin
                     ` (2 more replies)
  2021-05-20 18:36 ` [RFC PATCH v0.1 5/9] lib/umcg: implement UMCG core API for userspace Peter Oskolkov
                   ` (7 subsequent siblings)
  11 siblings, 3 replies; 35+ messages in thread
From: Peter Oskolkov @ 2021-05-20 18:36 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, linux-kernel, linux-api
  Cc: Paul Turner, Ben Segall, Peter Oskolkov, Peter Oskolkov,
	Joel Fernandes, Andrew Morton, Andrei Vagin, Jim Newsome

Implement version 1 of core UMCG API (wait/wake/swap).

As has been outlined in
https://lore.kernel.org/lkml/20200722234538.166697-1-posk@posk.io/,
efficient and synchronous on-CPU context switching is key
to enabling two broad use cases: in-process M:N userspace scheduling
and fast X-process RPCs for security wrappers.

High-level design considerations/approaches used:
- wait & wake can race with each other;
- offload as much work as possible to libumcg in tools/lib/umcg,
  specifically:
  - most state changes, e.g. RUNNABLE <=> RUNNING, are done in
    the userspace (libumcg);
  - retries are offloaded to the userspace.

This implementation misses timeout handling in sys_umcg_wait
and sys_umcg_swap, which will be added in version 2.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 include/linux/sched.h |   7 +-
 kernel/sched/core.c   |   3 +
 kernel/sched/umcg.c   | 237 ++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/umcg.h   |  42 ++++++++
 4 files changed, 282 insertions(+), 7 deletions(-)
 create mode 100644 kernel/sched/umcg.h

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c7e7d50e2fdc..fc4b8775f514 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -66,6 +66,7 @@ struct sighand_struct;
 struct signal_struct;
 struct task_delay_info;
 struct task_group;
+struct umcg_task_data;
 
 /*
  * Task state bitmask. NOTE! These bits are also
@@ -778,6 +779,10 @@ struct task_struct {
 	struct mm_struct		*mm;
 	struct mm_struct		*active_mm;
 
+#ifdef CONFIG_UMCG
+	struct umcg_task_data __rcu	*umcg_task_data;
+#endif
+
 	/* Per-thread vma caching: */
 	struct vmacache			vmacache;
 
@@ -1022,7 +1027,7 @@ struct task_struct {
 	u64				parent_exec_id;
 	u64				self_exec_id;
 
-	/* Protection against (de-)allocation: mm, files, fs, tty, keyrings, mems_allowed, mempolicy: */
+	/* Protection against (de-)allocation: mm, files, fs, tty, keyrings, mems_allowed, mempolicy, umcg: */
 	spinlock_t			alloc_lock;
 
 	/* Protection of the PI data structures: */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 88506bc2617f..462104f13c28 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3964,6 +3964,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->wake_entry.u_flags = CSD_TYPE_TTWU;
 	p->migration_pending = NULL;
 #endif
+#ifdef CONFIG_UMCG
+	rcu_assign_pointer(p->umcg_task_data, NULL);
+#endif
 }
 
 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
diff --git a/kernel/sched/umcg.c b/kernel/sched/umcg.c
index b8195cfdb76a..2d718433c773 100644
--- a/kernel/sched/umcg.c
+++ b/kernel/sched/umcg.c
@@ -4,11 +4,23 @@
  * User Managed Concurrency Groups (UMCG).
  */
 
+#include <linux/freezer.h>
 #include <linux/syscalls.h>
 #include <linux/types.h>
 #include <linux/uaccess.h>
 #include <linux/umcg.h>
 
+#include "sched.h"
+#include "umcg.h"
+
+static int __api_version(u32 requested)
+{
+	if (requested == 1)
+		return 0;
+
+	return 1;
+}
+
 /**
  * sys_umcg_api_version - query UMCG API versions that are supported.
  * @api_version:          Requested API version.
@@ -26,7 +38,52 @@
  */
 SYSCALL_DEFINE2(umcg_api_version, u32, api_version, u32, flags)
 {
-	return -ENOSYS;
+	if (flags)
+		return -EINVAL;
+
+	return __api_version(api_version);
+}
+
+static int get_state(struct umcg_task __user *ut, u32 *state)
+{
+	return get_user(*state, (u32 __user *)ut);
+}
+
+static int put_state(struct umcg_task __user *ut, u32 state)
+{
+	return put_user(state, (u32 __user *)ut);
+}
+
+static int register_core_task(u32 api_version, struct umcg_task __user *umcg_task)
+{
+	struct umcg_task_data *utd;
+	u32 state;
+
+	if (get_state(umcg_task, &state))
+		return -EFAULT;
+
+	if (state != UMCG_TASK_NONE)
+		return -EINVAL;
+
+	utd = kzalloc(sizeof(struct umcg_task_data), GFP_KERNEL);
+	if (!utd)
+		return -EINVAL;
+
+	utd->self = current;
+	utd->umcg_task = umcg_task;
+	utd->task_type = UMCG_TT_CORE;
+	utd->api_version = api_version;
+
+	if (put_state(umcg_task, UMCG_TASK_RUNNING)) {
+		kfree(utd);
+		return -EFAULT;
+	}
+
+	task_lock(current);
+	rcu_assign_pointer(current->umcg_task_data, utd);
+	task_unlock(current);
+
+	return 0;
 }
 
 /**
@@ -54,7 +111,20 @@ SYSCALL_DEFINE2(umcg_api_version, u32, api_version, u32, flags)
 SYSCALL_DEFINE4(umcg_register_task, u32, api_version, u32, flags, u32, group_id,
 		struct umcg_task __user *, umcg_task)
 {
-	return -ENOSYS;
+	if (__api_version(api_version))
+		return -EOPNOTSUPP;
+
+	if (rcu_access_pointer(current->umcg_task_data) || !umcg_task)
+		return -EINVAL;
+
+	switch (flags) {
+	case UMCG_REGISTER_CORE_TASK:
+		if (group_id != UMCG_NOID)
+			return -EINVAL;
+		return register_core_task(api_version, umcg_task);
+	default:
+		return -EINVAL;
+	}
 }
 
 /**
@@ -67,7 +137,75 @@ SYSCALL_DEFINE4(umcg_register_task, u32, api_version, u32, flags, u32, group_id,
  */
 SYSCALL_DEFINE1(umcg_unregister_task, u32, flags)
 {
-	return -ENOSYS;
+	struct umcg_task_data *utd;
+	int ret = -EINVAL;
+
+	rcu_read_lock();
+	utd = rcu_dereference(current->umcg_task_data);
+
+	if (!utd || flags)
+		goto out;
+
+	task_lock(current);
+	rcu_assign_pointer(current->umcg_task_data, NULL);
+	task_unlock(current);
+
+	ret = 0;
+
+out:
+	rcu_read_unlock();
+	if (!ret && utd) {
+		synchronize_rcu();
+		kfree(utd);
+	}
+	return ret;
+}
+
+static int do_context_switch(struct task_struct *next)
+{
+	struct umcg_task_data *utd = rcu_access_pointer(current->umcg_task_data);
+
+	/*
+	 * It is important to set_current_state(TASK_INTERRUPTIBLE) before
+	 * waking @next, as @next may immediately try to wake current back
+	 * (e.g. current is a server, @next is a worker that immediately
+	 * blocks or waits), and this next wakeup must not be lost.
+	 */
+	set_current_state(TASK_INTERRUPTIBLE);
+
+	WRITE_ONCE(utd->in_wait, true);
+
+	if (!try_to_wake_up(next, TASK_NORMAL, WF_CURRENT_CPU))
+		return -EAGAIN;
+
+	freezable_schedule();
+
+	WRITE_ONCE(utd->in_wait, false);
+
+	if (signal_pending(current))
+		return -EINTR;
+
+	return 0;
+}
+
+static int do_wait(void)
+{
+	struct umcg_task_data *utd = rcu_access_pointer(current->umcg_task_data);
+
+	if (!utd)
+		return -EINVAL;
+
+	WRITE_ONCE(utd->in_wait, true);
+
+	set_current_state(TASK_INTERRUPTIBLE);
+	freezable_schedule();
+
+	WRITE_ONCE(utd->in_wait, false);
+
+	if (signal_pending(current))
+		return -EINTR;
+
+	return 0;
 }
 
 /**
@@ -90,7 +228,23 @@ SYSCALL_DEFINE1(umcg_unregister_task, u32, flags)
 SYSCALL_DEFINE2(umcg_wait, u32, flags,
 		const struct __kernel_timespec __user *, timeout)
 {
-	return -ENOSYS;
+	struct umcg_task_data *utd;
+
+	if (flags)
+		return -EINVAL;
+	if (timeout)
+		return -EOPNOTSUPP;
+
+	rcu_read_lock();
+	utd = rcu_dereference(current->umcg_task_data);
+	if (!utd) {
+		rcu_read_unlock();
+		return -EINVAL;
+	}
+
+	rcu_read_unlock();
+
+	return do_wait();
 }
 
 /**
@@ -110,7 +264,39 @@ SYSCALL_DEFINE2(umcg_wait, u32, flags,
  */
 SYSCALL_DEFINE2(umcg_wake, u32, flags, u32, next_tid)
 {
-	return -ENOSYS;
+	struct umcg_task_data *next_utd;
+	struct task_struct *next;
+	int ret = -EINVAL;
+
+	if (!next_tid)
+		return -EINVAL;
+	if (flags)
+		return -EINVAL;
+
+	next = find_get_task_by_vpid(next_tid);
+	if (!next)
+		return -ESRCH;
+
+	rcu_read_lock();
+	next_utd = rcu_dereference(next->umcg_task_data);
+	if (!next_utd)
+		goto out;
+
+	if (!READ_ONCE(next_utd->in_wait)) {
+		ret = -EAGAIN;
+		goto out;
+	}
+
+	ret = wake_up_process(next);
+	put_task_struct(next);
+	if (ret)
+		ret = 0;
+	else
+		ret = -EAGAIN;
+
+out:
+	rcu_read_unlock();
+	return ret;
 }
 
 /**
@@ -139,5 +325,44 @@ SYSCALL_DEFINE2(umcg_wake, u32, flags, u32, next_tid)
 SYSCALL_DEFINE4(umcg_swap, u32, wake_flags, u32, next_tid, u32, wait_flags,
 		const struct __kernel_timespec __user *, timeout)
 {
-	return -ENOSYS;
+	struct umcg_task_data *curr_utd;
+	struct umcg_task_data *next_utd;
+	struct task_struct *next;
+	int ret = -EINVAL;
+
+	rcu_read_lock();
+	curr_utd = rcu_dereference(current->umcg_task_data);
+
+	if (!next_tid || wake_flags || wait_flags || !curr_utd)
+		goto out;
+
+	if (timeout) {
+		ret = -EOPNOTSUPP;
+		goto out;
+	}
+
+	next = find_get_task_by_vpid(next_tid);
+	if (!next) {
+		ret = -ESRCH;
+		goto out;
+	}
+
+	next_utd = rcu_dereference(next->umcg_task_data);
+	if (!next_utd) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (!READ_ONCE(next_utd->in_wait)) {
+		ret = -EAGAIN;
+		goto out;
+	}
+
+	rcu_read_unlock();
+
+	return do_context_switch(next);
+
+out:
+	rcu_read_unlock();
+	return ret;
 }
diff --git a/kernel/sched/umcg.h b/kernel/sched/umcg.h
new file mode 100644
index 000000000000..6791d570f622
--- /dev/null
+++ b/kernel/sched/umcg.h
@@ -0,0 +1,42 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _KERNEL_SCHED_UMCG_H
+#define _KERNEL_SCHED_UMCG_H
+
+#ifdef CONFIG_UMCG
+
+#include <linux/sched.h>
+#include <linux/umcg.h>
+
+enum umcg_task_type {
+	UMCG_TT_CORE	= 1,
+	UMCG_TT_SERVER	= 2,
+	UMCG_TT_WORKER	= 3
+};
+
+struct umcg_task_data {
+	/* umcg_task != NULL. Never changes. */
+	struct umcg_task __user		*umcg_task;
+
+	/* The task that owns this umcg_task_data. Never changes. */
+	struct task_struct		*self;
+
+	/* Core task, server, or worker. Never changes. */
+	enum umcg_task_type		task_type;
+
+	/*
+	 * The API version used to register this task. If this is a
+	 * worker or a server, must equal group->api_version.
+	 *
+	 * Never changes.
+	 */
+	u32 api_version;
+
+	/*
+	 * Used by wait/wake routines to handle races. Written only by current.
+	 */
+	bool				in_wait;
+};
+
+#endif  /* CONFIG_UMCG */
+#endif  /* _KERNEL_SCHED_UMCG_H */
-- 
2.31.1.818.g46aad6cb9e-goog


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v0.1 5/9] lib/umcg: implement UMCG core API for userspace
  2021-05-20 18:36 [RFC PATCH v0.1 0/9] UMCG early preview/RFC patchset Peter Oskolkov
                   ` (3 preceding siblings ...)
  2021-05-20 18:36 ` [RFC PATCH v0.1 4/9] sched/umcg: implement core UMCG API Peter Oskolkov
@ 2021-05-20 18:36 ` Peter Oskolkov
  2021-05-20 18:36 ` [RFC PATCH v0.1 6/9] selftests/umcg: add UMCG core API selftest Peter Oskolkov
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 35+ messages in thread
From: Peter Oskolkov @ 2021-05-20 18:36 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, linux-kernel, linux-api
  Cc: Paul Turner, Ben Segall, Peter Oskolkov, Peter Oskolkov,
	Joel Fernandes, Andrew Morton, Andrei Vagin, Jim Newsome

UMCG (= User Managed Concurrency Groups) kernel API
is designed to be minimalistic and requires tightly
coupled userspace code to make it easy to use.

Add userspace UMCG core API to achieve this goal.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 tools/lib/umcg/.gitignore |   4 +
 tools/lib/umcg/Makefile   |  11 ++
 tools/lib/umcg/libumcg.c  | 350 ++++++++++++++++++++++++++++++++++++++
 tools/lib/umcg/libumcg.h  | 154 +++++++++++++++++
 4 files changed, 519 insertions(+)
 create mode 100644 tools/lib/umcg/.gitignore
 create mode 100644 tools/lib/umcg/Makefile
 create mode 100644 tools/lib/umcg/libumcg.c
 create mode 100644 tools/lib/umcg/libumcg.h

diff --git a/tools/lib/umcg/.gitignore b/tools/lib/umcg/.gitignore
new file mode 100644
index 000000000000..ea55ae666041
--- /dev/null
+++ b/tools/lib/umcg/.gitignore
@@ -0,0 +1,4 @@
+PDX-License-Identifier: GPL-2.0-only
+libumcg.a
+libumcg.o
+
diff --git a/tools/lib/umcg/Makefile b/tools/lib/umcg/Makefile
new file mode 100644
index 000000000000..fa53fd5a851a
--- /dev/null
+++ b/tools/lib/umcg/Makefile
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0
+
+CFLAGS += -g -I../../../usr/include/ -I../../include/
+
+libumcg.a: libumcg.o
+	ar rc libumcg.a libumcg.o
+
+libumcg.o: libumcg.c
+
+clean :
+	rm libumcg.a libumcg.o
diff --git a/tools/lib/umcg/libumcg.c b/tools/lib/umcg/libumcg.c
new file mode 100644
index 000000000000..b177fb1d4b17
--- /dev/null
+++ b/tools/lib/umcg/libumcg.c
@@ -0,0 +1,350 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include "libumcg.h"
+
+#include <errno.h>
+#include <pthread.h>
+#include <signal.h>
+#include <stdatomic.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <threads.h>
+
+/* UMCG API version supported by this library. */
+static const uint32_t umcg_api_version = 1;
+
+struct umcg_group {
+	uint32_t group_id;
+};
+
+/**
+ * struct umcg_task_tls - per thread struct used to identify/manage UMCG tasks
+ *
+ * Each UMCG task requires an instance of struct umcg_task passed to
+ * sys_umcg_register. This struct contains it, as well as several additional
+ * fields.
+ */
+struct umcg_task_tls {
+	struct umcg_task	umcg_task;
+	umcg_tid		self;
+	intptr_t		tag;
+	pid_t			tid;
+
+} __attribute((aligned(4 * sizeof(uint64_t))));
+
+static thread_local struct umcg_task_tls *umcg_task_tls;
+
+umcg_tid umcg_get_utid(void)
+{
+	return (umcg_tid)&umcg_task_tls;
+}
+
+static umcg_tid umcg_task_to_utid(struct umcg_task *ut)
+{
+	if (!ut)
+		return UMCG_NONE;
+
+	return ((struct umcg_task_tls *)ut)->self;
+}
+
+static struct umcg_task_tls *utid_to_utls(umcg_tid utid)
+{
+	if (!utid || !*(struct umcg_task_tls **)utid) {
+		fprintf(stderr, "utid_to_utls: NULL\n");
+		/* Kill the process rather than corrupt memory. */
+		raise(SIGKILL);
+		return NULL;
+	}
+	return *(struct umcg_task_tls **)utid;
+}
+
+void umcg_set_task_tag(umcg_tid utid, intptr_t tag)
+{
+	utid_to_utls(utid)->tag = tag;
+}
+
+intptr_t umcg_get_task_tag(umcg_tid utid)
+{
+	return utid_to_utls(utid)->tag;
+}
+
+umcg_tid umcg_register_core_task(intptr_t tag)
+{
+	int ret;
+
+	if (umcg_task_tls != NULL) {
+		errno = EINVAL;
+		return UMCG_NONE;
+	}
+
+	umcg_task_tls = malloc(sizeof(struct umcg_task_tls));
+	if (!umcg_task_tls) {
+		errno = ENOMEM;
+		return UMCG_NONE;
+	}
+
+	umcg_task_tls->umcg_task.state = UMCG_TASK_NONE;
+	umcg_task_tls->self = (umcg_tid)&umcg_task_tls;
+	umcg_task_tls->tag = tag;
+	umcg_task_tls->tid = gettid();
+
+	ret = sys_umcg_register_task(umcg_api_version, UMCG_REGISTER_CORE_TASK,
+			UMCG_NOID, &umcg_task_tls->umcg_task);
+	if (ret) {
+		free(umcg_task_tls);
+		umcg_task_tls = NULL;
+		errno = ret;
+		return UMCG_NONE;
+	}
+
+	return umcg_task_tls->self;
+}
+
+int umcg_unregister_task(void)
+{
+	int ret;
+
+	if (!umcg_task_tls) {
+		errno = EINVAL;
+		return -1;
+	}
+
+	ret = sys_umcg_unregister_task(0);
+	if (ret) {
+		errno = ret;
+		return -1;
+	}
+
+	free(umcg_task_tls);
+	atomic_store_explicit(&umcg_task_tls, NULL, memory_order_seq_cst);
+	return 0;
+}
+
+/* Helper return codes. */
+enum umcg_prepare_op_result {
+	UMCG_OP_DONE,
+	UMCG_OP_SYS,
+	UMCG_OP_AGAIN,
+	UMCG_OP_ERROR
+};
+
+static enum umcg_prepare_op_result umcg_prepare_wait(void)
+{
+	struct umcg_task *ut;
+	uint32_t umcg_state;
+	int ret;
+
+	if (!umcg_task_tls) {
+		errno = EINVAL;
+		return UMCG_OP_ERROR;
+	}
+
+	ut = &umcg_task_tls->umcg_task;
+
+	umcg_state = UMCG_TASK_RUNNING;
+	if (atomic_compare_exchange_strong_explicit(&ut->state,
+			&umcg_state, UMCG_TASK_RUNNABLE,
+			memory_order_seq_cst, memory_order_seq_cst))
+		return UMCG_OP_SYS;
+
+	if (umcg_state != (UMCG_TASK_RUNNING | UMCG_TF_WAKEUP_QUEUED)) {
+		fprintf(stderr, "libumcg: unexpected state before wait: %u\n",
+				umcg_state);
+		errno = EINVAL;
+		return UMCG_OP_ERROR;
+	}
+
+	if (atomic_compare_exchange_strong_explicit(&ut->state,
+			&umcg_state, UMCG_TASK_RUNNING,
+			memory_order_seq_cst, memory_order_seq_cst)) {
+		return UMCG_OP_DONE;
+	}
+
+	/* Raced with another wait/wake? This is not supported. */
+	fprintf(stderr, "libumcg: failed to remove the wakeup flag: %u\n",
+			umcg_state);
+	errno = EINVAL;
+	return UMCG_OP_ERROR;
+}
+
+static int umcg_do_wait(const struct timespec *timeout)
+{
+	uint32_t umcg_state;
+	int ret;
+
+	do {
+		ret = sys_umcg_wait(0, timeout);
+		if (ret != 0 && errno != EAGAIN)
+			return ret;
+
+		umcg_state = atomic_load_explicit(
+				&umcg_task_tls->umcg_task.state,
+				memory_order_acquire);
+	} while (umcg_state == UMCG_TASK_RUNNABLE);
+
+	return 0;
+}
+
+int umcg_wait(const struct timespec *timeout)
+{
+	switch (umcg_prepare_wait()) {
+	case UMCG_OP_DONE:
+		return 0;
+	case UMCG_OP_SYS:
+		break;
+	case UMCG_OP_ERROR:
+		return -1;
+	default:
+		fprintf(stderr, "Unknown pre_op result.\n");
+		exit(1);
+		return -1;
+	}
+
+	return umcg_do_wait(timeout);
+}
+
+static enum umcg_prepare_op_result umcg_prepare_wake(struct umcg_task_tls *utls)
+{
+	struct umcg_task *ut = &utls->umcg_task;
+	uint32_t umcg_state, next_state;
+
+	next_state = UMCG_TASK_RUNNING;
+	umcg_state = UMCG_TASK_RUNNABLE;
+	if (atomic_compare_exchange_strong_explicit(&ut->state,
+			&umcg_state, next_state,
+			memory_order_seq_cst, memory_order_seq_cst))
+		return UMCG_OP_SYS;
+
+	if (umcg_state != UMCG_TASK_RUNNING) {
+		if (umcg_state == (UMCG_TASK_RUNNING | UMCG_TF_WAKEUP_QUEUED)) {
+			/*
+			 * With ping-pong mutual swapping using wake/wait
+			 * without synchronization this can happen.
+			 */
+			return UMCG_OP_AGAIN;
+		}
+		fprintf(stderr, "libumcg: unexpected state in umcg_wake(): %u\n",
+				umcg_state);
+		errno = EINVAL;
+		return UMCG_OP_ERROR;
+	}
+
+	if (atomic_compare_exchange_strong_explicit(&ut->state,
+			&umcg_state, UMCG_TASK_RUNNING | UMCG_TF_WAKEUP_QUEUED,
+			memory_order_seq_cst, memory_order_seq_cst)) {
+		return UMCG_OP_DONE;
+	}
+
+	if (umcg_state != UMCG_TASK_RUNNABLE) {
+		fprintf(stderr, "libumcg: unexpected state in umcg_wake (1): %u\n",
+				umcg_state);
+		errno = EINVAL;
+		return UMCG_OP_ERROR;
+	}
+
+	return UMCG_OP_AGAIN;
+}
+
+static int umcg_do_wake_or_swap(struct umcg_task_tls *next_utls,
+				uint64_t prev_wait_counter, bool should_wait,
+				const struct timespec *timeout)
+{
+	int ret;
+
+again:
+
+	if (should_wait)
+		ret = sys_umcg_swap(0, next_utls->tid, 0, timeout);
+	else
+		ret = sys_umcg_wake(0, next_utls->tid);
+
+	if (ret && errno == EAGAIN)
+		goto again;
+
+	return ret;
+}
+
+int umcg_wake(umcg_tid next)
+{
+	struct umcg_task_tls *utls = *(struct umcg_task_tls **)next;
+	uint64_t prev_wait_counter;
+
+	if (!utls) {
+		errno = EINVAL;
+		return -1;
+	}
+
+again:
+	switch (umcg_prepare_wake(utls)) {
+	case UMCG_OP_DONE:
+		return 0;
+	case UMCG_OP_SYS:
+		break;
+	case UMCG_OP_ERROR:
+		return -1;
+	case UMCG_OP_AGAIN:
+		goto again;
+	default:
+		fprintf(stderr, "libumcg: unknown pre_op result.\n");
+		exit(1);
+		return -1;
+	}
+
+	return umcg_do_wake_or_swap(utls, prev_wait_counter, false, NULL);
+}
+
+int umcg_swap(umcg_tid next, const struct timespec *timeout)
+{
+	struct umcg_task_tls *utls = *(struct umcg_task_tls **)next;
+	bool should_wake, should_wait;
+	uint64_t prev_wait_counter;
+	int ret;
+
+	if (!utls) {
+		errno = EINVAL;
+		return -1;
+	}
+
+again:
+	switch (umcg_prepare_wake(utls)) {
+	case UMCG_OP_DONE:
+		should_wake = false;
+		break;
+	case UMCG_OP_SYS:
+		should_wake = true;
+		break;
+	case UMCG_OP_ERROR:
+		return -1;
+	case UMCG_OP_AGAIN:
+		goto again;
+	default:
+		fprintf(stderr, "lubumcg: unknown pre_op result.\n");
+		exit(1);
+		return -1;
+	}
+
+	switch (umcg_prepare_wait()) {
+	case UMCG_OP_DONE:
+		should_wait = false;
+		break;
+	case UMCG_OP_SYS:
+		should_wait = true;
+		break;
+	case UMCG_OP_ERROR:
+		return -1;
+	default:
+		fprintf(stderr, "lubumcg: unknown pre_op result.\n");
+		exit(1);
+		return -1;
+	}
+
+	if (should_wake)
+		return umcg_do_wake_or_swap(utls, prev_wait_counter,
+				should_wait, timeout);
+
+	if (should_wait)
+		return umcg_do_wait(timeout);
+
+	return 0;
+}
diff --git a/tools/lib/umcg/libumcg.h b/tools/lib/umcg/libumcg.h
new file mode 100644
index 000000000000..31ef786d1965
--- /dev/null
+++ b/tools/lib/umcg/libumcg.h
@@ -0,0 +1,154 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __LIBUMCG_H
+#define __LIBUMCG_H
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <limits.h>
+#include <unistd.h>
+#include <linux/types.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <syscall.h>
+#include <time.h>
+
+#include <linux/umcg.h>
+
+static int sys_umcg_api_version(uint32_t requested_api_version, uint32_t flags)
+{
+	return syscall(__NR_umcg_api_version, requested_api_version, flags);
+}
+
+static int sys_umcg_register_task(uint32_t api_version, uint32_t flags,
+		uint32_t group_id, struct umcg_task *umcg_task)
+{
+	return syscall(__NR_umcg_register_task, api_version, flags, group_id,
+			umcg_task);
+}
+
+static int sys_umcg_unregister_task(uint32_t flags)
+{
+	return syscall(__NR_umcg_unregister_task, flags);
+}
+
+static int sys_umcg_wait(uint32_t flags, const struct timespec *timeout)
+{
+	return syscall(__NR_umcg_wait, flags, timeout);
+}
+
+static int sys_umcg_wake(uint32_t flags, uint32_t next_tid)
+{
+	return syscall(__NR_umcg_wake, flags, next_tid);
+}
+
+static int sys_umcg_swap(uint32_t wake_flags, uint32_t next_tid,
+		uint32_t wait_flags, const struct timespec *timeout)
+{
+	return syscall(__NR_umcg_swap, wake_flags, next_tid,
+			wait_flags, timeout);
+}
+
+typedef intptr_t umcg_tid; /* UMCG thread ID. */
+
+#define UMCG_NONE	(0)
+
+/**
+ * umcg_get_utid - return the UMCG ID of the current thread.
+ *
+ * The function always succeeds, and the returned ID is guaranteed to be
+ * stable over the life of the thread (and multiple
+ * umcg_register/umcg_unregister calls).
+ *
+ * The ID is NOT guaranteed to be unique over the life of the process.
+ */
+umcg_tid umcg_get_utid(void);
+
+/**
+ * umcg_set_task_tag - add an arbitrary tag to a registered UMCG task.
+ *
+ * Note: non-thread-safe: the user is responsible for proper memory fencing.
+ */
+void umcg_set_task_tag(umcg_tid utid, intptr_t tag);
+
+/*
+ * umcg_get_task_tag - get the task tag. Returns zero if none set.
+ *
+ * Note: non-thread-safe: the user is responsible for proper memory fencing.
+ */
+intptr_t umcg_get_task_tag(umcg_tid utid);
+
+/**
+ * umcg_register_core_task - register the current thread as a UMCG core task
+ *
+ * Return:
+ * UMCG_NONE     - an error occurred. Check errno.
+ * != UMCG_NONE  - the ID of the thread to be used with UMCG API (guaranteed
+ *                 to match the value returned by umcg_get_utid).
+ */
+umcg_tid umcg_register_core_task(intptr_t tag);
+
+/**
+ * umcg_unregister_task - unregister the current thread
+ *
+ * Return:
+ * 0              - OK
+ * -1             - the current thread is not a UMCG thread
+ */
+int umcg_unregister_task(void);
+
+/**
+ * umcg_wait - block the current thread
+ * @timeout:   absolute timeout (not supported at the moment)
+ *
+ * Blocks the current thread, which must have been registered via umcg_register,
+ * until it is waken via umcg_wake or swapped into via umcg_swap. If the current
+ * thread has a wakeup queued (see umcg_wake), returns zero immediately,
+ * consuming the wakeup.
+ *
+ * Return:
+ * 0         - OK, the thread was waken;
+ * -1        - did not wake normally;
+ *               errno:
+ *                 EINTR: interrupted
+ *                 EINVAL: some other error occurred
+ */
+int umcg_wait(const struct timespec *timeout);
+
+/**
+ * umcg_wake - wake @next
+ * @next:      ID of the thread to wake (IDs are returned by umcg_register).
+ *
+ * If @next is blocked via umcg_wait, or umcg_swap, wake it. If @next is
+ * running, queue the wakeup, so that a future block of @next will consume
+ * the wakeup but will not block.
+ *
+ * umcg_wake is non-blocking, but may retry a few times to make sure @next
+ * has indeed woken.
+ *
+ * umcg_wake can queue at most one wakeup; if @next has a wakeup queued,
+ * an error is returned.
+ *
+ * Return:
+ * 0         - OK, @next has woken, or a wakeup has been queued;
+ * -1        - an error occurred.
+ */
+int umcg_wake(umcg_tid next);
+
+/**
+ * umcg_swap - wake @next, put the current thread to sleep
+ * @next:      ID of the thread to wake
+ * @timeout:   absolute timeout (not supported at the moment)
+ *
+ * umcg_swap is semantically equivalent to
+ *
+ *     int ret = umcg_wake(next);
+ *     if (ret)
+ *             return ret;
+ *     return umcg_wait(timeout);
+ *
+ * but may do a synchronous context switch into @next on the current CPU.
+ */
+int umcg_swap(umcg_tid next, const struct timespec *timeout);
+
+#endif  /* __LIBUMCG_H */
-- 
2.31.1.818.g46aad6cb9e-goog


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v0.1 6/9] selftests/umcg: add UMCG core API selftest
  2021-05-20 18:36 [RFC PATCH v0.1 0/9] UMCG early preview/RFC patchset Peter Oskolkov
                   ` (4 preceding siblings ...)
  2021-05-20 18:36 ` [RFC PATCH v0.1 5/9] lib/umcg: implement UMCG core API for userspace Peter Oskolkov
@ 2021-05-20 18:36 ` Peter Oskolkov
  2021-05-20 18:36 ` [RFC PATCH v0.1 7/9] sched/umcg: add UMCG server/worker API (early RFC) Peter Oskolkov
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 35+ messages in thread
From: Peter Oskolkov @ 2021-05-20 18:36 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, linux-kernel, linux-api
  Cc: Paul Turner, Ben Segall, Peter Oskolkov, Peter Oskolkov,
	Joel Fernandes, Andrew Morton, Andrei Vagin, Jim Newsome

Add UMCG core API selftests. In particular, test that
umcg_wait/umcg_wake/umcg_swap behave correctly when racing
with each other.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 tools/testing/selftests/umcg/.gitignore       |   2 +
 tools/testing/selftests/umcg/Makefile         |  13 +
 tools/testing/selftests/umcg/umcg_core_test.c | 347 ++++++++++++++++++
 3 files changed, 362 insertions(+)
 create mode 100644 tools/testing/selftests/umcg/.gitignore
 create mode 100644 tools/testing/selftests/umcg/Makefile
 create mode 100644 tools/testing/selftests/umcg/umcg_core_test.c

diff --git a/tools/testing/selftests/umcg/.gitignore b/tools/testing/selftests/umcg/.gitignore
new file mode 100644
index 000000000000..89cca24e5907
--- /dev/null
+++ b/tools/testing/selftests/umcg/.gitignore
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+umcg_core_test
diff --git a/tools/testing/selftests/umcg/Makefile b/tools/testing/selftests/umcg/Makefile
new file mode 100644
index 000000000000..b151098e2ed1
--- /dev/null
+++ b/tools/testing/selftests/umcg/Makefile
@@ -0,0 +1,13 @@
+# SPDX-License-Identifier: GPL-2.0-only
+
+TOOLSDIR := $(abspath ../../..)
+LIBUMCGDIR := $(TOOLSDIR)/lib/umcg
+
+CFLAGS += -g -O0 -I$(LIBUMCGDIR) -I$(TOOLSDIR)/include/ -I../../../../usr/include/
+LDLIBS += -lpthread -static
+
+TEST_GEN_PROGS := umcg_core_test
+
+include ../lib.mk
+
+$(OUTPUT)/umcg_core_test: umcg_core_test.c $(LIBUMCGDIR)/libumcg.c
diff --git a/tools/testing/selftests/umcg/umcg_core_test.c b/tools/testing/selftests/umcg/umcg_core_test.c
new file mode 100644
index 000000000000..4dc20131ace7
--- /dev/null
+++ b/tools/testing/selftests/umcg/umcg_core_test.c
@@ -0,0 +1,347 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include "libumcg.h"
+
+#include <pthread.h>
+#include <stdatomic.h>
+
+#include "../kselftest_harness.h"
+
+#define CHECK_CONFIG()						\
+{								\
+	int ret = sys_umcg_api_version(1, 0);			\
+	if (ret == -1 && errno == ENOSYS)			\
+		SKIP(return, "CONFIG_UMCG not set");	\
+}
+
+TEST(umcg_api_version) {
+	CHECK_CONFIG();
+	ASSERT_EQ(0, sys_umcg_api_version(1, 0));
+	ASSERT_EQ(1, sys_umcg_api_version(1234, 0));
+}
+
+/* Test that forked children of UMCG enabled tasks are not UMCG enabled. */
+TEST(register_and_fork) {
+	CHECK_CONFIG();
+	pid_t pid;
+	int wstatus;
+	umcg_tid utid;
+
+	/* umcg_unregister should fail without registering earlier. */
+	ASSERT_NE(0, umcg_unregister_task());
+
+	utid = umcg_register_core_task(0);
+	ASSERT_TRUE(utid != UMCG_NONE);
+
+	pid = fork();
+	if (pid == 0) {
+		/* This is child. umcg_unregister_task() should fail. */
+		if (!umcg_unregister_task()) {
+			fprintf(stderr, "umcg_unregister_task() succeeded in "
+					"the forked child.\n");
+			exit(1);
+		}
+		exit(0);
+	}
+
+	ASSERT_EQ(pid, waitpid(pid, &wstatus, 0));
+	ASSERT_TRUE(WIFEXITED(wstatus));
+	ASSERT_EQ(0, WEXITSTATUS(wstatus));
+	ASSERT_EQ(0, umcg_unregister_task());
+}
+
+struct test_waiter_args {
+	umcg_tid	utid;
+	bool		stop;
+	bool		waiting;
+};
+
+/* Thread FN for the test waiter: calls umcg_wait() in a loop until stopped. */
+static void *test_waiter_threadfn(void *arg)
+{
+	struct test_waiter_args *args = (struct test_waiter_args *)arg;
+	uint64_t counter = 0;
+
+	atomic_store_explicit(&args->utid, umcg_register_core_task(0),
+			memory_order_relaxed);
+	if (!args->utid) {
+		fprintf(stderr, "umcg_register_core_task failed: %d.\n", errno);
+		exit(1);
+	}
+
+	while (!atomic_load_explicit(&args->stop, memory_order_seq_cst)) {
+		bool expected = false;
+
+		if (!atomic_compare_exchange_strong_explicit(&args->waiting,
+					&expected, true,
+					memory_order_seq_cst,
+					memory_order_seq_cst)) {
+			fprintf(stderr, "Failed to set waiting flag.\n");
+			exit(1);
+		}
+
+		++counter;
+		if (counter % 5 == 0)
+			usleep(1);  /* Trigger a race with ucmg_wake(). */
+
+		if (umcg_wait(NULL)) {
+			fprintf(stderr, "umcg_wait failed: %d.\n", errno);
+			exit(1);
+		}
+	}
+
+	if (umcg_unregister_task()) {
+		fprintf(stderr, "umcg_register_core_task failed: %d.\n", errno);
+		exit(1);
+	}
+
+	return (void *)counter;
+}
+
+/* Test wake/wait pair racing with each other. */
+TEST(umcg_wake_wait) {
+	CHECK_CONFIG();
+	struct test_waiter_args args;
+	const int steps = 10000;
+	bool expected = true;
+	void *result;
+	pthread_t t;
+	int ret;
+
+	args.utid = UMCG_NONE;
+	args.stop = false;
+	args.waiting = false;
+
+	ASSERT_EQ(0, pthread_create(&t, NULL, &test_waiter_threadfn, &args));
+
+	while (!atomic_load_explicit(&args.utid, memory_order_relaxed))
+		;
+
+	for (int step = 0; step < steps; ++step) {
+		/* Spin until the waiter indicates it is going to wait. */
+		while (!atomic_compare_exchange_weak_explicit(&args.waiting,
+					&expected, false,
+					memory_order_seq_cst,
+					memory_order_seq_cst)) {
+			expected = true;
+		}
+
+		ASSERT_EQ(0, umcg_wake(args.utid));
+	}
+
+	/* Carefully shut down. */
+	expected = true;
+	while (!atomic_compare_exchange_weak_explicit(&args.waiting, &expected,
+			false, memory_order_seq_cst, memory_order_seq_cst)) {
+		expected = true;
+	}
+	atomic_store_explicit(&args.stop, true, memory_order_seq_cst);
+	ret = umcg_wake(args.utid);
+
+	/* If the worker immediately exits upon wake, we may get ESRCH. */
+	ASSERT_TRUE((ret == 0) || (errno == ESRCH));
+
+	ASSERT_EQ(0, pthread_join(t, &result));
+	ASSERT_EQ(steps + 1, (uint64_t)result);
+}
+
+struct test_ping_pong_args {
+	bool		ping;  /* Is this worker doing pings or pongs? */
+	umcg_tid	utid_self;
+	umcg_tid	utid_peer;
+	int		steps;
+	bool		use_swap;  /* Use umcg_swap or wake/wait. */
+	bool		payload;   /* call gettid() if true at each iteration. */
+
+	/*
+	 * It is not allowed to wake a task that has a wakeup queued, so
+	 * normally the test "softly" synchronizes ping and pong tasks so
+	 * that pong calls umcg_wait() to wait for the first ping.
+	 *
+	 * However, it is allowed to do mutual umcg_swap(), so in the
+	 * test flavor when both ping and pong tasks use swaps we also
+	 * run the test without pong waiting for the initial ping.
+	 */
+	bool		pong_waits;
+};
+
+/* Thread FN for ping-pong workers. */
+static void *test_ping_pong_threadfn(void *arg)
+{
+	struct test_ping_pong_args *args = (struct test_ping_pong_args *)arg;
+	struct timespec start, stop;
+	int counter;
+
+	atomic_store_explicit(&args->utid_self, umcg_register_core_task(0),
+			memory_order_relaxed);
+	if (!args->utid_self) {
+		fprintf(stderr, "umcg_register_core_task failed: %d.\n", errno);
+		exit(1);
+	}
+
+	while (!atomic_load_explicit(&args->utid_peer, memory_order_acquire))
+		;
+
+	if (args->pong_waits && !args->ping) {
+		/* This is pong: we sleep first. */
+		if (umcg_wait(NULL)) {
+			fprintf(stderr, "umcg_wait failed: %d.\n", errno);
+			exit(1);
+		}
+	}
+
+	if (args->ping) {  /* The "ping" measures the running time. */
+		if (clock_gettime(CLOCK_MONOTONIC, &start)) {
+			fprintf(stderr, "clock_gettime() failed.\n");
+			exit(1);
+		}
+	}
+
+	for (counter = 0; counter < args->steps; ++counter) {
+		int ret;
+
+		if (args->payload)
+			gettid();
+
+		if (args->use_swap) {
+			ret = umcg_swap(args->utid_peer, NULL);
+		} else {
+			ret = umcg_wake(args->utid_peer);
+			if (!ret)
+				ret = umcg_wait(NULL);
+		}
+
+		if (ret) {
+			if (args->use_swap)
+				fprintf(stderr, "umcg_swap failed: %d.\n", errno);
+			else
+				fprintf(stderr, "umcg_wake/wait failed: %d.\n", errno);
+			exit(1);
+		}
+	}
+
+	if (args->ping) {
+		uint64_t duration;
+
+		if (clock_gettime(CLOCK_MONOTONIC, &stop)) {
+			fprintf(stderr, "clock_gettime() failed.\n");
+			exit(1);
+		}
+
+		duration = (stop.tv_sec - start.tv_sec) * 1000000000LL +
+		stop.tv_nsec - start.tv_nsec;
+		printf("completed %d ping-pong iterations in %lu ns: "
+				"%lu ns per context switch\n",
+			args->steps, duration, duration / (args->steps * 2));
+	}
+
+	if (args->pong_waits && args->ping) {
+		/* This is ping: we wake pong at the end. */
+		if (umcg_wake(args->utid_peer)) {
+			fprintf(stderr, "umcg_wake failed: %d.\n", errno);
+			exit(1);
+		}
+	}
+
+	if (umcg_unregister_task()) {
+		fprintf(stderr, "umcg_unregister_task failed: %d.\n", errno);
+		exit(1);
+	}
+
+	return NULL;
+}
+
+enum ping_pong_flavor {
+	NO_SWAPS,	/* Use wake/wait pairs on both sides. */
+	ONE_SWAP,	/* Use wake/wait on one side and swap on the other. */
+	ALL_SWAPS	/* Use swaps on both sides. */
+};
+
+static void test_ping_pong_flavored(enum ping_pong_flavor flavor,
+		bool pong_waits, bool payload)
+{
+	struct test_ping_pong_args ping, pong;
+	pthread_t ping_t, pong_t;
+	const int STEPS = 100000;
+
+	ping.ping = true;
+	ping.utid_self = UMCG_NONE;
+	ping.utid_peer = UMCG_NONE;
+	ping.steps = STEPS;
+	ping.pong_waits = pong_waits;
+	ping.payload = payload;
+
+	pong.ping = false;
+	pong.utid_self = UMCG_NONE;
+	pong.utid_peer = UMCG_NONE;
+	pong.steps = STEPS;
+	pong.pong_waits = pong_waits;
+	pong.payload = payload;
+
+	switch (flavor) {
+	case NO_SWAPS:
+		ping.use_swap = false;
+		pong.use_swap = false;
+		break;
+	case ONE_SWAP:
+		ping.use_swap = true;
+		pong.use_swap = false;
+		break;
+	case ALL_SWAPS:
+		ping.use_swap = true;
+		pong.use_swap = true;
+		break;
+	default:
+		fprintf(stderr, "Unknown ping/pong flavor.\n");
+		exit(1);
+	}
+
+	if (pthread_create(&ping_t, NULL, &test_ping_pong_threadfn, &ping)) {
+		fprintf(stderr, "pthread_create(ping) failed.\n");
+		exit(1);
+	}
+
+	while (!atomic_load_explicit(&ping.utid_self, memory_order_relaxed))
+		;
+	pong.utid_peer = ping.utid_self;
+
+	if (pthread_create(&pong_t, NULL, &test_ping_pong_threadfn, &pong)) {
+		fprintf(stderr, "pthread_create(pong) failed.\n");
+		exit(1);
+	}
+
+	while (!atomic_load_explicit(&pong.utid_self, memory_order_relaxed))
+		;
+	atomic_store_explicit(&ping.utid_peer, pong.utid_self,
+			memory_order_relaxed);
+
+	pthread_join(ping_t, NULL);
+	pthread_join(pong_t, NULL);
+}
+
+TEST(umcg_ping_pong_no_swaps_nop) {
+	CHECK_CONFIG();
+	test_ping_pong_flavored(NO_SWAPS, true, false);
+}
+TEST(umcg_ping_pong_one_swap_nop) {
+	CHECK_CONFIG();
+	test_ping_pong_flavored(ONE_SWAP, true, false);
+}
+TEST(umcg_ping_pong_all_swaps_nop) {
+	CHECK_CONFIG();
+	test_ping_pong_flavored(ALL_SWAPS, true, false);
+}
+TEST(umcg_ping_pong_all_swaps_loose_nop) {
+	CHECK_CONFIG();
+	test_ping_pong_flavored(ALL_SWAPS, false, false);
+}
+TEST(umcg_ping_pong_no_swaps_payload) {
+	CHECK_CONFIG();
+	test_ping_pong_flavored(NO_SWAPS, true, true);
+}
+TEST(umcg_ping_pong_all_swaps_payload) {
+	CHECK_CONFIG();
+	test_ping_pong_flavored(ALL_SWAPS, true, true);
+}
+
+TEST_HARNESS_MAIN
-- 
2.31.1.818.g46aad6cb9e-goog


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v0.1 7/9] sched/umcg: add UMCG server/worker API (early RFC)
  2021-05-20 18:36 [RFC PATCH v0.1 0/9] UMCG early preview/RFC patchset Peter Oskolkov
                   ` (5 preceding siblings ...)
  2021-05-20 18:36 ` [RFC PATCH v0.1 6/9] selftests/umcg: add UMCG core API selftest Peter Oskolkov
@ 2021-05-20 18:36 ` Peter Oskolkov
  2021-05-21 20:17   ` Andrei Vagin
  2021-05-20 18:36 ` [RFC PATCH v0.1 8/9] lib/umcg: " Peter Oskolkov
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 35+ messages in thread
From: Peter Oskolkov @ 2021-05-20 18:36 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, linux-kernel, linux-api
  Cc: Paul Turner, Ben Segall, Peter Oskolkov, Peter Oskolkov,
	Joel Fernandes, Andrew Morton, Andrei Vagin, Jim Newsome

Implement UMCG server/worker API.

This is an early RFC patch - the code seems working, but
more testing is needed. Gaps I plan to address before this
is ready for a detailed review:

- preemption/interrupt handling;
- better documentation/comments;
- tracing;
- additional testing;
- corner cases like abnormal process/task termination;
- in some cases where I kill the task (umcg_segv), returning
an error may be more appropriate.

All in all, please focus more on the high-level approach
and less on things like variable names, (doc) comments, or indentation.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 include/linux/mm_types.h |   5 +
 include/linux/syscalls.h |   5 +
 kernel/fork.c            |  11 +
 kernel/sched/core.c      |  11 +
 kernel/sched/umcg.c      | 764 ++++++++++++++++++++++++++++++++++++++-
 kernel/sched/umcg.h      |  54 +++
 mm/init-mm.c             |   4 +
 7 files changed, 845 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6613b26a8894..5ca7b7d55775 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -562,6 +562,11 @@ struct mm_struct {
 #ifdef CONFIG_IOMMU_SUPPORT
 		u32 pasid;
 #endif
+
+#ifdef CONFIG_UMCG
+	spinlock_t umcg_lock;
+	struct list_head umcg_groups;
+#endif
 	} __randomize_layout;
 
 	/*
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 15de3e34ccee..2781659daaf1 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1059,6 +1059,11 @@ asmlinkage long umcg_wait(u32 flags, const struct __kernel_timespec __user *time
 asmlinkage long umcg_wake(u32 flags, u32 next_tid);
 asmlinkage long umcg_swap(u32 wake_flags, u32 next_tid, u32 wait_flags,
 				const struct __kernel_timespec __user *timeout);
+asmlinkage long umcg_create_group(u32 api_version, u64, flags);
+asmlinkage long umcg_destroy_group(u32 group_id);
+asmlinkage long umcg_poll_worker(u32 flags, struct umcg_task __user **ut);
+asmlinkage long umcg_run_worker(u32 flags, u32 worker_tid,
+		struct umcg_task __user **ut);
 
 /*
  * Architecture-specific system calls
diff --git a/kernel/fork.c b/kernel/fork.c
index ace4631b5b54..3a2a7950df8e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1026,6 +1026,10 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	seqcount_init(&mm->write_protect_seq);
 	mmap_init_lock(mm);
 	INIT_LIST_HEAD(&mm->mmlist);
+#ifdef CONFIG_UMCG
+	spin_lock_init(&mm->umcg_lock);
+	INIT_LIST_HEAD(&mm->umcg_groups);
+#endif
 	mm->core_state = NULL;
 	mm_pgtables_bytes_init(mm);
 	mm->map_count = 0;
@@ -1102,6 +1106,13 @@ static inline void __mmput(struct mm_struct *mm)
 		list_del(&mm->mmlist);
 		spin_unlock(&mmlist_lock);
 	}
+#ifdef CONFIG_UMCG
+	if (!list_empty(&mm->umcg_groups)) {
+		spin_lock(&mm->umcg_lock);
+		list_del(&mm->umcg_groups);
+		spin_unlock(&mm->umcg_lock);
+	}
+#endif
 	if (mm->binfmt)
 		module_put(mm->binfmt->module);
 	mmdrop(mm);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 462104f13c28..e657a35655b1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -26,6 +26,7 @@
 
 #include "pelt.h"
 #include "smp.h"
+#include "umcg.h"
 
 /*
  * Export tracepoints that act as a bare tracehook (ie: have no trace event
@@ -6012,10 +6013,20 @@ static inline void sched_submit_work(struct task_struct *tsk)
 	 */
 	if (blk_needs_flush_plug(tsk))
 		blk_schedule_flush_plug(tsk);
+
+#ifdef CONFIG_UMCG
+	if (rcu_access_pointer(tsk->umcg_task_data))
+		umcg_on_block();
+#endif
 }
 
 static void sched_update_worker(struct task_struct *tsk)
 {
+#ifdef CONFIG_UMCG
+	if (rcu_access_pointer(tsk->umcg_task_data))
+		umcg_on_wake();
+#endif
+
 	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
 		if (tsk->flags & PF_WQ_WORKER)
 			wq_worker_running(tsk);
diff --git a/kernel/sched/umcg.c b/kernel/sched/umcg.c
index 2d718433c773..38cba772322d 100644
--- a/kernel/sched/umcg.c
+++ b/kernel/sched/umcg.c
@@ -21,6 +21,12 @@ static int __api_version(u32 requested)
 	return 1;
 }
 
+static int umcg_segv(int res)
+{
+	force_sig(SIGSEGV);
+	return res;
+}
+
 /**
  * sys_umcg_api_version - query UMCG API versions that are supported.
  * @api_version:          Requested API version.
@@ -54,6 +60,78 @@ static int put_state(struct umcg_task __user *ut, u32 state)
 	return put_user(state, (u32 __user *)ut);
 }
 
+static void umcg_lock_pair(struct task_struct *server,
+		struct task_struct *worker)
+{
+	spin_lock(&server->alloc_lock);
+	spin_lock_nested(&worker->alloc_lock, SINGLE_DEPTH_NESTING);
+}
+
+static void umcg_unlock_pair(struct task_struct *server,
+		struct task_struct *worker)
+{
+	spin_unlock(&worker->alloc_lock);
+	spin_unlock(&server->alloc_lock);
+}
+
+static void umcg_detach_peer(void)
+{
+	struct task_struct *server, *worker;
+	struct umcg_task_data *utd;
+
+	rcu_read_lock();
+	task_lock(current);
+	utd = rcu_dereference(current->umcg_task_data);
+
+	if (!utd || !rcu_dereference(utd->peer)) {
+		task_unlock(current);
+		goto out;
+	}
+
+	switch (utd->task_type) {
+	case UMCG_TT_SERVER:
+		server = current;
+		worker = rcu_dereference(utd->peer);
+		break;
+
+	case UMCG_TT_WORKER:
+		worker = current;
+		server = rcu_dereference(utd->peer);
+		break;
+
+	default:
+		task_unlock(current);
+		printk(KERN_WARNING "umcg_detach_peer: unexpected task type");
+		umcg_segv(0);
+		goto out;
+	}
+	task_unlock(current);
+
+	if (!server || !worker)
+		goto out;
+
+	umcg_lock_pair(server, worker);
+
+	utd = rcu_dereference(server->umcg_task_data);
+	if (WARN_ON(!utd)) {
+		umcg_segv(0);
+		goto out_pair;
+	}
+	rcu_assign_pointer(utd->peer, NULL);
+
+	utd = rcu_dereference(worker->umcg_task_data);
+	if (WARN_ON(!utd)) {
+		umcg_segv(0);
+		goto out_pair;
+	}
+	rcu_assign_pointer(utd->peer, NULL);
+
+out_pair:
+	umcg_unlock_pair(server, worker);
+out:
+	rcu_read_unlock();
+}
+
 static int register_core_task(u32 api_version, struct umcg_task __user *umcg_task)
 {
 	struct umcg_task_data *utd;
@@ -73,6 +151,7 @@ static int register_core_task(u32 api_version, struct umcg_task __user *umcg_tas
 	utd->umcg_task = umcg_task;
 	utd->task_type = UMCG_TT_CORE;
 	utd->api_version = api_version;
+	RCU_INIT_POINTER(utd->peer, NULL);
 
 	if (put_state(umcg_task, UMCG_TASK_RUNNING)) {
 		kfree(utd);
@@ -86,6 +165,105 @@ static int register_core_task(u32 api_version, struct umcg_task __user *umcg_tas
 	return 0;
 }
 
+static int add_task_to_group(u32 api_version, u32 group_id,
+		struct umcg_task __user *umcg_task,
+		enum umcg_task_type task_type, u32 new_state)
+{
+	struct mm_struct *mm = current->mm;
+	struct umcg_task_data *utd = NULL;
+	struct umcg_group *group = NULL;
+	struct umcg_group *list_entry;
+	int ret = -EINVAL;
+	u32 state;
+
+	if (get_state(umcg_task, &state))
+		return -EFAULT;
+
+	if (state != UMCG_TASK_NONE)
+		return -EINVAL;
+
+	if (put_state(umcg_task, new_state))
+		return -EFAULT;
+
+retry_once:
+	rcu_read_lock();
+	list_for_each_entry_rcu(list_entry, &mm->umcg_groups, list) {
+		if (list_entry->group_id == group_id) {
+			group = list_entry;
+			break;
+		}
+	}
+
+	if (!group || group->api_version != api_version)
+		goto out_rcu;
+
+	spin_lock(&group->lock);
+	if (group->nr_tasks < 0)  /* The groups is being destroyed. */
+		goto out_group;
+
+	if (!utd) {
+		utd = kzalloc(sizeof(struct umcg_task_data), GFP_NOWAIT);
+		if (!utd) {
+			spin_unlock(&group->lock);
+			rcu_read_unlock();
+
+			utd = kzalloc(sizeof(struct umcg_task_data), GFP_KERNEL);
+			if (!utd) {
+				ret = -ENOMEM;
+				goto out;
+			}
+
+			goto retry_once;
+		}
+	}
+
+	utd->self = current;
+	utd->group = group;
+	utd->umcg_task = umcg_task;
+	utd->task_type = task_type;
+	utd->api_version = api_version;
+	RCU_INIT_POINTER(utd->peer, NULL);
+
+	INIT_LIST_HEAD(&utd->list);
+	group->nr_tasks++;
+
+	task_lock(current);
+	rcu_assign_pointer(current->umcg_task_data, utd);
+	task_unlock(current);
+
+	ret = 0;
+
+out_group:
+	spin_unlock(&group->lock);
+
+out_rcu:
+	rcu_read_unlock();
+	if (ret && utd)
+		kfree(utd);
+
+out:
+	if (ret)
+		put_state(umcg_task, UMCG_TASK_NONE);
+	else
+		schedule();  /* Trigger umcg_on_wake(). */
+
+	return ret;
+}
+
+static int register_worker(u32 api_version, u32 group_id,
+		struct umcg_task __user *umcg_task)
+{
+	return add_task_to_group(api_version, group_id, umcg_task,
+				UMCG_TT_WORKER, UMCG_TASK_UNBLOCKED);
+}
+
+static int register_server(u32 api_version, u32 group_id,
+		struct umcg_task __user *umcg_task)
+{
+	return add_task_to_group(api_version, group_id, umcg_task,
+				UMCG_TT_SERVER, UMCG_TASK_PROCESSING);
+}
+
 /**
  * sys_umcg_register_task - register the current task as a UMCG task.
  * @api_version:       The expected/desired API version of the syscall.
@@ -122,6 +300,10 @@ SYSCALL_DEFINE4(umcg_register_task, u32, api_version, u32, flags, u32, group_id,
 		if (group_id != UMCG_NOID)
 			return -EINVAL;
 		return register_core_task(api_version, umcg_task);
+	case UMCG_REGISTER_WORKER:
+		return register_worker(api_version, group_id, umcg_task);
+	case UMCG_REGISTER_SERVER:
+		return register_server(api_version, group_id, umcg_task);
 	default:
 		return -EINVAL;
 	}
@@ -146,9 +328,39 @@ SYSCALL_DEFINE1(umcg_unregister_task, u32, flags)
 	if (!utd || flags)
 		goto out;
 
+	if (!utd->group) {
+		ret = 0;
+		goto out;
+	}
+
+	if (utd->task_type == UMCG_TT_WORKER) {
+		struct task_struct *server = rcu_dereference(utd->peer);
+
+		if (server) {
+			umcg_detach_peer();
+			if (WARN_ON(!wake_up_process(server))) {
+				umcg_segv(0);
+				goto out;
+			}
+		}
+	} else {
+		if (WARN_ON(utd->task_type != UMCG_TT_SERVER)) {
+			umcg_segv(0);
+			goto out;
+		}
+
+		umcg_detach_peer();
+	}
+
+	spin_lock(&utd->group->lock);
 	task_lock(current);
+
 	rcu_assign_pointer(current->umcg_task_data, NULL);
+
+	--utd->group->nr_tasks;
+
 	task_unlock(current);
+	spin_unlock(&utd->group->lock);
 
 	ret = 0;
 
@@ -164,6 +376,7 @@ SYSCALL_DEFINE1(umcg_unregister_task, u32, flags)
 static int do_context_switch(struct task_struct *next)
 {
 	struct umcg_task_data *utd = rcu_access_pointer(current->umcg_task_data);
+	bool prev_wait_flag;  /* See comment in do_wait() below. */
 
 	/*
 	 * It is important to set_current_state(TASK_INTERRUPTIBLE) before
@@ -173,34 +386,51 @@ static int do_context_switch(struct task_struct *next)
 	 */
 	set_current_state(TASK_INTERRUPTIBLE);
 
-	WRITE_ONCE(utd->in_wait, true);
-
+	prev_wait_flag = utd->in_wait;
+	if (!prev_wait_flag)
+		WRITE_ONCE(utd->in_wait, true);
+	
 	if (!try_to_wake_up(next, TASK_NORMAL, WF_CURRENT_CPU))
 		return -EAGAIN;
 
 	freezable_schedule();
 
-	WRITE_ONCE(utd->in_wait, false);
+	if (!prev_wait_flag)
+		WRITE_ONCE(utd->in_wait, false);
 
 	if (signal_pending(current))
 		return -EINTR;
 
+	/* TODO: deal with non-fatal interrupts. */
 	return 0;
 }
 
 static int do_wait(void)
 {
 	struct umcg_task_data *utd = rcu_access_pointer(current->umcg_task_data);
+	/*
+	 * freezable_schedule() below can recursively call do_wait() if
+	 * this is a worker that needs a server. As the wait flag is only
+	 * used by the outermost wait/wake (and swap) syscalls, modify it only
+	 * in the outermost do_wait() instead of using a counter.
+	 *
+	 * Note that the nesting level is at most two, as utd->in_workqueue
+	 * is used to prevent further nesting.
+	 */
+	bool prev_wait_flag;
 
 	if (!utd)
 		return -EINVAL;
 
-	WRITE_ONCE(utd->in_wait, true);
+	prev_wait_flag = utd->in_wait;
+	if (!prev_wait_flag)
+		WRITE_ONCE(utd->in_wait, true);
 
 	set_current_state(TASK_INTERRUPTIBLE);
 	freezable_schedule();
 
-	WRITE_ONCE(utd->in_wait, false);
+	if (!prev_wait_flag)
+		WRITE_ONCE(utd->in_wait, false);
 
 	if (signal_pending(current))
 		return -EINTR;
@@ -214,7 +444,7 @@ static int do_wait(void)
  * @timeout:       The absolute timeout of the wait. Not supported yet.
  *                 Must be NULL.
  *
- * Sleep until woken, interrupted, or @timeout expires.
+ * Sleep until woken or @timeout expires.
  *
  * Return:
  * 0           - Ok;
@@ -229,6 +459,7 @@ SYSCALL_DEFINE2(umcg_wait, u32, flags,
 		const struct __kernel_timespec __user *, timeout)
 {
 	struct umcg_task_data *utd;
+	struct task_struct *server = NULL;
 
 	if (flags)
 		return -EINVAL;
@@ -242,8 +473,14 @@ SYSCALL_DEFINE2(umcg_wait, u32, flags,
 		return -EINVAL;
 	}
 
+	if (utd->task_type == UMCG_TT_WORKER)
+		server = rcu_dereference(utd->peer);
+
 	rcu_read_unlock();
 
+	if (server)
+		return do_context_switch(server);
+
 	return do_wait();
 }
 
@@ -252,7 +489,7 @@ SYSCALL_DEFINE2(umcg_wait, u32, flags,
  * @flags:         Reserved.
  * @next_tid:      The ID of the task to wake.
  *
- * Wake @next identified by @next_tid. @next must be either a UMCG core
+ * Wake task next identified by @next_tid. @next must be either a UMCG core
  * task or a UMCG worker task.
  *
  * Return:
@@ -265,7 +502,7 @@ SYSCALL_DEFINE2(umcg_wait, u32, flags,
 SYSCALL_DEFINE2(umcg_wake, u32, flags, u32, next_tid)
 {
 	struct umcg_task_data *next_utd;
-	struct task_struct *next;
+	struct task_struct *next, *next_peer;
 	int ret = -EINVAL;
 
 	if (!next_tid)
@@ -282,11 +519,29 @@ SYSCALL_DEFINE2(umcg_wake, u32, flags, u32, next_tid)
 	if (!next_utd)
 		goto out;
 
+	if (next_utd->task_type == UMCG_TT_SERVER)
+		goto out;
+
 	if (!READ_ONCE(next_utd->in_wait)) {
 		ret = -EAGAIN;
 		goto out;
 	}
 
+	next_peer = rcu_dereference(next_utd->peer);
+	if (next_peer) {
+		if (next_peer == current)
+			umcg_detach_peer();
+		else {
+			/*
+			 * Waking a worker with an assigned server is not
+			 * permitted, unless the waking is done by the assigned
+			 * server.
+			 */
+			umcg_segv(0);
+			goto out;
+		}
+	}
+
 	ret = wake_up_process(next);
 	put_task_struct(next);
 	if (ret)
@@ -348,7 +603,7 @@ SYSCALL_DEFINE4(umcg_swap, u32, wake_flags, u32, next_tid, u32, wait_flags,
 	}
 
 	next_utd = rcu_dereference(next->umcg_task_data);
-	if (!next_utd) {
+	if (!next_utd || next_utd->group != curr_utd->group) {
 		ret = -EINVAL;
 		goto out;
 	}
@@ -358,6 +613,25 @@ SYSCALL_DEFINE4(umcg_swap, u32, wake_flags, u32, next_tid, u32, wait_flags,
 		goto out;
 	}
 
+	/* Move the server from curr to next, if appropriate. */
+	if (curr_utd->task_type == UMCG_TT_WORKER) {
+		struct task_struct *server = rcu_dereference(curr_utd->peer);
+		if (server) {
+			struct umcg_task_data *server_utd =
+				rcu_dereference(server->umcg_task_data);
+
+			if (rcu_access_pointer(next_utd->peer)) {
+				ret = -EAGAIN;
+				goto out;
+			}
+			umcg_detach_peer();
+			umcg_lock_pair(server, next);
+			rcu_assign_pointer(server_utd->peer, next);
+			rcu_assign_pointer(next_utd->peer, server);
+			umcg_unlock_pair(server, next);
+		}
+	}
+
 	rcu_read_unlock();
 
 	return do_context_switch(next);
@@ -366,3 +640,475 @@ SYSCALL_DEFINE4(umcg_swap, u32, wake_flags, u32, next_tid, u32, wait_flags,
 	rcu_read_unlock();
 	return ret;
 }
+
+/**
+ * sys_umcg_create_group - create a UMCG group
+ * @api_version:           Requested API version.
+ * @flags:                 Reserved.
+ *
+ * Return:
+ * >= 0                - the group ID
+ * -EOPNOTSUPP         - @api_version is not supported
+ * -EINVAL             - @flags is not valid
+ * -ENOMEM             - not enough memory
+ */
+SYSCALL_DEFINE2(umcg_create_group, u32, api_version, u64, flags)
+{
+	int ret;
+	struct umcg_group *group;
+	struct umcg_group *list_entry;
+	struct mm_struct *mm = current->mm;
+
+	if (flags)
+		return -EINVAL;
+
+	if (__api_version(api_version))
+		return -EOPNOTSUPP;
+
+	group = kzalloc(sizeof(struct umcg_group), GFP_KERNEL);
+	if (!group)
+		return -ENOMEM;
+
+	spin_lock_init(&group->lock);
+	INIT_LIST_HEAD(&group->list);
+	INIT_LIST_HEAD(&group->waiters);
+	group->flags = flags;
+	group->api_version = api_version;
+
+	spin_lock(&mm->umcg_lock);
+
+	list_for_each_entry_rcu(list_entry, &mm->umcg_groups, list) {
+		if (list_entry->group_id >= group->group_id)
+			group->group_id = list_entry->group_id + 1;
+	}
+
+	list_add_rcu(&mm->umcg_groups, &group->list);
+
+	ret = group->group_id;
+	spin_unlock(&mm->umcg_lock);
+
+	return ret;
+}
+
+/**
+ * sys_umcg_destroy_group - destroy a UMCG group
+ * @group_id: The ID of the group to destroy.
+ *
+ * The group must be empty, i.e. have no registered servers or workers.
+ *
+ * Return:
+ * 0       - success;
+ * -ESRCH  - group not found;
+ * -EBUSY  - the group has registered workers or servers.
+ */
+SYSCALL_DEFINE1(umcg_destroy_group, u32, group_id)
+{
+	int ret = 0;
+	struct umcg_group *group = NULL;
+	struct umcg_group *list_entry;
+	struct mm_struct *mm = current->mm;
+
+	spin_lock(&mm->umcg_lock);
+	list_for_each_entry_rcu(list_entry, &mm->umcg_groups, list) {
+		if (list_entry->group_id == group_id) {
+			group = list_entry;
+			break;
+		}
+	}
+
+	if (group == NULL) {
+		ret = -ESRCH;
+		goto out;
+	}
+
+	spin_lock(&group->lock);
+
+	if (group->nr_tasks > 0) {
+		ret = -EBUSY;
+		spin_unlock(&group->lock);
+		goto out;
+	}
+
+	/* Tell group rcu readers that the group is going to be deleted. */
+	group->nr_tasks = -1;
+
+	spin_unlock(&group->lock);
+
+	list_del_rcu(&group->list);
+	kfree_rcu(group, rcu);
+
+out:
+	spin_unlock(&mm->umcg_lock);
+	return ret;
+}
+
+/**
+ * sys_umcg_poll_worker - poll an UNBLOCKED worker
+ * @flags: reserved;
+ * @ut:    the control struct umcg_task of the polled worker.
+ *
+ * The current task must be a UMCG server in POLLING state; if there are
+ * UNBLOCKED workers in the server's group, take the earliest queued,
+ * mark the worker as RUNNABLE.and return.
+ *
+ * If there are no unblocked workers, the syscall waits for one to become
+ * available.
+ *
+ * Return:
+ * 0       - Ok;
+ * -EINTR  - a signal was received;
+ * -EINVAL - one of the parameters is wrong, or a precondition was not met.
+ */
+SYSCALL_DEFINE2(umcg_poll_worker, u32, flags, struct umcg_task __user **, ut)
+{
+	struct umcg_group *group;
+	struct task_struct *worker;
+	struct task_struct *server = current;
+	struct umcg_task __user *result;
+	struct umcg_task_data *worker_utd, *server_utd;
+
+	if (flags)
+		return -EINVAL;
+
+	rcu_read_lock();
+
+	server_utd = rcu_dereference(server->umcg_task_data);
+
+	if (!server_utd || server_utd->task_type != UMCG_TT_SERVER) {
+		rcu_read_unlock();
+		return -EINVAL;
+	}
+
+	umcg_detach_peer();
+
+	group = server_utd->group;
+
+	spin_lock(&group->lock);
+
+	if (group->nr_waiting_workers == 0) {  /* Queue the server. */
+		++group->nr_waiting_pollers;
+		list_add_tail(&server_utd->list, &group->waiters);
+		set_current_state(TASK_INTERRUPTIBLE);
+		spin_unlock(&group->lock);
+		rcu_read_unlock();
+
+		freezable_schedule();
+
+		rcu_read_lock();
+		server_utd = rcu_dereference(server->umcg_task_data);
+
+		if (!list_empty(&server_utd->list)) {
+			spin_lock(&group->lock);
+			list_del_init(&server_utd->list);
+			--group->nr_waiting_pollers;
+			spin_unlock(&group->lock);
+		}
+
+		if (signal_pending(current)) {
+			rcu_read_unlock();
+			return -EINTR;
+		}
+
+		worker = rcu_dereference(server_utd->peer);
+		if (worker) {
+			worker_utd = rcu_dereference(worker->umcg_task_data);
+			result = worker_utd->umcg_task;
+		} else
+			result = NULL;
+
+		rcu_read_unlock();
+
+		if (put_user(result, ut))
+			return umcg_segv(-EFAULT);
+		return 0;
+	}
+
+	/* Pick up the first worker. */
+	worker_utd = list_first_entry(&group->waiters, struct umcg_task_data,
+					list);
+	list_del_init(&worker_utd->list);
+	worker = worker_utd->self;
+	--group->nr_waiting_workers;
+
+	umcg_lock_pair(server, worker);
+	spin_unlock(&group->lock);
+
+	if (WARN_ON(rcu_access_pointer(server_utd->peer) ||
+			rcu_access_pointer(worker_utd->peer))) {
+		/* This is unexpected. */
+		rcu_read_unlock();
+		return umcg_segv(-EINVAL);
+	}
+	rcu_assign_pointer(server_utd->peer, worker);
+	rcu_assign_pointer(worker_utd->peer, current);
+
+	umcg_unlock_pair(server, worker);
+
+	result = worker_utd->umcg_task;
+	rcu_read_unlock();
+
+	if (put_state(result, UMCG_TASK_RUNNABLE))
+		return umcg_segv(-EFAULT);
+
+	if (put_user(result, ut))
+		return umcg_segv(-EFAULT);
+
+	return 0;
+}
+
+/**
+ * sys_umcg_run_worker - "run" a RUNNABLE worker as a server
+ * @flags:       reserved;
+ * @worker_tid:  tid of the worker to run;
+ * @ut:          the control struct umcg_task of the worker that blocked
+ *               during this "run".
+ *
+ * The worker must be in RUNNABLE state. The server (=current task)
+ * wakes the worker and blocks; when the worker, or one of the workers
+ * in umcg_swap chain, blocks, the server is woken and the syscall returns
+ * with ut indicating the blocked worker.
+ *
+ * If the worker exits or unregisters itself, the syscall succeeds with
+ * ut == NULL.
+ *
+ * Return:
+ * 0       - Ok;
+ * -EINTR  - a signal was received;
+ * -EINVAL - one of the parameters is wrong, or a precondition was not met.
+ */
+SYSCALL_DEFINE3(umcg_run_worker, u32, flags, u32, worker_tid,
+		struct umcg_task __user **, ut)
+{
+	int ret = -EINVAL;
+	struct task_struct *worker;
+	struct task_struct *server = current;
+	struct umcg_task __user *result = NULL;
+	struct umcg_task_data *worker_utd;
+	struct umcg_task_data *server_utd;
+	struct umcg_task __user *server_ut;
+	struct umcg_task __user *worker_ut;
+
+	if (!ut)
+		return -EINVAL;
+
+	rcu_read_lock();
+	server_utd = rcu_dereference(server->umcg_task_data);
+
+	if (!server_utd || server_utd->task_type != UMCG_TT_SERVER)
+		goto out_rcu;
+
+	if (flags)
+		goto out_rcu;
+
+	worker = find_get_task_by_vpid(worker_tid);
+	if (!worker) {
+		ret = -ESRCH;
+		goto out_rcu;
+	}
+
+	worker_utd = rcu_dereference(worker->umcg_task_data);
+	if (!worker_utd)
+		goto out_rcu;
+
+	if (!READ_ONCE(worker_utd->in_wait)) {
+		ret = -EAGAIN;
+		goto out_rcu;
+	}
+
+	if (server_utd->group != worker_utd->group)
+		goto out_rcu;
+
+	if (rcu_access_pointer(server_utd->peer) != worker)
+		umcg_detach_peer();
+
+	if (!rcu_access_pointer(server_utd->peer)) {
+		umcg_lock_pair(server, worker);
+		WARN_ON(worker_utd->peer);
+		rcu_assign_pointer(server_utd->peer, worker);
+		rcu_assign_pointer(worker_utd->peer, server);
+		umcg_unlock_pair(server, worker);
+	}
+
+	server_ut = server_utd->umcg_task;
+	worker_ut = server_utd->umcg_task;
+
+	rcu_read_unlock();
+
+	ret = do_context_switch(worker);
+	if (ret)
+		return ret;
+
+	rcu_read_lock();
+	worker = rcu_dereference(server_utd->peer);
+	if (worker) {
+		worker_utd = rcu_dereference(worker->umcg_task_data);
+		if (worker_utd)
+			result = worker_utd->umcg_task;
+	}
+	rcu_read_unlock();
+
+	if (put_user(result, ut))
+		return -EFAULT;
+	return 0;
+
+out_rcu:
+	rcu_read_unlock();
+	return ret;
+}
+
+void umcg_on_block(void)
+{
+	struct umcg_task_data *utd = rcu_access_pointer(current->umcg_task_data);
+	struct umcg_task __user *ut;
+	struct task_struct *server;
+	u32 state;
+
+	if (utd->task_type != UMCG_TT_WORKER || utd->in_workqueue)
+		return;
+
+	ut = utd->umcg_task;
+
+	if (get_user(state, (u32 __user *)ut)) {
+		if (signal_pending(current))
+			return;
+		umcg_segv(0);
+		return;
+	}
+
+	if (state != UMCG_TASK_RUNNING)
+		return;
+
+	state = UMCG_TASK_BLOCKED;
+	if (put_user(state, (u32 __user *)ut)) {
+		umcg_segv(0);
+		return;
+	}
+
+	rcu_read_lock();
+	server = rcu_dereference(utd->peer);
+	rcu_read_unlock();
+
+	if (server)
+		WARN_ON(!try_to_wake_up(server, TASK_NORMAL, WF_CURRENT_CPU));
+}
+
+/* Return true to return to the user, false to keep waiting. */
+static bool process_unblocked_worker(void)
+{
+	struct umcg_task_data *utd;
+	struct umcg_group *group;
+
+	rcu_read_lock();
+
+	utd = rcu_dereference(current->umcg_task_data);
+	group = utd->group;
+
+	spin_lock(&group->lock);
+	if (!list_empty(&utd->list)) {
+		/* This was a spurious wakeup or an interrupt, do nothing. */
+		spin_unlock(&group->lock);
+		rcu_read_unlock();
+		do_wait();
+		return false;
+	}
+
+	if (group->nr_waiting_pollers > 0) {  /* Wake a server. */
+		struct task_struct *server;
+		struct umcg_task_data *server_utd = list_first_entry(
+				&group->waiters, struct umcg_task_data, list);
+
+		list_del_init(&server_utd->list);
+		server = server_utd->self;
+		--group->nr_waiting_pollers;
+
+		umcg_lock_pair(server, current);
+		spin_unlock(&group->lock);
+
+		if (WARN_ON(server_utd->peer || utd->peer)) {
+			umcg_segv(0);
+			return true;
+		}
+		rcu_assign_pointer(server_utd->peer, current);
+		rcu_assign_pointer(utd->peer, server);
+
+		umcg_unlock_pair(server, current);
+		rcu_read_unlock();
+
+		if (put_state(utd->umcg_task, UMCG_TASK_RUNNABLE)) {
+			umcg_segv(0);
+			return true;
+		}
+
+		do_context_switch(server);
+		return false;
+	}
+
+	/* Add to the queue. */
+	++group->nr_waiting_workers;
+	list_add_tail(&utd->list, &group->waiters);
+	spin_unlock(&group->lock);
+	rcu_read_unlock();
+
+	do_wait();
+
+	smp_rmb();
+	if (!list_empty(&utd->list)) {
+		spin_lock(&group->lock);
+		list_del_init(&utd->list);
+		--group->nr_waiting_workers;
+		spin_unlock(&group->lock);
+	}
+
+	return false;
+}
+
+void umcg_on_wake(void)
+{
+	struct umcg_task_data *utd;
+	struct umcg_task __user *ut;
+	bool should_break = false;
+
+	/* current->umcg_task_data is modified only from current. */
+	utd = rcu_access_pointer(current->umcg_task_data);
+	if (utd->task_type != UMCG_TT_WORKER || utd->in_workqueue)
+		return;
+
+	do {
+		u32 state;
+
+		if (fatal_signal_pending(current))
+			return;
+
+		if (signal_pending(current))
+			return;
+
+		ut = utd->umcg_task;
+
+		if (get_state(ut, &state)) {
+			if (signal_pending(current))
+				return;
+			goto segv;
+		}
+
+		if (state == UMCG_TASK_RUNNING && rcu_access_pointer(utd->peer))
+			return;
+
+		if (state == UMCG_TASK_BLOCKED || state == UMCG_TASK_RUNNING) {
+			state = UMCG_TASK_UNBLOCKED;
+			if (put_state(ut, state))
+				goto segv;
+		} else if (state != UMCG_TASK_UNBLOCKED) {
+			goto segv;
+		}
+
+		utd->in_workqueue = true;
+		should_break = process_unblocked_worker();
+		utd->in_workqueue = false;
+		if (should_break)
+			return;
+
+	} while (!should_break);
+
+segv:
+	umcg_segv(0);
+}
diff --git a/kernel/sched/umcg.h b/kernel/sched/umcg.h
index 6791d570f622..92012a1674ab 100644
--- a/kernel/sched/umcg.h
+++ b/kernel/sched/umcg.h
@@ -8,6 +8,34 @@
 #include <linux/sched.h>
 #include <linux/umcg.h>
 
+struct umcg_group {
+	struct list_head list;
+	u32 group_id;     /* Never changes. */
+	u32 api_version;  /* Never changes. */
+	u64 flags;        /* Never changes. */
+
+	spinlock_t lock;
+
+	/*
+	 * One of the counters below is always zero. The non-zero counter
+	 * indicates the number of elements in @waiters below.
+	 */
+	int nr_waiting_workers;
+	int nr_waiting_pollers;
+
+	/*
+	 * The list below either contains UNBLOCKED workers waiting
+	 * for the userspace to poll or run them if nr_waiting_workers > 0,
+	 *  or polling servers waiting for unblocked workers if
+	 *  nr_waiting_pollers > 0.
+	 */
+	struct list_head waiters;
+
+	int nr_tasks;  /* The total number of tasks registered. */
+
+	struct rcu_head rcu;
+};
+
 enum umcg_task_type {
 	UMCG_TT_CORE	= 1,
 	UMCG_TT_SERVER	= 2,
@@ -32,11 +60,37 @@ struct umcg_task_data {
 	 */
 	u32 api_version;
 
+	/* NULL for core API tasks. Never changes. */
+	struct umcg_group		*group;
+
+	/*
+	 * If this is a server task, points to its assigned worker, if any;
+	 * if this is a worker task, points to its assigned server, if any.
+	 *
+	 * Protected by alloc_lock of the task owning this struct.
+	 *
+	 * Always either NULL, or the server and the worker point to each other.
+	 * Locking order: first lock the server, then the worker.
+	 *
+	 * Either the worker or the server should be the current task when
+	 * this field is changed, with the exception of sys_umcg_swap.
+	 */
+	struct task_struct __rcu	*peer;
+
+	/* Used in umcg_group.waiters. */
+	struct list_head		list;
+
+	/* Used by curr in umcg_on_block/wake to prevent nesting/recursion. */
+	bool				in_workqueue;
+
 	/*
 	 * Used by wait/wake routines to handle races. Written only by current.
 	 */
 	bool				in_wait;
 };
 
+void umcg_on_block(void);
+void umcg_on_wake(void);
+
 #endif  /* CONFIG_UMCG */
 #endif  /* _KERNEL_SCHED_UMCG_H */
diff --git a/mm/init-mm.c b/mm/init-mm.c
index 153162669f80..85e4a8ecfd91 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -36,6 +36,10 @@ struct mm_struct init_mm = {
 	.page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
 	.arg_lock	=  __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
 	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
+#ifdef CONFIG_UMCG
+	.umcg_lock	= __SPIN_LOCK_UNLOCKED(init_mm.umcg_lock),
+	.umcg_groups	= LIST_HEAD_INIT(init_mm.umcg_groups),
+#endif
 	.user_ns	= &init_user_ns,
 	.cpu_bitmap	= CPU_BITS_NONE,
 	INIT_MM_CONTEXT(init_mm)
-- 
2.31.1.818.g46aad6cb9e-goog


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v0.1 8/9] lib/umcg: add UMCG server/worker API (early RFC)
  2021-05-20 18:36 [RFC PATCH v0.1 0/9] UMCG early preview/RFC patchset Peter Oskolkov
                   ` (6 preceding siblings ...)
  2021-05-20 18:36 ` [RFC PATCH v0.1 7/9] sched/umcg: add UMCG server/worker API (early RFC) Peter Oskolkov
@ 2021-05-20 18:36 ` Peter Oskolkov
  2021-05-20 18:36 ` [RFC PATCH v0.1 9/9] selftests/umcg: add UMCG server/worker API selftest Peter Oskolkov
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 35+ messages in thread
From: Peter Oskolkov @ 2021-05-20 18:36 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, linux-kernel, linux-api
  Cc: Paul Turner, Ben Segall, Peter Oskolkov, Peter Oskolkov,
	Joel Fernandes, Andrew Morton, Andrei Vagin, Jim Newsome

Add userspace UMCG server/worker API.

This is an early RFC patch, with a lot of changes expected on the way.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 tools/lib/umcg/libumcg.c | 222 +++++++++++++++++++++++++++++++++++++++
 tools/lib/umcg/libumcg.h | 108 +++++++++++++++++++
 2 files changed, 330 insertions(+)

diff --git a/tools/lib/umcg/libumcg.c b/tools/lib/umcg/libumcg.c
index b177fb1d4b17..a11c2fc9e6e1 100644
--- a/tools/lib/umcg/libumcg.c
+++ b/tools/lib/umcg/libumcg.c
@@ -101,6 +101,86 @@ umcg_tid umcg_register_core_task(intptr_t tag)
 	return umcg_task_tls->self;
 }
 
+umcg_tid umcg_register_worker(umcg_t group_id, intptr_t tag)
+{
+	int ret;
+	struct umcg_group *group;
+
+	if (group_id == UMCG_NONE) {
+		errno = EINVAL;
+		return UMCG_NONE;
+	}
+
+	if (umcg_task_tls != NULL) {
+		errno = EINVAL;
+		return UMCG_NONE;
+	}
+
+	group = (struct umcg_group *)group_id;
+
+	umcg_task_tls = malloc(sizeof(struct umcg_task_tls));
+	if (!umcg_task_tls) {
+		errno = ENOMEM;
+		return UMCG_NONE;
+	}
+
+	umcg_task_tls->umcg_task.state = UMCG_TASK_NONE;
+	umcg_task_tls->self = (umcg_tid)&umcg_task_tls;
+	umcg_task_tls->tag = tag;
+	umcg_task_tls->tid = gettid();
+
+	ret = sys_umcg_register_task(umcg_api_version, UMCG_REGISTER_WORKER,
+			group->group_id, &umcg_task_tls->umcg_task);
+	if (ret) {
+		free(umcg_task_tls);
+		umcg_task_tls = NULL;
+		errno = ret;
+		return UMCG_NONE;
+	}
+
+	return umcg_task_tls->self;
+}
+
+umcg_tid umcg_register_server(umcg_t group_id, intptr_t tag)
+{
+	int ret;
+	struct umcg_group *group;
+
+	if (group_id == UMCG_NONE) {
+		errno = EINVAL;
+		return UMCG_NONE;
+	}
+
+	if (umcg_task_tls != NULL) {
+		errno = EINVAL;
+		return UMCG_NONE;
+	}
+
+	group = (struct umcg_group *)group_id;
+
+	umcg_task_tls = malloc(sizeof(struct umcg_task_tls));
+	if (!umcg_task_tls) {
+		errno = ENOMEM;
+		return UMCG_NONE;
+	}
+
+	umcg_task_tls->umcg_task.state = UMCG_TASK_NONE;
+	umcg_task_tls->self = (umcg_tid)&umcg_task_tls;
+	umcg_task_tls->tag = tag;
+	umcg_task_tls->tid = gettid();
+
+	ret = sys_umcg_register_task(umcg_api_version, UMCG_REGISTER_SERVER,
+			group->group_id, &umcg_task_tls->umcg_task);
+	if (ret) {
+		free(umcg_task_tls);
+		umcg_task_tls = NULL;
+		errno = ret;
+		return UMCG_NONE;
+	}
+
+	return umcg_task_tls->self;
+}
+
 int umcg_unregister_task(void)
 {
 	int ret;
@@ -348,3 +428,145 @@ int umcg_swap(umcg_tid next, const struct timespec *timeout)
 
 	return 0;
 }
+
+umcg_t umcg_create_group(uint32_t flags)
+{
+	int res = sys_umcg_create_group(umcg_api_version, flags);
+	struct umcg_group *group;
+
+	if (res < 0) {
+		errno = -res;
+		return -1;
+	}
+
+	group = malloc(sizeof(struct umcg_group));
+	if (!group) {
+		errno = ENOMEM;
+		return UMCG_NONE;
+	}
+
+	group->group_id = res;
+	return (intptr_t)group;
+}
+
+int umcg_destroy_group(umcg_t umcg)
+{
+	int res;
+	struct umcg_group *group = (struct umcg_group *)umcg;
+
+	res = sys_umcg_destroy_group(group->group_id);
+	if (res) {
+		errno = -res;
+		return -1;
+	}
+
+	free(group);
+	return 0;
+}
+
+umcg_tid umcg_poll_worker(void)
+{
+	struct umcg_task *server_ut = &umcg_task_tls->umcg_task;
+	struct umcg_task *worker_ut;
+	uint32_t expected_state;
+	int ret;
+
+	expected_state = UMCG_TASK_PROCESSING;
+	if (!atomic_compare_exchange_strong_explicit(&server_ut->state,
+			&expected_state, UMCG_TASK_POLLING,
+			memory_order_seq_cst, memory_order_seq_cst)) {
+		fprintf(stderr, "umcg_poll_worker: wrong server state before: %u\n",
+				expected_state);
+		exit(1);
+		return UMCG_NONE;
+	}
+	ret = sys_umcg_poll_worker(0, &worker_ut);
+
+	expected_state = UMCG_TASK_POLLING;
+	if (!atomic_compare_exchange_strong_explicit(&server_ut->state,
+			&expected_state, UMCG_TASK_PROCESSING,
+			memory_order_seq_cst, memory_order_seq_cst)) {
+		fprintf(stderr, "umcg_poll_worker: wrong server state after: %u\n",
+				expected_state);
+		exit(1);
+		return UMCG_NONE;
+	}
+
+	if (ret) {
+		fprintf(stderr, "sys_umcg_poll_worker: unexpected result %d\n",
+				errno);
+		exit(1);
+		return UMCG_NONE;
+	}
+
+	return umcg_task_to_utid(worker_ut);
+}
+
+umcg_tid umcg_run_worker(umcg_tid worker)
+{
+	struct umcg_task_tls *worker_utls;
+	struct umcg_task *server_ut = &umcg_task_tls->umcg_task;
+	struct umcg_task *worker_ut;
+	uint32_t expected_state;
+	int ret;
+
+	worker_utls = atomic_load_explicit((struct umcg_task_tls **)worker,
+			memory_order_seq_cst);
+	if (!worker_utls)
+		return UMCG_NONE;
+
+	worker_ut = &worker_utls->umcg_task;
+
+	expected_state = UMCG_TASK_RUNNABLE;
+	if (!atomic_compare_exchange_strong_explicit(&worker_ut->state,
+			&expected_state, UMCG_TASK_RUNNING,
+			memory_order_seq_cst, memory_order_seq_cst)) {
+		fprintf(stderr, "umcg_run_worker: wrong worker state: %u\n",
+				expected_state);
+		exit(1);
+		return UMCG_NONE;
+	}
+
+	expected_state = UMCG_TASK_PROCESSING;
+	if (!atomic_compare_exchange_strong_explicit(&server_ut->state,
+			&expected_state, UMCG_TASK_SERVING,
+			memory_order_seq_cst, memory_order_seq_cst)) {
+		fprintf(stderr, "umcg_run_worker: wrong server state: %u\n",
+				expected_state);
+		exit(1);
+		return UMCG_NONE;
+	}
+
+again:
+	ret = sys_umcg_run_worker(0, worker_utls->tid, &worker_ut);
+	if (ret && errno == EAGAIN)
+		goto again;
+
+	if (ret) {
+		fprintf(stderr, "umcg_run_worker failed: %d %d\n", ret, errno);
+		return UMCG_NONE;
+	}
+
+	expected_state = UMCG_TASK_SERVING;
+	if (!atomic_compare_exchange_strong_explicit(&server_ut->state,
+			&expected_state, UMCG_TASK_PROCESSING,
+			memory_order_seq_cst, memory_order_seq_cst)) {
+		fprintf(stderr, "umcg_run_worker: wrong server state: %u\n",
+				expected_state);
+		exit(1);
+		return UMCG_NONE;
+	}
+
+	return umcg_task_to_utid(worker_ut);
+}
+
+uint32_t umcg_get_task_state(umcg_tid task)
+{
+	struct umcg_task_tls *utls = atomic_load_explicit(
+			(struct umcg_task_tls **)task, memory_order_seq_cst);
+
+	if (!utls)
+		return UMCG_TASK_NONE;
+
+	return atomic_load_explicit(&utls->umcg_task.state, memory_order_relaxed);
+}
diff --git a/tools/lib/umcg/libumcg.h b/tools/lib/umcg/libumcg.h
index 31ef786d1965..4307bc0bd08e 100644
--- a/tools/lib/umcg/libumcg.h
+++ b/tools/lib/umcg/libumcg.h
@@ -49,6 +49,28 @@ static int sys_umcg_swap(uint32_t wake_flags, uint32_t next_tid,
 			wait_flags, timeout);
 }
 
+static int32_t sys_umcg_create_group(uint32_t api_version, uint32_t flags)
+{
+	return syscall(__NR_umcg_create_group, api_version, flags);
+}
+
+static int sys_umcg_destroy_group(int32_t group_id)
+{
+	return syscall(__NR_umcg_destroy_group, group_id);
+}
+
+static int sys_umcg_poll_worker(uint32_t flags, struct umcg_task **ut)
+{
+	return syscall(__NR_umcg_poll_worker, flags, ut);
+}
+
+static int sys_umcg_run_worker(uint32_t flags, uint32_t worker_tid,
+		struct umcg_task **ut)
+{
+	return syscall(__NR_umcg_run_worker, flags, worker_tid, ut);
+}
+
+typedef intptr_t umcg_t;   /* UMCG group ID. */
 typedef intptr_t umcg_tid; /* UMCG thread ID. */
 
 #define UMCG_NONE	(0)
@@ -88,6 +110,28 @@ intptr_t umcg_get_task_tag(umcg_tid utid);
  */
 umcg_tid umcg_register_core_task(intptr_t tag);
 
+/**
+ * umcg_register_worker - register the current thread as a UMCG worker
+ * @group_id:      The ID of the UMCG group the thread should join.
+ *
+ * Return:
+ * UMCG_NONE     - an error occurred. Check errno.
+ * != UMCG_NONE  - the ID of the thread to be used with UMCG API (guaranteed
+ *                 to match the value returned by umcg_get_utid).
+ */
+umcg_tid umcg_register_worker(umcg_t group_id, intptr_t tag);
+
+/**
+ * umcg_register_server - register the current thread as a UMCG server
+ * @group_id:      The ID of the UMCG group the thread should join.
+ *
+ * Return:
+ * UMCG_NONE     - an error occurred. Check errno.
+ * != UMCG_NONE  - the ID of the thread to be used with UMCG API (guaranteed
+ *                 to match the value returned by umcg_get_utid).
+ */
+umcg_tid umcg_register_server(umcg_t group_id, intptr_t tag);
+
 /**
  * umcg_unregister_task - unregister the current thread
  *
@@ -151,4 +195,68 @@ int umcg_wake(umcg_tid next);
  */
 int umcg_swap(umcg_tid next, const struct timespec *timeout);
 
+/**
+ * umcg_create_group - create a UMCG group
+ * @flags:             Reserved.
+ *
+ * UMCG groups have worker and server threads.
+ *
+ * Worker threads are either RUNNABLE/RUNNING "on behalf" of server threads
+ * (see umcg_run_worker), or are BLOCKED/UNBLOCKED. A worker thread can be
+ * running only if it is attached to a server thread (interrupts can
+ * complicate the matter - TBD).
+ *
+ * Server threads are either blocked while running worker threads or are
+ * blocked waiting for available (=UNBLOCKED) workers. A server thread
+ * can "run" only one worker thread.
+ *
+ * Return:
+ * UMCG_NONE     - an error occurred. Check errno.
+ * != UMCG_NONE  - the ID of the group, to be used in e.g. umcg_register.
+ */
+umcg_t umcg_create_group(uint32_t flags);
+
+/**
+ * umcg_destroy_group - destroy a UMCG group
+ * @umcg:               ID of the group to destroy
+ *
+ * The group must be empty (no server or worker threads).
+ *
+ * Return:
+ * 0            - Ok
+ * -1           - an error occurred. Check errno.
+ *                errno == EAGAIN: the group has server or worker threads
+ */
+int umcg_destroy_group(umcg_t umcg);
+
+/**
+ * umcg_poll_worker - wait for the first available UNBLOCKED worker
+ *
+ * The current thread must be a UMCG server. If there is a list/queue of
+ * waiting UNBLOCKED workers in the server's group, umcg_poll_worker
+ * picks the longest waiting one; if there are no UNBLOCKED workers, the
+ * current thread sleeps in the polling queue.
+ *
+ * Return:
+ * UMCG_NONE         - an error occurred; check errno;
+ * != UMCG_NONE      - a RUNNABLE worker.
+ */
+umcg_tid umcg_poll_worker(void);
+
+/**
+ * umcg_run_worker - run @worker as a UMCG server
+ * @worker:          the ID of a RUNNABLE worker to run
+ *
+ * The current thread must be a UMCG "server".
+ *
+ * Return:
+ * UMCG_NONE    - if errno == 0, the last worker the server was running
+ *                unregistered itself; if errno != 0, an error occurred
+ * != UMCG_NONE - the ID of the last worker the server was running before
+ *                the worker was blocked or preempted.
+ */
+umcg_tid umcg_run_worker(umcg_tid worker);
+
+uint32_t umcg_get_task_state(umcg_tid task);
+
 #endif  /* __LIBUMCG_H */
-- 
2.31.1.818.g46aad6cb9e-goog


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH v0.1 9/9] selftests/umcg: add UMCG server/worker API selftest
  2021-05-20 18:36 [RFC PATCH v0.1 0/9] UMCG early preview/RFC patchset Peter Oskolkov
                   ` (7 preceding siblings ...)
  2021-05-20 18:36 ` [RFC PATCH v0.1 8/9] lib/umcg: " Peter Oskolkov
@ 2021-05-20 18:36 ` Peter Oskolkov
  2021-05-20 21:17 ` [RFC PATCH v0.1 0/9] UMCG early preview/RFC patchset Jonathan Corbet
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 35+ messages in thread
From: Peter Oskolkov @ 2021-05-20 18:36 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, linux-kernel, linux-api
  Cc: Paul Turner, Ben Segall, Peter Oskolkov, Peter Oskolkov,
	Joel Fernandes, Andrew Morton, Andrei Vagin, Jim Newsome

Add UMCG server/worker API selftests. These are only basic
tests, they do not cover many important use cases/conditions.

More to come.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 tools/testing/selftests/umcg/.gitignore  |   1 +
 tools/testing/selftests/umcg/Makefile    |   4 +-
 tools/testing/selftests/umcg/umcg_test.c | 475 +++++++++++++++++++++++
 3 files changed, 479 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/umcg/umcg_test.c

diff --git a/tools/testing/selftests/umcg/.gitignore b/tools/testing/selftests/umcg/.gitignore
index 89cca24e5907..f488ec82882a 100644
--- a/tools/testing/selftests/umcg/.gitignore
+++ b/tools/testing/selftests/umcg/.gitignore
@@ -1,2 +1,3 @@
 # SPDX-License-Identifier: GPL-2.0-only
 umcg_core_test
+umcg_test
diff --git a/tools/testing/selftests/umcg/Makefile b/tools/testing/selftests/umcg/Makefile
index b151098e2ed1..916897d82e53 100644
--- a/tools/testing/selftests/umcg/Makefile
+++ b/tools/testing/selftests/umcg/Makefile
@@ -6,8 +6,10 @@ LIBUMCGDIR := $(TOOLSDIR)/lib/umcg
 CFLAGS += -g -O0 -I$(LIBUMCGDIR) -I$(TOOLSDIR)/include/ -I../../../../usr/include/
 LDLIBS += -lpthread -static
 
-TEST_GEN_PROGS := umcg_core_test
+TEST_GEN_PROGS := umcg_core_test umcg_test
 
 include ../lib.mk
 
 $(OUTPUT)/umcg_core_test: umcg_core_test.c $(LIBUMCGDIR)/libumcg.c
+
+$(OUTPUT)/umcg_test: umcg_test.c $(LIBUMCGDIR)/libumcg.c
diff --git a/tools/testing/selftests/umcg/umcg_test.c b/tools/testing/selftests/umcg/umcg_test.c
new file mode 100644
index 000000000000..2c01a61ec3f4
--- /dev/null
+++ b/tools/testing/selftests/umcg/umcg_test.c
@@ -0,0 +1,475 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include "libumcg.h"
+
+#include <pthread.h>
+#include <stdatomic.h>
+
+#include "../kselftest_harness.h"
+
+#define CHECK_CONFIG()						\
+{								\
+	int ret = sys_umcg_api_version(1, 0);	\
+								\
+	if (ret == -1 && errno == ENOSYS)			\
+		SKIP(return, "CONFIG_UMCG not set");	\
+}
+
+struct worker_args {
+	umcg_t		group;  /* Which group the worker should join. */
+	umcg_tid	utid;   /* This worker's utid. */
+	void *(*thread_fn)(void *);  /* Function to run. */
+	void		*thread_arg;
+	intptr_t	tag;
+};
+
+static void validate_state(umcg_tid utid, u32 expected, const char *ctx)
+{
+	u32 state = umcg_get_task_state(utid);
+
+	if (state == expected)
+		return;
+
+	fprintf(stderr, "BAD state for %ld: expected: %u; got: %u; ctx :%s\n",
+			utid, expected, state, ctx);
+	exit(1);
+}
+
+static void *worker_fn(void *arg)
+{
+	void *result;
+	umcg_tid utid;
+	struct worker_args *args = (struct worker_args *)arg;
+
+	validate_state(umcg_get_utid(), UMCG_TASK_NONE, "worker_fn start");
+
+	atomic_thread_fence(memory_order_acquire);
+	atomic_store_explicit(&args->utid, umcg_get_utid(),
+			memory_order_seq_cst);
+
+	utid = umcg_register_worker(args->group, args->tag);
+	if (args->utid != utid) {
+		fprintf(stderr, "umcg_register_worker failed.\n");
+		exit(1);
+	}
+	validate_state(umcg_get_utid(), UMCG_TASK_RUNNING, "worker_fn in");
+
+	/* Fence args->thread_arg */
+	atomic_thread_fence(memory_order_acquire);
+
+	result = args->thread_fn(args->thread_arg);
+	validate_state(umcg_get_utid(), UMCG_TASK_RUNNING, "worker_fn out");
+
+	if (umcg_unregister_task()) {
+		fprintf(stderr, "umcg_unregister_task failed.\n");
+		exit(1);
+	}
+	validate_state(umcg_get_utid(), UMCG_TASK_NONE, "worker_fn finish");
+
+	return result;
+}
+
+static void *simple_running_worker(void *arg)
+{
+	bool *checkpoint = (bool *)arg;
+
+	atomic_store_explicit(checkpoint, true, memory_order_relaxed);
+	return NULL;
+}
+
+TEST(umcg_poll_run_test) {
+	pthread_t worker;
+	bool checkpoint = false;
+	struct worker_args worker_args;
+
+	CHECK_CONFIG();
+
+	worker_args.utid = UMCG_NONE;
+	worker_args.group = umcg_create_group(0);
+	ASSERT_NE(UMCG_NONE, worker_args.group);
+
+	worker_args.thread_fn = &simple_running_worker;
+	worker_args.thread_arg = &checkpoint;
+	worker_args.tag = 0;
+
+	ASSERT_EQ(0, pthread_create(&worker, NULL, &worker_fn, &worker_args));
+
+	/* Wait for the worker to start. */
+	while (UMCG_NONE == atomic_load_explicit(&worker_args.utid,
+				memory_order_relaxed))
+		;
+
+	/*
+	 * Make sure that the worker does not checkpoint until the server
+	 * runs it.
+	 */
+	usleep(1000);
+	ASSERT_FALSE(atomic_load_explicit(&checkpoint, memory_order_relaxed));
+
+	ASSERT_NE(0, umcg_register_server(worker_args.group, 0));
+
+	/*
+	 * Run the worker until it exits. Need to loop because the worker
+	 * may pagefault and wake the server.
+	 */
+	do {
+		u32 state;
+
+		/* Poll the worker. */
+		ASSERT_EQ(worker_args.utid, umcg_poll_worker());
+		validate_state(worker_args.utid, UMCG_TASK_RUNNABLE, "wns poll");
+
+		umcg_tid utid = umcg_run_worker(worker_args.utid);
+		if (utid == UMCG_NONE) {
+			ASSERT_EQ(0, errno);
+			break;
+		}
+
+		ASSERT_EQ(utid, worker_args.utid);
+
+		state = umcg_get_task_state(utid);
+		ASSERT_TRUE(state == UMCG_TASK_BLOCKED || UMCG_TASK_UNBLOCKED);
+	} while (true);
+
+	ASSERT_TRUE(atomic_load_explicit(&checkpoint, memory_order_relaxed));
+
+	/* Can't destroy group while this thread still belongs to it. */
+	ASSERT_NE(0, umcg_destroy_group(worker_args.group));
+	ASSERT_EQ(0, umcg_unregister_task());
+	ASSERT_EQ(0, umcg_destroy_group(worker_args.group));
+	ASSERT_EQ(0, pthread_join(worker, NULL));
+}
+
+static void *sleeping_worker(void *arg)
+{
+	int *checkpoint = (int *)arg;
+
+	atomic_store_explicit(checkpoint, 1, memory_order_relaxed);
+	usleep(2000);
+	atomic_store_explicit(checkpoint, 2, memory_order_relaxed);
+
+	return NULL;
+}
+
+TEST(umcg_sleep_test) {
+	pthread_t worker;
+	u32 state;
+	int checkpoint = 0;
+	struct worker_args worker_args;
+
+	CHECK_CONFIG();
+
+	worker_args.utid = UMCG_NONE;
+	worker_args.group = umcg_create_group(0);
+	ASSERT_NE(UMCG_NONE, worker_args.group);
+
+	worker_args.thread_fn = &sleeping_worker;
+	worker_args.thread_arg = &checkpoint;
+	worker_args.tag = 0;
+
+	ASSERT_EQ(0, pthread_create(&worker, NULL, &worker_fn, &worker_args));
+
+	/* Wait for the worker to start. */
+	while (UMCG_NONE == atomic_load_explicit(&worker_args.utid,
+				memory_order_relaxed))
+		;
+
+	/*
+	 * Make sure that the worker does not checkpoint until the server
+	 * runs it.
+	 */
+	usleep(1000);
+	ASSERT_EQ(0, atomic_load_explicit(&checkpoint, memory_order_relaxed));
+
+	validate_state(umcg_get_utid(), UMCG_TASK_NONE, "sws prereg");
+
+	ASSERT_NE(0, umcg_register_server(worker_args.group, 0));
+
+	validate_state(umcg_get_utid(), UMCG_TASK_PROCESSING, "sws postreg");
+
+	/*
+	 * Run the worker until it checkpoints 1. Need to loop because
+	 * the worker may pagefault and wake the server.
+	 */
+	do {
+		ASSERT_EQ(worker_args.utid, umcg_poll_worker());
+		validate_state(worker_args.utid, UMCG_TASK_RUNNABLE,
+				"sws poll");
+
+		umcg_tid utid = umcg_run_worker(worker_args.utid);
+		ASSERT_EQ(utid, worker_args.utid);
+	} while (1 != atomic_load_explicit(&checkpoint, memory_order_relaxed));
+
+	state = umcg_get_task_state(worker_args.utid);
+	ASSERT_TRUE(state == UMCG_TASK_BLOCKED || UMCG_TASK_UNBLOCKED);
+	validate_state(umcg_get_utid(), UMCG_TASK_PROCESSING, "sws mid");
+
+	/* The worker cannot reach checkpoint 2 without the server running it. */
+	usleep(2000);
+	ASSERT_EQ(1, atomic_load_explicit(&checkpoint, memory_order_relaxed));
+
+	state = umcg_get_task_state(worker_args.utid);
+	ASSERT_TRUE(state == UMCG_TASK_BLOCKED || UMCG_TASK_UNBLOCKED);
+
+	/* Run the worker until it exits. */
+	do {
+		ASSERT_EQ(worker_args.utid, umcg_poll_worker());
+		umcg_tid utid = umcg_run_worker(worker_args.utid);
+		if (utid == UMCG_NONE) {
+			ASSERT_EQ(0, errno);
+			break;
+		}
+
+		ASSERT_EQ(utid, worker_args.utid);
+	} while (true);
+
+	/* The final check and cleanup. */
+	ASSERT_EQ(2, atomic_load_explicit(&checkpoint, memory_order_relaxed));
+	validate_state(umcg_get_utid(), UMCG_TASK_PROCESSING, "sws preunreg");
+	ASSERT_EQ(0, pthread_join(worker, NULL));
+	ASSERT_EQ(0, umcg_unregister_task());
+	validate_state(umcg_get_utid(), UMCG_TASK_NONE, "sws postunreg");
+	ASSERT_EQ(0, umcg_destroy_group(worker_args.group));
+}
+
+static void *waiting_worker(void *arg)
+{
+	int *checkpoint = (int *)arg;
+
+	atomic_store_explicit(checkpoint, 1, memory_order_relaxed);
+	if (umcg_wait(NULL)) {
+		fprintf(stderr, "umcg_wait() failed.\n");
+		exit(1);
+	}
+	atomic_store_explicit(checkpoint, 2, memory_order_relaxed);
+
+	return NULL;
+}
+
+TEST(umcg_wait_wake_test) {
+	pthread_t worker;
+	int checkpoint = 0;
+	struct worker_args worker_args;
+
+	CHECK_CONFIG();
+
+	worker_args.utid = UMCG_NONE;
+	worker_args.group = umcg_create_group(0);
+	ASSERT_NE(UMCG_NONE, worker_args.group);
+
+	worker_args.thread_fn = &waiting_worker;
+	worker_args.thread_arg = &checkpoint;
+	worker_args.tag = 0;
+
+	ASSERT_EQ(0, pthread_create(&worker, NULL, &worker_fn, &worker_args));
+
+	/* Wait for the worker to start. */
+	while (UMCG_NONE == atomic_load_explicit(&worker_args.utid,
+				memory_order_relaxed))
+		;
+
+	/*
+	 * Make sure that the worker does not checkpoint until the server
+	 * runs it.
+	 */
+	usleep(1000);
+	ASSERT_EQ(0, atomic_load_explicit(&checkpoint, memory_order_relaxed));
+
+	ASSERT_NE(0, umcg_register_server(worker_args.group, 0));
+
+	/*
+	 * Run the worker until it checkpoints 1. Need to loop because
+	 * the worker may pagefault and wake the server.
+	 */
+	do {
+		ASSERT_EQ(worker_args.utid, umcg_poll_worker());
+		ASSERT_EQ(worker_args.utid, umcg_run_worker(worker_args.utid));
+	} while (1 != atomic_load_explicit(&checkpoint, memory_order_relaxed));
+
+	validate_state(worker_args.utid, UMCG_TASK_RUNNABLE, "wait_wake wait");
+
+	/* The worker cannot reach checkpoint 2 without the server waking it. */
+	usleep(2000);
+	ASSERT_EQ(1, atomic_load_explicit(&checkpoint, memory_order_relaxed));
+	validate_state(worker_args.utid, UMCG_TASK_RUNNABLE, "wait_wake wait");
+
+
+	ASSERT_EQ(0, umcg_wake(worker_args.utid));
+
+	/*
+	 * umcg_wake() above marks the worker as RUNNING; it will become
+	 * UNBLOCKED upon wakeup as it does not have a server. But this may
+	 * be delayed.
+	 */
+	while (umcg_get_task_state(worker_args.utid) != UMCG_TASK_UNBLOCKED)
+		;
+
+	/* The worker cannot reach checkpoint 2 without the server running it. */
+	usleep(2000);
+	ASSERT_EQ(1, atomic_load_explicit(&checkpoint, memory_order_relaxed));
+
+	/* Run the worker until it exits. */
+	do {
+		ASSERT_EQ(worker_args.utid, umcg_poll_worker());
+		umcg_tid utid = umcg_run_worker(worker_args.utid);
+		if (utid == UMCG_NONE) {
+			ASSERT_EQ(0, errno);
+			break;
+		}
+
+		ASSERT_EQ(utid, worker_args.utid);
+	} while (true);
+
+	/* The final check and cleanup. */
+	ASSERT_EQ(2, atomic_load_explicit(&checkpoint, memory_order_relaxed));
+	ASSERT_EQ(0, pthread_join(worker, NULL));
+	ASSERT_EQ(0, umcg_unregister_task());
+	ASSERT_EQ(0, umcg_destroy_group(worker_args.group));
+}
+
+static void *swapping_worker(void *arg)
+{
+	umcg_tid next;
+
+	atomic_thread_fence(memory_order_acquire);
+	next = (umcg_tid)arg;
+
+	if (next == UMCG_NONE) {
+		if (0 != umcg_wait(NULL)) {
+			fprintf(stderr, "swapping_worker: umcg_wait failed\n");
+			exit(1);
+		}
+	} else {
+		if (0 != umcg_swap(next, NULL)) {
+			fprintf(stderr, "swapping_worker: umcg_swap failed\n");
+			exit(1);
+		}
+	}
+
+	return NULL;
+}
+
+TEST(umcg_swap_test) {
+	const int n_workers = 10;
+	struct worker_args *worker_args;
+	int swap_chain_wakeups = 0;
+	umcg_tid utid = UMCG_NONE;
+	bool *workers_polled;
+	pthread_t *workers;
+	umcg_t group_id;
+	int idx;
+
+	CHECK_CONFIG();
+
+	group_id = umcg_create_group(0);
+	ASSERT_NE(UMCG_NONE, group_id);
+
+	workers = malloc(n_workers * sizeof(pthread_t));
+	worker_args = malloc(n_workers * sizeof(struct worker_args));
+	workers_polled = malloc(n_workers * sizeof(bool));
+	if (!workers || !worker_args || !workers_polled) {
+		fprintf(stderr, "malloc failed\n");
+		exit(1);
+	}
+
+	memset(worker_args, 0, n_workers * sizeof(struct worker_args));
+
+	/* Start workers. All will block in umcg_register_worker(). */
+	for (idx = 0; idx < n_workers; ++idx) {
+		workers_polled[idx] = false;
+
+		worker_args[idx].group = group_id;
+		worker_args[idx].thread_fn = &swapping_worker;
+		worker_args[idx].tag = idx;
+		atomic_thread_fence(memory_order_release);
+
+		ASSERT_EQ(0, pthread_create(&workers[idx], NULL, &worker_fn,
+					&worker_args[idx]));
+	}
+
+	/* Wait for all workers to update their utids. */
+	for (idx = 0; idx < n_workers; ++idx) {
+		uint64_t counter = 0;
+		while (UMCG_NONE == atomic_load_explicit(&worker_args[idx].utid,
+					memory_order_seq_cst)) {
+			++counter;
+			if (!(counter % 1000000))
+				fprintf(stderr, "looping for utid: %d %lu\n",
+						idx, counter);
+		}
+	}
+
+	/* Update worker args. */
+	for (idx = 0; idx < (n_workers - 1); ++idx) {
+		worker_args[idx].thread_arg = (void *)worker_args[idx + 1].utid;
+	}
+	atomic_thread_fence(memory_order_release);
+
+	ASSERT_NE(0, umcg_register_server(group_id, 0));
+
+	/* Poll workers. */
+	for (idx = 0; idx < n_workers; ++idx) {
+		utid = umcg_poll_worker();
+
+		ASSERT_NE(UMCG_NONE, utid);
+		workers_polled[umcg_get_task_tag(utid)] = true;
+
+		validate_state(utid, UMCG_TASK_RUNNABLE, "swap poll");
+	}
+
+	/* Check that all workers have been polled. */
+	for (idx = 0; idx < n_workers; ++idx) {
+		ASSERT_TRUE(workers_polled[idx]);
+	}
+
+	/* Run the first worker; the swap chain will lead to the last worker. */
+	utid = worker_args[0].utid;
+	idx = 0;
+	do {
+		uint32_t state;
+
+		utid = umcg_run_worker(utid);
+		if (utid == worker_args[n_workers - 1].utid &&
+				umcg_get_task_state(utid) == UMCG_TASK_RUNNABLE)
+			break;
+
+		/* There can be an occasional mid-swap wakeup due to pagefault. */
+		++swap_chain_wakeups;
+
+		/* Validate progression. */
+		ASSERT_GE(umcg_get_task_tag(utid), idx);
+		idx = umcg_get_task_tag(utid);
+
+		/* Validate state. */
+		state = umcg_get_task_state(utid);
+		ASSERT_TRUE(state == UMCG_TASK_BLOCKED ||
+				state == UMCG_TASK_UNBLOCKED);
+
+		ASSERT_EQ(utid, umcg_poll_worker());
+	} while (true);
+
+	ASSERT_LT(swap_chain_wakeups, 4);
+	if (swap_chain_wakeups)
+		fprintf(stderr, "WARNING: %d swap chain wakeups\n",
+				swap_chain_wakeups);
+
+	/* Finally run/release all workers. */
+	for (idx = 0; idx < n_workers; ++idx) {
+		utid = worker_args[idx].utid;
+		do {
+			utid = umcg_run_worker(utid);
+			if (utid) {
+				ASSERT_EQ(utid, worker_args[idx].utid);
+				ASSERT_EQ(utid, umcg_poll_worker());
+			}
+		} while (utid != UMCG_NONE);
+	}
+
+	/* Cleanup. */
+	for (idx = 0; idx < n_workers; ++idx)
+		ASSERT_EQ(0, pthread_join(workers[idx], NULL));
+	ASSERT_EQ(0, umcg_unregister_task());
+	ASSERT_EQ(0, umcg_destroy_group(group_id));
+}
+
+TEST_HARNESS_MAIN
-- 
2.31.1.818.g46aad6cb9e-goog


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v0.1 0/9] UMCG early preview/RFC patchset
  2021-05-20 18:36 [RFC PATCH v0.1 0/9] UMCG early preview/RFC patchset Peter Oskolkov
                   ` (8 preceding siblings ...)
  2021-05-20 18:36 ` [RFC PATCH v0.1 9/9] selftests/umcg: add UMCG server/worker API selftest Peter Oskolkov
@ 2021-05-20 21:17 ` Jonathan Corbet
  2021-05-20 21:38   ` Peter Oskolkov
       [not found] ` <CAEWA0a72SvpcuN4ov=98T3uWtExPCr7BQePOgjkqD1ofWKEASw@mail.gmail.com>
  2021-06-09 12:54 ` Peter Zijlstra
  11 siblings, 1 reply; 35+ messages in thread
From: Jonathan Corbet @ 2021-05-20 21:17 UTC (permalink / raw)
  To: Peter Oskolkov, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	linux-kernel, linux-api
  Cc: Paul Turner, Ben Segall, Peter Oskolkov, Peter Oskolkov,
	Joel Fernandes, Andrew Morton, Andrei Vagin, Jim Newsome

Peter Oskolkov <posk@google.com> writes:

> As indicated earlier in the FUTEX_SWAP patchset:
>
> https://lore.kernel.org/lkml/20200722234538.166697-1-posk@posk.io/
>
> "Google Fibers" is a userspace scheduling framework
> used widely and successfully at Google to improve in-process workload
> isolation and response latencies. We are working on open-sourcing
> this framework, and UMCG (User-Managed Concurrency Groups) kernel
> patches are intended as the foundation of this.

So I have to ask...is there *any* documentation out there on what this
is and how people are supposed to use it?  Shockingly, typing "Google
fibers" into Google leads to a less than fully joyful outcome...  This
won't be easy for anybody to review if they have to start by
reverse-engineering what it's supposed to do.

Thanks,

jon

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v0.1 0/9] UMCG early preview/RFC patchset
  2021-05-20 21:17 ` [RFC PATCH v0.1 0/9] UMCG early preview/RFC patchset Jonathan Corbet
@ 2021-05-20 21:38   ` Peter Oskolkov
  2021-05-21  0:15     ` Randy Dunlap
  2021-05-21 15:08     ` Jonathan Corbet
  0 siblings, 2 replies; 35+ messages in thread
From: Peter Oskolkov @ 2021-05-20 21:38 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Linux Kernel Mailing List, linux-api, Paul Turner, Ben Segall,
	Peter Oskolkov, Joel Fernandes, Andrew Morton, Andrei Vagin,
	Jim Newsome

On Thu, May 20, 2021 at 2:17 PM Jonathan Corbet <corbet@lwn.net> wrote:
>
> Peter Oskolkov <posk@google.com> writes:
>
> > As indicated earlier in the FUTEX_SWAP patchset:
> >
> > https://lore.kernel.org/lkml/20200722234538.166697-1-posk@posk.io/
> >
> > "Google Fibers" is a userspace scheduling framework
> > used widely and successfully at Google to improve in-process workload
> > isolation and response latencies. We are working on open-sourcing
> > this framework, and UMCG (User-Managed Concurrency Groups) kernel
> > patches are intended as the foundation of this.
>
> So I have to ask...is there *any* documentation out there on what this
> is and how people are supposed to use it?  Shockingly, typing "Google
> fibers" into Google leads to a less than fully joyful outcome...  This
> won't be easy for anybody to review if they have to start by
> reverse-engineering what it's supposed to do.

Hi Jonathan,

There is this Linux Plumbers video: https://www.youtube.com/watch?v=KXuZi9aeGTw
And the pdf: http://pdxplumbers.osuosl.org/2013/ocw//system/presentations/1653/original/LPC%20-%20User%20Threading.pdf

I did not reference them in the patchset because links to sites other
than kernel.org are strongly discouraged... I will definitely add a
documentation patch.

Feel free to reach out to me directly or through this LKML thread if
you have any questions.

Do you think a documentation patch would be useful at this point, as
opposed to a free-form email discussion?

Thanks,
Peter

>
> Thanks,
>
> jon

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v0.1 0/9] UMCG early preview/RFC patchset
  2021-05-20 21:38   ` Peter Oskolkov
@ 2021-05-21  0:15     ` Randy Dunlap
  2021-05-21  8:04       ` Peter Zijlstra
  2021-05-21 15:08     ` Jonathan Corbet
  1 sibling, 1 reply; 35+ messages in thread
From: Randy Dunlap @ 2021-05-21  0:15 UTC (permalink / raw)
  To: Peter Oskolkov, Jonathan Corbet
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Linux Kernel Mailing List, linux-api, Paul Turner, Ben Segall,
	Peter Oskolkov, Joel Fernandes, Andrew Morton, Andrei Vagin,
	Jim Newsome

On 5/20/21 2:38 PM, Peter Oskolkov wrote:
> On Thu, May 20, 2021 at 2:17 PM Jonathan Corbet <corbet@lwn.net> wrote:
>>
>> Peter Oskolkov <posk@google.com> writes:
>>
>>> As indicated earlier in the FUTEX_SWAP patchset:
>>>
>>> https://lore.kernel.org/lkml/20200722234538.166697-1-posk@posk.io/
>>>
>>> "Google Fibers" is a userspace scheduling framework
>>> used widely and successfully at Google to improve in-process workload
>>> isolation and response latencies. We are working on open-sourcing
>>> this framework, and UMCG (User-Managed Concurrency Groups) kernel
>>> patches are intended as the foundation of this.
>>
>> So I have to ask...is there *any* documentation out there on what this
>> is and how people are supposed to use it?  Shockingly, typing "Google
>> fibers" into Google leads to a less than fully joyful outcome...  This
>> won't be easy for anybody to review if they have to start by
>> reverse-engineering what it's supposed to do.
> 
> Hi Jonathan,
> 
> There is this Linux Plumbers video: https://www.youtube.com/watch?v=KXuZi9aeGTw
> And the pdf: http://pdxplumbers.osuosl.org/2013/ocw//system/presentations/1653/original/LPC%20-%20User%20Threading.pdf
> 
> I did not reference them in the patchset because links to sites other
> than kernel.org are strongly discouraged... I will definitely add a
> documentation patch.

Certainly for links to email, we prefer to use lore.kernel.org archives.
Are links to other sites discouraged?  If so, that's news to me.


> Feel free to reach out to me directly or through this LKML thread if
> you have any questions.
> 
> Do you think a documentation patch would be useful at this point, as
> opposed to a free-form email discussion?

thanks.
-- 
~Randy


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v0.1 0/9] UMCG early preview/RFC patchset
  2021-05-21  0:15     ` Randy Dunlap
@ 2021-05-21  8:04       ` Peter Zijlstra
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2021-05-21  8:04 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Peter Oskolkov, Jonathan Corbet, Ingo Molnar, Thomas Gleixner,
	Linux Kernel Mailing List, linux-api, Paul Turner, Ben Segall,
	Peter Oskolkov, Joel Fernandes, Andrew Morton, Andrei Vagin,
	Jim Newsome

On Thu, May 20, 2021 at 05:15:41PM -0700, Randy Dunlap wrote:

> > There is this Linux Plumbers video: https://www.youtube.com/watch?v=KXuZi9aeGTw
> > And the pdf: http://pdxplumbers.osuosl.org/2013/ocw//system/presentations/1653/original/LPC%20-%20User%20Threading.pdf
> > 
> > I did not reference them in the patchset because links to sites other
> > than kernel.org are strongly discouraged... I will definitely add a
> > documentation patch.
> 
> Certainly for links to email, we prefer to use lore.kernel.org archives.
> Are links to other sites discouraged?  If so, that's news to me.

Discouraged in so far as that when an email solely references external
resources and doesn't bother to summarize or otherwise recap the
contents in the email proper; I'll ignore the whole thing.

Basically, if I have to click a link to figure out basic information of
a patch series, the whole thing is a fail and goes into the bit bucket.

That said; I have no objection against having links, as long as they're
not used to convey the primary information that _should_ be in the
cover letter and/or changelogs.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v0.1 0/9] UMCG early preview/RFC patchset
  2021-05-20 21:38   ` Peter Oskolkov
  2021-05-21  0:15     ` Randy Dunlap
@ 2021-05-21 15:08     ` Jonathan Corbet
  2021-05-21 16:03       ` Peter Oskolkov
  1 sibling, 1 reply; 35+ messages in thread
From: Jonathan Corbet @ 2021-05-21 15:08 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Linux Kernel Mailing List, linux-api, Paul Turner, Ben Segall,
	Peter Oskolkov, Joel Fernandes, Andrew Morton, Andrei Vagin,
	Jim Newsome

Peter Oskolkov <posk@google.com> writes:

> On Thu, May 20, 2021 at 2:17 PM Jonathan Corbet <corbet@lwn.net> wrote:
>>
>> Peter Oskolkov <posk@google.com> writes:
>>
>> > As indicated earlier in the FUTEX_SWAP patchset:
>> >
>> > https://lore.kernel.org/lkml/20200722234538.166697-1-posk@posk.io/
>> >
>> > "Google Fibers" is a userspace scheduling framework
>> > used widely and successfully at Google to improve in-process workload
>> > isolation and response latencies. We are working on open-sourcing
>> > this framework, and UMCG (User-Managed Concurrency Groups) kernel
>> > patches are intended as the foundation of this.
>>
>> So I have to ask...is there *any* documentation out there on what this
>> is and how people are supposed to use it?  Shockingly, typing "Google
>> fibers" into Google leads to a less than fully joyful outcome...  This
>> won't be easy for anybody to review if they have to start by
>> reverse-engineering what it's supposed to do.
>
> Hi Jonathan,
>
> There is this Linux Plumbers video: https://www.youtube.com/watch?v=KXuZi9aeGTw
> And the pdf: http://pdxplumbers.osuosl.org/2013/ocw//system/presentations/1653/original/LPC%20-%20User%20Threading.pdf
>
> I did not reference them in the patchset because links to sites other
> than kernel.org are strongly discouraged... I will definitely add a
> documentation patch.

I did look at those - but a presentation from 2013 is going to be of
limited relevance for a 2021 patch set.  In particular, the syscall API
appears to have evolved considerably since then.

> Feel free to reach out to me directly or through this LKML thread if
> you have any questions.
>
> Do you think a documentation patch would be useful at this point, as
> opposed to a free-form email discussion?

Documentation patches can help to guide that discussion; they also need
to be reviewed as well.  So yes, I think they should be present from the
beginning.  But then, that's the position I'm supposed to take :)  This
is a big change to the kernel's system-call API, I don't think that
there can be a proper discussion of that without a description of what
you're trying to do.

Thanks,

jon

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v0.1 0/9] UMCG early preview/RFC patchset
  2021-05-21 15:08     ` Jonathan Corbet
@ 2021-05-21 16:03       ` Peter Oskolkov
  2021-05-21 19:17         ` Jonathan Corbet
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Oskolkov @ 2021-05-21 16:03 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Linux Kernel Mailing List, linux-api, Paul Turner, Ben Segall,
	Peter Oskolkov, Joel Fernandes, Andrew Morton, Andrei Vagin,
	Jim Newsome

On Fri, May 21, 2021 at 8:08 AM Jonathan Corbet <corbet@lwn.net> wrote:

[...]
> Documentation patches can help to guide that discussion; they also need
> to be reviewed as well.  So yes, I think they should be present from the
> beginning.  But then, that's the position I'm supposed to take :)  This
> is a big change to the kernel's system-call API, I don't think that
> there can be a proper discussion of that without a description of what
> you're trying to do.

Hi Jon,

There are doc comments in patches 2 and 7 in umcg.c documenting the
new syscalls. That said, I'll prepare a separate doc patch - I guess
I'll add Documentation/scheduler/umcg.rst, unless you tell me there is
a better place to do that. ETA mid-to-late next week.

Thanks,
Peter

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v0.1 4/9] sched/umcg: implement core UMCG API
  2021-05-20 18:36 ` [RFC PATCH v0.1 4/9] sched/umcg: implement core UMCG API Peter Oskolkov
@ 2021-05-21 19:06   ` Andrei Vagin
  2021-05-21 21:31     ` Jann Horn
  2021-05-21 19:32   ` Andy Lutomirski
  2021-05-21 21:33   ` Jann Horn
  2 siblings, 1 reply; 35+ messages in thread
From: Andrei Vagin @ 2021-05-21 19:06 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, linux-kernel,
	linux-api, Paul Turner, Ben Segall, Peter Oskolkov,
	Joel Fernandes, Andrew Morton, Andrei Vagin, Jim Newsome

On Thu, May 20, 2021 at 11:36:09AM -0700, Peter Oskolkov wrote:
> @@ -67,7 +137,75 @@ SYSCALL_DEFINE4(umcg_register_task, u32, api_version, u32, flags, u32, group_id,
>   */
>  SYSCALL_DEFINE1(umcg_unregister_task, u32, flags)
>  {
> -	return -ENOSYS;
> +	struct umcg_task_data *utd;
> +	int ret = -EINVAL;
> +
> +	rcu_read_lock();
> +	utd = rcu_dereference(current->umcg_task_data);
> +
> +	if (!utd || flags)
> +		goto out;
> +
> +	task_lock(current);
> +	rcu_assign_pointer(current->umcg_task_data, NULL);
> +	task_unlock(current);
> +
> +	ret = 0;
> +
> +out:
> +	rcu_read_unlock();
> +	if (!ret && utd) {
> +		synchronize_rcu();

synchronize_rcu is expensive. Do we really need to call it here? Can we
use kfree_rcu?

Where is task->umcg_task_data freed when a task is destroyed?

> +		kfree(utd);
> +	}
> +	return ret;
> +}
> +

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v0.1 0/9] UMCG early preview/RFC patchset
       [not found] ` <CAEWA0a72SvpcuN4ov=98T3uWtExPCr7BQePOgjkqD1ofWKEASw@mail.gmail.com>
@ 2021-05-21 19:13   ` Peter Oskolkov
  2021-05-21 23:08     ` Jann Horn
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Oskolkov @ 2021-05-21 19:13 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Linux Kernel Mailing List, linux-api, Paul Turner, Ben Segall,
	Peter Oskolkov, Joel Fernandes, Andrew Morton, Jim Newsome

On Fri, May 21, 2021 at 11:44 AM Andrei Vagin <avagin@google.com> wrote:
>
>
>
> On Thu, May 20, 2021 at 11:36 AM Peter Oskolkov <posk@google.com> wrote:
>>
>> As indicated earlier in the FUTEX_SWAP patchset:
>>
>> https://lore.kernel.org/lkml/20200722234538.166697-1-posk@posk.io/
>
>
> Hi Peter,
>
> Do you have benchmark results? How fast is it compared with futex_swap and the google switchto?

Hi Andrei,

I did not run benchmarks on the same machine/kernel, but umcg_swap
between "core" tasks (your use case for gVisor) should be somewhat
faster than futex_swap, as there is no reading from the userspace and
no futex hash lookup/dequeue ops; umcg_swap should be slower than
switchto_switch because umcg_swap does go through ttwu+schedule, which
switchto_switch bypasses.

I expect that if UMCG is merged in a form similar to what I posted, we
will explore how to make UMCG context switches faster in later
patches.

Thanks,
Peter

>
> Thanks,
> Andrei

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v0.1 0/9] UMCG early preview/RFC patchset
  2021-05-21 16:03       ` Peter Oskolkov
@ 2021-05-21 19:17         ` Jonathan Corbet
  2021-05-27  0:06           ` Peter Oskolkov
  0 siblings, 1 reply; 35+ messages in thread
From: Jonathan Corbet @ 2021-05-21 19:17 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Linux Kernel Mailing List, linux-api, Paul Turner, Ben Segall,
	Peter Oskolkov, Joel Fernandes, Andrew Morton, Andrei Vagin,
	Jim Newsome

Peter Oskolkov <posk@google.com> writes:

> On Fri, May 21, 2021 at 8:08 AM Jonathan Corbet <corbet@lwn.net> wrote:
>
> [...]
>> Documentation patches can help to guide that discussion; they also need
>> to be reviewed as well.  So yes, I think they should be present from the
>> beginning.  But then, that's the position I'm supposed to take :)  This
>> is a big change to the kernel's system-call API, I don't think that
>> there can be a proper discussion of that without a description of what
>> you're trying to do.
>
> Hi Jon,
>
> There are doc comments in patches 2 and 7 in umcg.c documenting the
> new syscalls. That said, I'll prepare a separate doc patch - I guess
> I'll add Documentation/scheduler/umcg.rst, unless you tell me there is
> a better place to do that. ETA mid-to-late next week.

Yes, I saw those; they are a bit terse at best.  What are the "worker
states"?  What's a "UMCG group"?  Yes, all this can be worked out by
pounding one's head against the code for long enough, but you're asking
a fair amount of your reviewers.

A good overall description would be nice, perhaps for the userspace-api
book.  But *somebody* is also going to have to write real man pages for
all these system calls; if you provided those, the result should be a
good description of how you expect this subsystem to work.

Thanks,

jon

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v0.1 4/9] sched/umcg: implement core UMCG API
  2021-05-20 18:36 ` [RFC PATCH v0.1 4/9] sched/umcg: implement core UMCG API Peter Oskolkov
  2021-05-21 19:06   ` Andrei Vagin
@ 2021-05-21 19:32   ` Andy Lutomirski
  2021-05-21 22:01     ` Peter Oskolkov
  2021-05-21 21:33   ` Jann Horn
  2 siblings, 1 reply; 35+ messages in thread
From: Andy Lutomirski @ 2021-05-21 19:32 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, LKML, Linux API,
	Paul Turner, Ben Segall, Peter Oskolkov, Joel Fernandes,
	Andrew Morton, Andrei Vagin, Jim Newsome

On Thu, May 20, 2021 at 11:36 AM Peter Oskolkov <posk@google.com> wrote:
>
> Implement version 1 of core UMCG API (wait/wake/swap).
>
> As has been outlined in
> https://lore.kernel.org/lkml/20200722234538.166697-1-posk@posk.io/,
> efficient and synchronous on-CPU context switching is key
> to enabling two broad use cases: in-process M:N userspace scheduling
> and fast X-process RPCs for security wrappers.
>
> High-level design considerations/approaches used:
> - wait & wake can race with each other;
> - offload as much work as possible to libumcg in tools/lib/umcg,
>   specifically:
>   - most state changes, e.g. RUNNABLE <=> RUNNING, are done in
>     the userspace (libumcg);
>   - retries are offloaded to the userspace.

Do you have some perf numbers as to how long a UMCG context switch
takes compared to a normal one?

--Andy

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v0.1 7/9] sched/umcg: add UMCG server/worker API (early RFC)
  2021-05-20 18:36 ` [RFC PATCH v0.1 7/9] sched/umcg: add UMCG server/worker API (early RFC) Peter Oskolkov
@ 2021-05-21 20:17   ` Andrei Vagin
  0 siblings, 0 replies; 35+ messages in thread
From: Andrei Vagin @ 2021-05-21 20:17 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, linux-kernel,
	linux-api, Paul Turner, Ben Segall, Peter Oskolkov,
	Joel Fernandes, Andrew Morton, Andrei Vagin, Jim Newsome

On Thu, May 20, 2021 at 11:36:12AM -0700, Peter Oskolkov wrote:
> Implement UMCG server/worker API.
> 
> This is an early RFC patch - the code seems working, but
> more testing is needed. Gaps I plan to address before this
> is ready for a detailed review:
> 
> - preemption/interrupt handling;
> - better documentation/comments;
> - tracing;
> - additional testing;
> - corner cases like abnormal process/task termination;
> - in some cases where I kill the task (umcg_segv), returning
> an error may be more appropriate.
> 
> All in all, please focus more on the high-level approach
> and less on things like variable names, (doc) comments, or indentation.
> 
> Signed-off-by: Peter Oskolkov <posk@google.com>
> ---
>  include/linux/mm_types.h |   5 +
>  include/linux/syscalls.h |   5 +
>  kernel/fork.c            |  11 +
>  kernel/sched/core.c      |  11 +
>  kernel/sched/umcg.c      | 764 ++++++++++++++++++++++++++++++++++++++-
>  kernel/sched/umcg.h      |  54 +++
>  mm/init-mm.c             |   4 +
>  7 files changed, 845 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 6613b26a8894..5ca7b7d55775 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -562,6 +562,11 @@ struct mm_struct {
>  #ifdef CONFIG_IOMMU_SUPPORT
>  		u32 pasid;
>  #endif
> +
> +#ifdef CONFIG_UMCG
> +	spinlock_t umcg_lock;
> +	struct list_head umcg_groups;
> +#endif
>  	} __randomize_layout;
>  
>  	/*
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 15de3e34ccee..2781659daaf1 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -1059,6 +1059,11 @@ asmlinkage long umcg_wait(u32 flags, const struct __kernel_timespec __user *time
>  asmlinkage long umcg_wake(u32 flags, u32 next_tid);
>  asmlinkage long umcg_swap(u32 wake_flags, u32 next_tid, u32 wait_flags,
>  				const struct __kernel_timespec __user *timeout);
> +asmlinkage long umcg_create_group(u32 api_version, u64, flags);
> +asmlinkage long umcg_destroy_group(u32 group_id);
> +asmlinkage long umcg_poll_worker(u32 flags, struct umcg_task __user **ut);
> +asmlinkage long umcg_run_worker(u32 flags, u32 worker_tid,
> +		struct umcg_task __user **ut);
>  
>  /*
>   * Architecture-specific system calls
> diff --git a/kernel/fork.c b/kernel/fork.c
> index ace4631b5b54..3a2a7950df8e 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1026,6 +1026,10 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>  	seqcount_init(&mm->write_protect_seq);
>  	mmap_init_lock(mm);
>  	INIT_LIST_HEAD(&mm->mmlist);
> +#ifdef CONFIG_UMCG
> +	spin_lock_init(&mm->umcg_lock);
> +	INIT_LIST_HEAD(&mm->umcg_groups);
> +#endif
>  	mm->core_state = NULL;
>  	mm_pgtables_bytes_init(mm);
>  	mm->map_count = 0;
> @@ -1102,6 +1106,13 @@ static inline void __mmput(struct mm_struct *mm)
>  		list_del(&mm->mmlist);
>  		spin_unlock(&mmlist_lock);
>  	}
> +#ifdef CONFIG_UMCG
> +	if (!list_empty(&mm->umcg_groups)) {
> +		spin_lock(&mm->umcg_lock);
> +		list_del(&mm->umcg_groups);

I am not sure that I understand what is going on here. umsg_groups is
the head of a group list. list_del is usually called on list entries.

Should we enumirate all groups here and destroy them?

> +		spin_unlock(&mm->umcg_lock);
> +	}
> +#endif
>  	if (mm->binfmt)
>  		module_put(mm->binfmt->module);
>  	mmdrop(mm);

...

> +/**
> + * sys_umcg_create_group - create a UMCG group
> + * @api_version:           Requested API version.
> + * @flags:                 Reserved.
> + *
> + * Return:
> + * >= 0                - the group ID
> + * -EOPNOTSUPP         - @api_version is not supported
> + * -EINVAL             - @flags is not valid
> + * -ENOMEM             - not enough memory
> + */
> +SYSCALL_DEFINE2(umcg_create_group, u32, api_version, u64, flags)
> +{
> +	int ret;
> +	struct umcg_group *group;
> +	struct umcg_group *list_entry;
> +	struct mm_struct *mm = current->mm;
> +
> +	if (flags)
> +		return -EINVAL;
> +
> +	if (__api_version(api_version))
> +		return -EOPNOTSUPP;
> +
> +	group = kzalloc(sizeof(struct umcg_group), GFP_KERNEL);
> +	if (!group)
> +		return -ENOMEM;
> +
> +	spin_lock_init(&group->lock);
> +	INIT_LIST_HEAD(&group->list);
> +	INIT_LIST_HEAD(&group->waiters);
> +	group->flags = flags;
> +	group->api_version = api_version;
> +
> +	spin_lock(&mm->umcg_lock);
> +
> +	list_for_each_entry_rcu(list_entry, &mm->umcg_groups, list) {
> +		if (list_entry->group_id >= group->group_id)
> +			group->group_id = list_entry->group_id + 1;
> +	}

pls take into account that we need to be able to save and restore umcg
groups from user-space. There is the CRIU project that allows to
checkpoint/restore processes.

> +
> +	list_add_rcu(&mm->umcg_groups, &group->list);

I think it should be:

	list_add_rcu(&group->list, &mm->umcg_groups);
> +
> +	ret = group->group_id;
> +	spin_unlock(&mm->umcg_lock);
> +
> +	return ret;
> +}
> +

Thanks,
Andrei

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v0.1 4/9] sched/umcg: implement core UMCG API
  2021-05-21 19:06   ` Andrei Vagin
@ 2021-05-21 21:31     ` Jann Horn
  2021-05-21 22:03       ` Peter Oskolkov
  0 siblings, 1 reply; 35+ messages in thread
From: Jann Horn @ 2021-05-21 21:31 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: Peter Oskolkov, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	kernel list, Linux API, Paul Turner, Ben Segall, Peter Oskolkov,
	Joel Fernandes, Andrew Morton, Andrei Vagin, Jim Newsome

On Fri, May 21, 2021 at 9:09 PM Andrei Vagin <avagin@gmail.com> wrote:
> On Thu, May 20, 2021 at 11:36:09AM -0700, Peter Oskolkov wrote:
> > @@ -67,7 +137,75 @@ SYSCALL_DEFINE4(umcg_register_task, u32, api_version, u32, flags, u32, group_id,
> >   */
> >  SYSCALL_DEFINE1(umcg_unregister_task, u32, flags)
> >  {
> > -     return -ENOSYS;
> > +     struct umcg_task_data *utd;
> > +     int ret = -EINVAL;
> > +
> > +     rcu_read_lock();
> > +     utd = rcu_dereference(current->umcg_task_data);
> > +
> > +     if (!utd || flags)
> > +             goto out;
> > +
> > +     task_lock(current);
> > +     rcu_assign_pointer(current->umcg_task_data, NULL);
> > +     task_unlock(current);
> > +
> > +     ret = 0;
> > +
> > +out:
> > +     rcu_read_unlock();
> > +     if (!ret && utd) {
> > +             synchronize_rcu();
>
> synchronize_rcu is expensive. Do we really need to call it here? Can we
> use kfree_rcu?
>
> Where is task->umcg_task_data freed when a task is destroyed?

or executed - the umcg stuff includes a userspace pointer, so it
probably shouldn't normally be kept around across execve?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v0.1 4/9] sched/umcg: implement core UMCG API
  2021-05-20 18:36 ` [RFC PATCH v0.1 4/9] sched/umcg: implement core UMCG API Peter Oskolkov
  2021-05-21 19:06   ` Andrei Vagin
  2021-05-21 19:32   ` Andy Lutomirski
@ 2021-05-21 21:33   ` Jann Horn
  2021-06-09 13:01     ` Peter Zijlstra
  2 siblings, 1 reply; 35+ messages in thread
From: Jann Horn @ 2021-05-21 21:33 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, kernel list,
	Linux API, Paul Turner, Ben Segall, Peter Oskolkov,
	Joel Fernandes, Andrew Morton, Andrei Vagin, Jim Newsome

On Thu, May 20, 2021 at 8:36 PM Peter Oskolkov <posk@google.com> wrote:
> Implement version 1 of core UMCG API (wait/wake/swap).
>
> As has been outlined in
> https://lore.kernel.org/lkml/20200722234538.166697-1-posk@posk.io/,
> efficient and synchronous on-CPU context switching is key
> to enabling two broad use cases: in-process M:N userspace scheduling
> and fast X-process RPCs for security wrappers.
>
> High-level design considerations/approaches used:
> - wait & wake can race with each other;
> - offload as much work as possible to libumcg in tools/lib/umcg,
>   specifically:
>   - most state changes, e.g. RUNNABLE <=> RUNNING, are done in
>     the userspace (libumcg);
>   - retries are offloaded to the userspace.
[...]
> diff --git a/kernel/sched/umcg.c b/kernel/sched/umcg.c
[...]
> +static int get_state(struct umcg_task __user *ut, u32 *state)
> +{
> +       return get_user(*state, (u32 __user *)ut);

Why the cast instead of get_user(*state, &ut->state)?
And maybe do this inline instead of adding a separate helper for it?

> +}
> +
> +static int put_state(struct umcg_task __user *ut, u32 state)
> +{
> +       return put_user(state, (u32 __user *)ut);
> +}
[...]
> +static int do_context_switch(struct task_struct *next)
> +{
> +       struct umcg_task_data *utd = rcu_access_pointer(current->umcg_task_data);
> +
> +       /*
> +        * It is important to set_current_state(TASK_INTERRUPTIBLE) before
> +        * waking @next, as @next may immediately try to wake current back
> +        * (e.g. current is a server, @next is a worker that immediately
> +        * blocks or waits), and this next wakeup must not be lost.
> +        */
> +       set_current_state(TASK_INTERRUPTIBLE);
> +
> +       WRITE_ONCE(utd->in_wait, true);
> +
> +       if (!try_to_wake_up(next, TASK_NORMAL, WF_CURRENT_CPU))
> +               return -EAGAIN;
> +
> +       freezable_schedule();
> +
> +       WRITE_ONCE(utd->in_wait, false);
> +
> +       if (signal_pending(current))
> +               return -EINTR;

What is this -EINTR supposed to tell userspace? We can't tell whether
we were woken up by a signal or by do_context_switch() or the
umcg_wake syscall, right? If we're woken by another thread calling
do_context_switch() and then get a signal immediately afterwards,
can't that lead to a lost wakeup?

I don't know whether trying to track the origin of the wakeup is a
workable approach here; you might have to instead do cmpxchg() on the
->in_wait field and give it three states (default, waiting-for-wake
and successfully-woken)?
Or you give up on trying to figure out who woke you, just always
return zero, and let userspace deal with figuring out whether the
wakeup was real or not. I don't know whether that'd be acceptable.

> +       return 0;
> +}
> +
> +static int do_wait(void)
> +{
> +       struct umcg_task_data *utd = rcu_access_pointer(current->umcg_task_data);
> +
> +       if (!utd)
> +               return -EINVAL;
> +
> +       WRITE_ONCE(utd->in_wait, true);
> +
> +       set_current_state(TASK_INTERRUPTIBLE);
> +       freezable_schedule();
> +
> +       WRITE_ONCE(utd->in_wait, false);
> +
> +       if (signal_pending(current))
> +               return -EINTR;
> +
> +       return 0;
>  }
>
>  /**
> @@ -90,7 +228,23 @@ SYSCALL_DEFINE1(umcg_unregister_task, u32, flags)
>  SYSCALL_DEFINE2(umcg_wait, u32, flags,
>                 const struct __kernel_timespec __user *, timeout)
>  {
> -       return -ENOSYS;
> +       struct umcg_task_data *utd;
> +
> +       if (flags)
> +               return -EINVAL;
> +       if (timeout)
> +               return -EOPNOTSUPP;
> +
> +       rcu_read_lock();
> +       utd = rcu_dereference(current->umcg_task_data);
> +       if (!utd) {
> +               rcu_read_unlock();
> +               return -EINVAL;
> +       }
> +
> +       rcu_read_unlock();

rcu_access_pointer() instead of the locking and unlocking?

> +       return do_wait();
>  }
>
>  /**
> @@ -110,7 +264,39 @@ SYSCALL_DEFINE2(umcg_wait, u32, flags,
>   */
>  SYSCALL_DEFINE2(umcg_wake, u32, flags, u32, next_tid)
>  {
> -       return -ENOSYS;
> +       struct umcg_task_data *next_utd;
> +       struct task_struct *next;
> +       int ret = -EINVAL;
> +
> +       if (!next_tid)
> +               return -EINVAL;
> +       if (flags)
> +               return -EINVAL;
> +
> +       next = find_get_task_by_vpid(next_tid);
> +       if (!next)
> +               return -ESRCH;
> +       rcu_read_lock();

Wouldn't it be more efficient to replace the last 4 lines with the following?

rcu_read_lock();
next = find_task_by_vpid(next_tid);
if (!next) {
  err = -ESRCH;
  goto out;
}

Then you don't need to use refcounting here...

> +       next_utd = rcu_dereference(next->umcg_task_data);
> +       if (!next_utd)
> +               goto out;
> +
> +       if (!READ_ONCE(next_utd->in_wait)) {
> +               ret = -EAGAIN;
> +               goto out;
> +       }
> +
> +       ret = wake_up_process(next);
> +       put_task_struct(next);

... and you'd be able to drop this put_task_struct(), too.

> +       if (ret)
> +               ret = 0;
> +       else
> +               ret = -EAGAIN;
> +
> +out:
> +       rcu_read_unlock();
> +       return ret;
>  }
>
>  /**
> @@ -139,5 +325,44 @@ SYSCALL_DEFINE2(umcg_wake, u32, flags, u32, next_tid)
>  SYSCALL_DEFINE4(umcg_swap, u32, wake_flags, u32, next_tid, u32, wait_flags,
>                 const struct __kernel_timespec __user *, timeout)
>  {
> -       return -ENOSYS;
> +       struct umcg_task_data *curr_utd;
> +       struct umcg_task_data *next_utd;
> +       struct task_struct *next;
> +       int ret = -EINVAL;
> +
> +       rcu_read_lock();
> +       curr_utd = rcu_dereference(current->umcg_task_data);
> +
> +       if (!next_tid || wake_flags || wait_flags || !curr_utd)
> +               goto out;
> +
> +       if (timeout) {
> +               ret = -EOPNOTSUPP;
> +               goto out;
> +       }
> +
> +       next = find_get_task_by_vpid(next_tid);
> +       if (!next) {
> +               ret = -ESRCH;
> +               goto out;
> +       }

There isn't any type of access check here, right? Any task can wake up
any other task? That feels a bit weird to me - and if you want to keep
it as-is, it should probably at least be documented that any task on
the system can send you spurious wakeups if you opt in to umcg.

In contrast, shared futexes can avoid this because they get their
access control implicitly from the VMA.

> +       next_utd = rcu_dereference(next->umcg_task_data);
> +       if (!next_utd) {
> +               ret = -EINVAL;
> +               goto out;
> +       }
> +
> +       if (!READ_ONCE(next_utd->in_wait)) {
> +               ret = -EAGAIN;
> +               goto out;
> +       }
> +
> +       rcu_read_unlock();
> +
> +       return do_context_switch(next);

It looks like the refcount of the target task is incremented but never
decremented, so this probably currently leaks references?

I'd maybe try to split do_context_switch() into two parts, one that
does the non-blocking waking of another task and one that does the
sleeping. Then you can avoid taking a reference on the task as above -
this is supposed to be a really hot fastpath, so it's a good idea to
avoid atomic instructions if possible, right?



> +out:
> +       rcu_read_unlock();
> +       return ret;
>  }

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v0.1 4/9] sched/umcg: implement core UMCG API
  2021-05-21 19:32   ` Andy Lutomirski
@ 2021-05-21 22:01     ` Peter Oskolkov
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Oskolkov @ 2021-05-21 22:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, LKML, Linux API,
	Paul Turner, Ben Segall, Peter Oskolkov, Joel Fernandes,
	Andrew Morton, Andrei Vagin, Jim Newsome

On Fri, May 21, 2021 at 12:32 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Thu, May 20, 2021 at 11:36 AM Peter Oskolkov <posk@google.com> wrote:
> >
> > Implement version 1 of core UMCG API (wait/wake/swap).
> >
> > As has been outlined in
> > https://lore.kernel.org/lkml/20200722234538.166697-1-posk@posk.io/,
> > efficient and synchronous on-CPU context switching is key
> > to enabling two broad use cases: in-process M:N userspace scheduling
> > and fast X-process RPCs for security wrappers.
> >
> > High-level design considerations/approaches used:
> > - wait & wake can race with each other;
> > - offload as much work as possible to libumcg in tools/lib/umcg,
> >   specifically:
> >   - most state changes, e.g. RUNNABLE <=> RUNNING, are done in
> >     the userspace (libumcg);
> >   - retries are offloaded to the userspace.
>
> Do you have some perf numbers as to how long a UMCG context switch
> takes compared to a normal one?

I'm not sure what is a "normal context switch" in this context. Futex
wakeup on a remote idle CPU takes 5-10usec; an on-CPU UMCG context
switch takes less than 1usec; futex wake + futex wait on the same CPU
(taskset ***) takes about 1-1.5usec in my benchmarks.

>
> --Andy

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v0.1 4/9] sched/umcg: implement core UMCG API
  2021-05-21 21:31     ` Jann Horn
@ 2021-05-21 22:03       ` Peter Oskolkov
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Oskolkov @ 2021-05-21 22:03 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrei Vagin, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	kernel list, Linux API, Paul Turner, Ben Segall, Peter Oskolkov,
	Joel Fernandes, Andrew Morton, Andrei Vagin, Jim Newsome

On Fri, May 21, 2021 at 2:32 PM Jann Horn <jannh@google.com> wrote:
>
> On Fri, May 21, 2021 at 9:09 PM Andrei Vagin <avagin@gmail.com> wrote:
> > On Thu, May 20, 2021 at 11:36:09AM -0700, Peter Oskolkov wrote:
> > > @@ -67,7 +137,75 @@ SYSCALL_DEFINE4(umcg_register_task, u32, api_version, u32, flags, u32, group_id,
> > >   */
> > >  SYSCALL_DEFINE1(umcg_unregister_task, u32, flags)
> > >  {
> > > -     return -ENOSYS;
> > > +     struct umcg_task_data *utd;
> > > +     int ret = -EINVAL;
> > > +
> > > +     rcu_read_lock();
> > > +     utd = rcu_dereference(current->umcg_task_data);
> > > +
> > > +     if (!utd || flags)
> > > +             goto out;
> > > +
> > > +     task_lock(current);
> > > +     rcu_assign_pointer(current->umcg_task_data, NULL);
> > > +     task_unlock(current);
> > > +
> > > +     ret = 0;
> > > +
> > > +out:
> > > +     rcu_read_unlock();
> > > +     if (!ret && utd) {
> > > +             synchronize_rcu();
> >
> > synchronize_rcu is expensive. Do we really need to call it here? Can we
> > use kfree_rcu?
> >
> > Where is task->umcg_task_data freed when a task is destroyed?
>
> or executed - the umcg stuff includes a userspace pointer, so it
> probably shouldn't normally be kept around across execve?

Ack - thanks for these and other comments. Please keep them coming.
I'll address them in v0.2.

Thanks,
Peter

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v0.1 0/9] UMCG early preview/RFC patchset
  2021-05-21 19:13   ` Peter Oskolkov
@ 2021-05-21 23:08     ` Jann Horn
  0 siblings, 0 replies; 35+ messages in thread
From: Jann Horn @ 2021-05-21 23:08 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Andrei Vagin, Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Linux Kernel Mailing List, linux-api, Paul Turner, Ben Segall,
	Peter Oskolkov, Joel Fernandes, Andrew Morton, Jim Newsome

On Fri, May 21, 2021 at 9:14 PM Peter Oskolkov <posk@google.com> wrote:
> On Fri, May 21, 2021 at 11:44 AM Andrei Vagin <avagin@google.com> wrote:
> > On Thu, May 20, 2021 at 11:36 AM Peter Oskolkov <posk@google.com> wrote:
> >>
> >> As indicated earlier in the FUTEX_SWAP patchset:
> >>
> >> https://lore.kernel.org/lkml/20200722234538.166697-1-posk@posk.io/
> >
> >
> > Hi Peter,
> >
> > Do you have benchmark results? How fast is it compared with futex_swap and the google switchto?
>
> Hi Andrei,
>
> I did not run benchmarks on the same machine/kernel, but umcg_swap
> between "core" tasks (your use case for gVisor) should be somewhat
> faster than futex_swap, as there is no reading from the userspace and
> no futex hash lookup/dequeue ops;

The futex code currently creates and destroys hash table elements on
wait/wake, which does involve locking, but you could probably avoid
that if you built a faster futex variant optimized for the
single-waiter case that uses a bit more kernel memory to keep a
persistent hash table element (with RCU freeing) per pre-registered
lock address around? Whether that'd be significantly faster, I don't
know.


(As a sidenote, the futex code could slow down if the number of futex
buckets isn't well-calibrated - meaning you have something like >200
distinct futex addresses per CPU core, see futex_init(). Then
futex_init() probably needs to be tuned a bit. Actually, on my work
laptop, this is what I see right now (not counting multiple waiters on
the same address in the same process, since they intentionally occupy
the same bucket):

# for tasks_dir in /proc/*/task; do cat $tasks_dir/*/syscall | grep
'^202 ' | cut -d' ' -f2 | sort | uniq; done | wc -l
1193
# cat /sys/devices/system/cpu/possible
0-3
# gdb -core=/proc/kcore -ex "print ((unsigned long *)(0x$(grep
__futex_data /proc/kallsyms | cut -d' ' -f1)))[1]" -batch
[...]
$1 = 1024

So the load factor of the futex hash table on this machine right now
is ~117%, which I think is quite a bit higher than you'd normally want
in a hash table? I don't know how representative that is though. Seems
to mostly come from the tons of Chrome processes.)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v0.1 0/9] UMCG early preview/RFC patchset
  2021-05-21 19:17         ` Jonathan Corbet
@ 2021-05-27  0:06           ` Peter Oskolkov
  2021-05-27 15:41             ` Jonathan Corbet
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Oskolkov @ 2021-05-27  0:06 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Linux Kernel Mailing List, linux-api, Paul Turner, Ben Segall,
	Peter Oskolkov, Joel Fernandes, Andrew Morton, Andrei Vagin,
	Jim Newsome

[-- Attachment #1: Type: text/plain, Size: 22004 bytes --]

On Fri, May 21, 2021 at 12:17 PM Jonathan Corbet <corbet@lwn.net> wrote:

[...]

> What are the "worker
> states"?  What's a "UMCG group"?  Yes, all this can be worked out by
> pounding one's head against the code for long enough, but you're asking
> a fair amount of your reviewers.
>
> A good overall description would be nice, perhaps for the userspace-api
> book.  But *somebody* is also going to have to write real man pages for
> all these system calls; if you provided those, the result should be a
> good description of how you expect this subsystem to work.

Hi Jon,

I've pasted below the content of umcg.rst file that I'll add as a doc
patch to the next version of the patchset. I've also attached a PDF
version of the file rendered, in case it is useful. I also think that
it is a bit early for manpages - I expect the API and/or behavior to
change quite a bit before this is all merged. I'm also not completely
sure whether the manpages should describe the "porcelain API" or the
"plumbing API" (see the .rst below).

Please let me know if you have any questions, comments, or suggestions.

Thanks,
Peter


======================================
User Managed Concurrency Groups (UMCG)
======================================

User Managed Concurrency Groups (UMCG) is an M:N threading
subsystem/toolkit that lets user space application developers
implement in-process user space schedulers.

.. contents:: :local:

Why? Heterogeneous in-process workloads
=======================================
Linux kernel's CFS scheduler is designed for the "common" use case,
with efficiency/throughput in mind. Work isolation and workloads of
different "urgency" are addressed by tools such as cgroups, CPU
affinity, priorities, etc., which are difficult or impossible to
efficiently use in-process.

For example, a single DBMS process may receive tens of thousands
requests per second; some of these requests may have strong response
latency requirements as they serve live user requests (e.g. login
authentication); some of these requests may not care much about
latency but must be served within a certain time period (e.g. an
hourly aggregate usage report); some of these requests are to be
served only on a best-effort basis and can be NACKed under high load
(e.g. an exploratory research/hypothesis testing workload).

Beyond different work item latency/throughput requirements as outlined
above, the DBMS may need to provide certain guarantees to different
users; for example, user A may "reserve" 1 CPU for their
high-priority/low latency requests, 2 CPUs for mid-level throughput
workloads, and be allowed to send as many best-effort requests as
possible, which may or may not be served, depending on the DBMS load.
Besides, the best-effort work, started when the load was low, may need
to be delayed if suddenly a large amount of higher-priority work
arrives. With hundreds or thousands of users like this, it is very
difficult to guarantee the application's responsiveness using standard
Linux tools while maintaining high CPU utilization.

Gaming is another use case: some in-process work must be completed
before a certain deadline dictated by frame rendering schedule, while
other work items can be delayed; some work may need to be
cancelled/discarded because the deadline has passed; etc.

User Managed Concurrency Groups is an M:N threading toolkit that
allows constructing user space schedulers designed to efficiently
manage heterogeneous in-process workloads described above while
maintaining high CPU utilization (95%+).

Requirements
============
One relatively established way to design high-efficiency, low-latency
systems is to split all work into small on-cpu work items, with
asynchronous I/O and continuations, all executed on a thread pool with
the number of threads not exceeding the number of available CPUs.
Although this approach works, it is quite difficult to develop and
maintain such a system, as, for example, small continuations are
difficult to piece together when debugging. Besides, such asynchronous
callback-based systems tend to be somewhat cache-inefficient, as
continuations can get scheduled on any CPU regardless of cache
locality.

M:N threading and cooperative user space scheduling enables controlled
CPU usage (minimal OS preemption), synchronous coding style, and
better cache locality.

Specifically:

- a variable/fluctuating number M of "application" threads should be
"scheduled over" a relatively fixed number N of "kernel" threads,
where N is less than or equal to the number of CPUs available;
- only those application threads that are attached to kernel threads
are scheduled "on CPU";
- application threads should be able to cooperatively yield to each other;
- when an application thread blocks in kernel (e.g. in I/O), this
becomes a scheduling event ("block") that the userspace scheduler
should be able to efficiently detect, and reassign a waiting
application thread to the freeded "kernel" thread;
- when a blocked application thread wakes (e.g. its I/O operation
completes), this even ("wake") should also be detectable by the
userspace scheduler, which should be able to either quickly dispatch
the newly woken thread to an idle "kernel" thread or, if all "kernel"
threads are busy, put it in the waiting queue;
- in addition to the above, it would be extremely useful for a
separate in-process "watchdog" facility to be able to monitor the
state of each of the M+N threads, and to intervene in case of runaway
workloads (interrupt/preempt).

The building blocks
===================
Main Objects
------------
Based on the requirements above, UMCG exposes the following "objects":

- server tasks/threads: these are the N "kernel" threads from the
requirements section above;
- worker tasks/threads: these are the M application threads from the
requirements section above;
- UMCG groups: all UMCG worker and server threads belong to a UMCG
group; a process (a shared MM) can have multiple groups; workers and
servers must belong to the same UMCG group to interact (note: multiple
groups per process can be useful to e.g. partition scheduling per NUMA
node).

Key operations/API
------------------
As described above, the framework/toolkit must be able to efficiently
process block/wake events, run workers, and provide cooperative worker
scheduling facilities. As such, there are five runtime operations (not
including control/support facilities), two for servers (explicit
scheduling), and three for workers (cooperative scheduling):

- (server) run_worker(): a server specifies which worker to run; the
call blocks; when the worker the server is running blocks in the
kernel, the call returns, telling the server that its worker has
blocked;
- (server) poll_worker(): a server polls the kernel for workers whose
blocking operations has completed; the call returns the worker who
woke the earliest (or blocks until there is such a worker);
- (worker) wait(): a worker can cooperatively "yield"; its attached
server's run_worker() call returns;
- (worker) wake(): any task/thread can "wake" a worker that has yielded;
- (worker) swap(): a running worker can cooperatively "swap" its
server with another worker.

Detailed state transitions are described below.

umcg_context_switch
-------------------
This subsection explains some kernel-internal details.

It is important to emphasize that for a userspace scheduling framework
to be of use, it is essential that common scenarios such as

- worker W1 blocks, wakes its serving server S which then runs worker W2
- worker W1 swaps into worker W2
- worker W unblocks, wakes a polling server S which then runs it

be as fast and efficient as possible. If these operations mean simply
"wake remote task T and go to sleep", with T scheduled on a remote
(idle) CPU, the overall performance of the system will be poor due to
wakeup delays and cache locality issues (e.g. a server on CPU_A
processes a worker blocked on CPU_B).

umcg_context_switch() is basically "wake remote task T on the current
CPU and context switch into it"; it is a kernel-internal function that
most operations outlined above use; it is exposed to the userspace
indirectly via run_worker() (= context switch from the current server
to the worker) and swap() (= context switch between two workers).

Initially, umcg_context_switch() is implemented by adding
WF_CURRENT_CPU flag that is passed to ttwu; this change of the "wake
remotely and go to sleep" operation to "wake on the current CPU and go
to sleep" reduces the overall latency of swap about 8x on average
(6-10 usec to less than 1 usec).

Another use case: fast IPC
==========================
Fast, synchronous on-CPU context switching can also be used for fast
IPC (cross-process). For example, a typical security wrapper
intercepts syscalls of an untrusted process, consults with external
(out-of-process) "syscall firewall", and then delivers the allow/deny
decision back (or the remote process actually proxies the syscall
execution on behalf of the monitored process). This roundtrip is
usually relatively slow, consuming at least 5-10 usec, as it involves
waking a task on a remote CPU. A fast on-CPU context switch not only
helps with the wakeup latency, but also has beneficial cache locality
properties.

UMCG addresses this use case by providing another type of UMCG task: a
"core" task. A core UMCG task can be thought of as a UMCG worker that
does not belong to a UMCG group and that runs without a UMCG server
attached to it, but that has access to the same UMCG worker
operations, namely wait/wake/swap.

Userspace API
=============

This section outlines the key components of UMCG API.

UMCG task states and state transitions
--------------------------------------

At a high level, UMCG is a task/thread managing/scheduling framework.
The following task states are defined in uapi/linux/umcg.h (extra
state flags are omitted here):

.. code-block:: C

 #define UMCG_TASK_NONE 0
 /* UMCG server states. */
 #define UMCG_TASK_POLLING 1
 #define UMCG_TASK_SERVING 2
 #define UMCG_TASK_PROCESSING 3
 /* UMCG worker states. */
 #define UMCG_TASK_RUNNABLE 4
 #define UMCG_TASK_RUNNING 5
 #define UMCG_TASK_BLOCKED 6
 #define UMCG_TASK_UNBLOCKED 7

Server states and state transitions are easy:

- UMCG_TASK_POLLING: the server task is blocked in umcg_poll_worker(),
waiting for an UNBLOCKED worker;
- UMCG_TASK_SERVING: the server task is blocked in umcg_run_worker(),
serving a RUNNING worker;
- UMCG_TASK_PROCESSING: the server task is running in the userspace,
presumably processing worker events.

Worker states and state transitions are more complicated:

- UMCG_TASK_RUNNING: the worker task is runnable/schedulable from the
OS scheduler point of view (and is most likely running on a CPU);
- UMCG_TASK_RUNNABLE: this is a special worker state indicating that
the worker is not runnable/schedulable by the OS scheduler, but can be
scheduled by the user space scheduler; it is a sort of "voluntarily
blocked" state;
- UMCG_TASK_BLOCKED: a previously RUNNING worker involuntarily blocked
in the kernel, e.g. in a synchronous I/O operation; not
runnable/schedulable by the OS scheduler;
- UMCG_TASK_UNBLOCKED: the blocking operation of a BLOCKED worker has
completed, but the user space has not yet been notified of the event;
not runnable/schedulable by the OS scheduler.

State transitions:

::

 RUNNABLE => RUNNING : umcg_run_worker()
 RUNNABLE => RUNNING : umcg_swap() (RUNNABLE next becomes RUNNING)

 RUNNING  => RUNNABLE: umcg_swap() (RUNNING current becomes RUNNABLE)
 RUNNING  => RUNNABLE: umcg_wait() (RUNNING current becomes RUNNABLE)

 RUNNING   => BLOCKED  : the worker blocks in the kernel
 BLOCKED   => UNBLOCKED: the worker's blocking operation has completed
 UNBLOCKED => RUNNABLE : umcg_poll_worker()

 RUNNABLE  => UNBLOCKED: umcg_wake()

Block/wake events are delivered to server tasks in userspace by waking
them from their blocking operations:

- umcg_run_worker() returns (wakes the server) when the worker either
blocks voluntarily via umcg_wait(), RUNNABLE, or involuntarily,
BLOCKED;
- umcg_poll_worker() returns (wakes the server) when a new UNBLOCKED
worker becomes available; if more then one UNBLOCKED worker is
present, they are queued and umcg_poll_worker() returns immediately,
with the longest waiting worker.

Several optimized state transitions will be possible in later versions
of UMCG API. For example, it will be possible to call
umcg_run_worker() on a BLOCKED worker so that BLOCKED => UNBLOCKED =>
RUNNABLE => RUNNING chain of states is "short cut" and happens behind
the scenes. Or umcg_poll_worker() will be able to immediately run the
polled worker, expediting UNBLOCKED => RUNNABLE => RUNNING.

The state transitions described above become somewhat more complicated
in the presence of interrupts/signals, and the exact behavior in the
presence of signals/interrupts is still work-in-progress.

UMCG core tasks have only two states: RUNNABLE and RUNNING;
block/unblock detection logic is not applicable: if a UMCG core task
blocks on I/O, it is still considered RUNNING. In a way, UMCG core
tasks participate in cooperative scheduling, in that they can yield to
each other via umcg_wait() and umcg_swap() and wake RUNNABLE tasks via
umcg_wake().

Please note that while UMCG server/worker tasks must belong to the
same UMCG group, and thus the same user process, UMCG core tasks can
interact with each other across process boundaries.

UMCG API
--------

- Runtime/scheduling:

  - Server API (explicit user space scheduling):

    - umcg_run_worker
    - umcg_poll_worker

  - Core/Worker API (cooperative scheduling):

    - umcg_wait
    - umcg_wake
    - umcg_swap

- Management:

  - umcg_create_group
  - umcg_destroy_group
  - umcg_register_task
  - umcg_unregister_task

API levels: porcelain and plumbing
----------------------------------
Similarly to `Git Porcelain and Plumbing API`_, UMCG exposes two API
"surfaces": a higher-level "porcelain" API via libumcg, and a
lower-level "plumbing" API via syscalls.

.. _Git Porcelain and Plumbing API:
https://git-scm.com/book/en/v2/Git-Internals-Plumbing-and-Porcelain

This design helps keep UMCG syscalls relatively lightweight, while
hiding many implementation details from end users inside libumcg. It
can be useful to think of libumcg as the only true UMCG API for
end-users, while UMCG syscalls are a low-level toolkit that enables
the higher-level API.

Porcelain API: libumcg
----------------------
Located in $KDIR/tools/lib/umcg, libumcg exposes the following key API
functions in libumcg.h (some non-essential helper functions are
omitted here):

.. code-block:: C

 typedef intptr_t umcg_t;   /* UMCG group ID. */
 typedef intptr_t umcg_tid; /* UMCG thread ID. */

 #define UMCG_NONE (0)

 /**
  * umcg_get_utid - return the UMCG ID of the current thread.
  *
  * The function always succeeds, and the returned ID is guaranteed to be
  * stable over the life of the thread (and multiple
  * umcg_register/umcg_unregister calls).
  *
  * The ID is NOT guaranteed to be unique over the life of the process.
  */
 umcg_tid umcg_get_utid(void);

 /**
  * umcg_register_core_task - register the current thread as a UMCG core task
  *
  * Return:
  * UMCG_NONE     - an error occurred. Check errno.
  * != UMCG_NONE  - the ID of the thread to be used with UMCG API (guaranteed
  *                 to match the value returned by umcg_get_utid).
  */
 umcg_tid umcg_register_core_task(intptr_t tag);

 /**
  * umcg_register_worker - register the current thread as a UMCG worker
  * @group_id:      The ID of the UMCG group the thread should join.
  *
  * Return:
  * UMCG_NONE     - an error occurred. Check errno.
  * != UMCG_NONE  - the ID of the thread to be used with UMCG API (guaranteed
  *                 to match the value returned by umcg_get_utid).
  */
 umcg_tid umcg_register_worker(umcg_t group_id, intptr_t tag);

 /**
  * umcg_register_server - register the current thread as a UMCG server
  * @group_id:      The ID of the UMCG group the thread should join.
  *
  * Return:
  * UMCG_NONE     - an error occurred. Check errno.
  * != UMCG_NONE  - the ID of the thread to be used with UMCG API (guaranteed
  *                 to match the value returned by umcg_get_utid).
  */
 umcg_tid umcg_register_server(umcg_t group_id, intptr_t tag);

 /**
  * umcg_unregister_task - unregister the current thread
  *
  * Return:
  * 0              - OK
  * -1             - the current thread is not a UMCG thread
  */
 int umcg_unregister_task(void);

 /**
  * umcg_wait - block the current thread
  * @timeout:   absolute timeout (not supported at the moment)
  *
  * Blocks the current thread, which must have been registered via
umcg_register,
  * until it is woken via umcg_wake or swapped into via umcg_swap. If
the current
  * thread has a wakeup queued (see umcg_wake), returns zero immediately,
  * consuming the wakeup.
  *
  * Return:
  * 0         - OK, the thread was waken;
  * -1        - did not wake normally;
  *               errno:
  *                 EINTR: interrupted
  *                 EINVAL: some other error occurred
  */
 int umcg_wait(const struct timespec *timeout);

 /**
  * umcg_wake - wake @next
  * @next:      ID of the thread to wake (IDs are returned by umcg_register).
  *
  * If @next is blocked via umcg_wait, or umcg_swap, wake it. If @next is
  * running, queue the wakeup, so that a future block of @next will consume
  * the wakeup but will not block.
  *
  * umcg_wake is non-blocking, but may retry a few times to make sure @next
  * has indeed woken.
  *
  * umcg_wake can queue at most one wakeup; if @next has a wakeup queued,
  * an error is returned.
  *
  * Return:
  * 0         - OK, @next has woken, or a wakeup has been queued;
  * -1        - an error occurred.
  */
 int umcg_wake(umcg_tid next);

 /**
  * umcg_swap - wake @next, put the current thread to sleep
  * @next:      ID of the thread to wake
  * @timeout:   absolute timeout (not supported at the moment)
  *
  * umcg_swap is semantically equivalent to
  *
  *     int ret = umcg_wake(next);
  *     if (ret)
  *             return ret;
  *     return umcg_wait(timeout);
  *
  * but may do a synchronous context switch into @next on the current CPU.
  */
 int umcg_swap(umcg_tid next, const struct timespec *timeout);

 /**
  * umcg_create_group - create a UMCG group
  * @flags:             Reserved.
  *
  * UMCG groups have worker and server threads.
  *
  * Worker threads are either RUNNABLE/RUNNING "on behalf" of server threads
  * (see umcg_run_worker), or are BLOCKED/UNBLOCKED. A worker thread can be
  * running only if it is attached to a server thread (interrupts can
  * complicate the matter - TBD).
  *
  * Server threads are either blocked while running worker threads or are
  * blocked waiting for available (=UNBLOCKED) workers. A server thread
  * can "run" only one worker thread.
  *
  * Return:
  * UMCG_NONE     - an error occurred. Check errno.
  * != UMCG_NONE  - the ID of the group, to be used in e.g. umcg_register.
  */
 umcg_t umcg_create_group(uint32_t flags);

 /**
  * umcg_destroy_group - destroy a UMCG group
  * @umcg:               ID of the group to destroy
  *
  * The group must be empty (no server or worker threads).
  *
  * Return:
  * 0            - Ok
  * -1           - an error occurred. Check errno.
  *                errno == EAGAIN: the group has server or worker threads
  */
 int umcg_destroy_group(umcg_t umcg);

 /**
  * umcg_poll_worker - wait for the first available UNBLOCKED worker
  *
  * The current thread must be a UMCG server. If there is a list/queue of
  * waiting UNBLOCKED workers in the server's group, umcg_poll_worker
  * picks the longest waiting one; if there are no UNBLOCKED workers, the
  * current thread sleeps in the polling queue.
  *
  * Return:
  * UMCG_NONE         - an error occurred; check errno;
  * != UMCG_NONE      - a RUNNABLE worker.
  */
 umcg_tid umcg_poll_worker(void);

 /**
  * umcg_run_worker - run @worker as a UMCG server
  * @worker:          the ID of a RUNNABLE worker to run
  *
  * The current thread must be a UMCG "server".
  *
  * Return:
  * UMCG_NONE    - if errno == 0, the last worker the server was running
  *                unregistered itself; if errno != 0, an error occurred
  * != UMCG_NONE - the ID of the last worker the server was running before
  *                the worker was blocked or preempted.
  */
 umcg_tid umcg_run_worker(umcg_tid worker);

 /**
  * umcg_get_task-state - return the current UMCG state of @task
  * @task:                the ID of a UMCG task
  *
  * Return: one of UMCG_TASK_*** values defined in uapi/linux/umcg.h
  */
 uint32_t umcg_get_task_state(umcg_tid task);

Plumbing API: sys_umcg_*** syscalls
-----------------------------------

The following new Linux syscalls are proposed:

::

 sys_umcg_api_version
 sys_umcg_register_task
 sys_umcg_unregister_task
 sys_umcg_wait
 sys_umcg_wake
 sys_umcg_swap
 sys_umcg_create_group
 sys_umcg_destroy_group
 sys_umcg_poll_worker
 sys_umcg_run_worker
 sys_umcg_preempt_worker

They closely resemble umcg_*** functions from libumcg, but are more
lightweight. For example, some of the state transitions described
above are technically handled in the user space (libumcg), with the
syscalls doing the absolute minimum. The exact boundary of what is
done in the kernel and what is delegated to the userspace is still
work-in-progress. End users are supposed to use libumcg rather than
call sys_umcg_*** directly.

sys_umcg_preempt_worker() is a placeholder for a new syscall to be
added in the future.

A brief historical note
=======================

In 2012-2013 Paul Turner and his team at Google developed SwitchTo and
SwitchTo Groups Linux kernel extensions, on top of which a C++ user
space scheduling framework called "Google Fibers" is built; it is used
widely and successfully at Google.

UMCG core API (wait/wake/swap) is based on SwitchTo API, while the
overall UMCG API resembles SwitchTo Groups API.

[-- Attachment #2: 2021-05-26_UMCG_API.pdf --]
[-- Type: application/pdf, Size: 219371 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v0.1 0/9] UMCG early preview/RFC patchset
  2021-05-27  0:06           ` Peter Oskolkov
@ 2021-05-27 15:41             ` Jonathan Corbet
  0 siblings, 0 replies; 35+ messages in thread
From: Jonathan Corbet @ 2021-05-27 15:41 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Peter Zijlstra, Ingo Molnar, Thomas Gleixner,
	Linux Kernel Mailing List, linux-api, Paul Turner, Ben Segall,
	Peter Oskolkov, Joel Fernandes, Andrew Morton, Andrei Vagin,
	Jim Newsome

Peter Oskolkov <posk@google.com> writes:

> I've pasted below the content of umcg.rst file that I'll add as a doc
> patch to the next version of the patchset. I've also attached a PDF
> version of the file rendered, in case it is useful. I also think that
> it is a bit early for manpages - I expect the API and/or behavior to
> change quite a bit before this is all merged. I'm also not completely
> sure whether the manpages should describe the "porcelain API" or the
> "plumbing API" (see the .rst below).
>
> Please let me know if you have any questions, comments, or suggestions.

So this is very helpful, thanks.  I've been through it once, and have
some overall comments.

 - Parts of it read more like a requirements document.  A document going
   into the kernel should describe what the code actually does, not what
   we think it should be.

 - I would make a serious effort to get a handle on the terminology.
   The term "kernel thread" has a meaning other than the one you give
   it; saying "kernel thread" here will lead to a lot of confusion.  I
   hesitate to suggest terms because I'm terrible at naming (just ask my
   kids), but I would pick clear and concise names for your "server
   threads" and "worker threads", then stick to them.

 - The library documentation is good to have, but it will really be
   necessary to document the system calls as well.  *That* is the part
   that the kernel community will have to support forever if this is
   merged.

Thanks,

jon

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v0.1 0/9] UMCG early preview/RFC patchset
  2021-05-20 18:36 [RFC PATCH v0.1 0/9] UMCG early preview/RFC patchset Peter Oskolkov
                   ` (10 preceding siblings ...)
       [not found] ` <CAEWA0a72SvpcuN4ov=98T3uWtExPCr7BQePOgjkqD1ofWKEASw@mail.gmail.com>
@ 2021-06-09 12:54 ` Peter Zijlstra
  2021-06-09 20:18   ` Peter Oskolkov
  11 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2021-06-09 12:54 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, linux-api,
	Paul Turner, Ben Segall, Peter Oskolkov, Joel Fernandes,
	Andrew Morton, Andrei Vagin, Jim Newsome


Quoting random parts of the first few patches folded.

You present an API without explaining, *at*all*, how it's supposed to be
used and I can't seem to figure it out from the implementation either :/

> Index: linux-2.6/arch/x86/entry/syscalls/syscall_64.tbl
> ===================================================================
> --- linux-2.6.orig/arch/x86/entry/syscalls/syscall_64.tbl
> +++ linux-2.6/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -368,6 +368,17 @@
>  444	common	landlock_create_ruleset	sys_landlock_create_ruleset
>  445	common	landlock_add_rule	sys_landlock_add_rule
>  446	common	landlock_restrict_self	sys_landlock_restrict_self

> +447	common	umcg_api_version	sys_umcg_api_version
> +448	common	umcg_register_task	sys_umcg_register_task
> +449	common	umcg_unregister_task	sys_umcg_unregister_task

I think we can do away with the api_version thing and frob that in
register. Also, do we really need unregister over just letting a task
exit? Is there a sane use-case where task goes in and out of service?

> +450	common	umcg_wait		sys_umcg_wait
> +451	common	umcg_wake		sys_umcg_wake

Right, except I'm confused by the proposed implementation. I thought the
whole point was to let UMCG tasks block in kernel, at which point we'd
change their state to BLOCKED and have userspace select another task to
run. Such BLOCKED tasks would then also be captured before they return
to userspace, i.e. the whole admission scheduler thing.

I don't see any of that in these patches. So what are they actually
implementing? I can't find enough clues to tell :-(

> +452	common	umcg_swap		sys_umcg_swap

You're presenting it like a pure optimization, but IIRC this is what
enables us to frob the scheduler state to ensure the whole thing is seen
(to the rest of the system) as the M server tasks, instead of the
constellation of N+M worker and server tasks.

Also, you're not doing any of the frobbing required.

> +453	common	umcg_create_group	sys_umcg_create_group
> +454	common	umcg_destroy_group	sys_umcg_destroy_group

This is basically needed for cross-server things, right? What we in the
kernel would call SMP. Some thoughts on that below.

> +455	common	umcg_poll_worker	sys_umcg_poll_worker

Shouldn't this be called idle or something, instead of poll, the whole
point of having this syscall is to that you can indeed go idle.
Userspace can implement polling just fine without help:

	for (;;) {
		struct umcg_task *runnable = xchg(me->umcg_runnable_ptr, NULL);
		if (runnable) {
			// put them on a list and run one
		}
		cpu_relax();
	}

comes to mind (see below).

> +456	common	umcg_run_worker		sys_umcg_run_worker

This I'm confused about again.. there is no fundamental difference
between a worker or server, they're all the same. 

> +457	common	umcg_preempt_worker	sys_umcg_preempt_worker

And that's magic, we'll get to it..

> Index: linux-2.6/include/uapi/linux/umcg.h
> ===================================================================
> --- /dev/null
> +++ linux-2.6/include/uapi/linux/umcg.h
> @@ -0,0 +1,70 @@
> +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
> +#ifndef _UAPI_LINUX_UMCG_H
> +#define _UAPI_LINUX_UMCG_H
> +
> +#include <linux/limits.h>
> +#include <linux/types.h>
> +
> +/*
> + * UMCG task states, the first 8 bits.

All that needs a state transition diagram included

> + */
> +#define UMCG_TASK_NONE			0
> +/* UMCG server states. */
> +#define UMCG_TASK_POLLING		1
> +#define UMCG_TASK_SERVING		2
> +#define UMCG_TASK_PROCESSING		3

I get POLLING, although per the above, this probably wants to be IDLE.

What are the other two again? That is, along with the diagram, each
state wants a description.

> +/* UMCG worker states. */
> +#define UMCG_TASK_RUNNABLE		4
> +#define UMCG_TASK_RUNNING		5
> +#define UMCG_TASK_BLOCKED		6
> +#define UMCG_TASK_UNBLOCKED		7

Weird order, also I can't remember why we need the UNBLOCKED, isn't that
the same as the RUNNABLE, or did we want to distinguish the state were
we're no longer BLOCKED but the user scheduler hasn't yet put us on it's
ready queue (IOW, we're on the runnable_ptr list, see below).

> +
> +/* UMCG task state flags, bits 8-15 */
> +#define UMCG_TF_WAKEUP_QUEUED		(1 << 8)
> +
> +/*
> + * Unused at the moment flags reserved for features to be introduced
> + * in the near future.
> + */
> +#define UMCG_TF_PREEMPT_DISABLED	(1 << 9)
> +#define UMCG_TF_PREEMPTED		(1 << 10)
> +
> +#define UMCG_NOID	UINT_MAX
> +
> +/**
> + * struct umcg_task - controls the state of UMCG-enabled tasks.
> + *
> + * While at the moment only one field is present (@state), in future
> + * versions additional fields will be added, e.g. for the userspace to
> + * provide performance-improving hints and for the kernel to export sched
> + * stats.
> + *
> + * The struct is aligned at 32 bytes to ensure that even with future additions
> + * it fits into a single cache line.
> + */
> +struct umcg_task {
> +	/**
> +	 * @state: the current state of the UMCG task described by this struct.
> +	 *
> +	 * UMCG task state:
> +	 *   bits  0 -  7: task state;
> +	 *   bits  8 - 15: state flags;
> +	 *   bits 16 - 23: reserved; must be zeroes;
> +	 *   bits 24 - 31: for userspace use.
> +	 */
> +	uint32_t	state;
> +} __attribute((packed, aligned(4 * sizeof(uint64_t))));

So last time I really looked at this it looked something like this:

struct umcg_task {
        u32     umcg_status;            /* r/w */
        u32     umcg_server_tid;        /* r   */
        u32     umcg_next_tid;          /* r   */
        u32     __hole__;
        u64     umcg_blocked_ptr;       /*   w */
        u64     umcg_runnable_ptr;      /*   w */
};

(where r/w is from the kernel's pov)
(also see uapi/linux/rseq.h's ptr magic)

So a PF_UMCG_WORKER would be added to sched_submit_work()'s PF_*_WORKER
path to capture these tasks blocking. The umcg_sleeping() hook added
there would:

    put_user(BLOCKED, umcg_task->umcg_status);

    tid = get_user(umcg_task->next_tid);
    if (!tid)
	tid = get_user(umcg_task->umcg_server_tid);
    umcg_server = find_task(tid);

    /* append to blocked list */
    umcg_task->umcg_blocked_ptr = umcg_server->umcg_blocked_ptr;
    umcg_server->umcg_blocked_ptr = umcg_task;

    // with some user_cmpxchg() sprinkled on to make it an atomic single
    // linked list, we can borrow from futex_atomic_cmpxchg_inatomic().

    /* capture return to user */
    add_task_work(current, &current->umcg->task_work, TWA_RESUME);

    umcg_server->state = RUNNING;
    wake_up_process(umcg_server);

That task_work would, as the comment says, capture the return to user,
and do something like:

    put_user(RUNNABLE, umcg_task->umcg_status);

    tid = get_user(umcg_task->umcg_server_tid);
    umcg_server = find_task(tid);

    /* append to runable list */
    umcg_task->umcg_runnable_ptr = umcg_server->umcg_runnable_ptr;
    umcg_server->umcg_runnable_ptr = umcg_task;
    // same as above, this wants some user cmpxchg

    umcg_wait();

And for that we had something like:

void umcg_wait(void)
{
	u32 state;

	for (;;) {
		set_current_state(TASK_INTERRUPTIBLE);
		if (get_user(state, current->umcg->state))
			break;
		if (state == UMCG_RUNNING)
			break;
		if (signal_pending(current))
			break;
		schedule();
	}
	__set_current_state(TASK_RUNNING);
}

Which would wait until the userspace admission logic lets us rip by
setting state to RUNNING and prodding us with a sharp stick.


This all ensures that when a UMCG task goes to sleep, we mark ourselves
BLOCKED, we add ourselves to a user visible blocked list and wake the
owner of that blocked list.

We can either pre-select some task to run after us (next_tid) or it'll
pick the dedicated server task we're assigned to (server_tid).

Any time a task wakes up, it needs to check the blocked list and update
userspace ready queues and the sort, after which it can either run
things if it's a worker or pick another task to run if that's its work
(a server isn't special in this regard).

This was the absolute bare minimum, and I'm not seeing any of that here.
Nor an explanation of what there actually is :/


On top of this there's 'fun' questions about signals, ptrace and
umcg_preemption to be answered.

I think we want to allow signals to happen to UMCG RUNNABLE tasks, but
have them resume umcg_wait() on sigreturn.

I've not re-read the discussion with tglx on ptrace, he had some cute
corner cases IIRC.

The whole preemption thing should be doable with a task_work. Basically
check if the victim is RUNNING, send it TWA_SIGNAL to handle the task
work, the task_work would attempt a RUNNING->RUNNABLE (cmpxchg)
transition, success thereof needs to be propagated back to the syscall
and returned.

Adding preemption also means you have to deal with appending to
runnable_ptr list when the server isn't reaily available (most times).


Now on to those group things; they would basically replace the above
server_tid with a group/list of related server tasks, right? So why not
do so, litearlly:

struct umcg_task {
        u32     umcg_status;            /* r/w */
        u32     umcg_next_tid;          /* r   */
        u64     umcg_server_ptr;        /* r   */
        u64     umcg_blocked_ptr;       /*   w */
        u64     umcg_runnable_ptr;      /*   w */
};

Then have the kernel iterate the umcg_server_ptr list, looking for an
available (RUNNING or IDLE) server, also see the preemption point above.

This does, however, require a umcg_task to pid translation, which we've
so far avoided :/ OTOH it makes that grouping crud a user problem and we
can make the syscalls go away (and I that CRUI would like this better
too).

> +static int do_context_switch(struct task_struct *next)
> +{
> +	struct umcg_task_data *utd = rcu_access_pointer(current->umcg_task_data);
> +
> +	/*
> +	 * It is important to set_current_state(TASK_INTERRUPTIBLE) before
> +	 * waking @next, as @next may immediately try to wake current back
> +	 * (e.g. current is a server, @next is a worker that immediately
> +	 * blocks or waits), and this next wakeup must not be lost.
> +	 */
> +	set_current_state(TASK_INTERRUPTIBLE);
> +
> +	WRITE_ONCE(utd->in_wait, true);
> +
> +	if (!try_to_wake_up(next, TASK_NORMAL, WF_CURRENT_CPU))
> +		return -EAGAIN;
> +
> +	freezable_schedule();
> +
> +	WRITE_ONCE(utd->in_wait, false);
> +
> +	if (signal_pending(current))
> +		return -EINTR;
> +
> +	return 0;
> +}
> +
> +static int do_wait(void)
> +{
> +	struct umcg_task_data *utd = rcu_access_pointer(current->umcg_task_data);
> +
> +	if (!utd)
> +		return -EINVAL;
> +
> +	WRITE_ONCE(utd->in_wait, true);
> +
> +	set_current_state(TASK_INTERRUPTIBLE);
> +	freezable_schedule();
> +
> +	WRITE_ONCE(utd->in_wait, false);
> +
> +	if (signal_pending(current))
> +		return -EINTR;
> +
> +	return 0;
> +}

Both these are fundamentally buggered for not having a loop.

> +/**
> + * sys_umcg_wait - block the current task (if all condtions are met).
> + * @flags:         Reserved.
> + * @timeout:       The absolute timeout of the wait. Not supported yet.
> + *                 Must be NULL.
> + *
> + * Sleep until woken, interrupted, or @timeout expires.
> + *
> + * Return:
> + * 0           - Ok;
> + * -EFAULT     - failed to read struct umcg_task assigned to this task
> + *               via sys_umcg_register();
> + * -EAGAIN     - try again;
> + * -EINTR      - signal pending;
> + * -EOPNOTSUPP - @timeout != NULL (not supported yet).
> + * -EINVAL     - a parameter or a member of struct umcg_task has a wrong value.
> + */
> +SYSCALL_DEFINE2(umcg_wait, u32, flags,
> +		const struct __kernel_timespec __user *, timeout)

I despise timespec, tglx?

> +{
> +	struct umcg_task_data *utd;
> +
> +	if (flags)
> +		return -EINVAL;
> +	if (timeout)
> +		return -EOPNOTSUPP;
> +
> +	rcu_read_lock();
> +	utd = rcu_dereference(current->umcg_task_data);
> +	if (!utd) {
> +		rcu_read_unlock();
> +		return -EINVAL;
> +	}
> +
> +	rcu_read_unlock();

What Jann said.

> +
> +	return do_wait();
> +}
> +
> +/**
> + * sys_umcg_wake - wake @next_tid task blocked in sys_umcg_wait.
> + * @flags:         Reserved.
> + * @next_tid:      The ID of the task to wake.
> + *
> + * Wake @next identified by @next_tid. @next must be either a UMCG core
> + * task or a UMCG worker task.
> + *
> + * Return:
> + * 0           - Ok;
> + * -EFAULT     - failed to read struct umcg_task assigned to next;
> + * -ESRCH      - @next_tid did not identify a task;
> + * -EAGAIN     - try again;
> + * -EINVAL     - a parameter or a member of next->umcg_task has a wrong value.
> + */
> +SYSCALL_DEFINE2(umcg_wake, u32, flags, u32, next_tid)
> +{
> +	struct umcg_task_data *next_utd;
> +	struct task_struct *next;
> +	int ret = -EINVAL;
> +
> +	if (!next_tid)
> +		return -EINVAL;
> +	if (flags)
> +		return -EINVAL;
> +
> +	next = find_get_task_by_vpid(next_tid);
> +	if (!next)
> +		return -ESRCH;
> +
> +	rcu_read_lock();
> +	next_utd = rcu_dereference(next->umcg_task_data);
> +	if (!next_utd)
> +		goto out;
> +
> +	if (!READ_ONCE(next_utd->in_wait)) {
> +		ret = -EAGAIN;
> +		goto out;
> +	}

I'm thining this might want to be a user cmpxchg from RUNNABLE->RUNNING.

You need to deal with concurrent wakeups.

> +
> +	ret = wake_up_process(next);
> +	put_task_struct(next);
> +	if (ret)
> +		ret = 0;
> +	else
> +		ret = -EAGAIN;
> +
> +out:
> +	rcu_read_unlock();
> +	return ret;
> +}




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v0.1 4/9] sched/umcg: implement core UMCG API
  2021-05-21 21:33   ` Jann Horn
@ 2021-06-09 13:01     ` Peter Zijlstra
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2021-06-09 13:01 UTC (permalink / raw)
  To: Jann Horn
  Cc: Peter Oskolkov, Ingo Molnar, Thomas Gleixner, kernel list,
	Linux API, Paul Turner, Ben Segall, Peter Oskolkov,
	Joel Fernandes, Andrew Morton, Andrei Vagin, Jim Newsome

On Fri, May 21, 2021 at 11:33:14PM +0200, Jann Horn wrote:
> >  SYSCALL_DEFINE2(umcg_wake, u32, flags, u32, next_tid)
> >  {
> > -       return -ENOSYS;
> > +       struct umcg_task_data *next_utd;
> > +       struct task_struct *next;
> > +       int ret = -EINVAL;
> > +
> > +       if (!next_tid)
> > +               return -EINVAL;
> > +       if (flags)
> > +               return -EINVAL;
> > +
> > +       next = find_get_task_by_vpid(next_tid);
> > +       if (!next)
> > +               return -ESRCH;
> > +       rcu_read_lock();
> 
> Wouldn't it be more efficient to replace the last 4 lines with the following?
> 
> rcu_read_lock();
> next = find_task_by_vpid(next_tid);
> if (!next) {
>   err = -ESRCH;
>   goto out;
> }

This wakeup crud needs to modify the umcg->state, which is a user
variable. That can't be done under RCU. Weirdly the proposed code
doesn't actually do any of that for undocumented raisins :/

> Then you don't need to use refcounting here...
> 
> > +       next_utd = rcu_dereference(next->umcg_task_data);
> > +       if (!next_utd)
> > +               goto out;
> > +
> > +       if (!READ_ONCE(next_utd->in_wait)) {
> > +               ret = -EAGAIN;
> > +               goto out;
> > +       }
> > +
> > +       ret = wake_up_process(next);
> > +       put_task_struct(next);
> 
> ... and you'd be able to drop this put_task_struct(), too.
> 
> > +       if (ret)
> > +               ret = 0;
> > +       else
> > +               ret = -EAGAIN;
> > +
> > +out:
> > +       rcu_read_unlock();
> > +       return ret;
> >  }
> >
> >  /**
> > @@ -139,5 +325,44 @@ SYSCALL_DEFINE2(umcg_wake, u32, flags, u32, next_tid)
> >  SYSCALL_DEFINE4(umcg_swap, u32, wake_flags, u32, next_tid, u32, wait_flags,
> >                 const struct __kernel_timespec __user *, timeout)
> >  {
> > -       return -ENOSYS;
> > +       struct umcg_task_data *curr_utd;
> > +       struct umcg_task_data *next_utd;
> > +       struct task_struct *next;
> > +       int ret = -EINVAL;
> > +
> > +       rcu_read_lock();
> > +       curr_utd = rcu_dereference(current->umcg_task_data);
> > +
> > +       if (!next_tid || wake_flags || wait_flags || !curr_utd)
> > +               goto out;
> > +
> > +       if (timeout) {
> > +               ret = -EOPNOTSUPP;
> > +               goto out;
> > +       }
> > +
> > +       next = find_get_task_by_vpid(next_tid);
> > +       if (!next) {
> > +               ret = -ESRCH;
> > +               goto out;
> > +       }
> 
> There isn't any type of access check here, right? Any task can wake up
> any other task? That feels a bit weird to me - and if you want to keep
> it as-is, it should probably at least be documented that any task on
> the system can send you spurious wakeups if you opt in to umcg.

You can only send wakeups to other UMCG thingies, per the
next->umcg_task_data check below. That said..

> In contrast, shared futexes can avoid this because they get their
> access control implicitly from the VMA.

Every task must expect spurious wakups at all times, always (for
TASK_NORMAL wakeups that is). There's plenty ways to generate them.

> > +       next_utd = rcu_dereference(next->umcg_task_data);
> > +       if (!next_utd) {
> > +               ret = -EINVAL;
> > +               goto out;
> > +       }

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v0.1 0/9] UMCG early preview/RFC patchset
  2021-06-09 12:54 ` Peter Zijlstra
@ 2021-06-09 20:18   ` Peter Oskolkov
  2021-06-10 18:02     ` Peter Zijlstra
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Oskolkov @ 2021-06-09 20:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Linux Kernel Mailing List,
	linux-api, Paul Turner, Ben Segall, Peter Oskolkov,
	Joel Fernandes, Andrew Morton, Andrei Vagin, Jim Newsome

On Wed, Jun 9, 2021 at 5:55 AM Peter Zijlstra <peterz@infradead.org> wrote:

Finally, a high-level review - thanks a lot, Peter! My comments below,
and two high-level "important questions" at the end of my reply (with
some less important questions here and there).

[...]

> You present an API without explaining, *at*all*, how it's supposed to be
> used and I can't seem to figure it out from the implementation either :/

I tried to explain it in the doc patch that I followed up with:
https://lore.kernel.org/patchwork/cover/1433967/#1632328

Or do you mean it more narrowly, i.e. I do not explain syscalls in
detail? This assessment I agree with - my approach was/is to finalize
the userpace API (libumcg) first, and make the userspace vs kernel
decisions later.

For example, you wonder why there is no looping in umcg_wait
(do_wait). This is because the looping happens in the userspace in
libumcg. My overall approach was to make the syscalls as simple as
possible and push extra logic to the userspace.

It seems that this approach is not resonating with kernel
developers/maintainers - you are the third person asking why there is
no looping in sys_umcg_wait, despite the fact that I explicitly
mentioned pushing it out to the userspace.

Let me try to make my case once more here.

umcg_wait/umcg_wake: the RUNNABLE/RUNNING state changes, checks, and
looping happen in the userspace (libumcg - see umcg_wait/umcg_wake in
patch 5 here: https://lore.kernel.org/patchwork/patch/1433971/), while
the syscalls simply sleep/wake. I find doing it in the userspace is
much simpler and easier than in the kernel, as state reads and writes
are just atomic memory accesses; in the kernel it becomes much more
difficult - rcu locked sections, tasks locked, etc.

On the other hand I agree that having syscalls more logically
complete, in the sense that they do not require much hand-holding and
retries from the userspace, is probably better from the API design
perspective. My worry here is that state validation and retries in the
userspace are unavoidable, and so going the usual way we will end up
with retry loops both in the kernel and in the userspace.

So I pose this IMPORTANT QUESTION #1 to you that I hope to get a clear
answer to: it is strongly preferable to have syscalls be "logically
complete" in the sense that they retry things internally, and in
generally try to cover all possible corner cases; or, alternatively,
is it OK to make syscalls lightweight but "logically incomplete", and
have the accompanied userspace wrappers do all of the heavy lifting
re: state changes/validation, retries, etc.?

I see two additional benefits of thin/lightweight syscalls:
- reading userspace state is needed much less often (e.g. my umcg_wait
and umcg_wake syscalls do not access userspace data at all - also see
my "second important question" below)
- looping in the kernel, combined with reading/writing to userspace
memory, can easily lead to spinning in the kernel (e.g. trying to
atomically change a variable and looping until succeeding)

A clear answer one way or the other will help a lot!

[...]

> > +448  common  umcg_register_task      sys_umcg_register_task
> > +449  common  umcg_unregister_task    sys_umcg_unregister_task
>
> I think we can do away with the api_version thing and frob that in
> register.

Ok, will do.

> Also, do we really need unregister over just letting a task
> exit? Is there a sane use-case where task goes in and out of service?

I do not know of a specific use case here. On the other hand, I do not
know of a specific use case to unregister RSEQ, but the capability is
there. Maybe the assumption is that the userspace memory passed to the
kernel in register() may be freed before the task exits, and so there
should be a way to tell the kernel to no longer use it?

>
> > +450  common  umcg_wait               sys_umcg_wait
> > +451  common  umcg_wake               sys_umcg_wake
>
> Right, except I'm confused by the proposed implementation. I thought the
> whole point was to let UMCG tasks block in kernel, at which point we'd
> change their state to BLOCKED and have userspace select another task to
> run. Such BLOCKED tasks would then also be captured before they return
> to userspace, i.e. the whole admission scheduler thing.
>
> I don't see any of that in these patches. So what are they actually
> implementing? I can't find enough clues to tell :-(

As I mentioned above, state changes are done in libumcg in userspace
here: https://lore.kernel.org/patchwork/cover/1433967/#1632328

If you insist this logic should live in the kernel, I'll do it (grudgingly).

>
> > +452  common  umcg_swap               sys_umcg_swap
>
> You're presenting it like a pure optimization, but IIRC this is what
> enables us to frob the scheduler state to ensure the whole thing is seen
> (to the rest of the system) as the M server tasks, instead of the
> constellation of N+M worker and server tasks.

Yes, you recall it correctly.

> Also, you're not doing any of the frobbing required.

This is because I consider the frobbing a (very) nice to have rather
than a required feature, and so I am hoping to argue about how to
properly do it in later patchsets. This whole thing (UMCG) will be
extremely useful even without runtime accounting hacking and whatnot,
and so I hope to have everything else settled and tested and merged
before we spend another several weeks/months trying to make the
frobbing perfect.

>
> > +453  common  umcg_create_group       sys_umcg_create_group
> > +454  common  umcg_destroy_group      sys_umcg_destroy_group
>
> This is basically needed for cross-server things, right? What we in the
> kernel would call SMP. Some thoughts on that below.

Yes, right.

>
> > +455  common  umcg_poll_worker        sys_umcg_poll_worker
>
> Shouldn't this be called idle or something, instead of poll, the whole
> point of having this syscall is to that you can indeed go idle.

That's another way of looking at it. Yes, this means the server idles
until a worker becomes available. How would you call it? umcg_idle()?

> Userspace can implement polling just fine without help:
>
>         for (;;) {
>                 struct umcg_task *runnable = xchg(me->umcg_runnable_ptr, NULL);
>                 if (runnable) {
>                         // put them on a list and run one
>                 }
>                 cpu_relax();
>         }
>
> comes to mind (see below).
>
> > +456  common  umcg_run_worker         sys_umcg_run_worker
>
> This I'm confused about again.. there is no fundamental difference
> between a worker or server, they're all the same.

I don't see it this way. A server runs (on CPU) by itself and blocks
when there is a worker attached; a worker runs (on CPU) only when it
has a (blocked) server attached to it and, when the worker blocks, its
server detaches and runs another worker. So workers and servers are
the opposite of each other.

>
> > +457  common  umcg_preempt_worker     sys_umcg_preempt_worker
>
> And that's magic, we'll get to it..
>
> > Index: linux-2.6/include/uapi/linux/umcg.h
> > ===================================================================
> > --- /dev/null
> > +++ linux-2.6/include/uapi/linux/umcg.h
> > @@ -0,0 +1,70 @@
> > +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
> > +#ifndef _UAPI_LINUX_UMCG_H
> > +#define _UAPI_LINUX_UMCG_H
> > +
> > +#include <linux/limits.h>
> > +#include <linux/types.h>
> > +
> > +/*
> > + * UMCG task states, the first 8 bits.
>
> All that needs a state transition diagram included

I will add it. For now the doc patch can be consulted:
https://lore.kernel.org/patchwork/cover/1433967/#1632328

>
> > + */
> > +#define UMCG_TASK_NONE                       0
> > +/* UMCG server states. */
> > +#define UMCG_TASK_POLLING            1
> > +#define UMCG_TASK_SERVING            2
> > +#define UMCG_TASK_PROCESSING         3
>
> I get POLLING, although per the above, this probably wants to be IDLE.

Ack.

>
> What are the other two again? That is, along with the diagram, each
> state wants a description.

SERVING: the server is blocked, its attached worker is running
PROCESSING: the server is running (= processing a block or wake
event), has no running worker attached

Both of these states are different from POLLING/IDLE and from each other.

>
> > +/* UMCG worker states. */
> > +#define UMCG_TASK_RUNNABLE           4
> > +#define UMCG_TASK_RUNNING            5
> > +#define UMCG_TASK_BLOCKED            6
> > +#define UMCG_TASK_UNBLOCKED          7
>
> Weird order, also I can't remember why we need the UNBLOCKED, isn't that
> the same as the RUNNABLE, or did we want to distinguish the state were
> we're no longer BLOCKED but the user scheduler hasn't yet put us on it's
> ready queue (IOW, we're on the runnable_ptr list, see below).

Yes, UNBLOCKED it a transitory state meaning the worker's blocking
operation has completed, but the wake event hasn't been delivered to
the userspace yet (and so the worker it not yet RUNNABLE)

[...]

> > +struct umcg_task {
> > +     /**
> > +      * @state: the current state of the UMCG task described by this struct.
> > +      *
> > +      * UMCG task state:
> > +      *   bits  0 -  7: task state;
> > +      *   bits  8 - 15: state flags;
> > +      *   bits 16 - 23: reserved; must be zeroes;
> > +      *   bits 24 - 31: for userspace use.
> > +      */
> > +     uint32_t        state;
> > +} __attribute((packed, aligned(4 * sizeof(uint64_t))));
>
> So last time I really looked at this it looked something like this:
>
> struct umcg_task {
>         u32     umcg_status;            /* r/w */
>         u32     umcg_server_tid;        /* r   */
>         u32     umcg_next_tid;          /* r   */
>         u32     __hole__;
>         u64     umcg_blocked_ptr;       /*   w */
>         u64     umcg_runnable_ptr;      /*   w */
> };
>
> (where r/w is from the kernel's pov)
> (also see uapi/linux/rseq.h's ptr magic)

I tried doing it this way, i.e. to only have only userspace struct
added (without kernel-only data), and I found it really cumbersome and
inconvenient and much slower than the proposed implementation. For
example, when a worker blocks, it seems working with "struct
task_struct *peer" to get to the worker's server is easy and
straightforward; reading server_tid from userspace, then looking up
the task and only then doing what is needed (change state and wakeup)
is ... unnecessary? Also validating things becomes really important
but difficult (what if the user put something weird in
umcg_server_tid? or the ptr fields?). In my proposed implementation
only the state is user-writable, and it does not really affect most of
the kernel-side work.

Why do you think everything should be in the userspace memory?

>
> So a PF_UMCG_WORKER would be added to sched_submit_work()'s PF_*_WORKER
> path to capture these tasks blocking. The umcg_sleeping() hook added
> there would:
>
>     put_user(BLOCKED, umcg_task->umcg_status);
>
>     tid = get_user(umcg_task->next_tid);
>     if (!tid)
>         tid = get_user(umcg_task->umcg_server_tid);
>     umcg_server = find_task(tid);
>
>     /* append to blocked list */
>     umcg_task->umcg_blocked_ptr = umcg_server->umcg_blocked_ptr;
>     umcg_server->umcg_blocked_ptr = umcg_task;
>
>     // with some user_cmpxchg() sprinkled on to make it an atomic single
>     // linked list, we can borrow from futex_atomic_cmpxchg_inatomic().
>
>     /* capture return to user */
>     add_task_work(current, &current->umcg->task_work, TWA_RESUME);
>
>     umcg_server->state = RUNNING;
>     wake_up_process(umcg_server);
>
> That task_work would, as the comment says, capture the return to user,
> and do something like:
>
>     put_user(RUNNABLE, umcg_task->umcg_status);
>
>     tid = get_user(umcg_task->umcg_server_tid);
>     umcg_server = find_task(tid);
>
>     /* append to runable list */
>     umcg_task->umcg_runnable_ptr = umcg_server->umcg_runnable_ptr;
>     umcg_server->umcg_runnable_ptr = umcg_task;
>     // same as above, this wants some user cmpxchg
>
>     umcg_wait();
>
> And for that we had something like:
>
> void umcg_wait(void)
> {
>         u32 state;
>
>         for (;;) {
>                 set_current_state(TASK_INTERRUPTIBLE);
>                 if (get_user(state, current->umcg->state))
>                         break;
>                 if (state == UMCG_RUNNING)
>                         break;
>                 if (signal_pending(current))
>                         break;
>                 schedule();
>         }
>         __set_current_state(TASK_RUNNING);
> }
>
> Which would wait until the userspace admission logic lets us rip by
> setting state to RUNNING and prodding us with a sharp stick.
>
>
> This all ensures that when a UMCG task goes to sleep, we mark ourselves
> BLOCKED, we add ourselves to a user visible blocked list and wake the
> owner of that blocked list.
>
> We can either pre-select some task to run after us (next_tid) or it'll
> pick the dedicated server task we're assigned to (server_tid).
>
> Any time a task wakes up, it needs to check the blocked list and update
> userspace ready queues and the sort, after which it can either run
> things if it's a worker or pick another task to run if that's its work
> (a server isn't special in this regard).
>
> This was the absolute bare minimum, and I'm not seeing any of that here.
> Nor an explanation of what there actually is :/
>
>
> On top of this there's 'fun' questions about signals, ptrace and
> umcg_preemption to be answered.
>
> I think we want to allow signals to happen to UMCG RUNNABLE tasks, but
> have them resume umcg_wait() on sigreturn.
>
> I've not re-read the discussion with tglx on ptrace, he had some cute
> corner cases IIRC.
>
> The whole preemption thing should be doable with a task_work. Basically
> check if the victim is RUNNING, send it TWA_SIGNAL to handle the task
> work, the task_work would attempt a RUNNING->RUNNABLE (cmpxchg)
> transition, success thereof needs to be propagated back to the syscall
> and returned.
>
> Adding preemption also means you have to deal with appending to
> runnable_ptr list when the server isn't reaily available (most times).
>
>
> Now on to those group things; they would basically replace the above
> server_tid with a group/list of related server tasks, right? So why not
> do so, litearlly:
>
> struct umcg_task {
>         u32     umcg_status;            /* r/w */
>         u32     umcg_next_tid;          /* r   */
>         u64     umcg_server_ptr;        /* r   */
>         u64     umcg_blocked_ptr;       /*   w */
>         u64     umcg_runnable_ptr;      /*   w */
> };
>
> Then have the kernel iterate the umcg_server_ptr list, looking for an
> available (RUNNING or IDLE) server, also see the preemption point above.
>
> This does, however, require a umcg_task to pid translation, which we've
> so far avoided :/ OTOH it makes that grouping crud a user problem and we
> can make the syscalls go away (and I that CRUI would like this better
> too).

All of the code above assumes userspace-only data. I did not look into
every detail of your suggestions because I want to make sure we first
agree on this: do we keep every bit of information in the userspace
(other than "struct umcg_task __user *" pointer in task_struct) or do
we have some kernel-only details as well?

So IMPORTANT QUESTION #2: why would we want to keep __everything__ in
the userspace memory? I understand that CRIU would like this, but
given that the implementation would at a minimum have to

1. read a umcg_server_ptr (points to the server's umcg_task)
2. get the server tid out of it (presumably by reading a field from
the server's umcg_task; what if the tid is wrong?)
3. do a tid lookup

to get a task_struct pointer, it will be slower; I am also not sure it
call be done safely at all: with kernel-side data and I can do rcu
locking, task locking, etc. to ensure that the value I got does not
change while I'm working with it; with userspace data, a lot of races
will have to be specially coded for that can be easily handled by
kernel-side rcu locks or spin locks... Maybe this is just my ignorance
showing, and indeed things can be done simply and easily with
userspace-only data, but I am not sure how.

A common example:

- worker W1 with server S1 calls umcg_wait()
- worker W2 with server S2 calls umcg_swap(W1)

If due to preemption and other concurrency weirdness the two syscalls
above race with each other, each trying to change the server assigned
to W1. I can easily handle the race by doing kernel-side locking;
without kernel-side locking (cannot do rcu locks and/or spin locks
while accessing userspace data) I am not sure how to handle the race.
Maybe it is possible with careful atomic writes to states and looping
to handle this specific race (what if the userspace antagonistically
writes to the same location? will it force the syscall to spin
indefinitely?); but with proper locking many potential races can be
handled; with atomic ops and looping it is more difficult... Will we
have to add a lock to struct umcg_task? And acquire it from the kernel
side? And worry about spinning forever?

>
> > +static int do_context_switch(struct task_struct *next)
> > +{
[...]
> > +}
> > +
> > +static int do_wait(void)
> > +{
[...]
> > +}
>
> Both these are fundamentally buggered for not having a loop.

As I mentioned above, the loop is in the userpace.

[...]
> > +SYSCALL_DEFINE2(umcg_wait, u32, flags,
> > +             const struct __kernel_timespec __user *, timeout)
>
> I despise timespec, tglx?

What are the alternatives? I just picked what the futex code uses.

[...]
> > +SYSCALL_DEFINE2(umcg_wake, u32, flags, u32, next_tid)
> > +{
[...]
>
> I'm thinking this might want to be a user cmpxchg from RUNNABLE->RUNNING.
>
> You need to deal with concurrent wakeups.

This is done in the userspace - much easier to do it there...

In summary, two IMPORTANT QUESTIONS:

1. thin vs fat syscalls: can we push some code/logic to the userspace
(state changes, looping/retries), or do we insist on syscalls handling
everything? Please have in mind that even if we choose the second
approach (fat syscalls), the userspace will most likely still have to
do everything it does under the first option just to handle
signals/interrupts (i.e. unscheduled wakeups);
2. kernel-side data vs userspace-only: can we avoid having kernel-side
data? More specifically, what alternatives to rcu_read_lock and/or
task_lock are available when working with userspace data?

When these two questions are answered to everybody's satisfaction, we
can discuss this patchset/library/API in more detail.

Thanks,
Peter

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v0.1 0/9] UMCG early preview/RFC patchset
  2021-06-09 20:18   ` Peter Oskolkov
@ 2021-06-10 18:02     ` Peter Zijlstra
  2021-06-10 20:06       ` Peter Oskolkov
  2021-07-07 17:45       ` Thierry Delisle
  0 siblings, 2 replies; 35+ messages in thread
From: Peter Zijlstra @ 2021-06-10 18:02 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Ingo Molnar, Thomas Gleixner, Linux Kernel Mailing List,
	linux-api, Paul Turner, Ben Segall, Peter Oskolkov,
	Joel Fernandes, Andrew Morton, Andrei Vagin, Jim Newsome

On Wed, Jun 09, 2021 at 01:18:59PM -0700, Peter Oskolkov wrote:
> On Wed, Jun 9, 2021 at 5:55 AM Peter Zijlstra <peterz@infradead.org> wrote:
> 
> Finally, a high-level review - thanks a lot, Peter! My comments below,
> and two high-level "important questions" at the end of my reply (with
> some less important questions here and there).
> 
> [...]
> 
> > You present an API without explaining, *at*all*, how it's supposed to be
> > used and I can't seem to figure it out from the implementation either :/
> 
> I tried to explain it in the doc patch that I followed up with:
> https://lore.kernel.org/patchwork/cover/1433967/#1632328

Urgh, you write RST :-( That sorta helps, but I'm still unclear on a
number of things, more below.

> Or do you mean it more narrowly, i.e. I do not explain syscalls in
> detail? This assessment I agree with - my approach was/is to finalize
> the userpace API (libumcg) first, and make the userspace vs kernel
> decisions later.

Yeah, I couldn't figure out how to use the syscalls and thus how to
interpret their implementation. A little more in the way of comments
would've been helpful.

> For example, you wonder why there is no looping in umcg_wait
> (do_wait). This is because the looping happens in the userspace in
> libumcg. My overall approach was to make the syscalls as simple as
> possible and push extra logic to the userspace.

So a simple comment on the syscall that says:

  Userspace is expected to do:

	do {
		sys_umcg_wait();
	} while (smp_load_acquire(&umcg_task->state) != RUNNING);

would've made all the difference. It provides context.

> It seems that this approach is not resonating with kernel
> developers/maintainers - you are the third person asking why there is
> no looping in sys_umcg_wait, despite the fact that I explicitly
> mentioned pushing it out to the userspace.

We've been trained, through years of 'funny' bugs, to go 'BUG BUG BUG'
when schedule() is not in a loop. And pushing the loop to userspace has
me all on edge for being 'weird'.

> Let me try to make my case once more here.
> 
> umcg_wait/umcg_wake: the RUNNABLE/RUNNING state changes, checks, and
> looping happen in the userspace (libumcg - see umcg_wait/umcg_wake in
> patch 5 here: https://lore.kernel.org/patchwork/patch/1433971/), while
> the syscalls simply sleep/wake. I find doing it in the userspace is
> much simpler and easier than in the kernel, as state reads and writes
> are just atomic memory accesses; in the kernel it becomes much more
> difficult - rcu locked sections, tasks locked, etc.

Small difficulties as far as things go I think. The worst part is having
to do arch asm for the userspace cmpxchg. Luckily we can crib/share with
futex there.

> On the other hand I agree that having syscalls more logically
> complete, in the sense that they do not require much hand-holding and
> retries from the userspace, is probably better from the API design
> perspective. My worry here is that state validation and retries in the
> userspace are unavoidable, and so going the usual way we will end up
> with retry loops both in the kernel and in the userspace.

Can you expand on where you'd see the need for userspace to retry?

The canonical case in my mind is where a task, that's been BLOCKED in
kernelspace transitions to UNBLOCK/RUNNABLE in return-to-user and waits
for RUNNING.

Once it gets RUNNING, userspace can assume it can just go. It will never
have to re-check, because there's no way RUNNING can go away again. The
only way for RUNNING to become anything else, is setting it yourself
and/or doing a syscall.

Also, by having the BLOCKED thing block properly in return-to-user, you
don't have to wrap *all* the userspace syscall invocations. If you let
it return early, you get to wrap syscalls, which is both fragile and
bad for performance.

> So I pose this IMPORTANT QUESTION #1 to you that I hope to get a clear
> answer to: it is strongly preferable to have syscalls be "logically
> complete" in the sense that they retry things internally, and in
> generally try to cover all possible corner cases; or, alternatively,
> is it OK to make syscalls lightweight but "logically incomplete", and
> have the accompanied userspace wrappers do all of the heavy lifting
> re: state changes/validation, retries, etc.?

Intuitively I'd go with complete. I'd have never even considered the
incomplete option. But let me try and get my head around the incomplete
cases.

Oooh, I found the BLOCKED stuff, you hid it inside the grouping patch,
that makes no sense :-( Reason I'm looking is that I don't see how you
get around the blocked and runnable lists. You have to tell userspace
about them.

FWIW: I think you placed umcg_on_block() wrong, it needs to be before
the terrible PI thing. Also, like said, please avoid yet another branch
here by using PF_UMCG_WORKER.

> I see two additional benefits of thin/lightweight syscalls:
> - reading userspace state is needed much less often (e.g. my umcg_wait
> and umcg_wake syscalls do not access userspace data at all - also see
> my "second important question" below)

It is also broken I think, best I can make of it is somsething like
this:

    WAIT						WAKE

    if (smp_load_acquire(&state) == RUNNING)
	return;

							state = RUNNING;


    do {
      sys_umcg_wait()
      {
	in_wait = true;
							sys_umcg_wake()
							{
							  if (in_wait)
							    wake_up_process()
							}
	set_current_state(INTERRUPTIBLE);
	schedule();
	in_wait = false;
      }
    } while (smp_load_acquire(&state) != RUNNING);


missed wakeup, 'forever' stuck. You have to check your blocking
condition between setting state and scheduling. And if you do that, you
have a 'fat' syscall again.

> - looping in the kernel, combined with reading/writing to userspace
> memory, can easily lead to spinning in the kernel (e.g. trying to
> atomically change a variable and looping until succeeding)

I don't imagine spinning in kernel or userspace matters.

> > Also, do we really need unregister over just letting a task
> > exit? Is there a sane use-case where task goes in and out of service?
> 
> I do not know of a specific use case here. On the other hand, I do not
> know of a specific use case to unregister RSEQ, but the capability is
> there. Maybe the assumption is that the userspace memory passed to the
> kernel in register() may be freed before the task exits, and so there
> should be a way to tell the kernel to no longer use it?

Fair enough I suppose.

> >
> > > +450  common  umcg_wait               sys_umcg_wait
> > > +451  common  umcg_wake               sys_umcg_wake
> >
> > Right, except I'm confused by the proposed implementation. I thought the
> > whole point was to let UMCG tasks block in kernel, at which point we'd
> > change their state to BLOCKED and have userspace select another task to
> > run. Such BLOCKED tasks would then also be captured before they return
> > to userspace, i.e. the whole admission scheduler thing.
> >
> > I don't see any of that in these patches. So what are they actually
> > implementing? I can't find enough clues to tell :-(
> 
> As I mentioned above, state changes are done in libumcg in userspace
> here: https://lore.kernel.org/patchwork/cover/1433967/#1632328
> 
> If you insist this logic should live in the kernel, I'll do it (grudgingly).

So you have some of it, I just didn't find it because it's hidding in
that grouping thing.

> > > +452  common  umcg_swap               sys_umcg_swap
> >
> > You're presenting it like a pure optimization, but IIRC this is what
> > enables us to frob the scheduler state to ensure the whole thing is seen
> > (to the rest of the system) as the M server tasks, instead of the
> > constellation of N+M worker and server tasks.
> 
> Yes, you recall it correctly.
> 
> > Also, you're not doing any of the frobbing required.
> 
> This is because I consider the frobbing a (very) nice to have rather
> than a required feature, and so I am hoping to argue about how to
> properly do it in later patchsets. This whole thing (UMCG) will be
> extremely useful even without runtime accounting hacking and whatnot,
> and so I hope to have everything else settled and tested and merged
> before we spend another several weeks/months trying to make the
> frobbing perfect.

Sure, not saying you need the frobbing from the get-go, but it's a much
stronger argument for having the API in the first place. So mentioning
this property (along with a TODO) is a stronger justification.

This goes to *why again. It's fairly easy to see what from the code, but
code rarely explains why.

That said; if we do: @next_pid, we might be able to do away with this. A
!RUNING transition will attempt to wake-and-switch to @next_tid. This is
BLOCKED from syscall or explicit using umcg_wait().

> > > +455  common  umcg_poll_worker        sys_umcg_poll_worker
> >
> > Shouldn't this be called idle or something, instead of poll, the whole
> > point of having this syscall is to that you can indeed go idle.
> 
> That's another way of looking at it. Yes, this means the server idles
> until a worker becomes available. How would you call it? umcg_idle()?

I'm trying to digest the thing; it's doing *far* more than just idling,
but yes, sys_umcg_idle() or something.

> > This I'm confused about again.. there is no fundamental difference
> > between a worker or server, they're all the same.
> 
> I don't see it this way. A server runs (on CPU) by itself and blocks
> when there is a worker attached; a worker runs (on CPU) only when it
> has a (blocked) server attached to it and, when the worker blocks, its
> server detaches and runs another worker. So workers and servers are
> the opposite of each other.

So I was viewing the server more like the idle thread, its 'work' is
idle, which is always available.

> > > + */
> > > +#define UMCG_TASK_NONE                       0
> > > +/* UMCG server states. */
> > > +#define UMCG_TASK_POLLING            1
> > > +#define UMCG_TASK_SERVING            2
> > > +#define UMCG_TASK_PROCESSING         3
> >
> > I get POLLING, although per the above, this probably wants to be IDLE.
> 
> Ack.
> 
> >
> > What are the other two again? That is, along with the diagram, each
> > state wants a description.
> 
> SERVING: the server is blocked, its attached worker is running
> PROCESSING: the server is running (= processing a block or wake
> event), has no running worker attached
> 
> Both of these states are different from POLLING/IDLE and from each other.

But if we view the server as the worker with work 'idle', then serving
becomes RUNNABLE and PROCESSING becomes RUNNING, right?

And sys_run_worker(next); becomes:

	self->state = RUNNABLE;
	self->next_tid = next;
	sys_umcg_wait();

The question is if we need an explicit IDLE state along with calling
sys_umcg_idle(). I can't seem to make up my mind on that.

> > > +/* UMCG worker states. */
> > > +#define UMCG_TASK_RUNNABLE           4
> > > +#define UMCG_TASK_RUNNING            5
> > > +#define UMCG_TASK_BLOCKED            6
> > > +#define UMCG_TASK_UNBLOCKED          7
> >
> > Weird order, also I can't remember why we need the UNBLOCKED, isn't that
> > the same as the RUNNABLE, or did we want to distinguish the state were
> > we're no longer BLOCKED but the user scheduler hasn't yet put us on it's
> > ready queue (IOW, we're on the runnable_ptr list, see below).
> 
> Yes, UNBLOCKED it a transitory state meaning the worker's blocking
> operation has completed, but the wake event hasn't been delivered to
> the userspace yet (and so the worker it not yet RUNNABLE)

So if I understand the proposal correctly the only possible option is
something like:

	for (;;) {
		next = user_sched_pick();
		if (next) {
			sys_umcg_run(next);
			continue;
		}

		sys_umcg_poll(&next);
		if (next) {
			next->state = RUNNABLE;
			user_sched_enqueue(next);
		}
	}

This seems incapable of implementing generic scheduling policies and has
a hard-coded FIFO policy.

The poll() thing cannot differentiate between: 'find new task' and 'go
idle'. So you cannot keep running it until all new tasks are found.

But you basically get to do a syscall to discover every new task, while
the other proposal gets you a user visible list of new tasks, no
syscalls needed at all.

It's also not quite clear to me what you do about RUNNING->BLOCKED, how
does the userspace scheduler know to dequeue a task?


My proposal gets you something like:

	for (;;) {
		self->state = RUNNABLE;
		self->next_tid = 0; // next == self == server -> idle

		p = xchg(self->blocked_ptr, NULL);
		while (p) {
			n = new->blocked_ptr;
			user_sched_dequeue(p);
			p = n;
		}

		// Worker can have unblocked again before we got here,
		// hence we need to process blocked before runnable.
		// Worker cannot have blocked again, since we didn't
		// know it was runnable, hence it cannot have ran again.

		p = xchg(self->runnable_ptr, NULL);
		while (p) {
			n = new->runnable_ptr;
			user_sched_enqeue(p);
			p = n;
		}

		n = user_sched_pick();
		if (n)
			self->next_tid = n->tid;

		// new self->*_ptr state will have changed self->state
		// to RUNNING and we'll not switch to ->next.

		sys_umcg_wait();

		// self->state == RUNNING
	}

This allows you to implement arbitrary policies and instantly works with
preemption once we implement that. Preemption would put the running
worker in RUNNABLE, mark the server RUNNING and switch.

Hmm, looking at it written out like that, we don't need sys_umcg_wake(),
sys_umcg_swap() at all.

Anyway, and this is how I got here, UNBLOCKED is not required because we
cannot run it before we've observed it RUNNABLE. Yes the state exists
where it's no longer BLOCKED, and it's not yet on the runqueue, but when
we don't know it's RUNNABLE we'll not pick it, so its moot.

> > So last time I really looked at this it looked something like this:
> >
> > struct umcg_task {
> >         u32     umcg_status;            /* r/w */
> >         u32     umcg_server_tid;        /* r   */
> >         u32     umcg_next_tid;          /* r   */
> >         u32     __hole__;
> >         u64     umcg_blocked_ptr;       /*   w */
> >         u64     umcg_runnable_ptr;      /*   w */
> > };
> >
> > (where r/w is from the kernel's pov)
> > (also see uapi/linux/rseq.h's ptr magic)
> 
> I tried doing it this way, i.e. to only have only userspace struct
> added (without kernel-only data), and I found it really cumbersome and
> inconvenient and much slower than the proposed implementation.

> For example, when a worker blocks, it seems working with "struct
> task_struct *peer" to get to the worker's server is easy and
> straightforward; reading server_tid from userspace, then looking up
> the task and only then doing what is needed (change state and wakeup)
> is ... unnecessary? 

Is find_task_by_vpid() really that slow? The advantage of having it in
userspace is that you can very easily change 'affinities' of the
workers. You can simply set ->server_tid and it goes elsewhere.

> Also validating things becomes really important
> but difficult (what if the user put something weird in
> umcg_server_tid? or the ptr fields?).

If find_task_by_vpid() returns NULL, we return -ESRCH. If the user
cmpxchg returns -EFAULT we pass along the message. If userspace put a
valid but crap pointer in it, userspace gets to keep the pieces.

> In my proposed implementation only the state is user-writable, and it
> does not really affect most of the kernel-side work.
> 
> Why do you think everything should be in the userspace memory?

Because then we avoid all the kernel state and userspace gets to have
all the state without endless syscalls.

Note that with the proposal, per the above, we're at:

enum {
	UMCG_STATE_RUNNING,
	UMCG_STATE_RUNABLE,
	UMCG_STATE_BLOCKED,
};

struct umcg_task {
        u32     umcg_status;            /* r/w */
        u32     umcg_server_tid;        /* r   */
        u32     umcg_next_tid;          /* r   */
        u32     umcg_tid;		/* r   */
        u64     umcg_blocked_ptr;       /*   w */
        u64     umcg_runnable_ptr;      /*   w */
};

/*
 * Register current's UMCG state.
 */
sys_umcg_register(struct umcg_task *self, unsigned int flags);

/*
 * Just 'cause.
 */
sys_umcg_unregister(struct umcg_task *self)

/*
 * UMCG context switch.
 */
sys_umcg_wait(u64 time, unsigned int flags)
{
	unsigned int state = RUNNABLE;
	unsigned int tid;

	if (self->state == RUNNING)
		return;

	tid = self->next_tid;
	if (!tid)
		tid = self->server_tid;

	if (tid == self->server_tid && tid == self->tid)
		return umcg_idle(time, flags);

	next = find_process_by_pid(tid);
	if (!next) {
		return -ESRCH;

	ret = user_try_cmpxchg(next->umcg->state, &state, RUNNING);
	if (!ret)
		ret = -EBUSY;
	if (ret < 0)
		return ret;

	return umcg_switch_to(next);
}

With this (and the BLOCKING bits outlined last time) we can implement
full N:1 userspace scheduling (UP).

( Note that so far we assume all UMCG workers share the same address
  space, otherwise the user_try_cmpxchg() doesn't work. )

And I _think_ you can do the whole SMP thing in userspace as well, just
have the servers share queue state and reassign ->server_tid where
needed. No additional syscalls required.

> All of the code above assumes userspace-only data. I did not look into
> every detail of your suggestions because I want to make sure we first
> agree on this: do we keep every bit of information in the userspace
> (other than "struct umcg_task __user *" pointer in task_struct) or do
> we have some kernel-only details as well?

Most of the kernel state you seem to have implemented seems to limit
flexibility / implement specific policy. All because apparently
find_task_by_vpid() is considered expensive?

You've enangled the whole BLOCKING stuff with the SMP stuff. And by
putting that state in the kernel you've limited flexibility.

Also, if you don't have kernel state it can't go out of sync and cause
problems.

> So IMPORTANT QUESTION #2: why would we want to keep __everything__ in
> the userspace memory? I understand that CRIU would like this, but
> given that the implementation would at a minimum have to
> 
> 1. read a umcg_server_ptr (points to the server's umcg_task)
> 2. get the server tid out of it (presumably by reading a field from
> the server's umcg_task; what if the tid is wrong?)
> 3. do a tid lookup

So if we leave SMP as an exercise in scheduling queue management, And
implement the above, then you need:

 - copy_from_user()/get_user() for the first 4 words
 - find_task_by_vpid()

that gets you a task pointer, then we get to update a blocked_ptr.

If anything goes wrong, simply return an error and let userspace sort it
out.

> to get a task_struct pointer, it will be slower; I am also not sure it
> call be done safely at all: with kernel-side data and I can do rcu
> locking, task locking, etc. to ensure that the value I got does not
> change while I'm working with it; with userspace data, a lot of races
> will have to be specially coded for that can be easily handled by
> kernel-side rcu locks or spin locks... Maybe this is just my ignorance
> showing, and indeed things can be done simply and easily with
> userspace-only data, but I am not sure how.
> 
> A common example:
> 
> - worker W1 with server S1 calls umcg_wait()
> - worker W2 with server S2 calls umcg_swap(W1)
> 
> If due to preemption and other concurrency weirdness the two syscalls
> above race with each other, each trying to change the server assigned
> to W1. I can easily handle the race by doing kernel-side locking;
> without kernel-side locking (cannot do rcu locks and/or spin locks
> while accessing userspace data) I am not sure how to handle the race.
> Maybe it is possible with careful atomic writes to states and looping
> to handle this specific race (what if the userspace antagonistically
> writes to the same location? will it force the syscall to spin
> indefinitely?); but with proper locking many potential races can be
> handled; with atomic ops and looping it is more difficult... Will we
> have to add a lock to struct umcg_task? And acquire it from the kernel
> side? And worry about spinning forever?

What would you want locking for? I really don't see a problem here.

Both blocked_ptr and runnable_ptr are cmpxchg single-linked-lists. Yes
they can spin a little, but that's not a new problem, futex has all
that.

And ->state only needs single cmpxchg ops, no loops, either we got the
wakeup, or we didn't. The rest is done with memory ordering:


	server				worker

	self->state = RUNNABLE;		self->state = BLOCKED;

	head = xchg(list, NULL)		add_to_list(self, &server->blocked_ptr);

					if (try_cmpxchg_user(&server->umcg->state, RUNNABLE, RUNNING) > 0)
	sys_umcg_wait()				wake_up_process(server);


Either server sees the add, or we see it's RUNNABLE and wake it up
(or both).

If anything on the BLOCKED side goes wrong (bad pointers, whatever),
have it segfault.

<edit> Ooh, I forgot you can't just go wake the server when it's running
something else... so that does indeed need more states/complication, the
ordering argument stands though. We'll need something like
self->current_tid or somesuch </edit>


> > > +SYSCALL_DEFINE2(umcg_wait, u32, flags,
> > > +             const struct __kernel_timespec __user *, timeout)
> >
> > I despise timespec, tglx?
> 
> What are the alternatives? I just picked what the futex code uses.

u64 nanoseconds. Not sure tglx really wants to do that though, but
still, timespec is a terrible thing.

> In summary, two IMPORTANT QUESTIONS:
> 
> 1. thin vs fat syscalls: can we push some code/logic to the userspace
> (state changes, looping/retries), or do we insist on syscalls handling
> everything? 

Well, the way I see it it's a trade of what is handled where. I get a
smaller API (although I'm sure I've forgotten something trivial again
that wrecks everything <edit> I did :/ </edit>) and userspace gets to
deal with all of SMP and scheduling policies.

You hard-coded a global-fifo and had a enormous number of syscalls and
needed to wrap every syscall invocation in order to fix up the return.

> Please have in mind that even if we choose the second
> approach (fat syscalls), the userspace will most likely still have to
> do everything it does under the first option just to handle
> signals/interrupts (i.e. unscheduled wakeups);

IIRC sigreturn goes back into the kernel and we can resume blocking
there.

> 2. kernel-side data vs userspace-only: can we avoid having kernel-side
> data? More specifically, what alternatives to rcu_read_lock and/or
> task_lock are available when working with userspace data?

What would you want locked and why?


Anyway, this email is far too long again (basically took me all day :/),
hope it helps a bit. Thomas is stuck fixing XSAVE disasters, but I'll
ask him to chime in once that's done.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v0.1 0/9] UMCG early preview/RFC patchset
  2021-06-10 18:02     ` Peter Zijlstra
@ 2021-06-10 20:06       ` Peter Oskolkov
  2021-07-07 17:45       ` Thierry Delisle
  1 sibling, 0 replies; 35+ messages in thread
From: Peter Oskolkov @ 2021-06-10 20:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, Linux Kernel Mailing List,
	linux-api, Paul Turner, Ben Segall, Peter Oskolkov,
	Joel Fernandes, Andrew Morton, Andrei Vagin, Jim Newsome

On Thu, Jun 10, 2021 at 11:02 AM Peter Zijlstra <peterz@infradead.org> wrote:

Thanks a lot for the detailed reply!

I'll try again the data-in-userspace-only route (= everything in TLS):
if you are right and everything can be done without needing to lock
anything - great!

The last time I tried it I could not do it properly/safely, though,
because I could not fix races without rcu and/or spin locking stuff,
which was impossible with data in the userspace. I don't remember the
specifics now, though...

Thanks,
Peter

>
> On Wed, Jun 09, 2021 at 01:18:59PM -0700, Peter Oskolkov wrote:
> > On Wed, Jun 9, 2021 at 5:55 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Finally, a high-level review - thanks a lot, Peter! My comments below,
> > and two high-level "important questions" at the end of my reply (with
> > some less important questions here and there).
> >
> > [...]
> >
> > > You present an API without explaining, *at*all*, how it's supposed to be
> > > used and I can't seem to figure it out from the implementation either :/
> >
> > I tried to explain it in the doc patch that I followed up with:
> > https://lore.kernel.org/patchwork/cover/1433967/#1632328
>
> Urgh, you write RST :-( That sorta helps, but I'm still unclear on a
> number of things, more below.
>
> > Or do you mean it more narrowly, i.e. I do not explain syscalls in
> > detail? This assessment I agree with - my approach was/is to finalize
> > the userpace API (libumcg) first, and make the userspace vs kernel
> > decisions later.
>
> Yeah, I couldn't figure out how to use the syscalls and thus how to
> interpret their implementation. A little more in the way of comments
> would've been helpful.
>
> > For example, you wonder why there is no looping in umcg_wait
> > (do_wait). This is because the looping happens in the userspace in
> > libumcg. My overall approach was to make the syscalls as simple as
> > possible and push extra logic to the userspace.
>
> So a simple comment on the syscall that says:
>
>   Userspace is expected to do:
>
>         do {
>                 sys_umcg_wait();
>         } while (smp_load_acquire(&umcg_task->state) != RUNNING);
>
> would've made all the difference. It provides context.
>
> > It seems that this approach is not resonating with kernel
> > developers/maintainers - you are the third person asking why there is
> > no looping in sys_umcg_wait, despite the fact that I explicitly
> > mentioned pushing it out to the userspace.
>
> We've been trained, through years of 'funny' bugs, to go 'BUG BUG BUG'
> when schedule() is not in a loop. And pushing the loop to userspace has
> me all on edge for being 'weird'.
>
> > Let me try to make my case once more here.
> >
> > umcg_wait/umcg_wake: the RUNNABLE/RUNNING state changes, checks, and
> > looping happen in the userspace (libumcg - see umcg_wait/umcg_wake in
> > patch 5 here: https://lore.kernel.org/patchwork/patch/1433971/), while
> > the syscalls simply sleep/wake. I find doing it in the userspace is
> > much simpler and easier than in the kernel, as state reads and writes
> > are just atomic memory accesses; in the kernel it becomes much more
> > difficult - rcu locked sections, tasks locked, etc.
>
> Small difficulties as far as things go I think. The worst part is having
> to do arch asm for the userspace cmpxchg. Luckily we can crib/share with
> futex there.
>
> > On the other hand I agree that having syscalls more logically
> > complete, in the sense that they do not require much hand-holding and
> > retries from the userspace, is probably better from the API design
> > perspective. My worry here is that state validation and retries in the
> > userspace are unavoidable, and so going the usual way we will end up
> > with retry loops both in the kernel and in the userspace.
>
> Can you expand on where you'd see the need for userspace to retry?
>
> The canonical case in my mind is where a task, that's been BLOCKED in
> kernelspace transitions to UNBLOCK/RUNNABLE in return-to-user and waits
> for RUNNING.
>
> Once it gets RUNNING, userspace can assume it can just go. It will never
> have to re-check, because there's no way RUNNING can go away again. The
> only way for RUNNING to become anything else, is setting it yourself
> and/or doing a syscall.
>
> Also, by having the BLOCKED thing block properly in return-to-user, you
> don't have to wrap *all* the userspace syscall invocations. If you let
> it return early, you get to wrap syscalls, which is both fragile and
> bad for performance.
>
> > So I pose this IMPORTANT QUESTION #1 to you that I hope to get a clear
> > answer to: it is strongly preferable to have syscalls be "logically
> > complete" in the sense that they retry things internally, and in
> > generally try to cover all possible corner cases; or, alternatively,
> > is it OK to make syscalls lightweight but "logically incomplete", and
> > have the accompanied userspace wrappers do all of the heavy lifting
> > re: state changes/validation, retries, etc.?
>
> Intuitively I'd go with complete. I'd have never even considered the
> incomplete option. But let me try and get my head around the incomplete
> cases.
>
> Oooh, I found the BLOCKED stuff, you hid it inside the grouping patch,
> that makes no sense :-( Reason I'm looking is that I don't see how you
> get around the blocked and runnable lists. You have to tell userspace
> about them.
>
> FWIW: I think you placed umcg_on_block() wrong, it needs to be before
> the terrible PI thing. Also, like said, please avoid yet another branch
> here by using PF_UMCG_WORKER.
>
> > I see two additional benefits of thin/lightweight syscalls:
> > - reading userspace state is needed much less often (e.g. my umcg_wait
> > and umcg_wake syscalls do not access userspace data at all - also see
> > my "second important question" below)
>
> It is also broken I think, best I can make of it is somsething like
> this:
>
>     WAIT                                                WAKE
>
>     if (smp_load_acquire(&state) == RUNNING)
>         return;
>
>                                                         state = RUNNING;
>
>
>     do {
>       sys_umcg_wait()
>       {
>         in_wait = true;
>                                                         sys_umcg_wake()
>                                                         {
>                                                           if (in_wait)
>                                                             wake_up_process()
>                                                         }
>         set_current_state(INTERRUPTIBLE);
>         schedule();
>         in_wait = false;
>       }
>     } while (smp_load_acquire(&state) != RUNNING);
>
>
> missed wakeup, 'forever' stuck. You have to check your blocking
> condition between setting state and scheduling. And if you do that, you
> have a 'fat' syscall again.
>
> > - looping in the kernel, combined with reading/writing to userspace
> > memory, can easily lead to spinning in the kernel (e.g. trying to
> > atomically change a variable and looping until succeeding)
>
> I don't imagine spinning in kernel or userspace matters.
>
> > > Also, do we really need unregister over just letting a task
> > > exit? Is there a sane use-case where task goes in and out of service?
> >
> > I do not know of a specific use case here. On the other hand, I do not
> > know of a specific use case to unregister RSEQ, but the capability is
> > there. Maybe the assumption is that the userspace memory passed to the
> > kernel in register() may be freed before the task exits, and so there
> > should be a way to tell the kernel to no longer use it?
>
> Fair enough I suppose.
>
> > >
> > > > +450  common  umcg_wait               sys_umcg_wait
> > > > +451  common  umcg_wake               sys_umcg_wake
> > >
> > > Right, except I'm confused by the proposed implementation. I thought the
> > > whole point was to let UMCG tasks block in kernel, at which point we'd
> > > change their state to BLOCKED and have userspace select another task to
> > > run. Such BLOCKED tasks would then also be captured before they return
> > > to userspace, i.e. the whole admission scheduler thing.
> > >
> > > I don't see any of that in these patches. So what are they actually
> > > implementing? I can't find enough clues to tell :-(
> >
> > As I mentioned above, state changes are done in libumcg in userspace
> > here: https://lore.kernel.org/patchwork/cover/1433967/#1632328
> >
> > If you insist this logic should live in the kernel, I'll do it (grudgingly).
>
> So you have some of it, I just didn't find it because it's hidding in
> that grouping thing.
>
> > > > +452  common  umcg_swap               sys_umcg_swap
> > >
> > > You're presenting it like a pure optimization, but IIRC this is what
> > > enables us to frob the scheduler state to ensure the whole thing is seen
> > > (to the rest of the system) as the M server tasks, instead of the
> > > constellation of N+M worker and server tasks.
> >
> > Yes, you recall it correctly.
> >
> > > Also, you're not doing any of the frobbing required.
> >
> > This is because I consider the frobbing a (very) nice to have rather
> > than a required feature, and so I am hoping to argue about how to
> > properly do it in later patchsets. This whole thing (UMCG) will be
> > extremely useful even without runtime accounting hacking and whatnot,
> > and so I hope to have everything else settled and tested and merged
> > before we spend another several weeks/months trying to make the
> > frobbing perfect.
>
> Sure, not saying you need the frobbing from the get-go, but it's a much
> stronger argument for having the API in the first place. So mentioning
> this property (along with a TODO) is a stronger justification.
>
> This goes to *why again. It's fairly easy to see what from the code, but
> code rarely explains why.
>
> That said; if we do: @next_pid, we might be able to do away with this. A
> !RUNING transition will attempt to wake-and-switch to @next_tid. This is
> BLOCKED from syscall or explicit using umcg_wait().
>
> > > > +455  common  umcg_poll_worker        sys_umcg_poll_worker
> > >
> > > Shouldn't this be called idle or something, instead of poll, the whole
> > > point of having this syscall is to that you can indeed go idle.
> >
> > That's another way of looking at it. Yes, this means the server idles
> > until a worker becomes available. How would you call it? umcg_idle()?
>
> I'm trying to digest the thing; it's doing *far* more than just idling,
> but yes, sys_umcg_idle() or something.
>
> > > This I'm confused about again.. there is no fundamental difference
> > > between a worker or server, they're all the same.
> >
> > I don't see it this way. A server runs (on CPU) by itself and blocks
> > when there is a worker attached; a worker runs (on CPU) only when it
> > has a (blocked) server attached to it and, when the worker blocks, its
> > server detaches and runs another worker. So workers and servers are
> > the opposite of each other.
>
> So I was viewing the server more like the idle thread, its 'work' is
> idle, which is always available.
>
> > > > + */
> > > > +#define UMCG_TASK_NONE                       0
> > > > +/* UMCG server states. */
> > > > +#define UMCG_TASK_POLLING            1
> > > > +#define UMCG_TASK_SERVING            2
> > > > +#define UMCG_TASK_PROCESSING         3
> > >
> > > I get POLLING, although per the above, this probably wants to be IDLE.
> >
> > Ack.
> >
> > >
> > > What are the other two again? That is, along with the diagram, each
> > > state wants a description.
> >
> > SERVING: the server is blocked, its attached worker is running
> > PROCESSING: the server is running (= processing a block or wake
> > event), has no running worker attached
> >
> > Both of these states are different from POLLING/IDLE and from each other.
>
> But if we view the server as the worker with work 'idle', then serving
> becomes RUNNABLE and PROCESSING becomes RUNNING, right?
>
> And sys_run_worker(next); becomes:
>
>         self->state = RUNNABLE;
>         self->next_tid = next;
>         sys_umcg_wait();
>
> The question is if we need an explicit IDLE state along with calling
> sys_umcg_idle(). I can't seem to make up my mind on that.
>
> > > > +/* UMCG worker states. */
> > > > +#define UMCG_TASK_RUNNABLE           4
> > > > +#define UMCG_TASK_RUNNING            5
> > > > +#define UMCG_TASK_BLOCKED            6
> > > > +#define UMCG_TASK_UNBLOCKED          7
> > >
> > > Weird order, also I can't remember why we need the UNBLOCKED, isn't that
> > > the same as the RUNNABLE, or did we want to distinguish the state were
> > > we're no longer BLOCKED but the user scheduler hasn't yet put us on it's
> > > ready queue (IOW, we're on the runnable_ptr list, see below).
> >
> > Yes, UNBLOCKED it a transitory state meaning the worker's blocking
> > operation has completed, but the wake event hasn't been delivered to
> > the userspace yet (and so the worker it not yet RUNNABLE)
>
> So if I understand the proposal correctly the only possible option is
> something like:
>
>         for (;;) {
>                 next = user_sched_pick();
>                 if (next) {
>                         sys_umcg_run(next);
>                         continue;
>                 }
>
>                 sys_umcg_poll(&next);
>                 if (next) {
>                         next->state = RUNNABLE;
>                         user_sched_enqueue(next);
>                 }
>         }
>
> This seems incapable of implementing generic scheduling policies and has
> a hard-coded FIFO policy.
>
> The poll() thing cannot differentiate between: 'find new task' and 'go
> idle'. So you cannot keep running it until all new tasks are found.
>
> But you basically get to do a syscall to discover every new task, while
> the other proposal gets you a user visible list of new tasks, no
> syscalls needed at all.
>
> It's also not quite clear to me what you do about RUNNING->BLOCKED, how
> does the userspace scheduler know to dequeue a task?
>
>
> My proposal gets you something like:
>
>         for (;;) {
>                 self->state = RUNNABLE;
>                 self->next_tid = 0; // next == self == server -> idle
>
>                 p = xchg(self->blocked_ptr, NULL);
>                 while (p) {
>                         n = new->blocked_ptr;
>                         user_sched_dequeue(p);
>                         p = n;
>                 }
>
>                 // Worker can have unblocked again before we got here,
>                 // hence we need to process blocked before runnable.
>                 // Worker cannot have blocked again, since we didn't
>                 // know it was runnable, hence it cannot have ran again.
>
>                 p = xchg(self->runnable_ptr, NULL);
>                 while (p) {
>                         n = new->runnable_ptr;
>                         user_sched_enqeue(p);
>                         p = n;
>                 }
>
>                 n = user_sched_pick();
>                 if (n)
>                         self->next_tid = n->tid;
>
>                 // new self->*_ptr state will have changed self->state
>                 // to RUNNING and we'll not switch to ->next.
>
>                 sys_umcg_wait();
>
>                 // self->state == RUNNING
>         }
>
> This allows you to implement arbitrary policies and instantly works with
> preemption once we implement that. Preemption would put the running
> worker in RUNNABLE, mark the server RUNNING and switch.
>
> Hmm, looking at it written out like that, we don't need sys_umcg_wake(),
> sys_umcg_swap() at all.
>
> Anyway, and this is how I got here, UNBLOCKED is not required because we
> cannot run it before we've observed it RUNNABLE. Yes the state exists
> where it's no longer BLOCKED, and it's not yet on the runqueue, but when
> we don't know it's RUNNABLE we'll not pick it, so its moot.
>
> > > So last time I really looked at this it looked something like this:
> > >
> > > struct umcg_task {
> > >         u32     umcg_status;            /* r/w */
> > >         u32     umcg_server_tid;        /* r   */
> > >         u32     umcg_next_tid;          /* r   */
> > >         u32     __hole__;
> > >         u64     umcg_blocked_ptr;       /*   w */
> > >         u64     umcg_runnable_ptr;      /*   w */
> > > };
> > >
> > > (where r/w is from the kernel's pov)
> > > (also see uapi/linux/rseq.h's ptr magic)
> >
> > I tried doing it this way, i.e. to only have only userspace struct
> > added (without kernel-only data), and I found it really cumbersome and
> > inconvenient and much slower than the proposed implementation.
>
> > For example, when a worker blocks, it seems working with "struct
> > task_struct *peer" to get to the worker's server is easy and
> > straightforward; reading server_tid from userspace, then looking up
> > the task and only then doing what is needed (change state and wakeup)
> > is ... unnecessary?
>
> Is find_task_by_vpid() really that slow? The advantage of having it in
> userspace is that you can very easily change 'affinities' of the
> workers. You can simply set ->server_tid and it goes elsewhere.
>
> > Also validating things becomes really important
> > but difficult (what if the user put something weird in
> > umcg_server_tid? or the ptr fields?).
>
> If find_task_by_vpid() returns NULL, we return -ESRCH. If the user
> cmpxchg returns -EFAULT we pass along the message. If userspace put a
> valid but crap pointer in it, userspace gets to keep the pieces.
>
> > In my proposed implementation only the state is user-writable, and it
> > does not really affect most of the kernel-side work.
> >
> > Why do you think everything should be in the userspace memory?
>
> Because then we avoid all the kernel state and userspace gets to have
> all the state without endless syscalls.
>
> Note that with the proposal, per the above, we're at:
>
> enum {
>         UMCG_STATE_RUNNING,
>         UMCG_STATE_RUNABLE,
>         UMCG_STATE_BLOCKED,
> };
>
> struct umcg_task {
>         u32     umcg_status;            /* r/w */
>         u32     umcg_server_tid;        /* r   */
>         u32     umcg_next_tid;          /* r   */
>         u32     umcg_tid;               /* r   */
>         u64     umcg_blocked_ptr;       /*   w */
>         u64     umcg_runnable_ptr;      /*   w */
> };
>
> /*
>  * Register current's UMCG state.
>  */
> sys_umcg_register(struct umcg_task *self, unsigned int flags);
>
> /*
>  * Just 'cause.
>  */
> sys_umcg_unregister(struct umcg_task *self)
>
> /*
>  * UMCG context switch.
>  */
> sys_umcg_wait(u64 time, unsigned int flags)
> {
>         unsigned int state = RUNNABLE;
>         unsigned int tid;
>
>         if (self->state == RUNNING)
>                 return;
>
>         tid = self->next_tid;
>         if (!tid)
>                 tid = self->server_tid;
>
>         if (tid == self->server_tid && tid == self->tid)
>                 return umcg_idle(time, flags);
>
>         next = find_process_by_pid(tid);
>         if (!next) {
>                 return -ESRCH;
>
>         ret = user_try_cmpxchg(next->umcg->state, &state, RUNNING);
>         if (!ret)
>                 ret = -EBUSY;
>         if (ret < 0)
>                 return ret;
>
>         return umcg_switch_to(next);
> }
>
> With this (and the BLOCKING bits outlined last time) we can implement
> full N:1 userspace scheduling (UP).
>
> ( Note that so far we assume all UMCG workers share the same address
>   space, otherwise the user_try_cmpxchg() doesn't work. )
>
> And I _think_ you can do the whole SMP thing in userspace as well, just
> have the servers share queue state and reassign ->server_tid where
> needed. No additional syscalls required.
>
> > All of the code above assumes userspace-only data. I did not look into
> > every detail of your suggestions because I want to make sure we first
> > agree on this: do we keep every bit of information in the userspace
> > (other than "struct umcg_task __user *" pointer in task_struct) or do
> > we have some kernel-only details as well?
>
> Most of the kernel state you seem to have implemented seems to limit
> flexibility / implement specific policy. All because apparently
> find_task_by_vpid() is considered expensive?
>
> You've enangled the whole BLOCKING stuff with the SMP stuff. And by
> putting that state in the kernel you've limited flexibility.
>
> Also, if you don't have kernel state it can't go out of sync and cause
> problems.
>
> > So IMPORTANT QUESTION #2: why would we want to keep __everything__ in
> > the userspace memory? I understand that CRIU would like this, but
> > given that the implementation would at a minimum have to
> >
> > 1. read a umcg_server_ptr (points to the server's umcg_task)
> > 2. get the server tid out of it (presumably by reading a field from
> > the server's umcg_task; what if the tid is wrong?)
> > 3. do a tid lookup
>
> So if we leave SMP as an exercise in scheduling queue management, And
> implement the above, then you need:
>
>  - copy_from_user()/get_user() for the first 4 words
>  - find_task_by_vpid()
>
> that gets you a task pointer, then we get to update a blocked_ptr.
>
> If anything goes wrong, simply return an error and let userspace sort it
> out.
>
> > to get a task_struct pointer, it will be slower; I am also not sure it
> > call be done safely at all: with kernel-side data and I can do rcu
> > locking, task locking, etc. to ensure that the value I got does not
> > change while I'm working with it; with userspace data, a lot of races
> > will have to be specially coded for that can be easily handled by
> > kernel-side rcu locks or spin locks... Maybe this is just my ignorance
> > showing, and indeed things can be done simply and easily with
> > userspace-only data, but I am not sure how.
> >
> > A common example:
> >
> > - worker W1 with server S1 calls umcg_wait()
> > - worker W2 with server S2 calls umcg_swap(W1)
> >
> > If due to preemption and other concurrency weirdness the two syscalls
> > above race with each other, each trying to change the server assigned
> > to W1. I can easily handle the race by doing kernel-side locking;
> > without kernel-side locking (cannot do rcu locks and/or spin locks
> > while accessing userspace data) I am not sure how to handle the race.
> > Maybe it is possible with careful atomic writes to states and looping
> > to handle this specific race (what if the userspace antagonistically
> > writes to the same location? will it force the syscall to spin
> > indefinitely?); but with proper locking many potential races can be
> > handled; with atomic ops and looping it is more difficult... Will we
> > have to add a lock to struct umcg_task? And acquire it from the kernel
> > side? And worry about spinning forever?
>
> What would you want locking for? I really don't see a problem here.
>
> Both blocked_ptr and runnable_ptr are cmpxchg single-linked-lists. Yes
> they can spin a little, but that's not a new problem, futex has all
> that.
>
> And ->state only needs single cmpxchg ops, no loops, either we got the
> wakeup, or we didn't. The rest is done with memory ordering:
>
>
>         server                          worker
>
>         self->state = RUNNABLE;         self->state = BLOCKED;
>
>         head = xchg(list, NULL)         add_to_list(self, &server->blocked_ptr);
>
>                                         if (try_cmpxchg_user(&server->umcg->state, RUNNABLE, RUNNING) > 0)
>         sys_umcg_wait()                         wake_up_process(server);
>
>
> Either server sees the add, or we see it's RUNNABLE and wake it up
> (or both).
>
> If anything on the BLOCKED side goes wrong (bad pointers, whatever),
> have it segfault.
>
> <edit> Ooh, I forgot you can't just go wake the server when it's running
> something else... so that does indeed need more states/complication, the
> ordering argument stands though. We'll need something like
> self->current_tid or somesuch </edit>
>
>
> > > > +SYSCALL_DEFINE2(umcg_wait, u32, flags,
> > > > +             const struct __kernel_timespec __user *, timeout)
> > >
> > > I despise timespec, tglx?
> >
> > What are the alternatives? I just picked what the futex code uses.
>
> u64 nanoseconds. Not sure tglx really wants to do that though, but
> still, timespec is a terrible thing.
>
> > In summary, two IMPORTANT QUESTIONS:
> >
> > 1. thin vs fat syscalls: can we push some code/logic to the userspace
> > (state changes, looping/retries), or do we insist on syscalls handling
> > everything?
>
> Well, the way I see it it's a trade of what is handled where. I get a
> smaller API (although I'm sure I've forgotten something trivial again
> that wrecks everything <edit> I did :/ </edit>) and userspace gets to
> deal with all of SMP and scheduling policies.
>
> You hard-coded a global-fifo and had a enormous number of syscalls and
> needed to wrap every syscall invocation in order to fix up the return.
>
> > Please have in mind that even if we choose the second
> > approach (fat syscalls), the userspace will most likely still have to
> > do everything it does under the first option just to handle
> > signals/interrupts (i.e. unscheduled wakeups);
>
> IIRC sigreturn goes back into the kernel and we can resume blocking
> there.
>
> > 2. kernel-side data vs userspace-only: can we avoid having kernel-side
> > data? More specifically, what alternatives to rcu_read_lock and/or
> > task_lock are available when working with userspace data?
>
> What would you want locked and why?
>
>
> Anyway, this email is far too long again (basically took me all day :/),
> hope it helps a bit. Thomas is stuck fixing XSAVE disasters, but I'll
> ask him to chime in once that's done.
>
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v0.1 0/9] UMCG early preview/RFC patchset
  2021-06-10 18:02     ` Peter Zijlstra
  2021-06-10 20:06       ` Peter Oskolkov
@ 2021-07-07 17:45       ` Thierry Delisle
  2021-07-08 21:44         ` Peter Oskolkov
  1 sibling, 1 reply; 35+ messages in thread
From: Thierry Delisle @ 2021-07-07 17:45 UTC (permalink / raw)
  To: peterz
  Cc: akpm, avagin, bsegall, jnewsome, joel, linux-api, linux-kernel,
	mingo, pjt, posk, posk, tglx, Peter Buhr, Martin Karsten

Hi,
I wanted to way-in on this. I am one of the main developer's on the Cforall
programming language (https://cforall.uwaterloo.ca), which implements 
its own
M:N user-threading runtime. I want to state that this RFC is an interesting
feature, which we would be able to take advantage of immediately, assuming
performance and flexibility closely match state-of-the-art implementations.

Precisely, we would benefit from two aspects of User Managed Control Groups:

1. user-level threads would become regular pthreads, so that gdb, valgrind,
    ptrace and TLS works normally, etc.

2. The user-space scheduler can react on user-threads blocking in the 
kernel.

However, we would need to look at performance issues like thread 
creation and
context switch to know if your scheme is performant with user-level 
threading.
We are also conscious about use cases that involve a very high (100Ks to 
1Ms)
number of concurrent sessions and thus threads.

Note, our team published a comprehensive look at M:N threading in ACM
Sigmetrics 2020: https://doi.org/10.1145/3379483, which highlights the
expected performance of M:N threading, and another look at high-performance
control flow in SP&E 2021: 
https://onlinelibrary.wiley.com/doi/10.1002/spe.2925


  > > Yes, UNBLOCKED it a transitory state meaning the worker's blocking
  > > operation has completed, but the wake event hasn't been delivered to
  > > the userspace yet (and so the worker it not yet RUNNABLE)
  >
  > So if I understand the proposal correctly the only possible option is
  > something like:
  >
  >     for (;;) {
  >         next = user_sched_pick();
  >         if (next) {
  >             sys_umcg_run(next);
  >             continue;
  >         }
  >
  >         sys_umcg_poll(&next);
  >         if (next) {
  >             next->state = RUNNABLE;
  >             user_sched_enqueue(next);
  >         }
  >     }
  >
  > This seems incapable of implementing generic scheduling policies and has
  > a hard-coded FIFO policy.
  >
  > The poll() thing cannot differentiate between: 'find new task' and 'go
  > idle'. So you cannot keep running it until all new tasks are found.
  >
  > But you basically get to do a syscall to discover every new task, while
  > the other proposal gets you a user visible list of new tasks, no
  > syscalls needed at all.

I agree strongly with this comment, sys_umcg_poll() does not appear to be
flexible enough for generic policies. I also suspect it would become a
bottleneck in any SMP scheduler due to this central serial data-structure.


  > But you basically get to do a syscall to discover every new task, while
  > the other proposal gets you a user visible list of new tasks, no
  > syscalls needed at all.
  >
  > It's also not quite clear to me what you do about RUNNING->BLOCKED, how
  > does the userspace scheduler know to dequeue a task?

In the schedulers we have implemented, threads are dequeued *before* being
run. That is, the head of the queue is not the currently running thread.

If the currently running threads need to be in the scheduler data-structure,
I believe it can be dequeued immediately after sys_umcg_run() has returned.
More on this below.


  > My proposal gets you something like:
  >
  > [...]
  >
  > struct umcg_task {
  >    u32 umcg_status;            /* r/w */
  >    u32 umcg_server_tid;        /* r   */
  >    u32 umcg_next_tid;          /* r   */
  >    u32 umcg_tid;               /* r   */
  >    u64 umcg_blocked_ptr;       /*   w */
  >    u64 umcg_runnable_ptr;      /*   w */
  > };

I believe this approach may work, but could you elaborate on it? I 
wasn't able
to find a more complete description.

For example, I fail to see what purpose the umcg_blocked_ptr serves. 
When could
it contain anything other then a single element that is already pointed
to by "n" in the proposed loop? The only case I can come up with, is if a
worker thread tries to context switch directly to another worker thread. 
But in
that case, I do not know what state that second worker would need to be 
in for
this operation to be correct. Is the objective to allow the scheduler to be
invoked from worker threads?

Also, what is the purpose of umcg_status being writable by the user-space?
(I'm assuming status == state)? The code in sys_umcg_wait suggests it is for
managing potential out-of-order wakes and waits, but the kernel should 
be able
to handle them already, the same way FUTEX_WAKE and FUTEX_WAIT are handled.
When would these state transition not be handled by the kernel?

I would also point out that creating worker threads as regular pthreads and
then converting them to worker threads sounds less then ideal. It would
probably be preferable directly appended new worker threads to the
umcg_runnable_ptr list without scheduling them in the kernel. It makes the
placement of the umcg_task trickier but maintains a stronger M:N model.

Finally, I would recommend adding a 64-bit user pointer to umcg_task that is
neither read nor written from the kernel. These kind of fields are always
useful for implementers.

Thank you for your time,

Thierry


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH v0.1 0/9] UMCG early preview/RFC patchset
  2021-07-07 17:45       ` Thierry Delisle
@ 2021-07-08 21:44         ` Peter Oskolkov
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Oskolkov @ 2021-07-08 21:44 UTC (permalink / raw)
  To: Thierry Delisle
  Cc: linux-api, linux-kernel, pjt, posk, Peter Buhr, Martin Karsten

On Wed, Jul 7, 2021 at 10:45 AM Thierry Delisle <tdelisle@uwaterloo.ca> wrote:
>
> Hi,
> I wanted to way-in on this. I am one of the main developer's on the Cforall
> programming language (https://cforall.uwaterloo.ca), which implements
> its own
> M:N user-threading runtime. I want to state that this RFC is an interesting
> feature, which we would be able to take advantage of immediately, assuming
> performance and flexibility closely match state-of-the-art implementations.

Hi Thierry,

Thank you for your message! I just posted a new version/approach:

https://lore.kernel.org/lkml/20210708194638.128950-1-posk@google.com/

Let's move the discussion to the new thread.

Thanks,
Peter

[...]

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2021-07-08 21:44 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-20 18:36 [RFC PATCH v0.1 0/9] UMCG early preview/RFC patchset Peter Oskolkov
2021-05-20 18:36 ` [RFC PATCH v0.1 1/9] sched/umcg: add UMCG syscall stubs and CONFIG_UMCG Peter Oskolkov
2021-05-20 18:36 ` [RFC PATCH v0.1 2/9] sched/umcg: add uapi/linux/umcg.h and sched/umcg.c Peter Oskolkov
2021-05-20 18:36 ` [RFC PATCH v0.1 3/9] sched: add WF_CURRENT_CPU and externise ttwu Peter Oskolkov
2021-05-20 18:36 ` [RFC PATCH v0.1 4/9] sched/umcg: implement core UMCG API Peter Oskolkov
2021-05-21 19:06   ` Andrei Vagin
2021-05-21 21:31     ` Jann Horn
2021-05-21 22:03       ` Peter Oskolkov
2021-05-21 19:32   ` Andy Lutomirski
2021-05-21 22:01     ` Peter Oskolkov
2021-05-21 21:33   ` Jann Horn
2021-06-09 13:01     ` Peter Zijlstra
2021-05-20 18:36 ` [RFC PATCH v0.1 5/9] lib/umcg: implement UMCG core API for userspace Peter Oskolkov
2021-05-20 18:36 ` [RFC PATCH v0.1 6/9] selftests/umcg: add UMCG core API selftest Peter Oskolkov
2021-05-20 18:36 ` [RFC PATCH v0.1 7/9] sched/umcg: add UMCG server/worker API (early RFC) Peter Oskolkov
2021-05-21 20:17   ` Andrei Vagin
2021-05-20 18:36 ` [RFC PATCH v0.1 8/9] lib/umcg: " Peter Oskolkov
2021-05-20 18:36 ` [RFC PATCH v0.1 9/9] selftests/umcg: add UMCG server/worker API selftest Peter Oskolkov
2021-05-20 21:17 ` [RFC PATCH v0.1 0/9] UMCG early preview/RFC patchset Jonathan Corbet
2021-05-20 21:38   ` Peter Oskolkov
2021-05-21  0:15     ` Randy Dunlap
2021-05-21  8:04       ` Peter Zijlstra
2021-05-21 15:08     ` Jonathan Corbet
2021-05-21 16:03       ` Peter Oskolkov
2021-05-21 19:17         ` Jonathan Corbet
2021-05-27  0:06           ` Peter Oskolkov
2021-05-27 15:41             ` Jonathan Corbet
     [not found] ` <CAEWA0a72SvpcuN4ov=98T3uWtExPCr7BQePOgjkqD1ofWKEASw@mail.gmail.com>
2021-05-21 19:13   ` Peter Oskolkov
2021-05-21 23:08     ` Jann Horn
2021-06-09 12:54 ` Peter Zijlstra
2021-06-09 20:18   ` Peter Oskolkov
2021-06-10 18:02     ` Peter Zijlstra
2021-06-10 20:06       ` Peter Oskolkov
2021-07-07 17:45       ` Thierry Delisle
2021-07-08 21:44         ` Peter Oskolkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).