* [RFC][PATCH 0/3] sched: User Managed Concurrency Groups @ 2021-12-14 20:44 Peter Zijlstra 2021-12-14 20:44 ` [RFC][PATCH 1/3] sched/umcg: add WF_CURRENT_CPU and externise ttwu Peter Zijlstra ` (4 more replies) 0 siblings, 5 replies; 40+ messages in thread From: Peter Zijlstra @ 2021-12-14 20:44 UTC (permalink / raw) To: mingo, tglx, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, bristot Cc: linux-kernel, linux-mm, linux-api, x86, peterz, pjt, posk, avagin, jannh, tdelisle, posk Hi, This is actually tested code; but still missing the SMP wake-to-idle machinery. I still need to think about that. I'll post my test-hack as a reply, but basically it does co-operative and preemptive UP-like user scheduling. Patches go on top of tip/master as they rely on the .fixup removal recently merged in tip/x86/core. Also, I still need to audit a bunch of mm code, because I'm not sure things are actually as well behaved as this code supposes they are. ^ permalink raw reply [flat|nested] 40+ messages in thread
* [RFC][PATCH 1/3] sched/umcg: add WF_CURRENT_CPU and externise ttwu 2021-12-14 20:44 [RFC][PATCH 0/3] sched: User Managed Concurrency Groups Peter Zijlstra @ 2021-12-14 20:44 ` Peter Zijlstra 2021-12-14 20:44 ` [RFC][PATCH 2/3] x86/uaccess: Implement unsafe_try_cmpxchg_user() Peter Zijlstra ` (3 subsequent siblings) 4 siblings, 0 replies; 40+ messages in thread From: Peter Zijlstra @ 2021-12-14 20:44 UTC (permalink / raw) To: mingo, tglx, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, bristot Cc: linux-kernel, linux-mm, linux-api, x86, peterz, pjt, posk, avagin, jannh, tdelisle, posk From: Peter Oskolkov <posk@posk.io> Add WF_CURRENT_CPU wake flag that advices the scheduler to move the wakee to the current CPU. This is useful for fast on-CPU context switching use cases such as UMCG. In addition, make ttwu external rather than static so that the flag could be passed to it from outside of sched/core.c. Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20211122211327.5931-2-posk@google.com --- kernel/sched/core.c | 3 +-- kernel/sched/fair.c | 4 ++++ kernel/sched/sched.h | 17 ++++++++++------- 3 files changed, 15 insertions(+), 9 deletions(-) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3980,8 +3980,7 @@ bool ttwu_state_match(struct task_struct * Return: %true if @p->state changes (an actual wakeup was done), * %false otherwise. */ -static int -try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) +int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) { unsigned long flags; int cpu, success = 0; --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6838,6 +6838,10 @@ select_task_rq_fair(struct task_struct * if (wake_flags & WF_TTWU) { record_wakee(p); + if ((wake_flags & WF_CURRENT_CPU) && + cpumask_test_cpu(cpu, p->cpus_ptr)) + return cpu; + if (sched_energy_enabled()) { new_cpu = find_energy_efficient_cpu(p, prev_cpu); if (new_cpu >= 0) --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2052,13 +2052,14 @@ static inline int task_on_rq_migrating(s } /* Wake flags. The first three directly map to some SD flag value */ -#define WF_EXEC 0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */ -#define WF_FORK 0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */ -#define WF_TTWU 0x08 /* Wakeup; maps to SD_BALANCE_WAKE */ - -#define WF_SYNC 0x10 /* Waker goes to sleep after wakeup */ -#define WF_MIGRATED 0x20 /* Internal use, task got migrated */ -#define WF_ON_CPU 0x40 /* Wakee is on_cpu */ +#define WF_EXEC 0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */ +#define WF_FORK 0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */ +#define WF_TTWU 0x08 /* Wakeup; maps to SD_BALANCE_WAKE */ + +#define WF_SYNC 0x10 /* Waker goes to sleep after wakeup */ +#define WF_MIGRATED 0x20 /* Internal use, task got migrated */ +#define WF_ON_CPU 0x40 /* Wakee is on_cpu */ +#define WF_CURRENT_CPU 0x80 /* Prefer to move the wakee to the current CPU. */ #ifdef CONFIG_SMP static_assert(WF_EXEC == SD_BALANCE_EXEC); @@ -3112,6 +3113,8 @@ static inline bool is_per_cpu_kthread(st extern void swake_up_all_locked(struct swait_queue_head *q); extern void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait); +extern int try_to_wake_up(struct task_struct *tsk, unsigned int state, int wake_flags); + #ifdef CONFIG_PREEMPT_DYNAMIC extern int preempt_dynamic_mode; extern int sched_dynamic_mode(const char *str); ^ permalink raw reply [flat|nested] 40+ messages in thread
* [RFC][PATCH 2/3] x86/uaccess: Implement unsafe_try_cmpxchg_user() 2021-12-14 20:44 [RFC][PATCH 0/3] sched: User Managed Concurrency Groups Peter Zijlstra 2021-12-14 20:44 ` [RFC][PATCH 1/3] sched/umcg: add WF_CURRENT_CPU and externise ttwu Peter Zijlstra @ 2021-12-14 20:44 ` Peter Zijlstra 2021-12-20 17:30 ` Sean Christopherson 2021-12-14 20:44 ` [RFC][PATCH 3/3] sched: User Mode Concurency Groups Peter Zijlstra ` (2 subsequent siblings) 4 siblings, 1 reply; 40+ messages in thread From: Peter Zijlstra @ 2021-12-14 20:44 UTC (permalink / raw) To: mingo, tglx, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, bristot Cc: linux-kernel, linux-mm, linux-api, x86, peterz, pjt, posk, avagin, jannh, tdelisle, posk Do try_cmpxchg() loops on userspace addresses. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- arch/x86/include/asm/uaccess.h | 57 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 57 insertions(+) --- a/arch/x86/include/asm/uaccess.h +++ b/arch/x86/include/asm/uaccess.h @@ -342,6 +342,24 @@ do { \ : [umem] "m" (__m(addr)) \ : : label) +#define __try_cmpxchg_user_asm(itype, _ptr, _pold, _new, label) ({ \ + bool success; \ + __typeof__(_ptr) _old = (__typeof__(_ptr))(_pold); \ + __typeof__(*(_ptr)) __old = *_old; \ + __typeof__(*(_ptr)) __new = (_new); \ + asm_volatile_goto("\n" \ + "1: " LOCK_PREFIX "cmpxchg"itype" %[new], %[ptr]\n"\ + _ASM_EXTABLE_UA(1b, %l[label]) \ + : CC_OUT(z) (success), \ + [ptr] "+m" (*_ptr), \ + [old] "+a" (__old) \ + : [new] "r" (__new) \ + : "memory", "cc" \ + : label); \ + if (unlikely(!success)) \ + *_old = __old; \ + likely(success); }) + #else // !CONFIG_CC_HAS_ASM_GOTO_OUTPUT #ifdef CONFIG_X86_32 @@ -407,6 +425,30 @@ do { \ : [umem] "m" (__m(addr)), \ "0" (err)) +#define __try_cmpxchg_user_asm(itype, _ptr, _pold, _new, label) ({ \ + int __err = 0; \ + bool success; \ + __typeof__(_ptr) _old = (__typeof__(_ptr))(_pold); \ + __typeof__(*(_ptr)) __old = *_old; \ + __typeof__(*(_ptr)) __new = (_new); \ + asm volatile("\n" \ + "1: " LOCK_PREFIX "cmpxchg"itype" %[new], %[ptr]\n"\ + CC_SET(z) \ + "2:\n" \ + _ASM_EXTABLE_TYPE_REG(1b, 2b, EX_TYPE_EFAULT_REG, \ + %[errout]) \ + : CC_OUT(z) (success), \ + [errout] "+r" (__err), \ + [ptr] "+m" (*_ptr), \ + [old] "+a" (__old) \ + : [new] "r" (__new) \ + : "memory", "cc"); \ + if (unlikely(__err)) \ + goto label; \ + if (unlikely(!success)) \ + *_old = __old; \ + likely(success); }) + #endif // CONFIG_CC_HAS_ASM_GOTO_OUTPUT /* FIXME: this hack is definitely wrong -AK */ @@ -501,6 +543,21 @@ do { \ } while (0) #endif // CONFIG_CC_HAS_ASM_GOTO_OUTPUT +extern void __try_cmpxchg_user_wrong_size(void); + +#define unsafe_try_cmpxchg_user(_ptr, _oldp, _nval, _label) ({ \ + __typeof__(*(_ptr)) __ret; \ + switch (sizeof(__ret)) { \ + case 4: __ret = __try_cmpxchg_user_asm("l", (_ptr), (_oldp), \ + (_nval), _label); \ + break; \ + case 8: __ret = __try_cmpxchg_user_asm("q", (_ptr), (_oldp), \ + (_nval), _label); \ + break; \ + default: __try_cmpxchg_user_wrong_size(); \ + } \ + __ret; }) + /* * We want the unsafe accessors to always be inlined and use * the error labels - thus the macro games. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC][PATCH 2/3] x86/uaccess: Implement unsafe_try_cmpxchg_user() 2021-12-14 20:44 ` [RFC][PATCH 2/3] x86/uaccess: Implement unsafe_try_cmpxchg_user() Peter Zijlstra @ 2021-12-20 17:30 ` Sean Christopherson 2021-12-21 11:17 ` Peter Zijlstra 0 siblings, 1 reply; 40+ messages in thread From: Sean Christopherson @ 2021-12-20 17:30 UTC (permalink / raw) To: Peter Zijlstra Cc: mingo, tglx, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, bristot, linux-kernel, linux-mm, linux-api, x86, pjt, posk, avagin, jannh, tdelisle, posk On Tue, Dec 14, 2021, Peter Zijlstra wrote: > Do try_cmpxchg() loops on userspace addresses. > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > --- > @@ -501,6 +543,21 @@ do { \ > } while (0) > #endif // CONFIG_CC_HAS_ASM_GOTO_OUTPUT > > +extern void __try_cmpxchg_user_wrong_size(void); > + > +#define unsafe_try_cmpxchg_user(_ptr, _oldp, _nval, _label) ({ \ > + __typeof__(*(_ptr)) __ret; \ > + switch (sizeof(__ret)) { \ > + case 4: __ret = __try_cmpxchg_user_asm("l", (_ptr), (_oldp), \ > + (_nval), _label); \ > + break; \ > + case 8: __ret = __try_cmpxchg_user_asm("q", (_ptr), (_oldp), \ > + (_nval), _label); \ > + break; \ Can we add support for 1-byte and 2-byte cmpxchg, and for using cmpxchg8b to handle 8-byte operations in 32-bit mode? Support for all the flavors (except 16-byte) would allow KVM to use this in an emulator path that currently kmaps the target. I'd be more than happy to help test the result. Thanks! > + default: __try_cmpxchg_user_wrong_size(); \ > + } \ > + __ret; }) > + > /* > * We want the unsafe accessors to always be inlined and use > * the error labels - thus the macro games. > > ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC][PATCH 2/3] x86/uaccess: Implement unsafe_try_cmpxchg_user() 2021-12-20 17:30 ` Sean Christopherson @ 2021-12-21 11:17 ` Peter Zijlstra 0 siblings, 0 replies; 40+ messages in thread From: Peter Zijlstra @ 2021-12-21 11:17 UTC (permalink / raw) To: Sean Christopherson Cc: mingo, tglx, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, bristot, linux-kernel, linux-mm, linux-api, x86, pjt, posk, avagin, jannh, tdelisle, posk On Mon, Dec 20, 2021 at 05:30:05PM +0000, Sean Christopherson wrote: > On Tue, Dec 14, 2021, Peter Zijlstra wrote: > > Do try_cmpxchg() loops on userspace addresses. > > > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > > --- > > @@ -501,6 +543,21 @@ do { \ > > } while (0) > > #endif // CONFIG_CC_HAS_ASM_GOTO_OUTPUT > > > > +extern void __try_cmpxchg_user_wrong_size(void); > > + > > +#define unsafe_try_cmpxchg_user(_ptr, _oldp, _nval, _label) ({ \ > > + __typeof__(*(_ptr)) __ret; \ > > + switch (sizeof(__ret)) { \ > > + case 4: __ret = __try_cmpxchg_user_asm("l", (_ptr), (_oldp), \ > > + (_nval), _label); \ > > + break; \ > > + case 8: __ret = __try_cmpxchg_user_asm("q", (_ptr), (_oldp), \ > > + (_nval), _label); \ > > + break; \ > > Can we add support for 1-byte and 2-byte cmpxchg, and for using cmpxchg8b to handle > 8-byte operations in 32-bit mode? Support for all the flavors (except 16-byte) > would allow KVM to use this in an emulator path that currently kmaps the target. > I'd be more than happy to help test the result. Sure, no problem. I currently still need to audit parts of mm/ and do the smp-wake-idle bits before I repost -- that and take a xmas break ofcourse :-) So it'll be a while before I repost this. ^ permalink raw reply [flat|nested] 40+ messages in thread
* [RFC][PATCH 3/3] sched: User Mode Concurency Groups 2021-12-14 20:44 [RFC][PATCH 0/3] sched: User Managed Concurrency Groups Peter Zijlstra 2021-12-14 20:44 ` [RFC][PATCH 1/3] sched/umcg: add WF_CURRENT_CPU and externise ttwu Peter Zijlstra 2021-12-14 20:44 ` [RFC][PATCH 2/3] x86/uaccess: Implement unsafe_try_cmpxchg_user() Peter Zijlstra @ 2021-12-14 20:44 ` Peter Zijlstra 2021-12-21 17:19 ` Peter Oskolkov 2021-12-24 11:27 ` Peter Zijlstra 2021-12-14 21:00 ` [RFC][PATCH 0/3] sched: User Managed Concurrency Groups Peter Zijlstra 2021-12-15 3:46 ` Peter Oskolkov 4 siblings, 2 replies; 40+ messages in thread From: Peter Zijlstra @ 2021-12-14 20:44 UTC (permalink / raw) To: mingo, tglx, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, bristot Cc: linux-kernel, linux-mm, linux-api, x86, peterz, pjt, posk, avagin, jannh, tdelisle, posk User Managed Concurrency Groups is an M:N threading toolkit that allows constructing user space schedulers designed to efficiently manage heterogeneous in-process workloads while maintaining high CPU utilization (95%+). XXX moar changelog explaining how this is moar awesome than traditional user-space threading. The big thing that's missing is the SMP wake-to-remote-idle. The big assumption this whole thing is build on is that pin_user_pages() preserves user mappings in so far that pagefault_disable() will never generate EFAULT (unless the user does munmap() in which case it can keep the pieces). shrink_page_list() does page_maybe_dma_pinned() before try_to_unmap() and as such seems to respect this constraint. unmap_and_move() however seems willing to unmap otherwise pinned (and hence unmigratable) pages. This might need fixing. Originally-by: Peter Oskolkov <posk@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- arch/x86/Kconfig | 1 arch/x86/entry/syscalls/syscall_64.tbl | 3 arch/x86/include/asm/thread_info.h | 2 fs/exec.c | 1 include/linux/entry-common.h | 6 include/linux/sched.h | 86 +++ include/linux/syscalls.h | 4 include/linux/thread_info.h | 2 include/uapi/asm-generic/unistd.h | 9 include/uapi/linux/umcg.h | 143 +++++ init/Kconfig | 15 kernel/entry/common.c | 18 kernel/exit.c | 5 kernel/sched/Makefile | 1 kernel/sched/core.c | 9 kernel/sched/umcg.c | 868 +++++++++++++++++++++++++++++++++ kernel/sys_ni.c | 5 17 files changed, 1171 insertions(+), 7 deletions(-) --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -248,6 +248,7 @@ config X86 select HAVE_RSEQ select HAVE_SYSCALL_TRACEPOINTS select HAVE_UNSTABLE_SCHED_CLOCK + select HAVE_UMCG if X86_64 select HAVE_USER_RETURN_NOTIFIER select HAVE_GENERIC_VDSO select HOTPLUG_SMT if SMP --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -371,6 +371,9 @@ 447 common memfd_secret sys_memfd_secret 448 common process_mrelease sys_process_mrelease 449 common futex_waitv sys_futex_waitv +450 common umcg_ctl sys_umcg_ctl +451 common umcg_wait sys_umcg_wait +452 common umcg_kick sys_umcg_kick # # Due to a historical design error, certain syscalls are numbered differently --- a/arch/x86/include/asm/thread_info.h +++ b/arch/x86/include/asm/thread_info.h @@ -83,6 +83,7 @@ struct thread_info { #define TIF_NEED_RESCHED 3 /* rescheduling necessary */ #define TIF_SINGLESTEP 4 /* reenable singlestep on user return*/ #define TIF_SSBD 5 /* Speculative store bypass disable */ +#define TIF_UMCG 6 /* UMCG return to user hook */ #define TIF_SPEC_IB 9 /* Indirect branch speculation mitigation */ #define TIF_SPEC_L1D_FLUSH 10 /* Flush L1D on mm switches (processes) */ #define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */ @@ -107,6 +108,7 @@ struct thread_info { #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED) #define _TIF_SINGLESTEP (1 << TIF_SINGLESTEP) #define _TIF_SSBD (1 << TIF_SSBD) +#define _TIF_UMCG (1 << TIF_UMCG) #define _TIF_SPEC_IB (1 << TIF_SPEC_IB) #define _TIF_SPEC_L1D_FLUSH (1 << TIF_SPEC_L1D_FLUSH) #define _TIF_USER_RETURN_NOTIFY (1 << TIF_USER_RETURN_NOTIFY) --- a/fs/exec.c +++ b/fs/exec.c @@ -1838,6 +1838,7 @@ static int bprm_execve(struct linux_binp current->fs->in_exec = 0; current->in_execve = 0; rseq_execve(current); + umcg_execve(current); acct_update_integrals(current); task_numa_free(current, false); return retval; --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -22,6 +22,10 @@ # define _TIF_UPROBE (0) #endif +#ifndef _TIF_UMCG +# define _TIF_UMCG (0) +#endif + /* * SYSCALL_WORK flags handled in syscall_enter_from_user_mode() */ @@ -42,11 +46,13 @@ SYSCALL_WORK_SYSCALL_EMU | \ SYSCALL_WORK_SYSCALL_AUDIT | \ SYSCALL_WORK_SYSCALL_USER_DISPATCH | \ + SYSCALL_WORK_SYSCALL_UMCG | \ ARCH_SYSCALL_WORK_ENTER) #define SYSCALL_WORK_EXIT (SYSCALL_WORK_SYSCALL_TRACEPOINT | \ SYSCALL_WORK_SYSCALL_TRACE | \ SYSCALL_WORK_SYSCALL_AUDIT | \ SYSCALL_WORK_SYSCALL_USER_DISPATCH | \ + SYSCALL_WORK_SYSCALL_UMCG | \ SYSCALL_WORK_SYSCALL_EXIT_TRAP | \ ARCH_SYSCALL_WORK_EXIT) --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -67,6 +67,7 @@ struct sighand_struct; struct signal_struct; struct task_delay_info; struct task_group; +struct umcg_task; /* * Task state bitmask. NOTE! These bits are also @@ -1294,6 +1295,23 @@ struct task_struct { unsigned long rseq_event_mask; #endif +#ifdef CONFIG_UMCG + /* setup by sys_umcg_ctrl() */ + clockid_t umcg_clock; + struct umcg_task __user *umcg_task; + + /* setup by umcg_pin_enter() */ + struct page *umcg_worker_page; + + struct task_struct *umcg_server; + struct umcg_task __user *umcg_server_task; + struct page *umcg_server_page; + + struct task_struct *umcg_next; + struct umcg_task __user *umcg_next_task; + struct page *umcg_next_page; +#endif + struct tlbflush_unmap_batch tlb_ubc; union { @@ -1687,6 +1705,13 @@ extern struct pid *cad_pid; #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */ #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */ + +#ifdef CONFIG_UMCG +#define PF_UMCG_WORKER 0x01000000 /* UMCG worker */ +#else +#define PF_UMCG_WORKER 0x00000000 +#endif + #define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */ #define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */ #define PF_MEMALLOC_PIN 0x10000000 /* Allocation context constrained to zones which allow long term pinning. */ @@ -2294,6 +2319,67 @@ static inline void rseq_execve(struct ta { } +#endif + +#ifdef CONFIG_UMCG + +extern void umcg_sys_enter(struct pt_regs *regs, long syscall); +extern void umcg_sys_exit(struct pt_regs *regs); +extern void umcg_notify_resume(struct pt_regs *regs); +extern void umcg_worker_exit(void); +extern void umcg_clear_child(struct task_struct *tsk); + +/* Called by bprm_execve() in fs/exec.c. */ +static inline void umcg_execve(struct task_struct *tsk) +{ + if (tsk->umcg_task) + umcg_clear_child(tsk); +} + +/* Called by do_exit() in kernel/exit.c. */ +static inline void umcg_handle_exit(void) +{ + if (current->flags & PF_UMCG_WORKER) + umcg_worker_exit(); +} + +/* + * umcg_wq_worker_[sleeping|running] are called in core.c by + * sched_submit_work() and sched_update_worker(). + */ +extern void umcg_wq_worker_sleeping(struct task_struct *tsk); +extern void umcg_wq_worker_running(struct task_struct *tsk); + +#else /* CONFIG_UMCG */ + +static inline void umcg_sys_enter(struct pt_regs *regs, long syscall) +{ +} + +static inline void umcg_sys_exit(struct pt_regs *regs) +{ +} + +static inline void umcg_notify_resume(struct pt_regs *regs) +{ +} + +static inline void umcg_clear_child(struct task_struct *tsk) +{ +} +static inline void umcg_execve(struct task_struct *tsk) +{ +} +static inline void umcg_handle_exit(void) +{ +} +static inline void umcg_wq_worker_sleeping(struct task_struct *tsk) +{ +} +static inline void umcg_wq_worker_running(struct task_struct *tsk) +{ +} + #endif #ifdef CONFIG_DEBUG_RSEQ --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -72,6 +72,7 @@ struct open_how; struct mount_attr; struct landlock_ruleset_attr; enum landlock_rule_type; +struct umcg_task; #include <linux/types.h> #include <linux/aio_abi.h> @@ -1057,6 +1058,9 @@ asmlinkage long sys_landlock_add_rule(in const void __user *rule_attr, __u32 flags); asmlinkage long sys_landlock_restrict_self(int ruleset_fd, __u32 flags); asmlinkage long sys_memfd_secret(unsigned int flags); +asmlinkage long sys_umcg_ctl(u32 flags, struct umcg_task __user *self, clockid_t which_clock); +asmlinkage long sys_umcg_wait(u32 flags, u64 abs_timeout); +asmlinkage long sys_umcg_kick(u32 flags, pid_t tid); /* * Architecture-specific system calls --- a/include/linux/thread_info.h +++ b/include/linux/thread_info.h @@ -46,6 +46,7 @@ enum syscall_work_bit { SYSCALL_WORK_BIT_SYSCALL_AUDIT, SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH, SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP, + SYSCALL_WORK_BIT_SYSCALL_UMCG, }; #define SYSCALL_WORK_SECCOMP BIT(SYSCALL_WORK_BIT_SECCOMP) @@ -55,6 +56,7 @@ enum syscall_work_bit { #define SYSCALL_WORK_SYSCALL_AUDIT BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT) #define SYSCALL_WORK_SYSCALL_USER_DISPATCH BIT(SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH) #define SYSCALL_WORK_SYSCALL_EXIT_TRAP BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP) +#define SYSCALL_WORK_SYSCALL_UMCG BIT(SYSCALL_WORK_BIT_SYSCALL_UMCG) #endif #include <asm/thread_info.h> --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -883,8 +883,15 @@ __SYSCALL(__NR_process_mrelease, sys_pro #define __NR_futex_waitv 449 __SYSCALL(__NR_futex_waitv, sys_futex_waitv) +#define __NR_umcg_ctl 450 +__SYSCALL(__NR_umcg_ctl, sys_umcg_ctl) +#define __NR_umcg_wait 451 +__SYSCALL(__NR_umcg_wait, sys_umcg_wait) +#define __NR_umcg_kick 452 +__SYSCALL(__NR_umcg_kick, sys_umcg_kick) + #undef __NR_syscalls -#define __NR_syscalls 450 +#define __NR_syscalls 453 /* * 32 bit systems traditionally used different --- /dev/null +++ b/include/uapi/linux/umcg.h @@ -0,0 +1,143 @@ +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */ +#ifndef _UAPI_LINUX_UMCG_H +#define _UAPI_LINUX_UMCG_H + +#include <linux/types.h> + +/* + * UMCG: User Managed Concurrency Groups. + * + * Syscalls (see kernel/sched/umcg.c): + * sys_umcg_ctl() - register/unregister UMCG tasks; + * sys_umcg_wait() - wait/wake/context-switch. + * sys_umcg_kick() - prod a UMCG task + * + * struct umcg_task (below): controls the state of UMCG tasks. + */ + +/* + * UMCG task states, the first 6 bits of struct umcg_task.state_ts. + * The states represent the user space point of view. + * + * ,--------(TF_PREEMPT + notify_resume)-------. ,------------. + * | v | | + * RUNNING -(schedule)-> BLOCKED -(sys_exit)-> RUNNABLE (signal + notify_resume) + * ^ | ^ | + * `--------------(sys_umcg_wait)--------------' `------------' + * + */ +#define UMCG_TASK_NONE 0x0000U +#define UMCG_TASK_RUNNING 0x0001U +#define UMCG_TASK_RUNNABLE 0x0002U +#define UMCG_TASK_BLOCKED 0x0003U + +#define UMCG_TASK_MASK 0x00ffU + +/* + * UMCG_TF_PREEMPT: userspace indicates the worker should be preempted. + * + * Must only be set on UMCG_TASK_RUNNING; once set, any subsequent + * return-to-user (eg sys_umcg_kick()) will perform the equivalent of + * sys_umcg_wait() on it. That is, it will wake next_tid/server_tid, transfer + * to RUNNABLE and enqueue on the server's runnable list. + */ +#define UMCG_TF_PREEMPT 0x0100U +/* + * UMCG_TF_COND_WAIT: indicate the task *will* call sys_umcg_wait() + * + * Enables server loops like (vs umcg_sys_exit()): + * + * for(;;) { + * self->status = UMCG_TASK_RUNNABLE | UMCG_TF_COND_WAIT; + * // smp_mb() implied by xchg() + * + * runnable_ptr = xchg(self->runnable_workers_ptr, NULL); + * while (runnable_ptr) { + * next = runnable_ptr->runnable_workers_ptr; + * + * umcg_server_add_runnable(self, runnable_ptr); + * + * runnable_ptr = next; + * } + * + * self->next = umcg_server_pick_next(self); + * sys_umcg_wait(0, 0); + * } + * + * without a signal or interrupt in between setting umcg_task::state and + * sys_umcg_wait() resulting in an infinite wait in umcg_notify_resume(). + */ +#define UMCG_TF_COND_WAIT 0x0200U + +#define UMCG_TF_MASK 0xff00U + +#define UMCG_TASK_ALIGN 64 + +/** + * struct umcg_task - controls the state of UMCG tasks. + * + * The struct is aligned at 64 bytes to ensure that it fits into + * a single cache line. + */ +struct umcg_task { + /** + * @state_ts: the current state of the UMCG task described by + * this struct, with a unique timestamp indicating + * when the last state change happened. + * + * Readable/writable by both the kernel and the userspace. + * + * UMCG task state: + * bits 0 - 7: task state; + * bits 8 - 15: state flags; + * bits 16 - 31: for userspace use; + */ + __u32 state; /* r/w */ + + /** + * @next_tid: the TID of the UMCG task that should be context-switched + * into in sys_umcg_wait(). Can be zero, in which case + * it'll switch to server_tid. + * + * @server_tid: the TID of the UMCG server that hosts this task, + * when RUNNABLE this task will get added to it's + * runnable_workers_ptr list. + * + * Read-only for the kernel, read/write for the userspace. + */ + __u32 next_tid; /* r */ + __u32 server_tid; /* r */ + + __u32 __hole[1]; + + /* + * Timestamps for when last we became BLOCKED, RUNNABLE, in CLOCK_MONOTONIC. + */ + __u64 blocked_ts; /* w */ + __u64 runnable_ts; /* w */ + + /** + * @runnable_workers_ptr: a single-linked list of runnable workers. + * + * Readable/writable by both the kernel and the userspace: the + * kernel adds items to the list, userspace removes them. + */ + __u64 runnable_workers_ptr; /* r/w */ + + __u64 __zero[3]; + +} __attribute__((packed, aligned(UMCG_TASK_ALIGN))); + +/** + * enum umcg_ctl_flag - flags to pass to sys_umcg_ctl + * @UMCG_CTL_REGISTER: register the current task as a UMCG task + * @UMCG_CTL_UNREGISTER: unregister the current task as a UMCG task + * @UMCG_CTL_WORKER: register the current task as a UMCG worker + */ +enum umcg_ctl_flag { + UMCG_CTL_REGISTER = 0x00001, + UMCG_CTL_UNREGISTER = 0x00002, + UMCG_CTL_WORKER = 0x10000, +}; + +#endif /* _UAPI_LINUX_UMCG_H */ --- a/init/Kconfig +++ b/init/Kconfig @@ -1686,6 +1686,21 @@ config MEMBARRIER If unsure, say Y. +config HAVE_UMCG + bool + +config UMCG + bool "Enable User Managed Concurrency Groups API" + depends on 64BIT + depends on GENERIC_ENTRY + depends on HAVE_UMCG + default n + help + Enable User Managed Concurrency Groups API, which form the basis + for an in-process M:N userspace scheduling framework. + At the moment this is an experimental/RFC feature that is not + guaranteed to be backward-compatible. + config KALLSYMS bool "Load all symbols for debugging/ksymoops" if EXPERT default y --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -6,6 +6,7 @@ #include <linux/livepatch.h> #include <linux/audit.h> #include <linux/tick.h> +#include <linux/sched.h> #include "common.h" @@ -76,6 +77,9 @@ static long syscall_trace_enter(struct p if (unlikely(work & SYSCALL_WORK_SYSCALL_TRACEPOINT)) trace_sys_enter(regs, syscall); + if (work & SYSCALL_WORK_SYSCALL_UMCG) + umcg_sys_enter(regs, syscall); + syscall_enter_audit(regs, syscall); return ret ? : syscall; @@ -155,8 +159,7 @@ static unsigned long exit_to_user_mode_l * Before returning to user space ensure that all pending work * items have been completed. */ - while (ti_work & EXIT_TO_USER_MODE_WORK) { - + do { local_irq_enable_exit_to_user(ti_work); if (ti_work & _TIF_NEED_RESCHED) @@ -168,6 +171,10 @@ static unsigned long exit_to_user_mode_l if (ti_work & _TIF_PATCH_PENDING) klp_update_patch_state(current); + /* must be before handle_signal_work(); terminates on sigpending */ + if (ti_work & _TIF_UMCG) + umcg_notify_resume(regs); + if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL)) handle_signal_work(regs, ti_work); @@ -188,7 +195,7 @@ static unsigned long exit_to_user_mode_l tick_nohz_user_enter_prepare(); ti_work = read_thread_flags(); - } + } while (ti_work & EXIT_TO_USER_MODE_WORK); /* Return the latest work state for arch_exit_to_user_mode() */ return ti_work; @@ -203,7 +210,7 @@ static void exit_to_user_mode_prepare(st /* Flush pending rcuog wakeup before the last need_resched() check */ tick_nohz_user_enter_prepare(); - if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK)) + if (unlikely(ti_work & (EXIT_TO_USER_MODE_WORK | _TIF_UMCG))) ti_work = exit_to_user_mode_loop(regs, ti_work); arch_exit_to_user_mode_prepare(regs, ti_work); @@ -253,6 +260,9 @@ static void syscall_exit_work(struct pt_ step = report_single_step(work); if (step || work & SYSCALL_WORK_SYSCALL_TRACE) arch_syscall_exit_tracehook(regs, step); + + if (work & SYSCALL_WORK_SYSCALL_UMCG) + umcg_sys_exit(regs); } /* --- a/kernel/exit.c +++ b/kernel/exit.c @@ -749,6 +749,10 @@ void __noreturn do_exit(long code) if (unlikely(!tsk->pid)) panic("Attempted to kill the idle task!"); + /* Turn off UMCG sched hooks. */ + if (unlikely(tsk->flags & PF_UMCG_WORKER)) + tsk->flags &= ~PF_UMCG_WORKER; + /* * If do_exit is called because this processes oopsed, it's possible * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before @@ -786,6 +790,7 @@ void __noreturn do_exit(long code) io_uring_files_cancel(); exit_signals(tsk); /* sets PF_EXITING */ + umcg_handle_exit(); /* sync mm's RSS info before statistics gathering */ if (tsk->mm) --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -41,3 +41,4 @@ obj-$(CONFIG_MEMBARRIER) += membarrier.o obj-$(CONFIG_CPU_ISOLATION) += isolation.o obj-$(CONFIG_PSI) += psi.o obj-$(CONFIG_SCHED_CORE) += core_sched.o +obj-$(CONFIG_UMCG) += umcg.o --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4272,6 +4272,7 @@ static void __sched_fork(unsigned long c p->wake_entry.u_flags = CSD_TYPE_TTWU; p->migration_pending = NULL; #endif + umcg_clear_child(p); } DEFINE_STATIC_KEY_FALSE(sched_numa_balancing); @@ -6330,9 +6331,11 @@ static inline void sched_submit_work(str * If a worker goes to sleep, notify and ask workqueue whether it * wants to wake up a task to maintain concurrency. */ - if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER)) { + if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) { if (task_flags & PF_WQ_WORKER) wq_worker_sleeping(tsk); + else if (task_flags & PF_UMCG_WORKER) + umcg_wq_worker_sleeping(tsk); else io_wq_worker_sleeping(tsk); } @@ -6350,9 +6353,11 @@ static inline void sched_submit_work(str static void sched_update_worker(struct task_struct *tsk) { - if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) { + if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) { if (tsk->flags & PF_WQ_WORKER) wq_worker_running(tsk); + else if (tsk->flags & PF_UMCG_WORKER) + umcg_wq_worker_running(tsk); else io_wq_worker_running(tsk); } --- /dev/null +++ b/kernel/sched/umcg.c @@ -0,0 +1,868 @@ +// SPDX-License-Identifier: GPL-2.0-only + +/* + * User Managed Concurrency Groups (UMCG). + * + */ + +#include <linux/syscalls.h> +#include <linux/types.h> +#include <linux/uaccess.h> +#include <linux/umcg.h> + +#include <asm/syscall.h> + +#include "sched.h" + +static struct task_struct *umcg_get_task(u32 tid) +{ + struct task_struct *tsk = NULL; + + if (tid) { + rcu_read_lock(); + tsk = find_task_by_vpid(tid); + if (tsk && current->mm == tsk->mm && tsk->umcg_task) + get_task_struct(tsk); + else + tsk = NULL; + rcu_read_unlock(); + } + + return tsk; +} + +/** + * umcg_pin_pages: pin pages containing struct umcg_task of + * this task, its server (possibly this task again) + * and the next (possibly NULL). + */ +static int umcg_pin_pages(void) +{ + struct task_struct *server = NULL, *next = NULL, *tsk = current; + struct umcg_task __user *self = READ_ONCE(tsk->umcg_task); + int server_tid, next_tid; + int ret; + + /* must not have stale state */ + if (WARN_ON_ONCE(tsk->umcg_worker_page || + tsk->umcg_server_page || + tsk->umcg_next_page || + tsk->umcg_server_task || + tsk->umcg_next_task || + tsk->umcg_server || + tsk->umcg_next)) + return -EBUSY; + + ret = -EFAULT; + if (pin_user_pages_fast((unsigned long)self, 1, 0, + &tsk->umcg_worker_page) != 1) + goto clear_self; + + if (get_user(server_tid, &self->server_tid)) + goto unpin_self; + + ret = -ESRCH; + server = umcg_get_task(server_tid); + if (!server) + goto unpin_self; + + ret = -EFAULT; + /* must cache due to possible concurrent change vs access_ok() */ + tsk->umcg_server_task = READ_ONCE(server->umcg_task); + if (pin_user_pages_fast((unsigned long)tsk->umcg_server_task, 1, 0, + &tsk->umcg_server_page) != 1) + goto clear_server; + + tsk->umcg_server = server; + + if (get_user(next_tid, &self->next_tid)) + goto unpin_server; + + if (!next_tid) + goto done; + + ret = -ESRCH; + next = umcg_get_task(next_tid); + if (!next) + goto unpin_server; + + ret = -EFAULT; + tsk->umcg_next_task = READ_ONCE(next->umcg_task); + if (pin_user_pages_fast((unsigned long)tsk->umcg_next_task, 1, 0, + &tsk->umcg_next_page) != 1) + goto clear_next; + + tsk->umcg_next = next; + +done: + return 0; + +clear_next: + tsk->umcg_next_task = NULL; + tsk->umcg_next_page = NULL; + +unpin_server: + unpin_user_page(tsk->umcg_server_page); + +clear_server: + tsk->umcg_server_task = NULL; + tsk->umcg_server_page = NULL; + +unpin_self: + unpin_user_page(tsk->umcg_worker_page); +clear_self: + tsk->umcg_worker_page = NULL; + + return ret; +} + +static void umcg_unpin_pages(void) +{ + struct task_struct *tsk = current; + + if (tsk->umcg_server) { + unpin_user_page(tsk->umcg_worker_page); + tsk->umcg_worker_page = NULL; + + unpin_user_page(tsk->umcg_server_page); + tsk->umcg_server_page = NULL; + tsk->umcg_server_task = NULL; + + put_task_struct(tsk->umcg_server); + tsk->umcg_server = NULL; + + if (tsk->umcg_next) { + unpin_user_page(tsk->umcg_next_page); + tsk->umcg_next_page = NULL; + tsk->umcg_next_task = NULL; + + put_task_struct(tsk->umcg_next); + tsk->umcg_next = NULL; + } + } +} + +static void umcg_clear_task(struct task_struct *tsk) +{ + /* + * This is either called for the current task, or for a newly forked + * task that is not yet running, so we don't need strict atomicity + * below. + */ + if (tsk->umcg_task) { + WRITE_ONCE(tsk->umcg_task, NULL); + tsk->umcg_worker_page = NULL; + + tsk->umcg_server = NULL; + tsk->umcg_server_page = NULL; + tsk->umcg_server_task = NULL; + + tsk->umcg_next = NULL; + tsk->umcg_next_page = NULL; + tsk->umcg_next_task = NULL; + + tsk->flags &= ~PF_UMCG_WORKER; + clear_task_syscall_work(tsk, SYSCALL_UMCG); + clear_tsk_thread_flag(tsk, TIF_UMCG); + } +} + +/* Called for a forked or execve-ed child. */ +void umcg_clear_child(struct task_struct *tsk) +{ + umcg_clear_task(tsk); +} + +/* Called both by normally (unregister) and abnormally exiting workers. */ +void umcg_worker_exit(void) +{ + umcg_unpin_pages(); + umcg_clear_task(current); +} + +/* + * Do a state transition: @from -> @to. + * + * Will clear UMCG_TF_PREEMPT, UMCG_TF_COND_WAIT. + * + * When @to == {BLOCKED,RUNNABLE}, update timestamps. + * + * Returns: + * 0: success + * -EAGAIN: when self->state != @from + * -EFAULT + */ +static int umcg_update_state(struct task_struct *tsk, + struct umcg_task __user *self, + u32 from, u32 to) +{ + u32 old, new; + u64 now; + + if (to >= UMCG_TASK_RUNNABLE) { + switch (tsk->umcg_clock) { + case CLOCK_REALTIME: now = ktime_get_real_ns(); break; + case CLOCK_MONOTONIC: now = ktime_get_ns(); break; + case CLOCK_BOOTTIME: now = ktime_get_boottime_ns(); break; + case CLOCK_TAI: now = ktime_get_clocktai_ns(); break; + } + } + + if (!user_access_begin(self, sizeof(*self))) + return -EFAULT; + + unsafe_get_user(old, &self->state, Efault); + do { + if ((old & UMCG_TASK_MASK) != from) + goto fail; + + new = old & ~(UMCG_TASK_MASK | + UMCG_TF_PREEMPT | UMCG_TF_COND_WAIT); + new |= to & UMCG_TASK_MASK; + + } while (!unsafe_try_cmpxchg_user(&self->state, &old, new, Efault)); + + if (to == UMCG_TASK_BLOCKED) + unsafe_put_user(now, &self->blocked_ts, Efault); + if (to == UMCG_TASK_RUNNABLE) + unsafe_put_user(now, &self->runnable_ts, Efault); + + user_access_end(); + return 0; + +fail: + user_access_end(); + return -EAGAIN; + +Efault: + user_access_end(); + return -EFAULT; +} + +#define __UMCG_DIE(stmt, reason) do { \ + stmt; \ + pr_warn_ratelimited("%s: killing task %s/%d because: " reason "\n",\ + __func__, current->comm, current->pid); \ + force_sig(SIGKILL); \ + return; \ +} while (0) + +#define UMCG_DIE(reason) __UMCG_DIE(,reason) +#define UMCG_DIE_PF(reason) __UMCG_DIE(pagefault_enable(), reason) +#define UMCG_DIE_UNPIN(reason) __UMCG_DIE(umcg_unpin_pages(), reason) + +/* Called from syscall enter path */ +void umcg_sys_enter(struct pt_regs *regs, long syscall) +{ + /* avoid recursion vs our own syscalls */ + if (syscall == __NR_umcg_wait || + syscall == __NR_umcg_ctl) + return; + + /* avoid recursion vs schedule() */ + current->flags &= ~PF_UMCG_WORKER; + + /* + * Pin all the state on sys_enter() such that we can rely on it + * from dodgy contexts. It is either unpinned from pre-schedule() + * or sys_exit(), whichever comes first, thereby ensuring the pin + * is temporary. + */ + if (umcg_pin_pages()) + UMCG_DIE("pin"); + + current->flags |= PF_UMCG_WORKER; +} + +static int umcg_wake_task(struct task_struct *tsk, struct umcg_task __user *self) +{ + int ret = umcg_update_state(tsk, self, UMCG_TASK_RUNNABLE, UMCG_TASK_RUNNING); + if (ret) + return ret; + + try_to_wake_up(tsk, TASK_NORMAL, WF_CURRENT_CPU); + return 0; +} + +static int umcg_wake_next(struct task_struct *tsk) +{ + int ret = umcg_wake_task(tsk->umcg_next, tsk->umcg_next_task); + if (ret) + return ret; + + /* + * If userspace sets umcg_task::next_tid, it needs to remove + * that task from the ready-queue to avoid another server + * selecting it. However, that also means it needs to put it + * back in case it went unused. + * + * By clearing the field on use, userspace can detect this case + * and DTRT. + */ + if (put_user(0u, &tsk->umcg_task->next_tid)) + return -EFAULT; + + return 0; +} + +static int umcg_wake_server(struct task_struct *tsk) +{ + int ret = umcg_wake_task(tsk->umcg_server, tsk->umcg_server_task); + switch (ret) { + case 0: + case -EAGAIN: + /* + * Server could have timed-out or already be running + * due to a runnable enqueue. See umcg_sys_exit(). + */ + break; + + default: + return ret; + } + + return 0; +} + +/* + * Wake ::next_tid or ::server_tid. + * + * Must be called in umcg_pin_pages() context, relies on + * tsk->umcg_{server,next}. + * + * Returns: + * 0: success + * -EAGAIN + * -EFAULT + */ +static int umcg_wake(struct task_struct *tsk) +{ + if (tsk->umcg_next) + return umcg_wake_next(tsk); + + return umcg_wake_server(tsk); +} + +/* pre-schedule() */ +void umcg_wq_worker_sleeping(struct task_struct *tsk) +{ + struct umcg_task __user *self = READ_ONCE(tsk->umcg_task); + + /* Must not fault, mmap_sem might be held. */ + pagefault_disable(); + + if (WARN_ON_ONCE(!tsk->umcg_server)) + UMCG_DIE_PF("no server"); + + if (umcg_update_state(tsk, self, UMCG_TASK_RUNNING, UMCG_TASK_BLOCKED)) + UMCG_DIE_PF("state"); + + if (umcg_wake(tsk)) + UMCG_DIE_PF("wake"); + + pagefault_enable(); + + /* + * We're going to sleep, make sure to unpin the pages, this ensures + * the pins are temporary. Also see umcg_sys_exit(). + */ + umcg_unpin_pages(); +} + +/* post-schedule() */ +void umcg_wq_worker_running(struct task_struct *tsk) +{ + /* nothing here, see umcg_sys_exit() */ +} + +/* + * Enqueue @tsk on it's server's runnable list + * + * Must be called in umcg_pin_pages() context, relies on tsk->umcg_server. + * + * cmpxchg based single linked list add such that list integrity is never + * violated. Userspace *MUST* remove it from the list before changing ->state. + * As such, we must change state to RUNNABLE before enqueue. + * + * Returns: + * 0: success + * -EFAULT + */ +static int umcg_enqueue_runnable(struct task_struct *tsk) +{ + struct umcg_task __user *server = tsk->umcg_server_task; + struct umcg_task __user *self = tsk->umcg_task; + u64 self_ptr = (unsigned long)self; + u64 first_ptr; + + /* + * umcg_pin_pages() did access_ok() on both pointers, use self here + * only because __user_access_begin() isn't available in generic code. + */ + if (!user_access_begin(self, sizeof(*self))) + return -EFAULT; + + unsafe_get_user(first_ptr, &server->runnable_workers_ptr, Efault); + do { + unsafe_put_user(first_ptr, &self->runnable_workers_ptr, Efault); + } while (!unsafe_try_cmpxchg_user(&server->runnable_workers_ptr, &first_ptr, self_ptr, Efault)); + + user_access_end(); + return 0; + +Efault: + user_access_end(); + return -EFAULT; +} + +/* + * umcg_wait: Wait for ->state to become RUNNING + * + * Returns: + * 0 - success + * -EINTR - pending signal + * -EINVAL - ::state is not {RUNNABLE,RUNNING} + * -ETIMEDOUT + * -EFAULT + */ +int umcg_wait(u64 timo) +{ + struct task_struct *tsk = current; + struct umcg_task __user *self = tsk->umcg_task; + struct page *page = NULL; + u32 state; + int ret; + + for (;;) { + set_current_state(TASK_INTERRUPTIBLE); + + ret = -EINTR; + if (signal_pending(current)) + break; + + /* + * Faults can block and scribble our wait state. + */ + pagefault_disable(); + if (get_user(state, &self->state)) { + pagefault_enable(); + + ret = -EFAULT; + if (page) { + unpin_user_page(page); + page = NULL; + break; + } + + if (pin_user_pages_fast((unsigned long)self, 1, 0, &page) != 1) { + page = NULL; + break; + } + + continue; + } + + if (page) { + unpin_user_page(page); + page = NULL; + } + pagefault_enable(); + + state &= UMCG_TASK_MASK; + if (state != UMCG_TASK_RUNNABLE) { + ret = 0; + if (state == UMCG_TASK_RUNNING) + break; + + ret = -EINVAL; + break; + } + + if (!schedule_hrtimeout_range_clock(timo ? &timo : NULL, + tsk->timer_slack_ns, + HRTIMER_MODE_ABS, + tsk->umcg_clock)) { + ret = -ETIMEDOUT; + break; + } + } + __set_current_state(TASK_RUNNING); + + return ret; +} + +void umcg_sys_exit(struct pt_regs *regs) +{ + struct task_struct *tsk = current; + struct umcg_task __user *self = READ_ONCE(tsk->umcg_task); + long syscall = syscall_get_nr(tsk, regs); + + if (syscall == __NR_umcg_wait) + return; + + /* + * sys_umcg_ctl() will get here without having called umcg_sys_enter() + * as such it will look like a syscall that blocked. + */ + + if (tsk->umcg_server) { + /* + * Didn't block, we done. + */ + umcg_unpin_pages(); + return; + } + + /* avoid recursion vs schedule() */ + current->flags &= ~PF_UMCG_WORKER; + + if (umcg_pin_pages()) + UMCG_DIE("pin"); + + if (umcg_update_state(tsk, self, UMCG_TASK_BLOCKED, UMCG_TASK_RUNNABLE)) + UMCG_DIE_UNPIN("state"); + + if (umcg_enqueue_runnable(tsk)) + UMCG_DIE_UNPIN("enqueue"); + + /* Server might not be RUNNABLE, means it's already running */ + if (umcg_wake_server(tsk)) + UMCG_DIE_UNPIN("wake-server"); + + umcg_unpin_pages(); + + switch (umcg_wait(0)) { + case -EFAULT: + case -EINVAL: + case -ETIMEDOUT: /* how!?! */ + default: + UMCG_DIE("wait"); + + case 0: + case -EINTR: + /* notify_resume will continue the wait after the signal */ + break; + } + + current->flags |= PF_UMCG_WORKER; +} + +void umcg_notify_resume(struct pt_regs *regs) +{ + struct task_struct *tsk = current; + struct umcg_task __user *self = tsk->umcg_task; + bool worker = tsk->flags & PF_UMCG_WORKER; + u32 state; + + /* avoid recursion vs schedule() */ + if (worker) + current->flags &= ~PF_UMCG_WORKER; + + if (get_user(state, &self->state)) + UMCG_DIE("get-state"); + + state &= UMCG_TASK_MASK | UMCG_TF_MASK; + if (state == UMCG_TASK_RUNNING) + goto done; + + /* + * See comment at UMCG_TF_COND_WAIT; TL;DR: user *will* call + * sys_umcg_wait() and signals/interrupts shouldn't block + * return-to-user. + */ + if (state == (UMCG_TASK_RUNNABLE | UMCG_TF_COND_WAIT)) + goto done; + + if (state & UMCG_TF_PREEMPT) { + if (umcg_pin_pages()) + UMCG_DIE("pin"); + + if (umcg_update_state(tsk, self, + UMCG_TASK_RUNNING, + UMCG_TASK_RUNNABLE)) + UMCG_DIE_UNPIN("state"); + + if (umcg_enqueue_runnable(tsk)) + UMCG_DIE_UNPIN("enqueue"); + + /* + * XXX do we want a preemption consuming ::next_tid ? + * I'm currently leaning towards no. + */ + if (umcg_wake_server(tsk)) + UMCG_DIE_UNPIN("wake-server"); + + umcg_unpin_pages(); + } + + switch (umcg_wait(0)) { + case -EFAULT: + case -EINVAL: + case -ETIMEDOUT: /* how!?! */ + default: + UMCG_DIE("wait"); + + case 0: + case -EINTR: + /* we'll will resume the wait after the signal */ + break; + } + +done: + if (worker) + current->flags |= PF_UMCG_WORKER; +} + +/** + * sys_umcg_kick: makes a UMCG task cycle through umcg_notify_resume() + * + * Returns: + * 0 - Ok; + * -ESRCH - not a related UMCG task + * -EINVAL - another error happened (unknown flags, etc..) + */ +SYSCALL_DEFINE2(umcg_kick, u32, flags, pid_t, tid) +{ + struct task_struct *task = umcg_get_task(tid); + if (!task) + return -ESRCH; + + if (flags) + return -EINVAL; + +#ifdef CONFIG_SMP + smp_send_reschedule(task_cpu(task)); +#endif + + return 0; +} + +/** + * sys_umcg_wait: transfer running context + * + * Block until RUNNING. Userspace must already set RUNNABLE to deal with the + * sleep condition races (see TF_COND_WAIT). + * + * Will wake either ::next_tid or ::server_tid to take our place. If this is a + * server then not setting ::next_tid will wake self. + * + * Returns: + * 0 - OK; + * -ETIMEDOUT - the timeout expired; + * -ERANGE - the timeout is out of range (worker); + * -EAGAIN - ::state wasn't RUNNABLE, concurrent wakeup; + * -EFAULT - failed accessing struct umcg_task __user of the current + * task, the server or next; + * -ESRCH - the task to wake not found or not a UMCG task; + * -EINVAL - another error happened (e.g. the current task is not a + * UMCG task, etc.) + */ +SYSCALL_DEFINE2(umcg_wait, u32, flags, u64, timo) +{ + struct task_struct *tsk = current; + struct umcg_task __user *self = READ_ONCE(tsk->umcg_task); + bool worker = tsk->flags & PF_UMCG_WORKER; + int ret; + + if (!self || flags) + return -EINVAL; + + if (worker) { + tsk->flags &= ~PF_UMCG_WORKER; + if (timo) + return -ERANGE; + } + + /* see umcg_sys_{enter,exit}() syscall exceptions */ + ret = umcg_pin_pages(); + if (ret) + goto unblock; + + /* + * Clear UMCG_TF_COND_WAIT *and* check state == RUNNABLE. + */ + ret = umcg_update_state(tsk, self, UMCG_TASK_RUNNABLE, UMCG_TASK_RUNNABLE); + if (ret) + goto unpin; + + if (worker) { + ret = umcg_enqueue_runnable(tsk); + if (ret) + goto unpin; + } + + if (worker) + ret = umcg_wake(tsk); + else if (tsk->umcg_next) + ret = umcg_wake_next(tsk); + + if (ret) { + /* + * XXX already enqueued ourself on ::server_tid; failing now + * leaves the lot in an inconsistent state since it'll also + * unblock self in order to return the error. !?!? + */ + goto unpin; + } + + umcg_unpin_pages(); + + ret = umcg_wait(timo); + switch (ret) { + case 0: /* all done */ + case -EINTR: /* umcg_notify_resume() will continue the wait */ + ret = 0; + break; + + default: + goto unblock; + } +out: + if (worker) + tsk->flags |= PF_UMCG_WORKER; + return ret; + +unpin: + umcg_unpin_pages(); +unblock: + /* + * Workers will still block in umcg_notify_resume() before they can + * consume their error, servers however need to get the error asap. + * + * Still, things might be unrecoverably screwy after this. Not our + * problem. + */ + if (!worker) + umcg_update_state(tsk, self, UMCG_TASK_RUNNABLE, UMCG_TASK_RUNNING); + goto out; +} + +/** + * sys_umcg_ctl: (un)register the current task as a UMCG task. + * @flags: ORed values from enum umcg_ctl_flag; see below; + * @self: a pointer to struct umcg_task that describes this + * task and governs the behavior of sys_umcg_wait if + * registering; must be NULL if unregistering. + * + * @flags & UMCG_CTL_REGISTER: register a UMCG task: + * + * UMCG workers: + * - @flags & UMCG_CTL_WORKER + * - self->state must be UMCG_TASK_BLOCKED + * + * UMCG servers: + * - !(@flags & UMCG_CTL_WORKER) + * - self->state must be UMCG_TASK_RUNNING + * + * All tasks: + * - self->server_tid must be a valid server + * - self->next_tid must be zero + * + * If the conditions above are met, sys_umcg_ctl() immediately returns + * if the registered task is a server. If the registered task is a + * worker it will be added to it's server's runnable_workers_ptr list + * and the server will be woken. + * + * @flags == UMCG_CTL_UNREGISTER: unregister a UMCG task. If the current task + * is a UMCG worker, the userspace is responsible for waking its + * server (before or after calling sys_umcg_ctl). + * + * Return: + * 0 - success + * -EFAULT - failed to read @self + * -EINVAL - some other error occurred + * -ESRCH - no such server_tid + */ +SYSCALL_DEFINE3(umcg_ctl, u32, flags, struct umcg_task __user *, self, clockid_t, which_clock) +{ + struct task_struct *server; + struct umcg_task ut; + + if ((unsigned long)self % UMCG_TASK_ALIGN) + return -EINVAL; + + if (flags & ~(UMCG_CTL_REGISTER | + UMCG_CTL_UNREGISTER | + UMCG_CTL_WORKER)) + return -EINVAL; + + if (flags == UMCG_CTL_UNREGISTER) { + if (self || !current->umcg_task) + return -EINVAL; + + if (current->flags & PF_UMCG_WORKER) + umcg_worker_exit(); + else + umcg_clear_task(current); + + return 0; + } + + if (!(flags & UMCG_CTL_REGISTER)) + return -EINVAL; + + flags &= ~UMCG_CTL_REGISTER; + + switch (which_clock) { + case CLOCK_REALTIME: + case CLOCK_MONOTONIC: + case CLOCK_BOOTTIME: + case CLOCK_TAI: + current->umcg_clock = which_clock; + break; + + default: + return -EINVAL; + } + + if (current->umcg_task || !self) + return -EINVAL; + + if (copy_from_user(&ut, self, sizeof(ut))) + return -EFAULT; + + if (ut.next_tid || ut.__hole[0] || ut.__zero[0] || ut.__zero[1] || ut.__zero[2]) + return -EINVAL; + + rcu_read_lock(); + server = find_task_by_vpid(ut.server_tid); + if (server && server->mm == current->mm) { + if (flags == UMCG_CTL_WORKER) { + if (!server->umcg_task || + (server->flags & PF_UMCG_WORKER)) + server = NULL; + } else { + if (server != current) + server = NULL; + } + } else { + server = NULL; + } + rcu_read_unlock(); + + if (!server) + return -ESRCH; + + if (flags == UMCG_CTL_WORKER) { + if ((ut.state & (UMCG_TASK_MASK | UMCG_TF_MASK)) != UMCG_TASK_BLOCKED) + return -EINVAL; + + WRITE_ONCE(current->umcg_task, self); + current->flags |= PF_UMCG_WORKER; /* hook schedule() */ + set_syscall_work(SYSCALL_UMCG); /* hook syscall */ + set_thread_flag(TIF_UMCG); /* hook return-to-user */ + + /* umcg_sys_exit() will transition to RUNNABLE and wait */ + + } else { + if ((ut.state & (UMCG_TASK_MASK | UMCG_TF_MASK)) != UMCG_TASK_RUNNING) + return -EINVAL; + + WRITE_ONCE(current->umcg_task, self); + set_thread_flag(TIF_UMCG); /* hook return-to-user */ + + /* umcg_notify_resume() would block if not RUNNING */ + } + + return 0; +} --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -273,6 +273,11 @@ COND_SYSCALL(landlock_create_ruleset); COND_SYSCALL(landlock_add_rule); COND_SYSCALL(landlock_restrict_self); +/* kernel/sched/umcg.c */ +COND_SYSCALL(umcg_ctl); +COND_SYSCALL(umcg_wait); +COND_SYSCALL(umcg_kick); + /* arch/example/kernel/sys_example.c */ /* mm/fadvise.c */ ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC][PATCH 3/3] sched: User Mode Concurency Groups 2021-12-14 20:44 ` [RFC][PATCH 3/3] sched: User Mode Concurency Groups Peter Zijlstra @ 2021-12-21 17:19 ` Peter Oskolkov 2022-01-14 14:09 ` Peter Zijlstra 2022-01-17 13:04 ` Peter Zijlstra 2021-12-24 11:27 ` Peter Zijlstra 1 sibling, 2 replies; 40+ messages in thread From: Peter Oskolkov @ 2021-12-21 17:19 UTC (permalink / raw) To: Peter Zijlstra Cc: mingo, tglx, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, bristot, linux-kernel, linux-mm, linux-api, x86, pjt, posk, avagin, jannh, tdelisle On Tue, Dec 14, 2021 at 09:44:48PM +0100, Peter Zijlstra wrote: I have looked at the code now in relative detail, and I have some comments below. They are relatively minor, so I'm not against having this merged, assuming we can work on remaining issues later. And, of course, the higher-level question re: waking servers on worker wakeup remains. The main concern, as I mention below, is that my version of this ensures that no more than one worker can be RUNNING per server, while this API essentially lets the server spawn a new worker on each unblock event, resulting in many workers running per single server. But we can wave this away as the userspace problem/error, I guess. I do need to test this patch(set) on a (semi)real workload to see if all use cases are covered properly, so what I'll do next is I'll hook this up to our userspace scheduling code and do extensive testing (with appropriate tweaks to the kernel code). Due to the sheer amount of userspace work this will entail, and the holidays, this will take at least several weeks. > User Managed Concurrency Groups is an M:N threading toolkit that allows > constructing user space schedulers designed to efficiently manage > heterogeneous in-process workloads while maintaining high CPU > utilization (95%+). > > XXX moar changelog explaining how this is moar awesome than > traditional user-space threading. > > The big thing that's missing is the SMP wake-to-remote-idle. I think the SMP remote-wake can be easily emulated in the userspace: when a server detects a wakeup due to a worker becoming RUNNABLE, rather than the current worker becoming BLOCKED, the server can do the wake-to-remote-idle thing in the userspace. Yes, this will be slower than what I have in my version of the patchset, but we can live with it for now, and do something later if a benchmark shows something really compelling. > > The big assumption this whole thing is build on is that > pin_user_pages() preserves user mappings in so far that > pagefault_disable() will never generate EFAULT (unless the user does > munmap() in which case it can keep the pieces). > > shrink_page_list() does page_maybe_dma_pinned() before try_to_unmap() > and as such seems to respect this constraint. > > unmap_and_move() however seems willing to unmap otherwise pinned (and > hence unmigratable) pages. This might need fixing. > > Originally-by: Peter Oskolkov <posk@google.com> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > --- > arch/x86/Kconfig | 1 > arch/x86/entry/syscalls/syscall_64.tbl | 3 > arch/x86/include/asm/thread_info.h | 2 > fs/exec.c | 1 > include/linux/entry-common.h | 6 > include/linux/sched.h | 86 +++ > include/linux/syscalls.h | 4 > include/linux/thread_info.h | 2 > include/uapi/asm-generic/unistd.h | 9 > include/uapi/linux/umcg.h | 143 +++++ > init/Kconfig | 15 > kernel/entry/common.c | 18 > kernel/exit.c | 5 > kernel/sched/Makefile | 1 > kernel/sched/core.c | 9 > kernel/sched/umcg.c | 868 +++++++++++++++++++++++++++++++++ > kernel/sys_ni.c | 5 > 17 files changed, 1171 insertions(+), 7 deletions(-) > > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -248,6 +248,7 @@ config X86 > select HAVE_RSEQ > select HAVE_SYSCALL_TRACEPOINTS > select HAVE_UNSTABLE_SCHED_CLOCK > + select HAVE_UMCG if X86_64 > select HAVE_USER_RETURN_NOTIFIER > select HAVE_GENERIC_VDSO > select HOTPLUG_SMT if SMP > --- a/arch/x86/entry/syscalls/syscall_64.tbl > +++ b/arch/x86/entry/syscalls/syscall_64.tbl > @@ -371,6 +371,9 @@ > 447 common memfd_secret sys_memfd_secret > 448 common process_mrelease sys_process_mrelease > 449 common futex_waitv sys_futex_waitv > +450 common umcg_ctl sys_umcg_ctl > +451 common umcg_wait sys_umcg_wait > +452 common umcg_kick sys_umcg_kick > > # > # Due to a historical design error, certain syscalls are numbered differently > --- a/arch/x86/include/asm/thread_info.h > +++ b/arch/x86/include/asm/thread_info.h > @@ -83,6 +83,7 @@ struct thread_info { > #define TIF_NEED_RESCHED 3 /* rescheduling necessary */ > #define TIF_SINGLESTEP 4 /* reenable singlestep on user return*/ > #define TIF_SSBD 5 /* Speculative store bypass disable */ > +#define TIF_UMCG 6 /* UMCG return to user hook */ > #define TIF_SPEC_IB 9 /* Indirect branch speculation mitigation */ > #define TIF_SPEC_L1D_FLUSH 10 /* Flush L1D on mm switches (processes) */ > #define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */ > @@ -107,6 +108,7 @@ struct thread_info { > #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED) > #define _TIF_SINGLESTEP (1 << TIF_SINGLESTEP) > #define _TIF_SSBD (1 << TIF_SSBD) > +#define _TIF_UMCG (1 << TIF_UMCG) > #define _TIF_SPEC_IB (1 << TIF_SPEC_IB) > #define _TIF_SPEC_L1D_FLUSH (1 << TIF_SPEC_L1D_FLUSH) > #define _TIF_USER_RETURN_NOTIFY (1 << TIF_USER_RETURN_NOTIFY) > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -1838,6 +1838,7 @@ static int bprm_execve(struct linux_binp > current->fs->in_exec = 0; > current->in_execve = 0; > rseq_execve(current); > + umcg_execve(current); > acct_update_integrals(current); > task_numa_free(current, false); > return retval; > --- a/include/linux/entry-common.h > +++ b/include/linux/entry-common.h > @@ -22,6 +22,10 @@ > # define _TIF_UPROBE (0) > #endif > > +#ifndef _TIF_UMCG > +# define _TIF_UMCG (0) > +#endif > + > /* > * SYSCALL_WORK flags handled in syscall_enter_from_user_mode() > */ > @@ -42,11 +46,13 @@ > SYSCALL_WORK_SYSCALL_EMU | \ > SYSCALL_WORK_SYSCALL_AUDIT | \ > SYSCALL_WORK_SYSCALL_USER_DISPATCH | \ > + SYSCALL_WORK_SYSCALL_UMCG | \ > ARCH_SYSCALL_WORK_ENTER) > #define SYSCALL_WORK_EXIT (SYSCALL_WORK_SYSCALL_TRACEPOINT | \ > SYSCALL_WORK_SYSCALL_TRACE | \ > SYSCALL_WORK_SYSCALL_AUDIT | \ > SYSCALL_WORK_SYSCALL_USER_DISPATCH | \ > + SYSCALL_WORK_SYSCALL_UMCG | \ > SYSCALL_WORK_SYSCALL_EXIT_TRAP | \ > ARCH_SYSCALL_WORK_EXIT) > > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -67,6 +67,7 @@ struct sighand_struct; > struct signal_struct; > struct task_delay_info; > struct task_group; > +struct umcg_task; > > /* > * Task state bitmask. NOTE! These bits are also > @@ -1294,6 +1295,23 @@ struct task_struct { > unsigned long rseq_event_mask; > #endif > > +#ifdef CONFIG_UMCG > + /* setup by sys_umcg_ctrl() */ > + clockid_t umcg_clock; > + struct umcg_task __user *umcg_task; > + > + /* setup by umcg_pin_enter() */ > + struct page *umcg_worker_page; > + > + struct task_struct *umcg_server; > + struct umcg_task __user *umcg_server_task; > + struct page *umcg_server_page; > + > + struct task_struct *umcg_next; > + struct umcg_task __user *umcg_next_task; > + struct page *umcg_next_page; > +#endif > + > struct tlbflush_unmap_batch tlb_ubc; > > union { > @@ -1687,6 +1705,13 @@ extern struct pid *cad_pid; > #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ > #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */ > #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */ > + > +#ifdef CONFIG_UMCG > +#define PF_UMCG_WORKER 0x01000000 /* UMCG worker */ > +#else > +#define PF_UMCG_WORKER 0x00000000 > +#endif > + > #define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */ > #define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */ > #define PF_MEMALLOC_PIN 0x10000000 /* Allocation context constrained to zones which allow long term pinning. */ > @@ -2294,6 +2319,67 @@ static inline void rseq_execve(struct ta > { > } > > +#endif > + > +#ifdef CONFIG_UMCG > + > +extern void umcg_sys_enter(struct pt_regs *regs, long syscall); > +extern void umcg_sys_exit(struct pt_regs *regs); > +extern void umcg_notify_resume(struct pt_regs *regs); > +extern void umcg_worker_exit(void); > +extern void umcg_clear_child(struct task_struct *tsk); > + > +/* Called by bprm_execve() in fs/exec.c. */ > +static inline void umcg_execve(struct task_struct *tsk) > +{ > + if (tsk->umcg_task) > + umcg_clear_child(tsk); > +} > + > +/* Called by do_exit() in kernel/exit.c. */ > +static inline void umcg_handle_exit(void) > +{ > + if (current->flags & PF_UMCG_WORKER) > + umcg_worker_exit(); > +} > + > +/* > + * umcg_wq_worker_[sleeping|running] are called in core.c by > + * sched_submit_work() and sched_update_worker(). > + */ > +extern void umcg_wq_worker_sleeping(struct task_struct *tsk); > +extern void umcg_wq_worker_running(struct task_struct *tsk); > + > +#else /* CONFIG_UMCG */ > + > +static inline void umcg_sys_enter(struct pt_regs *regs, long syscall) > +{ > +} > + > +static inline void umcg_sys_exit(struct pt_regs *regs) > +{ > +} > + > +static inline void umcg_notify_resume(struct pt_regs *regs) > +{ > +} > + > +static inline void umcg_clear_child(struct task_struct *tsk) > +{ > +} > +static inline void umcg_execve(struct task_struct *tsk) > +{ > +} > +static inline void umcg_handle_exit(void) > +{ > +} > +static inline void umcg_wq_worker_sleeping(struct task_struct *tsk) > +{ > +} > +static inline void umcg_wq_worker_running(struct task_struct *tsk) > +{ > +} > + > #endif > > #ifdef CONFIG_DEBUG_RSEQ > --- a/include/linux/syscalls.h > +++ b/include/linux/syscalls.h > @@ -72,6 +72,7 @@ struct open_how; > struct mount_attr; > struct landlock_ruleset_attr; > enum landlock_rule_type; > +struct umcg_task; > > #include <linux/types.h> > #include <linux/aio_abi.h> > @@ -1057,6 +1058,9 @@ asmlinkage long sys_landlock_add_rule(in > const void __user *rule_attr, __u32 flags); > asmlinkage long sys_landlock_restrict_self(int ruleset_fd, __u32 flags); > asmlinkage long sys_memfd_secret(unsigned int flags); > +asmlinkage long sys_umcg_ctl(u32 flags, struct umcg_task __user *self, clockid_t which_clock); > +asmlinkage long sys_umcg_wait(u32 flags, u64 abs_timeout); > +asmlinkage long sys_umcg_kick(u32 flags, pid_t tid); > > /* > * Architecture-specific system calls > --- a/include/linux/thread_info.h > +++ b/include/linux/thread_info.h > @@ -46,6 +46,7 @@ enum syscall_work_bit { > SYSCALL_WORK_BIT_SYSCALL_AUDIT, > SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH, > SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP, > + SYSCALL_WORK_BIT_SYSCALL_UMCG, > }; > > #define SYSCALL_WORK_SECCOMP BIT(SYSCALL_WORK_BIT_SECCOMP) > @@ -55,6 +56,7 @@ enum syscall_work_bit { > #define SYSCALL_WORK_SYSCALL_AUDIT BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT) > #define SYSCALL_WORK_SYSCALL_USER_DISPATCH BIT(SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH) > #define SYSCALL_WORK_SYSCALL_EXIT_TRAP BIT(SYSCALL_WORK_BIT_SYSCALL_EXIT_TRAP) > +#define SYSCALL_WORK_SYSCALL_UMCG BIT(SYSCALL_WORK_BIT_SYSCALL_UMCG) > #endif > > #include <asm/thread_info.h> > --- a/include/uapi/asm-generic/unistd.h > +++ b/include/uapi/asm-generic/unistd.h > @@ -883,8 +883,15 @@ __SYSCALL(__NR_process_mrelease, sys_pro > #define __NR_futex_waitv 449 > __SYSCALL(__NR_futex_waitv, sys_futex_waitv) > > +#define __NR_umcg_ctl 450 > +__SYSCALL(__NR_umcg_ctl, sys_umcg_ctl) > +#define __NR_umcg_wait 451 > +__SYSCALL(__NR_umcg_wait, sys_umcg_wait) > +#define __NR_umcg_kick 452 > +__SYSCALL(__NR_umcg_kick, sys_umcg_kick) > + > #undef __NR_syscalls > -#define __NR_syscalls 450 > +#define __NR_syscalls 453 > > /* > * 32 bit systems traditionally used different > --- /dev/null > +++ b/include/uapi/linux/umcg.h > @@ -0,0 +1,143 @@ > +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */ > +#ifndef _UAPI_LINUX_UMCG_H > +#define _UAPI_LINUX_UMCG_H > + > +#include <linux/types.h> > + > +/* > + * UMCG: User Managed Concurrency Groups. > + * > + * Syscalls (see kernel/sched/umcg.c): > + * sys_umcg_ctl() - register/unregister UMCG tasks; > + * sys_umcg_wait() - wait/wake/context-switch. > + * sys_umcg_kick() - prod a UMCG task > + * > + * struct umcg_task (below): controls the state of UMCG tasks. > + */ > + > +/* > + * UMCG task states, the first 6 bits of struct umcg_task.state_ts. > + * The states represent the user space point of view. > + * > + * ,--------(TF_PREEMPT + notify_resume)-------. ,------------. > + * | v | | > + * RUNNING -(schedule)-> BLOCKED -(sys_exit)-> RUNNABLE (signal + notify_resume) > + * ^ | ^ | > + * `--------------(sys_umcg_wait)--------------' `------------' > + * > + */ > +#define UMCG_TASK_NONE 0x0000U > +#define UMCG_TASK_RUNNING 0x0001U > +#define UMCG_TASK_RUNNABLE 0x0002U > +#define UMCG_TASK_BLOCKED 0x0003U > + > +#define UMCG_TASK_MASK 0x00ffU > + > +/* > + * UMCG_TF_PREEMPT: userspace indicates the worker should be preempted. > + * > + * Must only be set on UMCG_TASK_RUNNING; once set, any subsequent > + * return-to-user (eg sys_umcg_kick()) will perform the equivalent of > + * sys_umcg_wait() on it. That is, it will wake next_tid/server_tid, transfer > + * to RUNNABLE and enqueue on the server's runnable list. > + */ > +#define UMCG_TF_PREEMPT 0x0100U > +/* > + * UMCG_TF_COND_WAIT: indicate the task *will* call sys_umcg_wait() > + * > + * Enables server loops like (vs umcg_sys_exit()): > + * > + * for(;;) { > + * self->status = UMCG_TASK_RUNNABLE | UMCG_TF_COND_WAIT; > + * // smp_mb() implied by xchg() > + * > + * runnable_ptr = xchg(self->runnable_workers_ptr, NULL); > + * while (runnable_ptr) { > + * next = runnable_ptr->runnable_workers_ptr; > + * > + * umcg_server_add_runnable(self, runnable_ptr); > + * > + * runnable_ptr = next; > + * } > + * > + * self->next = umcg_server_pick_next(self); > + * sys_umcg_wait(0, 0); > + * } > + * > + * without a signal or interrupt in between setting umcg_task::state and > + * sys_umcg_wait() resulting in an infinite wait in umcg_notify_resume(). > + */ > +#define UMCG_TF_COND_WAIT 0x0200U > + > +#define UMCG_TF_MASK 0xff00U > + > +#define UMCG_TASK_ALIGN 64 > + > +/** > + * struct umcg_task - controls the state of UMCG tasks. > + * > + * The struct is aligned at 64 bytes to ensure that it fits into > + * a single cache line. > + */ > +struct umcg_task { > + /** > + * @state_ts: the current state of the UMCG task described by > + * this struct, with a unique timestamp indicating > + * when the last state change happened. > + * > + * Readable/writable by both the kernel and the userspace. > + * > + * UMCG task state: > + * bits 0 - 7: task state; > + * bits 8 - 15: state flags; > + * bits 16 - 31: for userspace use; > + */ > + __u32 state; /* r/w */ > + > + /** > + * @next_tid: the TID of the UMCG task that should be context-switched > + * into in sys_umcg_wait(). Can be zero, in which case > + * it'll switch to server_tid. > + * > + * @server_tid: the TID of the UMCG server that hosts this task, > + * when RUNNABLE this task will get added to it's > + * runnable_workers_ptr list. > + * > + * Read-only for the kernel, read/write for the userspace. > + */ > + __u32 next_tid; /* r */ > + __u32 server_tid; /* r */ > + > + __u32 __hole[1]; > + > + /* > + * Timestamps for when last we became BLOCKED, RUNNABLE, in CLOCK_MONOTONIC. > + */ > + __u64 blocked_ts; /* w */ > + __u64 runnable_ts; /* w */ > + > + /** > + * @runnable_workers_ptr: a single-linked list of runnable workers. > + * > + * Readable/writable by both the kernel and the userspace: the > + * kernel adds items to the list, userspace removes them. > + */ > + __u64 runnable_workers_ptr; /* r/w */ > + > + __u64 __zero[3]; > + > +} __attribute__((packed, aligned(UMCG_TASK_ALIGN))); > + > +/** > + * enum umcg_ctl_flag - flags to pass to sys_umcg_ctl > + * @UMCG_CTL_REGISTER: register the current task as a UMCG task > + * @UMCG_CTL_UNREGISTER: unregister the current task as a UMCG task > + * @UMCG_CTL_WORKER: register the current task as a UMCG worker > + */ > +enum umcg_ctl_flag { > + UMCG_CTL_REGISTER = 0x00001, > + UMCG_CTL_UNREGISTER = 0x00002, > + UMCG_CTL_WORKER = 0x10000, > +}; > + > +#endif /* _UAPI_LINUX_UMCG_H */ > --- a/init/Kconfig > +++ b/init/Kconfig > @@ -1686,6 +1686,21 @@ config MEMBARRIER > > If unsure, say Y. > > +config HAVE_UMCG > + bool > + > +config UMCG > + bool "Enable User Managed Concurrency Groups API" > + depends on 64BIT > + depends on GENERIC_ENTRY > + depends on HAVE_UMCG > + default n > + help > + Enable User Managed Concurrency Groups API, which form the basis > + for an in-process M:N userspace scheduling framework. > + At the moment this is an experimental/RFC feature that is not > + guaranteed to be backward-compatible. > + > config KALLSYMS > bool "Load all symbols for debugging/ksymoops" if EXPERT > default y > --- a/kernel/entry/common.c > +++ b/kernel/entry/common.c > @@ -6,6 +6,7 @@ > #include <linux/livepatch.h> > #include <linux/audit.h> > #include <linux/tick.h> > +#include <linux/sched.h> > > #include "common.h" > > @@ -76,6 +77,9 @@ static long syscall_trace_enter(struct p > if (unlikely(work & SYSCALL_WORK_SYSCALL_TRACEPOINT)) > trace_sys_enter(regs, syscall); > > + if (work & SYSCALL_WORK_SYSCALL_UMCG) > + umcg_sys_enter(regs, syscall); > + > syscall_enter_audit(regs, syscall); > > return ret ? : syscall; > @@ -155,8 +159,7 @@ static unsigned long exit_to_user_mode_l > * Before returning to user space ensure that all pending work > * items have been completed. > */ > - while (ti_work & EXIT_TO_USER_MODE_WORK) { > - > + do { > local_irq_enable_exit_to_user(ti_work); > > if (ti_work & _TIF_NEED_RESCHED) > @@ -168,6 +171,10 @@ static unsigned long exit_to_user_mode_l > if (ti_work & _TIF_PATCH_PENDING) > klp_update_patch_state(current); > > + /* must be before handle_signal_work(); terminates on sigpending */ > + if (ti_work & _TIF_UMCG) > + umcg_notify_resume(regs); > + > if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL)) > handle_signal_work(regs, ti_work); > > @@ -188,7 +195,7 @@ static unsigned long exit_to_user_mode_l > tick_nohz_user_enter_prepare(); > > ti_work = read_thread_flags(); > - } > + } while (ti_work & EXIT_TO_USER_MODE_WORK); > > /* Return the latest work state for arch_exit_to_user_mode() */ > return ti_work; > @@ -203,7 +210,7 @@ static void exit_to_user_mode_prepare(st > /* Flush pending rcuog wakeup before the last need_resched() check */ > tick_nohz_user_enter_prepare(); > > - if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK)) > + if (unlikely(ti_work & (EXIT_TO_USER_MODE_WORK | _TIF_UMCG))) > ti_work = exit_to_user_mode_loop(regs, ti_work); > > arch_exit_to_user_mode_prepare(regs, ti_work); > @@ -253,6 +260,9 @@ static void syscall_exit_work(struct pt_ > step = report_single_step(work); > if (step || work & SYSCALL_WORK_SYSCALL_TRACE) > arch_syscall_exit_tracehook(regs, step); > + > + if (work & SYSCALL_WORK_SYSCALL_UMCG) > + umcg_sys_exit(regs); > } > > /* > --- a/kernel/exit.c > +++ b/kernel/exit.c > @@ -749,6 +749,10 @@ void __noreturn do_exit(long code) > if (unlikely(!tsk->pid)) > panic("Attempted to kill the idle task!"); > > + /* Turn off UMCG sched hooks. */ > + if (unlikely(tsk->flags & PF_UMCG_WORKER)) > + tsk->flags &= ~PF_UMCG_WORKER; > + > /* > * If do_exit is called because this processes oopsed, it's possible > * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before > @@ -786,6 +790,7 @@ void __noreturn do_exit(long code) > > io_uring_files_cancel(); > exit_signals(tsk); /* sets PF_EXITING */ > + umcg_handle_exit(); > > /* sync mm's RSS info before statistics gathering */ > if (tsk->mm) > --- a/kernel/sched/Makefile > +++ b/kernel/sched/Makefile > @@ -41,3 +41,4 @@ obj-$(CONFIG_MEMBARRIER) += membarrier.o > obj-$(CONFIG_CPU_ISOLATION) += isolation.o > obj-$(CONFIG_PSI) += psi.o > obj-$(CONFIG_SCHED_CORE) += core_sched.o > +obj-$(CONFIG_UMCG) += umcg.o > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -4272,6 +4272,7 @@ static void __sched_fork(unsigned long c > p->wake_entry.u_flags = CSD_TYPE_TTWU; > p->migration_pending = NULL; > #endif > + umcg_clear_child(p); > } > > DEFINE_STATIC_KEY_FALSE(sched_numa_balancing); > @@ -6330,9 +6331,11 @@ static inline void sched_submit_work(str > * If a worker goes to sleep, notify and ask workqueue whether it > * wants to wake up a task to maintain concurrency. > */ > - if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER)) { > + if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) { > if (task_flags & PF_WQ_WORKER) > wq_worker_sleeping(tsk); > + else if (task_flags & PF_UMCG_WORKER) > + umcg_wq_worker_sleeping(tsk); > else > io_wq_worker_sleeping(tsk); > } > @@ -6350,9 +6353,11 @@ static inline void sched_submit_work(str > > static void sched_update_worker(struct task_struct *tsk) > { > - if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) { > + if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) { > if (tsk->flags & PF_WQ_WORKER) > wq_worker_running(tsk); > + else if (tsk->flags & PF_UMCG_WORKER) > + umcg_wq_worker_running(tsk); > else > io_wq_worker_running(tsk); > } > --- /dev/null > +++ b/kernel/sched/umcg.c > @@ -0,0 +1,868 @@ > +// SPDX-License-Identifier: GPL-2.0-only > + > +/* > + * User Managed Concurrency Groups (UMCG). > + * > + */ > + > +#include <linux/syscalls.h> > +#include <linux/types.h> > +#include <linux/uaccess.h> > +#include <linux/umcg.h> > + > +#include <asm/syscall.h> > + > +#include "sched.h" > + > +static struct task_struct *umcg_get_task(u32 tid) > +{ > + struct task_struct *tsk = NULL; > + > + if (tid) { > + rcu_read_lock(); > + tsk = find_task_by_vpid(tid); > + if (tsk && current->mm == tsk->mm && tsk->umcg_task) This essentially limits all operations to a single mm/process. Fine for now, but a fast remote context switch is also a valid use case. It is not directly related to userspace scheduling, so I'm just mentioning it here. Maybe server-to-server cross-process context switches should be allowed for the same user/cgroup? (Again, this is for later to consider). > + get_task_struct(tsk); > + else > + tsk = NULL; > + rcu_read_unlock(); > + } > + > + return tsk; > +} > + > +/** > + * umcg_pin_pages: pin pages containing struct umcg_task of > + * this task, its server (possibly this task again) > + * and the next (possibly NULL). > + */ > +static int umcg_pin_pages(void) > +{ > + struct task_struct *server = NULL, *next = NULL, *tsk = current; > + struct umcg_task __user *self = READ_ONCE(tsk->umcg_task); > + int server_tid, next_tid; > + int ret; > + > + /* must not have stale state */ > + if (WARN_ON_ONCE(tsk->umcg_worker_page || > + tsk->umcg_server_page || > + tsk->umcg_next_page || > + tsk->umcg_server_task || > + tsk->umcg_next_task || > + tsk->umcg_server || > + tsk->umcg_next)) > + return -EBUSY; > + > + ret = -EFAULT; > + if (pin_user_pages_fast((unsigned long)self, 1, 0, > + &tsk->umcg_worker_page) != 1) > + goto clear_self; > + > + if (get_user(server_tid, &self->server_tid)) > + goto unpin_self; > + > + ret = -ESRCH; > + server = umcg_get_task(server_tid); > + if (!server) > + goto unpin_self; > + > + ret = -EFAULT; > + /* must cache due to possible concurrent change vs access_ok() */ > + tsk->umcg_server_task = READ_ONCE(server->umcg_task); > + if (pin_user_pages_fast((unsigned long)tsk->umcg_server_task, 1, 0, > + &tsk->umcg_server_page) != 1) > + goto clear_server; > + > + tsk->umcg_server = server; > + > + if (get_user(next_tid, &self->next_tid)) > + goto unpin_server; > + > + if (!next_tid) > + goto done; > + > + ret = -ESRCH; > + next = umcg_get_task(next_tid); > + if (!next) > + goto unpin_server; > + > + ret = -EFAULT; > + tsk->umcg_next_task = READ_ONCE(next->umcg_task); > + if (pin_user_pages_fast((unsigned long)tsk->umcg_next_task, 1, 0, > + &tsk->umcg_next_page) != 1) > + goto clear_next; > + > + tsk->umcg_next = next; > + > +done: > + return 0; > + > +clear_next: > + tsk->umcg_next_task = NULL; > + tsk->umcg_next_page = NULL; > + > +unpin_server: > + unpin_user_page(tsk->umcg_server_page); > + > +clear_server: > + tsk->umcg_server_task = NULL; > + tsk->umcg_server_page = NULL; > + > +unpin_self: > + unpin_user_page(tsk->umcg_worker_page); > +clear_self: > + tsk->umcg_worker_page = NULL; > + > + return ret; > +} > + > +static void umcg_unpin_pages(void) > +{ > + struct task_struct *tsk = current; > + > + if (tsk->umcg_server) { > + unpin_user_page(tsk->umcg_worker_page); > + tsk->umcg_worker_page = NULL; > + > + unpin_user_page(tsk->umcg_server_page); > + tsk->umcg_server_page = NULL; > + tsk->umcg_server_task = NULL; > + > + put_task_struct(tsk->umcg_server); > + tsk->umcg_server = NULL; > + > + if (tsk->umcg_next) { > + unpin_user_page(tsk->umcg_next_page); > + tsk->umcg_next_page = NULL; > + tsk->umcg_next_task = NULL; > + > + put_task_struct(tsk->umcg_next); > + tsk->umcg_next = NULL; > + } > + } > +} > + > +static void umcg_clear_task(struct task_struct *tsk) > +{ > + /* > + * This is either called for the current task, or for a newly forked > + * task that is not yet running, so we don't need strict atomicity > + * below. > + */ > + if (tsk->umcg_task) { > + WRITE_ONCE(tsk->umcg_task, NULL); > + tsk->umcg_worker_page = NULL; > + > + tsk->umcg_server = NULL; > + tsk->umcg_server_page = NULL; > + tsk->umcg_server_task = NULL; > + > + tsk->umcg_next = NULL; > + tsk->umcg_next_page = NULL; > + tsk->umcg_next_task = NULL; > + > + tsk->flags &= ~PF_UMCG_WORKER; > + clear_task_syscall_work(tsk, SYSCALL_UMCG); > + clear_tsk_thread_flag(tsk, TIF_UMCG); > + } > +} > + > +/* Called for a forked or execve-ed child. */ > +void umcg_clear_child(struct task_struct *tsk) > +{ > + umcg_clear_task(tsk); > +} > + > +/* Called both by normally (unregister) and abnormally exiting workers. */ > +void umcg_worker_exit(void) > +{ > + umcg_unpin_pages(); > + umcg_clear_task(current); > +} > + > +/* > + * Do a state transition: @from -> @to. > + * > + * Will clear UMCG_TF_PREEMPT, UMCG_TF_COND_WAIT. > + * > + * When @to == {BLOCKED,RUNNABLE}, update timestamps. > + * > + * Returns: > + * 0: success > + * -EAGAIN: when self->state != @from > + * -EFAULT > + */ > +static int umcg_update_state(struct task_struct *tsk, > + struct umcg_task __user *self, > + u32 from, u32 to) > +{ > + u32 old, new; > + u64 now; > + > + if (to >= UMCG_TASK_RUNNABLE) { > + switch (tsk->umcg_clock) { > + case CLOCK_REALTIME: now = ktime_get_real_ns(); break; > + case CLOCK_MONOTONIC: now = ktime_get_ns(); break; > + case CLOCK_BOOTTIME: now = ktime_get_boottime_ns(); break; > + case CLOCK_TAI: now = ktime_get_clocktai_ns(); break; > + } > + } > + > + if (!user_access_begin(self, sizeof(*self))) > + return -EFAULT; > + > + unsafe_get_user(old, &self->state, Efault); > + do { > + if ((old & UMCG_TASK_MASK) != from) > + goto fail; > + > + new = old & ~(UMCG_TASK_MASK | > + UMCG_TF_PREEMPT | UMCG_TF_COND_WAIT); > + new |= to & UMCG_TASK_MASK; > + > + } while (!unsafe_try_cmpxchg_user(&self->state, &old, new, Efault)); > + > + if (to == UMCG_TASK_BLOCKED) > + unsafe_put_user(now, &self->blocked_ts, Efault); > + if (to == UMCG_TASK_RUNNABLE) > + unsafe_put_user(now, &self->runnable_ts, Efault); > + > + user_access_end(); > + return 0; > + > +fail: > + user_access_end(); > + return -EAGAIN; > + > +Efault: > + user_access_end(); > + return -EFAULT; > +} > + > +#define __UMCG_DIE(stmt, reason) do { \ > + stmt; \ > + pr_warn_ratelimited("%s: killing task %s/%d because: " reason "\n",\ > + __func__, current->comm, current->pid); \ > + force_sig(SIGKILL); \ > + return; \ > +} while (0) > + > +#define UMCG_DIE(reason) __UMCG_DIE(,reason) > +#define UMCG_DIE_PF(reason) __UMCG_DIE(pagefault_enable(), reason) > +#define UMCG_DIE_UNPIN(reason) __UMCG_DIE(umcg_unpin_pages(), reason) > + > +/* Called from syscall enter path */ > +void umcg_sys_enter(struct pt_regs *regs, long syscall) > +{ > + /* avoid recursion vs our own syscalls */ > + if (syscall == __NR_umcg_wait || > + syscall == __NR_umcg_ctl) > + return; > + > + /* avoid recursion vs schedule() */ > + current->flags &= ~PF_UMCG_WORKER; > + > + /* > + * Pin all the state on sys_enter() such that we can rely on it > + * from dodgy contexts. It is either unpinned from pre-schedule() > + * or sys_exit(), whichever comes first, thereby ensuring the pin > + * is temporary. > + */ > + if (umcg_pin_pages()) > + UMCG_DIE("pin"); > + > + current->flags |= PF_UMCG_WORKER; > +} > + > +static int umcg_wake_task(struct task_struct *tsk, struct umcg_task __user *self) > +{ > + int ret = umcg_update_state(tsk, self, UMCG_TASK_RUNNABLE, UMCG_TASK_RUNNING); > + if (ret) > + return ret; > + > + try_to_wake_up(tsk, TASK_NORMAL, WF_CURRENT_CPU); > + return 0; > +} > + > +static int umcg_wake_next(struct task_struct *tsk) > +{ > + int ret = umcg_wake_task(tsk->umcg_next, tsk->umcg_next_task); > + if (ret) > + return ret; > + > + /* > + * If userspace sets umcg_task::next_tid, it needs to remove > + * that task from the ready-queue to avoid another server > + * selecting it. However, that also means it needs to put it > + * back in case it went unused. > + * > + * By clearing the field on use, userspace can detect this case > + * and DTRT. > + */ > + if (put_user(0u, &tsk->umcg_task->next_tid)) > + return -EFAULT; > + > + return 0; > +} > + > +static int umcg_wake_server(struct task_struct *tsk) > +{ > + int ret = umcg_wake_task(tsk->umcg_server, tsk->umcg_server_task); > + switch (ret) { > + case 0: > + case -EAGAIN: > + /* > + * Server could have timed-out or already be running > + * due to a runnable enqueue. See umcg_sys_exit(). > + */ > + break; > + > + default: > + return ret; > + } > + > + return 0; > +} > + > +/* > + * Wake ::next_tid or ::server_tid. > + * > + * Must be called in umcg_pin_pages() context, relies on > + * tsk->umcg_{server,next}. > + * > + * Returns: > + * 0: success > + * -EAGAIN > + * -EFAULT > + */ > +static int umcg_wake(struct task_struct *tsk) > +{ > + if (tsk->umcg_next) > + return umcg_wake_next(tsk); > + > + return umcg_wake_server(tsk); > +} > + > +/* pre-schedule() */ > +void umcg_wq_worker_sleeping(struct task_struct *tsk) > +{ > + struct umcg_task __user *self = READ_ONCE(tsk->umcg_task); > + > + /* Must not fault, mmap_sem might be held. */ > + pagefault_disable(); > + > + if (WARN_ON_ONCE(!tsk->umcg_server)) > + UMCG_DIE_PF("no server"); We can get here if a running worker (no pinned pages) gets a pagefault in the userspace. Is umcg_sys_enter() called for pagefaults? If not, we should not kill the worker; also the userspace won't be able to detect this worker blocking on a pagefault... Why don't you like my approach of pinning pages on exit_to_userspace and unpinning on going to sleep? Yes, the pins will last longer, but only for scheduled on CPU tasks, so bounded both by time and number (of course, if umcg_sys_enter() is called on pagefaults/signals/other interrupts, pinning in umcg_sys_enter() is better). On the other hand, doing nothing on pagefaults and similar, and having to only worry about blocking in syscalls, does make things much simpler (no unexpected concurrency and such). I think most of the things you found complicated in my patchset, other than the SMP remote-idle wakeup, were driven by making sure spurious pagefaults are properly handled. I can't tell now whether keeping workers RUNNING during pagefaults vs waking their servers to run pending workers is a net gain or loss re: performance. I'll have to benchmark this when my large test is ready. > + > + if (umcg_update_state(tsk, self, UMCG_TASK_RUNNING, UMCG_TASK_BLOCKED)) > + UMCG_DIE_PF("state"); > + > + if (umcg_wake(tsk)) > + UMCG_DIE_PF("wake"); So this works _if_ pagefaults do not go here, as a task switching to next will do so in sys_umcg_wait and thus elide this code. But if pagefaults do trigger this code, this is wrong, as the server should be woken on a pagefault, not next: the worker setting next_tid will then call sys_umcg_wait(), expecting to context-switch. So: if pagefaults lead here via umcg_sys_enter(), we should wake the server here, not next. If pagefaults do not trigger umcg_sys_enter(), then we should not kill the task above with "no server". > + > + pagefault_enable(); > + > + /* > + * We're going to sleep, make sure to unpin the pages, this ensures > + * the pins are temporary. Also see umcg_sys_exit(). > + */ > + umcg_unpin_pages(); > +} > + > +/* post-schedule() */ > +void umcg_wq_worker_running(struct task_struct *tsk) > +{ > + /* nothing here, see umcg_sys_exit() */ > +} > + > +/* > + * Enqueue @tsk on it's server's runnable list > + * > + * Must be called in umcg_pin_pages() context, relies on tsk->umcg_server. > + * > + * cmpxchg based single linked list add such that list integrity is never > + * violated. Userspace *MUST* remove it from the list before changing ->state. > + * As such, we must change state to RUNNABLE before enqueue. > + * > + * Returns: > + * 0: success > + * -EFAULT > + */ > +static int umcg_enqueue_runnable(struct task_struct *tsk) > +{ > + struct umcg_task __user *server = tsk->umcg_server_task; > + struct umcg_task __user *self = tsk->umcg_task; > + u64 self_ptr = (unsigned long)self; > + u64 first_ptr; > + > + /* > + * umcg_pin_pages() did access_ok() on both pointers, use self here > + * only because __user_access_begin() isn't available in generic code. > + */ > + if (!user_access_begin(self, sizeof(*self))) > + return -EFAULT; > + > + unsafe_get_user(first_ptr, &server->runnable_workers_ptr, Efault); > + do { > + unsafe_put_user(first_ptr, &self->runnable_workers_ptr, Efault); > + } while (!unsafe_try_cmpxchg_user(&server->runnable_workers_ptr, &first_ptr, self_ptr, Efault)); > + > + user_access_end(); > + return 0; > + > +Efault: > + user_access_end(); > + return -EFAULT; > +} > + > +/* > + * umcg_wait: Wait for ->state to become RUNNING > + * > + * Returns: > + * 0 - success > + * -EINTR - pending signal > + * -EINVAL - ::state is not {RUNNABLE,RUNNING} > + * -ETIMEDOUT > + * -EFAULT > + */ > +int umcg_wait(u64 timo) > +{ > + struct task_struct *tsk = current; > + struct umcg_task __user *self = tsk->umcg_task; > + struct page *page = NULL; > + u32 state; > + int ret; > + > + for (;;) { > + set_current_state(TASK_INTERRUPTIBLE); > + > + ret = -EINTR; > + if (signal_pending(current)) > + break; > + > + /* > + * Faults can block and scribble our wait state. > + */ > + pagefault_disable(); > + if (get_user(state, &self->state)) { > + pagefault_enable(); > + > + ret = -EFAULT; > + if (page) { > + unpin_user_page(page); > + page = NULL; > + break; > + } > + > + if (pin_user_pages_fast((unsigned long)self, 1, 0, &page) != 1) { I believe that the task should not be TASK_INTERRUPTIBLE here, as pin_user_pages_fast may fault, and might_fault complains via __might_sleep. > + page = NULL; > + break; > + } > + > + continue; > + } > + > + if (page) { > + unpin_user_page(page); > + page = NULL; > + } > + pagefault_enable(); > + > + state &= UMCG_TASK_MASK; > + if (state != UMCG_TASK_RUNNABLE) { > + ret = 0; > + if (state == UMCG_TASK_RUNNING) > + break; > + > + ret = -EINVAL; > + break; > + } > + > + if (!schedule_hrtimeout_range_clock(timo ? &timo : NULL, > + tsk->timer_slack_ns, > + HRTIMER_MODE_ABS, > + tsk->umcg_clock)) { > + ret = -ETIMEDOUT; > + break; > + } > + } > + __set_current_state(TASK_RUNNING); > + > + return ret; > +} > + > +void umcg_sys_exit(struct pt_regs *regs) > +{ > + struct task_struct *tsk = current; > + struct umcg_task __user *self = READ_ONCE(tsk->umcg_task); > + long syscall = syscall_get_nr(tsk, regs); > + > + if (syscall == __NR_umcg_wait) > + return; > + > + /* > + * sys_umcg_ctl() will get here without having called umcg_sys_enter() > + * as such it will look like a syscall that blocked. > + */ > + > + if (tsk->umcg_server) { > + /* > + * Didn't block, we done. > + */ > + umcg_unpin_pages(); > + return; > + } > + > + /* avoid recursion vs schedule() */ > + current->flags &= ~PF_UMCG_WORKER; > + > + if (umcg_pin_pages()) > + UMCG_DIE("pin"); > + > + if (umcg_update_state(tsk, self, UMCG_TASK_BLOCKED, UMCG_TASK_RUNNABLE)) > + UMCG_DIE_UNPIN("state"); > + > + if (umcg_enqueue_runnable(tsk)) > + UMCG_DIE_UNPIN("enqueue"); > + > + /* Server might not be RUNNABLE, means it's already running */ > + if (umcg_wake_server(tsk)) > + UMCG_DIE_UNPIN("wake-server"); So this here breaks the assumption that servers+workers never run on more CPUs than the number of servers, which I've gone through a lot of pain to ensure in my patchset. I think the assumption is based on the idea that a process using UMCG will get affined to N CPUs, will have N servers and a number of workers, and they will all happily cooperate and not get any extra threads running. Of course the pretty picture was not completely true, as the unblocked tasks do consume extra threads in the kernel, though never in the userspace. So this patch may result in all servers running due to wakeups in umcg_sys_exit(), with also their currently designated workers running as well, so the userspace may see N+N running threads. For now I think this may be OK, but as I mentioned above, I need to run a larger test with a real workload to see if anything is missing. What does worry me is that in this wakeup the server calls sys_umcg_wait() with another worker in next_tid, so now the server will have two workers running: the current kernel API seems to allow this to happen. In my patchset the invariant that no more than one worker running per server was enforced by the kernel. > + > + umcg_unpin_pages(); > + > + switch (umcg_wait(0)) { > + case -EFAULT: > + case -EINVAL: > + case -ETIMEDOUT: /* how!?! */ > + default: This "default" coming before "case 0" below feels weird... can we do switch (umcg_wait()) { case 0: case -EINTR: /* ... */ break; default: UMCG_DIE("wait"); } > + UMCG_DIE("wait"); > + > + case 0: > + case -EINTR: > + /* notify_resume will continue the wait after the signal */ > + break; > + } > + > + current->flags |= PF_UMCG_WORKER; > +} > + > +void umcg_notify_resume(struct pt_regs *regs) > +{ > + struct task_struct *tsk = current; > + struct umcg_task __user *self = tsk->umcg_task; > + bool worker = tsk->flags & PF_UMCG_WORKER; > + u32 state; > + > + /* avoid recursion vs schedule() */ > + if (worker) > + current->flags &= ~PF_UMCG_WORKER; > + > + if (get_user(state, &self->state)) > + UMCG_DIE("get-state"); > + > + state &= UMCG_TASK_MASK | UMCG_TF_MASK; > + if (state == UMCG_TASK_RUNNING) > + goto done; > + > + /* > + * See comment at UMCG_TF_COND_WAIT; TL;DR: user *will* call > + * sys_umcg_wait() and signals/interrupts shouldn't block > + * return-to-user. > + */ > + if (state == (UMCG_TASK_RUNNABLE | UMCG_TF_COND_WAIT)) > + goto done; > + > + if (state & UMCG_TF_PREEMPT) { > + if (umcg_pin_pages()) > + UMCG_DIE("pin"); > + > + if (umcg_update_state(tsk, self, > + UMCG_TASK_RUNNING, > + UMCG_TASK_RUNNABLE)) > + UMCG_DIE_UNPIN("state"); > + > + if (umcg_enqueue_runnable(tsk)) > + UMCG_DIE_UNPIN("enqueue"); > + > + /* > + * XXX do we want a preemption consuming ::next_tid ? > + * I'm currently leaning towards no. I don't think so: preemption is a sched-type event, so a server should handle it; next_tid has nothing to do with it. > + */ > + if (umcg_wake_server(tsk)) > + UMCG_DIE_UNPIN("wake-server"); > + > + umcg_unpin_pages(); > + } > + > + switch (umcg_wait(0)) { > + case -EFAULT: > + case -EINVAL: > + case -ETIMEDOUT: /* how!?! */ > + default: > + UMCG_DIE("wait"); Same suggestion re: putting case 0/-EINTR first. > + > + case 0: > + case -EINTR: > + /* we'll will resume the wait after the signal */ > + break; > + } > + > +done: > + if (worker) > + current->flags |= PF_UMCG_WORKER; > +} > + > +/** > + * sys_umcg_kick: makes a UMCG task cycle through umcg_notify_resume() > + * > + * Returns: > + * 0 - Ok; > + * -ESRCH - not a related UMCG task > + * -EINVAL - another error happened (unknown flags, etc..) > + */ > +SYSCALL_DEFINE2(umcg_kick, u32, flags, pid_t, tid) > +{ > + struct task_struct *task = umcg_get_task(tid); > + if (!task) > + return -ESRCH; > + > + if (flags) > + return -EINVAL; > + > +#ifdef CONFIG_SMP > + smp_send_reschedule(task_cpu(task)); > +#endif > + > + return 0; > +} > + > +/** > + * sys_umcg_wait: transfer running context > + * > + * Block until RUNNING. Userspace must already set RUNNABLE to deal with the > + * sleep condition races (see TF_COND_WAIT). > + * > + * Will wake either ::next_tid or ::server_tid to take our place. If this is a > + * server then not setting ::next_tid will wake self. > + * > + * Returns: > + * 0 - OK; > + * -ETIMEDOUT - the timeout expired; > + * -ERANGE - the timeout is out of range (worker); > + * -EAGAIN - ::state wasn't RUNNABLE, concurrent wakeup; > + * -EFAULT - failed accessing struct umcg_task __user of the current > + * task, the server or next; > + * -ESRCH - the task to wake not found or not a UMCG task; > + * -EINVAL - another error happened (e.g. the current task is not a > + * UMCG task, etc.) > + */ > +SYSCALL_DEFINE2(umcg_wait, u32, flags, u64, timo) > +{ > + struct task_struct *tsk = current; > + struct umcg_task __user *self = READ_ONCE(tsk->umcg_task); > + bool worker = tsk->flags & PF_UMCG_WORKER; > + int ret; > + > + if (!self || flags) > + return -EINVAL; > + > + if (worker) { > + tsk->flags &= ~PF_UMCG_WORKER; > + if (timo) > + return -ERANGE; Worker sleeps timing out is a valid and a real use case. Similar to futex timeouts, mutex timeouts, condvar timeouts. I do not believe there is a fundamental problem here, so I'll add worker timeout handling in my larger test. In addition, shouldn't we NOT clear PF_UMCG_WORKER flag if we return an error? > + } > + > + /* see umcg_sys_{enter,exit}() syscall exceptions */ > + ret = umcg_pin_pages(); I do not think we need to pin pages for servers, only for workers. Yes, this makes things easier/simpler, so ok for now, but maybe later we will need to be a bit more fine-grained here. > + if (ret) > + goto unblock; > + > + /* > + * Clear UMCG_TF_COND_WAIT *and* check state == RUNNABLE. > + */ > + ret = umcg_update_state(tsk, self, UMCG_TASK_RUNNABLE, UMCG_TASK_RUNNABLE); > + if (ret) > + goto unpin; > + > + if (worker) { > + ret = umcg_enqueue_runnable(tsk); > + if (ret) > + goto unpin; > + } > + > + if (worker) Should this "if" be merged with the one above? > + ret = umcg_wake(tsk); > + else if (tsk->umcg_next) > + ret = umcg_wake_next(tsk); > + > + if (ret) { > + /* > + * XXX already enqueued ourself on ::server_tid; failing now > + * leaves the lot in an inconsistent state since it'll also > + * unblock self in order to return the error. !?!? > + */ It looks like only EFAULT can be here. I'd ensure that, and then just DIE. > + goto unpin; > + } > + > + umcg_unpin_pages(); > + > + ret = umcg_wait(timo); > + switch (ret) { > + case 0: /* all done */ > + case -EINTR: /* umcg_notify_resume() will continue the wait */ > + ret = 0; > + break; Why not let workers have timeouts, and keep -ETIMEDOUT here? Just set UMCG_TF_PREEMPT, or another flag with similar behavior, and umcg_notify_resume will properly wake the server? > + > + default: > + goto unblock; > + } > +out: > + if (worker) > + tsk->flags |= PF_UMCG_WORKER; > + return ret; > + > +unpin: > + umcg_unpin_pages(); > +unblock: > + /* > + * Workers will still block in umcg_notify_resume() before they can > + * consume their error, servers however need to get the error asap. > + * > + * Still, things might be unrecoverably screwy after this. Not our > + * problem. I think we should explicitly document the unrecoverable screwiness of errors here, so that the userspace proactively kills itself to avoid badness. The only reason that returning an error here is mildly preferable to just killing the task (we already do that in other places) is to give the userspace an opportunity to log an error, with more state/info than we can do here. > + */ > + if (!worker) > + umcg_update_state(tsk, self, UMCG_TASK_RUNNABLE, UMCG_TASK_RUNNING); > + goto out; > +} > + > +/** > + * sys_umcg_ctl: (un)register the current task as a UMCG task. > + * @flags: ORed values from enum umcg_ctl_flag; see below; > + * @self: a pointer to struct umcg_task that describes this > + * task and governs the behavior of sys_umcg_wait if > + * registering; must be NULL if unregistering. @which_clock is not documented. Why do we need the option in the first place? > + * > + * @flags & UMCG_CTL_REGISTER: register a UMCG task: > + * > + * UMCG workers: > + * - @flags & UMCG_CTL_WORKER > + * - self->state must be UMCG_TASK_BLOCKED > + * > + * UMCG servers: > + * - !(@flags & UMCG_CTL_WORKER) > + * - self->state must be UMCG_TASK_RUNNING > + * > + * All tasks: > + * - self->server_tid must be a valid server > + * - self->next_tid must be zero > + * > + * If the conditions above are met, sys_umcg_ctl() immediately returns > + * if the registered task is a server. If the registered task is a > + * worker it will be added to it's server's runnable_workers_ptr list > + * and the server will be woken. > + * > + * @flags == UMCG_CTL_UNREGISTER: unregister a UMCG task. If the current task > + * is a UMCG worker, the userspace is responsible for waking its > + * server (before or after calling sys_umcg_ctl). > + * > + * Return: > + * 0 - success > + * -EFAULT - failed to read @self > + * -EINVAL - some other error occurred > + * -ESRCH - no such server_tid > + */ > +SYSCALL_DEFINE3(umcg_ctl, u32, flags, struct umcg_task __user *, self, clockid_t, which_clock) > +{ > + struct task_struct *server; > + struct umcg_task ut; > + > + if ((unsigned long)self % UMCG_TASK_ALIGN) > + return -EINVAL; > + > + if (flags & ~(UMCG_CTL_REGISTER | > + UMCG_CTL_UNREGISTER | > + UMCG_CTL_WORKER)) > + return -EINVAL; > + > + if (flags == UMCG_CTL_UNREGISTER) { > + if (self || !current->umcg_task) > + return -EINVAL; > + > + if (current->flags & PF_UMCG_WORKER) > + umcg_worker_exit(); The server should be woken here. Imagine: one server, one worker. The server is sleeping, the worker is running. The worker unregisters, the server keeps sleeping forever? I'm OK re: NOT waking the server if the worker thread exits without unregistering, as this is the userspace breaking the contract/protocol. But here we do need to notify the server. At the minimum so that the server can schedule a worker to run in its place. (Why is this important? Worker count can fluctuate considerably: on load spikes many new workers may be created, and later in quiet times they exit to free resources.) > + else > + umcg_clear_task(current); > + > + return 0; > + } > + > + if (!(flags & UMCG_CTL_REGISTER)) > + return -EINVAL; > + > + flags &= ~UMCG_CTL_REGISTER; > + > + switch (which_clock) { > + case CLOCK_REALTIME: > + case CLOCK_MONOTONIC: > + case CLOCK_BOOTTIME: > + case CLOCK_TAI: > + current->umcg_clock = which_clock; > + break; > + > + default: > + return -EINVAL; > + } > + > + if (current->umcg_task || !self) > + return -EINVAL; > + > + if (copy_from_user(&ut, self, sizeof(ut))) > + return -EFAULT; > + > + if (ut.next_tid || ut.__hole[0] || ut.__zero[0] || ut.__zero[1] || ut.__zero[2]) > + return -EINVAL; > + > + rcu_read_lock(); > + server = find_task_by_vpid(ut.server_tid); > + if (server && server->mm == current->mm) { > + if (flags == UMCG_CTL_WORKER) { > + if (!server->umcg_task || > + (server->flags & PF_UMCG_WORKER)) > + server = NULL; > + } else { > + if (server != current) > + server = NULL; > + } > + } else { > + server = NULL; > + } > + rcu_read_unlock(); > + > + if (!server) > + return -ESRCH; > + > + if (flags == UMCG_CTL_WORKER) { > + if ((ut.state & (UMCG_TASK_MASK | UMCG_TF_MASK)) != UMCG_TASK_BLOCKED) > + return -EINVAL; > + > + WRITE_ONCE(current->umcg_task, self); > + current->flags |= PF_UMCG_WORKER; /* hook schedule() */ > + set_syscall_work(SYSCALL_UMCG); /* hook syscall */ > + set_thread_flag(TIF_UMCG); /* hook return-to-user */ Too many flags, I'd say. Not a big deal, just a mild preference: I have only a single flag. > + > + /* umcg_sys_exit() will transition to RUNNABLE and wait */ > + > + } else { > + if ((ut.state & (UMCG_TASK_MASK | UMCG_TF_MASK)) != UMCG_TASK_RUNNING) > + return -EINVAL; > + > + WRITE_ONCE(current->umcg_task, self); > + set_thread_flag(TIF_UMCG); /* hook return-to-user */ Why do we need to hook server's return to user? I did not need it in my version. > + > + /* umcg_notify_resume() would block if not RUNNING */ > + } > + > + return 0; > +} > --- a/kernel/sys_ni.c > +++ b/kernel/sys_ni.c > @@ -273,6 +273,11 @@ COND_SYSCALL(landlock_create_ruleset); > COND_SYSCALL(landlock_add_rule); > COND_SYSCALL(landlock_restrict_self); > > +/* kernel/sched/umcg.c */ > +COND_SYSCALL(umcg_ctl); > +COND_SYSCALL(umcg_wait); > +COND_SYSCALL(umcg_kick); > + > /* arch/example/kernel/sys_example.c */ > > /* mm/fadvise.c */ > > ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC][PATCH 3/3] sched: User Mode Concurency Groups 2021-12-21 17:19 ` Peter Oskolkov @ 2022-01-14 14:09 ` Peter Zijlstra 2022-01-14 15:16 ` Peter Zijlstra ` (4 more replies) 2022-01-17 13:04 ` Peter Zijlstra 1 sibling, 5 replies; 40+ messages in thread From: Peter Zijlstra @ 2022-01-14 14:09 UTC (permalink / raw) To: Peter Oskolkov Cc: mingo, tglx, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, bristot, linux-kernel, linux-mm, linux-api, x86, pjt, posk, avagin, jannh, tdelisle Hi! I've seen you send a new version based on this, but I figured I ought to reply to this first. On Tue, Dec 21, 2021 at 05:19:00PM +0000, Peter Oskolkov wrote: > > +/* pre-schedule() */ > > +void umcg_wq_worker_sleeping(struct task_struct *tsk) > > +{ > > + struct umcg_task __user *self = READ_ONCE(tsk->umcg_task); > > + > > + /* Must not fault, mmap_sem might be held. */ > > + pagefault_disable(); > > + > > + if (WARN_ON_ONCE(!tsk->umcg_server)) > > + UMCG_DIE_PF("no server"); > > We can get here if a running worker (no pinned pages) gets a pagefault > in the userspace. Is umcg_sys_enter() called for pagefaults? If not, > we should not kill the worker; also the userspace won't be able to > detect this worker blocking on a pagefault... Ufff.. good one. No #PF doesn't pass through sys_enter, I'll have to go fix that. > Why don't you like my approach of pinning pages on exit_to_userspace > and unpinning on going to sleep? Yes, the pins will last longer, > but only for scheduled on CPU tasks, so bounded both by time and number > (of course, if umcg_sys_enter() is called on pagefaults/signals/other > interrupts, pinning in umcg_sys_enter() is better). Well, in general I would not call userspace bounded. There's plenty userspace that doesn't do syscalls for indeterminate amounts of time. Now, such userspace might not be the immediate target for UMCG, but we also should not rule it out. Having been an mm/ developer in a previous lifetime, I still think page-pins should be as short as possible. They can get in the way of other things, like CMA. > On the other hand, doing nothing on pagefaults and similar, and having > to only worry about blocking in syscalls, does make things much simpler > (no unexpected concurrency and such). I think most of the things > you found complicated in my patchset, other than the SMP remote-idle wakeup, > were driven by making sure spurious pagefaults are properly handled. > > I can't tell now whether keeping workers RUNNING during pagefaults > vs waking their servers to run pending workers is a net gain or loss > re: performance. I'll have to benchmark this when my large test is ready. I'll go fix the non syscall things that can schedule. > > +int umcg_wait(u64 timo) > > +{ > > + struct task_struct *tsk = current; > > + struct umcg_task __user *self = tsk->umcg_task; > > + struct page *page = NULL; > > + u32 state; > > + int ret; > > + > > + for (;;) { > > + set_current_state(TASK_INTERRUPTIBLE); > > + > > + ret = -EINTR; > > + if (signal_pending(current)) > > + break; > > + > > + /* > > + * Faults can block and scribble our wait state. > > + */ > > + pagefault_disable(); > > + if (get_user(state, &self->state)) { > > + pagefault_enable(); > > + > > + ret = -EFAULT; > > + if (page) { > > + unpin_user_page(page); > > + page = NULL; > > + break; > > + } > > + > > + if (pin_user_pages_fast((unsigned long)self, 1, 0, &page) != 1) { > > I believe that the task should not be TASK_INTERRUPTIBLE here, > as pin_user_pages_fast may fault, and might_fault complains via __might_sleep. Fair enough; can easily mark the task __set_current_state(TASK_RUNNING) right near pagefault_enable() or something. > > + page = NULL; > > + break; > > + } > > + > > + continue; > > + } > > +void umcg_sys_exit(struct pt_regs *regs) > > +{ > > + struct task_struct *tsk = current; > > + struct umcg_task __user *self = READ_ONCE(tsk->umcg_task); > > + long syscall = syscall_get_nr(tsk, regs); > > + > > + if (syscall == __NR_umcg_wait) > > + return; > > + > > + /* > > + * sys_umcg_ctl() will get here without having called umcg_sys_enter() > > + * as such it will look like a syscall that blocked. > > + */ > > + > > + if (tsk->umcg_server) { > > + /* > > + * Didn't block, we done. > > + */ > > + umcg_unpin_pages(); > > + return; > > + } > > + > > + /* avoid recursion vs schedule() */ > > + current->flags &= ~PF_UMCG_WORKER; > > + > > + if (umcg_pin_pages()) > > + UMCG_DIE("pin"); > > + > > + if (umcg_update_state(tsk, self, UMCG_TASK_BLOCKED, UMCG_TASK_RUNNABLE)) > > + UMCG_DIE_UNPIN("state"); > > + > > + if (umcg_enqueue_runnable(tsk)) > > + UMCG_DIE_UNPIN("enqueue"); > > + > > + /* Server might not be RUNNABLE, means it's already running */ > > + if (umcg_wake_server(tsk)) > > + UMCG_DIE_UNPIN("wake-server"); > > So this here breaks the assumption that servers+workers never run > on more CPUs than the number of servers, which I've gone through > a lot of pain to ensure in my patchset. Yes, but you also completely wrecked signals afaict. But yes, this preemption thing also causes that, which is why I proposed that LAZY crud earlier, but I never got that in a shape I was happy with -- it quickly becomes a mess :/ > I think the assumption is based on the idea that a process > using UMCG will get affined to N CPUs, will have N servers and > a number of workers, and they will all happily cooperate and not > get any extra threads running. > > Of course the pretty picture was not completely true, as the unblocked > tasks do consume extra threads in the kernel, though never in the > userspace. Right, there is some unmanaged time anyway. > So this patch may result in all servers running due to wakeups > in umcg_sys_exit(), with also their currently designated workers > running as well, so the userspace may see N+N running threads. I think this was already true, the servers could be running and all workers could be woken from their in-kernel slumber, entering unmamanged time, seeing N+M running tasks as worst possible case. But yes, the 2N case is more common now. > For now I think this may be OK, but as I mentioned above, I need to > run a larger test with a real workload to see if anything is missing. > > What does worry me is that in this wakeup the server calls sys_umcg_wait() > with another worker in next_tid, so now the server will have two > workers running: the current kernel API seems to allow this to happen. > In my patchset the invariant that no more than one worker running > per server was enforced by the kernel. So one of the things I've started, but didn't finished, is to forward port the Proxy-Execution patches to a current kernel and play with the PE+UMCG interaction. Thinking about that interaction I've ran into that exact problem. The 'nice' solution is to actually block the worker, but that's also the slow solution :/ The other solution seems to be to keep kernel state; track the current worker associated with the server. I haven't (so far) done that due to my futex trauma. So yes, the current API can be used to do the wrong thing, but the kernel doesn't care and you get to keep the pieces in userspace. And I much prefer user pieces over kernel pieces. > > + > > + umcg_unpin_pages(); > > + > > + switch (umcg_wait(0)) { > > + case -EFAULT: > > + case -EINVAL: > > + case -ETIMEDOUT: /* how!?! */ > > + default: > > This "default" coming before "case 0" below feels weird... can we do > > switch (umcg_wait()) { > case 0: > case -EINTR: > /* ... */ > break; > default: > UMCG_DIE("wait"); > } Sure. > > + /* > > + * XXX do we want a preemption consuming ::next_tid ? > > + * I'm currently leaning towards no. > > I don't think so: preemption is a sched-type event, so a server > should handle it; next_tid has nothing to do with it. We agree, I'll update the comment. > > +SYSCALL_DEFINE2(umcg_wait, u32, flags, u64, timo) > > +{ > > + struct task_struct *tsk = current; > > + struct umcg_task __user *self = READ_ONCE(tsk->umcg_task); > > + bool worker = tsk->flags & PF_UMCG_WORKER; > > + int ret; > > + > > + if (!self || flags) > > + return -EINVAL; > > + > > + if (worker) { > > + tsk->flags &= ~PF_UMCG_WORKER; > > + if (timo) > > + return -ERANGE; > > Worker sleeps timing out is a valid and a real use case. Similar > to futex timeouts, mutex timeouts, condvar timeouts. I do not believe > there is a fundamental problem here, so I'll add worker timeout > handling in my larger test. I don't understand worker timeout, also see: https://lkml.kernel.org/r/Ya34S2JCQg+81h4t@hirez.programming.kicks-ass.net > In addition, shouldn't we NOT clear PF_UMCG_WORKER flag if we > return an error? Why? Userspace can do umcg_ctl() if they want, no? > > + } > > + > > + /* see umcg_sys_{enter,exit}() syscall exceptions */ > > + ret = umcg_pin_pages(); > > I do not think we need to pin pages for servers, only for workers. Yes, > this makes things easier/simpler, so ok for now, but maybe later we will > need to be a bit more fine-grained here. Right. > > + if (ret) > > + goto unblock; > > + > > + /* > > + * Clear UMCG_TF_COND_WAIT *and* check state == RUNNABLE. > > + */ > > + ret = umcg_update_state(tsk, self, UMCG_TASK_RUNNABLE, UMCG_TASK_RUNNABLE); > > + if (ret) > > + goto unpin; > > + > > + if (worker) { > > + ret = umcg_enqueue_runnable(tsk); > > + if (ret) > > + goto unpin; > > + } > > + > > + if (worker) > > Should this "if" be merged with the one above? Yes, I think I've done that at least once, but clearly it didn't stick. Ah, here it is: https://lkml.kernel.org/r/Ybm+HJzkO%2F0BB4Va@hirez.programming.kicks-ass.net but since that LAZY thing didn't live that cleanup seems to have gone out the window too. > > + ret = umcg_wake(tsk); > > + else if (tsk->umcg_next) > > + ret = umcg_wake_next(tsk); > > + > > + if (ret) { > > + /* > > + * XXX already enqueued ourself on ::server_tid; failing now > > + * leaves the lot in an inconsistent state since it'll also > > + * unblock self in order to return the error. !?!? > > + */ > > It looks like only EFAULT can be here. I'd ensure that, and then just DIE. Can also be -EAGAIN if the target task isn't in an expected state. I also wanted to avoid DIE from the syscalls(). DIE really isn't nice, we shouldn't do it if it can be avoided. > > + goto unpin; > > + } > > + > > + umcg_unpin_pages(); > > + > > + ret = umcg_wait(timo); > > + switch (ret) { > > + case 0: /* all done */ > > + case -EINTR: /* umcg_notify_resume() will continue the wait */ > > + ret = 0; > > + break; > > Why not let workers have timeouts, and keep -ETIMEDOUT here? Just set > UMCG_TF_PREEMPT, or another flag with similar behavior, and > umcg_notify_resume will properly wake the server? I really don't understand timeouts on workers, see above. TF_PREEMPT must only be set on RUNNING, but if we're in wait, we're RUNNABLE. > > + > > + default: > > + goto unblock; > > + } > > +out: > > + if (worker) > > + tsk->flags |= PF_UMCG_WORKER; > > + return ret; > > + > > +unpin: > > + umcg_unpin_pages(); > > +unblock: > > + /* > > + * Workers will still block in umcg_notify_resume() before they can > > + * consume their error, servers however need to get the error asap. > > + * > > + * Still, things might be unrecoverably screwy after this. Not our > > + * problem. > > I think we should explicitly document the unrecoverable screwiness > of errors here, so that the userspace proactively kills itself > to avoid badness. The only reason that returning an error here is > mildly preferable to just killing the task (we already do that > in other places) is to give the userspace an opportunity to > log an error, with more state/info than we can do here. Bah, I should've written a better comment, because I can't quite remember the case I had in mind. Also, again from the LAZY patch, I think we can actually do better in some of the cases here. Specifically, currently we'll enqueue on ::runnable_workers_ptr and fail waking ::next_tid and leave it at that. While I think waking ::server_tid in that case makes sense. I'll go prod at this. > > + */ > > + if (!worker) > > + umcg_update_state(tsk, self, UMCG_TASK_RUNNABLE, UMCG_TASK_RUNNING); > > + goto out; > > +} > > + > > +/** > > + * sys_umcg_ctl: (un)register the current task as a UMCG task. > > + * @flags: ORed values from enum umcg_ctl_flag; see below; > > + * @self: a pointer to struct umcg_task that describes this > > + * task and governs the behavior of sys_umcg_wait if > > + * registering; must be NULL if unregistering. > > @which_clock is not documented. Why do we need the option in the first > place? Well, you had CLOCK_REALTIME, which I think is quite daft, but Thomas also wanted CLOCK_TAI, so here we are. I'll add the comment. > > +SYSCALL_DEFINE3(umcg_ctl, u32, flags, struct umcg_task __user *, self, clockid_t, which_clock) > > +{ > > + struct task_struct *server; > > + struct umcg_task ut; > > + > > + if ((unsigned long)self % UMCG_TASK_ALIGN) > > + return -EINVAL; > > + > > + if (flags & ~(UMCG_CTL_REGISTER | > > + UMCG_CTL_UNREGISTER | > > + UMCG_CTL_WORKER)) > > + return -EINVAL; > > + > > + if (flags == UMCG_CTL_UNREGISTER) { > > + if (self || !current->umcg_task) > > + return -EINVAL; > > + > > + if (current->flags & PF_UMCG_WORKER) > > + umcg_worker_exit(); > > The server should be woken here. Imagine: one server, one worker. > The server is sleeping, the worker is running. The worker unregisters, > the server keeps sleeping forever? > > I'm OK re: NOT waking the server if the worker thread exits without > unregistering, as this is the userspace breaking the contract/protocol. > But here we do need to notify the server. At the minimum so that the > server can schedule a worker to run in its place. > > (Why is this important? Worker count can fluctuate considerably: > on load spikes many new workers may be created, and later in > quiet times they exit to free resources.) Fair enough. Will do. > > + if (flags == UMCG_CTL_WORKER) { > > + if ((ut.state & (UMCG_TASK_MASK | UMCG_TF_MASK)) != UMCG_TASK_BLOCKED) > > + return -EINVAL; > > + > > + WRITE_ONCE(current->umcg_task, self); > > + current->flags |= PF_UMCG_WORKER; /* hook schedule() */ > > + set_syscall_work(SYSCALL_UMCG); /* hook syscall */ > > + set_thread_flag(TIF_UMCG); /* hook return-to-user */ > > Too many flags, I'd say. Not a big deal, just a mild preference: > I have only a single flag. Yeah, you overloaded TIF_NOTIFY_RESUME, I prefer an explicit flag. And the syscall things already have their own flag word, so that simply needs another one. > > + > > + /* umcg_sys_exit() will transition to RUNNABLE and wait */ > > + > > + } else { > > + if ((ut.state & (UMCG_TASK_MASK | UMCG_TF_MASK)) != UMCG_TASK_RUNNING) > > + return -EINVAL; > > + > > + WRITE_ONCE(current->umcg_task, self); > > + set_thread_flag(TIF_UMCG); /* hook return-to-user */ > > Why do we need to hook server's return to user? I did not need it in my > version. Signals; server could be blocked in sys_umcg_wait() and get a signal, then we should resume waiting after the signal. I hate signals as much as the next guy, but we gotta do something with them. Anyway, let me go poke at this code again.. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC][PATCH 3/3] sched: User Mode Concurency Groups 2022-01-14 14:09 ` Peter Zijlstra @ 2022-01-14 15:16 ` Peter Zijlstra 2022-01-14 17:15 ` Peter Zijlstra ` (3 subsequent siblings) 4 siblings, 0 replies; 40+ messages in thread From: Peter Zijlstra @ 2022-01-14 15:16 UTC (permalink / raw) To: Peter Oskolkov Cc: mingo, tglx, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, bristot, linux-kernel, linux-mm, linux-api, x86, pjt, posk, avagin, jannh, tdelisle On Fri, Jan 14, 2022 at 03:09:55PM +0100, Peter Zijlstra wrote: > > I think the assumption is based on the idea that a process > > using UMCG will get affined to N CPUs, will have N servers and > > a number of workers, and they will all happily cooperate and not > > get any extra threads running. > > > > Of course the pretty picture was not completely true, as the unblocked > > tasks do consume extra threads in the kernel, though never in the > > userspace. > > Right, there is some unmanaged time anyway. Also, since we force wake to the same CPU, and overlapping runtime is 'short', they should all stick to the same CPU, even if we don't pin. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC][PATCH 3/3] sched: User Mode Concurency Groups 2022-01-14 14:09 ` Peter Zijlstra 2022-01-14 15:16 ` Peter Zijlstra @ 2022-01-14 17:15 ` Peter Zijlstra 2022-01-17 11:35 ` Peter Zijlstra ` (2 subsequent siblings) 4 siblings, 0 replies; 40+ messages in thread From: Peter Zijlstra @ 2022-01-14 17:15 UTC (permalink / raw) To: Peter Oskolkov Cc: mingo, tglx, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, bristot, linux-kernel, linux-mm, linux-api, x86, pjt, posk, avagin, jannh, tdelisle On Fri, Jan 14, 2022 at 03:09:55PM +0100, Peter Zijlstra wrote: > > Hi! > > I've seen you send a new version based on this, but I figured I ought to > reply to this first. > > On Tue, Dec 21, 2021 at 05:19:00PM +0000, Peter Oskolkov wrote: > > > > +/* pre-schedule() */ > > > +void umcg_wq_worker_sleeping(struct task_struct *tsk) > > > +{ > > > + struct umcg_task __user *self = READ_ONCE(tsk->umcg_task); > > > + > > > + /* Must not fault, mmap_sem might be held. */ > > > + pagefault_disable(); > > > + > > > + if (WARN_ON_ONCE(!tsk->umcg_server)) > > > + UMCG_DIE_PF("no server"); > > > > We can get here if a running worker (no pinned pages) gets a pagefault > > in the userspace. Is umcg_sys_enter() called for pagefaults? If not, > > we should not kill the worker; also the userspace won't be able to > > detect this worker blocking on a pagefault... > > Ufff.. good one. No #PF doesn't pass through sys_enter, I'll have to go > fix that. Something like the below I think.. it builds, but I've not yet tested it. --- --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -73,18 +73,6 @@ DECLARE_BITMAP(system_vectors, NR_VECTORS); -static inline void cond_local_irq_enable(struct pt_regs *regs) -{ - if (regs->flags & X86_EFLAGS_IF) - local_irq_enable(); -} - -static inline void cond_local_irq_disable(struct pt_regs *regs) -{ - if (regs->flags & X86_EFLAGS_IF) - local_irq_disable(); -} - __always_inline int is_valid_bugaddr(unsigned long addr) { if (addr < TASK_SIZE_MAX) @@ -177,9 +165,9 @@ static void do_error_trap(struct pt_regs if (notify_die(DIE_TRAP, str, regs, error_code, trapnr, signr) != NOTIFY_STOP) { - cond_local_irq_enable(regs); + irqentry_irq_enable(regs); do_trap(trapnr, signr, str, regs, error_code, sicode, addr); - cond_local_irq_disable(regs); + irqentry_irq_disable(regs); } } @@ -300,7 +288,7 @@ DEFINE_IDTENTRY_ERRORCODE(exc_alignment_ if (!user_mode(regs)) die("Split lock detected\n", regs, error_code); - local_irq_enable(); + irqentry_irq_enable(regs); if (handle_user_split_lock(regs, error_code)) goto out; @@ -309,7 +297,7 @@ DEFINE_IDTENTRY_ERRORCODE(exc_alignment_ error_code, BUS_ADRALN, NULL); out: - local_irq_disable(); + irqentry_irq_disable(regs); } #ifdef CONFIG_VMAP_STACK @@ -473,14 +461,14 @@ DEFINE_IDTENTRY(exc_bounds) if (notify_die(DIE_TRAP, "bounds", regs, 0, X86_TRAP_BR, SIGSEGV) == NOTIFY_STOP) return; - cond_local_irq_enable(regs); + irqentry_irq_enable(regs); if (!user_mode(regs)) die("bounds", regs, 0); do_trap(X86_TRAP_BR, SIGSEGV, "bounds", regs, 0, 0, NULL); - cond_local_irq_disable(regs); + irqentry_irq_disable(regs); } enum kernel_gp_hint { @@ -567,7 +555,7 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_pr unsigned long gp_addr; int ret; - cond_local_irq_enable(regs); + irqentry_irq_enable(regs); if (static_cpu_has(X86_FEATURE_UMIP)) { if (user_mode(regs) && fixup_umip_exception(regs)) @@ -638,7 +626,7 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_pr die_addr(desc, regs, error_code, gp_addr); exit: - cond_local_irq_disable(regs); + irqentry_irq_disable(regs); } static bool do_int3(struct pt_regs *regs) @@ -665,9 +653,9 @@ static void do_int3_user(struct pt_regs if (do_int3(regs)) return; - cond_local_irq_enable(regs); + irqentry_irq_enable(regs); do_trap(X86_TRAP_BP, SIGTRAP, "int3", regs, 0, 0, NULL); - cond_local_irq_disable(regs); + irqentry_irq_disable(regs); } DEFINE_IDTENTRY_RAW(exc_int3) @@ -1003,7 +991,7 @@ static __always_inline void exc_debug_us goto out; /* It's safe to allow irq's after DR6 has been saved */ - local_irq_enable(); + irqentry_irq_enable(regs); if (v8086_mode(regs)) { handle_vm86_trap((struct kernel_vm86_regs *)regs, 0, X86_TRAP_DB); @@ -1020,7 +1008,7 @@ static __always_inline void exc_debug_us send_sigtrap(regs, 0, get_si_code(dr6)); out_irq: - local_irq_disable(); + irqentry_irq_disable(regs); out: instrumentation_end(); irqentry_exit_to_user_mode(regs); @@ -1064,7 +1052,7 @@ static void math_error(struct pt_regs *r char *str = (trapnr == X86_TRAP_MF) ? "fpu exception" : "simd exception"; - cond_local_irq_enable(regs); + irqentry_irq_enable(regs); if (!user_mode(regs)) { if (fixup_exception(regs, trapnr, 0, 0)) @@ -1099,7 +1087,7 @@ static void math_error(struct pt_regs *r force_sig_fault(SIGFPE, si_code, (void __user *)uprobe_get_trap_addr(regs)); exit: - cond_local_irq_disable(regs); + irqentry_irq_disable(regs); } DEFINE_IDTENTRY(exc_coprocessor_error) @@ -1160,7 +1148,7 @@ static bool handle_xfd_event(struct pt_r if (WARN_ON(!user_mode(regs))) return false; - local_irq_enable(); + irqentry_irq_enable(regs); err = xfd_enable_feature(xfd_err); @@ -1173,7 +1161,7 @@ static bool handle_xfd_event(struct pt_r break; } - local_irq_disable(); + irqentry_irq_disable(regs); return true; } @@ -1188,12 +1176,12 @@ DEFINE_IDTENTRY(exc_device_not_available if (!boot_cpu_has(X86_FEATURE_FPU) && (cr0 & X86_CR0_EM)) { struct math_emu_info info = { }; - cond_local_irq_enable(regs); + irqentry_irq_enable(regs); info.regs = regs; math_emulate(&info); - cond_local_irq_disable(regs); + irqentry_irq_disable(regs); return; } #endif --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1209,6 +1209,12 @@ do_kern_addr_fault(struct pt_regs *regs, NOKPROBE_SYMBOL(do_kern_addr_fault); /* + * EFLAGS[3] is unused and ABI defined to be 0, use it to store IRQ state, + * because do_user_addr_fault() is too convoluted to track things. + */ +#define X86_EFLAGS_MISC (1UL << 3) + +/* * Handle faults in the user portion of the address space. Nothing in here * should check X86_PF_USER without a specific justification: for almost * all purposes, we should treat a normal kernel access to user memory @@ -1290,13 +1296,11 @@ void do_user_addr_fault(struct pt_regs * * User-mode registers count as a user access even for any * potential system fault or CPU buglet: */ - if (user_mode(regs)) { - local_irq_enable(); + if (user_mode(regs)) flags |= FAULT_FLAG_USER; - } else { - if (regs->flags & X86_EFLAGS_IF) - local_irq_enable(); - } + + irqentry_irq_enable(regs); + regs->flags |= X86_EFLAGS_MISC; perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); @@ -1483,14 +1487,10 @@ handle_page_fault(struct pt_regs *regs, do_kern_addr_fault(regs, error_code, address); } else { do_user_addr_fault(regs, error_code, address); - /* - * User address page fault handling might have reenabled - * interrupts. Fixing up all potential exit points of - * do_user_addr_fault() and its leaf functions is just not - * doable w/o creating an unholy mess or turning the code - * upside down. - */ - local_irq_disable(); + if (regs->flags & X86_EFLAGS_MISC) { + regs->flags &= ~X86_EFLAGS_MISC; + irqentry_irq_disable(regs); + } } } --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -7,6 +7,7 @@ #include <linux/syscalls.h> #include <linux/seccomp.h> #include <linux/sched.h> +#include <asm/ptrace.h> #include <asm/entry-common.h> @@ -218,6 +219,24 @@ static inline void local_irq_disable_exi } #endif +static inline void irqentry_irq_enable(struct pt_regs *regs) +{ + if (!regs_irqs_disabled(regs)) { + local_irq_enable(); + if (user_mode(regs) && (current->flags & PF_UMCG_WORKER)) + umcg_sys_enter(regs, 0); + } +} + +static inline void irqentry_irq_disable(struct pt_regs *regs) +{ + if (!regs_irqs_disabled(regs)) { + if (user_mode(regs) && (current->flags & PF_UMCG_WORKER)) + umcg_sys_exit(regs); + local_irq_disable(); + } +} + /** * arch_exit_to_user_mode_work - Architecture specific TIF work for exit * to user mode. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC][PATCH 3/3] sched: User Mode Concurency Groups 2022-01-14 14:09 ` Peter Zijlstra 2022-01-14 15:16 ` Peter Zijlstra 2022-01-14 17:15 ` Peter Zijlstra @ 2022-01-17 11:35 ` Peter Zijlstra 2022-01-17 12:22 ` Peter Zijlstra 2022-01-17 12:12 ` Peter Zijlstra 2022-01-18 10:09 ` Peter Zijlstra 4 siblings, 1 reply; 40+ messages in thread From: Peter Zijlstra @ 2022-01-17 11:35 UTC (permalink / raw) To: Peter Oskolkov Cc: mingo, tglx, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, bristot, linux-kernel, linux-mm, linux-api, x86, pjt, posk, avagin, jannh, tdelisle On Fri, Jan 14, 2022 at 03:09:55PM +0100, Peter Zijlstra wrote: > > > +SYSCALL_DEFINE3(umcg_ctl, u32, flags, struct umcg_task __user *, self, clockid_t, which_clock) > > > +{ > > > + struct task_struct *server; > > > + struct umcg_task ut; > > > + > > > + if ((unsigned long)self % UMCG_TASK_ALIGN) > > > + return -EINVAL; > > > + > > > + if (flags & ~(UMCG_CTL_REGISTER | > > > + UMCG_CTL_UNREGISTER | > > > + UMCG_CTL_WORKER)) > > > + return -EINVAL; > > > + > > > + if (flags == UMCG_CTL_UNREGISTER) { > > > + if (self || !current->umcg_task) > > > + return -EINVAL; > > > + > > > + if (current->flags & PF_UMCG_WORKER) > > > + umcg_worker_exit(); > > > > The server should be woken here. Imagine: one server, one worker. > > The server is sleeping, the worker is running. The worker unregisters, > > the server keeps sleeping forever? > > > > I'm OK re: NOT waking the server if the worker thread exits without > > unregistering, as this is the userspace breaking the contract/protocol. > > But here we do need to notify the server. At the minimum so that the > > server can schedule a worker to run in its place. > > > > (Why is this important? Worker count can fluctuate considerably: > > on load spikes many new workers may be created, and later in > > quiet times they exit to free resources.) > > Fair enough. Will do. Something like so then... --- --- a/kernel/sched/umcg.c +++ b/kernel/sched/umcg.c @@ -185,13 +185,6 @@ void umcg_clear_child(struct task_struct umcg_clear_task(tsk); } -/* Called both by normally (unregister) and abnormally exiting workers. */ -void umcg_worker_exit(void) -{ - umcg_unpin_pages(); - umcg_clear_task(current); -} - /* * Do a state transition: @from -> @to. * @@ -748,32 +741,43 @@ SYSCALL_DEFINE2(umcg_wait, u32, flags, u * sys_umcg_ctl: (un)register the current task as a UMCG task. * @flags: ORed values from enum umcg_ctl_flag; see below; * @self: a pointer to struct umcg_task that describes this - * task and governs the behavior of sys_umcg_wait if - * registering; must be NULL if unregistering. + * task and governs the behavior of sys_umcg_wait. * @which_clock: clockid to use for timestamps and timeouts * * @flags & UMCG_CTL_REGISTER: register a UMCG task: * - * UMCG workers: - * - @flags & UMCG_CTL_WORKER - * - self->state must be UMCG_TASK_BLOCKED - * - * UMCG servers: - * - !(@flags & UMCG_CTL_WORKER) - * - self->state must be UMCG_TASK_RUNNING - * - * All tasks: - * - self->server_tid must be a valid server - * - self->next_tid must be zero - * - * If the conditions above are met, sys_umcg_ctl() immediately returns - * if the registered task is a server. If the registered task is a - * worker it will be added to it's server's runnable_workers_ptr list - * and the server will be woken. - * - * @flags == UMCG_CTL_UNREGISTER: unregister a UMCG task. If the current task - * is a UMCG worker, the userspace is responsible for waking its - * server (before or after calling sys_umcg_ctl). + * UMCG workers: + * - @flags & UMCG_CTL_WORKER + * - self->state must be UMCG_TASK_BLOCKED + * + * UMCG servers: + * - !(@flags & UMCG_CTL_WORKER) + * - self->state must be UMCG_TASK_RUNNING + * + * All tasks: + * - self->server_tid must be a valid server + * - self->next_tid must be zero + * + * If the conditions above are met, sys_umcg_ctl() immediately returns + * if the registered task is a server. If the registered task is a + * worker it will be added to it's server's runnable_workers_ptr list + * and the server will be woken. + * + * @flags & UMCG_CTL_UNREGISTER: unregister a UMCG task. + * + * UMCG workers: + * - @flags & UMCG_CTL_WORKER + * + * UMCG servers: + * - !(@flags & UMCG_CTL_WORKER) + * + * All tasks: + * - self must match with UMCG_CTL_REGISTER + * - self->state must be UMCG_TASK_RUNNING + * - self->server_tid must be a valid server + * + * If the conditions above are met, sys_umcg_ctl() will change state to + * UMCG_TASK_NONE, and for workers, wake either next or server. * * Return: * 0 - success @@ -794,16 +798,31 @@ SYSCALL_DEFINE3(umcg_ctl, u32, flags, st UMCG_CTL_WORKER)) return -EINVAL; - if (flags == UMCG_CTL_UNREGISTER) { - if (self || !current->umcg_task) + if (flags & UMCG_CTL_UNREGISTER) { + int ret; + + if (!self || self != current->umcg_task) return -EINVAL; - if (current->flags & PF_UMCG_WORKER) { - umcg_worker_exit(); - // XXX wake server - } else - umcg_clear_task(current); + current->flags &= ~PF_UMCG_WORKER; + ret = umcg_pin_pages(); + if (ret) { + current->flags |= PF_UMCG_WORKER; + return ret; + } + + ret = umcg_update_state(current, self, UMCG_TASK_RUNNING, UMCG_TASK_NONE); + if (ret) { + current->flags |= PF_UMCG_WORKER; + return ret; + } + + if (current->flags & PF_UMCG_WORKER) + umcg_wake(current); + + umcg_unpin_pages(); + umcg_clear_task(current); return 0; } ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC][PATCH 3/3] sched: User Mode Concurency Groups 2022-01-17 11:35 ` Peter Zijlstra @ 2022-01-17 12:22 ` Peter Zijlstra 0 siblings, 0 replies; 40+ messages in thread From: Peter Zijlstra @ 2022-01-17 12:22 UTC (permalink / raw) To: Peter Oskolkov Cc: mingo, tglx, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, bristot, linux-kernel, linux-mm, linux-api, x86, pjt, posk, avagin, jannh, tdelisle On Mon, Jan 17, 2022 at 12:35:29PM +0100, Peter Zijlstra wrote: > @@ -794,16 +798,31 @@ SYSCALL_DEFINE3(umcg_ctl, u32, flags, st > UMCG_CTL_WORKER)) > return -EINVAL; > > - if (flags == UMCG_CTL_UNREGISTER) { > - if (self || !current->umcg_task) > + if (flags & UMCG_CTL_UNREGISTER) { > + int ret; > + > + if (!self || self != current->umcg_task) > return -EINVAL; > > - if (current->flags & PF_UMCG_WORKER) { > - umcg_worker_exit(); > - // XXX wake server > - } else > - umcg_clear_task(current); > + current->flags &= ~PF_UMCG_WORKER; > > + ret = umcg_pin_pages(); > + if (ret) { > + current->flags |= PF_UMCG_WORKER; > + return ret; > + } > + > + ret = umcg_update_state(current, self, UMCG_TASK_RUNNING, UMCG_TASK_NONE); > + if (ret) { > + current->flags |= PF_UMCG_WORKER; > + return ret; > + } Sorry, should obv only restore PF_UMCG_WORKER for workers.. /me fixes > + > + if (current->flags & PF_UMCG_WORKER) > + umcg_wake(current); > + > + umcg_unpin_pages(); > + umcg_clear_task(current); > return 0; > } > ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC][PATCH 3/3] sched: User Mode Concurency Groups 2022-01-14 14:09 ` Peter Zijlstra ` (2 preceding siblings ...) 2022-01-17 11:35 ` Peter Zijlstra @ 2022-01-17 12:12 ` Peter Zijlstra 2022-01-18 10:09 ` Peter Zijlstra 4 siblings, 0 replies; 40+ messages in thread From: Peter Zijlstra @ 2022-01-17 12:12 UTC (permalink / raw) To: Peter Oskolkov Cc: mingo, tglx, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, bristot, linux-kernel, linux-mm, linux-api, x86, pjt, posk, avagin, jannh, tdelisle On Fri, Jan 14, 2022 at 03:09:55PM +0100, Peter Zijlstra wrote: > On Tue, Dec 21, 2021 at 05:19:00PM +0000, Peter Oskolkov wrote: > > > + /* > > > + * Workers will still block in umcg_notify_resume() before they can > > > + * consume their error, servers however need to get the error asap. > > > + * > > > + * Still, things might be unrecoverably screwy after this. Not our > > > + * problem. > > > > I think we should explicitly document the unrecoverable screwiness > > of errors here, so that the userspace proactively kills itself > > to avoid badness. The only reason that returning an error here is > > mildly preferable to just killing the task (we already do that > > in other places) is to give the userspace an opportunity to > > log an error, with more state/info than we can do here. > > Bah, I should've written a better comment, because I can't quite > remember the case I had in mind. Also, again from the LAZY patch, I > think we can actually do better in some of the cases here. > > Specifically, currently we'll enqueue on ::runnable_workers_ptr and fail > waking ::next_tid and leave it at that. While I think waking > ::server_tid in that case makes sense. > > I'll go prod at this. Is anybody actually planning to use ::next_tid for workers? My current thinking is that much of the problems here stem from that. Let me ponder more.. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC][PATCH 3/3] sched: User Mode Concurency Groups 2022-01-14 14:09 ` Peter Zijlstra ` (3 preceding siblings ...) 2022-01-17 12:12 ` Peter Zijlstra @ 2022-01-18 10:09 ` Peter Zijlstra 2022-01-18 18:19 ` Peter Oskolkov 4 siblings, 1 reply; 40+ messages in thread From: Peter Zijlstra @ 2022-01-18 10:09 UTC (permalink / raw) To: Peter Oskolkov Cc: mingo, tglx, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, bristot, linux-kernel, linux-mm, linux-api, x86, pjt, posk, avagin, jannh, tdelisle On Fri, Jan 14, 2022 at 03:09:55PM +0100, Peter Zijlstra wrote: > On Tue, Dec 21, 2021 at 05:19:00PM +0000, Peter Oskolkov wrote: > > What does worry me is that in this wakeup the server calls sys_umcg_wait() > > with another worker in next_tid, so now the server will have two > > workers running: the current kernel API seems to allow this to happen. > > In my patchset the invariant that no more than one worker running > > per server was enforced by the kernel. > > So one of the things I've started, but didn't finished, is to forward > port the Proxy-Execution patches to a current kernel and play with the > PE+UMCG interaction. > > Thinking about that interaction I've ran into that exact problem. > > The 'nice' solution is to actually block the worker, but that's also the > slow solution :/ > > The other solution seems to be to keep kernel state; track the current > worker associated with the server. I haven't (so far) done that due to > my futex trauma. > > So yes, the current API can be used to do the wrong thing, but the > kernel doesn't care and you get to keep the pieces in userspace. And I > much prefer user pieces over kernel pieces. So I think I came up with a 3rd option; since the TID range is 30 bits (per FUTEX_TID_MASK) we have those same two top bits to play with. So we write into server::next_tid the tid of the worker we want to wake; and we currently have it cleared such that we can distinguish between the case where sys_umcg_wait() returned an error before or after waking it. However; we can use one of the 2 remaining bits to indicate the worker is woken, let's say bit 31. This then has server::next_tid always containing the current tid, even when the server has a 'spurious' wakeup for other things. Then all we need to do is modify the state check when the bit is set to ensure we don't wake the worker again if we race sys_umcg_wait() with a worker blocking. A bit like this, I suppose... (incompete patch in that it relies on a whole pile of local changes that might or might not live). --- a/include/uapi/linux/umcg.h +++ b/include/uapi/linux/umcg.h @@ -94,6 +94,8 @@ struct umcg_task { */ __u32 state; /* r/w */ +#define UMCG_TID_RUNNING 0x80000000U +#define UMCG_TID_MASK 0x3fffffffU /** * @next_tid: the TID of the UMCG task that should be context-switched * into in sys_umcg_wait(). Can be zero, in which case --- a/kernel/sched/umcg.c +++ b/kernel/sched/umcg.c @@ -20,7 +20,7 @@ static struct task_struct *umcg_get_task if (tid) { rcu_read_lock(); - tsk = find_task_by_vpid(tid); + tsk = find_task_by_vpid(tid & UMCG_TID_MASK); if (tsk && current->mm == tsk->mm && tsk->umcg_task) get_task_struct(tsk); else @@ -289,27 +291,6 @@ static int umcg_wake_task(struct task_st return 0; } -static int umcg_wake_next(struct task_struct *tsk) -{ - int ret = umcg_wake_task(tsk->umcg_next, tsk->umcg_next_task); - if (ret) - return ret; - - /* - * If userspace sets umcg_task::next_tid, it needs to remove - * that task from the ready-queue to avoid another server - * selecting it. However, that also means it needs to put it - * back in case it went unused. - * - * By clearing the field on use, userspace can detect this case - * and DTRT. - */ - if (put_user(0u, &tsk->umcg_task->next_tid)) - return -EFAULT; - - return 0; -} - static int umcg_wake_server(struct task_struct *tsk) { int ret = umcg_wake_task(tsk->umcg_server, tsk->umcg_server_task); @@ -637,6 +599,48 @@ SYSCALL_DEFINE2(umcg_kick, u32, flags, p return 0; } +static int umcg_wake_next(struct task_struct *tsk, struct umcg_task __user *self) +{ + struct umcg_task __user *next_task; + struct task_struct *next; + u32 next_tid, state; + int ret; + + if (get_user(next_tid, &self->next_tid)) + return -EFAULT; + + next = umcg_get_task(next_tid); + if (!next) + return -ESRCH; + + next_task = READ_ONCE(next->umcg_task); + + if (next_tid & UMCG_TID_RUNNING) { + ret = -EFAULT; + if (get_user(state, &next_task->state)) + goto put_next; + + ret = 0; + if ((state & UMCG_TASK_MASK) != UMCG_TASK_RUNNING) + ret = -EAGAIN; + + } else { + ret = umcg_wake_task(next, next_task); + if (ret) + goto put_next; + + ret = -EFAULT; + if (put_user(next_tid | UMCG_TID_RUNNING, &self->next_tid)) + goto put_next; + + ret = 0; + } + +put_next: + put_task_struct(next); + return ret; +} + /** * sys_umcg_wait: transfer running context * And once we have this, we can add sanity checks that server::next_tid is what it should be for ever worker moving away from RUNNING state (which depends on the assumption that all threads are in the same PID namespace). Does this make sense? ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC][PATCH 3/3] sched: User Mode Concurency Groups 2022-01-18 10:09 ` Peter Zijlstra @ 2022-01-18 18:19 ` Peter Oskolkov 2022-01-19 8:47 ` Peter Zijlstra ` (2 more replies) 0 siblings, 3 replies; 40+ messages in thread From: Peter Oskolkov @ 2022-01-18 18:19 UTC (permalink / raw) To: Peter Zijlstra Cc: Peter Oskolkov, mingo, tglx, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, bristot, linux-kernel, linux-mm, linux-api, x86, pjt, avagin, jannh, tdelisle On Tue, Jan 18, 2022 at 2:09 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Fri, Jan 14, 2022 at 03:09:55PM +0100, Peter Zijlstra wrote: > > On Tue, Dec 21, 2021 at 05:19:00PM +0000, Peter Oskolkov wrote: > > > > What does worry me is that in this wakeup the server calls sys_umcg_wait() > > > with another worker in next_tid, so now the server will have two > > > workers running: the current kernel API seems to allow this to happen. > > > In my patchset the invariant that no more than one worker running > > > per server was enforced by the kernel. > > > > So one of the things I've started, but didn't finished, is to forward > > port the Proxy-Execution patches to a current kernel and play with the > > PE+UMCG interaction. > > > > Thinking about that interaction I've ran into that exact problem. > > > > The 'nice' solution is to actually block the worker, but that's also the > > slow solution :/ > > > > The other solution seems to be to keep kernel state; track the current > > worker associated with the server. I haven't (so far) done that due to > > my futex trauma. > > > > So yes, the current API can be used to do the wrong thing, but the > > kernel doesn't care and you get to keep the pieces in userspace. And I > > much prefer user pieces over kernel pieces. > > So I think I came up with a 3rd option; since the TID range is 30 bits > (per FUTEX_TID_MASK) we have those same two top bits to play with. > > So we write into server::next_tid the tid of the worker we want to wake; > and we currently have it cleared such that we can distinguish between > the case where sys_umcg_wait() returned an error before or after waking > it. > > However; we can use one of the 2 remaining bits to indicate the worker > is woken, let's say bit 31. This then has server::next_tid always > containing the current tid, even when the server has a 'spurious' wakeup > for other things. > > Then all we need to do is modify the state check when the bit is set to > ensure we don't wake the worker again if we race sys_umcg_wait() with a > worker blocking. > > A bit like this, I suppose... (incompete patch in that it relies on a > whole pile of local changes that might or might not live). > > --- a/include/uapi/linux/umcg.h > +++ b/include/uapi/linux/umcg.h > @@ -94,6 +94,8 @@ struct umcg_task { > */ > __u32 state; /* r/w */ > > +#define UMCG_TID_RUNNING 0x80000000U > +#define UMCG_TID_MASK 0x3fffffffU > /** > * @next_tid: the TID of the UMCG task that should be context-switched > * into in sys_umcg_wait(). Can be zero, in which case > --- a/kernel/sched/umcg.c > +++ b/kernel/sched/umcg.c > @@ -20,7 +20,7 @@ static struct task_struct *umcg_get_task > > if (tid) { > rcu_read_lock(); > - tsk = find_task_by_vpid(tid); > + tsk = find_task_by_vpid(tid & UMCG_TID_MASK); > if (tsk && current->mm == tsk->mm && tsk->umcg_task) > get_task_struct(tsk); > else > @@ -289,27 +291,6 @@ static int umcg_wake_task(struct task_st > return 0; > } > > -static int umcg_wake_next(struct task_struct *tsk) > -{ > - int ret = umcg_wake_task(tsk->umcg_next, tsk->umcg_next_task); > - if (ret) > - return ret; > - > - /* > - * If userspace sets umcg_task::next_tid, it needs to remove > - * that task from the ready-queue to avoid another server > - * selecting it. However, that also means it needs to put it > - * back in case it went unused. > - * > - * By clearing the field on use, userspace can detect this case > - * and DTRT. > - */ > - if (put_user(0u, &tsk->umcg_task->next_tid)) > - return -EFAULT; > - > - return 0; > -} > - > static int umcg_wake_server(struct task_struct *tsk) > { > int ret = umcg_wake_task(tsk->umcg_server, tsk->umcg_server_task); > @@ -637,6 +599,48 @@ SYSCALL_DEFINE2(umcg_kick, u32, flags, p > return 0; > } > > +static int umcg_wake_next(struct task_struct *tsk, struct umcg_task __user *self) > +{ > + struct umcg_task __user *next_task; > + struct task_struct *next; > + u32 next_tid, state; > + int ret; > + > + if (get_user(next_tid, &self->next_tid)) > + return -EFAULT; > + > + next = umcg_get_task(next_tid); > + if (!next) > + return -ESRCH; > + > + next_task = READ_ONCE(next->umcg_task); > + > + if (next_tid & UMCG_TID_RUNNING) { > + ret = -EFAULT; > + if (get_user(state, &next_task->state)) > + goto put_next; > + > + ret = 0; > + if ((state & UMCG_TASK_MASK) != UMCG_TASK_RUNNING) > + ret = -EAGAIN; > + > + } else { > + ret = umcg_wake_task(next, next_task); > + if (ret) > + goto put_next; > + > + ret = -EFAULT; > + if (put_user(next_tid | UMCG_TID_RUNNING, &self->next_tid)) > + goto put_next; > + > + ret = 0; > + } > + > +put_next: > + put_task_struct(next); > + return ret; > +} > + > /** > * sys_umcg_wait: transfer running context > * > > > And once we have this, we can add sanity checks that server::next_tid is > what it should be for ever worker moving away from RUNNING state (which > depends on the assumption that all threads are in the same PID > namespace). > > > Does this make sense? This is a long reply, touching several active discussion topics: - cooperative userspace scheduling - next_tid in workers - worker timeouts - signals and the general approach =========== cooperative userspace scheduling I think an important question to clear is about having worker timeouts and worker-to-worker context switches (next_tid in workers). These are actually fundamental, core features of cooperative user-space scheduling. Yes, if UMCG is viewed simply as enabling what currently exists in the kernel to be done in the userspace, then yes, worker timeouts and worker.next_tid seem unneeded, and a runqueue per server, with a single RUNNING worker per runqueue, makes the most natural approach, the one that is taken in this new and improved patchset. However, our experience of running many production workloads with inprocess userspace scheduling over the last ~8 years indicate that cooperative userspace scheduling features, built on top of worker-to-worker context switches (with timeouts), are equally important in driving performance gains and adoption. ============= worker-to-worker context switches One example: absl::Mutex (https://abseil.io/about/design/mutex) has google-internal extensions that are "fiber aware". More specifically, consider this situation: - worker W1 acqured the mutex and is doing its work - worker W2 calls mutex::lock() mutex::lock(), being aware of workers, understands that W2 is going to sleep; so instead of just doing so, waking the server, and letting the server figure out what to run in place of the sleeping worker, mutex::lock() calls into the userspace scheduler in the context of W2 running, and the userspace scheduler then picks W3 to run and does W2->W3 context switch. The optimization above replaces W2->Server and Server->W3 context switches with a single W2->W3 context switch, which is a material performance gain. In addition, when W1 calls mutex::unlock(), the scheduling code determines that W2 is waiting on the mutex, and thus calls W2::wake() from the context of running W1 (you asked earlier why do we need "WAKE_ONLY"). There are several other similar "cooperative" worker-to-worker context switches used in "handcrafted" userspace synchronization primitives, the most obvious is a producer-consumer queue: the producer (W1) prepares an item or several, and context-switches to the consumer (W2) for the latter to consume the items. If the producer has more items to produce, it will call W2::wake() instead of context-switching into it. I hope the above discussion explains why worker-to-worker context switches, as well as workers waking another workers, are useful/needed. ================ worker timeouts Timeouts now are easy to explain: mutex::lock() and condvar::wait() have timeouts, so workers waiting on these primitives naturally wait with timeouts; if sys_umcg_wait() supports worker timeouts, this is it, all is simple; if it does not, the userspace now has to implement the whole timeout machinery, in order to wake these sleeping workers when their timeouts expire. =========== signals and the general approach My version of the patchset has all of these things working. What it does not have, compared to the new approach we are discussing here, is runqueues per server and proper signal handling (and potential integration with proxy execution). Runqueues per server, in the LAZY mode, are easy to emulate in my patchset: nothing prevents the userspace to partition workers among servers, and have servers that "own" their workers to be pointed at by idle_server_tid_ptr. The only thing that is missing is proper treating of signals. But my patchset does ensure a single running worker per server, had pagefaults and preemptions sorted out, etc. Basically, everything works except signals. This patchet has issues with pagefaults, worker timeouts, worker-to-worker context switches (do workers move runqueues when they context switch?), etc. And my patchset now actually looks smaller and simpler, on the kernel side, that what this patchset is shaping up to be. What if I fix signals in my patchset? I think the way you deal with signals will work in my approach equally well; I'll also use umcg_kick() to preempt workers instead of sending them a signal. What do you think? If signals work in my patchset, why this new approach has that my version does not? Thanks, Peter ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC][PATCH 3/3] sched: User Mode Concurency Groups 2022-01-18 18:19 ` Peter Oskolkov @ 2022-01-19 8:47 ` Peter Zijlstra 2022-01-19 17:33 ` Peter Oskolkov 2022-01-19 8:51 ` Peter Zijlstra 2022-01-19 8:59 ` Peter Zijlstra 2 siblings, 1 reply; 40+ messages in thread From: Peter Zijlstra @ 2022-01-19 8:47 UTC (permalink / raw) To: Peter Oskolkov Cc: Peter Oskolkov, mingo, tglx, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, bristot, linux-kernel, linux-mm, linux-api, x86, pjt, avagin, jannh, tdelisle On Tue, Jan 18, 2022 at 10:19:21AM -0800, Peter Oskolkov wrote: > ============= worker-to-worker context switches > > One example: absl::Mutex (https://abseil.io/about/design/mutex) has > google-internal extensions that are "fiber aware". More specifically, > consider this situation: > > - worker W1 acqured the mutex and is doing its work > - worker W2 calls mutex::lock() > mutex::lock(), being aware of workers, understands that W2 is going to sleep; > so instead of just doing so, waking the server, and letting > the server figure out what to run in place of the sleeping worker, > mutex::lock() > calls into the userspace scheduler in the context of W2 running, and the > userspace scheduler then picks W3 to run and does W2->W3 context switch. > > The optimization above replaces W2->Server and Server->W3 context switches > with a single W2->W3 context switch, which is a material performance gain. Yes, I've also already reconsidered. Things like pipelines and other fixed order scheduling policies will greatly benefit from worker-to-worker switching. But I think all of them are explicit. That is, we can limit the ::next_tid usage to sys_umcg_wait() and never look at it for implicit blocks. > In addition, when W1 calls mutex::unlock(), the scheduling code determines > that W2 is waiting on the mutex, and thus calls W2::wake() from the context of > running W1 (you asked earlier why do we need "WAKE_ONLY"). This I'm not at all convinced on. That sounds like it will violate the 1:1 thing. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC][PATCH 3/3] sched: User Mode Concurency Groups 2022-01-19 8:47 ` Peter Zijlstra @ 2022-01-19 17:33 ` Peter Oskolkov 0 siblings, 0 replies; 40+ messages in thread From: Peter Oskolkov @ 2022-01-19 17:33 UTC (permalink / raw) To: Peter Zijlstra Cc: Peter Oskolkov, mingo, tglx, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, bristot, linux-kernel, linux-mm, linux-api, x86, pjt, avagin, jannh, tdelisle On Wed, Jan 19, 2022 at 12:47 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, Jan 18, 2022 at 10:19:21AM -0800, Peter Oskolkov wrote: > > ============= worker-to-worker context switches > > > > One example: absl::Mutex (https://abseil.io/about/design/mutex) has > > google-internal extensions that are "fiber aware". More specifically, > > consider this situation: > > > > - worker W1 acqured the mutex and is doing its work > > - worker W2 calls mutex::lock() > > mutex::lock(), being aware of workers, understands that W2 is going to sleep; > > so instead of just doing so, waking the server, and letting > > the server figure out what to run in place of the sleeping worker, > > mutex::lock() > > calls into the userspace scheduler in the context of W2 running, and the > > userspace scheduler then picks W3 to run and does W2->W3 context switch. > > > > The optimization above replaces W2->Server and Server->W3 context switches > > with a single W2->W3 context switch, which is a material performance gain. > > Yes, I've also already reconsidered. Things like pipelines and other > fixed order scheduling policies will greatly benefit from > worker-to-worker switching. > > But I think all of them are explicit. That is, we can limit the > ::next_tid usage to sys_umcg_wait() and never look at it for implicit > blocks. Yes, of course - when a worker blocks, its server gets notified. > > > In addition, when W1 calls mutex::unlock(), the scheduling code determines > > that W2 is waiting on the mutex, and thus calls W2::wake() from the context of > > running W1 (you asked earlier why do we need "WAKE_ONLY"). > > This I'm not at all convinced on. That sounds like it will violate the > 1:1 thing. wake_only is a wakeup event, meaning the worker gets added to the wake queue, not scheduled on a CPU; we don't have to implement it in the kernel, though - the userspace may keep its own wake queue for workers like this. So feel free to ignore this operation. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC][PATCH 3/3] sched: User Mode Concurency Groups 2022-01-18 18:19 ` Peter Oskolkov 2022-01-19 8:47 ` Peter Zijlstra @ 2022-01-19 8:51 ` Peter Zijlstra 2022-01-19 8:59 ` Peter Zijlstra 2 siblings, 0 replies; 40+ messages in thread From: Peter Zijlstra @ 2022-01-19 8:51 UTC (permalink / raw) To: Peter Oskolkov Cc: Peter Oskolkov, mingo, tglx, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, bristot, linux-kernel, linux-mm, linux-api, x86, pjt, avagin, jannh, tdelisle On Tue, Jan 18, 2022 at 10:19:21AM -0800, Peter Oskolkov wrote: > ================ worker timeouts > > Timeouts now are easy to explain: mutex::lock() and condvar::wait() have > timeouts, so workers waiting on these primitives naturally wait with timeouts; > if sys_umcg_wait() supports worker timeouts, this is it, all is simple; if > it does not, the userspace now has to implement the whole timeout machinery, > in order to wake these sleeping workers when their timeouts expire. I still have absolutely no idea what you're on about here. Please reply to the email on that subject: https://lkml.kernel.org/r/Ya34S2JCQg+81h4t@hirez.programming.kicks-ass.net What would you have the timeout actually do? Talk about the state transitions. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC][PATCH 3/3] sched: User Mode Concurency Groups 2022-01-18 18:19 ` Peter Oskolkov 2022-01-19 8:47 ` Peter Zijlstra 2022-01-19 8:51 ` Peter Zijlstra @ 2022-01-19 8:59 ` Peter Zijlstra 2022-01-19 17:52 ` Peter Oskolkov 2 siblings, 1 reply; 40+ messages in thread From: Peter Zijlstra @ 2022-01-19 8:59 UTC (permalink / raw) To: Peter Oskolkov Cc: Peter Oskolkov, mingo, tglx, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, bristot, linux-kernel, linux-mm, linux-api, x86, pjt, avagin, jannh, tdelisle On Tue, Jan 18, 2022 at 10:19:21AM -0800, Peter Oskolkov wrote: > =========== signals and the general approach > > My version of the patchset has all of these things working. What it > does not have, > compared to the new approach we are discussing here, is runqueues per server > and proper signal handling (and potential integration with proxy execution). > > Runqueues per server, in the LAZY mode, are easy to emulate in my patchset: > nothing prevents the userspace to partition workers among servers, and have > servers that "own" their workers to be pointed at by idle_server_tid_ptr. > > The only thing that is missing is proper treating of signals. But my patchset > does ensure a single running worker per server, had pagefaults and preemptions > sorted out, etc. Basically, everything works except signals. This patchet > has issues with pagefaults, Already fixed pagefaults per: YeGvovSckivQnKX8@hirez.programming.kicks-ass.net > worker timeouts I still have no clear answer as to what you actually want there. > , worker-to-worker context > switches (do workers move runqueues when they context switch?), etc. Not in kernel, if they need to be migrated, userspace needs to do that. > And my patchset now actually looks smaller and simpler, on the kernel side, > that what this patchset is shaping up to be. > > What if I fix signals in my patchset? I think the way you deal with signals > will work in my approach equally well; I'll also use umcg_kick() to preempt > workers instead of sending them a signal. > > What do you think? I still absolutely hate how long you do page pinning, it *will* wreck things like CMA which are somewhat latency critical for silly things like Android camera apps and who knows what else. You've also forgotten about this: YcWutpu7BDeG+dQ2@hirez.programming.kicks-ass.net That's not optional given how you're using page-pinning. Also, I think we need at least one direct access to the page after getting the pin in order to make it work. That also very much limits it to Anon pages. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC][PATCH 3/3] sched: User Mode Concurency Groups 2022-01-19 8:59 ` Peter Zijlstra @ 2022-01-19 17:52 ` Peter Oskolkov 2022-01-20 10:37 ` Peter Zijlstra 0 siblings, 1 reply; 40+ messages in thread From: Peter Oskolkov @ 2022-01-19 17:52 UTC (permalink / raw) To: Peter Zijlstra Cc: Peter Oskolkov, mingo, tglx, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, bristot, linux-kernel, linux-mm, linux-api, x86, pjt, avagin, jannh, tdelisle On Wed, Jan 19, 2022 at 1:00 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, Jan 18, 2022 at 10:19:21AM -0800, Peter Oskolkov wrote: > > > =========== signals and the general approach > > > > My version of the patchset has all of these things working. What it > > does not have, > > compared to the new approach we are discussing here, is runqueues per server > > and proper signal handling (and potential integration with proxy execution). > > > > Runqueues per server, in the LAZY mode, are easy to emulate in my patchset: > > nothing prevents the userspace to partition workers among servers, and have > > servers that "own" their workers to be pointed at by idle_server_tid_ptr. > > > > The only thing that is missing is proper treating of signals. But my patchset > > does ensure a single running worker per server, had pagefaults and preemptions > > sorted out, etc. Basically, everything works except signals. This patchet > > has issues with pagefaults, > > Already fixed pagefaults per: > > YeGvovSckivQnKX8@hirez.programming.kicks-ass.net Could you, please, post an updated RFC when you have a chance? Thanks! > > > worker timeouts > > I still have no clear answer as to what you actually want there. > > > , worker-to-worker context > > switches (do workers move runqueues when they context switch?), etc. > > Not in kernel, if they need to be migrated, userspace needs to do that. > > > And my patchset now actually looks smaller and simpler, on the kernel side, > > that what this patchset is shaping up to be. > > > > What if I fix signals in my patchset? I think the way you deal with signals > > will work in my approach equally well; I'll also use umcg_kick() to preempt > > workers instead of sending them a signal. > > > > What do you think? > > I still absolutely hate how long you do page pinning, it *will* wreck > things like CMA which are somewhat latency critical for silly things > like Android camera apps and who knows what else. > > You've also forgotten about this: > > YcWutpu7BDeG+dQ2@hirez.programming.kicks-ass.net > > That's not optional given how you're using page-pinning. Also, I think > we need at least one direct access to the page after getting the pin in > order to make it work. > > That also very much limits it to Anon pages. I can use the same mm/page pinning strategy as you do. But then our patchsets will be quite similar, I guess, with the difference being server wakeups with RUNNING workers vs "lazy" idle_server_tid_ptr. So OK, let's continue with your approach. If you could post a new RFC with the memory/paging fixes in it, I'll then add worker timeouts, as I outlined in a separate email ~ 30min ago, and continue with my integration/testing. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC][PATCH 3/3] sched: User Mode Concurency Groups 2022-01-19 17:52 ` Peter Oskolkov @ 2022-01-20 10:37 ` Peter Zijlstra 0 siblings, 0 replies; 40+ messages in thread From: Peter Zijlstra @ 2022-01-20 10:37 UTC (permalink / raw) To: Peter Oskolkov Cc: Peter Oskolkov, mingo, tglx, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, bristot, linux-kernel, linux-mm, linux-api, x86, pjt, avagin, jannh, tdelisle On Wed, Jan 19, 2022 at 09:52:30AM -0800, Peter Oskolkov wrote: > Could you, please, post an updated RFC when you have a chance? Thanks! I was working on that, I'll try and get it done today. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC][PATCH 3/3] sched: User Mode Concurency Groups 2021-12-21 17:19 ` Peter Oskolkov 2022-01-14 14:09 ` Peter Zijlstra @ 2022-01-17 13:04 ` Peter Zijlstra 1 sibling, 0 replies; 40+ messages in thread From: Peter Zijlstra @ 2022-01-17 13:04 UTC (permalink / raw) To: Peter Oskolkov Cc: mingo, tglx, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, bristot, linux-kernel, linux-mm, linux-api, x86, pjt, posk, avagin, jannh, tdelisle On Tue, Dec 21, 2021 at 05:19:00PM +0000, Peter Oskolkov wrote: > On Tue, Dec 14, 2021 at 09:44:48PM +0100, Peter Zijlstra wrote: > > +static struct task_struct *umcg_get_task(u32 tid) > > +{ > > + struct task_struct *tsk = NULL; > > + > > + if (tid) { > > + rcu_read_lock(); > > + tsk = find_task_by_vpid(tid); > > + if (tsk && current->mm == tsk->mm && tsk->umcg_task) > > This essentially limits all operations to a single mm/process. > Fine for now, but a fast remote context switch is also a valid use > case. It is not directly related to userspace scheduling, so I'm > just mentioning it here. Maybe server-to-server cross-process > context switches should be allowed for the same user/cgroup? (Again, > this is for later to consider). Doing cross-address-space UMCG will be a massive effort, too much (pretty much everything) in this implementation assumes things are directly addressable. > > + get_task_struct(tsk); > > + else > > + tsk = NULL; > > + rcu_read_unlock(); > > + } > > + > > + return tsk; > > +} ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC][PATCH 3/3] sched: User Mode Concurency Groups 2021-12-14 20:44 ` [RFC][PATCH 3/3] sched: User Mode Concurency Groups Peter Zijlstra 2021-12-21 17:19 ` Peter Oskolkov @ 2021-12-24 11:27 ` Peter Zijlstra 1 sibling, 0 replies; 40+ messages in thread From: Peter Zijlstra @ 2021-12-24 11:27 UTC (permalink / raw) To: mingo, tglx, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, bristot Cc: linux-kernel, linux-mm, linux-api, x86, pjt, posk, avagin, jannh, tdelisle, posk On Tue, Dec 14, 2021 at 09:44:48PM +0100, Peter Zijlstra wrote: > The big assumption this whole thing is build on is that > pin_user_pages() preserves user mappings in so far that > pagefault_disable() will never generate EFAULT (unless the user does > munmap() in which case it can keep the pieces). > > shrink_page_list() does page_maybe_dma_pinned() before try_to_unmap() > and as such seems to respect this constraint. > > unmap_and_move() however seems willing to unmap otherwise pinned (and > hence unmigratable) pages. This might need fixing. AFAICT this should mostly do,.. I still need to check if get_user_pages_fast() is itself sufficient to avoid all races or if we need to strengthen/augment that too. --- mm/migrate.c | 10 +++++++++- mm/mprotect.c | 6 ++++++ 2 files changed, 15 insertions(+), 1 deletion(-) diff --git a/mm/migrate.c b/mm/migrate.c index cf25b00f03c8..3850b33c64eb 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1472,7 +1472,15 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page, nr_subpages = thp_nr_pages(page); cond_resched(); - if (PageHuge(page)) + /* + * If the page has a pin then expected_page_refs() will + * not match and the whole migration will fail later + * anyway, fail early and preserve the mappings. + */ + if (page_maybe_dma_pinned(page)) + rc = -EAGAIN; + + else if (PageHuge(page)) rc = unmap_and_move_huge_page(get_new_page, put_new_page, private, page, pass > 2, mode, reason, diff --git a/mm/mprotect.c b/mm/mprotect.c index 9a105fce5aeb..093db725d39f 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -105,6 +105,12 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, if (page_is_file_lru(page) && PageDirty(page)) continue; + /* + * Can't migrate pinned pages, avoid touching them. + */ + if (page_maybe_dma_pinned(page)) + continue; + /* * Don't mess with PTEs if page is already on the node * a single-threaded process is running on. ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [RFC][PATCH 0/3] sched: User Managed Concurrency Groups 2021-12-14 20:44 [RFC][PATCH 0/3] sched: User Managed Concurrency Groups Peter Zijlstra ` (2 preceding siblings ...) 2021-12-14 20:44 ` [RFC][PATCH 3/3] sched: User Mode Concurency Groups Peter Zijlstra @ 2021-12-14 21:00 ` Peter Zijlstra 2021-12-15 3:46 ` Peter Oskolkov 4 siblings, 0 replies; 40+ messages in thread From: Peter Zijlstra @ 2021-12-14 21:00 UTC (permalink / raw) To: mingo, tglx, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, bristot Cc: linux-kernel, linux-mm, linux-api, x86, pjt, posk, avagin, jannh, tdelisle, posk On Tue, Dec 14, 2021 at 09:44:45PM +0100, Peter Zijlstra wrote: > I'll post my test-hack as a reply, but basically it does co-operative and > preemptive UP-like user scheduling. It's pretty rough, but seems to work. Defaults to co-operative and switches to preemptive when ran with an (any!) argument. --- // gcc -Itools/include/ -o umcg umcg.c -lpthread #define _GNU_SOURCE #include <unistd.h> #include <sys/types.h> #include <sys/syscall.h> #include <pthread.h> #include <stdio.h> #include <stdlib.h> #include <errno.h> #ifndef __NR_umcg_ctl #define __NR_umcg_ctl 450 #define __NR_umcg_wait 451 #define __NR_umcg_kick 452 #endif #include <linux/list.h> #include "include/uapi/linux/umcg.h" /* syscall wrappers */ static inline int sys_umcg_ctl(u32 flags, struct umcg_task *self, clockid_t which_clock) { return syscall(__NR_umcg_ctl, flags, self, which_clock); } static inline int sys_umcg_wait(u32 flags, u64 timo) { return syscall(__NR_umcg_wait, flags, timo); } static inline int sys_umcg_kick(u32 flags, pid_t tid) { return syscall(__NR_umcg_kick, flags, tid); } /* the 'foo' scheduler */ struct foo_task { struct umcg_task task; struct list_head node; pid_t tid; }; struct foo_server { struct umcg_task task; struct list_head node; pid_t tid; struct foo_task *cur; }; void foo_add(struct foo_server *server, struct umcg_task *t) { struct foo_task *foo = container_of(t, struct foo_task, task); t->runnable_workers_ptr = 0ULL; list_add_tail(&foo->node, &server->node); } struct foo_task *foo_pick_next(struct foo_server *server) { struct foo_task *first = NULL; if (list_empty(&server->node)) return first; first = list_first_entry(&server->node, struct foo_task, node); list_del(&first->node); return first; } #define NSEC_PER_SEC 1000000000ULL u64 foo_time(void) { struct timespec ts; clock_gettime(CLOCK_MONOTONIC, &ts); return (unsigned long long)ts.tv_sec * NSEC_PER_SEC + ts.tv_nsec; } void foo_yield(struct umcg_task *self) { self->state = UMCG_TASK_RUNNABLE | UMCG_TF_COND_WAIT; sys_umcg_wait(0, 0); } #define TICK_NSEC NSEC_PER_SEC volatile bool foo_preemptible = false; /* our workers */ /* always running worker */ void *worker_fn0(void *arg) { struct foo_server *server = arg; struct foo_task task = { }; unsigned long i; int ret; task.tid = gettid(); task.task.server_tid = server->tid; task.task.state = UMCG_TASK_BLOCKED; printf("A == %d\n", gettid()); ret = sys_umcg_ctl(UMCG_CTL_REGISTER|UMCG_CTL_WORKER, &task.task, CLOCK_MONOTONIC); if (ret) { perror("umcg_ctl(A): "); exit(-1); } for (;;) { int x = i++; if (!(x % 1000000)) { putchar('.'); fflush(stdout); } /* co-operative or preemptible */ if (!foo_preemptible && !(x % 10000000)) foo_yield(&task.task); } return NULL; } /* event driven worker */ void *worker_fn1(void *arg) { struct foo_server *server = arg; struct foo_task task = { }; int ret; task.tid = gettid(); task.task.server_tid = server->tid; task.task.state = UMCG_TASK_BLOCKED; printf("B == %d\n", gettid()); ret = sys_umcg_ctl(UMCG_CTL_REGISTER|UMCG_CTL_WORKER, &task.task, CLOCK_MONOTONIC); if (ret) { perror("umcg_ctl(B): "); exit(-1); } for (;;) { printf("B\n"); fflush(stdout); sleep(2); } return NULL; } void *worker_fn2(void *arg) { struct foo_server *server = arg; struct foo_task task = { }; int ret; task.tid = gettid(); task.task.server_tid = server->tid; task.task.state = UMCG_TASK_BLOCKED; printf("C == %d\n", gettid()); ret = sys_umcg_ctl(UMCG_CTL_REGISTER|UMCG_CTL_WORKER, &task.task, CLOCK_MONOTONIC); if (ret) { perror("umcg_ctl(C): "); exit(-1); } for (;;) { printf("C\n"); fflush(stdout); sleep(3); } return NULL; } /* the server */ int main(int argc, char **argv) { struct umcg_task *runnable_ptr, *next; struct foo_server server = { }; pthread_t worker[3]; u64 timeout = 0; int ret; printf("server == %d\n", gettid()); fflush(stdout); server.tid = gettid(); INIT_LIST_HEAD(&server.node); server.task.server_tid = gettid(); server.task.state = UMCG_TASK_RUNNING; ret = sys_umcg_ctl(UMCG_CTL_REGISTER, &server.task, CLOCK_MONOTONIC); if (ret) { perror("umcg_ctl: "); exit(-1); } pthread_create(&worker[0], NULL, worker_fn0, &server); pthread_create(&worker[1], NULL, worker_fn1, &server); pthread_create(&worker[2], NULL, worker_fn2, &server); if (argc > 1) { foo_preemptible = true; /* * setup preemption tick */ timeout = foo_time() + TICK_NSEC; } for (;;) { /* * Mark the server as runnable first, so we can detect * additions to the runnable list after we read it. */ server.task.state = UMCG_TASK_RUNNABLE | UMCG_TF_COND_WAIT; /* * comsume the runnable notification list and add * the tasks to our local runqueue. */ runnable_ptr = (void*)__atomic_exchange_n(&server.task.runnable_workers_ptr, NULL, __ATOMIC_SEQ_CST); while (runnable_ptr) { next = (void *)runnable_ptr->runnable_workers_ptr; foo_add(&server, runnable_ptr); runnable_ptr = next; } /* * If we've got a current running task, the server might have * gotten a 'spurious' wakeup to pick up new runnable tasks. * * In this case, don't pick a new task (possible * wakeup-preemption point, not implemented here). * * Note: even tough this RUNNING test is racy, if it blocks * after we'll get a RUNNABLE notification which will clear our * RUNNABLE state and sys_umcg_wait() will -EAGAIN. */ if (server.cur && server.cur->task.state == UMCG_TASK_RUNNING) { /* * Assert ::next_tid is clear, it should have been * consumed. */ if (server.task.next_tid) { printf("current running, but still have next_tid\n"); exit(-1); } putchar('x'); fflush(stdout); } else { /* * Pick the next task... */ server.cur = foo_pick_next(&server); server.task.next_tid = server.cur ? server.cur->tid : 0; printf("pick: %d\n", server.task.next_tid); fflush(stdout); } /* * And switch... */ ret = sys_umcg_wait(0, timeout); /* * If we did set ::next_tid but it hasn't been consumed by the * syscall due to failure, make sure to put the task back on * the runqueue, lest we leak it. */ if (server.task.next_tid) { foo_add(&server, &server.cur->task); server.cur = NULL; server.task.next_tid = 0; } if (!ret) continue; switch (errno) { case EAGAIN: /* * Got a wakeup, try again. */ continue; case ETIMEDOUT: /* * timeout: drive preemption */ putchar('t'); fflush(stdout); /* * Next tick.. */ timeout += TICK_NSEC; /* * If we have a current, cmpxchg set TF_PREEMPT and on success * send it a signal to kick it into the kernel such that * it might re-report itself runnable. */ if (server.cur) { struct foo_task *t = server.cur; u32 val = UMCG_TASK_RUNNING; u32 new = UMCG_TASK_RUNNING | UMCG_TF_PREEMPT; if (__atomic_compare_exchange_n(&t->task.state, &val, new, false, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST)) { sys_umcg_kick(0, t->tid); } } /* * Either way around, if the cmpxchg * failed the task will have blocked * and we should re-start the loop. */ continue; default: printf("errno: %d\n", errno); perror("wait:"); exit(-1); } } return 0; } ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC][PATCH 0/3] sched: User Managed Concurrency Groups 2021-12-14 20:44 [RFC][PATCH 0/3] sched: User Managed Concurrency Groups Peter Zijlstra ` (3 preceding siblings ...) 2021-12-14 21:00 ` [RFC][PATCH 0/3] sched: User Managed Concurrency Groups Peter Zijlstra @ 2021-12-15 3:46 ` Peter Oskolkov 2021-12-15 10:06 ` Peter Zijlstra 2021-12-15 10:44 ` Peter Zijlstra 4 siblings, 2 replies; 40+ messages in thread From: Peter Oskolkov @ 2021-12-15 3:46 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Thomas Gleixner, juri.lelli, Vincent Guittot, dietmar.eggemann, Steven Rostedt, Ben Segall, mgorman, bristot, Linux Kernel Mailing List, Linux Memory Management List, linux-api, x86, Paul Turner, Peter Oskolkov, Andrei Vagin, Jann Horn, Thierry Delisle On Tue, Dec 14, 2021 at 12:55 PM Peter Zijlstra <peterz@infradead.org> wrote: > > Hi, > > This is actually tested code; but still missing the SMP wake-to-idle machinery. > I still need to think about that. Thanks, Peter! At a first glance, your main patch does not look much smaller than mine, and I thought the whole point of re-doing it was to throw away extra features and make things smaller/simpler... Anyway, I'll test your patchset over the next week or so and let you know if anything really needed is missing (other than waking an idle server if there is one on a worker wakeup; this piece is definitely needed). > > I'll post my test-hack as a reply, but basically it does co-operative and > preemptive UP-like user scheduling. > > Patches go on top of tip/master as they rely on the .fixup removal > recently merged in tip/x86/core. > > Also, I still need to audit a bunch of mm code, because I'm not sure things are > actually as well behaved as this code supposes they are. > ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC][PATCH 0/3] sched: User Managed Concurrency Groups 2021-12-15 3:46 ` Peter Oskolkov @ 2021-12-15 10:06 ` Peter Zijlstra 2021-12-15 13:03 ` Peter Zijlstra 2021-12-15 17:56 ` Peter Oskolkov 2021-12-15 10:44 ` Peter Zijlstra 1 sibling, 2 replies; 40+ messages in thread From: Peter Zijlstra @ 2021-12-15 10:06 UTC (permalink / raw) To: Peter Oskolkov Cc: Ingo Molnar, Thomas Gleixner, juri.lelli, Vincent Guittot, dietmar.eggemann, Steven Rostedt, Ben Segall, mgorman, bristot, Linux Kernel Mailing List, Linux Memory Management List, linux-api, x86, Paul Turner, Peter Oskolkov, Andrei Vagin, Jann Horn, Thierry Delisle On Tue, Dec 14, 2021 at 07:46:25PM -0800, Peter Oskolkov wrote: > On Tue, Dec 14, 2021 at 12:55 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > > Hi, > > > > This is actually tested code; but still missing the SMP wake-to-idle machinery. > > I still need to think about that. > > Thanks, Peter! > > At a first glance, your main patch does not look much smaller than > mine, and I thought the whole point of re-doing it was to throw away > extra features and make things smaller/simpler... Well, simpler was the goal. I didn't really focus on size much. It isn't really big to begin with. But yes, it has 5 hooks now, 3 syscalls and lots of comments and all that under 900 lines, not bad I'd say. Also I think you wanted something like this? I'm not sure of the LAZY name, but I can't seem to come up with anything saner atm. --- --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1297,6 +1297,7 @@ struct task_struct { #ifdef CONFIG_UMCG /* setup by sys_umcg_ctrl() */ + u32 umcg_flags; clockid_t umcg_clock; struct umcg_task __user *umcg_task; --- a/include/uapi/linux/umcg.h +++ b/include/uapi/linux/umcg.h @@ -133,11 +133,13 @@ struct umcg_task { * @UMCG_CTL_REGISTER: register the current task as a UMCG task * @UMCG_CTL_UNREGISTER: unregister the current task as a UMCG task * @UMCG_CTL_WORKER: register the current task as a UMCG worker + * @UMCG_CTL_LAZY: don't wake server on runnable enqueue */ enum umcg_ctl_flag { UMCG_CTL_REGISTER = 0x00001, UMCG_CTL_UNREGISTER = 0x00002, UMCG_CTL_WORKER = 0x10000, + UMCG_CTL_LAZY = 0x20000, }; #endif /* _UAPI_LINUX_UMCG_H */ --- a/kernel/sched/umcg.c +++ b/kernel/sched/umcg.c @@ -416,6 +416,27 @@ static int umcg_enqueue_runnable(struct } /* + * Enqueue tsk to it's server's runnable list and wake the server for pickup if + * so desired. Notable LAZY workers will not wake the server and rely on the + * server to do pickup whenever it naturally runs next. + * + * Returns: + * 0: success + * -EFAULT + */ +static int umcg_enqueue_and_wake(struct task_struct *tsk, bool force) +{ + int ret = umcg_enqueue_runnable(tsk); + if (ret) + return ret; + + if (force || !(tsk->umcg_flags & UMCG_CTL_LAZY)) + ret = umcg_wake_server(tsk); + + return ret; +} + +/* * umcg_wait: Wait for ->state to become RUNNING * * Returns: @@ -522,12 +543,8 @@ void umcg_sys_exit(struct pt_regs *regs) if (umcg_update_state(tsk, self, UMCG_TASK_BLOCKED, UMCG_TASK_RUNNABLE)) UMCG_DIE_UNPIN("state"); - if (umcg_enqueue_runnable(tsk)) - UMCG_DIE_UNPIN("enqueue"); - - /* Server might not be RUNNABLE, means it's already running */ - if (umcg_wake_server(tsk)) - UMCG_DIE_UNPIN("wake-server"); + if (umcg_enqueue_and_wake(tsk, false)) + UMCG_DIE_UNPIN("enqueue-and-wake"); umcg_unpin_pages(); @@ -582,15 +599,11 @@ void umcg_notify_resume(struct pt_regs * UMCG_TASK_RUNNABLE)) UMCG_DIE_UNPIN("state"); - if (umcg_enqueue_runnable(tsk)) - UMCG_DIE_UNPIN("enqueue"); - /* - * XXX do we want a preemption consuming ::next_tid ? - * I'm currently leaning towards no. + * Preemption relies on waking the server on enqueue. */ - if (umcg_wake_server(tsk)) - UMCG_DIE_UNPIN("wake-server"); + if (umcg_enqueue_and_wake(tsk, true)) + UMCG_DIE_UNPIN("enqueue-and-wake"); umcg_unpin_pages(); } @@ -686,23 +699,15 @@ SYSCALL_DEFINE2(umcg_wait, u32, flags, u goto unpin; if (worker) { - ret = umcg_enqueue_runnable(tsk); + ret = umcg_enqueue_and_wake(tsk, !tsk->umcg_next); if (ret) goto unpin; } - if (worker) - ret = umcg_wake(tsk); - else if (tsk->umcg_next) + if (tsk->umcg_next) { ret = umcg_wake_next(tsk); - - if (ret) { - /* - * XXX already enqueued ourself on ::server_tid; failing now - * leaves the lot in an inconsistent state since it'll also - * unblock self in order to return the error. !?!? - */ - goto unpin; + if (ret) + goto unpin; } umcg_unpin_pages(); @@ -783,7 +788,8 @@ SYSCALL_DEFINE3(umcg_ctl, u32, flags, st if (flags & ~(UMCG_CTL_REGISTER | UMCG_CTL_UNREGISTER | - UMCG_CTL_WORKER)) + UMCG_CTL_WORKER | + UMCG_CTL_LAZY)) return -EINVAL; if (flags == UMCG_CTL_UNREGISTER) { @@ -827,7 +833,7 @@ SYSCALL_DEFINE3(umcg_ctl, u32, flags, st rcu_read_lock(); server = find_task_by_vpid(ut.server_tid); if (server && server->mm == current->mm) { - if (flags == UMCG_CTL_WORKER) { + if (flags & UMCG_CTL_WORKER) { if (!server->umcg_task || (server->flags & PF_UMCG_WORKER)) server = NULL; @@ -843,10 +849,11 @@ SYSCALL_DEFINE3(umcg_ctl, u32, flags, st if (!server) return -ESRCH; - if (flags == UMCG_CTL_WORKER) { + if (flags & UMCG_CTL_WORKER) { if ((ut.state & (UMCG_TASK_MASK | UMCG_TF_MASK)) != UMCG_TASK_BLOCKED) return -EINVAL; + current->umcg_flags = flags & UMCG_CTL_LAZY; WRITE_ONCE(current->umcg_task, self); current->flags |= PF_UMCG_WORKER; /* hook schedule() */ set_syscall_work(SYSCALL_UMCG); /* hook syscall */ @@ -858,6 +865,7 @@ SYSCALL_DEFINE3(umcg_ctl, u32, flags, st if ((ut.state & (UMCG_TASK_MASK | UMCG_TF_MASK)) != UMCG_TASK_RUNNING) return -EINVAL; + current->umcg_flags = 0; WRITE_ONCE(current->umcg_task, self); set_thread_flag(TIF_UMCG); /* hook return-to-user */ ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC][PATCH 0/3] sched: User Managed Concurrency Groups 2021-12-15 10:06 ` Peter Zijlstra @ 2021-12-15 13:03 ` Peter Zijlstra 2021-12-15 17:56 ` Peter Oskolkov 1 sibling, 0 replies; 40+ messages in thread From: Peter Zijlstra @ 2021-12-15 13:03 UTC (permalink / raw) To: Peter Oskolkov Cc: Ingo Molnar, Thomas Gleixner, juri.lelli, Vincent Guittot, dietmar.eggemann, Steven Rostedt, Ben Segall, mgorman, bristot, Linux Kernel Mailing List, Linux Memory Management List, linux-api, x86, Paul Turner, Peter Oskolkov, Andrei Vagin, Jann Horn, Thierry Delisle On Wed, Dec 15, 2021 at 11:06:20AM +0100, Peter Zijlstra wrote: > On Tue, Dec 14, 2021 at 07:46:25PM -0800, Peter Oskolkov wrote: > > On Tue, Dec 14, 2021 at 12:55 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > Hi, > > > > > > This is actually tested code; but still missing the SMP wake-to-idle machinery. > > > I still need to think about that. > > > > Thanks, Peter! > > > > At a first glance, your main patch does not look much smaller than > > mine, and I thought the whole point of re-doing it was to throw away > > extra features and make things smaller/simpler... > > Well, simpler was the goal. I didn't really focus on size much. It isn't > really big to begin with. > > But yes, it has 5 hooks now, 3 syscalls and lots of comments and all > that under 900 lines, not bad I'd say. > > Also I think you wanted something like this? I'm not sure of the LAZY > name, but I can't seem to come up with anything saner atm. > > > --- > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1297,6 +1297,7 @@ struct task_struct { > > #ifdef CONFIG_UMCG > /* setup by sys_umcg_ctrl() */ > + u32 umcg_flags; > clockid_t umcg_clock; > struct umcg_task __user *umcg_task; > > --- a/include/uapi/linux/umcg.h > +++ b/include/uapi/linux/umcg.h > @@ -133,11 +133,13 @@ struct umcg_task { > * @UMCG_CTL_REGISTER: register the current task as a UMCG task > * @UMCG_CTL_UNREGISTER: unregister the current task as a UMCG task > * @UMCG_CTL_WORKER: register the current task as a UMCG worker > + * @UMCG_CTL_LAZY: don't wake server on runnable enqueue > */ > enum umcg_ctl_flag { > UMCG_CTL_REGISTER = 0x00001, > UMCG_CTL_UNREGISTER = 0x00002, > UMCG_CTL_WORKER = 0x10000, > + UMCG_CTL_LAZY = 0x20000, > }; > > #endif /* _UAPI_LINUX_UMCG_H */ > --- a/kernel/sched/umcg.c > +++ b/kernel/sched/umcg.c > @@ -416,6 +416,27 @@ static int umcg_enqueue_runnable(struct > } > > /* > + * Enqueue tsk to it's server's runnable list and wake the server for pickup if > + * so desired. Notable LAZY workers will not wake the server and rely on the > + * server to do pickup whenever it naturally runs next. > + * > + * Returns: > + * 0: success > + * -EFAULT > + */ > +static int umcg_enqueue_and_wake(struct task_struct *tsk, bool force) > +{ > + int ret = umcg_enqueue_runnable(tsk); > + if (ret) > + return ret; > + > + if (force || !(tsk->umcg_flags & UMCG_CTL_LAZY)) > + ret = umcg_wake_server(tsk); > + > + return ret; > +} Aah, this has a problem when the server is otherwise idle. I think we need that TF_IDLE thing for this too. Let me go write a test-case for all this. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC][PATCH 0/3] sched: User Managed Concurrency Groups 2021-12-15 10:06 ` Peter Zijlstra 2021-12-15 13:03 ` Peter Zijlstra @ 2021-12-15 17:56 ` Peter Oskolkov 2021-12-15 18:18 ` Peter Zijlstra 2021-12-15 18:25 ` Peter Zijlstra 1 sibling, 2 replies; 40+ messages in thread From: Peter Oskolkov @ 2021-12-15 17:56 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Thomas Gleixner, juri.lelli, Vincent Guittot, dietmar.eggemann, Steven Rostedt, Ben Segall, mgorman, bristot, Linux Kernel Mailing List, Linux Memory Management List, linux-api, x86, Paul Turner, Peter Oskolkov, Andrei Vagin, Jann Horn, Thierry Delisle On Wed, Dec 15, 2021 at 2:06 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, Dec 14, 2021 at 07:46:25PM -0800, Peter Oskolkov wrote: > > On Tue, Dec 14, 2021 at 12:55 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > Hi, > > > > > > This is actually tested code; but still missing the SMP wake-to-idle machinery. > > > I still need to think about that. > > > > Thanks, Peter! > > > > At a first glance, your main patch does not look much smaller than > > mine, and I thought the whole point of re-doing it was to throw away > > extra features and make things smaller/simpler... > > Well, simpler was the goal. I didn't really focus on size much. It isn't > really big to begin with. > > But yes, it has 5 hooks now, 3 syscalls and lots of comments and all > that under 900 lines, not bad I'd say. My patchset had three hooks and two syscalls, and fewer new fields added to task_struct... And similarly around 900 lines on the kernel side in the main patch. So I am not sure why you believe that your approach is simpler, unless there was something fundamentally wrong with my approach. But tglx@ looked into it, and his remarks were more about comments and the commit message and smaller things at a function level, like an unneeded goto, than about the overall design... > > Also I think you wanted something like this? I'm not sure of the LAZY > name, but I can't seem to come up with anything saner atm. > [...] > /* > + * Enqueue tsk to it's server's runnable list and wake the server for pickup if > + * so desired. Notable LAZY workers will not wake the server and rely on the > + * server to do pickup whenever it naturally runs next. No, I never suggested we needed per-server runnable queues: in all my patchsets I had a single list of idle (runnable) workers. [...] From another message: >> Anyway, I'll test your patchset over the next week or so and let you >> know if anything really needed is missing (other than waking an idle >> server if there is one on a worker wakeup; this piece is definitely > needed). > Right, so the problem I'm having is that a single idle server ptr like > before can trivially miss waking annother idle server. I believe the approach I used in my patchset, suggested by Thierry Delisle, works. In short, there is a single idle server ptr for the kernel to work with. The userspace maintains a list of idle servers. If the ptr is NULL, the list is empty. When the kernel wakes the idle server it sees, the server reaps the runnable worker list and wakes another idle server from the userspace list, if available. This newly woken idle server repoints the ptr to itself, checks the runnable worker list, to avoid missing a woken worker, then goes to sleep. Why do you think this approach is not OK? ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC][PATCH 0/3] sched: User Managed Concurrency Groups 2021-12-15 17:56 ` Peter Oskolkov @ 2021-12-15 18:18 ` Peter Zijlstra 2021-12-15 19:49 ` Peter Oskolkov 2021-12-15 18:25 ` Peter Zijlstra 1 sibling, 1 reply; 40+ messages in thread From: Peter Zijlstra @ 2021-12-15 18:18 UTC (permalink / raw) To: Peter Oskolkov Cc: Ingo Molnar, Thomas Gleixner, juri.lelli, Vincent Guittot, dietmar.eggemann, Steven Rostedt, Ben Segall, mgorman, bristot, Linux Kernel Mailing List, Linux Memory Management List, linux-api, x86, Paul Turner, Peter Oskolkov, Andrei Vagin, Jann Horn, Thierry Delisle On Wed, Dec 15, 2021 at 09:56:06AM -0800, Peter Oskolkov wrote: > > Right, so the problem I'm having is that a single idle server ptr like > > before can trivially miss waking annother idle server. > > I believe the approach I used in my patchset, suggested by Thierry > Delisle, works. > > In short, there is a single idle server ptr for the kernel to work > with. The userspace maintains a list of idle servers. If the ptr is > NULL, the list is empty. When the kernel wakes the idle server it > sees, the server reaps the runnable worker list and wakes another idle > server from the userspace list, if available. This newly woken idle > server repoints the ptr to itself, checks the runnable worker list, to > avoid missing a woken worker, then goes to sleep. > > Why do you think this approach is not OK? Suppose at least 4 servers, 2 idle, 2 working. Now, the first of the working servers (lets call it S0) gets a wakeup (say Ta), it finds the idle server (say S3) and consumes it, sticking Ta on S3 and kicking it alive. Concurrently and loosing the race the other working server (S1) gets a wake-up from Tb, like said, it lost, no idle server, so Tb goes on the queue of S1. So then S3 wakes, finds Ta and they live happily ever after. While S2 and Tb fail to meet one another and both linger in sadness. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC][PATCH 0/3] sched: User Managed Concurrency Groups 2021-12-15 18:18 ` Peter Zijlstra @ 2021-12-15 19:49 ` Peter Oskolkov 2021-12-15 22:25 ` Peter Zijlstra 0 siblings, 1 reply; 40+ messages in thread From: Peter Oskolkov @ 2021-12-15 19:49 UTC (permalink / raw) To: Peter Zijlstra Cc: Peter Oskolkov, Ingo Molnar, Thomas Gleixner, juri.lelli, Vincent Guittot, dietmar.eggemann, Steven Rostedt, Ben Segall, mgorman, bristot, Linux Kernel Mailing List, Linux Memory Management List, linux-api, x86, Paul Turner, Andrei Vagin, Jann Horn, Thierry Delisle On Wed, Dec 15, 2021 at 10:19 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, Dec 15, 2021 at 09:56:06AM -0800, Peter Oskolkov wrote: > > > > Right, so the problem I'm having is that a single idle server ptr like > > > before can trivially miss waking annother idle server. > > > > I believe the approach I used in my patchset, suggested by Thierry > > Delisle, works. > > > > In short, there is a single idle server ptr for the kernel to work > > with. The userspace maintains a list of idle servers. If the ptr is > > NULL, the list is empty. When the kernel wakes the idle server it > > sees, the server reaps the runnable worker list and wakes another idle > > server from the userspace list, if available. This newly woken idle > > server repoints the ptr to itself, checks the runnable worker list, to > > avoid missing a woken worker, then goes to sleep. > > > > Why do you think this approach is not OK? > > Suppose at least 4 servers, 2 idle, 2 working. > > Now, the first of the working servers (lets call it S0) gets a wakeup > (say Ta), it finds the idle server (say S3) and consumes it, sticking Ta > on S3 and kicking it alive. TL;DR: our models are different here. In your model a single server can have a bunch of workers interacting with it; in my model only a single RUNNING worker is assigned to a server, which it wakes when it blocks. More details: "Working servers" cannot get wakeups, because a "working server" has a single RUNNING worker attached to it. When a worker blocks, it wakes its attached server and becomes a detached blocked worker (same is true if the worker is "preempted": it blocks and wakes its assigned server). Blocked workers upon wakeup do this, in order: - always add themselves to the runnable worker list (the list is shared among ALL servers, it is NOT per server); - wake a server pointed to by idle_server_ptr, if not NULL; - sleep, waiting for a wakeup from a server; Server S, upon becoming IDLE (no worker to run, or woken on idle server list) does this, in order, in userspace (simplified, see umcg_get_idle_worker() in https://lore.kernel.org/lkml/20211122211327.5931-5-posk@google.com/): - take a userspace (spin) lock (so the steps below are all within a single critical section): - compare_xchg(idle_server_ptr, NULL, S); - if failed, there is another server in idle_server_ptr, so S adds itself to the userspace idle server list, releases the lock, goes to sleep; - if succeeded: - check the runnable worker list; - if empty, release the lock, sleep; - if not empty: - get the list - xchg(idle_server_ptr, NULL) (either S removes itself, or a worker in the kernel does it first, does not matter); - release the lock; - wake server S1 on idle server list. S1 goes through all of these steps. The protocol above serializes the userspace dealing with the idle server ptr/list. Wakeups in the kernel will be caught if there are idle servers. Yes, the protocol in the userspace is complicated (more complicated than outlined above, as the reaped idle/runnable worker list from the kernel is added to the userspace idle/runnable worker list), but the kernel side is very simple. I've tested this interaction extensively, I'm reasonably sure that no worker wakeups are lost. > > Concurrently and loosing the race the other working server (S1) gets a > wake-up from Tb, like said, it lost, no idle server, so Tb goes on the > queue of S1. > > So then S3 wakes, finds Ta and they live happily ever after. > > While S2 and Tb fail to meet one another and both linger in sadness. > ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC][PATCH 0/3] sched: User Managed Concurrency Groups 2021-12-15 19:49 ` Peter Oskolkov @ 2021-12-15 22:25 ` Peter Zijlstra 2021-12-15 23:26 ` Peter Oskolkov 0 siblings, 1 reply; 40+ messages in thread From: Peter Zijlstra @ 2021-12-15 22:25 UTC (permalink / raw) To: Peter Oskolkov Cc: Peter Oskolkov, Ingo Molnar, Thomas Gleixner, juri.lelli, Vincent Guittot, dietmar.eggemann, Steven Rostedt, Ben Segall, mgorman, bristot, Linux Kernel Mailing List, Linux Memory Management List, linux-api, x86, Paul Turner, Andrei Vagin, Jann Horn, Thierry Delisle On Wed, Dec 15, 2021 at 11:49:51AM -0800, Peter Oskolkov wrote: > TL;DR: our models are different here. In your model a single server > can have a bunch of workers interacting with it; in my model only a > single RUNNING worker is assigned to a server, which it wakes when it > blocks. So part of the problem is that none of that was evident from the code. It is also completely different from the scheduler code it lives in, making it double confusing. After having read the code, I still had no clue what so ever how it was supposed to be used. Which is where my reverse engineering started :/ > More details: > > "Working servers" cannot get wakeups, because a "working server" has a > single RUNNING worker attached to it. When a worker blocks, it wakes > its attached server and becomes a detached blocked worker (same is > true if the worker is "preempted": it blocks and wakes its assigned > server). But who would do the preemption if the server isn't allowed to run? > Blocked workers upon wakeup do this, in order: > > - always add themselves to the runnable worker list (the list is > shared among ALL servers, it is NOT per server); That seems like a scalability issue. And, as said, it is completely alien when compared to the way Linux itself does scheduling. > - wake a server pointed to by idle_server_ptr, if not NULL; > - sleep, waiting for a wakeup from a server; > > Server S, upon becoming IDLE (no worker to run, or woken on idle > server list) does this, in order, in userspace (simplified, see > umcg_get_idle_worker() in > https://lore.kernel.org/lkml/20211122211327.5931-5-posk@google.com/): > - take a userspace (spin) lock (so the steps below are all within a > single critical section): Don't ever suggest userspace spinlocks, they're horrible crap. > - compare_xchg(idle_server_ptr, NULL, S); > - if failed, there is another server in idle_server_ptr, so S adds > itself to the userspace idle server list, releases the lock, goes to > sleep; > - if succeeded: > - check the runnable worker list; > - if empty, release the lock, sleep; > - if not empty: > - get the list > - xchg(idle_server_ptr, NULL) (either S removes itself, or > a worker in the kernel does it first, does not matter); > - release the lock; > - wake server S1 on idle server list. S1 goes through all > of these steps. > > The protocol above serializes the userspace dealing with the idle > server ptr/list. Wakeups in the kernel will be caught if there are > idle servers. Yes, the protocol in the userspace is complicated (more > complicated than outlined above, as the reaped idle/runnable worker > list from the kernel is added to the userspace idle/runnable worker > list), but the kernel side is very simple. I've tested this > interaction extensively, I'm reasonably sure that no worker wakeups > are lost. Sure, but also seems somewhat congestion prone :/ ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC][PATCH 0/3] sched: User Managed Concurrency Groups 2021-12-15 22:25 ` Peter Zijlstra @ 2021-12-15 23:26 ` Peter Oskolkov 2021-12-16 13:23 ` Thomas Gleixner 0 siblings, 1 reply; 40+ messages in thread From: Peter Oskolkov @ 2021-12-15 23:26 UTC (permalink / raw) To: Peter Zijlstra Cc: Peter Oskolkov, Ingo Molnar, Thomas Gleixner, juri.lelli, Vincent Guittot, dietmar.eggemann, Steven Rostedt, Ben Segall, mgorman, bristot, Linux Kernel Mailing List, Linux Memory Management List, linux-api, x86, Paul Turner, Andrei Vagin, Jann Horn, Thierry Delisle On Wed, Dec 15, 2021 at 2:25 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, Dec 15, 2021 at 11:49:51AM -0800, Peter Oskolkov wrote: > > > TL;DR: our models are different here. In your model a single server > > can have a bunch of workers interacting with it; in my model only a > > single RUNNING worker is assigned to a server, which it wakes when it > > blocks. > > So part of the problem is that none of that was evident from the code. > It is also completely different from the scheduler code it lives in, > making it double confusing. > > After having read the code, I still had no clue what so ever how it was > supposed to be used. Which is where my reverse engineering started :/ I posted a doc patch: https://lore.kernel.org/lkml/20211122211327.5931-6-posk@google.com/ a lib patch with userspace code: https://lore.kernel.org/lkml/20211122211327.5931-5-posk@google.com/ and a doc patch for the lib/userspace code: https://lore.kernel.org/lkml/20211122211327.5931-7-posk@google.com/ I spent at least two weeks polishing the lib patch and the docs, much more if previous patchsets are to be taken into account. Yes, they are confusing, and most likely answer all of the wrong questions, but I did try to make my approach as clear as possible... I apologize if that was not very successful... > > > More details: > > > > "Working servers" cannot get wakeups, because a "working server" has a > > single RUNNING worker attached to it. When a worker blocks, it wakes > > its attached server and becomes a detached blocked worker (same is > > true if the worker is "preempted": it blocks and wakes its assigned > > server). > > But who would do the preemption if the server isn't allowed to run? > > > Blocked workers upon wakeup do this, in order: > > > > - always add themselves to the runnable worker list (the list is > > shared among ALL servers, it is NOT per server); > > That seems like a scalability issue. And, as said, it is completely > alien when compared to the way Linux itself does scheduling. > > > - wake a server pointed to by idle_server_ptr, if not NULL; > > - sleep, waiting for a wakeup from a server; > > > > Server S, upon becoming IDLE (no worker to run, or woken on idle > > server list) does this, in order, in userspace (simplified, see > > umcg_get_idle_worker() in > > https://lore.kernel.org/lkml/20211122211327.5931-5-posk@google.com/): > > - take a userspace (spin) lock (so the steps below are all within a > > single critical section): > > Don't ever suggest userspace spinlocks, they're horrible crap. This can easily be a mutex, not really important (although for very short critical sections with only memory reads/writes, like here, spin locks often perform better, in our experience). > > > - compare_xchg(idle_server_ptr, NULL, S); > > - if failed, there is another server in idle_server_ptr, so S adds > > itself to the userspace idle server list, releases the lock, goes to > > sleep; > > - if succeeded: > > - check the runnable worker list; > > - if empty, release the lock, sleep; > > - if not empty: > > - get the list > > - xchg(idle_server_ptr, NULL) (either S removes itself, or > > a worker in the kernel does it first, does not matter); > > - release the lock; > > - wake server S1 on idle server list. S1 goes through all > > of these steps. > > > > The protocol above serializes the userspace dealing with the idle > > server ptr/list. Wakeups in the kernel will be caught if there are > > idle servers. Yes, the protocol in the userspace is complicated (more > > complicated than outlined above, as the reaped idle/runnable worker > > list from the kernel is added to the userspace idle/runnable worker > > list), but the kernel side is very simple. I've tested this > > interaction extensively, I'm reasonably sure that no worker wakeups > > are lost. > > Sure, but also seems somewhat congestion prone :/ The whole critical section under the loc is just several memory read/write operations, so very short. And workers are removed from the kernel's list of runnable/woken workers all at once; and the server processing the runnable worker list knows how many of them are now available to run, so the appropriate number of idle servers can be woken (not yet implemented in my lib patch). So yes, this can be a bottleneck, but there are ways to make it less and less likely (by making the userspace more complicated; but this is not a concern). ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC][PATCH 0/3] sched: User Managed Concurrency Groups 2021-12-15 23:26 ` Peter Oskolkov @ 2021-12-16 13:23 ` Thomas Gleixner 0 siblings, 0 replies; 40+ messages in thread From: Thomas Gleixner @ 2021-12-16 13:23 UTC (permalink / raw) To: Peter Oskolkov, Peter Zijlstra Cc: Peter Oskolkov, Ingo Molnar, juri.lelli, Vincent Guittot, dietmar.eggemann, Steven Rostedt, Ben Segall, mgorman, bristot, Linux Kernel Mailing List, Linux Memory Management List, linux-api, x86, Paul Turner, Andrei Vagin, Jann Horn, Thierry Delisle Peter, On Wed, Dec 15 2021 at 15:26, Peter Oskolkov wrote: > On Wed, Dec 15, 2021 at 2:25 PM Peter Zijlstra <peterz@infradead.org> wrote: >> > - take a userspace (spin) lock (so the steps below are all within a >> > single critical section): >> >> Don't ever suggest userspace spinlocks, they're horrible crap. > > This can easily be a mutex, not really important (although for very > short critical sections with only memory reads/writes, like here, spin > locks often perform better, in our experience). Performance may be better, but user space spinlocks have a fundamental problem: They are prone to live locks. That's completely independent of the length of the critical section, it even can be empty. There are ways to avoid that, but that needs a very careful design on the application/library level and at the system configuration level (priorities, affinities ...). And even then, there are trival ways to break that, e.g. via CPU hotplug. So no, for something of general use, they are a complete NONO. People who think they know what they are doing have the source and can replace them if they feel the need to do so. Thanks, tglx ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC][PATCH 0/3] sched: User Managed Concurrency Groups 2021-12-15 17:56 ` Peter Oskolkov 2021-12-15 18:18 ` Peter Zijlstra @ 2021-12-15 18:25 ` Peter Zijlstra 2021-12-15 21:04 ` Peter Oskolkov 1 sibling, 1 reply; 40+ messages in thread From: Peter Zijlstra @ 2021-12-15 18:25 UTC (permalink / raw) To: Peter Oskolkov Cc: Ingo Molnar, Thomas Gleixner, juri.lelli, Vincent Guittot, dietmar.eggemann, Steven Rostedt, Ben Segall, mgorman, bristot, Linux Kernel Mailing List, Linux Memory Management List, linux-api, x86, Paul Turner, Peter Oskolkov, Andrei Vagin, Jann Horn, Thierry Delisle On Wed, Dec 15, 2021 at 09:56:06AM -0800, Peter Oskolkov wrote: > On Wed, Dec 15, 2021 at 2:06 AM Peter Zijlstra <peterz@infradead.org> wrote: > > /* > > + * Enqueue tsk to it's server's runnable list and wake the server for pickup if > > + * so desired. Notable LAZY workers will not wake the server and rely on the > > + * server to do pickup whenever it naturally runs next. > > No, I never suggested we needed per-server runnable queues: in all my > patchsets I had a single list of idle (runnable) workers. This is not about the idle servers.. So without the LAZY thing on, a previously blocked task hitting sys_exit will enqueue itself on the runnable list and wake the server for pickup. IIRC you didn't like the server waking while it was still running another task, but instead preferred to have it pick up the newly enqueued task when next it ran. LAZY enables that.. *however* it does need to wake the server when it is idle, otherwise they'll all sit there waiting for one another. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC][PATCH 0/3] sched: User Managed Concurrency Groups 2021-12-15 18:25 ` Peter Zijlstra @ 2021-12-15 21:04 ` Peter Oskolkov 2021-12-15 23:16 ` Peter Zijlstra 0 siblings, 1 reply; 40+ messages in thread From: Peter Oskolkov @ 2021-12-15 21:04 UTC (permalink / raw) To: Peter Zijlstra Cc: Peter Oskolkov, Ingo Molnar, Thomas Gleixner, juri.lelli, Vincent Guittot, dietmar.eggemann, Steven Rostedt, Ben Segall, mgorman, bristot, Linux Kernel Mailing List, Linux Memory Management List, linux-api, x86, Paul Turner, Andrei Vagin, Jann Horn, Thierry Delisle On Wed, Dec 15, 2021 at 10:25 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, Dec 15, 2021 at 09:56:06AM -0800, Peter Oskolkov wrote: > > On Wed, Dec 15, 2021 at 2:06 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > /* > > > + * Enqueue tsk to it's server's runnable list and wake the server for pickup if > > > + * so desired. Notable LAZY workers will not wake the server and rely on the > > > + * server to do pickup whenever it naturally runs next. > > > > No, I never suggested we needed per-server runnable queues: in all my > > patchsets I had a single list of idle (runnable) workers. > > This is not about the idle servers.. > > So without the LAZY thing on, a previously blocked task hitting sys_exit > will enqueue itself on the runnable list and wake the server for pickup. How can a blocked task hit sys_exit()? Shouldn't it be RUNNING? Anyway, servers and workers are supposed to unregister before exiting, so if they call sys_exit() they break the agreement; in my patch I just clear all umcg-related state and proceed, without waking the server: the user broke the protocol, let them figure out what happened: +static void umcg_clear_task(struct task_struct *tsk) +{ + /* + * This is either called for the current task, or for a newly forked + * task that is not yet running, so we don't need strict atomicity + * below. + */ + if (tsk->umcg_task) { + WRITE_ONCE(tsk->umcg_task, NULL); + + /* These can be simple writes - see the comment above. */ + tsk->pinned_umcg_worker_page = NULL; + tsk->pinned_umcg_server_page = NULL; + tsk->flags &= ~PF_UMCG_WORKER; + } +} + +/* Called both by normally (unregister) and abnormally exiting workers. */ +void umcg_handle_exiting_worker(void) +{ + umcg_unpin_pages(); + umcg_clear_task(current); +} > > IIRC you didn't like the server waking while it was still running > another task, but instead preferred to have it pick up the newly > enqueued task when next it ran. Yes, this is the model I have, as I outlined in another email. I understand that having queues per-CPU/per-server is how it is done in the kernel, both for historical reasons (before multiprocessing there was a single queue/cpu) and for throughput (per-cpu runqueues are individually faster than a global one). However, this model is known to lag in presence of load spikes (long per-cpu queues with some CPUs idle), and is not really easy to work with given the use cases this whole userspace scheduling effort is trying to address: multiple priorities and work isolation: these are easy to address directly with a scheduler that has a global view rather than multiple per-cpu/per-server schedulers/queues that try to coordinate. I can even claim (without proof, just a hunch, based on how I would code this) that strict scheduling policies around priority and isolation (e.g. never run work item A if work item B becomes runnable, unless work item A is already running) cannot be enforced without a global scheduler, so per-cpu/per-server queues do not really fit the use case here... > > LAZY enables that.. *however* it does need to wake the server when it is > idle, otherwise they'll all sit there waiting for one another. If all servers are busy running workers, then it is not up to the kernel to "preempt" them in my model: the userspace can set up another thread/task to preempt a misbehaving worker, which will wake the server attached to it. But in practice there are always workers blocking in the kernel, which wakes their servers, which then reap the woken/runnable workers list, so well-behaving code does not need this. Yes, sometimes the code does not behave well, e.g. a worker grabs a spinlock, blocks in the kernel, its server runs another worker that starts spinning on the spinlock; but this is fixable by making the spinlock aware of our stuff: either the worker who got the lock is marked as LOCKED and so does not release its server (one of the reasons I have this flag), or the lock itself becomes sleepable (e.g. after spinning a bit it calls into a futex wait). And so we need to figure out this high-level thing first: do we go with the per-server worker queues/lists, or do we go with the approach I use in my patchset? It seems to me that the kernel-side code in my patchset is not more complicated than your patchset is shaping up to be, and some things are actually easier to accomplish, like having a single idle_server_ptr vs this LAZY and/or server "preemption" behavior that you have. Again, I'm OK with having it your way if all needed features are covered, but I think we should be explicit about why per-server/per-cpu model is chosen vs the one I proposed, especially as it seems the kernel side code is not really simpler in the end. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC][PATCH 0/3] sched: User Managed Concurrency Groups 2021-12-15 21:04 ` Peter Oskolkov @ 2021-12-15 23:16 ` Peter Zijlstra 2021-12-15 23:31 ` Peter Oskolkov 0 siblings, 1 reply; 40+ messages in thread From: Peter Zijlstra @ 2021-12-15 23:16 UTC (permalink / raw) To: Peter Oskolkov Cc: Peter Oskolkov, Ingo Molnar, Thomas Gleixner, juri.lelli, Vincent Guittot, dietmar.eggemann, Steven Rostedt, Ben Segall, mgorman, bristot, Linux Kernel Mailing List, Linux Memory Management List, linux-api, x86, Paul Turner, Andrei Vagin, Jann Horn, Thierry Delisle On Wed, Dec 15, 2021 at 01:04:33PM -0800, Peter Oskolkov wrote: > On Wed, Dec 15, 2021 at 10:25 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Wed, Dec 15, 2021 at 09:56:06AM -0800, Peter Oskolkov wrote: > > > On Wed, Dec 15, 2021 at 2:06 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > /* > > > > + * Enqueue tsk to it's server's runnable list and wake the server for pickup if > > > > + * so desired. Notable LAZY workers will not wake the server and rely on the > > > > + * server to do pickup whenever it naturally runs next. > > > > > > No, I never suggested we needed per-server runnable queues: in all my > > > patchsets I had a single list of idle (runnable) workers. > > > > This is not about the idle servers.. > > > > So without the LAZY thing on, a previously blocked task hitting sys_exit > > will enqueue itself on the runnable list and wake the server for pickup. > > How can a blocked task hit sys_exit()? Shouldn't it be RUNNING? Task was RUNNING, hits schedule() after passing through sys_enter(). this marks it BLOCKED. Task wakes again and proceeds to sys_exit(), at which point it's marked RUNNABLE and put on the runnable list. After which it'll kick the server to process said list. > Anyway, servers and workers are supposed to unregister before exiting, > so if they call sys_exit() they break the agreement; in my patch I > just clear all umcg-related state and proceed, without waking the > server: the user broke the protocol, let them figure out what > happened: No violation of anything there. The time between returning from schedule() and sys_exit() is unmanaged. sys_exit() is the spot where we regain control. > > IIRC you didn't like the server waking while it was still running > > another task, but instead preferred to have it pick up the newly > > enqueued task when next it ran. > > Yes, this is the model I have, as I outlined in another email. I > understand that having queues per-CPU/per-server is how it is done in > the kernel, both for historical reasons (before multiprocessing there > was a single queue/cpu) and for throughput (per-cpu runqueues are > individually faster than a global one). However, this model is known > to lag in presence of load spikes (long per-cpu queues with some CPUs > idle), and is not really easy to work with given the use cases this > whole userspace scheduling effort is trying to address: Well, that's *your* use-case. I'm fairly sure there's more people that want to use this thing. > multiple > priorities and work isolation: these are easy to address directly with > a scheduler that has a global view rather than multiple > per-cpu/per-server schedulers/queues that try to coordinate. You can trivially create this, even if the underlying thing is per-server. Simply have a lock and shared data structure between the servers. Even in the kernel, it should be mostly trivial to create a global policy. The only tricky bit (in the kernel) is the whole affinity muck, but userspace doesn't *need* to do even that. > > LAZY enables that.. *however* it does need to wake the server when it is > > idle, otherwise they'll all sit there waiting for one another. > > If all servers are busy running workers, then it is not up to the > kernel to "preempt" them in my model: the userspace can set up another > thread/task to preempt a misbehaving worker, which will wake the > server attached to it. So the way I'm seeing things is that the server *is* the 'CPU'. A UP machine cannot rely on another CPU to make preemption happen. Also, preemption is very much not about misbehaviour. Wakeup can cause a preemption event if the woken task is deemed higher priority than the current running one for example. And time based preemption is definitely also a thing wrt resource distribution. > But in practice there are always workers > blocking in the kernel, which wakes their servers, which then reap the > woken/runnable workers list, so well-behaving code does not need this. This seems to discount pure computational workloads. > And so we need to figure out this high-level thing first: do we go > with the per-server worker queues/lists, or do we go with the approach > I use in my patchset? It seems to me that the kernel-side code in my > patchset is not more complicated than your patchset is shaping up to > be, and some things are actually easier to accomplish, like having a > single idle_server_ptr vs this LAZY and/or server "preemption" > behavior that you have. > > Again, I'm OK with having it your way if all needed features are > covered, but I think we should be explicit about why > per-server/per-cpu model is chosen vs the one I proposed, especially > as it seems the kernel side code is not really simpler in the end. So I went with a UP first approach. I made single server preemption driven scheduling work first (both tick and wakeup-preemption are supported). The whole LAZY thing is only meant to supress some of that (notably wakeup preemption), but you're right in that it's not really nice. I got it working, but I'm not particularly happy with it either. Having the sys_enter/sys_exit hooks also made the page pins short lived, and signals much simpler to handle. You're destroying signals IIUC. So I see no fundamental reason why userspace cannot do something like: struct umcg_task *current = NULL; for (;;) { self->state = UMCG_TASK_RUNNABLE | UMCG_TF_COND_WAIT; runnable_ptr = (void *)__atomic_exchange_n(&self->runnable_workers_ptr, NULL, __ATOMIC_SEQ_CST); pthread_mutex_lock(&global_queue.lock); while (runnable_ptr) { next = (void *)runnable_ptr->runnable_workers_ptr; enqueue_task(&global_queue, runnable_ptr); runnable_ptr = next; } /* complicated bit about current already running goes here */ current = pick_task(&global_queue); self->next_tid = current ? current->tid : 0; unlock: pthread_mutex_unlock(&global_queue.lock); ret = sys_umcg_wait(0, 0); pthread_mutex_lock(&global_queue.lock); /* umcg_wait() didn't switch, make sure to return the task */ if (self->next_tid) { enqueue_task(&global_queue, current); current = NULL; } pthread_mutex_unlock(&global_queue.lock); // do something with @ret } to get global scheduling and all the contention^Wgoodness related to it. Except, of course, it's more complicated, but I think the idea's clear enough. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC][PATCH 0/3] sched: User Managed Concurrency Groups 2021-12-15 23:16 ` Peter Zijlstra @ 2021-12-15 23:31 ` Peter Oskolkov 0 siblings, 0 replies; 40+ messages in thread From: Peter Oskolkov @ 2021-12-15 23:31 UTC (permalink / raw) To: Peter Zijlstra Cc: Peter Oskolkov, Ingo Molnar, Thomas Gleixner, juri.lelli, Vincent Guittot, dietmar.eggemann, Steven Rostedt, Ben Segall, mgorman, bristot, Linux Kernel Mailing List, Linux Memory Management List, linux-api, x86, Paul Turner, Andrei Vagin, Jann Horn, Thierry Delisle On Wed, Dec 15, 2021 at 3:16 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, Dec 15, 2021 at 01:04:33PM -0800, Peter Oskolkov wrote: > > On Wed, Dec 15, 2021 at 10:25 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > On Wed, Dec 15, 2021 at 09:56:06AM -0800, Peter Oskolkov wrote: > > > > On Wed, Dec 15, 2021 at 2:06 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > > /* > > > > > + * Enqueue tsk to it's server's runnable list and wake the server for pickup if > > > > > + * so desired. Notable LAZY workers will not wake the server and rely on the > > > > > + * server to do pickup whenever it naturally runs next. > > > > > > > > No, I never suggested we needed per-server runnable queues: in all my > > > > patchsets I had a single list of idle (runnable) workers. > > > > > > This is not about the idle servers.. > > > > > > So without the LAZY thing on, a previously blocked task hitting sys_exit > > > will enqueue itself on the runnable list and wake the server for pickup. > > > > How can a blocked task hit sys_exit()? Shouldn't it be RUNNING? > > Task was RUNNING, hits schedule() after passing through sys_enter(). > this marks it BLOCKED. Task wakes again and proceeds to sys_exit(), at > which point it's marked RUNNABLE and put on the runnable list. After > which it'll kick the server to process said list. > Ah, you are talking about sys_exit hook; sorry, I thought you talked about the exit() syscall. [...] > > Well, that's *your* use-case. I'm fairly sure there's more people that > want to use this thing. > > > multiple > > priorities and work isolation: these are easy to address directly with > > a scheduler that has a global view rather than multiple > > per-cpu/per-server schedulers/queues that try to coordinate. > > You can trivially create this, even if the underlying thing is > per-server. Simply have a lock and shared data structure between the > servers. > > Even in the kernel, it should be mostly trivial to create a global > policy. The only tricky bit (in the kernel) is the whole affinity muck, > but userspace doesn't *need* to do even that. > > > > LAZY enables that.. *however* it does need to wake the server when it is > > > idle, otherwise they'll all sit there waiting for one another. > > > > If all servers are busy running workers, then it is not up to the > > kernel to "preempt" them in my model: the userspace can set up another > > thread/task to preempt a misbehaving worker, which will wake the > > server attached to it. > > So the way I'm seeing things is that the server *is* the 'CPU'. A UP > machine cannot rely on another CPU to make preemption happen. > > Also, preemption is very much not about misbehaviour. Wakeup can cause a > preemption event if the woken task is deemed higher priority than the > current running one for example. > > And time based preemption is definitely also a thing wrt resource > distribution. > > > But in practice there are always workers > > blocking in the kernel, which wakes their servers, which then reap the > > woken/runnable workers list, so well-behaving code does not need this. > > This seems to discount pure computational workloads. > > > And so we need to figure out this high-level thing first: do we go > > with the per-server worker queues/lists, or do we go with the approach > > I use in my patchset? It seems to me that the kernel-side code in my > > patchset is not more complicated than your patchset is shaping up to > > be, and some things are actually easier to accomplish, like having a > > single idle_server_ptr vs this LAZY and/or server "preemption" > > behavior that you have. > > > > Again, I'm OK with having it your way if all needed features are > > covered, but I think we should be explicit about why > > per-server/per-cpu model is chosen vs the one I proposed, especially > > as it seems the kernel side code is not really simpler in the end. > > So I went with a UP first approach. I made single server preemption > driven scheduling work first (both tick and wakeup-preemption are > supported). I agree that the UP approach is better than the LAZY one if we have per-server/per-cpu worker queues. > > The whole LAZY thing is only meant to supress some of that (notably > wakeup preemption), but you're right in that it's not really nice. I got > it working, but I'm not particularly happy with it either. > > Having the sys_enter/sys_exit hooks also made the page pins short lived, > and signals much simpler to handle. You're destroying signals IIUC. > > > So I see no fundamental reason why userspace cannot do something like: > > struct umcg_task *current = NULL; > > for (;;) { > self->state = UMCG_TASK_RUNNABLE | UMCG_TF_COND_WAIT; > > runnable_ptr = (void *)__atomic_exchange_n(&self->runnable_workers_ptr, > NULL, __ATOMIC_SEQ_CST); > > pthread_mutex_lock(&global_queue.lock); > while (runnable_ptr) { > next = (void *)runnable_ptr->runnable_workers_ptr; > enqueue_task(&global_queue, runnable_ptr); > runnable_ptr = next; > } > > /* complicated bit about current already running goes here */ > > current = pick_task(&global_queue); > self->next_tid = current ? current->tid : 0; > unlock: > pthread_mutex_unlock(&global_queue.lock); > > ret = sys_umcg_wait(0, 0); > > pthread_mutex_lock(&global_queue.lock); > /* umcg_wait() didn't switch, make sure to return the task */ > if (self->next_tid) { > enqueue_task(&global_queue, current); > current = NULL; > } > pthread_mutex_unlock(&global_queue.lock); > > // do something with @ret > } > > to get global scheduling and all the contention^Wgoodness related to it. > Except, of course, it's more complicated, but I think the idea's clear > enough. Let me spend some time and see if I can make all of this work together beyond simple tests. With the upcoming holidays and some other things I am busy with, this may take more than a week, I'm afraid... ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC][PATCH 0/3] sched: User Managed Concurrency Groups 2021-12-15 3:46 ` Peter Oskolkov 2021-12-15 10:06 ` Peter Zijlstra @ 2021-12-15 10:44 ` Peter Zijlstra 2021-12-15 13:49 ` Matthew Wilcox 1 sibling, 1 reply; 40+ messages in thread From: Peter Zijlstra @ 2021-12-15 10:44 UTC (permalink / raw) To: Peter Oskolkov Cc: Ingo Molnar, Thomas Gleixner, juri.lelli, Vincent Guittot, dietmar.eggemann, Steven Rostedt, Ben Segall, mgorman, bristot, Linux Kernel Mailing List, Linux Memory Management List, linux-api, x86, Paul Turner, Peter Oskolkov, Andrei Vagin, Jann Horn, Thierry Delisle On Tue, Dec 14, 2021 at 07:46:25PM -0800, Peter Oskolkov wrote: > Anyway, I'll test your patchset over the next week or so and let you > know if anything really needed is missing (other than waking an idle > server if there is one on a worker wakeup; this piece is definitely > needed). Right, so the problem I'm having is that a single idle server ptr like before can trivially miss waking annother idle server. Suppose: umcg::idle_server_tid_ptr Then the enqueue_and_wake() thing from the last patch would: idle_server_tid = xchg((pid_t __user *)self->idle_server_tid_ptr, 0); to consume the tid, and then use that to enqueue and wake. But what if a second wakeup happens right after that? There might be a second idle server, but we'll never find it, because userspace hasn't had time to update the field again. Alternatively, we do a linked list of servers, but then every such wakeup needs to iterate the whole list, looking for one that has UMCG_TF_IDLE set, or something like that, but that lookup is bad for performance. So I'm really not sure what way to go yet. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC][PATCH 0/3] sched: User Managed Concurrency Groups 2021-12-15 10:44 ` Peter Zijlstra @ 2021-12-15 13:49 ` Matthew Wilcox 2021-12-15 17:54 ` Peter Zijlstra 0 siblings, 1 reply; 40+ messages in thread From: Matthew Wilcox @ 2021-12-15 13:49 UTC (permalink / raw) To: Peter Zijlstra Cc: Peter Oskolkov, Ingo Molnar, Thomas Gleixner, juri.lelli, Vincent Guittot, dietmar.eggemann, Steven Rostedt, Ben Segall, mgorman, bristot, Linux Kernel Mailing List, Linux Memory Management List, linux-api, x86, Paul Turner, Peter Oskolkov, Andrei Vagin, Jann Horn, Thierry Delisle On Wed, Dec 15, 2021 at 11:44:49AM +0100, Peter Zijlstra wrote: > On Tue, Dec 14, 2021 at 07:46:25PM -0800, Peter Oskolkov wrote: > > > Anyway, I'll test your patchset over the next week or so and let you > > know if anything really needed is missing (other than waking an idle > > server if there is one on a worker wakeup; this piece is definitely > > needed). > > Right, so the problem I'm having is that a single idle server ptr like > before can trivially miss waking annother idle server. > > Suppose: > > umcg::idle_server_tid_ptr > > Then the enqueue_and_wake() thing from the last patch would: > > idle_server_tid = xchg((pid_t __user *)self->idle_server_tid_ptr, 0); > > to consume the tid, and then use that to enqueue and wake. But what if a > second wakeup happens right after that? There might be a second idle > server, but we'll never find it, because userspace hasn't had time to > update the field again. > > Alternatively, we do a linked list of servers, but then every such > wakeup needs to iterate the whole list, looking for one that has > UMCG_TF_IDLE set, or something like that, but that lookup is bad for > performance. > > So I'm really not sure what way to go yet. 1. Linked lists are fugly and bad for the CPU. 2. I'm not sure how big the 'N' in 'M:N' is supposed to be. Might be one per hardware thread? So it could be hundreds-to-thousands, depending on the scale of system. 3. The interface between user-kernel could be an array of idle tids, maybe 16 entries long (16 * 4 = 64 bytes, just one cacheline). As a server finishes work, it looks for a 0 tid in the batch and stores its tid in the slot (cmpxchg, I guess, since the array will be shared between processes). If there are no free slots in the array, then we definitely have 16 threads already waiting for work, so it can park itself in whatever data structure userspace wants to use to manage idle servers. It's up to userspace to decide when to repopulate the array of available servers from its data structure of idle servers. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC][PATCH 0/3] sched: User Managed Concurrency Groups 2021-12-15 13:49 ` Matthew Wilcox @ 2021-12-15 17:54 ` Peter Zijlstra 0 siblings, 0 replies; 40+ messages in thread From: Peter Zijlstra @ 2021-12-15 17:54 UTC (permalink / raw) To: Matthew Wilcox Cc: Peter Oskolkov, Ingo Molnar, Thomas Gleixner, juri.lelli, Vincent Guittot, dietmar.eggemann, Steven Rostedt, Ben Segall, mgorman, bristot, Linux Kernel Mailing List, Linux Memory Management List, linux-api, x86, Paul Turner, Peter Oskolkov, Andrei Vagin, Jann Horn, Thierry Delisle On Wed, Dec 15, 2021 at 01:49:28PM +0000, Matthew Wilcox wrote: > On Wed, Dec 15, 2021 at 11:44:49AM +0100, Peter Zijlstra wrote: > > On Tue, Dec 14, 2021 at 07:46:25PM -0800, Peter Oskolkov wrote: > > > > > Anyway, I'll test your patchset over the next week or so and let you > > > know if anything really needed is missing (other than waking an idle > > > server if there is one on a worker wakeup; this piece is definitely > > > needed). > > > > Right, so the problem I'm having is that a single idle server ptr like > > before can trivially miss waking annother idle server. > > > > Suppose: > > > > umcg::idle_server_tid_ptr > > > > Then the enqueue_and_wake() thing from the last patch would: > > > > idle_server_tid = xchg((pid_t __user *)self->idle_server_tid_ptr, 0); > > > > to consume the tid, and then use that to enqueue and wake. But what if a > > second wakeup happens right after that? There might be a second idle > > server, but we'll never find it, because userspace hasn't had time to > > update the field again. > > > > Alternatively, we do a linked list of servers, but then every such > > wakeup needs to iterate the whole list, looking for one that has > > UMCG_TF_IDLE set, or something like that, but that lookup is bad for > > performance. > > > > So I'm really not sure what way to go yet. > > 1. Linked lists are fugly and bad for the CPU. Absolutely.. although a stack might work, except for that ABA issue (and contention). > 2. I'm not sure how big the 'N' in 'M:N' is supposed to be. Might be > one per hardware thread? So it could be hundreds-to-thousands, > depending on the scale of system. Typically yes, one server task per hardware thread. Now, I'm also fairly sure you don't want excessive cross-node traffic for this stuff, so that puts a limit on things as well. > 3. The interface between user-kernel could be an array of idle tids, > maybe 16 entries long (16 * 4 = 64 bytes, just one cacheline). As a > server finishes work, it looks for a 0 tid in the batch and stores > its tid in the slot (cmpxchg, I guess, since the array will be shared > between processes). If there are no free slots in the array, then we > definitely have 16 threads already waiting for work, so it can park itself > in whatever data structure userspace wants to use to manage idle servers. > It's up to userspace to decide when to repopulate the array of available > servers from its data structure of idle servers. Right, a tid array might work. Could even have userspace specify the length, then it can do the trade-offs all on it's own. Either a fixed location for each server and a larger array, or clever things, whatever they want. I suppose I'll code up the variable length array, we have space for that. ^ permalink raw reply [flat|nested] 40+ messages in thread
end of thread, other threads:[~2022-01-20 10:37 UTC | newest] Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-12-14 20:44 [RFC][PATCH 0/3] sched: User Managed Concurrency Groups Peter Zijlstra 2021-12-14 20:44 ` [RFC][PATCH 1/3] sched/umcg: add WF_CURRENT_CPU and externise ttwu Peter Zijlstra 2021-12-14 20:44 ` [RFC][PATCH 2/3] x86/uaccess: Implement unsafe_try_cmpxchg_user() Peter Zijlstra 2021-12-20 17:30 ` Sean Christopherson 2021-12-21 11:17 ` Peter Zijlstra 2021-12-14 20:44 ` [RFC][PATCH 3/3] sched: User Mode Concurency Groups Peter Zijlstra 2021-12-21 17:19 ` Peter Oskolkov 2022-01-14 14:09 ` Peter Zijlstra 2022-01-14 15:16 ` Peter Zijlstra 2022-01-14 17:15 ` Peter Zijlstra 2022-01-17 11:35 ` Peter Zijlstra 2022-01-17 12:22 ` Peter Zijlstra 2022-01-17 12:12 ` Peter Zijlstra 2022-01-18 10:09 ` Peter Zijlstra 2022-01-18 18:19 ` Peter Oskolkov 2022-01-19 8:47 ` Peter Zijlstra 2022-01-19 17:33 ` Peter Oskolkov 2022-01-19 8:51 ` Peter Zijlstra 2022-01-19 8:59 ` Peter Zijlstra 2022-01-19 17:52 ` Peter Oskolkov 2022-01-20 10:37 ` Peter Zijlstra 2022-01-17 13:04 ` Peter Zijlstra 2021-12-24 11:27 ` Peter Zijlstra 2021-12-14 21:00 ` [RFC][PATCH 0/3] sched: User Managed Concurrency Groups Peter Zijlstra 2021-12-15 3:46 ` Peter Oskolkov 2021-12-15 10:06 ` Peter Zijlstra 2021-12-15 13:03 ` Peter Zijlstra 2021-12-15 17:56 ` Peter Oskolkov 2021-12-15 18:18 ` Peter Zijlstra 2021-12-15 19:49 ` Peter Oskolkov 2021-12-15 22:25 ` Peter Zijlstra 2021-12-15 23:26 ` Peter Oskolkov 2021-12-16 13:23 ` Thomas Gleixner 2021-12-15 18:25 ` Peter Zijlstra 2021-12-15 21:04 ` Peter Oskolkov 2021-12-15 23:16 ` Peter Zijlstra 2021-12-15 23:31 ` Peter Oskolkov 2021-12-15 10:44 ` Peter Zijlstra 2021-12-15 13:49 ` Matthew Wilcox 2021-12-15 17:54 ` Peter Zijlstra
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).