* [PATCH for 5.9 0/3] FUTEX_SWAP (tip/locking/core) @ 2020-07-22 23:45 Peter Oskolkov 2020-07-22 23:45 ` [PATCH for 5.9 1/3] futex: introduce FUTEX_SWAP operation Peter Oskolkov ` (2 more replies) 0 siblings, 3 replies; 10+ messages in thread From: Peter Oskolkov @ 2020-07-22 23:45 UTC (permalink / raw) To: Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar, Ingo Molnar, Peter Zijlstra, Darren Hart, Vincent Guittot Cc: Peter Oskolkov, Andrei Vagin, Paul Turner, Ben Segall, Aaron Lu From: Peter Oskolkov <posk@google.com> As Paul Turner presented at LPC in 2013 ... - pdf: http://pdxplumbers.osuosl.org/2013/ocw//system/presentations/1653/original/LPC%20-%20User%20Threading.pdf - video: https://www.youtube.com/watch?v=KXuZi9aeGTw ... Google has developed an M:N userspace threading subsystem backed by Google-private SwitchTo Linux Kernel API (page 17 in the pdf referenced above). This subsystem provides latency-sensitive services at Google with fine-grained user-space control/scheduling over what is running when, and this subsystem is used widely internally (called schedulers or fibers). This patchset is the first step to open-source this work. As explained in the linked pdf and video, SwitchTo API has three core operations: wait, resume, and swap (=switch). So this patchset adds a FUTEX_SWAP operation that, in addition to FUTEX_WAIT and FUTEX_WAKE, will provide a foundation on top of which user-space threading libraries can be built. Another common use case for FUTEX_SWAP is message passing a-la RPC between tasks: task/thread T1 prepares a message, wakes T2 to work on it, and waits for the results; when T2 is done, it wakes T1 and waits for more work to arrive. Currently the simplest way to implement this is a. T1: futex-wake T2, futex-wait b. T2: wakes, does what it has been woken to do c. T2: futex-wake T1, futex-wait With FUTEX_SWAP, steps a and c above can be reduced to one futex operation that runs 5-10 times faster. Patches in this patchset: Patch 1: introduce FUTEX_SWAP futex operation that, internally, does wake + wait. The purpose of this patch is to work out the API. Patch 2: a first rough attempt to make FUTEX_SWAP faster than what wake + wait can do. Patch 3: a selftest that can also be used to benchmark FUTEX_SWAP vs FUTEX_WAKE + FUTEX_WAIT. Peter Oskolkov (3): futex: introduce FUTEX_SWAP operation futex/sched: add wake_up_process_prefer_current_cpu, use in FUTEX_SWAP selftests/futex: add futex_swap selftest include/linux/sched.h | 1 + include/uapi/linux/futex.h | 2 + kernel/futex.c | 89 ++++++-- kernel/sched/core.c | 5 + kernel/sched/fair.c | 3 + kernel/sched/sched.h | 1 + .../selftests/futex/functional/.gitignore | 1 + .../selftests/futex/functional/Makefile | 1 + .../selftests/futex/functional/futex_swap.c | 209 ++++++++++++++++++ .../selftests/futex/include/futextest.h | 19 ++ 10 files changed, 315 insertions(+), 16 deletions(-) create mode 100644 tools/testing/selftests/futex/functional/futex_swap.c -- 2.25.1 ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH for 5.9 1/3] futex: introduce FUTEX_SWAP operation 2020-07-22 23:45 [PATCH for 5.9 0/3] FUTEX_SWAP (tip/locking/core) Peter Oskolkov @ 2020-07-22 23:45 ` Peter Oskolkov 2020-07-23 11:27 ` Peter Zijlstra 2020-07-22 23:45 ` [PATCH for 5.9 2/3] futex/sched: add wake_up_process_prefer_current_cpu, use in FUTEX_SWAP Peter Oskolkov 2020-07-22 23:45 ` [PATCH for 5.9 3/3] selftests/futex: add futex_swap selftest Peter Oskolkov 2 siblings, 1 reply; 10+ messages in thread From: Peter Oskolkov @ 2020-07-22 23:45 UTC (permalink / raw) To: Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar, Ingo Molnar, Peter Zijlstra, Darren Hart, Vincent Guittot Cc: Peter Oskolkov, Andrei Vagin, Paul Turner, Ben Segall, Aaron Lu From: Peter Oskolkov <posk@google.com> As Paul Turner presented at LPC in 2013 ... - pdf: http://pdxplumbers.osuosl.org/2013/ocw//system/presentations/1653/original/LPC%20-%20User%20Threading.pdf - video: https://www.youtube.com/watch?v=KXuZi9aeGTw ... Google has developed an M:N userspace threading subsystem backed by Google-private SwitchTo Linux Kernel API (page 17 in the pdf referenced above). This subsystem provides latency-sensitive services at Google with fine-grained user-space control/scheduling over what is running when, and this subsystem is used widely internally (called schedulers or fibers). This patchset is the first step to open-source this work. As explained in the linked pdf and video, SwitchTo API has three core operations: wait, resume, and swap (=switch). So this patchset adds a FUTEX_SWAP operation that, in addition to FUTEX_WAIT and FUTEX_WAKE, will provide a foundation on top of which user-space threading libraries can be built. Another common use case for FUTEX_SWAP is message passing a-la RPC between tasks: task/thread T1 prepares a message, wakes T2 to work on it, and waits for the results; when T2 is done, it wakes T1 and waits for more work to arrive. Currently the simplest way to implement this is a. T1: futex-wake T2, futex-wait b. T2: wakes, does what it has been woken to do c. T2: futex-wake T1, futex-wait With FUTEX_SWAP, steps a and c above can be reduced to one futex operation that runs 5-10 times faster. Patches in this patchset: Patch 1: (this patch) introduce FUTEX_SWAP futex operation that, internally, does wake + wait. The purpose of this patch is to work out the API. Patch 2: a first rough attempt to make FUTEX_SWAP faster than what wake + wait can do. Patch 3: a selftest that can also be used to benchmark FUTEX_SWAP vs FUTEX_WAKE + FUTEX_WAIT. Tested: see patch 3 in this patchset. Signed-off-by: Peter Oskolkov <posk@google.com> --- include/uapi/linux/futex.h | 2 + kernel/futex.c | 90 +++++++++++++++++++++++++++++++------- 2 files changed, 76 insertions(+), 16 deletions(-) diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h index a89eb0accd5e..c1d151d97dea 100644 --- a/include/uapi/linux/futex.h +++ b/include/uapi/linux/futex.h @@ -21,6 +21,7 @@ #define FUTEX_WAKE_BITSET 10 #define FUTEX_WAIT_REQUEUE_PI 11 #define FUTEX_CMP_REQUEUE_PI 12 +#define FUTEX_SWAP 13 #define FUTEX_PRIVATE_FLAG 128 #define FUTEX_CLOCK_REALTIME 256 @@ -40,6 +41,7 @@ FUTEX_PRIVATE_FLAG) #define FUTEX_CMP_REQUEUE_PI_PRIVATE (FUTEX_CMP_REQUEUE_PI | \ FUTEX_PRIVATE_FLAG) +#define FUTEX_SWAP_PRIVATE (FUTEX_SWAP | FUTEX_PRIVATE_FLAG) /* * Support for robust futexes: the kernel cleans up held futexes at diff --git a/kernel/futex.c b/kernel/futex.c index 4616d4ad609d..f077168a4410 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1162,7 +1162,7 @@ static int handle_exit_race(u32 __user *uaddr, u32 uval, * tsk->futex_state = } else { * FUTEX_STATE_DEAD; if (tsk->futex_state != * FUTEX_STATE_DEAD) - * return -EAGAIN; + * return -EAGAIN; * return -ESRCH; <--- FAIL * } * @@ -1574,16 +1574,16 @@ double_unlock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2) } /* - * Wake up waiters matching bitset queued on this futex (uaddr). + * Prepare wake queue matching bitset queued on this futex (uaddr). */ static int -futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset) +prepare_wake_q(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset, + struct wake_q_head *wake_q) { struct futex_hash_bucket *hb; struct futex_q *this, *next; union futex_key key = FUTEX_KEY_INIT; int ret; - DEFINE_WAKE_Q(wake_q); if (!bitset) return -EINVAL; @@ -1611,14 +1611,28 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset) if (!(this->bitset & bitset)) continue; - mark_wake_futex(&wake_q, this); + mark_wake_futex(wake_q, this); if (++ret >= nr_wake) break; } } spin_unlock(&hb->lock); + return ret; +} + +/* + * Wake up waiters matching bitset queued on this futex (uaddr). + */ +static int +futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset) +{ + int ret; + DEFINE_WAKE_Q(wake_q); + + ret = prepare_wake_q(uaddr, flags, nr_wake, bitset, &wake_q); wake_up_q(&wake_q); + return ret; } @@ -2557,9 +2571,12 @@ static int fixup_owner(u32 __user *uaddr, struct futex_q *q, int locked) * @hb: the futex hash bucket, must be locked by the caller * @q: the futex_q to queue up on * @timeout: the prepared hrtimer_sleeper, or null for no timeout + * @next: if present, wake next and hint to the scheduler that we'd + * prefer to execute it locally. */ static void futex_wait_queue_me(struct futex_hash_bucket *hb, struct futex_q *q, - struct hrtimer_sleeper *timeout) + struct hrtimer_sleeper *timeout, + struct task_struct *next) { /* * The task state is guaranteed to be set before another task can @@ -2584,10 +2601,26 @@ static void futex_wait_queue_me(struct futex_hash_bucket *hb, struct futex_q *q, * flagged for rescheduling. Only call schedule if there * is no timeout, or if it has yet to expire. */ - if (!timeout || timeout->task) + if (!timeout || timeout->task) { + if (next) { + /* + * wake_up_process() below will be replaced + * in the next patch with + * wake_up_process_prefer_current_cpu(). + */ + wake_up_process(next); + put_task_struct(next); + next = NULL; + } freezable_schedule(); + } } __set_current_state(TASK_RUNNING); + + if (next) { + wake_up_process(next); + put_task_struct(next); + } } /** @@ -2663,7 +2696,7 @@ static int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags, } static int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val, - ktime_t *abs_time, u32 bitset) + ktime_t *abs_time, u32 bitset, struct task_struct *next) { struct hrtimer_sleeper timeout, *to; struct restart_block *restart; @@ -2687,7 +2720,8 @@ static int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val, goto out; /* queue_me and wait for wakeup, timeout, or a signal. */ - futex_wait_queue_me(hb, &q, to); + futex_wait_queue_me(hb, &q, to, next); + next = NULL; /* If we were woken (and unqueued), we succeeded, whatever. */ ret = 0; @@ -2720,6 +2754,10 @@ static int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val, ret = -ERESTART_RESTARTBLOCK; out: + if (next) { + wake_up_process(next); + put_task_struct(next); + } if (to) { hrtimer_cancel(&to->timer); destroy_hrtimer_on_stack(&to->timer); @@ -2727,7 +2765,6 @@ static int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val, return ret; } - static long futex_wait_restart(struct restart_block *restart) { u32 __user *uaddr = restart->futex.uaddr; @@ -2739,10 +2776,29 @@ static long futex_wait_restart(struct restart_block *restart) } restart->fn = do_no_restart_syscall; - return (long)futex_wait(uaddr, restart->futex.flags, - restart->futex.val, tp, restart->futex.bitset); + return (long)futex_wait(uaddr, restart->futex.flags, restart->futex.val, + tp, restart->futex.bitset, NULL); } +static int futex_swap(u32 __user *uaddr, unsigned int flags, u32 val, + ktime_t *abs_time, u32 __user *uaddr2) +{ + u32 bitset = FUTEX_BITSET_MATCH_ANY; + struct task_struct *next = NULL; + DEFINE_WAKE_Q(wake_q); + int ret; + + ret = prepare_wake_q(uaddr2, flags, 1, bitset, &wake_q); + if (ret < 0) + return ret; + if (!wake_q_empty(&wake_q)) { + /* At most one wakee can be present. Pull it out. */ + next = container_of(wake_q.first, struct task_struct, wake_q); + next->wake_q.next = NULL; + } + + return futex_wait(uaddr, flags, val, abs_time, bitset, next); +} /* * Userspace tried a 0 -> TID atomic transition of the futex value @@ -3221,7 +3277,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, } /* Queue the futex_q, drop the hb lock, wait for wakeup. */ - futex_wait_queue_me(hb, &q, to); + futex_wait_queue_me(hb, &q, to, NULL); spin_lock(&hb->lock); ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to); @@ -3746,7 +3802,7 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout, val3 = FUTEX_BITSET_MATCH_ANY; /* fall through */ case FUTEX_WAIT_BITSET: - return futex_wait(uaddr, flags, val, timeout, val3); + return futex_wait(uaddr, flags, val, timeout, val3, NULL); case FUTEX_WAKE: val3 = FUTEX_BITSET_MATCH_ANY; /* fall through */ @@ -3770,6 +3826,8 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout, uaddr2); case FUTEX_CMP_REQUEUE_PI: return futex_requeue(uaddr, flags, uaddr2, val, val2, &val3, 1); + case FUTEX_SWAP: + return futex_swap(uaddr, flags, val, timeout, uaddr2); } return -ENOSYS; } @@ -3786,7 +3844,7 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val, if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI || cmd == FUTEX_WAIT_BITSET || - cmd == FUTEX_WAIT_REQUEUE_PI)) { + cmd == FUTEX_WAIT_REQUEUE_PI || cmd == FUTEX_SWAP)) { if (unlikely(should_fail_futex(!(op & FUTEX_PRIVATE_FLAG)))) return -EFAULT; if (get_timespec64(&ts, utime)) @@ -3795,7 +3853,7 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val, return -EINVAL; t = timespec64_to_ktime(ts); - if (cmd == FUTEX_WAIT) + if (cmd == FUTEX_WAIT || cmd == FUTEX_SWAP) t = ktime_add_safe(ktime_get(), t); tp = &t; } -- 2.25.1 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH for 5.9 1/3] futex: introduce FUTEX_SWAP operation 2020-07-22 23:45 ` [PATCH for 5.9 1/3] futex: introduce FUTEX_SWAP operation Peter Oskolkov @ 2020-07-23 11:27 ` Peter Zijlstra 2020-07-24 0:25 ` Peter Oskolkov 0 siblings, 1 reply; 10+ messages in thread From: Peter Zijlstra @ 2020-07-23 11:27 UTC (permalink / raw) To: Peter Oskolkov Cc: Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar, Ingo Molnar, Darren Hart, Vincent Guittot, Peter Oskolkov, Andrei Vagin, Paul Turner, Ben Segall, Aaron Lu On Wed, Jul 22, 2020 at 04:45:36PM -0700, Peter Oskolkov wrote: > This patchset is the first step to open-source this work. As explained > in the linked pdf and video, SwitchTo API has three core operations: wait, > resume, and swap (=switch). So this patchset adds a FUTEX_SWAP operation > that, in addition to FUTEX_WAIT and FUTEX_WAKE, will provide a foundation > on top of which user-space threading libraries can be built. The PDF and video can go pound sand; you get to fully explain things here. What worries me is how FUTEX_SWAP would interact with the future FUTEX_LOCK / FUTEX_UNLOCK. When we implement pthread_mutex with those, there's very few WAIT/WAKE left. Also, why would we commit to an ABI without ever having seen the rest? On another note: wake_up_process_prefer_current_cpu() is a horrific function name :/ That's half to a third of the line limit. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH for 5.9 1/3] futex: introduce FUTEX_SWAP operation 2020-07-23 11:27 ` Peter Zijlstra @ 2020-07-24 0:25 ` Peter Oskolkov 2020-07-24 3:00 ` Waiman Long 2020-07-27 9:51 ` peterz 0 siblings, 2 replies; 10+ messages in thread From: Peter Oskolkov @ 2020-07-24 0:25 UTC (permalink / raw) To: Peter Zijlstra Cc: Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar, Ingo Molnar, Darren Hart, Vincent Guittot, Peter Oskolkov, Andrei Vagin, Paul Turner, Ben Segall, Aaron Lu, Waiman Long On Thu, Jul 23, 2020 at 4:28 AM Peter Zijlstra <peterz@infradead.org> wrote: Thanks a lot for your comments, Peter! My answers below. > > On Wed, Jul 22, 2020 at 04:45:36PM -0700, Peter Oskolkov wrote: > > This patchset is the first step to open-source this work. As explained > > in the linked pdf and video, SwitchTo API has three core operations: wait, > > resume, and swap (=switch). So this patchset adds a FUTEX_SWAP operation > > that, in addition to FUTEX_WAIT and FUTEX_WAKE, will provide a foundation > > on top of which user-space threading libraries can be built. > > The PDF and video can go pound sand; you get to fully explain things > here. Will do. Should I expand the cover letter or the commit message? (I'll probably split the first patch into two in the latter case). > > What worries me is how FUTEX_SWAP would interact with the future > FUTEX_LOCK / FUTEX_UNLOCK. When we implement pthread_mutex with those, > there's very few WAIT/WAKE left. [+cc Waiman Long] I've looked through the latest FUTEX_LOCK patchset I could find ( https://lore.kernel.org/patchwork/cover/772643/ and related), and it seems that FUTEX_SWAP and FUTEX_LOCK/FUTEX_UNLOCK patchsets address the same issue (slow wakeups) but for different use cases: FUTEX_LOCK/FUTEX_UNLOCK uses spinning and lock stealing to improve futex wake/wait performance in high contention situations; FUTEX_SWAP is designed to be used for fast context switching with _no_ contention by design: the waker that is going to sleep, and the wakee are using different futexes; the userspace will have a futex per thread/task, and when needed the thread/task will either simply sleep on its futex, or context switch (=FUTEX_SWAP) into a different thread/task. I can also imagine that instead of combining WAIT/WAKE for fast context switching, a variant of FUTEX_SWAP can use LOCK/UNLOCK operations in the future, when these are available; but again, I fully expect that a single "FUTEX_LOCK the current task on futex A, FUTEX_UNLOCK futex B, context switch into the wakee" futex op will be much faster than doing the same thing in two syscalls, as FUTEX_LOCK/FUTEX_UNLOCK does not seem to be concerned with fast waking of a sleeping task, but more with minimizing sleeping in the first place. What will be faster: FUTEX_SWAP that does FUTEX_WAKE (futex A) + FUTEX_WAIT (current, futex B), or FUTEX_SWAP that does FUTEX_UNLOCK (futex A) + FUTEX_LOCK (current, futex B)? As wake+wait will always put the waker to sleep, it means that there will be a true context switch on the same CPU on the fast path; on the other hand, unlock+lock will potentially evade sleeping, so the wakee will often run on a different CPU (with the waker spinning instead of sleeping?), thus not benefitting from cache locality that fast context switching on the same CPU is meant to use... I'll add some of the considerations above to the expanded cover letter (or a commit message). > > Also, why would we commit to an ABI without ever having seen the rest? I'm not completely sure what you mean here. We do not envision any expansion/changes to the ABI proposed here, only further performance improvements. On these, we currently think that marking the wakee as the preferred next task to run on the current CPU (by storing "struct task_struct *preferred_next_tast" either in a per-CPU pointer, or in the current task_struct) and then having schedule() determine whether to follow the hint or ignore it would be the simplest way to speed up the context switch. > > On another note: wake_up_process_prefer_current_cpu() is a horrific > function name :/ That's half to a third of the line limit. I fully agree. I considered wake_up_on_current_cpu() first, but this name does not indicate that the wakeup is a "strong wish", but "current cpu" is a weak one... Do you have any suggestions? Maybe wake_up_on_cpu(struct task_struct *next, int cpu_hint)? But this seems too broad in scope, as we are interested here in only migrating the task to the current CPU... Thanks again for your comments! ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH for 5.9 1/3] futex: introduce FUTEX_SWAP operation 2020-07-24 0:25 ` Peter Oskolkov @ 2020-07-24 3:00 ` Waiman Long 2020-07-24 3:22 ` Peter Oskolkov 2020-07-27 9:51 ` peterz 1 sibling, 1 reply; 10+ messages in thread From: Waiman Long @ 2020-07-24 3:00 UTC (permalink / raw) To: Peter Oskolkov, Peter Zijlstra Cc: Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar, Ingo Molnar, Darren Hart, Vincent Guittot, Peter Oskolkov, Andrei Vagin, Paul Turner, Ben Segall, Aaron Lu On 7/23/20 8:25 PM, Peter Oskolkov wrote: > On Thu, Jul 23, 2020 at 4:28 AM Peter Zijlstra <peterz@infradead.org> wrote: > > Thanks a lot for your comments, Peter! My answers below. > >> On Wed, Jul 22, 2020 at 04:45:36PM -0700, Peter Oskolkov wrote: >>> This patchset is the first step to open-source this work. As explained >>> in the linked pdf and video, SwitchTo API has three core operations: wait, >>> resume, and swap (=switch). So this patchset adds a FUTEX_SWAP operation >>> that, in addition to FUTEX_WAIT and FUTEX_WAKE, will provide a foundation >>> on top of which user-space threading libraries can be built. >> The PDF and video can go pound sand; you get to fully explain things >> here. > Will do. Should I expand the cover letter or the commit message? (I'll probably > split the first patch into two in the latter case). You should put it mostly in the commit message which will be part of the git log history. The cover letter, on the other hand, is not part of the git log. > >> What worries me is how FUTEX_SWAP would interact with the future >> FUTEX_LOCK / FUTEX_UNLOCK. When we implement pthread_mutex with those, >> there's very few WAIT/WAKE left. > [+cc Waiman Long] > > I've looked through the latest FUTEX_LOCK patchset I could find ( > https://lore.kernel.org/patchwork/cover/772643/ and related), and it seems > that FUTEX_SWAP and FUTEX_LOCK/FUTEX_UNLOCK patchsets > address the same issue (slow wakeups) but for different use cases: > > FUTEX_LOCK/FUTEX_UNLOCK uses spinning and lock stealing to > improve futex wake/wait performance in high contention situations; > FUTEX_SWAP is designed to be used for fast context switching with > _no_ contention by design: the waker that is going to sleep, and the wakee > are using different futexes; the userspace will have a futex per thread/task, > and when needed the thread/task will either simply sleep on its futex, > or context switch (=FUTEX_SWAP) into a different thread/task. I agree that it is a different use case. I just hope that you keep the possible future extension to support FUTEX_LOCK/UNLOCK in mind so that they won't conflict with each other. Cheers, Longman ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH for 5.9 1/3] futex: introduce FUTEX_SWAP operation 2020-07-24 3:00 ` Waiman Long @ 2020-07-24 3:22 ` Peter Oskolkov 0 siblings, 0 replies; 10+ messages in thread From: Peter Oskolkov @ 2020-07-24 3:22 UTC (permalink / raw) To: Waiman Long Cc: Peter Zijlstra, Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar, Ingo Molnar, Darren Hart, Vincent Guittot, Peter Oskolkov, Andrei Vagin, Paul Turner, Ben Segall, Aaron Lu On Thu, Jul 23, 2020 at 8:00 PM Waiman Long <longman@redhat.com> wrote: > > On 7/23/20 8:25 PM, Peter Oskolkov wrote: > > On Thu, Jul 23, 2020 at 4:28 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > Thanks a lot for your comments, Peter! My answers below. > > > >> On Wed, Jul 22, 2020 at 04:45:36PM -0700, Peter Oskolkov wrote: > >>> This patchset is the first step to open-source this work. As explained > >>> in the linked pdf and video, SwitchTo API has three core operations: wait, > >>> resume, and swap (=switch). So this patchset adds a FUTEX_SWAP operation > >>> that, in addition to FUTEX_WAIT and FUTEX_WAKE, will provide a foundation > >>> on top of which user-space threading libraries can be built. > >> The PDF and video can go pound sand; you get to fully explain things > >> here. > > Will do. Should I expand the cover letter or the commit message? (I'll probably > > split the first patch into two in the latter case). > > You should put it mostly in the commit message which will be part of the > git log history. The cover letter, on the other hand, is not part of the > git log. Ack. (Networking/David Miller usually includes the cover letter in the git log, so this is context dependent, I guess). > > > > > >> What worries me is how FUTEX_SWAP would interact with the future > >> FUTEX_LOCK / FUTEX_UNLOCK. When we implement pthread_mutex with those, > >> there's very few WAIT/WAKE left. > > [+cc Waiman Long] > > > > I've looked through the latest FUTEX_LOCK patchset I could find ( > > https://lore.kernel.org/patchwork/cover/772643/ and related), and it seems > > that FUTEX_SWAP and FUTEX_LOCK/FUTEX_UNLOCK patchsets > > address the same issue (slow wakeups) but for different use cases: > > > > FUTEX_LOCK/FUTEX_UNLOCK uses spinning and lock stealing to > > improve futex wake/wait performance in high contention situations; > > FUTEX_SWAP is designed to be used for fast context switching with > > _no_ contention by design: the waker that is going to sleep, and the wakee > > are using different futexes; the userspace will have a futex per thread/task, > > and when needed the thread/task will either simply sleep on its futex, > > or context switch (=FUTEX_SWAP) into a different thread/task. > > I agree that it is a different use case. I just hope that you keep the > possible future extension to support FUTEX_LOCK/UNLOCK in mind so that > they won't conflict with each other. Ack. Will do. :) Thanks, Peter > > Cheers, > Longman > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH for 5.9 1/3] futex: introduce FUTEX_SWAP operation 2020-07-24 0:25 ` Peter Oskolkov 2020-07-24 3:00 ` Waiman Long @ 2020-07-27 9:51 ` peterz 2020-07-28 0:01 ` Peter Oskolkov 1 sibling, 1 reply; 10+ messages in thread From: peterz @ 2020-07-27 9:51 UTC (permalink / raw) To: Peter Oskolkov Cc: Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar, Ingo Molnar, Darren Hart, Vincent Guittot, Peter Oskolkov, Andrei Vagin, Paul Turner, Ben Segall, Aaron Lu, Waiman Long On Thu, Jul 23, 2020 at 05:25:05PM -0700, Peter Oskolkov wrote: > On Thu, Jul 23, 2020 at 4:28 AM Peter Zijlstra <peterz@infradead.org> wrote: > > What worries me is how FUTEX_SWAP would interact with the future > > FUTEX_LOCK / FUTEX_UNLOCK. When we implement pthread_mutex with those, > > there's very few WAIT/WAKE left. > > [+cc Waiman Long] > > I've looked through the latest FUTEX_LOCK patchset I could find ( > https://lore.kernel.org/patchwork/cover/772643/ and related), and it seems > that FUTEX_SWAP and FUTEX_LOCK/FUTEX_UNLOCK patchsets > address the same issue (slow wakeups) but for different use cases: > > FUTEX_LOCK/FUTEX_UNLOCK uses spinning and lock stealing to > improve futex wake/wait performance in high contention situations; The spinning is best for light contention. > FUTEX_SWAP is designed to be used for fast context switching with > _no_ contention by design: the waker that is going to sleep, and the wakee > are using different futexes; the userspace will have a futex per thread/task, > and when needed the thread/task will either simply sleep on its futex, > or context switch (=FUTEX_SWAP) into a different thread/task. Right, but how can you tell what the next thing after UNLOCK is going to be.. that's going to be tricky. > I can also imagine that instead of combining WAIT/WAKE for > fast context switching, a variant of FUTEX_SWAP can use LOCK/UNLOCK > operations in the future, when these are available; but again, I fully > expect that > a single "FUTEX_LOCK the current task on futex A, FUTEX_UNLOCK futex B, > context switch into the wakee" futex op will be much faster than doing > the same thing > in two syscalls, as FUTEX_LOCK/FUTEX_UNLOCK does not seem to be concerned > with fast waking of a sleeping task, but more with minimizing sleeping > in the first place. Correct; however the reason I like LOCK/UNLOCK is that it exposes the blocking relations to the kernel -- and that ties into yet another unfinished patch-set :-/ https://lkml.kernel.org/r/20181009092434.26221-1-juri.lelli@redhat.com > What will be faster: FUTEX_SWAP that does > FUTEX_WAKE (futex A) + FUTEX_WAIT (current, futex B), > or FUTEX_SWAP that does > FUTEX_UNLOCK (futex A) + FUTEX_LOCK (current, futex B)? Well, perhaps both argue against having SWAP but instead having compound futex ops. Something I think we're already starting to see with the new futex API patches posted here: https://lkml.kernel.org/r/20200709175921.211387-1-andrealmeid@collabora.com sys_futex_waitv() is effectively a whole bunch of WAIT ops combined. > As wake+wait will always put the waker to sleep, it means that > there will be a true context switch on the same CPU on the fast path; > on the other hand, unlock+lock will potentially evade sleeping, > so the wakee will often run on a different CPU (with the waker > spinning instead of sleeping?), thus not benefitting from cache locality > that fast context switching on the same CPU is meant to use... > > I'll add some of the considerations above to the expanded cover letter > (or a commit message). It's Monday morning, so perhaps I'm making a mess of things, but something like the below, where our thread t2 issues synchronous work to t1: t1 t2 LOCK A LOCK B 1: LOCK A ... UNLOCK A LOCK B ... UNLOCK B UNLOCK A LOCK B GOTO 1 UNLOCK B LOCK A ... Is an abuse of mutexes, that is, it implements completions using a (fair) mutex. A guards the work-queue, B is the 'completion'. Then, if you teach it that a compound UNLOCK-A + LOCK-B, where the lock owner of B is on the wait list of A is a 'SWAP', should get you the desired semantics, I think. You can do SWAP and you get to have an exposed blocking relation. Is this exactly what we want, I don't know. Because I'm not entirely sure what problem we're solving. Why would you be setting things up like that in the first place. IIRC you're using this to implement coroutines in golang, and I'm not sure I have a firm enough grasp of all that to make a cogent suggestion one way or the other. > > Also, why would we commit to an ABI without ever having seen the rest? > > I'm not completely sure what you mean here. We do not envision any > expansion/changes to the ABI proposed here, Well, you do not, but how can we verify this without having a complete picture. Also, IIRC, there's a bunch of scheduler patches that goes on top to implement the fast switch. Also, how does this compare to some of the glorious hacks found in GNU Pth? You can implement M:N threading using those as well. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH for 5.9 1/3] futex: introduce FUTEX_SWAP operation 2020-07-27 9:51 ` peterz @ 2020-07-28 0:01 ` Peter Oskolkov 0 siblings, 0 replies; 10+ messages in thread From: Peter Oskolkov @ 2020-07-28 0:01 UTC (permalink / raw) To: Peter Zijlstra Cc: Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar, Ingo Molnar, Darren Hart, Vincent Guittot, Peter Oskolkov, Andrei Vagin, Paul Turner, Ben Segall, Aaron Lu, Waiman Long, Alexei Starovoitov, Daniel Borkmann [+cc BPF maintainers] I added a lot of details below... Maybe too much? Let me know... On Mon, Jul 27, 2020 at 2:51 AM <peterz@infradead.org> wrote: > On Thu, Jul 23, 2020 at 05:25:05PM -0700, Peter Oskolkov wrote: > > On Thu, Jul 23, 2020 at 4:28 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > What worries me is how FUTEX_SWAP would interact with the future > > > FUTEX_LOCK / FUTEX_UNLOCK. When we implement pthread_mutex with those, > > > there's very few WAIT/WAKE left. [,,,] > > FUTEX_LOCK/FUTEX_UNLOCK uses spinning and lock stealing to > > improve futex wake/wait performance in high contention situations; > > The spinning is best for light contention. Right. But the goal here is to enable fast context switching, not deal with any kind of contention, light or heavy... See more on that below. > > > FUTEX_SWAP is designed to be used for fast context switching with > > _no_ contention by design: the waker that is going to sleep, and the wakee > > are using different futexes; the userspace will have a futex per thread/task, > > and when needed the thread/task will either simply sleep on its futex, > > or context switch (=FUTEX_SWAP) into a different thread/task. > > Right, but how can you tell what the next thing after UNLOCK is going to > be.. that's going to be tricky. Yes, but it is somewhat irrelevant here: if task A wants task B to run in its place, and task B is "cooperating", let them do it... Again, see below on what we would like to achieve here. [,,,] > Correct; however the reason I like LOCK/UNLOCK is that it exposes the > blocking relations to the kernel -- and that ties into yet another > unfinished patch-set :-/ > > https://lkml.kernel.org/r/20181009092434.26221-1-juri.lelli@redhat.com Futexes were initially introduced, I believe, as barebone kernel primitives for the userspace to build various synchronization mechanisms on top of. Explicitly keeping the kernel agnostic to blocking relations. While I understand and fully support the desire to have more sophisticated machinery in the kernel (like the LOCK/UNLOCK work, and the proxy execution patchset referenced above), FUTEX_SWAP is more aligned with the initial "keep it simple" approach, i.e. enable a fast context switch, let the userspace deal with complexity. Again, see more below. > > What will be faster: FUTEX_SWAP that does > > FUTEX_WAKE (futex A) + FUTEX_WAIT (current, futex B), > > or FUTEX_SWAP that does > > FUTEX_UNLOCK (futex A) + FUTEX_LOCK (current, futex B)? > > Well, perhaps both argue against having SWAP but instead having compound > futex ops. Something I think we're already starting to see with the new > futex API patches posted here: > > https://lkml.kernel.org/r/20200709175921.211387-1-andrealmeid@collabora.com > > sys_futex_waitv() is effectively a whole bunch of WAIT ops combined. I'm not sure how to build a fast context switch out of multiple related wait ops... > > > As wake+wait will always put the waker to sleep, it means that > > there will be a true context switch on the same CPU on the fast path; > > on the other hand, unlock+lock will potentially evade sleeping, > > so the wakee will often run on a different CPU (with the waker > > spinning instead of sleeping?), thus not benefitting from cache locality > > that fast context switching on the same CPU is meant to use... > > > > I'll add some of the considerations above to the expanded cover letter > > (or a commit message). > > It's Monday morning, so perhaps I'm making a mess of things, but > something like the below, where our thread t2 issues synchronous work to > t1: > > > t1 t2 > LOCK A > LOCK B > 1: LOCK A > > ... > > > UNLOCK A > LOCK B > ... > > UNLOCK B > UNLOCK A > LOCK B > GOTO 1 > UNLOCK B > LOCK A > ... > > Is an abuse of mutexes, that is, it implements completions using a > (fair) mutex. A guards the work-queue, B is the 'completion'. > > Then, if you teach it that a compound UNLOCK-A + LOCK-B, where > the lock owner of B is on the wait list of A is a 'SWAP', should get you > the desired semantics, I think. > > You can do SWAP and you get to have an exposed blocking relation. This will work for two threads. However, if a process has an arbitrary number of threads (tasks), I'm not sure how to construct something similar that would allow any task to swap into (= context switch into) any other task (and have the tasks dynamically come and go). With FUTEX_SWAP, each task has its own futex, and futexes are locked (= waited on) only by their owners, so no cross-locking is needed as in above. > Is this exactly what we want, I don't know. Because I'm not entirely > sure what problem we're solving. Why would you be setting things up like > that in the first place. IIRC you're using this to implement coroutines > in golang, and I'm not sure I have a firm enough grasp of all that to > make a cogent suggestion one way or the other. OK, here is the main wall of text in this message... :) A simplified/idealized use case: imagine a multi-user service application (e.g. a DBMS) that has to implement the following user CPU quota policy: - each user (these are DBMS users, not Linux users) can purchase certain amounts of expensive, but guaranteed, low-latency CPU quota (as a % of total CPUs available to the service), and a certain amount of cheap high-latency CPU quota; - quotas are enforced per second; - each user RPC/request to the service can specify whether this is a latency-critical request that should use the user's low-latency quota, and be immediately rejected if the quota for this second is exhausted; - requests can be high-latency-tolerant: should only use the high-latency quota; - a request can also be latency-tolerant: it should use the low-latency quota if available, or the high-latency quota if the low-latency quota is exhausted; - all "sold" (= allocated) low-latency quotas can go up to, but not exceed, 70% of all available CPUs (i.e. no over-subscription); - high-latency quotas are oversubscribed; - user isolation: misbehaving users should not affect the serving latency of users with available low-latency quotas; - soft deadlines/timeouts: each request/RPC can specify that it must be served within a certain deadline (let's say the minimum deadline is 10 milliseconds) or dropped if the deadline is exceeded; - each user request can potentially spawn several processing threads/tasks, and do child RPCs to remote services; these threads/tasks should also be covered by this quota/policy; - user requests should be served somewhat in-order: requests that use the same quota tiers that arrive earlier should be granted CPU before requests that arrive later ("arrival order scheduling"). There are many services at Google that implement a variant of the scheduling policy outlined above. In reality there are more priorities/quota tiers, there is service-internal maintenance work that can be either high or low priority, sometimes FIFO/LIFO/round robin scheduling is used in addition to arrival order scheduling, etc. (for example, LIFO scheduling is better at cache-locality in certain scenarios). User isolation within a process, as well as latency improvements are the main benefits (on top of the actual ability to implement complex quota/scheduling policies). What is important is that these scheduling policies are driven by userspace schedulers built on top of these basic kernel primitives: - block: block the current thread/task (with a timeout); - resume: resume some previously blocked task (should commutate with block, i.e. racing block/resume pairs should behave exactly as if wake arrived after block); - switch_to: block the current thread, resume some previously blocked task (behaves exactly as wake(remote), block(current), but optimized to do a fast context switch on the fast path); - block detection: when a task blocks in the kernel (on a network read, for example), the userspace scheduler is notified and schedules (resumes or swaps into) a pending task in the newly available CPU slot; - wake detection: when a task wakes from a previously blocking kernel operation (e.g. can now process some data on a network socket), the userspace scheduler is notified and can now schedule the task to run on a CPU when a CPU is available and the task can use it according to its scheduling policy. (Technically, block/wake detection is still experimental and not used widely: as we control the userspace, we can actually determine blocking/waking syscalls without kernel support). Internally we currently use kernel patches that are too "intrusive" to be included in a general-purpose Linux kernel, so we are exploring ways to upstream this functionality. The easiest/least intrusive approach that we have come up with is this: - block/resume map perfectly to futex wait/wake; - switch_to thus maps to FUTEX_SWAP; - block and wake detection can be done either through tracing or by introducing new BPF attach points (when a task blocks or wakes, a BPF program is triggered that then communicates with the userspace); - the BPF attach points are per task, and the task needs to "opt in" (i.e. all other tasks suffer just an additional pointer comparison on block/wake); - the BPF programs triggered on block/wake should be able to perform futex ops (e.g. wake a designated userspace scheduling task) - this probably indicates that tracing is not enough, and a new BPF prog type is needed. [+cc BPF maintainers for the above]. > > > Also, why would we commit to an ABI without ever having seen the rest? > > > > I'm not completely sure what you mean here. We do not envision any > > expansion/changes to the ABI proposed here, > > Well, you do not, but how can we verify this without having a complete > picture. Also, IIRC, there's a bunch of scheduler patches that goes on > top to implement the fast switch. I hope my wall of text above clarifies things a bit. > Also, how does this compare to some of the glorious hacks found in > GNU Pth? You can implement M:N threading using those as well. It seems GNU Pth tries to emulate kernel scheduling in the userspace, while we would like to expose basic scheduling building blocks to the userspace and let it do anything it feels like... ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH for 5.9 2/3] futex/sched: add wake_up_process_prefer_current_cpu, use in FUTEX_SWAP 2020-07-22 23:45 [PATCH for 5.9 0/3] FUTEX_SWAP (tip/locking/core) Peter Oskolkov 2020-07-22 23:45 ` [PATCH for 5.9 1/3] futex: introduce FUTEX_SWAP operation Peter Oskolkov @ 2020-07-22 23:45 ` Peter Oskolkov 2020-07-22 23:45 ` [PATCH for 5.9 3/3] selftests/futex: add futex_swap selftest Peter Oskolkov 2 siblings, 0 replies; 10+ messages in thread From: Peter Oskolkov @ 2020-07-22 23:45 UTC (permalink / raw) To: Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar, Ingo Molnar, Peter Zijlstra, Darren Hart, Vincent Guittot Cc: Peter Oskolkov, Andrei Vagin, Paul Turner, Ben Segall, Aaron Lu From: Peter Oskolkov <posk@google.com> As described in the previous patch in this patchset ("futex: introduce FUTEX_SWAP operation"), it is often beneficial to wake a task and run it on the same CPU where the current going to sleep task it running. Internally at Google, switchto_switch sycall not only migrates the wakee to the current CPU, but also moves the waker's load stats to the wakee, thus ensuring that the migration to the current CPU does not interfere with load balancing. switchto_switch also does the context switch into the wakee, bypassing schedule(). This patchset does not go that far yet, it simply migrates the wakee to the current CPU and calls schedule(). In follow-up patches I will try to fune-tune the behavior by adjusting load stats and schedule(): our internal switchto_switch is still about 2x faster than FUTEX_SWAP (see numbers below). And now about performance: futex_swap benchmark from the last patch in this patchset produces this typical output: $ ./futex_swap -i 100000 ------- running SWAP_WAKE_WAIT ----------- completed 100000 swap and back iterations in 820683263 ns: 4103 ns per swap PASS ------- running SWAP_SWAP ----------- completed 100000 swap and back iterations in 124034476 ns: 620 ns per swap PASS In the above, the first benchmark (SWAP_WAKE_WAIT) calls FUTEX_WAKE, then FUTEX_WAIT; the second benchmark (SWAP_SWAP) calls FUTEX_SWAP. If the benchmark is restricted to a single cpu: $ taskset -c 1 ./futex_swap -i 1000000 The numbers are very similar, as expected (with wake+wait being a bit slower than swap due to two vs one syscalls). Please also note that switchto_switch is about 2x faster than FUTEX_SWAP because it does a contex switch to the wakee immediately, bypassing schedule(), so this is one of the options I'll explore in further patches (if/when this initial patchset is accepted). Tested: see the last patch is this patchset. Signed-off-by: Peter Oskolkov <posk@google.com> --- include/linux/sched.h | 1 + kernel/futex.c | 9 ++++----- kernel/sched/core.c | 5 +++++ kernel/sched/fair.c | 3 +++ kernel/sched/sched.h | 1 + 5 files changed, 14 insertions(+), 5 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 3903a9500926..f2476dc3f449 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1716,6 +1716,7 @@ extern struct task_struct *find_get_task_by_vpid(pid_t nr); extern int wake_up_state(struct task_struct *tsk, unsigned int state); extern int wake_up_process(struct task_struct *tsk); +extern int wake_up_process_prefer_current_cpu(struct task_struct *tsk); extern void wake_up_new_task(struct task_struct *tsk); #ifdef CONFIG_SMP diff --git a/kernel/futex.c b/kernel/futex.c index f077168a4410..4286e9426a5f 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -2603,12 +2603,11 @@ static void futex_wait_queue_me(struct futex_hash_bucket *hb, struct futex_q *q, */ if (!timeout || timeout->task) { if (next) { - /* - * wake_up_process() below will be replaced - * in the next patch with - * wake_up_process_prefer_current_cpu(). - */ +#ifdef CONFIG_SMP + wake_up_process_prefer_current_cpu(next); +#else wake_up_process(next); +#endif put_task_struct(next); next = NULL; } diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ca5db40392d4..7526d3d42b0a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6277,6 +6277,11 @@ void sched_setnuma(struct task_struct *p, int nid) } #endif /* CONFIG_NUMA_BALANCING */ +int wake_up_process_prefer_current_cpu(struct task_struct *next) +{ + return try_to_wake_up(next, TASK_NORMAL, WF_CURRENT_CPU); +} + #ifdef CONFIG_HOTPLUG_CPU /* * Ensure that the idle task is using init_mm right before its CPU goes diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 658aa7a2ae6f..cceace0d6a54 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6664,6 +6664,9 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f int want_affine = 0; int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING); + if ((wake_flags & WF_CURRENT_CPU) && cpumask_test_cpu(cpu, p->cpus_ptr)) + return cpu; + if (sd_flag & SD_BALANCE_WAKE) { record_wakee(p); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 877fb08eb1b0..103f0865a590 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1683,6 +1683,7 @@ static inline int task_on_rq_migrating(struct task_struct *p) #define WF_FORK 0x02 /* Child wakeup after fork */ #define WF_MIGRATED 0x04 /* Internal use, task got migrated */ #define WF_ON_CPU 0x08 /* Wakee is on_cpu */ +#define WF_CURRENT_CPU 0x10 /* Prefer to move wakee to the current CPU */ /* * To aid in avoiding the subversion of "niceness" due to uneven distribution -- 2.25.1 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH for 5.9 3/3] selftests/futex: add futex_swap selftest 2020-07-22 23:45 [PATCH for 5.9 0/3] FUTEX_SWAP (tip/locking/core) Peter Oskolkov 2020-07-22 23:45 ` [PATCH for 5.9 1/3] futex: introduce FUTEX_SWAP operation Peter Oskolkov 2020-07-22 23:45 ` [PATCH for 5.9 2/3] futex/sched: add wake_up_process_prefer_current_cpu, use in FUTEX_SWAP Peter Oskolkov @ 2020-07-22 23:45 ` Peter Oskolkov 2 siblings, 0 replies; 10+ messages in thread From: Peter Oskolkov @ 2020-07-22 23:45 UTC (permalink / raw) To: Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar, Ingo Molnar, Peter Zijlstra, Darren Hart, Vincent Guittot Cc: Peter Oskolkov, Andrei Vagin, Paul Turner, Ben Segall, Aaron Lu From: Peter Oskolkov <posk@google.com> This is the final patch in FUTEX_SWAP patchset. It adds a test/benchmark to validate behavior and compare performance of a new FUTEX_SWAP futex operation. Detailed API design and behavior considerations are provided in the commit messages of the previous two patches. Signed-off-by: Peter Oskolkov <posk@google.com> --- .../selftests/futex/functional/.gitignore | 1 + .../selftests/futex/functional/Makefile | 1 + .../selftests/futex/functional/futex_swap.c | 209 ++++++++++++++++++ .../selftests/futex/include/futextest.h | 19 ++ 4 files changed, 230 insertions(+) create mode 100644 tools/testing/selftests/futex/functional/futex_swap.c diff --git a/tools/testing/selftests/futex/functional/.gitignore b/tools/testing/selftests/futex/functional/.gitignore index 0efcd494daab..d661ef0946cc 100644 --- a/tools/testing/selftests/futex/functional/.gitignore +++ b/tools/testing/selftests/futex/functional/.gitignore @@ -2,6 +2,7 @@ futex_requeue_pi futex_requeue_pi_mismatched_ops futex_requeue_pi_signal_restart +futex_swap futex_wait_private_mapped_file futex_wait_timeout futex_wait_uninitialized_heap diff --git a/tools/testing/selftests/futex/functional/Makefile b/tools/testing/selftests/futex/functional/Makefile index 23207829ec75..6992fac38b15 100644 --- a/tools/testing/selftests/futex/functional/Makefile +++ b/tools/testing/selftests/futex/functional/Makefile @@ -13,6 +13,7 @@ TEST_GEN_FILES := \ futex_requeue_pi \ futex_requeue_pi_signal_restart \ futex_requeue_pi_mismatched_ops \ + futex_swap \ futex_wait_uninitialized_heap \ futex_wait_private_mapped_file diff --git a/tools/testing/selftests/futex/functional/futex_swap.c b/tools/testing/selftests/futex/functional/futex_swap.c new file mode 100644 index 000000000000..9034d04372d3 --- /dev/null +++ b/tools/testing/selftests/futex/functional/futex_swap.c @@ -0,0 +1,209 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +#include <errno.h> +#include <getopt.h> +#include <pthread.h> +#include <stdint.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <time.h> +#include "atomic.h" +#include "futextest.h" + +/* The futex the main thread waits on. */ +futex_t futex_main = FUTEX_INITIALIZER; +/* The futex the other thread wats on. */ +futex_t futex_other = FUTEX_INITIALIZER; + +/* The number of iterations to run (>1 => run benchmarks. */ +static int cfg_iterations = 1; + +/* If != 0, print diagnostic messages. */ +static int cfg_verbose; + +/* If == 0, do not use validation_counter. Useful for benchmarking. */ +static int cfg_validate = 1; + +/* How to swap threads. */ +#define SWAP_WAKE_WAIT 1 +#define SWAP_SWAP 2 + +/* Futex values. */ +#define FUTEX_WAITING 0 +#define FUTEX_WAKEUP 1 + +/* An atomic counter used to validate proper swapping. */ +static atomic_t validation_counter; + +void futex_swap_op(int mode, futex_t *futex_this, futex_t *futex_that) +{ + int ret; + + switch (mode) { + case SWAP_WAKE_WAIT: + futex_set(futex_this, FUTEX_WAITING); + futex_set(futex_that, FUTEX_WAKEUP); + futex_wake(futex_that, 1, FUTEX_PRIVATE_FLAG); + futex_wait(futex_this, FUTEX_WAITING, NULL, FUTEX_PRIVATE_FLAG); + if (*futex_this != FUTEX_WAKEUP) { + fprintf(stderr, "unexpected futex_this value on wakeup\n"); + exit(1); + } + break; + + case SWAP_SWAP: + futex_set(futex_this, FUTEX_WAITING); + futex_set(futex_that, FUTEX_WAKEUP); + ret = futex_swap(futex_this, FUTEX_WAITING, NULL, + futex_that, FUTEX_PRIVATE_FLAG); + if (ret < 0 && errno == ENOSYS) { + /* futex_swap not implemented */ + perror("futex_swap"); + exit(1); + } + if (*futex_this != FUTEX_WAKEUP) { + fprintf(stderr, "unexpected futex_this value on wakeup\n"); + exit(1); + } + break; + + default: + fprintf(stderr, "unknown mode in %s\n", __func__); + exit(1); + } +} + +void *other_thread(void *arg) +{ + int mode = *((int *)arg); + int counter; + + if (cfg_verbose) + printf("%s started\n", __func__); + + futex_wait(&futex_other, 0, NULL, FUTEX_PRIVATE_FLAG); + + for (counter = 0; counter < cfg_iterations; ++counter) { + if (cfg_validate) { + int prev = 2 * counter + 1; + + if (prev != atomic_cmpxchg(&validation_counter, prev, + prev + 1)) { + fprintf(stderr, "swap validation failed\n"); + exit(1); + } + } + futex_swap_op(mode, &futex_other, &futex_main); + } + + if (cfg_verbose) + printf("%s finished: %d iteration(s)\n", __func__, counter); + + return NULL; +} + +void run_test(int mode) +{ + struct timespec start, stop; + int ret, counter; + pthread_t thread; + uint64_t duration; + + futex_set(&futex_other, FUTEX_WAITING); + atomic_set(&validation_counter, 0); + ret = pthread_create(&thread, NULL, &other_thread, &mode); + if (ret) { + perror("pthread_create"); + exit(1); + } + + ret = clock_gettime(CLOCK_MONOTONIC, &start); + if (ret) { + perror("clock_gettime"); + exit(1); + } + + for (counter = 0; counter < cfg_iterations; ++counter) { + if (cfg_validate) { + int prev = 2 * counter; + + if (prev != atomic_cmpxchg(&validation_counter, prev, + prev + 1)) { + fprintf(stderr, "swap validation failed\n"); + exit(1); + } + } + futex_swap_op(mode, &futex_main, &futex_other); + } + if (cfg_validate && validation_counter.val != 2 * cfg_iterations) { + fprintf(stderr, "final swap validation failed\n"); + exit(1); + } + + ret = clock_gettime(CLOCK_MONOTONIC, &stop); + if (ret) { + perror("clock_gettime"); + exit(1); + } + + duration = (stop.tv_sec - start.tv_sec) * 1000000000LL + + stop.tv_nsec - start.tv_nsec; + if (cfg_verbose || cfg_iterations > 1) { + printf("completed %d swap and back iterations in %lu ns: %lu ns per swap\n", + cfg_iterations, duration, + duration / (cfg_iterations * 2)); + } + + /* The remote thread is blocked; send it the final wake. */ + futex_set(&futex_other, FUTEX_WAKEUP); + futex_wake(&futex_other, 1, FUTEX_PRIVATE_FLAG); + if (pthread_join(thread, NULL)) { + perror("pthread_join"); + exit(1); + } +} + +void usage(char *prog) +{ + printf("Usage: %s\n", prog); + printf(" -h Display this help message\n"); + printf(" -i N Use N iterations to benchmark\n"); + printf(" -n Do not validate swapping correctness\n"); + printf(" -v Print diagnostic messages\n"); +} + +int main(int argc, char *argv[]) +{ + int c; + + while ((c = getopt(argc, argv, "hi:nv")) != -1) { + switch (c) { + case 'h': + usage(basename(argv[0])); + exit(0); + case 'i': + cfg_iterations = atoi(optarg); + break; + case 'n': + cfg_validate = 0; + break; + case 'v': + cfg_verbose = 1; + break; + default: + usage(basename(argv[0])); + exit(1); + } + } + + printf("\n\n------- running SWAP_WAKE_WAIT -----------\n\n"); + run_test(SWAP_WAKE_WAIT); + printf("PASS\n"); + + printf("\n\n------- running SWAP_SWAP -----------\n\n"); + run_test(SWAP_SWAP); + printf("PASS\n"); + + return 0; +} diff --git a/tools/testing/selftests/futex/include/futextest.h b/tools/testing/selftests/futex/include/futextest.h index ddbcfc9b7bac..4d6a0a18445a 100644 --- a/tools/testing/selftests/futex/include/futextest.h +++ b/tools/testing/selftests/futex/include/futextest.h @@ -38,6 +38,9 @@ typedef volatile u_int32_t futex_t; #ifndef FUTEX_CMP_REQUEUE_PI #define FUTEX_CMP_REQUEUE_PI 12 #endif +#ifndef FUTEX_SWAP +#define FUTEX_SWAP 13 +#endif #ifndef FUTEX_WAIT_REQUEUE_PI_PRIVATE #define FUTEX_WAIT_REQUEUE_PI_PRIVATE (FUTEX_WAIT_REQUEUE_PI | \ FUTEX_PRIVATE_FLAG) @@ -46,6 +49,9 @@ typedef volatile u_int32_t futex_t; #define FUTEX_CMP_REQUEUE_PI_PRIVATE (FUTEX_CMP_REQUEUE_PI | \ FUTEX_PRIVATE_FLAG) #endif +#ifndef FUTEX_SWAP_PRIVATE +#define FUTEX_SWAP_PRIVATE (FUTEX_WAIT_WAKE | FUTEX_PRIVATE_FLAG) +#endif /** * futex() - SYS_futex syscall wrapper @@ -204,6 +210,19 @@ futex_cmp_requeue_pi(futex_t *uaddr, futex_t val, futex_t *uaddr2, int nr_wake, val, opflags); } +/** + * futex_swap() - block on uaddr and wake one task blocked on uaddr2. + * @uaddr: futex to block the current task on + * @timeout: relative timeout for the current task block + * @uaddr2: futex to wake tasks at (can be the same as uaddr) + */ +static inline int +futex_swap(futex_t *uaddr, futex_t val, struct timespec *timeout, + futex_t *uaddr2, int opflags) +{ + return futex(uaddr, FUTEX_SWAP, val, timeout, uaddr2, 0, opflags); +} + /** * futex_cmpxchg() - atomic compare and exchange * @uaddr: The address of the futex to be modified -- 2.25.1 ^ permalink raw reply related [flat|nested] 10+ messages in thread
end of thread, other threads:[~2020-07-28 0:01 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-07-22 23:45 [PATCH for 5.9 0/3] FUTEX_SWAP (tip/locking/core) Peter Oskolkov 2020-07-22 23:45 ` [PATCH for 5.9 1/3] futex: introduce FUTEX_SWAP operation Peter Oskolkov 2020-07-23 11:27 ` Peter Zijlstra 2020-07-24 0:25 ` Peter Oskolkov 2020-07-24 3:00 ` Waiman Long 2020-07-24 3:22 ` Peter Oskolkov 2020-07-27 9:51 ` peterz 2020-07-28 0:01 ` Peter Oskolkov 2020-07-22 23:45 ` [PATCH for 5.9 2/3] futex/sched: add wake_up_process_prefer_current_cpu, use in FUTEX_SWAP Peter Oskolkov 2020-07-22 23:45 ` [PATCH for 5.9 3/3] selftests/futex: add futex_swap selftest Peter Oskolkov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).