linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH for 5.9 v2 0/4] FUTEX_SWAP (tip/locking/core)
@ 2020-08-03 22:15 Peter Oskolkov
  2020-08-03 22:15 ` [PATCH for 5.9 v2 1/4] futex: introduce FUTEX_SWAP operation Peter Oskolkov
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Peter Oskolkov @ 2020-08-03 22:15 UTC (permalink / raw)
  To: Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar,
	Ingo Molnar, Peter Zijlstra, Darren Hart, Vincent Guittot
  Cc: Peter Oskolkov, Andrei Vagin, Paul Turner, Ben Segall, Aaron Lu

From: Peter Oskolkov <posk@google.com>

This patchset introduces FUTEX_SWAP operation for fast
task context switching (aided by futexes). Detailed
use cases and reasoning are included in the commit message
of the first patch in the patchset.

v1->v2:
- split two #defines in futex.h into a small first patch
- detailed the use cases and reasoning in the commit message
  of the first patch
- renamed wake_up_process_prefer_current_cpu into wake_up_swap.


Peter Oskolkov (4):
  futex: introduce FUTEX_SWAP operation
  futex: implement FUTEX_SWAP as wake+wait.
  futex/sched: add wake_up_swap, use in FUTEX_SWAP
  selftests/futex: add futex_swap selftest

 include/linux/sched.h                         |   6 +
 include/uapi/linux/futex.h                    |   2 +
 kernel/futex.c                                |  86 +++++--
 kernel/sched/core.c                           |   5 +
 kernel/sched/fair.c                           |   3 +
 kernel/sched/sched.h                          |   1 +
 .../selftests/futex/functional/.gitignore     |   1 +
 .../selftests/futex/functional/Makefile       |   1 +
 .../selftests/futex/functional/futex_swap.c   | 209 ++++++++++++++++++
 .../selftests/futex/include/futextest.h       |  19 ++
 10 files changed, 318 insertions(+), 15 deletions(-)
 create mode 100644 tools/testing/selftests/futex/functional/futex_swap.c

-- 
2.25.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH for 5.9 v2 1/4] futex: introduce FUTEX_SWAP operation
  2020-08-03 22:15 [PATCH for 5.9 v2 0/4] FUTEX_SWAP (tip/locking/core) Peter Oskolkov
@ 2020-08-03 22:15 ` Peter Oskolkov
  2020-08-04 12:31   ` peterz
  2020-08-03 22:15 ` [PATCH for 5.9 v2 2/4] futex: implement FUTEX_SWAP as wake+wait Peter Oskolkov
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 7+ messages in thread
From: Peter Oskolkov @ 2020-08-03 22:15 UTC (permalink / raw)
  To: Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar,
	Ingo Molnar, Peter Zijlstra, Darren Hart, Vincent Guittot
  Cc: Peter Oskolkov, Andrei Vagin, Paul Turner, Ben Segall, Aaron Lu

From: Peter Oskolkov <posk@google.com>

As Paul Turner presented at LPC in 2003, Google has developed
an M:N userspace threading subsystem backed by Google-private
SwitchTo Linux Kernel API. This subsystem provides latency-sensitive
services at Google withfine-grained user-space control/scheduling
over what is running when, and this subsystem is used widely
internally (called schedulers or fibers).

A simplified/idealized use case: imagine a multi-user service application
(e.g. a DBMS) that has to implement the following user CPU quota
policy:
- each user (these are DBMS users, not Linux users) can purchase
  certain amounts of expensive, but guaranteed, low-latency
  CPU quota (as a % of total CPUs available to the service), and a certain
  amount of cheap high-latency CPU quota;
- quotas are enforced per second;
- each user RPC/request to the service can specify whether this is
  a latency-critical request that should use the user's low-latency quota,
  and be immediately rejected if the quota for this second is exhausted;
- requests can be high-latency-tolerant: should only use the high-latency
  quota;
- a request can also be latency-tolerant: it should use the low-latency
  quota if available, or the high-latency quota if the low-latency quota
  is exhausted;
- all "sold" (= allocated) low-latency quotas can go up to, but not exceed,
  70% of all available CPUs (i.e. no over-subscription);
- high-latency quotas are oversubscribed;
- user isolation: misbehaving users should not affect the serving
  latency of users with available low-latency quotas;
- soft deadlines/timeouts: each request/RPC can specify that it must
  be served within a certain deadline (let's say the minimum deadline
  is 10 milliseconds) or dropped if the deadline is exceeded;
- each user request can potentially spawn several processing threads/tasks,
  and do child RPCs to remote services; these threads/tasks should
  also be covered by this quota/policy;
- user requests should be served somewhat in-order: requests that use
  the same quota tiers that arrive earlier should be granted CPU before
  requests that arrive later ("arrival order scheduling").

There are many services at Google that implement a variant of the scheduling
policy outlined above. In reality there are more priorities/quota tiers,
there is service-internal maintenance work that can be either high
or low priority, sometimes FIFO/LIFO/round robin scheduling is used in
addition to arrival order scheduling, etc. (for example, LIFO scheduling
is better at cache-locality in certain scenarios). User isolation within
a process, as well as latency improvements are the main benefits (on top
of the actual ability to implement complex quota/scheduling policies).

What is important is that these scheduling policies are driven by
userspace schedulers built on top of these basic kernel primitives:
- block: block the current thread/task (with a timeout);
- resume: resume some previously blocked task (should commutate
  with block, i.e. racing block/resume pairs should behave
  exactly as if wake arrived after block);
- switch_to: block the current thread, resume some previously
  blocked task (behaves exactly as wake(remote), block(current), but
  optimized to do a fast context switch on the fast path);
- block detection: when a task blocks in the kernel (on a network
  read, for example), the userspace scheduler is notified and
  schedules (resumes or swaps into) a pending task in the newly available
  CPU slot;
- wake detection: when a task wakes from a previously blocking kernel
  operation (e.g. can now process some data on a network socket), the
  userspace scheduler is notified and can now schedule the task to
  run on a CPU when a CPU is available and the task can use it according
  to its scheduling policy.

(Technically, block/wake detection is still experimental and not
used widely: as we control the userspace, we can actually determine
blocking/waking syscalls without kernel support).

Internally we currently use kernel patches that are too "intrusive" to be
included in a general-purpose Linux kernel, so we are exploring ways to
upstream this functionality.

The easiest/least intrusive approach that we have come up with is this:

- block/resume map perfectly to futex wait/wake;
- switch_to thus maps to FUTEX_SWAP;
- block and wake detection can be done either through tracing
  or by introducing new BPF attach points (when a task blocks or wakes,
  a BPF program is triggered that then communicates with the userspace);
- the BPF attach points are per task, and the task needs to "opt in"
  (i.e. all other tasks suffer just an additional pointer comparison
  on block/wake);
- the BPF programs triggered on block/wake should be able to perform
  futex ops (e.g. wake a designated userspace scheduling task) - this
  probably indicates that tracing is not enough, and a new BPF prog type
  is needed.

In addition to the above, another common use case for FUTEX_SWAP is
message passing a-la RPC between tasks: task/thread T1 prepares a message,
wakes T2 to work on it, and waits for the results; when T2 is done, it
wakes T1 and waits for more work to arrive. Currently the simplest
way to implement this is

a. T1: futex-wake T2, futex-wait
b. T2: wakes, does what it has been woken to do
c. T2: futex-wake T1, futex-wait

With FUTEX_SWAP, steps a and c above can be reduced to one futex operation
that runs 5-10 times faster.

Patches in this patchset:

Patch 1: (this patch) add FUTEX_SWAP #defines, as well as the overall
         reasoning behind the patchset.
Patch 2: implement FUTEX_SWAP futex operation that, internally, does
         wake + wait.
Patch 3: the main speed-up of FUTEX_SWAP: migrate the wakee to the
         waker's CPU.
Patch 4: a selftest that can also be used to benchmark FUTEX_SWAP vs
         FUTEX_WAKE + FUTEX_WAIT.

Tested: see patch 4 in this patchset.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 include/uapi/linux/futex.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h
index a89eb0accd5e..c1d151d97dea 100644
--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
@@ -21,6 +21,7 @@
 #define FUTEX_WAKE_BITSET	10
 #define FUTEX_WAIT_REQUEUE_PI	11
 #define FUTEX_CMP_REQUEUE_PI	12
+#define FUTEX_SWAP		13
 
 #define FUTEX_PRIVATE_FLAG	128
 #define FUTEX_CLOCK_REALTIME	256
@@ -40,6 +41,7 @@
 					 FUTEX_PRIVATE_FLAG)
 #define FUTEX_CMP_REQUEUE_PI_PRIVATE	(FUTEX_CMP_REQUEUE_PI | \
 					 FUTEX_PRIVATE_FLAG)
+#define FUTEX_SWAP_PRIVATE		(FUTEX_SWAP | FUTEX_PRIVATE_FLAG)
 
 /*
  * Support for robust futexes: the kernel cleans up held futexes at
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH for 5.9 v2 2/4] futex: implement FUTEX_SWAP as wake+wait.
  2020-08-03 22:15 [PATCH for 5.9 v2 0/4] FUTEX_SWAP (tip/locking/core) Peter Oskolkov
  2020-08-03 22:15 ` [PATCH for 5.9 v2 1/4] futex: introduce FUTEX_SWAP operation Peter Oskolkov
@ 2020-08-03 22:15 ` Peter Oskolkov
  2020-08-03 22:15 ` [PATCH for 5.9 v2 3/4] futex/sched: add wake_up_swap, use in FUTEX_SWAP Peter Oskolkov
  2020-08-03 22:15 ` [PATCH for 5.9 v2 4/4] selftests/futex: add futex_swap selftest Peter Oskolkov
  3 siblings, 0 replies; 7+ messages in thread
From: Peter Oskolkov @ 2020-08-03 22:15 UTC (permalink / raw)
  To: Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar,
	Ingo Molnar, Peter Zijlstra, Darren Hart, Vincent Guittot
  Cc: Peter Oskolkov, Andrei Vagin, Paul Turner, Ben Segall, Aaron Lu

From: Peter Oskolkov <posk@google.com>

See the previous patch in the patchset, which introduced
FUTEX_SWAP op in futex.h, for a detailed description of
the use cases and future directions.

This patch implements FUTEX_SWAP as a simple wake+wait.
The next patch will improve this by migrating the wakee
to the waker's (= waiter's) CPU.

Tested: see patch 4 in this patchset.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 kernel/futex.c | 87 +++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 72 insertions(+), 15 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 4616d4ad609d..a81d62a16e72 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1574,16 +1574,16 @@ double_unlock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
 }
 
 /*
- * Wake up waiters matching bitset queued on this futex (uaddr).
+ * Prepare wake queue matching bitset queued on this futex (uaddr).
  */
 static int
-futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
+prepare_wake_q(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset,
+		struct wake_q_head *wake_q)
 {
 	struct futex_hash_bucket *hb;
 	struct futex_q *this, *next;
 	union futex_key key = FUTEX_KEY_INIT;
 	int ret;
-	DEFINE_WAKE_Q(wake_q);
 
 	if (!bitset)
 		return -EINVAL;
@@ -1611,13 +1611,26 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
 			if (!(this->bitset & bitset))
 				continue;
 
-			mark_wake_futex(&wake_q, this);
+			mark_wake_futex(wake_q, this);
 			if (++ret >= nr_wake)
 				break;
 		}
 	}
 
 	spin_unlock(&hb->lock);
+	return ret;
+}
+
+/*
+ * Wake up waiters matching bitset queued on this futex (uaddr).
+ */
+static int
+futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
+{
+	int ret;
+	DEFINE_WAKE_Q(wake_q);
+
+	ret = prepare_wake_q(uaddr, flags, nr_wake, bitset, &wake_q);
 	wake_up_q(&wake_q);
 	return ret;
 }
@@ -2557,9 +2570,12 @@ static int fixup_owner(u32 __user *uaddr, struct futex_q *q, int locked)
  * @hb:		the futex hash bucket, must be locked by the caller
  * @q:		the futex_q to queue up on
  * @timeout:	the prepared hrtimer_sleeper, or null for no timeout
+ * @next:	if present, wake next and hint to the scheduler that we'd
+ *		prefer to execute it locally.
  */
 static void futex_wait_queue_me(struct futex_hash_bucket *hb, struct futex_q *q,
-				struct hrtimer_sleeper *timeout)
+				struct hrtimer_sleeper *timeout,
+				struct task_struct *next)
 {
 	/*
 	 * The task state is guaranteed to be set before another task can
@@ -2584,10 +2600,26 @@ static void futex_wait_queue_me(struct futex_hash_bucket *hb, struct futex_q *q,
 		 * flagged for rescheduling. Only call schedule if there
 		 * is no timeout, or if it has yet to expire.
 		 */
-		if (!timeout || timeout->task)
+		if (!timeout || timeout->task) {
+			if (next) {
+				/*
+				 * wake_up_process() below will be
+				 * replaced in the next patch with
+				 * wake_up_swap().
+				 */
+				wake_up_process(next);
+				put_task_struct(next);
+				next = NULL;
+			}
 			freezable_schedule();
+		}
 	}
 	__set_current_state(TASK_RUNNING);
+
+	if (next) {
+		wake_up_process(next);
+		put_task_struct(next);
+	}
 }
 
 /**
@@ -2663,7 +2695,7 @@ static int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
 }
 
 static int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
-		      ktime_t *abs_time, u32 bitset)
+		      ktime_t *abs_time, u32 bitset, struct task_struct *next)
 {
 	struct hrtimer_sleeper timeout, *to;
 	struct restart_block *restart;
@@ -2687,7 +2719,8 @@ static int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
 		goto out;
 
 	/* queue_me and wait for wakeup, timeout, or a signal. */
-	futex_wait_queue_me(hb, &q, to);
+	futex_wait_queue_me(hb, &q, to, next);
+	next = NULL;
 
 	/* If we were woken (and unqueued), we succeeded, whatever. */
 	ret = 0;
@@ -2720,6 +2753,10 @@ static int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
 	ret = -ERESTART_RESTARTBLOCK;
 
 out:
+	if (next) {
+		wake_up_process(next);
+		put_task_struct(next);
+	}
 	if (to) {
 		hrtimer_cancel(&to->timer);
 		destroy_hrtimer_on_stack(&to->timer);
@@ -2727,7 +2764,6 @@ static int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
 	return ret;
 }
 
-
 static long futex_wait_restart(struct restart_block *restart)
 {
 	u32 __user *uaddr = restart->futex.uaddr;
@@ -2739,10 +2775,29 @@ static long futex_wait_restart(struct restart_block *restart)
 	}
 	restart->fn = do_no_restart_syscall;
 
-	return (long)futex_wait(uaddr, restart->futex.flags,
-				restart->futex.val, tp, restart->futex.bitset);
+	return (long)futex_wait(uaddr, restart->futex.flags, restart->futex.val,
+				tp, restart->futex.bitset, NULL);
 }
 
+static int futex_swap(u32 __user *uaddr, unsigned int flags, u32 val,
+		      ktime_t *abs_time, u32 __user *uaddr2)
+{
+	u32 bitset = FUTEX_BITSET_MATCH_ANY;
+	struct task_struct *next = NULL;
+	DEFINE_WAKE_Q(wake_q);
+	int ret;
+
+	ret = prepare_wake_q(uaddr2, flags, 1, bitset, &wake_q);
+	if (ret < 0)
+		return ret;
+	if (!wake_q_empty(&wake_q)) {
+		/* At most one wakee can be present. Pull it out. */
+		next = container_of(wake_q.first, struct task_struct, wake_q);
+		next->wake_q.next = NULL;
+	}
+
+	return futex_wait(uaddr, flags, val, abs_time, bitset, next);
+}
 
 /*
  * Userspace tried a 0 -> TID atomic transition of the futex value
@@ -3221,7 +3276,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 	}
 
 	/* Queue the futex_q, drop the hb lock, wait for wakeup. */
-	futex_wait_queue_me(hb, &q, to);
+	futex_wait_queue_me(hb, &q, to, NULL);
 
 	spin_lock(&hb->lock);
 	ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
@@ -3746,7 +3801,7 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
 		val3 = FUTEX_BITSET_MATCH_ANY;
 		/* fall through */
 	case FUTEX_WAIT_BITSET:
-		return futex_wait(uaddr, flags, val, timeout, val3);
+		return futex_wait(uaddr, flags, val, timeout, val3, NULL);
 	case FUTEX_WAKE:
 		val3 = FUTEX_BITSET_MATCH_ANY;
 		/* fall through */
@@ -3770,6 +3825,8 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
 					     uaddr2);
 	case FUTEX_CMP_REQUEUE_PI:
 		return futex_requeue(uaddr, flags, uaddr2, val, val2, &val3, 1);
+	case FUTEX_SWAP:
+		return futex_swap(uaddr, flags, val, timeout, uaddr2);
 	}
 	return -ENOSYS;
 }
@@ -3786,7 +3843,7 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
 
 	if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI ||
 		      cmd == FUTEX_WAIT_BITSET ||
-		      cmd == FUTEX_WAIT_REQUEUE_PI)) {
+		      cmd == FUTEX_WAIT_REQUEUE_PI || cmd == FUTEX_SWAP)) {
 		if (unlikely(should_fail_futex(!(op & FUTEX_PRIVATE_FLAG))))
 			return -EFAULT;
 		if (get_timespec64(&ts, utime))
@@ -3795,7 +3852,7 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
 			return -EINVAL;
 
 		t = timespec64_to_ktime(ts);
-		if (cmd == FUTEX_WAIT)
+		if (cmd == FUTEX_WAIT || cmd == FUTEX_SWAP)
 			t = ktime_add_safe(ktime_get(), t);
 		tp = &t;
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH for 5.9 v2 3/4] futex/sched: add wake_up_swap, use in FUTEX_SWAP
  2020-08-03 22:15 [PATCH for 5.9 v2 0/4] FUTEX_SWAP (tip/locking/core) Peter Oskolkov
  2020-08-03 22:15 ` [PATCH for 5.9 v2 1/4] futex: introduce FUTEX_SWAP operation Peter Oskolkov
  2020-08-03 22:15 ` [PATCH for 5.9 v2 2/4] futex: implement FUTEX_SWAP as wake+wait Peter Oskolkov
@ 2020-08-03 22:15 ` Peter Oskolkov
  2020-08-03 22:15 ` [PATCH for 5.9 v2 4/4] selftests/futex: add futex_swap selftest Peter Oskolkov
  3 siblings, 0 replies; 7+ messages in thread
From: Peter Oskolkov @ 2020-08-03 22:15 UTC (permalink / raw)
  To: Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar,
	Ingo Molnar, Peter Zijlstra, Darren Hart, Vincent Guittot
  Cc: Peter Oskolkov, Andrei Vagin, Paul Turner, Ben Segall, Aaron Lu

From: Peter Oskolkov <posk@google.com>

As described in the previous patch in this patchset, it is
often beneficial to wake a task and run it on the same CPU
where the current (=going to sleep) task it running.

Internally at Google, switchto_switch sycall not only
migrates the wakee to the current CPU, but also moves
the waker's load stats to the wakee, thus ensuring
that the migration to the current CPU does not interfere
with load balancing. switchto_switch also does the
context switch into the wakee, bypassing schedule().

This patchset does not go that far yet, it simply
migrates the wakee to the current CPU and calls schedule().

In follow-up patches I will try to fune-tune the behavior by adjusting
load stats and schedule(): our internal switchto_switch
is still about 2x faster than FUTEX_SWAP (see numbers below).

And now about performance: futex_swap benchmark
from the last patch in this patchset produces this typical
output:

$ ./futex_swap -i 100000

------- running SWAP_WAKE_WAIT -----------

completed 100000 swap and back iterations in 850686650 ns: 4253 ns per swap
PASS

------- running SWAP_SWAP -----------

completed 100000 swap and back iterations in 123800593 ns: 619 ns per swap
PASS

In the above, the first benchmark (SWAP_WAKE_WAIT) calls FUTEX_WAKE,
then FUTEX_WAIT; the second benchmark (SWAP_SWAP) calls FUTEX_SWAP.

If the benchmark is restricted to a single cpu:

$ taskset -c 1 ./futex_swap -i 1000000

The numbers are very similar, as expected (with wake+wait being
a bit slower than swap due to two vs one syscalls).

Please also note that switchto_switch is about 2x faster than
FUTEX_SWAP because it does a contex switch to the wakee immediately,
bypassing schedule(), so this is one of the options I'll
explore in further patches (if/when this initial patchset is
accepted).

Tested: see the last patch is this patchset.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 include/linux/sched.h | 6 ++++++
 kernel/futex.c        | 9 ++++-----
 kernel/sched/core.c   | 5 +++++
 kernel/sched/fair.c   | 3 +++
 kernel/sched/sched.h  | 1 +
 5 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 060e9214c8b5..857d68fd0f8a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1711,6 +1711,12 @@ extern int wake_up_state(struct task_struct *tsk, unsigned int state);
 extern int wake_up_process(struct task_struct *tsk);
 extern void wake_up_new_task(struct task_struct *tsk);
 
+/*
+ * Wake up tsk and try to swap it into the current tasks place, which
+ * initially means just trying to migrate it to the current CPU.
+ */
+extern int wake_up_swap(struct task_struct *tsk);
+
 #ifdef CONFIG_SMP
 extern void kick_process(struct task_struct *tsk);
 #else
diff --git a/kernel/futex.c b/kernel/futex.c
index a81d62a16e72..3c9a7402e69e 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2602,12 +2602,11 @@ static void futex_wait_queue_me(struct futex_hash_bucket *hb, struct futex_q *q,
 		 */
 		if (!timeout || timeout->task) {
 			if (next) {
-				/*
-				 * wake_up_process() below will be
-				 * replaced in the next patch with
-				 * wake_up_swap().
-				 */
+#ifdef CONFIG_SMP
+				wake_up_swap(next);
+#else
 				wake_up_process(next);
+#endif
 				put_task_struct(next);
 				next = NULL;
 			}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2142c6767682..9747bf1591ce 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2737,6 +2737,11 @@ int wake_up_process(struct task_struct *p)
 }
 EXPORT_SYMBOL(wake_up_process);
 
+int wake_up_swap(struct task_struct *tsk)
+{
+	return try_to_wake_up(tsk, TASK_NORMAL, WF_CURRENT_CPU);
+}
+
 int wake_up_state(struct task_struct *p, unsigned int state)
 {
 	return try_to_wake_up(p, state, 0);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04fa8dbcfa4d..ae244d85c15e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6668,6 +6668,9 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 	int want_affine = 0;
 	int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
 
+	if ((wake_flags & WF_CURRENT_CPU) && cpumask_test_cpu(cpu, p->cpus_ptr))
+		return cpu;
+
 	if (sd_flag & SD_BALANCE_WAKE) {
 		record_wakee(p);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 877fb08eb1b0..103f0865a590 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1683,6 +1683,7 @@ static inline int task_on_rq_migrating(struct task_struct *p)
 #define WF_FORK			0x02		/* Child wakeup after fork */
 #define WF_MIGRATED		0x04		/* Internal use, task got migrated */
 #define WF_ON_CPU		0x08		/* Wakee is on_cpu */
+#define WF_CURRENT_CPU		0x10		/* Prefer to move wakee to the current CPU */
 
 /*
  * To aid in avoiding the subversion of "niceness" due to uneven distribution
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH for 5.9 v2 4/4] selftests/futex: add futex_swap selftest
  2020-08-03 22:15 [PATCH for 5.9 v2 0/4] FUTEX_SWAP (tip/locking/core) Peter Oskolkov
                   ` (2 preceding siblings ...)
  2020-08-03 22:15 ` [PATCH for 5.9 v2 3/4] futex/sched: add wake_up_swap, use in FUTEX_SWAP Peter Oskolkov
@ 2020-08-03 22:15 ` Peter Oskolkov
  3 siblings, 0 replies; 7+ messages in thread
From: Peter Oskolkov @ 2020-08-03 22:15 UTC (permalink / raw)
  To: Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar,
	Ingo Molnar, Peter Zijlstra, Darren Hart, Vincent Guittot
  Cc: Peter Oskolkov, Andrei Vagin, Paul Turner, Ben Segall, Aaron Lu

From: Peter Oskolkov <posk@google.com>

This is the final patch in FUTEX_SWAP patchset. It adds a
test/benchmark to validate the behavior and compare the
performance of a new FUTEX_SWAP futex operation to
a pair of FUTEX_WAKE/FUTEX_WAIT calls.

Detailed API design and behavior considerations are provided
in the commit messages of the previous patches.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 .../selftests/futex/functional/.gitignore     |   1 +
 .../selftests/futex/functional/Makefile       |   1 +
 .../selftests/futex/functional/futex_swap.c   | 209 ++++++++++++++++++
 .../selftests/futex/include/futextest.h       |  19 ++
 4 files changed, 230 insertions(+)
 create mode 100644 tools/testing/selftests/futex/functional/futex_swap.c

diff --git a/tools/testing/selftests/futex/functional/.gitignore b/tools/testing/selftests/futex/functional/.gitignore
index 0efcd494daab..d661ef0946cc 100644
--- a/tools/testing/selftests/futex/functional/.gitignore
+++ b/tools/testing/selftests/futex/functional/.gitignore
@@ -2,6 +2,7 @@
 futex_requeue_pi
 futex_requeue_pi_mismatched_ops
 futex_requeue_pi_signal_restart
+futex_swap
 futex_wait_private_mapped_file
 futex_wait_timeout
 futex_wait_uninitialized_heap
diff --git a/tools/testing/selftests/futex/functional/Makefile b/tools/testing/selftests/futex/functional/Makefile
index 23207829ec75..6992fac38b15 100644
--- a/tools/testing/selftests/futex/functional/Makefile
+++ b/tools/testing/selftests/futex/functional/Makefile
@@ -13,6 +13,7 @@ TEST_GEN_FILES := \
 	futex_requeue_pi \
 	futex_requeue_pi_signal_restart \
 	futex_requeue_pi_mismatched_ops \
+	futex_swap \
 	futex_wait_uninitialized_heap \
 	futex_wait_private_mapped_file
 
diff --git a/tools/testing/selftests/futex/functional/futex_swap.c b/tools/testing/selftests/futex/functional/futex_swap.c
new file mode 100644
index 000000000000..9034d04372d3
--- /dev/null
+++ b/tools/testing/selftests/futex/functional/futex_swap.c
@@ -0,0 +1,209 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include <errno.h>
+#include <getopt.h>
+#include <pthread.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <time.h>
+#include "atomic.h"
+#include "futextest.h"
+
+/* The futex the main thread waits on. */
+futex_t futex_main = FUTEX_INITIALIZER;
+/* The futex the other thread wats on. */
+futex_t futex_other = FUTEX_INITIALIZER;
+
+/* The number of iterations to run (>1 => run benchmarks. */
+static int cfg_iterations = 1;
+
+/* If != 0, print diagnostic messages. */
+static int cfg_verbose;
+
+/* If == 0, do not use validation_counter. Useful for benchmarking. */
+static int cfg_validate = 1;
+
+/* How to swap threads. */
+#define SWAP_WAKE_WAIT 1
+#define SWAP_SWAP 2
+
+/* Futex values. */
+#define FUTEX_WAITING 0
+#define FUTEX_WAKEUP 1
+
+/* An atomic counter used to validate proper swapping. */
+static atomic_t validation_counter;
+
+void futex_swap_op(int mode, futex_t *futex_this, futex_t *futex_that)
+{
+	int ret;
+
+	switch (mode) {
+	case SWAP_WAKE_WAIT:
+		futex_set(futex_this, FUTEX_WAITING);
+		futex_set(futex_that, FUTEX_WAKEUP);
+		futex_wake(futex_that, 1, FUTEX_PRIVATE_FLAG);
+		futex_wait(futex_this, FUTEX_WAITING, NULL, FUTEX_PRIVATE_FLAG);
+		if (*futex_this != FUTEX_WAKEUP) {
+			fprintf(stderr, "unexpected futex_this value on wakeup\n");
+			exit(1);
+		}
+		break;
+
+	case SWAP_SWAP:
+		futex_set(futex_this, FUTEX_WAITING);
+		futex_set(futex_that, FUTEX_WAKEUP);
+		ret = futex_swap(futex_this, FUTEX_WAITING, NULL,
+				 futex_that, FUTEX_PRIVATE_FLAG);
+		if (ret < 0 && errno == ENOSYS) {
+			/* futex_swap not implemented */
+			perror("futex_swap");
+			exit(1);
+		}
+		if (*futex_this != FUTEX_WAKEUP) {
+			fprintf(stderr, "unexpected futex_this value on wakeup\n");
+			exit(1);
+		}
+		break;
+
+	default:
+		fprintf(stderr, "unknown mode in %s\n", __func__);
+		exit(1);
+	}
+}
+
+void *other_thread(void *arg)
+{
+	int mode = *((int *)arg);
+	int counter;
+
+	if (cfg_verbose)
+		printf("%s started\n", __func__);
+
+	futex_wait(&futex_other, 0, NULL, FUTEX_PRIVATE_FLAG);
+
+	for (counter = 0; counter < cfg_iterations; ++counter) {
+		if (cfg_validate) {
+			int prev = 2 * counter + 1;
+
+			if (prev != atomic_cmpxchg(&validation_counter, prev,
+						   prev + 1)) {
+				fprintf(stderr, "swap validation failed\n");
+				exit(1);
+			}
+		}
+		futex_swap_op(mode, &futex_other, &futex_main);
+	}
+
+	if (cfg_verbose)
+		printf("%s finished: %d iteration(s)\n", __func__, counter);
+
+	return NULL;
+}
+
+void run_test(int mode)
+{
+	struct timespec start, stop;
+	int ret, counter;
+	pthread_t thread;
+	uint64_t duration;
+
+	futex_set(&futex_other, FUTEX_WAITING);
+	atomic_set(&validation_counter, 0);
+	ret = pthread_create(&thread, NULL, &other_thread, &mode);
+	if (ret) {
+		perror("pthread_create");
+		exit(1);
+	}
+
+	ret = clock_gettime(CLOCK_MONOTONIC, &start);
+	if (ret) {
+		perror("clock_gettime");
+		exit(1);
+	}
+
+	for (counter = 0; counter < cfg_iterations; ++counter) {
+		if (cfg_validate) {
+			int prev = 2 * counter;
+
+			if (prev != atomic_cmpxchg(&validation_counter, prev,
+						   prev + 1)) {
+				fprintf(stderr, "swap validation failed\n");
+				exit(1);
+			}
+		}
+		futex_swap_op(mode, &futex_main, &futex_other);
+	}
+	if (cfg_validate && validation_counter.val != 2 * cfg_iterations) {
+		fprintf(stderr, "final swap validation failed\n");
+		exit(1);
+	}
+
+	ret = clock_gettime(CLOCK_MONOTONIC, &stop);
+	if (ret) {
+		perror("clock_gettime");
+		exit(1);
+	}
+
+	duration = (stop.tv_sec - start.tv_sec) * 1000000000LL +
+	stop.tv_nsec - start.tv_nsec;
+	if (cfg_verbose || cfg_iterations > 1) {
+		printf("completed %d swap and back iterations in %lu ns: %lu ns per swap\n",
+			cfg_iterations, duration,
+			duration / (cfg_iterations * 2));
+	}
+
+	/* The remote thread is blocked; send it the final wake. */
+	futex_set(&futex_other, FUTEX_WAKEUP);
+	futex_wake(&futex_other, 1, FUTEX_PRIVATE_FLAG);
+	if (pthread_join(thread, NULL)) {
+		perror("pthread_join");
+		exit(1);
+	}
+}
+
+void usage(char *prog)
+{
+	printf("Usage: %s\n", prog);
+	printf("  -h    Display this help message\n");
+	printf("  -i N  Use N iterations to benchmark\n");
+	printf("  -n    Do not validate swapping correctness\n");
+	printf("  -v    Print diagnostic messages\n");
+}
+
+int main(int argc, char *argv[])
+{
+	int c;
+
+	while ((c = getopt(argc, argv, "hi:nv")) != -1) {
+		switch (c) {
+		case 'h':
+			usage(basename(argv[0]));
+			exit(0);
+		case 'i':
+			cfg_iterations = atoi(optarg);
+			break;
+		case 'n':
+			cfg_validate = 0;
+			break;
+		case 'v':
+			cfg_verbose = 1;
+			break;
+		default:
+			usage(basename(argv[0]));
+			exit(1);
+		}
+	}
+
+	printf("\n\n------- running SWAP_WAKE_WAIT -----------\n\n");
+	run_test(SWAP_WAKE_WAIT);
+	printf("PASS\n");
+
+	printf("\n\n------- running SWAP_SWAP -----------\n\n");
+	run_test(SWAP_SWAP);
+	printf("PASS\n");
+
+	return 0;
+}
diff --git a/tools/testing/selftests/futex/include/futextest.h b/tools/testing/selftests/futex/include/futextest.h
index ddbcfc9b7bac..4d6a0a18445a 100644
--- a/tools/testing/selftests/futex/include/futextest.h
+++ b/tools/testing/selftests/futex/include/futextest.h
@@ -38,6 +38,9 @@ typedef volatile u_int32_t futex_t;
 #ifndef FUTEX_CMP_REQUEUE_PI
 #define FUTEX_CMP_REQUEUE_PI		12
 #endif
+#ifndef FUTEX_SWAP
+#define FUTEX_SWAP			13
+#endif
 #ifndef FUTEX_WAIT_REQUEUE_PI_PRIVATE
 #define FUTEX_WAIT_REQUEUE_PI_PRIVATE	(FUTEX_WAIT_REQUEUE_PI | \
 					 FUTEX_PRIVATE_FLAG)
@@ -46,6 +49,9 @@ typedef volatile u_int32_t futex_t;
 #define FUTEX_CMP_REQUEUE_PI_PRIVATE	(FUTEX_CMP_REQUEUE_PI | \
 					 FUTEX_PRIVATE_FLAG)
 #endif
+#ifndef FUTEX_SWAP_PRIVATE
+#define FUTEX_SWAP_PRIVATE		(FUTEX_WAIT_WAKE | FUTEX_PRIVATE_FLAG)
+#endif
 
 /**
  * futex() - SYS_futex syscall wrapper
@@ -204,6 +210,19 @@ futex_cmp_requeue_pi(futex_t *uaddr, futex_t val, futex_t *uaddr2, int nr_wake,
 		     val, opflags);
 }
 
+/**
+ * futex_swap() - block on uaddr and wake one task blocked on uaddr2.
+ * @uaddr:	futex to block the current task on
+ * @timeout:	relative timeout for the current task block
+ * @uaddr2:	futex to wake tasks at (can be the same as uaddr)
+ */
+static inline int
+futex_swap(futex_t *uaddr, futex_t val, struct timespec *timeout,
+	   futex_t *uaddr2, int opflags)
+{
+	return futex(uaddr, FUTEX_SWAP, val, timeout, uaddr2, 0, opflags);
+}
+
 /**
  * futex_cmpxchg() - atomic compare and exchange
  * @uaddr:	The address of the futex to be modified
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH for 5.9 v2 1/4] futex: introduce FUTEX_SWAP operation
  2020-08-03 22:15 ` [PATCH for 5.9 v2 1/4] futex: introduce FUTEX_SWAP operation Peter Oskolkov
@ 2020-08-04 12:31   ` peterz
  2020-08-11 15:26     ` Rik van Riel
  0 siblings, 1 reply; 7+ messages in thread
From: peterz @ 2020-08-04 12:31 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar,
	Ingo Molnar, Darren Hart, Vincent Guittot, Peter Oskolkov,
	Andrei Vagin, Paul Turner, Ben Segall, Aaron Lu

On Mon, Aug 03, 2020 at 03:15:07PM -0700, Peter Oskolkov wrote:
> A simplified/idealized use case: imagine a multi-user service application
> (e.g. a DBMS) that has to implement the following user CPU quota
> policy:

So the last posting made hackernews; and there a bunch expressed far
more interest in coroutines, which, if I'm not mistaken, can also be
implemented using all this.

Would that not make for a far simpler and more convincing use-case?

> - block detection: when a task blocks in the kernel (on a network
>   read, for example), the userspace scheduler is notified and
>   schedules (resumes or swaps into) a pending task in the newly available
>   CPU slot;
> - wake detection: when a task wakes from a previously blocking kernel
>   operation (e.g. can now process some data on a network socket), the
>   userspace scheduler is notified and can now schedule the task to
>   run on a CPU when a CPU is available and the task can use it according
>   to its scheduling policy.
> 
> (Technically, block/wake detection is still experimental and not
> used widely: as we control the userspace, we can actually determine
> blocking/waking syscalls without kernel support).
> 
> Internally we currently use kernel patches that are too "intrusive" to be
> included in a general-purpose Linux kernel, so we are exploring ways to
> upstream this functionality.
> 
> The easiest/least intrusive approach that we have come up with is this:
> 
> - block/resume map perfectly to futex wait/wake;
> - switch_to thus maps to FUTEX_SWAP;
> - block and wake detection can be done either through tracing
>   or by introducing new BPF attach points (when a task blocks or wakes,
>   a BPF program is triggered that then communicates with the userspace);
> - the BPF attach points are per task, and the task needs to "opt in"
>   (i.e. all other tasks suffer just an additional pointer comparison
>   on block/wake);
> - the BPF programs triggered on block/wake should be able to perform
>   futex ops (e.g. wake a designated userspace scheduling task) - this
>   probably indicates that tracing is not enough, and a new BPF prog type
>   is needed.

I really think we want to have block/resume detection sorted before this
goes anywhere, I also strongly feel BPF should not be used for
functional interfaces like that.

That is, I want to see a complete interface before I want to commit to
an ABI that we're stuck with.

I also want to see userspace that goes along with it; like with
sys_membarrier() / liburcu and sys_rseq() / librseq (which seems to be
heading for glibc).

Also, and this seems to be the crux of the whole endeavour, you want to
allow your 'fibers' to block. Which is what makes
{make,swap,get,set}context() unsuited for your needs and gives rise to
the whole block/resume issue above.

Also, I want words on the interaction between resume notification and
wake-up preemption. That is, how do you envision managing the
interaction between the two schedulers.


All in all, I don't think you're even close to having something
mergable.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH for 5.9 v2 1/4] futex: introduce FUTEX_SWAP operation
  2020-08-04 12:31   ` peterz
@ 2020-08-11 15:26     ` Rik van Riel
  0 siblings, 0 replies; 7+ messages in thread
From: Rik van Riel @ 2020-08-11 15:26 UTC (permalink / raw)
  To: peterz, Peter Oskolkov
  Cc: Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar,
	Ingo Molnar, Darren Hart, Vincent Guittot, Peter Oskolkov,
	Andrei Vagin, Paul Turner, Ben Segall, Aaron Lu

[-- Attachment #1: Type: text/plain, Size: 1562 bytes --]

On Tue, 2020-08-04 at 14:31 +0200, peterz@infradead.org wrote:
> On Mon, Aug 03, 2020 at 03:15:07PM -0700, Peter Oskolkov wrote:
> > A simplified/idealized use case: imagine a multi-user service
> > application
> > (e.g. a DBMS) that has to implement the following user CPU quota
> > policy:
> 
> So the last posting made hackernews; and there a bunch expressed far
> more interest in coroutines, which, if I'm not mistaken, can also be
> implemented using all this.
> 
> Would that not make for a far simpler and more convincing use-case?

Also just worker threads in general. Instead of waking up
an idle worker thread every time a new request comes in,
why not allow worker threads to indicate "I'm almost done",
and the main/receiving thread to enqueue the new request,
to be handled when the worker thread is done with the current
request?

> I really think we want to have block/resume detection sorted before
> this
> goes anywhere, I also strongly feel BPF should not be used for
> functional interfaces like that.
> 
> That is, I want to see a complete interface before I want to commit
> to
> an ABI that we're stuck with.

The work case above also needs to have that figured out,
for the (slow path, but still common) case of the worker
thread actually having gone to sleep while the work got
enqueued, and then needing to be woken up anyway.

Not sure that is a direction you are interested in, given
the description, but it could make coroutines and worker
threads more efficient :)

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-08-11 15:26 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-03 22:15 [PATCH for 5.9 v2 0/4] FUTEX_SWAP (tip/locking/core) Peter Oskolkov
2020-08-03 22:15 ` [PATCH for 5.9 v2 1/4] futex: introduce FUTEX_SWAP operation Peter Oskolkov
2020-08-04 12:31   ` peterz
2020-08-11 15:26     ` Rik van Riel
2020-08-03 22:15 ` [PATCH for 5.9 v2 2/4] futex: implement FUTEX_SWAP as wake+wait Peter Oskolkov
2020-08-03 22:15 ` [PATCH for 5.9 v2 3/4] futex/sched: add wake_up_swap, use in FUTEX_SWAP Peter Oskolkov
2020-08-03 22:15 ` [PATCH for 5.9 v2 4/4] selftests/futex: add futex_swap selftest Peter Oskolkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).