archive mirror
 help / color / mirror / Atom feed
From: Peter Oskolkov <>
To: Linux Kernel Mailing List <>,
	Thomas Gleixner <>,
	Ingo Molnar <>,
	Peter Zijlstra <>,
	Darren Hart <>,
	Vincent Guittot <>
Cc: Peter Oskolkov <>,, "" <>,
	Ben Segall <>
Subject: [RFC PATCH 2/3] futex, sched: add wake_up_swap, use in FUTEX_SWAP
Date: Mon, 15 Jun 2020 10:29:58 -0700	[thread overview]
Message-ID: <> (raw)

From 2e5a6e670f4d5f5ab94893fe0bff17150b56c142 Mon Sep 17 00:00:00 2001
From: Peter Oskolkov <>
Date: Sun, 14 Jun 2020 17:08:00 -0700
Subject: [RFC PATCH 2/3] futex, sched: add wake_up_swap, use in FUTEX_SWAP

This is an RFC!

As described in the previous patch in this patchset
("futex: introduce FUTEX_SWAP operation"), it is often
beneficial to wake a task and run it on the same CPU
where the current going to sleep task it running.

Internally at Google, switchto_switch sycall not only
migrates the wakee to the current CPU, but also moves
the waker's load stats to the wakee, thus ensuring
that the migration to the current CPU does not interfere
with load balancing. switchto_switch also does the
context switch into the wakee, bypassing schedule().

This patchset does not go that far yet, it simply
migrates the wakee to the current CPU and calls schedule().
This can potentially interfere with load balancing but,
on the other hand, the code is concise and simple
and is easy to understand to start the discussion.

If this approach is OK, then in follow-up patches
I will try to fune-tune the behavior by adjusting
load stats and schedule(): our internal switchto_switch
is still about 2x faster than FUTEX_SWAP (see numbers below).

And now about performance: futex_swap benchmark
from the last patch in this patchset produces this typical

$ ./futex_swap -i 100000

------- running SWAP_WAKE_WAIT -----------

completed 100000 swap and back iterations in 820683263 ns: 4103 ns per swap

------- running SWAP_SWAP -----------

completed 100000 swap and back iterations in 124034476 ns: 620 ns per swap

In the above, the first benchmark (SWAP_WAKE_WAIT) calls FUTEX_WAKE,
then FUTEX_WAIT; the second benchmark (SWAP_SWAP) calls FUTEX_SWAP.

If the benchmark is restricted to a single cpu:

$ taskset -c 1 ./futex_swap -i 1000000

The numbers are very similar, as expected (with wake+wait being
a bit slower than swap due to two vs one syscalls).

Please also note that switchto_switch is about 2x faster than
FUTEX_SWAP because it does a contex switch to the wakee immediately,
bypassing schedule(), so this is one of the options I'll
explore in further patches (if/when this initial patchset is

Tested: see the last patch is this patchset.

Signed-off-by: Peter Oskolkov <>
 include/linux/sched.h | 1 +
 kernel/futex.c        | 7 +------
 kernel/sched/core.c   | 5 +++++
 kernel/sched/fair.c   | 3 +++
 kernel/sched/sched.h  | 3 ++-
 5 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4418f5cb8324..4b177ac6f9be 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1693,6 +1693,7 @@ extern struct task_struct *find_get_task_by_vpid(pid_t nr);
 extern int wake_up_state(struct task_struct *tsk, unsigned int state);
 extern int wake_up_process(struct task_struct *tsk);
+extern int wake_up_process_prefer_current_cpu(struct task_struct *tsk);
 extern void wake_up_new_task(struct task_struct *tsk);
 #ifdef CONFIG_SMP
diff --git a/kernel/futex.c b/kernel/futex.c
index f3833190886f..a426671e4bbb 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2646,12 +2646,7 @@ static void futex_wait_queue_me(struct futex_hash_bucket *hb, struct futex_q *q,
 		if (!timeout || timeout->task) {
 			if (next) {
-				/*
-				 * wake_up_process() below will be replaced
-				 * in the next patch with
-				 * wake_up_process_prefer_current_cpu().
-				 */
-				wake_up_process(next);
+				wake_up_process_prefer_current_cpu(next);
 				next = NULL;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9a2fbf98fd6f..f894b3e6c9ed 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6180,6 +6180,11 @@ void sched_setnuma(struct task_struct *p, int nid)
+int wake_up_process_prefer_current_cpu(struct task_struct *next)
+	return try_to_wake_up(next, TASK_NORMAL, WF_CURRENT_CPU);
  * Ensure that the idle task is using init_mm right before its CPU goes
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 538ba5d94e99..80f927bb62eb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6656,6 +6656,9 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 	int want_affine = 0;
 	int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
+	if ((wake_flags & WF_CURRENT_CPU) && cpumask_test_cpu(cpu, p->cpus_ptr))
+		return cpu;
 	if (sd_flag & SD_BALANCE_WAKE) {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index db3a57675ccf..5dbe202a4103 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1688,7 +1688,8 @@ static inline int task_on_rq_migrating(struct task_struct *p)
 #define WF_SYNC			0x01		/* Waker goes to sleep after wakeup */
 #define WF_FORK			0x02		/* Child wakeup after fork */
-#define WF_MIGRATED		0x4		/* Internal use, task got migrated */
+#define WF_MIGRATED		0x04		/* Internal use, task got migrated */
+#define WF_CURRENT_CPU		0x10		/* Prefer to move wakee to the current CPU */
  * To aid in avoiding the subversion of "niceness" due to uneven distribution

                 reply	other threads:[~2020-06-15 17:30 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).