Re: [PATCH for 5.9 1/3] futex: introduce FUTEX_SWAP operation

From: Peter Oskolkov <posk@posk.io>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Ingo Molnar <mingo@kernel.org>,
	Darren Hart <dvhart@infradead.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Peter Oskolkov <posk@google.com>,
	Andrei Vagin <avagin@google.com>, Paul Turner <pjt@google.com>,
	Ben Segall <bsegall@google.com>, Aaron Lu <aaron.lwe@gmail.com>,
	Waiman Long <longman@redhat.com>
Subject: Re: [PATCH for 5.9 1/3] futex: introduce FUTEX_SWAP operation
Date: Thu, 23 Jul 2020 17:25:05 -0700	[thread overview]
Message-ID: <CAFTs51UJhC9TmXkzz8VbDNmkSEyZE29=dRdUi65TDpSYqoK5vw@mail.gmail.com> (raw)
In-Reply-To: <20200723112757.GN5523@worktop.programming.kicks-ass.net>

On Thu, Jul 23, 2020 at 4:28 AM Peter Zijlstra <peterz@infradead.org> wrote:

Thanks a lot for your comments, Peter! My answers below.

>
> On Wed, Jul 22, 2020 at 04:45:36PM -0700, Peter Oskolkov wrote:
> > This patchset is the first step to open-source this work. As explained
> > in the linked pdf and video, SwitchTo API has three core operations: wait,
> > resume, and swap (=switch). So this patchset adds a FUTEX_SWAP operation
> > that, in addition to FUTEX_WAIT and FUTEX_WAKE, will provide a foundation
> > on top of which user-space threading libraries can be built.
>
> The PDF and video can go pound sand; you get to fully explain things
> here.

Will do. Should I expand the cover letter or the commit message? (I'll probably
split the first patch into two in the latter case).

>
> What worries me is how FUTEX_SWAP would interact with the future
> FUTEX_LOCK / FUTEX_UNLOCK. When we implement pthread_mutex with those,
> there's very few WAIT/WAKE left.

[+cc Waiman Long]

I've looked through the latest FUTEX_LOCK patchset I could find (
https://lore.kernel.org/patchwork/cover/772643/ and related), and it seems
that FUTEX_SWAP and FUTEX_LOCK/FUTEX_UNLOCK patchsets
address the same issue (slow wakeups) but for different use cases:

FUTEX_LOCK/FUTEX_UNLOCK uses spinning and lock stealing to
improve futex wake/wait performance in high contention situations;
FUTEX_SWAP is designed to be used for fast context switching with
_no_ contention by design: the waker that is going to sleep, and the wakee
are using different futexes; the userspace will have a futex per thread/task,
and when needed the thread/task will either simply sleep on its futex,
or context switch (=FUTEX_SWAP) into a different thread/task.

I can also imagine that instead of combining WAIT/WAKE for
fast context switching, a variant of FUTEX_SWAP can use LOCK/UNLOCK
operations in the future, when these are available; but again, I fully
expect that
a single "FUTEX_LOCK the current task on futex A, FUTEX_UNLOCK futex B,
context switch into the wakee" futex op will be much faster than doing
the same thing
in two syscalls, as FUTEX_LOCK/FUTEX_UNLOCK does not seem to be concerned
with fast waking of a sleeping task, but more with minimizing sleeping
in the first place.

What will be faster: FUTEX_SWAP that does
   FUTEX_WAKE (futex A) + FUTEX_WAIT (current, futex B),
or FUTEX_SWAP that does
   FUTEX_UNLOCK (futex A) + FUTEX_LOCK (current, futex B)?

As wake+wait will always put the waker to sleep, it means that
there will be a true context switch on the same CPU on the fast path;
on the other hand, unlock+lock will potentially evade sleeping,
so the wakee will often run on a different CPU (with the waker
spinning instead of sleeping?), thus not benefitting from cache locality
that fast context switching on the same CPU is meant to use...

I'll add some of the considerations above to the expanded cover letter
(or a commit message).

>
> Also, why would we commit to an ABI without ever having seen the rest?

I'm not completely sure what you mean here. We do not envision any
expansion/changes to the ABI proposed here, only further performance
improvements. On these, we currently think that marking the wakee
as the preferred next task to run on the current CPU (by storing
"struct task_struct *preferred_next_tast" either in a per-CPU pointer,
or in the current task_struct) and then having schedule() determine
whether to follow the hint or ignore it would be the simplest way to speed up
the context switch.

>
> On another note: wake_up_process_prefer_current_cpu() is a horrific
> function name :/ That's half to a third of the line limit.

I fully agree. I considered wake_up_on_current_cpu() first, but this name
does not indicate that the wakeup is a "strong wish", but "current
cpu" is a weak
one... Do you have any suggestions? Maybe

wake_up_on_cpu(struct task_struct *next, int cpu_hint)?

But this seems too broad in scope, as we are interested here in only
migrating the task to the current CPU...

Thanks again for your comments!