Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls

From: Peter Oskolkov <posk@posk.io>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Andy Lutomirski <luto@kernel.org>,
	Linux Memory Management List <linux-mm@kvack.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-api@vger.kernel.org, Paul Turner <pjt@google.com>,
	Ben Segall <bsegall@google.com>, Peter Oskolkov <posk@google.com>,
	Andrei Vagin <avagin@google.com>, Jann Horn <jannh@google.com>,
	Thierry Delisle <tdelisle@uwaterloo.ca>
Subject: Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls
Date: Mon, 29 Nov 2021 09:34:49 -0800	[thread overview]
Message-ID: <CAFTs51UzR=m6+vcjTCNOGwGu3ZwB5GMrg+cSQy2ecvCWxhZvEQ@mail.gmail.com> (raw)
In-Reply-To: <YaUCoe07Wl9Stlch@hirez.programming.kicks-ass.net>

On Mon, Nov 29, 2021 at 8:41 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Sun, Nov 28, 2021 at 04:29:11PM -0800, Peter Oskolkov wrote:
>
> > wait_wake_only is not needed if you have both next_tid and server_tid,
> > as your patch has. In my version of the patch, next_tid is the same as
> > server_tid, so the flag is needed to indicate to the kernel that
> > next_tid is the wakee, not the server.
>
> Ah, okay.
>
> > re: (idle_)server_tid_ptr: it seems that you assume that blocked
> > workers keep their servers, while in my patch they "lose them" once
> > they block, and so there should be a global idle server pointer to
> > wake the server in my scheme (if there is an idle one). The main
> > difference is that in my approach a server has only a single, running,
> > worker assigned to it, while in your approach it can have a number of
> > blocked/idle workers to take care of as well.
>
> Correct; I've been thinking in analogues of the way we schedule CPUs.
> Each CPU has a ready/run queue along with the current task.
> fundamentally the RUNNABLE tasks need to go somewhere when all servers
> are busy. So at that point the previous server is as good a place as
> any.
>
> Now, I sympathise with a blocked task not having a relation; I often
> argue this same, since we have wakeup balancing etc. And I've not really
> thought about how to best do wakeup-balancing, also see below.
>
> > The main difference between our approaches, as I see it: in my
> > approach if a worker is running, its server is sleeping, period. If we
> > have N servers, and N running workers, there are no servers to wake
> > when a previously blocked worker finishes its blocking op. In your
> > approach, it seems that N servers have each a bunch of workers
> > pointing at them, and a single worker running. If a previously blocked
> > worker wakes up, it wakes the server it was assigned to previously,
>
> Right; it does that. It can check the ::state of it's current task,
> possibly set TF_PREEMPT or just go back to sleep.
>
> > and so now we have more than N physical tasks/threads running: N
> > workers and the woken server. This is not ideal: if the process is
> > affined to only N CPUs, that means a worker will be preempted to let
> > the woken server run, which is somewhat against the goal of letting
> > the workers run more or less uninterrupted. This is not deal breaking,
> > but maybe something to keep in mind.
>
> I suppose it's easy enough to make this behaviour configurable though;
> simply enqueue and not wake.... Hmm.. how would this worker know if the
> server was 'busy' or not? The whole 'current' thing is a user-space
> construct. I suppose that's what your pointer was for? Puts an actual
> idle server in there, if there is one. Let me ponder that a bit.

Yes, the idle_server_ptr was there to point to an idle server; this
naturally did wakeup balancing.

>
> However, do note this whole scheme fundamentally has some of that, the
> moment the syscall unblocks until sys_exit is 'unmanaged' runtime for
> all tasks, they can consume however much time the syscall needs there.
>
> Also, timeout on sys_umcg_wait() gets you the exact same situation (or
> worse, multiple running workers).

It should not. Timed out workers should be added to the runnable list
and not become running unless a server chooses so. So sys_umcg_wait()
with a timeout should behave similarly to a normal sleep, in that the
server is woken upon the worker blocking, and upon the worker wakeup
the worker is added to the woken workers list and waits for a server
to run it. The only difference is that in a sleep the worker becomes
BLOCKED, while in sys_umcg_wait() the worker is RUNNABLE the whole
time.

Why then have sys_umcg_wait() with a timeout at all, instead of
calling nanosleep()? Because the worker in sys_umcg_wait() can be
context-switched into by another worker, or made running by a server;
if the worker is in nanosleep(), it just sleeps.

>
> > Another big concern I have is that you removed UMCG_TF_LOCKED. I
>
> OOh yes, I forgot to mention that. I couldn't figure out what it was
> supposed to do.
>
> > definitely needed it to guard workers during "sched work" in the
> > userspace in my approach. I'm not sure if the flag is absolutely
> > needed with your approach, but most likely it is - the kernel-side
> > scheduler does lock tasks and runqueues and disables interrupts and
> > migrations and other things so that the scheduling logic is not
> > hijacked by concurrent stuff. Why do you assume that the userspace
> > scheduling code does not need similar protections?
>
> I've not yet come across a case where this is needed. Migration for
> instance is possible when RUNNABLE, simply write ::server_tid before
> ::state. Userspace just needs to make sure who actually owns the task,
> but it can do that outside of this state.
>
> But like I said; I've not yet done the userspace part (and I lost most
> of today trying to install a new machine), so perhaps I'll run into it
> soon enough.

The most obvious scenario where I needed locking is when worker A
wants to context switch into worker B, while another worker C wants to
context switch into worker A, and worker A pagefaults. This involves:

worker A context: worker A context switches into worker B:

- worker B::server_tid = worker A::server_tid
- worker A::server_tid = none
- worker A::state = runnable
- worker B::state = running
- worker A::next_tid = worker B
- worker A calls sys_umcg_wait()

worker B context: before the above completes, worker C wants to
context switch into worker A, with similar steps.

"interrupt context": in the middle of the mess above, worker A pagefaults

Too many moving parts. UMCG_TF_LOCKED helped me make this mess
manageable. Maybe without pagefaults clever ordering of the operations
listed above could make things work, but pagefaults mess things badly,
so some kind of "preempt_disable()" for the userspace scheduling code
was needed, and UMCG_TF_LOCKED was the solution I had.

>
>