Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls

From: Peter Zijlstra <peterz@infradead.org>
To: Peter Oskolkov <posk@posk.io>
Cc: Ingo Molnar <mingo@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Andy Lutomirski <luto@kernel.org>,
	Linux Memory Management List <linux-mm@kvack.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-api@vger.kernel.org, Paul Turner <pjt@google.com>,
	Ben Segall <bsegall@google.com>, Peter Oskolkov <posk@google.com>,
	Andrei Vagin <avagin@google.com>, Jann Horn <jannh@google.com>,
	Thierry Delisle <tdelisle@uwaterloo.ca>
Subject: Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls
Date: Mon, 29 Nov 2021 17:41:05 +0100	[thread overview]
Message-ID: <YaUCoe07Wl9Stlch@hirez.programming.kicks-ass.net> (raw)
In-Reply-To: <CAFTs51XnN+N74i1XHvRUAUWd04-Fs9uV6ouXo=CQSQs8MaEM5A@mail.gmail.com>

On Sun, Nov 28, 2021 at 04:29:11PM -0800, Peter Oskolkov wrote:

> wait_wake_only is not needed if you have both next_tid and server_tid,
> as your patch has. In my version of the patch, next_tid is the same as
> server_tid, so the flag is needed to indicate to the kernel that
> next_tid is the wakee, not the server.

Ah, okay.

> re: (idle_)server_tid_ptr: it seems that you assume that blocked
> workers keep their servers, while in my patch they "lose them" once
> they block, and so there should be a global idle server pointer to
> wake the server in my scheme (if there is an idle one). The main
> difference is that in my approach a server has only a single, running,
> worker assigned to it, while in your approach it can have a number of
> blocked/idle workers to take care of as well.

Correct; I've been thinking in analogues of the way we schedule CPUs.
Each CPU has a ready/run queue along with the current task.
fundamentally the RUNNABLE tasks need to go somewhere when all servers
are busy. So at that point the previous server is as good a place as
any.

Now, I sympathise with a blocked task not having a relation; I often
argue this same, since we have wakeup balancing etc. And I've not really
thought about how to best do wakeup-balancing, also see below.

> The main difference between our approaches, as I see it: in my
> approach if a worker is running, its server is sleeping, period. If we
> have N servers, and N running workers, there are no servers to wake
> when a previously blocked worker finishes its blocking op. In your
> approach, it seems that N servers have each a bunch of workers
> pointing at them, and a single worker running. If a previously blocked
> worker wakes up, it wakes the server it was assigned to previously,

Right; it does that. It can check the ::state of it's current task,
possibly set TF_PREEMPT or just go back to sleep.

> and so now we have more than N physical tasks/threads running: N
> workers and the woken server. This is not ideal: if the process is
> affined to only N CPUs, that means a worker will be preempted to let
> the woken server run, which is somewhat against the goal of letting
> the workers run more or less uninterrupted. This is not deal breaking,
> but maybe something to keep in mind.

I suppose it's easy enough to make this behaviour configurable though;
simply enqueue and not wake.... Hmm.. how would this worker know if the
server was 'busy' or not? The whole 'current' thing is a user-space
construct. I suppose that's what your pointer was for? Puts an actual
idle server in there, if there is one. Let me ponder that a bit.

However, do note this whole scheme fundamentally has some of that, the
moment the syscall unblocks until sys_exit is 'unmanaged' runtime for
all tasks, they can consume however much time the syscall needs there.

Also, timeout on sys_umcg_wait() gets you the exact same situation (or
worse, multiple running workers).

> Another big concern I have is that you removed UMCG_TF_LOCKED. I

OOh yes, I forgot to mention that. I couldn't figure out what it was
supposed to do.

> definitely needed it to guard workers during "sched work" in the
> userspace in my approach. I'm not sure if the flag is absolutely
> needed with your approach, but most likely it is - the kernel-side
> scheduler does lock tasks and runqueues and disables interrupts and
> migrations and other things so that the scheduling logic is not
> hijacked by concurrent stuff. Why do you assume that the userspace
> scheduling code does not need similar protections?

I've not yet come across a case where this is needed. Migration for
instance is possible when RUNNABLE, simply write ::server_tid before
::state. Userspace just needs to make sure who actually owns the task,
but it can do that outside of this state.

But like I said; I've not yet done the userspace part (and I lost most
of today trying to install a new machine), so perhaps I'll run into it
soon enough.