Re: [RFC PATCH 3/3 v0.2] sched/umcg: RFC: implement UMCG syscalls

From: Thierry Delisle <tdelisle@uwaterloo.ca>
To: <posk@posk.io>
Cc: <avagin@google.com>, <bsegall@google.com>, <jannh@google.com>,
	<jnewsome@torproject.org>, <joel@joelfernandes.org>,
	<linux-api@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
	<mingo@redhat.com>, <peterz@infradead.org>, <pjt@google.com>,
	<posk@google.com>, <tglx@linutronix.de>,
	Peter Buhr <pabuhr@uwaterloo.ca>,
	Martin Karsten <mkarsten@uwaterloo.ca>
Subject: Re: [RFC PATCH 3/3 v0.2] sched/umcg: RFC: implement UMCG syscalls
Date: Sun, 11 Jul 2021 14:29:39 -0400	[thread overview]
Message-ID: <bb30216c-4339-2703-9d87-9326af86a7b0@uwaterloo.ca> (raw)
In-Reply-To: <20210708194638.128950-4-posk@google.com>

 > Let's move the discussion to the new thread.

I'm happy to start a new thread. I'm re-responding to my last post 
because many
of my questions are still unanswered.

 > + * State transitions:
 > + *
 > + * RUNNING => IDLE:   the current RUNNING task becomes IDLE by calling
 > + *                    sys_umcg_wait();
 >
 > [...]
 >
 > +/**
 > + * enum umcg_wait_flag - flags to pass to sys_umcg_wait
 > + * @UMCG_WAIT_WAKE_ONLY: wake @self->next_tid, don't put @self to sleep;
 > + * @UMCG_WF_CURRENT_CPU: wake @self->next_tid on the current CPU
 > + *                       (use WF_CURRENT_CPU); @UMCG_WAIT_WAKE_ONLY 
must be set.
 > + */
 > +enum umcg_wait_flag {
 > +    UMCG_WAIT_WAKE_ONLY = 1,
 > +    UMCG_WF_CURRENT_CPU = 2,
 > +};

What is the purpose of using sys_umcg_wait without next_tid or with
UMCG_WAIT_WAKE_ONLY? It looks like Java's park/unpark semantics to me, 
that is
worker threads can use this for synchronization and mutual exclusion. In 
this
case, how do these compare to using FUTEX_WAIT/FUTEX_WAKE?

 > +struct umcg_task {
 > [...]
 > +    /**
 > +     * @server_tid: the TID of the server UMCG task that should be
 > +     *              woken when this WORKER becomes BLOCKED. Can be zero.
 > +     *
 > +     *              If this is a UMCG server, @server_tid should
 > +     *              contain the TID of @self - it will be used to find
 > +     *              the task_struct to wake when pulled from
 > +     *              @idle_servers.
 > +     *
 > +     * Read-only for the kernel, read/write for the userspace.
 > +     */
 > +    uint32_t    server_tid;        /* r   */
 > [...]
 > +    /**
 > +     * @idle_servers_ptr: a single-linked list pointing to the list
 > +     *                    of idle servers. Can be NULL.
 > +     *
 > +     * Readable/writable by both the kernel and the userspace: the
 > +     * userspace adds items to the list, the kernel removes them.
 > +     *
 > +     * TODO: describe how the list works.
 > +     */
 > +    uint64_t    idle_servers_ptr;    /* r/w */
 > [...]
 > +} __attribute__((packed, aligned(8 * sizeof(__u64))));

 From the comments and by elimination, I'm guessing that idle_servers_ptr is
somehow used by servers to block until some worker threads become idle. 
However,
I do not understand how the userspace is expected to use it. I also do not
understand if these link fields form a stack or a queue and where is the 
head.

 > +/**
 > + * sys_umcg_ctl: (un)register a task as a UMCG task.
 > + * @flags:       ORed values from enum umcg_ctl_flag; see below;
 > + * @self:        a pointer to struct umcg_task that describes this
 > + *               task and governs the behavior of sys_umcg_wait if
 > + *               registering; must be NULL if unregistering.
 > + *
 > + * @flags & UMCG_CTL_REGISTER: register a UMCG task:
 > + *         UMCG workers:
 > + *              - self->state must be UMCG_TASK_IDLE
 > + *              - @flags & UMCG_CTL_WORKER
 > + *
 > + *         If the conditions above are met, sys_umcg_ctl() 
immediately returns
 > + *         if the registered task is a RUNNING server or basic task; 
an IDLE
 > + *         worker will be added to idle_workers_ptr, and the worker 
put to
 > + *         sleep; an idle server from idle_servers_ptr will be 
woken, if any.

This approach to creating UMCG workers concerns me a little. My 
understanding
is that in general, the number of servers controls the amount of parallelism
in the program. But in the case of creating new UMCG workers, the new 
threads
only respect the M:N threading model after sys_umcg_ctl has blocked. 
What does
this mean for applications that create thousands of short lived tasks? Are
users expcted to create pools of reusable UMCG workers?

I would suggest adding at least one uint64_t field to the struct 
umcg_task that
is left as-is by the kernel. This allows implementers of user-space
schedulers to add scheduler specific data structures to the threads without
needing some kind of table on the side.