All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tao Zhou <tao.zhou@linux.dev>
To: Peter Oskolkov <posk@posk.io>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Andy Lutomirski <luto@kernel.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	linux-api@vger.kernel.org, Paul Turner <pjt@google.com>,
	Ben Segall <bsegall@google.com>, Peter Oskolkov <posk@google.com>,
	Andrei Vagin <avagin@google.com>, Jann Horn <jannh@google.com>,
	Thierry Delisle <tdelisle@uwaterloo.ca>,
	Tao Zhou <tao.zhou@linux.dev>
Subject: Re: [PATCH v0.7 5/5] sched/umcg: add Documentation/userspace-api/umcg.txt
Date: Mon, 18 Oct 2021 22:50:42 +0800	[thread overview]
Message-ID: <YW2JwoDTDxGOzl+m@geo.homenetwork> (raw)
In-Reply-To: <20211012232522.714898-6-posk@google.com>

Hi Peter,

On Tue, Oct 12, 2021 at 04:25:22PM -0700, Peter Oskolkov wrote:
> Document User Managed Concurrency Groups syscalls, data structures,
> state transitions, etc.
> 
> This is a text version of umcg.rst.
> 
> Signed-off-by: Peter Oskolkov <posk@google.com>
> ---
>  Documentation/userspace-api/umcg.txt | 594 +++++++++++++++++++++++++++
>  1 file changed, 594 insertions(+)
>  create mode 100644 Documentation/userspace-api/umcg.txt
> 
> diff --git a/Documentation/userspace-api/umcg.txt b/Documentation/userspace-api/umcg.txt
> new file mode 100644
> index 000000000000..cabaa6f4aaad
> --- /dev/null
> +++ b/Documentation/userspace-api/umcg.txt
> @@ -0,0 +1,594 @@
> +UMCG USERSPACE API
> +
> +User Managed Concurrency Groups (UMCG) is an M:N threading
> +subsystem/toolkit that lets user space application developers implement
> +in-process user space schedulers.
> +
> +
> +CONTENTS
> +
> +    WHY? HETEROGENEOUS IN-PROCESS WORKLOADS
> +    REQUIREMENTS
> +    UMCG KERNEL API
> +    SERVERS
> +    WORKERS
> +    UMCG TASK STATES
> +    STRUCT UMCG_TASK
> +    SYS_UMCG_CTL()
> +    SYS_UMCG_WAIT()
> +    STATE TRANSITIONS
> +    SERVER-ONLY USE CASES
> +
> +
> +WHY? HETEROGENEOUS IN-PROCESS WORKLOADS
> +
> +Linux kernel's CFS scheduler is designed for the "common" use case, with
> +efficiency/throughput in mind. Work isolation and workloads of different
> +"urgency" are addressed by tools such as cgroups, CPU affinity, priorities,
> +etc., which are difficult or impossible to efficiently use in-process.
> +
> +For example, a single DBMS process may receive tens of thousands requests
> +per second; some of these requests may have strong response latency
> +requirements as they serve live user requests (e.g. login authentication);
> +some of these requests may not care much about latency but must be served
> +within a certain time period (e.g. an hourly aggregate usage report); some
> +of these requests are to be served only on a best-effort basis and can be
> +NACKed under high load (e.g. an exploratory research/hypothesis testing
> +workload).
> +
> +Beyond different work item latency/throughput requirements as outlined
> +above, the DBMS may need to provide certain guarantees to different users;
> +for example, user A may "reserve" 1 CPU for their high-priority/low latency
                                                                   ^^^^^^^^^^^
                                                                   low-latency

> +requests, 2 CPUs for mid-level throughput workloads, and be allowed to send
> +as many best-effort requests as possible, which may or may not be served,
> +depending on the DBMS load. Besides, the best-effort work, started when the
> +load was low, may need to be delayed if suddenly a large amount of
> +higher-priority work arrives. With hundreds or thousands of users like
> +this, it is very difficult to guarantee the application's responsiveness
> +using standard Linux tools while maintaining high CPU utilization.
> +
> +Gaming is another use case: some in-process work must be completed before a
> +certain deadline dictated by frame rendering schedule, while other work
> +items can be delayed; some work may need to be cancelled/discarded because
> +the deadline has passed; etc.
> +
> +User Managed Concurrency Groups is an M:N threading toolkit that allows
> +constructing user space schedulers designed to efficiently manage
> +heterogeneous in-process workloads described above while maintaining high
> +CPU utilization (95%+).
> +
> +
> +REQUIREMENTS
> +
> +One relatively established way to design high-efficiency, low-latency
> +systems is to split all work into small on-cpu work items, with
> +asynchronous I/O and continuations, all executed on a thread pool with the
> +number of threads not exceeding the number of available CPUs. Although this
> +approach works, it is quite difficult to develop and maintain such a
> +system, as, for example, small continuations are difficult to piece
> +together when debugging. Besides, such asynchronous callback-based systems
> +tend to be somewhat cache-inefficient, as continuations can get scheduled
> +on any CPU regardless of cache locality.
> +
> +M:N threading and cooperative user space scheduling enables controlled CPU
> +usage (minimal OS preemption), synchronous coding style, and better cache
> +locality.
> +
> +Specifically:
> +
> +* a variable/fluctuating number M of "application" threads should be
> +  "scheduled over" a relatively fixed number N of "kernel" threads, where
> +  N is less than or equal to the number of CPUs available;
> +* only those application threads that are attached to kernel threads are
> +  scheduled "on CPU";
> +* application threads should be able to cooperatively
> + yield to each other;

The above two lines can be in one line.

   * application threads should be able to cooperatively yield to each other;

> +* when an application thread blocks in kernel (e.g. in I/O), this becomes
> +  a scheduling event ("block") that the userspace scheduler should be able
> +  to efficiently detect, and reassign a waiting application thread to the
> +  freeded "kernel" thread;
> +* when a blocked application thread wakes (e.g. its I/O operation
> +  completes), this even ("wake") should also be detectable by the
                      ^^^^
                      event

> +  userspace scheduler, which should be able to either quickly dispatch the
> +  newly woken thread to an idle "kernel" thread or, if all "kernel"
> +  threads are busy, put it in the waiting queue;
> +* in addition to the above, it would be extremely useful for a separate
> +  in-process "watchdog" facility to be able to monitor the state of each
> +  of the M+N threads, and to intervene in case of runaway workloads
> +  (interrupt/preempt).
> +
> +
> +UMCG KERNEL API
> +
> +Based on the requrements above, UMCG kernel API is build around the
> +following ideas:
> +
> +* UMCG server: a task/thread representing "kernel threads", or CPUs from
> +  the requirements above;
> +* UMCG worker: a task/thread representing "application threads", to be
> +  scheduled over servers;
> +* UMCG task state: (NONE), RUNNING, BLOCKED, IDLE: states a UMCG task (a
> +  server or a worker) can be in;
> +* UMCG task state flag: LOCKED, PREEMPTED: additional state flags that
> +  can be ORed with the task state to communicate additional information to
> +  the kernel;
> +* struct umcg_task: a per-task userspace set of data fields, usually
> +  residing in the TLS, that fully reflects the current task's UMCG state
> +  and controls the way the kernel manages the task;
> +* sys_umcg_ctl(): a syscall used to register the current task/thread as a
> +  server or a worker, or to unregister a UMCG task;
> +* sys_umcg_wait(): a syscall used to put the current task to sleep and/or
> +  wake another task, pontentially context-switching between the two tasks
> +  on-CPU synchronously.
> +
> +
> +SERVERS
> +
> +When a task/thread is registered as a server, it is in RUNNING state and
> +behaves like any other normal task/thread. In addition, servers can
> +interact with other UMCG tasks via sys_umcg_wait():
> +
> +* servers can voluntarily suspend their execution (wait), becoming IDLE;
> +* servers can wake other IDLE servers;
> +* servers can context-switch between each other.
> +
> +Note that if a server blocks in the kernel not via sys_umcg_wait(), it
> +still retains its RUNNING state.
> +
> +
> +WORKERS
> +
> +A worker cannot be RUNNING without having a server associated with it, so
> +when a task is first registered as a worker, it enters the IDLE state.
> +
> +* a worker becomes RUNNING when a server calls sys_umcg_wait to
> +  context-switch into it; the server goes IDLE, and the worker becomes
> +  RUNNING in its place;
> +* when a running worker blocks in the kernel, it becomes BLOCKED, its
> +  associated server becomes RUNNING and the server's sys_umcg_wait() call
> +  from the bullet above returns; this transition is sometimes called
> +  "block detection";
> +* when the syscall on which a BLOCKED worker completes, the worker
> +  becomes IDLE and is added to the list of idle workers; if there is an
> +  idle server waiting, the kernel wakes it; this transition is sometimes
> +  called "wake detection";
> +* running workers can voluntarily suspend their execution (wait),
     ^^^^^^^
     RUNNING

> +  becoming IDLE; their associated servers are woken;
> +* a RUNNING worker can context-switch with an IDLE worker; the server of
> +  the switched-out worker is transferred to the switched-in worker;
> +* any UMCG task can "wake" an IDLE worker via sys_umcg_wait(); unless
> +  this is a server running the worker as described in the first bullet in
> +  this list, the worker remain IDLE but is added to the idle workers list;
> +  this "wake" operation exists for completeness, to make sure
> +  wait/wake/context-switch operations are available for all UMCG tasks;
> +* the userspace can preempt a RUNNING worker by marking it
> +  RUNNING|PREEMPTED and sending a signal to it; the userspace should have
> +  installed a NOP signal handler for the signal; the kernel will then
> +  transition the worker into IDLE|PREEMPTED state and wake its associated
> +  server.
> +
> +
> +UMCG TASK STATES
> +
> +Important: all state transitions described below involve at least two
> +steps: the change of the state field in struct umcg_task, for example
> +RUNNING to IDLE, and the corresponding change in struct task_struct state,
> +for example a transition between the task running on CPU and being
> +descheduled and removed from the kernel runqueue. The key principle of UMCG
> +API design is that the party initiating the state transition modifies the
> +state variable.
> +
> +For example, a task going IDLE first changes its state from RUNNING to IDLE
> +in the userpace and then calls sys_umcg_wait(), which completes the
> +transition.
> +
> +Note on documentation: in include/uapi/linux/umcg.h, task states have the
> +form UMCG_TASK_RUNNING, UMCG_TASK_BLOCKED, etc. In this document these are
> +usually referred to simply RUNNING and BLOCKED, unless it creates
> +ambiguity. Task state flags, e.g. UMCG_TF_PREEMPTED, are treated similarly.
> +
> +UMCG task states reflect the view from the userspace, rather than from the
> +kernel. There are three fundamental task states:
> +
> +* RUNNING: indicates that the task is schedulable by the kernel; applies
> +  to both servers and workers;
> +* IDLE: indicates that the task is not schedulable by the kernel (see
> +  umcg_idle_loop() in kernel/sched/umcg.c); applies to both servers and
> +  workers;
> +* BLOCKED: indicates that the worker is blocked in the kernel; does not
> +  apply to servers.
> +
> +In addition to the three states above, two state flags help with state
> +transitions:
> +
> +* LOCKED: the userspace is preparing the worker for a state transition
> +  and "locks" the worker until the worker is ready for the kernel to act
> +  on the state transition; used similarly to preempt_disable or
> +  irq_disable in the kernel; applies only to workers in RUNNING or IDLE
> +  state; RUNNING|LOCKED means "this worker is about to become RUNNING,
> +  while IDLE|LOCKED means "this worker is about to become IDLE or
> +  unregister;
> +* PREEMPTED: the userspace indicates it wants the worker to be preempted;
> +  there are no situations when both LOCKED and PREEMPTED flags are set at
> +  the same time.
> +
> +
> +STRUCT UMCG_TASK
> +
> +From include/uapi/linux/umcg.h:
> +
> +struct umcg_task {
> +      uint64_t        state_ts;               /* r/w */
> +      uint32_t        next_tid;               /* r   */
> +      uint32_t        flags;                  /* reserved */
> +      uint64_t        idle_workers_ptr;       /* r/w */
> +      uint64_t        idle_server_tid_ptr;    /* r*  */
> +};
> +
> +Each UMCG task is identified by struct umcg_task, which is provided to the
> +kernel when the task is registered via sys_umcg_ctl().
> +
> +* uint64_t state_ts: the current state of the task this struct
> +  identifies, as described in the previous section, combined with a
> +  unique timestamp indicating when the last state change happened.
> +
> +  Readable/writable by both the kernel and the userspace.
> +
> +    bits  0 -  5: task state (RUNNING, IDLE, BLOCKED);
> +    bits  6 -  7: state flags (LOCKED, PREEMPTED);
> +    bits  8 - 12: reserved; must be zeroes;
> +    bits 13 - 17: for userspace use;
> +    bits 18 - 63: timestamp.
> +
> +   Timestamp: a 46-bit CLOCK_MONOTONIC timestamp, at 16ns resolution.
> +
> +   It is highly benefitical to tag each state change with a unique
> +   timestamp:
> +
> +   - timestamps will naturally provide instrumentation to measure
> +     scheduling delays, both in the kernel and in the userspace;
> +   - uniqueness of timestamps (module overflow) guarantees that state
> +     change races, especially ABA races, are easily detected and avoided.
> +
> +   Each timestamp represents the moment in time the state change happened,
> +   in nanoseconds, with the lower 4 bits and the upper 16 bits stripped.
> +
> +   In this document 'umcg_task.state' is often used to talk about
> +   'umcg_task.state_ts' field, as timestamps do not carry semantic
> +   meaning at the moment.
> +
> +   This is how umcg_task.state_ts is updated in the kernel:
> +
> +    /* kernel side */
> +    /**
> +     * umcg_update_state: atomically update umcg_task.state_ts, set new timestamp.
> +     * @state_ts   - points to the state_ts member of struct umcg_task to update;
> +     * @expected   - the expected value of state_ts, including the timestamp;
> +     * @desired    - the desired value of state_ts, state part only;
> +     * @may_fault  - whether to use normal or _nofault cmpxchg.
> +     *
> +     * The function is basically cmpxchg(state_ts, expected, desired), with extra
> +     * code to set the timestamp in @desired.
> +     */
> +    static int umcg_update_state(u64 __user *state_ts, u64 *expected, u64 desired,
> +                                    bool may_fault)
> +    {
> +            u64 curr_ts = (*expected) >> (64 - UMCG_STATE_TIMESTAMP_BITS);
> +            u64 next_ts = ktime_get_ns() >> UMCG_STATE_TIMESTAMP_GRANULARITY;
> +
> +            /* Cut higher order bits. */
> +            next_ts &= ((1ULL << UMCG_STATE_TIMESTAMP_BITS) - 1);
> +
> +            if (next_ts == curr_ts)
> +                    ++next_ts;
> +
> +            /* Remove an old timestamp, if any. */
> +            desired &= ((1ULL << (64 - UMCG_STATE_TIMESTAMP_BITS)) - 1);
> +
> +            /* Set the new timestamp. */
> +            desired |= (next_ts << (64 - UMCG_STATE_TIMESTAMP_BITS));
> +
> +            if (may_fault)
> +                    return cmpxchg_user_64(state_ts, expected, desired);
> +
> +            return cmpxchg_user_64_nofault(state_ts, expected, desired);
> +    }
> +
> +* uint32_t next_tid: contains the TID of the task to context-switch-into
> +  in sys_umcg_wait(); can be zero; writable by the userspace, readable by
> +  the kernel; if this is a RUNNING worker, this field contains the TID of
> +  the server that should be woken when this worker blocks; see
> +  sys_umcg_wait() for more details;
> +
> +* uint32_t flags: reserved; must be zero.
> +
> +* uint64_t idle_workers_ptr: this field forms a single-linked list of
> +  idle workers: all RUNNING workers have this field set to point to the
> +  head of the list (a pointer variable in the userspace).
> +
> +  When a worker's blocking operation in the kernel completes, the kernel
> +  changes the worker's state from BLOCKED to IDLE and adds the worker to
> +  the top of the list of idle workers using this logic:
> +
> +    /* kernel side */
> +    /**
> +     * enqueue_idle_worker - push an idle worker onto idle_workers_ptr
> +     * list/stack.
> +     *
> +     * Returns true on success, false on a fatal failure.
> +     */
> +    static bool enqueue_idle_worker(struct umcg_task __user *ut_worker)
> +    {
> +        u64 __user *node = &ut_worker->idle_workers_ptr;
> +        u64 __user *head_ptr;
> +        u64 first = (u64)node;
> +        u64 head;
> +
> +        if (get_user_nosleep(head, node) || !head)
> +                return false;
> +
> +        head_ptr = (u64 __user *)head;
> +
> +        if (put_user_nosleep(UMCG_IDLE_NODE_PENDING, node))
> +                return false;
> +
> +        if (xchg_user_64(head_ptr, &first))
> +                return false;
> +
> +        if (put_user_nosleep(first, node))
> +                return false;
> +
> +        return true;
> +    }
> +
> +  In the userspace the list is cleared atomically using this logic:
> +
> +    /* userspace side */
> +    uint64_t *idle_workers = (uint64_t *)*head;
> +
> +    atomic_exchange(&idle_workers, NULL);
> +
> +  The userspace re-points workers' idle_workers_ptr to the list head
> +  variable before the worker is allowed to become RUNNING again.
> +
> +  When processing the idle workers list, the userspace should wait for
> +  workers marked as UMCG_IDLE_NODE_PENDING to have the flag cleared (see
> +  enqueue_idle_worker() above).
> +
> +* uint64_t idle_server_tid_ptr: points to a variable in the userspace
> +  that points to an idle server, i.e. a server in IDLE state waiting in
> +  sys_umcg_wait(); read-only; workers must have this field set; not used
> +  in servers.
> +
> +  When a worker's blocking operation in the kernel completes, the kernel
> +  changes the worker's state from BLOCKED to IDLE, adds the worker to the
> +  list of idle workers, and wakes the idle server if present; the kernel
> +  atomically exchanges (*idle_server_tid_ptr) with 0, thus waking the idle
> +  server, if present, only once. See State transitions below for more
> +  details.
> +
> +
> +SYS_UMCG_CTL()
> +
> +int sys_umcg_ctl(uint32_t flags, struct umcg_task *self) is used to
> +register or unregister the current task as a worker or server. Flags can be
> +one of the following:
> +
> +    UMCG_CTL_REGISTER: register a server;
> +    UMCG_CTL_REGISTER | UMCG_CTL_WORKER: register a worker;
> +    UMCG_CTL_UNREGISTER: unregister the current server or worker.
> +
> +When registering a task, self must point to struct umcg_task describing
> +this server or worker; the pointer must remain valid until the task is
> +unregistered.
> +
> +When registering a server, self->state must be RUNNING; all other fields in
> +self must be zeroes.
> +
> +When registering a worker, self->state must be RUNNING;
                                                  ^^^^^^^
                                                  IDLE

After looking through the document and code I feel the new registered worker'
state should be IDLE.

+
+A worker cannot be RUNNING without having a server associated with it, so
+when a task is first registered as a worker, it enters the IDLE state.
+



> +self->idle_server_tid_ptr and self->idle_workers_ptr must be valid pointers
> +as described in struct umcg_task; self->next_tid must be zero.
> +
> +When unregistering a task, self must be NULL.
> +
> +
> +SYS_UMCG_WAIT()
> +
> +int sys_umcg_wait(uint32_t flags, uint64_t abs_timeout) operates on
> +registered UMCG servers and workers: struct umcg_task *self provided to
> +sys_umcg_ctl() when registering the current task is consulted in addition
> +to flags and abs_timeout parameters.
> +
> +The function can be used to perform one of the three operations:
> +
> +* wait: if self->next_tid is zero, sys_umcg_wait() puts the current
> +  task to sleep;
> +* wake: if self->next_tid is not zero, and flags & UMCG_WAIT_WAKE_ONLY,
> +  the task identified by next_tid is woken;
> +* context switch: if self->next_tid is not zero, and !(flags &
> +  UMCG_WAIT_WAKE_ONLY), the current task is put to sleep and the next task
> +  is woken, synchronously switching between the tasks on the current CPU
> +  on the fast path.
> +
> +Flags can be zero or a combination of the following values:
> +
> +* UMCG_WAIT_WAKE_ONLY: wake the next task, don't put the current task to
> +  sleep;
> +* UMCG_WAIT_WF_CURRENT_CPU: wake the next task on the curent CPU; this
> +  flag has an effect only if UMCG_WAIT_WAKE_ONLY is set: context switching
> +  is always attempted to happen on the curent CPU.
> +
> +The section below provides more details on how servers and workers interact
> +via sys_umcg_wait(), during worker block/wake events, and during worker
> +preemption.
> +
> +
> +STATE TRANSITIONS
> +
> +As mentioned above, the key principle of UMCG state transitions is that the
> +party initiating the state transition modifies the state of affected tasks.
> +
> +Below, "TASK:STATE" indicates a task T, where T can be either W for worker
> +or S for server, in state S, where S can be one of the three states,
> +potentially ORed with a state flag. Each individual state transition is an
> +atomic operation (cmpxchg) unless indicated otherwise. Also note that the
> +order of state transitions is important and is part of the contract between
> +the userspace and the kernel. The kernel is free to kill the task (SIGKILL)
> +if the contract is broken.
> +
> +Some worker state transitions below include adding LOCKED flag to worker
> +state. This is done to indicate to the kernel that the worker is

                                                        ..worker is +in the+

> +transitioning state and should not participate in the block/wake detection
> +routines, which can happen due to interrupts/pagefaults/signals.
> +
> +IDLE|LOCKED means that a running worker is preparing to sleep, so
> +interrupts should not lead to server wakeup; RUNNING|LOCKED means that an
> +idle worker is going to be "scheduled to run", but may not yet have its
> +server set up properly.
> +
> +Key state transitions:
> +
> +* server to worker context switch ("schedule a worker to run"):
> +  S:RUNNING+W:IDLE => S:IDLE+W:RUNNING:
> +        in the userspace, in the context of the server S running:
> +            S:RUNNING => S:IDLE (mark self as idle)
> +            W:IDLE => W:RUNNING|LOCKED (mark the worker as running)
> +            W.next_tid := S.tid; S.next_tid := W.tid (link the server with
> +                the worker)
> +            W:RUNNING|LOCKED => W:RUNNING (unlock the worker)
> +            S: sys_umcg_wait() (make the syscall)
> +        the kernel context switches from the server to the worker; the
> +        server sleeps until it becomes RUNNING during one of the
> +        transitions below;
> +
> +* worker to server context switch (worker "yields"): S:IDLE+W:RUNNING =>
> +S:RUNNING+W:IDLE:
> +        in the userspace, in the context of the worker W running (note that
> +        a running worker has its next_tid set to point to its server):
> +            W:RUNNING => W:IDLE|LOCKED (mark self as idle)
> +            S:IDLE => S:RUNNING (mark the server as running)
> +            W: sys_umcg_wait() (make the syscall)
> +        the kernel removes the LOCKED flag from the worker's state and
> +        context switches from the worker to the server; the worker sleeps
> +        until it becomes RUNNING;
> +
> +* worker to worker context switch: W1:RUNNING+W2:IDLE =>
> +  W1:IDLE+W2:RUNNING:
> +        in the userspace, in the context of W1 running:
> +            W2:IDLE => W2:RUNNING|LOCKED (mark W2 as running)
> +            W1:RUNNING => W1:IDLE|LOCKED (mark self as idle)
> +            W2.next_tid := W1.next_tid; S.next_tid := W2.tid (transfer the
> +                server W1 => W2)
> +            W1:next_tid := W2.tid (indicate that W1 should context-switch
> +                into W2)
> +            W2:RUNNING|LOCKED => W2:RUNNING (unlock W2)
> +            W1: sys_umcg_wait() (make the syscall)
> +        same as above, the kernel removes the LOCKED flag from the W1's
> +        state and context switches to next_tid;
> +
> +* worker wakeup: W:IDLE => W:RUNNING:
> +        in the userspace, a server S can wake a worker W without "running"
> +               it:
> +            S:next_tid :=W.tid
> +            W:next_tid := 0
> +            W:IDLE => W:RUNNING
> +            sys_umcg_wait(UMCG_WAIT_WAKE_ONLY) (make the syscall)
> +        the kernel will wake the worker W; as the worker does not have a
> +        server assigned, "wake detection" will happen, the worker will be
> +        immediately marked as IDLE and added to idle workers list; an idle
> +        server, if any, will be woken (see 'wake detection' below);
> +
> +        Note: if needed, it is possible for a worker to wake another
> +        worker: the waker marks itself "IDLE|LOCKED", points its next_tid
> +        to the wakee, makes the syscall, restores its server in next_tid,
> +        marks itself as RUNNING.
> +
> +* block detection: worker blocks in the kernel: S:IDLE+W:RUNNING =>
> +  S:RUNNING+W:BLOCKED:
> +        when a worker blocks in the kernel in RUNNING state (not LOCKED),
> +        before descheduling the task from the CPU the kernel performs
> +        these operations:
> +            W:RUNNING => W:BLOCKED
> +            S := W.next_tid
> +            S:IDLE => S:RUNNING
> +            try_to_wake_up(S)
> +        if any of the first three operations above fail, the worker is
> +        killed via SIGKILL. Note that ttwu(S) is not required to succeed,
> +        as the server may still be transitioning to sleep in
> +        sys_umcg_wait(); before actually putting the server to sleep its
> +        UMCG state is checked and, if it is RUNNING, sys_umcg_wait()
> +        returns to the userspace;
> +        if the worker has its LOCKED flag set, block detection does not
> +        trigger, as the worker is assumed to be in the userspace
> +        scheduling code.
> +
> +* wake detection: worker wakes in the kernel: W:BLOCKED => W:IDLE:
> +        all workers' returns to the userspace are intercepted:
> +            start: (a label)
> +            if W:RUNNING & W.next_tid != 0: let the worker exit to the
> +                userspace, as this is a RUNNING worker with a server;
> +            W:* => W:IDLE (previously blocked or woken without servers
> +                workers are not allowed to return to the userspace);
> +            the worker is appended to W.idle_workers_ptr idle workers list;
> +            S := *W.idle_server_tid_ptr; if (S != 0) S:IDLE => S.RUNNING;
> +                ttwu(S)
> +            idle_loop(W): this is the same idle loop that sys_umcg_wait()
> +                uses: it breaks only when the worker becomes RUNNING; when
> +                the idle loop exits, it is assumed that the userspace has
> +                properly removed the worker from the idle workers list
> +                before marking it RUNNING;
> +            goto start; (repeat from the beginning).
> +
> +        the logic above is a bit more complicated in the presence of
> +        LOCKED or PREEMPTED flags, but the main invariants
> +        stay the same:
> +            only RUNNING workers with servers assigned are allowed to run
> +                in the userspace (unless LOCKED);
> +            newly IDLE workers are added to the idle workers list; any
> +                user-initiated state change assumes the userspace
> +                properly removed the worker from the list;
> +            as with wake detection, any "breach of contract" by the
> +                userspace will result in the task termination via SIGKILL.
> +
> +* worker preemption: S:IDLE+W:RUNNING => S:RUNNING+W:IDLE|PREEMPTED:
> +        when the userspace wants to preempt a RUNNING worker, it changes it
> +        state, atomically, RUNNING => RUNNING|PREEMPTED and sends a
> +        signal to the worker via tgkill(); the signal handler, previously
> +        set up by the userspace, can be a NOP (note that only RUNNING
> +        workers can be preempted);
> +
> +        if the worker, at the moment the signal arrived, continued to be
> +        running on-CPU in the userspace, the "wake detection" code will be
> +        triggered that, in addition to what was described above, will
> +        check if the worker is in RUNNING|PREEMPTED state:
> +            W:RUNNING|PREEMPTED => W:IDLE|PREEMPTED
> +            S := W.next_tid
> +            S:IDLE => S:RUNNING
> +            try_to_wakeup(S)
> +
> +        if the signal arrives after the worker blocks in the kernel,
> +        the "block detection" happened as described above, with the
> +        following change:
> +            W:RUNNING|PREEMPTED => W:BLOCKED|PREEMPTED
> +            S := W.next_tid
> +            S:IDLE => S:RUNNING
> +            try_to_wake_up(S)
> +
> +        in any case, the worker's server is woken, with its attached
> +        worker (S.next_tid) either in BLOCKED|PREEMPTED or IDLE|PREEMPTED
> +        state.
> +
> +
> +SERVER-ONLY USE CASES
> +
> +Some workloads/applications may benefit from fast and synchronous on-CPU
> +user-initiated context switches without the need for full userspace
> +scheduling (block/wake detection). These applications can use "standalone"
> +UMCG servers to wait/wake/context-switch. At the moment only in-process
> +operations are allowed. In the future this restriction will be lifted,
> +and wait/wake/context-switch operations between servers in related processes
> +be permitted (when it is safe to do so, e.g. if the processes belong
> +to the same user and/or cgroup).
> +
> +These "worker-less" operations involve trivial RUNNING <==> IDLE state
> +changes, not discussed here for brevity.
> --
> 2.25.1
> 


Thanks,
Tao

      reply	other threads:[~2021-10-18 14:50 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-12 23:25 [PATCH v0.7 0/5] sched,mm,x86/uaccess: implement User Managed Concurrency Groups Peter Oskolkov
2021-10-12 23:25 ` [PATCH v0.7 1/5] sched/umcg: add WF_CURRENT_CPU and externise ttwu Peter Oskolkov
2021-10-12 23:25 ` [PATCH v0.7 2/5] mm, x86/uaccess: add userspace atomic helpers Peter Oskolkov
2021-10-26 23:21   ` Peter Oskolkov
2021-10-12 23:25 ` [PATCH v0.7 3/5] sched/umcg: implement UMCG syscalls Peter Oskolkov
2021-10-13 19:43   ` kernel test robot
2021-10-13 19:43     ` kernel test robot
2021-10-13 21:47   ` kernel test robot
2021-10-13 21:47     ` kernel test robot
2021-10-15 21:40   ` kernel test robot
2021-10-15 21:40     ` kernel test robot
2021-10-18 15:23   ` Tao Zhou
2021-10-12 23:25 ` [PATCH v0.7 4/5] sched/umcg: add Documentation/userspace-api/umcg.rst Peter Oskolkov
2021-10-12 23:25 ` [PATCH v0.7 5/5] sched/umcg: add Documentation/userspace-api/umcg.txt Peter Oskolkov
2021-10-18 14:50   ` Tao Zhou [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YW2JwoDTDxGOzl+m@geo.homenetwork \
    --to=tao.zhou@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=avagin@google.com \
    --cc=bsegall@google.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=jannh@google.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=posk@google.com \
    --cc=posk@posk.io \
    --cc=tdelisle@uwaterloo.ca \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.