From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id ADD4BC07E9A for ; Mon, 12 Jul 2021 15:40:52 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 9504560FF1 for ; Mon, 12 Jul 2021 15:40:52 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235555AbhGLPnj (ORCPT ); Mon, 12 Jul 2021 11:43:39 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44570 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235400AbhGLPni (ORCPT ); Mon, 12 Jul 2021 11:43:38 -0400 Received: from mail-lj1-x229.google.com (mail-lj1-x229.google.com [IPv6:2a00:1450:4864:20::229]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B933FC0613E5 for ; Mon, 12 Jul 2021 08:40:49 -0700 (PDT) Received: by mail-lj1-x229.google.com with SMTP id s18so24943105ljg.7 for ; Mon, 12 Jul 2021 08:40:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Sx2IKaIgugKwjbZh4RK43caAuNdf4BXBJqmTld2/qA0=; b=NEWvAFhouQ4e7j5ggnmprP9PikemmsFblqPRXXAHQZV2IEIy9r/NIHTZHWwR3+w77y vytETyqX23fY+KxY7hbkIW7ORnFbbomjiXm0CyBp5YoOi9dNBaU6aUVX2eTtAvoaiJpx 88pP7V9XkV+SohNlwuz01hbF1qLbIHmyj6WUFPWI77r8LDVuhnT6p041rK4ltWjCS0nq DXM2Ykt0fnh9y3KnlZ10Mh/jnN2aG0C3JSDjAgqWoAxmI7nsT8R8YMrVBounYpCSVSgs ztiUoIhd6z0DW/FxaNp7+ewQWz7vmPmJUaE/BvdIOIaV0ZvOzYBkDBVlxCdDxgvu9kTF 2ssg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Sx2IKaIgugKwjbZh4RK43caAuNdf4BXBJqmTld2/qA0=; b=F5URx3pR4wzgyn9/7WTo04Gjr8/sp5rycQ7ef5IvPIRcCQpEpqlUzkA+I4rFFaHjz2 7hWuDDseck6jGwBntHoQ61GkAxVlJ6fnLTvnpEIQogbp8ZOUrkUVK6SH3Nzyj3vXLNI6 LqMh1ExmisJUWT+uBCQyANpTEvq+JlpsyP74hP3ucSrxwGBPUff6x/5/I0Ld9M5uQ1Vy 22e66rKCqYXWNcr6rewof2Xo2w9ANgNs4KoE1Vr966yLxgwWqjV7projpaBdgbNcojWH jUSgMdvKmkwblxf3lUE2rQeDnqUPZmOb8mKSeIq8+7+r9AI9auf55sWijxi09Qpnr8oZ aKkQ== X-Gm-Message-State: AOAM530C620DvF6EqRcJQSb0onjWWEt6jWJq9htNIqPb9JLImM6AD+N3 xRJ0S3kiJ3bFgOMsocXLLXGth4IPYjo6puLxf8UULg== X-Google-Smtp-Source: ABdhPJx7ptjXxjNKzZ0q3xfM3GMwdaBnal/nB7DdO8AnH1LjuIVERQ2StOYIo3Yb7v9EfaXpMrpJ4b6ar3kDLpO6qG0= X-Received: by 2002:a05:651c:1684:: with SMTP id bd4mr26670952ljb.287.1626104447709; Mon, 12 Jul 2021 08:40:47 -0700 (PDT) MIME-Version: 1.0 References: <20210708194638.128950-4-posk@google.com> In-Reply-To: From: Peter Oskolkov Date: Mon, 12 Jul 2021 08:40:36 -0700 Message-ID: Subject: Re: [RFC PATCH 3/3 v0.2] sched/umcg: RFC: implement UMCG syscalls To: Thierry Delisle Cc: posk@posk.io, avagin@google.com, bsegall@google.com, jannh@google.com, jnewsome@torproject.org, joel@joelfernandes.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org, mingo@redhat.com, peterz@infradead.org, pjt@google.com, tglx@linutronix.de, Peter Buhr , Martin Karsten Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Jul 11, 2021 at 11:29 AM Thierry Delisle wrote: > > > Let's move the discussion to the new thread. > > I'm happy to start a new thread. I'm re-responding to my last post > because many > of my questions are still unanswered. > > > + * State transitions: > > + * > > + * RUNNING => IDLE: the current RUNNING task becomes IDLE by calling > > + * sys_umcg_wait(); > > > > [...] > > > > +/** > > + * enum umcg_wait_flag - flags to pass to sys_umcg_wait > > + * @UMCG_WAIT_WAKE_ONLY: wake @self->next_tid, don't put @self to sleep; > > + * @UMCG_WF_CURRENT_CPU: wake @self->next_tid on the current CPU > > + * (use WF_CURRENT_CPU); @UMCG_WAIT_WAKE_ONLY > must be set. > > + */ > > +enum umcg_wait_flag { > > + UMCG_WAIT_WAKE_ONLY = 1, > > + UMCG_WF_CURRENT_CPU = 2, > > +}; > > What is the purpose of using sys_umcg_wait without next_tid or with > UMCG_WAIT_WAKE_ONLY? It looks like Java's park/unpark semantics to me, > that is > worker threads can use this for synchronization and mutual exclusion. In > this > case, how do these compare to using FUTEX_WAIT/FUTEX_WAKE? sys_umcg_wait without next_tid puts the task in UMCG_IDLE state; wake wakes it. These are standard sched operations. If they are emulated via futexes, fast context switching will require something like FUTEX_SWAP that was NACKed last year. > > > > +struct umcg_task { > > [...] > > + /** > > + * @server_tid: the TID of the server UMCG task that should be > > + * woken when this WORKER becomes BLOCKED. Can be zero. > > + * > > + * If this is a UMCG server, @server_tid should > > + * contain the TID of @self - it will be used to find > > + * the task_struct to wake when pulled from > > + * @idle_servers. > > + * > > + * Read-only for the kernel, read/write for the userspace. > > + */ > > + uint32_t server_tid; /* r */ > > [...] > > + /** > > + * @idle_servers_ptr: a single-linked list pointing to the list > > + * of idle servers. Can be NULL. > > + * > > + * Readable/writable by both the kernel and the userspace: the > > + * userspace adds items to the list, the kernel removes them. > > + * > > + * TODO: describe how the list works. > > + */ > > + uint64_t idle_servers_ptr; /* r/w */ > > [...] > > +} __attribute__((packed, aligned(8 * sizeof(__u64)))); > > From the comments and by elimination, I'm guessing that idle_servers_ptr is > somehow used by servers to block until some worker threads become idle. > However, > I do not understand how the userspace is expected to use it. I also do not > understand if these link fields form a stack or a queue and where is the > head. When a server has nothing to do (no work to run), it is put into IDLE state and added to the list. The kernel wakes an IDLE server if a blocked worker unblocks. > > > > +/** > > + * sys_umcg_ctl: (un)register a task as a UMCG task. > > + * @flags: ORed values from enum umcg_ctl_flag; see below; > > + * @self: a pointer to struct umcg_task that describes this > > + * task and governs the behavior of sys_umcg_wait if > > + * registering; must be NULL if unregistering. > > + * > > + * @flags & UMCG_CTL_REGISTER: register a UMCG task: > > + * UMCG workers: > > + * - self->state must be UMCG_TASK_IDLE > > + * - @flags & UMCG_CTL_WORKER > > + * > > + * If the conditions above are met, sys_umcg_ctl() > immediately returns > > + * if the registered task is a RUNNING server or basic task; > an IDLE > > + * worker will be added to idle_workers_ptr, and the worker > put to > > + * sleep; an idle server from idle_servers_ptr will be > woken, if any. > > This approach to creating UMCG workers concerns me a little. My > understanding > is that in general, the number of servers controls the amount of parallelism > in the program. But in the case of creating new UMCG workers, the new > threads > only respect the M:N threading model after sys_umcg_ctl has blocked. > What does > this mean for applications that create thousands of short lived tasks? Are > users expcted to create pools of reusable UMCG workers? Yes: task/thread creation is not as lightweight as just posting work items onto a preexisting pool of workers. > > > I would suggest adding at least one uint64_t field to the struct > umcg_task that > is left as-is by the kernel. This allows implementers of user-space > schedulers to add scheduler specific data structures to the threads without > needing some kind of table on the side. This is usually achieved by embedding the kernel struct into a larger userspace/TLS struct. For example: struct umcg_task_user { struct umcg_task umcg_task; extra_user_data d1; extra_user_ptr p1; /* etc. */ } __aligned(...);