From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.9 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7587AC43331 for ; Tue, 24 Mar 2020 19:30:50 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 2E31F2076E for ; Tue, 24 Mar 2020 19:30:50 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=efficios.com header.i=@efficios.com header.b="SAGfbP5t" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727103AbgCXTat (ORCPT ); Tue, 24 Mar 2020 15:30:49 -0400 Received: from mail.efficios.com ([167.114.26.124]:56650 "EHLO mail.efficios.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726560AbgCXTas (ORCPT ); Tue, 24 Mar 2020 15:30:48 -0400 Received: from localhost (localhost [127.0.0.1]) by mail.efficios.com (Postfix) with ESMTP id 4F1DF26C9CD; Tue, 24 Mar 2020 15:30:47 -0400 (EDT) Received: from mail.efficios.com ([127.0.0.1]) by localhost (mail03.efficios.com [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id qDtp_e0PHl9e; Tue, 24 Mar 2020 15:30:46 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by mail.efficios.com (Postfix) with ESMTP id E226526C9CC; Tue, 24 Mar 2020 15:30:46 -0400 (EDT) DKIM-Filter: OpenDKIM Filter v2.10.3 mail.efficios.com E226526C9CC DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=efficios.com; s=default; t=1585078246; bh=guATgDsIjgBkAOI1TeelZchDx1bD338mzh2lBLsge68=; h=Date:From:To:Message-ID:MIME-Version; b=SAGfbP5tFl3Sr92fVmTeVcXupjW9hY3xp9R97UaTaj1Q1kPHBaCPsj6xWQcQ0HwAk KBCrj6BReJMJx1QG1igNB+jnpvH79NsSVZdZqZeodKLRlg34CIv4X6dcTmuoCuKgX0 nti72AShqW3z60joPj+3r8ty5Gbc+bEFZ7sq0GOTXEWTfDtZ9Lj6q0TllNhS3vopv4 MZ1x9b5FpnknlPE1NCPwbkmfYAERafYq36MMLeFNy1bHM9JQDRAwQ3xan996fqYhBe upHJ3FpBrr2kXFfr+P/XJHn8I82L5yBDLbmZs3O5iYN86z56T1c6Q0K3m087fSwqVO rWquuC1NBczuw== X-Virus-Scanned: amavisd-new at efficios.com Received: from mail.efficios.com ([127.0.0.1]) by localhost (mail03.efficios.com [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id V7tG0TuyZGeC; Tue, 24 Mar 2020 15:30:46 -0400 (EDT) Received: from mail03.efficios.com (mail03.efficios.com [167.114.26.124]) by mail.efficios.com (Postfix) with ESMTP id D4EA026CE00; Tue, 24 Mar 2020 15:30:46 -0400 (EDT) Date: Tue, 24 Mar 2020 15:30:46 -0400 (EDT) From: Mathieu Desnoyers To: Tejun Heo Cc: Li Zefan , cgroups , linux-kernel , Peter Zijlstra , Ingo Molnar , Valentin Schneider , Thomas Gleixner Message-ID: <195391080.10219.1585078246788.JavaMail.zimbra@efficios.com> In-Reply-To: <20200324180139.GB162390@mtj.duckdns.org> References: <1251528473.590671.1579196495905.JavaMail.zimbra@efficios.com> <20200219155202.GE698990@mtj.thefacebook.com> <1358308409.804.1582128519523.JavaMail.zimbra@efficios.com> <20200219161222.GF698990@mtj.thefacebook.com> <316507033.21078.1583597207356.JavaMail.zimbra@efficios.com> <20200312182618.GE79873@mtj.duckdns.org> <1289608777.27165.1584042470528.JavaMail.zimbra@efficios.com> <20200324180139.GB162390@mtj.duckdns.org> Subject: Re: [regression] cpuset: offlined CPUs removed from affinity masks MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [167.114.26.124] X-Mailer: Zimbra 8.8.15_GA_3918 (ZimbraWebClient - FF74 (Linux)/8.8.15_GA_3895) Thread-Topic: cpuset: offlined CPUs removed from affinity masks Thread-Index: +1BxSgW2D8wtgSc9ZwTWM+7saWQ/2w== Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org ----- On Mar 24, 2020, at 2:01 PM, Tejun Heo tj@kernel.org wrote: > On Thu, Mar 12, 2020 at 03:47:50PM -0400, Mathieu Desnoyers wrote: >> The basic idea is to allow applications to pin to every possible cpu, but >> not allow them to use this to consume a lot of cpu time on CPUs they >> are not allowed to run. >> >> Thoughts ? > > One thing that we learned is that priority alone isn't enough in isolating cpu > consumptions no matter how low the priority may be if the workload is latency > sensitive. The actual computation capacity of cpus gets saturated way before cpu > time is saturated and latency impact from lowered mips becomes noticeable. So, > depending on workloads, allowing threads to run at the lowest priority on > disallowed cpus might not lead to behaviors that users expect but I have no idea > what kind of usage models you have on mind for the new system call. Let me take a step back and focus on the requirements for the moment. It should help us navigate more easily through the various solutions available to us. Two goals are to enable use-cases such as user-space memory allocator migration of free memory (typically single-process), and issue operations on each per-CPU data from the consumer of a user-space per-CPU ring buffer (multi-process over shared memory). For the memory allocator use-case, one scenario which illustrates the situation well is related to CPU hotplug: with per-CPU memory pools, what should the application do when a CPU goes offline ? Ideally, it should have a manager thread able to detect that a CPU is offline, and be able to reclaim free memory or move it into other CPU's pools. However, considering that user-space has no mean to synchronously do this wrt CPU hotplug, the CPU may very well come back online and start using those data structures once more, so we cannot presume mutual exclusion from an offline CPU. One way to achieve this is by allowing user-space to run rseq critical sections targeting the per-CPU (user-space) data of any possible CPU. However, when considering allowing threads to pin themselves on any of the possible CPUs, three concerns arise: - CPU hotplug (offline CPUs), - sched_setaffinity affinity mask, which can be set either internally by the process or externally by a manager process, - cgroup cpuset allowed mask, which can be set either internally or by manager process, For offline CPUs, the pin_on_cpu system call ensures that a task can run on a "backup runqueue" when it pins itself onto an offline CPU. The current algorithm is to choose the first online CPU's runqueue for this. As soon as the offline CPU is brought back online, all tasks pinned to that CPU are moved to their rightful runqueue. For sched_setaffinity's affinity mask, I don't think it is such an issue, because pinning onto specific CPUs does not provide more rights than what could have been done by setting the affinity mask to a single CPU. The main difference between sched_setaffinity to a single cpu and pin_on_cpu is the behavior when the target CPU goes offline: sched_setaffinity then allows the thread to move to any runqueue (which is really bad for rseq purposes), whereas pin_on_cpu moves the thread to a runqueue which is guaranteed to be the same for all threads which want to be pinned on that CPU. Then there is the issue of cgroup cpuset: AFAIU, cgroup v1's integration with CPU hotplug removes the offlined CPUs from the cgroup's allowed mask, which basically breaks the memory allocator free memory migration/reclaim use-case, because there is then no way to target an offline CPU if we apply the cgroup's allowed mask. For cgroup v2, AFAIU it allows creation of groups which target specific threads within a process. Therefore, some threads could have allowed mask which differ from others. In this kind of scenario, it's not possible to have a manager thread allowed to pin itself onto each CPUs which can be accessed by other threads in the same process. Also, for the multi-process shared memory use-case (ring buffer), if the various processes which interact with the same shared memory end up in different cgroups allowed to run on a different subset of the possible CPUs, it becomes impossible to have a consumer allowed to pin itself on all the CPUs it needs. Ideally, I would like to come up with an approach that is not fragile when combined with cgroups or cpu hotplug. One approach I have envisioned is to allow pin_on_cpu to target CPUs which are not part of the cpuset's allowed mask, but lower the priority of the threads to the lowest possible priority while doing so. That approach would allow threads to pin themselves on basically any CPU part of the possible cpu mask. But as you point out, maybe this is an issue in terms of workload isolation. I am welcoming ideas on how to solve this. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com