From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.9 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 83133ECDE3D for ; Fri, 19 Oct 2018 11:40:16 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 46C09205F4 for ; Fri, 19 Oct 2018 11:40:16 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=amazon.de header.i=@amazon.de header.b="VyndSkYU" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 46C09205F4 Authentication-Results: mail.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=amazon.de Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727382AbeJSTp4 (ORCPT ); Fri, 19 Oct 2018 15:45:56 -0400 Received: from smtp-fw-9101.amazon.com ([207.171.184.25]:36138 "EHLO smtp-fw-9101.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727014AbeJSTpz (ORCPT ); Fri, 19 Oct 2018 15:45:55 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.de; i=@amazon.de; q=dns/txt; s=amazon201209; t=1539949213; x=1571485213; h=subject:to:cc:references:from:message-id:date: mime-version:in-reply-to:content-transfer-encoding; bh=xkqo3xOkUHi5c+96JlZ6DBBUnio1mGFzS0Gdk+TPLKQ=; b=VyndSkYU/GmxGC4mbI0g5TFwCtkqgxseDGxnwIMMvYFnX8Xi5R+kLxqg za5mySvnvabhS++or4XKmzQAuM0WMG0ocPHaRIhiPFvgwpCdEAVZoHWvT igo4137sCbB32/lkrS+OgzY44blreOzrtNANpxljavWxGHesVc1to/Vve Q=; X-IronPort-AV: E=Sophos;i="5.54,399,1534809600"; d="scan'208";a="764999898" Received: from sea3-co-svc-lb6-vlan3.sea.amazon.com (HELO email-inbound-relay-2b-4e24fd92.us-west-2.amazon.com) ([10.47.22.38]) by smtp-border-fw-out-9101.sea19.amazon.com with ESMTP/TLS/DHE-RSA-AES256-SHA; 19 Oct 2018 11:40:10 +0000 Received: from u7588a65da6b65f.ant.amazon.com (pdx2-ws-svc-lb17-vlan3.amazon.com [10.247.140.70]) by email-inbound-relay-2b-4e24fd92.us-west-2.amazon.com (8.14.7/8.14.7) with ESMTP id w9JBe6I2074201 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=FAIL); Fri, 19 Oct 2018 11:40:08 GMT Received: from u7588a65da6b65f.ant.amazon.com (localhost [127.0.0.1]) by u7588a65da6b65f.ant.amazon.com (8.15.2/8.15.2/Debian-3) with ESMTP id w9JBe3me001875; Fri, 19 Oct 2018 13:40:04 +0200 Subject: Re: [RFC 00/60] Coscheduling for Linux To: Frederic Weisbecker Cc: Ingo Molnar , Peter Zijlstra , linux-kernel@vger.kernel.org, Rik van Riel , Subhra Mazumdar References: <20180907214047.26914-1-jschoenh@amazon.de> <20181017020933.GC24723@lerouge> From: =?UTF-8?Q?Jan_H=2e_Sch=c3=b6nherr?= Openpgp: preference=signencrypt Message-ID: Date: Fri, 19 Oct 2018 13:40:03 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.2.1 MIME-Version: 1.0 In-Reply-To: <20181017020933.GC24723@lerouge> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 17/10/2018 04.09, Frederic Weisbecker wrote: > On Fri, Sep 07, 2018 at 11:39:47PM +0200, Jan H. Schönherr wrote: >> C) How does it work? >> -------------------- [...] >> For each task-group, the user can select at which level it should be >> scheduled. If you set "cpu.scheduled" to "1", coscheduling will typically >> happen at core-level on systems with SMT. That is, if one SMT sibling >> executes a task from this task group, the other sibling will do so, too. If >> no task is available, the SMT sibling will be idle. With "cpu.scheduled" >> set to "2" this is extended to the next level, which is typically a whole >> socket on many systems. And so on. If you feel, that this does not provide >> enough flexibility, you can specify "cosched_split_domains" on the kernel >> command line to create more fine-grained scheduling domains for your >> system. > > Have you considered using cpuset to specify the set of CPUs inside which > you want to coschedule task groups in? Perhaps that would be more flexible > and intuitive to control than this cpu.scheduled value. Yes, I did consider cpusets. Though, there are two dimensions to it: a) at what fraction of the system tasks shall be coscheduled, and b) where these tasks shall execute within the system. cpusets would be the obvious answer to the "where". However, in the current form they are too inflexible with too much overhead. Suppose, you want to coschedule two tasks on SMT siblings of a core. You would be able to restrict the tasks to a specific core with a cpuset. But then, it is bound to that core, and the load balancer cannot move the group of two tasks to a different core. Now, it would be possible to "invent" relocatable cpusets to address that issue ("I want affinity restricted to a core, I don't care which"), but then, the current way how cpuset affinity is enforced doesn't scale for making use of it from within the balancer. (The upcoming load balancing portion of the coscheduler currently uses a file similar to cpu.scheduled to restrict affinity to a load-balancer-controlled subset of the system.) Using cpusets as the mean to describe which parts of the system are to be coscheduled *may* be possible. But if so, it's a long way out. The current implementation uses scheduling domains for this, because (a) most coscheduling use cases require an alignment to the topology, and (b) it integrates really nicely with the load balancer. AFAIK, there is already some interaction between cpusets and scheduling domains. But it is supposed to be rather static and as soon as you have overlapping cpusets, you end up with the default scheduling domains. If we were able to make the scheduling domains more dynamic than they are today, we might be able to couple that to cpusets (or some similar interface to *define* scheduling domains). Regards Jan