From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.4 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D3EF7ECE560 for ; Mon, 24 Sep 2018 15:24:08 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 724E32086B for ; Mon, 24 Sep 2018 15:24:08 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=amazon.de header.i=@amazon.de header.b="jIKntQWU" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 724E32086B Authentication-Results: mail.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=amazon.de Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729789AbeIXV0s (ORCPT ); Mon, 24 Sep 2018 17:26:48 -0400 Received: from smtp-fw-2101.amazon.com ([72.21.196.25]:48316 "EHLO smtp-fw-2101.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728387AbeIXV0r (ORCPT ); Mon, 24 Sep 2018 17:26:47 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.de; i=@amazon.de; q=dns/txt; s=amazon201209; t=1537802644; x=1569338644; h=subject:to:cc:references:from:message-id:date: mime-version:in-reply-to:content-transfer-encoding; bh=ZUec5CwWA2knDz1wV/N4Zm2Ok7pKhM/Bg5XvoUem/6o=; b=jIKntQWUrsSq6cPQWQST7zcnOx5uGUK0OXysp5WpTb0S35jd4JE29Fii s/vMJnzSZ+k4weypRvVWnyfFFgnAu6U0mKbKZLEivA/gPtZ+s4cjrOF+p VX7HpjYuubNX/GrFFffM9k5Nh2PUVUnCgGoalenAggAbDyTpUWOkUhAOY Q=; X-IronPort-AV: E=Sophos;i="5.54,298,1534809600"; d="scan'208";a="698706410" Received: from iad6-co-svc-p1-lb1-vlan2.amazon.com (HELO email-inbound-relay-2a-7c6d20a4.us-west-2.amazon.com) ([10.124.125.2]) by smtp-border-fw-out-2101.iad2.amazon.com with ESMTP/TLS/DHE-RSA-AES256-SHA; 24 Sep 2018 15:24:02 +0000 Received: from u7588a65da6b65f.ant.amazon.com (pdx2-ws-svc-lb17-vlan3.amazon.com [10.247.140.70]) by email-inbound-relay-2a-7c6d20a4.us-west-2.amazon.com (8.14.7/8.14.7) with ESMTP id w8OFNufe084856 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=FAIL); Mon, 24 Sep 2018 15:24:00 GMT Received: from u7588a65da6b65f.ant.amazon.com (localhost [127.0.0.1]) by u7588a65da6b65f.ant.amazon.com (8.15.2/8.15.2/Debian-3) with ESMTP id w8OFNtBm017216; Mon, 24 Sep 2018 17:23:55 +0200 Subject: Re: [RFC 00/60] Coscheduling for Linux To: Rik van Riel , Peter Zijlstra Cc: Ingo Molnar , linux-kernel@vger.kernel.org, Paul Turner , Vincent Guittot , Morten Rasmussen , Tim Chen References: <20180907214047.26914-1-jschoenh@amazon.de> <20180914111251.GC24106@hirez.programming.kicks-ass.net> <1d86f497-9fef-0b19-50d6-d46ef1c0bffa@amazon.de> <1e3c2ab11320c1c2f320f9e24ac0d31625bd60e6.camel@surriel.com> From: "=?UTF-8?Q?Jan_H._Sch=c3=b6nherr?=" Openpgp: preference=signencrypt Message-ID: Date: Mon, 24 Sep 2018 17:23:55 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 In-Reply-To: <1e3c2ab11320c1c2f320f9e24ac0d31625bd60e6.camel@surriel.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 09/18/2018 04:40 PM, Rik van Riel wrote: > On Fri, 2018-09-14 at 18:25 +0200, Jan H. Schönherr wrote: >> On 09/14/2018 01:12 PM, Peter Zijlstra wrote: >>> On Fri, Sep 07, 2018 at 11:39:47PM +0200, Jan H. Schönherr wrote: >>>> >>>> B) Why would I want this? >>>> [one quoted use case from the original e-mail] > > What are the other use cases, and what kind of performance > numbers do you have to show examples of workloads where > coscheduling provides a performance benefit? For further use cases (still an incomplete list) let me redirect you to the unabridged Section B of the original e-mail: https://lkml.org/lkml/2018/9/7/1521 If you want me to, I can go into more detail and make the list from that e-mail more complete. Note, that many coscheduling use cases are not primarily about performance. Sure, there are the resource contention use cases, which are barely about anything else. See, e.g., [1] for a survey with further pointers to the potential performance gains. Realizing those use cases would require either a user space component driving this, or another kernel component performing a function similar to the current auto-grouping with some more complexity depending on the desired level of sophistication. This extra component is out of my scope. But I see a coscheduler like this as an enabler for practical applications of these kind of use cases. If you use coscheduling as part of a solution that closes a side-channel, performance is a secondary aspect, and hopefully we don't lose much of it. Then, there's the large fraction of use cases, where coscheduling is primarily about design flexibility, because it enables different (old and new) application designs, which usually cannot be executed in an efficient manner without coscheduling. For these use cases performance is important, but there is also a trade-off against development costs of alternative solutions to consider. These are also the use cases where we can do measurements today, i.e., without some yet-to-be-written extra component. For example, with coscheduling it is possible to use active waiting instead of passive waiting/spin-blocking on non-dedicated systems, because lock holder preemption is not an issue anymore. It also allows using applications that were developed for dedicated scenarios in non-dedicated settings without loss in performance -- like an (unmodified) operating system within a VM, or HPC code. Another example is cache optimization of parallel algorithms, where you don't have to resort to cache-oblivious algorithms for efficiency, but where you can stay with manually tuned or auto-tuned algorithms, even on non-dedicated systems. (You're even able to do the tuning itself on a system that has other load.) Now, you asked about performance numbers, that *I* have. If a workload has issues with lock-holder preemption, I've seen up to 5x to 20x improvement with coscheduling. (This includes parallel programs [2] and VMs with unmodified guests without PLE [3].) That is of course highly dependent on the workload. I currently don't have any numbers comparing coscheduling to other solutions used to reduce/avoid lock holder preemption, that don't mix in any other aspect like resource contention. These would have to be micro-benchmarked. If you're happy to compare across some more moving variables, then more or less blind coscheduling of parallel applications with some automatic workload-driven (but application-agnostic) width adjustment of coscheduled sets yielded an overall performance benefit between roughly 10% to 20% compared to approaches with passive waiting [2]. It was roughly on par with pure space-partitioning approaches (slight minus on performance, slight plus on flexibility/fairness). I never went much into the resource contention use cases myself. Though, I did use coscheduling to extend the concept of "nice" to sockets by putting all niced programs into a coscheduled task group with appropriately reduced shares. This way, niced programs don't just get any and all idle CPU capacity -- taking away parts of the energy budget of more important tasks all the time -- which leads to important tasks running at turbo frequencies more often. Depending on the parallelism of niced workload and the parallelism of normal workload, this translates to a performance improvement of the normal workload that corresponds roughly to the increase in frequency (for CPU-bound tasks) [4]. Depending on the processor, that can be anything from just a few percent to about a factor of 2. Regards Jan References: [1] S. Zhuravlev, J. C. Saez, S. Blagodurov, A. Fedorova, and M. Prieto, “Survey of scheduling techniques for addressing shared resources in multicore processors,” ACM Computing Surveys, vol. 45, no. 1, pp. 4:1–4:28, Dec. 2012. [2] J. H. Schönherr, B. Juurlink, and J. Richling, “TACO: A scheduling scheme for parallel applications on multicore architectures,” Scientific Programming, vol. 22, no. 3, pp. 223–237, 2014. [3] J. H. Schönherr, B. Lutz, and J. Richling, “Non-intrusive coscheduling for general purpose operating systems,” in Proceedings of the International Conference on Multicore Software Engineering, Performance, and Tools (MSEPT ’12), ser. Lecture Notes in Computer Science, vol. 7303. Berlin/Heidelberg, Germany: Springer, May 2012, pp. 66–77. [4] J. H. Schönherr, J. Richling, M. Werner, and G. Mühl, “A scheduling approach for efficient utilization of hardware-driven frequency scaling,” in Workshop Proceedings of the 23rd International Conference on Architecture of Computing Systems (ARCS 2010 Workshops), M. Beigl and F. J. Cazorla-Almeida, Eds. Berlin, Germany: VDE Verlag, Feb. 2010, pp. 367–376.