From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.9 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 89022C3A5A6 for ; Thu, 19 Sep 2019 14:42:41 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 4973620882 for ; Thu, 19 Sep 2019 14:42:41 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=telus.net header.i=@telus.net header.b="U5DwomJr" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732718AbfISOmk (ORCPT ); Thu, 19 Sep 2019 10:42:40 -0400 Received: from cmta20.telus.net ([209.171.16.93]:38498 "EHLO cmta20.telus.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1732450AbfISOmj (ORCPT ); Thu, 19 Sep 2019 10:42:39 -0400 Received: from dougxps ([173.180.45.4]) by cmsmtp with SMTP id AxdeijQG0N5I9AxdfiCpdK; Thu, 19 Sep 2019 08:42:37 -0600 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=telus.net; s=neo; t=1568904157; bh=qsEsOn4V0sECDsIsTqqpcyZaf3W8tl4yw4sCR0WiVTk=; h=From:To:Cc:References:In-Reply-To:Subject:Date; b=U5DwomJrh3bRYkhcLiR/2vupWKfKc85b06kE3tTiXuarJBCZ/529Zba2rCcr9KX74 diyUsc+4Z+qmtp5TvLa8isjhPLfysrBszid0wmE6WRVKUu/ILEw5U8Yy2hMQmTIoAy 63xpovk0IsqUslEwufgghgoHPCPZkmJfg4hWCH1GYwDa7jrUGhAwpUyRGeTxcpvAg+ xJ7sZ9cit+9NAsBlj3TX0ukxXz6B7Zbqtu4OEx+M2J2Fi76oEc8HNAar29i/uUnIaY UntclzIKmPAzts6mxBurTh9NBOfm5vGbH0MqWW8j5luLCTWgLWqUL1iOVtsLimC4ad 4iU2KXtJzZ3AA== X-Telus-Authed: none X-Authority-Analysis: v=2.3 cv=K/Fc4BeI c=1 sm=1 tr=0 a=zJWegnE7BH9C0Gl4FFgQyA==:117 a=zJWegnE7BH9C0Gl4FFgQyA==:17 a=Pyq9K9CWowscuQLKlpiwfMBGOR0=:19 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=IkcTkHD0fZMA:10 a=FGbulvE0AAAA:8 a=3yQbRBO75ofR69lJigcA:9 a=Hy-bYUk0Tl6sZ_2g:21 a=H4s1P0xrksXhmr0C:21 a=QEXdDO2ut3YA:10 a=svzTaB3SJmTkU8mK-ULk:22 a=pHzHmUro8NiASowvMSCR:22 a=Ew2E2A-JSTLzCXPT_086:22 From: "Doug Smythies" To: "'Giovanni Gherdovich'" Cc: , , , , , , , , , , , , , , , , , References: <20190909024216.5942-1-ggherdovich@suse.cz> <20190909024216.5942-2-ggherdovich@suse.cz> <000e01d568b5$87de9be0$979bd3a0$@net> <000301d56a76$0022e630$0068b290$@net> <1568730313.3329.1.camel@suse.cz> In-Reply-To: <1568730313.3329.1.camel@suse.cz> Subject: RE: [PATCH 1/2] x86,sched: Add support for frequency invariance Date: Thu, 19 Sep 2019 07:42:29 -0700 Message-ID: <001a01d56ef8$7abb07c0$70311740$@net> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook 12.0 Content-Language: en-ca Thread-Index: AdVtYwA14mVTiFQ5RZGnWAIKW2qj0gBjRDQA X-CMAE-Envelope: MS4wfGGRscLOybzzEQVsuwc2K7T+I2qYlMg0Yx7F0a/GExLEpT67nJ0viAdCunam85Px1X/zoNw893WHez3CBiivEb0VCJnptiLQe5h7sXCHNTe5y8Qc30MI CZRqd5Zjt7J3a1gXpZCmIvqS+yHP9KmQ+szxWkz0VZBwRJTLxr8I6Nmlw/544Nkep2RK3c1cCpL9NSdBIbfAQvs+UUv6fiCFnwoiQXugeysnOR1fcuuRCQPl oN2RZqhY8YRQGplzm8o6rtfWFgp5XmOnpHEFx2ZI1dAB3fZqISkqlWbEvGjO9265YuOBC8B1nPbBPTOEsIbTS9UuIfxC+cwJU7mj0X1ml5Huuo3JVGjE7Q4Y hLgbjVtZq+thd9QWRHq1cH5XuStbtYpbhJiShJtX97Yzhfn4niKAQFUeeX37vNT6f/vjfmqX/yXXWFeAkOpC513R4qwoO1TLI6JyaruMkcM1Nz+8yOx+LUoz +Z59WniWjlWjb+6Rb9/OPK0R+FrOhLa/vUoeFekTHVIIwbgiOIqKOnzdaePYISnglYC58P6zF+syllhzXbZtxqqncNrR7uIIs9EKS3YIPi4Y9g2VpSF+ACkd n0CD+5CgNPux3aVzw8WzXJZtJtRbXwAe9kEjk1neZcFN/JMV049sowkKVDeYrVwaYfZxacVqQudJ+Yfhr/XXyORHyF+hemAhxdHAES76L9Cwx4w4ZiAsktxK 0pGmTpCK7Jw9/1tKtIfJREx1mZnhnrJ5JtmTFEuYrHgVD/LwH55pIA== Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Giovanni, Thank you for your detailed reply. On 2019.09.17 07:25 Giovanni Gherdovich wrote: >On Wed, 2019-09-11 at 08:28 -0700, Doug Smythies wrote: > [...] >> The problem with the test is its run to run variability, which was from >> all the disk I/O, as far as I could determine. At the time, >> I studied this to death [2], and made a more repeatable test, without >> any disk I/O. >> >> While the challenges with this work flow have tended to be focused >> on the CPU frequency scaling driver, I have always considered >> the root issue here to be a scheduling issue. Excerpt from my notes >> [2]: >> >>> The issue is that performance is much much better if the system is >>> forced to use only 1 CPU rather than relying on the defaults where >>> the CPU scheduler decides what to do. >>> The scheduler seems to not realize that the current CPU has just >>> become free, and assigns the new task to a new CPU. Thus the load >>> on any one CPU is so low that it doesn't ramp up the CPU frequency. >>> It would be better if somehow the scheduler knew that the current >>> active CPU was now able to take on the new task, overall resulting >>> on one fully loaded CPU at the highest CPU frequency. >> >> I do not know if such is practical, and I didn't re-visit the issue. >> > > You're absolutely right, pinning a serialized, fork-intensive workload such as > gitsource gives you as good of a performance as you can get, because it removes > the scheduler out of the picture. > > So one might be tempted to flag this test as non-representative of a > real-world scenario; Disagree. I consider this test to be very representative of real-world scenarios. However, and I do not know for certain, the relatively high average fork rate of the gitsource "make test" is less common. > the reasons we keep looking at it are: > 1. pinning may not always practical, as you mention > 2. it's an adversary, worst-case sort of test for some scheduler code paths Agree. >> For reference against which all other results are compared >> is the forced CPU affinity test run. i.e.: >> >> taskset -c 3 test_script. >> >> Mode Governor degradation Power Bzy_MHz >> Reference perf 1 CPU 1.00 reference 3798 >> - performance 1.2 6% worse 3618 >> passive ondemand 2.3 >> active powersave 2.6 >> passive schedutil 2.7 1600 >> passive schedutil-4C 1.68 2515 >> >> Where degradation ratio is the time to execute / the reference time for >> the same conditions. The test runs over a wide range of processes per >> second, and the worst ratio has been selected for the above table. >> I have yet to write up this experiment, but the graphs that will >> eventually be used are at [4] and [5] (same data presented two >> different ways). > > Your table is interesting; I'd say that the one to beat there (from the > schedutil point of view) is intel_pstate(active)/performance. I'm slightly > surprised that intel_pstate(passive)/ondemand is worse than > intel_pstate(active)/powersave, I'd have guessed the other way around but it's > also true that the latter lost some grip on iowait_boost in of the recent > dev cycles. ?? intel_pstate(passive)/ondemand is better than intel_pstate(active)/powersave, not worse, over the entire range of PIDs (forks) per second and by quite a lot. >> I did the "make test" method and, presenting the numbers your way, >> got that 4C took 0.69 times as long as the unpatched schedutil. >> Your numbers were same or better (copied below, lower is better): >> 80x-BROADWELL-NUMA: 0.49 >> 8x-SKYLAKE-UMA: 0.55 >> 48x-HASWELL-NUMA: 0.69 > I think your 0.69 and my three values tell the same story: schedutil really > needs to use the frequency invariant formula otherwise it's out of the > race. Enabling scale-invariance gives multple tens of percent point in > advantage. Agreed. This frequency invariant addition is great. However, if schedutil is "out of the race" without it, as you say, then isn't intel_pstate(passive)/ondemand out of the race also? It performs just as poorly for this test, until very low PIDs per second. > Now, is it 0.69 or 0.49? There are many factors to it; that's why I'm happy I > can test on multiple machines and get a somehow more varied picture. > > Also, didn't you mention you made several runs and selected the worst one for > the final score? I was less adventurous and took the average of 5 runs for my > gitsource executions :) that might contribute to a slightly higher final mark. No, I did the exact same as you for the gitsource "make test" method, except that I do 6 runs and throw out the first one and average the next 5. Yes, I said I picked the worse ratio, but that was for my version of this test, with the disk I/O and its related non-repeatability eliminated, only to provide something for readers that did not want to go to my web site to look at the related graph [1]. I'll send you the graph in a separate e-mail, in case you didn't go to the web site. >>>> >>>> Compare it to the update formula of intel_pstate/powersave: >>> >>> freq_next = 1.25 * freq_max * Busy% >>> >>> where again freq_max is 1C turbo and Busy% is the percentage of time not spent >>> idling (calculated with delta_MPERF / delta_TSC); >> >> Note that the delta_MPERF / delta_TSC method includes idle state 0 and the old >> method of utilization does not (at least not last time I investigated, which was >> awhile ago (and I can not find my notes)). > > I think that depends on whether or not TSC stops at idle. As understand from > the Intel Software Developer manual (SDM) a TSC that stops at idle is called > "invariant TSC", and makes delta_MPERF / delta_TSC interesting. Otherwise the > two counters behaves exactly the same and the ratio is always 1, modulo the > delays in actually reading the two values. But all I know comes from > turbostat's man page and the SDM, so don't quote me on that :) I was only talking about idle state 0 (polling), where TSC does not stop. By the way, I have now done some tests with this patch set and multi-threaded stuff. Nothing to report, it all looks great. [1] http://www.smythies.com/~doug/linux/single-threaded/gg-pidps2.png ... Doug