From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2E835C433FE for ; Sun, 19 Dec 2021 14:14:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235881AbhLSOOP (ORCPT ); Sun, 19 Dec 2021 09:14:15 -0500 Received: from mail-ot1-f49.google.com ([209.85.210.49]:42916 "EHLO mail-ot1-f49.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232488AbhLSOON (ORCPT ); Sun, 19 Dec 2021 09:14:13 -0500 Received: by mail-ot1-f49.google.com with SMTP id 47-20020a9d0332000000b005798ac20d72so9331829otv.9; Sun, 19 Dec 2021 06:14:13 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=wPc4orBNRab1xAXrjJI4HRAbINpapbPATalmjOfB3qo=; b=gOliB2oxf7fof+TlMh2pkNs5QBtfLg7WXCjGsMLFdtTXnjc0f8YdUUYFpKyZ9tcNl2 qpYlVbsLulMR+T4XbB57DV/0kyV+CxCRm/SDbxDnDy6gybcVZA/S7DsCY3ZAV9NhRZ7v 1xKy9g4RGhBktt/TesibMt1UEjWnQzqc9uunGxJuivZA62xz2Wy1JZ9pQnChKxGSnpcf w7PXw8ZOLA8RtJ2YrGGTyeQEo2FeJ5Q4SnhSrGmmkLzHxmm+PqK9mWnnJnV+S2rPv9cA u0owEXFt2wc1ClxzvTv+1rWygRWJtlGFQogSMvxkzcnrTLsOJcGHxoZ1gPcoHd+mtvyT pqiA== X-Gm-Message-State: AOAM532oPguF9aD4rC6W43yp20Ahi/8o1q9PdcKO2QL3tfDga7t7NuTV D/si16OiCO/9jaClZUVudUQ9P+lZ1wZNzl+CgBE= X-Google-Smtp-Source: ABdhPJzClEzyBlaaRXjN7ari/REV85CCrualt51z48S1SkJMn1Y5GDX53b6rqaA55OAyybpKGcEkgnoY5x2eMv75mmQ= X-Received: by 2002:a05:6830:1e57:: with SMTP id e23mr8235303otj.16.1639923253091; Sun, 19 Dec 2021 06:14:13 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: "Rafael J. Wysocki" Date: Sun, 19 Dec 2021 15:14:01 +0100 Message-ID: Subject: Re: cpufreq: intel_pstate: map utilization into the pstate range To: Julia Lawall Cc: "Rafael J. Wysocki" , Srinivas Pandruvada , Len Brown , Viresh Kumar , Linux PM , Linux Kernel Mailing List , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Dec 17, 2021 at 8:32 PM Julia Lawall wrote: > > > > On Fri, 17 Dec 2021, Rafael J. Wysocki wrote: > > > On Mon, Dec 13, 2021 at 11:52 PM Julia Lawall wrote: > > > > > > With HWP, intel_cpufreq_adjust_perf takes the utilization, scales it > > > between 0 and the capacity, and then maps everything below min_pstate to > > > the lowest frequency. > > > > Well, it is not just intel_pstate with HWP. This is how schedutil > > works in general; see get_next_freq() in there. > > > > > On my Intel Xeon Gold 6130 and Intel Xeon Gold > > > 5218, this means that more than the bottom quarter of utilizations are all > > > mapped to the lowest frequency. Running slowly doesn't necessarily save > > > energy, because it takes more time. > > > > This is true, but the layout of the available range of performance > > values is a property of the processor, not a driver issue. > > > > Moreover, the role of the driver is not to decide how to respond to > > the given utilization value, that is the role of the governor. The > > driver is expected to do what it is asked for by the governor. > > OK, but what exactly is the goal of schedutil? The short answer is: minimizing the cost (in terms of energy) of allocating an adequate amount of CPU time for a given workload. Of course, this requires a bit of explanation, so bear with me. It starts with a question: Given a steady workload (ie. a workload that uses approximately the same amount of CPU time to run in every sampling interval), what is the most efficient frequency (or generally, performance level measured in some abstract units) to run it at and still ensure that it will get as much CPU time as it needs (or wants)? To answer this question, let's first assume that (1) Performance is a monotonically increasing (ideally, linear) function of frequency. (2) CPU idle states have not enough impact on the energy usage for them to matter. Both of these assumptions may not be realistic, but that's how it goes. Now, consider the "raw" frequency-dependent utilization util(f) = util_max * (t_{total} - t_{idle}(f)) / t_{total} where t_{total} is the total CPU time available in the given time frame. t_{idle}(f) is the idle CPU time appearing in the workload when run at frequency f in that time frame. util_max is a convenience constant allowing an integer data type to be used for representing util(f) with sufficient approximation. Notice that by assumption (1), util(f) is a monotonically decreasing function, so if util(f_{max}) = util_max (where f_{max} is the maximum frequency available from the hardware), which means that there is no idle CPU time in the workload when run at the max available frequency, there will be no idle CPU time in it when run at any frequency below f_{max}. Hence, in that case the workload needs to be run at f_{max}. If util(f_{max}) < util_max, there is some idle CPU time in the workload at f_{max} and it may be run at a lower frequency without sacrificing performance. Moreover, the cost should be minimum when running the workload at the maximum frequency f_e for which t_{idle}(f_e) = 0. IOW, that is the point at which the workload still gets as much CPU time as needed, but the cost of running it is maximally reduced. In practice, it is better to look for a frequency slightly greater than f_e to allow some performance margin to be there in case the workload fluctuates or similar, so we get C * util(f) / util_max = 1 where the constant C is slightly greater than 1. This equation cannot be solved directly, because the util(f) graph is not known, but util(f) can be computed (at least approximately) for a given f and the solution can be approximated by computing a series of frequencies f_n given by f_{n+1} = C * f_n * util(f_n) / util_max under certain additional assumptions regarding the convergence etc. This is almost what schedutil does, but it also uses the observation that if the frequency-invariant utilization util_inv is known, then approximately util(f) = util_inv * f_{max} / f so finally f = C * f_{max} * util_inv / util_max and util_inv is provided by PELT. This has a few interesting properties that are vitally important: (a) The current frequency need not be known in order to compute the next one (and it is hard to determine in general). (b) The response is predictable by the CPU scheduler upfront, so it can make decisions based on it in advance. (c) If util_inv is properly scaled to reflect differences between different types of CPUs in a hybrid system, the same formula can be used for each of them regardless of where the workload was running previously. and they need to be maintained. > I would have expected that it was to give good performance while saving > energy, but it's not doing either in many of these cases. The performance improvement after making the change in question means that something is missing. The assumptions mentioned above (and there are quite a few of them) may not hold or the hardware may not behave exactly as anticipated. Generally, there are three directions worth investigating IMV: 1. The scale-invariance mechanism may cause util_inv to be underestimated. It may be worth trying to use the max non-turbo performance instead of the 4-core-turbo performance level in it; see intel_set_max_freq_ratio() in smpboot.c. 2.The hardware's response to the "desired" HWP value may depend on some additional factors (eg. the EPP value) that may need to be adjusted. 3. The workloads are not actually steady and running them at higher frequencies causes the sections that really need more CPU time to complete faster. At the same time, CPU idle states may actually have measurable impact on energy usage which is why you may not see much difference in that respect. > Is it the intent of schedutil that the bottom quarter of utilizations > should be mapped to the lowest frequency? It is not the intent, but a consequence of the scaling algorithm used by schedutil. As you can see from the derivation of that algorithm outlined above, if the utilization is mapped to a performance level below min_perf, running the workload at min_perf or above it is not expected (under all of the the assumptions made) to improve performance, so mapping all of the "low" utilization values to min_perf should not hurt performance, as the CPU time required by the workload will still be provided (and with a surplus for that matter). The reason why the hardware refuses to run below a certain minimum performance level is because it knows that running below that level doesn't really improve energy usage (or at least the improvement whatever it may be is not worth the effort). The CPU would run slower, but it would still use (almost) as much energy as it uses at the "hardware minimum" level, so it may as well run at the min level.