* [PATCH 0/4 v2] x86,cpufreq: unify APERF/MPERF computation @ 2017-06-24 5:11 Len Brown 2017-06-24 5:11 ` [PATCH 1/4] x86: do not use cpufreq_quick_get() for /proc/cpuinfo "cpu MHz" Len Brown 2017-07-25 22:32 ` [PATCH 2/4 v2] x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF Doug Smythies 0 siblings, 2 replies; 15+ messages in thread From: Len Brown @ 2017-06-24 5:11 UTC (permalink / raw) To: rafael, tglx; +Cc: x86, srinivas.pandruvada, peterz, linux-pm, linux-kernel Hi Rafael, Thomas, Please see the updated 2nd patch in this series -- in response to tglx review. Patch 1,3,4 are unchanged. thanks, -Len This patch series has 3 goals: 1. Make "cpu MHz" in /proc/cpuinfo supportable. 2. Make /sys/.../cpufreq/scaling_cur_freq meaningful and consistent on modern x86 systems. 3. Use 1. and 2. to remove scheduler and cpufreq overhead There are 3 main changes since this series was proposed about a year ago: This update responds to distro feedback to make /proc/cpuinfo "cpu MHz" constant. Originally, we had proposed making it return the same dynamic value as cpufreq sysfs. Some community members suggested that sysfs MHz values should be meaninful, even down to 10ms intervals. So this has been changed, versus the original proposal to not re-compute at intervals shorter than 100ms. (For those who really care about observing frequency, the recommendation remains to use turbostat(8) or equivalent utility, which can reliably measure concurrent intervals of arbitrary length) The intel_pstate sampling mechanism has changed. Originally this series removed an intel_pstate timer in HWP mode. Now it removes the analogous scheduler call-back. Most recently, in response to posting this patch on the list about 10-days ago, the patch to remove frequency calculation from inside intel_pstate was dropped, in order to maintain compatibility with tracing scripts. Also, the order of the last two patches has been exchanged. Please let me know if you see any issues with this series. thanks! Len Brown, Intel Open Source Technology Center The following changes since commit 3c2993b8c6143d8a5793746a54eba8f86f95240f: Linux 4.12-rc4 (2017-06-04 16:47:43 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux.git x86 for you to fetch changes up to 68516d288d3968fe22d6c8984a7bcbdcdbed351d: intel_pstate: skip scheduler hook when in "performance" mode. (2017-06-23 22:01:46 -0700) ---------------------------------------------------------------- Len Brown (4): x86: do not use cpufreq_quick_get() for /proc/cpuinfo "cpu MHz" x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF intel_pstate: delete scheduler hook in HWP mode intel_pstate: skip scheduler hook when in "performance" mode. arch/x86/kernel/cpu/Makefile | 1 + arch/x86/kernel/cpu/aperfmperf.c | 79 ++++++++++++++++++++++++++++++++++++++++ arch/x86/kernel/cpu/proc.c | 10 +---- drivers/cpufreq/cpufreq.c | 12 +++++- drivers/cpufreq/intel_pstate.c | 18 +++------ include/linux/cpufreq.h | 2 + 6 files changed, 100 insertions(+), 22 deletions(-) create mode 100644 arch/x86/kernel/cpu/aperfmperf.c ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 1/4] x86: do not use cpufreq_quick_get() for /proc/cpuinfo "cpu MHz" 2017-06-24 5:11 [PATCH 0/4 v2] x86,cpufreq: unify APERF/MPERF computation Len Brown @ 2017-06-24 5:11 ` Len Brown 2017-06-24 5:11 ` [PATCH 2/4 v2] x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF Len Brown ` (2 more replies) 2017-07-25 22:32 ` [PATCH 2/4 v2] x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF Doug Smythies 1 sibling, 3 replies; 15+ messages in thread From: Len Brown @ 2017-06-24 5:11 UTC (permalink / raw) To: rafael, tglx Cc: x86, srinivas.pandruvada, peterz, linux-pm, linux-kernel, Len Brown From: Len Brown <len.brown@intel.com> cpufreq_quick_get() allows cpufreq drivers to over-ride cpu_khz that is otherwise reported in x86 /proc/cpuinfo "cpu MHz". There are four problems with this scheme, any of them is sufficient justification to delete it. 1. Depending on which cpufreq driver is loaded, the behavior of this field is different. 2. Distros complain that they have to explain to users why and how this field changes. Distros have requested a constant. 3. The two major providers of this information, acpi_cpufreq and intel_pstate, both "get it wrong" in different ways. acpi_cpufreq lies to the user by telling them that they are running at whatever frequency was last requested by software. intel_pstate lies to the user by telling them that they are running at the average frequency computed over an undefined measurement. But an average computed over an undefined interval, is itself, undefined... 4. On modern processors, user space utilities, such as turbostat(1), are more accurate and more precise, while supporing concurrent measurement over arbitrary intervals. Users who have been consulting /proc/cpuinfo to track changing CPU frequency will be dissapointed that it no longer wiggles -- perhaps being unaware of the limitations of the information they have been consuming. Yes, they can change their scripts to look in sysfs cpufreq/scaling_cur_frequency. Here they will find the same data of dubious quality here removed from /proc/cpuinfo. The value in sysfs will be addressed in a subsequent patch to address issues 1-3, above. Issue 4 will remain -- users that really care about accurate frequency information should not be using either proc or sysfs kernel interfaces. They should be using using turbostat(8), or a similar purpose-built analysis tool. Signed-off-by: Len Brown <len.brown@intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> --- arch/x86/kernel/cpu/proc.c | 10 ++-------- 1 file changed, 2 insertions(+), 8 deletions(-) diff --git a/arch/x86/kernel/cpu/proc.c b/arch/x86/kernel/cpu/proc.c index 6df621a..218f798 100644 --- a/arch/x86/kernel/cpu/proc.c +++ b/arch/x86/kernel/cpu/proc.c @@ -2,7 +2,6 @@ #include <linux/timex.h> #include <linux/string.h> #include <linux/seq_file.h> -#include <linux/cpufreq.h> /* * Get CPU information for use by the procfs. @@ -76,14 +75,9 @@ static int show_cpuinfo(struct seq_file *m, void *v) if (c->microcode) seq_printf(m, "microcode\t: 0x%x\n", c->microcode); - if (cpu_has(c, X86_FEATURE_TSC)) { - unsigned int freq = cpufreq_quick_get(cpu); - - if (!freq) - freq = cpu_khz; + if (cpu_has(c, X86_FEATURE_TSC)) seq_printf(m, "cpu MHz\t\t: %u.%03u\n", - freq / 1000, (freq % 1000)); - } + cpu_khz / 1000, (cpu_khz % 1000)); /* Cache size */ if (c->x86_cache_size >= 0) -- 2.7.4 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 2/4 v2] x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF 2017-06-24 5:11 ` [PATCH 1/4] x86: do not use cpufreq_quick_get() for /proc/cpuinfo "cpu MHz" Len Brown @ 2017-06-24 5:11 ` Len Brown 2017-06-24 8:56 ` Thomas Gleixner 2017-06-24 5:11 ` [PATCH 3/4] intel_pstate: delete scheduler hook in HWP mode Len Brown 2017-06-24 5:11 ` [PATCH 4/4] intel_pstate: skip scheduler hook when in "performance" mode Len Brown 2 siblings, 1 reply; 15+ messages in thread From: Len Brown @ 2017-06-24 5:11 UTC (permalink / raw) To: rafael, tglx Cc: x86, srinivas.pandruvada, peterz, linux-pm, linux-kernel, Len Brown From: Len Brown <len.brown@intel.com> The goal of this change is to give users a uniform and meaningful result when they read /sys/...cpufreq/scaling_cur_freq on modern x86 hardware, as compared to what they get today. Modern x86 processors include the hardware needed to accurately calculate frequency over an interval -- APERF, MPERF, and the TSC. Here we provide an x86 routine to make this calculation on supported hardware, and use it in preference to any driver driver-specific cpufreq_driver.get() routine. MHz is computed like so: MHz = base_MHz * delta_APERF / delta_MPERF MHz is the average frequency of the busy processor over a measurement interval. The interval is defined to be the time between successive invocations of aperfmperf_khz_on_cpu(), which are expected to to happen on-demand when users read sysfs attribute cpufreq/scaling_cur_freq. As with previous methods of calculating MHz, idle time is excluded. base_MHz above is from TSC calibration global "cpu_khz". This x86 native method to calculate MHz returns a meaningful result no matter if P-states are controlled by hardware or firmware and/or if the Linux cpufreq sub-system is or is-not installed. When this routine is invoked more frequently, the measurement interval becomes shorter. However, the code limits re-computation to 10ms intervals so that average frequency remains meaningful. Discerning users are encouraged to take advantage of the turbostat(8) utility, which can gracefully handle concurrent measurement intervals of arbitrary length. Signed-off-by: Len Brown <len.brown@intel.com> --- arch/x86/kernel/cpu/Makefile | 1 + arch/x86/kernel/cpu/aperfmperf.c | 79 ++++++++++++++++++++++++++++++++++++++++ drivers/cpufreq/cpufreq.c | 12 +++++- include/linux/cpufreq.h | 2 + 4 files changed, 93 insertions(+), 1 deletion(-) create mode 100644 arch/x86/kernel/cpu/aperfmperf.c diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile index 5200001..cdf8249 100644 --- a/arch/x86/kernel/cpu/Makefile +++ b/arch/x86/kernel/cpu/Makefile @@ -21,6 +21,7 @@ obj-y += common.o obj-y += rdrand.o obj-y += match.o obj-y += bugs.o +obj-$(CONFIG_CPU_FREQ) += aperfmperf.o obj-$(CONFIG_PROC_FS) += proc.o obj-$(CONFIG_X86_FEATURE_NAMES) += capflags.o powerflags.o diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c new file mode 100644 index 0000000..d869c86 --- /dev/null +++ b/arch/x86/kernel/cpu/aperfmperf.c @@ -0,0 +1,79 @@ +/* + * x86 APERF/MPERF KHz calculation for + * /sys/.../cpufreq/scaling_cur_freq + * + * Copyright (C) 2017 Intel Corp. + * Author: Len Brown <len.brown@intel.com> + * + * This file is licensed under GPLv2. + */ + +#include <linux/jiffies.h> +#include <linux/math64.h> +#include <linux/percpu.h> +#include <linux/smp.h> + +struct aperfmperf_sample { + unsigned int khz; + unsigned long jiffies; + u64 aperf; + u64 mperf; +}; + +static DEFINE_PER_CPU(struct aperfmperf_sample, samples); + +/* + * aperfmperf_snapshot_khz() + * On the current CPU, snapshot APERF, MPERF, and jiffies + * unless we already did it within 10ms + * calculate kHz, save snapshot + */ +static void aperfmperf_snapshot_khz(void *dummy) +{ + u64 aperf, aperf_delta; + u64 mperf, mperf_delta; + struct aperfmperf_sample *s = this_cpu_ptr(&samples); + + /* Don't bother re-computing within 10 ms */ + if (time_before(jiffies, s->jiffies + HZ/100)) + return; + + rdmsrl(MSR_IA32_APERF, aperf); + rdmsrl(MSR_IA32_MPERF, mperf); + + aperf_delta = aperf - s->aperf; + mperf_delta = mperf - s->mperf; + + /* + * There is no architectural guarantee that MPERF + * increments faster than we can read it. + */ + if (mperf_delta == 0) + return; + + /* + * if (cpu_khz * aperf_delta) fits into ULLONG_MAX, then + * khz = (cpu_khz * aperf_delta) / mperf_delta + */ + if (div64_u64(ULLONG_MAX, cpu_khz) > aperf_delta) + s->khz = div64_u64((cpu_khz * aperf_delta), mperf_delta); + else /* khz = aperf_delta / (mperf_delta / cpu_khz) */ + s->khz = div64_u64(aperf_delta, + div64_u64(mperf_delta, cpu_khz)); + s->jiffies = jiffies; + s->aperf = aperf; + s->mperf = mperf; +} + +unsigned int arch_freq_get_on_cpu(int cpu) +{ + if (!cpu_khz) + return 0; + + if (!static_cpu_has(X86_FEATURE_APERFMPERF)) + return 0; + + smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, 1); + + return per_cpu(samples.khz, cpu); +} diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index 26b643d..6e7424d 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -632,11 +632,21 @@ show_one(cpuinfo_transition_latency, cpuinfo.transition_latency); show_one(scaling_min_freq, min); show_one(scaling_max_freq, max); +__weak unsigned int arch_freq_get_on_cpu(int cpu) +{ + return 0; +} + static ssize_t show_scaling_cur_freq(struct cpufreq_policy *policy, char *buf) { ssize_t ret; + unsigned int freq; - if (cpufreq_driver && cpufreq_driver->setpolicy && cpufreq_driver->get) + freq = arch_freq_get_on_cpu(policy->cpu); + if (freq) + ret = sprintf(buf, "%u\n", freq); + else if (cpufreq_driver && cpufreq_driver->setpolicy && + cpufreq_driver->get) ret = sprintf(buf, "%u\n", cpufreq_driver->get(policy->cpu)); else ret = sprintf(buf, "%u\n", policy->cur); diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h index a5ce0bbe..905117b 100644 --- a/include/linux/cpufreq.h +++ b/include/linux/cpufreq.h @@ -883,6 +883,8 @@ static inline bool policy_has_boost_freq(struct cpufreq_policy *policy) } #endif +extern unsigned int arch_freq_get_on_cpu(int cpu); + /* the following are really really optional */ extern struct freq_attr cpufreq_freq_attr_scaling_available_freqs; extern struct freq_attr cpufreq_freq_attr_scaling_boost_freqs; -- 2.7.4 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 2/4 v2] x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF 2017-06-24 5:11 ` [PATCH 2/4 v2] x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF Len Brown @ 2017-06-24 8:56 ` Thomas Gleixner 2017-06-24 12:03 ` Rafael J. Wysocki 0 siblings, 1 reply; 15+ messages in thread From: Thomas Gleixner @ 2017-06-24 8:56 UTC (permalink / raw) To: Len Brown Cc: rafael, x86, srinivas.pandruvada, peterz, linux-pm, linux-kernel, Len Brown On Fri, 23 Jun 2017, Len Brown wrote: > This x86 native method to calculate MHz returns a meaningful result > no matter if P-states are controlled by hardware or firmware > and/or if the Linux cpufreq sub-system is or is-not installed. > > When this routine is invoked more frequently, the measurement > interval becomes shorter. However, the code limits re-computation > to 10ms intervals so that average frequency remains meaningful. > > Discerning users are encouraged to take advantage of > the turbostat(8) utility, which can gracefully handle > concurrent measurement intervals of arbitrary length. > > Signed-off-by: Len Brown <len.brown@intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Raphael, please take the whole lot through the cpufreq tree. Thanks, tglx ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/4 v2] x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF 2017-06-24 8:56 ` Thomas Gleixner @ 2017-06-24 12:03 ` Rafael J. Wysocki 0 siblings, 0 replies; 15+ messages in thread From: Rafael J. Wysocki @ 2017-06-24 12:03 UTC (permalink / raw) To: Thomas Gleixner Cc: Len Brown, Rafael J. Wysocki, the arch/x86 maintainers, Srinivas Pandruvada, Peter Zijlstra, Linux PM, Linux Kernel Mailing List, Len Brown On Sat, Jun 24, 2017 at 10:56 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > On Fri, 23 Jun 2017, Len Brown wrote: >> This x86 native method to calculate MHz returns a meaningful result >> no matter if P-states are controlled by hardware or firmware >> and/or if the Linux cpufreq sub-system is or is-not installed. >> >> When this routine is invoked more frequently, the measurement >> interval becomes shorter. However, the code limits re-computation >> to 10ms intervals so that average frequency remains meaningful. >> >> Discerning users are encouraged to take advantage of >> the turbostat(8) utility, which can gracefully handle >> concurrent measurement intervals of arbitrary length. >> >> Signed-off-by: Len Brown <len.brown@intel.com> > > Reviewed-by: Thomas Gleixner <tglx@linutronix.de> > > Raphael, please take the whole lot through the cpufreq tree. I will, thanks! Rafael ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 3/4] intel_pstate: delete scheduler hook in HWP mode 2017-06-24 5:11 ` [PATCH 1/4] x86: do not use cpufreq_quick_get() for /proc/cpuinfo "cpu MHz" Len Brown 2017-06-24 5:11 ` [PATCH 2/4 v2] x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF Len Brown @ 2017-06-24 5:11 ` Len Brown 2017-06-24 5:11 ` [PATCH 4/4] intel_pstate: skip scheduler hook when in "performance" mode Len Brown 2 siblings, 0 replies; 15+ messages in thread From: Len Brown @ 2017-06-24 5:11 UTC (permalink / raw) To: rafael, tglx Cc: x86, srinivas.pandruvada, peterz, linux-pm, linux-kernel, Len Brown From: Len Brown <len.brown@intel.com> The cpufreq/scaling_cur_freq sysfs attribute is now provided by shared x86 cpufreq code on modern x86 systems, including all systems supported by the intel_pstate driver. In HWP mode, maintaining that value was the sole purpose of the scheduler hook, intel_pstate_update_util_hwp(), so it can now be removed. Signed-off-by: Len Brown <len.brown@intel.com> --- drivers/cpufreq/intel_pstate.c | 14 +++----------- 1 file changed, 3 insertions(+), 11 deletions(-) diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c index b7de5bd..4ec5668 100644 --- a/drivers/cpufreq/intel_pstate.c +++ b/drivers/cpufreq/intel_pstate.c @@ -1732,16 +1732,6 @@ static void intel_pstate_adjust_pstate(struct cpudata *cpu, int target_pstate) fp_toint(cpu->iowait_boost * 100)); } -static void intel_pstate_update_util_hwp(struct update_util_data *data, - u64 time, unsigned int flags) -{ - struct cpudata *cpu = container_of(data, struct cpudata, update_util); - u64 delta_ns = time - cpu->sample.time; - - if ((s64)delta_ns >= INTEL_PSTATE_HWP_SAMPLING_INTERVAL) - intel_pstate_sample(cpu, time); -} - static void intel_pstate_update_util_pid(struct update_util_data *data, u64 time, unsigned int flags) { @@ -1933,6 +1923,9 @@ static void intel_pstate_set_update_util_hook(unsigned int cpu_num) { struct cpudata *cpu = all_cpu_data[cpu_num]; + if (hwp_active) + return; + if (cpu->update_util_set) return; @@ -2557,7 +2550,6 @@ static int __init intel_pstate_init(void) } else { hwp_active++; intel_pstate.attr = hwp_cpufreq_attrs; - pstate_funcs.update_util = intel_pstate_update_util_hwp; goto hwp_cpu_matched; } } else { -- 2.7.4 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 4/4] intel_pstate: skip scheduler hook when in "performance" mode. 2017-06-24 5:11 ` [PATCH 1/4] x86: do not use cpufreq_quick_get() for /proc/cpuinfo "cpu MHz" Len Brown 2017-06-24 5:11 ` [PATCH 2/4 v2] x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF Len Brown 2017-06-24 5:11 ` [PATCH 3/4] intel_pstate: delete scheduler hook in HWP mode Len Brown @ 2017-06-24 5:11 ` Len Brown 2 siblings, 0 replies; 15+ messages in thread From: Len Brown @ 2017-06-24 5:11 UTC (permalink / raw) To: rafael, tglx Cc: x86, srinivas.pandruvada, peterz, linux-pm, linux-kernel, Len Brown From: Len Brown <len.brown@intel.com> When the governor is set to "performance", intel_pstate does not need the scheduler hook for doing any calculations. Under these conditions, its only purpose is to continue to maintain cpufreq/scaling_cur_freq. The cpufreq/scaling_cur_freq sysfs attribute is now provided by shared x86 cpufreq code on modern x86 systems, including all systems supported by the intel_pstate driver. So in "performance" governor mode, the scheduler hook can be skipped. This applies to both in Software and Hardware P-state control modes. Suggested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Len Brown <len.brown@intel.com> --- drivers/cpufreq/intel_pstate.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c index 4ec5668..4538182 100644 --- a/drivers/cpufreq/intel_pstate.c +++ b/drivers/cpufreq/intel_pstate.c @@ -2031,10 +2031,10 @@ static int intel_pstate_set_policy(struct cpufreq_policy *policy) */ intel_pstate_clear_update_util_hook(policy->cpu); intel_pstate_max_within_limits(cpu); + } else { + intel_pstate_set_update_util_hook(policy->cpu); } - intel_pstate_set_update_util_hook(policy->cpu); - if (hwp_active) intel_pstate_hwp_set(policy->cpu); -- 2.7.4 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* RE: [PATCH 2/4 v2] x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF 2017-06-24 5:11 [PATCH 0/4 v2] x86,cpufreq: unify APERF/MPERF computation Len Brown 2017-06-24 5:11 ` [PATCH 1/4] x86: do not use cpufreq_quick_get() for /proc/cpuinfo "cpu MHz" Len Brown @ 2017-07-25 22:32 ` Doug Smythies 2017-07-26 17:23 ` Len Brown ` (2 more replies) 1 sibling, 3 replies; 15+ messages in thread From: Doug Smythies @ 2017-07-25 22:32 UTC (permalink / raw) To: 'Len Brown', rafael, tglx Cc: x86, srinivas.pandruvada, peterz, linux-pm, linux-kernel, 'Len Brown' Sorry to be late to the party on this one: On 2017.06.23 10:12 Len Brown wrote: > The goal of this change is to give users a uniform and meaningful > result when they read /sys/...cpufreq/scaling_cur_freq > on modern x86 hardware, as compared to what they get today. Myself, I like what I got then, and not what I get now. > Modern x86 processors include the hardware needed > to accurately calculate frequency over an interval -- > APERF, MPERF, and the TSC. > > Here we provide an x86 routine to make this calculation > on supported hardware, and use it in preference to any > driver driver-specific cpufreq_driver.get() routine. > > MHz is computed like so: > > MHz = base_MHz * delta_APERF / delta_MPERF Yes, thanks very much. > MHz is the average frequency of the busy processor > over a measurement interval. The interval is > defined to be the time between successive invocations > of aperfmperf_khz_on_cpu(), which are expected to to > happen on-demand when users read sysfs attribute > cpufreq/scaling_cur_freq. Yes but that can be hours apart, resulting in useless information. This threw me for a loop for several days. > As with previous methods of calculating MHz, > idle time is excluded. Which makes the response time to a correct answer asymmetric. i.e. removal of a load on a CPU will linger much much longer that adding a load on a CPU. > base_MHz above is from TSC calibration global "cpu_khz". Yes, thank you very much. > This x86 native method to calculate MHz returns a meaningful result > no matter if P-states are controlled by hardware or firmware > and/or if the Linux cpufreq sub-system is or is-not installed. > > When this routine is invoked more frequently, the measurement > interval becomes shorter. However, the code limits re-computation > to 10ms intervals so that average frequency remains meaningful. > > Discerning users are encouraged to take advantage of > the turbostat(8) utility, which can gracefully handle > concurrent measurement intervals of arbitrary length. Somehow, somewhere along the way, turbostat no longer seems to use base_MHz based on the actual TSC. It used to. > Signed-off-by: Len Brown <len.brown@intel.com> > --- > arch/x86/kernel/cpu/Makefile | 1 + > arch/x86/kernel/cpu/aperfmperf.c | 79 ++++++++++++++++++++++++++++++++++++++++ > drivers/cpufreq/cpufreq.c | 12 +++++- > include/linux/cpufreq.h | 2 + > 4 files changed, 93 insertions(+), 1 deletion(-) > create mode 100644 arch/x86/kernel/cpu/aperfmperf.c ... [deleted some] ... > + * aperfmperf_snapshot_khz() > + * On the current CPU, snapshot APERF, MPERF, and jiffies > + * unless we already did it within 10ms Well, it'll be 8 mSec on a 250 Hz kernel. There is no maximum time defined, so the interval can be anything, and therefore the result can be dominated by stale information. > + * calculate kHz, save snapshot > + */ > +static void aperfmperf_snapshot_khz(void *dummy) > +{ > + u64 aperf, aperf_delta; > + u64 mperf, mperf_delta; > + struct aperfmperf_sample *s = this_cpu_ptr(&samples); > + > + /* Don't bother re-computing within 10 ms */ > + if (time_before(jiffies, s->jiffies + HZ/100)) > + return; The above condition would be 8 mSec on a 250 Hertz kernel, wouldn't it? (I don't care, I'm just saying.) __________________________________ A long boring story is copied below, but it also includes my test data. Summary: . There no longer seems to be a way to check the CPU frequency without affecting the processor (i.e. forcing a wakeup), thereby potentially influencing the system under test. . Yes, the old way might have been a "lie", but in some situations it was much much less of a "lie", and took data that was already available (and at the very maximum 4 seconds old), and didn't force a wakeup, thus monitoring CPU frequency was a negligible perturbation to the system. . Now the data is as old as the time the command was run, which might be hours. For reference my test computer contains an i7-2600K processor, and TSC is 3411.1043 MHz. Minimum pstate 16. I did follow the e-mail thread [1] about changes to the "cpu MHz" line from /proc/cpuinfo, and expected it to have changed, and indeed, it only ever prints TSC now and never changes. Whereas with kernel 4.12 it printed the actual CPU frequency, albeit with the limitations stated in the e-mail thread, which I have always understood and accepted. O.K. so now it is useless as an actual CPU frequency inquiry tool. Now, there are two other methods (well three if one includes turbostat) for observing CPU frequency: The "sudo cat /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_cur_freq" method, works the same as it did in the past (well, there is another active thread about issues with it), but requires root access. And the "cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq" method, which works fine with kernel 4.12, but seems to give incorrect information with kernel 4.13-rc1, unless one inquires two or more times and discards the first inquiry. Test 1 data: Notes: CPU 7 only. It is 100% busy all the time. The CPU burn program prints a time stamp every N loops, as a way to do a sanity check on CPU frequency. Sanity checks were also done by acquiring trace data. Turbo is disabled, so the maximum CPU frequency is predicable and known, independent of what other cores are doing. The data is not from the first loop through this test. Data: /sys/devices/system/cpu/intel_pstate/max_perf_pct: 100 Actual CPU 7 frequency: 3411104 Kernel 4.12: /sys/devices/system/cpu/cpu7/cpufreq/cpuinfo_cur_freq: 3400000 Kernel 4.13-rc1: /sys/devices/system/cpu/cpu7/cpufreq/cpuinfo_cur_freq: 3400000 Kernel 4.12: /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq: 3400000 Kernel 4.13-rc1, 1st read: /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq: 1765012* Kernel 4.13-rc1, 2nd read: /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq: 3411286 /sys/devices/system/cpu/intel_pstate/max_perf_pct: 42 Actual CPU 7 frequency: 1605225 Kernel 4.12: /sys/devices/system/cpu/cpu7/cpufreq/cpuinfo_cur_freq: 1599768 Kernel 4.13-rc1: /sys/devices/system/cpu/cpu7/cpufreq/cpuinfo_cur_freq: 1599975 Kernel 4.12: /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq: 1599768 Kernel 4.13-rc1, 1st read: /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq: 3309707* Kernel 4.13-rc1, 2nd read: /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq: 1605311 * The value listed for the first read is both a function of the time difference between changing the maximum CPU frequency and the inquiry and how long since the last read the actual CPU frequency was changed. Data (for increase from 1.6 GHz to 3.4 GHz): First read quickly (manually): 1765012 0.25 seconds to first read: 1663176 0.5 seconds to first read: 1671658 1 seconds to first read: 1767889 2 seconds to first read: 1872128 3 seconds to first read: 1769770 4 seconds to first read: 1814673 5 seconds to first read: 2297147 10 seconds to first read: 2394407 20 seconds to first read: 2720619 30 seconds to first read: 2875374 2 minutes to first read: 3373563 5 minutes to first read: 3363630 10 minutes to first read: 3376521 Data (for decrease from 3.4 GHz to 1.6 GHz): 0.25 seconds to first read: 3381255 0.5 seconds to first read: 3323808 1 seconds to first read: 3247873 2 seconds to first read: 3090182 3 seconds to first read: 3104870 4 seconds to first read: 2837281 5 seconds to first read: 2962827 10 seconds to first read: 2510951 20 seconds to first read: 2763956 30 seconds to first read: 2116198 2 minutes to first read: 1876923 5 minutes to first read: 1715839 10 minutes to first read: 1634040 Note: the above table was done more or less manually. Test 2 data: Just take the load off of CPU 7 and then look at its frequency (any amount of time later, I have yet to find a time limit): Kernel 4.13-rc1, 1st read (1 minute after load removed): /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq: 3410964 Kernel 4.13-rc1, 2nd read (anytime after the 1st read): /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq: 1605326 Kernel 4.13-rc1, 1st read (24 minutes after load removed): /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq: 3268873 Kernel 4.13-rc1, 2nd read (anytime after the 1st read): /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq: 1605233 [1] http://marc.info/?t=149766883400002&r=1&w=2 Note: now also tested with kernel 4.13-rc2. ... Doug ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/4 v2] x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF 2017-07-25 22:32 ` [PATCH 2/4 v2] x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF Doug Smythies @ 2017-07-26 17:23 ` Len Brown 2017-07-28 0:13 ` [PATCH] cpufreq: x86: Make scaling_cur_freq behave more as expected Rafael J. Wysocki 2017-07-28 6:01 ` [PATCH] " Doug Smythies 2 siblings, 0 replies; 15+ messages in thread From: Len Brown @ 2017-07-26 17:23 UTC (permalink / raw) To: Doug Smythies Cc: Rafael J. Wysocki, Thomas Gleixner, X86 ML, Srinivas Pandruvada, Peter Zijlstra, Linux PM list, linux-kernel, Len Brown Hi Doug, Clearly you are a "discerning user", who understands the limitations of the kernel sysfs interface, both new and old, for communicating frequency. With the limitations of the (old and new) sysfs interfaces, why are you using it, rather than turbostat? >> As with previous methods of calculating MHz, >> idle time is excluded. > > Which makes the response time to a correct answer > asymmetric. i.e. removal of a load on a CPU will > linger much much longer that adding a load on a CPU. If the measurement interval is not defined, then a "correct answer" is also no defined. Users now have the capability to define the measurement interval. Before they didn't, they just could observe it was "short enough that it looks current". Others may want to measure frequency over a longer (or even known) interval, and the previous code made that impossible. Also, you may be interested to know that in HWP mode, intel_pstate used to have a periodic timer who's _only_ job was to wake up the CPU so that the driver could update the frequency statistic for sysfs. (now it is a scheduler callback) Sure, a "discerning user" may have noticed that they have "fresh" data in sysfs, but most users were better served by not having a timer fire to refresh data that they'll never consume... The new code never runs at all, unless the user asks it to. > Somehow, somewhere along the way, turbostat no longer seems > to use base_MHz based on the actual TSC. It used to. True, though not directly related to this thread... On current and future Intel hardware, base_mhz and TSC rate are not in the same clock domain. Only on very specific configurations are those clock rates now equal. >> + /* Don't bother re-computing within 10 ms */ >> + if (time_before(jiffies, s->jiffies + HZ/100)) >> + return; > > The above condition would be 8 mSec on a 250 Hertz kernel, > wouldn't it? > (I don't care, I'm just saying.) True. We could replace the "10ms" comment with "typically 10ms", or "recently". Note that the value here isn't precise, it is there just to prevent wasted overhead. The previous version of this patch was equally valid with a value 10x larger. > Summary: > > . There no longer seems to be a way to check the CPU frequency without affecting the processor (i.e. forcing a wakeup), > thereby potentially influencing the system under test. This has always been true, just that the wakeups used to happen inside the kernel -- whether you consumed the answer or not. > . Yes, the old way might have been a "lie", but in some situations it was much much less of a "lie", and took data that > was already available (and at the very maximum 4 seconds old), and didn't force a wakeup, thus monitoring CPU frequency > was a negligible perturbation to the system. Frequency data isn't "already available", it has to be measured. A measurement is not valid unless it is made over a known measurement interval. > . Now the data is as old as the time the command was run, which might be hours. True, under controlled conditions, the sysfs measurement interval could be days or months long. If a known interval is desired, than something need to provoke a read of the attribute at the start of the interval of interest. Yes, we could do this inside the kernel, but then that would add overhead to the system for the vast majority of users who never even read this attribute, and it would also take control of the interval away from the user. Making this interface more complex inside the kernel doesn't seem like a prudent path to go down when turbostat already exists and can already measure concurrent/overlapping intervals of arbitrary length in user-space. While I still haven't gleaned exactly what you are trying to measure, I'm very much interested to know if/why you can't measure it using the new sysfs attribute semantics, or better yet, using turbostat. thanks, Len Brown, Intel Open Source Technology Center ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH] cpufreq: x86: Make scaling_cur_freq behave more as expected 2017-07-25 22:32 ` [PATCH 2/4 v2] x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF Doug Smythies 2017-07-26 17:23 ` Len Brown @ 2017-07-28 0:13 ` Rafael J. Wysocki 2017-07-28 12:45 ` [PATCH v2] " Rafael J. Wysocki 2017-07-31 23:46 ` Doug Smythies 2017-07-28 6:01 ` [PATCH] " Doug Smythies 2 siblings, 2 replies; 15+ messages in thread From: Rafael J. Wysocki @ 2017-07-28 0:13 UTC (permalink / raw) To: x86, linux-pm Cc: Doug Smythies, 'Len Brown', rafael, tglx, srinivas.pandruvada, peterz, linux-kernel, 'Len Brown' From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> After commit f8475cef9008 "x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF" the scaling_cur_freq policy attribute in sysfs only behaves as expected on x86 with APERF/MPERF registers available when it is read from at least twice in a row. The value returned by the first read may not be meaningful, because the computations in there use cached values from the previous aperfmperf_snapshot_khz() call which may be stale. However, the interface is expected to return meaningful values on every read, including the first one. To address this problem modify arch_freq_get_on_cpu() to call aperfmperf_snapshot_khz() twice, with a short delay between these calls, if the previous invocation of aperfmperf_snapshot_khz() was too far back in the past (specifically, more that 1s ago) and adjust aperfmperf_snapshot_khz() for that. Fixes: f8475cef9008 "x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF" Reported-by: Doug Smythies <dsmythies@telus.net> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- arch/x86/kernel/cpu/aperfmperf.c | 36 +++++++++++++++++++++++++++++------- 1 file changed, 29 insertions(+), 7 deletions(-) Index: linux-pm/arch/x86/kernel/cpu/aperfmperf.c =================================================================== --- linux-pm.orig/arch/x86/kernel/cpu/aperfmperf.c +++ linux-pm/arch/x86/kernel/cpu/aperfmperf.c @@ -8,20 +8,25 @@ * This file is licensed under GPLv2. */ -#include <linux/jiffies.h> +#include <linux/delay.h> +#include <linux/ktime.h> #include <linux/math64.h> #include <linux/percpu.h> #include <linux/smp.h> struct aperfmperf_sample { unsigned int khz; - unsigned long jiffies; + ktime_t time; u64 aperf; u64 mperf; }; static DEFINE_PER_CPU(struct aperfmperf_sample, samples); +#define APERFMPERF_CACHE_THRESHOLD_MS 10 +#define APERFMPERF_REFRESH_DELAY_MS 20 +#define APERFMPERF_STALE_THRESHOLD_MS 1000 + /* * aperfmperf_snapshot_khz() * On the current CPU, snapshot APERF, MPERF, and jiffies @@ -33,9 +38,11 @@ static void aperfmperf_snapshot_khz(void u64 aperf, aperf_delta; u64 mperf, mperf_delta; struct aperfmperf_sample *s = this_cpu_ptr(&samples); + ktime_t now = ktime_get(); + s64 time_delta = ktime_ms_delta(now, s->time); - /* Don't bother re-computing within 10 ms */ - if (time_before(jiffies, s->jiffies + HZ/100)) + /* Don't bother re-computing within the cache threshold time. */ + if (time_delta < APERFMPERF_CACHE_THRESHOLD_MS) return; rdmsrl(MSR_IA32_APERF, aperf); @@ -51,6 +58,16 @@ static void aperfmperf_snapshot_khz(void if (mperf_delta == 0) return; + s->time = now; + s->aperf = aperf; + s->mperf = mperf; + + /* If the previous iteration was too long ago, discard it. */ + if (time_delta > APERFMPERF_STALE_THRESHOLD_MS) { + s->khz = 0; + return; + } + /* * if (cpu_khz * aperf_delta) fits into ULLONG_MAX, then * khz = (cpu_khz * aperf_delta) / mperf_delta @@ -60,13 +77,12 @@ static void aperfmperf_snapshot_khz(void else /* khz = aperf_delta / (mperf_delta / cpu_khz) */ s->khz = div64_u64(aperf_delta, div64_u64(mperf_delta, cpu_khz)); - s->jiffies = jiffies; - s->aperf = aperf; - s->mperf = mperf; } unsigned int arch_freq_get_on_cpu(int cpu) { + unsigned int khz; + if (!cpu_khz) return 0; @@ -74,6 +90,12 @@ unsigned int arch_freq_get_on_cpu(int cp return 0; smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, 1); + khz = per_cpu(samples.khz, cpu); + if (khz) + return khz; + + msleep(APERFMPERF_REFRESH_DELAY_MS); + smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, 1); return per_cpu(samples.khz, cpu); } ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v2] cpufreq: x86: Make scaling_cur_freq behave more as expected 2017-07-28 0:13 ` [PATCH] cpufreq: x86: Make scaling_cur_freq behave more as expected Rafael J. Wysocki @ 2017-07-28 12:45 ` Rafael J. Wysocki 2017-07-31 23:46 ` Doug Smythies 1 sibling, 0 replies; 15+ messages in thread From: Rafael J. Wysocki @ 2017-07-28 12:45 UTC (permalink / raw) To: x86, linux-pm Cc: Doug Smythies, 'Len Brown', rafael, tglx, srinivas.pandruvada, peterz, linux-kernel, 'Len Brown' From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> After commit f8475cef9008 "x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF" the scaling_cur_freq policy attribute in sysfs only behaves as expected on x86 with APERF/MPERF registers available when it is read from at least twice in a row. The value returned by the first read may not be meaningful, because the computations in there use cached values from the previous iteration of aperfmperf_snapshot_khz() which may be stale. To prevent that from happening, modify arch_freq_get_on_cpu() to call aperfmperf_snapshot_khz() twice, with a short delay between these calls, if the previous invocation of aperfmperf_snapshot_khz() was too far back in the past (specifically, more that 1s ago). Also, as pointed out by Doug Smythies, aperf_delta is limited now and the multiplication of it by cpu_khz won't overflow, so simplify the s->khz computations too. Fixes: f8475cef9008 "x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF" Reported-by: Doug Smythies <dsmythies@telus.net> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- -> v2: Simplify the khz computations as per the Doug's suggestion. --- arch/x86/kernel/cpu/aperfmperf.c | 40 +++++++++++++++++++++++++-------------- 1 file changed, 26 insertions(+), 14 deletions(-) Index: linux-pm/arch/x86/kernel/cpu/aperfmperf.c =================================================================== --- linux-pm.orig/arch/x86/kernel/cpu/aperfmperf.c +++ linux-pm/arch/x86/kernel/cpu/aperfmperf.c @@ -8,20 +8,25 @@ * This file is licensed under GPLv2. */ -#include <linux/jiffies.h> +#include <linux/delay.h> +#include <linux/ktime.h> #include <linux/math64.h> #include <linux/percpu.h> #include <linux/smp.h> struct aperfmperf_sample { unsigned int khz; - unsigned long jiffies; + ktime_t time; u64 aperf; u64 mperf; }; static DEFINE_PER_CPU(struct aperfmperf_sample, samples); +#define APERFMPERF_CACHE_THRESHOLD_MS 10 +#define APERFMPERF_REFRESH_DELAY_MS 20 +#define APERFMPERF_STALE_THRESHOLD_MS 1000 + /* * aperfmperf_snapshot_khz() * On the current CPU, snapshot APERF, MPERF, and jiffies @@ -33,9 +38,11 @@ static void aperfmperf_snapshot_khz(void u64 aperf, aperf_delta; u64 mperf, mperf_delta; struct aperfmperf_sample *s = this_cpu_ptr(&samples); + ktime_t now = ktime_get(); + s64 time_delta = ktime_ms_delta(now, s->time); - /* Don't bother re-computing within 10 ms */ - if (time_before(jiffies, s->jiffies + HZ/100)) + /* Don't bother re-computing within the cache threshold time. */ + if (time_delta < APERFMPERF_CACHE_THRESHOLD_MS) return; rdmsrl(MSR_IA32_APERF, aperf); @@ -51,22 +58,21 @@ static void aperfmperf_snapshot_khz(void if (mperf_delta == 0) return; - /* - * if (cpu_khz * aperf_delta) fits into ULLONG_MAX, then - * khz = (cpu_khz * aperf_delta) / mperf_delta - */ - if (div64_u64(ULLONG_MAX, cpu_khz) > aperf_delta) - s->khz = div64_u64((cpu_khz * aperf_delta), mperf_delta); - else /* khz = aperf_delta / (mperf_delta / cpu_khz) */ - s->khz = div64_u64(aperf_delta, - div64_u64(mperf_delta, cpu_khz)); - s->jiffies = jiffies; + s->time = now; s->aperf = aperf; s->mperf = mperf; + + /* If the previous iteration was too long ago, discard it. */ + if (time_delta > APERFMPERF_STALE_THRESHOLD_MS) + s->khz = 0; + else + s->khz = div64_u64((cpu_khz * aperf_delta), mperf_delta); } unsigned int arch_freq_get_on_cpu(int cpu) { + unsigned int khz; + if (!cpu_khz) return 0; @@ -74,6 +80,12 @@ unsigned int arch_freq_get_on_cpu(int cp return 0; smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, 1); + khz = per_cpu(samples.khz, cpu); + if (khz) + return khz; + + msleep(APERFMPERF_REFRESH_DELAY_MS); + smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, 1); return per_cpu(samples.khz, cpu); } ^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: [PATCH v2] cpufreq: x86: Make scaling_cur_freq behave more as expected 2017-07-28 0:13 ` [PATCH] cpufreq: x86: Make scaling_cur_freq behave more as expected Rafael J. Wysocki 2017-07-28 12:45 ` [PATCH v2] " Rafael J. Wysocki @ 2017-07-31 23:46 ` Doug Smythies 2017-08-01 0:50 ` Rafael J. Wysocki 1 sibling, 1 reply; 15+ messages in thread From: Doug Smythies @ 2017-07-31 23:46 UTC (permalink / raw) To: 'Rafael J. Wysocki', x86, linux-pm Cc: 'Len Brown', rafael, tglx, srinivas.pandruvada, peterz, linux-kernel, 'Len Brown' On 2017.07.28 05:45 Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > After commit f8475cef9008 "x86: use common aperfmperf_khz_on_cpu() to > calculate KHz using APERF/MPERF" the scaling_cur_freq policy attribute > in sysfs only behaves as expected on x86 with APERF/MPERF registers > available when it is read from at least twice in a row. The value > returned by the first read may not be meaningful, because the > computations in there use cached values from the previous iteration > of aperfmperf_snapshot_khz() which may be stale. > > To prevent that from happening, modify arch_freq_get_on_cpu() to > call aperfmperf_snapshot_khz() twice, with a short delay between > these calls, if the previous invocation of aperfmperf_snapshot_khz() > was too far back in the past (specifically, more that 1s ago). ...[deleted the rest]... This patch seems to work fine and addresses my complaints from last week. Thanks. ... Doug ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2] cpufreq: x86: Make scaling_cur_freq behave more as expected 2017-07-31 23:46 ` Doug Smythies @ 2017-08-01 0:50 ` Rafael J. Wysocki 0 siblings, 0 replies; 15+ messages in thread From: Rafael J. Wysocki @ 2017-08-01 0:50 UTC (permalink / raw) To: Doug Smythies Cc: x86, linux-pm, 'Len Brown', rafael, tglx, srinivas.pandruvada, peterz, linux-kernel, 'Len Brown' On Monday, July 31, 2017 04:46:42 PM Doug Smythies wrote: > On 2017.07.28 05:45 Rafael J. Wysocki wrote: > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > > After commit f8475cef9008 "x86: use common aperfmperf_khz_on_cpu() to > > calculate KHz using APERF/MPERF" the scaling_cur_freq policy attribute > > in sysfs only behaves as expected on x86 with APERF/MPERF registers > > available when it is read from at least twice in a row. The value > > returned by the first read may not be meaningful, because the > > computations in there use cached values from the previous iteration > > of aperfmperf_snapshot_khz() which may be stale. > > > > To prevent that from happening, modify arch_freq_get_on_cpu() to > > call aperfmperf_snapshot_khz() twice, with a short delay between > > these calls, if the previous invocation of aperfmperf_snapshot_khz() > > was too far back in the past (specifically, more that 1s ago). > > ...[deleted the rest]... > > This patch seems to work fine and addresses my complaints from last week. > Thanks. Thanks for the confirmation! Rafael ^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: [PATCH] cpufreq: x86: Make scaling_cur_freq behave more as expected 2017-07-25 22:32 ` [PATCH 2/4 v2] x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF Doug Smythies 2017-07-26 17:23 ` Len Brown 2017-07-28 0:13 ` [PATCH] cpufreq: x86: Make scaling_cur_freq behave more as expected Rafael J. Wysocki @ 2017-07-28 6:01 ` Doug Smythies 2017-07-28 12:26 ` Rafael J. Wysocki 2 siblings, 1 reply; 15+ messages in thread From: Doug Smythies @ 2017-07-28 6:01 UTC (permalink / raw) To: 'Rafael J. Wysocki' Cc: 'Len Brown', rafael, tglx, srinivas.pandruvada, peterz, linux-kernel, 'Len Brown', x86, linux-pm On 2017.07.27 17:13 Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > After commit f8475cef9008 "x86: use common aperfmperf_khz_on_cpu() to > calculate KHz using APERF/MPERF" the scaling_cur_freq policy attribute > in sysfs only behaves as expected on x86 with APERF/MPERF registers > available when it is read from at least twice in a row. > > The value returned by the first read may not be meaningful, because > the computations in there use cached values from the previous > aperfmperf_snapshot_khz() call which may be stale. However, the > interface is expected to return meaningful values on every read, > including the first one. > > To address this problem modify arch_freq_get_on_cpu() to call > aperfmperf_snapshot_khz() twice, with a short delay between > these calls, if the previous invocation of aperfmperf_snapshot_khz() > was too far back in the past (specifically, more that 1s ago) and > adjust aperfmperf_snapshot_khz() for that. > > Fixes: f8475cef9008 "x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF" > Reported-by: Doug Smythies <dsmythies@telus.net> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > --- > arch/x86/kernel/cpu/aperfmperf.c | 36 +++++++++++++++++++++++++++++------- > 1 file changed, 29 insertions(+), 7 deletions(-) > > Index: linux-pm/arch/x86/kernel/cpu/aperfmperf.c ...[deleted the rest]... This proposed patch would be good. However, I can only try it maybe by Sunday. I think the maximum time span means that this code: /* * if (cpu_khz * aperf_delta) fits into ULLONG_MAX, then * khz = (cpu_khz * aperf_delta) / mperf_delta */ if (div64_u64(ULLONG_MAX, cpu_khz) > aperf_delta) s->khz = div64_u64((cpu_khz * aperf_delta), mperf_delta); else /* khz = aperf_delta / (mperf_delta / cpu_khz) */ s->khz = div64_u64(aperf_delta, div64_u64(mperf_delta, cpu_khz)); Could be reduced to this: s->khz = div64_u64((cpu_khz * aperf_delta), mperf_delta); Because it could never overflow anymore. ... Doug ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] cpufreq: x86: Make scaling_cur_freq behave more as expected 2017-07-28 6:01 ` [PATCH] " Doug Smythies @ 2017-07-28 12:26 ` Rafael J. Wysocki 0 siblings, 0 replies; 15+ messages in thread From: Rafael J. Wysocki @ 2017-07-28 12:26 UTC (permalink / raw) To: Doug Smythies Cc: 'Len Brown', rafael, tglx, srinivas.pandruvada, peterz, linux-kernel, 'Len Brown', x86, linux-pm On Thursday, July 27, 2017 11:01:39 PM Doug Smythies wrote: > On 2017.07.27 17:13 Rafael J. Wysocki wrote: > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > > After commit f8475cef9008 "x86: use common aperfmperf_khz_on_cpu() to > > calculate KHz using APERF/MPERF" the scaling_cur_freq policy attribute > > in sysfs only behaves as expected on x86 with APERF/MPERF registers > > available when it is read from at least twice in a row. > > > > The value returned by the first read may not be meaningful, because > > the computations in there use cached values from the previous > > aperfmperf_snapshot_khz() call which may be stale. However, the > > interface is expected to return meaningful values on every read, > > including the first one. > > > > To address this problem modify arch_freq_get_on_cpu() to call > > aperfmperf_snapshot_khz() twice, with a short delay between > > these calls, if the previous invocation of aperfmperf_snapshot_khz() > > was too far back in the past (specifically, more that 1s ago) and > > adjust aperfmperf_snapshot_khz() for that. > > > > Fixes: f8475cef9008 "x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF" > > Reported-by: Doug Smythies <dsmythies@telus.net> > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > --- > > arch/x86/kernel/cpu/aperfmperf.c | 36 +++++++++++++++++++++++++++++------- > > 1 file changed, 29 insertions(+), 7 deletions(-) > > > > Index: linux-pm/arch/x86/kernel/cpu/aperfmperf.c > > ...[deleted the rest]... > > This proposed patch would be good. However, I can only try it maybe by Sunday. > I think the maximum time span means that this code: > > /* > * if (cpu_khz * aperf_delta) fits into ULLONG_MAX, then > * khz = (cpu_khz * aperf_delta) / mperf_delta > */ > if (div64_u64(ULLONG_MAX, cpu_khz) > aperf_delta) > s->khz = div64_u64((cpu_khz * aperf_delta), mperf_delta); > else /* khz = aperf_delta / (mperf_delta / cpu_khz) */ > s->khz = div64_u64(aperf_delta, > div64_u64(mperf_delta, cpu_khz)); > > Could be reduced to this: > > s->khz = div64_u64((cpu_khz * aperf_delta), mperf_delta); > > Because it could never overflow anymore. Right, that's a good point. I'll send a v2 with this change included shortly. Thanks, Rafael ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2017-08-01 0:58 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-06-24 5:11 [PATCH 0/4 v2] x86,cpufreq: unify APERF/MPERF computation Len Brown 2017-06-24 5:11 ` [PATCH 1/4] x86: do not use cpufreq_quick_get() for /proc/cpuinfo "cpu MHz" Len Brown 2017-06-24 5:11 ` [PATCH 2/4 v2] x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF Len Brown 2017-06-24 8:56 ` Thomas Gleixner 2017-06-24 12:03 ` Rafael J. Wysocki 2017-06-24 5:11 ` [PATCH 3/4] intel_pstate: delete scheduler hook in HWP mode Len Brown 2017-06-24 5:11 ` [PATCH 4/4] intel_pstate: skip scheduler hook when in "performance" mode Len Brown 2017-07-25 22:32 ` [PATCH 2/4 v2] x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF Doug Smythies 2017-07-26 17:23 ` Len Brown 2017-07-28 0:13 ` [PATCH] cpufreq: x86: Make scaling_cur_freq behave more as expected Rafael J. Wysocki 2017-07-28 12:45 ` [PATCH v2] " Rafael J. Wysocki 2017-07-31 23:46 ` Doug Smythies 2017-08-01 0:50 ` Rafael J. Wysocki 2017-07-28 6:01 ` [PATCH] " Doug Smythies 2017-07-28 12:26 ` Rafael J. Wysocki
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).