All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC/RFT] [PATCH 00/10] Intel_pstate: HWP Dynamic performance boost
@ 2018-05-16  4:49 Srinivas Pandruvada
  2018-05-16  4:49 ` [RFC/RFT] [PATCH 01/10] x86,sched: Add support for frequency invariance Srinivas Pandruvada
                   ` (10 more replies)
  0 siblings, 11 replies; 58+ messages in thread
From: Srinivas Pandruvada @ 2018-05-16  4:49 UTC (permalink / raw)
  To: srinivas.pandruvada, tglx, mingo, peterz, bp, lenb, rjw, mgorman
  Cc: x86, linux-pm, viresh.kumar, juri.lelli, linux-kernel

This series tries to address some concern in performance particularly with IO
workloads (Reported by Mel Gorman), when HWP is using intel_pstate powersave
policy.

Background
HWP performance can be controlled by user space using sysfs interface for
max/min frequency limits and energy performance preference settings. Based on
workload characteristics these can be adjusted from user space. These limits
are not changed dynamically by kernel based on workload.

By default HWP defaults to energy performance preference value of 0x80 on
majority of platforms(Scale is 0-255, 0 is max performance and 255 is min).
This value offers best performance/watt and for majority of server workloads
performance doesn't suffer. Also users always have option to use performance
policy of intel_pstate, to get best performance. But user tend to run with
out of box configuration, which is powersave policy on most of the distros.

In some case it is possible to dynamically adjust performance, for example,
when a CPU is woken up due to IO completion or thread migrate to a new CPU. In
this case HWP algorithm will take some time to build utilization and ramp up
P-states. So this may results in lower performance for some IO workloads and
workloads which tend to migrate. The idea of this patch series is to
temporarily boost performance dynamically in these cases. This is only
applicable only when user is using powersave policy, not in performance policy.

Results on a Skylake server:

Benchmark                       Improvement %
----------------------------------------------------------------------
dbench                          50.36
thread IO bench (tiobench)      10.35
File IO                         9.81
sqlite                          15.76
X264 -104 cores                 9.75

Spec Power                      (Negligible impact 7382 Vs. 7378)
Idle Power                      No change observed
-----------------------------------------------------------------------

HWP brings in best performace/watt at EPP=0x80. Since we are boosting
EPP here to 0, the performance/watt drops upto 10%. So there is a power
penalty of these changes.

Also Mel Gorman provided test results on a prior patchset, which shows
benifits of this series.

Peter Zijlstra (1):
  x86,sched: Add support for frequency invariance

Srinivas Pandruvada (9):
  cpufreq: intel_pstate: Conditional frequency invariant accounting
  cpufreq: intel_pstate: Utility functions to boost HWP performance
    limits
  cpufreq: intel_pstate: Add update_util_hook for HWP
  cpufreq: intel_pstate: HWP boost performance on IO Wake
  cpufreq / sched: Add interface to get utilization values
  cpufreq: intel_pstate: HWP boost performance on busy task migrate
  cpufreq: intel_pstate: Dyanmically update busy pct
  cpufreq: intel_pstate: New sysfs entry to control HWP boost
  cpufreq: intel_pstate: enable boost for SKX

 arch/x86/include/asm/topology.h |  29 +++++
 arch/x86/kernel/smpboot.c       | 196 +++++++++++++++++++++++++++++-
 drivers/cpufreq/intel_pstate.c  | 260 +++++++++++++++++++++++++++++++++++++++-
 include/linux/sched/cpufreq.h   |   2 +
 kernel/sched/core.c             |   1 +
 kernel/sched/cpufreq.c          |  23 ++++
 kernel/sched/sched.h            |   7 ++
 7 files changed, 513 insertions(+), 5 deletions(-)

-- 
2.9.5

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC/RFT] [PATCH 01/10] x86,sched: Add support for frequency invariance
  2018-05-16  4:49 [RFC/RFT] [PATCH 00/10] Intel_pstate: HWP Dynamic performance boost Srinivas Pandruvada
@ 2018-05-16  4:49 ` Srinivas Pandruvada
  2018-05-16  9:56   ` Peter Zijlstra
  2018-05-16  4:49 ` [RFC/RFT] [PATCH 02/10] cpufreq: intel_pstate: Conditional frequency invariant accounting Srinivas Pandruvada
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 58+ messages in thread
From: Srinivas Pandruvada @ 2018-05-16  4:49 UTC (permalink / raw)
  To: srinivas.pandruvada, tglx, mingo, peterz, bp, lenb, rjw, mgorman
  Cc: x86, linux-pm, viresh.kumar, juri.lelli, linux-kernel,
	Suravee Suthikulpanit, Rafael J. Wysocki

From: Peter Zijlstra <peterz@infradead.org>

Implement arch_scale_freq_capacity() for 'modern' x86. This function
is used by the scheduler to correctly account usage in the face of
DVFS.

For example; suppose a CPU has two frequencies: 500 and 1000 Mhz. When
running a task that would consume 1/3rd of a CPU at 1000 MHz, it would
appear to consume 2/3rd (or 66.6%) when running at 500 MHz, giving the
false impression this CPU is almost at capacity, even though it can go
faster [*].

Since modern x86 has hardware control over the actual frequency we run
at (because amongst other things, Turbo-Mode), we cannot simply use
the frequency as requested through cpufreq.

Instead we use the APERF/MPERF MSRs to compute the effective frequency
over the recent past. Also, because reading MSRs is expensive, don't
do so every time we need the value, but amortize the cost by doing it
every tick.

[*] this assumes a linear frequency/performance relation; which
everybody knows to be false, but given realities its the best
approximation we can make.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
---

Changes on top of Peter's patch:
-Ported to the latest 4.17-rc4
-Added KNL/KNM related changes
-Account for Turbo boost disabled on a system in BIOS
-New interface to disable tick processing when we don't want

 arch/x86/include/asm/topology.h |  29 ++++++
 arch/x86/kernel/smpboot.c       | 196 +++++++++++++++++++++++++++++++++++++++-
 kernel/sched/core.c             |   1 +
 kernel/sched/sched.h            |   7 ++
 4 files changed, 232 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index c1d2a98..3fb5346 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -172,4 +172,33 @@ static inline void sched_clear_itmt_support(void)
 }
 #endif /* CONFIG_SCHED_MC_PRIO */
 
+#ifdef CONFIG_SMP
+#include <asm/cpufeature.h>
+
+#define arch_scale_freq_tick arch_scale_freq_tick
+#define arch_scale_freq_capacity arch_scale_freq_capacity
+
+DECLARE_PER_CPU(unsigned long, arch_cpu_freq);
+
+static inline long arch_scale_freq_capacity(int cpu)
+{
+	if (static_cpu_has(X86_FEATURE_APERFMPERF))
+		return per_cpu(arch_cpu_freq, cpu);
+
+	return 1024 /* SCHED_CAPACITY_SCALE */;
+}
+
+extern void arch_scale_freq_tick(void);
+extern void x86_arch_scale_freq_tick_enable(void);
+extern void x86_arch_scale_freq_tick_disable(void);
+#else
+static inline void x86_arch_scale_freq_tick_enable(void)
+{
+}
+
+static inline void x86_arch_scale_freq_tick_disable(void)
+{
+}
+#endif
+
 #endif /* _ASM_X86_TOPOLOGY_H */
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 0f1cbb0..9e2cb82 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -148,6 +148,8 @@ static inline void smpboot_restore_warm_reset_vector(void)
 	*((volatile u32 *)phys_to_virt(TRAMPOLINE_PHYS_LOW)) = 0;
 }
 
+static void set_cpu_max_freq(void);
+
 /*
  * Report back to the Boot Processor during boot time or to the caller processor
  * during CPU online.
@@ -189,6 +191,8 @@ static void smp_callin(void)
 	 */
 	set_cpu_sibling_map(raw_smp_processor_id());
 
+	set_cpu_max_freq();
+
 	/*
 	 * Get our bogomips.
 	 * Update loops_per_jiffy in cpu_data. Previous call to
@@ -1259,7 +1263,7 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus)
 	set_sched_topology(x86_topology);
 
 	set_cpu_sibling_map(0);
-
+	set_cpu_max_freq();
 	smp_sanity_check();
 
 	switch (apic_intr_mode) {
@@ -1676,3 +1680,193 @@ void native_play_dead(void)
 }
 
 #endif
+
+/*
+ * APERF/MPERF frequency ratio computation.
+ *
+ * The scheduler wants to do frequency invariant accounting and needs a <1
+ * ratio to account for the 'current' frequency.
+ *
+ * Since the frequency on x86 is controlled by micro-controller and our P-state
+ * setting is little more than a request/hint, we need to observe the effective
+ * frequency. We do this with APERF/MPERF.
+ *
+ * One complication is that the APERF/MPERF ratio can be >1, specifically
+ * APERF/MPERF gives the ratio relative to the max non-turbo P-state. Therefore
+ * we need to re-normalize the ratio.
+ *
+ * We do this by tracking the max APERF/MPERF ratio previously observed and
+ * scaling our MPERF delta with that. Every time our ratio goes over 1, we
+ * proportionally scale up our old max.
+ *
+ * The down-side to this runtime max search is that you have to trigger the
+ * actual max frequency before your scale is right. Therefore allow
+ * architectures to initialize the max ratio on CPU bringup.
+ */
+
+static DEFINE_PER_CPU(u64, arch_prev_aperf);
+static DEFINE_PER_CPU(u64, arch_prev_mperf);
+static DEFINE_PER_CPU(u64, arch_prev_max_freq) = SCHED_CAPACITY_SCALE;
+
+static bool turbo_disabled(void)
+{
+	u64 misc_en;
+	int err;
+
+	err = rdmsrl_safe(MSR_IA32_MISC_ENABLE, &misc_en);
+	if (err)
+		return false;
+
+	return (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE);
+}
+
+static bool atom_set_cpu_max_freq(void)
+{
+	u64 ratio, turbo_ratio;
+	int err;
+
+	if (turbo_disabled()) {
+		this_cpu_write(arch_prev_max_freq, SCHED_CAPACITY_SCALE);
+		return true;
+	}
+
+	err = rdmsrl_safe(MSR_ATOM_CORE_RATIOS, &ratio);
+	if (err)
+		return false;
+
+	err = rdmsrl_safe(MSR_ATOM_CORE_TURBO_RATIOS, &turbo_ratio);
+	if (err)
+		return false;
+
+	ratio = (ratio >> 16) & 0x7F;     /* max P state ratio */
+	turbo_ratio = turbo_ratio & 0x7F; /* 1C turbo ratio */
+
+	this_cpu_write(arch_prev_max_freq,
+			 div_u64(turbo_ratio * SCHED_CAPACITY_SCALE, ratio));
+	return true;
+}
+
+#include <asm/cpu_device_id.h>
+#include <asm/intel-family.h>
+
+#define ICPU(model) \
+	{ X86_VENDOR_INTEL, 6, model, X86_FEATURE_APERFMPERF, 0}
+
+static const struct x86_cpu_id intel_turbo_ratio_adjust[] __initconst = {
+	ICPU(INTEL_FAM6_XEON_PHI_KNL),
+	ICPU(INTEL_FAM6_XEON_PHI_KNM),
+	{}
+};
+
+static bool core_set_cpu_max_freq(void)
+{
+	const struct x86_cpu_id *id;
+	u64 ratio, turbo_ratio;
+	int err;
+
+	if (turbo_disabled()) {
+		this_cpu_write(arch_prev_max_freq, SCHED_CAPACITY_SCALE);
+		return true;
+	}
+
+	err = rdmsrl_safe(MSR_PLATFORM_INFO, &ratio);
+	if (err)
+		return false;
+
+	err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &turbo_ratio);
+	if (err)
+		return false;
+
+	ratio = (ratio >> 8) & 0xFF;      /* base ratio */
+	id = x86_match_cpu(intel_turbo_ratio_adjust);
+	if (id)
+		turbo_ratio = (turbo_ratio >> 8) & 0xFF; /* 1C turbo ratio */
+	else
+		turbo_ratio = turbo_ratio & 0xFF; /* 1C turbo ratio */
+
+	this_cpu_write(arch_prev_max_freq,
+			 div_u64(turbo_ratio * SCHED_CAPACITY_SCALE, ratio));
+	return true;
+}
+
+static void intel_set_cpu_max_freq(void)
+{
+	if (atom_set_cpu_max_freq())
+		return;
+
+	if (core_set_cpu_max_freq())
+		return;
+}
+
+static void set_cpu_max_freq(void)
+{
+	u64 aperf, mperf;
+
+	if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
+		return;
+
+	switch (boot_cpu_data.x86_vendor) {
+	case X86_VENDOR_INTEL:
+		intel_set_cpu_max_freq();
+		break;
+	default:
+		break;
+	}
+
+	rdmsrl(MSR_IA32_APERF, aperf);
+	rdmsrl(MSR_IA32_MPERF, mperf);
+
+	this_cpu_write(arch_prev_aperf, aperf);
+	this_cpu_write(arch_prev_mperf, mperf);
+}
+
+DEFINE_PER_CPU(unsigned long, arch_cpu_freq);
+
+static bool tick_disable;
+
+void arch_scale_freq_tick(void)
+{
+	u64 freq, max_freq = this_cpu_read(arch_prev_max_freq);
+	u64 aperf, mperf;
+	u64 acnt, mcnt;
+
+	if (!static_cpu_has(X86_FEATURE_APERFMPERF) || tick_disable)
+		return;
+
+	rdmsrl(MSR_IA32_APERF, aperf);
+	rdmsrl(MSR_IA32_MPERF, mperf);
+
+	acnt = aperf - this_cpu_read(arch_prev_aperf);
+	mcnt = mperf - this_cpu_read(arch_prev_mperf);
+	if (!mcnt)
+		return;
+
+	this_cpu_write(arch_prev_aperf, aperf);
+	this_cpu_write(arch_prev_mperf, mperf);
+
+	acnt <<= 2*SCHED_CAPACITY_SHIFT;
+	mcnt *= max_freq;
+
+	freq = div64_u64(acnt, mcnt);
+
+	if (unlikely(freq > SCHED_CAPACITY_SCALE)) {
+		max_freq *= freq;
+		max_freq >>= SCHED_CAPACITY_SHIFT;
+
+		this_cpu_write(arch_prev_max_freq, max_freq);
+
+		freq = SCHED_CAPACITY_SCALE;
+	}
+
+	this_cpu_write(arch_cpu_freq, freq);
+}
+
+void x86_arch_scale_freq_tick_enable(void)
+{
+	tick_disable = false;
+}
+
+void x86_arch_scale_freq_tick_disable(void)
+{
+	tick_disable = true;
+}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 092f7c4..2bdef36 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3076,6 +3076,7 @@ void scheduler_tick(void)
 	struct task_struct *curr = rq->curr;
 	struct rq_flags rf;
 
+	arch_scale_freq_tick();
 	sched_clock_tick();
 
 	rq_lock(rq, &rf);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 15750c2..71851af 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1713,6 +1713,13 @@ static inline int hrtick_enabled(struct rq *rq)
 
 #endif /* CONFIG_SCHED_HRTICK */
 
+#ifndef arch_scale_freq_tick
+static __always_inline
+void arch_scale_freq_tick(void)
+{
+}
+#endif
+
 #ifndef arch_scale_freq_capacity
 static __always_inline
 unsigned long arch_scale_freq_capacity(int cpu)
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC/RFT] [PATCH 02/10] cpufreq: intel_pstate: Conditional frequency invariant accounting
  2018-05-16  4:49 [RFC/RFT] [PATCH 00/10] Intel_pstate: HWP Dynamic performance boost Srinivas Pandruvada
  2018-05-16  4:49 ` [RFC/RFT] [PATCH 01/10] x86,sched: Add support for frequency invariance Srinivas Pandruvada
@ 2018-05-16  4:49 ` Srinivas Pandruvada
  2018-05-16  7:16   ` Peter Zijlstra
  2018-05-16 15:19   ` Juri Lelli
  2018-05-16  4:49 ` [RFC/RFT] [PATCH 03/10] cpufreq: intel_pstate: Utility functions to boost HWP performance limits Srinivas Pandruvada
                   ` (8 subsequent siblings)
  10 siblings, 2 replies; 58+ messages in thread
From: Srinivas Pandruvada @ 2018-05-16  4:49 UTC (permalink / raw)
  To: srinivas.pandruvada, tglx, mingo, peterz, bp, lenb, rjw, mgorman
  Cc: x86, linux-pm, viresh.kumar, juri.lelli, linux-kernel

intel_pstate has two operating modes: active and passive. In "active"
mode, the in-built scaling governor is used and in "passive" mode,
the driver can be used with any governor like "schedutil". In "active"
mode the utilization values from schedutil is not used and there is
a requirement from high performance computing use cases, not to read
any APERF/MPERF MSRs. In this case no need to use CPU cycles for
frequency invariant accounting by reading APERF/MPERF MSRs.
With this change frequency invariant account is only enabled in
"passive" mode.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
---
[Note: The tick will be enabled later in the series when hwp dynamic
boost is enabled]

 drivers/cpufreq/intel_pstate.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 17e566af..f686bbe 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -2040,6 +2040,8 @@ static int intel_pstate_register_driver(struct cpufreq_driver *driver)
 {
 	int ret;
 
+	x86_arch_scale_freq_tick_disable();
+
 	memset(&global, 0, sizeof(global));
 	global.max_perf_pct = 100;
 
@@ -2052,6 +2054,9 @@ static int intel_pstate_register_driver(struct cpufreq_driver *driver)
 
 	global.min_perf_pct = min_perf_pct_min();
 
+	if (driver == &intel_cpufreq)
+		x86_arch_scale_freq_tick_enable();
+
 	return 0;
 }
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC/RFT] [PATCH 03/10] cpufreq: intel_pstate: Utility functions to boost HWP performance limits
  2018-05-16  4:49 [RFC/RFT] [PATCH 00/10] Intel_pstate: HWP Dynamic performance boost Srinivas Pandruvada
  2018-05-16  4:49 ` [RFC/RFT] [PATCH 01/10] x86,sched: Add support for frequency invariance Srinivas Pandruvada
  2018-05-16  4:49 ` [RFC/RFT] [PATCH 02/10] cpufreq: intel_pstate: Conditional frequency invariant accounting Srinivas Pandruvada
@ 2018-05-16  4:49 ` Srinivas Pandruvada
  2018-05-16  7:22   ` Peter Zijlstra
  2018-05-16  4:49 ` [RFC/RFT] [PATCH 04/10] cpufreq: intel_pstate: Add update_util_hook for HWP Srinivas Pandruvada
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 58+ messages in thread
From: Srinivas Pandruvada @ 2018-05-16  4:49 UTC (permalink / raw)
  To: srinivas.pandruvada, tglx, mingo, peterz, bp, lenb, rjw, mgorman
  Cc: x86, linux-pm, viresh.kumar, juri.lelli, linux-kernel

Setup necessary infrastructure to be able to boost HWP performance on a
remote CPU. First initialize data structure to be able to use
smp_call_function_single_async(). The boost up function simply set HWP
min to HWP max value and EPP to 0. The boost down function simply restores
to last cached HWP Request value.

To avoid reading HWP Request MSR during dynamic update, the HWP Request
MSR value is cached in the local memory. This caching is done whenever
HWP request MSR is modified during driver init on in setpolicy() callback
path.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
---
 drivers/cpufreq/intel_pstate.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index f686bbe..dc7dfa9 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -221,6 +221,9 @@ struct global_params {
  *			preference/bias
  * @epp_saved:		Saved EPP/EPB during system suspend or CPU offline
  *			operation
+ * @hwp_req_cached:	Cached value of the last HWP request MSR
+ * @csd:		A structure used to issue SMP async call, which
+ *			defines callback and arguments
  *
  * This structure stores per CPU instance data for all CPUs.
  */
@@ -253,6 +256,8 @@ struct cpudata {
 	s16 epp_policy;
 	s16 epp_default;
 	s16 epp_saved;
+	u64 hwp_req_cached;
+	call_single_data_t csd;
 };
 
 static struct cpudata **all_cpu_data;
@@ -763,6 +768,7 @@ static void intel_pstate_hwp_set(unsigned int cpu)
 		intel_pstate_set_epb(cpu, epp);
 	}
 skip_epp:
+	cpu_data->hwp_req_cached = value;
 	wrmsrl_on_cpu(cpu, MSR_HWP_REQUEST, value);
 }
 
@@ -1381,6 +1387,39 @@ static void intel_pstate_get_cpu_pstates(struct cpudata *cpu)
 	intel_pstate_set_min_pstate(cpu);
 }
 
+
+static inline void intel_pstate_hwp_boost_up(struct cpudata *cpu)
+{
+	u64 hwp_req;
+	u8 max;
+
+	max = (u8) (cpu->hwp_req_cached >> 8);
+
+	hwp_req = cpu->hwp_req_cached & ~GENMASK_ULL(31, 24);
+	hwp_req = (hwp_req & ~GENMASK_ULL(7, 0)) | max;
+
+	wrmsrl(MSR_HWP_REQUEST, hwp_req);
+}
+
+static inline void intel_pstate_hwp_boost_down(struct cpudata *cpu)
+{
+	wrmsrl(MSR_HWP_REQUEST, cpu->hwp_req_cached);
+}
+
+static void intel_pstate_hwp_boost_up_local(void *arg)
+{
+	struct cpudata *cpu = arg;
+
+	intel_pstate_hwp_boost_up(cpu);
+}
+
+static void csd_init(struct cpudata *cpu)
+{
+	cpu->csd.flags = 0;
+	cpu->csd.func = intel_pstate_hwp_boost_up_local;
+	cpu->csd.info = cpu;
+}
+
 static inline void intel_pstate_calc_avg_perf(struct cpudata *cpu)
 {
 	struct sample *sample = &cpu->sample;
@@ -1894,6 +1933,9 @@ static int __intel_pstate_cpu_init(struct cpufreq_policy *policy)
 
 	policy->fast_switch_possible = true;
 
+	if (hwp_active)
+		csd_init(cpu);
+
 	return 0;
 }
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC/RFT] [PATCH 04/10] cpufreq: intel_pstate: Add update_util_hook for HWP
  2018-05-16  4:49 [RFC/RFT] [PATCH 00/10] Intel_pstate: HWP Dynamic performance boost Srinivas Pandruvada
                   ` (2 preceding siblings ...)
  2018-05-16  4:49 ` [RFC/RFT] [PATCH 03/10] cpufreq: intel_pstate: Utility functions to boost HWP performance limits Srinivas Pandruvada
@ 2018-05-16  4:49 ` Srinivas Pandruvada
  2018-05-16  4:49 ` [RFC/RFT] [PATCH 05/10] cpufreq: intel_pstate: HWP boost performance on IO Wake Srinivas Pandruvada
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 58+ messages in thread
From: Srinivas Pandruvada @ 2018-05-16  4:49 UTC (permalink / raw)
  To: srinivas.pandruvada, tglx, mingo, peterz, bp, lenb, rjw, mgorman
  Cc: x86, linux-pm, viresh.kumar, juri.lelli, linux-kernel

When HWP dynamic boost is active then set the HWP specific update util
hook. Also start and stop processing in frequency invariant accounting
based on the HWP dyanmic boost setting

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
---
 drivers/cpufreq/intel_pstate.c | 26 ++++++++++++++++++++++----
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index dc7dfa9..e200887 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -290,6 +290,7 @@ static struct pstate_funcs pstate_funcs __read_mostly;
 
 static int hwp_active __read_mostly;
 static bool per_cpu_limits __read_mostly;
+static bool hwp_boost __read_mostly;
 
 static struct cpufreq_driver *intel_pstate_driver __read_mostly;
 
@@ -1420,6 +1421,12 @@ static void csd_init(struct cpudata *cpu)
 	cpu->csd.info = cpu;
 }
 
+static inline void intel_pstate_update_util_hwp(struct update_util_data *data,
+						u64 time, unsigned int flags)
+{
+
+}
+
 static inline void intel_pstate_calc_avg_perf(struct cpudata *cpu)
 {
 	struct sample *sample = &cpu->sample;
@@ -1723,7 +1730,7 @@ static void intel_pstate_set_update_util_hook(unsigned int cpu_num)
 {
 	struct cpudata *cpu = all_cpu_data[cpu_num];
 
-	if (hwp_active)
+	if (hwp_active && !hwp_boost)
 		return;
 
 	if (cpu->update_util_set)
@@ -1731,8 +1738,12 @@ static void intel_pstate_set_update_util_hook(unsigned int cpu_num)
 
 	/* Prevent intel_pstate_update_util() from using stale data. */
 	cpu->sample.time = 0;
-	cpufreq_add_update_util_hook(cpu_num, &cpu->update_util,
-				     intel_pstate_update_util);
+	if (hwp_active)
+		cpufreq_add_update_util_hook(cpu_num, &cpu->update_util,
+					     intel_pstate_update_util_hwp);
+	else
+		cpufreq_add_update_util_hook(cpu_num, &cpu->update_util,
+					     intel_pstate_update_util);
 	cpu->update_util_set = true;
 }
 
@@ -1844,8 +1855,15 @@ static int intel_pstate_set_policy(struct cpufreq_policy *policy)
 		intel_pstate_set_update_util_hook(policy->cpu);
 	}
 
-	if (hwp_active)
+	if (hwp_active) {
+		if (hwp_boost) {
+			x86_arch_scale_freq_tick_enable();
+		} else {
+			intel_pstate_clear_update_util_hook(policy->cpu);
+			x86_arch_scale_freq_tick_disable();
+		}
 		intel_pstate_hwp_set(policy->cpu);
+	}
 
 	mutex_unlock(&intel_pstate_limits_lock);
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC/RFT] [PATCH 05/10] cpufreq: intel_pstate: HWP boost performance on IO Wake
  2018-05-16  4:49 [RFC/RFT] [PATCH 00/10] Intel_pstate: HWP Dynamic performance boost Srinivas Pandruvada
                   ` (3 preceding siblings ...)
  2018-05-16  4:49 ` [RFC/RFT] [PATCH 04/10] cpufreq: intel_pstate: Add update_util_hook for HWP Srinivas Pandruvada
@ 2018-05-16  4:49 ` Srinivas Pandruvada
  2018-05-16  7:37   ` Peter Zijlstra
  2018-05-16  9:45   ` Rafael J. Wysocki
  2018-05-16  4:49 ` [RFC/RFT] [PATCH 06/10] cpufreq / sched: Add interface to get utilization values Srinivas Pandruvada
                   ` (5 subsequent siblings)
  10 siblings, 2 replies; 58+ messages in thread
From: Srinivas Pandruvada @ 2018-05-16  4:49 UTC (permalink / raw)
  To: srinivas.pandruvada, tglx, mingo, peterz, bp, lenb, rjw, mgorman
  Cc: x86, linux-pm, viresh.kumar, juri.lelli, linux-kernel

When a task is woken up from IO wait, boost HWP prformance to max. This
helps IO workloads on servers with per core P-states. But changing limits
has extra over head of issuing new HWP Request MSR, which takes 1000+
cycles. So this change limits setting HWP Request MSR. Also request can
be for a remote CPU.
Rate control in setting HWP Requests:
- If the current performance is around P1, simply ignore IO flag.
- Once set wait till hold time, till remove boost. While the boost
  is on, another IO flags is notified, it will prolong boost.
- If the IO flags are notified multiple ticks apart, this may not be
IO bound task. Othewise idle system gets periodic boosts for one
IO wake.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
---
 drivers/cpufreq/intel_pstate.c | 75 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 75 insertions(+)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index e200887..d418265 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -20,6 +20,7 @@
 #include <linux/tick.h>
 #include <linux/slab.h>
 #include <linux/sched/cpufreq.h>
+#include <linux/sched/topology.h>
 #include <linux/list.h>
 #include <linux/cpu.h>
 #include <linux/cpufreq.h>
@@ -224,6 +225,8 @@ struct global_params {
  * @hwp_req_cached:	Cached value of the last HWP request MSR
  * @csd:		A structure used to issue SMP async call, which
  *			defines callback and arguments
+ * @hwp_boost_active:	HWP performance is boosted on this CPU
+ * @last_io_update:	Last time when IO wake flag was set
  *
  * This structure stores per CPU instance data for all CPUs.
  */
@@ -258,6 +261,8 @@ struct cpudata {
 	s16 epp_saved;
 	u64 hwp_req_cached;
 	call_single_data_t csd;
+	bool hwp_boost_active;
+	u64 last_io_update;
 };
 
 static struct cpudata **all_cpu_data;
@@ -1421,10 +1426,80 @@ static void csd_init(struct cpudata *cpu)
 	cpu->csd.info = cpu;
 }
 
+/*
+ * Long hold time will keep high perf limits for long time,
+ * which negatively impacts perf/watt for some workloads,
+ * like specpower. 3ms is based on experiements on some
+ * workoads.
+ */
+static int hwp_boost_hold_time_ms = 3;
+
+/* Default: This will roughly around P1 on SKX */
+#define BOOST_PSTATE_THRESHOLD	(SCHED_CAPACITY_SCALE / 2)
+static int hwp_boost_pstate_threshold = BOOST_PSTATE_THRESHOLD;
+
+static inline bool intel_pstate_check_boost_threhold(struct cpudata *cpu)
+{
+	/*
+	 * If the last performance is above threshold, then return false,
+	 * so that caller can ignore boosting.
+	 */
+	if (arch_scale_freq_capacity(cpu->cpu) > hwp_boost_pstate_threshold)
+		return false;
+
+	return true;
+}
+
 static inline void intel_pstate_update_util_hwp(struct update_util_data *data,
 						u64 time, unsigned int flags)
 {
+	struct cpudata *cpu = container_of(data, struct cpudata, update_util);
+
+	if (flags & SCHED_CPUFREQ_IOWAIT) {
+		/*
+		 * Set iowait_boost flag and update time. Since IO WAIT flag
+		 * is set all the time, we can't just conclude that there is
+		 * some IO bound activity is scheduled on this CPU with just
+		 * one occurrence. If we receive at least two in two
+		 * consecutive ticks, then we start treating as IO. So
+		 * there will be one tick latency.
+		 */
+		if (time_before64(time, cpu->last_io_update + 2 * TICK_NSEC) &&
+		    intel_pstate_check_boost_threhold(cpu))
+			cpu->iowait_boost = true;
+
+		cpu->last_io_update = time;
+		cpu->last_update = time;
+	}
 
+	/*
+	 * If the boost is active, we will remove it after timeout on local
+	 * CPU only.
+	 */
+	if (cpu->hwp_boost_active) {
+		if (smp_processor_id() == cpu->cpu) {
+			bool expired;
+
+			expired = time_after64(time, cpu->last_update +
+					       (hwp_boost_hold_time_ms * NSEC_PER_MSEC));
+			if (expired) {
+				intel_pstate_hwp_boost_down(cpu);
+				cpu->hwp_boost_active = false;
+				cpu->iowait_boost = false;
+			}
+		}
+		return;
+	}
+
+	cpu->last_update = time;
+
+	if (cpu->iowait_boost) {
+		cpu->hwp_boost_active = true;
+		if (smp_processor_id() == cpu->cpu)
+			intel_pstate_hwp_boost_up(cpu);
+		else
+			smp_call_function_single_async(cpu->cpu, &cpu->csd);
+	}
 }
 
 static inline void intel_pstate_calc_avg_perf(struct cpudata *cpu)
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC/RFT] [PATCH 06/10] cpufreq / sched: Add interface to get utilization values
  2018-05-16  4:49 [RFC/RFT] [PATCH 00/10] Intel_pstate: HWP Dynamic performance boost Srinivas Pandruvada
                   ` (4 preceding siblings ...)
  2018-05-16  4:49 ` [RFC/RFT] [PATCH 05/10] cpufreq: intel_pstate: HWP boost performance on IO Wake Srinivas Pandruvada
@ 2018-05-16  4:49 ` Srinivas Pandruvada
  2018-05-16  6:40   ` Viresh Kumar
  2018-05-16  8:11   ` Peter Zijlstra
  2018-05-16  4:49 ` [RFC/RFT] [PATCH 07/10] cpufreq: intel_pstate: HWP boost performance on busy task migrate Srinivas Pandruvada
                   ` (4 subsequent siblings)
  10 siblings, 2 replies; 58+ messages in thread
From: Srinivas Pandruvada @ 2018-05-16  4:49 UTC (permalink / raw)
  To: srinivas.pandruvada, tglx, mingo, peterz, bp, lenb, rjw, mgorman
  Cc: x86, linux-pm, viresh.kumar, juri.lelli, linux-kernel

Added cpufreq_get_sched_util() to get the CFS, DL and max utilization
values for a CPU. This is required for getting utilization values
for cpufreq drivers outside of kernel/sched folder.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
---
 include/linux/sched/cpufreq.h |  2 ++
 kernel/sched/cpufreq.c        | 23 +++++++++++++++++++++++
 2 files changed, 25 insertions(+)

diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
index 5966744..a366600 100644
--- a/include/linux/sched/cpufreq.h
+++ b/include/linux/sched/cpufreq.h
@@ -20,6 +20,8 @@ void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data,
                        void (*func)(struct update_util_data *data, u64 time,
 				    unsigned int flags));
 void cpufreq_remove_update_util_hook(int cpu);
+void cpufreq_get_sched_util(int cpu, unsigned long *util_cfs,
+			    unsigned long *util_dl, unsigned long *max);
 #endif /* CONFIG_CPU_FREQ */
 
 #endif /* _LINUX_SCHED_CPUFREQ_H */
diff --git a/kernel/sched/cpufreq.c b/kernel/sched/cpufreq.c
index 5e54cbc..36e2839 100644
--- a/kernel/sched/cpufreq.c
+++ b/kernel/sched/cpufreq.c
@@ -60,3 +60,26 @@ void cpufreq_remove_update_util_hook(int cpu)
 	rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), NULL);
 }
 EXPORT_SYMBOL_GPL(cpufreq_remove_update_util_hook);
+
+/**
+ * cpufreq_get_sched_util - Get utilization values.
+ * @cpu: The targeted CPU.
+ *
+ * Get the CFS, DL and max utilization.
+ * This function allows cpufreq driver outside the kernel/sched to access
+ * utilization value for a CPUs run queue.
+ */
+void cpufreq_get_sched_util(int cpu, unsigned long *util_cfs,
+			    unsigned long *util_dl, unsigned long *max)
+{
+#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
+	struct rq *rq = cpu_rq(cpu);
+
+	*max = arch_scale_cpu_capacity(NULL, cpu);
+	*util_cfs = cpu_util_cfs(rq);
+	*util_dl  = cpu_util_dl(rq);
+#else
+	*util_cfs = *util_dl = 1;
+#endif
+}
+EXPORT_SYMBOL_GPL(cpufreq_get_sched_util);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC/RFT] [PATCH 07/10] cpufreq: intel_pstate: HWP boost performance on busy task migrate
  2018-05-16  4:49 [RFC/RFT] [PATCH 00/10] Intel_pstate: HWP Dynamic performance boost Srinivas Pandruvada
                   ` (5 preceding siblings ...)
  2018-05-16  4:49 ` [RFC/RFT] [PATCH 06/10] cpufreq / sched: Add interface to get utilization values Srinivas Pandruvada
@ 2018-05-16  4:49 ` Srinivas Pandruvada
  2018-05-16  9:49   ` Rafael J. Wysocki
  2018-05-16  4:49 ` [RFC/RFT] [PATCH 08/10] cpufreq: intel_pstate: Dyanmically update busy pct Srinivas Pandruvada
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 58+ messages in thread
From: Srinivas Pandruvada @ 2018-05-16  4:49 UTC (permalink / raw)
  To: srinivas.pandruvada, tglx, mingo, peterz, bp, lenb, rjw, mgorman
  Cc: x86, linux-pm, viresh.kumar, juri.lelli, linux-kernel

When a busy task migrates to a new CPU boost HWP prformance to max. This
helps workloads on servers with per core P-states, which saturates all
CPUs and then they migrate frequently. But changing limits has extra over
head of issuing new HWP Request MSR, which takes 1000+
cycles. So this change limits setting HWP Request MSR.
Rate control in setting HWP Requests:
- If the current performance is around P1, simply ignore.
- Once set wait till hold time, till remove boost. While the boost
 is on, another flags is notified, it will prolong boost.
- The task migrates needs to have some utilzation which is more
than threshold utilization, which will trigger P-state above minimum.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
---
 drivers/cpufreq/intel_pstate.c | 37 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 36 insertions(+), 1 deletion(-)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index d418265..ec455af 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -227,6 +227,7 @@ struct global_params {
  *			defines callback and arguments
  * @hwp_boost_active:	HWP performance is boosted on this CPU
  * @last_io_update:	Last time when IO wake flag was set
+ * @migrate_hint:	Set when scheduler indicates thread migration
  *
  * This structure stores per CPU instance data for all CPUs.
  */
@@ -263,6 +264,7 @@ struct cpudata {
 	call_single_data_t csd;
 	bool hwp_boost_active;
 	u64 last_io_update;
+	bool migrate_hint;
 };
 
 static struct cpudata **all_cpu_data;
@@ -1438,6 +1440,8 @@ static int hwp_boost_hold_time_ms = 3;
 #define BOOST_PSTATE_THRESHOLD	(SCHED_CAPACITY_SCALE / 2)
 static int hwp_boost_pstate_threshold = BOOST_PSTATE_THRESHOLD;
 
+static int hwp_boost_threshold_busy_pct;
+
 static inline bool intel_pstate_check_boost_threhold(struct cpudata *cpu)
 {
 	/*
@@ -1450,12 +1454,32 @@ static inline bool intel_pstate_check_boost_threhold(struct cpudata *cpu)
 	return true;
 }
 
+static inline int intel_pstate_get_sched_util(struct cpudata *cpu)
+{
+	unsigned long util_cfs, util_dl, max, util;
+
+	cpufreq_get_sched_util(cpu->cpu, &util_cfs, &util_dl, &max);
+	util = min(util_cfs + util_dl, max);
+	return util * 100 / max;
+}
+
 static inline void intel_pstate_update_util_hwp(struct update_util_data *data,
 						u64 time, unsigned int flags)
 {
 	struct cpudata *cpu = container_of(data, struct cpudata, update_util);
 
-	if (flags & SCHED_CPUFREQ_IOWAIT) {
+	if (flags & SCHED_CPUFREQ_MIGRATION) {
+		if (intel_pstate_check_boost_threhold(cpu))
+			cpu->migrate_hint = true;
+
+		cpu->last_update = time;
+		/*
+		 * The rq utilization data is not migrated yet to the new CPU
+		 * rq, so wait for call on local CPU to boost.
+		 */
+		if (smp_processor_id() != cpu->cpu)
+			return;
+	} else if (flags & SCHED_CPUFREQ_IOWAIT) {
 		/*
 		 * Set iowait_boost flag and update time. Since IO WAIT flag
 		 * is set all the time, we can't just conclude that there is
@@ -1499,6 +1523,17 @@ static inline void intel_pstate_update_util_hwp(struct update_util_data *data,
 			intel_pstate_hwp_boost_up(cpu);
 		else
 			smp_call_function_single_async(cpu->cpu, &cpu->csd);
+		return;
+	}
+
+	/* Ignore if the migrated thread has low utilization */
+	if (cpu->migrate_hint && smp_processor_id() == cpu->cpu) {
+		int util = intel_pstate_get_sched_util(cpu);
+
+		if (util >= hwp_boost_threshold_busy_pct) {
+			cpu->hwp_boost_active = true;
+			intel_pstate_hwp_boost_up(cpu);
+		}
 	}
 }
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC/RFT] [PATCH 08/10] cpufreq: intel_pstate: Dyanmically update busy pct
  2018-05-16  4:49 [RFC/RFT] [PATCH 00/10] Intel_pstate: HWP Dynamic performance boost Srinivas Pandruvada
                   ` (6 preceding siblings ...)
  2018-05-16  4:49 ` [RFC/RFT] [PATCH 07/10] cpufreq: intel_pstate: HWP boost performance on busy task migrate Srinivas Pandruvada
@ 2018-05-16  4:49 ` Srinivas Pandruvada
  2018-05-16  7:43   ` Peter Zijlstra
  2018-05-16  7:47   ` Peter Zijlstra
  2018-05-16  4:49 ` [RFC/RFT] [PATCH 09/10] cpufreq: intel_pstate: New sysfs entry to control HWP boost Srinivas Pandruvada
                   ` (2 subsequent siblings)
  10 siblings, 2 replies; 58+ messages in thread
From: Srinivas Pandruvada @ 2018-05-16  4:49 UTC (permalink / raw)
  To: srinivas.pandruvada, tglx, mingo, peterz, bp, lenb, rjw, mgorman
  Cc: x86, linux-pm, viresh.kumar, juri.lelli, linux-kernel

Calculate hwp_boost_threshold_busy_pct (task busy percent, which is
worth boosting) and hwp_boost_pstate_threshold (Don't boost if
CPU already has some performance) based on platform, min, max and
turbo frequencies.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
---
 drivers/cpufreq/intel_pstate.c | 40 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index ec455af..c43edce 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -1463,6 +1463,42 @@ static inline int intel_pstate_get_sched_util(struct cpudata *cpu)
 	return util * 100 / max;
 }
 
+
+static inline void intel_pstate_update_busy_threshold(struct cpudata *cpu)
+{
+	if (!hwp_boost_threshold_busy_pct) {
+		int min_freq, max_freq;
+
+		min_freq  = cpu->pstate.min_pstate * cpu->pstate.scaling;
+		update_turbo_state();
+		max_freq =  global.turbo_disabled || global.no_turbo ?
+				cpu->pstate.max_freq : cpu->pstate.turbo_freq;
+
+		/*
+		 * We are guranteed to get atleast min P-state. If we assume
+		 * P-state is proportional to load (such that 10% load
+		 * increase will result in 10% P-state increase), we will
+		 * get at least min P-state till we have atleast
+		 * (min * 100/max) percent cpu load. So any load less than
+		 * than this this we shouldn't do any boost. Then boosting
+		 * is not free, we will add atleast 20% offset.
+		 */
+		hwp_boost_threshold_busy_pct = min_freq * 100 / max_freq;
+		hwp_boost_threshold_busy_pct += 20;
+		pr_debug("hwp_boost_threshold_busy_pct = %d\n",
+			 hwp_boost_threshold_busy_pct);
+	}
+
+	/* P1 percent out of total range of P-states */
+	if (cpu->pstate.max_freq != cpu->pstate.turbo_freq) {
+		hwp_boost_pstate_threshold =
+			cpu->pstate.max_freq * SCHED_CAPACITY_SCALE / cpu->pstate.turbo_freq;
+		pr_debug("hwp_boost_pstate_threshold = %d\n",
+			 hwp_boost_pstate_threshold);
+	}
+
+}
+
 static inline void intel_pstate_update_util_hwp(struct update_util_data *data,
 						u64 time, unsigned int flags)
 {
@@ -2061,8 +2097,10 @@ static int __intel_pstate_cpu_init(struct cpufreq_policy *policy)
 
 	policy->fast_switch_possible = true;
 
-	if (hwp_active)
+	if (hwp_active) {
 		csd_init(cpu);
+		intel_pstate_update_busy_threshold(cpu);
+	}
 
 	return 0;
 }
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC/RFT] [PATCH 09/10] cpufreq: intel_pstate: New sysfs entry to control HWP boost
  2018-05-16  4:49 [RFC/RFT] [PATCH 00/10] Intel_pstate: HWP Dynamic performance boost Srinivas Pandruvada
                   ` (7 preceding siblings ...)
  2018-05-16  4:49 ` [RFC/RFT] [PATCH 08/10] cpufreq: intel_pstate: Dyanmically update busy pct Srinivas Pandruvada
@ 2018-05-16  4:49 ` Srinivas Pandruvada
  2018-05-16  4:49 ` [RFC/RFT] [PATCH 10/10] cpufreq: intel_pstate: enable boost for SKX Srinivas Pandruvada
  2018-05-16  6:49 ` [RFC/RFT] [PATCH 00/10] Intel_pstate: HWP Dynamic performance boost Juri Lelli
  10 siblings, 0 replies; 58+ messages in thread
From: Srinivas Pandruvada @ 2018-05-16  4:49 UTC (permalink / raw)
  To: srinivas.pandruvada, tglx, mingo, peterz, bp, lenb, rjw, mgorman
  Cc: x86, linux-pm, viresh.kumar, juri.lelli, linux-kernel

A new attribute is added to intel_pstate sysfs to enable/disable
HWP dynamic performance boost.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
---
 drivers/cpufreq/intel_pstate.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index c43edce..65d11d2 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -1034,6 +1034,30 @@ static ssize_t store_min_perf_pct(struct kobject *a, struct attribute *b,
 	return count;
 }
 
+static ssize_t show_hwp_dynamic_boost(struct kobject *kobj,
+				struct attribute *attr, char *buf)
+{
+	return sprintf(buf, "%u\n", hwp_boost);
+}
+
+static ssize_t store_hwp_dynamic_boost(struct kobject *a, struct attribute *b,
+				       const char *buf, size_t count)
+{
+	unsigned int input;
+	int ret;
+
+	ret = kstrtouint(buf, 10, &input);
+	if (ret)
+		return ret;
+
+	mutex_lock(&intel_pstate_driver_lock);
+	hwp_boost = !!input;
+	intel_pstate_update_policies();
+	mutex_unlock(&intel_pstate_driver_lock);
+
+	return count;
+}
+
 show_one(max_perf_pct, max_perf_pct);
 show_one(min_perf_pct, min_perf_pct);
 
@@ -1043,6 +1067,7 @@ define_one_global_rw(max_perf_pct);
 define_one_global_rw(min_perf_pct);
 define_one_global_ro(turbo_pct);
 define_one_global_ro(num_pstates);
+define_one_global_rw(hwp_dynamic_boost);
 
 static struct attribute *intel_pstate_attributes[] = {
 	&status.attr,
@@ -1083,6 +1108,11 @@ static void __init intel_pstate_sysfs_expose_params(void)
 	rc = sysfs_create_file(intel_pstate_kobject, &min_perf_pct.attr);
 	WARN_ON(rc);
 
+	if (hwp_active) {
+		rc = sysfs_create_file(intel_pstate_kobject,
+				       &hwp_dynamic_boost.attr);
+		WARN_ON(rc);
+	}
 }
 /************************** sysfs end ************************/
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [RFC/RFT] [PATCH 10/10] cpufreq: intel_pstate: enable boost for SKX
  2018-05-16  4:49 [RFC/RFT] [PATCH 00/10] Intel_pstate: HWP Dynamic performance boost Srinivas Pandruvada
                   ` (8 preceding siblings ...)
  2018-05-16  4:49 ` [RFC/RFT] [PATCH 09/10] cpufreq: intel_pstate: New sysfs entry to control HWP boost Srinivas Pandruvada
@ 2018-05-16  4:49 ` Srinivas Pandruvada
  2018-05-16  7:49   ` Peter Zijlstra
  2018-05-16  6:49 ` [RFC/RFT] [PATCH 00/10] Intel_pstate: HWP Dynamic performance boost Juri Lelli
  10 siblings, 1 reply; 58+ messages in thread
From: Srinivas Pandruvada @ 2018-05-16  4:49 UTC (permalink / raw)
  To: srinivas.pandruvada, tglx, mingo, peterz, bp, lenb, rjw, mgorman
  Cc: x86, linux-pm, viresh.kumar, juri.lelli, linux-kernel

Enable HWP boost on Skylake server platform.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
---
 drivers/cpufreq/intel_pstate.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 65d11d2..827c003 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -1863,6 +1863,11 @@ static const struct x86_cpu_id intel_pstate_cpu_ee_disable_ids[] = {
 	{}
 };
 
+static const struct x86_cpu_id intel_pstate_hwp_boost_ids[] __initconst = {
+	ICPU(INTEL_FAM6_SKYLAKE_X, core_funcs),
+	{}
+};
+
 static int intel_pstate_init_cpu(unsigned int cpunum)
 {
 	struct cpudata *cpu;
@@ -1893,6 +1898,10 @@ static int intel_pstate_init_cpu(unsigned int cpunum)
 			intel_pstate_disable_ee(cpunum);
 
 		intel_pstate_hwp_enable(cpu);
+
+		id = x86_match_cpu(intel_pstate_hwp_boost_ids);
+		if (id)
+			hwp_boost = true;
 	}
 
 	intel_pstate_get_cpu_pstates(cpu);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 06/10] cpufreq / sched: Add interface to get utilization values
  2018-05-16  4:49 ` [RFC/RFT] [PATCH 06/10] cpufreq / sched: Add interface to get utilization values Srinivas Pandruvada
@ 2018-05-16  6:40   ` Viresh Kumar
  2018-05-16 22:25     ` Srinivas Pandruvada
  2018-05-16  8:11   ` Peter Zijlstra
  1 sibling, 1 reply; 58+ messages in thread
From: Viresh Kumar @ 2018-05-16  6:40 UTC (permalink / raw)
  To: Srinivas Pandruvada
  Cc: tglx, mingo, peterz, bp, lenb, rjw, mgorman, x86, linux-pm,
	juri.lelli, linux-kernel

On 15-05-18, 21:49, Srinivas Pandruvada wrote:
> Added cpufreq_get_sched_util() to get the CFS, DL and max utilization
> values for a CPU. This is required for getting utilization values
> for cpufreq drivers outside of kernel/sched folder.
> 
> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> ---
>  include/linux/sched/cpufreq.h |  2 ++
>  kernel/sched/cpufreq.c        | 23 +++++++++++++++++++++++
>  2 files changed, 25 insertions(+)
> 
> diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
> index 5966744..a366600 100644
> --- a/include/linux/sched/cpufreq.h
> +++ b/include/linux/sched/cpufreq.h
> @@ -20,6 +20,8 @@ void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data,
>                         void (*func)(struct update_util_data *data, u64 time,
>  				    unsigned int flags));
>  void cpufreq_remove_update_util_hook(int cpu);
> +void cpufreq_get_sched_util(int cpu, unsigned long *util_cfs,
> +			    unsigned long *util_dl, unsigned long *max);
>  #endif /* CONFIG_CPU_FREQ */
>  
>  #endif /* _LINUX_SCHED_CPUFREQ_H */
> diff --git a/kernel/sched/cpufreq.c b/kernel/sched/cpufreq.c
> index 5e54cbc..36e2839 100644
> --- a/kernel/sched/cpufreq.c
> +++ b/kernel/sched/cpufreq.c
> @@ -60,3 +60,26 @@ void cpufreq_remove_update_util_hook(int cpu)
>  	rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), NULL);
>  }
>  EXPORT_SYMBOL_GPL(cpufreq_remove_update_util_hook);
> +
> +/**
> + * cpufreq_get_sched_util - Get utilization values.
> + * @cpu: The targeted CPU.
> + *
> + * Get the CFS, DL and max utilization.
> + * This function allows cpufreq driver outside the kernel/sched to access
> + * utilization value for a CPUs run queue.
> + */
> +void cpufreq_get_sched_util(int cpu, unsigned long *util_cfs,
> +			    unsigned long *util_dl, unsigned long *max)
> +{
> +#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL

What will happen when schedutil is compiled in the kernel but ondemand
is the one getting used currently ?

> +	struct rq *rq = cpu_rq(cpu);
> +
> +	*max = arch_scale_cpu_capacity(NULL, cpu);
> +	*util_cfs = cpu_util_cfs(rq);
> +	*util_dl  = cpu_util_dl(rq);
> +#else
> +	*util_cfs = *util_dl = 1;
> +#endif
> +}
> +EXPORT_SYMBOL_GPL(cpufreq_get_sched_util);
> -- 
> 2.9.5

-- 
viresh

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 00/10] Intel_pstate: HWP Dynamic performance boost
  2018-05-16  4:49 [RFC/RFT] [PATCH 00/10] Intel_pstate: HWP Dynamic performance boost Srinivas Pandruvada
                   ` (9 preceding siblings ...)
  2018-05-16  4:49 ` [RFC/RFT] [PATCH 10/10] cpufreq: intel_pstate: enable boost for SKX Srinivas Pandruvada
@ 2018-05-16  6:49 ` Juri Lelli
  2018-05-16 15:43   ` Srinivas Pandruvada
  10 siblings, 1 reply; 58+ messages in thread
From: Juri Lelli @ 2018-05-16  6:49 UTC (permalink / raw)
  To: Srinivas Pandruvada
  Cc: tglx, mingo, peterz, bp, lenb, rjw, mgorman, x86, linux-pm,
	viresh.kumar, juri.lelli, linux-kernel

Hi Srinivas,

On 15/05/18 21:49, Srinivas Pandruvada wrote:

[...]

> 
> Peter Zijlstra (1):
>   x86,sched: Add support for frequency invariance

Cool! I was going to ask Peter about this patch. You beat me to it. :)

I'll have a lokk at the set. BTW, just noticed that you Cc-ed me using
my old address. Please use juri.lelli@redhat.com. ;)

Best,

- Juri

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 02/10] cpufreq: intel_pstate: Conditional frequency invariant accounting
  2018-05-16  4:49 ` [RFC/RFT] [PATCH 02/10] cpufreq: intel_pstate: Conditional frequency invariant accounting Srinivas Pandruvada
@ 2018-05-16  7:16   ` Peter Zijlstra
  2018-05-16  7:29     ` Peter Zijlstra
  2018-05-16 15:19   ` Juri Lelli
  1 sibling, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2018-05-16  7:16 UTC (permalink / raw)
  To: Srinivas Pandruvada
  Cc: tglx, mingo, bp, lenb, rjw, mgorman, x86, linux-pm, viresh.kumar,
	juri.lelli, linux-kernel

On Tue, May 15, 2018 at 09:49:03PM -0700, Srinivas Pandruvada wrote:
> intel_pstate has two operating modes: active and passive. In "active"
> mode, the in-built scaling governor is used and in "passive" mode,
> the driver can be used with any governor like "schedutil". In "active"
> mode the utilization values from schedutil is not used and there is
> a requirement from high performance computing use cases, not to read
> any APERF/MPERF MSRs. In this case no need to use CPU cycles for
> frequency invariant accounting by reading APERF/MPERF MSRs.
> With this change frequency invariant account is only enabled in
> "passive" mode.

WTH is active/passive? Is passive when we select performance governor?

Also; you have to explain why using APERF/MPERF is bad in that case. Why
do they care if we read those MSRs during the tick?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 03/10] cpufreq: intel_pstate: Utility functions to boost HWP performance limits
  2018-05-16  4:49 ` [RFC/RFT] [PATCH 03/10] cpufreq: intel_pstate: Utility functions to boost HWP performance limits Srinivas Pandruvada
@ 2018-05-16  7:22   ` Peter Zijlstra
  2018-05-16  9:15     ` Rafael J. Wysocki
  2018-05-16 15:41     ` Srinivas Pandruvada
  0 siblings, 2 replies; 58+ messages in thread
From: Peter Zijlstra @ 2018-05-16  7:22 UTC (permalink / raw)
  To: Srinivas Pandruvada
  Cc: tglx, mingo, bp, lenb, rjw, mgorman, x86, linux-pm, viresh.kumar,
	juri.lelli, linux-kernel

On Tue, May 15, 2018 at 09:49:04PM -0700, Srinivas Pandruvada wrote:
> Setup necessary infrastructure to be able to boost HWP performance on a
> remote CPU. First initialize data structure to be able to use
> smp_call_function_single_async(). The boost up function simply set HWP
> min to HWP max value and EPP to 0. The boost down function simply restores
> to last cached HWP Request value.
> 
> To avoid reading HWP Request MSR during dynamic update, the HWP Request
> MSR value is cached in the local memory. This caching is done whenever
> HWP request MSR is modified during driver init on in setpolicy() callback
> path.

The patch does two independent things; best to split that.


> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index f686bbe..dc7dfa9 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -221,6 +221,9 @@ struct global_params {
>   *			preference/bias
>   * @epp_saved:		Saved EPP/EPB during system suspend or CPU offline
>   *			operation
> + * @hwp_req_cached:	Cached value of the last HWP request MSR

That's simply not true given the code below.

> @@ -763,6 +768,7 @@ static void intel_pstate_hwp_set(unsigned int cpu)
>  		intel_pstate_set_epb(cpu, epp);
>  	}
>  skip_epp:
> +	cpu_data->hwp_req_cached = value;
>  	wrmsrl_on_cpu(cpu, MSR_HWP_REQUEST, value);
>  }
>  
> @@ -1381,6 +1387,39 @@ static void intel_pstate_get_cpu_pstates(struct cpudata *cpu)
>  	intel_pstate_set_min_pstate(cpu);
>  }
>  
> +
> +static inline void intel_pstate_hwp_boost_up(struct cpudata *cpu)
> +{
> +	u64 hwp_req;
> +	u8 max;
> +
> +	max = (u8) (cpu->hwp_req_cached >> 8);
> +
> +	hwp_req = cpu->hwp_req_cached & ~GENMASK_ULL(31, 24);
> +	hwp_req = (hwp_req & ~GENMASK_ULL(7, 0)) | max;
> +
> +	wrmsrl(MSR_HWP_REQUEST, hwp_req);
> +}
> +
> +static inline void intel_pstate_hwp_boost_down(struct cpudata *cpu)
> +{
> +	wrmsrl(MSR_HWP_REQUEST, cpu->hwp_req_cached);
> +}

This is not a traditional msr shadow; that would be updated on every
wrmsr. There is not a comment in sight explaining why this one is
different.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 02/10] cpufreq: intel_pstate: Conditional frequency invariant accounting
  2018-05-16  7:16   ` Peter Zijlstra
@ 2018-05-16  7:29     ` Peter Zijlstra
  2018-05-16  9:07       ` Rafael J. Wysocki
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2018-05-16  7:29 UTC (permalink / raw)
  To: Srinivas Pandruvada
  Cc: tglx, mingo, bp, lenb, rjw, mgorman, x86, linux-pm, viresh.kumar,
	juri.lelli, linux-kernel

On Wed, May 16, 2018 at 09:16:40AM +0200, Peter Zijlstra wrote:
> On Tue, May 15, 2018 at 09:49:03PM -0700, Srinivas Pandruvada wrote:
> > intel_pstate has two operating modes: active and passive. In "active"
> > mode, the in-built scaling governor is used and in "passive" mode,
> > the driver can be used with any governor like "schedutil". In "active"
> > mode the utilization values from schedutil is not used and there is
> > a requirement from high performance computing use cases, not to read
> > any APERF/MPERF MSRs. In this case no need to use CPU cycles for
> > frequency invariant accounting by reading APERF/MPERF MSRs.
> > With this change frequency invariant account is only enabled in
> > "passive" mode.
> 
> WTH is active/passive? Is passive when we select performance governor?

Bah, I cannot read it seems. active is when we use the intel_pstate
governor and passive is when we use schedutil and only use intel_pstate
as a driver.

> Also; you have to explain why using APERF/MPERF is bad in that case. Why
> do they care if we read those MSRs during the tick?

That still stands.. this needs to be properly explained.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 05/10] cpufreq: intel_pstate: HWP boost performance on IO Wake
  2018-05-16  4:49 ` [RFC/RFT] [PATCH 05/10] cpufreq: intel_pstate: HWP boost performance on IO Wake Srinivas Pandruvada
@ 2018-05-16  7:37   ` Peter Zijlstra
  2018-05-16 17:55     ` Srinivas Pandruvada
  2018-05-16  9:45   ` Rafael J. Wysocki
  1 sibling, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2018-05-16  7:37 UTC (permalink / raw)
  To: Srinivas Pandruvada
  Cc: tglx, mingo, bp, lenb, rjw, mgorman, x86, linux-pm, viresh.kumar,
	juri.lelli, linux-kernel

On Tue, May 15, 2018 at 09:49:06PM -0700, Srinivas Pandruvada wrote:
> When a task is woken up from IO wait, boost HWP prformance to max. This
> helps IO workloads on servers with per core P-states. But changing limits
> has extra over head of issuing new HWP Request MSR, which takes 1000+
> cycles. So this change limits setting HWP Request MSR. Also request can
> be for a remote CPU.
> Rate control in setting HWP Requests:
> - If the current performance is around P1, simply ignore IO flag.
> - Once set wait till hold time, till remove boost. While the boost
>   is on, another IO flags is notified, it will prolong boost.
> - If the IO flags are notified multiple ticks apart, this may not be
> IO bound task. Othewise idle system gets periodic boosts for one
> IO wake.
>
> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> ---
>  drivers/cpufreq/intel_pstate.c | 75 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 75 insertions(+)
> 
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index e200887..d418265 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -20,6 +20,7 @@
>  #include <linux/tick.h>
>  #include <linux/slab.h>
>  #include <linux/sched/cpufreq.h>
> +#include <linux/sched/topology.h>
>  #include <linux/list.h>
>  #include <linux/cpu.h>
>  #include <linux/cpufreq.h>
> @@ -224,6 +225,8 @@ struct global_params {
>   * @hwp_req_cached:	Cached value of the last HWP request MSR
>   * @csd:		A structure used to issue SMP async call, which
>   *			defines callback and arguments
> + * @hwp_boost_active:	HWP performance is boosted on this CPU
> + * @last_io_update:	Last time when IO wake flag was set
>   *
>   * This structure stores per CPU instance data for all CPUs.
>   */
> @@ -258,6 +261,8 @@ struct cpudata {
>  	s16 epp_saved;
>  	u64 hwp_req_cached;
>  	call_single_data_t csd;
> +	bool hwp_boost_active;
> +	u64 last_io_update;

This structure has abysmal layout; you should look at that.
Also, mandatory complaint about using _Bool in composites.

>  };
>  
>  static struct cpudata **all_cpu_data;
> @@ -1421,10 +1426,80 @@ static void csd_init(struct cpudata *cpu)
>  	cpu->csd.info = cpu;
>  }
>  
> +/*
> + * Long hold time will keep high perf limits for long time,
> + * which negatively impacts perf/watt for some workloads,
> + * like specpower. 3ms is based on experiements on some
> + * workoads.
> + */
> +static int hwp_boost_hold_time_ms = 3;
> +
> +/* Default: This will roughly around P1 on SKX */
> +#define BOOST_PSTATE_THRESHOLD	(SCHED_CAPACITY_SCALE / 2)

Yuck.. why the need to hardcode this? Can't you simply read the P1 value
for the part at hand?

> +static int hwp_boost_pstate_threshold = BOOST_PSTATE_THRESHOLD;
> +
> +static inline bool intel_pstate_check_boost_threhold(struct cpudata *cpu)
> +{
> +	/*
> +	 * If the last performance is above threshold, then return false,
> +	 * so that caller can ignore boosting.
> +	 */
> +	if (arch_scale_freq_capacity(cpu->cpu) > hwp_boost_pstate_threshold)
> +		return false;
> +
> +	return true;
> +}
> +
>  static inline void intel_pstate_update_util_hwp(struct update_util_data *data,
>  						u64 time, unsigned int flags)
>  {
> +	struct cpudata *cpu = container_of(data, struct cpudata, update_util);
> +
> +	if (flags & SCHED_CPUFREQ_IOWAIT) {
> +		/*
> +		 * Set iowait_boost flag and update time. Since IO WAIT flag
> +		 * is set all the time, we can't just conclude that there is
> +		 * some IO bound activity is scheduled on this CPU with just
> +		 * one occurrence. If we receive at least two in two
> +		 * consecutive ticks, then we start treating as IO. So
> +		 * there will be one tick latency.
> +		 */
> +		if (time_before64(time, cpu->last_io_update + 2 * TICK_NSEC) &&
> +		    intel_pstate_check_boost_threhold(cpu))
> +			cpu->iowait_boost = true;
> +
> +		cpu->last_io_update = time;
> +		cpu->last_update = time;
> +	}
>  
> +	/*
> +	 * If the boost is active, we will remove it after timeout on local
> +	 * CPU only.
> +	 */
> +	if (cpu->hwp_boost_active) {
> +		if (smp_processor_id() == cpu->cpu) {
> +			bool expired;
> +
> +			expired = time_after64(time, cpu->last_update +
> +					       (hwp_boost_hold_time_ms * NSEC_PER_MSEC));
> +			if (expired) {
> +				intel_pstate_hwp_boost_down(cpu);
> +				cpu->hwp_boost_active = false;
> +				cpu->iowait_boost = false;
> +			}
> +		}
> +		return;
> +	}
> +
> +	cpu->last_update = time;
> +
> +	if (cpu->iowait_boost) {
> +		cpu->hwp_boost_active = true;
> +		if (smp_processor_id() == cpu->cpu)
> +			intel_pstate_hwp_boost_up(cpu);
> +		else
> +			smp_call_function_single_async(cpu->cpu, &cpu->csd);
> +	}
>  }

Hurmph, this looks like you're starting to duplicate the schedutil
iowait logic. Why didn't you copy the gradual boosting thing?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 08/10] cpufreq: intel_pstate: Dyanmically update busy pct
  2018-05-16  4:49 ` [RFC/RFT] [PATCH 08/10] cpufreq: intel_pstate: Dyanmically update busy pct Srinivas Pandruvada
@ 2018-05-16  7:43   ` Peter Zijlstra
  2018-05-16  7:47   ` Peter Zijlstra
  1 sibling, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2018-05-16  7:43 UTC (permalink / raw)
  To: Srinivas Pandruvada
  Cc: tglx, mingo, bp, lenb, rjw, mgorman, x86, linux-pm, viresh.kumar,
	juri.lelli, linux-kernel

On Tue, May 15, 2018 at 09:49:09PM -0700, Srinivas Pandruvada wrote:
> +static inline void intel_pstate_update_busy_threshold(struct cpudata *cpu)
> +{
> +	/* P1 percent out of total range of P-states */
> +	if (cpu->pstate.max_freq != cpu->pstate.turbo_freq) {
> +		hwp_boost_pstate_threshold =
> +			cpu->pstate.max_freq * SCHED_CAPACITY_SCALE / cpu->pstate.turbo_freq;
> +		pr_debug("hwp_boost_pstate_threshold = %d\n",
> +			 hwp_boost_pstate_threshold);
> +	}
> +
> +}
> +
>  static inline void intel_pstate_update_util_hwp(struct update_util_data *data,
>  						u64 time, unsigned int flags)
>  {
> @@ -2061,8 +2097,10 @@ static int __intel_pstate_cpu_init(struct cpufreq_policy *policy)
>  
>  	policy->fast_switch_possible = true;
>  
> -	if (hwp_active)
> +	if (hwp_active) {
>  		csd_init(cpu);
> +		intel_pstate_update_busy_threshold(cpu);
> +	}
>  
>  	return 0;
>  }

This should go in patch #5 and then you can remove that SKX hack. Which
you left in, even though you now did it right.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 08/10] cpufreq: intel_pstate: Dyanmically update busy pct
  2018-05-16  4:49 ` [RFC/RFT] [PATCH 08/10] cpufreq: intel_pstate: Dyanmically update busy pct Srinivas Pandruvada
  2018-05-16  7:43   ` Peter Zijlstra
@ 2018-05-16  7:47   ` Peter Zijlstra
  1 sibling, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2018-05-16  7:47 UTC (permalink / raw)
  To: Srinivas Pandruvada
  Cc: tglx, mingo, bp, lenb, rjw, mgorman, x86, linux-pm, viresh.kumar,
	juri.lelli, linux-kernel

On Tue, May 15, 2018 at 09:49:09PM -0700, Srinivas Pandruvada wrote:

> +static inline void intel_pstate_update_busy_threshold(struct cpudata *cpu)
> +{
> +	if (!hwp_boost_threshold_busy_pct) {
> +		int min_freq, max_freq;
> +
> +		min_freq  = cpu->pstate.min_pstate * cpu->pstate.scaling;
> +		update_turbo_state();
> +		max_freq =  global.turbo_disabled || global.no_turbo ?
> +				cpu->pstate.max_freq : cpu->pstate.turbo_freq;
> +
> +		/*
> +		 * We are guranteed to get atleast min P-state.

		   If we assume
> +		 * P-state is proportional to load (such that 10% load
> +		 * increase will result in 10% P-state increase),

		   we will
> +		 * get at least min P-state till we have atleast
> +		 * (min * 100/max) percent cpu load.

turbo makes that story less clear ofcourse.

		   So any load less than
> +		 * than this this we shouldn't do any boost. Then boosting
> +		 * is not free, we will add atleast 20% offset.

This I don't get.. so you want to remain at min P longer?

> +		 */
> +		hwp_boost_threshold_busy_pct = min_freq * 100 / max_freq;
> +		hwp_boost_threshold_busy_pct += 20;
> +		pr_debug("hwp_boost_threshold_busy_pct = %d\n",
> +			 hwp_boost_threshold_busy_pct);
> +	}
> +}

And then this part should go in the previous patch.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 10/10] cpufreq: intel_pstate: enable boost for SKX
  2018-05-16  4:49 ` [RFC/RFT] [PATCH 10/10] cpufreq: intel_pstate: enable boost for SKX Srinivas Pandruvada
@ 2018-05-16  7:49   ` Peter Zijlstra
  2018-05-16 15:46     ` Srinivas Pandruvada
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2018-05-16  7:49 UTC (permalink / raw)
  To: Srinivas Pandruvada
  Cc: tglx, mingo, bp, lenb, rjw, mgorman, x86, linux-pm, viresh.kumar,
	juri.lelli, linux-kernel

On Tue, May 15, 2018 at 09:49:11PM -0700, Srinivas Pandruvada wrote:
> Enable HWP boost on Skylake server platform.

Why not unconditionally enable it on everything HWP ?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 06/10] cpufreq / sched: Add interface to get utilization values
  2018-05-16  4:49 ` [RFC/RFT] [PATCH 06/10] cpufreq / sched: Add interface to get utilization values Srinivas Pandruvada
  2018-05-16  6:40   ` Viresh Kumar
@ 2018-05-16  8:11   ` Peter Zijlstra
  2018-05-16 22:40     ` Srinivas Pandruvada
  1 sibling, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2018-05-16  8:11 UTC (permalink / raw)
  To: Srinivas Pandruvada
  Cc: tglx, mingo, bp, lenb, rjw, mgorman, x86, linux-pm, viresh.kumar,
	juri.lelli, linux-kernel

On Tue, May 15, 2018 at 09:49:07PM -0700, Srinivas Pandruvada wrote:
> --- a/kernel/sched/cpufreq.c
> +++ b/kernel/sched/cpufreq.c
> @@ -60,3 +60,26 @@ void cpufreq_remove_update_util_hook(int cpu)
>  	rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), NULL);
>  }
>  EXPORT_SYMBOL_GPL(cpufreq_remove_update_util_hook);
> +
> +/**
> + * cpufreq_get_sched_util - Get utilization values.
> + * @cpu: The targeted CPU.
> + *
> + * Get the CFS, DL and max utilization.
> + * This function allows cpufreq driver outside the kernel/sched to access
> + * utilization value for a CPUs run queue.
> + */
> +void cpufreq_get_sched_util(int cpu, unsigned long *util_cfs,
> +			    unsigned long *util_dl, unsigned long *max)
> +{
> +#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> +	struct rq *rq = cpu_rq(cpu);
> +
> +	*max = arch_scale_cpu_capacity(NULL, cpu);
> +	*util_cfs = cpu_util_cfs(rq);
> +	*util_dl  = cpu_util_dl(rq);
> +#else
> +	*util_cfs = *util_dl = 1;
> +#endif
> +}
> +EXPORT_SYMBOL_GPL(cpufreq_get_sched_util);

So I _really_ hate this... I'd much rather you make schedutil work with
the hwp passive stuff.

Also, afaict intel_pstate is bool, not tristate, so no need for an
EXPORT at all.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 02/10] cpufreq: intel_pstate: Conditional frequency invariant accounting
  2018-05-16  7:29     ` Peter Zijlstra
@ 2018-05-16  9:07       ` Rafael J. Wysocki
  2018-05-16 17:32         ` Srinivas Pandruvada
  0 siblings, 1 reply; 58+ messages in thread
From: Rafael J. Wysocki @ 2018-05-16  9:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srinivas Pandruvada, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Len Brown, Rafael J. Wysocki, Mel Gorman,
	the arch/x86 maintainers, Linux PM, Viresh Kumar, Juri Lelli,
	Linux Kernel Mailing List

On Wed, May 16, 2018 at 9:29 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, May 16, 2018 at 09:16:40AM +0200, Peter Zijlstra wrote:
>> On Tue, May 15, 2018 at 09:49:03PM -0700, Srinivas Pandruvada wrote:
>> > intel_pstate has two operating modes: active and passive. In "active"
>> > mode, the in-built scaling governor is used and in "passive" mode,
>> > the driver can be used with any governor like "schedutil". In "active"
>> > mode the utilization values from schedutil is not used and there is
>> > a requirement from high performance computing use cases, not to read
>> > any APERF/MPERF MSRs. In this case no need to use CPU cycles for
>> > frequency invariant accounting by reading APERF/MPERF MSRs.
>> > With this change frequency invariant account is only enabled in
>> > "passive" mode.
>>
>> WTH is active/passive? Is passive when we select performance governor?
>
> Bah, I cannot read it seems. active is when we use the intel_pstate
> governor and passive is when we use schedutil and only use intel_pstate
> as a driver.
>
>> Also; you have to explain why using APERF/MPERF is bad in that case. Why
>> do they care if we read those MSRs during the tick?
>
> That still stands.. this needs to be properly explained.

I guess this is from the intel_pstate perspective only.

The active mode is only used with HWP, so intel_pstate doesn't look at
the utilization (in any form) in the passive mode today.

Still, there are other reasons for PELT to be scale-invariant, so ...

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 03/10] cpufreq: intel_pstate: Utility functions to boost HWP performance limits
  2018-05-16  7:22   ` Peter Zijlstra
@ 2018-05-16  9:15     ` Rafael J. Wysocki
  2018-05-16 10:43       ` Peter Zijlstra
  2018-05-16 15:41     ` Srinivas Pandruvada
  1 sibling, 1 reply; 58+ messages in thread
From: Rafael J. Wysocki @ 2018-05-16  9:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srinivas Pandruvada, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Len Brown, Rafael J. Wysocki, Mel Gorman,
	the arch/x86 maintainers, Linux PM, Viresh Kumar, Juri Lelli,
	Linux Kernel Mailing List

On Wed, May 16, 2018 at 9:22 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, May 15, 2018 at 09:49:04PM -0700, Srinivas Pandruvada wrote:
>> Setup necessary infrastructure to be able to boost HWP performance on a
>> remote CPU. First initialize data structure to be able to use
>> smp_call_function_single_async(). The boost up function simply set HWP
>> min to HWP max value and EPP to 0. The boost down function simply restores
>> to last cached HWP Request value.
>>
>> To avoid reading HWP Request MSR during dynamic update, the HWP Request
>> MSR value is cached in the local memory. This caching is done whenever
>> HWP request MSR is modified during driver init on in setpolicy() callback
>> path.
>
> The patch does two independent things; best to split that.
>
>
>> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
>> index f686bbe..dc7dfa9 100644
>> --- a/drivers/cpufreq/intel_pstate.c
>> +++ b/drivers/cpufreq/intel_pstate.c
>> @@ -221,6 +221,9 @@ struct global_params {
>>   *                   preference/bias
>>   * @epp_saved:               Saved EPP/EPB during system suspend or CPU offline
>>   *                   operation
>> + * @hwp_req_cached:  Cached value of the last HWP request MSR
>
> That's simply not true given the code below.

It looks like the last "not boosted EPP" value.

>> @@ -763,6 +768,7 @@ static void intel_pstate_hwp_set(unsigned int cpu)
>>               intel_pstate_set_epb(cpu, epp);
>>       }
>>  skip_epp:
>> +     cpu_data->hwp_req_cached = value;
>>       wrmsrl_on_cpu(cpu, MSR_HWP_REQUEST, value);
>>  }
>>
>> @@ -1381,6 +1387,39 @@ static void intel_pstate_get_cpu_pstates(struct cpudata *cpu)
>>       intel_pstate_set_min_pstate(cpu);
>>  }
>>
>> +
>> +static inline void intel_pstate_hwp_boost_up(struct cpudata *cpu)
>> +{
>> +     u64 hwp_req;
>> +     u8 max;
>> +
>> +     max = (u8) (cpu->hwp_req_cached >> 8);
>> +
>> +     hwp_req = cpu->hwp_req_cached & ~GENMASK_ULL(31, 24);
>> +     hwp_req = (hwp_req & ~GENMASK_ULL(7, 0)) | max;
>> +
>> +     wrmsrl(MSR_HWP_REQUEST, hwp_req);
>> +}
>> +
>> +static inline void intel_pstate_hwp_boost_down(struct cpudata *cpu)
>> +{
>> +     wrmsrl(MSR_HWP_REQUEST, cpu->hwp_req_cached);
>> +}
>
> This is not a traditional msr shadow; that would be updated on every
> wrmsr. There is not a comment in sight explaining why this one is
> different.

So if the "cached" thing is the last "not boosted EPP", that's why it
is not updated here.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 05/10] cpufreq: intel_pstate: HWP boost performance on IO Wake
  2018-05-16  4:49 ` [RFC/RFT] [PATCH 05/10] cpufreq: intel_pstate: HWP boost performance on IO Wake Srinivas Pandruvada
  2018-05-16  7:37   ` Peter Zijlstra
@ 2018-05-16  9:45   ` Rafael J. Wysocki
  2018-05-16 19:28     ` Srinivas Pandruvada
  1 sibling, 1 reply; 58+ messages in thread
From: Rafael J. Wysocki @ 2018-05-16  9:45 UTC (permalink / raw)
  To: Srinivas Pandruvada
  Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Borislav Petkov,
	Len Brown, Rafael J. Wysocki, Mel Gorman,
	the arch/x86 maintainers, Linux PM, Viresh Kumar, Juri Lelli,
	Linux Kernel Mailing List

On Wed, May 16, 2018 at 6:49 AM, Srinivas Pandruvada
<srinivas.pandruvada@linux.intel.com> wrote:
> When a task is woken up from IO wait, boost HWP prformance to max. This
> helps IO workloads on servers with per core P-states. But changing limits
> has extra over head of issuing new HWP Request MSR, which takes 1000+
> cycles. So this change limits setting HWP Request MSR. Also request can
> be for a remote CPU.
> Rate control in setting HWP Requests:
> - If the current performance is around P1, simply ignore IO flag.
> - Once set wait till hold time, till remove boost. While the boost
>   is on, another IO flags is notified, it will prolong boost.
> - If the IO flags are notified multiple ticks apart, this may not be
> IO bound task. Othewise idle system gets periodic boosts for one
> IO wake.
>
> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> ---
>  drivers/cpufreq/intel_pstate.c | 75 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 75 insertions(+)
>
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index e200887..d418265 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -20,6 +20,7 @@
>  #include <linux/tick.h>
>  #include <linux/slab.h>
>  #include <linux/sched/cpufreq.h>
> +#include <linux/sched/topology.h>
>  #include <linux/list.h>
>  #include <linux/cpu.h>
>  #include <linux/cpufreq.h>
> @@ -224,6 +225,8 @@ struct global_params {
>   * @hwp_req_cached:    Cached value of the last HWP request MSR
>   * @csd:               A structure used to issue SMP async call, which
>   *                     defines callback and arguments
> + * @hwp_boost_active:  HWP performance is boosted on this CPU
> + * @last_io_update:    Last time when IO wake flag was set
>   *
>   * This structure stores per CPU instance data for all CPUs.
>   */
> @@ -258,6 +261,8 @@ struct cpudata {
>         s16 epp_saved;
>         u64 hwp_req_cached;
>         call_single_data_t csd;
> +       bool hwp_boost_active;
> +       u64 last_io_update;
>  };
>
>  static struct cpudata **all_cpu_data;
> @@ -1421,10 +1426,80 @@ static void csd_init(struct cpudata *cpu)
>         cpu->csd.info = cpu;
>  }
>
> +/*
> + * Long hold time will keep high perf limits for long time,
> + * which negatively impacts perf/watt for some workloads,
> + * like specpower. 3ms is based on experiements on some
> + * workoads.
> + */
> +static int hwp_boost_hold_time_ms = 3;
> +
> +/* Default: This will roughly around P1 on SKX */
> +#define BOOST_PSTATE_THRESHOLD (SCHED_CAPACITY_SCALE / 2)
> +static int hwp_boost_pstate_threshold = BOOST_PSTATE_THRESHOLD;
> +
> +static inline bool intel_pstate_check_boost_threhold(struct cpudata *cpu)
> +{
> +       /*
> +        * If the last performance is above threshold, then return false,
> +        * so that caller can ignore boosting.
> +        */
> +       if (arch_scale_freq_capacity(cpu->cpu) > hwp_boost_pstate_threshold)
> +               return false;
> +
> +       return true;
> +}
> +
>  static inline void intel_pstate_update_util_hwp(struct update_util_data *data,
>                                                 u64 time, unsigned int flags)
>  {
> +       struct cpudata *cpu = container_of(data, struct cpudata, update_util);
> +
> +       if (flags & SCHED_CPUFREQ_IOWAIT) {
> +               /*
> +                * Set iowait_boost flag and update time. Since IO WAIT flag
> +                * is set all the time, we can't just conclude that there is
> +                * some IO bound activity is scheduled on this CPU with just
> +                * one occurrence. If we receive at least two in two
> +                * consecutive ticks, then we start treating as IO. So
> +                * there will be one tick latency.
> +                */
> +               if (time_before64(time, cpu->last_io_update + 2 * TICK_NSEC) &&
> +                   intel_pstate_check_boost_threhold(cpu))
> +                       cpu->iowait_boost = true;
> +
> +               cpu->last_io_update = time;
> +               cpu->last_update = time;

This is a shared data structure and it gets updated without
synchronization, unless I'm missing something.

How much does the cross-CPU case matter?

> +       }
>
> +       /*
> +        * If the boost is active, we will remove it after timeout on local
> +        * CPU only.
> +        */
> +       if (cpu->hwp_boost_active) {
> +               if (smp_processor_id() == cpu->cpu) {
> +                       bool expired;
> +
> +                       expired = time_after64(time, cpu->last_update +
> +                                              (hwp_boost_hold_time_ms * NSEC_PER_MSEC));
> +                       if (expired) {
> +                               intel_pstate_hwp_boost_down(cpu);
> +                               cpu->hwp_boost_active = false;
> +                               cpu->iowait_boost = false;
> +                       }
> +               }
> +               return;
> +       }
> +
> +       cpu->last_update = time;
> +
> +       if (cpu->iowait_boost) {
> +               cpu->hwp_boost_active = true;
> +               if (smp_processor_id() == cpu->cpu)
> +                       intel_pstate_hwp_boost_up(cpu);
> +               else
> +                       smp_call_function_single_async(cpu->cpu, &cpu->csd);
> +       }
>  }
>
>  static inline void intel_pstate_calc_avg_perf(struct cpudata *cpu)
> --
> 2.9.5
>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 07/10] cpufreq: intel_pstate: HWP boost performance on busy task migrate
  2018-05-16  4:49 ` [RFC/RFT] [PATCH 07/10] cpufreq: intel_pstate: HWP boost performance on busy task migrate Srinivas Pandruvada
@ 2018-05-16  9:49   ` Rafael J. Wysocki
  2018-05-16 20:59     ` Srinivas Pandruvada
  0 siblings, 1 reply; 58+ messages in thread
From: Rafael J. Wysocki @ 2018-05-16  9:49 UTC (permalink / raw)
  To: Srinivas Pandruvada
  Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Borislav Petkov,
	Len Brown, Rafael J. Wysocki, Mel Gorman,
	the arch/x86 maintainers, Linux PM, Viresh Kumar, Juri Lelli,
	Linux Kernel Mailing List

On Wed, May 16, 2018 at 6:49 AM, Srinivas Pandruvada
<srinivas.pandruvada@linux.intel.com> wrote:
> When a busy task migrates to a new CPU boost HWP prformance to max. This
> helps workloads on servers with per core P-states, which saturates all
> CPUs and then they migrate frequently. But changing limits has extra over
> head of issuing new HWP Request MSR, which takes 1000+
> cycles. So this change limits setting HWP Request MSR.
> Rate control in setting HWP Requests:
> - If the current performance is around P1, simply ignore.
> - Once set wait till hold time, till remove boost. While the boost
>  is on, another flags is notified, it will prolong boost.
> - The task migrates needs to have some utilzation which is more
> than threshold utilization, which will trigger P-state above minimum.
>
> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> ---
>  drivers/cpufreq/intel_pstate.c | 37 ++++++++++++++++++++++++++++++++++++-
>  1 file changed, 36 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index d418265..ec455af 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -227,6 +227,7 @@ struct global_params {
>   *                     defines callback and arguments
>   * @hwp_boost_active:  HWP performance is boosted on this CPU
>   * @last_io_update:    Last time when IO wake flag was set
> + * @migrate_hint:      Set when scheduler indicates thread migration
>   *
>   * This structure stores per CPU instance data for all CPUs.
>   */
> @@ -263,6 +264,7 @@ struct cpudata {
>         call_single_data_t csd;
>         bool hwp_boost_active;
>         u64 last_io_update;
> +       bool migrate_hint;

Why do you need this in the struct?

It looks like it only is used locally in intel_pstate_update_util_hwp().

>  };

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 01/10] x86,sched: Add support for frequency invariance
  2018-05-16  4:49 ` [RFC/RFT] [PATCH 01/10] x86,sched: Add support for frequency invariance Srinivas Pandruvada
@ 2018-05-16  9:56   ` Peter Zijlstra
  0 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2018-05-16  9:56 UTC (permalink / raw)
  To: Srinivas Pandruvada
  Cc: tglx, mingo, bp, lenb, rjw, mgorman, x86, linux-pm, viresh.kumar,
	juri.lelli, linux-kernel, Suravee Suthikulpanit,
	Rafael J. Wysocki, Vincent Guittot, Morten Rasmussen,
	Dietmar Eggemann, Sudeep Holla


Thanks for posting this one; I meant to start a thread on this for a
while but never got around to doing so.

I left the 'important' parts of the patch for context but removed all
the arch fiddling to find the max freq, as that is not so important
here.

On Tue, May 15, 2018 at 09:49:02PM -0700, Srinivas Pandruvada wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> Implement arch_scale_freq_capacity() for 'modern' x86. This function
> is used by the scheduler to correctly account usage in the face of
> DVFS.
> 
> For example; suppose a CPU has two frequencies: 500 and 1000 Mhz. When
> running a task that would consume 1/3rd of a CPU at 1000 MHz, it would
> appear to consume 2/3rd (or 66.6%) when running at 500 MHz, giving the
> false impression this CPU is almost at capacity, even though it can go
> faster [*].
> 
> Since modern x86 has hardware control over the actual frequency we run
> at (because amongst other things, Turbo-Mode), we cannot simply use
> the frequency as requested through cpufreq.
> 
> Instead we use the APERF/MPERF MSRs to compute the effective frequency
> over the recent past. Also, because reading MSRs is expensive, don't
> do so every time we need the value, but amortize the cost by doing it
> every tick.
> 
> [*] this assumes a linear frequency/performance relation; which
> everybody knows to be false, but given realities its the best
> approximation we can make.
> 
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> ---

> diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
> index c1d2a98..3fb5346 100644
> --- a/arch/x86/include/asm/topology.h
> +++ b/arch/x86/include/asm/topology.h
> @@ -172,4 +172,33 @@ static inline void sched_clear_itmt_support(void)
>  }
>  #endif /* CONFIG_SCHED_MC_PRIO */
>  
> +#ifdef CONFIG_SMP
> +#include <asm/cpufeature.h>
> +
> +#define arch_scale_freq_tick arch_scale_freq_tick
> +#define arch_scale_freq_capacity arch_scale_freq_capacity
> +
> +DECLARE_PER_CPU(unsigned long, arch_cpu_freq);
> +
> +static inline long arch_scale_freq_capacity(int cpu)
> +{
> +	if (static_cpu_has(X86_FEATURE_APERFMPERF))
> +		return per_cpu(arch_cpu_freq, cpu);
> +
> +	return 1024 /* SCHED_CAPACITY_SCALE */;
> +}
> +
> +extern void arch_scale_freq_tick(void);
> +extern void x86_arch_scale_freq_tick_enable(void);
> +extern void x86_arch_scale_freq_tick_disable(void);
> +#else
> +static inline void x86_arch_scale_freq_tick_enable(void)
> +{
> +}
> +
> +static inline void x86_arch_scale_freq_tick_disable(void)
> +{
> +}
> +#endif
> +
>  #endif /* _ASM_X86_TOPOLOGY_H */
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 0f1cbb0..9e2cb82 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c

> @@ -1676,3 +1680,193 @@ void native_play_dead(void)
>  }
>  
>  #endif
> +
> +/*
> + * APERF/MPERF frequency ratio computation.
> + *
> + * The scheduler wants to do frequency invariant accounting and needs a <1
> + * ratio to account for the 'current' frequency.
> + *
> + * Since the frequency on x86 is controlled by micro-controller and our P-state
> + * setting is little more than a request/hint, we need to observe the effective
> + * frequency. We do this with APERF/MPERF.
> + *
> + * One complication is that the APERF/MPERF ratio can be >1, specifically
> + * APERF/MPERF gives the ratio relative to the max non-turbo P-state. Therefore
> + * we need to re-normalize the ratio.
> + *
> + * We do this by tracking the max APERF/MPERF ratio previously observed and
> + * scaling our MPERF delta with that. Every time our ratio goes over 1, we
> + * proportionally scale up our old max.

One very important point however is that I wrote this patch in the
context of Vincent's new scale invariance proposal:

  https://lkml.kernel.org/r/1493389435-2525-1-git-send-email-vincent.guittot@linaro.org

The reason is that while this 'works' with the current scale invariance,
the way turbo is handled is not optimal for it.

At OSPM we briefly touched upon this subject, since also ARM will need
something like this for some of their chips, so this is of general
interrest.

The problem with turbo of course is that our max frequency is variable;
but we really rather would like a unit value for scaling. Returning a >1
value results in weird things (think of running for 1.5ms in 1ms
wall-time for example).

This implementation simply finds the absolute max observed and scales
that as 1, with a result that when we're busy we'll always run at <1
because we cannot sustain turbo. This might result in the scheduler
thinking we're not fully busy, when in fact we are.

At OSPM it was suggested to instead track an average max or set 1 at the
sustainable freq and clip overshoot. The problem with that is that it is
actually hard to track an average max if you cannot tell what max even
is.

The problem with clipping of course is that we'll end up biasing the
frequencies higher than required -- which might be OK if the overshoot
is 'small' as it would typically be for an average max thing, but not
when we set 1 at the sustainable frequency.

I think the new scale invariance solves a bunch of these problems by
always saturating, irrespective of the actual frequency we run at.

Of course, IIRC it had other issues...

> + * The down-side to this runtime max search is that you have to trigger the
> + * actual max frequency before your scale is right. Therefore allow
> + * architectures to initialize the max ratio on CPU bringup.
> + */
> +
> +static DEFINE_PER_CPU(u64, arch_prev_aperf);
> +static DEFINE_PER_CPU(u64, arch_prev_mperf);
> +static DEFINE_PER_CPU(u64, arch_prev_max_freq) = SCHED_CAPACITY_SCALE;
> +

> +DEFINE_PER_CPU(unsigned long, arch_cpu_freq);
> +
> +static bool tick_disable;
> +
> +void arch_scale_freq_tick(void)
> +{
> +	u64 freq, max_freq = this_cpu_read(arch_prev_max_freq);
> +	u64 aperf, mperf;
> +	u64 acnt, mcnt;
> +
> +	if (!static_cpu_has(X86_FEATURE_APERFMPERF) || tick_disable)
> +		return;
> +
> +	rdmsrl(MSR_IA32_APERF, aperf);
> +	rdmsrl(MSR_IA32_MPERF, mperf);
> +
> +	acnt = aperf - this_cpu_read(arch_prev_aperf);
> +	mcnt = mperf - this_cpu_read(arch_prev_mperf);
> +	if (!mcnt)
> +		return;
> +
> +	this_cpu_write(arch_prev_aperf, aperf);
> +	this_cpu_write(arch_prev_mperf, mperf);
> +
> +	acnt <<= 2*SCHED_CAPACITY_SHIFT;
> +	mcnt *= max_freq;
> +
> +	freq = div64_u64(acnt, mcnt);
> +
> +	if (unlikely(freq > SCHED_CAPACITY_SCALE)) {
> +		max_freq *= freq;
> +		max_freq >>= SCHED_CAPACITY_SHIFT;
> +
> +		this_cpu_write(arch_prev_max_freq, max_freq);
> +
> +		freq = SCHED_CAPACITY_SCALE;
> +	}
> +
> +	this_cpu_write(arch_cpu_freq, freq);
> +}

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 092f7c4..2bdef36 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3076,6 +3076,7 @@ void scheduler_tick(void)
>  	struct task_struct *curr = rq->curr;
>  	struct rq_flags rf;
>  
> +	arch_scale_freq_tick();
>  	sched_clock_tick();
>  
>  	rq_lock(rq, &rf);

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 03/10] cpufreq: intel_pstate: Utility functions to boost HWP performance limits
  2018-05-16  9:15     ` Rafael J. Wysocki
@ 2018-05-16 10:43       ` Peter Zijlstra
  2018-05-16 15:39         ` Srinivas Pandruvada
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2018-05-16 10:43 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Srinivas Pandruvada, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Len Brown, Rafael J. Wysocki, Mel Gorman,
	the arch/x86 maintainers, Linux PM, Viresh Kumar, Juri Lelli,
	Linux Kernel Mailing List

On Wed, May 16, 2018 at 11:15:09AM +0200, Rafael J. Wysocki wrote:

> So if the "cached" thing is the last "not boosted EPP", that's why it
> is not updated here.

Sure, I see what it does, just saying that naming and comments are wrong
vs the actual code.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 02/10] cpufreq: intel_pstate: Conditional frequency invariant accounting
  2018-05-16  4:49 ` [RFC/RFT] [PATCH 02/10] cpufreq: intel_pstate: Conditional frequency invariant accounting Srinivas Pandruvada
  2018-05-16  7:16   ` Peter Zijlstra
@ 2018-05-16 15:19   ` Juri Lelli
  2018-05-16 15:47     ` Peter Zijlstra
  2018-05-16 15:58     ` Srinivas Pandruvada
  1 sibling, 2 replies; 58+ messages in thread
From: Juri Lelli @ 2018-05-16 15:19 UTC (permalink / raw)
  To: Srinivas Pandruvada
  Cc: tglx, mingo, peterz, bp, lenb, rjw, mgorman, x86, linux-pm,
	viresh.kumar, linux-kernel

On 15/05/18 21:49, Srinivas Pandruvada wrote:
> intel_pstate has two operating modes: active and passive. In "active"
> mode, the in-built scaling governor is used and in "passive" mode,
> the driver can be used with any governor like "schedutil". In "active"
> mode the utilization values from schedutil is not used and there is
> a requirement from high performance computing use cases, not to read
> any APERF/MPERF MSRs. In this case no need to use CPU cycles for
> frequency invariant accounting by reading APERF/MPERF MSRs.
> With this change frequency invariant account is only enabled in
> "passive" mode.
> 
> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> ---
> [Note: The tick will be enabled later in the series when hwp dynamic
> boost is enabled]
> 
>  drivers/cpufreq/intel_pstate.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index 17e566af..f686bbe 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -2040,6 +2040,8 @@ static int intel_pstate_register_driver(struct cpufreq_driver *driver)
>  {
>  	int ret;
>  
> +	x86_arch_scale_freq_tick_disable();
> +
>  	memset(&global, 0, sizeof(global));
>  	global.max_perf_pct = 100;
>  
> @@ -2052,6 +2054,9 @@ static int intel_pstate_register_driver(struct cpufreq_driver *driver)
>  
>  	global.min_perf_pct = min_perf_pct_min();
>  
> +	if (driver == &intel_cpufreq)
> +		x86_arch_scale_freq_tick_enable();

This will unconditionally trigger the reading/calculation at each tick
even though information is not actually consumed (e.g., running
performance or any other governor), right? Do we want that?

Anyway, FWIW I started testing this on a E5-2609 v3 and I'm not seeing
hackbench regressions so far (running with schedutil governor).

Best,

- Juri

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 03/10] cpufreq: intel_pstate: Utility functions to boost HWP performance limits
  2018-05-16 10:43       ` Peter Zijlstra
@ 2018-05-16 15:39         ` Srinivas Pandruvada
  0 siblings, 0 replies; 58+ messages in thread
From: Srinivas Pandruvada @ 2018-05-16 15:39 UTC (permalink / raw)
  To: Peter Zijlstra, Rafael J. Wysocki
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Len Brown,
	Rafael J. Wysocki, Mel Gorman, the arch/x86 maintainers,
	Linux PM, Viresh Kumar, Juri Lelli, Linux Kernel Mailing List

On Wed, 2018-05-16 at 12:43 +0200, Peter Zijlstra wrote:
> On Wed, May 16, 2018 at 11:15:09AM +0200, Rafael J. Wysocki wrote:
> 
> > So if the "cached" thing is the last "not boosted EPP", that's why
> > it
> > is not updated here.
> 
> Sure, I see what it does, just saying that naming and comments are
> wrong
> vs the actual code.
I will fix in the next revision.

Thanks,
Srinivas

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 03/10] cpufreq: intel_pstate: Utility functions to boost HWP performance limits
  2018-05-16  7:22   ` Peter Zijlstra
  2018-05-16  9:15     ` Rafael J. Wysocki
@ 2018-05-16 15:41     ` Srinivas Pandruvada
  1 sibling, 0 replies; 58+ messages in thread
From: Srinivas Pandruvada @ 2018-05-16 15:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, bp, lenb, rjw, mgorman, x86, linux-pm, viresh.kumar,
	juri.lelli, linux-kernel

On Wed, 2018-05-16 at 09:22 +0200, Peter Zijlstra wrote:
> On Tue, May 15, 2018 at 09:49:04PM -0700, Srinivas Pandruvada wrote:
> > Setup necessary infrastructure to be able to boost HWP performance
> > on a
> > remote CPU. First initialize data structure to be able to use
> > smp_call_function_single_async(). The boost up function simply set
> > HWP
> > min to HWP max value and EPP to 0. The boost down function simply
> > restores
> > to last cached HWP Request value.
> > 
> > To avoid reading HWP Request MSR during dynamic update, the HWP
> > Request
> > MSR value is cached in the local memory. This caching is done
> > whenever
> > HWP request MSR is modified during driver init on in setpolicy()
> > callback
> > path.
> 
> The patch does two independent things; best to split that.
I had that split before, but thought no one is consuming the cached
value in that patch, so combined them. If this is not a problem, it is
better to split the patch.

Thanks,
Srinivas

> 
> 
> > diff --git a/drivers/cpufreq/intel_pstate.c
> > b/drivers/cpufreq/intel_pstate.c
> > index f686bbe..dc7dfa9 100644
> > --- a/drivers/cpufreq/intel_pstate.c
> > +++ b/drivers/cpufreq/intel_pstate.c
> > @@ -221,6 +221,9 @@ struct global_params {
> >   *			preference/bias
> >   * @epp_saved:		Saved EPP/EPB during system suspend
> > or CPU offline
> >   *			operation
> > + * @hwp_req_cached:	Cached value of the last HWP request
> > MSR
> 
> That's simply not true given the code below.
> 
> > @@ -763,6 +768,7 @@ static void intel_pstate_hwp_set(unsigned int
> > cpu)
> >  		intel_pstate_set_epb(cpu, epp);
> >  	}
> >  skip_epp:
> > +	cpu_data->hwp_req_cached = value;
> >  	wrmsrl_on_cpu(cpu, MSR_HWP_REQUEST, value);
> >  }
> >  
> > @@ -1381,6 +1387,39 @@ static void
> > intel_pstate_get_cpu_pstates(struct cpudata *cpu)
> >  	intel_pstate_set_min_pstate(cpu);
> >  }
> >  
> > +
> > +static inline void intel_pstate_hwp_boost_up(struct cpudata *cpu)
> > +{
> > +	u64 hwp_req;
> > +	u8 max;
> > +
> > +	max = (u8) (cpu->hwp_req_cached >> 8);
> > +
> > +	hwp_req = cpu->hwp_req_cached & ~GENMASK_ULL(31, 24);
> > +	hwp_req = (hwp_req & ~GENMASK_ULL(7, 0)) | max;
> > +
> > +	wrmsrl(MSR_HWP_REQUEST, hwp_req);
> > +}
> > +
> > +static inline void intel_pstate_hwp_boost_down(struct cpudata
> > *cpu)
> > +{
> > +	wrmsrl(MSR_HWP_REQUEST, cpu->hwp_req_cached);
> > +}
> 
> This is not a traditional msr shadow; that would be updated on every
> wrmsr. There is not a comment in sight explaining why this one is
> different.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 00/10] Intel_pstate: HWP Dynamic performance boost
  2018-05-16  6:49 ` [RFC/RFT] [PATCH 00/10] Intel_pstate: HWP Dynamic performance boost Juri Lelli
@ 2018-05-16 15:43   ` Srinivas Pandruvada
  0 siblings, 0 replies; 58+ messages in thread
From: Srinivas Pandruvada @ 2018-05-16 15:43 UTC (permalink / raw)
  To: Juri Lelli
  Cc: tglx, mingo, peterz, bp, lenb, rjw, mgorman, x86, linux-pm,
	viresh.kumar, juri.lelli, linux-kernel

Hi Juri,

On Wed, 2018-05-16 at 08:49 +0200, Juri Lelli wrote:
> Hi Srinivas,
> 
> On 15/05/18 21:49, Srinivas Pandruvada wrote:
> 
> [...]
> 
> > 
> > Peter Zijlstra (1):
> >   x86,sched: Add support for frequency invariance
> 
> Cool! I was going to ask Peter about this patch. You beat me to it.
> :)
> 
> I'll have a lokk at the set. BTW, just noticed that you Cc-ed me
> using
> my old address. Please use juri.lelli@redhat.com. ;)
I didn't realize that. From next revision I will change to this
address.

Thanks,
Srinivas


> 
> Best,
> 
> - Juri

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 10/10] cpufreq: intel_pstate: enable boost for SKX
  2018-05-16  7:49   ` Peter Zijlstra
@ 2018-05-16 15:46     ` Srinivas Pandruvada
  2018-05-16 15:54       ` Peter Zijlstra
  0 siblings, 1 reply; 58+ messages in thread
From: Srinivas Pandruvada @ 2018-05-16 15:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, bp, lenb, rjw, mgorman, x86, linux-pm, viresh.kumar,
	juri.lelli, linux-kernel

On Wed, 2018-05-16 at 09:49 +0200, Peter Zijlstra wrote:
> On Tue, May 15, 2018 at 09:49:11PM -0700, Srinivas Pandruvada wrote:
> > Enable HWP boost on Skylake server platform.
> 
> Why not unconditionally enable it on everything HWP ?
Never noticed in any difference performance in clients with HWP nor
anybody complained. Since clients uses single power domain, some other
CPUs always gives boost. 

Thanks,
Srinivas

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 02/10] cpufreq: intel_pstate: Conditional frequency invariant accounting
  2018-05-16 15:19   ` Juri Lelli
@ 2018-05-16 15:47     ` Peter Zijlstra
  2018-05-16 16:31       ` Juri Lelli
  2018-05-16 15:58     ` Srinivas Pandruvada
  1 sibling, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2018-05-16 15:47 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Srinivas Pandruvada, tglx, mingo, bp, lenb, rjw, mgorman, x86,
	linux-pm, viresh.kumar, linux-kernel

On Wed, May 16, 2018 at 05:19:25PM +0200, Juri Lelli wrote:

> Anyway, FWIW I started testing this on a E5-2609 v3 and I'm not seeing
> hackbench regressions so far (running with schedutil governor).

https://en.wikipedia.org/wiki/Haswell_(microarchitecture)#Server_processors

Lists the E5 2609 v3 as not having turbo at all, which is basically a
best case scenario for this patch.

As I wrote earlier today; when turbo exists, like say the 2699, then
when we're busy we'll run at U=2.3/3.6 ~ .64, which might confuse
things.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 10/10] cpufreq: intel_pstate: enable boost for SKX
  2018-05-16 15:46     ` Srinivas Pandruvada
@ 2018-05-16 15:54       ` Peter Zijlstra
  2018-05-17  0:52         ` Srinivas Pandruvada
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2018-05-16 15:54 UTC (permalink / raw)
  To: Srinivas Pandruvada
  Cc: tglx, mingo, bp, lenb, rjw, mgorman, x86, linux-pm, viresh.kumar,
	juri.lelli, linux-kernel

On Wed, May 16, 2018 at 08:46:24AM -0700, Srinivas Pandruvada wrote:
> On Wed, 2018-05-16 at 09:49 +0200, Peter Zijlstra wrote:
> > On Tue, May 15, 2018 at 09:49:11PM -0700, Srinivas Pandruvada wrote:
> > > Enable HWP boost on Skylake server platform.
> > 
> > Why not unconditionally enable it on everything HWP ?
> Never noticed in any difference performance in clients with HWP nor
> anybody complained. Since clients uses single power domain, some other
> CPUs always gives boost. 

Can't the hardware tell us about per-core stuff? Do we really have to
use tables for that?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 02/10] cpufreq: intel_pstate: Conditional frequency invariant accounting
  2018-05-16 15:19   ` Juri Lelli
  2018-05-16 15:47     ` Peter Zijlstra
@ 2018-05-16 15:58     ` Srinivas Pandruvada
  1 sibling, 0 replies; 58+ messages in thread
From: Srinivas Pandruvada @ 2018-05-16 15:58 UTC (permalink / raw)
  To: Juri Lelli
  Cc: tglx, mingo, peterz, bp, lenb, rjw, mgorman, x86, linux-pm,
	viresh.kumar, linux-kernel

On Wed, 2018-05-16 at 17:19 +0200, Juri Lelli wrote:
> On 15/05/18 21:49, Srinivas Pandruvada wrote:
> > intel_pstate has two operating modes: active and passive. In
> > "active"
> > mode, the in-built scaling governor is used and in "passive" mode,
> > the driver can be used with any governor like "schedutil". In
> > "active"
> > mode the utilization values from schedutil is not used and there is
> > a requirement from high performance computing use cases, not to
> > read
> > any APERF/MPERF MSRs. In this case no need to use CPU cycles for
> > frequency invariant accounting by reading APERF/MPERF MSRs.
> > With this change frequency invariant account is only enabled in
> > "passive" mode.
> > 
> > Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel
> > .com>
> > ---
> > [Note: The tick will be enabled later in the series when hwp
> > dynamic
> > boost is enabled]
> > 
> >  drivers/cpufreq/intel_pstate.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/drivers/cpufreq/intel_pstate.c
> > b/drivers/cpufreq/intel_pstate.c
> > index 17e566af..f686bbe 100644
> > --- a/drivers/cpufreq/intel_pstate.c
> > +++ b/drivers/cpufreq/intel_pstate.c
> > @@ -2040,6 +2040,8 @@ static int
> > intel_pstate_register_driver(struct cpufreq_driver *driver)
> >  {
> >  	int ret;
> >  
> > +	x86_arch_scale_freq_tick_disable();
> > +
> >  	memset(&global, 0, sizeof(global));
> >  	global.max_perf_pct = 100;
> >  
> > @@ -2052,6 +2054,9 @@ static int
> > intel_pstate_register_driver(struct cpufreq_driver *driver)
> >  
> >  	global.min_perf_pct = min_perf_pct_min();
> >  
> > +	if (driver == &intel_cpufreq)
> > +		x86_arch_scale_freq_tick_enable();
> 
> This will unconditionally trigger the reading/calculation at each
> tick
> even though information is not actually consumed (e.g., running
> performance or any other governor), right? Do we want that?
Good point. I should call x86_arch_scale_freq_tick_disable() in
performance mode switch for active mode.

Thanks,
Srinivas

> 
> Anyway, FWIW I started testing this on a E5-2609 v3 and I'm not
> seeing
> hackbench regressions so far (running with schedutil governor).
> 
> Best,
> 
> - Juri

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 02/10] cpufreq: intel_pstate: Conditional frequency invariant accounting
  2018-05-16 15:47     ` Peter Zijlstra
@ 2018-05-16 16:31       ` Juri Lelli
  2018-05-17 10:59         ` Juri Lelli
  0 siblings, 1 reply; 58+ messages in thread
From: Juri Lelli @ 2018-05-16 16:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srinivas Pandruvada, tglx, mingo, bp, lenb, rjw, mgorman, x86,
	linux-pm, viresh.kumar, linux-kernel

On 16/05/18 17:47, Peter Zijlstra wrote:
> On Wed, May 16, 2018 at 05:19:25PM +0200, Juri Lelli wrote:
> 
> > Anyway, FWIW I started testing this on a E5-2609 v3 and I'm not seeing
> > hackbench regressions so far (running with schedutil governor).
> 
> https://en.wikipedia.org/wiki/Haswell_(microarchitecture)#Server_processors
> 
> Lists the E5 2609 v3 as not having turbo at all, which is basically a
> best case scenario for this patch.
> 
> As I wrote earlier today; when turbo exists, like say the 2699, then
> when we're busy we'll run at U=2.3/3.6 ~ .64, which might confuse
> things.

Indeed. I was mostly trying to see if adding this to the tick might
introduce noticeable overhead.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 02/10] cpufreq: intel_pstate: Conditional frequency invariant accounting
  2018-05-16  9:07       ` Rafael J. Wysocki
@ 2018-05-16 17:32         ` Srinivas Pandruvada
  0 siblings, 0 replies; 58+ messages in thread
From: Srinivas Pandruvada @ 2018-05-16 17:32 UTC (permalink / raw)
  To: Rafael J. Wysocki, Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Len Brown,
	Rafael J. Wysocki, Mel Gorman, the arch/x86 maintainers,
	Linux PM, Viresh Kumar, Juri Lelli, Linux Kernel Mailing List

On Wed, 2018-05-16 at 11:07 +0200, Rafael J. Wysocki wrote:
> On Wed, May 16, 2018 at 9:29 AM, Peter Zijlstra <peterz@infradead.org
> > wrote:
> > On Wed, May 16, 2018 at 09:16:40AM +0200, Peter Zijlstra wrote:
> > > On Tue, May 15, 2018 at 09:49:03PM -0700, Srinivas Pandruvada
> > > wrote:
> > > > intel_pstate has two operating modes: active and passive. In
> > > > "active"
> > > > mode, the in-built scaling governor is used and in "passive"
> > > > mode,
> > > > the driver can be used with any governor like "schedutil". In
> > > > "active"
> > > > mode the utilization values from schedutil is not used and
> > > > there is
> > > > a requirement from high performance computing use cases, not to
> > > > read
> > > > any APERF/MPERF MSRs. In this case no need to use CPU cycles
> > > > for
> > > > frequency invariant accounting by reading APERF/MPERF MSRs.
> > > > With this change frequency invariant account is only enabled in
> > > > "passive" mode.
> > > 
> > > WTH is active/passive? Is passive when we select performance
> > > governor?
> > 
> > Bah, I cannot read it seems. active is when we use the intel_pstate
> > governor and passive is when we use schedutil and only use
> > intel_pstate
> > as a driver.
> > 
> > > Also; you have to explain why using APERF/MPERF is bad in that
> > > case. Why
> > > do they care if we read those MSRs during the tick?
> > 
> > That still stands.. this needs to be properly explained.
> 
> I guess this is from the intel_pstate perspective only.
> 
> The active mode is only used with HWP, so intel_pstate doesn't look
> at
> the utilization (in any form) in the passive mode today.
> 
> Still, there are other reasons for PELT to be scale-invariant, so ...
Not sure about the use case in active mode other than dynamic HWP boost
later in this series. If any, I can remove this patch.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 05/10] cpufreq: intel_pstate: HWP boost performance on IO Wake
  2018-05-16  7:37   ` Peter Zijlstra
@ 2018-05-16 17:55     ` Srinivas Pandruvada
  2018-05-17  8:19       ` Peter Zijlstra
  0 siblings, 1 reply; 58+ messages in thread
From: Srinivas Pandruvada @ 2018-05-16 17:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, bp, lenb, rjw, mgorman, x86, linux-pm, viresh.kumar,
	juri.lelli, linux-kernel

On Wed, 2018-05-16 at 09:37 +0200, Peter Zijlstra wrote:
> On Tue, May 15, 2018 at 09:49:06PM -0700, Srinivas Pandruvada wrote:

[...]

> > 
@@ -258,6 +261,8 @@ struct cpudata {
> >  	s16 epp_saved;
> >  	u64 hwp_req_cached;
> >  	call_single_data_t csd;
> > +	bool hwp_boost_active;
> > +	u64 last_io_update;
> 
> This structure has abysmal layout; you should look at that.
> Also, mandatory complaint about using _Bool in composites.
> 
I will take care about this in next version.

[...]

> > +/* Default: This will roughly around P1 on SKX */
> > +#define BOOST_PSTATE_THRESHOLD	(SCHED_CAPACITY_SCALE / 2)
> 
> Yuck.. why the need to hardcode this? Can't you simply read the P1
> value
> for the part at hand?
Done later in the series. So need to reorder.

[..]
> 
+	if (cpu->iowait_boost) {
> > +		cpu->hwp_boost_active = true;
> > +		if (smp_processor_id() == cpu->cpu)
> > +			intel_pstate_hwp_boost_up(cpu);
> > +		else
> > +			smp_call_function_single_async(cpu->cpu,
> > &cpu->csd);
> > +	}
> >  }
> 
> Hurmph, this looks like you're starting to duplicate the schedutil
> iowait logic. Why didn't you copy the gradual boosting thing?
I tried what we implemented in intel_pstate in legacy mode, which gave
the best performance for servers (ramp up faster and slow ramp down).
This caused regression on some workloads, as each time we can call
HWP_REQUEST, we spend 1200+ cycles in issuing MSR. So keeping it at max
for timeout. Even keeping at P1 didn't help in power numbers.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 05/10] cpufreq: intel_pstate: HWP boost performance on IO Wake
  2018-05-16  9:45   ` Rafael J. Wysocki
@ 2018-05-16 19:28     ` Srinivas Pandruvada
  0 siblings, 0 replies; 58+ messages in thread
From: Srinivas Pandruvada @ 2018-05-16 19:28 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Borislav Petkov,
	Len Brown, Rafael J. Wysocki, Mel Gorman,
	the arch/x86 maintainers, Linux PM, Viresh Kumar, Juri Lelli,
	Linux Kernel Mailing List

On Wed, 2018-05-16 at 11:45 +0200, Rafael J. Wysocki wrote:

[...]
> 
> > +               if (time_before64(time, cpu->last_io_update + 2 *
> > TICK_NSEC) &&
> > +                   intel_pstate_check_boost_threhold(cpu))
> > +                       cpu->iowait_boost = true;
> > +
> > +               cpu->last_io_update = time;
> > +               cpu->last_update = time;
> 
> This is a shared data structure and it gets updated without
> synchronization, unless I'm missing something.
Good point.

> 
> How much does the cross-CPU case matter?
I was under impression that IOWAIT flag is set on local CPU calls only,
but I see IOWAIT flags set for remote CPU all the time. So we will miss
if we don't take care of cross CPU calls.

But I can lump them as part of smp async call for all cross cpu updates
to avoid sync issue.

> 
> > +       }
> > 
> > +       /*
> > +        * If the boost is active, we will remove it after timeout
> > on local
> > +        * CPU only.
> > +        */
> > +       if (cpu->hwp_boost_active) {
> > +               if (smp_processor_id() == cpu->cpu) {
> > +                       bool expired;
> > +
> > +                       expired = time_after64(time, cpu-
> > >last_update +
> > +                                              (hwp_boost_hold_time
> > _ms * NSEC_PER_MSEC));
> > +                       if (expired) {
> > +                               intel_pstate_hwp_boost_down(cpu);
> > +                               cpu->hwp_boost_active = false;
> > +                               cpu->iowait_boost = false;
> > +                       }
> > +               }
> > +               return;
> > +       }
> > +
> > +       cpu->last_update = time;
> > +
> > +       if (cpu->iowait_boost) {
> > +               cpu->hwp_boost_active = true;
> > +               if (smp_processor_id() == cpu->cpu)
> > +                       intel_pstate_hwp_boost_up(cpu);
> > +               else
> > +                       smp_call_function_single_async(cpu->cpu,
> > &cpu->csd);
> > +       }
> >  }
> > 
> >  static inline void intel_pstate_calc_avg_perf(struct cpudata *cpu)
> > --
> > 2.9.5
> > 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 07/10] cpufreq: intel_pstate: HWP boost performance on busy task migrate
  2018-05-16  9:49   ` Rafael J. Wysocki
@ 2018-05-16 20:59     ` Srinivas Pandruvada
  0 siblings, 0 replies; 58+ messages in thread
From: Srinivas Pandruvada @ 2018-05-16 20:59 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Borislav Petkov,
	Len Brown, Rafael J. Wysocki, Mel Gorman,
	the arch/x86 maintainers, Linux PM, Viresh Kumar, Juri Lelli,
	Linux Kernel Mailing List

On Wed, 2018-05-16 at 11:49 +0200, Rafael J. Wysocki wrote:
> On Wed, May 16, 2018 at 6:49 AM, Srinivas Pandruvada
> <srinivas.pandruvada@linux.intel.com> wrote:
> > When a busy task migrates to a new CPU boost HWP prformance to max.
> > This
> > helps workloads on servers with per core P-states, which saturates
> > all
> > CPUs and then they migrate frequently. But changing limits has
> > extra over
> > head of issuing new HWP Request MSR, which takes 1000+
> > cycles. So this change limits setting HWP Request MSR.
> > Rate control in setting HWP Requests:
> > - If the current performance is around P1, simply ignore.
> > - Once set wait till hold time, till remove boost. While the boost
> >  is on, another flags is notified, it will prolong boost.
> > - The task migrates needs to have some utilzation which is more
> > than threshold utilization, which will trigger P-state above
> > minimum.
> > 
> > Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel
> > .com>
> > ---
> >  drivers/cpufreq/intel_pstate.c | 37
> > ++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 36 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/cpufreq/intel_pstate.c
> > b/drivers/cpufreq/intel_pstate.c
> > index d418265..ec455af 100644
> > --- a/drivers/cpufreq/intel_pstate.c
> > +++ b/drivers/cpufreq/intel_pstate.c
> > @@ -227,6 +227,7 @@ struct global_params {
> >   *                     defines callback and arguments
> >   * @hwp_boost_active:  HWP performance is boosted on this CPU
> >   * @last_io_update:    Last time when IO wake flag was set
> > + * @migrate_hint:      Set when scheduler indicates thread
> > migration
> >   *
> >   * This structure stores per CPU instance data for all CPUs.
> >   */
> > @@ -263,6 +264,7 @@ struct cpudata {
> >         call_single_data_t csd;
> >         bool hwp_boost_active;
> >         u64 last_io_update;
> > +       bool migrate_hint;
> 
> Why do you need this in the struct?
> 
> It looks like it only is used locally in
> intel_pstate_update_util_hwp().
It is required as we set just set a flag for the new CPU in its cpudata
that a task is migrated on it. We update performance on the next tick
on local cpu only when the rq utilization is updated to check whether
it is worth boosting.

Thanks,
Srinivas

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 06/10] cpufreq / sched: Add interface to get utilization values
  2018-05-16  6:40   ` Viresh Kumar
@ 2018-05-16 22:25     ` Srinivas Pandruvada
  0 siblings, 0 replies; 58+ messages in thread
From: Srinivas Pandruvada @ 2018-05-16 22:25 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: tglx, mingo, peterz, bp, lenb, rjw, mgorman, x86, linux-pm,
	juri.lelli, linux-kernel

On Wed, 2018-05-16 at 12:10 +0530, Viresh Kumar wrote:
> On 15-05-18, 21:49, Srinivas Pandruvada wrote:
> > Added cpufreq_get_sched_util() to get the CFS, DL and max
> > utilization
> > values for a CPU. This is required for getting utilization values
> > for cpufreq drivers outside of kernel/sched folder.
> > 
> > Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel
> > .com>
> > ---
> >  include/linux/sched/cpufreq.h |  2 ++
> >  kernel/sched/cpufreq.c        | 23 +++++++++++++++++++++++
> >  2 files changed, 25 insertions(+)
> > 
> > diff --git a/include/linux/sched/cpufreq.h
> > b/include/linux/sched/cpufreq.h
> > index 5966744..a366600 100644
> > --- a/include/linux/sched/cpufreq.h
> > +++ b/include/linux/sched/cpufreq.h
> > @@ -20,6 +20,8 @@ void cpufreq_add_update_util_hook(int cpu, struct
> > update_util_data *data,
> >                         void (*func)(struct update_util_data *data,
> > u64 time,
> >  				    unsigned int flags));
> >  void cpufreq_remove_update_util_hook(int cpu);
> > +void cpufreq_get_sched_util(int cpu, unsigned long *util_cfs,
> > +			    unsigned long *util_dl, unsigned long
> > *max);
> >  #endif /* CONFIG_CPU_FREQ */
> >  
> >  #endif /* _LINUX_SCHED_CPUFREQ_H */
> > diff --git a/kernel/sched/cpufreq.c b/kernel/sched/cpufreq.c
> > index 5e54cbc..36e2839 100644
> > --- a/kernel/sched/cpufreq.c
> > +++ b/kernel/sched/cpufreq.c
> > @@ -60,3 +60,26 @@ void cpufreq_remove_update_util_hook(int cpu)
> >  	rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu),
> > NULL);
> >  }
> >  EXPORT_SYMBOL_GPL(cpufreq_remove_update_util_hook);
> > +
> > +/**
> > + * cpufreq_get_sched_util - Get utilization values.
> > + * @cpu: The targeted CPU.
> > + *
> > + * Get the CFS, DL and max utilization.
> > + * This function allows cpufreq driver outside the kernel/sched to
> > access
> > + * utilization value for a CPUs run queue.
> > + */
> > +void cpufreq_get_sched_util(int cpu, unsigned long *util_cfs,
> > +			    unsigned long *util_dl, unsigned long
> > *max)
> > +{
> > +#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> 
> What will happen when schedutil is compiled in the kernel but
> ondemand
> is the one getting used currently ?
It should still work. The only reason I have to use ifdef because of
compile issues when CONFIG_CPU_FREQ_SCHEDUTIL is not defined.
The reason for that is that cpu_util_cfs() and cpu_util_dl() is defined
under #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL.
The actual code inside the cpu_util_cfs() and cpu_util_dl() uses
rq->cfs.avg.util_avg and rq->cfs.avg.util_avg, which are updated
irrespective of cpufreq governor. May be better to remove ifdefs for
cpu_util_dl() and cpu_util_cfs().

Thanks,
Srinivas
> 
> > +	struct rq *rq = cpu_rq(cpu);
> > +
> > +	*max = arch_scale_cpu_capacity(NULL, cpu);
> > +	*util_cfs = cpu_util_cfs(rq);
> > +	*util_dl  = cpu_util_dl(rq);
> > +#else
> > +	*util_cfs = *util_dl = 1;
> > +#endif
> > +}
> > +EXPORT_SYMBOL_GPL(cpufreq_get_sched_util);
> > -- 
> > 2.9.5
> 
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 06/10] cpufreq / sched: Add interface to get utilization values
  2018-05-16  8:11   ` Peter Zijlstra
@ 2018-05-16 22:40     ` Srinivas Pandruvada
  2018-05-17  7:50       ` Peter Zijlstra
  0 siblings, 1 reply; 58+ messages in thread
From: Srinivas Pandruvada @ 2018-05-16 22:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, bp, lenb, rjw, mgorman, x86, linux-pm, viresh.kumar,
	juri.lelli, linux-kernel

On Wed, 2018-05-16 at 10:11 +0200, Peter Zijlstra wrote:
> On Tue, May 15, 2018 at 09:49:07PM -0700, Srinivas Pandruvada wrote:
> > --- a/kernel/sched/cpufreq.c
> > +++ b/kernel/sched/cpufreq.c
> > @@ -60,3 +60,26 @@ void cpufreq_remove_update_util_hook(int cpu)
> >  	rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu),
> > NULL);
> >  }
> >  EXPORT_SYMBOL_GPL(cpufreq_remove_update_util_hook);
> > +
> > +/**
> > + * cpufreq_get_sched_util - Get utilization values.
> > + * @cpu: The targeted CPU.
> > + *
> > + * Get the CFS, DL and max utilization.
> > + * This function allows cpufreq driver outside the kernel/sched to
> > access
> > + * utilization value for a CPUs run queue.
> > + */
> > +void cpufreq_get_sched_util(int cpu, unsigned long *util_cfs,
> > +			    unsigned long *util_dl, unsigned long
> > *max)
> > +{
> > +#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> > +	struct rq *rq = cpu_rq(cpu);
> > +
> > +	*max = arch_scale_cpu_capacity(NULL, cpu);
> > +	*util_cfs = cpu_util_cfs(rq);
> > +	*util_dl  = cpu_util_dl(rq);
> > +#else
> > +	*util_cfs = *util_dl = 1;
> > +#endif
> > +}
> > +EXPORT_SYMBOL_GPL(cpufreq_get_sched_util);
> 
> So I _really_ hate this... I'd much rather you make schedutil work
> with
> the hwp passive stuff.
Are you not happy with ifdefs are utility function itself? Can you
explain more how this should be done?

utilization values are not passed with scheduler update util callback.
So need to have access to rq->cfs.avg.util* and rq->dl.running_bw
access from intel_pstate.
The ifdefs can be removed, if we remove ifdefs for cpu_util_dl() or
cpu_util_cfs(), which are not doing anything special other than
accessing rq->cfs.avg.util* and rq->dl.running_bw. I think it is better
to remove from these functions.

> 
> Also, afaict intel_pstate is bool, not tristate, so no need for an
> EXPORT at all.
Correct. I am just following other interface functions for sched util
hooks in this file which are using EXPORT.
But I will remove EXPORT in the next revision.

Thanks,
Srinivas

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 10/10] cpufreq: intel_pstate: enable boost for SKX
  2018-05-16 15:54       ` Peter Zijlstra
@ 2018-05-17  0:52         ` Srinivas Pandruvada
  0 siblings, 0 replies; 58+ messages in thread
From: Srinivas Pandruvada @ 2018-05-17  0:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, bp, lenb, rjw, mgorman, x86, linux-pm, viresh.kumar,
	juri.lelli, linux-kernel

On Wed, 2018-05-16 at 17:54 +0200, Peter Zijlstra wrote:
> On Wed, May 16, 2018 at 08:46:24AM -0700, Srinivas Pandruvada wrote:
> > On Wed, 2018-05-16 at 09:49 +0200, Peter Zijlstra wrote:
> > > On Tue, May 15, 2018 at 09:49:11PM -0700, Srinivas Pandruvada
> > > wrote:
> > > > Enable HWP boost on Skylake server platform.
> > > 
> > > Why not unconditionally enable it on everything HWP ?
> > 
> > Never noticed in any difference performance in clients with HWP nor
> > anybody complained. Since clients uses single power domain, some
> > other
> > CPUs always gives boost. 
> 
> Can't the hardware tell us about per-core stuff? Do we really have to
> use tables for that?
AFAIK not in a standard way across all platforms. But I will check on
this.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 06/10] cpufreq / sched: Add interface to get utilization values
  2018-05-16 22:40     ` Srinivas Pandruvada
@ 2018-05-17  7:50       ` Peter Zijlstra
  0 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2018-05-17  7:50 UTC (permalink / raw)
  To: Srinivas Pandruvada
  Cc: tglx, mingo, bp, lenb, rjw, mgorman, x86, linux-pm, viresh.kumar,
	juri.lelli, linux-kernel

On Wed, May 16, 2018 at 03:40:39PM -0700, Srinivas Pandruvada wrote:
> On Wed, 2018-05-16 at 10:11 +0200, Peter Zijlstra wrote:

> > So I _really_ hate this... I'd much rather you make schedutil work
> > with the hwp passive stuff.

> Are you not happy with ifdefs are utility function itself? Can you
> explain more how this should be done?

The reason I hate it is because it exports data that should not be. We
created the schedutil governor as part of the scheduler such that we
could tightly couple it.

So if you want to use all that, get schedutil to work for you.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 05/10] cpufreq: intel_pstate: HWP boost performance on IO Wake
  2018-05-16 17:55     ` Srinivas Pandruvada
@ 2018-05-17  8:19       ` Peter Zijlstra
  0 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2018-05-17  8:19 UTC (permalink / raw)
  To: Srinivas Pandruvada
  Cc: tglx, mingo, bp, lenb, rjw, mgorman, x86, linux-pm, viresh.kumar,
	juri.lelli, linux-kernel

On Wed, May 16, 2018 at 10:55:13AM -0700, Srinivas Pandruvada wrote:
> On Wed, 2018-05-16 at 09:37 +0200, Peter Zijlstra wrote:
> > On Tue, May 15, 2018 at 09:49:06PM -0700, Srinivas Pandruvada wrote:

> > Hurmph, this looks like you're starting to duplicate the schedutil
> > iowait logic. Why didn't you copy the gradual boosting thing?

> I tried what we implemented in intel_pstate in legacy mode, which gave
> the best performance for servers (ramp up faster and slow ramp down).
> This caused regression on some workloads, as each time we can call
> HWP_REQUEST, we spend 1200+ cycles in issuing MSR. So keeping it at max
> for timeout. Even keeping at P1 didn't help in power numbers.

> > > - If the IO flags are notified multiple ticks apart, this may not be
> > > IO bound task. Othewise idle system gets periodic boosts for one
> > > IO wake.

That is what made me think of the gradual boosting. Because that is very
similar to the scenario that made us do that.

But like said elsewhere, ideally we'd use schedutil exclusively and
reduce intel_pstate to a pure driver. The normal way do deal with slow
transitions is rate limiting. Another option is for the slow HWP msr
driver we could choose to only issue the update when we reach max_freq.
But if the MSR ever gets cheaper we can issue it more often/finer
grained.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 02/10] cpufreq: intel_pstate: Conditional frequency invariant accounting
  2018-05-16 16:31       ` Juri Lelli
@ 2018-05-17 10:59         ` Juri Lelli
  2018-05-17 15:04           ` Juri Lelli
  0 siblings, 1 reply; 58+ messages in thread
From: Juri Lelli @ 2018-05-17 10:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srinivas Pandruvada, tglx, mingo, bp, lenb, rjw, mgorman, x86,
	linux-pm, viresh.kumar, linux-kernel

On 16/05/18 18:31, Juri Lelli wrote:
> On 16/05/18 17:47, Peter Zijlstra wrote:
> > On Wed, May 16, 2018 at 05:19:25PM +0200, Juri Lelli wrote:
> > 
> > > Anyway, FWIW I started testing this on a E5-2609 v3 and I'm not seeing
> > > hackbench regressions so far (running with schedutil governor).
> > 
> > https://en.wikipedia.org/wiki/Haswell_(microarchitecture)#Server_processors
> > 
> > Lists the E5 2609 v3 as not having turbo at all, which is basically a
> > best case scenario for this patch.
> > 
> > As I wrote earlier today; when turbo exists, like say the 2699, then
> > when we're busy we'll run at U=2.3/3.6 ~ .64, which might confuse
> > things.
> 
> Indeed. I was mostly trying to see if adding this to the tick might
> introduce noticeable overhead.

Blindly testing on an i5-5200U (2.2/2.7 GHz) gave the following

# perf bench sched messaging --pipe --thread --group 2 --loop 20000

                      count       mean       std     min     50%       95%       99%     max
hostname kernel                                                                             
i5-5200U test_after    30.0  13.843433  0.590605  12.369  13.810  14.85635  15.08205  15.127
         test_before   30.0  13.571167  0.999798  12.228  13.302  15.57805  16.40029  16.690

It might be interesting to see what happens when using a single CPU
only?

Also, I will look at how the util signals look when a single CPU is
busy..

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 02/10] cpufreq: intel_pstate: Conditional frequency invariant accounting
  2018-05-17 10:59         ` Juri Lelli
@ 2018-05-17 15:04           ` Juri Lelli
  2018-05-17 15:41             ` Srinivas Pandruvada
  0 siblings, 1 reply; 58+ messages in thread
From: Juri Lelli @ 2018-05-17 15:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srinivas Pandruvada, tglx, mingo, bp, lenb, rjw, mgorman, x86,
	linux-pm, viresh.kumar, linux-kernel

On 17/05/18 12:59, Juri Lelli wrote:
> On 16/05/18 18:31, Juri Lelli wrote:
> > On 16/05/18 17:47, Peter Zijlstra wrote:
> > > On Wed, May 16, 2018 at 05:19:25PM +0200, Juri Lelli wrote:
> > > 
> > > > Anyway, FWIW I started testing this on a E5-2609 v3 and I'm not seeing
> > > > hackbench regressions so far (running with schedutil governor).
> > > 
> > > https://en.wikipedia.org/wiki/Haswell_(microarchitecture)#Server_processors
> > > 
> > > Lists the E5 2609 v3 as not having turbo at all, which is basically a
> > > best case scenario for this patch.
> > > 
> > > As I wrote earlier today; when turbo exists, like say the 2699, then
> > > when we're busy we'll run at U=2.3/3.6 ~ .64, which might confuse
> > > things.
> > 
> > Indeed. I was mostly trying to see if adding this to the tick might
> > introduce noticeable overhead.
> 
> Blindly testing on an i5-5200U (2.2/2.7 GHz) gave the following
> 
> # perf bench sched messaging --pipe --thread --group 2 --loop 20000
> 
>                       count       mean       std     min     50%       95%       99%     max
> hostname kernel                                                                             
> i5-5200U test_after    30.0  13.843433  0.590605  12.369  13.810  14.85635  15.08205  15.127
>          test_before   30.0  13.571167  0.999798  12.228  13.302  15.57805  16.40029  16.690
> 
> It might be interesting to see what happens when using a single CPU
> only?
> 
> Also, I will look at how the util signals look when a single CPU is
> busy..

And this is showing where the problem is (as you were saying [1]):

https://gist.github.com/jlelli/f5438221186e5ed3660194e4f645fe93

Just look at the plots (and ignore setup).

First one (pid:4483) shows a single task busy running on a single CPU,
which seems to be able to sustain turbo for 5 sec. So task util reaches
~1024.

Second one (pid:4283) shows the same task, but running together with
other 3 tasks (each one pinned to a different CPU). In this case util
saturates at ~943, which is due to the fact that max freq is still
considered to be the turbo one. :/

[1] https://marc.info/?l=linux-kernel&m=152646464017810&w=2

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 02/10] cpufreq: intel_pstate: Conditional frequency invariant accounting
  2018-05-17 15:04           ` Juri Lelli
@ 2018-05-17 15:41             ` Srinivas Pandruvada
  2018-05-17 16:16               ` Peter Zijlstra
  0 siblings, 1 reply; 58+ messages in thread
From: Srinivas Pandruvada @ 2018-05-17 15:41 UTC (permalink / raw)
  To: Juri Lelli, Peter Zijlstra
  Cc: tglx, mingo, bp, lenb, rjw, mgorman, x86, linux-pm, viresh.kumar,
	linux-kernel

On Thu, 2018-05-17 at 17:04 +0200, Juri Lelli wrote:
> On 17/05/18 12:59, Juri Lelli wrote:
> > On 16/05/18 18:31, Juri Lelli wrote:
> > > On 16/05/18 17:47, Peter Zijlstra wrote:
> > > > On Wed, May 16, 2018 at 05:19:25PM +0200, Juri Lelli wrote:
> > > > 
> > > > > Anyway, FWIW I started testing this on a E5-2609 v3 and I'm
> > > > > not seeing
> > > > > hackbench regressions so far (running with schedutil
> > > > > governor).
> > > > 
> > > > https://en.wikipedia.org/wiki/Haswell_(microarchitecture)#Serve
> > > > r_processors
> > > > 
> > > > Lists the E5 2609 v3 as not having turbo at all, which is
> > > > basically a
> > > > best case scenario for this patch.
> > > > 
> > > > As I wrote earlier today; when turbo exists, like say the 2699,
> > > > then
> > > > when we're busy we'll run at U=2.3/3.6 ~ .64, which might
> > > > confuse
> > > > things.
> > > 
> > > Indeed. I was mostly trying to see if adding this to the tick
> > > might
> > > introduce noticeable overhead.
> > 
> > Blindly testing on an i5-5200U (2.2/2.7 GHz) gave the following
> > 
> > # perf bench sched messaging --pipe --thread --group 2 --loop 20000
> > 
> >                       count       mean       std     min     50%   
> >     95%       99%     max
> > hostname
> > kernel                                                             
> >                 
> > i5-5200U
> > test_after    30.0  13.843433  0.590605  12.369  13.810  14.85635  
> > 15.08205  15.127
> >          test_before   30.0  13.571167  0.999798  12.228  13.302  1
> > 5.57805  16.40029  16.690
> > 
> > It might be interesting to see what happens when using a single CPU
> > only?
> > 
> > Also, I will look at how the util signals look when a single CPU is
> > busy..
> 
> And this is showing where the problem is (as you were saying [1]):
> 
> https://gist.github.com/jlelli/f5438221186e5ed3660194e4f645fe93
> 
> Just look at the plots (and ignore setup).
> 
> First one (pid:4483) shows a single task busy running on a single
> CPU,
> which seems to be able to sustain turbo for 5 sec. So task util
> reaches
> ~1024.
> 
> Second one (pid:4283) shows the same task, but running together with
> other 3 tasks (each one pinned to a different CPU). In this case util
> saturates at ~943, which is due to the fact that max freq is still
> considered to be the turbo one. :/


One more point to note. Even if we calculate some utilization based on
the freq-invariant and arrive at a P-state, we will not be able to
control any P-state in turbo region (not even as a cap) on several
Intel processors using PERF_CTL MSRs.


> 
> [1] https://marc.info/?l=linux-kernel&m=152646464017810&w=2

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 02/10] cpufreq: intel_pstate: Conditional frequency invariant accounting
  2018-05-17 15:41             ` Srinivas Pandruvada
@ 2018-05-17 16:16               ` Peter Zijlstra
  2018-05-17 16:42                 ` Srinivas Pandruvada
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2018-05-17 16:16 UTC (permalink / raw)
  To: Srinivas Pandruvada
  Cc: Juri Lelli, tglx, mingo, bp, lenb, rjw, mgorman, x86, linux-pm,
	viresh.kumar, linux-kernel

On Thu, May 17, 2018 at 08:41:32AM -0700, Srinivas Pandruvada wrote:
> One more point to note. Even if we calculate some utilization based on
> the freq-invariant and arrive at a P-state, we will not be able to
> control any P-state in turbo region (not even as a cap) on several
> Intel processors using PERF_CTL MSRs.

Right, but don't we need to set the PERF_CTL to max P in order to access
the turbo bins? So we still need to compute a P state, but as soon as we
reach max P, we're done.

And its not as if setting anything below max P is a firm setting either
anyway, its hints all the way down.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 02/10] cpufreq: intel_pstate: Conditional frequency invariant accounting
  2018-05-17 16:16               ` Peter Zijlstra
@ 2018-05-17 16:42                 ` Srinivas Pandruvada
  2018-05-17 16:56                   ` Rafael J. Wysocki
  0 siblings, 1 reply; 58+ messages in thread
From: Srinivas Pandruvada @ 2018-05-17 16:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Juri Lelli, tglx, mingo, bp, lenb, rjw, mgorman, x86, linux-pm,
	viresh.kumar, linux-kernel

On Thu, 2018-05-17 at 18:16 +0200, Peter Zijlstra wrote:
> On Thu, May 17, 2018 at 08:41:32AM -0700, Srinivas Pandruvada wrote:
> > One more point to note. Even if we calculate some utilization based
> > on
> > the freq-invariant and arrive at a P-state, we will not be able to
> > control any P-state in turbo region (not even as a cap) on several
> > Intel processors using PERF_CTL MSRs.
> 
> Right, but don't we need to set the PERF_CTL to max P in order to
> access
> the turbo bins?
Any PERF_CTL setting above what we call "Turbo Activation ratio" (which
can be less than P1 read from platform info MSR) will do that on these
systems (Most clients from Ivy bridge).

>  So we still need to compute a P state, but as soon as we
> reach max P, we're done.
What will happen if we look at all core turbo as max and cap any
utilization above this to 1024?

> 
> And its not as if setting anything below max P is a firm setting
> either
> anyway, its hints all the way down.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 02/10] cpufreq: intel_pstate: Conditional frequency invariant accounting
  2018-05-17 16:42                 ` Srinivas Pandruvada
@ 2018-05-17 16:56                   ` Rafael J. Wysocki
  2018-05-17 18:28                     ` Peter Zijlstra
  0 siblings, 1 reply; 58+ messages in thread
From: Rafael J. Wysocki @ 2018-05-17 16:56 UTC (permalink / raw)
  To: Srinivas Pandruvada
  Cc: Peter Zijlstra, Juri Lelli, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Len Brown, Rafael J. Wysocki, Mel Gorman,
	the arch/x86 maintainers, Linux PM, Viresh Kumar,
	Linux Kernel Mailing List

On Thu, May 17, 2018 at 6:42 PM, Srinivas Pandruvada
<srinivas.pandruvada@linux.intel.com> wrote:
> On Thu, 2018-05-17 at 18:16 +0200, Peter Zijlstra wrote:
>> On Thu, May 17, 2018 at 08:41:32AM -0700, Srinivas Pandruvada wrote:
>> > One more point to note. Even if we calculate some utilization based
>> > on
>> > the freq-invariant and arrive at a P-state, we will not be able to
>> > control any P-state in turbo region (not even as a cap) on several
>> > Intel processors using PERF_CTL MSRs.
>>
>> Right, but don't we need to set the PERF_CTL to max P in order to
>> access the turbo bins?
>
> Any PERF_CTL setting above what we call "Turbo Activation ratio" (which
> can be less than P1 read from platform info MSR) will do that on these
> systems (Most clients from Ivy bridge).
>
>>  So we still need to compute a P state, but as soon as we
>> reach max P, we're done.
>
> What will happen if we look at all core turbo as max and cap any
> utilization above this to 1024?

I was going to suggest that.

Otherwise, if we ever get (say) the max one-core turbo at any point
and the system only runs parallel workloads after that, it will appear
as underutilized.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 02/10] cpufreq: intel_pstate: Conditional frequency invariant accounting
  2018-05-17 16:56                   ` Rafael J. Wysocki
@ 2018-05-17 18:28                     ` Peter Zijlstra
  2018-05-18  7:36                       ` Rafael J. Wysocki
  2018-05-18 10:57                       ` Patrick Bellasi
  0 siblings, 2 replies; 58+ messages in thread
From: Peter Zijlstra @ 2018-05-17 18:28 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Srinivas Pandruvada, Juri Lelli, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Len Brown, Rafael J. Wysocki, Mel Gorman,
	the arch/x86 maintainers, Linux PM, Viresh Kumar,
	Linux Kernel Mailing List

On Thu, May 17, 2018 at 06:56:37PM +0200, Rafael J. Wysocki wrote:
> On Thu, May 17, 2018 at 6:42 PM, Srinivas Pandruvada

> > What will happen if we look at all core turbo as max and cap any
> > utilization above this to 1024?
> 
> I was going to suggest that.

To the basic premise behind all our frequency scaling is that there's a
linear relation between utilization and frequency, where u=1 gets us the
fastest.

Now, we all know this is fairly crude, but it is what we work with.

OTOH, the whole premise of turbo is that you don't in fact know what the
fastest is, and in that respect setting u=1 at the guaranteed or
sustainable frequency makes sense.

The direct concequence of allowing clipping is that u=1 doesn't select
the highest frequency, but since we don't select anything anyway
(p-code does that for us) all we really need is to have u=1 above that
turbo activation point you mentioned.

For parts where we have to directly select frequency this obviously
comes apart.

However; what happens when the sustainable freq drops below our initial
'max'? Imagine us dropping below the all-core-turbo because of AVX. Then
we're back to running at u<1 at full tilt.

Or for mobile parts, the sustainable frequency could drop because of
severe thermal limits. Now I _think_ we have the possibility for getting
interrupts and reading the new guaranteed frequency, so we could
re-guage.

So in theory I think it works, in practise we need to always be able to
find the actual max -- be it all-core turbo, AVX or thermal constrained
frequency. Can we do that in all cases?


I need to go back to see what the complains against Vincent's proposal
were, because I really liked the fact that it did away with all this.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 02/10] cpufreq: intel_pstate: Conditional frequency invariant accounting
  2018-05-17 18:28                     ` Peter Zijlstra
@ 2018-05-18  7:36                       ` Rafael J. Wysocki
  2018-05-18 10:57                       ` Patrick Bellasi
  1 sibling, 0 replies; 58+ messages in thread
From: Rafael J. Wysocki @ 2018-05-18  7:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Srinivas Pandruvada, Juri Lelli,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Len Brown,
	Rafael J. Wysocki, Mel Gorman, the arch/x86 maintainers,
	Linux PM, Viresh Kumar, Linux Kernel Mailing List

On Thu, May 17, 2018 at 8:28 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, May 17, 2018 at 06:56:37PM +0200, Rafael J. Wysocki wrote:
>> On Thu, May 17, 2018 at 6:42 PM, Srinivas Pandruvada
>
>> > What will happen if we look at all core turbo as max and cap any
>> > utilization above this to 1024?
>>
>> I was going to suggest that.
>
> To the basic premise behind all our frequency scaling is that there's a
> linear relation between utilization and frequency, where u=1 gets us the
> fastest.
>
> Now, we all know this is fairly crude, but it is what we work with.
>
> OTOH, the whole premise of turbo is that you don't in fact know what the
> fastest is, and in that respect setting u=1 at the guaranteed or
> sustainable frequency makes sense.
>
> The direct concequence of allowing clipping is that u=1 doesn't select
> the highest frequency, but since we don't select anything anyway
> (p-code does that for us) all we really need is to have u=1 above that
> turbo activation point you mentioned.
>
> For parts where we have to directly select frequency this obviously
> comes apart.
>
> However; what happens when the sustainable freq drops below our initial
> 'max'? Imagine us dropping below the all-core-turbo because of AVX. Then
> we're back to running at u<1 at full tilt.
>
> Or for mobile parts, the sustainable frequency could drop because of
> severe thermal limits. Now I _think_ we have the possibility for getting
> interrupts and reading the new guaranteed frequency, so we could
> re-guage.
>
> So in theory I think it works, in practise we need to always be able to
> find the actual max -- be it all-core turbo, AVX or thermal constrained
> frequency. Can we do that in all cases?

We should be, but unfortunately that's a dynamic thing.

For example, the AVX limit only kicks in when AVX instructions are executed.

> I need to go back to see what the complains against Vincent's proposal
> were, because I really liked the fact that it did away with all this.

That would be the best way to deal with this mess, I agree.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 02/10] cpufreq: intel_pstate: Conditional frequency invariant accounting
  2018-05-17 18:28                     ` Peter Zijlstra
  2018-05-18  7:36                       ` Rafael J. Wysocki
@ 2018-05-18 10:57                       ` Patrick Bellasi
  2018-05-18 11:29                         ` Peter Zijlstra
  1 sibling, 1 reply; 58+ messages in thread
From: Patrick Bellasi @ 2018-05-18 10:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Srinivas Pandruvada, Juri Lelli,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Len Brown,
	Rafael J. Wysocki, Mel Gorman, the arch/x86 maintainers,
	Linux PM, Viresh Kumar, Linux Kernel Mailing List

On 17-May 20:28, Peter Zijlstra wrote:
> On Thu, May 17, 2018 at 06:56:37PM +0200, Rafael J. Wysocki wrote:
> > On Thu, May 17, 2018 at 6:42 PM, Srinivas Pandruvada
> 
> > > What will happen if we look at all core turbo as max and cap any
> > > utilization above this to 1024?
> > 
> > I was going to suggest that.
> 
> To the basic premise behind all our frequency scaling is that there's a
> linear relation between utilization and frequency, where u=1 gets us the
> fastest.
> 
> Now, we all know this is fairly crude, but it is what we work with.
> 
> OTOH, the whole premise of turbo is that you don't in fact know what the
> fastest is, and in that respect setting u=1 at the guaranteed or
> sustainable frequency makes sense.

Looking from the FAIR class standpoint, we can also argue that
although you know that the max possible utilization is 1024, you are
not always granted to reach it because of RT and Interrupts pressure.
Or in big.LITTLE systems, because of the arch scaling factor.

Is it not something quite similar to the problem of having
   "some not always available OPPs"
?

To track these "capacity limitations" we already have the two
different concepts of cpu_capacity_orig and cpu_capacity.

Are not "thermal constraints" and "some not always available OPPs"
just another form of "capacity limitations".
They are:
 - transient
   exactly like RT and Interrupt pressure
 - HW related
   which is the main different wrt RT and Interrupt pressure

But, apart from this last point (i.e.they have an HW related "nature"),
IMHO they seems quite similar concept... which are already addresses,
although only within the FAIR class perhaps.

Thus, my simple (maybe dumb) questions are:
- why can't we just fold turbo boost frequency into the existing concepts?
- what are the limitations of such a "simple" approach?

IOW: utilization always measures wrt the maximum possible capacity
(i.e. max turbo mode) and then there is a way to know what is, on each
CPU and at every decision time, the actual "transient maximum" we can
expect to reach for a "reasonable" future time.

> The direct concequence of allowing clipping is that u=1 doesn't select
> the highest frequency, but since we don't select anything anyway
> (p-code does that for us) all we really need is to have u=1 above that
> turbo activation point you mentioned.

If clipping means that we can also have >1024 values which are just
clamped at read/get time, this could maybe have some side-effects on
math (signals propagations across TG) and type ranges control?

> For parts where we have to directly select frequency this obviously
> comes apart.

Moreover, utilization is not (will not be) just for frequency driving.
We should keep the task placement perspective into account.

On that side, I personally like the definition _I think_ we have now:

  utilization is the amount of maximum capacity used

where maximum is a constant defined at boot time and representing the
absolute max you can expect to get...
... apart from "transient capacity limitations".

Scaling the maximum depending on these transient conditions to me it
reads like "changing the scale". Which I fear will make it more
difficult for example to compare in space (different CPUs) or time
(different scheduler events) what a utilization measure means.

For example, if you have a busy loop running on a CPU which is subject
to RT pressure, you will read a <100% utilization (let say 60%). Still
it's interesting to know that maybe I can try to move that task on an
IDLE CPU to run it faster.

Should not be the same for turbo boost?

If the same task is generating only 60% utilization, because of not
available turbo boost OPPs,  should still not be useful to see that
there is for example another CPU (maybe on a different NUMA node)
which is IDLE and cold, where we can move the task there to exploit
the 100% capacity provided by the topmost turbo boost mode?

> However; what happens when the sustainable freq drops below our initial
> 'max'? Imagine us dropping below the all-core-turbo because of AVX. Then
> we're back to running at u<1 at full tilt.
> 
> Or for mobile parts, the sustainable frequency could drop because of
> severe thermal limits. Now I _think_ we have the possibility for getting
> interrupts and reading the new guaranteed frequency, so we could
> re-guage.
> 
> So in theory I think it works, in practise we need to always be able to
> find the actual max -- be it all-core turbo, AVX or thermal constrained
> frequency. Can we do that in all cases?
> 
> 
> I need to go back to see what the complains against Vincent's proposal
> were, because I really liked the fact that it did away with all this.

AFAIR Vincent proposal was mainly addressing a different issue: fast
ramp-up... I don't recall if there was any specific intent to cover
the issue of "transient maximum capacities".

And still, based on my (maybe bogus) reasoning above, I think we are
discussing here a slightly different problem which has already a
(maybe partial) solution.

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 02/10] cpufreq: intel_pstate: Conditional frequency invariant accounting
  2018-05-18 10:57                       ` Patrick Bellasi
@ 2018-05-18 11:29                         ` Peter Zijlstra
  2018-05-18 13:33                           ` Patrick Bellasi
  2018-05-18 14:09                           ` Valentin Schneider
  0 siblings, 2 replies; 58+ messages in thread
From: Peter Zijlstra @ 2018-05-18 11:29 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Rafael J. Wysocki, Srinivas Pandruvada, Juri Lelli,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Len Brown,
	Rafael J. Wysocki, Mel Gorman, the arch/x86 maintainers,
	Linux PM, Viresh Kumar, Linux Kernel Mailing List

On Fri, May 18, 2018 at 11:57:42AM +0100, Patrick Bellasi wrote:
> Thus, my simple (maybe dumb) questions are:
> - why can't we just fold turbo boost frequency into the existing concepts?
> - what are the limitations of such a "simple" approach?

Perhaps... but does this not further complicate the whole capacity vs
util thing we already have in say the misfit patches? And the
util_fits_capacity() thing from the EAS ones.

The thing is, we either need to dynamically scale the util or the
capacity or both. I think for Thermal there are patches out there that
drop the capacity.

But we'd then have to do the same for turbo/vector and all the other
stuff as well. Otherwise we risk things like running at low U with 0%
idle and not triggering the tipping point between eas and regular
balancing.

So either way around we need to know the 'true' max, either to fudge
util or to fudge capacity. And I'm not sure we can know in some of these
cases :/

And while Vincent's patches might have been inspired by another problem,
they do have the effect of always allowing util to go to 1, which is
nice for this.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 02/10] cpufreq: intel_pstate: Conditional frequency invariant accounting
  2018-05-18 11:29                         ` Peter Zijlstra
@ 2018-05-18 13:33                           ` Patrick Bellasi
  2018-05-30 16:57                             ` Patrick Bellasi
  2018-05-18 14:09                           ` Valentin Schneider
  1 sibling, 1 reply; 58+ messages in thread
From: Patrick Bellasi @ 2018-05-18 13:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Srinivas Pandruvada, Juri Lelli,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Len Brown,
	Rafael J. Wysocki, Mel Gorman, the arch/x86 maintainers,
	Linux PM, Viresh Kumar, Linux Kernel Mailing List

On 18-May 13:29, Peter Zijlstra wrote:
> On Fri, May 18, 2018 at 11:57:42AM +0100, Patrick Bellasi wrote:
> > Thus, my simple (maybe dumb) questions are:
> > - why can't we just fold turbo boost frequency into the existing concepts?
> > - what are the limitations of such a "simple" approach?
> 
> Perhaps... but does this not further complicate the whole capacity vs
> util thing we already have in say the misfit patches?

Not sure about that...

> And the  util_fits_capacity() thing from the EAS ones.

In this case instead, if we can track somehow (not saying we can)
what is the currently available "transient maximum capacity"...
then a util_fits_capacity() should just look at that.

If the transient capacity is already folded into cpu_capacity, as it
is now for RT and IRQ pressure, then likely we don't have to change
anything.

> The thing is, we either need to dynamically scale the util or the
> capacity or both. I think for Thermal there are patches out there that
> drop the capacity.

Not sure... but I would feel more comfortable by something which caps
the maximum capacity. Meaning, eventually you can fill up the maximum
possible capacity only "up to" a given value, because of thermal or other
reasons most of the scheduler maybe doesn't even have to know why?

> But we'd then have to do the same for turbo/vector and all the other
> stuff as well. Otherwise we risk things like running at low U with 0%
> idle and not triggering the tipping point between eas and regular
> balancing.

Interacting with the tipping point and/or OPP changes is indeed an
interesting side of the problem I was not considering so far...

But again, the tipping point could not be defined as a delta
with respect to the "transient maximum capacity" ?

> So either way around we need to know the 'true' max, either to fudge
> util or to fudge capacity.

Right, but what I see from a concepts standpoint is something like:

     +--+--+   cpu_capacity_orig (CONSTANT at boot time)
     |  |  |
     |  |  |       HW generated constraints
     |  v  |
     +-----+   cpu_capacity_max (depending on thermal/turbo boost)
     |  |  |
     |  |  |       SW generated constraints
     |  v  |
     +-----+   cpu_capacity (depending on RT/IRQ pressure)
     |  |  |
     |  |  |       tipping point delta
     +--v--+
     |     |   Energy Aware mode available capacity
     +-----+

Where all the wkp/lb heuristics are updated to properly consider the
cpu_capacity_max metrics whenever it comes to know what is the max
speed we can reach now on a CPU.

> And I'm not sure we can know in some of these cases :/

Right, this schema will eventually work only under the hypothesis that
"somehow" we can update cpu_capacity_max from HW events.

Not entirely sure that's possible and/or at which time granularity on
all different platforms.

> And while Vincent's patches might have been inspired by another problem,
> they do have the effect of always allowing util to go to 1, which is
> nice for this.

Sure, that's a nice point, but still I have the feeling that always
reaching u=1 can defeat other interesting properties of a task,
For example, comparing task requirements in different CPUs and/or at
different times, which plays a big role for energy aware task
placement decisions.

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 02/10] cpufreq: intel_pstate: Conditional frequency invariant accounting
  2018-05-18 11:29                         ` Peter Zijlstra
  2018-05-18 13:33                           ` Patrick Bellasi
@ 2018-05-18 14:09                           ` Valentin Schneider
  1 sibling, 0 replies; 58+ messages in thread
From: Valentin Schneider @ 2018-05-18 14:09 UTC (permalink / raw)
  To: Peter Zijlstra, Patrick Bellasi
  Cc: Rafael J. Wysocki, Srinivas Pandruvada, Juri Lelli,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Len Brown,
	Rafael J. Wysocki, Mel Gorman, the arch/x86 maintainers,
	Linux PM, Viresh Kumar, Linux Kernel Mailing List

Hi,

On 18/05/18 12:29, Peter Zijlstra wrote:
> On Fri, May 18, 2018 at 11:57:42AM +0100, Patrick Bellasi wrote:
>> Thus, my simple (maybe dumb) questions are:
>> - why can't we just fold turbo boost frequency into the existing concepts?
>> - what are the limitations of such a "simple" approach?
> 
> Perhaps... but does this not further complicate the whole capacity vs
> util thing we already have in say the misfit patches?

What do you mean by that ? Bear in mind, I'm a complete stranger to turbo
boost so I fail to see why, as Patrick puts it, it can't fit within the
existing concepts (i.e. max util is 1024 but is only reachable when turbo
boosted).

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC/RFT] [PATCH 02/10] cpufreq: intel_pstate: Conditional frequency invariant accounting
  2018-05-18 13:33                           ` Patrick Bellasi
@ 2018-05-30 16:57                             ` Patrick Bellasi
  0 siblings, 0 replies; 58+ messages in thread
From: Patrick Bellasi @ 2018-05-30 16:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Srinivas Pandruvada, Juri Lelli,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Len Brown,
	Rafael J. Wysocki, Mel Gorman, the arch/x86 maintainers,
	Linux PM, Viresh Kumar, Linux Kernel Mailing List

Hi Peter,
maybe you missed this previous my response:
   20180518133353.GO30654@e110439-lin
?

Would like to have your tought about the concept of "transient maximum
capacity" I was describing...

On 18-May 14:33, Patrick Bellasi wrote:
> On 18-May 13:29, Peter Zijlstra wrote:
> > On Fri, May 18, 2018 at 11:57:42AM +0100, Patrick Bellasi wrote:
> > > Thus, my simple (maybe dumb) questions are:
> > > - why can't we just fold turbo boost frequency into the existing concepts?
> > > - what are the limitations of such a "simple" approach?
> > 
> > Perhaps... but does this not further complicate the whole capacity vs
> > util thing we already have in say the misfit patches?
> 
> Not sure about that...
> 
> > And the  util_fits_capacity() thing from the EAS ones.
> 
> In this case instead, if we can track somehow (not saying we can)
> what is the currently available "transient maximum capacity"...
> then a util_fits_capacity() should just look at that.
> 
> If the transient capacity is already folded into cpu_capacity, as it
> is now for RT and IRQ pressure, then likely we don't have to change
> anything.
> 
> > The thing is, we either need to dynamically scale the util or the
> > capacity or both. I think for Thermal there are patches out there that
> > drop the capacity.
> 
> Not sure... but I would feel more comfortable by something which caps
> the maximum capacity. Meaning, eventually you can fill up the maximum
> possible capacity only "up to" a given value, because of thermal or other
> reasons most of the scheduler maybe doesn't even have to know why?
> 
> > But we'd then have to do the same for turbo/vector and all the other
> > stuff as well. Otherwise we risk things like running at low U with 0%
> > idle and not triggering the tipping point between eas and regular
> > balancing.
> 
> Interacting with the tipping point and/or OPP changes is indeed an
> interesting side of the problem I was not considering so far...
> 
> But again, the tipping point could not be defined as a delta
> with respect to the "transient maximum capacity" ?
> 
> > So either way around we need to know the 'true' max, either to fudge
> > util or to fudge capacity.
> 
> Right, but what I see from a concepts standpoint is something like:
> 
>      +--+--+   cpu_capacity_orig (CONSTANT at boot time)
>      |  |  |
>      |  |  |       HW generated constraints
>      |  v  |
>      +-----+   cpu_capacity_max (depending on thermal/turbo boost)
>      |  |  |
>      |  |  |       SW generated constraints
>      |  v  |
>      +-----+   cpu_capacity (depending on RT/IRQ pressure)
>      |  |  |
>      |  |  |       tipping point delta
>      +--v--+
>      |     |   Energy Aware mode available capacity
>      +-----+
> 
> Where all the wkp/lb heuristics are updated to properly consider the
> cpu_capacity_max metrics whenever it comes to know what is the max
> speed we can reach now on a CPU.
> 
> > And I'm not sure we can know in some of these cases :/
> 
> Right, this schema will eventually work only under the hypothesis that
> "somehow" we can update cpu_capacity_max from HW events.
> 
> Not entirely sure that's possible and/or at which time granularity on
> all different platforms.
> 
> > And while Vincent's patches might have been inspired by another problem,
> > they do have the effect of always allowing util to go to 1, which is
> > nice for this.
> 
> Sure, that's a nice point, but still I have the feeling that always
> reaching u=1 can defeat other interesting properties of a task,
> For example, comparing task requirements in different CPUs and/or at
> different times, which plays a big role for energy aware task
> placement decisions.
> 
> -- 
> #include <best/regards.h>
> 
> Patrick Bellasi

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2018-05-30 16:57 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-16  4:49 [RFC/RFT] [PATCH 00/10] Intel_pstate: HWP Dynamic performance boost Srinivas Pandruvada
2018-05-16  4:49 ` [RFC/RFT] [PATCH 01/10] x86,sched: Add support for frequency invariance Srinivas Pandruvada
2018-05-16  9:56   ` Peter Zijlstra
2018-05-16  4:49 ` [RFC/RFT] [PATCH 02/10] cpufreq: intel_pstate: Conditional frequency invariant accounting Srinivas Pandruvada
2018-05-16  7:16   ` Peter Zijlstra
2018-05-16  7:29     ` Peter Zijlstra
2018-05-16  9:07       ` Rafael J. Wysocki
2018-05-16 17:32         ` Srinivas Pandruvada
2018-05-16 15:19   ` Juri Lelli
2018-05-16 15:47     ` Peter Zijlstra
2018-05-16 16:31       ` Juri Lelli
2018-05-17 10:59         ` Juri Lelli
2018-05-17 15:04           ` Juri Lelli
2018-05-17 15:41             ` Srinivas Pandruvada
2018-05-17 16:16               ` Peter Zijlstra
2018-05-17 16:42                 ` Srinivas Pandruvada
2018-05-17 16:56                   ` Rafael J. Wysocki
2018-05-17 18:28                     ` Peter Zijlstra
2018-05-18  7:36                       ` Rafael J. Wysocki
2018-05-18 10:57                       ` Patrick Bellasi
2018-05-18 11:29                         ` Peter Zijlstra
2018-05-18 13:33                           ` Patrick Bellasi
2018-05-30 16:57                             ` Patrick Bellasi
2018-05-18 14:09                           ` Valentin Schneider
2018-05-16 15:58     ` Srinivas Pandruvada
2018-05-16  4:49 ` [RFC/RFT] [PATCH 03/10] cpufreq: intel_pstate: Utility functions to boost HWP performance limits Srinivas Pandruvada
2018-05-16  7:22   ` Peter Zijlstra
2018-05-16  9:15     ` Rafael J. Wysocki
2018-05-16 10:43       ` Peter Zijlstra
2018-05-16 15:39         ` Srinivas Pandruvada
2018-05-16 15:41     ` Srinivas Pandruvada
2018-05-16  4:49 ` [RFC/RFT] [PATCH 04/10] cpufreq: intel_pstate: Add update_util_hook for HWP Srinivas Pandruvada
2018-05-16  4:49 ` [RFC/RFT] [PATCH 05/10] cpufreq: intel_pstate: HWP boost performance on IO Wake Srinivas Pandruvada
2018-05-16  7:37   ` Peter Zijlstra
2018-05-16 17:55     ` Srinivas Pandruvada
2018-05-17  8:19       ` Peter Zijlstra
2018-05-16  9:45   ` Rafael J. Wysocki
2018-05-16 19:28     ` Srinivas Pandruvada
2018-05-16  4:49 ` [RFC/RFT] [PATCH 06/10] cpufreq / sched: Add interface to get utilization values Srinivas Pandruvada
2018-05-16  6:40   ` Viresh Kumar
2018-05-16 22:25     ` Srinivas Pandruvada
2018-05-16  8:11   ` Peter Zijlstra
2018-05-16 22:40     ` Srinivas Pandruvada
2018-05-17  7:50       ` Peter Zijlstra
2018-05-16  4:49 ` [RFC/RFT] [PATCH 07/10] cpufreq: intel_pstate: HWP boost performance on busy task migrate Srinivas Pandruvada
2018-05-16  9:49   ` Rafael J. Wysocki
2018-05-16 20:59     ` Srinivas Pandruvada
2018-05-16  4:49 ` [RFC/RFT] [PATCH 08/10] cpufreq: intel_pstate: Dyanmically update busy pct Srinivas Pandruvada
2018-05-16  7:43   ` Peter Zijlstra
2018-05-16  7:47   ` Peter Zijlstra
2018-05-16  4:49 ` [RFC/RFT] [PATCH 09/10] cpufreq: intel_pstate: New sysfs entry to control HWP boost Srinivas Pandruvada
2018-05-16  4:49 ` [RFC/RFT] [PATCH 10/10] cpufreq: intel_pstate: enable boost for SKX Srinivas Pandruvada
2018-05-16  7:49   ` Peter Zijlstra
2018-05-16 15:46     ` Srinivas Pandruvada
2018-05-16 15:54       ` Peter Zijlstra
2018-05-17  0:52         ` Srinivas Pandruvada
2018-05-16  6:49 ` [RFC/RFT] [PATCH 00/10] Intel_pstate: HWP Dynamic performance boost Juri Lelli
2018-05-16 15:43   ` Srinivas Pandruvada

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.