linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [REDO PATCH v7] perf/x86/amd/power: Add AMD accumulated power reporting mechanism
@ 2016-03-09  5:45 Huang Rui
  2016-03-21  9:55 ` [tip:perf/urgent] " tip-bot for Huang Rui
  2016-06-16  1:13 ` [REDO PATCH v7] " Vince Weaver
  0 siblings, 2 replies; 15+ messages in thread
From: Huang Rui @ 2016-03-09  5:45 UTC (permalink / raw)
  To: Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Andy Lutomirski, Robert Richter, Jacob Shin,
	Arnaldo Carvalho de Melo, Kan Liang
  Cc: linux-kernel, spg_linux_kernel, x86, Suravee Suthikulpanit,
	Aravind Gopalakrishnan, Borislav Petkov, Guenter Roeck,
	Fengguang Wu, Huang Rui

Introduce an AMD accumlated power reporting mechanism for the Family
15h, Model 60h processor that can be used to calculate the average
power consumed by a processor during a measurement interval. The
feature support is indicated by CPUID Fn8000_0007_EDX[12].

This feature will be implemented both in hwmon and perf. The current
design provides one event to report per package/processor power
consumption by counting each compute unit power value.

Here the gory details of how the computation is done:

---------------------------------------------------------------------
* Tsample: compute unit power accumulator sample period
* Tref: the PTSC counter period (PTSC: performance timestamp counter)
* N: the ratio of compute unit power accumulator sample period to the
  PTSC period

* Jmax: max compute unit accumulated power which is indicated by
  MSR_C001007b[MaxCpuSwPwrAcc]

* Jx/Jy: compute unit accumulated power which is indicated by
  MSR_C001007a[CpuSwPwrAcc]

* Tx/Ty: the value of performance timestamp counter which is indicated
  by CU_PTSC MSR_C0010280[PTSC]
* PwrCPUave: CPU average power

i. Determine the ratio of Tsample to Tref by executing CPUID Fn8000_0007.
	N = value of CPUID Fn8000_0007_ECX[CpuPwrSampleTimeRatio[15:0]].

ii. Read the full range of the cumulative energy value from the new
    MSR MaxCpuSwPwrAcc.
	Jmax = value returned.

iii. At time x, software reads CpuSwPwrAcc and samples the PTSC.
	Jx = value read from CpuSwPwrAcc and Tx = value read from PTSC.

iv. At time y, software reads CpuSwPwrAcc and samples the PTSC.
	Jy = value read from CpuSwPwrAcc and Ty = value read from PTSC.

v. Calculate the average power consumption for a compute unit over
time period (y-x). Unit of result is uWatt:

	if (Jy < Jx) // Rollover has occurred
		Jdelta = (Jy + Jmax) - Jx
	else
		Jdelta = Jy - Jx
	PwrCPUave = N * Jdelta * 1000 / (Ty - Tx)
----------------------------------------------------------------------

Simple example:

  root@hr-zp:/home/ray/tip# ./tools/perf/perf stat -a -e 'power/power-pkg/' make -j4
    CHK     include/config/kernel.release
    CHK     include/generated/uapi/linux/version.h
    CHK     include/generated/utsrelease.h
    CHK     include/generated/timeconst.h
    CHK     include/generated/bounds.h
    CHK     include/generated/asm-offsets.h
    CALL    scripts/checksyscalls.sh
    CHK     include/generated/compile.h
    SKIPPED include/generated/compile.h
    Building modules, stage 2.
  Kernel: arch/x86/boot/bzImage is ready  (#40)
    MODPOST 4225 modules

   Performance counter stats for 'system wide':

              183.44 mWatts power/power-pkg/

       341.837270111 seconds time elapsed

  root@hr-zp:/home/ray/tip# ./tools/perf/perf stat -a -e 'power/power-pkg/' sleep 10

   Performance counter stats for 'system wide':

                0.18 mWatts power/power-pkg/

        10.012551815 seconds time elapsed

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Ingo Molnar <mingo@kernel.org>
Suggested-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Huang Rui <ray.huang@amd.com>
Cc: Guenter Roeck <linux@roeck-us.net>
---

Hi Boris,

I already redo this patch based on tip/master, it depends on some
previous patches you applied before. If you need me to send them
again, please let me know.

Thanks,
Rui

---
 arch/x86/Kconfig            |   9 ++
 arch/x86/events/Makefile    |   1 +
 arch/x86/events/amd/power.c | 353 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/perf_event.h  |   4 +
 4 files changed, 367 insertions(+)
 create mode 100644 arch/x86/events/amd/power.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3e61672..52ef30d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1210,6 +1210,15 @@ config MICROCODE_OLD_INTERFACE
 	def_bool y
 	depends on MICROCODE
 
+config PERF_EVENTS_AMD_POWER
+	depends on PERF_EVENTS && CPU_SUP_AMD
+	tristate "AMD Processor Power Reporting Mechanism"
+	---help---
+	  Provide power reporting mechanism support for AMD processors.
+	  Currently, it leverages X86_FEATURE_ACC_POWER
+	  (CPUID Fn8000_0007_EDX[12]) interface to calculate the
+	  average power consumption on Family 15h processors.
+
 config X86_MSR
 	tristate "/dev/cpu/*/msr - Model-specific register support"
 	---help---
diff --git a/arch/x86/events/Makefile b/arch/x86/events/Makefile
index fdfea15..f59618a 100644
--- a/arch/x86/events/Makefile
+++ b/arch/x86/events/Makefile
@@ -1,6 +1,7 @@
 obj-y			+= core.o
 
 obj-$(CONFIG_CPU_SUP_AMD)               += amd/core.o amd/uncore.o
+obj-$(CONFIG_PERF_EVENTS_AMD_POWER)	+= amd/power.o
 obj-$(CONFIG_X86_LOCAL_APIC)            += amd/ibs.o msr.o
 ifdef CONFIG_AMD_IOMMU
 obj-$(CONFIG_CPU_SUP_AMD)               += amd/iommu.o
diff --git a/arch/x86/events/amd/power.c b/arch/x86/events/amd/power.c
new file mode 100644
index 0000000..55a3529
--- /dev/null
+++ b/arch/x86/events/amd/power.c
@@ -0,0 +1,353 @@
+/*
+ * Performance events - AMD Processor Power Reporting Mechanism
+ *
+ * Copyright (C) 2016 Advanced Micro Devices, Inc.
+ *
+ * Author: Huang Rui <ray.huang@amd.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/perf_event.h>
+#include <asm/cpu_device_id.h>
+#include "../perf_event.h"
+
+#define MSR_F15H_CU_PWR_ACCUMULATOR     0xc001007a
+#define MSR_F15H_CU_MAX_PWR_ACCUMULATOR 0xc001007b
+#define MSR_F15H_PTSC			0xc0010280
+
+/* Event code: LSB 8 bits, passed in attr->config any other bit is reserved. */
+#define AMD_POWER_EVENT_MASK		0xFFULL
+
+/*
+ * Accumulated power status counters.
+ */
+#define AMD_POWER_EVENTSEL_PKG		1
+
+/*
+ * The ratio of compute unit power accumulator sample period to the
+ * PTSC period.
+ */
+static unsigned int cpu_pwr_sample_ratio;
+
+/* Maximum accumulated power of a compute unit. */
+static u64 max_cu_acc_power;
+
+static struct pmu pmu_class;
+
+/*
+ * Accumulated power represents the sum of each compute unit's (CU) power
+ * consumption. On any core of each CU we read the total accumulated power from
+ * MSR_F15H_CU_PWR_ACCUMULATOR. cpu_mask represents CPU bit map of all cores
+ * which are picked to measure the power for the CUs they belong to.
+ */
+static cpumask_t cpu_mask;
+
+static void event_update(struct perf_event *event)
+{
+	struct hw_perf_event *hwc = &event->hw;
+	u64 prev_pwr_acc, new_pwr_acc, prev_ptsc, new_ptsc;
+	u64 delta, tdelta;
+
+	prev_pwr_acc = hwc->pwr_acc;
+	prev_ptsc = hwc->ptsc;
+	rdmsrl(MSR_F15H_CU_PWR_ACCUMULATOR, new_pwr_acc);
+	rdmsrl(MSR_F15H_PTSC, new_ptsc);
+
+	/*
+	 * Calculate the CU power consumption over a time period, the unit of
+	 * final value (delta) is micro-Watts. Then add it to the event count.
+	 */
+	if (new_pwr_acc < prev_pwr_acc) {
+		delta = max_cu_acc_power + new_pwr_acc;
+		delta -= prev_pwr_acc;
+	} else
+		delta = new_pwr_acc - prev_pwr_acc;
+
+	delta *= cpu_pwr_sample_ratio * 1000;
+	tdelta = new_ptsc - prev_ptsc;
+
+	do_div(delta, tdelta);
+	local64_add(delta, &event->count);
+}
+
+static void __pmu_event_start(struct perf_event *event)
+{
+	if (WARN_ON_ONCE(!(event->hw.state & PERF_HES_STOPPED)))
+		return;
+
+	event->hw.state = 0;
+
+	rdmsrl(MSR_F15H_PTSC, event->hw.ptsc);
+	rdmsrl(MSR_F15H_CU_PWR_ACCUMULATOR, event->hw.pwr_acc);
+}
+
+static void pmu_event_start(struct perf_event *event, int mode)
+{
+	__pmu_event_start(event);
+}
+
+static void pmu_event_stop(struct perf_event *event, int mode)
+{
+	struct hw_perf_event *hwc = &event->hw;
+
+	/* Mark event as deactivated and stopped. */
+	if (!(hwc->state & PERF_HES_STOPPED))
+		hwc->state |= PERF_HES_STOPPED;
+
+	/* Check if software counter update is necessary. */
+	if ((mode & PERF_EF_UPDATE) && !(hwc->state & PERF_HES_UPTODATE)) {
+		/*
+		 * Drain the remaining delta count out of an event
+		 * that we are disabling:
+		 */
+		event_update(event);
+		hwc->state |= PERF_HES_UPTODATE;
+	}
+}
+
+static int pmu_event_add(struct perf_event *event, int mode)
+{
+	struct hw_perf_event *hwc = &event->hw;
+
+	hwc->state = PERF_HES_UPTODATE | PERF_HES_STOPPED;
+
+	if (mode & PERF_EF_START)
+		__pmu_event_start(event);
+
+	return 0;
+}
+
+static void pmu_event_del(struct perf_event *event, int flags)
+{
+	pmu_event_stop(event, PERF_EF_UPDATE);
+}
+
+static int pmu_event_init(struct perf_event *event)
+{
+	u64 cfg = event->attr.config & AMD_POWER_EVENT_MASK;
+
+	/* Only look at AMD power events. */
+	if (event->attr.type != pmu_class.type)
+		return -ENOENT;
+
+	/* Unsupported modes and filters. */
+	if (event->attr.exclude_user   ||
+	    event->attr.exclude_kernel ||
+	    event->attr.exclude_hv     ||
+	    event->attr.exclude_idle   ||
+	    event->attr.exclude_host   ||
+	    event->attr.exclude_guest  ||
+	    /* no sampling */
+	    event->attr.sample_period)
+		return -EINVAL;
+
+	if (cfg != AMD_POWER_EVENTSEL_PKG)
+		return -EINVAL;
+
+	return 0;
+}
+
+static void pmu_event_read(struct perf_event *event)
+{
+	event_update(event);
+}
+
+static ssize_t
+get_attr_cpumask(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	return cpumap_print_to_pagebuf(true, buf, &cpu_mask);
+}
+
+static DEVICE_ATTR(cpumask, S_IRUGO, get_attr_cpumask, NULL);
+
+static struct attribute *pmu_attrs[] = {
+	&dev_attr_cpumask.attr,
+	NULL,
+};
+
+static struct attribute_group pmu_attr_group = {
+	.attrs = pmu_attrs,
+};
+
+/*
+ * Currently it only supports to report the power of each
+ * processor/package.
+ */
+EVENT_ATTR_STR(power-pkg, power_pkg, "event=0x01");
+
+EVENT_ATTR_STR(power-pkg.unit, power_pkg_unit, "mWatts");
+
+/* Convert the count from micro-Watts to milli-Watts. */
+EVENT_ATTR_STR(power-pkg.scale, power_pkg_scale, "1.000000e-3");
+
+static struct attribute *events_attr[] = {
+	EVENT_PTR(power_pkg),
+	EVENT_PTR(power_pkg_unit),
+	EVENT_PTR(power_pkg_scale),
+	NULL,
+};
+
+static struct attribute_group pmu_events_group = {
+	.name	= "events",
+	.attrs	= events_attr,
+};
+
+PMU_FORMAT_ATTR(event, "config:0-7");
+
+static struct attribute *formats_attr[] = {
+	&format_attr_event.attr,
+	NULL,
+};
+
+static struct attribute_group pmu_format_group = {
+	.name	= "format",
+	.attrs	= formats_attr,
+};
+
+static const struct attribute_group *attr_groups[] = {
+	&pmu_attr_group,
+	&pmu_format_group,
+	&pmu_events_group,
+	NULL,
+};
+
+static struct pmu pmu_class = {
+	.attr_groups	= attr_groups,
+	/* system-wide only */
+	.task_ctx_nr	= perf_invalid_context,
+	.event_init	= pmu_event_init,
+	.add		= pmu_event_add,
+	.del		= pmu_event_del,
+	.start		= pmu_event_start,
+	.stop		= pmu_event_stop,
+	.read		= pmu_event_read,
+};
+
+static void power_cpu_exit(int cpu)
+{
+	int target;
+
+	if (!cpumask_test_and_clear_cpu(cpu, &cpu_mask))
+		return;
+
+	/*
+	 * Find a new CPU on the same compute unit, if was set in cpumask
+	 * and still some CPUs on compute unit. Then migrate event and
+	 * context to new CPU.
+	 */
+	target = cpumask_any_but(topology_sibling_cpumask(cpu), cpu);
+	if (target < nr_cpumask_bits) {
+		cpumask_set_cpu(target, &cpu_mask);
+		perf_pmu_migrate_context(&pmu_class, cpu, target);
+	}
+}
+
+static void power_cpu_init(int cpu)
+{
+	int target;
+
+	/*
+	 * 1) If any CPU is set at cpu_mask in the same compute unit, do
+	 * nothing.
+	 * 2) If no CPU is set at cpu_mask in the same compute unit,
+	 * set current STARTING CPU.
+	 *
+	 * Note: if there is a CPU aside of the new one already in the
+	 * sibling mask, then it is also in cpu_mask.
+	 */
+	target = cpumask_any_but(topology_sibling_cpumask(cpu), cpu);
+	if (target >= nr_cpumask_bits)
+		cpumask_set_cpu(cpu, &cpu_mask);
+}
+
+static int
+power_cpu_notifier(struct notifier_block *self, unsigned long action, void *hcpu)
+{
+	unsigned int cpu = (long)hcpu;
+
+	switch (action & ~CPU_TASKS_FROZEN) {
+	case CPU_DOWN_FAILED:
+	case CPU_STARTING:
+		power_cpu_init(cpu);
+		break;
+	case CPU_DOWN_PREPARE:
+		power_cpu_exit(cpu);
+		break;
+	default:
+		break;
+	}
+
+	return NOTIFY_OK;
+}
+
+static struct notifier_block power_cpu_notifier_nb = {
+	.notifier_call = power_cpu_notifier,
+	.priority = CPU_PRI_PERF,
+};
+
+static const struct x86_cpu_id cpu_match[] = {
+	{ .vendor = X86_VENDOR_AMD, .family = 0x15 },
+	{},
+};
+
+static int __init amd_power_pmu_init(void)
+{
+	int cpu, target, ret;
+
+	if (!x86_match_cpu(cpu_match))
+		return 0;
+
+	if (!boot_cpu_has(X86_FEATURE_ACC_POWER))
+		return -ENODEV;
+
+	cpu_pwr_sample_ratio = cpuid_ecx(0x80000007);
+
+	if (rdmsrl_safe(MSR_F15H_CU_MAX_PWR_ACCUMULATOR, &max_cu_acc_power)) {
+		pr_err("Failed to read max compute unit power accumulator MSR\n");
+		return -ENODEV;
+	}
+
+	cpu_notifier_register_begin();
+
+	/* Choose one online core of each compute unit. */
+	for_each_online_cpu(cpu) {
+		target = cpumask_first(topology_sibling_cpumask(cpu));
+		if (!cpumask_test_cpu(target, &cpu_mask))
+			cpumask_set_cpu(target, &cpu_mask);
+	}
+
+	ret = perf_pmu_register(&pmu_class, "power", -1);
+	if (WARN_ON(ret)) {
+		pr_warn("AMD Power PMU registration failed\n");
+		goto out;
+	}
+
+	__register_cpu_notifier(&power_cpu_notifier_nb);
+
+	pr_info("AMD Power PMU detected\n");
+
+out:
+	cpu_notifier_register_done();
+
+	return ret;
+}
+module_init(amd_power_pmu_init);
+
+static void __exit amd_power_pmu_exit(void)
+{
+	cpu_notifier_register_begin();
+	__unregister_cpu_notifier(&power_cpu_notifier_nb);
+	cpu_notifier_register_done();
+
+	perf_pmu_unregister(&pmu_class);
+}
+module_exit(amd_power_pmu_exit);
+
+MODULE_AUTHOR("Huang Rui <ray.huang@amd.com>");
+MODULE_DESCRIPTION("AMD Processor Power Reporting Mechanism");
+MODULE_LICENSE("GPL v2");
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 79ec7bb..a2d0d5a 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -128,6 +128,10 @@ struct hw_perf_event {
 		struct { /* itrace */
 			int			itrace_started;
 		};
+		struct { /* amd_power */
+			u64	pwr_acc;
+			u64	ptsc;
+		};
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
 		struct { /* breakpoint */
 			/*
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [tip:perf/urgent] perf/x86/amd/power: Add AMD accumulated power reporting mechanism
  2016-03-09  5:45 [REDO PATCH v7] perf/x86/amd/power: Add AMD accumulated power reporting mechanism Huang Rui
@ 2016-03-21  9:55 ` tip-bot for Huang Rui
  2016-06-16  1:13 ` [REDO PATCH v7] " Vince Weaver
  1 sibling, 0 replies; 15+ messages in thread
From: tip-bot for Huang Rui @ 2016-03-21  9:55 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: kan.liang, rric, hpa, ray.huang, dsahern, vincent.weaver, tglx,
	acme, peterz, bp, luto, bp, linux-kernel, namhyung, dvlasenk,
	mingo, eranian, jolsa, torvalds, brgerst, alexander.shishkin,
	acme

Commit-ID:  c7ab62bfbe0e27ef452d19d88b083f01e99f13a7
Gitweb:     http://git.kernel.org/tip/c7ab62bfbe0e27ef452d19d88b083f01e99f13a7
Author:     Huang Rui <ray.huang@amd.com>
AuthorDate: Wed, 9 Mar 2016 13:45:06 +0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 21 Mar 2016 09:37:15 +0100

perf/x86/amd/power: Add AMD accumulated power reporting mechanism

Introduce an AMD accumlated power reporting mechanism for the Family
15h, Model 60h processor that can be used to calculate the average
power consumed by a processor during a measurement interval. The
feature support is indicated by CPUID Fn8000_0007_EDX[12].

This feature will be implemented both in hwmon and perf. The current
design provides one event to report per package/processor power
consumption by counting each compute unit power value.

Here the gory details of how the computation is done:

* Tsample: compute unit power accumulator sample period
* Tref: the PTSC counter period (PTSC: performance timestamp counter)
* N: the ratio of compute unit power accumulator sample period to the
  PTSC period

* Jmax: max compute unit accumulated power which is indicated by
  MSR_C001007b[MaxCpuSwPwrAcc]

* Jx/Jy: compute unit accumulated power which is indicated by
  MSR_C001007a[CpuSwPwrAcc]

* Tx/Ty: the value of performance timestamp counter which is indicated
  by CU_PTSC MSR_C0010280[PTSC]
* PwrCPUave: CPU average power

i. Determine the ratio of Tsample to Tref by executing CPUID Fn8000_0007.
	N = value of CPUID Fn8000_0007_ECX[CpuPwrSampleTimeRatio[15:0]].

ii. Read the full range of the cumulative energy value from the new
    MSR MaxCpuSwPwrAcc.
	Jmax = value returned.

iii. At time x, software reads CpuSwPwrAcc and samples the PTSC.
	Jx = value read from CpuSwPwrAcc and Tx = value read from PTSC.

iv. At time y, software reads CpuSwPwrAcc and samples the PTSC.
	Jy = value read from CpuSwPwrAcc and Ty = value read from PTSC.

v. Calculate the average power consumption for a compute unit over
time period (y-x). Unit of result is uWatt:

	if (Jy < Jx) // Rollover has occurred
		Jdelta = (Jy + Jmax) - Jx
	else
		Jdelta = Jy - Jx
	PwrCPUave = N * Jdelta * 1000 / (Ty - Tx)

Simple example:

  root@hr-zp:/home/ray/tip# ./tools/perf/perf stat -a -e 'power/power-pkg/' make -j4
    CHK     include/config/kernel.release
    CHK     include/generated/uapi/linux/version.h
    CHK     include/generated/utsrelease.h
    CHK     include/generated/timeconst.h
    CHK     include/generated/bounds.h
    CHK     include/generated/asm-offsets.h
    CALL    scripts/checksyscalls.sh
    CHK     include/generated/compile.h
    SKIPPED include/generated/compile.h
    Building modules, stage 2.
  Kernel: arch/x86/boot/bzImage is ready  (#40)
    MODPOST 4225 modules

   Performance counter stats for 'system wide':

              183.44 mWatts power/power-pkg/

       341.837270111 seconds time elapsed

  root@hr-zp:/home/ray/tip# ./tools/perf/perf stat -a -e 'power/power-pkg/' sleep 10

   Performance counter stats for 'system wide':

                0.18 mWatts power/power-pkg/

        10.012551815 seconds time elapsed

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Ingo Molnar <mingo@kernel.org>
Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Huang Rui <ray.huang@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Robert Richter <rric@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: jacob.w.shin@gmail.com
Link: http://lkml.kernel.org/r/1457502306-2559-1-git-send-email-ray.huang@amd.com
[ Fixed the modular build. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/Kconfig            |   9 ++
 arch/x86/events/Makefile    |   1 +
 arch/x86/events/amd/power.c | 353 ++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/events/core.c      |   4 +-
 include/linux/perf_event.h  |   4 +
 5 files changed, 369 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 8f2e665..a313c0e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1206,6 +1206,15 @@ config MICROCODE_OLD_INTERFACE
 	def_bool y
 	depends on MICROCODE
 
+config PERF_EVENTS_AMD_POWER
+	depends on PERF_EVENTS && CPU_SUP_AMD
+	tristate "AMD Processor Power Reporting Mechanism"
+	---help---
+	  Provide power reporting mechanism support for AMD processors.
+	  Currently, it leverages X86_FEATURE_ACC_POWER
+	  (CPUID Fn8000_0007_EDX[12]) interface to calculate the
+	  average power consumption on Family 15h processors.
+
 config X86_MSR
 	tristate "/dev/cpu/*/msr - Model-specific register support"
 	---help---
diff --git a/arch/x86/events/Makefile b/arch/x86/events/Makefile
index fdfea15..f59618a 100644
--- a/arch/x86/events/Makefile
+++ b/arch/x86/events/Makefile
@@ -1,6 +1,7 @@
 obj-y			+= core.o
 
 obj-$(CONFIG_CPU_SUP_AMD)               += amd/core.o amd/uncore.o
+obj-$(CONFIG_PERF_EVENTS_AMD_POWER)	+= amd/power.o
 obj-$(CONFIG_X86_LOCAL_APIC)            += amd/ibs.o msr.o
 ifdef CONFIG_AMD_IOMMU
 obj-$(CONFIG_CPU_SUP_AMD)               += amd/iommu.o
diff --git a/arch/x86/events/amd/power.c b/arch/x86/events/amd/power.c
new file mode 100644
index 0000000..55a3529
--- /dev/null
+++ b/arch/x86/events/amd/power.c
@@ -0,0 +1,353 @@
+/*
+ * Performance events - AMD Processor Power Reporting Mechanism
+ *
+ * Copyright (C) 2016 Advanced Micro Devices, Inc.
+ *
+ * Author: Huang Rui <ray.huang@amd.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/perf_event.h>
+#include <asm/cpu_device_id.h>
+#include "../perf_event.h"
+
+#define MSR_F15H_CU_PWR_ACCUMULATOR     0xc001007a
+#define MSR_F15H_CU_MAX_PWR_ACCUMULATOR 0xc001007b
+#define MSR_F15H_PTSC			0xc0010280
+
+/* Event code: LSB 8 bits, passed in attr->config any other bit is reserved. */
+#define AMD_POWER_EVENT_MASK		0xFFULL
+
+/*
+ * Accumulated power status counters.
+ */
+#define AMD_POWER_EVENTSEL_PKG		1
+
+/*
+ * The ratio of compute unit power accumulator sample period to the
+ * PTSC period.
+ */
+static unsigned int cpu_pwr_sample_ratio;
+
+/* Maximum accumulated power of a compute unit. */
+static u64 max_cu_acc_power;
+
+static struct pmu pmu_class;
+
+/*
+ * Accumulated power represents the sum of each compute unit's (CU) power
+ * consumption. On any core of each CU we read the total accumulated power from
+ * MSR_F15H_CU_PWR_ACCUMULATOR. cpu_mask represents CPU bit map of all cores
+ * which are picked to measure the power for the CUs they belong to.
+ */
+static cpumask_t cpu_mask;
+
+static void event_update(struct perf_event *event)
+{
+	struct hw_perf_event *hwc = &event->hw;
+	u64 prev_pwr_acc, new_pwr_acc, prev_ptsc, new_ptsc;
+	u64 delta, tdelta;
+
+	prev_pwr_acc = hwc->pwr_acc;
+	prev_ptsc = hwc->ptsc;
+	rdmsrl(MSR_F15H_CU_PWR_ACCUMULATOR, new_pwr_acc);
+	rdmsrl(MSR_F15H_PTSC, new_ptsc);
+
+	/*
+	 * Calculate the CU power consumption over a time period, the unit of
+	 * final value (delta) is micro-Watts. Then add it to the event count.
+	 */
+	if (new_pwr_acc < prev_pwr_acc) {
+		delta = max_cu_acc_power + new_pwr_acc;
+		delta -= prev_pwr_acc;
+	} else
+		delta = new_pwr_acc - prev_pwr_acc;
+
+	delta *= cpu_pwr_sample_ratio * 1000;
+	tdelta = new_ptsc - prev_ptsc;
+
+	do_div(delta, tdelta);
+	local64_add(delta, &event->count);
+}
+
+static void __pmu_event_start(struct perf_event *event)
+{
+	if (WARN_ON_ONCE(!(event->hw.state & PERF_HES_STOPPED)))
+		return;
+
+	event->hw.state = 0;
+
+	rdmsrl(MSR_F15H_PTSC, event->hw.ptsc);
+	rdmsrl(MSR_F15H_CU_PWR_ACCUMULATOR, event->hw.pwr_acc);
+}
+
+static void pmu_event_start(struct perf_event *event, int mode)
+{
+	__pmu_event_start(event);
+}
+
+static void pmu_event_stop(struct perf_event *event, int mode)
+{
+	struct hw_perf_event *hwc = &event->hw;
+
+	/* Mark event as deactivated and stopped. */
+	if (!(hwc->state & PERF_HES_STOPPED))
+		hwc->state |= PERF_HES_STOPPED;
+
+	/* Check if software counter update is necessary. */
+	if ((mode & PERF_EF_UPDATE) && !(hwc->state & PERF_HES_UPTODATE)) {
+		/*
+		 * Drain the remaining delta count out of an event
+		 * that we are disabling:
+		 */
+		event_update(event);
+		hwc->state |= PERF_HES_UPTODATE;
+	}
+}
+
+static int pmu_event_add(struct perf_event *event, int mode)
+{
+	struct hw_perf_event *hwc = &event->hw;
+
+	hwc->state = PERF_HES_UPTODATE | PERF_HES_STOPPED;
+
+	if (mode & PERF_EF_START)
+		__pmu_event_start(event);
+
+	return 0;
+}
+
+static void pmu_event_del(struct perf_event *event, int flags)
+{
+	pmu_event_stop(event, PERF_EF_UPDATE);
+}
+
+static int pmu_event_init(struct perf_event *event)
+{
+	u64 cfg = event->attr.config & AMD_POWER_EVENT_MASK;
+
+	/* Only look at AMD power events. */
+	if (event->attr.type != pmu_class.type)
+		return -ENOENT;
+
+	/* Unsupported modes and filters. */
+	if (event->attr.exclude_user   ||
+	    event->attr.exclude_kernel ||
+	    event->attr.exclude_hv     ||
+	    event->attr.exclude_idle   ||
+	    event->attr.exclude_host   ||
+	    event->attr.exclude_guest  ||
+	    /* no sampling */
+	    event->attr.sample_period)
+		return -EINVAL;
+
+	if (cfg != AMD_POWER_EVENTSEL_PKG)
+		return -EINVAL;
+
+	return 0;
+}
+
+static void pmu_event_read(struct perf_event *event)
+{
+	event_update(event);
+}
+
+static ssize_t
+get_attr_cpumask(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	return cpumap_print_to_pagebuf(true, buf, &cpu_mask);
+}
+
+static DEVICE_ATTR(cpumask, S_IRUGO, get_attr_cpumask, NULL);
+
+static struct attribute *pmu_attrs[] = {
+	&dev_attr_cpumask.attr,
+	NULL,
+};
+
+static struct attribute_group pmu_attr_group = {
+	.attrs = pmu_attrs,
+};
+
+/*
+ * Currently it only supports to report the power of each
+ * processor/package.
+ */
+EVENT_ATTR_STR(power-pkg, power_pkg, "event=0x01");
+
+EVENT_ATTR_STR(power-pkg.unit, power_pkg_unit, "mWatts");
+
+/* Convert the count from micro-Watts to milli-Watts. */
+EVENT_ATTR_STR(power-pkg.scale, power_pkg_scale, "1.000000e-3");
+
+static struct attribute *events_attr[] = {
+	EVENT_PTR(power_pkg),
+	EVENT_PTR(power_pkg_unit),
+	EVENT_PTR(power_pkg_scale),
+	NULL,
+};
+
+static struct attribute_group pmu_events_group = {
+	.name	= "events",
+	.attrs	= events_attr,
+};
+
+PMU_FORMAT_ATTR(event, "config:0-7");
+
+static struct attribute *formats_attr[] = {
+	&format_attr_event.attr,
+	NULL,
+};
+
+static struct attribute_group pmu_format_group = {
+	.name	= "format",
+	.attrs	= formats_attr,
+};
+
+static const struct attribute_group *attr_groups[] = {
+	&pmu_attr_group,
+	&pmu_format_group,
+	&pmu_events_group,
+	NULL,
+};
+
+static struct pmu pmu_class = {
+	.attr_groups	= attr_groups,
+	/* system-wide only */
+	.task_ctx_nr	= perf_invalid_context,
+	.event_init	= pmu_event_init,
+	.add		= pmu_event_add,
+	.del		= pmu_event_del,
+	.start		= pmu_event_start,
+	.stop		= pmu_event_stop,
+	.read		= pmu_event_read,
+};
+
+static void power_cpu_exit(int cpu)
+{
+	int target;
+
+	if (!cpumask_test_and_clear_cpu(cpu, &cpu_mask))
+		return;
+
+	/*
+	 * Find a new CPU on the same compute unit, if was set in cpumask
+	 * and still some CPUs on compute unit. Then migrate event and
+	 * context to new CPU.
+	 */
+	target = cpumask_any_but(topology_sibling_cpumask(cpu), cpu);
+	if (target < nr_cpumask_bits) {
+		cpumask_set_cpu(target, &cpu_mask);
+		perf_pmu_migrate_context(&pmu_class, cpu, target);
+	}
+}
+
+static void power_cpu_init(int cpu)
+{
+	int target;
+
+	/*
+	 * 1) If any CPU is set at cpu_mask in the same compute unit, do
+	 * nothing.
+	 * 2) If no CPU is set at cpu_mask in the same compute unit,
+	 * set current STARTING CPU.
+	 *
+	 * Note: if there is a CPU aside of the new one already in the
+	 * sibling mask, then it is also in cpu_mask.
+	 */
+	target = cpumask_any_but(topology_sibling_cpumask(cpu), cpu);
+	if (target >= nr_cpumask_bits)
+		cpumask_set_cpu(cpu, &cpu_mask);
+}
+
+static int
+power_cpu_notifier(struct notifier_block *self, unsigned long action, void *hcpu)
+{
+	unsigned int cpu = (long)hcpu;
+
+	switch (action & ~CPU_TASKS_FROZEN) {
+	case CPU_DOWN_FAILED:
+	case CPU_STARTING:
+		power_cpu_init(cpu);
+		break;
+	case CPU_DOWN_PREPARE:
+		power_cpu_exit(cpu);
+		break;
+	default:
+		break;
+	}
+
+	return NOTIFY_OK;
+}
+
+static struct notifier_block power_cpu_notifier_nb = {
+	.notifier_call = power_cpu_notifier,
+	.priority = CPU_PRI_PERF,
+};
+
+static const struct x86_cpu_id cpu_match[] = {
+	{ .vendor = X86_VENDOR_AMD, .family = 0x15 },
+	{},
+};
+
+static int __init amd_power_pmu_init(void)
+{
+	int cpu, target, ret;
+
+	if (!x86_match_cpu(cpu_match))
+		return 0;
+
+	if (!boot_cpu_has(X86_FEATURE_ACC_POWER))
+		return -ENODEV;
+
+	cpu_pwr_sample_ratio = cpuid_ecx(0x80000007);
+
+	if (rdmsrl_safe(MSR_F15H_CU_MAX_PWR_ACCUMULATOR, &max_cu_acc_power)) {
+		pr_err("Failed to read max compute unit power accumulator MSR\n");
+		return -ENODEV;
+	}
+
+	cpu_notifier_register_begin();
+
+	/* Choose one online core of each compute unit. */
+	for_each_online_cpu(cpu) {
+		target = cpumask_first(topology_sibling_cpumask(cpu));
+		if (!cpumask_test_cpu(target, &cpu_mask))
+			cpumask_set_cpu(target, &cpu_mask);
+	}
+
+	ret = perf_pmu_register(&pmu_class, "power", -1);
+	if (WARN_ON(ret)) {
+		pr_warn("AMD Power PMU registration failed\n");
+		goto out;
+	}
+
+	__register_cpu_notifier(&power_cpu_notifier_nb);
+
+	pr_info("AMD Power PMU detected\n");
+
+out:
+	cpu_notifier_register_done();
+
+	return ret;
+}
+module_init(amd_power_pmu_init);
+
+static void __exit amd_power_pmu_exit(void)
+{
+	cpu_notifier_register_begin();
+	__unregister_cpu_notifier(&power_cpu_notifier_nb);
+	cpu_notifier_register_done();
+
+	perf_pmu_unregister(&pmu_class);
+}
+module_exit(amd_power_pmu_exit);
+
+MODULE_AUTHOR("Huang Rui <ray.huang@amd.com>");
+MODULE_DESCRIPTION("AMD Processor Power Reporting Mechanism");
+MODULE_LICENSE("GPL v2");
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 5e830d0..002b2ea 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1602,8 +1602,7 @@ __init struct attribute **merge_attr(struct attribute **a, struct attribute **b)
 	return new;
 }
 
-ssize_t events_sysfs_show(struct device *dev, struct device_attribute *attr,
-			  char *page)
+ssize_t events_sysfs_show(struct device *dev, struct device_attribute *attr, char *page)
 {
 	struct perf_pmu_events_attr *pmu_attr = \
 		container_of(attr, struct perf_pmu_events_attr, attr);
@@ -1615,6 +1614,7 @@ ssize_t events_sysfs_show(struct device *dev, struct device_attribute *attr,
 
 	return x86_pmu.events_sysfs_show(page, config);
 }
+EXPORT_SYMBOL_GPL(events_sysfs_show);
 
 EVENT_ATTR(cpu-cycles,			CPU_CYCLES		);
 EVENT_ATTR(instructions,		INSTRUCTIONS		);
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 7bb315b..15588d4 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -129,6 +129,10 @@ struct hw_perf_event {
 		struct { /* itrace */
 			int			itrace_started;
 		};
+		struct { /* amd_power */
+			u64	pwr_acc;
+			u64	ptsc;
+		};
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
 		struct { /* breakpoint */
 			/*

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [REDO PATCH v7] perf/x86/amd/power: Add AMD accumulated power reporting mechanism
  2016-03-09  5:45 [REDO PATCH v7] perf/x86/amd/power: Add AMD accumulated power reporting mechanism Huang Rui
  2016-03-21  9:55 ` [tip:perf/urgent] " tip-bot for Huang Rui
@ 2016-06-16  1:13 ` Vince Weaver
  2016-06-16  5:38   ` Huang Rui
  1 sibling, 1 reply; 15+ messages in thread
From: Vince Weaver @ 2016-06-16  1:13 UTC (permalink / raw)
  To: Huang Rui
  Cc: Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Andy Lutomirski, Robert Richter, Jacob Shin,
	Arnaldo Carvalho de Melo, Kan Liang, linux-kernel,
	spg_linux_kernel, x86, Suravee Suthikulpanit,
	Aravind Gopalakrishnan, Borislav Petkov, Guenter Roeck,
	Fengguang Wu


three questions about this functionality:

1.  In theory this should also work on an amd fam16h model 30h
    processor too, correct?  The current code limits things to fam15h
    even though the fam16mod30h has all the proper cpuid flags.

    I've tested the functionality a bit and it seems to work but for
    some reason the ptsc seems to occasionally count backwards
    on my machine.  Any reason that would be?  (It doesn't seem to be
    an overflow, just reading the ptsc 5ms apart and the values are 
    slightly lower after than before).

2.  Unless I'm misunderstanding things, the code seems to be accumulating 
	Power. (see chunk below) Power is an instantaneous measurement, it 
	makes no sense to add values.  If you use 5W for 1ms and 10W for
	1ms, the average power across the 2ms interval is not 15W.

	You can add energy, but not power.

> +	delta *= cpu_pwr_sample_ratio * 1000;
> +	tdelta = new_ptsc - prev_ptsc;
> +
> +	do_div(delta, tdelta);
> +	local64_add(delta, &event->count);

3.  The actual results gathered seem rediculously low.  341 seconds of
    calculation and only using 183 mWatts of power?

>    Performance counter stats for 'system wide':
> 
>               183.44 mWatts power/power-pkg/
> 
>        341.837270111 seconds time elapsed
> 
>   root@hr-zp:/home/ray/tip# ./tools/perf/perf stat -a -e 'power/power-pkg/' sleep 10

Vince

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [REDO PATCH v7] perf/x86/amd/power: Add AMD accumulated power reporting mechanism
  2016-06-16  1:13 ` [REDO PATCH v7] " Vince Weaver
@ 2016-06-16  5:38   ` Huang Rui
  2016-06-16  5:59     ` Huang Rui
                       ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Huang Rui @ 2016-06-16  5:38 UTC (permalink / raw)
  To: Vince Weaver
  Cc: Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Andy Lutomirski, Robert Richter, Jacob Shin,
	Arnaldo Carvalho de Melo, Kan Liang, linux-kernel, x86,
	Suravee Suthikulpanit, Aravind Gopalakrishnan, Borislav Petkov,
	Guenter Roeck, Fengguang Wu

Hi Vince,

Thanks for asking.

On Wed, Jun 15, 2016 at 09:13:59PM -0400, Vince Weaver wrote:
> 
> three questions about this functionality:
> 
> 1.  In theory this should also work on an amd fam16h model 30h
>     processor too, correct?  The current code limits things to fam15h
>     even though the fam16mod30h has all the proper cpuid flags.
> 

I was told this feature would be supported on fam15h 60h, 70h and
later processors before. Just checked the fam16h model 30h BKDG, yes,
it should be also supported. But I didn't test that platform, if you
confirm it works in your side. We can enable it.

>     I've tested the functionality a bit and it seems to work but for
>     some reason the ptsc seems to occasionally count backwards
>     on my machine.  Any reason that would be?  (It doesn't seem to be
>     an overflow, just reading the ptsc 5ms apart and the values are 
>     slightly lower after than before).
> 

PTSC's frequency is about 100Mhz, it shouldn't be overflow.

> 2.  Unless I'm misunderstanding things, the code seems to be accumulating 
> 	Power. (see chunk below) Power is an instantaneous measurement, it 
> 	makes no sense to add values.  If you use 5W for 1ms and 10W for
> 	1ms, the average power across the 2ms interval is not 15W.
> 
> 	You can add energy, but not power.
> 
> > +	delta *= cpu_pwr_sample_ratio * 1000;
> > +	tdelta = new_ptsc - prev_ptsc;
> > +
> > +	do_div(delta, tdelta);
> > +	local64_add(delta, &event->count);
> 

You're right. Nice catch! The average power is per compute unit. We
cannot add the power simplely for each processor/package.

So here, the average power per package should be (delta1 + delta2 + ... + deltaN)/(tdelta_avg).
I will work out a fix. Thanks to point out.

> 3.  The actual results gathered seem rediculously low.  341 seconds of
>     calculation and only using 183 mWatts of power?
> 

mWatts are for processor power not system power. Below data is
calculated on fam15h model 60h which is low power platform. Even
though the method has a minor mistake, the processor power should be
in mWatts field.

> >    Performance counter stats for 'system wide':
> > 
> >               183.44 mWatts power/power-pkg/
> > 
> >        341.837270111 seconds time elapsed
> > 
> >   root@hr-zp:/home/ray/tip# ./tools/perf/perf stat -a -e 'power/power-pkg/' sleep 10
> 

Thanks,
Rui

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [REDO PATCH v7] perf/x86/amd/power: Add AMD accumulated power reporting mechanism
  2016-06-16  5:38   ` Huang Rui
@ 2016-06-16  5:59     ` Huang Rui
  2016-06-16 21:10       ` Vince Weaver
  2016-06-16 16:47     ` Borislav Petkov
  2016-06-16 20:44     ` Vince Weaver
  2 siblings, 1 reply; 15+ messages in thread
From: Huang Rui @ 2016-06-16  5:59 UTC (permalink / raw)
  To: Vince Weaver
  Cc: Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Andy Lutomirski, Robert Richter, Jacob Shin,
	Arnaldo Carvalho de Melo, Kan Liang, linux-kernel, x86,
	Suravee Suthikulpanit, Aravind Gopalakrishnan, Borislav Petkov,
	Guenter Roeck, Fengguang Wu

On Thu, Jun 16, 2016 at 01:38:13PM +0800, Huang Rui wrote:
> On Wed, Jun 15, 2016 at 09:13:59PM -0400, Vince Weaver wrote:
> > 
> > 2.  Unless I'm misunderstanding things, the code seems to be accumulating 
> > 	Power. (see chunk below) Power is an instantaneous measurement, it 
> > 	makes no sense to add values.  If you use 5W for 1ms and 10W for
> > 	1ms, the average power across the 2ms interval is not 15W.
> > 
> > 	You can add energy, but not power.
> > 
> > > +	delta *= cpu_pwr_sample_ratio * 1000;
> > > +	tdelta = new_ptsc - prev_ptsc;
> > > +
> > > +	do_div(delta, tdelta);
> > > +	local64_add(delta, &event->count);
> > 
> 
> You're right. Nice catch! The average power is per compute unit. We
> cannot add the power simplely for each processor/package.
> 
> So here, the average power per package should be (delta1 + delta2 + ... + deltaN)/(tdelta_avg).
> I will work out a fix. Thanks to point out.
> 

After considering carefully, the original method should be OK. 

      AMD nomenclature for CMT systems:

        [node 0] -> [Compute Unit 0] -> [Compute Unit Core 0] -> Linux CPU 0
                                     -> [Compute Unit Core 1] -> Linux CPU 1
                 -> [Compute Unit 1] -> [Compute Unit Core 0] -> Linux CPU 2
                                     -> [Compute Unit Core 1] -> Linux CPU 3

The deltaN is power per compute unit. Current one package has two CUs.
In the *same* interval, CU0's power is 10W, CU1's power is 15W. The
package (CU0 + CU1) power should be 15W, right? Because the interval
is the same.

Q = Q1 + Q2.  P = Q/t = (Q1 + Q2)/t = Q1/t + Q2/t = P1 + P2.

Is that clear?

Thanks,
Rui

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [REDO PATCH v7] perf/x86/amd/power: Add AMD accumulated power reporting mechanism
  2016-06-16  5:38   ` Huang Rui
  2016-06-16  5:59     ` Huang Rui
@ 2016-06-16 16:47     ` Borislav Petkov
  2016-06-17  9:58       ` Huang Rui
  2016-06-16 20:44     ` Vince Weaver
  2 siblings, 1 reply; 15+ messages in thread
From: Borislav Petkov @ 2016-06-16 16:47 UTC (permalink / raw)
  To: Huang Rui
  Cc: Vince Weaver, Borislav Petkov, Thomas Gleixner, Peter Zijlstra,
	Ingo Molnar, Andy Lutomirski, Robert Richter, Jacob Shin,
	Arnaldo Carvalho de Melo, Kan Liang, linux-kernel, x86,
	Suravee Suthikulpanit, Aravind Gopalakrishnan, Guenter Roeck,
	Fengguang Wu

On Thu, Jun 16, 2016 at 01:38:14PM +0800, Huang Rui wrote:
> I was told this feature would be supported on fam15h 60h, 70h and
> later processors before. Just checked the fam16h model 30h BKDG, yes,
> it should be also supported. But I didn't test that platform, if you
> confirm it works in your side. We can enable it.

You might want to ask around first whether F16M30's acc power machinery
is even usable? I.e., no errata and whatnot...

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [REDO PATCH v7] perf/x86/amd/power: Add AMD accumulated power reporting mechanism
  2016-06-16  5:38   ` Huang Rui
  2016-06-16  5:59     ` Huang Rui
  2016-06-16 16:47     ` Borislav Petkov
@ 2016-06-16 20:44     ` Vince Weaver
  2016-06-17 10:03       ` Huang Rui
  2 siblings, 1 reply; 15+ messages in thread
From: Vince Weaver @ 2016-06-16 20:44 UTC (permalink / raw)
  To: Huang Rui
  Cc: Vince Weaver, Borislav Petkov, Thomas Gleixner, Peter Zijlstra,
	Ingo Molnar, Andy Lutomirski, Robert Richter, Jacob Shin,
	Arnaldo Carvalho de Melo, Kan Liang, linux-kernel, x86,
	Suravee Suthikulpanit, Aravind Gopalakrishnan, Borislav Petkov,
	Guenter Roeck, Fengguang Wu

On Thu, 16 Jun 2016, Huang Rui wrote:

> > 1.  In theory this should also work on an amd fam16h model 30h
> >     processor too, correct?  The current code limits things to fam15h
> >     even though the fam16mod30h has all the proper cpuid flags.
> > 
> 
> I was told this feature would be supported on fam15h 60h, 70h and
> later processors before. Just checked the fam16h model 30h BKDG, yes,
> it should be also supported. But I didn't test that platform, if you
> confirm it works in your side. We can enable it.

I can confirm I get power readings on my fam16hmod30h board once I apply a 
trivial patch to the driver.  I'll send the patch in a separate e-mail.

> PTSC's frequency is about 100Mhz, it shouldn't be overflow.

That's what I thought.  I'm trying to read the value using the /dev/msr 
interface from userspace and I get weird results.

i.e.:
	Jx: read 62d299b84
	PTSC MSR: read 72fe92
	
	sleep 5ms

	Jy: read 631b453b9
	PTSC MSR: read 46b25

this happens about half the time (PTSC going backwards).  Though 
admittedly the problem could somehow be in the MSR code I'm using.

> mWatts are for processor power not system power. Below data is
> calculated on fam15h model 60h which is low power platform. Even
> though the method has a minor mistake, the processor power should be
> in mWatts field.

I have an actual wall-mounted power meter hooked up to my system and the 
difference from idle to all-cores-busy is 20W, so I would think that that 
the results we find with perf should be >1W at least.

Vince

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [REDO PATCH v7] perf/x86/amd/power: Add AMD accumulated power reporting mechanism
  2016-06-16  5:59     ` Huang Rui
@ 2016-06-16 21:10       ` Vince Weaver
  2016-06-16 21:16         ` Vince Weaver
  0 siblings, 1 reply; 15+ messages in thread
From: Vince Weaver @ 2016-06-16 21:10 UTC (permalink / raw)
  To: Huang Rui
  Cc: Vince Weaver, Borislav Petkov, Thomas Gleixner, Peter Zijlstra,
	Ingo Molnar, Andy Lutomirski, Robert Richter, Jacob Shin,
	Arnaldo Carvalho de Melo, Kan Liang, linux-kernel, x86,
	Suravee Suthikulpanit, Aravind Gopalakrishnan, Borislav Petkov,
	Guenter Roeck, Fengguang Wu

On Thu, 16 Jun 2016, Huang Rui wrote:

> On Thu, Jun 16, 2016 at 01:38:13PM +0800, Huang Rui wrote:
> 
> After considering carefully, the original method should be OK. 
> 
>       AMD nomenclature for CMT systems:
> 
>         [node 0] -> [Compute Unit 0] -> [Compute Unit Core 0] -> Linux CPU 0
>                                      -> [Compute Unit Core 1] -> Linux CPU 1
>                  -> [Compute Unit 1] -> [Compute Unit Core 0] -> Linux CPU 2
>                                      -> [Compute Unit Core 1] -> Linux CPU 3
> 
> The deltaN is power per compute unit. Current one package has two CUs.
> In the *same* interval, CU0's power is 10W, CU1's power is 15W. The
> package (CU0 + CU1) power should be 15W, right? Because the interval
> is the same.
> 
> Q = Q1 + Q2.  P = Q/t = (Q1 + Q2)/t = Q1/t + Q2/t = P1 + P2.
> 
> Is that clear?

OK, I was misunderstanding.  I somehow thought there was a periodic timer 
that was adding accumulating power over time.
But no, the driver just assumes the PTSC does not overflow?  And that 
addition is just there to handle adding all the cores together?

If so, then I agree that the addition makes sense, sorry for confusing 
things.

Although I think it would be better if we reported Joules (like 
RAPL does) rather than average power, but too late to change that now.


Also, on my machine I get results that make no physical sense, such as:

sudo perf stat -a -e power/power-pkg/  /usr/games/primes 1 500000000 > /dev/null

 Performance counter stats for 'system wide':

      4,472,401.06 mWatts power/power-pkg/                                            

       6.956135769 seconds time elapsed

I somehow don't think the CPU is really burning 4kW of Power.

Vince

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [REDO PATCH v7] perf/x86/amd/power: Add AMD accumulated power reporting mechanism
  2016-06-16 21:10       ` Vince Weaver
@ 2016-06-16 21:16         ` Vince Weaver
  2016-06-16 21:36           ` Borislav Petkov
  0 siblings, 1 reply; 15+ messages in thread
From: Vince Weaver @ 2016-06-16 21:16 UTC (permalink / raw)
  To: Vince Weaver
  Cc: Huang Rui, Borislav Petkov, Thomas Gleixner, Peter Zijlstra,
	Ingo Molnar, Andy Lutomirski, Robert Richter, Jacob Shin,
	Arnaldo Carvalho de Melo, Kan Liang, linux-kernel, x86,
	Suravee Suthikulpanit, Aravind Gopalakrishnan, Borislav Petkov,
	Guenter Roeck, Fengguang Wu

On Thu, 16 Jun 2016, Vince Weaver wrote:

One more followup, if I run the benchmark a bunch of times I get this:

      4,472,401.06 mWatts power/power-pkg/                                            
     50,886,303.28 mWatts power/power-pkg/                                            
     81,737,001.44 mWatts power/power-pkg/                                            
          6,525.89 mWatts power/power-pkg/                                            
          6,522.04 mWatts power/power-pkg/                                            
          6,505.68 mWatts power/power-pkg/                                            
      4,938,855.83 mWatts power/power-pkg/                                            
      4,614,620.11 mWatts power/power-pkg/                                            
     79,679,069.41 mWatts power/power-pkg/                                            
    152,794,060.83 mWatts power/power-pkg/                                            
      3,942,429.02 mWatts power/power-pkg/                                            
          6,506.73 mWatts power/power-pkg/                                            
     60,198,884.39 mWatts power/power-pkg/                                            

I'd believe the 6W report as a value for how much the CPU is using.  The 
others seem spurious.  I guess I should go check the Errata for this chip.

Vince

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [REDO PATCH v7] perf/x86/amd/power: Add AMD accumulated power reporting mechanism
  2016-06-16 21:16         ` Vince Weaver
@ 2016-06-16 21:36           ` Borislav Petkov
  0 siblings, 0 replies; 15+ messages in thread
From: Borislav Petkov @ 2016-06-16 21:36 UTC (permalink / raw)
  To: Vince Weaver
  Cc: Huang Rui, Borislav Petkov, Thomas Gleixner, Peter Zijlstra,
	Ingo Molnar, Andy Lutomirski, Robert Richter, Jacob Shin,
	Arnaldo Carvalho de Melo, Kan Liang, linux-kernel, x86,
	Suravee Suthikulpanit, Aravind Gopalakrishnan, Guenter Roeck,
	Fengguang Wu

On Thu, Jun 16, 2016 at 05:16:04PM -0400, Vince Weaver wrote:
> I'd believe the 6W report as a value for how much the CPU is using.
> The others seem spurious. I guess I should go check the Errata for
> this chip.

Maybe this is the reason why it got enabled on F15 only :-)

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [REDO PATCH v7] perf/x86/amd/power: Add AMD accumulated power reporting mechanism
  2016-06-16 16:47     ` Borislav Petkov
@ 2016-06-17  9:58       ` Huang Rui
  2016-07-19 18:22         ` Vince Weaver
  0 siblings, 1 reply; 15+ messages in thread
From: Huang Rui @ 2016-06-17  9:58 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Vince Weaver, Borislav Petkov, Thomas Gleixner, Peter Zijlstra,
	Ingo Molnar, Andy Lutomirski, Robert Richter, Jacob Shin,
	Arnaldo Carvalho de Melo, Kan Liang, linux-kernel, x86,
	Suravee Suthikulpanit, Aravind Gopalakrishnan, Guenter Roeck,
	Fengguang Wu

On Thu, Jun 16, 2016 at 06:47:00PM +0200, Borislav Petkov wrote:
> On Thu, Jun 16, 2016 at 01:38:14PM +0800, Huang Rui wrote:
> > I was told this feature would be supported on fam15h 60h, 70h and
> > later processors before. Just checked the fam16h model 30h BKDG, yes,
> > it should be also supported. But I didn't test that platform, if you
> > confirm it works in your side. We can enable it.
> 
> You might want to ask around first whether F16M30's acc power machinery
> is even usable? I.e., no errata and whatnot...
> 

Yep, I already asked the designers, and was waiting for their
feedbacks.

Thanks,
Rui

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [REDO PATCH v7] perf/x86/amd/power: Add AMD accumulated power reporting mechanism
  2016-06-16 20:44     ` Vince Weaver
@ 2016-06-17 10:03       ` Huang Rui
  2016-06-17 15:54         ` Vince Weaver
  0 siblings, 1 reply; 15+ messages in thread
From: Huang Rui @ 2016-06-17 10:03 UTC (permalink / raw)
  To: Vince Weaver
  Cc: Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Andy Lutomirski, Robert Richter, Jacob Shin,
	Arnaldo Carvalho de Melo, Kan Liang, linux-kernel, x86,
	Suravee Suthikulpanit, Aravind Gopalakrishnan, Borislav Petkov,
	Guenter Roeck, Fengguang Wu

On Thu, Jun 16, 2016 at 04:44:20PM -0400, Vince Weaver wrote:
> On Thu, 16 Jun 2016, Huang Rui wrote:
> 
> > > 1.  In theory this should also work on an amd fam16h model 30h
> > >     processor too, correct?  The current code limits things to fam15h
> > >     even though the fam16mod30h has all the proper cpuid flags.
> > > 
> > 
> > I was told this feature would be supported on fam15h 60h, 70h and
> > later processors before. Just checked the fam16h model 30h BKDG, yes,
> > it should be also supported. But I didn't test that platform, if you
> > confirm it works in your side. We can enable it.
> 
> I can confirm I get power readings on my fam16hmod30h board once I apply a 
> trivial patch to the driver.  I'll send the patch in a separate e-mail.
> 

OK, thanks.

> > PTSC's frequency is about 100Mhz, it shouldn't be overflow.
> 
> That's what I thought.  I'm trying to read the value using the /dev/msr 
> interface from userspace and I get weird results.
> 
> i.e.:
> 	Jx: read 62d299b84
> 	PTSC MSR: read 72fe92
> 	
> 	sleep 5ms
> 
> 	Jy: read 631b453b9
> 	PTSC MSR: read 46b25
> 
> this happens about half the time (PTSC going backwards).  Though 
> admittedly the problem could somehow be in the MSR code I'm using.
> 

Can you try to read the MSR value two times with the same core
(rdmsrl_on_cpu)?

Thanks,
Rui

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [REDO PATCH v7] perf/x86/amd/power: Add AMD accumulated power reporting mechanism
  2016-06-17 10:03       ` Huang Rui
@ 2016-06-17 15:54         ` Vince Weaver
  0 siblings, 0 replies; 15+ messages in thread
From: Vince Weaver @ 2016-06-17 15:54 UTC (permalink / raw)
  To: Huang Rui
  Cc: Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Andy Lutomirski, Robert Richter, Jacob Shin,
	Arnaldo Carvalho de Melo, Kan Liang, linux-kernel, x86,
	Suravee Suthikulpanit, Aravind Gopalakrishnan, Borislav Petkov,
	Guenter Roeck, Fengguang Wu

On Fri, 17 Jun 2016, Huang Rui wrote:

> Can you try to read the MSR value two times with the same core
> (rdmsrl_on_cpu)?

I'm reading from userspace using the /dev/cpu/0/msr device so it should 
always be reading from cpu0.

I guess I could code up a custom kernel module to debug this if necessary.

It does look that for some reason the 0xc0010280 MSR is only returning the 
lower 24 bits of the PTSC, rather than the 40 bits that 
cpuid 80000008:ecx seems to think it should have.

Vince

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [REDO PATCH v7] perf/x86/amd/power: Add AMD accumulated power reporting mechanism
  2016-06-17  9:58       ` Huang Rui
@ 2016-07-19 18:22         ` Vince Weaver
  2016-07-20  2:58           ` Huang Rui
  0 siblings, 1 reply; 15+ messages in thread
From: Vince Weaver @ 2016-07-19 18:22 UTC (permalink / raw)
  To: Huang Rui
  Cc: Borislav Petkov, Vince Weaver, Borislav Petkov, Thomas Gleixner,
	Peter Zijlstra, Ingo Molnar, Andy Lutomirski, Robert Richter,
	Jacob Shin, Arnaldo Carvalho de Melo, Kan Liang, linux-kernel,
	x86, Suravee Suthikulpanit, Aravind Gopalakrishnan,
	Guenter Roeck, Fengguang Wu

On Fri, 17 Jun 2016, Huang Rui wrote:

> On Thu, Jun 16, 2016 at 06:47:00PM +0200, Borislav Petkov wrote:
> > On Thu, Jun 16, 2016 at 01:38:14PM +0800, Huang Rui wrote:
> > > I was told this feature would be supported on fam15h 60h, 70h and
> > > later processors before. Just checked the fam16h model 30h BKDG, yes,
> > > it should be also supported. But I didn't test that platform, if you
> > > confirm it works in your side. We can enable it.
> > 
> > You might want to ask around first whether F16M30's acc power machinery
> > is even usable? I.e., no errata and whatnot...
> > 
> 
> Yep, I already asked the designers, and was waiting for their
> feedbacks.

Was there ever any feedback about any of the problems encountered with AMD 
APM support?

Thanks,

Vince

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [REDO PATCH v7] perf/x86/amd/power: Add AMD accumulated power reporting mechanism
  2016-07-19 18:22         ` Vince Weaver
@ 2016-07-20  2:58           ` Huang Rui
  0 siblings, 0 replies; 15+ messages in thread
From: Huang Rui @ 2016-07-20  2:58 UTC (permalink / raw)
  To: Vince Weaver
  Cc: Borislav Petkov, Borislav Petkov, Thomas Gleixner,
	Peter Zijlstra, Ingo Molnar, Andy Lutomirski, Robert Richter,
	Jacob Shin, Arnaldo Carvalho de Melo, Kan Liang, linux-kernel,
	x86, Suravee Suthikulpanit, Aravind Gopalakrishnan,
	Guenter Roeck, Fengguang Wu

On Tue, Jul 19, 2016 at 02:22:36PM -0400, Vince Weaver wrote:
> On Fri, 17 Jun 2016, Huang Rui wrote:
> 
> > On Thu, Jun 16, 2016 at 06:47:00PM +0200, Borislav Petkov wrote:
> > > On Thu, Jun 16, 2016 at 01:38:14PM +0800, Huang Rui wrote:
> > > > I was told this feature would be supported on fam15h 60h, 70h and
> > > > later processors before. Just checked the fam16h model 30h BKDG, yes,
> > > > it should be also supported. But I didn't test that platform, if you
> > > > confirm it works in your side. We can enable it.
> > > 
> > > You might want to ask around first whether F16M30's acc power machinery
> > > is even usable? I.e., no errata and whatnot...
> > > 
> > 
> > Yep, I already asked the designers, and was waiting for their
> > feedbacks.
> 
> Was there ever any feedback about any of the problems encountered with AMD 
> APM support?
> 

Vince, apology to late response. We are drafting the erratum for this
feature. Will let you know if it is public.

Thanks,
Rui

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2016-07-20  5:56 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-09  5:45 [REDO PATCH v7] perf/x86/amd/power: Add AMD accumulated power reporting mechanism Huang Rui
2016-03-21  9:55 ` [tip:perf/urgent] " tip-bot for Huang Rui
2016-06-16  1:13 ` [REDO PATCH v7] " Vince Weaver
2016-06-16  5:38   ` Huang Rui
2016-06-16  5:59     ` Huang Rui
2016-06-16 21:10       ` Vince Weaver
2016-06-16 21:16         ` Vince Weaver
2016-06-16 21:36           ` Borislav Petkov
2016-06-16 16:47     ` Borislav Petkov
2016-06-17  9:58       ` Huang Rui
2016-07-19 18:22         ` Vince Weaver
2016-07-20  2:58           ` Huang Rui
2016-06-16 20:44     ` Vince Weaver
2016-06-17 10:03       ` Huang Rui
2016-06-17 15:54         ` Vince Weaver

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).