linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RESEND PATCH V3 0/8] TopDown metrics support for Icelake
@ 2019-08-26 14:47 kan.liang
  2019-08-26 14:47 ` [RESEND PATCH V3 1/8] perf/x86/intel: Set correct mask for TOPDOWN.SLOTS kan.liang
                   ` (7 more replies)
  0 siblings, 8 replies; 39+ messages in thread
From: kan.liang @ 2019-08-26 14:47 UTC (permalink / raw)
  To: peterz, acme, mingo, linux-kernel
  Cc: tglx, jolsa, eranian, alexander.shishkin, ak, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Icelake has support for measuring the level 1 TopDown metrics
directly in hardware. This is implemented by an additional METRICS
register, and a new Fixed Counter 3 that measures pipeline SLOTS.

Four TopDown metric events as separate perf events, which map to
internal METRICS register, are exposed. They are topdown-retiring,
topdown-bad-spec, topdown-fe-bound and topdown-be-bound.
Those events do not exist in hardware, but can be allocated by the
scheduler. The value of TopDown metric events can be calculated by
multiplying the METRICS (percentage) register with SLOTS fixed counter.

New in Icelake
- Do not require generic counters. This allows to collect TopDown always
  in addition to other events.
- Measuring TopDown per thread/process instead of only per core

Limitation
- To get accurate result and avoid reading the METRICS register multiple
  times, the TopDown metrics events and SLOTS event have to be in the
  same group.
- METRICS and SLOTS registers have to be cleared after each read by SW.
  That is to prevent the lose of precision and a known side effect of
  METRICS register.
- Cannot do sampling read SLOTS and TopDown metric events

Please refer SDM Vol3, 18.3.9.3 Performance Metrics for the details of
TopDown metrics.

Changes since V2:
- Rebase on top of v5.3-rc1

Key changes since V1:
- Remove variables for reg_idx and enabled_events[] array.
  The reg_idx can be calculated by idx in runtime.
  Using existing active_mask to replace enabled_events.
- Choose value 47 for the fixed index of BTS.
- Support OVF_PERF_METRICS overflow bit in PMI handler
- Drops the caching mechanism and related variables
  New mechanism is to update all active slots/metrics events for the
  first slots/metrics events in a group. For each group reading, it
  still only read the slots/perf_metrics MSR once
- Disable PMU for read of topdown events to avoid the NMI issue
- Move RDPMC support to a separate patch
- Using event=0x00,umask=0x1X for topdown metrics events
- Drop the patch which add REMOVE transaction
  We can indicate x86_pmu_stop() by checking
  (event && !test_bit(event->hw.idx, cpuc->active_mask)),
  which is a good place to save the slots/metrics MSR value

Andi Kleen (2):
  perf, tools, stat: Support new per thread TopDown metrics
  perf, tools: Add documentation for topdown metrics

Kan Liang (6):
  perf/x86/intel: Set correct mask for TOPDOWN.SLOTS
  perf/x86/intel: Basic support for metrics counters
  perf/x86/intel: Support hardware TopDown metrics
  perf/x86/intel: Support per thread RDPMC TopDown metrics
  perf/x86/intel: Export TopDown events for Icelake
  perf/x86/intel: Disable sampling read slots and topdown

 arch/x86/events/core.c                 |  35 ++-
 arch/x86/events/intel/core.c           | 362 ++++++++++++++++++++++++-
 arch/x86/events/perf_event.h           |  33 +++
 arch/x86/include/asm/msr-index.h       |   3 +
 arch/x86/include/asm/perf_event.h      |  54 +++-
 include/linux/perf_event.h             |   3 +
 tools/perf/Documentation/perf-stat.txt |   9 +-
 tools/perf/Documentation/topdown.txt   | 223 +++++++++++++++
 tools/perf/builtin-stat.c              |  24 ++
 tools/perf/util/stat-shadow.c          |  89 ++++++
 tools/perf/util/stat.c                 |   4 +
 tools/perf/util/stat.h                 |   8 +
 12 files changed, 827 insertions(+), 20 deletions(-)
 create mode 100644 tools/perf/Documentation/topdown.txt

-- 
2.17.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RESEND PATCH V3 1/8] perf/x86/intel: Set correct mask for TOPDOWN.SLOTS
  2019-08-26 14:47 [RESEND PATCH V3 0/8] TopDown metrics support for Icelake kan.liang
@ 2019-08-26 14:47 ` kan.liang
  2019-08-28  7:48   ` Peter Zijlstra
  2019-08-26 14:47 ` [RESEND PATCH V3 2/8] perf/x86/intel: Basic support for metrics counters kan.liang
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 39+ messages in thread
From: kan.liang @ 2019-08-26 14:47 UTC (permalink / raw)
  To: peterz, acme, mingo, linux-kernel
  Cc: tglx, jolsa, eranian, alexander.shishkin, ak, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

TOPDOWN.SLOTS(0x0400) is not a generic event. It is only available on
fixed counter3.

Don't extend its mask to generic counters.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/intel/core.c      | 6 ++++--
 arch/x86/include/asm/perf_event.h | 5 +++++
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 9e911a96972b..1b2c37ed49db 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -5081,12 +5081,14 @@ __init int intel_pmu_init(void)
 
 	if (x86_pmu.event_constraints) {
 		/*
-		 * event on fixed counter2 (REF_CYCLES) only works on this
+		 * event on fixed counter2 (REF_CYCLES) and
+		 * fixed counter3 (TOPDOWN.SLOTS) only work on this
 		 * counter, so do not extend mask to generic counters
 		 */
 		for_each_event_constraint(c, x86_pmu.event_constraints) {
 			if (c->cmask == FIXED_EVENT_FLAGS
-			    && c->idxmsk64 != INTEL_PMC_MSK_FIXED_REF_CYCLES) {
+			    && c->idxmsk64 != INTEL_PMC_MSK_FIXED_REF_CYCLES
+			    && c->idxmsk64 != INTEL_PMC_MSK_FIXED_SLOTS) {
 				c->idxmsk64 |= (1ULL << x86_pmu.num_counters) - 1;
 			}
 			c->idxmsk64 &=
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 1392d5e6e8d6..457d35a75ad3 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -167,6 +167,11 @@ struct x86_pmu_capability {
 #define INTEL_PMC_IDX_FIXED_REF_CYCLES	(INTEL_PMC_IDX_FIXED + 2)
 #define INTEL_PMC_MSK_FIXED_REF_CYCLES	(1ULL << INTEL_PMC_IDX_FIXED_REF_CYCLES)
 
+/* TOPDOWN.SLOTS: */
+#define MSR_ARCH_PERFMON_FIXED_CTR3	0x30c
+#define INTEL_PMC_IDX_FIXED_SLOTS	(INTEL_PMC_IDX_FIXED + 3)
+#define INTEL_PMC_MSK_FIXED_SLOTS	(1ULL << INTEL_PMC_IDX_FIXED_SLOTS)
+
 /*
  * We model BTS tracing as another fixed-mode PMC.
  *
-- 
2.17.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RESEND PATCH V3 2/8] perf/x86/intel: Basic support for metrics counters
  2019-08-26 14:47 [RESEND PATCH V3 0/8] TopDown metrics support for Icelake kan.liang
  2019-08-26 14:47 ` [RESEND PATCH V3 1/8] perf/x86/intel: Set correct mask for TOPDOWN.SLOTS kan.liang
@ 2019-08-26 14:47 ` kan.liang
  2019-08-28  7:48   ` Peter Zijlstra
                     ` (3 more replies)
  2019-08-26 14:47 ` [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics kan.liang
                   ` (5 subsequent siblings)
  7 siblings, 4 replies; 39+ messages in thread
From: kan.liang @ 2019-08-26 14:47 UTC (permalink / raw)
  To: peterz, acme, mingo, linux-kernel
  Cc: tglx, jolsa, eranian, alexander.shishkin, ak, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Metrics counters (hardware counters containing multiple metrics)
are modeled as separate registers for each TopDown metric events,
with an extra reg being used for coordinating access to the
underlying register in the scheduler.

Adds the basic infrastructure to separate the scheduler register indexes
from the actual hardware register indexes. In most cases the MSR address
is already used correctly, but for code using indexes we need calculate
the correct underlying register.

The TopDown metric events share the fixed counter 3. It only needs
enable/disable once for them.

Move BTS index to 47. Because bit 48 in the PERF_GLOBAL_STATUS is used
to indicate the overflow status of PERF_METRICS counters now.

Naming:
The events which uses Metrics counters are called TopDown metric
events or metric events in the code.
The fixed counter 3 is called TopDown slots event or slots event.
Topdown events stand for metric events + slots event in the code.

Originally-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/core.c            | 16 ++++++++--
 arch/x86/events/intel/core.c      | 28 ++++++++++++------
 arch/x86/events/perf_event.h      | 14 +++++++++
 arch/x86/include/asm/msr-index.h  |  1 +
 arch/x86/include/asm/perf_event.h | 49 +++++++++++++++++++++++++++++--
 5 files changed, 95 insertions(+), 13 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 81b005e4c7d9..54534ff00940 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1033,18 +1033,30 @@ static inline void x86_assign_hw_event(struct perf_event *event,
 				struct cpu_hw_events *cpuc, int i)
 {
 	struct hw_perf_event *hwc = &event->hw;
+	int reg_idx;
 
 	hwc->idx = cpuc->assign[i];
 	hwc->last_cpu = smp_processor_id();
 	hwc->last_tag = ++cpuc->tags[i];
 
+	/*
+	 * Metrics counters use different indexes in the scheduler
+	 * versus the hardware.
+	 *
+	 * Map metrics to fixed counter 3 (which is the base count),
+	 * but the update event callback reads the extra metric register
+	 * and converts to the right metric.
+	 */
+	reg_idx = get_reg_idx(hwc->idx);
+
 	if (hwc->idx == INTEL_PMC_IDX_FIXED_BTS) {
 		hwc->config_base = 0;
 		hwc->event_base	= 0;
 	} else if (hwc->idx >= INTEL_PMC_IDX_FIXED) {
 		hwc->config_base = MSR_ARCH_PERFMON_FIXED_CTR_CTRL;
-		hwc->event_base = MSR_ARCH_PERFMON_FIXED_CTR0 + (hwc->idx - INTEL_PMC_IDX_FIXED);
-		hwc->event_base_rdpmc = (hwc->idx - INTEL_PMC_IDX_FIXED) | 1<<30;
+		hwc->event_base = MSR_ARCH_PERFMON_FIXED_CTR0 +
+				  (reg_idx - INTEL_PMC_IDX_FIXED);
+		hwc->event_base_rdpmc = (reg_idx - INTEL_PMC_IDX_FIXED) | 1<<30;
 	} else {
 		hwc->config_base = x86_pmu_config_addr(hwc->idx);
 		hwc->event_base  = x86_pmu_event_addr(hwc->idx);
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 1b2c37ed49db..f4d6335a18e2 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2131,7 +2131,7 @@ static inline void intel_pmu_ack_status(u64 ack)
 
 static void intel_pmu_disable_fixed(struct hw_perf_event *hwc)
 {
-	int idx = hwc->idx - INTEL_PMC_IDX_FIXED;
+	int idx = get_reg_idx(hwc->idx) - INTEL_PMC_IDX_FIXED;
 	u64 ctrl_val, mask;
 
 	mask = 0xfULL << (idx * 4);
@@ -2150,6 +2150,7 @@ static void intel_pmu_disable_event(struct perf_event *event)
 {
 	struct hw_perf_event *hwc = &event->hw;
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+	int reg_idx = get_reg_idx(hwc->idx);
 
 	if (unlikely(hwc->idx == INTEL_PMC_IDX_FIXED_BTS)) {
 		intel_pmu_disable_bts();
@@ -2157,9 +2158,16 @@ static void intel_pmu_disable_event(struct perf_event *event)
 		return;
 	}
 
-	cpuc->intel_ctrl_guest_mask &= ~(1ull << hwc->idx);
-	cpuc->intel_ctrl_host_mask &= ~(1ull << hwc->idx);
-	cpuc->intel_cp_status &= ~(1ull << hwc->idx);
+	/*
+	 * When any other topdown events are still enabled,
+	 * cancel the disabling.
+	 */
+	if (has_other_topdown_event(cpuc->active_mask, hwc->idx))
+		return;
+
+	cpuc->intel_ctrl_guest_mask &= ~(1ull << reg_idx);
+	cpuc->intel_ctrl_host_mask &= ~(1ull << reg_idx);
+	cpuc->intel_cp_status &= ~(1ull << reg_idx);
 
 	if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL))
 		intel_pmu_disable_fixed(hwc);
@@ -2193,7 +2201,7 @@ static void intel_pmu_read_event(struct perf_event *event)
 static void intel_pmu_enable_fixed(struct perf_event *event)
 {
 	struct hw_perf_event *hwc = &event->hw;
-	int idx = hwc->idx - INTEL_PMC_IDX_FIXED;
+	int idx = get_reg_idx(hwc->idx) - INTEL_PMC_IDX_FIXED;
 	u64 ctrl_val, mask, bits = 0;
 
 	/*
@@ -2232,6 +2240,7 @@ static void intel_pmu_enable_event(struct perf_event *event)
 {
 	struct hw_perf_event *hwc = &event->hw;
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+	int reg_idx = get_reg_idx(hwc->idx);
 
 	if (unlikely(hwc->idx == INTEL_PMC_IDX_FIXED_BTS)) {
 		if (!__this_cpu_read(cpu_hw_events.enabled))
@@ -2242,18 +2251,19 @@ static void intel_pmu_enable_event(struct perf_event *event)
 	}
 
 	if (event->attr.exclude_host)
-		cpuc->intel_ctrl_guest_mask |= (1ull << hwc->idx);
+		cpuc->intel_ctrl_guest_mask |= (1ull << reg_idx);
 	if (event->attr.exclude_guest)
-		cpuc->intel_ctrl_host_mask |= (1ull << hwc->idx);
+		cpuc->intel_ctrl_host_mask |= (1ull << reg_idx);
 
 	if (unlikely(event_is_checkpointed(event)))
-		cpuc->intel_cp_status |= (1ull << hwc->idx);
+		cpuc->intel_cp_status |= (1ull << reg_idx);
 
 	if (unlikely(event->attr.precise_ip))
 		intel_pmu_pebs_enable(event);
 
 	if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL)) {
-		intel_pmu_enable_fixed(event);
+		if (!has_other_topdown_event(cpuc->active_mask, hwc->idx))
+			intel_pmu_enable_fixed(event);
 		return;
 	}
 
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 8751008fc170..37f17f55ef2d 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -348,6 +348,20 @@ struct cpu_hw_events {
 #define FIXED_EVENT_CONSTRAINT(c, n)	\
 	EVENT_CONSTRAINT(c, (1ULL << (32+n)), FIXED_EVENT_FLAGS)
 
+/*
+ * Special metric counters do not actually exist, but get remapped
+ * to a combination of FxCtr3 + MSR_PERF_METRICS
+ *
+ * This allocates them to a dummy offset for the scheduler.
+ * This does not allow sharing of multiple users of the same
+ * metric without multiplexing, even though the hardware supports that
+ * in principle.
+ */
+
+#define METRIC_EVENT_CONSTRAINT(c, n)					\
+	EVENT_CONSTRAINT(c, (1ULL << (INTEL_PMC_IDX_FIXED_METRIC_BASE+n)), \
+			 FIXED_EVENT_FLAGS)
+
 /*
  * Constraint on the Event code + UMask
  */
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 6b4fc2788078..78f3a5ebc1e2 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -793,6 +793,7 @@
 #define MSR_CORE_PERF_FIXED_CTR0	0x00000309
 #define MSR_CORE_PERF_FIXED_CTR1	0x0000030a
 #define MSR_CORE_PERF_FIXED_CTR2	0x0000030b
+#define MSR_CORE_PERF_FIXED_CTR3	0x0000030c
 #define MSR_CORE_PERF_FIXED_CTR_CTRL	0x0000038d
 #define MSR_CORE_PERF_GLOBAL_STATUS	0x0000038e
 #define MSR_CORE_PERF_GLOBAL_CTRL	0x0000038f
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 457d35a75ad3..88a506312a10 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -175,11 +175,56 @@ struct x86_pmu_capability {
 /*
  * We model BTS tracing as another fixed-mode PMC.
  *
- * We choose a value in the middle of the fixed event range, since lower
+ * We choose value 47 for the fixed index of BTS, since lower
  * values are used by actual fixed events and higher values are used
  * to indicate other overflow conditions in the PERF_GLOBAL_STATUS msr.
  */
-#define INTEL_PMC_IDX_FIXED_BTS				(INTEL_PMC_IDX_FIXED + 16)
+#define INTEL_PMC_IDX_FIXED_BTS				(INTEL_PMC_IDX_FIXED + 15)
+
+/*
+ * We model PERF_METRICS as more magic fixed-mode PMCs, one for each metric
+ *
+ * Internally they all map to Fixed Ctr 3 (SLOTS), and allocate PERF_METRICS
+ * as an extra_reg. PERF_METRICS has no own configuration, but we fill in
+ * the configuration of FxCtr3 to enforce that all the shared users of SLOTS
+ * have the same configuration.
+ */
+#define INTEL_PMC_IDX_FIXED_METRIC_BASE		(INTEL_PMC_IDX_FIXED + 16)
+#define INTEL_PMC_IDX_TD_RETIRING		(INTEL_PMC_IDX_FIXED_METRIC_BASE + 0)
+#define INTEL_PMC_IDX_TD_BAD_SPEC		(INTEL_PMC_IDX_FIXED_METRIC_BASE + 1)
+#define INTEL_PMC_IDX_TD_FE_BOUND		(INTEL_PMC_IDX_FIXED_METRIC_BASE + 2)
+#define INTEL_PMC_IDX_TD_BE_BOUND		(INTEL_PMC_IDX_FIXED_METRIC_BASE + 3)
+#define INTEL_PMC_MSK_TOPDOWN			((0xfull << INTEL_PMC_IDX_FIXED_METRIC_BASE) | \
+						 INTEL_PMC_MSK_FIXED_SLOTS)
+
+
+static inline bool is_metric_idx(int idx)
+{
+	return (unsigned)(idx - INTEL_PMC_IDX_FIXED_METRIC_BASE) < 4;
+}
+
+static inline bool is_topdown_idx(int idx)
+{
+	return is_metric_idx(idx) || idx == INTEL_PMC_IDX_FIXED_SLOTS;
+}
+
+static inline int get_reg_idx(int idx)
+{
+	return is_metric_idx(idx) ? INTEL_PMC_IDX_FIXED_SLOTS : idx;
+}
+
+#define INTEL_PMC_CLEAR_TOPDOWN_BIT(bit)	(~(0x1ull << bit) & INTEL_PMC_MSK_TOPDOWN)
+/*
+ * Check if any topdown events are still enabled.
+ *
+ * For non topdown events, always return false.
+ */
+static inline bool has_other_topdown_event(unsigned long *active_mask,
+						 int idx)
+{
+	return is_topdown_idx(idx) &&
+	       (*(u64 *)active_mask & INTEL_PMC_CLEAR_TOPDOWN_BIT(idx));
+}
 
 #define GLOBAL_STATUS_COND_CHG				BIT_ULL(63)
 #define GLOBAL_STATUS_BUFFER_OVF			BIT_ULL(62)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics
  2019-08-26 14:47 [RESEND PATCH V3 0/8] TopDown metrics support for Icelake kan.liang
  2019-08-26 14:47 ` [RESEND PATCH V3 1/8] perf/x86/intel: Set correct mask for TOPDOWN.SLOTS kan.liang
  2019-08-26 14:47 ` [RESEND PATCH V3 2/8] perf/x86/intel: Basic support for metrics counters kan.liang
@ 2019-08-26 14:47 ` kan.liang
  2019-08-28 15:02   ` Peter Zijlstra
                     ` (2 more replies)
  2019-08-26 14:47 ` [RESEND PATCH V3 4/8] perf/x86/intel: Support per thread RDPMC " kan.liang
                   ` (4 subsequent siblings)
  7 siblings, 3 replies; 39+ messages in thread
From: kan.liang @ 2019-08-26 14:47 UTC (permalink / raw)
  To: peterz, acme, mingo, linux-kernel
  Cc: tglx, jolsa, eranian, alexander.shishkin, ak, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Intro
=====

Icelake has support for measuring the four top level TopDown metrics
directly in hardware. This is implemented by an additional "metrics"
register, and a new Fixed Counter 3 that measures pipeline "slots".

Events
======

We export four metric events as separate perf events, which map to
internal "metrics" counter register. Those events do not exist in
hardware, but can be allocated by the scheduler.

For the event mapping we use a special 0x00 event code, which is
reserved for fake events. The metric events start from umask 0x10.

When setting up such events they point to the slots counter, and a
special callback, update_topdown_event(), reads the additional metrics
msr to generate the metrics. Then the metric is reported by multiplying
the metric (percentage) with slots.

This multiplication allows to easily keep a running count, for example
when the slots counter overflows, and makes all the standard tools, such
as a perf stat, work. They can do deltas of the values without needing
to know about percentages. This also simplifies accumulating the counts
of child events, which otherwise would need to know how to average
percent values.

All four metric events don't support sampling. Since they will be
handled specially for event update, a flag PERF_X86_EVENT_TOPDOWN is
introduced to indicate this case.

The slots event can support both sampling and counting.
For counting, the flag is also applied.
For sampling, it will be handled normally as other normal events.

Groups
======

To avoid reading the METRICS register multiple times, the metrics and
slots value can only be updated by the first slots/metrics event in a
group. All active slots and metrics events will be updated one time.

Reset
======

The PERF_METRICS and Fixed counter 3 have to be reset for each read,
because:
- The 8bit metrics ratio values lose precision when the measurement
  period gets longer.
- The PERF_METRICS may report wrong value if its delta was less than
  1/255 of SLOTS (Fixed counter 3).

Also, for counting, the -max_period is the initial value of the SLOTS.
The huge initial value will definitely trigger the issue mentioned
above. Force initial value as 0 for topdown and slots event counting.

NMI
======

The METRICS register may be overflow. The bit 48 of STATUS register
will be set. If so, update all active slots and metrics events.

The update_topdown_event() has to read two registers separately. The
values may be modify by a NMI. PMU has to be disabled before calling the
function.

RDPMC
======

RDPMC is temporarily disabled. The following patch will enable it.

Originally-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/core.c           |  10 ++
 arch/x86/events/intel/core.c     | 230 ++++++++++++++++++++++++++++++-
 arch/x86/events/perf_event.h     |  17 +++
 arch/x86/include/asm/msr-index.h |   2 +
 4 files changed, 255 insertions(+), 4 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 54534ff00940..1ae23db5c2d7 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -76,6 +76,8 @@ u64 x86_perf_event_update(struct perf_event *event)
 	if (idx == INTEL_PMC_IDX_FIXED_BTS)
 		return 0;
 
+	if (is_topdown_count(event) && x86_pmu.update_topdown_event)
+		return x86_pmu.update_topdown_event(event);
 	/*
 	 * Careful: an NMI might modify the previous event value.
 	 *
@@ -1003,6 +1005,10 @@ static int collect_events(struct cpu_hw_events *cpuc, struct perf_event *leader,
 
 	max_count = x86_pmu.num_counters + x86_pmu.num_counters_fixed;
 
+	/* There are 4 TopDown metrics events. */
+	if (x86_pmu.intel_cap.perf_metrics)
+		max_count += 4;
+
 	/* current number of events already accepted */
 	n = cpuc->n_events;
 
@@ -1184,6 +1190,10 @@ int x86_perf_event_set_period(struct perf_event *event)
 	if (idx == INTEL_PMC_IDX_FIXED_BTS)
 		return 0;
 
+	if (unlikely(is_topdown_count(event)) &&
+	    x86_pmu.set_topdown_event_period)
+		return x86_pmu.set_topdown_event_period(event);
+
 	/*
 	 * If we are way outside a reasonable range then just skip forward:
 	 */
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index f4d6335a18e2..616313d7f3d7 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -247,6 +247,10 @@ static struct event_constraint intel_icl_event_constraints[] = {
 	FIXED_EVENT_CONSTRAINT(0x003c, 1),	/* CPU_CLK_UNHALTED.CORE */
 	FIXED_EVENT_CONSTRAINT(0x0300, 2),	/* CPU_CLK_UNHALTED.REF */
 	FIXED_EVENT_CONSTRAINT(0x0400, 3),	/* SLOTS */
+	METRIC_EVENT_CONSTRAINT(0x1000, 0),	/* Retiring metric */
+	METRIC_EVENT_CONSTRAINT(0x1100, 1),	/* Bad speculation metric */
+	METRIC_EVENT_CONSTRAINT(0x1200, 2),	/* FE bound metric */
+	METRIC_EVENT_CONSTRAINT(0x1300, 3),	/* BE bound metric */
 	INTEL_EVENT_CONSTRAINT_RANGE(0x03, 0x0a, 0xf),
 	INTEL_EVENT_CONSTRAINT_RANGE(0x1f, 0x28, 0xf),
 	INTEL_EVENT_CONSTRAINT(0x32, 0xf),	/* SW_PREFETCH_ACCESS.* */
@@ -267,6 +271,14 @@ static struct extra_reg intel_icl_extra_regs[] __read_mostly = {
 	INTEL_UEVENT_EXTRA_REG(0x01bb, MSR_OFFCORE_RSP_1, 0x3fffff9fffull, RSP_1),
 	INTEL_UEVENT_PEBS_LDLAT_EXTRA_REG(0x01cd),
 	INTEL_UEVENT_EXTRA_REG(0x01c6, MSR_PEBS_FRONTEND, 0x7fff17, FE),
+	/*
+	 * The original Fixed Ctr 3 are shared from different metrics
+	 * events. So use the extra reg to enforce the same
+	 * configuration on the original register, but do not actually
+	 * write to it.
+	 */
+	INTEL_UEVENT_EXTRA_REG(0x0400, 0, -1L, TOPDOWN),
+	INTEL_UEVENT_TOPDOWN_EXTRA_REG(0x1000),
 	EVENT_EXTRA_END
 };
 
@@ -2190,10 +2202,163 @@ static void intel_pmu_del_event(struct perf_event *event)
 		intel_pmu_pebs_del(event);
 }
 
+static inline bool is_metric_event(struct perf_event *event)
+{
+	return ((event->attr.config & ARCH_PERFMON_EVENTSEL_EVENT) == 0) &&
+		((event->attr.config & INTEL_ARCH_EVENT_MASK) >= 0x1000)  &&
+		((event->attr.config & INTEL_ARCH_EVENT_MASK) <= 0x1300);
+}
+
+static inline bool is_slots_event(struct perf_event *event)
+{
+	return (event->attr.config & INTEL_ARCH_EVENT_MASK) == 0x0400;
+}
+
+static inline bool is_topdown_event(struct perf_event *event)
+{
+	return is_metric_event(event) || is_slots_event(event);
+}
+
+static bool is_first_topdown_event_in_group(struct perf_event *event)
+{
+	struct perf_event *first = NULL;
+
+	if (is_topdown_event(event->group_leader))
+		first = event->group_leader;
+	else {
+		for_each_sibling_event(first, event->group_leader)
+			if (is_topdown_event(first))
+				break;
+	}
+
+	if (event == first)
+		return true;
+
+	return false;
+}
+
+static int icl_set_topdown_event_period(struct perf_event *event)
+{
+	struct hw_perf_event *hwc = &event->hw;
+	s64 left = local64_read(&hwc->period_left);
+
+	/*
+	 * Clear PERF_METRICS and Fixed counter 3 in initialization.
+	 * After that, both MSRs will be cleared for each read.
+	 * Don't need to clear them again.
+	 */
+	if (left == x86_pmu.max_period) {
+		wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
+		wrmsrl(MSR_PERF_METRICS, 0);
+		local64_set(&hwc->period_left, 0);
+	}
+
+	perf_event_update_userpage(event);
+
+	return 0;
+}
+
+static u64 icl_get_metrics_event_value(u64 metric, u64 slots, int idx)
+{
+	u32 val;
+
+	/*
+	 * The metric is reported as an 8bit integer percentage
+	 * suming up to 0xff.
+	 * slots-in-metric = (Metric / 0xff) * slots
+	 */
+	val = (metric >> ((idx - INTEL_PMC_IDX_FIXED_METRIC_BASE) * 8)) & 0xff;
+	return  mul_u64_u32_div(slots, val, 0xff);
+}
+
+static void __icl_update_topdown_event(struct perf_event *event,
+				       u64 slots, u64 metrics)
+{
+	int idx = event->hw.idx;
+	u64 delta;
+
+	if (is_metric_idx(idx))
+		delta = icl_get_metrics_event_value(metrics, slots, idx);
+	else
+		delta = slots;
+
+	local64_add(delta, &event->count);
+}
+
+/*
+ * Update all active Topdown events.
+ * PMU has to be disabled before calling this function.
+ */
+static u64 icl_update_topdown_event(struct perf_event *event)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+	struct perf_event *other;
+	u64 slots, metrics;
+	int idx;
+
+	/*
+	 * Only need to update all events for the first
+	 * slots/metrics event in a group
+	 */
+	if (event && !is_first_topdown_event_in_group(event))
+		return 0;
+
+	/* read Fixed counter 3 */
+	rdpmcl((3 | 1<<30), slots);
+	if (!slots)
+		return 0;
+
+	/* read PERF_METRICS */
+	rdpmcl((1<<29), metrics);
+
+	for_each_set_bit(idx, cpuc->active_mask, INTEL_PMC_IDX_TD_BE_BOUND + 1) {
+		if (!is_topdown_idx(idx))
+			continue;
+		other = cpuc->events[idx];
+		__icl_update_topdown_event(other, slots, metrics);
+	}
+
+	/*
+	 * Check and update this event, which may have been cleared
+	 * in active_mask e.g. x86_pmu_stop()
+	 */
+	if (event && !test_bit(event->hw.idx, cpuc->active_mask))
+		__icl_update_topdown_event(event, slots, metrics);
+
+	/*
+	 * To avoid the known issues as below, the PERF_METRICS and
+	 * Fixed counter 3 are reset for each read.
+	 * - The 8bit metrics ratio values lose precision when the
+	 *   measurement period gets longer.
+	 * - The PERF_METRICS may report wrong value if its delta was
+	 *   less than 1/255 of Fixed counter 3.
+	 */
+	wrmsrl(MSR_PERF_METRICS, 0);
+	wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
+
+	return slots;
+}
+
+static void intel_pmu_read_topdown_event(struct perf_event *event)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+	/* Only need to call update_topdown_event() once for group read. */
+	if ((cpuc->txn_flags & PERF_PMU_TXN_READ) &&
+	    !is_first_topdown_event_in_group(event))
+		return;
+
+	perf_pmu_disable(event->pmu);
+	x86_pmu.update_topdown_event(event);
+	perf_pmu_enable(event->pmu);
+}
+
 static void intel_pmu_read_event(struct perf_event *event)
 {
 	if (event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD)
 		intel_pmu_auto_reload_read(event);
+	else if (is_topdown_count(event) && x86_pmu.update_topdown_event)
+		intel_pmu_read_topdown_event(event);
 	else
 		x86_perf_event_update(event);
 }
@@ -2401,6 +2566,15 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
 			intel_pt_interrupt();
 	}
 
+	/*
+	 * Intel Perf mertrics
+	 */
+	if (__test_and_clear_bit(48, (unsigned long *)&status)) {
+		handled++;
+		if (x86_pmu.update_topdown_event)
+			x86_pmu.update_topdown_event(NULL);
+	}
+
 	/*
 	 * Checkpointed counters can lead to 'spurious' PMIs because the
 	 * rollback caused by the PMI will have cleared the overflow status
@@ -3312,6 +3486,42 @@ static int intel_pmu_hw_config(struct perf_event *event)
 	if (event->attr.type != PERF_TYPE_RAW)
 		return 0;
 
+	/*
+	 * Config Topdown slots and metric events
+	 *
+	 * The slots event on Fixed Counter 3 can support sampling,
+	 * which will be handled normally in x86_perf_event_update().
+	 *
+	 * The metric events don't support sampling.
+	 *
+	 * For counting, topdown slots and metric events will be
+	 * handled specially for event update.
+	 * A flag PERF_X86_EVENT_TOPDOWN is applied for the case.
+	 */
+	if (x86_pmu.intel_cap.perf_metrics && is_topdown_event(event)) {
+		if (is_metric_event(event) && is_sampling_event(event))
+			return -EINVAL;
+
+		if (!is_sampling_event(event)) {
+			if (event->attr.config1 != 0)
+				return -EINVAL;
+			if (event->attr.config & ARCH_PERFMON_EVENTSEL_ANY)
+				return -EINVAL;
+			/*
+			 * Put configuration (minus event) into config1 so that
+			 * the scheduler enforces through an extra_reg that
+			 * all instances of the metrics events have the same
+			 * configuration.
+			 */
+			event->attr.config1 = event->hw.config &
+					      X86_ALL_EVENT_FLAGS;
+			event->hw.flags |= PERF_X86_EVENT_TOPDOWN;
+
+			if (is_metric_event(event))
+				event->hw.flags &= ~PERF_X86_EVENT_RDPMC_ALLOWED;
+		}
+	}
+
 	if (!(event->attr.config & ARCH_PERFMON_EVENTSEL_ANY))
 		return 0;
 
@@ -5040,6 +5250,8 @@ __init int intel_pmu_init(void)
 		x86_pmu.rtm_abort_event = X86_CONFIG(.event=0xca, .umask=0x02);
 		x86_pmu.lbr_pt_coexist = true;
 		intel_pmu_pebs_data_source_skl(pmem);
+		x86_pmu.update_topdown_event = icl_update_topdown_event;
+		x86_pmu.set_topdown_event_period = icl_set_topdown_event_period;
 		pr_cont("Icelake events, ");
 		name = "icelake";
 		break;
@@ -5096,10 +5308,17 @@ __init int intel_pmu_init(void)
 		 * counter, so do not extend mask to generic counters
 		 */
 		for_each_event_constraint(c, x86_pmu.event_constraints) {
-			if (c->cmask == FIXED_EVENT_FLAGS
-			    && c->idxmsk64 != INTEL_PMC_MSK_FIXED_REF_CYCLES
-			    && c->idxmsk64 != INTEL_PMC_MSK_FIXED_SLOTS) {
-				c->idxmsk64 |= (1ULL << x86_pmu.num_counters) - 1;
+			if (c->cmask == FIXED_EVENT_FLAGS) {
+				/*
+				 * Don't extend topdown slots and metrics
+				 * events to generic counters.
+				 */
+				if (c->idxmsk64 & INTEL_PMC_MSK_TOPDOWN) {
+					c->weight = hweight64(c->idxmsk64);
+					continue;
+				}
+				if (c->idxmsk64 != INTEL_PMC_MSK_FIXED_REF_CYCLES)
+					c->idxmsk64 |= (1ULL << x86_pmu.num_counters) - 1;
 			}
 			c->idxmsk64 &=
 				~(~0ULL << (INTEL_PMC_IDX_FIXED + x86_pmu.num_counters_fixed));
@@ -5152,6 +5371,9 @@ __init int intel_pmu_init(void)
 	if (x86_pmu.counter_freezing)
 		x86_pmu.handle_irq = intel_pmu_handle_irq_v4;
 
+	if (x86_pmu.intel_cap.perf_metrics)
+		x86_pmu.intel_ctrl |= 1ULL << 48;
+
 	return 0;
 }
 
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 37f17f55ef2d..7c59f08fadc0 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -40,6 +40,7 @@ enum extra_reg_type {
 	EXTRA_REG_LBR   = 2,	/* lbr_select */
 	EXTRA_REG_LDLAT = 3,	/* ld_lat_threshold */
 	EXTRA_REG_FE    = 4,    /* fe_* */
+	EXTRA_REG_TOPDOWN = 5,	/* Topdown slots/metrics */
 
 	EXTRA_REG_MAX		/* number of entries needed */
 };
@@ -76,6 +77,12 @@ static inline bool constraint_match(struct event_constraint *c, u64 ecode)
 #define PERF_X86_EVENT_EXCL_ACCT	0x0100 /* accounted EXCL event */
 #define PERF_X86_EVENT_AUTO_RELOAD	0x0200 /* use PEBS auto-reload */
 #define PERF_X86_EVENT_LARGE_PEBS	0x0400 /* use large PEBS */
+#define PERF_X86_EVENT_TOPDOWN		0x0800 /* Count Topdown slots/metrics events */
+
+static inline bool is_topdown_count(struct perf_event *event)
+{
+	return event->hw.flags & PERF_X86_EVENT_TOPDOWN;
+}
 
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
@@ -509,6 +516,9 @@ struct extra_reg {
 			       0xffff, \
 			       LDLAT)
 
+#define INTEL_UEVENT_TOPDOWN_EXTRA_REG(event)	\
+	EVENT_EXTRA_REG(event, 0, 0xfcff, -1L, TOPDOWN)
+
 #define EVENT_EXTRA_END EVENT_EXTRA_REG(0, 0, 0, 0, RSP_0)
 
 union perf_capabilities {
@@ -524,6 +534,7 @@ union perf_capabilities {
 		 */
 		u64	full_width_write:1;
 		u64     pebs_baseline:1;
+		u64	perf_metrics:1;
 	};
 	u64	capabilities;
 };
@@ -686,6 +697,12 @@ struct x86_pmu {
 	 */
 	atomic_t	lbr_exclusive[x86_lbr_exclusive_max];
 
+	/*
+	 * Intel perf metrics
+	 */
+	u64		(*update_topdown_event)(struct perf_event *event);
+	int		(*set_topdown_event_period)(struct perf_event *event);
+
 	/*
 	 * AMD bits
 	 */
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 78f3a5ebc1e2..460a419a7214 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -118,6 +118,8 @@
 #define MSR_TURBO_RATIO_LIMIT1		0x000001ae
 #define MSR_TURBO_RATIO_LIMIT2		0x000001af
 
+#define MSR_PERF_METRICS		0x00000329
+
 #define MSR_LBR_SELECT			0x000001c8
 #define MSR_LBR_TOS			0x000001c9
 #define MSR_LBR_NHM_FROM		0x00000680
-- 
2.17.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RESEND PATCH V3 4/8] perf/x86/intel: Support per thread RDPMC TopDown metrics
  2019-08-26 14:47 [RESEND PATCH V3 0/8] TopDown metrics support for Icelake kan.liang
                   ` (2 preceding siblings ...)
  2019-08-26 14:47 ` [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics kan.liang
@ 2019-08-26 14:47 ` kan.liang
  2019-08-26 14:47 ` [RESEND PATCH V3 5/8] perf/x86/intel: Export TopDown events for Icelake kan.liang
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 39+ messages in thread
From: kan.liang @ 2019-08-26 14:47 UTC (permalink / raw)
  To: peterz, acme, mingo, linux-kernel
  Cc: tglx, jolsa, eranian, alexander.shishkin, ak, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

With Icelake CPUs, the TopDown metrics are directly available as fixed
counters and do not require generic counters, which make it possible to
measure TopDown per thread/process instead of only per core.

The metrics and slots values have to be saved/restored during context
switching.
The saved values are also used as previous values to calculate the
delta.

The PERF_METRICS MSR value will be returned if RDPMC metrics events.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/core.c       |   5 +-
 arch/x86/events/intel/core.c | 102 ++++++++++++++++++++++++++++-------
 include/linux/perf_event.h   |   3 ++
 3 files changed, 91 insertions(+), 19 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 1ae23db5c2d7..77573fa379dc 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2150,7 +2150,10 @@ static int x86_pmu_event_idx(struct perf_event *event)
 	if (!(event->hw.flags & PERF_X86_EVENT_RDPMC_ALLOWED))
 		return 0;
 
-	if (x86_pmu.num_counters_fixed && idx >= INTEL_PMC_IDX_FIXED) {
+	/* Return PERF_METRICS MSR value for metrics event */
+	if (is_metric_idx(idx))
+		idx = 1 << 29;
+	else if (x86_pmu.num_counters_fixed && idx >= INTEL_PMC_IDX_FIXED) {
 		idx -= INTEL_PMC_IDX_FIXED;
 		idx |= 1 << 30;
 	}
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 616313d7f3d7..9dc24cebb68c 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2253,6 +2253,11 @@ static int icl_set_topdown_event_period(struct perf_event *event)
 		local64_set(&hwc->period_left, 0);
 	}
 
+	if ((hwc->saved_slots) && is_first_topdown_event_in_group(event)) {
+		wrmsrl(MSR_CORE_PERF_FIXED_CTR3, hwc->saved_slots);
+		wrmsrl(MSR_PERF_METRICS, hwc->saved_metric);
+	}
+
 	perf_event_update_userpage(event);
 
 	return 0;
@@ -2271,7 +2276,7 @@ static u64 icl_get_metrics_event_value(u64 metric, u64 slots, int idx)
 	return  mul_u64_u32_div(slots, val, 0xff);
 }
 
-static void __icl_update_topdown_event(struct perf_event *event,
+static u64 icl_get_topdown_value(struct perf_event *event,
 				       u64 slots, u64 metrics)
 {
 	int idx = event->hw.idx;
@@ -2282,7 +2287,50 @@ static void __icl_update_topdown_event(struct perf_event *event,
 	else
 		delta = slots;
 
-	local64_add(delta, &event->count);
+	return delta;
+}
+
+static void __icl_update_topdown_event(struct perf_event *event,
+				       u64 slots, u64 metrics,
+				       u64 last_slots, u64 last_metrics)
+{
+	u64 delta, last = 0;
+
+	delta = icl_get_topdown_value(event, slots, metrics);
+	if (last_slots)
+		last = icl_get_topdown_value(event, last_slots, last_metrics);
+
+	/*
+	 * The 8bit integer percentage of metric may be not accurate,
+	 * especially when the changes is very small.
+	 * For example, if only a few bad_spec happens, the percentage
+	 * may be reduced from 1% to 0%. If so, the bad_spec event value
+	 * will be 0 which is definitely less than the last value.
+	 * Avoid update event->count for this case.
+	 */
+	if (delta > last) {
+		delta -= last;
+		local64_add(delta, &event->count);
+	}
+}
+
+static void update_saved_topdown_regs(struct perf_event *event,
+				      u64 slots, u64 metrics)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+	struct perf_event *other;
+	int idx;
+
+	event->hw.saved_slots = slots;
+	event->hw.saved_metric = metrics;
+
+	for_each_set_bit(idx, cpuc->active_mask, INTEL_PMC_IDX_TD_BE_BOUND + 1) {
+		if (!is_topdown_idx(idx))
+			continue;
+		other = cpuc->events[idx];
+		other->hw.saved_slots = slots;
+		other->hw.saved_metric = metrics;
+	}
 }
 
 /*
@@ -2294,6 +2342,7 @@ static u64 icl_update_topdown_event(struct perf_event *event)
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	struct perf_event *other;
 	u64 slots, metrics;
+	bool reset = true;
 	int idx;
 
 	/*
@@ -2315,26 +2364,46 @@ static u64 icl_update_topdown_event(struct perf_event *event)
 		if (!is_topdown_idx(idx))
 			continue;
 		other = cpuc->events[idx];
-		__icl_update_topdown_event(other, slots, metrics);
+		__icl_update_topdown_event(other, slots, metrics,
+					   event ? event->hw.saved_slots : 0,
+					   event ? event->hw.saved_metric : 0);
 	}
 
 	/*
 	 * Check and update this event, which may have been cleared
 	 * in active_mask e.g. x86_pmu_stop()
 	 */
-	if (event && !test_bit(event->hw.idx, cpuc->active_mask))
-		__icl_update_topdown_event(event, slots, metrics);
+	if (event && !test_bit(event->hw.idx, cpuc->active_mask)) {
+		__icl_update_topdown_event(event, slots, metrics,
+					   event->hw.saved_slots,
+					   event->hw.saved_metric);
 
-	/*
-	 * To avoid the known issues as below, the PERF_METRICS and
-	 * Fixed counter 3 are reset for each read.
-	 * - The 8bit metrics ratio values lose precision when the
-	 *   measurement period gets longer.
-	 * - The PERF_METRICS may report wrong value if its delta was
-	 *   less than 1/255 of Fixed counter 3.
-	 */
-	wrmsrl(MSR_PERF_METRICS, 0);
-	wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
+		/*
+		 * In x86_pmu_stop(), the event is cleared in active_mask first,
+		 * then drain the delta, which indicates context switch for
+		 * counting.
+		 * Save metric and slots for context switch.
+		 * Don't need to reset the PERF_METRICS and Fixed counter 3.
+		 * Because the values will be restored in next schedule in.
+		 */
+		update_saved_topdown_regs(event, slots, metrics);
+		reset = false;
+	}
+
+	if (reset) {
+		/*
+		 * To avoid the known issues as below, the PERF_METRICS and
+		 * Fixed counter 3 are reset for each read.
+		 * - The 8bit metrics ratio values lose precision when the
+		 *   measurement period gets longer.
+		 * - The PERF_METRICS may report wrong value if its delta was
+		 *   less than 1/255 of Fixed counter 3.
+		 */
+		wrmsrl(MSR_PERF_METRICS, 0);
+		wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
+		if (event)
+			update_saved_topdown_regs(event, 0, 0);
+	}
 
 	return slots;
 }
@@ -3516,9 +3585,6 @@ static int intel_pmu_hw_config(struct perf_event *event)
 			event->attr.config1 = event->hw.config &
 					      X86_ALL_EVENT_FLAGS;
 			event->hw.flags |= PERF_X86_EVENT_TOPDOWN;
-
-			if (is_metric_event(event))
-				event->hw.flags &= ~PERF_X86_EVENT_RDPMC_ALLOWED;
 		}
 	}
 
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index e8ad3c590a23..f1cd57a5fa55 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -133,6 +133,9 @@ struct hw_perf_event {
 
 			struct hw_perf_event_extra extra_reg;
 			struct hw_perf_event_extra branch_reg;
+
+			u64		saved_slots;
+			u64		saved_metric;
 		};
 		struct { /* software */
 			struct hrtimer	hrtimer;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RESEND PATCH V3 5/8] perf/x86/intel: Export TopDown events for Icelake
  2019-08-26 14:47 [RESEND PATCH V3 0/8] TopDown metrics support for Icelake kan.liang
                   ` (3 preceding siblings ...)
  2019-08-26 14:47 ` [RESEND PATCH V3 4/8] perf/x86/intel: Support per thread RDPMC " kan.liang
@ 2019-08-26 14:47 ` kan.liang
  2019-08-26 14:47 ` [RESEND PATCH V3 6/8] perf/x86/intel: Disable sampling read slots and topdown kan.liang
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 39+ messages in thread
From: kan.liang @ 2019-08-26 14:47 UTC (permalink / raw)
  To: peterz, acme, mingo, linux-kernel
  Cc: tglx, jolsa, eranian, alexander.shishkin, ak, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Export new TopDown metrics events for perf that map to the sub metrics
in the metrics register, and another for the new slots fixed counter.
This makes the new fixed counters in Icelake visible to the perf
user tools.

Originally-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/intel/core.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 9dc24cebb68c..7925cf8830eb 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -321,6 +321,12 @@ EVENT_ATTR_STR_HT(topdown-recovery-bubbles, td_recovery_bubbles,
 EVENT_ATTR_STR_HT(topdown-recovery-bubbles.scale, td_recovery_bubbles_scale,
 	"4", "2");
 
+EVENT_ATTR_STR(slots,			slots,		"event=0x00,umask=0x4");
+EVENT_ATTR_STR(topdown-retiring,	td_retiring,	"event=0x00,umask=0x10");
+EVENT_ATTR_STR(topdown-bad-spec,	td_bad_spec,	"event=0x00,umask=0x11");
+EVENT_ATTR_STR(topdown-fe-bound,	td_fe_bound,	"event=0x00,umask=0x12");
+EVENT_ATTR_STR(topdown-be-bound,	td_be_bound,	"event=0x00,umask=0x13");
+
 static struct attribute *snb_events_attrs[] = {
 	EVENT_PTR(td_slots_issued),
 	EVENT_PTR(td_slots_retired),
@@ -4544,6 +4550,15 @@ static struct attribute *icl_events_attrs[] = {
 	NULL,
 };
 
+static struct attribute *icl_td_events_attrs[] = {
+	EVENT_PTR(slots),
+	EVENT_PTR(td_retiring),
+	EVENT_PTR(td_bad_spec),
+	EVENT_PTR(td_fe_bound),
+	EVENT_PTR(td_be_bound),
+	NULL,
+};
+
 static struct attribute *icl_tsx_events_attrs[] = {
 	EVENT_PTR(tx_start),
 	EVENT_PTR(tx_abort),
@@ -5312,6 +5327,7 @@ __init int intel_pmu_init(void)
 			hsw_format_attr : nhm_format_attr;
 		extra_skl_attr = skl_format_attr;
 		mem_attr = icl_events_attrs;
+		td_attr = icl_td_events_attrs;
 		tsx_attr = icl_tsx_events_attrs;
 		x86_pmu.rtm_abort_event = X86_CONFIG(.event=0xca, .umask=0x02);
 		x86_pmu.lbr_pt_coexist = true;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RESEND PATCH V3 6/8] perf/x86/intel: Disable sampling read slots and topdown
  2019-08-26 14:47 [RESEND PATCH V3 0/8] TopDown metrics support for Icelake kan.liang
                   ` (4 preceding siblings ...)
  2019-08-26 14:47 ` [RESEND PATCH V3 5/8] perf/x86/intel: Export TopDown events for Icelake kan.liang
@ 2019-08-26 14:47 ` kan.liang
  2019-08-26 14:47 ` [RESEND PATCH V3 7/8] perf, tools, stat: Support new per thread TopDown metrics kan.liang
  2019-08-26 14:47 ` [RESEND PATCH V3 8/8] perf, tools: Add documentation for topdown metrics kan.liang
  7 siblings, 0 replies; 39+ messages in thread
From: kan.liang @ 2019-08-26 14:47 UTC (permalink / raw)
  To: peterz, acme, mingo, linux-kernel
  Cc: tglx, jolsa, eranian, alexander.shishkin, ak, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The slots event supports sampling. Users may sampling read slots and
metrics events, e.g perf record -e '{slots, topdown-retiring}:S'.
But the metrics event will reset the fixed counter 3 which will impact
the sampling of the slots event.

Add specific validate_group() support to reject the case and error out
for Icelake.

An alternative fix may unconditionally disable SLOTS sampling. But it's
not a decent fix. Because users may want to only sampling slot events
without topdown metrics event.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/core.c       |  4 ++++
 arch/x86/events/intel/core.c | 20 ++++++++++++++++++++
 arch/x86/events/perf_event.h |  2 ++
 3 files changed, 26 insertions(+)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 77573fa379dc..7fb98890d885 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2057,7 +2057,11 @@ static int validate_group(struct perf_event *event)
 
 	fake_cpuc->n_events = 0;
 	ret = x86_pmu.schedule_events(fake_cpuc, n, NULL);
+	if (ret)
+		goto out;
 
+	if (x86_pmu.validate_group)
+		ret = x86_pmu.validate_group(fake_cpuc, n);
 out:
 	free_fake_cpuc(fake_cpuc);
 	return ret;
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 7925cf8830eb..4e0d60fcd1eb 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4489,6 +4489,25 @@ static __init void intel_ht_bug(void)
 	x86_pmu.stop_scheduling = intel_stop_scheduling;
 }
 
+static int icl_validate_group(struct cpu_hw_events *cpuc, int n)
+{
+	bool has_sampling_slots = false, has_metrics = false;
+	struct perf_event *e;
+	int i;
+
+	for (i = 0; i < n; i++) {
+		e = cpuc->event_list[i];
+		if (is_slots_event(e) && is_sampling_event(e))
+			has_sampling_slots = true;
+
+		if (is_metric_event(e))
+			has_metrics = true;
+	}
+	if (unlikely(has_sampling_slots && has_metrics))
+		return -EINVAL;
+	return 0;
+}
+
 EVENT_ATTR_STR(mem-loads,	mem_ld_hsw,	"event=0xcd,umask=0x1,ldlat=3");
 EVENT_ATTR_STR(mem-stores,	mem_st_hsw,	"event=0xd0,umask=0x82")
 
@@ -5334,6 +5353,7 @@ __init int intel_pmu_init(void)
 		intel_pmu_pebs_data_source_skl(pmem);
 		x86_pmu.update_topdown_event = icl_update_topdown_event;
 		x86_pmu.set_topdown_event_period = icl_set_topdown_event_period;
+		x86_pmu.validate_group = icl_validate_group;
 		pr_cont("Icelake events, ");
 		name = "icelake";
 		break;
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 7c59f08fadc0..34c6ab5cafe8 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -629,6 +629,8 @@ struct x86_pmu {
 	int		perfctr_second_write;
 	u64		(*limit_period)(struct perf_event *event, u64 l);
 
+	int		(*validate_group)(struct cpu_hw_events *cpuc, int n);
+
 	/* PMI handler bits */
 	unsigned int	late_ack		:1,
 			counter_freezing	:1;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RESEND PATCH V3 7/8] perf, tools, stat: Support new per thread TopDown metrics
  2019-08-26 14:47 [RESEND PATCH V3 0/8] TopDown metrics support for Icelake kan.liang
                   ` (5 preceding siblings ...)
  2019-08-26 14:47 ` [RESEND PATCH V3 6/8] perf/x86/intel: Disable sampling read slots and topdown kan.liang
@ 2019-08-26 14:47 ` kan.liang
  2019-08-26 14:47 ` [RESEND PATCH V3 8/8] perf, tools: Add documentation for topdown metrics kan.liang
  7 siblings, 0 replies; 39+ messages in thread
From: kan.liang @ 2019-08-26 14:47 UTC (permalink / raw)
  To: peterz, acme, mingo, linux-kernel
  Cc: tglx, jolsa, eranian, alexander.shishkin, ak, Kan Liang

From: Andi Kleen <ak@linux.intel.com>

Icelake has support for reporting per thread TopDown metrics.
These are reported differently than the previous TopDown support,
each metric is standalone, but scaled to pipeline "slots".
We don't need to do anything special for HyperThreading anymore.
Teach perf stat --topdown to handle these new metrics and
print them in the same way as the previous TopDown metrics.
The restrictions of only being able to report information per core is
gone.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 tools/perf/Documentation/perf-stat.txt |  9 ++-
 tools/perf/builtin-stat.c              | 24 +++++++
 tools/perf/util/stat-shadow.c          | 89 ++++++++++++++++++++++++++
 tools/perf/util/stat.c                 |  4 ++
 tools/perf/util/stat.h                 |  8 +++
 5 files changed, 132 insertions(+), 2 deletions(-)

diff --git a/tools/perf/Documentation/perf-stat.txt b/tools/perf/Documentation/perf-stat.txt
index 930c51c01201..f90101637c76 100644
--- a/tools/perf/Documentation/perf-stat.txt
+++ b/tools/perf/Documentation/perf-stat.txt
@@ -279,8 +279,13 @@ if the workload is actually bound by the CPU and not by something else.
 For best results it is usually a good idea to use it with interval
 mode like -I 1000, as the bottleneck of workloads can change often.
 
-The top down metrics are collected per core instead of per
-CPU thread. Per core mode is automatically enabled
+This enables --metric-only, unless overridden with --no-metric-only.
+
+The following restrictions only apply to older Intel CPUs and Atom,
+on newer CPUs (IceLake and later) TopDown can be collected for any thread:
+
+The top down metrics are collected per core instead of per CPU thread.
+Per core mode is automatically enabled
 and -a (global monitoring) is needed, requiring root rights or
 perf.perf_event_paranoid=-1.
 
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index b55a534b4de0..69371eea51fc 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -122,6 +122,14 @@ static const char * topdown_attrs[] = {
 	NULL,
 };
 
+static const char *topdown_metric_attrs[] = {
+	"topdown-retiring",
+	"topdown-bad-spec",
+	"topdown-fe-bound",
+	"topdown-be-bound",
+	NULL,
+};
+
 static const char *smi_cost_attrs = {
 	"{"
 	"msr/aperf/,"
@@ -1323,6 +1331,21 @@ static int add_default_attributes(void)
 		char *str = NULL;
 		bool warn = false;
 
+		if (topdown_filter_events(topdown_metric_attrs, &str, 1) < 0) {
+			pr_err("Out of memory\n");
+			return -1;
+		}
+		if (topdown_metric_attrs[0] && str) {
+			if (!stat_config.interval) {
+				fprintf(stat_config.output,
+					"Topdown accuracy may decreases when measuring long period.\n"
+					"Please print the result regularly, e.g. -I1000\n");
+			}
+			goto setup_metrics;
+		}
+
+		str = NULL;
+
 		if (stat_config.aggr_mode != AGGR_GLOBAL &&
 		    stat_config.aggr_mode != AGGR_CORE) {
 			pr_err("top down event configuration requires --per-core mode\n");
@@ -1344,6 +1367,7 @@ static int add_default_attributes(void)
 		if (topdown_attrs[0] && str) {
 			if (warn)
 				arch_topdown_group_warn();
+setup_metrics:
 			err = parse_events(evsel_list, str, &errinfo);
 			if (err) {
 				fprintf(stderr,
diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c
index 656065af4971..f3fe0686e8d5 100644
--- a/tools/perf/util/stat-shadow.c
+++ b/tools/perf/util/stat-shadow.c
@@ -240,6 +240,18 @@ void perf_stat__update_shadow_stats(struct perf_evsel *counter, u64 count,
 	else if (perf_stat_evsel__is(counter, TOPDOWN_RECOVERY_BUBBLES))
 		update_runtime_stat(st, STAT_TOPDOWN_RECOVERY_BUBBLES,
 				    ctx, cpu, count);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_RETIRING))
+		update_runtime_stat(st, STAT_TOPDOWN_RETIRING,
+				    ctx, cpu, count);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_BAD_SPEC))
+		update_runtime_stat(st, STAT_TOPDOWN_BAD_SPEC,
+				    ctx, cpu, count);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_FE_BOUND))
+		update_runtime_stat(st, STAT_TOPDOWN_FE_BOUND,
+				    ctx, cpu, count);
+	else if (perf_stat_evsel__is(counter, TOPDOWN_BE_BOUND))
+		update_runtime_stat(st, STAT_TOPDOWN_BE_BOUND,
+				    ctx, cpu, count);
 	else if (perf_evsel__match(counter, HARDWARE, HW_STALLED_CYCLES_FRONTEND))
 		update_runtime_stat(st, STAT_STALLED_CYCLES_FRONT,
 				    ctx, cpu, count);
@@ -685,6 +697,47 @@ static double td_be_bound(int ctx, int cpu, struct runtime_stat *st)
 	return sanitize_val(1.0 - sum);
 }
 
+/*
+ * Kernel reports metrics multiplied with slots. To get back
+ * the ratios we need to recreate the sum.
+ */
+
+static double td_metric_ratio(int ctx, int cpu,
+			      enum stat_type type,
+			      struct runtime_stat *stat)
+{
+	double sum = runtime_stat_avg(stat, STAT_TOPDOWN_RETIRING, ctx, cpu) +
+		runtime_stat_avg(stat, STAT_TOPDOWN_FE_BOUND, ctx, cpu) +
+		runtime_stat_avg(stat, STAT_TOPDOWN_BE_BOUND, ctx, cpu) +
+		runtime_stat_avg(stat, STAT_TOPDOWN_BAD_SPEC, ctx, cpu);
+	double d = runtime_stat_avg(stat, type, ctx, cpu);
+
+	if (sum)
+		return d / sum;
+	return 0;
+}
+
+/*
+ * ... but only if most of the values are actually available.
+ * We allow two missing.
+ */
+
+static bool full_td(int ctx, int cpu,
+		    struct runtime_stat *stat)
+{
+	int c = 0;
+
+	if (runtime_stat_avg(stat, STAT_TOPDOWN_RETIRING, ctx, cpu) > 0)
+		c++;
+	if (runtime_stat_avg(stat, STAT_TOPDOWN_BE_BOUND, ctx, cpu) > 0)
+		c++;
+	if (runtime_stat_avg(stat, STAT_TOPDOWN_FE_BOUND, ctx, cpu) > 0)
+		c++;
+	if (runtime_stat_avg(stat, STAT_TOPDOWN_BAD_SPEC, ctx, cpu) > 0)
+		c++;
+	return c >= 2;
+}
+
 static void print_smi_cost(struct perf_stat_config *config,
 			   int cpu, struct perf_evsel *evsel,
 			   struct perf_stat_output_ctx *out,
@@ -989,6 +1042,42 @@ void perf_stat__print_shadow_stats(struct perf_stat_config *config,
 					be_bound * 100.);
 		else
 			print_metric(config, ctxp, NULL, NULL, name, 0);
+	} else if (perf_stat_evsel__is(evsel, TOPDOWN_RETIRING) &&
+			full_td(ctx, cpu, st)) {
+		double retiring = td_metric_ratio(ctx, cpu,
+						  STAT_TOPDOWN_RETIRING, st);
+
+		if (retiring > 0.7)
+			color = PERF_COLOR_GREEN;
+		print_metric(config, ctxp, color, "%8.1f%%", "retiring",
+				retiring * 100.);
+	} else if (perf_stat_evsel__is(evsel, TOPDOWN_FE_BOUND) &&
+			full_td(ctx, cpu, st)) {
+		double fe_bound = td_metric_ratio(ctx, cpu,
+						  STAT_TOPDOWN_FE_BOUND, st);
+
+		if (fe_bound > 0.2)
+			color = PERF_COLOR_RED;
+		print_metric(config, ctxp, color, "%8.1f%%", "frontend bound",
+				fe_bound * 100.);
+	} else if (perf_stat_evsel__is(evsel, TOPDOWN_BE_BOUND) &&
+			full_td(ctx, cpu, st)) {
+		double be_bound = td_metric_ratio(ctx, cpu,
+						  STAT_TOPDOWN_BE_BOUND, st);
+
+		if (be_bound > 0.2)
+			color = PERF_COLOR_RED;
+		print_metric(config, ctxp, color, "%8.1f%%", "backend bound",
+				be_bound * 100.);
+	} else if (perf_stat_evsel__is(evsel, TOPDOWN_BAD_SPEC) &&
+			full_td(ctx, cpu, st)) {
+		double bad_spec = td_metric_ratio(ctx, cpu,
+						  STAT_TOPDOWN_BAD_SPEC, st);
+
+		if (bad_spec > 0.1)
+			color = PERF_COLOR_RED;
+		print_metric(config, ctxp, color, "%8.1f%%", "bad speculation",
+				bad_spec * 100.);
 	} else if (evsel->metric_expr) {
 		generic_metric(config, evsel->metric_expr, evsel->metric_events, evsel->name,
 				evsel->metric_name, avg, cpu, out, st);
diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c
index db8a6cf336be..f376e129616d 100644
--- a/tools/perf/util/stat.c
+++ b/tools/perf/util/stat.c
@@ -88,6 +88,10 @@ static const char *id_str[PERF_STAT_EVSEL_ID__MAX] = {
 	ID(TOPDOWN_SLOTS_RETIRED, topdown-slots-retired),
 	ID(TOPDOWN_FETCH_BUBBLES, topdown-fetch-bubbles),
 	ID(TOPDOWN_RECOVERY_BUBBLES, topdown-recovery-bubbles),
+	ID(TOPDOWN_RETIRING, topdown-retiring),
+	ID(TOPDOWN_BAD_SPEC, topdown-bad-spec),
+	ID(TOPDOWN_FE_BOUND, topdown-fe-bound),
+	ID(TOPDOWN_BE_BOUND, topdown-be-bound),
 	ID(SMI_NUM, msr/smi/),
 	ID(APERF, msr/aperf/),
 };
diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h
index 7032dd1eeac2..80b1d35a89d1 100644
--- a/tools/perf/util/stat.h
+++ b/tools/perf/util/stat.h
@@ -29,6 +29,10 @@ enum perf_stat_evsel_id {
 	PERF_STAT_EVSEL_ID__TOPDOWN_SLOTS_RETIRED,
 	PERF_STAT_EVSEL_ID__TOPDOWN_FETCH_BUBBLES,
 	PERF_STAT_EVSEL_ID__TOPDOWN_RECOVERY_BUBBLES,
+	PERF_STAT_EVSEL_ID__TOPDOWN_RETIRING,
+	PERF_STAT_EVSEL_ID__TOPDOWN_BAD_SPEC,
+	PERF_STAT_EVSEL_ID__TOPDOWN_FE_BOUND,
+	PERF_STAT_EVSEL_ID__TOPDOWN_BE_BOUND,
 	PERF_STAT_EVSEL_ID__SMI_NUM,
 	PERF_STAT_EVSEL_ID__APERF,
 	PERF_STAT_EVSEL_ID__MAX,
@@ -82,6 +86,10 @@ enum stat_type {
 	STAT_TOPDOWN_SLOTS_RETIRED,
 	STAT_TOPDOWN_FETCH_BUBBLES,
 	STAT_TOPDOWN_RECOVERY_BUBBLES,
+	STAT_TOPDOWN_RETIRING,
+	STAT_TOPDOWN_BAD_SPEC,
+	STAT_TOPDOWN_FE_BOUND,
+	STAT_TOPDOWN_BE_BOUND,
 	STAT_SMI_NUM,
 	STAT_APERF,
 	STAT_MAX
-- 
2.17.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RESEND PATCH V3 8/8] perf, tools: Add documentation for topdown metrics
  2019-08-26 14:47 [RESEND PATCH V3 0/8] TopDown metrics support for Icelake kan.liang
                   ` (6 preceding siblings ...)
  2019-08-26 14:47 ` [RESEND PATCH V3 7/8] perf, tools, stat: Support new per thread TopDown metrics kan.liang
@ 2019-08-26 14:47 ` kan.liang
  7 siblings, 0 replies; 39+ messages in thread
From: kan.liang @ 2019-08-26 14:47 UTC (permalink / raw)
  To: peterz, acme, mingo, linux-kernel
  Cc: tglx, jolsa, eranian, alexander.shishkin, ak, Kan Liang

From: Andi Kleen <ak@linux.intel.com>

Add some documentation how to use the topdown metrics in ring 3.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 tools/perf/Documentation/topdown.txt | 223 +++++++++++++++++++++++++++
 1 file changed, 223 insertions(+)
 create mode 100644 tools/perf/Documentation/topdown.txt

diff --git a/tools/perf/Documentation/topdown.txt b/tools/perf/Documentation/topdown.txt
new file mode 100644
index 000000000000..e82a74fa9243
--- /dev/null
+++ b/tools/perf/Documentation/topdown.txt
@@ -0,0 +1,223 @@
+Using TopDown metrics in user space
+-----------------------------------
+
+Intel CPUs (since Sandy Bridge and Silvermont) support a TopDown
+methology to break down CPU pipeline execution into 4 bottlenecks:
+frontend bound, backend bound, bad speculation, retiring.
+
+For more details on Topdown see [1][5]
+
+Traditionally this was implemented by events in generic counters
+and specific formulas to compute the bottlenecks.
+
+perf stat --topdown implements this.
+
+% perf stat -a --topdown -I1000
+#           time             counts unit events
+     1.000373951      8,460,978,609      topdown-retiring          #     22.9% retiring
+     1.000373951      3,445,383,303      topdown-bad-spec          #      9.3% bad speculation
+     1.000373951     15,886,483,355      topdown-fe-bound          #     43.0% frontend bound
+     1.000373951      9,163,488,720      topdown-be-bound          #     24.8% backend bound
+     2.000782154      8,477,925,431      topdown-retiring          #     22.9% retiring
+     2.000782154      3,459,151,256      topdown-bad-spec          #      9.3% bad speculation
+     2.000782154     15,947,224,725      topdown-fe-bound          #     43.0% frontend bound
+     2.000782154      9,145,551,695      topdown-be-bound          #     24.7% backend bound
+     3.001155967      8,487,323,125      topdown-retiring          #     22.9% retiring
+     3.001155967      3,451,808,066      topdown-bad-spec          #      9.3% bad speculation
+     3.001155967     15,959,068,902      topdown-fe-bound          #     43.0% frontend bound
+     3.001155967      9,172,479,784      topdown-be-bound          #     24.7% backend bound
+...
+
+Full Top Down includes more levels that can break down the
+bottlenecks further. This is not directly implemented in perf,
+but available in other tools that can run on top of perf,
+such as toplev[2] or vtune[3]
+
+New Topdown features in Icelake
+===============================
+
+With Icelake CPUs the TopDown metrics are directly available as
+fixed counters and do not require generic counters. This allows
+to collect TopDown always in addition to other events.
+
+This also enables measuring TopDown per thread/process instead
+of only per core.
+
+Using TopDown through RDPMC in applications on Icelake
+======================================================
+
+For more fine grained measurements it can be useful to
+access the new  directly from user space. This is more complicated,
+but drastically lowers overhead.
+
+On Icelake, there is a new fixed counter 3: SLOTS, which reports
+"pipeline SLOTS" (cycles multiplied by core issue width) and a
+metric register that reports slots ratios for the different bottleneck
+categories.
+
+The metrics counter is CPU model specific and is not be available
+on older CPUs.
+
+Example code
+============
+
+Library functions to do the functionality described below
+is also available in libjevents [4]
+
+The application opens a perf_event file descriptor
+and sets up fixed counter 3 (SLOTS) to start and
+allow user programs to read the performance counters.
+
+Fixed counter 3 is mapped to a pseudo event event=0x00, umask=04,
+so the perf_event_attr structure should be initialized with
+{ .config = 0x0400, .type = PERF_TYPE_RAW }
+
+#include <linux/perf_event.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+/* Provide own perf_event_open stub because glibc doesn't */
+__attribute__((weak))
+int perf_event_open(struct perf_event_attr *attr, pid_t pid,
+		    int cpu, int group_fd, unsigned long flags)
+{
+	return syscall(__NR_perf_event_open, attr, pid, cpu, group_fd, flags);
+}
+
+/* open slots counter file descriptor for current task */
+struct perf_event_attr slots = {
+	.type = PERF_TYPE_RAW,
+	.size = sizeof(struct perf_event_attr),
+	.config = 0x400,
+	.exclude_kernel = 1,
+};
+
+int fd = perf_event_open(&slots, 0, -1, -1, 0);
+if (fd < 0)
+	... error ...
+
+The RDPMC instruction (or _rdpmc compiler intrinsic) can now be used
+to read slots and the topdown metrics at different points of the program:
+
+#include <stdint.h>
+#include <x86intrin.h>
+
+#define RDPMC_FIXED	(1 << 30)	/* return fixed counters */
+#define RDPMC_METRIC	(1 << 29)	/* return metric counters */
+
+#define FIXED_COUNTER_SLOTS		3
+#define METRIC_COUNTER_TOPDOWN_L1	0
+
+static inline uint64_t read_slots(void)
+{
+	return _rdpmc(RDPMC_FIXED | FIXED_COUNTER_SLOTS);
+}
+
+static inline uint64_t read_metrics(void)
+{
+	return _rdpmc(RDPMC_METRIC | METRIC_COUNTER_TOPDOWN_L1);
+}
+
+Then the program can be instrumented to read these metrics at different
+points.
+
+It's not a good idea to do this with too short code regions,
+as the parallelism and overlap in the CPU program execution will
+cause too much measurement inaccuracy. For example instrumenting
+individual basic blocks is definitely too fine grained.
+
+Decoding metrics values
+=======================
+
+The value reported by read_metrics() contains four 8 bit fields
+that represent a scaled ratio that represent the Level 1 bottleneck.
+All four fields add up to 0xff (= 100%)
+
+The binary ratios in the metric value can be converted to float ratios:
+
+#define GET_METRIC(m, i) (((m) >> (i*8)) & 0xff)
+
+#define TOPDOWN_RETIRING(val)	((float)GET_METRIC(val, 0) / 0xff)
+#define TOPDOWN_BAD_SPEC(val)	((float)GET_METRIC(val, 1) / 0xff)
+#define TOPDOWN_FE_BOUND(val)	((float)GET_METRIC(val, 2) / 0xff)
+#define TOPDOWN_BE_BOUND(val)	((float)GET_METRIC(val, 3) / 0xff)
+
+and then converted to percent for printing.
+
+The ratios in the metric accumulate for the time when the counter
+is enabled. For measuring programs it is often useful to measure
+specific sections. For this it is needed to deltas on metrics.
+
+This can be done by scaling the metrics with the slots counter
+read at the same time.
+
+Then it's possible to take deltas of these slots counts
+measured at different points, and determine the metrics
+for that time period.
+
+	slots_a = read_slots();
+	metric_a = read_metrics();
+
+	... larger code region ...
+
+	slots_b = read_slots()
+	metric_b = read_metrics()
+
+	# compute scaled metrics for measurement a
+	retiring_slots_a = GET_METRIC(metric_a, 0) * slots_a
+	bad_spec_slots_a = GET_METRIC(metric_a, 1) * slots_a
+	fe_bound_slots_a = GET_METRIC(metric_a, 2) * slots_a
+	be_bound_slots_a = GET_METRIC(metric_a, 3) * slots_a
+
+	# compute delta scaled metrics between b and a
+	retiring_slots = GET_METRIC(metric_b, 0) * slots_b - retiring_slots_a
+	bad_spec_slots = GET_METRIC(metric_b, 1) * slots_b - bad_spec_slots_a
+	fe_bound_slots = GET_METRIC(metric_b, 2) * slots_b - fe_bound_slots_a
+	be_bound_slots = GET_METRIC(metric_b, 3) * slots_b - be_bound_slots_a
+
+Later the individual ratios for the measurement period can be recreated
+from these counts.
+
+	slots_delta = slots_b - slots_a
+	retiring_ratio = (float)retiring_slots / slots_delta
+	bad_spec_ratio = (float)bad_spec_slots / slots_delta
+	fe_bound_ratio = (float)fe_bound_slots / slots_delta
+	be_bound_ratio = (float)be_bound_slots / slota_delta
+
+	printf("Retiring %.2f%% Bad Speculation %.2f%% FE Bound %.2f%% BE Bound %.2f%%\n",
+		retiring_ratio * 100.,
+		bad_spec_ratio * 100.,
+		fe_bound_ratio * 100.,
+		be_bound_ratio * 100.);
+
+Resetting metrics counters
+==========================
+
+Since the individual metrics are only 8bit they lose precision for
+short regions over time because the number of cycles covered by each
+fraction bit shrinks. So the counters need to be reset regularly.
+
+When using the kernel perf API the kernel resets on every read.
+So as long as the reading is at reasonable intervals (every few
+seconds) the precision is good.
+
+When using perf stat it is recommended to always use the -I option,
+with no longer interval than a few seconds
+
+	perf stat -I 1000 --topdown ...
+
+For user programs using RDPMC directly the counter can
+be reset explicitly using ioctl:
+
+	ioctl(perf_fd, PERF_EVENT_IOC_RESET, 0);
+
+This "opens" a new measurement period.
+
+A program using RDPMC for TopDown should schedule such a reset
+regularly, as in every few seconds.
+
+[1] https://software.intel.com/en-us/top-down-microarchitecture-analysis-method-win
+[2] https://github.com/andikleen/pmu-tools/wiki/toplev-manual
+[3] https://software.intel.com/en-us/intel-vtune-amplifier-xe
+[4] https://github.com/andikleen/pmu-tools/tree/master/jevents
+[5] https://sites.google.com/site/analysismethods/yasin-pubs
-- 
2.17.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND PATCH V3 1/8] perf/x86/intel: Set correct mask for TOPDOWN.SLOTS
  2019-08-26 14:47 ` [RESEND PATCH V3 1/8] perf/x86/intel: Set correct mask for TOPDOWN.SLOTS kan.liang
@ 2019-08-28  7:48   ` Peter Zijlstra
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2019-08-28  7:48 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, mingo, linux-kernel, tglx, jolsa, eranian, alexander.shishkin, ak

On Mon, Aug 26, 2019 at 07:47:33AM -0700, kan.liang@linux.intel.com wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> TOPDOWN.SLOTS(0x0400) is not a generic event. It is only available on
> fixed counter3.
> 
> Don't extend its mask to generic counters.
> 
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>


> diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
> index 1392d5e6e8d6..457d35a75ad3 100644
> --- a/arch/x86/include/asm/perf_event.h
> +++ b/arch/x86/include/asm/perf_event.h
> @@ -167,6 +167,11 @@ struct x86_pmu_capability {
>  #define INTEL_PMC_IDX_FIXED_REF_CYCLES	(INTEL_PMC_IDX_FIXED + 2)
>  #define INTEL_PMC_MSK_FIXED_REF_CYCLES	(1ULL << INTEL_PMC_IDX_FIXED_REF_CYCLES)
>  
> +/* TOPDOWN.SLOTS: */
> +#define MSR_ARCH_PERFMON_FIXED_CTR3	0x30c
> +#define INTEL_PMC_IDX_FIXED_SLOTS	(INTEL_PMC_IDX_FIXED + 3)
> +#define INTEL_PMC_MSK_FIXED_SLOTS	(1ULL << INTEL_PMC_IDX_FIXED_SLOTS)
> +
>  /*
>   * We model BTS tracing as another fixed-mode PMC.
>   *


This whole second hunk does not belong in this patch, probably the next
one.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND PATCH V3 2/8] perf/x86/intel: Basic support for metrics counters
  2019-08-26 14:47 ` [RESEND PATCH V3 2/8] perf/x86/intel: Basic support for metrics counters kan.liang
@ 2019-08-28  7:48   ` Peter Zijlstra
  2019-08-28  7:52   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2019-08-28  7:48 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, mingo, linux-kernel, tglx, jolsa, eranian, alexander.shishkin, ak

On Mon, Aug 26, 2019 at 07:47:34AM -0700, kan.liang@linux.intel.com wrote:
> Move BTS index to 47. Because bit 48 in the PERF_GLOBAL_STATUS is used
> to indicate the overflow status of PERF_METRICS counters now.

> diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
> index 457d35a75ad3..88a506312a10 100644
> --- a/arch/x86/include/asm/perf_event.h
> +++ b/arch/x86/include/asm/perf_event.h
> @@ -175,11 +175,56 @@ struct x86_pmu_capability {
>  /*
>   * We model BTS tracing as another fixed-mode PMC.
>   *
> - * We choose a value in the middle of the fixed event range, since lower
> + * We choose value 47 for the fixed index of BTS, since lower
>   * values are used by actual fixed events and higher values are used
>   * to indicate other overflow conditions in the PERF_GLOBAL_STATUS msr.
>   */
> -#define INTEL_PMC_IDX_FIXED_BTS				(INTEL_PMC_IDX_FIXED + 16)
> +#define INTEL_PMC_IDX_FIXED_BTS				(INTEL_PMC_IDX_FIXED + 15)

Can be its own patch.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND PATCH V3 2/8] perf/x86/intel: Basic support for metrics counters
  2019-08-26 14:47 ` [RESEND PATCH V3 2/8] perf/x86/intel: Basic support for metrics counters kan.liang
  2019-08-28  7:48   ` Peter Zijlstra
@ 2019-08-28  7:52   ` Peter Zijlstra
  2019-08-28 13:59     ` Liang, Kan
  2019-08-28  8:44   ` Peter Zijlstra
  2019-08-28  8:52   ` Peter Zijlstra
  3 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2019-08-28  7:52 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, mingo, linux-kernel, tglx, jolsa, eranian, alexander.shishkin, ak

On Mon, Aug 26, 2019 at 07:47:34AM -0700, kan.liang@linux.intel.com wrote:

> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 81b005e4c7d9..54534ff00940 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -1033,18 +1033,30 @@ static inline void x86_assign_hw_event(struct perf_event *event,
>  				struct cpu_hw_events *cpuc, int i)
>  {
>  	struct hw_perf_event *hwc = &event->hw;
> +	int reg_idx;
>  
>  	hwc->idx = cpuc->assign[i];
>  	hwc->last_cpu = smp_processor_id();
>  	hwc->last_tag = ++cpuc->tags[i];
>  
> +	/*
> +	 * Metrics counters use different indexes in the scheduler
> +	 * versus the hardware.
> +	 *
> +	 * Map metrics to fixed counter 3 (which is the base count),
> +	 * but the update event callback reads the extra metric register
> +	 * and converts to the right metric.
> +	 */
> +	reg_idx = get_reg_idx(hwc->idx);
> +
>  	if (hwc->idx == INTEL_PMC_IDX_FIXED_BTS) {
>  		hwc->config_base = 0;
>  		hwc->event_base	= 0;
>  	} else if (hwc->idx >= INTEL_PMC_IDX_FIXED) {
>  		hwc->config_base = MSR_ARCH_PERFMON_FIXED_CTR_CTRL;
> -		hwc->event_base = MSR_ARCH_PERFMON_FIXED_CTR0 + (hwc->idx - INTEL_PMC_IDX_FIXED);
> -		hwc->event_base_rdpmc = (hwc->idx - INTEL_PMC_IDX_FIXED) | 1<<30;
> +		hwc->event_base = MSR_ARCH_PERFMON_FIXED_CTR0 +
> +				  (reg_idx - INTEL_PMC_IDX_FIXED);
> +		hwc->event_base_rdpmc = (reg_idx - INTEL_PMC_IDX_FIXED) | 1<<30;
>  	} else {
>  		hwc->config_base = x86_pmu_config_addr(hwc->idx);
>  		hwc->event_base  = x86_pmu_event_addr(hwc->idx);

That reg_idx is a pointless unconditional branch; better to write it
like:

static inline void x86_assign_hw_event(struct perf_event *event,
				struct cpu_hw_events *cpuc, int i)
{
	struct hw_perf_event *hwc = &event->hw;
	int idx;

	idx = hwc->idx = cpuc->assign[i];
	hwc->last_cpu = smp_processor_id();
	hwc->last_tag = ++cpuc->tags[i];

	switch (hwc->idx) {
	case INTEL_PMC_IDX_FIXED_BTS:
		hwc->config_base = 0;
		hwc->event_base	= 0;
		break;

	case INTEL_PMC_IDX_FIXED_METRIC_BASE ... INTEL_PMC_IDX_FIXED_METRIC_BASE+3:
		/* All METRIC events are mapped onto the fixed SLOTS counter */
		idx = INTEL_PMC_IDX_FIXED_SLOTS;

	case INTEL_PMC_IDX_FIXED ... INTEL_PMC_IDX_FIXED_METRIC_BASE-1:
		hwc->config_base = MSR_ARCH_PERFMON_FIXED_CTR_CTRL;
		hwc->event_base = MSR_ARCH_PERFMON_FIXED_CTR0 +
				  (idx - INTEL_PMC_IDX_FIXED);
		hwc->event_base_rdpmc = (idx - INTEL_PMC_IDX_FIXED) | 1<<30;
		break;

	default:
		hwc->config_base = x86_pmu_config_addr(hwc->idx);
		hwc->event_base = x86_pmu_event_addr(hwc->idx);
		hwc->event_base_rdpmc = x86_pmu_rdpmc_index(hwc->idx);
		break;
	}
}

On that; wth does this to the RDPMC userspace support!? Does that even
work with these counters?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND PATCH V3 2/8] perf/x86/intel: Basic support for metrics counters
  2019-08-26 14:47 ` [RESEND PATCH V3 2/8] perf/x86/intel: Basic support for metrics counters kan.liang
  2019-08-28  7:48   ` Peter Zijlstra
  2019-08-28  7:52   ` Peter Zijlstra
@ 2019-08-28  8:44   ` Peter Zijlstra
  2019-08-28  9:02     ` Peter Zijlstra
  2019-08-28  8:52   ` Peter Zijlstra
  3 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2019-08-28  8:44 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, mingo, linux-kernel, tglx, jolsa, eranian, alexander.shishkin, ak

On Mon, Aug 26, 2019 at 07:47:34AM -0700, kan.liang@linux.intel.com wrote:

> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index 1b2c37ed49db..f4d6335a18e2 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -2131,7 +2131,7 @@ static inline void intel_pmu_ack_status(u64 ack)
>  
>  static void intel_pmu_disable_fixed(struct hw_perf_event *hwc)
>  {
> -	int idx = hwc->idx - INTEL_PMC_IDX_FIXED;
> +	int idx = get_reg_idx(hwc->idx) - INTEL_PMC_IDX_FIXED;
>  	u64 ctrl_val, mask;
>  
>  	mask = 0xfULL << (idx * 4);
> @@ -2150,6 +2150,7 @@ static void intel_pmu_disable_event(struct perf_event *event)
>  {
>  	struct hw_perf_event *hwc = &event->hw;
>  	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> +	int reg_idx = get_reg_idx(hwc->idx);
>  
>  	if (unlikely(hwc->idx == INTEL_PMC_IDX_FIXED_BTS)) {
>  		intel_pmu_disable_bts();

It is unfortunate we need that in both cases; and note how the
inconsitent naming.

> @@ -2157,9 +2158,16 @@ static void intel_pmu_disable_event(struct perf_event *event)
>  		return;
>  	}
>  
> -	cpuc->intel_ctrl_guest_mask &= ~(1ull << hwc->idx);
> -	cpuc->intel_ctrl_host_mask &= ~(1ull << hwc->idx);
> -	cpuc->intel_cp_status &= ~(1ull << hwc->idx);
> +	/*
> +	 * When any other topdown events are still enabled,
> +	 * cancel the disabling.
> +	 */
> +	if (has_other_topdown_event(cpuc->active_mask, hwc->idx))
> +		return;

And this includes a 3rd instance of that check :/ Also, this really
wants to be in disable_fixed.

> +
> +	cpuc->intel_ctrl_guest_mask &= ~(1ull << reg_idx);
> +	cpuc->intel_ctrl_host_mask &= ~(1ull << reg_idx);
> +	cpuc->intel_cp_status &= ~(1ull << reg_idx);
>  
>  	if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL))
>  		intel_pmu_disable_fixed(hwc);

Same for the enable thing.

Let me clean up this mess for you.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND PATCH V3 2/8] perf/x86/intel: Basic support for metrics counters
  2019-08-26 14:47 ` [RESEND PATCH V3 2/8] perf/x86/intel: Basic support for metrics counters kan.liang
                     ` (2 preceding siblings ...)
  2019-08-28  8:44   ` Peter Zijlstra
@ 2019-08-28  8:52   ` Peter Zijlstra
  3 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2019-08-28  8:52 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, mingo, linux-kernel, tglx, jolsa, eranian, alexander.shishkin, ak

On Mon, Aug 26, 2019 at 07:47:34AM -0700, kan.liang@linux.intel.com wrote:
> +#define INTEL_PMC_IDX_FIXED_METRIC_BASE		(INTEL_PMC_IDX_FIXED + 16)
> +#define INTEL_PMC_IDX_TD_RETIRING		(INTEL_PMC_IDX_FIXED_METRIC_BASE + 0)
> +#define INTEL_PMC_IDX_TD_BAD_SPEC		(INTEL_PMC_IDX_FIXED_METRIC_BASE + 1)
> +#define INTEL_PMC_IDX_TD_FE_BOUND		(INTEL_PMC_IDX_FIXED_METRIC_BASE + 2)
> +#define INTEL_PMC_IDX_TD_BE_BOUND		(INTEL_PMC_IDX_FIXED_METRIC_BASE + 3)
> +#define INTEL_PMC_MSK_TOPDOWN			((0xfull << INTEL_PMC_IDX_FIXED_METRIC_BASE) | \
> +						 INTEL_PMC_MSK_FIXED_SLOTS)

> +
> +#define INTEL_PMC_CLEAR_TOPDOWN_BIT(bit)	(~(0x1ull << bit) & INTEL_PMC_MSK_TOPDOWN)

That thing is a clear misnomer; it is OTHER_TOPDOWN_BITS(). Fixed that
too.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND PATCH V3 2/8] perf/x86/intel: Basic support for metrics counters
  2019-08-28  8:44   ` Peter Zijlstra
@ 2019-08-28  9:02     ` Peter Zijlstra
  2019-08-28  9:37       ` Peter Zijlstra
  2019-08-28 13:51       ` Liang, Kan
  0 siblings, 2 replies; 39+ messages in thread
From: Peter Zijlstra @ 2019-08-28  9:02 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, mingo, linux-kernel, tglx, jolsa, eranian, alexander.shishkin, ak

On Wed, Aug 28, 2019 at 10:44:16AM +0200, Peter Zijlstra wrote:

> Let me clean up this mess for you.

Here, how's that. Now we don't check is_metric_idx() _3_ times on the
enable/disable path and all the topdown crud is properly placed in the
fixed counter functions.

Please think; don't tinker.

---

--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1033,22 +1033,34 @@ static inline void x86_assign_hw_event(s
 				struct cpu_hw_events *cpuc, int i)
 {
 	struct hw_perf_event *hwc = &event->hw;
+	int idx;
 
-	hwc->idx = cpuc->assign[i];
+	idx = hwc->idx = cpuc->assign[i];
 	hwc->last_cpu = smp_processor_id();
 	hwc->last_tag = ++cpuc->tags[i];
 
-	if (hwc->idx == INTEL_PMC_IDX_FIXED_BTS) {
+	switch (hwc->idx) {
+	case INTEL_PMC_IDX_FIXED_BTS:
 		hwc->config_base = 0;
 		hwc->event_base	= 0;
-	} else if (hwc->idx >= INTEL_PMC_IDX_FIXED) {
+		break;
+
+	case INTEL_PMC_IDX_FIXED_METRIC_BASE ... INTEL_PMC_IDX_FIXED_METRIC_BASE+3:
+		/* All METRIC events are mapped onto the fixed SLOTS event */
+		idx = INTEL_PMC_IDX_FIXED_SLOTS;
+
+	case INTEL_PMC_IDX_FIXED ... INTEL_PMC_IDX_FIXED_BTS-1:
 		hwc->config_base = MSR_ARCH_PERFMON_FIXED_CTR_CTRL;
-		hwc->event_base = MSR_ARCH_PERFMON_FIXED_CTR0 + (hwc->idx - INTEL_PMC_IDX_FIXED);
-		hwc->event_base_rdpmc = (hwc->idx - INTEL_PMC_IDX_FIXED) | 1<<30;
-	} else {
+		hwc->event_base = MSR_ARCH_PERFMON_FIXED_CTR0 +
+				  (idx - INTEL_PMC_IDX_FIXED);
+		hwc->event_base_rdpmc = (idx - INTEL_PMC_IDX_FIXED) | 1<<30;
+		break;
+
+	default:
 		hwc->config_base = x86_pmu_config_addr(hwc->idx);
 		hwc->event_base  = x86_pmu_event_addr(hwc->idx);
 		hwc->event_base_rdpmc = x86_pmu_rdpmc_index(hwc->idx);
+		break;
 	}
 }
 
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2128,27 +2128,61 @@ static inline void intel_pmu_ack_status(
 	wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, ack);
 }
 
-static void intel_pmu_disable_fixed(struct hw_perf_event *hwc)
+static inline bool event_is_checkpointed(struct perf_event *event)
+{
+	return unlikely(event->hw.config & HSW_IN_TX_CHECKPOINTED) != 0;
+}
+
+static inline void intel_set_masks(struct perf_event *event, int idx)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+	if (event->attr.exclude_host)
+		__set_bit(idx, (unsigned long *)&cpuc->intel_ctrl_guest_mask);
+	if (event->attr.exclude_guest)
+		__set_bit(idx, (unsigned long *)&cpuc->intel_ctrl_host_mask);
+	if (event_is_checkpointed(event))
+		__set_bit(idx, (unsigned long *)&cpuc->intel_cp_status);
+}
+
+static inline void intel_clear_masks(struct perf_event *event, int idx)
 {
-	int idx = hwc->idx - INTEL_PMC_IDX_FIXED;
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+	__clear_bit(idx, (unsigned long *)&cpuc->intel_ctrl_guest_mask);
+	__clear_bit(idx, (unsigned long *)&cpuc->intel_ctrl_host_mask);
+	__clear_bit(idx, (unsigned long *)&cpuc->intel_cp_status);
+}
+
+static void intel_pmu_disable_fixed(struct perf_event *event)
+{
+	struct hw_perf_event *hwc = &event->hw;
 	u64 ctrl_val, mask;
+	int idx = hwc->idx;
 
-	mask = 0xfULL << (idx * 4);
+	if (is_topdown_idx(idx)) {
+		struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+		/*
+		 * When there are other Top-Down events still active; don't
+		 * disable the SLOTS counter.
+		 */
+		if (*(u64 *)cpuc->active_mask & INTEL_PMC_OTHER_TOPDOWN_BITS(idx))
+			return;
+
+		idx = INTEL_PMC_IDX_FIXED_SLOTS;
+	}
 
+	intel_clear_masks(event, idx);
+
+	mask = 0xfULL << ((idx - INTEL_PMC_IDX_FIXED) * 4);
 	rdmsrl(hwc->config_base, ctrl_val);
 	ctrl_val &= ~mask;
 	wrmsrl(hwc->config_base, ctrl_val);
 }
 
-static inline bool event_is_checkpointed(struct perf_event *event)
-{
-	return (event->hw.config & HSW_IN_TX_CHECKPOINTED) != 0;
-}
-
 static void intel_pmu_disable_event(struct perf_event *event)
 {
 	struct hw_perf_event *hwc = &event->hw;
-	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 
 	if (unlikely(hwc->idx == INTEL_PMC_IDX_FIXED_BTS)) {
 		intel_pmu_disable_bts();
@@ -2156,18 +2190,19 @@ static void intel_pmu_disable_event(stru
 		return;
 	}
 
-	cpuc->intel_ctrl_guest_mask &= ~(1ull << hwc->idx);
-	cpuc->intel_ctrl_host_mask &= ~(1ull << hwc->idx);
-	cpuc->intel_cp_status &= ~(1ull << hwc->idx);
-
-	if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL))
-		intel_pmu_disable_fixed(hwc);
-	else
+	if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL)) {
+		intel_pmu_disable_fixed(event);
+	} else {
+		intel_clear_masks(event, hwc->idx);
 		x86_pmu_disable_event(event);
+	}
 
 	/*
 	 * Needs to be called after x86_pmu_disable_event,
 	 * so we don't trigger the event without PEBS bit set.
+	 *
+	 * Metric stuff doesn't do PEBS (I hope?); and so the early exit from
+	 * intel_pmu_disable_fixed() is OK.
 	 */
 	if (unlikely(event->attr.precise_ip))
 		intel_pmu_pebs_disable(event);
@@ -2192,8 +2227,22 @@ static void intel_pmu_read_event(struct
 static void intel_pmu_enable_fixed(struct perf_event *event)
 {
 	struct hw_perf_event *hwc = &event->hw;
-	int idx = hwc->idx - INTEL_PMC_IDX_FIXED;
 	u64 ctrl_val, mask, bits = 0;
+	int idx = hwc->idx;
+
+	if (is_topdown_idx(idx)) {
+		struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+		/*
+		 * When there are other Top-Down events already active; don't
+		 * enable the SLOTS counter.
+		 */
+		if (*(u64 *)cpuc->active_mask & INTEL_PMC_OTHER_TOPDOWN_BITS(idx))
+			return;
+
+		idx = INTEL_PMC_IDX_FIXED_SLOTS;
+	}
+
+	intel_set_masks(event, hwc->idx);
 
 	/*
 	 * Enable IRQ generation (0x8), if not PEBS,
@@ -2213,6 +2262,7 @@ static void intel_pmu_enable_fixed(struc
 	if (x86_pmu.version > 2 && hwc->config & ARCH_PERFMON_EVENTSEL_ANY)
 		bits |= 0x4;
 
+	idx -= INTEL_PMC_IDX_FIXED;
 	bits <<= (idx * 4);
 	mask = 0xfULL << (idx * 4);
 
@@ -2230,7 +2280,6 @@ static void intel_pmu_enable_fixed(struc
 static void intel_pmu_enable_event(struct perf_event *event)
 {
 	struct hw_perf_event *hwc = &event->hw;
-	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 
 	if (unlikely(hwc->idx == INTEL_PMC_IDX_FIXED_BTS)) {
 		if (!__this_cpu_read(cpu_hw_events.enabled))
@@ -2240,23 +2289,16 @@ static void intel_pmu_enable_event(struc
 		return;
 	}
 
-	if (event->attr.exclude_host)
-		cpuc->intel_ctrl_guest_mask |= (1ull << hwc->idx);
-	if (event->attr.exclude_guest)
-		cpuc->intel_ctrl_host_mask |= (1ull << hwc->idx);
-
-	if (unlikely(event_is_checkpointed(event)))
-		cpuc->intel_cp_status |= (1ull << hwc->idx);
 
 	if (unlikely(event->attr.precise_ip))
 		intel_pmu_pebs_enable(event);
 
 	if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL)) {
 		intel_pmu_enable_fixed(event);
-		return;
+	} else {
+		intel_set_masks(event, hwc->idx);
+		__x86_pmu_enable_event(hwc, ARCH_PERFMON_EVENTSEL_ENABLE);
 	}
-
-	__x86_pmu_enable_event(hwc, ARCH_PERFMON_EVENTSEL_ENABLE);
 }
 
 static void intel_pmu_add_event(struct perf_event *event)
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -349,6 +349,20 @@ struct cpu_hw_events {
 	EVENT_CONSTRAINT(c, (1ULL << (32+n)), FIXED_EVENT_FLAGS)
 
 /*
+ * Special metric counters do not actually exist, but get remapped
+ * to a combination of FxCtr3 + MSR_PERF_METRICS
+ *
+ * This allocates them to a dummy offset for the scheduler.
+ * This does not allow sharing of multiple users of the same
+ * metric without multiplexing, even though the hardware supports that
+ * in principle.
+ */
+
+#define METRIC_EVENT_CONSTRAINT(c, n)					\
+	EVENT_CONSTRAINT(c, (1ULL << (INTEL_PMC_IDX_FIXED_METRIC_BASE+n)), \
+			 FIXED_EVENT_FLAGS)
+
+/*
  * Constraint on the Event code + UMask
  */
 #define INTEL_UEVENT_CONSTRAINT(c, n)	\
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -795,6 +795,7 @@
 #define MSR_CORE_PERF_FIXED_CTR0	0x00000309
 #define MSR_CORE_PERF_FIXED_CTR1	0x0000030a
 #define MSR_CORE_PERF_FIXED_CTR2	0x0000030b
+#define MSR_CORE_PERF_FIXED_CTR3	0x0000030c
 #define MSR_CORE_PERF_FIXED_CTR_CTRL	0x0000038d
 #define MSR_CORE_PERF_GLOBAL_STATUS	0x0000038e
 #define MSR_CORE_PERF_GLOBAL_CTRL	0x0000038f
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -175,11 +175,39 @@ struct x86_pmu_capability {
 /*
  * We model BTS tracing as another fixed-mode PMC.
  *
- * We choose a value in the middle of the fixed event range, since lower
+ * We choose value 47 for the fixed index of BTS, since lower
  * values are used by actual fixed events and higher values are used
  * to indicate other overflow conditions in the PERF_GLOBAL_STATUS msr.
  */
-#define INTEL_PMC_IDX_FIXED_BTS				(INTEL_PMC_IDX_FIXED + 16)
+#define INTEL_PMC_IDX_FIXED_BTS				(INTEL_PMC_IDX_FIXED + 15)
+
+/*
+ * We model PERF_METRICS as more magic fixed-mode PMCs, one for each metric
+ *
+ * Internally they all map to Fixed Ctr 3 (SLOTS), and allocate PERF_METRICS
+ * as an extra_reg. PERF_METRICS has no own configuration, but we fill in
+ * the configuration of FxCtr3 to enforce that all the shared users of SLOTS
+ * have the same configuration.
+ */
+#define INTEL_PMC_IDX_FIXED_METRIC_BASE		(INTEL_PMC_IDX_FIXED + 16)
+#define INTEL_PMC_IDX_TD_RETIRING		(INTEL_PMC_IDX_FIXED_METRIC_BASE + 0)
+#define INTEL_PMC_IDX_TD_BAD_SPEC		(INTEL_PMC_IDX_FIXED_METRIC_BASE + 1)
+#define INTEL_PMC_IDX_TD_FE_BOUND		(INTEL_PMC_IDX_FIXED_METRIC_BASE + 2)
+#define INTEL_PMC_IDX_TD_BE_BOUND		(INTEL_PMC_IDX_FIXED_METRIC_BASE + 3)
+#define INTEL_PMC_MSK_TOPDOWN			((0xfull << INTEL_PMC_IDX_FIXED_METRIC_BASE) | \
+						 INTEL_PMC_MSK_FIXED_SLOTS)
+
+static inline bool is_metric_idx(int idx)
+{
+	return (unsigned)(idx - INTEL_PMC_IDX_FIXED_METRIC_BASE) < 4;
+}
+
+static inline bool is_topdown_idx(int idx)
+{
+	return is_metric_idx(idx) || idx == INTEL_PMC_IDX_FIXED_SLOTS;
+}
+
+#define INTEL_PMC_OTHER_TOPDOWN_BITS(bit)	(~(0x1ull << bit) & INTEL_PMC_MSK_TOPDOWN)
 
 #define GLOBAL_STATUS_COND_CHG				BIT_ULL(63)
 #define GLOBAL_STATUS_BUFFER_OVF			BIT_ULL(62)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND PATCH V3 2/8] perf/x86/intel: Basic support for metrics counters
  2019-08-28  9:02     ` Peter Zijlstra
@ 2019-08-28  9:37       ` Peter Zijlstra
  2019-08-28 13:51       ` Liang, Kan
  1 sibling, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2019-08-28  9:37 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, mingo, linux-kernel, tglx, jolsa, eranian, alexander.shishkin, ak

On Wed, Aug 28, 2019 at 11:02:17AM +0200, Peter Zijlstra wrote:
> @@ -2192,8 +2227,22 @@ static void intel_pmu_read_event(struct
>  static void intel_pmu_enable_fixed(struct perf_event *event)
>  {
>  	struct hw_perf_event *hwc = &event->hw;
> -	int idx = hwc->idx - INTEL_PMC_IDX_FIXED;
>  	u64 ctrl_val, mask, bits = 0;
> +	int idx = hwc->idx;
> +
> +	if (is_topdown_idx(idx)) {
> +		struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> +		/*
> +		 * When there are other Top-Down events already active; don't
> +		 * enable the SLOTS counter.
> +		 */
> +		if (*(u64 *)cpuc->active_mask & INTEL_PMC_OTHER_TOPDOWN_BITS(idx))
> +			return;
> +
> +		idx = INTEL_PMC_IDX_FIXED_SLOTS;
> +	}
> +
> +	intel_set_masks(event, hwc->idx);

That wants to be idx, not hwc->idx.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND PATCH V3 2/8] perf/x86/intel: Basic support for metrics counters
  2019-08-28  9:02     ` Peter Zijlstra
  2019-08-28  9:37       ` Peter Zijlstra
@ 2019-08-28 13:51       ` Liang, Kan
  1 sibling, 0 replies; 39+ messages in thread
From: Liang, Kan @ 2019-08-28 13:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: acme, mingo, linux-kernel, tglx, jolsa, eranian, alexander.shishkin, ak



On 8/28/2019 5:02 AM, Peter Zijlstra wrote:
> On Wed, Aug 28, 2019 at 10:44:16AM +0200, Peter Zijlstra wrote:
> 
>> Let me clean up this mess for you.
> 
> Here, how's that. Now we don't check is_metric_idx() _3_ times on the
> enable/disable path and all the topdown crud is properly placed in the
> fixed counter functions.

Thank you very much for the cleanup. The new code is more efficient.
I will prepare V4 with this code and to address other comments.

Thanks,
Kan
> 
> Please think; don't tinker.
> 
> ---
> 
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -1033,22 +1033,34 @@ static inline void x86_assign_hw_event(s
>   				struct cpu_hw_events *cpuc, int i)
>   {
>   	struct hw_perf_event *hwc = &event->hw;
> +	int idx;
>   
> -	hwc->idx = cpuc->assign[i];
> +	idx = hwc->idx = cpuc->assign[i];
>   	hwc->last_cpu = smp_processor_id();
>   	hwc->last_tag = ++cpuc->tags[i];
>   
> -	if (hwc->idx == INTEL_PMC_IDX_FIXED_BTS) {
> +	switch (hwc->idx) {
> +	case INTEL_PMC_IDX_FIXED_BTS:
>   		hwc->config_base = 0;
>   		hwc->event_base	= 0;
> -	} else if (hwc->idx >= INTEL_PMC_IDX_FIXED) {
> +		break;
> +
> +	case INTEL_PMC_IDX_FIXED_METRIC_BASE ... INTEL_PMC_IDX_FIXED_METRIC_BASE+3:
> +		/* All METRIC events are mapped onto the fixed SLOTS event */
> +		idx = INTEL_PMC_IDX_FIXED_SLOTS;
> +
> +	case INTEL_PMC_IDX_FIXED ... INTEL_PMC_IDX_FIXED_BTS-1:
>   		hwc->config_base = MSR_ARCH_PERFMON_FIXED_CTR_CTRL;
> -		hwc->event_base = MSR_ARCH_PERFMON_FIXED_CTR0 + (hwc->idx - INTEL_PMC_IDX_FIXED);
> -		hwc->event_base_rdpmc = (hwc->idx - INTEL_PMC_IDX_FIXED) | 1<<30;
> -	} else {
> +		hwc->event_base = MSR_ARCH_PERFMON_FIXED_CTR0 +
> +				  (idx - INTEL_PMC_IDX_FIXED);
> +		hwc->event_base_rdpmc = (idx - INTEL_PMC_IDX_FIXED) | 1<<30;
> +		break;
> +
> +	default:
>   		hwc->config_base = x86_pmu_config_addr(hwc->idx);
>   		hwc->event_base  = x86_pmu_event_addr(hwc->idx);
>   		hwc->event_base_rdpmc = x86_pmu_rdpmc_index(hwc->idx);
> +		break;
>   	}
>   }
>   
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -2128,27 +2128,61 @@ static inline void intel_pmu_ack_status(
>   	wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, ack);
>   }
>   
> -static void intel_pmu_disable_fixed(struct hw_perf_event *hwc)
> +static inline bool event_is_checkpointed(struct perf_event *event)
> +{
> +	return unlikely(event->hw.config & HSW_IN_TX_CHECKPOINTED) != 0;
> +}
> +
> +static inline void intel_set_masks(struct perf_event *event, int idx)
> +{
> +	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> +
> +	if (event->attr.exclude_host)
> +		__set_bit(idx, (unsigned long *)&cpuc->intel_ctrl_guest_mask);
> +	if (event->attr.exclude_guest)
> +		__set_bit(idx, (unsigned long *)&cpuc->intel_ctrl_host_mask);
> +	if (event_is_checkpointed(event))
> +		__set_bit(idx, (unsigned long *)&cpuc->intel_cp_status);
> +}
> +
> +static inline void intel_clear_masks(struct perf_event *event, int idx)
>   {
> -	int idx = hwc->idx - INTEL_PMC_IDX_FIXED;
> +	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> +
> +	__clear_bit(idx, (unsigned long *)&cpuc->intel_ctrl_guest_mask);
> +	__clear_bit(idx, (unsigned long *)&cpuc->intel_ctrl_host_mask);
> +	__clear_bit(idx, (unsigned long *)&cpuc->intel_cp_status);
> +}
> +
> +static void intel_pmu_disable_fixed(struct perf_event *event)
> +{
> +	struct hw_perf_event *hwc = &event->hw;
>   	u64 ctrl_val, mask;
> +	int idx = hwc->idx;
>   
> -	mask = 0xfULL << (idx * 4);
> +	if (is_topdown_idx(idx)) {
> +		struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> +		/*
> +		 * When there are other Top-Down events still active; don't
> +		 * disable the SLOTS counter.
> +		 */
> +		if (*(u64 *)cpuc->active_mask & INTEL_PMC_OTHER_TOPDOWN_BITS(idx))
> +			return;
> +
> +		idx = INTEL_PMC_IDX_FIXED_SLOTS;
> +	}
>   
> +	intel_clear_masks(event, idx);
> +
> +	mask = 0xfULL << ((idx - INTEL_PMC_IDX_FIXED) * 4);
>   	rdmsrl(hwc->config_base, ctrl_val);
>   	ctrl_val &= ~mask;
>   	wrmsrl(hwc->config_base, ctrl_val);
>   }
>   
> -static inline bool event_is_checkpointed(struct perf_event *event)
> -{
> -	return (event->hw.config & HSW_IN_TX_CHECKPOINTED) != 0;
> -}
> -
>   static void intel_pmu_disable_event(struct perf_event *event)
>   {
>   	struct hw_perf_event *hwc = &event->hw;
> -	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
>   
>   	if (unlikely(hwc->idx == INTEL_PMC_IDX_FIXED_BTS)) {
>   		intel_pmu_disable_bts();
> @@ -2156,18 +2190,19 @@ static void intel_pmu_disable_event(stru
>   		return;
>   	}
>   
> -	cpuc->intel_ctrl_guest_mask &= ~(1ull << hwc->idx);
> -	cpuc->intel_ctrl_host_mask &= ~(1ull << hwc->idx);
> -	cpuc->intel_cp_status &= ~(1ull << hwc->idx);
> -
> -	if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL))
> -		intel_pmu_disable_fixed(hwc);
> -	else
> +	if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL)) {
> +		intel_pmu_disable_fixed(event);
> +	} else {
> +		intel_clear_masks(event, hwc->idx);
>   		x86_pmu_disable_event(event);
> +	}
>   
>   	/*
>   	 * Needs to be called after x86_pmu_disable_event,
>   	 * so we don't trigger the event without PEBS bit set.
> +	 *
> +	 * Metric stuff doesn't do PEBS (I hope?); and so the early exit from
> +	 * intel_pmu_disable_fixed() is OK.
>   	 */
>   	if (unlikely(event->attr.precise_ip))
>   		intel_pmu_pebs_disable(event);
> @@ -2192,8 +2227,22 @@ static void intel_pmu_read_event(struct
>   static void intel_pmu_enable_fixed(struct perf_event *event)
>   {
>   	struct hw_perf_event *hwc = &event->hw;
> -	int idx = hwc->idx - INTEL_PMC_IDX_FIXED;
>   	u64 ctrl_val, mask, bits = 0;
> +	int idx = hwc->idx;
> +
> +	if (is_topdown_idx(idx)) {
> +		struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> +		/*
> +		 * When there are other Top-Down events already active; don't
> +		 * enable the SLOTS counter.
> +		 */
> +		if (*(u64 *)cpuc->active_mask & INTEL_PMC_OTHER_TOPDOWN_BITS(idx))
> +			return;
> +
> +		idx = INTEL_PMC_IDX_FIXED_SLOTS;
> +	}
> +
> +	intel_set_masks(event, hwc->idx);
>   
>   	/*
>   	 * Enable IRQ generation (0x8), if not PEBS,
> @@ -2213,6 +2262,7 @@ static void intel_pmu_enable_fixed(struc
>   	if (x86_pmu.version > 2 && hwc->config & ARCH_PERFMON_EVENTSEL_ANY)
>   		bits |= 0x4;
>   
> +	idx -= INTEL_PMC_IDX_FIXED;
>   	bits <<= (idx * 4);
>   	mask = 0xfULL << (idx * 4);
>   
> @@ -2230,7 +2280,6 @@ static void intel_pmu_enable_fixed(struc
>   static void intel_pmu_enable_event(struct perf_event *event)
>   {
>   	struct hw_perf_event *hwc = &event->hw;
> -	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
>   
>   	if (unlikely(hwc->idx == INTEL_PMC_IDX_FIXED_BTS)) {
>   		if (!__this_cpu_read(cpu_hw_events.enabled))
> @@ -2240,23 +2289,16 @@ static void intel_pmu_enable_event(struc
>   		return;
>   	}
>   
> -	if (event->attr.exclude_host)
> -		cpuc->intel_ctrl_guest_mask |= (1ull << hwc->idx);
> -	if (event->attr.exclude_guest)
> -		cpuc->intel_ctrl_host_mask |= (1ull << hwc->idx);
> -
> -	if (unlikely(event_is_checkpointed(event)))
> -		cpuc->intel_cp_status |= (1ull << hwc->idx);
>   
>   	if (unlikely(event->attr.precise_ip))
>   		intel_pmu_pebs_enable(event);
>   
>   	if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL)) {
>   		intel_pmu_enable_fixed(event);
> -		return;
> +	} else {
> +		intel_set_masks(event, hwc->idx);
> +		__x86_pmu_enable_event(hwc, ARCH_PERFMON_EVENTSEL_ENABLE);
>   	}
> -
> -	__x86_pmu_enable_event(hwc, ARCH_PERFMON_EVENTSEL_ENABLE);
>   }
>   
>   static void intel_pmu_add_event(struct perf_event *event)
> --- a/arch/x86/events/perf_event.h
> +++ b/arch/x86/events/perf_event.h
> @@ -349,6 +349,20 @@ struct cpu_hw_events {
>   	EVENT_CONSTRAINT(c, (1ULL << (32+n)), FIXED_EVENT_FLAGS)
>   
>   /*
> + * Special metric counters do not actually exist, but get remapped
> + * to a combination of FxCtr3 + MSR_PERF_METRICS
> + *
> + * This allocates them to a dummy offset for the scheduler.
> + * This does not allow sharing of multiple users of the same
> + * metric without multiplexing, even though the hardware supports that
> + * in principle.
> + */
> +
> +#define METRIC_EVENT_CONSTRAINT(c, n)					\
> +	EVENT_CONSTRAINT(c, (1ULL << (INTEL_PMC_IDX_FIXED_METRIC_BASE+n)), \
> +			 FIXED_EVENT_FLAGS)
> +
> +/*
>    * Constraint on the Event code + UMask
>    */
>   #define INTEL_UEVENT_CONSTRAINT(c, n)	\
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -795,6 +795,7 @@
>   #define MSR_CORE_PERF_FIXED_CTR0	0x00000309
>   #define MSR_CORE_PERF_FIXED_CTR1	0x0000030a
>   #define MSR_CORE_PERF_FIXED_CTR2	0x0000030b
> +#define MSR_CORE_PERF_FIXED_CTR3	0x0000030c
>   #define MSR_CORE_PERF_FIXED_CTR_CTRL	0x0000038d
>   #define MSR_CORE_PERF_GLOBAL_STATUS	0x0000038e
>   #define MSR_CORE_PERF_GLOBAL_CTRL	0x0000038f
> --- a/arch/x86/include/asm/perf_event.h
> +++ b/arch/x86/include/asm/perf_event.h
> @@ -175,11 +175,39 @@ struct x86_pmu_capability {
>   /*
>    * We model BTS tracing as another fixed-mode PMC.
>    *
> - * We choose a value in the middle of the fixed event range, since lower
> + * We choose value 47 for the fixed index of BTS, since lower
>    * values are used by actual fixed events and higher values are used
>    * to indicate other overflow conditions in the PERF_GLOBAL_STATUS msr.
>    */
> -#define INTEL_PMC_IDX_FIXED_BTS				(INTEL_PMC_IDX_FIXED + 16)
> +#define INTEL_PMC_IDX_FIXED_BTS				(INTEL_PMC_IDX_FIXED + 15)
> +
> +/*
> + * We model PERF_METRICS as more magic fixed-mode PMCs, one for each metric
> + *
> + * Internally they all map to Fixed Ctr 3 (SLOTS), and allocate PERF_METRICS
> + * as an extra_reg. PERF_METRICS has no own configuration, but we fill in
> + * the configuration of FxCtr3 to enforce that all the shared users of SLOTS
> + * have the same configuration.
> + */
> +#define INTEL_PMC_IDX_FIXED_METRIC_BASE		(INTEL_PMC_IDX_FIXED + 16)
> +#define INTEL_PMC_IDX_TD_RETIRING		(INTEL_PMC_IDX_FIXED_METRIC_BASE + 0)
> +#define INTEL_PMC_IDX_TD_BAD_SPEC		(INTEL_PMC_IDX_FIXED_METRIC_BASE + 1)
> +#define INTEL_PMC_IDX_TD_FE_BOUND		(INTEL_PMC_IDX_FIXED_METRIC_BASE + 2)
> +#define INTEL_PMC_IDX_TD_BE_BOUND		(INTEL_PMC_IDX_FIXED_METRIC_BASE + 3)
> +#define INTEL_PMC_MSK_TOPDOWN			((0xfull << INTEL_PMC_IDX_FIXED_METRIC_BASE) | \
> +						 INTEL_PMC_MSK_FIXED_SLOTS)
> +
> +static inline bool is_metric_idx(int idx)
> +{
> +	return (unsigned)(idx - INTEL_PMC_IDX_FIXED_METRIC_BASE) < 4;
> +}
> +
> +static inline bool is_topdown_idx(int idx)
> +{
> +	return is_metric_idx(idx) || idx == INTEL_PMC_IDX_FIXED_SLOTS;
> +}
> +
> +#define INTEL_PMC_OTHER_TOPDOWN_BITS(bit)	(~(0x1ull << bit) & INTEL_PMC_MSK_TOPDOWN)
>   
>   #define GLOBAL_STATUS_COND_CHG				BIT_ULL(63)
>   #define GLOBAL_STATUS_BUFFER_OVF			BIT_ULL(62)
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND PATCH V3 2/8] perf/x86/intel: Basic support for metrics counters
  2019-08-28  7:52   ` Peter Zijlstra
@ 2019-08-28 13:59     ` Liang, Kan
  0 siblings, 0 replies; 39+ messages in thread
From: Liang, Kan @ 2019-08-28 13:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: acme, mingo, linux-kernel, tglx, jolsa, eranian, alexander.shishkin, ak



On 8/28/2019 3:52 AM, Peter Zijlstra wrote:
> On Mon, Aug 26, 2019 at 07:47:34AM -0700, kan.liang@linux.intel.com wrote:
> 
>> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
>> index 81b005e4c7d9..54534ff00940 100644
>> --- a/arch/x86/events/core.c
>> +++ b/arch/x86/events/core.c
>> @@ -1033,18 +1033,30 @@ static inline void x86_assign_hw_event(struct perf_event *event,
>>   				struct cpu_hw_events *cpuc, int i)
>>   {
>>   	struct hw_perf_event *hwc = &event->hw;
>> +	int reg_idx;
>>   
>>   	hwc->idx = cpuc->assign[i];
>>   	hwc->last_cpu = smp_processor_id();
>>   	hwc->last_tag = ++cpuc->tags[i];
>>   
>> +	/*
>> +	 * Metrics counters use different indexes in the scheduler
>> +	 * versus the hardware.
>> +	 *
>> +	 * Map metrics to fixed counter 3 (which is the base count),
>> +	 * but the update event callback reads the extra metric register
>> +	 * and converts to the right metric.
>> +	 */
>> +	reg_idx = get_reg_idx(hwc->idx);
>> +
>>   	if (hwc->idx == INTEL_PMC_IDX_FIXED_BTS) {
>>   		hwc->config_base = 0;
>>   		hwc->event_base	= 0;
>>   	} else if (hwc->idx >= INTEL_PMC_IDX_FIXED) {
>>   		hwc->config_base = MSR_ARCH_PERFMON_FIXED_CTR_CTRL;
>> -		hwc->event_base = MSR_ARCH_PERFMON_FIXED_CTR0 + (hwc->idx - INTEL_PMC_IDX_FIXED);
>> -		hwc->event_base_rdpmc = (hwc->idx - INTEL_PMC_IDX_FIXED) | 1<<30;
>> +		hwc->event_base = MSR_ARCH_PERFMON_FIXED_CTR0 +
>> +				  (reg_idx - INTEL_PMC_IDX_FIXED);
>> +		hwc->event_base_rdpmc = (reg_idx - INTEL_PMC_IDX_FIXED) | 1<<30;
>>   	} else {
>>   		hwc->config_base = x86_pmu_config_addr(hwc->idx);
>>   		hwc->event_base  = x86_pmu_event_addr(hwc->idx);
> 
> That reg_idx is a pointless unconditional branch; better to write it
> like:
> 
> static inline void x86_assign_hw_event(struct perf_event *event,
> 				struct cpu_hw_events *cpuc, int i)
> {
> 	struct hw_perf_event *hwc = &event->hw;
> 	int idx;
> 
> 	idx = hwc->idx = cpuc->assign[i];
> 	hwc->last_cpu = smp_processor_id();
> 	hwc->last_tag = ++cpuc->tags[i];
> 
> 	switch (hwc->idx) {
> 	case INTEL_PMC_IDX_FIXED_BTS:
> 		hwc->config_base = 0;
> 		hwc->event_base	= 0;
> 		break;
> 
> 	case INTEL_PMC_IDX_FIXED_METRIC_BASE ... INTEL_PMC_IDX_FIXED_METRIC_BASE+3:
> 		/* All METRIC events are mapped onto the fixed SLOTS counter */
> 		idx = INTEL_PMC_IDX_FIXED_SLOTS;
> 
> 	case INTEL_PMC_IDX_FIXED ... INTEL_PMC_IDX_FIXED_METRIC_BASE-1:
> 		hwc->config_base = MSR_ARCH_PERFMON_FIXED_CTR_CTRL;
> 		hwc->event_base = MSR_ARCH_PERFMON_FIXED_CTR0 +
> 				  (idx - INTEL_PMC_IDX_FIXED);
> 		hwc->event_base_rdpmc = (idx - INTEL_PMC_IDX_FIXED) | 1<<30;
> 		break;
> 
> 	default:
> 		hwc->config_base = x86_pmu_config_addr(hwc->idx);
> 		hwc->event_base = x86_pmu_event_addr(hwc->idx);
> 		hwc->event_base_rdpmc = x86_pmu_rdpmc_index(hwc->idx);
> 		break;
> 	}
> }
> 
> On that; wth does this to the RDPMC userspace support!? Does that even
> work with these counters?
> 

The event_base_rdpmc is only for kernel usage now.
But it seems we can update x86_pmu_event_idx() to use it as well.

Thanks,
Kan



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics
  2019-08-26 14:47 ` [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics kan.liang
@ 2019-08-28 15:02   ` Peter Zijlstra
  2019-08-28 19:04     ` Andi Kleen
  2019-08-28 19:35     ` Liang, Kan
  2019-08-28 15:19   ` Peter Zijlstra
  2019-08-30 23:18   ` Stephane Eranian
  2 siblings, 2 replies; 39+ messages in thread
From: Peter Zijlstra @ 2019-08-28 15:02 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, mingo, linux-kernel, tglx, jolsa, eranian, alexander.shishkin, ak

On Mon, Aug 26, 2019 at 07:47:35AM -0700, kan.liang@linux.intel.com wrote:

> Groups
> ======
> 
> To avoid reading the METRICS register multiple times, the metrics and
> slots value can only be updated by the first slots/metrics event in a
> group. All active slots and metrics events will be updated one time.

Can't we require SLOTS to be the group leader for any Metric group?

Is there ever a case where we want to add other events to a metric
group?

> Reset
> ======
> 
> The PERF_METRICS and Fixed counter 3 have to be reset for each read,
> because:
> - The 8bit metrics ratio values lose precision when the measurement
>   period gets longer.

So it musn't be too hot,

> - The PERF_METRICS may report wrong value if its delta was less than
>   1/255 of SLOTS (Fixed counter 3).

it also musn't be too cold. But that leaves it unspecified what exactly
is the right range.

IOW, you want a Goldilocks number of SLOTS.

> Also, for counting, the -max_period is the initial value of the SLOTS.
> The huge initial value will definitely trigger the issue mentioned
> above. Force initial value as 0 for topdown and slots event counting.

But you just told us that 0 is wrong too (too cold).

I'm still confused by all this; when exactly does:

> NMI
> ======
> 
> The METRICS register may be overflow. The bit 48 of STATUS register
> will be set. If so, update all active slots and metrics events.

that happen? It would be useful to get that METRIC_OVF (can we please
start naming them; 62,55,48 is past silly) at the exact point
where PERF_METRICS is saturated.

If this is so; then we can use this to update/reset PERF_METRICS and
nothing else.

That is; leave the SLOTS programming alone; use -max_period as usual.

Then on METRIC_OVF, read PERF_METRICS and clear it, and update all the
metric events by adding slots_delta * frac / 256 -- where slots_delta is
the SLOTS count since the last METRIC_OVF.

On read; read PERF_METRICS -- BUT DO NOT RESET -- and compute an
intermediate delta and add that to whatever stable count we had.

Maybe something like:

	do {
		count1 = local64_read(&event->count);
		barrier();
		metrics = read_perf_metrics();
		barrier();
		count2 = local64_read(event->count);
	} while (count1 != count2);

	/* no METRIC_OVF happened and {count,metrics} is consistent */

	return count1 + (slots_delta * frac / 256);

> The update_topdown_event() has to read two registers separately. The
> values may be modify by a NMI. PMU has to be disabled before calling the
> function.

Then there is no mucking about with that odd counter/metrics msr pair
reset nonsense. Becuase that really stinks.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics
  2019-08-26 14:47 ` [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics kan.liang
  2019-08-28 15:02   ` Peter Zijlstra
@ 2019-08-28 15:19   ` Peter Zijlstra
  2019-08-28 16:11     ` [PATCH] x86/math64: Provide a sane mul_u64_u32_div() implementation for x86_64 Peter Zijlstra
                       ` (2 more replies)
  2019-08-30 23:18   ` Stephane Eranian
  2 siblings, 3 replies; 39+ messages in thread
From: Peter Zijlstra @ 2019-08-28 15:19 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, mingo, linux-kernel, tglx, jolsa, eranian, alexander.shishkin, ak

On Mon, Aug 26, 2019 at 07:47:35AM -0700, kan.liang@linux.intel.com wrote:
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 54534ff00940..1ae23db5c2d7 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -76,6 +76,8 @@ u64 x86_perf_event_update(struct perf_event *event)
>  	if (idx == INTEL_PMC_IDX_FIXED_BTS)
>  		return 0;
>  
> +	if (is_topdown_count(event) && x86_pmu.update_topdown_event)
> +		return x86_pmu.update_topdown_event(event);

If is_topdown_count() is true; it had better bloody well have that
function. But I really hate this.

>  	/*
>  	 * Careful: an NMI might modify the previous event value.
>  	 *
> @@ -1003,6 +1005,10 @@ static int collect_events(struct cpu_hw_events *cpuc, struct perf_event *leader,
>  
>  	max_count = x86_pmu.num_counters + x86_pmu.num_counters_fixed;
>  
> +	/* There are 4 TopDown metrics events. */
> +	if (x86_pmu.intel_cap.perf_metrics)
> +		max_count += 4;

I'm thinking this is wrong.. this unconditionally allows collecting 4
extra events on every part that has this metrics crud on.

>  	/* current number of events already accepted */
>  	n = cpuc->n_events;
>  
> @@ -1184,6 +1190,10 @@ int x86_perf_event_set_period(struct perf_event *event)
>  	if (idx == INTEL_PMC_IDX_FIXED_BTS)
>  		return 0;
>  
> +	if (unlikely(is_topdown_count(event)) &&
> +	    x86_pmu.set_topdown_event_period)
> +		return x86_pmu.set_topdown_event_period(event);

Same as with the other method; and I similarly hates it.

> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index f4d6335a18e2..616313d7f3d7 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c

> +static int icl_set_topdown_event_period(struct perf_event *event)
> +{
> +	struct hw_perf_event *hwc = &event->hw;
> +	s64 left = local64_read(&hwc->period_left);
> +
> +	/*
> +	 * Clear PERF_METRICS and Fixed counter 3 in initialization.
> +	 * After that, both MSRs will be cleared for each read.
> +	 * Don't need to clear them again.
> +	 */
> +	if (left == x86_pmu.max_period) {
> +		wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
> +		wrmsrl(MSR_PERF_METRICS, 0);
> +		local64_set(&hwc->period_left, 0);
> +	}

This really doesn't make sense to me; if you set FIXED_CTR3 := 0, you'll
never trigger the overflow there; this then seems to suggest the actual
counter value is irrelevant. Therefore you don't actually need this.

> +
> +	perf_event_update_userpage(event);
> +
> +	return 0;
> +}
> +
> +static u64 icl_get_metrics_event_value(u64 metric, u64 slots, int idx)
> +{
> +	u32 val;
> +
> +	/*
> +	 * The metric is reported as an 8bit integer percentage

s/percentage/fraction/

percentage is 1/100, 8bit is 256.

> +	 * suming up to 0xff.
> +	 * slots-in-metric = (Metric / 0xff) * slots
> +	 */
> +	val = (metric >> ((idx - INTEL_PMC_IDX_FIXED_METRIC_BASE) * 8)) & 0xff;
> +	return  mul_u64_u32_div(slots, val, 0xff);

This really utterly blows.. I'd have wasted range to be able to do a
power-of-two fraction here. That is use 8 bits with a range [0,128].

But also; x86_64 seems to lack a sane implementation of that function,
and it currently compiles into utter crap (it can be 2 instructions).

> +}

> +/*
> + * Update all active Topdown events.
> + * PMU has to be disabled before calling this function.

Forgets to explain why...

> + */

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH] x86/math64: Provide a sane mul_u64_u32_div() implementation for x86_64
  2019-08-28 15:19   ` Peter Zijlstra
@ 2019-08-28 16:11     ` Peter Zijlstra
  2019-08-29  9:30       ` Peter Zijlstra
  2019-08-28 16:17     ` [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics Andi Kleen
  2019-08-29 13:31     ` Liang, Kan
  2 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2019-08-28 16:11 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, mingo, linux-kernel, tglx, jolsa, eranian,
	alexander.shishkin, ak, x86

On Wed, Aug 28, 2019 at 05:19:21PM +0200, Peter Zijlstra wrote:
> On Mon, Aug 26, 2019 at 07:47:35AM -0700, kan.liang@linux.intel.com wrote:

> > +	return  mul_u64_u32_div(slots, val, 0xff);
> 
> But also; x86_64 seems to lack a sane implementation of that function,
> and it currently compiles into utter crap (it can be 2 instructions).

---
Subject: x86/math64: Provide a sane mul_u64_u32_div() implementation for x86_64
From: Peter Zijlstra <peterz@infradead.org>
Date: Wed Aug 28 17:39:46 CEST 2019

On x86_64 we can do a u64 * u64 -> u128 widening multiply followed by
a u128 / u64 -> u64 division to implement a sane version of
mul_u64_u32_div().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/div64.h |   13 +++++++++++++
 1 file changed, 13 insertions(+)

--- a/arch/x86/include/asm/div64.h
+++ b/arch/x86/include/asm/div64.h
@@ -73,6 +73,19 @@ static inline u64 mul_u32_u32(u32 a, u32
 
 #else
 # include <asm-generic/div64.h>
+
+static inline u64 mul_u64_u32_div(u64 a, u32 mul, u32 div)
+{
+	u64 q;
+
+	asm ("mulq %2; divq %3" : "=a" (q)
+				: "a" (a), "rm" (mul), "rm" (div)
+				: "rdx");
+
+	return q;
+}
+#define mul_u64_u32_div	mul_u64_u32_div
+
 #endif /* CONFIG_X86_32 */
 
 #endif /* _ASM_X86_DIV64_H */

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics
  2019-08-28 15:19   ` Peter Zijlstra
  2019-08-28 16:11     ` [PATCH] x86/math64: Provide a sane mul_u64_u32_div() implementation for x86_64 Peter Zijlstra
@ 2019-08-28 16:17     ` Andi Kleen
  2019-08-28 16:28       ` Peter Zijlstra
  2019-08-29 13:31     ` Liang, Kan
  2 siblings, 1 reply; 39+ messages in thread
From: Andi Kleen @ 2019-08-28 16:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kan.liang, acme, mingo, linux-kernel, tglx, jolsa, eranian,
	alexander.shishkin

> This really doesn't make sense to me; if you set FIXED_CTR3 := 0, you'll
> never trigger the overflow there; this then seems to suggest the actual

The 48bit counter might overflow in a few hours.

-Andi

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics
  2019-08-28 16:17     ` [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics Andi Kleen
@ 2019-08-28 16:28       ` Peter Zijlstra
  2019-08-29  3:11         ` Andi Kleen
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2019-08-28 16:28 UTC (permalink / raw)
  To: Andi Kleen
  Cc: kan.liang, acme, mingo, linux-kernel, tglx, jolsa, eranian,
	alexander.shishkin

On Wed, Aug 28, 2019 at 09:17:54AM -0700, Andi Kleen wrote:
> > This really doesn't make sense to me; if you set FIXED_CTR3 := 0, you'll
> > never trigger the overflow there; this then seems to suggest the actual
> 
> The 48bit counter might overflow in a few hours.

Sure; the point is? Kan said it should not be too big; a full 48bit wrap
around must be too big or nothing is.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics
  2019-08-28 15:02   ` Peter Zijlstra
@ 2019-08-28 19:04     ` Andi Kleen
  2019-08-31  9:19       ` Peter Zijlstra
  2019-08-28 19:35     ` Liang, Kan
  1 sibling, 1 reply; 39+ messages in thread
From: Andi Kleen @ 2019-08-28 19:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kan.liang, acme, mingo, linux-kernel, tglx, jolsa, eranian,
	alexander.shishkin

On Wed, Aug 28, 2019 at 05:02:38PM +0200, Peter Zijlstra wrote:
> > 
> > To avoid reading the METRICS register multiple times, the metrics and
> > slots value can only be updated by the first slots/metrics event in a
> > group. All active slots and metrics events will be updated one time.
> 
> Can't we require SLOTS to be the group leader for any Metric group?

Metrics are really useful to collect with any other sampling event
to give you more context in the program behavior.

> Is there ever a case where we want to add other events to a metric
> group?

Yes. Any normal leader sampling case. You just collect metrics too.

> it also musn't be too cold. But that leaves it unspecified what exactly
> is the right range.
> 
> IOW, you want a Goldilocks number of SLOTS.

That's too strong. You probably don't want minutes, but seconds
worth of measurement should be ok.

> > NMI
> > ======
> > 
> > The METRICS register may be overflow. The bit 48 of STATUS register
> > will be set. If so, update all active slots and metrics events.
> 
> that happen? It would be useful to get that METRIC_OVF (can we please

This happens when the internal counters that feed the metrics
overflow.

> If this is so; then we can use this to update/reset PERF_METRICS and
> nothing else.

It has to be handled in the PMI.

> Then there is no mucking about with that odd counter/metrics msr pair
> reset nonsense. Becuase that really stinks.

You have to write them to reset the internal counters.

-Andi


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics
  2019-08-28 15:02   ` Peter Zijlstra
  2019-08-28 19:04     ` Andi Kleen
@ 2019-08-28 19:35     ` Liang, Kan
  1 sibling, 0 replies; 39+ messages in thread
From: Liang, Kan @ 2019-08-28 19:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: acme, mingo, linux-kernel, tglx, jolsa, eranian, alexander.shishkin, ak



On 8/28/2019 11:02 AM, Peter Zijlstra wrote:
>> Reset
>> ======
>>
>> The PERF_METRICS and Fixed counter 3 have to be reset for each read,
>> because:
>> - The 8bit metrics ratio values lose precision when the measurement
>>    period gets longer.
> So it musn't be too hot,
> 
>> - The PERF_METRICS may report wrong value if its delta was less than
>>    1/255 of SLOTS (Fixed counter 3).

The "delta" is actually for the internal counters. Sorry for the
confusion.

> it also musn't be too cold. But that leaves it unspecified what exactly
> is the right range.
> 
> IOW, you want a Goldilocks number of SLOTS.
> 
>> Also, for counting, the -max_period is the initial value of the SLOTS.
>> The huge initial value will definitely trigger the issue mentioned
>> above. Force initial value as 0 for topdown and slots event counting.
> But you just told us that 0 is wrong too (too cold).

We set -max_period (0x8000 0000 0001) as initial value of FIXCTR3 for 
counting. But the internal counters probably starts counting from 0.
PERF_METRICS is fraction of internal counters / FIXCTR3.
So PERF_METRICS will never works.
We have to make FIXCTR3 count from 0 as well.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics
  2019-08-28 16:28       ` Peter Zijlstra
@ 2019-08-29  3:11         ` Andi Kleen
  2019-08-29  9:17           ` Peter Zijlstra
  0 siblings, 1 reply; 39+ messages in thread
From: Andi Kleen @ 2019-08-29  3:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kan.liang, acme, mingo, linux-kernel, tglx, jolsa, eranian,
	alexander.shishkin

On Wed, Aug 28, 2019 at 06:28:57PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 28, 2019 at 09:17:54AM -0700, Andi Kleen wrote:
> > > This really doesn't make sense to me; if you set FIXED_CTR3 := 0, you'll
> > > never trigger the overflow there; this then seems to suggest the actual
> > 
> > The 48bit counter might overflow in a few hours.
> 
> Sure; the point is? Kan said it should not be too big; a full 48bit wrap
> around must be too big or nothing is.

We expect users to avoid making it too big, but we cannot rule it out.

Actually for the common perf stat for a long time case you can make it fairly
big because the percentages still work out over the complete run time.

The problem with letting it accumulate too much is mainly if you
want to use a continuously running metrics counter to time smaller
regions by taking deltas, for example using RDPMC. 

In this case the small regions would be too inaccurate after some time.

But to make the first case work absolutely need to handle the overflow
case. Otherwise the metrics would just randomly stop at some
point.


-Andi




^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics
  2019-08-29  3:11         ` Andi Kleen
@ 2019-08-29  9:17           ` Peter Zijlstra
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2019-08-29  9:17 UTC (permalink / raw)
  To: Andi Kleen
  Cc: kan.liang, acme, mingo, linux-kernel, tglx, jolsa, eranian,
	alexander.shishkin

On Wed, Aug 28, 2019 at 08:11:51PM -0700, Andi Kleen wrote:
> On Wed, Aug 28, 2019 at 06:28:57PM +0200, Peter Zijlstra wrote:
> > On Wed, Aug 28, 2019 at 09:17:54AM -0700, Andi Kleen wrote:
> > > > This really doesn't make sense to me; if you set FIXED_CTR3 := 0, you'll
> > > > never trigger the overflow there; this then seems to suggest the actual
> > > 
> > > The 48bit counter might overflow in a few hours.
> > 
> > Sure; the point is? Kan said it should not be too big; a full 48bit wrap
> > around must be too big or nothing is.
> 
> We expect users to avoid making it too big, but we cannot rule it out.
> 
> Actually for the common perf stat for a long time case you can make it fairly
> big because the percentages still work out over the complete run time.
> 
> The problem with letting it accumulate too much is mainly if you
> want to use a continuously running metrics counter to time smaller
> regions by taking deltas, for example using RDPMC. 
> 
> In this case the small regions would be too inaccurate after some time.
> 
> But to make the first case work absolutely need to handle the overflow
> case. Otherwise the metrics would just randomly stop at some
> point.

But then you need -max_period, not 0. By using half the period, there is
no ambiguity on overflow.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH] x86/math64: Provide a sane mul_u64_u32_div() implementation for x86_64
  2019-08-28 16:11     ` [PATCH] x86/math64: Provide a sane mul_u64_u32_div() implementation for x86_64 Peter Zijlstra
@ 2019-08-29  9:30       ` Peter Zijlstra
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2019-08-29  9:30 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, mingo, linux-kernel, tglx, jolsa, eranian,
	alexander.shishkin, ak, x86

On Wed, Aug 28, 2019 at 06:11:23PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 28, 2019 at 05:19:21PM +0200, Peter Zijlstra wrote:
> > On Mon, Aug 26, 2019 at 07:47:35AM -0700, kan.liang@linux.intel.com wrote:
> 
> > > +	return  mul_u64_u32_div(slots, val, 0xff);
> > 
> > But also; x86_64 seems to lack a sane implementation of that function,
> > and it currently compiles into utter crap (it can be 2 instructions).

This one actually builds defconfig :-)

---
Subject: x86/math64: Provide a sane mul_u64_u32_div() implementation for x86_64
From: Peter Zijlstra <peterz@infradead.org>
Date: Wed Aug 28 17:39:46 CEST 2019

On x86_64 we can do a u64 * u64 -> u128 widening multiply followed by
a u128 / u64 -> u64 division to implement a sane version of
mul_u64_u32_div().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/div64.h | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/arch/x86/include/asm/div64.h b/arch/x86/include/asm/div64.h
index 20a46150e0a8..9b8cb50768c2 100644
--- a/arch/x86/include/asm/div64.h
+++ b/arch/x86/include/asm/div64.h
@@ -73,6 +73,19 @@ static inline u64 mul_u32_u32(u32 a, u32 b)
 
 #else
 # include <asm-generic/div64.h>
+
+static inline u64 mul_u64_u32_div(u64 a, u32 mul, u32 div)
+{
+	u64 q;
+
+	asm ("mulq %2; divq %3" : "=a" (q)
+				: "a" (a), "rm" ((u64)mul), "rm" ((u64)div)
+				: "rdx");
+
+	return q;
+}
+#define mul_u64_u32_div	mul_u64_u32_div
+
 #endif /* CONFIG_X86_32 */
 
 #endif /* _ASM_X86_DIV64_H */

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics
  2019-08-28 15:19   ` Peter Zijlstra
  2019-08-28 16:11     ` [PATCH] x86/math64: Provide a sane mul_u64_u32_div() implementation for x86_64 Peter Zijlstra
  2019-08-28 16:17     ` [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics Andi Kleen
@ 2019-08-29 13:31     ` Liang, Kan
  2019-08-29 13:52       ` Peter Zijlstra
  2 siblings, 1 reply; 39+ messages in thread
From: Liang, Kan @ 2019-08-29 13:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: acme, mingo, linux-kernel, tglx, jolsa, eranian, alexander.shishkin, ak



On 8/28/2019 11:19 AM, Peter Zijlstra wrote:
>> +static int icl_set_topdown_event_period(struct perf_event *event)
>> +{
>> +	struct hw_perf_event *hwc = &event->hw;
>> +	s64 left = local64_read(&hwc->period_left);
>> +
>> +	/*
>> +	 * Clear PERF_METRICS and Fixed counter 3 in initialization.
>> +	 * After that, both MSRs will be cleared for each read.
>> +	 * Don't need to clear them again.
>> +	 */
>> +	if (left == x86_pmu.max_period) {
>> +		wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
>> +		wrmsrl(MSR_PERF_METRICS, 0);
>> +		local64_set(&hwc->period_left, 0);
>> +	}
> This really doesn't make sense to me; if you set FIXED_CTR3 := 0, you'll
> never trigger the overflow there; this then seems to suggest the actual
> counter value is irrelevant. Therefore you don't actually need this.
> 

Could you please elaborate on why initialization to 0 never triggers an 
overflow?
As of my understanding, initialization to 0 only means that it will take 
more time than initialization to -max_period (0x8000 0000 0001) to 
trigger an overflow.

Maybe 0 is too tricky. We can set the initial value to 1.
I think the bottom line is that we need a small initial value for 
FIXED_CTR3 here.
PERF_METRICS reports an 8bit integer fraction which is something like 
0xff * internal counters / FIXCTR3.
The internal counters only start counting from 0. (SW cannot set an 
arbitrary initial value for internal counters.)
If the initial value of FIXED_CTR3 is too big, PERF_METRICS could always 
remain constant, e.g. 0.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics
  2019-08-29 13:31     ` Liang, Kan
@ 2019-08-29 13:52       ` Peter Zijlstra
  2019-08-29 16:56         ` Liang, Kan
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2019-08-29 13:52 UTC (permalink / raw)
  To: Liang, Kan
  Cc: acme, mingo, linux-kernel, tglx, jolsa, eranian, alexander.shishkin, ak

On Thu, Aug 29, 2019 at 09:31:37AM -0400, Liang, Kan wrote:
> On 8/28/2019 11:19 AM, Peter Zijlstra wrote:
> > > +static int icl_set_topdown_event_period(struct perf_event *event)
> > > +{
> > > +	struct hw_perf_event *hwc = &event->hw;
> > > +	s64 left = local64_read(&hwc->period_left);
> > > +
> > > +	/*
> > > +	 * Clear PERF_METRICS and Fixed counter 3 in initialization.
> > > +	 * After that, both MSRs will be cleared for each read.
> > > +	 * Don't need to clear them again.
> > > +	 */
> > > +	if (left == x86_pmu.max_period) {
> > > +		wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
> > > +		wrmsrl(MSR_PERF_METRICS, 0);
> > > +		local64_set(&hwc->period_left, 0);
> > > +	}
> > This really doesn't make sense to me; if you set FIXED_CTR3 := 0, you'll
> > never trigger the overflow there; this then seems to suggest the actual
> > counter value is irrelevant. Therefore you don't actually need this.
> > 
> 
> Could you please elaborate on why initialization to 0 never triggers an
> overflow?

Well, 'never' as in a 'long' time.

> As of my understanding, initialization to 0 only means that it will take
> more time than initialization to -max_period (0x8000 0000 0001) to trigger
> an overflow.

Only twice as long. And why do we care about that?

The problem with it is though that you get the overflow at the end of
the whole period, instead of halfway through, so reconstruction is
trickier.

> Maybe 0 is too tricky. We can set the initial value to 1.

That's even worse. I'm still not understanding why we can't use the
normal code.

> I think the bottom line is that we need a small initial value for FIXED_CTR3
> here.

But why?!

> PERF_METRICS reports an 8bit integer fraction which is something like 0xff *
> internal counters / FIXCTR3.
> The internal counters only start counting from 0. (SW cannot set an
> arbitrary initial value for internal counters.)
> If the initial value of FIXED_CTR3 is too big, PERF_METRICS could always
> remain constant, e.g. 0.

What what? The PERF_METRICS contents depends on the FIXCTR3 value ?!
That's bloody insane. /me goes find the SDM. The SDM is bloody useless
:-(.

Please give a complete and coherent description of all of this. I can't
very well review any of this until I know how the hardware works, now
can I.

In this write-up, include the exact condition for METRICS_OVF (the SDM
states: 'it indicates that PERF_METRIC counter has overflowed', which is
gramatically incorrect and makes no sense even with the missing article
injected).

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics
  2019-08-29 13:52       ` Peter Zijlstra
@ 2019-08-29 16:56         ` Liang, Kan
  2019-08-31  9:18           ` Peter Zijlstra
  0 siblings, 1 reply; 39+ messages in thread
From: Liang, Kan @ 2019-08-29 16:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: acme, mingo, linux-kernel, tglx, jolsa, eranian, alexander.shishkin, ak



On 8/29/2019 9:52 AM, Peter Zijlstra wrote:
> On Thu, Aug 29, 2019 at 09:31:37AM -0400, Liang, Kan wrote:
>> On 8/28/2019 11:19 AM, Peter Zijlstra wrote:
>>>> +static int icl_set_topdown_event_period(struct perf_event *event)
>>>> +{
>>>> +	struct hw_perf_event *hwc = &event->hw;
>>>> +	s64 left = local64_read(&hwc->period_left);
>>>> +
>>>> +	/*
>>>> +	 * Clear PERF_METRICS and Fixed counter 3 in initialization.
>>>> +	 * After that, both MSRs will be cleared for each read.
>>>> +	 * Don't need to clear them again.
>>>> +	 */
>>>> +	if (left == x86_pmu.max_period) {
>>>> +		wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
>>>> +		wrmsrl(MSR_PERF_METRICS, 0);
>>>> +		local64_set(&hwc->period_left, 0);
>>>> +	}
>>> This really doesn't make sense to me; if you set FIXED_CTR3 := 0, you'll
>>> never trigger the overflow there; this then seems to suggest the actual
>>> counter value is irrelevant. Therefore you don't actually need this.
>>>
>>
>> Could you please elaborate on why initialization to 0 never triggers an
>> overflow?
> 
> Well, 'never' as in a 'long' time.
> 
>> As of my understanding, initialization to 0 only means that it will take
>> more time than initialization to -max_period (0x8000 0000 0001) to trigger
>> an overflow.
> 
> Only twice as long. And why do we care about that?
> 
> The problem with it is though that you get the overflow at the end of
> the whole period, instead of halfway through, so reconstruction is
> trickier.
> 
>> Maybe 0 is too tricky. We can set the initial value to 1.
> 
> That's even worse. I'm still not understanding why we can't use the
> normal code.
> 
>> I think the bottom line is that we need a small initial value for FIXED_CTR3
>> here.
> 
> But why?!
> 
>> PERF_METRICS reports an 8bit integer fraction which is something like 0xff *
>> internal counters / FIXCTR3.
>> The internal counters only start counting from 0. (SW cannot set an
>> arbitrary initial value for internal counters.)
>> If the initial value of FIXED_CTR3 is too big, PERF_METRICS could always
>> remain constant, e.g. 0.
> 
> What what? The PERF_METRICS contents depends on the FIXCTR3 value ?!

Yes.

For current implementation, PERF_METRIC MSR is composed by four fields, 
backend bound, frontend bound, bad speculation and retiring.
Each of the fields are populated using the below formula for eg:
PERF_METRIC[RETIRING] = (0xFF * 
PERF_METRICS_RETIRING_INTERNAL_48bit_COUNTER)
/ FIXCTR3

The METRICS_OVF indicates the overflow of any internal counters.

The internal counters only start counting from 0, which cannot be 
programmed by SW. But resetting the PERF_METRIC would implicitly 
resetting the internal counters.

Thanks,
Kan


> That's bloody insane. /me goes find the SDM. The SDM is bloody useless
> :-(.
> 
> Please give a complete and coherent description of all of this. I can't
> very well review any of this until I know how the hardware works, now
> can I.
> 
> In this write-up, include the exact condition for METRICS_OVF (the SDM
> states: 'it indicates that PERF_METRIC counter has overflowed', which is
> gramatically incorrect and makes no sense even with the missing article
> injected).
>





^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics
  2019-08-26 14:47 ` [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics kan.liang
  2019-08-28 15:02   ` Peter Zijlstra
  2019-08-28 15:19   ` Peter Zijlstra
@ 2019-08-30 23:18   ` Stephane Eranian
  2019-08-31  0:31     ` Andi Kleen
  2 siblings, 1 reply; 39+ messages in thread
From: Stephane Eranian @ 2019-08-30 23:18 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Peter Zijlstra, Arnaldo Carvalho de Melo, Ingo Molnar, LKML,
	Thomas Gleixner, Jiri Olsa, Alexander Shishkin, Andi Kleen

Hi,

On Mon, Aug 26, 2019 at 7:48 AM <kan.liang@linux.intel.com> wrote:
>
> From: Kan Liang <kan.liang@linux.intel.com>
>
> Intro
> =====
>
> Icelake has support for measuring the four top level TopDown metrics
> directly in hardware. This is implemented by an additional "metrics"
> register, and a new Fixed Counter 3 that measures pipeline "slots".
>
> Events
> ======
>
> We export four metric events as separate perf events, which map to
> internal "metrics" counter register. Those events do not exist in
> hardware, but can be allocated by the scheduler.
>
There is another approach possible for supporting Topdown-style counters.
Instead of trying to abstract them as separate events to the user and then
trying to put them back together in the kernel and then using slots to scale
them as counts, we could just expose them as is, i.e., structured counter
values. The kernel already handles structured counter configs and exports
the fields on the config via sysfs and the perf tool picks them up and can
encode any event. We could have a similar approach for a counter
value. It could have fields, unit, types. Perf stat would pick them up in
the same manner. It would greatly simplify the kernel implementation.
You would need to publish an pseudo-event code for each group of
metrics. Note that I am not advocating expose the raw counter value.
That way you would maintain one event code -> one "counter" on hw.
The reset on read would also work. It would generate only one rdmsr
per read without forcing any grouping. You would not need any grouping,
or using slots under the hood to scale. If PERF_METRICS gets extended, you
can just add another pseudo event code or umask.

The PERF_METRICS events do not make real sense in isolation. The SLOTS
scaling is hard to interpret. You can never profiling on PERF_METRICS event
so keeping them grouped is okay.


> For the event mapping we use a special 0x00 event code, which is
> reserved for fake events. The metric events start from umask 0x10.
>
> When setting up such events they point to the slots counter, and a
> special callback, update_topdown_event(), reads the additional metrics
> msr to generate the metrics. Then the metric is reported by multiplying
> the metric (percentage) with slots.
>
> This multiplication allows to easily keep a running count, for example
> when the slots counter overflows, and makes all the standard tools, such
> as a perf stat, work. They can do deltas of the values without needing
> to know about percentages. This also simplifies accumulating the counts
> of child events, which otherwise would need to know how to average
> percent values.
>
> All four metric events don't support sampling. Since they will be
> handled specially for event update, a flag PERF_X86_EVENT_TOPDOWN is
> introduced to indicate this case.
>
> The slots event can support both sampling and counting.
> For counting, the flag is also applied.
> For sampling, it will be handled normally as other normal events.
>
> Groups
> ======
>
> To avoid reading the METRICS register multiple times, the metrics and
> slots value can only be updated by the first slots/metrics event in a
> group. All active slots and metrics events will be updated one time.
>
> Reset
> ======
>
> The PERF_METRICS and Fixed counter 3 have to be reset for each read,
> because:
> - The 8bit metrics ratio values lose precision when the measurement
>   period gets longer.
> - The PERF_METRICS may report wrong value if its delta was less than
>   1/255 of SLOTS (Fixed counter 3).
>
> Also, for counting, the -max_period is the initial value of the SLOTS.
> The huge initial value will definitely trigger the issue mentioned
> above. Force initial value as 0 for topdown and slots event counting.
>
> NMI
> ======
>
> The METRICS register may be overflow. The bit 48 of STATUS register
> will be set. If so, update all active slots and metrics events.
>
> The update_topdown_event() has to read two registers separately. The
> values may be modify by a NMI. PMU has to be disabled before calling the
> function.
>
> RDPMC
> ======
>
> RDPMC is temporarily disabled. The following patch will enable it.
>
> Originally-by: Andi Kleen <ak@linux.intel.com>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> ---
>  arch/x86/events/core.c           |  10 ++
>  arch/x86/events/intel/core.c     | 230 ++++++++++++++++++++++++++++++-
>  arch/x86/events/perf_event.h     |  17 +++
>  arch/x86/include/asm/msr-index.h |   2 +
>  4 files changed, 255 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 54534ff00940..1ae23db5c2d7 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -76,6 +76,8 @@ u64 x86_perf_event_update(struct perf_event *event)
>         if (idx == INTEL_PMC_IDX_FIXED_BTS)
>                 return 0;
>
> +       if (is_topdown_count(event) && x86_pmu.update_topdown_event)
> +               return x86_pmu.update_topdown_event(event);
>         /*
>          * Careful: an NMI might modify the previous event value.
>          *
> @@ -1003,6 +1005,10 @@ static int collect_events(struct cpu_hw_events *cpuc, struct perf_event *leader,
>
>         max_count = x86_pmu.num_counters + x86_pmu.num_counters_fixed;
>
> +       /* There are 4 TopDown metrics events. */
> +       if (x86_pmu.intel_cap.perf_metrics)
> +               max_count += 4;
> +
>         /* current number of events already accepted */
>         n = cpuc->n_events;
>
> @@ -1184,6 +1190,10 @@ int x86_perf_event_set_period(struct perf_event *event)
>         if (idx == INTEL_PMC_IDX_FIXED_BTS)
>                 return 0;
>
> +       if (unlikely(is_topdown_count(event)) &&
> +           x86_pmu.set_topdown_event_period)
> +               return x86_pmu.set_topdown_event_period(event);
> +
>         /*
>          * If we are way outside a reasonable range then just skip forward:
>          */
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index f4d6335a18e2..616313d7f3d7 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -247,6 +247,10 @@ static struct event_constraint intel_icl_event_constraints[] = {
>         FIXED_EVENT_CONSTRAINT(0x003c, 1),      /* CPU_CLK_UNHALTED.CORE */
>         FIXED_EVENT_CONSTRAINT(0x0300, 2),      /* CPU_CLK_UNHALTED.REF */
>         FIXED_EVENT_CONSTRAINT(0x0400, 3),      /* SLOTS */
> +       METRIC_EVENT_CONSTRAINT(0x1000, 0),     /* Retiring metric */
> +       METRIC_EVENT_CONSTRAINT(0x1100, 1),     /* Bad speculation metric */
> +       METRIC_EVENT_CONSTRAINT(0x1200, 2),     /* FE bound metric */
> +       METRIC_EVENT_CONSTRAINT(0x1300, 3),     /* BE bound metric */
>         INTEL_EVENT_CONSTRAINT_RANGE(0x03, 0x0a, 0xf),
>         INTEL_EVENT_CONSTRAINT_RANGE(0x1f, 0x28, 0xf),
>         INTEL_EVENT_CONSTRAINT(0x32, 0xf),      /* SW_PREFETCH_ACCESS.* */
> @@ -267,6 +271,14 @@ static struct extra_reg intel_icl_extra_regs[] __read_mostly = {
>         INTEL_UEVENT_EXTRA_REG(0x01bb, MSR_OFFCORE_RSP_1, 0x3fffff9fffull, RSP_1),
>         INTEL_UEVENT_PEBS_LDLAT_EXTRA_REG(0x01cd),
>         INTEL_UEVENT_EXTRA_REG(0x01c6, MSR_PEBS_FRONTEND, 0x7fff17, FE),
> +       /*
> +        * The original Fixed Ctr 3 are shared from different metrics
> +        * events. So use the extra reg to enforce the same
> +        * configuration on the original register, but do not actually
> +        * write to it.
> +        */
> +       INTEL_UEVENT_EXTRA_REG(0x0400, 0, -1L, TOPDOWN),
> +       INTEL_UEVENT_TOPDOWN_EXTRA_REG(0x1000),
>         EVENT_EXTRA_END
>  };
>
> @@ -2190,10 +2202,163 @@ static void intel_pmu_del_event(struct perf_event *event)
>                 intel_pmu_pebs_del(event);
>  }
>
> +static inline bool is_metric_event(struct perf_event *event)
> +{
> +       return ((event->attr.config & ARCH_PERFMON_EVENTSEL_EVENT) == 0) &&
> +               ((event->attr.config & INTEL_ARCH_EVENT_MASK) >= 0x1000)  &&
> +               ((event->attr.config & INTEL_ARCH_EVENT_MASK) <= 0x1300);
> +}
> +
> +static inline bool is_slots_event(struct perf_event *event)
> +{
> +       return (event->attr.config & INTEL_ARCH_EVENT_MASK) == 0x0400;
> +}
> +
> +static inline bool is_topdown_event(struct perf_event *event)
> +{
> +       return is_metric_event(event) || is_slots_event(event);
> +}
> +
> +static bool is_first_topdown_event_in_group(struct perf_event *event)
> +{
> +       struct perf_event *first = NULL;
> +
> +       if (is_topdown_event(event->group_leader))
> +               first = event->group_leader;
> +       else {
> +               for_each_sibling_event(first, event->group_leader)
> +                       if (is_topdown_event(first))
> +                               break;
> +       }
> +
> +       if (event == first)
> +               return true;
> +
> +       return false;
> +}
> +
> +static int icl_set_topdown_event_period(struct perf_event *event)
> +{
> +       struct hw_perf_event *hwc = &event->hw;
> +       s64 left = local64_read(&hwc->period_left);
> +
> +       /*
> +        * Clear PERF_METRICS and Fixed counter 3 in initialization.
> +        * After that, both MSRs will be cleared for each read.
> +        * Don't need to clear them again.
> +        */
> +       if (left == x86_pmu.max_period) {
> +               wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
> +               wrmsrl(MSR_PERF_METRICS, 0);
> +               local64_set(&hwc->period_left, 0);
> +       }
> +
> +       perf_event_update_userpage(event);
> +
> +       return 0;
> +}
> +
> +static u64 icl_get_metrics_event_value(u64 metric, u64 slots, int idx)
> +{
> +       u32 val;
> +
> +       /*
> +        * The metric is reported as an 8bit integer percentage
> +        * suming up to 0xff.
> +        * slots-in-metric = (Metric / 0xff) * slots
> +        */
> +       val = (metric >> ((idx - INTEL_PMC_IDX_FIXED_METRIC_BASE) * 8)) & 0xff;
> +       return  mul_u64_u32_div(slots, val, 0xff);
> +}
> +
> +static void __icl_update_topdown_event(struct perf_event *event,
> +                                      u64 slots, u64 metrics)
> +{
> +       int idx = event->hw.idx;
> +       u64 delta;
> +
> +       if (is_metric_idx(idx))
> +               delta = icl_get_metrics_event_value(metrics, slots, idx);
> +       else
> +               delta = slots;
> +
> +       local64_add(delta, &event->count);
> +}
> +
> +/*
> + * Update all active Topdown events.
> + * PMU has to be disabled before calling this function.
> + */
> +static u64 icl_update_topdown_event(struct perf_event *event)
> +{
> +       struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> +       struct perf_event *other;
> +       u64 slots, metrics;
> +       int idx;
> +
> +       /*
> +        * Only need to update all events for the first
> +        * slots/metrics event in a group
> +        */
> +       if (event && !is_first_topdown_event_in_group(event))
> +               return 0;
> +
> +       /* read Fixed counter 3 */
> +       rdpmcl((3 | 1<<30), slots);
> +       if (!slots)
> +               return 0;
> +
> +       /* read PERF_METRICS */
> +       rdpmcl((1<<29), metrics);
> +
> +       for_each_set_bit(idx, cpuc->active_mask, INTEL_PMC_IDX_TD_BE_BOUND + 1) {
> +               if (!is_topdown_idx(idx))
> +                       continue;
> +               other = cpuc->events[idx];
> +               __icl_update_topdown_event(other, slots, metrics);
> +       }
> +
> +       /*
> +        * Check and update this event, which may have been cleared
> +        * in active_mask e.g. x86_pmu_stop()
> +        */
> +       if (event && !test_bit(event->hw.idx, cpuc->active_mask))
> +               __icl_update_topdown_event(event, slots, metrics);
> +
> +       /*
> +        * To avoid the known issues as below, the PERF_METRICS and
> +        * Fixed counter 3 are reset for each read.
> +        * - The 8bit metrics ratio values lose precision when the
> +        *   measurement period gets longer.
> +        * - The PERF_METRICS may report wrong value if its delta was
> +        *   less than 1/255 of Fixed counter 3.
> +        */
> +       wrmsrl(MSR_PERF_METRICS, 0);
> +       wrmsrl(MSR_CORE_PERF_FIXED_CTR3, 0);
> +
> +       return slots;
> +}
> +
> +static void intel_pmu_read_topdown_event(struct perf_event *event)
> +{
> +       struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> +
> +       /* Only need to call update_topdown_event() once for group read. */
> +       if ((cpuc->txn_flags & PERF_PMU_TXN_READ) &&
> +           !is_first_topdown_event_in_group(event))
> +               return;
> +
> +       perf_pmu_disable(event->pmu);
> +       x86_pmu.update_topdown_event(event);
> +       perf_pmu_enable(event->pmu);
> +}
> +
>  static void intel_pmu_read_event(struct perf_event *event)
>  {
>         if (event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD)
>                 intel_pmu_auto_reload_read(event);
> +       else if (is_topdown_count(event) && x86_pmu.update_topdown_event)
> +               intel_pmu_read_topdown_event(event);
>         else
>                 x86_perf_event_update(event);
>  }
> @@ -2401,6 +2566,15 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
>                         intel_pt_interrupt();
>         }
>
> +       /*
> +        * Intel Perf mertrics
> +        */
> +       if (__test_and_clear_bit(48, (unsigned long *)&status)) {
> +               handled++;
> +               if (x86_pmu.update_topdown_event)
> +                       x86_pmu.update_topdown_event(NULL);
> +       }
> +
>         /*
>          * Checkpointed counters can lead to 'spurious' PMIs because the
>          * rollback caused by the PMI will have cleared the overflow status
> @@ -3312,6 +3486,42 @@ static int intel_pmu_hw_config(struct perf_event *event)
>         if (event->attr.type != PERF_TYPE_RAW)
>                 return 0;
>
> +       /*
> +        * Config Topdown slots and metric events
> +        *
> +        * The slots event on Fixed Counter 3 can support sampling,
> +        * which will be handled normally in x86_perf_event_update().
> +        *
> +        * The metric events don't support sampling.
> +        *
> +        * For counting, topdown slots and metric events will be
> +        * handled specially for event update.
> +        * A flag PERF_X86_EVENT_TOPDOWN is applied for the case.
> +        */
> +       if (x86_pmu.intel_cap.perf_metrics && is_topdown_event(event)) {
> +               if (is_metric_event(event) && is_sampling_event(event))
> +                       return -EINVAL;
> +
> +               if (!is_sampling_event(event)) {
> +                       if (event->attr.config1 != 0)
> +                               return -EINVAL;
> +                       if (event->attr.config & ARCH_PERFMON_EVENTSEL_ANY)
> +                               return -EINVAL;
> +                       /*
> +                        * Put configuration (minus event) into config1 so that
> +                        * the scheduler enforces through an extra_reg that
> +                        * all instances of the metrics events have the same
> +                        * configuration.
> +                        */
> +                       event->attr.config1 = event->hw.config &
> +                                             X86_ALL_EVENT_FLAGS;
> +                       event->hw.flags |= PERF_X86_EVENT_TOPDOWN;
> +
> +                       if (is_metric_event(event))
> +                               event->hw.flags &= ~PERF_X86_EVENT_RDPMC_ALLOWED;
> +               }
> +       }
> +
>         if (!(event->attr.config & ARCH_PERFMON_EVENTSEL_ANY))
>                 return 0;
>
> @@ -5040,6 +5250,8 @@ __init int intel_pmu_init(void)
>                 x86_pmu.rtm_abort_event = X86_CONFIG(.event=0xca, .umask=0x02);
>                 x86_pmu.lbr_pt_coexist = true;
>                 intel_pmu_pebs_data_source_skl(pmem);
> +               x86_pmu.update_topdown_event = icl_update_topdown_event;
> +               x86_pmu.set_topdown_event_period = icl_set_topdown_event_period;
>                 pr_cont("Icelake events, ");
>                 name = "icelake";
>                 break;
> @@ -5096,10 +5308,17 @@ __init int intel_pmu_init(void)
>                  * counter, so do not extend mask to generic counters
>                  */
>                 for_each_event_constraint(c, x86_pmu.event_constraints) {
> -                       if (c->cmask == FIXED_EVENT_FLAGS
> -                           && c->idxmsk64 != INTEL_PMC_MSK_FIXED_REF_CYCLES
> -                           && c->idxmsk64 != INTEL_PMC_MSK_FIXED_SLOTS) {
> -                               c->idxmsk64 |= (1ULL << x86_pmu.num_counters) - 1;
> +                       if (c->cmask == FIXED_EVENT_FLAGS) {
> +                               /*
> +                                * Don't extend topdown slots and metrics
> +                                * events to generic counters.
> +                                */
> +                               if (c->idxmsk64 & INTEL_PMC_MSK_TOPDOWN) {
> +                                       c->weight = hweight64(c->idxmsk64);
> +                                       continue;
> +                               }
> +                               if (c->idxmsk64 != INTEL_PMC_MSK_FIXED_REF_CYCLES)
> +                                       c->idxmsk64 |= (1ULL << x86_pmu.num_counters) - 1;
>                         }
>                         c->idxmsk64 &=
>                                 ~(~0ULL << (INTEL_PMC_IDX_FIXED + x86_pmu.num_counters_fixed));
> @@ -5152,6 +5371,9 @@ __init int intel_pmu_init(void)
>         if (x86_pmu.counter_freezing)
>                 x86_pmu.handle_irq = intel_pmu_handle_irq_v4;
>
> +       if (x86_pmu.intel_cap.perf_metrics)
> +               x86_pmu.intel_ctrl |= 1ULL << 48;
> +
>         return 0;
>  }
>
> diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
> index 37f17f55ef2d..7c59f08fadc0 100644
> --- a/arch/x86/events/perf_event.h
> +++ b/arch/x86/events/perf_event.h
> @@ -40,6 +40,7 @@ enum extra_reg_type {
>         EXTRA_REG_LBR   = 2,    /* lbr_select */
>         EXTRA_REG_LDLAT = 3,    /* ld_lat_threshold */
>         EXTRA_REG_FE    = 4,    /* fe_* */
> +       EXTRA_REG_TOPDOWN = 5,  /* Topdown slots/metrics */
>
>         EXTRA_REG_MAX           /* number of entries needed */
>  };
> @@ -76,6 +77,12 @@ static inline bool constraint_match(struct event_constraint *c, u64 ecode)
>  #define PERF_X86_EVENT_EXCL_ACCT       0x0100 /* accounted EXCL event */
>  #define PERF_X86_EVENT_AUTO_RELOAD     0x0200 /* use PEBS auto-reload */
>  #define PERF_X86_EVENT_LARGE_PEBS      0x0400 /* use large PEBS */
> +#define PERF_X86_EVENT_TOPDOWN         0x0800 /* Count Topdown slots/metrics events */
> +
> +static inline bool is_topdown_count(struct perf_event *event)
> +{
> +       return event->hw.flags & PERF_X86_EVENT_TOPDOWN;
> +}
>
>  struct amd_nb {
>         int nb_id;  /* NorthBridge id */
> @@ -509,6 +516,9 @@ struct extra_reg {
>                                0xffff, \
>                                LDLAT)
>
> +#define INTEL_UEVENT_TOPDOWN_EXTRA_REG(event)  \
> +       EVENT_EXTRA_REG(event, 0, 0xfcff, -1L, TOPDOWN)
> +
>  #define EVENT_EXTRA_END EVENT_EXTRA_REG(0, 0, 0, 0, RSP_0)
>
>  union perf_capabilities {
> @@ -524,6 +534,7 @@ union perf_capabilities {
>                  */
>                 u64     full_width_write:1;
>                 u64     pebs_baseline:1;
> +               u64     perf_metrics:1;
>         };
>         u64     capabilities;
>  };
> @@ -686,6 +697,12 @@ struct x86_pmu {
>          */
>         atomic_t        lbr_exclusive[x86_lbr_exclusive_max];
>
> +       /*
> +        * Intel perf metrics
> +        */
> +       u64             (*update_topdown_event)(struct perf_event *event);
> +       int             (*set_topdown_event_period)(struct perf_event *event);
> +
>         /*
>          * AMD bits
>          */
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index 78f3a5ebc1e2..460a419a7214 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -118,6 +118,8 @@
>  #define MSR_TURBO_RATIO_LIMIT1         0x000001ae
>  #define MSR_TURBO_RATIO_LIMIT2         0x000001af
>
> +#define MSR_PERF_METRICS               0x00000329
> +
>  #define MSR_LBR_SELECT                 0x000001c8
>  #define MSR_LBR_TOS                    0x000001c9
>  #define MSR_LBR_NHM_FROM               0x00000680
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics
  2019-08-30 23:18   ` Stephane Eranian
@ 2019-08-31  0:31     ` Andi Kleen
  2019-08-31  9:13       ` Stephane Eranian
  0 siblings, 1 reply; 39+ messages in thread
From: Andi Kleen @ 2019-08-31  0:31 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Liang, Kan, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Ingo Molnar, LKML, Thomas Gleixner, Jiri Olsa,
	Alexander Shishkin

> the same manner. It would greatly simplify the kernel implementation.

I tried that originally. It was actually more complicated.

You can't really do deltas on raw metrics, and a lot of the perf
infrastructure is built around deltas.

To do the regular reset and not lose precision over time internally
you have to keep expanded counters anyways. And if you do that
you can just expose them to user space too, and have everything
in user space just work without any changes (except for the final
output)

-Andi


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics
  2019-08-31  0:31     ` Andi Kleen
@ 2019-08-31  9:13       ` Stephane Eranian
  2019-08-31  9:29         ` Peter Zijlstra
  2019-08-31 17:53         ` Andi Kleen
  0 siblings, 2 replies; 39+ messages in thread
From: Stephane Eranian @ 2019-08-31  9:13 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Liang, Kan, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Ingo Molnar, LKML, Thomas Gleixner, Jiri Olsa,
	Alexander Shishkin

Andi,

On Fri, Aug 30, 2019 at 5:31 PM Andi Kleen <ak@linux.intel.com> wrote:
>
> > the same manner. It would greatly simplify the kernel implementation.
>
> I tried that originally. It was actually more complicated.
>
> You can't really do deltas on raw metrics, and a lot of the perf
> infrastructure is built around deltas.
>
How is RAPL handled? No deltas there either. It uses the snapshot model.
At each interval, perf stat just reads the current count, and does not compute
a delta since previous read.
With PERF_METRICS, the delta is always since previous read. If you read
frequently enough you do not lose precision.

>
> To do the regular reset and not lose precision over time internally
> you have to keep expanded counters anyways. And if you do that
> you can just expose them to user space too, and have everything
> in user space just work without any changes (except for the final
> output)
>
> -Andi
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics
  2019-08-29 16:56         ` Liang, Kan
@ 2019-08-31  9:18           ` Peter Zijlstra
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2019-08-31  9:18 UTC (permalink / raw)
  To: Liang, Kan
  Cc: acme, mingo, linux-kernel, tglx, jolsa, eranian, alexander.shishkin, ak

On Thu, Aug 29, 2019 at 12:56:02PM -0400, Liang, Kan wrote:
> On 8/29/2019 9:52 AM, Peter Zijlstra wrote:

> > What what? The PERF_METRICS contents depends on the FIXCTR3 value ?!
> 
> Yes.
> 
> For current implementation, PERF_METRIC MSR is composed by four fields,
> backend bound, frontend bound, bad speculation and retiring.
> Each of the fields are populated using the below formula for eg:
> PERF_METRIC[RETIRING] = (0xFF *
> PERF_METRICS_RETIRING_INTERNAL_48bit_COUNTER)
> / FIXCTR3

So it really depends on the actual exposed FIXCTR3 _value_ to compute
the PERF_METRIC field? *mind boggles*, that's really unspeakable crap.
And this isn't documented anywhere afaict.

I was thinking they've have an internal counter for the SLOTS value too,
so the PERF_METRIC fields are indenpendent; which would be like 'sane'.

Exposing the internal counters would've been _soooo_ much better, just
add 4 more fixed counters and call it a day.

> The METRICS_OVF indicates the overflow of any internal counters.

OK, but I'm thinking that by that time the fraction in PERF_METRIC will
be too coarse and we're loosing precision. Reconstruction will be
inaccurate.

> The internal counters only start counting from 0, which cannot be programmed
> by SW. But resetting the PERF_METRIC would implicitly resetting the internal
> counters.

The only possible option given the choices.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics
  2019-08-28 19:04     ` Andi Kleen
@ 2019-08-31  9:19       ` Peter Zijlstra
  2019-09-09 13:40         ` Liang, Kan
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2019-08-31  9:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: kan.liang, acme, mingo, linux-kernel, tglx, jolsa, eranian,
	alexander.shishkin

On Wed, Aug 28, 2019 at 12:04:45PM -0700, Andi Kleen wrote:

> > > NMI
> > > ======
> > > 
> > > The METRICS register may be overflow. The bit 48 of STATUS register
> > > will be set. If so, update all active slots and metrics events.
> > 
> > that happen? It would be useful to get that METRIC_OVF (can we please
> 
> This happens when the internal counters that feed the metrics
> overflow.
> 
> > If this is so; then we can use this to update/reset PERF_METRICS and
> > nothing else.
> 
> It has to be handled in the PMI.

That's what I wrote; Overflow is always NMI.

> > Then there is no mucking about with that odd counter/metrics msr pair
> > reset nonsense. Becuase that really stinks.
> 
> You have to write them to reset the internal counters.

But not for ever read, only on METRIC_OVF.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics
  2019-08-31  9:13       ` Stephane Eranian
@ 2019-08-31  9:29         ` Peter Zijlstra
  2019-08-31 17:53         ` Andi Kleen
  1 sibling, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2019-08-31  9:29 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Andi Kleen, Liang, Kan, Arnaldo Carvalho de Melo, Ingo Molnar,
	LKML, Thomas Gleixner, Jiri Olsa, Alexander Shishkin

On Sat, Aug 31, 2019 at 02:13:05AM -0700, Stephane Eranian wrote:

> With PERF_METRICS, the delta is always since previous read. If you read
> frequently enough you do not lose precision.

You always loose precision, the whole fraction of 255 looses the
fractional part; consider:

255 * 1 / 8
31.87500000000000000000
255 * 7 / 8
223.12500000000000000000

31 * 8 / 255
.97254901960784313725
223 * 8 / 255
6.99607843137254901960

Now, we can make reconstruction use normal 'round-nearest' and then we'd
get 1 and 7 back, but there's always cases where you loose things.

Also, with 255 being an odd number, round-nearest is actually 'hard' :/

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics
  2019-08-31  9:13       ` Stephane Eranian
  2019-08-31  9:29         ` Peter Zijlstra
@ 2019-08-31 17:53         ` Andi Kleen
  1 sibling, 0 replies; 39+ messages in thread
From: Andi Kleen @ 2019-08-31 17:53 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Liang, Kan, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Ingo Molnar, LKML, Thomas Gleixner, Jiri Olsa,
	Alexander Shishkin

On Sat, Aug 31, 2019 at 02:13:05AM -0700, Stephane Eranian wrote:
> Andi,
> 
> On Fri, Aug 30, 2019 at 5:31 PM Andi Kleen <ak@linux.intel.com> wrote:
> >
> > > the same manner. It would greatly simplify the kernel implementation.
> >
> > I tried that originally. It was actually more complicated.
> >
> > You can't really do deltas on raw metrics, and a lot of the perf
> > infrastructure is built around deltas.
> >
> How is RAPL handled? No deltas there either. It uses the snapshot model.

RAPL doesn't support any context switch or CPU context.
Also it has no concept of "accumulate with clear on read"

-Andi

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics
  2019-08-31  9:19       ` Peter Zijlstra
@ 2019-09-09 13:40         ` Liang, Kan
  0 siblings, 0 replies; 39+ messages in thread
From: Liang, Kan @ 2019-09-09 13:40 UTC (permalink / raw)
  To: Peter Zijlstra, Andi Kleen
  Cc: acme, mingo, linux-kernel, tglx, jolsa, eranian, alexander.shishkin



On 8/31/2019 5:19 AM, Peter Zijlstra wrote:
>>> Then there is no mucking about with that odd counter/metrics msr pair
>>> reset nonsense. Becuase that really stinks.
>> You have to write them to reset the internal counters.
> But not for ever read, only on METRIC_OVF.

The precision are lost if the counters were running much longer than the 
measured delta. We have to reset the counters after every read.
For example, the worst case is:
The previous reading was rounded up and the current reading is rounded 
down. The error of METRICS is 1/256 - 1/SLOTS, which is related to 
SLOTS. The error for each Topdown event is  (1/256 - 1/SLOTS) * SLOTS.
With the SLOTS increasing, the precision will get worse and worse.
Therefor, we have to reset the counters periodically.

Thanks,
Kan



^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2019-09-09 13:40 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-26 14:47 [RESEND PATCH V3 0/8] TopDown metrics support for Icelake kan.liang
2019-08-26 14:47 ` [RESEND PATCH V3 1/8] perf/x86/intel: Set correct mask for TOPDOWN.SLOTS kan.liang
2019-08-28  7:48   ` Peter Zijlstra
2019-08-26 14:47 ` [RESEND PATCH V3 2/8] perf/x86/intel: Basic support for metrics counters kan.liang
2019-08-28  7:48   ` Peter Zijlstra
2019-08-28  7:52   ` Peter Zijlstra
2019-08-28 13:59     ` Liang, Kan
2019-08-28  8:44   ` Peter Zijlstra
2019-08-28  9:02     ` Peter Zijlstra
2019-08-28  9:37       ` Peter Zijlstra
2019-08-28 13:51       ` Liang, Kan
2019-08-28  8:52   ` Peter Zijlstra
2019-08-26 14:47 ` [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics kan.liang
2019-08-28 15:02   ` Peter Zijlstra
2019-08-28 19:04     ` Andi Kleen
2019-08-31  9:19       ` Peter Zijlstra
2019-09-09 13:40         ` Liang, Kan
2019-08-28 19:35     ` Liang, Kan
2019-08-28 15:19   ` Peter Zijlstra
2019-08-28 16:11     ` [PATCH] x86/math64: Provide a sane mul_u64_u32_div() implementation for x86_64 Peter Zijlstra
2019-08-29  9:30       ` Peter Zijlstra
2019-08-28 16:17     ` [RESEND PATCH V3 3/8] perf/x86/intel: Support hardware TopDown metrics Andi Kleen
2019-08-28 16:28       ` Peter Zijlstra
2019-08-29  3:11         ` Andi Kleen
2019-08-29  9:17           ` Peter Zijlstra
2019-08-29 13:31     ` Liang, Kan
2019-08-29 13:52       ` Peter Zijlstra
2019-08-29 16:56         ` Liang, Kan
2019-08-31  9:18           ` Peter Zijlstra
2019-08-30 23:18   ` Stephane Eranian
2019-08-31  0:31     ` Andi Kleen
2019-08-31  9:13       ` Stephane Eranian
2019-08-31  9:29         ` Peter Zijlstra
2019-08-31 17:53         ` Andi Kleen
2019-08-26 14:47 ` [RESEND PATCH V3 4/8] perf/x86/intel: Support per thread RDPMC " kan.liang
2019-08-26 14:47 ` [RESEND PATCH V3 5/8] perf/x86/intel: Export TopDown events for Icelake kan.liang
2019-08-26 14:47 ` [RESEND PATCH V3 6/8] perf/x86/intel: Disable sampling read slots and topdown kan.liang
2019-08-26 14:47 ` [RESEND PATCH V3 7/8] perf, tools, stat: Support new per thread TopDown metrics kan.liang
2019-08-26 14:47 ` [RESEND PATCH V3 8/8] perf, tools: Add documentation for topdown metrics kan.liang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).