[PATCH V2 1/2] perf/x86/intel: enable CPU ref_cycles for GP counter

* [PATCH V2 1/2] perf/x86/intel: enable CPU ref_cycles for GP counter
@ 2017-06-09 17:28 kan.liang
  2017-06-09 17:28 ` [PATCH V2 2/2] perf/x86/intel, watchdog: Switch NMI watchdog to ref cycles on x86 kan.liang
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: kan.liang @ 2017-06-09 17:28 UTC (permalink / raw)
  To: peterz, mingo, eranian, linux-kernel
  Cc: alexander.shishkin, acme, jolsa, torvalds, tglx, vincent.weaver,
	ak, Kan Liang

From: Kan Liang <Kan.liang@intel.com>

The dominant motivation is to make it possible to switch cycles NMI
watchdog to ref_cycles on x86 platform. The new NMI watchdog could
 - Free widely used precious cycles fixed counter. For example, topdown
   code needs the cycles fixed counter in group counting. Otherwise,
   it has to fall back to not use groups, which can cause measurement
   inaccuracy due to multiplexing errors.
 - Avoiding the NMI watchdog expiring too fast. Cycles can tick faster
   than the measured CPU Frequency due to Turbo mode.
   Although we can enlarge the period of cycles to workaround the fast
   expiring, it is still limited by the Turbo ratio and may fail
   eventually.

CPU ref_cycles can only be counted by fixed counter 2. It uses
pseudo-encoding. The GP counter doesn't recognize.

BUS_CYCLES (0x013c) is another event which is not affected by core
frequency changes. It has a constant ratio with the CPU ref_cycles.
BUS_CYCLES could be used as an alternative event for ref_cycles on GP
counter.
A hook is implemented in x86_schedule_events. If the fixed counter 2 is
occupied and a GP counter is assigned, BUS_CYCLES is used to replace
ref_cycles. A new flag PERF_X86_EVENT_REF_CYCLES_REP in
hw_perf_event is introduced to indicate the replacement.
To make the switch transparent for perf tool, counting and sampling are
also specially handled.
 - For counting, it multiplies the result with the constant ratio after
   reading it.
 - For sampling with fixed period, the BUS_CYCLES period = ref_cycles
   period / the constant ratio.
 - For sampling with fixed frequency, the adaptive frequency algorithm
   will figure it out by calculating ref_cycles event's last_period and
   event counts. But the reload value has to be BUS_CYCLES left period.

User visible RDPMC issue
It is impossible to transparently handle the case, which reading
ref-cycles by RDPMC instruction in user space.
ref_cycles_factor_for_rdpmc is exposed for RDPMC user.
For the few users who want to read ref-cycles by RDPMC, they have to
correct the result by multipling ref_cycles_factor_for_rdpmc manually,
if GP counter is used.
The solution is only for ref-cycles. It doesn't impact other events'
result.

The constant ratio is model specific.
For the model after NEHALEM but before Skylake, the ratio is defined in
MSR_PLATFORM_INFO.
For the model after Skylake, it can be get from CPUID.15H.
For Knights Landing, Goldmont and later, the ratio is always 1.

The old Silvermont/Airmont, Core2 and Atom machines are not covered by
the patch. The behavior on those machines will not change.

Signed-off-by: Kan Liang <Kan.liang@intel.com>
---

Changes since V1:
 - Retry the whole scheduling thing for 0x0300 replacement event,
   if 0x0300 event is unscheduled.
 - Correct the method to calculate event->counter, period_left, left
   and offset in different cases
 - Expose the factor to user space
 - Modify the changelog
 - There is no transparent way to handle ref-cycles replacement events for
   RDPMC. RDPMC users have to multiply the results with factor if the
   ref-cycles replacement event is scheduled in GP counters.
   But it will not impact other events.
   There are few RDPMC users who use ref cycles. The impact should be limited.
   I will also send patches to other user tools, such as pmu-tools and PAPI,
   to minimize the impact.


 arch/x86/events/core.c       |  92 ++++++++++++++++++++++++++++++++++--
 arch/x86/events/intel/core.c | 110 ++++++++++++++++++++++++++++++++++++++++++-
 arch/x86/events/perf_event.h |   5 ++
 3 files changed, 203 insertions(+), 4 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index e6f5e4b..18f8d37 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -70,7 +70,7 @@ u64 x86_perf_event_update(struct perf_event *event)
 	int shift = 64 - x86_pmu.cntval_bits;
 	u64 prev_raw_count, new_raw_count;
 	int idx = hwc->idx;
-	u64 delta;
+	u64 delta, adjust_delta;
 
 	if (idx == INTEL_PMC_IDX_FIXED_BTS)
 		return 0;
@@ -101,8 +101,47 @@ u64 x86_perf_event_update(struct perf_event *event)
 	delta = (new_raw_count << shift) - (prev_raw_count << shift);
 	delta >>= shift;
 
-	local64_add(delta, &event->count);
-	local64_sub(delta, &hwc->period_left);
+	/*
+	 * Correct the count number if applying ref_cycles replacement.
+	 * There is a constant ratio between ref_cycles (event A) and
+	 * ref_cycles replacement (event B). The delta result is from event B.
+	 * To get accurate event A count, the real delta result must be
+	 * multiplied with the constant ratio.
+	 *
+	 * It is handled differently for sampling and counting.
+	 * - Fixed period Sampling: The period is already adjusted in
+	 *   scheduling for event B. The period_left is the remaining period
+	 *   for event B. Don't need to modify period_left.
+	 *   It's enough to only adjust event->count here.
+	 *
+	 * - Fixed freq Sampling: The adaptive frequency algorithm needs
+	 *   the last_period and event counts for event A to estimate the next
+	 *   period for event A. So the period_left is the remaining period
+	 *   for event A. It needs to multiply the result with the ratio.
+	 *   However, the period_left is also used to reload the event counter
+	 *   for event B in x86_perf_event_set_period. It has to be adjusted to
+	 *   event B's remaining period there.
+	 *
+	 * - Counting: It's enough to only adjust event->count for perf stat.
+	 *
+	 * - RDPMC: User can also read the counter directly by rdpmc/mmap.
+	 *   Users have to handle the adjustment themselves.
+	 *   For kernel, it only needs to guarantee that the offset is correct.
+	 *   In x86's arch_perf_update_userpage, the offset will be corrected
+	 *   if event B is used.
+	 */
+	adjust_delta = delta;
+	if (hwc->flags & PERF_X86_EVENT_REF_CYCLES_REP) {
+		adjust_delta = delta * x86_pmu.ref_cycles_factor;
+
+		if (is_sampling_event(event) && event->attr.freq)
+			local64_sub(adjust_delta, &hwc->period_left);
+		else
+			local64_sub(delta, &hwc->period_left);
+	} else
+		local64_sub(delta, &hwc->period_left);
+
+	local64_add(adjust_delta, &event->count);
 
 	return new_raw_count;
 }
@@ -865,6 +904,7 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
 	if (x86_pmu.start_scheduling)
 		x86_pmu.start_scheduling(cpuc);
 
+retry:
 	for (i = 0, wmin = X86_PMC_IDX_MAX, wmax = 0; i < n; i++) {
 		cpuc->event_constraint[i] = NULL;
 		c = x86_pmu.get_event_constraints(cpuc, i, cpuc->event_list[i]);
@@ -952,6 +992,25 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
 			 */
 			if (x86_pmu.put_event_constraints)
 				x86_pmu.put_event_constraints(cpuc, e);
+
+			/*
+			 * 0x0300 is pseudo-encoding for REF_CPU_CYCLES.
+			 * It can only be scheduled on fixed counter 2.
+			 * If the fixed counter 2 is occupied, the event is
+			 * failed scheduling.
+			 *
+			 * If that's the case, an alternative event, which
+			 * can be counted in GP counters, could be used to
+			 * replace the pseudo-encoding REF_CPU_CYCLES event.
+			 *
+			 * Here we reset the event and retry the whole
+			 * scheduling thing if there is unsched 0x0300 event.
+			 */
+			if (unsched && x86_pmu.ref_cycles_rep &&
+			    ((e->hw.config & X86_RAW_EVENT_MASK) == 0x0300)) {
+				x86_pmu.ref_cycles_rep(e);
+				goto retry;
+			}
 		}
 	}
 
@@ -1136,6 +1195,14 @@ int x86_perf_event_set_period(struct perf_event *event)
 		hwc->last_period = period;
 		ret = 1;
 	}
+
+	/*
+	 * Adjust left for ref_cycles replacement.
+	 * Please refer the comments in x86_perf_event_update for details.
+	 */
+	if ((hwc->flags & PERF_X86_EVENT_REF_CYCLES_REP) &&
+	    event->attr.freq)
+		left /= (u64)x86_pmu.ref_cycles_factor;
 	/*
 	 * Quirk: certain CPUs dont like it if just 1 hw_event is left:
 	 */
@@ -1823,6 +1890,14 @@ static int __init init_hw_perf_events(void)
 			x86_pmu_attr_group.attrs = tmp;
 	}
 
+	if (x86_pmu.factor_attrs) {
+		struct attribute **tmp;
+
+		tmp = merge_attr(x86_pmu_attr_group.attrs, x86_pmu.factor_attrs);
+		if (!WARN_ON(!tmp))
+			x86_pmu_attr_group.attrs = tmp;
+	}
+
 	pr_info("... version:                %d\n",     x86_pmu.version);
 	pr_info("... bit width:              %d\n",     x86_pmu.cntval_bits);
 	pr_info("... generic registers:      %d\n",     x86_pmu.num_counters);
@@ -2300,6 +2375,17 @@ void arch_perf_update_userpage(struct perf_event *event,
 	}
 
 	cyc2ns_read_end(data);
+
+	/*
+	 * offset should be corrected if ref cycles replacement is used.
+	 * Please refer the comments in x86_perf_event_update for details.
+	 */
+	if (event->hw.flags & PERF_X86_EVENT_REF_CYCLES_REP) {
+		userpg->offset = __perf_event_count(event);
+		userpg->offset /= (u64)x86_pmu.ref_cycles_factor;
+		if (userpg->index)
+			userpg->offset -= local64_read(&event->hw.prev_count);
+	}
 }
 
 void
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index da9047e..4853795 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3063,6 +3063,55 @@ static unsigned bdw_limit_period(struct perf_event *event, unsigned left)
 	return left;
 }
 
+static __init unsigned int glm_get_ref_cycles_factor(void)
+{
+	return 1;
+}
+
+static __init unsigned int nhm_get_ref_cycles_factor(void)
+{
+	u64 platform_info;
+
+	rdmsrl(MSR_PLATFORM_INFO, platform_info);
+	return (platform_info >> 8) & 0xff;
+}
+
+static __init unsigned int skl_get_ref_cycles_factor(void)
+{
+	unsigned int cpuid21_eax, cpuid21_ebx, cpuid21_ecx, unused;
+
+	cpuid(21, &cpuid21_eax, &cpuid21_ebx, &cpuid21_ecx, &unused);
+	if (!cpuid21_eax || !cpuid21_ebx)
+		return 0;
+
+	return cpuid21_ebx / cpuid21_eax;
+}
+
+/*
+ * BUS_CYCLES (0x013c) is another event which is not affected by core
+ * frequency changes. It has a constant ratio with the CPU ref_cycles.
+ * BUS_CYCLES could be used as an alternative event for ref_cycles on
+ * GP counter.
+ */
+void nhm_ref_cycles_rep(struct perf_event *event)
+{
+	struct hw_perf_event *hwc = &event->hw;
+
+	hwc->config = (hwc->config & ~X86_RAW_EVENT_MASK) |
+		      intel_perfmon_event_map[PERF_COUNT_HW_BUS_CYCLES];
+	hwc->flags |= PERF_X86_EVENT_REF_CYCLES_REP;
+
+	/* adjust the sample period for ref_cycles replacement event */
+	if (is_sampling_event(event) && hwc->sample_period != 1) {
+
+		hwc->sample_period /= (u64)x86_pmu.ref_cycles_factor;
+		if (hwc->sample_period < 2)
+			hwc->sample_period = 2;
+		hwc->last_period = hwc->sample_period;
+		local64_set(&hwc->period_left, hwc->sample_period);
+	}
+}
+
 PMU_FORMAT_ATTR(event,	"config:0-7"	);
 PMU_FORMAT_ATTR(umask,	"config:8-15"	);
 PMU_FORMAT_ATTR(edge,	"config:18"	);
@@ -3656,6 +3705,21 @@ static struct attribute *intel_pmu_attrs[] = {
 	NULL,
 };
 
+
+static ssize_t ref_cycles_factor_for_rdpmc_show(struct device *cdev,
+						struct device_attribute *attr,
+						char *buf)
+{
+	return sprintf(buf, "%u\n", x86_pmu.ref_cycles_factor);
+}
+
+DEVICE_ATTR_RO(ref_cycles_factor_for_rdpmc);
+
+static struct attribute *intel_ref_cycles_factor_attrs[] = {
+	&dev_attr_ref_cycles_factor_for_rdpmc.attr,
+	NULL,
+};
+
 __init int intel_pmu_init(void)
 {
 	union cpuid10_edx edx;
@@ -3775,7 +3839,11 @@ __init int intel_pmu_init(void)
 
 		intel_pmu_pebs_data_source_nhm();
 		x86_add_quirk(intel_nehalem_quirk);
-
+		x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor();
+		if (x86_pmu.ref_cycles_factor) {
+			x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
+			x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
+		}
 		pr_cont("Nehalem events, ");
 		break;
 
@@ -3835,6 +3903,11 @@ __init int intel_pmu_init(void)
 		x86_pmu.lbr_pt_coexist = true;
 		x86_pmu.flags |= PMU_FL_HAS_RSP_1;
 		x86_pmu.cpu_events = glm_events_attrs;
+		x86_pmu.ref_cycles_factor = glm_get_ref_cycles_factor();
+		if (x86_pmu.ref_cycles_factor) {
+			x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
+			x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
+		}
 		pr_cont("Goldmont events, ");
 		break;
 
@@ -3864,6 +3937,11 @@ __init int intel_pmu_init(void)
 			X86_CONFIG(.event=0xb1, .umask=0x3f, .inv=1, .cmask=1);
 
 		intel_pmu_pebs_data_source_nhm();
+		x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor();
+		if (x86_pmu.ref_cycles_factor) {
+			x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
+			x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
+		}
 		pr_cont("Westmere events, ");
 		break;
 
@@ -3899,6 +3977,11 @@ __init int intel_pmu_init(void)
 		/* UOPS_DISPATCHED.THREAD,c=1,i=1 to count stall cycles*/
 		intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_BACKEND] =
 			X86_CONFIG(.event=0xb1, .umask=0x01, .inv=1, .cmask=1);
+		x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor();
+		if (x86_pmu.ref_cycles_factor) {
+			x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
+			x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
+		}
 
 		pr_cont("SandyBridge events, ");
 		break;
@@ -3933,6 +4016,11 @@ __init int intel_pmu_init(void)
 		/* UOPS_ISSUED.ANY,c=1,i=1 to count stall cycles */
 		intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] =
 			X86_CONFIG(.event=0x0e, .umask=0x01, .inv=1, .cmask=1);
+		x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor();
+		if (x86_pmu.ref_cycles_factor) {
+			x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
+			x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
+		}
 
 		pr_cont("IvyBridge events, ");
 		break;
@@ -3962,6 +4050,11 @@ __init int intel_pmu_init(void)
 		x86_pmu.get_event_constraints = hsw_get_event_constraints;
 		x86_pmu.cpu_events = hsw_events_attrs;
 		x86_pmu.lbr_double_abort = true;
+		x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor();
+		if (x86_pmu.ref_cycles_factor) {
+			x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
+			x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
+		}
 		pr_cont("Haswell events, ");
 		break;
 
@@ -3998,6 +4091,11 @@ __init int intel_pmu_init(void)
 		x86_pmu.get_event_constraints = hsw_get_event_constraints;
 		x86_pmu.cpu_events = hsw_events_attrs;
 		x86_pmu.limit_period = bdw_limit_period;
+		x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor();
+		if (x86_pmu.ref_cycles_factor) {
+			x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
+			x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
+		}
 		pr_cont("Broadwell events, ");
 		break;
 
@@ -4016,6 +4114,11 @@ __init int intel_pmu_init(void)
 		/* all extra regs are per-cpu when HT is on */
 		x86_pmu.flags |= PMU_FL_HAS_RSP_1;
 		x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
+		x86_pmu.ref_cycles_factor = glm_get_ref_cycles_factor();
+		if (x86_pmu.ref_cycles_factor) {
+			x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
+			x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
+		}
 
 		pr_cont("Knights Landing/Mill events, ");
 		break;
@@ -4051,6 +4154,11 @@ __init int intel_pmu_init(void)
 						  skl_format_attr);
 		WARN_ON(!x86_pmu.format_attrs);
 		x86_pmu.cpu_events = hsw_events_attrs;
+		x86_pmu.ref_cycles_factor = skl_get_ref_cycles_factor();
+		if (x86_pmu.ref_cycles_factor) {
+			x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
+			x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
+		}
 		pr_cont("Skylake events, ");
 		break;
 
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 53728ee..3470178 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -68,6 +68,7 @@ struct event_constraint {
 #define PERF_X86_EVENT_EXCL_ACCT	0x0200 /* accounted EXCL event */
 #define PERF_X86_EVENT_AUTO_RELOAD	0x0400 /* use PEBS auto-reload */
 #define PERF_X86_EVENT_FREERUNNING	0x0800 /* use freerunning PEBS */
+#define PERF_X86_EVENT_REF_CYCLES_REP	0x1000 /* use ref_cycles replacement */
 
 
 struct amd_nb {
@@ -550,6 +551,8 @@ struct x86_pmu {
 	int		perfctr_second_write;
 	bool		late_ack;
 	unsigned	(*limit_period)(struct perf_event *event, unsigned l);
+	unsigned int	ref_cycles_factor;
+	void		(*ref_cycles_rep)(struct perf_event *event);
 
 	/*
 	 * sysfs attrs
@@ -565,6 +568,8 @@ struct x86_pmu {
 	unsigned long	attr_freeze_on_smi;
 	struct attribute **attrs;
 
+	unsigned int	attr_ref_cycles_factor;
+	struct attribute **factor_attrs;
 	/*
 	 * CPU Hotplug hooks
 	 */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 8+ messages in thread