linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V2 1/2] perf/x86/intel: enable CPU ref_cycles for GP counter
@ 2017-06-09 17:28 kan.liang
  2017-06-09 17:28 ` [PATCH V2 2/2] perf/x86/intel, watchdog: Switch NMI watchdog to ref cycles on x86 kan.liang
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: kan.liang @ 2017-06-09 17:28 UTC (permalink / raw)
  To: peterz, mingo, eranian, linux-kernel
  Cc: alexander.shishkin, acme, jolsa, torvalds, tglx, vincent.weaver,
	ak, Kan Liang

From: Kan Liang <Kan.liang@intel.com>

The dominant motivation is to make it possible to switch cycles NMI
watchdog to ref_cycles on x86 platform. The new NMI watchdog could
 - Free widely used precious cycles fixed counter. For example, topdown
   code needs the cycles fixed counter in group counting. Otherwise,
   it has to fall back to not use groups, which can cause measurement
   inaccuracy due to multiplexing errors.
 - Avoiding the NMI watchdog expiring too fast. Cycles can tick faster
   than the measured CPU Frequency due to Turbo mode.
   Although we can enlarge the period of cycles to workaround the fast
   expiring, it is still limited by the Turbo ratio and may fail
   eventually.

CPU ref_cycles can only be counted by fixed counter 2. It uses
pseudo-encoding. The GP counter doesn't recognize.

BUS_CYCLES (0x013c) is another event which is not affected by core
frequency changes. It has a constant ratio with the CPU ref_cycles.
BUS_CYCLES could be used as an alternative event for ref_cycles on GP
counter.
A hook is implemented in x86_schedule_events. If the fixed counter 2 is
occupied and a GP counter is assigned, BUS_CYCLES is used to replace
ref_cycles. A new flag PERF_X86_EVENT_REF_CYCLES_REP in
hw_perf_event is introduced to indicate the replacement.
To make the switch transparent for perf tool, counting and sampling are
also specially handled.
 - For counting, it multiplies the result with the constant ratio after
   reading it.
 - For sampling with fixed period, the BUS_CYCLES period = ref_cycles
   period / the constant ratio.
 - For sampling with fixed frequency, the adaptive frequency algorithm
   will figure it out by calculating ref_cycles event's last_period and
   event counts. But the reload value has to be BUS_CYCLES left period.

User visible RDPMC issue
It is impossible to transparently handle the case, which reading
ref-cycles by RDPMC instruction in user space.
ref_cycles_factor_for_rdpmc is exposed for RDPMC user.
For the few users who want to read ref-cycles by RDPMC, they have to
correct the result by multipling ref_cycles_factor_for_rdpmc manually,
if GP counter is used.
The solution is only for ref-cycles. It doesn't impact other events'
result.

The constant ratio is model specific.
For the model after NEHALEM but before Skylake, the ratio is defined in
MSR_PLATFORM_INFO.
For the model after Skylake, it can be get from CPUID.15H.
For Knights Landing, Goldmont and later, the ratio is always 1.

The old Silvermont/Airmont, Core2 and Atom machines are not covered by
the patch. The behavior on those machines will not change.

Signed-off-by: Kan Liang <Kan.liang@intel.com>
---

Changes since V1:
 - Retry the whole scheduling thing for 0x0300 replacement event,
   if 0x0300 event is unscheduled.
 - Correct the method to calculate event->counter, period_left, left
   and offset in different cases
 - Expose the factor to user space
 - Modify the changelog
 - There is no transparent way to handle ref-cycles replacement events for
   RDPMC. RDPMC users have to multiply the results with factor if the
   ref-cycles replacement event is scheduled in GP counters.
   But it will not impact other events.
   There are few RDPMC users who use ref cycles. The impact should be limited.
   I will also send patches to other user tools, such as pmu-tools and PAPI,
   to minimize the impact.


 arch/x86/events/core.c       |  92 ++++++++++++++++++++++++++++++++++--
 arch/x86/events/intel/core.c | 110 ++++++++++++++++++++++++++++++++++++++++++-
 arch/x86/events/perf_event.h |   5 ++
 3 files changed, 203 insertions(+), 4 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index e6f5e4b..18f8d37 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -70,7 +70,7 @@ u64 x86_perf_event_update(struct perf_event *event)
 	int shift = 64 - x86_pmu.cntval_bits;
 	u64 prev_raw_count, new_raw_count;
 	int idx = hwc->idx;
-	u64 delta;
+	u64 delta, adjust_delta;
 
 	if (idx == INTEL_PMC_IDX_FIXED_BTS)
 		return 0;
@@ -101,8 +101,47 @@ u64 x86_perf_event_update(struct perf_event *event)
 	delta = (new_raw_count << shift) - (prev_raw_count << shift);
 	delta >>= shift;
 
-	local64_add(delta, &event->count);
-	local64_sub(delta, &hwc->period_left);
+	/*
+	 * Correct the count number if applying ref_cycles replacement.
+	 * There is a constant ratio between ref_cycles (event A) and
+	 * ref_cycles replacement (event B). The delta result is from event B.
+	 * To get accurate event A count, the real delta result must be
+	 * multiplied with the constant ratio.
+	 *
+	 * It is handled differently for sampling and counting.
+	 * - Fixed period Sampling: The period is already adjusted in
+	 *   scheduling for event B. The period_left is the remaining period
+	 *   for event B. Don't need to modify period_left.
+	 *   It's enough to only adjust event->count here.
+	 *
+	 * - Fixed freq Sampling: The adaptive frequency algorithm needs
+	 *   the last_period and event counts for event A to estimate the next
+	 *   period for event A. So the period_left is the remaining period
+	 *   for event A. It needs to multiply the result with the ratio.
+	 *   However, the period_left is also used to reload the event counter
+	 *   for event B in x86_perf_event_set_period. It has to be adjusted to
+	 *   event B's remaining period there.
+	 *
+	 * - Counting: It's enough to only adjust event->count for perf stat.
+	 *
+	 * - RDPMC: User can also read the counter directly by rdpmc/mmap.
+	 *   Users have to handle the adjustment themselves.
+	 *   For kernel, it only needs to guarantee that the offset is correct.
+	 *   In x86's arch_perf_update_userpage, the offset will be corrected
+	 *   if event B is used.
+	 */
+	adjust_delta = delta;
+	if (hwc->flags & PERF_X86_EVENT_REF_CYCLES_REP) {
+		adjust_delta = delta * x86_pmu.ref_cycles_factor;
+
+		if (is_sampling_event(event) && event->attr.freq)
+			local64_sub(adjust_delta, &hwc->period_left);
+		else
+			local64_sub(delta, &hwc->period_left);
+	} else
+		local64_sub(delta, &hwc->period_left);
+
+	local64_add(adjust_delta, &event->count);
 
 	return new_raw_count;
 }
@@ -865,6 +904,7 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
 	if (x86_pmu.start_scheduling)
 		x86_pmu.start_scheduling(cpuc);
 
+retry:
 	for (i = 0, wmin = X86_PMC_IDX_MAX, wmax = 0; i < n; i++) {
 		cpuc->event_constraint[i] = NULL;
 		c = x86_pmu.get_event_constraints(cpuc, i, cpuc->event_list[i]);
@@ -952,6 +992,25 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
 			 */
 			if (x86_pmu.put_event_constraints)
 				x86_pmu.put_event_constraints(cpuc, e);
+
+			/*
+			 * 0x0300 is pseudo-encoding for REF_CPU_CYCLES.
+			 * It can only be scheduled on fixed counter 2.
+			 * If the fixed counter 2 is occupied, the event is
+			 * failed scheduling.
+			 *
+			 * If that's the case, an alternative event, which
+			 * can be counted in GP counters, could be used to
+			 * replace the pseudo-encoding REF_CPU_CYCLES event.
+			 *
+			 * Here we reset the event and retry the whole
+			 * scheduling thing if there is unsched 0x0300 event.
+			 */
+			if (unsched && x86_pmu.ref_cycles_rep &&
+			    ((e->hw.config & X86_RAW_EVENT_MASK) == 0x0300)) {
+				x86_pmu.ref_cycles_rep(e);
+				goto retry;
+			}
 		}
 	}
 
@@ -1136,6 +1195,14 @@ int x86_perf_event_set_period(struct perf_event *event)
 		hwc->last_period = period;
 		ret = 1;
 	}
+
+	/*
+	 * Adjust left for ref_cycles replacement.
+	 * Please refer the comments in x86_perf_event_update for details.
+	 */
+	if ((hwc->flags & PERF_X86_EVENT_REF_CYCLES_REP) &&
+	    event->attr.freq)
+		left /= (u64)x86_pmu.ref_cycles_factor;
 	/*
 	 * Quirk: certain CPUs dont like it if just 1 hw_event is left:
 	 */
@@ -1823,6 +1890,14 @@ static int __init init_hw_perf_events(void)
 			x86_pmu_attr_group.attrs = tmp;
 	}
 
+	if (x86_pmu.factor_attrs) {
+		struct attribute **tmp;
+
+		tmp = merge_attr(x86_pmu_attr_group.attrs, x86_pmu.factor_attrs);
+		if (!WARN_ON(!tmp))
+			x86_pmu_attr_group.attrs = tmp;
+	}
+
 	pr_info("... version:                %d\n",     x86_pmu.version);
 	pr_info("... bit width:              %d\n",     x86_pmu.cntval_bits);
 	pr_info("... generic registers:      %d\n",     x86_pmu.num_counters);
@@ -2300,6 +2375,17 @@ void arch_perf_update_userpage(struct perf_event *event,
 	}
 
 	cyc2ns_read_end(data);
+
+	/*
+	 * offset should be corrected if ref cycles replacement is used.
+	 * Please refer the comments in x86_perf_event_update for details.
+	 */
+	if (event->hw.flags & PERF_X86_EVENT_REF_CYCLES_REP) {
+		userpg->offset = __perf_event_count(event);
+		userpg->offset /= (u64)x86_pmu.ref_cycles_factor;
+		if (userpg->index)
+			userpg->offset -= local64_read(&event->hw.prev_count);
+	}
 }
 
 void
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index da9047e..4853795 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3063,6 +3063,55 @@ static unsigned bdw_limit_period(struct perf_event *event, unsigned left)
 	return left;
 }
 
+static __init unsigned int glm_get_ref_cycles_factor(void)
+{
+	return 1;
+}
+
+static __init unsigned int nhm_get_ref_cycles_factor(void)
+{
+	u64 platform_info;
+
+	rdmsrl(MSR_PLATFORM_INFO, platform_info);
+	return (platform_info >> 8) & 0xff;
+}
+
+static __init unsigned int skl_get_ref_cycles_factor(void)
+{
+	unsigned int cpuid21_eax, cpuid21_ebx, cpuid21_ecx, unused;
+
+	cpuid(21, &cpuid21_eax, &cpuid21_ebx, &cpuid21_ecx, &unused);
+	if (!cpuid21_eax || !cpuid21_ebx)
+		return 0;
+
+	return cpuid21_ebx / cpuid21_eax;
+}
+
+/*
+ * BUS_CYCLES (0x013c) is another event which is not affected by core
+ * frequency changes. It has a constant ratio with the CPU ref_cycles.
+ * BUS_CYCLES could be used as an alternative event for ref_cycles on
+ * GP counter.
+ */
+void nhm_ref_cycles_rep(struct perf_event *event)
+{
+	struct hw_perf_event *hwc = &event->hw;
+
+	hwc->config = (hwc->config & ~X86_RAW_EVENT_MASK) |
+		      intel_perfmon_event_map[PERF_COUNT_HW_BUS_CYCLES];
+	hwc->flags |= PERF_X86_EVENT_REF_CYCLES_REP;
+
+	/* adjust the sample period for ref_cycles replacement event */
+	if (is_sampling_event(event) && hwc->sample_period != 1) {
+
+		hwc->sample_period /= (u64)x86_pmu.ref_cycles_factor;
+		if (hwc->sample_period < 2)
+			hwc->sample_period = 2;
+		hwc->last_period = hwc->sample_period;
+		local64_set(&hwc->period_left, hwc->sample_period);
+	}
+}
+
 PMU_FORMAT_ATTR(event,	"config:0-7"	);
 PMU_FORMAT_ATTR(umask,	"config:8-15"	);
 PMU_FORMAT_ATTR(edge,	"config:18"	);
@@ -3656,6 +3705,21 @@ static struct attribute *intel_pmu_attrs[] = {
 	NULL,
 };
 
+
+static ssize_t ref_cycles_factor_for_rdpmc_show(struct device *cdev,
+						struct device_attribute *attr,
+						char *buf)
+{
+	return sprintf(buf, "%u\n", x86_pmu.ref_cycles_factor);
+}
+
+DEVICE_ATTR_RO(ref_cycles_factor_for_rdpmc);
+
+static struct attribute *intel_ref_cycles_factor_attrs[] = {
+	&dev_attr_ref_cycles_factor_for_rdpmc.attr,
+	NULL,
+};
+
 __init int intel_pmu_init(void)
 {
 	union cpuid10_edx edx;
@@ -3775,7 +3839,11 @@ __init int intel_pmu_init(void)
 
 		intel_pmu_pebs_data_source_nhm();
 		x86_add_quirk(intel_nehalem_quirk);
-
+		x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor();
+		if (x86_pmu.ref_cycles_factor) {
+			x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
+			x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
+		}
 		pr_cont("Nehalem events, ");
 		break;
 
@@ -3835,6 +3903,11 @@ __init int intel_pmu_init(void)
 		x86_pmu.lbr_pt_coexist = true;
 		x86_pmu.flags |= PMU_FL_HAS_RSP_1;
 		x86_pmu.cpu_events = glm_events_attrs;
+		x86_pmu.ref_cycles_factor = glm_get_ref_cycles_factor();
+		if (x86_pmu.ref_cycles_factor) {
+			x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
+			x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
+		}
 		pr_cont("Goldmont events, ");
 		break;
 
@@ -3864,6 +3937,11 @@ __init int intel_pmu_init(void)
 			X86_CONFIG(.event=0xb1, .umask=0x3f, .inv=1, .cmask=1);
 
 		intel_pmu_pebs_data_source_nhm();
+		x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor();
+		if (x86_pmu.ref_cycles_factor) {
+			x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
+			x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
+		}
 		pr_cont("Westmere events, ");
 		break;
 
@@ -3899,6 +3977,11 @@ __init int intel_pmu_init(void)
 		/* UOPS_DISPATCHED.THREAD,c=1,i=1 to count stall cycles*/
 		intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_BACKEND] =
 			X86_CONFIG(.event=0xb1, .umask=0x01, .inv=1, .cmask=1);
+		x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor();
+		if (x86_pmu.ref_cycles_factor) {
+			x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
+			x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
+		}
 
 		pr_cont("SandyBridge events, ");
 		break;
@@ -3933,6 +4016,11 @@ __init int intel_pmu_init(void)
 		/* UOPS_ISSUED.ANY,c=1,i=1 to count stall cycles */
 		intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] =
 			X86_CONFIG(.event=0x0e, .umask=0x01, .inv=1, .cmask=1);
+		x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor();
+		if (x86_pmu.ref_cycles_factor) {
+			x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
+			x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
+		}
 
 		pr_cont("IvyBridge events, ");
 		break;
@@ -3962,6 +4050,11 @@ __init int intel_pmu_init(void)
 		x86_pmu.get_event_constraints = hsw_get_event_constraints;
 		x86_pmu.cpu_events = hsw_events_attrs;
 		x86_pmu.lbr_double_abort = true;
+		x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor();
+		if (x86_pmu.ref_cycles_factor) {
+			x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
+			x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
+		}
 		pr_cont("Haswell events, ");
 		break;
 
@@ -3998,6 +4091,11 @@ __init int intel_pmu_init(void)
 		x86_pmu.get_event_constraints = hsw_get_event_constraints;
 		x86_pmu.cpu_events = hsw_events_attrs;
 		x86_pmu.limit_period = bdw_limit_period;
+		x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor();
+		if (x86_pmu.ref_cycles_factor) {
+			x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
+			x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
+		}
 		pr_cont("Broadwell events, ");
 		break;
 
@@ -4016,6 +4114,11 @@ __init int intel_pmu_init(void)
 		/* all extra regs are per-cpu when HT is on */
 		x86_pmu.flags |= PMU_FL_HAS_RSP_1;
 		x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
+		x86_pmu.ref_cycles_factor = glm_get_ref_cycles_factor();
+		if (x86_pmu.ref_cycles_factor) {
+			x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
+			x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
+		}
 
 		pr_cont("Knights Landing/Mill events, ");
 		break;
@@ -4051,6 +4154,11 @@ __init int intel_pmu_init(void)
 						  skl_format_attr);
 		WARN_ON(!x86_pmu.format_attrs);
 		x86_pmu.cpu_events = hsw_events_attrs;
+		x86_pmu.ref_cycles_factor = skl_get_ref_cycles_factor();
+		if (x86_pmu.ref_cycles_factor) {
+			x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
+			x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
+		}
 		pr_cont("Skylake events, ");
 		break;
 
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 53728ee..3470178 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -68,6 +68,7 @@ struct event_constraint {
 #define PERF_X86_EVENT_EXCL_ACCT	0x0200 /* accounted EXCL event */
 #define PERF_X86_EVENT_AUTO_RELOAD	0x0400 /* use PEBS auto-reload */
 #define PERF_X86_EVENT_FREERUNNING	0x0800 /* use freerunning PEBS */
+#define PERF_X86_EVENT_REF_CYCLES_REP	0x1000 /* use ref_cycles replacement */
 
 
 struct amd_nb {
@@ -550,6 +551,8 @@ struct x86_pmu {
 	int		perfctr_second_write;
 	bool		late_ack;
 	unsigned	(*limit_period)(struct perf_event *event, unsigned l);
+	unsigned int	ref_cycles_factor;
+	void		(*ref_cycles_rep)(struct perf_event *event);
 
 	/*
 	 * sysfs attrs
@@ -565,6 +568,8 @@ struct x86_pmu {
 	unsigned long	attr_freeze_on_smi;
 	struct attribute **attrs;
 
+	unsigned int	attr_ref_cycles_factor;
+	struct attribute **factor_attrs;
 	/*
 	 * CPU Hotplug hooks
 	 */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH V2 2/2] perf/x86/intel, watchdog: Switch NMI watchdog to ref cycles on x86
  2017-06-09 17:28 [PATCH V2 1/2] perf/x86/intel: enable CPU ref_cycles for GP counter kan.liang
@ 2017-06-09 17:28 ` kan.liang
       [not found]   ` <CA+55aFz7jZ0Yx+zxDZFxUOJcMtBtzfVphUdUrGwfv9SmiNwT6A@mail.gmail.com>
  2017-06-12 10:36 ` [PATCH V2 1/2] perf/x86/intel: enable CPU ref_cycles for GP counter Jiri Olsa
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 8+ messages in thread
From: kan.liang @ 2017-06-09 17:28 UTC (permalink / raw)
  To: peterz, mingo, eranian, linux-kernel
  Cc: alexander.shishkin, acme, jolsa, torvalds, tglx, vincent.weaver,
	ak, Kan Liang

From: Kan Liang <Kan.liang@intel.com>

The NMI watchdog uses either the fixed cycles or a generic cycles
counter. This causes a lot of conflicts with users of the PMU who want
to run a full group including the cycles fixed counter, for example the
--topdown support recently added to perf stat. The code needs to fall
back to not use groups, which can cause measurement inaccuracy due to
multiplexing errors.

This patch switches the NMI watchdog to use reference cycles on Intel
systems. This is actually more accurate than cycles, because cycles can
tick faster than the measured CPU Frequency due to Turbo mode.

The ref cycles always tick at their frequency, or slower when the system
is idling. That means the NMI watchdog can never expire too early,
unlike with cycles.

The reference cycles tick roughly at the frequency of the TSC, so the
same period computation can be used.

For older platform like Silvermont/Airmont, Core2 and Atom, don't do the
switch. Their NMI watchdog still use cycles event.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---

Changes since V1:
 - Don't use ref-cycles NMI watchdog in older platform.

 arch/x86/events/core.c | 10 ++++++++++
 include/linux/nmi.h    |  1 +
 kernel/watchdog_hld.c  |  7 +++++++
 3 files changed, 18 insertions(+)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 18f8d37..e4c9f11 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2625,3 +2625,13 @@ void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap)
 	cap->events_mask_len	= x86_pmu.events_mask_len;
 }
 EXPORT_SYMBOL_GPL(perf_get_x86_pmu_capability);
+
+#ifdef CONFIG_HARDLOCKUP_DETECTOR
+int hw_nmi_get_event(void)
+{
+	if ((boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) &&
+	    (x86_pmu.ref_cycles_rep))
+		return PERF_COUNT_HW_REF_CPU_CYCLES;
+	return PERF_COUNT_HW_CPU_CYCLES;
+}
+#endif
diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index aa3cd08..b2fa444 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -141,6 +141,7 @@ static inline bool trigger_single_cpu_backtrace(int cpu)
 
 #ifdef CONFIG_LOCKUP_DETECTOR
 u64 hw_nmi_get_sample_period(int watchdog_thresh);
+int hw_nmi_get_event(void);
 extern int nmi_watchdog_enabled;
 extern int soft_watchdog_enabled;
 extern int watchdog_user_enabled;
diff --git a/kernel/watchdog_hld.c b/kernel/watchdog_hld.c
index 54a427d..f899766 100644
--- a/kernel/watchdog_hld.c
+++ b/kernel/watchdog_hld.c
@@ -70,6 +70,12 @@ void touch_nmi_watchdog(void)
 }
 EXPORT_SYMBOL(touch_nmi_watchdog);
 
+/* Can be overridden by architecture */
+__weak int hw_nmi_get_event(void)
+{
+	return PERF_COUNT_HW_CPU_CYCLES;
+}
+
 static struct perf_event_attr wd_hw_attr = {
 	.type		= PERF_TYPE_HARDWARE,
 	.config		= PERF_COUNT_HW_CPU_CYCLES,
@@ -165,6 +171,7 @@ int watchdog_nmi_enable(unsigned int cpu)
 
 	wd_attr = &wd_hw_attr;
 	wd_attr->sample_period = hw_nmi_get_sample_period(watchdog_thresh);
+	wd_attr->config = hw_nmi_get_event();
 
 	/* Try to register using hardware perf events */
 	event = perf_event_create_kernel_counter(wd_attr, cpu, NULL, watchdog_overflow_callback, NULL);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH V2 2/2] perf/x86/intel, watchdog: Switch NMI watchdog to ref cycles on x86
       [not found]   ` <CA+55aFz7jZ0Yx+zxDZFxUOJcMtBtzfVphUdUrGwfv9SmiNwT6A@mail.gmail.com>
@ 2017-06-11 18:44     ` Andi Kleen
  2017-06-12 12:51     ` Liang, Kan
  1 sibling, 0 replies; 8+ messages in thread
From: Andi Kleen @ 2017-06-11 18:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kan.liang, Jiri Olsa, Arnaldo Carvalho de Melo, linux-kernel,
	tglx, peterz, vincent.weaver, eranian, alexander.shishkin, mingo

On Fri, Jun 09, 2017 at 08:39:59PM -0700, Linus Torvalds wrote:
>    Not commenting on the patch itself - I'll leave that to others. But the
>    sign-off chain is buggered.
>    When you re-send these patches that got reverted earlier, you really
>    should add yourself to the sign-off list.. and if you were the original
>    author, you should have been there from the first...
>    Hmm?

I was the original author, but this got lost somewhere.

-Andi

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V2 1/2] perf/x86/intel: enable CPU ref_cycles for GP counter
  2017-06-09 17:28 [PATCH V2 1/2] perf/x86/intel: enable CPU ref_cycles for GP counter kan.liang
  2017-06-09 17:28 ` [PATCH V2 2/2] perf/x86/intel, watchdog: Switch NMI watchdog to ref cycles on x86 kan.liang
@ 2017-06-12 10:36 ` Jiri Olsa
  2017-06-12 13:07   ` Liang, Kan
  2017-06-23 17:21 ` Liang, Kan
  2017-06-23 21:46 ` Thomas Gleixner
  3 siblings, 1 reply; 8+ messages in thread
From: Jiri Olsa @ 2017-06-12 10:36 UTC (permalink / raw)
  To: kan.liang
  Cc: peterz, mingo, eranian, linux-kernel, alexander.shishkin, acme,
	torvalds, tglx, vincent.weaver, ak

On Fri, Jun 09, 2017 at 10:28:02AM -0700, kan.liang@intel.com wrote:

SNIP

> +	/*
> +	 * Correct the count number if applying ref_cycles replacement.
> +	 * There is a constant ratio between ref_cycles (event A) and
> +	 * ref_cycles replacement (event B). The delta result is from event B.
> +	 * To get accurate event A count, the real delta result must be
> +	 * multiplied with the constant ratio.
> +	 *
> +	 * It is handled differently for sampling and counting.
> +	 * - Fixed period Sampling: The period is already adjusted in
> +	 *   scheduling for event B. The period_left is the remaining period
> +	 *   for event B. Don't need to modify period_left.
> +	 *   It's enough to only adjust event->count here.
> +	 *
> +	 * - Fixed freq Sampling: The adaptive frequency algorithm needs
> +	 *   the last_period and event counts for event A to estimate the next
> +	 *   period for event A. So the period_left is the remaining period
> +	 *   for event A. It needs to multiply the result with the ratio.
> +	 *   However, the period_left is also used to reload the event counter
> +	 *   for event B in x86_perf_event_set_period. It has to be adjusted to
> +	 *   event B's remaining period there.
> +	 *
> +	 * - Counting: It's enough to only adjust event->count for perf stat.
> +	 *
> +	 * - RDPMC: User can also read the counter directly by rdpmc/mmap.
> +	 *   Users have to handle the adjustment themselves.
> +	 *   For kernel, it only needs to guarantee that the offset is correct.
> +	 *   In x86's arch_perf_update_userpage, the offset will be corrected
> +	 *   if event B is used.
> +	 */
> +	adjust_delta = delta;
> +	if (hwc->flags & PERF_X86_EVENT_REF_CYCLES_REP) {
> +		adjust_delta = delta * x86_pmu.ref_cycles_factor;
> +
> +		if (is_sampling_event(event) && event->attr.freq)
> +			local64_sub(adjust_delta, &hwc->period_left);
> +		else
> +			local64_sub(delta, &hwc->period_left);

shouldn't you use adjust_delta in here also ^^^ ?

jirka

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [PATCH V2 2/2] perf/x86/intel, watchdog: Switch NMI watchdog to ref cycles on x86
       [not found]   ` <CA+55aFz7jZ0Yx+zxDZFxUOJcMtBtzfVphUdUrGwfv9SmiNwT6A@mail.gmail.com>
  2017-06-11 18:44     ` Andi Kleen
@ 2017-06-12 12:51     ` Liang, Kan
  1 sibling, 0 replies; 8+ messages in thread
From: Liang, Kan @ 2017-06-12 12:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, linux-kernel, tglx, peterz,
	vincent.weaver, ak, eranian, alexander.shishkin, mingo



> When you re-send these patches that got reverted earlier, you really
> should add yourself to the sign-off list.. and if you were the original
> author, you should have been there from the first...
> Hmm?

Andi was the original author.
Sure, I will add my signature after him.

Thanks,
Kan

> On Jun 9, 2017 11:22 AM, <kan.liang@intel.com> wrote:
> From: Kan Liang <Kan.liang@intel.com>
> ...
> Signed-off-by: Andi Kleen <ak@linux.intel.com>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [PATCH V2 1/2] perf/x86/intel: enable CPU ref_cycles for GP counter
  2017-06-12 10:36 ` [PATCH V2 1/2] perf/x86/intel: enable CPU ref_cycles for GP counter Jiri Olsa
@ 2017-06-12 13:07   ` Liang, Kan
  0 siblings, 0 replies; 8+ messages in thread
From: Liang, Kan @ 2017-06-12 13:07 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: peterz, mingo, eranian, linux-kernel, alexander.shishkin, acme,
	torvalds, tglx, vincent.weaver, ak

> 
> On Fri, Jun 09, 2017 at 10:28:02AM -0700, kan.liang@intel.com wrote:
> 
> SNIP
> 
> > +	/*
> > +	 * Correct the count number if applying ref_cycles replacement.
> > +	 * There is a constant ratio between ref_cycles (event A) and
> > +	 * ref_cycles replacement (event B). The delta result is from event B.
> > +	 * To get accurate event A count, the real delta result must be
> > +	 * multiplied with the constant ratio.
> > +	 *
> > +	 * It is handled differently for sampling and counting.
> > +	 * - Fixed period Sampling: The period is already adjusted in
> > +	 *   scheduling for event B. The period_left is the remaining period
> > +	 *   for event B. Don't need to modify period_left.
> > +	 *   It's enough to only adjust event->count here.
> > +	 *
> > +	 * - Fixed freq Sampling: The adaptive frequency algorithm needs
> > +	 *   the last_period and event counts for event A to estimate the next
> > +	 *   period for event A. So the period_left is the remaining period
> > +	 *   for event A. It needs to multiply the result with the ratio.
> > +	 *   However, the period_left is also used to reload the event counter
> > +	 *   for event B in x86_perf_event_set_period. It has to be adjusted to
> > +	 *   event B's remaining period there.
> > +	 *
> > +	 * - Counting: It's enough to only adjust event->count for perf stat.
> > +	 *
> > +	 * - RDPMC: User can also read the counter directly by rdpmc/mmap.
> > +	 *   Users have to handle the adjustment themselves.
> > +	 *   For kernel, it only needs to guarantee that the offset is correct.
> > +	 *   In x86's arch_perf_update_userpage, the offset will be corrected
> > +	 *   if event B is used.
> > +	 */
> > +	adjust_delta = delta;
> > +	if (hwc->flags & PERF_X86_EVENT_REF_CYCLES_REP) {
> > +		adjust_delta = delta * x86_pmu.ref_cycles_factor;
> > +
> > +		if (is_sampling_event(event) && event->attr.freq)
> > +			local64_sub(adjust_delta, &hwc->period_left);
> > +		else
> > +			local64_sub(delta, &hwc->period_left);
> 
> shouldn't you use adjust_delta in here also ^^^ ?

No, we shouldn't use adjust_delta for all cases. Only fixed freq sampling is
enough.

For fixed period sampling, we already adjust the period in scheduling.
So the period_left here is already adjusted. We should not do it twice.
For RDPMC, user will handle the adjustment. The kernel should not do
any changes here.
For counting (perf stat), it doesn't matter. We only care about the
event->count.
   
Thanks,
Kan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [PATCH V2 1/2] perf/x86/intel: enable CPU ref_cycles for GP counter
  2017-06-09 17:28 [PATCH V2 1/2] perf/x86/intel: enable CPU ref_cycles for GP counter kan.liang
  2017-06-09 17:28 ` [PATCH V2 2/2] perf/x86/intel, watchdog: Switch NMI watchdog to ref cycles on x86 kan.liang
  2017-06-12 10:36 ` [PATCH V2 1/2] perf/x86/intel: enable CPU ref_cycles for GP counter Jiri Olsa
@ 2017-06-23 17:21 ` Liang, Kan
  2017-06-23 21:46 ` Thomas Gleixner
  3 siblings, 0 replies; 8+ messages in thread
From: Liang, Kan @ 2017-06-23 17:21 UTC (permalink / raw)
  To: peterz, mingo, eranian, linux-kernel
  Cc: alexander.shishkin, acme, jolsa, torvalds, tglx, vincent.weaver,
	ak, Don Zickus (dzickus@redhat.com)

Hi all,

Any comments for the series?

We had some users (from both client and server) reported spurious
NMI watchdog timeouts issue.
Now, there is a proposed workaround fix from tglx. We are testing it.
However, this patch series is believed to be a real fix for the issue.
It's better that the patch series can be merged into mainline.


Here is the workaround fix, if you are interested.
https://patchwork.kernel.org/patch/9803033/
https://patchwork.kernel.org/patch/9805903/

Thanks,
Kan

> 
> From: Kan Liang <Kan.liang@intel.com>
> 
> The dominant motivation is to make it possible to switch cycles NMI
> watchdog to ref_cycles on x86 platform. The new NMI watchdog could
>  - Free widely used precious cycles fixed counter. For example, topdown
>    code needs the cycles fixed counter in group counting. Otherwise,
>    it has to fall back to not use groups, which can cause measurement
>    inaccuracy due to multiplexing errors.
>  - Avoiding the NMI watchdog expiring too fast. Cycles can tick faster
>    than the measured CPU Frequency due to Turbo mode.
>    Although we can enlarge the period of cycles to workaround the fast
>    expiring, it is still limited by the Turbo ratio and may fail
>    eventually.
> 
> CPU ref_cycles can only be counted by fixed counter 2. It uses pseudo-
> encoding. The GP counter doesn't recognize.
> 
> BUS_CYCLES (0x013c) is another event which is not affected by core
> frequency changes. It has a constant ratio with the CPU ref_cycles.
> BUS_CYCLES could be used as an alternative event for ref_cycles on GP
> counter.
> A hook is implemented in x86_schedule_events. If the fixed counter 2 is
> occupied and a GP counter is assigned, BUS_CYCLES is used to replace
> ref_cycles. A new flag PERF_X86_EVENT_REF_CYCLES_REP in hw_perf_event
> is introduced to indicate the replacement.
> To make the switch transparent for perf tool, counting and sampling are also
> specially handled.
>  - For counting, it multiplies the result with the constant ratio after
>    reading it.
>  - For sampling with fixed period, the BUS_CYCLES period = ref_cycles
>    period / the constant ratio.
>  - For sampling with fixed frequency, the adaptive frequency algorithm
>    will figure it out by calculating ref_cycles event's last_period and
>    event counts. But the reload value has to be BUS_CYCLES left period.
> 
> User visible RDPMC issue
> It is impossible to transparently handle the case, which reading ref-cycles by
> RDPMC instruction in user space.
> ref_cycles_factor_for_rdpmc is exposed for RDPMC user.
> For the few users who want to read ref-cycles by RDPMC, they have to
> correct the result by multipling ref_cycles_factor_for_rdpmc manually, if GP
> counter is used.
> The solution is only for ref-cycles. It doesn't impact other events'
> result.
> 
> The constant ratio is model specific.
> For the model after NEHALEM but before Skylake, the ratio is defined in
> MSR_PLATFORM_INFO.
> For the model after Skylake, it can be get from CPUID.15H.
> For Knights Landing, Goldmont and later, the ratio is always 1.
> 
> The old Silvermont/Airmont, Core2 and Atom machines are not covered by
> the patch. The behavior on those machines will not change.
> 
> Signed-off-by: Kan Liang <Kan.liang@intel.com>
> ---
> 
> Changes since V1:
>  - Retry the whole scheduling thing for 0x0300 replacement event,
>    if 0x0300 event is unscheduled.
>  - Correct the method to calculate event->counter, period_left, left
>    and offset in different cases
>  - Expose the factor to user space
>  - Modify the changelog
>  - There is no transparent way to handle ref-cycles replacement events for
>    RDPMC. RDPMC users have to multiply the results with factor if the
>    ref-cycles replacement event is scheduled in GP counters.
>    But it will not impact other events.
>    There are few RDPMC users who use ref cycles. The impact should be
> limited.
>    I will also send patches to other user tools, such as pmu-tools and PAPI,
>    to minimize the impact.
> 
> 
>  arch/x86/events/core.c       |  92 ++++++++++++++++++++++++++++++++++--
>  arch/x86/events/intel/core.c | 110
> ++++++++++++++++++++++++++++++++++++++++++-
>  arch/x86/events/perf_event.h |   5 ++
>  3 files changed, 203 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index
> e6f5e4b..18f8d37 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -70,7 +70,7 @@ u64 x86_perf_event_update(struct perf_event *event)
>  	int shift = 64 - x86_pmu.cntval_bits;
>  	u64 prev_raw_count, new_raw_count;
>  	int idx = hwc->idx;
> -	u64 delta;
> +	u64 delta, adjust_delta;
> 
>  	if (idx == INTEL_PMC_IDX_FIXED_BTS)
>  		return 0;
> @@ -101,8 +101,47 @@ u64 x86_perf_event_update(struct perf_event
> *event)
>  	delta = (new_raw_count << shift) - (prev_raw_count << shift);
>  	delta >>= shift;
> 
> -	local64_add(delta, &event->count);
> -	local64_sub(delta, &hwc->period_left);
> +	/*
> +	 * Correct the count number if applying ref_cycles replacement.
> +	 * There is a constant ratio between ref_cycles (event A) and
> +	 * ref_cycles replacement (event B). The delta result is from event B.
> +	 * To get accurate event A count, the real delta result must be
> +	 * multiplied with the constant ratio.
> +	 *
> +	 * It is handled differently for sampling and counting.
> +	 * - Fixed period Sampling: The period is already adjusted in
> +	 *   scheduling for event B. The period_left is the remaining period
> +	 *   for event B. Don't need to modify period_left.
> +	 *   It's enough to only adjust event->count here.
> +	 *
> +	 * - Fixed freq Sampling: The adaptive frequency algorithm needs
> +	 *   the last_period and event counts for event A to estimate the next
> +	 *   period for event A. So the period_left is the remaining period
> +	 *   for event A. It needs to multiply the result with the ratio.
> +	 *   However, the period_left is also used to reload the event counter
> +	 *   for event B in x86_perf_event_set_period. It has to be adjusted to
> +	 *   event B's remaining period there.
> +	 *
> +	 * - Counting: It's enough to only adjust event->count for perf stat.
> +	 *
> +	 * - RDPMC: User can also read the counter directly by rdpmc/mmap.
> +	 *   Users have to handle the adjustment themselves.
> +	 *   For kernel, it only needs to guarantee that the offset is correct.
> +	 *   In x86's arch_perf_update_userpage, the offset will be corrected
> +	 *   if event B is used.
> +	 */
> +	adjust_delta = delta;
> +	if (hwc->flags & PERF_X86_EVENT_REF_CYCLES_REP) {
> +		adjust_delta = delta * x86_pmu.ref_cycles_factor;
> +
> +		if (is_sampling_event(event) && event->attr.freq)
> +			local64_sub(adjust_delta, &hwc->period_left);
> +		else
> +			local64_sub(delta, &hwc->period_left);
> +	} else
> +		local64_sub(delta, &hwc->period_left);
> +
> +	local64_add(adjust_delta, &event->count);
> 
>  	return new_raw_count;
>  }
> @@ -865,6 +904,7 @@ int x86_schedule_events(struct cpu_hw_events
> *cpuc, int n, int *assign)
>  	if (x86_pmu.start_scheduling)
>  		x86_pmu.start_scheduling(cpuc);
> 
> +retry:
>  	for (i = 0, wmin = X86_PMC_IDX_MAX, wmax = 0; i < n; i++) {
>  		cpuc->event_constraint[i] = NULL;
>  		c = x86_pmu.get_event_constraints(cpuc, i, cpuc-
> >event_list[i]); @@ -952,6 +992,25 @@ int x86_schedule_events(struct
> cpu_hw_events *cpuc, int n, int *assign)
>  			 */
>  			if (x86_pmu.put_event_constraints)
>  				x86_pmu.put_event_constraints(cpuc, e);
> +
> +			/*
> +			 * 0x0300 is pseudo-encoding for REF_CPU_CYCLES.
> +			 * It can only be scheduled on fixed counter 2.
> +			 * If the fixed counter 2 is occupied, the event is
> +			 * failed scheduling.
> +			 *
> +			 * If that's the case, an alternative event, which
> +			 * can be counted in GP counters, could be used to
> +			 * replace the pseudo-encoding REF_CPU_CYCLES
> event.
> +			 *
> +			 * Here we reset the event and retry the whole
> +			 * scheduling thing if there is unsched 0x0300 event.
> +			 */
> +			if (unsched && x86_pmu.ref_cycles_rep &&
> +			    ((e->hw.config & X86_RAW_EVENT_MASK) ==
> 0x0300)) {
> +				x86_pmu.ref_cycles_rep(e);
> +				goto retry;
> +			}
>  		}
>  	}
> 
> @@ -1136,6 +1195,14 @@ int x86_perf_event_set_period(struct perf_event
> *event)
>  		hwc->last_period = period;
>  		ret = 1;
>  	}
> +
> +	/*
> +	 * Adjust left for ref_cycles replacement.
> +	 * Please refer the comments in x86_perf_event_update for details.
> +	 */
> +	if ((hwc->flags & PERF_X86_EVENT_REF_CYCLES_REP) &&
> +	    event->attr.freq)
> +		left /= (u64)x86_pmu.ref_cycles_factor;
>  	/*
>  	 * Quirk: certain CPUs dont like it if just 1 hw_event is left:
>  	 */
> @@ -1823,6 +1890,14 @@ static int __init init_hw_perf_events(void)
>  			x86_pmu_attr_group.attrs = tmp;
>  	}
> 
> +	if (x86_pmu.factor_attrs) {
> +		struct attribute **tmp;
> +
> +		tmp = merge_attr(x86_pmu_attr_group.attrs,
> x86_pmu.factor_attrs);
> +		if (!WARN_ON(!tmp))
> +			x86_pmu_attr_group.attrs = tmp;
> +	}
> +
>  	pr_info("... version:                %d\n",     x86_pmu.version);
>  	pr_info("... bit width:              %d\n",     x86_pmu.cntval_bits);
>  	pr_info("... generic registers:      %d\n",     x86_pmu.num_counters);
> @@ -2300,6 +2375,17 @@ void arch_perf_update_userpage(struct
> perf_event *event,
>  	}
> 
>  	cyc2ns_read_end(data);
> +
> +	/*
> +	 * offset should be corrected if ref cycles replacement is used.
> +	 * Please refer the comments in x86_perf_event_update for details.
> +	 */
> +	if (event->hw.flags & PERF_X86_EVENT_REF_CYCLES_REP) {
> +		userpg->offset = __perf_event_count(event);
> +		userpg->offset /= (u64)x86_pmu.ref_cycles_factor;
> +		if (userpg->index)
> +			userpg->offset -= local64_read(&event-
> >hw.prev_count);
> +	}
>  }
> 
>  void
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index
> da9047e..4853795 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -3063,6 +3063,55 @@ static unsigned bdw_limit_period(struct
> perf_event *event, unsigned left)
>  	return left;
>  }
> 
> +static __init unsigned int glm_get_ref_cycles_factor(void) {
> +	return 1;
> +}
> +
> +static __init unsigned int nhm_get_ref_cycles_factor(void) {
> +	u64 platform_info;
> +
> +	rdmsrl(MSR_PLATFORM_INFO, platform_info);
> +	return (platform_info >> 8) & 0xff;
> +}
> +
> +static __init unsigned int skl_get_ref_cycles_factor(void) {
> +	unsigned int cpuid21_eax, cpuid21_ebx, cpuid21_ecx, unused;
> +
> +	cpuid(21, &cpuid21_eax, &cpuid21_ebx, &cpuid21_ecx, &unused);
> +	if (!cpuid21_eax || !cpuid21_ebx)
> +		return 0;
> +
> +	return cpuid21_ebx / cpuid21_eax;
> +}
> +
> +/*
> + * BUS_CYCLES (0x013c) is another event which is not affected by core
> + * frequency changes. It has a constant ratio with the CPU ref_cycles.
> + * BUS_CYCLES could be used as an alternative event for ref_cycles on
> + * GP counter.
> + */
> +void nhm_ref_cycles_rep(struct perf_event *event) {
> +	struct hw_perf_event *hwc = &event->hw;
> +
> +	hwc->config = (hwc->config & ~X86_RAW_EVENT_MASK) |
> +
> intel_perfmon_event_map[PERF_COUNT_HW_BUS_CYCLES];
> +	hwc->flags |= PERF_X86_EVENT_REF_CYCLES_REP;
> +
> +	/* adjust the sample period for ref_cycles replacement event */
> +	if (is_sampling_event(event) && hwc->sample_period != 1) {
> +
> +		hwc->sample_period /= (u64)x86_pmu.ref_cycles_factor;
> +		if (hwc->sample_period < 2)
> +			hwc->sample_period = 2;
> +		hwc->last_period = hwc->sample_period;
> +		local64_set(&hwc->period_left, hwc->sample_period);
> +	}
> +}
> +
>  PMU_FORMAT_ATTR(event,	"config:0-7"	);
>  PMU_FORMAT_ATTR(umask,	"config:8-15"	);
>  PMU_FORMAT_ATTR(edge,	"config:18"	);
> @@ -3656,6 +3705,21 @@ static struct attribute *intel_pmu_attrs[] = {
>  	NULL,
>  };
> 
> +
> +static ssize_t ref_cycles_factor_for_rdpmc_show(struct device *cdev,
> +						struct device_attribute *attr,
> +						char *buf)
> +{
> +	return sprintf(buf, "%u\n", x86_pmu.ref_cycles_factor); }
> +
> +DEVICE_ATTR_RO(ref_cycles_factor_for_rdpmc);
> +
> +static struct attribute *intel_ref_cycles_factor_attrs[] = {
> +	&dev_attr_ref_cycles_factor_for_rdpmc.attr,
> +	NULL,
> +};
> +
>  __init int intel_pmu_init(void)
>  {
>  	union cpuid10_edx edx;
> @@ -3775,7 +3839,11 @@ __init int intel_pmu_init(void)
> 
>  		intel_pmu_pebs_data_source_nhm();
>  		x86_add_quirk(intel_nehalem_quirk);
> -
> +		x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor();
> +		if (x86_pmu.ref_cycles_factor) {
> +			x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
> +			x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
> +		}
>  		pr_cont("Nehalem events, ");
>  		break;
> 
> @@ -3835,6 +3903,11 @@ __init int intel_pmu_init(void)
>  		x86_pmu.lbr_pt_coexist = true;
>  		x86_pmu.flags |= PMU_FL_HAS_RSP_1;
>  		x86_pmu.cpu_events = glm_events_attrs;
> +		x86_pmu.ref_cycles_factor = glm_get_ref_cycles_factor();
> +		if (x86_pmu.ref_cycles_factor) {
> +			x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
> +			x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
> +		}
>  		pr_cont("Goldmont events, ");
>  		break;
> 
> @@ -3864,6 +3937,11 @@ __init int intel_pmu_init(void)
> 
> 	X86_CONFIG(.event=0xb1, .umask=0x3f, .inv=1, .cmask=1);
> 
>  		intel_pmu_pebs_data_source_nhm();
> +		x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor();
> +		if (x86_pmu.ref_cycles_factor) {
> +			x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
> +			x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
> +		}
>  		pr_cont("Westmere events, ");
>  		break;
> 
> @@ -3899,6 +3977,11 @@ __init int intel_pmu_init(void)
>  		/* UOPS_DISPATCHED.THREAD,c=1,i=1 to count stall cycles*/
> 
> 	intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_BAC
> KEND] =
> 
> 	X86_CONFIG(.event=0xb1, .umask=0x01, .inv=1, .cmask=1);
> +		x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor();
> +		if (x86_pmu.ref_cycles_factor) {
> +			x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
> +			x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
> +		}
> 
>  		pr_cont("SandyBridge events, ");
>  		break;
> @@ -3933,6 +4016,11 @@ __init int intel_pmu_init(void)
>  		/* UOPS_ISSUED.ANY,c=1,i=1 to count stall cycles */
> 
> 	intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_FRO
> NTEND] =
> 
> 	X86_CONFIG(.event=0x0e, .umask=0x01, .inv=1, .cmask=1);
> +		x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor();
> +		if (x86_pmu.ref_cycles_factor) {
> +			x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
> +			x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
> +		}
> 
>  		pr_cont("IvyBridge events, ");
>  		break;
> @@ -3962,6 +4050,11 @@ __init int intel_pmu_init(void)
>  		x86_pmu.get_event_constraints =
> hsw_get_event_constraints;
>  		x86_pmu.cpu_events = hsw_events_attrs;
>  		x86_pmu.lbr_double_abort = true;
> +		x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor();
> +		if (x86_pmu.ref_cycles_factor) {
> +			x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
> +			x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
> +		}
>  		pr_cont("Haswell events, ");
>  		break;
> 
> @@ -3998,6 +4091,11 @@ __init int intel_pmu_init(void)
>  		x86_pmu.get_event_constraints =
> hsw_get_event_constraints;
>  		x86_pmu.cpu_events = hsw_events_attrs;
>  		x86_pmu.limit_period = bdw_limit_period;
> +		x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor();
> +		if (x86_pmu.ref_cycles_factor) {
> +			x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
> +			x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
> +		}
>  		pr_cont("Broadwell events, ");
>  		break;
> 
> @@ -4016,6 +4114,11 @@ __init int intel_pmu_init(void)
>  		/* all extra regs are per-cpu when HT is on */
>  		x86_pmu.flags |= PMU_FL_HAS_RSP_1;
>  		x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
> +		x86_pmu.ref_cycles_factor = glm_get_ref_cycles_factor();
> +		if (x86_pmu.ref_cycles_factor) {
> +			x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
> +			x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
> +		}
> 
>  		pr_cont("Knights Landing/Mill events, ");
>  		break;
> @@ -4051,6 +4154,11 @@ __init int intel_pmu_init(void)
>  						  skl_format_attr);
>  		WARN_ON(!x86_pmu.format_attrs);
>  		x86_pmu.cpu_events = hsw_events_attrs;
> +		x86_pmu.ref_cycles_factor = skl_get_ref_cycles_factor();
> +		if (x86_pmu.ref_cycles_factor) {
> +			x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
> +			x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
> +		}
>  		pr_cont("Skylake events, ");
>  		break;
> 
> diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
> index 53728ee..3470178 100644
> --- a/arch/x86/events/perf_event.h
> +++ b/arch/x86/events/perf_event.h
> @@ -68,6 +68,7 @@ struct event_constraint {
>  #define PERF_X86_EVENT_EXCL_ACCT	0x0200 /* accounted EXCL event */
>  #define PERF_X86_EVENT_AUTO_RELOAD	0x0400 /* use PEBS auto-
> reload */
>  #define PERF_X86_EVENT_FREERUNNING	0x0800 /* use freerunning
> PEBS */
> +#define PERF_X86_EVENT_REF_CYCLES_REP	0x1000 /* use ref_cycles
> replacement */
> 
> 
>  struct amd_nb {
> @@ -550,6 +551,8 @@ struct x86_pmu {
>  	int		perfctr_second_write;
>  	bool		late_ack;
>  	unsigned	(*limit_period)(struct perf_event *event, unsigned l);
> +	unsigned int	ref_cycles_factor;
> +	void		(*ref_cycles_rep)(struct perf_event *event);
> 
>  	/*
>  	 * sysfs attrs
> @@ -565,6 +568,8 @@ struct x86_pmu {
>  	unsigned long	attr_freeze_on_smi;
>  	struct attribute **attrs;
> 
> +	unsigned int	attr_ref_cycles_factor;
> +	struct attribute **factor_attrs;
>  	/*
>  	 * CPU Hotplug hooks
>  	 */
> --
> 2.7.4

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V2 1/2] perf/x86/intel: enable CPU ref_cycles for GP counter
  2017-06-09 17:28 [PATCH V2 1/2] perf/x86/intel: enable CPU ref_cycles for GP counter kan.liang
                   ` (2 preceding siblings ...)
  2017-06-23 17:21 ` Liang, Kan
@ 2017-06-23 21:46 ` Thomas Gleixner
  3 siblings, 0 replies; 8+ messages in thread
From: Thomas Gleixner @ 2017-06-23 21:46 UTC (permalink / raw)
  To: Kan Liang
  Cc: peterz, mingo, eranian, linux-kernel, alexander.shishkin, acme,
	jolsa, torvalds, vincent.weaver, ak

On Fri, 9 Jun 2017, kan.liang@intel.com wrote:
> User visible RDPMC issue
> It is impossible to transparently handle the case, which reading
> ref-cycles by RDPMC instruction in user space.
> ref_cycles_factor_for_rdpmc is exposed for RDPMC user.
> For the few users who want to read ref-cycles by RDPMC, they have to
> correct the result by multipling ref_cycles_factor_for_rdpmc manually,
> if GP counter is used.

That's not how it works. You cannot nilly willy define what existing users
have to do and thereby break existing setups and use cases.

It does not matter how many patches you send to pmu tools, PAPI and
whatever. It'll take ages until they permeate into distros, but until that
happens users are stranded.

The _existing_ user space interface is also used by power users directly
and we are not going to break that just because you think there are just a
few users and they should accomodate.

If you want to move the watchdog NMI to ref cycles, then this needs to be
on a opt in basis via kernel command line and perhaps a config option
which is default n.

>  __init int intel_pmu_init(void)
>  {
>  	union cpuid10_edx edx;
> @@ -3775,7 +3839,11 @@ __init int intel_pmu_init(void)
>  
>  		intel_pmu_pebs_data_source_nhm();
>  		x86_add_quirk(intel_nehalem_quirk);
> -
> +		x86_pmu.ref_cycles_factor = nhm_get_ref_cycles_factor();
> +		if (x86_pmu.ref_cycles_factor) {

Why would this be 0? 

> +			x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;

What's the point of this function pointer? All implementations use the same
callback and I don't see any reason why this would change in the forseeable
future. If there is a reason you better document it.

The function can be called directly w/o indirection.. And the conditionals
can look at ref_cycles_factor to figure out whether it can be done or not.

And with that you can spare the whole exercise of sprinkling the same
factor_attrs thingy all over the place. You simply can check
ref_cycles_factor in the attribute setup code and conditionally merge
intel_ref_cycles_factor_attrs.

> +			x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
> +		}
>  		pr_cont("Nehalem events, ");
>  		break;
>  
> @@ -3835,6 +3903,11 @@ __init int intel_pmu_init(void)
>  		x86_pmu.lbr_pt_coexist = true;
>  		x86_pmu.flags |= PMU_FL_HAS_RSP_1;
>  		x86_pmu.cpu_events = glm_events_attrs;
> +		x86_pmu.ref_cycles_factor = glm_get_ref_cycles_factor();
> +		if (x86_pmu.ref_cycles_factor) {

Copy and paste "programming" ... The factor is hard coded '1' as you
documented also in the changelog. So why do you need an extra function to
retrieve 1 and a sanity check whether the function actually returned 1?

> +			x86_pmu.ref_cycles_rep = nhm_ref_cycles_rep;
> +			x86_pmu.factor_attrs = intel_ref_cycles_factor_attrs;
> +		}
>  		pr_cont("Goldmont events, ");
>  		break;

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-06-23 21:47 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-09 17:28 [PATCH V2 1/2] perf/x86/intel: enable CPU ref_cycles for GP counter kan.liang
2017-06-09 17:28 ` [PATCH V2 2/2] perf/x86/intel, watchdog: Switch NMI watchdog to ref cycles on x86 kan.liang
     [not found]   ` <CA+55aFz7jZ0Yx+zxDZFxUOJcMtBtzfVphUdUrGwfv9SmiNwT6A@mail.gmail.com>
2017-06-11 18:44     ` Andi Kleen
2017-06-12 12:51     ` Liang, Kan
2017-06-12 10:36 ` [PATCH V2 1/2] perf/x86/intel: enable CPU ref_cycles for GP counter Jiri Olsa
2017-06-12 13:07   ` Liang, Kan
2017-06-23 17:21 ` Liang, Kan
2017-06-23 21:46 ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).