linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V5 0/6] perf and x86/intel_rdt: Fix lack of coordination with perf
@ 2018-09-19 17:29 Reinette Chatre
  2018-09-19 17:29 ` [PATCH V5 1/6] perf/core: Add sanity check to deal with pinned event failure Reinette Chatre
                   ` (6 more replies)
  0 siblings, 7 replies; 23+ messages in thread
From: Reinette Chatre @ 2018-09-19 17:29 UTC (permalink / raw)
  To: tglx, fenghua.yu, tony.luck, peterz, mingo, acme
  Cc: gavin.hindman, jithu.joseph, dave.hansen, hpa, x86, linux-kernel,
	Reinette Chatre

Dear Maintainers,

This new series fixing the lack of coordination between the
pseudo-locking measurement code and perf addresses all feedback received
up to and including V4.

Changes since V4:
- Move the helper to obtain the performance counter index
  (x86_perf_rdpmc_index()) to arch/x86/events/core.c with an extern
  declaration in arch/x86/include/asm/perf_event.h. Please do note that
  this change introduces a checkpatch.pl warning that was discussed
  and found to be a false positive in:
  http://lkml.kernel.org/r/20180919091759.GZ24124@hirez.programming.kicks-ass.net
  The new warning is:
  > CHECK: extern prototypes should be avoided in .h files
  > #66: FILE: arch/x86/include/asm/perf_event.h:273:
  > +extern int x86_perf_rdpmc_index(struct perf_event *event);

- Change the x86_perf_rdpmc_index() comment slightly:
  "... could be checked for validity ..." -> "... should be checked for
  validity ...". No code changes.
- Add comments to the measurement routines to document that on measurement
  failure zeroes will be written to the trace buffer instead of leaving an
  empty trace buffer. No code changes.

Changes since V3:
- The kbuild test robot reported a build issue that at first seems to be
  related to "make ARCH=i386" but was actually a result of the kernel
  configuration used not having CONFIG_TRACEPOINTS set.
  The consequence was that include/linux/perf_event.h was not being included
  via the tracing code (from include/trace/define_trace.h) and we thus
  encounter the build failure where struct perf_event_attr is unknown.
  Fix this by explicitly including perf_event.h

- Below is verbatim from V3 submission (except for diffstat below) -

Changes since V2:
- Move the helper to obtain the performance counter index to
  include/linux/perf_event.h. The request was actually to move this helper
  to arch/x86/include/asm/perf_event.h - but doing so would be more
  involved since this header file does not know about struct perf_event
  that is used by this helper. There was no response for further
  clarification of the request to move this helper so I proceeded to move it
  to include/linux/perf_event.h instead.
- Change name of helper to obtain the index to perf_rdpmc_index() - the
  original request was to name it x86_perf_rdpmc_index() but this seems to
  be tied to the suggested header location. With the header location of
  include/linux/perf_event.h the name perf_rdpmc_index() seems to fit
  better with the new destination. There was no response for further
  clarification of the naming change request so I proceeded with the change.
- Replace all local register variables used in the measurement routines
  with local variables using READ_ONCE().
- The removal of local register variables also enable us to replace the
  direct __wrmsr() with wrmsr().
- Merge the L2 and L3 measurement routines following Peter's suggested
  framework.
- Do not copy the text from SDM that refers to serializing instructions.
- Include another LFENCE call after loop reading pseudo-locked memory.

The above addresses all feedback received for V2. The one unanswered
question that remains following the review is why the memory reading
was done with asm: the reason I did so was to avoid any compiler
optimizations while constraining the code exactly to what needed to be
measured. By using the asm instruction I am able to use a single instruction
to read a cache line into a register. To me this seemed the most constrained
way to measure if a cache line is in the cache.

This is the second attempt at fixing the lack of coordination between the
pseudo-locking measurement code and perf. Thank you very much for your
feedback on the first version. The entire solution, including the cover
letter, has been reworked based on your feedback, while submitted as a V2,
none of the patches from V1 remained.

Changes since V1:
- Use in-kernel interface to perf.
- Do not write directly to PMU registers.
- Do not introduce another PMU owner. perf maintains role as performing
  resource arbitration for PMU.
- User space is able to use perf and resctrl at the same time.
- event_base_rdpmc is accessed and used only within an interrupts
  disabled section.
- Internals of events are never accessed directly, inline function used.
- Due to "pinned" usage the scheduling of event may have failed.  Error
  state is checked in recommended way and have a credible error
  handling.
- use X86_CONFIG

This code is based on the x86/cache branch of tip.git

The success of Cache Pseudo-Locking, as measured by how many cache lines
from a physical memory region has been locked to cache, can be measured
via the use of hardware performance events. Specifically, the number of
cache hits and misses reading a memory region after it has been
pseudo-locked to cache. This measurement is triggered via the resctrl
debugfs interface.

The current solution accesses performance counters and their configuration
registers directly without coordination with other performance event users
(perf).
Two of the issues that exist with the current solution:
- By writing to the performance monitoring registers directly a new owner
  for these resources is introduced. The perf infrastructure already exist
  to perform resource arbitration and the in-kernel infrastructure should
  be used to do so.
- The current lack of coordination with perf will have consequences any time
  two users, for example perf and cache pseudo-locking, attempt to do any
  kind of measurement at the same time.

In this series the measurement of Cache Pseudo-Lock regions is moved to use
the in-kernel interface to perf. During the rework of the measurement
function the L2 and L3 cache measurements are separated to avoid the
additional code needed to decide on which measurement causing unrelated
cache hits and misses.

Your feedback on this work will be greatly appreciated.

Reinette

Reinette Chatre (6):
  perf/core: Add sanity check to deal with pinned event failure
  perf/x86: Add helper to obtain performance counter index
  x86/intel_rdt: Remove local register variables
  x86/intel_rdt: Create required perf event attributes
  x86/intel_rdt: Use perf infrastructure for measurements
  x86/intel_rdt: Re-enable pseudo-lock measurements

 Documentation/x86/intel_rdt_ui.txt          |  22 +-
 arch/x86/events/core.c                      |  21 ++
 arch/x86/include/asm/perf_event.h           |   1 +
 arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 372 ++++++++++++--------
 kernel/events/core.c                        |   6 +
 5 files changed, 261 insertions(+), 161 deletions(-)

-- 
2.17.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH V5 1/6] perf/core: Add sanity check to deal with pinned event failure
  2018-09-19 17:29 [PATCH V5 0/6] perf and x86/intel_rdt: Fix lack of coordination with perf Reinette Chatre
@ 2018-09-19 17:29 ` Reinette Chatre
  2018-09-28 20:48   ` [tip:perf/urgent] " tip-bot for Reinette Chatre
  2018-09-19 17:29 ` [PATCH V5 2/6] perf/x86: Add helper to obtain performance counter index Reinette Chatre
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Reinette Chatre @ 2018-09-19 17:29 UTC (permalink / raw)
  To: tglx, fenghua.yu, tony.luck, peterz, mingo, acme
  Cc: gavin.hindman, jithu.joseph, dave.hansen, hpa, x86, linux-kernel,
	Reinette Chatre

It is possible that a failure can occur during the scheduling of a
pinned event. The initial portion of perf_event_read_local() contains
the various error checks an event should pass before it can be
considered valid. Ensure that the potential scheduling failure
of a pinned event is checked for and have a credible error.

Link: http://lkml.kernel.org/r/20180807093615.GY2494@hirez.programming.kicks-ass.net
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 kernel/events/core.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2a62b96600ad..191c0c2e10de 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3940,6 +3940,12 @@ int perf_event_read_local(struct perf_event *event, u64 *value,
 		goto out;
 	}
 
+	/* If this is a pinned event it must be running on this CPU */
+	if (event->attr.pinned && event->oncpu != smp_processor_id()) {
+		ret = -EBUSY;
+		goto out;
+	}
+
 	/*
 	 * If the event is currently on this CPU, its either a per-task event,
 	 * or local to this CPU. Furthermore it means its ACTIVE (otherwise
-- 
2.17.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V5 2/6] perf/x86: Add helper to obtain performance counter index
  2018-09-19 17:29 [PATCH V5 0/6] perf and x86/intel_rdt: Fix lack of coordination with perf Reinette Chatre
  2018-09-19 17:29 ` [PATCH V5 1/6] perf/core: Add sanity check to deal with pinned event failure Reinette Chatre
@ 2018-09-19 17:29 ` Reinette Chatre
  2018-09-29  7:00   ` [tip:x86/cache] " tip-bot for Reinette Chatre
  2018-09-19 17:29 ` [PATCH V5 3/6] x86/intel_rdt: Remove local register variables Reinette Chatre
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Reinette Chatre @ 2018-09-19 17:29 UTC (permalink / raw)
  To: tglx, fenghua.yu, tony.luck, peterz, mingo, acme
  Cc: gavin.hindman, jithu.joseph, dave.hansen, hpa, x86, linux-kernel,
	Reinette Chatre

perf_event_read_local() is the safest way to obtain measurements
associated with performance events. In some cases the overhead
introduced by perf_event_read_local() affects the measurements and the
use of rdpmcl() is needed. rdpmcl() requires the index
of the performance counter used so a helper is introduced to determine
the index used by a provided performance event.

The index used by a performance event may change when interrupts are
enabled. A check is added to ensure that the index is only accessed
with interrupts disabled. Even with this check the use of this counter
needs to be done with care to ensure it is queried and used within the
same disabled interrupts section.

This change introduces a new checkpatch warning:
CHECK: extern prototypes should be avoided in .h files
+extern int x86_perf_rdpmc_index(struct perf_event *event);

This warning was discussed and designated as a false positive in
http://lkml.kernel.org/r/20180919091759.GZ24124@hirez.programming.kicks-ass.net

Link: http://lkml.kernel.org/r/20180917230758.GA3117@worktop.programming.kicks-ass.net
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 arch/x86/events/core.c            | 21 +++++++++++++++++++++
 arch/x86/include/asm/perf_event.h |  1 +
 2 files changed, 22 insertions(+)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index dfb2f7c0d019..3550d800b030 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1033,6 +1033,27 @@ static inline void x86_assign_hw_event(struct perf_event *event,
 	}
 }
 
+/**
+ * x86_perf_rdpmc_index - Return PMC counter used for event
+ * @event: the perf_event to which the PMC counter was assigned
+ *
+ * The counter assigned to this performance event may change if interrupts
+ * are enabled. This counter should thus never be used while interrupts are
+ * enabled. Before this function is used to obtain the assigned counter the
+ * event should be checked for validity using, for example,
+ * perf_event_read_local(), within the same interrupt disabled section in
+ * which this counter is planned to be used.
+ *
+ * Return: The index of the performance monitoring counter assigned to
+ * @perf_event.
+ */
+int x86_perf_rdpmc_index(struct perf_event *event)
+{
+	lockdep_assert_irqs_disabled();
+
+	return event->hw.event_base_rdpmc;
+}
+
 static inline int match_prev_assignment(struct hw_perf_event *hwc,
 					struct cpu_hw_events *cpuc,
 					int i)
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 12f54082f4c8..b2cf84c35a6d 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -270,6 +270,7 @@ struct perf_guest_switch_msr {
 extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr);
 extern void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap);
 extern void perf_check_microcode(void);
+extern int x86_perf_rdpmc_index(struct perf_event *event);
 #else
 static inline struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr)
 {
-- 
2.17.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V5 3/6] x86/intel_rdt: Remove local register variables
  2018-09-19 17:29 [PATCH V5 0/6] perf and x86/intel_rdt: Fix lack of coordination with perf Reinette Chatre
  2018-09-19 17:29 ` [PATCH V5 1/6] perf/core: Add sanity check to deal with pinned event failure Reinette Chatre
  2018-09-19 17:29 ` [PATCH V5 2/6] perf/x86: Add helper to obtain performance counter index Reinette Chatre
@ 2018-09-19 17:29 ` Reinette Chatre
  2018-09-29  7:01   ` [tip:x86/cache] " tip-bot for Reinette Chatre
  2018-09-19 17:29 ` [PATCH V5 4/6] x86/intel_rdt: Create required perf event attributes Reinette Chatre
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Reinette Chatre @ 2018-09-19 17:29 UTC (permalink / raw)
  To: tglx, fenghua.yu, tony.luck, peterz, mingo, acme
  Cc: gavin.hindman, jithu.joseph, dave.hansen, hpa, x86, linux-kernel,
	Reinette Chatre

Local register variables were used in an effort to improve the
accuracy of the measurement of cache residency of a pseudo-locked
region. This was done to ensure that only the cache residency of
the memory is measured and not the cache residency of the variables
used to perform the measurement.

While local register variables do accomplish the goal they do require
significant care since different architectures have different registers
available. Local register variables also cannot be used with valuable
developer tools like KASAN.

Significant testing has shown that similar accuracy in measurement
results can be obtained by replacing local register variables with
regular local variables.

Make use of local variables in the critical code but do so with
READ_ONCE() to prevent the compiler from merging or refetching reads.
Ensure these variables are initialized before the measurement starts,
and ensure it is only the local variables that are accessed during
the measurement.

With the removal of the local register variables and using READ_ONCE()
there is no longer a motivation for using a direct wrmsr call (that
avoids the additional tracing code that may clobber the local register
variables).

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 53 ++++-----------------
 1 file changed, 9 insertions(+), 44 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
index 40f3903ae5d9..8ad83eb3fc89 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
@@ -886,31 +886,14 @@ static int measure_cycles_lat_fn(void *_plr)
 	struct pseudo_lock_region *plr = _plr;
 	unsigned long i;
 	u64 start, end;
-#ifdef CONFIG_KASAN
-	/*
-	 * The registers used for local register variables are also used
-	 * when KASAN is active. When KASAN is active we use a regular
-	 * variable to ensure we always use a valid pointer to access memory.
-	 * The cost is that accessing this pointer, which could be in
-	 * cache, will be included in the measurement of memory read latency.
-	 */
 	void *mem_r;
-#else
-#ifdef CONFIG_X86_64
-	register void *mem_r asm("rbx");
-#else
-	register void *mem_r asm("ebx");
-#endif /* CONFIG_X86_64 */
-#endif /* CONFIG_KASAN */
 
 	local_irq_disable();
 	/*
-	 * The wrmsr call may be reordered with the assignment below it.
-	 * Call wrmsr as directly as possible to avoid tracing clobbering
-	 * local register variable used for memory pointer.
+	 * Disable hardware prefetchers.
 	 */
-	__wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);
-	mem_r = plr->kmem;
+	wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);
+	mem_r = READ_ONCE(plr->kmem);
 	/*
 	 * Dummy execute of the time measurement to load the needed
 	 * instructions into the L1 instruction cache.
@@ -939,26 +922,10 @@ static int measure_cycles_perf_fn(void *_plr)
 	struct pseudo_lock_region *plr = _plr;
 	unsigned long long l2_hits, l2_miss;
 	u64 l2_hit_bits, l2_miss_bits;
-	unsigned long i;
-#ifdef CONFIG_KASAN
-	/*
-	 * The registers used for local register variables are also used
-	 * when KASAN is active. When KASAN is active we use regular variables
-	 * at the cost of including cache access latency to these variables
-	 * in the measurements.
-	 */
 	unsigned int line_size;
 	unsigned int size;
+	unsigned long i;
 	void *mem_r;
-#else
-	register unsigned int line_size asm("esi");
-	register unsigned int size asm("edi");
-#ifdef CONFIG_X86_64
-	register void *mem_r asm("rbx");
-#else
-	register void *mem_r asm("ebx");
-#endif /* CONFIG_X86_64 */
-#endif /* CONFIG_KASAN */
 
 	/*
 	 * Non-architectural event for the Goldmont Microarchitecture
@@ -1011,11 +978,9 @@ static int measure_cycles_perf_fn(void *_plr)
 
 	local_irq_disable();
 	/*
-	 * Call wrmsr direcly to avoid the local register variables from
-	 * being overwritten due to reordering of their assignment with
-	 * the wrmsr calls.
+	 * Disable hardware prefetchers.
 	 */
-	__wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);
+	wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);
 	/* Disable events and reset counters */
 	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0, 0x0);
 	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 1, 0x0);
@@ -1028,6 +993,9 @@ static int measure_cycles_perf_fn(void *_plr)
 		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_PERFCTR0 + 3, 0x0);
 	}
 	/* Set and enable the L2 counters */
+	mem_r = READ_ONCE(plr->kmem);
+	size = READ_ONCE(plr->size);
+	line_size = READ_ONCE(plr->line_size);
 	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0, l2_hit_bits);
 	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 1, l2_miss_bits);
 	if (l3_hit_bits > 0) {
@@ -1036,9 +1004,6 @@ static int measure_cycles_perf_fn(void *_plr)
 		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 3,
 				      l3_miss_bits);
 	}
-	mem_r = plr->kmem;
-	size = plr->size;
-	line_size = plr->line_size;
 	for (i = 0; i < size; i += line_size) {
 		asm volatile("mov (%0,%1,1), %%eax\n\t"
 			     :
-- 
2.17.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V5 4/6] x86/intel_rdt: Create required perf event attributes
  2018-09-19 17:29 [PATCH V5 0/6] perf and x86/intel_rdt: Fix lack of coordination with perf Reinette Chatre
                   ` (2 preceding siblings ...)
  2018-09-19 17:29 ` [PATCH V5 3/6] x86/intel_rdt: Remove local register variables Reinette Chatre
@ 2018-09-19 17:29 ` Reinette Chatre
  2018-09-29  7:01   ` [tip:x86/cache] " tip-bot for Reinette Chatre
  2018-09-19 17:29 ` [PATCH V5 5/6] x86/intel_rdt: Use perf infrastructure for measurements Reinette Chatre
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Reinette Chatre @ 2018-09-19 17:29 UTC (permalink / raw)
  To: tglx, fenghua.yu, tony.luck, peterz, mingo, acme
  Cc: gavin.hindman, jithu.joseph, dave.hansen, hpa, x86, linux-kernel,
	Reinette Chatre

A perf event has many attributes that are maintained in a separate
structure that should be provided when a new perf_event is created.

In preparation for the transition to perf_events the required attribute
structures are created for all the events that may be used in the
measurements. Most attributes for all the events are identical. The
actual configuration, what specifies what needs to be measured, is what
will be different between the events used. This configuration needs to
be done with X86_CONFIG that cannot be used as part of the designated
initializers used here, this will be introduced later.

Although they do look identical at this time the attribute structures
needs to be maintained separately since a perf_event will maintain a
pointer to its unique attributes.

In support of patch testing the new structs are given the unused attribute
until their use in later patches.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 26 +++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
index 8ad83eb3fc89..33d7968f152a 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
@@ -17,6 +17,7 @@
 #include <linux/debugfs.h>
 #include <linux/kthread.h>
 #include <linux/mman.h>
+#include <linux/perf_event.h>
 #include <linux/pm_qos.h>
 #include <linux/slab.h>
 #include <linux/uaccess.h>
@@ -915,6 +916,31 @@ static int measure_cycles_lat_fn(void *_plr)
 	return 0;
 }
 
+/*
+ * Create a perf_event_attr for the hit and miss perf events that will
+ * be used during the performance measurement. A perf_event maintains
+ * a pointer to its perf_event_attr so a unique attribute structure is
+ * created for each perf_event.
+ *
+ * The actual configuration of the event is set right before use in order
+ * to use the X86_CONFIG macro.
+ */
+static struct perf_event_attr __attribute__((unused)) perf_miss_attr = {
+	.type		= PERF_TYPE_RAW,
+	.size		= sizeof(struct perf_event_attr),
+	.pinned		= 1,
+	.disabled	= 0,
+	.exclude_user	= 1,
+};
+
+static struct perf_event_attr __attribute__((unused)) perf_hit_attr = {
+	.type		= PERF_TYPE_RAW,
+	.size		= sizeof(struct perf_event_attr),
+	.pinned		= 1,
+	.disabled	= 0,
+	.exclude_user	= 1,
+};
+
 static int measure_cycles_perf_fn(void *_plr)
 {
 	unsigned long long l3_hits = 0, l3_miss = 0;
-- 
2.17.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V5 5/6] x86/intel_rdt: Use perf infrastructure for measurements
  2018-09-19 17:29 [PATCH V5 0/6] perf and x86/intel_rdt: Fix lack of coordination with perf Reinette Chatre
                   ` (3 preceding siblings ...)
  2018-09-19 17:29 ` [PATCH V5 4/6] x86/intel_rdt: Create required perf event attributes Reinette Chatre
@ 2018-09-19 17:29 ` Reinette Chatre
  2018-09-20 14:11   ` Peter Zijlstra
  2018-09-20 19:02   ` [PATCH V6 " Reinette Chatre
  2018-09-19 17:29 ` [PATCH V5 6/6] x86/intel_rdt: Re-enable pseudo-lock measurements Reinette Chatre
  2018-09-20 14:11 ` [PATCH V5 0/6] perf and x86/intel_rdt: Fix lack of coordination with perf Peter Zijlstra
  6 siblings, 2 replies; 23+ messages in thread
From: Reinette Chatre @ 2018-09-19 17:29 UTC (permalink / raw)
  To: tglx, fenghua.yu, tony.luck, peterz, mingo, acme
  Cc: gavin.hindman, jithu.joseph, dave.hansen, hpa, x86, linux-kernel,
	Reinette Chatre

The success of a cache pseudo-locked region is measured using
performance monitoring events that are programmed directly at the time
the user requests a measurement.

Modifying the performance event registers directly is not appropriate
since it circumvents the in-kernel perf infrastructure that exists to
manage these resources and provide resource arbitration to the
performance monitoring hardware.

The cache pseudo-locking measurements are modified to use the in-kernel
perf infrastructure. Performance events are created and validated with
the appropriate perf API. The performance counters are still read as
directly as possible to avoid the additional cache hits. This is
done safely by first ensuring with the perf API that the counters have
been programmed correctly and only accessing the counters in an
interrupt disabled section where they are not able to be moved.

As part of the transition to the in-kernel perf infrastructure the L2
and L3 measurements are split into two separate measurements that can
be triggered independently. This separation prevents additional cache
misses incurred during the extra testing code used to decide if a
L2 and/or L3 measurement should be made.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 Documentation/x86/intel_rdt_ui.txt          |  22 +-
 arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 305 ++++++++++++--------
 2 files changed, 204 insertions(+), 123 deletions(-)

diff --git a/Documentation/x86/intel_rdt_ui.txt b/Documentation/x86/intel_rdt_ui.txt
index f662d3c530e5..52b10945ff75 100644
--- a/Documentation/x86/intel_rdt_ui.txt
+++ b/Documentation/x86/intel_rdt_ui.txt
@@ -520,18 +520,24 @@ the pseudo-locked region:
 2) Cache hit and miss measurements using model specific precision counters if
    available. Depending on the levels of cache on the system the pseudo_lock_l2
    and pseudo_lock_l3 tracepoints are available.
-   WARNING: triggering this  measurement uses from two (for just L2
-   measurements) to four (for L2 and L3 measurements) precision counters on
-   the system, if any other measurements are in progress the counters and
-   their corresponding event registers will be clobbered.
 
 When a pseudo-locked region is created a new debugfs directory is created for
 it in debugfs as /sys/kernel/debug/resctrl/<newdir>. A single
 write-only file, pseudo_lock_measure, is present in this directory. The
-measurement on the pseudo-locked region depends on the number, 1 or 2,
-written to this debugfs file. Since the measurements are recorded with the
-tracing infrastructure the relevant tracepoints need to be enabled before the
-measurement is triggered.
+measurement of the pseudo-locked region depends on the number written to this
+debugfs file:
+1 -  writing "1" to the pseudo_lock_measure file will trigger the latency
+     measurement captured in the pseudo_lock_mem_latency tracepoint. See
+     example below.
+2 -  writing "2" to the pseudo_lock_measure file will trigger the L2 cache
+     residency (cache hits and misses) measurement captured in the
+     pseudo_lock_l2 tracepoint. See example below.
+3 -  writing "3" to the pseudo_lock_measure file will trigger the L3 cache
+     residency (cache hits and misses) measurement captured in the
+     pseudo_lock_l3 tracepoint.
+
+All measurements are recorded with the tracing infrastructure. This requires
+the relevant tracepoints to be enabled before the measurement is triggered.
 
 Example of latency debugging interface:
 In this example a pseudo-locked region named "newlock" was created. Here is
diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
index 33d7968f152a..c35b975e3b1e 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
@@ -27,6 +27,7 @@
 #include <asm/intel_rdt_sched.h>
 #include <asm/perf_event.h>
 
+#include "../../events/perf_event.h" /* For X86_CONFIG() */
 #include "intel_rdt.h"
 
 #define CREATE_TRACE_POINTS
@@ -107,16 +108,6 @@ static u64 get_prefetch_disable_bits(void)
 	return 0;
 }
 
-/*
- * Helper to write 64bit value to MSR without tracing. Used when
- * use of the cache should be restricted and use of registers used
- * for local variables avoided.
- */
-static inline void pseudo_wrmsrl_notrace(unsigned int msr, u64 val)
-{
-	__wrmsr(msr, (u32)(val & 0xffffffffULL), (u32)(val >> 32));
-}
-
 /**
  * pseudo_lock_minor_get - Obtain available minor number
  * @minor: Pointer to where new minor number will be stored
@@ -925,7 +916,7 @@ static int measure_cycles_lat_fn(void *_plr)
  * The actual configuration of the event is set right before use in order
  * to use the X86_CONFIG macro.
  */
-static struct perf_event_attr __attribute__((unused)) perf_miss_attr = {
+static struct perf_event_attr perf_miss_attr = {
 	.type		= PERF_TYPE_RAW,
 	.size		= sizeof(struct perf_event_attr),
 	.pinned		= 1,
@@ -933,7 +924,7 @@ static struct perf_event_attr __attribute__((unused)) perf_miss_attr = {
 	.exclude_user	= 1,
 };
 
-static struct perf_event_attr __attribute__((unused)) perf_hit_attr = {
+static struct perf_event_attr perf_hit_attr = {
 	.type		= PERF_TYPE_RAW,
 	.size		= sizeof(struct perf_event_attr),
 	.pinned		= 1,
@@ -941,139 +932,216 @@ static struct perf_event_attr __attribute__((unused)) perf_hit_attr = {
 	.exclude_user	= 1,
 };
 
-static int measure_cycles_perf_fn(void *_plr)
+struct residency_counts {
+	u64 miss_before, hits_before;
+	u64 miss_after,  hits_after;
+};
+
+static int measure_residency_fn(struct perf_event_attr *miss_attr,
+				struct perf_event_attr *hit_attr,
+				struct pseudo_lock_region *plr,
+				struct residency_counts *counts)
 {
-	unsigned long long l3_hits = 0, l3_miss = 0;
-	u64 l3_hit_bits = 0, l3_miss_bits = 0;
-	struct pseudo_lock_region *plr = _plr;
-	unsigned long long l2_hits, l2_miss;
-	u64 l2_hit_bits, l2_miss_bits;
+	u64 hits_before = 0, hits_after = 0, miss_before = 0, miss_after = 0;
+	struct perf_event *miss_event, *hit_event;
+	int hit_pmcnum, miss_pmcnum;
 	unsigned int line_size;
 	unsigned int size;
 	unsigned long i;
 	void *mem_r;
+	u64 tmp;
 
-	/*
-	 * Non-architectural event for the Goldmont Microarchitecture
-	 * from Intel x86 Architecture Software Developer Manual (SDM):
-	 * MEM_LOAD_UOPS_RETIRED D1H (event number)
-	 * Umask values:
-	 *     L1_HIT   01H
-	 *     L2_HIT   02H
-	 *     L1_MISS  08H
-	 *     L2_MISS  10H
-	 *
-	 * On Broadwell Microarchitecture the MEM_LOAD_UOPS_RETIRED event
-	 * has two "no fix" errata associated with it: BDM35 and BDM100. On
-	 * this platform we use the following events instead:
-	 *  L2_RQSTS 24H (Documented in https://download.01.org/perfmon/BDW/)
-	 *       REFERENCES FFH
-	 *       MISS       3FH
-	 *  LONGEST_LAT_CACHE 2EH (Documented in SDM)
-	 *       REFERENCE 4FH
-	 *       MISS      41H
-	 */
+	miss_event = perf_event_create_kernel_counter(miss_attr, plr->cpu,
+						      NULL, NULL, NULL);
+	if (IS_ERR(miss_event))
+		goto out;
 
+	hit_event = perf_event_create_kernel_counter(hit_attr, plr->cpu,
+						     NULL, NULL, NULL);
+	if (IS_ERR(hit_event))
+		goto out_miss;
+
+	local_irq_disable();
 	/*
-	 * Start by setting flags for IA32_PERFEVTSELx:
-	 *     OS  (Operating system mode)  0x2
-	 *     INT (APIC interrupt enable)  0x10
-	 *     EN  (Enable counter)         0x40
-	 *
-	 * Then add the Umask value and event number to select performance
-	 * event.
+	 * Check any possible error state of events used by performing
+	 * one local read.
 	 */
-
-	switch (boot_cpu_data.x86_model) {
-	case INTEL_FAM6_ATOM_GOLDMONT:
-	case INTEL_FAM6_ATOM_GEMINI_LAKE:
-		l2_hit_bits = (0x52ULL << 16) | (0x2 << 8) | 0xd1;
-		l2_miss_bits = (0x52ULL << 16) | (0x10 << 8) | 0xd1;
-		break;
-	case INTEL_FAM6_BROADWELL_X:
-		/* On BDW the l2_hit_bits count references, not hits */
-		l2_hit_bits = (0x52ULL << 16) | (0xff << 8) | 0x24;
-		l2_miss_bits = (0x52ULL << 16) | (0x3f << 8) | 0x24;
-		/* On BDW the l3_hit_bits count references, not hits */
-		l3_hit_bits = (0x52ULL << 16) | (0x4f << 8) | 0x2e;
-		l3_miss_bits = (0x52ULL << 16) | (0x41 << 8) | 0x2e;
-		break;
-	default:
-		goto out;
+	if (perf_event_read_local(miss_event, &tmp, NULL, NULL)) {
+		local_irq_enable();
+		goto out_hit;
+	}
+	if (perf_event_read_local(hit_event, &tmp, NULL, NULL)) {
+		local_irq_enable();
+		goto out_hit;
 	}
 
-	local_irq_disable();
 	/*
 	 * Disable hardware prefetchers.
 	 */
 	wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);
-	/* Disable events and reset counters */
-	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0, 0x0);
-	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 1, 0x0);
-	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_PERFCTR0, 0x0);
-	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_PERFCTR0 + 1, 0x0);
-	if (l3_hit_bits > 0) {
-		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 2, 0x0);
-		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 3, 0x0);
-		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_PERFCTR0 + 2, 0x0);
-		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_PERFCTR0 + 3, 0x0);
-	}
-	/* Set and enable the L2 counters */
+
+	/* Initialize rest of local variables */
+	/*
+	 * Performance event has been validated right before this with
+	 * interrupts disabled - it is thus safe to read the counter index.
+	 */
+	miss_pmcnum = x86_perf_rdpmc_index(miss_event);
+	hit_pmcnum = x86_perf_rdpmc_index(hit_event);
+	line_size = READ_ONCE(plr->line_size);
 	mem_r = READ_ONCE(plr->kmem);
 	size = READ_ONCE(plr->size);
-	line_size = READ_ONCE(plr->line_size);
-	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0, l2_hit_bits);
-	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 1, l2_miss_bits);
-	if (l3_hit_bits > 0) {
-		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 2,
-				      l3_hit_bits);
-		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 3,
-				      l3_miss_bits);
-	}
+
+	/*
+	 * Read counter variables twice - first to load the instructions
+	 * used in L1 cache, second to capture accurate value that does not
+	 * include cache misses incurred because of instruction loads.
+	 */
+	rdpmcl(hit_pmcnum, hits_before);
+	rdpmcl(miss_pmcnum, miss_before);
+	/*
+	 * From SDM: Performing back-to-back fast reads are not guaranteed
+	 * to be monotonic.
+	 * Use LFENCE to ensure all previous instructions are retired
+	 * before proceeding.
+	 */
+	rmb();
+	rdpmcl(hit_pmcnum, hits_before);
+	rdpmcl(miss_pmcnum, miss_before);
+	/*
+	 * Use LFENCE to ensure all previous instructions are retired
+	 * before proceeding.
+	 */
+	rmb();
 	for (i = 0; i < size; i += line_size) {
+		/*
+		 * Add a barrier to prevent speculative execution of this
+		 * loop reading beyond the end of the buffer.
+		 */
+		rmb();
 		asm volatile("mov (%0,%1,1), %%eax\n\t"
 			     :
 			     : "r" (mem_r), "r" (i)
 			     : "%eax", "memory");
 	}
 	/*
-	 * Call wrmsr directly (no tracing) to not influence
-	 * the cache access counters as they are disabled.
+	 * Use LFENCE to ensure all previous instructions are retired
+	 * before proceeding.
 	 */
-	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0,
-			      l2_hit_bits & ~(0x40ULL << 16));
-	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 1,
-			      l2_miss_bits & ~(0x40ULL << 16));
-	if (l3_hit_bits > 0) {
-		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 2,
-				      l3_hit_bits & ~(0x40ULL << 16));
-		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 3,
-				      l3_miss_bits & ~(0x40ULL << 16));
-	}
-	l2_hits = native_read_pmc(0);
-	l2_miss = native_read_pmc(1);
-	if (l3_hit_bits > 0) {
-		l3_hits = native_read_pmc(2);
-		l3_miss = native_read_pmc(3);
-	}
+	rmb();
+	rdpmcl(hit_pmcnum, hits_after);
+	rdpmcl(miss_pmcnum, miss_after);
+	/*
+	 * Use LFENCE to ensure all previous instructions are retired
+	 * before proceeding.
+	 */
+	rmb();
+	/* Re-enable hardware prefetchers */
 	wrmsr(MSR_MISC_FEATURE_CONTROL, 0x0, 0x0);
 	local_irq_enable();
+out_hit:
+	perf_event_release_kernel(hit_event);
+out_miss:
+	perf_event_release_kernel(miss_event);
+out:
+	/*
+	 * All counts will be zero on failure.
+	 */
+	counts->miss_before = miss_before;
+	counts->hits_before = hits_before;
+	counts->miss_after  = miss_after;
+	counts->hits_after  = hits_after;
+	return 0;
+}
+
+static int measure_l2_residency(void *_plr)
+{
+	struct pseudo_lock_region *plr = _plr;
+	struct residency_counts counts = {0};
+
+	/*
+	 * Non-architectural event for the Goldmont Microarchitecture
+	 * from Intel x86 Architecture Software Developer Manual (SDM):
+	 * MEM_LOAD_UOPS_RETIRED D1H (event number)
+	 * Umask values:
+	 *     L2_HIT   02H
+	 *     L2_MISS  10H
+	 */
+	switch (boot_cpu_data.x86_model) {
+	case INTEL_FAM6_ATOM_GOLDMONT:
+	case INTEL_FAM6_ATOM_GEMINI_LAKE:
+		perf_miss_attr.config = X86_CONFIG(.event = 0xd1,
+						   .umask = 0x10);
+		perf_hit_attr.config = X86_CONFIG(.event = 0xd1,
+						  .umask = 0x2);
+		break;
+	default:
+		goto out;
+	}
+
+	measure_residency_fn(&perf_miss_attr, &perf_hit_attr, plr, &counts);
+	/*
+	 * If a failure prevented the measurements from succeeding
+	 * tracepoints will still be written and all counts will be zero.
+	 */
+	trace_pseudo_lock_l2(counts.hits_after - counts.hits_before,
+			     counts.miss_after - counts.miss_before);
+out:
+	plr->thread_done = 1;
+	wake_up_interruptible(&plr->lock_thread_wq);
+	return 0;
+}
+
+static int measure_l3_residency(void *_plr)
+{
+	struct pseudo_lock_region *plr = _plr;
+	struct residency_counts counts = {0};
+
 	/*
-	 * On BDW we count references and misses, need to adjust. Sometimes
-	 * the "hits" counter is a bit more than the references, for
-	 * example, x references but x + 1 hits. To not report invalid
-	 * hit values in this case we treat that as misses eaqual to
-	 * references.
+	 * On Broadwell Microarchitecture the MEM_LOAD_UOPS_RETIRED event
+	 * has two "no fix" errata associated with it: BDM35 and BDM100. On
+	 * this platform the following events are used instead:
+	 * LONGEST_LAT_CACHE 2EH (Documented in SDM)
+	 *       REFERENCE 4FH
+	 *       MISS      41H
 	 */
-	if (boot_cpu_data.x86_model == INTEL_FAM6_BROADWELL_X)
-		l2_hits -= (l2_miss > l2_hits ? l2_hits : l2_miss);
-	trace_pseudo_lock_l2(l2_hits, l2_miss);
-	if (l3_hit_bits > 0) {
-		if (boot_cpu_data.x86_model == INTEL_FAM6_BROADWELL_X)
-			l3_hits -= (l3_miss > l3_hits ? l3_hits : l3_miss);
-		trace_pseudo_lock_l3(l3_hits, l3_miss);
+
+	switch (boot_cpu_data.x86_model) {
+	case INTEL_FAM6_BROADWELL_X:
+		/* On BDW the hit event counts references, not hits */
+		perf_hit_attr.config = X86_CONFIG(.event = 0x2e,
+						  .umask = 0x4f);
+		perf_miss_attr.config = X86_CONFIG(.event = 0x2e,
+						   .umask = 0x41);
+		break;
+	default:
+		goto out;
+	}
+
+	measure_residency_fn(&perf_miss_attr, &perf_hit_attr, plr, &counts);
+	/*
+	 * If a failure prevented the measurements from succeeding
+	 * tracepoints will still be written and all counts will be zero.
+	 */
+
+	counts.miss_after -= counts.miss_before;
+	if (boot_cpu_data.x86_model == INTEL_FAM6_BROADWELL_X) {
+		/*
+		 * On BDW references and misses are counted, need to adjust.
+		 * Sometimes the "hits" counter is a bit more than the
+		 * references, for example, x references but x + 1 hits.
+		 * To not report invalid hit values in this case we treat
+		 * that as misses equal to references.
+		 */
+		/* First compute the number of cache references measured */
+		counts.hits_after -= counts.hits_before;
+		/* Next convert references to cache hits */
+		counts.hits_after -= counts.miss_after > counts.hits_after ?
+					counts.hits_after : counts.miss_after;
+	} else {
+		counts.hits_after -= counts.hits_before;
 	}
 
+	trace_pseudo_lock_l3(counts.hits_after, counts.miss_after);
 out:
 	plr->thread_done = 1;
 	wake_up_interruptible(&plr->lock_thread_wq);
@@ -1112,13 +1180,20 @@ static int pseudo_lock_measure_cycles(struct rdtgroup *rdtgrp, int sel)
 		goto out;
 	}
 
+	plr->cpu = cpu;
+
 	if (sel == 1)
 		thread = kthread_create_on_node(measure_cycles_lat_fn, plr,
 						cpu_to_node(cpu),
 						"pseudo_lock_measure/%u",
 						cpu);
 	else if (sel == 2)
-		thread = kthread_create_on_node(measure_cycles_perf_fn, plr,
+		thread = kthread_create_on_node(measure_l2_residency, plr,
+						cpu_to_node(cpu),
+						"pseudo_lock_measure/%u",
+						cpu);
+	else if (sel == 3)
+		thread = kthread_create_on_node(measure_l3_residency, plr,
 						cpu_to_node(cpu),
 						"pseudo_lock_measure/%u",
 						cpu);
-- 
2.17.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V5 6/6] x86/intel_rdt: Re-enable pseudo-lock measurements
  2018-09-19 17:29 [PATCH V5 0/6] perf and x86/intel_rdt: Fix lack of coordination with perf Reinette Chatre
                   ` (4 preceding siblings ...)
  2018-09-19 17:29 ` [PATCH V5 5/6] x86/intel_rdt: Use perf infrastructure for measurements Reinette Chatre
@ 2018-09-19 17:29 ` Reinette Chatre
  2018-10-03 19:57   ` [tip:x86/cache] " tip-bot for Reinette Chatre
  2018-09-20 14:11 ` [PATCH V5 0/6] perf and x86/intel_rdt: Fix lack of coordination with perf Peter Zijlstra
  6 siblings, 1 reply; 23+ messages in thread
From: Reinette Chatre @ 2018-09-19 17:29 UTC (permalink / raw)
  To: tglx, fenghua.yu, tony.luck, peterz, mingo, acme
  Cc: gavin.hindman, jithu.joseph, dave.hansen, hpa, x86, linux-kernel,
	Reinette Chatre

Commit 4a7a54a55e72 ("x86/intel_rdt: Disable PMU access") disabled
measurements of pseudo-locked regions because of incorrect usage
of the performance monitoring hardware.

Cache pseudo-locking measurements are now done correctly with the
in-kernel perf API and its use can be re-enabled at this time.

The adjustment to the in-kernel perf API also separated the L2 and L3
measurements that can be triggered separately from user space. The
re-enabling of the measurements is thus not a simple revert of the
original disable in order to accommodate the additional parameter
possible.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
index c35b975e3b1e..edf30c94c226 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
@@ -1237,7 +1237,7 @@ static ssize_t pseudo_lock_measure_trigger(struct file *file,
 	buf[buf_size] = '\0';
 	ret = kstrtoint(buf, 10, &sel);
 	if (ret == 0) {
-		if (sel != 1)
+		if (sel != 1 && sel != 2 && sel != 3)
 			return -EINVAL;
 		ret = debugfs_file_get(file->f_path.dentry);
 		if (ret)
-- 
2.17.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH V5 5/6] x86/intel_rdt: Use perf infrastructure for measurements
  2018-09-19 17:29 ` [PATCH V5 5/6] x86/intel_rdt: Use perf infrastructure for measurements Reinette Chatre
@ 2018-09-20 14:11   ` Peter Zijlstra
  2018-09-20 19:02   ` [PATCH V6 " Reinette Chatre
  1 sibling, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2018-09-20 14:11 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: tglx, fenghua.yu, tony.luck, mingo, acme, gavin.hindman,
	jithu.joseph, dave.hansen, hpa, x86, linux-kernel

On Wed, Sep 19, 2018 at 10:29:10AM -0700, Reinette Chatre wrote:
> +static int measure_l3_residency(void *_plr)
> +{

> +	counts.miss_after -= counts.miss_before;
> +	if (boot_cpu_data.x86_model == INTEL_FAM6_BROADWELL_X) {
> +		/*
> +		 * On BDW references and misses are counted, need to adjust.
> +		 * Sometimes the "hits" counter is a bit more than the
> +		 * references, for example, x references but x + 1 hits.
> +		 * To not report invalid hit values in this case we treat
> +		 * that as misses equal to references.
> +		 */
> +		/* First compute the number of cache references measured */
> +		counts.hits_after -= counts.hits_before;

> +		/* Next convert references to cache hits */
> +		counts.hits_after -= counts.miss_after > counts.hits_after ?
> +					counts.hits_after : counts.miss_after;

Maybe:
		counts.hits_after -= min(counts.hits_after, counts.miss_after);


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V5 0/6] perf and x86/intel_rdt: Fix lack of coordination with perf
  2018-09-19 17:29 [PATCH V5 0/6] perf and x86/intel_rdt: Fix lack of coordination with perf Reinette Chatre
                   ` (5 preceding siblings ...)
  2018-09-19 17:29 ` [PATCH V5 6/6] x86/intel_rdt: Re-enable pseudo-lock measurements Reinette Chatre
@ 2018-09-20 14:11 ` Peter Zijlstra
  2018-09-21 16:49   ` Reinette Chatre
  6 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2018-09-20 14:11 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: tglx, fenghua.yu, tony.luck, mingo, acme, gavin.hindman,
	jithu.joseph, dave.hansen, hpa, x86, linux-kernel

On Wed, Sep 19, 2018 at 10:29:05AM -0700, Reinette Chatre wrote:
> Reinette Chatre (6):
>   perf/core: Add sanity check to deal with pinned event failure
>   perf/x86: Add helper to obtain performance counter index
>   x86/intel_rdt: Remove local register variables
>   x86/intel_rdt: Create required perf event attributes
>   x86/intel_rdt: Use perf infrastructure for measurements
>   x86/intel_rdt: Re-enable pseudo-lock measurements
> 
>  Documentation/x86/intel_rdt_ui.txt          |  22 +-
>  arch/x86/events/core.c                      |  21 ++
>  arch/x86/include/asm/perf_event.h           |   1 +
>  arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 372 ++++++++++++--------
>  kernel/events/core.c                        |   6 +
>  5 files changed, 261 insertions(+), 161 deletions(-)

Yeah, these look good, thanks!

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH V6 5/6] x86/intel_rdt: Use perf infrastructure for measurements
  2018-09-19 17:29 ` [PATCH V5 5/6] x86/intel_rdt: Use perf infrastructure for measurements Reinette Chatre
  2018-09-20 14:11   ` Peter Zijlstra
@ 2018-09-20 19:02   ` Reinette Chatre
  2018-09-29  7:02     ` [tip:x86/cache] " tip-bot for Reinette Chatre
  1 sibling, 1 reply; 23+ messages in thread
From: Reinette Chatre @ 2018-09-20 19:02 UTC (permalink / raw)
  To: tglx, fenghua.yu, tony.luck, peterz, mingo, acme
  Cc: gavin.hindman, jithu.joseph, dave.hansen, hpa, x86, linux-kernel,
	Reinette Chatre

The success of a cache pseudo-locked region is measured using
performance monitoring events that are programmed directly at the time
the user requests a measurement.

Modifying the performance event registers directly is not appropriate
since it circumvents the in-kernel perf infrastructure that exists to
manage these resources and provide resource arbitration to the
performance monitoring hardware.

The cache pseudo-locking measurements are modified to use the in-kernel
perf infrastructure. Performance events are created and validated with
the appropriate perf API. The performance counters are still read as
directly as possible to avoid the additional cache hits. This is
done safely by first ensuring with the perf API that the counters have
been programmed correctly and only accessing the counters in an
interrupt disabled section where they are not able to be moved.

As part of the transition to the in-kernel perf infrastructure the L2
and L3 measurements are split into two separate measurements that can
be triggered independently. This separation prevents additional cache
misses incurred during the extra testing code used to decide if a
L2 and/or L3 measurement should be made.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
V6:
 - Replace an expanded minimum check with min(). Thanks to Peter for
   noticing this.

 Documentation/x86/intel_rdt_ui.txt          |  22 +-
 arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 304 ++++++++++++--------
 2 files changed, 203 insertions(+), 123 deletions(-)

diff --git a/Documentation/x86/intel_rdt_ui.txt b/Documentation/x86/intel_rdt_ui.txt
index f662d3c530e5..52b10945ff75 100644
--- a/Documentation/x86/intel_rdt_ui.txt
+++ b/Documentation/x86/intel_rdt_ui.txt
@@ -520,18 +520,24 @@ the pseudo-locked region:
 2) Cache hit and miss measurements using model specific precision counters if
    available. Depending on the levels of cache on the system the pseudo_lock_l2
    and pseudo_lock_l3 tracepoints are available.
-   WARNING: triggering this  measurement uses from two (for just L2
-   measurements) to four (for L2 and L3 measurements) precision counters on
-   the system, if any other measurements are in progress the counters and
-   their corresponding event registers will be clobbered.
 
 When a pseudo-locked region is created a new debugfs directory is created for
 it in debugfs as /sys/kernel/debug/resctrl/<newdir>. A single
 write-only file, pseudo_lock_measure, is present in this directory. The
-measurement on the pseudo-locked region depends on the number, 1 or 2,
-written to this debugfs file. Since the measurements are recorded with the
-tracing infrastructure the relevant tracepoints need to be enabled before the
-measurement is triggered.
+measurement of the pseudo-locked region depends on the number written to this
+debugfs file:
+1 -  writing "1" to the pseudo_lock_measure file will trigger the latency
+     measurement captured in the pseudo_lock_mem_latency tracepoint. See
+     example below.
+2 -  writing "2" to the pseudo_lock_measure file will trigger the L2 cache
+     residency (cache hits and misses) measurement captured in the
+     pseudo_lock_l2 tracepoint. See example below.
+3 -  writing "3" to the pseudo_lock_measure file will trigger the L3 cache
+     residency (cache hits and misses) measurement captured in the
+     pseudo_lock_l3 tracepoint.
+
+All measurements are recorded with the tracing infrastructure. This requires
+the relevant tracepoints to be enabled before the measurement is triggered.
 
 Example of latency debugging interface:
 In this example a pseudo-locked region named "newlock" was created. Here is
diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
index 33d7968f152a..d68836139cf9 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
@@ -27,6 +27,7 @@
 #include <asm/intel_rdt_sched.h>
 #include <asm/perf_event.h>
 
+#include "../../events/perf_event.h" /* For X86_CONFIG() */
 #include "intel_rdt.h"
 
 #define CREATE_TRACE_POINTS
@@ -107,16 +108,6 @@ static u64 get_prefetch_disable_bits(void)
 	return 0;
 }
 
-/*
- * Helper to write 64bit value to MSR without tracing. Used when
- * use of the cache should be restricted and use of registers used
- * for local variables avoided.
- */
-static inline void pseudo_wrmsrl_notrace(unsigned int msr, u64 val)
-{
-	__wrmsr(msr, (u32)(val & 0xffffffffULL), (u32)(val >> 32));
-}
-
 /**
  * pseudo_lock_minor_get - Obtain available minor number
  * @minor: Pointer to where new minor number will be stored
@@ -925,7 +916,7 @@ static int measure_cycles_lat_fn(void *_plr)
  * The actual configuration of the event is set right before use in order
  * to use the X86_CONFIG macro.
  */
-static struct perf_event_attr __attribute__((unused)) perf_miss_attr = {
+static struct perf_event_attr perf_miss_attr = {
 	.type		= PERF_TYPE_RAW,
 	.size		= sizeof(struct perf_event_attr),
 	.pinned		= 1,
@@ -933,7 +924,7 @@ static struct perf_event_attr __attribute__((unused)) perf_miss_attr = {
 	.exclude_user	= 1,
 };
 
-static struct perf_event_attr __attribute__((unused)) perf_hit_attr = {
+static struct perf_event_attr perf_hit_attr = {
 	.type		= PERF_TYPE_RAW,
 	.size		= sizeof(struct perf_event_attr),
 	.pinned		= 1,
@@ -941,139 +932,215 @@ static struct perf_event_attr __attribute__((unused)) perf_hit_attr = {
 	.exclude_user	= 1,
 };
 
-static int measure_cycles_perf_fn(void *_plr)
+struct residency_counts {
+	u64 miss_before, hits_before;
+	u64 miss_after,  hits_after;
+};
+
+static int measure_residency_fn(struct perf_event_attr *miss_attr,
+				struct perf_event_attr *hit_attr,
+				struct pseudo_lock_region *plr,
+				struct residency_counts *counts)
 {
-	unsigned long long l3_hits = 0, l3_miss = 0;
-	u64 l3_hit_bits = 0, l3_miss_bits = 0;
-	struct pseudo_lock_region *plr = _plr;
-	unsigned long long l2_hits, l2_miss;
-	u64 l2_hit_bits, l2_miss_bits;
+	u64 hits_before = 0, hits_after = 0, miss_before = 0, miss_after = 0;
+	struct perf_event *miss_event, *hit_event;
+	int hit_pmcnum, miss_pmcnum;
 	unsigned int line_size;
 	unsigned int size;
 	unsigned long i;
 	void *mem_r;
+	u64 tmp;
 
-	/*
-	 * Non-architectural event for the Goldmont Microarchitecture
-	 * from Intel x86 Architecture Software Developer Manual (SDM):
-	 * MEM_LOAD_UOPS_RETIRED D1H (event number)
-	 * Umask values:
-	 *     L1_HIT   01H
-	 *     L2_HIT   02H
-	 *     L1_MISS  08H
-	 *     L2_MISS  10H
-	 *
-	 * On Broadwell Microarchitecture the MEM_LOAD_UOPS_RETIRED event
-	 * has two "no fix" errata associated with it: BDM35 and BDM100. On
-	 * this platform we use the following events instead:
-	 *  L2_RQSTS 24H (Documented in https://download.01.org/perfmon/BDW/)
-	 *       REFERENCES FFH
-	 *       MISS       3FH
-	 *  LONGEST_LAT_CACHE 2EH (Documented in SDM)
-	 *       REFERENCE 4FH
-	 *       MISS      41H
-	 */
+	miss_event = perf_event_create_kernel_counter(miss_attr, plr->cpu,
+						      NULL, NULL, NULL);
+	if (IS_ERR(miss_event))
+		goto out;
 
+	hit_event = perf_event_create_kernel_counter(hit_attr, plr->cpu,
+						     NULL, NULL, NULL);
+	if (IS_ERR(hit_event))
+		goto out_miss;
+
+	local_irq_disable();
 	/*
-	 * Start by setting flags for IA32_PERFEVTSELx:
-	 *     OS  (Operating system mode)  0x2
-	 *     INT (APIC interrupt enable)  0x10
-	 *     EN  (Enable counter)         0x40
-	 *
-	 * Then add the Umask value and event number to select performance
-	 * event.
+	 * Check any possible error state of events used by performing
+	 * one local read.
 	 */
-
-	switch (boot_cpu_data.x86_model) {
-	case INTEL_FAM6_ATOM_GOLDMONT:
-	case INTEL_FAM6_ATOM_GEMINI_LAKE:
-		l2_hit_bits = (0x52ULL << 16) | (0x2 << 8) | 0xd1;
-		l2_miss_bits = (0x52ULL << 16) | (0x10 << 8) | 0xd1;
-		break;
-	case INTEL_FAM6_BROADWELL_X:
-		/* On BDW the l2_hit_bits count references, not hits */
-		l2_hit_bits = (0x52ULL << 16) | (0xff << 8) | 0x24;
-		l2_miss_bits = (0x52ULL << 16) | (0x3f << 8) | 0x24;
-		/* On BDW the l3_hit_bits count references, not hits */
-		l3_hit_bits = (0x52ULL << 16) | (0x4f << 8) | 0x2e;
-		l3_miss_bits = (0x52ULL << 16) | (0x41 << 8) | 0x2e;
-		break;
-	default:
-		goto out;
+	if (perf_event_read_local(miss_event, &tmp, NULL, NULL)) {
+		local_irq_enable();
+		goto out_hit;
+	}
+	if (perf_event_read_local(hit_event, &tmp, NULL, NULL)) {
+		local_irq_enable();
+		goto out_hit;
 	}
 
-	local_irq_disable();
 	/*
 	 * Disable hardware prefetchers.
 	 */
 	wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);
-	/* Disable events and reset counters */
-	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0, 0x0);
-	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 1, 0x0);
-	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_PERFCTR0, 0x0);
-	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_PERFCTR0 + 1, 0x0);
-	if (l3_hit_bits > 0) {
-		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 2, 0x0);
-		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 3, 0x0);
-		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_PERFCTR0 + 2, 0x0);
-		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_PERFCTR0 + 3, 0x0);
-	}
-	/* Set and enable the L2 counters */
+
+	/* Initialize rest of local variables */
+	/*
+	 * Performance event has been validated right before this with
+	 * interrupts disabled - it is thus safe to read the counter index.
+	 */
+	miss_pmcnum = x86_perf_rdpmc_index(miss_event);
+	hit_pmcnum = x86_perf_rdpmc_index(hit_event);
+	line_size = READ_ONCE(plr->line_size);
 	mem_r = READ_ONCE(plr->kmem);
 	size = READ_ONCE(plr->size);
-	line_size = READ_ONCE(plr->line_size);
-	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0, l2_hit_bits);
-	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 1, l2_miss_bits);
-	if (l3_hit_bits > 0) {
-		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 2,
-				      l3_hit_bits);
-		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 3,
-				      l3_miss_bits);
-	}
+
+	/*
+	 * Read counter variables twice - first to load the instructions
+	 * used in L1 cache, second to capture accurate value that does not
+	 * include cache misses incurred because of instruction loads.
+	 */
+	rdpmcl(hit_pmcnum, hits_before);
+	rdpmcl(miss_pmcnum, miss_before);
+	/*
+	 * From SDM: Performing back-to-back fast reads are not guaranteed
+	 * to be monotonic.
+	 * Use LFENCE to ensure all previous instructions are retired
+	 * before proceeding.
+	 */
+	rmb();
+	rdpmcl(hit_pmcnum, hits_before);
+	rdpmcl(miss_pmcnum, miss_before);
+	/*
+	 * Use LFENCE to ensure all previous instructions are retired
+	 * before proceeding.
+	 */
+	rmb();
 	for (i = 0; i < size; i += line_size) {
+		/*
+		 * Add a barrier to prevent speculative execution of this
+		 * loop reading beyond the end of the buffer.
+		 */
+		rmb();
 		asm volatile("mov (%0,%1,1), %%eax\n\t"
 			     :
 			     : "r" (mem_r), "r" (i)
 			     : "%eax", "memory");
 	}
 	/*
-	 * Call wrmsr directly (no tracing) to not influence
-	 * the cache access counters as they are disabled.
+	 * Use LFENCE to ensure all previous instructions are retired
+	 * before proceeding.
 	 */
-	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0,
-			      l2_hit_bits & ~(0x40ULL << 16));
-	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 1,
-			      l2_miss_bits & ~(0x40ULL << 16));
-	if (l3_hit_bits > 0) {
-		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 2,
-				      l3_hit_bits & ~(0x40ULL << 16));
-		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 3,
-				      l3_miss_bits & ~(0x40ULL << 16));
-	}
-	l2_hits = native_read_pmc(0);
-	l2_miss = native_read_pmc(1);
-	if (l3_hit_bits > 0) {
-		l3_hits = native_read_pmc(2);
-		l3_miss = native_read_pmc(3);
-	}
+	rmb();
+	rdpmcl(hit_pmcnum, hits_after);
+	rdpmcl(miss_pmcnum, miss_after);
+	/*
+	 * Use LFENCE to ensure all previous instructions are retired
+	 * before proceeding.
+	 */
+	rmb();
+	/* Re-enable hardware prefetchers */
 	wrmsr(MSR_MISC_FEATURE_CONTROL, 0x0, 0x0);
 	local_irq_enable();
+out_hit:
+	perf_event_release_kernel(hit_event);
+out_miss:
+	perf_event_release_kernel(miss_event);
+out:
+	/*
+	 * All counts will be zero on failure.
+	 */
+	counts->miss_before = miss_before;
+	counts->hits_before = hits_before;
+	counts->miss_after  = miss_after;
+	counts->hits_after  = hits_after;
+	return 0;
+}
+
+static int measure_l2_residency(void *_plr)
+{
+	struct pseudo_lock_region *plr = _plr;
+	struct residency_counts counts = {0};
+
+	/*
+	 * Non-architectural event for the Goldmont Microarchitecture
+	 * from Intel x86 Architecture Software Developer Manual (SDM):
+	 * MEM_LOAD_UOPS_RETIRED D1H (event number)
+	 * Umask values:
+	 *     L2_HIT   02H
+	 *     L2_MISS  10H
+	 */
+	switch (boot_cpu_data.x86_model) {
+	case INTEL_FAM6_ATOM_GOLDMONT:
+	case INTEL_FAM6_ATOM_GEMINI_LAKE:
+		perf_miss_attr.config = X86_CONFIG(.event = 0xd1,
+						   .umask = 0x10);
+		perf_hit_attr.config = X86_CONFIG(.event = 0xd1,
+						  .umask = 0x2);
+		break;
+	default:
+		goto out;
+	}
+
+	measure_residency_fn(&perf_miss_attr, &perf_hit_attr, plr, &counts);
+	/*
+	 * If a failure prevented the measurements from succeeding
+	 * tracepoints will still be written and all counts will be zero.
+	 */
+	trace_pseudo_lock_l2(counts.hits_after - counts.hits_before,
+			     counts.miss_after - counts.miss_before);
+out:
+	plr->thread_done = 1;
+	wake_up_interruptible(&plr->lock_thread_wq);
+	return 0;
+}
+
+static int measure_l3_residency(void *_plr)
+{
+	struct pseudo_lock_region *plr = _plr;
+	struct residency_counts counts = {0};
+
 	/*
-	 * On BDW we count references and misses, need to adjust. Sometimes
-	 * the "hits" counter is a bit more than the references, for
-	 * example, x references but x + 1 hits. To not report invalid
-	 * hit values in this case we treat that as misses eaqual to
-	 * references.
+	 * On Broadwell Microarchitecture the MEM_LOAD_UOPS_RETIRED event
+	 * has two "no fix" errata associated with it: BDM35 and BDM100. On
+	 * this platform the following events are used instead:
+	 * LONGEST_LAT_CACHE 2EH (Documented in SDM)
+	 *       REFERENCE 4FH
+	 *       MISS      41H
 	 */
-	if (boot_cpu_data.x86_model == INTEL_FAM6_BROADWELL_X)
-		l2_hits -= (l2_miss > l2_hits ? l2_hits : l2_miss);
-	trace_pseudo_lock_l2(l2_hits, l2_miss);
-	if (l3_hit_bits > 0) {
-		if (boot_cpu_data.x86_model == INTEL_FAM6_BROADWELL_X)
-			l3_hits -= (l3_miss > l3_hits ? l3_hits : l3_miss);
-		trace_pseudo_lock_l3(l3_hits, l3_miss);
+
+	switch (boot_cpu_data.x86_model) {
+	case INTEL_FAM6_BROADWELL_X:
+		/* On BDW the hit event counts references, not hits */
+		perf_hit_attr.config = X86_CONFIG(.event = 0x2e,
+						  .umask = 0x4f);
+		perf_miss_attr.config = X86_CONFIG(.event = 0x2e,
+						   .umask = 0x41);
+		break;
+	default:
+		goto out;
+	}
+
+	measure_residency_fn(&perf_miss_attr, &perf_hit_attr, plr, &counts);
+	/*
+	 * If a failure prevented the measurements from succeeding
+	 * tracepoints will still be written and all counts will be zero.
+	 */
+
+	counts.miss_after -= counts.miss_before;
+	if (boot_cpu_data.x86_model == INTEL_FAM6_BROADWELL_X) {
+		/*
+		 * On BDW references and misses are counted, need to adjust.
+		 * Sometimes the "hits" counter is a bit more than the
+		 * references, for example, x references but x + 1 hits.
+		 * To not report invalid hit values in this case we treat
+		 * that as misses equal to references.
+		 */
+		/* First compute the number of cache references measured */
+		counts.hits_after -= counts.hits_before;
+		/* Next convert references to cache hits */
+		counts.hits_after -= min(counts.miss_after, counts.hits_after);
+	} else {
+		counts.hits_after -= counts.hits_before;
 	}
 
+	trace_pseudo_lock_l3(counts.hits_after, counts.miss_after);
 out:
 	plr->thread_done = 1;
 	wake_up_interruptible(&plr->lock_thread_wq);
@@ -1112,13 +1179,20 @@ static int pseudo_lock_measure_cycles(struct rdtgroup *rdtgrp, int sel)
 		goto out;
 	}
 
+	plr->cpu = cpu;
+
 	if (sel == 1)
 		thread = kthread_create_on_node(measure_cycles_lat_fn, plr,
 						cpu_to_node(cpu),
 						"pseudo_lock_measure/%u",
 						cpu);
 	else if (sel == 2)
-		thread = kthread_create_on_node(measure_cycles_perf_fn, plr,
+		thread = kthread_create_on_node(measure_l2_residency, plr,
+						cpu_to_node(cpu),
+						"pseudo_lock_measure/%u",
+						cpu);
+	else if (sel == 3)
+		thread = kthread_create_on_node(measure_l3_residency, plr,
 						cpu_to_node(cpu),
 						"pseudo_lock_measure/%u",
 						cpu);
-- 
2.17.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH V5 0/6] perf and x86/intel_rdt: Fix lack of coordination with perf
  2018-09-20 14:11 ` [PATCH V5 0/6] perf and x86/intel_rdt: Fix lack of coordination with perf Peter Zijlstra
@ 2018-09-21 16:49   ` Reinette Chatre
  2018-09-27 20:39     ` Thomas Gleixner
  0 siblings, 1 reply; 23+ messages in thread
From: Reinette Chatre @ 2018-09-21 16:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, fenghua.yu, tony.luck, mingo, acme, gavin.hindman,
	jithu.joseph, dave.hansen, hpa, x86, linux-kernel

Dear Maintainers,

On 9/20/2018 7:11 AM, Peter Zijlstra wrote:
> On Wed, Sep 19, 2018 at 10:29:05AM -0700, Reinette Chatre wrote:
>> Reinette Chatre (6):
>>   perf/core: Add sanity check to deal with pinned event failure
>>   perf/x86: Add helper to obtain performance counter index
>>   x86/intel_rdt: Remove local register variables
>>   x86/intel_rdt: Create required perf event attributes
>>   x86/intel_rdt: Use perf infrastructure for measurements
>>   x86/intel_rdt: Re-enable pseudo-lock measurements
>>
>>  Documentation/x86/intel_rdt_ui.txt          |  22 +-
>>  arch/x86/events/core.c                      |  21 ++
>>  arch/x86/include/asm/perf_event.h           |   1 +
>>  arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 372 ++++++++++++--------
>>  kernel/events/core.c                        |   6 +
>>  5 files changed, 261 insertions(+), 161 deletions(-)
> 
> Yeah, these look good, thanks!
> 
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> 

Could you please consider this series for inclusion into v4.19?

- This series is needed to complete the initial cache pseudo-locking
enabling that is first introduced in v4.19. Without this series users
are able to create pseudo-locked regions but unable to accurately
measure their success.
- This is not adding a new feature of pseudo-locked region measurement
but fixing the existing measurement code that was disabled after found
to be poorly integrated.
- This series consists out of 6 patches. Patches 2 to 6 are either new
code in support of this fix or fixes to code that does not exist in
kernels before v4.19.
- Patch 1 is a fix to existing code. It is small, was suggested by, and
does have support from maintainer. Even so, if it is felt that this is
too risky then patches 2 to 6 could still be merged since it does not
actually depend on patch 1. In retrospect I should have submitted it
separately.

Your consideration would be greatly appreciated.

Thank you

Reinette

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V5 0/6] perf and x86/intel_rdt: Fix lack of coordination with perf
  2018-09-21 16:49   ` Reinette Chatre
@ 2018-09-27 20:39     ` Thomas Gleixner
  2018-09-28  6:58       ` Peter Zijlstra
  0 siblings, 1 reply; 23+ messages in thread
From: Thomas Gleixner @ 2018-09-27 20:39 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Peter Zijlstra, fenghua.yu, tony.luck, mingo, acme,
	gavin.hindman, jithu.joseph, dave.hansen, hpa, x86, linux-kernel

On Fri, 21 Sep 2018, Reinette Chatre wrote:

> Dear Maintainers,

Sorry for replying late.

> On 9/20/2018 7:11 AM, Peter Zijlstra wrote:
> > On Wed, Sep 19, 2018 at 10:29:05AM -0700, Reinette Chatre wrote:
> >> Reinette Chatre (6):
> >>   perf/core: Add sanity check to deal with pinned event failure
> >>   perf/x86: Add helper to obtain performance counter index
> >>   x86/intel_rdt: Remove local register variables
> >>   x86/intel_rdt: Create required perf event attributes
> >>   x86/intel_rdt: Use perf infrastructure for measurements
> >>   x86/intel_rdt: Re-enable pseudo-lock measurements
> >>
> >>  Documentation/x86/intel_rdt_ui.txt          |  22 +-
> >>  arch/x86/events/core.c                      |  21 ++
> >>  arch/x86/include/asm/perf_event.h           |   1 +
> >>  arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 372 ++++++++++++--------
> >>  kernel/events/core.c                        |   6 +
> >>  5 files changed, 261 insertions(+), 161 deletions(-)
> > 
> > Yeah, these look good, thanks!
> > 
> > Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > 
> 
> Could you please consider this series for inclusion into v4.19?

So in principle I'm having no objections as this really is mostly a RDT
only issue.

Peter, any objections against the Perf part of it, aside the core patch
which is an obvious fix?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V5 0/6] perf and x86/intel_rdt: Fix lack of coordination with perf
  2018-09-27 20:39     ` Thomas Gleixner
@ 2018-09-28  6:58       ` Peter Zijlstra
  2018-09-29 14:23         ` Reinette Chatre
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2018-09-28  6:58 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Reinette Chatre, fenghua.yu, tony.luck, mingo, acme,
	gavin.hindman, jithu.joseph, dave.hansen, hpa, x86, linux-kernel

On Thu, Sep 27, 2018 at 10:39:01PM +0200, Thomas Gleixner wrote:
> On Fri, 21 Sep 2018, Reinette Chatre wrote:
> 
> > Dear Maintainers,
> 
> Sorry for replying late.
> 
> > On 9/20/2018 7:11 AM, Peter Zijlstra wrote:
> > > On Wed, Sep 19, 2018 at 10:29:05AM -0700, Reinette Chatre wrote:
> > >> Reinette Chatre (6):
> > >>   perf/core: Add sanity check to deal with pinned event failure
> > >>   perf/x86: Add helper to obtain performance counter index
> > >>   x86/intel_rdt: Remove local register variables
> > >>   x86/intel_rdt: Create required perf event attributes
> > >>   x86/intel_rdt: Use perf infrastructure for measurements
> > >>   x86/intel_rdt: Re-enable pseudo-lock measurements
> > >>
> > >>  Documentation/x86/intel_rdt_ui.txt          |  22 +-
> > >>  arch/x86/events/core.c                      |  21 ++
> > >>  arch/x86/include/asm/perf_event.h           |   1 +
> > >>  arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 372 ++++++++++++--------
> > >>  kernel/events/core.c                        |   6 +
> > >>  5 files changed, 261 insertions(+), 161 deletions(-)
> > > 
> > > Yeah, these look good, thanks!
> > > 
> > > Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > 
> > 
> > Could you please consider this series for inclusion into v4.19?
> 
> So in principle I'm having no objections as this really is mostly a RDT
> only issue.
> 
> Peter, any objections against the Perf part of it, aside the core patch
> which is an obvious fix?

Nope, look up a few lines to observe my Ack ;-)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [tip:perf/urgent] perf/core: Add sanity check to deal with pinned event failure
  2018-09-19 17:29 ` [PATCH V5 1/6] perf/core: Add sanity check to deal with pinned event failure Reinette Chatre
@ 2018-09-28 20:48   ` tip-bot for Reinette Chatre
  0 siblings, 0 replies; 23+ messages in thread
From: tip-bot for Reinette Chatre @ 2018-09-28 20:48 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: mingo, peterz, reinette.chatre, tglx, hpa, linux-kernel

Commit-ID:  befb1b3c2703897c5b8ffb0044dc5d0e5f27c5d7
Gitweb:     https://git.kernel.org/tip/befb1b3c2703897c5b8ffb0044dc5d0e5f27c5d7
Author:     Reinette Chatre <reinette.chatre@intel.com>
AuthorDate: Wed, 19 Sep 2018 10:29:06 -0700
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 28 Sep 2018 22:44:53 +0200

perf/core: Add sanity check to deal with pinned event failure

It is possible that a failure can occur during the scheduling of a
pinned event. The initial portion of perf_event_read_local() contains
the various error checks an event should pass before it can be
considered valid. Ensure that the potential scheduling failure
of a pinned event is checked for and have a credible error.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: fenghua.yu@intel.com
Cc: tony.luck@intel.com
Cc: acme@kernel.org
Cc: gavin.hindman@intel.com
Cc: jithu.joseph@intel.com
Cc: dave.hansen@intel.com
Cc: hpa@zytor.com
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/6486385d1f30336e9973b24c8c65f5079543d3d3.1537377064.git.reinette.chatre@intel.com

---
 kernel/events/core.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index c80549bf82c6..dcb093e7b377 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3935,6 +3935,12 @@ int perf_event_read_local(struct perf_event *event, u64 *value,
 		goto out;
 	}
 
+	/* If this is a pinned event it must be running on this CPU */
+	if (event->attr.pinned && event->oncpu != smp_processor_id()) {
+		ret = -EBUSY;
+		goto out;
+	}
+
 	/*
 	 * If the event is currently on this CPU, its either a per-task event,
 	 * or local to this CPU. Furthermore it means its ACTIVE (otherwise

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [tip:x86/cache] perf/x86: Add helper to obtain performance counter index
  2018-09-19 17:29 ` [PATCH V5 2/6] perf/x86: Add helper to obtain performance counter index Reinette Chatre
@ 2018-09-29  7:00   ` tip-bot for Reinette Chatre
  0 siblings, 0 replies; 23+ messages in thread
From: tip-bot for Reinette Chatre @ 2018-09-29  7:00 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: hpa, mingo, peterz, tglx, reinette.chatre, linux-kernel

Commit-ID:  1182a49529edde899be4b4f0e1ab76e626976eb6
Gitweb:     https://git.kernel.org/tip/1182a49529edde899be4b4f0e1ab76e626976eb6
Author:     Reinette Chatre <reinette.chatre@intel.com>
AuthorDate: Wed, 19 Sep 2018 10:29:07 -0700
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 28 Sep 2018 22:48:26 +0200

perf/x86: Add helper to obtain performance counter index

perf_event_read_local() is the safest way to obtain measurements
associated with performance events. In some cases the overhead
introduced by perf_event_read_local() affects the measurements and the
use of rdpmcl() is needed. rdpmcl() requires the index
of the performance counter used so a helper is introduced to determine
the index used by a provided performance event.

The index used by a performance event may change when interrupts are
enabled. A check is added to ensure that the index is only accessed
with interrupts disabled. Even with this check the use of this counter
needs to be done with care to ensure it is queried and used within the
same disabled interrupts section.

This change introduces a new checkpatch warning:
CHECK: extern prototypes should be avoided in .h files
+extern int x86_perf_rdpmc_index(struct perf_event *event);

This warning was discussed and designated as a false positive in
http://lkml.kernel.org/r/20180919091759.GZ24124@hirez.programming.kicks-ass.net

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: fenghua.yu@intel.com
Cc: tony.luck@intel.com
Cc: acme@kernel.org
Cc: gavin.hindman@intel.com
Cc: jithu.joseph@intel.com
Cc: dave.hansen@intel.com
Cc: hpa@zytor.com
Link: https://lkml.kernel.org/r/b277ffa78a51254f5414f7b1bc1923826874566e.1537377064.git.reinette.chatre@intel.com

---
 arch/x86/events/core.c            | 21 +++++++++++++++++++++
 arch/x86/include/asm/perf_event.h |  1 +
 2 files changed, 22 insertions(+)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index dfb2f7c0d019..3550d800b030 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1033,6 +1033,27 @@ static inline void x86_assign_hw_event(struct perf_event *event,
 	}
 }
 
+/**
+ * x86_perf_rdpmc_index - Return PMC counter used for event
+ * @event: the perf_event to which the PMC counter was assigned
+ *
+ * The counter assigned to this performance event may change if interrupts
+ * are enabled. This counter should thus never be used while interrupts are
+ * enabled. Before this function is used to obtain the assigned counter the
+ * event should be checked for validity using, for example,
+ * perf_event_read_local(), within the same interrupt disabled section in
+ * which this counter is planned to be used.
+ *
+ * Return: The index of the performance monitoring counter assigned to
+ * @perf_event.
+ */
+int x86_perf_rdpmc_index(struct perf_event *event)
+{
+	lockdep_assert_irqs_disabled();
+
+	return event->hw.event_base_rdpmc;
+}
+
 static inline int match_prev_assignment(struct hw_perf_event *hwc,
 					struct cpu_hw_events *cpuc,
 					int i)
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 12f54082f4c8..b2cf84c35a6d 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -270,6 +270,7 @@ struct perf_guest_switch_msr {
 extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr);
 extern void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap);
 extern void perf_check_microcode(void);
+extern int x86_perf_rdpmc_index(struct perf_event *event);
 #else
 static inline struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr)
 {

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [tip:x86/cache] x86/intel_rdt: Remove local register variables
  2018-09-19 17:29 ` [PATCH V5 3/6] x86/intel_rdt: Remove local register variables Reinette Chatre
@ 2018-09-29  7:01   ` tip-bot for Reinette Chatre
  0 siblings, 0 replies; 23+ messages in thread
From: tip-bot for Reinette Chatre @ 2018-09-29  7:01 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: mingo, tglx, hpa, linux-kernel, reinette.chatre, peterz

Commit-ID:  b5e4274ef7f00a03ce8d728701409a6e2c99146b
Gitweb:     https://git.kernel.org/tip/b5e4274ef7f00a03ce8d728701409a6e2c99146b
Author:     Reinette Chatre <reinette.chatre@intel.com>
AuthorDate: Wed, 19 Sep 2018 10:29:08 -0700
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 28 Sep 2018 22:48:26 +0200

x86/intel_rdt: Remove local register variables

Local register variables were used in an effort to improve the
accuracy of the measurement of cache residency of a pseudo-locked
region. This was done to ensure that only the cache residency of
the memory is measured and not the cache residency of the variables
used to perform the measurement.

While local register variables do accomplish the goal they do require
significant care since different architectures have different registers
available. Local register variables also cannot be used with valuable
developer tools like KASAN.

Significant testing has shown that similar accuracy in measurement
results can be obtained by replacing local register variables with
regular local variables.

Make use of local variables in the critical code but do so with
READ_ONCE() to prevent the compiler from merging or refetching reads.
Ensure these variables are initialized before the measurement starts,
and ensure it is only the local variables that are accessed during
the measurement.

With the removal of the local register variables and using READ_ONCE()
there is no longer a motivation for using a direct wrmsr call (that
avoids the additional tracing code that may clobber the local register
variables).

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: fenghua.yu@intel.com
Cc: tony.luck@intel.com
Cc: acme@kernel.org
Cc: gavin.hindman@intel.com
Cc: jithu.joseph@intel.com
Cc: dave.hansen@intel.com
Cc: hpa@zytor.com
Link: https://lkml.kernel.org/r/f430f57347414e0691765d92b144758ab93d8407.1537377064.git.reinette.chatre@intel.com

---
 arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 53 +++++------------------------
 1 file changed, 9 insertions(+), 44 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
index 40f3903ae5d9..8ad83eb3fc89 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
@@ -886,31 +886,14 @@ static int measure_cycles_lat_fn(void *_plr)
 	struct pseudo_lock_region *plr = _plr;
 	unsigned long i;
 	u64 start, end;
-#ifdef CONFIG_KASAN
-	/*
-	 * The registers used for local register variables are also used
-	 * when KASAN is active. When KASAN is active we use a regular
-	 * variable to ensure we always use a valid pointer to access memory.
-	 * The cost is that accessing this pointer, which could be in
-	 * cache, will be included in the measurement of memory read latency.
-	 */
 	void *mem_r;
-#else
-#ifdef CONFIG_X86_64
-	register void *mem_r asm("rbx");
-#else
-	register void *mem_r asm("ebx");
-#endif /* CONFIG_X86_64 */
-#endif /* CONFIG_KASAN */
 
 	local_irq_disable();
 	/*
-	 * The wrmsr call may be reordered with the assignment below it.
-	 * Call wrmsr as directly as possible to avoid tracing clobbering
-	 * local register variable used for memory pointer.
+	 * Disable hardware prefetchers.
 	 */
-	__wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);
-	mem_r = plr->kmem;
+	wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);
+	mem_r = READ_ONCE(plr->kmem);
 	/*
 	 * Dummy execute of the time measurement to load the needed
 	 * instructions into the L1 instruction cache.
@@ -939,26 +922,10 @@ static int measure_cycles_perf_fn(void *_plr)
 	struct pseudo_lock_region *plr = _plr;
 	unsigned long long l2_hits, l2_miss;
 	u64 l2_hit_bits, l2_miss_bits;
-	unsigned long i;
-#ifdef CONFIG_KASAN
-	/*
-	 * The registers used for local register variables are also used
-	 * when KASAN is active. When KASAN is active we use regular variables
-	 * at the cost of including cache access latency to these variables
-	 * in the measurements.
-	 */
 	unsigned int line_size;
 	unsigned int size;
+	unsigned long i;
 	void *mem_r;
-#else
-	register unsigned int line_size asm("esi");
-	register unsigned int size asm("edi");
-#ifdef CONFIG_X86_64
-	register void *mem_r asm("rbx");
-#else
-	register void *mem_r asm("ebx");
-#endif /* CONFIG_X86_64 */
-#endif /* CONFIG_KASAN */
 
 	/*
 	 * Non-architectural event for the Goldmont Microarchitecture
@@ -1011,11 +978,9 @@ static int measure_cycles_perf_fn(void *_plr)
 
 	local_irq_disable();
 	/*
-	 * Call wrmsr direcly to avoid the local register variables from
-	 * being overwritten due to reordering of their assignment with
-	 * the wrmsr calls.
+	 * Disable hardware prefetchers.
 	 */
-	__wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);
+	wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);
 	/* Disable events and reset counters */
 	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0, 0x0);
 	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 1, 0x0);
@@ -1028,6 +993,9 @@ static int measure_cycles_perf_fn(void *_plr)
 		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_PERFCTR0 + 3, 0x0);
 	}
 	/* Set and enable the L2 counters */
+	mem_r = READ_ONCE(plr->kmem);
+	size = READ_ONCE(plr->size);
+	line_size = READ_ONCE(plr->line_size);
 	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0, l2_hit_bits);
 	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 1, l2_miss_bits);
 	if (l3_hit_bits > 0) {
@@ -1036,9 +1004,6 @@ static int measure_cycles_perf_fn(void *_plr)
 		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 3,
 				      l3_miss_bits);
 	}
-	mem_r = plr->kmem;
-	size = plr->size;
-	line_size = plr->line_size;
 	for (i = 0; i < size; i += line_size) {
 		asm volatile("mov (%0,%1,1), %%eax\n\t"
 			     :

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [tip:x86/cache] x86/intel_rdt: Create required perf event attributes
  2018-09-19 17:29 ` [PATCH V5 4/6] x86/intel_rdt: Create required perf event attributes Reinette Chatre
@ 2018-09-29  7:01   ` tip-bot for Reinette Chatre
  0 siblings, 0 replies; 23+ messages in thread
From: tip-bot for Reinette Chatre @ 2018-09-29  7:01 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, tglx, reinette.chatre, peterz, hpa, mingo

Commit-ID:  0a701c9dd5351cbaef1677a0c8d37950e158cd55
Gitweb:     https://git.kernel.org/tip/0a701c9dd5351cbaef1677a0c8d37950e158cd55
Author:     Reinette Chatre <reinette.chatre@intel.com>
AuthorDate: Wed, 19 Sep 2018 10:29:09 -0700
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 28 Sep 2018 22:48:27 +0200

x86/intel_rdt: Create required perf event attributes

A perf event has many attributes that are maintained in a separate
structure that should be provided when a new perf_event is created.

In preparation for the transition to perf_events the required attribute
structures are created for all the events that may be used in the
measurements. Most attributes for all the events are identical. The
actual configuration, what specifies what needs to be measured, is what
will be different between the events used. This configuration needs to
be done with X86_CONFIG that cannot be used as part of the designated
initializers used here, this will be introduced later.

Although they do look identical at this time the attribute structures
needs to be maintained separately since a perf_event will maintain a
pointer to its unique attributes.

In support of patch testing the new structs are given the unused attribute
until their use in later patches.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: fenghua.yu@intel.com
Cc: tony.luck@intel.com
Cc: acme@kernel.org
Cc: gavin.hindman@intel.com
Cc: jithu.joseph@intel.com
Cc: dave.hansen@intel.com
Cc: hpa@zytor.com
Link: https://lkml.kernel.org/r/1822f6164e221a497648d108913d056ab675d5d0.1537377064.git.reinette.chatre@intel.com

---
 arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
index 8ad83eb3fc89..33d7968f152a 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
@@ -17,6 +17,7 @@
 #include <linux/debugfs.h>
 #include <linux/kthread.h>
 #include <linux/mman.h>
+#include <linux/perf_event.h>
 #include <linux/pm_qos.h>
 #include <linux/slab.h>
 #include <linux/uaccess.h>
@@ -915,6 +916,31 @@ static int measure_cycles_lat_fn(void *_plr)
 	return 0;
 }
 
+/*
+ * Create a perf_event_attr for the hit and miss perf events that will
+ * be used during the performance measurement. A perf_event maintains
+ * a pointer to its perf_event_attr so a unique attribute structure is
+ * created for each perf_event.
+ *
+ * The actual configuration of the event is set right before use in order
+ * to use the X86_CONFIG macro.
+ */
+static struct perf_event_attr __attribute__((unused)) perf_miss_attr = {
+	.type		= PERF_TYPE_RAW,
+	.size		= sizeof(struct perf_event_attr),
+	.pinned		= 1,
+	.disabled	= 0,
+	.exclude_user	= 1,
+};
+
+static struct perf_event_attr __attribute__((unused)) perf_hit_attr = {
+	.type		= PERF_TYPE_RAW,
+	.size		= sizeof(struct perf_event_attr),
+	.pinned		= 1,
+	.disabled	= 0,
+	.exclude_user	= 1,
+};
+
 static int measure_cycles_perf_fn(void *_plr)
 {
 	unsigned long long l3_hits = 0, l3_miss = 0;

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [tip:x86/cache] x86/intel_rdt: Use perf infrastructure for measurements
  2018-09-20 19:02   ` [PATCH V6 " Reinette Chatre
@ 2018-09-29  7:02     ` tip-bot for Reinette Chatre
  0 siblings, 0 replies; 23+ messages in thread
From: tip-bot for Reinette Chatre @ 2018-09-29  7:02 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, reinette.chatre, hpa, peterz, tglx, mingo

Commit-ID:  dd45407c0b2445bc2aa0ecfea744d5af3a146577
Gitweb:     https://git.kernel.org/tip/dd45407c0b2445bc2aa0ecfea744d5af3a146577
Author:     Reinette Chatre <reinette.chatre@intel.com>
AuthorDate: Thu, 20 Sep 2018 12:02:11 -0700
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 28 Sep 2018 22:48:27 +0200

x86/intel_rdt: Use perf infrastructure for measurements

The success of a cache pseudo-locked region is measured using
performance monitoring events that are programmed directly at the time
the user requests a measurement.

Modifying the performance event registers directly is not appropriate
since it circumvents the in-kernel perf infrastructure that exists to
manage these resources and provide resource arbitration to the
performance monitoring hardware.

The cache pseudo-locking measurements are modified to use the in-kernel
perf infrastructure. Performance events are created and validated with
the appropriate perf API. The performance counters are still read as
directly as possible to avoid the additional cache hits. This is
done safely by first ensuring with the perf API that the counters have
been programmed correctly and only accessing the counters in an
interrupt disabled section where they are not able to be moved.

As part of the transition to the in-kernel perf infrastructure the L2
and L3 measurements are split into two separate measurements that can
be triggered independently. This separation prevents additional cache
misses incurred during the extra testing code used to decide if a
L2 and/or L3 measurement should be made.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: fenghua.yu@intel.com
Cc: tony.luck@intel.com
Cc: peterz@infradead.org
Cc: acme@kernel.org
Cc: gavin.hindman@intel.com
Cc: jithu.joseph@intel.com
Cc: dave.hansen@intel.com
Cc: hpa@zytor.com
Link: https://lkml.kernel.org/r/fc24e728b446404f42c78573c506e98cd0599873.1537468643.git.reinette.chatre@intel.com

---
 Documentation/x86/intel_rdt_ui.txt          |  22 +-
 arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 304 +++++++++++++++++-----------
 2 files changed, 203 insertions(+), 123 deletions(-)

diff --git a/Documentation/x86/intel_rdt_ui.txt b/Documentation/x86/intel_rdt_ui.txt
index f662d3c530e5..52b10945ff75 100644
--- a/Documentation/x86/intel_rdt_ui.txt
+++ b/Documentation/x86/intel_rdt_ui.txt
@@ -520,18 +520,24 @@ the pseudo-locked region:
 2) Cache hit and miss measurements using model specific precision counters if
    available. Depending on the levels of cache on the system the pseudo_lock_l2
    and pseudo_lock_l3 tracepoints are available.
-   WARNING: triggering this  measurement uses from two (for just L2
-   measurements) to four (for L2 and L3 measurements) precision counters on
-   the system, if any other measurements are in progress the counters and
-   their corresponding event registers will be clobbered.
 
 When a pseudo-locked region is created a new debugfs directory is created for
 it in debugfs as /sys/kernel/debug/resctrl/<newdir>. A single
 write-only file, pseudo_lock_measure, is present in this directory. The
-measurement on the pseudo-locked region depends on the number, 1 or 2,
-written to this debugfs file. Since the measurements are recorded with the
-tracing infrastructure the relevant tracepoints need to be enabled before the
-measurement is triggered.
+measurement of the pseudo-locked region depends on the number written to this
+debugfs file:
+1 -  writing "1" to the pseudo_lock_measure file will trigger the latency
+     measurement captured in the pseudo_lock_mem_latency tracepoint. See
+     example below.
+2 -  writing "2" to the pseudo_lock_measure file will trigger the L2 cache
+     residency (cache hits and misses) measurement captured in the
+     pseudo_lock_l2 tracepoint. See example below.
+3 -  writing "3" to the pseudo_lock_measure file will trigger the L3 cache
+     residency (cache hits and misses) measurement captured in the
+     pseudo_lock_l3 tracepoint.
+
+All measurements are recorded with the tracing infrastructure. This requires
+the relevant tracepoints to be enabled before the measurement is triggered.
 
 Example of latency debugging interface:
 In this example a pseudo-locked region named "newlock" was created. Here is
diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
index 33d7968f152a..d68836139cf9 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
@@ -27,6 +27,7 @@
 #include <asm/intel_rdt_sched.h>
 #include <asm/perf_event.h>
 
+#include "../../events/perf_event.h" /* For X86_CONFIG() */
 #include "intel_rdt.h"
 
 #define CREATE_TRACE_POINTS
@@ -107,16 +108,6 @@ static u64 get_prefetch_disable_bits(void)
 	return 0;
 }
 
-/*
- * Helper to write 64bit value to MSR without tracing. Used when
- * use of the cache should be restricted and use of registers used
- * for local variables avoided.
- */
-static inline void pseudo_wrmsrl_notrace(unsigned int msr, u64 val)
-{
-	__wrmsr(msr, (u32)(val & 0xffffffffULL), (u32)(val >> 32));
-}
-
 /**
  * pseudo_lock_minor_get - Obtain available minor number
  * @minor: Pointer to where new minor number will be stored
@@ -925,7 +916,7 @@ static int measure_cycles_lat_fn(void *_plr)
  * The actual configuration of the event is set right before use in order
  * to use the X86_CONFIG macro.
  */
-static struct perf_event_attr __attribute__((unused)) perf_miss_attr = {
+static struct perf_event_attr perf_miss_attr = {
 	.type		= PERF_TYPE_RAW,
 	.size		= sizeof(struct perf_event_attr),
 	.pinned		= 1,
@@ -933,7 +924,7 @@ static struct perf_event_attr __attribute__((unused)) perf_miss_attr = {
 	.exclude_user	= 1,
 };
 
-static struct perf_event_attr __attribute__((unused)) perf_hit_attr = {
+static struct perf_event_attr perf_hit_attr = {
 	.type		= PERF_TYPE_RAW,
 	.size		= sizeof(struct perf_event_attr),
 	.pinned		= 1,
@@ -941,139 +932,215 @@ static struct perf_event_attr __attribute__((unused)) perf_hit_attr = {
 	.exclude_user	= 1,
 };
 
-static int measure_cycles_perf_fn(void *_plr)
+struct residency_counts {
+	u64 miss_before, hits_before;
+	u64 miss_after,  hits_after;
+};
+
+static int measure_residency_fn(struct perf_event_attr *miss_attr,
+				struct perf_event_attr *hit_attr,
+				struct pseudo_lock_region *plr,
+				struct residency_counts *counts)
 {
-	unsigned long long l3_hits = 0, l3_miss = 0;
-	u64 l3_hit_bits = 0, l3_miss_bits = 0;
-	struct pseudo_lock_region *plr = _plr;
-	unsigned long long l2_hits, l2_miss;
-	u64 l2_hit_bits, l2_miss_bits;
+	u64 hits_before = 0, hits_after = 0, miss_before = 0, miss_after = 0;
+	struct perf_event *miss_event, *hit_event;
+	int hit_pmcnum, miss_pmcnum;
 	unsigned int line_size;
 	unsigned int size;
 	unsigned long i;
 	void *mem_r;
+	u64 tmp;
 
-	/*
-	 * Non-architectural event for the Goldmont Microarchitecture
-	 * from Intel x86 Architecture Software Developer Manual (SDM):
-	 * MEM_LOAD_UOPS_RETIRED D1H (event number)
-	 * Umask values:
-	 *     L1_HIT   01H
-	 *     L2_HIT   02H
-	 *     L1_MISS  08H
-	 *     L2_MISS  10H
-	 *
-	 * On Broadwell Microarchitecture the MEM_LOAD_UOPS_RETIRED event
-	 * has two "no fix" errata associated with it: BDM35 and BDM100. On
-	 * this platform we use the following events instead:
-	 *  L2_RQSTS 24H (Documented in https://download.01.org/perfmon/BDW/)
-	 *       REFERENCES FFH
-	 *       MISS       3FH
-	 *  LONGEST_LAT_CACHE 2EH (Documented in SDM)
-	 *       REFERENCE 4FH
-	 *       MISS      41H
-	 */
+	miss_event = perf_event_create_kernel_counter(miss_attr, plr->cpu,
+						      NULL, NULL, NULL);
+	if (IS_ERR(miss_event))
+		goto out;
 
+	hit_event = perf_event_create_kernel_counter(hit_attr, plr->cpu,
+						     NULL, NULL, NULL);
+	if (IS_ERR(hit_event))
+		goto out_miss;
+
+	local_irq_disable();
 	/*
-	 * Start by setting flags for IA32_PERFEVTSELx:
-	 *     OS  (Operating system mode)  0x2
-	 *     INT (APIC interrupt enable)  0x10
-	 *     EN  (Enable counter)         0x40
-	 *
-	 * Then add the Umask value and event number to select performance
-	 * event.
+	 * Check any possible error state of events used by performing
+	 * one local read.
 	 */
-
-	switch (boot_cpu_data.x86_model) {
-	case INTEL_FAM6_ATOM_GOLDMONT:
-	case INTEL_FAM6_ATOM_GEMINI_LAKE:
-		l2_hit_bits = (0x52ULL << 16) | (0x2 << 8) | 0xd1;
-		l2_miss_bits = (0x52ULL << 16) | (0x10 << 8) | 0xd1;
-		break;
-	case INTEL_FAM6_BROADWELL_X:
-		/* On BDW the l2_hit_bits count references, not hits */
-		l2_hit_bits = (0x52ULL << 16) | (0xff << 8) | 0x24;
-		l2_miss_bits = (0x52ULL << 16) | (0x3f << 8) | 0x24;
-		/* On BDW the l3_hit_bits count references, not hits */
-		l3_hit_bits = (0x52ULL << 16) | (0x4f << 8) | 0x2e;
-		l3_miss_bits = (0x52ULL << 16) | (0x41 << 8) | 0x2e;
-		break;
-	default:
-		goto out;
+	if (perf_event_read_local(miss_event, &tmp, NULL, NULL)) {
+		local_irq_enable();
+		goto out_hit;
+	}
+	if (perf_event_read_local(hit_event, &tmp, NULL, NULL)) {
+		local_irq_enable();
+		goto out_hit;
 	}
 
-	local_irq_disable();
 	/*
 	 * Disable hardware prefetchers.
 	 */
 	wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);
-	/* Disable events and reset counters */
-	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0, 0x0);
-	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 1, 0x0);
-	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_PERFCTR0, 0x0);
-	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_PERFCTR0 + 1, 0x0);
-	if (l3_hit_bits > 0) {
-		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 2, 0x0);
-		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 3, 0x0);
-		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_PERFCTR0 + 2, 0x0);
-		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_PERFCTR0 + 3, 0x0);
-	}
-	/* Set and enable the L2 counters */
+
+	/* Initialize rest of local variables */
+	/*
+	 * Performance event has been validated right before this with
+	 * interrupts disabled - it is thus safe to read the counter index.
+	 */
+	miss_pmcnum = x86_perf_rdpmc_index(miss_event);
+	hit_pmcnum = x86_perf_rdpmc_index(hit_event);
+	line_size = READ_ONCE(plr->line_size);
 	mem_r = READ_ONCE(plr->kmem);
 	size = READ_ONCE(plr->size);
-	line_size = READ_ONCE(plr->line_size);
-	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0, l2_hit_bits);
-	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 1, l2_miss_bits);
-	if (l3_hit_bits > 0) {
-		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 2,
-				      l3_hit_bits);
-		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 3,
-				      l3_miss_bits);
-	}
+
+	/*
+	 * Read counter variables twice - first to load the instructions
+	 * used in L1 cache, second to capture accurate value that does not
+	 * include cache misses incurred because of instruction loads.
+	 */
+	rdpmcl(hit_pmcnum, hits_before);
+	rdpmcl(miss_pmcnum, miss_before);
+	/*
+	 * From SDM: Performing back-to-back fast reads are not guaranteed
+	 * to be monotonic.
+	 * Use LFENCE to ensure all previous instructions are retired
+	 * before proceeding.
+	 */
+	rmb();
+	rdpmcl(hit_pmcnum, hits_before);
+	rdpmcl(miss_pmcnum, miss_before);
+	/*
+	 * Use LFENCE to ensure all previous instructions are retired
+	 * before proceeding.
+	 */
+	rmb();
 	for (i = 0; i < size; i += line_size) {
+		/*
+		 * Add a barrier to prevent speculative execution of this
+		 * loop reading beyond the end of the buffer.
+		 */
+		rmb();
 		asm volatile("mov (%0,%1,1), %%eax\n\t"
 			     :
 			     : "r" (mem_r), "r" (i)
 			     : "%eax", "memory");
 	}
 	/*
-	 * Call wrmsr directly (no tracing) to not influence
-	 * the cache access counters as they are disabled.
+	 * Use LFENCE to ensure all previous instructions are retired
+	 * before proceeding.
 	 */
-	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0,
-			      l2_hit_bits & ~(0x40ULL << 16));
-	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 1,
-			      l2_miss_bits & ~(0x40ULL << 16));
-	if (l3_hit_bits > 0) {
-		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 2,
-				      l3_hit_bits & ~(0x40ULL << 16));
-		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 3,
-				      l3_miss_bits & ~(0x40ULL << 16));
-	}
-	l2_hits = native_read_pmc(0);
-	l2_miss = native_read_pmc(1);
-	if (l3_hit_bits > 0) {
-		l3_hits = native_read_pmc(2);
-		l3_miss = native_read_pmc(3);
-	}
+	rmb();
+	rdpmcl(hit_pmcnum, hits_after);
+	rdpmcl(miss_pmcnum, miss_after);
+	/*
+	 * Use LFENCE to ensure all previous instructions are retired
+	 * before proceeding.
+	 */
+	rmb();
+	/* Re-enable hardware prefetchers */
 	wrmsr(MSR_MISC_FEATURE_CONTROL, 0x0, 0x0);
 	local_irq_enable();
+out_hit:
+	perf_event_release_kernel(hit_event);
+out_miss:
+	perf_event_release_kernel(miss_event);
+out:
+	/*
+	 * All counts will be zero on failure.
+	 */
+	counts->miss_before = miss_before;
+	counts->hits_before = hits_before;
+	counts->miss_after  = miss_after;
+	counts->hits_after  = hits_after;
+	return 0;
+}
+
+static int measure_l2_residency(void *_plr)
+{
+	struct pseudo_lock_region *plr = _plr;
+	struct residency_counts counts = {0};
+
+	/*
+	 * Non-architectural event for the Goldmont Microarchitecture
+	 * from Intel x86 Architecture Software Developer Manual (SDM):
+	 * MEM_LOAD_UOPS_RETIRED D1H (event number)
+	 * Umask values:
+	 *     L2_HIT   02H
+	 *     L2_MISS  10H
+	 */
+	switch (boot_cpu_data.x86_model) {
+	case INTEL_FAM6_ATOM_GOLDMONT:
+	case INTEL_FAM6_ATOM_GEMINI_LAKE:
+		perf_miss_attr.config = X86_CONFIG(.event = 0xd1,
+						   .umask = 0x10);
+		perf_hit_attr.config = X86_CONFIG(.event = 0xd1,
+						  .umask = 0x2);
+		break;
+	default:
+		goto out;
+	}
+
+	measure_residency_fn(&perf_miss_attr, &perf_hit_attr, plr, &counts);
+	/*
+	 * If a failure prevented the measurements from succeeding
+	 * tracepoints will still be written and all counts will be zero.
+	 */
+	trace_pseudo_lock_l2(counts.hits_after - counts.hits_before,
+			     counts.miss_after - counts.miss_before);
+out:
+	plr->thread_done = 1;
+	wake_up_interruptible(&plr->lock_thread_wq);
+	return 0;
+}
+
+static int measure_l3_residency(void *_plr)
+{
+	struct pseudo_lock_region *plr = _plr;
+	struct residency_counts counts = {0};
+
 	/*
-	 * On BDW we count references and misses, need to adjust. Sometimes
-	 * the "hits" counter is a bit more than the references, for
-	 * example, x references but x + 1 hits. To not report invalid
-	 * hit values in this case we treat that as misses eaqual to
-	 * references.
+	 * On Broadwell Microarchitecture the MEM_LOAD_UOPS_RETIRED event
+	 * has two "no fix" errata associated with it: BDM35 and BDM100. On
+	 * this platform the following events are used instead:
+	 * LONGEST_LAT_CACHE 2EH (Documented in SDM)
+	 *       REFERENCE 4FH
+	 *       MISS      41H
 	 */
-	if (boot_cpu_data.x86_model == INTEL_FAM6_BROADWELL_X)
-		l2_hits -= (l2_miss > l2_hits ? l2_hits : l2_miss);
-	trace_pseudo_lock_l2(l2_hits, l2_miss);
-	if (l3_hit_bits > 0) {
-		if (boot_cpu_data.x86_model == INTEL_FAM6_BROADWELL_X)
-			l3_hits -= (l3_miss > l3_hits ? l3_hits : l3_miss);
-		trace_pseudo_lock_l3(l3_hits, l3_miss);
+
+	switch (boot_cpu_data.x86_model) {
+	case INTEL_FAM6_BROADWELL_X:
+		/* On BDW the hit event counts references, not hits */
+		perf_hit_attr.config = X86_CONFIG(.event = 0x2e,
+						  .umask = 0x4f);
+		perf_miss_attr.config = X86_CONFIG(.event = 0x2e,
+						   .umask = 0x41);
+		break;
+	default:
+		goto out;
+	}
+
+	measure_residency_fn(&perf_miss_attr, &perf_hit_attr, plr, &counts);
+	/*
+	 * If a failure prevented the measurements from succeeding
+	 * tracepoints will still be written and all counts will be zero.
+	 */
+
+	counts.miss_after -= counts.miss_before;
+	if (boot_cpu_data.x86_model == INTEL_FAM6_BROADWELL_X) {
+		/*
+		 * On BDW references and misses are counted, need to adjust.
+		 * Sometimes the "hits" counter is a bit more than the
+		 * references, for example, x references but x + 1 hits.
+		 * To not report invalid hit values in this case we treat
+		 * that as misses equal to references.
+		 */
+		/* First compute the number of cache references measured */
+		counts.hits_after -= counts.hits_before;
+		/* Next convert references to cache hits */
+		counts.hits_after -= min(counts.miss_after, counts.hits_after);
+	} else {
+		counts.hits_after -= counts.hits_before;
 	}
 
+	trace_pseudo_lock_l3(counts.hits_after, counts.miss_after);
 out:
 	plr->thread_done = 1;
 	wake_up_interruptible(&plr->lock_thread_wq);
@@ -1112,13 +1179,20 @@ static int pseudo_lock_measure_cycles(struct rdtgroup *rdtgrp, int sel)
 		goto out;
 	}
 
+	plr->cpu = cpu;
+
 	if (sel == 1)
 		thread = kthread_create_on_node(measure_cycles_lat_fn, plr,
 						cpu_to_node(cpu),
 						"pseudo_lock_measure/%u",
 						cpu);
 	else if (sel == 2)
-		thread = kthread_create_on_node(measure_cycles_perf_fn, plr,
+		thread = kthread_create_on_node(measure_l2_residency, plr,
+						cpu_to_node(cpu),
+						"pseudo_lock_measure/%u",
+						cpu);
+	else if (sel == 3)
+		thread = kthread_create_on_node(measure_l3_residency, plr,
 						cpu_to_node(cpu),
 						"pseudo_lock_measure/%u",
 						cpu);

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH V5 0/6] perf and x86/intel_rdt: Fix lack of coordination with perf
  2018-09-28  6:58       ` Peter Zijlstra
@ 2018-09-29 14:23         ` Reinette Chatre
  2018-09-29 17:56           ` Thomas Gleixner
  0 siblings, 1 reply; 23+ messages in thread
From: Reinette Chatre @ 2018-09-29 14:23 UTC (permalink / raw)
  To: Peter Zijlstra, Thomas Gleixner
  Cc: fenghua.yu, tony.luck, mingo, acme, gavin.hindman, jithu.joseph,
	dave.hansen, hpa, x86, linux-kernel

On 9/27/2018 11:58 PM, Peter Zijlstra wrote:
> On Thu, Sep 27, 2018 at 10:39:01PM +0200, Thomas Gleixner wrote:
>> On Fri, 21 Sep 2018, Reinette Chatre wrote:
>>
>>> Dear Maintainers,
>>
>> Sorry for replying late.
>>
>>> On 9/20/2018 7:11 AM, Peter Zijlstra wrote:
>>>> On Wed, Sep 19, 2018 at 10:29:05AM -0700, Reinette Chatre wrote:
>>>>> Reinette Chatre (6):
>>>>>   perf/core: Add sanity check to deal with pinned event failure
>>>>>   perf/x86: Add helper to obtain performance counter index
>>>>>   x86/intel_rdt: Remove local register variables
>>>>>   x86/intel_rdt: Create required perf event attributes
>>>>>   x86/intel_rdt: Use perf infrastructure for measurements
>>>>>   x86/intel_rdt: Re-enable pseudo-lock measurements
>>>>>
>>>>>  Documentation/x86/intel_rdt_ui.txt          |  22 +-
>>>>>  arch/x86/events/core.c                      |  21 ++
>>>>>  arch/x86/include/asm/perf_event.h           |   1 +
>>>>>  arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 372 ++++++++++++--------
>>>>>  kernel/events/core.c                        |   6 +
>>>>>  5 files changed, 261 insertions(+), 161 deletions(-)
>>>>
>>>> Yeah, these look good, thanks!
>>>>
>>>> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>>>>
>>>
>>> Could you please consider this series for inclusion into v4.19?
>>
>> So in principle I'm having no objections as this really is mostly a RDT
>> only issue.
>>
>> Peter, any objections against the Perf part of it, aside the core patch
>> which is an obvious fix?
> 
> Nope, look up a few lines to observe my Ack ;-)
> 

I interpreted Thomas and Peter's responses to mean that there are no
objections for this to be included in v4.19 as a fix.

If I understand the tip branches correctly the core patch seems to be
headed to v4.19 while the rest (excluding the final patch
"x86/intel_rdt: Re-enable pseudo-lock measurements") are headed to v4.20.

Have you decided against including this into v4.19 or did I
misunderstand the responses and/or branches?

Thank you for helping me to sort this out

Reinette

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V5 0/6] perf and x86/intel_rdt: Fix lack of coordination with perf
  2018-09-29 14:23         ` Reinette Chatre
@ 2018-09-29 17:56           ` Thomas Gleixner
  2018-10-03 19:41             ` Reinette Chatre
  0 siblings, 1 reply; 23+ messages in thread
From: Thomas Gleixner @ 2018-09-29 17:56 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Peter Zijlstra, fenghua.yu, tony.luck, mingo, acme,
	gavin.hindman, jithu.joseph, dave.hansen, hpa, x86, linux-kernel

Reinette,

On Sat, 29 Sep 2018, Reinette Chatre wrote:
> I interpreted Thomas and Peter's responses to mean that there are no
> objections for this to be included in v4.19 as a fix.
> 
> If I understand the tip branches correctly the core patch seems to be
> headed to v4.19 while the rest (excluding the final patch
> "x86/intel_rdt: Re-enable pseudo-lock measurements") are headed to v4.20.
> 
> Have you decided against including this into v4.19 or did I
> misunderstand the responses and/or branches?

I did not decide anything yet. It's not going into -rc6 as it's not yet
through next and the other standard testing.

I'm also looking at the other set of RDT fixes, which obviously want to go
as well. So not sure how to deal with all of that.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V5 0/6] perf and x86/intel_rdt: Fix lack of coordination with perf
  2018-09-29 17:56           ` Thomas Gleixner
@ 2018-10-03 19:41             ` Reinette Chatre
  2018-10-03 19:45               ` Thomas Gleixner
  0 siblings, 1 reply; 23+ messages in thread
From: Reinette Chatre @ 2018-10-03 19:41 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, fenghua.yu, tony.luck, mingo, acme,
	gavin.hindman, jithu.joseph, dave.hansen, hpa, x86, linux-kernel

Hi Thomas,

On 9/29/2018 10:56 AM, Thomas Gleixner wrote:
> Reinette,
> 
> On Sat, 29 Sep 2018, Reinette Chatre wrote:
>> I interpreted Thomas and Peter's responses to mean that there are no
>> objections for this to be included in v4.19 as a fix.
>>
>> If I understand the tip branches correctly the core patch seems to be
>> headed to v4.19 while the rest (excluding the final patch
>> "x86/intel_rdt: Re-enable pseudo-lock measurements") are headed to v4.20.
>>
>> Have you decided against including this into v4.19 or did I
>> misunderstand the responses and/or branches?
> 
> I did not decide anything yet. It's not going into -rc6 as it's not yet
> through next and the other standard testing.
> 
> I'm also looking at the other set of RDT fixes, which obviously want to go
> as well. So not sure how to deal with all of that.
> 

When you do deal with this series could you please also include the
final patch ("x86/intel_rdt: Re-enable pseudo-lock measurements")? I
noticed that patches 1/6 to 5/6 have been merged into x86/cache in
tip.git and then some other work on top of it. It is not clear to me why
6/6 was omitted and this fix does require it.

Thank you very much

Reinette

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V5 0/6] perf and x86/intel_rdt: Fix lack of coordination with perf
  2018-10-03 19:41             ` Reinette Chatre
@ 2018-10-03 19:45               ` Thomas Gleixner
  0 siblings, 0 replies; 23+ messages in thread
From: Thomas Gleixner @ 2018-10-03 19:45 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Peter Zijlstra, fenghua.yu, tony.luck, mingo, acme,
	gavin.hindman, jithu.joseph, dave.hansen, hpa, x86, linux-kernel

On Wed, 3 Oct 2018, Reinette Chatre wrote:
> On 9/29/2018 10:56 AM, Thomas Gleixner wrote:
> > Reinette,
> > 
> > On Sat, 29 Sep 2018, Reinette Chatre wrote:
> >> I interpreted Thomas and Peter's responses to mean that there are no
> >> objections for this to be included in v4.19 as a fix.
> >>
> >> If I understand the tip branches correctly the core patch seems to be
> >> headed to v4.19 while the rest (excluding the final patch
> >> "x86/intel_rdt: Re-enable pseudo-lock measurements") are headed to v4.20.
> >>
> >> Have you decided against including this into v4.19 or did I
> >> misunderstand the responses and/or branches?
> > 
> > I did not decide anything yet. It's not going into -rc6 as it's not yet
> > through next and the other standard testing.
> > 
> > I'm also looking at the other set of RDT fixes, which obviously want to go
> > as well. So not sure how to deal with all of that.
> > 
> 
> When you do deal with this series could you please also include the
> final patch ("x86/intel_rdt: Re-enable pseudo-lock measurements")? I
> noticed that patches 1/6 to 5/6 have been merged into x86/cache in
> tip.git and then some other work on top of it. It is not clear to me why
> 6/6 was omitted and this fix does require it.

Hmm. That was not intentional. /me goes to rumage around.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [tip:x86/cache] x86/intel_rdt: Re-enable pseudo-lock measurements
  2018-09-19 17:29 ` [PATCH V5 6/6] x86/intel_rdt: Re-enable pseudo-lock measurements Reinette Chatre
@ 2018-10-03 19:57   ` tip-bot for Reinette Chatre
  0 siblings, 0 replies; 23+ messages in thread
From: tip-bot for Reinette Chatre @ 2018-10-03 19:57 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, hpa, mingo, tglx, reinette.chatre

Commit-ID:  53ed74af05517cf5fe83d87c4bb7ab998adfd631
Gitweb:     https://git.kernel.org/tip/53ed74af05517cf5fe83d87c4bb7ab998adfd631
Author:     Reinette Chatre <reinette.chatre@intel.com>
AuthorDate: Wed, 19 Sep 2018 10:29:11 -0700
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Wed, 3 Oct 2018 21:53:35 +0200

x86/intel_rdt: Re-enable pseudo-lock measurements

Commit 4a7a54a55e72 ("x86/intel_rdt: Disable PMU access") disabled
measurements of pseudo-locked regions because of incorrect usage
of the performance monitoring hardware.

Cache pseudo-locking measurements are now done correctly with the
in-kernel perf API and its use can be re-enabled at this time.

The adjustment to the in-kernel perf API also separated the L2 and L3
measurements that can be triggered separately from user space. The
re-enabling of the measurements is thus not a simple revert of the
original disable in order to accommodate the additional parameter
possible.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: fenghua.yu@intel.com
Cc: tony.luck@intel.com
Cc: peterz@infradead.org
Cc: acme@kernel.org
Cc: gavin.hindman@intel.com
Cc: jithu.joseph@intel.com
Cc: dave.hansen@intel.com
Cc: hpa@zytor.com
Link: https://lkml.kernel.org/r/bfb9fc31692e0c62d9ca39062e55eceb6a0635b5.1537377064.git.reinette.chatre@intel.com

---
 arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
index d68836139cf9..30e6c9f5a0ad 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
@@ -1236,7 +1236,7 @@ static ssize_t pseudo_lock_measure_trigger(struct file *file,
 	buf[buf_size] = '\0';
 	ret = kstrtoint(buf, 10, &sel);
 	if (ret == 0) {
-		if (sel != 1)
+		if (sel != 1 && sel != 2 && sel != 3)
 			return -EINVAL;
 		ret = debugfs_file_get(file->f_path.dentry);
 		if (ret)

^ permalink raw reply related	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2018-10-03 19:58 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-19 17:29 [PATCH V5 0/6] perf and x86/intel_rdt: Fix lack of coordination with perf Reinette Chatre
2018-09-19 17:29 ` [PATCH V5 1/6] perf/core: Add sanity check to deal with pinned event failure Reinette Chatre
2018-09-28 20:48   ` [tip:perf/urgent] " tip-bot for Reinette Chatre
2018-09-19 17:29 ` [PATCH V5 2/6] perf/x86: Add helper to obtain performance counter index Reinette Chatre
2018-09-29  7:00   ` [tip:x86/cache] " tip-bot for Reinette Chatre
2018-09-19 17:29 ` [PATCH V5 3/6] x86/intel_rdt: Remove local register variables Reinette Chatre
2018-09-29  7:01   ` [tip:x86/cache] " tip-bot for Reinette Chatre
2018-09-19 17:29 ` [PATCH V5 4/6] x86/intel_rdt: Create required perf event attributes Reinette Chatre
2018-09-29  7:01   ` [tip:x86/cache] " tip-bot for Reinette Chatre
2018-09-19 17:29 ` [PATCH V5 5/6] x86/intel_rdt: Use perf infrastructure for measurements Reinette Chatre
2018-09-20 14:11   ` Peter Zijlstra
2018-09-20 19:02   ` [PATCH V6 " Reinette Chatre
2018-09-29  7:02     ` [tip:x86/cache] " tip-bot for Reinette Chatre
2018-09-19 17:29 ` [PATCH V5 6/6] x86/intel_rdt: Re-enable pseudo-lock measurements Reinette Chatre
2018-10-03 19:57   ` [tip:x86/cache] " tip-bot for Reinette Chatre
2018-09-20 14:11 ` [PATCH V5 0/6] perf and x86/intel_rdt: Fix lack of coordination with perf Peter Zijlstra
2018-09-21 16:49   ` Reinette Chatre
2018-09-27 20:39     ` Thomas Gleixner
2018-09-28  6:58       ` Peter Zijlstra
2018-09-29 14:23         ` Reinette Chatre
2018-09-29 17:56           ` Thomas Gleixner
2018-10-03 19:41             ` Reinette Chatre
2018-10-03 19:45               ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).