linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V2 0/6] perf/core and x86/intel_rdt: Fix lack of coordination with perf
@ 2018-08-16 20:16 Reinette Chatre
  2018-08-16 20:16 ` [PATCH V2 1/6] perf/core: Add sanity check to deal with pinned event failure Reinette Chatre
                   ` (6 more replies)
  0 siblings, 7 replies; 18+ messages in thread
From: Reinette Chatre @ 2018-08-16 20:16 UTC (permalink / raw)
  To: tglx, fenghua.yu, tony.luck, peterz, mingo, acme, vikas.shivappa
  Cc: gavin.hindman, jithu.joseph, dave.hansen, hpa, x86, linux-kernel,
	Reinette Chatre

Dear Maintainers,

This is the second attempt at fixing the lack of coordination between the
pseudo-locking measurement code and perf. Thank you very much for your
feedback on the first version. The entire solution, including the cover
letter, has been reworked based on your feedback, while submitted as a V2,
none of the patches from V1 remained.

Changes since V1:
- Use in-kernel interface to perf.
- Do not write directly to PMU registers.
- Do not introduce another PMU owner. perf maintains role as performing
  resource arbitration for PMU.
- User space is able to use perf and resctrl at the same time.
- event_base_rdpmc is accessed and used only within an interrupts
  disabled section.
- Internals of events are never accessed directly, inline function used.
- Due to "pinned" usage the scheduling of event may have failed.  Error
  state is checked in recommended way and have a credible error
  handling.
- use X86_CONFIG

This code is based on the x86/cache branch of tip.git

The success of Cache Pseudo-Locking, as measured by how many cache lines
from a physical memory region has been locked to cache, can be measured
via the use of hardware performance events. Specifically, the number of
cache hits and misses reading a memory region after it has been
pseudo-locked to cache. This measurement is triggered via the resctrl
debugfs interface.

The current solution accesses performance counters and their configuration
registers directly without coordination with other performance event users
(perf).
Two of the issues that exist with the current solution:
- By writing to the performance monitoring registers directly a new owner
  for these resources is introduced. The perf infrastructure already exist
  to perform resource arbitration and the in-kernel infrastructure should
  be used to do so.
- The current lack of coordination with perf will have consequences any time
  two users, for example perf and cache pseudo-locking, attempt to do any
  kind of measurement at the same time.

In this series the measurement of Cache Pseudo-Lock regions is moved to use
the in-kernel interface to perf. During the rework of the measurement
function the L2 and L3 cache measurements are separated to avoid the
additional code needed to decide on which measurement causing unrelated
cache hits and misses.

Your feedback on this work will be greatly appreciated.

Reinette

Reinette Chatre (6):
  perf/core: Add sanity check to deal with pinned event failure
  x86/intel_rdt: Remove local register variables
  x86/intel_rdt: Create required perf event attributes
  x86/intel_rdt: Add helper to obtain performance counter index
  x86/intel_rdt: Use perf infrastructure for measurements
  x86/intel_rdt: Re-enable pseudo-lock measurements

 Documentation/x86/intel_rdt_ui.txt          |  22 +-
 arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 419 ++++++++++++++------
 kernel/events/core.c                        |   6 +
 3 files changed, 310 insertions(+), 137 deletions(-)

-- 
2.17.0


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH V2 1/6] perf/core: Add sanity check to deal with pinned event failure
  2018-08-16 20:16 [PATCH V2 0/6] perf/core and x86/intel_rdt: Fix lack of coordination with perf Reinette Chatre
@ 2018-08-16 20:16 ` Reinette Chatre
  2018-08-16 20:16 ` [PATCH V2 2/6] x86/intel_rdt: Remove local register variables Reinette Chatre
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Reinette Chatre @ 2018-08-16 20:16 UTC (permalink / raw)
  To: tglx, fenghua.yu, tony.luck, peterz, mingo, acme, vikas.shivappa
  Cc: gavin.hindman, jithu.joseph, dave.hansen, hpa, x86, linux-kernel,
	Reinette Chatre

It is possible that a failure can occur during the scheduling of a
pinned event. The initial portion of perf_event_read_local() contains
the various error checks an event should pass before it can be
considered valid. Ensure that the potential scheduling failure
of a pinned event is checked for and have a credible error.

Link: http://lkml.kernel.org/r/20180807093615.GY2494@hirez.programming.kicks-ass.net
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 kernel/events/core.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 80cca2b30c4f..005d0f2519fd 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3940,6 +3940,12 @@ int perf_event_read_local(struct perf_event *event, u64 *value,
 		goto out;
 	}
 
+	/* If this is a pinned event it must be running on this CPU */
+	if (event->attr.pinned && event->oncpu != smp_processor_id()) {
+		ret = -EBUSY;
+		goto out;
+	}
+
 	/*
 	 * If the event is currently on this CPU, its either a per-task event,
 	 * or local to this CPU. Furthermore it means its ACTIVE (otherwise
-- 
2.17.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH V2 2/6] x86/intel_rdt: Remove local register variables
  2018-08-16 20:16 [PATCH V2 0/6] perf/core and x86/intel_rdt: Fix lack of coordination with perf Reinette Chatre
  2018-08-16 20:16 ` [PATCH V2 1/6] perf/core: Add sanity check to deal with pinned event failure Reinette Chatre
@ 2018-08-16 20:16 ` Reinette Chatre
  2018-08-16 20:16 ` [PATCH V2 3/6] x86/intel_rdt: Create required perf event attributes Reinette Chatre
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Reinette Chatre @ 2018-08-16 20:16 UTC (permalink / raw)
  To: tglx, fenghua.yu, tony.luck, peterz, mingo, acme, vikas.shivappa
  Cc: gavin.hindman, jithu.joseph, dave.hansen, hpa, x86, linux-kernel,
	Reinette Chatre

Local register variables were used in an effort to improve the
accuracy of the measurement of cache residency of a pseudo-locked
region. This was done to ensure that only the cache residency of
the memory is measured and not the cache residency of the variables
used to perform the measurement.

While local register variables do accomplish the goal they do require
significant care since different architectures have different registers
available. Local register variables also cannot be used with valuable
developer tools like KASAN.

Significant testing has shown that similar accuracy in measurement
results can be obtained by replacing local register variables with
regular local variables.

Make use of local variables in the critical code, ensure these variables
are initialized before the measurement starts, and ensure it is only
the local variables that are accessed during the measurement.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 39 +++------------------
 1 file changed, 4 insertions(+), 35 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
index 40f3903ae5d9..f46c8afe7875 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
@@ -886,22 +886,7 @@ static int measure_cycles_lat_fn(void *_plr)
 	struct pseudo_lock_region *plr = _plr;
 	unsigned long i;
 	u64 start, end;
-#ifdef CONFIG_KASAN
-	/*
-	 * The registers used for local register variables are also used
-	 * when KASAN is active. When KASAN is active we use a regular
-	 * variable to ensure we always use a valid pointer to access memory.
-	 * The cost is that accessing this pointer, which could be in
-	 * cache, will be included in the measurement of memory read latency.
-	 */
 	void *mem_r;
-#else
-#ifdef CONFIG_X86_64
-	register void *mem_r asm("rbx");
-#else
-	register void *mem_r asm("ebx");
-#endif /* CONFIG_X86_64 */
-#endif /* CONFIG_KASAN */
 
 	local_irq_disable();
 	/*
@@ -939,26 +924,10 @@ static int measure_cycles_perf_fn(void *_plr)
 	struct pseudo_lock_region *plr = _plr;
 	unsigned long long l2_hits, l2_miss;
 	u64 l2_hit_bits, l2_miss_bits;
-	unsigned long i;
-#ifdef CONFIG_KASAN
-	/*
-	 * The registers used for local register variables are also used
-	 * when KASAN is active. When KASAN is active we use regular variables
-	 * at the cost of including cache access latency to these variables
-	 * in the measurements.
-	 */
 	unsigned int line_size;
 	unsigned int size;
+	unsigned long i;
 	void *mem_r;
-#else
-	register unsigned int line_size asm("esi");
-	register unsigned int size asm("edi");
-#ifdef CONFIG_X86_64
-	register void *mem_r asm("rbx");
-#else
-	register void *mem_r asm("ebx");
-#endif /* CONFIG_X86_64 */
-#endif /* CONFIG_KASAN */
 
 	/*
 	 * Non-architectural event for the Goldmont Microarchitecture
@@ -1028,6 +997,9 @@ static int measure_cycles_perf_fn(void *_plr)
 		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_PERFCTR0 + 3, 0x0);
 	}
 	/* Set and enable the L2 counters */
+	mem_r = plr->kmem;
+	size = plr->size;
+	line_size = plr->line_size;
 	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0, l2_hit_bits);
 	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 1, l2_miss_bits);
 	if (l3_hit_bits > 0) {
@@ -1036,9 +1008,6 @@ static int measure_cycles_perf_fn(void *_plr)
 		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 3,
 				      l3_miss_bits);
 	}
-	mem_r = plr->kmem;
-	size = plr->size;
-	line_size = plr->line_size;
 	for (i = 0; i < size; i += line_size) {
 		asm volatile("mov (%0,%1,1), %%eax\n\t"
 			     :
-- 
2.17.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH V2 3/6] x86/intel_rdt: Create required perf event attributes
  2018-08-16 20:16 [PATCH V2 0/6] perf/core and x86/intel_rdt: Fix lack of coordination with perf Reinette Chatre
  2018-08-16 20:16 ` [PATCH V2 1/6] perf/core: Add sanity check to deal with pinned event failure Reinette Chatre
  2018-08-16 20:16 ` [PATCH V2 2/6] x86/intel_rdt: Remove local register variables Reinette Chatre
@ 2018-08-16 20:16 ` Reinette Chatre
  2018-08-16 20:16 ` [PATCH V2 4/6] x86/intel_rdt: Add helper to obtain performance counter index Reinette Chatre
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Reinette Chatre @ 2018-08-16 20:16 UTC (permalink / raw)
  To: tglx, fenghua.yu, tony.luck, peterz, mingo, acme, vikas.shivappa
  Cc: gavin.hindman, jithu.joseph, dave.hansen, hpa, x86, linux-kernel,
	Reinette Chatre

A perf event has many attributes that are maintained in a separate
structure that should be provided when a new perf_event is created.

In preparation for the transition to perf_events the required attribute
structures are created for all the events that may be used in the
measurements. Most attributes for all the events are identical. The
actual configuration, what specifies what needs to be measured, is what
will be different between the events used. This configuration needs to
be done with X86_CONFIG that cannot be used as part of the designated
initializers used here, this will be introduced later.

Although they do look identical at this time the attribute structures
needs to be maintained separately since a perf_event will maintain a
pointer to its unique attributes.

In support of patch testing the new structs are given the unused attribute
until their use in later patches.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 25 +++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
index f46c8afe7875..ab93079e9e5b 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
@@ -917,6 +917,31 @@ static int measure_cycles_lat_fn(void *_plr)
 	return 0;
 }
 
+/*
+ * Create a perf_event_attr for the hit and miss perf events that will
+ * be used during the performance measurement. A perf_event maintains
+ * a pointer to its perf_event_attr so a unique attribute structure is
+ * created for each perf_event.
+ *
+ * The actual configuration of the event is set right before use in order
+ * to use the X86_CONFIG macro.
+ */
+static struct perf_event_attr __attribute__((unused)) perf_miss_attr = {
+	.type		= PERF_TYPE_RAW,
+	.size		= sizeof(struct perf_event_attr),
+	.pinned		= 1,
+	.disabled	= 0,
+	.exclude_user	= 1,
+};
+
+static struct perf_event_attr __attribute__((unused)) perf_hit_attr = {
+	.type		= PERF_TYPE_RAW,
+	.size		= sizeof(struct perf_event_attr),
+	.pinned		= 1,
+	.disabled	= 0,
+	.exclude_user	= 1,
+};
+
 static int measure_cycles_perf_fn(void *_plr)
 {
 	unsigned long long l3_hits = 0, l3_miss = 0;
-- 
2.17.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH V2 4/6] x86/intel_rdt: Add helper to obtain performance counter index
  2018-08-16 20:16 [PATCH V2 0/6] perf/core and x86/intel_rdt: Fix lack of coordination with perf Reinette Chatre
                   ` (2 preceding siblings ...)
  2018-08-16 20:16 ` [PATCH V2 3/6] x86/intel_rdt: Create required perf event attributes Reinette Chatre
@ 2018-08-16 20:16 ` Reinette Chatre
  2018-09-06 14:47   ` Peter Zijlstra
  2018-08-16 20:16 ` [PATCH V2 5/6] x86/intel_rdt: Use perf infrastructure for measurements Reinette Chatre
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 18+ messages in thread
From: Reinette Chatre @ 2018-08-16 20:16 UTC (permalink / raw)
  To: tglx, fenghua.yu, tony.luck, peterz, mingo, acme, vikas.shivappa
  Cc: gavin.hindman, jithu.joseph, dave.hansen, hpa, x86, linux-kernel,
	Reinette Chatre

In support of the most accurate measurements rdpmcl() is used instead of
the more elaborate perf_event_read_local(). rdpmcl() requires the index
of the performance counter used so a helper is introduced to determine
the index used by a provided performance event.

The index used by a performance event may change when interrupts are
enabled. A check is added to ensure that the index is only accessed
with interrupts disabled. Even with this check the use of this counter
needs to be done with care to ensure it is queried and used within the
same disabled interrupts section.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
index ab93079e9e5b..20b76024701d 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
@@ -942,6 +942,27 @@ static struct perf_event_attr __attribute__((unused)) perf_hit_attr = {
 	.exclude_user	= 1,
 };
 
+/**
+ * x86_perf_rdpmc_ctr_get - Return PMC counter used for event
+ * @event: the perf_event to which the PMC counter was assigned
+ *
+ * The counter assigned to this performance event may change if interrupts
+ * are enabled. This counter should thus never be used while interrupts are
+ * enabled. Before this function is used to obtain the assigned counter the
+ * event could be checked for validity using, for example,
+ * perf_event_read_local(), within the same interrupt disabled section in
+ * which this counter is planned to be used.
+ *
+ * Return: The index of the performance monitoring counter assigned to
+ * @perf_event, -1 if event is not valid.
+ */
+static inline int x86_perf_rdpmc_ctr_get(struct perf_event *event)
+{
+	lockdep_assert_irqs_disabled();
+
+	return IS_ERR_OR_NULL(event) ? -1 : event->hw.event_base_rdpmc;
+}
+
 static int measure_cycles_perf_fn(void *_plr)
 {
 	unsigned long long l3_hits = 0, l3_miss = 0;
-- 
2.17.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH V2 5/6] x86/intel_rdt: Use perf infrastructure for measurements
  2018-08-16 20:16 [PATCH V2 0/6] perf/core and x86/intel_rdt: Fix lack of coordination with perf Reinette Chatre
                   ` (3 preceding siblings ...)
  2018-08-16 20:16 ` [PATCH V2 4/6] x86/intel_rdt: Add helper to obtain performance counter index Reinette Chatre
@ 2018-08-16 20:16 ` Reinette Chatre
  2018-09-06 14:15   ` Peter Zijlstra
  2018-09-06 14:38   ` Peter Zijlstra
  2018-08-16 20:16 ` [PATCH V2 6/6] x86/intel_rdt: Re-enable pseudo-lock measurements Reinette Chatre
  2018-09-04 16:50 ` [PATCH V2 0/6] perf/core and x86/intel_rdt: Fix lack of coordination with perf Reinette Chatre
  6 siblings, 2 replies; 18+ messages in thread
From: Reinette Chatre @ 2018-08-16 20:16 UTC (permalink / raw)
  To: tglx, fenghua.yu, tony.luck, peterz, mingo, acme, vikas.shivappa
  Cc: gavin.hindman, jithu.joseph, dave.hansen, hpa, x86, linux-kernel,
	Reinette Chatre

The success of a cache pseudo-locked region is measured using
performance monitoring events that are programmed directly at the time
the user requests a measurement.

Modifying the performance event registers directly is not appropriate
since it circumvents the in-kernel perf infrastructure that exists to
manage these resources and provide resource arbitration to the
performance monitoring hardware.

The cache pseudo-locking measurements are modified to use the in-kernel
perf infrastructure. Performance events are created and validated with
the appropriate perf API. The performance counters are still read as
directly as possible to avoid the additional cache hits. This is
done safely by first ensuring with the perf API that the counters have
been programmed correctly and only accessing the counters in an
interrupt disabled section where they are not able to be moved.

As part of the transition to the in-kernel perf infrastructure the L2
and L3 measurements are split into two separate measurements that can
be triggered independently. This separation prevents additional cache
misses incurred during the extra testing code used to decide if a
L2 and/or L3 measurement should be made.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 Documentation/x86/intel_rdt_ui.txt          |  22 +-
 arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 342 ++++++++++++++------
 2 files changed, 258 insertions(+), 106 deletions(-)

diff --git a/Documentation/x86/intel_rdt_ui.txt b/Documentation/x86/intel_rdt_ui.txt
index f662d3c530e5..52b10945ff75 100644
--- a/Documentation/x86/intel_rdt_ui.txt
+++ b/Documentation/x86/intel_rdt_ui.txt
@@ -520,18 +520,24 @@ the pseudo-locked region:
 2) Cache hit and miss measurements using model specific precision counters if
    available. Depending on the levels of cache on the system the pseudo_lock_l2
    and pseudo_lock_l3 tracepoints are available.
-   WARNING: triggering this  measurement uses from two (for just L2
-   measurements) to four (for L2 and L3 measurements) precision counters on
-   the system, if any other measurements are in progress the counters and
-   their corresponding event registers will be clobbered.
 
 When a pseudo-locked region is created a new debugfs directory is created for
 it in debugfs as /sys/kernel/debug/resctrl/<newdir>. A single
 write-only file, pseudo_lock_measure, is present in this directory. The
-measurement on the pseudo-locked region depends on the number, 1 or 2,
-written to this debugfs file. Since the measurements are recorded with the
-tracing infrastructure the relevant tracepoints need to be enabled before the
-measurement is triggered.
+measurement of the pseudo-locked region depends on the number written to this
+debugfs file:
+1 -  writing "1" to the pseudo_lock_measure file will trigger the latency
+     measurement captured in the pseudo_lock_mem_latency tracepoint. See
+     example below.
+2 -  writing "2" to the pseudo_lock_measure file will trigger the L2 cache
+     residency (cache hits and misses) measurement captured in the
+     pseudo_lock_l2 tracepoint. See example below.
+3 -  writing "3" to the pseudo_lock_measure file will trigger the L3 cache
+     residency (cache hits and misses) measurement captured in the
+     pseudo_lock_l3 tracepoint.
+
+All measurements are recorded with the tracing infrastructure. This requires
+the relevant tracepoints to be enabled before the measurement is triggered.
 
 Example of latency debugging interface:
 In this example a pseudo-locked region named "newlock" was created. Here is
diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
index 20b76024701d..aa27e8081bae 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
@@ -26,6 +26,7 @@
 #include <asm/intel_rdt_sched.h>
 #include <asm/perf_event.h>
 
+#include "../../events/perf_event.h" /* For X86_CONFIG() */
 #include "intel_rdt.h"
 
 #define CREATE_TRACE_POINTS
@@ -106,16 +107,6 @@ static u64 get_prefetch_disable_bits(void)
 	return 0;
 }
 
-/*
- * Helper to write 64bit value to MSR without tracing. Used when
- * use of the cache should be restricted and use of registers used
- * for local variables avoided.
- */
-static inline void pseudo_wrmsrl_notrace(unsigned int msr, u64 val)
-{
-	__wrmsr(msr, (u32)(val & 0xffffffffULL), (u32)(val >> 32));
-}
-
 /**
  * pseudo_lock_minor_get - Obtain available minor number
  * @minor: Pointer to where new minor number will be stored
@@ -926,7 +917,7 @@ static int measure_cycles_lat_fn(void *_plr)
  * The actual configuration of the event is set right before use in order
  * to use the X86_CONFIG macro.
  */
-static struct perf_event_attr __attribute__((unused)) perf_miss_attr = {
+static struct perf_event_attr perf_miss_attr = {
 	.type		= PERF_TYPE_RAW,
 	.size		= sizeof(struct perf_event_attr),
 	.pinned		= 1,
@@ -934,7 +925,7 @@ static struct perf_event_attr __attribute__((unused)) perf_miss_attr = {
 	.exclude_user	= 1,
 };
 
-static struct perf_event_attr __attribute__((unused)) perf_hit_attr = {
+static struct perf_event_attr perf_hit_attr = {
 	.type		= PERF_TYPE_RAW,
 	.size		= sizeof(struct perf_event_attr),
 	.pinned		= 1,
@@ -963,141 +954,289 @@ static inline int x86_perf_rdpmc_ctr_get(struct perf_event *event)
 	return IS_ERR_OR_NULL(event) ? -1 : event->hw.event_base_rdpmc;
 }
 
-static int measure_cycles_perf_fn(void *_plr)
+static int measure_l2_residency_fn(void *_plr)
 {
-	unsigned long long l3_hits = 0, l3_miss = 0;
-	u64 l3_hit_bits = 0, l3_miss_bits = 0;
+	u64 l2_hits_before, l2_hits_after, l2_miss_before, l2_miss_after;
+	struct perf_event *l2_miss_event, *l2_hit_event;
 	struct pseudo_lock_region *plr = _plr;
-	unsigned long long l2_hits, l2_miss;
-	u64 l2_hit_bits, l2_miss_bits;
+	int l2_hit_pmcnum, l2_miss_pmcnum;
 	unsigned int line_size;
 	unsigned int size;
 	unsigned long i;
 	void *mem_r;
+	u64 tmp;
 
 	/*
 	 * Non-architectural event for the Goldmont Microarchitecture
 	 * from Intel x86 Architecture Software Developer Manual (SDM):
 	 * MEM_LOAD_UOPS_RETIRED D1H (event number)
 	 * Umask values:
-	 *     L1_HIT   01H
 	 *     L2_HIT   02H
-	 *     L1_MISS  08H
 	 *     L2_MISS  10H
+	 */
+	switch (boot_cpu_data.x86_model) {
+	case INTEL_FAM6_ATOM_GOLDMONT:
+	case INTEL_FAM6_ATOM_GEMINI_LAKE:
+		perf_miss_attr.config = X86_CONFIG(.event = 0xd1,
+						   .umask = 0x10);
+		perf_hit_attr.config = X86_CONFIG(.event = 0xd1,
+						  .umask = 0x2);
+		break;
+	default:
+		goto out;
+	}
+
+	l2_miss_event = perf_event_create_kernel_counter(&perf_miss_attr,
+							 plr->cpu,
+							 NULL, NULL, NULL);
+	if (IS_ERR(l2_miss_event))
+		goto out;
+
+	l2_hit_event = perf_event_create_kernel_counter(&perf_hit_attr,
+							plr->cpu,
+							NULL, NULL, NULL);
+	if (IS_ERR(l2_hit_event))
+		goto out_l2_miss;
+
+	local_irq_disable();
+	/*
+	 * Check any possible error state of events used by performing
+	 * one local read.
+	 */
+	if (perf_event_read_local(l2_miss_event, &tmp, NULL, NULL)) {
+		local_irq_enable();
+		goto out_l2_hit;
+	}
+	if (perf_event_read_local(l2_hit_event, &tmp, NULL, NULL)) {
+		local_irq_enable();
+		goto out_l2_hit;
+	}
+
+	/*
+	 * Disable hardware prefetchers.
 	 *
+	 * Call wrmsr direcly to avoid the local register variables from
+	 * being overwritten due to reordering of their assignment with
+	 * the wrmsr calls.
+	 */
+	__wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);
+
+	/* Initialize rest of local variables */
+	/*
+	 * Performance event has been validated right before this with
+	 * interrupts disabled - it is thus safe to read the counter index.
+	 */
+	l2_miss_pmcnum = x86_perf_rdpmc_ctr_get(l2_miss_event);
+	l2_hit_pmcnum = x86_perf_rdpmc_ctr_get(l2_hit_event);
+	line_size = plr->line_size;
+	mem_r = plr->kmem;
+	size = plr->size;
+
+	/*
+	 * Read counter variables twice - first to load the instructions
+	 * used in L1 cache, second to capture accurate value that does not
+	 * include cache misses incurred because of instruction loads.
+	 */
+	rdpmcl(l2_hit_pmcnum, l2_hits_before);
+	rdpmcl(l2_miss_pmcnum, l2_miss_before);
+	/*
+	 * From SDM: Performing back-to-back fast reads are not guaranteed
+	 * to be monotonic. To guarantee monotonicity on back-toback reads,
+	 * a serializing instruction must be placed between the two
+	 * RDPMC instructions
+	 */
+	rmb();
+	rdpmcl(l2_hit_pmcnum, l2_hits_before);
+	rdpmcl(l2_miss_pmcnum, l2_miss_before);
+	/*
+	 * rdpmc is not a serializing instruction. Add barrier to prevent
+	 * instructions that follow to begin executing before reading the
+	 * counter value.
+	 */
+	rmb();
+	for (i = 0; i < size; i += line_size) {
+		/*
+		 * Add a barrier to prevent speculative execution of this
+		 * loop reading beyond the end of the buffer.
+		 */
+		rmb();
+		asm volatile("mov (%0,%1,1), %%eax\n\t"
+			     :
+			     : "r" (mem_r), "r" (i)
+			     : "%eax", "memory");
+	}
+	rdpmcl(l2_hit_pmcnum, l2_hits_after);
+	rdpmcl(l2_miss_pmcnum, l2_miss_after);
+	/*
+	 * rdpmc is not a serializing instruction. Add barrier to ensure
+	 * events measured have completed and prevent instructions that
+	 * follow to begin executing before reading the counter value.
+	 */
+	rmb();
+	/* Re-enable hardware prefetchers */
+	wrmsr(MSR_MISC_FEATURE_CONTROL, 0x0, 0x0);
+	local_irq_enable();
+	trace_pseudo_lock_l2(l2_hits_after - l2_hits_before,
+			     l2_miss_after - l2_miss_before);
+
+out_l2_hit:
+	perf_event_release_kernel(l2_hit_event);
+out_l2_miss:
+	perf_event_release_kernel(l2_miss_event);
+out:
+	plr->thread_done = 1;
+	wake_up_interruptible(&plr->lock_thread_wq);
+	return 0;
+}
+
+static int measure_l3_residency_fn(void *_plr)
+{
+	u64 l3_hits_before, l3_hits_after, l3_miss_before, l3_miss_after;
+	struct perf_event *l3_miss_event, *l3_hit_event;
+	struct pseudo_lock_region *plr = _plr;
+	int l3_hit_pmcnum, l3_miss_pmcnum;
+	unsigned int line_size;
+	unsigned int size;
+	unsigned long i;
+	void *mem_r;
+	u64 tmp;
+
+	/*
 	 * On Broadwell Microarchitecture the MEM_LOAD_UOPS_RETIRED event
 	 * has two "no fix" errata associated with it: BDM35 and BDM100. On
-	 * this platform we use the following events instead:
-	 *  L2_RQSTS 24H (Documented in https://download.01.org/perfmon/BDW/)
-	 *       REFERENCES FFH
-	 *       MISS       3FH
-	 *  LONGEST_LAT_CACHE 2EH (Documented in SDM)
+	 * this platform the following events are used instead:
+	 * LONGEST_LAT_CACHE 2EH (Documented in SDM)
 	 *       REFERENCE 4FH
 	 *       MISS      41H
 	 */
 
-	/*
-	 * Start by setting flags for IA32_PERFEVTSELx:
-	 *     OS  (Operating system mode)  0x2
-	 *     INT (APIC interrupt enable)  0x10
-	 *     EN  (Enable counter)         0x40
-	 *
-	 * Then add the Umask value and event number to select performance
-	 * event.
-	 */
-
 	switch (boot_cpu_data.x86_model) {
-	case INTEL_FAM6_ATOM_GOLDMONT:
-	case INTEL_FAM6_ATOM_GEMINI_LAKE:
-		l2_hit_bits = (0x52ULL << 16) | (0x2 << 8) | 0xd1;
-		l2_miss_bits = (0x52ULL << 16) | (0x10 << 8) | 0xd1;
-		break;
 	case INTEL_FAM6_BROADWELL_X:
-		/* On BDW the l2_hit_bits count references, not hits */
-		l2_hit_bits = (0x52ULL << 16) | (0xff << 8) | 0x24;
-		l2_miss_bits = (0x52ULL << 16) | (0x3f << 8) | 0x24;
 		/* On BDW the l3_hit_bits count references, not hits */
-		l3_hit_bits = (0x52ULL << 16) | (0x4f << 8) | 0x2e;
-		l3_miss_bits = (0x52ULL << 16) | (0x41 << 8) | 0x2e;
+		perf_hit_attr.config = X86_CONFIG(.event = 0x2e,
+						  .umask = 0x4f);
+		perf_miss_attr.config = X86_CONFIG(.event = 0x2e,
+						   .umask = 0x41);
 		break;
 	default:
 		goto out;
 	}
 
+	l3_miss_event = perf_event_create_kernel_counter(&perf_miss_attr,
+							 plr->cpu,
+							 NULL, NULL,
+							 NULL);
+	if (IS_ERR(l3_miss_event))
+		goto out;
+
+	l3_hit_event = perf_event_create_kernel_counter(&perf_hit_attr,
+							plr->cpu,
+							NULL, NULL,
+							NULL);
+	if (IS_ERR(l3_hit_event))
+		goto out_l3_miss;
+
 	local_irq_disable();
 	/*
+	 * Check any possible error state of events used by performing
+	 * one local read.
+	 */
+	if (perf_event_read_local(l3_miss_event, &tmp, NULL, NULL)) {
+		local_irq_enable();
+		goto out_l3_hit;
+	}
+	if (perf_event_read_local(l3_hit_event, &tmp, NULL, NULL)) {
+		local_irq_enable();
+		goto out_l3_hit;
+	}
+
+	/*
+	 * Disable hardware prefetchers.
+	 *
 	 * Call wrmsr direcly to avoid the local register variables from
 	 * being overwritten due to reordering of their assignment with
 	 * the wrmsr calls.
 	 */
 	__wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);
-	/* Disable events and reset counters */
-	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0, 0x0);
-	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 1, 0x0);
-	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_PERFCTR0, 0x0);
-	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_PERFCTR0 + 1, 0x0);
-	if (l3_hit_bits > 0) {
-		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 2, 0x0);
-		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 3, 0x0);
-		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_PERFCTR0 + 2, 0x0);
-		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_PERFCTR0 + 3, 0x0);
-	}
-	/* Set and enable the L2 counters */
+
+	/* Initialize rest of local variables */
+	/*
+	 * Performance event has been validated right before this with
+	 * interrupts disabled - it is thus safe to read the counter index.
+	 */
+	l3_hit_pmcnum = x86_perf_rdpmc_ctr_get(l3_hit_event);
+	l3_miss_pmcnum = x86_perf_rdpmc_ctr_get(l3_miss_event);
+	line_size = plr->line_size;
 	mem_r = plr->kmem;
 	size = plr->size;
-	line_size = plr->line_size;
-	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0, l2_hit_bits);
-	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 1, l2_miss_bits);
-	if (l3_hit_bits > 0) {
-		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 2,
-				      l3_hit_bits);
-		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 3,
-				      l3_miss_bits);
-	}
+
+	/*
+	 * Read counter variables twice - first to load the instructions
+	 * used in L1 cache, second to capture accurate value that does not
+	 * include cache misses incurred because of instruction loads.
+	 */
+	rdpmcl(l3_hit_pmcnum, l3_hits_before);
+	rdpmcl(l3_miss_pmcnum, l3_miss_before);
+	/*
+	 * From SDM: Performing back-to-back fast reads are not guaranteed
+	 * to be monotonic. To guarantee monotonicity on back-toback reads,
+	 * a serializing instruction must be placed between the two
+	 * RDPMC instructions
+	 */
+	rmb();
+	rdpmcl(l3_hit_pmcnum, l3_hits_before);
+	rdpmcl(l3_miss_pmcnum, l3_miss_before);
+	/*
+	 * rdpmc is not a serializing instruction. Add barrier to prevent
+	 * instructions that follow to begin executing before reading the
+	 * counter value.
+	 */
+	rmb();
 	for (i = 0; i < size; i += line_size) {
+		/*
+		 * Add a barrier to prevent speculative execution of this
+		 * loop reading beyond the end of the buffer.
+		 */
+		rmb();
 		asm volatile("mov (%0,%1,1), %%eax\n\t"
 			     :
 			     : "r" (mem_r), "r" (i)
 			     : "%eax", "memory");
 	}
+	rdpmcl(l3_hit_pmcnum, l3_hits_after);
+	rdpmcl(l3_miss_pmcnum, l3_miss_after);
 	/*
-	 * Call wrmsr directly (no tracing) to not influence
-	 * the cache access counters as they are disabled.
+	 * rdpmc is not a serializing instruction. Add barrier to ensure
+	 * events measured have completed and prevent instructions that
+	 * follow to begin executing before reading the counter value.
 	 */
-	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0,
-			      l2_hit_bits & ~(0x40ULL << 16));
-	pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 1,
-			      l2_miss_bits & ~(0x40ULL << 16));
-	if (l3_hit_bits > 0) {
-		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 2,
-				      l3_hit_bits & ~(0x40ULL << 16));
-		pseudo_wrmsrl_notrace(MSR_ARCH_PERFMON_EVENTSEL0 + 3,
-				      l3_miss_bits & ~(0x40ULL << 16));
-	}
-	l2_hits = native_read_pmc(0);
-	l2_miss = native_read_pmc(1);
-	if (l3_hit_bits > 0) {
-		l3_hits = native_read_pmc(2);
-		l3_miss = native_read_pmc(3);
-	}
+	rmb();
+	/* Re-enable hardware prefetchers */
 	wrmsr(MSR_MISC_FEATURE_CONTROL, 0x0, 0x0);
 	local_irq_enable();
-	/*
-	 * On BDW we count references and misses, need to adjust. Sometimes
-	 * the "hits" counter is a bit more than the references, for
-	 * example, x references but x + 1 hits. To not report invalid
-	 * hit values in this case we treat that as misses eaqual to
-	 * references.
-	 */
-	if (boot_cpu_data.x86_model == INTEL_FAM6_BROADWELL_X)
-		l2_hits -= (l2_miss > l2_hits ? l2_hits : l2_miss);
-	trace_pseudo_lock_l2(l2_hits, l2_miss);
-	if (l3_hit_bits > 0) {
-		if (boot_cpu_data.x86_model == INTEL_FAM6_BROADWELL_X)
-			l3_hits -= (l3_miss > l3_hits ? l3_hits : l3_miss);
-		trace_pseudo_lock_l3(l3_hits, l3_miss);
+	l3_miss_after -= l3_miss_before;
+	if (boot_cpu_data.x86_model == INTEL_FAM6_BROADWELL_X) {
+		/*
+		 * On BDW references and misses are counted, need to adjust.
+		 * Sometimes the "hits" counter is a bit more than the
+		 * references, for example, x references but x + 1 hits.
+		 * To not report invalid hit values in this case we treat
+		 * that as misses equal to references.
+		 */
+		/* First compute the number of cache references measured */
+		l3_hits_after -= l3_hits_before;
+		/* Next convert references to cache hits */
+		l3_hits_after -= l3_miss_after > l3_hits_after ?
+					l3_hits_after : l3_miss_after;
+	} else {
+		l3_hits_after -= l3_hits_before;
 	}
+	trace_pseudo_lock_l3(l3_hits_after, l3_miss_after);
 
+out_l3_hit:
+	perf_event_release_kernel(l3_hit_event);
+out_l3_miss:
+	perf_event_release_kernel(l3_miss_event);
 out:
 	plr->thread_done = 1;
 	wake_up_interruptible(&plr->lock_thread_wq);
@@ -1136,13 +1275,20 @@ static int pseudo_lock_measure_cycles(struct rdtgroup *rdtgrp, int sel)
 		goto out;
 	}
 
+	plr->cpu = cpu;
+
 	if (sel == 1)
 		thread = kthread_create_on_node(measure_cycles_lat_fn, plr,
 						cpu_to_node(cpu),
 						"pseudo_lock_measure/%u",
 						cpu);
 	else if (sel == 2)
-		thread = kthread_create_on_node(measure_cycles_perf_fn, plr,
+		thread = kthread_create_on_node(measure_l2_residency_fn, plr,
+						cpu_to_node(cpu),
+						"pseudo_lock_measure/%u",
+						cpu);
+	else if (sel == 3)
+		thread = kthread_create_on_node(measure_l3_residency_fn, plr,
 						cpu_to_node(cpu),
 						"pseudo_lock_measure/%u",
 						cpu);
-- 
2.17.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH V2 6/6] x86/intel_rdt: Re-enable pseudo-lock measurements
  2018-08-16 20:16 [PATCH V2 0/6] perf/core and x86/intel_rdt: Fix lack of coordination with perf Reinette Chatre
                   ` (4 preceding siblings ...)
  2018-08-16 20:16 ` [PATCH V2 5/6] x86/intel_rdt: Use perf infrastructure for measurements Reinette Chatre
@ 2018-08-16 20:16 ` Reinette Chatre
  2018-09-04 16:50 ` [PATCH V2 0/6] perf/core and x86/intel_rdt: Fix lack of coordination with perf Reinette Chatre
  6 siblings, 0 replies; 18+ messages in thread
From: Reinette Chatre @ 2018-08-16 20:16 UTC (permalink / raw)
  To: tglx, fenghua.yu, tony.luck, peterz, mingo, acme, vikas.shivappa
  Cc: gavin.hindman, jithu.joseph, dave.hansen, hpa, x86, linux-kernel,
	Reinette Chatre

Measurements of pseudo-locked regions were disabled because of their
incorrect usage of the performance monitoring hardware.

Cache pseudo-locking measurements are now done correctly with the
in-kernel perf API and its use can be re-enabled at this time.

The adjustment to the in-kernel perf API also separated the L2 and L3
measurements that can be triggered separately from user space.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
index aa27e8081bae..ab371eb819fd 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
@@ -1332,7 +1332,7 @@ static ssize_t pseudo_lock_measure_trigger(struct file *file,
 	buf[buf_size] = '\0';
 	ret = kstrtoint(buf, 10, &sel);
 	if (ret == 0) {
-		if (sel != 1)
+		if (sel != 1 && sel != 2 && sel != 3)
 			return -EINVAL;
 		ret = debugfs_file_get(file->f_path.dentry);
 		if (ret)
-- 
2.17.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH V2 0/6] perf/core and x86/intel_rdt: Fix lack of coordination with perf
  2018-08-16 20:16 [PATCH V2 0/6] perf/core and x86/intel_rdt: Fix lack of coordination with perf Reinette Chatre
                   ` (5 preceding siblings ...)
  2018-08-16 20:16 ` [PATCH V2 6/6] x86/intel_rdt: Re-enable pseudo-lock measurements Reinette Chatre
@ 2018-09-04 16:50 ` Reinette Chatre
  6 siblings, 0 replies; 18+ messages in thread
From: Reinette Chatre @ 2018-09-04 16:50 UTC (permalink / raw)
  To: tglx, fenghua.yu, tony.luck, peterz, mingo, acme, vikas.shivappa
  Cc: gavin.hindman, jithu.joseph, dave.hansen, hpa, x86, linux-kernel

Dear Maintainers,

This series is still needed to complete the initial cache pseudo-locking
implementation and still applies cleanly to both x86/cache of tip.git as
well as v4.19-rc2.

At this time users are unable to measure the success of their cache
pseudo-locked regions and either need to manually apply this series or
revert the commit that disables the current measurement mechanisms
"4a7a54a55e7237386cacc73f173e74329773ac93 x86/intel_rdt: Disable PMU
access".

Is there perhaps an opportunity to include this series into v4.19?
Patches 2-6 from this series fixes code that does not exist in earlier
kernels. Patch 1 was suggested by Peter and is a fix to perf.

Your consideration would be greatly appreciated.

Thank you

Reinette

On 8/16/2018 1:16 PM, Reinette Chatre wrote:
> Dear Maintainers,
> 
> This is the second attempt at fixing the lack of coordination between the
> pseudo-locking measurement code and perf. Thank you very much for your
> feedback on the first version. The entire solution, including the cover
> letter, has been reworked based on your feedback, while submitted as a V2,
> none of the patches from V1 remained.
> 
> Changes since V1:
> - Use in-kernel interface to perf.
> - Do not write directly to PMU registers.
> - Do not introduce another PMU owner. perf maintains role as performing
>   resource arbitration for PMU.
> - User space is able to use perf and resctrl at the same time.
> - event_base_rdpmc is accessed and used only within an interrupts
>   disabled section.
> - Internals of events are never accessed directly, inline function used.
> - Due to "pinned" usage the scheduling of event may have failed.  Error
>   state is checked in recommended way and have a credible error
>   handling.
> - use X86_CONFIG
> 
> This code is based on the x86/cache branch of tip.git
> 
> The success of Cache Pseudo-Locking, as measured by how many cache lines
> from a physical memory region has been locked to cache, can be measured
> via the use of hardware performance events. Specifically, the number of
> cache hits and misses reading a memory region after it has been
> pseudo-locked to cache. This measurement is triggered via the resctrl
> debugfs interface.
> 
> The current solution accesses performance counters and their configuration
> registers directly without coordination with other performance event users
> (perf).
> Two of the issues that exist with the current solution:
> - By writing to the performance monitoring registers directly a new owner
>   for these resources is introduced. The perf infrastructure already exist
>   to perform resource arbitration and the in-kernel infrastructure should
>   be used to do so.
> - The current lack of coordination with perf will have consequences any time
>   two users, for example perf and cache pseudo-locking, attempt to do any
>   kind of measurement at the same time.
> 
> In this series the measurement of Cache Pseudo-Lock regions is moved to use
> the in-kernel interface to perf. During the rework of the measurement
> function the L2 and L3 cache measurements are separated to avoid the
> additional code needed to decide on which measurement causing unrelated
> cache hits and misses.
> 
> Your feedback on this work will be greatly appreciated.
> 
> Reinette
> 
> Reinette Chatre (6):
>   perf/core: Add sanity check to deal with pinned event failure
>   x86/intel_rdt: Remove local register variables
>   x86/intel_rdt: Create required perf event attributes
>   x86/intel_rdt: Add helper to obtain performance counter index
>   x86/intel_rdt: Use perf infrastructure for measurements
>   x86/intel_rdt: Re-enable pseudo-lock measurements
> 
>  Documentation/x86/intel_rdt_ui.txt          |  22 +-
>  arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 419 ++++++++++++++------
>  kernel/events/core.c                        |   6 +
>  3 files changed, 310 insertions(+), 137 deletions(-)
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V2 5/6] x86/intel_rdt: Use perf infrastructure for measurements
  2018-08-16 20:16 ` [PATCH V2 5/6] x86/intel_rdt: Use perf infrastructure for measurements Reinette Chatre
@ 2018-09-06 14:15   ` Peter Zijlstra
  2018-09-06 19:21     ` Reinette Chatre
  2018-09-06 14:38   ` Peter Zijlstra
  1 sibling, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2018-09-06 14:15 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: tglx, fenghua.yu, tony.luck, mingo, acme, vikas.shivappa,
	gavin.hindman, jithu.joseph, dave.hansen, hpa, x86, linux-kernel

On Thu, Aug 16, 2018 at 01:16:08PM -0700, Reinette Chatre wrote:
> +	l2_miss_event = perf_event_create_kernel_counter(&perf_miss_attr,
> +							 plr->cpu,
> +							 NULL, NULL, NULL);
> +	if (IS_ERR(l2_miss_event))
> +		goto out;
> +
> +	l2_hit_event = perf_event_create_kernel_counter(&perf_hit_attr,
> +							plr->cpu,
> +							NULL, NULL, NULL);
> +	if (IS_ERR(l2_hit_event))
> +		goto out_l2_miss;
> +
> +	local_irq_disable();
> +	/*
> +	 * Check any possible error state of events used by performing
> +	 * one local read.
> +	 */
> +	if (perf_event_read_local(l2_miss_event, &tmp, NULL, NULL)) {
> +		local_irq_enable();
> +		goto out_l2_hit;
> +	}
> +	if (perf_event_read_local(l2_hit_event, &tmp, NULL, NULL)) {
> +		local_irq_enable();
> +		goto out_l2_hit;
> +	}
> +
> +	/*
> +	 * Disable hardware prefetchers.
>  	 *
> +	 * Call wrmsr direcly to avoid the local register variables from
> +	 * being overwritten due to reordering of their assignment with
> +	 * the wrmsr calls.
> +	 */
> +	__wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);
> +
> +	/* Initialize rest of local variables */
> +	/*
> +	 * Performance event has been validated right before this with
> +	 * interrupts disabled - it is thus safe to read the counter index.
> +	 */
> +	l2_miss_pmcnum = x86_perf_rdpmc_ctr_get(l2_miss_event);
> +	l2_hit_pmcnum = x86_perf_rdpmc_ctr_get(l2_hit_event);
> +	line_size = plr->line_size;
> +	mem_r = plr->kmem;
> +	size = plr->size;
> +
> +	/*
> +	 * Read counter variables twice - first to load the instructions
> +	 * used in L1 cache, second to capture accurate value that does not
> +	 * include cache misses incurred because of instruction loads.
> +	 */
> +	rdpmcl(l2_hit_pmcnum, l2_hits_before);
> +	rdpmcl(l2_miss_pmcnum, l2_miss_before);
> +	/*
> +	 * From SDM: Performing back-to-back fast reads are not guaranteed
> +	 * to be monotonic. To guarantee monotonicity on back-toback reads,
> +	 * a serializing instruction must be placed between the two
> +	 * RDPMC instructions
> +	 */
> +	rmb();
> +	rdpmcl(l2_hit_pmcnum, l2_hits_before);
> +	rdpmcl(l2_miss_pmcnum, l2_miss_before);
> +	/*
> +	 * rdpmc is not a serializing instruction. Add barrier to prevent
> +	 * instructions that follow to begin executing before reading the
> +	 * counter value.
> +	 */
> +	rmb();
> +	for (i = 0; i < size; i += line_size) {
> +		/*
> +		 * Add a barrier to prevent speculative execution of this
> +		 * loop reading beyond the end of the buffer.
> +		 */
> +		rmb();
> +		asm volatile("mov (%0,%1,1), %%eax\n\t"
> +			     :
> +			     : "r" (mem_r), "r" (i)
> +			     : "%eax", "memory");
> +	}
> +	rdpmcl(l2_hit_pmcnum, l2_hits_after);
> +	rdpmcl(l2_miss_pmcnum, l2_miss_after);
> +	/*
> +	 * rdpmc is not a serializing instruction. Add barrier to ensure
> +	 * events measured have completed and prevent instructions that
> +	 * follow to begin executing before reading the counter value.
> +	 */
> +	rmb();
> +	/* Re-enable hardware prefetchers */
> +	wrmsr(MSR_MISC_FEATURE_CONTROL, 0x0, 0x0);
> +	local_irq_enable();
> +	trace_pseudo_lock_l2(l2_hits_after - l2_hits_before,
> +			     l2_miss_after - l2_miss_before);
> +
> +out_l2_hit:
> +	perf_event_release_kernel(l2_hit_event);
> +out_l2_miss:
> +	perf_event_release_kernel(l2_miss_event);
> +out:
> +	plr->thread_done = 1;
> +	wake_up_interruptible(&plr->lock_thread_wq);
> +	return 0;
> +}
> +

The above, looks a _LOT_ like the below. And while C does suck a little,
I'm sure there's something we can do about this.

> +	l3_miss_event = perf_event_create_kernel_counter(&perf_miss_attr,
> +							 plr->cpu,
> +							 NULL, NULL,
> +							 NULL);
> +	if (IS_ERR(l3_miss_event))
> +		goto out;
> +
> +	l3_hit_event = perf_event_create_kernel_counter(&perf_hit_attr,
> +							plr->cpu,
> +							NULL, NULL,
> +							NULL);
> +	if (IS_ERR(l3_hit_event))
> +		goto out_l3_miss;
> +
>  	local_irq_disable();
>  	/*
> +	 * Check any possible error state of events used by performing
> +	 * one local read.
> +	 */
> +	if (perf_event_read_local(l3_miss_event, &tmp, NULL, NULL)) {
> +		local_irq_enable();
> +		goto out_l3_hit;
> +	}
> +	if (perf_event_read_local(l3_hit_event, &tmp, NULL, NULL)) {
> +		local_irq_enable();
> +		goto out_l3_hit;
> +	}
> +
> +	/*
> +	 * Disable hardware prefetchers.
> +	 *
>  	 * Call wrmsr direcly to avoid the local register variables from
>  	 * being overwritten due to reordering of their assignment with
>  	 * the wrmsr calls.
>  	 */
>  	__wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);
> +
> +	/* Initialize rest of local variables */
> +	/*
> +	 * Performance event has been validated right before this with
> +	 * interrupts disabled - it is thus safe to read the counter index.
> +	 */
> +	l3_hit_pmcnum = x86_perf_rdpmc_ctr_get(l3_hit_event);
> +	l3_miss_pmcnum = x86_perf_rdpmc_ctr_get(l3_miss_event);
> +	line_size = plr->line_size;
>  	mem_r = plr->kmem;
>  	size = plr->size;
> +
> +	/*
> +	 * Read counter variables twice - first to load the instructions
> +	 * used in L1 cache, second to capture accurate value that does not
> +	 * include cache misses incurred because of instruction loads.
> +	 */
> +	rdpmcl(l3_hit_pmcnum, l3_hits_before);
> +	rdpmcl(l3_miss_pmcnum, l3_miss_before);
> +	/*
> +	 * From SDM: Performing back-to-back fast reads are not guaranteed
> +	 * to be monotonic. To guarantee monotonicity on back-toback reads,
> +	 * a serializing instruction must be placed between the two
> +	 * RDPMC instructions
> +	 */
> +	rmb();
> +	rdpmcl(l3_hit_pmcnum, l3_hits_before);
> +	rdpmcl(l3_miss_pmcnum, l3_miss_before);
> +	/*
> +	 * rdpmc is not a serializing instruction. Add barrier to prevent
> +	 * instructions that follow to begin executing before reading the
> +	 * counter value.
> +	 */
> +	rmb();
>  	for (i = 0; i < size; i += line_size) {
> +		/*
> +		 * Add a barrier to prevent speculative execution of this
> +		 * loop reading beyond the end of the buffer.
> +		 */
> +		rmb();
>  		asm volatile("mov (%0,%1,1), %%eax\n\t"
>  			     :
>  			     : "r" (mem_r), "r" (i)
>  			     : "%eax", "memory");
>  	}
> +	rdpmcl(l3_hit_pmcnum, l3_hits_after);
> +	rdpmcl(l3_miss_pmcnum, l3_miss_after);
>  	/*
> +	 * rdpmc is not a serializing instruction. Add barrier to ensure
> +	 * events measured have completed and prevent instructions that
> +	 * follow to begin executing before reading the counter value.
>  	 */
> +	rmb();
> +	/* Re-enable hardware prefetchers */
>  	wrmsr(MSR_MISC_FEATURE_CONTROL, 0x0, 0x0);
>  	local_irq_enable();
> +	l3_miss_after -= l3_miss_before;
> +	if (boot_cpu_data.x86_model == INTEL_FAM6_BROADWELL_X) {
> +		/*
> +		 * On BDW references and misses are counted, need to adjust.
> +		 * Sometimes the "hits" counter is a bit more than the
> +		 * references, for example, x references but x + 1 hits.
> +		 * To not report invalid hit values in this case we treat
> +		 * that as misses equal to references.
> +		 */
> +		/* First compute the number of cache references measured */
> +		l3_hits_after -= l3_hits_before;
> +		/* Next convert references to cache hits */
> +		l3_hits_after -= l3_miss_after > l3_hits_after ?
> +					l3_hits_after : l3_miss_after;
> +	} else {
> +		l3_hits_after -= l3_hits_before;
>  	}
> +	trace_pseudo_lock_l3(l3_hits_after, l3_miss_after);
>  
> +out_l3_hit:
> +	perf_event_release_kernel(l3_hit_event);
> +out_l3_miss:
> +	perf_event_release_kernel(l3_miss_event);

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V2 5/6] x86/intel_rdt: Use perf infrastructure for measurements
  2018-08-16 20:16 ` [PATCH V2 5/6] x86/intel_rdt: Use perf infrastructure for measurements Reinette Chatre
  2018-09-06 14:15   ` Peter Zijlstra
@ 2018-09-06 14:38   ` Peter Zijlstra
  1 sibling, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2018-09-06 14:38 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: tglx, fenghua.yu, tony.luck, mingo, acme, vikas.shivappa,
	gavin.hindman, jithu.joseph, dave.hansen, hpa, x86, linux-kernel

On Thu, Aug 16, 2018 at 01:16:08PM -0700, Reinette Chatre wrote:
> +	l2_miss_event = perf_event_create_kernel_counter(&perf_miss_attr,
> +							 plr->cpu,
> +							 NULL, NULL, NULL);
> +	if (IS_ERR(l2_miss_event))
> +		goto out;
> +
> +	l2_hit_event = perf_event_create_kernel_counter(&perf_hit_attr,
> +							plr->cpu,
> +							NULL, NULL, NULL);
> +	if (IS_ERR(l2_hit_event))
> +		goto out_l2_miss;
> +
> +	local_irq_disable();
> +	/*
> +	 * Check any possible error state of events used by performing
> +	 * one local read.
> +	 */
> +	if (perf_event_read_local(l2_miss_event, &tmp, NULL, NULL)) {
> +		local_irq_enable();
> +		goto out_l2_hit;
> +	}
> +	if (perf_event_read_local(l2_hit_event, &tmp, NULL, NULL)) {
> +		local_irq_enable();
> +		goto out_l2_hit;
> +	}
> +
> +	/*
> +	 * Disable hardware prefetchers.
>  	 *
> +	 * Call wrmsr direcly to avoid the local register variables from
> +	 * being overwritten due to reordering of their assignment with
> +	 * the wrmsr calls.
> +	 */
> +	__wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);

So what about virt?

> +
> +	/* Initialize rest of local variables */
> +	/*
> +	 * Performance event has been validated right before this with
> +	 * interrupts disabled - it is thus safe to read the counter index.
> +	 */
> +	l2_miss_pmcnum = x86_perf_rdpmc_ctr_get(l2_miss_event);
> +	l2_hit_pmcnum = x86_perf_rdpmc_ctr_get(l2_hit_event);
> +	line_size = plr->line_size;
> +	mem_r = plr->kmem;
> +	size = plr->size;

You probably want READ_ONCE() on that, the volatile cast in there
disallows the compiler from re-loading the values later.

> +
> +	/*
> +	 * Read counter variables twice - first to load the instructions
> +	 * used in L1 cache, second to capture accurate value that does not
> +	 * include cache misses incurred because of instruction loads.
> +	 */
> +	rdpmcl(l2_hit_pmcnum, l2_hits_before);

And this again does do virt.

> +	rdpmcl(l2_miss_pmcnum, l2_miss_before);
> +	/*
> +	 * From SDM: Performing back-to-back fast reads are not guaranteed
> +	 * to be monotonic. To guarantee monotonicity on back-toback reads,
> +	 * a serializing instruction must be placed between the two
> +	 * RDPMC instructions
> +	 */
> +	rmb();

You're copying the horrid horrid (did I say truly horrid?) use of
'serializing' from the SDM. Please don't do that.

LFENCE is not a serializing instruction. But given the (new) definition
LFENCE does ensure all prior instructions are retired before it
proceeds.

> +	rdpmcl(l2_hit_pmcnum, l2_hits_before);
> +	rdpmcl(l2_miss_pmcnum, l2_miss_before);
> +	/*
> +	 * rdpmc is not a serializing instruction. Add barrier to prevent
> +	 * instructions that follow to begin executing before reading the
> +	 * counter value.
> +	 */
> +	rmb();
> +	for (i = 0; i < size; i += line_size) {
> +		/*
> +		 * Add a barrier to prevent speculative execution of this
> +		 * loop reading beyond the end of the buffer.
> +		 */
> +		rmb();
> +		asm volatile("mov (%0,%1,1), %%eax\n\t"
> +			     :
> +			     : "r" (mem_r), "r" (i)
> +			     : "%eax", "memory");

Why does that need to be asm?

> +	}

I think you want another LFENCE here, to ensure the RDPMCs don't overlap
with the last LOAD in the loop above.

> +	rdpmcl(l2_hit_pmcnum, l2_hits_after);
> +	rdpmcl(l2_miss_pmcnum, l2_miss_after);
> +	/*
> +	 * rdpmc is not a serializing instruction. Add barrier to ensure
> +	 * events measured have completed and prevent instructions that
> +	 * follow to begin executing before reading the counter value.
> +	 */
> +	rmb();
> +	/* Re-enable hardware prefetchers */
> +	wrmsr(MSR_MISC_FEATURE_CONTROL, 0x0, 0x0);


So what I do in userspace is:

	mmap_read_pinned(ctx); /* prime */

	for (many-times) {

		cnt = mmap_read_pinned(evt);
		barrier();
		cnt = mmap_read_pinned(evt) - cnt;
		update_stats(&empty, cnt);

		cnt = mmap_read_pinned(evt);
		barrier();
		/* the thing */
		barrier();
		cnt = mmap_read_pinned(evt) - cnt;
		update_stats(&stat, cnt);

	}

	sub_stats(&stat, &empty);

Maybe I should've used asm("lfence" ::: "memory") instead of barrier(),
but the results were good enough.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V2 4/6] x86/intel_rdt: Add helper to obtain performance counter index
  2018-08-16 20:16 ` [PATCH V2 4/6] x86/intel_rdt: Add helper to obtain performance counter index Reinette Chatre
@ 2018-09-06 14:47   ` Peter Zijlstra
  2018-09-06 23:26     ` Reinette Chatre
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2018-09-06 14:47 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: tglx, fenghua.yu, tony.luck, mingo, acme, vikas.shivappa,
	gavin.hindman, jithu.joseph, dave.hansen, hpa, x86, linux-kernel

On Thu, Aug 16, 2018 at 01:16:07PM -0700, Reinette Chatre wrote:

> +static inline int x86_perf_rdpmc_ctr_get(struct perf_event *event)
> +{
> +	lockdep_assert_irqs_disabled();
> +
> +	return IS_ERR_OR_NULL(event) ? -1 : event->hw.event_base_rdpmc;
> +}

That should be in arch/x86/include/asm/perf_event.h if anywhere. Also,
call the thing x86_perf_rdpmc_index(), that's consistent with the other
naming.

I don't think there's any point in testing for !event, this is an
interface that mandates you know wth you're doing anyway.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V2 5/6] x86/intel_rdt: Use perf infrastructure for measurements
  2018-09-06 14:15   ` Peter Zijlstra
@ 2018-09-06 19:21     ` Reinette Chatre
  2018-09-06 19:44       ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Reinette Chatre @ 2018-09-06 19:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, fenghua.yu, tony.luck, mingo, acme, vikas.shivappa,
	gavin.hindman, jithu.joseph, dave.hansen, hpa, x86, linux-kernel

Hi Peter,

On 9/6/2018 7:15 AM, Peter Zijlstra wrote:
> On Thu, Aug 16, 2018 at 01:16:08PM -0700, Reinette Chatre wrote:
>> +	l2_miss_event = perf_event_create_kernel_counter(&perf_miss_attr,
>> +							 plr->cpu,
>> +							 NULL, NULL, NULL);
>> +	if (IS_ERR(l2_miss_event))
>> +		goto out;
>> +
>> +	l2_hit_event = perf_event_create_kernel_counter(&perf_hit_attr,
>> +							plr->cpu,
>> +							NULL, NULL, NULL);
>> +	if (IS_ERR(l2_hit_event))
>> +		goto out_l2_miss;
>> +
>> +	local_irq_disable();
>> +	/*
>> +	 * Check any possible error state of events used by performing
>> +	 * one local read.
>> +	 */
>> +	if (perf_event_read_local(l2_miss_event, &tmp, NULL, NULL)) {
>> +		local_irq_enable();
>> +		goto out_l2_hit;
>> +	}
>> +	if (perf_event_read_local(l2_hit_event, &tmp, NULL, NULL)) {
>> +		local_irq_enable();
>> +		goto out_l2_hit;
>> +	}
>> +
>> +	/*
>> +	 * Disable hardware prefetchers.
>>  	 *
>> +	 * Call wrmsr direcly to avoid the local register variables from
>> +	 * being overwritten due to reordering of their assignment with
>> +	 * the wrmsr calls.
>> +	 */
>> +	__wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);
>> +
>> +	/* Initialize rest of local variables */
>> +	/*
>> +	 * Performance event has been validated right before this with
>> +	 * interrupts disabled - it is thus safe to read the counter index.
>> +	 */
>> +	l2_miss_pmcnum = x86_perf_rdpmc_ctr_get(l2_miss_event);
>> +	l2_hit_pmcnum = x86_perf_rdpmc_ctr_get(l2_hit_event);
>> +	line_size = plr->line_size;
>> +	mem_r = plr->kmem;
>> +	size = plr->size;
>> +
>> +	/*
>> +	 * Read counter variables twice - first to load the instructions
>> +	 * used in L1 cache, second to capture accurate value that does not
>> +	 * include cache misses incurred because of instruction loads.
>> +	 */
>> +	rdpmcl(l2_hit_pmcnum, l2_hits_before);
>> +	rdpmcl(l2_miss_pmcnum, l2_miss_before);
>> +	/*
>> +	 * From SDM: Performing back-to-back fast reads are not guaranteed
>> +	 * to be monotonic. To guarantee monotonicity on back-toback reads,
>> +	 * a serializing instruction must be placed between the two
>> +	 * RDPMC instructions
>> +	 */
>> +	rmb();
>> +	rdpmcl(l2_hit_pmcnum, l2_hits_before);
>> +	rdpmcl(l2_miss_pmcnum, l2_miss_before);
>> +	/*
>> +	 * rdpmc is not a serializing instruction. Add barrier to prevent
>> +	 * instructions that follow to begin executing before reading the
>> +	 * counter value.
>> +	 */
>> +	rmb();
>> +	for (i = 0; i < size; i += line_size) {
>> +		/*
>> +		 * Add a barrier to prevent speculative execution of this
>> +		 * loop reading beyond the end of the buffer.
>> +		 */
>> +		rmb();
>> +		asm volatile("mov (%0,%1,1), %%eax\n\t"
>> +			     :
>> +			     : "r" (mem_r), "r" (i)
>> +			     : "%eax", "memory");
>> +	}
>> +	rdpmcl(l2_hit_pmcnum, l2_hits_after);
>> +	rdpmcl(l2_miss_pmcnum, l2_miss_after);
>> +	/*
>> +	 * rdpmc is not a serializing instruction. Add barrier to ensure
>> +	 * events measured have completed and prevent instructions that
>> +	 * follow to begin executing before reading the counter value.
>> +	 */
>> +	rmb();
>> +	/* Re-enable hardware prefetchers */
>> +	wrmsr(MSR_MISC_FEATURE_CONTROL, 0x0, 0x0);
>> +	local_irq_enable();
>> +	trace_pseudo_lock_l2(l2_hits_after - l2_hits_before,
>> +			     l2_miss_after - l2_miss_before);
>> +
>> +out_l2_hit:
>> +	perf_event_release_kernel(l2_hit_event);
>> +out_l2_miss:
>> +	perf_event_release_kernel(l2_miss_event);
>> +out:
>> +	plr->thread_done = 1;
>> +	wake_up_interruptible(&plr->lock_thread_wq);
>> +	return 0;
>> +}
>> +
> 
> The above, looks a _LOT_ like the below. And while C does suck a little,
> I'm sure there's something we can do about this.

You are correct, the L2 and L3 cache measurements are very similar.
Indeed, the current implementation does have them together in one
function but I was not able to obtain the same accuracy in measurements
as presented in my previous emails that only did L2 measurements. When
combining the L2 and L3 measurements in one function it is required to
do something like:

          if (need_l2) {
                  rdpmcl(l2_hit_pmcnum, l2_hits_before);
                  rdpmcl(l2_miss_pmcnum, l2_miss_before);
          }
          if (need_l3) {
                  rdpmcl(l3_hit_pmcnum, l3_hits_before);
                  rdpmcl(l3_miss_pmcnum, l3_miss_before);
          }
          rmb();
          if (need_l2) {
                  rdpmcl(l2_hit_pmcnum, l2_hits_before);
                  rdpmcl(l2_miss_pmcnum, l2_miss_before);
          }
          if (need _l3) {
                  rdpmcl(l3_hit_pmcnum, l3_hits_before);
                  rdpmcl(l3_miss_pmcnum, l3_miss_before);
          }
          rmb();
          /* Test */
          if (need_l2) {
                  rdpmcl(l2_hit_pmcnum, l2_hits_after);
                  rdpmcl(l2_miss_pmcnum, l2_miss_after);
          }
          if (need_l3) {
                  rdpmcl(l3_hit_pmcnum, l3_hits_after);
                  rdpmcl(l3_miss_pmcnum, l3_miss_after);
          }

I have found that the additional branches required to support L2 and L3
measurements in one function introduces more cache misses in the
measurements than if I separate the measurements into two functions.

If you do have suggestions on how I can improve the implementation while
maintaining (or improving) the accuracy of the measurements I would
greatly appreciate it.

Reinette

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V2 5/6] x86/intel_rdt: Use perf infrastructure for measurements
  2018-09-06 19:21     ` Reinette Chatre
@ 2018-09-06 19:44       ` Peter Zijlstra
  2018-09-06 20:05         ` Reinette Chatre
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2018-09-06 19:44 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: tglx, fenghua.yu, tony.luck, mingo, acme, vikas.shivappa,
	gavin.hindman, jithu.joseph, dave.hansen, hpa, x86, linux-kernel

On Thu, Sep 06, 2018 at 12:21:59PM -0700, Reinette Chatre wrote:
> If you do have suggestions on how I can improve the implementation while
> maintaining (or improving) the accuracy of the measurements I would
> greatly appreciate it.

You can reduce the code duplication by using __always_inline functions
with const function arguments.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V2 5/6] x86/intel_rdt: Use perf infrastructure for measurements
  2018-09-06 19:44       ` Peter Zijlstra
@ 2018-09-06 20:05         ` Reinette Chatre
  2018-09-06 20:29           ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Reinette Chatre @ 2018-09-06 20:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, fenghua.yu, tony.luck, mingo, acme, vikas.shivappa,
	gavin.hindman, jithu.joseph, dave.hansen, hpa, x86, linux-kernel

Hi Peter,

On 9/6/2018 12:44 PM, Peter Zijlstra wrote:
> On Thu, Sep 06, 2018 at 12:21:59PM -0700, Reinette Chatre wrote:
>> If you do have suggestions on how I can improve the implementation while
>> maintaining (or improving) the accuracy of the measurements I would
>> greatly appreciate it.
> 
> You can reduce the code duplication by using __always_inline functions
> with const function arguments.
> 

Could you please elaborate? I am unable to see how that would help in,
for example:

          if (need_l2) {
                   rdpmcl(l2_hit_pmcnum, l2_hits_after);
                   rdpmcl(l2_miss_pmcnum, l2_miss_after);
          }
          if (need_l3) {
                   rdpmcl(l3_hit_pmcnum, l3_hits_after);
                   rdpmcl(l3_miss_pmcnum, l3_miss_after);
          }

Two issues with the above are:
- the measurements captured in l2_hits_after and l2_miss_after would
include the hits/misses associated with the "if (need_l2)" test
- the measurements captured in l3_hits_after and l3_miss_after would
include the hits/misses associated with both the "if (need_l2)" and "if
(need_l3)" tests.

When I separate the above into the two functions it just becomes either:
                   rdpmcl(l2_hit_pmcnum, l2_hits_after);
                   rdpmcl(l2_miss_pmcnum, l2_miss_after);
or:
                   rdpmcl(l3_hit_pmcnum, l3_hits_after);
                   rdpmcl(l3_miss_pmcnum, l3_miss_after);

Reinette

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V2 5/6] x86/intel_rdt: Use perf infrastructure for measurements
  2018-09-06 20:05         ` Reinette Chatre
@ 2018-09-06 20:29           ` Peter Zijlstra
  2018-09-06 20:37             ` Reinette Chatre
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2018-09-06 20:29 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: tglx, fenghua.yu, tony.luck, mingo, acme, vikas.shivappa,
	gavin.hindman, jithu.joseph, dave.hansen, hpa, x86, linux-kernel

On Thu, Sep 06, 2018 at 01:05:05PM -0700, Reinette Chatre wrote:
> When I separate the above into the two functions it just becomes either:
>                    rdpmcl(l2_hit_pmcnum, l2_hits_after);
>                    rdpmcl(l2_miss_pmcnum, l2_miss_after);
> or:
>                    rdpmcl(l3_hit_pmcnum, l3_hits_after);
>                    rdpmcl(l3_miss_pmcnum, l3_miss_after);
> 

Right, which is the exact _same_ code, so you only need a single
function.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V2 5/6] x86/intel_rdt: Use perf infrastructure for measurements
  2018-09-06 20:29           ` Peter Zijlstra
@ 2018-09-06 20:37             ` Reinette Chatre
  2018-09-06 21:38               ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Reinette Chatre @ 2018-09-06 20:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, fenghua.yu, tony.luck, mingo, acme, vikas.shivappa,
	gavin.hindman, jithu.joseph, dave.hansen, hpa, x86, linux-kernel

On 9/6/2018 1:29 PM, Peter Zijlstra wrote:
> On Thu, Sep 06, 2018 at 01:05:05PM -0700, Reinette Chatre wrote:
>> When I separate the above into the two functions it just becomes either:
>>                    rdpmcl(l2_hit_pmcnum, l2_hits_after);
>>                    rdpmcl(l2_miss_pmcnum, l2_miss_after);
>> or:
>>                    rdpmcl(l3_hit_pmcnum, l3_hits_after);
>>                    rdpmcl(l3_miss_pmcnum, l3_miss_after);
>>
> 
> Right, which is the exact _same_ code, so you only need a single
> function.
> 

From my understanding it is not this code specifically that is causing
the cache misses but instead the code and variables used to decide
whether to run them or not. These would still be needed when I extract
the above into inline functions.

Reinette

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V2 5/6] x86/intel_rdt: Use perf infrastructure for measurements
  2018-09-06 20:37             ` Reinette Chatre
@ 2018-09-06 21:38               ` Peter Zijlstra
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2018-09-06 21:38 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: tglx, fenghua.yu, tony.luck, mingo, acme, vikas.shivappa,
	gavin.hindman, jithu.joseph, dave.hansen, hpa, x86, linux-kernel

On Thu, Sep 06, 2018 at 01:37:14PM -0700, Reinette Chatre wrote:
> On 9/6/2018 1:29 PM, Peter Zijlstra wrote:
> > On Thu, Sep 06, 2018 at 01:05:05PM -0700, Reinette Chatre wrote:
> >> When I separate the above into the two functions it just becomes either:
> >>                    rdpmcl(l2_hit_pmcnum, l2_hits_after);
> >>                    rdpmcl(l2_miss_pmcnum, l2_miss_after);
> >> or:
> >>                    rdpmcl(l3_hit_pmcnum, l3_hits_after);
> >>                    rdpmcl(l3_miss_pmcnum, l3_miss_after);
> >>
> > 
> > Right, which is the exact _same_ code, so you only need a single
> > function.
> > 
> 
> From my understanding it is not this code specifically that is causing
> the cache misses but instead the code and variables used to decide
> whether to run them or not. These would still be needed when I extract
> the above into inline functions.

Oh, seriously, use your brain.. This is trivial stuff. Compare the two
functions l2/l3.

They are _identical_ except for some silly bits before/after and
some spurious differences because apparently you cannot copy/paste.
I thought there would be some differences in the loop, but not even
that. They really are identical.

The below should work I think.

---

struct recidency_counts {
	u64 miss_before, hits_before;
	u64 miss_after,  hits_after;
};

static int measure_residency_fn(struct perf_event_attr *miss_attr,
				struct perf_event_attr *hit_attr,
				void *plr, struct recidency_counts *counts)
 {
+	u64 hits_before, hits_after, miss_before, miss_after;
+	struct perf_event *miss_event, *hit_event;
+	int hit_pmcnum, miss_pmcnum;
 	unsigned int line_size;
 	unsigned int size;
 	unsigned long i;
 	void *mem_r;
+	u64 tmp;
 
+	miss_event = perf_event_create_kernel_counter(miss_attr,
+							 plr->cpu,
+							 NULL, NULL, NULL);
+	if (IS_ERR(miss_event))
+		goto out;
+
+	hit_event = perf_event_create_kernel_counter(hit_attr,
+							plr->cpu,
+							NULL, NULL, NULL);
+	if (IS_ERR(hit_event))
+		goto out_miss;
+
+	local_irq_disable();
+	/*
+	 * Check any possible error state of events used by performing
+	 * one local read.
+	 */
+	if (perf_event_read_local(miss_event, &tmp, NULL, NULL)) {
+		local_irq_enable();
+		goto out_hit;
+	}
+	if (perf_event_read_local(hit_event, &tmp, NULL, NULL)) {
+		local_irq_enable();
+		goto out_hit;
+	}
+
+	/*
+	 * Disable hardware prefetchers.
 	 *
+	 * Call wrmsr direcly to avoid the local register variables from
+	 * being overwritten due to reordering of their assignment with
+	 * the wrmsr calls.
+	 */
+	__wrmsr(MSR_MISC_FEATURE_CONTROL, prefetch_disable_bits, 0x0);
+
+	/* Initialize rest of local variables */
+	/*
+	 * Performance event has been validated right before this with
+	 * interrupts disabled - it is thus safe to read the counter index.
+	 */
+	miss_pmcnum = x86_perf_rdpmc_index(miss_event);
+	hit_pmcnum = x86_perf_rdpmc_index(hit_event);
+	line_size = READ_ONCE(plr->line_size);
+	mem_r = READ_ONCE(plr->kmem);
+	size = READ_ONCE(plr->size);
+
+	/*
+	 * Read counter variables twice - first to load the instructions
+	 * used in L1 cache, second to capture accurate value that does not
+	 * include cache misses incurred because of instruction loads.
+	 */
+	rdpmcl(hit_pmcnum, hits_before);
+	rdpmcl(miss_pmcnum, miss_before);
+	/*
+	 */
+	rmb();
+	rdpmcl(hit_pmcnum, hits_before);
+	rdpmcl(miss_pmcnum, miss_before);
+	/*
+	 */
+	rmb();
+	for (i = 0; i < size; i += line_size) {
+		/*
+		 * Add a barrier to prevent speculative execution of this
+		 * loop reading beyond the end of the buffer.
+		 */
+		rmb();
+		asm volatile("mov (%0,%1,1), %%eax\n\t"
+			     :
+			     : "r" (mem_r), "r" (i)
+			     : "%eax", "memory");
+	}
	rmb();
+	rdpmcl(hit_pmcnum, hits_after);
+	rdpmcl(miss_pmcnum, miss_after);
+	rmb();
+	/* Re-enable hardware prefetchers */
+	wrmsr(MSR_MISC_FEATURE_CONTROL, 0x0, 0x0);
+	local_irq_enable();
+out_hit:
+	perf_event_release_kernel(hit_event);
+out_miss:
+	perf_event_release_kernel(miss_event);
+out:
	counts->miss_before = miss_before;
	counts->hits_before = hits_before;
	counts->miss_after  = miss_after;
	counts->hits_after  = hits_after;
+	return 0;
+}

measure_l2_recidency()
{
	struct recidency_counts counts;

+       switch (boot_cpu_data.x86_model) {
+       case INTEL_FAM6_ATOM_GOLDMONT:
+       case INTEL_FAM6_ATOM_GEMINI_LAKE:
+               perf_miss_attr.config = X86_CONFIG(.event = 0xd1,
+                                                  .umask = 0x10);
+               perf_hit_attr.config = X86_CONFIG(.event = 0xd1,
+                                                 .umask = 0x2);
+               break;
+       default:
+               goto out;
+       }

	measure_recidency_fn(&perf_miss_attr, &perf_hit_attr, plr, &counts);

	trace_pseudo_lock_l2(counts->hits_after - counts->hits_before,
			     counts->miss_after - counts->miss_before);
out:
+	plr->thread_done = 1;
+	wake_up_interruptible(&plr->lock_thread_wq);
}

measure_l3_residency()
{
	struct recidency_counts counts;

        switch (boot_cpu_data.x86_model) {
        case INTEL_FAM6_BROADWELL_X:
                /* On BDW the l3_hit_bits count references, not hits */
+               perf_hit_attr.config = X86_CONFIG(.event = 0x2e,
+                                                 .umask = 0x4f);
+               perf_miss_attr.config = X86_CONFIG(.event = 0x2e,
+                                                  .umask = 0x41);
                break;
        default:
                goto out;
        }

	measure_recidency_fn(&perf_miss_attr, &perf_hit_attr, plr, &counts);

+       counts->miss_after -= counts->miss_before;
+       if (boot_cpu_data.x86_model == INTEL_FAM6_BROADWELL_X) {
+               /*
+                * On BDW references and misses are counted, need to adjust.
+                * Sometimes the "hits" counter is a bit more than the
+                * references, for example, x references but x + 1 hits.
+                * To not report invalid hit values in this case we treat
+                * that as misses equal to references.
+                */
+               /* First compute the number of cache references measured */
+               counts->hits_after -= counts->hits_before;
+               /* Next convert references to cache hits */
+               counts->hits_after -= counts->miss_after > counts->hits_after ?
+                                       counts->hits_after : counts->miss_after;
+       } else {
+               counts->hits_after -= counts->hits_before;
        }

+       trace_pseudo_lock_l3(counts->hits_after, counts->miss_after);
}

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V2 4/6] x86/intel_rdt: Add helper to obtain performance counter index
  2018-09-06 14:47   ` Peter Zijlstra
@ 2018-09-06 23:26     ` Reinette Chatre
  0 siblings, 0 replies; 18+ messages in thread
From: Reinette Chatre @ 2018-09-06 23:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, fenghua.yu, tony.luck, mingo, acme, vikas.shivappa,
	gavin.hindman, jithu.joseph, dave.hansen, hpa, x86, linux-kernel

Hi Peter,

On 9/6/2018 7:47 AM, Peter Zijlstra wrote:
> On Thu, Aug 16, 2018 at 01:16:07PM -0700, Reinette Chatre wrote:
> 
>> +static inline int x86_perf_rdpmc_ctr_get(struct perf_event *event)
>> +{
>> +	lockdep_assert_irqs_disabled();
>> +
>> +	return IS_ERR_OR_NULL(event) ? -1 : event->hw.event_base_rdpmc;
>> +}
> 
> That should be in arch/x86/include/asm/perf_event.h if anywhere. Also,
> call the thing x86_perf_rdpmc_index(), that's consistent with the other
> naming.

Moving it to arch/x86/include/asm/perf_event.h is not trivial since this
file is not familiar with struct perf_event.

struct perf_event, struct hw_perf_event and its member event_base_rdpmc
are all defined in include/linux/perf_event.h - could this function
perhaps be moved there? If so, would perf_rdpmc_index() perhaps be a
better name to be consistent with the other naming?

> 
> I don't think there's any point in testing for !event, this is an
> interface that mandates you know wth you're doing anyway.
> 

I could add:
/* !CONFIG_PERF_EVENTS */
static inline int perf_rdpmc_index(struct perf_event *event)
{
        return -1;
}

Reinette

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2018-09-06 23:26 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-16 20:16 [PATCH V2 0/6] perf/core and x86/intel_rdt: Fix lack of coordination with perf Reinette Chatre
2018-08-16 20:16 ` [PATCH V2 1/6] perf/core: Add sanity check to deal with pinned event failure Reinette Chatre
2018-08-16 20:16 ` [PATCH V2 2/6] x86/intel_rdt: Remove local register variables Reinette Chatre
2018-08-16 20:16 ` [PATCH V2 3/6] x86/intel_rdt: Create required perf event attributes Reinette Chatre
2018-08-16 20:16 ` [PATCH V2 4/6] x86/intel_rdt: Add helper to obtain performance counter index Reinette Chatre
2018-09-06 14:47   ` Peter Zijlstra
2018-09-06 23:26     ` Reinette Chatre
2018-08-16 20:16 ` [PATCH V2 5/6] x86/intel_rdt: Use perf infrastructure for measurements Reinette Chatre
2018-09-06 14:15   ` Peter Zijlstra
2018-09-06 19:21     ` Reinette Chatre
2018-09-06 19:44       ` Peter Zijlstra
2018-09-06 20:05         ` Reinette Chatre
2018-09-06 20:29           ` Peter Zijlstra
2018-09-06 20:37             ` Reinette Chatre
2018-09-06 21:38               ` Peter Zijlstra
2018-09-06 14:38   ` Peter Zijlstra
2018-08-16 20:16 ` [PATCH V2 6/6] x86/intel_rdt: Re-enable pseudo-lock measurements Reinette Chatre
2018-09-04 16:50 ` [PATCH V2 0/6] perf/core and x86/intel_rdt: Fix lack of coordination with perf Reinette Chatre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).