linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
@ 2018-07-31 19:38 Reinette Chatre
  2018-07-31 19:38 ` [PATCH 1/2] perf/x86: Expose PMC hardware reservation Reinette Chatre
                   ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Reinette Chatre @ 2018-07-31 19:38 UTC (permalink / raw)
  To: tglx, mingo, fenghua.yu, tony.luck, vikas.shivappa
  Cc: gavin.hindman, jithu.joseph, dave.hansen, hpa, x86, linux-kernel,
	Reinette Chatre

Dear Maintainers,

The success of Cache Pseudo-Locking can be measured via the use of
performance events. Specifically, the number of cache hits and misses
reading a memory region after it has been pseudo-locked to cache. This
measurement is triggered via the resctrl debugfs interface.

To ensure most accurate results the performance counters and their
configuration registers are accessed directly. This is currently done
without coordination with other performance event users and will have
consequences any time two users, for example perf and cache
pseudo-locking, attempt to do any kind of measurement at the same time.

The performance counter reservation mechanism for individual counters is
available to cache pseudo-locking and could be used. That would require
duplicating the currently private reservation mechanism for all counters.
Instead in this work the reservation mechanism for all counters on the
system is exposed and subsequently used by the cache pseudo-locking
measurement code.

Your feedback on this work will be greatly appreciated.

Reinette

Reinette Chatre (2):
  perf/x86: Expose PMC hardware reservation
  x86/intel_rdt: Coordinate performance monitoring with perf

 Documentation/x86/intel_rdt_ui.txt          | 4 ----
 arch/x86/events/core.c                      | 6 ++++--
 arch/x86/kernel/cpu/intel_rdt.h             | 2 ++
 arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 8 ++++++++
 4 files changed, 14 insertions(+), 6 deletions(-)

-- 
2.17.0


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/2] perf/x86: Expose PMC hardware reservation
  2018-07-31 19:38 [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf Reinette Chatre
@ 2018-07-31 19:38 ` Reinette Chatre
  2018-07-31 19:38 ` [PATCH 2/2] x86/intel_rdt: Coordinate performance monitoring with perf Reinette Chatre
  2018-08-02 12:39 ` [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination " Peter Zijlstra
  2 siblings, 0 replies; 31+ messages in thread
From: Reinette Chatre @ 2018-07-31 19:38 UTC (permalink / raw)
  To: tglx, mingo, fenghua.yu, tony.luck, vikas.shivappa
  Cc: gavin.hindman, jithu.joseph, dave.hansen, hpa, x86, linux-kernel,
	Reinette Chatre

When multiple users need to use performance counters a reservation
mechanism is required to ensure coordination. This reservation
mechanism already exists and any user can currently reserve/release
a single performance counter and/or its matching event configuration
via the exported symbols reserve_perfctr_nmi() and reserve_evntsel_nmi()
(and their matching release functions). These reservation functions
take as parameter a single performance counter or event configuration
register at a time and they are typically called in a loop where they
are called for every counter on the system.

The current users of these exported symbols are oprofile and the x86
events system that each use wrappers to these exported symbols as a way
to reserve the entire pmc system - calling a reserve of each counter
and its configuration registers.

A user wanting to use x86 PMC hardware can currently do so by duplicating
the x86 PMC hardware reservation by creating a new wrapper for the exported
reserve_perfctr_nmi() and reserve_evntsel_nmi() functions. This
duplication is not desirable and thus the current wrapping x86 pmc
reservation routine itself (reserve_pmc_hardware()) and matching
release function (release_pmc_hardware()) are exported for users needing
to coordinate use of PMC hardware.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 arch/x86/events/core.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 5f4829f10129..e883a0a11f53 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -144,7 +144,7 @@ static DEFINE_MUTEX(pmc_reserve_mutex);
 
 #ifdef CONFIG_X86_LOCAL_APIC
 
-static bool reserve_pmc_hardware(void)
+bool reserve_pmc_hardware(void)
 {
 	int i;
 
@@ -173,7 +173,7 @@ static bool reserve_pmc_hardware(void)
 	return false;
 }
 
-static void release_pmc_hardware(void)
+void release_pmc_hardware(void)
 {
 	int i;
 
@@ -189,6 +189,8 @@ static bool reserve_pmc_hardware(void) { return true; }
 static void release_pmc_hardware(void) {}
 
 #endif
+EXPORT_SYMBOL(reserve_pmc_hardware);
+EXPORT_SYMBOL(release_pmc_hardware);
 
 static bool check_hw_exists(void)
 {
-- 
2.17.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 2/2] x86/intel_rdt: Coordinate performance monitoring with perf
  2018-07-31 19:38 [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf Reinette Chatre
  2018-07-31 19:38 ` [PATCH 1/2] perf/x86: Expose PMC hardware reservation Reinette Chatre
@ 2018-07-31 19:38 ` Reinette Chatre
  2018-08-02 12:39 ` [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination " Peter Zijlstra
  2 siblings, 0 replies; 31+ messages in thread
From: Reinette Chatre @ 2018-07-31 19:38 UTC (permalink / raw)
  To: tglx, mingo, fenghua.yu, tony.luck, vikas.shivappa
  Cc: gavin.hindman, jithu.joseph, dave.hansen, hpa, x86, linux-kernel,
	Reinette Chatre

It is possible to measure cache pseudo-locking success using performance
monitoring counters. This measurement is triggered from user space via
the resctrl debugfs interface.

At this time the usage of the performance monitoring counters are not
coordinated with other users. If any other system measurement is in
progress, for example using perf, then the registers would be clobbered
between the multiple users.

Now that users have access to reserve_pmc_hardware() and
release_pmc_hardware() these functions can be used to ensure that only
one user has access to PMC hardware at a time. Internally this was
already used by perf - the cache pseudo-locking debugging needs to use
it also.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
 Documentation/x86/intel_rdt_ui.txt          | 4 ----
 arch/x86/kernel/cpu/intel_rdt.h             | 2 ++
 arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c | 8 ++++++++
 3 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/Documentation/x86/intel_rdt_ui.txt b/Documentation/x86/intel_rdt_ui.txt
index f662d3c530e5..a98e05d3e233 100644
--- a/Documentation/x86/intel_rdt_ui.txt
+++ b/Documentation/x86/intel_rdt_ui.txt
@@ -520,10 +520,6 @@ the pseudo-locked region:
 2) Cache hit and miss measurements using model specific precision counters if
    available. Depending on the levels of cache on the system the pseudo_lock_l2
    and pseudo_lock_l3 tracepoints are available.
-   WARNING: triggering this  measurement uses from two (for just L2
-   measurements) to four (for L2 and L3 measurements) precision counters on
-   the system, if any other measurements are in progress the counters and
-   their corresponding event registers will be clobbered.
 
 When a pseudo-locked region is created a new debugfs directory is created for
 it in debugfs as /sys/kernel/debug/resctrl/<newdir>. A single
diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h
index 4e588f36228f..280dde8c8229 100644
--- a/arch/x86/kernel/cpu/intel_rdt.h
+++ b/arch/x86/kernel/cpu/intel_rdt.h
@@ -558,5 +558,7 @@ void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms);
 void cqm_handle_limbo(struct work_struct *work);
 bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d);
 void __check_limbo(struct rdt_domain *d, bool force_free);
+extern bool reserve_pmc_hardware(void);
+extern void release_pmc_hardware(void);
 
 #endif /* _ASM_X86_INTEL_RDT_H */
diff --git a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
index f80c58f8adc3..164e9b8b070b 100644
--- a/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
+++ b/arch/x86/kernel/cpu/intel_rdt_pseudo_lock.c
@@ -1010,6 +1010,8 @@ static int measure_cycles_perf_fn(void *_plr)
 	}
 
 	local_irq_disable();
+	if (!reserve_pmc_hardware())
+		goto out_intr;
 	/*
 	 * Call wrmsr direcly to avoid the local register variables from
 	 * being overwritten due to reordering of their assignment with
@@ -1066,6 +1068,7 @@ static int measure_cycles_perf_fn(void *_plr)
 		l3_miss = native_read_pmc(3);
 	}
 	wrmsr(MSR_MISC_FEATURE_CONTROL, 0x0, 0x0);
+	release_pmc_hardware();
 	local_irq_enable();
 	/*
 	 * On BDW we count references and misses, need to adjust. Sometimes
@@ -1083,6 +1086,11 @@ static int measure_cycles_perf_fn(void *_plr)
 		trace_pseudo_lock_l3(l3_hits, l3_miss);
 	}
 
+	goto out;
+
+out_intr:
+	local_irq_enable();
+	pr_err_ratelimited("Failed to reserve performance monitoring regs\n");
 out:
 	plr->thread_done = 1;
 	wake_up_interruptible(&plr->lock_thread_wq);
-- 
2.17.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
  2018-07-31 19:38 [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf Reinette Chatre
  2018-07-31 19:38 ` [PATCH 1/2] perf/x86: Expose PMC hardware reservation Reinette Chatre
  2018-07-31 19:38 ` [PATCH 2/2] x86/intel_rdt: Coordinate performance monitoring with perf Reinette Chatre
@ 2018-08-02 12:39 ` Peter Zijlstra
  2018-08-02 16:14   ` Reinette Chatre
  2 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2018-08-02 12:39 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: tglx, mingo, fenghua.yu, tony.luck, vikas.shivappa,
	gavin.hindman, jithu.joseph, dave.hansen, hpa, x86, linux-kernel

On Tue, Jul 31, 2018 at 12:38:27PM -0700, Reinette Chatre wrote:
> Dear Maintainers,
> 
> The success of Cache Pseudo-Locking can be measured via the use of
> performance events. Specifically, the number of cache hits and misses
> reading a memory region after it has been pseudo-locked to cache. This
> measurement is triggered via the resctrl debugfs interface.
> 
> To ensure most accurate results the performance counters and their
> configuration registers are accessed directly.

NAK on that.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
  2018-08-02 12:39 ` [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination " Peter Zijlstra
@ 2018-08-02 16:14   ` Reinette Chatre
  2018-08-02 16:18     ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Reinette Chatre @ 2018-08-02 16:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, fenghua.yu, tony.luck, vikas.shivappa,
	gavin.hindman, jithu.joseph, dave.hansen, hpa, x86, linux-kernel

Hi Peter,

On 8/2/2018 5:39 AM, Peter Zijlstra wrote:
> On Tue, Jul 31, 2018 at 12:38:27PM -0700, Reinette Chatre wrote:
>> Dear Maintainers,
>>
>> The success of Cache Pseudo-Locking can be measured via the use of
>> performance events. Specifically, the number of cache hits and misses
>> reading a memory region after it has been pseudo-locked to cache. This
>> measurement is triggered via the resctrl debugfs interface.
>>
>> To ensure most accurate results the performance counters and their
>> configuration registers are accessed directly.
> 
> NAK on that.
> 

After data is locked to cache we need to measure the success of that.
There is no instruction that we can use to query if a memory address has
been cached but we can use performance monitoring events that are
especially valuable on the platforms where they are precise event capable.

To ensure that we are only measuring the presence of data that should be
locked to cache we need to tightly control how this measurement is done.

For example, on my test system I locked 256KB to the cache and with the
current implementation (tip.git on branch x86/cache) I am able to
accurately measure that this was successful as seen below (each cache
line within the 256KB is accessed while the performance monitoring
events are active):

pseudo_lock_mea-26090 [002] .... 61838.488027: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-26097 [002] .... 61843.689381: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-26100 [002] .... 61848.751411: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-26108 [002] .... 61853.820361: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-26111 [002] .... 61858.880364: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-26118 [002] .... 61863.937343: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-26121 [002] .... 61869.008341: pseudo_lock_l2: hits=4096
miss=0

The current implementation does not coordinate with perf and this is
what I am trying to fix in this series.

I do respect your NAK but it is not clear to me how to proceed after
obtaining it. Could you please elaborate on what you would prefer as a
solution to ensure accurate measurement of cache-locked data that is
better integrated?

Thank you very much

Reinette



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
  2018-08-02 16:14   ` Reinette Chatre
@ 2018-08-02 16:18     ` Peter Zijlstra
  2018-08-02 16:44       ` Reinette Chatre
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2018-08-02 16:18 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: tglx, mingo, fenghua.yu, tony.luck, vikas.shivappa,
	gavin.hindman, jithu.joseph, dave.hansen, hpa, x86, linux-kernel

On Thu, Aug 02, 2018 at 09:14:10AM -0700, Reinette Chatre wrote:

> The current implementation does not coordinate with perf and this is
> what I am trying to fix in this series.
> 
> I do respect your NAK but it is not clear to me how to proceed after
> obtaining it. Could you please elaborate on what you would prefer as a
> solution to ensure accurate measurement of cache-locked data that is
> better integrated?

We have an in-kernel interface to perf, use that if you want access to
the PMU. You will not directly stomp on PMU registers.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
  2018-08-02 16:18     ` Peter Zijlstra
@ 2018-08-02 16:44       ` Reinette Chatre
  2018-08-02 17:37         ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Reinette Chatre @ 2018-08-02 16:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, fenghua.yu, tony.luck, vikas.shivappa,
	gavin.hindman, jithu.joseph, dave.hansen, hpa, x86, linux-kernel

On 8/2/2018 9:18 AM, Peter Zijlstra wrote:
> On Thu, Aug 02, 2018 at 09:14:10AM -0700, Reinette Chatre wrote:
> 
>> The current implementation does not coordinate with perf and this is
>> what I am trying to fix in this series.
>>
>> I do respect your NAK but it is not clear to me how to proceed after
>> obtaining it. Could you please elaborate on what you would prefer as a
>> solution to ensure accurate measurement of cache-locked data that is
>> better integrated?
> 
> We have an in-kernel interface to perf, use that if you want access to
> the PMU. You will not directly stomp on PMU registers.

I do not see how I can do so without incurring the cache hits and misses
from the data needed and instructions run by this interface. Could you
please share how I can do so and still obtain the accurate measurement
of cache residency of a specific memory region?

Reinette



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
  2018-08-02 16:44       ` Reinette Chatre
@ 2018-08-02 17:37         ` Peter Zijlstra
  2018-08-02 18:18           ` Dave Hansen
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2018-08-02 17:37 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: tglx, mingo, fenghua.yu, tony.luck, vikas.shivappa,
	gavin.hindman, jithu.joseph, dave.hansen, hpa, x86, linux-kernel

On Thu, Aug 02, 2018 at 09:44:13AM -0700, Reinette Chatre wrote:
> On 8/2/2018 9:18 AM, Peter Zijlstra wrote:
> > On Thu, Aug 02, 2018 at 09:14:10AM -0700, Reinette Chatre wrote:
> > 
> >> The current implementation does not coordinate with perf and this is
> >> what I am trying to fix in this series.
> >>
> >> I do respect your NAK but it is not clear to me how to proceed after
> >> obtaining it. Could you please elaborate on what you would prefer as a
> >> solution to ensure accurate measurement of cache-locked data that is
> >> better integrated?
> > 
> > We have an in-kernel interface to perf, use that if you want access to
> > the PMU. You will not directly stomp on PMU registers.
> 
> I do not see how I can do so without incurring the cache hits and misses
> from the data needed and instructions run by this interface. Could you
> please share how I can do so and still obtain the accurate measurement
> of cache residency of a specific memory region?

That's the best you're going to get. You do _NOT_ get to use raw PMU.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
  2018-08-02 17:37         ` Peter Zijlstra
@ 2018-08-02 18:18           ` Dave Hansen
  2018-08-02 19:54             ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Dave Hansen @ 2018-08-02 18:18 UTC (permalink / raw)
  To: Peter Zijlstra, Reinette Chatre
  Cc: tglx, mingo, fenghua.yu, tony.luck, vikas.shivappa,
	gavin.hindman, jithu.joseph, hpa, x86, linux-kernel

On 08/02/2018 10:37 AM, Peter Zijlstra wrote:
>> I do not see how I can do so without incurring the cache hits and misses
>> from the data needed and instructions run by this interface. Could you
>> please share how I can do so and still obtain the accurate measurement
>> of cache residency of a specific memory region?
> That's the best you're going to get. You do _NOT_ get to use raw PMU.

Hi Peter,

This code is really fidgety.  It's really easily perturbed even by tiny
things like implicit cache accesses from the page walker filling the TLB
from the page tables.

Adding a bunch more code in the way is surely going to make it more
fragile and imprecise.

I totally understand not wanting to fill the tree with code hijacking
the raw PMU.  Is your reaction to this really around not wanting to
start down the slippery slope that ends up with lots of raw PMU "owners"?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
  2018-08-02 18:18           ` Dave Hansen
@ 2018-08-02 19:54             ` Peter Zijlstra
  2018-08-02 20:06               ` Dave Hansen
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2018-08-02 19:54 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Reinette Chatre, tglx, mingo, fenghua.yu, tony.luck,
	vikas.shivappa, gavin.hindman, jithu.joseph, hpa, x86,
	linux-kernel

On Thu, Aug 02, 2018 at 11:18:01AM -0700, Dave Hansen wrote:
> On 08/02/2018 10:37 AM, Peter Zijlstra wrote:
> >> I do not see how I can do so without incurring the cache hits and misses
> >> from the data needed and instructions run by this interface. Could you
> >> please share how I can do so and still obtain the accurate measurement
> >> of cache residency of a specific memory region?
> > That's the best you're going to get. You do _NOT_ get to use raw PMU.
> 
> Hi Peter,
> 
> This code is really fidgety.  It's really easily perturbed even by tiny
> things like implicit cache accesses from the page walker filling the TLB
> from the page tables.
> 
> Adding a bunch more code in the way is surely going to make it more
> fragile and imprecise.
> 
> I totally understand not wanting to fill the tree with code hijacking
> the raw PMU.  Is your reaction to this really around not wanting to
> start down the slippery slope that ends up with lots of raw PMU "owners"?

That and the fact that multiple owner directly contradicts what perf set
out to do, provide resource arbitration for the PMU.

Not being able to use both perf and this resctl thing at the same time
is utter crap. You will not get special dispensation.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
  2018-08-02 19:54             ` Peter Zijlstra
@ 2018-08-02 20:06               ` Dave Hansen
  2018-08-02 20:13                 ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Dave Hansen @ 2018-08-02 20:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Reinette Chatre, tglx, mingo, fenghua.yu, tony.luck,
	vikas.shivappa, gavin.hindman, jithu.joseph, hpa, x86,
	linux-kernel

On 08/02/2018 12:54 PM, Peter Zijlstra wrote:
>> I totally understand not wanting to fill the tree with code hijacking
>> the raw PMU.  Is your reaction to this really around not wanting to
>> start down the slippery slope that ends up with lots of raw PMU "owners"?
> That and the fact that multiple owner directly contradicts what perf set
> out to do, provide resource arbitration for the PMU.
> 
> Not being able to use both perf and this resctl thing at the same time
> is utter crap. You will not get special dispensation.

Is there something we could do in the middle, like have perf itself be
in charge of doing all the programming, but the psuedo-locking could
still _read_ the counters?

I guess we'd still end up needing to have perf expose which counter got
assigned to which event, and make sure things like multiplexing are not
in play.

But, either way, I think we can live with this measurement being fragile
(like it already is).  If it only works when all the planets are
aligned, I think it's still usable enough.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
  2018-08-02 20:06               ` Dave Hansen
@ 2018-08-02 20:13                 ` Peter Zijlstra
  2018-08-02 20:43                   ` Reinette Chatre
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2018-08-02 20:13 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Reinette Chatre, tglx, mingo, fenghua.yu, tony.luck,
	vikas.shivappa, gavin.hindman, jithu.joseph, hpa, x86,
	linux-kernel

On Thu, Aug 02, 2018 at 01:06:19PM -0700, Dave Hansen wrote:
> On 08/02/2018 12:54 PM, Peter Zijlstra wrote:
> >> I totally understand not wanting to fill the tree with code hijacking
> >> the raw PMU.  Is your reaction to this really around not wanting to
> >> start down the slippery slope that ends up with lots of raw PMU "owners"?
> > That and the fact that multiple owner directly contradicts what perf set
> > out to do, provide resource arbitration for the PMU.
> > 
> > Not being able to use both perf and this resctl thing at the same time
> > is utter crap. You will not get special dispensation.
> 
> Is there something we could do in the middle, like have perf itself be
> in charge of doing all the programming, but the psuedo-locking could
> still _read_ the counters?

perf has all of that.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
  2018-08-02 20:13                 ` Peter Zijlstra
@ 2018-08-02 20:43                   ` Reinette Chatre
  2018-08-03 10:49                     ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Reinette Chatre @ 2018-08-02 20:43 UTC (permalink / raw)
  To: Peter Zijlstra, Dave Hansen
  Cc: tglx, mingo, fenghua.yu, tony.luck, vikas.shivappa,
	gavin.hindman, jithu.joseph, hpa, x86, linux-kernel

Hi Peter,

On 8/2/2018 1:13 PM, Peter Zijlstra wrote:
> On Thu, Aug 02, 2018 at 01:06:19PM -0700, Dave Hansen wrote:
>> On 08/02/2018 12:54 PM, Peter Zijlstra wrote:
>>>> I totally understand not wanting to fill the tree with code hijacking
>>>> the raw PMU.  Is your reaction to this really around not wanting to
>>>> start down the slippery slope that ends up with lots of raw PMU "owners"?
>>> That and the fact that multiple owner directly contradicts what perf set
>>> out to do, provide resource arbitration for the PMU.
>>>
>>> Not being able to use both perf and this resctl thing at the same time
>>> is utter crap. You will not get special dispensation.

The goal of this work is to use the existing PMU hardware coordination
mechanism to ensure that perf and resctrl will not interfere with each
other. The resctrl debugging is short-lived - it reserves the PMU
hardware while interrupts are disabled and releases the PMU hardware
before re-enabling interrupts. If a perf session is running at that time
then this reservation will ensure it is not interfered with. If a perf
session is not running at the time then it will not even get the
opportunity to attempt to start one.

>> Is there something we could do in the middle, like have perf itself be
>> in charge of doing all the programming, but the psuedo-locking could
>> still _read_ the counters?
> 
> perf has all of that.

At the time the counters are read the damage has already been done by
all the extra instructions and data accessed during the enable and
disable of the events.

Reinette

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
  2018-08-02 20:43                   ` Reinette Chatre
@ 2018-08-03 10:49                     ` Peter Zijlstra
  2018-08-03 15:18                       ` Reinette Chatre
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2018-08-03 10:49 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Dave Hansen, tglx, mingo, fenghua.yu, tony.luck, vikas.shivappa,
	gavin.hindman, jithu.joseph, hpa, x86, linux-kernel

On Thu, Aug 02, 2018 at 01:43:42PM -0700, Reinette Chatre wrote:

> The goal of this work is to use the existing PMU hardware coordination
> mechanism to ensure that perf and resctrl will not interfere with each
> other.

I understand what it does.. but if I'd seen you frobbing at the PMU
earlier your patches would've never gotten merged.

It's simply not going to happen.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
  2018-08-03 10:49                     ` Peter Zijlstra
@ 2018-08-03 15:18                       ` Reinette Chatre
  2018-08-03 15:25                         ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Reinette Chatre @ 2018-08-03 15:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Hansen, tglx, mingo, fenghua.yu, tony.luck, vikas.shivappa,
	gavin.hindman, jithu.joseph, hpa, x86, linux-kernel

Hi Peter,

On 8/3/2018 3:49 AM, Peter Zijlstra wrote:
> On Thu, Aug 02, 2018 at 01:43:42PM -0700, Reinette Chatre wrote:
> 
>> The goal of this work is to use the existing PMU hardware coordination
>> mechanism to ensure that perf and resctrl will not interfere with each
>> other.
> 
> I understand what it does.. but if I'd seen you frobbing at the PMU
> earlier your patches would've never gotten merged.
> 
> It's simply not going to happen.

Your two-fold guidance is clear to me: (1) do not use PMU registers
directly, (2) use perf API instead.

You state that you understand what we are trying to do and I hope that I
convinced you that we are not able to accomplish the same by following
your guidance.

Could you please guide us how to obtain the accurate results we obtain
at this time while satisfying your requirements?

Earlier your response to a similar question was "That's the best you're
going to get" - unfortunately I cannot respectfully say that to our
customers when they indeed have become used to working with accurate
data for more than a year. I really want to find a solution that is
acceptable to mainline instead of having to maintain a solution needed
by customers out of tree.

Reinette



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
  2018-08-03 15:18                       ` Reinette Chatre
@ 2018-08-03 15:25                         ` Peter Zijlstra
  2018-08-03 18:37                           ` Reinette Chatre
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2018-08-03 15:25 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Dave Hansen, tglx, mingo, fenghua.yu, tony.luck, vikas.shivappa,
	gavin.hindman, jithu.joseph, hpa, x86, linux-kernel

On Fri, Aug 03, 2018 at 08:18:09AM -0700, Reinette Chatre wrote:
> You state that you understand what we are trying to do and I hope that I
> convinced you that we are not able to accomplish the same by following
> your guidance.

No, I said I understood your pmc reserve patch and its implications.

I have no clue what you're trying to do with resctl, nor why you think
this is not feasible with perf. And if it really is not feasible, you'll
have to live without it.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
  2018-08-03 15:25                         ` Peter Zijlstra
@ 2018-08-03 18:37                           ` Reinette Chatre
  2018-08-06 19:50                             ` Reinette Chatre
  0 siblings, 1 reply; 31+ messages in thread
From: Reinette Chatre @ 2018-08-03 18:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Hansen, tglx, mingo, fenghua.yu, tony.luck, vikas.shivappa,
	gavin.hindman, jithu.joseph, hpa, x86, linux-kernel

Hi Peter,

On 8/3/2018 8:25 AM, Peter Zijlstra wrote:
> On Fri, Aug 03, 2018 at 08:18:09AM -0700, Reinette Chatre wrote:
>> You state that you understand what we are trying to do and I hope that I
>> convinced you that we are not able to accomplish the same by following
>> your guidance.
> 
> No, I said I understood your pmc reserve patch and its implications.
> 
> I have no clue what you're trying to do with resctl, nor why you think
> this is not feasible with perf. And if it really is not feasible, you'll
> have to live without it.

I can surely provide the details on what we are doing with resctrl and
elaborate more on why this is not feasible with the full perf kernel API.

In summary:
Building on top of Cache Allocation Technology (CAT) we load a portion
of memory into a specified region of cache. After the region of cache
obtained its data it (the cache region) is configured (via CAT) to only
serve cache hits - this pre-loaded memory cannot be evicted from the
cache. We call this "cache pseudo-locking" - the memory has been
"pseudo-locked" to the cache.

To measure how successful the pseudo-locking of the memory is we can use
the precision capable performance events on our platforms: start the
cache hits and miss counters, read the pseudo-locked memory, stop the
counters. This measurement is done with interrupts and hardware
prefetchers disabled to ensure that _only_ access to the pseudo-locked
memory is measured.

Any additional code or data accessed either by the counter management or
even by the loops reading the memory itself can contribute to cache
hits/misses measured for that instead of the memory we are trying to access.

Even within the current measurement code we had to take a lot of care to
not use, for example, pointers to obtain information about the memory to
be measured. The information had to be local variables.

Looking at if we were to build on top of the kernel perf event API
(perf_event_create_kernel_counter(), perf_event_enable(),
perf_event_disable(), ...). Just looking at perf_event_enable() -
ideally this would be as lean as possible to only enable the event and
not result in itself contributing the the measurement. First, the
enabling of the event is not as lean to fulfill this requirement since
it executes more code after the event was actually enabled. Also, the
code relies on a mutex so we cannot use it with interrupts disabled.

We have two types of customers of this feature: those who require very
low latency and those who require high determinism. In either case a
measured cache miss is of concern to them and our goal is to provide a
memory region for which the number of cache misses can be demonstrated.

Reinette

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
  2018-08-03 18:37                           ` Reinette Chatre
@ 2018-08-06 19:50                             ` Reinette Chatre
  2018-08-06 22:12                               ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Reinette Chatre @ 2018-08-06 19:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Hansen, tglx, mingo, fenghua.yu, tony.luck, vikas.shivappa,
	gavin.hindman, jithu.joseph, hpa, x86, linux-kernel

Hi Peter,

On 8/3/2018 11:37 AM, Reinette Chatre wrote:
> On 8/3/2018 8:25 AM, Peter Zijlstra wrote:
>> On Fri, Aug 03, 2018 at 08:18:09AM -0700, Reinette Chatre wrote:
>>> You state that you understand what we are trying to do and I hope that I
>>> convinced you that we are not able to accomplish the same by following
>>> your guidance.
>>
>> No, I said I understood your pmc reserve patch and its implications.
>>
>> I have no clue what you're trying to do with resctl, nor why you think
>> this is not feasible with perf. And if it really is not feasible, you'll
>> have to live without it.

In my previous email I provided the details of the Cache Pseudo-Locking
feature implemented on top of resctrl. Please let me know if you would
like any more details about that. I can send you more materials.

In my previous message I also provided the thoughts on why I believe
same is not feasible with perf as commented below ...

> Looking at if we were to build on top of the kernel perf event API
> (perf_event_create_kernel_counter(), perf_event_enable(),
> perf_event_disable(), ...). Just looking at perf_event_enable() -
> ideally this would be as lean as possible to only enable the event and
> not result in itself contributing the the measurement. First, the
> enabling of the event is not as lean to fulfill this requirement since
> it executes more code after the event was actually enabled. Also, the
> code relies on a mutex so we cannot use it with interrupts disabled.

I proceeded to modify the implemented debugfs measurements to build on
top of the perf APIs mentioned above. As anticipated the events could
not be enabled in interrupt context. I get a clear message in this regard:

BUG: sleeping function called from invalid context at
kernel/locking/mutex.c:748

I thus continued to use the API with interrupts enabled did the following:

Two new event attributes:
static struct perf_event_attr l2_miss_attr = {
        .type           = PERF_TYPE_RAW,
        .config         = (0x10ULL << 8) | 0xd1,
        .size           = sizeof(struct perf_event_attr),
        .pinned         = 1,
        .disabled       = 1,
        .exclude_user   = 1
};

static struct perf_event_attr l2_hit_attr = {
        .type           = PERF_TYPE_RAW,
        .config         = (0x2ULL << 8) | 0xd1,
        .size           = sizeof(struct perf_event_attr),
        .pinned         = 1,
        .disabled       = 1,
        .exclude_user   = 1
};

Create the two new events using these attributes:
l2_miss_event = perf_event_create_kernel_counter(&l2_miss_attr, cpu,
NULL, NULL, NULL);
l2_hit_event = perf_event_create_kernel_counter(&l2_hit_attr, cpu, NULL,
NULL, NULL);

Take measurements:
perf_event_enable(l2_miss_event);
perf_event_enable(l2_hit_event);
local_irq_disable();
/* Disable hardware prefetchers */
/* Loop through pseudo-locked memory */
/* Enable hardware prefetchers */
local_irq_enable();
perf_event_disable(l2_hit_event);
perf_event_disable(l2_miss_event);

Read results:
l2_hits = perf_event_read_value(l2_hit_event, &enabled, &running);
l2_miss = perf_event_read_value(l2_miss_event, &enabled, &running);
/* Make results available in tracepoints */


With the above implementation and a 256KB pseudo-locked memory region I
obtain the following results:
pseudo_lock_mea-755   [002] ....   396.946953: pseudo_lock_l2: hits=4140
miss=5
pseudo_lock_mea-762   [002] ....   397.998864: pseudo_lock_l2: hits=4138
miss=8
pseudo_lock_mea-765   [002] ....   399.041868: pseudo_lock_l2: hits=4142
miss=5
pseudo_lock_mea-768   [002] ....   400.086871: pseudo_lock_l2: hits=4141
miss=7
pseudo_lock_mea-771   [002] ....   401.132921: pseudo_lock_l2: hits=4138
miss=10
pseudo_lock_mea-774   [002] ....   402.216700: pseudo_lock_l2: hits=4238
miss=46
pseudo_lock_mea-777   [002] ....   403.312148: pseudo_lock_l2: hits=4142
miss=5
pseudo_lock_mea-780   [002] ....   404.381674: pseudo_lock_l2: hits=4139
miss=8
pseudo_lock_mea-783   [002] ....   405.422820: pseudo_lock_l2: hits=4472
miss=79
pseudo_lock_mea-786   [002] ....   406.495065: pseudo_lock_l2: hits=4140
miss=8
pseudo_lock_mea-793   [002] ....   407.561383: pseudo_lock_l2: hits=4143
miss=4

The above results are not accurate since it does not reflect the success
of the pseudo-locked region. Expected results are as we can currently
obtain (copying results from previous email):
pseudo_lock_mea-26090 [002] .... 61838.488027: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-26097 [002] .... 61843.689381: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-26100 [002] .... 61848.751411: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-26108 [002] .... 61853.820361: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-26111 [002] .... 61858.880364: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-26118 [002] .... 61863.937343: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-26121 [002] .... 61869.008341: pseudo_lock_l2: hits=4096
miss=0

Could you please guide me on how you would prefer us to use perf in
order to obtain the same accurate results we can now?

Thank you very much

Reinette














^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
  2018-08-06 19:50                             ` Reinette Chatre
@ 2018-08-06 22:12                               ` Peter Zijlstra
  2018-08-06 23:07                                 ` Reinette Chatre
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2018-08-06 22:12 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Dave Hansen, tglx, mingo, fenghua.yu, tony.luck, vikas.shivappa,
	gavin.hindman, jithu.joseph, hpa, x86, linux-kernel

On Mon, Aug 06, 2018 at 12:50:50PM -0700, Reinette Chatre wrote:
> In my previous email I provided the details of the Cache Pseudo-Locking
> feature implemented on top of resctrl. Please let me know if you would
> like any more details about that. I can send you more materials.

I've no yet had time to read..

> BUG: sleeping function called from invalid context at
> kernel/locking/mutex.c:748
> 
> I thus continued to use the API with interrupts enabled did the following:
> 
> Two new event attributes:
> static struct perf_event_attr l2_miss_attr = {
>         .type           = PERF_TYPE_RAW,
>         .config         = (0x10ULL << 8) | 0xd1,

Please use something like:

		X86_CONFIG(.event=0xd1, .umask=0x10),

that's ever so much more readable.

>         .size           = sizeof(struct perf_event_attr),
>         .pinned         = 1,
>         .disabled       = 1,
>         .exclude_user   = 1
> };
> 
> static struct perf_event_attr l2_hit_attr = {
>         .type           = PERF_TYPE_RAW,
>         .config         = (0x2ULL << 8) | 0xd1,
>         .size           = sizeof(struct perf_event_attr),
>         .pinned         = 1,
>         .disabled       = 1,
>         .exclude_user   = 1
> };
> 
> Create the two new events using these attributes:
> l2_miss_event = perf_event_create_kernel_counter(&l2_miss_attr, cpu,
> NULL, NULL, NULL);
> l2_hit_event = perf_event_create_kernel_counter(&l2_hit_attr, cpu, NULL,
> NULL, NULL);
> 
> Take measurements:
> perf_event_enable(l2_miss_event);
> perf_event_enable(l2_hit_event);
> local_irq_disable();
> /* Disable hardware prefetchers */
> /* Loop through pseudo-locked memory */
> /* Enable hardware prefetchers */
> local_irq_enable();
> perf_event_disable(l2_hit_event);
> perf_event_disable(l2_miss_event);
> 
> Read results:
> l2_hits = perf_event_read_value(l2_hit_event, &enabled, &running);
> l2_miss = perf_event_read_value(l2_miss_event, &enabled, &running);
> /* Make results available in tracepoints */

switch to .disabled=0 and try this for measurement:

	local_irq_disable();
	perf_event_read_local(l2_miss_event, &miss_val1, NULL, NULL);
	perf_event_read_local(l2_hit_event, &hit_val1, NULL, NULL);
	/* do your thing */
	perf_event_read_local(l2_miss_event, &miss_val2, NULL, NULL);
	perf_event_read_local(l2_hit_event, &hit_val2, NULL, NULL);
	local_irq_enable();

You're running this on the CPU you created the event for, right?

> With the above implementation and a 256KB pseudo-locked memory region I
> obtain the following results:
> pseudo_lock_mea-755   [002] ....   396.946953: pseudo_lock_l2: hits=4140

> The above results are not accurate since it does not reflect the success
> of the pseudo-locked region. Expected results are as we can currently
> obtain (copying results from previous email):
> pseudo_lock_mea-26090 [002] .... 61838.488027: pseudo_lock_l2: hits=4096

Still fairly close.. only like 44 extra hits or 1% error.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
  2018-08-06 22:12                               ` Peter Zijlstra
@ 2018-08-06 23:07                                 ` Reinette Chatre
  2018-08-07  9:36                                   ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Reinette Chatre @ 2018-08-06 23:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Hansen, tglx, mingo, fenghua.yu, tony.luck, vikas.shivappa,
	gavin.hindman, jithu.joseph, hpa, x86, linux-kernel

Hi Peter,

On 8/6/2018 3:12 PM, Peter Zijlstra wrote:
> On Mon, Aug 06, 2018 at 12:50:50PM -0700, Reinette Chatre wrote:
>> In my previous email I provided the details of the Cache Pseudo-Locking
>> feature implemented on top of resctrl. Please let me know if you would
>> like any more details about that. I can send you more materials.
> 
> I've no yet had time to read..
> 
>> BUG: sleeping function called from invalid context at
>> kernel/locking/mutex.c:748
>>
>> I thus continued to use the API with interrupts enabled did the following:
>>
>> Two new event attributes:
>> static struct perf_event_attr l2_miss_attr = {
>>         .type           = PERF_TYPE_RAW,
>>         .config         = (0x10ULL << 8) | 0xd1,
> 
> Please use something like:
> 
> 		X86_CONFIG(.event=0xd1, .umask=0x10),
> 
> that's ever so much more readable.
> 
>>         .size           = sizeof(struct perf_event_attr),
>>         .pinned         = 1,
>>         .disabled       = 1,
>>         .exclude_user   = 1
>> };
>>
>> static struct perf_event_attr l2_hit_attr = {
>>         .type           = PERF_TYPE_RAW,
>>         .config         = (0x2ULL << 8) | 0xd1,
>>         .size           = sizeof(struct perf_event_attr),
>>         .pinned         = 1,
>>         .disabled       = 1,
>>         .exclude_user   = 1
>> };
>>
>> Create the two new events using these attributes:
>> l2_miss_event = perf_event_create_kernel_counter(&l2_miss_attr, cpu,
>> NULL, NULL, NULL);
>> l2_hit_event = perf_event_create_kernel_counter(&l2_hit_attr, cpu, NULL,
>> NULL, NULL);
>>
>> Take measurements:
>> perf_event_enable(l2_miss_event);
>> perf_event_enable(l2_hit_event);
>> local_irq_disable();
>> /* Disable hardware prefetchers */
>> /* Loop through pseudo-locked memory */
>> /* Enable hardware prefetchers */
>> local_irq_enable();
>> perf_event_disable(l2_hit_event);
>> perf_event_disable(l2_miss_event);
>>
>> Read results:
>> l2_hits = perf_event_read_value(l2_hit_event, &enabled, &running);
>> l2_miss = perf_event_read_value(l2_miss_event, &enabled, &running);
>> /* Make results available in tracepoints */
> 
> switch to .disabled=0 and try this for measurement:
> 
> 	local_irq_disable();
> 	perf_event_read_local(l2_miss_event, &miss_val1, NULL, NULL);
> 	perf_event_read_local(l2_hit_event, &hit_val1, NULL, NULL);
> 	/* do your thing */
> 	perf_event_read_local(l2_miss_event, &miss_val2, NULL, NULL);
> 	perf_event_read_local(l2_hit_event, &hit_val2, NULL, NULL);
> 	local_irq_enable();

Thank you very much for taking a look and providing your guidance.

> 
> You're running this on the CPU you created the event for, right?

Yes.

I've modified your suggestion slightly in an attempt to gain accuracy.
Now it looks like:

local_irq_disable();
/* disable hw prefetchers */
/* init local vars to loop through pseudo-locked mem */
perf_event_read_local(l2_hit_event, &l2_hits_before, NULL, NULL);
perf_event_read_local(l2_miss_event, &l2_miss_before, NULL, NULL);
/* loop through pseudo-locked mem */
perf_event_read_local(l2_hit_event, &l2_hits_after, NULL, NULL);
perf_event_read_local(l2_miss_event, &l2_miss_after, NULL, NULL);
/* enable hw prefetchers */
local_irq_enable();

With the above I do not see the impact of an interference workload
anymore but the results are not yet accurate:

pseudo_lock_mea-538   [002] ....   113.296084: pseudo_lock_l2: hits=4103
miss=2
pseudo_lock_mea-541   [002] ....   114.349343: pseudo_lock_l2: hits=4102
miss=3
pseudo_lock_mea-544   [002] ....   115.410206: pseudo_lock_l2: hits=4101
miss=4
pseudo_lock_mea-551   [002] ....   116.473912: pseudo_lock_l2: hits=4102
miss=3
pseudo_lock_mea-554   [002] ....   117.532446: pseudo_lock_l2: hits=4100
miss=5
pseudo_lock_mea-557   [002] ....   118.591121: pseudo_lock_l2: hits=4103
miss=2
pseudo_lock_mea-560   [002] ....   119.642467: pseudo_lock_l2: hits=4102
miss=3
pseudo_lock_mea-563   [002] ....   120.698562: pseudo_lock_l2: hits=4102
miss=3
pseudo_lock_mea-566   [002] ....   121.769348: pseudo_lock_l2: hits=4105
miss=4

In an attempt to improve the accuracy of the above I modified it to the
following:

/* create the two events as before in "enabled" state */
l2_hit_pmcnum = l2_hit_event->hw.event_base_rdpmc;
l2_miss_pmcnum = l2_miss_event->hw.event_base_rdpmc;
local_irq_disable();
/* disable hw prefetchers */
/* init local vars to loop through pseudo-locked mem */
l2_hits_before = native_read_pmc(l2_hit_pmcnum);
l2_miss_before = native_read_pmc(l2_miss_pmcnum);
/* loop through pseudo-locked mem */
l2_hits_after = native_read_pmc(l2_hit_pmcnum);
l2_miss_after = native_read_pmc(l2_miss_pmcnum);
/* enable hw prefetchers */
local_irq_enable();

With the above I seem to get the same accuracy as before:
pseudo_lock_mea-557   [002] ....   155.402566: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-564   [002] ....   156.441299: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-567   [002] ....   157.478605: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-570   [002] ....   158.524054: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-573   [002] ....   159.561853: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-576   [002] ....   160.599758: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-579   [002] ....   161.645553: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-582   [002] ....   162.687593: pseudo_lock_l2: hits=4096
miss=0

Would a solution like this perhaps be acceptable to you?

I will continue to do more testing searching for any caveats in this
solution.

>> With the above implementation and a 256KB pseudo-locked memory region I
>> obtain the following results:
>> pseudo_lock_mea-755   [002] ....   396.946953: pseudo_lock_l2: hits=4140
> 
>> The above results are not accurate since it does not reflect the success
>> of the pseudo-locked region. Expected results are as we can currently
>> obtain (copying results from previous email):
>> pseudo_lock_mea-26090 [002] .... 61838.488027: pseudo_lock_l2: hits=4096
> 
> Still fairly close.. only like 44 extra hits or 1% error.

While the results do seem close, reporting a cache miss on memory that
is set up to be locked in cache is significant.

Thank you very much for your patience

Reinette



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
  2018-08-06 23:07                                 ` Reinette Chatre
@ 2018-08-07  9:36                                   ` Peter Zijlstra
       [not found]                                     ` <ace0bebb-91ab-5d40-e7d7-d72d48302fa8@intel.com>
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2018-08-07  9:36 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Dave Hansen, tglx, mingo, fenghua.yu, tony.luck, vikas.shivappa,
	gavin.hindman, jithu.joseph, hpa, x86, linux-kernel

On Mon, Aug 06, 2018 at 04:07:09PM -0700, Reinette Chatre wrote:

> I've modified your suggestion slightly in an attempt to gain accuracy.
> Now it looks like:
> 
> local_irq_disable();
> /* disable hw prefetchers */
> /* init local vars to loop through pseudo-locked mem */
> perf_event_read_local(l2_hit_event, &l2_hits_before, NULL, NULL);
> perf_event_read_local(l2_miss_event, &l2_miss_before, NULL, NULL);
> /* loop through pseudo-locked mem */
> perf_event_read_local(l2_hit_event, &l2_hits_after, NULL, NULL);
> perf_event_read_local(l2_miss_event, &l2_miss_after, NULL, NULL);
> /* enable hw prefetchers */
> local_irq_enable();
> 
> With the above I do not see the impact of an interference workload
> anymore but the results are not yet accurate:
> 
> pseudo_lock_mea-538   [002] ....   113.296084: pseudo_lock_l2: hits=4103
> miss=2

an error of ~0.2% sounds pretty good to me ;-)

FWIW, how long is that IRQ disabled section? It looks like something
that could be taking a bit of time. We have these people that care about
IRQ latency.

> While the results do seem close, reporting a cache miss on memory that
> is set up to be locked in cache is significant.

If it is 'locked' then how does perf causes misses? I suppose I should
go read that earlier email.

> In an attempt to improve the accuracy of the above I modified it to the
> following:
> 
> /* create the two events as before in "enabled" state */
> l2_hit_pmcnum = l2_hit_event->hw.event_base_rdpmc;
> l2_miss_pmcnum = l2_miss_event->hw.event_base_rdpmc;
> local_irq_disable();
> /* disable hw prefetchers */
> /* init local vars to loop through pseudo-locked mem */
> l2_hits_before = native_read_pmc(l2_hit_pmcnum);
> l2_miss_before = native_read_pmc(l2_miss_pmcnum);
> /* loop through pseudo-locked mem */
> l2_hits_after = native_read_pmc(l2_hit_pmcnum);
> l2_miss_after = native_read_pmc(l2_miss_pmcnum);
> /* enable hw prefetchers */
> local_irq_enable();

So the above has a number or practical problems:

 - event_base_rdpmc is subject to change as long as interrupts are
   enabled.

 - I don't much fancy people accessing the guts of events like that;
   would not an inline function like:

   static inline u64 x86_perf_rdpmc(struct perf_event *event)
   {
	u64 val;

	lockdep_assert_irqs_disabled();

	rdpmcl(event->hw.event_base_rdpmc, val);
	return val;
   }

   Work for you?

 - native_read_pmc(); are you 100% sure this code only ever runs on
   native and not in some dodgy virt environment?

   (also, I suppose all hardware supporting this resctl locking muck
   actually has rdpmc ;-))

 - You used perf_event_attr::pinned=1, this means we could have failed
   to schedule the event, you need to check error state somewhere and
   have a credible error handling.

   For checking error state; you could use perf_event_read_local() early
   after disabling IRQs, it needs a little extra test though:

	if (event->attr.pinned && event->oncpu != smp_processor_id())
		return -EBUSY;

   to correctly deal with pinned failure. Place it such that it becomes
   the last sanity check.


Also, while you disable IRQs, your fancy pants loop is still subject to
NMIs that can/will perturb your measurements, how do you deal with
those?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
       [not found]                                     ` <ace0bebb-91ab-5d40-e7d7-d72d48302fa8@intel.com>
@ 2018-08-08  1:28                                       ` Luck, Tony
  2018-08-08  5:44                                         ` Reinette Chatre
  2018-08-08  7:51                                       ` Peter Zijlstra
  1 sibling, 1 reply; 31+ messages in thread
From: Luck, Tony @ 2018-08-08  1:28 UTC (permalink / raw)
  To: Chatre, Reinette, Peter Zijlstra
  Cc: Hansen, Dave, tglx, mingo, Yu, Fenghua, vikas.shivappa, Hindman,
	Gavin, Joseph, Jithu, hpa, x86, linux-kernel

Would it help to call routines to read the "before" values of the counter
twice. The first time to preload the cache with anything needed to execute
the perf code path.

>> In an attempt to improve the accuracy of the above I modified it to the
>> following:
>>
>> /* create the two events as before in "enabled" state */
>> l2_hit_pmcnum = l2_hit_event->hw.event_base_rdpmc;
>> l2_miss_pmcnum = l2_miss_event->hw.event_base_rdpmc;
>> local_irq_disable();
>> /* disable hw prefetchers */
>> /* init local vars to loop through pseudo-locked mem
      * may take some misses in the perf code
      */
     l2_hits_before = native_read_pmc(l2_hit_pmcnum);
     l2_miss_before = native_read_pmc(l2_miss_pmcnum);
     /* Read counters again, hope no new misses here */
>> l2_hits_before = native_read_pmc(l2_hit_pmcnum);
>> l2_miss_before = native_read_pmc(l2_miss_pmcnum);
>> /* loop through pseudo-locked mem */
>> l2_hits_after = native_read_pmc(l2_hit_pmcnum);
>> l2_miss_after = native_read_pmc(l2_miss_pmcnum);
>> /* enable hw prefetchers */
>> local_irq_enable();


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
  2018-08-08  1:28                                       ` Luck, Tony
@ 2018-08-08  5:44                                         ` Reinette Chatre
  2018-08-08  7:41                                           ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Reinette Chatre @ 2018-08-08  5:44 UTC (permalink / raw)
  To: Luck, Tony, Peter Zijlstra
  Cc: Hansen, Dave, tglx, mingo, Yu, Fenghua, vikas.shivappa, Hindman,
	Gavin, Joseph, Jithu, hpa, x86, linux-kernel

Hi Tony,

On 8/7/2018 6:28 PM, Luck, Tony wrote:
> Would it help to call routines to read the "before" values of the counter
> twice. The first time to preload the cache with anything needed to execute
> the perf code path.

>>> In an attempt to improve the accuracy of the above I modified it to the
>>> following:
>>>
>>> /* create the two events as before in "enabled" state */
>>> l2_hit_pmcnum = l2_hit_event->hw.event_base_rdpmc;
>>> l2_miss_pmcnum = l2_miss_event->hw.event_base_rdpmc;
>>> local_irq_disable();
>>> /* disable hw prefetchers */
>>> /* init local vars to loop through pseudo-locked mem
>       * may take some misses in the perf code
>       */
>      l2_hits_before = native_read_pmc(l2_hit_pmcnum);
>      l2_miss_before = native_read_pmc(l2_miss_pmcnum);
>      /* Read counters again, hope no new misses here */
>>> l2_hits_before = native_read_pmc(l2_hit_pmcnum);
>>> l2_miss_before = native_read_pmc(l2_miss_pmcnum);
>>> /* loop through pseudo-locked mem */
>>> l2_hits_after = native_read_pmc(l2_hit_pmcnum);
>>> l2_miss_after = native_read_pmc(l2_miss_pmcnum);
>>> /* enable hw prefetchers */
>>> local_irq_enable();
>

The end of my previous email to Peter contains a solution that does
address all the feedback received up to this point while also able to
obtain (what I thought to be ... more below) accurate results. The code
you comment on below is not this latest version but your suggestion is
valuable and I do try it out on two different ways from what you quote
below to read the perf data.

So, instead of reading data with native_read_pmc() as in the code you
quoted I first test when reading data twice using the original
recommendation of "perf_event_read_local()" and second when reading data
twice using "rdpmcl()" chosen instead of native_read_pmc().

First, reading data using perf_event_read_local() called twice.
When testing as follows:
/* create perf events */
/* disable irq */
/* disable hw prefetchers */
/* init local vars */
/* read before data twice as follows: */
perf_event_read_local(l2_hit_event, &l2_hits_before, NULL, NULL);
perf_event_read_local(l2_miss_event, &l2_miss_before, NULL, NULL);
perf_event_read_local(l2_hit_event, &l2_hits_before, NULL, NULL);
perf_event_read_local(l2_miss_event, &l2_miss_before, NULL, NULL);
/* read through pseudo-locked memory */
perf_event_read_local(l2_hit_event, &l2_hits_after, NULL, NULL);
perf_event_read_local(l2_miss_event, &l2_miss_after, NULL, NULL);
/* re enable hw prefetchers */
/* enable irq */
/* write data to tracepoint */

With the above I am not able to obtain accurate data:
pseudo_lock_mea-351   [002] ....    61.859147: pseudo_lock_l2: hits=4109
miss=0
pseudo_lock_mea-354   [002] ....    63.045734: pseudo_lock_l2: hits=4103
miss=6
pseudo_lock_mea-357   [002] ....    64.104673: pseudo_lock_l2: hits=4106
miss=3
pseudo_lock_mea-360   [002] ....    65.174775: pseudo_lock_l2: hits=4105
miss=5
pseudo_lock_mea-367   [002] ....    66.232308: pseudo_lock_l2: hits=4104
miss=5
pseudo_lock_mea-370   [002] ....    67.291844: pseudo_lock_l2: hits=4103
miss=6
pseudo_lock_mea-373   [002] ....    68.348725: pseudo_lock_l2: hits=4105
miss=5
pseudo_lock_mea-376   [002] ....    69.409738: pseudo_lock_l2: hits=4105
miss=5
pseudo_lock_mea-379   [002] ....    70.466763: pseudo_lock_l2: hits=4105
miss=5


Second, reading data using rdpmcl() called twice.
This is the same solution as documented in my previous email, with the
two extra rdpmcl() calls added. An overview of the flow:

/* create perf events */
/* disable irq */
/* check perf event error state */
/* disable hw prefetchers */
/* init local vars */
/* read before data twice as follows: */
rdpmcl(l2_hit_pmcnum, l2_hits_before);
rdpmcl(l2_miss_pmcnum, l2_miss_before);
rdpmcl(l2_hit_pmcnum, l2_hits_before);
rdpmcl(l2_miss_pmcnum, l2_miss_before);
/* read through pseudo-locked memory */
rdpmcl(l2_hit_pmcnum, l2_hits_after);
rdpmcl(l2_miss_pmcnum, l2_miss_after);
/* re enable hw prefetchers */
/* enable irq */
/* write data to tracepoint */

Here as expected a simple test showed that the data was accurate
(hits=4096 miss=0) so I repeated the creation and measurement of
pseudo-locked region at different sizes under different loads. Each
possible pseudo-lock region size is created and measured 100 times on an
idle system and 100 times on a system with a noisy neighbor - this
resulted in a total of 2800 pseudo-lock region creations each followed
by a measurement of that region.

The results of these tests are the best I have yet seen. In this total
of 2800 measurements the number of cache hits were miscounted only in
eight measurements - each miscount was under(?) counted with one.
Specifically, a memory region consisting of 8192 cache lines was
measured as "hits=8191 miss=0", three memory regions with 12288 cache
lines were measured as "hits=12287 miss=0", two memory regions with
10240 cache lines were measured as "hits=10239 miss=0", and two memory
regions with 14336 cache lines were measured as "hits=14335 miss=0".
I do not think that having the number of cache hits reported as one less
than the number of read attempts would be of big concern.
The miss data remained consistent and reported as zero misses - this is
the exact data we were trying to capture!

Thank you so much for your valuable suggestion. I do hope that we could
proceed with this way of measurement.

Reinette

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
  2018-08-08  5:44                                         ` Reinette Chatre
@ 2018-08-08  7:41                                           ` Peter Zijlstra
  2018-08-08 15:55                                             ` Luck, Tony
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2018-08-08  7:41 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Luck, Tony, Hansen, Dave, tglx, mingo, Yu, Fenghua,
	vikas.shivappa, Hindman, Gavin, Joseph, Jithu, hpa, x86,
	linux-kernel

On Tue, Aug 07, 2018 at 10:44:44PM -0700, Reinette Chatre wrote:
> Hi Tony,
> 
> On 8/7/2018 6:28 PM, Luck, Tony wrote:
> > Would it help to call routines to read the "before" values of the counter
> > twice. The first time to preload the cache with anything needed to execute
> > the perf code path.

Indeed, that is the 'common' pattern for this.


> First, reading data using perf_event_read_local() called twice.
> When testing as follows:
> /* create perf events */
> /* disable irq */
> /* disable hw prefetchers */
> /* init local vars */
> /* read before data twice as follows: */
> perf_event_read_local(l2_hit_event, &l2_hits_before, NULL, NULL);
> perf_event_read_local(l2_miss_event, &l2_miss_before, NULL, NULL);
> perf_event_read_local(l2_hit_event, &l2_hits_before, NULL, NULL);
> perf_event_read_local(l2_miss_event, &l2_miss_before, NULL, NULL);
> /* read through pseudo-locked memory */
> perf_event_read_local(l2_hit_event, &l2_hits_after, NULL, NULL);
> perf_event_read_local(l2_miss_event, &l2_miss_after, NULL, NULL);
> /* re enable hw prefetchers */
> /* enable irq */
> /* write data to tracepoint */
> 
> With the above I am not able to obtain accurate data:
> pseudo_lock_mea-354   [002] ....    63.045734: pseudo_lock_l2: hits=4103
> miss=6

So _why_ doesn't this work? As said by Tony, that first call should
prime the caches, so the second and third calls should not generate any
misses.

They might cause extra hits though, but that should be a constant
amount is is also measureable with a no-op loop and can easily be
subtracted.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
       [not found]                                     ` <ace0bebb-91ab-5d40-e7d7-d72d48302fa8@intel.com>
  2018-08-08  1:28                                       ` Luck, Tony
@ 2018-08-08  7:51                                       ` Peter Zijlstra
  2018-08-08 17:33                                         ` Reinette Chatre
  1 sibling, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2018-08-08  7:51 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Dave Hansen, tglx, mingo, fenghua.yu, tony.luck, vikas.shivappa,
	gavin.hindman, jithu.joseph, hpa, x86, linux-kernel

On Tue, Aug 07, 2018 at 03:47:15PM -0700, Reinette Chatre wrote:
> > FWIW, how long is that IRQ disabled section? It looks like something
> > that could be taking a bit of time. We have these people that care about
> > IRQ latency.
> 
> We work closely with customers needing low latency as well as customers
> needing deterministic behavior.
> 
> This measurement is triggered by the user as a validation mechanism of
> the pseudo-locked memory region after it has been created as part of
> system setup as well as during runtime if there are any concerns with
> the performance of an application that uses it.
> 
> This measurement would thus be triggered before the sensitive workloads
> start - during system setup, or if an issue is already present. In
> either case the measurement is triggered by the administrator via debugfs.

That does not in fact include the answer to the question. Also, it
assumes a competent operator (something I've found is not always true).

> >  - I don't much fancy people accessing the guts of events like that;
> >    would not an inline function like:
> > 
> >    static inline u64 x86_perf_rdpmc(struct perf_event *event)
> >    {
> > 	u64 val;
> > 
> > 	lockdep_assert_irqs_disabled();
> > 
> > 	rdpmcl(event->hw.event_base_rdpmc, val);
> > 	return val;
> >    }
> > 
> >    Work for you?
> 
> No. This does not provide accurate results. Implementing the above produces:
> pseudo_lock_mea-366   [002] ....    34.950740: pseudo_lock_l2: hits=4096
> miss=4

But it being an inline function should allow the compiler to optimize
and lift the event->hw.event_base_rdpmc load like you now do manually.
Also, like Tony already suggested, you can prime that load just fine by
doing an extra invocation.

(and note that the above function is _much_ simpler than
perf_event_read_local())

> >  - native_read_pmc(); are you 100% sure this code only ever runs on
> >    native and not in some dodgy virt environment?
> 
> My understanding is that a virtual environment would be a customer of a
> RDT allocation (cache or memory bandwidth). I do not see if/where this
> is restricted though - I'll move to rdpmcl() but the usage of a cache
> allocation feature like this from a virtual machine needs more
> investigation.

I can imagine that hypervisors that allow physical partitioning could
allow delegating the rdt crud to their guests when they 'own' a full
socket or whatever the domain is for this.

> Will do. I created the following helper function that can be used after
> interrupts are disabled:
> 
> static inline int perf_event_error_state(struct perf_event *event)
> {
>         int ret = 0;
>         u64 tmp;
> 
>         ret = perf_event_read_local(event, &tmp, NULL, NULL);
>         if (ret < 0)
>                 return ret;
> 
>         if (event->attr.pinned && event->oncpu != smp_processor_id())
>                 return -EBUSY;
> 
>         return ret;
> }

Nah, stick the test in perf_event_read_local(), that actually needs it.

> > Also, while you disable IRQs, your fancy pants loop is still subject to
> > NMIs that can/will perturb your measurements, how do you deal with
> > those?

> Customers interested in this feature are familiar with dealing with them
> (and also SMIs). The user space counterpart is able to detect such an
> occurrence.

You're very optimistic about your customers capabilities. And this might
be true for the current people you're talking to, but once this is
available and public, joe monkey will have access and he _will_ screw it
up.

> Please note that if an NMI arrives it would be handled with the
> currently active cache capacity bitmask so none of the pseudo-locked
> memory will be evicted since no capacity bitmask overlaps with the
> pseudo-locked region.

So exceptions change / have their own bitmask?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
  2018-08-08  7:41                                           ` Peter Zijlstra
@ 2018-08-08 15:55                                             ` Luck, Tony
  2018-08-08 16:47                                               ` Peter Zijlstra
  0 siblings, 1 reply; 31+ messages in thread
From: Luck, Tony @ 2018-08-08 15:55 UTC (permalink / raw)
  To: Peter Zijlstra, Chatre, Reinette
  Cc: Hansen, Dave, tglx, mingo, Yu, Fenghua, vikas.shivappa, Hindman,
	Gavin, Joseph, Jithu, hpa, x86, linux-kernel

> So _why_ doesn't this work? As said by Tony, that first call should
> prime the caches, so the second and third calls should not generate any
> misses.

How much code/data is involved? If there is a lot, then you may be unlucky
with cache coloring and the later parts of the "prime the caches" code path
may evict some lines loaded in the early parts.

-Tony

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
  2018-08-08 15:55                                             ` Luck, Tony
@ 2018-08-08 16:47                                               ` Peter Zijlstra
  2018-08-08 16:51                                                 ` Reinette Chatre
  0 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2018-08-08 16:47 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Chatre, Reinette, Hansen, Dave, tglx, mingo, Yu, Fenghua,
	vikas.shivappa, Hindman, Gavin, Joseph, Jithu, hpa, x86,
	linux-kernel

On Wed, Aug 08, 2018 at 03:55:54PM +0000, Luck, Tony wrote:
> > So _why_ doesn't this work? As said by Tony, that first call should
> > prime the caches, so the second and third calls should not generate any
> > misses.
> 
> How much code/data is involved? If there is a lot, then you may be unlucky
> with cache coloring and the later parts of the "prime the caches" code path
> may evict some lines loaded in the early parts.

Well, Reinette used perf_event_read_local() which is unfortunately quite
a bit. But the inline I proposed is a single load and depending on
rdpmcl() or native_read_pmc() a call to or just a single inline asm
rdpmc instruction.

That should certainly work I think.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
  2018-08-08 16:47                                               ` Peter Zijlstra
@ 2018-08-08 16:51                                                 ` Reinette Chatre
  0 siblings, 0 replies; 31+ messages in thread
From: Reinette Chatre @ 2018-08-08 16:51 UTC (permalink / raw)
  To: Peter Zijlstra, Luck, Tony
  Cc: Hansen, Dave, tglx, mingo, Yu, Fenghua, vikas.shivappa, Hindman,
	Gavin, Joseph, Jithu, hpa, x86, linux-kernel

On 8/8/2018 9:47 AM, Peter Zijlstra wrote:
> On Wed, Aug 08, 2018 at 03:55:54PM +0000, Luck, Tony wrote:
>>> So _why_ doesn't this work? As said by Tony, that first call should
>>> prime the caches, so the second and third calls should not generate any
>>> misses.
>>
>> How much code/data is involved? If there is a lot, then you may be unlucky
>> with cache coloring and the later parts of the "prime the caches" code path
>> may evict some lines loaded in the early parts.
> 
> Well, Reinette used perf_event_read_local() which is unfortunately quite
> a bit. But the inline I proposed is a single load and depending on
> rdpmcl() or native_read_pmc() a call to or just a single inline asm
> rdpmc instruction.
> 
> That should certainly work I think.

I am in the process of testing this variation.

Reinette



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
  2018-08-08  7:51                                       ` Peter Zijlstra
@ 2018-08-08 17:33                                         ` Reinette Chatre
  2018-08-10 16:25                                           ` Reinette Chatre
  0 siblings, 1 reply; 31+ messages in thread
From: Reinette Chatre @ 2018-08-08 17:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Hansen, tglx, mingo, fenghua.yu, tony.luck, vikas.shivappa,
	gavin.hindman, jithu.joseph, hpa, x86, linux-kernel

Hi Peter and Tony,

On 8/8/2018 12:51 AM, Peter Zijlstra wrote:
> On Tue, Aug 07, 2018 at 03:47:15PM -0700, Reinette Chatre wrote:
>>> FWIW, how long is that IRQ disabled section? It looks like something
>>> that could be taking a bit of time. We have these people that care about
>>> IRQ latency.
>>
>> We work closely with customers needing low latency as well as customers
>> needing deterministic behavior.
>>
>> This measurement is triggered by the user as a validation mechanism of
>> the pseudo-locked memory region after it has been created as part of
>> system setup as well as during runtime if there are any concerns with
>> the performance of an application that uses it.
>>
>> This measurement would thus be triggered before the sensitive workloads
>> start - during system setup, or if an issue is already present. In
>> either case the measurement is triggered by the administrator via debugfs.
> 
> That does not in fact include the answer to the question. Also, it
> assumes a competent operator (something I've found is not always true).

My apologies, I did not intend to avoid your question. The main goal of
this measurement is for an operator to test if a pseudo-locked region
has been set up correctly. It is thus targeted for system setup time,
before the IRQ sensitive workloads - some of which would use these
memory regions - start.

>>>  - I don't much fancy people accessing the guts of events like that;
>>>    would not an inline function like:
>>>
>>>    static inline u64 x86_perf_rdpmc(struct perf_event *event)
>>>    {
>>> 	u64 val;
>>>
>>> 	lockdep_assert_irqs_disabled();
>>>
>>> 	rdpmcl(event->hw.event_base_rdpmc, val);
>>> 	return val;
>>>    }
>>>
>>>    Work for you?
>>
>> No. This does not provide accurate results. Implementing the above produces:
>> pseudo_lock_mea-366   [002] ....    34.950740: pseudo_lock_l2: hits=4096
>> miss=4
> 
> But it being an inline function should allow the compiler to optimize
> and lift the event->hw.event_base_rdpmc load like you now do manually.
> Also, like Tony already suggested, you can prime that load just fine by
> doing an extra invocation.
> 
> (and note that the above function is _much_ simpler than
> perf_event_read_local())

Unfortunately I do not find this to be the case. When I implement
x86_perf_rdpmc() _exactly_ as you suggest above and do the measurement like:

l2_hits_before = x86_perf_rdpmc(l2_hit_event);
l2_miss_before = x86_perf_rdpmc(l2_miss_event);
l2_hits_before = x86_perf_rdpmc(l2_hit_event);
l2_miss_before = x86_perf_rdpmc(l2_miss_event);
/* read memory */
l2_hits_after = x86_perf_rdpmc(l2_hit_event);
l2_miss_after = x86_perf_rdpmc(l2_miss_event);


Then the results are not accurate, neither are the consistently
inaccurate to consider a constant adjustment:

pseudo_lock_mea-409   [002] ....   194.322611: pseudo_lock_l2: hits=4100
miss=0
pseudo_lock_mea-412   [002] ....   195.520203: pseudo_lock_l2: hits=4096
miss=3
pseudo_lock_mea-415   [002] ....   196.571114: pseudo_lock_l2: hits=4097
miss=3
pseudo_lock_mea-422   [002] ....   197.629118: pseudo_lock_l2: hits=4097
miss=3
pseudo_lock_mea-425   [002] ....   198.687160: pseudo_lock_l2: hits=4096
miss=3
pseudo_lock_mea-428   [002] ....   199.744156: pseudo_lock_l2: hits=4096
miss=2
pseudo_lock_mea-431   [002] ....   200.801131: pseudo_lock_l2: hits=4097
miss=2
pseudo_lock_mea-434   [002] ....   201.858141: pseudo_lock_l2: hits=4097
miss=2
pseudo_lock_mea-437   [002] ....   202.917168: pseudo_lock_l2: hits=4096
miss=2

I was able to test Tony's theory and replacing the reading of the
"after" counts with a direct rdpmcl() improve the results. What I mean
is this:

l2_hit_pmcnum = x86_perf_rdpmc_ctr_get(l2_hit_event);
l2_miss_pmcnum = x86_perf_rdpmc_ctr_get(l2_miss_event);
l2_hits_before = x86_perf_rdpmc(l2_hit_event);
l2_miss_before = x86_perf_rdpmc(l2_miss_event);
l2_hits_before = x86_perf_rdpmc(l2_hit_event);
l2_miss_before = x86_perf_rdpmc(l2_miss_event);
/* read memory */
rdpmcl(l2_hit_pmcnum, l2_hits_after);
rdpmcl(l2_miss_pmcnum, l2_miss_after);

I did not run my full tests with the above but a simple read of 256KB
pseudo-locked memory gives:
pseudo_lock_mea-492   [002] ....   372.001385: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-495   [002] ....   373.059748: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-498   [002] ....   374.117027: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-501   [002] ....   375.182864: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-504   [002] ....   376.243958: pseudo_lock_l2: hits=4096
miss=0

We thus seem to be encountering the issue Tony predicted where the
memory being tested is evicting the earlier measurement code and data.

>>>  - native_read_pmc(); are you 100% sure this code only ever runs on
>>>    native and not in some dodgy virt environment?
>>
>> My understanding is that a virtual environment would be a customer of a
>> RDT allocation (cache or memory bandwidth). I do not see if/where this
>> is restricted though - I'll move to rdpmcl() but the usage of a cache
>> allocation feature like this from a virtual machine needs more
>> investigation.
> 
> I can imagine that hypervisors that allow physical partitioning could
> allow delegating the rdt crud to their guests when they 'own' a full
> socket or whatever the domain is for this.

I am using rdpmcl() now

> 
>> Will do. I created the following helper function that can be used after
>> interrupts are disabled:
>>
>> static inline int perf_event_error_state(struct perf_event *event)
>> {
>>         int ret = 0;
>>         u64 tmp;
>>
>>         ret = perf_event_read_local(event, &tmp, NULL, NULL);
>>         if (ret < 0)
>>                 return ret;
>>
>>         if (event->attr.pinned && event->oncpu != smp_processor_id())
>>                 return -EBUSY;
>>
>>         return ret;
>> }
> 
> Nah, stick the test in perf_event_read_local(), that actually needs it.

I will keep this outside for now since it does not seem as though
perf_event_read_local() works well for this use case.


>>> Also, while you disable IRQs, your fancy pants loop is still subject to
>>> NMIs that can/will perturb your measurements, how do you deal with
>>> those?
> 
>> Customers interested in this feature are familiar with dealing with them
>> (and also SMIs). The user space counterpart is able to detect such an
>> occurrence.
> 
> You're very optimistic about your customers capabilities. And this might
> be true for the current people you're talking to, but once this is
> available and public, joe monkey will have access and he _will_ screw it
> up.

I think this ventures into an area that is an existing issue for all
workloads that fall into this category, not unique to this.

> 
>> Please note that if an NMI arrives it would be handled with the
>> currently active cache capacity bitmask so none of the pseudo-locked
>> memory will be evicted since no capacity bitmask overlaps with the
>> pseudo-locked region.
> 
> So exceptions change / have their own bitmask?

My understanding is that interrupts are run with the current active
capacity bitmask. The capacity bitmasks specify which portions of cache
a cpu and/or task may allocate into. No capacity bitmask would overlap
with the pseudo-locked region and thus no cpu or task would be able to
evict the data from the pseudo-locked region. Even if later interrupts
do obtain their own class of service with its own capacity bitmasks -
these btimasks would still not be allowed to overlap with the
pseudo-locked region.

Reinette



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
  2018-08-08 17:33                                         ` Reinette Chatre
@ 2018-08-10 16:25                                           ` Reinette Chatre
  2018-08-10 17:52                                             ` Reinette Chatre
  0 siblings, 1 reply; 31+ messages in thread
From: Reinette Chatre @ 2018-08-10 16:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Hansen, tglx, mingo, fenghua.yu, tony.luck, vikas.shivappa,
	gavin.hindman, jithu.joseph, hpa, x86, linux-kernel

Hi Peter,

On 8/8/2018 10:33 AM, Reinette Chatre wrote:
> On 8/8/2018 12:51 AM, Peter Zijlstra wrote:
>> On Tue, Aug 07, 2018 at 03:47:15PM -0700, Reinette Chatre wrote:
>>>>  - I don't much fancy people accessing the guts of events like that;
>>>>    would not an inline function like:
>>>>
>>>>    static inline u64 x86_perf_rdpmc(struct perf_event *event)
>>>>    {
>>>> 	u64 val;
>>>>
>>>> 	lockdep_assert_irqs_disabled();
>>>>
>>>> 	rdpmcl(event->hw.event_base_rdpmc, val);
>>>> 	return val;
>>>>    }
>>>>
>>>>    Work for you?
>>>
>>> No. This does not provide accurate results. Implementing the above produces:
>>> pseudo_lock_mea-366   [002] ....    34.950740: pseudo_lock_l2: hits=4096
>>> miss=4
>>
>> But it being an inline function should allow the compiler to optimize
>> and lift the event->hw.event_base_rdpmc load like you now do manually.
>> Also, like Tony already suggested, you can prime that load just fine by
>> doing an extra invocation.
>>
>> (and note that the above function is _much_ simpler than
>> perf_event_read_local())
> 
> Unfortunately I do not find this to be the case. When I implement
> x86_perf_rdpmc() _exactly_ as you suggest above and do the measurement like:
> 
> l2_hits_before = x86_perf_rdpmc(l2_hit_event);
> l2_miss_before = x86_perf_rdpmc(l2_miss_event);
> l2_hits_before = x86_perf_rdpmc(l2_hit_event);
> l2_miss_before = x86_perf_rdpmc(l2_miss_event);
> /* read memory */
> l2_hits_after = x86_perf_rdpmc(l2_hit_event);
> l2_miss_after = x86_perf_rdpmc(l2_miss_event);
> 
> 
> Then the results are not accurate, neither are the consistently
> inaccurate to consider a constant adjustment:
> 
> pseudo_lock_mea-409   [002] ....   194.322611: pseudo_lock_l2: hits=4100
> miss=0
> pseudo_lock_mea-412   [002] ....   195.520203: pseudo_lock_l2: hits=4096
> miss=3
> pseudo_lock_mea-415   [002] ....   196.571114: pseudo_lock_l2: hits=4097
> miss=3
> pseudo_lock_mea-422   [002] ....   197.629118: pseudo_lock_l2: hits=4097
> miss=3
> pseudo_lock_mea-425   [002] ....   198.687160: pseudo_lock_l2: hits=4096
> miss=3
> pseudo_lock_mea-428   [002] ....   199.744156: pseudo_lock_l2: hits=4096
> miss=2
> pseudo_lock_mea-431   [002] ....   200.801131: pseudo_lock_l2: hits=4097
> miss=2
> pseudo_lock_mea-434   [002] ....   201.858141: pseudo_lock_l2: hits=4097
> miss=2
> pseudo_lock_mea-437   [002] ....   202.917168: pseudo_lock_l2: hits=4096
> miss=2
> 
> I was able to test Tony's theory and replacing the reading of the
> "after" counts with a direct rdpmcl() improve the results. What I mean
> is this:
> 
> l2_hit_pmcnum = x86_perf_rdpmc_ctr_get(l2_hit_event);
> l2_miss_pmcnum = x86_perf_rdpmc_ctr_get(l2_miss_event);
> l2_hits_before = x86_perf_rdpmc(l2_hit_event);
> l2_miss_before = x86_perf_rdpmc(l2_miss_event);
> l2_hits_before = x86_perf_rdpmc(l2_hit_event);
> l2_miss_before = x86_perf_rdpmc(l2_miss_event);
> /* read memory */
> rdpmcl(l2_hit_pmcnum, l2_hits_after);
> rdpmcl(l2_miss_pmcnum, l2_miss_after);
> 
> I did not run my full tests with the above but a simple read of 256KB
> pseudo-locked memory gives:
> pseudo_lock_mea-492   [002] ....   372.001385: pseudo_lock_l2: hits=4096
> miss=0
> pseudo_lock_mea-495   [002] ....   373.059748: pseudo_lock_l2: hits=4096
> miss=0
> pseudo_lock_mea-498   [002] ....   374.117027: pseudo_lock_l2: hits=4096
> miss=0
> pseudo_lock_mea-501   [002] ....   375.182864: pseudo_lock_l2: hits=4096
> miss=0
> pseudo_lock_mea-504   [002] ....   376.243958: pseudo_lock_l2: hits=4096
> miss=0
> 
> We thus seem to be encountering the issue Tony predicted where the
> memory being tested is evicting the earlier measurement code and data.

I thoroughly reviewed this email thread to ensure that all your feedback
is being addressed. At this time I believe the current solution does so
since it addresses all requirements I was able to capture:
- Use in-kernel interface to perf.
- Do not write directly to PMU registers.
- Do not introduce another PMU owner. perf maintains role as performing
  resource arbitration for PMU.
- User space is able to use perf and resctrl at the same time.
- event_base_rdpmc is accessed and used only within an interrupts
  disabled section.
- Internals of events are never accessed directly, inline function used.
- Due to "pinned" usage the scheduling of event may have failed.  Error
  state is checked in recommended way and have a credible error
  handling.
- use X86_CONFIG

The pseudocode of the current solution is presented below. With this
solution I am able to address our customer requirement to be able to
measure a pseudo-locked region accurately while also addressing your
requirements to use perf correctly.

Is this solution acceptable to you?

#include "../../events/perf_event.h" /* For X86_CONFIG() */

/*
 * The X86_CONFIG() macro cannot be used in
 * a designated initializer as below - the initialization of
 * the .config attribute is thus deferred to later in order
 * to use X86_CONFIG
 */

static struct perf_event_attr l2_miss_attr = {
        .type           = PERF_TYPE_RAW,
        .size           = sizeof(struct perf_event_attr),
        .pinned         = 1,
        .disabled       = 0,
        .exclude_user   = 1
};

static struct perf_event_attr l2_hit_attr = {
        .type           = PERF_TYPE_RAW,
        .size           = sizeof(struct perf_event_attr),
        .pinned         = 1,
        .disabled       = 0,
        .exclude_user   = 1
};

static inline int x86_perf_rdpmc_ctr_get(struct perf_event *event)
{
        lockdep_assert_irqs_disabled();

        return IS_ERR(event) ? 0 : event->hw.event_base_rdpmc;
}

static inline int x86_perf_event_error_state(struct perf_event *event)
{
        int ret = 0;
        u64 tmp;

        ret = perf_event_read_local(event, &tmp, NULL, NULL);
        if (ret < 0)
                return ret;

        if (event->attr.pinned && event->oncpu != smp_processor_id())
                return -EBUSY;

        return ret;
}

/*
 * Below is run by kernel thread on correct CPU as triggered
 * by user via debugfs
 */
static int measure_cycles_perf_fn(...)
{
        u64 l2_hits_before, l2_hits_after, l2_miss_before, l2_miss_after;
        struct perf_event *l2_miss_event, *l2_hit_event;
        int l2_hit_pmcnum, l2_miss_pmcnum;
        /* Other vars */

        l2_miss_attr.config = X86_CONFIG(.event=0xd1, .umask=0x10);
        l2_hit_attr.config = X86_CONFIG(.event=0xd1, .umask=0x2);
        l2_miss_event = perf_event_create_kernel_counter(&l2_miss_attr,
                                                         cpu,
                                                         NULL, NULL, NULL);
        if (IS_ERR(l2_miss_event))
                goto out;

        l2_hit_event = perf_event_create_kernel_counter(&l2_hit_attr,
                                                        cpu,
                                                        NULL, NULL, NULL);
        if (IS_ERR(l2_hit_event))
                goto out_l2_miss;

        local_irq_disable();
        if (x86_perf_event_error_state(l2_miss_event)) {
                local_irq_enable();
                goto out_l2_hit;
        }
        if (x86_perf_event_error_state(l2_hit_event)) {
                local_irq_enable();
                goto out_l2_hit;
        }
        /* Disable hardware prefetchers */
        /* Initialize local variables */
        l2_hit_pmcnum = x86_perf_rdpmc_ctr_get(l2_hit_event);
        l2_miss_pmcnum = x86_perf_rdpmc_ctr_get(l2_miss_event);
        rdpmcl(l2_hit_pmcnum, l2_hits_before);
        rdpmcl(l2_miss_pmcnum, l2_miss_before);
        /*
         * From SDM: Performing back-to-back fast reads are not guaranteed
         * to be monotonic. To guarantee monotonicity on back-toback reads,
         * a serializing instruction must be placed between the two
         * RDPMC instructions
         */
        rmb();
        rdpmcl(l2_hit_pmcnum, l2_hits_before);
        rdpmcl(l2_miss_pmcnum, l2_miss_before);
        rmb();
        /* Loop through pseudo-locked memory */
        rdpmcl(l2_hit_pmcnum, l2_hits_after);
        rdpmcl(l2_miss_pmcnum, l2_miss_after);
        rmb();
        /* Re-enable hardware prefetchers */
        local_irq_enable();
        /* Write results to kernel tracepoints */
out_l2_hit:
        perf_event_release_kernel(l2_hit_event);
out_l2_miss:
        perf_event_release_kernel(l2_miss_event);
out:
        /* Cleanup */
}

Your feedback has been valuable and greatly appreciated.

Reinette


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf
  2018-08-10 16:25                                           ` Reinette Chatre
@ 2018-08-10 17:52                                             ` Reinette Chatre
  0 siblings, 0 replies; 31+ messages in thread
From: Reinette Chatre @ 2018-08-10 17:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Hansen, tglx, mingo, fenghua.yu, tony.luck, vikas.shivappa,
	gavin.hindman, jithu.joseph, hpa, x86, linux-kernel

Just one clarification ...

On 8/10/2018 9:25 AM, Reinette Chatre wrote:
> static inline int x86_perf_event_error_state(struct perf_event *event)
> {
>         int ret = 0;
>         u64 tmp;
> 
>         ret = perf_event_read_local(event, &tmp, NULL, NULL);
>         if (ret < 0)
>                 return ret;
> 
>         if (event->attr.pinned && event->oncpu != smp_processor_id())
>                 return -EBUSY;

I am preparing a patch series and in that the above extra test will be
included as the last sanity check in perf_event_read_local() as you
suggested. This inline function will thus go away and the only error
state check would be a call to perf_event_read_local().

Reinette

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2018-08-10 17:52 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-31 19:38 [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf Reinette Chatre
2018-07-31 19:38 ` [PATCH 1/2] perf/x86: Expose PMC hardware reservation Reinette Chatre
2018-07-31 19:38 ` [PATCH 2/2] x86/intel_rdt: Coordinate performance monitoring with perf Reinette Chatre
2018-08-02 12:39 ` [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination " Peter Zijlstra
2018-08-02 16:14   ` Reinette Chatre
2018-08-02 16:18     ` Peter Zijlstra
2018-08-02 16:44       ` Reinette Chatre
2018-08-02 17:37         ` Peter Zijlstra
2018-08-02 18:18           ` Dave Hansen
2018-08-02 19:54             ` Peter Zijlstra
2018-08-02 20:06               ` Dave Hansen
2018-08-02 20:13                 ` Peter Zijlstra
2018-08-02 20:43                   ` Reinette Chatre
2018-08-03 10:49                     ` Peter Zijlstra
2018-08-03 15:18                       ` Reinette Chatre
2018-08-03 15:25                         ` Peter Zijlstra
2018-08-03 18:37                           ` Reinette Chatre
2018-08-06 19:50                             ` Reinette Chatre
2018-08-06 22:12                               ` Peter Zijlstra
2018-08-06 23:07                                 ` Reinette Chatre
2018-08-07  9:36                                   ` Peter Zijlstra
     [not found]                                     ` <ace0bebb-91ab-5d40-e7d7-d72d48302fa8@intel.com>
2018-08-08  1:28                                       ` Luck, Tony
2018-08-08  5:44                                         ` Reinette Chatre
2018-08-08  7:41                                           ` Peter Zijlstra
2018-08-08 15:55                                             ` Luck, Tony
2018-08-08 16:47                                               ` Peter Zijlstra
2018-08-08 16:51                                                 ` Reinette Chatre
2018-08-08  7:51                                       ` Peter Zijlstra
2018-08-08 17:33                                         ` Reinette Chatre
2018-08-10 16:25                                           ` Reinette Chatre
2018-08-10 17:52                                             ` Reinette Chatre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).