All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH 0/8] perf/x86: event scheduling cleanups
@ 2019-03-14 13:01 Peter Zijlstra
  2019-03-14 13:01 ` [PATCH 1/8] perf/x86/intel: Fix memory corruption Peter Zijlstra
                   ` (8 more replies)
  0 siblings, 9 replies; 48+ messages in thread
From: Peter Zijlstra @ 2019-03-14 13:01 UTC (permalink / raw)
  To: mingo, eranian, jolsa; +Cc: linux-kernel, tonyj, nelson.dsouza, peterz

These here patches are the result of chasing a bug in the TSX Force Abort
patches.

The first patch is simple and fixes the issue and should be marked /urgent and
backported.

The rest is the result of having a hard look at how all that event scheduling
stuff worked while trying to figure out wth was going wrong.

Stephane, since you were involved with writing much of that code, can you
please have a good look? I _think_ I got it, but... :-)


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 1/8] perf/x86/intel: Fix memory corruption
  2019-03-14 13:01 [RFC][PATCH 0/8] perf/x86: event scheduling cleanups Peter Zijlstra
@ 2019-03-14 13:01 ` Peter Zijlstra
  2019-03-15 11:29   ` [tip:perf/urgent] " tip-bot for Peter Zijlstra
  2019-03-19  6:29   ` [PATCH 1/8] " Stephane Eranian
  2019-03-14 13:01 ` [RFC][PATCH 2/8] perf/x86/intel: Simplify intel_tfa_commit_scheduling() Peter Zijlstra
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 48+ messages in thread
From: Peter Zijlstra @ 2019-03-14 13:01 UTC (permalink / raw)
  To: mingo, eranian, jolsa; +Cc: linux-kernel, tonyj, nelson.dsouza, peterz

Through:

  validate_event()
    x86_pmu.get_event_constraints(.idx=-1)
      tfa_get_event_constraints()
        dyn_constraint()

We use cpuc->constraint_list[-1], which is an obvious out-of-bound
access.

In this case, simply skip the TFA constraint code, there is no event
constraint with just PMC3, therefore the code will never result in the
empty set.

Reported-by: Tony Jones <tonyj@suse.com>
Reported-by: "DSouza, Nelson" <nelson.dsouza@intel.com>
Tested-by: Tony Jones <tonyj@suse.com>
Tested-by: "DSouza, Nelson" <nelson.dsouza@intel.com>
Cc: stable@kernel.org
Fixes: 400816f60c54 ("perf/x86/intel: Implement support for TSX Force Abort")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/events/intel/core.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3410,7 +3410,7 @@ tfa_get_event_constraints(struct cpu_hw_
 	/*
 	 * Without TFA we must not use PMC3.
 	 */
-	if (!allow_tsx_force_abort && test_bit(3, c->idxmsk)) {
+	if (!allow_tsx_force_abort && test_bit(3, c->idxmsk) && idx >= 0) {
 		c = dyn_constraint(cpuc, c, idx);
 		c->idxmsk64 &= ~(1ULL << 3);
 		c->weight--;



^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH 2/8] perf/x86/intel: Simplify intel_tfa_commit_scheduling()
  2019-03-14 13:01 [RFC][PATCH 0/8] perf/x86: event scheduling cleanups Peter Zijlstra
  2019-03-14 13:01 ` [PATCH 1/8] perf/x86/intel: Fix memory corruption Peter Zijlstra
@ 2019-03-14 13:01 ` Peter Zijlstra
  2019-03-14 13:01 ` [RFC][PATCH 3/8] perf/x86: Simplify x86_pmu.get_constraints() interface Peter Zijlstra
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2019-03-14 13:01 UTC (permalink / raw)
  To: mingo, eranian, jolsa; +Cc: linux-kernel, tonyj, nelson.dsouza, peterz

validate_group() calls x86_schedule_events(.assign=NULL) and therefore
will not call intel_tfa_commit_scheduling(). So there is no point in
checking cpuc->is_fake, we'll never get there.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/events/intel/core.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2015,7 +2015,7 @@ static void intel_tfa_commit_scheduling(
 	/*
 	 * We're going to use PMC3, make sure TFA is set before we touch it.
 	 */
-	if (cntr == 3 && !cpuc->is_fake)
+	if (cntr == 3)
 		intel_set_tfa(cpuc, true);
 }
 



^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH 3/8] perf/x86: Simplify x86_pmu.get_constraints() interface
  2019-03-14 13:01 [RFC][PATCH 0/8] perf/x86: event scheduling cleanups Peter Zijlstra
  2019-03-14 13:01 ` [PATCH 1/8] perf/x86/intel: Fix memory corruption Peter Zijlstra
  2019-03-14 13:01 ` [RFC][PATCH 2/8] perf/x86/intel: Simplify intel_tfa_commit_scheduling() Peter Zijlstra
@ 2019-03-14 13:01 ` Peter Zijlstra
  2019-03-19 21:21   ` Stephane Eranian
  2019-03-14 13:01 ` [RFC][PATCH 4/8] perf/x86: Remove PERF_X86_EVENT_COMMITTED Peter Zijlstra
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2019-03-14 13:01 UTC (permalink / raw)
  To: mingo, eranian, jolsa; +Cc: linux-kernel, tonyj, nelson.dsouza, peterz

There is a special case for validate_events() where we'll call
x86_pmu.get_constraints(.idx=-1). It's purpose, up until recent, seems
to be to avoid taking a previous constraint from
cpuc->event_constraint[] in intel_get_event_constraints().

(I could not find any other get_event_constraints() implementation
using @idx)

However, since that cpuc is freshly allocated, that array will in fact
be initialized with NULL pointers, achieving the very same effect.

Therefore remove this exception.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/events/core.c       |    2 +-
 arch/x86/events/intel/core.c |    8 +++-----
 2 files changed, 4 insertions(+), 6 deletions(-)

--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2031,7 +2031,7 @@ static int validate_event(struct perf_ev
 	if (IS_ERR(fake_cpuc))
 		return PTR_ERR(fake_cpuc);
 
-	c = x86_pmu.get_event_constraints(fake_cpuc, -1, event);
+	c = x86_pmu.get_event_constraints(fake_cpuc, 0, event);
 
 	if (!c || !c->weight)
 		ret = -EINVAL;
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2931,11 +2931,9 @@ static struct event_constraint *
 intel_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
 			    struct perf_event *event)
 {
-	struct event_constraint *c1 = NULL;
-	struct event_constraint *c2;
+	struct event_constraint *c1, *c2;
 
-	if (idx >= 0) /* fake does < 0 */
-		c1 = cpuc->event_constraint[idx];
+	c1 = cpuc->event_constraint[idx];
 
 	/*
 	 * first time only
@@ -3410,7 +3408,7 @@ tfa_get_event_constraints(struct cpu_hw_
 	/*
 	 * Without TFA we must not use PMC3.
 	 */
-	if (!allow_tsx_force_abort && test_bit(3, c->idxmsk) && idx >= 0) {
+	if (!allow_tsx_force_abort && test_bit(3, c->idxmsk)) {
 		c = dyn_constraint(cpuc, c, idx);
 		c->idxmsk64 &= ~(1ULL << 3);
 		c->weight--;



^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH 4/8] perf/x86: Remove PERF_X86_EVENT_COMMITTED
  2019-03-14 13:01 [RFC][PATCH 0/8] perf/x86: event scheduling cleanups Peter Zijlstra
                   ` (2 preceding siblings ...)
  2019-03-14 13:01 ` [RFC][PATCH 3/8] perf/x86: Simplify x86_pmu.get_constraints() interface Peter Zijlstra
@ 2019-03-14 13:01 ` Peter Zijlstra
  2019-03-19 20:48   ` Stephane Eranian
  2019-03-14 13:01 ` [RFC][PATCH 5/8] perf/x86/intel: Optimize intel_get_excl_constraints() Peter Zijlstra
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2019-03-14 13:01 UTC (permalink / raw)
  To: mingo, eranian, jolsa; +Cc: linux-kernel, tonyj, nelson.dsouza, peterz

The flag PERF_X86_EVENT_COMMITTED is used to find uncommitted events
for which to call put_event_constraint() when scheduling fails.

These are the newly added events to the list, and must form, per
definition, the tail of cpuc->event_list[]. By computing the list
index of the last successfull schedule, then iteration can start there
and the flag is redundant.

There are only 3 callers of x86_schedule_events(), notably:

 - x86_pmu_add()
 - x86_pmu_commit_txn()
 - validate_group()

For x86_pmu_add(), cpuc->n_events isn't updated until after
schedule_events() succeeds, therefore cpuc->n_events points to the
desired index.

For x86_pmu_commit_txn(), cpuc->n_events is updated, but we can
trivially compute the desired value with cpuc->n_txn -- the number of
events added in this transaction.

For validate_group(), we can make the rule for x86_pmu_add() work by
simply setting cpuc->n_events to 0 before calling schedule_events().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/events/core.c       |   21 ++++++---------------
 arch/x86/events/perf_event.h |    1 -
 2 files changed, 6 insertions(+), 16 deletions(-)

--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -925,19 +925,16 @@ int x86_schedule_events(struct cpu_hw_ev
 	if (!unsched && assign) {
 		for (i = 0; i < n; i++) {
 			e = cpuc->event_list[i];
-			e->hw.flags |= PERF_X86_EVENT_COMMITTED;
 			if (x86_pmu.commit_scheduling)
 				x86_pmu.commit_scheduling(cpuc, i, assign[i]);
 		}
 	} else {
-		for (i = 0; i < n; i++) {
+		i = cpuc->n_events;
+		if (cpuc->txn_flags & PERF_PMU_TXN_ADD)
+			i -= cpuc->n_txn;
+
+		for (; i < n; i++) {
 			e = cpuc->event_list[i];
-			/*
-			 * do not put_constraint() on comitted events,
-			 * because they are good to go
-			 */
-			if ((e->hw.flags & PERF_X86_EVENT_COMMITTED))
-				continue;
 
 			/*
 			 * release events that failed scheduling
@@ -1372,11 +1369,6 @@ static void x86_pmu_del(struct perf_even
 	int i;
 
 	/*
-	 * event is descheduled
-	 */
-	event->hw.flags &= ~PERF_X86_EVENT_COMMITTED;
-
-	/*
 	 * If we're called during a txn, we only need to undo x86_pmu.add.
 	 * The events never got scheduled and ->cancel_txn will truncate
 	 * the event_list.
@@ -2079,8 +2071,7 @@ static int validate_group(struct perf_ev
 	if (n < 0)
 		goto out;
 
-	fake_cpuc->n_events = n;
-
+	fake_cpuc->n_events = 0;
 	ret = x86_pmu.schedule_events(fake_cpuc, n, NULL);
 
 out:
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -61,7 +61,6 @@ struct event_constraint {
 #define PERF_X86_EVENT_PEBS_LDLAT	0x0001 /* ld+ldlat data address sampling */
 #define PERF_X86_EVENT_PEBS_ST		0x0002 /* st data address sampling */
 #define PERF_X86_EVENT_PEBS_ST_HSW	0x0004 /* haswell style datala, store */
-#define PERF_X86_EVENT_COMMITTED	0x0008 /* event passed commit_txn */
 #define PERF_X86_EVENT_PEBS_LD_HSW	0x0010 /* haswell style datala, load */
 #define PERF_X86_EVENT_PEBS_NA_HSW	0x0020 /* haswell style datala, unknown */
 #define PERF_X86_EVENT_EXCL		0x0040 /* HT exclusivity on counter */



^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH 5/8] perf/x86/intel: Optimize intel_get_excl_constraints()
  2019-03-14 13:01 [RFC][PATCH 0/8] perf/x86: event scheduling cleanups Peter Zijlstra
                   ` (3 preceding siblings ...)
  2019-03-14 13:01 ` [RFC][PATCH 4/8] perf/x86: Remove PERF_X86_EVENT_COMMITTED Peter Zijlstra
@ 2019-03-14 13:01 ` Peter Zijlstra
  2019-03-19 23:43   ` Stephane Eranian
  2019-03-14 13:01 ` [RFC][PATCH 6/8] perf/x86: Clear ->event_constraint[] on put Peter Zijlstra
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2019-03-14 13:01 UTC (permalink / raw)
  To: mingo, eranian, jolsa; +Cc: linux-kernel, tonyj, nelson.dsouza, peterz

Avoid the POPCNT by noting we can decrement the weight for each
cleared bit.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/events/intel/core.c |   22 +++++++++++++---------
 1 file changed, 13 insertions(+), 9 deletions(-)

--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2838,7 +2838,7 @@ intel_get_excl_constraints(struct cpu_hw
 	struct intel_excl_cntrs *excl_cntrs = cpuc->excl_cntrs;
 	struct intel_excl_states *xlo;
 	int tid = cpuc->excl_thread_id;
-	int is_excl, i;
+	int is_excl, i, w;
 
 	/*
 	 * validating a group does not require
@@ -2894,36 +2894,40 @@ intel_get_excl_constraints(struct cpu_hw
 	 * SHARED   : sibling counter measuring non-exclusive event
 	 * UNUSED   : sibling counter unused
 	 */
+	w = c->weight;
 	for_each_set_bit(i, c->idxmsk, X86_PMC_IDX_MAX) {
 		/*
 		 * exclusive event in sibling counter
 		 * our corresponding counter cannot be used
 		 * regardless of our event
 		 */
-		if (xlo->state[i] == INTEL_EXCL_EXCLUSIVE)
+		if (xlo->state[i] == INTEL_EXCL_EXCLUSIVE) {
 			__clear_bit(i, c->idxmsk);
+			w--;
+			continue;
+		}
 		/*
 		 * if measuring an exclusive event, sibling
 		 * measuring non-exclusive, then counter cannot
 		 * be used
 		 */
-		if (is_excl && xlo->state[i] == INTEL_EXCL_SHARED)
+		if (is_excl && xlo->state[i] == INTEL_EXCL_SHARED) {
 			__clear_bit(i, c->idxmsk);
+			w--;
+			continue;
+		}
 	}
 
 	/*
-	 * recompute actual bit weight for scheduling algorithm
-	 */
-	c->weight = hweight64(c->idxmsk64);
-
-	/*
 	 * if we return an empty mask, then switch
 	 * back to static empty constraint to avoid
 	 * the cost of freeing later on
 	 */
-	if (c->weight == 0)
+	if (!w)
 		c = &emptyconstraint;
 
+	c->weight = w;
+
 	return c;
 }
 



^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH 6/8] perf/x86: Clear ->event_constraint[] on put
  2019-03-14 13:01 [RFC][PATCH 0/8] perf/x86: event scheduling cleanups Peter Zijlstra
                   ` (4 preceding siblings ...)
  2019-03-14 13:01 ` [RFC][PATCH 5/8] perf/x86/intel: Optimize intel_get_excl_constraints() Peter Zijlstra
@ 2019-03-14 13:01 ` Peter Zijlstra
  2019-03-19 21:50   ` Stephane Eranian
  2019-03-14 13:01 ` [RFC][PATCH 7/8] perf/x86: Optimize x86_schedule_events() Peter Zijlstra
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2019-03-14 13:01 UTC (permalink / raw)
  To: mingo, eranian, jolsa; +Cc: linux-kernel, tonyj, nelson.dsouza, peterz

The current code unconditionally clears cpuc->event_constraint[i]
before calling get_event_constraints(.idx=i). The only site that cares
is intel_get_event_constraints() where the c1 load will always be
NULL.

However, always calling get_event_constraints() on all events is
wastefull, most times it will return the exact same result. Therefore
retain the logic in intel_get_event_constraints() and change the
generic code to only clear the constraint on put.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/events/core.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -858,7 +858,6 @@ int x86_schedule_events(struct cpu_hw_ev
 		x86_pmu.start_scheduling(cpuc);
 
 	for (i = 0, wmin = X86_PMC_IDX_MAX, wmax = 0; i < n; i++) {
-		cpuc->event_constraint[i] = NULL;
 		c = x86_pmu.get_event_constraints(cpuc, i, cpuc->event_list[i]);
 		cpuc->event_constraint[i] = c;
 
@@ -941,6 +940,8 @@ int x86_schedule_events(struct cpu_hw_ev
 			 */
 			if (x86_pmu.put_event_constraints)
 				x86_pmu.put_event_constraints(cpuc, e);
+
+			cpuc->event_constraint[i] = NULL;
 		}
 	}
 
@@ -1404,6 +1405,7 @@ static void x86_pmu_del(struct perf_even
 		cpuc->event_list[i-1] = cpuc->event_list[i];
 		cpuc->event_constraint[i-1] = cpuc->event_constraint[i];
 	}
+	cpuc->event_constraint[i-1] = NULL;
 	--cpuc->n_events;
 
 	perf_event_update_userpage(event);



^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH 7/8] perf/x86: Optimize x86_schedule_events()
  2019-03-14 13:01 [RFC][PATCH 0/8] perf/x86: event scheduling cleanups Peter Zijlstra
                   ` (5 preceding siblings ...)
  2019-03-14 13:01 ` [RFC][PATCH 6/8] perf/x86: Clear ->event_constraint[] on put Peter Zijlstra
@ 2019-03-14 13:01 ` Peter Zijlstra
  2019-03-19 23:55   ` Stephane Eranian
  2019-03-14 13:01 ` [RFC][PATCH 8/8] perf/x86: Add sanity checks to x86_schedule_events() Peter Zijlstra
  2019-03-15  7:15 ` [RFC][PATCH 0/8] perf/x86: event scheduling cleanups Stephane Eranian
  8 siblings, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2019-03-14 13:01 UTC (permalink / raw)
  To: mingo, eranian, jolsa; +Cc: linux-kernel, tonyj, nelson.dsouza, peterz

Now that cpuc->event_constraint[] is retained, we can avoid calling
get_event_constraints() over and over again.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/events/core.c       |   25 +++++++++++++++++++++----
 arch/x86/events/intel/core.c |    3 ++-
 2 files changed, 23 insertions(+), 5 deletions(-)

--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -844,6 +844,12 @@ int perf_assign_events(struct event_cons
 }
 EXPORT_SYMBOL_GPL(perf_assign_events);
 
+static inline bool is_ht_workaround_active(struct cpu_hw_events *cpuc)
+{
+	return is_ht_workaround_enabled() && !cpuc->is_fake &&
+	       READ_ONCE(cpuc->excl_cntrs->exclusive_present);
+}
+
 int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
 {
 	struct event_constraint *c;
@@ -858,8 +864,20 @@ int x86_schedule_events(struct cpu_hw_ev
 		x86_pmu.start_scheduling(cpuc);
 
 	for (i = 0, wmin = X86_PMC_IDX_MAX, wmax = 0; i < n; i++) {
-		c = x86_pmu.get_event_constraints(cpuc, i, cpuc->event_list[i]);
-		cpuc->event_constraint[i] = c;
+		c = cpuc->event_constraint[i];
+
+		/*
+		 * Request constraints for new events; or for those events that
+		 * have a dynamic constraint due to the HT workaround -- for
+		 * those the constraint can change due to scheduling activity
+		 * on the other sibling.
+		 */
+		if (!c || ((c->flags & PERF_X86_EVENT_DYNAMIC) &&
+			   is_ht_workaround_active(cpuc))) {
+
+			c = x86_pmu.get_event_constraints(cpuc, i, cpuc->event_list[i]);
+			cpuc->event_constraint[i] = c;
+		}
 
 		wmin = min(wmin, c->weight);
 		wmax = max(wmax, c->weight);
@@ -903,8 +921,7 @@ int x86_schedule_events(struct cpu_hw_ev
 		 * N/2 counters can be used. This helps with events with
 		 * specific counter constraints.
 		 */
-		if (is_ht_workaround_enabled() && !cpuc->is_fake &&
-		    READ_ONCE(cpuc->excl_cntrs->exclusive_present))
+		if (is_ht_workaround_active(cpuc))
 			gpmax /= 2;
 
 		unsched = perf_assign_events(cpuc->event_constraint, n, wmin,
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2945,7 +2945,8 @@ intel_get_event_constraints(struct cpu_h
 	 * - dynamic constraint: handled by intel_get_excl_constraints()
 	 */
 	c2 = __intel_get_event_constraints(cpuc, idx, event);
-	if (c1 && (c1->flags & PERF_X86_EVENT_DYNAMIC)) {
+	if (c1) {
+	        WARN_ON_ONCE(!(c1->flags & PERF_X86_EVENT_DYNAMIC));
 		bitmap_copy(c1->idxmsk, c2->idxmsk, X86_PMC_IDX_MAX);
 		c1->weight = c2->weight;
 		c2 = c1;



^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC][PATCH 8/8] perf/x86: Add sanity checks to x86_schedule_events()
  2019-03-14 13:01 [RFC][PATCH 0/8] perf/x86: event scheduling cleanups Peter Zijlstra
                   ` (6 preceding siblings ...)
  2019-03-14 13:01 ` [RFC][PATCH 7/8] perf/x86: Optimize x86_schedule_events() Peter Zijlstra
@ 2019-03-14 13:01 ` Peter Zijlstra
  2019-03-15  7:15 ` [RFC][PATCH 0/8] perf/x86: event scheduling cleanups Stephane Eranian
  8 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2019-03-14 13:01 UTC (permalink / raw)
  To: mingo, eranian, jolsa; +Cc: linux-kernel, tonyj, nelson.dsouza, peterz


Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/events/core.c |   21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -855,17 +855,30 @@ int x86_schedule_events(struct cpu_hw_ev
 	struct event_constraint *c;
 	unsigned long used_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
 	struct perf_event *e;
-	int i, wmin, wmax, unsched = 0;
+	int n0, i, wmin, wmax, unsched = 0;
 	struct hw_perf_event *hwc;
 
 	bitmap_zero(used_mask, X86_PMC_IDX_MAX);
 
+	/*
+	 * Compute the number of events already present; see x86_pmu_add(),
+	 * validate_group() and x86_pmu_commit_txn(). For the former two
+	 * cpuc->n_events hasn't been updated yet, while for the latter
+	 * cpuc->n_txn contains the number of events added in the current
+	 * transaction.
+	 */
+	n0 = cpuc->n_events;
+	if (cpuc->txn_flags & PERF_PMU_TXN_ADD)
+		n0 -= cpuc->n_txn;
+
 	if (x86_pmu.start_scheduling)
 		x86_pmu.start_scheduling(cpuc);
 
 	for (i = 0, wmin = X86_PMC_IDX_MAX, wmax = 0; i < n; i++) {
 		c = cpuc->event_constraint[i];
 
+		WARN_ON_ONCE((c && i >= n0) || (!c && i < n0));
+
 		/*
 		 * Request constraints for new events; or for those events that
 		 * have a dynamic constraint due to the HT workaround -- for
@@ -945,11 +958,7 @@ int x86_schedule_events(struct cpu_hw_ev
 				x86_pmu.commit_scheduling(cpuc, i, assign[i]);
 		}
 	} else {
-		i = cpuc->n_events;
-		if (cpuc->txn_flags & PERF_PMU_TXN_ADD)
-			i -= cpuc->n_txn;
-
-		for (; i < n; i++) {
+		for (i = n0; i < n; i++) {
 			e = cpuc->event_list[i];
 
 			/*



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 0/8] perf/x86: event scheduling cleanups
  2019-03-14 13:01 [RFC][PATCH 0/8] perf/x86: event scheduling cleanups Peter Zijlstra
                   ` (7 preceding siblings ...)
  2019-03-14 13:01 ` [RFC][PATCH 8/8] perf/x86: Add sanity checks to x86_schedule_events() Peter Zijlstra
@ 2019-03-15  7:15 ` Stephane Eranian
  2019-03-15  7:15   ` Stephane Eranian
  8 siblings, 1 reply; 48+ messages in thread
From: Stephane Eranian @ 2019-03-15  7:15 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Jiri Olsa, LKML, tonyj, nelson.dsouza

On Thu, Mar 14, 2019 at 6:11 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> These here patches are the result of chasing a bug in the TSX Force Abort
> patches.
>
> The first patch is simple and fixes the issue and should be marked /urgent and
> backported.
>
> The rest is the result of having a hard look at how all that event scheduling
> stuff worked while trying to figure out wth was going wrong.
>
> Stephane, since you were involved with writing much of that code, can you
> please have a good look? I _think_ I got it, but... :-)
>
Will do today.
But you did not send me PATCH 1/8 apparently.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 0/8] perf/x86: event scheduling cleanups
  2019-03-15  7:15 ` [RFC][PATCH 0/8] perf/x86: event scheduling cleanups Stephane Eranian
@ 2019-03-15  7:15   ` Stephane Eranian
  2019-03-15  8:01     ` Peter Zijlstra
  0 siblings, 1 reply; 48+ messages in thread
From: Stephane Eranian @ 2019-03-15  7:15 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Jiri Olsa, LKML, tonyj, nelson.dsouza

On Fri, Mar 15, 2019 at 12:15 AM Stephane Eranian <eranian@google.com> wrote:
>
> On Thu, Mar 14, 2019 at 6:11 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > These here patches are the result of chasing a bug in the TSX Force Abort
> > patches.
> >
> > The first patch is simple and fixes the issue and should be marked /urgent and
> > backported.
> >
> > The rest is the result of having a hard look at how all that event scheduling
> > stuff worked while trying to figure out wth was going wrong.
> >
> > Stephane, since you were involved with writing much of that code, can you
> > please have a good look? I _think_ I got it, but... :-)
> >
> Will do today.
> But you did not send me PATCH 1/8 apparently.

Correction : I got it. Mail client had just moved it somewhere else!

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 0/8] perf/x86: event scheduling cleanups
  2019-03-15  7:15   ` Stephane Eranian
@ 2019-03-15  8:01     ` Peter Zijlstra
  0 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2019-03-15  8:01 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: Ingo Molnar, Jiri Olsa, LKML, tonyj, nelson.dsouza

On Fri, Mar 15, 2019 at 12:15:55AM -0700, Stephane Eranian wrote:
> On Fri, Mar 15, 2019 at 12:15 AM Stephane Eranian <eranian@google.com> wrote:
> >
> > On Thu, Mar 14, 2019 at 6:11 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > These here patches are the result of chasing a bug in the TSX Force Abort
> > > patches.
> > >
> > > The first patch is simple and fixes the issue and should be marked /urgent and
> > > backported.
> > >
> > > The rest is the result of having a hard look at how all that event scheduling
> > > stuff worked while trying to figure out wth was going wrong.
> > >
> > > Stephane, since you were involved with writing much of that code, can you
> > > please have a good look? I _think_ I got it, but... :-)
> > >
> > Will do today.
> > But you did not send me PATCH 1/8 apparently.
> 
> Correction : I got it. Mail client had just moved it somewhere else!

Also please look at all 8, I'm reasonably sure 1/8 is correct, but it is
rather more ugly than I'd wish. So the remaining 7 clean it all up (and
effectively revert 1 again) but are much more invasive.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [tip:perf/urgent] perf/x86/intel: Fix memory corruption
  2019-03-14 13:01 ` [PATCH 1/8] perf/x86/intel: Fix memory corruption Peter Zijlstra
@ 2019-03-15 11:29   ` tip-bot for Peter Zijlstra
  2019-03-19  6:29   ` [PATCH 1/8] " Stephane Eranian
  1 sibling, 0 replies; 48+ messages in thread
From: tip-bot for Peter Zijlstra @ 2019-03-15 11:29 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, linux-kernel, peterz, hpa, tglx, nelson.dsouza, tonyj

Commit-ID:  ede271b059463731cbd6dffe55ffd70d7dbe8392
Gitweb:     https://git.kernel.org/tip/ede271b059463731cbd6dffe55ffd70d7dbe8392
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Thu, 14 Mar 2019 14:01:14 +0100
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Fri, 15 Mar 2019 12:22:51 +0100

perf/x86/intel: Fix memory corruption

Through:

  validate_event()
    x86_pmu.get_event_constraints(.idx=-1)
      tfa_get_event_constraints()
        dyn_constraint()

cpuc->constraint_list[-1] is used, which is an obvious out-of-bound access.

In this case, simply skip the TFA constraint code, there is no event
constraint with just PMC3, therefore the code will never result in the
empty set.

Fixes: 400816f60c54 ("perf/x86/intel: Implement support for TSX Force Abort")
Reported-by: Tony Jones <tonyj@suse.com>
Reported-by: "DSouza, Nelson" <nelson.dsouza@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Tony Jones <tonyj@suse.com>
Tested-by: "DSouza, Nelson" <nelson.dsouza@intel.com>
Cc: eranian@google.com
Cc: jolsa@redhat.com
Cc: stable@kernel.org
Link: https://lkml.kernel.org/r/20190314130705.441549378@infradead.org

---
 arch/x86/events/intel/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 35102ecdfc8d..92dfeb343a6a 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3410,7 +3410,7 @@ tfa_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
 	/*
 	 * Without TFA we must not use PMC3.
 	 */
-	if (!allow_tsx_force_abort && test_bit(3, c->idxmsk)) {
+	if (!allow_tsx_force_abort && test_bit(3, c->idxmsk) && idx >= 0) {
 		c = dyn_constraint(cpuc, c, idx);
 		c->idxmsk64 &= ~(1ULL << 3);
 		c->weight--;

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/8] perf/x86/intel: Fix memory corruption
  2019-03-14 13:01 ` [PATCH 1/8] perf/x86/intel: Fix memory corruption Peter Zijlstra
  2019-03-15 11:29   ` [tip:perf/urgent] " tip-bot for Peter Zijlstra
@ 2019-03-19  6:29   ` Stephane Eranian
  2019-03-19 11:05     ` Peter Zijlstra
  1 sibling, 1 reply; 48+ messages in thread
From: Stephane Eranian @ 2019-03-19  6:29 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Jiri Olsa, LKML, tonyj, nelson.dsouza

On Thu, Mar 14, 2019 at 6:11 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Through:
>
>   validate_event()
>     x86_pmu.get_event_constraints(.idx=-1)
>       tfa_get_event_constraints()
>         dyn_constraint()
>
> We use cpuc->constraint_list[-1], which is an obvious out-of-bound
> access.
>
> In this case, simply skip the TFA constraint code, there is no event
> constraint with just PMC3, therefore the code will never result in the
> empty set.
>
> Reported-by: Tony Jones <tonyj@suse.com>
> Reported-by: "DSouza, Nelson" <nelson.dsouza@intel.com>
> Tested-by: Tony Jones <tonyj@suse.com>
> Tested-by: "DSouza, Nelson" <nelson.dsouza@intel.com>
> Cc: stable@kernel.org
> Fixes: 400816f60c54 ("perf/x86/intel: Implement support for TSX Force Abort")
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  arch/x86/events/intel/core.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -3410,7 +3410,7 @@ tfa_get_event_constraints(struct cpu_hw_
>         /*
>          * Without TFA we must not use PMC3.
>          */
> -       if (!allow_tsx_force_abort && test_bit(3, c->idxmsk)) {
> +       if (!allow_tsx_force_abort && test_bit(3, c->idxmsk) && idx >= 0) {
>                 c = dyn_constraint(cpuc, c, idx);
>                 c->idxmsk64 &= ~(1ULL << 3);
>                 c->weight--;
>
>
I was not cc'd on the patch that added  allow_tsx_force_abort, so I
will give some comments here.
If I understand the goal of the control parameter it is to turn on/off
the TFA workaround and thus determine
whether or not PMC3 is available. I don't know why you would need to
make this a runtime tunable. That
seems a bit dodgy. But given the code you have here right now, we have
to deal with it. A sysadmin could
flip the control at any time, including when PMC3 is already in used
by some events. I do not see the code
that schedules out all the events on all CPUs once PMC3 becomes
unavailable. You cannot just rely on the
next context-switch or timer tick for multiplexing.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/8] perf/x86/intel: Fix memory corruption
  2019-03-19  6:29   ` [PATCH 1/8] " Stephane Eranian
@ 2019-03-19 11:05     ` Peter Zijlstra
  2019-03-19 17:52       ` Stephane Eranian
  0 siblings, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2019-03-19 11:05 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: Ingo Molnar, Jiri Olsa, LKML, tonyj, nelson.dsouza

On Mon, Mar 18, 2019 at 11:29:25PM -0700, Stephane Eranian wrote:

> > --- a/arch/x86/events/intel/core.c
> > +++ b/arch/x86/events/intel/core.c
> > @@ -3410,7 +3410,7 @@ tfa_get_event_constraints(struct cpu_hw_
> >         /*
> >          * Without TFA we must not use PMC3.
> >          */
> > -       if (!allow_tsx_force_abort && test_bit(3, c->idxmsk)) {
> > +       if (!allow_tsx_force_abort && test_bit(3, c->idxmsk) && idx >= 0) {
> >                 c = dyn_constraint(cpuc, c, idx);
> >                 c->idxmsk64 &= ~(1ULL << 3);
> >                 c->weight--;
> >
> >

> I was not cc'd on the patch that added  allow_tsx_force_abort, so I

Yeah, that never was public :-( I didn't particularly like that, but
that's the way it is.

> will give some comments here.

> If I understand the goal of the control parameter it is to turn on/off
> the TFA workaround and thus determine whether or not PMC3 is
> available. I don't know why you would need to make this a runtime
> tunable.

Not quite; the control on its own doesn't directly write the MSR. And
even when the work-around is allowed, we'll not set the MSR unless there
is also demand for PMC3.

It is a runtime tunable because boot parameters suck.

> That seems a bit dodgy. But given the code you have here right now, we
> have to deal with it. A sysadmin could flip the control at any time,
> including when PMC3 is already in used by some events. I do not see
> the code that schedules out all the events on all CPUs once PMC3
> becomes unavailable. You cannot just rely on the next context-switch
> or timer tick for multiplexing.

Yeah, meh. You're admin, you can 'fix' it. In practise I don't expect
most people to care about the knob, and the few people that do, should
be able to make it work.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/8] perf/x86/intel: Fix memory corruption
  2019-03-19 11:05     ` Peter Zijlstra
@ 2019-03-19 17:52       ` Stephane Eranian
  2019-03-19 18:20         ` Peter Zijlstra
  0 siblings, 1 reply; 48+ messages in thread
From: Stephane Eranian @ 2019-03-19 17:52 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Jiri Olsa, LKML, tonyj, nelson.dsouza

On Tue, Mar 19, 2019 at 4:05 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Mar 18, 2019 at 11:29:25PM -0700, Stephane Eranian wrote:
>
> > > --- a/arch/x86/events/intel/core.c
> > > +++ b/arch/x86/events/intel/core.c
> > > @@ -3410,7 +3410,7 @@ tfa_get_event_constraints(struct cpu_hw_
> > >         /*
> > >          * Without TFA we must not use PMC3.
> > >          */
> > > -       if (!allow_tsx_force_abort && test_bit(3, c->idxmsk)) {
> > > +       if (!allow_tsx_force_abort && test_bit(3, c->idxmsk) && idx >= 0) {
> > >                 c = dyn_constraint(cpuc, c, idx);
> > >                 c->idxmsk64 &= ~(1ULL << 3);
> > >                 c->weight--;
> > >
> > >
>
> > I was not cc'd on the patch that added  allow_tsx_force_abort, so I
>
> Yeah, that never was public :-( I didn't particularly like that, but
> that's the way it is.
>
> > will give some comments here.
>
> > If I understand the goal of the control parameter it is to turn on/off
> > the TFA workaround and thus determine whether or not PMC3 is
> > available. I don't know why you would need to make this a runtime
> > tunable.
>
> Not quite; the control on its own doesn't directly write the MSR. And
> even when the work-around is allowed, we'll not set the MSR unless there
> is also demand for PMC3.
>
Trying to understand this better here. When the workaround is enabled
(tfa=0), you lose
PMC3 and transactions operate normally. When it is disabled (tfa=1),
transactions are
all aborted and PMC3 is available. If you are saying that when there
is a PMU event
requesting PMC3, then you need PMC3 avail, so you set the MSR so that
tfa=1 forcing
all transactions to abort. But in that case, you are modifying the
execution of the workload
when you are monitoring it, assuming it uses TSX.  You want lowest
overhead and no
modifications to how the workload operates, otherwise how
representative is the data you are
collecting? I understand that there is no impact on apps not using
TSX, well, except on context
switch where you have to toggle that MSR. But for workloads using TSX,
there is potentially
an impact.

> It is a runtime tunable because boot parameters suck.
>
> > That seems a bit dodgy. But given the code you have here right now, we
> > have to deal with it. A sysadmin could flip the control at any time,
> > including when PMC3 is already in used by some events. I do not see
> > the code that schedules out all the events on all CPUs once PMC3
> > becomes unavailable. You cannot just rely on the next context-switch
> > or timer tick for multiplexing.
>
> Yeah, meh. You're admin, you can 'fix' it. In practise I don't expect
> most people to care about the knob, and the few people that do, should
> be able to make it work.

I don't understand how this can work reliably. You have a knob to toggle
that MSR. Then, you have another one inside perf_events and then the sysadmin
has to make sure nobody (incl. NMI watchdog) is using the PMU when
this all happens.
How can this be a practical solution? Am I missing something here?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/8] perf/x86/intel: Fix memory corruption
  2019-03-19 17:52       ` Stephane Eranian
@ 2019-03-19 18:20         ` Peter Zijlstra
  2019-03-20 20:47           ` Stephane Eranian
  0 siblings, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2019-03-19 18:20 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: Ingo Molnar, Jiri Olsa, LKML, tonyj, nelson.dsouza

On Tue, Mar 19, 2019 at 10:52:01AM -0700, Stephane Eranian wrote:
> > Not quite; the control on its own doesn't directly write the MSR. And
> > even when the work-around is allowed, we'll not set the MSR unless there
> > is also demand for PMC3.
> >
> Trying to understand this better here. When the workaround is enabled
> (tfa=0), you lose PMC3 and transactions operate normally.

> When it is disabled (tfa=1), transactions are all aborted and PMC3 is
> available.

Right, but we don't expose tfa.

> If you are saying that when there is a PMU event requesting PMC3, then
> you need PMC3 avail, so you set the MSR so that tfa=1 forcing all
> transactions to abort.

Right, so when allow_tfa=1 (default), we only set tfa=1 when PMC3 is
requested.

This has the advantage that,

  TSX only workload -> works
  perf 4 counteres -> works

Only when you need both of them, do you get 'trouble'.

> But in that case, you are modifying the execution of the workload when
> you are monitoring it, assuming it uses TSX.

We assume you are not in fact using TSX, not a lot of code does. If you
do use TSX a lot, and you don't want to interfere, you have to set
allow_tfa=0 and live with one counter less.

Any which way around you turn this stone, it sucks.

> You want lowest overhead and no modifications to how the workload
> operates, otherwise how representative is the data you are collecting?

Sure; but there are no good choices here. This 'fix' will break
something. We figured TSX+4-counter-perf was the least common scenario.

We konw of people that rely on 4 counter being present; you want to
explain to them how when doing an update their program suddently doesn't
work anymore?

Or you want to default to tfa=1; but then you have to explain to those
people relying on TSX why their workload stopped working.

> I understand that there is no impact on apps not using TSX, well,
> except on context switch where you have to toggle that MSR.

There is no additional code in the context switch; only the perf event
scheduling code prods at the MSR.

> But for workloads using TSX, there is potentially an impact.

Yes, well, if you're a TSX _and_ perf user, you now have an extra knob
to play with; that's not something I can do anything about. We're forced
to make a choice here.

> > Yeah, meh. You're admin, you can 'fix' it. In practise I don't expect
> > most people to care about the knob, and the few people that do, should
> > be able to make it work.
> 
> I don't understand how this can work reliably.

> You have a knob to toggle that MSR.

No, we don't have this knob.

> Then, you have another one inside perf_events

Only this knob exists allow_tfa.

> and then the sysadmin has to make sure nobody (incl. NMI watchdog) is
> using the PMU when this all happens.

You're very unlucky if the watchdog runs on PMC3, normally it runs on
Fixed1 or something. Esp early after boot. (Remember, we schedule fixed
counters first, and then general purpose counters, with a preference for
lower counters).

Anyway, you can trivially switch it off if you want.

> How can this be a practical solution? Am I missing something here?

It works just fine; it is unfortunate that we have this interaction but
that's not something we can do anything about. We're forced to deal with
this.

But if you're a TSX+perf user, have your boot scripts do:

  echo 0 > /sys/bus/event_source/devices/cpu/allow_tsx_force_abort

and you'll not use PMC3 and TSX will be 'awesome'. If you don't give a
crap about TSX (most people), just boot and be happy.

If you do care about TSX+perf and want to dynamically toggle for some
reason, you just have to be a little careful.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 4/8] perf/x86: Remove PERF_X86_EVENT_COMMITTED
  2019-03-14 13:01 ` [RFC][PATCH 4/8] perf/x86: Remove PERF_X86_EVENT_COMMITTED Peter Zijlstra
@ 2019-03-19 20:48   ` Stephane Eranian
  2019-03-19 21:00     ` Peter Zijlstra
  2019-03-20 12:23     ` Peter Zijlstra
  0 siblings, 2 replies; 48+ messages in thread
From: Stephane Eranian @ 2019-03-19 20:48 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Jiri Olsa, LKML, tonyj, nelson.dsouza

On Thu, Mar 14, 2019 at 6:11 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> The flag PERF_X86_EVENT_COMMITTED is used to find uncommitted events
> for which to call put_event_constraint() when scheduling fails.
>
> These are the newly added events to the list, and must form, per
> definition, the tail of cpuc->event_list[]. By computing the list
> index of the last successfull schedule, then iteration can start there
> and the flag is redundant.
>
> There are only 3 callers of x86_schedule_events(), notably:
>
>  - x86_pmu_add()
>  - x86_pmu_commit_txn()
>  - validate_group()
>
> For x86_pmu_add(), cpuc->n_events isn't updated until after
> schedule_events() succeeds, therefore cpuc->n_events points to the
> desired index.
>
Correct.

> For x86_pmu_commit_txn(), cpuc->n_events is updated, but we can
> trivially compute the desired value with cpuc->n_txn -- the number of
> events added in this transaction.
>
I suggest you put this explanation in the code so that it is easier to
understand.

> For validate_group(), we can make the rule for x86_pmu_add() work by
> simply setting cpuc->n_events to 0 before calling schedule_events().
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Reviewed-by: Stephane Eranian <eranian@google.com>

> ---
>  arch/x86/events/core.c       |   21 ++++++---------------
>  arch/x86/events/perf_event.h |    1 -
>  2 files changed, 6 insertions(+), 16 deletions(-)
>
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -925,19 +925,16 @@ int x86_schedule_events(struct cpu_hw_ev
>         if (!unsched && assign) {
>                 for (i = 0; i < n; i++) {
>                         e = cpuc->event_list[i];
> -                       e->hw.flags |= PERF_X86_EVENT_COMMITTED;
>                         if (x86_pmu.commit_scheduling)
>                                 x86_pmu.commit_scheduling(cpuc, i, assign[i]);
>                 }
>         } else {
> -               for (i = 0; i < n; i++) {
> +               i = cpuc->n_events;
> +               if (cpuc->txn_flags & PERF_PMU_TXN_ADD)
> +                       i -= cpuc->n_txn;
> +
> +               for (; i < n; i++) {
>                         e = cpuc->event_list[i];
> -                       /*
> -                        * do not put_constraint() on comitted events,
> -                        * because they are good to go
> -                        */
> -                       if ((e->hw.flags & PERF_X86_EVENT_COMMITTED))
> -                               continue;
>
>                         /*
>                          * release events that failed scheduling
> @@ -1372,11 +1369,6 @@ static void x86_pmu_del(struct perf_even
>         int i;
>
>         /*
> -        * event is descheduled
> -        */
> -       event->hw.flags &= ~PERF_X86_EVENT_COMMITTED;
> -
> -       /*
>          * If we're called during a txn, we only need to undo x86_pmu.add.
>          * The events never got scheduled and ->cancel_txn will truncate
>          * the event_list.
> @@ -2079,8 +2071,7 @@ static int validate_group(struct perf_ev
>         if (n < 0)
>                 goto out;
>
> -       fake_cpuc->n_events = n;
> -
> +       fake_cpuc->n_events = 0;
>         ret = x86_pmu.schedule_events(fake_cpuc, n, NULL);
>
>  out:
> --- a/arch/x86/events/perf_event.h
> +++ b/arch/x86/events/perf_event.h
> @@ -61,7 +61,6 @@ struct event_constraint {
>  #define PERF_X86_EVENT_PEBS_LDLAT      0x0001 /* ld+ldlat data address sampling */
>  #define PERF_X86_EVENT_PEBS_ST         0x0002 /* st data address sampling */
>  #define PERF_X86_EVENT_PEBS_ST_HSW     0x0004 /* haswell style datala, store */
> -#define PERF_X86_EVENT_COMMITTED       0x0008 /* event passed commit_txn */

I would put a placeholder saying that bit 3 is available or renumbered
the other masks below

>  #define PERF_X86_EVENT_PEBS_LD_HSW     0x0010 /* haswell style datala, load */
>  #define PERF_X86_EVENT_PEBS_NA_HSW     0x0020 /* haswell style datala, unknown */
>  #define PERF_X86_EVENT_EXCL            0x0040 /* HT exclusivity on counter */
>
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 4/8] perf/x86: Remove PERF_X86_EVENT_COMMITTED
  2019-03-19 20:48   ` Stephane Eranian
@ 2019-03-19 21:00     ` Peter Zijlstra
  2019-03-20 13:14       ` Peter Zijlstra
  2019-03-20 12:23     ` Peter Zijlstra
  1 sibling, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2019-03-19 21:00 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: Ingo Molnar, Jiri Olsa, LKML, tonyj, nelson.dsouza

On Tue, Mar 19, 2019 at 01:48:18PM -0700, Stephane Eranian wrote:
> On Thu, Mar 14, 2019 at 6:11 AM Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > For x86_pmu_commit_txn(), cpuc->n_events is updated, but we can
> > trivially compute the desired value with cpuc->n_txn -- the number of
> > events added in this transaction.
> >
> I suggest you put this explanation in the code so that it is easier to
> understand.

Right, I actually attempted writing that comment a few times but every
time it became a mess. I'll try again, because you're quite right, this
is a bit magical.

> > -               for (i = 0; i < n; i++) {
> > +               i = cpuc->n_events;
> > +               if (cpuc->txn_flags & PERF_PMU_TXN_ADD)
> > +                       i -= cpuc->n_txn;
> > +
> > +               for (; i < n; i++) {

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 3/8] perf/x86: Simplify x86_pmu.get_constraints() interface
  2019-03-14 13:01 ` [RFC][PATCH 3/8] perf/x86: Simplify x86_pmu.get_constraints() interface Peter Zijlstra
@ 2019-03-19 21:21   ` Stephane Eranian
  0 siblings, 0 replies; 48+ messages in thread
From: Stephane Eranian @ 2019-03-19 21:21 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Jiri Olsa, LKML, tonyj, nelson.dsouza

On Thu, Mar 14, 2019 at 6:11 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> There is a special case for validate_events() where we'll call
> x86_pmu.get_constraints(.idx=-1). It's purpose, up until recent, seems
> to be to avoid taking a previous constraint from
> cpuc->event_constraint[] in intel_get_event_constraints().
>
> (I could not find any other get_event_constraints() implementation
> using @idx)
>
> However, since that cpuc is freshly allocated, that array will in fact
> be initialized with NULL pointers, achieving the very same effect.
>
> Therefore remove this exception.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Looks good to me.
Reviewed-by: Stephane Eranian <eranian@google.com>

> ---
>  arch/x86/events/core.c       |    2 +-
>  arch/x86/events/intel/core.c |    8 +++-----
>  2 files changed, 4 insertions(+), 6 deletions(-)
>
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -2031,7 +2031,7 @@ static int validate_event(struct perf_ev
>         if (IS_ERR(fake_cpuc))
>                 return PTR_ERR(fake_cpuc);
>
> -       c = x86_pmu.get_event_constraints(fake_cpuc, -1, event);
> +       c = x86_pmu.get_event_constraints(fake_cpuc, 0, event);
>
>         if (!c || !c->weight)
>                 ret = -EINVAL;
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -2931,11 +2931,9 @@ static struct event_constraint *
>  intel_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
>                             struct perf_event *event)
>  {
> -       struct event_constraint *c1 = NULL;
> -       struct event_constraint *c2;
> +       struct event_constraint *c1, *c2;
>
> -       if (idx >= 0) /* fake does < 0 */
> -               c1 = cpuc->event_constraint[idx];
> +       c1 = cpuc->event_constraint[idx];
>
>         /*
>          * first time only
> @@ -3410,7 +3408,7 @@ tfa_get_event_constraints(struct cpu_hw_
>         /*
>          * Without TFA we must not use PMC3.
>          */
> -       if (!allow_tsx_force_abort && test_bit(3, c->idxmsk) && idx >= 0) {
> +       if (!allow_tsx_force_abort && test_bit(3, c->idxmsk)) {
>                 c = dyn_constraint(cpuc, c, idx);
>                 c->idxmsk64 &= ~(1ULL << 3);
>                 c->weight--;
>
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 6/8] perf/x86: Clear ->event_constraint[] on put
  2019-03-14 13:01 ` [RFC][PATCH 6/8] perf/x86: Clear ->event_constraint[] on put Peter Zijlstra
@ 2019-03-19 21:50   ` Stephane Eranian
  2019-03-20 12:25     ` Peter Zijlstra
  0 siblings, 1 reply; 48+ messages in thread
From: Stephane Eranian @ 2019-03-19 21:50 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Jiri Olsa, LKML, tonyj, nelson.dsouza

On Thu, Mar 14, 2019 at 6:11 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> The current code unconditionally clears cpuc->event_constraint[i]
> before calling get_event_constraints(.idx=i). The only site that cares
> is intel_get_event_constraints() where the c1 load will always be
> NULL.
>
> However, always calling get_event_constraints() on all events is
> wastefull, most times it will return the exact same result. Therefore
> retain the logic in intel_get_event_constraints() and change the
> generic code to only clear the constraint on put.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

I thought there was caching of the constraint in case it was static
(or unconstrained)
to avoid walking the constraint table each time between invocations
on the same group_sched_in() call. But the way the c1 vs. c2 logic
is written I don't see it. In which case, this could be another opportunity.

Looks good to me.

Reviewed-by: Stephane Eranian <eranian@google.com>

> ---
>  arch/x86/events/core.c |    4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -858,7 +858,6 @@ int x86_schedule_events(struct cpu_hw_ev
>                 x86_pmu.start_scheduling(cpuc);
>
>         for (i = 0, wmin = X86_PMC_IDX_MAX, wmax = 0; i < n; i++) {
> -               cpuc->event_constraint[i] = NULL;
>                 c = x86_pmu.get_event_constraints(cpuc, i, cpuc->event_list[i]);
>                 cpuc->event_constraint[i] = c;
>
> @@ -941,6 +940,8 @@ int x86_schedule_events(struct cpu_hw_ev
>                          */
>                         if (x86_pmu.put_event_constraints)
>                                 x86_pmu.put_event_constraints(cpuc, e);
> +
> +                       cpuc->event_constraint[i] = NULL;
>                 }
>         }
>
> @@ -1404,6 +1405,7 @@ static void x86_pmu_del(struct perf_even
>                 cpuc->event_list[i-1] = cpuc->event_list[i];
>                 cpuc->event_constraint[i-1] = cpuc->event_constraint[i];
>         }
> +       cpuc->event_constraint[i-1] = NULL;
>         --cpuc->n_events;
>
>         perf_event_update_userpage(event);
>
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 5/8] perf/x86/intel: Optimize intel_get_excl_constraints()
  2019-03-14 13:01 ` [RFC][PATCH 5/8] perf/x86/intel: Optimize intel_get_excl_constraints() Peter Zijlstra
@ 2019-03-19 23:43   ` Stephane Eranian
  0 siblings, 0 replies; 48+ messages in thread
From: Stephane Eranian @ 2019-03-19 23:43 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Jiri Olsa, LKML, tonyj, nelson.dsouza

On Thu, Mar 14, 2019 at 6:11 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Avoid the POPCNT by noting we can decrement the weight for each
> cleared bit.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Looks good to me.
Reviewed-by: Stephane Eranian <eranian@google.com>

> ---
>  arch/x86/events/intel/core.c |   22 +++++++++++++---------
>  1 file changed, 13 insertions(+), 9 deletions(-)
>
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -2838,7 +2838,7 @@ intel_get_excl_constraints(struct cpu_hw
>         struct intel_excl_cntrs *excl_cntrs = cpuc->excl_cntrs;
>         struct intel_excl_states *xlo;
>         int tid = cpuc->excl_thread_id;
> -       int is_excl, i;
> +       int is_excl, i, w;
>
>         /*
>          * validating a group does not require
> @@ -2894,36 +2894,40 @@ intel_get_excl_constraints(struct cpu_hw
>          * SHARED   : sibling counter measuring non-exclusive event
>          * UNUSED   : sibling counter unused
>          */
> +       w = c->weight;
>         for_each_set_bit(i, c->idxmsk, X86_PMC_IDX_MAX) {
>                 /*
>                  * exclusive event in sibling counter
>                  * our corresponding counter cannot be used
>                  * regardless of our event
>                  */
> -               if (xlo->state[i] == INTEL_EXCL_EXCLUSIVE)
> +               if (xlo->state[i] == INTEL_EXCL_EXCLUSIVE) {
>                         __clear_bit(i, c->idxmsk);
> +                       w--;
> +                       continue;
> +               }
>                 /*
>                  * if measuring an exclusive event, sibling
>                  * measuring non-exclusive, then counter cannot
>                  * be used
>                  */
> -               if (is_excl && xlo->state[i] == INTEL_EXCL_SHARED)
> +               if (is_excl && xlo->state[i] == INTEL_EXCL_SHARED) {
>                         __clear_bit(i, c->idxmsk);
> +                       w--;
> +                       continue;
> +               }
>         }
>
>         /*
> -        * recompute actual bit weight for scheduling algorithm
> -        */
> -       c->weight = hweight64(c->idxmsk64);
> -
> -       /*
>          * if we return an empty mask, then switch
>          * back to static empty constraint to avoid
>          * the cost of freeing later on
>          */
> -       if (c->weight == 0)
> +       if (!w)
>                 c = &emptyconstraint;
>
> +       c->weight = w;
> +
>         return c;
>  }
>
>
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 7/8] perf/x86: Optimize x86_schedule_events()
  2019-03-14 13:01 ` [RFC][PATCH 7/8] perf/x86: Optimize x86_schedule_events() Peter Zijlstra
@ 2019-03-19 23:55   ` Stephane Eranian
  2019-03-20 13:11     ` Peter Zijlstra
  0 siblings, 1 reply; 48+ messages in thread
From: Stephane Eranian @ 2019-03-19 23:55 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Jiri Olsa, LKML, tonyj, nelson.dsouza

On Thu, Mar 14, 2019 at 6:11 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Now that cpuc->event_constraint[] is retained, we can avoid calling
> get_event_constraints() over and over again.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  arch/x86/events/core.c       |   25 +++++++++++++++++++++----
>  arch/x86/events/intel/core.c |    3 ++-
>  2 files changed, 23 insertions(+), 5 deletions(-)
>
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -844,6 +844,12 @@ int perf_assign_events(struct event_cons
>  }
>  EXPORT_SYMBOL_GPL(perf_assign_events);
>
> +static inline bool is_ht_workaround_active(struct cpu_hw_events *cpuc)
> +{
> +       return is_ht_workaround_enabled() && !cpuc->is_fake &&
> +              READ_ONCE(cpuc->excl_cntrs->exclusive_present);
> +}
> +
>  int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
>  {
>         struct event_constraint *c;
> @@ -858,8 +864,20 @@ int x86_schedule_events(struct cpu_hw_ev
>                 x86_pmu.start_scheduling(cpuc);
>
>         for (i = 0, wmin = X86_PMC_IDX_MAX, wmax = 0; i < n; i++) {
> -               c = x86_pmu.get_event_constraints(cpuc, i, cpuc->event_list[i]);
> -               cpuc->event_constraint[i] = c;
> +               c = cpuc->event_constraint[i];
> +
> +               /*
> +                * Request constraints for new events; or for those events that
> +                * have a dynamic constraint due to the HT workaround -- for
> +                * those the constraint can change due to scheduling activity
> +                * on the other sibling.
> +                */
> +               if (!c || ((c->flags & PERF_X86_EVENT_DYNAMIC) &&
> +                          is_ht_workaround_active(cpuc))) {
> +
> +                       c = x86_pmu.get_event_constraints(cpuc, i, cpuc->event_list[i]);
> +                       cpuc->event_constraint[i] = c;
> +               }
On this one, I think there may be a problem with events with
shared_regs constraints.
Constraint is dynamic as it depends on other events which share the
same MSR, yet it
is not marked as DYNAMIC. But this may be okay because these other
events are all
on the same CPU and thus scheduled during the same ctx_sched_in(). Yet with the
swapping in intel_alt_er(), we need to double-check that we cannot
reuse a constraint
which could be stale. I believe this is okay, just double-check.

>
>                 wmin = min(wmin, c->weight);
>                 wmax = max(wmax, c->weight);
> @@ -903,8 +921,7 @@ int x86_schedule_events(struct cpu_hw_ev
>                  * N/2 counters can be used. This helps with events with
>                  * specific counter constraints.
>                  */
> -               if (is_ht_workaround_enabled() && !cpuc->is_fake &&
> -                   READ_ONCE(cpuc->excl_cntrs->exclusive_present))
> +               if (is_ht_workaround_active(cpuc))
>                         gpmax /= 2;
>
>                 unsched = perf_assign_events(cpuc->event_constraint, n, wmin,
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -2945,7 +2945,8 @@ intel_get_event_constraints(struct cpu_h
>          * - dynamic constraint: handled by intel_get_excl_constraints()
>          */
>         c2 = __intel_get_event_constraints(cpuc, idx, event);
> -       if (c1 && (c1->flags & PERF_X86_EVENT_DYNAMIC)) {
> +       if (c1) {
> +               WARN_ON_ONCE(!(c1->flags & PERF_X86_EVENT_DYNAMIC));
>                 bitmap_copy(c1->idxmsk, c2->idxmsk, X86_PMC_IDX_MAX);
>                 c1->weight = c2->weight;
>                 c2 = c1;
>
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 4/8] perf/x86: Remove PERF_X86_EVENT_COMMITTED
  2019-03-19 20:48   ` Stephane Eranian
  2019-03-19 21:00     ` Peter Zijlstra
@ 2019-03-20 12:23     ` Peter Zijlstra
  1 sibling, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2019-03-20 12:23 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: Ingo Molnar, Jiri Olsa, LKML, tonyj, nelson.dsouza

Subject: perf/x86: Remove PERF_X86_EVENT_COMMITTED
From: Peter Zijlstra <peterz@infradead.org>
Date: Thu Mar 14 12:58:52 CET 2019

The flag PERF_X86_EVENT_COMMITTED is used to find uncommitted events
for which to call put_event_constraint() when scheduling fails.

These are the newly added events to the list, and must form, per
definition, the tail of cpuc->event_list[]. By computing the list
index of the last successfull schedule, then iteration can start there
and the flag is redundant.

There are only 3 callers of x86_schedule_events(), notably:

 - x86_pmu_add()
 - x86_pmu_commit_txn()
 - validate_group()

For x86_pmu_add(), cpuc->n_events isn't updated until after
schedule_events() succeeds, therefore cpuc->n_events points to the
desired index.

For x86_pmu_commit_txn(), cpuc->n_events is updated, but we can
trivially compute the desired value with cpuc->n_txn -- the number of
events added in this transaction.

For validate_group(), we can make the rule for x86_pmu_add() work by
simply setting cpuc->n_events to 0 before calling schedule_events().

Reviewed-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/events/core.c       |   28 +++++++++++++---------------
 arch/x86/events/perf_event.h |   19 +++++++++----------
 2 files changed, 22 insertions(+), 25 deletions(-)

--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -925,19 +925,23 @@ int x86_schedule_events(struct cpu_hw_ev
 	if (!unsched && assign) {
 		for (i = 0; i < n; i++) {
 			e = cpuc->event_list[i];
-			e->hw.flags |= PERF_X86_EVENT_COMMITTED;
 			if (x86_pmu.commit_scheduling)
 				x86_pmu.commit_scheduling(cpuc, i, assign[i]);
 		}
 	} else {
-		for (i = 0; i < n; i++) {
+		/*
+		 * In a transaction cpuc->n_events is already updated, but we
+		 * can use cpuc->n_txn know how many new events there are.
+		 *
+		 * Outside of a transaction, cpuc->n_events is not yet updated,
+		 * and indicates how many events how many events are scheduled.
+		 */
+		i = cpuc->n_events;
+		if (cpuc->txn_flags & PERF_PMU_TXN_ADD)
+			i -= cpuc->n_txn;
+
+		for (; i < n; i++) {
 			e = cpuc->event_list[i];
-			/*
-			 * do not put_constraint() on comitted events,
-			 * because they are good to go
-			 */
-			if ((e->hw.flags & PERF_X86_EVENT_COMMITTED))
-				continue;
 
 			/*
 			 * release events that failed scheduling
@@ -1372,11 +1376,6 @@ static void x86_pmu_del(struct perf_even
 	int i;
 
 	/*
-	 * event is descheduled
-	 */
-	event->hw.flags &= ~PERF_X86_EVENT_COMMITTED;
-
-	/*
 	 * If we're called during a txn, we only need to undo x86_pmu.add.
 	 * The events never got scheduled and ->cancel_txn will truncate
 	 * the event_list.
@@ -2079,8 +2078,7 @@ static int validate_group(struct perf_ev
 	if (n < 0)
 		goto out;
 
-	fake_cpuc->n_events = n;
-
+	fake_cpuc->n_events = 0;
 	ret = x86_pmu.schedule_events(fake_cpuc, n, NULL);
 
 out:
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -55,22 +55,21 @@ struct event_constraint {
 	int	overlap;
 	int	flags;
 };
+
 /*
  * struct hw_perf_event.flags flags
  */
 #define PERF_X86_EVENT_PEBS_LDLAT	0x0001 /* ld+ldlat data address sampling */
 #define PERF_X86_EVENT_PEBS_ST		0x0002 /* st data address sampling */
 #define PERF_X86_EVENT_PEBS_ST_HSW	0x0004 /* haswell style datala, store */
-#define PERF_X86_EVENT_COMMITTED	0x0008 /* event passed commit_txn */
-#define PERF_X86_EVENT_PEBS_LD_HSW	0x0010 /* haswell style datala, load */
-#define PERF_X86_EVENT_PEBS_NA_HSW	0x0020 /* haswell style datala, unknown */
-#define PERF_X86_EVENT_EXCL		0x0040 /* HT exclusivity on counter */
-#define PERF_X86_EVENT_DYNAMIC		0x0080 /* dynamic alloc'd constraint */
-#define PERF_X86_EVENT_RDPMC_ALLOWED	0x0100 /* grant rdpmc permission */
-#define PERF_X86_EVENT_EXCL_ACCT	0x0200 /* accounted EXCL event */
-#define PERF_X86_EVENT_AUTO_RELOAD	0x0400 /* use PEBS auto-reload */
-#define PERF_X86_EVENT_LARGE_PEBS	0x0800 /* use large PEBS */
-
+#define PERF_X86_EVENT_PEBS_LD_HSW	0x0008 /* haswell style datala, load */
+#define PERF_X86_EVENT_PEBS_NA_HSW	0x0010 /* haswell style datala, unknown */
+#define PERF_X86_EVENT_EXCL		0x0020 /* HT exclusivity on counter */
+#define PERF_X86_EVENT_DYNAMIC		0x0040 /* dynamic alloc'd constraint */
+#define PERF_X86_EVENT_RDPMC_ALLOWED	0x0080 /* grant rdpmc permission */
+#define PERF_X86_EVENT_EXCL_ACCT	0x0100 /* accounted EXCL event */
+#define PERF_X86_EVENT_AUTO_RELOAD	0x0200 /* use PEBS auto-reload */
+#define PERF_X86_EVENT_LARGE_PEBS	0x0400 /* use large PEBS */
 
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 6/8] perf/x86: Clear ->event_constraint[] on put
  2019-03-19 21:50   ` Stephane Eranian
@ 2019-03-20 12:25     ` Peter Zijlstra
  0 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2019-03-20 12:25 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: Ingo Molnar, Jiri Olsa, LKML, tonyj, nelson.dsouza

On Tue, Mar 19, 2019 at 02:50:21PM -0700, Stephane Eranian wrote:
> On Thu, Mar 14, 2019 at 6:11 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > The current code unconditionally clears cpuc->event_constraint[i]
> > before calling get_event_constraints(.idx=i). The only site that cares
> > is intel_get_event_constraints() where the c1 load will always be
> > NULL.
> >
> > However, always calling get_event_constraints() on all events is
> > wastefull, most times it will return the exact same result. Therefore
> > retain the logic in intel_get_event_constraints() and change the
> > generic code to only clear the constraint on put.
> >
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> 
> I thought there was caching of the constraint in case it was static
> (or unconstrained)

Yeah, I was under the same impression :-/

> to avoid walking the constraint table each time between invocations
> on the same group_sched_in() call. But the way the c1 vs. c2 logic
> is written I don't see it. In which case, this could be another opportunity.

Right, the next patch attempts this.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 7/8] perf/x86: Optimize x86_schedule_events()
  2019-03-19 23:55   ` Stephane Eranian
@ 2019-03-20 13:11     ` Peter Zijlstra
  2019-03-20 19:30       ` Stephane Eranian
  0 siblings, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2019-03-20 13:11 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: Ingo Molnar, Jiri Olsa, LKML, tonyj, nelson.dsouza

On Tue, Mar 19, 2019 at 04:55:16PM -0700, Stephane Eranian wrote:
> On Thu, Mar 14, 2019 at 6:11 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > @@ -858,8 +864,20 @@ int x86_schedule_events(struct cpu_hw_ev
> >                 x86_pmu.start_scheduling(cpuc);
> >
> >         for (i = 0, wmin = X86_PMC_IDX_MAX, wmax = 0; i < n; i++) {
> > -               c = x86_pmu.get_event_constraints(cpuc, i, cpuc->event_list[i]);
> > -               cpuc->event_constraint[i] = c;
> > +               c = cpuc->event_constraint[i];
> > +
> > +               /*
> > +                * Request constraints for new events; or for those events that
> > +                * have a dynamic constraint due to the HT workaround -- for
> > +                * those the constraint can change due to scheduling activity
> > +                * on the other sibling.
> > +                */
> > +               if (!c || ((c->flags & PERF_X86_EVENT_DYNAMIC) &&
> > +                          is_ht_workaround_active(cpuc))) {
> > +
> > +                       c = x86_pmu.get_event_constraints(cpuc, i, cpuc->event_list[i]);
> > +                       cpuc->event_constraint[i] = c;
> > +               }

> On this one, I think there may be a problem with events with
> shared_regs constraints.

Hmm...

> Constraint is dynamic as it depends on other events which share the
> same MSR, yet it is not marked as DYNAMIC.

it returns &emptyconstraint or a table constraint, depending on register
state.

> But this may be okay because these other events are all on the same
> CPU and thus scheduled during the same ctx_sched_in(). Yet with the
> swapping in intel_alt_er(), we need to double-check that we cannot
> reuse a constraint which could be stale.

> I believe this is okay, just double-check.

I'm not sure I see a problem.

So if we're the first event on a shared register, we claim the register
and scheduling succeeds (barring other constraints).

If we're the second event on a shared register (and have conflicting
register state), we get the empty constraint. This _will_ cause
scheduling to fail. We'll not cache the state and punt it back to the
core code.

So no future scheduling pass will come to see a shared reg constraint
that could've changed.

Now, there is indeed the intel_alt_er() thing, which slightly
complicates this; then suppose we schedule an event on RSP0, another on
RSP1, then remove the RSP0 one. Even in that case, the remaining RSP1
event will not change its constraint, since intel_fixup_er() rewrites
the event to be a native RSP1 event.

So that too reduces to the prior case.


That said; I have simplified the above condition to:

@@ -858,8 +858,17 @@ int x86_schedule_events(struct cpu_hw_ev
 		x86_pmu.start_scheduling(cpuc);
 
 	for (i = 0, wmin = X86_PMC_IDX_MAX, wmax = 0; i < n; i++) {
-		c = x86_pmu.get_event_constraints(cpuc, i, cpuc->event_list[i]);
-		cpuc->event_constraint[i] = c;
+		c = cpuc->event_constraint[i];
+
+		/*
+		 * Request constraints for new events; or for those events that
+		 * have a dynamic constraint -- for those the constraint can
+		 * change due to external factors (sibling state, allow_tfa).
+		 */
+		if (!c || (c->flags & PERF_X86_EVENT_DYNAMIC)) {
+			c = x86_pmu.get_event_constraints(cpuc, i, cpuc->event_list[i]);
+			cpuc->event_constraint[i] = c;
+		}
 
 		wmin = min(wmin, c->weight);
 		wmax = max(wmax, c->weight);

Because any dynamic event can change from one moment to the next.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 4/8] perf/x86: Remove PERF_X86_EVENT_COMMITTED
  2019-03-19 21:00     ` Peter Zijlstra
@ 2019-03-20 13:14       ` Peter Zijlstra
  0 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2019-03-20 13:14 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: Ingo Molnar, Jiri Olsa, LKML, tonyj, nelson.dsouza

On Tue, Mar 19, 2019 at 10:00:18PM +0100, Peter Zijlstra wrote:
> On Tue, Mar 19, 2019 at 01:48:18PM -0700, Stephane Eranian wrote:
> > On Thu, Mar 14, 2019 at 6:11 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > For x86_pmu_commit_txn(), cpuc->n_events is updated, but we can
> > > trivially compute the desired value with cpuc->n_txn -- the number of
> > > events added in this transaction.
> > >
> > I suggest you put this explanation in the code so that it is easier to
> > understand.
> 
> Right, I actually attempted writing that comment a few times but every
> time it became a mess. I'll try again, because you're quite right, this
> is a bit magical.

Ha! I found the comment, it's in patch 8 :-)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC][PATCH 7/8] perf/x86: Optimize x86_schedule_events()
  2019-03-20 13:11     ` Peter Zijlstra
@ 2019-03-20 19:30       ` Stephane Eranian
  0 siblings, 0 replies; 48+ messages in thread
From: Stephane Eranian @ 2019-03-20 19:30 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Jiri Olsa, LKML, tonyj, nelson.dsouza

On Wed, Mar 20, 2019 at 6:11 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Mar 19, 2019 at 04:55:16PM -0700, Stephane Eranian wrote:
> > On Thu, Mar 14, 2019 at 6:11 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > > @@ -858,8 +864,20 @@ int x86_schedule_events(struct cpu_hw_ev
> > >                 x86_pmu.start_scheduling(cpuc);
> > >
> > >         for (i = 0, wmin = X86_PMC_IDX_MAX, wmax = 0; i < n; i++) {
> > > -               c = x86_pmu.get_event_constraints(cpuc, i, cpuc->event_list[i]);
> > > -               cpuc->event_constraint[i] = c;
> > > +               c = cpuc->event_constraint[i];
> > > +
> > > +               /*
> > > +                * Request constraints for new events; or for those events that
> > > +                * have a dynamic constraint due to the HT workaround -- for
> > > +                * those the constraint can change due to scheduling activity
> > > +                * on the other sibling.
> > > +                */
> > > +               if (!c || ((c->flags & PERF_X86_EVENT_DYNAMIC) &&
> > > +                          is_ht_workaround_active(cpuc))) {
> > > +
> > > +                       c = x86_pmu.get_event_constraints(cpuc, i, cpuc->event_list[i]);
> > > +                       cpuc->event_constraint[i] = c;
> > > +               }
>
> > On this one, I think there may be a problem with events with
> > shared_regs constraints.
>
> Hmm...
>
> > Constraint is dynamic as it depends on other events which share the
> > same MSR, yet it is not marked as DYNAMIC.
>
> it returns &emptyconstraint or a table constraint, depending on register
> state.
>
> > But this may be okay because these other events are all on the same
> > CPU and thus scheduled during the same ctx_sched_in(). Yet with the
> > swapping in intel_alt_er(), we need to double-check that we cannot
> > reuse a constraint which could be stale.
>
> > I believe this is okay, just double-check.
>
> I'm not sure I see a problem.
>
> So if we're the first event on a shared register, we claim the register
> and scheduling succeeds (barring other constraints).
>
> If we're the second event on a shared register (and have conflicting
> register state), we get the empty constraint. This _will_ cause
> scheduling to fail. We'll not cache the state and punt it back to the
> core code.
>
> So no future scheduling pass will come to see a shared reg constraint
> that could've changed.
>
> Now, there is indeed the intel_alt_er() thing, which slightly
> complicates this; then suppose we schedule an event on RSP0, another on
> RSP1, then remove the RSP0 one. Even in that case, the remaining RSP1
> event will not change its constraint, since intel_fixup_er() rewrites
> the event to be a native RSP1 event.
>
> So that too reduces to the prior case.
>
I came the same conclusion later yesterday. I think this is okay.

>
> That said; I have simplified the above condition to:
>
> @@ -858,8 +858,17 @@ int x86_schedule_events(struct cpu_hw_ev
>                 x86_pmu.start_scheduling(cpuc);
>
>         for (i = 0, wmin = X86_PMC_IDX_MAX, wmax = 0; i < n; i++) {
> -               c = x86_pmu.get_event_constraints(cpuc, i, cpuc->event_list[i]);
> -               cpuc->event_constraint[i] = c;
> +               c = cpuc->event_constraint[i];
> +
> +               /*
> +                * Request constraints for new events; or for those events that
> +                * have a dynamic constraint -- for those the constraint can
> +                * change due to external factors (sibling state, allow_tfa).
> +                */
> +               if (!c || (c->flags & PERF_X86_EVENT_DYNAMIC)) {
> +                       c = x86_pmu.get_event_constraints(cpuc, i, cpuc->event_list[i]);
> +                       cpuc->event_constraint[i] = c;
> +               }
>
Right now DYNAMIC is only casued by the HT bug, but it could change later on but
the logic here would remain. If HT workaround is disabled, then no
evnt is tagged with
DYNAMIC.

>                 wmin = min(wmin, c->weight);
>                 wmax = max(wmax, c->weight);
>
> Because any dynamic event can change from one moment to the next.
I agree.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/8] perf/x86/intel: Fix memory corruption
  2019-03-19 18:20         ` Peter Zijlstra
@ 2019-03-20 20:47           ` Stephane Eranian
  2019-03-20 20:52             ` Stephane Eranian
  2019-03-20 22:22             ` Peter Zijlstra
  0 siblings, 2 replies; 48+ messages in thread
From: Stephane Eranian @ 2019-03-20 20:47 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Jiri Olsa, LKML, tonyj, nelson.dsouza

On Tue, Mar 19, 2019 at 11:20 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Mar 19, 2019 at 10:52:01AM -0700, Stephane Eranian wrote:
> > > Not quite; the control on its own doesn't directly write the MSR. And
> > > even when the work-around is allowed, we'll not set the MSR unless there
> > > is also demand for PMC3.
> > >
> > Trying to understand this better here. When the workaround is enabled
> > (tfa=0), you lose PMC3 and transactions operate normally.
>
> > When it is disabled (tfa=1), transactions are all aborted and PMC3 is
> > available.
>
> Right, but we don't expose tfa.
>
> > If you are saying that when there is a PMU event requesting PMC3, then
> > you need PMC3 avail, so you set the MSR so that tfa=1 forcing all
> > transactions to abort.
>
> Right, so when allow_tfa=1 (default), we only set tfa=1 when PMC3 is
> requested.
>
> This has the advantage that,
>
>   TSX only workload -> works
>   perf 4 counteres -> works
>
> Only when you need both of them, do you get 'trouble'.
>
> > But in that case, you are modifying the execution of the workload when
> > you are monitoring it, assuming it uses TSX.
>
> We assume you are not in fact using TSX, not a lot of code does. If you
> do use TSX a lot, and you don't want to interfere, you have to set
> allow_tfa=0 and live with one counter less.
>
> Any which way around you turn this stone, it sucks.
>
> > You want lowest overhead and no modifications to how the workload
> > operates, otherwise how representative is the data you are collecting?
>
> Sure; but there are no good choices here. This 'fix' will break
> something. We figured TSX+4-counter-perf was the least common scenario.
>
> We konw of people that rely on 4 counter being present; you want to
> explain to them how when doing an update their program suddently doesn't
> work anymore?
>
> Or you want to default to tfa=1; but then you have to explain to those
> people relying on TSX why their workload stopped working.
>
> > I understand that there is no impact on apps not using TSX, well,
> > except on context switch where you have to toggle that MSR.
>
> There is no additional code in the context switch; only the perf event
> scheduling code prods at the MSR.
>
> > But for workloads using TSX, there is potentially an impact.
>
> Yes, well, if you're a TSX _and_ perf user, you now have an extra knob
> to play with; that's not something I can do anything about. We're forced
> to make a choice here.
>
> > > Yeah, meh. You're admin, you can 'fix' it. In practise I don't expect
> > > most people to care about the knob, and the few people that do, should
> > > be able to make it work.
> >
> > I don't understand how this can work reliably.
>
> > You have a knob to toggle that MSR.
>
> No, we don't have this knob.
>
> > Then, you have another one inside perf_events
>
> Only this knob exists allow_tfa.
>
> > and then the sysadmin has to make sure nobody (incl. NMI watchdog) is
> > using the PMU when this all happens.
>
> You're very unlucky if the watchdog runs on PMC3, normally it runs on
> Fixed1 or something. Esp early after boot. (Remember, we schedule fixed
> counters first, and then general purpose counters, with a preference for
> lower counters).
>
> Anyway, you can trivially switch it off if you want.
>
> > How can this be a practical solution? Am I missing something here?
>
> It works just fine; it is unfortunate that we have this interaction but
> that's not something we can do anything about. We're forced to deal with
> this.
>
Right now, if I do:

echo 0 > /sys/bus/event_source/devices/cpu/allow_tsx_force_abort

Then I don't have the guarantee on when there will be no abort when I
return from the echo.
the MSR is accessed only on PMU scheduling. I would expect a sysadmin
to want some guarantee
if this is to be switched on/off at runtime. If not, then having a
boot time option is better in my opinion.

This other bit I noticed is that cpuc->tfa_shadow is used to avoid the
wrmsr(), but I don't see the
code that makes sure the init value (0) matches the value of the MSR.
Is this MSR guarantee to be
zero on reset? How about on kexec()?

> But if you're a TSX+perf user, have your boot scripts do:
>
>   echo 0 > /sys/bus/event_source/devices/cpu/allow_tsx_force_abort
>
> and you'll not use PMC3 and TSX will be 'awesome'. If you don't give a
> crap about TSX (most people), just boot and be happy.
>
> If you do care about TSX+perf and want to dynamically toggle for some
> reason, you just have to be a little careful.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/8] perf/x86/intel: Fix memory corruption
  2019-03-20 20:47           ` Stephane Eranian
@ 2019-03-20 20:52             ` Stephane Eranian
  2019-03-20 22:22             ` Peter Zijlstra
  1 sibling, 0 replies; 48+ messages in thread
From: Stephane Eranian @ 2019-03-20 20:52 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Jiri Olsa, LKML, tonyj, nelson.dsouza

On Wed, Mar 20, 2019 at 1:47 PM Stephane Eranian <eranian@google.com> wrote:
>
> On Tue, Mar 19, 2019 at 11:20 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Tue, Mar 19, 2019 at 10:52:01AM -0700, Stephane Eranian wrote:
> > > > Not quite; the control on its own doesn't directly write the MSR. And
> > > > even when the work-around is allowed, we'll not set the MSR unless there
> > > > is also demand for PMC3.
> > > >
> > > Trying to understand this better here. When the workaround is enabled
> > > (tfa=0), you lose PMC3 and transactions operate normally.
> >
> > > When it is disabled (tfa=1), transactions are all aborted and PMC3 is
> > > available.
> >
> > Right, but we don't expose tfa.
> >
> > > If you are saying that when there is a PMU event requesting PMC3, then
> > > you need PMC3 avail, so you set the MSR so that tfa=1 forcing all
> > > transactions to abort.
> >
> > Right, so when allow_tfa=1 (default), we only set tfa=1 when PMC3 is
> > requested.
> >
> > This has the advantage that,
> >
> >   TSX only workload -> works
> >   perf 4 counteres -> works
> >
> > Only when you need both of them, do you get 'trouble'.
> >
> > > But in that case, you are modifying the execution of the workload when
> > > you are monitoring it, assuming it uses TSX.
> >
> > We assume you are not in fact using TSX, not a lot of code does. If you
> > do use TSX a lot, and you don't want to interfere, you have to set
> > allow_tfa=0 and live with one counter less.
> >
> > Any which way around you turn this stone, it sucks.
> >
> > > You want lowest overhead and no modifications to how the workload
> > > operates, otherwise how representative is the data you are collecting?
> >
> > Sure; but there are no good choices here. This 'fix' will break
> > something. We figured TSX+4-counter-perf was the least common scenario.
> >
> > We konw of people that rely on 4 counter being present; you want to
> > explain to them how when doing an update their program suddently doesn't
> > work anymore?
> >
> > Or you want to default to tfa=1; but then you have to explain to those
> > people relying on TSX why their workload stopped working.
> >
> > > I understand that there is no impact on apps not using TSX, well,
> > > except on context switch where you have to toggle that MSR.
> >
> > There is no additional code in the context switch; only the perf event
> > scheduling code prods at the MSR.
> >
> > > But for workloads using TSX, there is potentially an impact.
> >
> > Yes, well, if you're a TSX _and_ perf user, you now have an extra knob
> > to play with; that's not something I can do anything about. We're forced
> > to make a choice here.
> >
> > > > Yeah, meh. You're admin, you can 'fix' it. In practise I don't expect
> > > > most people to care about the knob, and the few people that do, should
> > > > be able to make it work.
> > >
> > > I don't understand how this can work reliably.
> >
> > > You have a knob to toggle that MSR.
> >
> > No, we don't have this knob.
> >
> > > Then, you have another one inside perf_events
> >
> > Only this knob exists allow_tfa.
> >
> > > and then the sysadmin has to make sure nobody (incl. NMI watchdog) is
> > > using the PMU when this all happens.
> >
> > You're very unlucky if the watchdog runs on PMC3, normally it runs on
> > Fixed1 or something. Esp early after boot. (Remember, we schedule fixed
> > counters first, and then general purpose counters, with a preference for
> > lower counters).
> >
> > Anyway, you can trivially switch it off if you want.
> >
> > > How can this be a practical solution? Am I missing something here?
> >
> > It works just fine; it is unfortunate that we have this interaction but
> > that's not something we can do anything about. We're forced to deal with
> > this.
> >
> Right now, if I do:
>
> echo 0 > /sys/bus/event_source/devices/cpu/allow_tsx_force_abort
>
> Then I don't have the guarantee on when there will be no abort when I
> return from the echo.
> the MSR is accessed only on PMU scheduling. I would expect a sysadmin
> to want some guarantee
> if this is to be switched on/off at runtime. If not, then having a
> boot time option is better in my opinion.
>
> This other bit I noticed is that cpuc->tfa_shadow is used to avoid the
> wrmsr(), but I don't see the
> code that makes sure the init value (0) matches the value of the MSR.
> Is this MSR guarantee to be
> zero on reset? How about on kexec()?
>

Furthermore, depending on what is measured on each CPU, i.e., PMC3 is
used or not,
the TFA may be set to true or false dynamically. So if you have a TSX
workload it may
be impacted depending on which CPU it runs on. I don't think users would like
that approach.


> > But if you're a TSX+perf user, have your boot scripts do:
> >
> >   echo 0 > /sys/bus/event_source/devices/cpu/allow_tsx_force_abort
> >
> > and you'll not use PMC3 and TSX will be 'awesome'. If you don't give a
> > crap about TSX (most people), just boot and be happy.
> >
> > If you do care about TSX+perf and want to dynamically toggle for some
> > reason, you just have to be a little careful.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/8] perf/x86/intel: Fix memory corruption
  2019-03-20 20:47           ` Stephane Eranian
  2019-03-20 20:52             ` Stephane Eranian
@ 2019-03-20 22:22             ` Peter Zijlstra
  2019-03-21 12:38               ` Peter Zijlstra
  1 sibling, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2019-03-20 22:22 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: Ingo Molnar, Jiri Olsa, LKML, tonyj, nelson.dsouza

On Wed, Mar 20, 2019 at 01:47:28PM -0700, Stephane Eranian wrote:

> Right now, if I do:
> 
> echo 0 > /sys/bus/event_source/devices/cpu/allow_tsx_force_abort
> 
> Then I don't have the guarantee on when there will be no abort when I
> return from the echo.  the MSR is accessed only on PMU scheduling. I
> would expect a sysadmin to want some guarantee if this is to be
> switched on/off at runtime. If not, then having a boot time option is
> better in my opinion.

Something like cycling the nmi watchdog or:

  perf stat -a -e cycles sleep 1

should be enough to force reschedule the events on every CPU.

Again, I'm not adverse to 'fixing' this if it can be done with limited
LoC. But I don't really see this as critical.

> This other bit I noticed is that cpuc->tfa_shadow is used to avoid the
> wrmsr(), but I don't see the code that makes sure the init value (0)
> matches the value of the MSR.  Is this MSR guarantee to be zero on
> reset? 

That was my understanding.

> How about on kexec()?

Good point, we might want to fix that.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/8] perf/x86/intel: Fix memory corruption
  2019-03-20 22:22             ` Peter Zijlstra
@ 2019-03-21 12:38               ` Peter Zijlstra
  2019-03-21 16:45                 ` Thomas Gleixner
  2019-04-03 10:40                 ` [tip:perf/urgent] perf/x86/intel: Initialize TFA MSR tip-bot for Peter Zijlstra
  0 siblings, 2 replies; 48+ messages in thread
From: Peter Zijlstra @ 2019-03-21 12:38 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Ingo Molnar, Jiri Olsa, LKML, tonyj, nelson.dsouza, Thomas Gleixner

On Wed, Mar 20, 2019 at 11:22:20PM +0100, Peter Zijlstra wrote:
> On Wed, Mar 20, 2019 at 01:47:28PM -0700, Stephane Eranian wrote:
> 
> > Right now, if I do:
> > 
> > echo 0 > /sys/bus/event_source/devices/cpu/allow_tsx_force_abort
> > 
> > Then I don't have the guarantee on when there will be no abort when I
> > return from the echo.  the MSR is accessed only on PMU scheduling. I
> > would expect a sysadmin to want some guarantee if this is to be
> > switched on/off at runtime. If not, then having a boot time option is
> > better in my opinion.
> 
> Something like cycling the nmi watchdog or:
> 
>   perf stat -a -e cycles sleep 1
> 
> should be enough to force reschedule the events on every CPU.
> 
> Again, I'm not adverse to 'fixing' this if it can be done with limited
> LoC. But I don't really see this as critical.
> 
> > This other bit I noticed is that cpuc->tfa_shadow is used to avoid the
> > wrmsr(), but I don't see the code that makes sure the init value (0)
> > matches the value of the MSR.  Is this MSR guarantee to be zero on
> > reset? 
> 
> That was my understanding.
> 
> > How about on kexec()?
> 
> Good point, we might want to fix that.

Something like the below perhaps?

---
Subject: perf/x86/intel: Initialize TFA MSR

Stephane reported that we don't initialize the TFA MSR, which could lead
to trouble if the RESET value is not 0 or on kexec.

Reported-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/events/intel/core.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 8baa441d8000..2d3caf2d1384 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3575,6 +3575,12 @@ static void intel_pmu_cpu_starting(int cpu)
 
 	cpuc->lbr_sel = NULL;
 
+	if (x86_pmu.flags & PMU_FL_TFA) {
+		WARN_ON_ONCE(cpuc->tfa_shadow);
+		cpuc->tfa_shadow = ~0ULL;
+		intel_set_tfa(cpuc, false);
+	}
+
 	if (x86_pmu.version > 1)
 		flip_smm_bit(&x86_pmu.attr_freeze_on_smi);
 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/8] perf/x86/intel: Fix memory corruption
  2019-03-21 12:38               ` Peter Zijlstra
@ 2019-03-21 16:45                 ` Thomas Gleixner
  2019-03-21 17:10                   ` Peter Zijlstra
  2019-03-21 17:23                   ` Stephane Eranian
  2019-04-03 10:40                 ` [tip:perf/urgent] perf/x86/intel: Initialize TFA MSR tip-bot for Peter Zijlstra
  1 sibling, 2 replies; 48+ messages in thread
From: Thomas Gleixner @ 2019-03-21 16:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Stephane Eranian, Ingo Molnar, Jiri Olsa, LKML, tonyj, nelson.dsouza

On Thu, 21 Mar 2019, Peter Zijlstra wrote:
> Subject: perf/x86/intel: Initialize TFA MSR
> 
> Stephane reported that we don't initialize the TFA MSR, which could lead
> to trouble if the RESET value is not 0 or on kexec.

That sentence doesn't parse.

  Stephane reported that the TFA MSR is not initialized by the kernel, but
  the TFA bit could set by firmware or as a leftover from a kexec, which
  makes the state inconsistent.

> Reported-by: Stephane Eranian <eranian@google.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  arch/x86/events/intel/core.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index 8baa441d8000..2d3caf2d1384 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -3575,6 +3575,12 @@ static void intel_pmu_cpu_starting(int cpu)
>  
>  	cpuc->lbr_sel = NULL;
>  
> +	if (x86_pmu.flags & PMU_FL_TFA) {
> +		WARN_ON_ONCE(cpuc->tfa_shadow);

Hmm. I wouldn't warn here as this is a legit state for kexec/kdump and I
don't think we can figure out whether this comes directly from the
firmware.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/8] perf/x86/intel: Fix memory corruption
  2019-03-21 16:45                 ` Thomas Gleixner
@ 2019-03-21 17:10                   ` Peter Zijlstra
  2019-03-21 17:17                     ` Thomas Gleixner
  2019-03-21 17:23                   ` Stephane Eranian
  1 sibling, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2019-03-21 17:10 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Stephane Eranian, Ingo Molnar, Jiri Olsa, LKML, tonyj, nelson.dsouza

On Thu, Mar 21, 2019 at 05:45:41PM +0100, Thomas Gleixner wrote:
> On Thu, 21 Mar 2019, Peter Zijlstra wrote:
> > Subject: perf/x86/intel: Initialize TFA MSR
> > 
> > Stephane reported that we don't initialize the TFA MSR, which could lead
> > to trouble if the RESET value is not 0 or on kexec.
> 
> That sentence doesn't parse.
> 
>   Stephane reported that the TFA MSR is not initialized by the kernel, but
>   the TFA bit could set by firmware or as a leftover from a kexec, which
>   makes the state inconsistent.
> 
> > Reported-by: Stephane Eranian <eranian@google.com>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> >  arch/x86/events/intel/core.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> > 
> > diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> > index 8baa441d8000..2d3caf2d1384 100644
> > --- a/arch/x86/events/intel/core.c
> > +++ b/arch/x86/events/intel/core.c
> > @@ -3575,6 +3575,12 @@ static void intel_pmu_cpu_starting(int cpu)
> >  
> >  	cpuc->lbr_sel = NULL;
> >  
> > +	if (x86_pmu.flags & PMU_FL_TFA) {
> > +		WARN_ON_ONCE(cpuc->tfa_shadow);
> 
> Hmm. I wouldn't warn here as this is a legit state for kexec/kdump and I
> don't think we can figure out whether this comes directly from the
> firmware.

Even on kexec, the cpuc will be freshly allocated and zerod I think. We
only inherit hardware state, not software state.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/8] perf/x86/intel: Fix memory corruption
  2019-03-21 17:10                   ` Peter Zijlstra
@ 2019-03-21 17:17                     ` Thomas Gleixner
  2019-03-21 18:20                       ` Peter Zijlstra
  0 siblings, 1 reply; 48+ messages in thread
From: Thomas Gleixner @ 2019-03-21 17:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Stephane Eranian, Ingo Molnar, Jiri Olsa, LKML, tonyj, nelson.dsouza

On Thu, 21 Mar 2019, Peter Zijlstra wrote:
> On Thu, Mar 21, 2019 at 05:45:41PM +0100, Thomas Gleixner wrote:
> > On Thu, 21 Mar 2019, Peter Zijlstra wrote:
> > > Subject: perf/x86/intel: Initialize TFA MSR
> > > 
> > > Stephane reported that we don't initialize the TFA MSR, which could lead
> > > to trouble if the RESET value is not 0 or on kexec.
> > 
> > That sentence doesn't parse.
> > 
> >   Stephane reported that the TFA MSR is not initialized by the kernel, but
> >   the TFA bit could set by firmware or as a leftover from a kexec, which
> >   makes the state inconsistent.
> > 
> > > Reported-by: Stephane Eranian <eranian@google.com>
> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > ---
> > >  arch/x86/events/intel/core.c | 6 ++++++
> > >  1 file changed, 6 insertions(+)
> > > 
> > > diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> > > index 8baa441d8000..2d3caf2d1384 100644
> > > --- a/arch/x86/events/intel/core.c
> > > +++ b/arch/x86/events/intel/core.c
> > > @@ -3575,6 +3575,12 @@ static void intel_pmu_cpu_starting(int cpu)
> > >  
> > >  	cpuc->lbr_sel = NULL;
> > >  
> > > +	if (x86_pmu.flags & PMU_FL_TFA) {
> > > +		WARN_ON_ONCE(cpuc->tfa_shadow);
> > 
> > Hmm. I wouldn't warn here as this is a legit state for kexec/kdump and I
> > don't think we can figure out whether this comes directly from the
> > firmware.
> 
> Even on kexec, the cpuc will be freshly allocated and zerod I think. We
> only inherit hardware state, not software state.

Ouch, misread the patch of course ... What about cpu hotplug? Does that
free/alloc or reuse?

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/8] perf/x86/intel: Fix memory corruption
  2019-03-21 16:45                 ` Thomas Gleixner
  2019-03-21 17:10                   ` Peter Zijlstra
@ 2019-03-21 17:23                   ` Stephane Eranian
  2019-03-21 17:51                     ` Thomas Gleixner
  1 sibling, 1 reply; 48+ messages in thread
From: Stephane Eranian @ 2019-03-21 17:23 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Ingo Molnar, Jiri Olsa, LKML, tonyj, nelson.dsouza

On Thu, Mar 21, 2019 at 9:45 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Thu, 21 Mar 2019, Peter Zijlstra wrote:
> > Subject: perf/x86/intel: Initialize TFA MSR
> >
> > Stephane reported that we don't initialize the TFA MSR, which could lead
> > to trouble if the RESET value is not 0 or on kexec.
>
> That sentence doesn't parse.
>
>   Stephane reported that the TFA MSR is not initialized by the kernel, but
>   the TFA bit could set by firmware or as a leftover from a kexec, which
>   makes the state inconsistent.
>
Correct. This is what I meant.
The issue is what does the kernel guarantee when it boots?

I see:
static bool allow_tsx_force_abort = true;

Therefore you must ensure the MSR is set to reflect that state on boot.
So you have to force it to that value to be in sync which is what your
new patch is doing.

> > Reported-by: Stephane Eranian <eranian@google.com>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> >  arch/x86/events/intel/core.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> >
> > diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> > index 8baa441d8000..2d3caf2d1384 100644
> > --- a/arch/x86/events/intel/core.c
> > +++ b/arch/x86/events/intel/core.c
> > @@ -3575,6 +3575,12 @@ static void intel_pmu_cpu_starting(int cpu)
> >
> >       cpuc->lbr_sel = NULL;
> >
> > +     if (x86_pmu.flags & PMU_FL_TFA) {
> > +             WARN_ON_ONCE(cpuc->tfa_shadow);
>
> Hmm. I wouldn't warn here as this is a legit state for kexec/kdump and I
> don't think we can figure out whether this comes directly from the
> firmware.
>
> Thanks,
>
>         tglx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/8] perf/x86/intel: Fix memory corruption
  2019-03-21 17:23                   ` Stephane Eranian
@ 2019-03-21 17:51                     ` Thomas Gleixner
  2019-03-22 19:04                       ` Stephane Eranian
  0 siblings, 1 reply; 48+ messages in thread
From: Thomas Gleixner @ 2019-03-21 17:51 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Peter Zijlstra, Ingo Molnar, Jiri Olsa, LKML, tonyj, nelson.dsouza

On Thu, 21 Mar 2019, Stephane Eranian wrote:
> On Thu, Mar 21, 2019 at 9:45 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > On Thu, 21 Mar 2019, Peter Zijlstra wrote:
> > > Subject: perf/x86/intel: Initialize TFA MSR
> > >
> > > Stephane reported that we don't initialize the TFA MSR, which could lead
> > > to trouble if the RESET value is not 0 or on kexec.
> >
> > That sentence doesn't parse.
> >
> >   Stephane reported that the TFA MSR is not initialized by the kernel, but
> >   the TFA bit could set by firmware or as a leftover from a kexec, which
> >   makes the state inconsistent.
> >
> Correct. This is what I meant.
> The issue is what does the kernel guarantee when it boots?
> 
> I see:
> static bool allow_tsx_force_abort = true;
> 
> Therefore you must ensure the MSR is set to reflect that state on boot.
> So you have to force it to that value to be in sync which is what your
> new patch is doing.

The initial state should be that the MSR TFA bit is 0. The software state
is a different beast.

allow_tsx_force_abort

  false			Do not set MSR TFA bit (Make TSX work with PMC3) and
  			exclude PMC3 from being used.

  true			Set the MSR TFA bit when PMC3 is used by perf, clear it
  			when PMC3 is not longer in use.

Now, if the firmware or the kexec has the TFA bit set in the MSR and PMC3
is not in use then TSX always aborts pointlessly. It's not a fatal isseu,
but it's inconsistent.

So independent of the state of allow_tsx_force_abort the kernel has to
clear the MSR TSA bit when the CPUs are brought up.

The state of allow_tsx_force_abort is solely relevant after CPUs coming up
to decide whether PMC3 can be used by perf or not.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/8] perf/x86/intel: Fix memory corruption
  2019-03-21 17:17                     ` Thomas Gleixner
@ 2019-03-21 18:20                       ` Peter Zijlstra
  2019-03-21 19:42                         ` Tony Jones
  0 siblings, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2019-03-21 18:20 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Stephane Eranian, Ingo Molnar, Jiri Olsa, LKML, tonyj, nelson.dsouza

On Thu, Mar 21, 2019 at 06:17:01PM +0100, Thomas Gleixner wrote:
> On Thu, 21 Mar 2019, Peter Zijlstra wrote:
> > On Thu, Mar 21, 2019 at 05:45:41PM +0100, Thomas Gleixner wrote:
> > > On Thu, 21 Mar 2019, Peter Zijlstra wrote:
> > > > Subject: perf/x86/intel: Initialize TFA MSR
> > > > 
> > > > Stephane reported that we don't initialize the TFA MSR, which could lead
> > > > to trouble if the RESET value is not 0 or on kexec.
> > > 
> > > That sentence doesn't parse.
> > > 
> > >   Stephane reported that the TFA MSR is not initialized by the kernel, but
> > >   the TFA bit could set by firmware or as a leftover from a kexec, which
> > >   makes the state inconsistent.
> > > 
> > > > Reported-by: Stephane Eranian <eranian@google.com>
> > > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > > > ---
> > > >  arch/x86/events/intel/core.c | 6 ++++++
> > > >  1 file changed, 6 insertions(+)
> > > > 
> > > > diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> > > > index 8baa441d8000..2d3caf2d1384 100644
> > > > --- a/arch/x86/events/intel/core.c
> > > > +++ b/arch/x86/events/intel/core.c
> > > > @@ -3575,6 +3575,12 @@ static void intel_pmu_cpu_starting(int cpu)
> > > >  
> > > >  	cpuc->lbr_sel = NULL;
> > > >  
> > > > +	if (x86_pmu.flags & PMU_FL_TFA) {
> > > > +		WARN_ON_ONCE(cpuc->tfa_shadow);
> > > 
> > > Hmm. I wouldn't warn here as this is a legit state for kexec/kdump and I
> > > don't think we can figure out whether this comes directly from the
> > > firmware.
> > 
> > Even on kexec, the cpuc will be freshly allocated and zerod I think. We
> > only inherit hardware state, not software state.
> 
> Ouch, misread the patch of course ... What about cpu hotplug? Does that
> free/alloc or reuse?

re-use I think, but since we kill all events before taking the CPU down,
it _should_ have cleared the MSR while doing that.

Might be best to have someone test this.. someone with ucode and a
machine etc.. :-)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/8] perf/x86/intel: Fix memory corruption
  2019-03-21 18:20                       ` Peter Zijlstra
@ 2019-03-21 19:42                         ` Tony Jones
  2019-03-21 19:47                           ` DSouza, Nelson
  0 siblings, 1 reply; 48+ messages in thread
From: Tony Jones @ 2019-03-21 19:42 UTC (permalink / raw)
  To: Peter Zijlstra, Thomas Gleixner
  Cc: Stephane Eranian, Ingo Molnar, Jiri Olsa, LKML, nelson.dsouza

On 3/21/19 11:20 AM, Peter Zijlstra wrote:

> Might be best to have someone test this.. someone with ucode and a
> machine etc.. :-)

If your own company can't oblige you here Peter :-) I'm sure I can make some time to test it.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [PATCH 1/8] perf/x86/intel: Fix memory corruption
  2019-03-21 19:42                         ` Tony Jones
@ 2019-03-21 19:47                           ` DSouza, Nelson
  2019-03-21 20:07                             ` Peter Zijlstra
  0 siblings, 1 reply; 48+ messages in thread
From: DSouza, Nelson @ 2019-03-21 19:47 UTC (permalink / raw)
  To: Tony Jones, Peter Zijlstra, Thomas Gleixner
  Cc: Stephane Eranian, Ingo Molnar, Jiri Olsa, LKML

Is the request to check whether the msr gets reset to default upon reboot of the machine ?

-----Original Message-----
From: Tony Jones [mailto:tonyj@suse.com] 
Sent: Thursday, March 21, 2019 12:43 PM
To: Peter Zijlstra <peterz@infradead.org>; Thomas Gleixner <tglx@linutronix.de>
Cc: Stephane Eranian <eranian@google.com>; Ingo Molnar <mingo@kernel.org>; Jiri Olsa <jolsa@redhat.com>; LKML <linux-kernel@vger.kernel.org>; DSouza, Nelson <nelson.dsouza@intel.com>
Subject: Re: [PATCH 1/8] perf/x86/intel: Fix memory corruption

On 3/21/19 11:20 AM, Peter Zijlstra wrote:

> Might be best to have someone test this.. someone with ucode and a 
> machine etc.. :-)

If your own company can't oblige you here Peter :-) I'm sure I can make some time to test it.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/8] perf/x86/intel: Fix memory corruption
  2019-03-21 19:47                           ` DSouza, Nelson
@ 2019-03-21 20:07                             ` Peter Zijlstra
  2019-03-21 23:16                               ` DSouza, Nelson
  0 siblings, 1 reply; 48+ messages in thread
From: Peter Zijlstra @ 2019-03-21 20:07 UTC (permalink / raw)
  To: DSouza, Nelson
  Cc: Tony Jones, Thomas Gleixner, Stephane Eranian, Ingo Molnar,
	Jiri Olsa, LKML

On Thu, Mar 21, 2019 at 07:47:50PM +0000, DSouza, Nelson wrote:
> Is the request to check whether the msr gets reset to default upon reboot of the machine ?

basically:

 - apply patch
 - start workload with 4 counter (on all CPUs), such that tfa-msr=1
 - try the following:
   o cpu hotplug
   o kexec

to see if the new WARN will trigger -- it should not.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [PATCH 1/8] perf/x86/intel: Fix memory corruption
  2019-03-21 20:07                             ` Peter Zijlstra
@ 2019-03-21 23:16                               ` DSouza, Nelson
  2019-03-22 22:14                                 ` DSouza, Nelson
  0 siblings, 1 reply; 48+ messages in thread
From: DSouza, Nelson @ 2019-03-21 23:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tony Jones, Thomas Gleixner, Stephane Eranian, Ingo Molnar,
	Jiri Olsa, LKML

[-- Attachment #1: Type: text/plain, Size: 1027 bytes --]

Attached cpu hotplug test output while perf is running in the background. No WARN messages seen.

When I run the kexec command, it boots to bios. Haven't used kexec before. Still trying to figure that one out.

-----Original Message-----
From: Peter Zijlstra [mailto:peterz@infradead.org] 
Sent: Thursday, March 21, 2019 1:07 PM
To: DSouza, Nelson <nelson.dsouza@intel.com>
Cc: Tony Jones <tonyj@suse.com>; Thomas Gleixner <tglx@linutronix.de>; Stephane Eranian <eranian@google.com>; Ingo Molnar <mingo@kernel.org>; Jiri Olsa <jolsa@redhat.com>; LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 1/8] perf/x86/intel: Fix memory corruption

On Thu, Mar 21, 2019 at 07:47:50PM +0000, DSouza, Nelson wrote:
> Is the request to check whether the msr gets reset to default upon reboot of the machine ?

basically:

 - apply patch
 - start workload with 4 counter (on all CPUs), such that tfa-msr=1
 - try the following:
   o cpu hotplug
   o kexec

to see if the new WARN will trigger -- it should not.

[-- Attachment #2: cpu_hotplug.txt --]
[-- Type: text/plain, Size: 921 bytes --]


root@jf1-otc-3AR3-33:~# cat /sys/devices/cpu/allow_tsx_force_abort
1
root@jf1-otc-3AR3-33:~# perf stat -e cs,'{cache-misses,cache-references,branches,branch-misses}' --no-merge -a sleep 60 &
[1] 3025
root@jf1-otc-3AR3-33:~# echo 0 > /sys/devices/system/cpu/cpu4/online
root@jf1-otc-3AR3-33:~# echo 0 > /sys/devices/system/cpu/cpu5/online
root@jf1-otc-3AR3-33:~# echo 0 > /sys/devices/system/cpu/cpu6/online
root@jf1-otc-3AR3-33:~# echo 0 > /sys/devices/system/cpu/cpu7/online
root@jf1-otc-3AR3-33:~#
 Performance counter stats for 'system wide':

             4,800      cs
     <not counted>      cache-misses
     <not counted>      cache-references
     <not counted>      branches
     <not counted>      branch-misses

      60.009098150 seconds time elapsed



[1]+  Done                    perf stat -e cs,'{cache-misses,cache-references,branches,branch-misses}' --no-merge -a sleep 60



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/8] perf/x86/intel: Fix memory corruption
  2019-03-21 17:51                     ` Thomas Gleixner
@ 2019-03-22 19:04                       ` Stephane Eranian
  2019-04-03  7:32                         ` Peter Zijlstra
  0 siblings, 1 reply; 48+ messages in thread
From: Stephane Eranian @ 2019-03-22 19:04 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Ingo Molnar, Jiri Olsa, LKML, tonyj, nelson.dsouza

On Thu, Mar 21, 2019 at 10:51 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Thu, 21 Mar 2019, Stephane Eranian wrote:
> > On Thu, Mar 21, 2019 at 9:45 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> > >
> > > On Thu, 21 Mar 2019, Peter Zijlstra wrote:
> > > > Subject: perf/x86/intel: Initialize TFA MSR
> > > >
> > > > Stephane reported that we don't initialize the TFA MSR, which could lead
> > > > to trouble if the RESET value is not 0 or on kexec.
> > >
> > > That sentence doesn't parse.
> > >
> > >   Stephane reported that the TFA MSR is not initialized by the kernel, but
> > >   the TFA bit could set by firmware or as a leftover from a kexec, which
> > >   makes the state inconsistent.
> > >
> > Correct. This is what I meant.
> > The issue is what does the kernel guarantee when it boots?
> >
> > I see:
> > static bool allow_tsx_force_abort = true;
> >
> > Therefore you must ensure the MSR is set to reflect that state on boot.
> > So you have to force it to that value to be in sync which is what your
> > new patch is doing.
>
> The initial state should be that the MSR TFA bit is 0. The software state
> is a different beast.
>
> allow_tsx_force_abort
>
>   false                 Do not set MSR TFA bit (Make TSX work with PMC3) and
>                         exclude PMC3 from being used.
>
>   true                  Set the MSR TFA bit when PMC3 is used by perf, clear it
>                         when PMC3 is not longer in use.
>
I would expect this description to be included in the source code where the
allow_tsx_force_abort variable is defined and somewhere in the kernel
Documentation
because it is not trivial to understand what the control actually does
and the guarantees
you have when you toggle it.

> Now, if the firmware or the kexec has the TFA bit set in the MSR and PMC3
> is not in use then TSX always aborts pointlessly. It's not a fatal isseu,
> but it's inconsistent.
>
> So independent of the state of allow_tsx_force_abort the kernel has to
> clear the MSR TSA bit when the CPUs are brought up.
>
> The state of allow_tsx_force_abort is solely relevant after CPUs coming up
> to decide whether PMC3 can be used by perf or not.
>
> Thanks,
>
>         tglx
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [PATCH 1/8] perf/x86/intel: Fix memory corruption
  2019-03-21 23:16                               ` DSouza, Nelson
@ 2019-03-22 22:14                                 ` DSouza, Nelson
  0 siblings, 0 replies; 48+ messages in thread
From: DSouza, Nelson @ 2019-03-22 22:14 UTC (permalink / raw)
  To: 'Peter Zijlstra'
  Cc: 'Tony Jones', 'Thomas Gleixner',
	'Stephane Eranian', 'Ingo Molnar',
	'Jiri Olsa', 'LKML'

kexec was borked in 5.0+ kernel I was using.

Switched to 5.0.3 stable and kexec works as expected.
No warnings seen with a kexec boot.

-----Original Message-----
From: DSouza, Nelson 
Sent: Thursday, March 21, 2019 4:16 PM
To: Peter Zijlstra <peterz@infradead.org>
Cc: Tony Jones <tonyj@suse.com>; Thomas Gleixner <tglx@linutronix.de>; Stephane Eranian <eranian@google.com>; Ingo Molnar <mingo@kernel.org>; Jiri Olsa <jolsa@redhat.com>; LKML <linux-kernel@vger.kernel.org>
Subject: RE: [PATCH 1/8] perf/x86/intel: Fix memory corruption

Attached cpu hotplug test output while perf is running in the background. No WARN messages seen.

When I run the kexec command, it boots to bios. Haven't used kexec before. Still trying to figure that one out.

-----Original Message-----
From: Peter Zijlstra [mailto:peterz@infradead.org] 
Sent: Thursday, March 21, 2019 1:07 PM
To: DSouza, Nelson <nelson.dsouza@intel.com>
Cc: Tony Jones <tonyj@suse.com>; Thomas Gleixner <tglx@linutronix.de>; Stephane Eranian <eranian@google.com>; Ingo Molnar <mingo@kernel.org>; Jiri Olsa <jolsa@redhat.com>; LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 1/8] perf/x86/intel: Fix memory corruption

On Thu, Mar 21, 2019 at 07:47:50PM +0000, DSouza, Nelson wrote:
> Is the request to check whether the msr gets reset to default upon reboot of the machine ?

basically:

 - apply patch
 - start workload with 4 counter (on all CPUs), such that tfa-msr=1
 - try the following:
   o cpu hotplug
   o kexec

to see if the new WARN will trigger -- it should not.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/8] perf/x86/intel: Fix memory corruption
  2019-03-22 19:04                       ` Stephane Eranian
@ 2019-04-03  7:32                         ` Peter Zijlstra
  0 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2019-04-03  7:32 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Thomas Gleixner, Ingo Molnar, Jiri Olsa, LKML, tonyj, nelson.dsouza

On Fri, Mar 22, 2019 at 12:04:55PM -0700, Stephane Eranian wrote:
> On Thu, Mar 21, 2019 at 10:51 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > On Thu, 21 Mar 2019, Stephane Eranian wrote:
> > > On Thu, Mar 21, 2019 at 9:45 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> > > >
> > > > On Thu, 21 Mar 2019, Peter Zijlstra wrote:
> > > > > Subject: perf/x86/intel: Initialize TFA MSR
> > > > >
> > > > > Stephane reported that we don't initialize the TFA MSR, which could lead
> > > > > to trouble if the RESET value is not 0 or on kexec.
> > > >
> > > > That sentence doesn't parse.
> > > >
> > > >   Stephane reported that the TFA MSR is not initialized by the kernel, but
> > > >   the TFA bit could set by firmware or as a leftover from a kexec, which
> > > >   makes the state inconsistent.
> > > >
> > > Correct. This is what I meant.
> > > The issue is what does the kernel guarantee when it boots?
> > >
> > > I see:
> > > static bool allow_tsx_force_abort = true;
> > >
> > > Therefore you must ensure the MSR is set to reflect that state on boot.
> > > So you have to force it to that value to be in sync which is what your
> > > new patch is doing.
> >
> > The initial state should be that the MSR TFA bit is 0. The software state
> > is a different beast.
> >
> > allow_tsx_force_abort
> >
> >   false                 Do not set MSR TFA bit (Make TSX work with PMC3) and
> >                         exclude PMC3 from being used.
> >
> >   true                  Set the MSR TFA bit when PMC3 is used by perf, clear it
> >                         when PMC3 is not longer in use.
> >
> I would expect this description to be included in the source code where the
> allow_tsx_force_abort variable is defined

That part is easy I suppose.

> and somewhere in the kernel Documentation because it is not trivial to
> understand what the control actually does and the guarantees you have
> when you toggle it.

But here we seem to be sorely lacking, that is, there is no sysfs/perf
documentation at all.

In any case, feel free to send a patch :-)


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [tip:perf/urgent] perf/x86/intel: Initialize TFA MSR
  2019-03-21 12:38               ` Peter Zijlstra
  2019-03-21 16:45                 ` Thomas Gleixner
@ 2019-04-03 10:40                 ` tip-bot for Peter Zijlstra
  2019-04-03 11:30                   ` Thomas Gleixner
  1 sibling, 1 reply; 48+ messages in thread
From: tip-bot for Peter Zijlstra @ 2019-04-03 10:40 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, alexander.shishkin, peterz, jolsa, nelson.dsouza,
	acme, tglx, eranian, vincent.weaver, hpa, torvalds, mingo

Commit-ID:  d7262457e35dbe239659e62654e56f8ddb814bed
Gitweb:     https://git.kernel.org/tip/d7262457e35dbe239659e62654e56f8ddb814bed
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Thu, 21 Mar 2019 13:38:49 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 3 Apr 2019 11:40:32 +0200

perf/x86/intel: Initialize TFA MSR

Stephane reported that the TFA MSR is not initialized by the kernel,
but the TFA bit could set by firmware or as a leftover from a kexec,
which makes the state inconsistent.

Reported-by: Stephane Eranian <eranian@google.com>
Tested-by: Nelson DSouza <nelson.dsouza@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: tonyj@suse.com
Link: https://lkml.kernel.org/r/20190321123849.GN6521@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/events/intel/core.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 1539647ea39d..f61dcbef20ff 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3575,6 +3575,12 @@ static void intel_pmu_cpu_starting(int cpu)
 
 	cpuc->lbr_sel = NULL;
 
+	if (x86_pmu.flags & PMU_FL_TFA) {
+		WARN_ON_ONCE(cpuc->tfa_shadow);
+		cpuc->tfa_shadow = ~0ULL;
+		intel_set_tfa(cpuc, false);
+	}
+
 	if (x86_pmu.version > 1)
 		flip_smm_bit(&x86_pmu.attr_freeze_on_smi);
 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [tip:perf/urgent] perf/x86/intel: Initialize TFA MSR
  2019-04-03 10:40                 ` [tip:perf/urgent] perf/x86/intel: Initialize TFA MSR tip-bot for Peter Zijlstra
@ 2019-04-03 11:30                   ` Thomas Gleixner
  2019-04-03 12:23                     ` Vince Weaver
  0 siblings, 1 reply; 48+ messages in thread
From: Thomas Gleixner @ 2019-04-03 11:30 UTC (permalink / raw)
  To: nelson.dsouza, jolsa, linux-kernel, alexander.shishkin, peterz,
	vincent.weaver, eranian, hpa, mingo, torvalds, acme
  Cc: linux-tip-commits

On Wed, 3 Apr 2019, tip-bot for Peter Zijlstra wrote:

> Commit-ID:  d7262457e35dbe239659e62654e56f8ddb814bed
> Gitweb:     https://git.kernel.org/tip/d7262457e35dbe239659e62654e56f8ddb814bed
> Author:     Peter Zijlstra <peterz@infradead.org>
> AuthorDate: Thu, 21 Mar 2019 13:38:49 +0100
> Committer:  Ingo Molnar <mingo@kernel.org>
> CommitDate: Wed, 3 Apr 2019 11:40:32 +0200
> 
> perf/x86/intel: Initialize TFA MSR
> 
> Stephane reported that the TFA MSR is not initialized by the kernel,
> but the TFA bit could set by firmware or as a leftover from a kexec,
> which makes the state inconsistent.
> 
> Reported-by: Stephane Eranian <eranian@google.com>
> Tested-by: Nelson DSouza <nelson.dsouza@intel.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
> Cc: Jiri Olsa <jolsa@redhat.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Vince Weaver <vincent.weaver@maine.edu>
> Cc: tonyj@suse.com
> Link: https://lkml.kernel.org/r/20190321123849.GN6521@hirez.programming.kicks-ass.net
> Signed-off-by: Ingo Molnar <mingo@kernel.org>

This lacks:

 1) Fixes tag

 2) Cc: stable ....

Sigh.

> ---
>  arch/x86/events/intel/core.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index 1539647ea39d..f61dcbef20ff 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -3575,6 +3575,12 @@ static void intel_pmu_cpu_starting(int cpu)
>  
>  	cpuc->lbr_sel = NULL;
>  
> +	if (x86_pmu.flags & PMU_FL_TFA) {
> +		WARN_ON_ONCE(cpuc->tfa_shadow);
> +		cpuc->tfa_shadow = ~0ULL;
> +		intel_set_tfa(cpuc, false);
> +	}
> +
>  	if (x86_pmu.version > 1)
>  		flip_smm_bit(&x86_pmu.attr_freeze_on_smi);
>  
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [tip:perf/urgent] perf/x86/intel: Initialize TFA MSR
  2019-04-03 11:30                   ` Thomas Gleixner
@ 2019-04-03 12:23                     ` Vince Weaver
  0 siblings, 0 replies; 48+ messages in thread
From: Vince Weaver @ 2019-04-03 12:23 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: nelson.dsouza, jolsa, linux-kernel, alexander.shishkin, peterz,
	vincent.weaver, eranian, hpa, mingo, torvalds, acme,
	linux-tip-commits

On Wed, 3 Apr 2019, Thomas Gleixner wrote:

> On Wed, 3 Apr 2019, tip-bot for Peter Zijlstra wrote:
> 
> > Commit-ID:  d7262457e35dbe239659e62654e56f8ddb814bed
> > Gitweb:     https://git.kernel.org/tip/d7262457e35dbe239659e62654e56f8ddb814bed
> > Author:     Peter Zijlstra <peterz@infradead.org>
> > AuthorDate: Thu, 21 Mar 2019 13:38:49 +0100
> > Committer:  Ingo Molnar <mingo@kernel.org>
> > CommitDate: Wed, 3 Apr 2019 11:40:32 +0200
> > 
> > perf/x86/intel: Initialize TFA MSR
> > 
> > Stephane reported that the TFA MSR is not initialized by the kernel,
> > but the TFA bit could set by firmware or as a leftover from a kexec,
> > which makes the state inconsistent.
> > 
> > Reported-by: Stephane Eranian <eranian@google.com>
> > Tested-by: Nelson DSouza <nelson.dsouza@intel.com>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> > Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
> > Cc: Jiri Olsa <jolsa@redhat.com>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Vince Weaver <vincent.weaver@maine.edu>
> > Cc: tonyj@suse.com
> > Link: https://lkml.kernel.org/r/20190321123849.GN6521@hirez.programming.kicks-ass.net
> > Signed-off-by: Ingo Molnar <mingo@kernel.org>
> 
> This lacks:
> 
>  1) Fixes tag
> 
>  2) Cc: stable ....
> 
> Sigh.

It would also be nice to know what a "TFA" bit is without having to go 
find a copy of the Intel documentation.

Vince

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2019-04-03 12:24 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-14 13:01 [RFC][PATCH 0/8] perf/x86: event scheduling cleanups Peter Zijlstra
2019-03-14 13:01 ` [PATCH 1/8] perf/x86/intel: Fix memory corruption Peter Zijlstra
2019-03-15 11:29   ` [tip:perf/urgent] " tip-bot for Peter Zijlstra
2019-03-19  6:29   ` [PATCH 1/8] " Stephane Eranian
2019-03-19 11:05     ` Peter Zijlstra
2019-03-19 17:52       ` Stephane Eranian
2019-03-19 18:20         ` Peter Zijlstra
2019-03-20 20:47           ` Stephane Eranian
2019-03-20 20:52             ` Stephane Eranian
2019-03-20 22:22             ` Peter Zijlstra
2019-03-21 12:38               ` Peter Zijlstra
2019-03-21 16:45                 ` Thomas Gleixner
2019-03-21 17:10                   ` Peter Zijlstra
2019-03-21 17:17                     ` Thomas Gleixner
2019-03-21 18:20                       ` Peter Zijlstra
2019-03-21 19:42                         ` Tony Jones
2019-03-21 19:47                           ` DSouza, Nelson
2019-03-21 20:07                             ` Peter Zijlstra
2019-03-21 23:16                               ` DSouza, Nelson
2019-03-22 22:14                                 ` DSouza, Nelson
2019-03-21 17:23                   ` Stephane Eranian
2019-03-21 17:51                     ` Thomas Gleixner
2019-03-22 19:04                       ` Stephane Eranian
2019-04-03  7:32                         ` Peter Zijlstra
2019-04-03 10:40                 ` [tip:perf/urgent] perf/x86/intel: Initialize TFA MSR tip-bot for Peter Zijlstra
2019-04-03 11:30                   ` Thomas Gleixner
2019-04-03 12:23                     ` Vince Weaver
2019-03-14 13:01 ` [RFC][PATCH 2/8] perf/x86/intel: Simplify intel_tfa_commit_scheduling() Peter Zijlstra
2019-03-14 13:01 ` [RFC][PATCH 3/8] perf/x86: Simplify x86_pmu.get_constraints() interface Peter Zijlstra
2019-03-19 21:21   ` Stephane Eranian
2019-03-14 13:01 ` [RFC][PATCH 4/8] perf/x86: Remove PERF_X86_EVENT_COMMITTED Peter Zijlstra
2019-03-19 20:48   ` Stephane Eranian
2019-03-19 21:00     ` Peter Zijlstra
2019-03-20 13:14       ` Peter Zijlstra
2019-03-20 12:23     ` Peter Zijlstra
2019-03-14 13:01 ` [RFC][PATCH 5/8] perf/x86/intel: Optimize intel_get_excl_constraints() Peter Zijlstra
2019-03-19 23:43   ` Stephane Eranian
2019-03-14 13:01 ` [RFC][PATCH 6/8] perf/x86: Clear ->event_constraint[] on put Peter Zijlstra
2019-03-19 21:50   ` Stephane Eranian
2019-03-20 12:25     ` Peter Zijlstra
2019-03-14 13:01 ` [RFC][PATCH 7/8] perf/x86: Optimize x86_schedule_events() Peter Zijlstra
2019-03-19 23:55   ` Stephane Eranian
2019-03-20 13:11     ` Peter Zijlstra
2019-03-20 19:30       ` Stephane Eranian
2019-03-14 13:01 ` [RFC][PATCH 8/8] perf/x86: Add sanity checks to x86_schedule_events() Peter Zijlstra
2019-03-15  7:15 ` [RFC][PATCH 0/8] perf/x86: event scheduling cleanups Stephane Eranian
2019-03-15  7:15   ` Stephane Eranian
2019-03-15  8:01     ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.