linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/11] Various x86 pmu scheduling patches
@ 2015-05-22 13:29 Peter Zijlstra
  2015-05-22 13:29 ` [PATCH v2 01/11] perf,x86: Fix event/group validation Peter Zijlstra
                   ` (10 more replies)
  0 siblings, 11 replies; 36+ messages in thread
From: Peter Zijlstra @ 2015-05-22 13:29 UTC (permalink / raw)
  To: mingo, peterz; +Cc: vincent.weaver, eranian, jolsa, linux-kernel

The first two would be for perf/urgent, the rest is cleanups and can wait.

These patches have some testing and the new HT constraints code seems to
actually work now.

It only constraints the general purpose counters when there are exclusive
events present on either HT sibling, otherwise all GP counters are available as
per usual.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v2 01/11] perf,x86: Fix event/group validation
  2015-05-22 13:29 [PATCH v2 00/11] Various x86 pmu scheduling patches Peter Zijlstra
@ 2015-05-22 13:29 ` Peter Zijlstra
  2015-05-22 13:40   ` Peter Zijlstra
  2015-05-22 13:29 ` [PATCH v2 02/11] perf/x86: Improve HT workaround GP counter constraint Peter Zijlstra
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2015-05-22 13:29 UTC (permalink / raw)
  To: mingo, peterz
  Cc: vincent.weaver, eranian, jolsa, kan.liang, linux-kernel,
	Andrew Hunter, Maria Dimakopoulou

[-- Attachment #1: peterz-pmu-sched-1.patch --]
[-- Type: text/plain, Size: 10919 bytes --]

Commit 43b4578071c0 ("perf/x86: Reduce stack usage of
x86_schedule_events()") violated the rule that 'fake' scheduling; as
used for event/group validation; should not change the event state.

This went mostly un-noticed because repeated calls of
x86_pmu::get_event_constraints() would give the same result. And
x86_pmu::put_event_constraints() would mostly not do anything.

Commit e979121b1b15 ("perf/x86/intel: Implement cross-HT corruption
bug workaround") made the situation much worse by actually setting the
event->hw.constraint value to NULL, so when validation and actual
scheduling interact we get NULL ptr derefs.

Fix it by removing the constraint pointer from the event and move it
back to an array, this time in cpuc instead of on the stack.

validate_group()
  x86_schedule_events()
    event->hw.constraint = c; # store

      <context switch>
        perf_task_event_sched_in()
          ...
            x86_schedule_events();
              event->hw.constraint = c2; # store

              ...

              put_event_constraints(event); # assume failure to schedule
                intel_put_event_constraints()
                  event->hw.constraint = NULL;

      <context switch end>

    c = event->hw.constraint; # read -> NULL

    if (!test_bit(hwc->idx, c->idxmsk)) # <- *BOOM* NULL deref

This in particular is possible when the event in question is a
cpu-wide event and group-leader, where the validate_group() tries to
add an event to the group.

Fixes: 43b4578071c0 ("perf/x86: Reduce stack usage of x86_schedule_events()")
Fixes: e979121b1b15 ("perf/x86/intel: Implement cross-HT corruption bug workaround")
Cc: Andrew Hunter <ahh@google.com>
Cc: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com>
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/kernel/cpu/perf_event.c              |   33 ++++++++++++++------------
 arch/x86/kernel/cpu/perf_event.h              |    9 +++----
 arch/x86/kernel/cpu/perf_event_intel.c        |   15 +++--------
 arch/x86/kernel/cpu/perf_event_intel_ds.c     |    4 +--
 arch/x86/kernel/cpu/perf_event_intel_uncore.c |    7 ++---
 arch/x86/kernel/cpu/perf_event_intel_uncore.h |    1 
 6 files changed, 33 insertions(+), 36 deletions(-)

--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -620,7 +620,7 @@ struct sched_state {
 struct perf_sched {
 	int			max_weight;
 	int			max_events;
-	struct perf_event	**events;
+	struct event_constraint	**constraints;
 	struct sched_state	state;
 	int			saved_states;
 	struct sched_state	saved[SCHED_STATES_MAX];
@@ -629,7 +629,7 @@ struct perf_sched {
 /*
  * Initialize interator that runs through all events and counters.
  */
-static void perf_sched_init(struct perf_sched *sched, struct perf_event **events,
+static void perf_sched_init(struct perf_sched *sched, struct event_constraint **constraints,
 			    int num, int wmin, int wmax)
 {
 	int idx;
@@ -637,10 +637,10 @@ static void perf_sched_init(struct perf_
 	memset(sched, 0, sizeof(*sched));
 	sched->max_events	= num;
 	sched->max_weight	= wmax;
-	sched->events		= events;
+	sched->constraints	= constraints;
 
 	for (idx = 0; idx < num; idx++) {
-		if (events[idx]->hw.constraint->weight == wmin)
+		if (constraints[idx]->weight == wmin)
 			break;
 	}
 
@@ -687,7 +687,7 @@ static bool __perf_sched_find_counter(st
 	if (sched->state.event >= sched->max_events)
 		return false;
 
-	c = sched->events[sched->state.event]->hw.constraint;
+	c = sched->constraints[sched->state.event];
 	/* Prefer fixed purpose counters */
 	if (c->idxmsk64 & (~0ULL << INTEL_PMC_IDX_FIXED)) {
 		idx = INTEL_PMC_IDX_FIXED;
@@ -745,7 +745,7 @@ static bool perf_sched_next_event(struct
 			if (sched->state.weight > sched->max_weight)
 				return false;
 		}
-		c = sched->events[sched->state.event]->hw.constraint;
+		c = sched->constraints[sched->state.event];
 	} while (c->weight != sched->state.weight);
 
 	sched->state.counter = 0;	/* start with first counter */
@@ -756,12 +756,12 @@ static bool perf_sched_next_event(struct
 /*
  * Assign a counter for each event.
  */
-int perf_assign_events(struct perf_event **events, int n,
+int perf_assign_events(struct event_constraint **constraints, int n,
 			int wmin, int wmax, int *assign)
 {
 	struct perf_sched sched;
 
-	perf_sched_init(&sched, events, n, wmin, wmax);
+	perf_sched_init(&sched, constraints, n, wmin, wmax);
 
 	do {
 		if (!perf_sched_find_counter(&sched))
@@ -788,9 +788,9 @@ int x86_schedule_events(struct cpu_hw_ev
 		x86_pmu.start_scheduling(cpuc);
 
 	for (i = 0, wmin = X86_PMC_IDX_MAX, wmax = 0; i < n; i++) {
-		hwc = &cpuc->event_list[i]->hw;
+		cpuc->event_constraint[i] = NULL;
 		c = x86_pmu.get_event_constraints(cpuc, i, cpuc->event_list[i]);
-		hwc->constraint = c;
+		cpuc->event_constraint[i] = c;
 
 		wmin = min(wmin, c->weight);
 		wmax = max(wmax, c->weight);
@@ -801,7 +801,7 @@ int x86_schedule_events(struct cpu_hw_ev
 	 */
 	for (i = 0; i < n; i++) {
 		hwc = &cpuc->event_list[i]->hw;
-		c = hwc->constraint;
+		c = cpuc->event_constraint[i];
 
 		/* never assigned */
 		if (hwc->idx == -1)
@@ -821,9 +821,10 @@ int x86_schedule_events(struct cpu_hw_ev
 	}
 
 	/* slow path */
-	if (i != n)
-		unsched = perf_assign_events(cpuc->event_list, n, wmin,
+	if (i != n) {
+		unsched = perf_assign_events(cpuc->event_constraint, n, wmin,
 					     wmax, assign);
+	}
 
 	/*
 	 * In case of success (unsched = 0), mark events as committed,
@@ -840,7 +841,7 @@ int x86_schedule_events(struct cpu_hw_ev
 			e = cpuc->event_list[i];
 			e->hw.flags |= PERF_X86_EVENT_COMMITTED;
 			if (x86_pmu.commit_scheduling)
-				x86_pmu.commit_scheduling(cpuc, e, assign[i]);
+				x86_pmu.commit_scheduling(cpuc, i, assign[i]);
 		}
 	}
 
@@ -1292,8 +1293,10 @@ static void x86_pmu_del(struct perf_even
 		x86_pmu.put_event_constraints(cpuc, event);
 
 	/* Delete the array entry. */
-	while (++i < cpuc->n_events)
+	while (++i < cpuc->n_events) {
 		cpuc->event_list[i-1] = cpuc->event_list[i];
+		cpuc->event_constraint[i-1] = cpuc->event_constraint[i];
+	}
 	--cpuc->n_events;
 
 	perf_event_update_userpage(event);
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -172,7 +172,10 @@ struct cpu_hw_events {
 					     added in the current transaction */
 	int			assign[X86_PMC_IDX_MAX]; /* event to counter assignment */
 	u64			tags[X86_PMC_IDX_MAX];
+
 	struct perf_event	*event_list[X86_PMC_IDX_MAX]; /* in enabled order */
+	struct event_constraint	*event_constraint[X86_PMC_IDX_MAX];
+
 
 	unsigned int		group_flag;
 	int			is_fake;
@@ -519,9 +522,7 @@ struct x86_pmu {
 	void		(*put_event_constraints)(struct cpu_hw_events *cpuc,
 						 struct perf_event *event);
 
-	void		(*commit_scheduling)(struct cpu_hw_events *cpuc,
-					     struct perf_event *event,
-					     int cntr);
+	void		(*commit_scheduling)(struct cpu_hw_events *cpuc, int idx, int cntr);
 
 	void		(*start_scheduling)(struct cpu_hw_events *cpuc);
 
@@ -717,7 +718,7 @@ static inline void __x86_pmu_enable_even
 
 void x86_pmu_enable_all(int added);
 
-int perf_assign_events(struct perf_event **events, int n,
+int perf_assign_events(struct event_constraint **constraints, int n,
 			int wmin, int wmax, int *assign);
 int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign);
 
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2106,7 +2106,7 @@ static struct event_constraint *
 intel_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
 			    struct perf_event *event)
 {
-	struct event_constraint *c1 = event->hw.constraint;
+	struct event_constraint *c1 = cpuc->event_constraint[idx];
 	struct event_constraint *c2;
 
 	/*
@@ -2188,8 +2188,6 @@ intel_put_shared_regs_event_constraints(
 static void intel_put_event_constraints(struct cpu_hw_events *cpuc,
 					struct perf_event *event)
 {
-	struct event_constraint *c = event->hw.constraint;
-
 	intel_put_shared_regs_event_constraints(cpuc, event);
 
 	/*
@@ -2197,19 +2195,14 @@ static void intel_put_event_constraints(
 	 * all events are subject to and must call the
 	 * put_excl_constraints() routine
 	 */
-	if (c && cpuc->excl_cntrs)
+	if (cpuc->excl_cntrs)
 		intel_put_excl_constraints(cpuc, event);
-
-	/* cleanup dynamic constraint */
-	if (c && (c->flags & PERF_X86_EVENT_DYNAMIC))
-		event->hw.constraint = NULL;
 }
 
-static void intel_commit_scheduling(struct cpu_hw_events *cpuc,
-				    struct perf_event *event, int cntr)
+static void intel_commit_scheduling(struct cpu_hw_events *cpuc, int idx, int cntr)
 {
 	struct intel_excl_cntrs *excl_cntrs = cpuc->excl_cntrs;
-	struct event_constraint *c = event->hw.constraint;
+	struct event_constraint *c = cpuc->event_constraint[idx];
 	struct intel_excl_states *xlo, *xl;
 	int tid = cpuc->excl_thread_id;
 	int o_tid = 1 - tid;
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -706,9 +706,9 @@ void intel_pmu_pebs_disable(struct perf_
 
 	cpuc->pebs_enabled &= ~(1ULL << hwc->idx);
 
-	if (event->hw.constraint->flags & PERF_X86_EVENT_PEBS_LDLAT)
+	if (event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT)
 		cpuc->pebs_enabled &= ~(1ULL << (hwc->idx + 32));
-	else if (event->hw.constraint->flags & PERF_X86_EVENT_PEBS_ST)
+	else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
 		cpuc->pebs_enabled &= ~(1ULL << 63);
 
 	if (cpuc->enabled)
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
@@ -364,9 +364,8 @@ static int uncore_assign_events(struct i
 	bitmap_zero(used_mask, UNCORE_PMC_IDX_MAX);
 
 	for (i = 0, wmin = UNCORE_PMC_IDX_MAX, wmax = 0; i < n; i++) {
-		hwc = &box->event_list[i]->hw;
 		c = uncore_get_event_constraint(box, box->event_list[i]);
-		hwc->constraint = c;
+		box->event_constraint[i] = c;
 		wmin = min(wmin, c->weight);
 		wmax = max(wmax, c->weight);
 	}
@@ -374,7 +373,7 @@ static int uncore_assign_events(struct i
 	/* fastpath, try to reuse previous register */
 	for (i = 0; i < n; i++) {
 		hwc = &box->event_list[i]->hw;
-		c = hwc->constraint;
+		c = box->event_constraint[i];
 
 		/* never assigned */
 		if (hwc->idx == -1)
@@ -394,7 +393,7 @@ static int uncore_assign_events(struct i
 	}
 	/* slow path */
 	if (i != n)
-		ret = perf_assign_events(box->event_list, n,
+		ret = perf_assign_events(box->event_constraint, n,
 					 wmin, wmax, assign);
 
 	if (!assign || ret) {
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore.h
+++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.h
@@ -97,6 +97,7 @@ struct intel_uncore_box {
 	atomic_t refcnt;
 	struct perf_event *events[UNCORE_PMC_IDX_MAX];
 	struct perf_event *event_list[UNCORE_PMC_IDX_MAX];
+	struct event_constraint *event_constraint[UNCORE_PMC_IDX_MAX];
 	unsigned long active_mask[BITS_TO_LONGS(UNCORE_PMC_IDX_MAX)];
 	u64 tags[UNCORE_PMC_IDX_MAX];
 	struct pci_dev *pci_dev;



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v2 02/11] perf/x86: Improve HT workaround GP counter constraint
  2015-05-22 13:29 [PATCH v2 00/11] Various x86 pmu scheduling patches Peter Zijlstra
  2015-05-22 13:29 ` [PATCH v2 01/11] perf,x86: Fix event/group validation Peter Zijlstra
@ 2015-05-22 13:29 ` Peter Zijlstra
  2015-05-22 13:42   ` Peter Zijlstra
                     ` (2 more replies)
  2015-05-22 13:29 ` [PATCH v2 03/11] perf/x86: Correct local vs remote sibling state Peter Zijlstra
                   ` (8 subsequent siblings)
  10 siblings, 3 replies; 36+ messages in thread
From: Peter Zijlstra @ 2015-05-22 13:29 UTC (permalink / raw)
  To: mingo, peterz; +Cc: vincent.weaver, eranian, jolsa, kan.liang, linux-kernel

[-- Attachment #1: peterz-pmu-sched-2.patch --]
[-- Type: text/plain, Size: 8030 bytes --]

The (SNB/IVB/HSW) HT bug only affects events that can be programmed
onto GP counters, therefore we should only limit the number of GP
counters that can be used per cpu -- iow we should not constrain the
FP counters.

Furthermore, we should only enfore such a limit when there are in fact
exclusive events being scheduled on either sibling.

Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/kernel/cpu/perf_event.c              |   36 +++++++++++++++++++++-----
 arch/x86/kernel/cpu/perf_event.h              |   11 +++++--
 arch/x86/kernel/cpu/perf_event_intel.c        |   30 +++++++--------------
 arch/x86/kernel/cpu/perf_event_intel_uncore.c |    2 -
 4 files changed, 49 insertions(+), 30 deletions(-)

--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -611,6 +611,7 @@ struct sched_state {
 	int	event;		/* event index */
 	int	counter;	/* counter index */
 	int	unassigned;	/* number of events to be assigned left */
+	int	nr_gp;		/* number of GP counters used */
 	unsigned long used[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
 };
 
@@ -620,9 +621,10 @@ struct sched_state {
 struct perf_sched {
 	int			max_weight;
 	int			max_events;
+	int			max_gp;
+	int			saved_states;
 	struct event_constraint	**constraints;
 	struct sched_state	state;
-	int			saved_states;
 	struct sched_state	saved[SCHED_STATES_MAX];
 };
 
@@ -630,13 +632,14 @@ struct perf_sched {
  * Initialize interator that runs through all events and counters.
  */
 static void perf_sched_init(struct perf_sched *sched, struct event_constraint **constraints,
-			    int num, int wmin, int wmax)
+			    int num, int wmin, int wmax, int gpmax)
 {
 	int idx;
 
 	memset(sched, 0, sizeof(*sched));
 	sched->max_events	= num;
 	sched->max_weight	= wmax;
+	sched->max_gp		= gpmax;
 	sched->constraints	= constraints;
 
 	for (idx = 0; idx < num; idx++) {
@@ -696,11 +699,16 @@ static bool __perf_sched_find_counter(st
 				goto done;
 		}
 	}
+
 	/* Grab the first unused counter starting with idx */
 	idx = sched->state.counter;
 	for_each_set_bit_from(idx, c->idxmsk, INTEL_PMC_IDX_FIXED) {
-		if (!__test_and_set_bit(idx, sched->state.used))
+		if (!__test_and_set_bit(idx, sched->state.used)) {
+			if (sched->state.nr_gp++ >= sched->max_gp)
+				return false;
+
 			goto done;
+		}
 	}
 
 	return false;
@@ -757,11 +765,11 @@ static bool perf_sched_next_event(struct
  * Assign a counter for each event.
  */
 int perf_assign_events(struct event_constraint **constraints, int n,
-			int wmin, int wmax, int *assign)
+			int wmin, int wmax, int gpmax, int *assign)
 {
 	struct perf_sched sched;
 
-	perf_sched_init(&sched, constraints, n, wmin, wmax);
+	perf_sched_init(&sched, constraints, n, wmin, wmax, gpmax);
 
 	do {
 		if (!perf_sched_find_counter(&sched))
@@ -822,8 +830,24 @@ int x86_schedule_events(struct cpu_hw_ev
 
 	/* slow path */
 	if (i != n) {
+		int gpmax = x86_pmu.num_counters;
+
+		/*
+		 * Do not allow scheduling of more than half the available
+		 * generic counters.
+		 *
+		 * This helps avoid counter starvation of sibling thread by
+		 * ensuring at most half the counters cannot be in exclusive
+		 * mode. There is no designated counters for the limits. Any
+		 * N/2 counters can be used. This helps with events with
+		 * specific counter constraints.
+		 */
+		if (is_ht_workaround_enabled() && !cpuc->is_fake &&
+		    READ_ONCE(cpuc->excl_cntrs->exclusive_present))
+			gpmax /= 2;
+
 		unsched = perf_assign_events(cpuc->event_constraint, n, wmin,
-					     wmax, assign);
+					     wmax, gpmax, assign);
 	}
 
 	/*
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -74,6 +74,7 @@ struct event_constraint {
 #define PERF_X86_EVENT_EXCL		0x0040 /* HT exclusivity on counter */
 #define PERF_X86_EVENT_DYNAMIC		0x0080 /* dynamic alloc'd constraint */
 #define PERF_X86_EVENT_RDPMC_ALLOWED	0x0100 /* grant rdpmc permission */
+#define PERF_X86_EVENT_EXCL_ACCT	0x0200 /* accounted EXCL event */
 
 
 struct amd_nb {
@@ -134,8 +135,6 @@ enum intel_excl_state_type {
 struct intel_excl_states {
 	enum intel_excl_state_type init_state[X86_PMC_IDX_MAX];
 	enum intel_excl_state_type state[X86_PMC_IDX_MAX];
-	int  num_alloc_cntrs;/* #counters allocated */
-	int  max_alloc_cntrs;/* max #counters allowed */
 	bool sched_started; /* true if scheduling has started */
 };
 
@@ -144,6 +143,11 @@ struct intel_excl_cntrs {
 
 	struct intel_excl_states states[2];
 
+	union {
+		u16	has_exclusive[2];
+		u32	exclusive_present;
+	};
+
 	int		refcnt;		/* per-core: #HT threads */
 	unsigned	core_id;	/* per-core: core id */
 };
@@ -176,6 +180,7 @@ struct cpu_hw_events {
 	struct perf_event	*event_list[X86_PMC_IDX_MAX]; /* in enabled order */
 	struct event_constraint	*event_constraint[X86_PMC_IDX_MAX];
 
+	int			n_excl; /* the number of exclusive events */
 
 	unsigned int		group_flag;
 	int			is_fake;
@@ -719,7 +724,7 @@ static inline void __x86_pmu_enable_even
 void x86_pmu_enable_all(int added);
 
 int perf_assign_events(struct event_constraint **constraints, int n,
-			int wmin, int wmax, int *assign);
+			int wmin, int wmax, int gpmax, int *assign);
 int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign);
 
 void x86_pmu_stop(struct perf_event *event, int flags);
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1923,7 +1923,6 @@ intel_start_scheduling(struct cpu_hw_eve
 	xl = &excl_cntrs->states[tid];
 
 	xl->sched_started = true;
-	xl->num_alloc_cntrs = 0;
 	/*
 	 * lock shared state until we are done scheduling
 	 * in stop_event_scheduling()
@@ -2000,6 +1999,11 @@ intel_get_excl_constraints(struct cpu_hw
 	 * across HT threads
 	 */
 	is_excl = c->flags & PERF_X86_EVENT_EXCL;
+	if (is_excl && !(event->hw.flags & PERF_X86_EVENT_EXCL_ACCT)) {
+		event->hw.flags |= PERF_X86_EVENT_EXCL_ACCT;
+		if (!cpuc->n_excl++)
+			WRITE_ONCE(excl_cntrs->has_exclusive[tid], 1);
+	}
 
 	/*
 	 * xl = state of current HT
@@ -2008,18 +2012,6 @@ intel_get_excl_constraints(struct cpu_hw
 	xl = &excl_cntrs->states[tid];
 	xlo = &excl_cntrs->states[o_tid];
 
-	/*
-	 * do not allow scheduling of more than max_alloc_cntrs
-	 * which is set to half the available generic counters.
-	 * this helps avoid counter starvation of sibling thread
-	 * by ensuring at most half the counters cannot be in
-	 * exclusive mode. There is not designated counters for the
-	 * limits. Any N/2 counters can be used. This helps with
-	 * events with specifix counter constraints
-	 */
-	if (xl->num_alloc_cntrs++ == xl->max_alloc_cntrs)
-		return &emptyconstraint;
-
 	cx = c;
 
 	/*
@@ -2150,6 +2142,11 @@ static void intel_put_excl_constraints(s
 
 	xl = &excl_cntrs->states[tid];
 	xlo = &excl_cntrs->states[o_tid];
+	if (hwc->flags & PERF_X86_EVENT_EXCL_ACCT) {
+		hwc->flags &= ~PERF_X86_EVENT_EXCL_ACCT;
+		if (!--cpuc->n_excl)
+			WRITE_ONCE(excl_cntrs->has_exclusive[tid], 0);
+	}
 
 	/*
 	 * put_constraint may be called from x86_schedule_events()
@@ -2632,8 +2629,6 @@ static void intel_pmu_cpu_starting(int c
 		cpuc->lbr_sel = &cpuc->shared_regs->regs[EXTRA_REG_LBR];
 
 	if (x86_pmu.flags & PMU_FL_EXCL_CNTRS) {
-		int h = x86_pmu.num_counters >> 1;
-
 		for_each_cpu(i, topology_thread_cpumask(cpu)) {
 			struct intel_excl_cntrs *c;
 
@@ -2647,11 +2642,6 @@ static void intel_pmu_cpu_starting(int c
 		}
 		cpuc->excl_cntrs->core_id = core_id;
 		cpuc->excl_cntrs->refcnt++;
-		/*
-		 * set hard limit to half the number of generic counters
-		 */
-		cpuc->excl_cntrs->states[0].max_alloc_cntrs = h;
-		cpuc->excl_cntrs->states[1].max_alloc_cntrs = h;
 	}
 }
 
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
@@ -394,7 +394,7 @@ static int uncore_assign_events(struct i
 	/* slow path */
 	if (i != n)
 		ret = perf_assign_events(box->event_constraint, n,
-					 wmin, wmax, assign);
+					 wmin, wmax, n, assign);
 
 	if (!assign || ret) {
 		for (i = 0; i < n; i++)



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v2 03/11] perf/x86: Correct local vs remote sibling state
  2015-05-22 13:29 [PATCH v2 00/11] Various x86 pmu scheduling patches Peter Zijlstra
  2015-05-22 13:29 ` [PATCH v2 01/11] perf,x86: Fix event/group validation Peter Zijlstra
  2015-05-22 13:29 ` [PATCH v2 02/11] perf/x86: Improve HT workaround GP counter constraint Peter Zijlstra
@ 2015-05-22 13:29 ` Peter Zijlstra
  2015-05-26 11:48   ` Stephane Eranian
  2015-05-22 13:29 ` [PATCH v2 04/11] perf/x86: Use lockdep Peter Zijlstra
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2015-05-22 13:29 UTC (permalink / raw)
  To: mingo, peterz; +Cc: vincent.weaver, eranian, jolsa, linux-kernel

[-- Attachment #1: peterz-pmu-sched-3.patch --]
[-- Type: text/plain, Size: 6195 bytes --]

For some obscure reason the current code accounts the current SMT
thread's state on the remote thread and reads the remote's state on
the local SMT thread.

While internally consistent, and 'correct' its pointless confusion we
can do without.

Flip them the right way around.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/kernel/cpu/perf_event_intel.c |   79 +++++++++++++--------------------
 1 file changed, 33 insertions(+), 46 deletions(-)

--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1903,9 +1903,8 @@ static void
 intel_start_scheduling(struct cpu_hw_events *cpuc)
 {
 	struct intel_excl_cntrs *excl_cntrs = cpuc->excl_cntrs;
-	struct intel_excl_states *xl, *xlo;
+	struct intel_excl_states *xl;
 	int tid = cpuc->excl_thread_id;
-	int o_tid = 1 - tid; /* sibling thread */
 
 	/*
 	 * nothing needed if in group validation mode
@@ -1919,7 +1918,6 @@ intel_start_scheduling(struct cpu_hw_eve
 	if (!excl_cntrs)
 		return;
 
-	xlo = &excl_cntrs->states[o_tid];
 	xl = &excl_cntrs->states[tid];
 
 	xl->sched_started = true;
@@ -1932,18 +1930,17 @@ intel_start_scheduling(struct cpu_hw_eve
 	raw_spin_lock(&excl_cntrs->lock);
 
 	/*
-	 * save initial state of sibling thread
+	 * Save a copy of our state to work on.
 	 */
-	memcpy(xlo->init_state, xlo->state, sizeof(xlo->init_state));
+	memcpy(xl->init_state, xl->state, sizeof(xl->init_state));
 }
 
 static void
 intel_stop_scheduling(struct cpu_hw_events *cpuc)
 {
 	struct intel_excl_cntrs *excl_cntrs = cpuc->excl_cntrs;
-	struct intel_excl_states *xl, *xlo;
+	struct intel_excl_states *xl;
 	int tid = cpuc->excl_thread_id;
-	int o_tid = 1 - tid; /* sibling thread */
 
 	/*
 	 * nothing needed if in group validation mode
@@ -1956,13 +1953,12 @@ intel_stop_scheduling(struct cpu_hw_even
 	if (!excl_cntrs)
 		return;
 
-	xlo = &excl_cntrs->states[o_tid];
 	xl = &excl_cntrs->states[tid];
 
 	/*
-	 * make new sibling thread state visible
+	 * Commit the working state.
 	 */
-	memcpy(xlo->state, xlo->init_state, sizeof(xlo->state));
+	memcpy(xl->state, xl->init_state, sizeof(xl->state));
 
 	xl->sched_started = false;
 	/*
@@ -1977,10 +1973,9 @@ intel_get_excl_constraints(struct cpu_hw
 {
 	struct event_constraint *cx;
 	struct intel_excl_cntrs *excl_cntrs = cpuc->excl_cntrs;
-	struct intel_excl_states *xl, *xlo;
-	int is_excl, i;
+	struct intel_excl_states *xlo;
 	int tid = cpuc->excl_thread_id;
-	int o_tid = 1 - tid; /* alternate */
+	int is_excl, i;
 
 	/*
 	 * validating a group does not require
@@ -1994,23 +1989,6 @@ intel_get_excl_constraints(struct cpu_hw
 	 */
 	if (!excl_cntrs)
 		return c;
-	/*
-	 * event requires exclusive counter access
-	 * across HT threads
-	 */
-	is_excl = c->flags & PERF_X86_EVENT_EXCL;
-	if (is_excl && !(event->hw.flags & PERF_X86_EVENT_EXCL_ACCT)) {
-		event->hw.flags |= PERF_X86_EVENT_EXCL_ACCT;
-		if (!cpuc->n_excl++)
-			WRITE_ONCE(excl_cntrs->has_exclusive[tid], 1);
-	}
-
-	/*
-	 * xl = state of current HT
-	 * xlo = state of sibling HT
-	 */
-	xl = &excl_cntrs->states[tid];
-	xlo = &excl_cntrs->states[o_tid];
 
 	cx = c;
 
@@ -2054,6 +2032,22 @@ intel_get_excl_constraints(struct cpu_hw
 	 */
 
 	/*
+	 * state of sibling HT
+	 */
+	xlo = &excl_cntrs->states[tid ^ 1];
+
+	/*
+	 * event requires exclusive counter access
+	 * across HT threads
+	 */
+	is_excl = c->flags & PERF_X86_EVENT_EXCL;
+	if (is_excl && !(event->hw.flags & PERF_X86_EVENT_EXCL_ACCT)) {
+		event->hw.flags |= PERF_X86_EVENT_EXCL_ACCT;
+		if (!cpuc->n_excl++)
+			WRITE_ONCE(excl_cntrs->has_exclusive[tid], 1);
+	}
+
+	/*
 	 * Modify static constraint with current dynamic
 	 * state of thread
 	 *
@@ -2067,14 +2061,14 @@ intel_get_excl_constraints(struct cpu_hw
 		 * our corresponding counter cannot be used
 		 * regardless of our event
 		 */
-		if (xl->state[i] == INTEL_EXCL_EXCLUSIVE)
+		if (xlo->state[i] == INTEL_EXCL_EXCLUSIVE)
 			__clear_bit(i, cx->idxmsk);
 		/*
 		 * if measuring an exclusive event, sibling
 		 * measuring non-exclusive, then counter cannot
 		 * be used
 		 */
-		if (is_excl && xl->state[i] == INTEL_EXCL_SHARED)
+		if (is_excl && xlo->state[i] == INTEL_EXCL_SHARED)
 			__clear_bit(i, cx->idxmsk);
 	}
 
@@ -2124,10 +2118,9 @@ static void intel_put_excl_constraints(s
 {
 	struct hw_perf_event *hwc = &event->hw;
 	struct intel_excl_cntrs *excl_cntrs = cpuc->excl_cntrs;
-	struct intel_excl_states *xlo, *xl;
-	unsigned long flags = 0; /* keep compiler happy */
 	int tid = cpuc->excl_thread_id;
-	int o_tid = 1 - tid;
+	struct intel_excl_states *xl;
+	unsigned long flags = 0; /* keep compiler happy */
 
 	/*
 	 * nothing needed if in group validation mode
@@ -2141,7 +2134,6 @@ static void intel_put_excl_constraints(s
 		return;
 
 	xl = &excl_cntrs->states[tid];
-	xlo = &excl_cntrs->states[o_tid];
 	if (hwc->flags & PERF_X86_EVENT_EXCL_ACCT) {
 		hwc->flags &= ~PERF_X86_EVENT_EXCL_ACCT;
 		if (!--cpuc->n_excl)
@@ -2161,7 +2153,7 @@ static void intel_put_excl_constraints(s
 	 * counter state as unused now
 	 */
 	if (hwc->idx >= 0)
-		xlo->state[hwc->idx] = INTEL_EXCL_UNUSED;
+		xl->state[hwc->idx] = INTEL_EXCL_UNUSED;
 
 	if (!xl->sched_started)
 		raw_spin_unlock_irqrestore(&excl_cntrs->lock, flags);
@@ -2200,16 +2192,12 @@ static void intel_commit_scheduling(stru
 {
 	struct intel_excl_cntrs *excl_cntrs = cpuc->excl_cntrs;
 	struct event_constraint *c = cpuc->event_constraint[idx];
-	struct intel_excl_states *xlo, *xl;
+	struct intel_excl_states *xl;
 	int tid = cpuc->excl_thread_id;
-	int o_tid = 1 - tid;
-	int is_excl;
 
 	if (cpuc->is_fake || !c)
 		return;
 
-	is_excl = c->flags & PERF_X86_EVENT_EXCL;
-
 	if (!(c->flags & PERF_X86_EVENT_DYNAMIC))
 		return;
 
@@ -2219,15 +2207,14 @@ static void intel_commit_scheduling(stru
 		return;
 
 	xl = &excl_cntrs->states[tid];
-	xlo = &excl_cntrs->states[o_tid];
 
 	WARN_ON_ONCE(!raw_spin_is_locked(&excl_cntrs->lock));
 
 	if (cntr >= 0) {
-		if (is_excl)
-			xlo->init_state[cntr] = INTEL_EXCL_EXCLUSIVE;
+		if (c->flags & PERF_X86_EVENT_EXCL)
+			xl->init_state[cntr] = INTEL_EXCL_EXCLUSIVE;
 		else
-			xlo->init_state[cntr] = INTEL_EXCL_SHARED;
+			xl->init_state[cntr] = INTEL_EXCL_SHARED;
 	}
 }
 



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v2 04/11] perf/x86: Use lockdep
  2015-05-22 13:29 [PATCH v2 00/11] Various x86 pmu scheduling patches Peter Zijlstra
                   ` (2 preceding siblings ...)
  2015-05-22 13:29 ` [PATCH v2 03/11] perf/x86: Correct local vs remote sibling state Peter Zijlstra
@ 2015-05-22 13:29 ` Peter Zijlstra
  2015-05-22 13:29 ` [PATCH v2 05/11] perf/x86: Simplify dynamic constraint code somewhat Peter Zijlstra
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2015-05-22 13:29 UTC (permalink / raw)
  To: mingo, peterz; +Cc: vincent.weaver, eranian, jolsa, linux-kernel

[-- Attachment #1: peterz-pmu-sched-3a.patch --]
[-- Type: text/plain, Size: 935 bytes --]

Lockdep is very good at finding incorrect IRQ state while locking and
is far better at telling us if we hold a lock than the _is_locked()
API. It also generates less code for the !DEBUG kernels.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/kernel/cpu/perf_event_intel.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1926,7 +1926,6 @@ intel_start_scheduling(struct cpu_hw_eve
 	 * in stop_event_scheduling()
 	 * makes scheduling appear as a transaction
 	 */
-	WARN_ON_ONCE(!irqs_disabled());
 	raw_spin_lock(&excl_cntrs->lock);
 
 	/*
@@ -2198,7 +2197,7 @@ static void intel_commit_scheduling(stru
 
 	xl = &excl_cntrs->states[tid];
 
-	WARN_ON_ONCE(!raw_spin_is_locked(&excl_cntrs->lock));
+	lockdep_assert_held(&excl_cntrs->lock);
 
 	if (cntr >= 0) {
 		if (c->flags & PERF_X86_EVENT_EXCL)



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v2 05/11] perf/x86: Simplify dynamic constraint code somewhat
  2015-05-22 13:29 [PATCH v2 00/11] Various x86 pmu scheduling patches Peter Zijlstra
                   ` (3 preceding siblings ...)
  2015-05-22 13:29 ` [PATCH v2 04/11] perf/x86: Use lockdep Peter Zijlstra
@ 2015-05-22 13:29 ` Peter Zijlstra
  2015-05-22 13:29 ` [PATCH v2 06/11] perf/x86: Make WARNs consistent Peter Zijlstra
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2015-05-22 13:29 UTC (permalink / raw)
  To: mingo, peterz; +Cc: vincent.weaver, eranian, jolsa, linux-kernel

[-- Attachment #1: peterz-pmu-sched-4.patch --]
[-- Type: text/plain, Size: 2788 bytes --]

Instead of using cx after the dynamic allocation, put all cx inside
the dynamic allocation block and use c outside of it.

Also use direct assignment to copy the structure; let the compiler
figure it out.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/kernel/cpu/perf_event_intel.c |   21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1970,7 +1970,6 @@ static struct event_constraint *
 intel_get_excl_constraints(struct cpu_hw_events *cpuc, struct perf_event *event,
 			   int idx, struct event_constraint *c)
 {
-	struct event_constraint *cx;
 	struct intel_excl_cntrs *excl_cntrs = cpuc->excl_cntrs;
 	struct intel_excl_states *xlo;
 	int tid = cpuc->excl_thread_id;
@@ -1989,8 +1988,6 @@ intel_get_excl_constraints(struct cpu_hw
 	if (!excl_cntrs)
 		return c;
 
-	cx = c;
-
 	/*
 	 * because we modify the constraint, we need
 	 * to make a copy. Static constraints come
@@ -2000,6 +1997,7 @@ intel_get_excl_constraints(struct cpu_hw
 	 * been cloned (marked dynamic)
 	 */
 	if (!(c->flags & PERF_X86_EVENT_DYNAMIC)) {
+		struct event_constraint *cx;
 
 		/* sanity check */
 		if (idx < 0)
@@ -2014,13 +2012,14 @@ intel_get_excl_constraints(struct cpu_hw
 		 * initialize dynamic constraint
 		 * with static constraint
 		 */
-		memcpy(cx, c, sizeof(*cx));
+		*cx = *c;
 
 		/*
 		 * mark constraint as dynamic, so we
 		 * can free it later on
 		 */
 		cx->flags |= PERF_X86_EVENT_DYNAMIC;
+		c = cx;
 	}
 
 	/*
@@ -2049,37 +2048,37 @@ intel_get_excl_constraints(struct cpu_hw
 	 * SHARED   : sibling counter measuring non-exclusive event
 	 * UNUSED   : sibling counter unused
 	 */
-	for_each_set_bit(i, cx->idxmsk, X86_PMC_IDX_MAX) {
+	for_each_set_bit(i, c->idxmsk, X86_PMC_IDX_MAX) {
 		/*
 		 * exclusive event in sibling counter
 		 * our corresponding counter cannot be used
 		 * regardless of our event
 		 */
 		if (xlo->state[i] == INTEL_EXCL_EXCLUSIVE)
-			__clear_bit(i, cx->idxmsk);
+			__clear_bit(i, c->idxmsk);
 		/*
 		 * if measuring an exclusive event, sibling
 		 * measuring non-exclusive, then counter cannot
 		 * be used
 		 */
 		if (is_excl && xlo->state[i] == INTEL_EXCL_SHARED)
-			__clear_bit(i, cx->idxmsk);
+			__clear_bit(i, c->idxmsk);
 	}
 
 	/*
 	 * recompute actual bit weight for scheduling algorithm
 	 */
-	cx->weight = hweight64(cx->idxmsk64);
+	c->weight = hweight64(c->idxmsk64);
 
 	/*
 	 * if we return an empty mask, then switch
 	 * back to static empty constraint to avoid
 	 * the cost of freeing later on
 	 */
-	if (cx->weight == 0)
-		cx = &emptyconstraint;
+	if (c->weight == 0)
+		c = &emptyconstraint;
 
-	return cx;
+	return c;
 }
 
 static struct event_constraint *



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v2 06/11] perf/x86: Make WARNs consistent
  2015-05-22 13:29 [PATCH v2 00/11] Various x86 pmu scheduling patches Peter Zijlstra
                   ` (4 preceding siblings ...)
  2015-05-22 13:29 ` [PATCH v2 05/11] perf/x86: Simplify dynamic constraint code somewhat Peter Zijlstra
@ 2015-05-22 13:29 ` Peter Zijlstra
  2015-05-22 13:29 ` [PATCH v2 07/11] perf/x86: Move intel_commit_scheduling() Peter Zijlstra
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2015-05-22 13:29 UTC (permalink / raw)
  To: mingo, peterz; +Cc: vincent.weaver, eranian, jolsa, linux-kernel

[-- Attachment #1: peterz-pmu-sched-5.patch --]
[-- Type: text/plain, Size: 2282 bytes --]

The commit_scheduling hook is pointlessly different from the start and
stop scheduling hook.

Furthermore, the constraint should never be NULL, so remove that test.
Even though we'll never get called (because we NULL the callbacks)
when !is_ht_workaround_enabled() put that test in.

Collapse the (pointless) WARN_ON_ONCE and bail on !cpuc->excl_cntrs --
this is doubly pointless, because its the same condition as
is_ht_workaround_enabled() which was already pointless because the
whole method won't ever be called.

Furthremore, make all the !excl_cntrs test WARN_ON_ONCE; they're all
pointless, because the above, either the function
({get,put}_excl_constraint) are already predicated on it existing or
the is_ht_workaround_enabled() thing is the same test.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/kernel/cpu/perf_event_intel.c |   18 +++++++-----------
 1 file changed, 7 insertions(+), 11 deletions(-)

--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1915,7 +1915,7 @@ intel_start_scheduling(struct cpu_hw_eve
 	/*
 	 * no exclusion needed
 	 */
-	if (!excl_cntrs)
+	if (WARN_ON_ONCE(!excl_cntrs))
 		return;
 
 	xl = &excl_cntrs->states[tid];
@@ -1949,7 +1949,7 @@ intel_stop_scheduling(struct cpu_hw_even
 	/*
 	 * no exclusion needed
 	 */
-	if (!excl_cntrs)
+	if (WARN_ON_ONCE(!excl_cntrs))
 		return;
 
 	xl = &excl_cntrs->states[tid];
@@ -1985,7 +1985,7 @@ intel_get_excl_constraints(struct cpu_hw
 	/*
 	 * no exclusion needed
 	 */
-	if (!excl_cntrs)
+	if (WARN_ON_ONCE(!excl_cntrs))
 		return c;
 
 	/*
@@ -2121,9 +2121,7 @@ static void intel_put_excl_constraints(s
 	if (cpuc->is_fake)
 		return;
 
-	WARN_ON_ONCE(!excl_cntrs);
-
-	if (!excl_cntrs)
+	if (WARN_ON_ONCE(!excl_cntrs))
 		return;
 
 	xl = &excl_cntrs->states[tid];
@@ -2190,15 +2188,13 @@ static void intel_commit_scheduling(stru
 	struct intel_excl_states *xl;
 	int tid = cpuc->excl_thread_id;
 
-	if (cpuc->is_fake || !c)
+	if (cpuc->is_fake || !is_ht_workaround_enabled())
 		return;
 
-	if (!(c->flags & PERF_X86_EVENT_DYNAMIC))
+	if (WARN_ON_ONCE(!excl_cntrs))
 		return;
 
-	WARN_ON_ONCE(!excl_cntrs);
-
-	if (!excl_cntrs)
+	if (!(c->flags & PERF_X86_EVENT_DYNAMIC))
 		return;
 
 	xl = &excl_cntrs->states[tid];



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v2 07/11] perf/x86: Move intel_commit_scheduling()
  2015-05-22 13:29 [PATCH v2 00/11] Various x86 pmu scheduling patches Peter Zijlstra
                   ` (5 preceding siblings ...)
  2015-05-22 13:29 ` [PATCH v2 06/11] perf/x86: Make WARNs consistent Peter Zijlstra
@ 2015-05-22 13:29 ` Peter Zijlstra
  2015-05-22 13:29 ` [PATCH v2 08/11] perf/x86: Remove pointless tests Peter Zijlstra
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2015-05-22 13:29 UTC (permalink / raw)
  To: mingo, peterz; +Cc: vincent.weaver, eranian, jolsa, linux-kernel

[-- Attachment #1: peterz-pmu-sched-6.patch --]
[-- Type: text/plain, Size: 3424 bytes --]

Place intel_commit_scheduling() in the right place, which is in
between start and stop.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/kernel/cpu/perf_event.h       |    4 +-
 arch/x86/kernel/cpu/perf_event_intel.c |   60 ++++++++++++++++-----------------
 2 files changed, 32 insertions(+), 32 deletions(-)

--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -520,10 +520,10 @@ struct x86_pmu {
 	void		(*put_event_constraints)(struct cpu_hw_events *cpuc,
 						 struct perf_event *event);
 
-	void		(*commit_scheduling)(struct cpu_hw_events *cpuc, int idx, int cntr);
-
 	void		(*start_scheduling)(struct cpu_hw_events *cpuc);
 
+	void		(*commit_scheduling)(struct cpu_hw_events *cpuc, int idx, int cntr);
+
 	void		(*stop_scheduling)(struct cpu_hw_events *cpuc);
 
 	struct event_constraint *event_constraints;
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1934,6 +1934,34 @@ intel_start_scheduling(struct cpu_hw_eve
 	memcpy(xl->init_state, xl->state, sizeof(xl->init_state));
 }
 
+static void intel_commit_scheduling(struct cpu_hw_events *cpuc, int idx, int cntr)
+{
+	struct intel_excl_cntrs *excl_cntrs = cpuc->excl_cntrs;
+	struct event_constraint *c = cpuc->event_constraint[idx];
+	struct intel_excl_states *xl;
+	int tid = cpuc->excl_thread_id;
+
+	if (cpuc->is_fake || !is_ht_workaround_enabled())
+		return;
+
+	if (WARN_ON_ONCE(!excl_cntrs))
+		return;
+
+	if (!(c->flags & PERF_X86_EVENT_DYNAMIC))
+		return;
+
+	xl = &excl_cntrs->states[tid];
+
+	lockdep_assert_held(&excl_cntrs->lock);
+
+	if (cntr >= 0) {
+		if (c->flags & PERF_X86_EVENT_EXCL)
+			xl->init_state[cntr] = INTEL_EXCL_EXCLUSIVE;
+		else
+			xl->init_state[cntr] = INTEL_EXCL_SHARED;
+	}
+}
+
 static void
 intel_stop_scheduling(struct cpu_hw_events *cpuc)
 {
@@ -2174,34 +2202,6 @@ static void intel_put_event_constraints(
 		intel_put_excl_constraints(cpuc, event);
 }
 
-static void intel_commit_scheduling(struct cpu_hw_events *cpuc, int idx, int cntr)
-{
-	struct intel_excl_cntrs *excl_cntrs = cpuc->excl_cntrs;
-	struct event_constraint *c = cpuc->event_constraint[idx];
-	struct intel_excl_states *xl;
-	int tid = cpuc->excl_thread_id;
-
-	if (cpuc->is_fake || !is_ht_workaround_enabled())
-		return;
-
-	if (WARN_ON_ONCE(!excl_cntrs))
-		return;
-
-	if (!(c->flags & PERF_X86_EVENT_DYNAMIC))
-		return;
-
-	xl = &excl_cntrs->states[tid];
-
-	lockdep_assert_held(&excl_cntrs->lock);
-
-	if (cntr >= 0) {
-		if (c->flags & PERF_X86_EVENT_EXCL)
-			xl->init_state[cntr] = INTEL_EXCL_EXCLUSIVE;
-		else
-			xl->init_state[cntr] = INTEL_EXCL_SHARED;
-	}
-}
-
 static void intel_pebs_aliases_core2(struct perf_event *event)
 {
 	if ((event->hw.config & X86_RAW_EVENT_MASK) == 0x003c) {
@@ -2910,8 +2910,8 @@ static __init void intel_ht_bug(void)
 {
 	x86_pmu.flags |= PMU_FL_EXCL_CNTRS | PMU_FL_EXCL_ENABLED;
 
-	x86_pmu.commit_scheduling = intel_commit_scheduling;
 	x86_pmu.start_scheduling = intel_start_scheduling;
+	x86_pmu.commit_scheduling = intel_commit_scheduling;
 	x86_pmu.stop_scheduling = intel_stop_scheduling;
 }
 
@@ -3367,8 +3367,8 @@ static __init int fixup_ht_bug(void)
 
 	x86_pmu.flags &= ~(PMU_FL_EXCL_CNTRS | PMU_FL_EXCL_ENABLED);
 
-	x86_pmu.commit_scheduling = NULL;
 	x86_pmu.start_scheduling = NULL;
+	x86_pmu.commit_scheduling = NULL;
 	x86_pmu.stop_scheduling = NULL;
 
 	watchdog_nmi_enable_all();



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v2 08/11] perf/x86: Remove pointless tests
  2015-05-22 13:29 [PATCH v2 00/11] Various x86 pmu scheduling patches Peter Zijlstra
                   ` (6 preceding siblings ...)
  2015-05-22 13:29 ` [PATCH v2 07/11] perf/x86: Move intel_commit_scheduling() Peter Zijlstra
@ 2015-05-22 13:29 ` Peter Zijlstra
  2015-05-22 13:29 ` [PATCH v2 09/11] perf/x86: Remove intel_excl_states::init_state Peter Zijlstra
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2015-05-22 13:29 UTC (permalink / raw)
  To: mingo, peterz; +Cc: vincent.weaver, eranian, jolsa, linux-kernel

[-- Attachment #1: peterz-pmu-sched-7.patch --]
[-- Type: text/plain, Size: 1236 bytes --]

Both intel_commit_scheduling() and intel_get_excl_contraints() test
for cntr < 0.

The only way that can happen (aside from a bug) is through
validate_event(), however that is already captured by the
cpuc->is_fake test.

So remove these test and simplify the code.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/kernel/cpu/perf_event_intel.c |   14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1955,12 +1955,10 @@ static void intel_commit_scheduling(stru
 
 	lockdep_assert_held(&excl_cntrs->lock);
 
-	if (cntr >= 0) {
-		if (c->flags & PERF_X86_EVENT_EXCL)
-			xl->init_state[cntr] = INTEL_EXCL_EXCLUSIVE;
-		else
-			xl->init_state[cntr] = INTEL_EXCL_SHARED;
-	}
+	if (c->flags & PERF_X86_EVENT_EXCL)
+		xl->init_state[cntr] = INTEL_EXCL_EXCLUSIVE;
+	else
+		xl->init_state[cntr] = INTEL_EXCL_SHARED;
 }
 
 static void
@@ -2028,10 +2026,6 @@ intel_get_excl_constraints(struct cpu_hw
 	if (!(c->flags & PERF_X86_EVENT_DYNAMIC)) {
 		struct event_constraint *cx;
 
-		/* sanity check */
-		if (idx < 0)
-			return &emptyconstraint;
-
 		/*
 		 * grab pre-allocated constraint entry
 		 */



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v2 09/11] perf/x86: Remove intel_excl_states::init_state
  2015-05-22 13:29 [PATCH v2 00/11] Various x86 pmu scheduling patches Peter Zijlstra
                   ` (7 preceding siblings ...)
  2015-05-22 13:29 ` [PATCH v2 08/11] perf/x86: Remove pointless tests Peter Zijlstra
@ 2015-05-22 13:29 ` Peter Zijlstra
  2015-05-22 13:29 ` [PATCH v2 10/11] perf,x86: Simplify logic Peter Zijlstra
  2015-05-22 13:29 ` [PATCH v2 11/11] perf/x86: Simplify put_exclusive_constraints Peter Zijlstra
  10 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2015-05-22 13:29 UTC (permalink / raw)
  To: mingo, peterz; +Cc: vincent.weaver, eranian, jolsa, linux-kernel

[-- Attachment #1: peterz-pmu-sched-8.patch --]
[-- Type: text/plain, Size: 3178 bytes --]

For some obscure reason intel_{start,stop}_scheduling() copy the HT
state to an intermediate array. This would make sense if we ever were
to make changes to it which we'd have to discard.

Except we don't. By the time we call intel_commit_scheduling() we're;
as the name implies; committed to them. We'll never back out.

A further hint its pointless is that stop_scheduling() unconditionally
publishes the state.

So the intermediate array is pointless, modify the state in place and
kill the extra array.

And remove the pointless array initialization: INTEL_EXCL_UNUSED == 0.

Note; all is serialized by intel_excl_cntr::lock.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/kernel/cpu/perf_event.c       |    1 -
 arch/x86/kernel/cpu/perf_event.h       |    1 -
 arch/x86/kernel/cpu/perf_event_intel.c |   22 ++--------------------
 3 files changed, 2 insertions(+), 22 deletions(-)

--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -861,7 +861,6 @@ int x86_schedule_events(struct cpu_hw_ev
 	}
 
 	if (!assign || unsched) {
-
 		for (i = 0; i < n; i++) {
 			e = cpuc->event_list[i];
 			/*
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -132,7 +132,6 @@ enum intel_excl_state_type {
 };
 
 struct intel_excl_states {
-	enum intel_excl_state_type init_state[X86_PMC_IDX_MAX];
 	enum intel_excl_state_type state[X86_PMC_IDX_MAX];
 	bool sched_started; /* true if scheduling has started */
 };
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1927,11 +1927,6 @@ intel_start_scheduling(struct cpu_hw_eve
 	 * makes scheduling appear as a transaction
 	 */
 	raw_spin_lock(&excl_cntrs->lock);
-
-	/*
-	 * Save a copy of our state to work on.
-	 */
-	memcpy(xl->init_state, xl->state, sizeof(xl->init_state));
 }
 
 static void intel_commit_scheduling(struct cpu_hw_events *cpuc, int idx, int cntr)
@@ -1955,9 +1950,9 @@ static void intel_commit_scheduling(stru
 	lockdep_assert_held(&excl_cntrs->lock);
 
 	if (c->flags & PERF_X86_EVENT_EXCL)
-		xl->init_state[cntr] = INTEL_EXCL_EXCLUSIVE;
+		xl->state[cntr] = INTEL_EXCL_EXCLUSIVE;
 	else
-		xl->init_state[cntr] = INTEL_EXCL_SHARED;
+		xl->state[cntr] = INTEL_EXCL_SHARED;
 }
 
 static void
@@ -1980,11 +1975,6 @@ intel_stop_scheduling(struct cpu_hw_even
 
 	xl = &excl_cntrs->states[tid];
 
-	/*
-	 * Commit the working state.
-	 */
-	memcpy(xl->state, xl->init_state, sizeof(xl->state));
-
 	xl->sched_started = false;
 	/*
 	 * release shared state lock (acquired in intel_start_scheduling())
@@ -2509,19 +2499,11 @@ struct intel_shared_regs *allocate_share
 static struct intel_excl_cntrs *allocate_excl_cntrs(int cpu)
 {
 	struct intel_excl_cntrs *c;
-	int i;
 
 	c = kzalloc_node(sizeof(struct intel_excl_cntrs),
 			 GFP_KERNEL, cpu_to_node(cpu));
 	if (c) {
 		raw_spin_lock_init(&c->lock);
-		for (i = 0; i < X86_PMC_IDX_MAX; i++) {
-			c->states[0].state[i] = INTEL_EXCL_UNUSED;
-			c->states[0].init_state[i] = INTEL_EXCL_UNUSED;
-
-			c->states[1].state[i] = INTEL_EXCL_UNUSED;
-			c->states[1].init_state[i] = INTEL_EXCL_UNUSED;
-		}
 		c->core_id = -1;
 	}
 	return c;



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v2 10/11] perf,x86: Simplify logic
  2015-05-22 13:29 [PATCH v2 00/11] Various x86 pmu scheduling patches Peter Zijlstra
                   ` (8 preceding siblings ...)
  2015-05-22 13:29 ` [PATCH v2 09/11] perf/x86: Remove intel_excl_states::init_state Peter Zijlstra
@ 2015-05-22 13:29 ` Peter Zijlstra
  2015-05-22 13:29 ` [PATCH v2 11/11] perf/x86: Simplify put_exclusive_constraints Peter Zijlstra
  10 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2015-05-22 13:29 UTC (permalink / raw)
  To: mingo, peterz; +Cc: vincent.weaver, eranian, jolsa, linux-kernel

[-- Attachment #1: peterz-pmu-sched-9.patch --]
[-- Type: text/plain, Size: 528 bytes --]

	!x && y == ! (x || !y)

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/kernel/cpu/perf_event.c |    4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -858,9 +858,7 @@ int x86_schedule_events(struct cpu_hw_ev
 			if (x86_pmu.commit_scheduling)
 				x86_pmu.commit_scheduling(cpuc, i, assign[i]);
 		}
-	}
-
-	if (!assign || unsched) {
+	} else {
 		for (i = 0; i < n; i++) {
 			e = cpuc->event_list[i];
 			/*



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v2 11/11] perf/x86: Simplify put_exclusive_constraints
  2015-05-22 13:29 [PATCH v2 00/11] Various x86 pmu scheduling patches Peter Zijlstra
                   ` (9 preceding siblings ...)
  2015-05-22 13:29 ` [PATCH v2 10/11] perf,x86: Simplify logic Peter Zijlstra
@ 2015-05-22 13:29 ` Peter Zijlstra
  2015-05-22 13:38   ` Peter Zijlstra
  10 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2015-05-22 13:29 UTC (permalink / raw)
  To: mingo, peterz; +Cc: vincent.weaver, eranian, jolsa, linux-kernel

[-- Attachment #1: peterz-pmu-sched-10.patch --]
[-- Type: text/plain, Size: 1992 bytes --]

Don't bother with taking locks if we're not actually going to do
anything. Also, drop the _irqsave(), this is very much only called
from IRQ-disabled context.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/kernel/cpu/perf_event_intel.c |   29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2130,7 +2130,6 @@ static void intel_put_excl_constraints(s
 	struct intel_excl_cntrs *excl_cntrs = cpuc->excl_cntrs;
 	int tid = cpuc->excl_thread_id;
 	struct intel_excl_states *xl;
-	unsigned long flags = 0; /* keep compiler happy */
 
 	/*
 	 * nothing needed if in group validation mode
@@ -2141,7 +2140,6 @@ static void intel_put_excl_constraints(s
 	if (WARN_ON_ONCE(!excl_cntrs))
 		return;
 
-	xl = &excl_cntrs->states[tid];
 	if (hwc->flags & PERF_X86_EVENT_EXCL_ACCT) {
 		hwc->flags &= ~PERF_X86_EVENT_EXCL_ACCT;
 		if (!--cpuc->n_excl)
@@ -2149,22 +2147,25 @@ static void intel_put_excl_constraints(s
 	}
 
 	/*
-	 * put_constraint may be called from x86_schedule_events()
-	 * which already has the lock held so here make locking
-	 * conditional
+	 * If event was actually assigned, then mark the counter state as
+	 * unused now.
 	 */
-	if (!xl->sched_started)
-		raw_spin_lock_irqsave(&excl_cntrs->lock, flags);
+	if (hwc->idx >= 0) {
+		xl = &excl_cntrs->states[tid];
+
+		/*
+		 * put_constraint may be called from x86_schedule_events()
+		 * which already has the lock held so here make locking
+		 * conditional.
+		 */
+		if (!xl->sched_started)
+			raw_spin_lock(&excl_cntrs->lock);
 
-	/*
-	 * if event was actually assigned, then mark the
-	 * counter state as unused now
-	 */
-	if (hwc->idx >= 0)
 		xl->state[hwc->idx] = INTEL_EXCL_UNUSED;
 
-	if (!xl->sched_started)
-		raw_spin_unlock_irqrestore(&excl_cntrs->lock, flags);
+		if (!xl->sched_started)
+			raw_spin_unlock(&excl_cntrs->lock);
+	}
 }
 
 static void



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 11/11] perf/x86: Simplify put_exclusive_constraints
  2015-05-22 13:29 ` [PATCH v2 11/11] perf/x86: Simplify put_exclusive_constraints Peter Zijlstra
@ 2015-05-22 13:38   ` Peter Zijlstra
  0 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2015-05-22 13:38 UTC (permalink / raw)
  To: mingo; +Cc: vincent.weaver, eranian, jolsa, linux-kernel

On Fri, May 22, 2015 at 03:29:16PM +0200, Peter Zijlstra wrote:
> @@ -2149,22 +2147,25 @@ static void intel_put_excl_constraints(s
>  	}
>  
>  	/*
> +	 * If event was actually assigned, then mark the counter state as
> +	 * unused now.
>  	 */
> +	if (hwc->idx >= 0) {
> +		xl = &excl_cntrs->states[tid];
> +
> +		/*
> +		 * put_constraint may be called from x86_schedule_events()
> +		 * which already has the lock held so here make locking
> +		 * conditional.
> +		 */
> +		if (!xl->sched_started)
> +			raw_spin_lock(&excl_cntrs->lock);
>  
>  		xl->state[hwc->idx] = INTEL_EXCL_UNUSED;
>  
> +		if (!xl->sched_started)
> +			raw_spin_unlock(&excl_cntrs->lock);
> +	}
>  }

Hmm, we could probably avoid the lock and use WRITE_ONCE() to clear this
state, but I'm too tired to really think about it.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 01/11] perf,x86: Fix event/group validation
  2015-05-22 13:29 ` [PATCH v2 01/11] perf,x86: Fix event/group validation Peter Zijlstra
@ 2015-05-22 13:40   ` Peter Zijlstra
  2015-05-26  9:24     ` Stephane Eranian
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2015-05-22 13:40 UTC (permalink / raw)
  To: mingo
  Cc: vincent.weaver, eranian, jolsa, kan.liang, linux-kernel,
	Andrew Hunter, Maria Dimakopoulou

On Fri, May 22, 2015 at 03:29:06PM +0200, Peter Zijlstra wrote:
> @@ -788,9 +788,9 @@ int x86_schedule_events(struct cpu_hw_ev
>  		x86_pmu.start_scheduling(cpuc);
>  
>  	for (i = 0, wmin = X86_PMC_IDX_MAX, wmax = 0; i < n; i++) {
> +		cpuc->event_constraint[i] = NULL;

^^^ that is new, which is esp. important in light of the
intel_get_event_constraints() hunk below, which would happily continue
life with a garbage constraint.

>  		c = x86_pmu.get_event_constraints(cpuc, i, cpuc->event_list[i]);
> +		cpuc->event_constraint[i] = c;
>  
>  		wmin = min(wmin, c->weight);
>  		wmax = max(wmax, c->weight);


> +++ b/arch/x86/kernel/cpu/perf_event_intel.c
> @@ -2106,7 +2106,7 @@ static struct event_constraint *
>  intel_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
>  			    struct perf_event *event)
>  {
> -	struct event_constraint *c1 = event->hw.constraint;
> +	struct event_constraint *c1 = cpuc->event_constraint[idx];
>  	struct event_constraint *c2;
>  
>  	/*

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 02/11] perf/x86: Improve HT workaround GP counter constraint
  2015-05-22 13:29 ` [PATCH v2 02/11] perf/x86: Improve HT workaround GP counter constraint Peter Zijlstra
@ 2015-05-22 13:42   ` Peter Zijlstra
  2015-05-26  9:37   ` Stephane Eranian
  2015-05-26 23:33   ` Andi Kleen
  2 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2015-05-22 13:42 UTC (permalink / raw)
  To: mingo; +Cc: vincent.weaver, eranian, jolsa, kan.liang, linux-kernel

On Fri, May 22, 2015 at 03:29:07PM +0200, Peter Zijlstra wrote:
> @@ -696,11 +699,16 @@ static bool __perf_sched_find_counter(st
>  				goto done;
>  		}
>  	}
> +
>  	/* Grab the first unused counter starting with idx */
>  	idx = sched->state.counter;
>  	for_each_set_bit_from(idx, c->idxmsk, INTEL_PMC_IDX_FIXED) {
> +		if (!__test_and_set_bit(idx, sched->state.used)) {
> +			if (sched->state.nr_gp++ >= sched->max_gp)
> +				return false;

Note the placement inside the success path of the GP allocation, instead
of the attempt a GP place we had before.

> +
>  			goto done;
> +		}
>  	}
>  
>  	return false;




> @@ -2000,6 +1999,11 @@ intel_get_excl_constraints(struct cpu_hw
>  	 * across HT threads
>  	 */
>  	is_excl = c->flags & PERF_X86_EVENT_EXCL;
> +	if (is_excl && !(event->hw.flags & PERF_X86_EVENT_EXCL_ACCT)) {
> +		event->hw.flags |= PERF_X86_EVENT_EXCL_ACCT;
> +		if (!cpuc->n_excl++)
> +			WRITE_ONCE(excl_cntrs->has_exclusive[tid], 1);
> +	}
>  
>  	/*
>  	 * xl = state of current HT

And that is what keeps repeated get_event_constraints() calls from ever
increasing our n_excl count.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 01/11] perf,x86: Fix event/group validation
  2015-05-22 13:40   ` Peter Zijlstra
@ 2015-05-26  9:24     ` Stephane Eranian
  2015-05-26 10:12       ` Peter Zijlstra
  0 siblings, 1 reply; 36+ messages in thread
From: Stephane Eranian @ 2015-05-26  9:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vince Weaver, Jiri Olsa, Liang, Kan, LKML,
	Andrew Hunter, Maria Dimakopoulou

On Fri, May 22, 2015 at 6:40 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, May 22, 2015 at 03:29:06PM +0200, Peter Zijlstra wrote:
>> @@ -788,9 +788,9 @@ int x86_schedule_events(struct cpu_hw_ev
>>               x86_pmu.start_scheduling(cpuc);
>>
>>       for (i = 0, wmin = X86_PMC_IDX_MAX, wmax = 0; i < n; i++) {
>> +             cpuc->event_constraint[i] = NULL;
>
> ^^^ that is new, which is esp. important in light of the
> intel_get_event_constraints() hunk below, which would happily continue
> life with a garbage constraint.
>
You've moved the constraint list from event to cpuc. Yet, it is still
an array of pointers
to constraints. So here you are saying, that in the case validate_group() is
preempted and there is a context switch, there is still a risk of
overwriting the
constraint? I don't see how because validate_group() is using a fake_cpuc.
So yes, the cpuc->event_constraint[] array is modified but it is not the same
as the actual cpuc used by non-validate code. Or am I still missing something?

When using dynamic constraints, we already have constraint storage in cpuc
(to avoid calling kmalloc() in ctxsw context). Thus, I am wondering if it would
not be easier to always use cpuc for constraint storage (no more pointers).

>>               c = x86_pmu.get_event_constraints(cpuc, i, cpuc->event_list[i]);
>> +             cpuc->event_constraint[i] = c;
>>
>>               wmin = min(wmin, c->weight);
>>               wmax = max(wmax, c->weight);
>
>
>> +++ b/arch/x86/kernel/cpu/perf_event_intel.c
>> @@ -2106,7 +2106,7 @@ static struct event_constraint *
>>  intel_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
>>                           struct perf_event *event)
>>  {
>> -     struct event_constraint *c1 = event->hw.constraint;
>> +     struct event_constraint *c1 = cpuc->event_constraint[idx];
>>       struct event_constraint *c2;
>>
>>       /*

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 02/11] perf/x86: Improve HT workaround GP counter constraint
  2015-05-22 13:29 ` [PATCH v2 02/11] perf/x86: Improve HT workaround GP counter constraint Peter Zijlstra
  2015-05-22 13:42   ` Peter Zijlstra
@ 2015-05-26  9:37   ` Stephane Eranian
  2015-05-26 10:15     ` Peter Zijlstra
  2015-05-26 23:33   ` Andi Kleen
  2 siblings, 1 reply; 36+ messages in thread
From: Stephane Eranian @ 2015-05-26  9:37 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Vince Weaver, Jiri Olsa, Liang, Kan, LKML

On Fri, May 22, 2015 at 6:29 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> The (SNB/IVB/HSW) HT bug only affects events that can be programmed
> onto GP counters, therefore we should only limit the number of GP
> counters that can be used per cpu -- iow we should not constrain the
> FP counters.
>
> Furthermore, we should only enfore such a limit when there are in fact
> exclusive events being scheduled on either sibling.
>
> Reported-by: Vince Weaver <vincent.weaver@maine.edu>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  arch/x86/kernel/cpu/perf_event.c              |   36 +++++++++++++++++++++-----
>  arch/x86/kernel/cpu/perf_event.h              |   11 +++++--
>  arch/x86/kernel/cpu/perf_event_intel.c        |   30 +++++++--------------
>  arch/x86/kernel/cpu/perf_event_intel_uncore.c |    2 -
>  4 files changed, 49 insertions(+), 30 deletions(-)
>
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -611,6 +611,7 @@ struct sched_state {
>         int     event;          /* event index */
>         int     counter;        /* counter index */
>         int     unassigned;     /* number of events to be assigned left */
> +       int     nr_gp;          /* number of GP counters used */
>         unsigned long used[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
>  };
>
> @@ -620,9 +621,10 @@ struct sched_state {
>  struct perf_sched {
>         int                     max_weight;
>         int                     max_events;
> +       int                     max_gp;
> +       int                     saved_states;
>         struct event_constraint **constraints;
>         struct sched_state      state;
> -       int                     saved_states;
>         struct sched_state      saved[SCHED_STATES_MAX];
>  };
>
> @@ -630,13 +632,14 @@ struct perf_sched {
>   * Initialize interator that runs through all events and counters.
>   */
>  static void perf_sched_init(struct perf_sched *sched, struct event_constraint **constraints,
> -                           int num, int wmin, int wmax)
> +                           int num, int wmin, int wmax, int gpmax)
>  {
>         int idx;
>
>         memset(sched, 0, sizeof(*sched));
>         sched->max_events       = num;
>         sched->max_weight       = wmax;
> +       sched->max_gp           = gpmax;
>         sched->constraints      = constraints;
>
>         for (idx = 0; idx < num; idx++) {
> @@ -696,11 +699,16 @@ static bool __perf_sched_find_counter(st
>                                 goto done;
>                 }
>         }
> +
>         /* Grab the first unused counter starting with idx */
>         idx = sched->state.counter;
>         for_each_set_bit_from(idx, c->idxmsk, INTEL_PMC_IDX_FIXED) {
> -               if (!__test_and_set_bit(idx, sched->state.used))
> +               if (!__test_and_set_bit(idx, sched->state.used)) {
> +                       if (sched->state.nr_gp++ >= sched->max_gp)
> +                               return false;
> +
>                         goto done;
> +               }
>         }
>
>         return false;
> @@ -757,11 +765,11 @@ static bool perf_sched_next_event(struct
>   * Assign a counter for each event.
>   */
>  int perf_assign_events(struct event_constraint **constraints, int n,
> -                       int wmin, int wmax, int *assign)
> +                       int wmin, int wmax, int gpmax, int *assign)
>  {
>         struct perf_sched sched;
>
> -       perf_sched_init(&sched, constraints, n, wmin, wmax);
> +       perf_sched_init(&sched, constraints, n, wmin, wmax, gpmax);
>
>         do {
>                 if (!perf_sched_find_counter(&sched))
> @@ -822,8 +830,24 @@ int x86_schedule_events(struct cpu_hw_ev
>
>         /* slow path */
>         if (i != n) {
> +               int gpmax = x86_pmu.num_counters;
> +
> +               /*
> +                * Do not allow scheduling of more than half the available
> +                * generic counters.
> +                *
> +                * This helps avoid counter starvation of sibling thread by
> +                * ensuring at most half the counters cannot be in exclusive
> +                * mode. There is no designated counters for the limits. Any
> +                * N/2 counters can be used. This helps with events with
> +                * specific counter constraints.
> +                */
> +               if (is_ht_workaround_enabled() && !cpuc->is_fake &&
> +                   READ_ONCE(cpuc->excl_cntrs->exclusive_present))
> +                       gpmax /= 2;
> +
What I don't like about this part is that this is a hack to work around a bug
on some limited Intel CPUs and yet it is in the middle of generic x86 code.
I understand it will be inoperative on AMD PMU and is not used by Intel
uncore. On KNC or P6, you will not have is_ht_workaround_enabled().
Could this be made a x86_pmu callback? x86_pmu.counter_limit()?


>                 unsched = perf_assign_events(cpuc->event_constraint, n, wmin,
> -                                            wmax, assign);
> +                                            wmax, gpmax, assign);
>         }
>
>         /*
> --- a/arch/x86/kernel/cpu/perf_event.h
> +++ b/arch/x86/kernel/cpu/perf_event.h
> @@ -74,6 +74,7 @@ struct event_constraint {
>  #define PERF_X86_EVENT_EXCL            0x0040 /* HT exclusivity on counter */
>  #define PERF_X86_EVENT_DYNAMIC         0x0080 /* dynamic alloc'd constraint */
>  #define PERF_X86_EVENT_RDPMC_ALLOWED   0x0100 /* grant rdpmc permission */
> +#define PERF_X86_EVENT_EXCL_ACCT       0x0200 /* accounted EXCL event */
>
>
>  struct amd_nb {
> @@ -134,8 +135,6 @@ enum intel_excl_state_type {
>  struct intel_excl_states {
>         enum intel_excl_state_type init_state[X86_PMC_IDX_MAX];
>         enum intel_excl_state_type state[X86_PMC_IDX_MAX];
> -       int  num_alloc_cntrs;/* #counters allocated */
> -       int  max_alloc_cntrs;/* max #counters allowed */
>         bool sched_started; /* true if scheduling has started */
>  };
>
> @@ -144,6 +143,11 @@ struct intel_excl_cntrs {
>
>         struct intel_excl_states states[2];
>
> +       union {
> +               u16     has_exclusive[2];
> +               u32     exclusive_present;
> +       };
> +
>         int             refcnt;         /* per-core: #HT threads */
>         unsigned        core_id;        /* per-core: core id */
>  };
> @@ -176,6 +180,7 @@ struct cpu_hw_events {
>         struct perf_event       *event_list[X86_PMC_IDX_MAX]; /* in enabled order */
>         struct event_constraint *event_constraint[X86_PMC_IDX_MAX];
>
> +       int                     n_excl; /* the number of exclusive events */
>
>         unsigned int            group_flag;
>         int                     is_fake;
> @@ -719,7 +724,7 @@ static inline void __x86_pmu_enable_even
>  void x86_pmu_enable_all(int added);
>
>  int perf_assign_events(struct event_constraint **constraints, int n,
> -                       int wmin, int wmax, int *assign);
> +                       int wmin, int wmax, int gpmax, int *assign);
>  int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign);
>
>  void x86_pmu_stop(struct perf_event *event, int flags);
> --- a/arch/x86/kernel/cpu/perf_event_intel.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel.c
> @@ -1923,7 +1923,6 @@ intel_start_scheduling(struct cpu_hw_eve
>         xl = &excl_cntrs->states[tid];
>
>         xl->sched_started = true;
> -       xl->num_alloc_cntrs = 0;
>         /*
>          * lock shared state until we are done scheduling
>          * in stop_event_scheduling()
> @@ -2000,6 +1999,11 @@ intel_get_excl_constraints(struct cpu_hw
>          * across HT threads
>          */
>         is_excl = c->flags & PERF_X86_EVENT_EXCL;
> +       if (is_excl && !(event->hw.flags & PERF_X86_EVENT_EXCL_ACCT)) {
> +               event->hw.flags |= PERF_X86_EVENT_EXCL_ACCT;
> +               if (!cpuc->n_excl++)
> +                       WRITE_ONCE(excl_cntrs->has_exclusive[tid], 1);
> +       }
>
>         /*
>          * xl = state of current HT
> @@ -2008,18 +2012,6 @@ intel_get_excl_constraints(struct cpu_hw
>         xl = &excl_cntrs->states[tid];
>         xlo = &excl_cntrs->states[o_tid];
>
> -       /*
> -        * do not allow scheduling of more than max_alloc_cntrs
> -        * which is set to half the available generic counters.
> -        * this helps avoid counter starvation of sibling thread
> -        * by ensuring at most half the counters cannot be in
> -        * exclusive mode. There is not designated counters for the
> -        * limits. Any N/2 counters can be used. This helps with
> -        * events with specifix counter constraints
> -        */
> -       if (xl->num_alloc_cntrs++ == xl->max_alloc_cntrs)
> -               return &emptyconstraint;
> -
>         cx = c;
>
>         /*
> @@ -2150,6 +2142,11 @@ static void intel_put_excl_constraints(s
>
>         xl = &excl_cntrs->states[tid];
>         xlo = &excl_cntrs->states[o_tid];
> +       if (hwc->flags & PERF_X86_EVENT_EXCL_ACCT) {
> +               hwc->flags &= ~PERF_X86_EVENT_EXCL_ACCT;
> +               if (!--cpuc->n_excl)
> +                       WRITE_ONCE(excl_cntrs->has_exclusive[tid], 0);
> +       }
>
>         /*
>          * put_constraint may be called from x86_schedule_events()
> @@ -2632,8 +2629,6 @@ static void intel_pmu_cpu_starting(int c
>                 cpuc->lbr_sel = &cpuc->shared_regs->regs[EXTRA_REG_LBR];
>
>         if (x86_pmu.flags & PMU_FL_EXCL_CNTRS) {
> -               int h = x86_pmu.num_counters >> 1;
> -
>                 for_each_cpu(i, topology_thread_cpumask(cpu)) {
>                         struct intel_excl_cntrs *c;
>
> @@ -2647,11 +2642,6 @@ static void intel_pmu_cpu_starting(int c
>                 }
>                 cpuc->excl_cntrs->core_id = core_id;
>                 cpuc->excl_cntrs->refcnt++;
> -               /*
> -                * set hard limit to half the number of generic counters
> -                */
> -               cpuc->excl_cntrs->states[0].max_alloc_cntrs = h;
> -               cpuc->excl_cntrs->states[1].max_alloc_cntrs = h;
>         }
>  }
>
> --- a/arch/x86/kernel/cpu/perf_event_intel_uncore.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
> @@ -394,7 +394,7 @@ static int uncore_assign_events(struct i
>         /* slow path */
>         if (i != n)
>                 ret = perf_assign_events(box->event_constraint, n,
> -                                        wmin, wmax, assign);
> +                                        wmin, wmax, n, assign);
>
>         if (!assign || ret) {
>                 for (i = 0; i < n; i++)
>
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 01/11] perf,x86: Fix event/group validation
  2015-05-26  9:24     ` Stephane Eranian
@ 2015-05-26 10:12       ` Peter Zijlstra
  2015-05-26 11:46         ` Stephane Eranian
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2015-05-26 10:12 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Ingo Molnar, Vince Weaver, Jiri Olsa, Liang, Kan, LKML,
	Andrew Hunter, Maria Dimakopoulou

On Tue, May 26, 2015 at 02:24:38AM -0700, Stephane Eranian wrote:
> On Fri, May 22, 2015 at 6:40 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, May 22, 2015 at 03:29:06PM +0200, Peter Zijlstra wrote:
> >> @@ -788,9 +788,9 @@ int x86_schedule_events(struct cpu_hw_ev
> >>               x86_pmu.start_scheduling(cpuc);
> >>
> >>       for (i = 0, wmin = X86_PMC_IDX_MAX, wmax = 0; i < n; i++) {
> >> +             cpuc->event_constraint[i] = NULL;
> >
> > ^^^ that is new, which is esp. important in light of the
> > intel_get_event_constraints() hunk below, which would happily continue
> > life with a garbage constraint.
> >
> You've moved the constraint list from event to cpuc. Yet, it is still
> an array of pointers
> to constraints. So here you are saying, that in the case validate_group() is
> preempted and there is a context switch, there is still a risk of
> overwriting the
> constraint? I don't see how because validate_group() is using a fake_cpuc.
> So yes, the cpuc->event_constraint[] array is modified but it is not the same
> as the actual cpuc used by non-validate code. Or am I still missing something?
> 
> When using dynamic constraints, we already have constraint storage in cpuc
> (to avoid calling kmalloc() in ctxsw context). Thus, I am wondering if it would
> not be easier to always use cpuc for constraint storage (no more pointers).

No; the problem here is repeated use of the cpuc (the real one). Say one
scheduling run installs a constraint pointer for event i. Then event i
gets removed and another installed in the same spot.

Then the next scheduling run will pick up the old pointer in
intel_get_event_constraints() as a base for the new one.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 02/11] perf/x86: Improve HT workaround GP counter constraint
  2015-05-26  9:37   ` Stephane Eranian
@ 2015-05-26 10:15     ` Peter Zijlstra
  2015-05-26 11:47       ` Stephane Eranian
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2015-05-26 10:15 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: Ingo Molnar, Vince Weaver, Jiri Olsa, Liang, Kan, LKML

Please trim your email.

On Tue, May 26, 2015 at 02:37:52AM -0700, Stephane Eranian wrote:
> > @@ -822,8 +830,24 @@ int x86_schedule_events(struct cpu_hw_ev
> >
> >         /* slow path */
> >         if (i != n) {
> > +               int gpmax = x86_pmu.num_counters;
> > +
> > +               /*
> > +                * Do not allow scheduling of more than half the available
> > +                * generic counters.
> > +                *
> > +                * This helps avoid counter starvation of sibling thread by
> > +                * ensuring at most half the counters cannot be in exclusive
> > +                * mode. There is no designated counters for the limits. Any
> > +                * N/2 counters can be used. This helps with events with
> > +                * specific counter constraints.
> > +                */
> > +               if (is_ht_workaround_enabled() && !cpuc->is_fake &&
> > +                   READ_ONCE(cpuc->excl_cntrs->exclusive_present))
> > +                       gpmax /= 2;
> > +
> What I don't like about this part is that this is a hack to work around a bug
> on some limited Intel CPUs and yet it is in the middle of generic x86 code.
> I understand it will be inoperative on AMD PMU and is not used by Intel
> uncore. On KNC or P6, you will not have is_ht_workaround_enabled().
> Could this be made a x86_pmu callback? x86_pmu.counter_limit()?

It'll be slower though. You get an indirect function call in there.

But sure we can clean that up later if you like; there's other things
needing to be fixed here first.

I'm going to overhaul the whole get/put constraints stuff first.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 01/11] perf,x86: Fix event/group validation
  2015-05-26 10:12       ` Peter Zijlstra
@ 2015-05-26 11:46         ` Stephane Eranian
  2015-05-26 12:16           ` Peter Zijlstra
  0 siblings, 1 reply; 36+ messages in thread
From: Stephane Eranian @ 2015-05-26 11:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vince Weaver, Jiri Olsa, Liang, Kan, LKML,
	Andrew Hunter, Maria Dimakopoulou

On Tue, May 26, 2015 at 3:12 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, May 26, 2015 at 02:24:38AM -0700, Stephane Eranian wrote:
>> On Fri, May 22, 2015 at 6:40 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Fri, May 22, 2015 at 03:29:06PM +0200, Peter Zijlstra wrote:
>> >> @@ -788,9 +788,9 @@ int x86_schedule_events(struct cpu_hw_ev
>> >>               x86_pmu.start_scheduling(cpuc);
>> >>
>> >>       for (i = 0, wmin = X86_PMC_IDX_MAX, wmax = 0; i < n; i++) {
>> >> +             cpuc->event_constraint[i] = NULL;
>> >
>> > ^^^ that is new, which is esp. important in light of the
>> > intel_get_event_constraints() hunk below, which would happily continue
>> > life with a garbage constraint.
>> >
>> You've moved the constraint list from event to cpuc. Yet, it is still
>> an array of pointers
>> to constraints. So here you are saying, that in the case validate_group() is
>> preempted and there is a context switch, there is still a risk of
>> overwriting the
>> constraint? I don't see how because validate_group() is using a fake_cpuc.
>> So yes, the cpuc->event_constraint[] array is modified but it is not the same
>> as the actual cpuc used by non-validate code. Or am I still missing something?
>>
>> When using dynamic constraints, we already have constraint storage in cpuc
>> (to avoid calling kmalloc() in ctxsw context). Thus, I am wondering if it would
>> not be easier to always use cpuc for constraint storage (no more pointers).
>
> No; the problem here is repeated use of the cpuc (the real one). Say one
> scheduling run installs a constraint pointer for event i. Then event i
> gets removed and another installed in the same spot.
>
> Then the next scheduling run will pick up the old pointer in
> intel_get_event_constraints() as a base for the new one.
>
But where is the code that says: skip reinstalling the constraint
in intel_get_event_constraints() because there is already a (stale)
one? I don't see where that is.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 02/11] perf/x86: Improve HT workaround GP counter constraint
  2015-05-26 10:15     ` Peter Zijlstra
@ 2015-05-26 11:47       ` Stephane Eranian
  2015-05-26 13:19         ` Peter Zijlstra
  0 siblings, 1 reply; 36+ messages in thread
From: Stephane Eranian @ 2015-05-26 11:47 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Vince Weaver, Jiri Olsa, Liang, Kan, LKML

On Tue, May 26, 2015 at 3:15 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> Please trim your email.
>
> On Tue, May 26, 2015 at 02:37:52AM -0700, Stephane Eranian wrote:
>> > @@ -822,8 +830,24 @@ int x86_schedule_events(struct cpu_hw_ev
>> >
>> >         /* slow path */
>> >         if (i != n) {
>> > +               int gpmax = x86_pmu.num_counters;
>> > +
>> > +               /*
>> > +                * Do not allow scheduling of more than half the available
>> > +                * generic counters.
>> > +                *
>> > +                * This helps avoid counter starvation of sibling thread by
>> > +                * ensuring at most half the counters cannot be in exclusive
>> > +                * mode. There is no designated counters for the limits. Any
>> > +                * N/2 counters can be used. This helps with events with
>> > +                * specific counter constraints.
>> > +                */
>> > +               if (is_ht_workaround_enabled() && !cpuc->is_fake &&
>> > +                   READ_ONCE(cpuc->excl_cntrs->exclusive_present))
>> > +                       gpmax /= 2;
>> > +
>> What I don't like about this part is that this is a hack to work around a bug
>> on some limited Intel CPUs and yet it is in the middle of generic x86 code.
>> I understand it will be inoperative on AMD PMU and is not used by Intel
>> uncore. On KNC or P6, you will not have is_ht_workaround_enabled().
>> Could this be made a x86_pmu callback? x86_pmu.counter_limit()?
>
> It'll be slower though. You get an indirect function call in there.
>
> But sure we can clean that up later if you like; there's other things
> needing to be fixed here first.
>
> I'm going to overhaul the whole get/put constraints stuff first.

Ok, I think it would be good to balance to number of get/put. It would
avoid the confusion. Is that what you are thinking about?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 03/11] perf/x86: Correct local vs remote sibling state
  2015-05-22 13:29 ` [PATCH v2 03/11] perf/x86: Correct local vs remote sibling state Peter Zijlstra
@ 2015-05-26 11:48   ` Stephane Eranian
  0 siblings, 0 replies; 36+ messages in thread
From: Stephane Eranian @ 2015-05-26 11:48 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Vince Weaver, Jiri Olsa, LKML

On Fri, May 22, 2015 at 6:29 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> For some obscure reason the current code accounts the current SMT
> thread's state on the remote thread and reads the remote's state on
> the local SMT thread.
>
> While internally consistent, and 'correct' its pointless confusion we
> can do without.
>
> Flip them the right way around.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  arch/x86/kernel/cpu/perf_event_intel.c |   79 +++++++++++++--------------------
>  1 file changed, 33 insertions(+), 46 deletions(-)
>
> --- a/arch/x86/kernel/cpu/perf_event_intel.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel.c
> @@ -1903,9 +1903,8 @@ static void
>  intel_start_scheduling(struct cpu_hw_events *cpuc)
>  {
>         struct intel_excl_cntrs *excl_cntrs = cpuc->excl_cntrs;
> -       struct intel_excl_states *xl, *xlo;
> +       struct intel_excl_states *xl;
>         int tid = cpuc->excl_thread_id;
> -       int o_tid = 1 - tid; /* sibling thread */
>
>         /*
>          * nothing needed if in group validation mode
> @@ -1919,7 +1918,6 @@ intel_start_scheduling(struct cpu_hw_eve
>         if (!excl_cntrs)
>                 return;
>
> -       xlo = &excl_cntrs->states[o_tid];
>         xl = &excl_cntrs->states[tid];
>
>         xl->sched_started = true;
> @@ -1932,18 +1930,17 @@ intel_start_scheduling(struct cpu_hw_eve
>         raw_spin_lock(&excl_cntrs->lock);
>
>         /*
> -        * save initial state of sibling thread
> +        * Save a copy of our state to work on.
>          */
> -       memcpy(xlo->init_state, xlo->state, sizeof(xlo->init_state));
> +       memcpy(xl->init_state, xl->state, sizeof(xl->init_state));
>  }
>
>  static void
>  intel_stop_scheduling(struct cpu_hw_events *cpuc)
>  {
>         struct intel_excl_cntrs *excl_cntrs = cpuc->excl_cntrs;
> -       struct intel_excl_states *xl, *xlo;
> +       struct intel_excl_states *xl;
>         int tid = cpuc->excl_thread_id;
> -       int o_tid = 1 - tid; /* sibling thread */
>
>         /*
>          * nothing needed if in group validation mode
> @@ -1956,13 +1953,12 @@ intel_stop_scheduling(struct cpu_hw_even
>         if (!excl_cntrs)
>                 return;
>
> -       xlo = &excl_cntrs->states[o_tid];
>         xl = &excl_cntrs->states[tid];
>
>         /*
> -        * make new sibling thread state visible
> +        * Commit the working state.
>          */
> -       memcpy(xlo->state, xlo->init_state, sizeof(xlo->state));
> +       memcpy(xl->state, xl->init_state, sizeof(xl->state));
>
>         xl->sched_started = false;
>         /*
> @@ -1977,10 +1973,9 @@ intel_get_excl_constraints(struct cpu_hw
>  {
>         struct event_constraint *cx;
>         struct intel_excl_cntrs *excl_cntrs = cpuc->excl_cntrs;
> -       struct intel_excl_states *xl, *xlo;
> -       int is_excl, i;
> +       struct intel_excl_states *xlo;
>         int tid = cpuc->excl_thread_id;
> -       int o_tid = 1 - tid; /* alternate */
> +       int is_excl, i;
>
>         /*
>          * validating a group does not require
> @@ -1994,23 +1989,6 @@ intel_get_excl_constraints(struct cpu_hw
>          */
>         if (!excl_cntrs)
>                 return c;
> -       /*
> -        * event requires exclusive counter access
> -        * across HT threads
> -        */
> -       is_excl = c->flags & PERF_X86_EVENT_EXCL;
> -       if (is_excl && !(event->hw.flags & PERF_X86_EVENT_EXCL_ACCT)) {
> -               event->hw.flags |= PERF_X86_EVENT_EXCL_ACCT;
> -               if (!cpuc->n_excl++)
> -                       WRITE_ONCE(excl_cntrs->has_exclusive[tid], 1);
> -       }
> -
> -       /*
> -        * xl = state of current HT
> -        * xlo = state of sibling HT
> -        */
> -       xl = &excl_cntrs->states[tid];
> -       xlo = &excl_cntrs->states[o_tid];
>
>         cx = c;
>
> @@ -2054,6 +2032,22 @@ intel_get_excl_constraints(struct cpu_hw
>          */
>
>         /*
> +        * state of sibling HT
> +        */
> +       xlo = &excl_cntrs->states[tid ^ 1];
> +
> +       /*
> +        * event requires exclusive counter access
> +        * across HT threads
> +        */
I think the comment needs to be changed to reflect what the
test is doing. I would say:
/*
 * account for exclusive counter usage. Needed to avoid
 * cross thread counter starvation problem with exclusive events.
 */
> +       is_excl = c->flags & PERF_X86_EVENT_EXCL;
> +       if (is_excl && !(event->hw.flags & PERF_X86_EVENT_EXCL_ACCT)) {
> +               event->hw.flags |= PERF_X86_EVENT_EXCL_ACCT;
> +               if (!cpuc->n_excl++)
> +                       WRITE_ONCE(excl_cntrs->has_exclusive[tid], 1);
> +       }
> +
> +       /*
>          * Modify static constraint with current dynamic
>          * state of thread
>          *
> @@ -2067,14 +2061,14 @@ intel_get_excl_constraints(struct cpu_hw
>                  * our corresponding counter cannot be used
>                  * regardless of our event
>                  */
> -               if (xl->state[i] == INTEL_EXCL_EXCLUSIVE)
> +               if (xlo->state[i] == INTEL_EXCL_EXCLUSIVE)
>                         __clear_bit(i, cx->idxmsk);
>                 /*
>                  * if measuring an exclusive event, sibling
>                  * measuring non-exclusive, then counter cannot
>                  * be used
>                  */
> -               if (is_excl && xl->state[i] == INTEL_EXCL_SHARED)
> +               if (is_excl && xlo->state[i] == INTEL_EXCL_SHARED)
>                         __clear_bit(i, cx->idxmsk);
>         }
>
> @@ -2124,10 +2118,9 @@ static void intel_put_excl_constraints(s
>  {
>         struct hw_perf_event *hwc = &event->hw;
>         struct intel_excl_cntrs *excl_cntrs = cpuc->excl_cntrs;
> -       struct intel_excl_states *xlo, *xl;
> -       unsigned long flags = 0; /* keep compiler happy */
>         int tid = cpuc->excl_thread_id;
> -       int o_tid = 1 - tid;
> +       struct intel_excl_states *xl;
> +       unsigned long flags = 0; /* keep compiler happy */
>
>         /*
>          * nothing needed if in group validation mode
> @@ -2141,7 +2134,6 @@ static void intel_put_excl_constraints(s
>                 return;
>
>         xl = &excl_cntrs->states[tid];
> -       xlo = &excl_cntrs->states[o_tid];
>         if (hwc->flags & PERF_X86_EVENT_EXCL_ACCT) {
>                 hwc->flags &= ~PERF_X86_EVENT_EXCL_ACCT;
>                 if (!--cpuc->n_excl)
> @@ -2161,7 +2153,7 @@ static void intel_put_excl_constraints(s
>          * counter state as unused now
>          */
>         if (hwc->idx >= 0)
> -               xlo->state[hwc->idx] = INTEL_EXCL_UNUSED;
> +               xl->state[hwc->idx] = INTEL_EXCL_UNUSED;
>
>         if (!xl->sched_started)
>                 raw_spin_unlock_irqrestore(&excl_cntrs->lock, flags);
> @@ -2200,16 +2192,12 @@ static void intel_commit_scheduling(stru
>  {
>         struct intel_excl_cntrs *excl_cntrs = cpuc->excl_cntrs;
>         struct event_constraint *c = cpuc->event_constraint[idx];
> -       struct intel_excl_states *xlo, *xl;
> +       struct intel_excl_states *xl;
>         int tid = cpuc->excl_thread_id;
> -       int o_tid = 1 - tid;
> -       int is_excl;
>
>         if (cpuc->is_fake || !c)
>                 return;
>
> -       is_excl = c->flags & PERF_X86_EVENT_EXCL;
> -
>         if (!(c->flags & PERF_X86_EVENT_DYNAMIC))
>                 return;
>
> @@ -2219,15 +2207,14 @@ static void intel_commit_scheduling(stru
>                 return;
>
>         xl = &excl_cntrs->states[tid];
> -       xlo = &excl_cntrs->states[o_tid];
>
>         WARN_ON_ONCE(!raw_spin_is_locked(&excl_cntrs->lock));
>
>         if (cntr >= 0) {
> -               if (is_excl)
> -                       xlo->init_state[cntr] = INTEL_EXCL_EXCLUSIVE;
> +               if (c->flags & PERF_X86_EVENT_EXCL)
> +                       xl->init_state[cntr] = INTEL_EXCL_EXCLUSIVE;
>                 else
> -                       xlo->init_state[cntr] = INTEL_EXCL_SHARED;
> +                       xl->init_state[cntr] = INTEL_EXCL_SHARED;
>         }
>  }
>
>
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 01/11] perf,x86: Fix event/group validation
  2015-05-26 11:46         ` Stephane Eranian
@ 2015-05-26 12:16           ` Peter Zijlstra
  2015-05-26 12:25             ` Stephane Eranian
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2015-05-26 12:16 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Ingo Molnar, Vince Weaver, Jiri Olsa, Liang, Kan, LKML,
	Andrew Hunter, Maria Dimakopoulou

On Tue, May 26, 2015 at 04:46:07AM -0700, Stephane Eranian wrote:
> On Tue, May 26, 2015 at 3:12 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, May 26, 2015 at 02:24:38AM -0700, Stephane Eranian wrote:
> >> On Fri, May 22, 2015 at 6:40 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >> > On Fri, May 22, 2015 at 03:29:06PM +0200, Peter Zijlstra wrote:
> >> >> @@ -788,9 +788,9 @@ int x86_schedule_events(struct cpu_hw_ev
> >> >>               x86_pmu.start_scheduling(cpuc);
> >> >>
> >> >>       for (i = 0, wmin = X86_PMC_IDX_MAX, wmax = 0; i < n; i++) {
> >> >> +             cpuc->event_constraint[i] = NULL;
> >> >

> But where is the code that says: skip reinstalling the constraint
> in intel_get_event_constraints() because there is already a (stale)
> one? I don't see where that is.

IIRC the problem was that the copy from c2 into c1:

	if (c1 && (c1->flags & PERF_X86_EVENT_DYNAMIC)) {
		bitmap_copy(c1->idxmsk, c2->idxmsk, X86_PMC_IDX_MAX);
		c1->weight = c2->weight;
		c2 = c1;
	}

is incomplete. For instance, flags is not copied, and some code down the
line might check that and get wrong flags.

I'm not entirely sure I saw misbehaviour, but I figured I'd better close
that hole and rule out this is contributing to fail.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 01/11] perf,x86: Fix event/group validation
  2015-05-26 12:16           ` Peter Zijlstra
@ 2015-05-26 12:25             ` Stephane Eranian
  2015-05-26 13:22               ` Peter Zijlstra
  0 siblings, 1 reply; 36+ messages in thread
From: Stephane Eranian @ 2015-05-26 12:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vince Weaver, Jiri Olsa, Liang, Kan, LKML,
	Andrew Hunter, Maria Dimakopoulou

On Tue, May 26, 2015 at 5:16 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, May 26, 2015 at 04:46:07AM -0700, Stephane Eranian wrote:
>> On Tue, May 26, 2015 at 3:12 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Tue, May 26, 2015 at 02:24:38AM -0700, Stephane Eranian wrote:
>> >> On Fri, May 22, 2015 at 6:40 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> >> > On Fri, May 22, 2015 at 03:29:06PM +0200, Peter Zijlstra wrote:
>> >> >> @@ -788,9 +788,9 @@ int x86_schedule_events(struct cpu_hw_ev
>> >> >>               x86_pmu.start_scheduling(cpuc);
>> >> >>
>> >> >>       for (i = 0, wmin = X86_PMC_IDX_MAX, wmax = 0; i < n; i++) {
>> >> >> +             cpuc->event_constraint[i] = NULL;
>> >> >
>
>> But where is the code that says: skip reinstalling the constraint
>> in intel_get_event_constraints() because there is already a (stale)
>> one? I don't see where that is.
>
> IIRC the problem was that the copy from c2 into c1:
>
>         if (c1 && (c1->flags & PERF_X86_EVENT_DYNAMIC)) {
>                 bitmap_copy(c1->idxmsk, c2->idxmsk, X86_PMC_IDX_MAX);
>                 c1->weight = c2->weight;
>                 c2 = c1;
>         }
>
> is incomplete. For instance, flags is not copied, and some code down the
> line might check that and get wrong flags.
>
Ok, now I remember this code. It has to do with incremental scheduling.
Suppose E1, E2, E3 events where E1 is exclusive. The first call is
for scheduling E1. It gets to get_event_constraint() which "allocates" a
dynamic constraint. The second call  tries to schedule E1, E2. But the
second time for E1, you already have the dynamic constraint allocated, so
this code is reusing the constraint storage and just updates the bitmask
and weight.

Now, that the storage is not actually dynamic (kmalloc'd), but taken from a
fixed size array in cpuc, I believe we can simplify this and "re-allocate"
the constraint for each incremental call to intel_get_event_constraints().
Do you agree?


> I'm not entirely sure I saw misbehaviour, but I figured I'd better close
> that hole and rule out this is contributing to fail.
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 02/11] perf/x86: Improve HT workaround GP counter constraint
  2015-05-26 11:47       ` Stephane Eranian
@ 2015-05-26 13:19         ` Peter Zijlstra
  2015-05-26 16:07           ` Peter Zijlstra
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2015-05-26 13:19 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: Ingo Molnar, Vince Weaver, Jiri Olsa, Liang, Kan, LKML

On Tue, May 26, 2015 at 04:47:11AM -0700, Stephane Eranian wrote:
> > I'm going to overhaul the whole get/put constraints stuff first.
> 
> Ok, I think it would be good to balance to number of get/put. It would
> avoid the confusion. Is that what you are thinking about?

Yes, and remove the few associated modifications to events.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 01/11] perf,x86: Fix event/group validation
  2015-05-26 12:25             ` Stephane Eranian
@ 2015-05-26 13:22               ` Peter Zijlstra
  2015-05-26 13:44                 ` Stephane Eranian
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2015-05-26 13:22 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Ingo Molnar, Vince Weaver, Jiri Olsa, Liang, Kan, LKML,
	Andrew Hunter, Maria Dimakopoulou

On Tue, May 26, 2015 at 05:25:59AM -0700, Stephane Eranian wrote:
> > IIRC the problem was that the copy from c2 into c1:
> >
> >         if (c1 && (c1->flags & PERF_X86_EVENT_DYNAMIC)) {
> >                 bitmap_copy(c1->idxmsk, c2->idxmsk, X86_PMC_IDX_MAX);
> >                 c1->weight = c2->weight;
> >                 c2 = c1;
> >         }
> >
> > is incomplete. For instance, flags is not copied, and some code down the
> > line might check that and get wrong flags.
> >
> Ok, now I remember this code. It has to do with incremental scheduling.
> Suppose E1, E2, E3 events where E1 is exclusive. The first call is
> for scheduling E1. It gets to get_event_constraint() which "allocates" a
> dynamic constraint. The second call  tries to schedule E1, E2. But the
> second time for E1, you already have the dynamic constraint allocated, so
> this code is reusing the constraint storage and just updates the bitmask
> and weight.
> 
> Now, that the storage is not actually dynamic (kmalloc'd), but taken from a
> fixed size array in cpuc, I believe we can simplify this and "re-allocate"
> the constraint for each incremental call to intel_get_event_constraints().
> Do you agree?

That would probably work, the whole incremental thing seems superfluous
to me.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 01/11] perf,x86: Fix event/group validation
  2015-05-26 13:22               ` Peter Zijlstra
@ 2015-05-26 13:44                 ` Stephane Eranian
  0 siblings, 0 replies; 36+ messages in thread
From: Stephane Eranian @ 2015-05-26 13:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vince Weaver, Jiri Olsa, Liang, Kan, LKML,
	Andrew Hunter, Maria Dimakopoulou

On Tue, May 26, 2015 at 6:22 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, May 26, 2015 at 05:25:59AM -0700, Stephane Eranian wrote:
>> > IIRC the problem was that the copy from c2 into c1:
>> >
>> >         if (c1 && (c1->flags & PERF_X86_EVENT_DYNAMIC)) {
>> >                 bitmap_copy(c1->idxmsk, c2->idxmsk, X86_PMC_IDX_MAX);
>> >                 c1->weight = c2->weight;
>> >                 c2 = c1;
>> >         }
>> >
>> > is incomplete. For instance, flags is not copied, and some code down the
>> > line might check that and get wrong flags.
>> >
>> Ok, now I remember this code. It has to do with incremental scheduling.
>> Suppose E1, E2, E3 events where E1 is exclusive. The first call is
>> for scheduling E1. It gets to get_event_constraint() which "allocates" a
>> dynamic constraint. The second call  tries to schedule E1, E2. But the
>> second time for E1, you already have the dynamic constraint allocated, so
>> this code is reusing the constraint storage and just updates the bitmask
>> and weight.
>>
>> Now, that the storage is not actually dynamic (kmalloc'd), but taken from a
>> fixed size array in cpuc, I believe we can simplify this and "re-allocate"
>> the constraint for each incremental call to intel_get_event_constraints().
>> Do you agree?
>
> That would probably work, the whole incremental thing seems superfluous
> to me.

Well, you could pass the whole lists (4)  of events in one call and
let the arch level
sort it out. It could do one pass and stop at first error to get a
fixed bound like
today. But it would need to deal with groups as well...so not so easy and why
repeat the group processing for each arch?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 02/11] perf/x86: Improve HT workaround GP counter constraint
  2015-05-26 13:19         ` Peter Zijlstra
@ 2015-05-26 16:07           ` Peter Zijlstra
  2015-05-27  9:01             ` Stephane Eranian
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2015-05-26 16:07 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: Ingo Molnar, Vince Weaver, Jiri Olsa, Liang, Kan, LKML

On Tue, May 26, 2015 at 03:19:50PM +0200, Peter Zijlstra wrote:
> On Tue, May 26, 2015 at 04:47:11AM -0700, Stephane Eranian wrote:
> > > I'm going to overhaul the whole get/put constraints stuff first.
> > 
> > Ok, I think it would be good to balance to number of get/put. It would
> > avoid the confusion. Is that what you are thinking about?
> 
> Yes, and remove the few associated modifications to events.

I have the below (in 4 patches); compile tested only so far.

---
 perf_event.c       |   51 +++++++++++++++++++++++++--------------------------
 perf_event.h       |    4 ++--
 perf_event_intel.c |   40 ++++------------------------------------
 3 files changed, 31 insertions(+), 64 deletions(-)

--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -810,9 +810,15 @@ int x86_schedule_events(struct cpu_hw_ev
 		x86_pmu.start_scheduling(cpuc);
 
 	for (i = 0, wmin = X86_PMC_IDX_MAX, wmax = 0; i < n; i++) {
-		cpuc->event_constraint[i] = NULL;
-		c = x86_pmu.get_event_constraints(cpuc, i, cpuc->event_list[i]);
-		cpuc->event_constraint[i] = c;
+		/*
+		 * Only call get_event_constraints() once!
+		 */
+		c = cpuc->event_constraint[i];
+		if (!c) {
+			e = cpuc->event_list[i];
+			c = x86_pmu.get_event_constraints(cpuc, i, e);
+			cpuc->event_constraint[i] = c;
+		}
 
 		wmin = min(wmin, c->weight);
 		wmax = max(wmax, c->weight);
@@ -875,27 +881,23 @@ int x86_schedule_events(struct cpu_hw_ev
 	 * validate an event group (assign == NULL)
 	 */
 	if (!unsched && assign) {
-		for (i = 0; i < n; i++) {
-			e = cpuc->event_list[i];
-			e->hw.flags |= PERF_X86_EVENT_COMMITTED;
-			if (x86_pmu.commit_scheduling)
+		if (x86_pmu.commit_scheduling) {
+			for (i = 0; i < n; i++) {
 				x86_pmu.commit_scheduling(cpuc, i, assign[i]);
+			}
 		}
-	} else {
-		for (i = 0; i < n; i++) {
-			e = cpuc->event_list[i];
-			/*
-			 * do not put_constraint() on comitted events,
-			 * because they are good to go
-			 */
-			if ((e->hw.flags & PERF_X86_EVENT_COMMITTED))
-				continue;
+	} else if (x86_pmu.put_event_constraints) {
+		/* x86_pmu_add() will not yet have updated n_events */
+		i = cpuc->n_events;
+
+		/* x86_pmu_commit_txn() relies on n_txn */
+		if (cpuc->group_flag & PERF_EVENT_TXN)
+			i -= cpuc->n_txn;
 
-			/*
-			 * release events that failed scheduling
-			 */
-			if (x86_pmu.put_event_constraints)
-				x86_pmu.put_event_constraints(cpuc, e);
+		for (; i < n; i++) {
+			e = cpuc->event_list[i];
+			/* release events that failed scheduling */
+			x86_pmu.put_event_constraints(cpuc, e);
 		}
 	}
 
@@ -923,6 +925,7 @@ static int collect_events(struct cpu_hw_
 		if (n >= max_count)
 			return -EINVAL;
 		cpuc->event_list[n] = leader;
+		cpuc->event_constraint[n] = NULL;
 		n++;
 	}
 	if (!dogrp)
@@ -937,6 +940,7 @@ static int collect_events(struct cpu_hw_
 			return -EINVAL;
 
 		cpuc->event_list[n] = event;
+		cpuc->event_constraint[n] = NULL;
 		n++;
 	}
 	return n;
@@ -1295,11 +1299,6 @@ static void x86_pmu_del(struct perf_even
 	int i;
 
 	/*
-	 * event is descheduled
-	 */
-	event->hw.flags &= ~PERF_X86_EVENT_COMMITTED;
-
-	/*
 	 * If we're called during a txn, we don't need to do anything.
 	 * The events never got scheduled and ->cancel_txn will truncate
 	 * the event_list.
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -68,13 +68,13 @@ struct event_constraint {
 #define PERF_X86_EVENT_PEBS_LDLAT	0x0001 /* ld+ldlat data address sampling */
 #define PERF_X86_EVENT_PEBS_ST		0x0002 /* st data address sampling */
 #define PERF_X86_EVENT_PEBS_ST_HSW	0x0004 /* haswell style datala, store */
-#define PERF_X86_EVENT_COMMITTED	0x0008 /* event passed commit_txn */
+
 #define PERF_X86_EVENT_PEBS_LD_HSW	0x0010 /* haswell style datala, load */
 #define PERF_X86_EVENT_PEBS_NA_HSW	0x0020 /* haswell style datala, unknown */
 #define PERF_X86_EVENT_EXCL		0x0040 /* HT exclusivity on counter */
 #define PERF_X86_EVENT_DYNAMIC		0x0080 /* dynamic alloc'd constraint */
 #define PERF_X86_EVENT_RDPMC_ALLOWED	0x0100 /* grant rdpmc permission */
-#define PERF_X86_EVENT_EXCL_ACCT	0x0200 /* accounted EXCL event */
+
 #define PERF_X86_EVENT_AUTO_RELOAD	0x0400 /* use PEBS auto-reload */
 #define PERF_X86_EVENT_FREERUNNING	0x0800 /* use freerunning PEBS */
 
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1955,14 +1955,6 @@ __intel_shared_reg_get_constraints(struc
 	unsigned long flags;
 	int idx = reg->idx;
 
-	/*
-	 * reg->alloc can be set due to existing state, so for fake cpuc we
-	 * need to ignore this, otherwise we might fail to allocate proper fake
-	 * state for this extra reg constraint. Also see the comment below.
-	 */
-	if (reg->alloc && !cpuc->is_fake)
-		return NULL; /* call x86_get_event_constraint() */
-
 again:
 	era = &cpuc->shared_regs->regs[idx];
 	/*
@@ -1986,14 +1978,6 @@ __intel_shared_reg_get_constraints(struc
 		if (!cpuc->is_fake) {
 			if (idx != reg->idx)
 				intel_fixup_er(event, idx);
-
-			/*
-			 * x86_schedule_events() can call get_event_constraints()
-			 * multiple times on events in the case of incremental
-			 * scheduling(). reg->alloc ensures we only do the ER
-			 * allocation once.
-			 */
-			reg->alloc = 1;
 		}
 
 		/* lock in msr value */
@@ -2026,24 +2010,12 @@ __intel_shared_reg_put_constraints(struc
 {
 	struct er_account *era;
 
-	/*
-	 * Only put constraint if extra reg was actually allocated. Also takes
-	 * care of event which do not use an extra shared reg.
-	 *
-	 * Also, if this is a fake cpuc we shouldn't touch any event state
-	 * (reg->alloc) and we don't care about leaving inconsistent cpuc state
-	 * either since it'll be thrown out.
-	 */
-	if (!reg->alloc || cpuc->is_fake)
-		return;
+	WARN_ON_ONCE(cpuc->is_fake);
 
 	era = &cpuc->shared_regs->regs[reg->idx];
 
 	/* one fewer user */
 	atomic_dec(&era->ref);
-
-	/* allocate again next time */
-	reg->alloc = 0;
 }
 
 static struct event_constraint *
@@ -2261,8 +2233,7 @@ intel_get_excl_constraints(struct cpu_hw
 	 * across HT threads
 	 */
 	is_excl = c->flags & PERF_X86_EVENT_EXCL;
-	if (is_excl && !(event->hw.flags & PERF_X86_EVENT_EXCL_ACCT)) {
-		event->hw.flags |= PERF_X86_EVENT_EXCL_ACCT;
+	if (is_excl) {
 		if (!cpuc->n_excl++)
 			WRITE_ONCE(excl_cntrs->has_exclusive[tid], 1);
 	}
@@ -2350,11 +2321,8 @@ static void intel_put_excl_constraints(s
 	if (WARN_ON_ONCE(!excl_cntrs))
 		return;
 
-	if (hwc->flags & PERF_X86_EVENT_EXCL_ACCT) {
-		hwc->flags &= ~PERF_X86_EVENT_EXCL_ACCT;
-		if (!--cpuc->n_excl)
-			WRITE_ONCE(excl_cntrs->has_exclusive[tid], 0);
-	}
+	if ((hwc->flags & PERF_X86_EVENT_EXCL) && !--cpuc->n_excl)
+		WRITE_ONCE(excl_cntrs->has_exclusive[tid], 0);
 
 	/*
 	 * If event was actually assigned, then mark the counter state as

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 02/11] perf/x86: Improve HT workaround GP counter constraint
  2015-05-22 13:29 ` [PATCH v2 02/11] perf/x86: Improve HT workaround GP counter constraint Peter Zijlstra
  2015-05-22 13:42   ` Peter Zijlstra
  2015-05-26  9:37   ` Stephane Eranian
@ 2015-05-26 23:33   ` Andi Kleen
  2015-05-27  7:48     ` Peter Zijlstra
  2 siblings, 1 reply; 36+ messages in thread
From: Andi Kleen @ 2015-05-26 23:33 UTC (permalink / raw)
  To: P5A5Aeter Zijlstra
  Cc: mingo, vincent.weaver, eranian, jolsa, kan.liang, linux-kernel

Peter Zijlstra <peterz@infradead.org> writes:
> +		 */
> +		if (is_ht_workaround_enabled() && !cpuc->is_fake &&

Could this function also check if at least one leaking event is
scheduled somewhere? (e.g. from a global count)

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 02/11] perf/x86: Improve HT workaround GP counter constraint
  2015-05-26 23:33   ` Andi Kleen
@ 2015-05-27  7:48     ` Peter Zijlstra
  2015-05-27 14:00       ` Andi Kleen
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2015-05-27  7:48 UTC (permalink / raw)
  To: Andi Kleen; +Cc: mingo, vincent.weaver, eranian, jolsa, kan.liang, linux-kernel

On Tue, May 26, 2015 at 04:33:42PM -0700, Andi Kleen wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> > +		 */
> > +		if (is_ht_workaround_enabled() && !cpuc->is_fake &&
> 
> Could this function also check if at least one leaking event is
> scheduled somewhere? (e.g. from a global count)

You truncated one line too many:

+               if (is_ht_workaround_enabled() && !cpuc->is_fake &&
+                   READ_ONCE(cpuc->excl_cntrs->exclusive_present))
+                       gpmax /= 2;

Guess what that READ_ONCE() does? It checks if there's one such leaky
event on the current core.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 02/11] perf/x86: Improve HT workaround GP counter constraint
  2015-05-26 16:07           ` Peter Zijlstra
@ 2015-05-27  9:01             ` Stephane Eranian
  2015-05-27 10:11               ` Peter Zijlstra
  2015-05-27 10:13               ` Peter Zijlstra
  0 siblings, 2 replies; 36+ messages in thread
From: Stephane Eranian @ 2015-05-27  9:01 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Vince Weaver, Jiri Olsa, Liang, Kan, LKML

On Tue, May 26, 2015 at 9:07 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, May 26, 2015 at 03:19:50PM +0200, Peter Zijlstra wrote:
> > On Tue, May 26, 2015 at 04:47:11AM -0700, Stephane Eranian wrote:
> > > > I'm going to overhaul the whole get/put constraints stuff first.
> > >
> > > Ok, I think it would be good to balance to number of get/put. It would
> > > avoid the confusion. Is that what you are thinking about?
> >
> > Yes, and remove the few associated modifications to events.
>
> I have the below (in 4 patches); compile tested only so far.
>
> ---
>  perf_event.c       |   51 +++++++++++++++++++++++++--------------------------
>  perf_event.h       |    4 ++--
>  perf_event_intel.c |   40 ++++------------------------------------
>  3 files changed, 31 insertions(+), 64 deletions(-)
>
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -810,9 +810,15 @@ int x86_schedule_events(struct cpu_hw_ev
>                 x86_pmu.start_scheduling(cpuc);
>
>         for (i = 0, wmin = X86_PMC_IDX_MAX, wmax = 0; i < n; i++) {
> -               cpuc->event_constraint[i] = NULL;
> -               c = x86_pmu.get_event_constraints(cpuc, i, cpuc->event_list[i]);
> -               cpuc->event_constraint[i] = c;
> +               /*
> +                * Only call get_event_constraints() once!
> +                */
> +               c = cpuc->event_constraint[i];
> +               if (!c) {
>
> +                       e = cpuc->event_list[i];
> +                       c = x86_pmu.get_event_constraints(cpuc, i, e);
> +                       cpuc->event_constraint[i] = c;
> +               }

But are you removing the incremental calls from the upper layer via
x86_pmu.add()?
If not, then you are saying the dynamic constraint you got for
offcore_response, LBR
or the HT workaround is still the best avail now.
>
>
>                 wmin = min(wmin, c->weight);
>                 wmax = max(wmax, c->weight);
> @@ -875,27 +881,23 @@ int x86_schedule_events(struct cpu_hw_ev
>          * validate an event group (assign == NULL)
>          */
>         if (!unsched && assign) {
> -               for (i = 0; i < n; i++) {
> -                       e = cpuc->event_list[i];
> -                       e->hw.flags |= PERF_X86_EVENT_COMMITTED;
> -                       if (x86_pmu.commit_scheduling)
> +               if (x86_pmu.commit_scheduling) {
> +                       for (i = 0; i < n; i++) {
>                                 x86_pmu.commit_scheduling(cpuc, i, assign[i]);
> +                       }
>                 }
> -       } else {
> -               for (i = 0; i < n; i++) {
> -                       e = cpuc->event_list[i];
> -                       /*
> -                        * do not put_constraint() on comitted events,
> -                        * because they are good to go
> -                        */
> -                       if ((e->hw.flags & PERF_X86_EVENT_COMMITTED))
> -                               continue;
> +       } else if (x86_pmu.put_event_constraints) {
> +               /* x86_pmu_add() will not yet have updated n_events */
> +               i = cpuc->n_events;
> +
> +               /* x86_pmu_commit_txn() relies on n_txn */
> +               if (cpuc->group_flag & PERF_EVENT_TXN)
> +                       i -= cpuc->n_txn;
>
Is that to say, you're doing only one pass over all active events?

>
> -                       /*
> -                        * release events that failed scheduling
> -                        */
> -                       if (x86_pmu.put_event_constraints)
> -                               x86_pmu.put_event_constraints(cpuc, e);
> +               for (; i < n; i++) {
> +                       e = cpuc->event_list[i];
> +                       /* release events that failed scheduling */
> +                       x86_pmu.put_event_constraints(cpuc, e);
>                 }
>         }
>
> @@ -923,6 +925,7 @@ static int collect_events(struct cpu_hw_
>                 if (n >= max_count)
>                         return -EINVAL;
>                 cpuc->event_list[n] = leader;
> +               cpuc->event_constraint[n] = NULL;
>                 n++;
>         }
>         if (!dogrp)
> @@ -937,6 +940,7 @@ static int collect_events(struct cpu_hw_
>                         return -EINVAL;
>
>                 cpuc->event_list[n] = event;
> +               cpuc->event_constraint[n] = NULL;
>                 n++;
>         }
>         return n;
> @@ -1295,11 +1299,6 @@ static void x86_pmu_del(struct perf_even
>         int i;
>
>         /*
> -        * event is descheduled
> -        */
> -       event->hw.flags &= ~PERF_X86_EVENT_COMMITTED;
> -
> -       /*
>          * If we're called during a txn, we don't need to do anything.
>          * The events never got scheduled and ->cancel_txn will truncate
>          * the event_list.
> --- a/arch/x86/kernel/cpu/perf_event.h
> +++ b/arch/x86/kernel/cpu/perf_event.h
> @@ -68,13 +68,13 @@ struct event_constraint {
>  #define PERF_X86_EVENT_PEBS_LDLAT      0x0001 /* ld+ldlat data address sampling */
>  #define PERF_X86_EVENT_PEBS_ST         0x0002 /* st data address sampling */
>  #define PERF_X86_EVENT_PEBS_ST_HSW     0x0004 /* haswell style datala, store */
> -#define PERF_X86_EVENT_COMMITTED       0x0008 /* event passed commit_txn */
> +
>  #define PERF_X86_EVENT_PEBS_LD_HSW     0x0010 /* haswell style datala, load */
>  #define PERF_X86_EVENT_PEBS_NA_HSW     0x0020 /* haswell style datala, unknown */
>  #define PERF_X86_EVENT_EXCL            0x0040 /* HT exclusivity on counter */
>  #define PERF_X86_EVENT_DYNAMIC         0x0080 /* dynamic alloc'd constraint */
>  #define PERF_X86_EVENT_RDPMC_ALLOWED   0x0100 /* grant rdpmc permission */
> -#define PERF_X86_EVENT_EXCL_ACCT       0x0200 /* accounted EXCL event */
> +
>  #define PERF_X86_EVENT_AUTO_RELOAD     0x0400 /* use PEBS auto-reload */
>  #define PERF_X86_EVENT_FREERUNNING     0x0800 /* use freerunning PEBS */
>
What's free running PEBS? ;->

>
> --- a/arch/x86/kernel/cpu/perf_event_intel.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel.c
> @@ -1955,14 +1955,6 @@ __intel_shared_reg_get_constraints(struc
>         unsigned long flags;
>         int idx = reg->idx;
>
> -       /*
> -        * reg->alloc can be set due to existing state, so for fake cpuc we
> -        * need to ignore this, otherwise we might fail to allocate proper fake
> -        * state for this extra reg constraint. Also see the comment below.
> -        */
> -       if (reg->alloc && !cpuc->is_fake)
> -               return NULL; /* call x86_get_event_constraint() */
> -
>  again:
>         era = &cpuc->shared_regs->regs[idx];
>         /*
> @@ -1986,14 +1978,6 @@ __intel_shared_reg_get_constraints(struc
>                 if (!cpuc->is_fake) {
>                         if (idx != reg->idx)
>                                 intel_fixup_er(event, idx);
> -
> -                       /*
> -                        * x86_schedule_events() can call get_event_constraints()
> -                        * multiple times on events in the case of incremental
> -                        * scheduling(). reg->alloc ensures we only do the ER
> -                        * allocation once.
> -                        */
> -                       reg->alloc = 1;
>                 }
>
>                 /* lock in msr value */
> @@ -2026,24 +2010,12 @@ __intel_shared_reg_put_constraints(struc
>  {
>         struct er_account *era;
>
> -       /*
> -        * Only put constraint if extra reg was actually allocated. Also takes
> -        * care of event which do not use an extra shared reg.
> -        *
> -        * Also, if this is a fake cpuc we shouldn't touch any event state
> -        * (reg->alloc) and we don't care about leaving inconsistent cpuc state
> -        * either since it'll be thrown out.
> -        */
> -       if (!reg->alloc || cpuc->is_fake)
> -               return;
> +       WARN_ON_ONCE(cpuc->is_fake);
>
>         era = &cpuc->shared_regs->regs[reg->idx];
>
>         /* one fewer user */
>         atomic_dec(&era->ref);
> -
> -       /* allocate again next time */
> -       reg->alloc = 0;
>  }
>
>  static struct event_constraint *
> @@ -2261,8 +2233,7 @@ intel_get_excl_constraints(struct cpu_hw
>          * across HT threads
>          */
>         is_excl = c->flags & PERF_X86_EVENT_EXCL;
> -       if (is_excl && !(event->hw.flags & PERF_X86_EVENT_EXCL_ACCT)) {
> -               event->hw.flags |= PERF_X86_EVENT_EXCL_ACCT;
> +       if (is_excl) {
>                 if (!cpuc->n_excl++)
>                         WRITE_ONCE(excl_cntrs->has_exclusive[tid], 1);
>         }
> @@ -2350,11 +2321,8 @@ static void intel_put_excl_constraints(s
>         if (WARN_ON_ONCE(!excl_cntrs))
>                 return;
>
> -       if (hwc->flags & PERF_X86_EVENT_EXCL_ACCT) {
> -               hwc->flags &= ~PERF_X86_EVENT_EXCL_ACCT;
> -               if (!--cpuc->n_excl)
> -                       WRITE_ONCE(excl_cntrs->has_exclusive[tid], 0);
> -       }
> +       if ((hwc->flags & PERF_X86_EVENT_EXCL) && !--cpuc->n_excl)
> +               WRITE_ONCE(excl_cntrs->has_exclusive[tid], 0);
>
>         /*
>          * If event was actually assigned, then mark the counter state as

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 02/11] perf/x86: Improve HT workaround GP counter constraint
  2015-05-27  9:01             ` Stephane Eranian
@ 2015-05-27 10:11               ` Peter Zijlstra
  2015-05-27 11:39                 ` Stephane Eranian
  2015-05-27 10:13               ` Peter Zijlstra
  1 sibling, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2015-05-27 10:11 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: Ingo Molnar, Vince Weaver, Jiri Olsa, Liang, Kan, LKML

On Wed, May 27, 2015 at 02:01:04AM -0700, Stephane Eranian wrote:
> But are you removing the incremental calls from the upper layer via
> x86_pmu.add()?
> If not, then you are saying the dynamic constraint you got for
> offcore_response, LBR
> or the HT workaround is still the best avail now.

sigh, see I knew I was missing something :/

So then for c->flag & DYNAMIC we should put and get again, right?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 02/11] perf/x86: Improve HT workaround GP counter constraint
  2015-05-27  9:01             ` Stephane Eranian
  2015-05-27 10:11               ` Peter Zijlstra
@ 2015-05-27 10:13               ` Peter Zijlstra
  2015-05-27 11:44                 ` Stephane Eranian
  1 sibling, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2015-05-27 10:13 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: Ingo Molnar, Vince Weaver, Jiri Olsa, Liang, Kan, LKML

On Wed, May 27, 2015 at 02:01:04AM -0700, Stephane Eranian wrote:
> >  #define PERF_X86_EVENT_AUTO_RELOAD     0x0400 /* use PEBS auto-reload */
> >  #define PERF_X86_EVENT_FREERUNNING     0x0800 /* use freerunning PEBS */
> >
> What's free running PEBS? ;->

The series here:

  http://lkml.kernel.org/r/1430940834-8964-2-git-send-email-kan.liang@intel.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 02/11] perf/x86: Improve HT workaround GP counter constraint
  2015-05-27 10:11               ` Peter Zijlstra
@ 2015-05-27 11:39                 ` Stephane Eranian
  0 siblings, 0 replies; 36+ messages in thread
From: Stephane Eranian @ 2015-05-27 11:39 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Vince Weaver, Jiri Olsa, Liang, Kan, LKML

On Wed, May 27, 2015 at 3:11 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, May 27, 2015 at 02:01:04AM -0700, Stephane Eranian wrote:
>> But are you removing the incremental calls from the upper layer via
>> x86_pmu.add()?
>> If not, then you are saying the dynamic constraint you got for
>> offcore_response, LBR
>> or the HT workaround is still the best avail now.
>
> sigh, see I knew I was missing something :/
>
> So then for c->flag & DYNAMIC we should put and get again, right?

I think it would still work. The constraint, even dynamic, you got on first
call is "locked" to you. So if you call x86_schedule_events multiple times
to add events like for E1, E2, E3 which causes at most 3 calls to
x86_schedule_events9). Then you'd get the same constraint but
it would still be valid. Just need to make sure you release it if scheduling
fails at event N and it has the dynamic constraint.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 02/11] perf/x86: Improve HT workaround GP counter constraint
  2015-05-27 10:13               ` Peter Zijlstra
@ 2015-05-27 11:44                 ` Stephane Eranian
  0 siblings, 0 replies; 36+ messages in thread
From: Stephane Eranian @ 2015-05-27 11:44 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Vince Weaver, Jiri Olsa, Liang, Kan, LKML

On Wed, May 27, 2015 at 3:13 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, May 27, 2015 at 02:01:04AM -0700, Stephane Eranian wrote:
>> >  #define PERF_X86_EVENT_AUTO_RELOAD     0x0400 /* use PEBS auto-reload */
>> >  #define PERF_X86_EVENT_FREERUNNING     0x0800 /* use freerunning PEBS */
>> >
>> What's free running PEBS? ;->
>
> The series here:
>
>   http://lkml.kernel.org/r/1430940834-8964-2-git-send-email-kan.liang@intel.com

Ok, I assume this goes with more patches that enable depth > 1 for the
PEBS buffer, otherwise
I am not so sure about the added value. PEBS micro-code has to
reprogram that MSR
no matter what, hasn't it?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 02/11] perf/x86: Improve HT workaround GP counter constraint
  2015-05-27  7:48     ` Peter Zijlstra
@ 2015-05-27 14:00       ` Andi Kleen
  0 siblings, 0 replies; 36+ messages in thread
From: Andi Kleen @ 2015-05-27 14:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andi Kleen, mingo, vincent.weaver, eranian, jolsa, kan.liang,
	linux-kernel

On Wed, May 27, 2015 at 09:48:09AM +0200, Peter Zijlstra wrote:
> On Tue, May 26, 2015 at 04:33:42PM -0700, Andi Kleen wrote:
> > Peter Zijlstra <peterz@infradead.org> writes:
> > > +		 */
> > > +		if (is_ht_workaround_enabled() && !cpuc->is_fake &&
> > 
> > Could this function also check if at least one leaking event is
> > scheduled somewhere? (e.g. from a global count)
> 
> You truncated one line too many:
> 
> +               if (is_ht_workaround_enabled() && !cpuc->is_fake &&
> +                   READ_ONCE(cpuc->excl_cntrs->exclusive_present))
> +                       gpmax /= 2;
> 
> Guess what that READ_ONCE() does? It checks if there's one such leaky
> event on the current core.

Great. Thanks.

-Andi


-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2015-05-27 14:00 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-22 13:29 [PATCH v2 00/11] Various x86 pmu scheduling patches Peter Zijlstra
2015-05-22 13:29 ` [PATCH v2 01/11] perf,x86: Fix event/group validation Peter Zijlstra
2015-05-22 13:40   ` Peter Zijlstra
2015-05-26  9:24     ` Stephane Eranian
2015-05-26 10:12       ` Peter Zijlstra
2015-05-26 11:46         ` Stephane Eranian
2015-05-26 12:16           ` Peter Zijlstra
2015-05-26 12:25             ` Stephane Eranian
2015-05-26 13:22               ` Peter Zijlstra
2015-05-26 13:44                 ` Stephane Eranian
2015-05-22 13:29 ` [PATCH v2 02/11] perf/x86: Improve HT workaround GP counter constraint Peter Zijlstra
2015-05-22 13:42   ` Peter Zijlstra
2015-05-26  9:37   ` Stephane Eranian
2015-05-26 10:15     ` Peter Zijlstra
2015-05-26 11:47       ` Stephane Eranian
2015-05-26 13:19         ` Peter Zijlstra
2015-05-26 16:07           ` Peter Zijlstra
2015-05-27  9:01             ` Stephane Eranian
2015-05-27 10:11               ` Peter Zijlstra
2015-05-27 11:39                 ` Stephane Eranian
2015-05-27 10:13               ` Peter Zijlstra
2015-05-27 11:44                 ` Stephane Eranian
2015-05-26 23:33   ` Andi Kleen
2015-05-27  7:48     ` Peter Zijlstra
2015-05-27 14:00       ` Andi Kleen
2015-05-22 13:29 ` [PATCH v2 03/11] perf/x86: Correct local vs remote sibling state Peter Zijlstra
2015-05-26 11:48   ` Stephane Eranian
2015-05-22 13:29 ` [PATCH v2 04/11] perf/x86: Use lockdep Peter Zijlstra
2015-05-22 13:29 ` [PATCH v2 05/11] perf/x86: Simplify dynamic constraint code somewhat Peter Zijlstra
2015-05-22 13:29 ` [PATCH v2 06/11] perf/x86: Make WARNs consistent Peter Zijlstra
2015-05-22 13:29 ` [PATCH v2 07/11] perf/x86: Move intel_commit_scheduling() Peter Zijlstra
2015-05-22 13:29 ` [PATCH v2 08/11] perf/x86: Remove pointless tests Peter Zijlstra
2015-05-22 13:29 ` [PATCH v2 09/11] perf/x86: Remove intel_excl_states::init_state Peter Zijlstra
2015-05-22 13:29 ` [PATCH v2 10/11] perf,x86: Simplify logic Peter Zijlstra
2015-05-22 13:29 ` [PATCH v2 11/11] perf/x86: Simplify put_exclusive_constraints Peter Zijlstra
2015-05-22 13:38   ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).