linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/7] perf, x86: large PEBS interrupt threshold
@ 2014-05-28  6:18 Yan, Zheng
  2014-05-28  6:18 ` [RFC PATCH 1/7] perf, core: Add all PMUs to pmu_idr Yan, Zheng
                   ` (6 more replies)
  0 siblings, 7 replies; 33+ messages in thread
From: Yan, Zheng @ 2014-05-28  6:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: a.p.zijlstra, mingo, eranian, ak, Yan, Zheng

This patch series implements large PEBS interrupt threshold. For some
limited cases, it can significantly reduce the sample overhead. Please
read patch 6's commit message for more information.

Regards
Yan, Zheng

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC PATCH 1/7] perf, core: Add all PMUs to pmu_idr
  2014-05-28  6:18 [RFC PATCH 0/7] perf, x86: large PEBS interrupt threshold Yan, Zheng
@ 2014-05-28  6:18 ` Yan, Zheng
  2014-05-28  6:59   ` Peter Zijlstra
  2014-05-28  6:18 ` [RFC PATCH 2/7] perf, core: introduce pmu context switch callback Yan, Zheng
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 33+ messages in thread
From: Yan, Zheng @ 2014-05-28  6:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: a.p.zijlstra, mingo, eranian, ak, Yan, Zheng

Currently, the build-in PMUs are not added to pmu_idr. This makes
a inconvenience for finding PMU by type.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 kernel/events/core.c | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index ed50b09..bafc416 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6496,14 +6496,15 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
 		goto skip_type;
 	pmu->name = name;
 
-	if (type < 0) {
-		type = idr_alloc(&pmu_idr, pmu, PERF_TYPE_MAX, 0, GFP_KERNEL);
-		if (type < 0) {
-			ret = type;
-			goto free_pdc;
-		}
+	ret = idr_alloc(&pmu_idr, pmu, type >= 0 ? type : PERF_TYPE_MAX,
+			0, GFP_KERNEL);
+	if (ret < 0)
+		goto free_pdc;
+	pmu->type = ret;
+	if (WARN_ON_ONCE(type >= 0 && pmu->type != type)) {
+		ret = -EINVAL;
+		goto free_idr;
 	}
-	pmu->type = type;
 
 	if (pmu_bus_running) {
 		ret = pmu_dev_alloc(pmu);
@@ -6575,8 +6576,7 @@ free_dev:
 	put_device(pmu->dev);
 
 free_idr:
-	if (pmu->type >= PERF_TYPE_MAX)
-		idr_remove(&pmu_idr, pmu->type);
+	idr_remove(&pmu_idr, pmu->type);
 
 free_pdc:
 	free_percpu(pmu->pmu_disable_count);
@@ -6598,8 +6598,7 @@ void perf_pmu_unregister(struct pmu *pmu)
 	synchronize_rcu();
 
 	free_percpu(pmu->pmu_disable_count);
-	if (pmu->type >= PERF_TYPE_MAX)
-		idr_remove(&pmu_idr, pmu->type);
+	idr_remove(&pmu_idr, pmu->type);
 	device_del(pmu->dev);
 	put_device(pmu->dev);
 	free_pmu_context(pmu);
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 2/7] perf, core: introduce pmu context switch callback
  2014-05-28  6:18 [RFC PATCH 0/7] perf, x86: large PEBS interrupt threshold Yan, Zheng
  2014-05-28  6:18 ` [RFC PATCH 1/7] perf, core: Add all PMUs to pmu_idr Yan, Zheng
@ 2014-05-28  6:18 ` Yan, Zheng
  2014-05-28  6:18 ` [RFC PATCH 3/7] perf, x86: use context switch callback to flush LBR stack Yan, Zheng
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 33+ messages in thread
From: Yan, Zheng @ 2014-05-28  6:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: a.p.zijlstra, mingo, eranian, ak, Yan, Zheng

The callback is invoked when process is scheduled in or out.
It provides mechanism for later patches to save/store the LBR
stack. For the schedule in case, the callback is invoked at
the same place that flush branch stack callback is invoked.
So it also can replace the flush branch stack callback. To
avoid unnecessary overhead, the callback is enabled only when
there are events use the LBR stack.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 arch/x86/kernel/cpu/perf_event.c |  7 +++++
 arch/x86/kernel/cpu/perf_event.h |  2 ++
 include/linux/perf_event.h       |  8 ++++++
 kernel/events/core.c             | 60 +++++++++++++++++++++++++++++++++++++++-
 4 files changed, 76 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 89f3b7c..d4e1dd7 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1873,6 +1873,12 @@ static const struct attribute_group *x86_pmu_attr_groups[] = {
 	NULL,
 };
 
+static void x86_pmu_sched_task(struct perf_event_context *ctx, bool sched_in)
+{
+	if (x86_pmu.sched_task)
+		x86_pmu.sched_task(ctx, sched_in);
+}
+
 static void x86_pmu_flush_branch_stack(void)
 {
 	if (x86_pmu.flush_branch_stack)
@@ -1906,6 +1912,7 @@ static struct pmu pmu = {
 
 	.event_idx		= x86_pmu_event_idx,
 	.flush_branch_stack	= x86_pmu_flush_branch_stack,
+	.sched_task		= x86_pmu_sched_task,
 };
 
 void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 3b2f9bd..e70b352 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -429,6 +429,8 @@ struct x86_pmu {
 
 	void		(*check_microcode)(void);
 	void		(*flush_branch_stack)(void);
+	void		(*sched_task)(struct perf_event_context *ctx,
+				      bool sched_in);
 
 	/*
 	 * Intel Arch Perfmon v2+
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index af6dcf1..a4b29ad 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -252,6 +252,12 @@ struct pmu {
 	 * flush branch stack on context-switches (needed in cpu-wide mode)
 	 */
 	void (*flush_branch_stack)	(void);
+
+	/*
+	 * PMU callback for context-switches. optional
+	 */
+	void (*sched_task)		(struct perf_event_context *ctx,
+					 bool sched_in);
 };
 
 /**
@@ -545,6 +551,8 @@ extern void perf_event_delayed_put(struct task_struct *task);
 extern void perf_event_print_debug(void);
 extern void perf_pmu_disable(struct pmu *pmu);
 extern void perf_pmu_enable(struct pmu *pmu);
+extern void perf_sched_cb_disable(struct pmu *pmu);
+extern void perf_sched_cb_enable(struct pmu *pmu);
 extern int perf_event_task_disable(void);
 extern int perf_event_task_enable(void);
 extern int perf_event_refresh(struct perf_event *event, int refresh);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index bafc416..1a1139d 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -143,6 +143,7 @@ enum event_type_t {
 struct static_key_deferred perf_sched_events __read_mostly;
 static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
 static DEFINE_PER_CPU(atomic_t, perf_branch_stack_events);
+static DEFINE_PER_CPU(int, perf_sched_cb_usages);
 
 static atomic_t nr_mmap_events __read_mostly;
 static atomic_t nr_comm_events __read_mostly;
@@ -152,6 +153,7 @@ static atomic_t nr_freq_events __read_mostly;
 static LIST_HEAD(pmus);
 static DEFINE_MUTEX(pmus_lock);
 static struct srcu_struct pmus_srcu;
+static struct idr pmu_idr;
 
 /*
  * perf event paranoia level:
@@ -2360,6 +2362,57 @@ unlock:
 	}
 }
 
+void perf_sched_cb_disable(struct pmu *pmu)
+{
+	__get_cpu_var(perf_sched_cb_usages)--;
+}
+
+void perf_sched_cb_enable(struct pmu *pmu)
+{
+	__get_cpu_var(perf_sched_cb_usages)++;
+}
+
+/*
+ * This function provides the context switch callback to the lower code
+ * layer. It is invoked ONLY when the context switch callback is enabled.
+ */
+static void perf_pmu_sched_task(struct task_struct *prev,
+				struct task_struct *next,
+				bool sched_in)
+{
+	struct perf_cpu_context *cpuctx;
+	struct pmu *pmu;
+	unsigned long flags;
+
+	if (prev == next)
+		return;
+
+	local_irq_save(flags);
+
+	rcu_read_lock();
+
+	pmu = idr_find(&pmu_idr, PERF_TYPE_RAW);
+
+	if (pmu && pmu->sched_task) {
+		cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
+		pmu = cpuctx->ctx.pmu;
+
+		perf_ctx_lock(cpuctx, cpuctx->task_ctx);
+
+		perf_pmu_disable(pmu);
+
+		pmu->sched_task(cpuctx->task_ctx, sched_in);
+
+		perf_pmu_enable(pmu);
+
+		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+	}
+
+	rcu_read_unlock();
+
+	local_irq_restore(flags);
+}
+
 #define for_each_task_context_nr(ctxn)					\
 	for ((ctxn) = 0; (ctxn) < perf_nr_task_contexts; (ctxn)++)
 
@@ -2379,6 +2432,9 @@ void __perf_event_task_sched_out(struct task_struct *task,
 {
 	int ctxn;
 
+	if (__get_cpu_var(perf_sched_cb_usages))
+		perf_pmu_sched_task(task, next, false);
+
 	for_each_task_context_nr(ctxn)
 		perf_event_context_sched_out(task, ctxn, next);
 
@@ -2636,6 +2692,9 @@ void __perf_event_task_sched_in(struct task_struct *prev,
 	/* check for system-wide branch_stack events */
 	if (atomic_read(&__get_cpu_var(perf_branch_stack_events)))
 		perf_branch_stack_sched_in(prev, task);
+
+	if (__get_cpu_var(perf_sched_cb_usages))
+		perf_pmu_sched_task(prev, task, true);
 }
 
 static u64 perf_calculate_period(struct perf_event *event, u64 nsec, u64 count)
@@ -6375,7 +6434,6 @@ static void free_pmu_context(struct pmu *pmu)
 out:
 	mutex_unlock(&pmus_lock);
 }
-static struct idr pmu_idr;
 
 static ssize_t
 type_show(struct device *dev, struct device_attribute *attr, char *page)
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 3/7] perf, x86: use context switch callback to flush LBR stack
  2014-05-28  6:18 [RFC PATCH 0/7] perf, x86: large PEBS interrupt threshold Yan, Zheng
  2014-05-28  6:18 ` [RFC PATCH 1/7] perf, core: Add all PMUs to pmu_idr Yan, Zheng
  2014-05-28  6:18 ` [RFC PATCH 2/7] perf, core: introduce pmu context switch callback Yan, Zheng
@ 2014-05-28  6:18 ` Yan, Zheng
  2014-05-28  6:18 ` [RFC PATCH 4/7] tools, perf: Allow the user to disable time stamps Yan, Zheng
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 33+ messages in thread
From: Yan, Zheng @ 2014-05-28  6:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: a.p.zijlstra, mingo, eranian, ak, Yan, Zheng

Previous commit introduces context switch callback, its function
overlaps with the flush branch stack callback. So we can use the
context switch callback to flush LBR stack.

This patch adds code that uses the flush branch callback to
flush the LBR stack when task is being scheduled in. The callback
is enabled only when there are events use the LBR hardware. This
patch also removes all old flush branch stack code.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 arch/x86/kernel/cpu/perf_event.c           |  7 ---
 arch/x86/kernel/cpu/perf_event.h           |  3 +-
 arch/x86/kernel/cpu/perf_event_intel.c     | 14 +-----
 arch/x86/kernel/cpu/perf_event_intel_lbr.c | 32 +++++++++++--
 include/linux/perf_event.h                 |  6 ---
 kernel/events/core.c                       | 77 ------------------------------
 6 files changed, 30 insertions(+), 109 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index d4e1dd7..11b6490 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1879,12 +1879,6 @@ static void x86_pmu_sched_task(struct perf_event_context *ctx, bool sched_in)
 		x86_pmu.sched_task(ctx, sched_in);
 }
 
-static void x86_pmu_flush_branch_stack(void)
-{
-	if (x86_pmu.flush_branch_stack)
-		x86_pmu.flush_branch_stack();
-}
-
 void perf_check_microcode(void)
 {
 	if (x86_pmu.check_microcode)
@@ -1911,7 +1905,6 @@ static struct pmu pmu = {
 	.commit_txn		= x86_pmu_commit_txn,
 
 	.event_idx		= x86_pmu_event_idx,
-	.flush_branch_stack	= x86_pmu_flush_branch_stack,
 	.sched_task		= x86_pmu_sched_task,
 };
 
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index e70b352..d8165f3 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -428,7 +428,6 @@ struct x86_pmu {
 	void		(*cpu_dead)(int cpu);
 
 	void		(*check_microcode)(void);
-	void		(*flush_branch_stack)(void);
 	void		(*sched_task)(struct perf_event_context *ctx,
 				      bool sched_in);
 
@@ -685,6 +684,8 @@ void intel_pmu_pebs_disable_all(void);
 
 void intel_ds_init(void);
 
+void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in);
+
 void intel_pmu_lbr_reset(void);
 
 void intel_pmu_lbr_enable(struct perf_event *event);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index adb02aa..ef926ee 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2035,18 +2035,6 @@ static void intel_pmu_cpu_dying(int cpu)
 	fini_debug_store_on_cpu(cpu);
 }
 
-static void intel_pmu_flush_branch_stack(void)
-{
-	/*
-	 * Intel LBR does not tag entries with the
-	 * PID of the current task, then we need to
-	 * flush it on ctxsw
-	 * For now, we simply reset it
-	 */
-	if (x86_pmu.lbr_nr)
-		intel_pmu_lbr_reset();
-}
-
 PMU_FORMAT_ATTR(offcore_rsp, "config1:0-63");
 
 PMU_FORMAT_ATTR(ldlat, "config1:0-15");
@@ -2098,7 +2086,7 @@ static __initconst const struct x86_pmu intel_pmu = {
 	.cpu_starting		= intel_pmu_cpu_starting,
 	.cpu_dying		= intel_pmu_cpu_dying,
 	.guest_get_msrs		= intel_guest_get_msrs,
-	.flush_branch_stack	= intel_pmu_flush_branch_stack,
+	.sched_task		= intel_pmu_lbr_sched_task,
 };
 
 static __init void intel_clovertown_quirk(void)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index d82d155..c776931 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -181,7 +181,7 @@ void intel_pmu_lbr_reset(void)
 		intel_pmu_lbr_reset_64();
 }
 
-void intel_pmu_lbr_enable(struct perf_event *event)
+void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
 
@@ -189,6 +189,23 @@ void intel_pmu_lbr_enable(struct perf_event *event)
 		return;
 
 	/*
+	 * It is necessary to flush the stack on context switch. This happens
+	 * when the branch stack does not tag its entries with the pid of the
+	 * current task.
+	 */
+	if (sched_in) {
+		intel_pmu_lbr_reset();
+		cpuc->lbr_context = ctx;
+	}
+}
+
+void intel_pmu_lbr_enable(struct perf_event *event)
+{
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+
+	if (!x86_pmu.lbr_nr)
+		return;
+	/*
 	 * Reset the LBR stack if we changed task context to
 	 * avoid data leaks.
 	 */
@@ -199,6 +216,8 @@ void intel_pmu_lbr_enable(struct perf_event *event)
 	cpuc->br_sel = event->hw.branch_reg.reg;
 
 	cpuc->lbr_users++;
+	if (cpuc->lbr_users == 1)
+		perf_sched_cb_enable(event->ctx->pmu);
 }
 
 void intel_pmu_lbr_disable(struct perf_event *event)
@@ -211,10 +230,13 @@ void intel_pmu_lbr_disable(struct perf_event *event)
 	cpuc->lbr_users--;
 	WARN_ON_ONCE(cpuc->lbr_users < 0);
 
-	if (cpuc->enabled && !cpuc->lbr_users) {
-		__intel_pmu_lbr_disable();
-		/* avoid stale pointer */
-		cpuc->lbr_context = NULL;
+	if (!cpuc->lbr_users) {
+		perf_sched_cb_disable(event->ctx->pmu);
+		if (cpuc->enabled) {
+			__intel_pmu_lbr_disable();
+			/* avoid stale pointer */
+			cpuc->lbr_context = NULL;
+		}
 	}
 }
 
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index a4b29ad..b721c6a 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -249,11 +249,6 @@ struct pmu {
 	int (*event_idx)		(struct perf_event *event); /*optional */
 
 	/*
-	 * flush branch stack on context-switches (needed in cpu-wide mode)
-	 */
-	void (*flush_branch_stack)	(void);
-
-	/*
 	 * PMU callback for context-switches. optional
 	 */
 	void (*sched_task)		(struct perf_event_context *ctx,
@@ -499,7 +494,6 @@ struct perf_event_context {
 	u64				generation;
 	int				pin_count;
 	int				nr_cgroups;	 /* cgroup evts */
-	int				nr_branch_stack; /* branch_stack evt */
 	struct rcu_head			rcu_head;
 };
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 1a1139d..0cd08ff 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -142,7 +142,6 @@ enum event_type_t {
  */
 struct static_key_deferred perf_sched_events __read_mostly;
 static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
-static DEFINE_PER_CPU(atomic_t, perf_branch_stack_events);
 static DEFINE_PER_CPU(int, perf_sched_cb_usages);
 
 static atomic_t nr_mmap_events __read_mostly;
@@ -1136,9 +1135,6 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx)
 	if (is_cgroup_event(event))
 		ctx->nr_cgroups++;
 
-	if (has_branch_stack(event))
-		ctx->nr_branch_stack++;
-
 	list_add_rcu(&event->event_entry, &ctx->event_list);
 	if (!ctx->nr_events)
 		perf_pmu_rotate_start(ctx->pmu);
@@ -1301,9 +1297,6 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
 			cpuctx->cgrp = NULL;
 	}
 
-	if (has_branch_stack(event))
-		ctx->nr_branch_stack--;
-
 	ctx->nr_events--;
 	if (event->attr.inherit_stat)
 		ctx->nr_stat--;
@@ -2600,64 +2593,6 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx,
 }
 
 /*
- * When sampling the branck stack in system-wide, it may be necessary
- * to flush the stack on context switch. This happens when the branch
- * stack does not tag its entries with the pid of the current task.
- * Otherwise it becomes impossible to associate a branch entry with a
- * task. This ambiguity is more likely to appear when the branch stack
- * supports priv level filtering and the user sets it to monitor only
- * at the user level (which could be a useful measurement in system-wide
- * mode). In that case, the risk is high of having a branch stack with
- * branch from multiple tasks. Flushing may mean dropping the existing
- * entries or stashing them somewhere in the PMU specific code layer.
- *
- * This function provides the context switch callback to the lower code
- * layer. It is invoked ONLY when there is at least one system-wide context
- * with at least one active event using taken branch sampling.
- */
-static void perf_branch_stack_sched_in(struct task_struct *prev,
-				       struct task_struct *task)
-{
-	struct perf_cpu_context *cpuctx;
-	struct pmu *pmu;
-	unsigned long flags;
-
-	/* no need to flush branch stack if not changing task */
-	if (prev == task)
-		return;
-
-	local_irq_save(flags);
-
-	rcu_read_lock();
-
-	list_for_each_entry_rcu(pmu, &pmus, entry) {
-		cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
-
-		/*
-		 * check if the context has at least one
-		 * event using PERF_SAMPLE_BRANCH_STACK
-		 */
-		if (cpuctx->ctx.nr_branch_stack > 0
-		    && pmu->flush_branch_stack) {
-
-			perf_ctx_lock(cpuctx, cpuctx->task_ctx);
-
-			perf_pmu_disable(pmu);
-
-			pmu->flush_branch_stack();
-
-			perf_pmu_enable(pmu);
-
-			perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
-		}
-	}
-
-	rcu_read_unlock();
-
-	local_irq_restore(flags);
-}
-
-/*
  * Called from scheduler to add the events of the current task
  * with interrupts disabled.
  *
@@ -2689,10 +2624,6 @@ void __perf_event_task_sched_in(struct task_struct *prev,
 	if (atomic_read(&__get_cpu_var(perf_cgroup_events)))
 		perf_cgroup_sched_in(prev, task);
 
-	/* check for system-wide branch_stack events */
-	if (atomic_read(&__get_cpu_var(perf_branch_stack_events)))
-		perf_branch_stack_sched_in(prev, task);
-
 	if (__get_cpu_var(perf_sched_cb_usages))
 		perf_pmu_sched_task(prev, task, true);
 }
@@ -3261,10 +3192,6 @@ static void unaccount_event_cpu(struct perf_event *event, int cpu)
 	if (event->parent)
 		return;
 
-	if (has_branch_stack(event)) {
-		if (!(event->attach_state & PERF_ATTACH_TASK))
-			atomic_dec(&per_cpu(perf_branch_stack_events, cpu));
-	}
 	if (is_cgroup_event(event))
 		atomic_dec(&per_cpu(perf_cgroup_events, cpu));
 }
@@ -6713,10 +6640,6 @@ static void account_event_cpu(struct perf_event *event, int cpu)
 	if (event->parent)
 		return;
 
-	if (has_branch_stack(event)) {
-		if (!(event->attach_state & PERF_ATTACH_TASK))
-			atomic_inc(&per_cpu(perf_branch_stack_events, cpu));
-	}
 	if (is_cgroup_event(event))
 		atomic_inc(&per_cpu(perf_cgroup_events, cpu));
 }
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 4/7] tools, perf: Allow the user to disable time stamps
  2014-05-28  6:18 [RFC PATCH 0/7] perf, x86: large PEBS interrupt threshold Yan, Zheng
                   ` (2 preceding siblings ...)
  2014-05-28  6:18 ` [RFC PATCH 3/7] perf, x86: use context switch callback to flush LBR stack Yan, Zheng
@ 2014-05-28  6:18 ` Yan, Zheng
  2014-05-28  6:18 ` [RFC PATCH 5/7] perf, x86: use the PEBS auto reload mechanism when possible Yan, Zheng
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 33+ messages in thread
From: Yan, Zheng @ 2014-05-28  6:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: a.p.zijlstra, mingo, eranian, ak

From: Andi Kleen <ak@linux.intel.com>

Time stamps are always implicitely enabled for record currently.
The old --time/-T option is a nop.

Allow the user to disable timestamps by using --no-time

This can cause some minor misaccounting (by missing mmaps), but significantly
lowers the size of perf.data

The defaults are unchanged.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 tools/perf/builtin-record.c | 1 +
 tools/perf/util/evsel.c     | 9 ++++++---
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index e4c85b8..1c8849f 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -771,6 +771,7 @@ static const char * const record_usage[] = {
  */
 static struct record record = {
 	.opts = {
+		.sample_time	     = true,
 		.mmap_pages	     = UINT_MAX,
 		.user_freq	     = UINT_MAX,
 		.user_interval	     = ULLONG_MAX,
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 5c28d82..d6e7865 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -632,9 +632,12 @@ void perf_evsel__config(struct perf_evsel *evsel, struct record_opts *opts)
 	if (opts->period)
 		perf_evsel__set_sample_bit(evsel, PERIOD);
 
-	if (!perf_missing_features.sample_id_all &&
-	    (opts->sample_time || !opts->no_inherit ||
-	     target__has_cpu(&opts->target) || per_cpu))
+	/*
+	 * When the user explicitely disabled time don't force it here.
+	 */
+	if (opts->sample_time &&
+	    (!perf_missing_features.sample_id_all &&
+	    (!opts->no_inherit || target__has_cpu(&opts->target) || per_cpu)))
 		perf_evsel__set_sample_bit(evsel, TIME);
 
 	if (opts->raw_samples) {
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 5/7] perf, x86: use the PEBS auto reload mechanism when possible
  2014-05-28  6:18 [RFC PATCH 0/7] perf, x86: large PEBS interrupt threshold Yan, Zheng
                   ` (3 preceding siblings ...)
  2014-05-28  6:18 ` [RFC PATCH 4/7] tools, perf: Allow the user to disable time stamps Yan, Zheng
@ 2014-05-28  6:18 ` Yan, Zheng
  2014-05-28  7:59   ` Peter Zijlstra
  2014-05-28  6:18 ` [RFC PATCH 6/7] perf, x86: large PEBS interrupt threshold Yan, Zheng
  2014-05-28  6:18 ` [RFC PATCH 7/7] perf, x86: drain PEBS buffer during context switch Yan, Zheng
  6 siblings, 1 reply; 33+ messages in thread
From: Yan, Zheng @ 2014-05-28  6:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: a.p.zijlstra, mingo, eranian, ak, Yan, Zheng

When a fixed period is specified, this patch make perf use the PEBS
auto reload mechanism. This makes normal profiling faster, because
it avoids one costly MSR write in the PMI handler.

Signef-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 arch/x86/kernel/cpu/perf_event.c          | 15 +++++++++------
 arch/x86/kernel/cpu/perf_event_intel_ds.c |  7 +++++++
 include/linux/perf_event.h                |  1 +
 3 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 11b6490..282bae4 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -988,13 +988,16 @@ int x86_perf_event_set_period(struct perf_event *event)
 
 	per_cpu(pmc_prev_left[idx], smp_processor_id()) = left;
 
-	/*
-	 * The hw event starts counting from this event offset,
-	 * mark it to be able to extra future deltas:
-	 */
-	local64_set(&hwc->prev_count, (u64)-left);
+	if (!hwc->autoreload ||
+	    local64_read(&hwc->prev_count) != (u64)-left) {
+		/*
+		 * The hw event starts counting from this event offset,
+		 * mark it to be able to extra future deltas:
+		 */
+		local64_set(&hwc->prev_count, (u64)-left);
 
-	wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
+		wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
+	}
 
 	/*
 	 * Due to erratum on certan cpu we need
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 980970c..1db4ce5 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -714,6 +714,7 @@ void intel_pmu_pebs_enable(struct perf_event *event)
 	struct hw_perf_event *hwc = &event->hw;
 
 	hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;
+	hwc->autoreload = !event->attr.freq;
 
 	cpuc->pebs_enabled |= 1ULL << hwc->idx;
 
@@ -721,6 +722,11 @@ void intel_pmu_pebs_enable(struct perf_event *event)
 		cpuc->pebs_enabled |= 1ULL << (hwc->idx + 32);
 	else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
 		cpuc->pebs_enabled |= 1ULL << 63;
+
+	/* Use auto-reload if possible to save a MSR write in the PMI */
+	if (hwc->autoreload)
+		ds->pebs_event_reset[hwc->idx] =
+			(u64)-hwc->sample_period & x86_pmu.cntval_mask;
 }
 
 void intel_pmu_pebs_disable(struct perf_event *event)
@@ -739,6 +745,7 @@ void intel_pmu_pebs_disable(struct perf_event *event)
 		wrmsrl(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);
 
 	hwc->config |= ARCH_PERFMON_EVENTSEL_INT;
+	hwc->autoreload = false;
 }
 
 void intel_pmu_pebs_enable_all(void)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index b721c6a..6059309 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -149,6 +149,7 @@ struct hw_perf_event {
 
 	u64				freq_time_stamp;
 	u64				freq_count_stamp;
+	bool				autoreload;
 #endif
 };
 
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 6/7] perf, x86: large PEBS interrupt threshold
  2014-05-28  6:18 [RFC PATCH 0/7] perf, x86: large PEBS interrupt threshold Yan, Zheng
                   ` (4 preceding siblings ...)
  2014-05-28  6:18 ` [RFC PATCH 5/7] perf, x86: use the PEBS auto reload mechanism when possible Yan, Zheng
@ 2014-05-28  6:18 ` Yan, Zheng
  2014-05-28  8:10   ` Peter Zijlstra
  2014-05-28  6:18 ` [RFC PATCH 7/7] perf, x86: drain PEBS buffer during context switch Yan, Zheng
  6 siblings, 1 reply; 33+ messages in thread
From: Yan, Zheng @ 2014-05-28  6:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: a.p.zijlstra, mingo, eranian, ak, Yan, Zheng

PEBS always had the capability to log samples to its buffers without
an interrupt. Traditionally perf has not used this but always set the
PEBS threshold to one.

For frequently occuring events (like cycles or branches or load/stores)
this in term requires using a relatively high sampling period to avoid
overloading the system, by only processing PMIs. This in term increases
sampling error.

For the common cases we still need to use the PMI because the PEBS
hardware has various limitations. The biggest one is that it can not
supply a callgraph. It also requires setting a fixed period, as the
hardware does not support adaptive period. Another issue is that it
cannot supply a time stamp and some other options. To supply a TID it
requires flushing on context switch. It can however supply the IP, the
load/store address, TSX information, registers, and some other things.

So we can make PEBS work for some specific cases, basically as long as
you can do without a callgraph and can set the period you can use this
new PEBS mode.

The main benefit is the ability to support much lower sampling period
(down to -c 1000) without extensive overhead.

One use cases is for example to increase the resolution of the c2c tool.
Another is double checking when you suspect the standard sampling has
too much sampling error.

Some numbers on the overhead, using cycle soak, comparing
"perf record --no-time -e cycles:p -c" to "perf record -e cycles:p -c"

period    plain  multi  delta
10003     15     5      10
20003     15.7   4      11.7
40003     8.7    2.5    6.2
80003     4.1    1.4    2.7
100003    3.6    1.2    2.4
800003    4.4    1.4    3
1000003   0.6    0.4    0.2
2000003   0.4    0.3    0.1
4000003   0.3    0.2    0.1
10000003  0.3    0.2    0.1

The interesting part is the delta between multi-pebs and normal pebs. Above
-c 1000003 it does not really matter because the basic overhead is so low.
With periods below 80003 it becomes interesting.

Note in some other workloads (e.g. kernbench) the smaller sampling periods
cause much more overhead without multi-pebs,  upto 80% (and throttling) have
been observed with -c 10003. multi pebs generally does not throttle.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 arch/x86/kernel/cpu/perf_event.h          |  1 +
 arch/x86/kernel/cpu/perf_event_intel.c    |  4 ++
 arch/x86/kernel/cpu/perf_event_intel_ds.c | 94 ++++++++++++++++++++++---------
 3 files changed, 71 insertions(+), 28 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index d8165f3..cb7cda8 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -450,6 +450,7 @@ struct x86_pmu {
 	struct event_constraint *pebs_constraints;
 	void		(*pebs_aliases)(struct perf_event *event);
 	int 		max_pebs_events;
+	bool		multi_pebs;
 
 	/*
 	 * Intel LBR
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index ef926ee..6c2a380 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2447,6 +2447,7 @@ __init int intel_pmu_init(void)
 		intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_BACKEND] =
 			X86_CONFIG(.event=0xb1, .umask=0x01, .inv=1, .cmask=1);
 
+		x86_pmu.multi_pebs = true;
 		pr_cont("SandyBridge events, ");
 		break;
 	case 58: /* IvyBridge */
@@ -2475,6 +2476,7 @@ __init int intel_pmu_init(void)
 		intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] =
 			X86_CONFIG(.event=0x0e, .umask=0x01, .inv=1, .cmask=1);
 
+		x86_pmu.multi_pebs = true;
 		pr_cont("IvyBridge events, ");
 		break;
 
@@ -2502,6 +2504,8 @@ __init int intel_pmu_init(void)
 		x86_pmu.get_event_constraints = hsw_get_event_constraints;
 		x86_pmu.cpu_events = hsw_events_attrs;
 		x86_pmu.lbr_double_abort = true;
+
+		x86_pmu.multi_pebs = true;
 		pr_cont("Haswell events, ");
 		break;
 
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 1db4ce5..a0284a4 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -11,7 +11,7 @@
 #define BTS_RECORD_SIZE		24
 
 #define BTS_BUFFER_SIZE		(PAGE_SIZE << 4)
-#define PEBS_BUFFER_SIZE	PAGE_SIZE
+#define PEBS_BUFFER_SIZE	(PAGE_SIZE << 4)
 #define PEBS_FIXUP_SIZE		PAGE_SIZE
 
 /*
@@ -251,7 +251,7 @@ static int alloc_pebs_buffer(int cpu)
 {
 	struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
 	int node = cpu_to_node(cpu);
-	int max, thresh = 1; /* always use a single PEBS record */
+	int max;
 	void *buffer, *ibuffer;
 
 	if (!x86_pmu.pebs)
@@ -281,9 +281,6 @@ static int alloc_pebs_buffer(int cpu)
 	ds->pebs_absolute_maximum = ds->pebs_buffer_base +
 		max * x86_pmu.pebs_record_size;
 
-	ds->pebs_interrupt_threshold = ds->pebs_buffer_base +
-		thresh * x86_pmu.pebs_record_size;
-
 	return 0;
 }
 
@@ -708,14 +705,29 @@ struct event_constraint *intel_pebs_constraints(struct perf_event *event)
 	return &emptyconstraint;
 }
 
+/*
+ * Flags PEBS can handle without an PMI.
+ *
+ * TID can only be handled by flushing at context switch.
+ */
+#define PEBS_FREERUNNING_FLAGS \
+	(PERF_SAMPLE_IP | PERF_SAMPLE_TID | PERF_SAMPLE_ADDR | \
+	 PERF_SAMPLE_ID | PERF_SAMPLE_CPU | PERF_SAMPLE_STREAM_ID | \
+	 PERF_SAMPLE_DATA_SRC | PERF_SAMPLE_IDENTIFIER | \
+	 PERF_SAMPLE_TRANSACTION)
+
 void intel_pmu_pebs_enable(struct perf_event *event)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
 	struct hw_perf_event *hwc = &event->hw;
+	struct debug_store *ds = cpuc->ds;
+	u64 threshold;
+	bool first_pebs;
 
 	hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;
 	hwc->autoreload = !event->attr.freq;
 
+	first_pebs = !(cpuc->pebs_enabled & ((1ULL << MAX_PEBS_EVENTS) - 1));
 	cpuc->pebs_enabled |= 1ULL << hwc->idx;
 
 	if (event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT)
@@ -723,6 +735,20 @@ void intel_pmu_pebs_enable(struct perf_event *event)
 	else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
 		cpuc->pebs_enabled |= 1ULL << 63;
 
+	/*
+	 * When the event is constrained enough we can use a larger
+	 * threshold and run the event with less frequent PMI.
+	 */
+	if (x86_pmu.multi_pebs && hwc->autoreload &&
+	    !(event->attr.sample_type & ~PEBS_FREERUNNING_FLAGS)) {
+		threshold = ds->pebs_absolute_maximum -
+			x86_pmu.max_pebs_events * x86_pmu.pebs_record_size;
+	} else {
+		threshold = ds->pebs_buffer_base + x86_pmu.pebs_record_size;
+	}
+	if (first_pebs || ds->pebs_interrupt_threshold > threshold)
+		ds->pebs_interrupt_threshold = threshold;
+
 	/* Use auto-reload if possible to save a MSR write in the PMI */
 	if (hwc->autoreload)
 		ds->pebs_event_reset[hwc->idx] =
@@ -867,7 +893,8 @@ static inline u64 intel_hsw_transaction(struct pebs_record_hsw *pebs)
 }
 
 static void __intel_pmu_pebs_event(struct perf_event *event,
-				   struct pt_regs *iregs, void *__pebs)
+				   struct pt_regs *iregs, void *__pebs,
+				   bool first_record)
 {
 	/*
 	 * We cast to the biggest pebs_record but are careful not to
@@ -880,7 +907,7 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
 	u64 sample_type;
 	int fll, fst;
 
-	if (!intel_pmu_save_and_restart(event))
+	if (first_record && !intel_pmu_save_and_restart(event))
 		return;
 
 	fll = event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT;
@@ -956,8 +983,22 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
 	if (has_branch_stack(event))
 		data.br_stack = &cpuc->lbr_stack;
 
-	if (perf_event_overflow(event, &data, &regs))
-		x86_pmu_stop(event, 0);
+	if (first_record) {
+		if (perf_event_overflow(event, &data, &regs))
+			x86_pmu_stop(event, 0);
+	} else {
+		struct perf_output_handle handle;
+		struct perf_event_header header;
+
+		perf_prepare_sample(&header, &data, event, &regs);
+
+		if (perf_output_begin(&handle, event, header.size))
+			return;
+
+		perf_output_sample(&handle, &header, &data, event);
+
+		perf_output_end(&handle);
+	}
 }
 
 static void intel_pmu_drain_pebs_core(struct pt_regs *iregs)
@@ -998,17 +1039,18 @@ static void intel_pmu_drain_pebs_core(struct pt_regs *iregs)
 	WARN_ONCE(n > 1, "bad leftover pebs %d\n", n);
 	at += n - 1;
 
-	__intel_pmu_pebs_event(event, iregs, at);
+	__intel_pmu_pebs_event(event, iregs, at, true);
 }
 
 static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
 	struct debug_store *ds = cpuc->ds;
-	struct perf_event *event = NULL;
+	struct perf_event *event;
 	void *at, *top;
 	u64 status = 0;
 	int bit;
+	bool multi_pebs, first_record;
 
 	if (!x86_pmu.pebs_active)
 		return;
@@ -1021,13 +1063,11 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
 	if (unlikely(at > top))
 		return;
 
-	/*
-	 * Should not happen, we program the threshold at 1 and do not
-	 * set a reset value.
-	 */
-	WARN_ONCE(top - at > x86_pmu.max_pebs_events * x86_pmu.pebs_record_size,
-		  "Unexpected number of pebs records %ld\n",
-		  (long)(top - at) / x86_pmu.pebs_record_size);
+	if (ds->pebs_interrupt_threshold >
+	    ds->pebs_buffer_base + x86_pmu.pebs_record_size)
+		multi_pebs = true;
+	else
+		multi_pebs = false;
 
 	for (; at < top; at += x86_pmu.pebs_record_size) {
 		struct pebs_record_nhm *p = at;
@@ -1042,17 +1082,15 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
 
 			if (!event->attr.precise_ip)
 				continue;
-
-			if (__test_and_set_bit(bit, (unsigned long *)&status))
-				continue;
-
-			break;
+			if (!__test_and_set_bit(bit, (unsigned long *)&status)) {
+				first_record = true;
+			} else {
+				if (!multi_pebs)
+					continue;
+				first_record = false;
+			}
+			__intel_pmu_pebs_event(event, iregs, at, first_record);
 		}
-
-		if (!event || bit >= x86_pmu.max_pebs_events)
-			continue;
-
-		__intel_pmu_pebs_event(event, iregs, at);
 	}
 }
 
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH 7/7] perf, x86: drain PEBS buffer during context switch
  2014-05-28  6:18 [RFC PATCH 0/7] perf, x86: large PEBS interrupt threshold Yan, Zheng
                   ` (5 preceding siblings ...)
  2014-05-28  6:18 ` [RFC PATCH 6/7] perf, x86: large PEBS interrupt threshold Yan, Zheng
@ 2014-05-28  6:18 ` Yan, Zheng
  2014-05-28  8:12   ` Peter Zijlstra
  6 siblings, 1 reply; 33+ messages in thread
From: Yan, Zheng @ 2014-05-28  6:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: a.p.zijlstra, mingo, eranian, ak, Yan, Zheng

Flush the PEBS buffer during context switch if PEBS interrupt threshold
is larger than one. This allows perf to supply TID for events.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 arch/x86/kernel/cpu/perf_event.h           |  3 +++
 arch/x86/kernel/cpu/perf_event_intel.c     | 11 +++++++-
 arch/x86/kernel/cpu/perf_event_intel_ds.c  | 42 ++++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |  4 ---
 4 files changed, 55 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index cb7cda8..eafea09 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -147,6 +147,7 @@ struct cpu_hw_events {
 	 */
 	struct debug_store	*ds;
 	u64			pebs_enabled;
+	bool			pebs_sched_cb_enabled;
 
 	/*
 	 * Intel LBR bits
@@ -683,6 +684,8 @@ void intel_pmu_pebs_enable_all(void);
 
 void intel_pmu_pebs_disable_all(void);
 
+void intel_pmu_pebs_sched_task(struct perf_event_context *ctx, bool sched_in);
+
 void intel_ds_init(void);
 
 void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 6c2a380..dba03b3 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2035,6 +2035,15 @@ static void intel_pmu_cpu_dying(int cpu)
 	fini_debug_store_on_cpu(cpu);
 }
 
+static void intel_pmu_sched_task(struct perf_event_context *ctx,
+				 bool sched_in)
+{
+	if (x86_pmu.pebs_active)
+		intel_pmu_pebs_sched_task(ctx, sched_in);
+	if (x86_pmu.lbr_nr)
+		intel_pmu_lbr_sched_task(ctx, sched_in);
+}
+
 PMU_FORMAT_ATTR(offcore_rsp, "config1:0-63");
 
 PMU_FORMAT_ATTR(ldlat, "config1:0-15");
@@ -2086,7 +2095,7 @@ static __initconst const struct x86_pmu intel_pmu = {
 	.cpu_starting		= intel_pmu_cpu_starting,
 	.cpu_dying		= intel_pmu_cpu_dying,
 	.guest_get_msrs		= intel_guest_get_msrs,
-	.sched_task		= intel_pmu_lbr_sched_task,
+	.sched_task		= intel_pmu_sched_task,
 };
 
 static __init void intel_clovertown_quirk(void)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index a0284a4..ddb5683 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -705,6 +705,26 @@ struct event_constraint *intel_pebs_constraints(struct perf_event *event)
 	return &emptyconstraint;
 }
 
+void intel_pmu_drain_pebs_buffer(void)
+{
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+	struct debug_store *ds = cpuc->ds;
+	struct pt_regs regs;
+
+	if (!x86_pmu.pebs_active)
+		return;
+	if (ds->pebs_index <= ds->pebs_buffer_base)
+		return;
+
+	x86_pmu.drain_pebs(&regs);
+}
+
+void intel_pmu_pebs_sched_task(struct perf_event_context *ctx, bool sched_in)
+{
+	if (!sched_in)
+		intel_pmu_drain_pebs_buffer();
+}
+
 /*
  * Flags PEBS can handle without an PMI.
  *
@@ -743,8 +763,16 @@ void intel_pmu_pebs_enable(struct perf_event *event)
 	    !(event->attr.sample_type & ~PEBS_FREERUNNING_FLAGS)) {
 		threshold = ds->pebs_absolute_maximum -
 			x86_pmu.max_pebs_events * x86_pmu.pebs_record_size;
+		if (first_pebs) {
+			perf_sched_cb_enable(event->ctx->pmu);
+			cpuc->pebs_sched_cb_enabled = true;
+		}
 	} else {
 		threshold = ds->pebs_buffer_base + x86_pmu.pebs_record_size;
+		if (cpuc->pebs_sched_cb_enabled) {
+			perf_sched_cb_disable(event->ctx->pmu);
+			cpuc->pebs_sched_cb_enabled = false;
+		}
 	}
 	if (first_pebs || ds->pebs_interrupt_threshold > threshold)
 		ds->pebs_interrupt_threshold = threshold;
@@ -759,8 +787,19 @@ void intel_pmu_pebs_disable(struct perf_event *event)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
 	struct hw_perf_event *hwc = &event->hw;
+	struct debug_store *ds = cpuc->ds;
+	bool multi_pebs = false;
+
+	if (ds->pebs_interrupt_threshold >
+	    ds->pebs_buffer_base + x86_pmu.pebs_record_size)
+		multi_pebs = true;
 
 	cpuc->pebs_enabled &= ~(1ULL << hwc->idx);
+	if (cpuc->pebs_sched_cb_enabled &&
+	    !(cpuc->pebs_enabled & ((1ULL << MAX_PEBS_EVENTS) - 1))) {
+		perf_sched_cb_disable(event->ctx->pmu);
+		cpuc->pebs_sched_cb_enabled = false;
+	}
 
 	if (event->hw.constraint->flags & PERF_X86_EVENT_PEBS_LDLAT)
 		cpuc->pebs_enabled &= ~(1ULL << (hwc->idx + 32));
@@ -772,6 +811,9 @@ void intel_pmu_pebs_disable(struct perf_event *event)
 
 	hwc->config |= ARCH_PERFMON_EVENTSEL_INT;
 	hwc->autoreload = false;
+
+	if (multi_pebs)
+		intel_pmu_drain_pebs_buffer();
 }
 
 void intel_pmu_pebs_enable_all(void)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index c776931..e624407 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -184,10 +184,6 @@ void intel_pmu_lbr_reset(void)
 void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
-
-	if (!x86_pmu.lbr_nr)
-		return;
-
 	/*
 	 * It is necessary to flush the stack on context switch. This happens
 	 * when the branch stack does not tag its entries with the pid of the
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 1/7] perf, core: Add all PMUs to pmu_idr
  2014-05-28  6:18 ` [RFC PATCH 1/7] perf, core: Add all PMUs to pmu_idr Yan, Zheng
@ 2014-05-28  6:59   ` Peter Zijlstra
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2014-05-28  6:59 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: linux-kernel, mingo, eranian, ak

[-- Attachment #1: Type: text/plain, Size: 418 bytes --]

On Wed, May 28, 2014 at 02:18:04PM +0800, Yan, Zheng wrote:
> Currently, the build-in PMUs are not added to pmu_idr. This makes
> a inconvenience for finding PMU by type.
> 
This was done because of the horrid mess that we have with the software
pmus, some of which alias the same id.

I've not through about how this will or will not break things, but since
you didn't either, I'll leave you to sort it out.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 5/7] perf, x86: use the PEBS auto reload mechanism when possible
  2014-05-28  6:18 ` [RFC PATCH 5/7] perf, x86: use the PEBS auto reload mechanism when possible Yan, Zheng
@ 2014-05-28  7:59   ` Peter Zijlstra
  2014-05-28 14:46     ` Andi Kleen
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2014-05-28  7:59 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: linux-kernel, mingo, eranian, ak

[-- Attachment #1: Type: text/plain, Size: 384 bytes --]

On Wed, May 28, 2014 at 02:18:08PM +0800, Yan, Zheng wrote:
> When a fixed period is specified, this patch make perf use the PEBS
> auto reload mechanism. This makes normal profiling faster, because
> it avoids one costly MSR write in the PMI handler.
> 

You can't do this unconditionally. Like I think you mention in the next
patch, we need the PMI to fix up lots of cases.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 6/7] perf, x86: large PEBS interrupt threshold
  2014-05-28  6:18 ` [RFC PATCH 6/7] perf, x86: large PEBS interrupt threshold Yan, Zheng
@ 2014-05-28  8:10   ` Peter Zijlstra
  2014-05-28 12:54     ` Stephane Eranian
  2014-05-28 14:58     ` Andi Kleen
  0 siblings, 2 replies; 33+ messages in thread
From: Peter Zijlstra @ 2014-05-28  8:10 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: linux-kernel, mingo, eranian, ak

[-- Attachment #1: Type: text/plain, Size: 1571 bytes --]

On Wed, May 28, 2014 at 02:18:09PM +0800, Yan, Zheng wrote:
> PEBS always had the capability to log samples to its buffers without
> an interrupt. Traditionally perf has not used this but always set the
> PEBS threshold to one.
> 
> For the common cases we still need to use the PMI because the PEBS
> hardware has various limitations. The biggest one is that it can not
> supply a callgraph. It also requires setting a fixed period, as the
> hardware does not support adaptive period. Another issue is that it
> cannot supply a time stamp and some other options. 

So the reason I've never done this is because Intel has never fully
explained the demuxing of pebs events.

In particular, the 0x90 offset (IA32_PERF_GLOBAL_STATUS). Intel once
confirmed to me that that is a direct copy of the similarly named MSR at
the time of the PEBS assist.

This is a problem, since if multiple counters overflow multiple bits
will be set and its (afaict) ambiguous which event is for which counter.

At one point it was said they'd fix this 0x90 offset to indicate which
counter triggered the event, but I've never heard back if this happened.

So until you can give an official Intel answer on how all this demuxing
is supposed to work and be correct this patch set isn't moving anywhere.

> To supply a TID it
> requires flushing on context switch. It can however supply the IP

On SNB+, previous to SNB it would need to have precise==1. I've seen no
such logic in. Instead you seem to artificially limit it to SNB+, for no
apparent reason to me.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 7/7] perf, x86: drain PEBS buffer during context switch
  2014-05-28  6:18 ` [RFC PATCH 7/7] perf, x86: drain PEBS buffer during context switch Yan, Zheng
@ 2014-05-28  8:12   ` Peter Zijlstra
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2014-05-28  8:12 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: linux-kernel, mingo, eranian, ak

[-- Attachment #1: Type: text/plain, Size: 431 bytes --]

On Wed, May 28, 2014 at 02:18:10PM +0800, Yan, Zheng wrote:
> Flush the PEBS buffer during context switch if PEBS interrupt threshold
> is larger than one. This allows perf to supply TID for events.

I have the distinct feeling you're doing this patch series the wrong way
around. It seems to me patch 5 completely breaks PEBS, patch 6 maybe
patches it up a bit, but only after this patch things work again
(maybe).

That's wrong.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 6/7] perf, x86: large PEBS interrupt threshold
  2014-05-28  8:10   ` Peter Zijlstra
@ 2014-05-28 12:54     ` Stephane Eranian
  2014-05-28 15:02       ` Peter Zijlstra
  2014-05-28 14:58     ` Andi Kleen
  1 sibling, 1 reply; 33+ messages in thread
From: Stephane Eranian @ 2014-05-28 12:54 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Yan, Zheng, LKML, mingo, ak

On Wed, May 28, 2014 at 10:10 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, May 28, 2014 at 02:18:09PM +0800, Yan, Zheng wrote:
>> PEBS always had the capability to log samples to its buffers without
>> an interrupt. Traditionally perf has not used this but always set the
>> PEBS threshold to one.
>>
>> For the common cases we still need to use the PMI because the PEBS
>> hardware has various limitations. The biggest one is that it can not
>> supply a callgraph. It also requires setting a fixed period, as the
>> hardware does not support adaptive period. Another issue is that it
>> cannot supply a time stamp and some other options.
>
> So the reason I've never done this is because Intel has never fully
> explained the demuxing of pebs events.
>
> In particular, the 0x90 offset (IA32_PERF_GLOBAL_STATUS). Intel once
> confirmed to me that that is a direct copy of the similarly named MSR at
> the time of the PEBS assist.
>
> This is a problem, since if multiple counters overflow multiple bits
> will be set and its (afaict) ambiguous which event is for which counter.
>
I am not sure how having only one entry in the PEBS buffer solves this.
I think PEBS will create only one entry if multiple counters overflow
simultaneously. That OVFL_STATUS bitmask will have multiple bits
set. I understand the problem in perf_events because you need to
assign a sample to an event and not all events may record the same
info in the sampling buffer.

> At one point it was said they'd fix this 0x90 offset to indicate which
> counter triggered the event, but I've never heard back if this happened.
>
> So until you can give an official Intel answer on how all this demuxing
> is supposed to work and be correct this patch set isn't moving anywhere.
>
>> To supply a TID it
>> requires flushing on context switch. It can however supply the IP
>
> On SNB+, previous to SNB it would need to have precise==1. I've seen no
> such logic in. Instead you seem to artificially limit it to SNB+, for no
> apparent reason to me.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 5/7] perf, x86: use the PEBS auto reload mechanism when possible
  2014-05-28  7:59   ` Peter Zijlstra
@ 2014-05-28 14:46     ` Andi Kleen
  2014-05-28 15:36       ` Peter Zijlstra
  0 siblings, 1 reply; 33+ messages in thread
From: Andi Kleen @ 2014-05-28 14:46 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Yan, Zheng, linux-kernel, mingo, eranian

On Wed, May 28, 2014 at 09:59:37AM +0200, Peter Zijlstra wrote:
> On Wed, May 28, 2014 at 02:18:08PM +0800, Yan, Zheng wrote:
> > When a fixed period is specified, this patch make perf use the PEBS
> > auto reload mechanism. This makes normal profiling faster, because
> > it avoids one costly MSR write in the PMI handler.
> > 
> 
> You can't do this unconditionally. Like I think you mention in the next
> patch, we need the PMI to fix up lots of cases.

There are two cases:

(1) Fixed period specified with PEBS.
The PMI still runs with threshold=1, but we can use the PEBS
auto-reload mechanism to automatically rewrite the counter which is
faster.

(2) Free running mode with more constraints

This is case (1)

-andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 6/7] perf, x86: large PEBS interrupt threshold
  2014-05-28  8:10   ` Peter Zijlstra
  2014-05-28 12:54     ` Stephane Eranian
@ 2014-05-28 14:58     ` Andi Kleen
  2014-05-28 15:24       ` Stephane Eranian
  2014-05-28 15:35       ` Peter Zijlstra
  1 sibling, 2 replies; 33+ messages in thread
From: Andi Kleen @ 2014-05-28 14:58 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Yan, Zheng, linux-kernel, mingo, eranian

> In particular, the 0x90 offset (IA32_PERF_GLOBAL_STATUS). Intel once
> confirmed to me that that is a direct copy of the similarly named MSR at
> the time of the PEBS assist.

That is correct.

> 
> This is a problem, since if multiple counters overflow multiple bits
> will be set and its (afaict) ambiguous which event is for which counter.

When a PEBS counter overflows it clears its respective GLOBAL_STATUS bit
automatically. You may see multiple STATUS bits in the PEBS record
if two PEBS counters overflow at exactly the same time, but that is no different
from the threshold==1 case (and relatively rare)

So the threshold > 1 and the threshold == 1 case work identically.

> So until you can give an official Intel answer on how all this demuxing
> is supposed to work and be correct this patch set isn't moving anywhere.

FWIW a patch signed off by intel.com is an official Intel statement.

> 
> > To supply a TID it
> > requires flushing on context switch. It can however supply the IP
> 
> On SNB+, previous to SNB it would need to have precise==1. I've seen no
> such logic in. Instead you seem to artificially limit it to SNB+, for no
> apparent reason to me.

Only tested on Sandy Bridge+

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 6/7] perf, x86: large PEBS interrupt threshold
  2014-05-28 12:54     ` Stephane Eranian
@ 2014-05-28 15:02       ` Peter Zijlstra
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2014-05-28 15:02 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: Yan, Zheng, LKML, mingo, ak

[-- Attachment #1: Type: text/plain, Size: 3000 bytes --]

On Wed, May 28, 2014 at 02:54:25PM +0200, Stephane Eranian wrote:
> On Wed, May 28, 2014 at 10:10 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Wed, May 28, 2014 at 02:18:09PM +0800, Yan, Zheng wrote:
> >> PEBS always had the capability to log samples to its buffers without
> >> an interrupt. Traditionally perf has not used this but always set the
> >> PEBS threshold to one.
> >>
> >> For the common cases we still need to use the PMI because the PEBS
> >> hardware has various limitations. The biggest one is that it can not
> >> supply a callgraph. It also requires setting a fixed period, as the
> >> hardware does not support adaptive period. Another issue is that it
> >> cannot supply a time stamp and some other options.
> >
> > So the reason I've never done this is because Intel has never fully
> > explained the demuxing of pebs events.
> >
> > In particular, the 0x90 offset (IA32_PERF_GLOBAL_STATUS). Intel once
> > confirmed to me that that is a direct copy of the similarly named MSR at
> > the time of the PEBS assist.
> >
> > This is a problem, since if multiple counters overflow multiple bits
> > will be set and its (afaict) ambiguous which event is for which counter.
> >
> I am not sure how having only one entry in the PEBS buffer solves this.
> I think PEBS will create only one entry if multiple counters overflow
> simultaneously. 

For the not exact simultaneous events it narrows the window in which
we can have another event overflow and raise the bit because it will
immediately raise the PMI and disable the PMU.

Remember, that status bit gets raised when the counter overflows, but
the PEBS assist, and therefore the hardware reset, can take a long while
to actually happen. So there's fairly large windows here there's
multiple bits set.

And if you have the auto-refresh; does that clear the status bit again?
Supposing it does (its the sane thing to do), you can actually have 3
bits set, one for an event that hasn't even had a pebs assist yet.

So while the problem still exists for a single event, its much worse if
you just let the thing run.

> That OVFL_STATUS bitmask will have multiple bits
> set. I understand the problem in perf_events because you need to
> assign a sample to an event and not all events may record the same
> info in the sampling buffer.

Right, so you raise the issue that a pebs assist trigger of two events
on the exact cycle will only create one record with two bits set, that's
worse because its indistinguishable from the case where there's two
separate but near events recorded one of which also has two bits set
(because the second counter did overflow but no assist triggered yet).

So here we have to distinct scenarios in which multiple bits are set and
no way to disambiguate.

All in all, its a complete and utter trainwreck, and the worst part is
that I've raised this issue multiple times, starting some 5 years ago,
and nothing has happened afaik.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 6/7] perf, x86: large PEBS interrupt threshold
  2014-05-28 14:58     ` Andi Kleen
@ 2014-05-28 15:24       ` Stephane Eranian
  2014-05-28 16:51         ` Andi Kleen
  2014-05-28 15:35       ` Peter Zijlstra
  1 sibling, 1 reply; 33+ messages in thread
From: Stephane Eranian @ 2014-05-28 15:24 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Peter Zijlstra, Yan, Zheng, LKML, mingo

On Wed, May 28, 2014 at 4:58 PM, Andi Kleen <ak@linux.intel.com> wrote:
>> In particular, the 0x90 offset (IA32_PERF_GLOBAL_STATUS). Intel once
>> confirmed to me that that is a direct copy of the similarly named MSR at
>> the time of the PEBS assist.
>
> That is correct.
>
>>
>> This is a problem, since if multiple counters overflow multiple bits
>> will be set and its (afaict) ambiguous which event is for which counter.
>
> When a PEBS counter overflows it clears its respective GLOBAL_STATUS bit
> automatically. You may see multiple STATUS bits in the PEBS record
> if two PEBS counters overflow at exactly the same time, but that is no different
> from the threshold==1 case (and relatively rare)
>
> So the threshold > 1 and the threshold == 1 case work identically.
>
That's what I was trying to get to in my msg.
And I think today, you will generate up to 4 SAMPLE records if all
PEBS event overflowed at the same time and this is fine based on
what I see in intel_pmu_drain_pebs_nhm().

The only part I don't quite follow here is this:
                      if (__test_and_set_bit(bit, (unsigned long *)&status))
                                continue;

Which seems to indicate the code is making sure each counter is
processed only once. But it can only be processed once, if you have
only one record. And if you have multiple, you want to be able to
handle the same counter multiple times, at least once perf PEBS
record. So I am a bit confused about this test.

>> So until you can give an official Intel answer on how all this demuxing
>> is supposed to work and be correct this patch set isn't moving anywhere.
>
> FWIW a patch signed off by intel.com is an official Intel statement.
>
>>
>> > To supply a TID it
>> > requires flushing on context switch. It can however supply the IP
>>
>> On SNB+, previous to SNB it would need to have precise==1. I've seen no
>> such logic in. Instead you seem to artificially limit it to SNB+, for no
>> apparent reason to me.
>
> Only tested on Sandy Bridge+
>
> -Andi
>
> --
> ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 6/7] perf, x86: large PEBS interrupt threshold
  2014-05-28 14:58     ` Andi Kleen
  2014-05-28 15:24       ` Stephane Eranian
@ 2014-05-28 15:35       ` Peter Zijlstra
  2014-05-28 16:08         ` Andi Kleen
  1 sibling, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2014-05-28 15:35 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Yan, Zheng, linux-kernel, mingo, eranian

[-- Attachment #1: Type: text/plain, Size: 2005 bytes --]

On Wed, May 28, 2014 at 07:58:31AM -0700, Andi Kleen wrote:
> > In particular, the 0x90 offset (IA32_PERF_GLOBAL_STATUS). Intel once
> > confirmed to me that that is a direct copy of the similarly named MSR at
> > the time of the PEBS assist.
> 
> That is correct.
> 
> > 
> > This is a problem, since if multiple counters overflow multiple bits
> > will be set and its (afaict) ambiguous which event is for which counter.
> 
> When a PEBS counter overflows it clears its respective GLOBAL_STATUS bit
> automatically.

That's an ambiguous statement; did you mean to say a PEBS enabled
counter will not raise its bit in GLOBAL_STATUS on counter overflow?

Because if it does raise it, but then clears it again, its raised for a
short while and might be observed.

Anyway, you've still not entirely explained the full life cycle of those
bits in excruciating detail, please do so now.

Also, try and get it included in the SDM.

> > So until you can give an official Intel answer on how all this demuxing
> > is supposed to work and be correct this patch set isn't moving anywhere.
> 
> FWIW a patch signed off by intel.com is an official Intel statement.

Is that the reason they have such vague and non-committal Changelogs? Be
detailed and explicit.

On that same note; can someone shoot whoever writes the new SDM bits?
They're nearly impossible to comprehend, its like a patent lawyer was
involved with the end result of having a text explicitly engineered to
not be understood :-(

> > > To supply a TID it
> > > requires flushing on context switch. It can however supply the IP
> > 
> > On SNB+, previous to SNB it would need to have precise==1. I've seen no
> > such logic in. Instead you seem to artificially limit it to SNB+, for no
> > apparent reason to me.
> 
> Only tested on Sandy Bridge+

That's just ... /me lacking words.. the worst reason possible for
arbitrary limits, esp. since you guys actually have the hardware to test
older chips.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 5/7] perf, x86: use the PEBS auto reload mechanism when possible
  2014-05-28 14:46     ` Andi Kleen
@ 2014-05-28 15:36       ` Peter Zijlstra
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2014-05-28 15:36 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Yan, Zheng, linux-kernel, mingo, eranian

[-- Attachment #1: Type: text/plain, Size: 912 bytes --]

On Wed, May 28, 2014 at 07:46:43AM -0700, Andi Kleen wrote:
> On Wed, May 28, 2014 at 09:59:37AM +0200, Peter Zijlstra wrote:
> > On Wed, May 28, 2014 at 02:18:08PM +0800, Yan, Zheng wrote:
> > > When a fixed period is specified, this patch make perf use the PEBS
> > > auto reload mechanism. This makes normal profiling faster, because
> > > it avoids one costly MSR write in the PMI handler.
> > > 
> > 
> > You can't do this unconditionally. Like I think you mention in the next
> > patch, we need the PMI to fix up lots of cases.
> 
> There are two cases:
> 
> (1) Fixed period specified with PEBS.
> The PMI still runs with threshold=1, but we can use the PEBS
> auto-reload mechanism to automatically rewrite the counter which is
> faster.
> 
> (2) Free running mode with more constraints
> 
> This is case (1)

Sounds like something that should've been made clear in the Changelog.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 6/7] perf, x86: large PEBS interrupt threshold
  2014-05-28 15:35       ` Peter Zijlstra
@ 2014-05-28 16:08         ` Andi Kleen
  2014-05-28 17:05           ` Peter Zijlstra
  0 siblings, 1 reply; 33+ messages in thread
From: Andi Kleen @ 2014-05-28 16:08 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Yan, Zheng, linux-kernel, mingo, eranian

On Wed, May 28, 2014 at 05:35:48PM +0200, Peter Zijlstra wrote:
> On Wed, May 28, 2014 at 07:58:31AM -0700, Andi Kleen wrote:
> > > In particular, the 0x90 offset (IA32_PERF_GLOBAL_STATUS). Intel once
> > > confirmed to me that that is a direct copy of the similarly named MSR at
> > > the time of the PEBS assist.
> > 
> > That is correct.
> > 
> > > 
> > > This is a problem, since if multiple counters overflow multiple bits
> > > will be set and its (afaict) ambiguous which event is for which counter.
> > 
> > When a PEBS counter overflows it clears its respective GLOBAL_STATUS bit
> > automatically.
> 
> That's an ambiguous statement; did you mean to say a PEBS enabled
> counter will not raise its bit in GLOBAL_STATUS on counter overflow?

Let's revisit how PEBS works:

- The counter overflows and sets the GLOBAL_STATUS bit
- The PEBS assist is armed
- The counter triggers again
- The PEBS assist fires and delivers a PEBS record
- Finally it clears the GLOBAL_STATUS
- When the threshold is reached it raises an PMI

So the GLOBAL_STATUS bit is visible between the first overflow and the end 
of the PEBS record delivery.

> 
> Because if it does raise it, but then clears it again, its raised for a
> short while and might be observed.

That's correct. 

But it's no different than with threshold 1 and relatively rare.

-andi
-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 6/7] perf, x86: large PEBS interrupt threshold
  2014-05-28 15:24       ` Stephane Eranian
@ 2014-05-28 16:51         ` Andi Kleen
  2014-05-28 17:05           ` Stephane Eranian
  2014-05-28 17:09           ` Peter Zijlstra
  0 siblings, 2 replies; 33+ messages in thread
From: Andi Kleen @ 2014-05-28 16:51 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: Peter Zijlstra, Yan, Zheng, LKML, mingo

> The only part I don't quite follow here is this:
>                       if (__test_and_set_bit(bit, (unsigned long *)&status))
>                                 continue;
> 
> Which seems to indicate the code is making sure each counter is
> processed only once. But it can only be processed once, if you have
> only one record. And if you have multiple, you want to be able to
> handle the same counter multiple times, at least once perf PEBS
> record. So I am a bit confused about this test.

Each PEBS record is only for a single counter overflow. So it
always should only be a single perf event.

-Andi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 6/7] perf, x86: large PEBS interrupt threshold
  2014-05-28 16:08         ` Andi Kleen
@ 2014-05-28 17:05           ` Peter Zijlstra
  2014-05-28 17:25             ` Andi Kleen
  2014-05-28 19:28             ` Peter Zijlstra
  0 siblings, 2 replies; 33+ messages in thread
From: Peter Zijlstra @ 2014-05-28 17:05 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Yan, Zheng, linux-kernel, mingo, eranian

[-- Attachment #1: Type: text/plain, Size: 5088 bytes --]

On Wed, May 28, 2014 at 09:08:47AM -0700, Andi Kleen wrote:
> On Wed, May 28, 2014 at 05:35:48PM +0200, Peter Zijlstra wrote:
> > On Wed, May 28, 2014 at 07:58:31AM -0700, Andi Kleen wrote:

> > > When a PEBS counter overflows it clears its respective GLOBAL_STATUS bit
> > > automatically.
> > 
> > That's an ambiguous statement; did you mean to say a PEBS enabled
> > counter will not raise its bit in GLOBAL_STATUS on counter overflow?
> 
> Let's revisit how PEBS works:
> 
> - The counter overflows and sets the GLOBAL_STATUS bit
> - The PEBS assist is armed
> - The counter triggers again
> - The PEBS assist fires and delivers a PEBS record
> - Finally it clears the GLOBAL_STATUS
> - When the threshold is reached it raises an PMI
> 
> So the GLOBAL_STATUS bit is visible between the first overflow and the end 
> of the PEBS record delivery.

OK, so that's something entirely different from what you initially said,
but it is what I thought it did -- you said that it clears on overflow,
but it clears after recording.

Lets denote the above steps as A-F and then consider the following
scenario (hmm, while drawing the states I ended up with something
different than I thought):

	Counter0	Counter1

	 0A
	 0B

			 1A
			 1B

	 0C-E
	 0F

<PMI>
			 1C-E
			 1F


If we get the PMI (where denoted) we can actually reconstruct which
event triggered, by looking at which bit(s) flipped between the recorded
state and the current state (due to E coming before F)

Now, admittedly we don't actually do that, but the below patch
implements that (I think, its not been near a compiler).

Hmm,.. .oO I think we might be able to do something similar with
multi-events, look at the next records ->state, and use the current MSR
value when there's no next record.

Thoughts?

---
 arch/x86/kernel/cpu/perf_event.h          |  2 +-
 arch/x86/kernel/cpu/perf_event_intel.c    |  2 +-
 arch/x86/kernel/cpu/perf_event_intel_ds.c | 21 +++++++--------------
 3 files changed, 9 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 3b2f9bdd974b..a71f6bd05f19 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -445,7 +445,7 @@ struct x86_pmu {
 			pebs_active	:1,
 			pebs_broken	:1;
 	int		pebs_record_size;
-	void		(*drain_pebs)(struct pt_regs *regs);
+	void		(*drain_pebs)(struct pt_regs *regs, u64 status);
 	struct event_constraint *pebs_constraints;
 	void		(*pebs_aliases)(struct perf_event *event);
 	int 		max_pebs_events;
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index adb02aa62af5..3371310f200e 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1386,7 +1386,7 @@ static int intel_pmu_handle_irq(struct pt_regs *regs)
 	 */
 	if (__test_and_clear_bit(62, (unsigned long *)&status)) {
 		handled++;
-		x86_pmu.drain_pebs(regs);
+		x86_pmu.drain_pebs(regs, status);
 	}
 
 	/*
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 980970cb744d..0ad62addff7f 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -953,7 +953,7 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
 		x86_pmu_stop(event, 0);
 }
 
-static void intel_pmu_drain_pebs_core(struct pt_regs *iregs)
+static void intel_pmu_drain_pebs_core(struct pt_regs *iregs, u64 status)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
 	struct debug_store *ds = cpuc->ds;
@@ -994,13 +994,12 @@ static void intel_pmu_drain_pebs_core(struct pt_regs *iregs)
 	__intel_pmu_pebs_event(event, iregs, at);
 }
 
-static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
+static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, u64 status)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
 	struct debug_store *ds = cpuc->ds;
 	struct perf_event *event = NULL;
 	void *at, *top;
-	u64 status = 0;
 	int bit;
 
 	if (!x86_pmu.pebs_active)
@@ -1024,28 +1023,22 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
 
 	for (; at < top; at += x86_pmu.pebs_record_size) {
 		struct pebs_record_nhm *p = at;
+		u64 rec_status = (p->status ^ status) & (cpuc->pebs_enabled & 0xFFFFFFFF);
 
-		for_each_set_bit(bit, (unsigned long *)&p->status,
+		for_each_set_bit(bit, (unsigned long *)&rec_status,
 				 x86_pmu.max_pebs_events) {
 			event = cpuc->events[bit];
 			if (!test_bit(bit, cpuc->active_mask))
 				continue;
 
-			WARN_ON_ONCE(!event);
-
-			if (!event->attr.precise_ip)
+			if (WARN_ON_ONCE(!event))
 				continue;
 
-			if (__test_and_set_bit(bit, (unsigned long *)&status))
+			if (!event->attr.precise_ip)
 				continue;
 
-			break;
+			__intel_pmu_pebs_event(event, iregs, at);
 		}
-
-		if (!event || bit >= x86_pmu.max_pebs_events)
-			continue;
-
-		__intel_pmu_pebs_event(event, iregs, at);
 	}
 }
 

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 6/7] perf, x86: large PEBS interrupt threshold
  2014-05-28 16:51         ` Andi Kleen
@ 2014-05-28 17:05           ` Stephane Eranian
  2014-05-28 17:10             ` Peter Zijlstra
  2014-05-28 17:12             ` Andi Kleen
  2014-05-28 17:09           ` Peter Zijlstra
  1 sibling, 2 replies; 33+ messages in thread
From: Stephane Eranian @ 2014-05-28 17:05 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Peter Zijlstra, Yan, Zheng, LKML, mingo

On Wed, May 28, 2014 at 6:51 PM, Andi Kleen <ak@linux.intel.com> wrote:
>> The only part I don't quite follow here is this:
>>                       if (__test_and_set_bit(bit, (unsigned long *)&status))
>>                                 continue;
>>
>> Which seems to indicate the code is making sure each counter is
>> processed only once. But it can only be processed once, if you have
>> only one record. And if you have multiple, you want to be able to
>> handle the same counter multiple times, at least once perf PEBS
>> record. So I am a bit confused about this test.
>
> Each PEBS record is only for a single counter overflow. So it
> always should only be a single perf event.
>
So, you're telling me this is a sanity check. That p->status can
only have one bit set. Somehow that's not how I recall it working.

The point is that a single PEBS record is enough for multiple
events when the overflows occur simultaneously because they
all get the same machine state which is correct.  A single entry
also saves space in the buffer.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 6/7] perf, x86: large PEBS interrupt threshold
  2014-05-28 16:51         ` Andi Kleen
  2014-05-28 17:05           ` Stephane Eranian
@ 2014-05-28 17:09           ` Peter Zijlstra
  1 sibling, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2014-05-28 17:09 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Stephane Eranian, Yan, Zheng, LKML, mingo

[-- Attachment #1: Type: text/plain, Size: 856 bytes --]

On Wed, May 28, 2014 at 09:51:44AM -0700, Andi Kleen wrote:
> > The only part I don't quite follow here is this:
> >                       if (__test_and_set_bit(bit, (unsigned long *)&status))
> >                                 continue;
> > 
> > Which seems to indicate the code is making sure each counter is
> > processed only once. But it can only be processed once, if you have
> > only one record. And if you have multiple, you want to be able to
> > handle the same counter multiple times, at least once perf PEBS
> > record. So I am a bit confused about this test.
> 
> Each PEBS record is only for a single counter overflow. So it
> always should only be a single perf event.

OK, so what Stephane said, that two counter's having their assist on the
exact same cycle results in but a single record is false? That would be
good.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 6/7] perf, x86: large PEBS interrupt threshold
  2014-05-28 17:05           ` Stephane Eranian
@ 2014-05-28 17:10             ` Peter Zijlstra
  2014-05-28 17:12             ` Andi Kleen
  1 sibling, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2014-05-28 17:10 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: Andi Kleen, Yan, Zheng, LKML, mingo

[-- Attachment #1: Type: text/plain, Size: 1314 bytes --]

On Wed, May 28, 2014 at 07:05:51PM +0200, Stephane Eranian wrote:
> On Wed, May 28, 2014 at 6:51 PM, Andi Kleen <ak@linux.intel.com> wrote:
> >> The only part I don't quite follow here is this:
> >>                       if (__test_and_set_bit(bit, (unsigned long *)&status))
> >>                                 continue;
> >>
> >> Which seems to indicate the code is making sure each counter is
> >> processed only once. But it can only be processed once, if you have
> >> only one record. And if you have multiple, you want to be able to
> >> handle the same counter multiple times, at least once perf PEBS
> >> record. So I am a bit confused about this test.
> >
> > Each PEBS record is only for a single counter overflow. So it
> > always should only be a single perf event.
> >
> So, you're telling me this is a sanity check. That p->status can
> only have one bit set. Somehow that's not how I recall it working.
> 
> The point is that a single PEBS record is enough for multiple
> events when the overflows occur simultaneously because they
> all get the same machine state which is correct.  A single entry
> also saves space in the buffer.

Confusion reigns.. now if only someone would write all this down in a
coherent manner and give it an official Intel stamp of approval ;-)

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 6/7] perf, x86: large PEBS interrupt threshold
  2014-05-28 17:05           ` Stephane Eranian
  2014-05-28 17:10             ` Peter Zijlstra
@ 2014-05-28 17:12             ` Andi Kleen
  2014-05-28 17:19               ` Peter Zijlstra
  1 sibling, 1 reply; 33+ messages in thread
From: Andi Kleen @ 2014-05-28 17:12 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: Peter Zijlstra, Yan, Zheng, LKML, mingo

> So, you're telling me this is a sanity check. That p->status can
> only have one bit set. Somehow that's not how I recall it working.

It can have multiple bits set. We don't know for sure for which
it is, but we should only deliver it for one anyways.

> The point is that a single PEBS record is enough for multiple
> events when the overflows occur simultaneously because they
> all get the same machine state which is correct.  A single entry
> also saves space in the buffer.

The CPU will generate multiple PEBS records in this case.
So if we delivered it for all you would overcount by factor 4x

[Again this is a very unlikely situation. Normally counters
are not in lock step]

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 6/7] perf, x86: large PEBS interrupt threshold
  2014-05-28 17:12             ` Andi Kleen
@ 2014-05-28 17:19               ` Peter Zijlstra
  2014-05-28 17:45                 ` Andi Kleen
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2014-05-28 17:19 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Stephane Eranian, Yan, Zheng, LKML, mingo

[-- Attachment #1: Type: text/plain, Size: 1219 bytes --]

On Wed, May 28, 2014 at 10:12:43AM -0700, Andi Kleen wrote:
> > So, you're telling me this is a sanity check. That p->status can
> > only have one bit set. Somehow that's not how I recall it working.
> 
> It can have multiple bits set. We don't know for sure for which
> it is, but we should only deliver it for one anyways.
> 
> > The point is that a single PEBS record is enough for multiple
> > events when the overflows occur simultaneously because they
> > all get the same machine state which is correct.  A single entry
> > also saves space in the buffer.
> 
> The CPU will generate multiple PEBS records in this case.
> So if we delivered it for all you would overcount by factor 4x
> 
> [Again this is a very unlikely situation. Normally counters
> are not in lock step]

Well, the thing is, the delay between overflow and the assist being
armed can be quite a few cycles, and then waiting for the next event to
trigger can again be quite a few cycles.

So the window is fairly large for another to slip in, esp if you have a
fast (say inst-ret) and a slow (say br-misp) event.

Also, it really blows this never got addressed, even though I've been
complaining about this for years.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 6/7] perf, x86: large PEBS interrupt threshold
  2014-05-28 17:05           ` Peter Zijlstra
@ 2014-05-28 17:25             ` Andi Kleen
  2014-05-28 17:40               ` Stephane Eranian
  2014-05-28 19:28             ` Peter Zijlstra
  1 sibling, 1 reply; 33+ messages in thread
From: Andi Kleen @ 2014-05-28 17:25 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Yan, Zheng, linux-kernel, mingo, eranian

On Wed, May 28, 2014 at 07:05:31PM +0200, Peter Zijlstra wrote:
> On Wed, May 28, 2014 at 09:08:47AM -0700, Andi Kleen wrote:
> > On Wed, May 28, 2014 at 05:35:48PM +0200, Peter Zijlstra wrote:
> > > On Wed, May 28, 2014 at 07:58:31AM -0700, Andi Kleen wrote:
> 
> > > > When a PEBS counter overflows it clears its respective GLOBAL_STATUS bit
> > > > automatically.
> > > 
> > > That's an ambiguous statement; did you mean to say a PEBS enabled
> > > counter will not raise its bit in GLOBAL_STATUS on counter overflow?
> > 
> > Let's revisit how PEBS works:
> > 
> > - The counter overflows and sets the GLOBAL_STATUS bit
> > - The PEBS assist is armed
> > - The counter triggers again
> > - The PEBS assist fires and delivers a PEBS record
> > - Finally it clears the GLOBAL_STATUS
> > - When the threshold is reached it raises an PMI
> > 
> > So the GLOBAL_STATUS bit is visible between the first overflow and the end 
> > of the PEBS record delivery.
> 
> OK, so that's something entirely different from what you initially said,
> but it is what I thought it did -- you said that it clears on overflow
> but it clears after recording.

Fair enough. I should have said PEBS assist.

> If we get the PMI (where denoted) we can actually reconstruct which
> event triggered, by looking at which bit(s) flipped between the recorded
> state and the current state (due to E coming before F)

Normally when the PMI PEBS handler runs the GLOBAL_STATUS is already cleared
(as the PEBS assist will execute concurrently during the NMI entry)
Looking at the status won't help you much, it only has the PEBS bit 
set.

I don't think we need to do anything. It's a very unlikely situation
in normal operation, as the counter period is very long compared
to the race window.  When it happens very rarely we can ignore it.

It can happen mainly when you program two counters to count exactly
the same thing with the same threshold, but why would you do that?

I guess what would make sense is to add some debug counter somewhere
for this situation (more than one bit set)

-Andi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 6/7] perf, x86: large PEBS interrupt threshold
  2014-05-28 17:25             ` Andi Kleen
@ 2014-05-28 17:40               ` Stephane Eranian
  2014-05-28 17:47                 ` Andi Kleen
  0 siblings, 1 reply; 33+ messages in thread
From: Stephane Eranian @ 2014-05-28 17:40 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Peter Zijlstra, Yan, Zheng, LKML, mingo

On Wed, May 28, 2014 at 7:25 PM, Andi Kleen <ak@linux.intel.com> wrote:
> On Wed, May 28, 2014 at 07:05:31PM +0200, Peter Zijlstra wrote:
>> On Wed, May 28, 2014 at 09:08:47AM -0700, Andi Kleen wrote:
>> > On Wed, May 28, 2014 at 05:35:48PM +0200, Peter Zijlstra wrote:
>> > > On Wed, May 28, 2014 at 07:58:31AM -0700, Andi Kleen wrote:
>>
>> > > > When a PEBS counter overflows it clears its respective GLOBAL_STATUS bit
>> > > > automatically.
>> > >
>> > > That's an ambiguous statement; did you mean to say a PEBS enabled
>> > > counter will not raise its bit in GLOBAL_STATUS on counter overflow?
>> >
>> > Let's revisit how PEBS works:
>> >
>> > - The counter overflows and sets the GLOBAL_STATUS bit
>> > - The PEBS assist is armed
>> > - The counter triggers again
>> > - The PEBS assist fires and delivers a PEBS record
>> > - Finally it clears the GLOBAL_STATUS
>> > - When the threshold is reached it raises an PMI
>> >
>> > So the GLOBAL_STATUS bit is visible between the first overflow and the end
>> > of the PEBS record delivery.
>>
>> OK, so that's something entirely different from what you initially said,
>> but it is what I thought it did -- you said that it clears on overflow
>> but it clears after recording.
>
> Fair enough. I should have said PEBS assist.
>
>> If we get the PMI (where denoted) we can actually reconstruct which
>> event triggered, by looking at which bit(s) flipped between the recorded
>> state and the current state (due to E coming before F)
>
> Normally when the PMI PEBS handler runs the GLOBAL_STATUS is already cleared
> (as the PEBS assist will execute concurrently during the NMI entry)
> Looking at the status won't help you much, it only has the PEBS bit
> set.
>
> I don't think we need to do anything. It's a very unlikely situation
> in normal operation, as the counter period is very long compared
> to the race window.  When it happens very rarely we can ignore it.
>
> It can happen mainly when you program two counters to count exactly
> the same thing with the same threshold, but why would you do that?
>
That's a plausible scenario if you consider two distinct sessions of a tool,
i.e., two users running perf top or perf record some precise events.

> I guess what would make sense is to add some debug counter somewhere
> for this situation (more than one bit set)
>
> -Andi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 6/7] perf, x86: large PEBS interrupt threshold
  2014-05-28 17:19               ` Peter Zijlstra
@ 2014-05-28 17:45                 ` Andi Kleen
  2014-05-28 17:49                   ` Peter Zijlstra
  0 siblings, 1 reply; 33+ messages in thread
From: Andi Kleen @ 2014-05-28 17:45 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Stephane Eranian, Yan, Zheng, LKML, mingo

> So the window is fairly large for another to slip in, esp if you have a
> fast (say inst-ret) and a slow (say br-misp) event.

yes, but unless you have really small periods it's still very unlikely
that they synchronize, so it will only happen very rarely.  And the
rare cases won't affect the statistics very much.

Even with multi-pebs the min period is 1000, which is several
orders of magnitude larger than the typical window.

-Andi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 6/7] perf, x86: large PEBS interrupt threshold
  2014-05-28 17:40               ` Stephane Eranian
@ 2014-05-28 17:47                 ` Andi Kleen
  0 siblings, 0 replies; 33+ messages in thread
From: Andi Kleen @ 2014-05-28 17:47 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: Peter Zijlstra, Yan, Zheng, LKML, mingo

> That's a plausible scenario if you consider two distinct sessions of a tool,
> i.e., two users running perf top or perf record some precise events.

Then they will start at different times (if they are not on different cores), 
so it's still not synchronized and unlikely to meet.

|----< 1000+ > -----|
A
         |------< 1000+> ------|
         B

-Andi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 6/7] perf, x86: large PEBS interrupt threshold
  2014-05-28 17:45                 ` Andi Kleen
@ 2014-05-28 17:49                   ` Peter Zijlstra
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2014-05-28 17:49 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Stephane Eranian, Yan, Zheng, LKML, mingo

[-- Attachment #1: Type: text/plain, Size: 576 bytes --]

On Wed, May 28, 2014 at 10:45:45AM -0700, Andi Kleen wrote:
> > So the window is fairly large for another to slip in, esp if you have a
> > fast (say inst-ret) and a slow (say br-misp) event.
> 
> yes, but unless you have really small periods it's still very unlikely
> that they synchronize, so it will only happen very rarely.  And the
> rare cases won't affect the statistics very much.

Sure, but it might be an information leak... Not sure how severe these
are, but again.. it blows to have to worry about this when it would've
been 'trivial' to actually fix.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH 6/7] perf, x86: large PEBS interrupt threshold
  2014-05-28 17:05           ` Peter Zijlstra
  2014-05-28 17:25             ` Andi Kleen
@ 2014-05-28 19:28             ` Peter Zijlstra
  1 sibling, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2014-05-28 19:28 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Yan, Zheng, linux-kernel, mingo, eranian

[-- Attachment #1: Type: text/plain, Size: 405 bytes --]

On Wed, May 28, 2014 at 07:05:31PM +0200, Peter Zijlstra wrote:
> If we get the PMI (where denoted) we can actually reconstruct which
> event triggered, by looking at which bit(s) flipped between the recorded
> state and the current state (due to E coming before F)
> 

just a bit flip (xor) isn't enough, we need to find the bit that flipped
off, so the condition gets slightly more complicated.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2014-05-28 19:29 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-28  6:18 [RFC PATCH 0/7] perf, x86: large PEBS interrupt threshold Yan, Zheng
2014-05-28  6:18 ` [RFC PATCH 1/7] perf, core: Add all PMUs to pmu_idr Yan, Zheng
2014-05-28  6:59   ` Peter Zijlstra
2014-05-28  6:18 ` [RFC PATCH 2/7] perf, core: introduce pmu context switch callback Yan, Zheng
2014-05-28  6:18 ` [RFC PATCH 3/7] perf, x86: use context switch callback to flush LBR stack Yan, Zheng
2014-05-28  6:18 ` [RFC PATCH 4/7] tools, perf: Allow the user to disable time stamps Yan, Zheng
2014-05-28  6:18 ` [RFC PATCH 5/7] perf, x86: use the PEBS auto reload mechanism when possible Yan, Zheng
2014-05-28  7:59   ` Peter Zijlstra
2014-05-28 14:46     ` Andi Kleen
2014-05-28 15:36       ` Peter Zijlstra
2014-05-28  6:18 ` [RFC PATCH 6/7] perf, x86: large PEBS interrupt threshold Yan, Zheng
2014-05-28  8:10   ` Peter Zijlstra
2014-05-28 12:54     ` Stephane Eranian
2014-05-28 15:02       ` Peter Zijlstra
2014-05-28 14:58     ` Andi Kleen
2014-05-28 15:24       ` Stephane Eranian
2014-05-28 16:51         ` Andi Kleen
2014-05-28 17:05           ` Stephane Eranian
2014-05-28 17:10             ` Peter Zijlstra
2014-05-28 17:12             ` Andi Kleen
2014-05-28 17:19               ` Peter Zijlstra
2014-05-28 17:45                 ` Andi Kleen
2014-05-28 17:49                   ` Peter Zijlstra
2014-05-28 17:09           ` Peter Zijlstra
2014-05-28 15:35       ` Peter Zijlstra
2014-05-28 16:08         ` Andi Kleen
2014-05-28 17:05           ` Peter Zijlstra
2014-05-28 17:25             ` Andi Kleen
2014-05-28 17:40               ` Stephane Eranian
2014-05-28 17:47                 ` Andi Kleen
2014-05-28 19:28             ` Peter Zijlstra
2014-05-28  6:18 ` [RFC PATCH 7/7] perf, x86: drain PEBS buffer during context switch Yan, Zheng
2014-05-28  8:12   ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).