All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/14] perf, x86: Haswell LBR call stack support
@ 2014-01-03  5:47 Yan, Zheng
  2014-01-03  5:47 ` [PATCH 01/14] perf, x86: Reduce lbr_sel_map size Yan, Zheng
                   ` (14 more replies)
  0 siblings, 15 replies; 37+ messages in thread
From: Yan, Zheng @ 2014-01-03  5:47 UTC (permalink / raw)
  To: linux-kernel; +Cc: a.p.zijlstra, mingo, acme, eranian, andi, Yan, Zheng

For many profiling tasks we need the callgraph. For example we often
need to see the caller of a lock or the caller of a memcpy or other
library function to actually tune the program. Frame pointer unwinding
is efficient and works well. But frame pointers are off by default on
64bit code (and on modern 32bit gccs), so there are many binaries around
that do not use frame pointers. Profiling unchanged production code is
very useful in practice. On some CPUs frame pointer also has a high
cost. Dwarf2 unwinding also does not always work and is extremely slow
(upto 20% overhead).

Haswell has a new feature that utilizes the existing Last Branch Record
facility to record call chains. When the feature is enabled, function
call will be collected as normal, but as return instructions are
executed the last captured branch record is popped from the on-chip LBR
registers. The LBR call stack facility provides an alternative to get
callgraph. It has some limitations too, but should work in most cases
and is significantly faster than dwarf. Frame pointer unwinding is still
the best default, but LBR call stack is a good alternative when nothing
else works.

This patch series adds LBR call stack support. User can enabled/disable
this through an sysfs attribute file in the CPU PMU directory:
 echo 1 > /sys/bus/event_source/devices/cpu/lbr_callstack 

When profiling bc(1) on Fedora 19:
 echo 'scale=2000; 4*a(1)' > cmd; perf record -g fp bc -l < cmd

If this feature is enabled, perf report output looks like:
    50.36%       bc  bc                 [.] bc_divide
                 |
                 --- bc_divide
                     execute
                     run_code
                     yyparse
                     main
                     __libc_start_main
                     _start

    33.66%       bc  bc                 [.] _one_mult
                 |
                 --- _one_mult
                     bc_divide
                     execute
                     run_code
                     yyparse
                     main
                     __libc_start_main
                     _start

     7.62%       bc  bc                 [.] _bc_do_add
                 |
                 --- _bc_do_add
                    |
                    |--99.89%-- 0x2000186a8
                     --0.11%-- [...]

     6.83%       bc  bc                 [.] _bc_do_sub
                 |
                 --- _bc_do_sub
                    |
                    |--99.94%-- bc_add
                    |          execute
                    |          run_code
                    |          yyparse
                    |          main
                    |          __libc_start_main
                    |          _start
                     --0.06%-- [...]

     0.46%       bc  libc-2.17.so       [.] __memset_sse2
                 |
                 --- __memset_sse2
                    |
                    |--54.13%-- bc_new_num
                    |          |
                    |          |--51.00%-- bc_divide
                    |          |          execute
                    |          |          run_code
                    |          |          yyparse
                    |          |          main
                    |          |          __libc_start_main
                    |          |          _start
                    |          |
                    |          |--30.46%-- _bc_do_sub
                    |          |          bc_add
                    |          |          execute
                    |          |          run_code
                    |          |          yyparse
                    |          |          main
                    |          |          __libc_start_main
                    |          |          _start
                    |          |
                    |           --18.55%-- _bc_do_add
                    |                     bc_add
                    |                     execute
                    |                     run_code
                    |                     yyparse
                    |                     main
                    |                     __libc_start_main
                    |                     _start
                    |
                     --45.87%-- bc_divide
                               execute
                               run_code
                               yyparse
                               main
                               __libc_start_main
                               _start

If this feature is disabled, perf report output looks like:
    50.49%       bc  bc                 [.] bc_divide
                 |
                 --- bc_divide

    33.57%       bc  bc                 [.] _one_mult
                 |
                 --- _one_mult

     7.61%       bc  bc                 [.] _bc_do_add
                 |
                 --- _bc_do_add
                     0x2000186a8

     6.88%       bc  bc                 [.] _bc_do_sub
                 |
                 --- _bc_do_sub

     0.42%       bc  libc-2.17.so       [.] __memcpy_ssse3_back
                 |
                 --- __memcpy_ssse3_back

The LBR call stack has following known limitations
 - Zero length calls are not filtered out by hardware
 - Exception handing such as setjmp/longjmp will have calls/returns not
   match
 - Pushing different return address onto the stack will have calls/returns
   not match
 - If callstack is deeper than the LBR, only the last entries are captured

Change since previous version
 - split change into more patches
 - introduce context switch callback and use it to flush LBR
 - use the context switch callback to save/restore LBR
 - dynamic allocate memory area for storing LBR stack, always switch the
   memory area during context switch
 - disable this feature by default
 - more description in change logs


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 01/14] perf, x86: Reduce lbr_sel_map size
  2014-01-03  5:47 [PATCH 00/14] perf, x86: Haswell LBR call stack support Yan, Zheng
@ 2014-01-03  5:47 ` Yan, Zheng
  2014-02-05 15:15   ` Stephane Eranian
  2014-01-03  5:47 ` [PATCH 02/14] perf, core: introduce pmu context switch callback Yan, Zheng
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 37+ messages in thread
From: Yan, Zheng @ 2014-01-03  5:47 UTC (permalink / raw)
  To: linux-kernel; +Cc: a.p.zijlstra, mingo, acme, eranian, andi, Yan, Zheng

The index of lbr_sel_map is bit value of perf branch_sample_type.
PERF_SAMPLE_BRANCH_MAX is 1024 at present, so each lbr_sel_map uses
4096 bytes. By using bit shift as index, we can reduce lbr_sel_map
size to 40 bytes.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 arch/x86/kernel/cpu/perf_event.h           |  4 +++
 arch/x86/kernel/cpu/perf_event_intel_lbr.c | 50 ++++++++++++++----------------
 include/uapi/linux/perf_event.h            | 42 +++++++++++++++++--------
 3 files changed, 56 insertions(+), 40 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index fd00bb2..745f6fb 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -459,6 +459,10 @@ struct x86_pmu {
 	struct perf_guest_switch_msr *(*guest_get_msrs)(int *nr);
 };
 
+enum {
+	PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE = PERF_SAMPLE_BRANCH_MAX_SHIFT,
+};
+
 #define x86_add_quirk(func_)						\
 do {									\
 	static struct x86_pmu_quirk __quirk __initdata = {		\
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index d82d155..1ae2ec5 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -69,10 +69,6 @@ static enum {
 #define LBR_FROM_FLAG_IN_TX    (1ULL << 62)
 #define LBR_FROM_FLAG_ABORT    (1ULL << 61)
 
-#define for_each_branch_sample_type(x) \
-	for ((x) = PERF_SAMPLE_BRANCH_USER; \
-	     (x) < PERF_SAMPLE_BRANCH_MAX; (x) <<= 1)
-
 /*
  * x86control flow change classification
  * x86control flow changes include branches, interrupts, traps, faults
@@ -400,14 +396,14 @@ static int intel_pmu_setup_hw_lbr_filter(struct perf_event *event)
 {
 	struct hw_perf_event_extra *reg;
 	u64 br_type = event->attr.branch_sample_type;
-	u64 mask = 0, m;
-	u64 v;
+	u64 mask = 0, v;
+	int i;
 
-	for_each_branch_sample_type(m) {
-		if (!(br_type & m))
+	for (i = 0; i < PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE; i++) {
+		if (!(br_type & (1ULL << i)))
 			continue;
 
-		v = x86_pmu.lbr_sel_map[m];
+		v = x86_pmu.lbr_sel_map[i];
 		if (v == LBR_NOT_SUPP)
 			return -EOPNOTSUPP;
 
@@ -662,33 +658,33 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
 /*
  * Map interface branch filters onto LBR filters
  */
-static const int nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = {
-	[PERF_SAMPLE_BRANCH_ANY]	= LBR_ANY,
-	[PERF_SAMPLE_BRANCH_USER]	= LBR_USER,
-	[PERF_SAMPLE_BRANCH_KERNEL]	= LBR_KERNEL,
-	[PERF_SAMPLE_BRANCH_HV]		= LBR_IGN,
-	[PERF_SAMPLE_BRANCH_ANY_RETURN]	= LBR_RETURN | LBR_REL_JMP
-					| LBR_IND_JMP | LBR_FAR,
+static const int nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
+	[PERF_SAMPLE_BRANCH_ANY_SHIFT]		= LBR_ANY,
+	[PERF_SAMPLE_BRANCH_USER_SHIFT]		= LBR_USER,
+	[PERF_SAMPLE_BRANCH_KERNEL_SHIFT]	= LBR_KERNEL,
+	[PERF_SAMPLE_BRANCH_HV_SHIFT]		= LBR_IGN,
+	[PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT]	= LBR_RETURN | LBR_REL_JMP
+						| LBR_IND_JMP | LBR_FAR,
 	/*
 	 * NHM/WSM erratum: must include REL_JMP+IND_JMP to get CALL branches
 	 */
-	[PERF_SAMPLE_BRANCH_ANY_CALL] =
+	[PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT] =
 	 LBR_REL_CALL | LBR_IND_CALL | LBR_REL_JMP | LBR_IND_JMP | LBR_FAR,
 	/*
 	 * NHM/WSM erratum: must include IND_JMP to capture IND_CALL
 	 */
-	[PERF_SAMPLE_BRANCH_IND_CALL] = LBR_IND_CALL | LBR_IND_JMP,
+	[PERF_SAMPLE_BRANCH_IND_CALL_SHIFT] = LBR_IND_CALL | LBR_IND_JMP,
 };
 
-static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = {
-	[PERF_SAMPLE_BRANCH_ANY]	= LBR_ANY,
-	[PERF_SAMPLE_BRANCH_USER]	= LBR_USER,
-	[PERF_SAMPLE_BRANCH_KERNEL]	= LBR_KERNEL,
-	[PERF_SAMPLE_BRANCH_HV]		= LBR_IGN,
-	[PERF_SAMPLE_BRANCH_ANY_RETURN]	= LBR_RETURN | LBR_FAR,
-	[PERF_SAMPLE_BRANCH_ANY_CALL]	= LBR_REL_CALL | LBR_IND_CALL
-					| LBR_FAR,
-	[PERF_SAMPLE_BRANCH_IND_CALL]	= LBR_IND_CALL,
+static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
+	[PERF_SAMPLE_BRANCH_ANY_SHIFT]		= LBR_ANY,
+	[PERF_SAMPLE_BRANCH_USER_SHIFT]		= LBR_USER,
+	[PERF_SAMPLE_BRANCH_KERNEL_SHIFT]	= LBR_KERNEL,
+	[PERF_SAMPLE_BRANCH_HV_SHIFT]		= LBR_IGN,
+	[PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT]	= LBR_RETURN | LBR_FAR,
+	[PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT]	= LBR_REL_CALL | LBR_IND_CALL
+						| LBR_FAR,
+	[PERF_SAMPLE_BRANCH_IND_CALL_SHIFT]	= LBR_IND_CALL,
 };
 
 /* core */
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index e1802d6..4d8c438 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -151,20 +151,36 @@ enum perf_event_sample_format {
  * The branch types can be combined, however BRANCH_ANY covers all types
  * of branches and therefore it supersedes all the other types.
  */
+enum perf_branch_sample_type_shift {
+	PERF_SAMPLE_BRANCH_USER_SHIFT		= 0, /* user branches */
+	PERF_SAMPLE_BRANCH_KERNEL_SHIFT		= 1, /* kernel branches */
+	PERF_SAMPLE_BRANCH_HV_SHIFT		= 2, /* hypervisor branches */
+
+	PERF_SAMPLE_BRANCH_ANY_SHIFT		= 3, /* any branch types */
+	PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT	= 4, /* any call branch */
+	PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT	= 5, /* any return branch */
+	PERF_SAMPLE_BRANCH_IND_CALL_SHIFT	= 6, /* indirect calls */
+	PERF_SAMPLE_BRANCH_ABORT_TX_SHIFT	= 7, /* transaction aborts */
+	PERF_SAMPLE_BRANCH_IN_TX_SHIFT		= 8, /* in transaction */
+	PERF_SAMPLE_BRANCH_NO_TX_SHIFT		= 9, /* not in transaction */
+
+	PERF_SAMPLE_BRANCH_MAX_SHIFT		/* non-ABI */
+};
+
 enum perf_branch_sample_type {
-	PERF_SAMPLE_BRANCH_USER		= 1U << 0, /* user branches */
-	PERF_SAMPLE_BRANCH_KERNEL	= 1U << 1, /* kernel branches */
-	PERF_SAMPLE_BRANCH_HV		= 1U << 2, /* hypervisor branches */
-
-	PERF_SAMPLE_BRANCH_ANY		= 1U << 3, /* any branch types */
-	PERF_SAMPLE_BRANCH_ANY_CALL	= 1U << 4, /* any call branch */
-	PERF_SAMPLE_BRANCH_ANY_RETURN	= 1U << 5, /* any return branch */
-	PERF_SAMPLE_BRANCH_IND_CALL	= 1U << 6, /* indirect calls */
-	PERF_SAMPLE_BRANCH_ABORT_TX	= 1U << 7, /* transaction aborts */
-	PERF_SAMPLE_BRANCH_IN_TX	= 1U << 8, /* in transaction */
-	PERF_SAMPLE_BRANCH_NO_TX	= 1U << 9, /* not in transaction */
-
-	PERF_SAMPLE_BRANCH_MAX		= 1U << 10, /* non-ABI */
+	PERF_SAMPLE_BRANCH_USER         = 1U << PERF_SAMPLE_BRANCH_USER_SHIFT,
+	PERF_SAMPLE_BRANCH_KERNEL       = 1U << PERF_SAMPLE_BRANCH_KERNEL_SHIFT,
+	PERF_SAMPLE_BRANCH_HV           = 1U << PERF_SAMPLE_BRANCH_HV_SHIFT,
+
+	PERF_SAMPLE_BRANCH_ANY          = 1U << PERF_SAMPLE_BRANCH_ANY_SHIFT,
+	PERF_SAMPLE_BRANCH_ANY_CALL     = 1U << PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT,
+	PERF_SAMPLE_BRANCH_ANY_RETURN   = 1U << PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT,
+	PERF_SAMPLE_BRANCH_IND_CALL     = 1U << PERF_SAMPLE_BRANCH_IND_CALL_SHIFT,
+	PERF_SAMPLE_BRANCH_ABORT_TX     = 1U << PERF_SAMPLE_BRANCH_ABORT_TX_SHIFT,
+	PERF_SAMPLE_BRANCH_IN_TX        = 1U << PERF_SAMPLE_BRANCH_IN_TX_SHIFT,
+	PERF_SAMPLE_BRANCH_NO_TX        = 1U << PERF_SAMPLE_BRANCH_NO_TX_SHIFT,
+
+	PERF_SAMPLE_BRANCH_MAX          = 1U << PERF_SAMPLE_BRANCH_MAX_SHIFT,
 };
 
 #define PERF_SAMPLE_BRANCH_PLM_ALL \
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 02/14] perf, core: introduce pmu context switch callback
  2014-01-03  5:47 [PATCH 00/14] perf, x86: Haswell LBR call stack support Yan, Zheng
  2014-01-03  5:47 ` [PATCH 01/14] perf, x86: Reduce lbr_sel_map size Yan, Zheng
@ 2014-01-03  5:47 ` Yan, Zheng
  2014-02-05 16:01   ` Stephane Eranian
  2014-01-03  5:48 ` [PATCH 03/14] perf, x86: use context switch callback to flush LBR stack Yan, Zheng
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 37+ messages in thread
From: Yan, Zheng @ 2014-01-03  5:47 UTC (permalink / raw)
  To: linux-kernel; +Cc: a.p.zijlstra, mingo, acme, eranian, andi, Yan, Zheng

The callback is invoked when process is scheduled in or out. It
provides mechanism for later patches to save/store the LBR stack.
It can also replace the flush branch stack callback.

To avoid unnecessary overhead, the callback is enabled dynamically

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 arch/x86/kernel/cpu/perf_event.c |  7 +++++
 arch/x86/kernel/cpu/perf_event.h |  4 +++
 include/linux/perf_event.h       |  8 ++++++
 kernel/events/core.c             | 60 +++++++++++++++++++++++++++++++++++++++-
 4 files changed, 78 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 8e13293..6703d17 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1846,6 +1846,12 @@ static const struct attribute_group *x86_pmu_attr_groups[] = {
 	NULL,
 };
 
+static void x86_pmu_sched_task(struct perf_event_context *ctx, bool sched_in)
+{
+	if (x86_pmu.sched_task)
+		x86_pmu.sched_task(ctx, sched_in);
+}
+
 static void x86_pmu_flush_branch_stack(void)
 {
 	if (x86_pmu.flush_branch_stack)
@@ -1879,6 +1885,7 @@ static struct pmu pmu = {
 
 	.event_idx		= x86_pmu_event_idx,
 	.flush_branch_stack	= x86_pmu_flush_branch_stack,
+	.sched_task		= x86_pmu_sched_task,
 };
 
 void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 745f6fb..3fdb751 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -417,6 +417,8 @@ struct x86_pmu {
 
 	void		(*check_microcode)(void);
 	void		(*flush_branch_stack)(void);
+	void		(*sched_task)(struct perf_event_context *ctx,
+				      bool sched_in);
 
 	/*
 	 * Intel Arch Perfmon v2+
@@ -675,6 +677,8 @@ void intel_pmu_pebs_disable_all(void);
 
 void intel_ds_init(void);
 
+void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in);
+
 void intel_pmu_lbr_reset(void);
 
 void intel_pmu_lbr_enable(struct perf_event *event);
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 8f4a70f..6a3e603 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -251,6 +251,12 @@ struct pmu {
 	 * flush branch stack on context-switches (needed in cpu-wide mode)
 	 */
 	void (*flush_branch_stack)	(void);
+
+	/*
+	 * PMU callback for context-switches. optional
+	 */
+	void (*sched_task)		(struct perf_event_context *ctx,
+					 bool sched_in);
 };
 
 /**
@@ -546,6 +552,8 @@ extern void perf_event_delayed_put(struct task_struct *task);
 extern void perf_event_print_debug(void);
 extern void perf_pmu_disable(struct pmu *pmu);
 extern void perf_pmu_enable(struct pmu *pmu);
+extern void perf_sched_cb_disable(struct pmu *pmu);
+extern void perf_sched_cb_enable(struct pmu *pmu);
 extern int perf_event_task_disable(void);
 extern int perf_event_task_enable(void);
 extern int perf_event_refresh(struct perf_event *event, int refresh);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 89d34f9..d110a23 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -141,6 +141,7 @@ enum event_type_t {
 struct static_key_deferred perf_sched_events __read_mostly;
 static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
 static DEFINE_PER_CPU(atomic_t, perf_branch_stack_events);
+static DEFINE_PER_CPU(int, perf_sched_cb_usages);
 
 static atomic_t nr_mmap_events __read_mostly;
 static atomic_t nr_comm_events __read_mostly;
@@ -150,6 +151,7 @@ static atomic_t nr_freq_events __read_mostly;
 static LIST_HEAD(pmus);
 static DEFINE_MUTEX(pmus_lock);
 static struct srcu_struct pmus_srcu;
+static struct idr pmu_idr;
 
 /*
  * perf event paranoia level:
@@ -2327,6 +2329,57 @@ unlock:
 	}
 }
 
+void perf_sched_cb_disable(struct pmu *pmu)
+{
+	__get_cpu_var(perf_sched_cb_usages)--;
+}
+
+void perf_sched_cb_enable(struct pmu *pmu)
+{
+	__get_cpu_var(perf_sched_cb_usages)++;
+}
+
+/*
+ * This function provides the context switch callback to the lower code
+ * layer. It is invoked ONLY when the context switch callback is enabled.
+ */
+static void perf_pmu_sched_task(struct task_struct *prev,
+				struct task_struct *next,
+				bool sched_in)
+{
+	struct perf_cpu_context *cpuctx;
+	struct pmu *pmu;
+	unsigned long flags;
+
+	if (prev == next)
+		return;
+
+	local_irq_save(flags);
+
+	rcu_read_lock();
+
+	pmu = idr_find(&pmu_idr, PERF_TYPE_RAW);
+
+	if (pmu && pmu->sched_task) {
+		cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
+		pmu = cpuctx->ctx.pmu;
+
+		perf_ctx_lock(cpuctx, cpuctx->task_ctx);
+
+		perf_pmu_disable(pmu);
+
+		pmu->sched_task(cpuctx->task_ctx, sched_in);
+
+		perf_pmu_enable(pmu);
+
+		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+	}
+
+	rcu_read_unlock();
+
+	local_irq_restore(flags);
+}
+
 #define for_each_task_context_nr(ctxn)					\
 	for ((ctxn) = 0; (ctxn) < perf_nr_task_contexts; (ctxn)++)
 
@@ -2346,6 +2399,9 @@ void __perf_event_task_sched_out(struct task_struct *task,
 {
 	int ctxn;
 
+	if (__get_cpu_var(perf_sched_cb_usages))
+		perf_pmu_sched_task(task, next, false);
+
 	for_each_task_context_nr(ctxn)
 		perf_event_context_sched_out(task, ctxn, next);
 
@@ -2605,6 +2661,9 @@ void __perf_event_task_sched_in(struct task_struct *prev,
 	/* check for system-wide branch_stack events */
 	if (atomic_read(&__get_cpu_var(perf_branch_stack_events)))
 		perf_branch_stack_sched_in(prev, task);
+
+	if (__get_cpu_var(perf_sched_cb_usages))
+		perf_pmu_sched_task(prev, task, true);
 }
 
 static u64 perf_calculate_period(struct perf_event *event, u64 nsec, u64 count)
@@ -6326,7 +6385,6 @@ static void free_pmu_context(struct pmu *pmu)
 out:
 	mutex_unlock(&pmus_lock);
 }
-static struct idr pmu_idr;
 
 static ssize_t
 type_show(struct device *dev, struct device_attribute *attr, char *page)
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 03/14] perf, x86: use context switch callback to flush LBR stack
  2014-01-03  5:47 [PATCH 00/14] perf, x86: Haswell LBR call stack support Yan, Zheng
  2014-01-03  5:47 ` [PATCH 01/14] perf, x86: Reduce lbr_sel_map size Yan, Zheng
  2014-01-03  5:47 ` [PATCH 02/14] perf, core: introduce pmu context switch callback Yan, Zheng
@ 2014-01-03  5:48 ` Yan, Zheng
  2014-02-05 16:34   ` Stephane Eranian
  2014-01-03  5:48 ` [PATCH 04/14] perf, x86: Basic Haswell LBR call stack support Yan, Zheng
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 37+ messages in thread
From: Yan, Zheng @ 2014-01-03  5:48 UTC (permalink / raw)
  To: linux-kernel; +Cc: a.p.zijlstra, mingo, acme, eranian, andi, Yan, Zheng

Enable the pmu context switch callback when LBR is used. Use the
callback to flush LBR stack when task is scheduled in. This allows
us to move code that flushes LBR stack from perf core to perf x86.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 arch/x86/kernel/cpu/perf_event.c           |  7 ---
 arch/x86/kernel/cpu/perf_event.h           |  2 -
 arch/x86/kernel/cpu/perf_event_intel.c     | 14 +-----
 arch/x86/kernel/cpu/perf_event_intel_lbr.c | 32 ++++++++-----
 include/linux/perf_event.h                 |  5 ---
 kernel/events/core.c                       | 72 ------------------------------
 6 files changed, 21 insertions(+), 111 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 6703d17..69e2095 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1852,12 +1852,6 @@ static void x86_pmu_sched_task(struct perf_event_context *ctx, bool sched_in)
 		x86_pmu.sched_task(ctx, sched_in);
 }
 
-static void x86_pmu_flush_branch_stack(void)
-{
-	if (x86_pmu.flush_branch_stack)
-		x86_pmu.flush_branch_stack();
-}
-
 void perf_check_microcode(void)
 {
 	if (x86_pmu.check_microcode)
@@ -1884,7 +1878,6 @@ static struct pmu pmu = {
 	.commit_txn		= x86_pmu_commit_txn,
 
 	.event_idx		= x86_pmu_event_idx,
-	.flush_branch_stack	= x86_pmu_flush_branch_stack,
 	.sched_task		= x86_pmu_sched_task,
 };
 
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 3fdb751..80b8e83 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -150,7 +150,6 @@ struct cpu_hw_events {
 	 * Intel LBR bits
 	 */
 	int				lbr_users;
-	void				*lbr_context;
 	struct perf_branch_stack	lbr_stack;
 	struct perf_branch_entry	lbr_entries[MAX_LBR_ENTRIES];
 	struct er_account		*lbr_sel;
@@ -416,7 +415,6 @@ struct x86_pmu {
 	void		(*cpu_dead)(int cpu);
 
 	void		(*check_microcode)(void);
-	void		(*flush_branch_stack)(void);
 	void		(*sched_task)(struct perf_event_context *ctx,
 				      bool sched_in);
 
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 0fa4f24..4325bae 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2038,18 +2038,6 @@ static void intel_pmu_cpu_dying(int cpu)
 	fini_debug_store_on_cpu(cpu);
 }
 
-static void intel_pmu_flush_branch_stack(void)
-{
-	/*
-	 * Intel LBR does not tag entries with the
-	 * PID of the current task, then we need to
-	 * flush it on ctxsw
-	 * For now, we simply reset it
-	 */
-	if (x86_pmu.lbr_nr)
-		intel_pmu_lbr_reset();
-}
-
 PMU_FORMAT_ATTR(offcore_rsp, "config1:0-63");
 
 PMU_FORMAT_ATTR(ldlat, "config1:0-15");
@@ -2101,7 +2089,7 @@ static __initconst const struct x86_pmu intel_pmu = {
 	.cpu_starting		= intel_pmu_cpu_starting,
 	.cpu_dying		= intel_pmu_cpu_dying,
 	.guest_get_msrs		= intel_guest_get_msrs,
-	.flush_branch_stack	= intel_pmu_flush_branch_stack,
+	.sched_task		= intel_pmu_lbr_sched_task,
 };
 
 static __init void intel_clovertown_quirk(void)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 1ae2ec5..7ff2a99 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -177,24 +177,32 @@ void intel_pmu_lbr_reset(void)
 		intel_pmu_lbr_reset_64();
 }
 
-void intel_pmu_lbr_enable(struct perf_event *event)
+void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
 {
-	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
-
 	if (!x86_pmu.lbr_nr)
 		return;
 
 	/*
-	 * Reset the LBR stack if we changed task context to
-	 * avoid data leaks.
+	 * It is necessary to flush the stack on context switch. This happens
+	 * when the branch stack does not tag its entries with the pid of the
+	 * current task.
 	 */
-	if (event->ctx->task && cpuc->lbr_context != event->ctx) {
+	if (sched_in)
 		intel_pmu_lbr_reset();
-		cpuc->lbr_context = event->ctx;
-	}
+}
+
+void intel_pmu_lbr_enable(struct perf_event *event)
+{
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+
+	if (!x86_pmu.lbr_nr)
+		return;
+
 	cpuc->br_sel = event->hw.branch_reg.reg;
 
 	cpuc->lbr_users++;
+	if (cpuc->lbr_users == 1)
+		perf_sched_cb_enable(event->ctx->pmu);
 }
 
 void intel_pmu_lbr_disable(struct perf_event *event)
@@ -207,10 +215,10 @@ void intel_pmu_lbr_disable(struct perf_event *event)
 	cpuc->lbr_users--;
 	WARN_ON_ONCE(cpuc->lbr_users < 0);
 
-	if (cpuc->enabled && !cpuc->lbr_users) {
-		__intel_pmu_lbr_disable();
-		/* avoid stale pointer */
-		cpuc->lbr_context = NULL;
+	if (!cpuc->lbr_users) {
+		perf_sched_cb_disable(event->ctx->pmu);
+		if (cpuc->enabled)
+			__intel_pmu_lbr_disable();
 	}
 }
 
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 6a3e603..96cb88b 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -248,11 +248,6 @@ struct pmu {
 	int (*event_idx)		(struct perf_event *event); /*optional */
 
 	/*
-	 * flush branch stack on context-switches (needed in cpu-wide mode)
-	 */
-	void (*flush_branch_stack)	(void);
-
-	/*
 	 * PMU callback for context-switches. optional
 	 */
 	void (*sched_task)		(struct perf_event_context *ctx,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index d110a23..aba4d6d 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -140,7 +140,6 @@ enum event_type_t {
  */
 struct static_key_deferred perf_sched_events __read_mostly;
 static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
-static DEFINE_PER_CPU(atomic_t, perf_branch_stack_events);
 static DEFINE_PER_CPU(int, perf_sched_cb_usages);
 
 static atomic_t nr_mmap_events __read_mostly;
@@ -2566,65 +2565,6 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx,
 	perf_pmu_rotate_start(ctx->pmu);
 }
 
-/*
- * When sampling the branck stack in system-wide, it may be necessary
- * to flush the stack on context switch. This happens when the branch
- * stack does not tag its entries with the pid of the current task.
- * Otherwise it becomes impossible to associate a branch entry with a
- * task. This ambiguity is more likely to appear when the branch stack
- * supports priv level filtering and the user sets it to monitor only
- * at the user level (which could be a useful measurement in system-wide
- * mode). In that case, the risk is high of having a branch stack with
- * branch from multiple tasks. Flushing may mean dropping the existing
- * entries or stashing them somewhere in the PMU specific code layer.
- *
- * This function provides the context switch callback to the lower code
- * layer. It is invoked ONLY when there is at least one system-wide context
- * with at least one active event using taken branch sampling.
- */
-static void perf_branch_stack_sched_in(struct task_struct *prev,
-				       struct task_struct *task)
-{
-	struct perf_cpu_context *cpuctx;
-	struct pmu *pmu;
-	unsigned long flags;
-
-	/* no need to flush branch stack if not changing task */
-	if (prev == task)
-		return;
-
-	local_irq_save(flags);
-
-	rcu_read_lock();
-
-	list_for_each_entry_rcu(pmu, &pmus, entry) {
-		cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
-
-		/*
-		 * check if the context has at least one
-		 * event using PERF_SAMPLE_BRANCH_STACK
-		 */
-		if (cpuctx->ctx.nr_branch_stack > 0
-		    && pmu->flush_branch_stack) {
-
-			pmu = cpuctx->ctx.pmu;
-
-			perf_ctx_lock(cpuctx, cpuctx->task_ctx);
-
-			perf_pmu_disable(pmu);
-
-			pmu->flush_branch_stack();
-
-			perf_pmu_enable(pmu);
-
-			perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
-		}
-	}
-
-	rcu_read_unlock();
-
-	local_irq_restore(flags);
-}
 
 /*
  * Called from scheduler to add the events of the current task
@@ -2658,10 +2598,6 @@ void __perf_event_task_sched_in(struct task_struct *prev,
 	if (atomic_read(&__get_cpu_var(perf_cgroup_events)))
 		perf_cgroup_sched_in(prev, task);
 
-	/* check for system-wide branch_stack events */
-	if (atomic_read(&__get_cpu_var(perf_branch_stack_events)))
-		perf_branch_stack_sched_in(prev, task);
-
 	if (__get_cpu_var(perf_sched_cb_usages))
 		perf_pmu_sched_task(prev, task, true);
 }
@@ -3226,10 +3162,6 @@ static void unaccount_event_cpu(struct perf_event *event, int cpu)
 	if (event->parent)
 		return;
 
-	if (has_branch_stack(event)) {
-		if (!(event->attach_state & PERF_ATTACH_TASK))
-			atomic_dec(&per_cpu(perf_branch_stack_events, cpu));
-	}
 	if (is_cgroup_event(event))
 		atomic_dec(&per_cpu(perf_cgroup_events, cpu));
 }
@@ -6655,10 +6587,6 @@ static void account_event_cpu(struct perf_event *event, int cpu)
 	if (event->parent)
 		return;
 
-	if (has_branch_stack(event)) {
-		if (!(event->attach_state & PERF_ATTACH_TASK))
-			atomic_inc(&per_cpu(perf_branch_stack_events, cpu));
-	}
 	if (is_cgroup_event(event))
 		atomic_inc(&per_cpu(perf_cgroup_events, cpu));
 }
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 04/14] perf, x86: Basic Haswell LBR call stack support
  2014-01-03  5:47 [PATCH 00/14] perf, x86: Haswell LBR call stack support Yan, Zheng
                   ` (2 preceding siblings ...)
  2014-01-03  5:48 ` [PATCH 03/14] perf, x86: use context switch callback to flush LBR stack Yan, Zheng
@ 2014-01-03  5:48 ` Yan, Zheng
  2014-02-05 15:40   ` Stephane Eranian
  2014-01-03  5:48 ` [PATCH 05/14] perf, core: allow pmu specific data for perf task context Yan, Zheng
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 37+ messages in thread
From: Yan, Zheng @ 2014-01-03  5:48 UTC (permalink / raw)
  To: linux-kernel; +Cc: a.p.zijlstra, mingo, acme, eranian, andi, Yan, Zheng

When the call stack feature is enabled, the LBR stack will capture
unfiltered call data normally, but as return instructions are executed,
the last captured branch record is flushed from the on-chip registers
in a last-in first-out (LIFO) manner. Thus, branch information relative
to leaf functions will not be captured, while preserving the call stack
information of the main line execution path.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 arch/x86/kernel/cpu/perf_event.h           |  7 ++-
 arch/x86/kernel/cpu/perf_event_intel.c     |  2 +-
 arch/x86/kernel/cpu/perf_event_intel_lbr.c | 98 +++++++++++++++++++++++-------
 3 files changed, 82 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 80b8e83..3ef4b79 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -460,7 +460,10 @@ struct x86_pmu {
 };
 
 enum {
-	PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE = PERF_SAMPLE_BRANCH_MAX_SHIFT,
+	PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT = PERF_SAMPLE_BRANCH_MAX_SHIFT,
+	PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE,
+
+	PERF_SAMPLE_BRANCH_CALL_STACK = 1U << PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT,
 };
 
 #define x86_add_quirk(func_)						\
@@ -697,6 +700,8 @@ void intel_pmu_lbr_init_atom(void);
 
 void intel_pmu_lbr_init_snb(void);
 
+void intel_pmu_lbr_init_hsw(void);
+
 int intel_pmu_setup_lbr_filter(struct perf_event *event);
 
 int p4_pmu_init(void);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 4325bae..84a1c09 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2494,7 +2494,7 @@ __init int intel_pmu_init(void)
 		memcpy(hw_cache_event_ids, snb_hw_cache_event_ids, sizeof(hw_cache_event_ids));
 		memcpy(hw_cache_extra_regs, snb_hw_cache_extra_regs, sizeof(hw_cache_extra_regs));
 
-		intel_pmu_lbr_init_snb();
+		intel_pmu_lbr_init_hsw();
 
 		x86_pmu.event_constraints = intel_hsw_event_constraints;
 		x86_pmu.pebs_constraints = intel_hsw_pebs_event_constraints;
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 7ff2a99..bdd8758 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -39,6 +39,7 @@ static enum {
 #define LBR_IND_JMP_BIT		6 /* do not capture indirect jumps */
 #define LBR_REL_JMP_BIT		7 /* do not capture relative jumps */
 #define LBR_FAR_BIT		8 /* do not capture far branches */
+#define LBR_CALL_STACK_BIT	9 /* enable call stack */
 
 #define LBR_KERNEL	(1 << LBR_KERNEL_BIT)
 #define LBR_USER	(1 << LBR_USER_BIT)
@@ -49,6 +50,7 @@ static enum {
 #define LBR_REL_JMP	(1 << LBR_REL_JMP_BIT)
 #define LBR_IND_JMP	(1 << LBR_IND_JMP_BIT)
 #define LBR_FAR		(1 << LBR_FAR_BIT)
+#define LBR_CALL_STACK	(1 << LBR_CALL_STACK_BIT)
 
 #define LBR_PLM (LBR_KERNEL | LBR_USER)
 
@@ -74,24 +76,25 @@ static enum {
  * x86control flow changes include branches, interrupts, traps, faults
  */
 enum {
-	X86_BR_NONE     = 0,      /* unknown */
-
-	X86_BR_USER     = 1 << 0, /* branch target is user */
-	X86_BR_KERNEL   = 1 << 1, /* branch target is kernel */
-
-	X86_BR_CALL     = 1 << 2, /* call */
-	X86_BR_RET      = 1 << 3, /* return */
-	X86_BR_SYSCALL  = 1 << 4, /* syscall */
-	X86_BR_SYSRET   = 1 << 5, /* syscall return */
-	X86_BR_INT      = 1 << 6, /* sw interrupt */
-	X86_BR_IRET     = 1 << 7, /* return from interrupt */
-	X86_BR_JCC      = 1 << 8, /* conditional */
-	X86_BR_JMP      = 1 << 9, /* jump */
-	X86_BR_IRQ      = 1 << 10,/* hw interrupt or trap or fault */
-	X86_BR_IND_CALL = 1 << 11,/* indirect calls */
-	X86_BR_ABORT    = 1 << 12,/* transaction abort */
-	X86_BR_IN_TX    = 1 << 13,/* in transaction */
-	X86_BR_NO_TX    = 1 << 14,/* not in transaction */
+	X86_BR_NONE		= 0,      /* unknown */
+
+	X86_BR_USER		= 1 << 0, /* branch target is user */
+	X86_BR_KERNEL		= 1 << 1, /* branch target is kernel */
+
+	X86_BR_CALL		= 1 << 2, /* call */
+	X86_BR_RET		= 1 << 3, /* return */
+	X86_BR_SYSCALL		= 1 << 4, /* syscall */
+	X86_BR_SYSRET		= 1 << 5, /* syscall return */
+	X86_BR_INT		= 1 << 6, /* sw interrupt */
+	X86_BR_IRET		= 1 << 7, /* return from interrupt */
+	X86_BR_JCC		= 1 << 8, /* conditional */
+	X86_BR_JMP		= 1 << 9, /* jump */
+	X86_BR_IRQ		= 1 << 10,/* hw interrupt or trap or fault */
+	X86_BR_IND_CALL		= 1 << 11,/* indirect calls */
+	X86_BR_ABORT		= 1 << 12,/* transaction abort */
+	X86_BR_IN_TX		= 1 << 13,/* in transaction */
+	X86_BR_NO_TX		= 1 << 14,/* not in transaction */
+	X86_BR_CALL_STACK	= 1 << 15,/* call stack */
 };
 
 #define X86_BR_PLM (X86_BR_USER | X86_BR_KERNEL)
@@ -135,7 +138,14 @@ static void __intel_pmu_lbr_enable(void)
 		wrmsrl(MSR_LBR_SELECT, cpuc->lbr_sel->config);
 
 	rdmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
-	debugctl |= (DEBUGCTLMSR_LBR | DEBUGCTLMSR_FREEZE_LBRS_ON_PMI);
+	debugctl |= DEBUGCTLMSR_LBR;
+	/*
+	 * LBR callstack does not work well with FREEZE_LBRS_ON_PMI.
+	 * If FREEZE_LBRS_ON_PMI is set, PMI near call/return instructions
+	 * may cause superfluous increase/decrease of LBR_TOS.
+	 */
+	if (!cpuc->lbr_sel || !(cpuc->lbr_sel->config & LBR_CALL_STACK))
+		debugctl |= DEBUGCTLMSR_FREEZE_LBRS_ON_PMI;
 	wrmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
 }
 
@@ -354,7 +364,7 @@ void intel_pmu_lbr_read(void)
  * - in case there is no HW filter
  * - in case the HW filter has errata or limitations
  */
-static void intel_pmu_setup_sw_lbr_filter(struct perf_event *event)
+static int intel_pmu_setup_sw_lbr_filter(struct perf_event *event)
 {
 	u64 br_type = event->attr.branch_sample_type;
 	int mask = 0;
@@ -388,11 +398,21 @@ static void intel_pmu_setup_sw_lbr_filter(struct perf_event *event)
 	if (br_type & PERF_SAMPLE_BRANCH_NO_TX)
 		mask |= X86_BR_NO_TX;
 
+	if (br_type & PERF_SAMPLE_BRANCH_CALL_STACK) {
+		if (!x86_pmu.lbr_sel_map)
+			return -EOPNOTSUPP;
+		if (mask & ~(X86_BR_USER | X86_BR_KERNEL))
+			return -EINVAL;
+		mask |= X86_BR_CALL | X86_BR_IND_CALL | X86_BR_RET |
+			X86_BR_CALL_STACK;
+	}
+
 	/*
 	 * stash actual user request into reg, it may
 	 * be used by fixup code for some CPU
 	 */
 	event->hw.branch_reg.reg = mask;
+	return 0;
 }
 
 /*
@@ -421,8 +441,11 @@ static int intel_pmu_setup_hw_lbr_filter(struct perf_event *event)
 	reg = &event->hw.branch_reg;
 	reg->idx = EXTRA_REG_LBR;
 
-	/* LBR_SELECT operates in suppress mode so invert mask */
-	reg->config = ~mask & x86_pmu.lbr_sel_mask;
+	/*
+	 * the first 8 bits (LBR_SEL_MASK) in LBR_SELECT operates
+	 * in suppress mode so invert mask
+	 */
+	reg->config = mask ^ x86_pmu.lbr_sel_mask;
 
 	return 0;
 }
@@ -440,7 +463,9 @@ int intel_pmu_setup_lbr_filter(struct perf_event *event)
 	/*
 	 * setup SW LBR filter
 	 */
-	intel_pmu_setup_sw_lbr_filter(event);
+	ret = intel_pmu_setup_sw_lbr_filter(event);
+	if (ret)
+		return ret;
 
 	/*
 	 * setup HW LBR filter, if any
@@ -695,6 +720,19 @@ static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
 	[PERF_SAMPLE_BRANCH_IND_CALL_SHIFT]	= LBR_IND_CALL,
 };
 
+static const int hsw_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
+	[PERF_SAMPLE_BRANCH_ANY_SHIFT]		= LBR_ANY,
+	[PERF_SAMPLE_BRANCH_USER_SHIFT]		= LBR_USER,
+	[PERF_SAMPLE_BRANCH_KERNEL_SHIFT]	= LBR_KERNEL,
+	[PERF_SAMPLE_BRANCH_HV_SHIFT]		= LBR_IGN,
+	[PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT]	= LBR_RETURN | LBR_FAR,
+	[PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT]	= LBR_REL_CALL | LBR_IND_CALL
+						| LBR_FAR,
+	[PERF_SAMPLE_BRANCH_IND_CALL_SHIFT]	= LBR_IND_CALL,
+	[PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT]	= LBR_REL_CALL | LBR_IND_CALL
+						| LBR_RETURN | LBR_CALL_STACK,
+};
+
 /* core */
 void intel_pmu_lbr_init_core(void)
 {
@@ -751,6 +789,20 @@ void intel_pmu_lbr_init_snb(void)
 	pr_cont("16-deep LBR, ");
 }
 
+/* haswell */
+void intel_pmu_lbr_init_hsw(void)
+{
+	x86_pmu.lbr_nr	 = 16;
+	x86_pmu.lbr_tos	 = MSR_LBR_TOS;
+	x86_pmu.lbr_from = MSR_LBR_NHM_FROM;
+	x86_pmu.lbr_to   = MSR_LBR_NHM_TO;
+
+	x86_pmu.lbr_sel_mask = LBR_SEL_MASK;
+	x86_pmu.lbr_sel_map  = hsw_lbr_sel_map;
+
+	pr_cont("16-deep LBR, ");
+}
+
 /* atom */
 void intel_pmu_lbr_init_atom(void)
 {
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 05/14] perf, core: allow pmu specific data for perf task context
  2014-01-03  5:47 [PATCH 00/14] perf, x86: Haswell LBR call stack support Yan, Zheng
                   ` (3 preceding siblings ...)
  2014-01-03  5:48 ` [PATCH 04/14] perf, x86: Basic Haswell LBR call stack support Yan, Zheng
@ 2014-01-03  5:48 ` Yan, Zheng
  2014-02-05 16:57   ` Stephane Eranian
  2014-01-03  5:48 ` [PATCH 06/14] perf, core: always switch pmu specific data during context switch Yan, Zheng
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 37+ messages in thread
From: Yan, Zheng @ 2014-01-03  5:48 UTC (permalink / raw)
  To: linux-kernel; +Cc: a.p.zijlstra, mingo, acme, eranian, andi, Yan, Zheng

Later patches will use pmu specific data to save LBR stack.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 include/linux/perf_event.h |  5 +++++
 kernel/events/core.c       | 19 ++++++++++++++++++-
 2 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 96cb88b..147f9d3 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -252,6 +252,10 @@ struct pmu {
 	 */
 	void (*sched_task)		(struct perf_event_context *ctx,
 					 bool sched_in);
+	/*
+	 * PMU specific data size
+	 */
+	size_t				task_ctx_size;
 };
 
 /**
@@ -496,6 +500,7 @@ struct perf_event_context {
 	int				pin_count;
 	int				nr_cgroups;	 /* cgroup evts */
 	int				nr_branch_stack; /* branch_stack evt */
+	void				*task_ctx_data; /* pmu specific data */
 	struct rcu_head			rcu_head;
 };
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index aba4d6d..b6650ab 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -883,6 +883,15 @@ static void get_ctx(struct perf_event_context *ctx)
 	WARN_ON(!atomic_inc_not_zero(&ctx->refcount));
 }
 
+static void free_ctx(struct rcu_head *head)
+{
+	struct perf_event_context *ctx;
+
+	ctx = container_of(head, struct perf_event_context, rcu_head);
+	kfree(ctx->task_ctx_data);
+	kfree(ctx);
+}
+
 static void put_ctx(struct perf_event_context *ctx)
 {
 	if (atomic_dec_and_test(&ctx->refcount)) {
@@ -890,7 +899,7 @@ static void put_ctx(struct perf_event_context *ctx)
 			put_ctx(ctx->parent_ctx);
 		if (ctx->task)
 			put_task_struct(ctx->task);
-		kfree_rcu(ctx, rcu_head);
+		call_rcu(&ctx->rcu_head, free_ctx);
 	}
 }
 
@@ -3020,6 +3029,14 @@ alloc_perf_context(struct pmu *pmu, struct task_struct *task)
 	if (!ctx)
 		return NULL;
 
+	if (task && pmu->task_ctx_size > 0) {
+		ctx->task_ctx_data = kzalloc(pmu->task_ctx_size, GFP_KERNEL);
+		if (!ctx->task_ctx_data) {
+			kfree(ctx);
+			return NULL;
+		}
+	}
+
 	__perf_event_init_context(ctx);
 	if (task) {
 		ctx->task = task;
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 06/14] perf, core: always switch pmu specific data during context switch
  2014-01-03  5:47 [PATCH 00/14] perf, x86: Haswell LBR call stack support Yan, Zheng
                   ` (4 preceding siblings ...)
  2014-01-03  5:48 ` [PATCH 05/14] perf, core: allow pmu specific data for perf task context Yan, Zheng
@ 2014-01-03  5:48 ` Yan, Zheng
  2014-02-05 17:19   ` Stephane Eranian
  2014-01-03  5:48 ` [PATCH 07/14] perf: track number of events that use LBR callstack Yan, Zheng
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 37+ messages in thread
From: Yan, Zheng @ 2014-01-03  5:48 UTC (permalink / raw)
  To: linux-kernel; +Cc: a.p.zijlstra, mingo, acme, eranian, andi, Yan, Zheng

If two tasks were both forked from the same parent task, Events in their perf
task contexts can be the same. Perf core optimizes context switch oout in this
case.

Previous patch inroduces pmu specific data. The data is task specific, so we
should switch the data even when context switch is optimized out.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 kernel/events/core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index b6650ab..d6d8dea 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2319,6 +2319,8 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
 			next->perf_event_ctxp[ctxn] = ctx;
 			ctx->task = next;
 			next_ctx->task = task;
+			ctx->task_ctx_data = xchg(&next_ctx->task_ctx_data,
+						  ctx->task_ctx_data);
 			do_switch = 0;
 
 			perf_event_sync_stat(ctx, next_ctx);
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 07/14] perf: track number of events that use LBR callstack
  2014-01-03  5:47 [PATCH 00/14] perf, x86: Haswell LBR call stack support Yan, Zheng
                   ` (5 preceding siblings ...)
  2014-01-03  5:48 ` [PATCH 06/14] perf, core: always switch pmu specific data during context switch Yan, Zheng
@ 2014-01-03  5:48 ` Yan, Zheng
  2014-02-06 14:55   ` Stephane Eranian
  2014-01-03  5:48 ` [PATCH 08/14] perf, x86: allocate space for storing LBR stack Yan, Zheng
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 37+ messages in thread
From: Yan, Zheng @ 2014-01-03  5:48 UTC (permalink / raw)
  To: linux-kernel; +Cc: a.p.zijlstra, mingo, acme, eranian, andi, Yan, Zheng

Later patch will use it to decide if the LBR stack should be saved/restored

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel_lbr.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index bdd8758..2137a9f 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -201,15 +201,27 @@ void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
 		intel_pmu_lbr_reset();
 }
 
+static inline bool branch_user_callstack(unsigned br_sel)
+{
+	return (br_sel & X86_BR_USER) && (br_sel & X86_BR_CALL_STACK);
+}
+
 void intel_pmu_lbr_enable(struct perf_event *event)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+	struct x86_perf_task_context *task_ctx;
 
 	if (!x86_pmu.lbr_nr)
 		return;
 
+	cpuc = &__get_cpu_var(cpu_hw_events);
+	task_ctx = event->ctx ? event->ctx->task_ctx_data : NULL;
+
 	cpuc->br_sel = event->hw.branch_reg.reg;
 
+	if (branch_user_callstack(cpuc->br_sel))
+		task_ctx->lbr_callstack_users++;
+
 	cpuc->lbr_users++;
 	if (cpuc->lbr_users == 1)
 		perf_sched_cb_enable(event->ctx->pmu);
@@ -217,11 +229,18 @@ void intel_pmu_lbr_enable(struct perf_event *event)
 
 void intel_pmu_lbr_disable(struct perf_event *event)
 {
-	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+	struct cpu_hw_events *cpuc;
+	struct x86_perf_task_context *task_ctx;
 
 	if (!x86_pmu.lbr_nr)
 		return;
 
+	cpuc = &__get_cpu_var(cpu_hw_events);
+	task_ctx = event->ctx ? event->ctx->task_ctx_data : NULL;
+
+	if (branch_user_callstack(cpuc->br_sel))
+		task_ctx->lbr_callstack_users--;
+
 	cpuc->lbr_users--;
 	WARN_ON_ONCE(cpuc->lbr_users < 0);
 
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 08/14] perf, x86: allocate space for storing LBR stack
  2014-01-03  5:47 [PATCH 00/14] perf, x86: Haswell LBR call stack support Yan, Zheng
                   ` (6 preceding siblings ...)
  2014-01-03  5:48 ` [PATCH 07/14] perf: track number of events that use LBR callstack Yan, Zheng
@ 2014-01-03  5:48 ` Yan, Zheng
  2014-02-05 17:26   ` Stephane Eranian
  2014-01-03  5:48 ` [PATCH 09/14] perf, x86: Save/resotre LBR stack during context switch Yan, Zheng
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 37+ messages in thread
From: Yan, Zheng @ 2014-01-03  5:48 UTC (permalink / raw)
  To: linux-kernel; +Cc: a.p.zijlstra, mingo, acme, eranian, andi, Yan, Zheng

When the LBR call stack is enabled, it is necessary to save/restore
the LBR stack on context switch. We can use pmu specific data to
store LBR stack when task is scheduled out. This patch adds code
that allocates the pmu specific data.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 arch/x86/kernel/cpu/perf_event.c | 1 +
 arch/x86/kernel/cpu/perf_event.h | 7 +++++++
 2 files changed, 8 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 69e2095..2e43f1b 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1879,6 +1879,7 @@ static struct pmu pmu = {
 
 	.event_idx		= x86_pmu_event_idx,
 	.sched_task		= x86_pmu_sched_task,
+	.task_ctx_size          = sizeof(struct x86_perf_task_context),
 };
 
 void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 3ef4b79..3ed9629 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -459,6 +459,13 @@ struct x86_pmu {
 	struct perf_guest_switch_msr *(*guest_get_msrs)(int *nr);
 };
 
+struct x86_perf_task_context {
+	u64 lbr_from[MAX_LBR_ENTRIES];
+	u64 lbr_to[MAX_LBR_ENTRIES];
+	int lbr_callstack_users;
+	int lbr_stack_state;
+};
+
 enum {
 	PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT = PERF_SAMPLE_BRANCH_MAX_SHIFT,
 	PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE,
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 09/14] perf, x86: Save/resotre LBR stack during context switch
  2014-01-03  5:47 [PATCH 00/14] perf, x86: Haswell LBR call stack support Yan, Zheng
                   ` (7 preceding siblings ...)
  2014-01-03  5:48 ` [PATCH 08/14] perf, x86: allocate space for storing LBR stack Yan, Zheng
@ 2014-01-03  5:48 ` Yan, Zheng
  2014-02-05 17:45   ` Stephane Eranian
  2014-01-03  5:48 ` [PATCH 10/14] perf, core: simplify need branch stack check Yan, Zheng
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 37+ messages in thread
From: Yan, Zheng @ 2014-01-03  5:48 UTC (permalink / raw)
  To: linux-kernel; +Cc: a.p.zijlstra, mingo, acme, eranian, andi, Yan, Zheng

When the LBR call stack is enabled, it is necessary to save/restore
the LBR stack on context switch. The solution is saving/restoring
the LBR stack to/from task's perf event context.

The LBR stack is saved/restored only when there are events that use
the LBR call stack. If no event uses LBR call stack, the LBR stack
is reset when task is scheduled in.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel_lbr.c | 80 ++++++++++++++++++++++++------
 1 file changed, 66 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 2137a9f..51e1842 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -187,18 +187,82 @@ void intel_pmu_lbr_reset(void)
 		intel_pmu_lbr_reset_64();
 }
 
+/*
+ * TOS = most recently recorded branch
+ */
+static inline u64 intel_pmu_lbr_tos(void)
+{
+	u64 tos;
+	rdmsrl(x86_pmu.lbr_tos, tos);
+	return tos;
+}
+
+enum {
+	LBR_UNINIT,
+	LBR_NONE,
+	LBR_VALID,
+};
+
+static void __intel_pmu_lbr_restore(struct x86_perf_task_context *task_ctx)
+{
+	int i;
+	unsigned lbr_idx, mask = x86_pmu.lbr_nr - 1;
+	u64 tos = intel_pmu_lbr_tos();
+
+	for (i = 0; i < x86_pmu.lbr_nr; i++) {
+		lbr_idx = (tos - i) & mask;
+		wrmsrl(x86_pmu.lbr_from + lbr_idx, task_ctx->lbr_from[i]);
+		wrmsrl(x86_pmu.lbr_to + lbr_idx, task_ctx->lbr_to[i]);
+	}
+	task_ctx->lbr_stack_state = LBR_NONE;
+}
+
+static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx)
+{
+	int i;
+	unsigned lbr_idx, mask = x86_pmu.lbr_nr - 1;
+	u64 tos = intel_pmu_lbr_tos();
+
+	for (i = 0; i < x86_pmu.lbr_nr; i++) {
+		lbr_idx = (tos - i) & mask;
+		rdmsrl(x86_pmu.lbr_from + lbr_idx, task_ctx->lbr_from[i]);
+		rdmsrl(x86_pmu.lbr_to + lbr_idx, task_ctx->lbr_to[i]);
+	}
+	task_ctx->lbr_stack_state = LBR_VALID;
+}
+
+
 void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
 {
+	struct cpu_hw_events *cpuc;
+	struct x86_perf_task_context *task_ctx;
+
 	if (!x86_pmu.lbr_nr)
 		return;
 
+	cpuc = &__get_cpu_var(cpu_hw_events);
+	task_ctx = ctx ? ctx->task_ctx_data : NULL;
+
+
 	/*
 	 * It is necessary to flush the stack on context switch. This happens
 	 * when the branch stack does not tag its entries with the pid of the
 	 * current task.
 	 */
-	if (sched_in)
-		intel_pmu_lbr_reset();
+	if (sched_in) {
+		if (!task_ctx ||
+		    !task_ctx->lbr_callstack_users ||
+		    task_ctx->lbr_stack_state != LBR_VALID)
+			intel_pmu_lbr_reset();
+		else
+			__intel_pmu_lbr_restore(task_ctx);
+	} else if (task_ctx) {
+		if (task_ctx->lbr_callstack_users &&
+		    task_ctx->lbr_stack_state != LBR_UNINIT)
+			__intel_pmu_lbr_save(task_ctx);
+		else
+			task_ctx->lbr_stack_state = LBR_NONE;
+	}
 }
 
 static inline bool branch_user_callstack(unsigned br_sel)
@@ -267,18 +331,6 @@ void intel_pmu_lbr_disable_all(void)
 		__intel_pmu_lbr_disable();
 }
 
-/*
- * TOS = most recently recorded branch
- */
-static inline u64 intel_pmu_lbr_tos(void)
-{
-	u64 tos;
-
-	rdmsrl(x86_pmu.lbr_tos, tos);
-
-	return tos;
-}
-
 static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc)
 {
 	unsigned long mask = x86_pmu.lbr_nr - 1;
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 10/14] perf, core: simplify need branch stack check
  2014-01-03  5:47 [PATCH 00/14] perf, x86: Haswell LBR call stack support Yan, Zheng
                   ` (8 preceding siblings ...)
  2014-01-03  5:48 ` [PATCH 09/14] perf, x86: Save/resotre LBR stack during context switch Yan, Zheng
@ 2014-01-03  5:48 ` Yan, Zheng
  2014-02-06 15:35   ` Stephane Eranian
  2014-01-03  5:48 ` [PATCH 11/14] perf, core: Pass perf_sample_data to perf_callchain() Yan, Zheng
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 37+ messages in thread
From: Yan, Zheng @ 2014-01-03  5:48 UTC (permalink / raw)
  To: linux-kernel; +Cc: a.p.zijlstra, mingo, acme, eranian, andi, Yan, Zheng

event->attr.branch_sample_type is non-zero no matter branch stack
is enabled explicitly or is enabled implicitly. So we can use it
toreplace intel_pmu_needs_lbr_smpl(). This avoids duplicating code
that implicitly enables the LBR.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel.c | 20 +++-----------------
 include/linux/perf_event.h             |  5 +++++
 kernel/events/core.c                   | 11 +++++++----
 3 files changed, 15 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 84a1c09..722171c 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1030,20 +1030,6 @@ static __initconst const u64 slm_hw_cache_event_ids
  },
 };
 
-static inline bool intel_pmu_needs_lbr_smpl(struct perf_event *event)
-{
-	/* user explicitly requested branch sampling */
-	if (has_branch_stack(event))
-		return true;
-
-	/* implicit branch sampling to correct PEBS skid */
-	if (x86_pmu.intel_cap.pebs_trap && event->attr.precise_ip > 1 &&
-	    x86_pmu.intel_cap.pebs_format < 2)
-		return true;
-
-	return false;
-}
-
 static void intel_pmu_disable_all(void)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
@@ -1208,7 +1194,7 @@ static void intel_pmu_disable_event(struct perf_event *event)
 	 * must disable before any actual event
 	 * because any event may be combined with LBR
 	 */
-	if (intel_pmu_needs_lbr_smpl(event))
+	if (needs_branch_stack(event))
 		intel_pmu_lbr_disable(event);
 
 	if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL)) {
@@ -1269,7 +1255,7 @@ static void intel_pmu_enable_event(struct perf_event *event)
 	 * must enabled before any actual event
 	 * because any event may be combined with LBR
 	 */
-	if (intel_pmu_needs_lbr_smpl(event))
+	if (needs_branch_stack(event))
 		intel_pmu_lbr_enable(event);
 
 	if (event->attr.exclude_host)
@@ -1741,7 +1727,7 @@ static int intel_pmu_hw_config(struct perf_event *event)
 	if (event->attr.precise_ip && x86_pmu.pebs_aliases)
 		x86_pmu.pebs_aliases(event);
 
-	if (intel_pmu_needs_lbr_smpl(event)) {
+	if (needs_branch_stack(event)) {
 		ret = intel_pmu_setup_lbr_filter(event);
 		if (ret)
 			return ret;
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 147f9d3..0d88eb8 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -766,6 +766,11 @@ static inline bool has_branch_stack(struct perf_event *event)
 	return event->attr.sample_type & PERF_SAMPLE_BRANCH_STACK;
 }
 
+static inline bool needs_branch_stack(struct perf_event *event)
+{
+	return event->attr.branch_sample_type != 0;
+}
+
 extern int perf_output_begin(struct perf_output_handle *handle,
 			     struct perf_event *event, unsigned int size);
 extern void perf_output_end(struct perf_output_handle *handle);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index d6d8dea..7dd4d58 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1138,7 +1138,7 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx)
 	if (is_cgroup_event(event))
 		ctx->nr_cgroups++;
 
-	if (has_branch_stack(event))
+	if (needs_branch_stack(event))
 		ctx->nr_branch_stack++;
 
 	list_add_rcu(&event->event_entry, &ctx->event_list);
@@ -1303,7 +1303,7 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
 			cpuctx->cgrp = NULL;
 	}
 
-	if (has_branch_stack(event))
+	if (needs_branch_stack(event))
 		ctx->nr_branch_stack--;
 
 	ctx->nr_events--;
@@ -3202,7 +3202,7 @@ static void unaccount_event(struct perf_event *event)
 		atomic_dec(&nr_freq_events);
 	if (is_cgroup_event(event))
 		static_key_slow_dec_deferred(&perf_sched_events);
-	if (has_branch_stack(event))
+	if (needs_branch_stack(event))
 		static_key_slow_dec_deferred(&perf_sched_events);
 
 	unaccount_event_cpu(event, event->cpu);
@@ -6627,7 +6627,7 @@ static void account_event(struct perf_event *event)
 		if (atomic_inc_return(&nr_freq_events) == 1)
 			tick_nohz_full_kick_all();
 	}
-	if (has_branch_stack(event))
+	if (needs_branch_stack(event))
 		static_key_slow_inc(&perf_sched_events.key);
 	if (is_cgroup_event(event))
 		static_key_slow_inc(&perf_sched_events.key);
@@ -6735,6 +6735,9 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	if (attr->inherit && (attr->read_format & PERF_FORMAT_GROUP))
 		goto err_ns;
 
+	if (!has_branch_stack(event))
+		event->attr.branch_sample_type = 0;
+
 	pmu = perf_init_event(event);
 	if (!pmu)
 		goto err_ns;
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 11/14] perf, core: Pass perf_sample_data to perf_callchain()
  2014-01-03  5:47 [PATCH 00/14] perf, x86: Haswell LBR call stack support Yan, Zheng
                   ` (9 preceding siblings ...)
  2014-01-03  5:48 ` [PATCH 10/14] perf, core: simplify need branch stack check Yan, Zheng
@ 2014-01-03  5:48 ` Yan, Zheng
  2014-01-03  5:48 ` [PATCH 12/14] perf, x86: use LBR call stack to get user callchain Yan, Zheng
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Yan, Zheng @ 2014-01-03  5:48 UTC (permalink / raw)
  To: linux-kernel; +Cc: a.p.zijlstra, mingo, acme, eranian, andi, Yan, Zheng

Haswell has a new feature that utilizes the existing Last Branch Record
facility to record call chains. When the feature is enabled, function
call will be collected as normal, but as return instructions are executed
the last captured branch record is popped from the on-chip LBR registers.
The LBR call stack facility can help perf to get call chains of progam
without frame pointer.

This patch modifies various architectures' perf_callchain() to accept
perf sample data. Later patch will add code that use the sample data to
get call chains.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 arch/arm/kernel/perf_event.c     | 4 ++--
 arch/powerpc/perf/callchain.c    | 4 ++--
 arch/sparc/kernel/perf_event.c   | 4 ++--
 arch/x86/kernel/cpu/perf_event.c | 4 ++--
 include/linux/perf_event.h       | 3 ++-
 kernel/events/callchain.c        | 8 +++++---
 kernel/events/core.c             | 2 +-
 kernel/events/internal.h         | 3 ++-
 8 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/arch/arm/kernel/perf_event.c b/arch/arm/kernel/perf_event.c
index 789d846..276b13b 100644
--- a/arch/arm/kernel/perf_event.c
+++ b/arch/arm/kernel/perf_event.c
@@ -562,8 +562,8 @@ user_backtrace(struct frame_tail __user *tail,
 	return buftail.fp - 1;
 }
 
-void
-perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)
+void perf_callchain_user(struct perf_callchain_entry *entry,
+			 struct pt_regs *regs, struct perf_sample_data *data)
 {
 	struct frame_tail __user *tail;
 
diff --git a/arch/powerpc/perf/callchain.c b/arch/powerpc/perf/callchain.c
index 74d1e78..b379ebc 100644
--- a/arch/powerpc/perf/callchain.c
+++ b/arch/powerpc/perf/callchain.c
@@ -482,8 +482,8 @@ static void perf_callchain_user_32(struct perf_callchain_entry *entry,
 	}
 }
 
-void
-perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)
+void perf_callchain_user(struct perf_callchain_entry *entry,
+			 struct pt_regs *regs, struct perf_sample_data *data)
 {
 	if (current_is_64bit())
 		perf_callchain_user_64(entry, regs);
diff --git a/arch/sparc/kernel/perf_event.c b/arch/sparc/kernel/perf_event.c
index b5c38fa..cba0306 100644
--- a/arch/sparc/kernel/perf_event.c
+++ b/arch/sparc/kernel/perf_event.c
@@ -1785,8 +1785,8 @@ static void perf_callchain_user_32(struct perf_callchain_entry *entry,
 	} while (entry->nr < PERF_MAX_STACK_DEPTH);
 }
 
-void
-perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)
+void perf_callchain_user(struct perf_callchain_entry *entry,
+			 struct pt_regs *regs, struct perf_sample_data *data)
 {
 	perf_callchain_store(entry, regs->tpc);
 
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 2e43f1b..49128e6 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -2009,8 +2009,8 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry *entry)
 }
 #endif
 
-void
-perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)
+void perf_callchain_user(struct perf_callchain_entry *entry,
+			 struct pt_regs *regs, struct perf_sample_data *data)
 {
 	struct stack_frame frame;
 	const void __user *fp;
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 0d88eb8..c442276 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -709,7 +709,8 @@ extern void perf_event_fork(struct task_struct *tsk);
 /* Callchains */
 DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);
 
-extern void perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs);
+extern void perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs,
+				struct perf_sample_data *data);
 extern void perf_callchain_kernel(struct perf_callchain_entry *entry, struct pt_regs *regs);
 
 static inline void perf_callchain_store(struct perf_callchain_entry *entry, u64 ip)
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 97b67df..19d497c 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -30,7 +30,8 @@ __weak void perf_callchain_kernel(struct perf_callchain_entry *entry,
 }
 
 __weak void perf_callchain_user(struct perf_callchain_entry *entry,
-				struct pt_regs *regs)
+				struct pt_regs *regs,
+				struct perf_sample_data *data)
 {
 }
 
@@ -157,7 +158,8 @@ put_callchain_entry(int rctx)
 }
 
 struct perf_callchain_entry *
-perf_callchain(struct perf_event *event, struct pt_regs *regs)
+perf_callchain(struct perf_event *event, struct pt_regs *regs,
+	       struct perf_sample_data *data)
 {
 	int rctx;
 	struct perf_callchain_entry *entry;
@@ -198,7 +200,7 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 				goto exit_put;
 
 			perf_callchain_store(entry, PERF_CONTEXT_USER);
-			perf_callchain_user(entry, regs);
+			perf_callchain_user(entry, regs, data);
 		}
 	}
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 7dd4d58..8429cc0 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4665,7 +4665,7 @@ void perf_prepare_sample(struct perf_event_header *header,
 	if (sample_type & PERF_SAMPLE_CALLCHAIN) {
 		int size = 1;
 
-		data->callchain = perf_callchain(event, regs);
+		data->callchain = perf_callchain(event, regs, data);
 
 		if (data->callchain)
 			size += data->callchain->nr;
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 569b2187..cd18b64 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -147,7 +147,8 @@ DEFINE_OUTPUT_COPY(__output_copy_user, arch_perf_out_copy_user)
 
 /* Callchain handling */
 extern struct perf_callchain_entry *
-perf_callchain(struct perf_event *event, struct pt_regs *regs);
+perf_callchain(struct perf_event *event, struct pt_regs *regs,
+	       struct perf_sample_data *data);
 extern int get_callchain_buffers(void);
 extern void put_callchain_buffers(void);
 
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 12/14] perf, x86: use LBR call stack to get user callchain
  2014-01-03  5:47 [PATCH 00/14] perf, x86: Haswell LBR call stack support Yan, Zheng
                   ` (10 preceding siblings ...)
  2014-01-03  5:48 ` [PATCH 11/14] perf, core: Pass perf_sample_data to perf_callchain() Yan, Zheng
@ 2014-01-03  5:48 ` Yan, Zheng
  2014-02-06 15:46   ` Stephane Eranian
  2014-01-03  5:48 ` [PATCH 13/14] perf, x86: enable LBR callstack when recording callchain Yan, Zheng
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 37+ messages in thread
From: Yan, Zheng @ 2014-01-03  5:48 UTC (permalink / raw)
  To: linux-kernel; +Cc: a.p.zijlstra, mingo, acme, eranian, andi, Yan, Zheng

Haswell has a new feature that utilizes the existing Last Branch Record
facility to record call chains. When the feature is enabled, function
call will be collected as normal, but as return instructions are executed
the last captured branch record is popped from the on-chip LBR registers.
The LBR call stack facility can help perf to get call chains of progam
without frame pointer.

This patch makes x86's perf_callchain_user() failback to LBR callstack
when there is no frame pointer in the user program.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 arch/x86/kernel/cpu/perf_event.c           | 33 ++++++++++++++++++++++++++----
 arch/x86/kernel/cpu/perf_event_intel.c     | 11 +++++++++-
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |  2 ++
 include/linux/perf_event.h                 |  1 +
 4 files changed, 42 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 49128e6..1509340 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1965,12 +1965,28 @@ static unsigned long get_segment_base(unsigned int segment)
 	return get_desc_base(desc + idx);
 }
 
+static inline void
+perf_callchain_lbr_callstack(struct perf_callchain_entry *entry,
+			     struct perf_sample_data *data)
+{
+	struct perf_branch_stack *br_stack = data->br_stack;
+
+	if (br_stack && br_stack->user_callstack) {
+		int i = 0;
+		while (i < br_stack->nr && entry->nr < PERF_MAX_STACK_DEPTH) {
+			perf_callchain_store(entry, br_stack->entries[i].from);
+			i++;
+		}
+	}
+}
+
 #ifdef CONFIG_COMPAT
 
 #include <asm/compat.h>
 
 static inline int
-perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry *entry)
+perf_callchain_user32(struct perf_callchain_entry *entry,
+		      struct pt_regs *regs, struct perf_sample_data *data)
 {
 	/* 32-bit process in 64-bit kernel. */
 	unsigned long ss_base, cs_base;
@@ -1999,11 +2015,16 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry *entry)
 		perf_callchain_store(entry, cs_base + frame.return_address);
 		fp = compat_ptr(ss_base + frame.next_frame);
 	}
+
+	if (fp == compat_ptr(regs->bp))
+		perf_callchain_lbr_callstack(entry, data);
+
 	return 1;
 }
 #else
 static inline int
-perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry *entry)
+perf_callchain_user32(struct perf_callchain_entry *entry,
+		      struct pt_regs *regs, struct perf_sample_data *data)
 {
     return 0;
 }
@@ -2033,12 +2054,12 @@ void perf_callchain_user(struct perf_callchain_entry *entry,
 	if (!current->mm)
 		return;
 
-	if (perf_callchain_user32(regs, entry))
+	if (perf_callchain_user32(entry, regs, data))
 		return;
 
 	while (entry->nr < PERF_MAX_STACK_DEPTH) {
 		unsigned long bytes;
-		frame.next_frame	     = NULL;
+		frame.next_frame = NULL;
 		frame.return_address = 0;
 
 		bytes = copy_from_user_nmi(&frame, fp, sizeof(frame));
@@ -2051,6 +2072,10 @@ void perf_callchain_user(struct perf_callchain_entry *entry,
 		perf_callchain_store(entry, frame.return_address);
 		fp = frame.next_frame;
 	}
+
+	/* try LBR callstack if there is no frame pointer */
+	if (fp == (void __user *)regs->bp)
+		perf_callchain_lbr_callstack(entry, data);
 }
 
 /*
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 722171c..8b7465c 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1030,6 +1030,14 @@ static __initconst const u64 slm_hw_cache_event_ids
  },
 };
 
+static inline bool intel_pmu_needs_lbr_callstack(struct perf_event *event)
+{
+	if ((event->attr.sample_type & PERF_SAMPLE_CALLCHAIN) &&
+	    (event->attr.branch_sample_type & PERF_SAMPLE_BRANCH_CALL_STACK))
+		return true;
+	return false;
+}
+
 static void intel_pmu_disable_all(void)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
@@ -1398,7 +1406,8 @@ again:
 
 		perf_sample_data_init(&data, 0, event->hw.last_period);
 
-		if (has_branch_stack(event))
+		if (has_branch_stack(event) ||
+		    (event->ctx->task && intel_pmu_needs_lbr_callstack(event)))
 			data.br_stack = &cpuc->lbr_stack;
 
 		if (perf_event_overflow(event, &data, regs))
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 51e1842..08e3ba1 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -718,6 +718,8 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
 	int i, j, type;
 	bool compress = false;
 
+	cpuc->lbr_stack.user_callstack = branch_user_callstack(br_sel);
+
 	/* if sampling all branches, then nothing to filter */
 	if ((br_sel & X86_BR_ALL) == X86_BR_ALL)
 		return;
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index c442276..d2f0488 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -74,6 +74,7 @@ struct perf_raw_record {
  * recent branch.
  */
 struct perf_branch_stack {
+	bool				user_callstack;
 	__u64				nr;
 	struct perf_branch_entry	entries[0];
 };
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 13/14] perf, x86: enable LBR callstack when recording callchain
  2014-01-03  5:47 [PATCH 00/14] perf, x86: Haswell LBR call stack support Yan, Zheng
                   ` (11 preceding siblings ...)
  2014-01-03  5:48 ` [PATCH 12/14] perf, x86: use LBR call stack to get user callchain Yan, Zheng
@ 2014-01-03  5:48 ` Yan, Zheng
  2014-02-06 15:50   ` Stephane Eranian
  2014-01-03  5:48 ` [PATCH 14/14] perf, x86: Discard zero length call entries in LBR call stack Yan, Zheng
  2014-01-21 13:17 ` [PATCH 00/14] perf, x86: Haswell LBR call stack support Stephane Eranian
  14 siblings, 1 reply; 37+ messages in thread
From: Yan, Zheng @ 2014-01-03  5:48 UTC (permalink / raw)
  To: linux-kernel; +Cc: a.p.zijlstra, mingo, acme, eranian, andi, Yan, Zheng

Try enabling the LBR callstack facility if user requests recording
user space callchain. Also adds a cpu pmu attribute to enable/disable
this feature. This feature is disabled by default because it may
contend for the LBR with other events that explicitly require branch
stack

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 arch/x86/kernel/cpu/perf_event.c | 99 ++++++++++++++++++++++++++++------------
 arch/x86/kernel/cpu/perf_event.h |  7 +++
 2 files changed, 77 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 1509340..3ea184a 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -399,37 +399,49 @@ int x86_pmu_hw_config(struct perf_event *event)
 
 		if (event->attr.precise_ip > precise)
 			return -EOPNOTSUPP;
+	}
+	/*
+	 * check that PEBS LBR correction does not conflict with
+	 * whatever the user is asking with attr->branch_sample_type
+	 */
+	if (event->attr.precise_ip > 1 && x86_pmu.intel_cap.pebs_format < 2) {
+		u64 *br_type = &event->attr.branch_sample_type;
+
+		if (has_branch_stack(event)) {
+			if (!precise_br_compat(event))
+				return -EOPNOTSUPP;
+
+			/* branch_sample_type is compatible */
+
+		} else {
+			/*
+			 * user did not specify  branch_sample_type
+			 *
+			 * For PEBS fixups, we capture all
+			 * the branches at the priv level of the
+			 * event.
+			 */
+			*br_type = PERF_SAMPLE_BRANCH_ANY;
+
+			if (!event->attr.exclude_user)
+				*br_type |= PERF_SAMPLE_BRANCH_USER;
+
+			if (!event->attr.exclude_kernel)
+				*br_type |= PERF_SAMPLE_BRANCH_KERNEL;
+		}
+	} else if ((event->attr.sample_type & PERF_SAMPLE_CALLCHAIN) &&
+		   !has_branch_stack(event) &&
+		   x86_pmu.attr_lbr_callstack &&
+		   !event->attr.exclude_user &&
+		   (event->attach_state & PERF_ATTACH_TASK)) {
 		/*
-		 * check that PEBS LBR correction does not conflict with
-		 * whatever the user is asking with attr->branch_sample_type
+		 * user did not specify branch_sample_type,
+		 * try using the LBR call stack facility to
+		 * record call chains of user program.
 		 */
-		if (event->attr.precise_ip > 1 &&
-		    x86_pmu.intel_cap.pebs_format < 2) {
-			u64 *br_type = &event->attr.branch_sample_type;
-
-			if (has_branch_stack(event)) {
-				if (!precise_br_compat(event))
-					return -EOPNOTSUPP;
-
-				/* branch_sample_type is compatible */
-
-			} else {
-				/*
-				 * user did not specify  branch_sample_type
-				 *
-				 * For PEBS fixups, we capture all
-				 * the branches at the priv level of the
-				 * event.
-				 */
-				*br_type = PERF_SAMPLE_BRANCH_ANY;
-
-				if (!event->attr.exclude_user)
-					*br_type |= PERF_SAMPLE_BRANCH_USER;
-
-				if (!event->attr.exclude_kernel)
-					*br_type |= PERF_SAMPLE_BRANCH_KERNEL;
-			}
-		}
+		event->attr.branch_sample_type =
+			PERF_SAMPLE_BRANCH_USER |
+			PERF_SAMPLE_BRANCH_CALL_STACK;
 	}
 
 	/*
@@ -1828,10 +1840,39 @@ static ssize_t set_attr_rdpmc(struct device *cdev,
 	return count;
 }
 
+static ssize_t get_attr_lbr_callstack(struct device *cdev,
+				      struct device_attribute *attr, char *buf)
+{
+	return snprintf(buf, 40, "%d\n", x86_pmu.attr_lbr_callstack);
+}
+
+static ssize_t set_attr_lbr_callstack(struct device *cdev,
+				      struct device_attribute *attr,
+				      const char *buf, size_t count)
+{
+	unsigned long val;
+	ssize_t ret;
+
+	ret = kstrtoul(buf, 0, &val);
+	if (ret)
+		return ret;
+
+	if (!!val != !!x86_pmu.attr_lbr_callstack) {
+		if (val && !x86_pmu_has_lbr_callstack())
+			return -EOPNOTSUPP;
+		x86_pmu.attr_lbr_callstack = !!val;
+	}
+	return count;
+}
+
 static DEVICE_ATTR(rdpmc, S_IRUSR | S_IWUSR, get_attr_rdpmc, set_attr_rdpmc);
+static DEVICE_ATTR(lbr_callstack, S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH,
+		   get_attr_lbr_callstack, set_attr_lbr_callstack);
+
 
 static struct attribute *x86_pmu_attrs[] = {
 	&dev_attr_rdpmc.attr,
+	&dev_attr_lbr_callstack.attr,
 	NULL,
 };
 
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 3ed9629..b45258c 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -400,6 +400,7 @@ struct x86_pmu {
 	 * sysfs attrs
 	 */
 	int		attr_rdpmc;
+	int		attr_lbr_callstack;
 	struct attribute **format_attrs;
 	struct attribute **event_attrs;
 
@@ -504,6 +505,12 @@ static struct perf_pmu_events_attr event_attr_##v = {			\
 
 extern struct x86_pmu x86_pmu __read_mostly;
 
+static inline bool x86_pmu_has_lbr_callstack(void)
+{
+	return  x86_pmu.lbr_sel_map &&
+		x86_pmu.lbr_sel_map[PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT] > 0;
+}
+
 DECLARE_PER_CPU(struct cpu_hw_events, cpu_hw_events);
 
 int x86_perf_event_set_period(struct perf_event *event);
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 14/14] perf, x86: Discard zero length call entries in LBR call stack
  2014-01-03  5:47 [PATCH 00/14] perf, x86: Haswell LBR call stack support Yan, Zheng
                   ` (12 preceding siblings ...)
  2014-01-03  5:48 ` [PATCH 13/14] perf, x86: enable LBR callstack when recording callchain Yan, Zheng
@ 2014-01-03  5:48 ` Yan, Zheng
  2014-02-06 15:57   ` Stephane Eranian
  2014-01-21 13:17 ` [PATCH 00/14] perf, x86: Haswell LBR call stack support Stephane Eranian
  14 siblings, 1 reply; 37+ messages in thread
From: Yan, Zheng @ 2014-01-03  5:48 UTC (permalink / raw)
  To: linux-kernel; +Cc: a.p.zijlstra, mingo, acme, eranian, andi, Yan, Zheng

"Zero length call" uses the attribute of the call instruction to push
the immediate instruction pointer on to the stack and then pops off
that address into a register. This is accomplished without any matching
return instruction. It confuses the hardware and make the recorded call
stack incorrect. Try fixing the call stack by discarding zero length
call entries.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel_lbr.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 08e3ba1..57bdd34 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -94,7 +94,8 @@ enum {
 	X86_BR_ABORT		= 1 << 12,/* transaction abort */
 	X86_BR_IN_TX		= 1 << 13,/* in transaction */
 	X86_BR_NO_TX		= 1 << 14,/* not in transaction */
-	X86_BR_CALL_STACK	= 1 << 15,/* call stack */
+	X86_BR_ZERO_CALL	= 1 << 15,/* zero length call */
+	X86_BR_CALL_STACK	= 1 << 16,/* call stack */
 };
 
 #define X86_BR_PLM (X86_BR_USER | X86_BR_KERNEL)
@@ -111,13 +112,15 @@ enum {
 	 X86_BR_JMP	 |\
 	 X86_BR_IRQ	 |\
 	 X86_BR_ABORT	 |\
-	 X86_BR_IND_CALL)
+	 X86_BR_IND_CALL |\
+	 X86_BR_ZERO_CALL)
 
 #define X86_BR_ALL (X86_BR_PLM | X86_BR_ANY)
 
 #define X86_BR_ANY_CALL		 \
 	(X86_BR_CALL		|\
 	 X86_BR_IND_CALL	|\
+	 X86_BR_ZERO_CALL	|\
 	 X86_BR_SYSCALL		|\
 	 X86_BR_IRQ		|\
 	 X86_BR_INT)
@@ -652,6 +655,12 @@ static int branch_type(unsigned long from, unsigned long to, int abort)
 		ret = X86_BR_INT;
 		break;
 	case 0xe8: /* call near rel */
+		insn_get_immediate(&insn);
+		if (insn.immediate1.value == 0) {
+			/* zero length call */
+			ret = X86_BR_ZERO_CALL;
+			break;
+		}
 	case 0x9a: /* call far absolute */
 		ret = X86_BR_CALL;
 		break;
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/14] perf, x86: Haswell LBR call stack support
  2014-01-03  5:47 [PATCH 00/14] perf, x86: Haswell LBR call stack support Yan, Zheng
                   ` (13 preceding siblings ...)
  2014-01-03  5:48 ` [PATCH 14/14] perf, x86: Discard zero length call entries in LBR call stack Yan, Zheng
@ 2014-01-21 13:17 ` Stephane Eranian
  2014-01-22  1:35   ` Yan, Zheng
  14 siblings, 1 reply; 37+ messages in thread
From: Stephane Eranian @ 2014-01-21 13:17 UTC (permalink / raw)
  To: Yan, Zheng
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Andi Kleen

Hi,

Is there a git tree from which I could could pull those 14 patches from?

On Fri, Jan 3, 2014 at 6:47 AM, Yan, Zheng <zheng.z.yan@intel.com> wrote:
> For many profiling tasks we need the callgraph. For example we often
> need to see the caller of a lock or the caller of a memcpy or other
> library function to actually tune the program. Frame pointer unwinding
> is efficient and works well. But frame pointers are off by default on
> 64bit code (and on modern 32bit gccs), so there are many binaries around
> that do not use frame pointers. Profiling unchanged production code is
> very useful in practice. On some CPUs frame pointer also has a high
> cost. Dwarf2 unwinding also does not always work and is extremely slow
> (upto 20% overhead).
>
> Haswell has a new feature that utilizes the existing Last Branch Record
> facility to record call chains. When the feature is enabled, function
> call will be collected as normal, but as return instructions are
> executed the last captured branch record is popped from the on-chip LBR
> registers. The LBR call stack facility provides an alternative to get
> callgraph. It has some limitations too, but should work in most cases
> and is significantly faster than dwarf. Frame pointer unwinding is still
> the best default, but LBR call stack is a good alternative when nothing
> else works.
>
> This patch series adds LBR call stack support. User can enabled/disable
> this through an sysfs attribute file in the CPU PMU directory:
>  echo 1 > /sys/bus/event_source/devices/cpu/lbr_callstack
>
> When profiling bc(1) on Fedora 19:
>  echo 'scale=2000; 4*a(1)' > cmd; perf record -g fp bc -l < cmd
>
> If this feature is enabled, perf report output looks like:
>     50.36%       bc  bc                 [.] bc_divide
>                  |
>                  --- bc_divide
>                      execute
>                      run_code
>                      yyparse
>                      main
>                      __libc_start_main
>                      _start
>
>     33.66%       bc  bc                 [.] _one_mult
>                  |
>                  --- _one_mult
>                      bc_divide
>                      execute
>                      run_code
>                      yyparse
>                      main
>                      __libc_start_main
>                      _start
>
>      7.62%       bc  bc                 [.] _bc_do_add
>                  |
>                  --- _bc_do_add
>                     |
>                     |--99.89%-- 0x2000186a8
>                      --0.11%-- [...]
>
>      6.83%       bc  bc                 [.] _bc_do_sub
>                  |
>                  --- _bc_do_sub
>                     |
>                     |--99.94%-- bc_add
>                     |          execute
>                     |          run_code
>                     |          yyparse
>                     |          main
>                     |          __libc_start_main
>                     |          _start
>                      --0.06%-- [...]
>
>      0.46%       bc  libc-2.17.so       [.] __memset_sse2
>                  |
>                  --- __memset_sse2
>                     |
>                     |--54.13%-- bc_new_num
>                     |          |
>                     |          |--51.00%-- bc_divide
>                     |          |          execute
>                     |          |          run_code
>                     |          |          yyparse
>                     |          |          main
>                     |          |          __libc_start_main
>                     |          |          _start
>                     |          |
>                     |          |--30.46%-- _bc_do_sub
>                     |          |          bc_add
>                     |          |          execute
>                     |          |          run_code
>                     |          |          yyparse
>                     |          |          main
>                     |          |          __libc_start_main
>                     |          |          _start
>                     |          |
>                     |           --18.55%-- _bc_do_add
>                     |                     bc_add
>                     |                     execute
>                     |                     run_code
>                     |                     yyparse
>                     |                     main
>                     |                     __libc_start_main
>                     |                     _start
>                     |
>                      --45.87%-- bc_divide
>                                execute
>                                run_code
>                                yyparse
>                                main
>                                __libc_start_main
>                                _start
>
> If this feature is disabled, perf report output looks like:
>     50.49%       bc  bc                 [.] bc_divide
>                  |
>                  --- bc_divide
>
>     33.57%       bc  bc                 [.] _one_mult
>                  |
>                  --- _one_mult
>
>      7.61%       bc  bc                 [.] _bc_do_add
>                  |
>                  --- _bc_do_add
>                      0x2000186a8
>
>      6.88%       bc  bc                 [.] _bc_do_sub
>                  |
>                  --- _bc_do_sub
>
>      0.42%       bc  libc-2.17.so       [.] __memcpy_ssse3_back
>                  |
>                  --- __memcpy_ssse3_back
>
> The LBR call stack has following known limitations
>  - Zero length calls are not filtered out by hardware
>  - Exception handing such as setjmp/longjmp will have calls/returns not
>    match
>  - Pushing different return address onto the stack will have calls/returns
>    not match
>  - If callstack is deeper than the LBR, only the last entries are captured
>
> Change since previous version
>  - split change into more patches
>  - introduce context switch callback and use it to flush LBR
>  - use the context switch callback to save/restore LBR
>  - dynamic allocate memory area for storing LBR stack, always switch the
>    memory area during context switch
>  - disable this feature by default
>  - more description in change logs
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/14] perf, x86: Haswell LBR call stack support
  2014-01-21 13:17 ` [PATCH 00/14] perf, x86: Haswell LBR call stack support Stephane Eranian
@ 2014-01-22  1:35   ` Yan, Zheng
  0 siblings, 0 replies; 37+ messages in thread
From: Yan, Zheng @ 2014-01-22  1:35 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Andi Kleen

On 01/21/2014 09:17 PM, Stephane Eranian wrote:
> Hi,
> 
> Is there a git tree from which I could could pull those 14 patches from?

https://github.com/ukernel/linux.git perf-lbr-callstack

Regards
Yan, Zheng

> 
> On Fri, Jan 3, 2014 at 6:47 AM, Yan, Zheng <zheng.z.yan@intel.com> wrote:
>> For many profiling tasks we need the callgraph. For example we often
>> need to see the caller of a lock or the caller of a memcpy or other
>> library function to actually tune the program. Frame pointer unwinding
>> is efficient and works well. But frame pointers are off by default on
>> 64bit code (and on modern 32bit gccs), so there are many binaries around
>> that do not use frame pointers. Profiling unchanged production code is
>> very useful in practice. On some CPUs frame pointer also has a high
>> cost. Dwarf2 unwinding also does not always work and is extremely slow
>> (upto 20% overhead).
>>
>> Haswell has a new feature that utilizes the existing Last Branch Record
>> facility to record call chains. When the feature is enabled, function
>> call will be collected as normal, but as return instructions are
>> executed the last captured branch record is popped from the on-chip LBR
>> registers. The LBR call stack facility provides an alternative to get
>> callgraph. It has some limitations too, but should work in most cases
>> and is significantly faster than dwarf. Frame pointer unwinding is still
>> the best default, but LBR call stack is a good alternative when nothing
>> else works.
>>
>> This patch series adds LBR call stack support. User can enabled/disable
>> this through an sysfs attribute file in the CPU PMU directory:
>>  echo 1 > /sys/bus/event_source/devices/cpu/lbr_callstack
>>
>> When profiling bc(1) on Fedora 19:
>>  echo 'scale=2000; 4*a(1)' > cmd; perf record -g fp bc -l < cmd
>>
>> If this feature is enabled, perf report output looks like:
>>     50.36%       bc  bc                 [.] bc_divide
>>                  |
>>                  --- bc_divide
>>                      execute
>>                      run_code
>>                      yyparse
>>                      main
>>                      __libc_start_main
>>                      _start
>>
>>     33.66%       bc  bc                 [.] _one_mult
>>                  |
>>                  --- _one_mult
>>                      bc_divide
>>                      execute
>>                      run_code
>>                      yyparse
>>                      main
>>                      __libc_start_main
>>                      _start
>>
>>      7.62%       bc  bc                 [.] _bc_do_add
>>                  |
>>                  --- _bc_do_add
>>                     |
>>                     |--99.89%-- 0x2000186a8
>>                      --0.11%-- [...]
>>
>>      6.83%       bc  bc                 [.] _bc_do_sub
>>                  |
>>                  --- _bc_do_sub
>>                     |
>>                     |--99.94%-- bc_add
>>                     |          execute
>>                     |          run_code
>>                     |          yyparse
>>                     |          main
>>                     |          __libc_start_main
>>                     |          _start
>>                      --0.06%-- [...]
>>
>>      0.46%       bc  libc-2.17.so       [.] __memset_sse2
>>                  |
>>                  --- __memset_sse2
>>                     |
>>                     |--54.13%-- bc_new_num
>>                     |          |
>>                     |          |--51.00%-- bc_divide
>>                     |          |          execute
>>                     |          |          run_code
>>                     |          |          yyparse
>>                     |          |          main
>>                     |          |          __libc_start_main
>>                     |          |          _start
>>                     |          |
>>                     |          |--30.46%-- _bc_do_sub
>>                     |          |          bc_add
>>                     |          |          execute
>>                     |          |          run_code
>>                     |          |          yyparse
>>                     |          |          main
>>                     |          |          __libc_start_main
>>                     |          |          _start
>>                     |          |
>>                     |           --18.55%-- _bc_do_add
>>                     |                     bc_add
>>                     |                     execute
>>                     |                     run_code
>>                     |                     yyparse
>>                     |                     main
>>                     |                     __libc_start_main
>>                     |                     _start
>>                     |
>>                      --45.87%-- bc_divide
>>                                execute
>>                                run_code
>>                                yyparse
>>                                main
>>                                __libc_start_main
>>                                _start
>>
>> If this feature is disabled, perf report output looks like:
>>     50.49%       bc  bc                 [.] bc_divide
>>                  |
>>                  --- bc_divide
>>
>>     33.57%       bc  bc                 [.] _one_mult
>>                  |
>>                  --- _one_mult
>>
>>      7.61%       bc  bc                 [.] _bc_do_add
>>                  |
>>                  --- _bc_do_add
>>                      0x2000186a8
>>
>>      6.88%       bc  bc                 [.] _bc_do_sub
>>                  |
>>                  --- _bc_do_sub
>>
>>      0.42%       bc  libc-2.17.so       [.] __memcpy_ssse3_back
>>                  |
>>                  --- __memcpy_ssse3_back
>>
>> The LBR call stack has following known limitations
>>  - Zero length calls are not filtered out by hardware
>>  - Exception handing such as setjmp/longjmp will have calls/returns not
>>    match
>>  - Pushing different return address onto the stack will have calls/returns
>>    not match
>>  - If callstack is deeper than the LBR, only the last entries are captured
>>
>> Change since previous version
>>  - split change into more patches
>>  - introduce context switch callback and use it to flush LBR
>>  - use the context switch callback to save/restore LBR
>>  - dynamic allocate memory area for storing LBR stack, always switch the
>>    memory area during context switch
>>  - disable this feature by default
>>  - more description in change logs
>>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 01/14] perf, x86: Reduce lbr_sel_map size
  2014-01-03  5:47 ` [PATCH 01/14] perf, x86: Reduce lbr_sel_map size Yan, Zheng
@ 2014-02-05 15:15   ` Stephane Eranian
  0 siblings, 0 replies; 37+ messages in thread
From: Stephane Eranian @ 2014-02-05 15:15 UTC (permalink / raw)
  To: Yan, Zheng
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Andi Kleen

On Fri, Jan 3, 2014 at 6:47 AM, Yan, Zheng <zheng.z.yan@intel.com> wrote:
> The index of lbr_sel_map is bit value of perf branch_sample_type.
> PERF_SAMPLE_BRANCH_MAX is 1024 at present, so each lbr_sel_map uses
> 4096 bytes. By using bit shift as index, we can reduce lbr_sel_map
> size to 40 bytes.
>

Reviewed-by: Stephane Eranian <eranian@google.com>
> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
> ---
>  arch/x86/kernel/cpu/perf_event.h           |  4 +++
>  arch/x86/kernel/cpu/perf_event_intel_lbr.c | 50 ++++++++++++++----------------
>  include/uapi/linux/perf_event.h            | 42 +++++++++++++++++--------
>  3 files changed, 56 insertions(+), 40 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
> index fd00bb2..745f6fb 100644
> --- a/arch/x86/kernel/cpu/perf_event.h
> +++ b/arch/x86/kernel/cpu/perf_event.h
> @@ -459,6 +459,10 @@ struct x86_pmu {
>         struct perf_guest_switch_msr *(*guest_get_msrs)(int *nr);
>  };
>
> +enum {
> +       PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE = PERF_SAMPLE_BRANCH_MAX_SHIFT,
> +};
> +
>  #define x86_add_quirk(func_)                                           \
>  do {                                                                   \
>         static struct x86_pmu_quirk __quirk __initdata = {              \
> diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> index d82d155..1ae2ec5 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> @@ -69,10 +69,6 @@ static enum {
>  #define LBR_FROM_FLAG_IN_TX    (1ULL << 62)
>  #define LBR_FROM_FLAG_ABORT    (1ULL << 61)
>
> -#define for_each_branch_sample_type(x) \
> -       for ((x) = PERF_SAMPLE_BRANCH_USER; \
> -            (x) < PERF_SAMPLE_BRANCH_MAX; (x) <<= 1)
> -
>  /*
>   * x86control flow change classification
>   * x86control flow changes include branches, interrupts, traps, faults
> @@ -400,14 +396,14 @@ static int intel_pmu_setup_hw_lbr_filter(struct perf_event *event)
>  {
>         struct hw_perf_event_extra *reg;
>         u64 br_type = event->attr.branch_sample_type;
> -       u64 mask = 0, m;
> -       u64 v;
> +       u64 mask = 0, v;
> +       int i;
>
> -       for_each_branch_sample_type(m) {
> -               if (!(br_type & m))
> +       for (i = 0; i < PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE; i++) {
> +               if (!(br_type & (1ULL << i)))
>                         continue;
>
> -               v = x86_pmu.lbr_sel_map[m];
> +               v = x86_pmu.lbr_sel_map[i];
>                 if (v == LBR_NOT_SUPP)
>                         return -EOPNOTSUPP;
>
> @@ -662,33 +658,33 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
>  /*
>   * Map interface branch filters onto LBR filters
>   */
> -static const int nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = {
> -       [PERF_SAMPLE_BRANCH_ANY]        = LBR_ANY,
> -       [PERF_SAMPLE_BRANCH_USER]       = LBR_USER,
> -       [PERF_SAMPLE_BRANCH_KERNEL]     = LBR_KERNEL,
> -       [PERF_SAMPLE_BRANCH_HV]         = LBR_IGN,
> -       [PERF_SAMPLE_BRANCH_ANY_RETURN] = LBR_RETURN | LBR_REL_JMP
> -                                       | LBR_IND_JMP | LBR_FAR,
> +static const int nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
> +       [PERF_SAMPLE_BRANCH_ANY_SHIFT]          = LBR_ANY,
> +       [PERF_SAMPLE_BRANCH_USER_SHIFT]         = LBR_USER,
> +       [PERF_SAMPLE_BRANCH_KERNEL_SHIFT]       = LBR_KERNEL,
> +       [PERF_SAMPLE_BRANCH_HV_SHIFT]           = LBR_IGN,
> +       [PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT]   = LBR_RETURN | LBR_REL_JMP
> +                                               | LBR_IND_JMP | LBR_FAR,
>         /*
>          * NHM/WSM erratum: must include REL_JMP+IND_JMP to get CALL branches
>          */
> -       [PERF_SAMPLE_BRANCH_ANY_CALL] =
> +       [PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT] =
>          LBR_REL_CALL | LBR_IND_CALL | LBR_REL_JMP | LBR_IND_JMP | LBR_FAR,
>         /*
>          * NHM/WSM erratum: must include IND_JMP to capture IND_CALL
>          */
> -       [PERF_SAMPLE_BRANCH_IND_CALL] = LBR_IND_CALL | LBR_IND_JMP,
> +       [PERF_SAMPLE_BRANCH_IND_CALL_SHIFT] = LBR_IND_CALL | LBR_IND_JMP,
>  };
>
> -static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = {
> -       [PERF_SAMPLE_BRANCH_ANY]        = LBR_ANY,
> -       [PERF_SAMPLE_BRANCH_USER]       = LBR_USER,
> -       [PERF_SAMPLE_BRANCH_KERNEL]     = LBR_KERNEL,
> -       [PERF_SAMPLE_BRANCH_HV]         = LBR_IGN,
> -       [PERF_SAMPLE_BRANCH_ANY_RETURN] = LBR_RETURN | LBR_FAR,
> -       [PERF_SAMPLE_BRANCH_ANY_CALL]   = LBR_REL_CALL | LBR_IND_CALL
> -                                       | LBR_FAR,
> -       [PERF_SAMPLE_BRANCH_IND_CALL]   = LBR_IND_CALL,
> +static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
> +       [PERF_SAMPLE_BRANCH_ANY_SHIFT]          = LBR_ANY,
> +       [PERF_SAMPLE_BRANCH_USER_SHIFT]         = LBR_USER,
> +       [PERF_SAMPLE_BRANCH_KERNEL_SHIFT]       = LBR_KERNEL,
> +       [PERF_SAMPLE_BRANCH_HV_SHIFT]           = LBR_IGN,
> +       [PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT]   = LBR_RETURN | LBR_FAR,
> +       [PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT]     = LBR_REL_CALL | LBR_IND_CALL
> +                                               | LBR_FAR,
> +       [PERF_SAMPLE_BRANCH_IND_CALL_SHIFT]     = LBR_IND_CALL,
>  };
>
>  /* core */
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index e1802d6..4d8c438 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -151,20 +151,36 @@ enum perf_event_sample_format {
>   * The branch types can be combined, however BRANCH_ANY covers all types
>   * of branches and therefore it supersedes all the other types.
>   */
> +enum perf_branch_sample_type_shift {
> +       PERF_SAMPLE_BRANCH_USER_SHIFT           = 0, /* user branches */
> +       PERF_SAMPLE_BRANCH_KERNEL_SHIFT         = 1, /* kernel branches */
> +       PERF_SAMPLE_BRANCH_HV_SHIFT             = 2, /* hypervisor branches */
> +
> +       PERF_SAMPLE_BRANCH_ANY_SHIFT            = 3, /* any branch types */
> +       PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT       = 4, /* any call branch */
> +       PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT     = 5, /* any return branch */
> +       PERF_SAMPLE_BRANCH_IND_CALL_SHIFT       = 6, /* indirect calls */
> +       PERF_SAMPLE_BRANCH_ABORT_TX_SHIFT       = 7, /* transaction aborts */
> +       PERF_SAMPLE_BRANCH_IN_TX_SHIFT          = 8, /* in transaction */
> +       PERF_SAMPLE_BRANCH_NO_TX_SHIFT          = 9, /* not in transaction */
> +
> +       PERF_SAMPLE_BRANCH_MAX_SHIFT            /* non-ABI */
> +};
> +
>  enum perf_branch_sample_type {
> -       PERF_SAMPLE_BRANCH_USER         = 1U << 0, /* user branches */
> -       PERF_SAMPLE_BRANCH_KERNEL       = 1U << 1, /* kernel branches */
> -       PERF_SAMPLE_BRANCH_HV           = 1U << 2, /* hypervisor branches */
> -
> -       PERF_SAMPLE_BRANCH_ANY          = 1U << 3, /* any branch types */
> -       PERF_SAMPLE_BRANCH_ANY_CALL     = 1U << 4, /* any call branch */
> -       PERF_SAMPLE_BRANCH_ANY_RETURN   = 1U << 5, /* any return branch */
> -       PERF_SAMPLE_BRANCH_IND_CALL     = 1U << 6, /* indirect calls */
> -       PERF_SAMPLE_BRANCH_ABORT_TX     = 1U << 7, /* transaction aborts */
> -       PERF_SAMPLE_BRANCH_IN_TX        = 1U << 8, /* in transaction */
> -       PERF_SAMPLE_BRANCH_NO_TX        = 1U << 9, /* not in transaction */
> -
> -       PERF_SAMPLE_BRANCH_MAX          = 1U << 10, /* non-ABI */
> +       PERF_SAMPLE_BRANCH_USER         = 1U << PERF_SAMPLE_BRANCH_USER_SHIFT,
> +       PERF_SAMPLE_BRANCH_KERNEL       = 1U << PERF_SAMPLE_BRANCH_KERNEL_SHIFT,
> +       PERF_SAMPLE_BRANCH_HV           = 1U << PERF_SAMPLE_BRANCH_HV_SHIFT,
> +
> +       PERF_SAMPLE_BRANCH_ANY          = 1U << PERF_SAMPLE_BRANCH_ANY_SHIFT,
> +       PERF_SAMPLE_BRANCH_ANY_CALL     = 1U << PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT,
> +       PERF_SAMPLE_BRANCH_ANY_RETURN   = 1U << PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT,
> +       PERF_SAMPLE_BRANCH_IND_CALL     = 1U << PERF_SAMPLE_BRANCH_IND_CALL_SHIFT,
> +       PERF_SAMPLE_BRANCH_ABORT_TX     = 1U << PERF_SAMPLE_BRANCH_ABORT_TX_SHIFT,
> +       PERF_SAMPLE_BRANCH_IN_TX        = 1U << PERF_SAMPLE_BRANCH_IN_TX_SHIFT,
> +       PERF_SAMPLE_BRANCH_NO_TX        = 1U << PERF_SAMPLE_BRANCH_NO_TX_SHIFT,
> +
> +       PERF_SAMPLE_BRANCH_MAX          = 1U << PERF_SAMPLE_BRANCH_MAX_SHIFT,
>  };
>
>  #define PERF_SAMPLE_BRANCH_PLM_ALL \
> --
> 1.8.4.2
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 04/14] perf, x86: Basic Haswell LBR call stack support
  2014-01-03  5:48 ` [PATCH 04/14] perf, x86: Basic Haswell LBR call stack support Yan, Zheng
@ 2014-02-05 15:40   ` Stephane Eranian
  2014-02-06  1:52     ` Yan, Zheng
  0 siblings, 1 reply; 37+ messages in thread
From: Stephane Eranian @ 2014-02-05 15:40 UTC (permalink / raw)
  To: Yan, Zheng
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Andi Kleen

On Fri, Jan 3, 2014 at 6:48 AM, Yan, Zheng <zheng.z.yan@intel.com> wrote:
> When the call stack feature is enabled, the LBR stack will capture
> unfiltered call data normally, but as return instructions are executed,
> the last captured branch record is flushed from the on-chip registers
> in a last-in first-out (LIFO) manner. Thus, branch information relative
> to leaf functions will not be captured, while preserving the call stack
> information of the main line execution path.
>
This is a generic description of the LBR call stack feature. It does not
describe what the patch actually does which is implement the basic
internal infrastructure for CALL_STACK mode using LBR callstack.

> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
> ---
>  arch/x86/kernel/cpu/perf_event.h           |  7 ++-
>  arch/x86/kernel/cpu/perf_event_intel.c     |  2 +-
>  arch/x86/kernel/cpu/perf_event_intel_lbr.c | 98 +++++++++++++++++++++++-------
>  3 files changed, 82 insertions(+), 25 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
> index 80b8e83..3ef4b79 100644
> --- a/arch/x86/kernel/cpu/perf_event.h
> +++ b/arch/x86/kernel/cpu/perf_event.h
> @@ -460,7 +460,10 @@ struct x86_pmu {
>  };
>
>  enum {
> -       PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE = PERF_SAMPLE_BRANCH_MAX_SHIFT,
> +       PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT = PERF_SAMPLE_BRANCH_MAX_SHIFT,
> +       PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE,
> +
> +       PERF_SAMPLE_BRANCH_CALL_STACK = 1U << PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT,
>  };
>
>  #define x86_add_quirk(func_)                                           \
> @@ -697,6 +700,8 @@ void intel_pmu_lbr_init_atom(void);
>
>  void intel_pmu_lbr_init_snb(void);
>
> +void intel_pmu_lbr_init_hsw(void);
> +
>  int intel_pmu_setup_lbr_filter(struct perf_event *event);
>
>  int p4_pmu_init(void);
> diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
> index 4325bae..84a1c09 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel.c
> @@ -2494,7 +2494,7 @@ __init int intel_pmu_init(void)
>                 memcpy(hw_cache_event_ids, snb_hw_cache_event_ids, sizeof(hw_cache_event_ids));
>                 memcpy(hw_cache_extra_regs, snb_hw_cache_extra_regs, sizeof(hw_cache_extra_regs));
>
> -               intel_pmu_lbr_init_snb();
> +               intel_pmu_lbr_init_hsw();
>
>                 x86_pmu.event_constraints = intel_hsw_event_constraints;
>                 x86_pmu.pebs_constraints = intel_hsw_pebs_event_constraints;
> diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> index 7ff2a99..bdd8758 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> @@ -39,6 +39,7 @@ static enum {
>  #define LBR_IND_JMP_BIT                6 /* do not capture indirect jumps */
>  #define LBR_REL_JMP_BIT                7 /* do not capture relative jumps */
>  #define LBR_FAR_BIT            8 /* do not capture far branches */
> +#define LBR_CALL_STACK_BIT     9 /* enable call stack */
>
>  #define LBR_KERNEL     (1 << LBR_KERNEL_BIT)
>  #define LBR_USER       (1 << LBR_USER_BIT)
> @@ -49,6 +50,7 @@ static enum {
>  #define LBR_REL_JMP    (1 << LBR_REL_JMP_BIT)
>  #define LBR_IND_JMP    (1 << LBR_IND_JMP_BIT)
>  #define LBR_FAR                (1 << LBR_FAR_BIT)
> +#define LBR_CALL_STACK (1 << LBR_CALL_STACK_BIT)
>
>  #define LBR_PLM (LBR_KERNEL | LBR_USER)
>
> @@ -74,24 +76,25 @@ static enum {
>   * x86control flow changes include branches, interrupts, traps, faults
>   */
>  enum {
> -       X86_BR_NONE     = 0,      /* unknown */
> -
> -       X86_BR_USER     = 1 << 0, /* branch target is user */
> -       X86_BR_KERNEL   = 1 << 1, /* branch target is kernel */
> -
> -       X86_BR_CALL     = 1 << 2, /* call */
> -       X86_BR_RET      = 1 << 3, /* return */
> -       X86_BR_SYSCALL  = 1 << 4, /* syscall */
> -       X86_BR_SYSRET   = 1 << 5, /* syscall return */
> -       X86_BR_INT      = 1 << 6, /* sw interrupt */
> -       X86_BR_IRET     = 1 << 7, /* return from interrupt */
> -       X86_BR_JCC      = 1 << 8, /* conditional */
> -       X86_BR_JMP      = 1 << 9, /* jump */
> -       X86_BR_IRQ      = 1 << 10,/* hw interrupt or trap or fault */
> -       X86_BR_IND_CALL = 1 << 11,/* indirect calls */
> -       X86_BR_ABORT    = 1 << 12,/* transaction abort */
> -       X86_BR_IN_TX    = 1 << 13,/* in transaction */
> -       X86_BR_NO_TX    = 1 << 14,/* not in transaction */
> +       X86_BR_NONE             = 0,      /* unknown */
> +
> +       X86_BR_USER             = 1 << 0, /* branch target is user */
> +       X86_BR_KERNEL           = 1 << 1, /* branch target is kernel */
> +
> +       X86_BR_CALL             = 1 << 2, /* call */
> +       X86_BR_RET              = 1 << 3, /* return */
> +       X86_BR_SYSCALL          = 1 << 4, /* syscall */
> +       X86_BR_SYSRET           = 1 << 5, /* syscall return */
> +       X86_BR_INT              = 1 << 6, /* sw interrupt */
> +       X86_BR_IRET             = 1 << 7, /* return from interrupt */
> +       X86_BR_JCC              = 1 << 8, /* conditional */
> +       X86_BR_JMP              = 1 << 9, /* jump */
> +       X86_BR_IRQ              = 1 << 10,/* hw interrupt or trap or fault */
> +       X86_BR_IND_CALL         = 1 << 11,/* indirect calls */
> +       X86_BR_ABORT            = 1 << 12,/* transaction abort */
> +       X86_BR_IN_TX            = 1 << 13,/* in transaction */
> +       X86_BR_NO_TX            = 1 << 14,/* not in transaction */
> +       X86_BR_CALL_STACK       = 1 << 15,/* call stack */
>  };
>
>  #define X86_BR_PLM (X86_BR_USER | X86_BR_KERNEL)
> @@ -135,7 +138,14 @@ static void __intel_pmu_lbr_enable(void)
>                 wrmsrl(MSR_LBR_SELECT, cpuc->lbr_sel->config);
>
>         rdmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
> -       debugctl |= (DEBUGCTLMSR_LBR | DEBUGCTLMSR_FREEZE_LBRS_ON_PMI);
> +       debugctl |= DEBUGCTLMSR_LBR;
> +       /*
> +        * LBR callstack does not work well with FREEZE_LBRS_ON_PMI.
> +        * If FREEZE_LBRS_ON_PMI is set, PMI near call/return instructions
> +        * may cause superfluous increase/decrease of LBR_TOS.
> +        */
Is that a bug or a feature?

That prevent any use of the call-stack mode in the kernel because by the
time you get to perf_events code, the stack will have been overwritten. you
can get by if you are only interested in user level execution, the LBR priv
level filtering will cause a freeze, though with some skid. I assume you are
limiting this feature to user priv level by enforcing that users pass the
PERF_SAMPLE_BRANCH_USER flag.


> +       if (!cpuc->lbr_sel || !(cpuc->lbr_sel->config & LBR_CALL_STACK))
> +               debugctl |= DEBUGCTLMSR_FREEZE_LBRS_ON_PMI;
>         wrmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
>  }
>
> @@ -354,7 +364,7 @@ void intel_pmu_lbr_read(void)
>   * - in case there is no HW filter
>   * - in case the HW filter has errata or limitations
>   */
> -static void intel_pmu_setup_sw_lbr_filter(struct perf_event *event)
> +static int intel_pmu_setup_sw_lbr_filter(struct perf_event *event)
>  {
>         u64 br_type = event->attr.branch_sample_type;
>         int mask = 0;
> @@ -388,11 +398,21 @@ static void intel_pmu_setup_sw_lbr_filter(struct perf_event *event)
>         if (br_type & PERF_SAMPLE_BRANCH_NO_TX)
>                 mask |= X86_BR_NO_TX;
>
> +       if (br_type & PERF_SAMPLE_BRANCH_CALL_STACK) {
> +               if (!x86_pmu.lbr_sel_map)
> +                       return -EOPNOTSUPP;

I am not sure checking lbr_sel_map here is enough. You need to
check if the CALL_STACK entry is populated, meaning the HW feature
exists.

> +               if (mask & ~(X86_BR_USER | X86_BR_KERNEL))
> +                       return -EINVAL;
> +               mask |= X86_BR_CALL | X86_BR_IND_CALL | X86_BR_RET |
> +                       X86_BR_CALL_STACK;

Why have BR_RET here?

> +       }
> +
>         /*
>          * stash actual user request into reg, it may
>          * be used by fixup code for some CPU
>          */
>         event->hw.branch_reg.reg = mask;
> +       return 0;
>  }
>
>  /*
> @@ -421,8 +441,11 @@ static int intel_pmu_setup_hw_lbr_filter(struct perf_event *event)
>         reg = &event->hw.branch_reg;
>         reg->idx = EXTRA_REG_LBR;
>
> -       /* LBR_SELECT operates in suppress mode so invert mask */
> -       reg->config = ~mask & x86_pmu.lbr_sel_mask;
> +       /*
> +        * the first 8 bits (LBR_SEL_MASK) in LBR_SELECT operates
> +        * in suppress mode so invert mask
> +        */
> +       reg->config = mask ^ x86_pmu.lbr_sel_mask;
>
>         return 0;
>  }
> @@ -440,7 +463,9 @@ int intel_pmu_setup_lbr_filter(struct perf_event *event)
>         /*
>          * setup SW LBR filter
>          */
> -       intel_pmu_setup_sw_lbr_filter(event);
> +       ret = intel_pmu_setup_sw_lbr_filter(event);
> +       if (ret)
> +               return ret;
>
>         /*
>          * setup HW LBR filter, if any
> @@ -695,6 +720,19 @@ static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
>         [PERF_SAMPLE_BRANCH_IND_CALL_SHIFT]     = LBR_IND_CALL,
>  };
>
> +static const int hsw_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
> +       [PERF_SAMPLE_BRANCH_ANY_SHIFT]          = LBR_ANY,
> +       [PERF_SAMPLE_BRANCH_USER_SHIFT]         = LBR_USER,
> +       [PERF_SAMPLE_BRANCH_KERNEL_SHIFT]       = LBR_KERNEL,
> +       [PERF_SAMPLE_BRANCH_HV_SHIFT]           = LBR_IGN,
> +       [PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT]   = LBR_RETURN | LBR_FAR,
> +       [PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT]     = LBR_REL_CALL | LBR_IND_CALL
> +                                               | LBR_FAR,
> +       [PERF_SAMPLE_BRANCH_IND_CALL_SHIFT]     = LBR_IND_CALL,
> +       [PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT]   = LBR_REL_CALL | LBR_IND_CALL
> +                                               | LBR_RETURN | LBR_CALL_STACK,
> +};
> +
>  /* core */
>  void intel_pmu_lbr_init_core(void)
>  {
> @@ -751,6 +789,20 @@ void intel_pmu_lbr_init_snb(void)
>         pr_cont("16-deep LBR, ");
>  }
>
> +/* haswell */
> +void intel_pmu_lbr_init_hsw(void)
> +{
> +       x86_pmu.lbr_nr   = 16;
> +       x86_pmu.lbr_tos  = MSR_LBR_TOS;
> +       x86_pmu.lbr_from = MSR_LBR_NHM_FROM;
> +       x86_pmu.lbr_to   = MSR_LBR_NHM_TO;
> +
> +       x86_pmu.lbr_sel_mask = LBR_SEL_MASK;
> +       x86_pmu.lbr_sel_map  = hsw_lbr_sel_map;
> +
> +       pr_cont("16-deep LBR, ");
> +}
> +
>  /* atom */
>  void intel_pmu_lbr_init_atom(void)
>  {
> --
> 1.8.4.2
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 02/14] perf, core: introduce pmu context switch callback
  2014-01-03  5:47 ` [PATCH 02/14] perf, core: introduce pmu context switch callback Yan, Zheng
@ 2014-02-05 16:01   ` Stephane Eranian
  2014-02-06  1:38     ` Yan, Zheng
  0 siblings, 1 reply; 37+ messages in thread
From: Stephane Eranian @ 2014-02-05 16:01 UTC (permalink / raw)
  To: Yan, Zheng
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Andi Kleen

On Fri, Jan 3, 2014 at 6:47 AM, Yan, Zheng <zheng.z.yan@intel.com> wrote:
> The callback is invoked when process is scheduled in or out. It
> provides mechanism for later patches to save/store the LBR stack.
> It can also replace the flush branch stack callback.
>
I think you need to say this callback may be invoked on context switches
with per-thread events attached. As far I understand, the callback cannot
be invoked for system-wide events.

> To avoid unnecessary overhead, the callback is enabled dynamically
>
> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
> ---
>  arch/x86/kernel/cpu/perf_event.c |  7 +++++
>  arch/x86/kernel/cpu/perf_event.h |  4 +++
>  include/linux/perf_event.h       |  8 ++++++
>  kernel/events/core.c             | 60 +++++++++++++++++++++++++++++++++++++++-
>  4 files changed, 78 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index 8e13293..6703d17 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -1846,6 +1846,12 @@ static const struct attribute_group *x86_pmu_attr_groups[] = {
>         NULL,
>  };
>
> +static void x86_pmu_sched_task(struct perf_event_context *ctx, bool sched_in)
> +{
> +       if (x86_pmu.sched_task)
> +               x86_pmu.sched_task(ctx, sched_in);
> +}
> +
>  static void x86_pmu_flush_branch_stack(void)
>  {
>         if (x86_pmu.flush_branch_stack)
> @@ -1879,6 +1885,7 @@ static struct pmu pmu = {
>
>         .event_idx              = x86_pmu_event_idx,
>         .flush_branch_stack     = x86_pmu_flush_branch_stack,
> +       .sched_task             = x86_pmu_sched_task,
>  };
>
>  void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
> diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
> index 745f6fb..3fdb751 100644
> --- a/arch/x86/kernel/cpu/perf_event.h
> +++ b/arch/x86/kernel/cpu/perf_event.h
> @@ -417,6 +417,8 @@ struct x86_pmu {
>
>         void            (*check_microcode)(void);
>         void            (*flush_branch_stack)(void);
> +       void            (*sched_task)(struct perf_event_context *ctx,
> +                                     bool sched_in);
>
>         /*
>          * Intel Arch Perfmon v2+
> @@ -675,6 +677,8 @@ void intel_pmu_pebs_disable_all(void);
>
>  void intel_ds_init(void);
>
> +void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in);
> +
There is no mention of this function anywhere else. Should not be here.

>  void intel_pmu_lbr_reset(void);
>
>  void intel_pmu_lbr_enable(struct perf_event *event);
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 8f4a70f..6a3e603 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -251,6 +251,12 @@ struct pmu {
>          * flush branch stack on context-switches (needed in cpu-wide mode)
>          */
>         void (*flush_branch_stack)      (void);
> +
> +       /*
> +        * PMU callback for context-switches. optional
> +        */
> +       void (*sched_task)              (struct perf_event_context *ctx,
> +                                        bool sched_in);
>  };
>
>  /**
> @@ -546,6 +552,8 @@ extern void perf_event_delayed_put(struct task_struct *task);
>  extern void perf_event_print_debug(void);
>  extern void perf_pmu_disable(struct pmu *pmu);
>  extern void perf_pmu_enable(struct pmu *pmu);
> +extern void perf_sched_cb_disable(struct pmu *pmu);
> +extern void perf_sched_cb_enable(struct pmu *pmu);
>  extern int perf_event_task_disable(void);
>  extern int perf_event_task_enable(void);
>  extern int perf_event_refresh(struct perf_event *event, int refresh);
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 89d34f9..d110a23 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -141,6 +141,7 @@ enum event_type_t {
>  struct static_key_deferred perf_sched_events __read_mostly;
>  static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
>  static DEFINE_PER_CPU(atomic_t, perf_branch_stack_events);
> +static DEFINE_PER_CPU(int, perf_sched_cb_usages);
>
>  static atomic_t nr_mmap_events __read_mostly;
>  static atomic_t nr_comm_events __read_mostly;
> @@ -150,6 +151,7 @@ static atomic_t nr_freq_events __read_mostly;
>  static LIST_HEAD(pmus);
>  static DEFINE_MUTEX(pmus_lock);
>  static struct srcu_struct pmus_srcu;
> +static struct idr pmu_idr;
>
>  /*
>   * perf event paranoia level:
> @@ -2327,6 +2329,57 @@ unlock:
>         }
>  }
>
> +void perf_sched_cb_disable(struct pmu *pmu)
> +{
> +       __get_cpu_var(perf_sched_cb_usages)--;
> +}
> +
> +void perf_sched_cb_enable(struct pmu *pmu)
> +{
> +       __get_cpu_var(perf_sched_cb_usages)++;
> +}
> +
I think you want to use jump_labels instead of this to make
the callback optional. This is already used all over the place
in the generic code.

> +/*
> + * This function provides the context switch callback to the lower code
> + * layer. It is invoked ONLY when the context switch callback is enabled.
> + */
> +static void perf_pmu_sched_task(struct task_struct *prev,
> +                               struct task_struct *next,
> +                               bool sched_in)
> +{
> +       struct perf_cpu_context *cpuctx;
> +       struct pmu *pmu;
> +       unsigned long flags;
> +
> +       if (prev == next)
> +               return;
> +
> +       local_irq_save(flags);
> +
> +       rcu_read_lock();
> +
> +       pmu = idr_find(&pmu_idr, PERF_TYPE_RAW);
> +
> +       if (pmu && pmu->sched_task) {
> +               cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
> +               pmu = cpuctx->ctx.pmu;
> +
> +               perf_ctx_lock(cpuctx, cpuctx->task_ctx);
> +
> +               perf_pmu_disable(pmu);
> +
> +               pmu->sched_task(cpuctx->task_ctx, sched_in);
> +
> +               perf_pmu_enable(pmu);
> +
> +               perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> +       }
> +
> +       rcu_read_unlock();
> +
> +       local_irq_restore(flags);
> +}
> +
>  #define for_each_task_context_nr(ctxn)                                 \
>         for ((ctxn) = 0; (ctxn) < perf_nr_task_contexts; (ctxn)++)
>
> @@ -2346,6 +2399,9 @@ void __perf_event_task_sched_out(struct task_struct *task,
>  {
>         int ctxn;
>
> +       if (__get_cpu_var(perf_sched_cb_usages))
> +               perf_pmu_sched_task(task, next, false);
> +
>         for_each_task_context_nr(ctxn)
>                 perf_event_context_sched_out(task, ctxn, next);
>
> @@ -2605,6 +2661,9 @@ void __perf_event_task_sched_in(struct task_struct *prev,
>         /* check for system-wide branch_stack events */
>         if (atomic_read(&__get_cpu_var(perf_branch_stack_events)))
>                 perf_branch_stack_sched_in(prev, task);
> +
> +       if (__get_cpu_var(perf_sched_cb_usages))
> +               perf_pmu_sched_task(prev, task, true);
>  }
>
>  static u64 perf_calculate_period(struct perf_event *event, u64 nsec, u64 count)
> @@ -6326,7 +6385,6 @@ static void free_pmu_context(struct pmu *pmu)
>  out:
>         mutex_unlock(&pmus_lock);
>  }
> -static struct idr pmu_idr;
>
>  static ssize_t
>  type_show(struct device *dev, struct device_attribute *attr, char *page)
> --
> 1.8.4.2
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 03/14] perf, x86: use context switch callback to flush LBR stack
  2014-01-03  5:48 ` [PATCH 03/14] perf, x86: use context switch callback to flush LBR stack Yan, Zheng
@ 2014-02-05 16:34   ` Stephane Eranian
  0 siblings, 0 replies; 37+ messages in thread
From: Stephane Eranian @ 2014-02-05 16:34 UTC (permalink / raw)
  To: Yan, Zheng
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Andi Kleen

On Fri, Jan 3, 2014 at 6:48 AM, Yan, Zheng <zheng.z.yan@intel.com> wrote:
> Enable the pmu context switch callback when LBR is used. Use the
> callback to flush LBR stack when task is scheduled in. This allows
> us to move code that flushes LBR stack from perf core to perf x86.
>
You forgot to remove perf_event_context.nr_branch_stack which
does not appear to be needed anymore.

> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
> ---
>  arch/x86/kernel/cpu/perf_event.c           |  7 ---
>  arch/x86/kernel/cpu/perf_event.h           |  2 -
>  arch/x86/kernel/cpu/perf_event_intel.c     | 14 +-----
>  arch/x86/kernel/cpu/perf_event_intel_lbr.c | 32 ++++++++-----
>  include/linux/perf_event.h                 |  5 ---
>  kernel/events/core.c                       | 72 ------------------------------
>  6 files changed, 21 insertions(+), 111 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index 6703d17..69e2095 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -1852,12 +1852,6 @@ static void x86_pmu_sched_task(struct perf_event_context *ctx, bool sched_in)
>                 x86_pmu.sched_task(ctx, sched_in);
>  }
>
> -static void x86_pmu_flush_branch_stack(void)
> -{
> -       if (x86_pmu.flush_branch_stack)
> -               x86_pmu.flush_branch_stack();
> -}
> -
>  void perf_check_microcode(void)
>  {
>         if (x86_pmu.check_microcode)
> @@ -1884,7 +1878,6 @@ static struct pmu pmu = {
>         .commit_txn             = x86_pmu_commit_txn,
>
>         .event_idx              = x86_pmu_event_idx,
> -       .flush_branch_stack     = x86_pmu_flush_branch_stack,
>         .sched_task             = x86_pmu_sched_task,
>  };
>
> diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
> index 3fdb751..80b8e83 100644
> --- a/arch/x86/kernel/cpu/perf_event.h
> +++ b/arch/x86/kernel/cpu/perf_event.h
> @@ -150,7 +150,6 @@ struct cpu_hw_events {
>          * Intel LBR bits
>          */
>         int                             lbr_users;
> -       void                            *lbr_context;
>         struct perf_branch_stack        lbr_stack;
>         struct perf_branch_entry        lbr_entries[MAX_LBR_ENTRIES];
>         struct er_account               *lbr_sel;
> @@ -416,7 +415,6 @@ struct x86_pmu {
>         void            (*cpu_dead)(int cpu);
>
>         void            (*check_microcode)(void);
> -       void            (*flush_branch_stack)(void);
>         void            (*sched_task)(struct perf_event_context *ctx,
>                                       bool sched_in);
>
> diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
> index 0fa4f24..4325bae 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel.c
> @@ -2038,18 +2038,6 @@ static void intel_pmu_cpu_dying(int cpu)
>         fini_debug_store_on_cpu(cpu);
>  }
>
> -static void intel_pmu_flush_branch_stack(void)
> -{
> -       /*
> -        * Intel LBR does not tag entries with the
> -        * PID of the current task, then we need to
> -        * flush it on ctxsw
> -        * For now, we simply reset it
> -        */
> -       if (x86_pmu.lbr_nr)
> -               intel_pmu_lbr_reset();
> -}
> -
>  PMU_FORMAT_ATTR(offcore_rsp, "config1:0-63");
>
>  PMU_FORMAT_ATTR(ldlat, "config1:0-15");
> @@ -2101,7 +2089,7 @@ static __initconst const struct x86_pmu intel_pmu = {
>         .cpu_starting           = intel_pmu_cpu_starting,
>         .cpu_dying              = intel_pmu_cpu_dying,
>         .guest_get_msrs         = intel_guest_get_msrs,
> -       .flush_branch_stack     = intel_pmu_flush_branch_stack,
> +       .sched_task             = intel_pmu_lbr_sched_task,
>  };
>
>  static __init void intel_clovertown_quirk(void)
> diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> index 1ae2ec5..7ff2a99 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> @@ -177,24 +177,32 @@ void intel_pmu_lbr_reset(void)
>                 intel_pmu_lbr_reset_64();
>  }
>
> -void intel_pmu_lbr_enable(struct perf_event *event)
> +void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
>  {
> -       struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> -
>         if (!x86_pmu.lbr_nr)
>                 return;
>
>         /*
> -        * Reset the LBR stack if we changed task context to
> -        * avoid data leaks.
> +        * It is necessary to flush the stack on context switch. This happens
> +        * when the branch stack does not tag its entries with the pid of the
> +        * current task.
>          */
> -       if (event->ctx->task && cpuc->lbr_context != event->ctx) {
> +       if (sched_in)
>                 intel_pmu_lbr_reset();
> -               cpuc->lbr_context = event->ctx;
> -       }
> +}
> +
> +void intel_pmu_lbr_enable(struct perf_event *event)
> +{
> +       struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> +
> +       if (!x86_pmu.lbr_nr)
> +               return;
> +
>         cpuc->br_sel = event->hw.branch_reg.reg;
>
>         cpuc->lbr_users++;
> +       if (cpuc->lbr_users == 1)
> +               perf_sched_cb_enable(event->ctx->pmu);
>  }
>
>  void intel_pmu_lbr_disable(struct perf_event *event)
> @@ -207,10 +215,10 @@ void intel_pmu_lbr_disable(struct perf_event *event)
>         cpuc->lbr_users--;
>         WARN_ON_ONCE(cpuc->lbr_users < 0);
>
> -       if (cpuc->enabled && !cpuc->lbr_users) {
> -               __intel_pmu_lbr_disable();
> -               /* avoid stale pointer */
> -               cpuc->lbr_context = NULL;
> +       if (!cpuc->lbr_users) {
> +               perf_sched_cb_disable(event->ctx->pmu);
> +               if (cpuc->enabled)
> +                       __intel_pmu_lbr_disable();
>         }
>  }
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 6a3e603..96cb88b 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -248,11 +248,6 @@ struct pmu {
>         int (*event_idx)                (struct perf_event *event); /*optional */
>
>         /*
> -        * flush branch stack on context-switches (needed in cpu-wide mode)
> -        */
> -       void (*flush_branch_stack)      (void);
> -
> -       /*
>          * PMU callback for context-switches. optional
>          */
>         void (*sched_task)              (struct perf_event_context *ctx,
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index d110a23..aba4d6d 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -140,7 +140,6 @@ enum event_type_t {
>   */
>  struct static_key_deferred perf_sched_events __read_mostly;
>  static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
> -static DEFINE_PER_CPU(atomic_t, perf_branch_stack_events);
>  static DEFINE_PER_CPU(int, perf_sched_cb_usages);
>
>  static atomic_t nr_mmap_events __read_mostly;
> @@ -2566,65 +2565,6 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx,
>         perf_pmu_rotate_start(ctx->pmu);
>  }
>
> -/*
> - * When sampling the branck stack in system-wide, it may be necessary
> - * to flush the stack on context switch. This happens when the branch
> - * stack does not tag its entries with the pid of the current task.
> - * Otherwise it becomes impossible to associate a branch entry with a
> - * task. This ambiguity is more likely to appear when the branch stack
> - * supports priv level filtering and the user sets it to monitor only
> - * at the user level (which could be a useful measurement in system-wide
> - * mode). In that case, the risk is high of having a branch stack with
> - * branch from multiple tasks. Flushing may mean dropping the existing
> - * entries or stashing them somewhere in the PMU specific code layer.
> - *
> - * This function provides the context switch callback to the lower code
> - * layer. It is invoked ONLY when there is at least one system-wide context
> - * with at least one active event using taken branch sampling.
> - */
> -static void perf_branch_stack_sched_in(struct task_struct *prev,
> -                                      struct task_struct *task)
> -{
> -       struct perf_cpu_context *cpuctx;
> -       struct pmu *pmu;
> -       unsigned long flags;
> -
> -       /* no need to flush branch stack if not changing task */
> -       if (prev == task)
> -               return;
> -
> -       local_irq_save(flags);
> -
> -       rcu_read_lock();
> -
> -       list_for_each_entry_rcu(pmu, &pmus, entry) {
> -               cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
> -
> -               /*
> -                * check if the context has at least one
> -                * event using PERF_SAMPLE_BRANCH_STACK
> -                */
> -               if (cpuctx->ctx.nr_branch_stack > 0
> -                   && pmu->flush_branch_stack) {
> -
> -                       pmu = cpuctx->ctx.pmu;
> -
> -                       perf_ctx_lock(cpuctx, cpuctx->task_ctx);
> -
> -                       perf_pmu_disable(pmu);
> -
> -                       pmu->flush_branch_stack();
> -
> -                       perf_pmu_enable(pmu);
> -
> -                       perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> -               }
> -       }
> -
> -       rcu_read_unlock();
> -
> -       local_irq_restore(flags);
> -}
>
>  /*
>   * Called from scheduler to add the events of the current task
> @@ -2658,10 +2598,6 @@ void __perf_event_task_sched_in(struct task_struct *prev,
>         if (atomic_read(&__get_cpu_var(perf_cgroup_events)))
>                 perf_cgroup_sched_in(prev, task);
>
> -       /* check for system-wide branch_stack events */
> -       if (atomic_read(&__get_cpu_var(perf_branch_stack_events)))
> -               perf_branch_stack_sched_in(prev, task);
> -
>         if (__get_cpu_var(perf_sched_cb_usages))
>                 perf_pmu_sched_task(prev, task, true);
>  }
> @@ -3226,10 +3162,6 @@ static void unaccount_event_cpu(struct perf_event *event, int cpu)
>         if (event->parent)
>                 return;
>
> -       if (has_branch_stack(event)) {
> -               if (!(event->attach_state & PERF_ATTACH_TASK))
> -                       atomic_dec(&per_cpu(perf_branch_stack_events, cpu));
> -       }
>         if (is_cgroup_event(event))
>                 atomic_dec(&per_cpu(perf_cgroup_events, cpu));
>  }
> @@ -6655,10 +6587,6 @@ static void account_event_cpu(struct perf_event *event, int cpu)
>         if (event->parent)
>                 return;
>
> -       if (has_branch_stack(event)) {
> -               if (!(event->attach_state & PERF_ATTACH_TASK))
> -                       atomic_inc(&per_cpu(perf_branch_stack_events, cpu));
> -       }
>         if (is_cgroup_event(event))
>                 atomic_inc(&per_cpu(perf_cgroup_events, cpu));
>  }
> --
> 1.8.4.2
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 05/14] perf, core: allow pmu specific data for perf task context
  2014-01-03  5:48 ` [PATCH 05/14] perf, core: allow pmu specific data for perf task context Yan, Zheng
@ 2014-02-05 16:57   ` Stephane Eranian
  0 siblings, 0 replies; 37+ messages in thread
From: Stephane Eranian @ 2014-02-05 16:57 UTC (permalink / raw)
  To: Yan, Zheng
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Andi Kleen

On Fri, Jan 3, 2014 at 6:48 AM, Yan, Zheng <zheng.z.yan@intel.com> wrote:
> Later patches will use pmu specific data to save LBR stack.
>
I think the changelog could be more descriptive here.
Explain what you add.

Reviewed-by: Stephane Eranian <eranian@google.com>
> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
> ---
>  include/linux/perf_event.h |  5 +++++
>  kernel/events/core.c       | 19 ++++++++++++++++++-
>  2 files changed, 23 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 96cb88b..147f9d3 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -252,6 +252,10 @@ struct pmu {
>          */
>         void (*sched_task)              (struct perf_event_context *ctx,
>                                          bool sched_in);
> +       /*
> +        * PMU specific data size
> +        */
> +       size_t                          task_ctx_size;
>  };
>
>  /**
> @@ -496,6 +500,7 @@ struct perf_event_context {
>         int                             pin_count;
>         int                             nr_cgroups;      /* cgroup evts */
>         int                             nr_branch_stack; /* branch_stack evt */
> +       void                            *task_ctx_data; /* pmu specific data */
>         struct rcu_head                 rcu_head;
>  };
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index aba4d6d..b6650ab 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -883,6 +883,15 @@ static void get_ctx(struct perf_event_context *ctx)
>         WARN_ON(!atomic_inc_not_zero(&ctx->refcount));
>  }
>
> +static void free_ctx(struct rcu_head *head)
> +{
> +       struct perf_event_context *ctx;
> +
> +       ctx = container_of(head, struct perf_event_context, rcu_head);
> +       kfree(ctx->task_ctx_data);
> +       kfree(ctx);
> +}
> +
>  static void put_ctx(struct perf_event_context *ctx)
>  {
>         if (atomic_dec_and_test(&ctx->refcount)) {
> @@ -890,7 +899,7 @@ static void put_ctx(struct perf_event_context *ctx)
>                         put_ctx(ctx->parent_ctx);
>                 if (ctx->task)
>                         put_task_struct(ctx->task);
> -               kfree_rcu(ctx, rcu_head);
> +               call_rcu(&ctx->rcu_head, free_ctx);
>         }
>  }
>
> @@ -3020,6 +3029,14 @@ alloc_perf_context(struct pmu *pmu, struct task_struct *task)
>         if (!ctx)
>                 return NULL;
>
> +       if (task && pmu->task_ctx_size > 0) {
> +               ctx->task_ctx_data = kzalloc(pmu->task_ctx_size, GFP_KERNEL);
> +               if (!ctx->task_ctx_data) {
> +                       kfree(ctx);
> +                       return NULL;
> +               }
> +       }
> +
>         __perf_event_init_context(ctx);
>         if (task) {
>                 ctx->task = task;
> --
> 1.8.4.2
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 06/14] perf, core: always switch pmu specific data during context switch
  2014-01-03  5:48 ` [PATCH 06/14] perf, core: always switch pmu specific data during context switch Yan, Zheng
@ 2014-02-05 17:19   ` Stephane Eranian
  2014-02-05 17:55     ` Peter Zijlstra
  0 siblings, 1 reply; 37+ messages in thread
From: Stephane Eranian @ 2014-02-05 17:19 UTC (permalink / raw)
  To: Yan, Zheng
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Andi Kleen

On Fri, Jan 3, 2014 at 6:48 AM, Yan, Zheng <zheng.z.yan@intel.com> wrote:
> If two tasks were both forked from the same parent task, Events in their perf
> task contexts can be the same. Perf core optimizes context switch oout in this
> case.
>
> Previous patch inroduces pmu specific data. The data is task specific, so we
> should switch the data even when context switch is optimized out.
>
Reviwed-by: Stephane Eranian <eranian@google.com>
> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
> ---
>  kernel/events/core.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index b6650ab..d6d8dea 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2319,6 +2319,8 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
>                         next->perf_event_ctxp[ctxn] = ctx;
>                         ctx->task = next;
>                         next_ctx->task = task;
> +                       ctx->task_ctx_data = xchg(&next_ctx->task_ctx_data,
> +                                                 ctx->task_ctx_data);
>                         do_switch = 0;
>
>                         perf_event_sync_stat(ctx, next_ctx);
> --
> 1.8.4.2
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 08/14] perf, x86: allocate space for storing LBR stack
  2014-01-03  5:48 ` [PATCH 08/14] perf, x86: allocate space for storing LBR stack Yan, Zheng
@ 2014-02-05 17:26   ` Stephane Eranian
  0 siblings, 0 replies; 37+ messages in thread
From: Stephane Eranian @ 2014-02-05 17:26 UTC (permalink / raw)
  To: Yan, Zheng
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Andi Kleen

On Fri, Jan 3, 2014 at 6:48 AM, Yan, Zheng <zheng.z.yan@intel.com> wrote:
> When the LBR call stack is enabled, it is necessary to save/restore
> the LBR stack on context switch. We can use pmu specific data to
> store LBR stack when task is scheduled out. This patch adds code
> that allocates the pmu specific data.
>
Reviewed-by: Stephane Eranian <eranian@google.com>
> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
> ---
>  arch/x86/kernel/cpu/perf_event.c | 1 +
>  arch/x86/kernel/cpu/perf_event.h | 7 +++++++
>  2 files changed, 8 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index 69e2095..2e43f1b 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -1879,6 +1879,7 @@ static struct pmu pmu = {
>
>         .event_idx              = x86_pmu_event_idx,
>         .sched_task             = x86_pmu_sched_task,
> +       .task_ctx_size          = sizeof(struct x86_perf_task_context),
>  };
>
>  void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
> diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
> index 3ef4b79..3ed9629 100644
> --- a/arch/x86/kernel/cpu/perf_event.h
> +++ b/arch/x86/kernel/cpu/perf_event.h
> @@ -459,6 +459,13 @@ struct x86_pmu {
>         struct perf_guest_switch_msr *(*guest_get_msrs)(int *nr);
>  };
>
> +struct x86_perf_task_context {
> +       u64 lbr_from[MAX_LBR_ENTRIES];
> +       u64 lbr_to[MAX_LBR_ENTRIES];
> +       int lbr_callstack_users;
> +       int lbr_stack_state;
> +};
> +
>  enum {
>         PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT = PERF_SAMPLE_BRANCH_MAX_SHIFT,
>         PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE,
> --
> 1.8.4.2
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 09/14] perf, x86: Save/resotre LBR stack during context switch
  2014-01-03  5:48 ` [PATCH 09/14] perf, x86: Save/resotre LBR stack during context switch Yan, Zheng
@ 2014-02-05 17:45   ` Stephane Eranian
  2014-02-06 15:09     ` Stephane Eranian
  0 siblings, 1 reply; 37+ messages in thread
From: Stephane Eranian @ 2014-02-05 17:45 UTC (permalink / raw)
  To: Yan, Zheng
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Andi Kleen

On Fri, Jan 3, 2014 at 6:48 AM, Yan, Zheng <zheng.z.yan@intel.com> wrote:
> When the LBR call stack is enabled, it is necessary to save/restore
> the LBR stack on context switch. The solution is saving/restoring
> the LBR stack to/from task's perf event context.
>
> The LBR stack is saved/restored only when there are events that use
> the LBR call stack. If no event uses LBR call stack, the LBR stack
> is reset when task is scheduled in.
>
> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
> ---
>  arch/x86/kernel/cpu/perf_event_intel_lbr.c | 80 ++++++++++++++++++++++++------
>  1 file changed, 66 insertions(+), 14 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> index 2137a9f..51e1842 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> @@ -187,18 +187,82 @@ void intel_pmu_lbr_reset(void)
>                 intel_pmu_lbr_reset_64();
>  }
>
> +/*
> + * TOS = most recently recorded branch
> + */
> +static inline u64 intel_pmu_lbr_tos(void)
> +{
> +       u64 tos;
> +       rdmsrl(x86_pmu.lbr_tos, tos);
> +       return tos;
> +}
> +
> +enum {
> +       LBR_UNINIT,
> +       LBR_NONE,
> +       LBR_VALID,
> +};
> +
I don't see where the x86_perf_task_context struct gets initialized with
your task_ctx_data/task_ctx_size mechanism. You are relying on 0
as a valid default value. But if later more fields are needed and they need
non-zero init values, it will be easy to forget.....

So I think you need to provide a callback from alloc_perf_context().
Should have mentioned that in Patch 05/14.

> +static void __intel_pmu_lbr_restore(struct x86_perf_task_context *task_ctx)
> +{
> +       int i;
> +       unsigned lbr_idx, mask = x86_pmu.lbr_nr - 1;
> +       u64 tos = intel_pmu_lbr_tos();
> +
> +       for (i = 0; i < x86_pmu.lbr_nr; i++) {
> +               lbr_idx = (tos - i) & mask;
> +               wrmsrl(x86_pmu.lbr_from + lbr_idx, task_ctx->lbr_from[i]);
> +               wrmsrl(x86_pmu.lbr_to + lbr_idx, task_ctx->lbr_to[i]);
> +       }
> +       task_ctx->lbr_stack_state = LBR_NONE;
> +}
> +
> +static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx)
> +{
> +       int i;
> +       unsigned lbr_idx, mask = x86_pmu.lbr_nr - 1;
> +       u64 tos = intel_pmu_lbr_tos();
> +
> +       for (i = 0; i < x86_pmu.lbr_nr; i++) {
> +               lbr_idx = (tos - i) & mask;
> +               rdmsrl(x86_pmu.lbr_from + lbr_idx, task_ctx->lbr_from[i]);
> +               rdmsrl(x86_pmu.lbr_to + lbr_idx, task_ctx->lbr_to[i]);
> +       }
> +       task_ctx->lbr_stack_state = LBR_VALID;
> +}
> +
> +
>  void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
>  {
> +       struct cpu_hw_events *cpuc;
> +       struct x86_perf_task_context *task_ctx;
> +
>         if (!x86_pmu.lbr_nr)
>                 return;
>
> +       cpuc = &__get_cpu_var(cpu_hw_events);
> +       task_ctx = ctx ? ctx->task_ctx_data : NULL;
> +
> +
>         /*
>          * It is necessary to flush the stack on context switch. This happens
>          * when the branch stack does not tag its entries with the pid of the
>          * current task.
>          */
> -       if (sched_in)
> -               intel_pmu_lbr_reset();
> +       if (sched_in) {
> +               if (!task_ctx ||
> +                   !task_ctx->lbr_callstack_users ||
> +                   task_ctx->lbr_stack_state != LBR_VALID)
> +                       intel_pmu_lbr_reset();
> +               else
> +                       __intel_pmu_lbr_restore(task_ctx);
> +       } else if (task_ctx) {
> +               if (task_ctx->lbr_callstack_users &&
> +                   task_ctx->lbr_stack_state != LBR_UNINIT)
> +                       __intel_pmu_lbr_save(task_ctx);
> +               else
> +                       task_ctx->lbr_stack_state = LBR_NONE;
> +       }
>  }
>
There ought to be a better way of structuring this if/else. It is
ugly.

>  static inline bool branch_user_callstack(unsigned br_sel)
> @@ -267,18 +331,6 @@ void intel_pmu_lbr_disable_all(void)
>                 __intel_pmu_lbr_disable();
>  }
>
> -/*
> - * TOS = most recently recorded branch
> - */
> -static inline u64 intel_pmu_lbr_tos(void)
> -{
> -       u64 tos;
> -
> -       rdmsrl(x86_pmu.lbr_tos, tos);
> -
> -       return tos;
> -}
> -
>  static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc)
>  {
>         unsigned long mask = x86_pmu.lbr_nr - 1;
> --
> 1.8.4.2
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 06/14] perf, core: always switch pmu specific data during context switch
  2014-02-05 17:19   ` Stephane Eranian
@ 2014-02-05 17:55     ` Peter Zijlstra
  2014-02-05 18:35       ` Stephane Eranian
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2014-02-05 17:55 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Yan, Zheng, LKML, Ingo Molnar, Arnaldo Carvalho de Melo, Andi Kleen

On Wed, Feb 05, 2014 at 06:19:27PM +0100, Stephane Eranian wrote:
> On Fri, Jan 3, 2014 at 6:48 AM, Yan, Zheng <zheng.z.yan@intel.com> wrote:
> > If two tasks were both forked from the same parent task, Events in their perf
> > task contexts can be the same. Perf core optimizes context switch oout in this
> > case.
> >
> > Previous patch inroduces pmu specific data. The data is task specific, so we
> > should switch the data even when context switch is optimized out.
> >
> Reviwed-by: Stephane Eranian <eranian@google.com>

You should look again.. that xchg() is an atomic op and a total waste of
time since the assignment back onto ctx->task_ctx_data is non-atomic.

Complete fail there.

> > Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
> > ---
> >  kernel/events/core.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index b6650ab..d6d8dea 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -2319,6 +2319,8 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
> >                         next->perf_event_ctxp[ctxn] = ctx;
> >                         ctx->task = next;
> >                         next_ctx->task = task;
> > +                       ctx->task_ctx_data = xchg(&next_ctx->task_ctx_data,
> > +                                                 ctx->task_ctx_data);
> >                         do_switch = 0;
> >
> >                         perf_event_sync_stat(ctx, next_ctx);
> > --
> > 1.8.4.2
> >

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 06/14] perf, core: always switch pmu specific data during context switch
  2014-02-05 17:55     ` Peter Zijlstra
@ 2014-02-05 18:35       ` Stephane Eranian
  2014-02-06  2:08         ` Yan, Zheng
  0 siblings, 1 reply; 37+ messages in thread
From: Stephane Eranian @ 2014-02-05 18:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yan, Zheng, LKML, Ingo Molnar, Arnaldo Carvalho de Melo, Andi Kleen

On Wed, Feb 5, 2014 at 6:55 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Feb 05, 2014 at 06:19:27PM +0100, Stephane Eranian wrote:
>> On Fri, Jan 3, 2014 at 6:48 AM, Yan, Zheng <zheng.z.yan@intel.com> wrote:
>> > If two tasks were both forked from the same parent task, Events in their perf
>> > task contexts can be the same. Perf core optimizes context switch oout in this
>> > case.
>> >
>> > Previous patch inroduces pmu specific data. The data is task specific, so we
>> > should switch the data even when context switch is optimized out.
>> >
>> Reviwed-by: Stephane Eranian <eranian@google.com>
>
> You should look again.. that xchg() is an atomic op and a total waste of
> time since the assignment back onto ctx->task_ctx_data is non-atomic.
>
> Complete fail there.
>
I admit, it was not clear to me why the xchg().

>> > Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
>> > ---
>> >  kernel/events/core.c | 2 ++
>> >  1 file changed, 2 insertions(+)
>> >
>> > diff --git a/kernel/events/core.c b/kernel/events/core.c
>> > index b6650ab..d6d8dea 100644
>> > --- a/kernel/events/core.c
>> > +++ b/kernel/events/core.c
>> > @@ -2319,6 +2319,8 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
>> >                         next->perf_event_ctxp[ctxn] = ctx;
>> >                         ctx->task = next;
>> >                         next_ctx->task = task;
>> > +                       ctx->task_ctx_data = xchg(&next_ctx->task_ctx_data,
>> > +                                                 ctx->task_ctx_data);
>> >                         do_switch = 0;
>> >
>> >                         perf_event_sync_stat(ctx, next_ctx);
>> > --
>> > 1.8.4.2
>> >

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 02/14] perf, core: introduce pmu context switch callback
  2014-02-05 16:01   ` Stephane Eranian
@ 2014-02-06  1:38     ` Yan, Zheng
  0 siblings, 0 replies; 37+ messages in thread
From: Yan, Zheng @ 2014-02-06  1:38 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Andi Kleen

On 02/06/2014 12:01 AM, Stephane Eranian wrote:
> On Fri, Jan 3, 2014 at 6:47 AM, Yan, Zheng <zheng.z.yan@intel.com> wrote:
>> The callback is invoked when process is scheduled in or out. It
>> provides mechanism for later patches to save/store the LBR stack.
>> It can also replace the flush branch stack callback.
>>
> I think you need to say this callback may be invoked on context switches
> with per-thread events attached. As far I understand, the callback cannot
> be invoked for system-wide events.

It's also invoked when there is only system-wide event. (the flush branch stack case)

Regards
Yan, Zheng


> 
>> To avoid unnecessary overhead, the callback is enabled dynamically
>>
>> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
>> ---
>>  arch/x86/kernel/cpu/perf_event.c |  7 +++++
>>  arch/x86/kernel/cpu/perf_event.h |  4 +++
>>  include/linux/perf_event.h       |  8 ++++++
>>  kernel/events/core.c             | 60 +++++++++++++++++++++++++++++++++++++++-
>>  4 files changed, 78 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
>> index 8e13293..6703d17 100644
>> --- a/arch/x86/kernel/cpu/perf_event.c
>> +++ b/arch/x86/kernel/cpu/perf_event.c
>> @@ -1846,6 +1846,12 @@ static const struct attribute_group *x86_pmu_attr_groups[] = {
>>         NULL,
>>  };
>>
>> +static void x86_pmu_sched_task(struct perf_event_context *ctx, bool sched_in)
>> +{
>> +       if (x86_pmu.sched_task)
>> +               x86_pmu.sched_task(ctx, sched_in);
>> +}
>> +
>>  static void x86_pmu_flush_branch_stack(void)
>>  {
>>         if (x86_pmu.flush_branch_stack)
>> @@ -1879,6 +1885,7 @@ static struct pmu pmu = {
>>
>>         .event_idx              = x86_pmu_event_idx,
>>         .flush_branch_stack     = x86_pmu_flush_branch_stack,
>> +       .sched_task             = x86_pmu_sched_task,
>>  };
>>
>>  void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
>> diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
>> index 745f6fb..3fdb751 100644
>> --- a/arch/x86/kernel/cpu/perf_event.h
>> +++ b/arch/x86/kernel/cpu/perf_event.h
>> @@ -417,6 +417,8 @@ struct x86_pmu {
>>
>>         void            (*check_microcode)(void);
>>         void            (*flush_branch_stack)(void);
>> +       void            (*sched_task)(struct perf_event_context *ctx,
>> +                                     bool sched_in);
>>
>>         /*
>>          * Intel Arch Perfmon v2+
>> @@ -675,6 +677,8 @@ void intel_pmu_pebs_disable_all(void);
>>
>>  void intel_ds_init(void);
>>
>> +void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in);
>> +
> There is no mention of this function anywhere else. Should not be here.
> 
>>  void intel_pmu_lbr_reset(void);
>>
>>  void intel_pmu_lbr_enable(struct perf_event *event);
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 8f4a70f..6a3e603 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -251,6 +251,12 @@ struct pmu {
>>          * flush branch stack on context-switches (needed in cpu-wide mode)
>>          */
>>         void (*flush_branch_stack)      (void);
>> +
>> +       /*
>> +        * PMU callback for context-switches. optional
>> +        */
>> +       void (*sched_task)              (struct perf_event_context *ctx,
>> +                                        bool sched_in);
>>  };
>>
>>  /**
>> @@ -546,6 +552,8 @@ extern void perf_event_delayed_put(struct task_struct *task);
>>  extern void perf_event_print_debug(void);
>>  extern void perf_pmu_disable(struct pmu *pmu);
>>  extern void perf_pmu_enable(struct pmu *pmu);
>> +extern void perf_sched_cb_disable(struct pmu *pmu);
>> +extern void perf_sched_cb_enable(struct pmu *pmu);
>>  extern int perf_event_task_disable(void);
>>  extern int perf_event_task_enable(void);
>>  extern int perf_event_refresh(struct perf_event *event, int refresh);
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 89d34f9..d110a23 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -141,6 +141,7 @@ enum event_type_t {
>>  struct static_key_deferred perf_sched_events __read_mostly;
>>  static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
>>  static DEFINE_PER_CPU(atomic_t, perf_branch_stack_events);
>> +static DEFINE_PER_CPU(int, perf_sched_cb_usages);
>>
>>  static atomic_t nr_mmap_events __read_mostly;
>>  static atomic_t nr_comm_events __read_mostly;
>> @@ -150,6 +151,7 @@ static atomic_t nr_freq_events __read_mostly;
>>  static LIST_HEAD(pmus);
>>  static DEFINE_MUTEX(pmus_lock);
>>  static struct srcu_struct pmus_srcu;
>> +static struct idr pmu_idr;
>>
>>  /*
>>   * perf event paranoia level:
>> @@ -2327,6 +2329,57 @@ unlock:
>>         }
>>  }
>>
>> +void perf_sched_cb_disable(struct pmu *pmu)
>> +{
>> +       __get_cpu_var(perf_sched_cb_usages)--;
>> +}
>> +
>> +void perf_sched_cb_enable(struct pmu *pmu)
>> +{
>> +       __get_cpu_var(perf_sched_cb_usages)++;
>> +}
>> +
> I think you want to use jump_labels instead of this to make
> the callback optional. This is already used all over the place
> in the generic code.
> 
>> +/*
>> + * This function provides the context switch callback to the lower code
>> + * layer. It is invoked ONLY when the context switch callback is enabled.
>> + */
>> +static void perf_pmu_sched_task(struct task_struct *prev,
>> +                               struct task_struct *next,
>> +                               bool sched_in)
>> +{
>> +       struct perf_cpu_context *cpuctx;
>> +       struct pmu *pmu;
>> +       unsigned long flags;
>> +
>> +       if (prev == next)
>> +               return;
>> +
>> +       local_irq_save(flags);
>> +
>> +       rcu_read_lock();
>> +
>> +       pmu = idr_find(&pmu_idr, PERF_TYPE_RAW);
>> +
>> +       if (pmu && pmu->sched_task) {
>> +               cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
>> +               pmu = cpuctx->ctx.pmu;
>> +
>> +               perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>> +
>> +               perf_pmu_disable(pmu);
>> +
>> +               pmu->sched_task(cpuctx->task_ctx, sched_in);
>> +
>> +               perf_pmu_enable(pmu);
>> +
>> +               perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>> +       }
>> +
>> +       rcu_read_unlock();
>> +
>> +       local_irq_restore(flags);
>> +}
>> +
>>  #define for_each_task_context_nr(ctxn)                                 \
>>         for ((ctxn) = 0; (ctxn) < perf_nr_task_contexts; (ctxn)++)
>>
>> @@ -2346,6 +2399,9 @@ void __perf_event_task_sched_out(struct task_struct *task,
>>  {
>>         int ctxn;
>>
>> +       if (__get_cpu_var(perf_sched_cb_usages))
>> +               perf_pmu_sched_task(task, next, false);
>> +
>>         for_each_task_context_nr(ctxn)
>>                 perf_event_context_sched_out(task, ctxn, next);
>>
>> @@ -2605,6 +2661,9 @@ void __perf_event_task_sched_in(struct task_struct *prev,
>>         /* check for system-wide branch_stack events */
>>         if (atomic_read(&__get_cpu_var(perf_branch_stack_events)))
>>                 perf_branch_stack_sched_in(prev, task);
>> +
>> +       if (__get_cpu_var(perf_sched_cb_usages))
>> +               perf_pmu_sched_task(prev, task, true);
>>  }
>>
>>  static u64 perf_calculate_period(struct perf_event *event, u64 nsec, u64 count)
>> @@ -6326,7 +6385,6 @@ static void free_pmu_context(struct pmu *pmu)
>>  out:
>>         mutex_unlock(&pmus_lock);
>>  }
>> -static struct idr pmu_idr;
>>
>>  static ssize_t
>>  type_show(struct device *dev, struct device_attribute *attr, char *page)
>> --
>> 1.8.4.2
>>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 04/14] perf, x86: Basic Haswell LBR call stack support
  2014-02-05 15:40   ` Stephane Eranian
@ 2014-02-06  1:52     ` Yan, Zheng
  0 siblings, 0 replies; 37+ messages in thread
From: Yan, Zheng @ 2014-02-06  1:52 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Andi Kleen

On 02/05/2014 11:40 PM, Stephane Eranian wrote:
> On Fri, Jan 3, 2014 at 6:48 AM, Yan, Zheng <zheng.z.yan@intel.com> wrote:
>> When the call stack feature is enabled, the LBR stack will capture
>> unfiltered call data normally, but as return instructions are executed,
>> the last captured branch record is flushed from the on-chip registers
>> in a last-in first-out (LIFO) manner. Thus, branch information relative
>> to leaf functions will not be captured, while preserving the call stack
>> information of the main line execution path.
>>
> This is a generic description of the LBR call stack feature. It does not
> describe what the patch actually does which is implement the basic
> internal infrastructure for CALL_STACK mode using LBR callstack.
> 
>> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
>> ---
>>  arch/x86/kernel/cpu/perf_event.h           |  7 ++-
>>  arch/x86/kernel/cpu/perf_event_intel.c     |  2 +-
>>  arch/x86/kernel/cpu/perf_event_intel_lbr.c | 98 +++++++++++++++++++++++-------
>>  3 files changed, 82 insertions(+), 25 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
>> index 80b8e83..3ef4b79 100644
>> --- a/arch/x86/kernel/cpu/perf_event.h
>> +++ b/arch/x86/kernel/cpu/perf_event.h
>> @@ -460,7 +460,10 @@ struct x86_pmu {
>>  };
>>
>>  enum {
>> -       PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE = PERF_SAMPLE_BRANCH_MAX_SHIFT,
>> +       PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT = PERF_SAMPLE_BRANCH_MAX_SHIFT,
>> +       PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE,
>> +
>> +       PERF_SAMPLE_BRANCH_CALL_STACK = 1U << PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT,
>>  };
>>
>>  #define x86_add_quirk(func_)                                           \
>> @@ -697,6 +700,8 @@ void intel_pmu_lbr_init_atom(void);
>>
>>  void intel_pmu_lbr_init_snb(void);
>>
>> +void intel_pmu_lbr_init_hsw(void);
>> +
>>  int intel_pmu_setup_lbr_filter(struct perf_event *event);
>>
>>  int p4_pmu_init(void);
>> diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
>> index 4325bae..84a1c09 100644
>> --- a/arch/x86/kernel/cpu/perf_event_intel.c
>> +++ b/arch/x86/kernel/cpu/perf_event_intel.c
>> @@ -2494,7 +2494,7 @@ __init int intel_pmu_init(void)
>>                 memcpy(hw_cache_event_ids, snb_hw_cache_event_ids, sizeof(hw_cache_event_ids));
>>                 memcpy(hw_cache_extra_regs, snb_hw_cache_extra_regs, sizeof(hw_cache_extra_regs));
>>
>> -               intel_pmu_lbr_init_snb();
>> +               intel_pmu_lbr_init_hsw();
>>
>>                 x86_pmu.event_constraints = intel_hsw_event_constraints;
>>                 x86_pmu.pebs_constraints = intel_hsw_pebs_event_constraints;
>> diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
>> index 7ff2a99..bdd8758 100644
>> --- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
>> +++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
>> @@ -39,6 +39,7 @@ static enum {
>>  #define LBR_IND_JMP_BIT                6 /* do not capture indirect jumps */
>>  #define LBR_REL_JMP_BIT                7 /* do not capture relative jumps */
>>  #define LBR_FAR_BIT            8 /* do not capture far branches */
>> +#define LBR_CALL_STACK_BIT     9 /* enable call stack */
>>
>>  #define LBR_KERNEL     (1 << LBR_KERNEL_BIT)
>>  #define LBR_USER       (1 << LBR_USER_BIT)
>> @@ -49,6 +50,7 @@ static enum {
>>  #define LBR_REL_JMP    (1 << LBR_REL_JMP_BIT)
>>  #define LBR_IND_JMP    (1 << LBR_IND_JMP_BIT)
>>  #define LBR_FAR                (1 << LBR_FAR_BIT)
>> +#define LBR_CALL_STACK (1 << LBR_CALL_STACK_BIT)
>>
>>  #define LBR_PLM (LBR_KERNEL | LBR_USER)
>>
>> @@ -74,24 +76,25 @@ static enum {
>>   * x86control flow changes include branches, interrupts, traps, faults
>>   */
>>  enum {
>> -       X86_BR_NONE     = 0,      /* unknown */
>> -
>> -       X86_BR_USER     = 1 << 0, /* branch target is user */
>> -       X86_BR_KERNEL   = 1 << 1, /* branch target is kernel */
>> -
>> -       X86_BR_CALL     = 1 << 2, /* call */
>> -       X86_BR_RET      = 1 << 3, /* return */
>> -       X86_BR_SYSCALL  = 1 << 4, /* syscall */
>> -       X86_BR_SYSRET   = 1 << 5, /* syscall return */
>> -       X86_BR_INT      = 1 << 6, /* sw interrupt */
>> -       X86_BR_IRET     = 1 << 7, /* return from interrupt */
>> -       X86_BR_JCC      = 1 << 8, /* conditional */
>> -       X86_BR_JMP      = 1 << 9, /* jump */
>> -       X86_BR_IRQ      = 1 << 10,/* hw interrupt or trap or fault */
>> -       X86_BR_IND_CALL = 1 << 11,/* indirect calls */
>> -       X86_BR_ABORT    = 1 << 12,/* transaction abort */
>> -       X86_BR_IN_TX    = 1 << 13,/* in transaction */
>> -       X86_BR_NO_TX    = 1 << 14,/* not in transaction */
>> +       X86_BR_NONE             = 0,      /* unknown */
>> +
>> +       X86_BR_USER             = 1 << 0, /* branch target is user */
>> +       X86_BR_KERNEL           = 1 << 1, /* branch target is kernel */
>> +
>> +       X86_BR_CALL             = 1 << 2, /* call */
>> +       X86_BR_RET              = 1 << 3, /* return */
>> +       X86_BR_SYSCALL          = 1 << 4, /* syscall */
>> +       X86_BR_SYSRET           = 1 << 5, /* syscall return */
>> +       X86_BR_INT              = 1 << 6, /* sw interrupt */
>> +       X86_BR_IRET             = 1 << 7, /* return from interrupt */
>> +       X86_BR_JCC              = 1 << 8, /* conditional */
>> +       X86_BR_JMP              = 1 << 9, /* jump */
>> +       X86_BR_IRQ              = 1 << 10,/* hw interrupt or trap or fault */
>> +       X86_BR_IND_CALL         = 1 << 11,/* indirect calls */
>> +       X86_BR_ABORT            = 1 << 12,/* transaction abort */
>> +       X86_BR_IN_TX            = 1 << 13,/* in transaction */
>> +       X86_BR_NO_TX            = 1 << 14,/* not in transaction */
>> +       X86_BR_CALL_STACK       = 1 << 15,/* call stack */
>>  };
>>
>>  #define X86_BR_PLM (X86_BR_USER | X86_BR_KERNEL)
>> @@ -135,7 +138,14 @@ static void __intel_pmu_lbr_enable(void)
>>                 wrmsrl(MSR_LBR_SELECT, cpuc->lbr_sel->config);
>>
>>         rdmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
>> -       debugctl |= (DEBUGCTLMSR_LBR | DEBUGCTLMSR_FREEZE_LBRS_ON_PMI);
>> +       debugctl |= DEBUGCTLMSR_LBR;
>> +       /*
>> +        * LBR callstack does not work well with FREEZE_LBRS_ON_PMI.
>> +        * If FREEZE_LBRS_ON_PMI is set, PMI near call/return instructions
>> +        * may cause superfluous increase/decrease of LBR_TOS.
>> +        */
> Is that a bug or a feature?

hardware bug of haswell

> 
> That prevent any use of the call-stack mode in the kernel because by the
> time you get to perf_events code, the stack will have been overwritten. you
> can get by if you are only interested in user level execution, the LBR priv
> level filtering will cause a freeze, though with some skid. I assume you are
> limiting this feature to user priv level by enforcing that users pass the
> PERF_SAMPLE_BRANCH_USER flag.

yes, this feature is limited to user priv level

> 
> 
>> +       if (!cpuc->lbr_sel || !(cpuc->lbr_sel->config & LBR_CALL_STACK))
>> +               debugctl |= DEBUGCTLMSR_FREEZE_LBRS_ON_PMI;
>>         wrmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
>>  }
>>
>> @@ -354,7 +364,7 @@ void intel_pmu_lbr_read(void)
>>   * - in case there is no HW filter
>>   * - in case the HW filter has errata or limitations
>>   */
>> -static void intel_pmu_setup_sw_lbr_filter(struct perf_event *event)
>> +static int intel_pmu_setup_sw_lbr_filter(struct perf_event *event)
>>  {
>>         u64 br_type = event->attr.branch_sample_type;
>>         int mask = 0;
>> @@ -388,11 +398,21 @@ static void intel_pmu_setup_sw_lbr_filter(struct perf_event *event)
>>         if (br_type & PERF_SAMPLE_BRANCH_NO_TX)
>>                 mask |= X86_BR_NO_TX;
>>
>> +       if (br_type & PERF_SAMPLE_BRANCH_CALL_STACK) {
>> +               if (!x86_pmu.lbr_sel_map)
>> +                       return -EOPNOTSUPP;
> 
> I am not sure checking lbr_sel_map here is enough. You need to
> check if the CALL_STACK entry is populated, meaning the HW feature
> exists.
> 
>> +               if (mask & ~(X86_BR_USER | X86_BR_KERNEL))
>> +                       return -EINVAL;
>> +               mask |= X86_BR_CALL | X86_BR_IND_CALL | X86_BR_RET |
>> +                       X86_BR_CALL_STACK;
> 
> Why have BR_RET here?

the doc says NEAR_REL_CALL, NEAR-IND_CALL and NEAR_RET must be cleared when LBR callstack is enabled.

Regards
Yan, Zheng

> 
>> +       }
>> +
>>         /*
>>          * stash actual user request into reg, it may
>>          * be used by fixup code for some CPU
>>          */
>>         event->hw.branch_reg.reg = mask;
>> +       return 0;
>>  }
>>
>>  /*
>> @@ -421,8 +441,11 @@ static int intel_pmu_setup_hw_lbr_filter(struct perf_event *event)
>>         reg = &event->hw.branch_reg;
>>         reg->idx = EXTRA_REG_LBR;
>>
>> -       /* LBR_SELECT operates in suppress mode so invert mask */
>> -       reg->config = ~mask & x86_pmu.lbr_sel_mask;
>> +       /*
>> +        * the first 8 bits (LBR_SEL_MASK) in LBR_SELECT operates
>> +        * in suppress mode so invert mask
>> +        */
>> +       reg->config = mask ^ x86_pmu.lbr_sel_mask;
>>
>>         return 0;
>>  }
>> @@ -440,7 +463,9 @@ int intel_pmu_setup_lbr_filter(struct perf_event *event)
>>         /*
>>          * setup SW LBR filter
>>          */
>> -       intel_pmu_setup_sw_lbr_filter(event);
>> +       ret = intel_pmu_setup_sw_lbr_filter(event);
>> +       if (ret)
>> +               return ret;
>>
>>         /*
>>          * setup HW LBR filter, if any
>> @@ -695,6 +720,19 @@ static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
>>         [PERF_SAMPLE_BRANCH_IND_CALL_SHIFT]     = LBR_IND_CALL,
>>  };
>>
>> +static const int hsw_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
>> +       [PERF_SAMPLE_BRANCH_ANY_SHIFT]          = LBR_ANY,
>> +       [PERF_SAMPLE_BRANCH_USER_SHIFT]         = LBR_USER,
>> +       [PERF_SAMPLE_BRANCH_KERNEL_SHIFT]       = LBR_KERNEL,
>> +       [PERF_SAMPLE_BRANCH_HV_SHIFT]           = LBR_IGN,
>> +       [PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT]   = LBR_RETURN | LBR_FAR,
>> +       [PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT]     = LBR_REL_CALL | LBR_IND_CALL
>> +                                               | LBR_FAR,
>> +       [PERF_SAMPLE_BRANCH_IND_CALL_SHIFT]     = LBR_IND_CALL,
>> +       [PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT]   = LBR_REL_CALL | LBR_IND_CALL
>> +                                               | LBR_RETURN | LBR_CALL_STACK,
>> +};
>> +
>>  /* core */
>>  void intel_pmu_lbr_init_core(void)
>>  {
>> @@ -751,6 +789,20 @@ void intel_pmu_lbr_init_snb(void)
>>         pr_cont("16-deep LBR, ");
>>  }
>>
>> +/* haswell */
>> +void intel_pmu_lbr_init_hsw(void)
>> +{
>> +       x86_pmu.lbr_nr   = 16;
>> +       x86_pmu.lbr_tos  = MSR_LBR_TOS;
>> +       x86_pmu.lbr_from = MSR_LBR_NHM_FROM;
>> +       x86_pmu.lbr_to   = MSR_LBR_NHM_TO;
>> +
>> +       x86_pmu.lbr_sel_mask = LBR_SEL_MASK;
>> +       x86_pmu.lbr_sel_map  = hsw_lbr_sel_map;
>> +
>> +       pr_cont("16-deep LBR, ");
>> +}
>> +
>>  /* atom */
>>  void intel_pmu_lbr_init_atom(void)
>>  {
>> --
>> 1.8.4.2
>>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 06/14] perf, core: always switch pmu specific data during context switch
  2014-02-05 18:35       ` Stephane Eranian
@ 2014-02-06  2:08         ` Yan, Zheng
  0 siblings, 0 replies; 37+ messages in thread
From: Yan, Zheng @ 2014-02-06  2:08 UTC (permalink / raw)
  To: Stephane Eranian, Peter Zijlstra
  Cc: LKML, Ingo Molnar, Arnaldo Carvalho de Melo, Andi Kleen

On 02/06/2014 02:35 AM, Stephane Eranian wrote:
> On Wed, Feb 5, 2014 at 6:55 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Wed, Feb 05, 2014 at 06:19:27PM +0100, Stephane Eranian wrote:
>>> On Fri, Jan 3, 2014 at 6:48 AM, Yan, Zheng <zheng.z.yan@intel.com> wrote:
>>>> If two tasks were both forked from the same parent task, Events in their perf
>>>> task contexts can be the same. Perf core optimizes context switch oout in this
>>>> case.
>>>>
>>>> Previous patch inroduces pmu specific data. The data is task specific, so we
>>>> should switch the data even when context switch is optimized out.
>>>>
>>> Reviwed-by: Stephane Eranian <eranian@google.com>
>>
>> You should look again.. that xchg() is an atomic op and a total waste of
>> time since the assignment back onto ctx->task_ctx_data is non-atomic.
>>
>> Complete fail there.
>>
> I admit, it was not clear to me why the xchg().

Sorry. I forget why I used xhcg(), maybe save a few lines of code.

Regards
Yan, Zheng

> 
>>>> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
>>>> ---
>>>>  kernel/events/core.c | 2 ++
>>>>  1 file changed, 2 insertions(+)
>>>>
>>>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>>>> index b6650ab..d6d8dea 100644
>>>> --- a/kernel/events/core.c
>>>> +++ b/kernel/events/core.c
>>>> @@ -2319,6 +2319,8 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
>>>>                         next->perf_event_ctxp[ctxn] = ctx;
>>>>                         ctx->task = next;
>>>>                         next_ctx->task = task;
>>>> +                       ctx->task_ctx_data = xchg(&next_ctx->task_ctx_data,
>>>> +                                                 ctx->task_ctx_data);
>>>>                         do_switch = 0;
>>>>
>>>>                         perf_event_sync_stat(ctx, next_ctx);
>>>> --
>>>> 1.8.4.2
>>>>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 07/14] perf: track number of events that use LBR callstack
  2014-01-03  5:48 ` [PATCH 07/14] perf: track number of events that use LBR callstack Yan, Zheng
@ 2014-02-06 14:55   ` Stephane Eranian
  0 siblings, 0 replies; 37+ messages in thread
From: Stephane Eranian @ 2014-02-06 14:55 UTC (permalink / raw)
  To: Yan, Zheng
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Andi Kleen

On Fri, Jan 3, 2014 at 6:48 AM, Yan, Zheng <zheng.z.yan@intel.com> wrote:
> Later patch will use it to decide if the LBR stack should be saved/restored
>
You should describe better what this patch does. It should be repeated from
the patch title at least.

> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
> ---
>  arch/x86/kernel/cpu/perf_event_intel_lbr.c | 21 ++++++++++++++++++++-
>  1 file changed, 20 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> index bdd8758..2137a9f 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> @@ -201,15 +201,27 @@ void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
>                 intel_pmu_lbr_reset();
>  }
>
> +static inline bool branch_user_callstack(unsigned br_sel)
> +{
> +       return (br_sel & X86_BR_USER) && (br_sel & X86_BR_CALL_STACK);
> +}
> +
>  void intel_pmu_lbr_enable(struct perf_event *event)
>  {
>         struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> +       struct x86_perf_task_context *task_ctx;
>
>         if (!x86_pmu.lbr_nr)
>                 return;
>
> +       cpuc = &__get_cpu_var(cpu_hw_events);
> +       task_ctx = event->ctx ? event->ctx->task_ctx_data : NULL;
> +
>         cpuc->br_sel = event->hw.branch_reg.reg;
>
> +       if (branch_user_callstack(cpuc->br_sel))
> +               task_ctx->lbr_callstack_users++;
> +
>         cpuc->lbr_users++;
>         if (cpuc->lbr_users == 1)
>                 perf_sched_cb_enable(event->ctx->pmu);
> @@ -217,11 +229,18 @@ void intel_pmu_lbr_enable(struct perf_event *event)
>
>  void intel_pmu_lbr_disable(struct perf_event *event)
>  {
> -       struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> +       struct cpu_hw_events *cpuc;
> +       struct x86_perf_task_context *task_ctx;
>
>         if (!x86_pmu.lbr_nr)
>                 return;
>
> +       cpuc = &__get_cpu_var(cpu_hw_events);
> +       task_ctx = event->ctx ? event->ctx->task_ctx_data : NULL;
> +
> +       if (branch_user_callstack(cpuc->br_sel))
> +               task_ctx->lbr_callstack_users--;
> +
>         cpuc->lbr_users--;
>         WARN_ON_ONCE(cpuc->lbr_users < 0);
>
> --
> 1.8.4.2
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 09/14] perf, x86: Save/resotre LBR stack during context switch
  2014-02-05 17:45   ` Stephane Eranian
@ 2014-02-06 15:09     ` Stephane Eranian
  2014-02-10  8:45       ` Yan, Zheng
  0 siblings, 1 reply; 37+ messages in thread
From: Stephane Eranian @ 2014-02-06 15:09 UTC (permalink / raw)
  To: Yan, Zheng
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Andi Kleen

On Wed, Feb 5, 2014 at 6:45 PM, Stephane Eranian <eranian@google.com> wrote:
> On Fri, Jan 3, 2014 at 6:48 AM, Yan, Zheng <zheng.z.yan@intel.com> wrote:
>> When the LBR call stack is enabled, it is necessary to save/restore
>> the LBR stack on context switch. The solution is saving/restoring
>> the LBR stack to/from task's perf event context.
>>
>> The LBR stack is saved/restored only when there are events that use
>> the LBR call stack. If no event uses LBR call stack, the LBR stack
>> is reset when task is scheduled in.
>>
>> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
>> ---
>>  arch/x86/kernel/cpu/perf_event_intel_lbr.c | 80 ++++++++++++++++++++++++------
>>  1 file changed, 66 insertions(+), 14 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
>> index 2137a9f..51e1842 100644
>> --- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
>> +++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
>> @@ -187,18 +187,82 @@ void intel_pmu_lbr_reset(void)
>>                 intel_pmu_lbr_reset_64();
>>  }
>>
>> +/*
>> + * TOS = most recently recorded branch
>> + */
>> +static inline u64 intel_pmu_lbr_tos(void)
>> +{
>> +       u64 tos;
>> +       rdmsrl(x86_pmu.lbr_tos, tos);
>> +       return tos;
>> +}
>> +
>> +enum {
>> +       LBR_UNINIT,
>> +       LBR_NONE,
>> +       LBR_VALID,
>> +};
>> +
> I don't see where the x86_perf_task_context struct gets initialized with
> your task_ctx_data/task_ctx_size mechanism. You are relying on 0
> as a valid default value. But if later more fields are needed and they need
> non-zero init values, it will be easy to forget.....
>
> So I think you need to provide a callback from alloc_perf_context().
> Should have mentioned that in Patch 05/14.
>
>> +static void __intel_pmu_lbr_restore(struct x86_perf_task_context *task_ctx)
>> +{
>> +       int i;
>> +       unsigned lbr_idx, mask = x86_pmu.lbr_nr - 1;
>> +       u64 tos = intel_pmu_lbr_tos();
>> +
>> +       for (i = 0; i < x86_pmu.lbr_nr; i++) {
>> +               lbr_idx = (tos - i) & mask;
>> +               wrmsrl(x86_pmu.lbr_from + lbr_idx, task_ctx->lbr_from[i]);
>> +               wrmsrl(x86_pmu.lbr_to + lbr_idx, task_ctx->lbr_to[i]);
>> +       }
>> +       task_ctx->lbr_stack_state = LBR_NONE;
>> +}
>> +
>> +static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx)
>> +{
>> +       int i;
>> +       unsigned lbr_idx, mask = x86_pmu.lbr_nr - 1;
>> +       u64 tos = intel_pmu_lbr_tos();
>> +
>> +       for (i = 0; i < x86_pmu.lbr_nr; i++) {
>> +               lbr_idx = (tos - i) & mask;
>> +               rdmsrl(x86_pmu.lbr_from + lbr_idx, task_ctx->lbr_from[i]);
>> +               rdmsrl(x86_pmu.lbr_to + lbr_idx, task_ctx->lbr_to[i]);
>> +       }
>> +       task_ctx->lbr_stack_state = LBR_VALID;
>> +}
>> +
>> +
>>  void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
>>  {
>> +       struct cpu_hw_events *cpuc;
>> +       struct x86_perf_task_context *task_ctx;
>> +
>>         if (!x86_pmu.lbr_nr)
>>                 return;
>>
>> +       cpuc = &__get_cpu_var(cpu_hw_events);
>> +       task_ctx = ctx ? ctx->task_ctx_data : NULL;
>> +
>> +
>>         /*
>>          * It is necessary to flush the stack on context switch. This happens
>>          * when the branch stack does not tag its entries with the pid of the
>>          * current task.
>>          */
>> -       if (sched_in)
>> -               intel_pmu_lbr_reset();
>> +       if (sched_in) {
>> +               if (!task_ctx ||
>> +                   !task_ctx->lbr_callstack_users ||
>> +                   task_ctx->lbr_stack_state != LBR_VALID)
>> +                       intel_pmu_lbr_reset();
>> +               else
>> +                       __intel_pmu_lbr_restore(task_ctx);
>> +       } else if (task_ctx) {
>> +               if (task_ctx->lbr_callstack_users &&
>> +                   task_ctx->lbr_stack_state != LBR_UNINIT)
>> +                       __intel_pmu_lbr_save(task_ctx);
>> +               else
>> +                       task_ctx->lbr_stack_state = LBR_NONE;
>> +       }
>>  }
>>
> There ought to be a better way of structuring this if/else. It is
> ugly.
>
Second thought on this. I am not sure I understand why the
test has to be so complex including on the save() side.

if (sched_in) {
     if (task_ctx && lbr_callstack_users)
              restore()
    else
            reset
} else { /* sched_out */
     if (task_ctx && lbr_callstack_users)
               save()
}
If you have lbr_callstack_users, then you need to save/restore.
Looks like you are trying to prevent from double sched-in or
double sched-out. Can this happen?

In other words, I am not sure I understand the need for the
lbr_state here.


>>  static inline bool branch_user_callstack(unsigned br_sel)
>> @@ -267,18 +331,6 @@ void intel_pmu_lbr_disable_all(void)
>>                 __intel_pmu_lbr_disable();
>>  }
>>
>> -/*
>> - * TOS = most recently recorded branch
>> - */
>> -static inline u64 intel_pmu_lbr_tos(void)
>> -{
>> -       u64 tos;
>> -
>> -       rdmsrl(x86_pmu.lbr_tos, tos);
>> -
>> -       return tos;
>> -}
>> -
>>  static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc)
>>  {
>>         unsigned long mask = x86_pmu.lbr_nr - 1;
>> --
>> 1.8.4.2
>>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 10/14] perf, core: simplify need branch stack check
  2014-01-03  5:48 ` [PATCH 10/14] perf, core: simplify need branch stack check Yan, Zheng
@ 2014-02-06 15:35   ` Stephane Eranian
  0 siblings, 0 replies; 37+ messages in thread
From: Stephane Eranian @ 2014-02-06 15:35 UTC (permalink / raw)
  To: Yan, Zheng
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Andi Kleen

On Fri, Jan 3, 2014 at 6:48 AM, Yan, Zheng <zheng.z.yan@intel.com> wrote:
> event->attr.branch_sample_type is non-zero no matter branch stack
> is enabled explicitly or is enabled implicitly. So we can use it
> toreplace intel_pmu_needs_lbr_smpl(). This avoids duplicating code
> that implicitly enables the LBR.
>
This patch does more than what you describe here.
Explain the simplifications.
Explain the difference between has_branch_stack() and needs_branch_stack().

Given the way you've implemented LBR_CALLSTACK (not exposed to users).
You reusing the attr->sample_branch_type to stash you CALLSTACK mode.
So you end up in a situation where you have sample_branch_stack != 0 but
(sample_format_type & PERF_SAMPLE_BRANCH) == 0. Yet, you need
to detect if branch stack is used, thus you need to use sample_branch_type.


> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
> ---
>  arch/x86/kernel/cpu/perf_event_intel.c | 20 +++-----------------
>  include/linux/perf_event.h             |  5 +++++
>  kernel/events/core.c                   | 11 +++++++----
>  3 files changed, 15 insertions(+), 21 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
> index 84a1c09..722171c 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel.c
> @@ -1030,20 +1030,6 @@ static __initconst const u64 slm_hw_cache_event_ids
>   },
>  };
>
> -static inline bool intel_pmu_needs_lbr_smpl(struct perf_event *event)
> -{
> -       /* user explicitly requested branch sampling */
> -       if (has_branch_stack(event))
> -               return true;
> -
> -       /* implicit branch sampling to correct PEBS skid */
> -       if (x86_pmu.intel_cap.pebs_trap && event->attr.precise_ip > 1 &&
> -           x86_pmu.intel_cap.pebs_format < 2)
> -               return true;
> -
> -       return false;
> -}
> -
>  static void intel_pmu_disable_all(void)
>  {
>         struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> @@ -1208,7 +1194,7 @@ static void intel_pmu_disable_event(struct perf_event *event)
>          * must disable before any actual event
>          * because any event may be combined with LBR
>          */
> -       if (intel_pmu_needs_lbr_smpl(event))
> +       if (needs_branch_stack(event))
>                 intel_pmu_lbr_disable(event);
>
>         if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL)) {
> @@ -1269,7 +1255,7 @@ static void intel_pmu_enable_event(struct perf_event *event)
>          * must enabled before any actual event
>          * because any event may be combined with LBR
>          */
> -       if (intel_pmu_needs_lbr_smpl(event))
> +       if (needs_branch_stack(event))
>                 intel_pmu_lbr_enable(event);
>
>         if (event->attr.exclude_host)
> @@ -1741,7 +1727,7 @@ static int intel_pmu_hw_config(struct perf_event *event)
>         if (event->attr.precise_ip && x86_pmu.pebs_aliases)
>                 x86_pmu.pebs_aliases(event);
>
> -       if (intel_pmu_needs_lbr_smpl(event)) {
> +       if (needs_branch_stack(event)) {
>                 ret = intel_pmu_setup_lbr_filter(event);
>                 if (ret)
>                         return ret;
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 147f9d3..0d88eb8 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -766,6 +766,11 @@ static inline bool has_branch_stack(struct perf_event *event)
>         return event->attr.sample_type & PERF_SAMPLE_BRANCH_STACK;
>  }
>
> +static inline bool needs_branch_stack(struct perf_event *event)
> +{
> +       return event->attr.branch_sample_type != 0;
> +}
> +
>  extern int perf_output_begin(struct perf_output_handle *handle,
>                              struct perf_event *event, unsigned int size);
>  extern void perf_output_end(struct perf_output_handle *handle);
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index d6d8dea..7dd4d58 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -1138,7 +1138,7 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx)
>         if (is_cgroup_event(event))
>                 ctx->nr_cgroups++;
>
> -       if (has_branch_stack(event))
> +       if (needs_branch_stack(event))
>                 ctx->nr_branch_stack++;
>
>         list_add_rcu(&event->event_entry, &ctx->event_list);
> @@ -1303,7 +1303,7 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
>                         cpuctx->cgrp = NULL;
>         }
>
> -       if (has_branch_stack(event))
> +       if (needs_branch_stack(event))
>                 ctx->nr_branch_stack--;
>
>         ctx->nr_events--;
> @@ -3202,7 +3202,7 @@ static void unaccount_event(struct perf_event *event)
>                 atomic_dec(&nr_freq_events);
>         if (is_cgroup_event(event))
>                 static_key_slow_dec_deferred(&perf_sched_events);
> -       if (has_branch_stack(event))
> +       if (needs_branch_stack(event))
>                 static_key_slow_dec_deferred(&perf_sched_events);
>
>         unaccount_event_cpu(event, event->cpu);
> @@ -6627,7 +6627,7 @@ static void account_event(struct perf_event *event)
>                 if (atomic_inc_return(&nr_freq_events) == 1)
>                         tick_nohz_full_kick_all();
>         }
> -       if (has_branch_stack(event))
> +       if (needs_branch_stack(event))
>                 static_key_slow_inc(&perf_sched_events.key);
>         if (is_cgroup_event(event))
>                 static_key_slow_inc(&perf_sched_events.key);
> @@ -6735,6 +6735,9 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>         if (attr->inherit && (attr->read_format & PERF_FORMAT_GROUP))
>                 goto err_ns;
>
> +       if (!has_branch_stack(event))
> +               event->attr.branch_sample_type = 0;
> +
>         pmu = perf_init_event(event);
>         if (!pmu)
>                 goto err_ns;
> --
> 1.8.4.2
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 12/14] perf, x86: use LBR call stack to get user callchain
  2014-01-03  5:48 ` [PATCH 12/14] perf, x86: use LBR call stack to get user callchain Yan, Zheng
@ 2014-02-06 15:46   ` Stephane Eranian
  0 siblings, 0 replies; 37+ messages in thread
From: Stephane Eranian @ 2014-02-06 15:46 UTC (permalink / raw)
  To: Yan, Zheng
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Andi Kleen

On Fri, Jan 3, 2014 at 6:48 AM, Yan, Zheng <zheng.z.yan@intel.com> wrote:
> Haswell has a new feature that utilizes the existing Last Branch Record
> facility to record call chains. When the feature is enabled, function
> call will be collected as normal, but as return instructions are executed
> the last captured branch record is popped from the on-chip LBR registers.
> The LBR call stack facility can help perf to get call chains of progam
> without frame pointer.
>
> This patch makes x86's perf_callchain_user() failback to LBR callstack
> when there is no frame pointer in the user program.
>
> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
> ---
>  arch/x86/kernel/cpu/perf_event.c           | 33 ++++++++++++++++++++++++++----
>  arch/x86/kernel/cpu/perf_event_intel.c     | 11 +++++++++-
>  arch/x86/kernel/cpu/perf_event_intel_lbr.c |  2 ++
>  include/linux/perf_event.h                 |  1 +
>  4 files changed, 42 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index 49128e6..1509340 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -1965,12 +1965,28 @@ static unsigned long get_segment_base(unsigned int segment)
>         return get_desc_base(desc + idx);
>  }
>
> +static inline void
> +perf_callchain_lbr_callstack(struct perf_callchain_entry *entry,
> +                            struct perf_sample_data *data)
> +{
> +       struct perf_branch_stack *br_stack = data->br_stack;
> +
> +       if (br_stack && br_stack->user_callstack) {
> +               int i = 0;
> +               while (i < br_stack->nr && entry->nr < PERF_MAX_STACK_DEPTH) {
> +                       perf_callchain_store(entry, br_stack->entries[i].from);
> +                       i++;
> +               }
> +       }
> +}
> +
>  #ifdef CONFIG_COMPAT
>
>  #include <asm/compat.h>
>
>  static inline int
> -perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry *entry)
> +perf_callchain_user32(struct perf_callchain_entry *entry,
> +                     struct pt_regs *regs, struct perf_sample_data *data)
>  {
>         /* 32-bit process in 64-bit kernel. */
>         unsigned long ss_base, cs_base;
> @@ -1999,11 +2015,16 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry *entry)
>                 perf_callchain_store(entry, cs_base + frame.return_address);
>                 fp = compat_ptr(ss_base + frame.next_frame);
>         }
> +
> +       if (fp == compat_ptr(regs->bp))
> +               perf_callchain_lbr_callstack(entry, data);
> +
>         return 1;
>  }
>  #else
>  static inline int
> -perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry *entry)
> +perf_callchain_user32(struct perf_callchain_entry *entry,
> +                     struct pt_regs *regs, struct perf_sample_data *data)
>  {
>      return 0;
>  }
> @@ -2033,12 +2054,12 @@ void perf_callchain_user(struct perf_callchain_entry *entry,
>         if (!current->mm)
>                 return;
>
> -       if (perf_callchain_user32(regs, entry))
> +       if (perf_callchain_user32(entry, regs, data))
>                 return;
>
>         while (entry->nr < PERF_MAX_STACK_DEPTH) {
>                 unsigned long bytes;
> -               frame.next_frame             = NULL;
> +               frame.next_frame = NULL;
>                 frame.return_address = 0;
>
>                 bytes = copy_from_user_nmi(&frame, fp, sizeof(frame));
> @@ -2051,6 +2072,10 @@ void perf_callchain_user(struct perf_callchain_entry *entry,
>                 perf_callchain_store(entry, frame.return_address);
>                 fp = frame.next_frame;
>         }
> +
> +       /* try LBR callstack if there is no frame pointer */
> +       if (fp == (void __user *)regs->bp)
> +               perf_callchain_lbr_callstack(entry, data);
>  }
>
>  /*
> diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
> index 722171c..8b7465c 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel.c
> @@ -1030,6 +1030,14 @@ static __initconst const u64 slm_hw_cache_event_ids
>   },
>  };
>
> +static inline bool intel_pmu_needs_lbr_callstack(struct perf_event *event)
> +{
> +       if ((event->attr.sample_type & PERF_SAMPLE_CALLCHAIN) &&
> +           (event->attr.branch_sample_type & PERF_SAMPLE_BRANCH_CALL_STACK))
> +               return true;
> +       return false;
> +}
> +
>  static void intel_pmu_disable_all(void)
>  {
>         struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> @@ -1398,7 +1406,8 @@ again:
>
>                 perf_sample_data_init(&data, 0, event->hw.last_period);
>
> -               if (has_branch_stack(event))
> +               if (has_branch_stack(event) ||
> +                   (event->ctx->task && intel_pmu_needs_lbr_callstack(event)))

Isn't event->ctx->task redundant here. I thought you were already allowing
LBR_CALLSTACK only for per-process events. That should be checked during
setup, no need to do it for each interrupt.

Also it would be nicer to have:
        if (needs_lbr_stack(event))
                          data.br_stack = &cpuc->lbr_stack;

And you'd hide the two tests in that needs_lbr_stack() inline:
   has_branch_stack() and has_lbr_callstack().

That would be better for the eyes....


>
>                 if (perf_event_overflow(event, &data, regs))
> diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> index 51e1842..08e3ba1 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> @@ -718,6 +718,8 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
>         int i, j, type;
>         bool compress = false;
>
> +       cpuc->lbr_stack.user_callstack = branch_user_callstack(br_sel);
> +
>         /* if sampling all branches, then nothing to filter */
>         if ((br_sel & X86_BR_ALL) == X86_BR_ALL)
>                 return;
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index c442276..d2f0488 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -74,6 +74,7 @@ struct perf_raw_record {
>   * recent branch.
>   */
>  struct perf_branch_stack {
> +       bool                            user_callstack;
>         __u64                           nr;
>         struct perf_branch_entry        entries[0];
>  };
> --
> 1.8.4.2
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 13/14] perf, x86: enable LBR callstack when recording callchain
  2014-01-03  5:48 ` [PATCH 13/14] perf, x86: enable LBR callstack when recording callchain Yan, Zheng
@ 2014-02-06 15:50   ` Stephane Eranian
  0 siblings, 0 replies; 37+ messages in thread
From: Stephane Eranian @ 2014-02-06 15:50 UTC (permalink / raw)
  To: Yan, Zheng
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Andi Kleen

On Fri, Jan 3, 2014 at 6:48 AM, Yan, Zheng <zheng.z.yan@intel.com> wrote:
> Try enabling the LBR callstack facility if user requests recording
> user space callchain. Also adds a cpu pmu attribute to enable/disable
> this feature. This feature is disabled by default because it may
> contend for the LBR with other events that explicitly require branch
> stack
>
We have discussed this patch before.
I think you need to describe cleary what's going on based on priv levels.

excl_user=0, excl_kernel=0 : user callstack via LBR, kernel via framepointer?
excl_user=0, excl_kernel=1 : user callstack via LBR
excl_user=1, excl_kernel=0 : kernel via framepointer?
excl_user=1, excl_kernel=1 : nothing

This does not come out clear from the code below.

> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
> ---
>  arch/x86/kernel/cpu/perf_event.c | 99 ++++++++++++++++++++++++++++------------
>  arch/x86/kernel/cpu/perf_event.h |  7 +++
>  2 files changed, 77 insertions(+), 29 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index 1509340..3ea184a 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -399,37 +399,49 @@ int x86_pmu_hw_config(struct perf_event *event)
>
>                 if (event->attr.precise_ip > precise)
>                         return -EOPNOTSUPP;
> +       }
> +       /*
> +        * check that PEBS LBR correction does not conflict with
> +        * whatever the user is asking with attr->branch_sample_type
> +        */
> +       if (event->attr.precise_ip > 1 && x86_pmu.intel_cap.pebs_format < 2) {
> +               u64 *br_type = &event->attr.branch_sample_type;
> +
> +               if (has_branch_stack(event)) {
> +                       if (!precise_br_compat(event))
> +                               return -EOPNOTSUPP;
> +
> +                       /* branch_sample_type is compatible */
> +
> +               } else {
> +                       /*
> +                        * user did not specify  branch_sample_type
> +                        *
> +                        * For PEBS fixups, we capture all
> +                        * the branches at the priv level of the
> +                        * event.
> +                        */
> +                       *br_type = PERF_SAMPLE_BRANCH_ANY;
> +
> +                       if (!event->attr.exclude_user)
> +                               *br_type |= PERF_SAMPLE_BRANCH_USER;
> +
> +                       if (!event->attr.exclude_kernel)
> +                               *br_type |= PERF_SAMPLE_BRANCH_KERNEL;
> +               }
> +       } else if ((event->attr.sample_type & PERF_SAMPLE_CALLCHAIN) &&
> +                  !has_branch_stack(event) &&
> +                  x86_pmu.attr_lbr_callstack &&
> +                  !event->attr.exclude_user &&
> +                  (event->attach_state & PERF_ATTACH_TASK)) {
>                 /*
> -                * check that PEBS LBR correction does not conflict with
> -                * whatever the user is asking with attr->branch_sample_type
> +                * user did not specify branch_sample_type,
> +                * try using the LBR call stack facility to
> +                * record call chains of user program.
>                  */
> -               if (event->attr.precise_ip > 1 &&
> -                   x86_pmu.intel_cap.pebs_format < 2) {
> -                       u64 *br_type = &event->attr.branch_sample_type;
> -
> -                       if (has_branch_stack(event)) {
> -                               if (!precise_br_compat(event))
> -                                       return -EOPNOTSUPP;
> -
> -                               /* branch_sample_type is compatible */
> -
> -                       } else {
> -                               /*
> -                                * user did not specify  branch_sample_type
> -                                *
> -                                * For PEBS fixups, we capture all
> -                                * the branches at the priv level of the
> -                                * event.
> -                                */
> -                               *br_type = PERF_SAMPLE_BRANCH_ANY;
> -
> -                               if (!event->attr.exclude_user)
> -                                       *br_type |= PERF_SAMPLE_BRANCH_USER;
> -
> -                               if (!event->attr.exclude_kernel)
> -                                       *br_type |= PERF_SAMPLE_BRANCH_KERNEL;
> -                       }
> -               }
> +               event->attr.branch_sample_type =
> +                       PERF_SAMPLE_BRANCH_USER |
> +                       PERF_SAMPLE_BRANCH_CALL_STACK;
>         }
>
>         /*
> @@ -1828,10 +1840,39 @@ static ssize_t set_attr_rdpmc(struct device *cdev,
>         return count;
>  }
>
> +static ssize_t get_attr_lbr_callstack(struct device *cdev,
> +                                     struct device_attribute *attr, char *buf)
> +{
> +       return snprintf(buf, 40, "%d\n", x86_pmu.attr_lbr_callstack);
> +}
> +
> +static ssize_t set_attr_lbr_callstack(struct device *cdev,
> +                                     struct device_attribute *attr,
> +                                     const char *buf, size_t count)
> +{
> +       unsigned long val;
> +       ssize_t ret;
> +
> +       ret = kstrtoul(buf, 0, &val);
> +       if (ret)
> +               return ret;
> +
> +       if (!!val != !!x86_pmu.attr_lbr_callstack) {
> +               if (val && !x86_pmu_has_lbr_callstack())
> +                       return -EOPNOTSUPP;
> +               x86_pmu.attr_lbr_callstack = !!val;
> +       }
> +       return count;
> +}
> +
>  static DEVICE_ATTR(rdpmc, S_IRUSR | S_IWUSR, get_attr_rdpmc, set_attr_rdpmc);
> +static DEVICE_ATTR(lbr_callstack, S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH,
> +                  get_attr_lbr_callstack, set_attr_lbr_callstack);
> +
>
>  static struct attribute *x86_pmu_attrs[] = {
>         &dev_attr_rdpmc.attr,
> +       &dev_attr_lbr_callstack.attr,
>         NULL,
>  };
>
> diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
> index 3ed9629..b45258c 100644
> --- a/arch/x86/kernel/cpu/perf_event.h
> +++ b/arch/x86/kernel/cpu/perf_event.h
> @@ -400,6 +400,7 @@ struct x86_pmu {
>          * sysfs attrs
>          */
>         int             attr_rdpmc;
> +       int             attr_lbr_callstack;
>         struct attribute **format_attrs;
>         struct attribute **event_attrs;
>
> @@ -504,6 +505,12 @@ static struct perf_pmu_events_attr event_attr_##v = {                      \
>
>  extern struct x86_pmu x86_pmu __read_mostly;
>
> +static inline bool x86_pmu_has_lbr_callstack(void)
> +{
> +       return  x86_pmu.lbr_sel_map &&
> +               x86_pmu.lbr_sel_map[PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT] > 0;
> +}
> +
>  DECLARE_PER_CPU(struct cpu_hw_events, cpu_hw_events);
>
>  int x86_perf_event_set_period(struct perf_event *event);
> --
> 1.8.4.2
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 14/14] perf, x86: Discard zero length call entries in LBR call stack
  2014-01-03  5:48 ` [PATCH 14/14] perf, x86: Discard zero length call entries in LBR call stack Yan, Zheng
@ 2014-02-06 15:57   ` Stephane Eranian
  0 siblings, 0 replies; 37+ messages in thread
From: Stephane Eranian @ 2014-02-06 15:57 UTC (permalink / raw)
  To: Yan, Zheng
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Andi Kleen

On Fri, Jan 3, 2014 at 6:48 AM, Yan, Zheng <zheng.z.yan@intel.com> wrote:
> "Zero length call" uses the attribute of the call instruction to push
> the immediate instruction pointer on to the stack and then pops off
> that address into a register. This is accomplished without any matching
> return instruction. It confuses the hardware and make the recorded call
> stack incorrect. Try fixing the call stack by discarding zero length
> call entries.
>
So on this one you're saying LBR_CALLSTACK will not pop the return in
this case. And therefore the call remains the in LBR register buffer. The
kernel can look for those and remove them because you can detect them
by the call encoding.

Reviewed-by: Stephane Eranian <eranian@google.com>
> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
> ---
>  arch/x86/kernel/cpu/perf_event_intel_lbr.c | 13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> index 08e3ba1..57bdd34 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> @@ -94,7 +94,8 @@ enum {
>         X86_BR_ABORT            = 1 << 12,/* transaction abort */
>         X86_BR_IN_TX            = 1 << 13,/* in transaction */
>         X86_BR_NO_TX            = 1 << 14,/* not in transaction */
> -       X86_BR_CALL_STACK       = 1 << 15,/* call stack */
> +       X86_BR_ZERO_CALL        = 1 << 15,/* zero length call */
> +       X86_BR_CALL_STACK       = 1 << 16,/* call stack */
>  };
>
>  #define X86_BR_PLM (X86_BR_USER | X86_BR_KERNEL)
> @@ -111,13 +112,15 @@ enum {
>          X86_BR_JMP      |\
>          X86_BR_IRQ      |\
>          X86_BR_ABORT    |\
> -        X86_BR_IND_CALL)
> +        X86_BR_IND_CALL |\
> +        X86_BR_ZERO_CALL)
>
>  #define X86_BR_ALL (X86_BR_PLM | X86_BR_ANY)
>
>  #define X86_BR_ANY_CALL                 \
>         (X86_BR_CALL            |\
>          X86_BR_IND_CALL        |\
> +        X86_BR_ZERO_CALL       |\
>          X86_BR_SYSCALL         |\
>          X86_BR_IRQ             |\
>          X86_BR_INT)
> @@ -652,6 +655,12 @@ static int branch_type(unsigned long from, unsigned long to, int abort)
>                 ret = X86_BR_INT;
>                 break;
>         case 0xe8: /* call near rel */
> +               insn_get_immediate(&insn);
> +               if (insn.immediate1.value == 0) {
> +                       /* zero length call */
> +                       ret = X86_BR_ZERO_CALL;
> +                       break;
> +               }
>         case 0x9a: /* call far absolute */
>                 ret = X86_BR_CALL;
>                 break;
> --
> 1.8.4.2
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 09/14] perf, x86: Save/resotre LBR stack during context switch
  2014-02-06 15:09     ` Stephane Eranian
@ 2014-02-10  8:45       ` Yan, Zheng
  0 siblings, 0 replies; 37+ messages in thread
From: Yan, Zheng @ 2014-02-10  8:45 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Andi Kleen

On 02/06/2014 11:09 PM, Stephane Eranian wrote:
> On Wed, Feb 5, 2014 at 6:45 PM, Stephane Eranian <eranian@google.com> wrote:
>> On Fri, Jan 3, 2014 at 6:48 AM, Yan, Zheng <zheng.z.yan@intel.com> wrote:
>>> When the LBR call stack is enabled, it is necessary to save/restore
>>> the LBR stack on context switch. The solution is saving/restoring
>>> the LBR stack to/from task's perf event context.
>>>
>>> The LBR stack is saved/restored only when there are events that use
>>> the LBR call stack. If no event uses LBR call stack, the LBR stack
>>> is reset when task is scheduled in.
>>>
>>> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
>>> ---
>>>  arch/x86/kernel/cpu/perf_event_intel_lbr.c | 80 ++++++++++++++++++++++++------
>>>  1 file changed, 66 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
>>> index 2137a9f..51e1842 100644
>>> --- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
>>> +++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
>>> @@ -187,18 +187,82 @@ void intel_pmu_lbr_reset(void)
>>>                 intel_pmu_lbr_reset_64();
>>>  }
>>>
>>> +/*
>>> + * TOS = most recently recorded branch
>>> + */
>>> +static inline u64 intel_pmu_lbr_tos(void)
>>> +{
>>> +       u64 tos;
>>> +       rdmsrl(x86_pmu.lbr_tos, tos);
>>> +       return tos;
>>> +}
>>> +
>>> +enum {
>>> +       LBR_UNINIT,
>>> +       LBR_NONE,
>>> +       LBR_VALID,
>>> +};
>>> +
>> I don't see where the x86_perf_task_context struct gets initialized with
>> your task_ctx_data/task_ctx_size mechanism. You are relying on 0
>> as a valid default value. But if later more fields are needed and they need
>> non-zero init values, it will be easy to forget.....
>>
>> So I think you need to provide a callback from alloc_perf_context().
>> Should have mentioned that in Patch 05/14.
>>
>>> +static void __intel_pmu_lbr_restore(struct x86_perf_task_context *task_ctx)
>>> +{
>>> +       int i;
>>> +       unsigned lbr_idx, mask = x86_pmu.lbr_nr - 1;
>>> +       u64 tos = intel_pmu_lbr_tos();
>>> +
>>> +       for (i = 0; i < x86_pmu.lbr_nr; i++) {
>>> +               lbr_idx = (tos - i) & mask;
>>> +               wrmsrl(x86_pmu.lbr_from + lbr_idx, task_ctx->lbr_from[i]);
>>> +               wrmsrl(x86_pmu.lbr_to + lbr_idx, task_ctx->lbr_to[i]);
>>> +       }
>>> +       task_ctx->lbr_stack_state = LBR_NONE;
>>> +}
>>> +
>>> +static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx)
>>> +{
>>> +       int i;
>>> +       unsigned lbr_idx, mask = x86_pmu.lbr_nr - 1;
>>> +       u64 tos = intel_pmu_lbr_tos();
>>> +
>>> +       for (i = 0; i < x86_pmu.lbr_nr; i++) {
>>> +               lbr_idx = (tos - i) & mask;
>>> +               rdmsrl(x86_pmu.lbr_from + lbr_idx, task_ctx->lbr_from[i]);
>>> +               rdmsrl(x86_pmu.lbr_to + lbr_idx, task_ctx->lbr_to[i]);
>>> +       }
>>> +       task_ctx->lbr_stack_state = LBR_VALID;
>>> +}
>>> +
>>> +
>>>  void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
>>>  {
>>> +       struct cpu_hw_events *cpuc;
>>> +       struct x86_perf_task_context *task_ctx;
>>> +
>>>         if (!x86_pmu.lbr_nr)
>>>                 return;
>>>
>>> +       cpuc = &__get_cpu_var(cpu_hw_events);
>>> +       task_ctx = ctx ? ctx->task_ctx_data : NULL;
>>> +
>>> +
>>>         /*
>>>          * It is necessary to flush the stack on context switch. This happens
>>>          * when the branch stack does not tag its entries with the pid of the
>>>          * current task.
>>>          */
>>> -       if (sched_in)
>>> -               intel_pmu_lbr_reset();
>>> +       if (sched_in) {
>>> +               if (!task_ctx ||
>>> +                   !task_ctx->lbr_callstack_users ||
>>> +                   task_ctx->lbr_stack_state != LBR_VALID)
>>> +                       intel_pmu_lbr_reset();
>>> +               else
>>> +                       __intel_pmu_lbr_restore(task_ctx);
>>> +       } else if (task_ctx) {
>>> +               if (task_ctx->lbr_callstack_users &&
>>> +                   task_ctx->lbr_stack_state != LBR_UNINIT)
>>> +                       __intel_pmu_lbr_save(task_ctx);
>>> +               else
>>> +                       task_ctx->lbr_stack_state = LBR_NONE;
>>> +       }
>>>  }
>>>
>> There ought to be a better way of structuring this if/else. It is
>> ugly.
>>
> Second thought on this. I am not sure I understand why the
> test has to be so complex including on the save() side.
> 
> if (sched_in) {
>      if (task_ctx && lbr_callstack_users)
>               restore()
>     else
>             reset
> } else { /* sched_out */
>      if (task_ctx && lbr_callstack_users)
>                save()
> }

I think you are right about the save side. But the lbr_state is still needed
by the restore side. Because perf context may have invaild LBR state when task
is being scheduled in. (task is newly created or the callstack feature was not
enabled when the task is scheduled out)

Regards
Yan, Zheng

> If you have lbr_callstack_users, then you need to save/restore.
> Looks like you are trying to prevent from double sched-in or
> double sched-out. Can this happen?
> 
> In other words, I am not sure I understand the need for the
> lbr_state here.
> 
> 
>>>  static inline bool branch_user_callstack(unsigned br_sel)
>>> @@ -267,18 +331,6 @@ void intel_pmu_lbr_disable_all(void)
>>>                 __intel_pmu_lbr_disable();
>>>  }
>>>
>>> -/*
>>> - * TOS = most recently recorded branch
>>> - */
>>> -static inline u64 intel_pmu_lbr_tos(void)
>>> -{
>>> -       u64 tos;
>>> -
>>> -       rdmsrl(x86_pmu.lbr_tos, tos);
>>> -
>>> -       return tos;
>>> -}
>>> -
>>>  static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc)
>>>  {
>>>         unsigned long mask = x86_pmu.lbr_nr - 1;
>>> --
>>> 1.8.4.2
>>>


^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2014-02-10  8:45 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-01-03  5:47 [PATCH 00/14] perf, x86: Haswell LBR call stack support Yan, Zheng
2014-01-03  5:47 ` [PATCH 01/14] perf, x86: Reduce lbr_sel_map size Yan, Zheng
2014-02-05 15:15   ` Stephane Eranian
2014-01-03  5:47 ` [PATCH 02/14] perf, core: introduce pmu context switch callback Yan, Zheng
2014-02-05 16:01   ` Stephane Eranian
2014-02-06  1:38     ` Yan, Zheng
2014-01-03  5:48 ` [PATCH 03/14] perf, x86: use context switch callback to flush LBR stack Yan, Zheng
2014-02-05 16:34   ` Stephane Eranian
2014-01-03  5:48 ` [PATCH 04/14] perf, x86: Basic Haswell LBR call stack support Yan, Zheng
2014-02-05 15:40   ` Stephane Eranian
2014-02-06  1:52     ` Yan, Zheng
2014-01-03  5:48 ` [PATCH 05/14] perf, core: allow pmu specific data for perf task context Yan, Zheng
2014-02-05 16:57   ` Stephane Eranian
2014-01-03  5:48 ` [PATCH 06/14] perf, core: always switch pmu specific data during context switch Yan, Zheng
2014-02-05 17:19   ` Stephane Eranian
2014-02-05 17:55     ` Peter Zijlstra
2014-02-05 18:35       ` Stephane Eranian
2014-02-06  2:08         ` Yan, Zheng
2014-01-03  5:48 ` [PATCH 07/14] perf: track number of events that use LBR callstack Yan, Zheng
2014-02-06 14:55   ` Stephane Eranian
2014-01-03  5:48 ` [PATCH 08/14] perf, x86: allocate space for storing LBR stack Yan, Zheng
2014-02-05 17:26   ` Stephane Eranian
2014-01-03  5:48 ` [PATCH 09/14] perf, x86: Save/resotre LBR stack during context switch Yan, Zheng
2014-02-05 17:45   ` Stephane Eranian
2014-02-06 15:09     ` Stephane Eranian
2014-02-10  8:45       ` Yan, Zheng
2014-01-03  5:48 ` [PATCH 10/14] perf, core: simplify need branch stack check Yan, Zheng
2014-02-06 15:35   ` Stephane Eranian
2014-01-03  5:48 ` [PATCH 11/14] perf, core: Pass perf_sample_data to perf_callchain() Yan, Zheng
2014-01-03  5:48 ` [PATCH 12/14] perf, x86: use LBR call stack to get user callchain Yan, Zheng
2014-02-06 15:46   ` Stephane Eranian
2014-01-03  5:48 ` [PATCH 13/14] perf, x86: enable LBR callstack when recording callchain Yan, Zheng
2014-02-06 15:50   ` Stephane Eranian
2014-01-03  5:48 ` [PATCH 14/14] perf, x86: Discard zero length call entries in LBR call stack Yan, Zheng
2014-02-06 15:57   ` Stephane Eranian
2014-01-21 13:17 ` [PATCH 00/14] perf, x86: Haswell LBR call stack support Stephane Eranian
2014-01-22  1:35   ` Yan, Zheng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.