linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/7] perf, x86: Haswell LBR call stack support
@ 2013-07-01  7:23 Yan, Zheng
  2013-07-01  7:23 ` [PATCH v2 1/7] perf, x86: Reduce lbr_sel_map size Yan, Zheng
                   ` (6 more replies)
  0 siblings, 7 replies; 25+ messages in thread
From: Yan, Zheng @ 2013-07-01  7:23 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, a.p.zijlstra, eranian, andi, Yan, Zheng

From: "Yan, Zheng" <zheng.z.yan@intel.com>

Haswell has a new feature that utilizes the existing Last Branch Record
facility to record call chains. When the feature is enabled, function
call will be collected as normal, but as return instructions are executed
the last captured branch record is popped from the on-chip LBR registers.
The LBR call stack facility can help perf to get call chains of progam
without frame pointer. When perf tool requests PERF_SAMPLE_CALLCHAIN +
PERF_SAMPLE_BRANCH_USER, this feature is dynamically enabled by default.
This feature can be disabled/enabled through an attribute file in the cpu
pmu sysfs directory.

The LBR call stack has following known limitations
 1. Zero length calls are not filtered out by hardware
 2. Exception handing such as setjmp/longjmp will have calls/returns
    not match
 3. Pushing different return address onto the stack will have
    calls/returns not match

These patches are based upon tip/perf/core

Regards
Yan, Zheng
---
Changes since v1
 - more comments for why FREEZE_LBRS_ON_PMI does not work
 - don't save/restore LBR during context switches if LBR are used by
   CPU events
 - change mode of /sys/devices/cpu/lbr_callstack to 0644

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 1/7] perf, x86: Reduce lbr_sel_map size
  2013-07-01  7:23 [PATCH v2 0/7] perf, x86: Haswell LBR call stack support Yan, Zheng
@ 2013-07-01  7:23 ` Yan, Zheng
  2013-07-01  7:23 ` [PATCH v2 2/7] perf, x86: Basic Haswell LBR call stack support Yan, Zheng
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 25+ messages in thread
From: Yan, Zheng @ 2013-07-01  7:23 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, a.p.zijlstra, eranian, andi, Yan, Zheng

From: "Yan, Zheng" <zheng.z.yan@intel.com>

The index of lbr_sel_map is bit value of perf branch_sample_type.
We can reduce lbr_sel_map size by using bit shift as index.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 arch/x86/kernel/cpu/perf_event.h           |  4 +++
 arch/x86/kernel/cpu/perf_event_intel_lbr.c | 50 ++++++++++++++----------------
 include/uapi/linux/perf_event.h            | 42 +++++++++++++++++--------
 3 files changed, 56 insertions(+), 40 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 97e557b..e33d3b7 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -453,6 +453,10 @@ struct x86_pmu {
 	struct perf_guest_switch_msr *(*guest_get_msrs)(int *nr);
 };
 
+enum {
+	PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE = PERF_SAMPLE_BRANCH_MAX_SHIFT,
+};
+
 #define x86_add_quirk(func_)						\
 do {									\
 	static struct x86_pmu_quirk __quirk __initdata = {		\
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index d5be06a..1e74cc4 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -69,10 +69,6 @@ static enum {
 #define LBR_FROM_FLAG_IN_TX    (1ULL << 62)
 #define LBR_FROM_FLAG_ABORT    (1ULL << 61)
 
-#define for_each_branch_sample_type(x) \
-	for ((x) = PERF_SAMPLE_BRANCH_USER; \
-	     (x) < PERF_SAMPLE_BRANCH_MAX; (x) <<= 1)
-
 /*
  * x86control flow change classification
  * x86control flow changes include branches, interrupts, traps, faults
@@ -387,14 +383,14 @@ static int intel_pmu_setup_hw_lbr_filter(struct perf_event *event)
 {
 	struct hw_perf_event_extra *reg;
 	u64 br_type = event->attr.branch_sample_type;
-	u64 mask = 0, m;
-	u64 v;
+	u64 mask = 0, v;
+	int i;
 
-	for_each_branch_sample_type(m) {
-		if (!(br_type & m))
+	for (i = 0; i < PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE; i++) {
+		if (!(br_type & (1ULL << i)))
 			continue;
 
-		v = x86_pmu.lbr_sel_map[m];
+		v = x86_pmu.lbr_sel_map[i];
 		if (v == LBR_NOT_SUPP)
 			return -EOPNOTSUPP;
 
@@ -649,33 +645,33 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
 /*
  * Map interface branch filters onto LBR filters
  */
-static const int nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = {
-	[PERF_SAMPLE_BRANCH_ANY]	= LBR_ANY,
-	[PERF_SAMPLE_BRANCH_USER]	= LBR_USER,
-	[PERF_SAMPLE_BRANCH_KERNEL]	= LBR_KERNEL,
-	[PERF_SAMPLE_BRANCH_HV]		= LBR_IGN,
-	[PERF_SAMPLE_BRANCH_ANY_RETURN]	= LBR_RETURN | LBR_REL_JMP
-					| LBR_IND_JMP | LBR_FAR,
+static const int nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
+	[PERF_SAMPLE_BRANCH_ANY_SHIFT]		= LBR_ANY,
+	[PERF_SAMPLE_BRANCH_USER_SHIFT]		= LBR_USER,
+	[PERF_SAMPLE_BRANCH_KERNEL_SHIFT]	= LBR_KERNEL,
+	[PERF_SAMPLE_BRANCH_HV_SHIFT]		= LBR_IGN,
+	[PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT]	= LBR_RETURN | LBR_REL_JMP
+						| LBR_IND_JMP | LBR_FAR,
 	/*
 	 * NHM/WSM erratum: must include REL_JMP+IND_JMP to get CALL branches
 	 */
-	[PERF_SAMPLE_BRANCH_ANY_CALL] =
+	[PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT] =
 	 LBR_REL_CALL | LBR_IND_CALL | LBR_REL_JMP | LBR_IND_JMP | LBR_FAR,
 	/*
 	 * NHM/WSM erratum: must include IND_JMP to capture IND_CALL
 	 */
-	[PERF_SAMPLE_BRANCH_IND_CALL] = LBR_IND_CALL | LBR_IND_JMP,
+	[PERF_SAMPLE_BRANCH_IND_CALL_SHIFT] = LBR_IND_CALL | LBR_IND_JMP,
 };
 
-static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = {
-	[PERF_SAMPLE_BRANCH_ANY]	= LBR_ANY,
-	[PERF_SAMPLE_BRANCH_USER]	= LBR_USER,
-	[PERF_SAMPLE_BRANCH_KERNEL]	= LBR_KERNEL,
-	[PERF_SAMPLE_BRANCH_HV]		= LBR_IGN,
-	[PERF_SAMPLE_BRANCH_ANY_RETURN]	= LBR_RETURN | LBR_FAR,
-	[PERF_SAMPLE_BRANCH_ANY_CALL]	= LBR_REL_CALL | LBR_IND_CALL
-					| LBR_FAR,
-	[PERF_SAMPLE_BRANCH_IND_CALL]	= LBR_IND_CALL,
+static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
+	[PERF_SAMPLE_BRANCH_ANY_SHIFT]		= LBR_ANY,
+	[PERF_SAMPLE_BRANCH_USER_SHIFT]		= LBR_USER,
+	[PERF_SAMPLE_BRANCH_KERNEL_SHIFT]	= LBR_KERNEL,
+	[PERF_SAMPLE_BRANCH_HV_SHIFT]		= LBR_IGN,
+	[PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT]	= LBR_RETURN | LBR_FAR,
+	[PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT]	= LBR_REL_CALL | LBR_IND_CALL
+						| LBR_FAR,
+	[PERF_SAMPLE_BRANCH_IND_CALL_SHIFT]	= LBR_IND_CALL,
 };
 
 /* core */
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 0b1df41..2ec219e 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -148,20 +148,36 @@ enum perf_event_sample_format {
  * The branch types can be combined, however BRANCH_ANY covers all types
  * of branches and therefore it supersedes all the other types.
  */
+enum perf_branch_sample_type_shift {
+	PERF_SAMPLE_BRANCH_USER_SHIFT		= 0, /* user branches */
+	PERF_SAMPLE_BRANCH_KERNEL_SHIFT		= 1, /* kernel branches */
+	PERF_SAMPLE_BRANCH_HV_SHIFT		= 2, /* hypervisor branches */
+
+	PERF_SAMPLE_BRANCH_ANY_SHIFT		= 3, /* any branch types */
+	PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT	= 4, /* any call branch */
+	PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT	= 5, /* any return branch */
+	PERF_SAMPLE_BRANCH_IND_CALL_SHIFT	= 6, /* indirect calls */
+	PERF_SAMPLE_BRANCH_ABORT_TX_SHIFT	= 7, /* transaction aborts */
+	PERF_SAMPLE_BRANCH_IN_TX_SHIFT		= 8, /* in transaction */
+	PERF_SAMPLE_BRANCH_NO_TX_SHIFT		= 9, /* not in transaction */
+
+	PERF_SAMPLE_BRANCH_MAX_SHIFT		/* non-ABI */
+};
+
 enum perf_branch_sample_type {
-	PERF_SAMPLE_BRANCH_USER		= 1U << 0, /* user branches */
-	PERF_SAMPLE_BRANCH_KERNEL	= 1U << 1, /* kernel branches */
-	PERF_SAMPLE_BRANCH_HV		= 1U << 2, /* hypervisor branches */
-
-	PERF_SAMPLE_BRANCH_ANY		= 1U << 3, /* any branch types */
-	PERF_SAMPLE_BRANCH_ANY_CALL	= 1U << 4, /* any call branch */
-	PERF_SAMPLE_BRANCH_ANY_RETURN	= 1U << 5, /* any return branch */
-	PERF_SAMPLE_BRANCH_IND_CALL	= 1U << 6, /* indirect calls */
-	PERF_SAMPLE_BRANCH_ABORT_TX	= 1U << 7, /* transaction aborts */
-	PERF_SAMPLE_BRANCH_IN_TX	= 1U << 8, /* in transaction */
-	PERF_SAMPLE_BRANCH_NO_TX	= 1U << 9, /* not in transaction */
-
-	PERF_SAMPLE_BRANCH_MAX		= 1U << 10, /* non-ABI */
+	PERF_SAMPLE_BRANCH_USER         = 1U << PERF_SAMPLE_BRANCH_USER_SHIFT,
+	PERF_SAMPLE_BRANCH_KERNEL       = 1U << PERF_SAMPLE_BRANCH_KERNEL_SHIFT,
+	PERF_SAMPLE_BRANCH_HV           = 1U << PERF_SAMPLE_BRANCH_HV_SHIFT,
+
+	PERF_SAMPLE_BRANCH_ANY          = 1U << PERF_SAMPLE_BRANCH_ANY_SHIFT,
+	PERF_SAMPLE_BRANCH_ANY_CALL     = 1U << PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT,
+	PERF_SAMPLE_BRANCH_ANY_RETURN   = 1U << PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT,
+	PERF_SAMPLE_BRANCH_IND_CALL     = 1U << PERF_SAMPLE_BRANCH_IND_CALL_SHIFT,
+	PERF_SAMPLE_BRANCH_ABORT_TX     = 1U << PERF_SAMPLE_BRANCH_ABORT_TX_SHIFT,
+	PERF_SAMPLE_BRANCH_IN_TX        = 1U << PERF_SAMPLE_BRANCH_IN_TX_SHIFT,
+	PERF_SAMPLE_BRANCH_NO_TX        = 1U << PERF_SAMPLE_BRANCH_NO_TX_SHIFT,
+
+	PERF_SAMPLE_BRANCH_MAX          = 1U << PERF_SAMPLE_BRANCH_MAX_SHIFT,
 };
 
 #define PERF_SAMPLE_BRANCH_PLM_ALL \
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 2/7] perf, x86: Basic Haswell LBR call stack support
  2013-07-01  7:23 [PATCH v2 0/7] perf, x86: Haswell LBR call stack support Yan, Zheng
  2013-07-01  7:23 ` [PATCH v2 1/7] perf, x86: Reduce lbr_sel_map size Yan, Zheng
@ 2013-07-01  7:23 ` Yan, Zheng
  2013-07-01  7:23 ` [PATCH v2 3/7] perf, x86: Introduce x86 special perf event context Yan, Zheng
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 25+ messages in thread
From: Yan, Zheng @ 2013-07-01  7:23 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, a.p.zijlstra, eranian, andi, Yan, Zheng

From: "Yan, Zheng" <zheng.z.yan@intel.com>

The new HSW call stack feature provides a facility such that
unfiltered call data will be collected as normal, but as return
instructions are executed the last captured branch record is
popped from the LBR stack. Thus, branch information relative to
leaf functions will not be captured, while preserving the call
stack information of the main line execution path.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 arch/x86/kernel/cpu/perf_event.h           |  7 ++-
 arch/x86/kernel/cpu/perf_event_intel.c     |  2 +-
 arch/x86/kernel/cpu/perf_event_intel_lbr.c | 93 +++++++++++++++++++++++-------
 3 files changed, 78 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index e33d3b7..3f6172c 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -454,7 +454,10 @@ struct x86_pmu {
 };
 
 enum {
-	PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE = PERF_SAMPLE_BRANCH_MAX_SHIFT,
+	PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT = PERF_SAMPLE_BRANCH_MAX_SHIFT,
+	PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE,
+
+	PERF_SAMPLE_BRANCH_CALL_STACK = 1U << PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT,
 };
 
 #define x86_add_quirk(func_)						\
@@ -687,6 +690,8 @@ void intel_pmu_lbr_init_atom(void);
 
 void intel_pmu_lbr_init_snb(void);
 
+void intel_pmu_lbr_init_hsw(void);
+
 int intel_pmu_setup_lbr_filter(struct perf_event *event);
 
 int p4_pmu_init(void);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index fbc9210..e163717 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2274,7 +2274,7 @@ __init int intel_pmu_init(void)
 		memcpy(hw_cache_event_ids, snb_hw_cache_event_ids, sizeof(hw_cache_event_ids));
 		memcpy(hw_cache_extra_regs, snb_hw_cache_extra_regs, sizeof(hw_cache_extra_regs));
 
-		intel_pmu_lbr_init_snb();
+		intel_pmu_lbr_init_hsw();
 
 		x86_pmu.event_constraints = intel_hsw_event_constraints;
 		x86_pmu.pebs_constraints = intel_hsw_pebs_event_constraints;
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 1e74cc4..11911db 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -39,6 +39,7 @@ static enum {
 #define LBR_IND_JMP_BIT		6 /* do not capture indirect jumps */
 #define LBR_REL_JMP_BIT		7 /* do not capture relative jumps */
 #define LBR_FAR_BIT		8 /* do not capture far branches */
+#define LBR_CALL_STACK_BIT	9 /* enable call stack */
 
 #define LBR_KERNEL	(1 << LBR_KERNEL_BIT)
 #define LBR_USER	(1 << LBR_USER_BIT)
@@ -49,6 +50,7 @@ static enum {
 #define LBR_REL_JMP	(1 << LBR_REL_JMP_BIT)
 #define LBR_IND_JMP	(1 << LBR_IND_JMP_BIT)
 #define LBR_FAR		(1 << LBR_FAR_BIT)
+#define LBR_CALL_STACK	(1 << LBR_CALL_STACK_BIT)
 
 #define LBR_PLM (LBR_KERNEL | LBR_USER)
 
@@ -74,24 +76,25 @@ static enum {
  * x86control flow changes include branches, interrupts, traps, faults
  */
 enum {
-	X86_BR_NONE     = 0,      /* unknown */
-
-	X86_BR_USER     = 1 << 0, /* branch target is user */
-	X86_BR_KERNEL   = 1 << 1, /* branch target is kernel */
-
-	X86_BR_CALL     = 1 << 2, /* call */
-	X86_BR_RET      = 1 << 3, /* return */
-	X86_BR_SYSCALL  = 1 << 4, /* syscall */
-	X86_BR_SYSRET   = 1 << 5, /* syscall return */
-	X86_BR_INT      = 1 << 6, /* sw interrupt */
-	X86_BR_IRET     = 1 << 7, /* return from interrupt */
-	X86_BR_JCC      = 1 << 8, /* conditional */
-	X86_BR_JMP      = 1 << 9, /* jump */
-	X86_BR_IRQ      = 1 << 10,/* hw interrupt or trap or fault */
-	X86_BR_IND_CALL = 1 << 11,/* indirect calls */
-	X86_BR_ABORT    = 1 << 12,/* transaction abort */
-	X86_BR_IN_TX    = 1 << 13,/* in transaction */
-	X86_BR_NO_TX    = 1 << 14,/* not in transaction */
+	X86_BR_NONE		= 0,      /* unknown */
+
+	X86_BR_USER		= 1 << 0, /* branch target is user */
+	X86_BR_KERNEL		= 1 << 1, /* branch target is kernel */
+
+	X86_BR_CALL		= 1 << 2, /* call */
+	X86_BR_RET		= 1 << 3, /* return */
+	X86_BR_SYSCALL		= 1 << 4, /* syscall */
+	X86_BR_SYSRET		= 1 << 5, /* syscall return */
+	X86_BR_INT		= 1 << 6, /* sw interrupt */
+	X86_BR_IRET		= 1 << 7, /* return from interrupt */
+	X86_BR_JCC		= 1 << 8, /* conditional */
+	X86_BR_JMP		= 1 << 9, /* jump */
+	X86_BR_IRQ		= 1 << 10,/* hw interrupt or trap or fault */
+	X86_BR_IND_CALL		= 1 << 11,/* indirect calls */
+	X86_BR_ABORT		= 1 << 12,/* transaction abort */
+	X86_BR_IN_TX		= 1 << 13,/* in transaction */
+	X86_BR_NO_TX		= 1 << 14,/* not in transaction */
+	X86_BR_CALL_STACK	= 1 << 15,/* call stack */
 };
 
 #define X86_BR_PLM (X86_BR_USER | X86_BR_KERNEL)
@@ -135,7 +138,14 @@ static void __intel_pmu_lbr_enable(void)
 		wrmsrl(MSR_LBR_SELECT, cpuc->lbr_sel->config);
 
 	rdmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
-	debugctl |= (DEBUGCTLMSR_LBR | DEBUGCTLMSR_FREEZE_LBRS_ON_PMI);
+	debugctl |= DEBUGCTLMSR_LBR;
+	/*
+	 * LBR callstack does not work well with FREEZE_LBRS_ON_PMI.
+	 * If FREEZE_LBRS_ON_PMI is set, PMI near call/return instructions
+	 * may cause superfluous increase/decrease of LBR_TOS.
+	 */
+	if (!cpuc->lbr_sel || !(cpuc->lbr_sel->config & LBR_CALL_STACK))
+		debugctl |= DEBUGCTLMSR_FREEZE_LBRS_ON_PMI;
 	wrmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
 }
 
@@ -333,7 +343,7 @@ void intel_pmu_lbr_read(void)
  * - in case there is no HW filter
  * - in case the HW filter has errata or limitations
  */
-static void intel_pmu_setup_sw_lbr_filter(struct perf_event *event)
+static int intel_pmu_setup_sw_lbr_filter(struct perf_event *event)
 {
 	u64 br_type = event->attr.branch_sample_type;
 	int mask = 0;
@@ -367,11 +377,21 @@ static void intel_pmu_setup_sw_lbr_filter(struct perf_event *event)
 	if (br_type & PERF_SAMPLE_BRANCH_NO_TX)
 		mask |= X86_BR_NO_TX;
 
+	if (br_type & PERF_SAMPLE_BRANCH_CALL_STACK) {
+		if (!x86_pmu.lbr_sel_map)
+			return -EOPNOTSUPP;
+		if (mask & ~(X86_BR_USER | X86_BR_KERNEL))
+			return -EINVAL;
+		mask |= X86_BR_CALL | X86_BR_IND_CALL | X86_BR_RET |
+			X86_BR_CALL_STACK;
+	}
+
 	/*
 	 * stash actual user request into reg, it may
 	 * be used by fixup code for some CPU
 	 */
 	event->hw.branch_reg.reg = mask;
+	return 0;
 }
 
 /*
@@ -401,7 +421,7 @@ static int intel_pmu_setup_hw_lbr_filter(struct perf_event *event)
 	reg->idx = EXTRA_REG_LBR;
 
 	/* LBR_SELECT operates in suppress mode so invert mask */
-	reg->config = ~mask & x86_pmu.lbr_sel_mask;
+	reg->config = mask ^ x86_pmu.lbr_sel_mask;
 
 	return 0;
 }
@@ -419,7 +439,9 @@ int intel_pmu_setup_lbr_filter(struct perf_event *event)
 	/*
 	 * setup SW LBR filter
 	 */
-	intel_pmu_setup_sw_lbr_filter(event);
+	ret = intel_pmu_setup_sw_lbr_filter(event);
+	if (ret)
+		return ret;
 
 	/*
 	 * setup HW LBR filter, if any
@@ -674,6 +696,19 @@ static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
 	[PERF_SAMPLE_BRANCH_IND_CALL_SHIFT]	= LBR_IND_CALL,
 };
 
+static const int hsw_lbr_sel_map[PERF_SAMPLE_BRANCH_SELECT_MAP_SIZE] = {
+	[PERF_SAMPLE_BRANCH_ANY_SHIFT]		= LBR_ANY,
+	[PERF_SAMPLE_BRANCH_USER_SHIFT]		= LBR_USER,
+	[PERF_SAMPLE_BRANCH_KERNEL_SHIFT]	= LBR_KERNEL,
+	[PERF_SAMPLE_BRANCH_HV_SHIFT]		= LBR_IGN,
+	[PERF_SAMPLE_BRANCH_ANY_RETURN_SHIFT]	= LBR_RETURN | LBR_FAR,
+	[PERF_SAMPLE_BRANCH_ANY_CALL_SHIFT]	= LBR_REL_CALL | LBR_IND_CALL
+						| LBR_FAR,
+	[PERF_SAMPLE_BRANCH_IND_CALL_SHIFT]	= LBR_IND_CALL,
+	[PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT]	= LBR_REL_CALL | LBR_IND_CALL
+						| LBR_RETURN | LBR_CALL_STACK,
+};
+
 /* core */
 void intel_pmu_lbr_init_core(void)
 {
@@ -730,6 +765,20 @@ void intel_pmu_lbr_init_snb(void)
 	pr_cont("16-deep LBR, ");
 }
 
+/* haswell */
+void intel_pmu_lbr_init_hsw(void)
+{
+	x86_pmu.lbr_nr	 = 16;
+	x86_pmu.lbr_tos	 = MSR_LBR_TOS;
+	x86_pmu.lbr_from = MSR_LBR_NHM_FROM;
+	x86_pmu.lbr_to   = MSR_LBR_NHM_TO;
+
+	x86_pmu.lbr_sel_mask = LBR_SEL_MASK;
+	x86_pmu.lbr_sel_map  = hsw_lbr_sel_map;
+
+	pr_cont("16-deep LBR, ");
+}
+
 /* atom */
 void intel_pmu_lbr_init_atom(void)
 {
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 3/7] perf, x86: Introduce x86 special perf event context
  2013-07-01  7:23 [PATCH v2 0/7] perf, x86: Haswell LBR call stack support Yan, Zheng
  2013-07-01  7:23 ` [PATCH v2 1/7] perf, x86: Reduce lbr_sel_map size Yan, Zheng
  2013-07-01  7:23 ` [PATCH v2 2/7] perf, x86: Basic Haswell LBR call stack support Yan, Zheng
@ 2013-07-01  7:23 ` Yan, Zheng
  2013-07-04 12:41   ` Peter Zijlstra
  2013-07-01  7:23 ` [PATCH v2 4/7] perf, x86: Save/resotre LBR stack during context switch Yan, Zheng
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 25+ messages in thread
From: Yan, Zheng @ 2013-07-01  7:23 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, a.p.zijlstra, eranian, andi, Yan, Zheng

From: "Yan, Zheng" <zheng.z.yan@intel.com>

The x86 special perf event context is named x86_perf_event_context,
We can enlarge it later to store PMU special data.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 arch/x86/kernel/cpu/perf_event.c | 12 ++++++++++++
 arch/x86/kernel/cpu/perf_event.h |  4 ++++
 include/linux/perf_event.h       |  5 +++++
 kernel/events/core.c             | 28 ++++++++++++++++++----------
 4 files changed, 39 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 9e581c5..94dbe71 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1792,6 +1792,17 @@ static int x86_pmu_event_idx(struct perf_event *event)
 	return idx + 1;
 }
 
+static void *x86_pmu_event_context_alloc(struct perf_event_context *parent_ctx)
+{
+	struct perf_event_context *ctx;
+
+	ctx = kzalloc(sizeof(struct x86_perf_event_context), GFP_KERNEL);
+	if (!ctx)
+		return ERR_PTR(-ENOMEM);
+
+	return ctx;
+}
+
 static ssize_t get_attr_rdpmc(struct device *cdev,
 			      struct device_attribute *attr,
 			      char *buf)
@@ -1879,6 +1890,7 @@ static struct pmu pmu = {
 
 	.event_idx		= x86_pmu_event_idx,
 	.flush_branch_stack	= x86_pmu_flush_branch_stack,
+	.event_context_alloc	= x86_pmu_event_context_alloc,
 };
 
 void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 3f6172c..047ddc6 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -460,6 +460,10 @@ enum {
 	PERF_SAMPLE_BRANCH_CALL_STACK = 1U << PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT,
 };
 
+struct x86_perf_event_context {
+	struct perf_event_context ctx;
+};
+
 #define x86_add_quirk(func_)						\
 do {									\
 	static struct x86_pmu_quirk __quirk __initdata = {		\
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 50b3efd..f6d1d59 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -274,6 +274,11 @@ struct pmu {
 	 * flush branch stack on context-switches (needed in cpu-wide mode)
 	 */
 	void (*flush_branch_stack)	(void);
+
+	/*
+	 * Allocate PMU special perf event context
+	 */
+	void *(*event_context_alloc)	(struct perf_event_context *parent_ctx);
 };
 
 /**
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 1db3af9..3aececc 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2961,13 +2961,20 @@ static void __perf_event_init_context(struct perf_event_context *ctx)
 }
 
 static struct perf_event_context *
-alloc_perf_context(struct pmu *pmu, struct task_struct *task)
+alloc_perf_context(struct pmu *pmu, struct task_struct *task,
+		   struct perf_event_context *parent_ctx)
 {
 	struct perf_event_context *ctx;
 
-	ctx = kzalloc(sizeof(struct perf_event_context), GFP_KERNEL);
-	if (!ctx)
-		return NULL;
+	if (pmu->event_context_alloc) {
+		ctx = pmu->event_context_alloc(parent_ctx);
+		if (IS_ERR(ctx))
+			return ctx;
+	} else {
+		ctx = kzalloc(sizeof(struct perf_event_context), GFP_KERNEL);
+		if (!ctx)
+			return ERR_PTR(-ENOMEM);
+	}
 
 	__perf_event_init_context(ctx);
 	if (task) {
@@ -3053,10 +3060,11 @@ retry:
 		++ctx->pin_count;
 		raw_spin_unlock_irqrestore(&ctx->lock, flags);
 	} else {
-		ctx = alloc_perf_context(pmu, task);
-		err = -ENOMEM;
-		if (!ctx)
+		ctx = alloc_perf_context(pmu, task, NULL);
+		if (IS_ERR(ctx)) {
+			err = PTR_ERR(ctx);
 			goto errout;
+		}
 
 		err = 0;
 		mutex_lock(&task->perf_event_mutex);
@@ -7465,9 +7473,9 @@ inherit_task_group(struct perf_event *event, struct task_struct *parent,
 		 * child.
 		 */
 
-		child_ctx = alloc_perf_context(event->pmu, child);
-		if (!child_ctx)
-			return -ENOMEM;
+		child_ctx = alloc_perf_context(event->pmu, child, parent_ctx);
+		if (IS_ERR(child_ctx))
+			return PTR_ERR(child_ctx);
 
 		child->perf_event_ctxp[ctxn] = child_ctx;
 	}
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 4/7] perf, x86: Save/resotre LBR stack during context switch
  2013-07-01  7:23 [PATCH v2 0/7] perf, x86: Haswell LBR call stack support Yan, Zheng
                   ` (2 preceding siblings ...)
  2013-07-01  7:23 ` [PATCH v2 3/7] perf, x86: Introduce x86 special perf event context Yan, Zheng
@ 2013-07-01  7:23 ` Yan, Zheng
  2013-07-04  9:57   ` Peter Zijlstra
                     ` (2 more replies)
  2013-07-01  7:23 ` [PATCH v2 5/7] perf, core: Pass perf_sample_data to perf_callchain() Yan, Zheng
                   ` (2 subsequent siblings)
  6 siblings, 3 replies; 25+ messages in thread
From: Yan, Zheng @ 2013-07-01  7:23 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, a.p.zijlstra, eranian, andi, Yan, Zheng

From: "Yan, Zheng" <zheng.z.yan@intel.com>

When the LBR call stack is enabled, it is necessary to save/restore
the LBR stack on context switch. The solution is saving/restoring
the LBR stack to/from task's perf event context.

If there are cpu-wide events that use LBR, do not save/restore the
LBR stack on context switches, flush it instead.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 arch/x86/kernel/cpu/perf_event.c           |  18 +++--
 arch/x86/kernel/cpu/perf_event.h           |  14 +++-
 arch/x86/kernel/cpu/perf_event_intel.c     |  13 ++--
 arch/x86/kernel/cpu/perf_event_intel_lbr.c | 111 ++++++++++++++++++++++++++---
 include/linux/perf_event.h                 |   7 +-
 kernel/events/core.c                       |  74 +++++++++++--------
 6 files changed, 180 insertions(+), 57 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 94dbe71..2a79382 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1792,6 +1792,13 @@ static int x86_pmu_event_idx(struct perf_event *event)
 	return idx + 1;
 }
 
+static void x86_pmu_branch_stack_sched(struct perf_event_context *ctx,
+					bool sched_in, bool force_flush)
+{
+	if (x86_pmu.branch_stack_sched)
+		x86_pmu.branch_stack_sched(ctx, sched_in, force_flush);
+}
+
 static void *x86_pmu_event_context_alloc(struct perf_event_context *parent_ctx)
 {
 	struct perf_event_context *ctx;
@@ -1800,6 +1807,9 @@ static void *x86_pmu_event_context_alloc(struct perf_event_context *parent_ctx)
 	if (!ctx)
 		return ERR_PTR(-ENOMEM);
 
+	if (parent_ctx)
+		intel_pmu_lbr_init_context(ctx, parent_ctx);
+
 	return ctx;
 }
 
@@ -1857,12 +1867,6 @@ static const struct attribute_group *x86_pmu_attr_groups[] = {
 	NULL,
 };
 
-static void x86_pmu_flush_branch_stack(void)
-{
-	if (x86_pmu.flush_branch_stack)
-		x86_pmu.flush_branch_stack();
-}
-
 void perf_check_microcode(void)
 {
 	if (x86_pmu.check_microcode)
@@ -1889,7 +1893,7 @@ static struct pmu pmu = {
 	.commit_txn		= x86_pmu_commit_txn,
 
 	.event_idx		= x86_pmu_event_idx,
-	.flush_branch_stack	= x86_pmu_flush_branch_stack,
+	.branch_stack_sched     = x86_pmu_branch_stack_sched,
 	.event_context_alloc	= x86_pmu_event_context_alloc,
 };
 
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 047ddc6..9dff9fc 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -411,7 +411,6 @@ struct x86_pmu {
 	void		(*cpu_dead)(int cpu);
 
 	void		(*check_microcode)(void);
-	void		(*flush_branch_stack)(void);
 
 	/*
 	 * Intel Arch Perfmon v2+
@@ -440,6 +439,8 @@ struct x86_pmu {
 	int		lbr_nr;			   /* hardware stack size */
 	u64		lbr_sel_mask;		   /* LBR_SELECT valid bits */
 	const int	*lbr_sel_map;		   /* lbr_select mappings */
+	void		(*branch_stack_sched)(struct perf_event_context *ctx,
+					      bool sched_in, bool force_flush);
 
 	/*
 	 * Extra registers for events
@@ -462,6 +463,12 @@ enum {
 
 struct x86_perf_event_context {
 	struct perf_event_context ctx;
+
+	u64 lbr_from[MAX_LBR_ENTRIES];
+	u64 lbr_to[MAX_LBR_ENTRIES];
+	u64 lbr_stack_gen;
+	int lbr_callstack_users;
+	bool lbr_stack_saved;
 };
 
 #define x86_add_quirk(func_)						\
@@ -674,8 +681,13 @@ void intel_pmu_pebs_disable_all(void);
 
 void intel_ds_init(void);
 
+void intel_pmu_lbr_init_context(struct perf_event_context *child_ctx,
+				struct perf_event_context *parent_ctx);
 void intel_pmu_lbr_reset(void);
 
+void intel_pmu_lbr_sched(struct perf_event_context *ctx,
+			 bool sched_in, bool force_flush);
+
 void intel_pmu_lbr_enable(struct perf_event *event);
 
 void intel_pmu_lbr_disable(struct perf_event *event);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index e163717..96c30bb 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1849,16 +1849,11 @@ static void intel_pmu_cpu_dying(int cpu)
 	fini_debug_store_on_cpu(cpu);
 }
 
-static void intel_pmu_flush_branch_stack(void)
+static void intel_pmu_branch_stack_sched(struct perf_event_context *ctx,
+					 bool sched_in, bool force_flush)
 {
-	/*
-	 * Intel LBR does not tag entries with the
-	 * PID of the current task, then we need to
-	 * flush it on ctxsw
-	 * For now, we simply reset it
-	 */
 	if (x86_pmu.lbr_nr)
-		intel_pmu_lbr_reset();
+		intel_pmu_lbr_sched(ctx, sched_in, force_flush);
 }
 
 PMU_FORMAT_ATTR(offcore_rsp, "config1:0-63");
@@ -1912,7 +1907,7 @@ static __initconst const struct x86_pmu intel_pmu = {
 	.cpu_starting		= intel_pmu_cpu_starting,
 	.cpu_dying		= intel_pmu_cpu_dying,
 	.guest_get_msrs		= intel_guest_get_msrs,
-	.flush_branch_stack	= intel_pmu_flush_branch_stack,
+	.branch_stack_sched	= intel_pmu_branch_stack_sched,
 };
 
 static __init void intel_clovertown_quirk(void)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 11911db..013a9b9 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -185,6 +185,13 @@ void intel_pmu_lbr_reset(void)
 		intel_pmu_lbr_reset_32();
 	else
 		intel_pmu_lbr_reset_64();
+
+	wrmsrl(x86_pmu.lbr_tos, 0);
+}
+
+static inline bool branch_user_callstack(unsigned br_sel)
+{
+	return (br_sel & X86_BR_USER) && (br_sel & X86_BR_CALL_STACK);
 }
 
 void intel_pmu_lbr_enable(struct perf_event *event)
@@ -194,17 +201,23 @@ void intel_pmu_lbr_enable(struct perf_event *event)
 	if (!x86_pmu.lbr_nr)
 		return;
 
-	/*
-	 * Reset the LBR stack if we changed task context to
-	 * avoid data leaks.
-	 */
-	if (event->ctx->task && cpuc->lbr_context != event->ctx) {
-		intel_pmu_lbr_reset();
-		cpuc->lbr_context = event->ctx;
-	}
 	cpuc->br_sel = event->hw.branch_reg.reg;
-
 	cpuc->lbr_users++;
+
+	if (event->ctx->task &&
+	    branch_user_callstack(event->hw.branch_reg.reg)) {
+		struct x86_perf_event_context *task_ctx = (void *)event->ctx;
+		/*
+		 * Reset the LBR stack if the call stack is not
+		 * continuous enabled
+		 */
+		if (task_ctx->lbr_callstack_users == 0 &&
+		    task_ctx->lbr_stack_gen + 1 < event->ctx->sched_gen)
+			intel_pmu_lbr_reset();
+
+		task_ctx->lbr_callstack_users++;
+		task_ctx->lbr_stack_gen = event->ctx->sched_gen;
+	}
 }
 
 void intel_pmu_lbr_disable(struct perf_event *event)
@@ -214,6 +227,13 @@ void intel_pmu_lbr_disable(struct perf_event *event)
 	if (!x86_pmu.lbr_nr)
 		return;
 
+	if (event->ctx->task &&
+	    branch_user_callstack(event->hw.branch_reg.reg)) {
+		struct x86_perf_event_context *task_ctx = (void *)event->ctx;
+
+		task_ctx->lbr_callstack_users--;
+	}
+
 	cpuc->lbr_users--;
 	WARN_ON_ONCE(cpuc->lbr_users < 0);
 
@@ -338,6 +358,79 @@ void intel_pmu_lbr_read(void)
 	intel_pmu_lbr_filter(cpuc);
 }
 
+static void __intel_pmu_lbr_restore(struct x86_perf_event_context *task_ctx)
+{
+	int i;
+	unsigned lbr_idx, mask = x86_pmu.lbr_nr - 1;
+	u64 tos = intel_pmu_lbr_tos();
+
+	for (i = 0; i < x86_pmu.lbr_nr; i++) {
+		lbr_idx = (tos - i) & mask;
+		wrmsrl(x86_pmu.lbr_from + lbr_idx, task_ctx->lbr_from[i]);
+		wrmsrl(x86_pmu.lbr_to + lbr_idx, task_ctx->lbr_to[i]);
+	}
+	task_ctx->lbr_stack_saved = false;
+}
+
+static void __intel_pmu_lbr_save(struct x86_perf_event_context *task_ctx)
+{
+	int i;
+	unsigned lbr_idx, mask = x86_pmu.lbr_nr - 1;
+	u64 tos = intel_pmu_lbr_tos();
+
+	for (i = 0; i < x86_pmu.lbr_nr; i++) {
+		lbr_idx = (tos - i) & mask;
+		rdmsrl(x86_pmu.lbr_from + lbr_idx, task_ctx->lbr_from[i]);
+		rdmsrl(x86_pmu.lbr_to + lbr_idx, task_ctx->lbr_to[i]);
+	}
+	task_ctx->lbr_stack_gen = task_ctx->ctx.sched_gen;
+	task_ctx->lbr_stack_saved = true;
+}
+
+void intel_pmu_lbr_init_context(struct perf_event_context *child_ctx,
+				struct perf_event_context *parent_ctx)
+{
+	struct x86_perf_event_context *task_ctx, *parent_task_ctx;
+
+	if (!x86_pmu.lbr_nr)
+		return;
+
+	task_ctx = (struct x86_perf_event_context *)child_ctx;
+	parent_task_ctx = (struct x86_perf_event_context *)parent_ctx;
+
+	if (parent_task_ctx->lbr_callstack_users)
+		__intel_pmu_lbr_save(task_ctx);
+	else
+		task_ctx->lbr_stack_saved = false;
+}
+
+void intel_pmu_lbr_sched(struct perf_event_context *ctx,
+			 bool sched_in, bool force_flush)
+{
+	struct x86_perf_event_context *task_ctx;
+
+	if (!x86_pmu.lbr_nr)
+		return;
+
+	task_ctx = (struct x86_perf_event_context *)ctx;
+	if (force_flush || !task_ctx) {
+		if (sched_in)
+			intel_pmu_lbr_reset();
+		else if (task_ctx)
+			task_ctx->lbr_stack_saved = false;
+		return;
+	}
+
+	if (sched_in) {
+		if (!task_ctx->lbr_stack_saved)
+			intel_pmu_lbr_reset();
+		else
+			__intel_pmu_lbr_restore(task_ctx);
+	} else {
+		__intel_pmu_lbr_save(task_ctx);
+	}
+}
+
 /*
  * SW filter is used:
  * - in case there is no HW filter
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index f6d1d59..e9af629 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -271,9 +271,11 @@ struct pmu {
 	int (*event_idx)		(struct perf_event *event); /*optional */
 
 	/*
-	 * flush branch stack on context-switches (needed in cpu-wide mode)
+	 * On context-switches, save/restore the LBR stack or
+	 * flush the LBR stack (needed in cpu-wide mode)
 	 */
-	void (*flush_branch_stack)	(void);
+	void (*branch_stack_sched)	(struct perf_event_context *ctx,
+					 bool sched_in, bool force_flush);
 
 	/*
 	 * Allocate PMU special perf event context
@@ -495,6 +497,7 @@ struct perf_event_context {
 	struct perf_event_context	*parent_ctx;
 	u64				parent_gen;
 	u64				generation;
+	u64				sched_gen;
 	int				pin_count;
 	int				nr_cgroups;	 /* cgroup evts */
 	int				nr_branch_stack; /* branch_stack evt */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 3aececc..02ed472 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -140,7 +140,7 @@ enum event_type_t {
  */
 struct static_key_deferred perf_sched_events __read_mostly;
 static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
-static DEFINE_PER_CPU(atomic_t, perf_branch_stack_events);
+static DEFINE_PER_CPU(int, perf_branch_stack_events);
 
 static atomic_t nr_mmap_events __read_mostly;
 static atomic_t nr_comm_events __read_mostly;
@@ -278,6 +278,9 @@ static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
 static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
 			     enum event_type_t event_type,
 			     struct task_struct *task);
+static void perf_branch_stack_sched(struct task_struct *task1,
+				    struct task_struct *task2,
+				    bool sched_in);
 
 static void update_context_time(struct perf_event_context *ctx);
 static u64 perf_event_time(struct perf_event *event);
@@ -1271,8 +1274,11 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
 			cpuctx->cgrp = NULL;
 	}
 
-	if (has_branch_stack(event))
+	if (has_branch_stack(event)) {
+		if (ctx->is_active)
+			__get_cpu_var(perf_branch_stack_events)--;
 		ctx->nr_branch_stack--;
+	}
 
 	ctx->nr_events--;
 	if (event->attr.inherit_stat)
@@ -1796,8 +1802,10 @@ static void perf_event_sched_in(struct perf_cpu_context *cpuctx,
 				struct task_struct *task)
 {
 	cpu_ctx_sched_in(cpuctx, EVENT_PINNED, task);
-	if (ctx)
+	if (ctx) {
+		ctx->sched_gen++;
 		ctx_sched_in(ctx, cpuctx, EVENT_PINNED, task);
+	}
 	cpu_ctx_sched_in(cpuctx, EVENT_FLEXIBLE, task);
 	if (ctx)
 		ctx_sched_in(ctx, cpuctx, EVENT_FLEXIBLE, task);
@@ -2102,6 +2110,9 @@ static void ctx_sched_out(struct perf_event_context *ctx,
 	if (likely(!ctx->nr_events))
 		return;
 
+	if (!ctx->is_active && is_active)
+		__get_cpu_var(perf_branch_stack_events) -= ctx->nr_branch_stack;
+
 	update_context_time(ctx);
 	update_cgrp_time_from_cpuctx(cpuctx);
 	if (!ctx->nr_active)
@@ -2291,6 +2302,10 @@ void __perf_event_task_sched_out(struct task_struct *task,
 {
 	int ctxn;
 
+	/* check for branch_stack events running on this cpu */
+	if (__get_cpu_var(perf_branch_stack_events))
+		perf_branch_stack_sched(task, next, false);
+
 	for_each_task_context_nr(ctxn)
 		perf_event_context_sched_out(task, ctxn, next);
 
@@ -2398,6 +2413,9 @@ ctx_sched_in(struct perf_event_context *ctx,
 	if (likely(!ctx->nr_events))
 		return;
 
+	if (ctx->is_active && !is_active)
+		__get_cpu_var(perf_branch_stack_events) += ctx->nr_branch_stack;
+
 	now = perf_clock();
 	ctx->timestamp = now;
 	perf_cgroup_set_timestamp(task, ctx);
@@ -2471,15 +2489,17 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx,
  * layer. It is invoked ONLY when there is at least one system-wide context
  * with at least one active event using taken branch sampling.
  */
-static void perf_branch_stack_sched_in(struct task_struct *prev,
-				       struct task_struct *task)
+static void perf_branch_stack_sched(struct task_struct *task1,
+				    struct task_struct *task2,
+				    bool sched_in)
 {
 	struct perf_cpu_context *cpuctx;
+	struct perf_event_context *task_ctx;
 	struct pmu *pmu;
 	unsigned long flags;
 
 	/* no need to flush branch stack if not changing task */
-	if (prev == task)
+	if (task1 == task2)
 		return;
 
 	local_irq_save(flags);
@@ -2488,25 +2508,31 @@ static void perf_branch_stack_sched_in(struct task_struct *prev,
 
 	list_for_each_entry_rcu(pmu, &pmus, entry) {
 		cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
+		task_ctx = cpuctx->task_ctx;
 
 		/*
-		 * check if the context has at least one
-		 * event using PERF_SAMPLE_BRANCH_STACK
+		 * force flush the branch stack if there are cpu-wide events
+		 * using PERF_SAMPLE_BRANCH_STACK
+		 *
+		 * save/restore the branch stack if the task context has
+		 * at least one event using PERF_SAMPLE_BRANCH_STACK
 		 */
-		if (cpuctx->ctx.nr_branch_stack > 0
-		    && pmu->flush_branch_stack) {
-
+		bool force_flush = cpuctx->ctx.nr_branch_stack > 0;
+		if (pmu->branch_stack_sched &&
+		    (force_flush ||
+		     (task_ctx && task_ctx->nr_branch_stack > 0))) {
 			pmu = cpuctx->ctx.pmu;
 
-			perf_ctx_lock(cpuctx, cpuctx->task_ctx);
+			perf_ctx_lock(cpuctx, task_ctx);
 
 			perf_pmu_disable(pmu);
 
-			pmu->flush_branch_stack();
+			pmu->branch_stack_sched(task_ctx,
+						sched_in, force_flush);
 
 			perf_pmu_enable(pmu);
 
-			perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+			perf_ctx_unlock(cpuctx, task_ctx);
 		}
 	}
 
@@ -2547,9 +2573,9 @@ void __perf_event_task_sched_in(struct task_struct *prev,
 	if (atomic_read(&__get_cpu_var(perf_cgroup_events)))
 		perf_cgroup_sched_in(prev, task);
 
-	/* check for system-wide branch_stack events */
-	if (atomic_read(&__get_cpu_var(perf_branch_stack_events)))
-		perf_branch_stack_sched_in(prev, task);
+	/* check for branch_stack events running on this cpu */
+	if (__get_cpu_var(perf_branch_stack_events))
+		perf_branch_stack_sched(prev, task, true);
 }
 
 static u64 perf_calculate_period(struct perf_event *event, u64 nsec, u64 count)
@@ -3134,14 +3160,8 @@ static void free_event(struct perf_event *event)
 			static_key_slow_dec_deferred(&perf_sched_events);
 		}
 
-		if (has_branch_stack(event)) {
+		if (has_branch_stack(event))
 			static_key_slow_dec_deferred(&perf_sched_events);
-			/* is system-wide event */
-			if (!(event->attach_state & PERF_ATTACH_TASK)) {
-				atomic_dec(&per_cpu(perf_branch_stack_events,
-						    event->cpu));
-			}
-		}
 	}
 
 	if (event->rb) {
@@ -6562,12 +6582,8 @@ done:
 				return ERR_PTR(err);
 			}
 		}
-		if (has_branch_stack(event)) {
+		if (has_branch_stack(event))
 			static_key_slow_inc(&perf_sched_events.key);
-			if (!(event->attach_state & PERF_ATTACH_TASK))
-				atomic_inc(&per_cpu(perf_branch_stack_events,
-						    event->cpu));
-		}
 	}
 
 	return event;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 5/7] perf, core: Pass perf_sample_data to perf_callchain()
  2013-07-01  7:23 [PATCH v2 0/7] perf, x86: Haswell LBR call stack support Yan, Zheng
                   ` (3 preceding siblings ...)
  2013-07-01  7:23 ` [PATCH v2 4/7] perf, x86: Save/resotre LBR stack during context switch Yan, Zheng
@ 2013-07-01  7:23 ` Yan, Zheng
  2013-07-01  7:23 ` [PATCH v2 6/7] perf, x86: Use LBR call stack to get user callchain Yan, Zheng
  2013-07-01  7:23 ` [PATCH v2 7/7] perf, x86: Discard zero length call entries in LBR call stack Yan, Zheng
  6 siblings, 0 replies; 25+ messages in thread
From: Yan, Zheng @ 2013-07-01  7:23 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, a.p.zijlstra, eranian, andi, Yan, Zheng

From: "Yan, Zheng" <zheng.z.yan@intel.com>

New Intel CPU can record call chains by using existing last branch
record facility. perf_callchain_user() can make use of the call
chains recorded by hardware in case of there is no frame pointer.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 arch/arm/kernel/perf_event.c     | 4 ++--
 arch/powerpc/perf/callchain.c    | 4 ++--
 arch/sparc/kernel/perf_event.c   | 4 ++--
 arch/x86/kernel/cpu/perf_event.c | 4 ++--
 include/linux/perf_event.h       | 3 ++-
 kernel/events/callchain.c        | 8 +++++---
 kernel/events/core.c             | 2 +-
 kernel/events/internal.h         | 3 ++-
 8 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/arch/arm/kernel/perf_event.c b/arch/arm/kernel/perf_event.c
index 8c3094d..3f84d3c 100644
--- a/arch/arm/kernel/perf_event.c
+++ b/arch/arm/kernel/perf_event.c
@@ -559,8 +559,8 @@ user_backtrace(struct frame_tail __user *tail,
 	return buftail.fp - 1;
 }
 
-void
-perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)
+void perf_callchain_user(struct perf_callchain_entry *entry,
+			 struct pt_regs *regs, struct perf_sample_data *data)
 {
 	struct frame_tail __user *tail;
 
diff --git a/arch/powerpc/perf/callchain.c b/arch/powerpc/perf/callchain.c
index 74d1e78..b379ebc 100644
--- a/arch/powerpc/perf/callchain.c
+++ b/arch/powerpc/perf/callchain.c
@@ -482,8 +482,8 @@ static void perf_callchain_user_32(struct perf_callchain_entry *entry,
 	}
 }
 
-void
-perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)
+void perf_callchain_user(struct perf_callchain_entry *entry,
+			 struct pt_regs *regs, struct perf_sample_data *data)
 {
 	if (current_is_64bit())
 		perf_callchain_user_64(entry, regs);
diff --git a/arch/sparc/kernel/perf_event.c b/arch/sparc/kernel/perf_event.c
index b5c38fa..cba0306 100644
--- a/arch/sparc/kernel/perf_event.c
+++ b/arch/sparc/kernel/perf_event.c
@@ -1785,8 +1785,8 @@ static void perf_callchain_user_32(struct perf_callchain_entry *entry,
 	} while (entry->nr < PERF_MAX_STACK_DEPTH);
 }
 
-void
-perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)
+void perf_callchain_user(struct perf_callchain_entry *entry,
+			 struct pt_regs *regs, struct perf_sample_data *data)
 {
 	perf_callchain_store(entry, regs->tpc);
 
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 2a79382..be7253f 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -2023,8 +2023,8 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry *entry)
 }
 #endif
 
-void
-perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)
+void perf_callchain_user(struct perf_callchain_entry *entry,
+			 struct pt_regs *regs, struct perf_sample_data *data)
 {
 	struct stack_frame frame;
 	const void __user *fp;
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index e9af629..8bbe01f 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -702,7 +702,8 @@ extern void perf_event_fork(struct task_struct *tsk);
 /* Callchains */
 DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);
 
-extern void perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs);
+extern void perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs,
+				struct perf_sample_data *data);
 extern void perf_callchain_kernel(struct perf_callchain_entry *entry, struct pt_regs *regs);
 
 static inline void perf_callchain_store(struct perf_callchain_entry *entry, u64 ip)
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index c772061..bd7138a 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -30,7 +30,8 @@ __weak void perf_callchain_kernel(struct perf_callchain_entry *entry,
 }
 
 __weak void perf_callchain_user(struct perf_callchain_entry *entry,
-				struct pt_regs *regs)
+				struct pt_regs *regs,
+				struct perf_sample_data *data)
 {
 }
 
@@ -154,7 +155,8 @@ put_callchain_entry(int rctx)
 }
 
 struct perf_callchain_entry *
-perf_callchain(struct perf_event *event, struct pt_regs *regs)
+perf_callchain(struct perf_event *event, struct pt_regs *regs,
+	       struct perf_sample_data *data)
 {
 	int rctx;
 	struct perf_callchain_entry *entry;
@@ -195,7 +197,7 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 				goto exit_put;
 
 			perf_callchain_store(entry, PERF_CONTEXT_USER);
-			perf_callchain_user(entry, regs);
+			perf_callchain_user(entry, regs, data);
 		}
 	}
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 02ed472..db3677d 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4550,7 +4550,7 @@ void perf_prepare_sample(struct perf_event_header *header,
 	if (sample_type & PERF_SAMPLE_CALLCHAIN) {
 		int size = 1;
 
-		data->callchain = perf_callchain(event, regs);
+		data->callchain = perf_callchain(event, regs, data);
 
 		if (data->callchain)
 			size += data->callchain->nr;
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index ca65997..0e939e6 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -130,7 +130,8 @@ DEFINE_OUTPUT_COPY(__output_copy_user, arch_perf_out_copy_user)
 
 /* Callchain handling */
 extern struct perf_callchain_entry *
-perf_callchain(struct perf_event *event, struct pt_regs *regs);
+perf_callchain(struct perf_event *event, struct pt_regs *regs,
+	       struct perf_sample_data *data);
 extern int get_callchain_buffers(void);
 extern void put_callchain_buffers(void);
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 6/7] perf, x86: Use LBR call stack to get user callchain
  2013-07-01  7:23 [PATCH v2 0/7] perf, x86: Haswell LBR call stack support Yan, Zheng
                   ` (4 preceding siblings ...)
  2013-07-01  7:23 ` [PATCH v2 5/7] perf, core: Pass perf_sample_data to perf_callchain() Yan, Zheng
@ 2013-07-01  7:23 ` Yan, Zheng
  2013-07-01  7:23 ` [PATCH v2 7/7] perf, x86: Discard zero length call entries in LBR call stack Yan, Zheng
  6 siblings, 0 replies; 25+ messages in thread
From: Yan, Zheng @ 2013-07-01  7:23 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, a.p.zijlstra, eranian, andi, Yan, Zheng

From: "Yan, Zheng" <zheng.z.yan@intel.com>

Try enabling the LBR call stack feature if event request recording
callchain. Try utilizing the LBR call stack to get user callchain
in case of there is no frame pointer.

This patch also adds a cpu pmu attribute to enable/disable this
feature.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 arch/x86/kernel/cpu/perf_event.c           | 128 +++++++++++++++++++++--------
 arch/x86/kernel/cpu/perf_event.h           |   7 ++
 arch/x86/kernel/cpu/perf_event_intel.c     |  20 ++---
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |   3 +
 include/linux/perf_event.h                 |   6 ++
 kernel/events/core.c                       |  11 ++-
 6 files changed, 126 insertions(+), 49 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index be7253f..6061169 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -399,37 +399,49 @@ int x86_pmu_hw_config(struct perf_event *event)
 
 		if (event->attr.precise_ip > precise)
 			return -EOPNOTSUPP;
+	}
+	/*
+	 * check that PEBS LBR correction does not conflict with
+	 * whatever the user is asking with attr->branch_sample_type
+	 */
+	if (event->attr.precise_ip > 1 && x86_pmu.intel_cap.pebs_format < 2) {
+		u64 *br_type = &event->attr.branch_sample_type;
+
+		if (has_branch_stack(event)) {
+			if (!precise_br_compat(event))
+				return -EOPNOTSUPP;
+
+			/* branch_sample_type is compatible */
+
+		} else {
+			/*
+			 * user did not specify  branch_sample_type
+			 *
+			 * For PEBS fixups, we capture all
+			 * the branches at the priv level of the
+			 * event.
+			 */
+			*br_type = PERF_SAMPLE_BRANCH_ANY;
+
+			if (!event->attr.exclude_user)
+				*br_type |= PERF_SAMPLE_BRANCH_USER;
+
+			if (!event->attr.exclude_kernel)
+				*br_type |= PERF_SAMPLE_BRANCH_KERNEL;
+		}
+	} else if ((event->attr.sample_type & PERF_SAMPLE_CALLCHAIN) &&
+		   !has_branch_stack(event) &&
+		   x86_pmu.attr_lbr_callstack &&
+		   !event->attr.exclude_user &&
+		   (event->attach_state & PERF_ATTACH_TASK)) {
 		/*
-		 * check that PEBS LBR correction does not conflict with
-		 * whatever the user is asking with attr->branch_sample_type
+		 * user did not specify branch_sample_type,
+		 * try using the LBR call stack facility to
+		 * record call chains of user program.
 		 */
-		if (event->attr.precise_ip > 1 &&
-		    x86_pmu.intel_cap.pebs_format < 2) {
-			u64 *br_type = &event->attr.branch_sample_type;
-
-			if (has_branch_stack(event)) {
-				if (!precise_br_compat(event))
-					return -EOPNOTSUPP;
-
-				/* branch_sample_type is compatible */
-
-			} else {
-				/*
-				 * user did not specify  branch_sample_type
-				 *
-				 * For PEBS fixups, we capture all
-				 * the branches at the priv level of the
-				 * event.
-				 */
-				*br_type = PERF_SAMPLE_BRANCH_ANY;
-
-				if (!event->attr.exclude_user)
-					*br_type |= PERF_SAMPLE_BRANCH_USER;
-
-				if (!event->attr.exclude_kernel)
-					*br_type |= PERF_SAMPLE_BRANCH_KERNEL;
-			}
-		}
+		event->attr.branch_sample_type =
+			PERF_SAMPLE_BRANCH_USER |
+			PERF_SAMPLE_BRANCH_CALL_STACK;
 	}
 
 	/*
@@ -1849,10 +1861,34 @@ static ssize_t set_attr_rdpmc(struct device *cdev,
 	return count;
 }
 
+static ssize_t get_attr_lbr_callstack(struct device *cdev,
+				      struct device_attribute *attr, char *buf)
+{
+	return snprintf(buf, 40, "%d\n", x86_pmu.attr_lbr_callstack);
+}
+
+static ssize_t set_attr_lbr_callstack(struct device *cdev,
+				      struct device_attribute *attr,
+				      const char *buf, size_t count)
+{
+	unsigned long val = simple_strtoul(buf, NULL, 0);
+
+	if (x86_pmu.attr_lbr_callstack != !!val) {
+		if (val && !x86_pmu_has_lbr_callstack())
+			return -EOPNOTSUPP;
+		x86_pmu.attr_lbr_callstack = !!val;
+	}
+	return count;
+}
+
 static DEVICE_ATTR(rdpmc, S_IRUSR | S_IWUSR, get_attr_rdpmc, set_attr_rdpmc);
+static DEVICE_ATTR(lbr_callstack, S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH,
+		   get_attr_lbr_callstack, set_attr_lbr_callstack);
+
 
 static struct attribute *x86_pmu_attrs[] = {
 	&dev_attr_rdpmc.attr,
+	&dev_attr_lbr_callstack.attr,
 	NULL,
 };
 
@@ -1979,12 +2015,29 @@ static unsigned long get_segment_base(unsigned int segment)
 	return get_desc_base(desc + idx);
 }
 
+static inline void
+perf_callchain_lbr_callstack(struct perf_callchain_entry *entry,
+			     struct perf_sample_data *data)
+{
+	struct perf_branch_stack *br_stack = data->br_stack;
+
+	if (br_stack && br_stack->user_callstack &&
+	    x86_pmu.attr_lbr_callstack) {
+		int i = 0;
+		while (i < br_stack->nr && entry->nr < PERF_MAX_STACK_DEPTH) {
+			perf_callchain_store(entry, br_stack->entries[i].from);
+			i++;
+		}
+	}
+}
+
 #ifdef CONFIG_COMPAT
 
 #include <asm/compat.h>
 
 static inline int
-perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry *entry)
+perf_callchain_user32(struct perf_callchain_entry *entry,
+		      struct pt_regs *regs, struct perf_sample_data *data)
 {
 	/* 32-bit process in 64-bit kernel. */
 	unsigned long ss_base, cs_base;
@@ -2013,11 +2066,16 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry *entry)
 		perf_callchain_store(entry, cs_base + frame.return_address);
 		fp = compat_ptr(ss_base + frame.next_frame);
 	}
+
+	if (fp == compat_ptr(regs->bp))
+		perf_callchain_lbr_callstack(entry, data);
+
 	return 1;
 }
 #else
 static inline int
-perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry *entry)
+perf_callchain_user32(struct perf_callchain_entry *entry,
+		      struct pt_regs *regs, struct perf_sample_data *data)
 {
     return 0;
 }
@@ -2047,12 +2105,12 @@ void perf_callchain_user(struct perf_callchain_entry *entry,
 	if (!current->mm)
 		return;
 
-	if (perf_callchain_user32(regs, entry))
+	if (perf_callchain_user32(entry, regs, data))
 		return;
 
 	while (entry->nr < PERF_MAX_STACK_DEPTH) {
 		unsigned long bytes;
-		frame.next_frame	     = NULL;
+		frame.next_frame = NULL;
 		frame.return_address = 0;
 
 		bytes = copy_from_user_nmi(&frame, fp, sizeof(frame));
@@ -2065,6 +2123,10 @@ void perf_callchain_user(struct perf_callchain_entry *entry,
 		perf_callchain_store(entry, frame.return_address);
 		fp = frame.next_frame;
 	}
+
+	/* try LBR callstack if there is no frame pointer */
+	if (fp == (void __user *)regs->bp)
+		perf_callchain_lbr_callstack(entry, data);
 }
 
 /*
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 9dff9fc..412bf34 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -396,6 +396,7 @@ struct x86_pmu {
 	 * sysfs attrs
 	 */
 	int		attr_rdpmc;
+	int		attr_lbr_callstack;
 	struct attribute **format_attrs;
 	struct attribute **event_attrs;
 
@@ -502,6 +503,12 @@ static struct perf_pmu_events_attr event_attr_##v = {			\
 
 extern struct x86_pmu x86_pmu __read_mostly;
 
+static inline bool x86_pmu_has_lbr_callstack(void)
+{
+	return  x86_pmu.lbr_sel_map &&
+		x86_pmu.lbr_sel_map[PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT] > 0;
+}
+
 DECLARE_PER_CPU(struct cpu_hw_events, cpu_hw_events);
 
 int x86_perf_event_set_period(struct perf_event *event);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 96c30bb..5470265 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -882,15 +882,10 @@ static __initconst const u64 atom_hw_cache_event_ids
  },
 };
 
-static inline bool intel_pmu_needs_lbr_smpl(struct perf_event *event)
+static inline bool intel_pmu_needs_lbr_callstack(struct perf_event *event)
 {
-	/* user explicitly requested branch sampling */
-	if (has_branch_stack(event))
-		return true;
-
-	/* implicit branch sampling to correct PEBS skid */
-	if (x86_pmu.intel_cap.pebs_trap && event->attr.precise_ip > 1 &&
-	    x86_pmu.intel_cap.pebs_format < 2)
+	if ((event->attr.sample_type & PERF_SAMPLE_CALLCHAIN) &&
+	    (event->attr.branch_sample_type & PERF_SAMPLE_BRANCH_CALL_STACK))
 		return true;
 
 	return false;
@@ -1054,7 +1049,7 @@ static void intel_pmu_disable_event(struct perf_event *event)
 	 * must disable before any actual event
 	 * because any event may be combined with LBR
 	 */
-	if (intel_pmu_needs_lbr_smpl(event))
+	if (needs_branch_stack(event))
 		intel_pmu_lbr_disable(event);
 
 	if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL)) {
@@ -1115,7 +1110,7 @@ static void intel_pmu_enable_event(struct perf_event *event)
 	 * must enabled before any actual event
 	 * because any event may be combined with LBR
 	 */
-	if (intel_pmu_needs_lbr_smpl(event))
+	if (needs_branch_stack(event))
 		intel_pmu_lbr_enable(event);
 
 	if (event->attr.exclude_host)
@@ -1237,7 +1232,8 @@ again:
 
 		perf_sample_data_init(&data, 0, event->hw.last_period);
 
-		if (has_branch_stack(event))
+		if (has_branch_stack(event) ||
+		    (event->ctx->task && intel_pmu_needs_lbr_callstack(event)))
 			data.br_stack = &cpuc->lbr_stack;
 
 		if (perf_event_overflow(event, &data, regs))
@@ -1566,7 +1562,7 @@ static int intel_pmu_hw_config(struct perf_event *event)
 	if (event->attr.precise_ip && x86_pmu.pebs_aliases)
 		x86_pmu.pebs_aliases(event);
 
-	if (intel_pmu_needs_lbr_smpl(event)) {
+	if (needs_branch_stack(event)) {
 		ret = intel_pmu_setup_lbr_filter(event);
 		if (ret)
 			return ret;
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 013a9b9..3146fb5 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -716,6 +716,8 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
 	int i, j, type;
 	bool compress = false;
 
+	cpuc->lbr_stack.user_callstack = branch_user_callstack(br_sel);
+
 	/* if sampling all branches, then nothing to filter */
 	if ((br_sel & X86_BR_ALL) == X86_BR_ALL)
 		return;
@@ -868,6 +870,7 @@ void intel_pmu_lbr_init_hsw(void)
 
 	x86_pmu.lbr_sel_mask = LBR_SEL_MASK;
 	x86_pmu.lbr_sel_map  = hsw_lbr_sel_map;
+	x86_pmu.attr_lbr_callstack = 1;
 
 	pr_cont("16-deep LBR, ");
 }
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 8bbe01f..5adaf10 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -97,6 +97,7 @@ struct perf_branch_entry {
  * recent branch.
  */
 struct perf_branch_stack {
+	unsigned			user_callstack:1;
 	__u64				nr;
 	struct perf_branch_entry	entries[0];
 };
@@ -760,6 +761,11 @@ static inline bool has_branch_stack(struct perf_event *event)
 	return event->attr.sample_type & PERF_SAMPLE_BRANCH_STACK;
 }
 
+static inline bool needs_branch_stack(struct perf_event *event)
+{
+	return event->attr.branch_sample_type != 0;
+}
+
 extern int perf_output_begin(struct perf_output_handle *handle,
 			     struct perf_event *event, unsigned int size);
 extern void perf_output_end(struct perf_output_handle *handle);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index db3677d..60451f6 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1117,7 +1117,7 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx)
 	if (is_cgroup_event(event))
 		ctx->nr_cgroups++;
 
-	if (has_branch_stack(event))
+	if (needs_branch_stack(event))
 		ctx->nr_branch_stack++;
 
 	list_add_rcu(&event->event_entry, &ctx->event_list);
@@ -1274,7 +1274,7 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
 			cpuctx->cgrp = NULL;
 	}
 
-	if (has_branch_stack(event)) {
+	if (needs_branch_stack(event)) {
 		if (ctx->is_active)
 			__get_cpu_var(perf_branch_stack_events)--;
 		ctx->nr_branch_stack--;
@@ -3160,7 +3160,7 @@ static void free_event(struct perf_event *event)
 			static_key_slow_dec_deferred(&perf_sched_events);
 		}
 
-		if (has_branch_stack(event))
+		if (needs_branch_stack(event))
 			static_key_slow_dec_deferred(&perf_sched_events);
 	}
 
@@ -6550,6 +6550,9 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	if (attr->inherit && (attr->read_format & PERF_FORMAT_GROUP))
 		goto done;
 
+	if (!has_branch_stack(event))
+		event->attr.branch_sample_type = 0;
+
 	pmu = perf_init_event(event);
 
 done:
@@ -6582,7 +6585,7 @@ done:
 				return ERR_PTR(err);
 			}
 		}
-		if (has_branch_stack(event))
+		if (needs_branch_stack(event))
 			static_key_slow_inc(&perf_sched_events.key);
 	}
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 7/7] perf, x86: Discard zero length call entries in LBR call stack
  2013-07-01  7:23 [PATCH v2 0/7] perf, x86: Haswell LBR call stack support Yan, Zheng
                   ` (5 preceding siblings ...)
  2013-07-01  7:23 ` [PATCH v2 6/7] perf, x86: Use LBR call stack to get user callchain Yan, Zheng
@ 2013-07-01  7:23 ` Yan, Zheng
  6 siblings, 0 replies; 25+ messages in thread
From: Yan, Zheng @ 2013-07-01  7:23 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, a.p.zijlstra, eranian, andi, Yan, Zheng

From: "Yan, Zheng" <zheng.z.yan@intel.com>

"Zero length call" uses the attribute of the call instruction to push
the immediate instruction pointer on to the stack and then pops off
that address into a register. This is accomplished without any matching
return instruction. It confuses the hardware and make the recorded call
stack incorrect. Try fixing the call stack by discarding zero length
call entries.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 arch/x86/kernel/cpu/perf_event_intel_lbr.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 3146fb5..b1617d8 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -94,7 +94,8 @@ enum {
 	X86_BR_ABORT		= 1 << 12,/* transaction abort */
 	X86_BR_IN_TX		= 1 << 13,/* in transaction */
 	X86_BR_NO_TX		= 1 << 14,/* not in transaction */
-	X86_BR_CALL_STACK	= 1 << 15,/* call stack */
+	X86_BR_ZERO_CALL	= 1 << 15,/* zero length call */
+	X86_BR_CALL_STACK	= 1 << 16,/* call stack */
 };
 
 #define X86_BR_PLM (X86_BR_USER | X86_BR_KERNEL)
@@ -111,13 +112,15 @@ enum {
 	 X86_BR_JMP	 |\
 	 X86_BR_IRQ	 |\
 	 X86_BR_ABORT	 |\
-	 X86_BR_IND_CALL)
+	 X86_BR_IND_CALL |\
+	 X86_BR_ZERO_CALL)
 
 #define X86_BR_ALL (X86_BR_PLM | X86_BR_ANY)
 
 #define X86_BR_ANY_CALL		 \
 	(X86_BR_CALL		|\
 	 X86_BR_IND_CALL	|\
+	 X86_BR_ZERO_CALL	|\
 	 X86_BR_SYSCALL		|\
 	 X86_BR_IRQ		|\
 	 X86_BR_INT)
@@ -650,6 +653,12 @@ static int branch_type(unsigned long from, unsigned long to, int abort)
 		ret = X86_BR_INT;
 		break;
 	case 0xe8: /* call near rel */
+		insn_get_immediate(&insn);
+		if (insn.immediate1.value == 0) {
+			/* zero length call */
+			ret = X86_BR_ZERO_CALL;
+			break;
+		}
 	case 0x9a: /* call far absolute */
 		ret = X86_BR_CALL;
 		break;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 4/7] perf, x86: Save/resotre LBR stack during context switch
  2013-07-01  7:23 ` [PATCH v2 4/7] perf, x86: Save/resotre LBR stack during context switch Yan, Zheng
@ 2013-07-04  9:57   ` Peter Zijlstra
  2013-07-04 11:39     ` Yan, Zheng
  2013-07-04 13:44     ` Andi Kleen
  2013-07-04 12:44   ` Peter Zijlstra
  2013-07-04 12:45   ` Peter Zijlstra
  2 siblings, 2 replies; 25+ messages in thread
From: Peter Zijlstra @ 2013-07-04  9:57 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: linux-kernel, mingo, eranian, andi

On Mon, Jul 01, 2013 at 03:23:04PM +0800, Yan, Zheng wrote:
> +++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> @@ -185,6 +185,13 @@ void intel_pmu_lbr_reset(void)
>  		intel_pmu_lbr_reset_32();
>  	else
>  		intel_pmu_lbr_reset_64();
> +
> +	wrmsrl(x86_pmu.lbr_tos, 0);
> +}

I double checked; my SDM Jun 2013, Vol 3C 35-93 very explicitly states that
MSR_LASTBRANCH_TOS is a read-only MSR. And afaicr all previous times I checked
this it did say this too.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 4/7] perf, x86: Save/resotre LBR stack during context switch
  2013-07-04  9:57   ` Peter Zijlstra
@ 2013-07-04 11:39     ` Yan, Zheng
  2013-07-04 13:44     ` Andi Kleen
  1 sibling, 0 replies; 25+ messages in thread
From: Yan, Zheng @ 2013-07-04 11:39 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, eranian, andi

On 07/04/2013 05:57 PM, Peter Zijlstra wrote:
> On Mon, Jul 01, 2013 at 03:23:04PM +0800, Yan, Zheng wrote:
>> +++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
>> @@ -185,6 +185,13 @@ void intel_pmu_lbr_reset(void)
>>  		intel_pmu_lbr_reset_32();
>>  	else
>>  		intel_pmu_lbr_reset_64();
>> +
>> +	wrmsrl(x86_pmu.lbr_tos, 0);
>> +}
> 
> I double checked; my SDM Jun 2013, Vol 3C 35-93 very explicitly states that
> MSR_LASTBRANCH_TOS is a read-only MSR. And afaicr all previous times I checked
> this it did say this too.
> 

thank you for point out this. I will update the patch

Yan, Zheng

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 3/7] perf, x86: Introduce x86 special perf event context
  2013-07-01  7:23 ` [PATCH v2 3/7] perf, x86: Introduce x86 special perf event context Yan, Zheng
@ 2013-07-04 12:41   ` Peter Zijlstra
  2013-07-05  3:19     ` Yan, Zheng
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2013-07-04 12:41 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: linux-kernel, mingo, eranian, andi

On Mon, Jul 01, 2013 at 03:23:03PM +0800, Yan, Zheng wrote:
> From: "Yan, Zheng" <zheng.z.yan@intel.com>
> 
> The x86 special perf event context is named x86_perf_event_context,
> We can enlarge it later to store PMU special data.

This changelog is completely inadequate. It fails to state what and why
we do things.

I hate doing this; but I can't see another way around it either. That
said:

> @@ -274,6 +274,11 @@ struct pmu {
>  	 * flush branch stack on context-switches (needed in cpu-wide mode)
>  	 */
>  	void (*flush_branch_stack)	(void);
> +
> +	/*
> +	 * Allocate PMU special perf event context
> +	 */
> +	void *(*event_context_alloc)	(struct perf_event_context *parent_ctx);
>  };

It should be *optional*, also wtf is that parent_ctx thing for?

> +++ b/kernel/events/core.c
> @@ -2961,13 +2961,20 @@ static void __perf_event_init_context(struct perf_event_context *ctx)
>  }
>  
>  static struct perf_event_context *
> -alloc_perf_context(struct pmu *pmu, struct task_struct *task)
> +alloc_perf_context(struct pmu *pmu, struct task_struct *task,
> +		   struct perf_event_context *parent_ctx)
>  {
>  	struct perf_event_context *ctx;
>  
> -	ctx = kzalloc(sizeof(struct perf_event_context), GFP_KERNEL);
> -	if (!ctx)
> -		return NULL;
> +	if (pmu->event_context_alloc) {
> +		ctx = pmu->event_context_alloc(parent_ctx);
> +		if (IS_ERR(ctx))
> +			return ctx;
> +	} else {
> +		ctx = kzalloc(sizeof(struct perf_event_context), GFP_KERNEL);
> +		if (!ctx)
> +			return ERR_PTR(-ENOMEM);
> +	}
>  
>  	__perf_event_init_context(ctx);
>  	if (task) {

I'm not at all sure we want to do it like this; why not simply query the
size. Something like:

  alloc_perf_context(struct pmu *pmu, struct task_struct *task)
  {
    size_t ctx_size = sizeof(struct perf_event_context);

    if (pmu->task_context_size)
      size = pmu->task_context_size();

    ctx = kzalloc(size, GFP_KERNEL);
    if (!ctx)
      return ERR_PTR(-ENOMEM);

    ...

  }



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 4/7] perf, x86: Save/resotre LBR stack during context switch
  2013-07-01  7:23 ` [PATCH v2 4/7] perf, x86: Save/resotre LBR stack during context switch Yan, Zheng
  2013-07-04  9:57   ` Peter Zijlstra
@ 2013-07-04 12:44   ` Peter Zijlstra
  2013-07-04 12:45   ` Peter Zijlstra
  2 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2013-07-04 12:44 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: linux-kernel, mingo, eranian, andi

On Mon, Jul 01, 2013 at 03:23:04PM +0800, Yan, Zheng wrote:
> From: "Yan, Zheng" <zheng.z.yan@intel.com>
> 
> When the LBR call stack is enabled, it is necessary to save/restore
> the LBR stack on context switch. The solution is saving/restoring
> the LBR stack to/from task's perf event context.
> 
> If there are cpu-wide events that use LBR, do not save/restore the
> LBR stack on context switches, flush it instead.

Again inadequate changelog; it fails to explain wtf this is:

> +	if (event->ctx->task &&
> +	    branch_user_callstack(event->hw.branch_reg.reg)) {
> +		struct x86_perf_event_context *task_ctx = (void *)event->ctx;
> +		/*
> +		 * Reset the LBR stack if the call stack is not
> +		 * continuous enabled
> +		 */
> +		if (task_ctx->lbr_callstack_users == 0 &&
> +		    task_ctx->lbr_stack_gen + 1 < event->ctx->sched_gen)
> +			intel_pmu_lbr_reset();
> +
> +		task_ctx->lbr_callstack_users++;
> +		task_ctx->lbr_stack_gen = event->ctx->sched_gen;
> +	}
>  }

And what this parent_ctx nonsense is about:

> +void intel_pmu_lbr_init_context(struct perf_event_context *child_ctx,
> +				struct perf_event_context *parent_ctx)
> +{
> +	struct x86_perf_event_context *task_ctx, *parent_task_ctx;
> +
> +	if (!x86_pmu.lbr_nr)
> +		return;
> +
> +	task_ctx = (struct x86_perf_event_context *)child_ctx;
> +	parent_task_ctx = (struct x86_perf_event_context *)parent_ctx;
> +
> +	if (parent_task_ctx->lbr_callstack_users)
> +		__intel_pmu_lbr_save(task_ctx);
> +	else
> +		task_ctx->lbr_stack_saved = false;
> +}

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 4/7] perf, x86: Save/resotre LBR stack during context switch
  2013-07-01  7:23 ` [PATCH v2 4/7] perf, x86: Save/resotre LBR stack during context switch Yan, Zheng
  2013-07-04  9:57   ` Peter Zijlstra
  2013-07-04 12:44   ` Peter Zijlstra
@ 2013-07-04 12:45   ` Peter Zijlstra
  2013-07-05  5:36     ` Yan, Zheng
  2 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2013-07-04 12:45 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: linux-kernel, mingo, eranian, andi

On Mon, Jul 01, 2013 at 03:23:04PM +0800, Yan, Zheng wrote:

> @@ -2488,25 +2508,31 @@ static void perf_branch_stack_sched_in(struct task_struct *prev,
>  
>  	list_for_each_entry_rcu(pmu, &pmus, entry) {
>  		cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
> +		task_ctx = cpuctx->task_ctx;
>  
>  		/*
> -		 * check if the context has at least one
> -		 * event using PERF_SAMPLE_BRANCH_STACK
> +		 * force flush the branch stack if there are cpu-wide events
> +		 * using PERF_SAMPLE_BRANCH_STACK
> +		 *
> +		 * save/restore the branch stack if the task context has
> +		 * at least one event using PERF_SAMPLE_BRANCH_STACK
>  		 */
> -		if (cpuctx->ctx.nr_branch_stack > 0
> -		    && pmu->flush_branch_stack) {
> -
> +		bool force_flush = cpuctx->ctx.nr_branch_stack > 0;
> +		if (pmu->branch_stack_sched &&
> +		    (force_flush ||
> +		     (task_ctx && task_ctx->nr_branch_stack > 0))) {
>  			pmu = cpuctx->ctx.pmu;
>  
> -			perf_ctx_lock(cpuctx, cpuctx->task_ctx);
> +			perf_ctx_lock(cpuctx, task_ctx);
>  
>  			perf_pmu_disable(pmu);
>  
> -			pmu->flush_branch_stack();
> +			pmu->branch_stack_sched(task_ctx,
> +						sched_in, force_flush);
>  
>  			perf_pmu_enable(pmu);
>  
> -			perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> +			perf_ctx_unlock(cpuctx, task_ctx);
>  		}
>  	}
>  

I never really like this; and yes I know I wrote part of that. Is there
any way we can get rid of this and to it 'properly' through the events
that get scheduled?

After all; the LBR usage is through the events, so scheduling the events
should also manage the LBR state.

What is missing for that to work?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 4/7] perf, x86: Save/resotre LBR stack during context switch
  2013-07-04  9:57   ` Peter Zijlstra
  2013-07-04 11:39     ` Yan, Zheng
@ 2013-07-04 13:44     ` Andi Kleen
  2013-07-04 14:00       ` Peter Zijlstra
  1 sibling, 1 reply; 25+ messages in thread
From: Andi Kleen @ 2013-07-04 13:44 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Yan, Zheng, linux-kernel, mingo, eranian, andi

On Thu, Jul 04, 2013 at 11:57:35AM +0200, Peter Zijlstra wrote:
> On Mon, Jul 01, 2013 at 03:23:04PM +0800, Yan, Zheng wrote:
> > +++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> > @@ -185,6 +185,13 @@ void intel_pmu_lbr_reset(void)
> >  		intel_pmu_lbr_reset_32();
> >  	else
> >  		intel_pmu_lbr_reset_64();
> > +
> > +	wrmsrl(x86_pmu.lbr_tos, 0);
> > +}
> 
> I double checked; my SDM Jun 2013, Vol 3C 35-93 very explicitly states that
> MSR_LASTBRANCH_TOS is a read-only MSR. And afaicr all previous times I checked
> this it did say this too.

Evidently it's not read-only on Haswell at least.

And if we don't restore the TOS it can be completely wrong, and
the stack state would corrupt.

The LBR stack is quite sensitive to any corruption of the state,
that is different from other LBR uses.

I suppose the wrmsr could be made checking to catch any potential
failure. But normally Intel CPUs are quite consistent in things
like that.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 4/7] perf, x86: Save/resotre LBR stack during context switch
  2013-07-04 13:44     ` Andi Kleen
@ 2013-07-04 14:00       ` Peter Zijlstra
  2013-07-10 17:57         ` Andi Kleen
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2013-07-04 14:00 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Yan, Zheng, linux-kernel, mingo, eranian

On Thu, Jul 04, 2013 at 03:44:57PM +0200, Andi Kleen wrote:
> Evidently it's not read-only on Haswell at least.

It would be ever so good if you could at least test run such patches against
semi-current chips, not only the very latest.

> And if we don't restore the TOS it can be completely wrong, and
> the stack state would corrupt.

How so? You could simply write into idx := (tos + i) % nr_lbr ?

> The LBR stack is quite sensitive to any corruption of the state,
> that is different from other LBR uses.
> 
> I suppose the wrmsr could be made checking to catch any potential
> failure. But normally Intel CPUs are quite consistent in things
> like that.

Please verify with the appropriate hardware teams and update the SDM
accordingly. This code is used with all Intel CPUs that have LBR support, which
is pretty much all of them except Yonah.

Using a checking wmsr isn't really a nice option; esp as this is in a somewhat
time critical path.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 3/7] perf, x86: Introduce x86 special perf event context
  2013-07-04 12:41   ` Peter Zijlstra
@ 2013-07-05  3:19     ` Yan, Zheng
  2013-07-05 12:45       ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Yan, Zheng @ 2013-07-05  3:19 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, eranian, andi

On 07/04/2013 08:41 PM, Peter Zijlstra wrote:
> On Mon, Jul 01, 2013 at 03:23:03PM +0800, Yan, Zheng wrote:
>> From: "Yan, Zheng" <zheng.z.yan@intel.com>
>>
>> The x86 special perf event context is named x86_perf_event_context,
>> We can enlarge it later to store PMU special data.
> 
> This changelog is completely inadequate. It fails to state what and why
> we do things.
> 
> I hate doing this; but I can't see another way around it either. That
> said:
> 
>> @@ -274,6 +274,11 @@ struct pmu {
>>  	 * flush branch stack on context-switches (needed in cpu-wide mode)
>>  	 */
>>  	void (*flush_branch_stack)	(void);
>> +
>> +	/*
>> +	 * Allocate PMU special perf event context
>> +	 */
>> +	void *(*event_context_alloc)	(struct perf_event_context *parent_ctx);
>>  };
> 
> It should be *optional*, also wtf is that parent_ctx thing for?

parent_ctx is for the fork() case, used for checking if the callstack feature
is enabled for the parent task. If yes, clone  parent task's LBR stack.
For the simple program below:

--- 
void func0()
{
  ...
}

int main(int argc, char *argv[])
{
        if (fork() > 0) {
                int foo;
                wait(&foo);
        }
        while (1) func0();
}

if clone LBR stack on fork(), the output of perf report looks like:
----
0.35%     test  test               [.] func0
          |
          --- func0
              main
              _start

if not clone LBR stack on fork(), the output of perf report looks like:
----

     0.17%     test  test               [.] func0
               |
               --- func0
                   main

> 
>> +++ b/kernel/events/core.c
>> @@ -2961,13 +2961,20 @@ static void __perf_event_init_context(struct perf_event_context *ctx)
>>  }
>>  
>>  static struct perf_event_context *
>> -alloc_perf_context(struct pmu *pmu, struct task_struct *task)
>> +alloc_perf_context(struct pmu *pmu, struct task_struct *task,
>> +		   struct perf_event_context *parent_ctx)
>>  {
>>  	struct perf_event_context *ctx;
>>  
>> -	ctx = kzalloc(sizeof(struct perf_event_context), GFP_KERNEL);
>> -	if (!ctx)
>> -		return NULL;
>> +	if (pmu->event_context_alloc) {
>> +		ctx = pmu->event_context_alloc(parent_ctx);
>> +		if (IS_ERR(ctx))
>> +			return ctx;
>> +	} else {
>> +		ctx = kzalloc(sizeof(struct perf_event_context), GFP_KERNEL);
>> +		if (!ctx)
>> +			return ERR_PTR(-ENOMEM);
>> +	}
>>  
>>  	__perf_event_init_context(ctx);
>>  	if (task) {
> 
> I'm not at all sure we want to do it like this; why not simply query the
> size. Something like:

Because we need to initialize x86 PMU special data.(not just zero them)

Regards
Yan, Zheng 


> 
>   alloc_perf_context(struct pmu *pmu, struct task_struct *task)
>   {
>     size_t ctx_size = sizeof(struct perf_event_context);
> 
>     if (pmu->task_context_size)
>       size = pmu->task_context_size();
> 
>     ctx = kzalloc(size, GFP_KERNEL);
>     if (!ctx)
>       return ERR_PTR(-ENOMEM);
> 
>     ...
> 
>   }
> 
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 4/7] perf, x86: Save/resotre LBR stack during context switch
  2013-07-04 12:45   ` Peter Zijlstra
@ 2013-07-05  5:36     ` Yan, Zheng
  2013-07-05  8:15       ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Yan, Zheng @ 2013-07-05  5:36 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, eranian, andi

On 07/04/2013 08:45 PM, Peter Zijlstra wrote:
> On Mon, Jul 01, 2013 at 03:23:04PM +0800, Yan, Zheng wrote:
> 
>> @@ -2488,25 +2508,31 @@ static void perf_branch_stack_sched_in(struct task_struct *prev,
>>  
>>  	list_for_each_entry_rcu(pmu, &pmus, entry) {
>>  		cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
>> +		task_ctx = cpuctx->task_ctx;
>>  
>>  		/*
>> -		 * check if the context has at least one
>> -		 * event using PERF_SAMPLE_BRANCH_STACK
>> +		 * force flush the branch stack if there are cpu-wide events
>> +		 * using PERF_SAMPLE_BRANCH_STACK
>> +		 *
>> +		 * save/restore the branch stack if the task context has
>> +		 * at least one event using PERF_SAMPLE_BRANCH_STACK
>>  		 */
>> -		if (cpuctx->ctx.nr_branch_stack > 0
>> -		    && pmu->flush_branch_stack) {
>> -
>> +		bool force_flush = cpuctx->ctx.nr_branch_stack > 0;
>> +		if (pmu->branch_stack_sched &&
>> +		    (force_flush ||
>> +		     (task_ctx && task_ctx->nr_branch_stack > 0))) {
>>  			pmu = cpuctx->ctx.pmu;
>>  
>> -			perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>> +			perf_ctx_lock(cpuctx, task_ctx);
>>  
>>  			perf_pmu_disable(pmu);
>>  
>> -			pmu->flush_branch_stack();
>> +			pmu->branch_stack_sched(task_ctx,
>> +						sched_in, force_flush);
>>  
>>  			perf_pmu_enable(pmu);
>>  
>> -			perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>> +			perf_ctx_unlock(cpuctx, task_ctx);
>>  		}
>>  	}
>>  
> 
> I never really like this; and yes I know I wrote part of that. Is there
> any way we can get rid of this and to it 'properly' through the events
> that get scheduled?
> 
> After all; the LBR usage is through the events, so scheduling the events
> should also manage the LBR state.
> 
> What is missing for that to work?
> 

the LBR is shared resource, can be used by multiple events at the same time.
Strictly speaking,LBR is associated with task, not event. One example is
there are 5 events using the LBR stack feature, but there are only 4 counters.
So these events need schedule. Saving/restoring LBR on the basis of event is
clearly wrong.

Regards
Yan, Zheng



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 4/7] perf, x86: Save/resotre LBR stack during context switch
  2013-07-05  5:36     ` Yan, Zheng
@ 2013-07-05  8:15       ` Peter Zijlstra
  2013-07-05  8:51         ` Yan, Zheng
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2013-07-05  8:15 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: linux-kernel, mingo, eranian, andi

On Fri, Jul 05, 2013 at 01:36:24PM +0800, Yan, Zheng wrote:
> On 07/04/2013 08:45 PM, Peter Zijlstra wrote:
> > On Mon, Jul 01, 2013 at 03:23:04PM +0800, Yan, Zheng wrote:
> > 
> >> @@ -2488,25 +2508,31 @@ static void perf_branch_stack_sched_in(struct task_struct *prev,
> >>  
> >>  	list_for_each_entry_rcu(pmu, &pmus, entry) {
> >>  		cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
> >> +		task_ctx = cpuctx->task_ctx;
> >>  
> >>  		/*
> >> -		 * check if the context has at least one
> >> -		 * event using PERF_SAMPLE_BRANCH_STACK
> >> +		 * force flush the branch stack if there are cpu-wide events
> >> +		 * using PERF_SAMPLE_BRANCH_STACK
> >> +		 *
> >> +		 * save/restore the branch stack if the task context has
> >> +		 * at least one event using PERF_SAMPLE_BRANCH_STACK
> >>  		 */
> >> -		if (cpuctx->ctx.nr_branch_stack > 0
> >> -		    && pmu->flush_branch_stack) {
> >> -
> >> +		bool force_flush = cpuctx->ctx.nr_branch_stack > 0;
> >> +		if (pmu->branch_stack_sched &&
> >> +		    (force_flush ||
> >> +		     (task_ctx && task_ctx->nr_branch_stack > 0))) {
> >>  			pmu = cpuctx->ctx.pmu;
> >>  
> >> -			perf_ctx_lock(cpuctx, cpuctx->task_ctx);
> >> +			perf_ctx_lock(cpuctx, task_ctx);
> >>  
> >>  			perf_pmu_disable(pmu);
> >>  
> >> -			pmu->flush_branch_stack();
> >> +			pmu->branch_stack_sched(task_ctx,
> >> +						sched_in, force_flush);
> >>  
> >>  			perf_pmu_enable(pmu);
> >>  
> >> -			perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> >> +			perf_ctx_unlock(cpuctx, task_ctx);
> >>  		}
> >>  	}
> >>  
> > 
> > I never really like this; and yes I know I wrote part of that. Is there
> > any way we can get rid of this and to it 'properly' through the events
> > that get scheduled?
> > 
> > After all; the LBR usage is through the events, so scheduling the events
> > should also manage the LBR state.
> > 
> > What is missing for that to work?
> > 
> 
> the LBR is shared resource, can be used by multiple events at the same time.

Yeah so? There's tons of shared resources in the PMU already.

> Strictly speaking,LBR is associated with task, not event.

Wrong!, it _is_ associated with events. Events is all there is. Event can be
associated with tasks, but that's completely irrelevant.

> One example is
> there are 5 events using the LBR stack feature, but there are only 4 counters.
> So these events need schedule. Saving/restoring LBR on the basis of event is
> clearly wrong.

Different scheduling and you're wrong. Look at perf_rotate_context(), we'd
disable everything at perf_pmu_disable() and enable the entire thing at
perf_pmu_enable(), on both sides we'd have the LBR running.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 4/7] perf, x86: Save/resotre LBR stack during context switch
  2013-07-05  8:15       ` Peter Zijlstra
@ 2013-07-05  8:51         ` Yan, Zheng
  2013-07-05 12:31           ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Yan, Zheng @ 2013-07-05  8:51 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, eranian, andi

On 07/05/2013 04:15 PM, Peter Zijlstra wrote:
> On Fri, Jul 05, 2013 at 01:36:24PM +0800, Yan, Zheng wrote:
>> On 07/04/2013 08:45 PM, Peter Zijlstra wrote:
>>> On Mon, Jul 01, 2013 at 03:23:04PM +0800, Yan, Zheng wrote:
>>>
>>>> @@ -2488,25 +2508,31 @@ static void perf_branch_stack_sched_in(struct task_struct *prev,
>>>>  
>>>>  	list_for_each_entry_rcu(pmu, &pmus, entry) {
>>>>  		cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
>>>> +		task_ctx = cpuctx->task_ctx;
>>>>  
>>>>  		/*
>>>> -		 * check if the context has at least one
>>>> -		 * event using PERF_SAMPLE_BRANCH_STACK
>>>> +		 * force flush the branch stack if there are cpu-wide events
>>>> +		 * using PERF_SAMPLE_BRANCH_STACK
>>>> +		 *
>>>> +		 * save/restore the branch stack if the task context has
>>>> +		 * at least one event using PERF_SAMPLE_BRANCH_STACK
>>>>  		 */
>>>> -		if (cpuctx->ctx.nr_branch_stack > 0
>>>> -		    && pmu->flush_branch_stack) {
>>>> -
>>>> +		bool force_flush = cpuctx->ctx.nr_branch_stack > 0;
>>>> +		if (pmu->branch_stack_sched &&
>>>> +		    (force_flush ||
>>>> +		     (task_ctx && task_ctx->nr_branch_stack > 0))) {
>>>>  			pmu = cpuctx->ctx.pmu;
>>>>  
>>>> -			perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>>>> +			perf_ctx_lock(cpuctx, task_ctx);
>>>>  
>>>>  			perf_pmu_disable(pmu);
>>>>  
>>>> -			pmu->flush_branch_stack();
>>>> +			pmu->branch_stack_sched(task_ctx,
>>>> +						sched_in, force_flush);
>>>>  
>>>>  			perf_pmu_enable(pmu);
>>>>  
>>>> -			perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>>>> +			perf_ctx_unlock(cpuctx, task_ctx);
>>>>  		}
>>>>  	}
>>>>  
>>>
>>> I never really like this; and yes I know I wrote part of that. Is there
>>> any way we can get rid of this and to it 'properly' through the events
>>> that get scheduled?
>>>
>>> After all; the LBR usage is through the events, so scheduling the events
>>> should also manage the LBR state.
>>>
>>> What is missing for that to work?
>>>
>>
>> the LBR is shared resource, can be used by multiple events at the same time.
> 
> Yeah so? There's tons of shared resources in the PMU already.

we should restore the LBR callstack only when task schedule in. restoring the LBR
callstack at any other time will make the LBR callstack and actual callchain of program
mismatch. this property make the LBR different from counters.

> 
>> Strictly speaking,LBR is associated with task, not event.
> 
> Wrong!, it _is_ associated with events. Events is all there is. Event can be
> associated with tasks, but that's completely irrelevant.
> 
>> One example is
>> there are 5 events using the LBR stack feature, but there are only 4 counters.
>> So these events need schedule. Saving/restoring LBR on the basis of event is
>> clearly wrong.
> 
> Different scheduling and you're wrong. Look at perf_rotate_context(), we'd
> disable everything at perf_pmu_disable() and enable the entire thing at
> perf_pmu_enable(), on both sides we'd have the LBR running.
> 

yes,on both sides we'd have the LBR running. but there is no need to save/restore
the LBR stack in this case. we should save the LBR stack only when task schedule out,
and restore the LBR stack when task schedule in. So I think it's more natural to
manage the LBR state when switching perf task context.

Regards
Yan, Zheng 





^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 4/7] perf, x86: Save/resotre LBR stack during context switch
  2013-07-05  8:51         ` Yan, Zheng
@ 2013-07-05 12:31           ` Peter Zijlstra
  2013-08-08  6:18             ` Yan, Zheng
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2013-07-05 12:31 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: linux-kernel, mingo, eranian, andi

On Fri, Jul 05, 2013 at 04:51:33PM +0800, Yan, Zheng wrote:
> >> the LBR is shared resource, can be used by multiple events at the same time.
> > 
> > Yeah so? There's tons of shared resources in the PMU already.
> 
> we should restore the LBR callstack only when task schedule in. restoring the LBR
> callstack at any other time will make the LBR callstack and actual callchain of program
> mismatch. this property make the LBR different from counters.

But it doesn't change the fact that the LBR is controlled through
events.

> yes,on both sides we'd have the LBR running. but there is no need to save/restore
> the LBR stack in this case. we should save the LBR stack only when task schedule out,
> and restore the LBR stack when task schedule in. So I think it's more natural to
> manage the LBR state when switching perf task context.

And I never said we shouldn't, I just said we should push it down into the PMU
driver and not have a hook out into the generic code. The generic code should
ideally not know anything about LBR, it should only care about events.

Something like the below... although I'm still not entirely happy with that
either.

Completely untested, never even seen compiler.

---
 arch/x86/kernel/cpu/perf_event.c           |  5 ++
 arch/x86/kernel/cpu/perf_event.h           |  8 ++-
 arch/x86/kernel/cpu/perf_event_intel_lbr.c | 24 ++++++--
 include/linux/perf_event.h                 | 11 +++-
 kernel/events/core.c                       | 92 +++---------------------------
 5 files changed, 47 insertions(+), 93 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 9e581c5..6516ce0 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -519,6 +519,11 @@ static void x86_pmu_disable(struct pmu *pmu)
 	if (!cpuc->enabled)
 		return;
 
+	if (cpuc->current != current) {
+		cpuc->current = current;
+		cpuc->ctxs_seq++;
+	}
+
 	cpuc->n_added = 0;
 	cpuc->enabled = 0;
 	barrier();
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 97e557b..e1ee365 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -141,6 +141,12 @@ struct cpu_hw_events {
 	int			is_fake;
 
 	/*
+	 * Context switch tracking
+	 */
+	void			*current;
+	u64			ctxs_seq;
+
+	/*
 	 * Intel DebugStore bits
 	 */
 	struct debug_store	*ds;
@@ -150,11 +156,11 @@ struct cpu_hw_events {
 	 * Intel LBR bits
 	 */
 	int				lbr_users;
-	void				*lbr_context;
 	struct perf_branch_stack	lbr_stack;
 	struct perf_branch_entry	lbr_entries[MAX_LBR_ENTRIES];
 	struct er_account		*lbr_sel;
 	u64				br_sel;
+	u64				lbr_flush_seq;
 
 	/*
 	 * Intel host/guest exclude bits
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index d5be06a..aa34fa3 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -189,15 +189,20 @@ void intel_pmu_lbr_enable(struct perf_event *event)
 		return;
 
 	/*
-	 * Reset the LBR stack if we changed task context to
-	 * avoid data leaks.
+	 * If we're a task event and observe a context switch; flush the LBR
+	 * since we don't want to leak LBR entries from the previous task into
+	 * this one.
 	 */
-	if (event->ctx->task && cpuc->lbr_context != event->ctx) {
+	if (event->ctx->task && cpuc->ctxs_seq != cpuc->lbr_flush_seq) {
 		intel_pmu_lbr_reset();
-		cpuc->lbr_context = event->ctx;
+		cpuc->lbr_flush_seq = cpuc->ctxs_seq;
 	}
+
 	cpuc->br_sel = event->hw.branch_reg.reg;
 
+	if (!cpuc->lbr_users)
+		event->ctx->pmu->flags |= PERF_PF_CTXS;
+
 	cpuc->lbr_users++;
 }
 
@@ -209,6 +214,9 @@ void intel_pmu_lbr_disable(struct perf_event *event)
 		return;
 
 	cpuc->lbr_users--;
+	if (!cpuc->lbr_users)
+		event->ctx->pmu->flags &= ~PERF_PF_CTXS;
+
 	WARN_ON_ONCE(cpuc->lbr_users < 0);
 
 	if (cpuc->enabled && !cpuc->lbr_users) {
@@ -222,8 +230,14 @@ void intel_pmu_lbr_enable_all(void)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
 
-	if (cpuc->lbr_users)
+	if (cpuc->lbr_users) {
+		if (cpuc->lbr_flush_seq != cpuc->ctxs_seq) {
+			intel_pmu_lbr_reset();
+			cpuc->lbr_flush_seq = cpuc->ctxs_seq;
+		}
+
 		__intel_pmu_lbr_enable();
+	}
 }
 
 void intel_pmu_lbr_disable_all(void)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 8873f82..837f6e3 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -189,6 +189,11 @@ struct perf_event;
  */
 #define PERF_EVENT_TXN 0x1
 
+/*
+ * pmu::flags
+ */
+#define PERF_PF_CTXS	0x01 /* require pmu_disable/enable on context_sched_in */
+
 /**
  * struct pmu - generic performance monitoring unit
  */
@@ -200,10 +205,11 @@ struct pmu {
 	const char			*name;
 	int				type;
 
-	int * __percpu			pmu_disable_count;
-	struct perf_cpu_context * __percpu pmu_cpu_context;
+	unsigned int			flags;
 	int				task_ctx_nr;
 	int				hrtimer_interval_ms;
+	int * __percpu			pmu_disable_count;
+	struct perf_cpu_context * __percpu pmu_cpu_context;
 
 	/*
 	 * Fully disable/enable this PMU, can be used to protect from the PMI
@@ -492,7 +498,6 @@ struct perf_event_context {
 	u64				generation;
 	int				pin_count;
 	int				nr_cgroups;	 /* cgroup evts */
-	int				nr_branch_stack; /* branch_stack evt */
 	struct rcu_head			rcu_head;
 };
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 1db3af9..d49b4ea 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -140,7 +140,6 @@ enum event_type_t {
  */
 struct static_key_deferred perf_sched_events __read_mostly;
 static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
-static DEFINE_PER_CPU(atomic_t, perf_branch_stack_events);
 
 static atomic_t nr_mmap_events __read_mostly;
 static atomic_t nr_comm_events __read_mostly;
@@ -1114,9 +1113,6 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx)
 	if (is_cgroup_event(event))
 		ctx->nr_cgroups++;
 
-	if (has_branch_stack(event))
-		ctx->nr_branch_stack++;
-
 	list_add_rcu(&event->event_entry, &ctx->event_list);
 	if (!ctx->nr_events)
 		perf_pmu_rotate_start(ctx->pmu);
@@ -1271,9 +1267,6 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
 			cpuctx->cgrp = NULL;
 	}
 
-	if (has_branch_stack(event))
-		ctx->nr_branch_stack--;
-
 	ctx->nr_events--;
 	if (event->attr.inherit_stat)
 		ctx->nr_stat--;
@@ -2428,8 +2421,13 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx,
 	struct perf_cpu_context *cpuctx;
 
 	cpuctx = __get_cpu_context(ctx);
-	if (cpuctx->task_ctx == ctx)
+	if (cpuctx->task_ctx == ctx) {
+		if (ctx->pmu->flags & PERF_PF_CTXS) {
+			perf_pmu_disable(ctx->pmu);
+			perf_pmu_enable(ctx->pmu);
+		}
 		return;
+	}
 
 	perf_ctx_lock(cpuctx, ctx);
 	perf_pmu_disable(ctx->pmu);
@@ -2456,66 +2454,6 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx,
 }
 
 /*
- * When sampling the branck stack in system-wide, it may be necessary
- * to flush the stack on context switch. This happens when the branch
- * stack does not tag its entries with the pid of the current task.
- * Otherwise it becomes impossible to associate a branch entry with a
- * task. This ambiguity is more likely to appear when the branch stack
- * supports priv level filtering and the user sets it to monitor only
- * at the user level (which could be a useful measurement in system-wide
- * mode). In that case, the risk is high of having a branch stack with
- * branch from multiple tasks. Flushing may mean dropping the existing
- * entries or stashing them somewhere in the PMU specific code layer.
- *
- * This function provides the context switch callback to the lower code
- * layer. It is invoked ONLY when there is at least one system-wide context
- * with at least one active event using taken branch sampling.
- */
-static void perf_branch_stack_sched_in(struct task_struct *prev,
-				       struct task_struct *task)
-{
-	struct perf_cpu_context *cpuctx;
-	struct pmu *pmu;
-	unsigned long flags;
-
-	/* no need to flush branch stack if not changing task */
-	if (prev == task)
-		return;
-
-	local_irq_save(flags);
-
-	rcu_read_lock();
-
-	list_for_each_entry_rcu(pmu, &pmus, entry) {
-		cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
-
-		/*
-		 * check if the context has at least one
-		 * event using PERF_SAMPLE_BRANCH_STACK
-		 */
-		if (cpuctx->ctx.nr_branch_stack > 0
-		    && pmu->flush_branch_stack) {
-
-			pmu = cpuctx->ctx.pmu;
-
-			perf_ctx_lock(cpuctx, cpuctx->task_ctx);
-
-			perf_pmu_disable(pmu);
-
-			pmu->flush_branch_stack();
-
-			perf_pmu_enable(pmu);
-
-			perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
-		}
-	}
-
-	rcu_read_unlock();
-
-	local_irq_restore(flags);
-}
-
-/*
  * Called from scheduler to add the events of the current task
  * with interrupts disabled.
  *
@@ -2546,10 +2484,6 @@ void __perf_event_task_sched_in(struct task_struct *prev,
 	 */
 	if (atomic_read(&__get_cpu_var(perf_cgroup_events)))
 		perf_cgroup_sched_in(prev, task);
-
-	/* check for system-wide branch_stack events */
-	if (atomic_read(&__get_cpu_var(perf_branch_stack_events)))
-		perf_branch_stack_sched_in(prev, task);
 }
 
 static u64 perf_calculate_period(struct perf_event *event, u64 nsec, u64 count)
@@ -3126,14 +3060,8 @@ static void free_event(struct perf_event *event)
 			static_key_slow_dec_deferred(&perf_sched_events);
 		}
 
-		if (has_branch_stack(event)) {
+		if (has_branch_stack(event))
 			static_key_slow_dec_deferred(&perf_sched_events);
-			/* is system-wide event */
-			if (!(event->attach_state & PERF_ATTACH_TASK)) {
-				atomic_dec(&per_cpu(perf_branch_stack_events,
-						    event->cpu));
-			}
-		}
 	}
 
 	if (event->rb) {
@@ -6554,12 +6482,8 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 				return ERR_PTR(err);
 			}
 		}
-		if (has_branch_stack(event)) {
+		if (has_branch_stack(event))
 			static_key_slow_inc(&perf_sched_events.key);
-			if (!(event->attach_state & PERF_ATTACH_TASK))
-				atomic_inc(&per_cpu(perf_branch_stack_events,
-						    event->cpu));
-		}
 	}
 
 	return event;


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 3/7] perf, x86: Introduce x86 special perf event context
  2013-07-05  3:19     ` Yan, Zheng
@ 2013-07-05 12:45       ` Peter Zijlstra
  2013-07-08  8:51         ` Yan, Zheng
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2013-07-05 12:45 UTC (permalink / raw)
  To: Yan, Zheng; +Cc: linux-kernel, mingo, eranian, andi

On Fri, Jul 05, 2013 at 11:19:31AM +0800, Yan, Zheng wrote:
> On 07/04/2013 08:41 PM, Peter Zijlstra wrote:
> > It should be *optional*, also wtf is that parent_ctx thing for?
> 
> parent_ctx is for the fork() case, used for checking if the callstack feature
> is enabled for the parent task. If yes, clone  parent task's LBR stack.
> For the simple program below:

So there's a problem with all this; contexts aren't strict per task, we
play games with them in perf_event_context_sched_out(). We'd have to
disable that context switch optimization for LBR stack to work and
that's massively expensive.

If you actually did that, you again fail for not having mentioned this
in your changelog.

I'm starting to not want to read patches from you; going through them is
just too much effort, I might as well write the stuff myself :/

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 3/7] perf, x86: Introduce x86 special perf event context
  2013-07-05 12:45       ` Peter Zijlstra
@ 2013-07-08  8:51         ` Yan, Zheng
  0 siblings, 0 replies; 25+ messages in thread
From: Yan, Zheng @ 2013-07-08  8:51 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, eranian, andi

On 07/05/2013 08:45 PM, Peter Zijlstra wrote:
> On Fri, Jul 05, 2013 at 11:19:31AM +0800, Yan, Zheng wrote:
>> On 07/04/2013 08:41 PM, Peter Zijlstra wrote:
>>> It should be *optional*, also wtf is that parent_ctx thing for?
>>
>> parent_ctx is for the fork() case, used for checking if the callstack feature
>> is enabled for the parent task. If yes, clone  parent task's LBR stack.
>> For the simple program below:
> 
> So there's a problem with all this; contexts aren't strict per task, we
> play games with them in perf_event_context_sched_out(). We'd have to
> disable that context switch optimization for LBR stack to work and
> that's massively expensive.
> 

Sorry, I didn't realize the optimization. I will describe my change in more details
in the further.

Yan, Zheng

> If you actually did that, you again fail for not having mentioned this
> in your changelog.
> 
> I'm starting to not want to read patches from you; going through them is
> just too much effort, I might as well write the stuff myself :/
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 4/7] perf, x86: Save/resotre LBR stack during context switch
  2013-07-04 14:00       ` Peter Zijlstra
@ 2013-07-10 17:57         ` Andi Kleen
  0 siblings, 0 replies; 25+ messages in thread
From: Andi Kleen @ 2013-07-10 17:57 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Andi Kleen, Yan, Zheng, linux-kernel, mingo, eranian

On Thu, Jul 04, 2013 at 04:00:57PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 04, 2013 at 03:44:57PM +0200, Andi Kleen wrote:
> > Evidently it's not read-only on Haswell at least.
> 
> It would be ever so good if you could at least test run such patches against
> semi-current chips, not only the very latest.

So we tested some systems (old and new Atom, Westmere, Nehalem, *Bridge)
and they all have writable TOS. Also I double checked the SDM
and it actually documents these MSRs as R/W in Chapter 35

So relying on this is ok.

-Andi

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 4/7] perf, x86: Save/resotre LBR stack during context switch
  2013-07-05 12:31           ` Peter Zijlstra
@ 2013-08-08  6:18             ` Yan, Zheng
  0 siblings, 0 replies; 25+ messages in thread
From: Yan, Zheng @ 2013-08-08  6:18 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, mingo, eranian, andi

On 07/05/2013 08:31 PM, Peter Zijlstra wrote:
> On Fri, Jul 05, 2013 at 04:51:33PM +0800, Yan, Zheng wrote:
>>>> the LBR is shared resource, can be used by multiple events at the same time.
>>>
>>> Yeah so? There's tons of shared resources in the PMU already.
>>
>> we should restore the LBR callstack only when task schedule in. restoring the LBR
>> callstack at any other time will make the LBR callstack and actual callchain of program
>> mismatch. this property make the LBR different from counters.
> 
> But it doesn't change the fact that the LBR is controlled through
> events.
> 
>> yes,on both sides we'd have the LBR running. but there is no need to save/restore
>> the LBR stack in this case. we should save the LBR stack only when task schedule out,
>> and restore the LBR stack when task schedule in. So I think it's more natural to
>> manage the LBR state when switching perf task context.
> 
> And I never said we shouldn't, I just said we should push it down into the PMU
> driver and not have a hook out into the generic code. The generic code should
> ideally not know anything about LBR, it should only care about events.
> 
> Something like the below... although I'm still not entirely happy with that
> either.

Sorry for the delay.

How about the patch below. It introduces a pmu sched_ctx() callback and uses the callback
to flush LBR stack. The sched_ctx() callback can also be used to save/restore the lBR stack.

Thanks.
Yan, Zheng

---
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 8355c84..e5cb20d 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1846,10 +1846,10 @@ static const struct attribute_group *x86_pmu_attr_groups[] = {
 	NULL,
 };
 
-static void x86_pmu_flush_branch_stack(void)
+static void x86_pmu_sched_ctx(struct perf_event_context *ctx, bool sched_in)
 {
-	if (x86_pmu.flush_branch_stack)
-		x86_pmu.flush_branch_stack();
+	if (x86_pmu.sched_ctx)
+		x86_pmu.sched_ctx(ctx, sched_in);
 }
 
 void perf_check_microcode(void)
@@ -1878,7 +1878,7 @@ static struct pmu pmu = {
 	.commit_txn		= x86_pmu_commit_txn,
 
 	.event_idx		= x86_pmu_event_idx,
-	.flush_branch_stack	= x86_pmu_flush_branch_stack,
+	.sched_ctx		= x86_pmu_sched_ctx,
 };
 
 void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 97e557b..1320376 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -150,6 +150,7 @@ struct cpu_hw_events {
 	 * Intel LBR bits
 	 */
 	int				lbr_users;
+	int				lbr_sys_users;
 	void				*lbr_context;
 	struct perf_branch_stack	lbr_stack;
 	struct perf_branch_entry	lbr_entries[MAX_LBR_ENTRIES];
@@ -411,7 +412,8 @@ struct x86_pmu {
 	void		(*cpu_dead)(int cpu);
 
 	void		(*check_microcode)(void);
-	void		(*flush_branch_stack)(void);
+	void		(*sched_ctx)(struct perf_event_context *ctx,
+				     bool sched_in);
 
 	/*
 	 * Intel Arch Perfmon v2+
@@ -663,6 +665,8 @@ void intel_pmu_pebs_disable_all(void);
 
 void intel_ds_init(void);
 
+void intel_pmu_lbr_sched_ctx(struct perf_event_context *ctx, bool sched_in);
+
 void intel_pmu_lbr_reset(void);
 
 void intel_pmu_lbr_enable(struct perf_event *event);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index fbc9210..c8f0318 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1849,16 +1849,15 @@ static void intel_pmu_cpu_dying(int cpu)
 	fini_debug_store_on_cpu(cpu);
 }
 
-static void intel_pmu_flush_branch_stack(void)
+static void intel_pmu_sched_ctx(struct perf_event_context *ctx, bool sched_in)
 {
 	/*
 	 * Intel LBR does not tag entries with the
 	 * PID of the current task, then we need to
 	 * flush it on ctxsw
-	 * For now, we simply reset it
 	 */
 	if (x86_pmu.lbr_nr)
-		intel_pmu_lbr_reset();
+		intel_pmu_lbr_sched_ctx(ctx, sched_in);
 }
 
 PMU_FORMAT_ATTR(offcore_rsp, "config1:0-63");
@@ -1912,7 +1911,7 @@ static __initconst const struct x86_pmu intel_pmu = {
 	.cpu_starting		= intel_pmu_cpu_starting,
 	.cpu_dying		= intel_pmu_cpu_dying,
 	.guest_get_msrs		= intel_guest_get_msrs,
-	.flush_branch_stack	= intel_pmu_flush_branch_stack,
+	.sched_ctx		= intel_pmu_sched_ctx,
 };
 
 static __init void intel_clovertown_quirk(void)
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index d5be06a..99b00a8 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -181,6 +181,12 @@ void intel_pmu_lbr_reset(void)
 		intel_pmu_lbr_reset_64();
 }
 
+void intel_pmu_lbr_sched_ctx(struct perf_event_context *ctx, bool sched_in)
+{
+	if (sched_in)
+		intel_pmu_lbr_reset();
+}
+
 void intel_pmu_lbr_enable(struct perf_event *event)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
@@ -199,6 +205,11 @@ void intel_pmu_lbr_enable(struct perf_event *event)
 	cpuc->br_sel = event->hw.branch_reg.reg;
 
 	cpuc->lbr_users++;
+	if (!(event->attach_state & PERF_ATTACH_TASK)) {
+		cpuc->lbr_sys_users++;
+		if (cpuc->lbr_sys_users == 1)
+			event->ctx->pmu->flags |= PERF_PF_CTXS;
+	}
 }
 
 void intel_pmu_lbr_disable(struct perf_event *event)
@@ -209,6 +220,12 @@ void intel_pmu_lbr_disable(struct perf_event *event)
 		return;
 
 	cpuc->lbr_users--;
+	if (!(event->attach_state & PERF_ATTACH_TASK)) {
+		cpuc->lbr_sys_users--;
+		if (cpuc->lbr_sys_users == 0)
+			event->ctx->pmu->flags &= ~PERF_PF_CTXS;
+	}
+
 	WARN_ON_ONCE(cpuc->lbr_users < 0);
 
 	if (cpuc->enabled && !cpuc->lbr_users) {
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index c43f6ea..afdfc5a 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -189,6 +189,12 @@ struct perf_event;
  */
 #define PERF_EVENT_TXN 0x1
 
+/*
+ * pmu::flags
+ */
+#define PERF_PF_CTXS	0x01 /* call pmu->sched_ctx on context-switches */
+
+
 /**
  * struct pmu - generic performance monitoring unit
  */
@@ -199,11 +205,12 @@ struct pmu {
 	const struct attribute_group	**attr_groups;
 	const char			*name;
 	int				type;
+	unsigned int			flags;
+	int				task_ctx_nr;
+	int				hrtimer_interval_ms;
 
 	int * __percpu			pmu_disable_count;
 	struct perf_cpu_context * __percpu pmu_cpu_context;
-	int				task_ctx_nr;
-	int				hrtimer_interval_ms;
 
 	/*
 	 * Fully disable/enable this PMU, can be used to protect from the PMI
@@ -271,9 +278,10 @@ struct pmu {
 	int (*event_idx)		(struct perf_event *event); /*optional */
 
 	/*
-	 * flush branch stack on context-switches (needed in cpu-wide mode)
+	 * PMU callback for context-switches. optional
 	 */
-	void (*flush_branch_stack)	(void);
+	void (*sched_ctx)		(struct perf_event_context *ctx,
+					 bool sched_in); /*optional */
 };
 
 /**
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 1274114..8678e73 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -140,7 +140,6 @@ enum event_type_t {
  */
 struct static_key_deferred perf_sched_events __read_mostly;
 static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
-static DEFINE_PER_CPU(atomic_t, perf_branch_stack_events);
 
 static atomic_t nr_mmap_events __read_mostly;
 static atomic_t nr_comm_events __read_mostly;
@@ -2130,6 +2129,10 @@ static void ctx_sched_out(struct perf_event_context *ctx,
 		return;
 
 	perf_pmu_disable(ctx->pmu);
+
+	if (ctx->pmu->flags & PERF_PF_CTXS)
+		ctx->pmu->sched_ctx(ctx, false);
+
 	if ((is_active & EVENT_PINNED) && (event_type & EVENT_PINNED)) {
 		list_for_each_entry(event, &ctx->pinned_groups, group_entry)
 			group_sched_out(event, cpuctx, ctx);
@@ -2269,6 +2272,12 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
 		raw_spin_lock(&ctx->lock);
 		raw_spin_lock_nested(&next_ctx->lock, SINGLE_DEPTH_NESTING);
 		if (context_equiv(ctx, next_ctx)) {
+			if (ctx->pmu->flags & PERF_PF_CTXS) {
+				perf_pmu_disable(ctx->pmu);
+				ctx->pmu->sched_ctx(ctx, false);
+				ctx->pmu->sched_ctx(next_ctx, true);
+				perf_pmu_enable(ctx->pmu);
+			}
 			/*
 			 * XXX do we need a memory barrier of sorts
 			 * wrt to rcu_dereference() of perf_event_ctxp
@@ -2467,6 +2476,9 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx,
 
 	perf_event_sched_in(cpuctx, cpuctx->task_ctx, task);
 
+	if (ctx->pmu->flags & PERF_PF_CTXS)
+		ctx->pmu->sched_ctx(ctx, true);
+
 	perf_pmu_enable(ctx->pmu);
 	perf_ctx_unlock(cpuctx, ctx);
 
@@ -2478,66 +2490,6 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx,
 }
 
 /*
- * When sampling the branck stack in system-wide, it may be necessary
- * to flush the stack on context switch. This happens when the branch
- * stack does not tag its entries with the pid of the current task.
- * Otherwise it becomes impossible to associate a branch entry with a
- * task. This ambiguity is more likely to appear when the branch stack
- * supports priv level filtering and the user sets it to monitor only
- * at the user level (which could be a useful measurement in system-wide
- * mode). In that case, the risk is high of having a branch stack with
- * branch from multiple tasks. Flushing may mean dropping the existing
- * entries or stashing them somewhere in the PMU specific code layer.
- *
- * This function provides the context switch callback to the lower code
- * layer. It is invoked ONLY when there is at least one system-wide context
- * with at least one active event using taken branch sampling.
- */
-static void perf_branch_stack_sched_in(struct task_struct *prev,
-				       struct task_struct *task)
-{
-	struct perf_cpu_context *cpuctx;
-	struct pmu *pmu;
-	unsigned long flags;
-
-	/* no need to flush branch stack if not changing task */
-	if (prev == task)
-		return;
-
-	local_irq_save(flags);
-
-	rcu_read_lock();
-
-	list_for_each_entry_rcu(pmu, &pmus, entry) {
-		cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
-
-		/*
-		 * check if the context has at least one
-		 * event using PERF_SAMPLE_BRANCH_STACK
-		 */
-		if (cpuctx->ctx.nr_branch_stack > 0
-		    && pmu->flush_branch_stack) {
-
-			pmu = cpuctx->ctx.pmu;
-
-			perf_ctx_lock(cpuctx, cpuctx->task_ctx);
-
-			perf_pmu_disable(pmu);
-
-			pmu->flush_branch_stack();
-
-			perf_pmu_enable(pmu);
-
-			perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
-		}
-	}
-
-	rcu_read_unlock();
-
-	local_irq_restore(flags);
-}
-
-/*
  * Called from scheduler to add the events of the current task
  * with interrupts disabled.
  *
@@ -2568,10 +2520,6 @@ void __perf_event_task_sched_in(struct task_struct *prev,
 	 */
 	if (atomic_read(&__get_cpu_var(perf_cgroup_events)))
 		perf_cgroup_sched_in(prev, task);
-
-	/* check for system-wide branch_stack events */
-	if (atomic_read(&__get_cpu_var(perf_branch_stack_events)))
-		perf_branch_stack_sched_in(prev, task);
 }
 
 static u64 perf_calculate_period(struct perf_event *event, u64 nsec, u64 count)
@@ -3148,14 +3096,8 @@ static void free_event(struct perf_event *event)
 			static_key_slow_dec_deferred(&perf_sched_events);
 		}
 
-		if (has_branch_stack(event)) {
+		if (has_branch_stack(event))
 			static_key_slow_dec_deferred(&perf_sched_events);
-			/* is system-wide event */
-			if (!(event->attach_state & PERF_ATTACH_TASK)) {
-				atomic_dec(&per_cpu(perf_branch_stack_events,
-						    event->cpu));
-			}
-		}
 	}
 
 	if (event->rb) {
@@ -6574,12 +6516,8 @@ done:
 				return ERR_PTR(err);
 			}
 		}
-		if (has_branch_stack(event)) {
+		if (has_branch_stack(event))
 			static_key_slow_inc(&perf_sched_events.key);
-			if (!(event->attach_state & PERF_ATTACH_TASK))
-				atomic_inc(&per_cpu(perf_branch_stack_events,
-						    event->cpu));
-		}
 	}
 
 	return event;

> 
> Completely untested, never even seen compiler.
> 
> ---
>  arch/x86/kernel/cpu/perf_event.c           |  5 ++
>  arch/x86/kernel/cpu/perf_event.h           |  8 ++-
>  arch/x86/kernel/cpu/perf_event_intel_lbr.c | 24 ++++++--
>  include/linux/perf_event.h                 | 11 +++-
>  kernel/events/core.c                       | 92 +++---------------------------
>  5 files changed, 47 insertions(+), 93 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index 9e581c5..6516ce0 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -519,6 +519,11 @@ static void x86_pmu_disable(struct pmu *pmu)
>  	if (!cpuc->enabled)
>  		return;
>  
> +	if (cpuc->current != current) {
> +		cpuc->current = current;
> +		cpuc->ctxs_seq++;
> +	}
> +
>  	cpuc->n_added = 0;
>  	cpuc->enabled = 0;
>  	barrier();
> diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
> index 97e557b..e1ee365 100644
> --- a/arch/x86/kernel/cpu/perf_event.h
> +++ b/arch/x86/kernel/cpu/perf_event.h
> @@ -141,6 +141,12 @@ struct cpu_hw_events {
>  	int			is_fake;
>  
>  	/*
> +	 * Context switch tracking
> +	 */
> +	void			*current;
> +	u64			ctxs_seq;
> +
> +	/*
>  	 * Intel DebugStore bits
>  	 */
>  	struct debug_store	*ds;
> @@ -150,11 +156,11 @@ struct cpu_hw_events {
>  	 * Intel LBR bits
>  	 */
>  	int				lbr_users;
> -	void				*lbr_context;
>  	struct perf_branch_stack	lbr_stack;
>  	struct perf_branch_entry	lbr_entries[MAX_LBR_ENTRIES];
>  	struct er_account		*lbr_sel;
>  	u64				br_sel;
> +	u64				lbr_flush_seq;
>  
>  	/*
>  	 * Intel host/guest exclude bits
> diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> index d5be06a..aa34fa3 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> @@ -189,15 +189,20 @@ void intel_pmu_lbr_enable(struct perf_event *event)
>  		return;
>  
>  	/*
> -	 * Reset the LBR stack if we changed task context to
> -	 * avoid data leaks.
> +	 * If we're a task event and observe a context switch; flush the LBR
> +	 * since we don't want to leak LBR entries from the previous task into
> +	 * this one.
>  	 */
> -	if (event->ctx->task && cpuc->lbr_context != event->ctx) {
> +	if (event->ctx->task && cpuc->ctxs_seq != cpuc->lbr_flush_seq) {
>  		intel_pmu_lbr_reset();
> -		cpuc->lbr_context = event->ctx;
> +		cpuc->lbr_flush_seq = cpuc->ctxs_seq;
>  	}
> +
>  	cpuc->br_sel = event->hw.branch_reg.reg;
>  
> +	if (!cpuc->lbr_users)
> +		event->ctx->pmu->flags |= PERF_PF_CTXS;
> +
>  	cpuc->lbr_users++;
>  }
>  
> @@ -209,6 +214,9 @@ void intel_pmu_lbr_disable(struct perf_event *event)
>  		return;
>  
>  	cpuc->lbr_users--;
> +	if (!cpuc->lbr_users)
> +		event->ctx->pmu->flags &= ~PERF_PF_CTXS;
> +
>  	WARN_ON_ONCE(cpuc->lbr_users < 0);
>  
>  	if (cpuc->enabled && !cpuc->lbr_users) {
> @@ -222,8 +230,14 @@ void intel_pmu_lbr_enable_all(void)
>  {
>  	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
>  
> -	if (cpuc->lbr_users)
> +	if (cpuc->lbr_users) {
> +		if (cpuc->lbr_flush_seq != cpuc->ctxs_seq) {
> +			intel_pmu_lbr_reset();
> +			cpuc->lbr_flush_seq = cpuc->ctxs_seq;
> +		}
> +
>  		__intel_pmu_lbr_enable();
> +	}
>  }
>  
>  void intel_pmu_lbr_disable_all(void)
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 8873f82..837f6e3 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -189,6 +189,11 @@ struct perf_event;
>   */
>  #define PERF_EVENT_TXN 0x1
>  
> +/*
> + * pmu::flags
> + */
> +#define PERF_PF_CTXS	0x01 /* require pmu_disable/enable on context_sched_in */
> +
>  /**
>   * struct pmu - generic performance monitoring unit
>   */
> @@ -200,10 +205,11 @@ struct pmu {
>  	const char			*name;
>  	int				type;
>  
> -	int * __percpu			pmu_disable_count;
> -	struct perf_cpu_context * __percpu pmu_cpu_context;
> +	unsigned int			flags;
>  	int				task_ctx_nr;
>  	int				hrtimer_interval_ms;
> +	int * __percpu			pmu_disable_count;
> +	struct perf_cpu_context * __percpu pmu_cpu_context;
>  
>  	/*
>  	 * Fully disable/enable this PMU, can be used to protect from the PMI
> @@ -492,7 +498,6 @@ struct perf_event_context {
>  	u64				generation;
>  	int				pin_count;
>  	int				nr_cgroups;	 /* cgroup evts */
> -	int				nr_branch_stack; /* branch_stack evt */
>  	struct rcu_head			rcu_head;
>  };
>  
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 1db3af9..d49b4ea 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -140,7 +140,6 @@ enum event_type_t {
>   */
>  struct static_key_deferred perf_sched_events __read_mostly;
>  static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
> -static DEFINE_PER_CPU(atomic_t, perf_branch_stack_events);
>  
>  static atomic_t nr_mmap_events __read_mostly;
>  static atomic_t nr_comm_events __read_mostly;
> @@ -1114,9 +1113,6 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx)
>  	if (is_cgroup_event(event))
>  		ctx->nr_cgroups++;
>  
> -	if (has_branch_stack(event))
> -		ctx->nr_branch_stack++;
> -
>  	list_add_rcu(&event->event_entry, &ctx->event_list);
>  	if (!ctx->nr_events)
>  		perf_pmu_rotate_start(ctx->pmu);
> @@ -1271,9 +1267,6 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
>  			cpuctx->cgrp = NULL;
>  	}
>  
> -	if (has_branch_stack(event))
> -		ctx->nr_branch_stack--;
> -
>  	ctx->nr_events--;
>  	if (event->attr.inherit_stat)
>  		ctx->nr_stat--;
> @@ -2428,8 +2421,13 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx,
>  	struct perf_cpu_context *cpuctx;
>  
>  	cpuctx = __get_cpu_context(ctx);
> -	if (cpuctx->task_ctx == ctx)
> +	if (cpuctx->task_ctx == ctx) {
> +		if (ctx->pmu->flags & PERF_PF_CTXS) {
> +			perf_pmu_disable(ctx->pmu);
> +			perf_pmu_enable(ctx->pmu);
> +		}
>  		return;
> +	}
>  
>  	perf_ctx_lock(cpuctx, ctx);
>  	perf_pmu_disable(ctx->pmu);
> @@ -2456,66 +2454,6 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx,
>  }
>  
>  /*
> - * When sampling the branck stack in system-wide, it may be necessary
> - * to flush the stack on context switch. This happens when the branch
> - * stack does not tag its entries with the pid of the current task.
> - * Otherwise it becomes impossible to associate a branch entry with a
> - * task. This ambiguity is more likely to appear when the branch stack
> - * supports priv level filtering and the user sets it to monitor only
> - * at the user level (which could be a useful measurement in system-wide
> - * mode). In that case, the risk is high of having a branch stack with
> - * branch from multiple tasks. Flushing may mean dropping the existing
> - * entries or stashing them somewhere in the PMU specific code layer.
> - *
> - * This function provides the context switch callback to the lower code
> - * layer. It is invoked ONLY when there is at least one system-wide context
> - * with at least one active event using taken branch sampling.
> - */
> -static void perf_branch_stack_sched_in(struct task_struct *prev,
> -				       struct task_struct *task)
> -{
> -	struct perf_cpu_context *cpuctx;
> -	struct pmu *pmu;
> -	unsigned long flags;
> -
> -	/* no need to flush branch stack if not changing task */
> -	if (prev == task)
> -		return;
> -
> -	local_irq_save(flags);
> -
> -	rcu_read_lock();
> -
> -	list_for_each_entry_rcu(pmu, &pmus, entry) {
> -		cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
> -
> -		/*
> -		 * check if the context has at least one
> -		 * event using PERF_SAMPLE_BRANCH_STACK
> -		 */
> -		if (cpuctx->ctx.nr_branch_stack > 0
> -		    && pmu->flush_branch_stack) {
> -
> -			pmu = cpuctx->ctx.pmu;
> -
> -			perf_ctx_lock(cpuctx, cpuctx->task_ctx);
> -
> -			perf_pmu_disable(pmu);
> -
> -			pmu->flush_branch_stack();
> -
> -			perf_pmu_enable(pmu);
> -
> -			perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> -		}
> -	}
> -
> -	rcu_read_unlock();
> -
> -	local_irq_restore(flags);
> -}
> -
> -/*
>   * Called from scheduler to add the events of the current task
>   * with interrupts disabled.
>   *
> @@ -2546,10 +2484,6 @@ void __perf_event_task_sched_in(struct task_struct *prev,
>  	 */
>  	if (atomic_read(&__get_cpu_var(perf_cgroup_events)))
>  		perf_cgroup_sched_in(prev, task);
> -
> -	/* check for system-wide branch_stack events */
> -	if (atomic_read(&__get_cpu_var(perf_branch_stack_events)))
> -		perf_branch_stack_sched_in(prev, task);
>  }
>  
>  static u64 perf_calculate_period(struct perf_event *event, u64 nsec, u64 count)
> @@ -3126,14 +3060,8 @@ static void free_event(struct perf_event *event)
>  			static_key_slow_dec_deferred(&perf_sched_events);
>  		}
>  
> -		if (has_branch_stack(event)) {
> +		if (has_branch_stack(event))
>  			static_key_slow_dec_deferred(&perf_sched_events);
> -			/* is system-wide event */
> -			if (!(event->attach_state & PERF_ATTACH_TASK)) {
> -				atomic_dec(&per_cpu(perf_branch_stack_events,
> -						    event->cpu));
> -			}
> -		}
>  	}
>  
>  	if (event->rb) {
> @@ -6554,12 +6482,8 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>  				return ERR_PTR(err);
>  			}
>  		}
> -		if (has_branch_stack(event)) {
> +		if (has_branch_stack(event))
>  			static_key_slow_inc(&perf_sched_events.key);
> -			if (!(event->attach_state & PERF_ATTACH_TASK))
> -				atomic_inc(&per_cpu(perf_branch_stack_events,
> -						    event->cpu));
> -		}
>  	}
>  
>  	return event;
> 


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH V2 3/7] perf, x86: Introduce x86 special perf event context
  2012-10-24  5:59 [PATCH V2 0/7] perf, x86: Haswell LBR call stack support Yan, Zheng
@ 2012-10-24  5:59 ` Yan, Zheng
  0 siblings, 0 replies; 25+ messages in thread
From: Yan, Zheng @ 2012-10-24  5:59 UTC (permalink / raw)
  To: linux-kernel, a.p.zijlstra; +Cc: eranian, ak, Yan, Zheng

From: "Yan, Zheng" <zheng.z.yan@intel.com>

The x86 special perf event context is named x86_perf_event_context,
We can enlarge it later to store PMU special data.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 arch/x86/kernel/cpu/perf_event.c | 12 ++++++++++++
 arch/x86/kernel/cpu/perf_event.h |  4 ++++
 include/linux/perf_event.h       |  5 +++++
 kernel/events/core.c             | 28 ++++++++++++++++++----------
 4 files changed, 39 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 08e61a6..3361114 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1606,6 +1606,17 @@ static int x86_pmu_event_idx(struct perf_event *event)
 	return idx + 1;
 }
 
+static void *x86_pmu_event_context_alloc(struct perf_event_context *parent_ctx)
+{
+	struct perf_event_context *ctx;
+
+	ctx = kzalloc(sizeof(struct x86_perf_event_context), GFP_KERNEL);
+	if (!ctx)
+		return ERR_PTR(-ENOMEM);
+
+	return ctx;
+}
+
 static ssize_t get_attr_rdpmc(struct device *cdev,
 			      struct device_attribute *attr,
 			      char *buf)
@@ -1695,6 +1706,7 @@ static struct pmu pmu = {
 
 	.event_idx		= x86_pmu_event_idx,
 	.flush_branch_stack	= x86_pmu_flush_branch_stack,
+	.event_context_alloc	= x86_pmu_event_context_alloc,
 };
 
 void arch_perf_update_userpage(struct perf_event_mmap_page *userpg, u64 now)
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index a555a2c..55ca981 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -419,6 +419,10 @@ enum {
 	PERF_SAMPLE_BRANCH_CALL_STACK = 1U << PERF_SAMPLE_BRANCH_CALL_STACK_SHIFT,
 };
 
+struct x86_perf_event_context {
+	struct perf_event_context ctx;
+};
+
 #define x86_add_quirk(func_)						\
 do {									\
 	static struct x86_pmu_quirk __quirk __initdata = {		\
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 7e6a4b6..2868fcf 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -264,6 +264,11 @@ struct pmu {
 	 * flush branch stack on context-switches (needed in cpu-wide mode)
 	 */
 	void (*flush_branch_stack)	(void);
+
+	/*
+	 * Allocate PMU special perf event context
+	 */
+	void *(*event_context_alloc)	(struct perf_event_context *parent_ctx);
 };
 
 /**
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 534810d..c886018 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2721,13 +2721,20 @@ static void __perf_event_init_context(struct perf_event_context *ctx)
 }
 
 static struct perf_event_context *
-alloc_perf_context(struct pmu *pmu, struct task_struct *task)
+alloc_perf_context(struct pmu *pmu, struct task_struct *task,
+		   struct perf_event_context *parent_ctx)
 {
 	struct perf_event_context *ctx;
 
-	ctx = kzalloc(sizeof(struct perf_event_context), GFP_KERNEL);
-	if (!ctx)
-		return NULL;
+	if (pmu->event_context_alloc) {
+		ctx = pmu->event_context_alloc(parent_ctx);
+		if (IS_ERR(ctx))
+			return ctx;
+	} else {
+		ctx = kzalloc(sizeof(struct perf_event_context), GFP_KERNEL);
+		if (!ctx)
+			return ERR_PTR(-ENOMEM);
+	}
 
 	__perf_event_init_context(ctx);
 	if (task) {
@@ -2813,10 +2820,11 @@ retry:
 		++ctx->pin_count;
 		raw_spin_unlock_irqrestore(&ctx->lock, flags);
 	} else {
-		ctx = alloc_perf_context(pmu, task);
-		err = -ENOMEM;
-		if (!ctx)
+		ctx = alloc_perf_context(pmu, task, NULL);
+		if (IS_ERR(ctx)) {
+			err = PTR_ERR(ctx);
 			goto errout;
+		}
 
 		err = 0;
 		mutex_lock(&task->perf_event_mutex);
@@ -7132,9 +7140,9 @@ inherit_task_group(struct perf_event *event, struct task_struct *parent,
 		 * child.
 		 */
 
-		child_ctx = alloc_perf_context(event->pmu, child);
-		if (!child_ctx)
-			return -ENOMEM;
+		child_ctx = alloc_perf_context(event->pmu, child, parent_ctx);
+		if (IS_ERR(child_ctx))
+			return PTR_ERR(child_ctx);
 
 		child->perf_event_ctxp[ctxn] = child_ctx;
 	}
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2013-08-08  6:18 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-01  7:23 [PATCH v2 0/7] perf, x86: Haswell LBR call stack support Yan, Zheng
2013-07-01  7:23 ` [PATCH v2 1/7] perf, x86: Reduce lbr_sel_map size Yan, Zheng
2013-07-01  7:23 ` [PATCH v2 2/7] perf, x86: Basic Haswell LBR call stack support Yan, Zheng
2013-07-01  7:23 ` [PATCH v2 3/7] perf, x86: Introduce x86 special perf event context Yan, Zheng
2013-07-04 12:41   ` Peter Zijlstra
2013-07-05  3:19     ` Yan, Zheng
2013-07-05 12:45       ` Peter Zijlstra
2013-07-08  8:51         ` Yan, Zheng
2013-07-01  7:23 ` [PATCH v2 4/7] perf, x86: Save/resotre LBR stack during context switch Yan, Zheng
2013-07-04  9:57   ` Peter Zijlstra
2013-07-04 11:39     ` Yan, Zheng
2013-07-04 13:44     ` Andi Kleen
2013-07-04 14:00       ` Peter Zijlstra
2013-07-10 17:57         ` Andi Kleen
2013-07-04 12:44   ` Peter Zijlstra
2013-07-04 12:45   ` Peter Zijlstra
2013-07-05  5:36     ` Yan, Zheng
2013-07-05  8:15       ` Peter Zijlstra
2013-07-05  8:51         ` Yan, Zheng
2013-07-05 12:31           ` Peter Zijlstra
2013-08-08  6:18             ` Yan, Zheng
2013-07-01  7:23 ` [PATCH v2 5/7] perf, core: Pass perf_sample_data to perf_callchain() Yan, Zheng
2013-07-01  7:23 ` [PATCH v2 6/7] perf, x86: Use LBR call stack to get user callchain Yan, Zheng
2013-07-01  7:23 ` [PATCH v2 7/7] perf, x86: Discard zero length call entries in LBR call stack Yan, Zheng
  -- strict thread matches above, loose matches on Subject: below --
2012-10-24  5:59 [PATCH V2 0/7] perf, x86: Haswell LBR call stack support Yan, Zheng
2012-10-24  5:59 ` [PATCH V2 3/7] perf, x86: Introduce x86 special perf event context Yan, Zheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).