All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/12] perf_events: add support for sampling taken branches (v2)
@ 2011-10-14 12:37 Stephane Eranian
  2011-10-14 12:37 ` [PATCH 01/12] perf_events: add generic taken branch sampling support (v2) Stephane Eranian
                   ` (13 more replies)
  0 siblings, 14 replies; 36+ messages in thread
From: Stephane Eranian @ 2011-10-14 12:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo,
	will.deacon, paulus, benh, rth, ralf, davem, lethal

This patchset adds an important and useful new feature to
perf_events: branch stack sampling. In other words, the
ability to capture taken branches into each sample.

Statistical sampling of taken branch should not be confused
for branch tracing. Not all branches are necessarily captured

Sampling taken branches is important for basic block profiling,
statistical call graph, function call counts. Many of those
measurements can help drive a compiler optimizer.

The branch stack is a software abstraction which sits on top
of the PMU hardware. As such, it is not available on all
processors. For now, the patch provides the generic interface
and the Intel X86 implementation where it leverages the Last
Branch Record (LBR) feature (from Core2 to SandyBridge).

Branch stack sampling is supported for both per-thread and
system-wide modes.

It is possible to filter the type and privilege level of branches
to sample. The target of the branch is used to determine
the privilege level.

For each branch, the source and destination are captured. On
some hardware platforms, it may be possible to also extract
the target prediction and, in that case, it is also exposed
to end users.

The branch stack can record a variable number of taken
branches per sample. Those branches are always consecutive
in time. The number of branches captured depends on the
filtering and the underlying hardware. On Intel Nehalem
and later, up to 16 consecutive branches can be captured
per sample.

Branch sampling is always coupled with an event. It can
be any PMU event but it can't be a SW or tracepoint event.

Branch sampling is requested by setting a new sample_type
flag called: PERF_SAMPLE_BRANCH_STACK.

To support branch filtering, we introduce a new field
to the perf_event_attr struct: branch_sample_type. We chose
NOT to overload the config1, config2 field because those
are related to the event encoding. Branch stack is a
separate feature which is combined with the event.

The branch_sample_type is a bitmask of possible filters.
The following filters are defined (more can be added):
- PERF_SAMPLE_BRANCH_ANY     : any control flow change
- PERF_SAMPLE_BRANCH_USER    : capture branches when target is at user level
- PERF_SAMPLE_BRANCH_KERNEL  : capture branches when target is at user level
- PERF_SAMPLE_BRANCH_ANY_CALL: capture call branches (incl. syscalls)
- PERF_SAMPLE_BRANCH_ANY_RET : capture return branches (incl. syscall returns)
- PERF_SAMPLE_BRANCH_IND_CALL: capture indirect calls

It is possible to combine filters, e.g., IND_CALL|USER|KERNEL.

When the privilege level is not specified, the branch stack
inherits that of the associated event.

Some processors may not offer hardware branch filtering, e.g., Intel
Atom. Some may have HW filtering bugs (e.g., Nehalem). The Intel
X86 implementation in this patchset also provides a SW branch filter
which works on a best effort basis. It can compensate for the lack
of LBR filtering. But first and foremost, it helps work around LBR
filtering errata. The goal is to only capture the type of branches
requested by the user.

It is possible to combine branch stack sampling with PEBS on Intel
X86 processors. Depending on the precise_sampling mode, there are
certain filterting restrictions. When precise_sampling=1, then
there are no filtering restrictions. When precise_sampling > 1, 
then only ANY|USER|KERNEL filter can be used. This comes from
the fact that the kernel uses LBR to compensate for the PEBS
off-by-1 skid on the instruction pointer.

To demonstrate how the perf_event branch stack sampling interface
works, the patchset also modifies perf record to capture taken
branches. Similarly perf report is enhanced to display a histogram
of taken branches.

I would like to thank Roberto Vitillo @ LBL for his work on the perf
tool for this.

Enough talking, let's take a simple example. Our trivial test program
goes like this:

void f2(void)
{}
void f3(void)
{}
void f1(unsigned long n)
{
  if (n & 1UL)
    f2();
  else
    f3();
}
int main(void)
{
  unsigned long i;

  for (i=0; i < N; i++)
   f1(i);
  return 0;
}

$ perf record -b any branchy
$ perf report -b
# Events: 23K cycles
#
# Overhead  Source Symbol     Target Symbol
# ........  ................  ................

    18.13%  [.] f1            [.] main                          
    18.10%  [.] main          [.] main                          
    18.01%  [.] main          [.] f1                            
    15.69%  [.] f1            [.] f1                            
     9.11%  [.] f3            [.] f1                            
     6.78%  [.] f1            [.] f3                            
     6.74%  [.] f1            [.] f2                            
     6.71%  [.] f2            [.] f1                            

Of the total number of branches captured, 18.13% were from f1() -> main().

Let's make this clearer by filtering the user call branches only:

$ perf record -b any_call -e cycles:u branchy
$ perf report
# Events: 19K cycles
#
# Overhead  Source Symbol              Target Symbol
# ........  .........................  .........................
#
    52.50%  [.] main                   [.] f1                   
    23.99%  [.] f1                     [.] f3                   
    23.48%  [.] f1                     [.] f2                   
     0.03%  [.] _IO_default_xsputn     [.] _IO_new_file_overflow
     0.01%  [k] _start                 [k] __libc_start_main    

Now it is more obvious. %52 of all the captured branches where calls from main() -> f1().
The rest is split 50/50 between f1() -> f2() and f1() -> f3() which is expected given
that f1() dispatches based on odd vs. even values of n which is constantly increasing.


In version 2, we update the patch to tip/master (commit 5734857) and
we've incoporated the feedback from v1 concerning anynous bitfield
struct for branch_stack_entry and the hanlding of i386 ABI binaries
on 64-bit host in the instr decoder for the LBR SW filter.

Signed-off-by: Stephane Eranian <eranian@google.com>


Roberto Agostino Vitillo (2):
  perf: add support for sampling taken branch to perf record
  perf: add support for taken branch sampling to perf report

Stephane Eranian (10):
  perf_events: add generic taken branch sampling support
  perf_events: add Intel LBR MSR definitions
  perf_events: add Intel X86 LBR sharing logic
  perf_events: sync branch stack sampling with X86 precise_sampling
  perf_events: add LBR mappings for PERF_SAMPLE_BRANCH filters
  perf_events: implement PERF_SAMPLE_BRANCH for Intel X86
  perf_events: add LBR software filter support for Intel X86
  perf_events: disable PERF_SAMPLE_BRANCH_* when not supported
  perf_events: add hook to flush branch_stack on context switch
  perf: add code to support PERF_SAMPLE_BRANCH_STACK

 arch/alpha/kernel/perf_event.c             |    4 +
 arch/arm/kernel/perf_event.c               |    4 +
 arch/mips/kernel/perf_event.c              |    4 +
 arch/powerpc/kernel/perf_event.c           |    4 +
 arch/sh/kernel/perf_event.c                |    4 +
 arch/sparc/kernel/perf_event.c             |    4 +
 arch/x86/include/asm/msr-index.h           |    7 +
 arch/x86/kernel/cpu/perf_event.c           |   62 +++-
 arch/x86/kernel/cpu/perf_event_amd.c       |    3 +
 arch/x86/kernel/cpu/perf_event_intel.c     |  126 +++++--
 arch/x86/kernel/cpu/perf_event_intel_ds.c  |   21 +-
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |  529 ++++++++++++++++++++++++++--
 include/linux/perf_event.h                 |   74 ++++-
 kernel/events/core.c                       |  167 +++++++++
 kernel/events/hw_breakpoint.c              |    6 +
 tools/perf/Documentation/perf-record.txt   |   18 +
 tools/perf/Documentation/perf-report.txt   |    7 +
 tools/perf/builtin-record.c                |   75 ++++
 tools/perf/builtin-report.c                |   93 +++++-
 tools/perf/perf.h                          |   17 +
 tools/perf/util/annotate.c                 |    2 +-
 tools/perf/util/event.h                    |    1 +
 tools/perf/util/evsel.c                    |   10 +
 tools/perf/util/hist.c                     |   97 ++++--
 tools/perf/util/hist.h                     |    6 +
 tools/perf/util/session.c                  |   72 ++++
 tools/perf/util/session.h                  |    5 +
 tools/perf/util/sort.c                     |  348 ++++++++++++++-----
 tools/perf/util/sort.h                     |    5 +
 tools/perf/util/symbol.h                   |   13 +
 30 files changed, 1584 insertions(+), 204 deletions(-)

-- 
1.7.4.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 01/12] perf_events: add generic taken branch sampling support (v2)
  2011-10-14 12:37 [PATCH 00/12] perf_events: add support for sampling taken branches (v2) Stephane Eranian
@ 2011-10-14 12:37 ` Stephane Eranian
  2011-12-05 21:06   ` Peter Zijlstra
  2011-12-05 22:14   ` Peter Zijlstra
  2011-10-14 12:37 ` [PATCH 02/12] perf_events: add Intel LBR MSR definitions (v2) Stephane Eranian
                   ` (12 subsequent siblings)
  13 siblings, 2 replies; 36+ messages in thread
From: Stephane Eranian @ 2011-10-14 12:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo,
	will.deacon, paulus, benh, rth, ralf, davem, lethal

This patch adds the ability to sample taken branches to the
perf_event interface.

The ability to capture taken branches is very useful for all
sorts of analysis. For instance, basic block profiling, call
counts, statistical call graph.

This new capability requires hardware assist and as such may
not be available on all HW platforms. On Intel X86, it is
implemented on top of the Last Branch Record (LBR) facility.

To enable taken branches sampling, the PERF_SAMPLE_BRANCH_STACK
bit must be set in attr->sample_type.

Sampled taken branches may be filtered by type and/or priv
levels.

The patch adds a new field, called branch_sample_type, to the
perf_event_attr structure. It contains a bitmask of filters
to apply to the sampled taken branches.

Filters may be implemented in HW. If the HW filter does not exist
or is not good enough, some arch may also implement a SW filter.

The following generic filters are currently defined:
- PERF_SAMPLE_USER
  only branches whose targets are at the user level

- PERF_SAMPLE_KERNEL
  only branches whose targets are at the kernel level

- PERF_SAMPLE_ANY
  any type of branches (subject to priv levels filters)

- PERF_SAMPLE_ANY_CALL
  any call branches (may incl. syscall on some arch)

- PERF_SAMPLE_ANY_RET
  any return branches (may incl. syscall returns on some arch)

- PERF_SAMPLE_IND_CALL
  indirect call branches

Obviously filter may be combined. The priv level bits are optional.
If not provided, the priv level of the associated event are used. It
is possible to collect branches at a priv level different from the
associated event.

The number of taken branch records present in each sample may vary based
on HW, the type of sampled branches, the executed code. Therefore
each sample contains the number of taken branches it contains.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |   21 +++++---
 include/linux/perf_event.h                 |   66 ++++++++++++++++++++++++++--
 kernel/events/core.c                       |   58 ++++++++++++++++++++++++
 3 files changed, 133 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 3fab3de..b07e051 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -144,9 +144,11 @@ static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc)
 
 		rdmsrl(x86_pmu.lbr_from + lbr_idx, msr_lastbranch.lbr);
 
-		cpuc->lbr_entries[i].from  = msr_lastbranch.from;
-		cpuc->lbr_entries[i].to    = msr_lastbranch.to;
-		cpuc->lbr_entries[i].flags = 0;
+		cpuc->lbr_entries[i].from	= msr_lastbranch.from;
+		cpuc->lbr_entries[i].to		= msr_lastbranch.to;
+		cpuc->lbr_entries[i].mispred	= 0;
+		cpuc->lbr_entries[i].predicted	= 0;
+		cpuc->lbr_entries[i].reserved	= 0;
 	}
 	cpuc->lbr_stack.nr = i;
 }
@@ -167,19 +169,22 @@ static void intel_pmu_lbr_read_64(struct cpu_hw_events *cpuc)
 
 	for (i = 0; i < x86_pmu.lbr_nr; i++) {
 		unsigned long lbr_idx = (tos - i) & mask;
-		u64 from, to, flags = 0;
+		u64 from, to, mis = 0, pred = 0;
 
 		rdmsrl(x86_pmu.lbr_from + lbr_idx, from);
 		rdmsrl(x86_pmu.lbr_to   + lbr_idx, to);
 
 		if (lbr_format == LBR_FORMAT_EIP_FLAGS) {
-			flags = !!(from & LBR_FROM_FLAG_MISPRED);
+			mis = !!(from & LBR_FROM_FLAG_MISPRED);
+			pred= !mis;
 			from = (u64)((((s64)from) << 1) >> 1);
 		}
 
-		cpuc->lbr_entries[i].from  = from;
-		cpuc->lbr_entries[i].to    = to;
-		cpuc->lbr_entries[i].flags = flags;
+		cpuc->lbr_entries[i].from	= from;
+		cpuc->lbr_entries[i].to		= to;
+		cpuc->lbr_entries[i].mispred	= mis;
+		cpuc->lbr_entries[i].predicted	= pred;
+		cpuc->lbr_entries[i].reserved	= 0;
 	}
 	cpuc->lbr_stack.nr = i;
 }
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 1e9ebe5..d8f0278 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -128,11 +128,38 @@ enum perf_event_sample_format {
 	PERF_SAMPLE_PERIOD			= 1U << 8,
 	PERF_SAMPLE_STREAM_ID			= 1U << 9,
 	PERF_SAMPLE_RAW				= 1U << 10,
+	PERF_SAMPLE_BRANCH_STACK		= 1U << 11,
 
-	PERF_SAMPLE_MAX = 1U << 11,		/* non-ABI */
+	PERF_SAMPLE_MAX = 1U << 12,		/* non-ABI */
 };
 
 /*
+ * values to program into branch_sample_type when PERF_SAMPLE_BRANCH is set
+ *
+ * If the user does not pass priv level information via branch_sample_type,
+ * the kernel uses the event's priv level. Branch and event priv levels do
+ * not have to match. Branch priv level is checked for permissions.
+ *
+ * The branch types can be combined, however BRANCH_ANY covers all types
+ * of branches and therefore it supersedes all the other types.
+ */
+enum perf_branch_sample_type {
+	PERF_SAMPLE_BRANCH_USER		= 1U << 0, /* user level branches */
+	PERF_SAMPLE_BRANCH_KERNEL	= 1U << 1, /* kernel level branches */
+
+	PERF_SAMPLE_BRANCH_ANY		= 1U << 2, /* any branch types */
+	PERF_SAMPLE_BRANCH_ANY_CALL	= 1U << 3, /* any call branch */
+	PERF_SAMPLE_BRANCH_ANY_RETURN	= 1U << 4, /* any return branch */
+	PERF_SAMPLE_BRANCH_IND_CALL	= 1U << 5, /* indirect calls */
+
+	PERF_SAMPLE_BRANCH_MAX		= 1U << 6,/* non-ABI */
+};
+
+#define PERF_SAMPLE_BRANCH_PLM_ALL \
+	(PERF_SAMPLE_BRANCH_USER|\
+	 PERF_SAMPLE_BRANCH_KERNEL)
+
+/*
  * The format of the data returned by read() on a perf event fd,
  * as specified by attr.read_format:
  *
@@ -239,6 +266,7 @@ struct perf_event_attr {
 		__u64		bp_len;
 		__u64		config2; /* extension of config1 */
 	};
+	__u64	branch_sample_type; /* enum branch_sample_type */
 };
 
 /*
@@ -455,6 +483,8 @@ enum perf_event_type {
 	 *
 	 *	{ u32			size;
 	 *	  char                  data[size];}&& PERF_SAMPLE_RAW
+	 *
+	 *	{ u64 from, to, flags } lbr[nr];} && PERF_SAMPLE_BRANCH_STACK
 	 * };
 	 */
 	PERF_RECORD_SAMPLE			= 9,
@@ -527,12 +557,31 @@ struct perf_raw_record {
 	void				*data;
 };
 
+/*
+ * single taken branch record layout:
+ *
+ *      from: source instruction (may not always be a branch insn)
+ *        to: branch target
+ *   mispred: branch target was mispredicted
+ * predicted: branch target was predicted
+ *
+ * support for mispred, predicted is optional. In case it
+ * is not supported mispred = predicted = 0.
+ */
 struct perf_branch_entry {
-	__u64				from;
-	__u64				to;
-	__u64				flags;
+	__u64	from;
+	__u64	to;
+	__u64	mispred:1,  /* target mispredicted */
+		predicted:1,/* target predicted */
+		reserved:62;
 };
 
+/*
+ * branch stack layout:
+ *  nr: number of taken branches stored in entries[]
+ *
+ * Note that nr can vary from sample to sample
+ */
 struct perf_branch_stack {
 	__u64				nr;
 	struct perf_branch_entry	entries[0];
@@ -563,7 +612,9 @@ struct hw_perf_event {
 			unsigned long	event_base;
 			int		idx;
 			int		last_cpu;
+
 			struct hw_perf_event_extra extra_reg;
+			struct hw_perf_event_extra branch_reg;
 		};
 		struct { /* software */
 			struct hrtimer	hrtimer;
@@ -991,12 +1042,14 @@ struct perf_sample_data {
 	u64				period;
 	struct perf_callchain_entry	*callchain;
 	struct perf_raw_record		*raw;
+	struct perf_branch_stack	*br_stack;
 };
 
 static inline void perf_sample_data_init(struct perf_sample_data *data, u64 addr)
 {
 	data->addr = addr;
 	data->raw  = NULL;
+	data->br_stack = NULL;
 }
 
 extern void perf_output_sample(struct perf_output_handle *handle,
@@ -1135,6 +1188,11 @@ extern void perf_bp_event(struct perf_event *event, void *data);
 # define perf_instruction_pointer(regs)	instruction_pointer(regs)
 #endif
 
+static inline bool has_branch_stack(struct perf_event *event)
+{
+	return event->attr.sample_type & PERF_SAMPLE_BRANCH_STACK;
+}
+
 extern int perf_output_begin(struct perf_output_handle *handle,
 			     struct perf_event *event, unsigned int size);
 extern void perf_output_end(struct perf_output_handle *handle);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index d1a1bee..a4c3826 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3987,6 +3987,24 @@ void perf_output_sample(struct perf_output_handle *handle,
 			}
 		}
 	}
+
+	if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
+		if (data->br_stack) {
+			size_t size;
+
+			size = data->br_stack->nr
+			     * sizeof(struct perf_branch_entry);
+
+			perf_output_put(handle, data->br_stack->nr);
+			perf_output_copy(handle, data->br_stack->entries, size);
+		} else {
+			/*
+			 * we always store at least the value of nr
+			 */
+			u64 nr = 0;
+			perf_output_put(handle, nr);
+		}
+	}
 }
 
 void perf_prepare_sample(struct perf_event_header *header,
@@ -4029,6 +4047,15 @@ void perf_prepare_sample(struct perf_event_header *header,
 		WARN_ON_ONCE(size & (sizeof(u64)-1));
 		header->size += size;
 	}
+
+	if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
+		int size = sizeof(u64); /* nr */
+		if (data->br_stack) {
+			size += data->br_stack->nr
+			      * sizeof(struct perf_branch_entry);
+		}
+		header->size += size;
+	}
 }
 
 static void perf_event_output(struct perf_event *event,
@@ -5979,6 +6006,37 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
 	if (attr->read_format & ~(PERF_FORMAT_MAX-1))
 		return -EINVAL;
 
+	if (attr->sample_type & PERF_SAMPLE_BRANCH_STACK) {
+		u64 mask = attr->branch_sample_type;
+
+		/* only using defined bits */
+		if (mask & ~(PERF_SAMPLE_BRANCH_MAX-1))
+			return -EINVAL;
+
+		/* at least one branch bit must be set */
+		if (!(mask & ~PERF_SAMPLE_BRANCH_PLM_ALL))
+			return -EINVAL;
+
+		/* kernel level capture */
+		if ((mask & PERF_SAMPLE_BRANCH_KERNEL)
+		    && perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
+			return -EACCES;
+
+		/* propagate priv level, when not set for branch */
+		if (!(mask & PERF_SAMPLE_BRANCH_PLM_ALL)) {
+
+			/* exclude_kernel checked on syscall entry */
+			if (!attr->exclude_kernel)
+				mask |= PERF_SAMPLE_BRANCH_KERNEL;
+
+			if (!attr->exclude_user)
+				mask |= PERF_SAMPLE_BRANCH_USER;
+			/*
+			 * adjust user setting (for HW filter setup)
+			 */
+			attr->branch_sample_type = mask;
+		}
+	}
 out:
 	return ret;
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 02/12] perf_events: add Intel LBR MSR definitions (v2)
  2011-10-14 12:37 [PATCH 00/12] perf_events: add support for sampling taken branches (v2) Stephane Eranian
  2011-10-14 12:37 ` [PATCH 01/12] perf_events: add generic taken branch sampling support (v2) Stephane Eranian
@ 2011-10-14 12:37 ` Stephane Eranian
  2011-10-14 12:37 ` [PATCH 03/12] perf_events: add Intel X86 LBR sharing logic (v2) Stephane Eranian
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: Stephane Eranian @ 2011-10-14 12:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo,
	will.deacon, paulus, benh, rth, ralf, davem, lethal

This patch adds the LBR definitions for NHM/WSM/SNB and Core.
It also adds the definitions for the architected LBR MSR:
LBR_SELECT, LBRT_TOS.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/include/asm/msr-index.h           |    7 +++++++
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |   18 +++++++++---------
 2 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index d52609a..e8f5fbf 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -56,6 +56,13 @@
 #define MSR_OFFCORE_RSP_0		0x000001a6
 #define MSR_OFFCORE_RSP_1		0x000001a7
 
+#define MSR_LBR_SELECT			0x000001c8
+#define MSR_LBR_TOS			0x000001c9
+#define MSR_LBR_NHM_FROM		0x00000680
+#define MSR_LBR_NHM_TO			0x000006c0
+#define MSR_LBR_CORE_FROM		0x00000040
+#define MSR_LBR_CORE_TO			0x00000060
+
 #define MSR_IA32_PEBS_ENABLE		0x000003f1
 #define MSR_IA32_DS_AREA		0x00000600
 #define MSR_IA32_PERF_CAPABILITIES	0x00000345
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index b07e051..e9ac6e9 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -205,23 +205,23 @@ void intel_pmu_lbr_read(void)
 void intel_pmu_lbr_init_core(void)
 {
 	x86_pmu.lbr_nr     = 4;
-	x86_pmu.lbr_tos    = 0x01c9;
-	x86_pmu.lbr_from   = 0x40;
-	x86_pmu.lbr_to     = 0x60;
+	x86_pmu.lbr_tos    = MSR_LBR_TOS;
+	x86_pmu.lbr_from   = MSR_LBR_CORE_FROM;
+	x86_pmu.lbr_to     = MSR_LBR_CORE_TO;
 }
 
 void intel_pmu_lbr_init_nhm(void)
 {
 	x86_pmu.lbr_nr     = 16;
-	x86_pmu.lbr_tos    = 0x01c9;
-	x86_pmu.lbr_from   = 0x680;
-	x86_pmu.lbr_to     = 0x6c0;
+	x86_pmu.lbr_tos    = MSR_LBR_TOS;
+	x86_pmu.lbr_from   = MSR_LBR_NHM_FROM;
+	x86_pmu.lbr_to     = MSR_LBR_NHM_TO;
 }
 
 void intel_pmu_lbr_init_atom(void)
 {
 	x86_pmu.lbr_nr	   = 8;
-	x86_pmu.lbr_tos    = 0x01c9;
-	x86_pmu.lbr_from   = 0x40;
-	x86_pmu.lbr_to     = 0x60;
+	x86_pmu.lbr_tos    = MSR_LBR_TOS;
+	x86_pmu.lbr_from   = MSR_LBR_CORE_FROM;
+	x86_pmu.lbr_to     = MSR_LBR_CORE_TO;
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 03/12] perf_events: add Intel X86 LBR sharing logic (v2)
  2011-10-14 12:37 [PATCH 00/12] perf_events: add support for sampling taken branches (v2) Stephane Eranian
  2011-10-14 12:37 ` [PATCH 01/12] perf_events: add generic taken branch sampling support (v2) Stephane Eranian
  2011-10-14 12:37 ` [PATCH 02/12] perf_events: add Intel LBR MSR definitions (v2) Stephane Eranian
@ 2011-10-14 12:37 ` Stephane Eranian
  2011-10-14 12:37 ` [PATCH 04/12] perf_events: sync branch stack sampling with X86 precise_sampling (v2) Stephane Eranian
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: Stephane Eranian @ 2011-10-14 12:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo,
	will.deacon, paulus, benh, rth, ralf, davem, lethal

The Intel LBR on some recent processor is capable
of filtering branches by type. The filter is configurable
via the LBR_SELECT MSR register.

There are limitation on how this register can be used.

On Nehalem/Westmere, the LBR_SELECT is shared by the two HT threads
when HT is on. It is private to each core when HT is off.

On SandyBridge, the LBR_SELECT register is private to each thread
when HT is on. It is private to each core when HT is off.

The kernel must manage the sharing of LBR_SELECT. It allows
multiple users on the same logical CPU to use LBR_SELECT as
long as they program it with the same value. Across sibling
CPUs (HT threads), the same restriction applies on NHM/WSM.

This patch implements this sharing logic by leveraging the
mechanism put in place for managing the offcore_response
shared MSR.

We modify __intel_shared_reg_get_constraints() to cause
x86_get_event_constraint() to be called because LBR may
be associated with events that may be counter constrained.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event.c       |    4 ++
 arch/x86/kernel/cpu/perf_event.h       |    4 ++
 arch/x86/kernel/cpu/perf_event_intel.c |   70 ++++++++++++++++++++------------
 3 files changed, 52 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 6408910..cfef90e 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -428,6 +428,10 @@ static int __x86_pmu_event_init(struct perf_event *event)
 	/* mark unused */
 	event->hw.extra_reg.idx = EXTRA_REG_NONE;
 
+	/* mark not used */
+	event->hw.extra_reg.idx = EXTRA_REG_NONE;
+	event->hw.branch_reg.idx = EXTRA_REG_NONE;
+
 	return x86_pmu.hw_config(event);
 }
 
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index b9698d4..8a5c21f 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -33,6 +33,7 @@ enum extra_reg_type {
 
 	EXTRA_REG_RSP_0 = 0,	/* offcore_response_0 */
 	EXTRA_REG_RSP_1 = 1,	/* offcore_response_1 */
+	EXTRA_REG_LBR   = 2,	/* lbr_select */
 
 	EXTRA_REG_MAX		/* number of entries needed */
 };
@@ -129,6 +130,7 @@ struct cpu_hw_events {
 	void				*lbr_context;
 	struct perf_branch_stack	lbr_stack;
 	struct perf_branch_entry	lbr_entries[MAX_LBR_ENTRIES];
+	struct er_account		*lbr_sel;
 
 	/*
 	 * Intel host/guest exclude bits
@@ -296,6 +298,8 @@ struct x86_pmu {
 	 */
 	unsigned long	lbr_tos, lbr_from, lbr_to; /* MSR base regs       */
 	int		lbr_nr;			   /* hardware stack size */
+	u64		lbr_sel_mask;   	   /* valid bits in LBR_SELECT */
+	const int	*lbr_sel_map;   	   /* lbr_select mappings */
 
 	/*
 	 * Extra registers for events
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index e09ca20..1303732 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1126,17 +1126,17 @@ static bool intel_try_alt_er(struct perf_event *event, int orig_idx)
  */
 static struct event_constraint *
 __intel_shared_reg_get_constraints(struct cpu_hw_events *cpuc,
-				   struct perf_event *event)
+				   struct perf_event *event,
+				   struct hw_perf_event_extra *reg)
 {
 	struct event_constraint *c = &emptyconstraint;
-	struct hw_perf_event_extra *reg = &event->hw.extra_reg;
 	struct er_account *era;
 	unsigned long flags;
 	int orig_idx = reg->idx;
 
 	/* already allocated shared msr */
 	if (reg->alloc)
-		return &unconstrained;
+		return NULL; /* call x86_get_event_constraint() */
 
 again:
 	era = &cpuc->shared_regs->regs[reg->idx];
@@ -1159,14 +1159,10 @@ __intel_shared_reg_get_constraints(struct cpu_hw_events *cpuc,
 		reg->alloc = 1;
 
 		/*
-		 * All events using extra_reg are unconstrained.
-		 * Avoids calling x86_get_event_constraints()
-		 *
-		 * Must revisit if extra_reg controlling events
-		 * ever have constraints. Worst case we go through
-		 * the regular event constraint table.
+		 * need to call x86_get_event_constraint()
+		 * to check if associated event has constraints
 		 */
-		c = &unconstrained;
+		c = NULL;
 	} else if (intel_try_alt_er(event, orig_idx)) {
 		raw_spin_unlock(&era->lock);
 		goto again;
@@ -1203,11 +1199,23 @@ static struct event_constraint *
 intel_shared_regs_constraints(struct cpu_hw_events *cpuc,
 			      struct perf_event *event)
 {
-	struct event_constraint *c = NULL;
-
-	if (event->hw.extra_reg.idx != EXTRA_REG_NONE)
-		c = __intel_shared_reg_get_constraints(cpuc, event);
-
+	struct event_constraint *c = NULL, *d;
+	struct hw_perf_event_extra *xreg, *breg;
+
+	xreg = &event->hw.extra_reg;
+	if (xreg->idx != EXTRA_REG_NONE) {
+		c = __intel_shared_reg_get_constraints(cpuc, event, xreg);
+		if (c == &emptyconstraint)
+			return c;
+	}
+	breg = &event->hw.branch_reg;
+	if (breg->idx != EXTRA_REG_NONE) {
+		d = __intel_shared_reg_get_constraints(cpuc, event, breg);
+		if (d == &emptyconstraint) {
+			__intel_shared_reg_put_constraints(cpuc, xreg);
+			c = d;
+		}
+	}
 	return c;
 }
 
@@ -1255,6 +1263,10 @@ intel_put_shared_regs_event_constraints(struct cpu_hw_events *cpuc,
 	reg = &event->hw.extra_reg;
 	if (reg->idx != EXTRA_REG_NONE)
 		__intel_shared_reg_put_constraints(cpuc, reg);
+
+	reg = &event->hw.branch_reg;
+	if (reg->idx != EXTRA_REG_NONE)
+		__intel_shared_reg_put_constraints(cpuc, reg);
 }
 
 static void intel_put_event_constraints(struct cpu_hw_events *cpuc,
@@ -1434,7 +1446,7 @@ static int intel_pmu_cpu_prepare(int cpu)
 {
 	struct cpu_hw_events *cpuc = &per_cpu(cpu_hw_events, cpu);
 
-	if (!x86_pmu.extra_regs)
+	if (!(x86_pmu.extra_regs || x86_pmu.lbr_sel_map))
 		return NOTIFY_OK;
 
 	cpuc->shared_regs = allocate_shared_regs(cpu);
@@ -1456,22 +1468,28 @@ static void intel_pmu_cpu_starting(int cpu)
 	 */
 	intel_pmu_lbr_reset();
 
-	if (!cpuc->shared_regs || (x86_pmu.er_flags & ERF_NO_HT_SHARING))
+	cpuc->lbr_sel = NULL;
+
+	if (!cpuc->shared_regs)
 		return;
 
-	for_each_cpu(i, topology_thread_cpumask(cpu)) {
-		struct intel_shared_regs *pc;
+	if (!(x86_pmu.er_flags & ERF_NO_HT_SHARING)) {
+		for_each_cpu(i, topology_thread_cpumask(cpu)) {
+			struct intel_shared_regs *pc;
 
-		pc = per_cpu(cpu_hw_events, i).shared_regs;
-		if (pc && pc->core_id == core_id) {
-			cpuc->kfree_on_online = cpuc->shared_regs;
-			cpuc->shared_regs = pc;
-			break;
+			pc = per_cpu(cpu_hw_events, i).shared_regs;
+			if (pc && pc->core_id == core_id) {
+				cpuc->kfree_on_online = cpuc->shared_regs;
+				cpuc->shared_regs = pc;
+				break;
+			}
 		}
+		cpuc->shared_regs->core_id = core_id;
+		cpuc->shared_regs->refcnt++;
 	}
 
-	cpuc->shared_regs->core_id = core_id;
-	cpuc->shared_regs->refcnt++;
+	if (x86_pmu.lbr_sel_map)
+		cpuc->lbr_sel = &cpuc->shared_regs->regs[EXTRA_REG_LBR];
 }
 
 static void intel_pmu_cpu_dying(int cpu)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 04/12] perf_events: sync branch stack sampling with X86 precise_sampling (v2)
  2011-10-14 12:37 [PATCH 00/12] perf_events: add support for sampling taken branches (v2) Stephane Eranian
                   ` (2 preceding siblings ...)
  2011-10-14 12:37 ` [PATCH 03/12] perf_events: add Intel X86 LBR sharing logic (v2) Stephane Eranian
@ 2011-10-14 12:37 ` Stephane Eranian
  2011-10-14 12:37 ` [PATCH 05/12] perf_events: add LBR mappings for PERF_SAMPLE_BRANCH filters (v2) Stephane Eranian
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: Stephane Eranian @ 2011-10-14 12:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo,
	will.deacon, paulus, benh, rth, ralf, davem, lethal

If precise sampling is enabled on Intel X86, then perf_event uses PEBS.
To correct for the off-by-one error of PEBS, perf_event uses LBR when
precise_sample > 1.

On Intel X86 PERF_SAMPLE_BRANCH_STACK is implemented using LBR,
therefore both features must be coordinated as they may not
configure LBR the same way.

For PEBS, LBR needs to capture all branches at all priv levels.
This patch sets this up.

The configuration of PERF_SAMPLE_BRANCH_STACK may not be compatible
in which case an error must be returned.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event.c |   22 ++++++++++++++++++++++
 1 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index cfef90e..e2efa90 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -358,6 +358,7 @@ int x86_setup_perfctr(struct perf_event *event)
 int x86_pmu_hw_config(struct perf_event *event)
 {
 	if (event->attr.precise_ip) {
+		u64 *br_type, br_sel;
 		int precise = 0;
 
 		/* Support for constant skid */
@@ -371,6 +372,27 @@ int x86_pmu_hw_config(struct perf_event *event)
 
 		if (event->attr.precise_ip > precise)
 			return -EOPNOTSUPP;
+		/*
+		 * check that PEBS LBR correction does not conflict with
+		 * whatever the user is asking with attr->branch_sample_type
+		 */
+		if (event->attr.precise_ip > 1) {
+
+			br_type = &event->attr.branch_sample_type;
+
+			if (has_branch_stack(event)) {
+				br_sel = *br_type & PERF_SAMPLE_BRANCH_ANY;
+				if (br_sel != PERF_SAMPLE_BRANCH_ANY)
+					return -EOPNOTSUPP;
+			} else {
+				/*
+				 * For PEBS fixups, we capture all
+				 * the branches at all priv levels
+				 */
+				*br_type = PERF_SAMPLE_BRANCH_ANY
+					 | PERF_SAMPLE_BRANCH_PLM_ALL;
+			}
+		}
 	}
 
 	/*
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 05/12] perf_events: add LBR mappings for PERF_SAMPLE_BRANCH filters (v2)
  2011-10-14 12:37 [PATCH 00/12] perf_events: add support for sampling taken branches (v2) Stephane Eranian
                   ` (3 preceding siblings ...)
  2011-10-14 12:37 ` [PATCH 04/12] perf_events: sync branch stack sampling with X86 precise_sampling (v2) Stephane Eranian
@ 2011-10-14 12:37 ` Stephane Eranian
  2011-12-05 22:35   ` Peter Zijlstra
  2011-10-14 12:37 ` [PATCH 06/12] perf_events: implement PERF_SAMPLE_BRANCH for Intel X86 (v2) Stephane Eranian
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 36+ messages in thread
From: Stephane Eranian @ 2011-10-14 12:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo,
	will.deacon, paulus, benh, rth, ralf, davem, lethal

This patch adds the mappings from the generic PERF_SAMPLE_BRANCH_*
filters to the actual Intel X86 LBR filters, whenever they exist.

The patch also adds a restriction on Intel Atom, whereby only
stepping 10 (PineView) and more recent are supported. Older models,
do not have a functional LBR (does not freeze on PMU interrupt).

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event.h           |    2 +
 arch/x86/kernel/cpu/perf_event_intel.c     |    2 +-
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |  110 +++++++++++++++++++++++++++-
 3 files changed, 111 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 8a5c21f..750c7af 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -482,6 +482,8 @@ void intel_pmu_lbr_init_nhm(void);
 
 void intel_pmu_lbr_init_atom(void);
 
+void intel_pmu_lbr_init_snb(void);
+
 int p4_pmu_init(void);
 
 int p6_pmu_init(void);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 1303732..6f313c0 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1715,7 +1715,7 @@ __init int intel_pmu_init(void)
 		memcpy(hw_cache_event_ids, snb_hw_cache_event_ids,
 		       sizeof(hw_cache_event_ids));
 
-		intel_pmu_lbr_init_nhm();
+		intel_pmu_lbr_init_snb();
 
 		x86_pmu.event_constraints = intel_snb_event_constraints;
 		x86_pmu.pebs_constraints = intel_snb_pebs_event_constraints;
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index e9ac6e9..2e56ed3 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -14,6 +14,47 @@ enum {
 };
 
 /*
+ * Intel LBR_SELECT bits
+ * Intel Vol3a, April 2011, Section 16.7 Table 16-10
+ *
+ * Hardware branch filter (not available on all CPUs)
+ */
+#define LBR_KERNEL_BIT		0 /* do not capture at ring0 */
+#define LBR_USER_BIT		1 /* do not capture at ring > 0 */
+#define LBR_JCC_BIT		2 /* do not capture conditional branches */
+#define LBR_REL_CALL_BIT	3 /* do not capture relative calls */
+#define LBR_IND_CALL_BIT	4 /* do not capture indirect calls */
+#define LBR_RETURN_BIT		5 /* do not capture near returns */
+#define LBR_IND_JMP_BIT		6 /* do not capture indirect jumps */
+#define LBR_REL_JMP_BIT		7 /* do not capture relative jumps */
+#define LBR_FAR_BIT		8 /* do not capture far branches */
+
+#define LBR_KERNEL	(1 << LBR_KERNEL_BIT)
+#define LBR_USER	(1 << LBR_USER_BIT)
+#define LBR_JCC		(1 << LBR_JCC_BIT)
+#define LBR_REL_CALL	(1 << LBR_REL_CALL_BIT)
+#define LBR_IND_CALL	(1 << LBR_IND_CALL_BIT)
+#define LBR_RETURN	(1 << LBR_RETURN_BIT)
+#define LBR_REL_JMP	(1 << LBR_REL_JMP_BIT)
+#define LBR_IND_JMP	(1 << LBR_IND_JMP_BIT)
+#define LBR_FAR		(1 << LBR_FAR_BIT)
+
+#define LBR_PLM (LBR_KERNEL | LBR_USER)
+
+#define LBR_SEL_MASK	0x1ff /* valid bits in LBR_SELECT */
+
+#define LBR_ANY		 \
+	(LBR_JCC	|\
+	 LBR_REL_CALL	|\
+	 LBR_IND_CALL	|\
+	 LBR_RETURN	|\
+	 LBR_REL_JMP	|\
+	 LBR_IND_JMP	|\
+	 LBR_FAR)
+
+#define LBR_FROM_FLAG_MISPRED  (1ULL << 63)
+
+/*
  * We only support LBR implementations that have FREEZE_LBRS_ON_PMI
  * otherwise it becomes near impossible to get a reliable stack.
  */
@@ -153,8 +194,6 @@ static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc)
 	cpuc->lbr_stack.nr = i;
 }
 
-#define LBR_FROM_FLAG_MISPRED  (1ULL << 63)
-
 /*
  * Due to lack of segmentation in Linux the effective address (offset)
  * is the same as the linear address, allowing us to merge the LIP and EIP
@@ -202,26 +241,93 @@ void intel_pmu_lbr_read(void)
 		intel_pmu_lbr_read_64(cpuc);
 }
 
+/*
+ * Map interface branch filters onto LBR filters
+ */
+static const int nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX]=
+{
+	[PERF_SAMPLE_BRANCH_ANY]        = LBR_ANY,
+	[PERF_SAMPLE_BRANCH_USER]       = LBR_USER,
+	[PERF_SAMPLE_BRANCH_KERNEL]     = LBR_KERNEL,
+	[PERF_SAMPLE_BRANCH_ANY_RETURN] =
+		LBR_RETURN | LBR_REL_JMP | LBR_IND_JMP | LBR_FAR,
+	/*
+	 * NHM/WSM erratum: must include REL_JMP+IND_JMP to get CALL branches
+	 */
+	[PERF_SAMPLE_BRANCH_ANY_CALL] =
+		LBR_REL_CALL | LBR_IND_CALL | LBR_REL_JMP | LBR_IND_JMP | LBR_FAR,
+	/*
+	 * NHM/WSM erratum: must include IND_JMP to capture IND_CALL
+	 */
+	[PERF_SAMPLE_BRANCH_IND_CALL] = LBR_IND_CALL | LBR_IND_JMP,
+};
+
+static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX]=
+{
+	[PERF_SAMPLE_BRANCH_ANY]        = LBR_ANY,
+	[PERF_SAMPLE_BRANCH_USER]       = LBR_USER,
+	[PERF_SAMPLE_BRANCH_KERNEL]     = LBR_KERNEL,
+	[PERF_SAMPLE_BRANCH_ANY_RETURN] = LBR_RETURN | LBR_FAR,
+	[PERF_SAMPLE_BRANCH_ANY_CALL]   = LBR_REL_CALL | LBR_IND_CALL | LBR_FAR,
+	[PERF_SAMPLE_BRANCH_IND_CALL]   = LBR_IND_CALL,
+};
+
+/* core */
 void intel_pmu_lbr_init_core(void)
 {
 	x86_pmu.lbr_nr     = 4;
 	x86_pmu.lbr_tos    = MSR_LBR_TOS;
 	x86_pmu.lbr_from   = MSR_LBR_CORE_FROM;
 	x86_pmu.lbr_to     = MSR_LBR_CORE_TO;
+
+	pr_cont("4-deep LBR, ");
 }
 
+/* nehalem/westmere */
 void intel_pmu_lbr_init_nhm(void)
 {
 	x86_pmu.lbr_nr     = 16;
 	x86_pmu.lbr_tos    = MSR_LBR_TOS;
 	x86_pmu.lbr_from   = MSR_LBR_NHM_FROM;
 	x86_pmu.lbr_to     = MSR_LBR_NHM_TO;
+
+	x86_pmu.lbr_sel_mask = LBR_SEL_MASK;
+	x86_pmu.lbr_sel_map  = nhm_lbr_sel_map;
+
+	pr_cont("16-deep LBR, ");
+}
+
+/* sandy bridge */
+void intel_pmu_lbr_init_snb(void)
+{
+	x86_pmu.lbr_nr	 = 16;
+	x86_pmu.lbr_tos	 = MSR_LBR_TOS;
+	x86_pmu.lbr_from = MSR_LBR_NHM_FROM;
+	x86_pmu.lbr_to   = MSR_LBR_NHM_TO;
+
+	x86_pmu.lbr_sel_mask = LBR_SEL_MASK;
+	x86_pmu.lbr_sel_map  = snb_lbr_sel_map;
+
+	pr_cont("16-deep LBR, ");
 }
 
+/* atom */
 void intel_pmu_lbr_init_atom(void)
 {
+	/*
+	 * only models starting at stepping 10 seems
+	 * to have an operational LBR which can freeze
+	 * on PMU interrupt
+	 */
+	if (boot_cpu_data.x86_mask < 10) {
+		pr_cont("LBR disabled due to erratum");
+		return;
+	}
+
 	x86_pmu.lbr_nr	   = 8;
 	x86_pmu.lbr_tos    = MSR_LBR_TOS;
 	x86_pmu.lbr_from   = MSR_LBR_CORE_FROM;
 	x86_pmu.lbr_to     = MSR_LBR_CORE_TO;
+
+	pr_cont("8-deep LBR, ");
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 06/12] perf_events: implement PERF_SAMPLE_BRANCH for Intel X86 (v2)
  2011-10-14 12:37 [PATCH 00/12] perf_events: add support for sampling taken branches (v2) Stephane Eranian
                   ` (4 preceding siblings ...)
  2011-10-14 12:37 ` [PATCH 05/12] perf_events: add LBR mappings for PERF_SAMPLE_BRANCH filters (v2) Stephane Eranian
@ 2011-10-14 12:37 ` Stephane Eranian
  2011-10-14 12:37 ` [PATCH 07/12] perf_events: add LBR software filter support " Stephane Eranian
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: Stephane Eranian @ 2011-10-14 12:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo,
	will.deacon, paulus, benh, rth, ralf, davem, lethal

This patch implements PERF_SAMPLE_BRANCH support for Intel
X86 processors. It connects PERF_SAMPLE_BRANCH to the actual LBR.

The patch adds the hooks in the PMU irq handler to save the LBR
on counter overflow for both regular and PEBS modes.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event_intel.c     |   35 +++++++++++++
 arch/x86/kernel/cpu/perf_event_intel_ds.c  |   10 ++--
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |   73 +++++++++++++++++++++++++++-
 include/linux/perf_event.h                 |    3 +
 4 files changed, 113 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 6f313c0..901217d 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -730,6 +730,19 @@ static __initconst const u64 atom_hw_cache_event_ids
  },
 };
 
+static inline bool intel_pmu_needs_lbr_smpl(struct perf_event *event)
+{
+	/* user explicitly requested branch sampling */
+	if (has_branch_stack(event))
+		return true;
+
+	/* implicit branch sampling to correct PEBS skid */
+	if (x86_pmu.intel_cap.pebs_trap && event->attr.precise_ip > 1)
+		return true;
+
+	return false;
+}
+
 static void intel_pmu_disable_all(void)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
@@ -884,6 +897,13 @@ static void intel_pmu_disable_event(struct perf_event *event)
 	cpuc->intel_ctrl_guest_mask &= ~(1ull << hwc->idx);
 	cpuc->intel_ctrl_host_mask &= ~(1ull << hwc->idx);
 
+	/*
+	 * must disable before any actual event
+	 * because any event may be combined with LBR
+	 */
+	if (intel_pmu_needs_lbr_smpl(event))
+		intel_pmu_lbr_disable(event);
+
 	if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL)) {
 		intel_pmu_disable_fixed(hwc);
 		return;
@@ -938,6 +958,12 @@ static void intel_pmu_enable_event(struct perf_event *event)
 		intel_pmu_enable_bts(hwc->config);
 		return;
 	}
+	/*
+	 * must enabled before any actual event
+	 * because any event may be combined with LBR
+	 */
+	if (intel_pmu_needs_lbr_smpl(event))
+		intel_pmu_lbr_enable(event);
 
 	if (event->attr.exclude_host)
 		cpuc->intel_ctrl_guest_mask |= (1ull << hwc->idx);
@@ -1060,6 +1086,9 @@ static int intel_pmu_handle_irq(struct pt_regs *regs)
 
 		data.period = event->hw.last_period;
 
+		if (has_branch_stack(event))
+			data.br_stack = &cpuc->lbr_stack;
+
 		if (perf_event_overflow(event, &data, regs))
 			x86_pmu_stop(event, 0);
 	}
@@ -1308,6 +1337,12 @@ static int intel_pmu_hw_config(struct perf_event *event)
 		event->hw.config = alt_config;
 	}
 
+	if (intel_pmu_needs_lbr_smpl(event)) {
+		ret = intel_pmu_setup_lbr_filter(event);
+		if (ret)
+			return ret;
+	}
+
 	if (event->attr.type != PERF_TYPE_RAW)
 		return 0;
 
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index c0d238f..d0197ba 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -440,9 +440,6 @@ void intel_pmu_pebs_enable(struct perf_event *event)
 
 	cpuc->pebs_enabled |= 1ULL << hwc->idx;
 	WARN_ON_ONCE(cpuc->enabled);
-
-	if (x86_pmu.intel_cap.pebs_trap && event->attr.precise_ip > 1)
-		intel_pmu_lbr_enable(event);
 }
 
 void intel_pmu_pebs_disable(struct perf_event *event)
@@ -455,9 +452,6 @@ void intel_pmu_pebs_disable(struct perf_event *event)
 		wrmsrl(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);
 
 	hwc->config |= ARCH_PERFMON_EVENTSEL_INT;
-
-	if (x86_pmu.intel_cap.pebs_trap && event->attr.precise_ip > 1)
-		intel_pmu_lbr_disable(event);
 }
 
 void intel_pmu_pebs_enable_all(void)
@@ -569,6 +563,7 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
 	 * both formats and we don't use the other fields in this
 	 * routine.
 	 */
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
 	struct pebs_record_core *pebs = __pebs;
 	struct perf_sample_data data;
 	struct pt_regs regs;
@@ -599,6 +594,9 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
 	else
 		regs.flags &= ~PERF_EFLAGS_EXACT;
 
+	if (has_branch_stack(event))
+		data.br_stack = &cpuc->lbr_stack;
+
 	if (perf_event_overflow(event, &data, &regs))
 		x86_pmu_stop(event, 0);
 }
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 2e56ed3..f4d3fce 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -42,6 +42,7 @@ enum {
 #define LBR_PLM (LBR_KERNEL | LBR_USER)
 
 #define LBR_SEL_MASK	0x1ff /* valid bits in LBR_SELECT */
+#define LBR_NOT_SUPP	-1    /* LBR filter not supported */
 
 #define LBR_ANY		 \
 	(LBR_JCC	|\
@@ -54,6 +55,10 @@ enum {
 
 #define LBR_FROM_FLAG_MISPRED  (1ULL << 63)
 
+#define for_each_branch_sample_type(x) \
+	for ((x) = PERF_SAMPLE_BRANCH_USER; \
+	     (x) < PERF_SAMPLE_BRANCH_MAX; (x) <<= 1)
+
 /*
  * We only support LBR implementations that have FREEZE_LBRS_ON_PMI
  * otherwise it becomes near impossible to get a reliable stack.
@@ -62,6 +67,10 @@ enum {
 static void __intel_pmu_lbr_enable(void)
 {
 	u64 debugctl;
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+
+	if (cpuc->lbr_sel)
+		wrmsrl(MSR_LBR_SELECT, cpuc->lbr_sel->config);
 
 	rdmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
 	debugctl |= (DEBUGCTLMSR_LBR | DEBUGCTLMSR_FREEZE_LBRS_ON_PMI);
@@ -119,7 +128,6 @@ void intel_pmu_lbr_enable(struct perf_event *event)
 	 * Reset the LBR stack if we changed task context to
 	 * avoid data leaks.
 	 */
-
 	if (event->ctx->task && cpuc->lbr_context != event->ctx) {
 		intel_pmu_lbr_reset();
 		cpuc->lbr_context = event->ctx;
@@ -138,8 +146,11 @@ void intel_pmu_lbr_disable(struct perf_event *event)
 	cpuc->lbr_users--;
 	WARN_ON_ONCE(cpuc->lbr_users < 0);
 
-	if (cpuc->enabled && !cpuc->lbr_users)
+	if (cpuc->enabled && !cpuc->lbr_users) {
 		__intel_pmu_lbr_disable();
+		/* avoid stale pointer */
+		cpuc->lbr_context = NULL;
+	}
 }
 
 void intel_pmu_lbr_enable_all(void)
@@ -158,6 +169,9 @@ void intel_pmu_lbr_disable_all(void)
 		__intel_pmu_lbr_disable();
 }
 
+/*
+ * TOS = most recently recorded branch
+ */
 static inline u64 intel_pmu_lbr_tos(void)
 {
 	u64 tos;
@@ -242,6 +256,61 @@ void intel_pmu_lbr_read(void)
 }
 
 /*
+ * setup the HW LBR filter
+ * Used only when available, may not be enough to disambiguate
+ * all branches, may need the help of the SW filter
+ */
+static int intel_pmu_setup_hw_lbr_filter(struct perf_event *event)
+{
+	struct hw_perf_event_extra *reg;
+	u64 br_type = event->attr.branch_sample_type;
+	u64 mask = 0, m;
+	u64 v;
+
+	for_each_branch_sample_type(m) {
+		if (!(br_type & m))
+			continue;
+
+		v = x86_pmu.lbr_sel_map[m];
+		if (v == LBR_NOT_SUPP)
+			return -EOPNOTSUPP;
+		mask |= v;
+
+		if (m == PERF_SAMPLE_BRANCH_ANY)
+			break;
+	}
+	reg = &event->hw.branch_reg;
+	reg->idx = EXTRA_REG_LBR;
+
+	/* LBR_SELECT operates in suppress mode so invert mask */
+	reg->config = ~mask & x86_pmu.lbr_sel_mask;
+
+	return 0;
+}
+
+static int intel_pmu_setup_lbr_filter(struct perf_event *event)
+{
+	u64 br_type = event->attr.branch_sample_type;
+
+	/*
+	 * no LBR on this PMU
+	 */
+	if (!x86_pmu.lbr_nr)
+		return -EOPNOTSUPP;
+
+	/*
+	 * if no LBR HW filter, users can only
+	 * capture all branches
+	 */
+	if (!x86_pmu.lbr_sel_map) {
+		if (br_type != PERF_SAMPLE_BRANCH_ALL)
+			return -EOPNOTSUPP;
+		return 0;
+	}
+	return intel_pmu_setup_hw_lbr_filter(event);
+}
+
+/*
  * Map interface branch filters onto LBR filters
  */
 static const int nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX]=
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index d8f0278..c4fbe84 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -159,6 +159,9 @@ enum perf_branch_sample_type {
 	(PERF_SAMPLE_BRANCH_USER|\
 	 PERF_SAMPLE_BRANCH_KERNEL)
 
+#define PERF_SAMPLE_BRANCH_ALL \
+	(PERF_SAMPLE_BRANCH_PLM_ALL|PERF_SAMPLE_BRANCH_ANY)
+
 /*
  * The format of the data returned by read() on a perf event fd,
  * as specified by attr.read_format:
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 07/12] perf_events: add LBR software filter support for Intel X86 (v2)
  2011-10-14 12:37 [PATCH 00/12] perf_events: add support for sampling taken branches (v2) Stephane Eranian
                   ` (5 preceding siblings ...)
  2011-10-14 12:37 ` [PATCH 06/12] perf_events: implement PERF_SAMPLE_BRANCH for Intel X86 (v2) Stephane Eranian
@ 2011-10-14 12:37 ` Stephane Eranian
  2011-12-05 22:29   ` Peter Zijlstra
  2011-10-14 12:37 ` [PATCH 08/12] perf_events: disable PERF_SAMPLE_BRANCH_* when not supported (v2) Stephane Eranian
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 36+ messages in thread
From: Stephane Eranian @ 2011-10-14 12:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo,
	will.deacon, paulus, benh, rth, ralf, davem, lethal

This patch adds an internal sofware filter to complement
the (optional) LBR hardware filter.

The software filter is necessary:
- as a substitute when there is no HW LBR filter (e.g., Atom, Core)
- to complement HW LBR filter in case of errata (e.g., Nehalem/Westmere)
- to provide finer grain filtering (e.g., all processors)

Sometimes, the LBR HW filter cannot distinguish between two types
of branches. For instance, to capture syscall as CALLS, it is necessary
to enable the LBR_FAR filter which will also capture JMP instructions.
Thus, a second pass is necessary to filter those out, this is what the
SW filter can do.

The SW filter is built on top of the internal x86 disassembler. It
is a best effort filter especially for user level code. It is subject
to the availability of the text page of the program.

The SW filter is enabled on all Intel X86 processors. It is bypassed
when the user is capturing all branches at all priv levels.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event.h           |   12 +
 arch/x86/kernel/cpu/perf_event_intel_ds.c  |   12 +-
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |  322 +++++++++++++++++++++++++++-
 3 files changed, 326 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 750c7af..48ed504 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -131,6 +131,7 @@ struct cpu_hw_events {
 	struct perf_branch_stack	lbr_stack;
 	struct perf_branch_entry	lbr_entries[MAX_LBR_ENTRIES];
 	struct er_account		*lbr_sel;
+	u64				br_sel;
 
 	/*
 	 * Intel host/guest exclude bits
@@ -402,6 +403,17 @@ extern struct event_constraint emptyconstraint;
 
 extern struct event_constraint unconstrained;
 
+static inline bool kernel_ip(unsigned long ip)
+{
+#ifdef CONFIG_X86_32
+	return ip > PAGE_OFFSET;
+#else
+	return (long)ip < 0;
+#endif
+}
+
+int intel_pmu_setup_lbr_filter(struct perf_event *event);
+
 #ifdef CONFIG_CPU_SUP_AMD
 
 int amd_pmu_init(void);
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index d0197ba..8c17380 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -3,6 +3,7 @@
 #include <linux/slab.h>
 
 #include <asm/perf_event.h>
+#include <asm/insn.h>
 
 #include "perf_event.h"
 
@@ -470,17 +471,6 @@ void intel_pmu_pebs_disable_all(void)
 		wrmsrl(MSR_IA32_PEBS_ENABLE, 0);
 }
 
-#include <asm/insn.h>
-
-static inline bool kernel_ip(unsigned long ip)
-{
-#ifdef CONFIG_X86_32
-	return ip > PAGE_OFFSET;
-#else
-	return (long)ip < 0;
-#endif
-}
-
 static int intel_pmu_pebs_fixup_ip(struct pt_regs *regs)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index f4d3fce..3eb47c1 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -3,6 +3,7 @@
 
 #include <asm/perf_event.h>
 #include <asm/msr.h>
+#include <asm/insn.h>
 
 #include "perf_event.h"
 
@@ -60,6 +61,53 @@ enum {
 	     (x) < PERF_SAMPLE_BRANCH_MAX; (x) <<= 1)
 
 /*
+ * X86 control flow change classification
+ * X86 control flow changes include branches, interrupts, traps, faults
+ */
+enum {
+        X86_BR_NONE    = 0,      /* unknown */
+
+        X86_BR_USER    = 1 << 0, /* branch target is user */
+        X86_BR_KERNEL  = 1 << 1, /* branch target is kernel */
+
+        X86_BR_CALL    = 1 << 2, /* call */
+        X86_BR_RET     = 1 << 3, /* return */
+        X86_BR_SYSCALL = 1 << 4, /* syscall */
+        X86_BR_SYSRET  = 1 << 5, /* syscall return */
+        X86_BR_INT     = 1 << 6, /* sw interrupt */
+        X86_BR_IRET    = 1 << 7, /* return from interrupt */
+        X86_BR_JCC     = 1 << 8, /* conditional */
+        X86_BR_JMP     = 1 << 9, /* jump */
+        X86_BR_IRQ     = 1 << 10,/* hw interrupt or trap or fault */
+        X86_BR_IND_CALL= 1 << 11,/* indirect calls */
+};
+
+#define X86_BR_PLM (X86_BR_USER | X86_BR_KERNEL)
+
+#define X86_BR_ANY       \
+        (X86_BR_CALL    |\
+         X86_BR_RET     |\
+         X86_BR_SYSCALL |\
+         X86_BR_SYSRET  |\
+         X86_BR_INT     |\
+         X86_BR_IRET    |\
+         X86_BR_JCC     |\
+         X86_BR_JMP	 |\
+	 X86_BR_IRQ	 |\
+	 X86_BR_IND_CALL)
+
+#define X86_BR_ALL (X86_BR_PLM | X86_BR_ANY)
+
+#define X86_BR_ANY_CALL 	 \
+	(X86_BR_CALL		|\
+	 X86_BR_IND_CALL	|\
+	 X86_BR_SYSCALL		|\
+	 X86_BR_IRQ		|\
+	 X86_BR_INT)
+
+static void intel_pmu_lbr_filter(struct cpu_hw_events *cpuc);
+
+/*
  * We only support LBR implementations that have FREEZE_LBRS_ON_PMI
  * otherwise it becomes near impossible to get a reliable stack.
  */
@@ -132,6 +180,7 @@ void intel_pmu_lbr_enable(struct perf_event *event)
 		intel_pmu_lbr_reset();
 		cpuc->lbr_context = event->ctx;
 	}
+	cpuc->br_sel = event->hw.branch_reg.reg;
 
 	cpuc->lbr_users++;
 }
@@ -253,6 +302,45 @@ void intel_pmu_lbr_read(void)
 		intel_pmu_lbr_read_32(cpuc);
 	else
 		intel_pmu_lbr_read_64(cpuc);
+
+	intel_pmu_lbr_filter(cpuc);
+}
+
+/*
+ * SW filter is used:
+ * - in case there is no HW filter
+ * - in case the HW filter has errata or limitations
+ */
+static void intel_pmu_setup_sw_lbr_filter(struct perf_event *event)
+{
+	u64 br_type = event->attr.branch_sample_type;
+        int mask = 0;
+
+        if (br_type & PERF_SAMPLE_BRANCH_USER)
+                mask |= X86_BR_USER;
+
+        if (br_type & PERF_SAMPLE_BRANCH_KERNEL)
+                mask |= X86_BR_KERNEL;
+
+        if (br_type & PERF_SAMPLE_BRANCH_ANY) {
+                mask |= X86_BR_ANY;
+                goto done;
+        }
+
+        if (br_type & PERF_SAMPLE_BRANCH_ANY_CALL)
+                mask |= X86_BR_ANY_CALL;
+
+        if (br_type & PERF_SAMPLE_BRANCH_ANY_RETURN)
+                mask |= X86_BR_RET | X86_BR_IRET | X86_BR_SYSRET;
+
+        if (br_type & PERF_SAMPLE_BRANCH_IND_CALL)
+                mask |= X86_BR_IND_CALL;
+done:
+        /*
+         * stash actual user request into reg, it may
+         * be used by fixup code for some CPU
+         */
+        event->hw.branch_reg.reg = mask;
 }
 
 /*
@@ -288,9 +376,9 @@ static int intel_pmu_setup_hw_lbr_filter(struct perf_event *event)
 	return 0;
 }
 
-static int intel_pmu_setup_lbr_filter(struct perf_event *event)
+int intel_pmu_setup_lbr_filter(struct perf_event *event)
 {
-	u64 br_type = event->attr.branch_sample_type;
+	int ret = 0;
 
 	/*
 	 * no LBR on this PMU
@@ -299,15 +387,210 @@ static int intel_pmu_setup_lbr_filter(struct perf_event *event)
 		return -EOPNOTSUPP;
 
 	/*
-	 * if no LBR HW filter, users can only
-	 * capture all branches
+	 * setup SW LBR filter
 	 */
-	if (!x86_pmu.lbr_sel_map) {
-		if (br_type != PERF_SAMPLE_BRANCH_ALL)
-			return -EOPNOTSUPP;
-		return 0;
+	intel_pmu_setup_sw_lbr_filter(event);
+
+	/*
+	 * setup HW LBR filter, if any
+	 */
+        if (x86_pmu.lbr_sel_map)
+                ret = intel_pmu_setup_hw_lbr_filter(event);
+
+	return ret;
+}
+
+/*
+ * return the type of control flow change at address "from"
+ * intruction is not necessarily a branch (in case of interrupt).
+ *
+ * The branch type returned also includes the priv level of the
+ * target of the control flow change (X86_BR_USER, X86_BR_KERNEL).
+ *
+ * If a branch type is unknown OR the instruction cannot be
+ * decoded (e.g., text page not present), then X86_BR_NONE is
+ * returned.
+ */
+static int branch_type(unsigned long from, unsigned long to)
+{
+	struct insn insn;
+	void *addr;
+	int bytes, size = MAX_INSN_SIZE;
+	int ret = X86_BR_NONE;
+	int ext, to_plm, from_plm;
+	u8 buf[MAX_INSN_SIZE];
+	int is64 = 0;
+
+	to_plm = kernel_ip(to) ? X86_BR_KERNEL : X86_BR_USER;
+	from_plm = kernel_ip(from) ? X86_BR_KERNEL : X86_BR_USER;
+
+	/*
+	 * maybe zero if lbr did not fill up after a reset by the time
+	 * we get a PMU interrupt
+	 */
+	if (from == 0 || to == 0)
+		return X86_BR_NONE;
+
+	if (from_plm == X86_BR_USER) {
+		/*
+		 * can happen if measuring at the user level only
+		 * and we interrupt in a kernel thread, e.g., idle.
+		 */
+		if (!current->mm)
+			return X86_BR_NONE;
+
+		/* may fail if text not present */
+		bytes = copy_from_user_nmi(buf, (void __user *)from, size);
+		if (bytes != size)
+			return X86_BR_NONE;
+
+		addr = buf;
+	} else
+		addr = (void *)from;
+
+	/*
+	 * decoder needs to know the ABI especially
+	 * on 64-bit systems running 32-bit apps
+	 */
+#ifdef CONFIG_X86_64
+	is64 = kernel_ip((unsigned long)addr) || !test_thread_flag(TIF_IA32);
+#endif
+	insn_init(&insn, addr, is64);
+	insn_get_opcode(&insn);
+
+	switch (insn.opcode.bytes[0]) {
+	case 0xf:
+		switch(insn.opcode.bytes[1]) {
+		case 0x05: /* syscall */
+		case 0x34: /* sysenter */
+			ret = X86_BR_SYSCALL;
+			break;
+		case 0x07: /* sysret */
+		case 0x35: /* sysexit */
+			ret = X86_BR_SYSRET;
+			break;
+		case 0x80 ... 0x8f: /* conditional */
+			ret = X86_BR_JCC;
+			break;
+		default:
+			ret = X86_BR_NONE;
+		}
+		break;
+	case 0x70 ... 0x7f: /* conditional */
+		ret = X86_BR_JCC;
+		break;
+	case 0xc2: /* near ret */
+	case 0xc3: /* near ret */
+	case 0xca: /* far ret */
+	case 0xcb: /* far ret */
+		ret = X86_BR_RET;
+		break;
+	case 0xcf: /* iret */
+		ret = X86_BR_IRET;
+		break;
+	case 0xcc ... 0xce: /* int */
+		ret = X86_BR_INT;
+		break;
+	case 0xe8: /* call near rel */
+	case 0x9a: /* call far absolute */
+		ret = X86_BR_CALL;
+		break;
+	case 0xe0 ... 0xe3: /* loop jmp */
+		ret = X86_BR_JCC;
+		break;
+	case 0xe9 ... 0xeb: /* jmp */
+		ret = X86_BR_JMP;
+		break;
+	case 0xff: /* call near absolute, call far absolute ind */
+		insn_get_modrm(&insn);
+		ext = (insn.modrm.bytes[0] >> 3) & 0x7;
+		switch (ext) {
+		case 2: /* near ind call */
+		case 3: /* far ind call */
+			ret = X86_BR_IND_CALL;
+			break;
+		case 4:
+		case 5:
+			ret = X86_BR_JMP;
+			break;
+		}
+		break;
+	default:
+		ret = X86_BR_NONE;
 	}
-	return intel_pmu_setup_hw_lbr_filter(event);
+	/*
+	 * interrupts, traps, faults (and thus ring transition) may
+	 * occur on any instructions. Thus, to classify them correctly,
+	 * we need to first look at the from and to priv levels. If they
+	 * are different and to is in the kernel, then it indicates
+	 * a ring transition. If the from instruction is not a ring
+	 * transition instr (syscall, systenter, int), then it means
+	 * it was a irq, trap or fault.
+	 *
+	 * we have no way of detecting kernel to kernel faults.
+	 */
+        if (from_plm == X86_BR_USER && to_plm == X86_BR_KERNEL
+	    && ret != X86_BR_SYSCALL && ret != X86_BR_INT)
+                        ret = X86_BR_IRQ;
+
+	/*
+	 * branch priv level determined by target as
+	 * is done by HW when LBR_SELECT is implemented
+	 */
+	if (ret != X86_BR_NONE)
+		ret |= to_plm;
+
+	return ret;
+}
+
+/*
+ * implement actual branch filter based on user demand.
+ * Hardware may not exactly satisfy that request, thus
+ * we need to inspect opcodes. Mismatched branches are
+ * discarded. Therefore, the number of branches returned
+ * in PERF_SAMPLE_BRANCH_STACK sample may vary.
+ */
+static void
+intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
+{
+	u64 from, to;
+        int br_sel = cpuc->br_sel;
+        int i, j, type;
+        bool compress = false;
+
+        /* if sampling all branches, then nothing to filter */
+        if ((br_sel & X86_BR_ALL) == X86_BR_ALL)
+                return;
+
+        for (i = 0; i < cpuc->lbr_stack.nr; i++) {
+
+		from = cpuc->lbr_entries[i].from;
+		to = cpuc->lbr_entries[i].to;
+
+                type = branch_type(from, to);
+
+                /* if type does not correspond, then discard */
+                if (type == X86_BR_NONE || (br_sel & type) != type) {
+                        cpuc->lbr_entries[i].from = 0;
+                        compress = true;
+                }
+        }
+
+        if (!compress)
+                return;
+
+        /* remove all entries with from=0 */
+        for (i = 0; i < cpuc->lbr_stack.nr; ) {
+                if (!cpuc->lbr_entries[i].from) {
+                        j = i;
+                        while (++j < cpuc->lbr_stack.nr)
+                                cpuc->lbr_entries[j-1] = cpuc->lbr_entries[j];
+                        cpuc->lbr_stack.nr--;
+                        if (!cpuc->lbr_entries[i].from)
+                                continue;
+                }
+                i++;
+        }
 }
 
 /*
@@ -349,6 +632,10 @@ void intel_pmu_lbr_init_core(void)
 	x86_pmu.lbr_from   = MSR_LBR_CORE_FROM;
 	x86_pmu.lbr_to     = MSR_LBR_CORE_TO;
 
+	/*
+	 * SW branch filter usage:
+	 * - compensate for lack of HW filter
+	 */
 	pr_cont("4-deep LBR, ");
 }
 
@@ -363,6 +650,13 @@ void intel_pmu_lbr_init_nhm(void)
 	x86_pmu.lbr_sel_mask = LBR_SEL_MASK;
 	x86_pmu.lbr_sel_map  = nhm_lbr_sel_map;
 
+	/*
+	 * SW branch filter usage:
+	 * - workaround LBR_SEL errata (see above)
+	 * - support syscall, sysret capture.
+	 *   That requires LBR_FAR but that means far
+	 *   jmp need to be filtered out
+	 */
 	pr_cont("16-deep LBR, ");
 }
 
@@ -377,6 +671,12 @@ void intel_pmu_lbr_init_snb(void)
 	x86_pmu.lbr_sel_mask = LBR_SEL_MASK;
 	x86_pmu.lbr_sel_map  = snb_lbr_sel_map;
 
+	/*
+	 * SW branch filter usage:
+	 * - support syscall, sysret capture.
+	 *   That requires LBR_FAR but that means far
+	 *   jmp need to be filtered out
+	 */
 	pr_cont("16-deep LBR, ");
 }
 
@@ -398,5 +698,9 @@ void intel_pmu_lbr_init_atom(void)
 	x86_pmu.lbr_from   = MSR_LBR_CORE_FROM;
 	x86_pmu.lbr_to     = MSR_LBR_CORE_TO;
 
+	/*
+	 * SW branch filter usage:
+	 * - compensate for lack of HW filter
+	 */
 	pr_cont("8-deep LBR, ");
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 08/12] perf_events: disable PERF_SAMPLE_BRANCH_* when not supported (v2)
  2011-10-14 12:37 [PATCH 00/12] perf_events: add support for sampling taken branches (v2) Stephane Eranian
                   ` (6 preceding siblings ...)
  2011-10-14 12:37 ` [PATCH 07/12] perf_events: add LBR software filter support " Stephane Eranian
@ 2011-10-14 12:37 ` Stephane Eranian
  2011-10-14 12:37 ` [PATCH 09/12] perf_events: add hook to flush branch_stack on context switch (v2) Stephane Eranian
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: Stephane Eranian @ 2011-10-14 12:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo,
	will.deacon, paulus, benh, rth, ralf, davem, lethal

PERF_SAMPLE_BRANCH_* is disabled for:
- SW events (sw counters, tracepoints)
- HW breakpoints
- ALL but Intel X86 architecture
- AMD64 processors

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/alpha/kernel/perf_event.c       |    4 ++++
 arch/arm/kernel/perf_event.c         |    4 ++++
 arch/mips/kernel/perf_event.c        |    4 ++++
 arch/powerpc/kernel/perf_event.c     |    4 ++++
 arch/sh/kernel/perf_event.c          |    4 ++++
 arch/sparc/kernel/perf_event.c       |    4 ++++
 arch/x86/kernel/cpu/perf_event_amd.c |    3 +++
 kernel/events/core.c                 |   24 ++++++++++++++++++++++++
 kernel/events/hw_breakpoint.c        |    6 ++++++
 9 files changed, 57 insertions(+), 0 deletions(-)

diff --git a/arch/alpha/kernel/perf_event.c b/arch/alpha/kernel/perf_event.c
index 8143cd7..0dae252 100644
--- a/arch/alpha/kernel/perf_event.c
+++ b/arch/alpha/kernel/perf_event.c
@@ -685,6 +685,10 @@ static int alpha_pmu_event_init(struct perf_event *event)
 {
 	int err;
 
+	/* does not support taken branch sampling */
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	switch (event->attr.type) {
 	case PERF_TYPE_RAW:
 	case PERF_TYPE_HARDWARE:
diff --git a/arch/arm/kernel/perf_event.c b/arch/arm/kernel/perf_event.c
index 53c9c26..bcb0dd1 100644
--- a/arch/arm/kernel/perf_event.c
+++ b/arch/arm/kernel/perf_event.c
@@ -544,6 +544,10 @@ static int armpmu_event_init(struct perf_event *event)
 {
 	int err = 0;
 
+	/* does not support taken branch sampling */
+	if (has_branch_smpl(event))
+		return -EOPNOTSUPP;
+
 	switch (event->attr.type) {
 	case PERF_TYPE_RAW:
 	case PERF_TYPE_HARDWARE:
diff --git a/arch/mips/kernel/perf_event.c b/arch/mips/kernel/perf_event.c
index 0aee944..425c35a 100644
--- a/arch/mips/kernel/perf_event.c
+++ b/arch/mips/kernel/perf_event.c
@@ -370,6 +370,10 @@ static int mipspmu_event_init(struct perf_event *event)
 {
 	int err = 0;
 
+	/* does not support taken branch sampling */
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	switch (event->attr.type) {
 	case PERF_TYPE_RAW:
 	case PERF_TYPE_HARDWARE:
diff --git a/arch/powerpc/kernel/perf_event.c b/arch/powerpc/kernel/perf_event.c
index 10a140f..5701051 100644
--- a/arch/powerpc/kernel/perf_event.c
+++ b/arch/powerpc/kernel/perf_event.c
@@ -1078,6 +1078,10 @@ static int power_pmu_event_init(struct perf_event *event)
 	if (!ppmu)
 		return -ENOENT;
 
+	/* does not support taken branch sampling */
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	switch (event->attr.type) {
 	case PERF_TYPE_HARDWARE:
 		ev = event->attr.config;
diff --git a/arch/sh/kernel/perf_event.c b/arch/sh/kernel/perf_event.c
index 2ee21a4..7cc9066 100644
--- a/arch/sh/kernel/perf_event.c
+++ b/arch/sh/kernel/perf_event.c
@@ -309,6 +309,10 @@ static int sh_pmu_event_init(struct perf_event *event)
 {
 	int err;
 
+	/* does not support taken branch sampling */
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	switch (event->attr.type) {
 	case PERF_TYPE_RAW:
 	case PERF_TYPE_HW_CACHE:
diff --git a/arch/sparc/kernel/perf_event.c b/arch/sparc/kernel/perf_event.c
index 614da62..8e16a4a 100644
--- a/arch/sparc/kernel/perf_event.c
+++ b/arch/sparc/kernel/perf_event.c
@@ -1105,6 +1105,10 @@ static int sparc_pmu_event_init(struct perf_event *event)
 	if (atomic_read(&nmi_active) < 0)
 		return -ENODEV;
 
+	/* does not support taken branch sampling */
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	switch (attr->type) {
 	case PERF_TYPE_HARDWARE:
 		if (attr->config >= sparc_pmu->max_events)
diff --git a/arch/x86/kernel/cpu/perf_event_amd.c b/arch/x86/kernel/cpu/perf_event_amd.c
index aeefd45..9ef5749 100644
--- a/arch/x86/kernel/cpu/perf_event_amd.c
+++ b/arch/x86/kernel/cpu/perf_event_amd.c
@@ -138,6 +138,9 @@ static int amd_pmu_hw_config(struct perf_event *event)
 	if (ret)
 		return ret;
 
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	if (event->attr.exclude_host && event->attr.exclude_guest)
 		/*
 		 * When HO == GO == 1 the hardware treats that as GO == HO == 0
diff --git a/kernel/events/core.c b/kernel/events/core.c
index a4c3826..6d30498 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5105,6 +5105,12 @@ static int perf_swevent_init(struct perf_event *event)
 	if (event->attr.type != PERF_TYPE_SOFTWARE)
 		return -ENOENT;
 
+	/*
+	 * no branch sampling for software events
+	 */
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	switch (event_id) {
 	case PERF_COUNT_SW_CPU_CLOCK:
 	case PERF_COUNT_SW_TASK_CLOCK:
@@ -5208,6 +5214,12 @@ static int perf_tp_event_init(struct perf_event *event)
 	if (event->attr.type != PERF_TYPE_TRACEPOINT)
 		return -ENOENT;
 
+	/*
+	 * no branch sampling for tracepoint events
+	 */
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	err = perf_trace_init(event);
 	if (err)
 		return err;
@@ -5431,6 +5443,12 @@ static int cpu_clock_event_init(struct perf_event *event)
 	if (event->attr.config != PERF_COUNT_SW_CPU_CLOCK)
 		return -ENOENT;
 
+	/*
+	 * no branch sampling for software events
+	 */
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	perf_swevent_init_hrtimer(event);
 
 	return 0;
@@ -5503,6 +5521,12 @@ static int task_clock_event_init(struct perf_event *event)
 	if (event->attr.config != PERF_COUNT_SW_TASK_CLOCK)
 		return -ENOENT;
 
+	/*
+	 * no branch sampling for software events
+	 */
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	perf_swevent_init_hrtimer(event);
 
 	return 0;
diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
index b7971d6..e7fb781 100644
--- a/kernel/events/hw_breakpoint.c
+++ b/kernel/events/hw_breakpoint.c
@@ -581,6 +581,12 @@ static int hw_breakpoint_event_init(struct perf_event *bp)
 	if (bp->attr.type != PERF_TYPE_BREAKPOINT)
 		return -ENOENT;
 
+	/*
+	 * no branch sampling for breakpoint events
+	 */
+	if (has_branch_stack(bp))
+		return -EOPNOTSUPP;
+
 	err = register_perf_hw_breakpoint(bp);
 	if (err)
 		return err;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 09/12] perf_events: add hook to flush branch_stack on context switch (v2)
  2011-10-14 12:37 [PATCH 00/12] perf_events: add support for sampling taken branches (v2) Stephane Eranian
                   ` (7 preceding siblings ...)
  2011-10-14 12:37 ` [PATCH 08/12] perf_events: disable PERF_SAMPLE_BRANCH_* when not supported (v2) Stephane Eranian
@ 2011-10-14 12:37 ` Stephane Eranian
  2011-12-05 21:10   ` Peter Zijlstra
  2011-12-05 21:37   ` Peter Zijlstra
  2011-10-14 12:37 ` [PATCH 10/12] perf: add code to support PERF_SAMPLE_BRANCH_STACK (v2) Stephane Eranian
                   ` (4 subsequent siblings)
  13 siblings, 2 replies; 36+ messages in thread
From: Stephane Eranian @ 2011-10-14 12:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo,
	will.deacon, paulus, benh, rth, ralf, davem, lethal

With branch stack sampling, it is possible to filter by priv levels.
In system-wide mode, that means it is possible to capture only user
level branches. The builtin SW LBR filter needs to disassemble code
based on LBR captured addresses. For that, it needs to know the task
the addresses are associated with. Because of context switches, the
content of the branch stack buffer may contain addresses from
different tasks.

We need a hook on context switch to either flush the branch stack
or save it. This patch adds a new hook in struct pmu which is called
during context switches. The hook is called only when necessary.
That is when a system-wide context has, at least, one event which
uses PERF_SAMPLE_BRANCH_STACK. The hook is never called for per-thread
context.

In this version, the Intel X86 code simply flushes (reset) the LBR
on context switches (fill with zeroes). Those zeroed branches are
then filtered out by the SW filter.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event.c       |   29 +++++++----
 arch/x86/kernel/cpu/perf_event.h       |    1 +
 arch/x86/kernel/cpu/perf_event_intel.c |   13 +++++
 include/linux/perf_event.h             |    7 ++-
 kernel/events/core.c                   |   85 ++++++++++++++++++++++++++++++++
 5 files changed, 123 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index e2efa90..b44aba8 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1436,21 +1436,28 @@ static int x86_pmu_event_init(struct perf_event *event)
 	return err;
 }
 
+static void x86_pmu_flush_branch_stack(void)
+{
+	if (x86_pmu.flush_branch_stack)
+		x86_pmu.flush_branch_stack();
+}
+
 static struct pmu pmu = {
-	.pmu_enable	= x86_pmu_enable,
-	.pmu_disable	= x86_pmu_disable,
+	.pmu_enable		= x86_pmu_enable,
+	.pmu_disable		= x86_pmu_disable,
 
-	.event_init	= x86_pmu_event_init,
+	.event_init		= x86_pmu_event_init,
 
-	.add		= x86_pmu_add,
-	.del		= x86_pmu_del,
-	.start		= x86_pmu_start,
-	.stop		= x86_pmu_stop,
-	.read		= x86_pmu_read,
+	.add			= x86_pmu_add,
+	.del			= x86_pmu_del,
+	.start			= x86_pmu_start,
+	.stop			= x86_pmu_stop,
+	.read			= x86_pmu_read,
 
-	.start_txn	= x86_pmu_start_txn,
-	.cancel_txn	= x86_pmu_cancel_txn,
-	.commit_txn	= x86_pmu_commit_txn,
+	.start_txn		= x86_pmu_start_txn,
+	.cancel_txn		= x86_pmu_cancel_txn,
+	.commit_txn		= x86_pmu_commit_txn,
+	.flush_branch_stack	= x86_pmu_flush_branch_stack,
 };
 
 /*
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 48ed504..5ba6a7b 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -278,6 +278,7 @@ struct x86_pmu {
 	void		(*cpu_starting)(int cpu);
 	void		(*cpu_dying)(int cpu);
 	void		(*cpu_dead)(int cpu);
+	void		(*flush_branch_stack)(void);
 
 	/*
 	 * Intel Arch Perfmon v2+
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 901217d..9cc8a17 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1542,6 +1542,18 @@ static void intel_pmu_cpu_dying(int cpu)
 	fini_debug_store_on_cpu(cpu);
 }
 
+static void intel_pmu_flush_branch_stack(void)
+{
+	/*
+	 * Intel LBR does not tag entries with the
+	 * PID of the current task, then we need to
+	 * flush it on ctxsw
+	 * For now, we simply reset it
+	 */
+	if (x86_pmu.lbr_nr)
+		intel_pmu_lbr_reset();
+}
+
 static __initconst const struct x86_pmu intel_pmu = {
 	.name			= "Intel",
 	.handle_irq		= intel_pmu_handle_irq,
@@ -1569,6 +1581,7 @@ static __initconst const struct x86_pmu intel_pmu = {
 	.cpu_starting		= intel_pmu_cpu_starting,
 	.cpu_dying		= intel_pmu_cpu_dying,
 	.guest_get_msrs		= intel_guest_get_msrs,
+	.flush_branch_stack	= intel_pmu_flush_branch_stack,
 };
 
 static void intel_clovertown_quirks(void)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index c4fbe84..83ddb52 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -733,6 +733,9 @@ struct pmu {
 	 * for each successful ->add() during the transaction.
 	 */
 	void (*cancel_txn)		(struct pmu *pmu); /* optional */
+	/*
+	 * flush branch stack on context-switches (needed in cpu-wide mode) */
+	void (*flush_branch_stack)	(void);
 };
 
 /**
@@ -961,7 +964,8 @@ struct perf_event_context {
 	u64				parent_gen;
 	u64				generation;
 	int				pin_count;
-	int				nr_cgroups; /* cgroup events present */
+	int				nr_cgroups;	 /* cgroup evts */
+	int				nr_branch_stack; /* branch_stack evt */
 	struct rcu_head			rcu_head;
 };
 
@@ -1026,6 +1030,7 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr,
 extern u64 perf_event_read_value(struct perf_event *event,
 				 u64 *enabled, u64 *running);
 
+
 struct perf_sample_data {
 	u64				type;
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 6d30498..6f22f46 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -130,6 +130,7 @@ enum event_type_t {
  */
 struct jump_label_key perf_sched_events __read_mostly;
 static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
+static DEFINE_PER_CPU(atomic_t, perf_branch_stack_events);
 
 static atomic_t nr_mmap_events __read_mostly;
 static atomic_t nr_comm_events __read_mostly;
@@ -878,6 +879,9 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx)
 	if (is_cgroup_event(event))
 		ctx->nr_cgroups++;
 
+	if (has_branch_stack(event))
+		ctx->nr_branch_stack++;
+
 	list_add_rcu(&event->event_entry, &ctx->event_list);
 	if (!ctx->nr_events)
 		perf_pmu_rotate_start(ctx->pmu);
@@ -1017,6 +1021,9 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
 			cpuctx->cgrp = NULL;
 	}
 
+	if (has_branch_stack(event))
+		ctx->nr_branch_stack--;
+
 	ctx->nr_events--;
 	if (event->attr.inherit_stat)
 		ctx->nr_stat--;
@@ -2186,6 +2193,66 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx,
 }
 
 /*
+ * When sampling the branck stack in system-wide, it may be necessary
+ * to flush the stack on context switch. This happens when the branch
+ * stack does not tag its entries with the pid of the current task.
+ * Otherwise it becomes impossible to associate a branch entry with a
+ * task. This ambiguity is more likely to appear when the branch stack
+ * supports priv level filtering and the user sets it to monitor only
+ * at the user level (which could be a useful measurement in system-wide
+ * mode). In that case, the risk is high of having a branch stack with
+ * branch from multiple tasks. Flushing may mean dropping the existing
+ * entries or stashing them somewhere in the PMU specific code layer.
+ *
+ * This function provides the context switch callback to the lower code
+ * layer. It is invoked ONLY when there is at least one system-wide context
+ * with at least one active event using taken branch sampling.
+ */
+static void perf_branch_stack_sched_in(struct task_struct *prev,
+				       struct task_struct *task)
+{
+	struct perf_cpu_context *cpuctx;
+	struct pmu *pmu;
+	unsigned long flags;
+
+	/* no need to flush branch stack if not changing task */
+	if (prev == task)
+		return;
+
+	local_irq_save(flags);
+
+	rcu_read_lock();
+
+	list_for_each_entry_rcu(pmu, &pmus, entry) {
+		cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
+
+		/*
+		 * check if the context has at least one
+		 * event using PERF_SAMPLE_BRANCH_STACK
+		 */
+		if (cpuctx->ctx.nr_branch_stack > 0
+		    && pmu->flush_branch_stack) {
+
+			pmu = cpuctx->ctx.pmu;
+
+			perf_ctx_lock(cpuctx, cpuctx->task_ctx);
+
+			perf_pmu_disable(pmu);
+
+			pmu->flush_branch_stack();
+
+			perf_pmu_enable(pmu);
+
+			perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+		}
+	}
+
+	rcu_read_unlock();
+
+	local_irq_restore(flags);
+}
+
+/*
  * Called from scheduler to add the events of the current task
  * with interrupts disabled.
  *
@@ -2216,6 +2283,10 @@ void __perf_event_task_sched_in(struct task_struct *prev,
 	 */
 	if (atomic_read(&__get_cpu_var(perf_cgroup_events)))
 		perf_cgroup_sched_in(prev, task);
+
+	/* check for system-wide branch_stack events */
+	if (atomic_read(&__get_cpu_var(perf_branch_stack_events)))
+		perf_branch_stack_sched_in(prev, task);
 }
 
 static u64 perf_calculate_period(struct perf_event *event, u64 nsec, u64 count)
@@ -2955,6 +3026,14 @@ static void free_event(struct perf_event *event)
 			atomic_dec(&per_cpu(perf_cgroup_events, event->cpu));
 			jump_label_dec(&perf_sched_events);
 		}
+
+		if (has_branch_stack(event)) {
+			jump_label_dec(&perf_sched_events);
+			/* is system-wide event */
+			if (!(event->attach_state & PERF_ATTACH_TASK))
+				atomic_dec(&per_cpu(perf_branch_stack_events,
+						    event->cpu));
+		}
 	}
 
 	if (event->rb) {
@@ -5961,6 +6040,12 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 				return ERR_PTR(err);
 			}
 		}
+		if (has_branch_stack(event)) {
+			jump_label_inc(&perf_sched_events);
+			if (!(event->attach_state & PERF_ATTACH_TASK))
+				atomic_inc(&per_cpu(perf_branch_stack_events,
+						    event->cpu));
+		}
 	}
 
 	return event;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 10/12] perf: add code to support PERF_SAMPLE_BRANCH_STACK (v2)
  2011-10-14 12:37 [PATCH 00/12] perf_events: add support for sampling taken branches (v2) Stephane Eranian
                   ` (8 preceding siblings ...)
  2011-10-14 12:37 ` [PATCH 09/12] perf_events: add hook to flush branch_stack on context switch (v2) Stephane Eranian
@ 2011-10-14 12:37 ` Stephane Eranian
  2011-10-14 12:37 ` [PATCH 11/12] perf: add support for sampling taken branch to perf record (v2) Stephane Eranian
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: Stephane Eranian @ 2011-10-14 12:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo,
	will.deacon, paulus, benh, rth, ralf, davem, lethal

From: Roberto Agostino Vitillo <ravitillo@lbl.gov>

This patch adds:
- ability to parse samples with PERF_SAMPLE_BRANCH_STACK
- sort on branches
- build histograms on branches

Signed-off-by: Roberto Agostino Vitillo <ravitillo@lbl.gov>
Signed-off-by: Stephane Eranian <eranian@google.com>
---
 tools/perf/perf.h          |   17 ++
 tools/perf/util/annotate.c |    2 +-
 tools/perf/util/event.h    |    1 +
 tools/perf/util/evsel.c    |   10 ++
 tools/perf/util/hist.c     |   92 +++++++++---
 tools/perf/util/hist.h     |    6 +
 tools/perf/util/session.c  |   72 +++++++++
 tools/perf/util/session.h  |    5 +
 tools/perf/util/sort.c     |  348 +++++++++++++++++++++++++++++++++-----------
 tools/perf/util/sort.h     |    5 +
 tools/perf/util/symbol.h   |   13 ++
 11 files changed, 462 insertions(+), 109 deletions(-)

diff --git a/tools/perf/perf.h b/tools/perf/perf.h
index 08b0b5e..a3177e7 100644
--- a/tools/perf/perf.h
+++ b/tools/perf/perf.h
@@ -180,6 +180,23 @@ struct ip_callchain {
 	u64 ips[0];
 };
 
+struct branch_flags{
+	u64 mispred : 1;
+	u64 predicted : 1;
+	u64 reserved : 62;
+};
+
+struct branch_entry {
+	u64				from;
+	u64				to;
+	struct branch_flags flags;
+};
+
+struct branch_stack {
+	u64				nr;
+	struct branch_entry	entries[0];
+};
+
 extern bool perf_host, perf_guest;
 extern const char perf_version_string[];
 
diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c
index bc8f477..f071d29 100644
--- a/tools/perf/util/annotate.c
+++ b/tools/perf/util/annotate.c
@@ -64,7 +64,7 @@ int symbol__inc_addr_samples(struct symbol *sym, struct map *map,
 
 	pr_debug3("%s: addr=%#" PRIx64 "\n", __func__, map->unmap_ip(map, addr));
 
-	if (addr >= sym->end)
+	if (addr >= sym->end || addr < sym->start)
 		return 0;
 
 	offset = addr - sym->start;
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index 357a85b..026b1f6 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -80,6 +80,7 @@ struct perf_sample {
 	u32 raw_size;
 	void *raw_data;
 	struct ip_callchain *callchain;
+	struct branch_stack *branch_stack;
 };
 
 #define BUILD_ID_SIZE 20
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index b46f6e4..73550ec 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -473,5 +473,15 @@ int perf_event__parse_sample(const union perf_event *event, u64 type,
 		data->raw_data = (void *) pdata;
 	}
 
+	if (type & PERF_SAMPLE_BRANCH_STACK) {
+		u64 sz;
+
+		data->branch_stack = (struct branch_stack *)array;
+		array++; /* nr */
+
+		sz = data->branch_stack->nr * sizeof (struct branch_entry);
+		sz /= sizeof(uint64_t);
+		array += sz;
+	}
 	return 0;
 }
diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index 50c8fec..163650b 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -49,9 +49,10 @@ static void hists__calc_col_len(struct hists *hists, struct hist_entry *h)
 {
 	u16 len;
 
-	if (h->ms.sym)
-		hists__new_col_len(hists, HISTC_SYMBOL, h->ms.sym->namelen);
-	else {
+	if (h->ms.sym){
+		int symlen = max((int)h->ms.sym->namelen + 4, BITS_PER_LONG / 4 + 6);
+		hists__new_col_len(hists, HISTC_SYMBOL, symlen);
+	} else {
 		const unsigned int unresolved_col_width = BITS_PER_LONG / 4;
 
 		if (hists__col_len(hists, HISTC_DSO) < unresolved_col_width &&
@@ -164,26 +165,14 @@ static u8 symbol__parent_filter(const struct symbol *parent)
 	return 0;
 }
 
-struct hist_entry *__hists__add_entry(struct hists *hists,
+static struct hist_entry *add_hist_entry(struct hists *hists,
+				      struct hist_entry *entry,
 				      struct addr_location *al,
-				      struct symbol *sym_parent, u64 period)
+				      u64 period)
 {
 	struct rb_node **p;
 	struct rb_node *parent = NULL;
 	struct hist_entry *he;
-	struct hist_entry entry = {
-		.thread	= al->thread,
-		.ms = {
-			.map	= al->map,
-			.sym	= al->sym,
-		},
-		.cpu	= al->cpu,
-		.ip	= al->addr,
-		.level	= al->level,
-		.period	= period,
-		.parent = sym_parent,
-		.filtered = symbol__parent_filter(sym_parent),
-	};
 	int cmp;
 
 	pthread_mutex_lock(&hists->lock);
@@ -194,7 +183,7 @@ struct hist_entry *__hists__add_entry(struct hists *hists,
 		parent = *p;
 		he = rb_entry(parent, struct hist_entry, rb_node_in);
 
-		cmp = hist_entry__cmp(&entry, he);
+		cmp = hist_entry__cmp(entry, he);
 
 		if (!cmp) {
 			he->period += period;
@@ -208,7 +197,7 @@ struct hist_entry *__hists__add_entry(struct hists *hists,
 			p = &(*p)->rb_right;
 	}
 
-	he = hist_entry__new(&entry);
+	he = hist_entry__new(entry);
 	if (!he)
 		goto out_unlock;
 
@@ -221,6 +210,69 @@ struct hist_entry *__hists__add_entry(struct hists *hists,
 	return he;
 }
 
+struct hist_entry *__hists__add_branch_entry(struct hists *self,
+					     struct addr_location *al,
+					     struct symbol *sym_parent,
+					     struct branch_info* bi,
+					     u64 period){
+	struct hist_entry entry = {
+		.thread	= al->thread,
+		.ms = {
+			.map	= bi->to.map,
+			.sym	= bi->to.sym,
+		},
+		.cpu	= al->cpu,
+		.ip	= bi->to.addr,
+		.level	= al->level,
+		.period	= period,
+		.parent = sym_parent,
+		.filtered = symbol__parent_filter(sym_parent),
+		.branch_info = bi,
+	};
+	struct hist_entry *he;
+
+	he = add_hist_entry(self, &entry, al, period);
+	if (!he)
+		return NULL;
+
+	/*
+	 * in branch mode, we do not display al->sym, al->addr
+	 * but instead what is in branch_info. The addresses and
+	 * symbols there may need wider columns, so make sure they
+	 * are taken into account.
+	 *
+	 * hists__calc_col_len() tracks the max column width, so
+	 * we need to call it for both the from and to addresses
+	 */
+	entry.ip     = bi->from.addr;
+	entry.ms.map = bi->from.map;
+	entry.ms.sym = bi->from.sym;
+	hists__calc_col_len(self, &entry);
+
+	return he;
+}
+
+struct hist_entry *__hists__add_entry(struct hists *self,
+				      struct addr_location *al,
+				      struct symbol *sym_parent, u64 period)
+{
+	struct hist_entry entry = {
+		.thread	= al->thread,
+		.ms = {
+			.map	= al->map,
+			.sym	= al->sym,
+		},
+		.cpu	= al->cpu,
+		.ip	= al->addr,
+		.level	= al->level,
+		.period	= period,
+		.parent = sym_parent,
+		.filtered = symbol__parent_filter(sym_parent),
+	};
+
+	return add_hist_entry(self, &entry, al, period);
+}
+
 int64_t
 hist_entry__cmp(struct hist_entry *left, struct hist_entry *right)
 {
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 7ea1e56..395b2e7 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -40,6 +40,7 @@ enum hist_column {
 	HISTC_COMM,
 	HISTC_PARENT,
 	HISTC_CPU,
+	HISTC_MISPREDICT,
 	HISTC_NR_COLS, /* Last entry */
 };
 
@@ -62,6 +63,11 @@ void hists__init(struct hists *hists);
 struct hist_entry *__hists__add_entry(struct hists *self,
 				      struct addr_location *al,
 				      struct symbol *parent, u64 period);
+struct hist_entry *__hists__add_branch_entry(struct hists *self,
+					     struct addr_location *al,
+					     struct symbol *sym_parent,
+					     struct branch_info* bi,
+					     u64 period);
 extern int64_t hist_entry__cmp(struct hist_entry *, struct hist_entry *);
 extern int64_t hist_entry__collapse(struct hist_entry *, struct hist_entry *);
 int hist_entry__fprintf(struct hist_entry *he, size_t size, struct hists *hists,
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 20e011c..7942c20 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -236,6 +236,64 @@ static bool symbol__match_parent_regex(struct symbol *sym)
 	return 0;
 }
 
+
+static const u8 cpumodes[] = {
+	PERF_RECORD_MISC_USER,
+	PERF_RECORD_MISC_KERNEL,
+	PERF_RECORD_MISC_GUEST_USER,
+	PERF_RECORD_MISC_GUEST_KERNEL
+};
+#define NCPUMODES (sizeof(cpumodes)/sizeof(u8))
+
+static void ip__resolve_ams(struct perf_session *self, struct thread *thread,
+                                        struct addr_map_symbol *ams,
+                                        u64 ip)
+{
+	struct addr_location al;
+	size_t i;
+	u8 m;
+
+	memset(&al, 0, sizeof(al));
+
+	for (i = 0; i < NCPUMODES; i++) {
+		m = cpumodes[i];
+		/*
+		 * we cannot use the header.misc hint to determine whether a
+		 * branch stack address is user, kernel, guest, hypervisor.
+		 * Branches may straddle the kernel/user/hypervisor boundaries.
+		 * Thus, we have to try * consecutively until we find a match
+		 * or else, the symbol is unknown
+		 */
+		thread__find_addr_location(thread, self, m, MAP__FUNCTION,
+				thread->pid, ip, &al, NULL);
+		if (al.sym)
+			goto found;
+	}
+found:
+	ams->addr = ip;
+	ams->sym = al.sym;
+	ams->map = al.map;
+}
+
+struct branch_info *perf_session__resolve_bstack(struct perf_session *self,
+						 struct thread *thr,
+						 struct branch_stack *bs)
+{
+	struct branch_info *bi;
+	unsigned int i;
+
+	bi = calloc(bs->nr, sizeof(struct branch_info));
+	if (!bi)
+		return NULL;
+
+	for (i = 0; i < bs->nr; i++) {
+		ip__resolve_ams(self, thr, &bi[i].to, bs->entries[i].to);
+		ip__resolve_ams(self, thr, &bi[i].from, bs->entries[i].from);
+		bi[i].flags = bs->entries[i].flags;
+	}
+	return bi;
+}
+
 int perf_session__resolve_callchain(struct perf_session *self,
 				    struct thread *thread,
 				    struct ip_callchain *chain,
@@ -679,6 +737,17 @@ static void callchain__printf(struct perf_sample *sample)
 		       i, sample->callchain->ips[i]);
 }
 
+static void branch_stack__printf(struct perf_sample *sample)
+{
+	uint64_t i;
+
+	printf("... branch stack: nr:%" PRIu64 "\n", sample->branch_stack->nr);
+
+	for (i = 0; i < sample->branch_stack->nr; i++)
+		printf("..... %2"PRIu64": %016" PRIx64 " -> %016" PRIx64 "\n", i, sample->branch_stack-> \
+			entries[i].from, sample->branch_stack->entries[i].to);
+}
+
 static void perf_session__print_tstamp(struct perf_session *session,
 				       union perf_event *event,
 				       struct perf_sample *sample)
@@ -726,6 +795,9 @@ static void dump_sample(struct perf_session *session, union perf_event *event,
 
 	if (session->sample_type & PERF_SAMPLE_CALLCHAIN)
 		callchain__printf(sample);
+
+	if (session->sample_type & PERF_SAMPLE_BRANCH_STACK)
+		branch_stack__printf(sample);
 }
 
 static int perf_session_deliver_event(struct perf_session *session,
diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index 514b06d..44a957b 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -100,6 +100,11 @@ int __perf_session__process_events(struct perf_session *self,
 int perf_session__process_events(struct perf_session *self,
 				 struct perf_event_ops *event_ops);
 
+
+struct branch_info *perf_session__resolve_bstack(struct perf_session *self,
+						 struct thread *thread,
+						 struct branch_stack *bs);
+
 int perf_session__resolve_callchain(struct perf_session *self,
 				    struct thread *thread,
 				    struct ip_callchain *chain,
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index 1ee8f1e..f6b31f8 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -8,6 +8,7 @@ const char	default_sort_order[] = "comm,dso,symbol";
 const char	*sort_order = default_sort_order;
 int		sort__need_collapse = 0;
 int		sort__has_parent = 0;
+bool		sort__branch_mode = 0;
 
 enum sort_type	sort__first_dimension;
 
@@ -94,6 +95,26 @@ static int hist_entry__comm_snprintf(struct hist_entry *self, char *bf,
 	return repsep_snprintf(bf, size, "%*s", width, self->thread->comm);
 }
 
+static int64_t _sort__dso_cmp(struct map *map_l, struct map *map_r)
+{
+	struct dso *dso_l = map_l ? map_l->dso : NULL;
+	struct dso *dso_r = map_r ? map_r->dso : NULL;
+	const char *dso_name_l, *dso_name_r;
+
+	if (!dso_l || !dso_r)
+		return cmp_null(dso_l, dso_r);
+
+	if (verbose) {
+		dso_name_l = dso_l->long_name;
+		dso_name_r = dso_r->long_name;
+	} else {
+		dso_name_l = dso_l->short_name;
+		dso_name_r = dso_r->short_name;
+	}
+
+	return strcmp(dso_name_l, dso_name_r);
+}
+
 struct sort_entry sort_comm = {
 	.se_header	= "Command",
 	.se_cmp		= sort__comm_cmp,
@@ -107,36 +128,72 @@ struct sort_entry sort_comm = {
 static int64_t
 sort__dso_cmp(struct hist_entry *left, struct hist_entry *right)
 {
-	struct dso *dso_l = left->ms.map ? left->ms.map->dso : NULL;
-	struct dso *dso_r = right->ms.map ? right->ms.map->dso : NULL;
-	const char *dso_name_l, *dso_name_r;
+        return _sort__dso_cmp(left->ms.map, right->ms.map);
+}
 
-	if (!dso_l || !dso_r)
-		return cmp_null(dso_l, dso_r);
 
-	if (verbose) {
-		dso_name_l = dso_l->long_name;
-		dso_name_r = dso_r->long_name;
-	} else {
-		dso_name_l = dso_l->short_name;
-		dso_name_r = dso_r->short_name;
+static int64_t _sort__sym_cmp(struct symbol *sym_l, struct symbol *sym_r,
+			      u64 ip_l, u64 ip_r)
+{
+	if (!sym_l || !sym_r)
+		return cmp_null(sym_l, sym_r);
+
+	if (sym_l == sym_r)
+		return 0;
+
+	if(sym_l)
+		ip_l = sym_l->start;
+	if(sym_r)
+		ip_r = sym_r->start;
+
+	return (int64_t)(ip_r - ip_l);
+}
+
+static int _hist_entry__dso_snprintf(struct map *map, char *bf,
+				     size_t size, unsigned int width)
+{
+	if (map && map->dso) {
+		const char *dso_name = !verbose ? map->dso->short_name :
+			map->dso->long_name;
+		return repsep_snprintf(bf, size, "%-*s", width, dso_name);
 	}
 
-	return strcmp(dso_name_l, dso_name_r);
+	return repsep_snprintf(bf, size, "%-*s", width, "[unknown]");
 }
 
 static int hist_entry__dso_snprintf(struct hist_entry *self, char *bf,
 				    size_t size, unsigned int width)
 {
-	if (self->ms.map && self->ms.map->dso) {
-		const char *dso_name = !verbose ? self->ms.map->dso->short_name :
-						  self->ms.map->dso->long_name;
-		return repsep_snprintf(bf, size, "%-*s", width, dso_name);
+	return _hist_entry__dso_snprintf(self->ms.map, bf, size, width);
+}
+
+static int _hist_entry__sym_snprintf(struct map *map, struct symbol *sym,
+				     u64 ip, char level, char *bf, size_t size,
+				     unsigned int width __used)
+{
+	size_t ret = 0;
+
+	if (verbose) {
+		char o = map ? dso__symtab_origin(map->dso) : '!';
+		ret += repsep_snprintf(bf, size, "%-#*llx %c ",
+				       BITS_PER_LONG / 4, ip, o);
 	}
 
-	return repsep_snprintf(bf, size, "%-*s", width, "[unknown]");
+	ret += repsep_snprintf(bf + ret, size - ret, "[%c] ", level);
+	if (sym)
+		ret += repsep_snprintf(bf + ret, size - ret, "%-*s", width - ret,
+				       sym->name);
+	else {
+		size_t len = BITS_PER_LONG / 4;
+		ret += repsep_snprintf(bf + ret, size - ret, "%-#.*llx",
+				       len, ip);
+		ret += repsep_snprintf(bf + ret, size - ret, "%-*s", width - ret, "");
+	}
+
+	return ret;
 }
 
+
 struct sort_entry sort_dso = {
 	.se_header	= "Shared Object",
 	.se_cmp		= sort__dso_cmp,
@@ -144,8 +201,14 @@ struct sort_entry sort_dso = {
 	.se_width_idx	= HISTC_DSO,
 };
 
-/* --sort symbol */
+static int hist_entry__sym_snprintf(struct hist_entry *self, char *bf,
+				    size_t size, unsigned int width __used)
+{
+	return _hist_entry__sym_snprintf(self->ms.map, self->ms.sym, self->ip,
+					 self->level, bf, size, width);
+}
 
+/* --sort symbol */
 static int64_t
 sort__sym_cmp(struct hist_entry *left, struct hist_entry *right)
 {
@@ -154,38 +217,10 @@ sort__sym_cmp(struct hist_entry *left, struct hist_entry *right)
 	if (!left->ms.sym && !right->ms.sym)
 		return right->level - left->level;
 
-	if (!left->ms.sym || !right->ms.sym)
-		return cmp_null(left->ms.sym, right->ms.sym);
-
-	if (left->ms.sym == right->ms.sym)
-		return 0;
-
 	ip_l = left->ms.sym->start;
 	ip_r = right->ms.sym->start;
 
-	return (int64_t)(ip_r - ip_l);
-}
-
-static int hist_entry__sym_snprintf(struct hist_entry *self, char *bf,
-				    size_t size, unsigned int width __used)
-{
-	size_t ret = 0;
-
-	if (verbose) {
-		char o = self->ms.map ? dso__symtab_origin(self->ms.map->dso) : '!';
-		ret += repsep_snprintf(bf, size, "%-#*llx %c ",
-				       BITS_PER_LONG / 4, self->ip, o);
-	}
-
-	ret += repsep_snprintf(bf + ret, size - ret, "[%c] ", self->level);
-	if (self->ms.sym)
-		ret += repsep_snprintf(bf + ret, size - ret, "%s",
-				       self->ms.sym->name);
-	else
-		ret += repsep_snprintf(bf + ret, size - ret, "%-#*llx",
-				       BITS_PER_LONG / 4, self->ip);
-
-	return ret;
+	return _sort__sym_cmp(left->ms.sym, right->ms.sym, ip_l, ip_r);
 }
 
 struct sort_entry sort_sym = {
@@ -244,6 +279,124 @@ struct sort_entry sort_cpu = {
 	.se_width_idx	= HISTC_CPU,
 };
 
+static int64_t
+sort__dso_from_cmp(struct hist_entry *left, struct hist_entry *right){
+	return _sort__dso_cmp(left->branch_info->from.map,
+				       right->branch_info->from.map);
+}
+
+static int hist_entry__dso_from_snprintf(struct hist_entry *self, char *bf,
+				    size_t size, unsigned int width){
+	return _hist_entry__dso_snprintf(self->branch_info->from.map,
+					      bf, size, width);
+}
+
+struct sort_entry sort_dso_from = {
+	.se_header	= "Source Shared Object",
+	.se_cmp		= sort__dso_from_cmp,
+	.se_snprintf	= hist_entry__dso_from_snprintf,
+	.se_width_idx	= HISTC_DSO,
+};
+
+static int64_t
+sort__dso_to_cmp(struct hist_entry *left, struct hist_entry *right){
+	return _sort__dso_cmp(left->branch_info->to.map,
+				       right->branch_info->to.map);
+}
+
+static int hist_entry__dso_to_snprintf(struct hist_entry *self, char *bf,
+				    size_t size, unsigned int width){
+	return _hist_entry__dso_snprintf(self->branch_info->to.map,
+					      bf, size, width);
+}
+
+static int64_t
+sort__sym_from_cmp(struct hist_entry *left, struct hist_entry *right){
+	struct addr_map_symbol *from_l = &left->branch_info->from;
+	struct addr_map_symbol *from_r = &right->branch_info->from;
+	if (!from_l->sym && !from_r->sym)
+		return right->level - left->level;
+	return _sort__sym_cmp(from_l->sym, from_r->sym, from_l->addr,
+			     from_r->addr);
+}
+
+static int64_t
+sort__sym_to_cmp(struct hist_entry *left, struct hist_entry *right){
+	struct addr_map_symbol *to_l = &left->branch_info->to;
+	struct addr_map_symbol *to_r = &right->branch_info->to;
+	if (!to_l->sym && !to_r->sym)
+		return right->level - left->level;
+	return _sort__sym_cmp(to_l->sym, to_r->sym, to_l->addr, to_r->addr);
+}
+
+static int hist_entry__sym_from_snprintf(struct hist_entry *self, char *bf,
+				    size_t size, unsigned int width __used)
+{
+	struct addr_map_symbol *from = &self->branch_info->from;
+	return _hist_entry__sym_snprintf(from->map, from->sym, from->addr,
+					 self->level, bf, size, width);
+
+}
+
+static int hist_entry__sym_to_snprintf(struct hist_entry *self, char *bf,
+				    size_t size, unsigned int width __used)
+{
+	struct addr_map_symbol *to = &self->branch_info->to;
+	return _hist_entry__sym_snprintf(to->map, to->sym, to->addr,
+					 self->level, bf, size, width);
+
+}
+
+struct sort_entry sort_dso_to = {
+	.se_header	= "Target Shared Object",
+	.se_cmp		= sort__dso_to_cmp,
+	.se_snprintf	= hist_entry__dso_to_snprintf,
+	.se_width_idx	= HISTC_DSO,
+};
+
+struct sort_entry sort_sym_from = {
+	.se_header	= "Source Symbol",
+	.se_cmp		= sort__sym_from_cmp,
+	.se_snprintf	= hist_entry__sym_from_snprintf,
+	.se_width_idx	= HISTC_SYMBOL,
+};
+
+struct sort_entry sort_sym_to = {
+	.se_header	= "Target Symbol",
+	.se_cmp		= sort__sym_to_cmp,
+	.se_snprintf	= hist_entry__sym_to_snprintf,
+	.se_width_idx	= HISTC_SYMBOL,
+};
+
+static int64_t
+sort__mispredict_cmp(struct hist_entry *left, struct hist_entry *right){
+	const unsigned char mp = left->branch_info->flags.mispred !=
+					right->branch_info->flags.mispred;
+	const unsigned char p = left->branch_info->flags.predicted !=
+					right->branch_info->flags.predicted;
+
+	return mp || p;
+}
+
+static int hist_entry__mispredict_snprintf(struct hist_entry *self, char *bf,
+				    size_t size, unsigned int width){
+	static const char *out = "N/A";
+
+	if (self->branch_info->flags.predicted)
+		out = "N";
+	else if (self->branch_info->flags.mispred)
+		out = "Y";
+
+	return repsep_snprintf(bf, size, "%-*s", width, out);
+}
+
+struct sort_entry sort_mispredict = {
+	.se_header	= "Branch Mispredicted",
+	.se_cmp		= sort__mispredict_cmp,
+	.se_snprintf	= hist_entry__mispredict_snprintf,
+	.se_width_idx	= HISTC_MISPREDICT,
+};
+
 struct sort_dimension {
 	const char		*name;
 	struct sort_entry	*entry;
@@ -251,14 +404,59 @@ struct sort_dimension {
 };
 
 static struct sort_dimension sort_dimensions[] = {
-	{ .name = "pid",	.entry = &sort_thread,	},
-	{ .name = "comm",	.entry = &sort_comm,	},
-	{ .name = "dso",	.entry = &sort_dso,	},
-	{ .name = "symbol",	.entry = &sort_sym,	},
-	{ .name = "parent",	.entry = &sort_parent,	},
-	{ .name = "cpu",	.entry = &sort_cpu,	},
+	{ .name = "pid",	.entry = &sort_thread,			},
+	{ .name = "comm",	.entry = &sort_comm,			},
+	{ .name = "dso",	.entry = &sort_dso,			},
+	{ .name = "dso_from", 	.entry = &sort_dso_from,.taken = true	},
+	{ .name = "dso_to",	.entry = &sort_dso_to,	.taken = true	},
+	{ .name = "symbol",	.entry = &sort_sym,			},
+	{ .name = "symbol_from",.entry = &sort_sym_from,.taken = true	},
+	{ .name = "symbol_to",	.entry = &sort_sym_to,	.taken = true	},
+	{ .name = "parent",	.entry = &sort_parent,			},
+	{ .name = "cpu",	.entry = &sort_cpu,			},
+	{ .name = "mispredict", .entry = &sort_mispredict, },
 };
 
+static int _sort_dimension__add(struct sort_dimension *sd)
+{
+	if (sd->entry->se_collapse)
+		sort__need_collapse = 1;
+
+	if (sd->entry == &sort_parent) {
+		int ret = regcomp(&parent_regex, parent_pattern, REG_EXTENDED);
+		if (ret) {
+			char err[BUFSIZ];
+
+			regerror(ret, &parent_regex, err, sizeof(err));
+			pr_err("Invalid regex: %s\n%s", parent_pattern, err);
+			return -EINVAL;
+		}
+		sort__has_parent = 1;
+	}
+
+	if (list_empty(&hist_entry__sort_list)) {
+		if (!strcmp(sd->name, "pid"))
+			sort__first_dimension = SORT_PID;
+		else if (!strcmp(sd->name, "comm"))
+			sort__first_dimension = SORT_COMM;
+		else if (!strcmp(sd->name, "dso"))
+			sort__first_dimension = SORT_DSO;
+		else if (!strcmp(sd->name, "symbol"))
+			sort__first_dimension = SORT_SYM;
+		else if (!strcmp(sd->name, "parent"))
+			sort__first_dimension = SORT_PARENT;
+		else if (!strcmp(sd->name, "cpu"))
+			sort__first_dimension = SORT_CPU;
+		else if (!strcmp(sd->name, "mispredict"))
+			sort__first_dimension = SORT_MISPREDICTED;
+	}
+
+	list_add_tail(&sd->entry->list, &hist_entry__sort_list);
+	sd->taken = 1;
+
+	return 0;
+}
+
 int sort_dimension__add(const char *tok)
 {
 	unsigned int i;
@@ -269,48 +467,22 @@ int sort_dimension__add(const char *tok)
 		if (strncasecmp(tok, sd->name, strlen(tok)))
 			continue;
 
-		if (sd->entry == &sort_parent) {
-			int ret = regcomp(&parent_regex, parent_pattern, REG_EXTENDED);
-			if (ret) {
-				char err[BUFSIZ];
-
-				regerror(ret, &parent_regex, err, sizeof(err));
-				pr_err("Invalid regex: %s\n%s", parent_pattern, err);
-				return -EINVAL;
-			}
-			sort__has_parent = 1;
-		}
-
 		if (sd->taken)
 			return 0;
 
-		if (sd->entry->se_collapse)
-			sort__need_collapse = 1;
-
-		if (list_empty(&hist_entry__sort_list)) {
-			if (!strcmp(sd->name, "pid"))
-				sort__first_dimension = SORT_PID;
-			else if (!strcmp(sd->name, "comm"))
-				sort__first_dimension = SORT_COMM;
-			else if (!strcmp(sd->name, "dso"))
-				sort__first_dimension = SORT_DSO;
-			else if (!strcmp(sd->name, "symbol"))
-				sort__first_dimension = SORT_SYM;
-			else if (!strcmp(sd->name, "parent"))
-				sort__first_dimension = SORT_PARENT;
-			else if (!strcmp(sd->name, "cpu"))
-				sort__first_dimension = SORT_CPU;
-		}
-
-		list_add_tail(&sd->entry->list, &hist_entry__sort_list);
-		sd->taken = 1;
 
-		return 0;
+		if(sort__branch_mode && (sd->entry == &sort_dso ||
+					sd->entry == &sort_sym)){
+			int err = _sort_dimension__add(sd + 1);
+			return err ?: _sort_dimension__add(sd + 2);
+		}
+		else if(sd->entry == &sort_mispredict && !sort__branch_mode)
+			break;
+		else
+			return _sort_dimension__add(sd);
 	}
-
 	return -ESRCH;
 }
-
 void setup_sorting(const char * const usagestr[], const struct option *opts)
 {
 	char *tmp, *tok, *str = strdup(sort_order);
diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
index 03851e3..69fc954 100644
--- a/tools/perf/util/sort.h
+++ b/tools/perf/util/sort.h
@@ -31,11 +31,14 @@ extern const char *parent_pattern;
 extern const char default_sort_order[];
 extern int sort__need_collapse;
 extern int sort__has_parent;
+extern bool sort__branch_mode;
 extern char *field_sep;
 extern struct sort_entry sort_comm;
 extern struct sort_entry sort_dso;
 extern struct sort_entry sort_sym;
 extern struct sort_entry sort_parent;
+extern struct sort_entry sort_lbr_dso;
+extern struct sort_entry sort_lbr_sym;
 extern enum sort_type sort__first_dimension;
 
 /**
@@ -71,6 +74,7 @@ struct hist_entry {
 		struct hist_entry *pair;
 		struct rb_root	  sorted_chain;
 	};
+	struct branch_info	*branch_info;
 	struct callchain_root	callchain[0];
 };
 
@@ -81,6 +85,7 @@ enum sort_type {
 	SORT_SYM,
 	SORT_PARENT,
 	SORT_CPU,
+	SORT_MISPREDICTED,
 };
 
 /*
diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h
index 29f8d74..a5c84d1 100644
--- a/tools/perf/util/symbol.h
+++ b/tools/perf/util/symbol.h
@@ -5,6 +5,7 @@
 #include <stdbool.h>
 #include <stdint.h>
 #include "map.h"
+#include "../perf.h"
 #include <linux/list.h>
 #include <linux/rbtree.h>
 #include <stdio.h>
@@ -118,6 +119,18 @@ struct map_symbol {
 	bool	      has_children;
 };
 
+struct addr_map_symbol {
+	struct map    *map;
+	struct symbol *sym;
+	u64	      addr;
+};
+
+struct branch_info {
+	struct addr_map_symbol from;
+	struct addr_map_symbol to;
+	struct branch_flags flags;
+};
+
 struct addr_location {
 	struct thread *thread;
 	struct map    *map;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 11/12] perf: add support for sampling taken branch to perf record (v2)
  2011-10-14 12:37 [PATCH 00/12] perf_events: add support for sampling taken branches (v2) Stephane Eranian
                   ` (9 preceding siblings ...)
  2011-10-14 12:37 ` [PATCH 10/12] perf: add code to support PERF_SAMPLE_BRANCH_STACK (v2) Stephane Eranian
@ 2011-10-14 12:37 ` Stephane Eranian
  2011-10-14 12:37 ` [PATCH 12/12] perf: add support for taken branch sampling to perf report (v2) Stephane Eranian
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: Stephane Eranian @ 2011-10-14 12:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo,
	will.deacon, paulus, benh, rth, ralf, davem, lethal

From: Roberto Agostino Vitillo <ravitillo@lbl.gov>

This patch adds a new option to enable taken branch stack
sampling, i.e., leverage the PERF_SAMPLE_BRANCH_STACK feature
of perf_events.

There is a new option to active this mode: -b.
It is possible to pass a set of filters to select the type of
branches to sample.

The following filters are available:
- any : any type of branches
- any_call : any function call or system call
- any_ret : any function return or system call return
- any_ind : any indirect branch
- u:  only when the branch target is at the user level
- k: only when the branch target is in the kernel

Filters can be combined by passing a comma separated list
to the option:

$ perf record -b any_call,u -e cycles:u branchy

Signed-off-by: Roberto Agostino Vitillo <ravitillo@lbl.gov>
Signed-off-by: Stephane Eranian <eranian@google.com>
---
 tools/perf/Documentation/perf-record.txt |   18 +++++++
 tools/perf/builtin-record.c              |   75 ++++++++++++++++++++++++++++++
 2 files changed, 93 insertions(+), 0 deletions(-)

diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index 5a520f8..ddc1999 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -148,6 +148,24 @@ an empty cgroup (monitor all the time) using, e.g., -G foo,,bar. Cgroups must ha
 corresponding events, i.e., they always refer to events defined earlier on the command
 line.
 
+-b::
+--branch-stack::
+Enable taken branch stack sampling. Each sample captures a series of consecutive
+taken branches. The number of branches captured with each sample depends on the
+underlying hardware, the type of branches of interested and the executed code.
+It is possible to filter the types of branches by enabling filters. The
+following filters are defined: any (any type of branches), any_call (any function
+call or system call), any_ret (any function return or system call return), any_ind
+(any indirect branch), u (only when the branch target is at the user level), k (only when
+the branch target is in the kernel). At least one of any, any_call, any_ret, any_ind
+must be provided. The privilege levels may be ommitted, in which case, the privilege
+levels of the associated event is applied to the branch filter. When sampling on multiple
+events, branch stack sampling is enabled for all the sampling events. The sampled branch
+type is the same for all events. The privilege levels are adjusted based on those of
+the associated event unless specified explicitly with this option. Note that taken
+branch sampling may not be available on all processors. The various filters must
+be specified as a comma separated list: -b any_ret,u,k
+
 SEE ALSO
 --------
 linkperf:perf-stat[1], linkperf:perf-list[1]
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index f82480f..c2f9cdd 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -57,6 +57,7 @@ static pid_t			child_pid			=     -1;
 static bool			no_inherit			=  false;
 static enum write_mode_t	write_mode			= WRITE_FORCE;
 static bool			call_graph			=  false;
+static int			branch_stack			=  0;
 static bool			inherit_stat			=  false;
 static bool			no_samples			=  false;
 static bool			sample_address			=  false;
@@ -217,6 +218,11 @@ static void config_attr(struct perf_evsel *evsel, struct perf_evlist *evlist)
 	if (system_wide)
 		attr->sample_type	|= PERF_SAMPLE_CPU;
 
+	if (branch_stack) {
+		attr->sample_type	|= PERF_SAMPLE_BRANCH_STACK;
+		attr->branch_sample_type = branch_stack;
+	}
+
 	if (sample_id_all_avail &&
 	    (sample_time || system_wide || !no_inherit || cpu_list))
 		attr->sample_type	|= PERF_SAMPLE_TIME;
@@ -745,6 +751,72 @@ static int __cmd_record(int argc, const char **argv)
 	return err;
 }
 
+#define BRANCH_OPT(n, m) \
+	{ .name = n, .mode = (m) }
+
+#define BRANCH_END { .name = NULL }
+
+struct branch_mode {
+	const char *name;
+	int mode;
+};
+
+static const struct branch_mode branch_modes[]={
+	BRANCH_OPT("u", PERF_SAMPLE_BRANCH_USER),
+	BRANCH_OPT("k", PERF_SAMPLE_BRANCH_KERNEL),
+	BRANCH_OPT("any", PERF_SAMPLE_BRANCH_ANY),
+	BRANCH_OPT("any_call", PERF_SAMPLE_BRANCH_ANY_CALL),
+	BRANCH_OPT("any_ret", PERF_SAMPLE_BRANCH_ANY_RETURN),
+	BRANCH_OPT("ind_call", PERF_SAMPLE_BRANCH_IND_CALL),
+	BRANCH_END
+};
+
+static int
+parse_branch_stack(const struct option *opt, const char *str, int unset __used)
+{
+#define ONLY_PLM (PERF_SAMPLE_BRANCH_USER|PERF_SAMPLE_BRANCH_KERNEL)
+	uint64_t *mode = (uint64_t *)opt->value;
+	const struct branch_mode *br;
+	char *s, *os, *p;
+	int ret = -1;
+
+	*mode = 0;
+
+	/* because str is read-only */
+	s = os = strdup(str);
+	if (!s)
+		return -1;
+
+	for (;;) {
+		p = strchr(s, ',');
+		if (p)
+			*p = '\0';
+
+		for (br = branch_modes; br->name; br++) {
+			if (!strcasecmp(s, br->name))
+				break;
+		}
+		if (!br->name)
+			goto error;
+
+		*mode |= br->mode;
+
+		if (!p)
+			break;
+
+		s = p + 1;
+	}
+	ret = 0;
+
+	if ((*mode & ~ONLY_PLM) == 0) {
+		error("need at least one branch type with -b\n");
+		ret = -1;
+	}
+error:
+	free(os);
+	return ret;
+}
+
 static const char * const record_usage[] = {
 	"perf record [<options>] [<command>]",
 	"perf record [<options>] -- <command> [<options>]",
@@ -805,6 +877,9 @@ const struct option record_options[] = {
 	OPT_CALLBACK('G', "cgroup", &evsel_list, "name",
 		     "monitor event in cgroup name only",
 		     parse_cgroups),
+	OPT_CALLBACK('b', "branch stack", &branch_stack, "branch mode mask",
+		     "branch stack sampling modes",
+		     parse_branch_stack),
 	OPT_END()
 };
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 12/12] perf: add support for taken branch sampling to perf report (v2)
  2011-10-14 12:37 [PATCH 00/12] perf_events: add support for sampling taken branches (v2) Stephane Eranian
                   ` (10 preceding siblings ...)
  2011-10-14 12:37 ` [PATCH 11/12] perf: add support for sampling taken branch to perf record (v2) Stephane Eranian
@ 2011-10-14 12:37 ` Stephane Eranian
  2011-12-04 20:11 ` [PATCH 00/12] perf_events: add support for sampling taken branches (v2) Stephane Eranian
  2011-12-05 22:39 ` Peter Zijlstra
  13 siblings, 0 replies; 36+ messages in thread
From: Stephane Eranian @ 2011-10-14 12:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo,
	will.deacon, paulus, benh, rth, ralf, davem, lethal

From: Roberto Agostino Vitillo <ravitillo@lbl.gov>

This patch adds support for taken branch sampling, i.e, the
PERF_SAMPLE_BRANCH_STACK feature to perf report. In other
words, to display histograms based on taken branches rather
than executed instructions addresses.

The new option is called -b and it takes no argument. To
generate meaningful output, the perf.data must have been
obtained using perf record -b xxx ... where xxx is a branch
filter option.

The output shows symbols, modules, sorted by 'who branches
where' the most often. The percentages reported in the first
column refer to the total number of branches captured and
not the usual number of samples.

Here is a quick example.
Here branchy is simple test program which looks as follows:

void f2(void)
{}
void f3(void)
{}
void f1(unsigned long n)
{
  if (n & 1UL)
    f2();
  else
    f3();
}
int main(void)
{
  unsigned long i;

  for (i=0; i < N; i++)
   f1(i);
  return 0;
}

Here is the output captured on Nehalem, if we are
only interested in user level function calls.

$ perf record -b any_call,u -e cycles:u branchy

$ perf report -b --sort=symbol
    52.34%  [.] main                   [.] f1
    24.04%  [.] f1                     [.] f3
    23.60%  [.] f1                     [.] f2
     0.01%  [k] _IO_new_file_xsputn    [k] _IO_file_overflow
     0.01%  [k] _IO_vfprintf_internal  [k] _IO_new_file_xsputn
     0.01%  [k] _IO_vfprintf_internal  [k] strchrnul
     0.01%  [k] __printf               [k] _IO_vfprintf_internal
     0.01%  [k] main                   [k] __printf

About half (52%) of the call branches captured are from main() -> f1().
The second half (24%+23%) is split in two equal shares between
f1() -> f2(), f1() ->f3(). The output is as expected given the code.

It should be noted, that using -b in perf record does not eliminate
information in the perf.data file. Consequently, a typical profile
can also be obtained by perf report by simply not using its -b option.

Signed-off-by: Roberto Agostino Vitillo <ravitillo@lbl.gov>
Signed-off-by: Stephane Eranian <eranian@google.com>
---
 tools/perf/Documentation/perf-report.txt |    7 ++
 tools/perf/builtin-report.c              |   93 +++++++++++++++++++++++++++---
 2 files changed, 91 insertions(+), 9 deletions(-)

diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
index 212f24d..3163be5 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -152,6 +152,13 @@ OPTIONS
 	information which may be very large and thus may clutter the display.
 	It currently includes: cpu and numa topology of the host system.
 
+-b::
+--branch-stack::
+	Use the addresses of sampled taken branches instead of the instruction
+	address to build the histograms. To generate meaningful output, the
+	perf.data file must have been obtained using perf record -b xxx where
+	xxx is a branch filter option.
+
 SEE ALSO
 --------
 linkperf:perf-stat[1], linkperf:perf-annotate[1]
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 4d7c834..f52f65c 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -55,6 +55,46 @@ static symbol_filter_t	annotate_init;
 static const char	*cpu_list;
 static DECLARE_BITMAP(cpu_bitmap, MAX_NR_CPUS);
 
+static int perf_session__add_branch_hist_entry(struct perf_session *session,
+					struct addr_location *al,
+					struct perf_sample *sample,
+					struct perf_evsel *evsel){
+	struct symbol *parent = NULL;
+	int err = 0;
+	unsigned i;
+	struct hist_entry *he;
+	struct branch_info *bi;
+
+	if ((sort__has_parent || symbol_conf.use_callchain) && sample->callchain) {
+		err = perf_session__resolve_callchain(session, al->thread,
+						      sample->callchain, &parent);
+		if (err)
+			return err;
+	}
+
+	bi = perf_session__resolve_bstack(session, al->thread,
+					  sample->branch_stack);
+	if (!bi)
+		return -ENOMEM;
+
+	for(i = 0; i < sample->branch_stack->nr; i++) {
+		if(hide_unresolved && !(bi[i].from.sym && bi[i].to.sym))
+			continue;
+		/*
+		 * The report shows the percentage of total branches captured
+		 * and not events sampled. Thus we use a pseudo period of 1.
+		 */
+		he = __hists__add_branch_entry(&evsel->hists, al, parent,
+					       &bi[i], 1);
+		if (he) {
+			evsel->hists.stats.total_period += 1;
+			hists__inc_nr_events(&evsel->hists, PERF_RECORD_SAMPLE);
+		} else
+			return -ENOMEM;
+	}
+	return err;
+}
+
 static int perf_session__add_hist_entry(struct perf_session *session,
 					struct addr_location *al,
 					struct perf_sample *sample,
@@ -120,20 +160,28 @@ static int process_sample_event(union perf_event *event,
 		return -1;
 	}
 
-	if (al.filtered || (hide_unresolved && al.sym == NULL))
-		return 0;
-
 	if (cpu_list && !test_bit(sample->cpu, cpu_bitmap))
 		return 0;
 
-	if (al.map != NULL)
-		al.map->dso->hit = 1;
+	if (sort__branch_mode) {
+		if(perf_session__add_branch_hist_entry(session, &al, sample,
+						    evsel)) {
+			pr_debug("problem adding lbr entry, skipping event\n");
+			return -1;
+		}
+	} else {
+		if (al.filtered || (hide_unresolved && al.sym == NULL))
+			return 0;
 
-	if (perf_session__add_hist_entry(session, &al, sample, evsel)) {
-		pr_debug("problem incrementing symbol period, skipping event\n");
-		return -1;
-	}
+		if (al.map != NULL)
+			al.map->dso->hit = 1;
 
+		if (perf_session__add_hist_entry(session, &al, sample, evsel)) {
+			pr_debug("problem incrementing symbol period, skipping"
+					" event\n");
+			return -1;
+		}
+	}
 	return 0;
 }
 
@@ -183,6 +231,15 @@ static int perf_session__setup_sample_type(struct perf_session *self)
 			}
 	}
 
+	if(sort__branch_mode){
+		if(!(self->sample_type & PERF_SAMPLE_BRANCH_STACK)){
+			fprintf(stderr, "selected -b but no branch data."
+					" Did you call perf record without"
+					" -b?\n");
+			return -1;
+		}
+	}
+
 	return 0;
 }
 
@@ -499,6 +556,8 @@ static const struct option options[] = {
 		   "Specify disassembler style (e.g. -M intel for intel syntax)"),
 	OPT_BOOLEAN(0, "show-total-period", &symbol_conf.show_total_period,
 		    "Show a column with the sum of periods"),
+	OPT_BOOLEAN('b', "branch-stack", &sort__branch_mode,
+		    "use branch records for histogram filling"),
 	OPT_END()
 };
 
@@ -514,6 +573,22 @@ int cmd_report(int argc, const char **argv, const char *prefix __used)
 	if (inverted_callchain)
 		callchain_param.order = ORDER_CALLER;
 
+	if (sort__branch_mode){
+		if(use_browser)
+			fprintf(stderr, "Warning: TUI interface not supported"
+					" in branch mode\n");
+		if(symbol_conf.dso_list_str != NULL)
+			fprintf(stderr, "Warning: dso filtering not supported"
+					" in branch mode\n");
+		if(symbol_conf.sym_list_str != NULL)
+			fprintf(stderr, "Warning: symbol filtering not supported"
+					" in branch mode\n");
+
+		use_browser = 0;
+		symbol_conf.dso_list_str = NULL;
+		symbol_conf.sym_list_str = NULL;
+	}
+
 	if (strcmp(input_name, "-") != 0)
 		setup_browser(true);
 	else
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/12] perf_events: add support for sampling taken branches (v2)
  2011-10-14 12:37 [PATCH 00/12] perf_events: add support for sampling taken branches (v2) Stephane Eranian
                   ` (11 preceding siblings ...)
  2011-10-14 12:37 ` [PATCH 12/12] perf: add support for taken branch sampling to perf report (v2) Stephane Eranian
@ 2011-12-04 20:11 ` Stephane Eranian
  2011-12-05 15:27   ` Peter Zijlstra
  2011-12-05 22:39 ` Peter Zijlstra
  13 siblings, 1 reply; 36+ messages in thread
From: Stephane Eranian @ 2011-12-04 20:11 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo,
	will.deacon, paulus, benh, rth, ralf, davem, lethal

Any update on this patchset?

On Fri, Oct 14, 2011 at 5:37 AM, Stephane Eranian <eranian@google.com> wrote:
> This patchset adds an important and useful new feature to
> perf_events: branch stack sampling. In other words, the
> ability to capture taken branches into each sample.
>
> Statistical sampling of taken branch should not be confused
> for branch tracing. Not all branches are necessarily captured
>
> Sampling taken branches is important for basic block profiling,
> statistical call graph, function call counts. Many of those
> measurements can help drive a compiler optimizer.
>
> The branch stack is a software abstraction which sits on top
> of the PMU hardware. As such, it is not available on all
> processors. For now, the patch provides the generic interface
> and the Intel X86 implementation where it leverages the Last
> Branch Record (LBR) feature (from Core2 to SandyBridge).
>
> Branch stack sampling is supported for both per-thread and
> system-wide modes.
>
> It is possible to filter the type and privilege level of branches
> to sample. The target of the branch is used to determine
> the privilege level.
>
> For each branch, the source and destination are captured. On
> some hardware platforms, it may be possible to also extract
> the target prediction and, in that case, it is also exposed
> to end users.
>
> The branch stack can record a variable number of taken
> branches per sample. Those branches are always consecutive
> in time. The number of branches captured depends on the
> filtering and the underlying hardware. On Intel Nehalem
> and later, up to 16 consecutive branches can be captured
> per sample.
>
> Branch sampling is always coupled with an event. It can
> be any PMU event but it can't be a SW or tracepoint event.
>
> Branch sampling is requested by setting a new sample_type
> flag called: PERF_SAMPLE_BRANCH_STACK.
>
> To support branch filtering, we introduce a new field
> to the perf_event_attr struct: branch_sample_type. We chose
> NOT to overload the config1, config2 field because those
> are related to the event encoding. Branch stack is a
> separate feature which is combined with the event.
>
> The branch_sample_type is a bitmask of possible filters.
> The following filters are defined (more can be added):
> - PERF_SAMPLE_BRANCH_ANY     : any control flow change
> - PERF_SAMPLE_BRANCH_USER    : capture branches when target is at user level
> - PERF_SAMPLE_BRANCH_KERNEL  : capture branches when target is at user level
> - PERF_SAMPLE_BRANCH_ANY_CALL: capture call branches (incl. syscalls)
> - PERF_SAMPLE_BRANCH_ANY_RET : capture return branches (incl. syscall returns)
> - PERF_SAMPLE_BRANCH_IND_CALL: capture indirect calls
>
> It is possible to combine filters, e.g., IND_CALL|USER|KERNEL.
>
> When the privilege level is not specified, the branch stack
> inherits that of the associated event.
>
> Some processors may not offer hardware branch filtering, e.g., Intel
> Atom. Some may have HW filtering bugs (e.g., Nehalem). The Intel
> X86 implementation in this patchset also provides a SW branch filter
> which works on a best effort basis. It can compensate for the lack
> of LBR filtering. But first and foremost, it helps work around LBR
> filtering errata. The goal is to only capture the type of branches
> requested by the user.
>
> It is possible to combine branch stack sampling with PEBS on Intel
> X86 processors. Depending on the precise_sampling mode, there are
> certain filterting restrictions. When precise_sampling=1, then
> there are no filtering restrictions. When precise_sampling > 1,
> then only ANY|USER|KERNEL filter can be used. This comes from
> the fact that the kernel uses LBR to compensate for the PEBS
> off-by-1 skid on the instruction pointer.
>
> To demonstrate how the perf_event branch stack sampling interface
> works, the patchset also modifies perf record to capture taken
> branches. Similarly perf report is enhanced to display a histogram
> of taken branches.
>
> I would like to thank Roberto Vitillo @ LBL for his work on the perf
> tool for this.
>
> Enough talking, let's take a simple example. Our trivial test program
> goes like this:
>
> void f2(void)
> {}
> void f3(void)
> {}
> void f1(unsigned long n)
> {
>  if (n & 1UL)
>    f2();
>  else
>    f3();
> }
> int main(void)
> {
>  unsigned long i;
>
>  for (i=0; i < N; i++)
>   f1(i);
>  return 0;
> }
>
> $ perf record -b any branchy
> $ perf report -b
> # Events: 23K cycles
> #
> # Overhead  Source Symbol     Target Symbol
> # ........  ................  ................
>
>    18.13%  [.] f1            [.] main
>    18.10%  [.] main          [.] main
>    18.01%  [.] main          [.] f1
>    15.69%  [.] f1            [.] f1
>     9.11%  [.] f3            [.] f1
>     6.78%  [.] f1            [.] f3
>     6.74%  [.] f1            [.] f2
>     6.71%  [.] f2            [.] f1
>
> Of the total number of branches captured, 18.13% were from f1() -> main().
>
> Let's make this clearer by filtering the user call branches only:
>
> $ perf record -b any_call -e cycles:u branchy
> $ perf report
> # Events: 19K cycles
> #
> # Overhead  Source Symbol              Target Symbol
> # ........  .........................  .........................
> #
>    52.50%  [.] main                   [.] f1
>    23.99%  [.] f1                     [.] f3
>    23.48%  [.] f1                     [.] f2
>     0.03%  [.] _IO_default_xsputn     [.] _IO_new_file_overflow
>     0.01%  [k] _start                 [k] __libc_start_main
>
> Now it is more obvious. %52 of all the captured branches where calls from main() -> f1().
> The rest is split 50/50 between f1() -> f2() and f1() -> f3() which is expected given
> that f1() dispatches based on odd vs. even values of n which is constantly increasing.
>
>
> In version 2, we update the patch to tip/master (commit 5734857) and
> we've incoporated the feedback from v1 concerning anynous bitfield
> struct for branch_stack_entry and the hanlding of i386 ABI binaries
> on 64-bit host in the instr decoder for the LBR SW filter.
>
> Signed-off-by: Stephane Eranian <eranian@google.com>
>
>
> Roberto Agostino Vitillo (2):
>  perf: add support for sampling taken branch to perf record
>  perf: add support for taken branch sampling to perf report
>
> Stephane Eranian (10):
>  perf_events: add generic taken branch sampling support
>  perf_events: add Intel LBR MSR definitions
>  perf_events: add Intel X86 LBR sharing logic
>  perf_events: sync branch stack sampling with X86 precise_sampling
>  perf_events: add LBR mappings for PERF_SAMPLE_BRANCH filters
>  perf_events: implement PERF_SAMPLE_BRANCH for Intel X86
>  perf_events: add LBR software filter support for Intel X86
>  perf_events: disable PERF_SAMPLE_BRANCH_* when not supported
>  perf_events: add hook to flush branch_stack on context switch
>  perf: add code to support PERF_SAMPLE_BRANCH_STACK
>
>  arch/alpha/kernel/perf_event.c             |    4 +
>  arch/arm/kernel/perf_event.c               |    4 +
>  arch/mips/kernel/perf_event.c              |    4 +
>  arch/powerpc/kernel/perf_event.c           |    4 +
>  arch/sh/kernel/perf_event.c                |    4 +
>  arch/sparc/kernel/perf_event.c             |    4 +
>  arch/x86/include/asm/msr-index.h           |    7 +
>  arch/x86/kernel/cpu/perf_event.c           |   62 +++-
>  arch/x86/kernel/cpu/perf_event_amd.c       |    3 +
>  arch/x86/kernel/cpu/perf_event_intel.c     |  126 +++++--
>  arch/x86/kernel/cpu/perf_event_intel_ds.c  |   21 +-
>  arch/x86/kernel/cpu/perf_event_intel_lbr.c |  529 ++++++++++++++++++++++++++--
>  include/linux/perf_event.h                 |   74 ++++-
>  kernel/events/core.c                       |  167 +++++++++
>  kernel/events/hw_breakpoint.c              |    6 +
>  tools/perf/Documentation/perf-record.txt   |   18 +
>  tools/perf/Documentation/perf-report.txt   |    7 +
>  tools/perf/builtin-record.c                |   75 ++++
>  tools/perf/builtin-report.c                |   93 +++++-
>  tools/perf/perf.h                          |   17 +
>  tools/perf/util/annotate.c                 |    2 +-
>  tools/perf/util/event.h                    |    1 +
>  tools/perf/util/evsel.c                    |   10 +
>  tools/perf/util/hist.c                     |   97 ++++--
>  tools/perf/util/hist.h                     |    6 +
>  tools/perf/util/session.c                  |   72 ++++
>  tools/perf/util/session.h                  |    5 +
>  tools/perf/util/sort.c                     |  348 ++++++++++++++-----
>  tools/perf/util/sort.h                     |    5 +
>  tools/perf/util/symbol.h                   |   13 +
>  30 files changed, 1584 insertions(+), 204 deletions(-)
>
> --
> 1.7.4.1
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/12] perf_events: add support for sampling taken branches (v2)
  2011-12-04 20:11 ` [PATCH 00/12] perf_events: add support for sampling taken branches (v2) Stephane Eranian
@ 2011-12-05 15:27   ` Peter Zijlstra
  0 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2011-12-05 15:27 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, mingo, acme, ming.m.lin, andi, robert.richter,
	ravitillo, will.deacon, paulus, benh, rth, ralf, davem, lethal

On Sun, 2011-12-04 at 12:11 -0800, Stephane Eranian wrote:
> Any update on this patchset?

Completely slipped through the cracks in my brain :-) Lemme go have a
look.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 01/12] perf_events: add generic taken branch sampling support (v2)
  2011-10-14 12:37 ` [PATCH 01/12] perf_events: add generic taken branch sampling support (v2) Stephane Eranian
@ 2011-12-05 21:06   ` Peter Zijlstra
  2011-12-06 19:42     ` Stephane Eranian
  2011-12-05 22:14   ` Peter Zijlstra
  1 sibling, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2011-12-05 21:06 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, mingo, acme, ming.m.lin, andi, robert.richter,
	ravitillo, will.deacon, paulus, benh, rth, ralf, davem, lethal

On Fri, 2011-10-14 at 14:37 +0200, Stephane Eranian wrote:
> @@ -455,6 +483,8 @@ enum perf_event_type {
>          *
>          *      { u32                   size;
>          *        char                  data[size];}&& PERF_SAMPLE_RAW
> +        *
> +        *      { u64 from, to, flags } lbr[nr];} && PERF_SAMPLE_BRANCH_STACK 

		{ u64	nr;
		  { u64 from, to, flags } brstack[nr]; } && PERF_SAMPLE_BRANCH_STACK

Perhaps? It looks like you lost a line somewhere, even you curly braces
are unmatched in the correct way.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 09/12] perf_events: add hook to flush branch_stack on context switch (v2)
  2011-10-14 12:37 ` [PATCH 09/12] perf_events: add hook to flush branch_stack on context switch (v2) Stephane Eranian
@ 2011-12-05 21:10   ` Peter Zijlstra
  2011-12-05 21:37   ` Peter Zijlstra
  1 sibling, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2011-12-05 21:10 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, mingo, acme, ming.m.lin, andi, robert.richter,
	ravitillo, will.deacon, paulus, benh, rth, ralf, davem, lethal

On Fri, 2011-10-14 at 14:37 +0200, Stephane Eranian wrote:
> + * When sampling the branck stack in system-wide, it may be necessary
> + * to flush the stack on context switch. This happens when the branch
> + * stack does not tag its entries with the pid of the current task.
> + * Otherwise it becomes impossible to associate a branch entry with a
> + * task. This ambiguity is more likely to appear when the branch stack
> + * supports priv level filtering and the user sets it to monitor only
> + * at the user level (which could be a useful measurement in system-wide
> + * mode). In that case, the risk is high of having a branch stack with
> + * branch from multiple tasks. Flushing may mean dropping the existing
> + * entries or stashing them somewhere in the PMU specific code layer. 

It doesn't need to tag stuff with PID to solve that problem, making the
TOS a full 64bit wide counter will work equally well, we'd simply record
the TOS value at context switch time and discard everything prior to the
last switch-in.

But yeah, we need to flush this stuff under the current scheme.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 09/12] perf_events: add hook to flush branch_stack on context switch (v2)
  2011-10-14 12:37 ` [PATCH 09/12] perf_events: add hook to flush branch_stack on context switch (v2) Stephane Eranian
  2011-12-05 21:10   ` Peter Zijlstra
@ 2011-12-05 21:37   ` Peter Zijlstra
  2011-12-07 18:25     ` Stephane Eranian
  1 sibling, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2011-12-05 21:37 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, mingo, acme, ming.m.lin, andi, robert.richter,
	ravitillo, will.deacon, paulus, benh, rth, ralf, davem, lethal

On Fri, 2011-10-14 at 14:37 +0200, Stephane Eranian wrote:
> +               /*
> +                * check if the context has at least one
> +                * event using PERF_SAMPLE_BRANCH_STACK
> +                */
> +               if (cpuctx->ctx.nr_branch_stack > 0
> +                   && pmu->flush_branch_stack) {
> +
> +                       pmu = cpuctx->ctx.pmu;
> +
> +                       perf_ctx_lock(cpuctx, cpuctx->task_ctx);
> +
> +                       perf_pmu_disable(pmu);
> +
> +                       pmu->flush_branch_stack();
> +
> +                       perf_pmu_enable(pmu);
> +
> +                       perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> +               }
> +       } 

(what whitespace looks funny)

So all PMUs not supporting this branch stuff will fail to create a
has_branch_stack() event, right? Thus all ctx with !0 nr_branch_stack
support it. Doesn't this make the test for pmu->flush_branch_stack
redundant?



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 01/12] perf_events: add generic taken branch sampling support (v2)
  2011-10-14 12:37 ` [PATCH 01/12] perf_events: add generic taken branch sampling support (v2) Stephane Eranian
  2011-12-05 21:06   ` Peter Zijlstra
@ 2011-12-05 22:14   ` Peter Zijlstra
  2011-12-06 19:27     ` Stephane Eranian
  1 sibling, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2011-12-05 22:14 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, mingo, acme, ming.m.lin, andi, robert.richter,
	ravitillo, will.deacon, paulus, benh, rth, ralf, davem, lethal

On Fri, 2011-10-14 at 14:37 +0200, Stephane Eranian wrote:
> +#define PERF_SAMPLE_BRANCH_PLM_ALL \
> +       (PERF_SAMPLE_BRANCH_USER|\
> +        PERF_SAMPLE_BRANCH_KERNEL) 

This PLM thing keeps popping up all over, I'm sure it stands for
something, but for now it just hurts my eyes.


> +               /* at least one branch bit must be set */
> +               if (!(mask & ~PERF_SAMPLE_BRANCH_PLM_ALL))
> +                       return -EINVAL;
> 
Why? we can create counters with exclude_user && exclude_kernel as well,
I mean, they're useless, but its perfectly valid.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 07/12] perf_events: add LBR software filter support for Intel X86 (v2)
  2011-10-14 12:37 ` [PATCH 07/12] perf_events: add LBR software filter support " Stephane Eranian
@ 2011-12-05 22:29   ` Peter Zijlstra
  0 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2011-12-05 22:29 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, mingo, acme, ming.m.lin, andi, robert.richter,
	ravitillo, will.deacon, paulus, benh, rth, ralf, davem, lethal

On Fri, 2011-10-14 at 14:37 +0200, Stephane Eranian wrote:
> 
> This patch adds an internal sofware filter to complement
> the (optional) LBR hardware filter.
> 
> The software filter is necessary:
> - as a substitute when there is no HW LBR filter (e.g., Atom, Core)
> - to complement HW LBR filter in case of errata (e.g., Nehalem/Westmere)
> - to provide finer grain filtering (e.g., all processors)
> 
> Sometimes, the LBR HW filter cannot distinguish between two types
> of branches. For instance, to capture syscall as CALLS, it is necessary
> to enable the LBR_FAR filter which will also capture JMP instructions.
> Thus, a second pass is necessary to filter those out, this is what the
> SW filter can do.
> 
> The SW filter is built on top of the internal x86 disassembler. It
> is a best effort filter especially for user level code. It is subject
> to the availability of the text page of the program.
> 
> The SW filter is enabled on all Intel X86 processors. It is bypassed
> when the user is capturing all branches at all priv levels.

This patch is very seriously whitespace challenged. 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 05/12] perf_events: add LBR mappings for PERF_SAMPLE_BRANCH filters (v2)
  2011-10-14 12:37 ` [PATCH 05/12] perf_events: add LBR mappings for PERF_SAMPLE_BRANCH filters (v2) Stephane Eranian
@ 2011-12-05 22:35   ` Peter Zijlstra
  2011-12-07  4:22     ` Stephane Eranian
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2011-12-05 22:35 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, mingo, acme, ming.m.lin, andi, robert.richter,
	ravitillo, will.deacon, paulus, benh, rth, ralf, davem, lethal

On Fri, 2011-10-14 at 14:37 +0200, Stephane Eranian wrote:
>  void intel_pmu_lbr_init_atom(void)
>  {
> +       /*
> +        * only models starting at stepping 10 seems
> +        * to have an operational LBR which can freeze
> +        * on PMU interrupt
> +        */
> +       if (boot_cpu_data.x86_mask < 10) {
> +               pr_cont("LBR disabled due to erratum");
> +               return;
> +       } 

Shouldn't that be a separate patch?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/12] perf_events: add support for sampling taken branches (v2)
  2011-10-14 12:37 [PATCH 00/12] perf_events: add support for sampling taken branches (v2) Stephane Eranian
                   ` (12 preceding siblings ...)
  2011-12-04 20:11 ` [PATCH 00/12] perf_events: add support for sampling taken branches (v2) Stephane Eranian
@ 2011-12-05 22:39 ` Peter Zijlstra
  2011-12-06  9:49   ` Will Deacon
  13 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2011-12-05 22:39 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, mingo, acme, ming.m.lin, andi, robert.richter,
	ravitillo, will.deacon, paulus, benh, rth, ralf, davem, lethal

On Fri, 2011-10-14 at 14:37 +0200, Stephane Eranian wrote:
> This patchset adds an important and useful new feature to
> perf_events: branch stack sampling. In other words, the
> ability to capture taken branches into each sample.
> 
Other than the few comments given it all looks good. My main worry is
the Intel only aspect, I'd really love for there to be another platform
that could implement at least part of this.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/12] perf_events: add support for sampling taken branches (v2)
  2011-12-05 22:39 ` Peter Zijlstra
@ 2011-12-06  9:49   ` Will Deacon
  2011-12-06 11:03     ` Peter Zijlstra
  0 siblings, 1 reply; 36+ messages in thread
From: Will Deacon @ 2011-12-06  9:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Stephane Eranian, linux-kernel, mingo, acme, ming.m.lin, andi,
	robert.richter, ravitillo, paulus, benh, rth, ralf, davem,
	lethal

On Mon, Dec 05, 2011 at 10:39:26PM +0000, Peter Zijlstra wrote:
> On Fri, 2011-10-14 at 14:37 +0200, Stephane Eranian wrote:
> > This patchset adds an important and useful new feature to
> > perf_events: branch stack sampling. In other words, the
> > ability to capture taken branches into each sample.
> > 
> Other than the few comments given it all looks good. My main worry is
> the Intel only aspect, I'd really love for there to be another platform
> that could implement at least part of this.

I discussed this with Stephane in Prague and, although it would be lovely to
have this on ARM, we simply don't have the hardware to do it. So the nature
of series does seem to be x86-centric unless there's way to do a watered
down version in software.

Will

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/12] perf_events: add support for sampling taken branches (v2)
  2011-12-06  9:49   ` Will Deacon
@ 2011-12-06 11:03     ` Peter Zijlstra
  2011-12-06 19:14       ` Stephane Eranian
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2011-12-06 11:03 UTC (permalink / raw)
  To: Will Deacon
  Cc: Stephane Eranian, linux-kernel, mingo, acme, ming.m.lin, andi,
	robert.richter, ravitillo, paulus, benh, rth, ralf, davem,
	lethal

On Tue, 2011-12-06 at 09:49 +0000, Will Deacon wrote:
> On Mon, Dec 05, 2011 at 10:39:26PM +0000, Peter Zijlstra wrote:
> > On Fri, 2011-10-14 at 14:37 +0200, Stephane Eranian wrote:
> > > This patchset adds an important and useful new feature to
> > > perf_events: branch stack sampling. In other words, the
> > > ability to capture taken branches into each sample.
> > > 
> > Other than the few comments given it all looks good. My main worry is
> > the Intel only aspect, I'd really love for there to be another platform
> > that could implement at least part of this.
> 
> I discussed this with Stephane in Prague and, although it would be lovely to
> have this on ARM, we simply don't have the hardware to do it. So the nature
> of series does seem to be x86-centric unless there's way to do a watered
> down version in software.

The only way to do this in software would be like
CONFIG_PROFILE_ALL_BRANCHES and that's horrid (and kernel only).

But there's more than Intel & ARM, but it looks like PPC doesn't have
this either and I suspect MIPS and SPARC don't either, which doesn't
leave us with much else.

So I guess we should just go ahead and merge this and hope more hardware
grows this feature in a compatible enough manner.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/12] perf_events: add support for sampling taken branches (v2)
  2011-12-06 11:03     ` Peter Zijlstra
@ 2011-12-06 19:14       ` Stephane Eranian
  2011-12-06 19:20         ` Peter Zijlstra
  0 siblings, 1 reply; 36+ messages in thread
From: Stephane Eranian @ 2011-12-06 19:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, linux-kernel, mingo, acme, ming.m.lin, andi,
	robert.richter, ravitillo, paulus, benh, rth, ralf, davem,
	lethal

On Tue, Dec 6, 2011 at 3:03 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, 2011-12-06 at 09:49 +0000, Will Deacon wrote:
>> On Mon, Dec 05, 2011 at 10:39:26PM +0000, Peter Zijlstra wrote:
>> > On Fri, 2011-10-14 at 14:37 +0200, Stephane Eranian wrote:
>> > > This patchset adds an important and useful new feature to
>> > > perf_events: branch stack sampling. In other words, the
>> > > ability to capture taken branches into each sample.
>> > >
>> > Other than the few comments given it all looks good. My main worry is
>> > the Intel only aspect, I'd really love for there to be another platform
>> > that could implement at least part of this.
>>
>> I discussed this with Stephane in Prague and, although it would be lovely to
>> have this on ARM, we simply don't have the hardware to do it. So the nature
>> of series does seem to be x86-centric unless there's way to do a watered
>> down version in software.
>
> The only way to do this in software would be like
> CONFIG_PROFILE_ALL_BRANCHES and that's horrid (and kernel only).
>
> But there's more than Intel & ARM, but it looks like PPC doesn't have
> this either and I suspect MIPS and SPARC don't either, which doesn't
> leave us with much else.
>
There is a hardware branch buffer on all Itanium processors.
You can find the description for the McKinley implementation
 (Itanium2) branch buffer in section 10.3.9 from:
http://download.intel.com/design/Itanium2/manuals/25111003.pdf

> So I guess we should just go ahead and merge this and hope more hardware
> grows this feature in a compatible enough manner.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/12] perf_events: add support for sampling taken branches (v2)
  2011-12-06 19:14       ` Stephane Eranian
@ 2011-12-06 19:20         ` Peter Zijlstra
  2011-12-06 19:22           ` Stephane Eranian
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2011-12-06 19:20 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Will Deacon, linux-kernel, mingo, acme, ming.m.lin, andi,
	robert.richter, ravitillo, paulus, benh, rth, ralf, davem,
	lethal

On Tue, 2011-12-06 at 11:14 -0800, Stephane Eranian wrote:

> >> > Other than the few comments given it all looks good. My main worry is
> >> > the Intel only aspect, I'd really love for there to be another platform
> >> > that could implement at least part of this.

> There is a hardware branch buffer on all Itanium processors.
> You can find the description for the McKinley implementation
>  (Itanium2) branch buffer in section 10.3.9 from:
> http://download.intel.com/design/Itanium2/manuals/25111003.pdf


Yeah, I knew Itanic has it, but one its sinking (and doesn't have perf
support), and two its still Intel ;-)



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/12] perf_events: add support for sampling taken branches (v2)
  2011-12-06 19:20         ` Peter Zijlstra
@ 2011-12-06 19:22           ` Stephane Eranian
  0 siblings, 0 replies; 36+ messages in thread
From: Stephane Eranian @ 2011-12-06 19:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Will Deacon, linux-kernel, mingo, acme, ming.m.lin, andi,
	robert.richter, ravitillo, paulus, benh, rth, ralf, davem,
	lethal

On Tue, Dec 6, 2011 at 11:20 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, 2011-12-06 at 11:14 -0800, Stephane Eranian wrote:
>
>> >> > Other than the few comments given it all looks good. My main worry is
>> >> > the Intel only aspect, I'd really love for there to be another platform
>> >> > that could implement at least part of this.
>
>> There is a hardware branch buffer on all Itanium processors.
>> You can find the description for the McKinley implementation
>>  (Itanium2) branch buffer in section 10.3.9 from:
>> http://download.intel.com/design/Itanium2/manuals/25111003.pdf
>
>
> Yeah, I knew Itanic has it, but one its sinking (and doesn't have perf
> support), and two its still Intel ;-)
>
It's just an example of a different implementation of a branch buffer.
That's why I mentioned it. The question is: could the branch_stack
abstraction map to it. I think it could.

>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 01/12] perf_events: add generic taken branch sampling support (v2)
  2011-12-05 22:14   ` Peter Zijlstra
@ 2011-12-06 19:27     ` Stephane Eranian
  0 siblings, 0 replies; 36+ messages in thread
From: Stephane Eranian @ 2011-12-06 19:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, acme, ming.m.lin, andi, robert.richter,
	ravitillo, will.deacon, paulus, benh, rth, ralf, davem, lethal

On Mon, Dec 5, 2011 at 2:14 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, 2011-10-14 at 14:37 +0200, Stephane Eranian wrote:
>> +#define PERF_SAMPLE_BRANCH_PLM_ALL \
>> +       (PERF_SAMPLE_BRANCH_USER|\
>> +        PERF_SAMPLE_BRANCH_KERNEL)
>
> This PLM thing keeps popping up all over, I'm sure it stands for
> something, but for now it just hurts my eyes.
>
>
>> +               /* at least one branch bit must be set */
>> +               if (!(mask & ~PERF_SAMPLE_BRANCH_PLM_ALL))
>> +                       return -EINVAL;
>>
> Why? we can create counters with exclude_user && exclude_kernel as well,
> I mean, they're useless, but its perfectly valid.


I am fine with that change. I can drop this check.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 01/12] perf_events: add generic taken branch sampling support (v2)
  2011-12-05 21:06   ` Peter Zijlstra
@ 2011-12-06 19:42     ` Stephane Eranian
  0 siblings, 0 replies; 36+ messages in thread
From: Stephane Eranian @ 2011-12-06 19:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, acme, ming.m.lin, andi, robert.richter,
	ravitillo, will.deacon, paulus, benh, rth, ralf, davem, lethal

On Mon, Dec 5, 2011 at 1:06 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, 2011-10-14 at 14:37 +0200, Stephane Eranian wrote:
>> @@ -455,6 +483,8 @@ enum perf_event_type {
>>          *
>>          *      { u32                   size;
>>          *        char                  data[size];}&& PERF_SAMPLE_RAW
>> +        *
>> +        *      { u64 from, to, flags } lbr[nr];} && PERF_SAMPLE_BRANCH_STACK
>
>                { u64   nr;
>                  { u64 from, to, flags } brstack[nr]; } && PERF_SAMPLE_BRANCH_STACK
>
> Perhaps? It looks like you lost a line somewhere, even you curly braces
> are unmatched in the correct way.

Yes, that the right way of describing the layout. Will fix that.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 05/12] perf_events: add LBR mappings for PERF_SAMPLE_BRANCH filters (v2)
  2011-12-05 22:35   ` Peter Zijlstra
@ 2011-12-07  4:22     ` Stephane Eranian
  0 siblings, 0 replies; 36+ messages in thread
From: Stephane Eranian @ 2011-12-07  4:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, acme, ming.m.lin, andi, robert.richter,
	ravitillo, will.deacon, paulus, benh, rth, ralf, davem, lethal

On Mon, Dec 5, 2011 at 2:35 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, 2011-10-14 at 14:37 +0200, Stephane Eranian wrote:
>>  void intel_pmu_lbr_init_atom(void)
>>  {
>> +       /*
>> +        * only models starting at stepping 10 seems
>> +        * to have an operational LBR which can freeze
>> +        * on PMU interrupt
>> +        */
>> +       if (boot_cpu_data.x86_mask < 10) {
>> +               pr_cont("LBR disabled due to erratum");
>> +               return;
>> +       }
>
> Shouldn't that be a separate patch?

I'll make it into a separate patch.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 09/12] perf_events: add hook to flush branch_stack on context switch (v2)
  2011-12-05 21:37   ` Peter Zijlstra
@ 2011-12-07 18:25     ` Stephane Eranian
  2011-12-08 10:49       ` Peter Zijlstra
  0 siblings, 1 reply; 36+ messages in thread
From: Stephane Eranian @ 2011-12-07 18:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, acme, ming.m.lin, andi, robert.richter,
	ravitillo, will.deacon, paulus, benh, rth, ralf, davem, lethal

On Mon, Dec 5, 2011 at 1:37 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, 2011-10-14 at 14:37 +0200, Stephane Eranian wrote:
>> +               /*
>> +                * check if the context has at least one
>> +                * event using PERF_SAMPLE_BRANCH_STACK
>> +                */
>> +               if (cpuctx->ctx.nr_branch_stack > 0
>> +                   && pmu->flush_branch_stack) {
>> +
>> +                       pmu = cpuctx->ctx.pmu;
>> +
>> +                       perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>> +
>> +                       perf_pmu_disable(pmu);
>> +
>> +                       pmu->flush_branch_stack();
>> +
>> +                       perf_pmu_enable(pmu);
>> +
>> +                       perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>> +               }
>> +       }
>
> (what whitespace looks funny)
>
> So all PMUs not supporting this branch stuff will fail to create a
> has_branch_stack() event, right? Thus all ctx with !0 nr_branch_stack
> support it. Doesn't this make the test for pmu->flush_branch_stack
> redundant?
>
>
No, nr_branch_stack counts the number of active events with
branch_stack. It's like the ctx->nr_cgroups. Processors which
do not support branch_stack will always have this field to 0.
It's not because a processor supports branch_stack that we
need to call flush_branch_stack(), i.e., we use a lazy approach.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 09/12] perf_events: add hook to flush branch_stack on context switch (v2)
  2011-12-07 18:25     ` Stephane Eranian
@ 2011-12-08 10:49       ` Peter Zijlstra
  2011-12-08 18:04         ` Stephane Eranian
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2011-12-08 10:49 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, mingo, acme, ming.m.lin, andi, robert.richter,
	ravitillo, will.deacon, paulus, benh, rth, ralf, davem, lethal

On Wed, 2011-12-07 at 10:25 -0800, Stephane Eranian wrote:
> On Mon, Dec 5, 2011 at 1:37 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, 2011-10-14 at 14:37 +0200, Stephane Eranian wrote:
> >> +               /*
> >> +                * check if the context has at least one
> >> +                * event using PERF_SAMPLE_BRANCH_STACK
> >> +                */
> >> +               if (cpuctx->ctx.nr_branch_stack > 0
> >> +                   && pmu->flush_branch_stack) {
> >> +
> >> +                       pmu = cpuctx->ctx.pmu;
> >> +
> >> +                       perf_ctx_lock(cpuctx, cpuctx->task_ctx);
> >> +
> >> +                       perf_pmu_disable(pmu);
> >> +
> >> +                       pmu->flush_branch_stack();
> >> +
> >> +                       perf_pmu_enable(pmu);
> >> +
> >> +                       perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> >> +               }
> >> +       }
> >
> > (what whitespace looks funny)
> >
> > So all PMUs not supporting this branch stuff will fail to create a
> > has_branch_stack() event, right? Thus all ctx with !0 nr_branch_stack
> > support it. Doesn't this make the test for pmu->flush_branch_stack
> > redundant?
> >
> >
> No, nr_branch_stack counts the number of active events with
> branch_stack. It's like the ctx->nr_cgroups. Processors which
> do not support branch_stack will always have this field to 0.
> It's not because a processor supports branch_stack that we
> need to call flush_branch_stack(), i.e., we use a lazy approach.

What you're saying is we can support branch stack and not need
flush_branch_stack()? Say in the case the x86 LBR TOS field would be a
full u64 counter, since then we could sample the TOS on context switch
and filter on that, obviating the hard reset we do now.

And the advantage of testing for the operation as opposed to putting in
a dummy function (like we do for most other optional methods) is
avoiding all that ctx_lock and pmu_disable muck.

Fair enough.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 09/12] perf_events: add hook to flush branch_stack on context switch (v2)
  2011-12-08 10:49       ` Peter Zijlstra
@ 2011-12-08 18:04         ` Stephane Eranian
  2011-12-08 18:13           ` Peter Zijlstra
  0 siblings, 1 reply; 36+ messages in thread
From: Stephane Eranian @ 2011-12-08 18:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, acme, ming.m.lin, andi, robert.richter,
	ravitillo, will.deacon, paulus, benh, rth, ralf, davem, lethal

On Thu, Dec 8, 2011 at 2:49 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, 2011-12-07 at 10:25 -0800, Stephane Eranian wrote:
>> On Mon, Dec 5, 2011 at 1:37 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Fri, 2011-10-14 at 14:37 +0200, Stephane Eranian wrote:
>> >> +               /*
>> >> +                * check if the context has at least one
>> >> +                * event using PERF_SAMPLE_BRANCH_STACK
>> >> +                */
>> >> +               if (cpuctx->ctx.nr_branch_stack > 0
>> >> +                   && pmu->flush_branch_stack) {
>> >> +
>> >> +                       pmu = cpuctx->ctx.pmu;
>> >> +
>> >> +                       perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>> >> +
>> >> +                       perf_pmu_disable(pmu);
>> >> +
>> >> +                       pmu->flush_branch_stack();
>> >> +
>> >> +                       perf_pmu_enable(pmu);
>> >> +
>> >> +                       perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>> >> +               }
>> >> +       }
>> >
>> > (what whitespace looks funny)
>> >
>> > So all PMUs not supporting this branch stuff will fail to create a
>> > has_branch_stack() event, right? Thus all ctx with !0 nr_branch_stack
>> > support it. Doesn't this make the test for pmu->flush_branch_stack
>> > redundant?
>> >
>> >
>> No, nr_branch_stack counts the number of active events with
>> branch_stack. It's like the ctx->nr_cgroups. Processors which
>> do not support branch_stack will always have this field to 0.
>> It's not because a processor supports branch_stack that we
>> need to call flush_branch_stack(), i.e., we use a lazy approach.
>
> What you're saying is we can support branch stack and not need
> flush_branch_stack()? Say in the case the x86 LBR TOS field would be a
> full u64 counter, since then we could sample the TOS on context switch
> and filter on that, obviating the hard reset we do now.
>
The whole motivation behind the flush_branch_stack is explained in the
Changelog of the patch. In summary, we need to flush the LBR (regardless
of TOS) because in system-wide we need to be able to associate the content
of the LBR with a specific task. Given that the HW does not capture the PID
in the LBR buffer, the kernel has to intervene. Why don't we have this already?
Because we are capturing at all priv levels. But with this patchset, it becomes
possible to filter taken branches based on priv levels. Thus, if you only sample
at the user level and run in system-wide mode, it is more likely you could end
up with branches belonging to two different tasks in the LBR buffer. But you'd
have no way of determining this just by looking at the content of the buffer.
So instead, we need to flush the LBR on context switch to associate a PID
with them.

Because this is an expensive operation, we want to do this only when we
sample on LBR. That's what the ctx->nr_branch_stack is about. We could
refine that some more by checking for system-wide events with only
user priv level on the branch stack. But I did not do that yet.

Does this make more sense now?

> And the advantage of testing for the operation as opposed to putting in
> a dummy function (like we do for most other optional methods) is
> avoiding all that ctx_lock and pmu_disable muck.
>
> Fair enough.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 09/12] perf_events: add hook to flush branch_stack on context switch (v2)
  2011-12-08 18:04         ` Stephane Eranian
@ 2011-12-08 18:13           ` Peter Zijlstra
  2011-12-08 22:06             ` Stephane Eranian
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2011-12-08 18:13 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, mingo, acme, ming.m.lin, andi, robert.richter,
	ravitillo, will.deacon, paulus, benh, rth, ralf, davem, lethal

On Thu, 2011-12-08 at 10:04 -0800, Stephane Eranian wrote:
> The whole motivation behind the flush_branch_stack is explained in the
> Changelog of the patch. In summary, we need to flush the LBR (regardless
> of TOS) because in system-wide we need to be able to associate the content
> of the LBR with a specific task. Given that the HW does not capture the PID
> in the LBR buffer, the kernel has to intervene. 

That's not regardless of the TOS. If the TOS was a full u64 you wouldn't
need the TID (which would be good, since the hardware has no such
concept).

> Why don't we have this already?
> Because we are capturing at all priv levels. But with this patchset, it becomes
> possible to filter taken branches based on priv levels. Thus, if you only sample
> at the user level and run in system-wide mode, it is more likely you could end
> up with branches belonging to two different tasks in the LBR buffer. But you'd
> have no way of determining this just by looking at the content of the buffer.
> So instead, we need to flush the LBR on context switch to associate a PID
> with them.

Yeah, I get that.

> Because this is an expensive operation, we want to do this only when we
> sample on LBR. That's what the ctx->nr_branch_stack is about. We could
> refine that some more by checking for system-wide events with only
> user priv level on the branch stack. But I did not do that yet.
> 
> Does this make more sense now? 

It already did. The only thing I wanted to do was get rid of that method
check. Initially I overlooked the fact that its optional, even if you
support the branch stack. My reply from today argued for it, since
installing a dummy method would still have the needless ctx_lock &&
pmu_disable overhead.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 09/12] perf_events: add hook to flush branch_stack on context switch (v2)
  2011-12-08 18:13           ` Peter Zijlstra
@ 2011-12-08 22:06             ` Stephane Eranian
  2011-12-09  9:00               ` Peter Zijlstra
  0 siblings, 1 reply; 36+ messages in thread
From: Stephane Eranian @ 2011-12-08 22:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, acme, ming.m.lin, andi, robert.richter,
	ravitillo, will.deacon, paulus, benh, rth, ralf, davem, lethal

On Thu, Dec 8, 2011 at 10:13 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, 2011-12-08 at 10:04 -0800, Stephane Eranian wrote:
>> The whole motivation behind the flush_branch_stack is explained in the
>> Changelog of the patch. In summary, we need to flush the LBR (regardless
>> of TOS) because in system-wide we need to be able to associate the content
>> of the LBR with a specific task. Given that the HW does not capture the PID
>> in the LBR buffer, the kernel has to intervene.
>
> That's not regardless of the TOS. If the TOS was a full u64 you wouldn't
> need the TID (which would be good, since the hardware has no such
> concept).
>
Maybe I missed the trick but I don't quite see how a 64-bit TOS would
solve the TID
problem. It's not about the wraparound issue, i.e., not like the
sampling buffer indexes.
Could you describe the trick again?

>> Why don't we have this already?
>> Because we are capturing at all priv levels. But with this patchset, it becomes
>> possible to filter taken branches based on priv levels. Thus, if you only sample
>> at the user level and run in system-wide mode, it is more likely you could end
>> up with branches belonging to two different tasks in the LBR buffer. But you'd
>> have no way of determining this just by looking at the content of the buffer.
>> So instead, we need to flush the LBR on context switch to associate a PID
>> with them.
>
> Yeah, I get that.
>
>> Because this is an expensive operation, we want to do this only when we
>> sample on LBR. That's what the ctx->nr_branch_stack is about. We could
>> refine that some more by checking for system-wide events with only
>> user priv level on the branch stack. But I did not do that yet.
>>
>> Does this make more sense now?
>
> It already did. The only thing I wanted to do was get rid of that method
> check. Initially I overlooked the fact that its optional, even if you
> support the branch stack. My reply from today argued for it, since
> installing a dummy method would still have the needless ctx_lock &&
> pmu_disable overhead.
>
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 09/12] perf_events: add hook to flush branch_stack on context switch (v2)
  2011-12-08 22:06             ` Stephane Eranian
@ 2011-12-09  9:00               ` Peter Zijlstra
  0 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2011-12-09  9:00 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, mingo, acme, ming.m.lin, andi, robert.richter,
	ravitillo, will.deacon, paulus, benh, rth, ralf, davem, lethal

On Thu, 2011-12-08 at 14:06 -0800, Stephane Eranian wrote:
> > That's not regardless of the TOS. If the TOS was a full u64 you wouldn't
> > need the TID (which would be good, since the hardware has no such
> > concept).
> >
> Maybe I missed the trick but I don't quite see how a 64-bit TOS would
> solve the TID
> problem. It's not about the wraparound issue, i.e., not like the
> sampling buffer indexes.
> Could you describe the trick again? 

LBR 0
.
.
.        <-- TOS % n
.
LBR n-1


So the LBR is an array of n entries which is written to in a cyclic
fashion. The Top-Of-Stack or TOS indicates the last written entry and we
can read n entries backwards from there.

Something like:

  tos = rdmsr(lbr_tos);
  for (i = 0; i < n; i++) {
	idx = (tos - i) % n;

	from = rdmsr(lbr_from + idx);
	to   = rdmsr(lbr_to   + idx);
  }

Now the hardware keeps (TOS % n) by limiting the bits in the counter (n
= 2^m etc..), if it wouldn't do that, we could sample the TOS on ctxsw
and modify the read to:

  tos = rdmsr(lbr_tos)
  for (i = 0; i < n && (tos - i) > ctxsw_tos; i++) {
	idx = (tos - i) % n;

	...
  }

This would ensure we never read back past the context-switch. But we
need the extra bits for this to work, since with the current limited
(TOS % n) bits we get into trouble as soon as the ctxsw was more than n
branches ago (which is something very likely).

[ With 16 bits we'd get into trouble when the ctxsw was 65536 branch
ago, which is still quite possible, at 32 bits we'd need 4G branches,
which is rather unlikely, with 64 bits the sun will have died or so.. ]

Also note we can make this extra condition conditional on the event
being a task event, so that the cpu events always consume all n entries.



^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2011-12-09  9:01 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-10-14 12:37 [PATCH 00/12] perf_events: add support for sampling taken branches (v2) Stephane Eranian
2011-10-14 12:37 ` [PATCH 01/12] perf_events: add generic taken branch sampling support (v2) Stephane Eranian
2011-12-05 21:06   ` Peter Zijlstra
2011-12-06 19:42     ` Stephane Eranian
2011-12-05 22:14   ` Peter Zijlstra
2011-12-06 19:27     ` Stephane Eranian
2011-10-14 12:37 ` [PATCH 02/12] perf_events: add Intel LBR MSR definitions (v2) Stephane Eranian
2011-10-14 12:37 ` [PATCH 03/12] perf_events: add Intel X86 LBR sharing logic (v2) Stephane Eranian
2011-10-14 12:37 ` [PATCH 04/12] perf_events: sync branch stack sampling with X86 precise_sampling (v2) Stephane Eranian
2011-10-14 12:37 ` [PATCH 05/12] perf_events: add LBR mappings for PERF_SAMPLE_BRANCH filters (v2) Stephane Eranian
2011-12-05 22:35   ` Peter Zijlstra
2011-12-07  4:22     ` Stephane Eranian
2011-10-14 12:37 ` [PATCH 06/12] perf_events: implement PERF_SAMPLE_BRANCH for Intel X86 (v2) Stephane Eranian
2011-10-14 12:37 ` [PATCH 07/12] perf_events: add LBR software filter support " Stephane Eranian
2011-12-05 22:29   ` Peter Zijlstra
2011-10-14 12:37 ` [PATCH 08/12] perf_events: disable PERF_SAMPLE_BRANCH_* when not supported (v2) Stephane Eranian
2011-10-14 12:37 ` [PATCH 09/12] perf_events: add hook to flush branch_stack on context switch (v2) Stephane Eranian
2011-12-05 21:10   ` Peter Zijlstra
2011-12-05 21:37   ` Peter Zijlstra
2011-12-07 18:25     ` Stephane Eranian
2011-12-08 10:49       ` Peter Zijlstra
2011-12-08 18:04         ` Stephane Eranian
2011-12-08 18:13           ` Peter Zijlstra
2011-12-08 22:06             ` Stephane Eranian
2011-12-09  9:00               ` Peter Zijlstra
2011-10-14 12:37 ` [PATCH 10/12] perf: add code to support PERF_SAMPLE_BRANCH_STACK (v2) Stephane Eranian
2011-10-14 12:37 ` [PATCH 11/12] perf: add support for sampling taken branch to perf record (v2) Stephane Eranian
2011-10-14 12:37 ` [PATCH 12/12] perf: add support for taken branch sampling to perf report (v2) Stephane Eranian
2011-12-04 20:11 ` [PATCH 00/12] perf_events: add support for sampling taken branches (v2) Stephane Eranian
2011-12-05 15:27   ` Peter Zijlstra
2011-12-05 22:39 ` Peter Zijlstra
2011-12-06  9:49   ` Will Deacon
2011-12-06 11:03     ` Peter Zijlstra
2011-12-06 19:14       ` Stephane Eranian
2011-12-06 19:20         ` Peter Zijlstra
2011-12-06 19:22           ` Stephane Eranian

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.