linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/13] perf_events: add support for sampling taken branches (v3)
@ 2012-01-09 16:49 Stephane Eranian
  2012-01-09 16:49 ` [PATCH 01/13] perf_events: add generic taken branch sampling support (v3) Stephane Eranian
                   ` (14 more replies)
  0 siblings, 15 replies; 36+ messages in thread
From: Stephane Eranian @ 2012-01-09 16:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, robert.richter, ming.m.lin, andi, asharma,
	ravitillo, vweaver1

This patchset adds an important and useful new feature to
perf_events: branch stack sampling. In other words, the
ability to capture taken branches into each sample.

Statistical sampling of taken branch should not be confused
for branch tracing. Not all branches are necessarily captured

Sampling taken branches is important for basic block profiling,
statistical call graph, function call counts. Many of those
measurements can help drive a compiler optimizer.

The branch stack is a software abstraction which sits on top
of the PMU hardware. As such, it is not available on all
processors. For now, the patch provides the generic interface
and the Intel X86 implementation where it leverages the Last
Branch Record (LBR) feature (from Core2 to SandyBridge).

Branch stack sampling is supported for both per-thread and
system-wide modes.

It is possible to filter the type and privilege level of branches
to sample. The target of the branch is used to determine
the privilege level.

For each branch, the source and destination are captured. On
some hardware platforms, it may be possible to also extract
the target prediction and, in that case, it is also exposed
to end users.

The branch stack can record a variable number of taken
branches per sample. Those branches are always consecutive
in time. The number of branches captured depends on the
filtering and the underlying hardware. On Intel Nehalem
and later, up to 16 consecutive branches can be captured
per sample.

Branch sampling is always coupled with an event. It can
be any PMU event but it can't be a SW or tracepoint event.

Branch sampling is requested by setting a new sample_type
flag called: PERF_SAMPLE_BRANCH_STACK.

To support branch filtering, we introduce a new field
to the perf_event_attr struct: branch_sample_type. We chose
NOT to overload the config1, config2 field because those
are related to the event encoding. Branch stack is a
separate feature which is combined with the event.

The branch_sample_type is a bitmask of possible filters.
The following filters are defined (more can be added):
- PERF_SAMPLE_BRANCH_ANY     : any control flow change
- PERF_SAMPLE_BRANCH_USER    : capture branches when target is at user level
- PERF_SAMPLE_BRANCH_KERNEL  : capture branches when target is at user level
- PERF_SAMPLE_BRANCH_ANY_CALL: capture call branches (incl. syscalls)
- PERF_SAMPLE_BRANCH_ANY_RET : capture return branches (incl. syscall returns)
- PERF_SAMPLE_BRANCH_IND_CALL: capture indirect calls

It is possible to combine filters, e.g., IND_CALL|USER|KERNEL.

When the privilege level is not specified, the branch stack
inherits that of the associated event.

Some processors may not offer hardware branch filtering, e.g., Intel
Atom. Some may have HW filtering bugs (e.g., Nehalem). The Intel
X86 implementation in this patchset also provides a SW branch filter
which works on a best effort basis. It can compensate for the lack
of LBR filtering. But first and foremost, it helps work around LBR
filtering errata. The goal is to only capture the type of branches
requested by the user.

It is possible to combine branch stack sampling with PEBS on Intel
X86 processors. Depending on the precise_sampling mode, there are
certain filterting restrictions. When precise_sampling=1, then
there are no filtering restrictions. When precise_sampling > 1, 
then only ANY|USER|KERNEL filter can be used. This comes from
the fact that the kernel uses LBR to compensate for the PEBS
off-by-1 skid on the instruction pointer.

To demonstrate how the perf_event branch stack sampling interface
works, the patchset also modifies perf record to capture taken
branches. Similarly perf report is enhanced to display a histogram
of taken branches.

I would like to thank Roberto Vitillo @ LBL for his work on the perf
tool for this.

Enough talking, let's take a simple example. Our trivial test program
goes like this:

void f2(void)
{}
void f3(void)
{}
void f1(unsigned long n)
{
  if (n & 1UL)
    f2();
  else
    f3();
}
int main(void)
{
  unsigned long i;

  for (i=0; i < N; i++)
   f1(i);
  return 0;
}

$ perf record -b any branchy
$ perf report -b
# Events: 23K cycles
#
# Overhead  Source Symbol     Target Symbol
# ........  ................  ................

    18.13%  [.] f1            [.] main                          
    18.10%  [.] main          [.] main                          
    18.01%  [.] main          [.] f1                            
    15.69%  [.] f1            [.] f1                            
     9.11%  [.] f3            [.] f1                            
     6.78%  [.] f1            [.] f3                            
     6.74%  [.] f1            [.] f2                            
     6.71%  [.] f2            [.] f1                            

Of the total number of branches captured, 18.13% were from f1() -> main().

Let's make this clearer by filtering the user call branches only:

$ perf record -b any_call -e cycles:u branchy
$ perf report -b
# Events: 19K cycles
#
# Overhead  Source Symbol              Target Symbol
# ........  .........................  .........................
#
    52.50%  [.] main                   [.] f1                   
    23.99%  [.] f1                     [.] f3                   
    23.48%  [.] f1                     [.] f2                   
     0.03%  [.] _IO_default_xsputn     [.] _IO_new_file_overflow
     0.01%  [k] _start                 [k] __libc_start_main    

Now it is more obvious. %52 of all the captured branches where calls from main() -> f1().
The rest is split 50/50 between f1() -> f2() and f1() -> f3() which is expected given
that f1() dispatches based on odd vs. even values of n which is constantly increasing.


Here is a kernel example, where we want to sample indirect calls:
$ perf record -a -C 1 -b ind_call -e r1c4:k sleep 10 
$ perf report -b
#
# Overhead  Source Symbol               Target Symbol
# ........  ..........................  ..........................
#
    36.36%  [k] __delay                 [k] delay_tsc             
     9.09%  [k] ktime_get               [k] read_tsc              
     9.09%  [k] getnstimeofday          [k] read_tsc              
     9.09%  [k] notifier_call_chain     [k] tick_notify           
     4.55%  [k] cpuidle_idle_call       [k] intel_idle            
     4.55%  [k] cpuidle_idle_call       [k] menu_reflect          
     2.27%  [k] handle_irq              [k] handle_edge_irq       
     2.27%  [k] ack_apic_edge           [k] native_apic_mem_write 
     2.27%  [k] hpet_interrupt_handler  [k] hrtimer_interrupt     
     2.27%  [k] __run_hrtimer           [k] watchdog_timer_fn     
     2.27%  [k] enqueue_task            [k] enqueue_task_rt       
     2.27%  [k] try_to_wake_up          [k] select_task_rq_rt     
     2.27%  [k] do_timer                [k] read_tsc              

Due to HW limitations, branch filtering may be approximate on
Core, Atom processors. It is more accurate on Nehalem, Westmere
and best on Sandy Bridge.

In version 2, we've updated the patch to tip/master (commit 5734857) and
we've incoporated the feedback from v1 concerning anynous bitfield
struct for branch_stack_entry and the hanlding of i386 ABI binaries
on 64-bit host in the instr decoder for the LBR SW filter.

In version 3, we've updated to 3.2.0-tip. The Atom revision
check has been put into its own patch. We fixed a browser
issue with report report. We fixed all the style issues as well.

Signed-off-by: Stephane Eranian <eranian@google.com>
---

Roberto Agostino Vitillo (3):
  perf: add code to support PERF_SAMPLE_BRANCH_STACK
  perf: add support for sampling taken branch to perf record
  perf: add support for taken branch sampling to perf report

Stephane Eranian (10):
  perf_events: add generic taken branch sampling support (v3)
  perf_events: add Intel LBR MSR definitions
  perf_events: add Intel X86 LBR sharing logic
  perf_events: sync branch stack sampling with X86 precise_sampling
  perf_events: add LBR mappings for PERF_SAMPLE_BRANCH filters
  perf_events: disable LBR support for older Intel Atom processors
  perf_events: implement PERF_SAMPLE_BRANCH for Intel X86
  perf_events: add LBR software filter support for Intel X86
  perf_events: disable PERF_SAMPLE_BRANCH_* when not supported
  perf_events: add hook to flush branch_stack on context switch

 arch/alpha/kernel/perf_event.c             |    4 +
 arch/arm/kernel/perf_event.c               |    4 +
 arch/mips/kernel/perf_event_mipsxx.c       |    4 +
 arch/powerpc/kernel/perf_event.c           |    4 +
 arch/sh/kernel/perf_event.c                |    4 +
 arch/sparc/kernel/perf_event.c             |    4 +
 arch/x86/include/asm/msr-index.h           |    7 +
 arch/x86/kernel/cpu/perf_event.c           |   47 +++-
 arch/x86/kernel/cpu/perf_event.h           |   19 +
 arch/x86/kernel/cpu/perf_event_amd.c       |    3 +
 arch/x86/kernel/cpu/perf_event_intel.c     |  120 +++++--
 arch/x86/kernel/cpu/perf_event_intel_ds.c  |   22 +-
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |  525 ++++++++++++++++++++++++++--
 include/linux/perf_event.h                 |   78 ++++-
 kernel/events/core.c                       |  167 +++++++++
 kernel/events/hw_breakpoint.c              |    6 +
 tools/perf/Documentation/perf-record.txt   |   18 +
 tools/perf/Documentation/perf-report.txt   |    7 +
 tools/perf/builtin-record.c                |   69 ++++
 tools/perf/builtin-report.c                |   95 +++++-
 tools/perf/perf.h                          |   18 +
 tools/perf/util/annotate.c                 |    2 +-
 tools/perf/util/event.h                    |    1 +
 tools/perf/util/evsel.c                    |   14 +
 tools/perf/util/hist.c                     |   93 ++++-
 tools/perf/util/hist.h                     |    7 +
 tools/perf/util/session.c                  |   72 ++++
 tools/perf/util/session.h                  |    4 +
 tools/perf/util/sort.c                     |  361 ++++++++++++++-----
 tools/perf/util/sort.h                     |    5 +
 tools/perf/util/symbol.h                   |   13 +
 31 files changed, 1601 insertions(+), 196 deletions(-)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 01/13] perf_events: add generic taken branch sampling support (v3)
  2012-01-09 16:49 [PATCH 00/13] perf_events: add support for sampling taken branches (v3) Stephane Eranian
@ 2012-01-09 16:49 ` Stephane Eranian
  2012-01-27  4:46   ` Anshuman Khandual
  2012-01-09 16:49 ` [PATCH 02/13] perf_events: add Intel LBR MSR definitions (v3) Stephane Eranian
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 36+ messages in thread
From: Stephane Eranian @ 2012-01-09 16:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, robert.richter, ming.m.lin, andi, asharma,
	ravitillo, vweaver1

This patch adds the ability to sample taken branches to the
perf_event interface.

The ability to capture taken branches is very useful for all
sorts of analysis. For instance, basic block profiling, call
counts, statistical call graph.

This new capability requires hardware assist and as such may
not be available on all HW platforms. On Intel X86, it is
implemented on top of the Last Branch Record (LBR) facility.

To enable taken branches sampling, the PERF_SAMPLE_BRANCH_STACK
bit must be set in attr->sample_type.

Sampled taken branches may be filtered by type and/or priv
levels.

The patch adds a new field, called branch_sample_type, to the
perf_event_attr structure. It contains a bitmask of filters
to apply to the sampled taken branches.

Filters may be implemented in HW. If the HW filter does not exist
or is not good enough, some arch may also implement a SW filter.

The following generic filters are currently defined:
- PERF_SAMPLE_USER
  only branches whose targets are at the user level

- PERF_SAMPLE_KERNEL
  only branches whose targets are at the kernel level

- PERF_SAMPLE_ANY
  any type of branches (subject to priv levels filters)

- PERF_SAMPLE_ANY_CALL
  any call branches (may incl. syscall on some arch)

- PERF_SAMPLE_ANY_RET
  any return branches (may incl. syscall returns on some arch)

- PERF_SAMPLE_IND_CALL
  indirect call branches

Obviously filter may be combined. The priv level bits are optional.
If not provided, the priv level of the associated event are used. It
is possible to collect branches at a priv level different from the
associated event.

The number of taken branch records present in each sample may vary based
on HW, the type of sampled branches, the executed code. Therefore
each sample contains the number of taken branches it contains.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |   21 +++++---
 include/linux/perf_event.h                 |   66 ++++++++++++++++++++++++++--
 kernel/events/core.c                       |   58 ++++++++++++++++++++++++
 3 files changed, 133 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 3fab3de..c3f8100 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -144,9 +144,11 @@ static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc)
 
 		rdmsrl(x86_pmu.lbr_from + lbr_idx, msr_lastbranch.lbr);
 
-		cpuc->lbr_entries[i].from  = msr_lastbranch.from;
-		cpuc->lbr_entries[i].to    = msr_lastbranch.to;
-		cpuc->lbr_entries[i].flags = 0;
+		cpuc->lbr_entries[i].from	= msr_lastbranch.from;
+		cpuc->lbr_entries[i].to		= msr_lastbranch.to;
+		cpuc->lbr_entries[i].mispred	= 0;
+		cpuc->lbr_entries[i].predicted	= 0;
+		cpuc->lbr_entries[i].reserved	= 0;
 	}
 	cpuc->lbr_stack.nr = i;
 }
@@ -167,19 +169,22 @@ static void intel_pmu_lbr_read_64(struct cpu_hw_events *cpuc)
 
 	for (i = 0; i < x86_pmu.lbr_nr; i++) {
 		unsigned long lbr_idx = (tos - i) & mask;
-		u64 from, to, flags = 0;
+		u64 from, to, mis = 0, pred = 0;
 
 		rdmsrl(x86_pmu.lbr_from + lbr_idx, from);
 		rdmsrl(x86_pmu.lbr_to   + lbr_idx, to);
 
 		if (lbr_format == LBR_FORMAT_EIP_FLAGS) {
-			flags = !!(from & LBR_FROM_FLAG_MISPRED);
+			mis = !!(from & LBR_FROM_FLAG_MISPRED);
+			pred = !mis;
 			from = (u64)((((s64)from) << 1) >> 1);
 		}
 
-		cpuc->lbr_entries[i].from  = from;
-		cpuc->lbr_entries[i].to    = to;
-		cpuc->lbr_entries[i].flags = flags;
+		cpuc->lbr_entries[i].from	= from;
+		cpuc->lbr_entries[i].to		= to;
+		cpuc->lbr_entries[i].mispred	= mis;
+		cpuc->lbr_entries[i].predicted	= pred;
+		cpuc->lbr_entries[i].reserved	= 0;
 	}
 	cpuc->lbr_stack.nr = i;
 }
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 0b91db2..17751b1 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -129,11 +129,38 @@ enum perf_event_sample_format {
 	PERF_SAMPLE_PERIOD			= 1U << 8,
 	PERF_SAMPLE_STREAM_ID			= 1U << 9,
 	PERF_SAMPLE_RAW				= 1U << 10,
+	PERF_SAMPLE_BRANCH_STACK		= 1U << 11,
 
-	PERF_SAMPLE_MAX = 1U << 11,		/* non-ABI */
+	PERF_SAMPLE_MAX = 1U << 12,		/* non-ABI */
 };
 
 /*
+ * values to program into branch_sample_type when PERF_SAMPLE_BRANCH is set
+ *
+ * If the user does not pass priv level information via branch_sample_type,
+ * the kernel uses the event's priv level. Branch and event priv levels do
+ * not have to match. Branch priv level is checked for permissions.
+ *
+ * The branch types can be combined, however BRANCH_ANY covers all types
+ * of branches and therefore it supersedes all the other types.
+ */
+enum perf_branch_sample_type {
+	PERF_SAMPLE_BRANCH_USER		= 1U << 0, /* user level branches */
+	PERF_SAMPLE_BRANCH_KERNEL	= 1U << 1, /* kernel level branches */
+
+	PERF_SAMPLE_BRANCH_ANY		= 1U << 2, /* any branch types */
+	PERF_SAMPLE_BRANCH_ANY_CALL	= 1U << 3, /* any call branch */
+	PERF_SAMPLE_BRANCH_ANY_RETURN	= 1U << 4, /* any return branch */
+	PERF_SAMPLE_BRANCH_IND_CALL	= 1U << 5, /* indirect calls */
+
+	PERF_SAMPLE_BRANCH_MAX		= 1U << 6,/* non-ABI */
+};
+
+#define PERF_SAMPLE_BRANCH_PLM_ALL \
+	(PERF_SAMPLE_BRANCH_USER|\
+	 PERF_SAMPLE_BRANCH_KERNEL)
+
+/*
  * The format of the data returned by read() on a perf event fd,
  * as specified by attr.read_format:
  *
@@ -240,6 +267,7 @@ struct perf_event_attr {
 		__u64		bp_len;
 		__u64		config2; /* extension of config1 */
 	};
+	__u64	branch_sample_type; /* enum branch_sample_type */
 };
 
 /*
@@ -458,6 +486,8 @@ enum perf_event_type {
 	 *
 	 *	{ u32			size;
 	 *	  char                  data[size];}&& PERF_SAMPLE_RAW
+	 *
+	 *	{ u64 from, to, flags } lbr[nr];} && PERF_SAMPLE_BRANCH_STACK
 	 * };
 	 */
 	PERF_RECORD_SAMPLE			= 9,
@@ -530,12 +560,31 @@ struct perf_raw_record {
 	void				*data;
 };
 
+/*
+ * single taken branch record layout:
+ *
+ *      from: source instruction (may not always be a branch insn)
+ *        to: branch target
+ *   mispred: branch target was mispredicted
+ * predicted: branch target was predicted
+ *
+ * support for mispred, predicted is optional. In case it
+ * is not supported mispred = predicted = 0.
+ */
 struct perf_branch_entry {
-	__u64				from;
-	__u64				to;
-	__u64				flags;
+	__u64	from;
+	__u64	to;
+	__u64	mispred:1,  /* target mispredicted */
+		predicted:1,/* target predicted */
+		reserved:62;
 };
 
+/*
+ * branch stack layout:
+ *  nr: number of taken branches stored in entries[]
+ *
+ * Note that nr can vary from sample to sample
+ */
 struct perf_branch_stack {
 	__u64				nr;
 	struct perf_branch_entry	entries[0];
@@ -566,7 +615,9 @@ struct hw_perf_event {
 			unsigned long	event_base;
 			int		idx;
 			int		last_cpu;
+
 			struct hw_perf_event_extra extra_reg;
+			struct hw_perf_event_extra branch_reg;
 		};
 		struct { /* software */
 			struct hrtimer	hrtimer;
@@ -1003,12 +1054,14 @@ struct perf_sample_data {
 	u64				period;
 	struct perf_callchain_entry	*callchain;
 	struct perf_raw_record		*raw;
+	struct perf_branch_stack	*br_stack;
 };
 
 static inline void perf_sample_data_init(struct perf_sample_data *data, u64 addr)
 {
 	data->addr = addr;
 	data->raw  = NULL;
+	data->br_stack = NULL;
 }
 
 extern void perf_output_sample(struct perf_output_handle *handle,
@@ -1147,6 +1200,11 @@ extern void perf_bp_event(struct perf_event *event, void *data);
 # define perf_instruction_pointer(regs)	instruction_pointer(regs)
 #endif
 
+static inline bool has_branch_stack(struct perf_event *event)
+{
+	return event->attr.sample_type & PERF_SAMPLE_BRANCH_STACK;
+}
+
 extern int perf_output_begin(struct perf_output_handle *handle,
 			     struct perf_event *event, unsigned int size);
 extern void perf_output_end(struct perf_output_handle *handle);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 91fb68a..ed39225 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3877,6 +3877,24 @@ void perf_output_sample(struct perf_output_handle *handle,
 			}
 		}
 	}
+
+	if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
+		if (data->br_stack) {
+			size_t size;
+
+			size = data->br_stack->nr
+			     * sizeof(struct perf_branch_entry);
+
+			perf_output_put(handle, data->br_stack->nr);
+			perf_output_copy(handle, data->br_stack->entries, size);
+		} else {
+			/*
+			 * we always store at least the value of nr
+			 */
+			u64 nr = 0;
+			perf_output_put(handle, nr);
+		}
+	}
 }
 
 void perf_prepare_sample(struct perf_event_header *header,
@@ -3919,6 +3937,15 @@ void perf_prepare_sample(struct perf_event_header *header,
 		WARN_ON_ONCE(size & (sizeof(u64)-1));
 		header->size += size;
 	}
+
+	if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
+		int size = sizeof(u64); /* nr */
+		if (data->br_stack) {
+			size += data->br_stack->nr
+			      * sizeof(struct perf_branch_entry);
+		}
+		header->size += size;
+	}
 }
 
 static void perf_event_output(struct perf_event *event,
@@ -5898,6 +5925,37 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
 	if (attr->read_format & ~(PERF_FORMAT_MAX-1))
 		return -EINVAL;
 
+	if (attr->sample_type & PERF_SAMPLE_BRANCH_STACK) {
+		u64 mask = attr->branch_sample_type;
+
+		/* only using defined bits */
+		if (mask & ~(PERF_SAMPLE_BRANCH_MAX-1))
+			return -EINVAL;
+
+		/* at least one branch bit must be set */
+		if (!(mask & ~PERF_SAMPLE_BRANCH_PLM_ALL))
+			return -EINVAL;
+
+		/* kernel level capture */
+		if ((mask & PERF_SAMPLE_BRANCH_KERNEL)
+		    && perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
+			return -EACCES;
+
+		/* propagate priv level, when not set for branch */
+		if (!(mask & PERF_SAMPLE_BRANCH_PLM_ALL)) {
+
+			/* exclude_kernel checked on syscall entry */
+			if (!attr->exclude_kernel)
+				mask |= PERF_SAMPLE_BRANCH_KERNEL;
+
+			if (!attr->exclude_user)
+				mask |= PERF_SAMPLE_BRANCH_USER;
+			/*
+			 * adjust user setting (for HW filter setup)
+			 */
+			attr->branch_sample_type = mask;
+		}
+	}
 out:
 	return ret;
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 02/13] perf_events: add Intel LBR MSR definitions (v3)
  2012-01-09 16:49 [PATCH 00/13] perf_events: add support for sampling taken branches (v3) Stephane Eranian
  2012-01-09 16:49 ` [PATCH 01/13] perf_events: add generic taken branch sampling support (v3) Stephane Eranian
@ 2012-01-09 16:49 ` Stephane Eranian
  2012-01-27  5:03   ` Anshuman Khandual
  2012-01-09 16:49 ` [PATCH 03/13] perf_events: add Intel X86 LBR sharing logic (v3) Stephane Eranian
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 36+ messages in thread
From: Stephane Eranian @ 2012-01-09 16:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, robert.richter, ming.m.lin, andi, asharma,
	ravitillo, vweaver1

This patch adds the LBR definitions for NHM/WSM/SNB and Core.
It also adds the definitions for the architected LBR MSR:
LBR_SELECT, LBRT_TOS.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/include/asm/msr-index.h           |    7 +++++++
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |   18 +++++++++---------
 2 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index a6962d9..ccb8059 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -56,6 +56,13 @@
 #define MSR_OFFCORE_RSP_0		0x000001a6
 #define MSR_OFFCORE_RSP_1		0x000001a7
 
+#define MSR_LBR_SELECT			0x000001c8
+#define MSR_LBR_TOS			0x000001c9
+#define MSR_LBR_NHM_FROM		0x00000680
+#define MSR_LBR_NHM_TO			0x000006c0
+#define MSR_LBR_CORE_FROM		0x00000040
+#define MSR_LBR_CORE_TO			0x00000060
+
 #define MSR_IA32_PEBS_ENABLE		0x000003f1
 #define MSR_IA32_DS_AREA		0x00000600
 #define MSR_IA32_PERF_CAPABILITIES	0x00000345
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index c3f8100..e14431f 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -205,23 +205,23 @@ void intel_pmu_lbr_read(void)
 void intel_pmu_lbr_init_core(void)
 {
 	x86_pmu.lbr_nr     = 4;
-	x86_pmu.lbr_tos    = 0x01c9;
-	x86_pmu.lbr_from   = 0x40;
-	x86_pmu.lbr_to     = 0x60;
+	x86_pmu.lbr_tos    = MSR_LBR_TOS;
+	x86_pmu.lbr_from   = MSR_LBR_CORE_FROM;
+	x86_pmu.lbr_to     = MSR_LBR_CORE_TO;
 }
 
 void intel_pmu_lbr_init_nhm(void)
 {
 	x86_pmu.lbr_nr     = 16;
-	x86_pmu.lbr_tos    = 0x01c9;
-	x86_pmu.lbr_from   = 0x680;
-	x86_pmu.lbr_to     = 0x6c0;
+	x86_pmu.lbr_tos    = MSR_LBR_TOS;
+	x86_pmu.lbr_from   = MSR_LBR_NHM_FROM;
+	x86_pmu.lbr_to     = MSR_LBR_NHM_TO;
 }
 
 void intel_pmu_lbr_init_atom(void)
 {
 	x86_pmu.lbr_nr	   = 8;
-	x86_pmu.lbr_tos    = 0x01c9;
-	x86_pmu.lbr_from   = 0x40;
-	x86_pmu.lbr_to     = 0x60;
+	x86_pmu.lbr_tos    = MSR_LBR_TOS;
+	x86_pmu.lbr_from   = MSR_LBR_CORE_FROM;
+	x86_pmu.lbr_to     = MSR_LBR_CORE_TO;
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 03/13] perf_events: add Intel X86 LBR sharing logic (v3)
  2012-01-09 16:49 [PATCH 00/13] perf_events: add support for sampling taken branches (v3) Stephane Eranian
  2012-01-09 16:49 ` [PATCH 01/13] perf_events: add generic taken branch sampling support (v3) Stephane Eranian
  2012-01-09 16:49 ` [PATCH 02/13] perf_events: add Intel LBR MSR definitions (v3) Stephane Eranian
@ 2012-01-09 16:49 ` Stephane Eranian
  2012-01-09 16:49 ` [PATCH 04/13] perf_events: sync branch stack sampling with X86 precise_sampling (v3) Stephane Eranian
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 36+ messages in thread
From: Stephane Eranian @ 2012-01-09 16:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, robert.richter, ming.m.lin, andi, asharma,
	ravitillo, vweaver1

The Intel LBR on some recent processor is capable
of filtering branches by type. The filter is configurable
via the LBR_SELECT MSR register.

There are limitation on how this register can be used.

On Nehalem/Westmere, the LBR_SELECT is shared by the two HT threads
when HT is on. It is private to each core when HT is off.

On SandyBridge, the LBR_SELECT register is private to each thread
when HT is on. It is private to each core when HT is off.

The kernel must manage the sharing of LBR_SELECT. It allows
multiple users on the same logical CPU to use LBR_SELECT as
long as they program it with the same value. Across sibling
CPUs (HT threads), the same restriction applies on NHM/WSM.

This patch implements this sharing logic by leveraging the
mechanism put in place for managing the offcore_response
shared MSR.

We modify __intel_shared_reg_get_constraints() to cause
x86_get_event_constraint() to be called because LBR may
be associated with events that may be counter constrained.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event.c       |    4 ++
 arch/x86/kernel/cpu/perf_event.h       |    4 ++
 arch/x86/kernel/cpu/perf_event_intel.c |   70 ++++++++++++++++++++------------
 3 files changed, 52 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index f8bddb5..3779313 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -426,6 +426,10 @@ static int __x86_pmu_event_init(struct perf_event *event)
 	/* mark unused */
 	event->hw.extra_reg.idx = EXTRA_REG_NONE;
 
+	/* mark not used */
+	event->hw.extra_reg.idx = EXTRA_REG_NONE;
+	event->hw.branch_reg.idx = EXTRA_REG_NONE;
+
 	return x86_pmu.hw_config(event);
 }
 
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 513d617..4535ada 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -33,6 +33,7 @@ enum extra_reg_type {
 
 	EXTRA_REG_RSP_0 = 0,	/* offcore_response_0 */
 	EXTRA_REG_RSP_1 = 1,	/* offcore_response_1 */
+	EXTRA_REG_LBR   = 2,	/* lbr_select */
 
 	EXTRA_REG_MAX		/* number of entries needed */
 };
@@ -130,6 +131,7 @@ struct cpu_hw_events {
 	void				*lbr_context;
 	struct perf_branch_stack	lbr_stack;
 	struct perf_branch_entry	lbr_entries[MAX_LBR_ENTRIES];
+	struct er_account		*lbr_sel;
 
 	/*
 	 * Intel host/guest exclude bits
@@ -340,6 +342,8 @@ struct x86_pmu {
 	 */
 	unsigned long	lbr_tos, lbr_from, lbr_to; /* MSR base regs       */
 	int		lbr_nr;			   /* hardware stack size */
+	u64		lbr_sel_mask;		   /* LBR_SELECT valid bits */
+	const int	*lbr_sel_map;		   /* lbr_select mappings */
 
 	/*
 	 * Extra registers for events
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 3bd37bd..97f7bb5 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1123,17 +1123,17 @@ static bool intel_try_alt_er(struct perf_event *event, int orig_idx)
  */
 static struct event_constraint *
 __intel_shared_reg_get_constraints(struct cpu_hw_events *cpuc,
-				   struct perf_event *event)
+				   struct perf_event *event,
+				   struct hw_perf_event_extra *reg)
 {
 	struct event_constraint *c = &emptyconstraint;
-	struct hw_perf_event_extra *reg = &event->hw.extra_reg;
 	struct er_account *era;
 	unsigned long flags;
 	int orig_idx = reg->idx;
 
 	/* already allocated shared msr */
 	if (reg->alloc)
-		return &unconstrained;
+		return NULL; /* call x86_get_event_constraint() */
 
 again:
 	era = &cpuc->shared_regs->regs[reg->idx];
@@ -1156,14 +1156,10 @@ __intel_shared_reg_get_constraints(struct cpu_hw_events *cpuc,
 		reg->alloc = 1;
 
 		/*
-		 * All events using extra_reg are unconstrained.
-		 * Avoids calling x86_get_event_constraints()
-		 *
-		 * Must revisit if extra_reg controlling events
-		 * ever have constraints. Worst case we go through
-		 * the regular event constraint table.
+		 * need to call x86_get_event_constraint()
+		 * to check if associated event has constraints
 		 */
-		c = &unconstrained;
+		c = NULL;
 	} else if (intel_try_alt_er(event, orig_idx)) {
 		raw_spin_unlock_irqrestore(&era->lock, flags);
 		goto again;
@@ -1200,11 +1196,23 @@ static struct event_constraint *
 intel_shared_regs_constraints(struct cpu_hw_events *cpuc,
 			      struct perf_event *event)
 {
-	struct event_constraint *c = NULL;
-
-	if (event->hw.extra_reg.idx != EXTRA_REG_NONE)
-		c = __intel_shared_reg_get_constraints(cpuc, event);
-
+	struct event_constraint *c = NULL, *d;
+	struct hw_perf_event_extra *xreg, *breg;
+
+	xreg = &event->hw.extra_reg;
+	if (xreg->idx != EXTRA_REG_NONE) {
+		c = __intel_shared_reg_get_constraints(cpuc, event, xreg);
+		if (c == &emptyconstraint)
+			return c;
+	}
+	breg = &event->hw.branch_reg;
+	if (breg->idx != EXTRA_REG_NONE) {
+		d = __intel_shared_reg_get_constraints(cpuc, event, breg);
+		if (d == &emptyconstraint) {
+			__intel_shared_reg_put_constraints(cpuc, xreg);
+			c = d;
+		}
+	}
 	return c;
 }
 
@@ -1252,6 +1260,10 @@ intel_put_shared_regs_event_constraints(struct cpu_hw_events *cpuc,
 	reg = &event->hw.extra_reg;
 	if (reg->idx != EXTRA_REG_NONE)
 		__intel_shared_reg_put_constraints(cpuc, reg);
+
+	reg = &event->hw.branch_reg;
+	if (reg->idx != EXTRA_REG_NONE)
+		__intel_shared_reg_put_constraints(cpuc, reg);
 }
 
 static void intel_put_event_constraints(struct cpu_hw_events *cpuc,
@@ -1431,7 +1443,7 @@ static int intel_pmu_cpu_prepare(int cpu)
 {
 	struct cpu_hw_events *cpuc = &per_cpu(cpu_hw_events, cpu);
 
-	if (!x86_pmu.extra_regs)
+	if (!(x86_pmu.extra_regs || x86_pmu.lbr_sel_map))
 		return NOTIFY_OK;
 
 	cpuc->shared_regs = allocate_shared_regs(cpu);
@@ -1453,22 +1465,28 @@ static void intel_pmu_cpu_starting(int cpu)
 	 */
 	intel_pmu_lbr_reset();
 
-	if (!cpuc->shared_regs || (x86_pmu.er_flags & ERF_NO_HT_SHARING))
+	cpuc->lbr_sel = NULL;
+
+	if (!cpuc->shared_regs)
 		return;
 
-	for_each_cpu(i, topology_thread_cpumask(cpu)) {
-		struct intel_shared_regs *pc;
+	if (!(x86_pmu.er_flags & ERF_NO_HT_SHARING)) {
+		for_each_cpu(i, topology_thread_cpumask(cpu)) {
+			struct intel_shared_regs *pc;
 
-		pc = per_cpu(cpu_hw_events, i).shared_regs;
-		if (pc && pc->core_id == core_id) {
-			cpuc->kfree_on_online = cpuc->shared_regs;
-			cpuc->shared_regs = pc;
-			break;
+			pc = per_cpu(cpu_hw_events, i).shared_regs;
+			if (pc && pc->core_id == core_id) {
+				cpuc->kfree_on_online = cpuc->shared_regs;
+				cpuc->shared_regs = pc;
+				break;
+			}
 		}
+		cpuc->shared_regs->core_id = core_id;
+		cpuc->shared_regs->refcnt++;
 	}
 
-	cpuc->shared_regs->core_id = core_id;
-	cpuc->shared_regs->refcnt++;
+	if (x86_pmu.lbr_sel_map)
+		cpuc->lbr_sel = &cpuc->shared_regs->regs[EXTRA_REG_LBR];
 }
 
 static void intel_pmu_cpu_dying(int cpu)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 04/13] perf_events: sync branch stack sampling with X86 precise_sampling (v3)
  2012-01-09 16:49 [PATCH 00/13] perf_events: add support for sampling taken branches (v3) Stephane Eranian
                   ` (2 preceding siblings ...)
  2012-01-09 16:49 ` [PATCH 03/13] perf_events: add Intel X86 LBR sharing logic (v3) Stephane Eranian
@ 2012-01-09 16:49 ` Stephane Eranian
  2012-01-27  5:26   ` Anshuman Khandual
  2012-01-09 16:49 ` [PATCH 05/13] perf_events: add LBR mappings for PERF_SAMPLE_BRANCH filters (v3) Stephane Eranian
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 36+ messages in thread
From: Stephane Eranian @ 2012-01-09 16:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, robert.richter, ming.m.lin, andi, asharma,
	ravitillo, vweaver1

If precise sampling is enabled on Intel X86, then perf_event uses PEBS.
To correct for the off-by-one error of PEBS, perf_event uses LBR when
precise_sample > 1.

On Intel X86 PERF_SAMPLE_BRANCH_STACK is implemented using LBR,
therefore both features must be coordinated as they may not
configure LBR the same way.

For PEBS, LBR needs to capture all branches at all priv levels.
This patch sets this up.

The configuration of PERF_SAMPLE_BRANCH_STACK may not be compatible
in which case an error must be returned.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event.c |   22 ++++++++++++++++++++++
 1 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 3779313..710ec93 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -356,6 +356,7 @@ int x86_setup_perfctr(struct perf_event *event)
 int x86_pmu_hw_config(struct perf_event *event)
 {
 	if (event->attr.precise_ip) {
+		u64 *br_type, br_sel;
 		int precise = 0;
 
 		/* Support for constant skid */
@@ -369,6 +370,27 @@ int x86_pmu_hw_config(struct perf_event *event)
 
 		if (event->attr.precise_ip > precise)
 			return -EOPNOTSUPP;
+		/*
+		 * check that PEBS LBR correction does not conflict with
+		 * whatever the user is asking with attr->branch_sample_type
+		 */
+		if (event->attr.precise_ip > 1) {
+
+			br_type = &event->attr.branch_sample_type;
+
+			if (has_branch_stack(event)) {
+				br_sel = *br_type & PERF_SAMPLE_BRANCH_ANY;
+				if (br_sel != PERF_SAMPLE_BRANCH_ANY)
+					return -EOPNOTSUPP;
+			} else {
+				/*
+				 * For PEBS fixups, we capture all
+				 * the branches at all priv levels
+				 */
+				*br_type = PERF_SAMPLE_BRANCH_ANY
+					 | PERF_SAMPLE_BRANCH_PLM_ALL;
+			}
+		}
 	}
 
 	/*
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 05/13] perf_events: add LBR mappings for PERF_SAMPLE_BRANCH filters (v3)
  2012-01-09 16:49 [PATCH 00/13] perf_events: add support for sampling taken branches (v3) Stephane Eranian
                   ` (3 preceding siblings ...)
  2012-01-09 16:49 ` [PATCH 04/13] perf_events: sync branch stack sampling with X86 precise_sampling (v3) Stephane Eranian
@ 2012-01-09 16:49 ` Stephane Eranian
  2012-01-27  5:41   ` Anshuman Khandual
  2012-01-09 16:49 ` [PATCH 06/13] perf_events: disable LBR support for older Intel Atom processors (v3) Stephane Eranian
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 36+ messages in thread
From: Stephane Eranian @ 2012-01-09 16:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, robert.richter, ming.m.lin, andi, asharma,
	ravitillo, vweaver1

This patch adds the mappings from the generic PERF_SAMPLE_BRANCH_*
filters to the actual Intel X86 LBR filters, whenever they exist.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event.h           |    2 +
 arch/x86/kernel/cpu/perf_event_intel.c     |    2 +-
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |   99 +++++++++++++++++++++++++++-
 3 files changed, 100 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 4535ada..776fb5a 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -535,6 +535,8 @@ void intel_pmu_lbr_init_nhm(void);
 
 void intel_pmu_lbr_init_atom(void);
 
+void intel_pmu_lbr_init_snb(void);
+
 int p4_pmu_init(void);
 
 int p6_pmu_init(void);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 97f7bb5..b0db016 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1757,7 +1757,7 @@ __init int intel_pmu_init(void)
 		memcpy(hw_cache_event_ids, snb_hw_cache_event_ids,
 		       sizeof(hw_cache_event_ids));
 
-		intel_pmu_lbr_init_nhm();
+		intel_pmu_lbr_init_snb();
 
 		x86_pmu.event_constraints = intel_snb_event_constraints;
 		x86_pmu.pebs_constraints = intel_snb_pebs_event_constraints;
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index e14431f..8a1eb6c 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -14,6 +14,47 @@ enum {
 };
 
 /*
+ * Intel LBR_SELECT bits
+ * Intel Vol3a, April 2011, Section 16.7 Table 16-10
+ *
+ * Hardware branch filter (not available on all CPUs)
+ */
+#define LBR_KERNEL_BIT		0 /* do not capture at ring0 */
+#define LBR_USER_BIT		1 /* do not capture at ring > 0 */
+#define LBR_JCC_BIT		2 /* do not capture conditional branches */
+#define LBR_REL_CALL_BIT	3 /* do not capture relative calls */
+#define LBR_IND_CALL_BIT	4 /* do not capture indirect calls */
+#define LBR_RETURN_BIT		5 /* do not capture near returns */
+#define LBR_IND_JMP_BIT		6 /* do not capture indirect jumps */
+#define LBR_REL_JMP_BIT		7 /* do not capture relative jumps */
+#define LBR_FAR_BIT		8 /* do not capture far branches */
+
+#define LBR_KERNEL	(1 << LBR_KERNEL_BIT)
+#define LBR_USER	(1 << LBR_USER_BIT)
+#define LBR_JCC		(1 << LBR_JCC_BIT)
+#define LBR_REL_CALL	(1 << LBR_REL_CALL_BIT)
+#define LBR_IND_CALL	(1 << LBR_IND_CALL_BIT)
+#define LBR_RETURN	(1 << LBR_RETURN_BIT)
+#define LBR_REL_JMP	(1 << LBR_REL_JMP_BIT)
+#define LBR_IND_JMP	(1 << LBR_IND_JMP_BIT)
+#define LBR_FAR		(1 << LBR_FAR_BIT)
+
+#define LBR_PLM (LBR_KERNEL | LBR_USER)
+
+#define LBR_SEL_MASK	0x1ff /* valid bits in LBR_SELECT */
+
+#define LBR_ANY		 \
+	(LBR_JCC	|\
+	 LBR_REL_CALL	|\
+	 LBR_IND_CALL	|\
+	 LBR_RETURN	|\
+	 LBR_REL_JMP	|\
+	 LBR_IND_JMP	|\
+	 LBR_FAR)
+
+#define LBR_FROM_FLAG_MISPRED  (1ULL << 63)
+
+/*
  * We only support LBR implementations that have FREEZE_LBRS_ON_PMI
  * otherwise it becomes near impossible to get a reliable stack.
  */
@@ -153,8 +194,6 @@ static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc)
 	cpuc->lbr_stack.nr = i;
 }
 
-#define LBR_FROM_FLAG_MISPRED  (1ULL << 63)
-
 /*
  * Due to lack of segmentation in Linux the effective address (offset)
  * is the same as the linear address, allowing us to merge the LIP and EIP
@@ -202,26 +241,82 @@ void intel_pmu_lbr_read(void)
 		intel_pmu_lbr_read_64(cpuc);
 }
 
+/*
+ * Map interface branch filters onto LBR filters
+ */
+static const int nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = {
+	[PERF_SAMPLE_BRANCH_ANY]        = LBR_ANY,
+	[PERF_SAMPLE_BRANCH_USER]       = LBR_USER,
+	[PERF_SAMPLE_BRANCH_KERNEL]     = LBR_KERNEL,
+	[PERF_SAMPLE_BRANCH_ANY_RETURN] = LBR_RETURN | LBR_REL_JMP
+					| LBR_IND_JMP | LBR_FAR,
+	/*
+	 * NHM/WSM erratum: must include REL_JMP+IND_JMP to get CALL branches
+	 */
+	[PERF_SAMPLE_BRANCH_ANY_CALL] =
+	 LBR_REL_CALL | LBR_IND_CALL | LBR_REL_JMP | LBR_IND_JMP | LBR_FAR,
+	/*
+	 * NHM/WSM erratum: must include IND_JMP to capture IND_CALL
+	 */
+	[PERF_SAMPLE_BRANCH_IND_CALL] = LBR_IND_CALL | LBR_IND_JMP,
+};
+
+static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = {
+	[PERF_SAMPLE_BRANCH_ANY]        = LBR_ANY,
+	[PERF_SAMPLE_BRANCH_USER]       = LBR_USER,
+	[PERF_SAMPLE_BRANCH_KERNEL]     = LBR_KERNEL,
+	[PERF_SAMPLE_BRANCH_ANY_RETURN] = LBR_RETURN | LBR_FAR,
+	[PERF_SAMPLE_BRANCH_ANY_CALL]   = LBR_REL_CALL | LBR_IND_CALL
+					| LBR_FAR,
+	[PERF_SAMPLE_BRANCH_IND_CALL]   = LBR_IND_CALL,
+};
+
+/* core */
 void intel_pmu_lbr_init_core(void)
 {
 	x86_pmu.lbr_nr     = 4;
 	x86_pmu.lbr_tos    = MSR_LBR_TOS;
 	x86_pmu.lbr_from   = MSR_LBR_CORE_FROM;
 	x86_pmu.lbr_to     = MSR_LBR_CORE_TO;
+
+	pr_cont("4-deep LBR, ");
 }
 
+/* nehalem/westmere */
 void intel_pmu_lbr_init_nhm(void)
 {
 	x86_pmu.lbr_nr     = 16;
 	x86_pmu.lbr_tos    = MSR_LBR_TOS;
 	x86_pmu.lbr_from   = MSR_LBR_NHM_FROM;
 	x86_pmu.lbr_to     = MSR_LBR_NHM_TO;
+
+	x86_pmu.lbr_sel_mask = LBR_SEL_MASK;
+	x86_pmu.lbr_sel_map  = nhm_lbr_sel_map;
+
+	pr_cont("16-deep LBR, ");
 }
 
+/* sandy bridge */
+void intel_pmu_lbr_init_snb(void)
+{
+	x86_pmu.lbr_nr	 = 16;
+	x86_pmu.lbr_tos	 = MSR_LBR_TOS;
+	x86_pmu.lbr_from = MSR_LBR_NHM_FROM;
+	x86_pmu.lbr_to   = MSR_LBR_NHM_TO;
+
+	x86_pmu.lbr_sel_mask = LBR_SEL_MASK;
+	x86_pmu.lbr_sel_map  = snb_lbr_sel_map;
+
+	pr_cont("16-deep LBR, ");
+}
+
+/* atom */
 void intel_pmu_lbr_init_atom(void)
 {
 	x86_pmu.lbr_nr	   = 8;
 	x86_pmu.lbr_tos    = MSR_LBR_TOS;
 	x86_pmu.lbr_from   = MSR_LBR_CORE_FROM;
 	x86_pmu.lbr_to     = MSR_LBR_CORE_TO;
+
+	pr_cont("8-deep LBR, ");
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 06/13] perf_events: disable LBR support for older Intel Atom processors (v3)
  2012-01-09 16:49 [PATCH 00/13] perf_events: add support for sampling taken branches (v3) Stephane Eranian
                   ` (4 preceding siblings ...)
  2012-01-09 16:49 ` [PATCH 05/13] perf_events: add LBR mappings for PERF_SAMPLE_BRANCH filters (v3) Stephane Eranian
@ 2012-01-09 16:49 ` Stephane Eranian
  2012-01-27  5:43   ` Anshuman Khandual
  2012-01-09 16:49 ` [PATCH 07/13] perf_events: implement PERF_SAMPLE_BRANCH for Intel X86 (v3) Stephane Eranian
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 36+ messages in thread
From: Stephane Eranian @ 2012-01-09 16:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, robert.richter, ming.m.lin, andi, asharma,
	ravitillo, vweaver1

The patch adds a restriction for Intel Atom LBR support. Only
steppings 10 (PineView) and more recent are supported. Older models,
do not have a functional LBR. Their LBR does not freeze on PMU interrupt
which makes LBR unusable in the context of perf_events.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 8a1eb6c..e2b7094 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -313,6 +313,16 @@ void intel_pmu_lbr_init_snb(void)
 /* atom */
 void intel_pmu_lbr_init_atom(void)
 {
+	/*
+	 * only models starting at stepping 10 seems
+	 * to have an operational LBR which can freeze
+	 * on PMU interrupt
+	 */
+	if (boot_cpu_data.x86_mask < 10) {
+		pr_cont("LBR disabled due to erratum");
+		return;
+	}
+
 	x86_pmu.lbr_nr	   = 8;
 	x86_pmu.lbr_tos    = MSR_LBR_TOS;
 	x86_pmu.lbr_from   = MSR_LBR_CORE_FROM;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 07/13] perf_events: implement PERF_SAMPLE_BRANCH for Intel X86 (v3)
  2012-01-09 16:49 [PATCH 00/13] perf_events: add support for sampling taken branches (v3) Stephane Eranian
                   ` (5 preceding siblings ...)
  2012-01-09 16:49 ` [PATCH 06/13] perf_events: disable LBR support for older Intel Atom processors (v3) Stephane Eranian
@ 2012-01-09 16:49 ` Stephane Eranian
  2012-01-27  6:14   ` Anshuman Khandual
  2012-01-09 16:49 ` [PATCH 08/13] perf_events: add LBR software filter support " Stephane Eranian
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 36+ messages in thread
From: Stephane Eranian @ 2012-01-09 16:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, robert.richter, ming.m.lin, andi, asharma,
	ravitillo, vweaver1

This patch implements PERF_SAMPLE_BRANCH support for Intel
X86 processors. It connects PERF_SAMPLE_BRANCH to the actual LBR.

The patch adds the hooks in the PMU irq handler to save the LBR
on counter overflow for both regular and PEBS modes.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event_intel.c     |   35 +++++++++++++
 arch/x86/kernel/cpu/perf_event_intel_ds.c  |   10 ++--
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |   73 +++++++++++++++++++++++++++-
 include/linux/perf_event.h                 |    3 +
 4 files changed, 113 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index b0db016..7cc1e2d 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -727,6 +727,19 @@ static __initconst const u64 atom_hw_cache_event_ids
  },
 };
 
+static inline bool intel_pmu_needs_lbr_smpl(struct perf_event *event)
+{
+	/* user explicitly requested branch sampling */
+	if (has_branch_stack(event))
+		return true;
+
+	/* implicit branch sampling to correct PEBS skid */
+	if (x86_pmu.intel_cap.pebs_trap && event->attr.precise_ip > 1)
+		return true;
+
+	return false;
+}
+
 static void intel_pmu_disable_all(void)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
@@ -881,6 +894,13 @@ static void intel_pmu_disable_event(struct perf_event *event)
 	cpuc->intel_ctrl_guest_mask &= ~(1ull << hwc->idx);
 	cpuc->intel_ctrl_host_mask &= ~(1ull << hwc->idx);
 
+	/*
+	 * must disable before any actual event
+	 * because any event may be combined with LBR
+	 */
+	if (intel_pmu_needs_lbr_smpl(event))
+		intel_pmu_lbr_disable(event);
+
 	if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL)) {
 		intel_pmu_disable_fixed(hwc);
 		return;
@@ -935,6 +955,12 @@ static void intel_pmu_enable_event(struct perf_event *event)
 		intel_pmu_enable_bts(hwc->config);
 		return;
 	}
+	/*
+	 * must enabled before any actual event
+	 * because any event may be combined with LBR
+	 */
+	if (intel_pmu_needs_lbr_smpl(event))
+		intel_pmu_lbr_enable(event);
 
 	if (event->attr.exclude_host)
 		cpuc->intel_ctrl_guest_mask |= (1ull << hwc->idx);
@@ -1057,6 +1083,9 @@ static int intel_pmu_handle_irq(struct pt_regs *regs)
 
 		data.period = event->hw.last_period;
 
+		if (has_branch_stack(event))
+			data.br_stack = &cpuc->lbr_stack;
+
 		if (perf_event_overflow(event, &data, regs))
 			x86_pmu_stop(event, 0);
 	}
@@ -1305,6 +1334,12 @@ static int intel_pmu_hw_config(struct perf_event *event)
 		event->hw.config = alt_config;
 	}
 
+	if (intel_pmu_needs_lbr_smpl(event)) {
+		ret = intel_pmu_setup_lbr_filter(event);
+		if (ret)
+			return ret;
+	}
+
 	if (event->attr.type != PERF_TYPE_RAW)
 		return 0;
 
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 73da6b6..04c71ea 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -440,9 +440,6 @@ void intel_pmu_pebs_enable(struct perf_event *event)
 
 	cpuc->pebs_enabled |= 1ULL << hwc->idx;
 	WARN_ON_ONCE(cpuc->enabled);
-
-	if (x86_pmu.intel_cap.pebs_trap && event->attr.precise_ip > 1)
-		intel_pmu_lbr_enable(event);
 }
 
 void intel_pmu_pebs_disable(struct perf_event *event)
@@ -455,9 +452,6 @@ void intel_pmu_pebs_disable(struct perf_event *event)
 		wrmsrl(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);
 
 	hwc->config |= ARCH_PERFMON_EVENTSEL_INT;
-
-	if (x86_pmu.intel_cap.pebs_trap && event->attr.precise_ip > 1)
-		intel_pmu_lbr_disable(event);
 }
 
 void intel_pmu_pebs_enable_all(void)
@@ -573,6 +567,7 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
 	 * both formats and we don't use the other fields in this
 	 * routine.
 	 */
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
 	struct pebs_record_core *pebs = __pebs;
 	struct perf_sample_data data;
 	struct pt_regs regs;
@@ -603,6 +598,9 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
 	else
 		regs.flags &= ~PERF_EFLAGS_EXACT;
 
+	if (has_branch_stack(event))
+		data.br_stack = &cpuc->lbr_stack;
+
 	if (perf_event_overflow(event, &data, &regs))
 		x86_pmu_stop(event, 0);
 }
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index e2b7094..84712b1 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -42,6 +42,7 @@ enum {
 #define LBR_PLM (LBR_KERNEL | LBR_USER)
 
 #define LBR_SEL_MASK	0x1ff /* valid bits in LBR_SELECT */
+#define LBR_NOT_SUPP	-1    /* LBR filter not supported */
 
 #define LBR_ANY		 \
 	(LBR_JCC	|\
@@ -54,6 +55,10 @@ enum {
 
 #define LBR_FROM_FLAG_MISPRED  (1ULL << 63)
 
+#define for_each_branch_sample_type(x) \
+	for ((x) = PERF_SAMPLE_BRANCH_USER; \
+	     (x) < PERF_SAMPLE_BRANCH_MAX; (x) <<= 1)
+
 /*
  * We only support LBR implementations that have FREEZE_LBRS_ON_PMI
  * otherwise it becomes near impossible to get a reliable stack.
@@ -62,6 +67,10 @@ enum {
 static void __intel_pmu_lbr_enable(void)
 {
 	u64 debugctl;
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+
+	if (cpuc->lbr_sel)
+		wrmsrl(MSR_LBR_SELECT, cpuc->lbr_sel->config);
 
 	rdmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
 	debugctl |= (DEBUGCTLMSR_LBR | DEBUGCTLMSR_FREEZE_LBRS_ON_PMI);
@@ -119,7 +128,6 @@ void intel_pmu_lbr_enable(struct perf_event *event)
 	 * Reset the LBR stack if we changed task context to
 	 * avoid data leaks.
 	 */
-
 	if (event->ctx->task && cpuc->lbr_context != event->ctx) {
 		intel_pmu_lbr_reset();
 		cpuc->lbr_context = event->ctx;
@@ -138,8 +146,11 @@ void intel_pmu_lbr_disable(struct perf_event *event)
 	cpuc->lbr_users--;
 	WARN_ON_ONCE(cpuc->lbr_users < 0);
 
-	if (cpuc->enabled && !cpuc->lbr_users)
+	if (cpuc->enabled && !cpuc->lbr_users) {
 		__intel_pmu_lbr_disable();
+		/* avoid stale pointer */
+		cpuc->lbr_context = NULL;
+	}
 }
 
 void intel_pmu_lbr_enable_all(void)
@@ -158,6 +169,9 @@ void intel_pmu_lbr_disable_all(void)
 		__intel_pmu_lbr_disable();
 }
 
+/*
+ * TOS = most recently recorded branch
+ */
 static inline u64 intel_pmu_lbr_tos(void)
 {
 	u64 tos;
@@ -242,6 +256,61 @@ void intel_pmu_lbr_read(void)
 }
 
 /*
+ * setup the HW LBR filter
+ * Used only when available, may not be enough to disambiguate
+ * all branches, may need the help of the SW filter
+ */
+static int intel_pmu_setup_hw_lbr_filter(struct perf_event *event)
+{
+	struct hw_perf_event_extra *reg;
+	u64 br_type = event->attr.branch_sample_type;
+	u64 mask = 0, m;
+	u64 v;
+
+	for_each_branch_sample_type(m) {
+		if (!(br_type & m))
+			continue;
+
+		v = x86_pmu.lbr_sel_map[m];
+		if (v == LBR_NOT_SUPP)
+			return -EOPNOTSUPP;
+		mask |= v;
+
+		if (m == PERF_SAMPLE_BRANCH_ANY)
+			break;
+	}
+	reg = &event->hw.branch_reg;
+	reg->idx = EXTRA_REG_LBR;
+
+	/* LBR_SELECT operates in suppress mode so invert mask */
+	reg->config = ~mask & x86_pmu.lbr_sel_mask;
+
+	return 0;
+}
+
+static int intel_pmu_setup_lbr_filter(struct perf_event *event)
+{
+	u64 br_type = event->attr.branch_sample_type;
+
+	/*
+	 * no LBR on this PMU
+	 */
+	if (!x86_pmu.lbr_nr)
+		return -EOPNOTSUPP;
+
+	/*
+	 * if no LBR HW filter, users can only
+	 * capture all branches
+	 */
+	if (!x86_pmu.lbr_sel_map) {
+		if (br_type != PERF_SAMPLE_BRANCH_ALL)
+			return -EOPNOTSUPP;
+		return 0;
+	}
+	return intel_pmu_setup_hw_lbr_filter(event);
+}
+
+/*
  * Map interface branch filters onto LBR filters
  */
 static const int nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = {
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 17751b1..84bd6a6 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -160,6 +160,9 @@ enum perf_branch_sample_type {
 	(PERF_SAMPLE_BRANCH_USER|\
 	 PERF_SAMPLE_BRANCH_KERNEL)
 
+#define PERF_SAMPLE_BRANCH_ALL \
+	(PERF_SAMPLE_BRANCH_PLM_ALL|PERF_SAMPLE_BRANCH_ANY)
+
 /*
  * The format of the data returned by read() on a perf event fd,
  * as specified by attr.read_format:
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 08/13] perf_events: add LBR software filter support for Intel X86 (v3)
  2012-01-09 16:49 [PATCH 00/13] perf_events: add support for sampling taken branches (v3) Stephane Eranian
                   ` (6 preceding siblings ...)
  2012-01-09 16:49 ` [PATCH 07/13] perf_events: implement PERF_SAMPLE_BRANCH for Intel X86 (v3) Stephane Eranian
@ 2012-01-09 16:49 ` Stephane Eranian
  2012-01-09 16:49 ` [PATCH 09/13] perf_events: disable PERF_SAMPLE_BRANCH_* when not supported (v3) Stephane Eranian
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 36+ messages in thread
From: Stephane Eranian @ 2012-01-09 16:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, robert.richter, ming.m.lin, andi, asharma,
	ravitillo, vweaver1

This patch adds an internal sofware filter to complement
the (optional) LBR hardware filter.

The software filter is necessary:
- as a substitute when there is no HW LBR filter (e.g., Atom, Core)
- to complement HW LBR filter in case of errata (e.g., Nehalem/Westmere)
- to provide finer grain filtering (e.g., all processors)

Sometimes, the LBR HW filter cannot distinguish between two types
of branches. For instance, to capture syscall as CALLS, it is necessary
to enable the LBR_FAR filter which will also capture JMP instructions.
Thus, a second pass is necessary to filter those out, this is what the
SW filter can do.

The SW filter is built on top of the internal x86 disassembler. It
is a best effort filter especially for user level code. It is subject
to the availability of the text page of the program.

The SW filter is enabled on all Intel X86 processors. It is bypassed
when the user is capturing all branches at all priv levels.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event.h           |   12 +
 arch/x86/kernel/cpu/perf_event_intel_ds.c  |   12 +-
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |  322 +++++++++++++++++++++++++++-
 3 files changed, 326 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 776fb5a..d038cd1 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -132,6 +132,7 @@ struct cpu_hw_events {
 	struct perf_branch_stack	lbr_stack;
 	struct perf_branch_entry	lbr_entries[MAX_LBR_ENTRIES];
 	struct er_account		*lbr_sel;
+	u64				br_sel;
 
 	/*
 	 * Intel host/guest exclude bits
@@ -455,6 +456,17 @@ extern struct event_constraint emptyconstraint;
 
 extern struct event_constraint unconstrained;
 
+static inline bool kernel_ip(unsigned long ip)
+{
+#ifdef CONFIG_X86_32
+	return ip > PAGE_OFFSET;
+#else
+	return (long)ip < 0;
+#endif
+}
+
+int intel_pmu_setup_lbr_filter(struct perf_event *event);
+
 #ifdef CONFIG_CPU_SUP_AMD
 
 int amd_pmu_init(void);
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 04c71ea..db0aa19 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -3,6 +3,7 @@
 #include <linux/slab.h>
 
 #include <asm/perf_event.h>
+#include <asm/insn.h>
 
 #include "perf_event.h"
 
@@ -470,17 +471,6 @@ void intel_pmu_pebs_disable_all(void)
 		wrmsrl(MSR_IA32_PEBS_ENABLE, 0);
 }
 
-#include <asm/insn.h>
-
-static inline bool kernel_ip(unsigned long ip)
-{
-#ifdef CONFIG_X86_32
-	return ip > PAGE_OFFSET;
-#else
-	return (long)ip < 0;
-#endif
-}
-
 static int intel_pmu_pebs_fixup_ip(struct pt_regs *regs)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 84712b1..99e4011 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -3,6 +3,7 @@
 
 #include <asm/perf_event.h>
 #include <asm/msr.h>
+#include <asm/insn.h>
 
 #include "perf_event.h"
 
@@ -60,6 +61,53 @@ enum {
 	     (x) < PERF_SAMPLE_BRANCH_MAX; (x) <<= 1)
 
 /*
+ * X86 control flow change classification
+ * X86 control flow changes include branches, interrupts, traps, faults
+ */
+enum {
+	X86_BR_NONE     = 0,      /* unknown */
+
+	X86_BR_USER     = 1 << 0, /* branch target is user */
+	X86_BR_KERNEL   = 1 << 1, /* branch target is kernel */
+
+	X86_BR_CALL     = 1 << 2, /* call */
+	X86_BR_RET      = 1 << 3, /* return */
+	X86_BR_SYSCALL  = 1 << 4, /* syscall */
+	X86_BR_SYSRET   = 1 << 5, /* syscall return */
+	X86_BR_INT      = 1 << 6, /* sw interrupt */
+	X86_BR_IRET     = 1 << 7, /* return from interrupt */
+	X86_BR_JCC      = 1 << 8, /* conditional */
+	X86_BR_JMP      = 1 << 9, /* jump */
+	X86_BR_IRQ      = 1 << 10,/* hw interrupt or trap or fault */
+	X86_BR_IND_CALL = 1 << 11,/* indirect calls */
+};
+
+#define X86_BR_PLM (X86_BR_USER | X86_BR_KERNEL)
+
+#define X86_BR_ANY       \
+	(X86_BR_CALL    |\
+	 X86_BR_RET     |\
+	 X86_BR_SYSCALL |\
+	 X86_BR_SYSRET  |\
+	 X86_BR_INT     |\
+	 X86_BR_IRET    |\
+	 X86_BR_JCC     |\
+	 X86_BR_JMP	 |\
+	 X86_BR_IRQ	 |\
+	 X86_BR_IND_CALL)
+
+#define X86_BR_ALL (X86_BR_PLM | X86_BR_ANY)
+
+#define X86_BR_ANY_CALL		 \
+	(X86_BR_CALL		|\
+	 X86_BR_IND_CALL	|\
+	 X86_BR_SYSCALL		|\
+	 X86_BR_IRQ		|\
+	 X86_BR_INT)
+
+static void intel_pmu_lbr_filter(struct cpu_hw_events *cpuc);
+
+/*
  * We only support LBR implementations that have FREEZE_LBRS_ON_PMI
  * otherwise it becomes near impossible to get a reliable stack.
  */
@@ -132,6 +180,7 @@ void intel_pmu_lbr_enable(struct perf_event *event)
 		intel_pmu_lbr_reset();
 		cpuc->lbr_context = event->ctx;
 	}
+	cpuc->br_sel = event->hw.branch_reg.reg;
 
 	cpuc->lbr_users++;
 }
@@ -253,6 +302,45 @@ void intel_pmu_lbr_read(void)
 		intel_pmu_lbr_read_32(cpuc);
 	else
 		intel_pmu_lbr_read_64(cpuc);
+
+	intel_pmu_lbr_filter(cpuc);
+}
+
+/*
+ * SW filter is used:
+ * - in case there is no HW filter
+ * - in case the HW filter has errata or limitations
+ */
+static void intel_pmu_setup_sw_lbr_filter(struct perf_event *event)
+{
+	u64 br_type = event->attr.branch_sample_type;
+	int mask = 0;
+
+	if (br_type & PERF_SAMPLE_BRANCH_USER)
+		mask |= X86_BR_USER;
+
+	if (br_type & PERF_SAMPLE_BRANCH_KERNEL)
+		mask |= X86_BR_KERNEL;
+
+	if (br_type & PERF_SAMPLE_BRANCH_ANY) {
+		mask |= X86_BR_ANY;
+		goto done;
+	}
+
+	if (br_type & PERF_SAMPLE_BRANCH_ANY_CALL)
+		mask |= X86_BR_ANY_CALL;
+
+	if (br_type & PERF_SAMPLE_BRANCH_ANY_RETURN)
+		mask |= X86_BR_RET | X86_BR_IRET | X86_BR_SYSRET;
+
+	if (br_type & PERF_SAMPLE_BRANCH_IND_CALL)
+		mask |= X86_BR_IND_CALL;
+done:
+	/*
+	 * stash actual user request into reg, it may
+	 * be used by fixup code for some CPU
+   */
+	event->hw.branch_reg.reg = mask;
 }
 
 /*
@@ -288,9 +376,9 @@ static int intel_pmu_setup_hw_lbr_filter(struct perf_event *event)
 	return 0;
 }
 
-static int intel_pmu_setup_lbr_filter(struct perf_event *event)
+int intel_pmu_setup_lbr_filter(struct perf_event *event)
 {
-	u64 br_type = event->attr.branch_sample_type;
+	int ret = 0;
 
 	/*
 	 * no LBR on this PMU
@@ -299,15 +387,210 @@ static int intel_pmu_setup_lbr_filter(struct perf_event *event)
 		return -EOPNOTSUPP;
 
 	/*
-	 * if no LBR HW filter, users can only
-	 * capture all branches
+	 * setup SW LBR filter
 	 */
-	if (!x86_pmu.lbr_sel_map) {
-		if (br_type != PERF_SAMPLE_BRANCH_ALL)
-			return -EOPNOTSUPP;
-		return 0;
+	intel_pmu_setup_sw_lbr_filter(event);
+
+	/*
+	 * setup HW LBR filter, if any
+	 */
+	if (x86_pmu.lbr_sel_map)
+		ret = intel_pmu_setup_hw_lbr_filter(event);
+
+	return ret;
+}
+
+/*
+ * return the type of control flow change at address "from"
+ * intruction is not necessarily a branch (in case of interrupt).
+ *
+ * The branch type returned also includes the priv level of the
+ * target of the control flow change (X86_BR_USER, X86_BR_KERNEL).
+ *
+ * If a branch type is unknown OR the instruction cannot be
+ * decoded (e.g., text page not present), then X86_BR_NONE is
+ * returned.
+ */
+static int branch_type(unsigned long from, unsigned long to)
+{
+	struct insn insn;
+	void *addr;
+	int bytes, size = MAX_INSN_SIZE;
+	int ret = X86_BR_NONE;
+	int ext, to_plm, from_plm;
+	u8 buf[MAX_INSN_SIZE];
+	int is64 = 0;
+
+	to_plm = kernel_ip(to) ? X86_BR_KERNEL : X86_BR_USER;
+	from_plm = kernel_ip(from) ? X86_BR_KERNEL : X86_BR_USER;
+
+	/*
+	 * maybe zero if lbr did not fill up after a reset by the time
+	 * we get a PMU interrupt
+	 */
+	if (from == 0 || to == 0)
+		return X86_BR_NONE;
+
+	if (from_plm == X86_BR_USER) {
+		/*
+		 * can happen if measuring at the user level only
+		 * and we interrupt in a kernel thread, e.g., idle.
+		 */
+		if (!current->mm)
+			return X86_BR_NONE;
+
+		/* may fail if text not present */
+		bytes = copy_from_user_nmi(buf, (void __user *)from, size);
+		if (bytes != size)
+			return X86_BR_NONE;
+
+		addr = buf;
+	} else
+		addr = (void *)from;
+
+	/*
+	 * decoder needs to know the ABI especially
+	 * on 64-bit systems running 32-bit apps
+	 */
+#ifdef CONFIG_X86_64
+	is64 = kernel_ip((unsigned long)addr) || !test_thread_flag(TIF_IA32);
+#endif
+	insn_init(&insn, addr, is64);
+	insn_get_opcode(&insn);
+
+	switch (insn.opcode.bytes[0]) {
+	case 0xf:
+		switch (insn.opcode.bytes[1]) {
+		case 0x05: /* syscall */
+		case 0x34: /* sysenter */
+			ret = X86_BR_SYSCALL;
+			break;
+		case 0x07: /* sysret */
+		case 0x35: /* sysexit */
+			ret = X86_BR_SYSRET;
+			break;
+		case 0x80 ... 0x8f: /* conditional */
+			ret = X86_BR_JCC;
+			break;
+		default:
+			ret = X86_BR_NONE;
+		}
+		break;
+	case 0x70 ... 0x7f: /* conditional */
+		ret = X86_BR_JCC;
+		break;
+	case 0xc2: /* near ret */
+	case 0xc3: /* near ret */
+	case 0xca: /* far ret */
+	case 0xcb: /* far ret */
+		ret = X86_BR_RET;
+		break;
+	case 0xcf: /* iret */
+		ret = X86_BR_IRET;
+		break;
+	case 0xcc ... 0xce: /* int */
+		ret = X86_BR_INT;
+		break;
+	case 0xe8: /* call near rel */
+	case 0x9a: /* call far absolute */
+		ret = X86_BR_CALL;
+		break;
+	case 0xe0 ... 0xe3: /* loop jmp */
+		ret = X86_BR_JCC;
+		break;
+	case 0xe9 ... 0xeb: /* jmp */
+		ret = X86_BR_JMP;
+		break;
+	case 0xff: /* call near absolute, call far absolute ind */
+		insn_get_modrm(&insn);
+		ext = (insn.modrm.bytes[0] >> 3) & 0x7;
+		switch (ext) {
+		case 2: /* near ind call */
+		case 3: /* far ind call */
+			ret = X86_BR_IND_CALL;
+			break;
+		case 4:
+		case 5:
+			ret = X86_BR_JMP;
+			break;
+		}
+		break;
+	default:
+		ret = X86_BR_NONE;
+	}
+	/*
+	 * interrupts, traps, faults (and thus ring transition) may
+	 * occur on any instructions. Thus, to classify them correctly,
+	 * we need to first look at the from and to priv levels. If they
+	 * are different and to is in the kernel, then it indicates
+	 * a ring transition. If the from instruction is not a ring
+	 * transition instr (syscall, systenter, int), then it means
+	 * it was a irq, trap or fault.
+	 *
+	 * we have no way of detecting kernel to kernel faults.
+	 */
+	if (from_plm == X86_BR_USER && to_plm == X86_BR_KERNEL
+	    && ret != X86_BR_SYSCALL && ret != X86_BR_INT)
+		ret = X86_BR_IRQ;
+
+	/*
+	 * branch priv level determined by target as
+	 * is done by HW when LBR_SELECT is implemented
+	 */
+	if (ret != X86_BR_NONE)
+		ret |= to_plm;
+
+	return ret;
+}
+
+/*
+ * implement actual branch filter based on user demand.
+ * Hardware may not exactly satisfy that request, thus
+ * we need to inspect opcodes. Mismatched branches are
+ * discarded. Therefore, the number of branches returned
+ * in PERF_SAMPLE_BRANCH_STACK sample may vary.
+ */
+static void
+intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
+{
+	u64 from, to;
+	int br_sel = cpuc->br_sel;
+	int i, j, type;
+	bool compress = false;
+
+	/* if sampling all branches, then nothing to filter */
+	if ((br_sel & X86_BR_ALL) == X86_BR_ALL)
+		return;
+
+	for (i = 0; i < cpuc->lbr_stack.nr; i++) {
+
+		from = cpuc->lbr_entries[i].from;
+		to = cpuc->lbr_entries[i].to;
+
+		type = branch_type(from, to);
+
+		/* if type does not correspond, then discard */
+		if (type == X86_BR_NONE || (br_sel & type) != type) {
+			cpuc->lbr_entries[i].from = 0;
+			compress = true;
+		}
+	}
+
+	if (!compress)
+		return;
+
+	/* remove all entries with from=0 */
+	for (i = 0; i < cpuc->lbr_stack.nr; ) {
+		if (!cpuc->lbr_entries[i].from) {
+			j = i;
+			while (++j < cpuc->lbr_stack.nr)
+				cpuc->lbr_entries[j-1] = cpuc->lbr_entries[j];
+			cpuc->lbr_stack.nr--;
+			if (!cpuc->lbr_entries[i].from)
+				continue;
+		}
+		i++;
 	}
-	return intel_pmu_setup_hw_lbr_filter(event);
 }
 
 /*
@@ -348,6 +631,10 @@ void intel_pmu_lbr_init_core(void)
 	x86_pmu.lbr_from   = MSR_LBR_CORE_FROM;
 	x86_pmu.lbr_to     = MSR_LBR_CORE_TO;
 
+	/*
+	 * SW branch filter usage:
+	 * - compensate for lack of HW filter
+	 */
 	pr_cont("4-deep LBR, ");
 }
 
@@ -362,6 +649,13 @@ void intel_pmu_lbr_init_nhm(void)
 	x86_pmu.lbr_sel_mask = LBR_SEL_MASK;
 	x86_pmu.lbr_sel_map  = nhm_lbr_sel_map;
 
+	/*
+	 * SW branch filter usage:
+	 * - workaround LBR_SEL errata (see above)
+	 * - support syscall, sysret capture.
+	 *   That requires LBR_FAR but that means far
+	 *   jmp need to be filtered out
+	 */
 	pr_cont("16-deep LBR, ");
 }
 
@@ -376,6 +670,12 @@ void intel_pmu_lbr_init_snb(void)
 	x86_pmu.lbr_sel_mask = LBR_SEL_MASK;
 	x86_pmu.lbr_sel_map  = snb_lbr_sel_map;
 
+	/*
+	 * SW branch filter usage:
+	 * - support syscall, sysret capture.
+	 *   That requires LBR_FAR but that means far
+	 *   jmp need to be filtered out
+	 */
 	pr_cont("16-deep LBR, ");
 }
 
@@ -397,5 +697,9 @@ void intel_pmu_lbr_init_atom(void)
 	x86_pmu.lbr_from   = MSR_LBR_CORE_FROM;
 	x86_pmu.lbr_to     = MSR_LBR_CORE_TO;
 
+	/*
+	 * SW branch filter usage:
+	 * - compensate for lack of HW filter
+	 */
 	pr_cont("8-deep LBR, ");
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 09/13] perf_events: disable PERF_SAMPLE_BRANCH_* when not supported (v3)
  2012-01-09 16:49 [PATCH 00/13] perf_events: add support for sampling taken branches (v3) Stephane Eranian
                   ` (7 preceding siblings ...)
  2012-01-09 16:49 ` [PATCH 08/13] perf_events: add LBR software filter support " Stephane Eranian
@ 2012-01-09 16:49 ` Stephane Eranian
  2012-01-27  7:15   ` Anshuman Khandual
  2012-01-09 16:49 ` [PATCH 10/13] perf_events: add hook to flush branch_stack on context switch (v3) Stephane Eranian
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 36+ messages in thread
From: Stephane Eranian @ 2012-01-09 16:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, robert.richter, ming.m.lin, andi, asharma,
	ravitillo, vweaver1

PERF_SAMPLE_BRANCH_* is disabled for:
- SW events (sw counters, tracepoints)
- HW breakpoints
- ALL but Intel X86 architecture
- AMD64 processors

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/alpha/kernel/perf_event.c       |    4 ++++
 arch/arm/kernel/perf_event.c         |    4 ++++
 arch/mips/kernel/perf_event_mipsxx.c |    4 ++++
 arch/powerpc/kernel/perf_event.c     |    4 ++++
 arch/sh/kernel/perf_event.c          |    4 ++++
 arch/sparc/kernel/perf_event.c       |    4 ++++
 arch/x86/kernel/cpu/perf_event_amd.c |    3 +++
 kernel/events/core.c                 |   24 ++++++++++++++++++++++++
 kernel/events/hw_breakpoint.c        |    6 ++++++
 9 files changed, 57 insertions(+), 0 deletions(-)

diff --git a/arch/alpha/kernel/perf_event.c b/arch/alpha/kernel/perf_event.c
index 8143cd7..0dae252 100644
--- a/arch/alpha/kernel/perf_event.c
+++ b/arch/alpha/kernel/perf_event.c
@@ -685,6 +685,10 @@ static int alpha_pmu_event_init(struct perf_event *event)
 {
 	int err;
 
+	/* does not support taken branch sampling */
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	switch (event->attr.type) {
 	case PERF_TYPE_RAW:
 	case PERF_TYPE_HARDWARE:
diff --git a/arch/arm/kernel/perf_event.c b/arch/arm/kernel/perf_event.c
index 88b0941..42262ff 100644
--- a/arch/arm/kernel/perf_event.c
+++ b/arch/arm/kernel/perf_event.c
@@ -540,6 +540,10 @@ static int armpmu_event_init(struct perf_event *event)
 	int err = 0;
 	atomic_t *active_events = &armpmu->active_events;
 
+	/* does not support taken branch sampling */
+	if (has_branch_smpl(event))
+		return -EOPNOTSUPP;
+
 	if (armpmu->map_event(event) == -ENOENT)
 		return -ENOENT;
 
diff --git a/arch/mips/kernel/perf_event_mipsxx.c b/arch/mips/kernel/perf_event_mipsxx.c
index 315fc0b..7070f8c 100644
--- a/arch/mips/kernel/perf_event_mipsxx.c
+++ b/arch/mips/kernel/perf_event_mipsxx.c
@@ -606,6 +606,10 @@ static int mipspmu_event_init(struct perf_event *event)
 {
 	int err = 0;
 
+	/* does not support taken branch sampling */
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	switch (event->attr.type) {
 	case PERF_TYPE_RAW:
 	case PERF_TYPE_HARDWARE:
diff --git a/arch/powerpc/kernel/perf_event.c b/arch/powerpc/kernel/perf_event.c
index d614ab5..4e0b265 100644
--- a/arch/powerpc/kernel/perf_event.c
+++ b/arch/powerpc/kernel/perf_event.c
@@ -1078,6 +1078,10 @@ static int power_pmu_event_init(struct perf_event *event)
 	if (!ppmu)
 		return -ENOENT;
 
+	/* does not support taken branch sampling */
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	switch (event->attr.type) {
 	case PERF_TYPE_HARDWARE:
 		ev = event->attr.config;
diff --git a/arch/sh/kernel/perf_event.c b/arch/sh/kernel/perf_event.c
index 10b14e3..068b8a2 100644
--- a/arch/sh/kernel/perf_event.c
+++ b/arch/sh/kernel/perf_event.c
@@ -310,6 +310,10 @@ static int sh_pmu_event_init(struct perf_event *event)
 {
 	int err;
 
+	/* does not support taken branch sampling */
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	switch (event->attr.type) {
 	case PERF_TYPE_RAW:
 	case PERF_TYPE_HW_CACHE:
diff --git a/arch/sparc/kernel/perf_event.c b/arch/sparc/kernel/perf_event.c
index 614da62..8e16a4a 100644
--- a/arch/sparc/kernel/perf_event.c
+++ b/arch/sparc/kernel/perf_event.c
@@ -1105,6 +1105,10 @@ static int sparc_pmu_event_init(struct perf_event *event)
 	if (atomic_read(&nmi_active) < 0)
 		return -ENODEV;
 
+	/* does not support taken branch sampling */
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	switch (attr->type) {
 	case PERF_TYPE_HARDWARE:
 		if (attr->config >= sparc_pmu->max_events)
diff --git a/arch/x86/kernel/cpu/perf_event_amd.c b/arch/x86/kernel/cpu/perf_event_amd.c
index 0397b23..0d8da03 100644
--- a/arch/x86/kernel/cpu/perf_event_amd.c
+++ b/arch/x86/kernel/cpu/perf_event_amd.c
@@ -138,6 +138,9 @@ static int amd_pmu_hw_config(struct perf_event *event)
 	if (ret)
 		return ret;
 
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	if (event->attr.exclude_host && event->attr.exclude_guest)
 		/*
 		 * When HO == GO == 1 the hardware treats that as GO == HO == 0
diff --git a/kernel/events/core.c b/kernel/events/core.c
index ed39225..36d1a63 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5000,6 +5000,12 @@ static int perf_swevent_init(struct perf_event *event)
 	if (event->attr.type != PERF_TYPE_SOFTWARE)
 		return -ENOENT;
 
+	/*
+	 * no branch sampling for software events
+	 */
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	switch (event_id) {
 	case PERF_COUNT_SW_CPU_CLOCK:
 	case PERF_COUNT_SW_TASK_CLOCK:
@@ -5110,6 +5116,12 @@ static int perf_tp_event_init(struct perf_event *event)
 	if (event->attr.type != PERF_TYPE_TRACEPOINT)
 		return -ENOENT;
 
+	/*
+	 * no branch sampling for tracepoint events
+	 */
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	err = perf_trace_init(event);
 	if (err)
 		return err;
@@ -5335,6 +5347,12 @@ static int cpu_clock_event_init(struct perf_event *event)
 	if (event->attr.config != PERF_COUNT_SW_CPU_CLOCK)
 		return -ENOENT;
 
+	/*
+	 * no branch sampling for software events
+	 */
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	perf_swevent_init_hrtimer(event);
 
 	return 0;
@@ -5409,6 +5427,12 @@ static int task_clock_event_init(struct perf_event *event)
 	if (event->attr.config != PERF_COUNT_SW_TASK_CLOCK)
 		return -ENOENT;
 
+	/*
+	 * no branch sampling for software events
+	 */
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	perf_swevent_init_hrtimer(event);
 
 	return 0;
diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
index b0309f7..cee5423 100644
--- a/kernel/events/hw_breakpoint.c
+++ b/kernel/events/hw_breakpoint.c
@@ -581,6 +581,12 @@ static int hw_breakpoint_event_init(struct perf_event *bp)
 	if (bp->attr.type != PERF_TYPE_BREAKPOINT)
 		return -ENOENT;
 
+	/*
+	 * no branch sampling for breakpoint events
+	 */
+	if (has_branch_stack(bp))
+		return -EOPNOTSUPP;
+
 	err = register_perf_hw_breakpoint(bp);
 	if (err)
 		return err;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 10/13] perf_events: add hook to flush branch_stack on context switch (v3)
  2012-01-09 16:49 [PATCH 00/13] perf_events: add support for sampling taken branches (v3) Stephane Eranian
                   ` (8 preceding siblings ...)
  2012-01-09 16:49 ` [PATCH 09/13] perf_events: disable PERF_SAMPLE_BRANCH_* when not supported (v3) Stephane Eranian
@ 2012-01-09 16:49 ` Stephane Eranian
  2012-01-09 16:49 ` [PATCH 11/13] perf: add code to support PERF_SAMPLE_BRANCH_STACK (v3) Stephane Eranian
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 36+ messages in thread
From: Stephane Eranian @ 2012-01-09 16:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, robert.richter, ming.m.lin, andi, asharma,
	ravitillo, vweaver1

With branch stack sampling, it is possible to filter by priv levels.
In system-wide mode, that means it is possible to capture only user
level branches. The builtin SW LBR filter needs to disassemble code
based on LBR captured addresses. For that, it needs to know the task
the addresses are associated with. Because of context switches, the
content of the branch stack buffer may contain addresses from
different tasks.

We need a hook on context switch to either flush the branch stack
or save it. This patch adds a new hook in struct pmu which is called
during context switches. The hook is called only when necessary.
That is when a system-wide context has, at least, one event which
uses PERF_SAMPLE_BRANCH_STACK. The hook is never called for per-thread
context.

In this version, the Intel X86 code simply flushes (reset) the LBR
on context switches (fill with zeroes). Those zeroed branches are
then filtered out by the SW filter.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event.c       |   21 +++++---
 arch/x86/kernel/cpu/perf_event.h       |    1 +
 arch/x86/kernel/cpu/perf_event_intel.c |   13 +++++
 include/linux/perf_event.h             |    9 +++-
 kernel/events/core.c                   |   85 ++++++++++++++++++++++++++++++++
 5 files changed, 121 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 710ec93..01ce138 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1633,25 +1633,32 @@ static const struct attribute_group *x86_pmu_attr_groups[] = {
 	NULL,
 };
 
+static void x86_pmu_flush_branch_stack(void)
+{
+	if (x86_pmu.flush_branch_stack)
+		x86_pmu.flush_branch_stack();
+}
+
 static struct pmu pmu = {
-	.pmu_enable	= x86_pmu_enable,
-	.pmu_disable	= x86_pmu_disable,
+	.pmu_enable		= x86_pmu_enable,
+	.pmu_disable		= x86_pmu_disable,
 
 	.attr_groups	= x86_pmu_attr_groups,
 
 	.event_init	= x86_pmu_event_init,
 
-	.add		= x86_pmu_add,
-	.del		= x86_pmu_del,
-	.start		= x86_pmu_start,
-	.stop		= x86_pmu_stop,
-	.read		= x86_pmu_read,
+	.add			= x86_pmu_add,
+	.del			= x86_pmu_del,
+	.start			= x86_pmu_start,
+	.stop			= x86_pmu_stop,
+	.read			= x86_pmu_read,
 
 	.start_txn	= x86_pmu_start_txn,
 	.cancel_txn	= x86_pmu_cancel_txn,
 	.commit_txn	= x86_pmu_commit_txn,
 
 	.event_idx	= x86_pmu_event_idx,
+	.flush_branch_stack	= x86_pmu_flush_branch_stack,
 };
 
 void perf_update_user_clock(struct perf_event_mmap_page *userpg, u64 now)
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index d038cd1..fce4f5d 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -322,6 +322,7 @@ struct x86_pmu {
 	void		(*cpu_starting)(int cpu);
 	void		(*cpu_dying)(int cpu);
 	void		(*cpu_dead)(int cpu);
+	void		(*flush_branch_stack)(void);
 
 	/*
 	 * Intel Arch Perfmon v2+
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 7cc1e2d..6627089 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1539,6 +1539,18 @@ static void intel_pmu_cpu_dying(int cpu)
 	fini_debug_store_on_cpu(cpu);
 }
 
+static void intel_pmu_flush_branch_stack(void)
+{
+	/*
+	 * Intel LBR does not tag entries with the
+	 * PID of the current task, then we need to
+	 * flush it on ctxsw
+	 * For now, we simply reset it
+	 */
+	if (x86_pmu.lbr_nr)
+		intel_pmu_lbr_reset();
+}
+
 static __initconst const struct x86_pmu intel_pmu = {
 	.name			= "Intel",
 	.handle_irq		= intel_pmu_handle_irq,
@@ -1566,6 +1578,7 @@ static __initconst const struct x86_pmu intel_pmu = {
 	.cpu_starting		= intel_pmu_cpu_starting,
 	.cpu_dying		= intel_pmu_cpu_dying,
 	.guest_get_msrs		= intel_guest_get_msrs,
+	.flush_branch_stack	= intel_pmu_flush_branch_stack,
 };
 
 static __init void intel_clovertown_quirk(void)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 84bd6a6..5cda9b9 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -743,6 +743,11 @@ struct pmu {
 	 * if no implementation is provided it will default to: event->hw.idx + 1.
 	 */
 	int (*event_idx)		(struct perf_event *event); /*optional */
+
+	/*
+	 * flush branch stack on context-switches (needed in cpu-wide mode)
+	 */
+	void (*flush_branch_stack)	(void);
 };
 
 /**
@@ -973,7 +978,8 @@ struct perf_event_context {
 	u64				parent_gen;
 	u64				generation;
 	int				pin_count;
-	int				nr_cgroups; /* cgroup events present */
+	int				nr_cgroups;	 /* cgroup evts */
+	int				nr_branch_stack; /* branch_stack evt */
 	struct rcu_head			rcu_head;
 };
 
@@ -1038,6 +1044,7 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr,
 extern u64 perf_event_read_value(struct perf_event *event,
 				 u64 *enabled, u64 *running);
 
+
 struct perf_sample_data {
 	u64				type;
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 36d1a63..29018fe 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -130,6 +130,7 @@ enum event_type_t {
  */
 struct jump_label_key_deferred perf_sched_events __read_mostly;
 static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
+static DEFINE_PER_CPU(atomic_t, perf_branch_stack_events);
 
 static atomic_t nr_mmap_events __read_mostly;
 static atomic_t nr_comm_events __read_mostly;
@@ -881,6 +882,9 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx)
 	if (is_cgroup_event(event))
 		ctx->nr_cgroups++;
 
+	if (has_branch_stack(event))
+		ctx->nr_branch_stack++;
+
 	list_add_rcu(&event->event_entry, &ctx->event_list);
 	if (!ctx->nr_events)
 		perf_pmu_rotate_start(ctx->pmu);
@@ -1020,6 +1024,9 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
 			cpuctx->cgrp = NULL;
 	}
 
+	if (has_branch_stack(event))
+		ctx->nr_branch_stack--;
+
 	ctx->nr_events--;
 	if (event->attr.inherit_stat)
 		ctx->nr_stat--;
@@ -2195,6 +2202,66 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx,
 }
 
 /*
+ * When sampling the branck stack in system-wide, it may be necessary
+ * to flush the stack on context switch. This happens when the branch
+ * stack does not tag its entries with the pid of the current task.
+ * Otherwise it becomes impossible to associate a branch entry with a
+ * task. This ambiguity is more likely to appear when the branch stack
+ * supports priv level filtering and the user sets it to monitor only
+ * at the user level (which could be a useful measurement in system-wide
+ * mode). In that case, the risk is high of having a branch stack with
+ * branch from multiple tasks. Flushing may mean dropping the existing
+ * entries or stashing them somewhere in the PMU specific code layer.
+ *
+ * This function provides the context switch callback to the lower code
+ * layer. It is invoked ONLY when there is at least one system-wide context
+ * with at least one active event using taken branch sampling.
+ */
+static void perf_branch_stack_sched_in(struct task_struct *prev,
+				       struct task_struct *task)
+{
+	struct perf_cpu_context *cpuctx;
+	struct pmu *pmu;
+	unsigned long flags;
+
+	/* no need to flush branch stack if not changing task */
+	if (prev == task)
+		return;
+
+	local_irq_save(flags);
+
+	rcu_read_lock();
+
+	list_for_each_entry_rcu(pmu, &pmus, entry) {
+		cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
+
+		/*
+		 * check if the context has at least one
+		 * event using PERF_SAMPLE_BRANCH_STACK
+		 */
+		if (cpuctx->ctx.nr_branch_stack > 0
+		    && pmu->flush_branch_stack) {
+
+			pmu = cpuctx->ctx.pmu;
+
+			perf_ctx_lock(cpuctx, cpuctx->task_ctx);
+
+			perf_pmu_disable(pmu);
+
+			pmu->flush_branch_stack();
+
+			perf_pmu_enable(pmu);
+
+			perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+		}
+	}
+
+	rcu_read_unlock();
+
+	local_irq_restore(flags);
+}
+
+/*
  * Called from scheduler to add the events of the current task
  * with interrupts disabled.
  *
@@ -2225,6 +2292,10 @@ void __perf_event_task_sched_in(struct task_struct *prev,
 	 */
 	if (atomic_read(&__get_cpu_var(perf_cgroup_events)))
 		perf_cgroup_sched_in(prev, task);
+
+	/* check for system-wide branch_stack events */
+	if (atomic_read(&__get_cpu_var(perf_branch_stack_events)))
+		perf_branch_stack_sched_in(prev, task);
 }
 
 static u64 perf_calculate_period(struct perf_event *event, u64 nsec, u64 count)
@@ -2761,6 +2832,14 @@ static void free_event(struct perf_event *event)
 			atomic_dec(&per_cpu(perf_cgroup_events, event->cpu));
 			jump_label_dec_deferred(&perf_sched_events);
 		}
+
+		if (has_branch_stack(event)) {
+			jump_label_dec_deferred(&perf_sched_events);
+			/* is system-wide event */
+			if (!(event->attach_state & PERF_ATTACH_TASK))
+				atomic_dec(&per_cpu(perf_branch_stack_events,
+						    event->cpu));
+		}
 	}
 
 	if (event->rb) {
@@ -5880,6 +5959,12 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 				return ERR_PTR(err);
 			}
 		}
+		if (has_branch_stack(event)) {
+			jump_label_inc(&perf_sched_events.key);
+			if (!(event->attach_state & PERF_ATTACH_TASK))
+				atomic_inc(&per_cpu(perf_branch_stack_events,
+						    event->cpu));
+		}
 	}
 
 	return event;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 11/13] perf: add code to support PERF_SAMPLE_BRANCH_STACK (v3)
  2012-01-09 16:49 [PATCH 00/13] perf_events: add support for sampling taken branches (v3) Stephane Eranian
                   ` (9 preceding siblings ...)
  2012-01-09 16:49 ` [PATCH 10/13] perf_events: add hook to flush branch_stack on context switch (v3) Stephane Eranian
@ 2012-01-09 16:49 ` Stephane Eranian
  2012-01-10  1:25   ` Arun Sharma
  2012-01-09 16:49 ` [PATCH 12/13] perf: add support for sampling taken branch to perf record (v3) Stephane Eranian
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 36+ messages in thread
From: Stephane Eranian @ 2012-01-09 16:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, robert.richter, ming.m.lin, andi, asharma,
	ravitillo, vweaver1

From: Roberto Agostino Vitillo <ravitillo@lbl.gov>

This patch adds:
- ability to parse samples with PERF_SAMPLE_BRANCH_STACK
- sort on branches
- build histograms on branches

Signed-off-by: Roberto Agostino Vitillo <ravitillo@lbl.gov>
Signed-off-by: Stephane Eranian <eranian@google.com>
---
 tools/perf/perf.h          |   17 ++
 tools/perf/util/annotate.c |    2 +-
 tools/perf/util/event.h    |    1 +
 tools/perf/util/evsel.c    |   10 ++
 tools/perf/util/hist.c     |   93 +++++++++---
 tools/perf/util/hist.h     |    7 +
 tools/perf/util/session.c  |   72 +++++++++
 tools/perf/util/session.h  |    4 +
 tools/perf/util/sort.c     |  361 +++++++++++++++++++++++++++++++++-----------
 tools/perf/util/sort.h     |    5 +
 tools/perf/util/symbol.h   |   13 ++
 11 files changed, 474 insertions(+), 111 deletions(-)

diff --git a/tools/perf/perf.h b/tools/perf/perf.h
index 64f8bee..513617c 100644
--- a/tools/perf/perf.h
+++ b/tools/perf/perf.h
@@ -180,6 +180,23 @@ struct ip_callchain {
 	u64 ips[0];
 };
 
+struct branch_flags {
+	u64 mispred:1;
+	u64 predicted:1;
+	u64 reserved:62;
+};
+
+struct branch_entry {
+	u64				from;
+	u64				to;
+	struct branch_flags flags;
+};
+
+struct branch_stack {
+	u64				nr;
+	struct branch_entry	entries[0];
+};
+
 extern bool perf_host, perf_guest;
 extern const char perf_version_string[];
 
diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c
index 011ed26..8248d80 100644
--- a/tools/perf/util/annotate.c
+++ b/tools/perf/util/annotate.c
@@ -64,7 +64,7 @@ int symbol__inc_addr_samples(struct symbol *sym, struct map *map,
 
 	pr_debug3("%s: addr=%#" PRIx64 "\n", __func__, map->unmap_ip(map, addr));
 
-	if (addr >= sym->end)
+	if (addr >= sym->end || addr < sym->start)
 		return 0;
 
 	offset = addr - sym->start;
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index cbdeaad..1b19728 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -81,6 +81,7 @@ struct perf_sample {
 	u32 raw_size;
 	void *raw_data;
 	struct ip_callchain *callchain;
+	struct branch_stack *branch_stack;
 };
 
 #define BUILD_ID_SIZE 20
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 667f3b7..472fc8c 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -575,6 +575,16 @@ int perf_event__parse_sample(const union perf_event *event, u64 type,
 		data->raw_data = (void *) pdata;
 	}
 
+	if (type & PERF_SAMPLE_BRANCH_STACK) {
+		u64 sz;
+
+		data->branch_stack = (struct branch_stack *)array;
+		array++; /* nr */
+
+		sz = data->branch_stack->nr * sizeof(struct branch_entry);
+		sz /= sizeof(uint64_t);
+		array += sz;
+	}
 	return 0;
 }
 
diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index 6f505d1..66f9936 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -54,9 +54,11 @@ static void hists__calc_col_len(struct hists *hists, struct hist_entry *h)
 {
 	u16 len;
 
-	if (h->ms.sym)
-		hists__new_col_len(hists, HISTC_SYMBOL, h->ms.sym->namelen);
-	else {
+	if (h->ms.sym) {
+		int n = (int)h->ms.sym->namelen + 4;
+		int symlen = max(n, BITS_PER_LONG / 4 + 6);
+		hists__new_col_len(hists, HISTC_SYMBOL, symlen);
+	} else {
 		const unsigned int unresolved_col_width = BITS_PER_LONG / 4;
 
 		if (hists__col_len(hists, HISTC_DSO) < unresolved_col_width &&
@@ -195,26 +197,14 @@ static u8 symbol__parent_filter(const struct symbol *parent)
 	return 0;
 }
 
-struct hist_entry *__hists__add_entry(struct hists *hists,
+static struct hist_entry *add_hist_entry(struct hists *hists,
+				      struct hist_entry *entry,
 				      struct addr_location *al,
-				      struct symbol *sym_parent, u64 period)
+				      u64 period)
 {
 	struct rb_node **p;
 	struct rb_node *parent = NULL;
 	struct hist_entry *he;
-	struct hist_entry entry = {
-		.thread	= al->thread,
-		.ms = {
-			.map	= al->map,
-			.sym	= al->sym,
-		},
-		.cpu	= al->cpu,
-		.ip	= al->addr,
-		.level	= al->level,
-		.period	= period,
-		.parent = sym_parent,
-		.filtered = symbol__parent_filter(sym_parent),
-	};
 	int cmp;
 
 	pthread_mutex_lock(&hists->lock);
@@ -225,7 +215,7 @@ struct hist_entry *__hists__add_entry(struct hists *hists,
 		parent = *p;
 		he = rb_entry(parent, struct hist_entry, rb_node_in);
 
-		cmp = hist_entry__cmp(&entry, he);
+		cmp = hist_entry__cmp(entry, he);
 
 		if (!cmp) {
 			he->period += period;
@@ -239,7 +229,7 @@ struct hist_entry *__hists__add_entry(struct hists *hists,
 			p = &(*p)->rb_right;
 	}
 
-	he = hist_entry__new(&entry);
+	he = hist_entry__new(entry);
 	if (!he)
 		goto out_unlock;
 
@@ -252,6 +242,69 @@ struct hist_entry *__hists__add_entry(struct hists *hists,
 	return he;
 }
 
+struct hist_entry *__hists__add_branch_entry(struct hists *self,
+					     struct addr_location *al,
+					     struct symbol *sym_parent,
+					     struct branch_info *bi,
+					     u64 period){
+	struct hist_entry entry = {
+		.thread	= al->thread,
+		.ms = {
+			.map	= bi->to.map,
+			.sym	= bi->to.sym,
+		},
+		.cpu	= al->cpu,
+		.ip	= bi->to.addr,
+		.level	= al->level,
+		.period	= period,
+		.parent = sym_parent,
+		.filtered = symbol__parent_filter(sym_parent),
+		.branch_info = bi,
+	};
+	struct hist_entry *he;
+
+	he = add_hist_entry(self, &entry, al, period);
+	if (!he)
+		return NULL;
+
+	/*
+	 * in branch mode, we do not display al->sym, al->addr
+	 * but instead what is in branch_info. The addresses and
+	 * symbols there may need wider columns, so make sure they
+	 * are taken into account.
+	 *
+	 * hists__calc_col_len() tracks the max column width, so
+	 * we need to call it for both the from and to addresses
+	 */
+	entry.ip     = bi->from.addr;
+	entry.ms.map = bi->from.map;
+	entry.ms.sym = bi->from.sym;
+	hists__calc_col_len(self, &entry);
+
+	return he;
+}
+
+struct hist_entry *__hists__add_entry(struct hists *self,
+				      struct addr_location *al,
+				      struct symbol *sym_parent, u64 period)
+{
+	struct hist_entry entry = {
+		.thread	= al->thread,
+		.ms = {
+			.map	= al->map,
+			.sym	= al->sym,
+		},
+		.cpu	= al->cpu,
+		.ip	= al->addr,
+		.level	= al->level,
+		.period	= period,
+		.parent = sym_parent,
+		.filtered = symbol__parent_filter(sym_parent),
+	};
+
+	return add_hist_entry(self, &entry, al, period);
+}
+
 int64_t
 hist_entry__cmp(struct hist_entry *left, struct hist_entry *right)
 {
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index f55f0a8..f277e7b 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -41,6 +41,7 @@ enum hist_column {
 	HISTC_COMM,
 	HISTC_PARENT,
 	HISTC_CPU,
+	HISTC_MISPREDICT,
 	HISTC_NR_COLS, /* Last entry */
 };
 
@@ -72,6 +73,12 @@ int hist_entry__snprintf(struct hist_entry *self, char *bf, size_t size,
 			 struct hists *hists);
 void hist_entry__free(struct hist_entry *);
 
+struct hist_entry *__hists__add_branch_entry(struct hists *self,
+					     struct addr_location *al,
+					     struct symbol *sym_parent,
+					     struct branch_info* bi,
+					     u64 period);
+
 void hists__output_resort(struct hists *self);
 void hists__output_resort_threaded(struct hists *hists);
 void hists__collapse_resort(struct hists *self);
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index b5ca255..6643224 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -229,6 +229,63 @@ static bool symbol__match_parent_regex(struct symbol *sym)
 	return 0;
 }
 
+static const u8 cpumodes[] = {
+	PERF_RECORD_MISC_USER,
+	PERF_RECORD_MISC_KERNEL,
+	PERF_RECORD_MISC_GUEST_USER,
+	PERF_RECORD_MISC_GUEST_KERNEL
+};
+#define NCPUMODES (sizeof(cpumodes)/sizeof(u8))
+
+static void ip__resolve_ams(struct machine *self, struct thread *thread,
+			    struct addr_map_symbol *ams,
+			    u64 ip)
+{
+	struct addr_location al;
+	size_t i;
+	u8 m;
+
+	memset(&al, 0, sizeof(al));
+
+	for (i = 0; i < NCPUMODES; i++) {
+		m = cpumodes[i];
+		/*
+		 * we cannot use the header.misc hint to determine whether a
+		 * branch stack address is user, kernel, guest, hypervisor.
+		 * Branches may straddle the kernel/user/hypervisor boundaries.
+		 * Thus, we have to try * consecutively until we find a match
+		 * or else, the symbol is unknown
+		 */
+		thread__find_addr_location(thread, self, m, MAP__FUNCTION,
+				ip, &al, NULL);
+		if (al.sym)
+			goto found;
+	}
+found:
+	ams->addr = ip;
+	ams->sym = al.sym;
+	ams->map = al.map;
+}
+
+struct branch_info *perf_session__resolve_bstack(struct machine *self,
+						 struct thread *thr,
+						 struct branch_stack *bs)
+{
+	struct branch_info *bi;
+	unsigned int i;
+
+	bi = calloc(bs->nr, sizeof(struct branch_info));
+	if (!bi)
+		return NULL;
+
+	for (i = 0; i < bs->nr; i++) {
+		ip__resolve_ams(self, thr, &bi[i].to, bs->entries[i].to);
+		ip__resolve_ams(self, thr, &bi[i].from, bs->entries[i].from);
+		bi[i].flags = bs->entries[i].flags;
+	}
+	return bi;
+}
+
 int machine__resolve_callchain(struct machine *self, struct perf_evsel *evsel,
 			       struct thread *thread,
 			       struct ip_callchain *chain,
@@ -697,6 +754,18 @@ static void callchain__printf(struct perf_sample *sample)
 		       i, sample->callchain->ips[i]);
 }
 
+static void branch_stack__printf(struct perf_sample *sample)
+{
+	uint64_t i;
+
+	printf("... branch stack: nr:%" PRIu64 "\n", sample->branch_stack->nr);
+
+	for (i = 0; i < sample->branch_stack->nr; i++)
+		printf("..... %2"PRIu64": %016" PRIx64 " -> %016" PRIx64 "\n",
+			i, sample->branch_stack->entries[i].from,
+			sample->branch_stack->entries[i].to);
+}
+
 static void perf_session__print_tstamp(struct perf_session *session,
 				       union perf_event *event,
 				       struct perf_sample *sample)
@@ -744,6 +813,9 @@ static void dump_sample(struct perf_session *session, union perf_event *event,
 
 	if (session->sample_type & PERF_SAMPLE_CALLCHAIN)
 		callchain__printf(sample);
+
+	if (session->sample_type & PERF_SAMPLE_BRANCH_STACK)
+		branch_stack__printf(sample);
 }
 
 static struct machine *
diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index 37bc383..f407338 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -73,6 +73,10 @@ int perf_session__resolve_callchain(struct perf_session *self, struct perf_evsel
 				    struct ip_callchain *chain,
 				    struct symbol **parent);
 
+struct branch_info *perf_session__resolve_bstack(struct machine *self,
+						 struct thread *thread,
+						 struct branch_stack *bs);
+
 bool perf_session__has_traces(struct perf_session *self, const char *msg);
 
 void mem_bswap_64(void *src, int byte_size);
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index 16da30d..4c7fe4e 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -8,6 +8,7 @@ const char	default_sort_order[] = "comm,dso,symbol";
 const char	*sort_order = default_sort_order;
 int		sort__need_collapse = 0;
 int		sort__has_parent = 0;
+bool		sort__branch_mode;
 
 enum sort_type	sort__first_dimension;
 
@@ -94,6 +95,26 @@ static int hist_entry__comm_snprintf(struct hist_entry *self, char *bf,
 	return repsep_snprintf(bf, size, "%*s", width, self->thread->comm);
 }
 
+static int64_t _sort__dso_cmp(struct map *map_l, struct map *map_r)
+{
+	struct dso *dso_l = map_l ? map_l->dso : NULL;
+	struct dso *dso_r = map_r ? map_r->dso : NULL;
+	const char *dso_name_l, *dso_name_r;
+
+	if (!dso_l || !dso_r)
+		return cmp_null(dso_l, dso_r);
+
+	if (verbose) {
+		dso_name_l = dso_l->long_name;
+		dso_name_r = dso_r->long_name;
+	} else {
+		dso_name_l = dso_l->short_name;
+		dso_name_r = dso_r->short_name;
+	}
+
+	return strcmp(dso_name_l, dso_name_r);
+}
+
 struct sort_entry sort_comm = {
 	.se_header	= "Command",
 	.se_cmp		= sort__comm_cmp,
@@ -107,36 +128,74 @@ struct sort_entry sort_comm = {
 static int64_t
 sort__dso_cmp(struct hist_entry *left, struct hist_entry *right)
 {
-	struct dso *dso_l = left->ms.map ? left->ms.map->dso : NULL;
-	struct dso *dso_r = right->ms.map ? right->ms.map->dso : NULL;
-	const char *dso_name_l, *dso_name_r;
+	return _sort__dso_cmp(left->ms.map, right->ms.map);
+}
 
-	if (!dso_l || !dso_r)
-		return cmp_null(dso_l, dso_r);
 
-	if (verbose) {
-		dso_name_l = dso_l->long_name;
-		dso_name_r = dso_r->long_name;
-	} else {
-		dso_name_l = dso_l->short_name;
-		dso_name_r = dso_r->short_name;
+static int64_t _sort__sym_cmp(struct symbol *sym_l, struct symbol *sym_r,
+			      u64 ip_l, u64 ip_r)
+{
+	if (!sym_l || !sym_r)
+		return cmp_null(sym_l, sym_r);
+
+	if (sym_l == sym_r)
+		return 0;
+
+	if (sym_l)
+		ip_l = sym_l->start;
+	if (sym_r)
+		ip_r = sym_r->start;
+
+	return (int64_t)(ip_r - ip_l);
+}
+
+static int _hist_entry__dso_snprintf(struct map *map, char *bf,
+				     size_t size, unsigned int width)
+{
+	if (map && map->dso) {
+		const char *dso_name = !verbose ? map->dso->short_name :
+			map->dso->long_name;
+		return repsep_snprintf(bf, size, "%-*s", width, dso_name);
 	}
 
-	return strcmp(dso_name_l, dso_name_r);
+	return repsep_snprintf(bf, size, "%-*s", width, "[unknown]");
 }
 
 static int hist_entry__dso_snprintf(struct hist_entry *self, char *bf,
 				    size_t size, unsigned int width)
 {
-	if (self->ms.map && self->ms.map->dso) {
-		const char *dso_name = !verbose ? self->ms.map->dso->short_name :
-						  self->ms.map->dso->long_name;
-		return repsep_snprintf(bf, size, "%-*s", width, dso_name);
+	return _hist_entry__dso_snprintf(self->ms.map, bf, size, width);
+}
+
+static int _hist_entry__sym_snprintf(struct map *map, struct symbol *sym,
+				     u64 ip, char level, char *bf, size_t size,
+				     unsigned int width __used)
+{
+	size_t ret = 0;
+
+	if (verbose) {
+		char o = map ? dso__symtab_origin(map->dso) : '!';
+		ret += repsep_snprintf(bf, size, "%-#*llx %c ",
+				       BITS_PER_LONG / 4, ip, o);
 	}
 
-	return repsep_snprintf(bf, size, "%-*s", width, "[unknown]");
+	ret += repsep_snprintf(bf + ret, size - ret, "[%c] ", level);
+	if (sym)
+		ret += repsep_snprintf(bf + ret, size - ret, "%-*s",
+				       width - ret,
+				       sym->name);
+	else {
+		size_t len = BITS_PER_LONG / 4;
+		ret += repsep_snprintf(bf + ret, size - ret, "%-#.*llx",
+				       len, ip);
+		ret += repsep_snprintf(bf + ret, size - ret, "%-*s",
+				       width - ret, "");
+	}
+
+	return ret;
 }
 
+
 struct sort_entry sort_dso = {
 	.se_header	= "Shared Object",
 	.se_cmp		= sort__dso_cmp,
@@ -144,8 +203,14 @@ struct sort_entry sort_dso = {
 	.se_width_idx	= HISTC_DSO,
 };
 
-/* --sort symbol */
+static int hist_entry__sym_snprintf(struct hist_entry *self, char *bf,
+				    size_t size, unsigned int width __used)
+{
+	return _hist_entry__sym_snprintf(self->ms.map, self->ms.sym, self->ip,
+					 self->level, bf, size, width);
+}
 
+/* --sort symbol */
 static int64_t
 sort__sym_cmp(struct hist_entry *left, struct hist_entry *right)
 {
@@ -154,40 +219,10 @@ sort__sym_cmp(struct hist_entry *left, struct hist_entry *right)
 	if (!left->ms.sym && !right->ms.sym)
 		return right->level - left->level;
 
-	if (!left->ms.sym || !right->ms.sym)
-		return cmp_null(left->ms.sym, right->ms.sym);
-
-	if (left->ms.sym == right->ms.sym)
-		return 0;
-
 	ip_l = left->ms.sym->start;
 	ip_r = right->ms.sym->start;
 
-	return (int64_t)(ip_r - ip_l);
-}
-
-static int hist_entry__sym_snprintf(struct hist_entry *self, char *bf,
-				    size_t size, unsigned int width __used)
-{
-	size_t ret = 0;
-
-	if (verbose) {
-		char o = self->ms.map ? dso__symtab_origin(self->ms.map->dso) : '!';
-		ret += repsep_snprintf(bf, size, "%-#*llx %c ",
-				       BITS_PER_LONG / 4, self->ip, o);
-	}
-
-	if (!sort_dso.elide)
-		ret += repsep_snprintf(bf + ret, size - ret, "[%c] ", self->level);
-
-	if (self->ms.sym)
-		ret += repsep_snprintf(bf + ret, size - ret, "%s",
-				       self->ms.sym->name);
-	else
-		ret += repsep_snprintf(bf + ret, size - ret, "%-#*llx",
-				       BITS_PER_LONG / 4, self->ip);
-
-	return ret;
+	return _sort__sym_cmp(left->ms.sym, right->ms.sym, ip_l, ip_r);
 }
 
 struct sort_entry sort_sym = {
@@ -246,6 +281,134 @@ struct sort_entry sort_cpu = {
 	.se_width_idx	= HISTC_CPU,
 };
 
+static int64_t
+sort__dso_from_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+	return _sort__dso_cmp(left->branch_info->from.map,
+			      right->branch_info->from.map);
+}
+
+static int hist_entry__dso_from_snprintf(struct hist_entry *self, char *bf,
+				    size_t size, unsigned int width)
+{
+	return _hist_entry__dso_snprintf(self->branch_info->from.map,
+					 bf, size, width);
+}
+
+struct sort_entry sort_dso_from = {
+	.se_header	= "Source Shared Object",
+	.se_cmp		= sort__dso_from_cmp,
+	.se_snprintf	= hist_entry__dso_from_snprintf,
+	.se_width_idx	= HISTC_DSO,
+};
+
+static int64_t
+sort__dso_to_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+	return _sort__dso_cmp(left->branch_info->to.map,
+		              right->branch_info->to.map);
+}
+
+static int hist_entry__dso_to_snprintf(struct hist_entry *self, char *bf,
+				       size_t size, unsigned int width)
+{
+	return _hist_entry__dso_snprintf(self->branch_info->to.map,
+					 bf, size, width);
+}
+
+static int64_t
+sort__sym_from_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+	struct addr_map_symbol *from_l = &left->branch_info->from;
+	struct addr_map_symbol *from_r = &right->branch_info->from;
+
+	if (!from_l->sym && !from_r->sym)
+		return right->level - left->level;
+
+	return _sort__sym_cmp(from_l->sym, from_r->sym, from_l->addr,
+			     from_r->addr);
+}
+
+static int64_t
+sort__sym_to_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+	struct addr_map_symbol *to_l = &left->branch_info->to;
+	struct addr_map_symbol *to_r = &right->branch_info->to;
+
+	if (!to_l->sym && !to_r->sym)
+		return right->level - left->level;
+
+	return _sort__sym_cmp(to_l->sym, to_r->sym, to_l->addr, to_r->addr);
+}
+
+static int hist_entry__sym_from_snprintf(struct hist_entry *self, char *bf,
+				    size_t size, unsigned int width __used)
+{
+	struct addr_map_symbol *from = &self->branch_info->from;
+	return _hist_entry__sym_snprintf(from->map, from->sym, from->addr,
+					 self->level, bf, size, width);
+
+}
+
+static int hist_entry__sym_to_snprintf(struct hist_entry *self, char *bf,
+				    size_t size, unsigned int width __used)
+{
+	struct addr_map_symbol *to = &self->branch_info->to;
+	return _hist_entry__sym_snprintf(to->map, to->sym, to->addr,
+					 self->level, bf, size, width);
+
+}
+
+struct sort_entry sort_dso_to = {
+	.se_header	= "Target Shared Object",
+	.se_cmp		= sort__dso_to_cmp,
+	.se_snprintf	= hist_entry__dso_to_snprintf,
+	.se_width_idx	= HISTC_DSO,
+};
+
+struct sort_entry sort_sym_from = {
+	.se_header	= "Source Symbol",
+	.se_cmp		= sort__sym_from_cmp,
+	.se_snprintf	= hist_entry__sym_from_snprintf,
+	.se_width_idx	= HISTC_SYMBOL,
+};
+
+struct sort_entry sort_sym_to = {
+	.se_header	= "Target Symbol",
+	.se_cmp		= sort__sym_to_cmp,
+	.se_snprintf	= hist_entry__sym_to_snprintf,
+	.se_width_idx	= HISTC_SYMBOL,
+};
+
+static int64_t
+sort__mispredict_cmp(struct hist_entry *left, struct hist_entry *right){
+	const unsigned char mp = left->branch_info->flags.mispred !=
+					right->branch_info->flags.mispred;
+	const unsigned char p = left->branch_info->flags.predicted !=
+					right->branch_info->flags.predicted;
+
+	return mp || p;
+}
+
+static int hist_entry__mispredict_snprintf(struct hist_entry *self, char *bf,
+				    size_t size, unsigned int width){
+	static const char *out = "N/A";
+
+	if (self->branch_info->flags.predicted)
+		out = "N";
+	else if (self->branch_info->flags.mispred)
+		out = "Y";
+
+	return repsep_snprintf(bf, size, "%-*s", width, out);
+}
+
+struct sort_entry sort_mispredict = {
+	.se_header	= "Branch Mispredicted",
+	.se_cmp		= sort__mispredict_cmp,
+	.se_snprintf	= hist_entry__mispredict_snprintf,
+	.se_width_idx	= HISTC_MISPREDICT,
+};
+
 struct sort_dimension {
 	const char		*name;
 	struct sort_entry	*entry;
@@ -253,14 +416,59 @@ struct sort_dimension {
 };
 
 static struct sort_dimension sort_dimensions[] = {
-	{ .name = "pid",	.entry = &sort_thread,	},
-	{ .name = "comm",	.entry = &sort_comm,	},
-	{ .name = "dso",	.entry = &sort_dso,	},
-	{ .name = "symbol",	.entry = &sort_sym,	},
-	{ .name = "parent",	.entry = &sort_parent,	},
-	{ .name = "cpu",	.entry = &sort_cpu,	},
+	{ .name = "pid",	.entry = &sort_thread,			},
+	{ .name = "comm",	.entry = &sort_comm,			},
+	{ .name = "dso",	.entry = &sort_dso,			},
+	{ .name = "dso_from",	.entry = &sort_dso_from,.taken = true	},
+	{ .name = "dso_to",	.entry = &sort_dso_to,	.taken = true	},
+	{ .name = "symbol",	.entry = &sort_sym,			},
+	{ .name = "symbol_from",.entry = &sort_sym_from,.taken = true	},
+	{ .name = "symbol_to",	.entry = &sort_sym_to,	.taken = true	},
+	{ .name = "parent",	.entry = &sort_parent,			},
+	{ .name = "cpu",	.entry = &sort_cpu,			},
+	{ .name = "mispredict", .entry = &sort_mispredict, },
 };
 
+static int _sort_dimension__add(struct sort_dimension *sd)
+{
+	if (sd->entry->se_collapse)
+		sort__need_collapse = 1;
+
+	if (sd->entry == &sort_parent) {
+		int ret = regcomp(&parent_regex, parent_pattern, REG_EXTENDED);
+		if (ret) {
+			char err[BUFSIZ];
+
+			regerror(ret, &parent_regex, err, sizeof(err));
+			pr_err("Invalid regex: %s\n%s", parent_pattern, err);
+			return -EINVAL;
+		}
+		sort__has_parent = 1;
+	}
+
+	if (list_empty(&hist_entry__sort_list)) {
+		if (!strcmp(sd->name, "pid"))
+			sort__first_dimension = SORT_PID;
+		else if (!strcmp(sd->name, "comm"))
+			sort__first_dimension = SORT_COMM;
+		else if (!strcmp(sd->name, "dso"))
+			sort__first_dimension = SORT_DSO;
+		else if (!strcmp(sd->name, "symbol"))
+			sort__first_dimension = SORT_SYM;
+		else if (!strcmp(sd->name, "parent"))
+			sort__first_dimension = SORT_PARENT;
+		else if (!strcmp(sd->name, "cpu"))
+			sort__first_dimension = SORT_CPU;
+		else if (!strcmp(sd->name, "mispredict"))
+			sort__first_dimension = SORT_MISPREDICTED;
+	}
+
+	list_add_tail(&sd->entry->list, &hist_entry__sort_list);
+	sd->taken = 1;
+
+	return 0;
+}
+
 int sort_dimension__add(const char *tok)
 {
 	unsigned int i;
@@ -271,48 +479,21 @@ int sort_dimension__add(const char *tok)
 		if (strncasecmp(tok, sd->name, strlen(tok)))
 			continue;
 
-		if (sd->entry == &sort_parent) {
-			int ret = regcomp(&parent_regex, parent_pattern, REG_EXTENDED);
-			if (ret) {
-				char err[BUFSIZ];
-
-				regerror(ret, &parent_regex, err, sizeof(err));
-				pr_err("Invalid regex: %s\n%s", parent_pattern, err);
-				return -EINVAL;
-			}
-			sort__has_parent = 1;
-		}
-
 		if (sd->taken)
 			return 0;
 
-		if (sd->entry->se_collapse)
-			sort__need_collapse = 1;
-
-		if (list_empty(&hist_entry__sort_list)) {
-			if (!strcmp(sd->name, "pid"))
-				sort__first_dimension = SORT_PID;
-			else if (!strcmp(sd->name, "comm"))
-				sort__first_dimension = SORT_COMM;
-			else if (!strcmp(sd->name, "dso"))
-				sort__first_dimension = SORT_DSO;
-			else if (!strcmp(sd->name, "symbol"))
-				sort__first_dimension = SORT_SYM;
-			else if (!strcmp(sd->name, "parent"))
-				sort__first_dimension = SORT_PARENT;
-			else if (!strcmp(sd->name, "cpu"))
-				sort__first_dimension = SORT_CPU;
-		}
-
-		list_add_tail(&sd->entry->list, &hist_entry__sort_list);
-		sd->taken = 1;
 
-		return 0;
+		if (sort__branch_mode && (sd->entry == &sort_dso ||
+					sd->entry == &sort_sym)){
+			int err = _sort_dimension__add(sd + 1);
+			return err ?: _sort_dimension__add(sd + 2);
+		} else if (sd->entry == &sort_mispredict && !sort__branch_mode)
+			break;
+		else
+			return _sort_dimension__add(sd);
 	}
-
 	return -ESRCH;
 }
-
 void setup_sorting(const char * const usagestr[], const struct option *opts)
 {
 	char *tmp, *tok, *str = strdup(sort_order);
diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
index 3f67ae3..effcae1 100644
--- a/tools/perf/util/sort.h
+++ b/tools/perf/util/sort.h
@@ -31,11 +31,14 @@ extern const char *parent_pattern;
 extern const char default_sort_order[];
 extern int sort__need_collapse;
 extern int sort__has_parent;
+extern bool sort__branch_mode;
 extern char *field_sep;
 extern struct sort_entry sort_comm;
 extern struct sort_entry sort_dso;
 extern struct sort_entry sort_sym;
 extern struct sort_entry sort_parent;
+extern struct sort_entry sort_lbr_dso;
+extern struct sort_entry sort_lbr_sym;
 extern enum sort_type sort__first_dimension;
 
 /**
@@ -72,6 +75,7 @@ struct hist_entry {
 		struct hist_entry *pair;
 		struct rb_root	  sorted_chain;
 	};
+	struct branch_info	*branch_info;
 	struct callchain_root	callchain[0];
 };
 
@@ -82,6 +86,7 @@ enum sort_type {
 	SORT_SYM,
 	SORT_PARENT,
 	SORT_CPU,
+	SORT_MISPREDICTED,
 };
 
 /*
diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h
index 123c2e1..6297e88 100644
--- a/tools/perf/util/symbol.h
+++ b/tools/perf/util/symbol.h
@@ -5,6 +5,7 @@
 #include <stdbool.h>
 #include <stdint.h>
 #include "map.h"
+#include "../perf.h"
 #include <linux/list.h>
 #include <linux/rbtree.h>
 #include <stdio.h>
@@ -119,6 +120,18 @@ struct map_symbol {
 	bool	      has_children;
 };
 
+struct addr_map_symbol {
+	struct map    *map;
+	struct symbol *sym;
+	u64	      addr;
+};
+
+struct branch_info {
+	struct addr_map_symbol from;
+	struct addr_map_symbol to;
+	struct branch_flags flags;
+};
+
 struct addr_location {
 	struct thread *thread;
 	struct map    *map;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 12/13] perf: add support for sampling taken branch to perf record (v3)
  2012-01-09 16:49 [PATCH 00/13] perf_events: add support for sampling taken branches (v3) Stephane Eranian
                   ` (10 preceding siblings ...)
  2012-01-09 16:49 ` [PATCH 11/13] perf: add code to support PERF_SAMPLE_BRANCH_STACK (v3) Stephane Eranian
@ 2012-01-09 16:49 ` Stephane Eranian
  2012-01-09 16:49 ` [PATCH 13/13] perf: add support for taken branch sampling to perf report (v3) Stephane Eranian
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 36+ messages in thread
From: Stephane Eranian @ 2012-01-09 16:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, robert.richter, ming.m.lin, andi, asharma,
	ravitillo, vweaver1

From: Roberto Agostino Vitillo <ravitillo@lbl.gov>

This patch adds a new option to enable taken branch stack
sampling, i.e., leverage the PERF_SAMPLE_BRANCH_STACK feature
of perf_events.

There is a new option to active this mode: -b.
It is possible to pass a set of filters to select the type of
branches to sample.

The following filters are available:
- any : any type of branches
- any_call : any function call or system call
- any_ret : any function return or system call return
- any_ind : any indirect branch
- u:  only when the branch target is at the user level
- k: only when the branch target is in the kernel

Filters can be combined by passing a comma separated list
to the option:

$ perf record -b any_call,u -e cycles:u branchy

Signed-off-by: Roberto Agostino Vitillo <ravitillo@lbl.gov>
Signed-off-by: Stephane Eranian <eranian@google.com>
---
 tools/perf/Documentation/perf-record.txt |   18 ++++++++
 tools/perf/builtin-record.c              |   69 ++++++++++++++++++++++++++++++
 tools/perf/perf.h                        |    1 +
 tools/perf/util/evsel.c                  |    4 ++
 4 files changed, 92 insertions(+), 0 deletions(-)

diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index 2937f7e..69068d0 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -148,6 +148,24 @@ an empty cgroup (monitor all the time) using, e.g., -G foo,,bar. Cgroups must ha
 corresponding events, i.e., they always refer to events defined earlier on the command
 line.
 
+-b::
+--branch-stack::
+Enable taken branch stack sampling. Each sample captures a series of consecutive
+taken branches. The number of branches captured with each sample depends on the
+underlying hardware, the type of branches of interested and the executed code.
+It is possible to filter the types of branches by enabling filters. The
+following filters are defined: any (any type of branches), any_call (any function
+call or system call), any_ret (any function return or system call return), any_ind
+(any indirect branch), u (only when the branch target is at the user level), k (only when
+the branch target is in the kernel). At least one of any, any_call, any_ret, any_ind
+must be provided. The privilege levels may be ommitted, in which case, the privilege
+levels of the associated event is applied to the branch filter. When sampling on multiple
+events, branch stack sampling is enabled for all the sampling events. The sampled branch
+type is the same for all events. The privilege levels are adjusted based on those of
+the associated event unless specified explicitly with this option. Note that taken
+branch sampling may not be available on all processors. The various filters must
+be specified as a comma separated list: -b any_ret,u,k
+
 SEE ALSO
 --------
 linkperf:perf-stat[1], linkperf:perf-list[1]
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 0abfb18..df79d23 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -636,6 +636,72 @@ static int __cmd_record(struct perf_record *rec, int argc, const char **argv)
 	return err;
 }
 
+#define BRANCH_OPT(n, m) \
+	{ .name = n, .mode = (m) }
+
+#define BRANCH_END { .name = NULL }
+
+struct branch_mode {
+	const char *name;
+	int mode;
+};
+
+static const struct branch_mode branch_modes[]={
+	BRANCH_OPT("u", PERF_SAMPLE_BRANCH_USER),
+	BRANCH_OPT("k", PERF_SAMPLE_BRANCH_KERNEL),
+	BRANCH_OPT("any", PERF_SAMPLE_BRANCH_ANY),
+	BRANCH_OPT("any_call", PERF_SAMPLE_BRANCH_ANY_CALL),
+	BRANCH_OPT("any_ret", PERF_SAMPLE_BRANCH_ANY_RETURN),
+	BRANCH_OPT("ind_call", PERF_SAMPLE_BRANCH_IND_CALL),
+	BRANCH_END
+};
+
+static int
+parse_branch_stack(const struct option *opt, const char *str, int unset __used)
+{
+#define ONLY_PLM (PERF_SAMPLE_BRANCH_USER|PERF_SAMPLE_BRANCH_KERNEL)
+	uint64_t *mode = (uint64_t *)opt->value;
+	const struct branch_mode *br;
+	char *s, *os, *p;
+	int ret = -1;
+
+	*mode = 0;
+
+	/* because str is read-only */
+	s = os = strdup(str);
+	if (!s)
+		return -1;
+
+	for (;;) {
+		p = strchr(s, ',');
+		if (p)
+			*p = '\0';
+
+		for (br = branch_modes; br->name; br++) {
+			if (!strcasecmp(s, br->name))
+				break;
+		}
+		if (!br->name)
+			goto error;
+
+		*mode |= br->mode;
+
+		if (!p)
+			break;
+
+		s = p + 1;
+	}
+	ret = 0;
+
+	if ((*mode & ~ONLY_PLM) == 0) {
+		error("need at least one branch type with -b\n");
+		ret = -1;
+	}
+error:
+	free(os);
+	return ret;
+}
+
 static const char * const record_usage[] = {
 	"perf record [<options>] [<command>]",
 	"perf record [<options>] -- <command> [<options>]",
@@ -727,6 +793,9 @@ const struct option record_options[] = {
 	OPT_CALLBACK('G', "cgroup", &record.evlist, "name",
 		     "monitor event in cgroup name only",
 		     parse_cgroups),
+	OPT_CALLBACK('b', "branch stack", &record.opts.branch_stack, "branch mode mask",
+		     "branch stack sampling modes",
+		     parse_branch_stack),
 	OPT_END()
 };
 
diff --git a/tools/perf/perf.h b/tools/perf/perf.h
index 513617c..dec5b4c 100644
--- a/tools/perf/perf.h
+++ b/tools/perf/perf.h
@@ -221,6 +221,7 @@ struct perf_record_opts {
 	unsigned int freq;
 	unsigned int mmap_pages;
 	unsigned int user_freq;
+	int	     branch_stack;
 	u64	     default_interval;
 	u64	     user_interval;
 	const char   *cpu_list;
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 472fc8c..a65a53c 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -126,6 +126,10 @@ void perf_evsel__config(struct perf_evsel *evsel, struct perf_record_opts *opts)
 		attr->watermark = 0;
 		attr->wakeup_events = 1;
 	}
+	if (opts->branch_stack) {
+		attr->sample_type	|= PERF_SAMPLE_BRANCH_STACK;
+		attr->branch_sample_type = opts->branch_stack;
+	}
 
 	attr->mmap = track;
 	attr->comm = track;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 13/13] perf: add support for taken branch sampling to perf report (v3)
  2012-01-09 16:49 [PATCH 00/13] perf_events: add support for sampling taken branches (v3) Stephane Eranian
                   ` (11 preceding siblings ...)
  2012-01-09 16:49 ` [PATCH 12/13] perf: add support for sampling taken branch to perf record (v3) Stephane Eranian
@ 2012-01-09 16:49 ` Stephane Eranian
  2012-01-23 10:14 ` [PATCH 00/13] perf_events: add support for sampling taken branches (v3) Stephane Eranian
  2012-01-27 12:09 ` Peter Zijlstra
  14 siblings, 0 replies; 36+ messages in thread
From: Stephane Eranian @ 2012-01-09 16:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, robert.richter, ming.m.lin, andi, asharma,
	ravitillo, vweaver1

From: Roberto Agostino Vitillo <ravitillo@lbl.gov>

This patch adds support for taken branch sampling, i.e, the
PERF_SAMPLE_BRANCH_STACK feature to perf report. In other
words, to display histograms based on taken branches rather
than executed instructions addresses.

The new option is called -b and it takes no argument. To
generate meaningful output, the perf.data must have been
obtained using perf record -b xxx ... where xxx is a branch
filter option.

The output shows symbols, modules, sorted by 'who branches
where' the most often. The percentages reported in the first
column refer to the total number of branches captured and
not the usual number of samples.

Here is a quick example.
Here branchy is simple test program which looks as follows:

void f2(void)
{}
void f3(void)
{}
void f1(unsigned long n)
{
  if (n & 1UL)
    f2();
  else
    f3();
}
int main(void)
{
  unsigned long i;

  for (i=0; i < N; i++)
   f1(i);
  return 0;
}

Here is the output captured on Nehalem, if we are
only interested in user level function calls.

$ perf record -b any_call,u -e cycles:u branchy

$ perf report -b --sort=symbol
    52.34%  [.] main                   [.] f1
    24.04%  [.] f1                     [.] f3
    23.60%  [.] f1                     [.] f2
     0.01%  [k] _IO_new_file_xsputn    [k] _IO_file_overflow
     0.01%  [k] _IO_vfprintf_internal  [k] _IO_new_file_xsputn
     0.01%  [k] _IO_vfprintf_internal  [k] strchrnul
     0.01%  [k] __printf               [k] _IO_vfprintf_internal
     0.01%  [k] main                   [k] __printf

About half (52%) of the call branches captured are from main() -> f1().
The second half (24%+23%) is split in two equal shares between
f1() -> f2(), f1() ->f3(). The output is as expected given the code.

It should be noted, that using -b in perf record does not eliminate
information in the perf.data file. Consequently, a typical profile
can also be obtained by perf report by simply not using its -b option.

Signed-off-by: Roberto Agostino Vitillo <ravitillo@lbl.gov>
Signed-off-by: Stephane Eranian <eranian@google.com>
---
 tools/perf/Documentation/perf-report.txt |    7 ++
 tools/perf/builtin-report.c              |   95 +++++++++++++++++++++++++++---
 2 files changed, 94 insertions(+), 8 deletions(-)

diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
index 9b430e9..19b9092 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -153,6 +153,13 @@ OPTIONS
 	information which may be very large and thus may clutter the display.
 	It currently includes: cpu and numa topology of the host system.
 
+-b::
+--branch-stack::
+	Use the addresses of sampled taken branches instead of the instruction
+	address to build the histograms. To generate meaningful output, the
+	perf.data file must have been obtained using perf record -b xxx where
+	xxx is a branch filter option.
+
 SEE ALSO
 --------
 linkperf:perf-stat[1], linkperf:perf-annotate[1]
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 25d34d4..fb8194b 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -53,6 +53,50 @@ struct perf_report {
 	DECLARE_BITMAP(cpu_bitmap, MAX_NR_CPUS);
 };
 
+static int perf_session__add_branch_hist_entry(struct perf_tool *tool,
+					struct addr_location *al,
+					struct perf_sample *sample,
+					struct perf_evsel *evsel,
+				      struct machine *machine)
+{
+	struct perf_report *rep = container_of(tool, struct perf_report, tool);
+	struct symbol *parent = NULL;
+	int err = 0;
+	unsigned i;
+	struct hist_entry *he;
+	struct branch_info *bi;
+
+	if ((sort__has_parent || symbol_conf.use_callchain)
+	    && sample->callchain) {
+		err = machine__resolve_callchain(machine, evsel, al->thread,
+						 sample->callchain, &parent);
+		if (err)
+			return err;
+	}
+
+	bi = perf_session__resolve_bstack(machine, al->thread,
+					  sample->branch_stack);
+	if (!bi)
+		return -ENOMEM;
+
+	for (i = 0; i < sample->branch_stack->nr; i++) {
+		if (rep->hide_unresolved && !(bi[i].from.sym && bi[i].to.sym))
+			continue;
+		/*
+		 * The report shows the percentage of total branches captured
+		 * and not events sampled. Thus we use a pseudo period of 1.
+		 */
+		he = __hists__add_branch_entry(&evsel->hists, al, parent,
+					       &bi[i], 1);
+		if (he) {
+			evsel->hists.stats.total_period += 1;
+			hists__inc_nr_events(&evsel->hists, PERF_RECORD_SAMPLE);
+		} else
+			return -ENOMEM;
+	}
+	return err;
+}
+
 static int perf_evsel__add_hist_entry(struct perf_evsel *evsel,
 				      struct addr_location *al,
 				      struct perf_sample *sample,
@@ -126,14 +170,21 @@ static int process_sample_event(struct perf_tool *tool,
 	if (rep->cpu_list && !test_bit(sample->cpu, rep->cpu_bitmap))
 		return 0;
 
-	if (al.map != NULL)
-		al.map->dso->hit = 1;
+	if (sort__branch_mode) {
+		if (perf_session__add_branch_hist_entry(tool, &al, sample,
+						    evsel, machine)) {
+			pr_debug("problem adding lbr entry, skipping event\n");
+			return -1;
+		}
+	} else {
+		if (al.map != NULL)
+			al.map->dso->hit = 1;
 
-	if (perf_evsel__add_hist_entry(evsel, &al, sample, machine)) {
-		pr_debug("problem incrementing symbol period, skipping event\n");
-		return -1;
+		if (perf_evsel__add_hist_entry(evsel, &al, sample, machine)) {
+			pr_debug("problem incrementing symbol period, skipping event\n");
+			return -1;
+		}
 	}
-
 	return 0;
 }
 
@@ -188,6 +239,15 @@ static int perf_report__setup_sample_type(struct perf_report *rep)
 			}
 	}
 
+	if (sort__branch_mode) {
+		if (!(self->sample_type & PERF_SAMPLE_BRANCH_STACK)) {
+			fprintf(stderr, "selected -b but no branch data."
+					" Did you call perf record without"
+					" -b?\n");
+			return -1;
+		}
+	}
+
 	return 0;
 }
 
@@ -517,6 +577,8 @@ int cmd_report(int argc, const char **argv, const char *prefix __used)
 		   "Specify disassembler style (e.g. -M intel for intel syntax)"),
 	OPT_BOOLEAN(0, "show-total-period", &symbol_conf.show_total_period,
 		    "Show a column with the sum of periods"),
+	OPT_BOOLEAN('b', "branch-stack", &sort__branch_mode,
+		    "use branch records for histogram filling"),
 	OPT_END()
 	};
 
@@ -537,10 +599,27 @@ int cmd_report(int argc, const char **argv, const char *prefix __used)
 			report.input_name = "perf.data";
 	}
 
-	if (strcmp(report.input_name, "-") != 0)
+	if (sort__branch_mode) {
+		if (use_browser)
+			fprintf(stderr, "Warning: TUI interface not supported"
+					" in branch mode\n");
+		if (symbol_conf.dso_list_str != NULL)
+			fprintf(stderr, "Warning: dso filtering not supported"
+					" in branch mode\n");
+		if (symbol_conf.sym_list_str != NULL)
+			fprintf(stderr, "Warning: symbol filtering not"
+					" supported in branch mode\n");
+
+		report.use_stdio = true;
 		setup_browser(true);
-	else
 		use_browser = 0;
+		symbol_conf.dso_list_str = NULL;
+		symbol_conf.sym_list_str = NULL;
+	} else if (strcmp(report.input_name, "-") != 0) {
+		setup_browser(true);
+	} else {
+		use_browser = 0;
+	}
 
 	/*
 	 * Only in the newt browser we are doing integrated annotation,
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 11/13] perf: add code to support PERF_SAMPLE_BRANCH_STACK (v3)
  2012-01-09 16:49 ` [PATCH 11/13] perf: add code to support PERF_SAMPLE_BRANCH_STACK (v3) Stephane Eranian
@ 2012-01-10  1:25   ` Arun Sharma
  2012-01-10 15:43     ` Stephane Eranian
  0 siblings, 1 reply; 36+ messages in thread
From: Arun Sharma @ 2012-01-10  1:25 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, peterz, mingo, acme, robert.richter, ming.m.lin,
	andi, ravitillo, vweaver1

On 1/9/12 8:49 AM, Stephane Eranian wrote:
> From: Roberto Agostino Vitillo<ravitillo@lbl.gov>
>
> This patch adds:
> - ability to parse samples with PERF_SAMPLE_BRANCH_STACK
> - sort on branches
> - build histograms on branches
>
[..]
>   static struct sort_dimension sort_dimensions[] = {
> -	{ .name = "pid",	.entry =&sort_thread,	},
> -	{ .name = "comm",	.entry =&sort_comm,	},
> -	{ .name = "dso",	.entry =&sort_dso,	},
> -	{ .name = "symbol",	.entry =&sort_sym,	},
> -	{ .name = "parent",	.entry =&sort_parent,	},
> -	{ .name = "cpu",	.entry =&sort_cpu,	},
> +	{ .name = "pid",	.entry =&sort_thread,			},
> +	{ .name = "comm",	.entry =&sort_comm,			},
> +	{ .name = "dso",	.entry =&sort_dso,			},
> +	{ .name = "dso_from",	.entry =&sort_dso_from,.taken = true	},
> +	{ .name = "dso_to",	.entry =&sort_dso_to,	.taken = true	},
> +	{ .name = "symbol",	.entry =&sort_sym,			},
> +	{ .name = "symbol_from",.entry =&sort_sym_from,.taken = true	},
> +	{ .name = "symbol_to",	.entry =&sort_sym_to,	.taken = true	},
> +	{ .name = "parent",	.entry =&sort_parent,			},
> +	{ .name = "cpu",	.entry =&sort_cpu,			},
> +	{ .name = "mispredict", .entry =&sort_mispredict, },
>   };

The new sort dimensions don't seem to show up in perf report -h. Could 
you please update the help text?

Also:

# perf script -h

     -f, --fields <str>    comma separated output fields prepend with 
'type:'. Valid types: hw,sw,trace,raw. Fields: 
comm,tid,pid,time,cpu,event,trace,ip,sym,dso,addr

You probably want to add a field here, so I could:

perf record -b any_call,u -e cycles:u
perf script -f event,branch_stack

and examine raw (symbolized) samples like I can with

perf record -g
perf script -f event,ip,sym

  -Arun


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 11/13] perf: add code to support PERF_SAMPLE_BRANCH_STACK (v3)
  2012-01-10  1:25   ` Arun Sharma
@ 2012-01-10 15:43     ` Stephane Eranian
  0 siblings, 0 replies; 36+ messages in thread
From: Stephane Eranian @ 2012-01-10 15:43 UTC (permalink / raw)
  To: Arun Sharma
  Cc: linux-kernel, peterz, mingo, acme, robert.richter, ming.m.lin,
	andi, ravitillo, vweaver1

On Tue, Jan 10, 2012 at 2:25 AM, Arun Sharma <asharma@fb.com> wrote:
> On 1/9/12 8:49 AM, Stephane Eranian wrote:
>>
>> From: Roberto Agostino Vitillo<ravitillo@lbl.gov>
>>
>> This patch adds:
>> - ability to parse samples with PERF_SAMPLE_BRANCH_STACK
>> - sort on branches
>> - build histograms on branches
>>
> [..]
>>
>>  static struct sort_dimension sort_dimensions[] = {
>> -       { .name = "pid",        .entry =&sort_thread,   },
>> -       { .name = "comm",       .entry =&sort_comm,     },
>> -       { .name = "dso",        .entry =&sort_dso,      },
>> -       { .name = "symbol",     .entry =&sort_sym,      },
>> -       { .name = "parent",     .entry =&sort_parent,   },
>> -       { .name = "cpu",        .entry =&sort_cpu,      },
>> +       { .name = "pid",        .entry =&sort_thread,                   },
>> +       { .name = "comm",       .entry =&sort_comm,                     },
>> +       { .name = "dso",        .entry =&sort_dso,                      },
>> +       { .name = "dso_from",   .entry =&sort_dso_from,.taken = true    },
>> +       { .name = "dso_to",     .entry =&sort_dso_to,   .taken = true   },
>> +       { .name = "symbol",     .entry =&sort_sym,                      },
>> +       { .name = "symbol_from",.entry =&sort_sym_from,.taken = true    },
>> +       { .name = "symbol_to",  .entry =&sort_sym_to,   .taken = true   },
>> +       { .name = "parent",     .entry =&sort_parent,                   },
>> +       { .name = "cpu",        .entry =&sort_cpu,                      },
>>
>> +       { .name = "mispredict", .entry =&sort_mispredict, },
>>  };
>
>
> The new sort dimensions don't seem to show up in perf report -h. Could you
> please update the help text?
>
Can do this with the understanding that those are only avail when you
use branch sampling.

> Also:
>
> # perf script -h
>
>    -f, --fields <str>    comma separated output fields prepend with 'type:'.
> Valid types: hw,sw,trace,raw. Fields:
> comm,tid,pid,time,cpu,event,trace,ip,sym,dso,addr
>
> You probably want to add a field here, so I could:
>
> perf record -b any_call,u -e cycles:u
> perf script -f event,branch_stack
>
> and examine raw (symbolized) samples like I can with
>
> perf record -g
> perf script -f event,ip,sym
>
Ok, I'll look into this. Looks like a useful command for automating processing.

>  -Arun
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/13] perf_events: add support for sampling taken branches (v3)
  2012-01-09 16:49 [PATCH 00/13] perf_events: add support for sampling taken branches (v3) Stephane Eranian
                   ` (12 preceding siblings ...)
  2012-01-09 16:49 ` [PATCH 13/13] perf: add support for taken branch sampling to perf report (v3) Stephane Eranian
@ 2012-01-23 10:14 ` Stephane Eranian
  2012-01-23 12:25   ` Peter Zijlstra
  2012-01-27 12:09 ` Peter Zijlstra
  14 siblings, 1 reply; 36+ messages in thread
From: Stephane Eranian @ 2012-01-23 10:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, robert.richter, ming.m.lin, andi, asharma,
	ravitillo, vweaver1

Any comments on this patch set?


On Mon, Jan 9, 2012 at 5:49 PM, Stephane Eranian <eranian@google.com> wrote:
>
> This patchset adds an important and useful new feature to
> perf_events: branch stack sampling. In other words, the
> ability to capture taken branches into each sample.
>
> Statistical sampling of taken branch should not be confused
> for branch tracing. Not all branches are necessarily captured
>
> Sampling taken branches is important for basic block profiling,
> statistical call graph, function call counts. Many of those
> measurements can help drive a compiler optimizer.
>
> The branch stack is a software abstraction which sits on top
> of the PMU hardware. As such, it is not available on all
> processors. For now, the patch provides the generic interface
> and the Intel X86 implementation where it leverages the Last
> Branch Record (LBR) feature (from Core2 to SandyBridge).
>
> Branch stack sampling is supported for both per-thread and
> system-wide modes.
>
> It is possible to filter the type and privilege level of branches
> to sample. The target of the branch is used to determine
> the privilege level.
>
> For each branch, the source and destination are captured. On
> some hardware platforms, it may be possible to also extract
> the target prediction and, in that case, it is also exposed
> to end users.
>
> The branch stack can record a variable number of taken
> branches per sample. Those branches are always consecutive
> in time. The number of branches captured depends on the
> filtering and the underlying hardware. On Intel Nehalem
> and later, up to 16 consecutive branches can be captured
> per sample.
>
> Branch sampling is always coupled with an event. It can
> be any PMU event but it can't be a SW or tracepoint event.
>
> Branch sampling is requested by setting a new sample_type
> flag called: PERF_SAMPLE_BRANCH_STACK.
>
> To support branch filtering, we introduce a new field
> to the perf_event_attr struct: branch_sample_type. We chose
> NOT to overload the config1, config2 field because those
> are related to the event encoding. Branch stack is a
> separate feature which is combined with the event.
>
> The branch_sample_type is a bitmask of possible filters.
> The following filters are defined (more can be added):
> - PERF_SAMPLE_BRANCH_ANY     : any control flow change
> - PERF_SAMPLE_BRANCH_USER    : capture branches when target is at user level
> - PERF_SAMPLE_BRANCH_KERNEL  : capture branches when target is at user level
> - PERF_SAMPLE_BRANCH_ANY_CALL: capture call branches (incl. syscalls)
> - PERF_SAMPLE_BRANCH_ANY_RET : capture return branches (incl. syscall returns)
> - PERF_SAMPLE_BRANCH_IND_CALL: capture indirect calls
>
> It is possible to combine filters, e.g., IND_CALL|USER|KERNEL.
>
> When the privilege level is not specified, the branch stack
> inherits that of the associated event.
>
> Some processors may not offer hardware branch filtering, e.g., Intel
> Atom. Some may have HW filtering bugs (e.g., Nehalem). The Intel
> X86 implementation in this patchset also provides a SW branch filter
> which works on a best effort basis. It can compensate for the lack
> of LBR filtering. But first and foremost, it helps work around LBR
> filtering errata. The goal is to only capture the type of branches
> requested by the user.
>
> It is possible to combine branch stack sampling with PEBS on Intel
> X86 processors. Depending on the precise_sampling mode, there are
> certain filterting restrictions. When precise_sampling=1, then
> there are no filtering restrictions. When precise_sampling > 1,
> then only ANY|USER|KERNEL filter can be used. This comes from
> the fact that the kernel uses LBR to compensate for the PEBS
> off-by-1 skid on the instruction pointer.
>
> To demonstrate how the perf_event branch stack sampling interface
> works, the patchset also modifies perf record to capture taken
> branches. Similarly perf report is enhanced to display a histogram
> of taken branches.
>
> I would like to thank Roberto Vitillo @ LBL for his work on the perf
> tool for this.
>
> Enough talking, let's take a simple example. Our trivial test program
> goes like this:
>
> void f2(void)
> {}
> void f3(void)
> {}
> void f1(unsigned long n)
> {
>  if (n & 1UL)
>    f2();
>  else
>    f3();
> }
> int main(void)
> {
>  unsigned long i;
>
>  for (i=0; i < N; i++)
>   f1(i);
>  return 0;
> }
>
> $ perf record -b any branchy
> $ perf report -b
> # Events: 23K cycles
> #
> # Overhead  Source Symbol     Target Symbol
> # ........  ................  ................
>
>    18.13%  [.] f1            [.] main
>    18.10%  [.] main          [.] main
>    18.01%  [.] main          [.] f1
>    15.69%  [.] f1            [.] f1
>     9.11%  [.] f3            [.] f1
>     6.78%  [.] f1            [.] f3
>     6.74%  [.] f1            [.] f2
>     6.71%  [.] f2            [.] f1
>
> Of the total number of branches captured, 18.13% were from f1() -> main().
>
> Let's make this clearer by filtering the user call branches only:
>
> $ perf record -b any_call -e cycles:u branchy
> $ perf report -b
> # Events: 19K cycles
> #
> # Overhead  Source Symbol              Target Symbol
> # ........  .........................  .........................
> #
>    52.50%  [.] main                   [.] f1
>    23.99%  [.] f1                     [.] f3
>    23.48%  [.] f1                     [.] f2
>     0.03%  [.] _IO_default_xsputn     [.] _IO_new_file_overflow
>     0.01%  [k] _start                 [k] __libc_start_main
>
> Now it is more obvious. %52 of all the captured branches where calls from main() -> f1().
> The rest is split 50/50 between f1() -> f2() and f1() -> f3() which is expected given
> that f1() dispatches based on odd vs. even values of n which is constantly increasing.
>
>
> Here is a kernel example, where we want to sample indirect calls:
> $ perf record -a -C 1 -b ind_call -e r1c4:k sleep 10
> $ perf report -b
> #
> # Overhead  Source Symbol               Target Symbol
> # ........  ..........................  ..........................
> #
>    36.36%  [k] __delay                 [k] delay_tsc
>     9.09%  [k] ktime_get               [k] read_tsc
>     9.09%  [k] getnstimeofday          [k] read_tsc
>     9.09%  [k] notifier_call_chain     [k] tick_notify
>     4.55%  [k] cpuidle_idle_call       [k] intel_idle
>     4.55%  [k] cpuidle_idle_call       [k] menu_reflect
>     2.27%  [k] handle_irq              [k] handle_edge_irq
>     2.27%  [k] ack_apic_edge           [k] native_apic_mem_write
>     2.27%  [k] hpet_interrupt_handler  [k] hrtimer_interrupt
>     2.27%  [k] __run_hrtimer           [k] watchdog_timer_fn
>     2.27%  [k] enqueue_task            [k] enqueue_task_rt
>     2.27%  [k] try_to_wake_up          [k] select_task_rq_rt
>     2.27%  [k] do_timer                [k] read_tsc
>
> Due to HW limitations, branch filtering may be approximate on
> Core, Atom processors. It is more accurate on Nehalem, Westmere
> and best on Sandy Bridge.
>
> In version 2, we've updated the patch to tip/master (commit 5734857) and
> we've incoporated the feedback from v1 concerning anynous bitfield
> struct for branch_stack_entry and the hanlding of i386 ABI binaries
> on 64-bit host in the instr decoder for the LBR SW filter.
>
> In version 3, we've updated to 3.2.0-tip. The Atom revision
> check has been put into its own patch. We fixed a browser
> issue with report report. We fixed all the style issues as well.
>
> Signed-off-by: Stephane Eranian <eranian@google.com>
> ---
>
> Roberto Agostino Vitillo (3):
>  perf: add code to support PERF_SAMPLE_BRANCH_STACK
>  perf: add support for sampling taken branch to perf record
>  perf: add support for taken branch sampling to perf report
>
> Stephane Eranian (10):
>  perf_events: add generic taken branch sampling support (v3)
>  perf_events: add Intel LBR MSR definitions
>  perf_events: add Intel X86 LBR sharing logic
>  perf_events: sync branch stack sampling with X86 precise_sampling
>  perf_events: add LBR mappings for PERF_SAMPLE_BRANCH filters
>  perf_events: disable LBR support for older Intel Atom processors
>  perf_events: implement PERF_SAMPLE_BRANCH for Intel X86
>  perf_events: add LBR software filter support for Intel X86
>  perf_events: disable PERF_SAMPLE_BRANCH_* when not supported
>  perf_events: add hook to flush branch_stack on context switch
>
>  arch/alpha/kernel/perf_event.c             |    4 +
>  arch/arm/kernel/perf_event.c               |    4 +
>  arch/mips/kernel/perf_event_mipsxx.c       |    4 +
>  arch/powerpc/kernel/perf_event.c           |    4 +
>  arch/sh/kernel/perf_event.c                |    4 +
>  arch/sparc/kernel/perf_event.c             |    4 +
>  arch/x86/include/asm/msr-index.h           |    7 +
>  arch/x86/kernel/cpu/perf_event.c           |   47 +++-
>  arch/x86/kernel/cpu/perf_event.h           |   19 +
>  arch/x86/kernel/cpu/perf_event_amd.c       |    3 +
>  arch/x86/kernel/cpu/perf_event_intel.c     |  120 +++++--
>  arch/x86/kernel/cpu/perf_event_intel_ds.c  |   22 +-
>  arch/x86/kernel/cpu/perf_event_intel_lbr.c |  525 ++++++++++++++++++++++++++--
>  include/linux/perf_event.h                 |   78 ++++-
>  kernel/events/core.c                       |  167 +++++++++
>  kernel/events/hw_breakpoint.c              |    6 +
>  tools/perf/Documentation/perf-record.txt   |   18 +
>  tools/perf/Documentation/perf-report.txt   |    7 +
>  tools/perf/builtin-record.c                |   69 ++++
>  tools/perf/builtin-report.c                |   95 +++++-
>  tools/perf/perf.h                          |   18 +
>  tools/perf/util/annotate.c                 |    2 +-
>  tools/perf/util/event.h                    |    1 +
>  tools/perf/util/evsel.c                    |   14 +
>  tools/perf/util/hist.c                     |   93 ++++-
>  tools/perf/util/hist.h                     |    7 +
>  tools/perf/util/session.c                  |   72 ++++
>  tools/perf/util/session.h                  |    4 +
>  tools/perf/util/sort.c                     |  361 ++++++++++++++-----
>  tools/perf/util/sort.h                     |    5 +
>  tools/perf/util/symbol.h                   |   13 +
>  31 files changed, 1601 insertions(+), 196 deletions(-)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/13] perf_events: add support for sampling taken branches (v3)
  2012-01-23 10:14 ` [PATCH 00/13] perf_events: add support for sampling taken branches (v3) Stephane Eranian
@ 2012-01-23 12:25   ` Peter Zijlstra
  2012-01-23 15:07     ` Stephane Eranian
  2012-01-23 17:14     ` Stephane Eranian
  0 siblings, 2 replies; 36+ messages in thread
From: Peter Zijlstra @ 2012-01-23 12:25 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, mingo, acme, robert.richter, ming.m.lin, andi,
	asharma, ravitillo, vweaver1

On Mon, 2012-01-23 at 11:14 +0100, Stephane Eranian wrote:
> Any comments on this patch set?
> 
Queued it, thanks!

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/13] perf_events: add support for sampling taken branches (v3)
  2012-01-23 12:25   ` Peter Zijlstra
@ 2012-01-23 15:07     ` Stephane Eranian
  2012-01-23 15:47       ` Andi Kleen
  2012-01-23 17:14     ` Stephane Eranian
  1 sibling, 1 reply; 36+ messages in thread
From: Stephane Eranian @ 2012-01-23 15:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, acme, robert.richter, ming.m.lin, andi,
	asharma, ravitillo, vweaver1

On Mon, Jan 23, 2012 at 1:25 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, 2012-01-23 at 11:14 +0100, Stephane Eranian wrote:
> > Any comments on this patch set?
> >
> Queued it, thanks!


Great Thanks.

Now, we still need to make progress on:
  - fixing throttling
  - PEBS-LL
  - IBS
  - event scheduling
  - LWP

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/13] perf_events: add support for sampling taken branches (v3)
  2012-01-23 15:07     ` Stephane Eranian
@ 2012-01-23 15:47       ` Andi Kleen
  0 siblings, 0 replies; 36+ messages in thread
From: Andi Kleen @ 2012-01-23 15:47 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Peter Zijlstra, linux-kernel, mingo, acme, robert.richter,
	ming.m.lin, andi, asharma, ravitillo, vweaver1

> Now, we still need to make progress on:
>   - fixing throttling
>   - PEBS-LL

    Raw PEBS really too

>   - IBS
>   - event scheduling
>   - LWP

     - Back-to-back events

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/13] perf_events: add support for sampling taken branches (v3)
  2012-01-23 12:25   ` Peter Zijlstra
  2012-01-23 15:07     ` Stephane Eranian
@ 2012-01-23 17:14     ` Stephane Eranian
  2012-01-24 15:39       ` Stephane Eranian
  1 sibling, 1 reply; 36+ messages in thread
From: Stephane Eranian @ 2012-01-23 17:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, acme, robert.richter, ming.m.lin, andi,
	asharma, ravitillo, vweaver1

On Mon, Jan 23, 2012 at 1:25 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, 2012-01-23 at 11:14 +0100, Stephane Eranian wrote:
>> Any comments on this patch set?
>>
> Queued it, thanks!

One thing that needs to happen with this branch sampling patch is that
Arnaldo needs
to apply the patch I developed with dsahern@ that changes the magic number to
something that can determine the endianess. I don't know what happened
to this patch.
It's been over 6 months since I posted it and it's still not in the
perf source tree.

The issue here is that with branch stack sampling, the size of
perf_event_attr has
now changed (an extra u64). As such if this new perf tries to read an
old perf.data
file, it is going to think it's using a different endianess, it'll
swap the attr_size and still
find it different from its own version of the struct. That leads to:
incompatible file
format.

We need to separate endianess detection from attr_size. If attr_size
in perf.data <
then sizeof(struct perf_event_attr), then just zero out the extra
(unused) fields.
With that in place, this new perf will be able to read older perf.data files.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/13] perf_events: add support for sampling taken branches (v3)
  2012-01-23 17:14     ` Stephane Eranian
@ 2012-01-24 15:39       ` Stephane Eranian
  2012-01-24 16:08         ` David Ahern
  0 siblings, 1 reply; 36+ messages in thread
From: Stephane Eranian @ 2012-01-24 15:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, acme, robert.richter, ming.m.lin, andi,
	asharma, ravitillo, vweaver1

Hi,

The branch stack sampling patch exposes a flaw in the sampling
buffer format as currently exported by the kernel.

In the current format, sample records (RECORD_SAMPLE) are NOT
self-describing. That means that by looking at the fixed size header, it
is not possible to determine which event caused the sample to be recorded
and what's in the body of the variable length sample.

Such introspection is only possible once we know the event unique id
(PERF_SAMPLE_ID). But to get the event ID, we need to parse the
sample. But, given that a sample has a variable length, there is no
predefined position for that ID in the sample. You have a chicken
and egg problem here. There is no room left in the fixed size header
to fit this in.

This works today with perf because perf applies the SAME sample_type
to ALL events, i.e., all events have the same body layout. This is a limitation
of the tool. The kernel API clearly allows more flexibility but it is
hindered by
the problem I described above.

With branch sampling, this becomes more problematic because if you
are sampling on many events, it may not be necessary nor useful to capture
the branch sample stack for each event. With existing HW, the branch
stack uses at most 264 bytes in a sample. You'd be consuming the buffer
space much faster for nothing.

We need to solve this problem yet maintain backward compatibility with old
version of tools.

The kernel sampling buffer format needs to evolve to have fixed size header
that are more self-describing. The header somehow needs to contain the event
ID or the sample_type for type=RECORD_SAMPLE. I would prefer the former
because if we ever need more than 64-bits for sample_type, we would have the
same problem again. Having the event ID, requires that it be generated
systematically. That is not the case today.

That new buffer format could be requested, as a flag, when the event is created.
That would ensure backward compatibility.

An alternative would be to find a way to encode the event ID at a known position
somehow in the body of a RECORD_SAMPLE. But I don't see how that would be
possible (given there is already the sample_id_all stuff).

Any comments?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/13] perf_events: add support for sampling taken branches (v3)
  2012-01-24 15:39       ` Stephane Eranian
@ 2012-01-24 16:08         ` David Ahern
  2012-01-24 17:42           ` Stephane Eranian
  2012-01-26 16:21           ` Stephane Eranian
  0 siblings, 2 replies; 36+ messages in thread
From: David Ahern @ 2012-01-24 16:08 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Peter Zijlstra, linux-kernel, mingo, acme, robert.richter,
	ming.m.lin, andi, asharma, ravitillo, vweaver1


On 01/24/2012 08:39 AM, Stephane Eranian wrote:
> Hi,
> 
> The branch stack sampling patch exposes a flaw in the sampling
> buffer format as currently exported by the kernel.
> 
> In the current format, sample records (RECORD_SAMPLE) are NOT
> self-describing. That means that by looking at the fixed size header, it
> is not possible to determine which event caused the sample to be recorded
> and what's in the body of the variable length sample.
> 
> Such introspection is only possible once we know the event unique id
> (PERF_SAMPLE_ID). But to get the event ID, we need to parse the
> sample. But, given that a sample has a variable length, there is no
> predefined position for that ID in the sample. You have a chicken
> and egg problem here. There is no room left in the fixed size header
> to fit this in.

I brought this up last Fall as well. As I recall the response is to move
each sample_type based stream into its own data file and then merge the
data files while processing.

David

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/13] perf_events: add support for sampling taken branches (v3)
  2012-01-24 16:08         ` David Ahern
@ 2012-01-24 17:42           ` Stephane Eranian
  2012-01-26 16:21           ` Stephane Eranian
  1 sibling, 0 replies; 36+ messages in thread
From: Stephane Eranian @ 2012-01-24 17:42 UTC (permalink / raw)
  To: David Ahern
  Cc: Peter Zijlstra, linux-kernel, mingo, acme, robert.richter,
	ming.m.lin, andi, asharma, ravitillo, vweaver1

On Tue, Jan 24, 2012 at 5:08 PM, David Ahern <dsahern@gmail.com> wrote:
>
> On 01/24/2012 08:39 AM, Stephane Eranian wrote:
>> Hi,
>>
>> The branch stack sampling patch exposes a flaw in the sampling
>> buffer format as currently exported by the kernel.
>>
>> In the current format, sample records (RECORD_SAMPLE) are NOT
>> self-describing. That means that by looking at the fixed size header, it
>> is not possible to determine which event caused the sample to be recorded
>> and what's in the body of the variable length sample.
>>
>> Such introspection is only possible once we know the event unique id
>> (PERF_SAMPLE_ID). But to get the event ID, we need to parse the
>> sample. But, given that a sample has a variable length, there is no
>> predefined position for that ID in the sample. You have a chicken
>> and egg problem here. There is no room left in the fixed size header
>> to fit this in.
>
> I brought this up last Fall as well. As I recall the response is to move
> each sample_type based stream into its own data file and then merge the
> data files while processing.

By data stream, you're talking about actual sampling buffer.
Sampling on 10 events, means 10 sampling buffers. If you reach
the rlimit on it, you need to decrease buffer size, and therefore
you increase overhead.

Moreover, I don't think perf record is well equipped to handle this.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/13] perf_events: add support for sampling taken branches (v3)
  2012-01-24 16:08         ` David Ahern
  2012-01-24 17:42           ` Stephane Eranian
@ 2012-01-26 16:21           ` Stephane Eranian
  1 sibling, 0 replies; 36+ messages in thread
From: Stephane Eranian @ 2012-01-26 16:21 UTC (permalink / raw)
  To: David Ahern
  Cc: Peter Zijlstra, linux-kernel, mingo, acme, robert.richter,
	ming.m.lin, andi, asharma, ravitillo, vweaver1

On Tue, Jan 24, 2012 at 5:08 PM, David Ahern <dsahern@gmail.com> wrote:
>
> On 01/24/2012 08:39 AM, Stephane Eranian wrote:
>> Hi,
>>
>> The branch stack sampling patch exposes a flaw in the sampling
>> buffer format as currently exported by the kernel.
>>
>> In the current format, sample records (RECORD_SAMPLE) are NOT
>> self-describing. That means that by looking at the fixed size header, it
>> is not possible to determine which event caused the sample to be recorded
>> and what's in the body of the variable length sample.
>>
>> Such introspection is only possible once we know the event unique id
>> (PERF_SAMPLE_ID). But to get the event ID, we need to parse the
>> sample. But, given that a sample has a variable length, there is no
>> predefined position for that ID in the sample. You have a chicken
>> and egg problem here. There is no room left in the fixed size header
>> to fit this in.
>
> I brought this up last Fall as well. As I recall the response is to move
> each sample_type based stream into its own data file and then merge the
> data files while processing.
>
But then, if you merge the files as is, it does not buy you anything.
You need to
add or overwrite some information in the headers. Unless, you rewrite
the tool to
handle multiple inputs files.

> David

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 01/13] perf_events: add generic taken branch sampling support (v3)
  2012-01-09 16:49 ` [PATCH 01/13] perf_events: add generic taken branch sampling support (v3) Stephane Eranian
@ 2012-01-27  4:46   ` Anshuman Khandual
  2012-01-27  9:57     ` Stephane Eranian
  0 siblings, 1 reply; 36+ messages in thread
From: Anshuman Khandual @ 2012-01-27  4:46 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, peterz, mingo, acme, robert.richter, ming.m.lin,
	andi, asharma, ravitillo, vweaver1

On Monday 09 January 2012 10:19 PM, Stephane Eranian wrote:
> This patch adds the ability to sample taken branches to the
> perf_event interface.
> 
> The ability to capture taken branches is very useful for all
> sorts of analysis. For instance, basic block profiling, call
> counts, statistical call graph.
> 
> This new capability requires hardware assist and as such may
> not be available on all HW platforms. On Intel X86, it is
> implemented on top of the Last Branch Record (LBR) facility.
> 
> To enable taken branches sampling, the PERF_SAMPLE_BRANCH_STACK
> bit must be set in attr->sample_type.
> 
> Sampled taken branches may be filtered by type and/or priv
> levels.
> 
> The patch adds a new field, called branch_sample_type, to the
> perf_event_attr structure. It contains a bitmask of filters
> to apply to the sampled taken branches.
> 
> Filters may be implemented in HW. If the HW filter does not exist
> or is not good enough, some arch may also implement a SW filter.
> 
> The following generic filters are currently defined:
> - PERF_SAMPLE_USER
>   only branches whose targets are at the user level
> 
> - PERF_SAMPLE_KERNEL
>   only branches whose targets are at the kernel level
> 
> - PERF_SAMPLE_ANY
>   any type of branches (subject to priv levels filters)
> 
> - PERF_SAMPLE_ANY_CALL
>   any call branches (may incl. syscall on some arch)
> 
> - PERF_SAMPLE_ANY_RET
>   any return branches (may incl. syscall returns on some arch)
> 
> - PERF_SAMPLE_IND_CALL
>   indirect call branches
> 
> Obviously filter may be combined. The priv level bits are optional.
> If not provided, the priv level of the associated event are used. It
> is possible to collect branches at a priv level different from the
> associated event.
> 
> The number of taken branch records present in each sample may vary based
> on HW, the type of sampled branches, the executed code. Therefore
> each sample contains the number of taken branches it contains.
> 
> Signed-off-by: Stephane Eranian <eranian@google.com>
  Reviewed by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
> ---
>  arch/x86/kernel/cpu/perf_event_intel_lbr.c |   21 +++++---
>  include/linux/perf_event.h                 |   66 ++++++++++++++++++++++++++--
>  kernel/events/core.c                       |   58 ++++++++++++++++++++++++
>  3 files changed, 133 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> index 3fab3de..c3f8100 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> @@ -144,9 +144,11 @@ static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc)
> 
>  		rdmsrl(x86_pmu.lbr_from + lbr_idx, msr_lastbranch.lbr);
> 
> -		cpuc->lbr_entries[i].from  = msr_lastbranch.from;
> -		cpuc->lbr_entries[i].to    = msr_lastbranch.to;
> -		cpuc->lbr_entries[i].flags = 0;
> +		cpuc->lbr_entries[i].from	= msr_lastbranch.from;
> +		cpuc->lbr_entries[i].to		= msr_lastbranch.to;
> +		cpuc->lbr_entries[i].mispred	= 0;
> +		cpuc->lbr_entries[i].predicted	= 0;
> +		cpuc->lbr_entries[i].reserved	= 0;
>  	}
>  	cpuc->lbr_stack.nr = i;
>  }
> @@ -167,19 +169,22 @@ static void intel_pmu_lbr_read_64(struct cpu_hw_events *cpuc)
> 
>  	for (i = 0; i < x86_pmu.lbr_nr; i++) {
>  		unsigned long lbr_idx = (tos - i) & mask;
> -		u64 from, to, flags = 0;
> +		u64 from, to, mis = 0, pred = 0;
> 
>  		rdmsrl(x86_pmu.lbr_from + lbr_idx, from);
>  		rdmsrl(x86_pmu.lbr_to   + lbr_idx, to);
> 
>  		if (lbr_format == LBR_FORMAT_EIP_FLAGS) {
> -			flags = !!(from & LBR_FROM_FLAG_MISPRED);
> +			mis = !!(from & LBR_FROM_FLAG_MISPRED);
> +			pred = !mis;
>  			from = (u64)((((s64)from) << 1) >> 1);
>  		}
> 
> -		cpuc->lbr_entries[i].from  = from;
> -		cpuc->lbr_entries[i].to    = to;
> -		cpuc->lbr_entries[i].flags = flags;
> +		cpuc->lbr_entries[i].from	= from;
> +		cpuc->lbr_entries[i].to		= to;
> +		cpuc->lbr_entries[i].mispred	= mis;
> +		cpuc->lbr_entries[i].predicted	= pred;
> +		cpuc->lbr_entries[i].reserved	= 0;
>  	}
>  	cpuc->lbr_stack.nr = i;
>  }
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 0b91db2..17751b1 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -129,11 +129,38 @@ enum perf_event_sample_format {
>  	PERF_SAMPLE_PERIOD			= 1U << 8,
>  	PERF_SAMPLE_STREAM_ID			= 1U << 9,
>  	PERF_SAMPLE_RAW				= 1U << 10,
> +	PERF_SAMPLE_BRANCH_STACK		= 1U << 11,
> 
> -	PERF_SAMPLE_MAX = 1U << 11,		/* non-ABI */
> +	PERF_SAMPLE_MAX = 1U << 12,		/* non-ABI */
>  };
> 
>  /*
> + * values to program into branch_sample_type when PERF_SAMPLE_BRANCH is set
> + *
> + * If the user does not pass priv level information via branch_sample_type,
> + * the kernel uses the event's priv level. Branch and event priv levels do
> + * not have to match. Branch priv level is checked for permissions.
> + *
> + * The branch types can be combined, however BRANCH_ANY covers all types
> + * of branches and therefore it supersedes all the other types.
> + */
> +enum perf_branch_sample_type {
> +	PERF_SAMPLE_BRANCH_USER		= 1U << 0, /* user level branches */
> +	PERF_SAMPLE_BRANCH_KERNEL	= 1U << 1, /* kernel level branches */
> +
> +	PERF_SAMPLE_BRANCH_ANY		= 1U << 2, /* any branch types */
> +	PERF_SAMPLE_BRANCH_ANY_CALL	= 1U << 3, /* any call branch */
> +	PERF_SAMPLE_BRANCH_ANY_RETURN	= 1U << 4, /* any return branch */
> +	PERF_SAMPLE_BRANCH_IND_CALL	= 1U << 5, /* indirect calls */
> +
> +	PERF_SAMPLE_BRANCH_MAX		= 1U << 6,/* non-ABI */
> +};
> +
> +#define PERF_SAMPLE_BRANCH_PLM_ALL \
> +	(PERF_SAMPLE_BRANCH_USER|\
> +	 PERF_SAMPLE_BRANCH_KERNEL)
> +
> +/*
>   * The format of the data returned by read() on a perf event fd,
>   * as specified by attr.read_format:
>   *
> @@ -240,6 +267,7 @@ struct perf_event_attr {
>  		__u64		bp_len;
>  		__u64		config2; /* extension of config1 */
>  	};
> +	__u64	branch_sample_type; /* enum branch_sample_type */
>  };
> 
>  /*
> @@ -458,6 +486,8 @@ enum perf_event_type {
>  	 *
>  	 *	{ u32			size;
>  	 *	  char                  data[size];}&& PERF_SAMPLE_RAW
> +	 *
> +	 *	{ u64 from, to, flags } lbr[nr];} && PERF_SAMPLE_BRANCH_STACK
>  	 * };
>  	 */
>  	PERF_RECORD_SAMPLE			= 9,
> @@ -530,12 +560,31 @@ struct perf_raw_record {
>  	void				*data;
>  };
> 
> +/*
> + * single taken branch record layout:
> + *
> + *      from: source instruction (may not always be a branch insn)
> + *        to: branch target
> + *   mispred: branch target was mispredicted
> + * predicted: branch target was predicted
> + *
> + * support for mispred, predicted is optional. In case it
> + * is not supported mispred = predicted = 0.
> + */
So the user level perf tools would check for ((mispred = 0) && (predicted = 0))
in a sample and report that its not supported by the HW PMU ? Point here is
that if its not supported we should say  "No HW support" rather than displaying
mispred = 0 and predicted = 0 (As this could be misleading)
>  struct perf_branch_entry {
> -	__u64				from;
> -	__u64				to;
> -	__u64				flags;
> +	__u64	from;
> +	__u64	to;
> +	__u64	mispred:1,  /* target mispredicted */
> +		predicted:1,/* target predicted */
> +		reserved:62;
>  };
> 
> +/*
> + * branch stack layout:
> + *  nr: number of taken branches stored in entries[]
> + *
> + * Note that nr can vary from sample to sample
> + */
>  struct perf_branch_stack {
>  	__u64				nr;
>  	struct perf_branch_entry	entries[0];
> @@ -566,7 +615,9 @@ struct hw_perf_event {
>  			unsigned long	event_base;
>  			int		idx;
>  			int		last_cpu;
> +
>  			struct hw_perf_event_extra extra_reg;
> +			struct hw_perf_event_extra branch_reg;
>  		};
>  		struct { /* software */
>  			struct hrtimer	hrtimer;
> @@ -1003,12 +1054,14 @@ struct perf_sample_data {
>  	u64				period;
>  	struct perf_callchain_entry	*callchain;
>  	struct perf_raw_record		*raw;
> +	struct perf_branch_stack	*br_stack;
>  };
> 
>  static inline void perf_sample_data_init(struct perf_sample_data *data, u64 addr)
>  {
>  	data->addr = addr;
>  	data->raw  = NULL;
> +	data->br_stack = NULL;
>  }
> 
>  extern void perf_output_sample(struct perf_output_handle *handle,
> @@ -1147,6 +1200,11 @@ extern void perf_bp_event(struct perf_event *event, void *data);
>  # define perf_instruction_pointer(regs)	instruction_pointer(regs)
>  #endif
> 
> +static inline bool has_branch_stack(struct perf_event *event)
> +{
> +	return event->attr.sample_type & PERF_SAMPLE_BRANCH_STACK;
> +}
> +
>  extern int perf_output_begin(struct perf_output_handle *handle,
>  			     struct perf_event *event, unsigned int size);
>  extern void perf_output_end(struct perf_output_handle *handle);
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 91fb68a..ed39225 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3877,6 +3877,24 @@ void perf_output_sample(struct perf_output_handle *handle,
>  			}
>  		}
>  	}
> +
> +	if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
> +		if (data->br_stack) {
> +			size_t size;
> +
> +			size = data->br_stack->nr
> +			     * sizeof(struct perf_branch_entry);
> +
> +			perf_output_put(handle, data->br_stack->nr);
> +			perf_output_copy(handle, data->br_stack->entries, size);
> +		} else {
> +			/*
> +			 * we always store at least the value of nr
> +			 */
> +			u64 nr = 0;
> +			perf_output_put(handle, nr);
> +		}
> +	}
>  }
> 
>  void perf_prepare_sample(struct perf_event_header *header,
> @@ -3919,6 +3937,15 @@ void perf_prepare_sample(struct perf_event_header *header,
>  		WARN_ON_ONCE(size & (sizeof(u64)-1));
>  		header->size += size;
>  	}
> +
> +	if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
> +		int size = sizeof(u64); /* nr */
> +		if (data->br_stack) {
> +			size += data->br_stack->nr
> +			      * sizeof(struct perf_branch_entry);
> +		}
> +		header->size += size;
> +	}
>  }
> 
>  static void perf_event_output(struct perf_event *event,
> @@ -5898,6 +5925,37 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
>  	if (attr->read_format & ~(PERF_FORMAT_MAX-1))
>  		return -EINVAL;
> 
> +	if (attr->sample_type & PERF_SAMPLE_BRANCH_STACK) {
> +		u64 mask = attr->branch_sample_type;
> +
> +		/* only using defined bits */
> +		if (mask & ~(PERF_SAMPLE_BRANCH_MAX-1))
> +			return -EINVAL;
> +
> +		/* at least one branch bit must be set */
> +		if (!(mask & ~PERF_SAMPLE_BRANCH_PLM_ALL))
> +			return -EINVAL;
> +
> +		/* kernel level capture */
> +		if ((mask & PERF_SAMPLE_BRANCH_KERNEL)
> +		    && perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
> +			return -EACCES;
> +
> +		/* propagate priv level, when not set for branch */
> +		if (!(mask & PERF_SAMPLE_BRANCH_PLM_ALL)) {
> +
> +			/* exclude_kernel checked on syscall entry */
> +			if (!attr->exclude_kernel)
> +				mask |= PERF_SAMPLE_BRANCH_KERNEL;
> +
> +			if (!attr->exclude_user)
> +				mask |= PERF_SAMPLE_BRANCH_USER;
Why we are not taking care for attr->exclude_hv ? Should not we define
PERF_SAMPLE_BRANCH_HV for hyper-visor level branches ?
> +			/*
> +			 * adjust user setting (for HW filter setup)
> +			 */
> +			attr->branch_sample_type = mask;
> +		}
> +	}
>  out:
>  	return ret;
> 


-- 
Linux Technology Centre
IBM Systems and Technology Group
Bangalore India


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 02/13] perf_events: add Intel LBR MSR definitions (v3)
  2012-01-09 16:49 ` [PATCH 02/13] perf_events: add Intel LBR MSR definitions (v3) Stephane Eranian
@ 2012-01-27  5:03   ` Anshuman Khandual
  0 siblings, 0 replies; 36+ messages in thread
From: Anshuman Khandual @ 2012-01-27  5:03 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, peterz, mingo, acme, robert.richter, ming.m.lin,
	andi, asharma, ravitillo, vweaver1

On Monday 09 January 2012 10:19 PM, Stephane Eranian wrote:
> This patch adds the LBR definitions for NHM/WSM/SNB and Core.
> It also adds the definitions for the architected LBR MSR:
> LBR_SELECT, LBRT_TOS.
> 
> Signed-off-by: Stephane Eranian <eranian@google.com>
  Reviewed   by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
> ---
>  arch/x86/include/asm/msr-index.h           |    7 +++++++
>  arch/x86/kernel/cpu/perf_event_intel_lbr.c |   18 +++++++++---------
>  2 files changed, 16 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index a6962d9..ccb8059 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -56,6 +56,13 @@
>  #define MSR_OFFCORE_RSP_0		0x000001a6
>  #define MSR_OFFCORE_RSP_1		0x000001a7
> 
> +#define MSR_LBR_SELECT			0x000001c8
> +#define MSR_LBR_TOS			0x000001c9
> +#define MSR_LBR_NHM_FROM		0x00000680
> +#define MSR_LBR_NHM_TO			0x000006c0
> +#define MSR_LBR_CORE_FROM		0x00000040
> +#define MSR_LBR_CORE_TO			0x00000060
> +
>  #define MSR_IA32_PEBS_ENABLE		0x000003f1
>  #define MSR_IA32_DS_AREA		0x00000600
>  #define MSR_IA32_PERF_CAPABILITIES	0x00000345
> diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> index c3f8100..e14431f 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> @@ -205,23 +205,23 @@ void intel_pmu_lbr_read(void)
>  void intel_pmu_lbr_init_core(void)
>  {
>  	x86_pmu.lbr_nr     = 4;
> -	x86_pmu.lbr_tos    = 0x01c9;
> -	x86_pmu.lbr_from   = 0x40;
> -	x86_pmu.lbr_to     = 0x60;
> +	x86_pmu.lbr_tos    = MSR_LBR_TOS;
> +	x86_pmu.lbr_from   = MSR_LBR_CORE_FROM;
> +	x86_pmu.lbr_to     = MSR_LBR_CORE_TO;
>  }
> 
>  void intel_pmu_lbr_init_nhm(void)
>  {
>  	x86_pmu.lbr_nr     = 16;
> -	x86_pmu.lbr_tos    = 0x01c9;
> -	x86_pmu.lbr_from   = 0x680;
> -	x86_pmu.lbr_to     = 0x6c0;
> +	x86_pmu.lbr_tos    = MSR_LBR_TOS;
> +	x86_pmu.lbr_from   = MSR_LBR_NHM_FROM;
> +	x86_pmu.lbr_to     = MSR_LBR_NHM_TO;
>  }
> 
>  void intel_pmu_lbr_init_atom(void)
>  {
>  	x86_pmu.lbr_nr	   = 8;
> -	x86_pmu.lbr_tos    = 0x01c9;
> -	x86_pmu.lbr_from   = 0x40;
> -	x86_pmu.lbr_to     = 0x60;
> +	x86_pmu.lbr_tos    = MSR_LBR_TOS;
> +	x86_pmu.lbr_from   = MSR_LBR_CORE_FROM;
> +	x86_pmu.lbr_to     = MSR_LBR_CORE_TO;
>  }


-- 
Linux Technology Centre
IBM Systems and Technology Group
Bangalore India


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 04/13] perf_events: sync branch stack sampling with X86 precise_sampling (v3)
  2012-01-09 16:49 ` [PATCH 04/13] perf_events: sync branch stack sampling with X86 precise_sampling (v3) Stephane Eranian
@ 2012-01-27  5:26   ` Anshuman Khandual
  0 siblings, 0 replies; 36+ messages in thread
From: Anshuman Khandual @ 2012-01-27  5:26 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, peterz, mingo, acme, robert.richter, ming.m.lin,
	andi, asharma, ravitillo, vweaver1

On Monday 09 January 2012 10:19 PM, Stephane Eranian wrote:
> If precise sampling is enabled on Intel X86, then perf_event uses PEBS.
> To correct for the off-by-one error of PEBS, perf_event uses LBR when
> precise_sample > 1.
> 
> On Intel X86 PERF_SAMPLE_BRANCH_STACK is implemented using LBR,
> therefore both features must be coordinated as they may not
> configure LBR the same way.
> 
> For PEBS, LBR needs to capture all branches at all priv levels.
> This patch sets this up.
> 
> The configuration of PERF_SAMPLE_BRANCH_STACK may not be compatible
> in which case an error must be returned.
> 
> Signed-off-by: Stephane Eranian <eranian@google.com>
  Reviewed   by: Anshuman Khandual<khandual@linux.vnet.ibm.com>
> ---
>  arch/x86/kernel/cpu/perf_event.c |   22 ++++++++++++++++++++++
>  1 files changed, 22 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index 3779313..710ec93 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -356,6 +356,7 @@ int x86_setup_perfctr(struct perf_event *event)
>  int x86_pmu_hw_config(struct perf_event *event)
>  {
>  	if (event->attr.precise_ip) {
> +		u64 *br_type, br_sel;
>  		int precise = 0;
> 
>  		/* Support for constant skid */
> @@ -369,6 +370,27 @@ int x86_pmu_hw_config(struct perf_event *event)
> 
>  		if (event->attr.precise_ip > precise)
>  			return -EOPNOTSUPP;
> +		/*
> +		 * check that PEBS LBR correction does not conflict with
> +		 * whatever the user is asking with attr->branch_sample_type
> +		 */
> +		if (event->attr.precise_ip > 1) {
> +
> +			br_type = &event->attr.branch_sample_type;
> +
> +			if (has_branch_stack(event)) {
> +				br_sel = *br_type & PERF_SAMPLE_BRANCH_ANY;
> +				if (br_sel != PERF_SAMPLE_BRANCH_ANY)
> +					return -EOPNOTSUPP;
> +			} else {
> +				/*
> +				 * For PEBS fixups, we capture all
> +				 * the branches at all priv levels
> +				 */
> +				*br_type = PERF_SAMPLE_BRANCH_ANY
> +					 | PERF_SAMPLE_BRANCH_PLM_ALL;
> +			}
> +		}
>  	}
> 
>  	/*


-- 
Linux Technology Centre
IBM Systems and Technology Group
Bangalore India


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 05/13] perf_events: add LBR mappings for PERF_SAMPLE_BRANCH filters (v3)
  2012-01-09 16:49 ` [PATCH 05/13] perf_events: add LBR mappings for PERF_SAMPLE_BRANCH filters (v3) Stephane Eranian
@ 2012-01-27  5:41   ` Anshuman Khandual
  0 siblings, 0 replies; 36+ messages in thread
From: Anshuman Khandual @ 2012-01-27  5:41 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, peterz, mingo, acme, robert.richter, ming.m.lin,
	andi, asharma, ravitillo, vweaver1

On Monday 09 January 2012 10:19 PM, Stephane Eranian wrote:
> This patch adds the mappings from the generic PERF_SAMPLE_BRANCH_*
> filters to the actual Intel X86 LBR filters, whenever they exist.
> 
> Signed-off-by: Stephane Eranian <eranian@google.com>
  Reviewed   by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
> ---
>  arch/x86/kernel/cpu/perf_event.h           |    2 +
>  arch/x86/kernel/cpu/perf_event_intel.c     |    2 +-
>  arch/x86/kernel/cpu/perf_event_intel_lbr.c |   99 +++++++++++++++++++++++++++-
>  3 files changed, 100 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
> index 4535ada..776fb5a 100644
> --- a/arch/x86/kernel/cpu/perf_event.h
> +++ b/arch/x86/kernel/cpu/perf_event.h
> @@ -535,6 +535,8 @@ void intel_pmu_lbr_init_nhm(void);
> 
>  void intel_pmu_lbr_init_atom(void);
> 
> +void intel_pmu_lbr_init_snb(void);
> +
>  int p4_pmu_init(void);
> 
>  int p6_pmu_init(void);
> diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
> index 97f7bb5..b0db016 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel.c
> @@ -1757,7 +1757,7 @@ __init int intel_pmu_init(void)
>  		memcpy(hw_cache_event_ids, snb_hw_cache_event_ids,
>  		       sizeof(hw_cache_event_ids));
> 
> -		intel_pmu_lbr_init_nhm();
> +		intel_pmu_lbr_init_snb();
> 
>  		x86_pmu.event_constraints = intel_snb_event_constraints;
>  		x86_pmu.pebs_constraints = intel_snb_pebs_event_constraints;
> diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> index e14431f..8a1eb6c 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> @@ -14,6 +14,47 @@ enum {
>  };
> 
>  /*
> + * Intel LBR_SELECT bits
> + * Intel Vol3a, April 2011, Section 16.7 Table 16-10
> + *
> + * Hardware branch filter (not available on all CPUs)
> + */
> +#define LBR_KERNEL_BIT		0 /* do not capture at ring0 */
> +#define LBR_USER_BIT		1 /* do not capture at ring > 0 */
> +#define LBR_JCC_BIT		2 /* do not capture conditional branches */
> +#define LBR_REL_CALL_BIT	3 /* do not capture relative calls */
> +#define LBR_IND_CALL_BIT	4 /* do not capture indirect calls */
> +#define LBR_RETURN_BIT		5 /* do not capture near returns */
> +#define LBR_IND_JMP_BIT		6 /* do not capture indirect jumps */
> +#define LBR_REL_JMP_BIT		7 /* do not capture relative jumps */
> +#define LBR_FAR_BIT		8 /* do not capture far branches */
> +
> +#define LBR_KERNEL	(1 << LBR_KERNEL_BIT)
> +#define LBR_USER	(1 << LBR_USER_BIT)
> +#define LBR_JCC		(1 << LBR_JCC_BIT)
> +#define LBR_REL_CALL	(1 << LBR_REL_CALL_BIT)
> +#define LBR_IND_CALL	(1 << LBR_IND_CALL_BIT)
> +#define LBR_RETURN	(1 << LBR_RETURN_BIT)
> +#define LBR_REL_JMP	(1 << LBR_REL_JMP_BIT)
> +#define LBR_IND_JMP	(1 << LBR_IND_JMP_BIT)
> +#define LBR_FAR		(1 << LBR_FAR_BIT)
> +
> +#define LBR_PLM (LBR_KERNEL | LBR_USER)
> +
> +#define LBR_SEL_MASK	0x1ff /* valid bits in LBR_SELECT */
> +
> +#define LBR_ANY		 \
> +	(LBR_JCC	|\
> +	 LBR_REL_CALL	|\
> +	 LBR_IND_CALL	|\
> +	 LBR_RETURN	|\
> +	 LBR_REL_JMP	|\
> +	 LBR_IND_JMP	|\
> +	 LBR_FAR)
> +
> +#define LBR_FROM_FLAG_MISPRED  (1ULL << 63)
> +
> +/*
>   * We only support LBR implementations that have FREEZE_LBRS_ON_PMI
>   * otherwise it becomes near impossible to get a reliable stack.
>   */
> @@ -153,8 +194,6 @@ static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc)
>  	cpuc->lbr_stack.nr = i;
>  }
> 
> -#define LBR_FROM_FLAG_MISPRED  (1ULL << 63)
> -
>  /*
>   * Due to lack of segmentation in Linux the effective address (offset)
>   * is the same as the linear address, allowing us to merge the LIP and EIP
> @@ -202,26 +241,82 @@ void intel_pmu_lbr_read(void)
>  		intel_pmu_lbr_read_64(cpuc);
>  }
> 
> +/*
> + * Map interface branch filters onto LBR filters
> + */
> +static const int nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = {
> +	[PERF_SAMPLE_BRANCH_ANY]        = LBR_ANY,
> +	[PERF_SAMPLE_BRANCH_USER]       = LBR_USER,
> +	[PERF_SAMPLE_BRANCH_KERNEL]     = LBR_KERNEL,
> +	[PERF_SAMPLE_BRANCH_ANY_RETURN] = LBR_RETURN | LBR_REL_JMP
> +					| LBR_IND_JMP | LBR_FAR,
> +	/*
> +	 * NHM/WSM erratum: must include REL_JMP+IND_JMP to get CALL branches
> +	 */
> +	[PERF_SAMPLE_BRANCH_ANY_CALL] =
> +	 LBR_REL_CALL | LBR_IND_CALL | LBR_REL_JMP | LBR_IND_JMP | LBR_FAR,
> +	/*
> +	 * NHM/WSM erratum: must include IND_JMP to capture IND_CALL
> +	 */
> +	[PERF_SAMPLE_BRANCH_IND_CALL] = LBR_IND_CALL | LBR_IND_JMP,
> +};
> +
> +static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = {
> +	[PERF_SAMPLE_BRANCH_ANY]        = LBR_ANY,
> +	[PERF_SAMPLE_BRANCH_USER]       = LBR_USER,
> +	[PERF_SAMPLE_BRANCH_KERNEL]     = LBR_KERNEL,
> +	[PERF_SAMPLE_BRANCH_ANY_RETURN] = LBR_RETURN | LBR_FAR,
> +	[PERF_SAMPLE_BRANCH_ANY_CALL]   = LBR_REL_CALL | LBR_IND_CALL
> +					| LBR_FAR,
> +	[PERF_SAMPLE_BRANCH_IND_CALL]   = LBR_IND_CALL,
> +};
> +
> +/* core */
>  void intel_pmu_lbr_init_core(void)
>  {
>  	x86_pmu.lbr_nr     = 4;
>  	x86_pmu.lbr_tos    = MSR_LBR_TOS;
>  	x86_pmu.lbr_from   = MSR_LBR_CORE_FROM;
>  	x86_pmu.lbr_to     = MSR_LBR_CORE_TO;
> +
> +	pr_cont("4-deep LBR, ");
>  }
> 
> +/* nehalem/westmere */
>  void intel_pmu_lbr_init_nhm(void)
>  {
>  	x86_pmu.lbr_nr     = 16;
>  	x86_pmu.lbr_tos    = MSR_LBR_TOS;
>  	x86_pmu.lbr_from   = MSR_LBR_NHM_FROM;
>  	x86_pmu.lbr_to     = MSR_LBR_NHM_TO;
> +
> +	x86_pmu.lbr_sel_mask = LBR_SEL_MASK;
> +	x86_pmu.lbr_sel_map  = nhm_lbr_sel_map;
> +
> +	pr_cont("16-deep LBR, ");
>  }
> 
> +/* sandy bridge */
> +void intel_pmu_lbr_init_snb(void)
> +{
> +	x86_pmu.lbr_nr	 = 16;
> +	x86_pmu.lbr_tos	 = MSR_LBR_TOS;
> +	x86_pmu.lbr_from = MSR_LBR_NHM_FROM;
> +	x86_pmu.lbr_to   = MSR_LBR_NHM_TO;
> +
> +	x86_pmu.lbr_sel_mask = LBR_SEL_MASK;
> +	x86_pmu.lbr_sel_map  = snb_lbr_sel_map;
> +
> +	pr_cont("16-deep LBR, ");
> +}
> +
> +/* atom */
>  void intel_pmu_lbr_init_atom(void)
>  {
>  	x86_pmu.lbr_nr	   = 8;
>  	x86_pmu.lbr_tos    = MSR_LBR_TOS;
>  	x86_pmu.lbr_from   = MSR_LBR_CORE_FROM;
>  	x86_pmu.lbr_to     = MSR_LBR_CORE_TO;
> +
> +	pr_cont("8-deep LBR, ");
>  }


-- 
Linux Technology Centre
IBM Systems and Technology Group
Bangalore India


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 06/13] perf_events: disable LBR support for older Intel Atom processors (v3)
  2012-01-09 16:49 ` [PATCH 06/13] perf_events: disable LBR support for older Intel Atom processors (v3) Stephane Eranian
@ 2012-01-27  5:43   ` Anshuman Khandual
  0 siblings, 0 replies; 36+ messages in thread
From: Anshuman Khandual @ 2012-01-27  5:43 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, peterz, mingo, acme, robert.richter, ming.m.lin,
	andi, asharma, ravitillo, vweaver1

On Monday 09 January 2012 10:19 PM, Stephane Eranian wrote:
> The patch adds a restriction for Intel Atom LBR support. Only
> steppings 10 (PineView) and more recent are supported. Older models,
> do not have a functional LBR. Their LBR does not freeze on PMU interrupt
> which makes LBR unusable in the context of perf_events.
> 
> Signed-off-by: Stephane Eranian <eranian@google.com>
  Reviewed-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
> ---
>  arch/x86/kernel/cpu/perf_event_intel_lbr.c |   10 ++++++++++
>  1 files changed, 10 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> index 8a1eb6c..e2b7094 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> @@ -313,6 +313,16 @@ void intel_pmu_lbr_init_snb(void)
>  /* atom */
>  void intel_pmu_lbr_init_atom(void)
>  {
> +	/*
> +	 * only models starting at stepping 10 seems
> +	 * to have an operational LBR which can freeze
> +	 * on PMU interrupt
> +	 */
> +	if (boot_cpu_data.x86_mask < 10) {
> +		pr_cont("LBR disabled due to erratum");
> +		return;
> +	}
> +
>  	x86_pmu.lbr_nr	   = 8;
>  	x86_pmu.lbr_tos    = MSR_LBR_TOS;
>  	x86_pmu.lbr_from   = MSR_LBR_CORE_FROM;


-- 
Linux Technology Centre
IBM Systems and Technology Group
Bangalore India


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 07/13] perf_events: implement PERF_SAMPLE_BRANCH for Intel X86 (v3)
  2012-01-09 16:49 ` [PATCH 07/13] perf_events: implement PERF_SAMPLE_BRANCH for Intel X86 (v3) Stephane Eranian
@ 2012-01-27  6:14   ` Anshuman Khandual
  0 siblings, 0 replies; 36+ messages in thread
From: Anshuman Khandual @ 2012-01-27  6:14 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, peterz, mingo, acme, robert.richter, ming.m.lin,
	andi, asharma, ravitillo, vweaver1

On Monday 09 January 2012 10:19 PM, Stephane Eranian wrote:
> This patch implements PERF_SAMPLE_BRANCH support for Intel
> X86 processors. It connects PERF_SAMPLE_BRANCH to the actual LBR.
> 
> The patch adds the hooks in the PMU irq handler to save the LBR
> on counter overflow for both regular and PEBS modes.
> 
> Signed-off-by: Stephane Eranian <eranian@google.com>
> ---
>  arch/x86/kernel/cpu/perf_event_intel.c     |   35 +++++++++++++
>  arch/x86/kernel/cpu/perf_event_intel_ds.c  |   10 ++--
>  arch/x86/kernel/cpu/perf_event_intel_lbr.c |   73 +++++++++++++++++++++++++++-
>  include/linux/perf_event.h                 |    3 +
>  4 files changed, 113 insertions(+), 8 deletions(-)
> 
This patch FAILs to compile independently because of the function
'intel_pmu_setup_lbr_filter' which is defined arch/x86/kernel/
cpu/perf_event_intel_lbr.c but used in arch/x86/kernel/cpu
/perf_event_intel.c without making the prototype update in
the header arch/x86/kernel/cpu/perf_event.h. Though [PATCH 8/13]
rectifies the problem, [PATCH 7/13] fails to compile independently.

arch/x86/kernel/cpu/perf_event_intel_lbr.c:291: warning: ‘intel_pmu_setup_lbr_filter’ defined but not used
  CC      arch/x86/kernel/cpu/perf_event_intel_ds.o
  CC      arch/x86/kernel/cpu/perf_event_intel.o
arch/x86/kernel/cpu/perf_event_intel.c: In function ‘intel_pmu_hw_config’:
arch/x86/kernel/cpu/perf_event_intel.c:1342: error: implicit declaration of function ‘intel_pmu_setup_lbr_filter’
make[3]: *** [arch/x86/kernel/cpu/perf_event_intel.o] Error 1
make[2]: *** [arch/x86/kernel/cpu] Error 2
make[1]: *** [arch/x86/kernel] Error 2
make: *** [arch/x86] Error 2

-- 
Linux Technology Centre
IBM Systems and Technology Group
Bangalore India


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 09/13] perf_events: disable PERF_SAMPLE_BRANCH_* when not supported (v3)
  2012-01-09 16:49 ` [PATCH 09/13] perf_events: disable PERF_SAMPLE_BRANCH_* when not supported (v3) Stephane Eranian
@ 2012-01-27  7:15   ` Anshuman Khandual
  2012-01-27  9:56     ` Stephane Eranian
  0 siblings, 1 reply; 36+ messages in thread
From: Anshuman Khandual @ 2012-01-27  7:15 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, peterz, mingo, acme, robert.richter, ming.m.lin,
	andi, asharma, ravitillo, vweaver1

On Monday 09 January 2012 10:19 PM, Stephane Eranian wrote:
> PERF_SAMPLE_BRANCH_* is disabled for:
> - SW events (sw counters, tracepoints)
> - HW breakpoints
> - ALL but Intel X86 architecture
> - AMD64 processors
> 
> Signed-off-by: Stephane Eranian <eranian@google.com>
> ---
>  arch/alpha/kernel/perf_event.c       |    4 ++++
>  arch/arm/kernel/perf_event.c         |    4 ++++
>  arch/mips/kernel/perf_event_mipsxx.c |    4 ++++
>  arch/powerpc/kernel/perf_event.c     |    4 ++++
>  arch/sh/kernel/perf_event.c          |    4 ++++
>  arch/sparc/kernel/perf_event.c       |    4 ++++
>  arch/x86/kernel/cpu/perf_event_amd.c |    3 +++
>  kernel/events/core.c                 |   24 ++++++++++++++++++++++++
>  kernel/events/hw_breakpoint.c        |    6 ++++++
>  9 files changed, 57 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/alpha/kernel/perf_event.c b/arch/alpha/kernel/perf_event.c
> index 8143cd7..0dae252 100644
> --- a/arch/alpha/kernel/perf_event.c
> +++ b/arch/alpha/kernel/perf_event.c
> @@ -685,6 +685,10 @@ static int alpha_pmu_event_init(struct perf_event *event)
>  {
>  	int err;
> 
> +	/* does not support taken branch sampling */
> +	if (has_branch_stack(event))
> +		return -EOPNOTSUPP;
> +
>  	switch (event->attr.type) {
>  	case PERF_TYPE_RAW:
>  	case PERF_TYPE_HARDWARE:
> diff --git a/arch/arm/kernel/perf_event.c b/arch/arm/kernel/perf_event.c
> index 88b0941..42262ff 100644
> --- a/arch/arm/kernel/perf_event.c
> +++ b/arch/arm/kernel/perf_event.c
> @@ -540,6 +540,10 @@ static int armpmu_event_init(struct perf_event *event)
>  	int err = 0;
>  	atomic_t *active_events = &armpmu->active_events;
> 
> +	/* does not support taken branch sampling */
> +	if (has_branch_smpl(event))
I guess this would be 'has_branch_stack' instead of 'has_branch_smpl'.
'has_branch_smpl' has not been defined any where but getting called here.
> +		return -EOPNOTSUPP;
> +
>  	if (armpmu->map_event(event) == -ENOENT)
>  		return -ENOENT;
> 
> diff --git a/arch/mips/kernel/perf_event_mipsxx.c b/arch/mips/kernel/perf_event_mipsxx.c
> index 315fc0b..7070f8c 100644
> --- a/arch/mips/kernel/perf_event_mipsxx.c
> +++ b/arch/mips/kernel/perf_event_mipsxx.c
> @@ -606,6 +606,10 @@ static int mipspmu_event_init(struct perf_event *event)
>  {
>  	int err = 0;
> 
> +	/* does not support taken branch sampling */
> +	if (has_branch_stack(event))
> +		return -EOPNOTSUPP;
> +
>  	switch (event->attr.type) {
>  	case PERF_TYPE_RAW:
>  	case PERF_TYPE_HARDWARE:
> diff --git a/arch/powerpc/kernel/perf_event.c b/arch/powerpc/kernel/perf_event.c
> index d614ab5..4e0b265 100644
> --- a/arch/powerpc/kernel/perf_event.c
> +++ b/arch/powerpc/kernel/perf_event.c
> @@ -1078,6 +1078,10 @@ static int power_pmu_event_init(struct perf_event *event)
>  	if (!ppmu)
>  		return -ENOENT;
> 
> +	/* does not support taken branch sampling */
> +	if (has_branch_stack(event))
> +		return -EOPNOTSUPP;
> +
>  	switch (event->attr.type) {
>  	case PERF_TYPE_HARDWARE:
>  		ev = event->attr.config;
> diff --git a/arch/sh/kernel/perf_event.c b/arch/sh/kernel/perf_event.c
> index 10b14e3..068b8a2 100644
> --- a/arch/sh/kernel/perf_event.c
> +++ b/arch/sh/kernel/perf_event.c
> @@ -310,6 +310,10 @@ static int sh_pmu_event_init(struct perf_event *event)
>  {
>  	int err;
> 
> +	/* does not support taken branch sampling */
> +	if (has_branch_stack(event))
> +		return -EOPNOTSUPP;
> +
>  	switch (event->attr.type) {
>  	case PERF_TYPE_RAW:
>  	case PERF_TYPE_HW_CACHE:
> diff --git a/arch/sparc/kernel/perf_event.c b/arch/sparc/kernel/perf_event.c
> index 614da62..8e16a4a 100644
> --- a/arch/sparc/kernel/perf_event.c
> +++ b/arch/sparc/kernel/perf_event.c
> @@ -1105,6 +1105,10 @@ static int sparc_pmu_event_init(struct perf_event *event)
>  	if (atomic_read(&nmi_active) < 0)
>  		return -ENODEV;
> 
> +	/* does not support taken branch sampling */
> +	if (has_branch_stack(event))
> +		return -EOPNOTSUPP;
> +
>  	switch (attr->type) {
>  	case PERF_TYPE_HARDWARE:
>  		if (attr->config >= sparc_pmu->max_events)
> diff --git a/arch/x86/kernel/cpu/perf_event_amd.c b/arch/x86/kernel/cpu/perf_event_amd.c
> index 0397b23..0d8da03 100644
> --- a/arch/x86/kernel/cpu/perf_event_amd.c
> +++ b/arch/x86/kernel/cpu/perf_event_amd.c
> @@ -138,6 +138,9 @@ static int amd_pmu_hw_config(struct perf_event *event)
>  	if (ret)
>  		return ret;
> 
> +	if (has_branch_stack(event))
> +		return -EOPNOTSUPP;
> +
>  	if (event->attr.exclude_host && event->attr.exclude_guest)
>  		/*
>  		 * When HO == GO == 1 the hardware treats that as GO == HO == 0
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index ed39225..36d1a63 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -5000,6 +5000,12 @@ static int perf_swevent_init(struct perf_event *event)
>  	if (event->attr.type != PERF_TYPE_SOFTWARE)
>  		return -ENOENT;
> 
> +	/*
> +	 * no branch sampling for software events
> +	 */
> +	if (has_branch_stack(event))
> +		return -EOPNOTSUPP;
> +
>  	switch (event_id) {
>  	case PERF_COUNT_SW_CPU_CLOCK:
>  	case PERF_COUNT_SW_TASK_CLOCK:
> @@ -5110,6 +5116,12 @@ static int perf_tp_event_init(struct perf_event *event)
>  	if (event->attr.type != PERF_TYPE_TRACEPOINT)
>  		return -ENOENT;
> 
> +	/*
> +	 * no branch sampling for tracepoint events
> +	 */
> +	if (has_branch_stack(event))
> +		return -EOPNOTSUPP;
> +
>  	err = perf_trace_init(event);
>  	if (err)
>  		return err;
> @@ -5335,6 +5347,12 @@ static int cpu_clock_event_init(struct perf_event *event)
>  	if (event->attr.config != PERF_COUNT_SW_CPU_CLOCK)
>  		return -ENOENT;
> 
> +	/*
> +	 * no branch sampling for software events
> +	 */
> +	if (has_branch_stack(event))
> +		return -EOPNOTSUPP;
> +
>  	perf_swevent_init_hrtimer(event);
> 
>  	return 0;
> @@ -5409,6 +5427,12 @@ static int task_clock_event_init(struct perf_event *event)
>  	if (event->attr.config != PERF_COUNT_SW_TASK_CLOCK)
>  		return -ENOENT;
> 
> +	/*
> +	 * no branch sampling for software events
> +	 */
> +	if (has_branch_stack(event))
> +		return -EOPNOTSUPP;
> +
>  	perf_swevent_init_hrtimer(event);
> 
>  	return 0;
> diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
> index b0309f7..cee5423 100644
> --- a/kernel/events/hw_breakpoint.c
> +++ b/kernel/events/hw_breakpoint.c
> @@ -581,6 +581,12 @@ static int hw_breakpoint_event_init(struct perf_event *bp)
>  	if (bp->attr.type != PERF_TYPE_BREAKPOINT)
>  		return -ENOENT;
> 
> +	/*
> +	 * no branch sampling for breakpoint events
> +	 */
> +	if (has_branch_stack(bp))
> +		return -EOPNOTSUPP;
> +
>  	err = register_perf_hw_breakpoint(bp);
>  	if (err)
>  		return err;


-- 
Linux Technology Centre
IBM Systems and Technology Group
Bangalore India


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 09/13] perf_events: disable PERF_SAMPLE_BRANCH_* when not supported (v3)
  2012-01-27  7:15   ` Anshuman Khandual
@ 2012-01-27  9:56     ` Stephane Eranian
  0 siblings, 0 replies; 36+ messages in thread
From: Stephane Eranian @ 2012-01-27  9:56 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-kernel, peterz, mingo, acme, robert.richter, ming.m.lin,
	andi, asharma, ravitillo, vweaver1

On Fri, Jan 27, 2012 at 8:15 AM, Anshuman Khandual
<khandual@linux.vnet.ibm.com> wrote:
> On Monday 09 January 2012 10:19 PM, Stephane Eranian wrote:
>> PERF_SAMPLE_BRANCH_* is disabled for:
>> - SW events (sw counters, tracepoints)
>> - HW breakpoints
>> - ALL but Intel X86 architecture
>> - AMD64 processors
>>
>> Signed-off-by: Stephane Eranian <eranian@google.com>
>> ---
>>  arch/alpha/kernel/perf_event.c       |    4 ++++
>>  arch/arm/kernel/perf_event.c         |    4 ++++
>>  arch/mips/kernel/perf_event_mipsxx.c |    4 ++++
>>  arch/powerpc/kernel/perf_event.c     |    4 ++++
>>  arch/sh/kernel/perf_event.c          |    4 ++++
>>  arch/sparc/kernel/perf_event.c       |    4 ++++
>>  arch/x86/kernel/cpu/perf_event_amd.c |    3 +++
>>  kernel/events/core.c                 |   24 ++++++++++++++++++++++++
>>  kernel/events/hw_breakpoint.c        |    6 ++++++
>>  9 files changed, 57 insertions(+), 0 deletions(-)
>>
>> diff --git a/arch/alpha/kernel/perf_event.c b/arch/alpha/kernel/perf_event.c
>> index 8143cd7..0dae252 100644
>> --- a/arch/alpha/kernel/perf_event.c
>> +++ b/arch/alpha/kernel/perf_event.c
>> @@ -685,6 +685,10 @@ static int alpha_pmu_event_init(struct perf_event *event)
>>  {
>>       int err;
>>
>> +     /* does not support taken branch sampling */
>> +     if (has_branch_stack(event))
>> +             return -EOPNOTSUPP;
>> +
>>       switch (event->attr.type) {
>>       case PERF_TYPE_RAW:
>>       case PERF_TYPE_HARDWARE:
>> diff --git a/arch/arm/kernel/perf_event.c b/arch/arm/kernel/perf_event.c
>> index 88b0941..42262ff 100644
>> --- a/arch/arm/kernel/perf_event.c
>> +++ b/arch/arm/kernel/perf_event.c
>> @@ -540,6 +540,10 @@ static int armpmu_event_init(struct perf_event *event)
>>       int err = 0;
>>       atomic_t *active_events = &armpmu->active_events;
>>
>> +     /* does not support taken branch sampling */
>> +     if (has_branch_smpl(event))
> I guess this would be 'has_branch_stack' instead of 'has_branch_smpl'.
> 'has_branch_smpl' has not been defined any where but getting called here.

Good catch. Will post a patch to fix that.

>> +             return -EOPNOTSUPP;
>> +
>>       if (armpmu->map_event(event) == -ENOENT)
>>               return -ENOENT;
>>
>> diff --git a/arch/mips/kernel/perf_event_mipsxx.c b/arch/mips/kernel/perf_event_mipsxx.c
>> index 315fc0b..7070f8c 100644
>> --- a/arch/mips/kernel/perf_event_mipsxx.c
>> +++ b/arch/mips/kernel/perf_event_mipsxx.c
>> @@ -606,6 +606,10 @@ static int mipspmu_event_init(struct perf_event *event)
>>  {
>>       int err = 0;
>>
>> +     /* does not support taken branch sampling */
>> +     if (has_branch_stack(event))
>> +             return -EOPNOTSUPP;
>> +
>>       switch (event->attr.type) {
>>       case PERF_TYPE_RAW:
>>       case PERF_TYPE_HARDWARE:
>> diff --git a/arch/powerpc/kernel/perf_event.c b/arch/powerpc/kernel/perf_event.c
>> index d614ab5..4e0b265 100644
>> --- a/arch/powerpc/kernel/perf_event.c
>> +++ b/arch/powerpc/kernel/perf_event.c
>> @@ -1078,6 +1078,10 @@ static int power_pmu_event_init(struct perf_event *event)
>>       if (!ppmu)
>>               return -ENOENT;
>>
>> +     /* does not support taken branch sampling */
>> +     if (has_branch_stack(event))
>> +             return -EOPNOTSUPP;
>> +
>>       switch (event->attr.type) {
>>       case PERF_TYPE_HARDWARE:
>>               ev = event->attr.config;
>> diff --git a/arch/sh/kernel/perf_event.c b/arch/sh/kernel/perf_event.c
>> index 10b14e3..068b8a2 100644
>> --- a/arch/sh/kernel/perf_event.c
>> +++ b/arch/sh/kernel/perf_event.c
>> @@ -310,6 +310,10 @@ static int sh_pmu_event_init(struct perf_event *event)
>>  {
>>       int err;
>>
>> +     /* does not support taken branch sampling */
>> +     if (has_branch_stack(event))
>> +             return -EOPNOTSUPP;
>> +
>>       switch (event->attr.type) {
>>       case PERF_TYPE_RAW:
>>       case PERF_TYPE_HW_CACHE:
>> diff --git a/arch/sparc/kernel/perf_event.c b/arch/sparc/kernel/perf_event.c
>> index 614da62..8e16a4a 100644
>> --- a/arch/sparc/kernel/perf_event.c
>> +++ b/arch/sparc/kernel/perf_event.c
>> @@ -1105,6 +1105,10 @@ static int sparc_pmu_event_init(struct perf_event *event)
>>       if (atomic_read(&nmi_active) < 0)
>>               return -ENODEV;
>>
>> +     /* does not support taken branch sampling */
>> +     if (has_branch_stack(event))
>> +             return -EOPNOTSUPP;
>> +
>>       switch (attr->type) {
>>       case PERF_TYPE_HARDWARE:
>>               if (attr->config >= sparc_pmu->max_events)
>> diff --git a/arch/x86/kernel/cpu/perf_event_amd.c b/arch/x86/kernel/cpu/perf_event_amd.c
>> index 0397b23..0d8da03 100644
>> --- a/arch/x86/kernel/cpu/perf_event_amd.c
>> +++ b/arch/x86/kernel/cpu/perf_event_amd.c
>> @@ -138,6 +138,9 @@ static int amd_pmu_hw_config(struct perf_event *event)
>>       if (ret)
>>               return ret;
>>
>> +     if (has_branch_stack(event))
>> +             return -EOPNOTSUPP;
>> +
>>       if (event->attr.exclude_host && event->attr.exclude_guest)
>>               /*
>>                * When HO == GO == 1 the hardware treats that as GO == HO == 0
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index ed39225..36d1a63 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -5000,6 +5000,12 @@ static int perf_swevent_init(struct perf_event *event)
>>       if (event->attr.type != PERF_TYPE_SOFTWARE)
>>               return -ENOENT;
>>
>> +     /*
>> +      * no branch sampling for software events
>> +      */
>> +     if (has_branch_stack(event))
>> +             return -EOPNOTSUPP;
>> +
>>       switch (event_id) {
>>       case PERF_COUNT_SW_CPU_CLOCK:
>>       case PERF_COUNT_SW_TASK_CLOCK:
>> @@ -5110,6 +5116,12 @@ static int perf_tp_event_init(struct perf_event *event)
>>       if (event->attr.type != PERF_TYPE_TRACEPOINT)
>>               return -ENOENT;
>>
>> +     /*
>> +      * no branch sampling for tracepoint events
>> +      */
>> +     if (has_branch_stack(event))
>> +             return -EOPNOTSUPP;
>> +
>>       err = perf_trace_init(event);
>>       if (err)
>>               return err;
>> @@ -5335,6 +5347,12 @@ static int cpu_clock_event_init(struct perf_event *event)
>>       if (event->attr.config != PERF_COUNT_SW_CPU_CLOCK)
>>               return -ENOENT;
>>
>> +     /*
>> +      * no branch sampling for software events
>> +      */
>> +     if (has_branch_stack(event))
>> +             return -EOPNOTSUPP;
>> +
>>       perf_swevent_init_hrtimer(event);
>>
>>       return 0;
>> @@ -5409,6 +5427,12 @@ static int task_clock_event_init(struct perf_event *event)
>>       if (event->attr.config != PERF_COUNT_SW_TASK_CLOCK)
>>               return -ENOENT;
>>
>> +     /*
>> +      * no branch sampling for software events
>> +      */
>> +     if (has_branch_stack(event))
>> +             return -EOPNOTSUPP;
>> +
>>       perf_swevent_init_hrtimer(event);
>>
>>       return 0;
>> diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
>> index b0309f7..cee5423 100644
>> --- a/kernel/events/hw_breakpoint.c
>> +++ b/kernel/events/hw_breakpoint.c
>> @@ -581,6 +581,12 @@ static int hw_breakpoint_event_init(struct perf_event *bp)
>>       if (bp->attr.type != PERF_TYPE_BREAKPOINT)
>>               return -ENOENT;
>>
>> +     /*
>> +      * no branch sampling for breakpoint events
>> +      */
>> +     if (has_branch_stack(bp))
>> +             return -EOPNOTSUPP;
>> +
>>       err = register_perf_hw_breakpoint(bp);
>>       if (err)
>>               return err;
>
>
> --
> Linux Technology Centre
> IBM Systems and Technology Group
> Bangalore India
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 01/13] perf_events: add generic taken branch sampling support (v3)
  2012-01-27  4:46   ` Anshuman Khandual
@ 2012-01-27  9:57     ` Stephane Eranian
  0 siblings, 0 replies; 36+ messages in thread
From: Stephane Eranian @ 2012-01-27  9:57 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-kernel, peterz, mingo, acme, robert.richter, ming.m.lin,
	andi, asharma, ravitillo, vweaver1

On Fri, Jan 27, 2012 at 5:46 AM, Anshuman Khandual
<khandual@linux.vnet.ibm.com> wrote:
> On Monday 09 January 2012 10:19 PM, Stephane Eranian wrote:
>> This patch adds the ability to sample taken branches to the
>> perf_event interface.
>>
>> The ability to capture taken branches is very useful for all
>> sorts of analysis. For instance, basic block profiling, call
>> counts, statistical call graph.
>>
>> This new capability requires hardware assist and as such may
>> not be available on all HW platforms. On Intel X86, it is
>> implemented on top of the Last Branch Record (LBR) facility.
>>
>> To enable taken branches sampling, the PERF_SAMPLE_BRANCH_STACK
>> bit must be set in attr->sample_type.
>>
>> Sampled taken branches may be filtered by type and/or priv
>> levels.
>>
>> The patch adds a new field, called branch_sample_type, to the
>> perf_event_attr structure. It contains a bitmask of filters
>> to apply to the sampled taken branches.
>>
>> Filters may be implemented in HW. If the HW filter does not exist
>> or is not good enough, some arch may also implement a SW filter.
>>
>> The following generic filters are currently defined:
>> - PERF_SAMPLE_USER
>>   only branches whose targets are at the user level
>>
>> - PERF_SAMPLE_KERNEL
>>   only branches whose targets are at the kernel level
>>
>> - PERF_SAMPLE_ANY
>>   any type of branches (subject to priv levels filters)
>>
>> - PERF_SAMPLE_ANY_CALL
>>   any call branches (may incl. syscall on some arch)
>>
>> - PERF_SAMPLE_ANY_RET
>>   any return branches (may incl. syscall returns on some arch)
>>
>> - PERF_SAMPLE_IND_CALL
>>   indirect call branches
>>
>> Obviously filter may be combined. The priv level bits are optional.
>> If not provided, the priv level of the associated event are used. It
>> is possible to collect branches at a priv level different from the
>> associated event.
>>
>> The number of taken branch records present in each sample may vary based
>> on HW, the type of sampled branches, the executed code. Therefore
>> each sample contains the number of taken branches it contains.
>>
>> Signed-off-by: Stephane Eranian <eranian@google.com>
>  Reviewed by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
>> ---
>>  arch/x86/kernel/cpu/perf_event_intel_lbr.c |   21 +++++---
>>  include/linux/perf_event.h                 |   66 ++++++++++++++++++++++++++--
>>  kernel/events/core.c                       |   58 ++++++++++++++++++++++++
>>  3 files changed, 133 insertions(+), 12 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
>> index 3fab3de..c3f8100 100644
>> --- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
>> +++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
>> @@ -144,9 +144,11 @@ static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc)
>>
>>               rdmsrl(x86_pmu.lbr_from + lbr_idx, msr_lastbranch.lbr);
>>
>> -             cpuc->lbr_entries[i].from  = msr_lastbranch.from;
>> -             cpuc->lbr_entries[i].to    = msr_lastbranch.to;
>> -             cpuc->lbr_entries[i].flags = 0;
>> +             cpuc->lbr_entries[i].from       = msr_lastbranch.from;
>> +             cpuc->lbr_entries[i].to         = msr_lastbranch.to;
>> +             cpuc->lbr_entries[i].mispred    = 0;
>> +             cpuc->lbr_entries[i].predicted  = 0;
>> +             cpuc->lbr_entries[i].reserved   = 0;
>>       }
>>       cpuc->lbr_stack.nr = i;
>>  }
>> @@ -167,19 +169,22 @@ static void intel_pmu_lbr_read_64(struct cpu_hw_events *cpuc)
>>
>>       for (i = 0; i < x86_pmu.lbr_nr; i++) {
>>               unsigned long lbr_idx = (tos - i) & mask;
>> -             u64 from, to, flags = 0;
>> +             u64 from, to, mis = 0, pred = 0;
>>
>>               rdmsrl(x86_pmu.lbr_from + lbr_idx, from);
>>               rdmsrl(x86_pmu.lbr_to   + lbr_idx, to);
>>
>>               if (lbr_format == LBR_FORMAT_EIP_FLAGS) {
>> -                     flags = !!(from & LBR_FROM_FLAG_MISPRED);
>> +                     mis = !!(from & LBR_FROM_FLAG_MISPRED);
>> +                     pred = !mis;
>>                       from = (u64)((((s64)from) << 1) >> 1);
>>               }
>>
>> -             cpuc->lbr_entries[i].from  = from;
>> -             cpuc->lbr_entries[i].to    = to;
>> -             cpuc->lbr_entries[i].flags = flags;
>> +             cpuc->lbr_entries[i].from       = from;
>> +             cpuc->lbr_entries[i].to         = to;
>> +             cpuc->lbr_entries[i].mispred    = mis;
>> +             cpuc->lbr_entries[i].predicted  = pred;
>> +             cpuc->lbr_entries[i].reserved   = 0;
>>       }
>>       cpuc->lbr_stack.nr = i;
>>  }
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 0b91db2..17751b1 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -129,11 +129,38 @@ enum perf_event_sample_format {
>>       PERF_SAMPLE_PERIOD                      = 1U << 8,
>>       PERF_SAMPLE_STREAM_ID                   = 1U << 9,
>>       PERF_SAMPLE_RAW                         = 1U << 10,
>> +     PERF_SAMPLE_BRANCH_STACK                = 1U << 11,
>>
>> -     PERF_SAMPLE_MAX = 1U << 11,             /* non-ABI */
>> +     PERF_SAMPLE_MAX = 1U << 12,             /* non-ABI */
>>  };
>>
>>  /*
>> + * values to program into branch_sample_type when PERF_SAMPLE_BRANCH is set
>> + *
>> + * If the user does not pass priv level information via branch_sample_type,
>> + * the kernel uses the event's priv level. Branch and event priv levels do
>> + * not have to match. Branch priv level is checked for permissions.
>> + *
>> + * The branch types can be combined, however BRANCH_ANY covers all types
>> + * of branches and therefore it supersedes all the other types.
>> + */
>> +enum perf_branch_sample_type {
>> +     PERF_SAMPLE_BRANCH_USER         = 1U << 0, /* user level branches */
>> +     PERF_SAMPLE_BRANCH_KERNEL       = 1U << 1, /* kernel level branches */
>> +
>> +     PERF_SAMPLE_BRANCH_ANY          = 1U << 2, /* any branch types */
>> +     PERF_SAMPLE_BRANCH_ANY_CALL     = 1U << 3, /* any call branch */
>> +     PERF_SAMPLE_BRANCH_ANY_RETURN   = 1U << 4, /* any return branch */
>> +     PERF_SAMPLE_BRANCH_IND_CALL     = 1U << 5, /* indirect calls */
>> +
>> +     PERF_SAMPLE_BRANCH_MAX          = 1U << 6,/* non-ABI */
>> +};
>> +
>> +#define PERF_SAMPLE_BRANCH_PLM_ALL \
>> +     (PERF_SAMPLE_BRANCH_USER|\
>> +      PERF_SAMPLE_BRANCH_KERNEL)
>> +
>> +/*
>>   * The format of the data returned by read() on a perf event fd,
>>   * as specified by attr.read_format:
>>   *
>> @@ -240,6 +267,7 @@ struct perf_event_attr {
>>               __u64           bp_len;
>>               __u64           config2; /* extension of config1 */
>>       };
>> +     __u64   branch_sample_type; /* enum branch_sample_type */
>>  };
>>
>>  /*
>> @@ -458,6 +486,8 @@ enum perf_event_type {
>>        *
>>        *      { u32                   size;
>>        *        char                  data[size];}&& PERF_SAMPLE_RAW
>> +      *
>> +      *      { u64 from, to, flags } lbr[nr];} && PERF_SAMPLE_BRANCH_STACK
>>        * };
>>        */
>>       PERF_RECORD_SAMPLE                      = 9,
>> @@ -530,12 +560,31 @@ struct perf_raw_record {
>>       void                            *data;
>>  };
>>
>> +/*
>> + * single taken branch record layout:
>> + *
>> + *      from: source instruction (may not always be a branch insn)
>> + *        to: branch target
>> + *   mispred: branch target was mispredicted
>> + * predicted: branch target was predicted
>> + *
>> + * support for mispred, predicted is optional. In case it
>> + * is not supported mispred = predicted = 0.
>> + */
> So the user level perf tools would check for ((mispred = 0) && (predicted = 0))
> in a sample and report that its not supported by the HW PMU ? Point here is
> that if its not supported we should say  "No HW support" rather than displaying
> mispred = 0 and predicted = 0 (As this could be misleading)
>>  struct perf_branch_entry {
>> -     __u64                           from;
>> -     __u64                           to;
>> -     __u64                           flags;
>> +     __u64   from;
>> +     __u64   to;
>> +     __u64   mispred:1,  /* target mispredicted */
>> +             predicted:1,/* target predicted */
>> +             reserved:62;
>>  };
>>
>> +/*
>> + * branch stack layout:
>> + *  nr: number of taken branches stored in entries[]
>> + *
>> + * Note that nr can vary from sample to sample
>> + */
>>  struct perf_branch_stack {
>>       __u64                           nr;
>>       struct perf_branch_entry        entries[0];
>> @@ -566,7 +615,9 @@ struct hw_perf_event {
>>                       unsigned long   event_base;
>>                       int             idx;
>>                       int             last_cpu;
>> +
>>                       struct hw_perf_event_extra extra_reg;
>> +                     struct hw_perf_event_extra branch_reg;
>>               };
>>               struct { /* software */
>>                       struct hrtimer  hrtimer;
>> @@ -1003,12 +1054,14 @@ struct perf_sample_data {
>>       u64                             period;
>>       struct perf_callchain_entry     *callchain;
>>       struct perf_raw_record          *raw;
>> +     struct perf_branch_stack        *br_stack;
>>  };
>>
>>  static inline void perf_sample_data_init(struct perf_sample_data *data, u64 addr)
>>  {
>>       data->addr = addr;
>>       data->raw  = NULL;
>> +     data->br_stack = NULL;
>>  }
>>
>>  extern void perf_output_sample(struct perf_output_handle *handle,
>> @@ -1147,6 +1200,11 @@ extern void perf_bp_event(struct perf_event *event, void *data);
>>  # define perf_instruction_pointer(regs)      instruction_pointer(regs)
>>  #endif
>>
>> +static inline bool has_branch_stack(struct perf_event *event)
>> +{
>> +     return event->attr.sample_type & PERF_SAMPLE_BRANCH_STACK;
>> +}
>> +
>>  extern int perf_output_begin(struct perf_output_handle *handle,
>>                            struct perf_event *event, unsigned int size);
>>  extern void perf_output_end(struct perf_output_handle *handle);
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 91fb68a..ed39225 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -3877,6 +3877,24 @@ void perf_output_sample(struct perf_output_handle *handle,
>>                       }
>>               }
>>       }
>> +
>> +     if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
>> +             if (data->br_stack) {
>> +                     size_t size;
>> +
>> +                     size = data->br_stack->nr
>> +                          * sizeof(struct perf_branch_entry);
>> +
>> +                     perf_output_put(handle, data->br_stack->nr);
>> +                     perf_output_copy(handle, data->br_stack->entries, size);
>> +             } else {
>> +                     /*
>> +                      * we always store at least the value of nr
>> +                      */
>> +                     u64 nr = 0;
>> +                     perf_output_put(handle, nr);
>> +             }
>> +     }
>>  }
>>
>>  void perf_prepare_sample(struct perf_event_header *header,
>> @@ -3919,6 +3937,15 @@ void perf_prepare_sample(struct perf_event_header *header,
>>               WARN_ON_ONCE(size & (sizeof(u64)-1));
>>               header->size += size;
>>       }
>> +
>> +     if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
>> +             int size = sizeof(u64); /* nr */
>> +             if (data->br_stack) {
>> +                     size += data->br_stack->nr
>> +                           * sizeof(struct perf_branch_entry);
>> +             }
>> +             header->size += size;
>> +     }
>>  }
>>
>>  static void perf_event_output(struct perf_event *event,
>> @@ -5898,6 +5925,37 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
>>       if (attr->read_format & ~(PERF_FORMAT_MAX-1))
>>               return -EINVAL;
>>
>> +     if (attr->sample_type & PERF_SAMPLE_BRANCH_STACK) {
>> +             u64 mask = attr->branch_sample_type;
>> +
>> +             /* only using defined bits */
>> +             if (mask & ~(PERF_SAMPLE_BRANCH_MAX-1))
>> +                     return -EINVAL;
>> +
>> +             /* at least one branch bit must be set */
>> +             if (!(mask & ~PERF_SAMPLE_BRANCH_PLM_ALL))
>> +                     return -EINVAL;
>> +
>> +             /* kernel level capture */
>> +             if ((mask & PERF_SAMPLE_BRANCH_KERNEL)
>> +                 && perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
>> +                     return -EACCES;
>> +
>> +             /* propagate priv level, when not set for branch */
>> +             if (!(mask & PERF_SAMPLE_BRANCH_PLM_ALL)) {
>> +
>> +                     /* exclude_kernel checked on syscall entry */
>> +                     if (!attr->exclude_kernel)
>> +                             mask |= PERF_SAMPLE_BRANCH_KERNEL;
>> +
>> +                     if (!attr->exclude_user)
>> +                             mask |= PERF_SAMPLE_BRANCH_USER;
> Why we are not taking care for attr->exclude_hv ? Should not we define
> PERF_SAMPLE_BRANCH_HV for hyper-visor level branches ?

Yes, we can add this, though I don't have any system to test it.
I will post a patch to add this priv level.

>> +                     /*
>> +                      * adjust user setting (for HW filter setup)
>> +                      */
>> +                     attr->branch_sample_type = mask;
>> +             }
>> +     }
>>  out:
>>       return ret;
>>
>
>
> --
> Linux Technology Centre
> IBM Systems and Technology Group
> Bangalore India
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/13] perf_events: add support for sampling taken branches (v3)
  2012-01-09 16:49 [PATCH 00/13] perf_events: add support for sampling taken branches (v3) Stephane Eranian
                   ` (13 preceding siblings ...)
  2012-01-23 10:14 ` [PATCH 00/13] perf_events: add support for sampling taken branches (v3) Stephane Eranian
@ 2012-01-27 12:09 ` Peter Zijlstra
  2012-01-27 18:20   ` Arun Sharma
  14 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2012-01-27 12:09 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, mingo, acme, robert.richter, ming.m.lin, andi,
	asharma, ravitillo, vweaver1

Arnaldo,

On Mon, 2012-01-09 at 17:49 +0100, Stephane Eranian wrote:
> I would like to thank Roberto Vitillo @ LBL for his work on the perf
> tool for this.
> 
> Enough talking, let's take a simple example. Our trivial test program
> goes like this:
> 
> void f2(void)
> {}
> void f3(void)
> {}
> void f1(unsigned long n)
> {
>   if (n & 1UL)
>     f2();
>   else
>     f3();
> }
> int main(void)
> {
>   unsigned long i;
> 
>   for (i=0; i < N; i++)
>    f1(i);
>   return 0;
> }
> 
> $ perf record -b any branchy
> $ perf report -b
> # Events: 23K cycles
> #
> # Overhead  Source Symbol     Target Symbol
> # ........  ................  ................
> 
>     18.13%  [.] f1            [.] main                          
>     18.10%  [.] main          [.] main                          
>     18.01%  [.] main          [.] f1                            
>     15.69%  [.] f1            [.] f1                            
>      9.11%  [.] f3            [.] f1                            
>      6.78%  [.] f1            [.] f3                            
>      6.74%  [.] f1            [.] f2                            
>      6.71%  [.] f2            [.] f1                            
> 
> Of the total number of branches captured, 18.13% were from f1() -> main().
> 
> Let's make this clearer by filtering the user call branches only:
> 
> $ perf record -b any_call -e cycles:u branchy
> $ perf report -b
> # Events: 19K cycles
> #
> # Overhead  Source Symbol              Target Symbol
> # ........  .........................  .........................
> #
>     52.50%  [.] main                   [.] f1                   
>     23.99%  [.] f1                     [.] f3                   
>     23.48%  [.] f1                     [.] f2                   
>      0.03%  [.] _IO_default_xsputn     [.] _IO_new_file_overflow
>      0.01%  [k] _start                 [k] __libc_start_main    
> 
> Now it is more obvious. %52 of all the captured branches where calls from main() -> f1().
> The rest is split 50/50 between f1() -> f2() and f1() -> f3() which is expected given
> that f1() dispatches based on odd vs. even values of n which is constantly increasing.
> 
> 
> Here is a kernel example, where we want to sample indirect calls:
> $ perf record -a -C 1 -b ind_call -e r1c4:k sleep 10 
> $ perf report -b
> #
> # Overhead  Source Symbol               Target Symbol
> # ........  ..........................  ..........................
> #
>     36.36%  [k] __delay                 [k] delay_tsc             
>      9.09%  [k] ktime_get               [k] read_tsc              
>      9.09%  [k] getnstimeofday          [k] read_tsc              
>      9.09%  [k] notifier_call_chain     [k] tick_notify           
>      4.55%  [k] cpuidle_idle_call       [k] intel_idle            
>      4.55%  [k] cpuidle_idle_call       [k] menu_reflect          
>      2.27%  [k] handle_irq              [k] handle_edge_irq       
>      2.27%  [k] ack_apic_edge           [k] native_apic_mem_write 
>      2.27%  [k] hpet_interrupt_handler  [k] hrtimer_interrupt     
>      2.27%  [k] __run_hrtimer           [k] watchdog_timer_fn     
>      2.27%  [k] enqueue_task            [k] enqueue_task_rt       
>      2.27%  [k] try_to_wake_up          [k] select_task_rq_rt     
>      2.27%  [k] do_timer                [k] read_tsc              
> 
> Due to HW limitations, branch filtering may be approximate on
> Core, Atom processors. It is more accurate on Nehalem, Westmere
> and best on Sandy Bridge. 

Can I have you ACK on this userspace stuff (patches 11-13)?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/13] perf_events: add support for sampling taken branches (v3)
  2012-01-27 12:09 ` Peter Zijlstra
@ 2012-01-27 18:20   ` Arun Sharma
  0 siblings, 0 replies; 36+ messages in thread
From: Arun Sharma @ 2012-01-27 18:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Stephane Eranian, linux-kernel, mingo, acme, robert.richter,
	ming.m.lin, andi, ravitillo, vweaver1

On 1/27/12 4:09 AM, Peter Zijlstra wrote:
>> Here is a kernel example, where we want to sample indirect calls:
>> $ perf record -a -C 1 -b ind_call -e r1c4:k sleep 10
>> $ perf report -b
>> #
>> # Overhead  Source Symbol               Target Symbol
>> # ........  ..........................  ..........................
>> #
>>      36.36%  [k] __delay                 [k] delay_tsc
>>       9.09%  [k] ktime_get               [k] read_tsc
>>       9.09%  [k] getnstimeofday          [k] read_tsc
>>       9.09%  [k] notifier_call_chain     [k] tick_notify
>>       4.55%  [k] cpuidle_idle_call       [k] intel_idle
>>       4.55%  [k] cpuidle_idle_call       [k] menu_reflect
>>       2.27%  [k] handle_irq              [k] handle_edge_irq
>>       2.27%  [k] ack_apic_edge           [k] native_apic_mem_write
>>       2.27%  [k] hpet_interrupt_handler  [k] hrtimer_interrupt
>>       2.27%  [k] __run_hrtimer           [k] watchdog_timer_fn
>>       2.27%  [k] enqueue_task            [k] enqueue_task_rt
>>       2.27%  [k] try_to_wake_up          [k] select_task_rq_rt
>>       2.27%  [k] do_timer                [k] read_tsc
>>
>> Due to HW limitations, branch filtering may be approximate on
>> Core, Atom processors. It is more accurate on Nehalem, Westmere
>> and best on Sandy Bridge.
>
> Can I have you ACK on this userspace stuff (patches 11-13)?

While the current "Source -> Target" based UI works well for many cases, 
it'd be nice to have "-g -b any_call" to result in a callgraph like 
output, so for userspace programs compiled without frame pointers, we 
get a limited callgraph.

  -Arun

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2012-01-27 18:21 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-09 16:49 [PATCH 00/13] perf_events: add support for sampling taken branches (v3) Stephane Eranian
2012-01-09 16:49 ` [PATCH 01/13] perf_events: add generic taken branch sampling support (v3) Stephane Eranian
2012-01-27  4:46   ` Anshuman Khandual
2012-01-27  9:57     ` Stephane Eranian
2012-01-09 16:49 ` [PATCH 02/13] perf_events: add Intel LBR MSR definitions (v3) Stephane Eranian
2012-01-27  5:03   ` Anshuman Khandual
2012-01-09 16:49 ` [PATCH 03/13] perf_events: add Intel X86 LBR sharing logic (v3) Stephane Eranian
2012-01-09 16:49 ` [PATCH 04/13] perf_events: sync branch stack sampling with X86 precise_sampling (v3) Stephane Eranian
2012-01-27  5:26   ` Anshuman Khandual
2012-01-09 16:49 ` [PATCH 05/13] perf_events: add LBR mappings for PERF_SAMPLE_BRANCH filters (v3) Stephane Eranian
2012-01-27  5:41   ` Anshuman Khandual
2012-01-09 16:49 ` [PATCH 06/13] perf_events: disable LBR support for older Intel Atom processors (v3) Stephane Eranian
2012-01-27  5:43   ` Anshuman Khandual
2012-01-09 16:49 ` [PATCH 07/13] perf_events: implement PERF_SAMPLE_BRANCH for Intel X86 (v3) Stephane Eranian
2012-01-27  6:14   ` Anshuman Khandual
2012-01-09 16:49 ` [PATCH 08/13] perf_events: add LBR software filter support " Stephane Eranian
2012-01-09 16:49 ` [PATCH 09/13] perf_events: disable PERF_SAMPLE_BRANCH_* when not supported (v3) Stephane Eranian
2012-01-27  7:15   ` Anshuman Khandual
2012-01-27  9:56     ` Stephane Eranian
2012-01-09 16:49 ` [PATCH 10/13] perf_events: add hook to flush branch_stack on context switch (v3) Stephane Eranian
2012-01-09 16:49 ` [PATCH 11/13] perf: add code to support PERF_SAMPLE_BRANCH_STACK (v3) Stephane Eranian
2012-01-10  1:25   ` Arun Sharma
2012-01-10 15:43     ` Stephane Eranian
2012-01-09 16:49 ` [PATCH 12/13] perf: add support for sampling taken branch to perf record (v3) Stephane Eranian
2012-01-09 16:49 ` [PATCH 13/13] perf: add support for taken branch sampling to perf report (v3) Stephane Eranian
2012-01-23 10:14 ` [PATCH 00/13] perf_events: add support for sampling taken branches (v3) Stephane Eranian
2012-01-23 12:25   ` Peter Zijlstra
2012-01-23 15:07     ` Stephane Eranian
2012-01-23 15:47       ` Andi Kleen
2012-01-23 17:14     ` Stephane Eranian
2012-01-24 15:39       ` Stephane Eranian
2012-01-24 16:08         ` David Ahern
2012-01-24 17:42           ` Stephane Eranian
2012-01-26 16:21           ` Stephane Eranian
2012-01-27 12:09 ` Peter Zijlstra
2012-01-27 18:20   ` Arun Sharma

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).