linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 00/18] perf: add support for sampling taken branches
@ 2012-01-27 20:56 Stephane Eranian
  2012-01-27 20:56 ` [PATCH v4 01/18] perf: add generic taken branch sampling support Stephane Eranian
                   ` (19 more replies)
  0 siblings, 20 replies; 30+ messages in thread
From: Stephane Eranian @ 2012-01-27 20:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, robert.richter, ming.m.lin, andi, asharma,
	ravitillo, vweaver1, khandual, dsahern

This patchset adds an important and useful new feature to
perf_events: branch stack sampling. In other words, the
ability to capture taken branches into each sample.

Statistical sampling of taken branch should not be confused
for branch tracing. Not all branches are necessarily captured

Sampling taken branches is important for basic block profiling,
statistical call graph, function call counts. Many of those
measurements can help drive a compiler optimizer.

The branch stack is a software abstraction which sits on top
of the PMU hardware. As such, it is not available on all
processors. For now, the patch provides the generic interface
and the Intel X86 implementation where it leverages the Last
Branch Record (LBR) feature (from Core2 to SandyBridge).

Branch stack sampling is supported for both per-thread and
system-wide modes.

It is possible to filter the type and privilege level of branches
to sample. The target of the branch is used to determine
the privilege level.

For each branch, the source and destination are captured. On
some hardware platforms, it may be possible to also extract
the target prediction and, in that case, it is also exposed
to end users.

The branch stack can record a variable number of taken
branches per sample. Those branches are always consecutive
in time. The number of branches captured depends on the
filtering and the underlying hardware. On Intel Nehalem
and later, up to 16 consecutive branches can be captured
per sample.

Branch sampling is always coupled with an event. It can
be any PMU event but it can't be a SW or tracepoint event.

Branch sampling is requested by setting a new sample_type
flag called: PERF_SAMPLE_BRANCH_STACK.

To support branch filtering, we introduce a new field
to the perf_event_attr struct: branch_sample_type. We chose
NOT to overload the config1, config2 field because those
are related to the event encoding. Branch stack is a
separate feature which is combined with the event.

The branch_sample_type is a bitmask of possible filters.
The following filters are defined (more can be added):
- PERF_SAMPLE_BRANCH_ANY     : any control flow change
- PERF_SAMPLE_BRANCH_USER    : branches when target is at user level
- PERF_SAMPLE_BRANCH_KERNEL  : branches when target is at kernel level
- PERF_SAMPLE_BRANCH_HV      : branches when target is at hypervisor level
- PERF_SAMPLE_BRANCH_ANY_CALL: call branches (incl. syscalls)
- PERF_SAMPLE_BRANCH_ANY_RET : return branches (incl. syscall returns)
- PERF_SAMPLE_BRANCH_IND_CALL: indirect calls

It is possible to combine filters, e.g., IND_CALL|USER|KERNEL.

When the privilege level is not specified, the branch stack
inherits that of the associated event.

Some processors may not offer hardware branch filtering, e.g., Intel
Atom. Some may have HW filtering bugs (e.g., Nehalem). The Intel
X86 implementation in this patchset also provides a SW branch filter
which works on a best effort basis. It can compensate for the lack
of LBR filtering. But first and foremost, it helps work around LBR
filtering errata. The goal is to only capture the type of branches
requested by the user.

It is possible to combine branch stack sampling with PEBS on Intel
X86 processors. Depending on the precise_sampling mode, there are
certain filterting restrictions. When precise_sampling=1, then
there are no filtering restrictions. When precise_sampling > 1, 
then only ANY|USER|KERNEL filter can be used. This comes from
the fact that the kernel uses LBR to compensate for the PEBS
off-by-1 skid on the instruction pointer.

To demonstrate how the perf_event branch stack sampling interface
works, the patchset also modifies perf record to capture taken
branches. Similarly perf report is enhanced to display a histogram
of taken branches.

I would like to thank Roberto Vitillo @ LBL for his work on the perf
tool for this.

Enough talking, let's take a simple example. Our trivial test program
goes like this:

void f2(void)
{}
void f3(void)
{}
void f1(unsigned long n)
{
  if (n & 1UL)
    f2();
  else
    f3();
}
int main(void)
{
  unsigned long i;

  for (i=0; i < N; i++)
   f1(i);
  return 0;
}

$ perf record -b any branchy
$ perf report -b
# Events: 23K cycles
#
# Overhead  Source Symbol     Target Symbol
# ........  ................  ................

    18.13%  [.] f1            [.] main                          
    18.10%  [.] main          [.] main                          
    18.01%  [.] main          [.] f1                            
    15.69%  [.] f1            [.] f1                            
     9.11%  [.] f3            [.] f1                            
     6.78%  [.] f1            [.] f3                            
     6.74%  [.] f1            [.] f2                            
     6.71%  [.] f2            [.] f1                            

Of the total number of branches captured, 18.13% were from f1() -> main().

Let's make this clearer by filtering the user call branches only:

$ perf record -b any_call -e cycles:u branchy
$ perf report -b
# Events: 19K cycles
#
# Overhead  Source Symbol              Target Symbol
# ........  .........................  .........................
#
    52.50%  [.] main                   [.] f1                   
    23.99%  [.] f1                     [.] f3                   
    23.48%  [.] f1                     [.] f2                   
     0.03%  [.] _IO_default_xsputn     [.] _IO_new_file_overflow
     0.01%  [k] _start                 [k] __libc_start_main    

Now it is more obvious. %52 of all the captured branches where calls from main() -> f1().
The rest is split 50/50 between f1() -> f2() and f1() -> f3() which is expected given
that f1() dispatches based on odd vs. even values of n which is constantly increasing.


Here is a kernel example, where we want to sample indirect calls:
$ perf record -a -C 1 -b ind_call -e r1c4:k sleep 10 
$ perf report -b
#
# Overhead  Source Symbol               Target Symbol
# ........  ..........................  ..........................
#
    36.36%  [k] __delay                 [k] delay_tsc             
     9.09%  [k] ktime_get               [k] read_tsc              
     9.09%  [k] getnstimeofday          [k] read_tsc              
     9.09%  [k] notifier_call_chain     [k] tick_notify           
     4.55%  [k] cpuidle_idle_call       [k] intel_idle            
     4.55%  [k] cpuidle_idle_call       [k] menu_reflect          
     2.27%  [k] handle_irq              [k] handle_edge_irq       
     2.27%  [k] ack_apic_edge           [k] native_apic_mem_write 
     2.27%  [k] hpet_interrupt_handler  [k] hrtimer_interrupt     
     2.27%  [k] __run_hrtimer           [k] watchdog_timer_fn     
     2.27%  [k] enqueue_task            [k] enqueue_task_rt       
     2.27%  [k] try_to_wake_up          [k] select_task_rq_rt     
     2.27%  [k] do_timer                [k] read_tsc              

Due to HW limitations, branch filtering may be approximate on
Core, Atom processors. It is more accurate on Nehalem, Westmere
and best on Sandy Bridge.

In version 2, we've updated the patch to tip/master (commit 5734857) and
we've incoporated the feedback from v1 concerning anynous bitfield
struct for branch_stack_entry and the hanlding of i386 ABI binaries
on 64-bit host in the instr decoder for the LBR SW filter.

In version 3, we've updated to 3.2.0-tip. The Atom revision
check has been put into its own patch. We fixed a browser
issue with report report. We fixed all the style issues as well.

In version 4, we've modified the branch stack API to add a missing
priv level : hypervisor. There is a new PERF_SAMPLE_BRANCH_HV. It
is not used on Intel X86. Thanks to  khandual@linux.vnet.ibm.com
for pointing this out. We also fix compilation error on ARM.

In version 4, we also extend the patch to include the changes necessary
to the perf tool to support reading perf.data files which were produced
from older perf_event ABI revisions. This patch set extends the ABI
with a new field in struct perf_event_attr. That struct is saved as
is in the perf.data file. Therefore, older perf.data files contain
smaller perf_event_attr struct, yet perf must process them transparently.
That's not the case today. It dies with 'incompatible file format'.

The patch solves this problem and, at the same time, decouples endianness
detection from the size of perf_event_attr. Endianness is now detected via
the signature (the first 8 bytes of the file). We introduce a new signature
(PERFILE2). It is not laid out the same way in the file based on the endianness
of the host where the file is written. Therefore, we can dynamically detect
the endianness by simply reading the first 8 bytes. The size of the
perf_event_attr struct can then be processed according to the endianness.
The ambiguity between the size being at the same time, the endianness marker
and the actual size is gone. We can now distinguish an older ABI by the size
and not confuse it with an endianness mismatch.

Signed-off-by: Stephane Eranian <eranian@google.com>


Roberto Agostino Vitillo (3):
  perf: add code to support PERF_SAMPLE_BRANCH_STACK
  perf: add support for sampling taken branch to perf record
  perf: add support for taken branch sampling to perf report

Stephane Eranian (15):
  perf: add generic taken branch sampling support
  perf: add Intel LBR MSR definitions
  perf: add Intel X86 LBR sharing logic
  perf: sync branch stack sampling with X86 precise_sampling
  perf: add LBR mappings for PERF_SAMPLE_BRANCH filters
  perf: disable LBR support for older Intel Atom processors
  perf: implement PERF_SAMPLE_BRANCH for Intel X86
  perf: add LBR software filter support for Intel X86
  perf: disable PERF_SAMPLE_BRANCH_* when not supported
  perf: add hook to flush branch_stack on context switch
  perf: fix endianness detection in perf.data
  perf: add ABI reference sizes
  perf: enable reading of perf.data files from different ABI rev
  perf: fix bug print_event_desc()
  perf: make perf able to read file from older ABIs

 arch/alpha/kernel/perf_event.c             |    4 +
 arch/arm/kernel/perf_event.c               |    4 +
 arch/mips/kernel/perf_event_mipsxx.c       |    4 +
 arch/powerpc/kernel/perf_event.c           |    4 +
 arch/sh/kernel/perf_event.c                |    4 +
 arch/sparc/kernel/perf_event.c             |    4 +
 arch/x86/include/asm/msr-index.h           |    7 +
 arch/x86/kernel/cpu/perf_event.c           |   47 ++-
 arch/x86/kernel/cpu/perf_event.h           |   19 +
 arch/x86/kernel/cpu/perf_event_amd.c       |    3 +
 arch/x86/kernel/cpu/perf_event_intel.c     |  120 +++++--
 arch/x86/kernel/cpu/perf_event_intel_ds.c  |   22 +-
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |  532 ++++++++++++++++++++++++++--
 include/linux/perf_event.h                 |   82 ++++-
 kernel/events/core.c                       |  177 +++++++++
 kernel/events/hw_breakpoint.c              |    6 +
 tools/perf/Documentation/perf-record.txt   |   25 ++
 tools/perf/Documentation/perf-report.txt   |    7 +
 tools/perf/builtin-record.c                |   74 ++++
 tools/perf/builtin-report.c                |   98 +++++-
 tools/perf/perf.h                          |   18 +
 tools/perf/util/annotate.c                 |    2 +-
 tools/perf/util/event.h                    |    1 +
 tools/perf/util/evsel.c                    |   14 +
 tools/perf/util/header.c                   |  231 +++++++++++--
 tools/perf/util/hist.c                     |   93 ++++-
 tools/perf/util/hist.h                     |    7 +
 tools/perf/util/session.c                  |   72 ++++
 tools/perf/util/session.h                  |    4 +
 tools/perf/util/sort.c                     |  362 ++++++++++++++-----
 tools/perf/util/sort.h                     |    5 +
 tools/perf/util/symbol.h                   |   13 +
 32 files changed, 1835 insertions(+), 230 deletions(-)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v4 01/18] perf: add generic taken branch sampling support
  2012-01-27 20:56 [PATCH v4 00/18] perf: add support for sampling taken branches Stephane Eranian
@ 2012-01-27 20:56 ` Stephane Eranian
  2012-01-27 20:56 ` [PATCH v4 02/18] perf: add Intel LBR MSR definitions Stephane Eranian
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: Stephane Eranian @ 2012-01-27 20:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, robert.richter, ming.m.lin, andi, asharma,
	ravitillo, vweaver1, khandual, dsahern

This patch adds the ability to sample taken branches to the
perf_event interface.

The ability to capture taken branches is very useful for all
sorts of analysis. For instance, basic block profiling, call
counts, statistical call graph.

This new capability requires hardware assist and as such may
not be available on all HW platforms. On Intel X86, it is
implemented on top of the Last Branch Record (LBR) facility.

To enable taken branches sampling, the PERF_SAMPLE_BRANCH_STACK
bit must be set in attr->sample_type.

Sampled taken branches may be filtered by type and/or priv
levels.

The patch adds a new field, called branch_sample_type, to the
perf_event_attr structure. It contains a bitmask of filters
to apply to the sampled taken branches.

Filters may be implemented in HW. If the HW filter does not exist
or is not good enough, some arch may also implement a SW filter.

The following generic filters are currently defined:
- PERF_SAMPLE_USER
  only branches whose targets are at the user level

- PERF_SAMPLE_KERNEL
  only branches whose targets are at the kernel level

- PERF_SAMPLE_HV
  only branches whose targets are at the hypervisor level

- PERF_SAMPLE_ANY
  any type of branches (subject to priv levels filters)

- PERF_SAMPLE_ANY_CALL
  any call branches (may incl. syscall on some arch)

- PERF_SAMPLE_ANY_RET
  any return branches (may incl. syscall returns on some arch)

- PERF_SAMPLE_IND_CALL
  indirect call branches

Obviously filter may be combined. The priv level bits are optional.
If not provided, the priv level of the associated event are used. It
is possible to collect branches at a priv level different from the
associated event. Use of kernel, hv priv levels is subject to permissions
and availability (hv).

The number of taken branch records present in each sample may vary based
on HW, the type of sampled branches, the executed code. Therefore
each sample contains the number of taken branches it contains.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |   21 +++++---
 include/linux/perf_event.h                 |   68 ++++++++++++++++++++++++++--
 kernel/events/core.c                       |   68 ++++++++++++++++++++++++++++
 3 files changed, 145 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 3fab3de..c3f8100 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -144,9 +144,11 @@ static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc)
 
 		rdmsrl(x86_pmu.lbr_from + lbr_idx, msr_lastbranch.lbr);
 
-		cpuc->lbr_entries[i].from  = msr_lastbranch.from;
-		cpuc->lbr_entries[i].to    = msr_lastbranch.to;
-		cpuc->lbr_entries[i].flags = 0;
+		cpuc->lbr_entries[i].from	= msr_lastbranch.from;
+		cpuc->lbr_entries[i].to		= msr_lastbranch.to;
+		cpuc->lbr_entries[i].mispred	= 0;
+		cpuc->lbr_entries[i].predicted	= 0;
+		cpuc->lbr_entries[i].reserved	= 0;
 	}
 	cpuc->lbr_stack.nr = i;
 }
@@ -167,19 +169,22 @@ static void intel_pmu_lbr_read_64(struct cpu_hw_events *cpuc)
 
 	for (i = 0; i < x86_pmu.lbr_nr; i++) {
 		unsigned long lbr_idx = (tos - i) & mask;
-		u64 from, to, flags = 0;
+		u64 from, to, mis = 0, pred = 0;
 
 		rdmsrl(x86_pmu.lbr_from + lbr_idx, from);
 		rdmsrl(x86_pmu.lbr_to   + lbr_idx, to);
 
 		if (lbr_format == LBR_FORMAT_EIP_FLAGS) {
-			flags = !!(from & LBR_FROM_FLAG_MISPRED);
+			mis = !!(from & LBR_FROM_FLAG_MISPRED);
+			pred = !mis;
 			from = (u64)((((s64)from) << 1) >> 1);
 		}
 
-		cpuc->lbr_entries[i].from  = from;
-		cpuc->lbr_entries[i].to    = to;
-		cpuc->lbr_entries[i].flags = flags;
+		cpuc->lbr_entries[i].from	= from;
+		cpuc->lbr_entries[i].to		= to;
+		cpuc->lbr_entries[i].mispred	= mis;
+		cpuc->lbr_entries[i].predicted	= pred;
+		cpuc->lbr_entries[i].reserved	= 0;
 	}
 	cpuc->lbr_stack.nr = i;
 }
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 0b91db2..9ef3002 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -129,11 +129,40 @@ enum perf_event_sample_format {
 	PERF_SAMPLE_PERIOD			= 1U << 8,
 	PERF_SAMPLE_STREAM_ID			= 1U << 9,
 	PERF_SAMPLE_RAW				= 1U << 10,
+	PERF_SAMPLE_BRANCH_STACK		= 1U << 11,
 
-	PERF_SAMPLE_MAX = 1U << 11,		/* non-ABI */
+	PERF_SAMPLE_MAX = 1U << 12,		/* non-ABI */
 };
 
 /*
+ * values to program into branch_sample_type when PERF_SAMPLE_BRANCH is set
+ *
+ * If the user does not pass priv level information via branch_sample_type,
+ * the kernel uses the event's priv level. Branch and event priv levels do
+ * not have to match. Branch priv level is checked for permissions.
+ *
+ * The branch types can be combined, however BRANCH_ANY covers all types
+ * of branches and therefore it supersedes all the other types.
+ */
+enum perf_branch_sample_type {
+	PERF_SAMPLE_BRANCH_USER		= 1U << 0, /* user branches */
+	PERF_SAMPLE_BRANCH_KERNEL	= 1U << 1, /* kernel branches */
+	PERF_SAMPLE_BRANCH_HV		= 1U << 2, /* hypervisor branches */
+
+	PERF_SAMPLE_BRANCH_ANY		= 1U << 3, /* any branch types */
+	PERF_SAMPLE_BRANCH_ANY_CALL	= 1U << 4, /* any call branch */
+	PERF_SAMPLE_BRANCH_ANY_RETURN	= 1U << 5, /* any return branch */
+	PERF_SAMPLE_BRANCH_IND_CALL	= 1U << 6, /* indirect calls */
+
+	PERF_SAMPLE_BRANCH_MAX		= 1U << 7, /* non-ABI */
+};
+
+#define PERF_SAMPLE_BRANCH_PLM_ALL \
+	(PERF_SAMPLE_BRANCH_USER|\
+	 PERF_SAMPLE_BRANCH_KERNEL|\
+	 PERF_SAMPLE_BRANCH_HV)
+
+/*
  * The format of the data returned by read() on a perf event fd,
  * as specified by attr.read_format:
  *
@@ -240,6 +269,7 @@ struct perf_event_attr {
 		__u64		bp_len;
 		__u64		config2; /* extension of config1 */
 	};
+	__u64	branch_sample_type; /* enum branch_sample_type */
 };
 
 /*
@@ -458,6 +488,8 @@ enum perf_event_type {
 	 *
 	 *	{ u32			size;
 	 *	  char                  data[size];}&& PERF_SAMPLE_RAW
+	 *
+	 *	{ u64 from, to, flags } lbr[nr];} && PERF_SAMPLE_BRANCH_STACK
 	 * };
 	 */
 	PERF_RECORD_SAMPLE			= 9,
@@ -530,12 +562,31 @@ struct perf_raw_record {
 	void				*data;
 };
 
+/*
+ * single taken branch record layout:
+ *
+ *      from: source instruction (may not always be a branch insn)
+ *        to: branch target
+ *   mispred: branch target was mispredicted
+ * predicted: branch target was predicted
+ *
+ * support for mispred, predicted is optional. In case it
+ * is not supported mispred = predicted = 0.
+ */
 struct perf_branch_entry {
-	__u64				from;
-	__u64				to;
-	__u64				flags;
+	__u64	from;
+	__u64	to;
+	__u64	mispred:1,  /* target mispredicted */
+		predicted:1,/* target predicted */
+		reserved:62;
 };
 
+/*
+ * branch stack layout:
+ *  nr: number of taken branches stored in entries[]
+ *
+ * Note that nr can vary from sample to sample
+ */
 struct perf_branch_stack {
 	__u64				nr;
 	struct perf_branch_entry	entries[0];
@@ -566,7 +617,9 @@ struct hw_perf_event {
 			unsigned long	event_base;
 			int		idx;
 			int		last_cpu;
+
 			struct hw_perf_event_extra extra_reg;
+			struct hw_perf_event_extra branch_reg;
 		};
 		struct { /* software */
 			struct hrtimer	hrtimer;
@@ -1003,12 +1056,14 @@ struct perf_sample_data {
 	u64				period;
 	struct perf_callchain_entry	*callchain;
 	struct perf_raw_record		*raw;
+	struct perf_branch_stack	*br_stack;
 };
 
 static inline void perf_sample_data_init(struct perf_sample_data *data, u64 addr)
 {
 	data->addr = addr;
 	data->raw  = NULL;
+	data->br_stack = NULL;
 }
 
 extern void perf_output_sample(struct perf_output_handle *handle,
@@ -1147,6 +1202,11 @@ extern void perf_bp_event(struct perf_event *event, void *data);
 # define perf_instruction_pointer(regs)	instruction_pointer(regs)
 #endif
 
+static inline bool has_branch_stack(struct perf_event *event)
+{
+	return event->attr.sample_type & PERF_SAMPLE_BRANCH_STACK;
+}
+
 extern int perf_output_begin(struct perf_output_handle *handle,
 			     struct perf_event *event, unsigned int size);
 extern void perf_output_end(struct perf_output_handle *handle);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index de859fb..c4520a2 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -118,6 +118,13 @@ static int cpu_function_call(int cpu, int (*func) (void *info), void *info)
 		       PERF_FLAG_FD_OUTPUT  |\
 		       PERF_FLAG_PID_CGROUP)
 
+/*
+ * branch priv levels that need permission checks
+ */
+#define PERF_SAMPLE_BRANCH_PERM_PLM \
+	(PERF_SAMPLE_BRANCH_KERNEL |\
+	 PERF_SAMPLE_BRANCH_HV)
+
 enum event_type_t {
 	EVENT_FLEXIBLE = 0x1,
 	EVENT_PINNED = 0x2,
@@ -3877,6 +3884,24 @@ void perf_output_sample(struct perf_output_handle *handle,
 			}
 		}
 	}
+
+	if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
+		if (data->br_stack) {
+			size_t size;
+
+			size = data->br_stack->nr
+			     * sizeof(struct perf_branch_entry);
+
+			perf_output_put(handle, data->br_stack->nr);
+			perf_output_copy(handle, data->br_stack->entries, size);
+		} else {
+			/*
+			 * we always store at least the value of nr
+			 */
+			u64 nr = 0;
+			perf_output_put(handle, nr);
+		}
+	}
 }
 
 void perf_prepare_sample(struct perf_event_header *header,
@@ -3919,6 +3944,15 @@ void perf_prepare_sample(struct perf_event_header *header,
 		WARN_ON_ONCE(size & (sizeof(u64)-1));
 		header->size += size;
 	}
+
+	if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
+		int size = sizeof(u64); /* nr */
+		if (data->br_stack) {
+			size += data->br_stack->nr
+			      * sizeof(struct perf_branch_entry);
+		}
+		header->size += size;
+	}
 }
 
 static void perf_event_output(struct perf_event *event,
@@ -5898,6 +5932,40 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
 	if (attr->read_format & ~(PERF_FORMAT_MAX-1))
 		return -EINVAL;
 
+	if (attr->sample_type & PERF_SAMPLE_BRANCH_STACK) {
+		u64 mask = attr->branch_sample_type;
+
+		/* only using defined bits */
+		if (mask & ~(PERF_SAMPLE_BRANCH_MAX-1))
+			return -EINVAL;
+
+		/* at least one branch bit must be set */
+		if (!(mask & ~PERF_SAMPLE_BRANCH_PLM_ALL))
+			return -EINVAL;
+
+		/* kernel level capture: check permissions */
+		if ((mask & PERF_SAMPLE_BRANCH_PERM_PLM)
+		    && perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
+			return -EACCES;
+
+		/* propagate priv level, when not set for branch */
+		if (!(mask & PERF_SAMPLE_BRANCH_PLM_ALL)) {
+
+			/* exclude_kernel checked on syscall entry */
+			if (!attr->exclude_kernel)
+				mask |= PERF_SAMPLE_BRANCH_KERNEL;
+
+			if (!attr->exclude_user)
+				mask |= PERF_SAMPLE_BRANCH_USER;
+
+			if (!attr->exclude_hv)
+				mask |= PERF_SAMPLE_BRANCH_HV;
+			/*
+			 * adjust user setting (for HW filter setup)
+			 */
+			attr->branch_sample_type = mask;
+		}
+	}
 out:
 	return ret;
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v4 02/18] perf: add Intel LBR MSR definitions
  2012-01-27 20:56 [PATCH v4 00/18] perf: add support for sampling taken branches Stephane Eranian
  2012-01-27 20:56 ` [PATCH v4 01/18] perf: add generic taken branch sampling support Stephane Eranian
@ 2012-01-27 20:56 ` Stephane Eranian
  2012-01-27 20:56 ` [PATCH v4 03/18] perf: add Intel X86 LBR sharing logic Stephane Eranian
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: Stephane Eranian @ 2012-01-27 20:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, robert.richter, ming.m.lin, andi, asharma,
	ravitillo, vweaver1, khandual, dsahern

This patch adds the LBR definitions for NHM/WSM/SNB and Core.
It also adds the definitions for the architected LBR MSR:
LBR_SELECT, LBRT_TOS.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/include/asm/msr-index.h           |    7 +++++++
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |   18 +++++++++---------
 2 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index a6962d9..ccb8059 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -56,6 +56,13 @@
 #define MSR_OFFCORE_RSP_0		0x000001a6
 #define MSR_OFFCORE_RSP_1		0x000001a7
 
+#define MSR_LBR_SELECT			0x000001c8
+#define MSR_LBR_TOS			0x000001c9
+#define MSR_LBR_NHM_FROM		0x00000680
+#define MSR_LBR_NHM_TO			0x000006c0
+#define MSR_LBR_CORE_FROM		0x00000040
+#define MSR_LBR_CORE_TO			0x00000060
+
 #define MSR_IA32_PEBS_ENABLE		0x000003f1
 #define MSR_IA32_DS_AREA		0x00000600
 #define MSR_IA32_PERF_CAPABILITIES	0x00000345
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index c3f8100..e14431f 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -205,23 +205,23 @@ void intel_pmu_lbr_read(void)
 void intel_pmu_lbr_init_core(void)
 {
 	x86_pmu.lbr_nr     = 4;
-	x86_pmu.lbr_tos    = 0x01c9;
-	x86_pmu.lbr_from   = 0x40;
-	x86_pmu.lbr_to     = 0x60;
+	x86_pmu.lbr_tos    = MSR_LBR_TOS;
+	x86_pmu.lbr_from   = MSR_LBR_CORE_FROM;
+	x86_pmu.lbr_to     = MSR_LBR_CORE_TO;
 }
 
 void intel_pmu_lbr_init_nhm(void)
 {
 	x86_pmu.lbr_nr     = 16;
-	x86_pmu.lbr_tos    = 0x01c9;
-	x86_pmu.lbr_from   = 0x680;
-	x86_pmu.lbr_to     = 0x6c0;
+	x86_pmu.lbr_tos    = MSR_LBR_TOS;
+	x86_pmu.lbr_from   = MSR_LBR_NHM_FROM;
+	x86_pmu.lbr_to     = MSR_LBR_NHM_TO;
 }
 
 void intel_pmu_lbr_init_atom(void)
 {
 	x86_pmu.lbr_nr	   = 8;
-	x86_pmu.lbr_tos    = 0x01c9;
-	x86_pmu.lbr_from   = 0x40;
-	x86_pmu.lbr_to     = 0x60;
+	x86_pmu.lbr_tos    = MSR_LBR_TOS;
+	x86_pmu.lbr_from   = MSR_LBR_CORE_FROM;
+	x86_pmu.lbr_to     = MSR_LBR_CORE_TO;
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v4 03/18] perf: add Intel X86 LBR sharing logic
  2012-01-27 20:56 [PATCH v4 00/18] perf: add support for sampling taken branches Stephane Eranian
  2012-01-27 20:56 ` [PATCH v4 01/18] perf: add generic taken branch sampling support Stephane Eranian
  2012-01-27 20:56 ` [PATCH v4 02/18] perf: add Intel LBR MSR definitions Stephane Eranian
@ 2012-01-27 20:56 ` Stephane Eranian
  2012-01-27 20:56 ` [PATCH v4 04/18] perf: sync branch stack sampling with X86 precise_sampling Stephane Eranian
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: Stephane Eranian @ 2012-01-27 20:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, robert.richter, ming.m.lin, andi, asharma,
	ravitillo, vweaver1, khandual, dsahern

The Intel LBR on some recent processor is capable
of filtering branches by type. The filter is configurable
via the LBR_SELECT MSR register.

There are limitation on how this register can be used.

On Nehalem/Westmere, the LBR_SELECT is shared by the two HT threads
when HT is on. It is private to each core when HT is off.

On SandyBridge, the LBR_SELECT register is private to each thread
when HT is on. It is private to each core when HT is off.

The kernel must manage the sharing of LBR_SELECT. It allows
multiple users on the same logical CPU to use LBR_SELECT as
long as they program it with the same value. Across sibling
CPUs (HT threads), the same restriction applies on NHM/WSM.

This patch implements this sharing logic by leveraging the
mechanism put in place for managing the offcore_response
shared MSR.

We modify __intel_shared_reg_get_constraints() to cause
x86_get_event_constraint() to be called because LBR may
be associated with events that may be counter constrained.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event.c       |    4 ++
 arch/x86/kernel/cpu/perf_event.h       |    4 ++
 arch/x86/kernel/cpu/perf_event_intel.c |   70 ++++++++++++++++++++------------
 3 files changed, 52 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index f8bddb5..3779313 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -426,6 +426,10 @@ static int __x86_pmu_event_init(struct perf_event *event)
 	/* mark unused */
 	event->hw.extra_reg.idx = EXTRA_REG_NONE;
 
+	/* mark not used */
+	event->hw.extra_reg.idx = EXTRA_REG_NONE;
+	event->hw.branch_reg.idx = EXTRA_REG_NONE;
+
 	return x86_pmu.hw_config(event);
 }
 
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 513d617..4535ada 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -33,6 +33,7 @@ enum extra_reg_type {
 
 	EXTRA_REG_RSP_0 = 0,	/* offcore_response_0 */
 	EXTRA_REG_RSP_1 = 1,	/* offcore_response_1 */
+	EXTRA_REG_LBR   = 2,	/* lbr_select */
 
 	EXTRA_REG_MAX		/* number of entries needed */
 };
@@ -130,6 +131,7 @@ struct cpu_hw_events {
 	void				*lbr_context;
 	struct perf_branch_stack	lbr_stack;
 	struct perf_branch_entry	lbr_entries[MAX_LBR_ENTRIES];
+	struct er_account		*lbr_sel;
 
 	/*
 	 * Intel host/guest exclude bits
@@ -340,6 +342,8 @@ struct x86_pmu {
 	 */
 	unsigned long	lbr_tos, lbr_from, lbr_to; /* MSR base regs       */
 	int		lbr_nr;			   /* hardware stack size */
+	u64		lbr_sel_mask;		   /* LBR_SELECT valid bits */
+	const int	*lbr_sel_map;		   /* lbr_select mappings */
 
 	/*
 	 * Extra registers for events
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 3bd37bd..97f7bb5 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1123,17 +1123,17 @@ static bool intel_try_alt_er(struct perf_event *event, int orig_idx)
  */
 static struct event_constraint *
 __intel_shared_reg_get_constraints(struct cpu_hw_events *cpuc,
-				   struct perf_event *event)
+				   struct perf_event *event,
+				   struct hw_perf_event_extra *reg)
 {
 	struct event_constraint *c = &emptyconstraint;
-	struct hw_perf_event_extra *reg = &event->hw.extra_reg;
 	struct er_account *era;
 	unsigned long flags;
 	int orig_idx = reg->idx;
 
 	/* already allocated shared msr */
 	if (reg->alloc)
-		return &unconstrained;
+		return NULL; /* call x86_get_event_constraint() */
 
 again:
 	era = &cpuc->shared_regs->regs[reg->idx];
@@ -1156,14 +1156,10 @@ __intel_shared_reg_get_constraints(struct cpu_hw_events *cpuc,
 		reg->alloc = 1;
 
 		/*
-		 * All events using extra_reg are unconstrained.
-		 * Avoids calling x86_get_event_constraints()
-		 *
-		 * Must revisit if extra_reg controlling events
-		 * ever have constraints. Worst case we go through
-		 * the regular event constraint table.
+		 * need to call x86_get_event_constraint()
+		 * to check if associated event has constraints
 		 */
-		c = &unconstrained;
+		c = NULL;
 	} else if (intel_try_alt_er(event, orig_idx)) {
 		raw_spin_unlock_irqrestore(&era->lock, flags);
 		goto again;
@@ -1200,11 +1196,23 @@ static struct event_constraint *
 intel_shared_regs_constraints(struct cpu_hw_events *cpuc,
 			      struct perf_event *event)
 {
-	struct event_constraint *c = NULL;
-
-	if (event->hw.extra_reg.idx != EXTRA_REG_NONE)
-		c = __intel_shared_reg_get_constraints(cpuc, event);
-
+	struct event_constraint *c = NULL, *d;
+	struct hw_perf_event_extra *xreg, *breg;
+
+	xreg = &event->hw.extra_reg;
+	if (xreg->idx != EXTRA_REG_NONE) {
+		c = __intel_shared_reg_get_constraints(cpuc, event, xreg);
+		if (c == &emptyconstraint)
+			return c;
+	}
+	breg = &event->hw.branch_reg;
+	if (breg->idx != EXTRA_REG_NONE) {
+		d = __intel_shared_reg_get_constraints(cpuc, event, breg);
+		if (d == &emptyconstraint) {
+			__intel_shared_reg_put_constraints(cpuc, xreg);
+			c = d;
+		}
+	}
 	return c;
 }
 
@@ -1252,6 +1260,10 @@ intel_put_shared_regs_event_constraints(struct cpu_hw_events *cpuc,
 	reg = &event->hw.extra_reg;
 	if (reg->idx != EXTRA_REG_NONE)
 		__intel_shared_reg_put_constraints(cpuc, reg);
+
+	reg = &event->hw.branch_reg;
+	if (reg->idx != EXTRA_REG_NONE)
+		__intel_shared_reg_put_constraints(cpuc, reg);
 }
 
 static void intel_put_event_constraints(struct cpu_hw_events *cpuc,
@@ -1431,7 +1443,7 @@ static int intel_pmu_cpu_prepare(int cpu)
 {
 	struct cpu_hw_events *cpuc = &per_cpu(cpu_hw_events, cpu);
 
-	if (!x86_pmu.extra_regs)
+	if (!(x86_pmu.extra_regs || x86_pmu.lbr_sel_map))
 		return NOTIFY_OK;
 
 	cpuc->shared_regs = allocate_shared_regs(cpu);
@@ -1453,22 +1465,28 @@ static void intel_pmu_cpu_starting(int cpu)
 	 */
 	intel_pmu_lbr_reset();
 
-	if (!cpuc->shared_regs || (x86_pmu.er_flags & ERF_NO_HT_SHARING))
+	cpuc->lbr_sel = NULL;
+
+	if (!cpuc->shared_regs)
 		return;
 
-	for_each_cpu(i, topology_thread_cpumask(cpu)) {
-		struct intel_shared_regs *pc;
+	if (!(x86_pmu.er_flags & ERF_NO_HT_SHARING)) {
+		for_each_cpu(i, topology_thread_cpumask(cpu)) {
+			struct intel_shared_regs *pc;
 
-		pc = per_cpu(cpu_hw_events, i).shared_regs;
-		if (pc && pc->core_id == core_id) {
-			cpuc->kfree_on_online = cpuc->shared_regs;
-			cpuc->shared_regs = pc;
-			break;
+			pc = per_cpu(cpu_hw_events, i).shared_regs;
+			if (pc && pc->core_id == core_id) {
+				cpuc->kfree_on_online = cpuc->shared_regs;
+				cpuc->shared_regs = pc;
+				break;
+			}
 		}
+		cpuc->shared_regs->core_id = core_id;
+		cpuc->shared_regs->refcnt++;
 	}
 
-	cpuc->shared_regs->core_id = core_id;
-	cpuc->shared_regs->refcnt++;
+	if (x86_pmu.lbr_sel_map)
+		cpuc->lbr_sel = &cpuc->shared_regs->regs[EXTRA_REG_LBR];
 }
 
 static void intel_pmu_cpu_dying(int cpu)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v4 04/18] perf: sync branch stack sampling with X86 precise_sampling
  2012-01-27 20:56 [PATCH v4 00/18] perf: add support for sampling taken branches Stephane Eranian
                   ` (2 preceding siblings ...)
  2012-01-27 20:56 ` [PATCH v4 03/18] perf: add Intel X86 LBR sharing logic Stephane Eranian
@ 2012-01-27 20:56 ` Stephane Eranian
  2012-01-27 20:56 ` [PATCH v4 05/18] perf: add LBR mappings for PERF_SAMPLE_BRANCH filters Stephane Eranian
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: Stephane Eranian @ 2012-01-27 20:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, robert.richter, ming.m.lin, andi, asharma,
	ravitillo, vweaver1, khandual, dsahern

If precise sampling is enabled on Intel X86, then perf_event uses PEBS.
To correct for the off-by-one error of PEBS, perf_event uses LBR when
precise_sample > 1.

On Intel X86 PERF_SAMPLE_BRANCH_STACK is implemented using LBR,
therefore both features must be coordinated as they may not
configure LBR the same way.

For PEBS, LBR needs to capture all branches at all priv levels.
This patch sets this up.

The configuration of PERF_SAMPLE_BRANCH_STACK may not be compatible
in which case an error must be returned.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event.c |   22 ++++++++++++++++++++++
 1 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 3779313..710ec93 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -356,6 +356,7 @@ int x86_setup_perfctr(struct perf_event *event)
 int x86_pmu_hw_config(struct perf_event *event)
 {
 	if (event->attr.precise_ip) {
+		u64 *br_type, br_sel;
 		int precise = 0;
 
 		/* Support for constant skid */
@@ -369,6 +370,27 @@ int x86_pmu_hw_config(struct perf_event *event)
 
 		if (event->attr.precise_ip > precise)
 			return -EOPNOTSUPP;
+		/*
+		 * check that PEBS LBR correction does not conflict with
+		 * whatever the user is asking with attr->branch_sample_type
+		 */
+		if (event->attr.precise_ip > 1) {
+
+			br_type = &event->attr.branch_sample_type;
+
+			if (has_branch_stack(event)) {
+				br_sel = *br_type & PERF_SAMPLE_BRANCH_ANY;
+				if (br_sel != PERF_SAMPLE_BRANCH_ANY)
+					return -EOPNOTSUPP;
+			} else {
+				/*
+				 * For PEBS fixups, we capture all
+				 * the branches at all priv levels
+				 */
+				*br_type = PERF_SAMPLE_BRANCH_ANY
+					 | PERF_SAMPLE_BRANCH_PLM_ALL;
+			}
+		}
 	}
 
 	/*
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v4 05/18] perf: add LBR mappings for PERF_SAMPLE_BRANCH filters
  2012-01-27 20:56 [PATCH v4 00/18] perf: add support for sampling taken branches Stephane Eranian
                   ` (3 preceding siblings ...)
  2012-01-27 20:56 ` [PATCH v4 04/18] perf: sync branch stack sampling with X86 precise_sampling Stephane Eranian
@ 2012-01-27 20:56 ` Stephane Eranian
  2012-01-27 20:56 ` [PATCH v4 06/18] perf: disable LBR support for older Intel Atom processors Stephane Eranian
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: Stephane Eranian @ 2012-01-27 20:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, robert.richter, ming.m.lin, andi, asharma,
	ravitillo, vweaver1, khandual, dsahern

This patch adds the mappings from the generic PERF_SAMPLE_BRANCH_*
filters to the actual Intel X86 LBR filters, whenever they exist.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event.h           |    2 +
 arch/x86/kernel/cpu/perf_event_intel.c     |    2 +-
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |   99 +++++++++++++++++++++++++++-
 3 files changed, 100 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 4535ada..776fb5a 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -535,6 +535,8 @@ void intel_pmu_lbr_init_nhm(void);
 
 void intel_pmu_lbr_init_atom(void);
 
+void intel_pmu_lbr_init_snb(void);
+
 int p4_pmu_init(void);
 
 int p6_pmu_init(void);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 97f7bb5..b0db016 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1757,7 +1757,7 @@ __init int intel_pmu_init(void)
 		memcpy(hw_cache_event_ids, snb_hw_cache_event_ids,
 		       sizeof(hw_cache_event_ids));
 
-		intel_pmu_lbr_init_nhm();
+		intel_pmu_lbr_init_snb();
 
 		x86_pmu.event_constraints = intel_snb_event_constraints;
 		x86_pmu.pebs_constraints = intel_snb_pebs_event_constraints;
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index e14431f..8a1eb6c 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -14,6 +14,47 @@ enum {
 };
 
 /*
+ * Intel LBR_SELECT bits
+ * Intel Vol3a, April 2011, Section 16.7 Table 16-10
+ *
+ * Hardware branch filter (not available on all CPUs)
+ */
+#define LBR_KERNEL_BIT		0 /* do not capture at ring0 */
+#define LBR_USER_BIT		1 /* do not capture at ring > 0 */
+#define LBR_JCC_BIT		2 /* do not capture conditional branches */
+#define LBR_REL_CALL_BIT	3 /* do not capture relative calls */
+#define LBR_IND_CALL_BIT	4 /* do not capture indirect calls */
+#define LBR_RETURN_BIT		5 /* do not capture near returns */
+#define LBR_IND_JMP_BIT		6 /* do not capture indirect jumps */
+#define LBR_REL_JMP_BIT		7 /* do not capture relative jumps */
+#define LBR_FAR_BIT		8 /* do not capture far branches */
+
+#define LBR_KERNEL	(1 << LBR_KERNEL_BIT)
+#define LBR_USER	(1 << LBR_USER_BIT)
+#define LBR_JCC		(1 << LBR_JCC_BIT)
+#define LBR_REL_CALL	(1 << LBR_REL_CALL_BIT)
+#define LBR_IND_CALL	(1 << LBR_IND_CALL_BIT)
+#define LBR_RETURN	(1 << LBR_RETURN_BIT)
+#define LBR_REL_JMP	(1 << LBR_REL_JMP_BIT)
+#define LBR_IND_JMP	(1 << LBR_IND_JMP_BIT)
+#define LBR_FAR		(1 << LBR_FAR_BIT)
+
+#define LBR_PLM (LBR_KERNEL | LBR_USER)
+
+#define LBR_SEL_MASK	0x1ff /* valid bits in LBR_SELECT */
+
+#define LBR_ANY		 \
+	(LBR_JCC	|\
+	 LBR_REL_CALL	|\
+	 LBR_IND_CALL	|\
+	 LBR_RETURN	|\
+	 LBR_REL_JMP	|\
+	 LBR_IND_JMP	|\
+	 LBR_FAR)
+
+#define LBR_FROM_FLAG_MISPRED  (1ULL << 63)
+
+/*
  * We only support LBR implementations that have FREEZE_LBRS_ON_PMI
  * otherwise it becomes near impossible to get a reliable stack.
  */
@@ -153,8 +194,6 @@ static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc)
 	cpuc->lbr_stack.nr = i;
 }
 
-#define LBR_FROM_FLAG_MISPRED  (1ULL << 63)
-
 /*
  * Due to lack of segmentation in Linux the effective address (offset)
  * is the same as the linear address, allowing us to merge the LIP and EIP
@@ -202,26 +241,82 @@ void intel_pmu_lbr_read(void)
 		intel_pmu_lbr_read_64(cpuc);
 }
 
+/*
+ * Map interface branch filters onto LBR filters
+ */
+static const int nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = {
+	[PERF_SAMPLE_BRANCH_ANY]        = LBR_ANY,
+	[PERF_SAMPLE_BRANCH_USER]       = LBR_USER,
+	[PERF_SAMPLE_BRANCH_KERNEL]     = LBR_KERNEL,
+	[PERF_SAMPLE_BRANCH_ANY_RETURN] = LBR_RETURN | LBR_REL_JMP
+					| LBR_IND_JMP | LBR_FAR,
+	/*
+	 * NHM/WSM erratum: must include REL_JMP+IND_JMP to get CALL branches
+	 */
+	[PERF_SAMPLE_BRANCH_ANY_CALL] =
+	 LBR_REL_CALL | LBR_IND_CALL | LBR_REL_JMP | LBR_IND_JMP | LBR_FAR,
+	/*
+	 * NHM/WSM erratum: must include IND_JMP to capture IND_CALL
+	 */
+	[PERF_SAMPLE_BRANCH_IND_CALL] = LBR_IND_CALL | LBR_IND_JMP,
+};
+
+static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = {
+	[PERF_SAMPLE_BRANCH_ANY]        = LBR_ANY,
+	[PERF_SAMPLE_BRANCH_USER]       = LBR_USER,
+	[PERF_SAMPLE_BRANCH_KERNEL]     = LBR_KERNEL,
+	[PERF_SAMPLE_BRANCH_ANY_RETURN] = LBR_RETURN | LBR_FAR,
+	[PERF_SAMPLE_BRANCH_ANY_CALL]   = LBR_REL_CALL | LBR_IND_CALL
+					| LBR_FAR,
+	[PERF_SAMPLE_BRANCH_IND_CALL]   = LBR_IND_CALL,
+};
+
+/* core */
 void intel_pmu_lbr_init_core(void)
 {
 	x86_pmu.lbr_nr     = 4;
 	x86_pmu.lbr_tos    = MSR_LBR_TOS;
 	x86_pmu.lbr_from   = MSR_LBR_CORE_FROM;
 	x86_pmu.lbr_to     = MSR_LBR_CORE_TO;
+
+	pr_cont("4-deep LBR, ");
 }
 
+/* nehalem/westmere */
 void intel_pmu_lbr_init_nhm(void)
 {
 	x86_pmu.lbr_nr     = 16;
 	x86_pmu.lbr_tos    = MSR_LBR_TOS;
 	x86_pmu.lbr_from   = MSR_LBR_NHM_FROM;
 	x86_pmu.lbr_to     = MSR_LBR_NHM_TO;
+
+	x86_pmu.lbr_sel_mask = LBR_SEL_MASK;
+	x86_pmu.lbr_sel_map  = nhm_lbr_sel_map;
+
+	pr_cont("16-deep LBR, ");
 }
 
+/* sandy bridge */
+void intel_pmu_lbr_init_snb(void)
+{
+	x86_pmu.lbr_nr	 = 16;
+	x86_pmu.lbr_tos	 = MSR_LBR_TOS;
+	x86_pmu.lbr_from = MSR_LBR_NHM_FROM;
+	x86_pmu.lbr_to   = MSR_LBR_NHM_TO;
+
+	x86_pmu.lbr_sel_mask = LBR_SEL_MASK;
+	x86_pmu.lbr_sel_map  = snb_lbr_sel_map;
+
+	pr_cont("16-deep LBR, ");
+}
+
+/* atom */
 void intel_pmu_lbr_init_atom(void)
 {
 	x86_pmu.lbr_nr	   = 8;
 	x86_pmu.lbr_tos    = MSR_LBR_TOS;
 	x86_pmu.lbr_from   = MSR_LBR_CORE_FROM;
 	x86_pmu.lbr_to     = MSR_LBR_CORE_TO;
+
+	pr_cont("8-deep LBR, ");
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v4 06/18] perf: disable LBR support for older Intel Atom processors
  2012-01-27 20:56 [PATCH v4 00/18] perf: add support for sampling taken branches Stephane Eranian
                   ` (4 preceding siblings ...)
  2012-01-27 20:56 ` [PATCH v4 05/18] perf: add LBR mappings for PERF_SAMPLE_BRANCH filters Stephane Eranian
@ 2012-01-27 20:56 ` Stephane Eranian
  2012-01-27 20:56 ` [PATCH v4 07/18] perf: implement PERF_SAMPLE_BRANCH for Intel X86 Stephane Eranian
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: Stephane Eranian @ 2012-01-27 20:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, robert.richter, ming.m.lin, andi, asharma,
	ravitillo, vweaver1, khandual, dsahern

The patch adds a restriction for Intel Atom LBR support. Only
steppings 10 (PineView) and more recent are supported. Older models,
do not have a functional LBR. Their LBR does not freeze on PMU interrupt
which makes LBR unusable in the context of perf_events.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 8a1eb6c..e2b7094 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -313,6 +313,16 @@ void intel_pmu_lbr_init_snb(void)
 /* atom */
 void intel_pmu_lbr_init_atom(void)
 {
+	/*
+	 * only models starting at stepping 10 seems
+	 * to have an operational LBR which can freeze
+	 * on PMU interrupt
+	 */
+	if (boot_cpu_data.x86_mask < 10) {
+		pr_cont("LBR disabled due to erratum");
+		return;
+	}
+
 	x86_pmu.lbr_nr	   = 8;
 	x86_pmu.lbr_tos    = MSR_LBR_TOS;
 	x86_pmu.lbr_from   = MSR_LBR_CORE_FROM;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v4 07/18] perf: implement PERF_SAMPLE_BRANCH for Intel X86
  2012-01-27 20:56 [PATCH v4 00/18] perf: add support for sampling taken branches Stephane Eranian
                   ` (5 preceding siblings ...)
  2012-01-27 20:56 ` [PATCH v4 06/18] perf: disable LBR support for older Intel Atom processors Stephane Eranian
@ 2012-01-27 20:56 ` Stephane Eranian
  2012-01-27 20:56 ` [PATCH v4 08/18] perf: add LBR software filter support " Stephane Eranian
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: Stephane Eranian @ 2012-01-27 20:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, robert.richter, ming.m.lin, andi, asharma,
	ravitillo, vweaver1, khandual, dsahern

This patch implements PERF_SAMPLE_BRANCH support for Intel
X86 processors. It connects PERF_SAMPLE_BRANCH to the actual LBR.

The patch adds the hooks in the PMU irq handler to save the LBR
on counter overflow for both regular and PEBS modes.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event.h           |    2 +
 arch/x86/kernel/cpu/perf_event_intel.c     |   35 ++++++++++++
 arch/x86/kernel/cpu/perf_event_intel_ds.c  |   10 ++--
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |   79 +++++++++++++++++++++++++++-
 include/linux/perf_event.h                 |    3 +
 5 files changed, 121 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 776fb5a..adbe80a 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -537,6 +537,8 @@ void intel_pmu_lbr_init_atom(void);
 
 void intel_pmu_lbr_init_snb(void);
 
+int intel_pmu_setup_lbr_filter(struct perf_event *event);
+
 int p4_pmu_init(void);
 
 int p6_pmu_init(void);
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index b0db016..7cc1e2d 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -727,6 +727,19 @@ static __initconst const u64 atom_hw_cache_event_ids
  },
 };
 
+static inline bool intel_pmu_needs_lbr_smpl(struct perf_event *event)
+{
+	/* user explicitly requested branch sampling */
+	if (has_branch_stack(event))
+		return true;
+
+	/* implicit branch sampling to correct PEBS skid */
+	if (x86_pmu.intel_cap.pebs_trap && event->attr.precise_ip > 1)
+		return true;
+
+	return false;
+}
+
 static void intel_pmu_disable_all(void)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
@@ -881,6 +894,13 @@ static void intel_pmu_disable_event(struct perf_event *event)
 	cpuc->intel_ctrl_guest_mask &= ~(1ull << hwc->idx);
 	cpuc->intel_ctrl_host_mask &= ~(1ull << hwc->idx);
 
+	/*
+	 * must disable before any actual event
+	 * because any event may be combined with LBR
+	 */
+	if (intel_pmu_needs_lbr_smpl(event))
+		intel_pmu_lbr_disable(event);
+
 	if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL)) {
 		intel_pmu_disable_fixed(hwc);
 		return;
@@ -935,6 +955,12 @@ static void intel_pmu_enable_event(struct perf_event *event)
 		intel_pmu_enable_bts(hwc->config);
 		return;
 	}
+	/*
+	 * must enabled before any actual event
+	 * because any event may be combined with LBR
+	 */
+	if (intel_pmu_needs_lbr_smpl(event))
+		intel_pmu_lbr_enable(event);
 
 	if (event->attr.exclude_host)
 		cpuc->intel_ctrl_guest_mask |= (1ull << hwc->idx);
@@ -1057,6 +1083,9 @@ static int intel_pmu_handle_irq(struct pt_regs *regs)
 
 		data.period = event->hw.last_period;
 
+		if (has_branch_stack(event))
+			data.br_stack = &cpuc->lbr_stack;
+
 		if (perf_event_overflow(event, &data, regs))
 			x86_pmu_stop(event, 0);
 	}
@@ -1305,6 +1334,12 @@ static int intel_pmu_hw_config(struct perf_event *event)
 		event->hw.config = alt_config;
 	}
 
+	if (intel_pmu_needs_lbr_smpl(event)) {
+		ret = intel_pmu_setup_lbr_filter(event);
+		if (ret)
+			return ret;
+	}
+
 	if (event->attr.type != PERF_TYPE_RAW)
 		return 0;
 
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 73da6b6..04c71ea 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -440,9 +440,6 @@ void intel_pmu_pebs_enable(struct perf_event *event)
 
 	cpuc->pebs_enabled |= 1ULL << hwc->idx;
 	WARN_ON_ONCE(cpuc->enabled);
-
-	if (x86_pmu.intel_cap.pebs_trap && event->attr.precise_ip > 1)
-		intel_pmu_lbr_enable(event);
 }
 
 void intel_pmu_pebs_disable(struct perf_event *event)
@@ -455,9 +452,6 @@ void intel_pmu_pebs_disable(struct perf_event *event)
 		wrmsrl(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);
 
 	hwc->config |= ARCH_PERFMON_EVENTSEL_INT;
-
-	if (x86_pmu.intel_cap.pebs_trap && event->attr.precise_ip > 1)
-		intel_pmu_lbr_disable(event);
 }
 
 void intel_pmu_pebs_enable_all(void)
@@ -573,6 +567,7 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
 	 * both formats and we don't use the other fields in this
 	 * routine.
 	 */
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
 	struct pebs_record_core *pebs = __pebs;
 	struct perf_sample_data data;
 	struct pt_regs regs;
@@ -603,6 +598,9 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
 	else
 		regs.flags &= ~PERF_EFLAGS_EXACT;
 
+	if (has_branch_stack(event))
+		data.br_stack = &cpuc->lbr_stack;
+
 	if (perf_event_overflow(event, &data, &regs))
 		x86_pmu_stop(event, 0);
 }
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index e2b7094..fa2f198 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -42,6 +42,7 @@ enum {
 #define LBR_PLM (LBR_KERNEL | LBR_USER)
 
 #define LBR_SEL_MASK	0x1ff /* valid bits in LBR_SELECT */
+#define LBR_NOT_SUPP	-1    /* LBR filter not supported */
 
 #define LBR_ANY		 \
 	(LBR_JCC	|\
@@ -54,6 +55,10 @@ enum {
 
 #define LBR_FROM_FLAG_MISPRED  (1ULL << 63)
 
+#define for_each_branch_sample_type(x) \
+	for ((x) = PERF_SAMPLE_BRANCH_USER; \
+	     (x) < PERF_SAMPLE_BRANCH_MAX; (x) <<= 1)
+
 /*
  * We only support LBR implementations that have FREEZE_LBRS_ON_PMI
  * otherwise it becomes near impossible to get a reliable stack.
@@ -62,6 +67,10 @@ enum {
 static void __intel_pmu_lbr_enable(void)
 {
 	u64 debugctl;
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+
+	if (cpuc->lbr_sel)
+		wrmsrl(MSR_LBR_SELECT, cpuc->lbr_sel->config);
 
 	rdmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
 	debugctl |= (DEBUGCTLMSR_LBR | DEBUGCTLMSR_FREEZE_LBRS_ON_PMI);
@@ -119,7 +128,6 @@ void intel_pmu_lbr_enable(struct perf_event *event)
 	 * Reset the LBR stack if we changed task context to
 	 * avoid data leaks.
 	 */
-
 	if (event->ctx->task && cpuc->lbr_context != event->ctx) {
 		intel_pmu_lbr_reset();
 		cpuc->lbr_context = event->ctx;
@@ -138,8 +146,11 @@ void intel_pmu_lbr_disable(struct perf_event *event)
 	cpuc->lbr_users--;
 	WARN_ON_ONCE(cpuc->lbr_users < 0);
 
-	if (cpuc->enabled && !cpuc->lbr_users)
+	if (cpuc->enabled && !cpuc->lbr_users) {
 		__intel_pmu_lbr_disable();
+		/* avoid stale pointer */
+		cpuc->lbr_context = NULL;
+	}
 }
 
 void intel_pmu_lbr_enable_all(void)
@@ -158,6 +169,9 @@ void intel_pmu_lbr_disable_all(void)
 		__intel_pmu_lbr_disable();
 }
 
+/*
+ * TOS = most recently recorded branch
+ */
 static inline u64 intel_pmu_lbr_tos(void)
 {
 	u64 tos;
@@ -242,6 +256,67 @@ void intel_pmu_lbr_read(void)
 }
 
 /*
+ * setup the HW LBR filter
+ * Used only when available, may not be enough to disambiguate
+ * all branches, may need the help of the SW filter
+ */
+static int intel_pmu_setup_hw_lbr_filter(struct perf_event *event)
+{
+	struct hw_perf_event_extra *reg;
+	u64 br_type = event->attr.branch_sample_type;
+	u64 mask = 0, m;
+	u64 v;
+
+	for_each_branch_sample_type(m) {
+		if (!(br_type & m))
+			continue;
+
+		v = x86_pmu.lbr_sel_map[m];
+		if (v == LBR_NOT_SUPP)
+			return -EOPNOTSUPP;
+		mask |= v;
+
+		if (m == PERF_SAMPLE_BRANCH_ANY)
+			break;
+	}
+	reg = &event->hw.branch_reg;
+	reg->idx = EXTRA_REG_LBR;
+
+	/* LBR_SELECT operates in suppress mode so invert mask */
+	reg->config = ~mask & x86_pmu.lbr_sel_mask;
+
+	return 0;
+}
+
+int intel_pmu_setup_lbr_filter(struct perf_event *event)
+{
+	u64 br_type = event->attr.branch_sample_type;
+
+	/*
+	 * no LBR on this PMU
+	 */
+	if (!x86_pmu.lbr_nr)
+		return -EOPNOTSUPP;
+
+	/*
+	 * if no LBR HW filter, users can only
+	 * capture all branches
+	 */
+	if (!x86_pmu.lbr_sel_map) {
+		if (br_type != PERF_SAMPLE_BRANCH_ALL)
+			return -EOPNOTSUPP;
+		return 0;
+	}
+	/*
+	 * no LBR hypervisor support on Intel X86
+	 */
+	if (br_type & PERF_SAMPLE_BRANCH_HV)
+		return -EOPNOTSUPP;
+
+	return intel_pmu_setup_hw_lbr_filter(event);
+}
+
+/*
  * Map interface branch filters onto LBR filters
  */
 static const int nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX] = {
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 9ef3002..3ac057d 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -162,6 +162,9 @@ enum perf_branch_sample_type {
 	 PERF_SAMPLE_BRANCH_KERNEL|\
 	 PERF_SAMPLE_BRANCH_HV)
 
+#define PERF_SAMPLE_BRANCH_ALL \
+	(PERF_SAMPLE_BRANCH_PLM_ALL|PERF_SAMPLE_BRANCH_ANY)
+
 /*
  * The format of the data returned by read() on a perf event fd,
  * as specified by attr.read_format:
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v4 08/18] perf: add LBR software filter support for Intel X86
  2012-01-27 20:56 [PATCH v4 00/18] perf: add support for sampling taken branches Stephane Eranian
                   ` (6 preceding siblings ...)
  2012-01-27 20:56 ` [PATCH v4 07/18] perf: implement PERF_SAMPLE_BRANCH for Intel X86 Stephane Eranian
@ 2012-01-27 20:56 ` Stephane Eranian
  2012-01-27 20:56 ` [PATCH v4 09/18] perf: disable PERF_SAMPLE_BRANCH_* when not supported Stephane Eranian
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: Stephane Eranian @ 2012-01-27 20:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, robert.richter, ming.m.lin, andi, asharma,
	ravitillo, vweaver1, khandual, dsahern

This patch adds an internal sofware filter to complement
the (optional) LBR hardware filter.

The software filter is necessary:
- as a substitute when there is no HW LBR filter (e.g., Atom, Core)
- to complement HW LBR filter in case of errata (e.g., Nehalem/Westmere)
- to provide finer grain filtering (e.g., all processors)

Sometimes, the LBR HW filter cannot distinguish between two types
of branches. For instance, to capture syscall as CALLS, it is necessary
to enable the LBR_FAR filter which will also capture JMP instructions.
Thus, a second pass is necessary to filter those out, this is what the
SW filter can do.

The SW filter is built on top of the internal x86 disassembler. It
is a best effort filter especially for user level code. It is subject
to the availability of the text page of the program.

The SW filter is enabled on all Intel X86 processors. It is bypassed
when the user is capturing all branches at all priv levels.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event.h           |   10 +
 arch/x86/kernel/cpu/perf_event_intel_ds.c  |   12 +-
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |  325 +++++++++++++++++++++++++++-
 3 files changed, 326 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index adbe80a..a5281a4 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -132,6 +132,7 @@ struct cpu_hw_events {
 	struct perf_branch_stack	lbr_stack;
 	struct perf_branch_entry	lbr_entries[MAX_LBR_ENTRIES];
 	struct er_account		*lbr_sel;
+	u64				br_sel;
 
 	/*
 	 * Intel host/guest exclude bits
@@ -455,6 +456,15 @@ extern struct event_constraint emptyconstraint;
 
 extern struct event_constraint unconstrained;
 
+static inline bool kernel_ip(unsigned long ip)
+{
+#ifdef CONFIG_X86_32
+	return ip > PAGE_OFFSET;
+#else
+	return (long)ip < 0;
+#endif
+}
+
 #ifdef CONFIG_CPU_SUP_AMD
 
 int amd_pmu_init(void);
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 04c71ea..db0aa19 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -3,6 +3,7 @@
 #include <linux/slab.h>
 
 #include <asm/perf_event.h>
+#include <asm/insn.h>
 
 #include "perf_event.h"
 
@@ -470,17 +471,6 @@ void intel_pmu_pebs_disable_all(void)
 		wrmsrl(MSR_IA32_PEBS_ENABLE, 0);
 }
 
-#include <asm/insn.h>
-
-static inline bool kernel_ip(unsigned long ip)
-{
-#ifdef CONFIG_X86_32
-	return ip > PAGE_OFFSET;
-#else
-	return (long)ip < 0;
-#endif
-}
-
 static int intel_pmu_pebs_fixup_ip(struct pt_regs *regs)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index fa2f198..6b09bef 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -3,6 +3,7 @@
 
 #include <asm/perf_event.h>
 #include <asm/msr.h>
+#include <asm/insn.h>
 
 #include "perf_event.h"
 
@@ -60,6 +61,53 @@ enum {
 	     (x) < PERF_SAMPLE_BRANCH_MAX; (x) <<= 1)
 
 /*
+ * X86 control flow change classification
+ * X86 control flow changes include branches, interrupts, traps, faults
+ */
+enum {
+	X86_BR_NONE     = 0,      /* unknown */
+
+	X86_BR_USER     = 1 << 0, /* branch target is user */
+	X86_BR_KERNEL   = 1 << 1, /* branch target is kernel */
+
+	X86_BR_CALL     = 1 << 2, /* call */
+	X86_BR_RET      = 1 << 3, /* return */
+	X86_BR_SYSCALL  = 1 << 4, /* syscall */
+	X86_BR_SYSRET   = 1 << 5, /* syscall return */
+	X86_BR_INT      = 1 << 6, /* sw interrupt */
+	X86_BR_IRET     = 1 << 7, /* return from interrupt */
+	X86_BR_JCC      = 1 << 8, /* conditional */
+	X86_BR_JMP      = 1 << 9, /* jump */
+	X86_BR_IRQ      = 1 << 10,/* hw interrupt or trap or fault */
+	X86_BR_IND_CALL = 1 << 11,/* indirect calls */
+};
+
+#define X86_BR_PLM (X86_BR_USER | X86_BR_KERNEL)
+
+#define X86_BR_ANY       \
+	(X86_BR_CALL    |\
+	 X86_BR_RET     |\
+	 X86_BR_SYSCALL |\
+	 X86_BR_SYSRET  |\
+	 X86_BR_INT     |\
+	 X86_BR_IRET    |\
+	 X86_BR_JCC     |\
+	 X86_BR_JMP	 |\
+	 X86_BR_IRQ	 |\
+	 X86_BR_IND_CALL)
+
+#define X86_BR_ALL (X86_BR_PLM | X86_BR_ANY)
+
+#define X86_BR_ANY_CALL		 \
+	(X86_BR_CALL		|\
+	 X86_BR_IND_CALL	|\
+	 X86_BR_SYSCALL		|\
+	 X86_BR_IRQ		|\
+	 X86_BR_INT)
+
+static void intel_pmu_lbr_filter(struct cpu_hw_events *cpuc);
+
+/*
  * We only support LBR implementations that have FREEZE_LBRS_ON_PMI
  * otherwise it becomes near impossible to get a reliable stack.
  */
@@ -132,6 +180,7 @@ void intel_pmu_lbr_enable(struct perf_event *event)
 		intel_pmu_lbr_reset();
 		cpuc->lbr_context = event->ctx;
 	}
+	cpuc->br_sel = event->hw.branch_reg.reg;
 
 	cpuc->lbr_users++;
 }
@@ -253,6 +302,45 @@ void intel_pmu_lbr_read(void)
 		intel_pmu_lbr_read_32(cpuc);
 	else
 		intel_pmu_lbr_read_64(cpuc);
+
+	intel_pmu_lbr_filter(cpuc);
+}
+
+/*
+ * SW filter is used:
+ * - in case there is no HW filter
+ * - in case the HW filter has errata or limitations
+ */
+static void intel_pmu_setup_sw_lbr_filter(struct perf_event *event)
+{
+	u64 br_type = event->attr.branch_sample_type;
+	int mask = 0;
+
+	if (br_type & PERF_SAMPLE_BRANCH_USER)
+		mask |= X86_BR_USER;
+
+	if (br_type & PERF_SAMPLE_BRANCH_KERNEL)
+		mask |= X86_BR_KERNEL;
+
+	if (br_type & PERF_SAMPLE_BRANCH_ANY) {
+		mask |= X86_BR_ANY;
+		goto done;
+	}
+
+	if (br_type & PERF_SAMPLE_BRANCH_ANY_CALL)
+		mask |= X86_BR_ANY_CALL;
+
+	if (br_type & PERF_SAMPLE_BRANCH_ANY_RETURN)
+		mask |= X86_BR_RET | X86_BR_IRET | X86_BR_SYSRET;
+
+	if (br_type & PERF_SAMPLE_BRANCH_IND_CALL)
+		mask |= X86_BR_IND_CALL;
+done:
+	/*
+	 * stash actual user request into reg, it may
+	 * be used by fixup code for some CPU
+   */
+	event->hw.branch_reg.reg = mask;
 }
 
 /*
@@ -291,6 +379,7 @@ static int intel_pmu_setup_hw_lbr_filter(struct perf_event *event)
 int intel_pmu_setup_lbr_filter(struct perf_event *event)
 {
 	u64 br_type = event->attr.branch_sample_type;
+	int ret = 0;
 
 	/*
 	 * no LBR on this PMU
@@ -299,21 +388,216 @@ int intel_pmu_setup_lbr_filter(struct perf_event *event)
 		return -EOPNOTSUPP;
 
 	/*
-	 * if no LBR HW filter, users can only
-	 * capture all branches
+	 * no support for hypervisor priv level on Intel X86 LBR
 	 */
-	if (!x86_pmu.lbr_sel_map) {
-		if (br_type != PERF_SAMPLE_BRANCH_ALL)
-			return -EOPNOTSUPP;
-		return 0;
+	if (br_type & PERF_SAMPLE_BRANCH_HV)
+		return -ENOTSUPP;
+
+	/*
+	 * setup SW LBR filter
+	 */
+	intel_pmu_setup_sw_lbr_filter(event);
+
+	/*
+	 * setup HW LBR filter, if any
+	 */
+	if (x86_pmu.lbr_sel_map)
+		ret = intel_pmu_setup_hw_lbr_filter(event);
+
+	return ret;
+}
+
+/*
+ * return the type of control flow change at address "from"
+ * intruction is not necessarily a branch (in case of interrupt).
+ *
+ * The branch type returned also includes the priv level of the
+ * target of the control flow change (X86_BR_USER, X86_BR_KERNEL).
+ *
+ * If a branch type is unknown OR the instruction cannot be
+ * decoded (e.g., text page not present), then X86_BR_NONE is
+ * returned.
+ */
+static int branch_type(unsigned long from, unsigned long to)
+{
+	struct insn insn;
+	void *addr;
+	int bytes, size = MAX_INSN_SIZE;
+	int ret = X86_BR_NONE;
+	int ext, to_plm, from_plm;
+	u8 buf[MAX_INSN_SIZE];
+	int is64 = 0;
+
+	to_plm = kernel_ip(to) ? X86_BR_KERNEL : X86_BR_USER;
+	from_plm = kernel_ip(from) ? X86_BR_KERNEL : X86_BR_USER;
+
+	/*
+	 * maybe zero if lbr did not fill up after a reset by the time
+	 * we get a PMU interrupt
+	 */
+	if (from == 0 || to == 0)
+		return X86_BR_NONE;
+
+	if (from_plm == X86_BR_USER) {
+		/*
+		 * can happen if measuring at the user level only
+		 * and we interrupt in a kernel thread, e.g., idle.
+		 */
+		if (!current->mm)
+			return X86_BR_NONE;
+
+		/* may fail if text not present */
+		bytes = copy_from_user_nmi(buf, (void __user *)from, size);
+		if (bytes != size)
+			return X86_BR_NONE;
+
+		addr = buf;
+	} else
+		addr = (void *)from;
+
+	/*
+	 * decoder needs to know the ABI especially
+	 * on 64-bit systems running 32-bit apps
+	 */
+#ifdef CONFIG_X86_64
+	is64 = kernel_ip((unsigned long)addr) || !test_thread_flag(TIF_IA32);
+#endif
+	insn_init(&insn, addr, is64);
+	insn_get_opcode(&insn);
+
+	switch (insn.opcode.bytes[0]) {
+	case 0xf:
+		switch (insn.opcode.bytes[1]) {
+		case 0x05: /* syscall */
+		case 0x34: /* sysenter */
+			ret = X86_BR_SYSCALL;
+			break;
+		case 0x07: /* sysret */
+		case 0x35: /* sysexit */
+			ret = X86_BR_SYSRET;
+			break;
+		case 0x80 ... 0x8f: /* conditional */
+			ret = X86_BR_JCC;
+			break;
+		default:
+			ret = X86_BR_NONE;
+		}
+		break;
+	case 0x70 ... 0x7f: /* conditional */
+		ret = X86_BR_JCC;
+		break;
+	case 0xc2: /* near ret */
+	case 0xc3: /* near ret */
+	case 0xca: /* far ret */
+	case 0xcb: /* far ret */
+		ret = X86_BR_RET;
+		break;
+	case 0xcf: /* iret */
+		ret = X86_BR_IRET;
+		break;
+	case 0xcc ... 0xce: /* int */
+		ret = X86_BR_INT;
+		break;
+	case 0xe8: /* call near rel */
+	case 0x9a: /* call far absolute */
+		ret = X86_BR_CALL;
+		break;
+	case 0xe0 ... 0xe3: /* loop jmp */
+		ret = X86_BR_JCC;
+		break;
+	case 0xe9 ... 0xeb: /* jmp */
+		ret = X86_BR_JMP;
+		break;
+	case 0xff: /* call near absolute, call far absolute ind */
+		insn_get_modrm(&insn);
+		ext = (insn.modrm.bytes[0] >> 3) & 0x7;
+		switch (ext) {
+		case 2: /* near ind call */
+		case 3: /* far ind call */
+			ret = X86_BR_IND_CALL;
+			break;
+		case 4:
+		case 5:
+			ret = X86_BR_JMP;
+			break;
+		}
+		break;
+	default:
+		ret = X86_BR_NONE;
 	}
 	/*
-	 * no LBR hypervisor support on Intel X86
+	 * interrupts, traps, faults (and thus ring transition) may
+	 * occur on any instructions. Thus, to classify them correctly,
+	 * we need to first look at the from and to priv levels. If they
+	 * are different and to is in the kernel, then it indicates
+	 * a ring transition. If the from instruction is not a ring
+	 * transition instr (syscall, systenter, int), then it means
+	 * it was a irq, trap or fault.
+	 *
+	 * we have no way of detecting kernel to kernel faults.
 	 */
-	if (br_type & PERF_SAMPLE_BRANCH_HV)
-		return -EOPNOTSUPP;
+	if (from_plm == X86_BR_USER && to_plm == X86_BR_KERNEL
+	    && ret != X86_BR_SYSCALL && ret != X86_BR_INT)
+		ret = X86_BR_IRQ;
+
+	/*
+	 * branch priv level determined by target as
+	 * is done by HW when LBR_SELECT is implemented
+	 */
+	if (ret != X86_BR_NONE)
+		ret |= to_plm;
 
-	return intel_pmu_setup_hw_lbr_filter(event);
+	return ret;
+}
+
+/*
+ * implement actual branch filter based on user demand.
+ * Hardware may not exactly satisfy that request, thus
+ * we need to inspect opcodes. Mismatched branches are
+ * discarded. Therefore, the number of branches returned
+ * in PERF_SAMPLE_BRANCH_STACK sample may vary.
+ */
+static void
+intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
+{
+	u64 from, to;
+	int br_sel = cpuc->br_sel;
+	int i, j, type;
+	bool compress = false;
+
+	/* if sampling all branches, then nothing to filter */
+	if ((br_sel & X86_BR_ALL) == X86_BR_ALL)
+		return;
+
+	for (i = 0; i < cpuc->lbr_stack.nr; i++) {
+
+		from = cpuc->lbr_entries[i].from;
+		to = cpuc->lbr_entries[i].to;
+
+		type = branch_type(from, to);
+
+		/* if type does not correspond, then discard */
+		if (type == X86_BR_NONE || (br_sel & type) != type) {
+			cpuc->lbr_entries[i].from = 0;
+			compress = true;
+		}
+	}
+
+	if (!compress)
+		return;
+
+	/* remove all entries with from=0 */
+	for (i = 0; i < cpuc->lbr_stack.nr; ) {
+		if (!cpuc->lbr_entries[i].from) {
+			j = i;
+			while (++j < cpuc->lbr_stack.nr)
+				cpuc->lbr_entries[j-1] = cpuc->lbr_entries[j];
+			cpuc->lbr_stack.nr--;
+			if (!cpuc->lbr_entries[i].from)
+				continue;
+		}
+		i++;
+	}
 }
 
 /*
@@ -354,6 +638,10 @@ void intel_pmu_lbr_init_core(void)
 	x86_pmu.lbr_from   = MSR_LBR_CORE_FROM;
 	x86_pmu.lbr_to     = MSR_LBR_CORE_TO;
 
+	/*
+	 * SW branch filter usage:
+	 * - compensate for lack of HW filter
+	 */
 	pr_cont("4-deep LBR, ");
 }
 
@@ -368,6 +656,13 @@ void intel_pmu_lbr_init_nhm(void)
 	x86_pmu.lbr_sel_mask = LBR_SEL_MASK;
 	x86_pmu.lbr_sel_map  = nhm_lbr_sel_map;
 
+	/*
+	 * SW branch filter usage:
+	 * - workaround LBR_SEL errata (see above)
+	 * - support syscall, sysret capture.
+	 *   That requires LBR_FAR but that means far
+	 *   jmp need to be filtered out
+	 */
 	pr_cont("16-deep LBR, ");
 }
 
@@ -382,6 +677,12 @@ void intel_pmu_lbr_init_snb(void)
 	x86_pmu.lbr_sel_mask = LBR_SEL_MASK;
 	x86_pmu.lbr_sel_map  = snb_lbr_sel_map;
 
+	/*
+	 * SW branch filter usage:
+	 * - support syscall, sysret capture.
+	 *   That requires LBR_FAR but that means far
+	 *   jmp need to be filtered out
+	 */
 	pr_cont("16-deep LBR, ");
 }
 
@@ -403,5 +704,9 @@ void intel_pmu_lbr_init_atom(void)
 	x86_pmu.lbr_from   = MSR_LBR_CORE_FROM;
 	x86_pmu.lbr_to     = MSR_LBR_CORE_TO;
 
+	/*
+	 * SW branch filter usage:
+	 * - compensate for lack of HW filter
+	 */
 	pr_cont("8-deep LBR, ");
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v4 09/18] perf: disable PERF_SAMPLE_BRANCH_* when not supported
  2012-01-27 20:56 [PATCH v4 00/18] perf: add support for sampling taken branches Stephane Eranian
                   ` (7 preceding siblings ...)
  2012-01-27 20:56 ` [PATCH v4 08/18] perf: add LBR software filter support " Stephane Eranian
@ 2012-01-27 20:56 ` Stephane Eranian
  2012-01-30  3:57   ` Anshuman Khandual
  2012-01-27 20:56 ` [PATCH v4 10/18] perf: add hook to flush branch_stack on context switch Stephane Eranian
                   ` (10 subsequent siblings)
  19 siblings, 1 reply; 30+ messages in thread
From: Stephane Eranian @ 2012-01-27 20:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, robert.richter, ming.m.lin, andi, asharma,
	ravitillo, vweaver1, khandual, dsahern

PERF_SAMPLE_BRANCH_* is disabled for:
- SW events (sw counters, tracepoints)
- HW breakpoints
- ALL but Intel X86 architecture
- AMD64 processors

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/alpha/kernel/perf_event.c       |    4 ++++
 arch/arm/kernel/perf_event.c         |    4 ++++
 arch/mips/kernel/perf_event_mipsxx.c |    4 ++++
 arch/powerpc/kernel/perf_event.c     |    4 ++++
 arch/sh/kernel/perf_event.c          |    4 ++++
 arch/sparc/kernel/perf_event.c       |    4 ++++
 arch/x86/kernel/cpu/perf_event_amd.c |    3 +++
 kernel/events/core.c                 |   24 ++++++++++++++++++++++++
 kernel/events/hw_breakpoint.c        |    6 ++++++
 9 files changed, 57 insertions(+), 0 deletions(-)

diff --git a/arch/alpha/kernel/perf_event.c b/arch/alpha/kernel/perf_event.c
index 8143cd7..0dae252 100644
--- a/arch/alpha/kernel/perf_event.c
+++ b/arch/alpha/kernel/perf_event.c
@@ -685,6 +685,10 @@ static int alpha_pmu_event_init(struct perf_event *event)
 {
 	int err;
 
+	/* does not support taken branch sampling */
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	switch (event->attr.type) {
 	case PERF_TYPE_RAW:
 	case PERF_TYPE_HARDWARE:
diff --git a/arch/arm/kernel/perf_event.c b/arch/arm/kernel/perf_event.c
index 5bb91bf..68bb0ce 100644
--- a/arch/arm/kernel/perf_event.c
+++ b/arch/arm/kernel/perf_event.c
@@ -539,6 +539,10 @@ static int armpmu_event_init(struct perf_event *event)
 	int err = 0;
 	atomic_t *active_events = &armpmu->active_events;
 
+	/* does not support taken branch sampling */
+	if (has_branch_smpl(event))
+		return -EOPNOTSUPP;
+
 	if (armpmu->map_event(event) == -ENOENT)
 		return -ENOENT;
 
diff --git a/arch/mips/kernel/perf_event_mipsxx.c b/arch/mips/kernel/perf_event_mipsxx.c
index e3b897a..811084f 100644
--- a/arch/mips/kernel/perf_event_mipsxx.c
+++ b/arch/mips/kernel/perf_event_mipsxx.c
@@ -606,6 +606,10 @@ static int mipspmu_event_init(struct perf_event *event)
 {
 	int err = 0;
 
+	/* does not support taken branch sampling */
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	switch (event->attr.type) {
 	case PERF_TYPE_RAW:
 	case PERF_TYPE_HARDWARE:
diff --git a/arch/powerpc/kernel/perf_event.c b/arch/powerpc/kernel/perf_event.c
index d614ab5..4e0b265 100644
--- a/arch/powerpc/kernel/perf_event.c
+++ b/arch/powerpc/kernel/perf_event.c
@@ -1078,6 +1078,10 @@ static int power_pmu_event_init(struct perf_event *event)
 	if (!ppmu)
 		return -ENOENT;
 
+	/* does not support taken branch sampling */
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	switch (event->attr.type) {
 	case PERF_TYPE_HARDWARE:
 		ev = event->attr.config;
diff --git a/arch/sh/kernel/perf_event.c b/arch/sh/kernel/perf_event.c
index 10b14e3..068b8a2 100644
--- a/arch/sh/kernel/perf_event.c
+++ b/arch/sh/kernel/perf_event.c
@@ -310,6 +310,10 @@ static int sh_pmu_event_init(struct perf_event *event)
 {
 	int err;
 
+	/* does not support taken branch sampling */
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	switch (event->attr.type) {
 	case PERF_TYPE_RAW:
 	case PERF_TYPE_HW_CACHE:
diff --git a/arch/sparc/kernel/perf_event.c b/arch/sparc/kernel/perf_event.c
index 614da62..8e16a4a 100644
--- a/arch/sparc/kernel/perf_event.c
+++ b/arch/sparc/kernel/perf_event.c
@@ -1105,6 +1105,10 @@ static int sparc_pmu_event_init(struct perf_event *event)
 	if (atomic_read(&nmi_active) < 0)
 		return -ENODEV;
 
+	/* does not support taken branch sampling */
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	switch (attr->type) {
 	case PERF_TYPE_HARDWARE:
 		if (attr->config >= sparc_pmu->max_events)
diff --git a/arch/x86/kernel/cpu/perf_event_amd.c b/arch/x86/kernel/cpu/perf_event_amd.c
index 0397b23..0d8da03 100644
--- a/arch/x86/kernel/cpu/perf_event_amd.c
+++ b/arch/x86/kernel/cpu/perf_event_amd.c
@@ -138,6 +138,9 @@ static int amd_pmu_hw_config(struct perf_event *event)
 	if (ret)
 		return ret;
 
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	if (event->attr.exclude_host && event->attr.exclude_guest)
 		/*
 		 * When HO == GO == 1 the hardware treats that as GO == HO == 0
diff --git a/kernel/events/core.c b/kernel/events/core.c
index c4520a2..431f7b4 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5007,6 +5007,12 @@ static int perf_swevent_init(struct perf_event *event)
 	if (event->attr.type != PERF_TYPE_SOFTWARE)
 		return -ENOENT;
 
+	/*
+	 * no branch sampling for software events
+	 */
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	switch (event_id) {
 	case PERF_COUNT_SW_CPU_CLOCK:
 	case PERF_COUNT_SW_TASK_CLOCK:
@@ -5117,6 +5123,12 @@ static int perf_tp_event_init(struct perf_event *event)
 	if (event->attr.type != PERF_TYPE_TRACEPOINT)
 		return -ENOENT;
 
+	/*
+	 * no branch sampling for tracepoint events
+	 */
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	err = perf_trace_init(event);
 	if (err)
 		return err;
@@ -5342,6 +5354,12 @@ static int cpu_clock_event_init(struct perf_event *event)
 	if (event->attr.config != PERF_COUNT_SW_CPU_CLOCK)
 		return -ENOENT;
 
+	/*
+	 * no branch sampling for software events
+	 */
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	perf_swevent_init_hrtimer(event);
 
 	return 0;
@@ -5416,6 +5434,12 @@ static int task_clock_event_init(struct perf_event *event)
 	if (event->attr.config != PERF_COUNT_SW_TASK_CLOCK)
 		return -ENOENT;
 
+	/*
+	 * no branch sampling for software events
+	 */
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	perf_swevent_init_hrtimer(event);
 
 	return 0;
diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
index b0309f7..cee5423 100644
--- a/kernel/events/hw_breakpoint.c
+++ b/kernel/events/hw_breakpoint.c
@@ -581,6 +581,12 @@ static int hw_breakpoint_event_init(struct perf_event *bp)
 	if (bp->attr.type != PERF_TYPE_BREAKPOINT)
 		return -ENOENT;
 
+	/*
+	 * no branch sampling for breakpoint events
+	 */
+	if (has_branch_stack(bp))
+		return -EOPNOTSUPP;
+
 	err = register_perf_hw_breakpoint(bp);
 	if (err)
 		return err;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v4 10/18] perf: add hook to flush branch_stack on context switch
  2012-01-27 20:56 [PATCH v4 00/18] perf: add support for sampling taken branches Stephane Eranian
                   ` (8 preceding siblings ...)
  2012-01-27 20:56 ` [PATCH v4 09/18] perf: disable PERF_SAMPLE_BRANCH_* when not supported Stephane Eranian
@ 2012-01-27 20:56 ` Stephane Eranian
  2012-01-27 20:56 ` [PATCH v4 11/18] perf: add code to support PERF_SAMPLE_BRANCH_STACK Stephane Eranian
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: Stephane Eranian @ 2012-01-27 20:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, robert.richter, ming.m.lin, andi, asharma,
	ravitillo, vweaver1, khandual, dsahern

With branch stack sampling, it is possible to filter by priv levels.
In system-wide mode, that means it is possible to capture only user
level branches. The builtin SW LBR filter needs to disassemble code
based on LBR captured addresses. For that, it needs to know the task
the addresses are associated with. Because of context switches, the
content of the branch stack buffer may contain addresses from
different tasks.

We need a hook on context switch to either flush the branch stack
or save it. This patch adds a new hook in struct pmu which is called
during context switches. The hook is called only when necessary.
That is when a system-wide context has, at least, one event which
uses PERF_SAMPLE_BRANCH_STACK. The hook is never called for per-thread
context.

In this version, the Intel X86 code simply flushes (reset) the LBR
on context switches (fill with zeroes). Those zeroed branches are
then filtered out by the SW filter.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event.c       |   21 +++++---
 arch/x86/kernel/cpu/perf_event.h       |    1 +
 arch/x86/kernel/cpu/perf_event_intel.c |   13 +++++
 include/linux/perf_event.h             |    9 +++-
 kernel/events/core.c                   |   85 ++++++++++++++++++++++++++++++++
 5 files changed, 121 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 710ec93..01ce138 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1633,25 +1633,32 @@ static const struct attribute_group *x86_pmu_attr_groups[] = {
 	NULL,
 };
 
+static void x86_pmu_flush_branch_stack(void)
+{
+	if (x86_pmu.flush_branch_stack)
+		x86_pmu.flush_branch_stack();
+}
+
 static struct pmu pmu = {
-	.pmu_enable	= x86_pmu_enable,
-	.pmu_disable	= x86_pmu_disable,
+	.pmu_enable		= x86_pmu_enable,
+	.pmu_disable		= x86_pmu_disable,
 
 	.attr_groups	= x86_pmu_attr_groups,
 
 	.event_init	= x86_pmu_event_init,
 
-	.add		= x86_pmu_add,
-	.del		= x86_pmu_del,
-	.start		= x86_pmu_start,
-	.stop		= x86_pmu_stop,
-	.read		= x86_pmu_read,
+	.add			= x86_pmu_add,
+	.del			= x86_pmu_del,
+	.start			= x86_pmu_start,
+	.stop			= x86_pmu_stop,
+	.read			= x86_pmu_read,
 
 	.start_txn	= x86_pmu_start_txn,
 	.cancel_txn	= x86_pmu_cancel_txn,
 	.commit_txn	= x86_pmu_commit_txn,
 
 	.event_idx	= x86_pmu_event_idx,
+	.flush_branch_stack	= x86_pmu_flush_branch_stack,
 };
 
 void perf_update_user_clock(struct perf_event_mmap_page *userpg, u64 now)
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index a5281a4..1699d36 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -322,6 +322,7 @@ struct x86_pmu {
 	void		(*cpu_starting)(int cpu);
 	void		(*cpu_dying)(int cpu);
 	void		(*cpu_dead)(int cpu);
+	void		(*flush_branch_stack)(void);
 
 	/*
 	 * Intel Arch Perfmon v2+
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 7cc1e2d..6627089 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1539,6 +1539,18 @@ static void intel_pmu_cpu_dying(int cpu)
 	fini_debug_store_on_cpu(cpu);
 }
 
+static void intel_pmu_flush_branch_stack(void)
+{
+	/*
+	 * Intel LBR does not tag entries with the
+	 * PID of the current task, then we need to
+	 * flush it on ctxsw
+	 * For now, we simply reset it
+	 */
+	if (x86_pmu.lbr_nr)
+		intel_pmu_lbr_reset();
+}
+
 static __initconst const struct x86_pmu intel_pmu = {
 	.name			= "Intel",
 	.handle_irq		= intel_pmu_handle_irq,
@@ -1566,6 +1578,7 @@ static __initconst const struct x86_pmu intel_pmu = {
 	.cpu_starting		= intel_pmu_cpu_starting,
 	.cpu_dying		= intel_pmu_cpu_dying,
 	.guest_get_msrs		= intel_guest_get_msrs,
+	.flush_branch_stack	= intel_pmu_flush_branch_stack,
 };
 
 static __init void intel_clovertown_quirk(void)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 3ac057d..6d9d712 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -745,6 +745,11 @@ struct pmu {
 	 * if no implementation is provided it will default to: event->hw.idx + 1.
 	 */
 	int (*event_idx)		(struct perf_event *event); /*optional */
+
+	/*
+	 * flush branch stack on context-switches (needed in cpu-wide mode)
+	 */
+	void (*flush_branch_stack)	(void);
 };
 
 /**
@@ -975,7 +980,8 @@ struct perf_event_context {
 	u64				parent_gen;
 	u64				generation;
 	int				pin_count;
-	int				nr_cgroups; /* cgroup events present */
+	int				nr_cgroups;	 /* cgroup evts */
+	int				nr_branch_stack; /* branch_stack evt */
 	struct rcu_head			rcu_head;
 };
 
@@ -1040,6 +1046,7 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr,
 extern u64 perf_event_read_value(struct perf_event *event,
 				 u64 *enabled, u64 *running);
 
+
 struct perf_sample_data {
 	u64				type;
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 431f7b4..7e4d4af 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -137,6 +137,7 @@ enum event_type_t {
  */
 struct jump_label_key_deferred perf_sched_events __read_mostly;
 static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
+static DEFINE_PER_CPU(atomic_t, perf_branch_stack_events);
 
 static atomic_t nr_mmap_events __read_mostly;
 static atomic_t nr_comm_events __read_mostly;
@@ -888,6 +889,9 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx)
 	if (is_cgroup_event(event))
 		ctx->nr_cgroups++;
 
+	if (has_branch_stack(event))
+		ctx->nr_branch_stack++;
+
 	list_add_rcu(&event->event_entry, &ctx->event_list);
 	if (!ctx->nr_events)
 		perf_pmu_rotate_start(ctx->pmu);
@@ -1027,6 +1031,9 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
 			cpuctx->cgrp = NULL;
 	}
 
+	if (has_branch_stack(event))
+		ctx->nr_branch_stack--;
+
 	ctx->nr_events--;
 	if (event->attr.inherit_stat)
 		ctx->nr_stat--;
@@ -2202,6 +2209,66 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx,
 }
 
 /*
+ * When sampling the branck stack in system-wide, it may be necessary
+ * to flush the stack on context switch. This happens when the branch
+ * stack does not tag its entries with the pid of the current task.
+ * Otherwise it becomes impossible to associate a branch entry with a
+ * task. This ambiguity is more likely to appear when the branch stack
+ * supports priv level filtering and the user sets it to monitor only
+ * at the user level (which could be a useful measurement in system-wide
+ * mode). In that case, the risk is high of having a branch stack with
+ * branch from multiple tasks. Flushing may mean dropping the existing
+ * entries or stashing them somewhere in the PMU specific code layer.
+ *
+ * This function provides the context switch callback to the lower code
+ * layer. It is invoked ONLY when there is at least one system-wide context
+ * with at least one active event using taken branch sampling.
+ */
+static void perf_branch_stack_sched_in(struct task_struct *prev,
+				       struct task_struct *task)
+{
+	struct perf_cpu_context *cpuctx;
+	struct pmu *pmu;
+	unsigned long flags;
+
+	/* no need to flush branch stack if not changing task */
+	if (prev == task)
+		return;
+
+	local_irq_save(flags);
+
+	rcu_read_lock();
+
+	list_for_each_entry_rcu(pmu, &pmus, entry) {
+		cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
+
+		/*
+		 * check if the context has at least one
+		 * event using PERF_SAMPLE_BRANCH_STACK
+		 */
+		if (cpuctx->ctx.nr_branch_stack > 0
+		    && pmu->flush_branch_stack) {
+
+			pmu = cpuctx->ctx.pmu;
+
+			perf_ctx_lock(cpuctx, cpuctx->task_ctx);
+
+			perf_pmu_disable(pmu);
+
+			pmu->flush_branch_stack();
+
+			perf_pmu_enable(pmu);
+
+			perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+		}
+	}
+
+	rcu_read_unlock();
+
+	local_irq_restore(flags);
+}
+
+/*
  * Called from scheduler to add the events of the current task
  * with interrupts disabled.
  *
@@ -2232,6 +2299,10 @@ void __perf_event_task_sched_in(struct task_struct *prev,
 	 */
 	if (atomic_read(&__get_cpu_var(perf_cgroup_events)))
 		perf_cgroup_sched_in(prev, task);
+
+	/* check for system-wide branch_stack events */
+	if (atomic_read(&__get_cpu_var(perf_branch_stack_events)))
+		perf_branch_stack_sched_in(prev, task);
 }
 
 static u64 perf_calculate_period(struct perf_event *event, u64 nsec, u64 count)
@@ -2768,6 +2839,14 @@ static void free_event(struct perf_event *event)
 			atomic_dec(&per_cpu(perf_cgroup_events, event->cpu));
 			jump_label_dec_deferred(&perf_sched_events);
 		}
+
+		if (has_branch_stack(event)) {
+			jump_label_dec_deferred(&perf_sched_events);
+			/* is system-wide event */
+			if (!(event->attach_state & PERF_ATTACH_TASK))
+				atomic_dec(&per_cpu(perf_branch_stack_events,
+						    event->cpu));
+		}
 	}
 
 	if (event->rb) {
@@ -5887,6 +5966,12 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 				return ERR_PTR(err);
 			}
 		}
+		if (has_branch_stack(event)) {
+			jump_label_inc(&perf_sched_events.key);
+			if (!(event->attach_state & PERF_ATTACH_TASK))
+				atomic_inc(&per_cpu(perf_branch_stack_events,
+						    event->cpu));
+		}
 	}
 
 	return event;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v4 11/18] perf: add code to support PERF_SAMPLE_BRANCH_STACK
  2012-01-27 20:56 [PATCH v4 00/18] perf: add support for sampling taken branches Stephane Eranian
                   ` (9 preceding siblings ...)
  2012-01-27 20:56 ` [PATCH v4 10/18] perf: add hook to flush branch_stack on context switch Stephane Eranian
@ 2012-01-27 20:56 ` Stephane Eranian
  2012-01-27 20:56 ` [PATCH v4 12/18] perf: add support for sampling taken branch to perf record Stephane Eranian
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: Stephane Eranian @ 2012-01-27 20:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, robert.richter, ming.m.lin, andi, asharma,
	ravitillo, vweaver1, khandual, dsahern

From: Roberto Agostino Vitillo <ravitillo@lbl.gov>

This patch adds:
- ability to parse samples with PERF_SAMPLE_BRANCH_STACK
- sort on branches
- build histograms on branches

Signed-off-by: Roberto Agostino Vitillo <ravitillo@lbl.gov>
Signed-off-by: Stephane Eranian <eranian@google.com>
---
 tools/perf/perf.h          |   17 ++
 tools/perf/util/annotate.c |    2 +-
 tools/perf/util/event.h    |    1 +
 tools/perf/util/evsel.c    |   10 ++
 tools/perf/util/hist.c     |   93 +++++++++---
 tools/perf/util/hist.h     |    7 +
 tools/perf/util/session.c  |   72 +++++++++
 tools/perf/util/session.h  |    4 +
 tools/perf/util/sort.c     |  362 +++++++++++++++++++++++++++++++++-----------
 tools/perf/util/sort.h     |    5 +
 tools/perf/util/symbol.h   |   13 ++
 11 files changed, 475 insertions(+), 111 deletions(-)

diff --git a/tools/perf/perf.h b/tools/perf/perf.h
index 92af168..8b4d25d 100644
--- a/tools/perf/perf.h
+++ b/tools/perf/perf.h
@@ -180,6 +180,23 @@ struct ip_callchain {
 	u64 ips[0];
 };
 
+struct branch_flags {
+	u64 mispred:1;
+	u64 predicted:1;
+	u64 reserved:62;
+};
+
+struct branch_entry {
+	u64				from;
+	u64				to;
+	struct branch_flags flags;
+};
+
+struct branch_stack {
+	u64				nr;
+	struct branch_entry	entries[0];
+};
+
 extern bool perf_host, perf_guest;
 extern const char perf_version_string[];
 
diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c
index 011ed26..8248d80 100644
--- a/tools/perf/util/annotate.c
+++ b/tools/perf/util/annotate.c
@@ -64,7 +64,7 @@ int symbol__inc_addr_samples(struct symbol *sym, struct map *map,
 
 	pr_debug3("%s: addr=%#" PRIx64 "\n", __func__, map->unmap_ip(map, addr));
 
-	if (addr >= sym->end)
+	if (addr >= sym->end || addr < sym->start)
 		return 0;
 
 	offset = addr - sym->start;
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index cbdeaad..1b19728 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -81,6 +81,7 @@ struct perf_sample {
 	u32 raw_size;
 	void *raw_data;
 	struct ip_callchain *callchain;
+	struct branch_stack *branch_stack;
 };
 
 #define BUILD_ID_SIZE 20
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 667f3b7..472fc8c 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -575,6 +575,16 @@ int perf_event__parse_sample(const union perf_event *event, u64 type,
 		data->raw_data = (void *) pdata;
 	}
 
+	if (type & PERF_SAMPLE_BRANCH_STACK) {
+		u64 sz;
+
+		data->branch_stack = (struct branch_stack *)array;
+		array++; /* nr */
+
+		sz = data->branch_stack->nr * sizeof(struct branch_entry);
+		sz /= sizeof(uint64_t);
+		array += sz;
+	}
 	return 0;
 }
 
diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index 6f505d1..66f9936 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -54,9 +54,11 @@ static void hists__calc_col_len(struct hists *hists, struct hist_entry *h)
 {
 	u16 len;
 
-	if (h->ms.sym)
-		hists__new_col_len(hists, HISTC_SYMBOL, h->ms.sym->namelen);
-	else {
+	if (h->ms.sym) {
+		int n = (int)h->ms.sym->namelen + 4;
+		int symlen = max(n, BITS_PER_LONG / 4 + 6);
+		hists__new_col_len(hists, HISTC_SYMBOL, symlen);
+	} else {
 		const unsigned int unresolved_col_width = BITS_PER_LONG / 4;
 
 		if (hists__col_len(hists, HISTC_DSO) < unresolved_col_width &&
@@ -195,26 +197,14 @@ static u8 symbol__parent_filter(const struct symbol *parent)
 	return 0;
 }
 
-struct hist_entry *__hists__add_entry(struct hists *hists,
+static struct hist_entry *add_hist_entry(struct hists *hists,
+				      struct hist_entry *entry,
 				      struct addr_location *al,
-				      struct symbol *sym_parent, u64 period)
+				      u64 period)
 {
 	struct rb_node **p;
 	struct rb_node *parent = NULL;
 	struct hist_entry *he;
-	struct hist_entry entry = {
-		.thread	= al->thread,
-		.ms = {
-			.map	= al->map,
-			.sym	= al->sym,
-		},
-		.cpu	= al->cpu,
-		.ip	= al->addr,
-		.level	= al->level,
-		.period	= period,
-		.parent = sym_parent,
-		.filtered = symbol__parent_filter(sym_parent),
-	};
 	int cmp;
 
 	pthread_mutex_lock(&hists->lock);
@@ -225,7 +215,7 @@ struct hist_entry *__hists__add_entry(struct hists *hists,
 		parent = *p;
 		he = rb_entry(parent, struct hist_entry, rb_node_in);
 
-		cmp = hist_entry__cmp(&entry, he);
+		cmp = hist_entry__cmp(entry, he);
 
 		if (!cmp) {
 			he->period += period;
@@ -239,7 +229,7 @@ struct hist_entry *__hists__add_entry(struct hists *hists,
 			p = &(*p)->rb_right;
 	}
 
-	he = hist_entry__new(&entry);
+	he = hist_entry__new(entry);
 	if (!he)
 		goto out_unlock;
 
@@ -252,6 +242,69 @@ struct hist_entry *__hists__add_entry(struct hists *hists,
 	return he;
 }
 
+struct hist_entry *__hists__add_branch_entry(struct hists *self,
+					     struct addr_location *al,
+					     struct symbol *sym_parent,
+					     struct branch_info *bi,
+					     u64 period){
+	struct hist_entry entry = {
+		.thread	= al->thread,
+		.ms = {
+			.map	= bi->to.map,
+			.sym	= bi->to.sym,
+		},
+		.cpu	= al->cpu,
+		.ip	= bi->to.addr,
+		.level	= al->level,
+		.period	= period,
+		.parent = sym_parent,
+		.filtered = symbol__parent_filter(sym_parent),
+		.branch_info = bi,
+	};
+	struct hist_entry *he;
+
+	he = add_hist_entry(self, &entry, al, period);
+	if (!he)
+		return NULL;
+
+	/*
+	 * in branch mode, we do not display al->sym, al->addr
+	 * but instead what is in branch_info. The addresses and
+	 * symbols there may need wider columns, so make sure they
+	 * are taken into account.
+	 *
+	 * hists__calc_col_len() tracks the max column width, so
+	 * we need to call it for both the from and to addresses
+	 */
+	entry.ip     = bi->from.addr;
+	entry.ms.map = bi->from.map;
+	entry.ms.sym = bi->from.sym;
+	hists__calc_col_len(self, &entry);
+
+	return he;
+}
+
+struct hist_entry *__hists__add_entry(struct hists *self,
+				      struct addr_location *al,
+				      struct symbol *sym_parent, u64 period)
+{
+	struct hist_entry entry = {
+		.thread	= al->thread,
+		.ms = {
+			.map	= al->map,
+			.sym	= al->sym,
+		},
+		.cpu	= al->cpu,
+		.ip	= al->addr,
+		.level	= al->level,
+		.period	= period,
+		.parent = sym_parent,
+		.filtered = symbol__parent_filter(sym_parent),
+	};
+
+	return add_hist_entry(self, &entry, al, period);
+}
+
 int64_t
 hist_entry__cmp(struct hist_entry *left, struct hist_entry *right)
 {
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 0d48613..801a04e 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -41,6 +41,7 @@ enum hist_column {
 	HISTC_COMM,
 	HISTC_PARENT,
 	HISTC_CPU,
+	HISTC_MISPREDICT,
 	HISTC_NR_COLS, /* Last entry */
 };
 
@@ -73,6 +74,12 @@ int hist_entry__snprintf(struct hist_entry *self, char *bf, size_t size,
 			 struct hists *hists);
 void hist_entry__free(struct hist_entry *);
 
+struct hist_entry *__hists__add_branch_entry(struct hists *self,
+					     struct addr_location *al,
+					     struct symbol *sym_parent,
+					     struct branch_info *bi,
+					     u64 period);
+
 void hists__output_resort(struct hists *self);
 void hists__output_resort_threaded(struct hists *hists);
 void hists__collapse_resort(struct hists *self);
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index b5ca255..6643224 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -229,6 +229,63 @@ static bool symbol__match_parent_regex(struct symbol *sym)
 	return 0;
 }
 
+static const u8 cpumodes[] = {
+	PERF_RECORD_MISC_USER,
+	PERF_RECORD_MISC_KERNEL,
+	PERF_RECORD_MISC_GUEST_USER,
+	PERF_RECORD_MISC_GUEST_KERNEL
+};
+#define NCPUMODES (sizeof(cpumodes)/sizeof(u8))
+
+static void ip__resolve_ams(struct machine *self, struct thread *thread,
+			    struct addr_map_symbol *ams,
+			    u64 ip)
+{
+	struct addr_location al;
+	size_t i;
+	u8 m;
+
+	memset(&al, 0, sizeof(al));
+
+	for (i = 0; i < NCPUMODES; i++) {
+		m = cpumodes[i];
+		/*
+		 * we cannot use the header.misc hint to determine whether a
+		 * branch stack address is user, kernel, guest, hypervisor.
+		 * Branches may straddle the kernel/user/hypervisor boundaries.
+		 * Thus, we have to try * consecutively until we find a match
+		 * or else, the symbol is unknown
+		 */
+		thread__find_addr_location(thread, self, m, MAP__FUNCTION,
+				ip, &al, NULL);
+		if (al.sym)
+			goto found;
+	}
+found:
+	ams->addr = ip;
+	ams->sym = al.sym;
+	ams->map = al.map;
+}
+
+struct branch_info *perf_session__resolve_bstack(struct machine *self,
+						 struct thread *thr,
+						 struct branch_stack *bs)
+{
+	struct branch_info *bi;
+	unsigned int i;
+
+	bi = calloc(bs->nr, sizeof(struct branch_info));
+	if (!bi)
+		return NULL;
+
+	for (i = 0; i < bs->nr; i++) {
+		ip__resolve_ams(self, thr, &bi[i].to, bs->entries[i].to);
+		ip__resolve_ams(self, thr, &bi[i].from, bs->entries[i].from);
+		bi[i].flags = bs->entries[i].flags;
+	}
+	return bi;
+}
+
 int machine__resolve_callchain(struct machine *self, struct perf_evsel *evsel,
 			       struct thread *thread,
 			       struct ip_callchain *chain,
@@ -697,6 +754,18 @@ static void callchain__printf(struct perf_sample *sample)
 		       i, sample->callchain->ips[i]);
 }
 
+static void branch_stack__printf(struct perf_sample *sample)
+{
+	uint64_t i;
+
+	printf("... branch stack: nr:%" PRIu64 "\n", sample->branch_stack->nr);
+
+	for (i = 0; i < sample->branch_stack->nr; i++)
+		printf("..... %2"PRIu64": %016" PRIx64 " -> %016" PRIx64 "\n",
+			i, sample->branch_stack->entries[i].from,
+			sample->branch_stack->entries[i].to);
+}
+
 static void perf_session__print_tstamp(struct perf_session *session,
 				       union perf_event *event,
 				       struct perf_sample *sample)
@@ -744,6 +813,9 @@ static void dump_sample(struct perf_session *session, union perf_event *event,
 
 	if (session->sample_type & PERF_SAMPLE_CALLCHAIN)
 		callchain__printf(sample);
+
+	if (session->sample_type & PERF_SAMPLE_BRANCH_STACK)
+		branch_stack__printf(sample);
 }
 
 static struct machine *
diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index 37bc383..f407338 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -73,6 +73,10 @@ int perf_session__resolve_callchain(struct perf_session *self, struct perf_evsel
 				    struct ip_callchain *chain,
 				    struct symbol **parent);
 
+struct branch_info *perf_session__resolve_bstack(struct machine *self,
+						 struct thread *thread,
+						 struct branch_stack *bs);
+
 bool perf_session__has_traces(struct perf_session *self, const char *msg);
 
 void mem_bswap_64(void *src, int byte_size);
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index 16da30d..1531989 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -8,6 +8,7 @@ const char	default_sort_order[] = "comm,dso,symbol";
 const char	*sort_order = default_sort_order;
 int		sort__need_collapse = 0;
 int		sort__has_parent = 0;
+bool		sort__branch_mode;
 
 enum sort_type	sort__first_dimension;
 
@@ -94,6 +95,26 @@ static int hist_entry__comm_snprintf(struct hist_entry *self, char *bf,
 	return repsep_snprintf(bf, size, "%*s", width, self->thread->comm);
 }
 
+static int64_t _sort__dso_cmp(struct map *map_l, struct map *map_r)
+{
+	struct dso *dso_l = map_l ? map_l->dso : NULL;
+	struct dso *dso_r = map_r ? map_r->dso : NULL;
+	const char *dso_name_l, *dso_name_r;
+
+	if (!dso_l || !dso_r)
+		return cmp_null(dso_l, dso_r);
+
+	if (verbose) {
+		dso_name_l = dso_l->long_name;
+		dso_name_r = dso_r->long_name;
+	} else {
+		dso_name_l = dso_l->short_name;
+		dso_name_r = dso_r->short_name;
+	}
+
+	return strcmp(dso_name_l, dso_name_r);
+}
+
 struct sort_entry sort_comm = {
 	.se_header	= "Command",
 	.se_cmp		= sort__comm_cmp,
@@ -107,36 +128,74 @@ struct sort_entry sort_comm = {
 static int64_t
 sort__dso_cmp(struct hist_entry *left, struct hist_entry *right)
 {
-	struct dso *dso_l = left->ms.map ? left->ms.map->dso : NULL;
-	struct dso *dso_r = right->ms.map ? right->ms.map->dso : NULL;
-	const char *dso_name_l, *dso_name_r;
+	return _sort__dso_cmp(left->ms.map, right->ms.map);
+}
 
-	if (!dso_l || !dso_r)
-		return cmp_null(dso_l, dso_r);
 
-	if (verbose) {
-		dso_name_l = dso_l->long_name;
-		dso_name_r = dso_r->long_name;
-	} else {
-		dso_name_l = dso_l->short_name;
-		dso_name_r = dso_r->short_name;
+static int64_t _sort__sym_cmp(struct symbol *sym_l, struct symbol *sym_r,
+			      u64 ip_l, u64 ip_r)
+{
+	if (!sym_l || !sym_r)
+		return cmp_null(sym_l, sym_r);
+
+	if (sym_l == sym_r)
+		return 0;
+
+	if (sym_l)
+		ip_l = sym_l->start;
+	if (sym_r)
+		ip_r = sym_r->start;
+
+	return (int64_t)(ip_r - ip_l);
+}
+
+static int _hist_entry__dso_snprintf(struct map *map, char *bf,
+				     size_t size, unsigned int width)
+{
+	if (map && map->dso) {
+		const char *dso_name = !verbose ? map->dso->short_name :
+			map->dso->long_name;
+		return repsep_snprintf(bf, size, "%-*s", width, dso_name);
 	}
 
-	return strcmp(dso_name_l, dso_name_r);
+	return repsep_snprintf(bf, size, "%-*s", width, "[unknown]");
 }
 
 static int hist_entry__dso_snprintf(struct hist_entry *self, char *bf,
 				    size_t size, unsigned int width)
 {
-	if (self->ms.map && self->ms.map->dso) {
-		const char *dso_name = !verbose ? self->ms.map->dso->short_name :
-						  self->ms.map->dso->long_name;
-		return repsep_snprintf(bf, size, "%-*s", width, dso_name);
+	return _hist_entry__dso_snprintf(self->ms.map, bf, size, width);
+}
+
+static int _hist_entry__sym_snprintf(struct map *map, struct symbol *sym,
+				     u64 ip, char level, char *bf, size_t size,
+				     unsigned int width __used)
+{
+	size_t ret = 0;
+
+	if (verbose) {
+		char o = map ? dso__symtab_origin(map->dso) : '!';
+		ret += repsep_snprintf(bf, size, "%-#*llx %c ",
+				       BITS_PER_LONG / 4, ip, o);
 	}
 
-	return repsep_snprintf(bf, size, "%-*s", width, "[unknown]");
+	ret += repsep_snprintf(bf + ret, size - ret, "[%c] ", level);
+	if (sym)
+		ret += repsep_snprintf(bf + ret, size - ret, "%-*s",
+				       width - ret,
+				       sym->name);
+	else {
+		size_t len = BITS_PER_LONG / 4;
+		ret += repsep_snprintf(bf + ret, size - ret, "%-#.*llx",
+				       len, ip);
+		ret += repsep_snprintf(bf + ret, size - ret, "%-*s",
+				       width - ret, "");
+	}
+
+	return ret;
 }
 
+
 struct sort_entry sort_dso = {
 	.se_header	= "Shared Object",
 	.se_cmp		= sort__dso_cmp,
@@ -144,8 +203,14 @@ struct sort_entry sort_dso = {
 	.se_width_idx	= HISTC_DSO,
 };
 
-/* --sort symbol */
+static int hist_entry__sym_snprintf(struct hist_entry *self, char *bf,
+				    size_t size, unsigned int width __used)
+{
+	return _hist_entry__sym_snprintf(self->ms.map, self->ms.sym, self->ip,
+					 self->level, bf, size, width);
+}
 
+/* --sort symbol */
 static int64_t
 sort__sym_cmp(struct hist_entry *left, struct hist_entry *right)
 {
@@ -154,40 +219,10 @@ sort__sym_cmp(struct hist_entry *left, struct hist_entry *right)
 	if (!left->ms.sym && !right->ms.sym)
 		return right->level - left->level;
 
-	if (!left->ms.sym || !right->ms.sym)
-		return cmp_null(left->ms.sym, right->ms.sym);
-
-	if (left->ms.sym == right->ms.sym)
-		return 0;
-
 	ip_l = left->ms.sym->start;
 	ip_r = right->ms.sym->start;
 
-	return (int64_t)(ip_r - ip_l);
-}
-
-static int hist_entry__sym_snprintf(struct hist_entry *self, char *bf,
-				    size_t size, unsigned int width __used)
-{
-	size_t ret = 0;
-
-	if (verbose) {
-		char o = self->ms.map ? dso__symtab_origin(self->ms.map->dso) : '!';
-		ret += repsep_snprintf(bf, size, "%-#*llx %c ",
-				       BITS_PER_LONG / 4, self->ip, o);
-	}
-
-	if (!sort_dso.elide)
-		ret += repsep_snprintf(bf + ret, size - ret, "[%c] ", self->level);
-
-	if (self->ms.sym)
-		ret += repsep_snprintf(bf + ret, size - ret, "%s",
-				       self->ms.sym->name);
-	else
-		ret += repsep_snprintf(bf + ret, size - ret, "%-#*llx",
-				       BITS_PER_LONG / 4, self->ip);
-
-	return ret;
+	return _sort__sym_cmp(left->ms.sym, right->ms.sym, ip_l, ip_r);
 }
 
 struct sort_entry sort_sym = {
@@ -246,6 +281,135 @@ struct sort_entry sort_cpu = {
 	.se_width_idx	= HISTC_CPU,
 };
 
+static int64_t
+sort__dso_from_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+	return _sort__dso_cmp(left->branch_info->from.map,
+			      right->branch_info->from.map);
+}
+
+static int hist_entry__dso_from_snprintf(struct hist_entry *self, char *bf,
+				    size_t size, unsigned int width)
+{
+	return _hist_entry__dso_snprintf(self->branch_info->from.map,
+					 bf, size, width);
+}
+
+struct sort_entry sort_dso_from = {
+	.se_header	= "Source Shared Object",
+	.se_cmp		= sort__dso_from_cmp,
+	.se_snprintf	= hist_entry__dso_from_snprintf,
+	.se_width_idx	= HISTC_DSO,
+};
+
+static int64_t
+sort__dso_to_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+	return _sort__dso_cmp(left->branch_info->to.map,
+			      right->branch_info->to.map);
+}
+
+static int hist_entry__dso_to_snprintf(struct hist_entry *self, char *bf,
+				       size_t size, unsigned int width)
+{
+	return _hist_entry__dso_snprintf(self->branch_info->to.map,
+					 bf, size, width);
+}
+
+static int64_t
+sort__sym_from_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+	struct addr_map_symbol *from_l = &left->branch_info->from;
+	struct addr_map_symbol *from_r = &right->branch_info->from;
+
+	if (!from_l->sym && !from_r->sym)
+		return right->level - left->level;
+
+	return _sort__sym_cmp(from_l->sym, from_r->sym, from_l->addr,
+			     from_r->addr);
+}
+
+static int64_t
+sort__sym_to_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+	struct addr_map_symbol *to_l = &left->branch_info->to;
+	struct addr_map_symbol *to_r = &right->branch_info->to;
+
+	if (!to_l->sym && !to_r->sym)
+		return right->level - left->level;
+
+	return _sort__sym_cmp(to_l->sym, to_r->sym, to_l->addr, to_r->addr);
+}
+
+static int hist_entry__sym_from_snprintf(struct hist_entry *self, char *bf,
+				    size_t size, unsigned int width __used)
+{
+	struct addr_map_symbol *from = &self->branch_info->from;
+	return _hist_entry__sym_snprintf(from->map, from->sym, from->addr,
+					 self->level, bf, size, width);
+
+}
+
+static int hist_entry__sym_to_snprintf(struct hist_entry *self, char *bf,
+				    size_t size, unsigned int width __used)
+{
+	struct addr_map_symbol *to = &self->branch_info->to;
+	return _hist_entry__sym_snprintf(to->map, to->sym, to->addr,
+					 self->level, bf, size, width);
+
+}
+
+struct sort_entry sort_dso_to = {
+	.se_header	= "Target Shared Object",
+	.se_cmp		= sort__dso_to_cmp,
+	.se_snprintf	= hist_entry__dso_to_snprintf,
+	.se_width_idx	= HISTC_DSO,
+};
+
+struct sort_entry sort_sym_from = {
+	.se_header	= "Source Symbol",
+	.se_cmp		= sort__sym_from_cmp,
+	.se_snprintf	= hist_entry__sym_from_snprintf,
+	.se_width_idx	= HISTC_SYMBOL,
+};
+
+struct sort_entry sort_sym_to = {
+	.se_header	= "Target Symbol",
+	.se_cmp		= sort__sym_to_cmp,
+	.se_snprintf	= hist_entry__sym_to_snprintf,
+	.se_width_idx	= HISTC_SYMBOL,
+};
+
+static int64_t
+sort__mispredict_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+	const unsigned char mp = left->branch_info->flags.mispred !=
+					right->branch_info->flags.mispred;
+	const unsigned char p = left->branch_info->flags.predicted !=
+					right->branch_info->flags.predicted;
+
+	return mp || p;
+}
+
+static int hist_entry__mispredict_snprintf(struct hist_entry *self, char *bf,
+				    size_t size, unsigned int width){
+	static const char *out = "N/A";
+
+	if (self->branch_info->flags.predicted)
+		out = "N";
+	else if (self->branch_info->flags.mispred)
+		out = "Y";
+
+	return repsep_snprintf(bf, size, "%-*s", width, out);
+}
+
+struct sort_entry sort_mispredict = {
+	.se_header	= "Branch Mispredicted",
+	.se_cmp		= sort__mispredict_cmp,
+	.se_snprintf	= hist_entry__mispredict_snprintf,
+	.se_width_idx	= HISTC_MISPREDICT,
+};
+
 struct sort_dimension {
 	const char		*name;
 	struct sort_entry	*entry;
@@ -253,14 +417,59 @@ struct sort_dimension {
 };
 
 static struct sort_dimension sort_dimensions[] = {
-	{ .name = "pid",	.entry = &sort_thread,	},
-	{ .name = "comm",	.entry = &sort_comm,	},
-	{ .name = "dso",	.entry = &sort_dso,	},
-	{ .name = "symbol",	.entry = &sort_sym,	},
-	{ .name = "parent",	.entry = &sort_parent,	},
-	{ .name = "cpu",	.entry = &sort_cpu,	},
+	{ .name = "pid",	.entry = &sort_thread,			},
+	{ .name = "comm",	.entry = &sort_comm,			},
+	{ .name = "dso",	.entry = &sort_dso,			},
+	{ .name = "dso_from",	.entry = &sort_dso_from, .taken = true	},
+	{ .name = "dso_to",	.entry = &sort_dso_to,	 .taken = true	},
+	{ .name = "symbol",	.entry = &sort_sym,			},
+	{ .name = "symbol_from",.entry = &sort_sym_from, .taken = true	},
+	{ .name = "symbol_to",	.entry = &sort_sym_to,	 .taken = true	},
+	{ .name = "parent",	.entry = &sort_parent,			},
+	{ .name = "cpu",	.entry = &sort_cpu,			},
+	{ .name = "mispredict", .entry = &sort_mispredict, },
 };
 
+static int _sort_dimension__add(struct sort_dimension *sd)
+{
+	if (sd->entry->se_collapse)
+		sort__need_collapse = 1;
+
+	if (sd->entry == &sort_parent) {
+		int ret = regcomp(&parent_regex, parent_pattern, REG_EXTENDED);
+		if (ret) {
+			char err[BUFSIZ];
+
+			regerror(ret, &parent_regex, err, sizeof(err));
+			pr_err("Invalid regex: %s\n%s", parent_pattern, err);
+			return -EINVAL;
+		}
+		sort__has_parent = 1;
+	}
+
+	if (list_empty(&hist_entry__sort_list)) {
+		if (!strcmp(sd->name, "pid"))
+			sort__first_dimension = SORT_PID;
+		else if (!strcmp(sd->name, "comm"))
+			sort__first_dimension = SORT_COMM;
+		else if (!strcmp(sd->name, "dso"))
+			sort__first_dimension = SORT_DSO;
+		else if (!strcmp(sd->name, "symbol"))
+			sort__first_dimension = SORT_SYM;
+		else if (!strcmp(sd->name, "parent"))
+			sort__first_dimension = SORT_PARENT;
+		else if (!strcmp(sd->name, "cpu"))
+			sort__first_dimension = SORT_CPU;
+		else if (!strcmp(sd->name, "mispredict"))
+			sort__first_dimension = SORT_MISPREDICTED;
+	}
+
+	list_add_tail(&sd->entry->list, &hist_entry__sort_list);
+	sd->taken = 1;
+
+	return 0;
+}
+
 int sort_dimension__add(const char *tok)
 {
 	unsigned int i;
@@ -271,48 +480,21 @@ int sort_dimension__add(const char *tok)
 		if (strncasecmp(tok, sd->name, strlen(tok)))
 			continue;
 
-		if (sd->entry == &sort_parent) {
-			int ret = regcomp(&parent_regex, parent_pattern, REG_EXTENDED);
-			if (ret) {
-				char err[BUFSIZ];
-
-				regerror(ret, &parent_regex, err, sizeof(err));
-				pr_err("Invalid regex: %s\n%s", parent_pattern, err);
-				return -EINVAL;
-			}
-			sort__has_parent = 1;
-		}
-
 		if (sd->taken)
 			return 0;
 
-		if (sd->entry->se_collapse)
-			sort__need_collapse = 1;
-
-		if (list_empty(&hist_entry__sort_list)) {
-			if (!strcmp(sd->name, "pid"))
-				sort__first_dimension = SORT_PID;
-			else if (!strcmp(sd->name, "comm"))
-				sort__first_dimension = SORT_COMM;
-			else if (!strcmp(sd->name, "dso"))
-				sort__first_dimension = SORT_DSO;
-			else if (!strcmp(sd->name, "symbol"))
-				sort__first_dimension = SORT_SYM;
-			else if (!strcmp(sd->name, "parent"))
-				sort__first_dimension = SORT_PARENT;
-			else if (!strcmp(sd->name, "cpu"))
-				sort__first_dimension = SORT_CPU;
-		}
-
-		list_add_tail(&sd->entry->list, &hist_entry__sort_list);
-		sd->taken = 1;
 
-		return 0;
+		if (sort__branch_mode && (sd->entry == &sort_dso ||
+					sd->entry == &sort_sym)){
+			int err = _sort_dimension__add(sd + 1);
+			return err ?: _sort_dimension__add(sd + 2);
+		} else if (sd->entry == &sort_mispredict && !sort__branch_mode)
+			break;
+		else
+			return _sort_dimension__add(sd);
 	}
-
 	return -ESRCH;
 }
-
 void setup_sorting(const char * const usagestr[], const struct option *opts)
 {
 	char *tmp, *tok, *str = strdup(sort_order);
diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
index 3f67ae3..effcae1 100644
--- a/tools/perf/util/sort.h
+++ b/tools/perf/util/sort.h
@@ -31,11 +31,14 @@ extern const char *parent_pattern;
 extern const char default_sort_order[];
 extern int sort__need_collapse;
 extern int sort__has_parent;
+extern bool sort__branch_mode;
 extern char *field_sep;
 extern struct sort_entry sort_comm;
 extern struct sort_entry sort_dso;
 extern struct sort_entry sort_sym;
 extern struct sort_entry sort_parent;
+extern struct sort_entry sort_lbr_dso;
+extern struct sort_entry sort_lbr_sym;
 extern enum sort_type sort__first_dimension;
 
 /**
@@ -72,6 +75,7 @@ struct hist_entry {
 		struct hist_entry *pair;
 		struct rb_root	  sorted_chain;
 	};
+	struct branch_info	*branch_info;
 	struct callchain_root	callchain[0];
 };
 
@@ -82,6 +86,7 @@ enum sort_type {
 	SORT_SYM,
 	SORT_PARENT,
 	SORT_CPU,
+	SORT_MISPREDICTED,
 };
 
 /*
diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h
index 123c2e1..6297e88 100644
--- a/tools/perf/util/symbol.h
+++ b/tools/perf/util/symbol.h
@@ -5,6 +5,7 @@
 #include <stdbool.h>
 #include <stdint.h>
 #include "map.h"
+#include "../perf.h"
 #include <linux/list.h>
 #include <linux/rbtree.h>
 #include <stdio.h>
@@ -119,6 +120,18 @@ struct map_symbol {
 	bool	      has_children;
 };
 
+struct addr_map_symbol {
+	struct map    *map;
+	struct symbol *sym;
+	u64	      addr;
+};
+
+struct branch_info {
+	struct addr_map_symbol from;
+	struct addr_map_symbol to;
+	struct branch_flags flags;
+};
+
 struct addr_location {
 	struct thread *thread;
 	struct map    *map;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v4 12/18] perf: add support for sampling taken branch to perf record
  2012-01-27 20:56 [PATCH v4 00/18] perf: add support for sampling taken branches Stephane Eranian
                   ` (10 preceding siblings ...)
  2012-01-27 20:56 ` [PATCH v4 11/18] perf: add code to support PERF_SAMPLE_BRANCH_STACK Stephane Eranian
@ 2012-01-27 20:56 ` Stephane Eranian
  2012-01-31  9:47   ` Anshuman Khandual
  2012-01-27 20:56 ` [PATCH v4 13/18] perf: add support for taken branch sampling to perf report Stephane Eranian
                   ` (7 subsequent siblings)
  19 siblings, 1 reply; 30+ messages in thread
From: Stephane Eranian @ 2012-01-27 20:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, robert.richter, ming.m.lin, andi, asharma,
	ravitillo, vweaver1, khandual, dsahern

From: Roberto Agostino Vitillo <ravitillo@lbl.gov>

This patch adds a new option to enable taken branch stack
sampling, i.e., leverage the PERF_SAMPLE_BRANCH_STACK feature
of perf_events.

There is a new option to active this mode: -b.
It is possible to pass a set of filters to select the type of
branches to sample.

The following filters are available:
- any : any type of branches
- any_call : any function call or system call
- any_ret : any function return or system call return
- any_ind : any indirect branch
- u:  only when the branch target is at the user level
- k: only when the branch target is in the kernel
- hv: only when the branch target is in the hypervisor

Filters can be combined by passing a comma separated list
to the option:

$ perf record -b any_call,u -e cycles:u branchy

Signed-off-by: Roberto Agostino Vitillo <ravitillo@lbl.gov>
Signed-off-by: Stephane Eranian <eranian@google.com>
---
 tools/perf/Documentation/perf-record.txt |   25 ++++++++++
 tools/perf/builtin-record.c              |   74 ++++++++++++++++++++++++++++++
 tools/perf/perf.h                        |    1 +
 tools/perf/util/evsel.c                  |    4 ++
 4 files changed, 104 insertions(+), 0 deletions(-)

diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index ff9a66e..288d429 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -152,6 +152,31 @@ an empty cgroup (monitor all the time) using, e.g., -G foo,,bar. Cgroups must ha
 corresponding events, i.e., they always refer to events defined earlier on the command
 line.
 
+-b::
+--branch-stack::
+Enable taken branch stack sampling. Each sample captures a series of consecutive
+taken branches. The number of branches captured with each sample depends on the
+underlying hardware, the type of branches of interest, and the executed code.
+It is possible to select the types of branches captured by enabling filters. The
+following filters are defined:
+
+        -  any :  any type of branches
+        - any_call: any function call or system call
+        - any_ret: any function return or system call return
+        - any_ind: any indirect branch
+        - u:  only when the branch target is at the user level
+        - k: only when the branch target is in the kernel
+        - hv: only when the target is at the hypervisor level
+
++
+At least one of any, any_call, any_ret, any_ind must be provided. The privilege levels may
+be ommitted, in which case, the privilege levels of the associated event are applied to the
+branch filter. Both kernel (k) and hypervisor (hv) privilege levels are subject to
+permissions.  When sampling on multiple events, branch stack sampling is enabled for all
+the sampling events. The sampled branch type is the same for all events.
+Note that taken branch sampling may not be available on all processors.
+The various filters must be specified as a comma separated list: -b any_ret,u,k
+
 SEE ALSO
 --------
 linkperf:perf-stat[1], linkperf:perf-list[1]
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 32870ee..7df6e68 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -637,6 +637,77 @@ static int __cmd_record(struct perf_record *rec, int argc, const char **argv)
 	return err;
 }
 
+#define BRANCH_OPT(n, m) \
+	{ .name = n, .mode = (m) }
+
+#define BRANCH_END { .name = NULL }
+
+struct branch_mode {
+	const char *name;
+	int mode;
+};
+
+static const struct branch_mode branch_modes[] = {
+	BRANCH_OPT("u", PERF_SAMPLE_BRANCH_USER),
+	BRANCH_OPT("k", PERF_SAMPLE_BRANCH_KERNEL),
+	BRANCH_OPT("hv", PERF_SAMPLE_BRANCH_HV),
+	BRANCH_OPT("any", PERF_SAMPLE_BRANCH_ANY),
+	BRANCH_OPT("any_call", PERF_SAMPLE_BRANCH_ANY_CALL),
+	BRANCH_OPT("any_ret", PERF_SAMPLE_BRANCH_ANY_RETURN),
+	BRANCH_OPT("ind_call", PERF_SAMPLE_BRANCH_IND_CALL),
+	BRANCH_END
+};
+
+static int
+parse_branch_stack(const struct option *opt, const char *str, int unset __used)
+{
+#define ONLY_PLM \
+	(PERF_SAMPLE_BRANCH_USER	|\
+	 PERF_SAMPLE_BRANCH_KERNEL	|\
+	 PERF_SAMPLE_BRANCH_KERNEL)
+
+	uint64_t *mode = (uint64_t *)opt->value;
+	const struct branch_mode *br;
+	char *s, *os, *p;
+	int ret = -1;
+
+	*mode = 0;
+
+	/* because str is read-only */
+	s = os = strdup(str);
+	if (!s)
+		return -1;
+
+	for (;;) {
+		p = strchr(s, ',');
+		if (p)
+			*p = '\0';
+
+		for (br = branch_modes; br->name; br++) {
+			if (!strcasecmp(s, br->name))
+				break;
+		}
+		if (!br->name)
+			goto error;
+
+		*mode |= br->mode;
+
+		if (!p)
+			break;
+
+		s = p + 1;
+	}
+	ret = 0;
+
+	if ((*mode & ~ONLY_PLM) == 0) {
+		error("need at least one branch type with -b\n");
+		ret = -1;
+	}
+error:
+	free(os);
+	return ret;
+}
+
 static const char * const record_usage[] = {
 	"perf record [<options>] [<command>]",
 	"perf record [<options>] -- <command> [<options>]",
@@ -729,6 +800,9 @@ const struct option record_options[] = {
 		     "monitor event in cgroup name only",
 		     parse_cgroups),
 	OPT_STRING('u', "uid", &record.uid_str, "user", "user to profile"),
+	OPT_CALLBACK('b', "branch-stack", &record.opts.branch_stack,
+		     "branch mode mask", "branch stack sampling modes",
+		     parse_branch_stack),
 	OPT_END()
 };
 
diff --git a/tools/perf/perf.h b/tools/perf/perf.h
index 8b4d25d..7f8fbab 100644
--- a/tools/perf/perf.h
+++ b/tools/perf/perf.h
@@ -222,6 +222,7 @@ struct perf_record_opts {
 	unsigned int freq;
 	unsigned int mmap_pages;
 	unsigned int user_freq;
+	int	     branch_stack;
 	u64	     default_interval;
 	u64	     user_interval;
 	const char   *cpu_list;
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 472fc8c..a65a53c 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -126,6 +126,10 @@ void perf_evsel__config(struct perf_evsel *evsel, struct perf_record_opts *opts)
 		attr->watermark = 0;
 		attr->wakeup_events = 1;
 	}
+	if (opts->branch_stack) {
+		attr->sample_type	|= PERF_SAMPLE_BRANCH_STACK;
+		attr->branch_sample_type = opts->branch_stack;
+	}
 
 	attr->mmap = track;
 	attr->comm = track;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v4 13/18] perf: add support for taken branch sampling to perf report
  2012-01-27 20:56 [PATCH v4 00/18] perf: add support for sampling taken branches Stephane Eranian
                   ` (11 preceding siblings ...)
  2012-01-27 20:56 ` [PATCH v4 12/18] perf: add support for sampling taken branch to perf record Stephane Eranian
@ 2012-01-27 20:56 ` Stephane Eranian
  2012-01-27 20:56 ` [PATCH v4 14/18] perf: fix endianness detection in perf.data Stephane Eranian
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: Stephane Eranian @ 2012-01-27 20:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, robert.richter, ming.m.lin, andi, asharma,
	ravitillo, vweaver1, khandual, dsahern

From: Roberto Agostino Vitillo <ravitillo@lbl.gov>

This patch adds support for taken branch sampling, i.e, the
PERF_SAMPLE_BRANCH_STACK feature to perf report. In other
words, to display histograms based on taken branches rather
than executed instructions addresses.

The new option is called -b and it takes no argument. To
generate meaningful output, the perf.data must have been
obtained using perf record -b xxx ... where xxx is a branch
filter option.

The output shows symbols, modules, sorted by 'who branches
where' the most often. The percentages reported in the first
column refer to the total number of branches captured and
not the usual number of samples.

Here is a quick example.
Here branchy is simple test program which looks as follows:

void f2(void)
{}
void f3(void)
{}
void f1(unsigned long n)
{
  if (n & 1UL)
    f2();
  else
    f3();
}
int main(void)
{
  unsigned long i;

  for (i=0; i < N; i++)
   f1(i);
  return 0;
}

Here is the output captured on Nehalem, if we are
only interested in user level function calls.

$ perf record -b any_call,u -e cycles:u branchy

$ perf report -b --sort=symbol
    52.34%  [.] main                   [.] f1
    24.04%  [.] f1                     [.] f3
    23.60%  [.] f1                     [.] f2
     0.01%  [k] _IO_new_file_xsputn    [k] _IO_file_overflow
     0.01%  [k] _IO_vfprintf_internal  [k] _IO_new_file_xsputn
     0.01%  [k] _IO_vfprintf_internal  [k] strchrnul
     0.01%  [k] __printf               [k] _IO_vfprintf_internal
     0.01%  [k] main                   [k] __printf

About half (52%) of the call branches captured are from main() -> f1().
The second half (24%+23%) is split in two equal shares between
f1() -> f2(), f1() ->f3(). The output is as expected given the code.

It should be noted, that using -b in perf record does not eliminate
information in the perf.data file. Consequently, a typical profile
can also be obtained by perf report by simply not using its -b option.

Signed-off-by: Roberto Agostino Vitillo <ravitillo@lbl.gov>
Signed-off-by: Stephane Eranian <eranian@google.com>
---
 tools/perf/Documentation/perf-report.txt |    7 ++
 tools/perf/builtin-report.c              |   98 +++++++++++++++++++++++++++---
 2 files changed, 96 insertions(+), 9 deletions(-)

diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
index 9b430e9..19b9092 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -153,6 +153,13 @@ OPTIONS
 	information which may be very large and thus may clutter the display.
 	It currently includes: cpu and numa topology of the host system.
 
+-b::
+--branch-stack::
+	Use the addresses of sampled taken branches instead of the instruction
+	address to build the histograms. To generate meaningful output, the
+	perf.data file must have been obtained using perf record -b xxx where
+	xxx is a branch filter option.
+
 SEE ALSO
 --------
 linkperf:perf-stat[1], linkperf:perf-annotate[1]
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 25d34d4..8a8d2f9 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -53,6 +53,50 @@ struct perf_report {
 	DECLARE_BITMAP(cpu_bitmap, MAX_NR_CPUS);
 };
 
+static int perf_session__add_branch_hist_entry(struct perf_tool *tool,
+					struct addr_location *al,
+					struct perf_sample *sample,
+					struct perf_evsel *evsel,
+				      struct machine *machine)
+{
+	struct perf_report *rep = container_of(tool, struct perf_report, tool);
+	struct symbol *parent = NULL;
+	int err = 0;
+	unsigned i;
+	struct hist_entry *he;
+	struct branch_info *bi;
+
+	if ((sort__has_parent || symbol_conf.use_callchain)
+	    && sample->callchain) {
+		err = machine__resolve_callchain(machine, evsel, al->thread,
+						 sample->callchain, &parent);
+		if (err)
+			return err;
+	}
+
+	bi = perf_session__resolve_bstack(machine, al->thread,
+					  sample->branch_stack);
+	if (!bi)
+		return -ENOMEM;
+
+	for (i = 0; i < sample->branch_stack->nr; i++) {
+		if (rep->hide_unresolved && !(bi[i].from.sym && bi[i].to.sym))
+			continue;
+		/*
+		 * The report shows the percentage of total branches captured
+		 * and not events sampled. Thus we use a pseudo period of 1.
+		 */
+		he = __hists__add_branch_entry(&evsel->hists, al, parent,
+					       &bi[i], 1);
+		if (he) {
+			evsel->hists.stats.total_period += 1;
+			hists__inc_nr_events(&evsel->hists, PERF_RECORD_SAMPLE);
+		} else
+			return -ENOMEM;
+	}
+	return err;
+}
+
 static int perf_evsel__add_hist_entry(struct perf_evsel *evsel,
 				      struct addr_location *al,
 				      struct perf_sample *sample,
@@ -126,14 +170,21 @@ static int process_sample_event(struct perf_tool *tool,
 	if (rep->cpu_list && !test_bit(sample->cpu, rep->cpu_bitmap))
 		return 0;
 
-	if (al.map != NULL)
-		al.map->dso->hit = 1;
+	if (sort__branch_mode) {
+		if (perf_session__add_branch_hist_entry(tool, &al, sample,
+						    evsel, machine)) {
+			pr_debug("problem adding lbr entry, skipping event\n");
+			return -1;
+		}
+	} else {
+		if (al.map != NULL)
+			al.map->dso->hit = 1;
 
-	if (perf_evsel__add_hist_entry(evsel, &al, sample, machine)) {
-		pr_debug("problem incrementing symbol period, skipping event\n");
-		return -1;
+		if (perf_evsel__add_hist_entry(evsel, &al, sample, machine)) {
+			pr_debug("problem incrementing symbol period, skipping event\n");
+			return -1;
+		}
 	}
-
 	return 0;
 }
 
@@ -188,6 +239,15 @@ static int perf_report__setup_sample_type(struct perf_report *rep)
 			}
 	}
 
+	if (sort__branch_mode) {
+		if (!(self->sample_type & PERF_SAMPLE_BRANCH_STACK)) {
+			fprintf(stderr, "selected -b but no branch data."
+					" Did you call perf record without"
+					" -b?\n");
+			return -1;
+		}
+	}
+
 	return 0;
 }
 
@@ -477,7 +537,8 @@ int cmd_report(int argc, const char **argv, const char *prefix __used)
 	OPT_BOOLEAN(0, "stdio", &report.use_stdio,
 		    "Use the stdio interface"),
 	OPT_STRING('s', "sort", &sort_order, "key[,key2...]",
-		   "sort by key(s): pid, comm, dso, symbol, parent"),
+		   "sort by key(s): pid, comm, dso, symbol, parent, dso_to,"
+		   " dso_from, symbol_to, symbol_from, mispredict"),
 	OPT_BOOLEAN(0, "showcpuutilization", &symbol_conf.show_cpu_utilization,
 		    "Show sample percentage for different cpu modes"),
 	OPT_STRING('p', "parent", &parent_pattern, "regex",
@@ -517,6 +578,8 @@ int cmd_report(int argc, const char **argv, const char *prefix __used)
 		   "Specify disassembler style (e.g. -M intel for intel syntax)"),
 	OPT_BOOLEAN(0, "show-total-period", &symbol_conf.show_total_period,
 		    "Show a column with the sum of periods"),
+	OPT_BOOLEAN('b', "branch-stack", &sort__branch_mode,
+		    "use branch records for histogram filling"),
 	OPT_END()
 	};
 
@@ -537,10 +600,27 @@ int cmd_report(int argc, const char **argv, const char *prefix __used)
 			report.input_name = "perf.data";
 	}
 
-	if (strcmp(report.input_name, "-") != 0)
+	if (sort__branch_mode) {
+		if (use_browser)
+			fprintf(stderr, "Warning: TUI interface not supported"
+					" in branch mode\n");
+		if (symbol_conf.dso_list_str != NULL)
+			fprintf(stderr, "Warning: dso filtering not supported"
+					" in branch mode\n");
+		if (symbol_conf.sym_list_str != NULL)
+			fprintf(stderr, "Warning: symbol filtering not"
+					" supported in branch mode\n");
+
+		report.use_stdio = true;
+		use_browser = 0;
 		setup_browser(true);
-	else
+		symbol_conf.dso_list_str = NULL;
+		symbol_conf.sym_list_str = NULL;
+	} else if (strcmp(report.input_name, "-") != 0) {
+		setup_browser(true);
+	} else {
 		use_browser = 0;
+	}
 
 	/*
 	 * Only in the newt browser we are doing integrated annotation,
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v4 14/18] perf: fix endianness detection in perf.data
  2012-01-27 20:56 [PATCH v4 00/18] perf: add support for sampling taken branches Stephane Eranian
                   ` (12 preceding siblings ...)
  2012-01-27 20:56 ` [PATCH v4 13/18] perf: add support for taken branch sampling to perf report Stephane Eranian
@ 2012-01-27 20:56 ` Stephane Eranian
  2012-01-30  5:55   ` Anshuman Khandual
  2012-01-27 20:56 ` [PATCH v4 15/18] perf: add ABI reference sizes Stephane Eranian
                   ` (5 subsequent siblings)
  19 siblings, 1 reply; 30+ messages in thread
From: Stephane Eranian @ 2012-01-27 20:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, robert.richter, ming.m.lin, andi, asharma,
	ravitillo, vweaver1, khandual, dsahern

The current version of perf detects whether or not
the perf.data file is written in a different endianness
using the attr_size field in the header of the file. This
field represents sizeof(struct perf_event_attr) as known
to perf record. If the sizes do not match, then perf tries
the byte-swapped version. If they match, then the tool assumes
a different endianness.

The issue with the approach is that it assumes the size of
perf_event_attr always has to match between perf record and
perf report. However, the kernel perf_event ABI is extensible.
New fields can be added to struct perf_event_attr. Consequently,
it is not possible to use attr_size to detect endianness.

This patch takes another approach by using the magic number
written at the beginning of the perf.data file to detect
endianness. The magic number is an eight-byte signature.
It's primary purpose is to identify (signature) a perf.data
file. But it could also be used to encode the endianness.

The patch introduces a new value for this signature. The key
difference is that the signature is written differently in
the file depending on the endianness. Thus, by comparing the
signature from the file with the tool's own signature it is
possible to detect endianness. The new signature is "PERFILE2".

Backward compatiblity with existing perf.data file is
ensured.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 tools/perf/util/header.c |   77 ++++++++++++++++++++++++++++++++++++++--------
 1 files changed, 64 insertions(+), 13 deletions(-)

diff --git a/tools/perf/util/header.c b/tools/perf/util/header.c
index ecd7f4d..6f4187d 100644
--- a/tools/perf/util/header.c
+++ b/tools/perf/util/header.c
@@ -63,9 +63,20 @@ char *perf_header__find_event(u64 id)
 	return NULL;
 }
 
-static const char *__perf_magic = "PERFFILE";
+/*
+ * magic2 = "PERFILE2"
+ * must be a numerical value to let the endianness
+ * determine the memory layout. That way we are able
+ * to detect endianness when reading the perf.data file
+ * back.
+ *
+ * we check for legacy (PERFFILE) format.
+ */
+static const char *__perf_magic1 = "PERFFILE";
+static const u64 __perf_magic2    = 0x32454c4946524550ULL;
+static const u64 __perf_magic2_sw = 0x50455246494c4532ULL;
 
-#define PERF_MAGIC	(*(u64 *)__perf_magic)
+#define PERF_MAGIC	__perf_magic2
 
 struct perf_file_attr {
 	struct perf_event_attr	attr;
@@ -1620,24 +1631,59 @@ int perf_header__process_sections(struct perf_header *header, int fd,
 	return err;
 }
 
+static int check_magic_endian(u64 *magic, struct perf_file_header *header,
+			      struct perf_header *ph)
+{
+	int ret;
+
+	/* check for legacy format */
+	ret = memcmp(magic, __perf_magic1, sizeof(*magic));
+	if (ret == 0) {
+		pr_debug("legacy perf.data format\n");
+		if (!header)
+			return -1;
+
+		if (header->attr_size != sizeof(struct perf_file_attr)) {
+			u64 attr_size = bswap_64(header->attr_size);
+
+			if (attr_size != sizeof(struct perf_file_attr))
+				return -1;
+
+			ph->needs_swap = true;
+		}
+		return 0;
+	}
+
+	/* check magic number with same endianness */
+	if (*magic == __perf_magic2)
+		return 0;
+
+	/* check magic number but opposite endianness */
+	if (*magic != __perf_magic2_sw)
+		return -1;
+
+	ph->needs_swap = true;
+
+	return 0;
+}
+
 int perf_file_header__read(struct perf_file_header *header,
 			   struct perf_header *ph, int fd)
 {
+	int ret;
+
 	lseek(fd, 0, SEEK_SET);
 
-	if (readn(fd, header, sizeof(*header)) <= 0 ||
-	    memcmp(&header->magic, __perf_magic, sizeof(header->magic)))
+	ret = readn(fd, header, sizeof(*header));
+	if (ret <= 0)
 		return -1;
 
-	if (header->attr_size != sizeof(struct perf_file_attr)) {
-		u64 attr_size = bswap_64(header->attr_size);
-
-		if (attr_size != sizeof(struct perf_file_attr))
-			return -1;
+	if (check_magic_endian(&header->magic, header, ph) < 0)
+		return -1;
 
+	if (ph->needs_swap) {
 		mem_bswap_64(header, offsetof(struct perf_file_header,
-					    adds_features));
-		ph->needs_swap = true;
+			     adds_features));
 	}
 
 	if (header->size != sizeof(*header)) {
@@ -1873,8 +1919,13 @@ static int perf_file_header__read_pipe(struct perf_pipe_file_header *header,
 				       struct perf_header *ph, int fd,
 				       bool repipe)
 {
-	if (readn(fd, header, sizeof(*header)) <= 0 ||
-	    memcmp(&header->magic, __perf_magic, sizeof(header->magic)))
+	int ret;
+
+	ret = readn(fd, header, sizeof(*header));
+	if (ret <= 0)
+		return -1;
+
+	 if (check_magic_endian(&header->magic, NULL, ph) < 0)
 		return -1;
 
 	if (repipe && do_write(STDOUT_FILENO, header, sizeof(*header)) < 0)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v4 15/18] perf: add ABI reference sizes
  2012-01-27 20:56 [PATCH v4 00/18] perf: add support for sampling taken branches Stephane Eranian
                   ` (13 preceding siblings ...)
  2012-01-27 20:56 ` [PATCH v4 14/18] perf: fix endianness detection in perf.data Stephane Eranian
@ 2012-01-27 20:56 ` Stephane Eranian
  2012-01-27 20:56 ` [PATCH v4 16/18] perf: enable reading of perf.data files from different ABI rev Stephane Eranian
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: Stephane Eranian @ 2012-01-27 20:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, robert.richter, ming.m.lin, andi, asharma,
	ravitillo, vweaver1, khandual, dsahern

This patch adds reference sizes for revision 1
and 2 of the perf_event ABI, i.e., the size of
the perf_event_attr struct.

With Rev1: config2 was added = +8 bytes
With Rev2: branch_sample_type was added = +8 bytes

Adds the definition for Rev1, Rev2.

This is useful for tools trying to decode the revision
numbers based on the size of the struct.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 include/linux/perf_event.h |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 6d9d712..91452ed 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -195,6 +195,8 @@ enum perf_event_read_format {
 };
 
 #define PERF_ATTR_SIZE_VER0	64	/* sizeof first published struct */
+#define PERF_ATTR_SIZE_VER1	72	/* add: config2 */
+#define PERF_ATTR_SIZE_VER2	80	/* add: branch_sample_type */
 
 /*
  * Hardware event_id to monitor via a performance monitoring event:
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v4 16/18] perf: enable reading of perf.data files from different ABI rev
  2012-01-27 20:56 [PATCH v4 00/18] perf: add support for sampling taken branches Stephane Eranian
                   ` (14 preceding siblings ...)
  2012-01-27 20:56 ` [PATCH v4 15/18] perf: add ABI reference sizes Stephane Eranian
@ 2012-01-27 20:56 ` Stephane Eranian
  2012-01-27 20:56 ` [PATCH v4 17/18] perf: fix bug print_event_desc() Stephane Eranian
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: Stephane Eranian @ 2012-01-27 20:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, robert.richter, ming.m.lin, andi, asharma,
	ravitillo, vweaver1, khandual, dsahern

This patch allows perf to process perf.data files generated
using an ABI that has a different perf_event_attr struct size, i.e.,
a different ABI version.

The perf_event_attr can be extended, yet perf needs to cope with
older perf.data files. Similarly, perf must be able to cope with
a perf.data file which is using a newer version of the ABI than
what it knows about.

This patch adds read_attr(), a routine that reads a perf_event_attr
struct from a file incrementally based on its advertised size. If
the on-file struct is smaller than what perf knows, then the extra
fields are zeroed. If the on-file struct is bigger, then perf only
uses what it knows about, the rest is skipped.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 tools/perf/util/header.c |   49 ++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 47 insertions(+), 2 deletions(-)

diff --git a/tools/perf/util/header.c b/tools/perf/util/header.c
index 6f4187d..8d6c18d 100644
--- a/tools/perf/util/header.c
+++ b/tools/perf/util/header.c
@@ -1959,6 +1959,51 @@ static int perf_header__read_pipe(struct perf_session *session, int fd)
 	return 0;
 }
 
+static int read_attr(int fd, struct perf_header *ph,
+		     struct perf_file_attr *f_attr)
+{
+	struct perf_event_attr *attr = &f_attr->attr;
+	size_t sz, left;
+	size_t our_sz = sizeof(f_attr->attr);
+	int ret;
+
+	memset(f_attr, 0, sizeof(*f_attr));
+
+	/* read minimal guaranteed structure */
+	ret = readn(fd, attr, PERF_ATTR_SIZE_VER0);
+	if (ret <= 0)
+		return -1;
+
+	/* on file perf_event_attr size */
+	sz = attr->size;
+	if (ph->needs_swap)
+		sz = bswap_32(sz);
+
+	if (sz == 0) {
+		/* assume ABI0 */
+		sz =  PERF_ATTR_SIZE_VER0;
+	} else if (sz > our_sz) {
+		/* bigger than what we know about */
+		sz = our_sz;
+
+		/* skip what we do not know about */
+		lseek(fd, SEEK_CUR, attr->size - our_sz);
+	}
+	/* what we have not yet read and that we know about */
+	left = sz - PERF_ATTR_SIZE_VER0;
+	if (left) {
+		void *ptr = attr;
+		ptr += PERF_ATTR_SIZE_VER0;
+
+		ret = readn(fd, ptr, left);
+		if (ret <= 0)
+			return -1;
+	}
+	/* read the ids */
+	ret = readn(fd, &f_attr->ids, sizeof(struct perf_file_section));
+	return ret <= 0 ? -1 : 0;
+}
+
 int perf_session__read_header(struct perf_session *session, int fd)
 {
 	struct perf_header *header = &session->header;
@@ -1979,14 +2024,14 @@ int perf_session__read_header(struct perf_session *session, int fd)
 		return -EINVAL;
 	}
 
-	nr_attrs = f_header.attrs.size / sizeof(f_attr);
+	nr_attrs = f_header.attrs.size / f_header.attr_size;
 	lseek(fd, f_header.attrs.offset, SEEK_SET);
 
 	for (i = 0; i < nr_attrs; i++) {
 		struct perf_evsel *evsel;
 		off_t tmp;
 
-		if (readn(fd, &f_attr, sizeof(f_attr)) <= 0)
+		if (read_attr(fd, header, &f_attr) < 0)
 			goto out_errno;
 
 		if (header->needs_swap)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v4 17/18] perf: fix bug print_event_desc()
  2012-01-27 20:56 [PATCH v4 00/18] perf: add support for sampling taken branches Stephane Eranian
                   ` (15 preceding siblings ...)
  2012-01-27 20:56 ` [PATCH v4 16/18] perf: enable reading of perf.data files from different ABI rev Stephane Eranian
@ 2012-01-27 20:56 ` Stephane Eranian
  2012-01-27 20:56 ` [PATCH v4 18/18] perf: make perf able to read file from older ABIs Stephane Eranian
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: Stephane Eranian @ 2012-01-27 20:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, robert.richter, ming.m.lin, andi, asharma,
	ravitillo, vweaver1, khandual, dsahern

This patches cleans up local variable types for msz and ret.
They need to be size_t and ssize_t respectively.

It also fixes a bug whereby perf would not read attr struct
with a different size than what it knows about.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 tools/perf/util/header.c |   19 +++++++++----------
 1 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/tools/perf/util/header.c b/tools/perf/util/header.c
index 8d6c18d..1fb365d 100644
--- a/tools/perf/util/header.c
+++ b/tools/perf/util/header.c
@@ -1144,8 +1144,9 @@ static void print_event_desc(struct perf_header *ph, int fd, FILE *fp)
 	uint64_t id;
 	void *buf = NULL;
 	char *str;
-	u32 nre, sz, nr, i, j, msz;
-	int ret;
+	u32 nre, sz, nr, i, j;
+	ssize_t ret;
+	size_t msz;
 
 	/* number of events */
 	ret = read(fd, &nre, sizeof(nre));
@@ -1162,25 +1163,23 @@ static void print_event_desc(struct perf_header *ph, int fd, FILE *fp)
 	if (ph->needs_swap)
 		sz = bswap_32(sz);
 
-	/*
-	 * ensure it is at least to our ABI rev
-	 */
-	if (sz < (u32)sizeof(attr))
-		goto error;
-
 	memset(&attr, 0, sizeof(attr));
 
-	/* read entire region to sync up to next field */
+	/* buffer to hold on file attr struct */
 	buf = malloc(sz);
 	if (!buf)
 		goto error;
 
 	msz = sizeof(attr);
-	if (sz < msz)
+	if (sz < (ssize_t)msz)
 		msz = sz;
 
 	for (i = 0 ; i < nre; i++) {
 
+		/*
+		 * must read entire on-file attr struct to
+		 * sync up with layout.
+		 */
 		ret = read(fd, buf, sz);
 		if (ret != (ssize_t)sz)
 			goto error;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v4 18/18] perf: make perf able to read file from older ABIs
  2012-01-27 20:56 [PATCH v4 00/18] perf: add support for sampling taken branches Stephane Eranian
                   ` (16 preceding siblings ...)
  2012-01-27 20:56 ` [PATCH v4 17/18] perf: fix bug print_event_desc() Stephane Eranian
@ 2012-01-27 20:56 ` Stephane Eranian
  2012-01-31  8:54   ` Anshuman Khandual
  2012-01-30  4:16 ` [PATCH v4 00/18] perf: add support for sampling taken branches Anshuman Khandual
  2012-02-01  8:41 ` Anshuman Khandual
  19 siblings, 1 reply; 30+ messages in thread
From: Stephane Eranian @ 2012-01-27 20:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, robert.richter, ming.m.lin, andi, asharma,
	ravitillo, vweaver1, khandual, dsahern

This patches provides a way to handle legacy perf.data
files.  Legacy files are those using the older PERFFILE
signature.

For those, it is still necessary to detect endianness but
without comparing their header->attr_size with the
tool's own version as it may be different. Instead, we use
a reference table for all known sizes from the legacy era.

We try all the combinations for sizes and endianness. If we find
a match, we proceed, otherwise we return: "incompatible file format".
This is also done for the pipe-mode file format.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 tools/perf/util/header.c |  126 +++++++++++++++++++++++++++++++++++----------
 1 files changed, 98 insertions(+), 28 deletions(-)

diff --git a/tools/perf/util/header.c b/tools/perf/util/header.c
index 1fb365d..a15f451 100644
--- a/tools/perf/util/header.c
+++ b/tools/perf/util/header.c
@@ -1630,35 +1630,102 @@ int perf_header__process_sections(struct perf_header *header, int fd,
 	return err;
 }
 
-static int check_magic_endian(u64 *magic, struct perf_file_header *header,
-			      struct perf_header *ph)
+static const int attr_file_abi_sizes[] = {
+	[0] = PERF_ATTR_SIZE_VER0,
+	[1] = PERF_ATTR_SIZE_VER1,
+	0,
+};
+
+/*
+ * In the legacy file format, the magic number is not used to encode endianness.
+ * hdr_sz was used to encode endianness. But given that hdr_sz can vary based
+ * on ABI revisions, we need to try all combinations for all endianness to
+ * detect the endianness.
+ */
+static int try_all_file_abis(uint64_t hdr_sz, struct perf_header *ph)
 {
-	int ret;
+	uint64_t ref_size, attr_size;
+	int i;
 
-	/* check for legacy format */
-	ret = memcmp(magic, __perf_magic1, sizeof(*magic));
-	if (ret == 0) {
-		pr_debug("legacy perf.data format\n");
-		if (!header)
-			return -1;
+	for (i = 0 ; attr_file_abi_sizes[i]; i++) {
+		ref_size = attr_file_abi_sizes[i]
+			 + sizeof(struct perf_file_section);
+		if (hdr_sz != ref_size) {
+			attr_size = bswap_64(hdr_sz);
+			if (attr_size != ref_size)
+				continue;
 
-		if (header->attr_size != sizeof(struct perf_file_attr)) {
-			u64 attr_size = bswap_64(header->attr_size);
+			ph->needs_swap = true;
+		}
+		pr_debug("ABI%d perf.data file detected, need_swap=%d\n",
+			 i,
+			 ph->needs_swap);
+		return 0;
+	}
+	/* could not determine endianness */
+	return -1;
+}
 
-			if (attr_size != sizeof(struct perf_file_attr))
-				return -1;
+#define PERF_PIPE_HDR_VER0	16
+
+static const size_t attr_pipe_abi_sizes[] = {
+	[0] = PERF_PIPE_HDR_VER0,
+	0,
+};
+
+/*
+ * In the legacy pipe format, there is an implicit assumption that endiannesss
+ * between host recording the samples, and host parsing the samples is the
+ * same. This is not always the case given that the pipe output may always be
+ * redirected into a file and analyzed on a different machine with possibly a
+ * different endianness and perf_event ABI revsions in the perf tool itself.
+ */
+static int try_all_pipe_abis(uint64_t hdr_sz, struct perf_header *ph)
+{
+	uint64_t ref_size;
+	int i;
+
+	for (i = 0 ; attr_pipe_abi_sizes[i]; i++) {
+		if (hdr_sz != attr_pipe_abi_sizes[i]) {
+			u64 attr_size = bswap_64(hdr_sz);
+
+			if (attr_size != ref_size)
+				continue;
 
 			ph->needs_swap = true;
 		}
+		pr_debug("Pipe ABI%d perf.data file detected\n", i);
 		return 0;
 	}
+	return -1;
+}
 
-	/* check magic number with same endianness */
-	if (*magic == __perf_magic2)
+static int check_magic_endian(u64 magic, uint64_t hdr_sz,
+			      bool is_pipe, struct perf_header *ph)
+{
+	int ret;
+
+	/* check for legacy format */
+	ret = memcmp(&magic, __perf_magic1, sizeof(magic));
+	if (ret == 0) {
+		pr_debug("legacy perf.data format\n");
+		if (is_pipe)
+			return try_all_pipe_abis(hdr_sz, ph);
+
+		return try_all_file_abis(hdr_sz, ph);
+	}
+	/*
+	 * the new magic number serves two purposes:
+	 * - unique number to identify actual perf.data files
+	 * - encode endianness of file
+	 */
+
+	/* check magic number with one endianness */
+	if (magic == __perf_magic2)
 		return 0;
 
-	/* check magic number but opposite endianness */
-	if (*magic != __perf_magic2_sw)
+	/* check magic number with opposite endianness */
+	if (magic != __perf_magic2_sw)
 		return -1;
 
 	ph->needs_swap = true;
@@ -1677,8 +1744,11 @@ int perf_file_header__read(struct perf_file_header *header,
 	if (ret <= 0)
 		return -1;
 
-	if (check_magic_endian(&header->magic, header, ph) < 0)
+	if (check_magic_endian(header->magic,
+			       header->attr_size, false, ph) < 0) {
+		pr_debug("magic/endian check failed\n");
 		return -1;
+	}
 
 	if (ph->needs_swap) {
 		mem_bswap_64(header, offsetof(struct perf_file_header,
@@ -1924,21 +1994,17 @@ static int perf_file_header__read_pipe(struct perf_pipe_file_header *header,
 	if (ret <= 0)
 		return -1;
 
-	 if (check_magic_endian(&header->magic, NULL, ph) < 0)
+	if (check_magic_endian(header->magic, header->size, true, ph) < 0) {
+		pr_debug("endian/magic failed\n");
 		return -1;
+	}
+
+	if (ph->needs_swap)
+		header->size = bswap_64(header->size);
 
 	if (repipe && do_write(STDOUT_FILENO, header, sizeof(*header)) < 0)
 		return -1;
 
-	if (header->size != sizeof(*header)) {
-		u64 size = bswap_64(header->size);
-
-		if (size != sizeof(*header))
-			return -1;
-
-		ph->needs_swap = true;
-	}
-
 	return 0;
 }
 
@@ -1975,6 +2041,10 @@ static int read_attr(int fd, struct perf_header *ph,
 
 	/* on file perf_event_attr size */
 	sz = attr->size;
+	if (sz != our_sz)
+		pr_debug("on file attr=%zu vs. %zu bytes,"
+			 " ignoring extra fields\n", sz, our_sz);
+
 	if (ph->needs_swap)
 		sz = bswap_32(sz);
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v4 09/18] perf: disable PERF_SAMPLE_BRANCH_* when not supported
  2012-01-27 20:56 ` [PATCH v4 09/18] perf: disable PERF_SAMPLE_BRANCH_* when not supported Stephane Eranian
@ 2012-01-30  3:57   ` Anshuman Khandual
  0 siblings, 0 replies; 30+ messages in thread
From: Anshuman Khandual @ 2012-01-30  3:57 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, peterz, mingo, acme, robert.richter, ming.m.lin,
	andi, asharma, ravitillo, vweaver1, dsahern

On Saturday 28 January 2012 02:26 AM, Stephane Eranian wrote:
> PERF_SAMPLE_BRANCH_* is disabled for:
> - SW events (sw counters, tracepoints)
> - HW breakpoints
> - ALL but Intel X86 architecture
> - AMD64 processors
> 
> Signed-off-by: Stephane Eranian <eranian@google.com>
> ---
>  arch/alpha/kernel/perf_event.c       |    4 ++++
>  arch/arm/kernel/perf_event.c         |    4 ++++
>  arch/mips/kernel/perf_event_mipsxx.c |    4 ++++
>  arch/powerpc/kernel/perf_event.c     |    4 ++++
>  arch/sh/kernel/perf_event.c          |    4 ++++
>  arch/sparc/kernel/perf_event.c       |    4 ++++
>  arch/x86/kernel/cpu/perf_event_amd.c |    3 +++
>  kernel/events/core.c                 |   24 ++++++++++++++++++++++++
>  kernel/events/hw_breakpoint.c        |    6 ++++++
>  9 files changed, 57 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/alpha/kernel/perf_event.c b/arch/alpha/kernel/perf_event.c
> index 8143cd7..0dae252 100644
> --- a/arch/alpha/kernel/perf_event.c
> +++ b/arch/alpha/kernel/perf_event.c
> @@ -685,6 +685,10 @@ static int alpha_pmu_event_init(struct perf_event *event)
>  {
>  	int err;
> 
> +	/* does not support taken branch sampling */
> +	if (has_branch_stack(event))
> +		return -EOPNOTSUPP;
> +
>  	switch (event->attr.type) {
>  	case PERF_TYPE_RAW:
>  	case PERF_TYPE_HARDWARE:
> diff --git a/arch/arm/kernel/perf_event.c b/arch/arm/kernel/perf_event.c
> index 5bb91bf..68bb0ce 100644
> --- a/arch/arm/kernel/perf_event.c
> +++ b/arch/arm/kernel/perf_event.c
> @@ -539,6 +539,10 @@ static int armpmu_event_init(struct perf_event *event)
>  	int err = 0;
>  	atomic_t *active_events = &armpmu->active_events;
> 
> +	/* does not support taken branch sampling */
> +	if (has_branch_smpl(event))
ohh, this function is still present ? I could not find the function defined any where in the
patch set. 
> +		return -EOPNOTSUPP;
> +
>  	if (armpmu->map_event(event) == -ENOENT)
>  		return -ENOENT;
> 
> diff --git a/arch/mips/kernel/perf_event_mipsxx.c b/arch/mips/kernel/perf_event_mipsxx.c
> index e3b897a..811084f 100644
> --- a/arch/mips/kernel/perf_event_mipsxx.c
> +++ b/arch/mips/kernel/perf_event_mipsxx.c
> @@ -606,6 +606,10 @@ static int mipspmu_event_init(struct perf_event *event)
>  {
>  	int err = 0;
> 
> +	/* does not support taken branch sampling */
> +	if (has_branch_stack(event))
> +		return -EOPNOTSUPP;
> +
>  	switch (event->attr.type) {
>  	case PERF_TYPE_RAW:
>  	case PERF_TYPE_HARDWARE:
> diff --git a/arch/powerpc/kernel/perf_event.c b/arch/powerpc/kernel/perf_event.c
> index d614ab5..4e0b265 100644
> --- a/arch/powerpc/kernel/perf_event.c
> +++ b/arch/powerpc/kernel/perf_event.c
> @@ -1078,6 +1078,10 @@ static int power_pmu_event_init(struct perf_event *event)
>  	if (!ppmu)
>  		return -ENOENT;
> 
> +	/* does not support taken branch sampling */
> +	if (has_branch_stack(event))
> +		return -EOPNOTSUPP;
> +
>  	switch (event->attr.type) {
>  	case PERF_TYPE_HARDWARE:
>  		ev = event->attr.config;
> diff --git a/arch/sh/kernel/perf_event.c b/arch/sh/kernel/perf_event.c
> index 10b14e3..068b8a2 100644
> --- a/arch/sh/kernel/perf_event.c
> +++ b/arch/sh/kernel/perf_event.c
> @@ -310,6 +310,10 @@ static int sh_pmu_event_init(struct perf_event *event)
>  {
>  	int err;
> 
> +	/* does not support taken branch sampling */
> +	if (has_branch_stack(event))
> +		return -EOPNOTSUPP;
> +
>  	switch (event->attr.type) {
>  	case PERF_TYPE_RAW:
>  	case PERF_TYPE_HW_CACHE:
> diff --git a/arch/sparc/kernel/perf_event.c b/arch/sparc/kernel/perf_event.c
> index 614da62..8e16a4a 100644
> --- a/arch/sparc/kernel/perf_event.c
> +++ b/arch/sparc/kernel/perf_event.c
> @@ -1105,6 +1105,10 @@ static int sparc_pmu_event_init(struct perf_event *event)
>  	if (atomic_read(&nmi_active) < 0)
>  		return -ENODEV;
> 
> +	/* does not support taken branch sampling */
> +	if (has_branch_stack(event))
> +		return -EOPNOTSUPP;
> +
>  	switch (attr->type) {
>  	case PERF_TYPE_HARDWARE:
>  		if (attr->config >= sparc_pmu->max_events)
> diff --git a/arch/x86/kernel/cpu/perf_event_amd.c b/arch/x86/kernel/cpu/perf_event_amd.c
> index 0397b23..0d8da03 100644
> --- a/arch/x86/kernel/cpu/perf_event_amd.c
> +++ b/arch/x86/kernel/cpu/perf_event_amd.c
> @@ -138,6 +138,9 @@ static int amd_pmu_hw_config(struct perf_event *event)
>  	if (ret)
>  		return ret;
> 
> +	if (has_branch_stack(event))
> +		return -EOPNOTSUPP;
> +
>  	if (event->attr.exclude_host && event->attr.exclude_guest)
>  		/*
>  		 * When HO == GO == 1 the hardware treats that as GO == HO == 0
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index c4520a2..431f7b4 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -5007,6 +5007,12 @@ static int perf_swevent_init(struct perf_event *event)
>  	if (event->attr.type != PERF_TYPE_SOFTWARE)
>  		return -ENOENT;
> 
> +	/*
> +	 * no branch sampling for software events
> +	 */
> +	if (has_branch_stack(event))
> +		return -EOPNOTSUPP;
> +
>  	switch (event_id) {
>  	case PERF_COUNT_SW_CPU_CLOCK:
>  	case PERF_COUNT_SW_TASK_CLOCK:
> @@ -5117,6 +5123,12 @@ static int perf_tp_event_init(struct perf_event *event)
>  	if (event->attr.type != PERF_TYPE_TRACEPOINT)
>  		return -ENOENT;
> 
> +	/*
> +	 * no branch sampling for tracepoint events
> +	 */
> +	if (has_branch_stack(event))
> +		return -EOPNOTSUPP;
> +
>  	err = perf_trace_init(event);
>  	if (err)
>  		return err;
> @@ -5342,6 +5354,12 @@ static int cpu_clock_event_init(struct perf_event *event)
>  	if (event->attr.config != PERF_COUNT_SW_CPU_CLOCK)
>  		return -ENOENT;
> 
> +	/*
> +	 * no branch sampling for software events
> +	 */
> +	if (has_branch_stack(event))
> +		return -EOPNOTSUPP;
> +
>  	perf_swevent_init_hrtimer(event);
> 
>  	return 0;
> @@ -5416,6 +5434,12 @@ static int task_clock_event_init(struct perf_event *event)
>  	if (event->attr.config != PERF_COUNT_SW_TASK_CLOCK)
>  		return -ENOENT;
> 
> +	/*
> +	 * no branch sampling for software events
> +	 */
> +	if (has_branch_stack(event))
> +		return -EOPNOTSUPP;
> +
>  	perf_swevent_init_hrtimer(event);
> 
>  	return 0;
> diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
> index b0309f7..cee5423 100644
> --- a/kernel/events/hw_breakpoint.c
> +++ b/kernel/events/hw_breakpoint.c
> @@ -581,6 +581,12 @@ static int hw_breakpoint_event_init(struct perf_event *bp)
>  	if (bp->attr.type != PERF_TYPE_BREAKPOINT)
>  		return -ENOENT;
> 
> +	/*
> +	 * no branch sampling for breakpoint events
> +	 */
> +	if (has_branch_stack(bp))
> +		return -EOPNOTSUPP;
> +
>  	err = register_perf_hw_breakpoint(bp);
>  	if (err)
>  		return err;


-- 
Anshuman Khandual
Linux Technology Centre
IBM Systems and Technology Group


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v4 00/18] perf: add support for sampling taken branches
  2012-01-27 20:56 [PATCH v4 00/18] perf: add support for sampling taken branches Stephane Eranian
                   ` (17 preceding siblings ...)
  2012-01-27 20:56 ` [PATCH v4 18/18] perf: make perf able to read file from older ABIs Stephane Eranian
@ 2012-01-30  4:16 ` Anshuman Khandual
  2012-01-30 10:15   ` Stephane Eranian
  2012-02-01  8:41 ` Anshuman Khandual
  19 siblings, 1 reply; 30+ messages in thread
From: Anshuman Khandual @ 2012-01-30  4:16 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, peterz, mingo, acme, robert.richter, ming.m.lin,
	andi, asharma, ravitillo, vweaver1, dsahern

On Saturday 28 January 2012 02:26 AM, Stephane Eranian wrote:
> This patchset adds an important and useful new feature to
> perf_events: branch stack sampling. In other words, the
> ability to capture taken branches into each sample.
> 
> Statistical sampling of taken branch should not be confused
> for branch tracing. Not all branches are necessarily captured
> 
> Sampling taken branches is important for basic block profiling,
> statistical call graph, function call counts. Many of those
> measurements can help drive a compiler optimizer.
> 
> The branch stack is a software abstraction which sits on top
> of the PMU hardware. As such, it is not available on all
> processors. For now, the patch provides the generic interface
> and the Intel X86 implementation where it leverages the Last
> Branch Record (LBR) feature (from Core2 to SandyBridge).
> 
> Branch stack sampling is supported for both per-thread and
> system-wide modes.
> 
> It is possible to filter the type and privilege level of branches
> to sample. The target of the branch is used to determine
> the privilege level.
> 
> For each branch, the source and destination are captured. On
> some hardware platforms, it may be possible to also extract
> the target prediction and, in that case, it is also exposed
> to end users.
> 
> The branch stack can record a variable number of taken
> branches per sample. Those branches are always consecutive
> in time. The number of branches captured depends on the
> filtering and the underlying hardware. On Intel Nehalem
> and later, up to 16 consecutive branches can be captured
> per sample.
> 
> Branch sampling is always coupled with an event. It can
> be any PMU event but it can't be a SW or tracepoint event.
> 
> Branch sampling is requested by setting a new sample_type
> flag called: PERF_SAMPLE_BRANCH_STACK.
> 
> To support branch filtering, we introduce a new field
> to the perf_event_attr struct: branch_sample_type. We chose
> NOT to overload the config1, config2 field because those
> are related to the event encoding. Branch stack is a
> separate feature which is combined with the event.
> 
> The branch_sample_type is a bitmask of possible filters.
> The following filters are defined (more can be added):
> - PERF_SAMPLE_BRANCH_ANY     : any control flow change
> - PERF_SAMPLE_BRANCH_USER    : branches when target is at user level
> - PERF_SAMPLE_BRANCH_KERNEL  : branches when target is at kernel level
> - PERF_SAMPLE_BRANCH_HV      : branches when target is at hypervisor level
> - PERF_SAMPLE_BRANCH_ANY_CALL: call branches (incl. syscalls)
> - PERF_SAMPLE_BRANCH_ANY_RET : return branches (incl. syscall returns)
> - PERF_SAMPLE_BRANCH_IND_CALL: indirect calls
> 
> It is possible to combine filters, e.g., IND_CALL|USER|KERNEL.
> 
> When the privilege level is not specified, the branch stack
> inherits that of the associated event.
> 
> Some processors may not offer hardware branch filtering, e.g., Intel
> Atom. Some may have HW filtering bugs (e.g., Nehalem). The Intel
> X86 implementation in this patchset also provides a SW branch filter
> which works on a best effort basis. It can compensate for the lack
> of LBR filtering. But first and foremost, it helps work around LBR
> filtering errata. The goal is to only capture the type of branches
> requested by the user.
> 
> It is possible to combine branch stack sampling with PEBS on Intel
> X86 processors. Depending on the precise_sampling mode, there are
> certain filterting restrictions. When precise_sampling=1, then
> there are no filtering restrictions. When precise_sampling > 1, 
> then only ANY|USER|KERNEL filter can be used. This comes from
> the fact that the kernel uses LBR to compensate for the PEBS
> off-by-1 skid on the instruction pointer.
> 
> To demonstrate how the perf_event branch stack sampling interface
> works, the patchset also modifies perf record to capture taken
> branches. Similarly perf report is enhanced to display a histogram
> of taken branches.
> 
> I would like to thank Roberto Vitillo @ LBL for his work on the perf
> tool for this.
> 
> Enough talking, let's take a simple example. Our trivial test program
> goes like this:
> 
> void f2(void)
> {}
> void f3(void)
> {}
> void f1(unsigned long n)
> {
>   if (n & 1UL)
>     f2();
>   else
>     f3();
> }
> int main(void)
> {
>   unsigned long i;
> 
>   for (i=0; i < N; i++)
>    f1(i);
>   return 0;
> }
> 
> $ perf record -b any branchy
> $ perf report -b
> # Events: 23K cycles
> #
> # Overhead  Source Symbol     Target Symbol
> # ........  ................  ................
> 
>     18.13%  [.] f1            [.] main                          
>     18.10%  [.] main          [.] main                          
>     18.01%  [.] main          [.] f1                            
>     15.69%  [.] f1            [.] f1                            
>      9.11%  [.] f3            [.] f1                            
>      6.78%  [.] f1            [.] f3                            
>      6.74%  [.] f1            [.] f2                            
>      6.71%  [.] f2            [.] f1                            
> 
> Of the total number of branches captured, 18.13% were from f1() -> main().
> 
> Let's make this clearer by filtering the user call branches only:
> 
> $ perf record -b any_call -e cycles:u branchy
> $ perf report -b
> # Events: 19K cycles
> #
> # Overhead  Source Symbol              Target Symbol
> # ........  .........................  .........................
> #
>     52.50%  [.] main                   [.] f1                   
>     23.99%  [.] f1                     [.] f3                   
>     23.48%  [.] f1                     [.] f2                   
>      0.03%  [.] _IO_default_xsputn     [.] _IO_new_file_overflow
>      0.01%  [k] _start                 [k] __libc_start_main    
> 
> Now it is more obvious. %52 of all the captured branches where calls from main() -> f1().
> The rest is split 50/50 between f1() -> f2() and f1() -> f3() which is expected given
> that f1() dispatches based on odd vs. even values of n which is constantly increasing.
> 
> 
> Here is a kernel example, where we want to sample indirect calls:
> $ perf record -a -C 1 -b ind_call -e r1c4:k sleep 10 
> $ perf report -b
> #
> # Overhead  Source Symbol               Target Symbol
> # ........  ..........................  ..........................
> #
>     36.36%  [k] __delay                 [k] delay_tsc             
>      9.09%  [k] ktime_get               [k] read_tsc              
>      9.09%  [k] getnstimeofday          [k] read_tsc              
>      9.09%  [k] notifier_call_chain     [k] tick_notify           
>      4.55%  [k] cpuidle_idle_call       [k] intel_idle            
>      4.55%  [k] cpuidle_idle_call       [k] menu_reflect          
>      2.27%  [k] handle_irq              [k] handle_edge_irq       
>      2.27%  [k] ack_apic_edge           [k] native_apic_mem_write 
>      2.27%  [k] hpet_interrupt_handler  [k] hrtimer_interrupt     
>      2.27%  [k] __run_hrtimer           [k] watchdog_timer_fn     
>      2.27%  [k] enqueue_task            [k] enqueue_task_rt       
>      2.27%  [k] try_to_wake_up          [k] select_task_rq_rt     
>      2.27%  [k] do_timer                [k] read_tsc              
> 
> Due to HW limitations, branch filtering may be approximate on
> Core, Atom processors. It is more accurate on Nehalem, Westmere
> and best on Sandy Bridge.
> 
> In version 2, we've updated the patch to tip/master (commit 5734857) and
> we've incoporated the feedback from v1 concerning anynous bitfield
> struct for branch_stack_entry and the hanlding of i386 ABI binaries
> on 64-bit host in the instr decoder for the LBR SW filter.
> 
> In version 3, we've updated to 3.2.0-tip. The Atom revision
> check has been put into its own patch. We fixed a browser
> issue with report report. We fixed all the style issues as well.
> 
> In version 4, we've modified the branch stack API to add a missing
> priv level : hypervisor. There is a new PERF_SAMPLE_BRANCH_HV. It
> is not used on Intel X86. Thanks to  khandual@linux.vnet.ibm.com
> for pointing this out. We also fix compilation error on ARM.
> 
> In version 4, we also extend the patch to include the changes necessary
> to the perf tool to support reading perf.data files which were produced
> from older perf_event ABI revisions. This patch set extends the ABI
> with a new field in struct perf_event_attr. That struct is saved as
> is in the perf.data file. Therefore, older perf.data files contain
> smaller perf_event_attr struct, yet perf must process them transparently.
> That's not the case today. It dies with 'incompatible file format'.
> 
> The patch solves this problem and, at the same time, decouples endianness
> detection from the size of perf_event_attr. Endianness is now detected via
> the signature (the first 8 bytes of the file). We introduce a new signature
> (PERFILE2). It is not laid out the same way in the file based on the endianness
So as perf subsystem evolves and we modify perf_event_attr structure, a new 'PERFFILE<N>' 
marker is generated (for perf subsystem version N) and placed in perf.data. Perf tools 
would be modified to distinguish between various versions of perf.data file and process
them accordingly. Sounds good ! PERFFILE,PERFFILE2,PERFFILE3........ We should have named
PERFFILE as PERFILE1 expecting this to happen one day :)
> of the host where the file is written. Therefore, we can dynamically detect
> the endianness by simply reading the first 8 bytes. The size of the
> perf_event_attr struct can then be processed according to the endianness.
> The ambiguity between the size being at the same time, the endianness marker
> and the actual size is gone. We can now distinguish an older ABI by the size
> and not confuse it with an endianness mismatch.
> 
> Signed-off-by: Stephane Eranian <eranian@google.com>
> 
> 
> Roberto Agostino Vitillo (3):
>   perf: add code to support PERF_SAMPLE_BRANCH_STACK
>   perf: add support for sampling taken branch to perf record
>   perf: add support for taken branch sampling to perf report
> 
> Stephane Eranian (15):
>   perf: add generic taken branch sampling support
>   perf: add Intel LBR MSR definitions
>   perf: add Intel X86 LBR sharing logic
>   perf: sync branch stack sampling with X86 precise_sampling
>   perf: add LBR mappings for PERF_SAMPLE_BRANCH filters
>   perf: disable LBR support for older Intel Atom processors
>   perf: implement PERF_SAMPLE_BRANCH for Intel X86
>   perf: add LBR software filter support for Intel X86
>   perf: disable PERF_SAMPLE_BRANCH_* when not supported
>   perf: add hook to flush branch_stack on context switch
>   perf: fix endianness detection in perf.data
>   perf: add ABI reference sizes
>   perf: enable reading of perf.data files from different ABI rev
>   perf: fix bug print_event_desc()
>   perf: make perf able to read file from older ABIs
> 
>  arch/alpha/kernel/perf_event.c             |    4 +
>  arch/arm/kernel/perf_event.c               |    4 +
>  arch/mips/kernel/perf_event_mipsxx.c       |    4 +
>  arch/powerpc/kernel/perf_event.c           |    4 +
>  arch/sh/kernel/perf_event.c                |    4 +
>  arch/sparc/kernel/perf_event.c             |    4 +
>  arch/x86/include/asm/msr-index.h           |    7 +
>  arch/x86/kernel/cpu/perf_event.c           |   47 ++-
>  arch/x86/kernel/cpu/perf_event.h           |   19 +
>  arch/x86/kernel/cpu/perf_event_amd.c       |    3 +
>  arch/x86/kernel/cpu/perf_event_intel.c     |  120 +++++--
>  arch/x86/kernel/cpu/perf_event_intel_ds.c  |   22 +-
>  arch/x86/kernel/cpu/perf_event_intel_lbr.c |  532 ++++++++++++++++++++++++++--
>  include/linux/perf_event.h                 |   82 ++++-
>  kernel/events/core.c                       |  177 +++++++++
>  kernel/events/hw_breakpoint.c              |    6 +
>  tools/perf/Documentation/perf-record.txt   |   25 ++
>  tools/perf/Documentation/perf-report.txt   |    7 +
>  tools/perf/builtin-record.c                |   74 ++++
>  tools/perf/builtin-report.c                |   98 +++++-
>  tools/perf/perf.h                          |   18 +
>  tools/perf/util/annotate.c                 |    2 +-
>  tools/perf/util/event.h                    |    1 +
>  tools/perf/util/evsel.c                    |   14 +
>  tools/perf/util/header.c                   |  231 +++++++++++--
>  tools/perf/util/hist.c                     |   93 ++++-
>  tools/perf/util/hist.h                     |    7 +
>  tools/perf/util/session.c                  |   72 ++++
>  tools/perf/util/session.h                  |    4 +
>  tools/perf/util/sort.c                     |  362 ++++++++++++++-----
>  tools/perf/util/sort.h                     |    5 +
>  tools/perf/util/symbol.h                   |   13 +
>  32 files changed, 1835 insertions(+), 230 deletions(-)
> 


-- 
Anshuman Khandual
Linux Technology Centre
IBM Systems and Technology Group


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v4 14/18] perf: fix endianness detection in perf.data
  2012-01-27 20:56 ` [PATCH v4 14/18] perf: fix endianness detection in perf.data Stephane Eranian
@ 2012-01-30  5:55   ` Anshuman Khandual
  0 siblings, 0 replies; 30+ messages in thread
From: Anshuman Khandual @ 2012-01-30  5:55 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, peterz, mingo, acme, robert.richter, ming.m.lin,
	andi, asharma, ravitillo, vweaver1, dsahern

On Saturday 28 January 2012 02:26 AM, Stephane Eranian wrote:
> The current version of perf detects whether or not
> the perf.data file is written in a different endianness
> using the attr_size field in the header of the file. This
> field represents sizeof(struct perf_event_attr) as known
> to perf record. If the sizes do not match, then perf tries
> the byte-swapped version. If they match, then the tool assumes
> a different endianness.
> 
> The issue with the approach is that it assumes the size of
> perf_event_attr always has to match between perf record and
> perf report. However, the kernel perf_event ABI is extensible.
> New fields can be added to struct perf_event_attr. Consequently,
> it is not possible to use attr_size to detect endianness.
> 
> This patch takes another approach by using the magic number
> written at the beginning of the perf.data file to detect
> endianness. The magic number is an eight-byte signature.
> It's primary purpose is to identify (signature) a perf.data
> file. But it could also be used to encode the endianness.
> 
> The patch introduces a new value for this signature. The key
> difference is that the signature is written differently in
> the file depending on the endianness. Thus, by comparing the
> signature from the file with the tool's own signature it is
> possible to detect endianness. The new signature is "PERFILE2".
> 
> Backward compatiblity with existing perf.data file is
> ensured.
> 
> Signed-off-by: Stephane Eranian <eranian@google.com>
> ---
>  tools/perf/util/header.c |   77 ++++++++++++++++++++++++++++++++++++++--------
>  1 files changed, 64 insertions(+), 13 deletions(-)
> 
> diff --git a/tools/perf/util/header.c b/tools/perf/util/header.c
> index ecd7f4d..6f4187d 100644
> --- a/tools/perf/util/header.c
> +++ b/tools/perf/util/header.c
> @@ -63,9 +63,20 @@ char *perf_header__find_event(u64 id)
>  	return NULL;
>  }
> 
> -static const char *__perf_magic = "PERFFILE";
> +/*
> + * magic2 = "PERFILE2"
> + * must be a numerical value to let the endianness
> + * determine the memory layout. That way we are able
> + * to detect endianness when reading the perf.data file
> + * back.
> + *
> + * we check for legacy (PERFFILE) format.
> + */
> +static const char *__perf_magic1 = "PERFFILE";
> +static const u64 __perf_magic2    = 0x32454c4946524550ULL;
> +static const u64 __perf_magic2_sw = 0x50455246494c4532ULL;
In perf context, the variable '__perf_magic2_sw' (I guess 'sw' stands for switch)
sounds something related to SW events. Could we change this to something like
'__perf_magic2_revend' or simply '__perf_magic2_rev' which would mean reverse endianness ?
> 
> -#define PERF_MAGIC	(*(u64 *)__perf_magic)
> +#define PERF_MAGIC	__perf_magic2
> 
>  struct perf_file_attr {
>  	struct perf_event_attr	attr;
> @@ -1620,24 +1631,59 @@ int perf_header__process_sections(struct perf_header *header, int fd,
>  	return err;
>  }
> 
> +static int check_magic_endian(u64 *magic, struct perf_file_header *header,
> +			      struct perf_header *ph)
> +{
> +	int ret;
> +
> +	/* check for legacy format */
> +	ret = memcmp(magic, __perf_magic1, sizeof(*magic));
> +	if (ret == 0) {
> +		pr_debug("legacy perf.data format\n");
> +		if (!header)
> +			return -1;
> +
> +		if (header->attr_size != sizeof(struct perf_file_attr)) {
> +			u64 attr_size = bswap_64(header->attr_size);
> +
> +			if (attr_size != sizeof(struct perf_file_attr))
> +				return -1;
> +
> +			ph->needs_swap = true;
> +		}
> +		return 0;
> +	}
> +
> +	/* check magic number with same endianness */
> +	if (*magic == __perf_magic2)
> +		return 0;
> +
> +	/* check magic number but opposite endianness */
> +	if (*magic != __perf_magic2_sw)
> +		return -1;
> +
> +	ph->needs_swap = true;
> +
> +	return 0;
> +}
> +
>  int perf_file_header__read(struct perf_file_header *header,
>  			   struct perf_header *ph, int fd)
>  {
> +	int ret;
> +
>  	lseek(fd, 0, SEEK_SET);
> 
> -	if (readn(fd, header, sizeof(*header)) <= 0 ||
> -	    memcmp(&header->magic, __perf_magic, sizeof(header->magic)))
> +	ret = readn(fd, header, sizeof(*header));
> +	if (ret <= 0)
>  		return -1;
> 
> -	if (header->attr_size != sizeof(struct perf_file_attr)) {
> -		u64 attr_size = bswap_64(header->attr_size);
> -
> -		if (attr_size != sizeof(struct perf_file_attr))
> -			return -1;
> +	if (check_magic_endian(&header->magic, header, ph) < 0)
> +		return -1;
> 
> +	if (ph->needs_swap) {
>  		mem_bswap_64(header, offsetof(struct perf_file_header,
> -					    adds_features));
> -		ph->needs_swap = true;
> +			     adds_features));
>  	}
> 
>  	if (header->size != sizeof(*header)) {
> @@ -1873,8 +1919,13 @@ static int perf_file_header__read_pipe(struct perf_pipe_file_header *header,
>  				       struct perf_header *ph, int fd,
>  				       bool repipe)
>  {
> -	if (readn(fd, header, sizeof(*header)) <= 0 ||
> -	    memcmp(&header->magic, __perf_magic, sizeof(header->magic)))
> +	int ret;
> +
> +	ret = readn(fd, header, sizeof(*header));
> +	if (ret <= 0)
> +		return -1;
> +
> +	 if (check_magic_endian(&header->magic, NULL, ph) < 0)
>  		return -1;
> 
>  	if (repipe && do_write(STDOUT_FILENO, header, sizeof(*header)) < 0)


-- 
Anshuman Khandual
Linux Technology Centre
IBM Systems and Technology Group


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v4 00/18] perf: add support for sampling taken branches
  2012-01-30  4:16 ` [PATCH v4 00/18] perf: add support for sampling taken branches Anshuman Khandual
@ 2012-01-30 10:15   ` Stephane Eranian
  0 siblings, 0 replies; 30+ messages in thread
From: Stephane Eranian @ 2012-01-30 10:15 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-kernel, peterz, mingo, acme, robert.richter, ming.m.lin,
	andi, asharma, ravitillo, vweaver1, dsahern

[repost due to stupid MIME encoding]

On Mon, Jan 30, 2012 at 5:16 AM, Anshuman Khandual
<khandual@linux.vnet.ibm.com> wrote:
>
> On Saturday 28 January 2012 02:26 AM, Stephane Eranian wrote:
> > This patchset adds an important and useful new feature to
> > perf_events: branch stack sampling. In other words, the
> > ability to capture taken branches into each sample.
> >
> > Statistical sampling of taken branch should not be confused
> > for branch tracing. Not all branches are necessarily captured
> >
> > Sampling taken branches is important for basic block profiling,
> > statistical call graph, function call counts. Many of those
> > measurements can help drive a compiler optimizer.
> >
> > The branch stack is a software abstraction which sits on top
> > of the PMU hardware. As such, it is not available on all
> > processors. For now, the patch provides the generic interface
> > and the Intel X86 implementation where it leverages the Last
> > Branch Record (LBR) feature (from Core2 to SandyBridge).
> >
> > Branch stack sampling is supported for both per-thread and
> > system-wide modes.
> >
> > It is possible to filter the type and privilege level of branches
> > to sample. The target of the branch is used to determine
> > the privilege level.
> >
> > For each branch, the source and destination are captured. On
> > some hardware platforms, it may be possible to also extract
> > the target prediction and, in that case, it is also exposed
> > to end users.
> >
> > The branch stack can record a variable number of taken
> > branches per sample. Those branches are always consecutive
> > in time. The number of branches captured depends on the
> > filtering and the underlying hardware. On Intel Nehalem
> > and later, up to 16 consecutive branches can be captured
> > per sample.
> >
> > Branch sampling is always coupled with an event. It can
> > be any PMU event but it can't be a SW or tracepoint event.
> >
> > Branch sampling is requested by setting a new sample_type
> > flag called: PERF_SAMPLE_BRANCH_STACK.
> >
> > To support branch filtering, we introduce a new field
> > to the perf_event_attr struct: branch_sample_type. We chose
> > NOT to overload the config1, config2 field because those
> > are related to the event encoding. Branch stack is a
> > separate feature which is combined with the event.
> >
> > The branch_sample_type is a bitmask of possible filters.
> > The following filters are defined (more can be added):
> > - PERF_SAMPLE_BRANCH_ANY     : any control flow change
> > - PERF_SAMPLE_BRANCH_USER    : branches when target is at user level
> > - PERF_SAMPLE_BRANCH_KERNEL  : branches when target is at kernel level
> > - PERF_SAMPLE_BRANCH_HV      : branches when target is at hypervisor level
> > - PERF_SAMPLE_BRANCH_ANY_CALL: call branches (incl. syscalls)
> > - PERF_SAMPLE_BRANCH_ANY_RET : return branches (incl. syscall returns)
> > - PERF_SAMPLE_BRANCH_IND_CALL: indirect calls
> >
> > It is possible to combine filters, e.g., IND_CALL|USER|KERNEL.
> >
> > When the privilege level is not specified, the branch stack
> > inherits that of the associated event.
> >
> > Some processors may not offer hardware branch filtering, e.g., Intel
> > Atom. Some may have HW filtering bugs (e.g., Nehalem). The Intel
> > X86 implementation in this patchset also provides a SW branch filter
> > which works on a best effort basis. It can compensate for the lack
> > of LBR filtering. But first and foremost, it helps work around LBR
> > filtering errata. The goal is to only capture the type of branches
> > requested by the user.
> >
> > It is possible to combine branch stack sampling with PEBS on Intel
> > X86 processors. Depending on the precise_sampling mode, there are
> > certain filterting restrictions. When precise_sampling=1, then
> > there are no filtering restrictions. When precise_sampling > 1,
> > then only ANY|USER|KERNEL filter can be used. This comes from
> > the fact that the kernel uses LBR to compensate for the PEBS
> > off-by-1 skid on the instruction pointer.
> >
> > To demonstrate how the perf_event branch stack sampling interface
> > works, the patchset also modifies perf record to capture taken
> > branches. Similarly perf report is enhanced to display a histogram
> > of taken branches.
> >
> > I would like to thank Roberto Vitillo @ LBL for his work on the perf
> > tool for this.
> >
> > Enough talking, let's take a simple example. Our trivial test program
> > goes like this:
> >
> > void f2(void)
> > {}
> > void f3(void)
> > {}
> > void f1(unsigned long n)
> > {
> >   if (n & 1UL)
> >     f2();
> >   else
> >     f3();
> > }
> > int main(void)
> > {
> >   unsigned long i;
> >
> >   for (i=0; i < N; i++)
> >    f1(i);
> >   return 0;
> > }
> >
> > $ perf record -b any branchy
> > $ perf report -b
> > # Events: 23K cycles
> > #
> > # Overhead  Source Symbol     Target Symbol
> > # ........  ................  ................
> >
> >     18.13%  [.] f1            [.] main
> >     18.10%  [.] main          [.] main
> >     18.01%  [.] main          [.] f1
> >     15.69%  [.] f1            [.] f1
> >      9.11%  [.] f3            [.] f1
> >      6.78%  [.] f1            [.] f3
> >      6.74%  [.] f1            [.] f2
> >      6.71%  [.] f2            [.] f1
> >
> > Of the total number of branches captured, 18.13% were from f1() -> main().
> >
> > Let's make this clearer by filtering the user call branches only:
> >
> > $ perf record -b any_call -e cycles:u branchy
> > $ perf report -b
> > # Events: 19K cycles
> > #
> > # Overhead  Source Symbol              Target Symbol
> > # ........  .........................  .........................
> > #
> >     52.50%  [.] main                   [.] f1
> >     23.99%  [.] f1                     [.] f3
> >     23.48%  [.] f1                     [.] f2
> >      0.03%  [.] _IO_default_xsputn     [.] _IO_new_file_overflow
> >      0.01%  [k] _start                 [k] __libc_start_main
> >
> > Now it is more obvious. %52 of all the captured branches where calls from main() -> f1().
> > The rest is split 50/50 between f1() -> f2() and f1() -> f3() which is expected given
> > that f1() dispatches based on odd vs. even values of n which is constantly increasing.
> >
> >
> > Here is a kernel example, where we want to sample indirect calls:
> > $ perf record -a -C 1 -b ind_call -e r1c4:k sleep 10
> > $ perf report -b
> > #
> > # Overhead  Source Symbol               Target Symbol
> > # ........  ..........................  ..........................
> > #
> >     36.36%  [k] __delay                 [k] delay_tsc
> >      9.09%  [k] ktime_get               [k] read_tsc
> >      9.09%  [k] getnstimeofday          [k] read_tsc
> >      9.09%  [k] notifier_call_chain     [k] tick_notify
> >      4.55%  [k] cpuidle_idle_call       [k] intel_idle
> >      4.55%  [k] cpuidle_idle_call       [k] menu_reflect
> >      2.27%  [k] handle_irq              [k] handle_edge_irq
> >      2.27%  [k] ack_apic_edge           [k] native_apic_mem_write
> >      2.27%  [k] hpet_interrupt_handler  [k] hrtimer_interrupt
> >      2.27%  [k] __run_hrtimer           [k] watchdog_timer_fn
> >      2.27%  [k] enqueue_task            [k] enqueue_task_rt
> >      2.27%  [k] try_to_wake_up          [k] select_task_rq_rt
> >      2.27%  [k] do_timer                [k] read_tsc
> >
> > Due to HW limitations, branch filtering may be approximate on
> > Core, Atom processors. It is more accurate on Nehalem, Westmere
> > and best on Sandy Bridge.
> >
> > In version 2, we've updated the patch to tip/master (commit 5734857) and
> > we've incoporated the feedback from v1 concerning anynous bitfield
> > struct for branch_stack_entry and the hanlding of i386 ABI binaries
> > on 64-bit host in the instr decoder for the LBR SW filter.
> >
> > In version 3, we've updated to 3.2.0-tip. The Atom revision
> > check has been put into its own patch. We fixed a browser
> > issue with report report. We fixed all the style issues as well.
> >
> > In version 4, we've modified the branch stack API to add a missing
> > priv level : hypervisor. There is a new PERF_SAMPLE_BRANCH_HV. It
> > is not used on Intel X86. Thanks to  khandual@linux.vnet.ibm.com
> > for pointing this out. We also fix compilation error on ARM.
> >
> > In version 4, we also extend the patch to include the changes necessary
> > to the perf tool to support reading perf.data files which were produced
> > from older perf_event ABI revisions. This patch set extends the ABI
> > with a new field in struct perf_event_attr. That struct is saved as
> > is in the perf.data file. Therefore, older perf.data files contain
> > smaller perf_event_attr struct, yet perf must process them transparently.
> > That's not the case today. It dies with 'incompatible file format'.
> >
> > The patch solves this problem and, at the same time, decouples endianness
> > detection from the size of perf_event_attr. Endianness is now detected via
> > the signature (the first 8 bytes of the file). We introduce a new signature
> > (PERFILE2). It is not laid out the same way in the file based on the endianness
> So as perf subsystem evolves and we modify perf_event_attr structure, a new 'PERFFILE<N>'
> marker is generated (for perf subsystem version N) and placed in perf.data. Perf tools
> would be modified to distinguish between various versions of perf.data file and process
> them accordingly. Sounds good ! PERFFILE,PERFFILE2,PERFFILE3........ We should have named
> PERFFILE as PERFILE1 expecting this to happen one day :)


No, as the perf_event_attr evolves, we DO NOT need to modify the
signature any longer.
This was necessary to distinguish the "legacy" (broken) signature
files, from the fixed
version. In the new version, the signature is not laid out in the same
way for big vs.
little endian, as such it can be used to detect endianness, thereby
"freeing" attr_size
from this role. It was an error to begin with because attr_size can
change as the
event_attr struct grows.

>
> > of the host where the file is written. Therefore, we can dynamically detect
> > the endianness by simply reading the first 8 bytes. The size of the
> > perf_event_attr struct can then be processed according to the endianness.
> > The ambiguity between the size being at the same time, the endianness marker
> > and the actual size is gone. We can now distinguish an older ABI by the size
> > and not confuse it with an endianness mismatch.
> >
> > Signed-off-by: Stephane Eranian <eranian@google.com>
> >
> >
> > Roberto Agostino Vitillo (3):
> >   perf: add code to support PERF_SAMPLE_BRANCH_STACK
> >   perf: add support for sampling taken branch to perf record
> >   perf: add support for taken branch sampling to perf report
> >
> > Stephane Eranian (15):
> >   perf: add generic taken branch sampling support
> >   perf: add Intel LBR MSR definitions
> >   perf: add Intel X86 LBR sharing logic
> >   perf: sync branch stack sampling with X86 precise_sampling
> >   perf: add LBR mappings for PERF_SAMPLE_BRANCH filters
> >   perf: disable LBR support for older Intel Atom processors
> >   perf: implement PERF_SAMPLE_BRANCH for Intel X86
> >   perf: add LBR software filter support for Intel X86
> >   perf: disable PERF_SAMPLE_BRANCH_* when not supported
> >   perf: add hook to flush branch_stack on context switch
> >   perf: fix endianness detection in perf.data
> >   perf: add ABI reference sizes
> >   perf: enable reading of perf.data files from different ABI rev
> >   perf: fix bug print_event_desc()
> >   perf: make perf able to read file from older ABIs
> >
> >  arch/alpha/kernel/perf_event.c             |    4 +
> >  arch/arm/kernel/perf_event.c               |    4 +
> >  arch/mips/kernel/perf_event_mipsxx.c       |    4 +
> >  arch/powerpc/kernel/perf_event.c           |    4 +
> >  arch/sh/kernel/perf_event.c                |    4 +
> >  arch/sparc/kernel/perf_event.c             |    4 +
> >  arch/x86/include/asm/msr-index.h           |    7 +
> >  arch/x86/kernel/cpu/perf_event.c           |   47 ++-
> >  arch/x86/kernel/cpu/perf_event.h           |   19 +
> >  arch/x86/kernel/cpu/perf_event_amd.c       |    3 +
> >  arch/x86/kernel/cpu/perf_event_intel.c     |  120 +++++--
> >  arch/x86/kernel/cpu/perf_event_intel_ds.c  |   22 +-
> >  arch/x86/kernel/cpu/perf_event_intel_lbr.c |  532 ++++++++++++++++++++++++++--
> >  include/linux/perf_event.h                 |   82 ++++-
> >  kernel/events/core.c                       |  177 +++++++++
> >  kernel/events/hw_breakpoint.c              |    6 +
> >  tools/perf/Documentation/perf-record.txt   |   25 ++
> >  tools/perf/Documentation/perf-report.txt   |    7 +
> >  tools/perf/builtin-record.c                |   74 ++++
> >  tools/perf/builtin-report.c                |   98 +++++-
> >  tools/perf/perf.h                          |   18 +
> >  tools/perf/util/annotate.c                 |    2 +-
> >  tools/perf/util/event.h                    |    1 +
> >  tools/perf/util/evsel.c                    |   14 +
> >  tools/perf/util/header.c                   |  231 +++++++++++--
> >  tools/perf/util/hist.c                     |   93 ++++-
> >  tools/perf/util/hist.h                     |    7 +
> >  tools/perf/util/session.c                  |   72 ++++
> >  tools/perf/util/session.h                  |    4 +
> >  tools/perf/util/sort.c                     |  362 ++++++++++++++-----
> >  tools/perf/util/sort.h                     |    5 +
> >  tools/perf/util/symbol.h                   |   13 +
> >  32 files changed, 1835 insertions(+), 230 deletions(-)
> >
>
>
> --
> Anshuman Khandual
> Linux Technology Centre
> IBM Systems and Technology Group
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v4 18/18] perf: make perf able to read file from older ABIs
  2012-01-27 20:56 ` [PATCH v4 18/18] perf: make perf able to read file from older ABIs Stephane Eranian
@ 2012-01-31  8:54   ` Anshuman Khandual
  0 siblings, 0 replies; 30+ messages in thread
From: Anshuman Khandual @ 2012-01-31  8:54 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, peterz, mingo, acme, robert.richter, ming.m.lin,
	andi, asharma, ravitillo, vweaver1, dsahern

On Saturday 28 January 2012 02:26 AM, Stephane Eranian wrote:
> This patches provides a way to handle legacy perf.data
> files.  Legacy files are those using the older PERFFILE
> signature.
> 
> For those, it is still necessary to detect endianness but
> without comparing their header->attr_size with the
> tool's own version as it may be different. Instead, we use
> a reference table for all known sizes from the legacy era.
> 
> We try all the combinations for sizes and endianness. If we find
> a match, we proceed, otherwise we return: "incompatible file format".
> This is also done for the pipe-mode file format.
> 
> Signed-off-by: Stephane Eranian <eranian@google.com>
> ---
>  tools/perf/util/header.c |  126 +++++++++++++++++++++++++++++++++++----------
>  1 files changed, 98 insertions(+), 28 deletions(-)
> 
> diff --git a/tools/perf/util/header.c b/tools/perf/util/header.c
> index 1fb365d..a15f451 100644
> --- a/tools/perf/util/header.c
> +++ b/tools/perf/util/header.c
> @@ -1630,35 +1630,102 @@ int perf_header__process_sections(struct perf_header *header, int fd,
>  	return err;
>  }
> 
> -static int check_magic_endian(u64 *magic, struct perf_file_header *header,
> -			      struct perf_header *ph)
> +static const int attr_file_abi_sizes[] = {
> +	[0] = PERF_ATTR_SIZE_VER0,
> +	[1] = PERF_ATTR_SIZE_VER1,
> +	0,
> +};
> +
> +/*
> + * In the legacy file format, the magic number is not used to encode endianness.
> + * hdr_sz was used to encode endianness. But given that hdr_sz can vary based
> + * on ABI revisions, we need to try all combinations for all endianness to
> + * detect the endianness.
> + */
> +static int try_all_file_abis(uint64_t hdr_sz, struct perf_header *ph)
>  {
> -	int ret;
> +	uint64_t ref_size, attr_size;
> +	int i;
> 
> -	/* check for legacy format */
> -	ret = memcmp(magic, __perf_magic1, sizeof(*magic));
> -	if (ret == 0) {
> -		pr_debug("legacy perf.data format\n");
> -		if (!header)
> -			return -1;
> +	for (i = 0 ; attr_file_abi_sizes[i]; i++) {
> +		ref_size = attr_file_abi_sizes[i]
> +			 + sizeof(struct perf_file_section);
> +		if (hdr_sz != ref_size) {
> +			attr_size = bswap_64(hdr_sz);
> +			if (attr_size != ref_size)
> +				continue;
> 
> -		if (header->attr_size != sizeof(struct perf_file_attr)) {
> -			u64 attr_size = bswap_64(header->attr_size);
> +			ph->needs_swap = true;
> +		}
> +		pr_debug("ABI%d perf.data file detected, need_swap=%d\n",
> +			 i,
> +			 ph->needs_swap);
> +		return 0;
> +	}
> +	/* could not determine endianness */
> +	return -1;
> +}
> 
> -			if (attr_size != sizeof(struct perf_file_attr))
> -				return -1;
> +#define PERF_PIPE_HDR_VER0	16
> +
> +static const size_t attr_pipe_abi_sizes[] = {
> +	[0] = PERF_PIPE_HDR_VER0,
> +	0,
> +};
> +
> +/*
> + * In the legacy pipe format, there is an implicit assumption that endiannesss
> + * between host recording the samples, and host parsing the samples is the
> + * same. This is not always the case given that the pipe output may always be
> + * redirected into a file and analyzed on a different machine with possibly a
> + * different endianness and perf_event ABI revsions in the perf tool itself.
> + */
> +static int try_all_pipe_abis(uint64_t hdr_sz, struct perf_header *ph)
> +{
> +	uint64_t ref_size;
> +	int i;
> +
> +	for (i = 0 ; attr_pipe_abi_sizes[i]; i++) {
> +		if (hdr_sz != attr_pipe_abi_sizes[i]) {
> +			u64 attr_size = bswap_64(hdr_sz);
> +
> +			if (attr_size != ref_size)

'ref_size' never got a value here but being checked against. This statement hits a compilation
failure. 

cc1: warnings being treated as errors
util/header.c: In function ‘try_all_pipe_abis’:
util/header.c:1692: error: ‘ref_size’ may be used uninitialized in this function
make: *** [util/header.o] Error 1

> +				continue;
> 
>  			ph->needs_swap = true;
>  		}
> +		pr_debug("Pipe ABI%d perf.data file detected\n", i);
>  		return 0;
>  	}
> +	return -1;
> +}
> 
> -	/* check magic number with same endianness */
> -	if (*magic == __perf_magic2)
> +static int check_magic_endian(u64 magic, uint64_t hdr_sz,
> +			      bool is_pipe, struct perf_header *ph)
> +{
> +	int ret;
> +
> +	/* check for legacy format */
> +	ret = memcmp(&magic, __perf_magic1, sizeof(magic));
> +	if (ret == 0) {
> +		pr_debug("legacy perf.data format\n");
> +		if (is_pipe)
> +			return try_all_pipe_abis(hdr_sz, ph);
> +
> +		return try_all_file_abis(hdr_sz, ph);
> +	}
> +	/*
> +	 * the new magic number serves two purposes:
> +	 * - unique number to identify actual perf.data files
> +	 * - encode endianness of file
> +	 */
> +
> +	/* check magic number with one endianness */
> +	if (magic == __perf_magic2)
>  		return 0;
> 
> -	/* check magic number but opposite endianness */
> -	if (*magic != __perf_magic2_sw)
> +	/* check magic number with opposite endianness */
> +	if (magic != __perf_magic2_sw)
>  		return -1;
> 
>  	ph->needs_swap = true;
> @@ -1677,8 +1744,11 @@ int perf_file_header__read(struct perf_file_header *header,
>  	if (ret <= 0)
>  		return -1;
> 
> -	if (check_magic_endian(&header->magic, header, ph) < 0)
> +	if (check_magic_endian(header->magic,
> +			       header->attr_size, false, ph) < 0) {
> +		pr_debug("magic/endian check failed\n");
>  		return -1;
> +	}
> 
>  	if (ph->needs_swap) {
>  		mem_bswap_64(header, offsetof(struct perf_file_header,
> @@ -1924,21 +1994,17 @@ static int perf_file_header__read_pipe(struct perf_pipe_file_header *header,
>  	if (ret <= 0)
>  		return -1;
> 
> -	 if (check_magic_endian(&header->magic, NULL, ph) < 0)
> +	if (check_magic_endian(header->magic, header->size, true, ph) < 0) {
> +		pr_debug("endian/magic failed\n");
>  		return -1;
> +	}
> +
> +	if (ph->needs_swap)
> +		header->size = bswap_64(header->size);
> 
>  	if (repipe && do_write(STDOUT_FILENO, header, sizeof(*header)) < 0)
>  		return -1;
> 
> -	if (header->size != sizeof(*header)) {
> -		u64 size = bswap_64(header->size);
> -
> -		if (size != sizeof(*header))
> -			return -1;
> -
> -		ph->needs_swap = true;
> -	}
> -
>  	return 0;
>  }
> 
> @@ -1975,6 +2041,10 @@ static int read_attr(int fd, struct perf_header *ph,
> 
>  	/* on file perf_event_attr size */
>  	sz = attr->size;
> +	if (sz != our_sz)
> +		pr_debug("on file attr=%zu vs. %zu bytes,"
> +			 " ignoring extra fields\n", sz, our_sz);
> +
>  	if (ph->needs_swap)
>  		sz = bswap_32(sz);
> 


-- 
Anshuman Khandual
Linux Technology Centre
IBM Systems and Technology Group


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v4 12/18] perf: add support for sampling taken branch to perf record
  2012-01-27 20:56 ` [PATCH v4 12/18] perf: add support for sampling taken branch to perf record Stephane Eranian
@ 2012-01-31  9:47   ` Anshuman Khandual
  2012-01-31 10:31     ` Stephane Eranian
  0 siblings, 1 reply; 30+ messages in thread
From: Anshuman Khandual @ 2012-01-31  9:47 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, peterz, mingo, acme, robert.richter, ming.m.lin,
	andi, asharma, ravitillo, vweaver1, dsahern

On Saturday 28 January 2012 02:26 AM, Stephane Eranian wrote:
> From: Roberto Agostino Vitillo <ravitillo@lbl.gov>
> 
> This patch adds a new option to enable taken branch stack
> sampling, i.e., leverage the PERF_SAMPLE_BRANCH_STACK feature
> of perf_events.
> 
> There is a new option to active this mode: -b.
> It is possible to pass a set of filters to select the type of
> branches to sample.
> 
> The following filters are available:
> - any : any type of branches
> - any_call : any function call or system call
> - any_ret : any function return or system call return
> - any_ind : any indirect branch
> - u:  only when the branch target is at the user level
> - k: only when the branch target is in the kernel
> - hv: only when the branch target is in the hypervisor
> 
> Filters can be combined by passing a comma separated list
> to the option:
> 
> $ perf record -b any_call,u -e cycles:u branchy
> 
> Signed-off-by: Roberto Agostino Vitillo <ravitillo@lbl.gov>
> Signed-off-by: Stephane Eranian <eranian@google.com>
> ---
>  tools/perf/Documentation/perf-record.txt |   25 ++++++++++
>  tools/perf/builtin-record.c              |   74 ++++++++++++++++++++++++++++++
>  tools/perf/perf.h                        |    1 +
>  tools/perf/util/evsel.c                  |    4 ++
>  4 files changed, 104 insertions(+), 0 deletions(-)
> 
> diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
> index ff9a66e..288d429 100644
> --- a/tools/perf/Documentation/perf-record.txt
> +++ b/tools/perf/Documentation/perf-record.txt
> @@ -152,6 +152,31 @@ an empty cgroup (monitor all the time) using, e.g., -G foo,,bar. Cgroups must ha
>  corresponding events, i.e., they always refer to events defined earlier on the command
>  line.
> 
> +-b::
> +--branch-stack::
> +Enable taken branch stack sampling. Each sample captures a series of consecutive
> +taken branches. The number of branches captured with each sample depends on the
> +underlying hardware, the type of branches of interest, and the executed code.
> +It is possible to select the types of branches captured by enabling filters. The
> +following filters are defined:
> +
> +        -  any :  any type of branches
> +        - any_call: any function call or system call
> +        - any_ret: any function return or system call return
> +        - any_ind: any indirect branch
> +        - u:  only when the branch target is at the user level
> +        - k: only when the branch target is in the kernel
> +        - hv: only when the target is at the hypervisor level
> +
> ++
> +At least one of any, any_call, any_ret, any_ind must be provided. The privilege levels may
> +be ommitted, in which case, the privilege levels of the associated event are applied to the
> +branch filter. Both kernel (k) and hypervisor (hv) privilege levels are subject to
> +permissions.  When sampling on multiple events, branch stack sampling is enabled for all
> +the sampling events. The sampled branch type is the same for all events.
> +Note that taken branch sampling may not be available on all processors.
> +The various filters must be specified as a comma separated list: -b any_ret,u,k
> +
>  SEE ALSO
>  --------
>  linkperf:perf-stat[1], linkperf:perf-list[1]
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index 32870ee..7df6e68 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -637,6 +637,77 @@ static int __cmd_record(struct perf_record *rec, int argc, const char **argv)
>  	return err;
>  }
> 
> +#define BRANCH_OPT(n, m) \
> +	{ .name = n, .mode = (m) }
> +
> +#define BRANCH_END { .name = NULL }
> +
> +struct branch_mode {
> +	const char *name;
> +	int mode;
> +};
> +
> +static const struct branch_mode branch_modes[] = {
> +	BRANCH_OPT("u", PERF_SAMPLE_BRANCH_USER),
> +	BRANCH_OPT("k", PERF_SAMPLE_BRANCH_KERNEL),
> +	BRANCH_OPT("hv", PERF_SAMPLE_BRANCH_HV),
> +	BRANCH_OPT("any", PERF_SAMPLE_BRANCH_ANY),
> +	BRANCH_OPT("any_call", PERF_SAMPLE_BRANCH_ANY_CALL),
> +	BRANCH_OPT("any_ret", PERF_SAMPLE_BRANCH_ANY_RETURN),
> +	BRANCH_OPT("ind_call", PERF_SAMPLE_BRANCH_IND_CALL),
> +	BRANCH_END
> +};
> +
> +static int
> +parse_branch_stack(const struct option *opt, const char *str, int unset __used)
> +{
> +#define ONLY_PLM \
> +	(PERF_SAMPLE_BRANCH_USER	|\
> +	 PERF_SAMPLE_BRANCH_KERNEL	|\
> +	 PERF_SAMPLE_BRANCH_KERNEL)

I guess this would be PERF_SAMPLE_BRANCH_HV instead of the second
PERF_SAMPLE_BRANCH_KERNEL. 

> +
> +	uint64_t *mode = (uint64_t *)opt->value;
> +	const struct branch_mode *br;
> +	char *s, *os, *p;
> +	int ret = -1;
> +
> +	*mode = 0;
> +
> +	/* because str is read-only */
> +	s = os = strdup(str);
> +	if (!s)
> +		return -1;
> +
> +	for (;;) {
> +		p = strchr(s, ',');
> +		if (p)
> +			*p = '\0';
> +
> +		for (br = branch_modes; br->name; br++) {
> +			if (!strcasecmp(s, br->name))
> +				break;
> +		}
> +		if (!br->name)
> +			goto error;
> +
> +		*mode |= br->mode;
> +
> +		if (!p)
> +			break;
> +
> +		s = p + 1;
> +	}
> +	ret = 0;
> +
> +	if ((*mode & ~ONLY_PLM) == 0) {
> +		error("need at least one branch type with -b\n");
> +		ret = -1;
> +	}
> +error:
> +	free(os);
> +	return ret;
> +}
> +
>  static const char * const record_usage[] = {
>  	"perf record [<options>] [<command>]",
>  	"perf record [<options>] -- <command> [<options>]",
> @@ -729,6 +800,9 @@ const struct option record_options[] = {
>  		     "monitor event in cgroup name only",
>  		     parse_cgroups),
>  	OPT_STRING('u', "uid", &record.uid_str, "user", "user to profile"),
> +	OPT_CALLBACK('b', "branch-stack", &record.opts.branch_stack,
> +		     "branch mode mask", "branch stack sampling modes",
> +		     parse_branch_stack),
>  	OPT_END()
>  };
> 
> diff --git a/tools/perf/perf.h b/tools/perf/perf.h
> index 8b4d25d..7f8fbab 100644
> --- a/tools/perf/perf.h
> +++ b/tools/perf/perf.h
> @@ -222,6 +222,7 @@ struct perf_record_opts {
>  	unsigned int freq;
>  	unsigned int mmap_pages;
>  	unsigned int user_freq;
> +	int	     branch_stack;
>  	u64	     default_interval;
>  	u64	     user_interval;
>  	const char   *cpu_list;
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index 472fc8c..a65a53c 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -126,6 +126,10 @@ void perf_evsel__config(struct perf_evsel *evsel, struct perf_record_opts *opts)
>  		attr->watermark = 0;
>  		attr->wakeup_events = 1;
>  	}
> +	if (opts->branch_stack) {
> +		attr->sample_type	|= PERF_SAMPLE_BRANCH_STACK;
> +		attr->branch_sample_type = opts->branch_stack;
> +	}
> 
>  	attr->mmap = track;
>  	attr->comm = track;


-- 
Anshuman Khandual
Linux Technology Centre
IBM Systems and Technology Group


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v4 12/18] perf: add support for sampling taken branch to perf record
  2012-01-31  9:47   ` Anshuman Khandual
@ 2012-01-31 10:31     ` Stephane Eranian
  2012-01-31 15:44       ` Anshuman Khandual
  0 siblings, 1 reply; 30+ messages in thread
From: Stephane Eranian @ 2012-01-31 10:31 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-kernel, peterz, mingo, acme, robert.richter, ming.m.lin,
	andi, asharma, ravitillo, vweaver1, dsahern

On Tue, Jan 31, 2012 at 10:47 AM, Anshuman Khandual
<khandual@linux.vnet.ibm.com> wrote:
> On Saturday 28 January 2012 02:26 AM, Stephane Eranian wrote:
>> From: Roberto Agostino Vitillo <ravitillo@lbl.gov>
>>
>> This patch adds a new option to enable taken branch stack
>> sampling, i.e., leverage the PERF_SAMPLE_BRANCH_STACK feature
>> of perf_events.
>>
>> There is a new option to active this mode: -b.
>> It is possible to pass a set of filters to select the type of
>> branches to sample.
>>
>> The following filters are available:
>> - any : any type of branches
>> - any_call : any function call or system call
>> - any_ret : any function return or system call return
>> - any_ind : any indirect branch
>> - u:  only when the branch target is at the user level
>> - k: only when the branch target is in the kernel
>> - hv: only when the branch target is in the hypervisor
>>
>> Filters can be combined by passing a comma separated list
>> to the option:
>>
>> $ perf record -b any_call,u -e cycles:u branchy
>>
>> Signed-off-by: Roberto Agostino Vitillo <ravitillo@lbl.gov>
>> Signed-off-by: Stephane Eranian <eranian@google.com>
>> ---
>>  tools/perf/Documentation/perf-record.txt |   25 ++++++++++
>>  tools/perf/builtin-record.c              |   74 ++++++++++++++++++++++++++++++
>>  tools/perf/perf.h                        |    1 +
>>  tools/perf/util/evsel.c                  |    4 ++
>>  4 files changed, 104 insertions(+), 0 deletions(-)
>>
>> diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
>> index ff9a66e..288d429 100644
>> --- a/tools/perf/Documentation/perf-record.txt
>> +++ b/tools/perf/Documentation/perf-record.txt
>> @@ -152,6 +152,31 @@ an empty cgroup (monitor all the time) using, e.g., -G foo,,bar. Cgroups must ha
>>  corresponding events, i.e., they always refer to events defined earlier on the command
>>  line.
>>
>> +-b::
>> +--branch-stack::
>> +Enable taken branch stack sampling. Each sample captures a series of consecutive
>> +taken branches. The number of branches captured with each sample depends on the
>> +underlying hardware, the type of branches of interest, and the executed code.
>> +It is possible to select the types of branches captured by enabling filters. The
>> +following filters are defined:
>> +
>> +        -  any :  any type of branches
>> +        - any_call: any function call or system call
>> +        - any_ret: any function return or system call return
>> +        - any_ind: any indirect branch
>> +        - u:  only when the branch target is at the user level
>> +        - k: only when the branch target is in the kernel
>> +        - hv: only when the target is at the hypervisor level
>> +
>> ++
>> +At least one of any, any_call, any_ret, any_ind must be provided. The privilege levels may
>> +be ommitted, in which case, the privilege levels of the associated event are applied to the
>> +branch filter. Both kernel (k) and hypervisor (hv) privilege levels are subject to
>> +permissions.  When sampling on multiple events, branch stack sampling is enabled for all
>> +the sampling events. The sampled branch type is the same for all events.
>> +Note that taken branch sampling may not be available on all processors.
>> +The various filters must be specified as a comma separated list: -b any_ret,u,k
>> +
>>  SEE ALSO
>>  --------
>>  linkperf:perf-stat[1], linkperf:perf-list[1]
>> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
>> index 32870ee..7df6e68 100644
>> --- a/tools/perf/builtin-record.c
>> +++ b/tools/perf/builtin-record.c
>> @@ -637,6 +637,77 @@ static int __cmd_record(struct perf_record *rec, int argc, const char **argv)
>>       return err;
>>  }
>>
>> +#define BRANCH_OPT(n, m) \
>> +     { .name = n, .mode = (m) }
>> +
>> +#define BRANCH_END { .name = NULL }
>> +
>> +struct branch_mode {
>> +     const char *name;
>> +     int mode;
>> +};
>> +
>> +static const struct branch_mode branch_modes[] = {
>> +     BRANCH_OPT("u", PERF_SAMPLE_BRANCH_USER),
>> +     BRANCH_OPT("k", PERF_SAMPLE_BRANCH_KERNEL),
>> +     BRANCH_OPT("hv", PERF_SAMPLE_BRANCH_HV),
>> +     BRANCH_OPT("any", PERF_SAMPLE_BRANCH_ANY),
>> +     BRANCH_OPT("any_call", PERF_SAMPLE_BRANCH_ANY_CALL),
>> +     BRANCH_OPT("any_ret", PERF_SAMPLE_BRANCH_ANY_RETURN),
>> +     BRANCH_OPT("ind_call", PERF_SAMPLE_BRANCH_IND_CALL),
>> +     BRANCH_END
>> +};
>> +
>> +static int
>> +parse_branch_stack(const struct option *opt, const char *str, int unset __used)
>> +{
>> +#define ONLY_PLM \
>> +     (PERF_SAMPLE_BRANCH_USER        |\
>> +      PERF_SAMPLE_BRANCH_KERNEL      |\
>> +      PERF_SAMPLE_BRANCH_KERNEL)
>
> I guess this would be PERF_SAMPLE_BRANCH_HV instead of the second
> PERF_SAMPLE_BRANCH_KERNEL.
>
Oops, yes you're right.

There is also something else I realized after the fact that needs to
be tweaked about
BRANCH_HV.

The thing is the X86 code is setup to ignore priv levels it does not
know about, it seems.
Perf does not set exclude_hv by default. Thus in my patch, if the user
does not specify
any branch priv level, it will default to the level used for the
event. That is fine but in the
x86 code, I added a sanity check to reject BRANCH_HV because the HW
does not support
it. I think it should just ignore it. That way, one can do:

    $ perf record -b any_call -e cycles ls

without getting an error (because hv is not supported on branch sampling).
Currently, the workaround is to set the priv level on branches:

    $ perf record -b any_call,u,k -e cycles ls


>> +
>> +     uint64_t *mode = (uint64_t *)opt->value;
>> +     const struct branch_mode *br;
>> +     char *s, *os, *p;
>> +     int ret = -1;
>> +
>> +     *mode = 0;
>> +
>> +     /* because str is read-only */
>> +     s = os = strdup(str);
>> +     if (!s)
>> +             return -1;
>> +
>> +     for (;;) {
>> +             p = strchr(s, ',');
>> +             if (p)
>> +                     *p = '\0';
>> +
>> +             for (br = branch_modes; br->name; br++) {
>> +                     if (!strcasecmp(s, br->name))
>> +                             break;
>> +             }
>> +             if (!br->name)
>> +                     goto error;
>> +
>> +             *mode |= br->mode;
>> +
>> +             if (!p)
>> +                     break;
>> +
>> +             s = p + 1;
>> +     }
>> +     ret = 0;
>> +
>> +     if ((*mode & ~ONLY_PLM) == 0) {
>> +             error("need at least one branch type with -b\n");
>> +             ret = -1;
>> +     }
>> +error:
>> +     free(os);
>> +     return ret;
>> +}
>> +
>>  static const char * const record_usage[] = {
>>       "perf record [<options>] [<command>]",
>>       "perf record [<options>] -- <command> [<options>]",
>> @@ -729,6 +800,9 @@ const struct option record_options[] = {
>>                    "monitor event in cgroup name only",
>>                    parse_cgroups),
>>       OPT_STRING('u', "uid", &record.uid_str, "user", "user to profile"),
>> +     OPT_CALLBACK('b', "branch-stack", &record.opts.branch_stack,
>> +                  "branch mode mask", "branch stack sampling modes",
>> +                  parse_branch_stack),
>>       OPT_END()
>>  };
>>
>> diff --git a/tools/perf/perf.h b/tools/perf/perf.h
>> index 8b4d25d..7f8fbab 100644
>> --- a/tools/perf/perf.h
>> +++ b/tools/perf/perf.h
>> @@ -222,6 +222,7 @@ struct perf_record_opts {
>>       unsigned int freq;
>>       unsigned int mmap_pages;
>>       unsigned int user_freq;
>> +     int          branch_stack;
>>       u64          default_interval;
>>       u64          user_interval;
>>       const char   *cpu_list;
>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>> index 472fc8c..a65a53c 100644
>> --- a/tools/perf/util/evsel.c
>> +++ b/tools/perf/util/evsel.c
>> @@ -126,6 +126,10 @@ void perf_evsel__config(struct perf_evsel *evsel, struct perf_record_opts *opts)
>>               attr->watermark = 0;
>>               attr->wakeup_events = 1;
>>       }
>> +     if (opts->branch_stack) {
>> +             attr->sample_type       |= PERF_SAMPLE_BRANCH_STACK;
>> +             attr->branch_sample_type = opts->branch_stack;
>> +     }
>>
>>       attr->mmap = track;
>>       attr->comm = track;
>
>
> --
> Anshuman Khandual
> Linux Technology Centre
> IBM Systems and Technology Group
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v4 12/18] perf: add support for sampling taken branch to perf record
  2012-01-31 10:31     ` Stephane Eranian
@ 2012-01-31 15:44       ` Anshuman Khandual
  2012-01-31 15:48         ` Stephane Eranian
  0 siblings, 1 reply; 30+ messages in thread
From: Anshuman Khandual @ 2012-01-31 15:44 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, peterz, mingo, acme, robert.richter, ming.m.lin,
	andi, asharma, ravitillo, vweaver1, dsahern

On Tuesday 31 January 2012 04:01 PM, Stephane Eranian wrote:
>>> +};
>>> +
>>> +static int
>>> +parse_branch_stack(const struct option *opt, const char *str, int unset __used)
>>> +{
>>> +#define ONLY_PLM \
>>> +     (PERF_SAMPLE_BRANCH_USER        |\
>>> +      PERF_SAMPLE_BRANCH_KERNEL      |\
>>> +      PERF_SAMPLE_BRANCH_KERNEL)
>>
>> I guess this would be PERF_SAMPLE_BRANCH_HV instead of the second
>> PERF_SAMPLE_BRANCH_KERNEL.
>>
> Oops, yes you're right.
> 
> There is also something else I realized after the fact that needs to
> be tweaked about
> BRANCH_HV.
> 
> The thing is the X86 code is setup to ignore priv levels it does not
> know about, it seems.
> Perf does not set exclude_hv by default. Thus in my patch, if the user
> does not specify
> any branch priv level, it will default to the level used for the
> event. That is fine but in the
> x86 code, I added a sanity check to reject BRANCH_HV because the HW
> does not support
> it. 

Right. So either we 

(1) Set 'exclude_hv' on a X86 system without hypervisor mode (and required HW support) and do the sanity check for BRANCH_HV

or 

(2) Do not set 'exclude_hv' (which is happening right now by default) and remove the sanity check
>I think it should just ignore it. That way, one can do:
> 
>     $ perf record -b any_call -e cycles ls
> 
> without getting an error (because hv is not supported on branch sampling).
> Currently, the workaround is to set the priv level on branches:
> 
>     $ perf record -b any_call,u,k -e cycles ls 
--------------------------------
./perf record -b any_call -e cycles ls
./perf record -b any_call,hv -e cycles ls

  Error: sys_perf_event_open() syscall returned with 95 (Operation not supported).  /bin/dmesg may provide additional information.

  Fatal: No CONFIG_PERF_EVENTS=y kernel support configured?

ls: Terminated
--------------------------------

However these works absolutely fine

perf record -b any_call,k -e cycles ls
perf record -b any_call,u -e cycles ls
perf record -b any_call,u,k -e cycles ls
-- 
Anshuman Khandual
Linux Technology Centre
IBM Systems and Technology Group


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v4 12/18] perf: add support for sampling taken branch to perf record
  2012-01-31 15:44       ` Anshuman Khandual
@ 2012-01-31 15:48         ` Stephane Eranian
  0 siblings, 0 replies; 30+ messages in thread
From: Stephane Eranian @ 2012-01-31 15:48 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-kernel, peterz, mingo, acme, robert.richter, ming.m.lin,
	andi, asharma, ravitillo, vweaver1, dsahern

On Tue, Jan 31, 2012 at 4:44 PM, Anshuman Khandual
<khandual@linux.vnet.ibm.com> wrote:
> On Tuesday 31 January 2012 04:01 PM, Stephane Eranian wrote:
>>>> +};
>>>> +
>>>> +static int
>>>> +parse_branch_stack(const struct option *opt, const char *str, int unset __used)
>>>> +{
>>>> +#define ONLY_PLM \
>>>> +     (PERF_SAMPLE_BRANCH_USER        |\
>>>> +      PERF_SAMPLE_BRANCH_KERNEL      |\
>>>> +      PERF_SAMPLE_BRANCH_KERNEL)
>>>
>>> I guess this would be PERF_SAMPLE_BRANCH_HV instead of the second
>>> PERF_SAMPLE_BRANCH_KERNEL.
>>>
>> Oops, yes you're right.
>>
>> There is also something else I realized after the fact that needs to
>> be tweaked about
>> BRANCH_HV.
>>
>> The thing is the X86 code is setup to ignore priv levels it does not
>> know about, it seems.
>> Perf does not set exclude_hv by default. Thus in my patch, if the user
>> does not specify
>> any branch priv level, it will default to the level used for the
>> event. That is fine but in the
>> x86 code, I added a sanity check to reject BRANCH_HV because the HW
>> does not support
>> it.
>
> Right. So either we
>
> (1) Set 'exclude_hv' on a X86 system without hypervisor mode (and required HW support) and do the sanity check for BRANCH_HV
>
> or
>
> (2) Do not set 'exclude_hv' (which is happening right now by default) and remove the sanity check
>>I think it should just ignore it. That way, one can do:
>>
>>     $ perf record -b any_call -e cycles ls
>>
>> without getting an error (because hv is not supported on branch sampling).
>> Currently, the workaround is to set the priv level on branches:
>>
>>     $ perf record -b any_call,u,k -e cycles ls
> --------------------------------
> ./perf record -b any_call -e cycles ls
> ./perf record -b any_call,hv -e cycles ls
>
>  Error: sys_perf_event_open() syscall returned with 95 (Operation not supported).  /bin/dmesg may provide additional information.
>
>  Fatal: No CONFIG_PERF_EVENTS=y kernel support configured?
>
> ls: Terminated
> --------------------------------
>
> However these works absolutely fine
>
> perf record -b any_call,k -e cycles ls
> perf record -b any_call,u -e cycles ls
> perf record -b any_call,u,k -e cycles ls

Yes, because you only get the problem with the kernel has to figure out the
default priv level for the branches.

I want to make this simplest case work:
     /perf record -b any_call -e cycles ls

For now, I have reworked the patchset, to ignore hv in the x86 lbr code.
That's the simplest.

> --
> Anshuman Khandual
> Linux Technology Centre
> IBM Systems and Technology Group
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v4 00/18] perf: add support for sampling taken branches
  2012-01-27 20:56 [PATCH v4 00/18] perf: add support for sampling taken branches Stephane Eranian
                   ` (18 preceding siblings ...)
  2012-01-30  4:16 ` [PATCH v4 00/18] perf: add support for sampling taken branches Anshuman Khandual
@ 2012-02-01  8:41 ` Anshuman Khandual
  2012-02-02 13:23   ` Stephane Eranian
  19 siblings, 1 reply; 30+ messages in thread
From: Anshuman Khandual @ 2012-02-01  8:41 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, peterz, mingo, acme, robert.richter, ming.m.lin,
	andi, asharma, ravitillo, vweaver1, dsahern

On Saturday 28 January 2012 02:26 AM, Stephane Eranian wrote:
> This patchset adds an important and useful new feature to
> perf_events: branch stack sampling. In other words, the
> ability to capture taken branches into each sample.
> 
> Statistical sampling of taken branch should not be confused
> for branch tracing. Not all branches are necessarily captured
> 
> Sampling taken branches is important for basic block profiling,
> statistical call graph, function call counts. Many of those
> measurements can help drive a compiler optimizer.
> 
> The branch stack is a software abstraction which sits on top
> of the PMU hardware. As such, it is not available on all
> processors. For now, the patch provides the generic interface
> and the Intel X86 implementation where it leverages the Last
> Branch Record (LBR) feature (from Core2 to SandyBridge).
> 
> Branch stack sampling is supported for both per-thread and
> system-wide modes.
> 
> It is possible to filter the type and privilege level of branches
> to sample. The target of the branch is used to determine
> the privilege level.
> 
> For each branch, the source and destination are captured. On
> some hardware platforms, it may be possible to also extract
> the target prediction and, in that case, it is also exposed
> to end users.
> 
> The branch stack can record a variable number of taken
> branches per sample. Those branches are always consecutive
> in time. The number of branches captured depends on the
> filtering and the underlying hardware. On Intel Nehalem
> and later, up to 16 consecutive branches can be captured
> per sample.
> 
> Branch sampling is always coupled with an event. It can
> be any PMU event but it can't be a SW or tracepoint event.
> 
> Branch sampling is requested by setting a new sample_type
> flag called: PERF_SAMPLE_BRANCH_STACK.
> 
> To support branch filtering, we introduce a new field
> to the perf_event_attr struct: branch_sample_type. We chose
> NOT to overload the config1, config2 field because those
> are related to the event encoding. Branch stack is a
> separate feature which is combined with the event.
> 
> The branch_sample_type is a bitmask of possible filters.
> The following filters are defined (more can be added):
> - PERF_SAMPLE_BRANCH_ANY     : any control flow change
> - PERF_SAMPLE_BRANCH_USER    : branches when target is at user level
> - PERF_SAMPLE_BRANCH_KERNEL  : branches when target is at kernel level
> - PERF_SAMPLE_BRANCH_HV      : branches when target is at hypervisor level
> - PERF_SAMPLE_BRANCH_ANY_CALL: call branches (incl. syscalls)
> - PERF_SAMPLE_BRANCH_ANY_RET : return branches (incl. syscall returns)
> - PERF_SAMPLE_BRANCH_IND_CALL: indirect calls
> 
> It is possible to combine filters, e.g., IND_CALL|USER|KERNEL.
> 
> When the privilege level is not specified, the branch stack
> inherits that of the associated event.
> 
> Some processors may not offer hardware branch filtering, e.g., Intel
> Atom. Some may have HW filtering bugs (e.g., Nehalem). The Intel
> X86 implementation in this patchset also provides a SW branch filter
> which works on a best effort basis. It can compensate for the lack
> of LBR filtering. But first and foremost, it helps work around LBR
> filtering errata. The goal is to only capture the type of branches
> requested by the user.
> 
> It is possible to combine branch stack sampling with PEBS on Intel
> X86 processors. Depending on the precise_sampling mode, there are
> certain filterting restrictions. When precise_sampling=1, then
> there are no filtering restrictions. When precise_sampling > 1, 
> then only ANY|USER|KERNEL filter can be used. This comes from
> the fact that the kernel uses LBR to compensate for the PEBS
> off-by-1 skid on the instruction pointer.
> 
> To demonstrate how the perf_event branch stack sampling interface
> works, the patchset also modifies perf record to capture taken
> branches. Similarly perf report is enhanced to display a histogram
> of taken branches.
> 
> I would like to thank Roberto Vitillo @ LBL for his work on the perf
> tool for this.
> 
> Enough talking, let's take a simple example. Our trivial test program
> goes like this:
> 
> void f2(void)
> {}
> void f3(void)
> {}
> void f1(unsigned long n)
> {
>   if (n & 1UL)
>     f2();
>   else
>     f3();
> }
> int main(void)
> {
>   unsigned long i;
> 
>   for (i=0; i < N; i++)
>    f1(i);
>   return 0;
> }
> 
> $ perf record -b any branchy
> $ perf report -b
> # Events: 23K cycles
> #
> # Overhead  Source Symbol     Target Symbol
> # ........  ................  ................
> 
>     18.13%  [.] f1            [.] main                          
>     18.10%  [.] main          [.] main                          
>     18.01%  [.] main          [.] f1                            
>     15.69%  [.] f1            [.] f1                            
>      9.11%  [.] f3            [.] f1                            
>      6.78%  [.] f1            [.] f3                            
>      6.74%  [.] f1            [.] f2                            
>      6.71%  [.] f2            [.] f1                            
> 
> Of the total number of branches captured, 18.13% were from f1() -> main().
> 
> Let's make this clearer by filtering the user call branches only:
> 
> $ perf record -b any_call -e cycles:u branchy
> $ perf report -b
> # Events: 19K cycles
> #
> # Overhead  Source Symbol              Target Symbol
> # ........  .........................  .........................
> #
>     52.50%  [.] main                   [.] f1                   
>     23.99%  [.] f1                     [.] f3                   
>     23.48%  [.] f1                     [.] f2                   
>      0.03%  [.] _IO_default_xsputn     [.] _IO_new_file_overflow
>      0.01%  [k] _start                 [k] __libc_start_main    
> 
> Now it is more obvious. %52 of all the captured branches where calls from main() -> f1().
> The rest is split 50/50 between f1() -> f2() and f1() -> f3() which is expected given
> that f1() dispatches based on odd vs. even values of n which is constantly increasing.
> 
> 
> Here is a kernel example, where we want to sample indirect calls:
> $ perf record -a -C 1 -b ind_call -e r1c4:k sleep 10 
> $ perf report -b
> #
> # Overhead  Source Symbol               Target Symbol
> # ........  ..........................  ..........................
> #
>     36.36%  [k] __delay                 [k] delay_tsc             
>      9.09%  [k] ktime_get               [k] read_tsc              
>      9.09%  [k] getnstimeofday          [k] read_tsc              
>      9.09%  [k] notifier_call_chain     [k] tick_notify           
>      4.55%  [k] cpuidle_idle_call       [k] intel_idle            
>      4.55%  [k] cpuidle_idle_call       [k] menu_reflect          
>      2.27%  [k] handle_irq              [k] handle_edge_irq       
>      2.27%  [k] ack_apic_edge           [k] native_apic_mem_write 
>      2.27%  [k] hpet_interrupt_handler  [k] hrtimer_interrupt     
>      2.27%  [k] __run_hrtimer           [k] watchdog_timer_fn     
>      2.27%  [k] enqueue_task            [k] enqueue_task_rt       
>      2.27%  [k] try_to_wake_up          [k] select_task_rq_rt     
>      2.27%  [k] do_timer                [k] read_tsc              
>

Just wondering whether appending function call chain details to branch stack
would add value from system performance event analysis perspective.

perf record -g -b any_call,u -e branch-misses:k ls

15.38% ls libc-2.11.1.so  libc-2.11.1.so  [k] getenv              [k] strncmp
15.38% ls libc-2.11.1.so  libc-2.11.1.so  [k] __execvpe           [k] strlen
15.38% ls libc-2.11.1.so  libc-2.11.1.so  [k] __execvpe           [k] memcpy
15.38% ls ld-2.11.1.so    ld-2.11.1.so    [k] _dl_map_object_from_fd  [k] mmap64
 7.69% ls libc-2.11.1.so  libc-2.11.1.so  [k] __execvpe           [k] __strchrnul
 7.69% ls libc-2.11.1.so  libc-2.11.1.so  [k] __execvpe           [k] __execve
 7.69% ls ld-2.11.1.so    ld-2.11.1.so    [k] _dl_map_object_from_fd  [k] _dl_setup_hash
 7.69% ls ld-2.11.1.so    ld-2.11.1.so    [k] _dl_map_object_from_fd  [k] close
 7.69% ls ld-2.11.1.so    ld-2.11.1.so    [k] _dl_map_object_from_fd  [k] memset

>From the example above, we can see 

(1) 15.38%  ls  libc-2.11.1.so libc-2.11.1.so [k] getenv [k] strncmp

    '[k] getenv ----> [k]' strncmp happened 15% time for the branch-misses
     event overflow.

(2) But this lacks the information from the  source code program point of view
    like what is the code path which eventually ended up in the branch (getenv
    ----> strncmp) 15.38% of time for the event. There can be N number of  
    function call chains which might lead to the branch (getenv ----> strncmp).
    Having a percentage distribution of the function callchians for every entry
    in the output (as above) would be a good idea. This would give complete 
    information (though statistical sampling) on the source code control flow
    which would have lead to the PMU event.

(3) <percentage of call_chain> <percentage of branch_chain> [EVENT]
    There may be situations where these chains are overlapping with each other
    to some extent.

If we change to newt output format, we can display the relative percentages of call
chains when we click on specific entry of branch chain similar to when we try to  
annotate a symbol in normal perf report newt output.

Any thoughts ?


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v4 00/18] perf: add support for sampling taken branches
  2012-02-01  8:41 ` Anshuman Khandual
@ 2012-02-02 13:23   ` Stephane Eranian
  0 siblings, 0 replies; 30+ messages in thread
From: Stephane Eranian @ 2012-02-02 13:23 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-kernel, peterz, mingo, acme, robert.richter, ming.m.lin,
	andi, asharma, ravitillo, vweaver1, dsahern

On Wed, Feb 1, 2012 at 9:41 AM, Anshuman Khandual
<khandual@linux.vnet.ibm.com> wrote:
> On Saturday 28 January 2012 02:26 AM, Stephane Eranian wrote:
>> This patchset adds an important and useful new feature to
>> perf_events: branch stack sampling. In other words, the
>> ability to capture taken branches into each sample.
>>
>> Statistical sampling of taken branch should not be confused
>> for branch tracing. Not all branches are necessarily captured
>>
>> Sampling taken branches is important for basic block profiling,
>> statistical call graph, function call counts. Many of those
>> measurements can help drive a compiler optimizer.
>>
>> The branch stack is a software abstraction which sits on top
>> of the PMU hardware. As such, it is not available on all
>> processors. For now, the patch provides the generic interface
>> and the Intel X86 implementation where it leverages the Last
>> Branch Record (LBR) feature (from Core2 to SandyBridge).
>>
>> Branch stack sampling is supported for both per-thread and
>> system-wide modes.
>>
>> It is possible to filter the type and privilege level of branches
>> to sample. The target of the branch is used to determine
>> the privilege level.
>>
>> For each branch, the source and destination are captured. On
>> some hardware platforms, it may be possible to also extract
>> the target prediction and, in that case, it is also exposed
>> to end users.
>>
>> The branch stack can record a variable number of taken
>> branches per sample. Those branches are always consecutive
>> in time. The number of branches captured depends on the
>> filtering and the underlying hardware. On Intel Nehalem
>> and later, up to 16 consecutive branches can be captured
>> per sample.
>>
>> Branch sampling is always coupled with an event. It can
>> be any PMU event but it can't be a SW or tracepoint event.
>>
>> Branch sampling is requested by setting a new sample_type
>> flag called: PERF_SAMPLE_BRANCH_STACK.
>>
>> To support branch filtering, we introduce a new field
>> to the perf_event_attr struct: branch_sample_type. We chose
>> NOT to overload the config1, config2 field because those
>> are related to the event encoding. Branch stack is a
>> separate feature which is combined with the event.
>>
>> The branch_sample_type is a bitmask of possible filters.
>> The following filters are defined (more can be added):
>> - PERF_SAMPLE_BRANCH_ANY     : any control flow change
>> - PERF_SAMPLE_BRANCH_USER    : branches when target is at user level
>> - PERF_SAMPLE_BRANCH_KERNEL  : branches when target is at kernel level
>> - PERF_SAMPLE_BRANCH_HV      : branches when target is at hypervisor level
>> - PERF_SAMPLE_BRANCH_ANY_CALL: call branches (incl. syscalls)
>> - PERF_SAMPLE_BRANCH_ANY_RET : return branches (incl. syscall returns)
>> - PERF_SAMPLE_BRANCH_IND_CALL: indirect calls
>>
>> It is possible to combine filters, e.g., IND_CALL|USER|KERNEL.
>>
>> When the privilege level is not specified, the branch stack
>> inherits that of the associated event.
>>
>> Some processors may not offer hardware branch filtering, e.g., Intel
>> Atom. Some may have HW filtering bugs (e.g., Nehalem). The Intel
>> X86 implementation in this patchset also provides a SW branch filter
>> which works on a best effort basis. It can compensate for the lack
>> of LBR filtering. But first and foremost, it helps work around LBR
>> filtering errata. The goal is to only capture the type of branches
>> requested by the user.
>>
>> It is possible to combine branch stack sampling with PEBS on Intel
>> X86 processors. Depending on the precise_sampling mode, there are
>> certain filterting restrictions. When precise_sampling=1, then
>> there are no filtering restrictions. When precise_sampling > 1,
>> then only ANY|USER|KERNEL filter can be used. This comes from
>> the fact that the kernel uses LBR to compensate for the PEBS
>> off-by-1 skid on the instruction pointer.
>>
>> To demonstrate how the perf_event branch stack sampling interface
>> works, the patchset also modifies perf record to capture taken
>> branches. Similarly perf report is enhanced to display a histogram
>> of taken branches.
>>
>> I would like to thank Roberto Vitillo @ LBL for his work on the perf
>> tool for this.
>>
>> Enough talking, let's take a simple example. Our trivial test program
>> goes like this:
>>
>> void f2(void)
>> {}
>> void f3(void)
>> {}
>> void f1(unsigned long n)
>> {
>>   if (n & 1UL)
>>     f2();
>>   else
>>     f3();
>> }
>> int main(void)
>> {
>>   unsigned long i;
>>
>>   for (i=0; i < N; i++)
>>    f1(i);
>>   return 0;
>> }
>>
>> $ perf record -b any branchy
>> $ perf report -b
>> # Events: 23K cycles
>> #
>> # Overhead  Source Symbol     Target Symbol
>> # ........  ................  ................
>>
>>     18.13%  [.] f1            [.] main
>>     18.10%  [.] main          [.] main
>>     18.01%  [.] main          [.] f1
>>     15.69%  [.] f1            [.] f1
>>      9.11%  [.] f3            [.] f1
>>      6.78%  [.] f1            [.] f3
>>      6.74%  [.] f1            [.] f2
>>      6.71%  [.] f2            [.] f1
>>
>> Of the total number of branches captured, 18.13% were from f1() -> main().
>>
>> Let's make this clearer by filtering the user call branches only:
>>
>> $ perf record -b any_call -e cycles:u branchy
>> $ perf report -b
>> # Events: 19K cycles
>> #
>> # Overhead  Source Symbol              Target Symbol
>> # ........  .........................  .........................
>> #
>>     52.50%  [.] main                   [.] f1
>>     23.99%  [.] f1                     [.] f3
>>     23.48%  [.] f1                     [.] f2
>>      0.03%  [.] _IO_default_xsputn     [.] _IO_new_file_overflow
>>      0.01%  [k] _start                 [k] __libc_start_main
>>
>> Now it is more obvious. %52 of all the captured branches where calls from main() -> f1().
>> The rest is split 50/50 between f1() -> f2() and f1() -> f3() which is expected given
>> that f1() dispatches based on odd vs. even values of n which is constantly increasing.
>>
>>
>> Here is a kernel example, where we want to sample indirect calls:
>> $ perf record -a -C 1 -b ind_call -e r1c4:k sleep 10
>> $ perf report -b
>> #
>> # Overhead  Source Symbol               Target Symbol
>> # ........  ..........................  ..........................
>> #
>>     36.36%  [k] __delay                 [k] delay_tsc
>>      9.09%  [k] ktime_get               [k] read_tsc
>>      9.09%  [k] getnstimeofday          [k] read_tsc
>>      9.09%  [k] notifier_call_chain     [k] tick_notify
>>      4.55%  [k] cpuidle_idle_call       [k] intel_idle
>>      4.55%  [k] cpuidle_idle_call       [k] menu_reflect
>>      2.27%  [k] handle_irq              [k] handle_edge_irq
>>      2.27%  [k] ack_apic_edge           [k] native_apic_mem_write
>>      2.27%  [k] hpet_interrupt_handler  [k] hrtimer_interrupt
>>      2.27%  [k] __run_hrtimer           [k] watchdog_timer_fn
>>      2.27%  [k] enqueue_task            [k] enqueue_task_rt
>>      2.27%  [k] try_to_wake_up          [k] select_task_rq_rt
>>      2.27%  [k] do_timer                [k] read_tsc
>>
>
> Just wondering whether appending function call chain details to branch stack
> would add value from system performance event analysis perspective.
>

> perf record -g -b any_call,u -e branch-misses:k ls
>
Are you talking about using the content of branch_stack as a substitute
for PERF_SAMPLE_CALLCHAIN? You could, assuming you're sampling
only return branches (not call branches).

> 15.38% ls libc-2.11.1.so  libc-2.11.1.so  [k] getenv              [k] strncmp
> 15.38% ls libc-2.11.1.so  libc-2.11.1.so  [k] __execvpe           [k] strlen
> 15.38% ls libc-2.11.1.so  libc-2.11.1.so  [k] __execvpe           [k] memcpy
> 15.38% ls ld-2.11.1.so    ld-2.11.1.so    [k] _dl_map_object_from_fd  [k] mmap64
>  7.69% ls libc-2.11.1.so  libc-2.11.1.so  [k] __execvpe           [k] __strchrnul
>  7.69% ls libc-2.11.1.so  libc-2.11.1.so  [k] __execvpe           [k] __execve
>  7.69% ls ld-2.11.1.so    ld-2.11.1.so    [k] _dl_map_object_from_fd  [k] _dl_setup_hash
>  7.69% ls ld-2.11.1.so    ld-2.11.1.so    [k] _dl_map_object_from_fd  [k] close
>  7.69% ls ld-2.11.1.so    ld-2.11.1.so    [k] _dl_map_object_from_fd  [k] memset
>
> From the example above, we can see
>
> (1) 15.38%  ls  libc-2.11.1.so libc-2.11.1.so [k] getenv [k] strncmp
>
>    '[k] getenv ----> [k]' strncmp happened 15% time for the branch-misses
>     event overflow.
>
No, that's not how you have to interpret the data. It's not 15.38% of the time.
It's 15.38% of all the captured branches.

One of the goals of this first perf report mode is to show how branch_stack can
be used to statistically capture cross-module (or cross-function)
calls. In other
words, who calls who and how often. This can be used by compilers to drive
inlining, for instance. The fact that on NHM/WSM/SNB, it is possible to capture
prediction is also interesting, especially for indirect calls.

> (2) But this lacks the information from the  source code program point of view
>    like what is the code path which eventually ended up in the branch (getenv
>    ----> strncmp) 15.38% of time for the event. There can be N number of
>    function call chains which might lead to the branch (getenv ----> strncmp).
>    Having a percentage distribution of the function callchians for every entry
>    in the output (as above) would be a good idea. This would give complete
>    information (though statistical sampling) on the source code control flow
>    which would have lead to the PMU event.
>
Yes. I think what you are after is more like gprof or perf report -g, i.e., the
callgraph. You can use the branch_stack feature to collect a
statistical callgraph
without the need to frame-pointers or unwind info. You'd have to
filter on return
branches only, then you invert the edge. I think we could probably
reuse the existing
perf code to handle CALLCHAIN for this. We just haven't had a chance
to look at this
yet. But patches can be added later on.

> (3) <percentage of call_chain> <percentage of branch_chain> [EVENT]
>    There may be situations where these chains are overlapping with each other
>    to some extent.
>
> If we change to newt output format, we can display the relative percentages of call
> chains when we click on specific entry of branch chain similar to when we try to
> annotate a symbol in normal perf report newt output.
>
> Any thoughts ?
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2012-02-02 13:23 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-27 20:56 [PATCH v4 00/18] perf: add support for sampling taken branches Stephane Eranian
2012-01-27 20:56 ` [PATCH v4 01/18] perf: add generic taken branch sampling support Stephane Eranian
2012-01-27 20:56 ` [PATCH v4 02/18] perf: add Intel LBR MSR definitions Stephane Eranian
2012-01-27 20:56 ` [PATCH v4 03/18] perf: add Intel X86 LBR sharing logic Stephane Eranian
2012-01-27 20:56 ` [PATCH v4 04/18] perf: sync branch stack sampling with X86 precise_sampling Stephane Eranian
2012-01-27 20:56 ` [PATCH v4 05/18] perf: add LBR mappings for PERF_SAMPLE_BRANCH filters Stephane Eranian
2012-01-27 20:56 ` [PATCH v4 06/18] perf: disable LBR support for older Intel Atom processors Stephane Eranian
2012-01-27 20:56 ` [PATCH v4 07/18] perf: implement PERF_SAMPLE_BRANCH for Intel X86 Stephane Eranian
2012-01-27 20:56 ` [PATCH v4 08/18] perf: add LBR software filter support " Stephane Eranian
2012-01-27 20:56 ` [PATCH v4 09/18] perf: disable PERF_SAMPLE_BRANCH_* when not supported Stephane Eranian
2012-01-30  3:57   ` Anshuman Khandual
2012-01-27 20:56 ` [PATCH v4 10/18] perf: add hook to flush branch_stack on context switch Stephane Eranian
2012-01-27 20:56 ` [PATCH v4 11/18] perf: add code to support PERF_SAMPLE_BRANCH_STACK Stephane Eranian
2012-01-27 20:56 ` [PATCH v4 12/18] perf: add support for sampling taken branch to perf record Stephane Eranian
2012-01-31  9:47   ` Anshuman Khandual
2012-01-31 10:31     ` Stephane Eranian
2012-01-31 15:44       ` Anshuman Khandual
2012-01-31 15:48         ` Stephane Eranian
2012-01-27 20:56 ` [PATCH v4 13/18] perf: add support for taken branch sampling to perf report Stephane Eranian
2012-01-27 20:56 ` [PATCH v4 14/18] perf: fix endianness detection in perf.data Stephane Eranian
2012-01-30  5:55   ` Anshuman Khandual
2012-01-27 20:56 ` [PATCH v4 15/18] perf: add ABI reference sizes Stephane Eranian
2012-01-27 20:56 ` [PATCH v4 16/18] perf: enable reading of perf.data files from different ABI rev Stephane Eranian
2012-01-27 20:56 ` [PATCH v4 17/18] perf: fix bug print_event_desc() Stephane Eranian
2012-01-27 20:56 ` [PATCH v4 18/18] perf: make perf able to read file from older ABIs Stephane Eranian
2012-01-31  8:54   ` Anshuman Khandual
2012-01-30  4:16 ` [PATCH v4 00/18] perf: add support for sampling taken branches Anshuman Khandual
2012-01-30 10:15   ` Stephane Eranian
2012-02-01  8:41 ` Anshuman Khandual
2012-02-02 13:23   ` Stephane Eranian

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).