All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/12] perf_events: add support for sampling taken branches
@ 2011-10-06 14:49 Stephane Eranian
  2011-10-06 14:49 ` [PATCH 01/12] perf_events: add generic taken branch sampling support Stephane Eranian
                   ` (13 more replies)
  0 siblings, 14 replies; 65+ messages in thread
From: Stephane Eranian @ 2011-10-06 14:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo

This patchset adds an important and useful new feature to
perf_events: branch stack sampling. In other words, the
ability to capture taken branches into each sample.

Statistical sampling of taken branch should not be confused
for branch tracing. Not all branches are necessarily captured

Sampling taken branches is important for basic block profiling,
statistical call graph, function call counts. Many of those
measurements can help drive a compiler optimizer.

The branch stack is a software abstraction which sits on top
of the PMU hardware. As such, it is not available on all
processors. For now, the patch provides the generic interface
and the Intel X86 implementation where it leverages the Last
Branch Record (LBR) feature (from Core2 to SandyBridge).

Branch stack sampling is supported for both per-thread and
system-wide modes.

It is possible to filter the type and privilege level of branches
to sample. The target of the branch is used to determine
the privilege level.

For each branch, the source and destination are captured. On
some hardware platforms, it may be possible to also extract
the target prediction and, in that case, it is also exposed
to end users.

The branch stack can record a variable number of taken
branches per sample. Those branches are always consecutive
in time. The number of branches captured depends on the
filtering and the underlying hardware. On Intel Nehalem
and later, up to 16 consecutive branches can be captured
per sample.

Branch sampling is always coupled with an event. It can
be any PMU event but it can't be a SW or tracepoint event.

Branch sampling is requested by setting a new sample_type
flag called: PERF_SAMPLE_BRANCH_STACK.

To support branch filtering, we introduce a new field
to the perf_event_attr struct: branch_sample_type. We chose
NOT to overload the config1, config2 field because those
are related to the event encoding. Branch stack is a
separate feature which is combined with the event.

The branch_sample_type is a bitmask of possible filters.
The following filters are defined (more can be added):
- PERF_SAMPLE_BRANCH_ANY     : any control flow change
- PERF_SAMPLE_BRANCH_USER    : capture branches when target is at user level
- PERF_SAMPLE_BRANCH_KERNEL  : capture branches when target is at user level
- PERF_SAMPLE_BRANCH_ANY_CALL: capture call branches (incl. syscalls)
- PERF_SAMPLE_BRANCH_ANY_RET : capture return branches (incl. syscall returns)
- PERF_SAMPLE_BRANCH_IND_CALL: capture indirect calls

It is possible to combine filters, e.g., IND_CALL|USER|KERNEL.

When the privilege level is not specified, the branch stack
inherits that of the associated event.

Some processors may not offer hardware branch filtering, e.g., Intel
Atom. Some may have HW filtering bugs (e.g., Nehalem). The Intel
X86 implementation in this patchset also provides a SW branch filter
which works on a best effort basis. It can compensate for the lack
of LBR filtering. But first and foremost, it helps work around LBR
filtering errata. The goal is to only capture the type of branches
requested by the user.

It is possible to combine branch stack sampling with PEBS on Intel
X86 processors. Depending on the precise_sampling mode, there are
certain filterting restrictions. When precise_sampling=1, then
there are no filtering restrictions. When precise_sampling > 1, 
then only ANY|USER|KERNEL filter can be used. This comes from
the fact that the kernel uses LBR to compensate for the PEBS
off-by-1 skid on the instruction pointer.

To demonstrate how the perf_event branch stack sampling interface
works, the patchset also modifies perf record to capture taken
branches. Similarly perf report is enhanced to display a histogram
of taken branches.

I would like to thank Roberto Vitillo @ LBL for his work on the perf
tool for this.

Enough talking, let's take a simple example. Our trivial test program
goes like this:

void f2(void)
{}
void f3(void)
{}
void f1(unsigned long n)
{
  if (n & 1UL)
    f2();
  else
    f3();
}
int main(void)
{
  unsigned long i;

  for (i=0; i < N; i++)
   f1(i);
  return 0;
}

$ perf record -b any branchy
$ perf report -b
# Events: 23K cycles
#
# Overhead  Source Symbol     Target Symbol
# ........  ................  ................

    18.13%  [.] f1            [.] main                          
    18.10%  [.] main          [.] main                          
    18.01%  [.] main          [.] f1                            
    15.69%  [.] f1            [.] f1                            
     9.11%  [.] f3            [.] f1                            
     6.78%  [.] f1            [.] f3                            
     6.74%  [.] f1            [.] f2                            
     6.71%  [.] f2            [.] f1                            

Of the total number of branches captured, 18.13% were from f1() -> main().

Let's make this clearer by filtering the user call branches only:

$ perf record -b any_call -e cycles:u branchy
$ perf report
# Events: 19K cycles
#
# Overhead  Source Symbol              Target Symbol
# ........  .........................  .........................
#
    52.50%  [.] main                   [.] f1                   
    23.99%  [.] f1                     [.] f3                   
    23.48%  [.] f1                     [.] f2                   
     0.03%  [.] _IO_default_xsputn     [.] _IO_new_file_overflow
     0.01%  [k] _start                 [k] __libc_start_main    

Now it is more obvious. %52 of all the captured branches where calls from main() -> f1().
The rest is split 50/50 between f1() -> f2() and f1() -> f3() which is expected given
that f1() dispatches based on odd vs. even values of n which is constantly increasing.


Signed-off-by: Stephane Eranian <eranian@google.com>


Roberto Agostino Vitillo (2):
  perf: add support for sampling taken branch to perf record
  perf: add support for taken branch sampling to perf report

Stephane Eranian (10):
  perf_events: add generic taken branch sampling support
  perf_events: add Intel LBR MSR definitions
  perf_events: add Intel X86 LBR sharing logic
  perf_events: sync branch stack sampling with X86 precise_sampling
  perf_events: add LBR mappings for PERF_SAMPLE_BRANCH filters
  perf_events: implement PERF_SAMPLE_BRANCH for Intel X86
  perf_events: add LBR software filter support for Intel X86
  perf_events: disable PERF_SAMPLE_BRANCH_* when not supported
  perf_events: add hook to flush branch_stack on context switch
  perf: add code to support PERF_SAMPLE_BRANCH_STACK

 arch/alpha/kernel/perf_event.c             |    4 +
 arch/arm/kernel/perf_event.c               |    4 +
 arch/mips/kernel/perf_event.c              |    4 +
 arch/powerpc/kernel/perf_event.c           |    4 +
 arch/sh/kernel/perf_event.c                |    4 +
 arch/sparc/kernel/perf_event.c             |    4 +
 arch/x86/include/asm/msr-index.h           |    7 +
 arch/x86/kernel/cpu/perf_event.c           |   62 +++-
 arch/x86/kernel/cpu/perf_event_amd.c       |    3 +
 arch/x86/kernel/cpu/perf_event_intel.c     |  126 +++++--
 arch/x86/kernel/cpu/perf_event_intel_ds.c  |   21 +-
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |  529 ++++++++++++++++++++++++++--
 include/linux/perf_event.h                 |   74 ++++-
 kernel/events/core.c                       |  167 +++++++++
 kernel/events/hw_breakpoint.c              |    6 +
 tools/perf/Documentation/perf-record.txt   |   18 +
 tools/perf/Documentation/perf-report.txt   |    7 +
 tools/perf/builtin-record.c                |   75 ++++
 tools/perf/builtin-report.c                |   93 +++++-
 tools/perf/perf.h                          |   17 +
 tools/perf/util/annotate.c                 |    2 +-
 tools/perf/util/event.h                    |    1 +
 tools/perf/util/evsel.c                    |   10 +
 tools/perf/util/hist.c                     |   97 ++++--
 tools/perf/util/hist.h                     |    6 +
 tools/perf/util/session.c                  |   72 ++++
 tools/perf/util/session.h                  |    5 +
 tools/perf/util/sort.c                     |  348 ++++++++++++++-----
 tools/perf/util/sort.h                     |    5 +
 tools/perf/util/symbol.h                   |   13 +
 30 files changed, 1584 insertions(+), 204 deletions(-)

-- 
1.7.4.1


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH 01/12] perf_events: add generic taken branch sampling support
  2011-10-06 14:49 [PATCH 00/12] perf_events: add support for sampling taken branches Stephane Eranian
@ 2011-10-06 14:49 ` Stephane Eranian
  2011-10-06 16:57   ` Peter Zijlstra
  2011-10-06 17:01   ` Peter Zijlstra
  2011-10-06 14:49 ` [PATCH 02/12] perf_events: add Intel LBR MSR definitions Stephane Eranian
                   ` (12 subsequent siblings)
  13 siblings, 2 replies; 65+ messages in thread
From: Stephane Eranian @ 2011-10-06 14:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo

This patch adds the ability to sample taken branches to the
perf_event interface.

The ability to capture taken branches is very useful for all
sorts of analysis. For instance, basic block profiling, call
counts, statistical call graph.

This new capability requires hardware assist and as such may
not be available on all HW platforms. On Intel X86, it is
implemented on top of the Last Branch Record (LBR) facility.

To enable taken branches sampling, the PERF_SAMPLE_BRANCH_STACK
bit must be set in attr->sample_type.

Sampled taken branches may be filtered by type and/or priv
levels.

The patch adds a new field, called branch_sample_type, to the
perf_event_attr structure. It contains a bitmask of filters
to apply to the sampled taken branches.

Filters may be implemented in HW. If the HW filter does not exist
or is not good enough, some arch may also implement a SW filter.

The following generic filters are currently defined:
- PERF_SAMPLE_USER
  only branches whose targets are at the user level

- PERF_SAMPLE_KERNEL
  only branches whose targets are at the kernel level

- PERF_SAMPLE_ANY
  any type of branches (subject to priv levels filters)

- PERF_SAMPLE_ANY_CALL
  any call branches (may incl. syscall on some arch)

- PERF_SAMPLE_ANY_RET
  any return branches (may incl. syscall returns on some arch)

- PERF_SAMPLE_IND_CALL
  indirect call branches

Obviously filter may be combined. The priv level bits are optional.
If not provided, the priv level of the associated event are used. It
is possible to collect branches at a priv level different from the
associated event.

The number of taken branch records present in each sample may vary based
on HW, the type of sampled branches, the executed code. Therefore
each sample contains the number of taken branches it contains.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |   21 ++++++----
 include/linux/perf_event.h                 |   64 +++++++++++++++++++++++++++-
 kernel/events/core.c                       |   58 +++++++++++++++++++++++++
 3 files changed, 133 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index d202c1b..528287a 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -138,9 +138,11 @@ static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc)
 
 		rdmsrl(x86_pmu.lbr_from + lbr_idx, msr_lastbranch.lbr);
 
-		cpuc->lbr_entries[i].from  = msr_lastbranch.from;
-		cpuc->lbr_entries[i].to    = msr_lastbranch.to;
-		cpuc->lbr_entries[i].flags = 0;
+		cpuc->lbr_entries[i].from	= msr_lastbranch.from;
+		cpuc->lbr_entries[i].to		= msr_lastbranch.to;
+		cpuc->lbr_entries[i].mispred	= 0;
+		cpuc->lbr_entries[i].predicted	= 0;
+		cpuc->lbr_entries[i].reserved	= 0;
 	}
 	cpuc->lbr_stack.nr = i;
 }
@@ -161,19 +163,22 @@ static void intel_pmu_lbr_read_64(struct cpu_hw_events *cpuc)
 
 	for (i = 0; i < x86_pmu.lbr_nr; i++) {
 		unsigned long lbr_idx = (tos - i) & mask;
-		u64 from, to, flags = 0;
+		u64 from, to, mis = 0, pred = 0;
 
 		rdmsrl(x86_pmu.lbr_from + lbr_idx, from);
 		rdmsrl(x86_pmu.lbr_to   + lbr_idx, to);
 
 		if (lbr_format == LBR_FORMAT_EIP_FLAGS) {
-			flags = !!(from & LBR_FROM_FLAG_MISPRED);
+			mis = !!(from & LBR_FROM_FLAG_MISPRED);
+			pred= !mis;
 			from = (u64)((((s64)from) << 1) >> 1);
 		}
 
-		cpuc->lbr_entries[i].from  = from;
-		cpuc->lbr_entries[i].to    = to;
-		cpuc->lbr_entries[i].flags = flags;
+		cpuc->lbr_entries[i].from	= from;
+		cpuc->lbr_entries[i].to		= to;
+		cpuc->lbr_entries[i].mispred	= mis;
+		cpuc->lbr_entries[i].predicted	= pred;
+		cpuc->lbr_entries[i].reserved	= 0;
 	}
 	cpuc->lbr_stack.nr = i;
 }
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index c816075..850bbbe 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -128,11 +128,38 @@ enum perf_event_sample_format {
 	PERF_SAMPLE_PERIOD			= 1U << 8,
 	PERF_SAMPLE_STREAM_ID			= 1U << 9,
 	PERF_SAMPLE_RAW				= 1U << 10,
+	PERF_SAMPLE_BRANCH_STACK		= 1U << 11,
 
-	PERF_SAMPLE_MAX = 1U << 11,		/* non-ABI */
+	PERF_SAMPLE_MAX = 1U << 12,		/* non-ABI */
 };
 
 /*
+ * values to program into branch_sample_type when PERF_SAMPLE_BRANCH is set
+ *
+ * If the user does not pass priv level information via branch_sample_type,
+ * the kernel uses the event's priv level. Branch and event priv levels do
+ * not have to match. Branch priv level is checked for permissions.
+ *
+ * The branch types can be combined, however BRANCH_ANY covers all types
+ * of branches and therefore it supersedes all the other types.
+ */
+enum perf_branch_sample_type {
+	PERF_SAMPLE_BRANCH_USER		= 1U << 0, /* user level branches */
+	PERF_SAMPLE_BRANCH_KERNEL	= 1U << 1, /* kernel level branches */
+
+	PERF_SAMPLE_BRANCH_ANY		= 1U << 2, /* any branch types */
+	PERF_SAMPLE_BRANCH_ANY_CALL	= 1U << 3, /* any call branch */
+	PERF_SAMPLE_BRANCH_ANY_RETURN	= 1U << 4, /* any return branch */
+	PERF_SAMPLE_BRANCH_IND_CALL	= 1U << 5, /* indirect calls */
+
+	PERF_SAMPLE_BRANCH_MAX		= 1U << 6,/* non-ABI */
+};
+
+#define PERF_SAMPLE_BRANCH_PLM_ALL \
+	(PERF_SAMPLE_BRANCH_USER|\
+	 PERF_SAMPLE_BRANCH_KERNEL)
+
+/*
  * The format of the data returned by read() on a perf event fd,
  * as specified by attr.read_format:
  *
@@ -236,6 +263,7 @@ struct perf_event_attr {
 		__u64		bp_len;
 		__u64		config2; /* extension of config1 */
 	};
+	__u64	branch_sample_type; /* enum branch_sample_type */
 };
 
 /*
@@ -452,6 +480,8 @@ enum perf_event_type {
 	 *
 	 *	{ u32			size;
 	 *	  char                  data[size];}&& PERF_SAMPLE_RAW
+	 *
+	 *	{ u64 from, to, flags } lbr[nr];} && PERF_SAMPLE_BRANCH_STACK
 	 * };
 	 */
 	PERF_RECORD_SAMPLE			= 9,
@@ -524,12 +554,33 @@ struct perf_raw_record {
 	void				*data;
 };
 
+/*
+ * single taken branch record layout:
+ *
+ *      from: source instruction (may not always be a branch insn)
+ *        to: branch target
+ *   mispred: branch target was mispredicted
+ * predicted: branch target was predicted
+ *
+ * support for mispred, predicted is optional. In case it
+ * is not supported mispred = predicted = 0.
+ */
 struct perf_branch_entry {
 	__u64				from;
 	__u64				to;
-	__u64				flags;
+	struct {
+		__u64			mispred:1,  /* target mispredicted */
+					predicted:1,/* target predicted */
+					reserved:62;
+	};
 };
 
+/*
+ * branch stack layout:
+ *  nr: number of taken branches stored in entries[]
+ *
+ * Note that nr can vary from sample to sample
+ */
 struct perf_branch_stack {
 	__u64				nr;
 	struct perf_branch_entry	entries[0];
@@ -560,7 +611,9 @@ struct hw_perf_event {
 			unsigned long	event_base;
 			int		idx;
 			int		last_cpu;
+
 			struct hw_perf_event_extra extra_reg;
+			struct hw_perf_event_extra branch_reg;
 		};
 		struct { /* software */
 			struct hrtimer	hrtimer;
@@ -988,12 +1041,14 @@ struct perf_sample_data {
 	u64				period;
 	struct perf_callchain_entry	*callchain;
 	struct perf_raw_record		*raw;
+	struct perf_branch_stack	*br_stack;
 };
 
 static inline void perf_sample_data_init(struct perf_sample_data *data, u64 addr)
 {
 	data->addr = addr;
 	data->raw  = NULL;
+	data->br_stack = NULL;
 }
 
 extern void perf_output_sample(struct perf_output_handle *handle,
@@ -1132,6 +1187,11 @@ extern void perf_bp_event(struct perf_event *event, void *data);
 # define perf_instruction_pointer(regs)	instruction_pointer(regs)
 #endif
 
+static inline bool has_branch_stack(struct perf_event *event)
+{
+	return event->attr.sample_type & PERF_SAMPLE_BRANCH_STACK;
+}
+
 extern int perf_output_begin(struct perf_output_handle *handle,
 			     struct perf_event *event, unsigned int size);
 extern void perf_output_end(struct perf_output_handle *handle);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0f85778..70a8975 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3986,6 +3986,24 @@ void perf_output_sample(struct perf_output_handle *handle,
 			}
 		}
 	}
+
+	if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
+		if (data->br_stack) {
+			size_t size;
+
+			size = data->br_stack->nr
+			     * sizeof(struct perf_branch_entry);
+
+			perf_output_put(handle, data->br_stack->nr);
+			perf_output_copy(handle, data->br_stack->entries, size);
+		} else {
+			/*
+			 * we always store at least the value of nr
+			 */
+			u64 nr = 0;
+			perf_output_put(handle, nr);
+		}
+	}
 }
 
 void perf_prepare_sample(struct perf_event_header *header,
@@ -4028,6 +4046,15 @@ void perf_prepare_sample(struct perf_event_header *header,
 		WARN_ON_ONCE(size & (sizeof(u64)-1));
 		header->size += size;
 	}
+
+	if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
+		int size = sizeof(u64); /* nr */
+		if (data->br_stack) {
+			size += data->br_stack->nr
+			      * sizeof(struct perf_branch_entry);
+		}
+		header->size += size;
+	}
 }
 
 static void perf_event_output(struct perf_event *event,
@@ -5978,6 +6005,37 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
 	if (attr->read_format & ~(PERF_FORMAT_MAX-1))
 		return -EINVAL;
 
+	if (attr->sample_type & PERF_SAMPLE_BRANCH_STACK) {
+		u64 mask = attr->branch_sample_type;
+
+		/* only using defined bits */
+		if (mask & ~(PERF_SAMPLE_BRANCH_MAX-1))
+			return -EINVAL;
+
+		/* at least one branch bit must be set */
+		if (!(mask & ~PERF_SAMPLE_BRANCH_PLM_ALL))
+			return -EINVAL;
+
+		/* kernel level capture */
+		if ((mask & PERF_SAMPLE_BRANCH_KERNEL)
+		    && perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
+			return -EACCES;
+
+		/* propagate priv level, when not set for branch */
+		if (!(mask & PERF_SAMPLE_BRANCH_PLM_ALL)) {
+
+			/* exclude_kernel checked on syscall entry */
+			if (!attr->exclude_kernel)
+				mask |= PERF_SAMPLE_BRANCH_KERNEL;
+
+			if (!attr->exclude_user)
+				mask |= PERF_SAMPLE_BRANCH_USER;
+			/*
+			 * adjust user setting (for HW filter setup)
+			 */
+			attr->branch_sample_type = mask;
+		}
+	}
 out:
 	return ret;
 
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 02/12] perf_events: add Intel LBR MSR definitions
  2011-10-06 14:49 [PATCH 00/12] perf_events: add support for sampling taken branches Stephane Eranian
  2011-10-06 14:49 ` [PATCH 01/12] perf_events: add generic taken branch sampling support Stephane Eranian
@ 2011-10-06 14:49 ` Stephane Eranian
  2011-10-06 14:49 ` [PATCH 03/12] perf_events: add Intel X86 LBR sharing logic Stephane Eranian
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 65+ messages in thread
From: Stephane Eranian @ 2011-10-06 14:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo

This patch adds the LBR definitions for NHM/WSM/SNB and Core.
It also adds the definitions for the architected LBR MSR:
LBR_SELECT, LBRT_TOS.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/include/asm/msr-index.h           |    7 +++++++
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |   18 +++++++++---------
 2 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index d52609a..e8f5fbf 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -56,6 +56,13 @@
 #define MSR_OFFCORE_RSP_0		0x000001a6
 #define MSR_OFFCORE_RSP_1		0x000001a7
 
+#define MSR_LBR_SELECT			0x000001c8
+#define MSR_LBR_TOS			0x000001c9
+#define MSR_LBR_NHM_FROM		0x00000680
+#define MSR_LBR_NHM_TO			0x000006c0
+#define MSR_LBR_CORE_FROM		0x00000040
+#define MSR_LBR_CORE_TO			0x00000060
+
 #define MSR_IA32_PEBS_ENABLE		0x000003f1
 #define MSR_IA32_DS_AREA		0x00000600
 #define MSR_IA32_PERF_CAPABILITIES	0x00000345
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 528287a..e8a6851 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -199,25 +199,25 @@ static void intel_pmu_lbr_read(void)
 static void intel_pmu_lbr_init_core(void)
 {
 	x86_pmu.lbr_nr     = 4;
-	x86_pmu.lbr_tos    = 0x01c9;
-	x86_pmu.lbr_from   = 0x40;
-	x86_pmu.lbr_to     = 0x60;
+	x86_pmu.lbr_tos    = MSR_LBR_TOS;
+	x86_pmu.lbr_from   = MSR_LBR_CORE_FROM;
+	x86_pmu.lbr_to     = MSR_LBR_CORE_TO;
 }
 
 static void intel_pmu_lbr_init_nhm(void)
 {
 	x86_pmu.lbr_nr     = 16;
-	x86_pmu.lbr_tos    = 0x01c9;
-	x86_pmu.lbr_from   = 0x680;
-	x86_pmu.lbr_to     = 0x6c0;
+	x86_pmu.lbr_tos    = MSR_LBR_TOS;
+	x86_pmu.lbr_from   = MSR_LBR_NHM_FROM;
+	x86_pmu.lbr_to     = MSR_LBR_NHM_TO;
 }
 
 static void intel_pmu_lbr_init_atom(void)
 {
 	x86_pmu.lbr_nr	   = 8;
-	x86_pmu.lbr_tos    = 0x01c9;
-	x86_pmu.lbr_from   = 0x40;
-	x86_pmu.lbr_to     = 0x60;
+	x86_pmu.lbr_tos    = MSR_LBR_TOS;
+	x86_pmu.lbr_from   = MSR_LBR_CORE_FROM;
+	x86_pmu.lbr_to     = MSR_LBR_CORE_TO;
 }
 
 #endif /* CONFIG_CPU_SUP_INTEL */
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 03/12] perf_events: add Intel X86 LBR sharing logic
  2011-10-06 14:49 [PATCH 00/12] perf_events: add support for sampling taken branches Stephane Eranian
  2011-10-06 14:49 ` [PATCH 01/12] perf_events: add generic taken branch sampling support Stephane Eranian
  2011-10-06 14:49 ` [PATCH 02/12] perf_events: add Intel LBR MSR definitions Stephane Eranian
@ 2011-10-06 14:49 ` Stephane Eranian
  2011-10-06 14:49 ` [PATCH 04/12] perf_events: sync branch stack sampling with X86 precise_sampling Stephane Eranian
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 65+ messages in thread
From: Stephane Eranian @ 2011-10-06 14:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo

The Intel LBR on some recent processor is capable
of filtering branches by type. The filter is configurable
via the LBR_SELECT MSR register.

There are limitation on how this register can be used.

On Nehalem/Westmere, the LBR_SELECT is shared by the two HT threads
when HT is on. It is private to each core when HT is off.

On SandyBridge, the LBR_SELECT register is private to each thread
when HT is on. It is private to each core when HT is off.

The kernel must manage the sharing of LBR_SELECT. It allows
multiple users on the same logical CPU to use LBR_SELECT as
long as they program it with the same value. Across sibling
CPUs (HT threads), the same restriction applies on NHM/WSM.

This patch implements this sharing logic by leveraging the
mechanism put in place for managing the offcore_response
shared MSR.

We modify __intel_shared_reg_get_constraints() to cause
x86_get_event_constraint() to be called because LBR may
be associated with events that may be counter constrained.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event.c       |    8 +++
 arch/x86/kernel/cpu/perf_event_intel.c |   76 ++++++++++++++++++++------------
 2 files changed, 55 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index cfa62ec..6fecd36 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -62,6 +62,7 @@ enum extra_reg_type {
 
 	EXTRA_REG_RSP_0 = 0,	/* offcore_response_0 */
 	EXTRA_REG_RSP_1 = 1,	/* offcore_response_1 */
+	EXTRA_REG_LBR   = 2,	/* lbr_select */
 
 	EXTRA_REG_MAX		/* number of entries needed */
 };
@@ -118,6 +119,7 @@ struct cpu_hw_events {
 	void				*lbr_context;
 	struct perf_branch_stack	lbr_stack;
 	struct perf_branch_entry	lbr_entries[MAX_LBR_ENTRIES];
+	struct er_account		*lbr_sel;
 
 	/*
 	 * manage shared (per-core, per-cpu) registers
@@ -284,6 +286,8 @@ struct x86_pmu {
 	/*
 	 * Intel LBR
 	 */
+	u64		lbr_sel_mask;		   /* valid bits in LBR_SELECT */
+	const int	*lbr_sel_map;		   /* lbr_select mappings */
 	unsigned long	lbr_tos, lbr_from, lbr_to; /* MSR base regs       */
 	int		lbr_nr;			   /* hardware stack size */
 
@@ -720,6 +724,10 @@ static int __x86_pmu_event_init(struct perf_event *event)
 	/* mark unused */
 	event->hw.extra_reg.idx = EXTRA_REG_NONE;
 
+	/* mark not used */
+	event->hw.extra_reg.idx = EXTRA_REG_NONE;
+	event->hw.branch_reg.idx = EXTRA_REG_NONE;
+
 	return x86_pmu.hw_config(event);
 }
 
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index f88af2c..e2639f4 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1112,17 +1112,17 @@ static bool intel_try_alt_er(struct perf_event *event, int orig_idx)
  */
 static struct event_constraint *
 __intel_shared_reg_get_constraints(struct cpu_hw_events *cpuc,
-				   struct perf_event *event)
+				   struct perf_event *event,
+				   struct hw_perf_event_extra *reg)
 {
 	struct event_constraint *c = &emptyconstraint;
-	struct hw_perf_event_extra *reg = &event->hw.extra_reg;
 	struct er_account *era;
 	unsigned long flags;
 	int orig_idx = reg->idx;
 
 	/* already allocated shared msr */
 	if (reg->alloc)
-		return &unconstrained;
+		return NULL; /* call x86_get_event_constraint() */
 
 again:
 	era = &cpuc->shared_regs->regs[reg->idx];
@@ -1145,14 +1145,10 @@ again:
 		reg->alloc = 1;
 
 		/*
-		 * All events using extra_reg are unconstrained.
-		 * Avoids calling x86_get_event_constraints()
-		 *
-		 * Must revisit if extra_reg controlling events
-		 * ever have constraints. Worst case we go through
-		 * the regular event constraint table.
+		 * need to call x86_get_event_constraint()
+		 * to check if associated event has constraints
 		 */
-		c = &unconstrained;
+		c = NULL;
 	} else if (intel_try_alt_er(event, orig_idx)) {
 		raw_spin_unlock(&era->lock);
 		goto again;
@@ -1189,11 +1185,23 @@ static struct event_constraint *
 intel_shared_regs_constraints(struct cpu_hw_events *cpuc,
 			      struct perf_event *event)
 {
-	struct event_constraint *c = NULL;
-
-	if (event->hw.extra_reg.idx != EXTRA_REG_NONE)
-		c = __intel_shared_reg_get_constraints(cpuc, event);
-
+	struct event_constraint *c = NULL, *d;
+	struct hw_perf_event_extra *xreg, *breg;
+
+	xreg = &event->hw.extra_reg;
+	if (xreg->idx != EXTRA_REG_NONE) {
+		c = __intel_shared_reg_get_constraints(cpuc, event, xreg);
+		if (c == &emptyconstraint)
+			return c;
+	}
+	breg = &event->hw.branch_reg;
+	if (breg->idx != EXTRA_REG_NONE) {
+		d = __intel_shared_reg_get_constraints(cpuc, event, breg);
+		if (d == &emptyconstraint) {
+			__intel_shared_reg_put_constraints(cpuc, xreg);
+			c = d;
+		}
+	}
 	return c;
 }
 
@@ -1226,6 +1234,10 @@ intel_put_shared_regs_event_constraints(struct cpu_hw_events *cpuc,
 	reg = &event->hw.extra_reg;
 	if (reg->idx != EXTRA_REG_NONE)
 		__intel_shared_reg_put_constraints(cpuc, reg);
+
+	reg = &event->hw.branch_reg;
+	if (reg->idx != EXTRA_REG_NONE)
+		__intel_shared_reg_put_constraints(cpuc, reg);
 }
 
 static void intel_put_event_constraints(struct cpu_hw_events *cpuc,
@@ -1332,7 +1344,7 @@ static int intel_pmu_cpu_prepare(int cpu)
 {
 	struct cpu_hw_events *cpuc = &per_cpu(cpu_hw_events, cpu);
 
-	if (!x86_pmu.extra_regs)
+	if (!(x86_pmu.extra_regs || x86_pmu.lbr_sel_map))
 		return NOTIFY_OK;
 
 	cpuc->shared_regs = allocate_shared_regs(cpu);
@@ -1354,22 +1366,28 @@ static void intel_pmu_cpu_starting(int cpu)
 	 */
 	intel_pmu_lbr_reset();
 
-	if (!cpuc->shared_regs || (x86_pmu.er_flags & ERF_NO_HT_SHARING))
-		return;
-
-	for_each_cpu(i, topology_thread_cpumask(cpu)) {
-		struct intel_shared_regs *pc;
+	cpuc->lbr_sel = NULL;
 
-		pc = per_cpu(cpu_hw_events, i).shared_regs;
-		if (pc && pc->core_id == core_id) {
-			kfree(cpuc->shared_regs);
-			cpuc->shared_regs = pc;
-			break;
-		}
+	if (!cpuc->shared_regs)
+ 		return;
+
+	if (!(x86_pmu.er_flags & ERF_NO_HT_SHARING)) {
+		for_each_cpu(i, topology_thread_cpumask(cpu)) {
+			struct intel_shared_regs *pc;
+
+			pc = per_cpu(cpu_hw_events, i).shared_regs;
+			if (pc && pc->core_id == core_id) {
+				kfree(cpuc->shared_regs);
+				cpuc->shared_regs = pc;
+				break;
+			}
+ 		}
+		cpuc->shared_regs->core_id = core_id;
+		cpuc->shared_regs->refcnt++;
 	}
 
-	cpuc->shared_regs->core_id = core_id;
-	cpuc->shared_regs->refcnt++;
+	if (x86_pmu.lbr_sel_map)
+		cpuc->lbr_sel = &cpuc->shared_regs->regs[EXTRA_REG_LBR];
 }
 
 static void intel_pmu_cpu_dying(int cpu)
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 04/12] perf_events: sync branch stack sampling with X86 precise_sampling
  2011-10-06 14:49 [PATCH 00/12] perf_events: add support for sampling taken branches Stephane Eranian
                   ` (2 preceding siblings ...)
  2011-10-06 14:49 ` [PATCH 03/12] perf_events: add Intel X86 LBR sharing logic Stephane Eranian
@ 2011-10-06 14:49 ` Stephane Eranian
  2011-10-06 17:25   ` Peter Zijlstra
  2011-10-06 14:49 ` [PATCH 05/12] perf_events: add LBR mappings for PERF_SAMPLE_BRANCH filters Stephane Eranian
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 65+ messages in thread
From: Stephane Eranian @ 2011-10-06 14:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo

If precise sampling is enabled on Intel X86, then perf_event uses PEBS.
To correct for the off-by-one error of PEBS, perf_event uses LBR when
precise_sample > 1.

On Intel X86 PERF_SAMPLE_BRANCH_STACK is implemented using LBR,
therefore both features must be coordinated as they may not
configure LBR the same way.

For PEBS, LBR needs to capture all branches at all priv levels.
This patch sets this up.

The configuration of PERF_SAMPLE_BRANCH_STACK may not be compatible
in which case an error must be returned.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event.c |   22 ++++++++++++++++++++++
 1 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 6fecd36..711cbaf 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -654,6 +654,7 @@ static int x86_setup_perfctr(struct perf_event *event)
 static int x86_pmu_hw_config(struct perf_event *event)
 {
 	if (event->attr.precise_ip) {
+		u64 *br_type, br_sel;
 		int precise = 0;
 
 		/* Support for constant skid */
@@ -667,6 +668,27 @@ static int x86_pmu_hw_config(struct perf_event *event)
 
 		if (event->attr.precise_ip > precise)
 			return -EOPNOTSUPP;
+		/*
+		 * check that PEBS LBR correction does not conflict with
+		 * whatever the user is asking with attr->branch_sample_type
+		 */
+		if (event->attr.precise_ip > 1) {
+
+			br_type = &event->attr.branch_sample_type;
+
+			if (has_branch_stack(event)) {
+				br_sel = *br_type & PERF_SAMPLE_BRANCH_ANY;
+				if (br_sel != PERF_SAMPLE_BRANCH_ANY)
+					return -EOPNOTSUPP;
+			} else {
+				/*
+				 * For PEBS fixups, we capture all
+				 * the branches at all priv levels
+				 */
+				*br_type = PERF_SAMPLE_BRANCH_ANY
+					 | PERF_SAMPLE_BRANCH_PLM_ALL;
+			}
+		}
 	}
 
 	/*
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 05/12] perf_events: add LBR mappings for PERF_SAMPLE_BRANCH filters
  2011-10-06 14:49 [PATCH 00/12] perf_events: add support for sampling taken branches Stephane Eranian
                   ` (3 preceding siblings ...)
  2011-10-06 14:49 ` [PATCH 04/12] perf_events: sync branch stack sampling with X86 precise_sampling Stephane Eranian
@ 2011-10-06 14:49 ` Stephane Eranian
  2011-10-06 14:49 ` [PATCH 06/12] perf_events: implement PERF_SAMPLE_BRANCH for Intel X86 Stephane Eranian
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 65+ messages in thread
From: Stephane Eranian @ 2011-10-06 14:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo

This patch adds the mappings from the generic PERF_SAMPLE_BRANCH_*
filters to the actual Intel X86 LBR filters, whenever they exist.

The patch also adds a restriction on Intel Atom, whereby only
stepping 10 (PineView) and more recent are supported. Older models,
do not have a functional LBR (does not freeze on PMU interrupt).

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event_intel.c     |    2 +-
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |  110 +++++++++++++++++++++++++++-
 2 files changed, 109 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index e2639f4..740857f 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1612,7 +1612,7 @@ static __init int intel_pmu_init(void)
 		memcpy(hw_cache_event_ids, snb_hw_cache_event_ids,
 		       sizeof(hw_cache_event_ids));
 
-		intel_pmu_lbr_init_nhm();
+		intel_pmu_lbr_init_snb();
 
 		x86_pmu.event_constraints = intel_snb_event_constraints;
 		x86_pmu.pebs_constraints = intel_snb_pebs_events;
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index e8a6851..abcabe3 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -8,6 +8,47 @@ enum {
 };
 
 /*
+ * Intel LBR_SELECT bits
+ * Intel Vol3a, April 2011, Section 16.7 Table 16-10
+ *
+ * Hardware branch filter (not available on all CPUs)
+ */
+#define LBR_KERNEL_BIT		0 /* do not capture at ring0 */
+#define LBR_USER_BIT		1 /* do not capture at ring > 0 */
+#define LBR_JCC_BIT		2 /* do not capture conditional branches */
+#define LBR_REL_CALL_BIT	3 /* do not capture relative calls */
+#define LBR_IND_CALL_BIT	4 /* do not capture indirect calls */
+#define LBR_RETURN_BIT		5 /* do not capture near returns */
+#define LBR_IND_JMP_BIT		6 /* do not capture indirect jumps */
+#define LBR_REL_JMP_BIT		7 /* do not capture relative jumps */
+#define LBR_FAR_BIT		8 /* do not capture far branches */
+
+#define LBR_KERNEL	(1 << LBR_KERNEL_BIT)
+#define LBR_USER	(1 << LBR_USER_BIT)
+#define LBR_JCC		(1 << LBR_JCC_BIT)
+#define LBR_REL_CALL	(1 << LBR_REL_CALL_BIT)
+#define LBR_IND_CALL	(1 << LBR_IND_CALL_BIT)
+#define LBR_RETURN	(1 << LBR_RETURN_BIT)
+#define LBR_REL_JMP	(1 << LBR_REL_JMP_BIT)
+#define LBR_IND_JMP	(1 << LBR_IND_JMP_BIT)
+#define LBR_FAR		(1 << LBR_FAR_BIT)
+
+#define LBR_PLM (LBR_KERNEL | LBR_USER)
+
+#define LBR_SEL_MASK	0x1ff /* valid bits in LBR_SELECT */
+
+#define LBR_ANY		 \
+	(LBR_JCC	|\
+	 LBR_REL_CALL	|\
+	 LBR_IND_CALL	|\
+	 LBR_RETURN	|\
+	 LBR_REL_JMP	|\
+	 LBR_IND_JMP	|\
+	 LBR_FAR)
+
+#define LBR_FROM_FLAG_MISPRED  (1ULL << 63)
+
+/*
  * We only support LBR implementations that have FREEZE_LBRS_ON_PMI
  * otherwise it becomes near impossible to get a reliable stack.
  */
@@ -147,8 +188,6 @@ static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc)
 	cpuc->lbr_stack.nr = i;
 }
 
-#define LBR_FROM_FLAG_MISPRED  (1ULL << 63)
-
 /*
  * Due to lack of segmentation in Linux the effective address (offset)
  * is the same as the linear address, allowing us to merge the LIP and EIP
@@ -196,28 +235,95 @@ static void intel_pmu_lbr_read(void)
 		intel_pmu_lbr_read_64(cpuc);
 }
 
+/*
+ * Map interface branch filters onto LBR filters
+ */
+static const int nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX]=
+{
+	[PERF_SAMPLE_BRANCH_ANY]        = LBR_ANY,
+	[PERF_SAMPLE_BRANCH_USER]       = LBR_USER,
+	[PERF_SAMPLE_BRANCH_KERNEL]     = LBR_KERNEL,
+	[PERF_SAMPLE_BRANCH_ANY_RETURN] =
+		LBR_RETURN | LBR_REL_JMP | LBR_IND_JMP | LBR_FAR,
+	/*
+	 * NHM/WSM erratum: must include REL_JMP+IND_JMP to get CALL branches
+	 */
+	[PERF_SAMPLE_BRANCH_ANY_CALL] =
+		LBR_REL_CALL | LBR_IND_CALL | LBR_REL_JMP | LBR_IND_JMP | LBR_FAR,
+	/*
+	 * NHM/WSM erratum: must include IND_JMP to capture IND_CALL
+	 */
+	[PERF_SAMPLE_BRANCH_IND_CALL] = LBR_IND_CALL | LBR_IND_JMP,
+};
+
+static const int snb_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX]=
+{
+	[PERF_SAMPLE_BRANCH_ANY]        = LBR_ANY,
+	[PERF_SAMPLE_BRANCH_USER]       = LBR_USER,
+	[PERF_SAMPLE_BRANCH_KERNEL]     = LBR_KERNEL,
+	[PERF_SAMPLE_BRANCH_ANY_RETURN] = LBR_RETURN | LBR_FAR,
+	[PERF_SAMPLE_BRANCH_ANY_CALL]   = LBR_REL_CALL | LBR_IND_CALL | LBR_FAR,
+	[PERF_SAMPLE_BRANCH_IND_CALL]   = LBR_IND_CALL,
+};
+
+/* core */
 static void intel_pmu_lbr_init_core(void)
 {
 	x86_pmu.lbr_nr     = 4;
 	x86_pmu.lbr_tos    = MSR_LBR_TOS;
 	x86_pmu.lbr_from   = MSR_LBR_CORE_FROM;
 	x86_pmu.lbr_to     = MSR_LBR_CORE_TO;
+
+	pr_cont("4-deep LBR, ");
 }
 
+/* nehalem/westmere */
 static void intel_pmu_lbr_init_nhm(void)
 {
 	x86_pmu.lbr_nr     = 16;
 	x86_pmu.lbr_tos    = MSR_LBR_TOS;
 	x86_pmu.lbr_from   = MSR_LBR_NHM_FROM;
 	x86_pmu.lbr_to     = MSR_LBR_NHM_TO;
+
+	x86_pmu.lbr_sel_mask = LBR_SEL_MASK;
+	x86_pmu.lbr_sel_map  = nhm_lbr_sel_map;
+
+	pr_cont("16-deep LBR, ");
+}
+
+/* sandy bridge */
+static void intel_pmu_lbr_init_snb(void)
+{
+	x86_pmu.lbr_nr	 = 16;
+	x86_pmu.lbr_tos	 = MSR_LBR_TOS;
+	x86_pmu.lbr_from = MSR_LBR_NHM_FROM;
+	x86_pmu.lbr_to   = MSR_LBR_NHM_TO;
+
+	x86_pmu.lbr_sel_mask = LBR_SEL_MASK;
+	x86_pmu.lbr_sel_map  = snb_lbr_sel_map;
+
+	pr_cont("16-deep LBR, ");
 }
 
+/* atom */
 static void intel_pmu_lbr_init_atom(void)
 {
+	/*
+	 * only models starting at stepping 10 seems
+	 * to have an operational LBR which can freeze
+	 * on PMU interrupt
+	 */
+	if (boot_cpu_data.x86_mask < 10) {
+		pr_cont("LBR disabled due to erratum");
+		return;
+	}
+
 	x86_pmu.lbr_nr	   = 8;
 	x86_pmu.lbr_tos    = MSR_LBR_TOS;
 	x86_pmu.lbr_from   = MSR_LBR_CORE_FROM;
 	x86_pmu.lbr_to     = MSR_LBR_CORE_TO;
+
+	pr_cont("8-deep LBR, ");
 }
 
 #endif /* CONFIG_CPU_SUP_INTEL */
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 06/12] perf_events: implement PERF_SAMPLE_BRANCH for Intel X86
  2011-10-06 14:49 [PATCH 00/12] perf_events: add support for sampling taken branches Stephane Eranian
                   ` (4 preceding siblings ...)
  2011-10-06 14:49 ` [PATCH 05/12] perf_events: add LBR mappings for PERF_SAMPLE_BRANCH filters Stephane Eranian
@ 2011-10-06 14:49 ` Stephane Eranian
  2011-10-06 17:54   ` Peter Zijlstra
  2011-10-06 18:05   ` Peter Zijlstra
  2011-10-06 14:49 ` [PATCH 07/12] perf_events: add LBR software filter support " Stephane Eranian
                   ` (7 subsequent siblings)
  13 siblings, 2 replies; 65+ messages in thread
From: Stephane Eranian @ 2011-10-06 14:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo

This patch implements PERF_SAMPLE_BRANCH support for Intel
X86 processors. It connects PERF_SAMPLE_BRANCH to the actual LBR.

The patch adds the hooks in the PMU irq handler to save the LBR
on counter overflow for both regular and PEBS modes.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event_intel.c     |   35 +++++++++++++
 arch/x86/kernel/cpu/perf_event_intel_ds.c  |   10 ++--
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |   73 +++++++++++++++++++++++++++-
 include/linux/perf_event.h                 |    3 +
 4 files changed, 113 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 740857f..b5e7c52 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -727,6 +727,19 @@ static __initconst const u64 atom_hw_cache_event_ids
  },
 };
 
+static inline bool intel_pmu_needs_lbr_smpl(struct perf_event *event)
+{
+	/* user explicitly requested branch sampling */
+	if (has_branch_stack(event))
+		return true;
+
+	/* implicit branch sampling to correct PEBS skid */
+	if (x86_pmu.intel_cap.pebs_trap && event->attr.precise_ip > 1)
+		return true;
+
+	return false;
+}
+
 static void intel_pmu_disable_all(void)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
@@ -876,6 +889,13 @@ static void intel_pmu_disable_event(struct perf_event *event)
 		return;
 	}
 
+	/*
+	 * must disable before any actual event
+	 * because any event may be combined with LBR
+	 */
+	if (intel_pmu_needs_lbr_smpl(event))
+		intel_pmu_lbr_disable(event);
+
 	if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL)) {
 		intel_pmu_disable_fixed(hwc);
 		return;
@@ -929,6 +949,12 @@ static void intel_pmu_enable_event(struct perf_event *event)
 		intel_pmu_enable_bts(hwc->config);
 		return;
 	}
+	/*
+	 * must enabled before any actual event
+	 * because any event may be combined with LBR
+	 */
+	if (intel_pmu_needs_lbr_smpl(event))
+		intel_pmu_lbr_enable(event);
 
 	if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL)) {
 		intel_pmu_enable_fixed(hwc);
@@ -1046,6 +1072,9 @@ again:
 
 		data.period = event->hw.last_period;
 
+		if (has_branch_stack(event))
+			data.br_stack = &cpuc->lbr_stack;
+
 		if (perf_event_overflow(event, &data, regs))
 			x86_pmu_stop(event, 0);
 	}
@@ -1279,6 +1308,12 @@ static int intel_pmu_hw_config(struct perf_event *event)
 		event->hw.config = alt_config;
 	}
 
+	if (intel_pmu_needs_lbr_smpl(event)) {
+		ret = intel_pmu_setup_lbr_filter(event);
+		if (ret)
+			return ret;
+	}
+
 	if (event->attr.type != PERF_TYPE_RAW)
 		return 0;
 
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 1b1ef3a..238d82f 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -455,9 +455,6 @@ static void intel_pmu_pebs_enable(struct perf_event *event)
 
 	cpuc->pebs_enabled |= 1ULL << hwc->idx;
 	WARN_ON_ONCE(cpuc->enabled);
-
-	if (x86_pmu.intel_cap.pebs_trap && event->attr.precise_ip > 1)
-		intel_pmu_lbr_enable(event);
 }
 
 static void intel_pmu_pebs_disable(struct perf_event *event)
@@ -470,9 +467,6 @@ static void intel_pmu_pebs_disable(struct perf_event *event)
 		wrmsrl(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);
 
 	hwc->config |= ARCH_PERFMON_EVENTSEL_INT;
-
-	if (x86_pmu.intel_cap.pebs_trap && event->attr.precise_ip > 1)
-		intel_pmu_lbr_disable(event);
 }
 
 static void intel_pmu_pebs_enable_all(void)
@@ -586,6 +580,7 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
 	 * both formats and we don't use the other fields in this
 	 * routine.
 	 */
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
 	struct pebs_record_core *pebs = __pebs;
 	struct perf_sample_data data;
 	struct pt_regs regs;
@@ -616,6 +611,9 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
 	else
 		regs.flags &= ~PERF_EFLAGS_EXACT;
 
+	if (has_branch_stack(event))
+		data.br_stack = &cpuc->lbr_stack;
+
 	if (perf_event_overflow(event, &data, &regs))
 		x86_pmu_stop(event, 0);
 }
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index abcabe3..76c4639 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -36,6 +36,7 @@ enum {
 #define LBR_PLM (LBR_KERNEL | LBR_USER)
 
 #define LBR_SEL_MASK	0x1ff /* valid bits in LBR_SELECT */
+#define LBR_NOT_SUPP	-1    /* LBR filter not supported */
 
 #define LBR_ANY		 \
 	(LBR_JCC	|\
@@ -48,6 +49,10 @@ enum {
 
 #define LBR_FROM_FLAG_MISPRED  (1ULL << 63)
 
+#define for_each_branch_sample_type(x) \
+	for ((x) = PERF_SAMPLE_BRANCH_USER; \
+	     (x) < PERF_SAMPLE_BRANCH_MAX; (x) <<= 1)
+
 /*
  * We only support LBR implementations that have FREEZE_LBRS_ON_PMI
  * otherwise it becomes near impossible to get a reliable stack.
@@ -56,6 +61,10 @@ enum {
 static void __intel_pmu_lbr_enable(void)
 {
 	u64 debugctl;
+	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+
+	if (cpuc->lbr_sel)
+		wrmsrl(MSR_LBR_SELECT, cpuc->lbr_sel->config);
 
 	rdmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
 	debugctl |= (DEBUGCTLMSR_LBR | DEBUGCTLMSR_FREEZE_LBRS_ON_PMI);
@@ -113,7 +122,6 @@ static void intel_pmu_lbr_enable(struct perf_event *event)
 	 * Reset the LBR stack if we changed task context to
 	 * avoid data leaks.
 	 */
-
 	if (event->ctx->task && cpuc->lbr_context != event->ctx) {
 		intel_pmu_lbr_reset();
 		cpuc->lbr_context = event->ctx;
@@ -132,8 +140,11 @@ static void intel_pmu_lbr_disable(struct perf_event *event)
 	cpuc->lbr_users--;
 	WARN_ON_ONCE(cpuc->lbr_users < 0);
 
-	if (cpuc->enabled && !cpuc->lbr_users)
+	if (cpuc->enabled && !cpuc->lbr_users) {
 		__intel_pmu_lbr_disable();
+		/* avoid stale pointer */
+		cpuc->lbr_context = NULL;
+	}
 }
 
 static void intel_pmu_lbr_enable_all(void)
@@ -152,6 +163,9 @@ static void intel_pmu_lbr_disable_all(void)
 		__intel_pmu_lbr_disable();
 }
 
+/*
+ * TOS = most recently recorded branch
+ */
 static inline u64 intel_pmu_lbr_tos(void)
 {
 	u64 tos;
@@ -236,6 +250,61 @@ static void intel_pmu_lbr_read(void)
 }
 
 /*
+ * setup the HW LBR filter
+ * Used only when available, may not be enough to disambiguate
+ * all branches, may need the help of the SW filter
+ */
+static int intel_pmu_setup_hw_lbr_filter(struct perf_event *event)
+{
+	struct hw_perf_event_extra *reg;
+	u64 br_type = event->attr.branch_sample_type;
+	u64 mask = 0, m;
+	u64 v;
+
+	for_each_branch_sample_type(m) {
+		if (!(br_type & m))
+			continue;
+
+		v = x86_pmu.lbr_sel_map[m];
+		if (v == LBR_NOT_SUPP)
+			return -EOPNOTSUPP;
+		mask |= v;
+
+		if (m == PERF_SAMPLE_BRANCH_ANY)
+			break;
+	}
+	reg = &event->hw.branch_reg;
+	reg->idx = EXTRA_REG_LBR;
+
+	/* LBR_SELECT operates in suppress mode so invert mask */
+	reg->config = ~mask & x86_pmu.lbr_sel_mask;
+
+	return 0;
+}
+
+static int intel_pmu_setup_lbr_filter(struct perf_event *event)
+{
+	u64 br_type = event->attr.branch_sample_type;
+
+	/*
+	 * no LBR on this PMU
+	 */
+	if (!x86_pmu.lbr_nr)
+		return -EOPNOTSUPP;
+
+	/*
+	 * if no LBR HW filter, users can only
+	 * capture all branches
+	 */
+	if (!x86_pmu.lbr_sel_map) {
+		if (br_type != PERF_SAMPLE_BRANCH_ALL)
+			return -EOPNOTSUPP;
+		return 0;
+	}
+	return intel_pmu_setup_hw_lbr_filter(event);
+}
+
+/*
  * Map interface branch filters onto LBR filters
  */
 static const int nhm_lbr_sel_map[PERF_SAMPLE_BRANCH_MAX]=
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 850bbbe..0819364 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -159,6 +159,9 @@ enum perf_branch_sample_type {
 	(PERF_SAMPLE_BRANCH_USER|\
 	 PERF_SAMPLE_BRANCH_KERNEL)
 
+#define PERF_SAMPLE_BRANCH_ALL \
+	(PERF_SAMPLE_BRANCH_PLM_ALL|PERF_SAMPLE_BRANCH_ANY)
+
 /*
  * The format of the data returned by read() on a perf event fd,
  * as specified by attr.read_format:
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 07/12] perf_events: add LBR software filter support for Intel X86
  2011-10-06 14:49 [PATCH 00/12] perf_events: add support for sampling taken branches Stephane Eranian
                   ` (5 preceding siblings ...)
  2011-10-06 14:49 ` [PATCH 06/12] perf_events: implement PERF_SAMPLE_BRANCH for Intel X86 Stephane Eranian
@ 2011-10-06 14:49 ` Stephane Eranian
  2011-10-06 15:32   ` Andi Kleen
  2011-10-06 14:49 ` [PATCH 08/12] perf_events: disable PERF_SAMPLE_BRANCH_* when not supported Stephane Eranian
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 65+ messages in thread
From: Stephane Eranian @ 2011-10-06 14:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo

This patch adds an internal sofware filter to complement
the (optional) LBR hardware filter.

The software filter is necessary:
- as a substitute when there is no HW LBR filter (e.g., Atom, Core)
- to complement HW LBR filter in case of errata (e.g., Nehalem/Westmere)
- to provide finer grain filtering (e.g., all processors)

Sometimes, the LBR HW filter cannot distinguish between two types
of branches. For instance, to capture syscall as CALLS, it is necessary
to enable the LBR_FAR filter which will also capture JMP instructions.
Thus, a second pass is necessary to filter those out, this is what the
SW filter can do.

The SW filter is built on top of the internal x86 disassembler. It
is a best effort filter especially for user level code. It is subject
to the availability of the text page of the program.

The SW filter is enabled on all Intel X86 processors. It is bypassed
when the user is capturing all branches at all priv levels.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event.c           |    1 +
 arch/x86/kernel/cpu/perf_event_intel_ds.c  |   11 -
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |  323 +++++++++++++++++++++++++++-
 3 files changed, 316 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 711cbaf..80aa2f1 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -120,6 +120,7 @@ struct cpu_hw_events {
 	struct perf_branch_stack	lbr_stack;
 	struct perf_branch_entry	lbr_entries[MAX_LBR_ENTRIES];
 	struct er_account		*lbr_sel;
+	u64				br_sel;
 
 	/*
 	 * manage shared (per-core, per-cpu) registers
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 238d82f..085f412 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -485,17 +485,6 @@ static void intel_pmu_pebs_disable_all(void)
 		wrmsrl(MSR_IA32_PEBS_ENABLE, 0);
 }
 
-#include <asm/insn.h>
-
-static inline bool kernel_ip(unsigned long ip)
-{
-#ifdef CONFIG_X86_32
-	return ip > PAGE_OFFSET;
-#else
-	return (long)ip < 0;
-#endif
-}
-
 static int intel_pmu_pebs_fixup_ip(struct pt_regs *regs)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 76c4639..42baefc 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -1,5 +1,7 @@
 #ifdef CONFIG_CPU_SUP_INTEL
 
+#include <asm/insn.h>
+
 enum {
 	LBR_FORMAT_32		= 0x00,
 	LBR_FORMAT_LIP		= 0x01,
@@ -54,6 +56,62 @@ enum {
 	     (x) < PERF_SAMPLE_BRANCH_MAX; (x) <<= 1)
 
 /*
+ * X86 control flow change classification
+ * X86 control flow changes include branches, interrupts, traps, faults
+ */
+enum {
+        X86_BR_NONE    = 0,      /* unknown */
+
+        X86_BR_USER    = 1 << 0, /* branch target is user */
+        X86_BR_KERNEL  = 1 << 1, /* branch target is kernel */
+
+        X86_BR_CALL    = 1 << 2, /* call */
+        X86_BR_RET     = 1 << 3, /* return */
+        X86_BR_SYSCALL = 1 << 4, /* syscall */
+        X86_BR_SYSRET  = 1 << 5, /* syscall return */
+        X86_BR_INT     = 1 << 6, /* sw interrupt */
+        X86_BR_IRET    = 1 << 7, /* return from interrupt */
+        X86_BR_JCC     = 1 << 8, /* conditional */
+        X86_BR_JMP     = 1 << 9, /* jump */
+        X86_BR_IRQ     = 1 << 10,/* hw interrupt or trap or fault */
+        X86_BR_IND_CALL= 1 << 11,/* indirect calls */
+};
+
+#define X86_BR_PLM (X86_BR_USER | X86_BR_KERNEL)
+
+#define X86_BR_ANY       \
+        (X86_BR_CALL    |\
+         X86_BR_RET     |\
+         X86_BR_SYSCALL |\
+         X86_BR_SYSRET  |\
+         X86_BR_INT     |\
+         X86_BR_IRET    |\
+         X86_BR_JCC     |\
+         X86_BR_JMP	 |\
+	 X86_BR_IRQ	 |\
+	 X86_BR_IND_CALL)
+
+#define X86_BR_ALL (X86_BR_PLM | X86_BR_ANY)
+
+#define X86_BR_ANY_CALL 	 \
+	(X86_BR_CALL		|\
+	 X86_BR_IND_CALL	|\
+	 X86_BR_SYSCALL		|\
+	 X86_BR_IRQ		|\
+	 X86_BR_INT)
+
+static void intel_pmu_lbr_filter(struct cpu_hw_events *cpuc);
+
+static inline bool kernel_ip(unsigned long ip)
+{
+#ifdef CONFIG_X86_32
+	return ip > PAGE_OFFSET;
+#else
+	return (long)ip < 0;
+#endif
+}
+
+/*
  * We only support LBR implementations that have FREEZE_LBRS_ON_PMI
  * otherwise it becomes near impossible to get a reliable stack.
  */
@@ -126,6 +184,7 @@ static void intel_pmu_lbr_enable(struct perf_event *event)
 		intel_pmu_lbr_reset();
 		cpuc->lbr_context = event->ctx;
 	}
+	cpuc->br_sel = event->hw.branch_reg.reg;
 
 	cpuc->lbr_users++;
 }
@@ -247,6 +306,45 @@ static void intel_pmu_lbr_read(void)
 		intel_pmu_lbr_read_32(cpuc);
 	else
 		intel_pmu_lbr_read_64(cpuc);
+
+	intel_pmu_lbr_filter(cpuc);
+}
+
+/*
+ * SW filter is used:
+ * - in case there is no HW filter
+ * - in case the HW filter has errata or limitations
+ */
+static void intel_pmu_setup_sw_lbr_filter(struct perf_event *event)
+{
+	u64 br_type = event->attr.branch_sample_type;
+        int mask = 0;
+
+        if (br_type & PERF_SAMPLE_BRANCH_USER)
+                mask |= X86_BR_USER;
+
+        if (br_type & PERF_SAMPLE_BRANCH_KERNEL)
+                mask |= X86_BR_KERNEL;
+
+        if (br_type & PERF_SAMPLE_BRANCH_ANY) {
+                mask |= X86_BR_ANY;
+                goto done;
+        }
+
+        if (br_type & PERF_SAMPLE_BRANCH_ANY_CALL)
+                mask |= X86_BR_ANY_CALL;
+
+        if (br_type & PERF_SAMPLE_BRANCH_ANY_RETURN)
+                mask |= X86_BR_RET | X86_BR_IRET | X86_BR_SYSRET;
+
+        if (br_type & PERF_SAMPLE_BRANCH_IND_CALL)
+                mask |= X86_BR_IND_CALL;
+done:
+        /*
+         * stash actual user request into reg, it may
+         * be used by fixup code for some CPU
+         */
+        event->hw.branch_reg.reg = mask;
 }
 
 /*
@@ -284,7 +382,7 @@ static int intel_pmu_setup_hw_lbr_filter(struct perf_event *event)
 
 static int intel_pmu_setup_lbr_filter(struct perf_event *event)
 {
-	u64 br_type = event->attr.branch_sample_type;
+	int ret = 0;
 
 	/*
 	 * no LBR on this PMU
@@ -293,15 +391,203 @@ static int intel_pmu_setup_lbr_filter(struct perf_event *event)
 		return -EOPNOTSUPP;
 
 	/*
-	 * if no LBR HW filter, users can only
-	 * capture all branches
+	 * setup SW LBR filter
 	 */
-	if (!x86_pmu.lbr_sel_map) {
-		if (br_type != PERF_SAMPLE_BRANCH_ALL)
-			return -EOPNOTSUPP;
-		return 0;
+	intel_pmu_setup_sw_lbr_filter(event);
+
+	/*
+	 * setup HW LBR filter, if any
+	 */
+        if (x86_pmu.lbr_sel_map)
+                ret = intel_pmu_setup_hw_lbr_filter(event);
+
+	return ret;
+}
+
+/*
+ * return the type of control flow change at address "from"
+ * intruction is not necessarily a branch (in case of interrupt).
+ *
+ * The branch type returned also includes the priv level of the
+ * target of the control flow change (X86_BR_USER, X86_BR_KERNEL).
+ *
+ * If a branch type is unknown OR the instruction cannot be
+ * decoded (e.g., text page not present), then X86_BR_NONE is
+ * returned.
+ */
+static int branch_type(unsigned long from, unsigned long to)
+{
+	struct insn insn;
+	void *kaddr;
+	int bytes, size = MAX_INSN_SIZE;
+	int ret = X86_BR_NONE;
+	int ext, to_plm, from_plm;
+	u8 buf[MAX_INSN_SIZE];
+
+	to_plm = kernel_ip(to) ? X86_BR_KERNEL : X86_BR_USER;
+	from_plm = kernel_ip(from) ? X86_BR_KERNEL : X86_BR_USER;
+
+	/*
+	 * maybe zero if lbr did not fill up after a reset by the time
+	 * we get a PMU interrupt
+	 */
+	if (from == 0 || to == 0)
+		return X86_BR_NONE;
+
+	if (from_plm == X86_BR_USER) {
+		/*
+		 * can happen if measuring at the user level only
+		 * and we interrupt in a kernel thread, e.g., idle.
+		 */
+		if (!current->mm)
+			return X86_BR_NONE;
+
+		/* may fail if text not present */
+		bytes = copy_from_user_nmi(buf, (void __user *)from, size);
+		if (bytes != size)
+			return X86_BR_NONE;
+
+		kaddr = buf;
+	} else
+		kaddr = (void *)from;
+
+
+	kernel_insn_init(&insn, kaddr);
+	insn_get_opcode(&insn);
+
+	switch (insn.opcode.bytes[0]) {
+	case 0xf:
+		switch(insn.opcode.bytes[1]) {
+		case 0x05: /* syscall */
+		case 0x34: /* sysenter */
+			ret = X86_BR_SYSCALL;
+			break;
+		case 0x07: /* sysret */
+		case 0x35: /* sysexit */
+			ret = X86_BR_SYSRET;
+			break;
+		case 0x80 ... 0x8f: /* conditional */
+			ret = X86_BR_JCC;
+			break;
+		default:
+			ret = X86_BR_NONE;
+		}
+		break;
+	case 0x70 ... 0x7f: /* conditional */
+		ret = X86_BR_JCC;
+		break;
+	case 0xc2: /* near ret */
+	case 0xc3: /* near ret */
+	case 0xca: /* far ret */
+	case 0xcb: /* far ret */
+		ret = X86_BR_RET;
+		break;
+	case 0xcf: /* iret */
+		ret = X86_BR_IRET;
+		break;
+	case 0xcc ... 0xce: /* int */
+		ret = X86_BR_INT;
+		break;
+	case 0xe8: /* call near rel */
+	case 0x9a: /* call far absolute */
+		ret = X86_BR_CALL;
+		break;
+	case 0xe0 ... 0xe3: /* loop jmp */
+		ret = X86_BR_JCC;
+		break;
+	case 0xe9 ... 0xeb: /* jmp */
+		ret = X86_BR_JMP;
+		break;
+	case 0xff: /* call near absolute, call far absolute ind */
+		insn_get_modrm(&insn);
+		ext = (insn.modrm.bytes[0] >> 3) & 0x7;
+		switch (ext) {
+		case 2: /* near ind call */
+		case 3: /* far ind call */
+			ret = X86_BR_IND_CALL;
+			break;
+		case 4:
+		case 5:
+			ret = X86_BR_JMP;
+			break;
+		}
+		break;
+	default:
+		ret = X86_BR_NONE;
 	}
-	return intel_pmu_setup_hw_lbr_filter(event);
+	/*
+	 * interrupts, traps, faults (and thus ring transition) may
+	 * occur on any instructions. Thus, to classify them correctly,
+	 * we need to first look at the from and to priv levels. If they
+	 * are different and to is in the kernel, then it indicates
+	 * a ring transition. If the from instruction is not a ring
+	 * transition instr (syscall, systenter, int), then it means
+	 * it was a irq, trap or fault.
+	 *
+	 * we have no way of detecting kernel to kernel faults.
+	 */
+        if (from_plm == X86_BR_USER && to_plm == X86_BR_KERNEL
+	    && ret != X86_BR_SYSCALL && ret != X86_BR_INT)
+                        ret = X86_BR_IRQ;
+
+	/*
+	 * branch priv level determined by target as
+	 * is done by HW when LBR_SELECT is implemented
+	 */
+	if (ret != X86_BR_NONE)
+		ret |= to_plm;
+
+	return ret;
+}
+
+/*
+ * implement actual branch filter based on user demand.
+ * Hardware may not exactly satisfy that request, thus
+ * we need to inspect opcodes. Mismatched branches are
+ * discarded. Therefore, the number of branches returned
+ * in PERF_SAMPLE_BRANCH_STACK sample may vary.
+ */
+static void
+intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
+{
+	u64 from, to;
+        int br_sel = cpuc->br_sel;
+        int i, j, type;
+        bool compress = false;
+
+        /* if sampling all branches, then nothing to filter */
+        if ((br_sel & X86_BR_ALL) == X86_BR_ALL)
+                return;
+
+        for (i = 0; i < cpuc->lbr_stack.nr; i++) {
+
+		from = cpuc->lbr_entries[i].from;
+		to = cpuc->lbr_entries[i].to;
+
+                type = branch_type(from, to);
+
+                /* if type does not correspond, then discard */
+                if (type == X86_BR_NONE || (br_sel & type) != type) {
+                        cpuc->lbr_entries[i].from = 0;
+                        compress = true;
+                }
+        }
+
+        if (!compress)
+                return;
+
+        /* remove all entries with from=0 */
+        for (i = 0; i < cpuc->lbr_stack.nr; ) {
+                if (!cpuc->lbr_entries[i].from) {
+                        j = i;
+                        while (++j < cpuc->lbr_stack.nr)
+                                cpuc->lbr_entries[j-1] = cpuc->lbr_entries[j];
+                        cpuc->lbr_stack.nr--;
+                        if (!cpuc->lbr_entries[i].from)
+                                continue;
+                }
+                i++;
+        }
 }
 
 /*
@@ -343,6 +629,10 @@ static void intel_pmu_lbr_init_core(void)
 	x86_pmu.lbr_from   = MSR_LBR_CORE_FROM;
 	x86_pmu.lbr_to     = MSR_LBR_CORE_TO;
 
+	/*
+	 * SW branch filter usage:
+	 * - compensate for lack of HW filter
+	 */
 	pr_cont("4-deep LBR, ");
 }
 
@@ -357,6 +647,13 @@ static void intel_pmu_lbr_init_nhm(void)
 	x86_pmu.lbr_sel_mask = LBR_SEL_MASK;
 	x86_pmu.lbr_sel_map  = nhm_lbr_sel_map;
 
+	/*
+	 * SW branch filter usage:
+	 * - workaround LBR_SEL errata (see above)
+	 * - support syscall, sysret capture.
+	 *   That requires LBR_FAR but that means far
+	 *   jmp need to be filtered out
+	 */
 	pr_cont("16-deep LBR, ");
 }
 
@@ -371,6 +668,12 @@ static void intel_pmu_lbr_init_snb(void)
 	x86_pmu.lbr_sel_mask = LBR_SEL_MASK;
 	x86_pmu.lbr_sel_map  = snb_lbr_sel_map;
 
+	/*
+	 * SW branch filter usage:
+	 * - support syscall, sysret capture.
+	 *   That requires LBR_FAR but that means far
+	 *   jmp need to be filtered out
+	 */
 	pr_cont("16-deep LBR, ");
 }
 
@@ -392,6 +695,10 @@ static void intel_pmu_lbr_init_atom(void)
 	x86_pmu.lbr_from   = MSR_LBR_CORE_FROM;
 	x86_pmu.lbr_to     = MSR_LBR_CORE_TO;
 
+	/*
+	 * SW branch filter usage:
+	 * - compensate for lack of HW filter
+	 */
 	pr_cont("8-deep LBR, ");
 }
 
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 08/12] perf_events: disable PERF_SAMPLE_BRANCH_* when not supported
  2011-10-06 14:49 [PATCH 00/12] perf_events: add support for sampling taken branches Stephane Eranian
                   ` (6 preceding siblings ...)
  2011-10-06 14:49 ` [PATCH 07/12] perf_events: add LBR software filter support " Stephane Eranian
@ 2011-10-06 14:49 ` Stephane Eranian
  2011-10-06 18:53   ` Peter Zijlstra
  2011-10-06 14:49 ` [PATCH 09/12] perf_events: add hook to flush branch_stack on context switch Stephane Eranian
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 65+ messages in thread
From: Stephane Eranian @ 2011-10-06 14:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo

PERF_SAMPLE_BRANCH_* is disabled for:
- SW events (sw counters, tracepoints)
- HW breakpoints
- ALL but Intel X86 architecture
- AMD64 processors

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/alpha/kernel/perf_event.c       |    4 ++++
 arch/arm/kernel/perf_event.c         |    4 ++++
 arch/mips/kernel/perf_event.c        |    4 ++++
 arch/powerpc/kernel/perf_event.c     |    4 ++++
 arch/sh/kernel/perf_event.c          |    4 ++++
 arch/sparc/kernel/perf_event.c       |    4 ++++
 arch/x86/kernel/cpu/perf_event_amd.c |    3 +++
 kernel/events/core.c                 |   24 ++++++++++++++++++++++++
 kernel/events/hw_breakpoint.c        |    6 ++++++
 9 files changed, 57 insertions(+), 0 deletions(-)

diff --git a/arch/alpha/kernel/perf_event.c b/arch/alpha/kernel/perf_event.c
index 8143cd7..0dae252 100644
--- a/arch/alpha/kernel/perf_event.c
+++ b/arch/alpha/kernel/perf_event.c
@@ -685,6 +685,10 @@ static int alpha_pmu_event_init(struct perf_event *event)
 {
 	int err;
 
+	/* does not support taken branch sampling */
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	switch (event->attr.type) {
 	case PERF_TYPE_RAW:
 	case PERF_TYPE_HARDWARE:
diff --git a/arch/arm/kernel/perf_event.c b/arch/arm/kernel/perf_event.c
index 53c9c26..bcb0dd1 100644
--- a/arch/arm/kernel/perf_event.c
+++ b/arch/arm/kernel/perf_event.c
@@ -544,6 +544,10 @@ static int armpmu_event_init(struct perf_event *event)
 {
 	int err = 0;
 
+	/* does not support taken branch sampling */
+	if (has_branch_smpl(event))
+		return -EOPNOTSUPP;
+
 	switch (event->attr.type) {
 	case PERF_TYPE_RAW:
 	case PERF_TYPE_HARDWARE:
diff --git a/arch/mips/kernel/perf_event.c b/arch/mips/kernel/perf_event.c
index 0aee944..425c35a 100644
--- a/arch/mips/kernel/perf_event.c
+++ b/arch/mips/kernel/perf_event.c
@@ -370,6 +370,10 @@ static int mipspmu_event_init(struct perf_event *event)
 {
 	int err = 0;
 
+	/* does not support taken branch sampling */
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	switch (event->attr.type) {
 	case PERF_TYPE_RAW:
 	case PERF_TYPE_HARDWARE:
diff --git a/arch/powerpc/kernel/perf_event.c b/arch/powerpc/kernel/perf_event.c
index 10a140f..5701051 100644
--- a/arch/powerpc/kernel/perf_event.c
+++ b/arch/powerpc/kernel/perf_event.c
@@ -1078,6 +1078,10 @@ static int power_pmu_event_init(struct perf_event *event)
 	if (!ppmu)
 		return -ENOENT;
 
+	/* does not support taken branch sampling */
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	switch (event->attr.type) {
 	case PERF_TYPE_HARDWARE:
 		ev = event->attr.config;
diff --git a/arch/sh/kernel/perf_event.c b/arch/sh/kernel/perf_event.c
index 2ee21a4..7cc9066 100644
--- a/arch/sh/kernel/perf_event.c
+++ b/arch/sh/kernel/perf_event.c
@@ -309,6 +309,10 @@ static int sh_pmu_event_init(struct perf_event *event)
 {
 	int err;
 
+	/* does not support taken branch sampling */
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	switch (event->attr.type) {
 	case PERF_TYPE_RAW:
 	case PERF_TYPE_HW_CACHE:
diff --git a/arch/sparc/kernel/perf_event.c b/arch/sparc/kernel/perf_event.c
index 614da62..8e16a4a 100644
--- a/arch/sparc/kernel/perf_event.c
+++ b/arch/sparc/kernel/perf_event.c
@@ -1105,6 +1105,10 @@ static int sparc_pmu_event_init(struct perf_event *event)
 	if (atomic_read(&nmi_active) < 0)
 		return -ENODEV;
 
+	/* does not support taken branch sampling */
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	switch (attr->type) {
 	case PERF_TYPE_HARDWARE:
 		if (attr->config >= sparc_pmu->max_events)
diff --git a/arch/x86/kernel/cpu/perf_event_amd.c b/arch/x86/kernel/cpu/perf_event_amd.c
index 941caa2..81a9b5a 100644
--- a/arch/x86/kernel/cpu/perf_event_amd.c
+++ b/arch/x86/kernel/cpu/perf_event_amd.c
@@ -132,6 +132,9 @@ static int amd_pmu_hw_config(struct perf_event *event)
 	if (ret)
 		return ret;
 
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	if (event->attr.type != PERF_TYPE_RAW)
 		return 0;
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 70a8975..e5caab5 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5104,6 +5104,12 @@ static int perf_swevent_init(struct perf_event *event)
 	if (event->attr.type != PERF_TYPE_SOFTWARE)
 		return -ENOENT;
 
+	/*
+	 * no branch sampling for software events
+	 */
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	switch (event_id) {
 	case PERF_COUNT_SW_CPU_CLOCK:
 	case PERF_COUNT_SW_TASK_CLOCK:
@@ -5207,6 +5213,12 @@ static int perf_tp_event_init(struct perf_event *event)
 	if (event->attr.type != PERF_TYPE_TRACEPOINT)
 		return -ENOENT;
 
+	/*
+	 * no branch sampling for tracepoint events
+	 */
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	err = perf_trace_init(event);
 	if (err)
 		return err;
@@ -5430,6 +5442,12 @@ static int cpu_clock_event_init(struct perf_event *event)
 	if (event->attr.config != PERF_COUNT_SW_CPU_CLOCK)
 		return -ENOENT;
 
+	/*
+	 * no branch sampling for software events
+	 */
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	perf_swevent_init_hrtimer(event);
 
 	return 0;
@@ -5502,6 +5520,12 @@ static int task_clock_event_init(struct perf_event *event)
 	if (event->attr.config != PERF_COUNT_SW_TASK_CLOCK)
 		return -ENOENT;
 
+	/*
+	 * no branch sampling for software events
+	 */
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
 	perf_swevent_init_hrtimer(event);
 
 	return 0;
diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
index b7971d6..e7fb781 100644
--- a/kernel/events/hw_breakpoint.c
+++ b/kernel/events/hw_breakpoint.c
@@ -581,6 +581,12 @@ static int hw_breakpoint_event_init(struct perf_event *bp)
 	if (bp->attr.type != PERF_TYPE_BREAKPOINT)
 		return -ENOENT;
 
+	/*
+	 * no branch sampling for breakpoint events
+	 */
+	if (has_branch_stack(bp))
+		return -EOPNOTSUPP;
+
 	err = register_perf_hw_breakpoint(bp);
 	if (err)
 		return err;
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 09/12] perf_events: add hook to flush branch_stack on context switch
  2011-10-06 14:49 [PATCH 00/12] perf_events: add support for sampling taken branches Stephane Eranian
                   ` (7 preceding siblings ...)
  2011-10-06 14:49 ` [PATCH 08/12] perf_events: disable PERF_SAMPLE_BRANCH_* when not supported Stephane Eranian
@ 2011-10-06 14:49 ` Stephane Eranian
  2011-10-06 14:49 ` [PATCH 10/12] perf: add code to support PERF_SAMPLE_BRANCH_STACK Stephane Eranian
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 65+ messages in thread
From: Stephane Eranian @ 2011-10-06 14:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo

With branch stack sampling, it is possible to filter by priv levels.
In system-wide mode, that means it is possible to capture only user
level branches. The builtin SW LBR filter needs to disassemble code
based on LBR captured addresses. For that, it needs to know the task
the addresses are associated with. Because of context switches, the
content of the branch stack buffer may contain addresses from
different tasks.

We need a hook on context switch to either flush the branch stack
or save it. This patch adds a new hook in struct pmu which is called
during context switches. The hook is called only when necessary.
That is when a system-wide context has, at least, one event which
uses PERF_SAMPLE_BRANCH_STACK. The hook is never called for per-thread
context.

In this version, the Intel X86 code simply flushes (reset) the LBR
on context switches (fill with zeroes). Those zeroed branches are
then filtered out by the SW filter.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event.c       |   31 +++++++-----
 arch/x86/kernel/cpu/perf_event_intel.c |   13 +++++
 include/linux/perf_event.h             |    7 ++-
 kernel/events/core.c                   |   85 ++++++++++++++++++++++++++++++++
 4 files changed, 123 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 80aa2f1..7df913c 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -268,7 +268,7 @@ struct x86_pmu {
 	void		(*cpu_starting)(int cpu);
 	void		(*cpu_dying)(int cpu);
 	void		(*cpu_dead)(int cpu);
-
+	void		(*flush_branch_stack)(void);
 	/*
 	 * Intel Arch Perfmon v2+
 	 */
@@ -1827,21 +1827,28 @@ static int x86_pmu_event_init(struct perf_event *event)
 	return err;
 }
 
+static void x86_pmu_flush_branch_stack(void)
+{
+	if (x86_pmu.flush_branch_stack)
+		x86_pmu.flush_branch_stack();
+}
+
 static struct pmu pmu = {
-	.pmu_enable	= x86_pmu_enable,
-	.pmu_disable	= x86_pmu_disable,
+	.pmu_enable		= x86_pmu_enable,
+	.pmu_disable		= x86_pmu_disable,
 
-	.event_init	= x86_pmu_event_init,
+	.event_init		= x86_pmu_event_init,
 
-	.add		= x86_pmu_add,
-	.del		= x86_pmu_del,
-	.start		= x86_pmu_start,
-	.stop		= x86_pmu_stop,
-	.read		= x86_pmu_read,
+	.add			= x86_pmu_add,
+	.del			= x86_pmu_del,
+	.start			= x86_pmu_start,
+	.stop			= x86_pmu_stop,
+	.read			= x86_pmu_read,
 
-	.start_txn	= x86_pmu_start_txn,
-	.cancel_txn	= x86_pmu_cancel_txn,
-	.commit_txn	= x86_pmu_commit_txn,
+	.start_txn		= x86_pmu_start_txn,
+	.cancel_txn		= x86_pmu_cancel_txn,
+	.commit_txn		= x86_pmu_commit_txn,
+	.flush_branch_stack	= x86_pmu_flush_branch_stack,
 };
 
 /*
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index b5e7c52..3bb1614 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1440,6 +1440,18 @@ static void intel_pmu_cpu_dying(int cpu)
 	fini_debug_store_on_cpu(cpu);
 }
 
+static void intel_pmu_flush_branch_stack(void)
+{
+	/*
+	 * Intel LBR does not tag entries with the
+	 * PID of the current task, then we need to
+	 * flush it on ctxsw
+	 * For now, we simply reset it
+	 */
+	if (x86_pmu.lbr_nr)
+		intel_pmu_lbr_reset();
+}
+
 static __initconst const struct x86_pmu intel_pmu = {
 	.name			= "Intel",
 	.handle_irq		= intel_pmu_handle_irq,
@@ -1466,6 +1478,7 @@ static __initconst const struct x86_pmu intel_pmu = {
 	.cpu_prepare		= intel_pmu_cpu_prepare,
 	.cpu_starting		= intel_pmu_cpu_starting,
 	.cpu_dying		= intel_pmu_cpu_dying,
+	.flush_branch_stack	= intel_pmu_flush_branch_stack,
 };
 
 static void intel_clovertown_quirks(void)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 0819364..70d6453 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -732,6 +732,9 @@ struct pmu {
 	 * for each successful ->add() during the transaction.
 	 */
 	void (*cancel_txn)		(struct pmu *pmu); /* optional */
+	/*
+	 * flush branch stack on context-switches (needed in cpu-wide mode) */
+	void (*flush_branch_stack)	(void);
 };
 
 /**
@@ -960,7 +963,8 @@ struct perf_event_context {
 	u64				parent_gen;
 	u64				generation;
 	int				pin_count;
-	int				nr_cgroups; /* cgroup events present */
+	int				nr_cgroups;	 /* cgroup evts */
+	int				nr_branch_stack; /* branch_stack evt */
 	struct rcu_head			rcu_head;
 };
 
@@ -1025,6 +1029,7 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr,
 extern u64 perf_event_read_value(struct perf_event *event,
 				 u64 *enabled, u64 *running);
 
+
 struct perf_sample_data {
 	u64				type;
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index e5caab5..a7f388f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -129,6 +129,7 @@ enum event_type_t {
  */
 struct jump_label_key perf_sched_events __read_mostly;
 static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
+static DEFINE_PER_CPU(atomic_t, perf_branch_stack_events);
 
 static atomic_t nr_mmap_events __read_mostly;
 static atomic_t nr_comm_events __read_mostly;
@@ -877,6 +878,9 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx)
 	if (is_cgroup_event(event))
 		ctx->nr_cgroups++;
 
+	if (has_branch_stack(event))
+		ctx->nr_branch_stack++;
+
 	list_add_rcu(&event->event_entry, &ctx->event_list);
 	if (!ctx->nr_events)
 		perf_pmu_rotate_start(ctx->pmu);
@@ -1016,6 +1020,9 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
 			cpuctx->cgrp = NULL;
 	}
 
+	if (has_branch_stack(event))
+		ctx->nr_branch_stack--;
+
 	ctx->nr_events--;
 	if (event->attr.inherit_stat)
 		ctx->nr_stat--;
@@ -2185,6 +2192,66 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx,
 }
 
 /*
+ * When sampling the branck stack in system-wide, it may be necessary
+ * to flush the stack on context switch. This happens when the branch
+ * stack does not tag its entries with the pid of the current task.
+ * Otherwise it becomes impossible to associate a branch entry with a
+ * task. This ambiguity is more likely to appear when the branch stack
+ * supports priv level filtering and the user sets it to monitor only
+ * at the user level (which could be a useful measurement in system-wide
+ * mode). In that case, the risk is high of having a branch stack with
+ * branch from multiple tasks. Flushing may mean dropping the existing
+ * entries or stashing them somewhere in the PMU specific code layer.
+ *
+ * This function provides the context switch callback to the lower code
+ * layer. It is invoked ONLY when there is at least one system-wide context
+ * with at least one active event using taken branch sampling.
+ */
+static void perf_branch_stack_sched_in(struct task_struct *prev,
+				       struct task_struct *task)
+{
+	struct perf_cpu_context *cpuctx;
+	struct pmu *pmu;
+	unsigned long flags;
+
+	/* no need to flush branch stack if not changing task */
+	if (prev == task)
+		return;
+
+	local_irq_save(flags);
+
+	rcu_read_lock();
+
+	list_for_each_entry_rcu(pmu, &pmus, entry) {
+		cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
+
+		/*
+		 * check if the context has at least one
+		 * event using PERF_SAMPLE_BRANCH_STACK
+		 */
+		if (cpuctx->ctx.nr_branch_stack > 0
+		    && pmu->flush_branch_stack) {
+
+			pmu = cpuctx->ctx.pmu;
+
+			perf_ctx_lock(cpuctx, cpuctx->task_ctx);
+
+			perf_pmu_disable(pmu);
+
+			pmu->flush_branch_stack();
+
+			perf_pmu_enable(pmu);
+
+			perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+		}
+	}
+
+	rcu_read_unlock();
+
+	local_irq_restore(flags);
+}
+
+/*
  * Called from scheduler to add the events of the current task
  * with interrupts disabled.
  *
@@ -2215,6 +2282,10 @@ void __perf_event_task_sched_in(struct task_struct *prev,
 	 */
 	if (atomic_read(&__get_cpu_var(perf_cgroup_events)))
 		perf_cgroup_sched_in(prev, task);
+
+	/* check for system-wide branch_stack events */
+	if (atomic_read(&__get_cpu_var(perf_branch_stack_events)))
+		perf_branch_stack_sched_in(prev, task);
 }
 
 static u64 perf_calculate_period(struct perf_event *event, u64 nsec, u64 count)
@@ -2954,6 +3025,14 @@ static void free_event(struct perf_event *event)
 			atomic_dec(&per_cpu(perf_cgroup_events, event->cpu));
 			jump_label_dec(&perf_sched_events);
 		}
+
+		if (has_branch_stack(event)) {
+			jump_label_dec(&perf_sched_events);
+			/* is system-wide event */
+			if (!(event->attach_state & PERF_ATTACH_TASK))
+				atomic_dec(&per_cpu(perf_branch_stack_events,
+						    event->cpu));
+		}
 	}
 
 	if (event->rb) {
@@ -5960,6 +6039,12 @@ done:
 				return ERR_PTR(err);
 			}
 		}
+		if (has_branch_stack(event)) {
+			jump_label_inc(&perf_sched_events);
+			if (!(event->attach_state & PERF_ATTACH_TASK))
+				atomic_inc(&per_cpu(perf_branch_stack_events,
+						    event->cpu));
+		}
 	}
 
 	return event;
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 10/12] perf: add code to support PERF_SAMPLE_BRANCH_STACK
  2011-10-06 14:49 [PATCH 00/12] perf_events: add support for sampling taken branches Stephane Eranian
                   ` (8 preceding siblings ...)
  2011-10-06 14:49 ` [PATCH 09/12] perf_events: add hook to flush branch_stack on context switch Stephane Eranian
@ 2011-10-06 14:49 ` Stephane Eranian
  2011-10-06 18:50   ` Peter Zijlstra
  2011-10-06 14:49 ` [PATCH 11/12] perf: add support for sampling taken branch to perf record Stephane Eranian
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 65+ messages in thread
From: Stephane Eranian @ 2011-10-06 14:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo

This patch adds:
- ability to parse samples with PERF_SAMPLE_BRANCH_STACK
- sort on branches
- build histograms on branches

Signed-off-by: Roberto Agostino Vitillo <ravitillo@lbl.gov>
Signed-off-by: Stephane Eranian <eranian@google.com>
---
 tools/perf/perf.h          |   17 ++
 tools/perf/util/annotate.c |    2 +-
 tools/perf/util/event.h    |    1 +
 tools/perf/util/evsel.c    |   10 ++
 tools/perf/util/hist.c     |   97 ++++++++++---
 tools/perf/util/hist.h     |    6 +
 tools/perf/util/session.c  |   72 +++++++++
 tools/perf/util/session.h  |    5 +
 tools/perf/util/sort.c     |  348 +++++++++++++++++++++++++++++++++-----------
 tools/perf/util/sort.h     |    5 +
 tools/perf/util/symbol.h   |   13 ++
 11 files changed, 464 insertions(+), 112 deletions(-)

diff --git a/tools/perf/perf.h b/tools/perf/perf.h
index a5fc660..549ac58 100644
--- a/tools/perf/perf.h
+++ b/tools/perf/perf.h
@@ -170,6 +170,23 @@ struct ip_callchain {
 	u64 ips[0];
 };
 
+struct branch_flags{
+	u64 mispred : 1;
+	u64 predicted : 1;
+	u64 reserved : 62;
+};
+
+struct branch_entry {
+	u64				from;
+	u64				to;
+	struct branch_flags flags;
+};
+
+struct branch_stack {
+	u64				nr;
+	struct branch_entry	entries[0];
+};
+
 extern bool perf_host, perf_guest;
 
 #endif
diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c
index e01af2b..532e4a6 100644
--- a/tools/perf/util/annotate.c
+++ b/tools/perf/util/annotate.c
@@ -62,7 +62,7 @@ int symbol__inc_addr_samples(struct symbol *sym, struct map *map,
 
 	pr_debug3("%s: addr=%#" PRIx64 "\n", __func__, map->unmap_ip(map, addr));
 
-	if (addr >= sym->end)
+	if (addr >= sym->end || addr < sym->start)
 		return 0;
 
 	offset = addr - sym->start;
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index 357a85b..026b1f6 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -80,6 +80,7 @@ struct perf_sample {
 	u32 raw_size;
 	void *raw_data;
 	struct ip_callchain *callchain;
+	struct branch_stack *branch_stack;
 };
 
 #define BUILD_ID_SIZE 20
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index e389815..cca6c23 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -472,5 +472,15 @@ int perf_event__parse_sample(const union perf_event *event, u64 type,
 		data->raw_data = (void *) pdata;
 	}
 
+	if (type & PERF_SAMPLE_BRANCH_STACK) {
+		u64 sz;
+
+		data->branch_stack = (struct branch_stack *)array;
+		array++; /* nr */
+
+		sz = data->branch_stack->nr * sizeof (struct branch_entry);
+		sz /= sizeof(uint64_t);
+		array += sz;
+	}
 	return 0;
 }
diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index 677e1da..9421391 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -49,8 +49,10 @@ static void hists__calc_col_len(struct hists *self, struct hist_entry *h)
 {
 	u16 len;
 
-	if (h->ms.sym)
-		hists__new_col_len(self, HISTC_SYMBOL, h->ms.sym->namelen);
+	if (h->ms.sym){
+		int symlen = max((int)h->ms.sym->namelen + 4, BITS_PER_LONG / 4 + 6);
+		hists__new_col_len(self, HISTC_SYMBOL, symlen);
+	}
 	else {
 		const unsigned int unresolved_col_width = BITS_PER_LONG / 4;
 
@@ -128,33 +130,18 @@ static u8 symbol__parent_filter(const struct symbol *parent)
 	return 0;
 }
 
-struct hist_entry *__hists__add_entry(struct hists *self,
-				      struct addr_location *al,
-				      struct symbol *sym_parent, u64 period)
-{
+static struct hist_entry *add_hist_entry(struct hists *self,
+					 struct hist_entry *entry,
+					 struct addr_location *al,
+					 u64 period){
 	struct rb_node **p = &self->entries.rb_node;
 	struct rb_node *parent = NULL;
 	struct hist_entry *he;
-	struct hist_entry entry = {
-		.thread	= al->thread,
-		.ms = {
-			.map	= al->map,
-			.sym	= al->sym,
-		},
-		.cpu	= al->cpu,
-		.ip	= al->addr,
-		.level	= al->level,
-		.period	= period,
-		.parent = sym_parent,
-		.filtered = symbol__parent_filter(sym_parent),
-	};
 	int cmp;
-
 	while (*p != NULL) {
 		parent = *p;
 		he = rb_entry(parent, struct hist_entry, rb_node);
-
-		cmp = hist_entry__cmp(&entry, he);
+		cmp = hist_entry__cmp(entry, he);
 
 		if (!cmp) {
 			he->period += period;
@@ -168,9 +155,10 @@ struct hist_entry *__hists__add_entry(struct hists *self,
 			p = &(*p)->rb_right;
 	}
 
-	he = hist_entry__new(&entry);
+	he = hist_entry__new(entry);
 	if (!he)
 		return NULL;
+
 	rb_link_node(&he->rb_node, parent, p);
 	rb_insert_color(&he->rb_node, &self->entries);
 	hists__inc_nr_entries(self, he);
@@ -179,6 +167,69 @@ out:
 	return he;
 }
 
+struct hist_entry *__hists__add_branch_entry(struct hists *self,
+					     struct addr_location *al,
+					     struct symbol *sym_parent,
+					     struct branch_info* bi,
+					     u64 period){
+	struct hist_entry entry = {
+		.thread	= al->thread,
+		.ms = {
+			.map	= bi->to.map,
+			.sym	= bi->to.sym,
+		},
+		.cpu	= al->cpu,
+		.ip	= bi->to.addr,
+		.level	= al->level,
+		.period	= period,
+		.parent = sym_parent,
+		.filtered = symbol__parent_filter(sym_parent),
+		.branch_info = bi,
+	};
+	struct hist_entry *he;
+
+	he = add_hist_entry(self, &entry, al, period);
+	if (!he)
+		return NULL;
+
+	/*
+	 * in branch mode, we do not display al->sym, al->addr
+	 * but instead what is in branch_info. The addresses and
+	 * symbols there may need wider columns, so make sure they
+	 * are taken into account.
+	 *
+	 * hists__calc_col_len() tracks the max column width, so
+	 * we need to call it for both the from and to addresses
+	 */
+	entry.ip     = bi->from.addr;
+	entry.ms.map = bi->from.map;
+	entry.ms.sym = bi->from.sym;
+	hists__calc_col_len(self, &entry);
+
+	return he;
+}
+
+struct hist_entry *__hists__add_entry(struct hists *self,
+				      struct addr_location *al,
+				      struct symbol *sym_parent, u64 period)
+{
+	struct hist_entry entry = {
+		.thread	= al->thread,
+		.ms = {
+			.map	= al->map,
+			.sym	= al->sym,
+		},
+		.cpu	= al->cpu,
+		.ip	= al->addr,
+		.level	= al->level,
+		.period	= period,
+		.parent = sym_parent,
+		.filtered = symbol__parent_filter(sym_parent),
+	};
+
+	return add_hist_entry(self, &entry, al, period);
+}
+
 int64_t
 hist_entry__cmp(struct hist_entry *left, struct hist_entry *right)
 {
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 3beb97c..1b6eda2 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -39,6 +39,7 @@ enum hist_column {
 	HISTC_COMM,
 	HISTC_PARENT,
 	HISTC_CPU,
+	HISTC_MISPREDICT,
 	HISTC_NR_COLS, /* Last entry */
 };
 
@@ -55,6 +56,11 @@ struct hists {
 struct hist_entry *__hists__add_entry(struct hists *self,
 				      struct addr_location *al,
 				      struct symbol *parent, u64 period);
+struct hist_entry *__hists__add_branch_entry(struct hists *self,
+					     struct addr_location *al,
+					     struct symbol *sym_parent,
+					     struct branch_info* bi,
+					     u64 period);
 extern int64_t hist_entry__cmp(struct hist_entry *, struct hist_entry *);
 extern int64_t hist_entry__collapse(struct hist_entry *, struct hist_entry *);
 int hist_entry__fprintf(struct hist_entry *self, struct hists *hists,
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 72458d9..ac1dfaea 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -236,6 +236,64 @@ static bool symbol__match_parent_regex(struct symbol *sym)
 	return 0;
 }
 
+
+static const u8 cpumodes[] = {
+	PERF_RECORD_MISC_USER,
+	PERF_RECORD_MISC_KERNEL,
+	PERF_RECORD_MISC_GUEST_USER,
+	PERF_RECORD_MISC_GUEST_KERNEL
+};
+#define NCPUMODES (sizeof(cpumodes)/sizeof(u8))
+
+static void ip__resolve_ams(struct perf_session *self, struct thread *thread,
+                                        struct addr_map_symbol *ams,
+                                        u64 ip)
+{
+	struct addr_location al;
+	size_t i;
+	u8 m;
+
+	memset(&al, 0, sizeof(al));
+
+	for (i = 0; i < NCPUMODES; i++) {
+		m = cpumodes[i];
+		/*
+		 * we cannot use the header.misc hint to determine whether a
+		 * branch stack address is user, kernel, guest, hypervisor.
+		 * Branches may straddle the kernel/user/hypervisor boundaries.
+		 * Thus, we have to try * consecutively until we find a match
+		 * or else, the symbol is unknown
+		 */
+		thread__find_addr_location(thread, self, m, MAP__FUNCTION,
+				thread->pid, ip, &al, NULL);
+		if (al.sym)
+			goto found;
+	}
+found:
+	ams->addr = ip;
+	ams->sym = al.sym;
+	ams->map = al.map;
+}
+
+struct branch_info *perf_session__resolve_bstack(struct perf_session *self,
+						 struct thread *thr,
+						 struct branch_stack *bs)
+{
+	struct branch_info *bi;
+	unsigned int i;
+
+	bi = calloc(bs->nr, sizeof(struct branch_info));
+	if (!bi)
+		return NULL;
+
+	for (i = 0; i < bs->nr; i++) {
+		ip__resolve_ams(self, thr, &bi[i].to, bs->entries[i].to);
+		ip__resolve_ams(self, thr, &bi[i].from, bs->entries[i].from);
+		bi[i].flags = bs->entries[i].flags;
+	}
+	return bi;
+}
+
 int perf_session__resolve_callchain(struct perf_session *self,
 				    struct thread *thread,
 				    struct ip_callchain *chain,
@@ -679,6 +737,17 @@ static void callchain__printf(struct perf_sample *sample)
 		       i, sample->callchain->ips[i]);
 }
 
+static void branch_stack__printf(struct perf_sample *sample)
+{
+	uint64_t i;
+
+	printf("... branch stack: nr:%" PRIu64 "\n", sample->branch_stack->nr);
+
+	for (i = 0; i < sample->branch_stack->nr; i++)
+		printf("..... %2"PRIu64": %016" PRIx64 " -> %016" PRIx64 "\n", i, sample->branch_stack-> \
+			entries[i].from, sample->branch_stack->entries[i].to);
+}
+
 static void perf_session__print_tstamp(struct perf_session *session,
 				       union perf_event *event,
 				       struct perf_sample *sample)
@@ -726,6 +795,9 @@ static void dump_sample(struct perf_session *session, union perf_event *event,
 
 	if (session->sample_type & PERF_SAMPLE_CALLCHAIN)
 		callchain__printf(sample);
+
+	if (session->sample_type & PERF_SAMPLE_BRANCH_STACK)
+		branch_stack__printf(sample);
 }
 
 static int perf_session_deliver_event(struct perf_session *session,
diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index 974d0cb..1bf9ba5 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -100,6 +100,11 @@ int __perf_session__process_events(struct perf_session *self,
 int perf_session__process_events(struct perf_session *self,
 				 struct perf_event_ops *event_ops);
 
+
+struct branch_info *perf_session__resolve_bstack(struct perf_session *self,
+						 struct thread *thread,
+						 struct branch_stack *bs);
+
 int perf_session__resolve_callchain(struct perf_session *self,
 				    struct thread *thread,
 				    struct ip_callchain *chain,
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index 1ee8f1e..f6b31f8 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -8,6 +8,7 @@ const char	default_sort_order[] = "comm,dso,symbol";
 const char	*sort_order = default_sort_order;
 int		sort__need_collapse = 0;
 int		sort__has_parent = 0;
+bool		sort__branch_mode = 0;
 
 enum sort_type	sort__first_dimension;
 
@@ -94,6 +95,26 @@ static int hist_entry__comm_snprintf(struct hist_entry *self, char *bf,
 	return repsep_snprintf(bf, size, "%*s", width, self->thread->comm);
 }
 
+static int64_t _sort__dso_cmp(struct map *map_l, struct map *map_r)
+{
+	struct dso *dso_l = map_l ? map_l->dso : NULL;
+	struct dso *dso_r = map_r ? map_r->dso : NULL;
+	const char *dso_name_l, *dso_name_r;
+
+	if (!dso_l || !dso_r)
+		return cmp_null(dso_l, dso_r);
+
+	if (verbose) {
+		dso_name_l = dso_l->long_name;
+		dso_name_r = dso_r->long_name;
+	} else {
+		dso_name_l = dso_l->short_name;
+		dso_name_r = dso_r->short_name;
+	}
+
+	return strcmp(dso_name_l, dso_name_r);
+}
+
 struct sort_entry sort_comm = {
 	.se_header	= "Command",
 	.se_cmp		= sort__comm_cmp,
@@ -107,36 +128,72 @@ struct sort_entry sort_comm = {
 static int64_t
 sort__dso_cmp(struct hist_entry *left, struct hist_entry *right)
 {
-	struct dso *dso_l = left->ms.map ? left->ms.map->dso : NULL;
-	struct dso *dso_r = right->ms.map ? right->ms.map->dso : NULL;
-	const char *dso_name_l, *dso_name_r;
+        return _sort__dso_cmp(left->ms.map, right->ms.map);
+}
 
-	if (!dso_l || !dso_r)
-		return cmp_null(dso_l, dso_r);
 
-	if (verbose) {
-		dso_name_l = dso_l->long_name;
-		dso_name_r = dso_r->long_name;
-	} else {
-		dso_name_l = dso_l->short_name;
-		dso_name_r = dso_r->short_name;
+static int64_t _sort__sym_cmp(struct symbol *sym_l, struct symbol *sym_r,
+			      u64 ip_l, u64 ip_r)
+{
+	if (!sym_l || !sym_r)
+		return cmp_null(sym_l, sym_r);
+
+	if (sym_l == sym_r)
+		return 0;
+
+	if(sym_l)
+		ip_l = sym_l->start;
+	if(sym_r)
+		ip_r = sym_r->start;
+
+	return (int64_t)(ip_r - ip_l);
+}
+
+static int _hist_entry__dso_snprintf(struct map *map, char *bf,
+				     size_t size, unsigned int width)
+{
+	if (map && map->dso) {
+		const char *dso_name = !verbose ? map->dso->short_name :
+			map->dso->long_name;
+		return repsep_snprintf(bf, size, "%-*s", width, dso_name);
 	}
 
-	return strcmp(dso_name_l, dso_name_r);
+	return repsep_snprintf(bf, size, "%-*s", width, "[unknown]");
 }
 
 static int hist_entry__dso_snprintf(struct hist_entry *self, char *bf,
 				    size_t size, unsigned int width)
 {
-	if (self->ms.map && self->ms.map->dso) {
-		const char *dso_name = !verbose ? self->ms.map->dso->short_name :
-						  self->ms.map->dso->long_name;
-		return repsep_snprintf(bf, size, "%-*s", width, dso_name);
+	return _hist_entry__dso_snprintf(self->ms.map, bf, size, width);
+}
+
+static int _hist_entry__sym_snprintf(struct map *map, struct symbol *sym,
+				     u64 ip, char level, char *bf, size_t size,
+				     unsigned int width __used)
+{
+	size_t ret = 0;
+
+	if (verbose) {
+		char o = map ? dso__symtab_origin(map->dso) : '!';
+		ret += repsep_snprintf(bf, size, "%-#*llx %c ",
+				       BITS_PER_LONG / 4, ip, o);
 	}
 
-	return repsep_snprintf(bf, size, "%-*s", width, "[unknown]");
+	ret += repsep_snprintf(bf + ret, size - ret, "[%c] ", level);
+	if (sym)
+		ret += repsep_snprintf(bf + ret, size - ret, "%-*s", width - ret,
+				       sym->name);
+	else {
+		size_t len = BITS_PER_LONG / 4;
+		ret += repsep_snprintf(bf + ret, size - ret, "%-#.*llx",
+				       len, ip);
+		ret += repsep_snprintf(bf + ret, size - ret, "%-*s", width - ret, "");
+	}
+
+	return ret;
 }
 
+
 struct sort_entry sort_dso = {
 	.se_header	= "Shared Object",
 	.se_cmp		= sort__dso_cmp,
@@ -144,8 +201,14 @@ struct sort_entry sort_dso = {
 	.se_width_idx	= HISTC_DSO,
 };
 
-/* --sort symbol */
+static int hist_entry__sym_snprintf(struct hist_entry *self, char *bf,
+				    size_t size, unsigned int width __used)
+{
+	return _hist_entry__sym_snprintf(self->ms.map, self->ms.sym, self->ip,
+					 self->level, bf, size, width);
+}
 
+/* --sort symbol */
 static int64_t
 sort__sym_cmp(struct hist_entry *left, struct hist_entry *right)
 {
@@ -154,38 +217,10 @@ sort__sym_cmp(struct hist_entry *left, struct hist_entry *right)
 	if (!left->ms.sym && !right->ms.sym)
 		return right->level - left->level;
 
-	if (!left->ms.sym || !right->ms.sym)
-		return cmp_null(left->ms.sym, right->ms.sym);
-
-	if (left->ms.sym == right->ms.sym)
-		return 0;
-
 	ip_l = left->ms.sym->start;
 	ip_r = right->ms.sym->start;
 
-	return (int64_t)(ip_r - ip_l);
-}
-
-static int hist_entry__sym_snprintf(struct hist_entry *self, char *bf,
-				    size_t size, unsigned int width __used)
-{
-	size_t ret = 0;
-
-	if (verbose) {
-		char o = self->ms.map ? dso__symtab_origin(self->ms.map->dso) : '!';
-		ret += repsep_snprintf(bf, size, "%-#*llx %c ",
-				       BITS_PER_LONG / 4, self->ip, o);
-	}
-
-	ret += repsep_snprintf(bf + ret, size - ret, "[%c] ", self->level);
-	if (self->ms.sym)
-		ret += repsep_snprintf(bf + ret, size - ret, "%s",
-				       self->ms.sym->name);
-	else
-		ret += repsep_snprintf(bf + ret, size - ret, "%-#*llx",
-				       BITS_PER_LONG / 4, self->ip);
-
-	return ret;
+	return _sort__sym_cmp(left->ms.sym, right->ms.sym, ip_l, ip_r);
 }
 
 struct sort_entry sort_sym = {
@@ -244,6 +279,124 @@ struct sort_entry sort_cpu = {
 	.se_width_idx	= HISTC_CPU,
 };
 
+static int64_t
+sort__dso_from_cmp(struct hist_entry *left, struct hist_entry *right){
+	return _sort__dso_cmp(left->branch_info->from.map,
+				       right->branch_info->from.map);
+}
+
+static int hist_entry__dso_from_snprintf(struct hist_entry *self, char *bf,
+				    size_t size, unsigned int width){
+	return _hist_entry__dso_snprintf(self->branch_info->from.map,
+					      bf, size, width);
+}
+
+struct sort_entry sort_dso_from = {
+	.se_header	= "Source Shared Object",
+	.se_cmp		= sort__dso_from_cmp,
+	.se_snprintf	= hist_entry__dso_from_snprintf,
+	.se_width_idx	= HISTC_DSO,
+};
+
+static int64_t
+sort__dso_to_cmp(struct hist_entry *left, struct hist_entry *right){
+	return _sort__dso_cmp(left->branch_info->to.map,
+				       right->branch_info->to.map);
+}
+
+static int hist_entry__dso_to_snprintf(struct hist_entry *self, char *bf,
+				    size_t size, unsigned int width){
+	return _hist_entry__dso_snprintf(self->branch_info->to.map,
+					      bf, size, width);
+}
+
+static int64_t
+sort__sym_from_cmp(struct hist_entry *left, struct hist_entry *right){
+	struct addr_map_symbol *from_l = &left->branch_info->from;
+	struct addr_map_symbol *from_r = &right->branch_info->from;
+	if (!from_l->sym && !from_r->sym)
+		return right->level - left->level;
+	return _sort__sym_cmp(from_l->sym, from_r->sym, from_l->addr,
+			     from_r->addr);
+}
+
+static int64_t
+sort__sym_to_cmp(struct hist_entry *left, struct hist_entry *right){
+	struct addr_map_symbol *to_l = &left->branch_info->to;
+	struct addr_map_symbol *to_r = &right->branch_info->to;
+	if (!to_l->sym && !to_r->sym)
+		return right->level - left->level;
+	return _sort__sym_cmp(to_l->sym, to_r->sym, to_l->addr, to_r->addr);
+}
+
+static int hist_entry__sym_from_snprintf(struct hist_entry *self, char *bf,
+				    size_t size, unsigned int width __used)
+{
+	struct addr_map_symbol *from = &self->branch_info->from;
+	return _hist_entry__sym_snprintf(from->map, from->sym, from->addr,
+					 self->level, bf, size, width);
+
+}
+
+static int hist_entry__sym_to_snprintf(struct hist_entry *self, char *bf,
+				    size_t size, unsigned int width __used)
+{
+	struct addr_map_symbol *to = &self->branch_info->to;
+	return _hist_entry__sym_snprintf(to->map, to->sym, to->addr,
+					 self->level, bf, size, width);
+
+}
+
+struct sort_entry sort_dso_to = {
+	.se_header	= "Target Shared Object",
+	.se_cmp		= sort__dso_to_cmp,
+	.se_snprintf	= hist_entry__dso_to_snprintf,
+	.se_width_idx	= HISTC_DSO,
+};
+
+struct sort_entry sort_sym_from = {
+	.se_header	= "Source Symbol",
+	.se_cmp		= sort__sym_from_cmp,
+	.se_snprintf	= hist_entry__sym_from_snprintf,
+	.se_width_idx	= HISTC_SYMBOL,
+};
+
+struct sort_entry sort_sym_to = {
+	.se_header	= "Target Symbol",
+	.se_cmp		= sort__sym_to_cmp,
+	.se_snprintf	= hist_entry__sym_to_snprintf,
+	.se_width_idx	= HISTC_SYMBOL,
+};
+
+static int64_t
+sort__mispredict_cmp(struct hist_entry *left, struct hist_entry *right){
+	const unsigned char mp = left->branch_info->flags.mispred !=
+					right->branch_info->flags.mispred;
+	const unsigned char p = left->branch_info->flags.predicted !=
+					right->branch_info->flags.predicted;
+
+	return mp || p;
+}
+
+static int hist_entry__mispredict_snprintf(struct hist_entry *self, char *bf,
+				    size_t size, unsigned int width){
+	static const char *out = "N/A";
+
+	if (self->branch_info->flags.predicted)
+		out = "N";
+	else if (self->branch_info->flags.mispred)
+		out = "Y";
+
+	return repsep_snprintf(bf, size, "%-*s", width, out);
+}
+
+struct sort_entry sort_mispredict = {
+	.se_header	= "Branch Mispredicted",
+	.se_cmp		= sort__mispredict_cmp,
+	.se_snprintf	= hist_entry__mispredict_snprintf,
+	.se_width_idx	= HISTC_MISPREDICT,
+};
+
 struct sort_dimension {
 	const char		*name;
 	struct sort_entry	*entry;
@@ -251,14 +404,59 @@ struct sort_dimension {
 };
 
 static struct sort_dimension sort_dimensions[] = {
-	{ .name = "pid",	.entry = &sort_thread,	},
-	{ .name = "comm",	.entry = &sort_comm,	},
-	{ .name = "dso",	.entry = &sort_dso,	},
-	{ .name = "symbol",	.entry = &sort_sym,	},
-	{ .name = "parent",	.entry = &sort_parent,	},
-	{ .name = "cpu",	.entry = &sort_cpu,	},
+	{ .name = "pid",	.entry = &sort_thread,			},
+	{ .name = "comm",	.entry = &sort_comm,			},
+	{ .name = "dso",	.entry = &sort_dso,			},
+	{ .name = "dso_from", 	.entry = &sort_dso_from,.taken = true	},
+	{ .name = "dso_to",	.entry = &sort_dso_to,	.taken = true	},
+	{ .name = "symbol",	.entry = &sort_sym,			},
+	{ .name = "symbol_from",.entry = &sort_sym_from,.taken = true	},
+	{ .name = "symbol_to",	.entry = &sort_sym_to,	.taken = true	},
+	{ .name = "parent",	.entry = &sort_parent,			},
+	{ .name = "cpu",	.entry = &sort_cpu,			},
+	{ .name = "mispredict", .entry = &sort_mispredict, },
 };
 
+static int _sort_dimension__add(struct sort_dimension *sd)
+{
+	if (sd->entry->se_collapse)
+		sort__need_collapse = 1;
+
+	if (sd->entry == &sort_parent) {
+		int ret = regcomp(&parent_regex, parent_pattern, REG_EXTENDED);
+		if (ret) {
+			char err[BUFSIZ];
+
+			regerror(ret, &parent_regex, err, sizeof(err));
+			pr_err("Invalid regex: %s\n%s", parent_pattern, err);
+			return -EINVAL;
+		}
+		sort__has_parent = 1;
+	}
+
+	if (list_empty(&hist_entry__sort_list)) {
+		if (!strcmp(sd->name, "pid"))
+			sort__first_dimension = SORT_PID;
+		else if (!strcmp(sd->name, "comm"))
+			sort__first_dimension = SORT_COMM;
+		else if (!strcmp(sd->name, "dso"))
+			sort__first_dimension = SORT_DSO;
+		else if (!strcmp(sd->name, "symbol"))
+			sort__first_dimension = SORT_SYM;
+		else if (!strcmp(sd->name, "parent"))
+			sort__first_dimension = SORT_PARENT;
+		else if (!strcmp(sd->name, "cpu"))
+			sort__first_dimension = SORT_CPU;
+		else if (!strcmp(sd->name, "mispredict"))
+			sort__first_dimension = SORT_MISPREDICTED;
+	}
+
+	list_add_tail(&sd->entry->list, &hist_entry__sort_list);
+	sd->taken = 1;
+
+	return 0;
+}
+
 int sort_dimension__add(const char *tok)
 {
 	unsigned int i;
@@ -269,48 +467,22 @@ int sort_dimension__add(const char *tok)
 		if (strncasecmp(tok, sd->name, strlen(tok)))
 			continue;
 
-		if (sd->entry == &sort_parent) {
-			int ret = regcomp(&parent_regex, parent_pattern, REG_EXTENDED);
-			if (ret) {
-				char err[BUFSIZ];
-
-				regerror(ret, &parent_regex, err, sizeof(err));
-				pr_err("Invalid regex: %s\n%s", parent_pattern, err);
-				return -EINVAL;
-			}
-			sort__has_parent = 1;
-		}
-
 		if (sd->taken)
 			return 0;
 
-		if (sd->entry->se_collapse)
-			sort__need_collapse = 1;
-
-		if (list_empty(&hist_entry__sort_list)) {
-			if (!strcmp(sd->name, "pid"))
-				sort__first_dimension = SORT_PID;
-			else if (!strcmp(sd->name, "comm"))
-				sort__first_dimension = SORT_COMM;
-			else if (!strcmp(sd->name, "dso"))
-				sort__first_dimension = SORT_DSO;
-			else if (!strcmp(sd->name, "symbol"))
-				sort__first_dimension = SORT_SYM;
-			else if (!strcmp(sd->name, "parent"))
-				sort__first_dimension = SORT_PARENT;
-			else if (!strcmp(sd->name, "cpu"))
-				sort__first_dimension = SORT_CPU;
-		}
-
-		list_add_tail(&sd->entry->list, &hist_entry__sort_list);
-		sd->taken = 1;
 
-		return 0;
+		if(sort__branch_mode && (sd->entry == &sort_dso ||
+					sd->entry == &sort_sym)){
+			int err = _sort_dimension__add(sd + 1);
+			return err ?: _sort_dimension__add(sd + 2);
+		}
+		else if(sd->entry == &sort_mispredict && !sort__branch_mode)
+			break;
+		else
+			return _sort_dimension__add(sd);
 	}
-
 	return -ESRCH;
 }
-
 void setup_sorting(const char * const usagestr[], const struct option *opts)
 {
 	char *tmp, *tok, *str = strdup(sort_order);
diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
index 77d0388..9cd0106 100644
--- a/tools/perf/util/sort.h
+++ b/tools/perf/util/sort.h
@@ -31,11 +31,14 @@ extern const char *parent_pattern;
 extern const char default_sort_order[];
 extern int sort__need_collapse;
 extern int sort__has_parent;
+extern bool sort__branch_mode;
 extern char *field_sep;
 extern struct sort_entry sort_comm;
 extern struct sort_entry sort_dso;
 extern struct sort_entry sort_sym;
 extern struct sort_entry sort_parent;
+extern struct sort_entry sort_lbr_dso;
+extern struct sort_entry sort_lbr_sym;
 extern enum sort_type sort__first_dimension;
 
 /**
@@ -70,6 +73,7 @@ struct hist_entry {
 		struct hist_entry *pair;
 		struct rb_root	  sorted_chain;
 	};
+	struct branch_info	*branch_info;
 	struct callchain_root	callchain[0];
 };
 
@@ -80,6 +84,7 @@ enum sort_type {
 	SORT_SYM,
 	SORT_PARENT,
 	SORT_CPU,
+	SORT_MISPREDICTED,
 };
 
 /*
diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h
index 4f377d9..daa251f 100644
--- a/tools/perf/util/symbol.h
+++ b/tools/perf/util/symbol.h
@@ -5,6 +5,7 @@
 #include <stdbool.h>
 #include <stdint.h>
 #include "map.h"
+#include "../perf.h"
 #include <linux/list.h>
 #include <linux/rbtree.h>
 #include <stdio.h>
@@ -115,6 +116,18 @@ struct map_symbol {
 	bool	      has_children;
 };
 
+struct addr_map_symbol {
+	struct map    *map;
+	struct symbol *sym;
+	u64	      addr;
+};
+
+struct branch_info {
+	struct addr_map_symbol from;
+	struct addr_map_symbol to;
+	struct branch_flags flags;
+};
+
 struct addr_location {
 	struct thread *thread;
 	struct map    *map;
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 11/12] perf: add support for sampling taken branch to perf record
  2011-10-06 14:49 [PATCH 00/12] perf_events: add support for sampling taken branches Stephane Eranian
                   ` (9 preceding siblings ...)
  2011-10-06 14:49 ` [PATCH 10/12] perf: add code to support PERF_SAMPLE_BRANCH_STACK Stephane Eranian
@ 2011-10-06 14:49 ` Stephane Eranian
  2011-10-06 14:49 ` [PATCH 12/12] perf: add support for taken branch sampling to perf report Stephane Eranian
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 65+ messages in thread
From: Stephane Eranian @ 2011-10-06 14:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo

This patch adds a new option to enable taken branch stack
sampling, i.e., leverage the PERF_SAMPLE_BRANCH_STACK feature
of perf_events.

There is a new option to active this mode: -b.
It is possible to pass a set of filters to select the type of
branches to sample.

The following filters are available:
- any : any type of branches
- any_call : any function call or system call
- any_ret : any function return or system call return
- any_ind : any indirect branch
- u:  only when the branch target is at the user level
- k: only when the branch target is in the kernel

Filters can be combined by passing a comma separated list
to the option:

$ perf record -b any_call,u -e cycles:u branchy

Signed-off-by: Roberto Agostino Vitillo <ravitillo@lbl.gov>
Signed-off-by: Stephane Eranian <eranian@google.com>
---
 tools/perf/Documentation/perf-record.txt |   18 +++++++
 tools/perf/builtin-record.c              |   75 ++++++++++++++++++++++++++++++
 2 files changed, 93 insertions(+), 0 deletions(-)

diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index 5a520f8..ddc1999 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -148,6 +148,24 @@ an empty cgroup (monitor all the time) using, e.g., -G foo,,bar. Cgroups must ha
 corresponding events, i.e., they always refer to events defined earlier on the command
 line.
 
+-b::
+--branch-stack::
+Enable taken branch stack sampling. Each sample captures a series of consecutive
+taken branches. The number of branches captured with each sample depends on the
+underlying hardware, the type of branches of interested and the executed code.
+It is possible to filter the types of branches by enabling filters. The
+following filters are defined: any (any type of branches), any_call (any function
+call or system call), any_ret (any function return or system call return), any_ind
+(any indirect branch), u (only when the branch target is at the user level), k (only when
+the branch target is in the kernel). At least one of any, any_call, any_ret, any_ind
+must be provided. The privilege levels may be ommitted, in which case, the privilege
+levels of the associated event is applied to the branch filter. When sampling on multiple
+events, branch stack sampling is enabled for all the sampling events. The sampled branch
+type is the same for all events. The privilege levels are adjusted based on those of
+the associated event unless specified explicitly with this option. Note that taken
+branch sampling may not be available on all processors. The various filters must
+be specified as a comma separated list: -b any_ret,u,k
+
 SEE ALSO
 --------
 linkperf:perf-stat[1], linkperf:perf-list[1]
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index f4c3fbe..4464677 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -57,6 +57,7 @@ static pid_t			child_pid			=     -1;
 static bool			no_inherit			=  false;
 static enum write_mode_t	write_mode			= WRITE_FORCE;
 static bool			call_graph			=  false;
+static int			branch_stack			=  0;
 static bool			inherit_stat			=  false;
 static bool			no_samples			=  false;
 static bool			sample_address			=  false;
@@ -204,6 +205,11 @@ static void config_attr(struct perf_evsel *evsel, struct perf_evlist *evlist)
 	if (system_wide)
 		attr->sample_type	|= PERF_SAMPLE_CPU;
 
+	if (branch_stack) {
+		attr->sample_type	|= PERF_SAMPLE_BRANCH_STACK;
+		attr->branch_sample_type = branch_stack;
+	}
+
 	if (sample_id_all_avail &&
 	    (sample_time || system_wide || !no_inherit || cpu_list))
 		attr->sample_type	|= PERF_SAMPLE_TIME;
@@ -717,6 +723,72 @@ out_delete_session:
 	return err;
 }
 
+#define BRANCH_OPT(n, m) \
+	{ .name = n, .mode = (m) }
+
+#define BRANCH_END { .name = NULL }
+
+struct branch_mode {
+	const char *name;
+	int mode;
+};
+
+static const struct branch_mode branch_modes[]={
+	BRANCH_OPT("u", PERF_SAMPLE_BRANCH_USER),
+	BRANCH_OPT("k", PERF_SAMPLE_BRANCH_KERNEL),
+	BRANCH_OPT("any", PERF_SAMPLE_BRANCH_ANY),
+	BRANCH_OPT("any_call", PERF_SAMPLE_BRANCH_ANY_CALL),
+	BRANCH_OPT("any_ret", PERF_SAMPLE_BRANCH_ANY_RETURN),
+	BRANCH_OPT("ind_call", PERF_SAMPLE_BRANCH_IND_CALL),
+	BRANCH_END
+};
+
+static int
+parse_branch_stack(const struct option *opt, const char *str, int unset __used)
+{
+#define ONLY_PLM (PERF_SAMPLE_BRANCH_USER|PERF_SAMPLE_BRANCH_KERNEL)
+	uint64_t *mode = (uint64_t *)opt->value;
+	const struct branch_mode *br;
+	char *s, *os, *p;
+	int ret = -1;
+
+	*mode = 0;
+
+	/* because str is read-only */
+	s = os = strdup(str);
+	if (!s)
+		return -1;
+
+	for (;;) {
+		p = strchr(s, ',');
+		if (p)
+			*p = '\0';
+
+		for (br = branch_modes; br->name; br++) {
+			if (!strcasecmp(s, br->name))
+				break;
+		}
+		if (!br->name)
+			goto error;
+
+		*mode |= br->mode;
+
+		if (!p)
+			break;
+
+		s = p + 1;
+	}
+	ret = 0;
+
+	if ((*mode & ~ONLY_PLM) == 0) {
+		pr_warning("error: needs at least one branch type with -b\n");
+		ret = -1;
+	}
+error:
+	free(os);
+	return ret;
+}
+
 static const char * const record_usage[] = {
 	"perf record [<options>] [<command>]",
 	"perf record [<options>] -- <command> [<options>]",
@@ -777,6 +849,9 @@ const struct option record_options[] = {
 	OPT_CALLBACK('G', "cgroup", &evsel_list, "name",
 		     "monitor event in cgroup name only",
 		     parse_cgroups),
+	OPT_CALLBACK('b', "branch stack", &branch_stack, "branch mode mask",
+		     "branch stack sampling modes",
+		     parse_branch_stack),
 	OPT_END()
 };
 
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 12/12] perf: add support for taken branch sampling to perf report
  2011-10-06 14:49 [PATCH 00/12] perf_events: add support for sampling taken branches Stephane Eranian
                   ` (10 preceding siblings ...)
  2011-10-06 14:49 ` [PATCH 11/12] perf: add support for sampling taken branch to perf record Stephane Eranian
@ 2011-10-06 14:49 ` Stephane Eranian
  2011-10-06 15:25 ` [PATCH 00/12] perf_events: add support for sampling taken branches Andi Kleen
  2011-10-06 18:32 ` Peter Zijlstra
  13 siblings, 0 replies; 65+ messages in thread
From: Stephane Eranian @ 2011-10-06 14:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo

This patch adds support for taken branch sampling, i.e, the
PERF_SAMPLE_BRANCH_STACK feature to perf report. In other
words, to display histograms based on taken branches rather
than executed instructions addresses.

The new option is called -b and it takes no argument. To
generate meaningful output, the perf.data must have been
obtained using perf record -b xxx ... where xxx is a branch
filter option.

The output shows symbols, modules, sorted by 'who branches
where' the most often. The percentages reported in the first
column refer to the total number of branches captured and
not the usual number of samples.

Here is a quick example.
Here branchy is simple test program which looks as follows:

void f2(void)
{}
void f3(void)
{}
void f1(unsigned long n)
{
  if (n & 1UL)
    f2();
  else
    f3();
}
int main(void)
{
  unsigned long i;

  for (i=0; i < N; i++)
   f1(i);
  return 0;
}

Here is the output captured on Nehalem, if we are
only interested in user level function calls.

$ perf record -b any_call,u -e cycles:u branchy

$ perf report -b --sort=symbol
    52.34%  [.] main                   [.] f1
    24.04%  [.] f1                     [.] f3
    23.60%  [.] f1                     [.] f2
     0.01%  [k] _IO_new_file_xsputn    [k] _IO_file_overflow
     0.01%  [k] _IO_vfprintf_internal  [k] _IO_new_file_xsputn
     0.01%  [k] _IO_vfprintf_internal  [k] strchrnul
     0.01%  [k] __printf               [k] _IO_vfprintf_internal
     0.01%  [k] main                   [k] __printf

About half (52%) of the call branches captured are from main() -> f1().
The second half (24%+23%) is split in two equal shares between
f1() -> f2(), f1() ->f3(). The output is as expected given the code.

It should be noted, that using -b in perf record does not eliminate
information in the perf.data file. Consequently, a typical profile
can also be obtained by perf report by simply not using its -b option.

Signed-off-by: Roberto Agostino Vitillo <ravitillo@lbl.gov>
Signed-off-by: Stephane Eranian <eranian@google.com>
---
 tools/perf/Documentation/perf-report.txt |    7 ++
 tools/perf/builtin-report.c              |   93 +++++++++++++++++++++++++++---
 2 files changed, 91 insertions(+), 9 deletions(-)

diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
index 04253c0..fd132ed 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -134,6 +134,13 @@ OPTIONS
 	CPUs are specified with -: 0-2. Default is to report samples on all
 	CPUs.
 
+-b::
+--branch-stack::
+	Use the addresses of sampled taken branches instead of the instruction
+	address to build the histograms. To generate meaningful output, the
+	perf.data file must have been obtained using perf record -b xxx where
+	xxx is a branch filter option.
+
 SEE ALSO
 --------
 linkperf:perf-stat[1]
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index d7ff277..15c60e3 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -54,6 +54,46 @@ static symbol_filter_t	annotate_init;
 static const char	*cpu_list;
 static DECLARE_BITMAP(cpu_bitmap, MAX_NR_CPUS);
 
+static int perf_session__add_branch_hist_entry(struct perf_session *session,
+					struct addr_location *al,
+					struct perf_sample *sample,
+					struct perf_evsel *evsel){
+	struct symbol *parent = NULL;
+	int err = 0;
+	unsigned i;
+	struct hist_entry *he;
+	struct branch_info *bi;
+
+	if ((sort__has_parent || symbol_conf.use_callchain) && sample->callchain) {
+		err = perf_session__resolve_callchain(session, al->thread,
+						      sample->callchain, &parent);
+		if (err)
+			return err;
+	}
+
+	bi = perf_session__resolve_bstack(session, al->thread,
+					  sample->branch_stack);
+	if (!bi)
+		return -ENOMEM;
+
+	for(i = 0; i < sample->branch_stack->nr; i++) {
+		if(hide_unresolved && !(bi[i].from.sym && bi[i].to.sym))
+			continue;
+		/*
+		 * The report shows the percentage of total branches captured
+		 * and not events sampled. Thus we use a pseudo period of 1.
+		 */
+		he = __hists__add_branch_entry(&evsel->hists, al, parent,
+					       &bi[i], 1);
+		if (he) {
+			evsel->hists.stats.total_period += 1;
+			hists__inc_nr_events(&evsel->hists, PERF_RECORD_SAMPLE);
+		} else
+			return -ENOMEM;
+	}
+	return err;
+}
+
 static int perf_session__add_hist_entry(struct perf_session *session,
 					struct addr_location *al,
 					struct perf_sample *sample,
@@ -119,20 +159,28 @@ static int process_sample_event(union perf_event *event,
 		return -1;
 	}
 
-	if (al.filtered || (hide_unresolved && al.sym == NULL))
-		return 0;
-
 	if (cpu_list && !test_bit(sample->cpu, cpu_bitmap))
 		return 0;
 
-	if (al.map != NULL)
-		al.map->dso->hit = 1;
+	if (sort__branch_mode) {
+		if(perf_session__add_branch_hist_entry(session, &al, sample,
+						    evsel)) {
+			pr_debug("problem adding lbr entry, skipping event\n");
+			return -1;
+		}
+	} else {
+		if (al.filtered || (hide_unresolved && al.sym == NULL))
+			return 0;
 
-	if (perf_session__add_hist_entry(session, &al, sample, evsel)) {
-		pr_debug("problem incrementing symbol period, skipping event\n");
-		return -1;
-	}
+		if (al.map != NULL)
+			al.map->dso->hit = 1;
 
+		if (perf_session__add_hist_entry(session, &al, sample, evsel)) {
+			pr_debug("problem incrementing symbol period, skipping"
+					" event\n");
+			return -1;
+		}
+	}
 	return 0;
 }
 
@@ -182,6 +230,15 @@ static int perf_session__setup_sample_type(struct perf_session *self)
 			}
 	}
 
+	if(sort__branch_mode){
+		if(!(self->sample_type & PERF_SAMPLE_BRANCH_STACK)){
+			fprintf(stderr, "selected -b but no branch data."
+					" Did you call perf record without"
+					" -b?\n");
+			return -1;
+		}
+	}
+
 	return 0;
 }
 
@@ -487,6 +544,8 @@ static const struct option options[] = {
 	OPT_STRING(0, "symfs", &symbol_conf.symfs, "directory",
 		    "Look for files with symbols relative to this directory"),
 	OPT_STRING('c', "cpu", &cpu_list, "cpu", "list of cpus to profile"),
+	OPT_BOOLEAN('b', "branch-stack", &sort__branch_mode,
+		    "use branch records for histogram filling"),
 	OPT_END()
 };
 
@@ -502,6 +561,22 @@ int cmd_report(int argc, const char **argv, const char *prefix __used)
 	if (inverted_callchain)
 		callchain_param.order = ORDER_CALLER;
 
+	if (sort__branch_mode){
+		if(use_browser)
+			fprintf(stderr, "Warning: TUI interface not supported"
+					" in branch mode\n");
+		if(symbol_conf.dso_list_str != NULL)
+			fprintf(stderr, "Warning: dso filtering not supported"
+					" in branch mode\n");
+		if(symbol_conf.sym_list_str != NULL)
+			fprintf(stderr, "Warning: symbol filtering not supported"
+					" in branch mode\n");
+
+		use_browser = 0;
+		symbol_conf.dso_list_str = NULL;
+		symbol_conf.sym_list_str = NULL;
+	}
+
 	if (strcmp(input_name, "-") != 0)
 		setup_browser(true);
 	else
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH 00/12] perf_events: add support for sampling taken branches
  2011-10-06 14:49 [PATCH 00/12] perf_events: add support for sampling taken branches Stephane Eranian
                   ` (11 preceding siblings ...)
  2011-10-06 14:49 ` [PATCH 12/12] perf: add support for taken branch sampling to perf report Stephane Eranian
@ 2011-10-06 15:25 ` Andi Kleen
  2011-10-07 10:23   ` Stephane Eranian
  2011-10-06 18:32 ` Peter Zijlstra
  13 siblings, 1 reply; 65+ messages in thread
From: Andi Kleen @ 2011-10-06 15:25 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, peterz, mingo, acme, ming.m.lin, andi,
	robert.richter, ravitillo


Thanks very nice patchkit (at least from the description,
haven't read it yet)

> # Overhead  Source Symbol     Target Symbol
> # ........  ................  ................
> 
>     18.13%  [.] f1            [.] main                          
>     18.10%  [.] main          [.] main                          
>     18.01%  [.] main          [.] f1                            

The big problem here is no line numbers (something I miss for other analysis
too).  I hope that can be fixed so that analysis inside functions becomes
possible. Something similar to pprofs line mode.

-Andi

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 07/12] perf_events: add LBR software filter support for Intel X86
  2011-10-06 14:49 ` [PATCH 07/12] perf_events: add LBR software filter support " Stephane Eranian
@ 2011-10-06 15:32   ` Andi Kleen
  2011-10-06 16:43     ` Peter Zijlstra
  2011-10-07 10:38     ` Stephane Eranian
  0 siblings, 2 replies; 65+ messages in thread
From: Andi Kleen @ 2011-10-06 15:32 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, peterz, mingo, acme, ming.m.lin, andi,
	robert.richter, ravitillo

> +	kernel_insn_init(&insn, kaddr);
> +	insn_get_opcode(&insn);

This makes me uncomfortable. AFAIK that's the first use of the opcode
decoder being used directly for user space. It has a quite large attack
surface. Who says it cannot be exploited?

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 07/12] perf_events: add LBR software filter support for Intel X86
  2011-10-06 15:32   ` Andi Kleen
@ 2011-10-06 16:43     ` Peter Zijlstra
  2011-10-06 17:14       ` Andi Kleen
  2011-10-07  7:06       ` Masami Hiramatsu
  2011-10-07 10:38     ` Stephane Eranian
  1 sibling, 2 replies; 65+ messages in thread
From: Peter Zijlstra @ 2011-10-06 16:43 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Stephane Eranian, linux-kernel, mingo, acme, ming.m.lin,
	robert.richter, ravitillo, masami.hiramatsu.pt, H. Peter Anvin

On Thu, 2011-10-06 at 17:32 +0200, Andi Kleen wrote:
> > +	kernel_insn_init(&insn, kaddr);
> > +	insn_get_opcode(&insn);
> 
> This makes me uncomfortable. AFAIK that's the first use of the opcode
> decoder being used directly for user space. It has a quite large attack
> surface. Who says it cannot be exploited?

You mean:
arch/x86/kernel/cpu/perf_event_intel_ds.c:intel_pmu_pebs_fixup_ip()
doesn't use the opcode decoder on user space code?

ISTR Masami telling me they ran fuzzers on it, feeding it bad
instructions etc. But maybe he can tell more.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 01/12] perf_events: add generic taken branch sampling support
  2011-10-06 14:49 ` [PATCH 01/12] perf_events: add generic taken branch sampling support Stephane Eranian
@ 2011-10-06 16:57   ` Peter Zijlstra
  2011-10-07 10:28     ` Stephane Eranian
  2011-10-06 17:01   ` Peter Zijlstra
  1 sibling, 1 reply; 65+ messages in thread
From: Peter Zijlstra @ 2011-10-06 16:57 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo

On Thu, 2011-10-06 at 16:49 +0200, Stephane Eranian wrote:
>  struct perf_branch_entry {
>         __u64                           from;
>         __u64                           to;
> +       struct {
> +               __u64                   mispred:1,  /* target mispredicted */
> +                                       predicted:1,/* target predicted */
> +                                       reserved:62;
> +       };
>  }; 

Why that anonymous structure?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 01/12] perf_events: add generic taken branch sampling support
  2011-10-06 14:49 ` [PATCH 01/12] perf_events: add generic taken branch sampling support Stephane Eranian
  2011-10-06 16:57   ` Peter Zijlstra
@ 2011-10-06 17:01   ` Peter Zijlstra
  1 sibling, 0 replies; 65+ messages in thread
From: Peter Zijlstra @ 2011-10-06 17:01 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo

On Thu, 2011-10-06 at 16:49 +0200, Stephane Eranian wrote:
> +enum perf_branch_sample_type {
> +       PERF_SAMPLE_BRANCH_USER         = 1U << 0, /* user level branches */
> +       PERF_SAMPLE_BRANCH_KERNEL       = 1U << 1, /* kernel level branches */


I just merged a patch-set that adds guest/host muck as well, would it
make sense to extend this stuff to support that as well?

> +       PERF_SAMPLE_BRANCH_ANY          = 1U << 2, /* any branch types */
> +       PERF_SAMPLE_BRANCH_ANY_CALL     = 1U << 3, /* any call branch */
> +       PERF_SAMPLE_BRANCH_ANY_RETURN   = 1U << 4, /* any return branch */
> +       PERF_SAMPLE_BRANCH_IND_CALL     = 1U << 5, /* indirect calls */
> +
> +       PERF_SAMPLE_BRANCH_MAX          = 1U << 6,/* non-ABI */
> +} 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 07/12] perf_events: add LBR software filter support for Intel X86
  2011-10-06 16:43     ` Peter Zijlstra
@ 2011-10-06 17:14       ` Andi Kleen
  2011-10-10  6:08         ` Ingo Molnar
  2011-10-07  7:06       ` Masami Hiramatsu
  1 sibling, 1 reply; 65+ messages in thread
From: Andi Kleen @ 2011-10-06 17:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andi Kleen, Stephane Eranian, linux-kernel, mingo, acme,
	ming.m.lin, robert.richter, ravitillo, masami.hiramatsu.pt,
	H. Peter Anvin

On Thu, Oct 06, 2011 at 06:43:44PM +0200, Peter Zijlstra wrote:
> On Thu, 2011-10-06 at 17:32 +0200, Andi Kleen wrote:
> > > +	kernel_insn_init(&insn, kaddr);
> > > +	insn_get_opcode(&insn);
> > 
> > This makes me uncomfortable. AFAIK that's the first use of the opcode
> > decoder being used directly for user space. It has a quite large attack
> > surface. Who says it cannot be exploited?
> 
> You mean:
> arch/x86/kernel/cpu/perf_event_intel_ds.c:intel_pmu_pebs_fixup_ip()
> doesn't use the opcode decoder on user space code?

True it does. Ok the second.
I know uprobes is planning it too, so the problem has to be handled
eventually.

Still it makes me uncomfortable.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 04/12] perf_events: sync branch stack sampling with X86 precise_sampling
  2011-10-06 14:49 ` [PATCH 04/12] perf_events: sync branch stack sampling with X86 precise_sampling Stephane Eranian
@ 2011-10-06 17:25   ` Peter Zijlstra
  2011-10-07 10:34     ` Stephane Eranian
  0 siblings, 1 reply; 65+ messages in thread
From: Peter Zijlstra @ 2011-10-06 17:25 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo

On Thu, 2011-10-06 at 16:49 +0200, Stephane Eranian wrote:
> 
> On Intel X86 PERF_SAMPLE_BRANCH_STACK is implemented using LBR,
> therefore both features must be coordinated as they may not
> configure LBR the same way. 

Differently, you mean? Both wanting the same configuration seems fine.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 06/12] perf_events: implement PERF_SAMPLE_BRANCH for Intel X86
  2011-10-06 14:49 ` [PATCH 06/12] perf_events: implement PERF_SAMPLE_BRANCH for Intel X86 Stephane Eranian
@ 2011-10-06 17:54   ` Peter Zijlstra
  2011-10-06 18:05   ` Peter Zijlstra
  1 sibling, 0 replies; 65+ messages in thread
From: Peter Zijlstra @ 2011-10-06 17:54 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo

On Thu, 2011-10-06 at 16:49 +0200, Stephane Eranian wrote:
> @@ -876,6 +889,13 @@ static void intel_pmu_disable_event(struct perf_event *event)
>                 return;
>         }
>  
> +       /*
> +        * must disable before any actual event
> +        * because any event may be combined with LBR
> +        */
> +       if (intel_pmu_needs_lbr_smpl(event))
> +               intel_pmu_lbr_disable(event);
> +
>         if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL)) {
>                 intel_pmu_disable_fixed(hwc);
>                 return; 

I don't get that, since until you disable the counter a PMI could happen
(*) you'd need to disable the LBR after you disable the counters.

* doesn't actually happen since we likely have the whole pmu disabled
here, but that's another thing ;-)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 06/12] perf_events: implement PERF_SAMPLE_BRANCH for Intel X86
  2011-10-06 14:49 ` [PATCH 06/12] perf_events: implement PERF_SAMPLE_BRANCH for Intel X86 Stephane Eranian
  2011-10-06 17:54   ` Peter Zijlstra
@ 2011-10-06 18:05   ` Peter Zijlstra
  1 sibling, 0 replies; 65+ messages in thread
From: Peter Zijlstra @ 2011-10-06 18:05 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo

On Thu, 2011-10-06 at 16:49 +0200, Stephane Eranian wrote:
> diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c

> @@ -1046,6 +1072,9 @@ again: 

$ cat ~/.gitconfig 
[diff "default"]
                xfuncname = "^[[:alpha:]$_].*[^:]$"


Cures that. Although there's different ways too..
https://lkml.org/lkml/2011/8/25/613

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 00/12] perf_events: add support for sampling taken branches
  2011-10-06 14:49 [PATCH 00/12] perf_events: add support for sampling taken branches Stephane Eranian
                   ` (12 preceding siblings ...)
  2011-10-06 15:25 ` [PATCH 00/12] perf_events: add support for sampling taken branches Andi Kleen
@ 2011-10-06 18:32 ` Peter Zijlstra
  2011-10-06 21:41   ` Stephane Eranian
  13 siblings, 1 reply; 65+ messages in thread
From: Peter Zijlstra @ 2011-10-06 18:32 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo


This doesn't apply at all, its against some ancient tree. Please re-spin
against a current tip/master or so.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 10/12] perf: add code to support PERF_SAMPLE_BRANCH_STACK
  2011-10-06 14:49 ` [PATCH 10/12] perf: add code to support PERF_SAMPLE_BRANCH_STACK Stephane Eranian
@ 2011-10-06 18:50   ` Peter Zijlstra
  2011-10-07 10:25     ` Stephane Eranian
  0 siblings, 1 reply; 65+ messages in thread
From: Peter Zijlstra @ 2011-10-06 18:50 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo

On Thu, 2011-10-06 at 16:49 +0200, Stephane Eranian wrote:

Is a From: Roberto ..

Line missing here?

> This patch adds:
> - ability to parse samples with PERF_SAMPLE_BRANCH_STACK
> - sort on branches
> - build histograms on branches
> 
> Signed-off-by: Roberto Agostino Vitillo <ravitillo@lbl.gov>
> Signed-off-by: Stephane Eranian <eranian@google.com> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 08/12] perf_events: disable PERF_SAMPLE_BRANCH_* when not supported
  2011-10-06 14:49 ` [PATCH 08/12] perf_events: disable PERF_SAMPLE_BRANCH_* when not supported Stephane Eranian
@ 2011-10-06 18:53   ` Peter Zijlstra
  0 siblings, 0 replies; 65+ messages in thread
From: Peter Zijlstra @ 2011-10-06 18:53 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo

On Thu, 2011-10-06 at 16:49 +0200, Stephane Eranian wrote:
> PERF_SAMPLE_BRANCH_* is disabled for:

> - ALL but Intel X86 architecture
> - AMD64 processors
> 
It would be good if you could CC all arch/*/kernel/perf_event*
maintainers on this, that way they are aware of the existence of the
feature. Maybe one of them can actually support some of this.. who
knows ;-)

Robert, does this Fam15 thing have anything useful in this regards?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 00/12] perf_events: add support for sampling taken branches
  2011-10-06 18:32 ` Peter Zijlstra
@ 2011-10-06 21:41   ` Stephane Eranian
  0 siblings, 0 replies; 65+ messages in thread
From: Stephane Eranian @ 2011-10-06 21:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo

On Thu, Oct 6, 2011 at 8:32 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> This doesn't apply at all, its against some ancient tree. Please re-spin
> against a current tip/master or so.
>
It is against Linus' 3.1-rc9 commit 976d167.
I will re-spin tomorrow and cc the arch maintainers too.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 07/12] perf_events: add LBR software filter support for Intel X86
  2011-10-06 16:43     ` Peter Zijlstra
  2011-10-06 17:14       ` Andi Kleen
@ 2011-10-07  7:06       ` Masami Hiramatsu
  1 sibling, 0 replies; 65+ messages in thread
From: Masami Hiramatsu @ 2011-10-07  7:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andi Kleen, Stephane Eranian, linux-kernel, mingo, acme,
	ming.m.lin, robert.richter, ravitillo, H. Peter Anvin

(2011/10/07 1:43), Peter Zijlstra wrote:
> On Thu, 2011-10-06 at 17:32 +0200, Andi Kleen wrote:
>>> +	kernel_insn_init(&insn, kaddr);
>>> +	insn_get_opcode(&insn);
>>
>> This makes me uncomfortable. AFAIK that's the first use of the opcode
>> decoder being used directly for user space. It has a quite large attack
>> surface. Who says it cannot be exploited?
> 
> You mean:
> arch/x86/kernel/cpu/perf_event_intel_ds.c:intel_pmu_pebs_fixup_ip()
> doesn't use the opcode decoder on user space code?
> 
> ISTR Masami telling me they ran fuzzers on it, feeding it bad
> instructions etc. But maybe he can tell more.

I've tested it by decoding userspace libraries and binaries.
It is also easy to test decoding with random binaries too :)
I'll try to do that.

Thanks,

-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 00/12] perf_events: add support for sampling taken branches
  2011-10-06 15:25 ` [PATCH 00/12] perf_events: add support for sampling taken branches Andi Kleen
@ 2011-10-07 10:23   ` Stephane Eranian
  0 siblings, 0 replies; 65+ messages in thread
From: Stephane Eranian @ 2011-10-07 10:23 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, peterz, mingo, acme, ming.m.lin, robert.richter, ravitillo

On Thu, Oct 6, 2011 at 5:25 PM, Andi Kleen <andi@firstfloor.org> wrote:
>
> Thanks very nice patchkit (at least from the description,
> haven't read it yet)

Thanks.

>
>> # Overhead  Source Symbol     Target Symbol
>> # ........  ................  ................
>>
>>     18.13%  [.] f1            [.] main
>>     18.10%  [.] main          [.] main
>>     18.01%  [.] main          [.] f1
>
> The big problem here is no line numbers (something I miss for other analysis
> too).  I hope that can be fixed so that analysis inside functions becomes
> possible. Something similar to pprofs line mode.
>
You get line numbers with annotate only today so this is not specific to this
particular patch. But yeah, it would be nice.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 10/12] perf: add code to support PERF_SAMPLE_BRANCH_STACK
  2011-10-06 18:50   ` Peter Zijlstra
@ 2011-10-07 10:25     ` Stephane Eranian
  2011-10-07 10:27       ` Peter Zijlstra
  0 siblings, 1 reply; 65+ messages in thread
From: Stephane Eranian @ 2011-10-07 10:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo

On Thu, Oct 6, 2011 at 8:50 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, 2011-10-06 at 16:49 +0200, Stephane Eranian wrote:
>
> Is a From: Roberto ..
>
> Line missing here?
>
Yes, I can add it back if that's the way the system works.

>> This patch adds:
>> - ability to parse samples with PERF_SAMPLE_BRANCH_STACK
>> - sort on branches
>> - build histograms on branches
>>
>> Signed-off-by: Roberto Agostino Vitillo <ravitillo@lbl.gov>
>> Signed-off-by: Stephane Eranian <eranian@google.com>
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 10/12] perf: add code to support PERF_SAMPLE_BRANCH_STACK
  2011-10-07 10:25     ` Stephane Eranian
@ 2011-10-07 10:27       ` Peter Zijlstra
  0 siblings, 0 replies; 65+ messages in thread
From: Peter Zijlstra @ 2011-10-07 10:27 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo

On Fri, 2011-10-07 at 12:25 +0200, Stephane Eranian wrote:
> On Thu, Oct 6, 2011 at 8:50 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Thu, 2011-10-06 at 16:49 +0200, Stephane Eranian wrote:
> >
> > Is a From: Roberto ..
> >
> > Line missing here?
> >
> Yes, I can add it back if that's the way the system works.

Please, otherwise it might get attributed to you.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 01/12] perf_events: add generic taken branch sampling support
  2011-10-06 16:57   ` Peter Zijlstra
@ 2011-10-07 10:28     ` Stephane Eranian
  2011-10-07 10:32       ` Peter Zijlstra
  0 siblings, 1 reply; 65+ messages in thread
From: Stephane Eranian @ 2011-10-07 10:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo

On Thu, Oct 6, 2011 at 6:57 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, 2011-10-06 at 16:49 +0200, Stephane Eranian wrote:
>>  struct perf_branch_entry {
>>         __u64                           from;
>>         __u64                           to;
>> +       struct {
>> +               __u64                   mispred:1,  /* target mispredicted */
>> +                                       predicted:1,/* target predicted */
>> +                                       reserved:62;
>> +       };
>>  };
>
> Why that anonymous structure?
>
The interface can return more than source/destination, it can also
return prediction infos.
Prediction is boolean, thus we only need a couple of bits. The reason
there are two bits
and not just one is to disambiguate between: predicted correctly and
'prediction reporting
unsupported'. For instance, prediction is also supported since
Nehalem, Core2/Atom do
not have it.
But maybe you're just commenting of the anonymous vs. named struct for that?
It is just for convenience. Isn't that the same argument for the
anonymous bitfield
in struct perf_event_attr?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 01/12] perf_events: add generic taken branch sampling support
  2011-10-07 10:28     ` Stephane Eranian
@ 2011-10-07 10:32       ` Peter Zijlstra
  2011-10-07 10:44         ` Stephane Eranian
  0 siblings, 1 reply; 65+ messages in thread
From: Peter Zijlstra @ 2011-10-07 10:32 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo

On Fri, 2011-10-07 at 12:28 +0200, Stephane Eranian wrote:
> On Thu, Oct 6, 2011 at 6:57 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Thu, 2011-10-06 at 16:49 +0200, Stephane Eranian wrote:
> >>  struct perf_branch_entry {
> >>         __u64                           from;
> >>         __u64                           to;
> >> +       struct {
> >> +               __u64                   mispred:1,  /* target mispredicted */
> >> +                                       predicted:1,/* target predicted */
> >> +                                       reserved:62;
> >> +       };
> >>  };
> >
> > Why that anonymous structure?
> >
> The interface can return more than source/destination, it can also
> return prediction infos.
> Prediction is boolean, thus we only need a couple of bits. The reason
> there are two bits
> and not just one is to disambiguate between: predicted correctly and
> 'prediction reporting
> unsupported'. For instance, prediction is also supported since
> Nehalem, Core2/Atom do
> not have it.

Right, I got that.

> But maybe you're just commenting of the anonymous vs. named struct for
> that?

I don't see the need for any struct, why can't the bitfield live in
perf_branch_entry proper?

> It is just for convenience. Isn't that the same argument for the
> anonymous bitfield
> in struct perf_event_attr? 

But that isn't wrapped in a structure either is it..

I guess I'm asking, what's wrong with:

struct perf_branch_entry {
	__u64		from;
	__u64		to;
	__u64		mispred:1,
			predicted:1,
			reserved:62;
};

?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 04/12] perf_events: sync branch stack sampling with X86 precise_sampling
  2011-10-06 17:25   ` Peter Zijlstra
@ 2011-10-07 10:34     ` Stephane Eranian
  2011-10-07 10:37       ` Peter Zijlstra
  0 siblings, 1 reply; 65+ messages in thread
From: Stephane Eranian @ 2011-10-07 10:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo

On Thu, Oct 6, 2011 at 7:25 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, 2011-10-06 at 16:49 +0200, Stephane Eranian wrote:
>>
>> On Intel X86 PERF_SAMPLE_BRANCH_STACK is implemented using LBR,
>> therefore both features must be coordinated as they may not
>> configure LBR the same way.
>
> Differently, you mean? Both wanting the same configuration seems fine.
>
No, I meant you can allow LBR + precise_sampling>1 ONLY when
users set LBR to record ALL branches and at the same priv levels. In
other words,
you're exposing the LBR content used by the fixup code.

One could argue, that if LBR is set to filter certain branches, it may
also be okay,
it's just that you won't necessarily get the same number of successful
fixups. The
samples are tagged when fixups were successful, so that may also be an viable
option. Best effort given the content of the LBR. Depending on the
code, that might
be slightly better than dropping to precise_sampling=1 (no fixups).

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 04/12] perf_events: sync branch stack sampling with X86 precise_sampling
  2011-10-07 10:34     ` Stephane Eranian
@ 2011-10-07 10:37       ` Peter Zijlstra
  0 siblings, 0 replies; 65+ messages in thread
From: Peter Zijlstra @ 2011-10-07 10:37 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo

On Fri, 2011-10-07 at 12:34 +0200, Stephane Eranian wrote:
> On Thu, Oct 6, 2011 at 7:25 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Thu, 2011-10-06 at 16:49 +0200, Stephane Eranian wrote:
> >>
> >> On Intel X86 PERF_SAMPLE_BRANCH_STACK is implemented using LBR,
> >> therefore both features must be coordinated as they may not
> >> configure LBR the same way.
> >
> > Differently, you mean? Both wanting the same configuration seems fine.
> >
> No, I meant you can allow LBR + precise_sampling>1 ONLY when
> users set LBR to record ALL branches and at the same priv levels. In
> other words,
> you're exposing the LBR content used by the fixup code.

Right. 

> One could argue, that if LBR is set to filter certain branches, it may
> also be okay,
> it's just that you won't necessarily get the same number of successful
> fixups. The
> samples are tagged when fixups were successful, so that may also be an viable
> option. Best effort given the content of the LBR. Depending on the
> code, that might
> be slightly better than dropping to precise_sampling=1 (no fixups).

Possible, but I imagine it might surprise some people.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 07/12] perf_events: add LBR software filter support for Intel X86
  2011-10-06 15:32   ` Andi Kleen
  2011-10-06 16:43     ` Peter Zijlstra
@ 2011-10-07 10:38     ` Stephane Eranian
  2011-10-07 10:40       ` Stephane Eranian
  2011-10-07 11:25       ` Masami Hiramatsu
  1 sibling, 2 replies; 65+ messages in thread
From: Stephane Eranian @ 2011-10-07 10:38 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, peterz, mingo, acme, ming.m.lin, robert.richter, ravitillo

On Thu, Oct 6, 2011 at 5:32 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> +     kernel_insn_init(&insn, kaddr);
>> +     insn_get_opcode(&insn);
>
> This makes me uncomfortable. AFAIK that's the first use of the opcode
> decoder being used directly for user space. It has a quite large attack
> surface. Who says it cannot be exploited?
>
This is not new, it's already used for the PEBS fixups and that includes
user level fixups, if possible.

We are not executing the instruction here, just decoding it to filter it out
from a buffer if necessary.

> -Andi
> --
> ak@linux.intel.com -- Speaking for myself only.
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 07/12] perf_events: add LBR software filter support for Intel X86
  2011-10-07 10:38     ` Stephane Eranian
@ 2011-10-07 10:40       ` Stephane Eranian
  2011-10-07 10:42         ` Peter Zijlstra
  2011-10-07 11:25       ` Masami Hiramatsu
  1 sibling, 1 reply; 65+ messages in thread
From: Stephane Eranian @ 2011-10-07 10:40 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, peterz, mingo, acme, ming.m.lin, robert.richter, ravitillo

On Fri, Oct 7, 2011 at 12:38 PM, Stephane Eranian <eranian@google.com> wrote:
> On Thu, Oct 6, 2011 at 5:32 PM, Andi Kleen <andi@firstfloor.org> wrote:
>>> +     kernel_insn_init(&insn, kaddr);
>>> +     insn_get_opcode(&insn);
>>
>> This makes me uncomfortable. AFAIK that's the first use of the opcode
>> decoder being used directly for user space. It has a quite large attack
>> surface. Who says it cannot be exploited?
>>
> This is not new, it's already used for the PEBS fixups and that includes
> user level fixups, if possible.
>
> We are not executing the instruction here, just decoding it to filter it out
> from a buffer if necessary.
>
I would add that in this particular usage, the source address is coming
straight from LBR, it's not made up my SW. That means it corresponds
to a point where there was a control flow change. But it can certainly
be any x86 opcode (not just branches). LBR captures control flow changes
due to traps, faults, interrupts.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 07/12] perf_events: add LBR software filter support for Intel X86
  2011-10-07 10:40       ` Stephane Eranian
@ 2011-10-07 10:42         ` Peter Zijlstra
  2011-10-07 10:49           ` Stephane Eranian
  0 siblings, 1 reply; 65+ messages in thread
From: Peter Zijlstra @ 2011-10-07 10:42 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Andi Kleen, linux-kernel, mingo, acme, ming.m.lin,
	robert.richter, ravitillo

On Fri, 2011-10-07 at 12:40 +0200, Stephane Eranian wrote:
> On Fri, Oct 7, 2011 at 12:38 PM, Stephane Eranian <eranian@google.com> wrote:
> > On Thu, Oct 6, 2011 at 5:32 PM, Andi Kleen <andi@firstfloor.org> wrote:
> >>> +     kernel_insn_init(&insn, kaddr);
> >>> +     insn_get_opcode(&insn);
> >>
> >> This makes me uncomfortable. AFAIK that's the first use of the opcode
> >> decoder being used directly for user space. It has a quite large attack
> >> surface. Who says it cannot be exploited?
> >>
> > This is not new, it's already used for the PEBS fixups and that includes
> > user level fixups, if possible.
> >
> > We are not executing the instruction here, just decoding it to filter it out
> > from a buffer if necessary.
> >
> I would add that in this particular usage, the source address is coming
> straight from LBR, it's not made up my SW. That means it corresponds
> to a point where there was a control flow change. But it can certainly
> be any x86 opcode (not just branches). LBR captures control flow changes
> due to traps, faults, interrupts.

You could still fuzz it after the cpu passed through and before the
kernel reads the LBR. Its a narrow window, but quite feasible.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 01/12] perf_events: add generic taken branch sampling support
  2011-10-07 10:32       ` Peter Zijlstra
@ 2011-10-07 10:44         ` Stephane Eranian
  0 siblings, 0 replies; 65+ messages in thread
From: Stephane Eranian @ 2011-10-07 10:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, acme, ming.m.lin, andi, robert.richter, ravitillo

On Fri, Oct 7, 2011 at 12:32 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, 2011-10-07 at 12:28 +0200, Stephane Eranian wrote:
>> On Thu, Oct 6, 2011 at 6:57 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Thu, 2011-10-06 at 16:49 +0200, Stephane Eranian wrote:
>> >>  struct perf_branch_entry {
>> >>         __u64                           from;
>> >>         __u64                           to;
>> >> +       struct {
>> >> +               __u64                   mispred:1,  /* target mispredicted */
>> >> +                                       predicted:1,/* target predicted */
>> >> +                                       reserved:62;
>> >> +       };
>> >>  };
>> >
>> > Why that anonymous structure?
>> >
>> The interface can return more than source/destination, it can also
>> return prediction infos.
>> Prediction is boolean, thus we only need a couple of bits. The reason
>> there are two bits
>> and not just one is to disambiguate between: predicted correctly and
>> 'prediction reporting
>> unsupported'. For instance, prediction is also supported since
>> Nehalem, Core2/Atom do
>> not have it.
>
> Right, I got that.
>
>> But maybe you're just commenting of the anonymous vs. named struct for
>> that?
>
> I don't see the need for any struct, why can't the bitfield live in
> perf_branch_entry proper?
>
>> It is just for convenience. Isn't that the same argument for the
>> anonymous bitfield
>> in struct perf_event_attr?
>
> But that isn't wrapped in a structure either is it..
>
> I guess I'm asking, what's wrong with:
>
> struct perf_branch_entry {
>        __u64           from;
>        __u64           to;
>        __u64           mispred:1,
>                        predicted:1,
>                        reserved:62;
> };
>
Ok, I think I missed your original point about the anonymous
struct. What you're suggesting above is certainly fine with me.
I am not just used to bitfields mixed with other fields in a struct.
I'll fix that.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 07/12] perf_events: add LBR software filter support for Intel X86
  2011-10-07 10:42         ` Peter Zijlstra
@ 2011-10-07 10:49           ` Stephane Eranian
  2011-10-07 11:18             ` Peter Zijlstra
                               ` (2 more replies)
  0 siblings, 3 replies; 65+ messages in thread
From: Stephane Eranian @ 2011-10-07 10:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andi Kleen, linux-kernel, mingo, acme, ming.m.lin,
	robert.richter, ravitillo

On Fri, Oct 7, 2011 at 12:42 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, 2011-10-07 at 12:40 +0200, Stephane Eranian wrote:
>> On Fri, Oct 7, 2011 at 12:38 PM, Stephane Eranian <eranian@google.com> wrote:
>> > On Thu, Oct 6, 2011 at 5:32 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> >>> +     kernel_insn_init(&insn, kaddr);
>> >>> +     insn_get_opcode(&insn);
>> >>
>> >> This makes me uncomfortable. AFAIK that's the first use of the opcode
>> >> decoder being used directly for user space. It has a quite large attack
>> >> surface. Who says it cannot be exploited?
>> >>
>> > This is not new, it's already used for the PEBS fixups and that includes
>> > user level fixups, if possible.
>> >
>> > We are not executing the instruction here, just decoding it to filter it out
>> > from a buffer if necessary.
>> >
>> I would add that in this particular usage, the source address is coming
>> straight from LBR, it's not made up my SW. That means it corresponds
>> to a point where there was a control flow change. But it can certainly
>> be any x86 opcode (not just branches). LBR captures control flow changes
>> due to traps, faults, interrupts.
>
> You could still fuzz it after the cpu passed through and before the
> kernel reads the LBR. Its a narrow window, but quite feasible.
>
Yeah, depending on the depth of the LBR. But then what happens, you
decode an instruction that is not what was executed. In this scenario,
either that new instruction matches the user LBR filter or it does not.
In the latter situation, the LBR record is not copied into the sampling
buffer. In the former case, it is included and all you've done is generate
a false profile. I think this can happen even without being malicious but
just with self-modifying (JIT) code, though the depth of LBR is such that
this is very unlikely (it must happen in the last 16 branches before the
PMU interrupt).

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 07/12] perf_events: add LBR software filter support for Intel X86
  2011-10-07 10:49           ` Stephane Eranian
@ 2011-10-07 11:18             ` Peter Zijlstra
  2011-10-07 11:21             ` Peter Zijlstra
  2011-10-07 15:42             ` [PATCH 07/12] perf_events: add LBR software filter support for Intel X86 Andi Kleen
  2 siblings, 0 replies; 65+ messages in thread
From: Peter Zijlstra @ 2011-10-07 11:18 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Andi Kleen, linux-kernel, mingo, acme, ming.m.lin,
	robert.richter, ravitillo

On Fri, 2011-10-07 at 12:49 +0200, Stephane Eranian wrote:
> > You could still fuzz it after the cpu passed through and before the
> > kernel reads the LBR. Its a narrow window, but quite feasible.
> >
> Yeah, depending on the depth of the LBR. But then what happens, you
> decode an instruction that is not what was executed. 

Right, and Andi's concern is that this might cause our instruction
decoder to blow up, or worse.

The whole false profile thing isn't really a problem, I mean, that's
what you get for poking at your own instruction stream.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 07/12] perf_events: add LBR software filter support for Intel X86
  2011-10-07 10:49           ` Stephane Eranian
  2011-10-07 11:18             ` Peter Zijlstra
@ 2011-10-07 11:21             ` Peter Zijlstra
  2011-10-07 11:54               ` Masami Hiramatsu
  2011-10-07 15:42             ` [PATCH 07/12] perf_events: add LBR software filter support for Intel X86 Andi Kleen
  2 siblings, 1 reply; 65+ messages in thread
From: Peter Zijlstra @ 2011-10-07 11:21 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Andi Kleen, linux-kernel, mingo, acme, ming.m.lin,
	robert.richter, ravitillo, masami.hiramatsu.pt

On Fri, 2011-10-07 at 13:18 +0200, Peter Zijlstra wrote:
> > Yeah, depending on the depth of the LBR. But then what happens, you
> > decode an instruction that is not what was executed. 
> 
> Right, and Andi's concern is that this might cause our instruction
> decoder to blow up, or worse. 

That is, we have a fair confidence that its capable of correctly
decoding correct streams, but I understood from Masami that they haven't
actually tried feeding it crap.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 07/12] perf_events: add LBR software filter support for Intel X86
  2011-10-07 10:38     ` Stephane Eranian
  2011-10-07 10:40       ` Stephane Eranian
@ 2011-10-07 11:25       ` Masami Hiramatsu
  2011-10-07 11:40         ` Peter Zijlstra
  2011-10-07 15:09         ` Andi Kleen
  1 sibling, 2 replies; 65+ messages in thread
From: Masami Hiramatsu @ 2011-10-07 11:25 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Andi Kleen, linux-kernel, peterz, mingo, acme, ming.m.lin,
	robert.richter, ravitillo

(2011/10/07 19:38), Stephane Eranian wrote:
> On Thu, Oct 6, 2011 at 5:32 PM, Andi Kleen <andi@firstfloor.org> wrote:
>>> +     kernel_insn_init(&insn, kaddr);
>>> +     insn_get_opcode(&insn);
>>
>> This makes me uncomfortable. AFAIK that's the first use of the opcode
>> decoder being used directly for user space. It has a quite large attack
>> surface. Who says it cannot be exploited?
>>
> This is not new, it's already used for the PEBS fixups and that includes
> user level fixups, if possible.

Oops, I've thought existing code is only for kernel. Hmm, I guess
this could cause a problem when running 32bit binary on x86-64.

As you know, there are differences between decoding routines of
x86-64 and -32. So insn_init() has the third parameter to give
a flag for changing it.
However, since the kernel itself runs on the native mode, I've added
kernel_insn_init() only for the kernel decoding.

----
/* Init insn for kernel text */
static inline void kernel_insn_init(struct insn *insn, const void *kaddr)
{
#ifdef CONFIG_X86_64
        insn_init(insn, kaddr, 1);
#else /* CONFIG_X86_32 */
        insn_init(insn, kaddr, 0);
#endif
}
----

Thus, I strongly recommend you to check the target instruction bitwidth
on x86-64 as (probably) below;

if (kernel_ip(addr) || !test_tsk_thread_flag(current, TIF_IA32))
	kernel_insn_init(insn, addr);
else
	insn_init(insn, addr, 0);

Thank you,

-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 07/12] perf_events: add LBR software filter support for Intel X86
  2011-10-07 11:25       ` Masami Hiramatsu
@ 2011-10-07 11:40         ` Peter Zijlstra
  2011-10-07 15:44           ` Andi Kleen
  2011-10-07 15:09         ` Andi Kleen
  1 sibling, 1 reply; 65+ messages in thread
From: Peter Zijlstra @ 2011-10-07 11:40 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Stephane Eranian, Andi Kleen, linux-kernel, mingo, acme,
	ming.m.lin, robert.richter, ravitillo

On Fri, 2011-10-07 at 20:25 +0900, Masami Hiramatsu wrote:
> Oops, I've thought existing code is only for kernel. Hmm, I guess
> this could cause a problem when running 32bit binary on x86-64.
> 
I queued below.
> 
---
Subject: perf, x86: Fix PEBS instruction unwind
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Fri Oct 07 13:36:40 CEST 2011

Masami spotted that we always try to decode the instruction stream as
64bit instructions when running a 64bit kernel, this doesn't work for
ia32-compat proglets.

Use TIF_IA32 to detect if we need to use the 32bit instruction
decoder.

Reported-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/kernel/cpu/perf_event_intel_ds.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Index: linux-2.6/arch/x86/kernel/cpu/perf_event_intel_ds.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -493,6 +493,7 @@ static int intel_pmu_pebs_fixup_ip(struc
 	unsigned long from = cpuc->lbr_entries[0].from;
 	unsigned long old_to, to = cpuc->lbr_entries[0].to;
 	unsigned long ip = regs->ip;
+	int is_64bit = 0;
 
 	/*
 	 * We don't need to fixup if the PEBS assist is fault like
@@ -544,7 +545,10 @@ static int intel_pmu_pebs_fixup_ip(struc
 		} else
 			kaddr = (void *)to;
 
-		kernel_insn_init(&insn, kaddr);
+#ifdef CONFIG_X86_64
+		is_64bit = kernel_ip(to) || !test_thread_flag(TIF_IA32);
+#endif
+		insn_init(&insn, kaddr, is_64bit);
 		insn_get_length(&insn);
 		to += insn.length;
 	} while (to < ip);


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 07/12] perf_events: add LBR software filter support for Intel X86
  2011-10-07 11:21             ` Peter Zijlstra
@ 2011-10-07 11:54               ` Masami Hiramatsu
  2011-10-07 13:31                 ` [PATCH] x86: Fix insn decoder for longer instruction Masami Hiramatsu
  2011-10-10  6:09                 ` [PATCH 07/12] perf_events: add LBR software filter support for Intel X86 Ingo Molnar
  0 siblings, 2 replies; 65+ messages in thread
From: Masami Hiramatsu @ 2011-10-07 11:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Stephane Eranian, Andi Kleen, linux-kernel, mingo, acme,
	ming.m.lin, robert.richter, ravitillo

(2011/10/07 20:21), Peter Zijlstra wrote:
> On Fri, 2011-10-07 at 13:18 +0200, Peter Zijlstra wrote:
>>> Yeah, depending on the depth of the LBR. But then what happens, you
>>> decode an instruction that is not what was executed. 
>>
>> Right, and Andi's concern is that this might cause our instruction
>> decoder to blow up, or worse. 
> 
> That is, we have a fair confidence that its capable of correctly
> decoding correct streams, but I understood from Masami that they haven't
> actually tried feeding it crap.

OK, I'll do hardening it.

Thanks,

-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH] x86: Fix insn decoder for longer instruction
  2011-10-07 11:54               ` Masami Hiramatsu
@ 2011-10-07 13:31                 ` Masami Hiramatsu
  2011-10-10  7:04                   ` Ingo Molnar
  2011-10-10  6:09                 ` [PATCH 07/12] perf_events: add LBR software filter support for Intel X86 Ingo Molnar
  1 sibling, 1 reply; 65+ messages in thread
From: Masami Hiramatsu @ 2011-10-07 13:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Stephane Eranian, Andi Kleen, linux-kernel, mingo, acme,
	ming.m.lin, robert.richter, ravitillo, yrl.pp-manager.tt,
	Masami Hiramatsu, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Peter Zijlstra, Stephane Eranian, Andi Kleen, Srikar Dronamraju

Fix x86 insn decoder for hardening against invalid length
instructions. This adds length checkings for each byte-read
site and if it exceeds MAX_INSN_SIZE, returns immediately.
This can happen when decoding user-space binary.

Caller can check whether it happened by checking insn.*.got
member is set or not.

Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

 arch/x86/lib/insn.c |   48 +++++++++++++++++++++++++++++++++++++++++++-----
 1 files changed, 43 insertions(+), 5 deletions(-)

diff --git a/arch/x86/lib/insn.c b/arch/x86/lib/insn.c
index 9f33b98..374562e 100644
--- a/arch/x86/lib/insn.c
+++ b/arch/x86/lib/insn.c
@@ -22,14 +22,23 @@
 #include <asm/inat.h>
 #include <asm/insn.h>
 
-#define get_next(t, insn)	\
-	({t r; r = *(t*)insn->next_byte; insn->next_byte += sizeof(t); r; })
+/* Verify next sizeof(t) bytes can be on the same instruction */
+#define validate_next(t, insn, n)	\
+	((insn)->next_byte + sizeof(t) + n - (insn)->kaddr <= MAX_INSN_SIZE)
+
+#define __get_next(t, insn)	\
+	({ t r = *(t*)insn->next_byte; insn->next_byte += sizeof(t); r; })
+
+#define __peek_nbyte_next(t, insn, n)	\
+	({ t r = *(t*)((insn)->next_byte + n); r; })
 
-#define peek_next(t, insn)	\
-	({t r; r = *(t*)insn->next_byte; r; })
+#define get_next(t, insn)	\
+	({ if (unlikely(!validate_next(t, insn, 0))) goto err_out; __get_next(t, insn); })
 
 #define peek_nbyte_next(t, insn, n)	\
-	({t r; r = *(t*)((insn)->next_byte + n); r; })
+	({ if (unlikely(!validate_next(t, insn, n))) goto err_out; __peek_nbyte_next(t, insn, n); })
+
+#define peek_next(t, insn)	peek_nbyte_next(t, insn, 0)
 
 /**
  * insn_init() - initialize struct insn
@@ -158,6 +167,8 @@ vex_end:
 	insn->vex_prefix.got = 1;
 
 	prefixes->got = 1;
+
+err_out:
 	return;
 }
 
@@ -208,6 +219,9 @@ void insn_get_opcode(struct insn *insn)
 		insn->attr = 0;	/* This instruction is bad */
 end:
 	opcode->got = 1;
+
+err_out:
+	return;
 }
 
 /**
@@ -241,6 +255,9 @@ void insn_get_modrm(struct insn *insn)
 	if (insn->x86_64 && inat_is_force64(insn->attr))
 		insn->opnd_bytes = 8;
 	modrm->got = 1;
+
+err_out:
+	return;
 }
 
 
@@ -290,6 +307,9 @@ void insn_get_sib(struct insn *insn)
 		}
 	}
 	insn->sib.got = 1;
+
+err_out:
+	return;
 }
 
 
@@ -351,6 +371,9 @@ void insn_get_displacement(struct insn *insn)
 	}
 out:
 	insn->displacement.got = 1;
+
+err_out:
+	return;
 }
 
 /* Decode moffset16/32/64 */
@@ -373,6 +396,9 @@ static void __get_moffset(struct insn *insn)
 		break;
 	}
 	insn->moffset1.got = insn->moffset2.got = 1;
+
+err_out:
+	return;
 }
 
 /* Decode imm v32(Iz) */
@@ -389,6 +415,9 @@ static void __get_immv32(struct insn *insn)
 		insn->immediate.nbytes = 4;
 		break;
 	}
+
+err_out:
+	return;
 }
 
 /* Decode imm v64(Iv/Ov) */
@@ -411,6 +440,9 @@ static void __get_immv(struct insn *insn)
 		break;
 	}
 	insn->immediate1.got = insn->immediate2.got = 1;
+
+err_out:
+	return;
 }
 
 /* Decode ptr16:16/32(Ap) */
@@ -432,6 +464,9 @@ static void __get_immptr(struct insn *insn)
 	insn->immediate2.value = get_next(unsigned short, insn);
 	insn->immediate2.nbytes = 2;
 	insn->immediate1.got = insn->immediate2.got = 1;
+
+err_out:
+	return;
 }
 
 /**
@@ -496,6 +531,9 @@ void insn_get_immediate(struct insn *insn)
 	}
 done:
 	insn->immediate.got = 1;
+
+err_out:
+	return;
 }
 
 /**


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH 07/12] perf_events: add LBR software filter support for Intel X86
  2011-10-07 11:25       ` Masami Hiramatsu
  2011-10-07 11:40         ` Peter Zijlstra
@ 2011-10-07 15:09         ` Andi Kleen
  2011-10-07 16:05           ` Peter Zijlstra
  1 sibling, 1 reply; 65+ messages in thread
From: Andi Kleen @ 2011-10-07 15:09 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Stephane Eranian, Andi Kleen, linux-kernel, peterz, mingo, acme,
	ming.m.lin, robert.richter, ravitillo

> if (kernel_ip(addr) || !test_tsk_thread_flag(current, TIF_IA32))

That's not good, a 64bit process can do 32bit syscalls.

The only fully working way is to check the code segment, unfortunately
that requires checking the LDT to be fully correct :-(

In 2.4 we had some code for that if you want to copy, but it disappeared
later.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 07/12] perf_events: add LBR software filter support for Intel X86
  2011-10-07 10:49           ` Stephane Eranian
  2011-10-07 11:18             ` Peter Zijlstra
  2011-10-07 11:21             ` Peter Zijlstra
@ 2011-10-07 15:42             ` Andi Kleen
  2 siblings, 0 replies; 65+ messages in thread
From: Andi Kleen @ 2011-10-07 15:42 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Peter Zijlstra, Andi Kleen, linux-kernel, mingo, acme,
	ming.m.lin, robert.richter, ravitillo

> just with self-modifying (JIT) code, though the depth of LBR is such that
> this is very unlikely (it must happen in the last 16 branches before the
> PMU interrupt).

Interrupts can make this window arbitary long.

-Andi


-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 07/12] perf_events: add LBR software filter support for Intel X86
  2011-10-07 11:40         ` Peter Zijlstra
@ 2011-10-07 15:44           ` Andi Kleen
  0 siblings, 0 replies; 65+ messages in thread
From: Andi Kleen @ 2011-10-07 15:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Masami Hiramatsu, Stephane Eranian, Andi Kleen, linux-kernel,
	mingo, acme, ming.m.lin, robert.richter, ravitillo

> -		kernel_insn_init(&insn, kaddr);
> +#ifdef CONFIG_X86_64
> +		is_64bit = kernel_ip(to) || !test_thread_flag(TIF_IA32);

Sorry that's wrong. If anything check the code segment
(but see other email)

-Andi



-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 07/12] perf_events: add LBR software filter support for Intel X86
  2011-10-07 15:09         ` Andi Kleen
@ 2011-10-07 16:05           ` Peter Zijlstra
  0 siblings, 0 replies; 65+ messages in thread
From: Peter Zijlstra @ 2011-10-07 16:05 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Masami Hiramatsu, Stephane Eranian, linux-kernel, mingo, acme,
	ming.m.lin, robert.richter, ravitillo

On Fri, 2011-10-07 at 17:09 +0200, Andi Kleen wrote:
> > if (kernel_ip(addr) || !test_tsk_thread_flag(current, TIF_IA32))
> 
> That's not good, a 64bit process can do 32bit syscalls.
> 
> The only fully working way is to check the code segment, unfortunately
> that requires checking the LDT to be fully correct :-(

I can't really be arsed about people doing daft shit like that. The
worst that happens is that their PEBS fixup fails. Honestly if you do
crazy things like that you deserve far far worse.



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 07/12] perf_events: add LBR software filter support for Intel X86
  2011-10-06 17:14       ` Andi Kleen
@ 2011-10-10  6:08         ` Ingo Molnar
  2011-10-10  9:39           ` Peter Zijlstra
  0 siblings, 1 reply; 65+ messages in thread
From: Ingo Molnar @ 2011-10-10  6:08 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Stephane Eranian, linux-kernel, acme, ming.m.lin,
	robert.richter, ravitillo, masami.hiramatsu.pt, H. Peter Anvin


* Andi Kleen <andi@firstfloor.org> wrote:

> On Thu, Oct 06, 2011 at 06:43:44PM +0200, Peter Zijlstra wrote:
> > On Thu, 2011-10-06 at 17:32 +0200, Andi Kleen wrote:
> > > > +	kernel_insn_init(&insn, kaddr);
> > > > +	insn_get_opcode(&insn);
> > > 
> > > This makes me uncomfortable. AFAIK that's the first use of the opcode
> > > decoder being used directly for user space. It has a quite large attack
> > > surface. Who says it cannot be exploited?
> > 
> > You mean:
> > arch/x86/kernel/cpu/perf_event_intel_ds.c:intel_pmu_pebs_fixup_ip()
> > doesn't use the opcode decoder on user space code?
> 
> True it does. Ok the second.

That's what matters.

> I know uprobes is planning it too, so the problem has to be handled
> eventually.
> 
> Still it makes me uncomfortable.

FUD is not how we approach these things - you need to say something 
specific.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 07/12] perf_events: add LBR software filter support for Intel X86
  2011-10-07 11:54               ` Masami Hiramatsu
  2011-10-07 13:31                 ` [PATCH] x86: Fix insn decoder for longer instruction Masami Hiramatsu
@ 2011-10-10  6:09                 ` Ingo Molnar
  2011-10-10 14:05                   ` Masami Hiramatsu
  1 sibling, 1 reply; 65+ messages in thread
From: Ingo Molnar @ 2011-10-10  6:09 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Peter Zijlstra, Stephane Eranian, Andi Kleen, linux-kernel, acme,
	ming.m.lin, robert.richter, ravitillo


* Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> wrote:

> (2011/10/07 20:21), Peter Zijlstra wrote:
> > On Fri, 2011-10-07 at 13:18 +0200, Peter Zijlstra wrote:
> >>> Yeah, depending on the depth of the LBR. But then what happens, you
> >>> decode an instruction that is not what was executed. 
> >>
> >> Right, and Andi's concern is that this might cause our instruction
> >> decoder to blow up, or worse. 
> > 
> > That is, we have a fair confidence that its capable of correctly
> > decoding correct streams, but I understood from Masami that they haven't
> > actually tried feeding it crap.
> 
> OK, I'll do hardening it.

Yes, that's a good idea. Could you perhaps add it to the existing 
build time sanity test we do for the instruction decoder, to feed it 
a sequence of /dev/urandom input or such?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] x86: Fix insn decoder for longer instruction
  2011-10-07 13:31                 ` [PATCH] x86: Fix insn decoder for longer instruction Masami Hiramatsu
@ 2011-10-10  7:04                   ` Ingo Molnar
  0 siblings, 0 replies; 65+ messages in thread
From: Ingo Molnar @ 2011-10-10  7:04 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Peter Zijlstra, Stephane Eranian, Andi Kleen, linux-kernel, acme,
	ming.m.lin, robert.richter, ravitillo, yrl.pp-manager.tt,
	Thomas Gleixner, H. Peter Anvin, Srikar Dronamraju


* Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> wrote:

> Fix x86 insn decoder for hardening against invalid length
> instructions. This adds length checkings for each byte-read
> site and if it exceeds MAX_INSN_SIZE, returns immediately.
> This can happen when decoding user-space binary.
> 
> Caller can check whether it happened by checking insn.*.got
> member is set or not.

Thanks.

We really need the /dev/urandom based build-time tester as well.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 07/12] perf_events: add LBR software filter support for Intel X86
  2011-10-10  6:08         ` Ingo Molnar
@ 2011-10-10  9:39           ` Peter Zijlstra
  0 siblings, 0 replies; 65+ messages in thread
From: Peter Zijlstra @ 2011-10-10  9:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Stephane Eranian, linux-kernel, acme, ming.m.lin,
	robert.richter, ravitillo, masami.hiramatsu.pt, H. Peter Anvin

On Mon, 2011-10-10 at 08:08 +0200, Ingo Molnar wrote:
> FUD is not how we approach these things - you need to say something 
> specific. 

Its Andi, of course its FUD. I mean, he's been telling us to use IRQs
for the PMI instead of NMIs for a while now. If we were to do something
daft like that this whole feature, which he does like, would be
useless/broken.

He really can't seem to make up his mind.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 07/12] perf_events: add LBR software filter support for Intel X86
  2011-10-10  6:09                 ` [PATCH 07/12] perf_events: add LBR software filter support for Intel X86 Ingo Molnar
@ 2011-10-10 14:05                   ` Masami Hiramatsu
  2011-10-10 14:45                     ` Andi Kleen
  0 siblings, 1 reply; 65+ messages in thread
From: Masami Hiramatsu @ 2011-10-10 14:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Stephane Eranian, Andi Kleen, linux-kernel, acme,
	ming.m.lin, robert.richter, ravitillo

(2011/10/10 15:09), Ingo Molnar wrote:
> 
> * Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> wrote:
> 
>> (2011/10/07 20:21), Peter Zijlstra wrote:
>>> On Fri, 2011-10-07 at 13:18 +0200, Peter Zijlstra wrote:
>>>>> Yeah, depending on the depth of the LBR. But then what happens, you
>>>>> decode an instruction that is not what was executed. 
>>>>
>>>> Right, and Andi's concern is that this might cause our instruction
>>>> decoder to blow up, or worse. 
>>>
>>> That is, we have a fair confidence that its capable of correctly
>>> decoding correct streams, but I understood from Masami that they haven't
>>> actually tried feeding it crap.
>>
>> OK, I'll do hardening it.
> 
> Yes, that's a good idea. Could you perhaps add it to the existing 
> build time sanity test we do for the instruction decoder, to feed it 
> a sequence of /dev/urandom input or such?

Ah, nice. Maybe we need another test binary, since current one is
just ensuring the output of objdump and decoder is same.
anyway it's not so difficult if it feeds random binaries to
ensure the decoder doesn't access bad address.

Thank you,

-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 07/12] perf_events: add LBR software filter support for Intel X86
  2011-10-10 14:05                   ` Masami Hiramatsu
@ 2011-10-10 14:45                     ` Andi Kleen
  2011-10-11 12:59                       ` Masami Hiramatsu
  0 siblings, 1 reply; 65+ messages in thread
From: Andi Kleen @ 2011-10-10 14:45 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Ingo Molnar, Peter Zijlstra, Stephane Eranian, Andi Kleen,
	linux-kernel, acme, ming.m.lin, robert.richter, ravitillo

> Ah, nice. Maybe we need another test binary, since current one is
> just ensuring the output of objdump and decoder is same.
> anyway it's not so difficult if it feeds random binaries to
> ensure the decoder doesn't access bad address.

Pure /dev/urandom is not good because it cannot be ever reproduced.
Better use a PRNG with random seed from urandom, but print the seed.

In addition to random I would do fuzzing: take an existing stream
and just corrupt some bits and groups of bits. This will exercise different 
paths.  In fact fuzzing is probably better than random for most tests.

Not sure it's needed to run on every build though, just checks
now and then should be sufficient.

-Andi

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 07/12] perf_events: add LBR software filter support for Intel X86
  2011-10-10 14:45                     ` Andi Kleen
@ 2011-10-11 12:59                       ` Masami Hiramatsu
  2011-10-12  7:06                         ` Ingo Molnar
  0 siblings, 1 reply; 65+ messages in thread
From: Masami Hiramatsu @ 2011-10-11 12:59 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, Peter Zijlstra, Stephane Eranian, linux-kernel,
	acme, ming.m.lin, robert.richter, ravitillo

(2011/10/10 23:45), Andi Kleen wrote:
>> Ah, nice. Maybe we need another test binary, since current one is
>> just ensuring the output of objdump and decoder is same.
>> anyway it's not so difficult if it feeds random binaries to
>> ensure the decoder doesn't access bad address.
> 
> Pure /dev/urandom is not good because it cannot be ever reproduced.
> Better use a PRNG with random seed from urandom, but print the seed.

Sure,

> In addition to random I would do fuzzing: take an existing stream
> and just corrupt some bits and groups of bits. This will exercise different 
> paths.  In fact fuzzing is probably better than random for most tests.
> 
> Not sure it's needed to run on every build though, just checks
> now and then should be sufficient.

I made a test for that, and I didn't hit any random
bytes which can not be decoded after I've tested 100,000,000.
(with previous hardening patch)

since the number of the combination is 2^128 (MAX_INSN_SIZE is 16),
too huge to check all of them. Of course, the real number of
possible combination should be smaller than that.

- Because this decoder don't evaluate but just decode,
 we can skip immediates and operands.
- If we hit the non-oprand opcode, mod/rm, sib, or displacement
 in the middle of byte stream, we can stop trying decode tail bytes.

So I think there is smarter way to cover all possible combination
for the decoder than feeding random bytes.

Anyway, the decoder code can evolve in the future, I think it
should be put into the linux kernel.

Thank you,

-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 07/12] perf_events: add LBR software filter support for Intel X86
  2011-10-11 12:59                       ` Masami Hiramatsu
@ 2011-10-12  7:06                         ` Ingo Molnar
  2011-10-13 10:54                           ` Masami Hiramatsu
  2011-10-13 11:01                           ` [RFC PATCH] x86: Add a sanity test of x86 decoder Masami Hiramatsu
  0 siblings, 2 replies; 65+ messages in thread
From: Ingo Molnar @ 2011-10-12  7:06 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Andi Kleen, Peter Zijlstra, Stephane Eranian, linux-kernel, acme,
	ming.m.lin, robert.richter, ravitillo


* Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> wrote:

> (2011/10/10 23:45), Andi Kleen wrote:
> >> Ah, nice. Maybe we need another test binary, since current one is
> >> just ensuring the output of objdump and decoder is same.
> >> anyway it's not so difficult if it feeds random binaries to
> >> ensure the decoder doesn't access bad address.
> > 
> > Pure /dev/urandom is not good because it cannot be ever reproduced.
> > Better use a PRNG with random seed from urandom, but print the seed.
> 
> Sure,

Obviously the urandom data should be kbuild generated and only 
cleared on 'make clean' - not regenerated on every run. So if there's 
a failure the failing urandom data stays around readily.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 07/12] perf_events: add LBR software filter support for Intel X86
  2011-10-12  7:06                         ` Ingo Molnar
@ 2011-10-13 10:54                           ` Masami Hiramatsu
  2011-10-13 11:01                           ` [RFC PATCH] x86: Add a sanity test of x86 decoder Masami Hiramatsu
  1 sibling, 0 replies; 65+ messages in thread
From: Masami Hiramatsu @ 2011-10-13 10:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Peter Zijlstra, Stephane Eranian, linux-kernel, acme,
	ming.m.lin, robert.richter, ravitillo

(2011/10/12 16:06), Ingo Molnar wrote:
> 
> * Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> wrote:
> 
>> (2011/10/10 23:45), Andi Kleen wrote:
>>>> Ah, nice. Maybe we need another test binary, since current one is
>>>> just ensuring the output of objdump and decoder is same.
>>>> anyway it's not so difficult if it feeds random binaries to
>>>> ensure the decoder doesn't access bad address.
>>>
>>> Pure /dev/urandom is not good because it cannot be ever reproduced.
>>> Better use a PRNG with random seed from urandom, but print the seed.
>>
>> Sure,
> 
> Obviously the urandom data should be kbuild generated and only 
> cleared on 'make clean' - not regenerated on every run. So if there's 
> a failure the failing urandom data stays around readily.

I've added options for passing a combination of seed and seqno
or instruction sequence directly. So if someone find an error
message, they can easily reproduce it or report it.

Thank you,

-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC PATCH] x86: Add a sanity test of x86 decoder
  2011-10-12  7:06                         ` Ingo Molnar
  2011-10-13 10:54                           ` Masami Hiramatsu
@ 2011-10-13 11:01                           ` Masami Hiramatsu
  2011-10-18  6:54                             ` Ingo Molnar
  1 sibling, 1 reply; 65+ messages in thread
From: Masami Hiramatsu @ 2011-10-13 11:01 UTC (permalink / raw)
  To: Ingo Molnar, Andi Kleen, Peter Zijlstra
  Cc: linux-kernel, mingo, acme, ming.m.lin, robert.richter, ravitillo,
	yrl.pp-manager.tt, Masami Hiramatsu, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, Andi Kleen, Peter Zijlstra,
	Stephane Eranian

Add a sanity test of x86 insn decoder against the random
input. This test is also able to reproduce the bug by
passing random-seed and iteration-number, or by passing
input while which has invalid byte code.

Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
---

 arch/x86/tools/Makefile      |   10 +-
 arch/x86/tools/insn_sanity.c |  253 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 262 insertions(+), 1 deletions(-)
 create mode 100644 arch/x86/tools/insn_sanity.c

diff --git a/arch/x86/tools/Makefile b/arch/x86/tools/Makefile
index f820826..3255c3d 100644
--- a/arch/x86/tools/Makefile
+++ b/arch/x86/tools/Makefile
@@ -18,14 +18,22 @@ chkobjdump = $(srctree)/arch/x86/tools/chkobjdump.awk
 quiet_cmd_posttest = TEST    $@
       cmd_posttest = ($(OBJDUMP) -v | $(AWK) -f $(chkobjdump)) || $(OBJDUMP) -d -j .text $(objtree)/vmlinux | $(AWK) -f $(distill_awk) | $(obj)/test_get_len $(posttest_64bit) $(posttest_verbose)
 
-posttest: $(obj)/test_get_len vmlinux
+quiet_cmd_sanitytest = TEST    $@
+      cmd_sanitytest = $(obj)/insn_sanity $(posttest_64bit) -m 1000000
+
+posttest: $(obj)/test_get_len vmlinux $(obj)/insn_sanity
 	$(call cmd,posttest)
+	$(call cmd,sanitytest)
 
 hostprogs-y	:= test_get_len
+hostprogs-y	:= insn_sanity
 
 # -I needed for generated C source and C source which in the kernel tree.
 HOSTCFLAGS_test_get_len.o := -Wall -I$(objtree)/arch/x86/lib/ -I$(srctree)/arch/x86/include/ -I$(srctree)/arch/x86/lib/ -I$(srctree)/include/
 
+HOSTCFLAGS_insn_sanity.o := -Wall -I$(objtree)/arch/x86/lib/ -I$(srctree)/arch/x86/include/ -I$(srctree)/arch/x86/lib/ -I$(srctree)/include/
+
 # Dependencies are also needed.
 $(obj)/test_get_len.o: $(srctree)/arch/x86/lib/insn.c $(srctree)/arch/x86/lib/inat.c $(srctree)/arch/x86/include/asm/inat_types.h $(srctree)/arch/x86/include/asm/inat.h $(srctree)/arch/x86/include/asm/insn.h $(objtree)/arch/x86/lib/inat-tables.c
 
+$(obj)/insn_sanity.o: $(srctree)/arch/x86/lib/insn.c $(srctree)/arch/x86/lib/inat.c $(srctree)/arch/x86/include/asm/inat_types.h $(srctree)/arch/x86/include/asm/inat.h $(srctree)/arch/x86/include/asm/insn.h $(objtree)/arch/x86/lib/inat-tables.c
diff --git a/arch/x86/tools/insn_sanity.c b/arch/x86/tools/insn_sanity.c
new file mode 100644
index 0000000..9ba74fe
--- /dev/null
+++ b/arch/x86/tools/insn_sanity.c
@@ -0,0 +1,253 @@
+/*
+ * x86 decoder sanity test - based on test_get_insn.c
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2009
+ * Copyright (C) Hitachi, Ltd., 2011
+ */
+
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <assert.h>
+#include <unistd.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+
+#define unlikely(cond) (cond)
+
+#include <asm/insn.h>
+#include <inat.c>
+#include <insn.c>
+
+/*
+ * Test of instruction analysis against tampering.
+ * Feed random binary to instruction decoder and ensure not to
+ * access out-of-instruction-buffer.
+ */
+
+#define DEFAULT_MAX_ITER	10000
+#define INSN_NOP 0x90
+
+static const char *prog;
+static int verbose;
+static int x86_64;
+static unsigned int seed;
+static unsigned long nr_iter;
+static unsigned long max_iter = DEFAULT_MAX_ITER;
+static char insn_buf[MAX_INSN_SIZE * 2];
+static FILE *input_file;
+
+static void usage(const char *err)
+{
+	if (err)
+		fprintf(stderr, "Error: %s\n\n", err);
+	fprintf(stderr, "Usage: %s [-y|-n|-v] [-s seed[,no]] [-m max] [-i input]\n", prog);
+	fprintf(stderr, "\t-y	64bit mode\n");
+	fprintf(stderr, "\t-n	32bit mode\n");
+	fprintf(stderr, "\t-v	Verbose mode\n");
+	fprintf(stderr, "\t-s	Give a random seed (and iter number)\n");
+	fprintf(stderr, "\t-m	Give a max iteration number\n");
+	fprintf(stderr, "\t-i	Give an input file with decoded binary\n");
+	exit(1);
+}
+
+static void dump_field(FILE *fp, const char *name, const char *indent,
+		       struct insn_field *field)
+{
+	fprintf(fp, "%s.%s = {\n", indent, name);
+	fprintf(fp, "%s\t.value = %d, bytes[] = {%x, %x, %x, %x},\n",
+		indent, field->value, field->bytes[0], field->bytes[1],
+		field->bytes[2], field->bytes[3]);
+	fprintf(fp, "%s\t.got = %d, .nbytes = %d},\n", indent,
+		field->got, field->nbytes);
+}
+
+static void dump_insn(FILE *fp, struct insn *insn)
+{
+	fprintf(fp, "Instruction = {\n");
+	dump_field(fp, "prefixes", "\t",	&insn->prefixes);
+	dump_field(fp, "rex_prefix", "\t",	&insn->rex_prefix);
+	dump_field(fp, "vex_prefix", "\t",	&insn->vex_prefix);
+	dump_field(fp, "opcode", "\t",		&insn->opcode);
+	dump_field(fp, "modrm", "\t",		&insn->modrm);
+	dump_field(fp, "sib", "\t",		&insn->sib);
+	dump_field(fp, "displacement", "\t",	&insn->displacement);
+	dump_field(fp, "immediate1", "\t",	&insn->immediate1);
+	dump_field(fp, "immediate2", "\t",	&insn->immediate2);
+	fprintf(fp, "\t.attr = %x, .opnd_bytes = %d, .addr_bytes = %d,\n",
+		insn->attr, insn->opnd_bytes, insn->addr_bytes);
+	fprintf(fp, "\t.length = %d, .x86_64 = %d, .kaddr = %p}\n",
+		insn->length, insn->x86_64, insn->kaddr);
+}
+
+static void dump_stream(FILE *fp, const char *msg, struct insn *insn)
+{
+	int i;
+	fprintf(fp, "%s:\n code:", msg);
+	for (i = 0; i < MAX_INSN_SIZE; i++)
+		fprintf(fp, " %02x", insn_buf[i]);
+	fprintf(fp, "\n");
+	dump_insn(stderr, insn);
+	fprintf(fp, "\n This can be reproduced with \"-s 0x%x,%lu\" option\n",
+		seed, nr_iter);
+}
+
+static void get_seed(void)
+{
+	int fd;
+
+	fd = open("/dev/urandom", O_RDONLY);
+	if (!fd)
+		usage("Failed to open /dev/urandom");
+	if (read(fd, &seed, sizeof(seed)) != sizeof(seed))
+		usage("Failed to read /dev/urandom");
+	close(fd);
+}
+
+static int get_next_insn(void)
+{
+	char buf[256]  = "", *tmp;
+	int i;
+
+	tmp = fgets(buf, 256, input_file);
+	if (tmp == NULL || feof(input_file))
+		return 0;
+
+	for (i = 0; i < MAX_INSN_SIZE; i++) {
+		((unsigned char *)insn_buf)[i] = (unsigned char)strtoul(tmp, &tmp, 16);
+		if (*tmp != ' ')
+			break;
+	}
+	return ++nr_iter;
+}
+
+/* generate a random instruction binary */
+static int generate_random_insn(void)
+{
+	int i;
+	if (nr_iter >= max_iter)
+		return 0;
+
+	if (input_file)
+		return get_next_insn();
+
+	/* Fills buffer with random binary up to MAX_INSN_SIZE */
+	if (nr_iter == 0)
+		srand(seed);
+	for (i = 0; i < MAX_INSN_SIZE; i += 2)
+		*(unsigned short *)(&insn_buf[i]) = random() & 0xffff;
+	return ++nr_iter;
+}
+
+static void init_test(int argc, char **argv)
+{
+	int c;
+	char *tmp = NULL;
+	unsigned long iter = 0;
+	int set_seed = 0;
+
+	prog = argv[0];
+	while ((c = getopt(argc, argv, "ynvs:m:i:")) != -1) {
+		switch (c) {
+		case 'y':
+			x86_64 = 1;
+			break;
+		case 'n':
+			x86_64 = 0;
+			break;
+		case 'v':
+			verbose = 1;
+			break;
+		case 'i':
+			if (strcmp("-", optarg) == 0)
+				input_file = stdin;
+			else
+				input_file = fopen(optarg, "r");
+			if (!input_file)
+				usage("Failed to open input file");
+			break;
+		case 's':
+			seed = (unsigned int)strtoul(optarg, &tmp, 0);
+			if (*tmp == ',') {
+				optarg = tmp + 1;
+				iter = strtoul(optarg, &tmp, 0);
+			}
+			if (*tmp != '\0' || tmp == optarg)
+				usage("Failed to parse seed");
+			set_seed = 1;
+			break;
+		case 'm':
+			max_iter = strtoul(optarg, &tmp, 0);
+			if (*tmp != '\0' || tmp == optarg)
+				usage("Failed to parse max_iter");
+			break;
+		default:
+			usage(NULL);
+		}
+	}
+
+	if (max_iter < iter)
+		usage("Max iteration number must be bigger than iter-num");
+
+	if (!set_seed && !input_file)	/* Don't get a seed */
+		get_seed();
+	if (set_seed && input_file)
+		usage("Don't use input file (-i) with random seed (-s)");
+
+	/* Prepare stop bytes with NOPs */
+	memset(insn_buf + MAX_INSN_SIZE, INSN_NOP, MAX_INSN_SIZE);
+	/* Skip first nth states */
+	for (; iter > 0; iter--)
+		if (!generate_random_insn())
+			usage("Failed to generate random insns\n");
+}
+
+#define insn_complete(insn)	\
+	(insn.opcode.got && insn.modrm.got && insn.sib.got && \
+	 insn.displacement.got && insn.immediate.got)
+
+int main(int argc, char **argv)
+{
+	struct insn insn;
+	int insns = 0;
+	int warnings = 0;
+
+	init_test(argc, argv);
+
+	while (generate_random_insn()) {
+		/* Decode an instruction */
+		insn_init(&insn, insn_buf, x86_64);
+		insn_get_length(&insn);
+
+		if (verbose && !insn_complete(insn))
+			dump_stream(stdout, "Undecodable input", &insn);
+
+		if (insn.next_byte <= insn.kaddr ||
+		    insn.kaddr + MAX_INSN_SIZE < insn.next_byte) {
+			/* Access out-of-range memory */
+			dump_stream(stdout, "Find access violation", &insn);
+			warnings++;
+		}
+		insns++;
+	}
+	if (warnings)
+		fprintf(stdout, "Warning: decoded and checked %d insns with %d warnings (seed:0x%x)\n", insns, warnings, seed);
+	else
+		fprintf(stdout, "Succeed: decoded and checked %d insns (seed:0x%x)\n", insns, seed);
+	return 0;
+}


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH] x86: Add a sanity test of x86 decoder
  2011-10-13 11:01                           ` [RFC PATCH] x86: Add a sanity test of x86 decoder Masami Hiramatsu
@ 2011-10-18  6:54                             ` Ingo Molnar
  2011-10-19  4:29                               ` Masami Hiramatsu
  0 siblings, 1 reply; 65+ messages in thread
From: Ingo Molnar @ 2011-10-18  6:54 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Andi Kleen, Peter Zijlstra, linux-kernel, acme, ming.m.lin,
	robert.richter, ravitillo, yrl.pp-manager.tt, Thomas Gleixner,
	H. Peter Anvin, Stephane Eranian


* Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> wrote:

> Add a sanity test of x86 insn decoder against the random
> input. This test is also able to reproduce the bug by
> passing random-seed and iteration-number, or by passing
> input while which has invalid byte code.

Looks good in general.

a few nitpicking details:

> -posttest: $(obj)/test_get_len vmlinux
> +quiet_cmd_sanitytest = TEST    $@
> +      cmd_sanitytest = $(obj)/insn_sanity $(posttest_64bit) -m 1000000

Just curious, what's the execution time for this? I'd expect 
milliseconds, but there will also be urandom overhead.

> +#define unlikely(cond) (cond)
> +
> +#include <asm/insn.h>
> +#include <inat.c>
> +#include <insn.c>
> +
> +/*
> + * Test of instruction analysis against tampering.
> + * Feed random binary to instruction decoder and ensure not to
> + * access out-of-instruction-buffer.
> + */
> +
> +#define DEFAULT_MAX_ITER	10000
> +#define INSN_NOP 0x90
> +
> +static const char *prog;
> +static int verbose;
> +static int x86_64;
> +static unsigned int seed;
> +static unsigned long nr_iter;
> +static unsigned long max_iter = DEFAULT_MAX_ITER;
> +static char insn_buf[MAX_INSN_SIZE * 2];
> +static FILE *input_file;

This needs more comments and a bit more vertical structure.

> +static void dump_stream(FILE *fp, const char *msg, struct insn *insn)
> +{
> +	int i;
> +	fprintf(fp, "%s:\n code:", msg);

missing newline.

> +static int get_next_insn(void)
> +{
> +	char buf[256]  = "", *tmp;
> +	int i;
> +
> +	tmp = fgets(buf, 256, input_file);

ARRAY_SIZE().

> +	if (tmp == NULL || feof(input_file))
> +		return 0;
> +
> +	for (i = 0; i < MAX_INSN_SIZE; i++) {
> +		((unsigned char *)insn_buf)[i] = (unsigned char)strtoul(tmp, &tmp, 16);

why is this cast needed? Shouldnt insn_buf[] have this type if this 
is how it's used?

> +{
> +	int i;
> +	if (nr_iter >= max_iter)

missing newline.

> +	for (i = 0; i < MAX_INSN_SIZE; i += 2)
> +		*(unsigned short *)(&insn_buf[i]) = random() & 0xffff;

this silently assumes that MAX_INSN_SIZE is a multiple of 2. Which is 
true ... for now.

> +#define insn_complete(insn)	\
> +	(insn.opcode.got && insn.modrm.got && insn.sib.got && \
> +	 insn.displacement.got && insn.immediate.got)

This could move into insn.h (even though only user-space uses it), 
right?

> +	while (generate_random_insn()) {


this loop is really weird: half of it is hidden in 
generate_random_insn()'s use of nr_iter global variable!

Why not just do it in the old-fashioned way:

	for (i = 0; i < max_iter; i++) {
		...
	}

and keep generate_random_insn() loop-invariant?

> +		if (insn.next_byte <= insn.kaddr ||
> +		    insn.kaddr + MAX_INSN_SIZE < insn.next_byte) {
> +			/* Access out-of-range memory */
> +			dump_stream(stdout, "Find access violation", &insn);
> +			warnings++;

s/Find/Found ?

> +	if (warnings)
> +		fprintf(stdout, "Warning: decoded and checked %d insns with %d warnings (seed:0x%x)\n", insns, warnings, seed);
> +	else
> +		fprintf(stdout, "Succeed: decoded and checked %d insns (seed:0x%x)\n", insns, seed);

s/Succeed/Success ?

Also, s/insns/random instructions - there's rarely any good reason to 
abbreviate in human readable output.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH] x86: Add a sanity test of x86 decoder
  2011-10-18  6:54                             ` Ingo Molnar
@ 2011-10-19  4:29                               ` Masami Hiramatsu
  2011-10-19  6:44                                 ` Ingo Molnar
  0 siblings, 1 reply; 65+ messages in thread
From: Masami Hiramatsu @ 2011-10-19  4:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Peter Zijlstra, linux-kernel, acme, ming.m.lin,
	robert.richter, ravitillo, yrl.pp-manager.tt, Thomas Gleixner,
	H. Peter Anvin, Stephane Eranian

(2011/10/18 15:54), Ingo Molnar wrote:
> 
> * Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> wrote:
> 
>> Add a sanity test of x86 insn decoder against the random
>> input. This test is also able to reproduce the bug by
>> passing random-seed and iteration-number, or by passing
>> input while which has invalid byte code.
> 
> Looks good in general.
> 
> a few nitpicking details:

Thank you for the comments :)

> 
>> -posttest: $(obj)/test_get_len vmlinux
>> +quiet_cmd_sanitytest = TEST    $@
>> +      cmd_sanitytest = $(obj)/insn_sanity $(posttest_64bit) -m 1000000
> 
> Just curious, what's the execution time for this? I'd expect 
> milliseconds, but there will also be urandom overhead.

I measured it with time command.
---
Succeed: decoded and checked 1000000 insns (seed:0x73f2b3bb)

real    0m0.152s
user    0m0.149s
sys     0m0.002s
---

Here, you can see how long it takes. It actually refers /dev/urandom
just one time at start, so I guess there is no urandom overhead.

>> +#define unlikely(cond) (cond)
>> +
>> +#include <asm/insn.h>
>> +#include <inat.c>
>> +#include <insn.c>
>> +
>> +/*
>> + * Test of instruction analysis against tampering.
>> + * Feed random binary to instruction decoder and ensure not to
>> + * access out-of-instruction-buffer.
>> + */
>> +
>> +#define DEFAULT_MAX_ITER	10000
>> +#define INSN_NOP 0x90
>> +
>> +static const char *prog;
>> +static int verbose;
>> +static int x86_64;
>> +static unsigned int seed;
>> +static unsigned long nr_iter;
>> +static unsigned long max_iter = DEFAULT_MAX_ITER;
>> +static char insn_buf[MAX_INSN_SIZE * 2];
>> +static FILE *input_file;
> 
> This needs more comments and a bit more vertical structure.

OK.

>> +static void dump_stream(FILE *fp, const char *msg, struct insn *insn)
>> +{
>> +	int i;
>> +	fprintf(fp, "%s:\n code:", msg);
> 
> missing newline.

This prints a header of code sequence, so that we'll see a
message like below;

Error message:
code: 00 01 02 03 04 ...

>> +static int get_next_insn(void)
>> +{
>> +	char buf[256]  = "", *tmp;
>> +	int i;
>> +
>> +	tmp = fgets(buf, 256, input_file);
> 
> ARRAY_SIZE().

OK.

>> +	if (tmp == NULL || feof(input_file))
>> +		return 0;
>> +
>> +	for (i = 0; i < MAX_INSN_SIZE; i++) {
>> +		((unsigned char *)insn_buf)[i] = (unsigned char)strtoul(tmp, &tmp, 16);
> 
> why is this cast needed? Shouldnt insn_buf[] have this type if this 
> is how it's used?

Yes, the type of insn_buf can be changed.

> 
>> +{
>> +	int i;
>> +	if (nr_iter >= max_iter)
> 
> missing newline.
> 
>> +	for (i = 0; i < MAX_INSN_SIZE; i += 2)
>> +		*(unsigned short *)(&insn_buf[i]) = random() & 0xffff;
> 
> this silently assumes that MAX_INSN_SIZE is a multiple of 2. Which is 
> true ... for now.

Ah, right. OK, add a line to fill the last byte if needed.

>> +#define insn_complete(insn)	\
>> +	(insn.opcode.got && insn.modrm.got && insn.sib.got && \
>> +	 insn.displacement.got && insn.immediate.got)
> 
> This could move into insn.h (even though only user-space uses it), 
> right?

Right.

> 
>> +	while (generate_random_insn()) {
> 
> 
> this loop is really weird: half of it is hidden in 
> generate_random_insn()'s use of nr_iter global variable!
> 
> Why not just do it in the old-fashioned way:
> 
> 	for (i = 0; i < max_iter; i++) {
> 		...
> 	}
> 
> and keep generate_random_insn() loop-invariant?

OK, I'll change this.

> 
>> +		if (insn.next_byte <= insn.kaddr ||
>> +		    insn.kaddr + MAX_INSN_SIZE < insn.next_byte) {
>> +			/* Access out-of-range memory */
>> +			dump_stream(stdout, "Find access violation", &insn);
>> +			warnings++;
> 
> s/Find/Found ?
> 
>> +	if (warnings)
>> +		fprintf(stdout, "Warning: decoded and checked %d insns with %d warnings (seed:0x%x)\n", insns, warnings, seed);
>> +	else
>> +		fprintf(stdout, "Succeed: decoded and checked %d insns (seed:0x%x)\n", insns, seed);
> 
> s/Succeed/Success ?

Oops...

> Also, s/insns/random instructions - there's rarely any good reason to 
> abbreviate in human readable output.

Agreed.

Thank you!

> 
> Thanks,
> 
> 	Ingo


-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH] x86: Add a sanity test of x86 decoder
  2011-10-19  4:29                               ` Masami Hiramatsu
@ 2011-10-19  6:44                                 ` Ingo Molnar
  2011-10-20 14:01                                   ` [RFC PATCH v2 1/2] " Masami Hiramatsu
  2011-10-20 14:01                                   ` [RFC PATCH v2 2/2] [RESEND] x86: Fix insn decoder for longer instruction Masami Hiramatsu
  0 siblings, 2 replies; 65+ messages in thread
From: Ingo Molnar @ 2011-10-19  6:44 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Andi Kleen, Peter Zijlstra, linux-kernel, acme, ming.m.lin,
	robert.richter, ravitillo, yrl.pp-manager.tt, Thomas Gleixner,
	H. Peter Anvin, Stephane Eranian


* Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> wrote:

> >> +static void dump_stream(FILE *fp, const char *msg, struct insn *insn)
> >> +{
> >> +	int i;
> >> +	fprintf(fp, "%s:\n code:", msg);
> > 
> > missing newline.
> 
> This prints a header of code sequence, so that we'll see a
> message like below;

I mean, it's missing the standard newline between local variable(s) 
and the first statement:

static void dump_stream(FILE *fp, const char *msg, struct insn *insn)
{
	int i;

	fprintf(fp, "%s:\n code:", msg);
[...]

We do this for readability.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC PATCH v2 1/2] x86: Add a sanity test of x86 decoder
  2011-10-19  6:44                                 ` Ingo Molnar
@ 2011-10-20 14:01                                   ` Masami Hiramatsu
  2011-11-18 23:16                                     ` [tip:perf/core] x86, perf: Add a build-time sanity test to the " tip-bot for Masami Hiramatsu
  2011-10-20 14:01                                   ` [RFC PATCH v2 2/2] [RESEND] x86: Fix insn decoder for longer instruction Masami Hiramatsu
  1 sibling, 1 reply; 65+ messages in thread
From: Masami Hiramatsu @ 2011-10-20 14:01 UTC (permalink / raw)
  To: Ingo Molnar, Andi Kleen, Peter Zijlstra
  Cc: linux-kernel, mingo, acme, ming.m.lin, robert.richter, ravitillo,
	yrl.pp-manager.tt, Masami Hiramatsu, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, Andi Kleen, Peter Zijlstra,
	Stephane Eranian

Add a sanity test of x86 insn decoder against the random
input. This test is also able to reproduce the bug by
passing random-seed and iteration-number, or by passing
input while which has invalid byte code.

Changes in V2:
 - Code cleanup.
 - Show how to reproduce the error by insn_sanity test.

Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
---

 arch/x86/include/asm/insn.h  |    7 +
 arch/x86/tools/Makefile      |   10 +-
 arch/x86/tools/insn_sanity.c |  275 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 291 insertions(+), 1 deletions(-)
 create mode 100644 arch/x86/tools/insn_sanity.c

diff --git a/arch/x86/include/asm/insn.h b/arch/x86/include/asm/insn.h
index 88c765e..74df3f1 100644
--- a/arch/x86/include/asm/insn.h
+++ b/arch/x86/include/asm/insn.h
@@ -137,6 +137,13 @@ static inline int insn_is_avx(struct insn *insn)
 	return (insn->vex_prefix.value != 0);
 }
 
+/* Ensure this instruction is decoded completely */
+static inline int insn_complete(struct insn *insn)
+{
+	return insn->opcode.got && insn->modrm.got && insn->sib.got &&
+		insn->displacement.got && insn->immediate.got;
+}
+
 static inline insn_byte_t insn_vex_m_bits(struct insn *insn)
 {
 	if (insn->vex_prefix.nbytes == 2)	/* 2 bytes VEX */
diff --git a/arch/x86/tools/Makefile b/arch/x86/tools/Makefile
index f820826..3255c3d 100644
--- a/arch/x86/tools/Makefile
+++ b/arch/x86/tools/Makefile
@@ -18,14 +18,22 @@ chkobjdump = $(srctree)/arch/x86/tools/chkobjdump.awk
 quiet_cmd_posttest = TEST    $@
       cmd_posttest = ($(OBJDUMP) -v | $(AWK) -f $(chkobjdump)) || $(OBJDUMP) -d -j .text $(objtree)/vmlinux | $(AWK) -f $(distill_awk) | $(obj)/test_get_len $(posttest_64bit) $(posttest_verbose)
 
-posttest: $(obj)/test_get_len vmlinux
+quiet_cmd_sanitytest = TEST    $@
+      cmd_sanitytest = $(obj)/insn_sanity $(posttest_64bit) -m 1000000
+
+posttest: $(obj)/test_get_len vmlinux $(obj)/insn_sanity
 	$(call cmd,posttest)
+	$(call cmd,sanitytest)
 
 hostprogs-y	:= test_get_len
+hostprogs-y	:= insn_sanity
 
 # -I needed for generated C source and C source which in the kernel tree.
 HOSTCFLAGS_test_get_len.o := -Wall -I$(objtree)/arch/x86/lib/ -I$(srctree)/arch/x86/include/ -I$(srctree)/arch/x86/lib/ -I$(srctree)/include/
 
+HOSTCFLAGS_insn_sanity.o := -Wall -I$(objtree)/arch/x86/lib/ -I$(srctree)/arch/x86/include/ -I$(srctree)/arch/x86/lib/ -I$(srctree)/include/
+
 # Dependencies are also needed.
 $(obj)/test_get_len.o: $(srctree)/arch/x86/lib/insn.c $(srctree)/arch/x86/lib/inat.c $(srctree)/arch/x86/include/asm/inat_types.h $(srctree)/arch/x86/include/asm/inat.h $(srctree)/arch/x86/include/asm/insn.h $(objtree)/arch/x86/lib/inat-tables.c
 
+$(obj)/insn_sanity.o: $(srctree)/arch/x86/lib/insn.c $(srctree)/arch/x86/lib/inat.c $(srctree)/arch/x86/include/asm/inat_types.h $(srctree)/arch/x86/include/asm/inat.h $(srctree)/arch/x86/include/asm/insn.h $(objtree)/arch/x86/lib/inat-tables.c
diff --git a/arch/x86/tools/insn_sanity.c b/arch/x86/tools/insn_sanity.c
new file mode 100644
index 0000000..478b35a
--- /dev/null
+++ b/arch/x86/tools/insn_sanity.c
@@ -0,0 +1,275 @@
+/*
+ * x86 decoder sanity test - based on test_get_insn.c
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2009
+ * Copyright (C) Hitachi, Ltd., 2011
+ */
+
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <assert.h>
+#include <unistd.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+
+#define unlikely(cond) (cond)
+#define ARRAY_SIZE(a)	(sizeof(a)/sizeof(a[0]))
+
+#include <asm/insn.h>
+#include <inat.c>
+#include <insn.c>
+
+/*
+ * Test of instruction analysis against tampering.
+ * Feed random binary to instruction decoder and ensure not to
+ * access out-of-instruction-buffer.
+ */
+
+#define DEFAULT_MAX_ITER	10000
+#define INSN_NOP 0x90
+
+static const char 	*prog;		/* Program name */
+static int		verbose;	/* Verbosity */
+static int		x86_64;		/* x86-64 bit mode flag */
+static unsigned int	seed;		/* Random seed */
+static unsigned long	iter_start;	/* Start of iteration number */
+static unsigned long	iter_end = DEFAULT_MAX_ITER;	/* End of iteration number */
+static FILE		*input_file;	/* Input file name */
+
+static void usage(const char *err)
+{
+	if (err)
+		fprintf(stderr, "Error: %s\n\n", err);
+	fprintf(stderr, "Usage: %s [-y|-n|-v] [-s seed[,no]] [-m max] [-i input]\n", prog);
+	fprintf(stderr, "\t-y	64bit mode\n");
+	fprintf(stderr, "\t-n	32bit mode\n");
+	fprintf(stderr, "\t-v	Verbose mode\n");
+	fprintf(stderr, "\t-s	Give a random seed (and iteration number)\n");
+	fprintf(stderr, "\t-m	Give a maximum iteration number\n");
+	fprintf(stderr, "\t-i	Give an input file with decoded binary\n");
+	exit(1);
+}
+
+static void dump_field(FILE *fp, const char *name, const char *indent,
+		       struct insn_field *field)
+{
+	fprintf(fp, "%s.%s = {\n", indent, name);
+	fprintf(fp, "%s\t.value = %d, bytes[] = {%x, %x, %x, %x},\n",
+		indent, field->value, field->bytes[0], field->bytes[1],
+		field->bytes[2], field->bytes[3]);
+	fprintf(fp, "%s\t.got = %d, .nbytes = %d},\n", indent,
+		field->got, field->nbytes);
+}
+
+static void dump_insn(FILE *fp, struct insn *insn)
+{
+	fprintf(fp, "Instruction = {\n");
+	dump_field(fp, "prefixes", "\t",	&insn->prefixes);
+	dump_field(fp, "rex_prefix", "\t",	&insn->rex_prefix);
+	dump_field(fp, "vex_prefix", "\t",	&insn->vex_prefix);
+	dump_field(fp, "opcode", "\t",		&insn->opcode);
+	dump_field(fp, "modrm", "\t",		&insn->modrm);
+	dump_field(fp, "sib", "\t",		&insn->sib);
+	dump_field(fp, "displacement", "\t",	&insn->displacement);
+	dump_field(fp, "immediate1", "\t",	&insn->immediate1);
+	dump_field(fp, "immediate2", "\t",	&insn->immediate2);
+	fprintf(fp, "\t.attr = %x, .opnd_bytes = %d, .addr_bytes = %d,\n",
+		insn->attr, insn->opnd_bytes, insn->addr_bytes);
+	fprintf(fp, "\t.length = %d, .x86_64 = %d, .kaddr = %p}\n",
+		insn->length, insn->x86_64, insn->kaddr);
+}
+
+static void dump_stream(FILE *fp, const char *msg, unsigned long nr_iter,
+			unsigned char *insn_buf, struct insn *insn)
+{
+	int i;
+
+	fprintf(fp, "%s:\n", msg);
+
+	dump_insn(stderr, insn);
+
+	fprintf(fp, "You can reproduce this with below command(s);\n");
+
+	/* Input a decoded instruction sequence directly */
+	fprintf(fp, " $ echo ");
+	for (i = 0; i < MAX_INSN_SIZE; i++)
+		fprintf(fp, " %02x", insn_buf[i]);
+	fprintf(fp, " | %s -i -\n", prog);
+
+	if (!input_file) {
+		fprintf(fp, "Or \n");
+		/* Give a seed and iteration number */
+		fprintf(fp, " $ %s -s 0x%x,%lu\n", prog, seed, nr_iter);
+	}
+}
+
+static void init_random_seed(void)
+{
+	int fd;
+
+	fd = open("/dev/urandom", O_RDONLY);
+	if (fd < 0)
+		goto fail;
+
+	if (read(fd, &seed, sizeof(seed)) != sizeof(seed))
+		goto fail;
+
+	close(fd);
+	return;
+fail:
+	usage("Failed to open /dev/urandom");
+}
+
+/* Read given instruction sequence from the input file */
+static int read_next_insn(unsigned char *insn_buf)
+{
+	char buf[256]  = "", *tmp;
+	int i;
+
+	tmp = fgets(buf, ARRAY_SIZE(buf), input_file);
+	if (tmp == NULL || feof(input_file))
+		return 0;
+
+	for (i = 0; i < MAX_INSN_SIZE; i++) {
+		insn_buf[i] = (unsigned char)strtoul(tmp, &tmp, 16);
+		if (*tmp != ' ')
+			break;
+	}
+
+	return i;
+}
+
+static int generate_insn(unsigned char *insn_buf)
+{
+	int i;
+
+	if (input_file)
+		return read_next_insn(insn_buf);
+
+	/* Fills buffer with random binary up to MAX_INSN_SIZE */
+	for (i = 0; i < MAX_INSN_SIZE - 1; i += 2)
+		*(unsigned short *)(&insn_buf[i]) = random() & 0xffff;
+
+	while (i < MAX_INSN_SIZE)
+		insn_buf[i++] = random() & 0xff;
+
+	return i;
+}
+
+static void parse_args(int argc, char **argv)
+{
+	int c;
+	char *tmp = NULL;
+	int set_seed = 0;
+
+	prog = argv[0];
+	while ((c = getopt(argc, argv, "ynvs:m:i:")) != -1) {
+		switch (c) {
+		case 'y':
+			x86_64 = 1;
+			break;
+		case 'n':
+			x86_64 = 0;
+			break;
+		case 'v':
+			verbose = 1;
+			break;
+		case 'i':
+			if (strcmp("-", optarg) == 0)
+				input_file = stdin;
+			else
+				input_file = fopen(optarg, "r");
+			if (!input_file)
+				usage("Failed to open input file");
+			break;
+		case 's':
+			seed = (unsigned int)strtoul(optarg, &tmp, 0);
+			if (*tmp == ',') {
+				optarg = tmp + 1;
+				iter_start = strtoul(optarg, &tmp, 0);
+			}
+			if (*tmp != '\0' || tmp == optarg)
+				usage("Failed to parse seed");
+			set_seed = 1;
+			break;
+		case 'm':
+			iter_end = strtoul(optarg, &tmp, 0);
+			if (*tmp != '\0' || tmp == optarg)
+				usage("Failed to parse max_iter");
+			break;
+		default:
+			usage(NULL);
+		}
+	}
+
+	/* Check errors */
+	if (iter_end < iter_start)
+		usage("Max iteration number must be bigger than iter-num");
+
+	if (set_seed && input_file)
+		usage("Don't use input file (-i) with random seed (-s)");
+
+	/* Initialize random seed */
+	if (!input_file) {
+		if (!set_seed)	/* No seed is given */
+			init_random_seed();
+		srand(seed);
+	}
+}
+
+int main(int argc, char **argv)
+{
+	struct insn insn;
+	int insns = 0;
+	int errors = 0;
+	unsigned long i;
+	unsigned char insn_buf[MAX_INSN_SIZE * 2];
+
+	parse_args(argc, argv);
+
+	/* Prepare stop bytes with NOPs */
+	memset(insn_buf + MAX_INSN_SIZE, INSN_NOP, MAX_INSN_SIZE);
+
+	for (i = 0; i < iter_end; i++) {
+		if (generate_insn(insn_buf) <= 0)
+			break;
+
+		if (i < iter_start)	/* Skip to given iteration number */
+			continue;
+
+		/* Decode an instruction */
+		insn_init(&insn, insn_buf, x86_64);
+		insn_get_length(&insn);
+
+		if (verbose && !insn_complete(&insn))
+			dump_stream(stdout, "Info: Found an undecodable input", i, insn_buf, &insn);
+
+		if (insn.next_byte <= insn.kaddr ||
+		    insn.kaddr + MAX_INSN_SIZE < insn.next_byte) {
+			/* Access out-of-range memory */
+			dump_stream(stdout, "Error: Found an access violation", i, insn_buf, &insn);
+			errors++;
+		}
+		insns++;
+	}
+
+	fprintf(stdout, "%s: decoded and checked %d %s instructions with %d errors (seed:0x%x)\n", (errors) ? "Failure" : "Success", insns, (input_file) ? "given" : "random", errors, seed);
+
+	return errors ? 1 : 0;
+}


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [RFC PATCH v2 2/2] [RESEND] x86: Fix insn decoder for longer instruction
  2011-10-19  6:44                                 ` Ingo Molnar
  2011-10-20 14:01                                   ` [RFC PATCH v2 1/2] " Masami Hiramatsu
@ 2011-10-20 14:01                                   ` Masami Hiramatsu
  1 sibling, 0 replies; 65+ messages in thread
From: Masami Hiramatsu @ 2011-10-20 14:01 UTC (permalink / raw)
  To: Ingo Molnar, Andi Kleen, Peter Zijlstra
  Cc: linux-kernel, mingo, acme, ming.m.lin, robert.richter, ravitillo,
	yrl.pp-manager.tt, Masami Hiramatsu, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, Peter Zijlstra, Stephane Eranian,
	Andi Kleen, Srikar Dronamraju

Fix x86 insn decoder for hardening against invalid length
instructions. This adds length checkings for each byte-read
site and if it exceeds MAX_INSN_SIZE, returns immediately.
This can happen when decoding unreliable binary.

 For example, without this patch,

$ echo  66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 |\
 ./insn_sanity -i -
Error: Found an access violation:
Instruction = {
        .prefixes = {
                .value = 1711276134, bytes[] = {66, 0, 0, 66},
                .got = 1, .nbytes = 16},
        .rex_prefix = {
                .value = 0, bytes[] = {0, 0, 0, 0},
                .got = 1, .nbytes = 0},
        .vex_prefix = {
                .value = 0, bytes[] = {0, 0, 0, 0},
                .got = 1, .nbytes = 0},
        .opcode = {
                .value = 144, bytes[] = {90, 0, 0, 0},
                .got = 1, .nbytes = 1},
...
        .attr = 0, .opnd_bytes = 2, .addr_bytes = 4,
        .length = 17, .x86_64 = 0, .kaddr = 0x7fffedc67ae0}
You can reproduce this with below command(s);
 $ echo  66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 | ./insn_sanity -i -
Failure: decoded and checked 1 given instructions with 1 errors (seed:0x0)


 With this patch;


 $ echo  66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 |\
  ./insn_sanity -i -
 Success: decoded and checked 1 given instructions with 0
 errors (seed:0x0)


Caller can check whether it happened by insn_complete().

Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

 arch/x86/lib/insn.c |   48 +++++++++++++++++++++++++++++++++++++++++++-----
 1 files changed, 43 insertions(+), 5 deletions(-)

diff --git a/arch/x86/lib/insn.c b/arch/x86/lib/insn.c
index 9f33b98..374562e 100644
--- a/arch/x86/lib/insn.c
+++ b/arch/x86/lib/insn.c
@@ -22,14 +22,23 @@
 #include <asm/inat.h>
 #include <asm/insn.h>
 
-#define get_next(t, insn)	\
-	({t r; r = *(t*)insn->next_byte; insn->next_byte += sizeof(t); r; })
+/* Verify next sizeof(t) bytes can be on the same instruction */
+#define validate_next(t, insn, n)	\
+	((insn)->next_byte + sizeof(t) + n - (insn)->kaddr <= MAX_INSN_SIZE)
+
+#define __get_next(t, insn)	\
+	({ t r = *(t*)insn->next_byte; insn->next_byte += sizeof(t); r; })
+
+#define __peek_nbyte_next(t, insn, n)	\
+	({ t r = *(t*)((insn)->next_byte + n); r; })
 
-#define peek_next(t, insn)	\
-	({t r; r = *(t*)insn->next_byte; r; })
+#define get_next(t, insn)	\
+	({ if (unlikely(!validate_next(t, insn, 0))) goto err_out; __get_next(t, insn); })
 
 #define peek_nbyte_next(t, insn, n)	\
-	({t r; r = *(t*)((insn)->next_byte + n); r; })
+	({ if (unlikely(!validate_next(t, insn, n))) goto err_out; __peek_nbyte_next(t, insn, n); })
+
+#define peek_next(t, insn)	peek_nbyte_next(t, insn, 0)
 
 /**
  * insn_init() - initialize struct insn
@@ -158,6 +167,8 @@ vex_end:
 	insn->vex_prefix.got = 1;
 
 	prefixes->got = 1;
+
+err_out:
 	return;
 }
 
@@ -208,6 +219,9 @@ void insn_get_opcode(struct insn *insn)
 		insn->attr = 0;	/* This instruction is bad */
 end:
 	opcode->got = 1;
+
+err_out:
+	return;
 }
 
 /**
@@ -241,6 +255,9 @@ void insn_get_modrm(struct insn *insn)
 	if (insn->x86_64 && inat_is_force64(insn->attr))
 		insn->opnd_bytes = 8;
 	modrm->got = 1;
+
+err_out:
+	return;
 }
 
 
@@ -290,6 +307,9 @@ void insn_get_sib(struct insn *insn)
 		}
 	}
 	insn->sib.got = 1;
+
+err_out:
+	return;
 }
 
 
@@ -351,6 +371,9 @@ void insn_get_displacement(struct insn *insn)
 	}
 out:
 	insn->displacement.got = 1;
+
+err_out:
+	return;
 }
 
 /* Decode moffset16/32/64 */
@@ -373,6 +396,9 @@ static void __get_moffset(struct insn *insn)
 		break;
 	}
 	insn->moffset1.got = insn->moffset2.got = 1;
+
+err_out:
+	return;
 }
 
 /* Decode imm v32(Iz) */
@@ -389,6 +415,9 @@ static void __get_immv32(struct insn *insn)
 		insn->immediate.nbytes = 4;
 		break;
 	}
+
+err_out:
+	return;
 }
 
 /* Decode imm v64(Iv/Ov) */
@@ -411,6 +440,9 @@ static void __get_immv(struct insn *insn)
 		break;
 	}
 	insn->immediate1.got = insn->immediate2.got = 1;
+
+err_out:
+	return;
 }
 
 /* Decode ptr16:16/32(Ap) */
@@ -432,6 +464,9 @@ static void __get_immptr(struct insn *insn)
 	insn->immediate2.value = get_next(unsigned short, insn);
 	insn->immediate2.nbytes = 2;
 	insn->immediate1.got = insn->immediate2.got = 1;
+
+err_out:
+	return;
 }
 
 /**
@@ -496,6 +531,9 @@ void insn_get_immediate(struct insn *insn)
 	}
 done:
 	insn->immediate.got = 1;
+
+err_out:
+	return;
 }
 
 /**


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [tip:perf/core] x86, perf: Add a build-time sanity test to the x86 decoder
  2011-10-20 14:01                                   ` [RFC PATCH v2 1/2] " Masami Hiramatsu
@ 2011-11-18 23:16                                     ` tip-bot for Masami Hiramatsu
  0 siblings, 0 replies; 65+ messages in thread
From: tip-bot for Masami Hiramatsu @ 2011-11-18 23:16 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, eranian, hpa, mingo, andi, peterz,
	masami.hiramatsu.pt, tglx, mingo

Commit-ID:  1ec454baf1245df4fdb5dae728da3363630ce6de
Gitweb:     http://git.kernel.org/tip/1ec454baf1245df4fdb5dae728da3363630ce6de
Author:     Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
AuthorDate: Thu, 20 Oct 2011 23:01:09 +0900
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Thu, 10 Nov 2011 12:38:51 +0100

x86, perf: Add a build-time sanity test to the x86 decoder

Add a sanity test of x86 insn decoder against a stream
of randomly generated input, at build time.

This test is also able to reproduce any bug that might
trigger by allowing the passing of random-seed and
iteration-number to the test, or by passing input
which has invalid byte code.

Changes in V2:
 - Code cleanup.
 - Show how to reproduce the error by insn_sanity test.

Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: acme@redhat.com
Cc: ming.m.lin@intel.com
Cc: robert.richter@amd.com
Cc: ravitillo@lbl.gov
Cc: yrl.pp-manager.tt@hitachi.com
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20111020140109.20938.92572.stgit@localhost.localdomain
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/include/asm/insn.h  |    7 +
 arch/x86/tools/Makefile      |   10 ++-
 arch/x86/tools/insn_sanity.c |  275 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 291 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/insn.h b/arch/x86/include/asm/insn.h
index 88c765e..74df3f1 100644
--- a/arch/x86/include/asm/insn.h
+++ b/arch/x86/include/asm/insn.h
@@ -137,6 +137,13 @@ static inline int insn_is_avx(struct insn *insn)
 	return (insn->vex_prefix.value != 0);
 }
 
+/* Ensure this instruction is decoded completely */
+static inline int insn_complete(struct insn *insn)
+{
+	return insn->opcode.got && insn->modrm.got && insn->sib.got &&
+		insn->displacement.got && insn->immediate.got;
+}
+
 static inline insn_byte_t insn_vex_m_bits(struct insn *insn)
 {
 	if (insn->vex_prefix.nbytes == 2)	/* 2 bytes VEX */
diff --git a/arch/x86/tools/Makefile b/arch/x86/tools/Makefile
index f820826..3255c3d 100644
--- a/arch/x86/tools/Makefile
+++ b/arch/x86/tools/Makefile
@@ -18,14 +18,22 @@ chkobjdump = $(srctree)/arch/x86/tools/chkobjdump.awk
 quiet_cmd_posttest = TEST    $@
       cmd_posttest = ($(OBJDUMP) -v | $(AWK) -f $(chkobjdump)) || $(OBJDUMP) -d -j .text $(objtree)/vmlinux | $(AWK) -f $(distill_awk) | $(obj)/test_get_len $(posttest_64bit) $(posttest_verbose)
 
-posttest: $(obj)/test_get_len vmlinux
+quiet_cmd_sanitytest = TEST    $@
+      cmd_sanitytest = $(obj)/insn_sanity $(posttest_64bit) -m 1000000
+
+posttest: $(obj)/test_get_len vmlinux $(obj)/insn_sanity
 	$(call cmd,posttest)
+	$(call cmd,sanitytest)
 
 hostprogs-y	:= test_get_len
+hostprogs-y	:= insn_sanity
 
 # -I needed for generated C source and C source which in the kernel tree.
 HOSTCFLAGS_test_get_len.o := -Wall -I$(objtree)/arch/x86/lib/ -I$(srctree)/arch/x86/include/ -I$(srctree)/arch/x86/lib/ -I$(srctree)/include/
 
+HOSTCFLAGS_insn_sanity.o := -Wall -I$(objtree)/arch/x86/lib/ -I$(srctree)/arch/x86/include/ -I$(srctree)/arch/x86/lib/ -I$(srctree)/include/
+
 # Dependencies are also needed.
 $(obj)/test_get_len.o: $(srctree)/arch/x86/lib/insn.c $(srctree)/arch/x86/lib/inat.c $(srctree)/arch/x86/include/asm/inat_types.h $(srctree)/arch/x86/include/asm/inat.h $(srctree)/arch/x86/include/asm/insn.h $(objtree)/arch/x86/lib/inat-tables.c
 
+$(obj)/insn_sanity.o: $(srctree)/arch/x86/lib/insn.c $(srctree)/arch/x86/lib/inat.c $(srctree)/arch/x86/include/asm/inat_types.h $(srctree)/arch/x86/include/asm/inat.h $(srctree)/arch/x86/include/asm/insn.h $(objtree)/arch/x86/lib/inat-tables.c
diff --git a/arch/x86/tools/insn_sanity.c b/arch/x86/tools/insn_sanity.c
new file mode 100644
index 0000000..334d9de
--- /dev/null
+++ b/arch/x86/tools/insn_sanity.c
@@ -0,0 +1,275 @@
+/*
+ * x86 decoder sanity test - based on test_get_insn.c
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2009
+ * Copyright (C) Hitachi, Ltd., 2011
+ */
+
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <assert.h>
+#include <unistd.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+
+#define unlikely(cond) (cond)
+#define ARRAY_SIZE(a)	(sizeof(a)/sizeof(a[0]))
+
+#include <asm/insn.h>
+#include <inat.c>
+#include <insn.c>
+
+/*
+ * Test of instruction analysis against tampering.
+ * Feed random binary to instruction decoder and ensure not to
+ * access out-of-instruction-buffer.
+ */
+
+#define DEFAULT_MAX_ITER	10000
+#define INSN_NOP 0x90
+
+static const char	*prog;		/* Program name */
+static int		verbose;	/* Verbosity */
+static int		x86_64;		/* x86-64 bit mode flag */
+static unsigned int	seed;		/* Random seed */
+static unsigned long	iter_start;	/* Start of iteration number */
+static unsigned long	iter_end = DEFAULT_MAX_ITER;	/* End of iteration number */
+static FILE		*input_file;	/* Input file name */
+
+static void usage(const char *err)
+{
+	if (err)
+		fprintf(stderr, "Error: %s\n\n", err);
+	fprintf(stderr, "Usage: %s [-y|-n|-v] [-s seed[,no]] [-m max] [-i input]\n", prog);
+	fprintf(stderr, "\t-y	64bit mode\n");
+	fprintf(stderr, "\t-n	32bit mode\n");
+	fprintf(stderr, "\t-v	Verbose mode\n");
+	fprintf(stderr, "\t-s	Give a random seed (and iteration number)\n");
+	fprintf(stderr, "\t-m	Give a maximum iteration number\n");
+	fprintf(stderr, "\t-i	Give an input file with decoded binary\n");
+	exit(1);
+}
+
+static void dump_field(FILE *fp, const char *name, const char *indent,
+		       struct insn_field *field)
+{
+	fprintf(fp, "%s.%s = {\n", indent, name);
+	fprintf(fp, "%s\t.value = %d, bytes[] = {%x, %x, %x, %x},\n",
+		indent, field->value, field->bytes[0], field->bytes[1],
+		field->bytes[2], field->bytes[3]);
+	fprintf(fp, "%s\t.got = %d, .nbytes = %d},\n", indent,
+		field->got, field->nbytes);
+}
+
+static void dump_insn(FILE *fp, struct insn *insn)
+{
+	fprintf(fp, "Instruction = {\n");
+	dump_field(fp, "prefixes", "\t",	&insn->prefixes);
+	dump_field(fp, "rex_prefix", "\t",	&insn->rex_prefix);
+	dump_field(fp, "vex_prefix", "\t",	&insn->vex_prefix);
+	dump_field(fp, "opcode", "\t",		&insn->opcode);
+	dump_field(fp, "modrm", "\t",		&insn->modrm);
+	dump_field(fp, "sib", "\t",		&insn->sib);
+	dump_field(fp, "displacement", "\t",	&insn->displacement);
+	dump_field(fp, "immediate1", "\t",	&insn->immediate1);
+	dump_field(fp, "immediate2", "\t",	&insn->immediate2);
+	fprintf(fp, "\t.attr = %x, .opnd_bytes = %d, .addr_bytes = %d,\n",
+		insn->attr, insn->opnd_bytes, insn->addr_bytes);
+	fprintf(fp, "\t.length = %d, .x86_64 = %d, .kaddr = %p}\n",
+		insn->length, insn->x86_64, insn->kaddr);
+}
+
+static void dump_stream(FILE *fp, const char *msg, unsigned long nr_iter,
+			unsigned char *insn_buf, struct insn *insn)
+{
+	int i;
+
+	fprintf(fp, "%s:\n", msg);
+
+	dump_insn(stderr, insn);
+
+	fprintf(fp, "You can reproduce this with below command(s);\n");
+
+	/* Input a decoded instruction sequence directly */
+	fprintf(fp, " $ echo ");
+	for (i = 0; i < MAX_INSN_SIZE; i++)
+		fprintf(fp, " %02x", insn_buf[i]);
+	fprintf(fp, " | %s -i -\n", prog);
+
+	if (!input_file) {
+		fprintf(fp, "Or \n");
+		/* Give a seed and iteration number */
+		fprintf(fp, " $ %s -s 0x%x,%lu\n", prog, seed, nr_iter);
+	}
+}
+
+static void init_random_seed(void)
+{
+	int fd;
+
+	fd = open("/dev/urandom", O_RDONLY);
+	if (fd < 0)
+		goto fail;
+
+	if (read(fd, &seed, sizeof(seed)) != sizeof(seed))
+		goto fail;
+
+	close(fd);
+	return;
+fail:
+	usage("Failed to open /dev/urandom");
+}
+
+/* Read given instruction sequence from the input file */
+static int read_next_insn(unsigned char *insn_buf)
+{
+	char buf[256]  = "", *tmp;
+	int i;
+
+	tmp = fgets(buf, ARRAY_SIZE(buf), input_file);
+	if (tmp == NULL || feof(input_file))
+		return 0;
+
+	for (i = 0; i < MAX_INSN_SIZE; i++) {
+		insn_buf[i] = (unsigned char)strtoul(tmp, &tmp, 16);
+		if (*tmp != ' ')
+			break;
+	}
+
+	return i;
+}
+
+static int generate_insn(unsigned char *insn_buf)
+{
+	int i;
+
+	if (input_file)
+		return read_next_insn(insn_buf);
+
+	/* Fills buffer with random binary up to MAX_INSN_SIZE */
+	for (i = 0; i < MAX_INSN_SIZE - 1; i += 2)
+		*(unsigned short *)(&insn_buf[i]) = random() & 0xffff;
+
+	while (i < MAX_INSN_SIZE)
+		insn_buf[i++] = random() & 0xff;
+
+	return i;
+}
+
+static void parse_args(int argc, char **argv)
+{
+	int c;
+	char *tmp = NULL;
+	int set_seed = 0;
+
+	prog = argv[0];
+	while ((c = getopt(argc, argv, "ynvs:m:i:")) != -1) {
+		switch (c) {
+		case 'y':
+			x86_64 = 1;
+			break;
+		case 'n':
+			x86_64 = 0;
+			break;
+		case 'v':
+			verbose = 1;
+			break;
+		case 'i':
+			if (strcmp("-", optarg) == 0)
+				input_file = stdin;
+			else
+				input_file = fopen(optarg, "r");
+			if (!input_file)
+				usage("Failed to open input file");
+			break;
+		case 's':
+			seed = (unsigned int)strtoul(optarg, &tmp, 0);
+			if (*tmp == ',') {
+				optarg = tmp + 1;
+				iter_start = strtoul(optarg, &tmp, 0);
+			}
+			if (*tmp != '\0' || tmp == optarg)
+				usage("Failed to parse seed");
+			set_seed = 1;
+			break;
+		case 'm':
+			iter_end = strtoul(optarg, &tmp, 0);
+			if (*tmp != '\0' || tmp == optarg)
+				usage("Failed to parse max_iter");
+			break;
+		default:
+			usage(NULL);
+		}
+	}
+
+	/* Check errors */
+	if (iter_end < iter_start)
+		usage("Max iteration number must be bigger than iter-num");
+
+	if (set_seed && input_file)
+		usage("Don't use input file (-i) with random seed (-s)");
+
+	/* Initialize random seed */
+	if (!input_file) {
+		if (!set_seed)	/* No seed is given */
+			init_random_seed();
+		srand(seed);
+	}
+}
+
+int main(int argc, char **argv)
+{
+	struct insn insn;
+	int insns = 0;
+	int errors = 0;
+	unsigned long i;
+	unsigned char insn_buf[MAX_INSN_SIZE * 2];
+
+	parse_args(argc, argv);
+
+	/* Prepare stop bytes with NOPs */
+	memset(insn_buf + MAX_INSN_SIZE, INSN_NOP, MAX_INSN_SIZE);
+
+	for (i = 0; i < iter_end; i++) {
+		if (generate_insn(insn_buf) <= 0)
+			break;
+
+		if (i < iter_start)	/* Skip to given iteration number */
+			continue;
+
+		/* Decode an instruction */
+		insn_init(&insn, insn_buf, x86_64);
+		insn_get_length(&insn);
+
+		if (verbose && !insn_complete(&insn))
+			dump_stream(stdout, "Info: Found an undecodable input", i, insn_buf, &insn);
+
+		if (insn.next_byte <= insn.kaddr ||
+		    insn.kaddr + MAX_INSN_SIZE < insn.next_byte) {
+			/* Access out-of-range memory */
+			dump_stream(stdout, "Error: Found an access violation", i, insn_buf, &insn);
+			errors++;
+		}
+		insns++;
+	}
+
+	fprintf(stdout, "%s: decoded and checked %d %s instructions with %d errors (seed:0x%x)\n", (errors) ? "Failure" : "Success", insns, (input_file) ? "given" : "random", errors, seed);
+
+	return errors ? 1 : 0;
+}

^ permalink raw reply related	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2011-11-18 23:18 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-10-06 14:49 [PATCH 00/12] perf_events: add support for sampling taken branches Stephane Eranian
2011-10-06 14:49 ` [PATCH 01/12] perf_events: add generic taken branch sampling support Stephane Eranian
2011-10-06 16:57   ` Peter Zijlstra
2011-10-07 10:28     ` Stephane Eranian
2011-10-07 10:32       ` Peter Zijlstra
2011-10-07 10:44         ` Stephane Eranian
2011-10-06 17:01   ` Peter Zijlstra
2011-10-06 14:49 ` [PATCH 02/12] perf_events: add Intel LBR MSR definitions Stephane Eranian
2011-10-06 14:49 ` [PATCH 03/12] perf_events: add Intel X86 LBR sharing logic Stephane Eranian
2011-10-06 14:49 ` [PATCH 04/12] perf_events: sync branch stack sampling with X86 precise_sampling Stephane Eranian
2011-10-06 17:25   ` Peter Zijlstra
2011-10-07 10:34     ` Stephane Eranian
2011-10-07 10:37       ` Peter Zijlstra
2011-10-06 14:49 ` [PATCH 05/12] perf_events: add LBR mappings for PERF_SAMPLE_BRANCH filters Stephane Eranian
2011-10-06 14:49 ` [PATCH 06/12] perf_events: implement PERF_SAMPLE_BRANCH for Intel X86 Stephane Eranian
2011-10-06 17:54   ` Peter Zijlstra
2011-10-06 18:05   ` Peter Zijlstra
2011-10-06 14:49 ` [PATCH 07/12] perf_events: add LBR software filter support " Stephane Eranian
2011-10-06 15:32   ` Andi Kleen
2011-10-06 16:43     ` Peter Zijlstra
2011-10-06 17:14       ` Andi Kleen
2011-10-10  6:08         ` Ingo Molnar
2011-10-10  9:39           ` Peter Zijlstra
2011-10-07  7:06       ` Masami Hiramatsu
2011-10-07 10:38     ` Stephane Eranian
2011-10-07 10:40       ` Stephane Eranian
2011-10-07 10:42         ` Peter Zijlstra
2011-10-07 10:49           ` Stephane Eranian
2011-10-07 11:18             ` Peter Zijlstra
2011-10-07 11:21             ` Peter Zijlstra
2011-10-07 11:54               ` Masami Hiramatsu
2011-10-07 13:31                 ` [PATCH] x86: Fix insn decoder for longer instruction Masami Hiramatsu
2011-10-10  7:04                   ` Ingo Molnar
2011-10-10  6:09                 ` [PATCH 07/12] perf_events: add LBR software filter support for Intel X86 Ingo Molnar
2011-10-10 14:05                   ` Masami Hiramatsu
2011-10-10 14:45                     ` Andi Kleen
2011-10-11 12:59                       ` Masami Hiramatsu
2011-10-12  7:06                         ` Ingo Molnar
2011-10-13 10:54                           ` Masami Hiramatsu
2011-10-13 11:01                           ` [RFC PATCH] x86: Add a sanity test of x86 decoder Masami Hiramatsu
2011-10-18  6:54                             ` Ingo Molnar
2011-10-19  4:29                               ` Masami Hiramatsu
2011-10-19  6:44                                 ` Ingo Molnar
2011-10-20 14:01                                   ` [RFC PATCH v2 1/2] " Masami Hiramatsu
2011-11-18 23:16                                     ` [tip:perf/core] x86, perf: Add a build-time sanity test to the " tip-bot for Masami Hiramatsu
2011-10-20 14:01                                   ` [RFC PATCH v2 2/2] [RESEND] x86: Fix insn decoder for longer instruction Masami Hiramatsu
2011-10-07 15:42             ` [PATCH 07/12] perf_events: add LBR software filter support for Intel X86 Andi Kleen
2011-10-07 11:25       ` Masami Hiramatsu
2011-10-07 11:40         ` Peter Zijlstra
2011-10-07 15:44           ` Andi Kleen
2011-10-07 15:09         ` Andi Kleen
2011-10-07 16:05           ` Peter Zijlstra
2011-10-06 14:49 ` [PATCH 08/12] perf_events: disable PERF_SAMPLE_BRANCH_* when not supported Stephane Eranian
2011-10-06 18:53   ` Peter Zijlstra
2011-10-06 14:49 ` [PATCH 09/12] perf_events: add hook to flush branch_stack on context switch Stephane Eranian
2011-10-06 14:49 ` [PATCH 10/12] perf: add code to support PERF_SAMPLE_BRANCH_STACK Stephane Eranian
2011-10-06 18:50   ` Peter Zijlstra
2011-10-07 10:25     ` Stephane Eranian
2011-10-07 10:27       ` Peter Zijlstra
2011-10-06 14:49 ` [PATCH 11/12] perf: add support for sampling taken branch to perf record Stephane Eranian
2011-10-06 14:49 ` [PATCH 12/12] perf: add support for taken branch sampling to perf report Stephane Eranian
2011-10-06 15:25 ` [PATCH 00/12] perf_events: add support for sampling taken branches Andi Kleen
2011-10-07 10:23   ` Stephane Eranian
2011-10-06 18:32 ` Peter Zijlstra
2011-10-06 21:41   ` Stephane Eranian

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.