linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 00/18] perf: add memory access sampling support
@ 2013-01-15 15:39 Stephane Eranian
  2013-01-15 15:39 ` [PATCH v6 01/18] perf, x86: Support CPU specific sysfs events Stephane Eranian
                   ` (18 more replies)
  0 siblings, 19 replies; 33+ messages in thread
From: Stephane Eranian @ 2013-01-15 15:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mingo, ak, acme, jolsa, namhyung.kim

This patch series had a new feature to the kernel perf_events
interface and corresponding user level tool, perf.

With this patch, it is possible to sample (not trace) memory
accesses (load, store). For loads, the instruction and data
addresses are captured along with the latency and data source.
For stores, the instruction and data addresses are capture
along with limited cache and TLB information.

For load data source, the memory hierarchy level, the tlb, snoop
and lock information is captured.

Although the perf_event interface is extended in a generic manner,
sampling memory accesses requires HW support. The current patches
implement the feature on Intel processors starting with Nehalem.
The patches leverage the PEBS Load Latency and Precise Store
mechanisms. Precise Store is present only on Sandy Bridge and
Ivy Bridge based processors.

The perf tool is extended to make capturing and analyzing the
data easier with a new command: perf mem.

$ perf mem -t load rec triad
$ perf mem -t load rep --stdio
# Samples: 19K of event 'cpu/mem-loads/pp'
# Total cost : 1013994
# Sort order : cost,mem,sym,dso,symbol_daddr,dso_daddr,snoop,tlb,locked
#
# Overhead      Samples     Cost  Memory access Symbol      Shared Obj  Data Symbol             Data Object Snoop  TLB access   Locked
# ........  ...........  .......  ............. ..........  ........... ......................  ........... ...... ............ ......
#
     0.10%            1      986  LFB hit       [.] triad   triad       [.] 0x00007f67dffe8038  [unknown]    None  L1 or L2 hit  No    
     0.09%            1      890  LFB hit       [.] triad   triad       [.] 0x00007f67df91a750  [unknown]    None  L1 or L2 hit  No    
     0.08%            1      826  LFB hit       [.] triad   triad       [.] 0x00007f67e288fba8  [unknown]    None  L1 or L2 hit  No    
     0.08%            1      825  LFB hit       [.] triad   triad       [.] 0x00007f67dea28c80  [unknown]    None  L1 or L2 hit  No    
     0.08%            1      787  LFB hit       [.] triad   triad       [.] 0x00007f67df055a60  [unknown]    None  L1 or L2 hit  No    

The perf mem command is a wrapper around perf record/report. It passes the
right options to the report and record commands. Note that the TUI mode is
supported.

One powerful feature of perf is that users can toy with sort order to display
the information in different format or from a different angle. This is particularly
useful with memory sampling:

$ perf mem -t load rep --sort=mem
# Samples: 19K of event 'cpu/mem-loads/pp'
# Total cost : 1013994
# Sort order : mem
#
# Overhead      Samples             Memory access
# ........  ...........  ........................
#
    85.26%        10633  LFB hit                 
     7.35%         8151  L1 hit                  
     3.13%          383  L3 hit                  
     3.09%          195  Local RAM hit           
     1.16%          259  L2 hit                  
     0.00%            4  Uncached hit            

Or if one is interested in the data view:
$ perf mem -t load rep --sort=symbol_daddr,cost
# Samples: 19K of event 'cpu/mem-loads/pp'
# Total cost : 1013994
# Sort order : symbol_daddr,cost
#
# Overhead      Samples             Data Symbol     Cost
# ........  ...........  ......................  .......
#
     0.10%            1  [.] 0x00007f67dffe8038      986
     0.09%            1  [.] 0x00007f67df91a750      890
     0.08%            1  [.] 0x00007f67e288fba8      826


One note on the cost displayed: On Intel processors with PEBS Load Latency, as described
in the SDM, the cost encompasses the number of cycles from dispatch to Globally Observable
(GO) state. That means, that it includes OOO execution. It is not usual to see L1D Hits
with a cost of > 100 cycles. Always look at the memory level for an approximation of the
access penalty, then interpret the cost value accordingly.

Data symbolization is working for initialized global variables. Dynamically allocated
data and bss symbolization is currently non-functional.

There is no cost associated with stores.

In v2, we leverage some of Andi Kleen's Haswell patches, namely the weighted
samples and perf tool event parser fixes. We  also introduce PERF_RECORD_MISC_DATA_MMAP
to tag mmaps for data vs. code. This helps the perf tool distinguish data. vs. code
mmaps (and therefore symbols). We have also integrated the feedback from v1. Note that in
v2 data symbol resolution is not yet fully operational, but there is a slight improvement.

In v3, we rebased the patch on 3.7.0-rc6 which includes certain of Nahyung's patches.

In v4, we rebase to v3.7 tip and also included the fixes from Namhyung Kim for
symbolization of data addresses. We now have accesses to global variables working.

In v5, we rebase to 3.8.0-rc1. We also updated the WEIGHT patches from Andi to
fix a couple of issues. Integrated the feedback from jolsa@.
Reintegrated the man page.

In v6, we rebased to 3.8.0-rc3 and fixed the issue reported by jolsa@ related
to hist_entry->mem_info maps not being marked as referenced.


Andi Kleen (2):
  perf, x86: Support CPU specific sysfs events
  perf, core: Add a concept of a weightened sample

Namhyung Kim (2):
  perf tools: Ignore ABS symbols when loading data maps
  perf tools: Fix output of symbol_daddr offset

Stephane Eranian (14):
  perf/x86: improve sysfs event mapping with event string
  perf/x86: add flags to event constraints
  perf: add minimal support for PERF_SAMPLE_WEIGHT
  perf: add support for PERF_SAMPLE_ADDR in dump_sampple()
  perf: add generic memory sampling interface
  perf/x86: add memory profiling via PEBS Load Latency
  perf/x86: export PEBS load latency threshold register to sysfs
  perf/x86: add support for PEBS Precise Store
  perf tools: add mem access sampling core support
  perf report: add support for mem access profiling
  perf record: add support for mem access profiling
  perf tools: add new mem command for memory access profiling
  perf: add PERF_RECORD_MISC_MMAP_DATA to RECORD_MMAP
  perf tools: detect data vs. text mappings

 arch/x86/include/uapi/asm/msr-index.h         |    1 +
 arch/x86/kernel/cpu/perf_event.c              |   67 +++--
 arch/x86/kernel/cpu/perf_event.h              |   62 ++++-
 arch/x86/kernel/cpu/perf_event_intel.c        |   34 ++-
 arch/x86/kernel/cpu/perf_event_intel_ds.c     |  182 +++++++++++++-
 arch/x86/kernel/cpu/perf_event_intel_uncore.c |    2 +-
 include/linux/perf_event.h                    |    5 +
 include/uapi/linux/perf_event.h               |   70 +++++-
 kernel/events/core.c                          |   15 ++
 tools/perf/Documentation/perf-mem.txt         |   48 ++++
 tools/perf/Makefile                           |    1 +
 tools/perf/builtin-mem.c                      |  238 ++++++++++++++++++
 tools/perf/builtin-record.c                   |    2 +
 tools/perf/builtin-report.c                   |  131 +++++++++-
 tools/perf/builtin.h                          |    1 +
 tools/perf/command-list.txt                   |    1 +
 tools/perf/perf.c                             |    1 +
 tools/perf/perf.h                             |    1 +
 tools/perf/util/event.h                       |    2 +
 tools/perf/util/evsel.c                       |   16 ++
 tools/perf/util/hist.c                        |   77 +++++-
 tools/perf/util/hist.h                        |   13 +
 tools/perf/util/machine.c                     |   10 +-
 tools/perf/util/session.c                     |   45 ++++
 tools/perf/util/session.h                     |    4 +
 tools/perf/util/sort.c                        |  325 ++++++++++++++++++++++++-
 tools/perf/util/sort.h                        |   10 +-
 tools/perf/util/symbol-elf.c                  |    3 +
 tools/perf/util/symbol.h                      |    7 +
 29 files changed, 1326 insertions(+), 48 deletions(-)
 create mode 100644 tools/perf/Documentation/perf-mem.txt
 create mode 100644 tools/perf/builtin-mem.c

-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v6 01/18] perf, x86: Support CPU specific sysfs events
  2013-01-15 15:39 [PATCH v6 00/18] perf: add memory access sampling support Stephane Eranian
@ 2013-01-15 15:39 ` Stephane Eranian
  2013-01-15 15:39 ` [PATCH v6 02/18] perf/x86: improve sysfs event mapping with event string Stephane Eranian
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Stephane Eranian @ 2013-01-15 15:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mingo, ak, acme, jolsa, namhyung.kim

From: Andi Kleen <ak@linux.intel.com>

Add a way for the CPU initialization code to register additional events,
and merge them into the events attribute directory. Used in the next
patch.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/perf_event.c |   33 +++++++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/perf_event.h |    1 +
 2 files changed, 34 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 4428fd1..26df58b 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1341,6 +1341,30 @@ static void __init filter_events(struct attribute **attrs)
 	}
 }
 
+/* Merge two pointer arrays */
+static __init struct attribute **merge_attr(struct attribute **a,
+					    struct attribute **b)
+{
+	struct attribute **new;
+	int j, i;
+
+	for (j = 0; a[j]; j++)
+		;
+	for (i = 0; b[i]; i++)
+		j++;
+	j++;
+	new = kmalloc(sizeof(struct attribute *) * j, GFP_KERNEL);
+	if (!new)
+		return NULL;
+	j = 0;
+	for (i = 0; a[i]; i++)
+		new[j++] = a[i];
+	for (i = 0; b[i]; i++)
+		new[j++] = b[i];
+	new[j] = NULL;
+	return new;
+}
+
 static ssize_t events_sysfs_show(struct device *dev, struct device_attribute *attr,
 			  char *page)
 {
@@ -1482,6 +1506,15 @@ static int __init init_hw_perf_events(void)
 	else
 		filter_events(x86_pmu_events_group.attrs);
 
+	if (x86_pmu.cpu_events) {
+		struct attribute *tmp;
+
+		tmp = merge_attr(x86_pmu_events_group.attrs,
+				 x86_pmu.cpu_events);
+		if (!WARN_ON(!tmp))
+			x86_pmu_events_group.attrs = tmp;
+	}
+
 	pr_info("... version:                %d\n",     x86_pmu.version);
 	pr_info("... bit width:              %d\n",     x86_pmu.cntval_bits);
 	pr_info("... generic registers:      %d\n",     x86_pmu.num_counters);
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 115c1ea..4170043 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -355,6 +355,7 @@ struct x86_pmu {
 	struct attribute **format_attrs;
 
 	ssize_t		(*events_sysfs_show)(char *page, u64 config);
+	struct attribute **cpu_events;
 
 	/*
 	 * CPU Hotplug hooks
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v6 02/18] perf/x86: improve sysfs event mapping with event string
  2013-01-15 15:39 [PATCH v6 00/18] perf: add memory access sampling support Stephane Eranian
  2013-01-15 15:39 ` [PATCH v6 01/18] perf, x86: Support CPU specific sysfs events Stephane Eranian
@ 2013-01-15 15:39 ` Stephane Eranian
  2013-01-18 22:57   ` Andi Kleen
  2013-01-15 15:39 ` [PATCH v6 03/18] perf/x86: add flags to event constraints Stephane Eranian
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 33+ messages in thread
From: Stephane Eranian @ 2013-01-15 15:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mingo, ak, acme, jolsa, namhyung.kim

This patch extends Jiri's changes to make generic
events mapping visible via sysfs. The patch extends
the mechanism to non-generic events by allowing
the mappings to be hardcoded in strings.

This mechanism will be used by the PEBS-LL patch
later on.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event.c |   27 ++++++++++++---------------
 arch/x86/kernel/cpu/perf_event.h |   23 +++++++++++++++++++++++
 2 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 26df58b..5744dc9 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1316,20 +1316,22 @@ static struct attribute_group x86_pmu_format_group = {
 	.attrs = NULL,
 };
 
-struct perf_pmu_events_attr {
-	struct device_attribute attr;
-	u64 id;
-};
-
 /*
  * Remove all undefined events (x86_pmu.event_map(id) == 0)
  * out of events_attr attributes.
  */
 static void __init filter_events(struct attribute **attrs)
 {
+	struct device_attribute *d;
+	struct perf_pmu_events_attr *pmu_attr;
 	int i, j;
 
 	for (i = 0; attrs[i]; i++) {
+		d = (struct device_attribute *)attrs[i];
+		pmu_attr = container_of(d, struct perf_pmu_events_attr, attr);
+		/* str trumps id */
+		if (pmu_attr->event_str)
+			continue;
 		if (x86_pmu.event_map(i))
 			continue;
 
@@ -1370,19 +1372,14 @@ static ssize_t events_sysfs_show(struct device *dev, struct device_attribute *at
 {
 	struct perf_pmu_events_attr *pmu_attr = \
 		container_of(attr, struct perf_pmu_events_attr, attr);
-
 	u64 config = x86_pmu.event_map(pmu_attr->id);
-	return x86_pmu.events_sysfs_show(page, config);
-}
 
-#define EVENT_VAR(_id)  event_attr_##_id
-#define EVENT_PTR(_id) &event_attr_##_id.attr.attr
+	/* string trumps id */
+	if (pmu_attr->event_str)
+		return sprintf(page, "%s", pmu_attr->event_str);
 
-#define EVENT_ATTR(_name, _id)					\
-static struct perf_pmu_events_attr EVENT_VAR(_id) = {		\
-	.attr = __ATTR(_name, 0444, events_sysfs_show, NULL),	\
-	.id   =  PERF_COUNT_HW_##_id,				\
-};
+	return x86_pmu.events_sysfs_show(page, config);
+}
 
 EVENT_ATTR(cpu-cycles,			CPU_CYCLES		);
 EVENT_ATTR(instructions,		INSTRUCTIONS		);
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 4170043..3f4380c 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -420,6 +420,29 @@ do {									\
 #define ERF_NO_HT_SHARING	1
 #define ERF_HAS_RSP_1		2
 
+#define EVENT_VAR(_id)  event_attr_##_id
+#define EVENT_PTR(_id) &event_attr_##_id.attr.attr
+
+#define EVENT_ATTR(_name, _id)					\
+static struct perf_pmu_events_attr EVENT_VAR(_id) = {		\
+	.attr = __ATTR(_name, 0444, events_sysfs_show, NULL),	\
+	.id   =  PERF_COUNT_HW_##_id,				\
+	.event_str = NULL,					\
+};
+
+#define EVENT_ATTR_STR(_name, v, str)				  \
+static struct perf_pmu_events_attr event_attr_##v = {		  \
+	.attr      = __ATTR(_name, 0444, events_sysfs_show, NULL),\
+	.id        =  0,					  \
+	.event_str =  str,					  \
+};
+
+struct perf_pmu_events_attr {
+	struct device_attribute attr;
+	u64 id;
+	const char *event_str;
+};
+
 extern struct x86_pmu x86_pmu __read_mostly;
 
 DECLARE_PER_CPU(struct cpu_hw_events, cpu_hw_events);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v6 03/18] perf/x86: add flags to event constraints
  2013-01-15 15:39 [PATCH v6 00/18] perf: add memory access sampling support Stephane Eranian
  2013-01-15 15:39 ` [PATCH v6 01/18] perf, x86: Support CPU specific sysfs events Stephane Eranian
  2013-01-15 15:39 ` [PATCH v6 02/18] perf/x86: improve sysfs event mapping with event string Stephane Eranian
@ 2013-01-15 15:39 ` Stephane Eranian
  2013-01-18 22:59   ` Andi Kleen
  2013-01-15 15:39 ` [PATCH v6 04/18] perf, core: Add a concept of a weightened sample Stephane Eranian
                   ` (15 subsequent siblings)
  18 siblings, 1 reply; 33+ messages in thread
From: Stephane Eranian @ 2013-01-15 15:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mingo, ak, acme, jolsa, namhyung.kim

This patch adds a flags field to each event constraint.
It can be used to store event specific features which can
then later be used by scheduling code or low-level x86 code.

The flags are propagated into event->hw.flags during the
get_event_constraint() call. They are cleared during the
put_event_constraint() call.

This mechanism is going to be used by the PEBS-LL patches.
It avoids defining yet another table to hold event specific
information.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event.c              |    2 +-
 arch/x86/kernel/cpu/perf_event.h              |    8 +++++---
 arch/x86/kernel/cpu/perf_event_intel.c        |    5 ++++-
 arch/x86/kernel/cpu/perf_event_intel_ds.c     |    4 +++-
 arch/x86/kernel/cpu/perf_event_intel_uncore.c |    2 +-
 include/linux/perf_event.h                    |    1 +
 6 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 5744dc9..35b516a 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1493,7 +1493,7 @@ static int __init init_hw_perf_events(void)
 
 	unconstrained = (struct event_constraint)
 		__EVENT_CONSTRAINT(0, (1ULL << x86_pmu.num_counters) - 1,
-				   0, x86_pmu.num_counters, 0);
+				   0, x86_pmu.num_counters, 0, 0);
 
 	x86_pmu.attr_rdpmc = 1; /* enable userspace RDPMC usage by default */
 	x86_pmu_format_group.attrs = x86_pmu.format_attrs;
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 3f4380c..3f10cfe 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -59,6 +59,7 @@ struct event_constraint {
 	u64	cmask;
 	int	weight;
 	int	overlap;
+	int	flags;
 };
 
 struct amd_nb {
@@ -170,16 +171,17 @@ struct cpu_hw_events {
 	void				*kfree_on_online;
 };
 
-#define __EVENT_CONSTRAINT(c, n, m, w, o) {\
+#define __EVENT_CONSTRAINT(c, n, m, w, o, f) {\
 	{ .idxmsk64 = (n) },		\
 	.code = (c),			\
 	.cmask = (m),			\
 	.weight = (w),			\
 	.overlap = (o),			\
+	.flags = f,			\
 }
 
 #define EVENT_CONSTRAINT(c, n, m)	\
-	__EVENT_CONSTRAINT(c, n, m, HWEIGHT(n), 0)
+	__EVENT_CONSTRAINT(c, n, m, HWEIGHT(n), 0, 0)
 
 /*
  * The overlap flag marks event constraints with overlapping counter
@@ -203,7 +205,7 @@ struct cpu_hw_events {
  * and its counter masks must be kept at a minimum.
  */
 #define EVENT_CONSTRAINT_OVERLAP(c, n, m)	\
-	__EVENT_CONSTRAINT(c, n, m, HWEIGHT(n), 1)
+	__EVENT_CONSTRAINT(c, n, m, HWEIGHT(n), 1, 0)
 
 /*
  * Constraint on the Event code.
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 93b9e11..57d6527 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1367,8 +1367,10 @@ x86_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event)
 
 	if (x86_pmu.event_constraints) {
 		for_each_event_constraint(c, x86_pmu.event_constraints) {
-			if ((event->hw.config & c->cmask) == c->code)
+			if ((event->hw.config & c->cmask) == c->code) {
+				event->hw.flags |= c->flags;
 				return c;
+			}
 		}
 	}
 
@@ -1413,6 +1415,7 @@ intel_put_shared_regs_event_constraints(struct cpu_hw_events *cpuc,
 static void intel_put_event_constraints(struct cpu_hw_events *cpuc,
 					struct perf_event *event)
 {
+	event->hw.flags = 0;
 	intel_put_shared_regs_event_constraints(cpuc, event);
 }
 
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 826054a..f30d85b 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -430,8 +430,10 @@ struct event_constraint *intel_pebs_constraints(struct perf_event *event)
 
 	if (x86_pmu.pebs_constraints) {
 		for_each_event_constraint(c, x86_pmu.pebs_constraints) {
-			if ((event->hw.config & c->cmask) == c->code)
+			if ((event->hw.config & c->cmask) == c->code) {
+				event->hw.flags |= c->flags;
 				return c;
+			}
 		}
 	}
 
diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.c b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
index b43200d..75da9e1 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
@@ -2438,7 +2438,7 @@ static int __init uncore_type_init(struct intel_uncore_type *type)
 
 	type->unconstrainted = (struct event_constraint)
 		__EVENT_CONSTRAINT(0, (1ULL << type->num_counters) - 1,
-				0, type->num_counters, 0);
+				0, type->num_counters, 0, 0);
 
 	for (i = 0; i < type->num_boxes; i++) {
 		pmus[i].func_id = -1;
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 6bfb2faa..484cfbc 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -128,6 +128,7 @@ struct hw_perf_event {
 			int		event_base_rdpmc;
 			int		idx;
 			int		last_cpu;
+			int		flags;
 
 			struct hw_perf_event_extra extra_reg;
 			struct hw_perf_event_extra branch_reg;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v6 04/18] perf, core: Add a concept of a weightened sample
  2013-01-15 15:39 [PATCH v6 00/18] perf: add memory access sampling support Stephane Eranian
                   ` (2 preceding siblings ...)
  2013-01-15 15:39 ` [PATCH v6 03/18] perf/x86: add flags to event constraints Stephane Eranian
@ 2013-01-15 15:39 ` Stephane Eranian
  2013-01-15 15:39 ` [PATCH v6 05/18] perf: add minimal support for PERF_SAMPLE_WEIGHT Stephane Eranian
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Stephane Eranian @ 2013-01-15 15:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mingo, ak, acme, jolsa, namhyung.kim

From: Andi Kleen <ak@linux.intel.com>

For some events it's useful to weight sample with a hardware
provided number. This expresses how expensive the action the
sample represent was.  This allows the profiler to scale
the samples to be more informative to the programmer.

There is already the period which is used similarly, but it means
something different, so I chose to not overload it. Instead
a new sample type for WEIGHT is added.

Can be used for multiple things. Initially it is used for TSX abort costs
and profiling by memory latencies (so to make expensive load appear higher
up in the histograms)  The concept is quite generic and can be extended
to many other kinds of events or architectures, as long as the hardware
provides suitable auxillary values. In principle it could be also
used for software tracpoints.

This adds the generic glue. A new optional sample format for a 64bit
weight value.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 include/linux/perf_event.h      |    2 ++
 include/uapi/linux/perf_event.h |    5 ++++-
 kernel/events/core.c            |    6 ++++++
 3 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 484cfbc..bb2429d 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -584,6 +584,7 @@ struct perf_sample_data {
 	struct perf_branch_stack	*br_stack;
 	struct perf_regs_user		regs_user;
 	u64				stack_user_size;
+	u64				weight;
 };
 
 static inline void perf_sample_data_init(struct perf_sample_data *data,
@@ -597,6 +598,7 @@ static inline void perf_sample_data_init(struct perf_sample_data *data,
 	data->regs_user.abi = PERF_SAMPLE_REGS_ABI_NONE;
 	data->regs_user.regs = NULL;
 	data->stack_user_size = 0;
+	data->weight = 0;
 }
 
 extern void perf_output_sample(struct perf_output_handle *handle,
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 4f63c05..7e24641 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -132,8 +132,10 @@ enum perf_event_sample_format {
 	PERF_SAMPLE_BRANCH_STACK		= 1U << 11,
 	PERF_SAMPLE_REGS_USER			= 1U << 12,
 	PERF_SAMPLE_STACK_USER			= 1U << 13,
+	PERF_SAMPLE_WEIGHT			= 1U << 14,
+
+	PERF_SAMPLE_MAX = 1U << 15,		/* non-ABI */
 
-	PERF_SAMPLE_MAX = 1U << 14,		/* non-ABI */
 };
 
 /*
@@ -587,6 +589,7 @@ enum perf_event_type {
 	 * 	{ u64			size;
 	 * 	  char			data[size];
 	 * 	  u64			dyn_size; } && PERF_SAMPLE_STACK_USER
+	 *	{ u64			weight;   } && PERF_SAMPLE_WEIGHT
 	 * };
 	 */
 	PERF_RECORD_SAMPLE			= 9,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 301079d..bc2ce07 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -955,6 +955,9 @@ static void perf_event__header_size(struct perf_event *event)
 	if (sample_type & PERF_SAMPLE_READ)
 		size += event->read_size;
 
+	if (sample_type & PERF_SAMPLE_WEIGHT)
+		size += sizeof(data->weight);
+
 	event->header_size = size;
 }
 
@@ -4169,6 +4172,9 @@ void perf_output_sample(struct perf_output_handle *handle,
 		perf_output_sample_ustack(handle,
 					  data->stack_user_size,
 					  data->regs_user.regs);
+
+	if (sample_type & PERF_SAMPLE_WEIGHT)
+		perf_output_put(handle, data->weight);
 }
 
 void perf_prepare_sample(struct perf_event_header *header,
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v6 05/18] perf: add minimal support for PERF_SAMPLE_WEIGHT
  2013-01-15 15:39 [PATCH v6 00/18] perf: add memory access sampling support Stephane Eranian
                   ` (3 preceding siblings ...)
  2013-01-15 15:39 ` [PATCH v6 04/18] perf, core: Add a concept of a weightened sample Stephane Eranian
@ 2013-01-15 15:39 ` Stephane Eranian
  2013-01-18 23:00   ` Andi Kleen
  2013-01-15 15:39 ` [PATCH v6 06/18] perf: add support for PERF_SAMPLE_ADDR in dump_sampple() Stephane Eranian
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 33+ messages in thread
From: Stephane Eranian @ 2013-01-15 15:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mingo, ak, acme, jolsa, namhyung.kim

Ensure we grab the weight from  raw sample struct
and that we can dump it via perf report -D.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 tools/perf/util/event.h   |    1 +
 tools/perf/util/evsel.c   |    5 +++++
 tools/perf/util/session.c |    3 +++
 3 files changed, 9 insertions(+)

diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index 0d573ff..cf52977 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -90,6 +90,7 @@ struct perf_sample {
 	u64 period;
 	u32 cpu;
 	u32 raw_size;
+	u64 weight;
 	void *raw_data;
 	struct ip_callchain *callchain;
 	struct branch_stack *branch_stack;
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 1b16dd1..e08ce12 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1058,6 +1058,11 @@ int perf_evsel__parse_sample(struct perf_evsel *evsel, union perf_event *event,
 		}
 	}
 
+	if (type & PERF_SAMPLE_WEIGHT) {
+		data->weight = *array;
+		array++;
+	}
+
 	return 0;
 }
 
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index ce6f511..117983e 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1006,6 +1006,9 @@ static void dump_sample(struct perf_evsel *evsel, union perf_event *event,
 
 	if (sample_type & PERF_SAMPLE_STACK_USER)
 		stack_user__printf(&sample->user_stack);
+
+	if (sample_type & PERF_SAMPLE_WEIGHT)
+		printf(" ... weight: %"PRIu64"\n", sample->weight);
 }
 
 static struct machine *
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v6 06/18] perf: add support for PERF_SAMPLE_ADDR in dump_sampple()
  2013-01-15 15:39 [PATCH v6 00/18] perf: add memory access sampling support Stephane Eranian
                   ` (4 preceding siblings ...)
  2013-01-15 15:39 ` [PATCH v6 05/18] perf: add minimal support for PERF_SAMPLE_WEIGHT Stephane Eranian
@ 2013-01-15 15:39 ` Stephane Eranian
  2013-01-15 15:39 ` [PATCH v6 07/18] perf: add generic memory sampling interface Stephane Eranian
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Stephane Eranian @ 2013-01-15 15:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mingo, ak, acme, jolsa, namhyung.kim

Was missing from current code yet PERF_SAMPLE_ADDR has
been present for a long time. Needed for PEBS-LL mode.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 tools/perf/util/session.c |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 117983e..9900778 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1009,6 +1009,10 @@ static void dump_sample(struct perf_evsel *evsel, union perf_event *event,
 
 	if (sample_type & PERF_SAMPLE_WEIGHT)
 		printf(" ... weight: %"PRIu64"\n", sample->weight);
+
+	if (sample_type & PERF_SAMPLE_ADDR)
+		printf(" ..... data: 0x%"PRIx64"\n", sample->addr);
+
 }
 
 static struct machine *
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v6 07/18] perf: add generic memory sampling interface
  2013-01-15 15:39 [PATCH v6 00/18] perf: add memory access sampling support Stephane Eranian
                   ` (5 preceding siblings ...)
  2013-01-15 15:39 ` [PATCH v6 06/18] perf: add support for PERF_SAMPLE_ADDR in dump_sampple() Stephane Eranian
@ 2013-01-15 15:39 ` Stephane Eranian
  2013-01-18 23:06   ` Andi Kleen
  2013-01-15 15:39 ` [PATCH v6 08/18] perf/x86: add memory profiling via PEBS Load Latency Stephane Eranian
                   ` (11 subsequent siblings)
  18 siblings, 1 reply; 33+ messages in thread
From: Stephane Eranian @ 2013-01-15 15:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mingo, ak, acme, jolsa, namhyung.kim

This patch adds PERF_SAMPLE_DSRC.

PERF_SAMPLE_DSRC collects the data source, i.e., where
did the data associated with the sampled instruction
come from. Information is stored in a perf_mem_dsrc
structure. It contains opcode, mem level, tlb, snoop,
lock information, subject to availability in hardware.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 include/linux/perf_event.h      |    2 ++
 include/uapi/linux/perf_event.h |   68 +++++++++++++++++++++++++++++++++++++--
 kernel/events/core.c            |    6 ++++
 3 files changed, 74 insertions(+), 2 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index bb2429d..8fe4610 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -579,6 +579,7 @@ struct perf_sample_data {
 		u32	reserved;
 	}				cpu_entry;
 	u64				period;
+	union  perf_mem_dsrc		dsrc;
 	struct perf_callchain_entry	*callchain;
 	struct perf_raw_record		*raw;
 	struct perf_branch_stack	*br_stack;
@@ -599,6 +600,7 @@ static inline void perf_sample_data_init(struct perf_sample_data *data,
 	data->regs_user.regs = NULL;
 	data->stack_user_size = 0;
 	data->weight = 0;
+	data->dsrc.val = 0;
 }
 
 extern void perf_output_sample(struct perf_output_handle *handle,
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 7e24641..8283218 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -133,9 +133,9 @@ enum perf_event_sample_format {
 	PERF_SAMPLE_REGS_USER			= 1U << 12,
 	PERF_SAMPLE_STACK_USER			= 1U << 13,
 	PERF_SAMPLE_WEIGHT			= 1U << 14,
+	PERF_SAMPLE_DSRC			= 1U << 15,
 
-	PERF_SAMPLE_MAX = 1U << 15,		/* non-ABI */
-
+	PERF_SAMPLE_MAX = 1U << 16,		/* non-ABI */
 };
 
 /*
@@ -590,6 +590,7 @@ enum perf_event_type {
 	 * 	  char			data[size];
 	 * 	  u64			dyn_size; } && PERF_SAMPLE_STACK_USER
 	 *	{ u64			weight;   } && PERF_SAMPLE_WEIGHT
+	 *	{ u64			dsrc;     } && PERF_SAMPLE_DSRC
 	 * };
 	 */
 	PERF_RECORD_SAMPLE			= 9,
@@ -615,4 +616,67 @@ enum perf_callchain_context {
 #define PERF_FLAG_FD_OUTPUT		(1U << 1)
 #define PERF_FLAG_PID_CGROUP		(1U << 2) /* pid=cgroup id, per-cpu mode only */
 
+union perf_mem_dsrc {
+	__u64 val;
+	struct {
+		__u64   mem_op:5,	/* type of opcode */
+			mem_lvl:14,	/* memory hierarchy level */
+			mem_snoop:5,	/* snoop mode */
+			mem_lock:2,	/* lock instr */
+			mem_dtlb:7,	/* tlb access */
+			mem_rsvd:31;
+	};
+};
+
+/* type of opcode (load/store/prefetch,code) */
+#define PERF_MEM_OP_NA		0x01 /* not available */
+#define PERF_MEM_OP_LOAD	0x02 /* load instruction */
+#define PERF_MEM_OP_STORE	0x04 /* store instruction */
+#define PERF_MEM_OP_PFETCH	0x08 /* prefetch */
+#define PERF_MEM_OP_EXEC	0x10 /* code (execution) */
+#define PERF_MEM_OP_SHIFT	0
+
+/* memory hierarchy (memory level, hit or miss) */
+#define PERF_MEM_LVL_NA		0x01  /* not available */
+#define PERF_MEM_LVL_HIT	0x02  /* hit level */
+#define PERF_MEM_LVL_MISS	0x04  /* miss level  */
+#define PERF_MEM_LVL_L1		0x08  /* L1 */
+#define PERF_MEM_LVL_LFB	0x10  /* Line Fill Buffer */
+#define PERF_MEM_LVL_L2		0x20  /* L2 hit */
+#define PERF_MEM_LVL_L3		0x40  /* L3 hit */
+#define PERF_MEM_LVL_LOC_RAM	0x80  /* Local DRAM */
+#define PERF_MEM_LVL_REM_RAM1	0x100 /* Remote DRAM (1 hop) */
+#define PERF_MEM_LVL_REM_RAM2	0x200 /* Remote DRAM (2 hops) */
+#define PERF_MEM_LVL_REM_CCE1	0x400 /* Remote Cache (1 hop) */
+#define PERF_MEM_LVL_REM_CCE2	0x800 /* Remote Cache (2 hops) */
+#define PERF_MEM_LVL_IO		0x1000 /* I/O memory */
+#define PERF_MEM_LVL_UNC	0x2000 /* Uncached memory */
+#define PERF_MEM_LVL_SHIFT	5
+
+/* snoop mode */
+#define PERF_MEM_SNOOP_NA	0x01 /* not available */
+#define PERF_MEM_SNOOP_NONE	0x02 /* no snoop */
+#define PERF_MEM_SNOOP_HIT	0x04 /* snoop hit */
+#define PERF_MEM_SNOOP_MISS	0x08 /* snoop miss */
+#define PERF_MEM_SNOOP_HITM	0x10 /* snoop hit modified */
+#define PERF_MEM_SNOOP_SHIFT	19
+
+/* locked instruction */
+#define PERF_MEM_LOCK_NA	0x01 /* not available */
+#define PERF_MEM_LOCK_LOCKED	0x02 /* locked transaction */
+#define PERF_MEM_LOCK_SHIFT	24
+
+/* TLB access */
+#define PERF_MEM_TLB_NA		0x01 /* not available */
+#define PERF_MEM_TLB_HIT	0x02 /* hit level */
+#define PERF_MEM_TLB_MISS	0x04 /* miss level */
+#define PERF_MEM_TLB_L1		0x08 /* L1 */
+#define PERF_MEM_TLB_L2		0x10 /* L2 */
+#define PERF_MEM_TLB_WK		0x20 /* Hardware Walker*/
+#define PERF_MEM_TLB_OS		0x40 /* OS fault handler */
+#define PERF_MEM_TLB_SHIFT	26
+
+#define PERF_MEM_S(a, s) \
+	(((u64)PERF_MEM_##a##_##s) << PERF_MEM_##a##_SHIFT)
+
 #endif /* _UAPI_LINUX_PERF_EVENT_H */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index bc2ce07..fd4ceea 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -958,6 +958,9 @@ static void perf_event__header_size(struct perf_event *event)
 	if (sample_type & PERF_SAMPLE_WEIGHT)
 		size += sizeof(data->weight);
 
+	if (sample_type & PERF_SAMPLE_DSRC)
+		size += sizeof(data->dsrc.val);
+
 	event->header_size = size;
 }
 
@@ -4175,6 +4178,9 @@ void perf_output_sample(struct perf_output_handle *handle,
 
 	if (sample_type & PERF_SAMPLE_WEIGHT)
 		perf_output_put(handle, data->weight);
+
+	if (sample_type & PERF_SAMPLE_DSRC)
+		perf_output_put(handle, data->dsrc.val);
 }
 
 void perf_prepare_sample(struct perf_event_header *header,
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v6 08/18] perf/x86: add memory profiling via PEBS Load Latency
  2013-01-15 15:39 [PATCH v6 00/18] perf: add memory access sampling support Stephane Eranian
                   ` (6 preceding siblings ...)
  2013-01-15 15:39 ` [PATCH v6 07/18] perf: add generic memory sampling interface Stephane Eranian
@ 2013-01-15 15:39 ` Stephane Eranian
  2013-01-18 23:12   ` Andi Kleen
  2013-01-15 15:39 ` [PATCH v6 09/18] perf/x86: export PEBS load latency threshold register to sysfs Stephane Eranian
                   ` (10 subsequent siblings)
  18 siblings, 1 reply; 33+ messages in thread
From: Stephane Eranian @ 2013-01-15 15:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mingo, ak, acme, jolsa, namhyung.kim

This patch adds support for memory profiling using the
PEBS Load Latency facility.

Load accesses are sampled by HW and the instruction
address, data address, load latency, data source, tlb,
locked information can be saved in the sampling buffer
if using the PERF_SAMPLE_COST (for latency),
PERF_SAMPLE_ADDR, PERF_SAMPLE_DSRC types.

To enable PEBS Load Latency, users have to use the
model specific event:
- on NHM/WSM: MEM_INST_RETIRED:LATENCY_ABOVE_THRESHOLD
- on SNB/IVB: MEM_TRANS_RETIRED:LATENCY_ABOVE_THRESHOLD

To make things easier, this patch also exports a generic
alias via sysfs: mem-loads. It export the right event
encoding based on the host CPU and can be used directly
by the perf tool.

Loosely based on Intel's Lin Ming patch posted on LKML
in July 2011.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/include/uapi/asm/msr-index.h     |    1 +
 arch/x86/kernel/cpu/perf_event.c          |    5 +-
 arch/x86/kernel/cpu/perf_event.h          |   25 +++++-
 arch/x86/kernel/cpu/perf_event_intel.c    |   24 ++++++
 arch/x86/kernel/cpu/perf_event_intel_ds.c |  133 +++++++++++++++++++++++++++--
 5 files changed, 178 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/uapi/asm/msr-index.h b/arch/x86/include/uapi/asm/msr-index.h
index 433a59f..1031604 100644
--- a/arch/x86/include/uapi/asm/msr-index.h
+++ b/arch/x86/include/uapi/asm/msr-index.h
@@ -71,6 +71,7 @@
 #define MSR_IA32_PEBS_ENABLE		0x000003f1
 #define MSR_IA32_DS_AREA		0x00000600
 #define MSR_IA32_PERF_CAPABILITIES	0x00000345
+#define MSR_PEBS_LD_LAT_THRESHOLD	0x000003f6
 
 #define MSR_MTRRfix64K_00000		0x00000250
 #define MSR_MTRRfix16K_80000		0x00000258
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 35b516a..1a2b337 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1367,7 +1367,7 @@ static __init struct attribute **merge_attr(struct attribute **a,
 	return new;
 }
 
-static ssize_t events_sysfs_show(struct device *dev, struct device_attribute *attr,
+ssize_t events_sysfs_show(struct device *dev, struct device_attribute *attr,
 			  char *page)
 {
 	struct perf_pmu_events_attr *pmu_attr = \
@@ -1498,6 +1498,9 @@ static int __init init_hw_perf_events(void)
 	x86_pmu.attr_rdpmc = 1; /* enable userspace RDPMC usage by default */
 	x86_pmu_format_group.attrs = x86_pmu.format_attrs;
 
+	if (x86_pmu.event_attrs)
+		x86_pmu_events_group.attrs = x86_pmu.event_attrs;
+
 	if (!x86_pmu.events_sysfs_show)
 		x86_pmu_events_group.attrs = &empty_attrs;
 	else
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 3f10cfe..3f91411 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -46,6 +46,7 @@ enum extra_reg_type {
 	EXTRA_REG_RSP_0 = 0,	/* offcore_response_0 */
 	EXTRA_REG_RSP_1 = 1,	/* offcore_response_1 */
 	EXTRA_REG_LBR   = 2,	/* lbr_select */
+	EXTRA_REG_LDLAT = 3,	/* ld_lat_threshold */
 
 	EXTRA_REG_MAX		/* number of entries needed */
 };
@@ -61,6 +62,10 @@ struct event_constraint {
 	int	overlap;
 	int	flags;
 };
+/*
+ * struct event_constraint flags
+ */
+#define PERF_X86_EVENT_PEBS_LDLAT	0x1 /* ld+ldlat data address sampling */
 
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
@@ -233,6 +238,10 @@ struct cpu_hw_events {
 #define INTEL_UEVENT_CONSTRAINT(c, n)	\
 	EVENT_CONSTRAINT(c, n, INTEL_ARCH_EVENT_MASK)
 
+#define INTEL_PLD_CONSTRAINT(c, n)	\
+	__EVENT_CONSTRAINT(c, n, INTEL_ARCH_EVENT_MASK, \
+			   HWEIGHT(n), 0, PERF_X86_EVENT_PEBS_LDLAT)
+
 #define EVENT_CONSTRAINT_END		\
 	EVENT_CONSTRAINT(0, 0, 0)
 
@@ -262,12 +271,22 @@ struct extra_reg {
 	.msr = (ms),		\
 	.config_mask = (m),	\
 	.valid_mask = (vm),	\
-	.idx = EXTRA_REG_##i	\
+	.idx = EXTRA_REG_##i,	\
 	}
 
 #define INTEL_EVENT_EXTRA_REG(event, msr, vm, idx)	\
 	EVENT_EXTRA_REG(event, msr, ARCH_PERFMON_EVENTSEL_EVENT, vm, idx)
 
+#define INTEL_UEVENT_EXTRA_REG(event, msr, vm, idx) \
+	EVENT_EXTRA_REG(event, msr, ARCH_PERFMON_EVENTSEL_EVENT | \
+			ARCH_PERFMON_EVENTSEL_UMASK, vm, idx)
+
+#define INTEL_UEVENT_PEBS_LDLAT_EXTRA_REG(c) \
+	INTEL_UEVENT_EXTRA_REG(c, \
+			       MSR_PEBS_LD_LAT_THRESHOLD, \
+			       0xffff, \
+			       LDLAT)
+
 #define EVENT_EXTRA_END EVENT_EXTRA_REG(0, 0, 0, 0, RSP_0)
 
 union perf_capabilities {
@@ -355,6 +374,7 @@ struct x86_pmu {
 	 */
 	int		attr_rdpmc;
 	struct attribute **format_attrs;
+	struct attribute **event_attrs;
 
 	ssize_t		(*events_sysfs_show)(char *page, u64 config);
 	struct attribute **cpu_events;
@@ -659,6 +679,9 @@ int p6_pmu_init(void);
 
 int knc_pmu_init(void);
 
+ssize_t events_sysfs_show(struct device *dev, struct device_attribute *attr,
+			  char *page);
+
 #else /* CONFIG_CPU_SUP_INTEL */
 
 static inline void reserve_ds_buffers(void)
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 57d6527..1d2a7ca 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -81,6 +81,7 @@ static struct event_constraint intel_nehalem_event_constraints[] __read_mostly =
 static struct extra_reg intel_nehalem_extra_regs[] __read_mostly =
 {
 	INTEL_EVENT_EXTRA_REG(0xb7, MSR_OFFCORE_RSP_0, 0xffff, RSP_0),
+	INTEL_UEVENT_PEBS_LDLAT_EXTRA_REG(0x100b),
 	EVENT_EXTRA_END
 };
 
@@ -111,6 +112,7 @@ static struct extra_reg intel_westmere_extra_regs[] __read_mostly =
 {
 	INTEL_EVENT_EXTRA_REG(0xb7, MSR_OFFCORE_RSP_0, 0xffff, RSP_0),
 	INTEL_EVENT_EXTRA_REG(0xbb, MSR_OFFCORE_RSP_1, 0xffff, RSP_1),
+	INTEL_UEVENT_PEBS_LDLAT_EXTRA_REG(0x100b),
 	EVENT_EXTRA_END
 };
 
@@ -130,9 +132,23 @@ static struct event_constraint intel_gen_event_constraints[] __read_mostly =
 static struct extra_reg intel_snb_extra_regs[] __read_mostly = {
 	INTEL_EVENT_EXTRA_REG(0xb7, MSR_OFFCORE_RSP_0, 0x3fffffffffull, RSP_0),
 	INTEL_EVENT_EXTRA_REG(0xbb, MSR_OFFCORE_RSP_1, 0x3fffffffffull, RSP_1),
+	INTEL_UEVENT_PEBS_LDLAT_EXTRA_REG(0x01cd),
 	EVENT_EXTRA_END
 };
 
+EVENT_ATTR_STR(mem-loads, mem_ld_nhm, "event=0x0b,umask=0x10,ldlat=3");
+EVENT_ATTR_STR(mem-loads, mem_ld_snb, "event=0xcd,umask=0x1,ldlat=3");
+
+struct attribute *nhm_events_attrs[] = {
+	EVENT_PTR(mem_ld_nhm),
+	NULL,
+};
+
+struct attribute *snb_events_attrs[] = {
+	EVENT_PTR(mem_ld_snb),
+	NULL,
+};
+
 static u64 intel_pmu_event_map(int hw_event)
 {
 	return intel_perfmon_event_map[hw_event];
@@ -2009,6 +2025,8 @@ __init int intel_pmu_init(void)
 		x86_pmu.enable_all = intel_pmu_nhm_enable_all;
 		x86_pmu.extra_regs = intel_nehalem_extra_regs;
 
+		x86_pmu.cpu_events = nhm_events_attrs;
+
 		/* UOPS_ISSUED.STALLED_CYCLES */
 		intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] =
 			X86_CONFIG(.event=0x0e, .umask=0x01, .inv=1, .cmask=1);
@@ -2049,6 +2067,8 @@ __init int intel_pmu_init(void)
 		x86_pmu.extra_regs = intel_westmere_extra_regs;
 		x86_pmu.er_flags |= ERF_HAS_RSP_1;
 
+		x86_pmu.cpu_events = nhm_events_attrs;
+
 		/* UOPS_ISSUED.STALLED_CYCLES */
 		intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] =
 			X86_CONFIG(.event=0x0e, .umask=0x01, .inv=1, .cmask=1);
@@ -2077,6 +2097,8 @@ __init int intel_pmu_init(void)
 		x86_pmu.er_flags |= ERF_HAS_RSP_1;
 		x86_pmu.er_flags |= ERF_NO_HT_SHARING;
 
+		x86_pmu.cpu_events = snb_events_attrs;
+
 		/* UOPS_ISSUED.ANY,c=1,i=1 to count stall cycles */
 		intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] =
 			X86_CONFIG(.event=0x0e, .umask=0x01, .inv=1, .cmask=1);
@@ -2102,6 +2124,8 @@ __init int intel_pmu_init(void)
 		x86_pmu.er_flags |= ERF_HAS_RSP_1;
 		x86_pmu.er_flags |= ERF_NO_HT_SHARING;
 
+		x86_pmu.cpu_events = snb_events_attrs;
+
 		/* UOPS_ISSUED.ANY,c=1,i=1 to count stall cycles */
 		intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] =
 			X86_CONFIG(.event=0x0e, .umask=0x01, .inv=1, .cmask=1);
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index f30d85b..bdd73bc 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -24,6 +24,92 @@ struct pebs_record_32 {
 
  */
 
+union intel_x86_pebs_dse {
+	u64 val;
+	struct {
+		unsigned int ld_dse:4;
+		unsigned int ld_stlb_miss:1;
+		unsigned int ld_locked:1;
+		unsigned int ld_reserved:26;
+	};
+	struct {
+		unsigned int st_l1d_hit:1;
+		unsigned int st_reserved1:3;
+		unsigned int st_stlb_miss:1;
+		unsigned int st_locked:1;
+		unsigned int st_reserved2:26;
+	};
+};
+
+
+/*
+ * Map PEBS Load Latency Data Source encodings to generic
+ * memory data source information
+ */
+#define P(a, b) PERF_MEM_S(a, b)
+#define OP_LH (P(OP, LOAD) | P(LVL, HIT))
+#define SNOOP_NONE_MISS (P(SNOOP, NONE) | P(SNOOP, MISS))
+
+static const u64 pebs_data_source[] = {
+	P(OP, LOAD) | P(LVL, MISS) | P(LVL, L3) | P(SNOOP, NA),/* 0x00:ukn L3 */
+	OP_LH | P(LVL, L1) | P(SNOOP, NONE),	/* 0x01: L1 local */
+	OP_LH | P(LVL, LFB)| P(SNOOP, NONE),	/* 0x02: LFB hit */
+	OP_LH | P(LVL, L2) | P(SNOOP, NONE),	/* 0x03: L2 hit */
+	OP_LH | P(LVL, L3) | P(SNOOP, NONE),	/* 0x04: L3 hit */
+	OP_LH | P(LVL, L3) | P(SNOOP, MISS),	/* 0x05: L3 hit, snoop miss */
+	OP_LH | P(LVL, L3) | P(SNOOP, HIT),	/* 0x06: L3 hit, snoop hit */
+	OP_LH | P(LVL, L3) | P(SNOOP, HITM),	/* 0x07: L3 hit, snoop hitm */
+	OP_LH | P(LVL, REM_CCE1) | P(SNOOP, HIT),  /* 0x08: L3 miss snoop hit */
+	OP_LH | P(LVL, REM_CCE1) | P(SNOOP, HITM), /* 0x09: L3 miss snoop hitm*/
+	OP_LH | P(LVL, LOC_RAM)  | P(SNOOP, HIT),  /* 0x0a: L3 miss, shared */
+	OP_LH | P(LVL, REM_RAM1) | P(SNOOP, HIT),  /* 0x0b: L3 miss, shared */
+	OP_LH | P(LVL, LOC_RAM)  | SNOOP_NONE_MISS,/* 0x0c: L3 miss, excl */
+	OP_LH | P(LVL, REM_RAM1) | SNOOP_NONE_MISS,/* 0x0d: L3 miss, excl */
+	OP_LH | P(LVL, IO) | P(SNOOP, NONE), /* 0x0e: I/O */
+	OP_LH | P(LVL, UNC) | P(SNOOP, NONE), /* 0x0f: uncached */
+};
+
+static u64 load_latency_data(u64 status)
+{
+	union intel_x86_pebs_dse dse;
+	u64 val;
+	int model = boot_cpu_data.x86_model;
+	int fam = boot_cpu_data.x86;
+
+	dse.val = status;
+
+	/*
+	 * use the mapping table for bit 0-3
+	 */
+	val = pebs_data_source[dse.ld_dse];
+
+	/*
+	 * Nehalem models do not support TLB, Lock infos
+	 */
+	if (fam == 0x6 && (model == 26 || model == 30
+	    || model == 31 || model == 46)) {
+		val |= P(TLB, NA) | P(LOCK, NA);
+		return val;
+	}
+	/*
+	 * bit 4: TLB access
+	 * 0 = did not miss 2nd level TLB
+	 * 1 = missed 2nd level TLB
+	 */
+	if (dse.ld_stlb_miss)
+		val |= P(TLB, MISS) | P(TLB, L2);
+	else
+		val |= P(TLB, HIT) | P(TLB, L1) | P(TLB, L2);
+
+	/*
+	 * bit 5: locked prefix
+	 */
+	if (dse.ld_locked)
+		val |= P(LOCK, LOCKED);
+
+	return val;
+}
+
 struct pebs_record_core {
 	u64 flags, ip;
 	u64 ax, bx, cx, dx;
@@ -364,7 +450,7 @@ struct event_constraint intel_atom_pebs_event_constraints[] = {
 };
 
 struct event_constraint intel_nehalem_pebs_event_constraints[] = {
-	INTEL_EVENT_CONSTRAINT(0x0b, 0xf),    /* MEM_INST_RETIRED.* */
+	INTEL_PLD_CONSTRAINT(0x100b, 0xf),      /* MEM_INST_RETIRED.* */
 	INTEL_EVENT_CONSTRAINT(0x0f, 0xf),    /* MEM_UNCORE_RETIRED.* */
 	INTEL_UEVENT_CONSTRAINT(0x010c, 0xf), /* MEM_STORE_RETIRED.DTLB_MISS */
 	INTEL_EVENT_CONSTRAINT(0xc0, 0xf),    /* INST_RETIRED.ANY */
@@ -379,7 +465,7 @@ struct event_constraint intel_nehalem_pebs_event_constraints[] = {
 };
 
 struct event_constraint intel_westmere_pebs_event_constraints[] = {
-	INTEL_EVENT_CONSTRAINT(0x0b, 0xf),    /* MEM_INST_RETIRED.* */
+	INTEL_PLD_CONSTRAINT(0x100b, 0xf),      /* MEM_INST_RETIRED.* */
 	INTEL_EVENT_CONSTRAINT(0x0f, 0xf),    /* MEM_UNCORE_RETIRED.* */
 	INTEL_UEVENT_CONSTRAINT(0x010c, 0xf), /* MEM_STORE_RETIRED.DTLB_MISS */
 	INTEL_EVENT_CONSTRAINT(0xc0, 0xf),    /* INSTR_RETIRED.* */
@@ -399,7 +485,7 @@ struct event_constraint intel_snb_pebs_event_constraints[] = {
 	INTEL_UEVENT_CONSTRAINT(0x02c2, 0xf), /* UOPS_RETIRED.RETIRE_SLOTS */
 	INTEL_EVENT_CONSTRAINT(0xc4, 0xf),    /* BR_INST_RETIRED.* */
 	INTEL_EVENT_CONSTRAINT(0xc5, 0xf),    /* BR_MISP_RETIRED.* */
-	INTEL_EVENT_CONSTRAINT(0xcd, 0x8),    /* MEM_TRANS_RETIRED.* */
+	INTEL_PLD_CONSTRAINT(0x01cd, 0x8),    /* MEM_TRANS_RETIRED.LAT_ABOVE_THR */
 	INTEL_EVENT_CONSTRAINT(0xd0, 0xf),    /* MEM_UOP_RETIRED.* */
 	INTEL_EVENT_CONSTRAINT(0xd1, 0xf),    /* MEM_LOAD_UOPS_RETIRED.* */
 	INTEL_EVENT_CONSTRAINT(0xd2, 0xf),    /* MEM_LOAD_UOPS_LLC_HIT_RETIRED.* */
@@ -413,7 +499,7 @@ struct event_constraint intel_ivb_pebs_event_constraints[] = {
         INTEL_UEVENT_CONSTRAINT(0x02c2, 0xf), /* UOPS_RETIRED.RETIRE_SLOTS */
         INTEL_EVENT_CONSTRAINT(0xc4, 0xf),    /* BR_INST_RETIRED.* */
         INTEL_EVENT_CONSTRAINT(0xc5, 0xf),    /* BR_MISP_RETIRED.* */
-        INTEL_EVENT_CONSTRAINT(0xcd, 0x8),    /* MEM_TRANS_RETIRED.* */
+        INTEL_PLD_CONSTRAINT(0x01cd, 0x8),    /* MEM_TRANS_RETIRED.LAT_ABOVE_THR */
         INTEL_EVENT_CONSTRAINT(0xd0, 0xf),    /* MEM_UOP_RETIRED.* */
         INTEL_EVENT_CONSTRAINT(0xd1, 0xf),    /* MEM_LOAD_UOPS_RETIRED.* */
         INTEL_EVENT_CONSTRAINT(0xd2, 0xf),    /* MEM_LOAD_UOPS_LLC_HIT_RETIRED.* */
@@ -448,6 +534,9 @@ void intel_pmu_pebs_enable(struct perf_event *event)
 	hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;
 
 	cpuc->pebs_enabled |= 1ULL << hwc->idx;
+
+	if (event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT)
+		cpuc->pebs_enabled |= 1ULL << (hwc->idx + 32);
 }
 
 void intel_pmu_pebs_disable(struct perf_event *event)
@@ -560,20 +649,48 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
 				   struct pt_regs *iregs, void *__pebs)
 {
 	/*
-	 * We cast to pebs_record_core since that is a subset of
-	 * both formats and we don't use the other fields in this
-	 * routine.
+	 * We cast to pebs_record_nhm to get the load latency data
+	 * if extra_reg MSR_PEBS_LD_LAT_THRESHOLD used
 	 */
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
-	struct pebs_record_core *pebs = __pebs;
+	struct pebs_record_nhm *pebs = __pebs;
 	struct perf_sample_data data;
 	struct pt_regs regs;
+	u64 sample_type;
+	int fll;
 
 	if (!intel_pmu_save_and_restart(event))
 		return;
 
+	fll = event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT;
+
 	perf_sample_data_init(&data, 0, event->hw.last_period);
 
+	data.period = event->hw.last_period;
+	sample_type = event->attr.sample_type;
+
+	/*
+	 * if PEBS-LL or PreciseStore
+	 */
+	if (fll) {
+		if (sample_type & PERF_SAMPLE_ADDR)
+			data.addr = pebs->dla;
+
+		/*
+		 * Use latency for weight (only avail with PEBS-LL)
+		 */
+		if (fll && (sample_type & PERF_SAMPLE_WEIGHT))
+			data.weight = pebs->lat;
+
+		/*
+		 * data.dsrc encodes the data source
+		 */
+		if (sample_type & PERF_SAMPLE_DSRC) {
+			if (fll)
+				data.dsrc.val = load_latency_data(pebs->dse);
+		}
+	}
+
 	/*
 	 * We use the interrupt regs as a base because the PEBS record
 	 * does not contain a full regs set, specifically it seems to
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v6 09/18] perf/x86: export PEBS load latency threshold register to sysfs
  2013-01-15 15:39 [PATCH v6 00/18] perf: add memory access sampling support Stephane Eranian
                   ` (7 preceding siblings ...)
  2013-01-15 15:39 ` [PATCH v6 08/18] perf/x86: add memory profiling via PEBS Load Latency Stephane Eranian
@ 2013-01-15 15:39 ` Stephane Eranian
  2013-01-15 15:39 ` [PATCH v6 10/18] perf/x86: add support for PEBS Precise Store Stephane Eranian
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Stephane Eranian @ 2013-01-15 15:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mingo, ak, acme, jolsa, namhyung.kim

Make the PEBS Load Latency threshold register layout
and encoding visible to user level tools.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event_intel.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 1d2a7ca..8eced4c 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1755,6 +1755,8 @@ static void intel_pmu_flush_branch_stack(void)
 
 PMU_FORMAT_ATTR(offcore_rsp, "config1:0-63");
 
+PMU_FORMAT_ATTR(ldlat, "config1:0-15");
+
 static struct attribute *intel_arch3_formats_attr[] = {
 	&format_attr_event.attr,
 	&format_attr_umask.attr,
@@ -1765,6 +1767,7 @@ static struct attribute *intel_arch3_formats_attr[] = {
 	&format_attr_cmask.attr,
 
 	&format_attr_offcore_rsp.attr, /* XXX do NHM/WSM + SNB breakout */
+	&format_attr_ldlat.attr, /* PEBS load latency */
 	NULL,
 };
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v6 10/18] perf/x86: add support for PEBS Precise Store
  2013-01-15 15:39 [PATCH v6 00/18] perf: add memory access sampling support Stephane Eranian
                   ` (8 preceding siblings ...)
  2013-01-15 15:39 ` [PATCH v6 09/18] perf/x86: export PEBS load latency threshold register to sysfs Stephane Eranian
@ 2013-01-15 15:39 ` Stephane Eranian
  2013-01-18 23:21   ` Andi Kleen
  2013-01-15 15:39 ` [PATCH v6 11/18] perf tools: add mem access sampling core support Stephane Eranian
                   ` (8 subsequent siblings)
  18 siblings, 1 reply; 33+ messages in thread
From: Stephane Eranian @ 2013-01-15 15:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mingo, ak, acme, jolsa, namhyung.kim

This patch adds support for PEBS Precise Store
which is available on Intel Sandy Bridge and
Ivy Bridge processors.

To use Precise store, the proper PEBS event
must be used: mem_trans_retired:precise_stores.
For the perf tool, the generic mem-stores event
exported via sysfs can be used directly.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event.h          |    5 +++
 arch/x86/kernel/cpu/perf_event_intel.c    |    2 ++
 arch/x86/kernel/cpu/perf_event_intel_ds.c |   49 +++++++++++++++++++++++++++--
 3 files changed, 54 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 3f91411..645b864 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -66,6 +66,7 @@ struct event_constraint {
  * struct event_constraint flags
  */
 #define PERF_X86_EVENT_PEBS_LDLAT	0x1 /* ld+ldlat data address sampling */
+#define PERF_X86_EVENT_PEBS_ST		0x2 /* st data address sampling */
 
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
@@ -242,6 +243,10 @@ struct cpu_hw_events {
 	__EVENT_CONSTRAINT(c, n, INTEL_ARCH_EVENT_MASK, \
 			   HWEIGHT(n), 0, PERF_X86_EVENT_PEBS_LDLAT)
 
+#define INTEL_PST_CONSTRAINT(c, n)	\
+	__EVENT_CONSTRAINT(c, n, INTEL_ARCH_EVENT_MASK, \
+			  HWEIGHT(n), 0, PERF_X86_EVENT_PEBS_ST)
+
 #define EVENT_CONSTRAINT_END		\
 	EVENT_CONSTRAINT(0, 0, 0)
 
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 8eced4c..665e26c 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -138,6 +138,7 @@ static struct extra_reg intel_snb_extra_regs[] __read_mostly = {
 
 EVENT_ATTR_STR(mem-loads, mem_ld_nhm, "event=0x0b,umask=0x10,ldlat=3");
 EVENT_ATTR_STR(mem-loads, mem_ld_snb, "event=0xcd,umask=0x1,ldlat=3");
+EVENT_ATTR_STR(mem-stores, mem_st_snb, "event=0xcd,umask=0x2");
 
 struct attribute *nhm_events_attrs[] = {
 	EVENT_PTR(mem_ld_nhm),
@@ -146,6 +147,7 @@ struct attribute *nhm_events_attrs[] = {
 
 struct attribute *snb_events_attrs[] = {
 	EVENT_PTR(mem_ld_snb),
+	EVENT_PTR(mem_st_snb),
 	NULL,
 };
 
diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index bdd73bc..cfbd469 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -69,6 +69,44 @@ static const u64 pebs_data_source[] = {
 	OP_LH | P(LVL, UNC) | P(SNOOP, NONE), /* 0x0f: uncached */
 };
 
+static u64 precise_store_data(u64 status)
+{
+	union intel_x86_pebs_dse dse;
+	u64 val = P(OP, STORE) | P(SNOOP, NA) | P(LVL, L1) | P(TLB, L2);
+
+	dse.val = status;
+
+	/*
+	 * bit 4: TLB access
+	 * 1 = stored missed 2nd level TLB
+	 *
+	 * so it either hit the walker or the OS
+	 * otherwise hit 2nd level TLB
+	 */
+	if (dse.st_stlb_miss)
+		val |= P(TLB, MISS);
+	else
+		val |= P(TLB, HIT);
+
+	/*
+	 * bit 0: hit L1 data cache
+	 * if not set, then all we know is that
+	 * it missed L1D
+	 */
+	if (dse.st_l1d_hit)
+		val |= P(LVL, HIT);
+	else
+		val |= P(LVL, MISS);
+
+	/*
+	 * bit 5: Locked prefix
+	 */
+	if (dse.st_locked)
+		val |= P(LOCK, LOCKED);
+
+	return val;
+}
+
 static u64 load_latency_data(u64 status)
 {
 	union intel_x86_pebs_dse dse;
@@ -486,6 +524,7 @@ struct event_constraint intel_snb_pebs_event_constraints[] = {
 	INTEL_EVENT_CONSTRAINT(0xc4, 0xf),    /* BR_INST_RETIRED.* */
 	INTEL_EVENT_CONSTRAINT(0xc5, 0xf),    /* BR_MISP_RETIRED.* */
 	INTEL_PLD_CONSTRAINT(0x01cd, 0x8),    /* MEM_TRANS_RETIRED.LAT_ABOVE_THR */
+	INTEL_PST_CONSTRAINT(0x02cd, 0x8),    /* MEM_TRANS_RETIRED.PRECISE_STORES */
 	INTEL_EVENT_CONSTRAINT(0xd0, 0xf),    /* MEM_UOP_RETIRED.* */
 	INTEL_EVENT_CONSTRAINT(0xd1, 0xf),    /* MEM_LOAD_UOPS_RETIRED.* */
 	INTEL_EVENT_CONSTRAINT(0xd2, 0xf),    /* MEM_LOAD_UOPS_LLC_HIT_RETIRED.* */
@@ -500,6 +539,7 @@ struct event_constraint intel_ivb_pebs_event_constraints[] = {
         INTEL_EVENT_CONSTRAINT(0xc4, 0xf),    /* BR_INST_RETIRED.* */
         INTEL_EVENT_CONSTRAINT(0xc5, 0xf),    /* BR_MISP_RETIRED.* */
         INTEL_PLD_CONSTRAINT(0x01cd, 0x8),    /* MEM_TRANS_RETIRED.LAT_ABOVE_THR */
+	INTEL_PST_CONSTRAINT(0x02cd, 0x8),    /* MEM_TRANS_RETIRED.PRECISE_STORES */
         INTEL_EVENT_CONSTRAINT(0xd0, 0xf),    /* MEM_UOP_RETIRED.* */
         INTEL_EVENT_CONSTRAINT(0xd1, 0xf),    /* MEM_LOAD_UOPS_RETIRED.* */
         INTEL_EVENT_CONSTRAINT(0xd2, 0xf),    /* MEM_LOAD_UOPS_LLC_HIT_RETIRED.* */
@@ -537,6 +577,8 @@ void intel_pmu_pebs_enable(struct perf_event *event)
 
 	if (event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT)
 		cpuc->pebs_enabled |= 1ULL << (hwc->idx + 32);
+	else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
+		cpuc->pebs_enabled |= 1ULL << 63;
 }
 
 void intel_pmu_pebs_disable(struct perf_event *event)
@@ -657,12 +699,13 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
 	struct perf_sample_data data;
 	struct pt_regs regs;
 	u64 sample_type;
-	int fll;
+	int fll, fst;
 
 	if (!intel_pmu_save_and_restart(event))
 		return;
 
 	fll = event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT;
+	fst = event->hw.flags & PERF_X86_EVENT_PEBS_ST;
 
 	perf_sample_data_init(&data, 0, event->hw.last_period);
 
@@ -672,7 +715,7 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
 	/*
 	 * if PEBS-LL or PreciseStore
 	 */
-	if (fll) {
+	if (fll || fst) {
 		if (sample_type & PERF_SAMPLE_ADDR)
 			data.addr = pebs->dla;
 
@@ -688,6 +731,8 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
 		if (sample_type & PERF_SAMPLE_DSRC) {
 			if (fll)
 				data.dsrc.val = load_latency_data(pebs->dse);
+			else
+				data.dsrc.val = precise_store_data(pebs->dse);
 		}
 	}
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v6 11/18] perf tools: add mem access sampling core support
  2013-01-15 15:39 [PATCH v6 00/18] perf: add memory access sampling support Stephane Eranian
                   ` (9 preceding siblings ...)
  2013-01-15 15:39 ` [PATCH v6 10/18] perf/x86: add support for PEBS Precise Store Stephane Eranian
@ 2013-01-15 15:39 ` Stephane Eranian
  2013-01-18 23:25   ` Andi Kleen
  2013-01-15 15:39 ` [PATCH v6 12/18] perf report: add support for mem access profiling Stephane Eranian
                   ` (7 subsequent siblings)
  18 siblings, 1 reply; 33+ messages in thread
From: Stephane Eranian @ 2013-01-15 15:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mingo, ak, acme, jolsa, namhyung.kim

This patch adds the sorting and histogram support
functions to enable profiling of memory accesses.

The following sorting orders are added:
 - symbol_daddr: data address symbol (or raw address)
 - dso_daddr: data address shared object
 - weight: access cost
 - locked: access uses locked transaction
 - tlb : TLB access
 - mem : memory level of the access (L1, L2, L3, RAM, ...)
 - snoop: access snoop mode

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 tools/perf/util/event.h   |    1 +
 tools/perf/util/evsel.c   |    5 +
 tools/perf/util/hist.c    |   76 ++++++++++-
 tools/perf/util/hist.h    |   13 ++
 tools/perf/util/session.c |   38 ++++++
 tools/perf/util/session.h |    4 +
 tools/perf/util/sort.c    |  325 ++++++++++++++++++++++++++++++++++++++++++++-
 tools/perf/util/sort.h    |   10 +-
 tools/perf/util/symbol.h  |    7 +
 9 files changed, 469 insertions(+), 10 deletions(-)

diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index cf52977..ad66b44 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -91,6 +91,7 @@ struct perf_sample {
 	u32 cpu;
 	u32 raw_size;
 	u64 weight;
+	u64 dsrc;
 	void *raw_data;
 	struct ip_callchain *callchain;
 	struct branch_stack *branch_stack;
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index e08ce12..9f334dd 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1063,6 +1063,11 @@ int perf_evsel__parse_sample(struct perf_evsel *evsel, union perf_event *event,
 		array++;
 	}
 
+	if (type & PERF_SAMPLE_DSRC) {
+		data->dsrc = *array;
+		array++;
+	}
+
 	return 0;
 }
 
diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index cb17e2a..b5259c9 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -66,12 +66,16 @@ static void hists__set_unres_dso_col_len(struct hists *hists, int dso)
 void hists__calc_col_len(struct hists *hists, struct hist_entry *h)
 {
 	const unsigned int unresolved_col_width = BITS_PER_LONG / 4;
+	int symlen;
 	u16 len;
 
 	if (h->ms.sym)
 		hists__new_col_len(hists, HISTC_SYMBOL, h->ms.sym->namelen + 4);
-	else
+	else {
+		symlen = unresolved_col_width + 4 + 2;
+		hists__new_col_len(hists, HISTC_SYMBOL, symlen);
 		hists__set_unres_dso_col_len(hists, HISTC_DSO);
+	}
 
 	len = thread__comm_len(h->thread);
 	if (hists__new_col_len(hists, HISTC_COMM, len))
@@ -83,7 +87,6 @@ void hists__calc_col_len(struct hists *hists, struct hist_entry *h)
 	}
 
 	if (h->branch_info) {
-		int symlen;
 		/*
 		 * +4 accounts for '[x] ' priv level info
 		 * +2 account of 0x prefix on raw addresses
@@ -111,7 +114,36 @@ void hists__calc_col_len(struct hists *hists, struct hist_entry *h)
 			hists__new_col_len(hists, HISTC_SYMBOL_TO, symlen);
 			hists__set_unres_dso_col_len(hists, HISTC_DSO_TO);
 		}
+	} else if (h->mem_info) {
+		/*
+		 * +4 accounts for '[x] ' priv level info
+		 * +2 account of 0x prefix on raw addresses
+		 */
+		if (h->mem_info->daddr.sym) {
+			symlen = (int)h->mem_info->daddr.sym->namelen + 4
+			       + unresolved_col_width + 2;
+			hists__new_col_len(hists, HISTC_MEM_DADDR_SYMBOL,
+					   symlen);
+		} else {
+			symlen = unresolved_col_width + 4 + 2;
+			hists__new_col_len(hists, HISTC_MEM_DADDR_SYMBOL,
+					   symlen);
+		}
+		if (h->mem_info->daddr.map) {
+			symlen = dso__name_len(h->mem_info->daddr.map->dso);
+			hists__new_col_len(hists, HISTC_MEM_DADDR_DSO,
+					   symlen);
+		} else {
+			symlen = unresolved_col_width + 4 + 2;
+			hists__set_unres_dso_col_len(hists, HISTC_MEM_DADDR_DSO);
+		}
+		hists__new_col_len(hists, HISTC_MEM_COST, 7);
+		hists__new_col_len(hists, HISTC_MEM_LOCKED, 6);
+		hists__new_col_len(hists, HISTC_MEM_TLB, 22);
+		hists__new_col_len(hists, HISTC_MEM_SNOOP, 12);
+		hists__new_col_len(hists, HISTC_MEM_LVL, 21+3);
 	}
+
 }
 
 void hists__output_recalc_col_len(struct hists *hists, int max_rows)
@@ -235,13 +267,19 @@ void hists__decay_entries_threaded(struct hists *hists,
 static struct hist_entry *hist_entry__new(struct hist_entry *template)
 {
 	size_t callchain_size = symbol_conf.use_callchain ? sizeof(struct callchain_root) : 0;
-	struct hist_entry *he = malloc(sizeof(*he) + callchain_size);
+	struct hist_entry *he = calloc(1, sizeof(*he) + callchain_size);
 
 	if (he != NULL) {
 		*he = *template;
 
 		if (he->ms.map)
 			he->ms.map->referenced = true;
+		if (he->mem_info) {
+			if (he->mem_info->iaddr.map)
+				he->mem_info->iaddr.map->referenced = true;
+			if (he->mem_info->daddr.map)
+				he->mem_info->daddr.map->referenced = true;
+		}
 		if (symbol_conf.use_callchain)
 			callchain_init(he->callchain);
 
@@ -323,6 +361,35 @@ static struct hist_entry *add_hist_entry(struct hists *hists,
 	return he;
 }
 
+struct hist_entry *__hists__add_mem_entry(struct hists *self,
+					  struct addr_location *al,
+					  struct symbol *sym_parent,
+					  struct mem_info *mi,
+					  u64 weight)
+{
+	struct hist_entry entry = {
+		.thread	= al->thread,
+		.ms = {
+			.map	= al->map,
+			.sym	= al->sym,
+		},
+		.cpu	= al->cpu,
+		.ip	= al->addr,
+		.level	= al->level,
+		.stat = {
+			.period = weight,
+			.nr_events = 1,
+		},
+		.parent = sym_parent,
+		.filtered = symbol__parent_filter(sym_parent),
+		.hists = self,
+		.mem_info = mi,
+		.branch_info = NULL,
+	};
+
+	return add_hist_entry(self, &entry, al, weight);
+}
+
 struct hist_entry *__hists__add_branch_entry(struct hists *self,
 					     struct addr_location *al,
 					     struct symbol *sym_parent,
@@ -346,6 +413,7 @@ struct hist_entry *__hists__add_branch_entry(struct hists *self,
 		.filtered = symbol__parent_filter(sym_parent),
 		.branch_info = bi,
 		.hists	= self,
+		.mem_info = NULL,
 	};
 
 	return add_hist_entry(self, &entry, al, period);
@@ -371,6 +439,8 @@ struct hist_entry *__hists__add_entry(struct hists *self,
 		.parent = sym_parent,
 		.filtered = symbol__parent_filter(sym_parent),
 		.hists	= self,
+		.branch_info = NULL,
+		.mem_info = NULL,
 	};
 
 	return add_hist_entry(self, &entry, al, period);
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 8b091a5..464d9d2 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -49,6 +49,13 @@ enum hist_column {
 	HISTC_DSO_FROM,
 	HISTC_DSO_TO,
 	HISTC_SRCLINE,
+	HISTC_MEM_DADDR_SYMBOL,
+	HISTC_MEM_DADDR_DSO,
+	HISTC_MEM_COST,
+	HISTC_MEM_LOCKED,
+	HISTC_MEM_TLB,
+	HISTC_MEM_LVL,
+	HISTC_MEM_SNOOP,
 	HISTC_NR_COLS, /* Last entry */
 };
 
@@ -86,6 +93,12 @@ struct hist_entry *__hists__add_branch_entry(struct hists *self,
 					     struct branch_info *bi,
 					     u64 period);
 
+struct hist_entry *__hists__add_mem_entry(struct hists *self,
+					  struct addr_location *al,
+					  struct symbol *sym_parent,
+					  struct mem_info *mi,
+					  u64 period);
+
 void hists__output_resort(struct hists *self);
 void hists__output_resort_threaded(struct hists *hists);
 void hists__collapse_resort(struct hists *self);
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 9900778..49ca66b 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -273,6 +273,42 @@ static void ip__resolve_ams(struct machine *self, struct thread *thread,
 	ams->map = al.map;
 }
 
+static void ip__resolve_data(struct machine *self, struct thread *thread,
+			     u8 m,
+			    struct addr_map_symbol *ams,
+			    u64 addr)
+{
+	struct addr_location al;
+
+	memset(&al, 0, sizeof(al));
+
+	thread__find_addr_location(thread, self, m, MAP__VARIABLE, addr, &al,
+				   NULL);
+	ams->addr = addr;
+	ams->al_addr = al.addr;
+	ams->sym = al.sym;
+	ams->map = al.map;
+}
+
+struct mem_info *machine__resolve_mem(struct machine *self,
+				      struct thread *thr,
+				      struct perf_sample *sample,
+				      u8 cpumode)
+{
+	struct mem_info *mi;
+
+	mi = calloc(1, sizeof(struct mem_info));
+	if (!mi)
+		return NULL;
+
+	ip__resolve_ams(self, thr, &mi->iaddr, sample->ip);
+	ip__resolve_data(self, thr, cpumode, &mi->daddr, sample->addr);
+	mi->cost = sample->weight;
+	mi->dsrc.val = sample->dsrc;
+
+	return mi;
+}
+
 struct branch_info *machine__resolve_bstack(struct machine *self,
 					    struct thread *thr,
 					    struct branch_stack *bs)
@@ -1013,6 +1049,8 @@ static void dump_sample(struct perf_evsel *evsel, union perf_event *event,
 	if (sample_type & PERF_SAMPLE_ADDR)
 		printf(" ..... data: 0x%"PRIx64"\n", sample->addr);
 
+	if (sample_type & PERF_SAMPLE_DSRC)
+		printf(" . data_src: 0x%"PRIx64"\n", sample->dsrc);
 }
 
 static struct machine *
diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index cea133a..f3ea026 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -69,6 +69,10 @@ int perf_session__resolve_callchain(struct perf_session *self, struct perf_evsel
 				    struct ip_callchain *chain,
 				    struct symbol **parent);
 
+struct mem_info *machine__resolve_mem(struct machine *self,
+				      struct thread *thread,
+				      struct perf_sample *sample, u8 cpumode);
+
 bool perf_session__has_traces(struct perf_session *self, const char *msg);
 
 void mem_bswap_64(void *src, int byte_size);
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index cfd1c0f..8164544 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -182,11 +182,19 @@ static int _hist_entry__sym_snprintf(struct map *map, struct symbol *sym,
 	}
 
 	ret += repsep_snprintf(bf + ret, size - ret, "[%c] ", level);
-	if (sym)
-		ret += repsep_snprintf(bf + ret, size - ret, "%-*s",
-				       width - ret,
-				       sym->name);
-	else {
+	if (sym) {
+		if (map->type == MAP__VARIABLE) {
+			ret += repsep_snprintf(bf + ret, size - ret, "%s", sym->name);
+			ret += repsep_snprintf(bf + ret, size - ret, "+0x%llx",
+					ip - sym->start);
+			ret += repsep_snprintf(bf + ret, size - ret, "%-*s",
+				       width - ret, "");
+		} else {
+			ret += repsep_snprintf(bf + ret, size - ret, "%-*s",
+					       width - ret,
+					       sym->name);
+		}
+	} else {
 		size_t len = BITS_PER_LONG / 4;
 		ret += repsep_snprintf(bf + ret, size - ret, "%-#.*llx",
 				       len, ip);
@@ -469,6 +477,239 @@ static int hist_entry__mispredict_snprintf(struct hist_entry *self, char *bf,
 	return repsep_snprintf(bf, size, "%-*s", width, out);
 }
 
+/* --sort daddr_sym */
+static int64_t
+sort__daddr_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+	struct addr_map_symbol *l = &left->mem_info->daddr;
+	struct addr_map_symbol *r = &right->mem_info->daddr;
+
+	return (int64_t)(r->addr - l->addr);
+}
+
+static int hist_entry__daddr_snprintf(struct hist_entry *self, char *bf,
+				    size_t size, unsigned int width)
+{
+	return _hist_entry__sym_snprintf(self->mem_info->daddr.map,
+					 self->mem_info->daddr.sym,
+					 self->mem_info->daddr.addr,
+					 self->level, bf, size, width);
+}
+
+static int64_t
+sort__dso_daddr_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+	return _sort__dso_cmp(left->mem_info->daddr.map, right->mem_info->daddr.map);
+}
+
+static int hist_entry__dso_daddr_snprintf(struct hist_entry *self, char *bf,
+				    size_t size, unsigned int width)
+{
+	return _hist_entry__dso_snprintf(self->mem_info->daddr.map, bf, size,
+					 width);
+}
+
+static int64_t
+sort__cost_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+	u64 cost_l = left->mem_info->cost;
+	u64 cost_r = right->mem_info->cost;
+
+	return (int64_t)(cost_r - cost_l);
+}
+
+static int hist_entry__cost_snprintf(struct hist_entry *self, char *bf,
+				    size_t size, unsigned int width)
+{
+	if (self->mem_info->cost == 0)
+		return repsep_snprintf(bf, size, "%*s", width, "N/A");
+	return repsep_snprintf(bf, size, "%*"PRIu64, width, self->mem_info->cost);
+}
+
+static int64_t
+sort__locked_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+	union perf_mem_dsrc dsrc_l = left->mem_info->dsrc;
+	union perf_mem_dsrc dsrc_r = right->mem_info->dsrc;
+
+	return (int64_t)(dsrc_r.mem_lock - dsrc_l.mem_lock);
+}
+
+static int hist_entry__locked_snprintf(struct hist_entry *self, char *bf,
+				    size_t size, unsigned int width)
+{
+	const char *out = "??";
+	u64 mask = self->mem_info->dsrc.mem_lock;
+
+	if (mask & PERF_MEM_LOCK_NA)
+		out = "N/A";
+	else if (mask & PERF_MEM_LOCK_LOCKED)
+		out = "Yes";
+	else
+		out = "No";
+
+	return repsep_snprintf(bf, size, "%-*s", width, out);
+}
+
+static int64_t
+sort__tlb_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+	union perf_mem_dsrc dsrc_l = left->mem_info->dsrc;
+	union perf_mem_dsrc dsrc_r = right->mem_info->dsrc;
+
+	return (int64_t)(dsrc_r.mem_dtlb - dsrc_l.mem_dtlb);
+}
+
+static const char * const tlb_access[] = {
+	"N/A",
+	"HIT",
+	"MISS",
+	"L1",
+	"L2",
+	"Walker",
+	"Fault",
+};
+#define NUM_TLB_ACCESS (sizeof(tlb_access)/sizeof(const char *))
+
+static int hist_entry__tlb_snprintf(struct hist_entry *self, char *bf,
+				    size_t size, unsigned int width)
+{
+	char out[64];
+	size_t sz = sizeof(out) - 1; /* -1 for null termination */
+	size_t l = 0, i;
+	u64 m = self->mem_info->dsrc.mem_dtlb;
+	u64 hit, miss;
+
+	out[0] = '\0';
+
+	hit = m & PERF_MEM_TLB_HIT;
+	miss = m & PERF_MEM_TLB_MISS;
+
+	/* already taken care of */
+	m &= ~(PERF_MEM_TLB_HIT|PERF_MEM_TLB_MISS);
+
+	for (i = 0; m && i < NUM_TLB_ACCESS; i++, m >>= 1) {
+		if (!(m & 0x1))
+			continue;
+		if (l) {
+			strcat(out, " or ");
+			l += 4;
+		}
+		strncat(out, tlb_access[i], sz - l);
+		l += strlen(tlb_access[i]);
+	}
+	if (hit)
+		strncat(out, " hit", sz - l);
+	if (miss)
+		strncat(out, " miss", sz - l);
+
+	return repsep_snprintf(bf, size, "%-*s", width, out);
+}
+
+static int64_t
+sort__lvl_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+	union perf_mem_dsrc dsrc_l = left->mem_info->dsrc;
+	union perf_mem_dsrc dsrc_r = right->mem_info->dsrc;
+
+	return (int64_t)(dsrc_r.mem_lvl - dsrc_l.mem_lvl);
+}
+
+static const char * const mem_lvl[] = {
+	"N/A",
+	"HIT",
+	"MISS",
+	"L1",
+	"LFB",
+	"L2",
+	"L3",
+	"Local RAM",
+	"Remote RAM (1 hop)",
+	"Remote RAM (2 hops)",
+	"Remote Cache (1 hop)",
+	"Remote Cache (2 hops)",
+	"I/O",
+	"Uncached",
+};
+#define NUM_MEM_LVL (sizeof(mem_lvl)/sizeof(const char *))
+
+static int hist_entry__lvl_snprintf(struct hist_entry *self, char *bf,
+				    size_t size, unsigned int width)
+{
+	char out[64];
+	size_t sz = sizeof(out) - 1; /* -1 for null termination */
+	size_t i, l = 0;
+	u64 m = self->mem_info->dsrc.mem_lvl;
+	u64 hit, miss;
+
+	out[0] = '\0';
+
+	hit = m & PERF_MEM_LVL_HIT;
+	miss = m & PERF_MEM_LVL_MISS;
+
+	/* already taken care of */
+	m &= ~(PERF_MEM_LVL_HIT|PERF_MEM_LVL_MISS);
+
+	for (i = 0; m && i < NUM_MEM_LVL; i++, m >>= 1) {
+		if (!(m & 0x1))
+			continue;
+		if (l) {
+			strcat(out, " or ");
+			l += 4;
+		}
+		strncat(out, mem_lvl[i], sz - l);
+		l += strlen(mem_lvl[i]);
+	}
+	if (hit)
+		strncat(out, " hit", sz - l);
+	if (miss)
+		strncat(out, " miss", sz - l);
+
+	return repsep_snprintf(bf, size, "%-*s", width, out);
+}
+
+static int64_t
+sort__snoop_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+	union perf_mem_dsrc dsrc_l = left->mem_info->dsrc;
+	union perf_mem_dsrc dsrc_r = right->mem_info->dsrc;
+
+	return (int64_t)(dsrc_r.mem_snoop - dsrc_l.mem_snoop);
+}
+
+static const char * const snoop_access[] = {
+	"N/A",
+	"None",
+	"Miss",
+	"Hit",
+	"HitM",
+};
+#define NUM_SNOOP_ACCESS (sizeof(snoop_access)/sizeof(const char *))
+
+static int hist_entry__snoop_snprintf(struct hist_entry *self, char *bf,
+				    size_t size, unsigned int width)
+{
+	char out[64];
+	size_t sz = sizeof(out) - 1; /* -1 for null termination */
+	size_t i, l = 0;
+	u64 m = self->mem_info->dsrc.mem_snoop;
+
+	out[0] = '\0';
+
+	for (i = 0; m && i < NUM_SNOOP_ACCESS; i++, m >>= 1) {
+		if (!(m & 0x1))
+			continue;
+		if (l) {
+			strcat(out, " or ");
+			l += 4;
+		}
+		strncat(out, snoop_access[i], sz - l);
+		l += strlen(snoop_access[i]);
+	}
+
+	return repsep_snprintf(bf, size, "%-*s", width, out);
+}
+
 struct sort_entry sort_mispredict = {
 	.se_header	= "Branch Mispredicted",
 	.se_cmp		= sort__mispredict_cmp,
@@ -476,6 +717,56 @@ struct sort_entry sort_mispredict = {
 	.se_width_idx	= HISTC_MISPREDICT,
 };
 
+struct sort_entry sort_mem_daddr_sym = {
+	.se_header	= "Data Symbol",
+	.se_cmp		= sort__daddr_cmp,
+	.se_snprintf	= hist_entry__daddr_snprintf,
+	.se_width_idx	= HISTC_MEM_DADDR_SYMBOL,
+};
+
+struct sort_entry sort_mem_daddr_dso = {
+	.se_header	= "Data Object",
+	.se_cmp		= sort__dso_daddr_cmp,
+	.se_snprintf	= hist_entry__dso_daddr_snprintf,
+	.se_width_idx	= HISTC_MEM_DADDR_SYMBOL,
+};
+
+struct sort_entry sort_mem_cost = {
+	.se_header	= "Cost",
+	.se_cmp		= sort__cost_cmp,
+	.se_snprintf	= hist_entry__cost_snprintf,
+	.se_width_idx	= HISTC_MEM_COST,
+};
+
+struct sort_entry sort_mem_locked = {
+	.se_header	= "Locked",
+	.se_cmp		= sort__locked_cmp,
+	.se_snprintf	= hist_entry__locked_snprintf,
+	.se_width_idx	= HISTC_MEM_LOCKED,
+};
+
+struct sort_entry sort_mem_tlb = {
+	.se_header	= "TLB access",
+	.se_cmp		= sort__tlb_cmp,
+	.se_snprintf	= hist_entry__tlb_snprintf,
+	.se_width_idx	= HISTC_MEM_TLB,
+};
+
+struct sort_entry sort_mem_lvl = {
+	.se_header	= "Memory access",
+	.se_cmp		= sort__lvl_cmp,
+	.se_snprintf	= hist_entry__lvl_snprintf,
+	.se_width_idx	= HISTC_MEM_LVL,
+};
+
+struct sort_entry sort_mem_snoop = {
+	.se_header	= "Snoop",
+	.se_cmp		= sort__snoop_cmp,
+	.se_snprintf	= hist_entry__snoop_snprintf,
+	.se_width_idx	= HISTC_MEM_SNOOP,
+};
+
+
 struct sort_dimension {
 	const char		*name;
 	struct sort_entry	*entry;
@@ -497,6 +788,13 @@ static struct sort_dimension sort_dimensions[] = {
 	DIM(SORT_CPU, "cpu", sort_cpu),
 	DIM(SORT_MISPREDICT, "mispredict", sort_mispredict),
 	DIM(SORT_SRCLINE, "srcline", sort_srcline),
+	DIM(SORT_MEM_DADDR_SYMBOL, "symbol_daddr", sort_mem_daddr_sym),
+	DIM(SORT_MEM_DADDR_DSO, "dso_daddr", sort_mem_daddr_dso),
+	DIM(SORT_MEM_COST, "cost", sort_mem_cost),
+	DIM(SORT_MEM_LOCKED, "locked", sort_mem_locked),
+	DIM(SORT_MEM_TLB, "tlb", sort_mem_tlb),
+	DIM(SORT_MEM_LVL, "mem", sort_mem_lvl),
+	DIM(SORT_MEM_SNOOP, "snoop", sort_mem_snoop),
 };
 
 int sort_dimension__add(const char *tok)
@@ -520,7 +818,8 @@ int sort_dimension__add(const char *tok)
 			sort__has_parent = 1;
 		} else if (sd->entry == &sort_sym ||
 			   sd->entry == &sort_sym_from ||
-			   sd->entry == &sort_sym_to) {
+			   sd->entry == &sort_sym_to ||
+			   sd->entry == &sort_mem_daddr_sym) {
 			sort__has_sym = 1;
 		}
 
@@ -553,6 +852,20 @@ int sort_dimension__add(const char *tok)
 				sort__first_dimension = SORT_DSO_TO;
 			else if (!strcmp(sd->name, "mispredict"))
 				sort__first_dimension = SORT_MISPREDICT;
+			else if (!strcmp(sd->name, "symbol_daddr"))
+				sort__first_dimension = SORT_MEM_DADDR_SYMBOL;
+			else if (!strcmp(sd->name, "dso_daddr"))
+				sort__first_dimension = SORT_MEM_DADDR_DSO;
+			else if (!strcmp(sd->name, "cost"))
+				sort__first_dimension = SORT_MEM_COST;
+			else if (!strcmp(sd->name, "locked"))
+				sort__first_dimension = SORT_MEM_LOCKED;
+			else if (!strcmp(sd->name, "tlb"))
+				sort__first_dimension = SORT_MEM_TLB;
+			else if (!strcmp(sd->name, "mem_lvl"))
+				sort__first_dimension = SORT_MEM_LVL;
+			else if (!strcmp(sd->name, "snoop"))
+				sort__first_dimension = SORT_MEM_SNOOP;
 		}
 
 		list_add_tail(&sd->entry->list, &hist_entry__sort_list);
diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
index b4e8c3b..adee5eb 100644
--- a/tools/perf/util/sort.h
+++ b/tools/perf/util/sort.h
@@ -103,7 +103,8 @@ struct hist_entry {
 	struct rb_root		sorted_chain;
 	struct branch_info	*branch_info;
 	struct hists		*hists;
-	struct callchain_root	callchain[0];
+	struct mem_info		*mem_info;
+	struct callchain_root	callchain[0]; /* must be last member */
 };
 
 static inline bool hist_entry__has_pairs(struct hist_entry *he)
@@ -137,6 +138,13 @@ enum sort_type {
 	SORT_SYM_TO,
 	SORT_MISPREDICT,
 	SORT_SRCLINE,
+	SORT_MEM_DADDR_SYMBOL,
+	SORT_MEM_DADDR_DSO,
+	SORT_MEM_COST,
+	SORT_MEM_LOCKED,
+	SORT_MEM_TLB,
+	SORT_MEM_LVL,
+	SORT_MEM_SNOOP,
 };
 
 /*
diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h
index de68f98..fdf6371 100644
--- a/tools/perf/util/symbol.h
+++ b/tools/perf/util/symbol.h
@@ -152,6 +152,13 @@ struct branch_info {
 	struct branch_flags flags;
 };
 
+struct mem_info {
+	struct addr_map_symbol iaddr;
+	struct addr_map_symbol daddr;
+	u64	cost;
+	union perf_mem_dsrc dsrc;
+};
+
 struct addr_location {
 	struct thread *thread;
 	struct map    *map;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v6 12/18] perf report: add support for mem access profiling
  2013-01-15 15:39 [PATCH v6 00/18] perf: add memory access sampling support Stephane Eranian
                   ` (10 preceding siblings ...)
  2013-01-15 15:39 ` [PATCH v6 11/18] perf tools: add mem access sampling core support Stephane Eranian
@ 2013-01-15 15:39 ` Stephane Eranian
  2013-01-15 15:39 ` [PATCH v6 13/18] perf record: " Stephane Eranian
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Stephane Eranian @ 2013-01-15 15:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mingo, ak, acme, jolsa, namhyung.kim

This patch adds the --mem-mode option to perf report.

This mode requires a perf.data file created with memory
access samples.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 tools/perf/builtin-report.c |  131 +++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 127 insertions(+), 4 deletions(-)

diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index fc25100..231165a 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -46,6 +46,7 @@ struct perf_report {
 	bool			show_full_info;
 	bool			show_threads;
 	bool			inverted_callchain;
+	bool			mem_mode;
 	struct perf_read_values	show_threads_values;
 	const char		*pretty_printing_style;
 	symbol_filter_t		annotate_init;
@@ -54,6 +55,96 @@ struct perf_report {
 	DECLARE_BITMAP(cpu_bitmap, MAX_NR_CPUS);
 };
 
+static int perf_report__add_mem_hist_entry(struct perf_tool *tool,
+					   struct addr_location *al,
+					   struct perf_sample *sample,
+					   struct perf_evsel *evsel,
+					   struct machine *machine,
+					   union perf_event *event)
+{
+	struct perf_report *rep = container_of(tool, struct perf_report, tool);
+	struct symbol *parent = NULL;
+	u8 cpumode = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
+	int err = 0;
+	struct hist_entry *he;
+	struct mem_info *mi, *mx;
+	uint64_t cost;
+
+	if ((sort__has_parent || symbol_conf.use_callchain)
+	    && sample->callchain) {
+		err = machine__resolve_callchain(machine, evsel, al->thread,
+						 sample, &parent);
+		if (err)
+			return err;
+	}
+
+	mi = machine__resolve_mem(machine, al->thread, sample, cpumode);
+	if (!mi)
+		return -ENOMEM;
+
+	if (rep->hide_unresolved && !al->sym)
+		return 0;
+
+	cost = mi->cost;
+	if (!cost)
+		cost = 1;
+
+	/*
+	 * The report shows the percentage of total branches captured
+	 * and not events sampled. Thus we use a pseudo period of 1.
+	 * Only in the newt browser we are doing integrated annotation,
+	 * so we don't allocated the extra space needed because the stdio
+	 * code will not use it.
+	 */
+	he = __hists__add_mem_entry(&evsel->hists, al, parent, mi,
+				    cost);
+	if (!he)
+		return -ENOMEM;
+
+	if (sort__has_sym && he->ms.sym && use_browser > 0) {
+		struct annotation *notes = symbol__annotation(he->ms.sym);
+
+		assert(evsel != NULL);
+
+		if (notes->src == NULL && symbol__alloc_hist(he->ms.sym) < 0)
+			goto out;
+
+		err = hist_entry__inc_addr_samples(he, evsel->idx, al->addr);
+		if (err)
+			goto out;
+	}
+
+	if (sort__has_sym && he->mem_info->daddr.sym && use_browser > 0) {
+		struct annotation *notes;
+
+		mx = he->mem_info;
+
+		notes = symbol__annotation(mx->daddr.sym);
+		if (!notes->src
+		    && symbol__alloc_hist(mx->daddr.sym) < 0)
+			goto out;
+
+		err = symbol__inc_addr_samples(mx->daddr.sym,
+					       mx->daddr.map,
+					       evsel->idx,
+					       mx->daddr.al_addr);
+		if (err)
+			goto out;
+	}
+
+	evsel->hists.stats.total_period += sample->period;
+	hists__inc_nr_events(&evsel->hists, PERF_RECORD_SAMPLE);
+	err = 0;
+
+	if (symbol_conf.use_callchain) {
+		err = callchain_append(he->callchain,
+				       &callchain_cursor,
+				       sample->period);
+	}
+out:
+	return err;
+}
+
 static int perf_report__add_branch_hist_entry(struct perf_tool *tool,
 					struct addr_location *al,
 					struct perf_sample *sample,
@@ -209,6 +300,12 @@ static int process_sample_event(struct perf_tool *tool,
 			pr_debug("problem adding lbr entry, skipping event\n");
 			return -1;
 		}
+	} else if (rep->mem_mode == 1) {
+		if (perf_report__add_mem_hist_entry(tool, &al, sample,
+						    evsel, machine, event)) {
+			pr_debug("problem adding mem entry, skipping event\n");
+			return -1;
+		}
 	} else {
 		if (al.map != NULL)
 			al.map->dso->hit = 1;
@@ -292,7 +389,8 @@ static void sig_handler(int sig __maybe_unused)
 	session_done = 1;
 }
 
-static size_t hists__fprintf_nr_sample_events(struct hists *self,
+static size_t hists__fprintf_nr_sample_events(struct perf_report *rep,
+					      struct hists *self,
 					      const char *evname, FILE *fp)
 {
 	size_t ret;
@@ -305,7 +403,11 @@ static size_t hists__fprintf_nr_sample_events(struct hists *self,
 	if (evname != NULL)
 		ret += fprintf(fp, " of event '%s'", evname);
 
-	ret += fprintf(fp, "\n# Event count (approx.): %" PRIu64, nr_events);
+	if (rep->mem_mode) {
+		ret += fprintf(fp, "\n# Total cost : %" PRIu64, nr_events);
+		ret += fprintf(fp, "\n# Sort order : %s", sort_order);
+	} else
+		ret += fprintf(fp, "\n# Event count (approx.): %" PRIu64, nr_events);
 	return ret + fprintf(fp, "\n#\n");
 }
 
@@ -319,7 +421,7 @@ static int perf_evlist__tty_browse_hists(struct perf_evlist *evlist,
 		struct hists *hists = &pos->hists;
 		const char *evname = perf_evsel__name(pos);
 
-		hists__fprintf_nr_sample_events(hists, evname, stdout);
+		hists__fprintf_nr_sample_events(rep, hists, evname, stdout);
 		hists__fprintf(hists, true, 0, 0, stdout);
 		fprintf(stdout, "\n\n");
 	}
@@ -596,7 +698,7 @@ int cmd_report(int argc, const char **argv, const char *prefix __maybe_unused)
 		    "Use the stdio interface"),
 	OPT_STRING('s', "sort", &sort_order, "key[,key2...]",
 		   "sort by key(s): pid, comm, dso, symbol, parent, dso_to,"
-		   " dso_from, symbol_to, symbol_from, mispredict"),
+		   " dso_from, symbol_to, symbol_from, mispredict, mem, cost, symbol_daddr, dso_daddr, tlb, snoop, locked"),
 	OPT_BOOLEAN(0, "showcpuutilization", &symbol_conf.show_cpu_utilization,
 		    "Show sample percentage for different cpu modes"),
 	OPT_STRING('p', "parent", &parent_pattern, "regex",
@@ -642,6 +744,7 @@ int cmd_report(int argc, const char **argv, const char *prefix __maybe_unused)
 		    "use branch records for histogram filling", parse_branch_mode),
 	OPT_STRING(0, "objdump", &objdump_path, "path",
 		   "objdump binary to use for disassembly and annotations"),
+	OPT_BOOLEAN(0, "mem-mode", &report.mem_mode, "mem access profile"),
 	OPT_END()
 	};
 
@@ -687,6 +790,18 @@ int cmd_report(int argc, const char **argv, const char *prefix __maybe_unused)
 				     "dso_to,symbol_to";
 
 	}
+	if (report.mem_mode) {
+		if (sort__branch_mode == 1) {
+			fprintf(stderr, "branch and mem mode incompatible\n");
+			goto error;
+		}
+		/*
+		 * if no sort_order is provided, then specify
+		 * branch-mode specific order
+		 */
+		if (sort_order == default_sort_order)
+			sort_order = "cost,mem,sym,dso,symbol_daddr,dso_daddr,snoop,tlb,locked";
+	}
 
 	if (strcmp(input_name, "-") != 0)
 		setup_browser(true);
@@ -758,6 +873,14 @@ int cmd_report(int argc, const char **argv, const char *prefix __maybe_unused)
 		sort_entry__setup_elide(&sort_sym_from, symbol_conf.sym_from_list, "sym_from", stdout);
 		sort_entry__setup_elide(&sort_sym_to, symbol_conf.sym_to_list, "sym_to", stdout);
 	} else {
+		if (report.mem_mode) {
+			sort_entry__setup_elide(&sort_dso, symbol_conf.dso_list, "symbol_daddr", stdout);
+			sort_entry__setup_elide(&sort_dso, symbol_conf.dso_list, "dso_daddr", stdout);
+			sort_entry__setup_elide(&sort_dso, symbol_conf.dso_list, "mem", stdout);
+			sort_entry__setup_elide(&sort_dso, symbol_conf.dso_list, "cost", stdout);
+			sort_entry__setup_elide(&sort_dso, symbol_conf.dso_list, "tlb", stdout);
+			sort_entry__setup_elide(&sort_dso, symbol_conf.dso_list, "snoop", stdout);
+		}
 		sort_entry__setup_elide(&sort_dso, symbol_conf.dso_list, "dso", stdout);
 		sort_entry__setup_elide(&sort_sym, symbol_conf.sym_list, "symbol", stdout);
 	}
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v6 13/18] perf record: add support for mem access profiling
  2013-01-15 15:39 [PATCH v6 00/18] perf: add memory access sampling support Stephane Eranian
                   ` (11 preceding siblings ...)
  2013-01-15 15:39 ` [PATCH v6 12/18] perf report: add support for mem access profiling Stephane Eranian
@ 2013-01-15 15:39 ` Stephane Eranian
  2013-01-15 15:39 ` [PATCH v6 14/18] perf tools: add new mem command for memory " Stephane Eranian
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Stephane Eranian @ 2013-01-15 15:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mingo, ak, acme, jolsa, namhyung.kim

Add the -l option to perf record to enable sampling
access cost sampling.

Data address sampling is obtained via the -d option.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 tools/perf/builtin-record.c |    2 ++
 tools/perf/perf.h           |    1 +
 tools/perf/util/evsel.c     |    6 ++++++
 3 files changed, 9 insertions(+)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index f3151d3..341f9c1 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -1059,6 +1059,8 @@ const struct option record_options[] = {
 	OPT_CALLBACK('j', "branch-filter", &record.opts.branch_stack,
 		     "branch filter mask", "branch stack filter modes",
 		     parse_branch_stack),
+	OPT_BOOLEAN('l', "cost", &record.opts.weight,
+		    "event cost"),
 	OPT_END()
 };
 
diff --git a/tools/perf/perf.h b/tools/perf/perf.h
index 2c340e7..c976553 100644
--- a/tools/perf/perf.h
+++ b/tools/perf/perf.h
@@ -240,6 +240,7 @@ struct perf_record_opts {
 	bool	     sample_id_all_missing;
 	bool	     exclude_guest_missing;
 	bool	     period;
+	bool         weight;
 	unsigned int freq;
 	unsigned int mmap_pages;
 	unsigned int user_freq;
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 9f334dd..ad0c649 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -501,6 +501,12 @@ void perf_evsel__config(struct perf_evsel *evsel,
 		attr->sample_type	|= PERF_SAMPLE_CPU;
 	}
 
+	if (opts->weight)
+		attr->sample_type	|= PERF_SAMPLE_WEIGHT;
+
+	if (opts->sample_address)
+		attr->sample_type	|= PERF_SAMPLE_DSRC;
+
 	if (opts->no_delay) {
 		attr->watermark = 0;
 		attr->wakeup_events = 1;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v6 14/18] perf tools: add new mem command for memory access profiling
  2013-01-15 15:39 [PATCH v6 00/18] perf: add memory access sampling support Stephane Eranian
                   ` (12 preceding siblings ...)
  2013-01-15 15:39 ` [PATCH v6 13/18] perf record: " Stephane Eranian
@ 2013-01-15 15:39 ` Stephane Eranian
  2013-01-15 15:39 ` [PATCH v6 15/18] perf: add PERF_RECORD_MISC_MMAP_DATA to RECORD_MMAP Stephane Eranian
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Stephane Eranian @ 2013-01-15 15:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mingo, ak, acme, jolsa, namhyung.kim

This new command is a wrapper on top of perf record and
perf report to make it easier to configure for memory
access profiling.

To record loads:
$ perf mem -t load rec .....

To record stores:
$ perf mem -t store rec .....

To get the report:
$ perf mem -t load rep

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 tools/perf/Documentation/perf-mem.txt |   48 +++++++
 tools/perf/Makefile                   |    1 +
 tools/perf/builtin-mem.c              |  242 +++++++++++++++++++++++++++++++++
 tools/perf/builtin.h                  |    1 +
 tools/perf/command-list.txt           |    1 +
 tools/perf/perf.c                     |    1 +
 tools/perf/util/hist.c                |    1 +
 7 files changed, 295 insertions(+)
 create mode 100644 tools/perf/Documentation/perf-mem.txt
 create mode 100644 tools/perf/builtin-mem.c

diff --git a/tools/perf/Documentation/perf-mem.txt b/tools/perf/Documentation/perf-mem.txt
new file mode 100644
index 0000000..888d511
--- /dev/null
+++ b/tools/perf/Documentation/perf-mem.txt
@@ -0,0 +1,48 @@
+perf-mem(1)
+===========
+
+NAME
+----
+perf-mem - Profile memory accesses
+
+SYNOPSIS
+--------
+[verse]
+'perf mem' [<options>] (record [<command>] | report)
+
+DESCRIPTION
+-----------
+"perf mem -t <TYPE> record" runs a command and gathers memory operation data
+from it, into perf.data. Perf record options are accepted and are passed through.
+
+"perf mem -t <TYPE> report" displays the result. It invokes perf report with the
+right set of options to display a memory access profile.
+
+OPTIONS
+-------
+<command>...::
+	Any command you can specify in a shell.
+
+-t::
+--type=::
+	Select the memory operation type: load or store (default: load)
+
+-D::
+--dump-raw-samples=::
+	Dump the raw decoded samples on the screen in a format that is easy to parse with
+	one sample per line.
+
+-x::
+--field-separator::
+	Specify the field separator used when dump raw samples (-D option). By default,
+	The separator is the space character.
+
+-C::
+--cpu-list::
+	Restrict dump of raw samples to those provided via this option. Note that the same
+	option can be passed in record mode. It will be interpreted the same way as perf
+	record.
+
+SEE ALSO
+--------
+linkperf:perf-record[1], linkperf:perf-report[1]
diff --git a/tools/perf/Makefile b/tools/perf/Makefile
index 994f0f6..be8f4697 100644
--- a/tools/perf/Makefile
+++ b/tools/perf/Makefile
@@ -505,6 +505,7 @@ BUILTIN_OBJS += $(OUTPUT)builtin-lock.o
 BUILTIN_OBJS += $(OUTPUT)builtin-kvm.o
 BUILTIN_OBJS += $(OUTPUT)builtin-inject.o
 BUILTIN_OBJS += $(OUTPUT)tests/builtin-test.o
+BUILTIN_OBJS += $(OUTPUT)builtin-mem.o
 
 PERFLIBS = $(LIB_FILE) $(LIBTRACEEVENT)
 
diff --git a/tools/perf/builtin-mem.c b/tools/perf/builtin-mem.c
new file mode 100644
index 0000000..b5f593f
--- /dev/null
+++ b/tools/perf/builtin-mem.c
@@ -0,0 +1,242 @@
+#include "builtin.h"
+#include "perf.h"
+
+#include "util/parse-options.h"
+#include "util/trace-event.h"
+#include "util/tool.h"
+#include "util/session.h"
+
+#define MEM_OPERATION_LOAD	"load"
+#define MEM_OPERATION_STORE	"store"
+
+static const char	*mem_operation		= MEM_OPERATION_LOAD;
+
+struct perf_mem {
+	struct perf_tool	tool;
+	char const		*input_name;
+	symbol_filter_t		annotate_init;
+	bool			hide_unresolved;
+	bool			dump_raw;
+	const char		*cpu_list;
+	DECLARE_BITMAP(cpu_bitmap, MAX_NR_CPUS);
+};
+
+static const char * const mem_usage[] = {
+	"perf mem [<options>] {record <command> |report}",
+	NULL
+};
+
+static int __cmd_record(int argc, const char **argv)
+{
+	int rec_argc, i = 0, j;
+	const char **rec_argv;
+	char event[64];
+	int ret;
+
+	rec_argc = argc + 4;
+	rec_argv = calloc(rec_argc + 1, sizeof(char *));
+	if (!rec_argv)
+		return -1;
+
+	rec_argv[i++] = strdup("record");
+	if (!strcmp(mem_operation, MEM_OPERATION_LOAD))
+		rec_argv[i++] = strdup("-l");
+	rec_argv[i++] = strdup("-d");
+	rec_argv[i++] = strdup("-e");
+
+	if (strcmp(mem_operation, MEM_OPERATION_LOAD))
+		sprintf(event, "cpu/mem-stores/pp");
+	else
+		sprintf(event, "cpu/mem-loads/pp");
+
+	rec_argv[i++] = strdup(event);
+	for (j = 1; j < argc; j++, i++)
+		rec_argv[i] = argv[j];
+
+	ret = cmd_record(i, rec_argv, NULL);
+	free(rec_argv);
+	return ret;
+}
+
+static int
+dump_raw_samples(struct perf_tool *tool,
+		 union perf_event *event,
+		 struct perf_sample *sample,
+		 struct perf_evsel *evsel __maybe_unused,
+		 struct machine *machine)
+{
+	struct perf_mem *mem = container_of(tool, struct perf_mem, tool);
+	struct addr_location al;
+	const char *fmt;
+
+	if (perf_event__preprocess_sample(event, machine, &al, sample,
+				mem->annotate_init) < 0) {
+		fprintf(stderr, "problem processing %d event, skipping it.\n",
+				event->header.type);
+		return -1;
+	}
+
+	if (al.filtered || (mem->hide_unresolved && al.sym == NULL))
+		return 0;
+
+	if (al.map != NULL)
+		al.map->dso->hit = 1;
+
+	if (symbol_conf.field_sep) {
+		fmt = "%d%s%d%s0x%"PRIx64"%s0x%"PRIx64"%s%"PRIu64
+		      "%s0x%"PRIx64"%s%s:%s\n";
+	} else {
+		fmt = "%5d%s%5d%s0x%016"PRIx64"%s0x016%"PRIx64
+		      "%s%5"PRIu64"%s0x%06"PRIx64"%s%s:%s\n";
+		symbol_conf.field_sep = " ";
+	}
+
+	printf(fmt,
+		sample->pid,
+		symbol_conf.field_sep,
+		sample->tid,
+		symbol_conf.field_sep,
+		event->ip.ip,
+		symbol_conf.field_sep,
+		sample->addr,
+		symbol_conf.field_sep,
+		sample->weight,
+		symbol_conf.field_sep,
+		sample->dsrc,
+		symbol_conf.field_sep,
+		al.map ? (al.map->dso ? al.map->dso->long_name : "???") : "???",
+		al.sym ? al.sym->name : "???");
+
+	return 0;
+}
+
+static int process_sample_event(struct perf_tool *tool,
+				union perf_event *event,
+				struct perf_sample *sample,
+				struct perf_evsel *evsel,
+				struct machine *machine)
+{
+	return dump_raw_samples(tool, event, sample, evsel, machine);
+}
+
+static int report_raw_events(struct perf_mem *mem)
+{
+	int err = -EINVAL;
+	int ret;
+	struct perf_session *session = perf_session__new(input_name, O_RDONLY,
+							 0, false, &mem->tool);
+
+	if (session == NULL)
+		return -ENOMEM;
+
+	if (mem->cpu_list) {
+		ret = perf_session__cpu_bitmap(session, mem->cpu_list,
+					       mem->cpu_bitmap);
+		if (ret)
+			goto out_delete;
+	}
+
+	if (symbol__init() < 0)
+		return -1;
+
+	printf("# PID, TID, IP, ADDR, COST, DSRC, SYMBOL\n");
+
+	err = perf_session__process_events(session, &mem->tool);
+	if (err)
+		return err;
+
+	return 0;
+
+out_delete:
+	perf_session__delete(session);
+	return err;
+}
+
+static int report_events(int argc, const char **argv, struct perf_mem *mem)
+{
+	const char **rep_argv;
+	int ret, i = 0, j, rep_argc;
+
+	if (mem->dump_raw)
+		return report_raw_events(mem);
+
+	rep_argc = argc + 3;
+	rep_argv = calloc(rep_argc + 1, sizeof(char *));
+	if (!rep_argv)
+		return -1;
+
+	rep_argv[i++] = strdup("report");
+	rep_argv[i++] = strdup("--mem-mode");
+	rep_argv[i++] = strdup("-n"); /* display number of samples */
+
+	/*
+	 * there is no cost associated with stores, so don't print
+	 * the column
+	 */
+	if (strcmp(mem_operation, MEM_OPERATION_LOAD))
+		rep_argv[i++] = strdup("--sort=mem,sym,dso,symbol_daddr,"
+				       "dso_daddr,tlb,locked");
+
+	for (j = 1; j < argc; j++, i++)
+		rep_argv[i] = argv[j];
+
+	ret = cmd_report(i, rep_argv, NULL);
+	free(rep_argv);
+	return ret;
+}
+
+int cmd_mem(int argc, const char **argv, const char *prefix __maybe_unused)
+{
+	struct stat st;
+	struct perf_mem mem = {
+		.tool = {
+			.sample		= process_sample_event,
+			.mmap		= perf_event__process_mmap,
+			.comm		= perf_event__process_comm,
+			.lost		= perf_event__process_lost,
+			.fork		= perf_event__process_fork,
+			.build_id	= perf_event__process_build_id,
+			.ordered_samples = true,
+		},
+		.input_name		 = "perf.data",
+	};
+	const struct option mem_options[] = {
+		OPT_STRING('t', "type", &mem_operation,
+			   "type", "memory operations(load/store)"),
+		OPT_BOOLEAN('D', "dump-raw-samples", &mem.dump_raw,
+			    "dump raw samples in ASCII"),
+		OPT_BOOLEAN('U', "hide-unresolved", &mem.hide_unresolved,
+			    "Only display entries resolved to a symbol"),
+		OPT_STRING('i', "input", &input_name, "file",
+			   "input file name"),
+		OPT_STRING('C', "cpu", &mem.cpu_list, "cpu",
+			   "list of cpus to profile"),
+		OPT_STRING('x', "field-separator", &symbol_conf.field_sep,
+			   "separator",
+			   "separator for columns, no spaces will be added"
+			   " between columns '.' is reserved."),
+		OPT_END()
+	};
+
+	argc = parse_options(argc, argv, mem_options, mem_usage,
+			     PARSE_OPT_STOP_AT_NON_OPTION);
+
+	if (!argc || !(strncmp(argv[0], "rec", 3) || mem_operation))
+		usage_with_options(mem_usage, mem_options);
+
+	if (!mem.input_name || !strlen(mem.input_name)) {
+		if (!fstat(STDIN_FILENO, &st) && S_ISFIFO(st.st_mode))
+			mem.input_name = "-";
+		else
+			mem.input_name = "perf.data";
+	}
+
+	if (!strncmp(argv[0], "rec", 3))
+		return __cmd_record(argc, argv);
+	else if (!strncmp(argv[0], "rep", 3))
+		return report_events(argc, argv, &mem);
+	else
+		usage_with_options(mem_usage, mem_options);
+
+	return 0;
+}
diff --git a/tools/perf/builtin.h b/tools/perf/builtin.h
index 08143bd..b210d62 100644
--- a/tools/perf/builtin.h
+++ b/tools/perf/builtin.h
@@ -36,6 +36,7 @@ extern int cmd_kvm(int argc, const char **argv, const char *prefix);
 extern int cmd_test(int argc, const char **argv, const char *prefix);
 extern int cmd_trace(int argc, const char **argv, const char *prefix);
 extern int cmd_inject(int argc, const char **argv, const char *prefix);
+extern int cmd_mem(int argc, const char **argv, const char *prefix);
 
 extern int find_scripts(char **scripts_array, char **scripts_path_array);
 #endif
diff --git a/tools/perf/command-list.txt b/tools/perf/command-list.txt
index 3e86bbd..2c5b621 100644
--- a/tools/perf/command-list.txt
+++ b/tools/perf/command-list.txt
@@ -24,3 +24,4 @@ perf-kmem			mainporcelain common
 perf-lock			mainporcelain common
 perf-kvm			mainporcelain common
 perf-test			mainporcelain common
+perf-mem			mainporcelain common
diff --git a/tools/perf/perf.c b/tools/perf/perf.c
index 0f661fb..682340e 100644
--- a/tools/perf/perf.c
+++ b/tools/perf/perf.c
@@ -60,6 +60,7 @@ static struct cmd_struct commands[] = {
 	{ "trace",	cmd_trace,	0 },
 #endif
 	{ "inject",	cmd_inject,	0 },
+	{ "mem",	cmd_mem,	0 },
 };
 
 struct pager_config {
diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index b5259c9..fc05b9f 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -483,6 +483,7 @@ hist_entry__collapse(struct hist_entry *left, struct hist_entry *right)
 void hist_entry__free(struct hist_entry *he)
 {
 	free(he->branch_info);
+	free(he->mem_info);
 	free(he);
 }
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v6 15/18] perf: add PERF_RECORD_MISC_MMAP_DATA to RECORD_MMAP
  2013-01-15 15:39 [PATCH v6 00/18] perf: add memory access sampling support Stephane Eranian
                   ` (13 preceding siblings ...)
  2013-01-15 15:39 ` [PATCH v6 14/18] perf tools: add new mem command for memory " Stephane Eranian
@ 2013-01-15 15:39 ` Stephane Eranian
  2013-01-18 23:25   ` Andi Kleen
  2013-01-15 15:39 ` [PATCH v6 16/18] perf tools: detect data vs. text mappings Stephane Eranian
                   ` (3 subsequent siblings)
  18 siblings, 1 reply; 33+ messages in thread
From: Stephane Eranian @ 2013-01-15 15:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mingo, ak, acme, jolsa, namhyung.kim

Type of mapping was lost and made it hard for a tool
to distinguish code vs. data mmaps. Perf has the ability
to distinguish the two.

Use a bit in the header->misc bitmask to keep track of
the mmap type. If PERF_RECORD_MISC_MMAP_DATA is set then
the mapping is not executable (!VM_EXEC). If not set, then
the mapping is executable.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 include/uapi/linux/perf_event.h |    1 +
 kernel/events/core.c            |    3 +++
 2 files changed, 4 insertions(+)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 8283218..5b80f14 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -445,6 +445,7 @@ struct perf_event_mmap_page {
 #define PERF_RECORD_MISC_GUEST_KERNEL		(4 << 0)
 #define PERF_RECORD_MISC_GUEST_USER		(5 << 0)
 
+#define PERF_RECORD_MISC_MMAP_DATA		(1 << 13)
 /*
  * Indicates that the content of PERF_SAMPLE_IP points to
  * the actual instruction that triggered the event. See also
diff --git a/kernel/events/core.c b/kernel/events/core.c
index fd4ceea..8292f15 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4764,6 +4764,9 @@ static void perf_event_mmap_event(struct perf_mmap_event *mmap_event)
 	mmap_event->file_name = name;
 	mmap_event->file_size = size;
 
+	if (!(vma->vm_flags & VM_EXEC))
+		mmap_event->event_id.header.misc |= PERF_RECORD_MISC_MMAP_DATA;
+
 	mmap_event->event_id.header.size = sizeof(mmap_event->event_id) + size;
 
 	rcu_read_lock();
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v6 16/18] perf tools: detect data vs. text mappings
  2013-01-15 15:39 [PATCH v6 00/18] perf: add memory access sampling support Stephane Eranian
                   ` (14 preceding siblings ...)
  2013-01-15 15:39 ` [PATCH v6 15/18] perf: add PERF_RECORD_MISC_MMAP_DATA to RECORD_MMAP Stephane Eranian
@ 2013-01-15 15:39 ` Stephane Eranian
  2013-01-15 15:39 ` [PATCH v6 17/18] perf tools: Ignore ABS symbols when loading data maps Stephane Eranian
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 33+ messages in thread
From: Stephane Eranian @ 2013-01-15 15:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mingo, ak, acme, jolsa, namhyung.kim

Leverages the PERF_RECORD_MISC_MMAP_DATA bit in
the RECORD_MMAP record header. When the bit is set
then the mapping type is set to MAP__VARIABLE.

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 tools/perf/util/machine.c |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 1f09d05..d1c3e48 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -379,6 +379,7 @@ int machine__process_mmap_event(struct machine *machine, union perf_event *event
 	u8 cpumode = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
 	struct thread *thread;
 	struct map *map;
+	enum map_type type;
 	int ret = 0;
 
 	if (dump_trace)
@@ -395,10 +396,17 @@ int machine__process_mmap_event(struct machine *machine, union perf_event *event
 	thread = machine__findnew_thread(machine, event->mmap.pid);
 	if (thread == NULL)
 		goto out_problem;
+
+	if (event->header.misc & PERF_RECORD_MISC_MMAP_DATA)
+		type = MAP__VARIABLE;
+	else
+		type = MAP__FUNCTION;
+
 	map = map__new(&machine->user_dsos, event->mmap.start,
 			event->mmap.len, event->mmap.pgoff,
 			event->mmap.pid, event->mmap.filename,
-			MAP__FUNCTION);
+			type);
+
 	if (map == NULL)
 		goto out_problem;
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v6 17/18] perf tools: Ignore ABS symbols when loading data maps
  2013-01-15 15:39 [PATCH v6 00/18] perf: add memory access sampling support Stephane Eranian
                   ` (15 preceding siblings ...)
  2013-01-15 15:39 ` [PATCH v6 16/18] perf tools: detect data vs. text mappings Stephane Eranian
@ 2013-01-15 15:39 ` Stephane Eranian
  2013-01-15 15:39 ` [PATCH v6 18/18] perf tools: Fix output of symbol_daddr offset Stephane Eranian
  2013-01-24 11:56 ` [PATCH v6 00/18] perf: add memory access sampling support Ingo Molnar
  18 siblings, 0 replies; 33+ messages in thread
From: Stephane Eranian @ 2013-01-15 15:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mingo, ak, acme, jolsa, namhyung.kim, Namhyung Kim

From: Namhyung Kim <namhyung.kim@lge.com>

When loading symbols in a data mapping, ABS symbols (which has a value
of SHN_ABS in its st_shndx) failed at elf_getscn().  And it marks the
loading as a failure so already loaded symbols cannot be fixed up.

I'm not sure what should be done. Just ignore them for now. :)

Cc: Stephane Eranian <eranian@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/util/symbol-elf.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/tools/perf/util/symbol-elf.c b/tools/perf/util/symbol-elf.c
index db0cc92..00cf128 100644
--- a/tools/perf/util/symbol-elf.c
+++ b/tools/perf/util/symbol-elf.c
@@ -719,6 +719,9 @@ int dso__load_sym(struct dso *dso, struct map *map,
 			used_opd = true;
 		}
 
+		if (sym.st_shndx == SHN_ABS)
+			continue;
+
 		sec = elf_getscn(runtime_ss->elf, sym.st_shndx);
 		if (!sec)
 			goto out_elf_end;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v6 18/18] perf tools: Fix output of symbol_daddr offset
  2013-01-15 15:39 [PATCH v6 00/18] perf: add memory access sampling support Stephane Eranian
                   ` (16 preceding siblings ...)
  2013-01-15 15:39 ` [PATCH v6 17/18] perf tools: Ignore ABS symbols when loading data maps Stephane Eranian
@ 2013-01-15 15:39 ` Stephane Eranian
  2013-01-24 11:56 ` [PATCH v6 00/18] perf: add memory access sampling support Ingo Molnar
  18 siblings, 0 replies; 33+ messages in thread
From: Stephane Eranian @ 2013-01-15 15:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mingo, ak, acme, jolsa, namhyung.kim, Namhyung Kim

From: Namhyung Kim <namhyung.kim@lge.com>

The symbol addresses in a dso have relative offsets from the start of
a mapping.  So in order to ouput correct offset value from @ip, one of
them should be converted.

Cc: Stephane Eranian <eranian@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/util/sort.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index 8164544..d9cb158 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -186,7 +186,7 @@ static int _hist_entry__sym_snprintf(struct map *map, struct symbol *sym,
 		if (map->type == MAP__VARIABLE) {
 			ret += repsep_snprintf(bf + ret, size - ret, "%s", sym->name);
 			ret += repsep_snprintf(bf + ret, size - ret, "+0x%llx",
-					ip - sym->start);
+					ip - map->unmap_ip(map, sym->start));
 			ret += repsep_snprintf(bf + ret, size - ret, "%-*s",
 				       width - ret, "");
 		} else {
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 02/18] perf/x86: improve sysfs event mapping with event string
  2013-01-15 15:39 ` [PATCH v6 02/18] perf/x86: improve sysfs event mapping with event string Stephane Eranian
@ 2013-01-18 22:57   ` Andi Kleen
  0 siblings, 0 replies; 33+ messages in thread
From: Andi Kleen @ 2013-01-18 22:57 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: linux-kernel, peterz, mingo, acme, jolsa, namhyung.kim

On Tue, Jan 15, 2013 at 04:39:30PM +0100, Stephane Eranian wrote:
> This patch extends Jiri's changes to make generic
> events mapping visible via sysfs. The patch extends
> the mechanism to non-generic events by allowing
> the mappings to be hardcoded in strings.
> 
> This mechanism will be used by the PEBS-LL patch
> later on.
> 
> Signed-off-by: Stephane Eranian <eranian@google.com>

Looks good to me.

Acked-by: Andi Kleen <ak@linux.intel.com>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 03/18] perf/x86: add flags to event constraints
  2013-01-15 15:39 ` [PATCH v6 03/18] perf/x86: add flags to event constraints Stephane Eranian
@ 2013-01-18 22:59   ` Andi Kleen
  2013-01-22 14:22     ` Stephane Eranian
  0 siblings, 1 reply; 33+ messages in thread
From: Andi Kleen @ 2013-01-18 22:59 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: linux-kernel, peterz, mingo, acme, jolsa, namhyung.kim

> --- a/arch/x86/kernel/cpu/perf_event_intel.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel.c
> @@ -1367,8 +1367,10 @@ x86_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event)
>  
>  	if (x86_pmu.event_constraints) {
>  		for_each_event_constraint(c, x86_pmu.event_constraints) {
> -			if ((event->hw.config & c->cmask) == c->code)
> +			if ((event->hw.config & c->cmask) == c->code) {
> +				event->hw.flags |= c->flags;
>  				return c;
> +			}

It's not fully clear where that hw.flags field gets initially zeroed. Is that implicit
in the allocation? Some comments would be good about its live cycle.

Or just use a = instead of |=? Why would you have multiple flags in different places?

-Andi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 05/18] perf: add minimal support for PERF_SAMPLE_WEIGHT
  2013-01-15 15:39 ` [PATCH v6 05/18] perf: add minimal support for PERF_SAMPLE_WEIGHT Stephane Eranian
@ 2013-01-18 23:00   ` Andi Kleen
  2013-01-22 14:30     ` Stephane Eranian
  0 siblings, 1 reply; 33+ messages in thread
From: Andi Kleen @ 2013-01-18 23:00 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: linux-kernel, peterz, mingo, acme, jolsa, namhyung.kim

On Tue, Jan 15, 2013 at 04:39:33PM +0100, Stephane Eranian wrote:
> Ensure we grab the weight from  raw sample struct
> and that we can dump it via perf report -D.

AFAIK i have a similar patch in the haswell patchkit.

It's straight forward of course, just have to chose one.
I think mine does slightly more.

-Andi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 07/18] perf: add generic memory sampling interface
  2013-01-15 15:39 ` [PATCH v6 07/18] perf: add generic memory sampling interface Stephane Eranian
@ 2013-01-18 23:06   ` Andi Kleen
  2013-01-23 16:54     ` Stephane Eranian
  0 siblings, 1 reply; 33+ messages in thread
From: Andi Kleen @ 2013-01-18 23:06 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: linux-kernel, peterz, mingo, acme, jolsa, namhyung.kim

>  extern void perf_output_sample(struct perf_output_handle *handle,
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 7e24641..8283218 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -133,9 +133,9 @@ enum perf_event_sample_format {
>  	PERF_SAMPLE_REGS_USER			= 1U << 12,
>  	PERF_SAMPLE_STACK_USER			= 1U << 13,
>  	PERF_SAMPLE_WEIGHT			= 1U << 14,
> +	PERF_SAMPLE_DSRC			= 1U << 15,

This conflicts with similar extensions in the Haswell patchkit,
but that can be worked out by just moving some numbers (and making
sure the input/output calls are still in the right place)


> +union perf_mem_dsrc {
> +	__u64 val;
> +	struct {
> +		__u64   mem_op:5,	/* type of opcode */
> +			mem_lvl:14,	/* memory hierarchy level */
> +			mem_snoop:5,	/* snoop mode */
> +			mem_lock:2,	/* lock instr */
> +			mem_dtlb:7,	/* tlb access */
> +			mem_rsvd:31;
> +	};
> +};
> +
> +/* type of opcode (load/store/prefetch,code) */
> +#define PERF_MEM_OP_NA		0x01 /* not available */
> +#define PERF_MEM_OP_LOAD	0x02 /* load instruction */
> +#define PERF_MEM_OP_STORE	0x04 /* store instruction */
> +#define PERF_MEM_OP_PFETCH	0x08 /* prefetch */
> +#define PERF_MEM_OP_EXEC	0x10 /* code (execution) */
> +#define PERF_MEM_OP_SHIFT	0

Do we really need the shift? it's implicit in the bitfield right?

> +/* memory hierarchy (memory level, hit or miss) */
> +#define PERF_MEM_LVL_NA		0x01  /* not available */
> +#define PERF_MEM_LVL_HIT	0x02  /* hit level */
> +#define PERF_MEM_LVL_MISS	0x04  /* miss level  */
> +#define PERF_MEM_LVL_L1		0x08  /* L1 */
> +#define PERF_MEM_LVL_LFB	0x10  /* Line Fill Buffer */
> +#define PERF_MEM_LVL_L2		0x20  /* L2 hit */
> +#define PERF_MEM_LVL_L3		0x40  /* L3 hit */
> +#define PERF_MEM_LVL_LOC_RAM	0x80  /* Local DRAM */
> +#define PERF_MEM_LVL_REM_RAM1	0x100 /* Remote DRAM (1 hop) */
> +#define PERF_MEM_LVL_REM_RAM2	0x200 /* Remote DRAM (2 hops) */
> +#define PERF_MEM_LVL_REM_CCE1	0x400 /* Remote Cache (1 hop) */
> +#define PERF_MEM_LVL_REM_CCE2	0x800 /* Remote Cache (2 hops) */
> +#define PERF_MEM_LVL_IO		0x1000 /* I/O memory */
> +#define PERF_MEM_LVL_UNC	0x2000 /* Uncached memory */

I would leave some free bits here, obviously this doesn't cover all
that may be possible in system architecture. Also why is this a bit mask, 
you can only hit one level right? So perhaps a number.

> +/* TLB access */
> +#define PERF_MEM_TLB_NA		0x01 /* not available */
> +#define PERF_MEM_TLB_HIT	0x02 /* hit level */
> +#define PERF_MEM_TLB_MISS	0x04 /* miss level */
> +#define PERF_MEM_TLB_L1		0x08 /* L1 */
> +#define PERF_MEM_TLB_L2		0x10 /* L2 */
> +#define PERF_MEM_TLB_WK		0x20 /* Hardware Walker*/
> +#define PERF_MEM_TLB_OS		0x40 /* OS fault handler */


Same


> +#define PERF_MEM_TLB_SHIFT	26
> +
> +#define PERF_MEM_S(a, s) \
> +	(((u64)PERF_MEM_##a##_##s) << PERF_MEM_##a##_SHIFT)

Is that used by anything?


-Andi
-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 08/18] perf/x86: add memory profiling via PEBS Load Latency
  2013-01-15 15:39 ` [PATCH v6 08/18] perf/x86: add memory profiling via PEBS Load Latency Stephane Eranian
@ 2013-01-18 23:12   ` Andi Kleen
  0 siblings, 0 replies; 33+ messages in thread
From: Andi Kleen @ 2013-01-18 23:12 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: linux-kernel, peterz, mingo, acme, jolsa, namhyung.kim

> +	sample_type = event->attr.sample_type;
> +
> +	/*
> +	 * if PEBS-LL or PreciseStore
> +	 */
> +	if (fll) {
> +		if (sample_type & PERF_SAMPLE_ADDR)
> +			data.addr = pebs->dla;
> +
> +		/*
> +		 * Use latency for weight (only avail with PEBS-LL)
> +		 */
> +		if (fll && (sample_type & PERF_SAMPLE_WEIGHT))

The extra fll tests here don't make sense because it's always true inside
the if. You could remove the variable and the tests and only check once
in the if.

The rest looks good to me. There will be some conflicts with the Haswell
patches, but either of us can rebase.

Acked-by: Andi Kleen <ak@linux.intel.com>

-Andi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 10/18] perf/x86: add support for PEBS Precise Store
  2013-01-15 15:39 ` [PATCH v6 10/18] perf/x86: add support for PEBS Precise Store Stephane Eranian
@ 2013-01-18 23:21   ` Andi Kleen
  0 siblings, 0 replies; 33+ messages in thread
From: Andi Kleen @ 2013-01-18 23:21 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: linux-kernel, peterz, mingo, acme, jolsa, namhyung.kim

>  	u64 sample_type;
> -	int fll;
> +	int fll, fst;
>  
>  	if (!intel_pmu_save_and_restart(event))
>  		return;
>  
>  	fll = event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT;
> +	fst = event->hw.flags & PERF_X86_EVENT_PEBS_ST;
>  
>  	perf_sample_data_init(&data, 0, event->hw.last_period);
>  
> @@ -672,7 +715,7 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
>  	/*
>  	 * if PEBS-LL or PreciseStore
>  	 */
> -	if (fll) {
> +	if (fll || fst) {

Ok I understand now why the other patch looked so strange.
it makes sense and you can disregard those comments.

Looks good

Acked-by: Andi Kleen <ak@linux.intel.com>

-Andi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 11/18] perf tools: add mem access sampling core support
  2013-01-15 15:39 ` [PATCH v6 11/18] perf tools: add mem access sampling core support Stephane Eranian
@ 2013-01-18 23:25   ` Andi Kleen
  0 siblings, 0 replies; 33+ messages in thread
From: Andi Kleen @ 2013-01-18 23:25 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: linux-kernel, peterz, mingo, acme, jolsa, namhyung.kim

On Tue, Jan 15, 2013 at 04:39:39PM +0100, Stephane Eranian wrote:
> This patch adds the sorting and histogram support
> functions to enable profiling of memory accesses.
> 
> The following sorting orders are added:
>  - symbol_daddr: data address symbol (or raw address)
>  - dso_daddr: data address shared object
>  - weight: access cost
>  - locked: access uses locked transaction
>  - tlb : TLB access
>  - mem : memory level of the access (L1, L2, L3, RAM, ...)
>  - snoop: access snoop mode

Do they actually sort? I found that the sort keys do not actually
sort in most cases. This is mostly an issue with numbers like
"weight" and of course not added by your patch.


-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 15/18] perf: add PERF_RECORD_MISC_MMAP_DATA to RECORD_MMAP
  2013-01-15 15:39 ` [PATCH v6 15/18] perf: add PERF_RECORD_MISC_MMAP_DATA to RECORD_MMAP Stephane Eranian
@ 2013-01-18 23:25   ` Andi Kleen
  0 siblings, 0 replies; 33+ messages in thread
From: Andi Kleen @ 2013-01-18 23:25 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: linux-kernel, peterz, mingo, acme, jolsa, namhyung.kim

On Tue, Jan 15, 2013 at 04:39:43PM +0100, Stephane Eranian wrote:
> Type of mapping was lost and made it hard for a tool
> to distinguish code vs. data mmaps. Perf has the ability
> to distinguish the two.
> 
> Use a bit in the header->misc bitmask to keep track of
> the mmap type. If PERF_RECORD_MISC_MMAP_DATA is set then
> the mapping is not executable (!VM_EXEC). If not set, then
> the mapping is executable.
> 
> Signed-off-by: Stephane Eranian <eranian@google.com>

Acked-by: Andi Kleen <ak@linux.intel.com>

-Andi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 03/18] perf/x86: add flags to event constraints
  2013-01-18 22:59   ` Andi Kleen
@ 2013-01-22 14:22     ` Stephane Eranian
  0 siblings, 0 replies; 33+ messages in thread
From: Stephane Eranian @ 2013-01-22 14:22 UTC (permalink / raw)
  To: Andi Kleen
  Cc: LKML, Peter Zijlstra, mingo, Arnaldo Carvalho de Melo, Jiri Olsa,
	Namhyung Kim

On Fri, Jan 18, 2013 at 11:59 PM, Andi Kleen <ak@linux.intel.com> wrote:
>> --- a/arch/x86/kernel/cpu/perf_event_intel.c
>> +++ b/arch/x86/kernel/cpu/perf_event_intel.c
>> @@ -1367,8 +1367,10 @@ x86_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event)
>>
>>       if (x86_pmu.event_constraints) {
>>               for_each_event_constraint(c, x86_pmu.event_constraints) {
>> -                     if ((event->hw.config & c->cmask) == c->code)
>> +                     if ((event->hw.config & c->cmask) == c->code) {
>> +                             event->hw.flags |= c->flags;
>>                               return c;
>> +                     }
>
> It's not fully clear where that hw.flags field gets initially zeroed. Is that implicit
> in the allocation? Some comments would be good about its live cycle.
>
Yes, this is by allocation.
I used |= in case we need to add more flags in the future.
I will add a comment.

> Or just use a = instead of |=? Why would you have multiple flags in different places?
>
> -Andi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 05/18] perf: add minimal support for PERF_SAMPLE_WEIGHT
  2013-01-18 23:00   ` Andi Kleen
@ 2013-01-22 14:30     ` Stephane Eranian
  0 siblings, 0 replies; 33+ messages in thread
From: Stephane Eranian @ 2013-01-22 14:30 UTC (permalink / raw)
  To: Andi Kleen
  Cc: LKML, Peter Zijlstra, mingo, Arnaldo Carvalho de Melo, Jiri Olsa,
	Namhyung Kim

On Sat, Jan 19, 2013 at 12:00 AM, Andi Kleen <ak@linux.intel.com> wrote:
> On Tue, Jan 15, 2013 at 04:39:33PM +0100, Stephane Eranian wrote:
>> Ensure we grab the weight from  raw sample struct
>> and that we can dump it via perf report -D.
>
> AFAIK i have a similar patch in the haswell patchkit.
>
> It's straight forward of course, just have to chose one.
> I think mine does slightly more.
>
Yes, mine is a tiny subset of yours. Should use yours instead.
Will swap with yours if that's easy to do.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 07/18] perf: add generic memory sampling interface
  2013-01-18 23:06   ` Andi Kleen
@ 2013-01-23 16:54     ` Stephane Eranian
  0 siblings, 0 replies; 33+ messages in thread
From: Stephane Eranian @ 2013-01-23 16:54 UTC (permalink / raw)
  To: Andi Kleen
  Cc: LKML, Peter Zijlstra, mingo, Arnaldo Carvalho de Melo, Jiri Olsa,
	Namhyung Kim

On Sat, Jan 19, 2013 at 12:06 AM, Andi Kleen <ak@linux.intel.com> wrote:
>>  extern void perf_output_sample(struct perf_output_handle *handle,
>> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
>> index 7e24641..8283218 100644
>> --- a/include/uapi/linux/perf_event.h
>> +++ b/include/uapi/linux/perf_event.h
>> @@ -133,9 +133,9 @@ enum perf_event_sample_format {
>>       PERF_SAMPLE_REGS_USER                   = 1U << 12,
>>       PERF_SAMPLE_STACK_USER                  = 1U << 13,
>>       PERF_SAMPLE_WEIGHT                      = 1U << 14,
>> +     PERF_SAMPLE_DSRC                        = 1U << 15,
>
> This conflicts with similar extensions in the Haswell patchkit,
> but that can be worked out by just moving some numbers (and making
> sure the input/output calls are still in the right place)
>
Yes, it all depends on which patch goes in first. No big deal.

>
>> +union perf_mem_dsrc {
>> +     __u64 val;
>> +     struct {
>> +             __u64   mem_op:5,       /* type of opcode */
>> +                     mem_lvl:14,     /* memory hierarchy level */
>> +                     mem_snoop:5,    /* snoop mode */
>> +                     mem_lock:2,     /* lock instr */
>> +                     mem_dtlb:7,     /* tlb access */
>> +                     mem_rsvd:31;
>> +     };
>> +};
>> +
>> +/* type of opcode (load/store/prefetch,code) */
>> +#define PERF_MEM_OP_NA               0x01 /* not available */
>> +#define PERF_MEM_OP_LOAD     0x02 /* load instruction */
>> +#define PERF_MEM_OP_STORE    0x04 /* store instruction */
>> +#define PERF_MEM_OP_PFETCH   0x08 /* prefetch */
>> +#define PERF_MEM_OP_EXEC     0x10 /* code (execution) */
>> +#define PERF_MEM_OP_SHIFT    0
>
> Do we really need the shift? it's implicit in the bitfield right?
>
The bitfield is provided for reference for user code. It is not
used by the kernel code. We use plain u64 instead thus we
need the shift. This is used for the static pebs_data_source[]
table.

>> +/* memory hierarchy (memory level, hit or miss) */
>> +#define PERF_MEM_LVL_NA              0x01  /* not available */
>> +#define PERF_MEM_LVL_HIT     0x02  /* hit level */
>> +#define PERF_MEM_LVL_MISS    0x04  /* miss level  */
>> +#define PERF_MEM_LVL_L1              0x08  /* L1 */
>> +#define PERF_MEM_LVL_LFB     0x10  /* Line Fill Buffer */
>> +#define PERF_MEM_LVL_L2              0x20  /* L2 hit */
>> +#define PERF_MEM_LVL_L3              0x40  /* L3 hit */
>> +#define PERF_MEM_LVL_LOC_RAM 0x80  /* Local DRAM */
>> +#define PERF_MEM_LVL_REM_RAM1        0x100 /* Remote DRAM (1 hop) */
>> +#define PERF_MEM_LVL_REM_RAM2        0x200 /* Remote DRAM (2 hops) */
>> +#define PERF_MEM_LVL_REM_CCE1        0x400 /* Remote Cache (1 hop) */
>> +#define PERF_MEM_LVL_REM_CCE2        0x800 /* Remote Cache (2 hops) */
>> +#define PERF_MEM_LVL_IO              0x1000 /* I/O memory */
>> +#define PERF_MEM_LVL_UNC     0x2000 /* Uncached memory */
>
> I would leave some free bits here, obviously this doesn't cover all
> that may be possible in system architecture. Also why is this a bit mask,
> you can only hit one level right? So perhaps a number.
>
Yeah, I have been going back and forth on how to best define this to leave
some room for extensions. For now, we have 31 bits left at the in the MSB
part of the u64. Could either leave them there or try to commandeer some
for the memory hierarchy field.

This can be a bitmask on architectures where the HW cannot determine
for sure where the line came from. It may provided best effort such as
missed L2 or L3.

>> +/* TLB access */
>> +#define PERF_MEM_TLB_NA              0x01 /* not available */
>> +#define PERF_MEM_TLB_HIT     0x02 /* hit level */
>> +#define PERF_MEM_TLB_MISS    0x04 /* miss level */
>> +#define PERF_MEM_TLB_L1              0x08 /* L1 */
>> +#define PERF_MEM_TLB_L2              0x10 /* L2 */
>> +#define PERF_MEM_TLB_WK              0x20 /* Hardware Walker*/
>> +#define PERF_MEM_TLB_OS              0x40 /* OS fault handler */
>
>
The current x86 PEBS-LL is a example of HW that cannot disambiguate
where the TLB access actually was. It can for instance return that the access
did not miss the 2nd level TLB which means: hit L1 TLB or L2 TLB. That's
why you need a bitmask.


> Same
>
>
>> +#define PERF_MEM_TLB_SHIFT   26
>> +
>> +#define PERF_MEM_S(a, s) \
>> +     (((u64)PERF_MEM_##a##_##s) << PERF_MEM_##a##_SHIFT)
>
> Is that used by anything?
>
>
Yes, it is used to populate the pebs_data_source[] in perf_event_intel_ds.c

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 00/18] perf: add memory access sampling support
  2013-01-15 15:39 [PATCH v6 00/18] perf: add memory access sampling support Stephane Eranian
                   ` (17 preceding siblings ...)
  2013-01-15 15:39 ` [PATCH v6 18/18] perf tools: Fix output of symbol_daddr offset Stephane Eranian
@ 2013-01-24 11:56 ` Ingo Molnar
  2013-01-24 13:39   ` Stephane Eranian
  18 siblings, 1 reply; 33+ messages in thread
From: Ingo Molnar @ 2013-01-24 11:56 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, peterz, mingo, ak, acme, jolsa, namhyung.kim


* Stephane Eranian <eranian@google.com> wrote:

> This patch series had a new feature to the kernel perf_events 
> interface and corresponding user level tool, perf.

Would be nice to merge this with the overlapping parts of Andi's 
Haswell series.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 00/18] perf: add memory access sampling support
  2013-01-24 11:56 ` [PATCH v6 00/18] perf: add memory access sampling support Ingo Molnar
@ 2013-01-24 13:39   ` Stephane Eranian
  2013-01-24 13:41     ` Ingo Molnar
  0 siblings, 1 reply; 33+ messages in thread
From: Stephane Eranian @ 2013-01-24 13:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Peter Zijlstra, mingo, ak, Arnaldo Carvalho de Melo,
	Jiri Olsa, Namhyung Kim

On Thu, Jan 24, 2013 at 12:56 PM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Stephane Eranian <eranian@google.com> wrote:
>
>> This patch series had a new feature to the kernel perf_events
>> interface and corresponding user level tool, perf.
>
> Would be nice to merge this with the overlapping parts of Andi's
> Haswell series.
>
That's what I have now. Merged with a couple of patches from Andi's.
Will post update today.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v6 00/18] perf: add memory access sampling support
  2013-01-24 13:39   ` Stephane Eranian
@ 2013-01-24 13:41     ` Ingo Molnar
  0 siblings, 0 replies; 33+ messages in thread
From: Ingo Molnar @ 2013-01-24 13:41 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: LKML, Peter Zijlstra, mingo, ak, Arnaldo Carvalho de Melo,
	Jiri Olsa, Namhyung Kim


* Stephane Eranian <eranian@google.com> wrote:

> On Thu, Jan 24, 2013 at 12:56 PM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > * Stephane Eranian <eranian@google.com> wrote:
> >
> >> This patch series had a new feature to the kernel perf_events
> >> interface and corresponding user level tool, perf.
> >
> > Would be nice to merge this with the overlapping parts of Andi's
> > Haswell series.
>
> That's what I have now. Merged with a couple of patches from 
> Andi's. Will post update today.

Ok, great!

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2013-01-24 13:42 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-15 15:39 [PATCH v6 00/18] perf: add memory access sampling support Stephane Eranian
2013-01-15 15:39 ` [PATCH v6 01/18] perf, x86: Support CPU specific sysfs events Stephane Eranian
2013-01-15 15:39 ` [PATCH v6 02/18] perf/x86: improve sysfs event mapping with event string Stephane Eranian
2013-01-18 22:57   ` Andi Kleen
2013-01-15 15:39 ` [PATCH v6 03/18] perf/x86: add flags to event constraints Stephane Eranian
2013-01-18 22:59   ` Andi Kleen
2013-01-22 14:22     ` Stephane Eranian
2013-01-15 15:39 ` [PATCH v6 04/18] perf, core: Add a concept of a weightened sample Stephane Eranian
2013-01-15 15:39 ` [PATCH v6 05/18] perf: add minimal support for PERF_SAMPLE_WEIGHT Stephane Eranian
2013-01-18 23:00   ` Andi Kleen
2013-01-22 14:30     ` Stephane Eranian
2013-01-15 15:39 ` [PATCH v6 06/18] perf: add support for PERF_SAMPLE_ADDR in dump_sampple() Stephane Eranian
2013-01-15 15:39 ` [PATCH v6 07/18] perf: add generic memory sampling interface Stephane Eranian
2013-01-18 23:06   ` Andi Kleen
2013-01-23 16:54     ` Stephane Eranian
2013-01-15 15:39 ` [PATCH v6 08/18] perf/x86: add memory profiling via PEBS Load Latency Stephane Eranian
2013-01-18 23:12   ` Andi Kleen
2013-01-15 15:39 ` [PATCH v6 09/18] perf/x86: export PEBS load latency threshold register to sysfs Stephane Eranian
2013-01-15 15:39 ` [PATCH v6 10/18] perf/x86: add support for PEBS Precise Store Stephane Eranian
2013-01-18 23:21   ` Andi Kleen
2013-01-15 15:39 ` [PATCH v6 11/18] perf tools: add mem access sampling core support Stephane Eranian
2013-01-18 23:25   ` Andi Kleen
2013-01-15 15:39 ` [PATCH v6 12/18] perf report: add support for mem access profiling Stephane Eranian
2013-01-15 15:39 ` [PATCH v6 13/18] perf record: " Stephane Eranian
2013-01-15 15:39 ` [PATCH v6 14/18] perf tools: add new mem command for memory " Stephane Eranian
2013-01-15 15:39 ` [PATCH v6 15/18] perf: add PERF_RECORD_MISC_MMAP_DATA to RECORD_MMAP Stephane Eranian
2013-01-18 23:25   ` Andi Kleen
2013-01-15 15:39 ` [PATCH v6 16/18] perf tools: detect data vs. text mappings Stephane Eranian
2013-01-15 15:39 ` [PATCH v6 17/18] perf tools: Ignore ABS symbols when loading data maps Stephane Eranian
2013-01-15 15:39 ` [PATCH v6 18/18] perf tools: Fix output of symbol_daddr offset Stephane Eranian
2013-01-24 11:56 ` [PATCH v6 00/18] perf: add memory access sampling support Ingo Molnar
2013-01-24 13:39   ` Stephane Eranian
2013-01-24 13:41     ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).