linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems
@ 2014-02-10 17:28 Don Zickus
  2014-02-10 17:28 ` [PATCH 03/21] Revert "perf: Disable PERF_RECORD_MMAP2 support" Don Zickus
                   ` (25 more replies)
  0 siblings, 26 replies; 72+ messages in thread
From: Don Zickus @ 2014-02-10 17:28 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, eranian, Don Zickus

With the introduction of NUMA systems, came the possibility of remote memory accesses.
Combine those remote memory accesses with contention on the remote node (ie a modified
cacheline) and you have a possibility for very long latencies.  These latencies can
bottleneck a program.

The program added by these patches, helps detect the situation where two nodes are
'tugging' on the same _data_ cacheline.  The term used through out this program and
the various changelogs is called a HITM.  This means nodeX went to read a cacheline
and it was discovered to be loaded in nodeY's LLC cache (hence the cacheHIT). The 
remote cacheline was also in a 'M'odified state thus creating a 'HIT M' for hit in
a modified state.  HITMs can happen locally and remotely.  This program's interest
is mainly in remote HITMs as they cause the longest latencies.

Why a program has a remote HITM derives from how the two nodes are 'sharing' the
cacheline.  Is the sharing intentional ("true") or unintentional ("false").  We have seen
lots of "false" sharing cases, which lead to simple solutions such as seperating the data
onto different cachelines.  

This tool does not distinguish between 'true' or 'false' sharing, instead it just points to
the more expensive sharing situations under the current workload.  It is up to the user
to understand what the workload is doing to determine whether a problem exists or not and
how to report it.

The data output is verbose and there are lots of data tables that interprit the latencies
and data addresses in different ways to help see where bottlenecks might be lying.

Most of this idea, work and calculations were done by Dick Fowles.  My work mainly
includes porting it to perf.  Joe Mario has contributed greatly with ideas to make the
output more informative based on his usage of the tool.  Joe has found a handful of
bottlenecks using various industry benchmarks and has worked with developers to fix
them.

I would also like to thank Stephane Eranian for his early help and guidance on 
navigating the differences between the current perf tool and how similar tools
looked at HP.  And also his tireless work in getting the MMAP2 interface to stick.

Also thanks to Arnaldo and Jiri Olso for their help in suggestions for this tool.

I also have a test program that generated a controlled number of HITMs that we used
frequently to validate our early work (the Intel docs were not always clear which
bits had to be set and some arches do not work well).  I would like to add it, but
didn't know how (nor did I spend any serious time looking either).

This program has been tested primarily on Intel's Ivy Bridge platforms.  The Sandy Bridge
platforms had some quirks that were fixed on Ivy Bridge.  We haven't tried Haswell as
that has a re-worked latency event implementation.

A handful of patches include re-enabling MMAP2 support and some fixes to perf itself.  One
in particular hacks up how standard deviation is calculated.  It works with our calculations
but may break other tools expectations.  Feedback is welcomed.

Comemnts, feedback, anything else welcomed.

Signed-off-by: Don Zickus <dzickus@redhat.com>

Arnaldo Carvalho de Melo (2):
  perf c2c: Shared data analyser
  perf c2c: Dump raw records, decode data_src bits

Don Zickus (19):
  Revert "perf: Disable PERF_RECORD_MMAP2 support"
  perf, machine: Use map as success in ip__resolve_ams
  perf, session: Change header.misc dump from decimal to hex
  perf, stat: FIXME Stddev calculation is incorrect
  perf, callchain: Add generic callchain print handler for stdio
  perf, c2c: Rework setup code to prepare for features
  perf, c2c: Add rbtree sorted on mmap2 data
  perf, c2c: Add stats to track data source bits and cpu to node maps
  perf, c2c: Sort based on hottest cache line
  perf, c2c: Display cacheline HITM analysis to stdout
  perf, c2c: Add callchain support
  perf, c2c: Output summary stats
  perf, c2c: Dump rbtree for debugging
  perf, c2c: Fixup tid because of perf map is broken
  perf, c2c: Add symbol count table
  perf, c2c: Add shared cachline summary table
  perf, c2c: Add framework to analyze latency and display summary stats
  perf, c2c: Add selected extreme latencies to output cacheline stats
    table
  perf, c2c: Add summary latency table for various parts of caches

 kernel/events/core.c                |    4 -
 tools/perf/Documentation/perf-c2c.c |   22 +
 tools/perf/Makefile.perf            |    1 +
 tools/perf/builtin-c2c.c            | 2963 +++++++++++++++++++++++++++++++++++
 tools/perf/builtin.h                |    1 +
 tools/perf/perf.c                   |    1 +
 tools/perf/ui/stdio/hist.c          |   37 +
 tools/perf/util/event.c             |   36 +-
 tools/perf/util/evlist.c            |   37 +
 tools/perf/util/evlist.h            |    7 +
 tools/perf/util/evsel.c             |    1 +
 tools/perf/util/hist.h              |    4 +
 tools/perf/util/machine.c           |    2 +-
 tools/perf/util/session.c           |    2 +-
 tools/perf/util/stat.c              |    3 +-
 15 files changed, 3097 insertions(+), 24 deletions(-)
 create mode 100644 tools/perf/Documentation/perf-c2c.c
 create mode 100644 tools/perf/builtin-c2c.c

-- 
1.7.11.7


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 03/21] Revert "perf: Disable PERF_RECORD_MMAP2 support"
  2014-02-10 17:28 [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
@ 2014-02-10 17:28 ` Don Zickus
  2014-02-10 17:28 ` [PATCH 04/21] perf, machine: Use map as success in ip__resolve_ams Don Zickus
                   ` (24 subsequent siblings)
  25 siblings, 0 replies; 72+ messages in thread
From: Don Zickus @ 2014-02-10 17:28 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, eranian, Don Zickus

This reverts commit 3090ffb5a2515990182f3f55b0688a7817325488.

With the introduction of the c2c tools, we now have a user of MMAP2

Signed-off-by: Don Zickus <dzickus@redhat.com>

---
 kernel/events/core.c    |  4 ----
 tools/perf/util/event.c | 36 +++++++++++++++++++-----------------
 tools/perf/util/evsel.c |  1 +
 3 files changed, 20 insertions(+), 21 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 56003c6..f18cbb8 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6832,10 +6832,6 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
 	if (ret)
 		return -EFAULT;
 
-	/* disabled for now */
-	if (attr->mmap2)
-		return -EINVAL;
-
 	if (attr->__reserved_1)
 		return -EINVAL;
 
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index b0f3ca8..086c7c8 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -201,13 +201,14 @@ int perf_event__synthesize_mmap_events(struct perf_tool *tool,
 		return -1;
 	}
 
-	event->header.type = PERF_RECORD_MMAP;
+	event->header.type = PERF_RECORD_MMAP2;
 
 	while (1) {
 		char bf[BUFSIZ];
 		char prot[5];
 		char execname[PATH_MAX];
 		char anonstr[] = "//anon";
+		unsigned int ino;
 		size_t size;
 		ssize_t n;
 
@@ -218,14 +219,15 @@ int perf_event__synthesize_mmap_events(struct perf_tool *tool,
 		strcpy(execname, "");
 
 		/* 00400000-0040c000 r-xp 00000000 fd:01 41038  /bin/cat */
-		n = sscanf(bf, "%"PRIx64"-%"PRIx64" %s %"PRIx64" %*x:%*x %*u %s\n",
-		       &event->mmap.start, &event->mmap.len, prot,
-		       &event->mmap.pgoff,
-		       execname);
-		/*
- 		 * Anon maps don't have the execname.
- 		 */
-		if (n < 4)
+		n = sscanf(bf, "%"PRIx64"-%"PRIx64" %s %"PRIx64" %x:%x %u %s\n",
+		       &event->mmap2.start, &event->mmap2.len, prot,
+		       &event->mmap2.pgoff, &event->mmap2.maj,
+		       &event->mmap2.min,
+		       &ino, execname);
+
+		event->mmap2.ino = (u64)ino;
+
+		if (n < 7)
 			continue;
 		/*
 		 * Just like the kernel, see __perf_event_mmap in kernel/perf_event.c
@@ -246,15 +248,15 @@ int perf_event__synthesize_mmap_events(struct perf_tool *tool,
 			strcpy(execname, anonstr);
 
 		size = strlen(execname) + 1;
-		memcpy(event->mmap.filename, execname, size);
+		memcpy(event->mmap2.filename, execname, size);
 		size = PERF_ALIGN(size, sizeof(u64));
-		event->mmap.len -= event->mmap.start;
-		event->mmap.header.size = (sizeof(event->mmap) -
-					(sizeof(event->mmap.filename) - size));
-		memset(event->mmap.filename + size, 0, machine->id_hdr_size);
-		event->mmap.header.size += machine->id_hdr_size;
-		event->mmap.pid = tgid;
-		event->mmap.tid = pid;
+		event->mmap2.len -= event->mmap.start;
+		event->mmap2.header.size = (sizeof(event->mmap2) -
+					(sizeof(event->mmap2.filename) - size));
+		memset(event->mmap2.filename + size, 0, machine->id_hdr_size);
+		event->mmap2.header.size += machine->id_hdr_size;
+		event->mmap2.pid = tgid;
+		event->mmap2.tid = pid;
 
 		if (process(tool, event, &synth_sample, machine) != 0) {
 			rc = -1;
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 55407c5..65db757 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -640,6 +640,7 @@ void perf_evsel__config(struct perf_evsel *evsel, struct record_opts *opts)
 		perf_evsel__set_sample_bit(evsel, WEIGHT);
 
 	attr->mmap  = track;
+	attr->mmap2 = track && !perf_missing_features.mmap2;
 	attr->comm  = track;
 
 	if (opts->sample_transaction)
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 04/21] perf, machine: Use map as success in ip__resolve_ams
  2014-02-10 17:28 [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
  2014-02-10 17:28 ` [PATCH 03/21] Revert "perf: Disable PERF_RECORD_MMAP2 support" Don Zickus
@ 2014-02-10 17:28 ` Don Zickus
  2014-02-10 17:29 ` [PATCH 05/21] perf, session: Change header.misc dump from decimal to hex Don Zickus
                   ` (23 subsequent siblings)
  25 siblings, 0 replies; 72+ messages in thread
From: Don Zickus @ 2014-02-10 17:28 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, eranian, Don Zickus

When trying to map a bunch of instruction addresses to their respective
threads, I kept getting a lot of bogus entries [I forget the exact reason
as I patched my code months ago].

Looking through ip__resovle_ams, I noticed the check for

if (al.sym)

and realized, most times I have an al.map definition but sometimes an
al.sym is undefined.  In the cases where al.sym is undefined, the loop
keeps going even though a valid al.map exists.

Modify this check to use the more reliable al.map.  This fixed my bogus
entries.

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/util/machine.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index c872991..620a198 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -1213,7 +1213,7 @@ static void ip__resolve_ams(struct machine *machine, struct thread *thread,
 		 */
 		thread__find_addr_location(thread, machine, m, MAP__FUNCTION,
 				ip, &al);
-		if (al.sym)
+		if (al.map)
 			goto found;
 	}
 found:
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 05/21] perf, session: Change header.misc dump from decimal to hex
  2014-02-10 17:28 [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
  2014-02-10 17:28 ` [PATCH 03/21] Revert "perf: Disable PERF_RECORD_MMAP2 support" Don Zickus
  2014-02-10 17:28 ` [PATCH 04/21] perf, machine: Use map as success in ip__resolve_ams Don Zickus
@ 2014-02-10 17:29 ` Don Zickus
  2014-02-18 12:56   ` Jiri Olsa
  2014-02-10 17:29 ` [PATCH 06/21] perf, stat: FIXME Stddev calculation is incorrect Don Zickus
                   ` (22 subsequent siblings)
  25 siblings, 1 reply; 72+ messages in thread
From: Don Zickus @ 2014-02-10 17:29 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, eranian, Don Zickus

When printing the raw dump of a data file, the header.misc is
printed as a decimal.  Unfortunately, that field is a bit mask, so
it is hard to interpret as a decimal.

Print in hex, so the user can easily see what bits are set and more
importantly what type of info it is conveying.

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/util/session.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 0b39a48..d1ad10f 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -793,7 +793,7 @@ static void dump_sample(struct perf_evsel *evsel, union perf_event *event,
 	if (!dump_trace)
 		return;
 
-	printf("(IP, %d): %d/%d: %#" PRIx64 " period: %" PRIu64 " addr: %#" PRIx64 "\n",
+	printf("(IP, %x): %d/%d: %#" PRIx64 " period: %" PRIu64 " addr: %#" PRIx64 "\n",
 	       event->header.misc, sample->pid, sample->tid, sample->ip,
 	       sample->period, sample->addr);
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 06/21] perf, stat: FIXME Stddev calculation is incorrect
  2014-02-10 17:28 [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (2 preceding siblings ...)
  2014-02-10 17:29 ` [PATCH 05/21] perf, session: Change header.misc dump from decimal to hex Don Zickus
@ 2014-02-10 17:29 ` Don Zickus
  2014-02-10 17:29 ` [PATCH 07/21] perf, callchain: Add generic callchain print handler for stdio Don Zickus
                   ` (21 subsequent siblings)
  25 siblings, 0 replies; 72+ messages in thread
From: Don Zickus @ 2014-02-10 17:29 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, eranian, Don Zickus

Fix the standard deviation calculation.  The current calculation
does the standard deviation mean which is slightly different.

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/util/stat.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c
index 6506b3d..58aa661 100644
--- a/tools/perf/util/stat.c
+++ b/tools/perf/util/stat.c
@@ -47,7 +47,8 @@ double stddev_stats(struct stats *stats)
 		return 0.0;
 
 	variance = stats->M2 / (stats->n - 1);
-	variance_mean = variance / stats->n;
+	//variance_mean = variance / stats->n;
+	variance_mean = variance;
 
 	return sqrt(variance_mean);
 }
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 07/21] perf, callchain: Add generic callchain print handler for stdio
  2014-02-10 17:28 [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (3 preceding siblings ...)
  2014-02-10 17:29 ` [PATCH 06/21] perf, stat: FIXME Stddev calculation is incorrect Don Zickus
@ 2014-02-10 17:29 ` Don Zickus
  2014-02-10 17:29 ` [PATCH 08/21] perf, c2c: Rework setup code to prepare for features Don Zickus
                   ` (20 subsequent siblings)
  25 siblings, 0 replies; 72+ messages in thread
From: Don Zickus @ 2014-02-10 17:29 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, eranian, Don Zickus

My initial implementation for rbtree sorting in the c2c tool does not use the
normal history elements.  As a result, adding callchain support (which is
deeply integrated with history elements) is more challenging when trying to
display its output.

To make things simpler for myself (and to avoid rewriting the same code into
the c2c tool), I provided a generic interface that takes an unsorted callchain
list along with its total and relative sample size, and sorts it locally based
on period and calls the appropriate graph function (passing the correct sample
size).

This makes things easier because the c2c tool can be dumber and just collect
callchains and not worry about the magic needed to sort and display them
correctly.

Unfortunately, this is assuming a stdio output only and does not use the other
gui type outputs.

Regardless, this patch provides useful info for the tool right now.  Tweaks and
recommendations for a better approach are welcomed. :-)

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/ui/stdio/hist.c | 37 +++++++++++++++++++++++++++++++++++++
 tools/perf/util/hist.h     |  4 ++++
 2 files changed, 41 insertions(+)

diff --git a/tools/perf/ui/stdio/hist.c b/tools/perf/ui/stdio/hist.c
index 831fbb7..0a40f59 100644
--- a/tools/perf/ui/stdio/hist.c
+++ b/tools/perf/ui/stdio/hist.c
@@ -536,3 +536,40 @@ size_t events_stats__fprintf(struct events_stats *stats, FILE *fp)
 
 	return ret;
 }
+
+size_t generic_entry_callchain__fprintf(struct callchain_root *unsorted_callchain,
+					u64 total_samples, u64 relative_samples,
+					int left_margin, FILE *fp)
+{
+	struct rb_root sorted_chain;
+	u64 min_callchain_hits;
+
+	if (!symbol_conf.use_callchain)
+		return 0;
+
+	min_callchain_hits = total_samples * (callchain_param.min_percent / 100);
+
+	callchain_param.sort(&sorted_chain, unsorted_callchain,
+				min_callchain_hits, &callchain_param);
+
+	switch (callchain_param.mode) {
+	case CHAIN_GRAPH_REL:
+		return callchain__fprintf_graph(fp, &sorted_chain, relative_samples,
+						left_margin);
+		break;
+	case CHAIN_GRAPH_ABS:
+		return callchain__fprintf_graph(fp, &sorted_chain, total_samples,
+						left_margin);
+		break;
+	case CHAIN_FLAT:
+		return callchain__fprintf_flat(fp, &sorted_chain, total_samples);
+		break;
+	case CHAIN_NONE:
+		break;
+	default:
+		pr_err("Bad callchain mode\n");
+	}
+
+	return 0;
+}
+
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index a59743f..4fe5de5 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -111,6 +111,10 @@ size_t events_stats__fprintf(struct events_stats *stats, FILE *fp);
 size_t hists__fprintf(struct hists *hists, bool show_header, int max_rows,
 		      int max_cols, float min_pcnt, FILE *fp);
 
+size_t generic_entry_callchain__fprintf(struct callchain_root *unsorted_callchain,
+					u64 total_samples, u64 relative_samples,
+					int left_margin, FILE *fp);
+
 void hists__filter_by_dso(struct hists *hists);
 void hists__filter_by_thread(struct hists *hists);
 void hists__filter_by_symbol(struct hists *hists);
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 08/21] perf, c2c: Rework setup code to prepare for features
  2014-02-10 17:28 [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (4 preceding siblings ...)
  2014-02-10 17:29 ` [PATCH 07/21] perf, callchain: Add generic callchain print handler for stdio Don Zickus
@ 2014-02-10 17:29 ` Don Zickus
  2014-02-18 13:02   ` Jiri Olsa
  2014-02-10 17:29 ` [PATCH 09/21] perf, c2c: Add rbtree sorted on mmap2 data Don Zickus
                   ` (19 subsequent siblings)
  25 siblings, 1 reply; 72+ messages in thread
From: Don Zickus @ 2014-02-10 17:29 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, eranian, Don Zickus

A basic patch that re-arranges some of the c2c code and adds a couple
of small features to lay the ground work for the rest of the patch
series.

Changes include:

o reworking the report path
o creating an initial entry struct
o replace preprocess_sample with simpler calls
o rework raw output to handle separators
o remove phys id gunk
o add some generic options

There isn't much meat in this patch just a bunch of code movement and cleanups.

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/builtin-c2c.c | 163 +++++++++++++++++++++++++++++++++++++----------
 1 file changed, 129 insertions(+), 34 deletions(-)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index a5dc412..b062485 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -5,6 +5,7 @@
 #include "util/parse-options.h"
 #include "util/session.h"
 #include "util/tool.h"
+#include "util/debug.h"
 
 #include <linux/compiler.h>
 #include <linux/kernel.h>
@@ -14,6 +15,15 @@ struct perf_c2c {
 	bool		 raw_records;
 };
 
+struct c2c_entry {
+	struct thread		*thread;
+	struct mem_info		*mi;
+	u32			cpu;
+	u8			cpumode;
+	int			weight;
+	int			period;
+};
+
 enum { OP, LVL, SNP, LCK, TLB };
 
 static int perf_c2c__scnprintf_data_src(char *bf, size_t size, uint64_t val)
@@ -105,34 +115,69 @@ static int perf_c2c__fprintf_header(FILE *fp)
 }
 
 static int perf_sample__fprintf(struct perf_sample *sample, char tag,
-				const char *reason, struct addr_location *al, FILE *fp)
+				const char *reason, struct mem_info *mi, FILE *fp)
 {
 	char data_src[61];
+	const char *fmt, *sep;
+	struct map *map = mi->iaddr.map;
 
 	perf_c2c__scnprintf_data_src(data_src, sizeof(data_src), sample->data_src);
 
-	return fprintf(fp, "%c %-16s  %6d  %6d  %4d  %#18" PRIx64 "  %#18" PRIx64 "  %#18" PRIx64 "  %6" PRIu64 "  %#10" PRIx64 " %-60.60s %s:%s\n",
-		       tag,
-		       reason ?: "valid record",
-		       sample->pid,
-		       sample->tid,
-		       sample->cpu,
-		       sample->ip,
-		       sample->addr,
-		       0UL,
-		       sample->weight,
-		       sample->data_src,
-		       data_src,
-		       al->map ? (al->map->dso ? al->map->dso->long_name : "???") : "???",
-		       al->sym ? al->sym->name : "???");
+	if (symbol_conf.field_sep) {
+		fmt = "%c%s%s%s%d%s%d%s%d%s%#"PRIx64"%s%#"PRIx64"%s"
+		      "%"PRIu64"%s%#"PRIx64"%s%s%s%s:%s\n";
+		sep = symbol_conf.field_sep;
+	} else {
+		fmt = "%c%s%-16s%s%6d%s%6d%s%4d%s%#18"PRIx64"%s%#18"PRIx64"%s"
+		      "%6"PRIu64"%s%#10"PRIx64"%s%-60.60s%s%s:%s\n";
+		sep = " ";
+	}
+
+	return fprintf(fp, fmt,
+		       tag,				sep,
+		       reason ?: "valid record",	sep,
+		       sample->pid,			sep,
+		       sample->tid,			sep,
+		       sample->cpu,			sep,
+		       sample->ip,			sep,
+		       sample->addr,			sep,
+		       sample->weight,			sep,
+		       sample->data_src,		sep,
+		       data_src,			sep,
+		       map ? (map->dso ? map->dso->long_name : "???") : "???",
+		       mi->iaddr.sym ? mi->iaddr.sym->name : "???");
 }
 
-static int perf_c2c__process_load_store(struct perf_c2c *c2c,
-					struct perf_sample *sample,
-					struct addr_location *al)
+static struct c2c_entry *c2c_entry__new(struct perf_sample *sample,
+					struct thread *thread,
+					struct mem_info *mi,
+					u8 cpumode)
 {
-	if (c2c->raw_records)
-		perf_sample__fprintf(sample, ' ', "raw input", al, stdout);
+	size_t callchain_size = symbol_conf.use_callchain ? sizeof(struct callchain_root) : 0;
+	struct c2c_entry *entry = zalloc(sizeof(*entry) + callchain_size);
+
+	if (entry != NULL) {
+		entry->thread = thread;
+		entry->mi = mi;
+		entry->cpu = sample->cpu;
+		entry->cpumode = cpumode;
+		entry->weight = sample->weight;
+		if (sample->period)
+			entry->period = sample->period;
+		else
+			entry->period = 1;
+	}
+
+	return entry;
+}
+
+static int perf_c2c__process_load_store(struct perf_c2c *c2c __maybe_unused,
+				  struct perf_sample *sample __maybe_unused,
+				  struct c2c_entry *entry)
+{
+	/* don't lose the maps if remapped */
+	entry->mi->iaddr.map->referenced = true;
+	entry->mi->daddr.map->referenced = true;
 
 	return 0;
 }
@@ -144,7 +189,7 @@ static const struct perf_evsel_str_handler handlers[] = {
 
 typedef int (*sample_handler)(struct perf_c2c *c2c,
 			      struct perf_sample *sample,
-			      struct addr_location *al);
+			      struct c2c_entry *entry);
 
 static int perf_c2c__process_sample(struct perf_tool *tool,
 				    union perf_event *event,
@@ -153,20 +198,63 @@ static int perf_c2c__process_sample(struct perf_tool *tool,
 				    struct machine *machine)
 {
 	struct perf_c2c *c2c = container_of(tool, struct perf_c2c, tool);
-	struct addr_location al;
-	int err = 0;
+	u8 cpumode = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
+	struct mem_info *mi;
+	struct thread *thread;
+	struct c2c_entry *entry;
+	sample_handler f;
+	int err = -1;
+
+	if (evsel->handler.func == NULL)
+		return 0;
+
+	thread = machine__find_thread(machine, sample->tid);
+	if (thread == NULL)
+		goto err;
+
+	mi = machine__resolve_mem(machine, thread, sample, cpumode);
+	if (mi == NULL)
+		goto err;
 
-	if (perf_event__preprocess_sample(event, machine, &al, sample) < 0) {
-		pr_err("problem processing %d event, skipping it.\n",
-		       event->header.type);
-		return -1;
+	if (c2c->raw_records) {
+		perf_sample__fprintf(sample, ' ', "raw input", mi, stdout);
+		free(mi);
+		return 0;
 	}
 
-	if (evsel->handler.func != NULL) {
-		sample_handler f = evsel->handler.func;
-		err = f(c2c, sample, &al);
+	entry = c2c_entry__new(sample, thread, mi, cpumode);
+	if (entry == NULL)
+		goto err_mem;
+
+	f = evsel->handler.func;
+	err = f(c2c, sample, entry);
+	if (err)
+		goto err_entry;
+
+	return 0;
+
+err_entry:
+	free(entry);
+err_mem:
+	free(mi);
+err:
+	if (err > 0)
+		err = 0;
+	return err;
+}
+
+static int perf_c2c__process_events(struct perf_session *session,
+				    struct perf_c2c *c2c)
+{
+	int err = -1;
+
+	err = perf_session__process_events(session, &c2c->tool);
+	if (err) {
+		pr_err("Failed to process count events, error %d\n", err);
+		goto err;
 	}
 
+err:
 	return err;
 }
 
@@ -184,9 +272,7 @@ static int perf_c2c__read_events(struct perf_c2c *c2c)
 	if (perf_evlist__set_handlers(session->evlist, handlers))
 		goto out_delete;
 
-	err = perf_session__process_events(session, &c2c->tool);
-	if (err)
-		pr_err("Failed to process events, error %d", err);
+	err = perf_c2c__process_events(session, c2c);
 
 out_delete:
 	perf_session__delete(session);
@@ -210,7 +296,6 @@ static int perf_c2c__record(int argc, const char **argv)
 	const char **rec_argv;
 	const char * const record_args[] = {
 		"record",
-		/* "--phys-addr", */
 		"-W",
 		"-d",
 		"-a",
@@ -243,6 +328,8 @@ int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
 	struct perf_c2c c2c = {
 		.tool = {
 			.sample		 = perf_c2c__process_sample,
+			.mmap2           = perf_event__process_mmap2,
+			.mmap            = perf_event__process_mmap,
 			.comm		 = perf_event__process_comm,
 			.exit		 = perf_event__process_exit,
 			.fork		 = perf_event__process_fork,
@@ -252,6 +339,14 @@ int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
 	};
 	const struct option c2c_options[] = {
 	OPT_BOOLEAN('r', "raw_records", &c2c.raw_records, "dump raw events"),
+	OPT_INCR('v', "verbose", &verbose,
+		 "be more verbose (show counter open errors, etc)"),
+	OPT_STRING('i', "input", &input_name, "file",
+		   "the input file to process"),
+	OPT_STRING('x', "field-separator", &symbol_conf.field_sep,
+		   "separator",
+		   "separator for columns, no spaces will be added"
+		   " between columns '.' is reserved."),
 	OPT_END()
 	};
 	const char * const c2c_usage[] = {
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 09/21] perf, c2c: Add rbtree sorted on mmap2 data
  2014-02-10 17:28 [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (5 preceding siblings ...)
  2014-02-10 17:29 ` [PATCH 08/21] perf, c2c: Rework setup code to prepare for features Don Zickus
@ 2014-02-10 17:29 ` Don Zickus
  2014-02-18 13:04   ` Jiri Olsa
  2014-02-10 17:29 ` [PATCH 10/21] perf, c2c: Add stats to track data source bits and cpu to node maps Don Zickus
                   ` (18 subsequent siblings)
  25 siblings, 1 reply; 72+ messages in thread
From: Don Zickus @ 2014-02-10 17:29 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, eranian, Don Zickus

In order for the c2c tool to work correctly, it needs to properly
sort all the records on uniquely identifiable data addresses.  These
unique addresses are converted from virtual addresses provided by the
hardware into a kernel address using an mmap2 record as the decoder.

Once a unique address is converted, we can sort on them based on
various rules.  Then it becomes clear which address are overlapping
with each other across mmap regions or pid spaces.

This patch just creates the rules and inserts the records into an
rbtree for safe keeping until later patches process them.

The general sorting rule is:

o group cpumodes together
o group similar major, minor, inode, inode generation numbers togther
o if (nonzero major/minor number - ie mmap'd areas)
  o sort on data addresses
  o sort on instruction address
  o sort on pid
  o sort on tid
o if cpumode is kernel
  o sort on data addresses
  o sort on instruction address
  o sort on pid
  o sort on tid
o else (private to pid space)
  o sort on pid
  o sort on tid
  o sort on data addresses
  o sort on instruction address

I also hacked in the concept of 'color'.  The purpose of that bit is to
provides hints later when processing these records that indicate a new unique
address has been encountered.  Because later processing only checks the data
addresses, there can be a theoretical scenario that similar sequential data
addresses (when walking the rbtree) could be misinterpreted as overlapping
when in fact they are not.

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/builtin-c2c.c | 145 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 144 insertions(+), 1 deletion(-)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index b062485..a9c536b 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -13,15 +13,20 @@
 struct perf_c2c {
 	struct perf_tool tool;
 	bool		 raw_records;
+	struct rb_root   tree_physid;
 };
 
+#define REGION_SAME	1 << 0;
+
 struct c2c_entry {
+	struct rb_node		rb_node;
 	struct thread		*thread;
 	struct mem_info		*mi;
 	u32			cpu;
 	u8			cpumode;
 	int			weight;
 	int			period;
+	int			color;
 };
 
 enum { OP, LVL, SNP, LCK, TLB };
@@ -96,6 +101,133 @@ static int perf_c2c__scnprintf_data_src(char *bf, size_t size, uint64_t val)
 	return printed;
 }
 
+static int physid_cmp(struct c2c_entry *left, struct c2c_entry *right)
+{
+	u64 l, r;
+	struct map *l_map = left->mi->daddr.map;
+	struct map *r_map = right->mi->daddr.map;
+
+	/* group event types together */
+	if (left->cpumode > right->cpumode) return 1;
+	if (left->cpumode < right->cpumode) return -1;
+
+	if (l_map->maj > r_map->maj) return 1;
+	if (l_map->maj < r_map->maj) return -1;
+
+	if (l_map->min > r_map->min) return 1;
+	if (l_map->min < r_map->min) return -1;
+
+	if (l_map->ino > r_map->ino) return 1;
+	if (l_map->ino < r_map->ino) return -1;
+
+	if (l_map->ino_generation > r_map->ino_generation) return 1;
+	if (l_map->ino_generation < r_map->ino_generation) return -1;
+
+	/*
+	 * Addresses with no major/minor numbers are assumed to be
+	 * anonymous in userspace.  Sort those on pid then address.
+	 *
+	 * The kernel and non-zero major/minor mapped areas are
+	 * assumed to be unity mapped.  Sort those on address then pid.
+	 */
+
+	/* al_addr does all the right addr - start + offset calculations */
+	l = left->mi->daddr.al_addr;
+	r = right->mi->daddr.al_addr;
+
+	if (l_map->maj || l_map->min) {
+		/* mmapped areas */
+
+		/* hack to mark similar regions, 'right' is new entry */
+		/* entries with same maj/min/ino/inogen are in same address space */
+		right->color = REGION_SAME;
+
+		if (l > r) return 1;
+		if (l < r) return -1;
+
+		/* sorting by iaddr makes calculations easier later */
+		if (left->mi->iaddr.al_addr > right->mi->iaddr.al_addr) return 1;
+		if (left->mi->iaddr.al_addr < right->mi->iaddr.al_addr) return -1;
+
+		if (left->thread->pid_ > right->thread->pid_) return 1;
+		if (left->thread->pid_ < right->thread->pid_) return -1;
+
+		if (left->thread->tid > right->thread->tid) return 1;
+		if (left->thread->tid < right->thread->tid) return -1;
+	} else if (left->cpumode == PERF_RECORD_MISC_KERNEL) {
+		/* kernel mapped areas where 'start' doesn't matter */
+
+		/* hack to mark similar regions, 'right' is new entry */
+		/* whole kernel region is in the same address space */
+		right->color = REGION_SAME;
+
+		if (l > r) return 1;
+		if (l < r) return -1;
+
+		/* sorting by iaddr makes calculations easier later */
+		if (left->mi->iaddr.al_addr > right->mi->iaddr.al_addr) return 1;
+		if (left->mi->iaddr.al_addr < right->mi->iaddr.al_addr) return -1;
+
+		if (left->thread->pid_ > right->thread->pid_) return 1;
+		if (left->thread->pid_ < right->thread->pid_) return -1;
+
+		if (left->thread->tid > right->thread->tid) return 1;
+		if (left->thread->tid < right->thread->tid) return -1;
+	} else {
+		/* userspace anonymous */
+		if (left->thread->pid_ > right->thread->pid_) return 1;
+		if (left->thread->pid_ < right->thread->pid_) return -1;
+
+		if (left->thread->tid > right->thread->tid) return 1;
+		if (left->thread->tid < right->thread->tid) return -1;
+
+		/* hack to mark similar regions, 'right' is new entry */
+		/* userspace anonymous address space is contained within pid */
+		right->color = REGION_SAME;
+
+		if (l > r) return 1;
+		if (l < r) return -1;
+
+		/* sorting by iaddr makes calculations easier later */
+		if (left->mi->iaddr.al_addr > right->mi->iaddr.al_addr) return 1;
+		if (left->mi->iaddr.al_addr < right->mi->iaddr.al_addr) return -1;
+	}
+
+	return 0;
+}
+static struct c2c_entry *c2c_entry__add_to_list(struct perf_c2c *c2c, struct c2c_entry *entry)
+{
+	struct rb_node **p;
+	struct rb_node *parent = NULL;
+	struct c2c_entry *ce;
+	int64_t cmp;
+
+	p = &c2c->tree_physid.rb_node;
+
+	while (*p != NULL) {
+		parent = *p;
+		ce = rb_entry(parent, struct c2c_entry, rb_node);
+
+		cmp = physid_cmp(ce, entry);
+
+		/* FIXME wrap this with a #ifdef debug or something */
+		if (!cmp)
+			if ((entry->mi->daddr.map != ce->mi->daddr.map) &&
+				!entry->mi->daddr.map->maj && !entry->mi->daddr.map->min)
+				pr_err("Similar entries have different maps\n");
+
+		if (cmp > 0)
+			p = &(*p)->rb_left;
+		else
+			p = &(*p)->rb_right;
+	}
+
+	rb_link_node(&entry->rb_node, parent, p);
+	rb_insert_color(&entry->rb_node, &c2c->tree_physid);
+
+	return entry;
+}
+
 static int perf_c2c__fprintf_header(FILE *fp)
 {
 	int printed = fprintf(fp, "%c %-16s  %6s  %6s  %4s  %18s  %18s  %18s  %6s  %-10s %-60s %s\n", 
@@ -171,10 +303,12 @@ static struct c2c_entry *c2c_entry__new(struct perf_sample *sample,
 	return entry;
 }
 
-static int perf_c2c__process_load_store(struct perf_c2c *c2c __maybe_unused,
+static int perf_c2c__process_load_store(struct perf_c2c *c2c,
 				  struct perf_sample *sample __maybe_unused,
 				  struct c2c_entry *entry)
 {
+	c2c_entry__add_to_list(c2c, entry);
+
 	/* don't lose the maps if remapped */
 	entry->mi->iaddr.map->referenced = true;
 	entry->mi->daddr.map->referenced = true;
@@ -280,10 +414,19 @@ out:
 	return err;
 }
 
+static int perf_c2c__init(struct perf_c2c *c2c)
+{
+	c2c->tree_physid = RB_ROOT;
+
+	return 0;
+}
 static int perf_c2c__report(struct perf_c2c *c2c)
 {
 	setup_pager();
 
+	if (perf_c2c__init(c2c))
+		return -1;
+
 	if (c2c->raw_records)
 		perf_c2c__fprintf_header(stdout);
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 10/21] perf, c2c: Add stats to track data source bits and cpu to node maps
  2014-02-10 17:28 [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (6 preceding siblings ...)
  2014-02-10 17:29 ` [PATCH 09/21] perf, c2c: Add rbtree sorted on mmap2 data Don Zickus
@ 2014-02-10 17:29 ` Don Zickus
  2014-02-18 13:05   ` Jiri Olsa
  2014-02-10 17:29 ` [PATCH 11/21] perf, c2c: Sort based on hottest cache line Don Zickus
                   ` (17 subsequent siblings)
  25 siblings, 1 reply; 72+ messages in thread
From: Don Zickus @ 2014-02-10 17:29 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, eranian, Don Zickus

This patch adds a bunch of stats that will be used later in post-processing
to determine where and with what frequency the HITMs are coming from.

Most of the stats are decoded from the data source response.  Another
piece of the stats is tracking which cpu the record came in on.

In order to properly build a cpu map to map where interesting events are coming
from, I shamelessly copy-n-pasted the cpu->NUMA node code from builtin-kmem.c.

As HITMs are most expensive when going across NUMA nodes, it only made sense
to create a quick cpu->NUMA lookup for when processing the records.

Credit to Dick Fowles for determining which bits are important and how to
properly track them.  Ported to perf by me.

Original-by: Dick Fowles <rfowles@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/builtin-c2c.c | 327 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 326 insertions(+), 1 deletion(-)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index a9c536b..360fbcf 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -5,15 +5,54 @@
 #include "util/parse-options.h"
 #include "util/session.h"
 #include "util/tool.h"
+#include "util/stat.h"
+#include "util/cpumap.h"
 #include "util/debug.h"
 
 #include <linux/compiler.h>
 #include <linux/kernel.h>
+#include <sched.h>
+
+typedef struct {
+	int  locks;               /* count of 'lock' transactions */
+	int  store;               /* count of all stores in trace */
+	int  st_uncache;          /* stores to uncacheable address */
+	int  st_noadrs;           /* cacheable store with no address */
+	int  st_l1hit;            /* count of stores that hit L1D */
+	int  st_l1miss;           /* count of stores that miss L1D */
+	int  load;                /* count of all loads in trace */
+	int  ld_excl;             /* exclusive loads, rmt/lcl DRAM - snp none/miss */
+	int  ld_shared;           /* shared loads, rmt/lcl DRAM - snp hit */
+	int  ld_uncache;          /* loads to uncacheable address */
+	int  ld_noadrs;           /* cacheable load with no address */
+	int  ld_fbhit;            /* count of loads hitting Fill Buffer */
+	int  ld_l1hit;            /* count of loads that hit L1D */
+	int  ld_l2hit;            /* count of loads that hit L2D */
+	int  ld_llchit;           /* count of loads that hit LLC */
+	int  lcl_hitm;            /* count of loads with local HITM  */
+	int  rmt_hitm;            /* count of loads with remote HITM */
+	int  rmt_hit;             /* count of loads with remote hit clean; */
+	int  lcl_dram;            /* count of loads miss to local DRAM */
+	int  rmt_dram;            /* count of loads miss to remote DRAM */
+	int  nomap;               /* count of load/stores with no phys adrs */
+	int  remap;               /* count of virt->phys remappings */
+} trinfo_t;
+
+struct c2c_stats {
+	cpu_set_t		cpuset;
+	int			nr_entries;
+	u64			total_period;
+	trinfo_t		t;
+	struct stats		stats;
+};
 
 struct perf_c2c {
 	struct perf_tool tool;
 	bool		 raw_records;
 	struct rb_root   tree_physid;
+
+	/* stats */
+	struct c2c_stats	stats;
 };
 
 #define REGION_SAME	1 << 0;
@@ -31,6 +70,179 @@ struct c2c_entry {
 
 enum { OP, LVL, SNP, LCK, TLB };
 
+#define RMT_RAM              (PERF_MEM_LVL_REM_RAM1 | PERF_MEM_LVL_REM_RAM2)
+#define RMT_LLC              (PERF_MEM_LVL_REM_CCE1 | PERF_MEM_LVL_REM_CCE2)
+
+#define L1CACHE_HIT(a)       (((a) & PERF_MEM_LVL_L1 ) && ((a) & PERF_MEM_LVL_HIT))
+#define FILLBUF_HIT(a)       (((a) & PERF_MEM_LVL_LFB) && ((a) & PERF_MEM_LVL_HIT))
+#define L2CACHE_HIT(a)       (((a) & PERF_MEM_LVL_L2 ) && ((a) & PERF_MEM_LVL_HIT))
+#define L3CACHE_HIT(a)       (((a) & PERF_MEM_LVL_L3 ) && ((a) & PERF_MEM_LVL_HIT))
+
+#define L1CACHE_MISS(a)      (((a) & PERF_MEM_LVL_L1 ) && ((a) & PERF_MEM_LVL_MISS))
+#define L3CACHE_MISS(a)      (((a) & PERF_MEM_LVL_L3 ) && ((a) & PERF_MEM_LVL_MISS))
+
+#define LD_UNCACHED(a)       (((a) & PERF_MEM_LVL_UNC) && ((a) & PERF_MEM_LVL_HIT))
+#define ST_UNCACHED(a)       (((a) & PERF_MEM_LVL_UNC) && ((a) & PERF_MEM_LVL_HIT))
+
+#define RMT_LLCHIT(a)        (((a) & RMT_LLC) && ((a) & PERF_MEM_LVL_HIT))
+#define RMT_HIT(a,b)         (((a) & RMT_LLC) && ((b) & PERF_MEM_SNOOP_HIT))
+#define RMT_HITM(a,b)        (((a) & RMT_LLC) && ((b) & PERF_MEM_SNOOP_HITM))
+#define RMT_MEM(a)           (((a) & RMT_RAM) && ((a) & PERF_MEM_LVL_HIT))
+
+#define LCL_HIT(a,b)         (L3CACHE_HIT(a) && ((b) & PERF_MEM_SNOOP_HIT))
+#define LCL_HITM(a,b)        (L3CACHE_HIT(a) && ((b) & PERF_MEM_SNOOP_HITM))
+#define LCL_MEM(a)           (((a) & PERF_MEM_LVL_LOC_RAM) && ((a) & PERF_MEM_LVL_HIT))
+
+static int			max_cpu_num;
+static int			max_node_num;
+static int			*cpunode_map;
+
+#define PATH_SYS_NODE   "/sys/devices/system/node"
+
+/* Determine highest possible cpu in the system for sparse allocation */
+static void set_max_cpu_num(void)
+{
+	FILE *fp;
+	char buf[256];
+	int num;
+
+	/* set up default */
+	max_cpu_num = 4096;
+
+	/* get the highest possible cpu number for a sparse allocation */
+	fp = fopen("/sys/devices/system/cpu/possible", "r");
+	if (!fp)
+		goto out;
+
+	num = fread(&buf, 1, sizeof(buf), fp);
+	if (!num)
+		goto out_close;
+	buf[num] = '\0';
+
+	/* start on the right, to find highest cpu num */
+	while (--num) {
+		if ((buf[num] == ',') || (buf[num] == '-')) {
+			num++;
+			break;
+		}
+	}
+	if (sscanf(&buf[num], "%d", &max_cpu_num) < 1)
+		goto out_close;
+
+	max_cpu_num++;
+
+	fclose(fp);
+	return;
+
+out_close:
+	fclose(fp);
+out:
+	pr_err("Failed to read max cpus, using default of %d\n",
+		max_cpu_num);
+	return;
+}
+
+/* Determine highest possible node in the system for sparse allocation */
+static void set_max_node_num(void)
+{
+	FILE *fp;
+	char buf[256];
+	int num;
+
+	/* set up default */
+	max_node_num = 8;
+
+	/* get the highest possible cpu number for a sparse allocation */
+	fp = fopen("/sys/devices/system/node/possible", "r");
+	if (!fp)
+		goto out;
+
+	num = fread(&buf, 1, sizeof(buf), fp);
+	if (!num)
+		goto out_close;
+	buf[num] = '\0';
+
+	/* start on the right, to find highest node num */
+	while (--num) {
+		if ((buf[num] == ',') || (buf[num] == '-')) {
+			num++;
+			break;
+		}
+	}
+	if (sscanf(&buf[num], "%d", &max_node_num) < 1)
+		goto out_close;
+
+	max_node_num++;
+
+	fclose(fp);
+	return;
+
+out_close:
+	fclose(fp);
+out:
+	pr_err("Failed to read max nodes, using default of %d\n",
+		max_node_num);
+	return;
+}
+
+static int init_cpunode_map(void)
+{
+	int i;
+
+	set_max_cpu_num();
+	set_max_node_num();
+
+	cpunode_map = calloc(max_cpu_num, sizeof(int));
+	if (!cpunode_map) {
+		pr_err("%s: calloc failed\n", __func__);
+		goto out;
+	}
+
+	for (i = 0; i < max_cpu_num; i++)
+		cpunode_map[i] = -1;
+
+	return 0;
+out:
+	return -1;
+}
+
+static int setup_cpunode_map(void)
+{
+	struct dirent *dent1, *dent2;
+	DIR *dir1, *dir2;
+	unsigned int cpu, mem;
+	char buf[PATH_MAX];
+
+	/* initialize globals */
+	if (init_cpunode_map())
+		return -1;
+
+	dir1 = opendir(PATH_SYS_NODE);
+	if (!dir1)
+		return 0;
+
+	/* walk tree and setup map */
+	while ((dent1 = readdir(dir1)) != NULL) {
+		if (dent1->d_type != DT_DIR ||
+		    sscanf(dent1->d_name, "node%u", &mem) < 1)
+			continue;
+
+		snprintf(buf, PATH_MAX, "%s/%s", PATH_SYS_NODE, dent1->d_name);
+		dir2 = opendir(buf);
+		if (!dir2)
+			continue;
+		while ((dent2 = readdir(dir2)) != NULL) {
+			if (dent2->d_type != DT_LNK ||
+			    sscanf(dent2->d_name, "cpu%u", &cpu) < 1)
+				continue;
+			cpunode_map[cpu] = mem;
+		}
+		closedir(dir2);
+	}
+	closedir(dir1);
+	return 0;
+}
+
 static int perf_c2c__scnprintf_data_src(char *bf, size_t size, uint64_t val)
 {
 #define PREFIX       "["
@@ -303,17 +515,120 @@ static struct c2c_entry *c2c_entry__new(struct perf_sample *sample,
 	return entry;
 }
 
+static int c2c_decode_stats(struct c2c_stats *stats, struct c2c_entry *entry)
+{
+	union perf_mem_data_src *data_src = &entry->mi->data_src;
+	u64 daddr = entry->mi->daddr.addr;
+	u64 weight = entry->weight;
+	int err = 0;
+
+	u64 op = data_src->mem_op;
+	u64 lvl = data_src->mem_lvl;
+	u64 snoop = data_src->mem_snoop;
+	u64 lock = data_src->mem_lock;
+
+#define P(a,b) PERF_MEM_##a##_##b
+
+	stats->nr_entries++;
+	stats->total_period += entry->period;
+
+	if (lock & P(LOCK,LOCKED)) stats->t.locks++;
+
+	if (op & P(OP,LOAD)) {
+		stats->t.load++;
+
+		if (!daddr) {
+			stats->t.ld_noadrs++;
+			return -1;
+		}
+
+		if (lvl & P(LVL,HIT)) {
+			if (lvl & P(LVL,UNC)) stats->t.ld_uncache++;
+			if (lvl & P(LVL,LFB)) stats->t.ld_fbhit++;
+			if (lvl & P(LVL,L1 )) stats->t.ld_l1hit++;
+			if (lvl & P(LVL,L2 )) stats->t.ld_l2hit++;
+			if (lvl & P(LVL,L3 )) {
+				if (snoop & P(SNOOP,HITM))
+					stats->t.lcl_hitm++;
+				else
+					stats->t.ld_llchit++;
+			}
+
+			if (lvl & P(LVL,LOC_RAM)) {
+				stats->t.lcl_dram++;
+				if (snoop & P(SNOOP,HIT))
+					stats->t.ld_shared++;
+				else
+					stats->t.ld_excl++;
+			}
+
+			if ((lvl & P(LVL,REM_RAM1)) ||
+			    (lvl & P(LVL,REM_RAM2))) {
+				stats->t.rmt_dram++;
+				if (snoop & P(SNOOP,HIT))
+					stats->t.ld_shared++;
+				else
+					stats->t.ld_excl++;
+			}
+		}
+
+		if ((lvl & P(LVL,REM_CCE1)) ||
+		    (lvl & P(LVL,REM_CCE2))) {
+			if (snoop & P(SNOOP, HIT))
+				stats->t.rmt_hit++;
+			else if (snoop & P(SNOOP, HITM)) {
+				stats->t.rmt_hitm++;
+				update_stats(&stats->stats, weight);
+			}
+		}
+
+	} else if (op & P(OP,STORE)) {
+		/* store */
+		stats->t.store++;
+
+		if (!daddr) {
+			stats->t.st_noadrs++;
+			return -1;
+		}
+
+		if (lvl & P(LVL,HIT)) {
+			if (lvl & P(LVL,UNC)) stats->t.st_uncache++;
+			if (lvl & P(LVL,L1 )) stats->t.st_l1hit++;
+		}
+		if (lvl & P(LVL,MISS))
+			if (lvl & P(LVL,L1)) stats->t.st_l1miss++;
+	} else {
+		/* unparsable data_src? */
+		return -1;
+	}
+
+	if (!entry->mi->daddr.map || !entry->mi->iaddr.map)
+		return -1;
+
+	return err;
+}
+
 static int perf_c2c__process_load_store(struct perf_c2c *c2c,
 				  struct perf_sample *sample __maybe_unused,
 				  struct c2c_entry *entry)
 {
+	int err = 0;
+
+	err = c2c_decode_stats(&c2c->stats, entry);
+	if (err < 0) {
+		err = 1;
+		goto err;
+	}
+	err = 0;
+
 	c2c_entry__add_to_list(c2c, entry);
 
 	/* don't lose the maps if remapped */
 	entry->mi->iaddr.map->referenced = true;
 	entry->mi->daddr.map->referenced = true;
 
-	return 0;
+err:
+	return err;
 }
 
 static const struct perf_evsel_str_handler handlers[] = {
@@ -403,6 +718,9 @@ static int perf_c2c__read_events(struct perf_c2c *c2c)
 		goto out;
 	}
 
+	if (symbol__init() < 0)
+		goto out_delete;
+
 	if (perf_evlist__set_handlers(session->evlist, handlers))
 		goto out_delete;
 
@@ -416,6 +734,13 @@ out:
 
 static int perf_c2c__init(struct perf_c2c *c2c)
 {
+	/* setup cpu map */
+	if (setup_cpunode_map() < 0) {
+		pr_err("can not setup cpu map\n");
+		return -1;
+	}
+
+	CPU_ZERO(&c2c->stats.cpuset);
 	c2c->tree_physid = RB_ROOT;
 
 	return 0;
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 11/21] perf, c2c: Sort based on hottest cache line
  2014-02-10 17:28 [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (7 preceding siblings ...)
  2014-02-10 17:29 ` [PATCH 10/21] perf, c2c: Add stats to track data source bits and cpu to node maps Don Zickus
@ 2014-02-10 17:29 ` Don Zickus
  2014-02-10 17:29 ` [PATCH 12/21] perf, c2c: Display cacheline HITM analysis to stdout Don Zickus
                   ` (16 subsequent siblings)
  25 siblings, 0 replies; 72+ messages in thread
From: Don Zickus @ 2014-02-10 17:29 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, eranian, Don Zickus

Now that we have all the events sort on a unique address, we can walk
the rbtree sequential and count up all the HITMs for each cacheline
fairly easily.

Once we encounter a new event on a different cacheline, process the previous
cacheline.  That includes determining if any HITMs were present on that
cacheline and if so, add it to another rbtree sorted on the number of HITMs.

This second rbtree sorted on number of HITMs will be the interesting data
we want to report and will be displayed in a follow up patch.

For now, organize the data properly.

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/builtin-c2c.c | 219 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 219 insertions(+)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 360fbcf..8b26ea2 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -59,6 +59,7 @@ struct perf_c2c {
 
 struct c2c_entry {
 	struct rb_node		rb_node;
+	struct list_head	scratch;  /* scratch list for resorting */
 	struct thread		*thread;
 	struct mem_info		*mi;
 	u32			cpu;
@@ -68,6 +69,25 @@ struct c2c_entry {
 	int			color;
 };
 
+#define CACHE_LINESIZE       64
+#define CLINE_OFFSET_MSK     (CACHE_LINESIZE - 1)
+#define CLADRS(a)            ((a) & ~(CLINE_OFFSET_MSK))
+#define CLOFFSET(a)          (int)((a) &  (CLINE_OFFSET_MSK))
+
+struct c2c_hit {
+	struct rb_node	rb_node;
+	struct rb_root  tree;
+	struct list_head list;
+	u64		cacheline;
+	int		color;
+	struct c2c_stats	stats;
+	pid_t		pid;
+	pid_t		tid;
+	u64		daddr;
+	u64		iaddr;
+	struct mem_info	*mi;
+};
+
 enum { OP, LVL, SNP, LCK, TLB };
 
 #define RMT_RAM              (PERF_MEM_LVL_REM_RAM1 | PERF_MEM_LVL_REM_RAM2)
@@ -440,6 +460,44 @@ static struct c2c_entry *c2c_entry__add_to_list(struct perf_c2c *c2c, struct c2c
 	return entry;
 }
 
+static int c2c_hitm__add_to_list(struct rb_root *root, struct c2c_hit *h)
+{
+	struct rb_node **p;
+	struct rb_node *parent = NULL;
+	struct c2c_hit *he;
+	int64_t cmp;
+	u64 l_hitms, r_hitms;
+
+	p = &root->rb_node;
+
+	while (*p != NULL) {
+		parent = *p;
+		he = rb_entry(parent, struct c2c_hit, rb_node);
+
+		/* sort on remote hitms first */
+		l_hitms = he->stats.t.rmt_hitm;
+		r_hitms = h->stats.t.rmt_hitm;
+		cmp = r_hitms - l_hitms;
+
+		if (!cmp) {
+			/* sort on local hitms */
+			l_hitms = he->stats.t.lcl_hitm;
+			r_hitms = h->stats.t.lcl_hitm;
+			cmp = r_hitms - l_hitms;
+		}
+
+		if (cmp > 0)
+			p = &(*p)->rb_left;
+		else
+			p = &(*p)->rb_right;
+	}
+
+	rb_link_node(&h->rb_node, parent, p);
+	rb_insert_color(&h->rb_node, root);
+
+	return 0;
+}
+
 static int perf_c2c__fprintf_header(FILE *fp)
 {
 	int printed = fprintf(fp, "%c %-16s  %6s  %6s  %4s  %18s  %18s  %18s  %6s  %-10s %-60s %s\n", 
@@ -608,6 +666,50 @@ static int c2c_decode_stats(struct c2c_stats *stats, struct c2c_entry *entry)
 	return err;
 }
 
+static struct c2c_hit *c2c_hit__new(u64 cacheline, struct c2c_entry *entry)
+{
+	struct c2c_hit *h = zalloc(sizeof(struct c2c_hit));
+
+	if (!h) {
+		pr_err("Could not allocate c2c_hit memory\n");
+		return NULL;
+	}
+
+	CPU_ZERO(&h->stats.cpuset);
+	INIT_LIST_HEAD(&h->list);
+	init_stats(&h->stats.stats);
+	h->tree = RB_ROOT;
+	h->cacheline = cacheline;
+	h->pid = entry->thread->pid_;
+	h->tid = entry->thread->tid;
+
+	/* use original addresses here, not adjusted al_addr */
+	h->iaddr = entry->mi->iaddr.addr;
+	h->daddr = entry->mi->daddr.addr;
+
+	h->mi = entry->mi;
+	return h;
+}
+
+static void c2c_hit__update_strings(struct c2c_hit *h,
+				    struct c2c_entry *n)
+{
+	if (h->pid != n->thread->pid_)
+		h->pid = -1;
+
+	if (h->tid != n->thread->tid)
+		h->tid = -1;
+
+	/* use original addresses here, not adjusted al_addr */
+	if (h->iaddr != n->mi->iaddr.addr)
+		h->iaddr = -1;
+
+	if (CLADRS(h->daddr) != CLADRS(n->mi->daddr.addr))
+		h->daddr = -1;
+
+	CPU_SET(n->cpu, &h->stats.cpuset);
+}
+
 static int perf_c2c__process_load_store(struct perf_c2c *c2c,
 				  struct perf_sample *sample __maybe_unused,
 				  struct c2c_entry *entry)
@@ -692,6 +794,121 @@ err:
 	return err;
 }
 
+#define HAS_HITMS(h) (h->stats.t.lcl_hitm || h->stats.t.rmt_hitm)
+
+static void c2c_hit__update_stats(struct c2c_stats *new,
+				  struct c2c_stats *old)
+{
+	new->t.load		+= old->t.load;
+	new->t.ld_fbhit		+= old->t.ld_fbhit;
+	new->t.ld_l1hit		+= old->t.ld_l1hit;
+	new->t.ld_l2hit		+= old->t.ld_l2hit;
+	new->t.ld_llchit	+= old->t.ld_llchit;
+	new->t.locks		+= old->t.locks;
+	new->t.lcl_dram		+= old->t.lcl_dram;
+	new->t.rmt_dram		+= old->t.rmt_dram;
+	new->t.lcl_hitm		+= old->t.lcl_hitm;
+	new->t.rmt_hitm		+= old->t.rmt_hitm;
+	new->t.rmt_hit		+= old->t.rmt_hit;
+	new->t.store		+= old->t.store;
+	new->t.st_l1hit		+= old->t.st_l1hit;
+
+	new->total_period	+= old->total_period;
+}
+
+static void dump_tree_hitm(struct rb_root *tree,
+			   struct perf_c2c *c2c __maybe_unused)
+{
+	struct rb_node *next = rb_first(tree);
+	struct c2c_hit *h;
+
+	printf("%16s %8s %8s %8s\n",
+		"Cacheline", "nr", "loads", "stores");
+	while (next) {
+		h = rb_entry(next, struct c2c_hit, rb_node);
+		next = rb_next(&h->rb_node);
+
+		printf("%16lx %8d %8d %8d\n",
+			h->cacheline,
+			h->stats.nr_entries,
+			h->stats.t.load,
+			h->stats.t.store);
+	}
+}
+
+static void c2c_analyze_hitms(struct perf_c2c *c2c)
+{
+
+	struct rb_node *next = rb_first(&c2c->tree_physid);
+	struct c2c_entry *n;
+	struct c2c_hit *h = NULL;
+	struct c2c_stats hitm_stats;
+	struct rb_root hitm_tree = RB_ROOT;
+	int shared_clines = 0;
+	u64 cl = 0;
+
+	memset(&hitm_stats, 0, sizeof(struct c2c_stats));
+
+	/* find HITMs */
+	while (next) {
+		n = rb_entry(next, struct c2c_entry, rb_node);
+		next = rb_next(&n->rb_node);
+
+		cl = n->mi->daddr.al_addr;
+
+		/* switch cache line objects */
+		/* 'color' forces a boundary change based on the original sort */
+		if (!h || !n->color || (CLADRS(cl) != h->cacheline)) {
+			if (h && HAS_HITMS(h)) {
+				c2c_hit__update_stats(&hitm_stats, &h->stats);
+
+				/* sort based on hottest cacheline */
+				c2c_hitm__add_to_list(&hitm_tree, h);
+				shared_clines++;
+			} else {
+				/* stores-only are un-interesting */
+				free(h);
+			}
+			h = c2c_hit__new(CLADRS(cl), n);
+			if (!h)
+				goto cleanup;
+		}
+
+
+		c2c_decode_stats(&h->stats, n);
+
+		/* filter out non-hitms as un-interesting noise */
+		if (valid_hitm_or_store(&n->mi->data_src)) {
+			/* save the entry for later processing */
+			list_add_tail(&n->scratch, &h->list);
+
+			c2c_hit__update_strings(h, n);
+		}
+	}
+
+	/* last chunk */
+	if (HAS_HITMS(h)) {
+		c2c_hit__update_stats(&hitm_stats, &h->stats);
+		c2c_hitm__add_to_list(&hitm_tree, h);
+		shared_clines++;
+	} else
+		free(h);
+
+	if (verbose > 2)
+		dump_tree_hitm(&c2c->tree_hitm, c2c);
+
+cleanup:
+	next = rb_first(&hitm_tree);
+	while (next) {
+		h = rb_entry(next, struct c2c_hit, rb_node);
+		next = rb_next(&h->rb_node);
+		rb_erase(&h->rb_node, &hitm_tree);
+
+		free(h);
+	}
+	return;
+}
+
 static int perf_c2c__process_events(struct perf_session *session,
 				    struct perf_c2c *c2c)
 {
@@ -703,6 +920,8 @@ static int perf_c2c__process_events(struct perf_session *session,
 		goto err;
 	}
 
+	c2c_analyze_hitms(c2c);
+
 err:
 	return err;
 }
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 12/21] perf, c2c: Display cacheline HITM analysis to stdout
  2014-02-10 17:28 [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (8 preceding siblings ...)
  2014-02-10 17:29 ` [PATCH 11/21] perf, c2c: Sort based on hottest cache line Don Zickus
@ 2014-02-10 17:29 ` Don Zickus
  2014-02-10 17:29 ` [PATCH 13/21] perf, c2c: Add callchain support Don Zickus
                   ` (15 subsequent siblings)
  25 siblings, 0 replies; 72+ messages in thread
From: Don Zickus @ 2014-02-10 17:29 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, eranian, Don Zickus

This patch mainly focuses on processing and displaying the collected
HITMs to stdout.  Most of it is just printing data in a pretty way.

There is one trick used when walking the cacheline.  When we get this
far we have two rbtrees.  One rbtree holds every record sorted on a
unique id (using the mmap2 decoder), the other rbtree holds every
cacheline with at least one HITM sorted on number of HITMs.

To display the output, the tool walks the second rbtree to display
the hottest cachelines.  Inside each hot cacheline, the tool displays
the offsets and the loads/stores it generates.  To determine the
cacheline offset, it uses linked list inside the cacheline elment
to walk the first rbtree elements for that particular cacheline.

The first rbtree elements are already sorted correctly in offset order, so
processing the offsets is fairly trivial and is done sequentially.

This is why you will see two while loops in the print_c2c_hitm_report(),
the outer one walks the cachelines, the inner one walks the offsets.

A knob has been added to display node information, which is useful
to see which cpus are involved in the contention and their nodes.

Another knob has been added to change the coalescing levels.  You can
coalesce the output based on pid, tid, ip, and/or symbol.

Original output and statistics done by Dick Fowles, backported by me.

Original-by: Dick Fowles <rfowles@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/builtin-c2c.c | 528 +++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 515 insertions(+), 13 deletions(-)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 8b26ea2..39fd233 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -69,10 +69,13 @@ struct c2c_entry {
 	int			color;
 };
 
+#define DISPLAY_LINE_LIMIT  0.0015
 #define CACHE_LINESIZE       64
 #define CLINE_OFFSET_MSK     (CACHE_LINESIZE - 1)
 #define CLADRS(a)            ((a) & ~(CLINE_OFFSET_MSK))
 #define CLOFFSET(a)          (int)((a) &  (CLINE_OFFSET_MSK))
+#define MAXTITLE_SZ          400
+#define MAXLBL_SZ            256
 
 struct c2c_hit {
 	struct rb_node	rb_node;
@@ -113,6 +116,11 @@ enum { OP, LVL, SNP, LCK, TLB };
 #define LCL_HITM(a,b)        (L3CACHE_HIT(a) && ((b) & PERF_MEM_SNOOP_HITM))
 #define LCL_MEM(a)           (((a) & PERF_MEM_LVL_LOC_RAM) && ((a) & PERF_MEM_LVL_HIT))
 
+enum { LVL0, LVL1, LVL2, LVL3, LVL4, MAX_LVL };
+static int cloffset = LVL1;
+static int node_info = 0;
+static int coalesce_level = LVL1;
+
 static int			max_cpu_num;
 static int			max_node_num;
 static int			*cpunode_map;
@@ -710,6 +718,80 @@ static void c2c_hit__update_strings(struct c2c_hit *h,
 	CPU_SET(n->cpu, &h->stats.cpuset);
 }
 
+static inline bool matching_coalescing(struct c2c_hit *h,
+				       struct c2c_entry *e)
+{
+	bool value = false;
+	struct mem_info *mi = e->mi;
+
+	if (coalesce_level > MAX_LVL)
+		printf("DON: bad coalesce level %d\n", coalesce_level);
+
+	if (e->cpumode != PERF_RECORD_MISC_KERNEL) {
+
+		switch (coalesce_level) {
+
+		case LVL0:
+		case LVL1:
+			value = ((h->daddr == mi->daddr.addr) &&
+				 (h->pid   == e->thread->pid_) &&
+				 (h->tid   == e->thread->tid) &&
+				 (h->iaddr == mi->iaddr.addr));
+			break;
+
+		case LVL2:
+			value = ((h->daddr == mi->daddr.addr) &&
+				 (h->pid   == e->thread->pid_) &&
+				 (h->iaddr == mi->iaddr.addr));
+			break;
+
+		case LVL3:
+			value = ((h->daddr == mi->daddr.addr) &&
+				 (h->iaddr == mi->iaddr.addr));
+			break;
+
+		case LVL4:
+			value = ((h->daddr == mi->daddr.addr) &&
+				 (h->mi->iaddr.sym == mi->iaddr.sym));
+			break;
+
+		default:
+			break;
+
+		}
+
+	} else {
+
+		switch (coalesce_level) {
+
+		case LVL0:
+			value = ((h->daddr == mi->daddr.addr) &&
+				 (h->pid   == e->thread->pid_) &&
+				 (h->tid   == e->thread->tid) &&
+				 (h->iaddr == mi->iaddr.addr));
+			break;
+
+		case LVL1:
+		case LVL2:
+		case LVL3:
+			value = ((h->daddr == mi->daddr.addr) &&
+				 (h->iaddr == mi->iaddr.addr));
+			break;
+
+		case LVL4:
+			value = ((h->daddr == mi->daddr.addr) &&
+				 (h->mi->iaddr.sym == mi->iaddr.sym));
+			break;
+
+		default:
+			break;
+
+		}
+	}
+
+	return value;
+}
+
 static int perf_c2c__process_load_store(struct perf_c2c *c2c,
 				  struct perf_sample *sample __maybe_unused,
 				  struct c2c_entry *entry)
@@ -816,26 +898,442 @@ static void c2c_hit__update_stats(struct c2c_stats *new,
 	new->total_period	+= old->total_period;
 }
 
-static void dump_tree_hitm(struct rb_root *tree,
-			   struct perf_c2c *c2c __maybe_unused)
+static void print_hitm_cacheline_header(void)
+{
+#define SHARING_REPORT_TITLE  "Shared Cache Line Distribution Pareto"
+#define PARTICIPANTS1         "Node{cpus %htims %stores} Node{cpus %hitms %stores} ..."
+#define PARTICIPANTS2         "Node{cpu list}; Node{cpu list}; Node{cpu list}; ..."
+
+	int i;
+	const  char *docptr;
+	static char  delimit[MAXTITLE_SZ];
+	static char  title2[MAXTITLE_SZ];
+	int       pad;
+
+	docptr = " ";
+	if (node_info == 1)
+		docptr = PARTICIPANTS1;
+	if (node_info  == 2)
+		docptr = PARTICIPANTS2;
+
+	sprintf(title2, "%4s %6s %6s %6s %6s %8s %8s %8s %8s %18s %6s %6s %18s %8s %8s %8s %6s %-30s %-20s %s",
+			"Num",
+			"%dist",
+			"%cumm",
+			"%dist",
+			"%cumm",
+			"LLCmiss",
+			"LLChit",
+			"L1 hit",
+			"L1 Miss",
+			"Data Address",
+			"Pid",
+			"Tid",
+			"Inst Address",
+			"median",
+			"mean",
+			"CV  ",
+			"cnt",
+			"Symbol",
+			"Object",
+			docptr);
+
+	for (i = 0; i < (int)strlen(title2); i++) strcat(delimit, "=");
+
+
+	printf("\n\n");
+	printf("%s\n", delimit);
+	printf("\n");
+
+	pad = (strlen(title2)/2) - (strlen(SHARING_REPORT_TITLE)/2);
+	for (i = 0; i < pad; i++) printf(" ");
+	printf("%s\n", SHARING_REPORT_TITLE);
+
+	printf("\n");
+	printf("%4s %13s %13s %17s %8s %8s %18s %6s %6s %18s %26s %6s %30s %20s %s\n",
+		" ",
+		"---- All ----",
+		"-- Shared --",
+		"---- HITM ----",
+		" ",
+		" ",
+		" ",
+		" ",
+		" ",
+		" ",
+		"Load Inst Execute Latency",
+		" ",
+		" ",
+		" ",
+		node_info  ? "Shared Data Participants" : " ");
+
+
+	printf("%4s %13s %13s %8s %8s %17s %18s %6s %6s %17s %18s\n",
+		" ",
+		" Data Misses",
+		" Data Misses",
+		"Remote",
+		"Local",
+		"-- Store Refs --",
+		" ",
+		" ",
+		" ",
+		" ",
+		" ");
+
+	printf("%4s %13s %13s %8s %8s %8s %8s %18s %6s %6s %17s %18s %8s %6s\n",
+		" ",
+		" ",
+		" ",
+		" ",
+		" ",
+		" ",
+		" ",
+		" ",
+		" ",
+		" ",
+		" ",
+		"---- cycles ----",
+		" ",
+		"cpu");
+
+	printf("%s\n", title2);
+	printf("%s\n", delimit);
+}
+
+static void print_hitm_cacheline(struct c2c_hit *h,
+				 int record,
+				 double tot_cumm,
+				 double ld_cumm,
+				 double tot_dist,
+				 double ld_dist)
+{
+	char pidstr[7];
+	char addrstr[20];
+	static char  summary[MAXLBL_SZ];
+	int j;
+
+	if (h->pid > 0)
+		sprintf(pidstr, "%6d", h->pid);
+	else
+		sprintf(pidstr, "***");
+	/*
+	 * It is possible to have none distinct virtual addresses
+	 * pointing to a distinct SYstem V shared memory region.
+	 * if there are multple virtual addresses the address
+	 * field will be astericks. It would be possible to subsitute
+	 * the physical address but this count be confusing as some
+	 * times the field is a virtual address while or times it
+	 * may be a physical address which may lead to confusion.
+	 */
+	if (h->daddr != ~0UL)
+		sprintf(addrstr, "%#18lx", CLADRS(h->daddr));
+	else
+		sprintf(addrstr, "****************");
+
+
+	sprintf(summary, "%4d %5.1f%% %5.1f%% %5.1f%% %5.1f%% %8d %8d %8d %8d %18s %6s\n",
+			record,
+			tot_dist * 100.,
+			tot_cumm * 100.,
+			ld_dist * 100.,
+			ld_cumm * 100.,
+			h->stats.t.rmt_hitm,
+			h->stats.t.lcl_hitm,
+			h->stats.t.st_l1hit,
+			h->stats.t.st_l1miss,
+			addrstr,
+			pidstr);
+
+	for (j = 0; j < (int)strlen(summary); j++) printf("-");
+	printf("\n");
+	printf("%s", summary);
+	for (j = 0; j < (int)strlen(summary); j++) printf("-");
+	printf("\n");
+}
+
+static void print_socket_stats_str(struct c2c_hit *clo,
+				   struct c2c_stats *node_stats)
 {
-	struct rb_node *next = rb_first(tree);
-	struct c2c_hit *h;
+	int i, j;
+
+	if (!node_stats)
+		return;
+
+	for (i = 0; i < max_node_num; i++) {
+		struct c2c_stats *stats = &node_stats[i];
+		int num = CPU_COUNT(&stats->cpuset);
+
+		if (!num) {
+			/* pad align socket info */
+			for (j = 0; j < 21; j++)
+				printf(" ");
+			continue;
+		}
+
+		printf("%2d{%2d ", i, num);
+
+		if (clo->stats.t.rmt_hitm > 0)
+			printf("%5.1f%% ", 100. * ((double)stats->t.rmt_hitm / (double) clo->stats.t.rmt_hitm));
+		else
+			printf("%6s ", "n/a");
+
+		if (clo->stats.t.store > 0)
+			printf("%5.1f%%} ", 100. * ((double)stats->t.store / (double)clo->stats.t.store));
+		else
+			printf("%6s} ", "n/a");
+	}
+}
+
+static void print_socket_shared_str(struct c2c_stats *node_stats)
+{
+	int i, j;
+
+	if (!node_stats)
+		return;
+
+	for (i = 0; i < max_node_num; i++) {
+		struct c2c_stats *stats = &node_stats[i];
+		int num = CPU_COUNT(&stats->cpuset);
+		int start = -1;
+		bool first = true;
+
+		if (!num)
+			continue;
+
+		printf("%d{", i);
+
+		for (j = 0; j < max_cpu_num; j++) {
+			if (!CPU_ISSET(j, &stats->cpuset)) {
+				if (start != -1) {
+					if ((j-1) - start)
+						/* print the range */
+						printf("%s%d-%d", (first ? "" : ","), start, j-1);
+					else
+						/* standalone */
+						printf("%s%d", (first ? "" : ",") , start);
+					start = -1;
+					first = false;
+				}
+				continue;
+			}
+
+			if (start == -1)
+				start = j;
+		}
+		/* last chunk */
+		if (start != -1) {
+			if ((j-1) - start)
+				/* print the range */
+				printf("%s%d-%d", (first ? "" : ","), start, j-1);
+			else
+				/* standalone */
+				printf("%s%d", (first ? "" : ",") , start);
+		}
+
+		printf("}; ");
+	}
+}
+
+static void print_hitm_cacheline_offset(struct c2c_hit *clo,
+					struct c2c_hit *h,
+					struct c2c_stats *node_stats)
+{
+#define SHORT_STR_LEN	7
+#define LONG_STR_LEN	30
+
+	char pidstr[SHORT_STR_LEN];
+	char tidstr[SHORT_STR_LEN];
+	char addrstr[LONG_STR_LEN];
+	char latstr[LONG_STR_LEN];
+	char objptr[LONG_STR_LEN];
+	char symptr[LONG_STR_LEN];
+	struct c2c_stats *stats = &clo->stats;
+	struct addr_map_symbol *ams;
+
+	ams = &clo->mi->iaddr;
+
+	if (clo->pid >= 0)
+		snprintf(pidstr, SHORT_STR_LEN, "%6d", clo->pid);
+	else
+		sprintf(pidstr, "***");
+
+	if (clo->tid >= 0)
+		snprintf(tidstr, SHORT_STR_LEN, "%6d", clo->tid);
+	else
+		sprintf(tidstr, "***");
+
+	if (clo->iaddr != ~0UL)
+		snprintf(addrstr, LONG_STR_LEN, "%#18lx", clo->iaddr);
+	else
+		sprintf(addrstr, "****************");
+	snprintf(objptr, LONG_STR_LEN, "%-18s", ams->map->dso->short_name);
+	snprintf(symptr, LONG_STR_LEN, "%-18s", (ams->sym ? ams->sym->name : "?????"));
+
+	if (stats->t.rmt_hitm > 0) {
+		double mean = avg_stats(&stats->stats);
+		double std = stddev_stats(&stats->stats);
+
+		sprintf(latstr, "%8.0f %8.0f %7.1f%%",
+			-1.0, /* FIXME */
+			mean,
+			rel_stddev_stats(std, mean));
+	} else {
+		sprintf(latstr, "%8s %8s %8s",
+			"n/a",
+			"n/a",
+			"n/a");
+
+	}
+
+	/*
+	 * implicit assumption that we are not coalescing over IPs
+	 */
+	printf("%4s %6s %6s %6s %6s %7.1f%% %7.1f%% %7.1f%% %7.1f%% %14s0x%02lx %6s %6s %18s %8s %6d %-30s %-20s",
+		" ",
+		" ",
+		" ",
+		" ",
+		" ",
+		(stats->t.rmt_hitm  > 0) ? (100. * ((double)stats->t.rmt_hitm  / (double)h->stats.t.rmt_hitm))  : 0.0,
+		(stats->t.lcl_hitm  > 0) ? (100. * ((double)stats->t.lcl_hitm  / (double)h->stats.t.lcl_hitm))  : 0.0,
+		(stats->t.st_l1hit  > 0) ? (100. * ((double)stats->t.st_l1hit  / (double)h->stats.t.st_l1hit))  : 0.0,
+		(stats->t.st_l1miss > 0) ? (100. * ((double)stats->t.st_l1miss / (double)h->stats.t.st_l1miss)) : 0.0,
+		" ",
+		(cloffset == LVL2) ? (clo->daddr & 0xff) : CLOFFSET(clo->daddr),
+		pidstr,
+		tidstr,
+		addrstr,
+		latstr,
+		CPU_COUNT(&clo->stats.cpuset),
+		symptr,
+		objptr);
+
+	if (node_info == 0)
+		printf("  ");
+	else if (node_info  == 1)
+		print_socket_stats_str(clo, node_stats);
+	else if (node_info  == 2)
+		print_socket_shared_str(node_stats);
+
+	printf("\n");
+}
+
+static void print_c2c_hitm_report(struct rb_root *hitm_tree,
+				  struct c2c_stats *hitm_stats __maybe_unused,
+				  struct c2c_stats *c2c_stats)
+{
+	struct rb_node	*next = rb_first(hitm_tree);
+	struct c2c_hit	*h, *clo = NULL;
+	u64		addr;
+	double		tot_dist, tot_cumm;
+	double		ld_dist, ld_cumm;
+	int		llc_misses;
+	int		record = 0;
+	struct c2c_stats *node_stats = NULL;
+
+	if (node_info) {
+		node_stats = zalloc(sizeof(struct c2c_stats) * max_node_num);
+		if (!node_stats) {
+			printf("Can not allocate stats for node output\n");
+			return;
+		}
+	}
+
+	print_hitm_cacheline_header();
+
+	llc_misses = c2c_stats->t.lcl_dram +
+		     c2c_stats->t.rmt_dram +
+		     c2c_stats->t.rmt_hit +
+		     c2c_stats->t.rmt_hitm;
+
+	/*
+	 * generate distinct cache line report
+	 */
+	tot_cumm = 0.0;
+	ld_cumm  = 0.0;
 
-	printf("%16s %8s %8s %8s\n",
-		"Cacheline", "nr", "loads", "stores");
 	while (next) {
+		struct c2c_entry *entry;
+
 		h = rb_entry(next, struct c2c_hit, rb_node);
 		next = rb_next(&h->rb_node);
 
-		printf("%16lx %8d %8d %8d\n",
-			h->cacheline,
-			h->stats.nr_entries,
-			h->stats.t.load,
-			h->stats.t.store);
+		tot_dist  = ((double)h->stats.t.rmt_hitm / llc_misses);
+		tot_cumm += tot_dist;
+
+		ld_dist  = ((double)h->stats.t.rmt_hitm / c2c_stats->t.rmt_hitm);
+		ld_cumm += ld_dist;
+
+		/*
+		 * don't display lines with insignificant sharing contribution
+		 */
+		if (ld_dist < DISPLAY_LINE_LIMIT)
+			break;
+
+		print_hitm_cacheline(h, record, tot_cumm, ld_cumm, tot_dist, ld_dist);
+
+		list_for_each_entry(entry, &h->list, scratch) {
+
+			if (!clo || !matching_coalescing(clo, entry)) {
+				if (clo)
+					print_hitm_cacheline_offset(clo, h, node_stats);
+
+				free(clo);
+				addr = entry->mi->iaddr.al_addr;
+				clo = c2c_hit__new(addr, entry);
+				if (node_info)
+					memset(node_stats, 0, sizeof(struct c2c_stats) * max_node_num);
+			}
+			c2c_decode_stats(&clo->stats, entry);
+			c2c_hit__update_strings(clo, entry);
+
+			if (node_info) {
+				int node = cpunode_map[entry->cpu];
+				c2c_decode_stats(&node_stats[node], entry);
+				CPU_SET(entry->cpu, &(node_stats[node].cpuset));
+			}
+
+		}
+		if (clo) {
+			print_hitm_cacheline_offset(clo, h, node_stats);
+			free(clo);
+			clo = NULL;
+		}
+
+		if (node_info)
+			memset(node_stats, 0, sizeof(struct c2c_stats) * max_node_num);
+
+		printf("\n");
+		record++;
 	}
 }
 
+static inline int valid_hitm_or_store(union perf_mem_data_src *dsrc)
+{
+	return ((dsrc->mem_snoop & P(SNOOP,HITM)) ||
+		(dsrc->mem_op & P(OP,STORE)));
+}
+
+static void print_shared_cacheline_info(struct c2c_stats *stats, int cline_cnt)
+{
+	int hitm_cnt = stats->t.lcl_hitm + stats->t.rmt_hitm;
+
+	printf("=================================================\n");
+	printf("    Global Shared Cache Line Event Information   \n");
+	printf("=================================================\n");
+	printf("  Total Shared Cache Lines          : %10d\n", cline_cnt);
+	printf("  Load HITs on shared lines         : %10d\n", stats->t.load);
+	printf("  Fill Buffer Hits on shared lines  : %10d\n", stats->t.ld_fbhit);
+	printf("  L1D hits on shared lines          : %10d\n", stats->t.ld_l1hit);
+	printf("  L2D hits on shared lines          : %10d\n", stats->t.ld_l2hit);
+	printf("  LLC hits on shared lines          : %10d\n", stats->t.ld_llchit + stats->t.lcl_hitm);
+	printf("  Locked Access on shared lines     : %10d\n", stats->t.locks);
+	printf("  Store HITs on shared lines        : %10d\n", stats->t.store);
+	printf("  Store L1D hits on shared lines    : %10d\n", stats->t.st_l1hit);
+	printf("  Total Merged records              : %10d\n", hitm_cnt + stats->t.store);
+}
+
 static void c2c_analyze_hitms(struct perf_c2c *c2c)
 {
 
@@ -894,8 +1392,8 @@ static void c2c_analyze_hitms(struct perf_c2c *c2c)
 	} else
 		free(h);
 
-	if (verbose > 2)
-		dump_tree_hitm(&c2c->tree_hitm, c2c);
+	print_shared_cacheline_info(&hitm_stats, shared_clines);
+	print_c2c_hitm_report(&hitm_tree, &hitm_stats, &c2c->stats);
 
 cleanup:
 	next = rb_first(&hitm_tree);
@@ -1026,6 +1524,10 @@ int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
 	};
 	const struct option c2c_options[] = {
 	OPT_BOOLEAN('r', "raw_records", &c2c.raw_records, "dump raw events"),
+	OPT_INCR('N', "node-info", &node_info,
+		 "show extra node info in report (repeat for more info)"),
+	OPT_INTEGER('c', "coalesce-level", &coalesce_level,
+		 "how much coalescing for tid, pid, and ip is done (repeat for more coalescing)"),
 	OPT_INCR('v', "verbose", &verbose,
 		 "be more verbose (show counter open errors, etc)"),
 	OPT_STRING('i', "input", &input_name, "file",
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 13/21] perf, c2c: Add callchain support
  2014-02-10 17:28 [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (9 preceding siblings ...)
  2014-02-10 17:29 ` [PATCH 12/21] perf, c2c: Display cacheline HITM analysis to stdout Don Zickus
@ 2014-02-10 17:29 ` Don Zickus
  2014-02-18 13:07   ` Jiri Olsa
  2014-02-10 17:29 ` [PATCH 14/21] perf, c2c: Output summary stats Don Zickus
                   ` (14 subsequent siblings)
  25 siblings, 1 reply; 72+ messages in thread
From: Don Zickus @ 2014-02-10 17:29 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, eranian, Don Zickus

Seeing cacheline statistics is useful by itself.  Seeing the callchain
for these cache contentions saves time tracking things down.

This patch tries to add callchain support.  I had to use the generic
interface from a previous patch to output things to stdout easily.

Other than the displaying the results, collecting the callchain and
merging it was fairly straightforward.

I used a lot of copying-n-pasting from other builtin tools to get
the intial parameter setup correctly and the automatic reading of
'symbol_conf.use_callchain' from the data file.

Hopefully this is all correct.  The amount of memory corruption (from the
callchain dynamic array) seems to have dwindled done to nothing. :-)

Suggested-by: Joe Mario <jmario@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/builtin-c2c.c | 160 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 159 insertions(+), 1 deletion(-)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 39fd233..047fe26 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -49,6 +49,7 @@ struct c2c_stats {
 struct perf_c2c {
 	struct perf_tool tool;
 	bool		 raw_records;
+	bool		 call_graph;
 	struct rb_root   tree_physid;
 
 	/* stats */
@@ -67,6 +68,8 @@ struct c2c_entry {
 	int			weight;
 	int			period;
 	int			color;
+
+	struct callchain_root   callchain[0]; /* must be last member */
 };
 
 #define DISPLAY_LINE_LIMIT  0.0015
@@ -89,6 +92,8 @@ struct c2c_hit {
 	u64		daddr;
 	u64		iaddr;
 	struct mem_info	*mi;
+
+	struct callchain_root   callchain[0]; /* must be last member */
 };
 
 enum { OP, LVL, SNP, LCK, TLB };
@@ -676,7 +681,8 @@ static int c2c_decode_stats(struct c2c_stats *stats, struct c2c_entry *entry)
 
 static struct c2c_hit *c2c_hit__new(u64 cacheline, struct c2c_entry *entry)
 {
-	struct c2c_hit *h = zalloc(sizeof(struct c2c_hit));
+	size_t callchain_size = symbol_conf.use_callchain ? sizeof(struct callchain_root) : 0;
+	struct c2c_hit *h = zalloc(sizeof(struct c2c_hit) + callchain_size);
 
 	if (!h) {
 		pr_err("Could not allocate c2c_hit memory\n");
@@ -690,6 +696,8 @@ static struct c2c_hit *c2c_hit__new(u64 cacheline, struct c2c_entry *entry)
 	h->cacheline = cacheline;
 	h->pid = entry->thread->pid_;
 	h->tid = entry->thread->tid;
+	if (symbol_conf.use_callchain)
+		callchain_init(h->callchain);
 
 	/* use original addresses here, not adjusted al_addr */
 	h->iaddr = entry->mi->iaddr.addr;
@@ -834,6 +842,7 @@ static int perf_c2c__process_sample(struct perf_tool *tool,
 	u8 cpumode = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
 	struct mem_info *mi;
 	struct thread *thread;
+	struct symbol *parent = NULL;
 	struct c2c_entry *entry;
 	sample_handler f;
 	int err = -1;
@@ -864,6 +873,19 @@ static int perf_c2c__process_sample(struct perf_tool *tool,
 	if (err)
 		goto err_entry;
 
+	/* attach callchain if everything is good */
+	if (symbol_conf.use_callchain && sample->callchain) {
+		callchain_init(entry->callchain);
+
+		err = machine__resolve_callchain(machine, evsel, thread,
+						 sample, &parent, NULL);
+		if (!err)
+			err = callchain_append(entry->callchain,
+					       &callchain_cursor,
+					       entry->period);
+		if (err)
+			pr_err("Could not attach callchain, skipping\n");
+	}
 	return 0;
 
 err_entry:
@@ -1217,6 +1239,13 @@ static void print_hitm_cacheline_offset(struct c2c_hit *clo,
 		print_socket_shared_str(node_stats);
 
 	printf("\n");
+
+	if (symbol_conf.use_callchain) {
+		generic_entry_callchain__fprintf(clo->callchain,
+						 h->stats.total_period,
+						 clo->stats.total_period,
+						 23, stdout);
+	}
 }
 
 static void print_c2c_hitm_report(struct rb_root *hitm_tree,
@@ -1293,6 +1322,12 @@ static void print_c2c_hitm_report(struct rb_root *hitm_tree,
 				c2c_decode_stats(&node_stats[node], entry);
 				CPU_SET(entry->cpu, &(node_stats[node].cpuset));
 			}
+			if (symbol_conf.use_callchain) {
+				callchain_cursor_reset(&callchain_cursor);
+				callchain_merge(&callchain_cursor,
+						clo->callchain,
+						entry->callchain);
+			}
 
 		}
 		if (clo) {
@@ -1424,6 +1459,30 @@ err:
 	return err;
 }
 
+static int perf_c2c__setup_sample_type(struct perf_c2c *c2c,
+				       struct perf_session *session)
+{
+	u64 sample_type = perf_evlist__combined_sample_type(session->evlist);
+
+	if (!(sample_type & PERF_SAMPLE_CALLCHAIN)) {
+		if (symbol_conf.use_callchain) {
+			printf("Selected -g but no callchain data. Did "
+				  "you call 'perf c2c record' without -g?\n");
+			return -1;
+		}
+	} else if (callchain_param.mode != CHAIN_NONE &&
+		   !symbol_conf.use_callchain) {
+			symbol_conf.use_callchain = true;
+			c2c->call_graph = true;
+			if (callchain_register_param(&callchain_param) < 0) {
+				printf("Can't register callchain params.\n");
+				return -EINVAL;
+			}
+	}
+
+	return 0;
+}
+
 static int perf_c2c__read_events(struct perf_c2c *c2c)
 {
 	int err = -1;
@@ -1438,6 +1497,9 @@ static int perf_c2c__read_events(struct perf_c2c *c2c)
 	if (symbol__init() < 0)
 		goto out_delete;
 
+	if (perf_c2c__setup_sample_type(c2c, session) < 0)
+		goto out_delete;
+
 	if (perf_evlist__set_handlers(session->evlist, handlers))
 		goto out_delete;
 
@@ -1508,8 +1570,101 @@ static int perf_c2c__record(int argc, const char **argv)
 	return cmd_record(i, rec_argv, NULL);
 }
 
+static int
+opt_callchain_cb(const struct option *opt, const char *arg, int unset)
+{
+	struct perf_c2c *c2c = (struct perf_c2c *)opt->value;
+	char *tok, *tok2;
+	char *endptr;
+
+	/*
+	 * --no-call-graph
+	 */
+	if (unset) {
+		c2c->call_graph = false;
+		return 0;
+	}
+
+	symbol_conf.use_callchain = true;
+	c2c->call_graph = true;
+
+	if (!arg)
+		return 0;
+
+	tok = strtok((char *)arg, ",");
+	if (!tok)
+		return -1;
+
+	/* get the output mode */
+	if (!strncmp(tok, "graph", strlen(arg)))
+		callchain_param.mode = CHAIN_GRAPH_ABS;
+
+	else if (!strncmp(tok, "flat", strlen(arg)))
+		callchain_param.mode = CHAIN_FLAT;
+
+	else if (!strncmp(tok, "fractal", strlen(arg)))
+		callchain_param.mode = CHAIN_GRAPH_REL;
+
+	else if (!strncmp(tok, "none", strlen(arg))) {
+		callchain_param.mode = CHAIN_NONE;
+		symbol_conf.use_callchain = false;
+
+		return 0;
+	}
+
+	else
+		return -1;
+
+	/* get the min percentage */
+	tok = strtok(NULL, ",");
+	if (!tok)
+		goto setup;
+
+	callchain_param.min_percent = strtod(tok, &endptr);
+	if (tok == endptr)
+		return -1;
+
+	/* get the print limit */
+	tok2 = strtok(NULL, ",");
+	if (!tok2)
+		goto setup;
+
+	if (tok2[0] != 'c') {
+		callchain_param.print_limit = strtoul(tok2, &endptr, 0);
+		tok2 = strtok(NULL, ",");
+		if (!tok2)
+			goto setup;
+	}
+
+	/* get the call chain order */
+	if (!strncmp(tok2, "caller", strlen("caller")))
+		callchain_param.order = ORDER_CALLER;
+	else if (!strncmp(tok2, "callee", strlen("callee")))
+		callchain_param.order = ORDER_CALLEE;
+	else
+		return -1;
+
+	/* Get the sort key */
+	tok2 = strtok(NULL, ",");
+	if (!tok2)
+		goto setup;
+	if (!strncmp(tok2, "function", strlen("function")))
+		callchain_param.key = CCKEY_FUNCTION;
+	else if (!strncmp(tok2, "address", strlen("address")))
+		callchain_param.key = CCKEY_ADDRESS;
+	else
+		return -1;
+setup:
+	if (callchain_register_param(&callchain_param) < 0) {
+		fprintf(stderr, "Can't register callchain params\n");
+		return -1;
+	}
+	return 0;
+}
+
 int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
 {
+	char callchain_default_opt[] = "fractal,0.05,callee";
 	struct perf_c2c c2c = {
 		.tool = {
 			.sample		 = perf_c2c__process_sample,
@@ -1536,6 +1691,9 @@ int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
 		   "separator",
 		   "separator for columns, no spaces will be added"
 		   " between columns '.' is reserved."),
+	OPT_CALLBACK_DEFAULT('g', "call-graph", &c2c, "output_type,min_percent[,print_limit],call_order",
+			     "Display callchains using output_type (graph, flat, fractal, or none) , min percent threshold, optional print limit, callchain order, key (function or address). "
+			     "Default: fractal,0.5,callee,function", &opt_callchain_cb, callchain_default_opt),
 	OPT_END()
 	};
 	const char * const c2c_usage[] = {
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 14/21] perf, c2c: Output summary stats
  2014-02-10 17:28 [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (10 preceding siblings ...)
  2014-02-10 17:29 ` [PATCH 13/21] perf, c2c: Add callchain support Don Zickus
@ 2014-02-10 17:29 ` Don Zickus
  2014-02-10 17:29 ` [PATCH 15/21] perf, c2c: Dump rbtree for debugging Don Zickus
                   ` (13 subsequent siblings)
  25 siblings, 0 replies; 72+ messages in thread
From: Don Zickus @ 2014-02-10 17:29 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, eranian, Don Zickus

Output some summary stats based on the processed records.
Mainly diagnostic uses.

Stats done by Dick Fowles, backported by me.

Original-by: Dick Fowles <rfowles@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/builtin-c2c.c | 44 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 43 insertions(+), 1 deletion(-)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 047fe26..c8e76dc 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -1247,7 +1247,6 @@ static void print_hitm_cacheline_offset(struct c2c_hit *clo,
 						 23, stdout);
 	}
 }
-
 static void print_c2c_hitm_report(struct rb_root *hitm_tree,
 				  struct c2c_stats *hitm_stats __maybe_unused,
 				  struct c2c_stats *c2c_stats)
@@ -1442,6 +1441,48 @@ cleanup:
 	return;
 }
 
+static void print_c2c_trace_report(struct perf_c2c *c2c)
+{
+	int llc_misses;
+	struct c2c_stats *stats = &c2c->stats;
+
+	llc_misses = stats->t.lcl_dram +
+		     stats->t.rmt_dram +
+		     stats->t.rmt_hit +
+		     stats->t.rmt_hitm;
+
+	printf("=================================================\n");
+	printf("            Trace Event Information              \n");
+	printf("=================================================\n");
+	printf("  Total records                     : %10d\n", c2c->stats.nr_entries);
+	printf("  Locked Load/Store Operations      : %10d\n", stats->t.locks);
+	printf("  Load Operations                   : %10d\n", stats->t.load);
+	printf("  Loads - uncacheable               : %10d\n", stats->t.ld_uncache);
+	printf("  Loads - no mapping                : %10d\n", stats->t.ld_noadrs);
+	printf("  Load Fill Buffer Hit              : %10d\n", stats->t.ld_fbhit);
+	printf("  Load L1D hit                      : %10d\n", stats->t.ld_l1hit);
+	printf("  Load L2D hit                      : %10d\n", stats->t.ld_l2hit);
+	printf("  Load LLC hit                      : %10d\n", stats->t.ld_llchit + stats->t.lcl_hitm);
+	printf("  Load Local HITM                   : %10d\n", stats->t.lcl_hitm);
+	printf("  Load Remote HITM                  : %10d\n", stats->t.rmt_hitm);
+	printf("  Load Remote HIT                   : %10d\n", stats->t.rmt_hit);
+	printf("  Load Local DRAM                   : %10d\n", stats->t.lcl_dram);
+	printf("  Load Remote DRAM                  : %10d\n", stats->t.rmt_dram);
+	printf("  Load MESI State Exclusive         : %10d\n", stats->t.ld_excl);
+	printf("  Load MESI State Shared            : %10d\n", stats->t.ld_shared);
+	printf("  Load LLC Misses                   : %10d\n", llc_misses);
+	printf("  LLC Misses to Local DRAM          : %10.1f%%\n", ((double)stats->t.lcl_dram/(double)llc_misses) * 100.);
+	printf("  LLC Misses to Remote DRAM         : %10.1f%%\n", ((double)stats->t.rmt_dram/(double)llc_misses) * 100.);
+	printf("  LLC Misses to Remote cache (HIT)  : %10.1f%%\n", ((double)stats->t.rmt_hit /(double)llc_misses) * 100.);
+	printf("  LLC Misses to Remote cache (HITM) : %10.1f%%\n", ((double)stats->t.rmt_hitm/(double)llc_misses) * 100.);
+	printf("  Store Operations                  : %10d\n", stats->t.store);
+	printf("  Store - uncacheable               : %10d\n", stats->t.st_uncache);
+	printf("  Store - no mapping                : %10d\n", stats->t.st_noadrs);
+	printf("  Store L1D Hit                     : %10d\n", stats->t.st_l1hit);
+	printf("  Virt -> Phys Remap Rejects        : %10d\n", stats->t.remap);
+	printf("  No Page Map Rejects               : %10d\n", stats->t.nomap);
+}
+
 static int perf_c2c__process_events(struct perf_session *session,
 				    struct perf_c2c *c2c)
 {
@@ -1453,6 +1494,7 @@ static int perf_c2c__process_events(struct perf_session *session,
 		goto err;
 	}
 
+	print_c2c_trace_report(c2c);
 	c2c_analyze_hitms(c2c);
 
 err:
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 15/21] perf, c2c: Dump rbtree for debugging
  2014-02-10 17:28 [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (11 preceding siblings ...)
  2014-02-10 17:29 ` [PATCH 14/21] perf, c2c: Output summary stats Don Zickus
@ 2014-02-10 17:29 ` Don Zickus
  2014-02-10 17:29 ` [PATCH 16/21] perf, c2c: Fixup tid because of perf map is broken Don Zickus
                   ` (12 subsequent siblings)
  25 siblings, 0 replies; 72+ messages in thread
From: Don Zickus @ 2014-02-10 17:29 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, eranian, Don Zickus

Sometimes you want to verify the rbtree sorting on a unique id
is working correctly.  This allows you to dump it.

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/builtin-c2c.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index c8e76dc..0760f6a 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -900,6 +900,34 @@ err:
 
 #define HAS_HITMS(h) (h->stats.t.lcl_hitm || h->stats.t.rmt_hitm)
 
+static void dump_rb_tree(struct rb_root *tree,
+			 struct perf_c2c *c2c __maybe_unused)
+{
+	struct rb_node *next = rb_first(tree);
+	struct c2c_entry *n;
+
+	printf("%3s %3s %8s %8s %6s %16s %16s %16s %16s %16s %8s\n",
+		"Maj", "Min", "Ino", "InoGen", "Pid", "Start",
+		"Vaddr", "al_addr", "ip addr", "pgoff", "cpumode");
+	while (next) {
+		n = rb_entry(next, struct c2c_entry, rb_node);
+		next = rb_next(&n->rb_node);
+
+		printf("%3x %3x %8lx %8lx %6d %16lx %16lx %16lx %16lx %16lx %8x\n",
+			n->mi->daddr.map->maj,
+			n->mi->daddr.map->min,
+			n->mi->daddr.map->ino,
+			n->mi->daddr.map->ino_generation,
+			n->thread->pid_,
+			n->mi->daddr.map->start,
+			n->mi->daddr.addr,
+			n->mi->daddr.al_addr,
+			n->mi->iaddr.al_addr,
+			n->mi->daddr.map->pgoff,
+			n->cpumode);
+	}
+}
+
 static void c2c_hit__update_stats(struct c2c_stats *new,
 				  struct c2c_stats *old)
 {
@@ -1494,6 +1522,8 @@ static int perf_c2c__process_events(struct perf_session *session,
 		goto err;
 	}
 
+	if (verbose > 2)
+		dump_rb_tree(&c2c->tree_physid, c2c);
 	print_c2c_trace_report(c2c);
 	c2c_analyze_hitms(c2c);
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 16/21] perf, c2c: Fixup tid because of perf map is broken
  2014-02-10 17:28 [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (12 preceding siblings ...)
  2014-02-10 17:29 ` [PATCH 15/21] perf, c2c: Dump rbtree for debugging Don Zickus
@ 2014-02-10 17:29 ` Don Zickus
  2014-02-10 17:29 ` [PATCH 17/21] perf, c2c: Add symbol count table Don Zickus
                   ` (11 subsequent siblings)
  25 siblings, 0 replies; 72+ messages in thread
From: Don Zickus @ 2014-02-10 17:29 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, eranian, Don Zickus

When perf tries to load the initial mmaps, it graps data from /proc/<pid>/maps files
and feeds them into perf as generated MMAP events.  Unfortunately, when the system
has threads running, those thread maps are not generated (because perf doesn't know the
history of the fork events leading to the threads).

As a result when trying to map a data source address (not IP), to a thread map, it fails
and the map returns NULL.  Feeding perf the pid instead gets us the correct map, however
the TID is now incorrect for the thread struct (it now has a PID for a TID instead).

Performing any cache-to-cache contention analysis leads to problems, so for now save the
TID for local use and do not rely on the perf maps' TID.

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/builtin-c2c.c | 24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 0760f6a..32c2319 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -62,6 +62,7 @@ struct c2c_entry {
 	struct rb_node		rb_node;
 	struct list_head	scratch;  /* scratch list for resorting */
 	struct thread		*thread;
+	int			tid;  /* FIXME perf maps broken */
 	struct mem_info		*mi;
 	u32			cpu;
 	u8			cpumode;
@@ -397,8 +398,8 @@ static int physid_cmp(struct c2c_entry *left, struct c2c_entry *right)
 		if (left->thread->pid_ > right->thread->pid_) return 1;
 		if (left->thread->pid_ < right->thread->pid_) return -1;
 
-		if (left->thread->tid > right->thread->tid) return 1;
-		if (left->thread->tid < right->thread->tid) return -1;
+		if (left->tid > right->tid) return 1;
+		if (left->tid < right->tid) return -1;
 	} else if (left->cpumode == PERF_RECORD_MISC_KERNEL) {
 		/* kernel mapped areas where 'start' doesn't matter */
 
@@ -416,15 +417,15 @@ static int physid_cmp(struct c2c_entry *left, struct c2c_entry *right)
 		if (left->thread->pid_ > right->thread->pid_) return 1;
 		if (left->thread->pid_ < right->thread->pid_) return -1;
 
-		if (left->thread->tid > right->thread->tid) return 1;
-		if (left->thread->tid < right->thread->tid) return -1;
+		if (left->tid > right->tid) return 1;
+		if (left->tid < right->tid) return -1;
 	} else {
 		/* userspace anonymous */
 		if (left->thread->pid_ > right->thread->pid_) return 1;
 		if (left->thread->pid_ < right->thread->pid_) return -1;
 
-		if (left->thread->tid > right->thread->tid) return 1;
-		if (left->thread->tid < right->thread->tid) return -1;
+		if (left->tid > right->tid) return 1;
+		if (left->tid < right->tid) return -1;
 
 		/* hack to mark similar regions, 'right' is new entry */
 		/* userspace anonymous address space is contained within pid */
@@ -574,6 +575,7 @@ static struct c2c_entry *c2c_entry__new(struct perf_sample *sample,
 	if (entry != NULL) {
 		entry->thread = thread;
 		entry->mi = mi;
+		entry->tid = sample->tid;
 		entry->cpu = sample->cpu;
 		entry->cpumode = cpumode;
 		entry->weight = sample->weight;
@@ -695,7 +697,7 @@ static struct c2c_hit *c2c_hit__new(u64 cacheline, struct c2c_entry *entry)
 	h->tree = RB_ROOT;
 	h->cacheline = cacheline;
 	h->pid = entry->thread->pid_;
-	h->tid = entry->thread->tid;
+	h->tid = entry->tid;
 	if (symbol_conf.use_callchain)
 		callchain_init(h->callchain);
 
@@ -713,7 +715,7 @@ static void c2c_hit__update_strings(struct c2c_hit *h,
 	if (h->pid != n->thread->pid_)
 		h->pid = -1;
 
-	if (h->tid != n->thread->tid)
+	if (h->tid != n->tid)
 		h->tid = -1;
 
 	/* use original addresses here, not adjusted al_addr */
@@ -743,7 +745,7 @@ static inline bool matching_coalescing(struct c2c_hit *h,
 		case LVL1:
 			value = ((h->daddr == mi->daddr.addr) &&
 				 (h->pid   == e->thread->pid_) &&
-				 (h->tid   == e->thread->tid) &&
+				 (h->tid   == e->tid) &&
 				 (h->iaddr == mi->iaddr.addr));
 			break;
 
@@ -775,7 +777,7 @@ static inline bool matching_coalescing(struct c2c_hit *h,
 		case LVL0:
 			value = ((h->daddr == mi->daddr.addr) &&
 				 (h->pid   == e->thread->pid_) &&
-				 (h->tid   == e->thread->tid) &&
+				 (h->tid   == e->tid) &&
 				 (h->iaddr == mi->iaddr.addr));
 			break;
 
@@ -850,7 +852,7 @@ static int perf_c2c__process_sample(struct perf_tool *tool,
 	if (evsel->handler.func == NULL)
 		return 0;
 
-	thread = machine__find_thread(machine, sample->tid);
+	thread = machine__find_thread(machine, sample->pid);
 	if (thread == NULL)
 		goto err;
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 17/21] perf, c2c: Add symbol count table
  2014-02-10 17:28 [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (13 preceding siblings ...)
  2014-02-10 17:29 ` [PATCH 16/21] perf, c2c: Fixup tid because of perf map is broken Don Zickus
@ 2014-02-10 17:29 ` Don Zickus
  2014-02-18 13:09   ` Jiri Olsa
  2014-02-10 17:29 ` [PATCH 18/21] perf, c2c: Add shared cachline summary table Don Zickus
                   ` (10 subsequent siblings)
  25 siblings, 1 reply; 72+ messages in thread
From: Don Zickus @ 2014-02-10 17:29 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, eranian, Don Zickus

Just another table that displays the referenced symbols in the analysis
report.  The table lists the most frequently used symbols first.

It is just another way to look at similar data to figure out who
is causing the most contention (based on the workload used).

Originally done by Dick Fowles and ported by me.

Suggested-by: Joe Mario <jmario@redhat.com>
Original-by: Dick Fowles <rfowles@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/builtin-c2c.c | 99 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 99 insertions(+)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 32c2319..979187f 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -950,6 +950,104 @@ static void c2c_hit__update_stats(struct c2c_stats *new,
 	new->total_period	+= old->total_period;
 }
 
+LIST_HEAD(ref_tree);
+LIST_HEAD(ref_tree_sorted);
+struct refs {
+	struct list_head	list;
+	int			nr;
+	const char		*name;
+	char			*long_name;
+};
+
+static int update_ref_tree(struct c2c_entry *entry)
+{
+	struct refs *p;
+	struct dso *dso = entry->mi->iaddr.map->dso;
+	const char *name = dso->short_name;
+
+	list_for_each_entry(p, &ref_tree, list) {
+		if (!strcmp(p->name, name))
+			goto found;
+	}
+
+	p = zalloc(sizeof(struct refs));
+	if (!p)
+		return -1;
+	p->name = name;
+	p->long_name = dso->long_name;
+	list_add_tail(&p->list, &ref_tree);
+
+found:
+	p->nr++;
+	return 0;
+}
+
+static void print_symbol_record_count(struct rb_root *tree)
+{
+	struct rb_node *next = rb_first(tree);
+	struct c2c_entry *n;
+	struct refs *p, *q, *pn;
+	char      string[256];
+	char      delimit[256];
+	int       i;
+	int       idx = 0;
+
+	/* gather symbol references */
+	while (next) {
+		n = rb_entry(next, struct c2c_entry, rb_node);
+		next = rb_next(&n->rb_node);
+
+		if (update_ref_tree(n)) {
+			pr_err("Could not update reference tree\n");
+			goto cleanup;
+		}
+	}
+
+	/* sort on number of references per symbol */
+	list_for_each_entry_safe(p, pn, &ref_tree, list) {
+		list_del_init(&p->list);
+		list_for_each_entry(q, &ref_tree_sorted, list) {
+			if (p->nr > q->nr) {
+				list_add_tail(&p->list, &q->list);
+				break;
+			}
+		}
+		if (list_empty(&p->list))
+			list_add_tail(&p->list, &ref_tree_sorted);
+	}
+
+	/* print header info */
+	sprintf(string, "%5s   %8s   %-32s  %-80s",
+		"Index",
+		"Records",
+		"Object Name",
+		"Object Path");
+
+	delimit[0] = '\0';
+	for (i = 0; i < (int)strlen(string); i++) strcat(delimit, "=");
+
+	printf("\n\n");
+	printf("%s\n", delimit);
+	printf("%50s %s\n", " ", "Object Name, Path & Reference Counts");
+	printf("\n");
+	printf("%s\n", string);
+	printf("%s\n", delimit);
+
+	/* print out table */
+	list_for_each_entry(p, &ref_tree_sorted, list) {
+		printf("%5d   %8d   %-32s  %-80s\n",
+			idx, p->nr, p->name, p->long_name);
+		idx++;
+	}
+	printf("\n");
+
+cleanup:
+	list_for_each_entry_safe(p, pn, &ref_tree_sorted, list) {
+		list_del(&p->list);
+		free(p);
+	}
+}
+
 static void print_hitm_cacheline_header(void)
 {
 #define SHARING_REPORT_TITLE  "Shared Cache Line Distribution Pareto"
@@ -1528,6 +1626,7 @@ static int perf_c2c__process_events(struct perf_session *session,
 		dump_rb_tree(&c2c->tree_physid, c2c);
 	print_c2c_trace_report(c2c);
 	c2c_analyze_hitms(c2c);
+	print_symbol_record_count(&c2c->tree_physid);
 
 err:
 	return err;
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 18/21] perf, c2c: Add shared cachline summary table
  2014-02-10 17:28 [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (14 preceding siblings ...)
  2014-02-10 17:29 ` [PATCH 17/21] perf, c2c: Add symbol count table Don Zickus
@ 2014-02-10 17:29 ` Don Zickus
  2014-02-10 17:29 ` [PATCH 19/21] perf, c2c: Add framework to analyze latency and display summary stats Don Zickus
                   ` (9 subsequent siblings)
  25 siblings, 0 replies; 72+ messages in thread
From: Don Zickus @ 2014-02-10 17:29 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, eranian, Don Zickus

This adds a quick summary of the hottest cache contention lines based
on the input data.  This summarizes what the broken table shows you,
so you can see at a quick glance which cachelines are interesting.

Originally done by Dick Fowles and ported by me.

Original-by: Dick Fowles <rfowles@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/builtin-c2c.c | 136 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 136 insertions(+)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 979187f..014c9b0 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -1048,6 +1048,141 @@ cleanup:
 	}
 }
 
+static void print_c2c_shared_cacheline_report(struct rb_root *hitm_tree,
+					      struct c2c_stats *shared_stats __maybe_unused,
+					      struct c2c_stats *c2c_stats __maybe_unused)
+{
+#define   SHM_TITLE  "Shared Data Cache Line Table"
+
+	struct rb_node	*next = rb_first(hitm_tree);
+	struct c2c_hit	*h;
+	char		header[256];
+	char		delimit[256];
+	u32		crecords;
+	u32		lclmiss;
+	u32		ldcnt;
+	double		p_hitm;
+	double		p_all;
+	int		totmiss;
+	int		rmt_hitm;
+	int		len;
+	int		pad;
+	int		i;
+
+	sprintf(header,"%28s  %8s  %8s  %8s  %8s  %28s  %18s  %28s  %18s  %8s  %28s",
+		" ",
+		"Total",
+		"%All ",
+		" ",
+		"Total",
+		"---- Core Load Hit ----",
+		"-- LLC Load Hit --",
+		"----- LLC Load Hitm -----",
+		"-- Load Dram --",
+		"LLC  ",
+		"---- Store Reference ----");
+
+	len = strlen(header);
+	delimit[0] = '\0';
+
+	for (i = 0; i < len; i++)
+		strcat(delimit, "=");
+
+	printf("\n\n");
+	printf("%s\n", delimit);
+	printf("\n");
+	pad = (strlen(header)/2) - (strlen(SHM_TITLE)/2);
+	for (i = 0; i < pad; i++)
+		printf(" ");
+	printf("%s\n", SHM_TITLE);
+	printf("\n");
+	printf("%s\n", header);
+
+	sprintf(header, "%8s  %18s  %8s  %8s  %8s  %8s  %8s  %8s  %8s  %8s  %8s  %8s  %8s  %8s  %8s  %8s  %8s  %8s  %8s  %8s",
+		"Index",
+		"Phys Adrs",
+		"Records",
+		"Ld Miss",
+		"%hitm",
+		"Loads",
+		"FB",
+		"L1D",
+		"L2D",
+		"Lcl",
+		"Rmt",
+		"Total",
+		"Lcl",
+		"Rmt",
+		"Lcl",
+		"Rmt",
+		"Ld Miss",
+		"Total",
+		"L1Hit",
+		"L1Miss");
+
+	printf("%s\n", header);
+	printf("%s\n", delimit);
+
+	rmt_hitm    = c2c_stats->t.rmt_hitm;
+	totmiss     = c2c_stats->t.lcl_dram +
+		      c2c_stats->t.rmt_dram +
+		      c2c_stats->t.rmt_hit +
+		      c2c_stats->t.rmt_hitm;
+
+	i = 0;
+	while (next) {
+		h = rb_entry(next, struct c2c_hit, rb_node);
+		next = rb_next(&h->rb_node);
+
+		lclmiss  = h->stats.t.lcl_dram +
+			   h->stats.t.rmt_dram +
+			   h->stats.t.rmt_hitm +
+			   h->stats.t.rmt_hit;
+
+		ldcnt    = lclmiss +
+			   h->stats.t.ld_fbhit +
+			   h->stats.t.ld_l1hit +
+			   h->stats.t.ld_l2hit +
+			   h->stats.t.ld_llchit +
+			   h->stats.t.lcl_hitm;
+
+		crecords = ldcnt +
+			   h->stats.t.st_l1hit +
+			   h->stats.t.st_l1miss;
+
+		p_hitm = (double)h->stats.t.rmt_hitm / (double)rmt_hitm;
+		p_all  = (double)h->stats.t.rmt_hitm / (double)totmiss;
+
+		/* stop when the percentage gets to low */
+		if (p_hitm < DISPLAY_LINE_LIMIT)
+			break;
+
+		printf("%8d  %#18lx  %8u  %7.2f%%  %7.2f%%  %8u  %8u  %8u  %8u  %8u  %8u  %8u  %8u  %8u  %8u  %8u  %8u  %8u  %8u  %8u\n",
+			i,
+			h->cacheline,
+			crecords,
+			100. * p_all,
+			100. * p_hitm,
+			ldcnt,
+			h->stats.t.ld_fbhit,
+			h->stats.t.ld_l1hit,
+			h->stats.t.ld_l2hit,
+			h->stats.t.ld_llchit,
+			h->stats.t.rmt_hit,
+			h->stats.t.lcl_hitm + h->stats.t.rmt_hitm,
+			h->stats.t.lcl_hitm,
+			h->stats.t.rmt_hitm,
+			h->stats.t.lcl_dram,
+			h->stats.t.rmt_dram,
+			lclmiss,
+			h->stats.t.store,
+			h->stats.t.st_l1hit,
+			h->stats.t.st_l1miss);
+
+		i++;
+	}
+}
+
 static void print_hitm_cacheline_header(void)
 {
 #define SHARING_REPORT_TITLE  "Shared Cache Line Distribution Pareto"
@@ -1555,6 +1690,7 @@ static void c2c_analyze_hitms(struct perf_c2c *c2c)
 		free(h);
 
 	print_shared_cacheline_info(&hitm_stats, shared_clines);
+	print_c2c_shared_cacheline_report(&hitm_tree, &hitm_stats, &c2c->stats);
 	print_c2c_hitm_report(&hitm_tree, &hitm_stats, &c2c->stats);
 
 cleanup:
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 19/21] perf, c2c: Add framework to analyze latency and display summary stats
  2014-02-10 17:28 [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (15 preceding siblings ...)
  2014-02-10 17:29 ` [PATCH 18/21] perf, c2c: Add shared cachline summary table Don Zickus
@ 2014-02-10 17:29 ` Don Zickus
  2014-02-10 17:29 ` [PATCH 20/21] perf, c2c: Add selected extreme latencies to output cacheline stats table Don Zickus
                   ` (8 subsequent siblings)
  25 siblings, 0 replies; 72+ messages in thread
From: Don Zickus @ 2014-02-10 17:29 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, eranian, Don Zickus

The overall goal for the cache-to-cache contention tool is to find extreme latencies and
help point out the problem, so they can be fixed.  The big assumption is remote cache hits
cause the biggest contentions.

Those are summarized by previous patches.  However, we still have non-remote cache hits with
high latency.  We display those here in a table.  We identify the outliers and focus on them.

Orignal work done by Dick Fowles.  I just ported it over to the perf framework.

Suggeted-by: Joe Mario <jmario@redhat.com>
Original-by: Dick Fowles <rfowles@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/builtin-c2c.c | 456 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 456 insertions(+)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 014c9b0..b1d4a8b 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -12,6 +12,7 @@
 #include <linux/compiler.h>
 #include <linux/kernel.h>
 #include <sched.h>
+#include <math.h>
 
 typedef struct {
 	int  locks;               /* count of 'lock' transactions */
@@ -46,6 +47,20 @@ struct c2c_stats {
 	struct stats		stats;
 };
 
+struct c2c_latency_stats {
+	int    min;
+	int    mode;
+	int    max;
+	int    cnt;
+	int    thrshld;
+	double median;
+	double stddev;
+	double cv;
+	double ci;
+	double mean;
+	struct rb_node *start;
+};
+
 struct perf_c2c {
 	struct perf_tool tool;
 	bool		 raw_records;
@@ -60,6 +75,7 @@ struct perf_c2c {
 
 struct c2c_entry {
 	struct rb_node		rb_node;
+	struct rb_node		latency;
 	struct list_head	scratch;  /* scratch list for resorting */
 	struct thread		*thread;
 	int			tid;  /* FIXME perf maps broken */
@@ -97,6 +113,21 @@ struct c2c_hit {
 	struct callchain_root   callchain[0]; /* must be last member */
 };
 
+typedef struct {
+	const char *label;
+	const char *fmt;
+	void       *overall;
+	void       *extremes;
+	void       *analyze;
+} stats_t;
+
+
+enum { EMPTY, SYMBOL, OBJECT };
+enum { OVERALL, EXTREMES, ANALYZE, SCOPES };
+enum { INVALID, NONE, INTEGER, DOUBLE };
+
+struct c2c_latency_stats hist_info[SCOPES];
+
 enum { OP, LVL, SNP, LCK, TLB };
 
 #define RMT_RAM              (PERF_MEM_LVL_REM_RAM1 | PERF_MEM_LVL_REM_RAM2)
@@ -512,6 +543,34 @@ static int c2c_hitm__add_to_list(struct rb_root *root, struct c2c_hit *h)
 	return 0;
 }
 
+static int c2c_latency__add_to_list(struct rb_root *root, struct c2c_entry *n)
+{
+	struct rb_node **p;
+	struct rb_node *parent = NULL;
+	struct c2c_entry *ne;
+	int64_t cmp;
+
+	p = &root->rb_node;
+
+	while (*p != NULL) {
+		parent = *p;
+		ne = rb_entry(parent, struct c2c_entry, latency);
+
+		/* sort on weight */
+		cmp = ne->weight - n->weight;
+
+		if (cmp > 0)
+			p = &(*p)->rb_left;
+		else
+			p = &(*p)->rb_right;
+	}
+
+	rb_link_node(&n->latency, parent, p);
+	rb_insert_color(&n->latency, root);
+
+	return 0;
+}
+
 static int perf_c2c__fprintf_header(FILE *fp)
 {
 	int printed = fprintf(fp, "%c %-16s  %6s  %6s  %4s  %18s  %18s  %18s  %6s  %-10s %-60s %s\n", 
@@ -1048,6 +1107,402 @@ cleanup:
 	}
 }
 
+stats_t data[] = {
+	{ "Samples           ", "%20d",   &hist_info[OVERALL].cnt,     &hist_info[EXTREMES].cnt,     &hist_info[ANALYZE].cnt     },
+	{ " ",                   NULL,    NULL,                        NULL,                         NULL                        },
+	{ "Minimum           ", "%20d",   &hist_info[OVERALL].min,     &hist_info[EXTREMES].min,     &hist_info[ANALYZE].min     },
+	{ "Maximum           ", "%20d",   &hist_info[OVERALL].max,     &hist_info[EXTREMES].max,     &hist_info[ANALYZE].max     },
+	{ "Threshold         ", "%20d",   &hist_info[OVERALL].thrshld, &hist_info[EXTREMES].thrshld, &hist_info[ANALYZE].thrshld },
+	{ " ",                   NULL,    NULL,                        NULL,                         NULL                        },
+	{ "Mode              ", "%20d",   &hist_info[OVERALL].mode,    &hist_info[EXTREMES].mode,    &hist_info[ANALYZE].mode    },
+	{ "Median            ", "%20.0f", &hist_info[OVERALL].median,  &hist_info[EXTREMES].median,  &hist_info[ANALYZE].median  },
+	{ "Mean              ", "%20.0f", &hist_info[OVERALL].mean,    &hist_info[EXTREMES].mean,    &hist_info[ANALYZE].mean    },
+	{ " ",                   NULL,    NULL,                        NULL,                         NULL                        },
+	{ "Std Dev           ", "%20.1f", &hist_info[OVERALL].stddev,  &hist_info[EXTREMES].stddev,  &hist_info[ANALYZE].stddev   },
+	{ "Coeff of Variation", "%20.3f", &hist_info[OVERALL].cv,      &hist_info[EXTREMES].cv,      &hist_info[ANALYZE].cv      },
+	{ "Confid Interval   ", "%20.1f", &hist_info[OVERALL].ci,      &hist_info[EXTREMES].ci,      &hist_info[ANALYZE].ci      },
+};
+
+#define STATS_ENTRIES (sizeof(data) / sizeof(stats_t))
+enum { C90, C95 };
+
+static double tdist(int ci, int num_samples)
+{
+	#define MAX_ENTRIES       32
+	#define INFINITE_SAMPLES  (MAX_ENTRIES-1)
+
+	/*
+	 * Student's T-distribution for 90% & 95% confidence intervals
+	 * The last entry is the value for infinite degress of freedom 
+	 */
+
+	static double t_dist[MAX_ENTRIES][2] = {
+				{ NAN,     NAN   },        /*   0  */
+				{ 6.31,   12.71  },        /*   1  */
+				{ 2.92,    4.30  },        /*   2  */
+				{ 2.35,    3.18  },        /*   3  */
+				{ 2.13,    2.78  },        /*   4  */
+				{ 2.02,    2.57  },        /*   5  */
+				{ 1.94,    2.45  },        /*   6  */
+				{ 1.90,    2.36  },        /*   7  */
+				{ 1.86,    2.31  },        /*   8  */
+				{ 1.83,    2.26  },        /*   9  */
+				{ 1.81,    2.23  },        /*  10  */
+				{ 1.80,    2.20  },        /*  11  */
+				{ 1.78,    2.18  },        /*  12  */
+				{ 1.77,    2.16  },        /*  13  */
+				{ 1.76,    2.14  },        /*  14  */
+				{ 1.75,    2.13  },        /*  15  */
+				{ 1.75,    2.12  },        /*  16  */
+				{ 1.74,    2.11  },        /*  17  */
+				{ 1.73,    2.10  },        /*  18  */
+				{ 1.73,    2.09  },        /*  19  */
+				{ 1.72,    2.09  },        /*  20  */
+				{ 1.72,    2.08  },        /*  21  */
+				{ 1.72,    2.07  },        /*  22  */
+				{ 1.71,    2.07  },        /*  23  */
+				{ 1.71,    2.06  },        /*  24  */
+				{ 1.71,    2.06  },        /*  25  */
+				{ 1.71,    2.06  },        /*  26  */
+				{ 1.70,    2.05  },        /*  27  */
+				{ 1.70,    2.05  },        /*  28  */
+				{ 1.70,    2.04  },        /*  29  */
+				{ 1.70,    2.04  },        /*  30  */
+				{ 1.645,   1.96  },        /*  31  */
+	};
+
+	double tvalue;
+
+	tvalue = 0;
+
+	switch (ci) {
+
+	case C90: /* 90% CI */
+		tvalue = ((num_samples-1) > 30) ? t_dist[INFINITE_SAMPLES][ci] : t_dist[num_samples-1][ci];
+		break;
+
+	case C95: /* 95% CI */
+		tvalue = ((num_samples-1) > 30) ? t_dist[INFINITE_SAMPLES][ci] : t_dist[num_samples-1][ci];
+		break;
+
+	default:
+		printf("internal error - invalid confidence interval value specified");
+		break;
+	}
+
+	return tvalue;
+}
+
+static inline void print_latency_info_header(void)
+{
+	const char	*title;
+	char		hdrstr[256];
+	int		twidth;
+	int		size;
+	int		pad;
+	int		i;
+
+	twidth = sprintf(hdrstr, "%-20s%20s%20s%20s",
+				"Metric", "Overall", "Extremes", "Selected");
+	title = "Execution Latency For Loads to Non Shared Memory";
+	size  = strlen(title);
+	pad   = (twidth - size)/2;
+
+	printf("\n\n");
+	for (i = 0; i < twidth; i++) printf("=");
+	printf("\n");
+
+	if (pad > 0) {
+		for (i = 0; i < pad; i++) printf(" ");
+	}
+	printf("%s\n", title);
+	printf("\n");
+
+	printf("%s\n", hdrstr);
+
+	for (i = 0; i < twidth; i++) printf("=");
+	printf("\n");
+}
+
+static void print_latency_info(void)
+{
+#define  LBLFMT "%-20s"
+
+	char	fmtstr[32];
+	int	i, dtype;
+	stats_t *ptr;
+
+	print_latency_info_header();
+
+	for (i = 0; i < (int)STATS_ENTRIES; i++) {
+
+		ptr = &data[i];
+
+		dtype = INVALID;
+
+		if (ptr->fmt == NULL)  {
+
+			dtype = EMPTY;
+
+		} else {
+
+			if (strchr(ptr->fmt, 'd') != NULL) dtype = INTEGER;
+			if (strchr(ptr->fmt, 'f') != NULL) dtype = DOUBLE;
+
+			strcpy(fmtstr, ptr->fmt);
+			strtok(fmtstr, ".d");
+			strcat(fmtstr, "s");
+
+		}
+
+		switch (dtype) {
+
+		case INTEGER:
+			printf(LBLFMT, ptr->label);
+			(ptr->overall  != NULL) ? printf(ptr->fmt, *((int *)ptr->overall))  : printf(fmtstr, "na");
+			(ptr->extremes != NULL) ? printf(ptr->fmt, *((int *)ptr->extremes)) : printf(fmtstr, "na");
+			(ptr->analyze  != NULL) ? printf(ptr->fmt, *((int *)ptr->analyze))  : printf(fmtstr, "na");
+			printf("\n");
+			break;
+
+
+		case DOUBLE:
+			printf(LBLFMT, ptr->label);
+			(ptr->overall  != NULL) ? printf(ptr->fmt, *((double *)ptr->overall))  : printf(fmtstr, "na");
+			(ptr->extremes != NULL) ? printf(ptr->fmt, *((double *)ptr->extremes)) : printf(fmtstr, "na");
+			(ptr->analyze  != NULL) ? printf(ptr->fmt, *((double *)ptr->analyze))  : printf(fmtstr, "na");
+			printf("\n");
+			break;
+
+
+		case EMPTY:
+			printf("\n");
+			break;
+
+
+		default:
+			printf("internal error - unsupported formtat specifier : %s\n", ptr->fmt);
+			break;
+
+		};
+
+	}
+	printf("\n\n");
+}
+
+static void calculate_latency_info(struct rb_root *tree,
+				   struct stats *stats,
+				   struct c2c_latency_stats *overall,
+				   struct c2c_latency_stats *extremes,
+				   struct c2c_latency_stats *selected)
+{
+	struct rb_node *next = rb_first(tree);
+	struct rb_node *start = NULL;
+	struct c2c_entry *n;
+	int count = 0, weight = 0;
+	int mode = 0, mode_count = 0, idx = 0;
+	int median;
+	double threshold;
+	struct stats s;
+
+
+	median = stats->n / 2.0;
+
+	overall->cnt = stats->n;
+	overall->min = stats->min;
+	overall->max = stats->max;
+	overall->thrshld = 0;
+	overall->mean = avg_stats(stats);
+	overall->stddev = stddev_stats(stats);
+	overall->cv = overall->stddev / overall->mean;
+	overall->ci = (tdist(C90, stats->n) * stddev_stats(stats)) / sqrt(stats->n);
+	overall->start = next;
+
+	/* set threshold to mean + 3 * stddev of stats */
+	threshold = avg_stats(stats) + 3 * stddev_stats(stats);
+	init_stats(&s);
+
+	/* calculate overall latency */
+	while (next) {
+		n = rb_entry(next, struct c2c_entry, latency);
+		next = rb_next(&n->latency);
+
+		/* sorted on weight, makes counting easy, look for boundary */
+		if (n->weight != weight) {
+			if (count > mode_count) {
+				mode = weight;
+				mode_count = count;
+			}
+			count = 0;
+			weight = n->weight;
+		}
+		count++;
+
+		if (idx == median)
+			overall->median = n->weight;
+
+		/* save start for extreme latency calculation */
+		if (n->weight > threshold) {
+			if (!start)
+				start = next;
+
+			update_stats(&s, n->weight);
+		}
+
+		idx++;
+	}
+	/* count last set */
+	if (count > mode_count)
+		mode = weight;
+
+	overall->mode = mode;
+
+	/* calculate extreme latency */
+	next = start;
+	start = NULL;
+	idx = 0;
+	count = 0;
+	mode_count = 0;
+	mode = 0;
+	weight = 0;
+	median = s.n / 2.0;
+
+	extremes->cnt = s.n;
+	extremes->min = s.min;
+	extremes->max = s.max;
+	extremes->thrshld = threshold;
+	extremes->mean = avg_stats(&s);
+	extremes->stddev = stddev_stats(&s);
+	extremes->cv = extremes->stddev / extremes->mean;
+	extremes->ci = (tdist(C90, s.n) * stddev_stats(&s)) / sqrt(s.n);
+	extremes->start = next;
+
+	/* set threshold to mean + 3 * stddev of stats */
+	threshold = avg_stats(&s) + 3 * stddev_stats(&s);
+	init_stats(&s);
+
+	while (next) {
+		n = rb_entry(next, struct c2c_entry, latency);
+		next = rb_next(&n->latency);
+
+		/* sorted on weight, makes counting easy, look for boundary */
+		if (n->weight != weight) {
+			if (count > mode_count) {
+				mode = weight;
+				mode_count = count;
+			}
+			count = 0;
+			weight = n->weight;
+		}
+		count++;
+
+		if (idx == median)
+			extremes->median = n->weight;
+
+		/* save start for extreme latency calculation */
+		if (n->weight > threshold) {
+			if (!start)
+				start = next;
+
+			update_stats(&s, n->weight);
+		}
+
+		idx++;
+	}
+	/* count last set */
+	if (count > mode_count)
+		mode = weight;
+
+	extremes->mode = mode;
+
+	/* calculate analyze latency */
+	next = start;
+	idx = 0;
+	count = 0;
+	mode_count = 0;
+	mode = 0;
+	weight = 0;
+	median = s.n / 2.0;
+
+	selected->cnt = s.n;
+	selected->min = s.min;
+	selected->max = s.max;
+	selected->thrshld = threshold;
+	selected->mean = avg_stats(&s);
+	selected->stddev = stddev_stats(&s);
+	selected->cv = selected->stddev / selected->mean;
+	selected->ci = (tdist(C90, s.n) * stddev_stats(stats)) / sqrt(s.n);
+	selected->start = next;
+
+	while (next) {
+		n = rb_entry(next, struct c2c_entry, latency);
+		next = rb_next(&n->latency);
+
+		/* sorted on weight, makes counting easy, look for boundary */
+		if (n->weight != weight) {
+			if (count > mode_count) {
+				mode = weight;
+				mode_count = count;
+			}
+			count = 0;
+			weight = n->weight;
+		}
+		count++;
+
+		if (idx == median)
+			selected->median = n->weight;
+
+		idx++;
+	}
+	/* count last set */
+	if (count > mode_count)
+		mode = weight;
+
+	selected->mode = mode;
+}
+
+static void c2c_analyze_latency(struct perf_c2c *c2c)
+{
+
+	struct rb_node *next = rb_first(&c2c->tree_physid);
+	struct c2c_entry *n;
+	struct c2c_latency_stats *overall, *extremes, *selected;
+	struct rb_root lat_tree = RB_ROOT;
+	struct c2c_stats lat_stats;
+	u64 snoop;
+	struct stats s;
+
+	init_stats(&s);
+	memset(&lat_stats, 0, sizeof(struct c2c_stats));
+	memset(&hist_info, 0, sizeof(struct c2c_latency_stats) * SCOPES);
+
+	overall = &hist_info[OVERALL];
+	extremes = &hist_info[EXTREMES];
+	selected = &hist_info[ANALYZE];
+
+	/* sort on latency */
+	while (next) {
+		n = rb_entry(next, struct c2c_entry, rb_node);
+		next = rb_next(&n->rb_node);
+
+		snoop  = n->mi->data_src.mem_snoop;
+
+		/* filter out HITs as un-interesting */
+		if ((snoop & P(SNOOP, HIT)) ||
+		    (snoop & P(SNOOP, HITM)) ||
+		    (snoop & P(SNOOP, NA)))
+			continue;
+
+		c2c_latency__add_to_list(&lat_tree, n);
+		update_stats(&s, n->weight);
+	}
+
+	calculate_latency_info(&lat_tree, &s, overall, extremes, selected);
+	print_latency_info();
+
+	return;
+}
+
 static void print_c2c_shared_cacheline_report(struct rb_root *hitm_tree,
 					      struct c2c_stats *shared_stats __maybe_unused,
 					      struct c2c_stats *c2c_stats __maybe_unused)
@@ -1761,6 +2216,7 @@ static int perf_c2c__process_events(struct perf_session *session,
 	if (verbose > 2)
 		dump_rb_tree(&c2c->tree_physid, c2c);
 	print_c2c_trace_report(c2c);
+	c2c_analyze_latency(c2c);
 	c2c_analyze_hitms(c2c);
 	print_symbol_record_count(&c2c->tree_physid);
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 20/21] perf, c2c: Add selected extreme latencies to output cacheline stats table
  2014-02-10 17:28 [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (16 preceding siblings ...)
  2014-02-10 17:29 ` [PATCH 19/21] perf, c2c: Add framework to analyze latency and display summary stats Don Zickus
@ 2014-02-10 17:29 ` Don Zickus
  2014-02-10 17:29 ` [PATCH 21/21] perf, c2c: Add summary latency table for various parts of caches Don Zickus
                   ` (7 subsequent siblings)
  25 siblings, 0 replies; 72+ messages in thread
From: Don Zickus @ 2014-02-10 17:29 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, eranian, Don Zickus

This just takes the previously calculated extreme latencies and prints them
in a pretty table with the cacheline and its offsets exposed for to help
further understand what they are coming from.

Original work done by Dick Fowles, ported to perf by me.

Suggested-by: Joe Mario <jmario@redhat.com>
Original-by: Dick Fowles <rfowles@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/builtin-c2c.c | 265 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 265 insertions(+)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index b1d4a8b..1fa21b4 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -76,6 +76,7 @@ struct perf_c2c {
 struct c2c_entry {
 	struct rb_node		rb_node;
 	struct rb_node		latency;
+	struct rb_node		latency_scratch;
 	struct list_head	scratch;  /* scratch list for resorting */
 	struct thread		*thread;
 	int			tid;  /* FIXME perf maps broken */
@@ -571,6 +572,62 @@ static int c2c_latency__add_to_list(struct rb_root *root, struct c2c_entry *n)
 	return 0;
 }
 
+static struct c2c_entry *c2c_latency__add_to_list_physid(struct rb_root *root,
+							 struct c2c_entry *entry)
+{
+	struct rb_node **p;
+	struct rb_node *parent = NULL;
+	struct c2c_entry *ce;
+	int64_t cmp;
+
+	p = &root->rb_node;
+
+	while (*p != NULL) {
+		parent = *p;
+		ce = rb_entry(parent, struct c2c_entry, latency_scratch);
+
+		cmp = physid_cmp(ce, entry);
+
+		if (cmp > 0)
+			p = &(*p)->rb_left;
+		else
+			p = &(*p)->rb_right;
+	}
+
+	rb_link_node(&entry->latency_scratch, parent, p);
+	rb_insert_color(&entry->latency_scratch, root);
+
+	return entry;
+}
+
+static int c2c_latency__add_to_list_count(struct rb_root *root,
+					  struct c2c_hit *h)
+{
+	struct rb_node **p;
+	struct rb_node *parent = NULL;
+	struct c2c_hit *he;
+	int64_t cmp;
+
+	p = &root->rb_node;
+
+	while (*p != NULL) {
+		parent = *p;
+		he = rb_entry(parent, struct c2c_hit, rb_node);
+
+		cmp = h->stats.stats.n - he->stats.stats.n;
+
+		if (cmp > 0)
+			p = &(*p)->rb_left;
+		else
+			p = &(*p)->rb_right;
+	}
+
+	rb_link_node(&h->rb_node, parent, p);
+	rb_insert_color(&h->rb_node, root);
+
+	return 0;
+}
+
 static int perf_c2c__fprintf_header(FILE *fp)
 {
 	int printed = fprintf(fp, "%c %-16s  %6s  %6s  %4s  %18s  %18s  %18s  %6s  %-10s %-60s %s\n", 
@@ -1107,6 +1164,209 @@ cleanup:
 	}
 }
 
+static void print_latency_select_cacheline_offset(struct c2c_hit *offset,
+						  int total)
+{
+	struct stats *s = &offset->stats.stats;
+	struct addr_map_symbol *ams = &offset->mi->iaddr;
+
+	printf("%5s %6s %6s %7.1f%% %14s0x%02lx %#18lx %8ld %7.1f %8ld %7.1f %7.1f%%  %-30s %-20s\n",
+		" ",
+		" ",
+		" ",
+		((double) s->n / (double)total) * 100.0,
+		" ",
+		(cloffset == LVL2) ? (offset->mi->daddr.addr & 0xff) : CLOFFSET(offset->mi->daddr.addr),
+		offset->mi->iaddr.addr,
+		s->min,
+		0.0,
+		s->max,
+		avg_stats(s),
+		(stddev_stats(s)/avg_stats(s) * 100.0),
+		(ams->sym ? ams->sym->name : "?????"),
+		ams->map->dso->short_name);
+}
+
+static void print_latency_select_header(void)
+{
+#define EXCESS_LATENCY_TITLE "Non Shared Data Loads With Excessive Execution Latency"
+
+	static char delimit[MAXTITLE_SZ];
+	static char title[MAXTITLE_SZ];
+	int      pad;
+	int      i;
+
+	sprintf(title, "%5s %6s %6s %8s %18s %18s %8s %8s %8s %8s %8s  %-30s %-20s",
+		"Num",
+		"%dist",
+		"%cumm",
+		"Count",
+		"Data Address",
+		"Inst Address",
+		"Min",
+		"Median",
+		"Max",
+		"Mean",
+		"CV",
+		"Symbol",
+		"Object");
+
+	memset(delimit, 0, sizeof(delimit));
+	for (i = 0; i < (int)strlen(title); i++) delimit[i] = '=';
+
+	printf("\n\n");
+	printf("%s\n", delimit);
+
+	pad = (strlen(title)/2) - (strlen(EXCESS_LATENCY_TITLE)/2);
+	for (i = 0; i < pad; i++) printf(" ");
+	printf("%s\n", EXCESS_LATENCY_TITLE);
+	printf("\n");
+
+	printf("%5s %6s %6s %8s %18s %18s %44s  %-30s %-20s\n",
+		" ",
+		" ",
+		" ",
+		"Load",
+		" ",
+		" ",
+		"------ Load Inst Execute Latency ------",
+		" ",
+		" ");
+
+	printf("%s\n", title);
+	printf("%s\n", delimit);
+}
+
+static void print_latency_select_info(struct rb_root *root,
+				      struct c2c_stats *stats)
+{
+#define XLAT_DIST_LIMIT 0.1
+
+	struct rb_node *next = rb_first(root);
+	struct c2c_hit *h, *clo = NULL;
+	struct c2c_entry *entry;
+	double tot_dist, tot_cumm;
+	int idx = 0, j;
+	static char delimit[MAXTITLE_SZ];
+	static char summary[MAXTITLE_SZ];
+
+	print_latency_select_header();
+
+	tot_cumm = 0.0;
+
+	while (next) {
+		h = rb_entry(next, struct c2c_hit, rb_node);
+		next = rb_next(&h->rb_node);
+
+		tot_dist  = ((double)h->stats.stats.n / stats->stats.n);
+		tot_cumm += tot_dist;
+
+		/*
+		 * don't display lines with insignificant sharing contribution
+		 */
+		if (tot_dist*100.0 < XLAT_DIST_LIMIT)
+			break;
+
+		sprintf(summary, "%5d %5.1f%% %5.1f%% %8d %#18lx",
+			idx,
+			tot_dist*100.0,
+			tot_cumm*100.0,
+			(int)h->stats.stats.n,
+			h->cacheline);
+
+		if (delimit[0] != '-') {
+			memset(delimit, 0, sizeof(delimit));
+			for (j = 0; j < (int)strlen(summary); j++) delimit[j] = '-';
+		}
+
+		printf("%s\n", delimit);
+		printf("%s\n", summary);
+		printf("%s\n", delimit);
+
+		list_for_each_entry(entry, &h->list, scratch) {
+
+			if (!clo || !matching_coalescing(clo, entry)) {
+				u64 addr;
+
+				if (clo)
+					print_latency_select_cacheline_offset(clo, h->stats.stats.n);
+
+				free(clo);
+				addr = entry->mi->iaddr.al_addr;
+				clo = c2c_hit__new(addr, entry);
+			}
+			update_stats(&clo->stats.stats, entry->weight);
+		}
+		if (clo) {
+			print_latency_select_cacheline_offset(clo, h->stats.stats.n);
+			free(clo);
+			clo = NULL;
+		}
+
+		idx++;
+	}
+	printf("\n\n");
+}
+
+static void calculate_latency_selected_info(struct rb_root *root,
+					    struct rb_node *start,
+					    struct c2c_stats *lat_stats)
+{
+	struct rb_node *next = start;
+	struct rb_root lat_tree = RB_ROOT;
+	struct c2c_hit *h = NULL;
+	struct c2c_entry *n;
+	u64 cl;
+
+	/* new sort of 'selected' tree using physid_cmp */
+	while (next) {
+		n = rb_entry(next, struct c2c_entry, latency);
+		next = rb_next(&n->latency);
+
+		c2c_latency__add_to_list_physid(&lat_tree, n);
+	}
+
+	/* resort based on number of entries in each cacheline */
+	next = rb_first(&lat_tree);
+	while (next) {
+		n = rb_entry(next, struct c2c_entry, latency_scratch);
+		next = rb_next(&n->latency_scratch);
+
+		cl = n->mi->daddr.al_addr;
+
+		/* switch cache line objects */
+		/* 'color' forces a boundary change based on the original sort */
+		if (!h || !n->color || (CLADRS(cl) != h->cacheline)) {
+			if (h)
+				c2c_latency__add_to_list_count(root, h);
+
+			h = c2c_hit__new(CLADRS(cl), n);
+			if (!h)
+				goto cleanup;
+		}
+
+		update_stats(&h->stats.stats, n->weight);
+		update_stats(&lat_stats->stats, n->weight);
+
+		/* save the entry for later processing */
+		list_add_tail(&n->scratch, &h->list);
+	}
+	/* last chunk */
+	if (h)
+		c2c_latency__add_to_list_count(root, h);
+	return;
+
+cleanup:
+	next = rb_first(root);
+	while (next) {
+		h = rb_entry(next, struct c2c_hit, rb_node);
+		next = rb_next(&h->rb_node);
+		rb_erase(&h->rb_node, root);
+
+		free(h);
+	}
+}
+
 stats_t data[] = {
 	{ "Samples           ", "%20d",   &hist_info[OVERALL].cnt,     &hist_info[EXTREMES].cnt,     &hist_info[ANALYZE].cnt     },
 	{ " ",                   NULL,    NULL,                        NULL,                         NULL                        },
@@ -1471,6 +1731,8 @@ static void c2c_analyze_latency(struct perf_c2c *c2c)
 	struct c2c_stats lat_stats;
 	u64 snoop;
 	struct stats s;
+	int i;
+	struct rb_root lat_select_tree = RB_ROOT;
 
 	init_stats(&s);
 	memset(&lat_stats, 0, sizeof(struct c2c_stats));
@@ -1500,6 +1762,9 @@ static void c2c_analyze_latency(struct perf_c2c *c2c)
 	calculate_latency_info(&lat_tree, &s, overall, extremes, selected);
 	print_latency_info();
 
+	calculate_latency_selected_info(&lat_select_tree, selected->start, &lat_stats);
+	print_latency_select_info(&lat_select_tree, &lat_stats);
+
 	return;
 }
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 21/21] perf, c2c: Add summary latency table for various parts of caches
  2014-02-10 17:28 [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (17 preceding siblings ...)
  2014-02-10 17:29 ` [PATCH 20/21] perf, c2c: Add selected extreme latencies to output cacheline stats table Don Zickus
@ 2014-02-10 17:29 ` Don Zickus
  2014-02-10 18:59 ` [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Davidlohr Bueso
                   ` (6 subsequent siblings)
  25 siblings, 0 replies; 72+ messages in thread
From: Don Zickus @ 2014-02-10 17:29 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, eranian, Don Zickus

Just a simple summary table of latencies for the different parts of a
hardware cache (L1, LFB, L2, LLC [local/remote], DRAM [local/remote]).

Of course, this is based on the original ldlat filter level, which is 30 cycles
as of this writing.  This makes the L1, LFB, L2 numbers slightly misleading.

Original done by Dick Fowles and ported to perf by me.

Suggested-by: Joe Mario <jmario@redhat.com>
Original-by: Dick Fowles <rfowles@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/builtin-c2c.c | 215 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 215 insertions(+)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 1fa21b4..a73535a 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -122,6 +122,41 @@ typedef struct {
 	void       *analyze;
 } stats_t;
 
+enum {
+	LD_L1HIT_NONE,
+	LD_LFBHIT_NONE,
+	LD_L2HIT_NONE,
+	LD_L3HIT_NONE,
+	LD_L3HIT_MISS,           /* other core snoop miss */
+	LD_L3HIT_HIT,            /* hit on other core within socket, no fwd */
+	LD_L3HIT_HITM,           /* hitm on other core within socket */
+	LD_L3MISS_HIT_CACHE,     /* remote cache hit, fwd data? */
+	LD_L3MISS_HITM_CACHE,    /* remote cache hitm, C2C, implicit WB, invalidate */
+	LD_L3MISS_HIT_LDRAM,     /* load shared from local dram */
+	LD_L3MISS_HIT_RDRAM,     /* load shared from remote dram */
+	LD_L3MISS_MISS_LDRAM,    /* load exclusive from local dram */
+	LD_L3MISS_MISS_RDRAM,    /* load exclusive from remote dram */
+	LD_L3MISS_NA,
+	LD_UNCACHED,
+	LOAD_CATAGORIES,
+	ST_L1HIT_NA,
+	ST_L1MISS_NA,
+	ST_UNCACHED,
+	LOCK,                    /* defines a bit flag to represent locked events */
+	ALL_CATAGORIES
+};
+
+struct ld_lat_stats {
+	struct stats	stats;
+	u64		total;
+};
+
+struct ld_lat_stats ld_lat_stats[ALL_CATAGORIES];
+
+typedef struct {
+	const char  *name;
+	int    id;
+} xref_t;
 
 enum { EMPTY, SYMBOL, OBJECT };
 enum { OVERALL, EXTREMES, ANALYZE, SCOPES };
@@ -131,6 +166,16 @@ struct c2c_latency_stats hist_info[SCOPES];
 
 enum { OP, LVL, SNP, LCK, TLB };
 
+#define LOAD_OP(a)           ((a) & PERF_MEM_OP_LOAD  )
+#define STORE_OP(a)          ((a) & PERF_MEM_OP_STORE )
+#define LOCKED_OP(a)         ((a) & PERF_MEM_LOCK_LOCKED)
+
+#define SNOOP_NA(a)          ((a) & PERF_MEM_SNOOP_NA)
+#define SNOOP_NONE(a)        ((a) & PERF_MEM_SNOOP_NONE)
+#define SNOOP_MISS(a)        ((a) & PERF_MEM_SNOOP_MISS)
+#define SNOOP_HIT(a)         ((a) & PERF_MEM_SNOOP_HIT)
+#define SNOOP_HITM(a)        ((a) & PERF_MEM_SNOOP_HITM)
+
 #define RMT_RAM              (PERF_MEM_LVL_REM_RAM1 | PERF_MEM_LVL_REM_RAM2)
 #define RMT_LLC              (PERF_MEM_LVL_REM_CCE1 | PERF_MEM_LVL_REM_CCE2)
 
@@ -1066,6 +1111,87 @@ static void c2c_hit__update_stats(struct c2c_stats *new,
 	new->total_period	+= old->total_period;
 }
 
+xref_t names[LOAD_CATAGORIES] = {
+	{ "L1  Hit  - Snp None            ",  LD_L1HIT_NONE        },
+	{ "LFB Hit  - Snp None            ",  LD_LFBHIT_NONE       },
+	{ "L2  Hit  - Snp None            ",  LD_L2HIT_NONE        },
+	{ "L3  Hit  - Snp None            ",  LD_L3HIT_NONE        },
+	{ "L3  Hit  - Snp Miss            ",  LD_L3HIT_MISS        },
+	{ "L3  Hit  - Snp Hit  - Lcl Cache",  LD_L3HIT_HIT         },
+	{ "L3  Hit  - Snp Hitm - Lcl Cache",  LD_L3HIT_HITM        },
+	{ "L3  Miss - Snp Hit  - Rmt Cache",  LD_L3MISS_HIT_CACHE  },
+	{ "L3  Miss - Snp Hitm - Rmt Cache",  LD_L3MISS_HITM_CACHE },
+	{ "L3  Miss - Snp Hit  - Lcl Dram ",  LD_L3MISS_HIT_LDRAM  },
+	{ "L3  Miss - Snp Hit  - Rmt Dram ",  LD_L3MISS_HIT_RDRAM  },
+	{ "L3  Miss - Snp Miss - Lcl Dram ",  LD_L3MISS_MISS_LDRAM },
+	{ "L3  Miss - Snp Miss - Rmt Dram ",  LD_L3MISS_MISS_RDRAM },
+	{ "L3  Miss - Snp NA              ",  LD_L3MISS_NA         },
+	{ "Ld  UNC  - Snp None            ",  LD_UNCACHED          },
+};
+
+static void print_latency_load_info(void)
+{
+#define TITLE "Load Access & Excute Latency Information"
+
+	char     title_str[256];
+	double   stddev;
+	double   mean;
+	double   covar;
+	uint64_t cycles;
+	int      pad;
+	int      idx;
+	int      i;
+
+
+	cycles = 0;
+
+	for (i = 0; i < LOAD_CATAGORIES; i++)
+		cycles += ld_lat_stats[i].total;
+
+	sprintf(title_str, "%32s  %10s  %10s  %10s  %10s  %10s  %10s",
+		" ",
+		"Count",
+		"Minmum",
+		"Average",
+		"CV  ",
+		"Maximum",
+		"%dist");
+
+	pad = (strlen(title_str)/2) - (strlen(TITLE)/2);
+
+	printf("\n\n");
+	for (i = 0; i < (int)strlen(title_str); i++) printf("=");
+	printf("\n");
+	for (i = 0; i < pad; i++) printf(" ");
+	printf("%s\n", TITLE);
+	printf("\n");
+	printf("%s\n", title_str);
+	for (i = 0; i < (int)strlen(title_str); i++) printf("=");
+	printf("\n");
+
+	for (i = 0; i < LOAD_CATAGORIES; i++) {
+
+		idx    = names[i].id;
+
+		mean   = avg_stats(&ld_lat_stats[idx].stats);
+		stddev = stddev_stats(&ld_lat_stats[idx].stats);
+		covar  = stddev / mean;
+
+		printf("%-32s  %10lu  %10lu  %10.0f  %10.4f  %10lu  %10.1f%%\n",
+			names[i].name,
+			(u64)ld_lat_stats[idx].stats.n,
+			ld_lat_stats[idx].stats.min,
+			ld_lat_stats[idx].stats.mean,
+			covar,
+			ld_lat_stats[idx].stats.max,
+			100. * ((double)ld_lat_stats[idx].total / (double)cycles));
+
+	}
+
+	printf("\n");
+
+}
+
 LIST_HEAD(ref_tree);
 LIST_HEAD(ref_tree_sorted);
 struct refs {
@@ -1721,6 +1847,88 @@ static void calculate_latency_info(struct rb_root *tree,
 	selected->mode = mode;
 }
 
+static int decode_src(union perf_mem_data_src dsrc)
+{
+	if (LOAD_OP(dsrc.mem_op)) {
+
+		if (FILLBUF_HIT(dsrc.mem_lvl)) return(LD_LFBHIT_NONE);
+		if (L1CACHE_HIT(dsrc.mem_lvl)) return(LD_L1HIT_NONE);
+		if (L2CACHE_HIT(dsrc.mem_lvl)) return(LD_L2HIT_NONE);
+
+		if (L3CACHE_HIT(dsrc.mem_lvl)) {
+
+			if (SNOOP_HITM(dsrc.mem_snoop)) return(LD_L3HIT_HITM);
+			if (SNOOP_HIT(dsrc.mem_snoop))  return(LD_L3HIT_HIT);
+			if (SNOOP_MISS(dsrc.mem_snoop)) return(LD_L3HIT_MISS);
+			if (SNOOP_NONE(dsrc.mem_snoop)) return(LD_L3HIT_NONE);
+
+		}
+
+		if (L3CACHE_MISS(dsrc.mem_lvl)) {
+
+			if (SNOOP_NA(dsrc.mem_snoop)) return(LD_L3MISS_NA);
+
+		}
+
+		if (RMT_LLCHIT(dsrc.mem_lvl)) {
+
+			if (SNOOP_HITM(dsrc.mem_snoop)) return(LD_L3MISS_HITM_CACHE);
+			if (SNOOP_HIT(dsrc.mem_snoop))  return(LD_L3MISS_HIT_CACHE);
+
+		}
+
+
+		if (LCL_MEM(dsrc.mem_lvl)) {
+
+			if (SNOOP_MISS(dsrc.mem_snoop)) return(LD_L3MISS_MISS_LDRAM);
+			if (SNOOP_HIT(dsrc.mem_snoop))  return(LD_L3MISS_HIT_LDRAM);
+
+		}
+
+
+		if (RMT_MEM(dsrc.mem_lvl)) {
+
+			if (SNOOP_MISS(dsrc.mem_snoop)) return(LD_L3MISS_MISS_RDRAM);
+			if (SNOOP_HIT(dsrc.mem_snoop))  return(LD_L3MISS_HIT_RDRAM);
+
+		}
+
+		if (LD_UNCACHED(dsrc.mem_lvl)) {
+			if (SNOOP_NONE(dsrc.mem_snoop)) return(LD_UNCACHED);
+		}
+
+	}
+
+
+	if (STORE_OP(dsrc.mem_op)) {
+
+		if (SNOOP_NA(dsrc.mem_snoop)) {
+
+			if (L1CACHE_HIT(dsrc.mem_lvl))  return(ST_L1HIT_NA);
+			if (L1CACHE_MISS(dsrc.mem_lvl)) return(ST_L1MISS_NA);
+
+		}
+
+	}
+	return -1;
+}
+
+static void latency_update_stats(union perf_mem_data_src src,
+				u64 weight)
+{
+	int id = decode_src(src);
+
+	if (id < 0) {
+		pr_err("Bad data_src: %llx\n", src.val);
+		return;
+	}
+
+	update_stats(&ld_lat_stats[id].stats, weight);
+	ld_lat_stats[id].total += weight;
+
+	return;
+}
+
 static void c2c_analyze_latency(struct perf_c2c *c2c)
 {
 
@@ -1742,6 +1950,9 @@ static void c2c_analyze_latency(struct perf_c2c *c2c)
 	extremes = &hist_info[EXTREMES];
 	selected = &hist_info[ANALYZE];
 
+	for (i = 0; i < LOAD_CATAGORIES; i++)
+		init_stats(&ld_lat_stats[i].stats);
+
 	/* sort on latency */
 	while (next) {
 		n = rb_entry(next, struct c2c_entry, rb_node);
@@ -1749,6 +1960,9 @@ static void c2c_analyze_latency(struct perf_c2c *c2c)
 
 		snoop  = n->mi->data_src.mem_snoop;
 
+		/* piggy back updating load latency stats */
+		latency_update_stats(n->mi->data_src, n->weight);
+
 		/* filter out HITs as un-interesting */
 		if ((snoop & P(SNOOP, HIT)) ||
 		    (snoop & P(SNOOP, HITM)) ||
@@ -1765,6 +1979,7 @@ static void c2c_analyze_latency(struct perf_c2c *c2c)
 	calculate_latency_selected_info(&lat_select_tree, selected->start, &lat_stats);
 	print_latency_select_info(&lat_select_tree, &lat_stats);
 
+	print_latency_load_info();
 	return;
 }
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems
  2014-02-10 17:28 [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (18 preceding siblings ...)
  2014-02-10 17:29 ` [PATCH 21/21] perf, c2c: Add summary latency table for various parts of caches Don Zickus
@ 2014-02-10 18:59 ` Davidlohr Bueso
  2014-02-10 19:17   ` Don Zickus
  2014-02-10 19:18 ` [PATCH 01/21] perf c2c: Shared data analyser Don Zickus
                   ` (5 subsequent siblings)
  25 siblings, 1 reply; 72+ messages in thread
From: Davidlohr Bueso @ 2014-02-10 18:59 UTC (permalink / raw)
  To: Don Zickus; +Cc: acme, LKML, jolsa, jmario, fowles, eranian

This can be really useful for us performance folks, thanks. It seems
however that the first two patches in the series are missing.


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems
  2014-02-10 18:59 ` [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Davidlohr Bueso
@ 2014-02-10 19:17   ` Don Zickus
  0 siblings, 0 replies; 72+ messages in thread
From: Don Zickus @ 2014-02-10 19:17 UTC (permalink / raw)
  To: Davidlohr Bueso; +Cc: acme, LKML, jolsa, jmario, fowles, eranian

On Mon, Feb 10, 2014 at 10:59:30AM -0800, Davidlohr Bueso wrote:
> This can be really useful for us performance folks, thanks. It seems
> however that the first two patches in the series are missing.

Odd, yes.  For some reason they cc'd to me fine, just never made it to
lkml.  Let me resend them. Thanks.

Cheers,
Don

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 01/21] perf c2c: Shared data analyser
  2014-02-10 17:28 [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (19 preceding siblings ...)
  2014-02-10 18:59 ` [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Davidlohr Bueso
@ 2014-02-10 19:18 ` Don Zickus
  2014-02-10 22:10   ` Davidlohr Bueso
  2014-02-10 19:18 ` [PATCH 02/21] perf c2c: Dump raw records, decode data_src bits Don Zickus
                   ` (4 subsequent siblings)
  25 siblings, 1 reply; 72+ messages in thread
From: Don Zickus @ 2014-02-10 19:18 UTC (permalink / raw)
  To: acme
  Cc: LKML, jolsa, jmario, fowles, eranian, Arnaldo Carvalho de Melo,
	David Ahern, Don Zickus, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Peter Zijlstra, Richard Fowles

From: Arnaldo Carvalho de Melo <acme@redhat.com>

This is the start of a new perf tool that will collect information about
memory accesses and analyse it to find things like hot cachelines, etc.

This is basically trying to get a prototype written by Richard Fowles
written using the tools/perf coding style and libraries.

Start it from 'perf sched', this patch starts the process by adding the
'record' subcommand to collect the needed mem loads and stores samples.

It also have the basic 'report' skeleton, resolving the sample address
and hooking the events found in a perf.data file with methods to handle
them, right now just printing the resolved perf_sample data structure
after each event name.

Cc: David Ahern <dsahern@gmail.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Joe Mario <jmario@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Fowles <rfowles@redhat.com>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/n/tip-xfq0d75x7ndoejy7ozjlpy0i@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/Documentation/perf-c2c.c |  22 +++++
 tools/perf/Makefile.perf            |   1 +
 tools/perf/builtin-c2c.c            | 174 ++++++++++++++++++++++++++++++++++++
 tools/perf/builtin.h                |   1 +
 tools/perf/perf.c                   |   1 +
 tools/perf/util/evlist.c            |  37 ++++++++
 tools/perf/util/evlist.h            |   7 ++
 7 files changed, 243 insertions(+)
 create mode 100644 tools/perf/Documentation/perf-c2c.c
 create mode 100644 tools/perf/builtin-c2c.c

diff --git a/tools/perf/Documentation/perf-c2c.c b/tools/perf/Documentation/perf-c2c.c
new file mode 100644
index 0000000..4d52798
--- /dev/null
+++ b/tools/perf/Documentation/perf-c2c.c
@@ -0,0 +1,22 @@
+perf-c2c(1)
+===========
+
+NAME
+----
+perf-c2c - Shared Data C2C/HITM Analyzer.
+
+SYNOPSIS
+--------
+[verse]
+'perf c2c' record
+
+DESCRIPTION
+-----------
+These are the variants of perf c2c:
+
+  'perf c2c record <command>' to record the memory accesses of an arbitrary
+  workload.
+
+SEE ALSO
+--------
+linkperf:perf-record[1], linkperf:perf-mem[1]
diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
index 7257e7e..3b21f5b 100644
--- a/tools/perf/Makefile.perf
+++ b/tools/perf/Makefile.perf
@@ -421,6 +421,7 @@ endif
 BUILTIN_OBJS += $(OUTPUT)bench/mem-memcpy.o
 BUILTIN_OBJS += $(OUTPUT)bench/mem-memset.o
 
+BUILTIN_OBJS += $(OUTPUT)builtin-c2c.o
 BUILTIN_OBJS += $(OUTPUT)builtin-diff.o
 BUILTIN_OBJS += $(OUTPUT)builtin-evlist.o
 BUILTIN_OBJS += $(OUTPUT)builtin-help.o
diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
new file mode 100644
index 0000000..897eadb
--- /dev/null
+++ b/tools/perf/builtin-c2c.c
@@ -0,0 +1,174 @@
+#include "builtin.h"
+#include "cache.h"
+
+#include "util/evlist.h"
+#include "util/parse-options.h"
+#include "util/session.h"
+#include "util/tool.h"
+
+#include <linux/compiler.h>
+#include <linux/kernel.h>
+
+struct perf_c2c {
+	struct perf_tool tool;
+};
+
+static int perf_sample__fprintf(struct perf_sample *sample,
+				struct perf_evsel *evsel,
+				struct addr_location *al, FILE *fp)
+{
+	return fprintf(fp, "%25.25s: %5d %5d 0x%016" PRIx64 " 0x016%" PRIx64 " %5" PRIu64 " 0x%06" PRIx64 " %s:%s\n",
+		       perf_evsel__name(evsel),
+		       sample->pid, sample->tid, sample->ip, sample->addr,
+		       sample->weight, sample->data_src,
+		       al->map ? (al->map->dso ? al->map->dso->long_name : "???") : "???",
+		       al->sym ? al->sym->name : "???");
+}
+
+static int perf_c2c__process_load(struct perf_evsel *evsel,
+				  struct perf_sample *sample,
+				  struct addr_location *al)
+{
+	perf_sample__fprintf(sample, evsel, al, stdout);
+	return 0;
+}
+
+static int perf_c2c__process_store(struct perf_evsel *evsel,
+				   struct perf_sample *sample,
+				   struct addr_location *al)
+{
+	perf_sample__fprintf(sample, evsel, al, stdout);
+	return 0;
+}
+
+static const struct perf_evsel_str_handler handlers[] = {
+	{ "cpu/mem-loads,ldlat=30/pp", perf_c2c__process_load, },
+	{ "cpu/mem-stores/pp",	       perf_c2c__process_store, },
+};
+
+typedef int (*sample_handler)(struct perf_evsel *evsel,
+			      struct perf_sample *sample,
+			      struct addr_location *al);
+
+static int perf_c2c__process_sample(struct perf_tool *tool __maybe_unused,
+				    union perf_event *event,
+				    struct perf_sample *sample,
+				    struct perf_evsel *evsel,
+				    struct machine *machine)
+{
+	struct addr_location al;
+	int err = 0;
+
+	if (perf_event__preprocess_sample(event, machine, &al, sample) < 0) {
+		pr_err("problem processing %d event, skipping it.\n",
+		       event->header.type);
+		return -1;
+	}
+
+	if (evsel->handler.func != NULL) {
+		sample_handler f = evsel->handler.func;
+		err = f(evsel, sample, &al);
+	}
+
+	return err;
+}
+
+static int perf_c2c__read_events(struct perf_c2c *c2c)
+{
+	int err = -1;
+	struct perf_session *session;
+
+	session = perf_session__new(input_name, O_RDONLY, 0, false, &c2c->tool);
+	if (session == NULL) {
+		pr_debug("No memory for session\n");
+		goto out;
+	}
+
+	if (perf_evlist__set_handlers(session->evlist, handlers))
+		goto out_delete;
+
+	err = perf_session__process_events(session, &c2c->tool);
+	if (err)
+		pr_err("Failed to process events, error %d", err);
+
+out_delete:
+	perf_session__delete(session);
+out:
+	return err;
+}
+
+static int perf_c2c__report(struct perf_c2c *c2c)
+{
+	setup_pager();
+	return perf_c2c__read_events(c2c);
+}
+
+static int perf_c2c__record(int argc, const char **argv)
+{
+	unsigned int rec_argc, i, j;
+	const char **rec_argv;
+	const char * const record_args[] = {
+		"record",
+		/* "--phys-addr", */
+		"-W",
+		"-d",
+		"-a",
+	};
+
+	rec_argc = ARRAY_SIZE(record_args) + 2 * ARRAY_SIZE(handlers) + argc - 1;
+	rec_argv = calloc(rec_argc + 1, sizeof(char *));
+
+	if (rec_argv == NULL)
+		return -ENOMEM;
+
+	for (i = 0; i < ARRAY_SIZE(record_args); i++)
+		rec_argv[i] = strdup(record_args[i]);
+
+	for (j = 0; j < ARRAY_SIZE(handlers); j++) {
+		rec_argv[i++] = strdup("-e");
+		rec_argv[i++] = strdup(handlers[j].name);
+	}
+
+	for (j = 1; j < (unsigned int)argc; j++, i++)
+		rec_argv[i] = argv[j];
+
+	BUG_ON(i != rec_argc);
+
+	return cmd_record(i, rec_argv, NULL);
+}
+
+int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
+{
+	struct perf_c2c c2c = {
+		.tool = {
+			.sample		 = perf_c2c__process_sample,
+			.comm		 = perf_event__process_comm,
+			.exit		 = perf_event__process_exit,
+			.fork		 = perf_event__process_fork,
+			.lost		 = perf_event__process_lost,
+			.ordered_samples = true,
+		},
+	};
+	const struct option c2c_options[] = {
+	OPT_END()
+	};
+	const char * const c2c_usage[] = {
+		"perf c2c {record|report}",
+		NULL
+	};
+
+	argc = parse_options(argc, argv, c2c_options, c2c_usage,
+			     PARSE_OPT_STOP_AT_NON_OPTION);
+	if (!argc)
+		usage_with_options(c2c_usage, c2c_options);
+
+	if (!strncmp(argv[0], "rec", 3)) {
+		return perf_c2c__record(argc, argv);
+	} else if (!strncmp(argv[0], "rep", 3)) {
+		return perf_c2c__report(&c2c);
+	} else {
+		usage_with_options(c2c_usage, c2c_options);
+	}
+
+	return 0;
+}
diff --git a/tools/perf/builtin.h b/tools/perf/builtin.h
index b210d62..2d0b1b5 100644
--- a/tools/perf/builtin.h
+++ b/tools/perf/builtin.h
@@ -17,6 +17,7 @@ extern int cmd_annotate(int argc, const char **argv, const char *prefix);
 extern int cmd_bench(int argc, const char **argv, const char *prefix);
 extern int cmd_buildid_cache(int argc, const char **argv, const char *prefix);
 extern int cmd_buildid_list(int argc, const char **argv, const char *prefix);
+extern int cmd_c2c(int argc, const char **argv, const char *prefix);
 extern int cmd_diff(int argc, const char **argv, const char *prefix);
 extern int cmd_evlist(int argc, const char **argv, const char *prefix);
 extern int cmd_help(int argc, const char **argv, const char *prefix);
diff --git a/tools/perf/perf.c b/tools/perf/perf.c
index 431798a..c7012a3 100644
--- a/tools/perf/perf.c
+++ b/tools/perf/perf.c
@@ -35,6 +35,7 @@ struct cmd_struct {
 static struct cmd_struct commands[] = {
 	{ "buildid-cache", cmd_buildid_cache, 0 },
 	{ "buildid-list", cmd_buildid_list, 0 },
+	{ "c2c",	cmd_c2c,	0 },
 	{ "diff",	cmd_diff,	0 },
 	{ "evlist",	cmd_evlist,	0 },
 	{ "help",	cmd_help,	0 },
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 59ef280..faf29b0 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -1243,3 +1243,40 @@ void perf_evlist__to_front(struct perf_evlist *evlist,
 
 	list_splice(&move, &evlist->entries);
 }
+
+static struct perf_evsel *perf_evlist__find_by_name(struct perf_evlist *evlist,
+						    const char *name)
+{
+	struct perf_evsel *evsel;
+
+	list_for_each_entry(evsel, &evlist->entries, node) {
+		if (strcmp(name, perf_evsel__name(evsel)) == 0)
+			return evsel;
+	}
+
+	return NULL;
+}
+
+int __perf_evlist__set_handlers(struct perf_evlist *evlist,
+				const struct perf_evsel_str_handler *assocs,
+				size_t nr_assocs)
+{
+	struct perf_evsel *evsel;
+	size_t i;
+	int err = -EEXIST;
+
+	for (i = 0; i < nr_assocs; i++) {
+		evsel = perf_evlist__find_by_name(evlist, assocs[i].name);
+		if (evsel == NULL)
+			continue;
+
+		if (evsel->handler.func != NULL)
+			goto out;
+
+		evsel->handler.func = assocs[i].handler;
+	}
+
+	err = 0;
+out:
+	return err;
+}
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index f5173cd..76f77c8 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -52,6 +52,13 @@ struct perf_evsel_str_handler {
 	void	   *handler;
 };
 
+int __perf_evlist__set_handlers(struct perf_evlist *evlist,
+				const struct perf_evsel_str_handler *assocs,
+				size_t nr_assocs);
+
+#define perf_evlist__set_handlers(evlist, array) \
+	__perf_evlist__set_handlers(evlist, array, ARRAY_SIZE(array))
+
 struct perf_evlist *perf_evlist__new(void);
 struct perf_evlist *perf_evlist__new_default(void);
 void perf_evlist__init(struct perf_evlist *evlist, struct cpu_map *cpus,
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 02/21] perf c2c: Dump raw records, decode data_src bits
  2014-02-10 17:28 [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (20 preceding siblings ...)
  2014-02-10 19:18 ` [PATCH 01/21] perf c2c: Shared data analyser Don Zickus
@ 2014-02-10 19:18 ` Don Zickus
  2014-02-10 21:18 ` [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Peter Zijlstra
                   ` (3 subsequent siblings)
  25 siblings, 0 replies; 72+ messages in thread
From: Don Zickus @ 2014-02-10 19:18 UTC (permalink / raw)
  To: acme
  Cc: LKML, jolsa, jmario, fowles, eranian, Arnaldo Carvalho de Melo,
	David Ahern, Don Zickus, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Peter Zijlstra, Richard Fowles

From: Arnaldo Carvalho de Melo <acme@redhat.com>

>From the c2c prototype:

[root@sandy ~]# perf c2c -r report | head -7
T Status    Pid Tid CPU          Inst Adrs     Virt Data Adrs Phys Data Adrs Cycles Source      Decoded Source                ObJect:Symbol
--------------------------------------------------------------------------------------------------------------------------------------------
  raw input 779 779   7 0xffffffff810865dd 0xffff8803f4d75ec8              0    370 0x68080882 [LOAD,LCL_LLC,MISS,SNP NA]    [kernel.kallsyms]:try_to_wake_up
  raw input 779 779   7 0xffffffff8107acb3 0xffff8802a5b73158              0    297 0x6a100142 [LOAD,L1,HIT,SNP NONE,LOCKED] [kernel.kallsyms]:up_read
  raw input 779 779   7       0x3b7e009814     0x7fff87429ea0              0    925 0x68100142 [LOAD,L1,HIT,SNP NONE]        ???:???
  raw input   0   0   1 0xffffffff8108bf81 0xffff8803eafebf50              0    172 0x68800842 [LOAD,LCL_LLC,HIT,SNP HITM]   [kernel.kallsyms]:update_stats_wait_end
  raw input 779 779   7       0x3b7e0097cc     0x7fac94b69068              0    228 0x68100242 [LOAD,LFB,HIT,SNP NONE]       ???:???
[root@sandy ~]#

The "Phys Data Adrs" column is not available at this point.

Cc: David Ahern <dsahern@gmail.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Joe Mario <jmario@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Fowles <rfowles@redhat.com>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/n/tip-prwkuoz6y4g1qf3od8lksl1b@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/builtin-c2c.c | 148 +++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 125 insertions(+), 23 deletions(-)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 897eadb..a5dc412 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -11,51 +11,148 @@
 
 struct perf_c2c {
 	struct perf_tool tool;
+	bool		 raw_records;
 };
 
-static int perf_sample__fprintf(struct perf_sample *sample,
-				struct perf_evsel *evsel,
-				struct addr_location *al, FILE *fp)
+enum { OP, LVL, SNP, LCK, TLB };
+
+static int perf_c2c__scnprintf_data_src(char *bf, size_t size, uint64_t val)
 {
-	return fprintf(fp, "%25.25s: %5d %5d 0x%016" PRIx64 " 0x016%" PRIx64 " %5" PRIu64 " 0x%06" PRIx64 " %s:%s\n",
-		       perf_evsel__name(evsel),
-		       sample->pid, sample->tid, sample->ip, sample->addr,
-		       sample->weight, sample->data_src,
-		       al->map ? (al->map->dso ? al->map->dso->long_name : "???") : "???",
-		       al->sym ? al->sym->name : "???");
+#define PREFIX       "["
+#define SUFFIX       "]"
+#define ELLIPSIS     "..."
+	static const struct {
+		uint64_t   bit;
+		int64_t    field;
+		const char *name;
+	} decode_bits[] = {
+	{ PERF_MEM_OP_LOAD,       OP,  "LOAD"     },
+	{ PERF_MEM_OP_STORE,      OP,  "STORE"    },
+	{ PERF_MEM_OP_NA,         OP,  "OP_NA"    },
+	{ PERF_MEM_LVL_LFB,       LVL, "LFB"      },
+	{ PERF_MEM_LVL_L1,        LVL, "L1"       },
+	{ PERF_MEM_LVL_L2,        LVL, "L2"       },
+	{ PERF_MEM_LVL_L3,        LVL, "LCL_LLC"  },
+	{ PERF_MEM_LVL_LOC_RAM,   LVL, "LCL_RAM"  },
+	{ PERF_MEM_LVL_REM_RAM1,  LVL, "RMT_RAM"  },
+	{ PERF_MEM_LVL_REM_RAM2,  LVL, "RMT_RAM"  },
+	{ PERF_MEM_LVL_REM_CCE1,  LVL, "RMT_LLC"  },
+	{ PERF_MEM_LVL_REM_CCE2,  LVL, "RMT_LLC"  },
+	{ PERF_MEM_LVL_IO,        LVL, "I/O"	  },
+	{ PERF_MEM_LVL_UNC,       LVL, "UNCACHED" },
+	{ PERF_MEM_LVL_NA,        LVL, "N"        },
+	{ PERF_MEM_LVL_HIT,       LVL, "HIT"      },
+	{ PERF_MEM_LVL_MISS,      LVL, "MISS"     },
+	{ PERF_MEM_SNOOP_NONE,    SNP, "SNP NONE" },
+	{ PERF_MEM_SNOOP_HIT,     SNP, "SNP HIT"  },
+	{ PERF_MEM_SNOOP_MISS,    SNP, "SNP MISS" },
+	{ PERF_MEM_SNOOP_HITM,    SNP, "SNP HITM" },
+	{ PERF_MEM_SNOOP_NA,      SNP, "SNP NA"   },
+	{ PERF_MEM_LOCK_LOCKED,   LCK, "LOCKED"   },
+	{ PERF_MEM_LOCK_NA,       LCK, "LOCK_NA"  },
+	};
+	union perf_mem_data_src dsrc = { .val = val, };
+	int printed = scnprintf(bf, size, PREFIX);
+	size_t i;
+	bool first_present = true;
+
+	for (i = 0; i < ARRAY_SIZE(decode_bits); i++) {
+		int bitval;
+
+		switch (decode_bits[i].field) {
+		case OP:  bitval = decode_bits[i].bit & dsrc.mem_op;    break;
+		case LVL: bitval = decode_bits[i].bit & dsrc.mem_lvl;   break;
+		case SNP: bitval = decode_bits[i].bit & dsrc.mem_snoop; break;
+		case LCK: bitval = decode_bits[i].bit & dsrc.mem_lock;  break;
+		case TLB: bitval = decode_bits[i].bit & dsrc.mem_dtlb;  break;
+		default: bitval = 0;					break;
+		}
+
+		if (!bitval)
+			continue;
+
+		if (strlen(decode_bits[i].name) + !!i > size - printed - sizeof(SUFFIX)) {
+			sprintf(bf + size - sizeof(SUFFIX) - sizeof(ELLIPSIS) + 1, ELLIPSIS);
+			printed = size - sizeof(SUFFIX);
+			break;
+		}
+
+		printed += scnprintf(bf + printed, size - printed, "%s%s",
+				     first_present ? "" : ",", decode_bits[i].name);
+		first_present = false;
+	}
+
+	printed += scnprintf(bf + printed, size - printed, SUFFIX);
+	return printed;
 }
 
-static int perf_c2c__process_load(struct perf_evsel *evsel,
-				  struct perf_sample *sample,
-				  struct addr_location *al)
+static int perf_c2c__fprintf_header(FILE *fp)
 {
-	perf_sample__fprintf(sample, evsel, al, stdout);
-	return 0;
+	int printed = fprintf(fp, "%c %-16s  %6s  %6s  %4s  %18s  %18s  %18s  %6s  %-10s %-60s %s\n", 
+			      'T',
+			      "Status",
+			      "Pid",
+			      "Tid",
+			      "CPU",
+			      "Inst Adrs",
+			      "Virt Data Adrs",
+			      "Phys Data Adrs",
+			      "Cycles",
+			      "Source",
+			      "  Decoded Source",
+			      "ObJect:Symbol");
+	return printed + fprintf(fp, "%-*.*s\n", printed, printed, graph_dotted_line);
 }
 
-static int perf_c2c__process_store(struct perf_evsel *evsel,
-				   struct perf_sample *sample,
-				   struct addr_location *al)
+static int perf_sample__fprintf(struct perf_sample *sample, char tag,
+				const char *reason, struct addr_location *al, FILE *fp)
 {
-	perf_sample__fprintf(sample, evsel, al, stdout);
+	char data_src[61];
+
+	perf_c2c__scnprintf_data_src(data_src, sizeof(data_src), sample->data_src);
+
+	return fprintf(fp, "%c %-16s  %6d  %6d  %4d  %#18" PRIx64 "  %#18" PRIx64 "  %#18" PRIx64 "  %6" PRIu64 "  %#10" PRIx64 " %-60.60s %s:%s\n",
+		       tag,
+		       reason ?: "valid record",
+		       sample->pid,
+		       sample->tid,
+		       sample->cpu,
+		       sample->ip,
+		       sample->addr,
+		       0UL,
+		       sample->weight,
+		       sample->data_src,
+		       data_src,
+		       al->map ? (al->map->dso ? al->map->dso->long_name : "???") : "???",
+		       al->sym ? al->sym->name : "???");
+}
+
+static int perf_c2c__process_load_store(struct perf_c2c *c2c,
+					struct perf_sample *sample,
+					struct addr_location *al)
+{
+	if (c2c->raw_records)
+		perf_sample__fprintf(sample, ' ', "raw input", al, stdout);
+
 	return 0;
 }
 
 static const struct perf_evsel_str_handler handlers[] = {
-	{ "cpu/mem-loads,ldlat=30/pp", perf_c2c__process_load, },
-	{ "cpu/mem-stores/pp",	       perf_c2c__process_store, },
+	{ "cpu/mem-loads,ldlat=30/pp", perf_c2c__process_load_store, },
+	{ "cpu/mem-stores/pp",	       perf_c2c__process_load_store, },
 };
 
-typedef int (*sample_handler)(struct perf_evsel *evsel,
+typedef int (*sample_handler)(struct perf_c2c *c2c,
 			      struct perf_sample *sample,
 			      struct addr_location *al);
 
-static int perf_c2c__process_sample(struct perf_tool *tool __maybe_unused,
+static int perf_c2c__process_sample(struct perf_tool *tool,
 				    union perf_event *event,
 				    struct perf_sample *sample,
 				    struct perf_evsel *evsel,
 				    struct machine *machine)
 {
+	struct perf_c2c *c2c = container_of(tool, struct perf_c2c, tool);
 	struct addr_location al;
 	int err = 0;
 
@@ -67,7 +164,7 @@ static int perf_c2c__process_sample(struct perf_tool *tool __maybe_unused,
 
 	if (evsel->handler.func != NULL) {
 		sample_handler f = evsel->handler.func;
-		err = f(evsel, sample, &al);
+		err = f(c2c, sample, &al);
 	}
 
 	return err;
@@ -100,6 +197,10 @@ out:
 static int perf_c2c__report(struct perf_c2c *c2c)
 {
 	setup_pager();
+
+	if (c2c->raw_records)
+		perf_c2c__fprintf_header(stdout);
+
 	return perf_c2c__read_events(c2c);
 }
 
@@ -150,6 +251,7 @@ int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
 		},
 	};
 	const struct option c2c_options[] = {
+	OPT_BOOLEAN('r', "raw_records", &c2c.raw_records, "dump raw events"),
 	OPT_END()
 	};
 	const char * const c2c_usage[] = {
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems
  2014-02-10 17:28 [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (21 preceding siblings ...)
  2014-02-10 19:18 ` [PATCH 02/21] perf c2c: Dump raw records, decode data_src bits Don Zickus
@ 2014-02-10 21:18 ` Peter Zijlstra
  2014-02-10 22:11   ` Don Zickus
  2014-02-10 21:29 ` Peter Zijlstra
                   ` (2 subsequent siblings)
  25 siblings, 1 reply; 72+ messages in thread
From: Peter Zijlstra @ 2014-02-10 21:18 UTC (permalink / raw)
  To: Don Zickus; +Cc: acme, LKML, jolsa, jmario, fowles, eranian

On Mon, Feb 10, 2014 at 12:28:55PM -0500, Don Zickus wrote:
> With the introduction of NUMA systems, came the possibility of remote memory accesses.
> Combine those remote memory accesses with contention on the remote node (ie a modified
> cacheline) and you have a possibility for very long latencies.  These latencies can
> bottleneck a program.
> 
> The program added by these patches, helps detect the situation where two nodes are
> 'tugging' on the same _data_ cacheline.  The term used through out this program and
> the various changelogs is called a HITM.  This means nodeX went to read a cacheline
> and it was discovered to be loaded in nodeY's LLC cache (hence the cacheHIT). The 
> remote cacheline was also in a 'M'odified state thus creating a 'HIT M' for hit in
> a modified state.  HITMs can happen locally and remotely.  This program's interest
> is mainly in remote HITMs as they cause the longest latencies.

All of that is true of the traditional SMP system too. Just use lower
level caches.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems
  2014-02-10 17:28 [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (22 preceding siblings ...)
  2014-02-10 21:18 ` [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Peter Zijlstra
@ 2014-02-10 21:29 ` Peter Zijlstra
  2014-02-10 22:20   ` Don Zickus
  2014-02-10 22:21   ` Stephane Eranian
       [not found] ` <1392053356-23024-2-git-send-email-dzickus@redhat.com>
       [not found] ` <1392053356-23024-3-git-send-email-dzickus@redhat.com>
  25 siblings, 2 replies; 72+ messages in thread
From: Peter Zijlstra @ 2014-02-10 21:29 UTC (permalink / raw)
  To: Don Zickus; +Cc: acme, LKML, jolsa, jmario, fowles, eranian

On Mon, Feb 10, 2014 at 12:28:55PM -0500, Don Zickus wrote:
> The data output is verbose and there are lots of data tables that interprit the latencies
> and data addresses in different ways to help see where bottlenecks might be lying.

Would be good to see what the output looks like.

What I haven't seen; and what I would find most useful; is using the IP
+ dwarf info to map it back to a data structure member.

Since you're already using the PEBS data-source fields, you can also
have a precise IP. For many cases its possible to reconstruct the exact
data member the instruction is modifying.

At that point you can do pahole like output of data structures, showing
which members are 'hot' on misses etc.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 01/21] perf c2c: Shared data analyser
  2014-02-10 19:18 ` [PATCH 01/21] perf c2c: Shared data analyser Don Zickus
@ 2014-02-10 22:10   ` Davidlohr Bueso
  2014-02-11 11:24     ` Jiri Olsa
  2014-02-11 11:31     ` Arnaldo Carvalho de Melo
  0 siblings, 2 replies; 72+ messages in thread
From: Davidlohr Bueso @ 2014-02-10 22:10 UTC (permalink / raw)
  To: Don Zickus
  Cc: acme, LKML, jolsa, jmario, fowles, eranian,
	Arnaldo Carvalho de Melo, David Ahern, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Peter Zijlstra, Richard Fowles

On Mon, 2014-02-10 at 14:18 -0500, Don Zickus wrote:
> From: Arnaldo Carvalho de Melo <acme@redhat.com>
> 
> This is the start of a new perf tool that will collect information about
> memory accesses and analyse it to find things like hot cachelines, etc.
> 
> This is basically trying to get a prototype written by Richard Fowles
> written using the tools/perf coding style and libraries.
> 
> Start it from 'perf sched', this patch starts the process by adding the
> 'record' subcommand to collect the needed mem loads and stores samples.
> 
> It also have the basic 'report' skeleton, resolving the sample address
> and hooking the events found in a perf.data file with methods to handle
> them, right now just printing the resolved perf_sample data structure
> after each event name.

What tree/branch is this developed against? I'm getting the following
with Linus' latest and tip tree:

builtin-c2c.c: In function ‘perf_c2c__process_sample’:
builtin-c2c.c:68:20: error: request for member ‘func’ in something not a structure or union
builtin-c2c.c:69:36: error: request for member ‘func’ in something not a structure or union
builtin-c2c.c: In function ‘perf_c2c__read_events’:
builtin-c2c.c:81:2: error: passing argument 1 of ‘perf_session__new’ from incompatible pointer type [-Werror]
In file included from builtin-c2c.c:6:0:
util/session.h:52:22: note: expected ‘struct perf_data_file *’ but argument is of type ‘const char *’
builtin-c2c.c:81:2: error: too many arguments to function ‘perf_session__new’
In file included from builtin-c2c.c:6:0:
util/session.h:52:22: note: declared here



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems
  2014-02-10 21:18 ` [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Peter Zijlstra
@ 2014-02-10 22:11   ` Don Zickus
  0 siblings, 0 replies; 72+ messages in thread
From: Don Zickus @ 2014-02-10 22:11 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: acme, LKML, jolsa, jmario, fowles, eranian

On Mon, Feb 10, 2014 at 10:18:25PM +0100, Peter Zijlstra wrote:
> On Mon, Feb 10, 2014 at 12:28:55PM -0500, Don Zickus wrote:
> > With the introduction of NUMA systems, came the possibility of remote memory accesses.
> > Combine those remote memory accesses with contention on the remote node (ie a modified
> > cacheline) and you have a possibility for very long latencies.  These latencies can
> > bottleneck a program.
> > 
> > The program added by these patches, helps detect the situation where two nodes are
> > 'tugging' on the same _data_ cacheline.  The term used through out this program and
> > the various changelogs is called a HITM.  This means nodeX went to read a cacheline
> > and it was discovered to be loaded in nodeY's LLC cache (hence the cacheHIT). The 
> > remote cacheline was also in a 'M'odified state thus creating a 'HIT M' for hit in
> > a modified state.  HITMs can happen locally and remotely.  This program's interest
> > is mainly in remote HITMs as they cause the longest latencies.
> 
> All of that is true of the traditional SMP system too. Just use lower
> level caches.

Yup.  We just focused on the longer latencies which is the remote case.  I
think the idea was overflowing an L1 and L2 wasn't that hard, so the gain
on solving local LLC HITMs wouldn't be that much.  Maybe we are wrong.

Anyway, if this tool can help solve any bottlenecks, NUMA or non-NUMA,
that would be great. :-)

Cheers,
Don


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems
  2014-02-10 21:29 ` Peter Zijlstra
@ 2014-02-10 22:20   ` Don Zickus
  2014-02-10 22:21   ` Stephane Eranian
  1 sibling, 0 replies; 72+ messages in thread
From: Don Zickus @ 2014-02-10 22:20 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: acme, LKML, jolsa, jmario, fowles, eranian

On Mon, Feb 10, 2014 at 10:29:55PM +0100, Peter Zijlstra wrote:
> On Mon, Feb 10, 2014 at 12:28:55PM -0500, Don Zickus wrote:
> > The data output is verbose and there are lots of data tables that interprit the latencies
> > and data addresses in different ways to help see where bottlenecks might be lying.
> 
> Would be good to see what the output looks like.

hehe.  unfortunately, my node info is causing double frees now, but
attached below (without node info)..

> 
> What I haven't seen; and what I would find most useful; is using the IP
> + dwarf info to map it back to a data structure member.

Yeah, we would like that too. :-)

> 
> Since you're already using the PEBS data-source fields, you can also
> have a precise IP. For many cases its possible to reconstruct the exact
> data member the instruction is modifying.
> 
> At that point you can do pahole like output of data structures, showing
> which members are 'hot' on misses etc.

Yeah, Arnaldo promised to look into that.  I think Stephane was doing some
research into that too.

Cheers,
Don


=================================================
            Trace Event Information              
=================================================
  Total records                     :    1322047
  Locked Load/Store Operations      :     206317
  Load Operations                   :     355701
  Loads - uncacheable               :        590
  Loads - no mapping                :        207
  Load Fill Buffer Hit              :     100214
  Load L1D hit                      :     148454
  Load L2D hit                      :      15170
  Load LLC hit                      :      53872
  Load Local HITM                   :      15388
  Load Remote HITM                  :      26760
  Load Remote HIT                   :       3910
  Load Local DRAM                   :       2436
  Load Remote DRAM                  :       3648
  Load MESI State Exclusive         :       2883
  Load MESI State Shared            :       3201
  Load LLC Misses                   :      36754
  LLC Misses to Local DRAM          :        6.6%
  LLC Misses to Remote DRAM         :        9.9%
  LLC Misses to Remote cache (HIT)  :       10.6%
  LLC Misses to Remote cache (HITM) :       72.8%
  Store Operations                  :     966322
  Store - uncacheable               :          0
  Store - no mapping                :      42931
  Store L1D Hit                     :     915696
  Virt -> Phys Remap Rejects        :          0
  No Page Map Rejects               :          0


================================================================================
                Execution Latency For Loads to Non Shared Memory

Metric                           Overall            Extremes            Selected
================================================================================
Samples                           301189                3454                 104

Minimum                               32                1006                4095
Maximum                             8149                8149                8149
Threshold                              0                1005                4042

Mode                                  34                1152                4556
Median                               136                1250                5163
Mean                                 236                1524                5337

Std Dev                            256.3               839.7               979.5
Coeff of Variation                 1.086               0.551               0.184
Confid Interval                      0.8                23.5                41.3




====================================================================================================================================================================
                                                       Non Shared Data Loads With Excessive Execution Latency

                        Load                                            ------ Load Inst Execute Latency ------                                                     
  Num  %dist  %cumm    Count       Data Address       Inst Address      Min   Median      Max     Mean       CV  Symbol                         Object              
====================================================================================================================================================================
-----------------------------------------------
    0  57.3%  57.3%       59 0xffffffff81c57ac0
-----------------------------------------------
                       33.9%               0x00 0xffffffff81098c43     4169     0.0     8149  5479.0    20.5%  update_cfs_shares              [kernel.kallsyms]   
                       66.1%               0x20 0xffffffff81094929     4155     0.0     7492  5286.6    16.9%  update_cfs_rq_blocked_load     [kernel.kallsyms]   
-----------------------------------------------
    1   5.8%  63.1%        6 0xffffffff818a1180
-----------------------------------------------
                       50.0%               0x04 0xffffffff815c4f1e     4866     0.0     8116  6745.7    25.0%  _raw_spin_lock                 [kernel.kallsyms]   
                       50.0%               0x04 0xffffffff815c4f47     4556     0.0     7389  6099.3    23.5%  _raw_spin_lock                 [kernel.kallsyms]   
-----------------------------------------------
    2   4.9%  68.0%        5 0xffff881fbf608ac0
-----------------------------------------------
                      100.0%               0x24 0xffffffff810981dd     4201     0.0     6653  5429.4    17.4%  task_tick_numa                 [kernel.kallsyms]   
-----------------------------------------------
    3   2.9%  70.9%        3 0xffff881fa55b0140
-----------------------------------------------
                      100.0%               0x38 0xffffffff81082e8a     4317     0.0     4571  4418.7     3.0%  mspin_lock                     [kernel.kallsyms]   
-----------------------------------------------
    4   2.9%  73.8%        3 0xffff883fff834700
-----------------------------------------------
                      100.0%               0x30 0xffffffff810948e4     4906     0.0     6078  5532.3    10.7%  update_cfs_rq_blocked_load     [kernel.kallsyms]   
-----------------------------------------------
    5   2.9%  76.7%        3 0xffff885fff834700
-----------------------------------------------
                      100.0%               0x30 0xffffffff810948e4     4921     0.0     6703  5828.7    15.3%  update_cfs_rq_blocked_load     [kernel.kallsyms]   
-----------------------------------------------
    6   2.9%  79.6%        3 0xffff885fff8b4700
-----------------------------------------------
                      100.0%               0x30 0xffffffff810948e4     4101     0.0     6022  5166.7    18.9%  update_cfs_rq_blocked_load     [kernel.kallsyms]   
-----------------------------------------------
    7   2.9%  82.5%        3 0xffff885fff9d4700
-----------------------------------------------
                      100.0%               0x30 0xffffffff810948e4     4319     0.0     4486  4381.7     2.1%  update_cfs_rq_blocked_load     [kernel.kallsyms]   
-----------------------------------------------
    8   1.9%  84.5%        2 0xffff885fff854700
-----------------------------------------------
                      100.0%               0x30 0xffffffff810948e4     5434     0.0     6075  5754.5     7.9%  update_cfs_rq_blocked_load     [kernel.kallsyms]   
-----------------------------------------------
    9   1.9%  86.4%        2 0xffff885fff974700
-----------------------------------------------
                      100.0%               0x30 0xffffffff810948e4     5326     0.0     5589  5457.5     3.4%  update_cfs_rq_blocked_load     [kernel.kallsyms]   
-----------------------------------------------
   10   1.9%  88.3%        2 0xffffffff819bd400
-----------------------------------------------
                       50.0%               0x04 0xffffffff810ae886     5427     0.0     5427  5427.0     0.0%  ktime_get                      [kernel.kallsyms]   
                       50.0%               0x04 0xffffffff810aed07     4334     0.0     4334  4334.0     0.0%  update_wall_time               [kernel.kallsyms]   
-----------------------------------------------
   11   1.0%  89.3%        1 0xffff881fc5600040
-----------------------------------------------
                      100.0%               0x20 0xffffffffa162cb70     5606     0.0     5606  5606.0     0.0%  mtip_irq_handler               [mtip32xx]          

<snip 11 records>
...




========================================================================================================
                                Load Access & Excute Latency Information

                                       Count      Minmum     Average        CV       Maximum       %dist
========================================================================================================
L1  Hit  - Snp None                   148444          32         242      1.1501        7492        39.5%
LFB Hit  - Snp None                   100190          32         271      0.8572        8149        29.8%
L2  Hit  - Snp None                    15154          32         227      0.9071        1682         3.8%
L3  Hit  - Snp None                    32029          32          70      1.7761        6353         2.5%
L3  Hit  - Snp Miss                     2489          38         373      0.7307        5306         1.0%
L3  Hit  - Snp Hit  - Lcl Cache         3802          32         150      0.9289        3225         0.6%
L3  Hit  - Snp Hitm - Lcl Cache        15388          32         187      1.0542        8485         3.2%
L3  Miss - Snp Hit  - Rmt Cache         3910          32         355      0.3318        3972         1.5%
L3  Miss - Snp Hitm - Rmt Cache        26760          32         493      0.5116        6236        14.5%
L3  Miss - Snp Hit  - Lcl Dram          1029          32         400      0.5783        3578         0.5%
L3  Miss - Snp Hit  - Rmt Dram          2170          32         541      0.7315        9967         1.3%
L3  Miss - Snp Miss - Lcl Dram          1406          32         431      0.9117        8116         0.7%
L3  Miss - Snp Miss - Rmt Dram          1477          34         554      0.4437        2956         0.9%
L3  Miss - Snp NA                        440           0         607      0.6940        2717         0.3%
Ld  UNC  - Snp None                        0  18446744073709551615           0        -nan           0         0.0%

=================================================
    Global Shared Cache Line Event Information   
=================================================
  Total Shared Cache Lines          :       1327
  Load HITs on shared lines         :     167131
  Fill Buffer Hits on shared lines  :      43469
  L1D hits on shared lines          :      50497
  L2D hits on shared lines          :        960
  LLC hits on shared lines          :      38467
  Locked Access on shared lines     :     100032
  Store HITs on shared lines        :     118659
  Store L1D hits on shared lines    :     113783
  Total Merged records              :     160807


================================================================================================================================================================================================================

                                                                                          Shared Data Cache Line Table

                                 Total     %All                Total       ---- Core Load Hit ----  -- LLC Load Hit --     ----- LLC Load Hitm -----     -- Load Dram --     LLC       ---- Store Reference ----
   Index           Phys Adrs   Records   Ld Miss     %hitm     Loads        FB       L1D       L2D       Lcl       Rmt     Total       Lcl       Rmt       Lcl       Rmt   Ld Miss     Total     L1Hit    L1Miss
================================================================================================================================================================================================================
       0  0xffff881fa55b0140     72006    16.97%    23.31%     43095     13591     16860        45      2651        25      9526      3288      6238       266       131      6660     28911     28098       813
       1  0xffff881fba47f000     21854     5.29%     7.26%     13938      3887      6941        15         1         7      3087      1143      1944         0         0      1951      7916      7916         0
       2  0xffff881fc21b9cc0      2153     1.61%     2.21%       862        32        70         0        15         1       740       148       592         0         4       597      1291      1235        56
       3  0xffff881fc7d91cc0      1957     1.40%     1.92%       866        34        94         0        14         3       720       207       513         0         1       517      1091      1028        63
       4  0xffff881fba539cc0      1813     1.35%     1.85%       808        33        84         3        14         1       665       170       495         8         0       504      1005       967        38
       5  0xffff881fc770bcc0      1939     1.30%     1.78%       827        36        70         0        16         1       700       223       477         4         0       482      1112      1058        54
       6  0xffff881fbc8adcc0      1854     1.25%     1.72%       788        21        77         1        12         2       674       215       459         0         1       462      1066       965       101
       7  0xffff881fc6c03cc0      1825     1.19%     1.63%       800        20        80         1        16         3       677       240       437         1         2       443      1025       973        52
       8  0xffff881fb93f1cc0      1934     1.17%     1.61%       846        42        79         2         7         2       711       280       431         0         3       436      1088      1022        66
       9  0xffff881fd1391cc0      1901     1.16%     1.59%       840        35        97         0        10         0       693       267       426         0         5       431      1061      1000        61
      10  0xffff881fd0771cc0      1731     1.14%     1.57%       779        19        62         1        22         9       663       244       419         1         2       431       952       890        62
      11  0xffff881fc7d31cc0      1971     1.13%     1.56%       826        29        91         2        19         5       677       260       417         0         3       425      1145      1064        81
      12  0xffff881fb9dcdcc0      1821     1.13%     1.55%       784        31        67         0        13         8       663       249       414         0         2       424      1037       973        64
      13  0xffff881fc3febcc0      1795     1.10%     1.51%       788        33        74         1        11         4       663       258       405         1         1       411      1007       936        71
      14  0xffff881fc7d29cc0      1837     1.10%     1.51%       756        30        71         0        19         1       634       229       405         1         0       407      1081      1023        58
      15  0xffff881fbc365cc0      1961     1.06%     1.46%       850        23        73         0        16         0       736       345       391         2         0       393      1111      1047        64
      16  0xffff881fd0259cc0      1896     1.04%     1.43%       779        28        68         1        15         0       665       282       383         1         1       385      1117      1052        65
      17  0xffff881fd0589cc0      1848     1.03%     1.42%       860        26        95         0        22         2       714       334       380         0         1       383       988       921        67
      18  0xffff881fd01a1cc0      1822     1.02%     1.40%       823        38        78         0        17         1       688       313       375         1         0       377       999       926        73
      19  0xffff881fb7ce1cc0      1833     1.00%     1.38%       761        28        72         0        11         2       642       274       368         1         5       376      1072       975        97
      20  0xffff881fb8099cc0      1846     0.94%     1.30%       779        23        52         1        11         2       684       337       347         1         5       355      1067       991        76
      21  0xffff881fc7e91cc0      1751     0.91%     1.25%       792        29        79         0        20         1       662       327       335         0         1       337       959       905        54
      22  0xffff881fbf608000       625     0.33%     0.45%       551       162        17        27       189        18       134        13       121         4         0       143        74        64        10

<snip 40 lines>
....


================================================================================================================================================================================================================

                                                                                      Shared Cache Line Distribution Pareto

     ---- All ----  -- Shared --    ---- HITM ----                                                                        Load Inst Execute Latency                                                             
       Data Misses   Data Misses   Remote    Local  -- Store Refs --                                                                      
                                                                                                                          ---- cycles ----             cpu
 Num  %dist  %cumm  %dist  %cumm  LLCmiss   LLChit   L1 hit  L1 Miss       Data Address    Pid    Tid       Inst Address   median     mean     CV      cnt Symbol                         Object                
================================================================================================================================================================================================================
-----------------------------------------------------------------------------------------------
   0  17.0%  17.0%  23.3%  23.3%     6238     3288    28098      813 0xffff881fa55b0140    ***
-----------------------------------------------------------------------------------------------
                                     0.0%     0.0%     0.0%     0.0%               0x00    375    375 0xffffffffa018ff5b      n/a      n/a      n/a      1 ext4_bio_write_page            [ext4]                
                                     0.0%     0.0%     0.0%     0.0%               0x08  18156  18165 0xffffffffa018b7f9       -1      384     0.0%      1 ext4_mark_iloc_dirty           [ext4]                
                                     0.2%     0.0%     0.0%     0.0%               0x10  18156    *** 0xffffffff811ca1aa       -1      387    10.7%      7 __mark_inode_dirty             [kernel.kallsyms]     
                                     0.7%     0.1%    15.9%     0.0%               0x18  18156    *** 0xffffffff815c15b1       -1     1241    24.0%     51 mutex_unlock                   [kernel.kallsyms]     
                                     0.0%     0.0%    23.2%     0.0%               0x18  18156    *** 0xffffffff815c1615       -1      684     0.0%     50 mutex_lock                     [kernel.kallsyms]     
                                     0.0%     0.0%     0.5%     3.1%               0x18  18156    *** 0xffffffff815c2082      n/a      n/a      n/a     38 __mutex_unlock_slowpath        [kernel.kallsyms]     
                                     0.2%     3.3%     0.0%     0.0%               0x18  18156    *** 0xffffffff815c2139       -1      496    22.7%     31 __mutex_lock_slowpath          [kernel.kallsyms]     
                                     0.2%     3.4%     5.1%     0.0%               0x18  18156    *** 0xffffffff815c2142       -1      821    13.2%     50 __mutex_lock_slowpath          [kernel.kallsyms]     
                                     0.0%     0.0%     0.1%     0.0%               0x18  18156    *** 0xffffffff815c21bf       -1     1203     0.0%     18 __mutex_lock_slowpath          [kernel.kallsyms]     
                                     1.2%     0.1%     0.0%     0.0%               0x18  18156    *** 0xffffffff815c21ed       -1      671    42.9%     37 __mutex_lock_slowpath          [kernel.kallsyms]     
                                     0.0%     0.1%     0.1%     0.1%               0x18  18156    *** 0xffffffff815c21f6       -1      971    16.5%     25 __mutex_lock_slowpath          [kernel.kallsyms]     
                                    10.9%     6.2%     2.5%    89.3%               0x1c  18156    *** 0xffffffff815c4e4a       -1      478    51.7%     50 _raw_spin_unlock               [kernel.kallsyms]     
                                    11.8%     2.0%    18.4%     0.0%               0x1c  18156    *** 0xffffffff815c4f1e       -1     1276    22.0%     50 _raw_spin_lock                 [kernel.kallsyms]     
                                     3.2%     2.6%     0.0%     0.0%               0x1c  18156    *** 0xffffffff815c4f47       -1      831    54.2%     48 _raw_spin_lock                 [kernel.kallsyms]     
                                     0.8%     0.4%     0.0%     0.0%               0x20  18156    *** 0xffffffff815c207f       -1      669    26.3%     26 __mutex_unlock_slowpath        [kernel.kallsyms]     
                                     0.0%     0.0%     0.0%     0.6%               0x28  18156    *** 0xffffffff812b7e4e      n/a      n/a      n/a      5 __list_add                     [kernel.kallsyms]     
                                     0.0%     0.1%     0.0%     0.0%               0x30  18156    *** 0xffffffff81082f19       -1      738    58.4%      5 mutex_spin_on_owner            [kernel.kallsyms]     
                                     0.1%     8.8%     0.0%     0.0%               0x30  18156    *** 0xffffffff81082f55       -1      730    33.9%     23 mutex_spin_on_owner            [kernel.kallsyms]     
                                     0.0%     0.0%     0.2%     0.4%               0x30  18156    *** 0xffffffff815c15a6      n/a      n/a      n/a     27 mutex_unlock                   [kernel.kallsyms]     
                                     0.0%     0.0%     2.8%     6.5%               0x30  18156    *** 0xffffffff815c1628      n/a      n/a      n/a     50 mutex_lock                     [kernel.kallsyms]     
                                     0.4%     0.4%     0.0%     0.0%               0x30  18156    *** 0xffffffff815c20d6       -1      860    50.5%     17 __mutex_lock_slowpath          [kernel.kallsyms]     
                                    60.3%    66.7%     0.0%     0.0%               0x30  18156    *** 0xffffffff815c211d       -1      471    46.6%     50 __mutex_lock_slowpath          [kernel.kallsyms]     
                                     0.0%     0.0%     0.0%     0.0%               0x30  18156  18165 0xffffffff815c2157      n/a      n/a      n/a      1 __mutex_lock_slowpath          [kernel.kallsyms]     
                                     9.8%     5.8%    31.0%     0.0%               0x38  18156    *** 0xffffffff81082e8a       -1      960    33.3%     50 mspin_lock                     [kernel.kallsyms]     
                                     0.0%     0.0%     0.1%     0.0%               0x38  18156    *** 0xffffffff81082ec4       -1      588    42.1%     18 mspin_unlock                   [kernel.kallsyms]     

-----------------------------------------------------------------------------------------------
   1   5.3%  22.3%   7.3%  30.6%     1944     1143     7916        0 0xffff881fba47f000  18156
-----------------------------------------------------------------------------------------------
                                   100.0%   100.0%     0.0%     0.0%               0x00  18156    *** 0xffffffffa01b410e       -1      401    13.5%     50 __ext4_journal_start_sb        [ext4]                
                                     0.0%     0.0%    10.1%     0.0%               0x28  18156    *** 0xffffffffa0167409      n/a      n/a      n/a     50 start_this_handle              [jbd2]                
                                     0.0%     0.0%    89.9%     0.0%               0x28  18156    *** 0xffffffff815c4be9      n/a      n/a      n/a     50 _raw_read_lock                 [kernel.kallsyms]     

-----------------------------------------------------------------------------------------------
   2   1.6%  23.9%   2.2%  32.8%      592      148     1235       56 0xffff881fc21b9cc0  18156
-----------------------------------------------------------------------------------------------
                                     0.0%     0.0%     0.2%     0.0%               0x00  18156  18172 0xffffffff81082e75      n/a      n/a      n/a      2 mspin_lock                     [kernel.kallsyms]     
                                     0.0%     0.0%     0.2%     0.0%               0x00  18156  18172 0xffffffff81082f15      n/a      n/a      n/a      2 mutex_spin_on_owner            [kernel.kallsyms]     
                                     0.3%     0.0%     0.0%     0.0%               0x00  18156  18172 0xffffffff81082f5a       -1      449     0.3%      1 mutex_spin_on_owner            [kernel.kallsyms]     
                                     0.0%     0.0%     0.2%     0.0%               0x00  18156  18172 0xffffffff810908ac      n/a      n/a      n/a      1 try_to_wake_up                 [kernel.kallsyms]     
                                     0.0%     0.0%     0.9%     0.0%               0x00  18156  18172 0xffffffff81090a73      n/a      n/a      n/a      5 try_to_wake_up                 [kernel.kallsyms]     
                                     0.0%     0.0%     2.3%     0.0%               0x00  18156  18172 0xffffffff8113524c      n/a      n/a      n/a      7 generic_segment_checks         [kernel.kallsyms]     
                                     0.0%     0.0%     1.1%     0.0%               0x00  18156  18172 0xffffffff81135f09      n/a      n/a      n/a      3 generic_file_buffered_write    [kernel.kallsyms]     
                                     0.0%     0.0%     3.8%     0.0%               0x00  18156  18172 0xffffffff811ba059      n/a      n/a      n/a      6 file_update_time               [kernel.kallsyms]     
                                     0.0%     0.0%     0.0%     1.8%               0x00  18156  18172 0xffffffff815c4275      n/a      n/a      n/a      1 schedule_preempt_disabled      [kernel.kallsyms]     
                                     2.9%     0.0%     0.0%     0.0%               0x00  18156  18172 0xffffffff815c4299       -1      353    15.2%      6 schedule_preempt_disabled      [kernel.kallsyms]     
                                     0.0%     0.0%     1.9%     0.0%               0x08  18156  18172 0xffffffff81135245      n/a      n/a      n/a      6 generic_segment_checks         [kernel.kallsyms]     
                                     0.0%     0.0%     0.5%     0.0%               0x08  18156  18172 0xffffffff81135f05      n/a      n/a      n/a      3 generic_file_buffered_write    [kernel.kallsyms]     
                                     0.0%     0.0%    10.0%     0.0%               0x08  18156  18172 0xffffffff811b9cb5      n/a      n/a      n/a      8 file_remove_suid               [kernel.kallsyms]     
                                     0.0%     0.0%     2.3%     0.0%               0x08  18156  18172 0xffffffff811ba055      n/a      n/a      n/a      8 file_update_time               [kernel.kallsyms]     
                                     0.0%     0.0%     2.8%     3.6%               0x08  18156  18172 0xffffffff815c2106      n/a      n/a      n/a      5 __mutex_lock_slowpath          [kernel.kallsyms]     
                                     0.0%     0.0%    12.2%     0.0%               0x08  18156  18172 0xffffffff815c2118      n/a      n/a      n/a      9 __mutex_lock_slowpath          [kernel.kallsyms]     
                                     0.0%     0.0%     2.9%    33.9%               0x08  18156  18172 0xffffffff815c212c      n/a      n/a      n/a      8 __mutex_lock_slowpath          [kernel.kallsyms]     
                                     0.0%     0.0%     0.0%     7.1%               0x08  18156  18172 0xffffffff815c21e8      n/a      n/a      n/a      3 __mutex_lock_slowpath          [kernel.kallsyms]     
                                     0.3%     0.0%     0.0%     0.0%               0x08  18156  18172 0xffffffff815c429a       -1      314    28.6%      2 schedule_preempt_disabled      [kernel.kallsyms]     
                                     0.0%     0.0%     0.1%     0.0%               0x10  18156  18172 0xffffffff81082e80      n/a      n/a      n/a      1 mspin_lock                     [kernel.kallsyms]     
                                     0.0%     0.0%     0.0%    17.9%               0x10  18156    *** 0xffffffff81082e9b      n/a      n/a      n/a      8 mspin_lock                     [kernel.kallsyms]     
                                     0.0%     0.0%     0.1%     0.0%               0x10  18156  18172 0xffffffff810908a2      n/a      n/a      n/a      1 try_to_wake_up                 [kernel.kallsyms]     
                                     0.0%     0.0%     1.4%     0.0%               0x10  18156  18172 0xffffffff81137c4d      n/a      n/a      n/a      7 __generic_file_aio_write       [kernel.kallsyms]     
                                     0.0%     0.0%     7.4%     0.0%               0x10  18156  18172 0xffffffff81137d7b      n/a      n/a      n/a      9 __generic_file_aio_write       [kernel.kallsyms]     
                                     0.0%     0.0%     0.2%     0.0%               0x10  18156  18172 0xffffffff81137dbc      n/a      n/a      n/a      2 __generic_file_aio_write       [kernel.kallsyms]     
                                     0.2%     0.7%     0.0%     0.0%               0x10  18156    *** 0xffffffff812b7e3b       -1      507     0.0%      2 __list_add                     [kernel.kallsyms]     
                                    89.0%    98.0%     0.0%     0.0%               0x18  18156  18172 0xffffffff81082ea7       -1      471    28.5%      9 mspin_lock                     [kernel.kallsyms]     
                                     0.0%     0.0%     0.0%    33.9%               0x18  18156    *** 0xffffffff81082eda      n/a      n/a      n/a     15 mspin_unlock                   [kernel.kallsyms]     
                                     0.0%     0.0%     0.1%     0.0%               0x18  18156  18172 0xffffffff810908a0      n/a      n/a      n/a      1 try_to_wake_up                 [kernel.kallsyms]     
                                     0.0%     0.0%    13.8%     1.8%               0x18  18156  18172 0xffffffff811360de      n/a      n/a      n/a      9 generic_file_buffered_write    [kernel.kallsyms]     
                                     0.0%     0.0%     6.6%     0.0%               0x18  18156  18172 0xffffffff81137dab      n/a      n/a      n/a      9 __generic_file_aio_write       [kernel.kallsyms]     
                                     0.0%     0.0%     0.2%     0.0%               0x20  18156  18172 0xffffffff8109089e      n/a      n/a      n/a      1 try_to_wake_up                 [kernel.kallsyms]     
                                     7.3%     1.4%     0.0%     0.0%               0x20  18156    *** 0xffffffff815c208e       -1      402    13.6%     29 __mutex_unlock_slowpath        [kernel.kallsyms]     
                                     0.0%     0.0%    19.5%     0.0%               0x28  18156  18172 0xffffffff81137d63      n/a      n/a      n/a      9 __generic_file_aio_write       [kernel.kallsyms]     
                                     0.0%     0.0%     0.2%     0.0%               0x30  18156  18172 0xffffffff81137c49      n/a      n/a      n/a      1 __generic_file_aio_write       [kernel.kallsyms]     
                                     0.0%     0.0%     5.4%     0.0%               0x30  18156  18172 0xffffffff81137dc1      n/a      n/a      n/a      6 __generic_file_aio_write       [kernel.kallsyms]     
                                     0.0%     0.0%     2.6%     0.0%               0x38  18156  18172 0xffffffff81090b4e      n/a      n/a      n/a      7 wake_up_process                [kernel.kallsyms]     
                                     0.0%     0.0%     1.3%     0.0%               0x38  18156  18172 0xffffffff81137c2c      n/a      n/a      n/a      4 __generic_file_aio_write       [kernel.kallsyms]     

-----------------------------------------------------------------------------------------------
   3   1.4%  25.3%   1.9%  34.7%      513      207     1028       63 0xffff881fc7d91cc0  18156
-----------------------------------------------------------------------------------------------
                                     0.0%     0.0%     0.2%     0.0%               0x00  18156  18159 0xffffffff81082e75      n/a      n/a      n/a      2 mspin_lock                     [kernel.kallsyms]     
                                     0.0%     0.0%     0.1%     0.0%               0x00  18156  18159 0xffffffff81082f15      n/a      n/a      n/a      1 mutex_spin_on_owner            [kernel.kallsyms]     
                                     0.4%     0.5%     0.0%     0.0%               0x00  18156  18159 0xffffffff81082f5a       -1      446     2.5%      2 mutex_spin_on_owner            [kernel.kallsyms]     
                                     0.0%     0.0%     0.3%     0.0%               0x00  18156  18159 0xffffffff810908ac      n/a      n/a      n/a      3 try_to_wake_up                 [kernel.kallsyms]     
                                     0.0%     0.0%     0.1%     0.0%               0x00  18156  18159 0xffffffff810908fc      n/a      n/a      n/a      1 try_to_wake_up                 [kernel.kallsyms]     
                                     0.0%     0.0%     1.6%     0.0%               0x00  18156  18159 0xffffffff81090a73      n/a      n/a      n/a      7 try_to_wake_up                 [kernel.kallsyms]     
                                     0.0%     0.0%     2.9%     0.0%               0x00  18156  18159 0xffffffff8113524c      n/a      n/a      n/a      6 generic_segment_checks         [kernel.kallsyms]     
                                     0.0%     0.0%     0.1%     0.0%               0x00  18156  18159 0xffffffff81135f09      n/a      n/a      n/a      1 generic_file_buffered_write    [kernel.kallsyms]     
                                     0.0%     0.0%     0.2%     0.0%               0x00  18156  18159 0xffffffff811ba059      n/a      n/a      n/a      1 file_update_time               [kernel.kallsyms]     
                                     0.0%     0.0%     0.1%     0.0%               0x00  18156  18159 0xffffffff815c4275      n/a      n/a      n/a      1 schedule_preempt_disabled      [kernel.kallsyms]     
                                     3.1%     0.0%     0.0%     0.0%               0x00  18156  18159 0xffffffff815c4299       -1      376    18.4%      5 schedule_preempt_disabled      [kernel.kallsyms]     
                                     0.2%     0.0%     0.0%     0.0%               0x00  18156  18159 0xffffffff815c4f4f       -1      431     0.0%      1 _raw_spin_lock                 [kernel.kallsyms]     
                                     0.0%     0.0%     1.9%     0.0%               0x08  18156  18159 0xffffffff81135245      n/a      n/a      n/a      5 generic_segment_checks         [kernel.kallsyms]     
                                     0.0%     0.0%     1.6%     0.0%               0x08  18156  18159 0xffffffff811b9cb5      n/a      n/a      n/a      5 file_remove_suid               [kernel.kallsyms]     
                                     0.0%     0.0%     0.1%     0.0%               0x08  18156  18159 0xffffffff811ba055      n/a      n/a      n/a      1 file_update_time               [kernel.kallsyms]     
                                     0.0%     0.0%     2.9%     3.2%               0x08  18156  18159 0xffffffff815c2106      n/a      n/a      n/a      5 __mutex_lock_slowpath          [kernel.kallsyms]     
                                     0.0%     0.0%    17.6%     0.0%               0x08  18156  18159 0xffffffff815c2118      n/a      n/a      n/a      7 __mutex_lock_slowpath          [kernel.kallsyms]     
                                     0.0%     0.0%     6.2%    31.7%               0x08  18156  18159 0xffffffff815c212c      n/a      n/a      n/a      7 __mutex_lock_slowpath          [kernel.kallsyms]     
                                     0.0%     0.0%     0.1%     0.0%               0x08  18156  18159 0xffffffff815c21aa      n/a      n/a      n/a      1 __mutex_lock_slowpath          [kernel.kallsyms]     
                                     0.0%     0.0%     0.0%     9.5%               0x08  18156  18159 0xffffffff815c21e8      n/a      n/a      n/a      5 __mutex_lock_slowpath          [kernel.kallsyms]     
                                     0.0%     0.0%     0.1%     0.0%               0x08  18156  18159 0xffffffff815c2246      n/a      n/a      n/a      1 __mutex_lock_slowpath          [kernel.kallsyms]     
                                     0.4%     0.0%     0.0%     0.0%               0x08  18156  18159 0xffffffff815c429a       -1      358    35.7%      2 schedule_preempt_disabled      [kernel.kallsyms]     
                                     0.0%     0.0%     0.0%    23.8%               0x10  18156    *** 0xffffffff81082e9b      n/a      n/a      n/a     10 mspin_lock                     [kernel.kallsyms]     
                                     0.0%     0.0%     0.2%     0.0%               0x10  18156  18159 0xffffffff810908a2      n/a      n/a      n/a      2 try_to_wake_up                 [kernel.kallsyms]     
                                     0.0%     0.0%     1.3%     0.0%               0x10  18156  18159 0xffffffff81137c4d      n/a      n/a      n/a      4 __generic_file_aio_write       [kernel.kallsyms]     
                                     0.0%     0.0%     4.2%     0.0%               0x10  18156  18159 0xffffffff81137d7b      n/a      n/a      n/a      6 __generic_file_aio_write       [kernel.kallsyms]     
                                     0.0%     0.0%     2.0%     0.0%               0x10  18156  18159 0xffffffff81137dbc      n/a      n/a      n/a      5 __generic_file_aio_write       [kernel.kallsyms]     
                                     0.2%     0.0%     0.0%     0.0%               0x10  18156  18175 0xffffffff812b7e3b       -1      502     0.0%      1 __list_add                     [kernel.kallsyms]     
                                    90.8%    92.3%     0.0%     0.0%               0x18  18156  18159 0xffffffff81082ea7       -1      481    25.2%      7 mspin_lock                     [kernel.kallsyms]     
                                     0.0%     0.0%     0.0%    31.7%               0x18  18156    *** 0xffffffff81082eda      n/a      n/a      n/a     12 mspin_unlock                   [kernel.kallsyms]     
                                     0.0%     0.0%     0.1%     0.0%               0x18  18156  18159 0xffffffff810908a0      n/a      n/a      n/a      1 try_to_wake_up                 [kernel.kallsyms]     
                                     0.0%     0.0%    24.2%     0.0%               0x18  18156  18159 0xffffffff811360de      n/a      n/a      n/a      7 generic_file_buffered_write    [kernel.kallsyms]     
                                     0.0%     0.0%     7.8%     0.0%               0x18  18156  18159 0xffffffff81137dab      n/a      n/a      n/a      6 __generic_file_aio_write       [kernel.kallsyms]     
                                     0.0%     0.0%     0.1%     0.0%               0x20  18156  18159 0xffffffff8109089e      n/a      n/a      n/a      1 try_to_wake_up                 [kernel.kallsyms]     
                                     4.9%     7.2%     0.0%     0.0%               0x20  18156    *** 0xffffffff815c208e       -1      381    15.7%     28 __mutex_unlock_slowpath        [kernel.kallsyms]     
                                     0.0%     0.0%    15.9%     0.0%               0x28  18156  18159 0xffffffff81137d63      n/a      n/a      n/a      7 __generic_file_aio_write       [kernel.kallsyms]     
                                     0.0%     0.0%     0.1%     0.0%               0x30  18156  18159 0xffffffff81090895      n/a      n/a      n/a      1 try_to_wake_up                 [kernel.kallsyms]     
                                     0.0%     0.0%     0.1%     0.0%               0x30  18156  18159 0xffffffff81137c49      n/a      n/a      n/a      1 __generic_file_aio_write       [kernel.kallsyms]     
                                     0.0%     0.0%     4.3%     0.0%               0x30  18156  18159 0xffffffff81137dc1      n/a      n/a      n/a      7 __generic_file_aio_write       [kernel.kallsyms]     
                                     0.0%     0.0%     0.1%     0.0%               0x30  18156  18159 0xffffffff815ce68c      n/a      n/a      n/a      1 apic_timer_interrupt           [kernel.kallsyms]     
                                     0.0%     0.0%     2.3%     0.0%               0x38  18156  18159 0xffffffff81090b4e      n/a      n/a      n/a      5 wake_up_process                [kernel.kallsyms]     
                                     0.0%     0.0%     1.3%     0.0%               0x38  18156  18159 0xffffffff81137c2c      n/a      n/a      n/a      6 __generic_file_aio_write       [kernel.kallsyms]     

-----------------------------------------------------------------------------------------------
   4   1.3%  26.6%   1.8%  36.6%      495      170      967       38 0xffff881fba539cc0  18156
-----------------------------------------------------------------------------------------------
                                     0.0%     0.0%     0.2%     0.0%               0x00  18156  18169 0xffffffff81082e75      n/a      n/a      n/a      2 mspin_lock                     [kernel.kallsyms]     
                                     0.0%     0.0%     0.2%     0.0%               0x00  18156  18169 0xffffffff81082f15      n/a      n/a      n/a      2 mutex_spin_on_owner            [kernel.kallsyms]     
                                     0.0%     0.6%     0.0%     0.0%               0x00  18156  18169 0xffffffff81082f5a      n/a      n/a      n/a      1 mutex_spin_on_owner            [kernel.kallsyms]     
                                     0.0%     0.0%     0.1%     0.0%               0x00  18156  18169 0xffffffff810908ac      n/a      n/a      n/a      1 try_to_wake_up                 [kernel.kallsyms]     
                                     0.0%     0.0%     1.0%     0.0%               0x00  18156  18169 0xffffffff81090a73      n/a      n/a      n/a      5 try_to_wake_up                 [kernel.kallsyms]     
                                     0.0%     0.0%     2.1%     0.0%               0x00  18156  18169 0xffffffff8113524c      n/a      n/a      n/a      7 generic_segment_checks         [kernel.kallsyms]     
                                     0.0%     0.0%     0.1%     0.0%               0x00  18156  18169 0xffffffff81135f09      n/a      n/a      n/a      1 generic_file_buffered_write    [kernel.kallsyms]     
                                     0.0%     0.0%     0.4%     0.0%               0x00  18156  18169 0xffffffff811ba059      n/a      n/a      n/a      3 file_update_time               [kernel.kallsyms]     
                                     0.0%     0.0%     0.1%     0.0%               0x00  18156  18169 0xffffffff815c4275      n/a      n/a      n/a      1 schedule_preempt_disabled      [kernel.kallsyms]     
                                     1.2%     0.0%     0.0%     0.0%               0x00  18156  18169 0xffffffff815c4299       -1      298     8.7%      6 schedule_preempt_disabled      [kernel.kallsyms]     
                                     0.2%     0.0%     0.0%     0.0%               0x00  18156  18169 0xffffffff815c4f2c       -1      441     0.0%      1 _raw_spin_lock                 [kernel.kallsyms]     
                                     0.4%     0.0%     0.0%     0.0%               0x00  18156  18169 0xffffffff815c4f4f       -1      353     8.8%      2 _raw_spin_lock                 [kernel.kallsyms]     
                                     0.0%     0.0%     1.2%     0.0%               0x08  18156  18169 0xffffffff81135245      n/a      n/a      n/a      9 generic_segment_checks         [kernel.kallsyms]     
                                     0.0%     0.0%     1.3%     0.0%               0x08  18156  18169 0xffffffff811b9cb5      n/a      n/a      n/a      8 file_remove_suid               [kernel.kallsyms]     
                                     0.0%     0.0%     0.6%     0.0%               0x08  18156  18169 0xffffffff811ba055      n/a      n/a      n/a      4 file_update_time               [kernel.kallsyms]     
                                     0.0%     0.0%     4.2%    13.2%               0x08  18156  18169 0xffffffff815c2106      n/a      n/a      n/a      9 __mutex_lock_slowpath          [kernel.kallsyms]     
                                     0.0%     0.0%    13.8%     0.0%               0x08  18156  18169 0xffffffff815c2118      n/a      n/a      n/a     12 __mutex_lock_slowpath          [kernel.kallsyms]     
                                     0.0%     0.0%     5.0%    50.0%               0x08  18156  18169 0xffffffff815c212c      n/a      n/a      n/a     11 __mutex_lock_slowpath          [kernel.kallsyms]     
                                     0.0%     0.0%     0.1%     0.0%               0x08  18156  18169 0xffffffff815c21e0      n/a      n/a      n/a      1 __mutex_lock_slowpath          [kernel.kallsyms]     
                                     0.0%     0.0%     0.1%    15.8%               0x08  18156  18169 0xffffffff815c21e8      n/a      n/a      n/a      5 __mutex_lock_slowpath          [kernel.kallsyms]     
                                     1.0%     0.0%     0.0%     0.0%               0x08  18156  18169 0xffffffff815c429a       -1      286     5.8%      3 schedule_preempt_disabled      [kernel.kallsyms]     
                                     0.0%     0.0%     0.1%     0.0%               0x10  18156  18169 0xffffffff81082e80      n/a      n/a      n/a      1 mspin_lock                     [kernel.kallsyms]     
                                     0.0%     0.0%     0.2%     0.0%               0x10  18156  18169 0xffffffff810908a2      n/a      n/a      n/a      2 try_to_wake_up                 [kernel.kallsyms]     
                                     0.0%     0.0%     1.2%     0.0%               0x10  18156  18169 0xffffffff81137c4d      n/a      n/a      n/a      6 __generic_file_aio_write       [kernel.kallsyms]     
                                     0.0%     0.0%     5.2%     0.0%               0x10  18156  18169 0xffffffff81137d7b      n/a      n/a      n/a     12 __generic_file_aio_write       [kernel.kallsyms]     
                                     0.0%     0.0%     1.9%     0.0%               0x10  18156  18169 0xffffffff81137dbc      n/a      n/a      n/a      8 __generic_file_aio_write       [kernel.kallsyms]     
                                     0.2%     0.0%     0.0%     0.0%               0x10  18156  18175 0xffffffff812b7e3b       -1      355     0.0%      1 __list_add                     [kernel.kallsyms]     
                                    90.9%    93.5%     0.0%     0.0%               0x18  18156  18169 0xffffffff81082ea7       -1      613    28.9%     13 mspin_lock                     [kernel.kallsyms]     
                                     0.0%     0.0%     0.0%    21.1%               0x18  18156    *** 0xffffffff81082eda      n/a      n/a      n/a      7 mspin_unlock                   [kernel.kallsyms]     
                                     0.0%     0.0%    26.0%     0.0%               0x18  18156  18169 0xffffffff811360de      n/a      n/a      n/a     12 generic_file_buffered_write    [kernel.kallsyms]     
                                     0.0%     0.0%     9.0%     0.0%               0x18  18156  18169 0xffffffff81137dab      n/a      n/a      n/a     12 __generic_file_aio_write       [kernel.kallsyms]     
                                     6.1%     5.9%     0.0%     0.0%               0x20  18156    *** 0xffffffff815c208e       -1      377    19.3%     25 __mutex_unlock_slowpath        [kernel.kallsyms]     
                                     0.0%     0.0%    18.2%     0.0%               0x28  18156  18169 0xffffffff81137d63      n/a      n/a      n/a     12 __generic_file_aio_write       [kernel.kallsyms]     
                                     0.0%     0.0%     0.2%     0.0%               0x30  18156  18169 0xffffffff81137c49      n/a      n/a      n/a      2 __generic_file_aio_write       [kernel.kallsyms]     
                                     0.0%     0.0%     0.1%     0.0%               0x30  18156  18169 0xffffffff81137d67      n/a      n/a      n/a      1 __generic_file_aio_write       [kernel.kallsyms]     
                                     0.0%     0.0%     3.1%     0.0%               0x30  18156  18169 0xffffffff81137dc1      n/a      n/a      n/a      8 __generic_file_aio_write       [kernel.kallsyms]     
                                     0.0%     0.0%     3.1%     0.0%               0x38  18156  18169 0xffffffff81090b4e      n/a      n/a      n/a     11 wake_up_process                [kernel.kallsyms]     
                                     0.0%     0.0%     1.1%     0.0%               0x38  18156  18169 0xffffffff81137c2c      n/a      n/a      n/a      8 __generic_file_aio_write       [kernel.kallsyms]     


<snip 58 records>
....



=====================================================================================================================================
                                                   Object Name, Path & Reference Counts

Index    Records   Object Name                       Object Path                                                                     
=====================================================================================================================================
    0     931379   [kernel.kallsyms]                 [kernel.kallsyms]                                                               
    1     192258   fio                               /home/joe/old_fio-2.0.15/fio                                                    
    2      80302   [jbd2]                            /lib/modules/3.10.0c2c_all+/kernel/fs/jbd2/jbd2.ko                              
    3      65392   [ext4]                            /lib/modules/3.10.0c2c_all+/kernel/fs/ext4/ext4.ko                              
    4       8236   libpthread-2.17.so                /usr/lib64/libpthread-2.17.so                                                   
    5         19   [ip_tables]                       /lib/modules/3.10.0c2c_all+/kernel/net/ipv4/netfilter/ip_tables.ko              
    6         17   [ixgbe]                           /lib/modules/3.10.0c2c_all+/kernel/drivers/net/ethernet/intel/ixgbe/ixgbe.ko    
    7         17   perf                              /home/root/git/rhel7.don/tools/perf/perf                                        
    8         13   [ipmi_si]                         /lib/modules/3.10.0c2c_all+/kernel/drivers/char/ipmi/ipmi_si.ko                 
    9         11   libc-2.17.so                      /usr/lib64/libc-2.17.so                                                         
   10         10   [megaraid_sas]                    /lib/modules/3.10.0c2c_all+/kernel/drivers/scsi/megaraid/megaraid_sas.ko        
   11          9   [mtip32xx]                        /lib/modules/3.10.0c2c_all+/kernel/drivers/block/mtip32xx/mtip32xx.ko           
   12          6   irqbalance                        /usr/sbin/irqbalance                                                            
   13          6   libpython2.7.so.1.0               /usr/lib64/libpython2.7.so.1.0                                                  
   14          5   [ip6_tables]                      /lib/modules/3.10.0c2c_all+/kernel/net/ipv6/netfilter/ip6_tables.ko             
   15          4   [nf_conntrack]                    /lib/modules/3.10.0c2c_all+/kernel/net/netfilter/nf_conntrack.ko                
   16          2   [dm_mod]                          /lib/modules/3.10.0c2c_all+/kernel/drivers/md/dm-mod.ko                         
   17          2   sshd                              /usr/sbin/sshd                                                                  
   18          1   [iptable_raw]                     /lib/modules/3.10.0c2c_all+/kernel/net/ipv4/netfilter/iptable_raw.ko            
   19          1   [sb_edac]                         /lib/modules/3.10.0c2c_all+/kernel/drivers/edac/sb_edac.ko                      
   20          1   [edac_core]                       /lib/modules/3.10.0c2c_all+/kernel/drivers/edac/edac_core.ko                    
   21          1   libdbus-1.so.3.7.4                /usr/lib64/libdbus-1.so.3.7.4                                                   


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems
  2014-02-10 21:29 ` Peter Zijlstra
  2014-02-10 22:20   ` Don Zickus
@ 2014-02-10 22:21   ` Stephane Eranian
  2014-02-11  7:14     ` Peter Zijlstra
  1 sibling, 1 reply; 72+ messages in thread
From: Stephane Eranian @ 2014-02-10 22:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Don Zickus, Arnaldo Carvalho de Melo, LKML, Jiri Olsa, Joe Mario,
	Richard Fowles

On Mon, Feb 10, 2014 at 10:29 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Feb 10, 2014 at 12:28:55PM -0500, Don Zickus wrote:
>> The data output is verbose and there are lots of data tables that interprit the latencies
>> and data addresses in different ways to help see where bottlenecks might be lying.
>
> Would be good to see what the output looks like.
>
> What I haven't seen; and what I would find most useful; is using the IP
> + dwarf info to map it back to a data structure member.
>
> Since you're already using the PEBS data-source fields, you can also
> have a precise IP. For many cases its possible to reconstruct the exact
> data member the instruction is modifying.
>
The tool already uses precise=2 to get the precise IP.

To get from IP to data member, you'd need some debug info which is not
yet emitted
by the compiler.

> At that point you can do pahole like output of data structures, showing
> which members are 'hot' on misses etc.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems
  2014-02-10 22:21   ` Stephane Eranian
@ 2014-02-11  7:14     ` Peter Zijlstra
  2014-02-11 10:35       ` Stephane Eranian
  0 siblings, 1 reply; 72+ messages in thread
From: Peter Zijlstra @ 2014-02-11  7:14 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Don Zickus, Arnaldo Carvalho de Melo, LKML, Jiri Olsa, Joe Mario,
	Richard Fowles

On Mon, Feb 10, 2014 at 11:21:53PM +0100, Stephane Eranian wrote:
> On Mon, Feb 10, 2014 at 10:29 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Mon, Feb 10, 2014 at 12:28:55PM -0500, Don Zickus wrote:
> >> The data output is verbose and there are lots of data tables that interprit the latencies
> >> and data addresses in different ways to help see where bottlenecks might be lying.
> >
> > Would be good to see what the output looks like.
> >
> > What I haven't seen; and what I would find most useful; is using the IP
> > + dwarf info to map it back to a data structure member.
> >
> > Since you're already using the PEBS data-source fields, you can also
> > have a precise IP. For many cases its possible to reconstruct the exact
> > data member the instruction is modifying.
> >
> The tool already uses precise=2 to get the precise IP.
> 
> To get from IP to data member, you'd need some debug info which is not
> yet emitted
> by the compiler.

That blows; how much is missing?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems
  2014-02-11  7:14     ` Peter Zijlstra
@ 2014-02-11 10:35       ` Stephane Eranian
  2014-02-11 10:52         ` Peter Zijlstra
  0 siblings, 1 reply; 72+ messages in thread
From: Stephane Eranian @ 2014-02-11 10:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Don Zickus, Arnaldo Carvalho de Melo, LKML, Jiri Olsa, Joe Mario,
	Richard Fowles

On Tue, Feb 11, 2014 at 8:14 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Feb 10, 2014 at 11:21:53PM +0100, Stephane Eranian wrote:
>> On Mon, Feb 10, 2014 at 10:29 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Mon, Feb 10, 2014 at 12:28:55PM -0500, Don Zickus wrote:
>> >> The data output is verbose and there are lots of data tables that interprit the latencies
>> >> and data addresses in different ways to help see where bottlenecks might be lying.
>> >
>> > Would be good to see what the output looks like.
>> >
>> > What I haven't seen; and what I would find most useful; is using the IP
>> > + dwarf info to map it back to a data structure member.
>> >
>> > Since you're already using the PEBS data-source fields, you can also
>> > have a precise IP. For many cases its possible to reconstruct the exact
>> > data member the instruction is modifying.
>> >
>> The tool already uses precise=2 to get the precise IP.
>>
>> To get from IP to data member, you'd need some debug info which is not
>> yet emitted
>> by the compiler.
>
> That blows; how much is missing?

They need to annotate load and stores. I asked for that feature a while ago.
It will come.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems
  2014-02-11 10:35       ` Stephane Eranian
@ 2014-02-11 10:52         ` Peter Zijlstra
  2014-02-11 10:58           ` Stephane Eranian
  0 siblings, 1 reply; 72+ messages in thread
From: Peter Zijlstra @ 2014-02-11 10:52 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Don Zickus, Arnaldo Carvalho de Melo, LKML, Jiri Olsa, Joe Mario,
	Richard Fowles

On Tue, Feb 11, 2014 at 11:35:45AM +0100, Stephane Eranian wrote:
> On Tue, Feb 11, 2014 at 8:14 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > That blows; how much is missing?
> 
> They need to annotate load and stores. I asked for that feature a while ago.
> It will come.

And there is no way to deduce the information? We have type information
for all arguments and local variables, right? So we can follow that.

struct foo {
	int ponies;
	int moar_ponies;
};

struct bar {
	int my_ponies;
	struct foo *foo;
};

int moo(struct bar *bar)
{
	return bar->foo->moar_ponies;
}

Since we have the argument type, we can find the type for both loads,
the first load:

  *bar+8, we know is: struct foo * bar::foo
  *foo+4, we know is: int foo::moar_ponies

Or am I missing something?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems
  2014-02-11 10:52         ` Peter Zijlstra
@ 2014-02-11 10:58           ` Stephane Eranian
  2014-02-11 11:02             ` Peter Zijlstra
  0 siblings, 1 reply; 72+ messages in thread
From: Stephane Eranian @ 2014-02-11 10:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Don Zickus, Arnaldo Carvalho de Melo, LKML, Jiri Olsa, Joe Mario,
	Richard Fowles

On Tue, Feb 11, 2014 at 11:52 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Feb 11, 2014 at 11:35:45AM +0100, Stephane Eranian wrote:
>> On Tue, Feb 11, 2014 at 8:14 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> >
>> > That blows; how much is missing?
>>
>> They need to annotate load and stores. I asked for that feature a while ago.
>> It will come.
>
> And there is no way to deduce the information? We have type information
> for all arguments and local variables, right? So we can follow that.
>
> struct foo {
>         int ponies;
>         int moar_ponies;
> };
>
> struct bar {
>         int my_ponies;
>         struct foo *foo;
> };
>
> int moo(struct bar *bar)
> {
>         return bar->foo->moar_ponies;
> }
>
> Since we have the argument type, we can find the type for both loads,
> the first load:
>
>   *bar+8, we know is: struct foo * bar::foo
>   *foo+4, we know is: int foo::moar_ponies
>
> Or am I missing something?

How do you know that load at addr 0x1000 is accessing variable bar?
The IP gives you line number, and then what?
I think dwarf has the mapping regs -> variable and yes, the type info.
But I am not sure that's enough.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems
  2014-02-11 10:58           ` Stephane Eranian
@ 2014-02-11 11:02             ` Peter Zijlstra
  2014-02-11 11:04               ` Stephane Eranian
  0 siblings, 1 reply; 72+ messages in thread
From: Peter Zijlstra @ 2014-02-11 11:02 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Don Zickus, Arnaldo Carvalho de Melo, LKML, Jiri Olsa, Joe Mario,
	Richard Fowles

On Tue, Feb 11, 2014 at 11:58:45AM +0100, Stephane Eranian wrote:
> On Tue, Feb 11, 2014 at 11:52 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, Feb 11, 2014 at 11:35:45AM +0100, Stephane Eranian wrote:
> >> On Tue, Feb 11, 2014 at 8:14 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >> >
> >> > That blows; how much is missing?
> >>
> >> They need to annotate load and stores. I asked for that feature a while ago.
> >> It will come.
> >
> > And there is no way to deduce the information? We have type information
> > for all arguments and local variables, right? So we can follow that.
> >
> > struct foo {
> >         int ponies;
> >         int moar_ponies;
> > };
> >
> > struct bar {
> >         int my_ponies;
> >         struct foo *foo;
> > };
> >
> > int moo(struct bar *bar)
> > {
> >         return bar->foo->moar_ponies;
> > }
> >
> > Since we have the argument type, we can find the type for both loads,
> > the first load:
> >
> >   *bar+8, we know is: struct foo * bar::foo
> >   *foo+4, we know is: int foo::moar_ponies
> >
> > Or am I missing something?
> 
> How do you know that load at addr 0x1000 is accessing variable bar?
> The IP gives you line number, and then what?
> I think dwarf has the mapping regs -> variable and yes, the type info.
> But I am not sure that's enough.

Ah, but if you have the instruction, you can decode it and obtain the
reg and thus type-info, no?


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems
  2014-02-11 11:02             ` Peter Zijlstra
@ 2014-02-11 11:04               ` Stephane Eranian
  2014-02-11 11:08                 ` Peter Zijlstra
  2014-02-11 11:08                 ` Stephane Eranian
  0 siblings, 2 replies; 72+ messages in thread
From: Stephane Eranian @ 2014-02-11 11:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Don Zickus, Arnaldo Carvalho de Melo, LKML, Jiri Olsa, Joe Mario,
	Richard Fowles

On Tue, Feb 11, 2014 at 12:02 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Feb 11, 2014 at 11:58:45AM +0100, Stephane Eranian wrote:
>> On Tue, Feb 11, 2014 at 11:52 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Tue, Feb 11, 2014 at 11:35:45AM +0100, Stephane Eranian wrote:
>> >> On Tue, Feb 11, 2014 at 8:14 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> >> >
>> >> > That blows; how much is missing?
>> >>
>> >> They need to annotate load and stores. I asked for that feature a while ago.
>> >> It will come.
>> >
>> > And there is no way to deduce the information? We have type information
>> > for all arguments and local variables, right? So we can follow that.
>> >
>> > struct foo {
>> >         int ponies;
>> >         int moar_ponies;
>> > };
>> >
>> > struct bar {
>> >         int my_ponies;
>> >         struct foo *foo;
>> > };
>> >
>> > int moo(struct bar *bar)
>> > {
>> >         return bar->foo->moar_ponies;
>> > }
>> >
>> > Since we have the argument type, we can find the type for both loads,
>> > the first load:
>> >
>> >   *bar+8, we know is: struct foo * bar::foo
>> >   *foo+4, we know is: int foo::moar_ponies
>> >
>> > Or am I missing something?
>>
>> How do you know that load at addr 0x1000 is accessing variable bar?
>> The IP gives you line number, and then what?
>> I think dwarf has the mapping regs -> variable and yes, the type info.
>> But I am not sure that's enough.
>
> Ah, but if you have the instruction, you can decode it and obtain the
> reg and thus type-info, no?
>
But on x86, you can load directly from memory, you'd only have the
target reg for the load. Not enough.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems
  2014-02-11 11:04               ` Stephane Eranian
@ 2014-02-11 11:08                 ` Peter Zijlstra
  2014-02-11 11:08                 ` Stephane Eranian
  1 sibling, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2014-02-11 11:08 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Don Zickus, Arnaldo Carvalho de Melo, LKML, Jiri Olsa, Joe Mario,
	Richard Fowles

On Tue, Feb 11, 2014 at 12:04:23PM +0100, Stephane Eranian wrote:
> >> How do you know that load at addr 0x1000 is accessing variable bar?
> >> The IP gives you line number, and then what?
> >> I think dwarf has the mapping regs -> variable and yes, the type info.
> >> But I am not sure that's enough.
> >
> > Ah, but if you have the instruction, you can decode it and obtain the
> > reg and thus type-info, no?
> >
> But on x86, you can load directly from memory, you'd only have the
> target reg for the load. Not enough.

But if you load an immediate, you should be able to find it in the
symbol table.

Any other load will have a register base and will thus have type-info
therefrom.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems
  2014-02-11 11:04               ` Stephane Eranian
  2014-02-11 11:08                 ` Peter Zijlstra
@ 2014-02-11 11:08                 ` Stephane Eranian
  2014-02-11 11:14                   ` Peter Zijlstra
  1 sibling, 1 reply; 72+ messages in thread
From: Stephane Eranian @ 2014-02-11 11:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Don Zickus, Arnaldo Carvalho de Melo, LKML, Jiri Olsa, Joe Mario,
	Richard Fowles

On Tue, Feb 11, 2014 at 12:04 PM, Stephane Eranian <eranian@google.com> wrote:
> On Tue, Feb 11, 2014 at 12:02 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Tue, Feb 11, 2014 at 11:58:45AM +0100, Stephane Eranian wrote:
>>> On Tue, Feb 11, 2014 at 11:52 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> > On Tue, Feb 11, 2014 at 11:35:45AM +0100, Stephane Eranian wrote:
>>> >> On Tue, Feb 11, 2014 at 8:14 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> >> >
>>> >> > That blows; how much is missing?
>>> >>
>>> >> They need to annotate load and stores. I asked for that feature a while ago.
>>> >> It will come.
>>> >
>>> > And there is no way to deduce the information? We have type information
>>> > for all arguments and local variables, right? So we can follow that.
>>> >
>>> > struct foo {
>>> >         int ponies;
>>> >         int moar_ponies;
>>> > };
>>> >
>>> > struct bar {
>>> >         int my_ponies;
>>> >         struct foo *foo;
>>> > };
>>> >
>>> > int moo(struct bar *bar)
>>> > {
>>> >         return bar->foo->moar_ponies;
>>> > }
>>> >
>>> > Since we have the argument type, we can find the type for both loads,
>>> > the first load:
>>> >
>>> >   *bar+8, we know is: struct foo * bar::foo
>>> >   *foo+4, we know is: int foo::moar_ponies
>>> >
>>> > Or am I missing something?
>>>
>>> How do you know that load at addr 0x1000 is accessing variable bar?
>>> The IP gives you line number, and then what?
>>> I think dwarf has the mapping regs -> variable and yes, the type info.
>>> But I am not sure that's enough.
>>
>> Ah, but if you have the instruction, you can decode it and obtain the
>> reg and thus type-info, no?
>>
> But on x86, you can load directly from memory, you'd only have the
> target reg for the load. Not enough.

Assuming you can decode and get the info about the base registers used,
you'd have to do this for each arch with load/store sampling capabilities.
this is painful compared to getting the portable info from dwarf directly.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems
  2014-02-11 11:08                 ` Stephane Eranian
@ 2014-02-11 11:14                   ` Peter Zijlstra
  2014-02-11 11:28                     ` Stephane Eranian
  2014-02-11 11:50                     ` Arnaldo Carvalho de Melo
  0 siblings, 2 replies; 72+ messages in thread
From: Peter Zijlstra @ 2014-02-11 11:14 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Don Zickus, Arnaldo Carvalho de Melo, LKML, Jiri Olsa, Joe Mario,
	Richard Fowles

On Tue, Feb 11, 2014 at 12:08:56PM +0100, Stephane Eranian wrote:
> Assuming you can decode and get the info about the base registers used,
> you'd have to do this for each arch with load/store sampling capabilities.
> this is painful compared to getting the portable info from dwarf directly.

But its useful now, as compared to whenever GCC gets around to
implementing more dwarves and that GCC getting used widely enough to
actually rely on it.

All you need for the decode is a disassembler, and every arch should
already have multiple of those. Should be easy to reuse one, right?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 01/21] perf c2c: Shared data analyser
  2014-02-10 22:10   ` Davidlohr Bueso
@ 2014-02-11 11:24     ` Jiri Olsa
  2014-02-11 11:31     ` Arnaldo Carvalho de Melo
  1 sibling, 0 replies; 72+ messages in thread
From: Jiri Olsa @ 2014-02-11 11:24 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Don Zickus, acme, LKML, jmario, fowles, eranian,
	Arnaldo Carvalho de Melo, David Ahern, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Peter Zijlstra, Richard Fowles

On Mon, Feb 10, 2014 at 02:10:04PM -0800, Davidlohr Bueso wrote:
> On Mon, 2014-02-10 at 14:18 -0500, Don Zickus wrote:
> > From: Arnaldo Carvalho de Melo <acme@redhat.com>
> > 
> > This is the start of a new perf tool that will collect information about
> > memory accesses and analyse it to find things like hot cachelines, etc.
> > 
> > This is basically trying to get a prototype written by Richard Fowles
> > written using the tools/perf coding style and libraries.
> > 
> > Start it from 'perf sched', this patch starts the process by adding the
> > 'record' subcommand to collect the needed mem loads and stores samples.
> > 
> > It also have the basic 'report' skeleton, resolving the sample address
> > and hooking the events found in a perf.data file with methods to handle
> > them, right now just printing the resolved perf_sample data structure
> > after each event name.
> 
> What tree/branch is this developed against? I'm getting the following
> with Linus' latest and tip tree:
> 
> builtin-c2c.c: In function ‘perf_c2c__process_sample’:
> builtin-c2c.c:68:20: error: request for member ‘func’ in something not a structure or union
> builtin-c2c.c:69:36: error: request for member ‘func’ in something not a structure or union
> builtin-c2c.c: In function ‘perf_c2c__read_events’:
> builtin-c2c.c:81:2: error: passing argument 1 of ‘perf_session__new’ from incompatible pointer type [-Werror]
> In file included from builtin-c2c.c:6:0:
> util/session.h:52:22: note: expected ‘struct perf_data_file *’ but argument is of type ‘const char *’
> builtin-c2c.c:81:2: error: too many arguments to function ‘perf_session__new’
> In file included from builtin-c2c.c:6:0:
> util/session.h:52:22: note: declared here

got the same one.. Don, what did you based on?

please, rebase to latest acme's perf/core or send your HEAD ;-)

thanks,
jirka

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems
  2014-02-11 11:14                   ` Peter Zijlstra
@ 2014-02-11 11:28                     ` Stephane Eranian
  2014-02-11 11:31                       ` Peter Zijlstra
  2014-02-11 11:50                     ` Arnaldo Carvalho de Melo
  1 sibling, 1 reply; 72+ messages in thread
From: Stephane Eranian @ 2014-02-11 11:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Don Zickus, Arnaldo Carvalho de Melo, LKML, Jiri Olsa, Joe Mario,
	Richard Fowles

On Tue, Feb 11, 2014 at 12:14 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Feb 11, 2014 at 12:08:56PM +0100, Stephane Eranian wrote:
>> Assuming you can decode and get the info about the base registers used,
>> you'd have to do this for each arch with load/store sampling capabilities.
>> this is painful compared to getting the portable info from dwarf directly.
>
> But its useful now, as compared to whenever GCC gets around to
> implementing more dwarves and that GCC getting used widely enough to
> actually rely on it.
>
> All you need for the decode is a disassembler, and every arch should
> already have multiple of those. Should be easy to reuse one, right?

I know, and you want to pull this into the perf tool?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 01/21] perf c2c: Shared data analyser
  2014-02-10 22:10   ` Davidlohr Bueso
  2014-02-11 11:24     ` Jiri Olsa
@ 2014-02-11 11:31     ` Arnaldo Carvalho de Melo
  2014-02-11 13:54       ` Don Zickus
  2014-02-11 14:36       ` Don Zickus
  1 sibling, 2 replies; 72+ messages in thread
From: Arnaldo Carvalho de Melo @ 2014-02-11 11:31 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Don Zickus, LKML, jolsa, jmario, fowles, eranian, David Ahern,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Peter Zijlstra, Richard Fowles

Em Mon, Feb 10, 2014 at 02:10:04PM -0800, Davidlohr Bueso escreveu:
> On Mon, 2014-02-10 at 14:18 -0500, Don Zickus wrote:
> > From: Arnaldo Carvalho de Melo <acme@redhat.com>
> > 
> > This is the start of a new perf tool that will collect information about
> > memory accesses and analyse it to find things like hot cachelines, etc.
> > 
> > This is basically trying to get a prototype written by Richard Fowles
> > written using the tools/perf coding style and libraries.
> > 
> > Start it from 'perf sched', this patch starts the process by adding the
> > 'record' subcommand to collect the needed mem loads and stores samples.
> > 
> > It also have the basic 'report' skeleton, resolving the sample address
> > and hooking the events found in a perf.data file with methods to handle
> > them, right now just printing the resolved perf_sample data structure
> > after each event name.
> 
> What tree/branch is this developed against? I'm getting the following
> with Linus' latest and tip tree:

I'll try refreshing it on top of my perf/core branch today
 
> builtin-c2c.c: In function ‘perf_c2c__process_sample’:
> builtin-c2c.c:68:20: error: request for member ‘func’ in something not a structure or union
> builtin-c2c.c:69:36: error: request for member ‘func’ in something not a structure or union
> builtin-c2c.c: In function ‘perf_c2c__read_events’:
> builtin-c2c.c:81:2: error: passing argument 1 of ‘perf_session__new’ from incompatible pointer type [-Werror]
> In file included from builtin-c2c.c:6:0:
> util/session.h:52:22: note: expected ‘struct perf_data_file *’ but argument is of type ‘const char *’
> builtin-c2c.c:81:2: error: too many arguments to function ‘perf_session__new’
> In file included from builtin-c2c.c:6:0:
> util/session.h:52:22: note: declared here
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems
  2014-02-11 11:28                     ` Stephane Eranian
@ 2014-02-11 11:31                       ` Peter Zijlstra
  2014-02-11 11:51                         ` Peter Zijlstra
  0 siblings, 1 reply; 72+ messages in thread
From: Peter Zijlstra @ 2014-02-11 11:31 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Don Zickus, Arnaldo Carvalho de Melo, LKML, Jiri Olsa, Joe Mario,
	Richard Fowles

On Tue, Feb 11, 2014 at 12:28:47PM +0100, Stephane Eranian wrote:
> On Tue, Feb 11, 2014 at 12:14 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, Feb 11, 2014 at 12:08:56PM +0100, Stephane Eranian wrote:
> >> Assuming you can decode and get the info about the base registers used,
> >> you'd have to do this for each arch with load/store sampling capabilities.
> >> this is painful compared to getting the portable info from dwarf directly.
> >
> > But its useful now, as compared to whenever GCC gets around to
> > implementing more dwarves and that GCC getting used widely enough to
> > actually rely on it.
> >
> > All you need for the decode is a disassembler, and every arch should
> > already have multiple of those. Should be easy to reuse one, right?
> 
> I know, and you want to pull this into the perf tool?

Sure why not, its already got the world and then some :/

It would be just another dynamic lib.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems
  2014-02-11 11:14                   ` Peter Zijlstra
  2014-02-11 11:28                     ` Stephane Eranian
@ 2014-02-11 11:50                     ` Arnaldo Carvalho de Melo
  2014-02-11 12:09                       ` Peter Zijlstra
  2014-02-13 13:02                       ` Jiri Olsa
  1 sibling, 2 replies; 72+ messages in thread
From: Arnaldo Carvalho de Melo @ 2014-02-11 11:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Stephane Eranian, Don Zickus, LKML, Jiri Olsa, Joe Mario, Richard Fowles

Em Tue, Feb 11, 2014 at 12:14:21PM +0100, Peter Zijlstra escreveu:
> On Tue, Feb 11, 2014 at 12:08:56PM +0100, Stephane Eranian wrote:
> > Assuming you can decode and get the info about the base registers used,
> > you'd have to do this for each arch with load/store sampling capabilities.
> > this is painful compared to getting the portable info from dwarf directly.
 
> But its useful now, as compared to whenever GCC gets around to
> implementing more dwarves and that GCC getting used widely enough to
> actually rely on it.
 
> All you need for the decode is a disassembler, and every arch should
> already have multiple of those. Should be easy to reuse one, right?

Yeah, I never got around to actually try to implement this, but my
feeling was that all the bits and pieces were there already:

1) the precise IP for the instruction, that disassembled would tell
which registers were being operated on, or memory that we would "reverse
map" to a register

2) DWARF expression locations that allows us to go from registers to a
variable/parameter and thus to a type

3) PERF_SAMPLE_REGS_USER (from a quick look, why do we have "USER" in
it? Jiri?)

4) libunwind have register maps for various arches, so probably
something there could be reused here as well (Jiri?)

Get that and generate a series of (type,offset) tuples for the samples
and get pahole to highlight the members with different colours, just
like 'annotate' does with source code/asm.

That way we would reuse 'pahole' in much the same way as we reuse
'objdump'. Give some more time to revisit the libdwarves APIs and then
we could use it directly on perf or perhaps extract just what is needed
and merge into the kernel sources.

- Arnaldo

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems
  2014-02-11 11:31                       ` Peter Zijlstra
@ 2014-02-11 11:51                         ` Peter Zijlstra
  0 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2014-02-11 11:51 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Don Zickus, Arnaldo Carvalho de Melo, LKML, Jiri Olsa, Joe Mario,
	Richard Fowles

On Tue, Feb 11, 2014 at 12:31:49PM +0100, Peter Zijlstra wrote:
> On Tue, Feb 11, 2014 at 12:28:47PM +0100, Stephane Eranian wrote:
> > On Tue, Feb 11, 2014 at 12:14 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Tue, Feb 11, 2014 at 12:08:56PM +0100, Stephane Eranian wrote:
> > >> Assuming you can decode and get the info about the base registers used,
> > >> you'd have to do this for each arch with load/store sampling capabilities.
> > >> this is painful compared to getting the portable info from dwarf directly.
> > >
> > > But its useful now, as compared to whenever GCC gets around to
> > > implementing more dwarves and that GCC getting used widely enough to
> > > actually rely on it.
> > >
> > > All you need for the decode is a disassembler, and every arch should
> > > already have multiple of those. Should be easy to reuse one, right?
> > 
> > I know, and you want to pull this into the perf tool?
> 
> Sure why not, its already got the world and then some :/
> 
> It would be just another dynamic lib.

The added benefit is that we could get rid of the objdump usage for
annotate if we find a usable disasm lib. At which point we can start
improving the annotations with these variable/type information as well.



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems
  2014-02-11 11:50                     ` Arnaldo Carvalho de Melo
@ 2014-02-11 12:09                       ` Peter Zijlstra
  2014-02-13 13:02                       ` Jiri Olsa
  1 sibling, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2014-02-11 12:09 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Stephane Eranian, Don Zickus, LKML, Jiri Olsa, Joe Mario, Richard Fowles

On Tue, Feb 11, 2014 at 08:50:13AM -0300, Arnaldo Carvalho de Melo wrote:
> 3) PERF_SAMPLE_REGS_USER (from a quick look, why do we have "USER" in
> it? Jiri?)

Note that the regs are in the POST instruction state, so any op that
does something like:

  MOV %edx, $(eax+edx*8)

Will have lost the original index value.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 01/21] perf c2c: Shared data analyser
  2014-02-11 11:31     ` Arnaldo Carvalho de Melo
@ 2014-02-11 13:54       ` Don Zickus
  2014-02-11 14:36       ` Don Zickus
  1 sibling, 0 replies; 72+ messages in thread
From: Don Zickus @ 2014-02-11 13:54 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Davidlohr Bueso, LKML, jolsa, jmario, fowles, eranian,
	David Ahern, Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Peter Zijlstra, Richard Fowles

On Tue, Feb 11, 2014 at 08:31:27AM -0300, Arnaldo Carvalho de Melo wrote:
> Em Mon, Feb 10, 2014 at 02:10:04PM -0800, Davidlohr Bueso escreveu:
> > On Mon, 2014-02-10 at 14:18 -0500, Don Zickus wrote:
> > > From: Arnaldo Carvalho de Melo <acme@redhat.com>
> > > 
> > > This is the start of a new perf tool that will collect information about
> > > memory accesses and analyse it to find things like hot cachelines, etc.
> > > 
> > > This is basically trying to get a prototype written by Richard Fowles
> > > written using the tools/perf coding style and libraries.
> > > 
> > > Start it from 'perf sched', this patch starts the process by adding the
> > > 'record' subcommand to collect the needed mem loads and stores samples.
> > > 
> > > It also have the basic 'report' skeleton, resolving the sample address
> > > and hooking the events found in a perf.data file with methods to handle
> > > them, right now just printing the resolved perf_sample data structure
> > > after each event name.
> > 
> > What tree/branch is this developed against? I'm getting the following
> > with Linus' latest and tip tree:
> 
> I'll try refreshing it on top of my perf/core branch today

Sorry everyone,  I managed to rebase it on top of Ingo's master branch
f58a0b1790e3973b23548e297db60c18b29b0818

Let me find your perf/core branch.

Cheers,
Don

>  
> > builtin-c2c.c: In function ‘perf_c2c__process_sample’:
> > builtin-c2c.c:68:20: error: request for member ‘func’ in something not a structure or union
> > builtin-c2c.c:69:36: error: request for member ‘func’ in something not a structure or union
> > builtin-c2c.c: In function ‘perf_c2c__read_events’:
> > builtin-c2c.c:81:2: error: passing argument 1 of ‘perf_session__new’ from incompatible pointer type [-Werror]
> > In file included from builtin-c2c.c:6:0:
> > util/session.h:52:22: note: expected ‘struct perf_data_file *’ but argument is of type ‘const char *’
> > builtin-c2c.c:81:2: error: too many arguments to function ‘perf_session__new’
> > In file included from builtin-c2c.c:6:0:
> > util/session.h:52:22: note: declared here
> > 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 01/21] perf c2c: Shared data analyser
  2014-02-11 11:31     ` Arnaldo Carvalho de Melo
  2014-02-11 13:54       ` Don Zickus
@ 2014-02-11 14:36       ` Don Zickus
  2014-02-11 15:41         ` Arnaldo Carvalho de Melo
  1 sibling, 1 reply; 72+ messages in thread
From: Don Zickus @ 2014-02-11 14:36 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Davidlohr Bueso, LKML, jolsa, jmario, fowles, eranian,
	David Ahern, Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Peter Zijlstra, Richard Fowles

On Tue, Feb 11, 2014 at 08:31:27AM -0300, Arnaldo Carvalho de Melo wrote:
> Em Mon, Feb 10, 2014 at 02:10:04PM -0800, Davidlohr Bueso escreveu:
> > On Mon, 2014-02-10 at 14:18 -0500, Don Zickus wrote:
> > > From: Arnaldo Carvalho de Melo <acme@redhat.com>
> > > 
> > > This is the start of a new perf tool that will collect information about
> > > memory accesses and analyse it to find things like hot cachelines, etc.
> > > 
> > > This is basically trying to get a prototype written by Richard Fowles
> > > written using the tools/perf coding style and libraries.
> > > 
> > > Start it from 'perf sched', this patch starts the process by adding the
> > > 'record' subcommand to collect the needed mem loads and stores samples.
> > > 
> > > It also have the basic 'report' skeleton, resolving the sample address
> > > and hooking the events found in a perf.data file with methods to handle
> > > them, right now just printing the resolved perf_sample data structure
> > > after each event name.
> > 
> > What tree/branch is this developed against? I'm getting the following
> > with Linus' latest and tip tree:
> 
> I'll try refreshing it on top of my perf/core branch today

Sorry for the trouble.  I guess I missed all the function cleanups from
last month.  Attached is a patch that gets things to compile.

I'll split this patch up to the right spots on my next refresh.

Cheers,
Don

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index a73535a..b55f281 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -1009,15 +1009,20 @@ static int perf_c2c__process_sample(struct perf_tool *tool,
 	struct c2c_entry *entry;
 	sample_handler f;
 	int err = -1;
+	struct addr_location al = {
+			.machine = machine,
+			.cpumode = cpumode,
+	};
 
-	if (evsel->handler.func == NULL)
+	if (evsel->handler == NULL)
 		return 0;
 
 	thread = machine__find_thread(machine, sample->pid);
 	if (thread == NULL)
 		goto err;
 
-	mi = machine__resolve_mem(machine, thread, sample, cpumode);
+	al.thread = thread;
+	mi = sample__resolve_mem(sample, &al);
 	if (mi == NULL)
 		goto err;
 
@@ -1031,7 +1036,7 @@ static int perf_c2c__process_sample(struct perf_tool *tool,
 	if (entry == NULL)
 		goto err_mem;
 
-	f = evsel->handler.func;
+	f = evsel->handler;
 	err = f(c2c, sample, entry);
 	if (err)
 		goto err_entry;
@@ -1040,8 +1045,8 @@ static int perf_c2c__process_sample(struct perf_tool *tool,
 	if (symbol_conf.use_callchain && sample->callchain) {
 		callchain_init(entry->callchain);
 
-		err = machine__resolve_callchain(machine, evsel, thread,
-						 sample, &parent, NULL);
+		err = sample__resolve_callchain(sample, &parent, evsel, &al,
+						PERF_MAX_STACK_DEPTH);
 		if (!err)
 			err = callchain_append(entry->callchain,
 					       &callchain_cursor,
@@ -1198,7 +1203,7 @@ struct refs {
 	struct list_head	list;
 	int			nr;
 	const char		*name;
-	char			*long_name;
+	const char		*long_name;
 };
 
 static int update_ref_tree(struct c2c_entry *entry)
@@ -2732,8 +2737,12 @@ static int perf_c2c__read_events(struct perf_c2c *c2c)
 {
 	int err = -1;
 	struct perf_session *session;
+	struct perf_data_file file = {
+			.path = input_name,
+			.mode = PERF_DATA_MODE_READ,
+	};
 
-	session = perf_session__new(input_name, O_RDONLY, 0, false, &c2c->tool);
+	session = perf_session__new(&file, 0, &c2c->tool);
 	if (session == NULL) {
 		pr_debug("No memory for session\n");
 		goto out;
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index faf29b0..49e0328 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -1270,10 +1270,10 @@ int __perf_evlist__set_handlers(struct perf_evlist *evlist,
 		if (evsel == NULL)
 			continue;
 
-		if (evsel->handler.func != NULL)
+		if (evsel->handler != NULL)
 			goto out;
 
-		evsel->handler.func = assocs[i].handler;
+		evsel->handler = assocs[i].handler;
 	}
 
 	err = 0;

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH 01/21] perf c2c: Shared data analyser
  2014-02-11 14:36       ` Don Zickus
@ 2014-02-11 15:41         ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 72+ messages in thread
From: Arnaldo Carvalho de Melo @ 2014-02-11 15:41 UTC (permalink / raw)
  To: Don Zickus
  Cc: Davidlohr Bueso, LKML, jolsa, jmario, fowles, eranian,
	David Ahern, Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Peter Zijlstra, Richard Fowles

Em Tue, Feb 11, 2014 at 09:36:43AM -0500, Don Zickus escreveu:
> On Tue, Feb 11, 2014 at 08:31:27AM -0300, Arnaldo Carvalho de Melo wrote:
> > Em Mon, Feb 10, 2014 at 02:10:04PM -0800, Davidlohr Bueso escreveu:
> > > On Mon, 2014-02-10 at 14:18 -0500, Don Zickus wrote:
> > > > From: Arnaldo Carvalho de Melo <acme@redhat.com>
> > > > 
> > > > This is the start of a new perf tool that will collect information about
> > > > memory accesses and analyse it to find things like hot cachelines, etc.
> > > > 
> > > > This is basically trying to get a prototype written by Richard Fowles
> > > > written using the tools/perf coding style and libraries.
> > > > 
> > > > Start it from 'perf sched', this patch starts the process by adding the
> > > > 'record' subcommand to collect the needed mem loads and stores samples.
> > > > 
> > > > It also have the basic 'report' skeleton, resolving the sample address
> > > > and hooking the events found in a perf.data file with methods to handle
> > > > them, right now just printing the resolved perf_sample data structure
> > > > after each event name.
> > > 
> > > What tree/branch is this developed against? I'm getting the following
> > > with Linus' latest and tip tree:
> > 
> > I'll try refreshing it on top of my perf/core branch today
> 
> Sorry for the trouble.  I guess I missed all the function cleanups from
> last month.  Attached is a patch that gets things to compile.
> 
> I'll split this patch up to the right spots on my next refresh.

I go cleaning up the libraries trying to simplify its use as I go seeing
how new tools use the core, hopefully making it easier/less
boilerplate'ish.

Will look at how you used it to see what can be folded into the
libraries, thanks!

- Arnaldo
 
> Cheers,
> Don
> 
> diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
> index a73535a..b55f281 100644
> --- a/tools/perf/builtin-c2c.c
> +++ b/tools/perf/builtin-c2c.c
> @@ -1009,15 +1009,20 @@ static int perf_c2c__process_sample(struct perf_tool *tool,
>  	struct c2c_entry *entry;
>  	sample_handler f;
>  	int err = -1;
> +	struct addr_location al = {
> +			.machine = machine,
> +			.cpumode = cpumode,
> +	};
>  
> -	if (evsel->handler.func == NULL)
> +	if (evsel->handler == NULL)
>  		return 0;
>  
>  	thread = machine__find_thread(machine, sample->pid);
>  	if (thread == NULL)
>  		goto err;
>  
> -	mi = machine__resolve_mem(machine, thread, sample, cpumode);
> +	al.thread = thread;
> +	mi = sample__resolve_mem(sample, &al);
>  	if (mi == NULL)
>  		goto err;
>  
> @@ -1031,7 +1036,7 @@ static int perf_c2c__process_sample(struct perf_tool *tool,
>  	if (entry == NULL)
>  		goto err_mem;
>  
> -	f = evsel->handler.func;
> +	f = evsel->handler;
>  	err = f(c2c, sample, entry);
>  	if (err)
>  		goto err_entry;
> @@ -1040,8 +1045,8 @@ static int perf_c2c__process_sample(struct perf_tool *tool,
>  	if (symbol_conf.use_callchain && sample->callchain) {
>  		callchain_init(entry->callchain);
>  
> -		err = machine__resolve_callchain(machine, evsel, thread,
> -						 sample, &parent, NULL);
> +		err = sample__resolve_callchain(sample, &parent, evsel, &al,
> +						PERF_MAX_STACK_DEPTH);
>  		if (!err)
>  			err = callchain_append(entry->callchain,
>  					       &callchain_cursor,
> @@ -1198,7 +1203,7 @@ struct refs {
>  	struct list_head	list;
>  	int			nr;
>  	const char		*name;
> -	char			*long_name;
> +	const char		*long_name;
>  };
>  
>  static int update_ref_tree(struct c2c_entry *entry)
> @@ -2732,8 +2737,12 @@ static int perf_c2c__read_events(struct perf_c2c *c2c)
>  {
>  	int err = -1;
>  	struct perf_session *session;
> +	struct perf_data_file file = {
> +			.path = input_name,
> +			.mode = PERF_DATA_MODE_READ,
> +	};
>  
> -	session = perf_session__new(input_name, O_RDONLY, 0, false, &c2c->tool);
> +	session = perf_session__new(&file, 0, &c2c->tool);
>  	if (session == NULL) {
>  		pr_debug("No memory for session\n");
>  		goto out;
> diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
> index faf29b0..49e0328 100644
> --- a/tools/perf/util/evlist.c
> +++ b/tools/perf/util/evlist.c
> @@ -1270,10 +1270,10 @@ int __perf_evlist__set_handlers(struct perf_evlist *evlist,
>  		if (evsel == NULL)
>  			continue;
>  
> -		if (evsel->handler.func != NULL)
> +		if (evsel->handler != NULL)
>  			goto out;
>  
> -		evsel->handler.func = assocs[i].handler;
> +		evsel->handler = assocs[i].handler;
>  	}
>  
>  	err = 0;

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems
  2014-02-11 11:50                     ` Arnaldo Carvalho de Melo
  2014-02-11 12:09                       ` Peter Zijlstra
@ 2014-02-13 13:02                       ` Jiri Olsa
  2014-02-13 13:10                         ` Stephane Eranian
  1 sibling, 1 reply; 72+ messages in thread
From: Jiri Olsa @ 2014-02-13 13:02 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Peter Zijlstra, Stephane Eranian, Don Zickus, LKML, Joe Mario,
	Richard Fowles

On Tue, Feb 11, 2014 at 08:50:13AM -0300, Arnaldo Carvalho de Melo wrote:
> Em Tue, Feb 11, 2014 at 12:14:21PM +0100, Peter Zijlstra escreveu:
> > On Tue, Feb 11, 2014 at 12:08:56PM +0100, Stephane Eranian wrote:
> > > Assuming you can decode and get the info about the base registers used,
> > > you'd have to do this for each arch with load/store sampling capabilities.
> > > this is painful compared to getting the portable info from dwarf directly.
>  
> > But its useful now, as compared to whenever GCC gets around to
> > implementing more dwarves and that GCC getting used widely enough to
> > actually rely on it.
>  
> > All you need for the decode is a disassembler, and every arch should
> > already have multiple of those. Should be easy to reuse one, right?
> 
> Yeah, I never got around to actually try to implement this, but my
> feeling was that all the bits and pieces were there already:
> 
> 1) the precise IP for the instruction, that disassembled would tell
> which registers were being operated on, or memory that we would "reverse
> map" to a register
> 
> 2) DWARF expression locations that allows us to go from registers to a
> variable/parameter and thus to a type
> 
> 3) PERF_SAMPLE_REGS_USER (from a quick look, why do we have "USER" in
> it? Jiri?)

well, it was meant for store user registers only
to assists user DWARF unwind

we can add PERF_SAMPLE_REGS_KERNEL

> 
> 4) libunwind have register maps for various arches, so probably
> something there could be reused here as well (Jiri?)

not sure what you mean by 'something' here.. but yep,
libunwind does have register maps for various arches

jirka

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems
  2014-02-13 13:02                       ` Jiri Olsa
@ 2014-02-13 13:10                         ` Stephane Eranian
  0 siblings, 0 replies; 72+ messages in thread
From: Stephane Eranian @ 2014-02-13 13:10 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Arnaldo Carvalho de Melo, Peter Zijlstra, Don Zickus, LKML,
	Joe Mario, Richard Fowles

On Thu, Feb 13, 2014 at 2:02 PM, Jiri Olsa <jolsa@redhat.com> wrote:
> On Tue, Feb 11, 2014 at 08:50:13AM -0300, Arnaldo Carvalho de Melo wrote:
>> Em Tue, Feb 11, 2014 at 12:14:21PM +0100, Peter Zijlstra escreveu:
>> > On Tue, Feb 11, 2014 at 12:08:56PM +0100, Stephane Eranian wrote:
>> > > Assuming you can decode and get the info about the base registers used,
>> > > you'd have to do this for each arch with load/store sampling capabilities.
>> > > this is painful compared to getting the portable info from dwarf directly.
>>
>> > But its useful now, as compared to whenever GCC gets around to
>> > implementing more dwarves and that GCC getting used widely enough to
>> > actually rely on it.
>>
>> > All you need for the decode is a disassembler, and every arch should
>> > already have multiple of those. Should be easy to reuse one, right?
>>
>> Yeah, I never got around to actually try to implement this, but my
>> feeling was that all the bits and pieces were there already:
>>
>> 1) the precise IP for the instruction, that disassembled would tell
>> which registers were being operated on, or memory that we would "reverse
>> map" to a register
>>
>> 2) DWARF expression locations that allows us to go from registers to a
>> variable/parameter and thus to a type
>>
>> 3) PERF_SAMPLE_REGS_USER (from a quick look, why do we have "USER" in
>> it? Jiri?)
>
> well, it was meant for store user registers only
> to assists user DWARF unwind
>
But it does captures the user state regardless of the stack snapshotting.

> we can add PERF_SAMPLE_REGS_KERNEL
>
I have a patch series to do this and more. It will be ready next month
hopefully.

>>
>> 4) libunwind have register maps for various arches, so probably
>> something there could be reused here as well (Jiri?)
>
> not sure what you mean by 'something' here.. but yep,
> libunwind does have register maps for various arches
>
> jirka

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 01/21] perf c2c: Shared data analyser
       [not found] ` <1392053356-23024-2-git-send-email-dzickus@redhat.com>
@ 2014-02-18 12:52   ` Jiri Olsa
  2014-02-18 12:56     ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 72+ messages in thread
From: Jiri Olsa @ 2014-02-18 12:52 UTC (permalink / raw)
  To: Don Zickus
  Cc: acme, LKML, jmario, fowles, eranian, Arnaldo Carvalho de Melo,
	David Ahern, Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Peter Zijlstra, Richard Fowles

On Mon, Feb 10, 2014 at 12:28:56PM -0500, Don Zickus wrote:
> From: Arnaldo Carvalho de Melo <acme@redhat.com>
> 
> This is the start of a new perf tool that will collect information about
> memory accesses and analyse it to find things like hot cachelines, etc.
> 
> This is basically trying to get a prototype written by Richard Fowles
> written using the tools/perf coding style and libraries.
> 
> Start it from 'perf sched', this patch starts the process by adding the
> 'record' subcommand to collect the needed mem loads and stores samples.
> 
> It also have the basic 'report' skeleton, resolving the sample address
> and hooking the events found in a perf.data file with methods to handle
> them, right now just printing the resolved perf_sample data structure
> after each event name.

SNIP

> +		evsel->handler.func = assocs[i].handler;
> +	}
> +
> +	err = 0;
> +out:
> +	return err;
> +}
> diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
> index f5173cd..76f77c8 100644
> --- a/tools/perf/util/evlist.h
> +++ b/tools/perf/util/evlist.h
> @@ -52,6 +52,13 @@ struct perf_evsel_str_handler {
>  	void	   *handler;
>  };
>  
> +int __perf_evlist__set_handlers(struct perf_evlist *evlist,
> +				const struct perf_evsel_str_handler *assocs,
> +				size_t nr_assocs);
> +
> +#define perf_evlist__set_handlers(evlist, array) \
> +	__perf_evlist__set_handlers(evlist, array, ARRAY_SIZE(array))
> +

this is already implemented in session object.. just need to be
changed to work over any event is globaly usable

jirka

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 02/21] perf c2c: Dump raw records, decode data_src bits
       [not found] ` <1392053356-23024-3-git-send-email-dzickus@redhat.com>
@ 2014-02-18 12:53   ` Jiri Olsa
  2014-02-18 13:49     ` Arnaldo Carvalho de Melo
  2014-02-19  3:04     ` Don Zickus
  0 siblings, 2 replies; 72+ messages in thread
From: Jiri Olsa @ 2014-02-18 12:53 UTC (permalink / raw)
  To: Don Zickus
  Cc: acme, LKML, jmario, fowles, eranian, Arnaldo Carvalho de Melo,
	David Ahern, Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Peter Zijlstra, Richard Fowles

On Mon, Feb 10, 2014 at 12:28:57PM -0500, Don Zickus wrote:
> From: Arnaldo Carvalho de Melo <acme@redhat.com>
> 
> From the c2c prototype:
> 
> [root@sandy ~]# perf c2c -r report | head -7
> T Status    Pid Tid CPU          Inst Adrs     Virt Data Adrs Phys Data Adrs Cycles Source      Decoded Source                ObJect:Symbol
> --------------------------------------------------------------------------------------------------------------------------------------------
>   raw input 779 779   7 0xffffffff810865dd 0xffff8803f4d75ec8              0    370 0x68080882 [LOAD,LCL_LLC,MISS,SNP NA]    [kernel.kallsyms]:try_to_wake_up
>   raw input 779 779   7 0xffffffff8107acb3 0xffff8802a5b73158              0    297 0x6a100142 [LOAD,L1,HIT,SNP NONE,LOCKED] [kernel.kallsyms]:up_read
>   raw input 779 779   7       0x3b7e009814     0x7fff87429ea0              0    925 0x68100142 [LOAD,L1,HIT,SNP NONE]        ???:???
>   raw input   0   0   1 0xffffffff8108bf81 0xffff8803eafebf50              0    172 0x68800842 [LOAD,LCL_LLC,HIT,SNP HITM]   [kernel.kallsyms]:update_stats_wait_end
>   raw input 779 779   7       0x3b7e0097cc     0x7fac94b69068              0    228 0x68100242 [LOAD,LFB,HIT,SNP NONE]       ???:???
> [root@sandy ~]#
> 
> The "Phys Data Adrs" column is not available at this point.

SNIP

> +		       sample->data_src,
> +		       data_src,
> +		       al->map ? (al->map->dso ? al->map->dso->long_name : "???") : "???",
> +		       al->sym ? al->sym->name : "???");
> +}
> +
> +static int perf_c2c__process_load_store(struct perf_c2c *c2c,
> +					struct perf_sample *sample,
> +					struct addr_location *al)
> +{
> +	if (c2c->raw_records)
> +		perf_sample__fprintf(sample, ' ', "raw input", al, stdout);
> +
>  	return 0;
>  }
>  
>  static const struct perf_evsel_str_handler handlers[] = {
> -	{ "cpu/mem-loads,ldlat=30/pp", perf_c2c__process_load, },
> -	{ "cpu/mem-stores/pp",	       perf_c2c__process_store, },
> +	{ "cpu/mem-loads,ldlat=30/pp", perf_c2c__process_load_store, },
> +	{ "cpu/mem-stores/pp",	       perf_c2c__process_load_store, },

hm.. so it's only one function for both handlers.. no need
to use handlers at all then, right?

jirka

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 05/21] perf, session: Change header.misc dump from decimal to hex
  2014-02-10 17:29 ` [PATCH 05/21] perf, session: Change header.misc dump from decimal to hex Don Zickus
@ 2014-02-18 12:56   ` Jiri Olsa
  2014-02-19  2:40     ` Don Zickus
  0 siblings, 1 reply; 72+ messages in thread
From: Jiri Olsa @ 2014-02-18 12:56 UTC (permalink / raw)
  To: Don Zickus; +Cc: acme, LKML, jmario, fowles, eranian

On Mon, Feb 10, 2014 at 12:29:00PM -0500, Don Zickus wrote:
> When printing the raw dump of a data file, the header.misc is
> printed as a decimal.  Unfortunately, that field is a bit mask, so
> it is hard to interpret as a decimal.
> 
> Print in hex, so the user can easily see what bits are set and more
> importantly what type of info it is conveying.
> 
> Signed-off-by: Don Zickus <dzickus@redhat.com>
> ---
>  tools/perf/util/session.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
> index 0b39a48..d1ad10f 100644
> --- a/tools/perf/util/session.c
> +++ b/tools/perf/util/session.c
> @@ -793,7 +793,7 @@ static void dump_sample(struct perf_evsel *evsel, union perf_event *event,
>  	if (!dump_trace)
>  		return;
>  
> -	printf("(IP, %d): %d/%d: %#" PRIx64 " period: %" PRIu64 " addr: %#" PRIx64 "\n",
> +	printf("(IP, %x): %d/%d: %#" PRIx64 " period: %" PRIu64 " addr: %#" PRIx64 "\n",
>  	       event->header.misc, sample->pid, sample->tid, sample->ip,
>  	       sample->period, sample->addr);

nit, maybe use '0x%x' ? hum, but probably nobody is actually parsing this..

jirka

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 01/21] perf c2c: Shared data analyser
  2014-02-18 12:52   ` [PATCH 01/21] perf c2c: Shared data analyser Jiri Olsa
@ 2014-02-18 12:56     ` Arnaldo Carvalho de Melo
  2014-02-19  2:42       ` Don Zickus
  0 siblings, 1 reply; 72+ messages in thread
From: Arnaldo Carvalho de Melo @ 2014-02-18 12:56 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Don Zickus, LKML, jmario, fowles, eranian, David Ahern,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Peter Zijlstra, Richard Fowles

Em Tue, Feb 18, 2014 at 01:52:06PM +0100, Jiri Olsa escreveu:
> On Mon, Feb 10, 2014 at 12:28:56PM -0500, Don Zickus wrote:
> > From: Arnaldo Carvalho de Melo <acme@redhat.com>

<SNIP>

> > +++ b/tools/perf/util/evlist.h
> > @@ -52,6 +52,13 @@ struct perf_evsel_str_handler {
> >  	void	   *handler;
> >  };
> >  
> > +int __perf_evlist__set_handlers(struct perf_evlist *evlist,
> > +				const struct perf_evsel_str_handler *assocs,
> > +				size_t nr_assocs);
> > +
> > +#define perf_evlist__set_handlers(evlist, array) \
> > +	__perf_evlist__set_handlers(evlist, array, ARRAY_SIZE(array))
> > +
> 
> this is already implemented in session object.. just need to be
> changed to work over any event is globaly usable

This probably has some historical baggage, i.e. I introduced this in
this cset and later brought it to the trunk, perhaps, need to review
this to figure this out, busy right now.

- Arnaldo

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 08/21] perf, c2c: Rework setup code to prepare for features
  2014-02-10 17:29 ` [PATCH 08/21] perf, c2c: Rework setup code to prepare for features Don Zickus
@ 2014-02-18 13:02   ` Jiri Olsa
  2014-02-19  2:45     ` Don Zickus
  0 siblings, 1 reply; 72+ messages in thread
From: Jiri Olsa @ 2014-02-18 13:02 UTC (permalink / raw)
  To: Don Zickus; +Cc: acme, LKML, jmario, fowles, eranian

On Mon, Feb 10, 2014 at 12:29:03PM -0500, Don Zickus wrote:
> A basic patch that re-arranges some of the c2c code and adds a couple
> of small features to lay the ground work for the rest of the patch
> series.
> 
> Changes include:
> 
> o reworking the report path
> o creating an initial entry struct
> o replace preprocess_sample with simpler calls
> o rework raw output to handle separators
> o remove phys id gunk
> o add some generic options
> 
> There isn't much meat in this patch just a bunch of code movement and cleanups.
> 
> Signed-off-by: Don Zickus <dzickus@redhat.com>
> ---

SNIP

>  static int perf_c2c__process_sample(struct perf_tool *tool,
>  				    union perf_event *event,
> @@ -153,20 +198,63 @@ static int perf_c2c__process_sample(struct perf_tool *tool,
>  				    struct machine *machine)
>  {
>  	struct perf_c2c *c2c = container_of(tool, struct perf_c2c, tool);
> -	struct addr_location al;
> -	int err = 0;
> +	u8 cpumode = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
> +	struct mem_info *mi;
> +	struct thread *thread;
> +	struct c2c_entry *entry;
> +	sample_handler f;
> +	int err = -1;
> +
> +	if (evsel->handler.func == NULL)
> +		return 0;
> +
> +	thread = machine__find_thread(machine, sample->tid);
> +	if (thread == NULL)
> +		goto err;
> +
> +	mi = machine__resolve_mem(machine, thread, sample, cpumode);
> +	if (mi == NULL)
> +		goto err;
>  
> -	if (perf_event__preprocess_sample(event, machine, &al, sample) < 0) {
> -		pr_err("problem processing %d event, skipping it.\n",
> -		       event->header.type);
> -		return -1;
> +	if (c2c->raw_records) {
> +		perf_sample__fprintf(sample, ' ', "raw input", mi, stdout);
> +		free(mi);
> +		return 0;
>  	}
>  
> -	if (evsel->handler.func != NULL) {
> -		sample_handler f = evsel->handler.func;
> -		err = f(c2c, sample, &al);
> +	entry = c2c_entry__new(sample, thread, mi, cpumode);
> +	if (entry == NULL)
> +		goto err_mem;
> +
> +	f = evsel->handler.func;
> +	err = f(c2c, sample, entry);
> +	if (err)
> +		goto err_entry;
> +
> +	return 0;

this looks like new mode for namhyung's iterator patchset
http://marc.info/?l=linux-kernel&m=138967747319160&w=2

git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git
perf/cumulate-v8

jirka

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 09/21] perf, c2c: Add rbtree sorted on mmap2 data
  2014-02-10 17:29 ` [PATCH 09/21] perf, c2c: Add rbtree sorted on mmap2 data Don Zickus
@ 2014-02-18 13:04   ` Jiri Olsa
  2014-02-19  2:48     ` Don Zickus
  2014-02-21  2:45     ` Don Zickus
  0 siblings, 2 replies; 72+ messages in thread
From: Jiri Olsa @ 2014-02-18 13:04 UTC (permalink / raw)
  To: Don Zickus; +Cc: acme, LKML, jmario, fowles, eranian

On Mon, Feb 10, 2014 at 12:29:04PM -0500, Don Zickus wrote:
> In order for the c2c tool to work correctly, it needs to properly
> sort all the records on uniquely identifiable data addresses.  These
> unique addresses are converted from virtual addresses provided by the
> hardware into a kernel address using an mmap2 record as the decoder.
> 

SNIP

> +static int physid_cmp(struct c2c_entry *left, struct c2c_entry *right)
> +{
> +	u64 l, r;
> +	struct map *l_map = left->mi->daddr.map;
> +	struct map *r_map = right->mi->daddr.map;
> +
> +	/* group event types together */
> +	if (left->cpumode > right->cpumode) return 1;
> +	if (left->cpumode < right->cpumode) return -1;
> +
> +	if (l_map->maj > r_map->maj) return 1;
> +	if (l_map->maj < r_map->maj) return -1;
> +
> +	if (l_map->min > r_map->min) return 1;
> +	if (l_map->min < r_map->min) return -1;
> +
> +	if (l_map->ino > r_map->ino) return 1;
> +	if (l_map->ino < r_map->ino) return -1;
> +
> +	if (l_map->ino_generation > r_map->ino_generation) return 1;
> +	if (l_map->ino_generation < r_map->ino_generation) return -1;
> +
> +	/*
> +	 * Addresses with no major/minor numbers are assumed to be
> +	 * anonymous in userspace.  Sort those on pid then address.
> +	 *
> +	 * The kernel and non-zero major/minor mapped areas are
> +	 * assumed to be unity mapped.  Sort those on address then pid.
> +	 */
> +
> +	/* al_addr does all the right addr - start + offset calculations */
> +	l = left->mi->daddr.al_addr;
> +	r = right->mi->daddr.al_addr;
> +
> +	if (l_map->maj || l_map->min) {
> +		/* mmapped areas */
> +
> +		/* hack to mark similar regions, 'right' is new entry */
> +		/* entries with same maj/min/ino/inogen are in same address space */
> +		right->color = REGION_SAME;
> +
> +		if (l > r) return 1;
> +		if (l < r) return -1;
> +
> +		/* sorting by iaddr makes calculations easier later */
> +		if (left->mi->iaddr.al_addr > right->mi->iaddr.al_addr) return 1;
> +		if (left->mi->iaddr.al_addr < right->mi->iaddr.al_addr) return -1;
> +
> +		if (left->thread->pid_ > right->thread->pid_) return 1;
> +		if (left->thread->pid_ < right->thread->pid_) return -1;
> +
> +		if (left->thread->tid > right->thread->tid) return 1;
> +		if (left->thread->tid < right->thread->tid) return -1;
> +	} else if (left->cpumode == PERF_RECORD_MISC_KERNEL) {
> +		/* kernel mapped areas where 'start' doesn't matter */
> +
> +		/* hack to mark similar regions, 'right' is new entry */
> +		/* whole kernel region is in the same address space */
> +		right->color = REGION_SAME;
> +
> +		if (l > r) return 1;
> +		if (l < r) return -1;
> +
> +		/* sorting by iaddr makes calculations easier later */
> +		if (left->mi->iaddr.al_addr > right->mi->iaddr.al_addr) return 1;
> +		if (left->mi->iaddr.al_addr < right->mi->iaddr.al_addr) return -1;
> +
> +		if (left->thread->pid_ > right->thread->pid_) return 1;
> +		if (left->thread->pid_ < right->thread->pid_) return -1;
> +
> +		if (left->thread->tid > right->thread->tid) return 1;
> +		if (left->thread->tid < right->thread->tid) return -1;
> +	} else {
> +		/* userspace anonymous */
> +		if (left->thread->pid_ > right->thread->pid_) return 1;
> +		if (left->thread->pid_ < right->thread->pid_) return -1;
> +
> +		if (left->thread->tid > right->thread->tid) return 1;
> +		if (left->thread->tid < right->thread->tid) return -1;
> +
> +		/* hack to mark similar regions, 'right' is new entry */
> +		/* userspace anonymous address space is contained within pid */
> +		right->color = REGION_SAME;
> +
> +		if (l > r) return 1;
> +		if (l < r) return -1;
> +
> +		/* sorting by iaddr makes calculations easier later */
> +		if (left->mi->iaddr.al_addr > right->mi->iaddr.al_addr) return 1;
> +		if (left->mi->iaddr.al_addr < right->mi->iaddr.al_addr) return -1;
> +	}
> +
> +	return 0;
> +}

there's sort object doing exatly this over hist_entry's

Is there any reason not to use hist_entries?

jirka

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 10/21] perf, c2c: Add stats to track data source bits and cpu to node maps
  2014-02-10 17:29 ` [PATCH 10/21] perf, c2c: Add stats to track data source bits and cpu to node maps Don Zickus
@ 2014-02-18 13:05   ` Jiri Olsa
  2014-02-19  2:51     ` Don Zickus
  0 siblings, 1 reply; 72+ messages in thread
From: Jiri Olsa @ 2014-02-18 13:05 UTC (permalink / raw)
  To: Don Zickus; +Cc: acme, LKML, jmario, fowles, eranian

On Mon, Feb 10, 2014 at 12:29:05PM -0500, Don Zickus wrote:
> This patch adds a bunch of stats that will be used later in post-processing
> to determine where and with what frequency the HITMs are coming from.
> 
> Most of the stats are decoded from the data source response.  Another
> piece of the stats is tracking which cpu the record came in on.
> 
> In order to properly build a cpu map to map where interesting events are coming
> from, I shamelessly copy-n-pasted the cpu->NUMA node code from builtin-kmem.c.
> 
> As HITMs are most expensive when going across NUMA nodes, it only made sense
> to create a quick cpu->NUMA lookup for when processing the records.
> 
> Credit to Dick Fowles for determining which bits are important and how to
> properly track them.  Ported to perf by me.
> 
> Original-by: Dick Fowles <rfowles@redhat.com>
> Signed-off-by: Don Zickus <dzickus@redhat.com>
> ---

SNIP

> +
> +static int setup_cpunode_map(void)
> +{
> +	struct dirent *dent1, *dent2;
> +	DIR *dir1, *dir2;
> +	unsigned int cpu, mem;
> +	char buf[PATH_MAX];
> +
> +	/* initialize globals */
> +	if (init_cpunode_map())
> +		return -1;
> +
> +	dir1 = opendir(PATH_SYS_NODE);
> +	if (!dir1)
> +		return 0;
> +
> +	/* walk tree and setup map */
> +	while ((dent1 = readdir(dir1)) != NULL) {
> +		if (dent1->d_type != DT_DIR ||
> +		    sscanf(dent1->d_name, "node%u", &mem) < 1)
> +			continue;
> +
> +		snprintf(buf, PATH_MAX, "%s/%s", PATH_SYS_NODE, dent1->d_name);
> +		dir2 = opendir(buf);
> +		if (!dir2)
> +			continue;
> +		while ((dent2 = readdir(dir2)) != NULL) {
> +			if (dent2->d_type != DT_LNK ||
> +			    sscanf(dent2->d_name, "cpu%u", &cpu) < 1)
> +				continue;
> +			cpunode_map[cpu] = mem;
> +		}
> +		closedir(dir2);
> +	}
> +	closedir(dir1);
> +	return 0;
> +}


There's already setup_cpunode_map interface in builtin-kmem.c
Please make it global (maybe place in separate object?)
and use this one.

jirka

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 13/21] perf, c2c: Add callchain support
  2014-02-10 17:29 ` [PATCH 13/21] perf, c2c: Add callchain support Don Zickus
@ 2014-02-18 13:07   ` Jiri Olsa
  2014-02-19  2:54     ` Don Zickus
  0 siblings, 1 reply; 72+ messages in thread
From: Jiri Olsa @ 2014-02-18 13:07 UTC (permalink / raw)
  To: Don Zickus; +Cc: acme, LKML, jmario, fowles, eranian

On Mon, Feb 10, 2014 at 12:29:08PM -0500, Don Zickus wrote:
> Seeing cacheline statistics is useful by itself.  Seeing the callchain
> for these cache contentions saves time tracking things down.
> 
> This patch tries to add callchain support.  I had to use the generic
> interface from a previous patch to output things to stdout easily.
> 
> Other than the displaying the results, collecting the callchain and
> merging it was fairly straightforward.
> 
> I used a lot of copying-n-pasting from other builtin tools to get
> the intial parameter setup correctly and the automatic reading of
> 'symbol_conf.use_callchain' from the data file.
> 
> Hopefully this is all correct.  The amount of memory corruption (from the
> callchain dynamic array) seems to have dwindled done to nothing. :-)

hum.. report command already has all this.. if we could go the
hist_entry way, there'd be no need to reiplement this

jirka

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 17/21] perf, c2c: Add symbol count table
  2014-02-10 17:29 ` [PATCH 17/21] perf, c2c: Add symbol count table Don Zickus
@ 2014-02-18 13:09   ` Jiri Olsa
  2014-02-19  2:56     ` Don Zickus
  0 siblings, 1 reply; 72+ messages in thread
From: Jiri Olsa @ 2014-02-18 13:09 UTC (permalink / raw)
  To: Don Zickus; +Cc: acme, LKML, jmario, fowles, eranian

On Mon, Feb 10, 2014 at 12:29:12PM -0500, Don Zickus wrote:
> Just another table that displays the referenced symbols in the analysis
> report.  The table lists the most frequently used symbols first.
> 
> It is just another way to look at similar data to figure out who
> is causing the most contention (based on the workload used).
> 
> Originally done by Dick Fowles and ported by me.
> 
> Suggested-by: Joe Mario <jmario@redhat.com>
> Original-by: Dick Fowles <rfowles@redhat.com>
> Signed-off-by: Don Zickus <dzickus@redhat.com>
> ---
>  tools/perf/builtin-c2c.c | 99 ++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 99 insertions(+)
> 
> diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
> index 32c2319..979187f 100644
> --- a/tools/perf/builtin-c2c.c
> +++ b/tools/perf/builtin-c2c.c
> @@ -950,6 +950,104 @@ static void c2c_hit__update_stats(struct c2c_stats *new,
>  	new->total_period	+= old->total_period;
>  }
>  
> +LIST_HEAD(ref_tree);
> +LIST_HEAD(ref_tree_sorted);
> +struct refs {
> +	struct list_head	list;
> +	int			nr;
> +	const char		*name;
> +	char			*long_name;
> +};
> +
> +static int update_ref_tree(struct c2c_entry *entry)
> +{
> +	struct refs *p;
> +	struct dso *dso = entry->mi->iaddr.map->dso;
> +	const char *name = dso->short_name;
> +
> +	list_for_each_entry(p, &ref_tree, list) {
> +		if (!strcmp(p->name, name))
> +			goto found;
> +	}
> +
> +	p = zalloc(sizeof(struct refs));
> +	if (!p)
> +		return -1;
> +	p->name = name;
> +	p->long_name = dso->long_name;
> +	list_add_tail(&p->list, &ref_tree);

so this is a tree, which is actually a list ;-)

jirka

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 02/21] perf c2c: Dump raw records, decode data_src bits
  2014-02-18 12:53   ` [PATCH 02/21] perf c2c: Dump raw records, decode data_src bits Jiri Olsa
@ 2014-02-18 13:49     ` Arnaldo Carvalho de Melo
  2014-02-19  3:04     ` Don Zickus
  1 sibling, 0 replies; 72+ messages in thread
From: Arnaldo Carvalho de Melo @ 2014-02-18 13:49 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Don Zickus, LKML, jmario, fowles, eranian, David Ahern,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Peter Zijlstra, Richard Fowles

Em Tue, Feb 18, 2014 at 01:53:35PM +0100, Jiri Olsa escreveu:
> On Mon, Feb 10, 2014 at 12:28:57PM -0500, Don Zickus wrote:
> > From: Arnaldo Carvalho de Melo <acme@redhat.com>
<SNIP>
> > +static int perf_c2c__process_load_store(struct perf_c2c *c2c,
> > +					struct perf_sample *sample,
> > +					struct addr_location *al)
> > +{
> > +	if (c2c->raw_records)
> > +		perf_sample__fprintf(sample, ' ', "raw input", al, stdout);
> > +
> >  	return 0;
> >  }

> >  static const struct perf_evsel_str_handler handlers[] = {
> > -	{ "cpu/mem-loads,ldlat=30/pp", perf_c2c__process_load, },
> > -	{ "cpu/mem-stores/pp",	       perf_c2c__process_store, },
> > +	{ "cpu/mem-loads,ldlat=30/pp", perf_c2c__process_load_store, },
> > +	{ "cpu/mem-stores/pp",	       perf_c2c__process_load_store, },
 
> hm.. so it's only one function for both handlers.. no need
> to use handlers at all then, right?

This was just an skeleton from where to continue, so no, the idea isn't
to have just one function for both.

- Arnaldo

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 05/21] perf, session: Change header.misc dump from decimal to hex
  2014-02-18 12:56   ` Jiri Olsa
@ 2014-02-19  2:40     ` Don Zickus
  0 siblings, 0 replies; 72+ messages in thread
From: Don Zickus @ 2014-02-19  2:40 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: acme, LKML, jmario, fowles, eranian

On Tue, Feb 18, 2014 at 01:56:44PM +0100, Jiri Olsa wrote:
> On Mon, Feb 10, 2014 at 12:29:00PM -0500, Don Zickus wrote:
> > When printing the raw dump of a data file, the header.misc is
> > printed as a decimal.  Unfortunately, that field is a bit mask, so
> > it is hard to interpret as a decimal.
> > 
> > Print in hex, so the user can easily see what bits are set and more
> > importantly what type of info it is conveying.
> > 
> > Signed-off-by: Don Zickus <dzickus@redhat.com>
> > ---
> >  tools/perf/util/session.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
> > index 0b39a48..d1ad10f 100644
> > --- a/tools/perf/util/session.c
> > +++ b/tools/perf/util/session.c
> > @@ -793,7 +793,7 @@ static void dump_sample(struct perf_evsel *evsel, union perf_event *event,
> >  	if (!dump_trace)
> >  		return;
> >  
> > -	printf("(IP, %d): %d/%d: %#" PRIx64 " period: %" PRIu64 " addr: %#" PRIx64 "\n",
> > +	printf("(IP, %x): %d/%d: %#" PRIx64 " period: %" PRIu64 " addr: %#" PRIx64 "\n",
> >  	       event->header.misc, sample->pid, sample->tid, sample->ip,
> >  	       sample->period, sample->addr);
> 
> nit, maybe use '0x%x' ? hum, but probably nobody is actually parsing this..

Fair enough. :-)

Cheers,
Don


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 01/21] perf c2c: Shared data analyser
  2014-02-18 12:56     ` Arnaldo Carvalho de Melo
@ 2014-02-19  2:42       ` Don Zickus
  0 siblings, 0 replies; 72+ messages in thread
From: Don Zickus @ 2014-02-19  2:42 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Jiri Olsa, LKML, jmario, fowles, eranian, David Ahern,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Peter Zijlstra, Richard Fowles

On Tue, Feb 18, 2014 at 09:56:47AM -0300, Arnaldo Carvalho de Melo wrote:
> Em Tue, Feb 18, 2014 at 01:52:06PM +0100, Jiri Olsa escreveu:
> > On Mon, Feb 10, 2014 at 12:28:56PM -0500, Don Zickus wrote:
> > > From: Arnaldo Carvalho de Melo <acme@redhat.com>
> 
> <SNIP>
> 
> > > +++ b/tools/perf/util/evlist.h
> > > @@ -52,6 +52,13 @@ struct perf_evsel_str_handler {
> > >  	void	   *handler;
> > >  };
> > >  
> > > +int __perf_evlist__set_handlers(struct perf_evlist *evlist,
> > > +				const struct perf_evsel_str_handler *assocs,
> > > +				size_t nr_assocs);
> > > +
> > > +#define perf_evlist__set_handlers(evlist, array) \
> > > +	__perf_evlist__set_handlers(evlist, array, ARRAY_SIZE(array))
> > > +
> > 
> > this is already implemented in session object.. just need to be
> > changed to work over any event is globaly usable
> 
> This probably has some historical baggage, i.e. I introduced this in
> this cset and later brought it to the trunk, perhaps, need to review
> this to figure this out, busy right now.

No worries, I will dig into it and try to use the right function calls.

Cheers,
Don

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 08/21] perf, c2c: Rework setup code to prepare for features
  2014-02-18 13:02   ` Jiri Olsa
@ 2014-02-19  2:45     ` Don Zickus
  0 siblings, 0 replies; 72+ messages in thread
From: Don Zickus @ 2014-02-19  2:45 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: acme, LKML, jmario, fowles, eranian

On Tue, Feb 18, 2014 at 02:02:23PM +0100, Jiri Olsa wrote:
> On Mon, Feb 10, 2014 at 12:29:03PM -0500, Don Zickus wrote:
> > A basic patch that re-arranges some of the c2c code and adds a couple
> > of small features to lay the ground work for the rest of the patch
> > series.
> > 
> > Changes include:
> > 
> > o reworking the report path
> > o creating an initial entry struct
> > o replace preprocess_sample with simpler calls
> > o rework raw output to handle separators
> > o remove phys id gunk
> > o add some generic options
> > 
> > There isn't much meat in this patch just a bunch of code movement and cleanups.
> > 
> > Signed-off-by: Don Zickus <dzickus@redhat.com>
> > ---
> 
> SNIP
> 
> >  static int perf_c2c__process_sample(struct perf_tool *tool,
> >  				    union perf_event *event,
> > @@ -153,20 +198,63 @@ static int perf_c2c__process_sample(struct perf_tool *tool,
> >  				    struct machine *machine)
> >  {
> >  	struct perf_c2c *c2c = container_of(tool, struct perf_c2c, tool);
> > -	struct addr_location al;
> > -	int err = 0;
> > +	u8 cpumode = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
> > +	struct mem_info *mi;
> > +	struct thread *thread;
> > +	struct c2c_entry *entry;
> > +	sample_handler f;
> > +	int err = -1;
> > +
> > +	if (evsel->handler.func == NULL)
> > +		return 0;
> > +
> > +	thread = machine__find_thread(machine, sample->tid);
> > +	if (thread == NULL)
> > +		goto err;
> > +
> > +	mi = machine__resolve_mem(machine, thread, sample, cpumode);
> > +	if (mi == NULL)
> > +		goto err;
> >  
> > -	if (perf_event__preprocess_sample(event, machine, &al, sample) < 0) {
> > -		pr_err("problem processing %d event, skipping it.\n",
> > -		       event->header.type);
> > -		return -1;
> > +	if (c2c->raw_records) {
> > +		perf_sample__fprintf(sample, ' ', "raw input", mi, stdout);
> > +		free(mi);
> > +		return 0;
> >  	}
> >  
> > -	if (evsel->handler.func != NULL) {
> > -		sample_handler f = evsel->handler.func;
> > -		err = f(c2c, sample, &al);
> > +	entry = c2c_entry__new(sample, thread, mi, cpumode);
> > +	if (entry == NULL)
> > +		goto err_mem;
> > +
> > +	f = evsel->handler.func;
> > +	err = f(c2c, sample, entry);
> > +	if (err)
> > +		goto err_entry;
> > +
> > +	return 0;
> 
> this looks like new mode for namhyung's iterator patchset
> http://marc.info/?l=linux-kernel&m=138967747319160&w=2
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git
> perf/cumulate-v8

I'll take a look at it and seem what is useful for me.  Thanks!

Cheers,
Don

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 09/21] perf, c2c: Add rbtree sorted on mmap2 data
  2014-02-18 13:04   ` Jiri Olsa
@ 2014-02-19  2:48     ` Don Zickus
  2014-02-21  2:45     ` Don Zickus
  1 sibling, 0 replies; 72+ messages in thread
From: Don Zickus @ 2014-02-19  2:48 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: acme, LKML, jmario, fowles, eranian

On Tue, Feb 18, 2014 at 02:04:05PM +0100, Jiri Olsa wrote:
> On Mon, Feb 10, 2014 at 12:29:04PM -0500, Don Zickus wrote:
> > In order for the c2c tool to work correctly, it needs to properly
> > sort all the records on uniquely identifiable data addresses.  These
> > unique addresses are converted from virtual addresses provided by the
> > hardware into a kernel address using an mmap2 record as the decoder.
> > 
> 
> SNIP
> 
> > +static int physid_cmp(struct c2c_entry *left, struct c2c_entry *right)
> > +{
> > +	u64 l, r;
> > +	struct map *l_map = left->mi->daddr.map;
> > +	struct map *r_map = right->mi->daddr.map;
> > +
> > +	/* group event types together */
> > +	if (left->cpumode > right->cpumode) return 1;
> > +	if (left->cpumode < right->cpumode) return -1;
> > +
> > +	if (l_map->maj > r_map->maj) return 1;
> > +	if (l_map->maj < r_map->maj) return -1;
> > +
> > +	if (l_map->min > r_map->min) return 1;
> > +	if (l_map->min < r_map->min) return -1;
> > +
> > +	if (l_map->ino > r_map->ino) return 1;
> > +	if (l_map->ino < r_map->ino) return -1;
> > +
> > +	if (l_map->ino_generation > r_map->ino_generation) return 1;
> > +	if (l_map->ino_generation < r_map->ino_generation) return -1;
> > +
> > +	/*
> > +	 * Addresses with no major/minor numbers are assumed to be
> > +	 * anonymous in userspace.  Sort those on pid then address.
> > +	 *
> > +	 * The kernel and non-zero major/minor mapped areas are
> > +	 * assumed to be unity mapped.  Sort those on address then pid.
> > +	 */
> > +
> > +	/* al_addr does all the right addr - start + offset calculations */
> > +	l = left->mi->daddr.al_addr;
> > +	r = right->mi->daddr.al_addr;
> > +
> > +	if (l_map->maj || l_map->min) {
> > +		/* mmapped areas */
> > +
> > +		/* hack to mark similar regions, 'right' is new entry */
> > +		/* entries with same maj/min/ino/inogen are in same address space */
> > +		right->color = REGION_SAME;
> > +
> > +		if (l > r) return 1;
> > +		if (l < r) return -1;
> > +
> > +		/* sorting by iaddr makes calculations easier later */
> > +		if (left->mi->iaddr.al_addr > right->mi->iaddr.al_addr) return 1;
> > +		if (left->mi->iaddr.al_addr < right->mi->iaddr.al_addr) return -1;
> > +
> > +		if (left->thread->pid_ > right->thread->pid_) return 1;
> > +		if (left->thread->pid_ < right->thread->pid_) return -1;
> > +
> > +		if (left->thread->tid > right->thread->tid) return 1;
> > +		if (left->thread->tid < right->thread->tid) return -1;
> > +	} else if (left->cpumode == PERF_RECORD_MISC_KERNEL) {
> > +		/* kernel mapped areas where 'start' doesn't matter */
> > +
> > +		/* hack to mark similar regions, 'right' is new entry */
> > +		/* whole kernel region is in the same address space */
> > +		right->color = REGION_SAME;
> > +
> > +		if (l > r) return 1;
> > +		if (l < r) return -1;
> > +
> > +		/* sorting by iaddr makes calculations easier later */
> > +		if (left->mi->iaddr.al_addr > right->mi->iaddr.al_addr) return 1;
> > +		if (left->mi->iaddr.al_addr < right->mi->iaddr.al_addr) return -1;
> > +
> > +		if (left->thread->pid_ > right->thread->pid_) return 1;
> > +		if (left->thread->pid_ < right->thread->pid_) return -1;
> > +
> > +		if (left->thread->tid > right->thread->tid) return 1;
> > +		if (left->thread->tid < right->thread->tid) return -1;
> > +	} else {
> > +		/* userspace anonymous */
> > +		if (left->thread->pid_ > right->thread->pid_) return 1;
> > +		if (left->thread->pid_ < right->thread->pid_) return -1;
> > +
> > +		if (left->thread->tid > right->thread->tid) return 1;
> > +		if (left->thread->tid < right->thread->tid) return -1;
> > +
> > +		/* hack to mark similar regions, 'right' is new entry */
> > +		/* userspace anonymous address space is contained within pid */
> > +		right->color = REGION_SAME;
> > +
> > +		if (l > r) return 1;
> > +		if (l < r) return -1;
> > +
> > +		/* sorting by iaddr makes calculations easier later */
> > +		if (left->mi->iaddr.al_addr > right->mi->iaddr.al_addr) return 1;
> > +		if (left->mi->iaddr.al_addr < right->mi->iaddr.al_addr) return -1;
> > +	}
> > +
> > +	return 0;
> > +}
> 
> there's sort object doing exatly this over hist_entry's
> 
> Is there any reason not to use hist_entries?

I started there but had trouble wrapping my head around how I wanted the
above implemented (it took several iterations to sort correctly), so I
took the standalone approach first.

I need to double check how easy it is to manipulate the hist_entry tree
once sorted.  I have to resort the objects into another rbtree based on
cacheline hitms.

Cheers,
Don

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 10/21] perf, c2c: Add stats to track data source bits and cpu to node maps
  2014-02-18 13:05   ` Jiri Olsa
@ 2014-02-19  2:51     ` Don Zickus
  0 siblings, 0 replies; 72+ messages in thread
From: Don Zickus @ 2014-02-19  2:51 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: acme, LKML, jmario, fowles, eranian

On Tue, Feb 18, 2014 at 02:05:31PM +0100, Jiri Olsa wrote:
> On Mon, Feb 10, 2014 at 12:29:05PM -0500, Don Zickus wrote:
> > This patch adds a bunch of stats that will be used later in post-processing
> > to determine where and with what frequency the HITMs are coming from.
> > 
> > Most of the stats are decoded from the data source response.  Another
> > piece of the stats is tracking which cpu the record came in on.
> > 
> > In order to properly build a cpu map to map where interesting events are coming
> > from, I shamelessly copy-n-pasted the cpu->NUMA node code from builtin-kmem.c.
> > 
> > As HITMs are most expensive when going across NUMA nodes, it only made sense
> > to create a quick cpu->NUMA lookup for when processing the records.
> > 
> > Credit to Dick Fowles for determining which bits are important and how to
> > properly track them.  Ported to perf by me.
> > 
> > Original-by: Dick Fowles <rfowles@redhat.com>
> > Signed-off-by: Don Zickus <dzickus@redhat.com>
> > ---
> 
> SNIP
> 
> > +
> > +static int setup_cpunode_map(void)
> > +{
> > +	struct dirent *dent1, *dent2;
> > +	DIR *dir1, *dir2;
> > +	unsigned int cpu, mem;
> > +	char buf[PATH_MAX];
> > +
> > +	/* initialize globals */
> > +	if (init_cpunode_map())
> > +		return -1;
> > +
> > +	dir1 = opendir(PATH_SYS_NODE);
> > +	if (!dir1)
> > +		return 0;
> > +
> > +	/* walk tree and setup map */
> > +	while ((dent1 = readdir(dir1)) != NULL) {
> > +		if (dent1->d_type != DT_DIR ||
> > +		    sscanf(dent1->d_name, "node%u", &mem) < 1)
> > +			continue;
> > +
> > +		snprintf(buf, PATH_MAX, "%s/%s", PATH_SYS_NODE, dent1->d_name);
> > +		dir2 = opendir(buf);
> > +		if (!dir2)
> > +			continue;
> > +		while ((dent2 = readdir(dir2)) != NULL) {
> > +			if (dent2->d_type != DT_LNK ||
> > +			    sscanf(dent2->d_name, "cpu%u", &cpu) < 1)
> > +				continue;
> > +			cpunode_map[cpu] = mem;
> > +		}
> > +		closedir(dir2);
> > +	}
> > +	closedir(dir1);
> > +	return 0;
> > +}
> 
> 
> There's already setup_cpunode_map interface in builtin-kmem.c
> Please make it global (maybe place in separate object?)
> and use this one.

Heh, where do you think I got this from? :-)  Though I did tweak it for my
needs, namely I used 'possible' cpus as opposed to 'online' cpus to deal
with hotplug.

I also ran into a bug here, where this code populating an array based on
what is on the running system, not the system where the data was
collected.  Is it possible to have perf-archive add this info?

I try to make this function global on the next version.

Cheers,
Don

> 
> jirka

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 13/21] perf, c2c: Add callchain support
  2014-02-18 13:07   ` Jiri Olsa
@ 2014-02-19  2:54     ` Don Zickus
  0 siblings, 0 replies; 72+ messages in thread
From: Don Zickus @ 2014-02-19  2:54 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: acme, LKML, jmario, fowles, eranian

On Tue, Feb 18, 2014 at 02:07:31PM +0100, Jiri Olsa wrote:
> On Mon, Feb 10, 2014 at 12:29:08PM -0500, Don Zickus wrote:
> > Seeing cacheline statistics is useful by itself.  Seeing the callchain
> > for these cache contentions saves time tracking things down.
> > 
> > This patch tries to add callchain support.  I had to use the generic
> > interface from a previous patch to output things to stdout easily.
> > 
> > Other than the displaying the results, collecting the callchain and
> > merging it was fairly straightforward.
> > 
> > I used a lot of copying-n-pasting from other builtin tools to get
> > the intial parameter setup correctly and the automatic reading of
> > 'symbol_conf.use_callchain' from the data file.
> > 
> > Hopefully this is all correct.  The amount of memory corruption (from the
> > callchain dynamic array) seems to have dwindled done to nothing. :-)
> 
> hum.. report command already has all this.. if we could go the
> hist_entry way, there'd be no need to reiplement this

Sure.  I will look into the hist_entry.  It would also be nice if the
callchain support had a better api so those commands that can't hook in
through hist_entry had a simpler way to utilize callchains. :-)

Cheers,
Don

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 17/21] perf, c2c: Add symbol count table
  2014-02-18 13:09   ` Jiri Olsa
@ 2014-02-19  2:56     ` Don Zickus
  0 siblings, 0 replies; 72+ messages in thread
From: Don Zickus @ 2014-02-19  2:56 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: acme, LKML, jmario, fowles, eranian

On Tue, Feb 18, 2014 at 02:09:29PM +0100, Jiri Olsa wrote:
> On Mon, Feb 10, 2014 at 12:29:12PM -0500, Don Zickus wrote:
> > Just another table that displays the referenced symbols in the analysis
> > report.  The table lists the most frequently used symbols first.
> > 
> > It is just another way to look at similar data to figure out who
> > is causing the most contention (based on the workload used).
> > 
> > Originally done by Dick Fowles and ported by me.
> > 
> > Suggested-by: Joe Mario <jmario@redhat.com>
> > Original-by: Dick Fowles <rfowles@redhat.com>
> > Signed-off-by: Don Zickus <dzickus@redhat.com>
> > ---
> >  tools/perf/builtin-c2c.c | 99 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 99 insertions(+)
> > 
> > diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
> > index 32c2319..979187f 100644
> > --- a/tools/perf/builtin-c2c.c
> > +++ b/tools/perf/builtin-c2c.c
> > @@ -950,6 +950,104 @@ static void c2c_hit__update_stats(struct c2c_stats *new,
> >  	new->total_period	+= old->total_period;
> >  }
> >  
> > +LIST_HEAD(ref_tree);
> > +LIST_HEAD(ref_tree_sorted);
> > +struct refs {
> > +	struct list_head	list;
> > +	int			nr;
> > +	const char		*name;
> > +	char			*long_name;
> > +};
> > +
> > +static int update_ref_tree(struct c2c_entry *entry)
> > +{
> > +	struct refs *p;
> > +	struct dso *dso = entry->mi->iaddr.map->dso;
> > +	const char *name = dso->short_name;
> > +
> > +	list_for_each_entry(p, &ref_tree, list) {
> > +		if (!strcmp(p->name, name))
> > +			goto found;
> > +	}
> > +
> > +	p = zalloc(sizeof(struct refs));
> > +	if (!p)
> > +		return -1;
> > +	p->name = name;
> > +	p->long_name = dso->long_name;
> > +	list_add_tail(&p->list, &ref_tree);
> 
> so this is a tree, which is actually a list ;-)

It used to be a tree, now it is a stick. :-)

Old code that needs to be renamed.  I can update that.

Cheers,
Don

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 02/21] perf c2c: Dump raw records, decode data_src bits
  2014-02-18 12:53   ` [PATCH 02/21] perf c2c: Dump raw records, decode data_src bits Jiri Olsa
  2014-02-18 13:49     ` Arnaldo Carvalho de Melo
@ 2014-02-19  3:04     ` Don Zickus
  1 sibling, 0 replies; 72+ messages in thread
From: Don Zickus @ 2014-02-19  3:04 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: acme, LKML, jmario, fowles, eranian, Arnaldo Carvalho de Melo,
	David Ahern, Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Peter Zijlstra, Richard Fowles

On Tue, Feb 18, 2014 at 01:53:35PM +0100, Jiri Olsa wrote:
> On Mon, Feb 10, 2014 at 12:28:57PM -0500, Don Zickus wrote:
> > From: Arnaldo Carvalho de Melo <acme@redhat.com>
> > 
> > From the c2c prototype:
> > 
> > [root@sandy ~]# perf c2c -r report | head -7
> > T Status    Pid Tid CPU          Inst Adrs     Virt Data Adrs Phys Data Adrs Cycles Source      Decoded Source                ObJect:Symbol
> > --------------------------------------------------------------------------------------------------------------------------------------------
> >   raw input 779 779   7 0xffffffff810865dd 0xffff8803f4d75ec8              0    370 0x68080882 [LOAD,LCL_LLC,MISS,SNP NA]    [kernel.kallsyms]:try_to_wake_up
> >   raw input 779 779   7 0xffffffff8107acb3 0xffff8802a5b73158              0    297 0x6a100142 [LOAD,L1,HIT,SNP NONE,LOCKED] [kernel.kallsyms]:up_read
> >   raw input 779 779   7       0x3b7e009814     0x7fff87429ea0              0    925 0x68100142 [LOAD,L1,HIT,SNP NONE]        ???:???
> >   raw input   0   0   1 0xffffffff8108bf81 0xffff8803eafebf50              0    172 0x68800842 [LOAD,LCL_LLC,HIT,SNP HITM]   [kernel.kallsyms]:update_stats_wait_end
> >   raw input 779 779   7       0x3b7e0097cc     0x7fac94b69068              0    228 0x68100242 [LOAD,LFB,HIT,SNP NONE]       ???:???
> > [root@sandy ~]#
> > 
> > The "Phys Data Adrs" column is not available at this point.
> 
> SNIP
> 
> > +		       sample->data_src,
> > +		       data_src,
> > +		       al->map ? (al->map->dso ? al->map->dso->long_name : "???") : "???",
> > +		       al->sym ? al->sym->name : "???");
> > +}
> > +
> > +static int perf_c2c__process_load_store(struct perf_c2c *c2c,
> > +					struct perf_sample *sample,
> > +					struct addr_location *al)
> > +{
> > +	if (c2c->raw_records)
> > +		perf_sample__fprintf(sample, ' ', "raw input", al, stdout);
> > +
> >  	return 0;
> >  }
> >  
> >  static const struct perf_evsel_str_handler handlers[] = {
> > -	{ "cpu/mem-loads,ldlat=30/pp", perf_c2c__process_load, },
> > -	{ "cpu/mem-stores/pp",	       perf_c2c__process_store, },
> > +	{ "cpu/mem-loads,ldlat=30/pp", perf_c2c__process_load_store, },
> > +	{ "cpu/mem-stores/pp",	       perf_c2c__process_load_store, },
> 
> hm.. so it's only one function for both handlers.. no need
> to use handlers at all then, right?

I implemented them seperately but then realized they look identical once
everything was working, so I combined them again.  I keep thinking there
has to be some advantage to have them seperate, but haven't found a use
case.

You still need to use the handlers, in case you want to add some other events
into the mix and have them filtered out with this tool.

However, I do have the problem of trying to figure out a good way to
dynamically adjust the '30' above.  Seeing that Intel doesn't publish
L1, LFB and L2 latency numbers, we have been guessing at 30 cycles for an
LLC hit.  It would probably be nice to adjust that on the command line as
opposed to recompiling.  Small issue.

Cheers,
Don

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 09/21] perf, c2c: Add rbtree sorted on mmap2 data
  2014-02-18 13:04   ` Jiri Olsa
  2014-02-19  2:48     ` Don Zickus
@ 2014-02-21  2:45     ` Don Zickus
  2014-02-21 16:59       ` Jiri Olsa
  1 sibling, 1 reply; 72+ messages in thread
From: Don Zickus @ 2014-02-21  2:45 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: acme, LKML, jmario, fowles, eranian

On Tue, Feb 18, 2014 at 02:04:05PM +0100, Jiri Olsa wrote:
> On Mon, Feb 10, 2014 at 12:29:04PM -0500, Don Zickus wrote:
> > In order for the c2c tool to work correctly, it needs to properly
> > sort all the records on uniquely identifiable data addresses.  These
> > unique addresses are converted from virtual addresses provided by the
> > hardware into a kernel address using an mmap2 record as the decoder.
> > 
> 
> SNIP
> 
> > +static int physid_cmp(struct c2c_entry *left, struct c2c_entry *right)
> > +{
> > +	u64 l, r;
> > +	struct map *l_map = left->mi->daddr.map;
> > +	struct map *r_map = right->mi->daddr.map;
> > +
> > +	/* group event types together */
> > +	if (left->cpumode > right->cpumode) return 1;
> > +	if (left->cpumode < right->cpumode) return -1;
> > +
> > +	if (l_map->maj > r_map->maj) return 1;
> > +	if (l_map->maj < r_map->maj) return -1;
> > +
> > +	if (l_map->min > r_map->min) return 1;
> > +	if (l_map->min < r_map->min) return -1;
> > +
> > +	if (l_map->ino > r_map->ino) return 1;
> > +	if (l_map->ino < r_map->ino) return -1;
> > +
> > +	if (l_map->ino_generation > r_map->ino_generation) return 1;
> > +	if (l_map->ino_generation < r_map->ino_generation) return -1;
> > +
> > +	/*
> > +	 * Addresses with no major/minor numbers are assumed to be
> > +	 * anonymous in userspace.  Sort those on pid then address.
> > +	 *
> > +	 * The kernel and non-zero major/minor mapped areas are
> > +	 * assumed to be unity mapped.  Sort those on address then pid.
> > +	 */
> > +
> > +	/* al_addr does all the right addr - start + offset calculations */
> > +	l = left->mi->daddr.al_addr;
> > +	r = right->mi->daddr.al_addr;
> > +
> > +	if (l_map->maj || l_map->min) {
> > +		/* mmapped areas */
> > +
> > +		/* hack to mark similar regions, 'right' is new entry */
> > +		/* entries with same maj/min/ino/inogen are in same address space */
> > +		right->color = REGION_SAME;
> > +
> > +		if (l > r) return 1;
> > +		if (l < r) return -1;
> > +
> > +		/* sorting by iaddr makes calculations easier later */
> > +		if (left->mi->iaddr.al_addr > right->mi->iaddr.al_addr) return 1;
> > +		if (left->mi->iaddr.al_addr < right->mi->iaddr.al_addr) return -1;
> > +
> > +		if (left->thread->pid_ > right->thread->pid_) return 1;
> > +		if (left->thread->pid_ < right->thread->pid_) return -1;
> > +
> > +		if (left->thread->tid > right->thread->tid) return 1;
> > +		if (left->thread->tid < right->thread->tid) return -1;
> > +	} else if (left->cpumode == PERF_RECORD_MISC_KERNEL) {
> > +		/* kernel mapped areas where 'start' doesn't matter */
> > +
> > +		/* hack to mark similar regions, 'right' is new entry */
> > +		/* whole kernel region is in the same address space */
> > +		right->color = REGION_SAME;
> > +
> > +		if (l > r) return 1;
> > +		if (l < r) return -1;
> > +
> > +		/* sorting by iaddr makes calculations easier later */
> > +		if (left->mi->iaddr.al_addr > right->mi->iaddr.al_addr) return 1;
> > +		if (left->mi->iaddr.al_addr < right->mi->iaddr.al_addr) return -1;
> > +
> > +		if (left->thread->pid_ > right->thread->pid_) return 1;
> > +		if (left->thread->pid_ < right->thread->pid_) return -1;
> > +
> > +		if (left->thread->tid > right->thread->tid) return 1;
> > +		if (left->thread->tid < right->thread->tid) return -1;
> > +	} else {
> > +		/* userspace anonymous */
> > +		if (left->thread->pid_ > right->thread->pid_) return 1;
> > +		if (left->thread->pid_ < right->thread->pid_) return -1;
> > +
> > +		if (left->thread->tid > right->thread->tid) return 1;
> > +		if (left->thread->tid < right->thread->tid) return -1;
> > +
> > +		/* hack to mark similar regions, 'right' is new entry */
> > +		/* userspace anonymous address space is contained within pid */
> > +		right->color = REGION_SAME;
> > +
> > +		if (l > r) return 1;
> > +		if (l < r) return -1;
> > +
> > +		/* sorting by iaddr makes calculations easier later */
> > +		if (left->mi->iaddr.al_addr > right->mi->iaddr.al_addr) return 1;
> > +		if (left->mi->iaddr.al_addr < right->mi->iaddr.al_addr) return -1;
> > +	}
> > +
> > +	return 0;
> > +}
> 
> there's sort object doing exatly this over hist_entry's
> 
> Is there any reason not to use hist_entries?

So looking over hist_entry, I realize, what do I gain?  I implemented it
and realized I had to add, 'cpumode', 'tid' and a 'private' field to
struct hist_entry.  Then because I have my own report implementation, I
still have to copy and paste a ton of stuff from builtin-report over to
here (including callchain support).

Not unless you are expecting me to add giant chunks of code to
builtin-report.c?

Cheers,
Don

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 09/21] perf, c2c: Add rbtree sorted on mmap2 data
  2014-02-21  2:45     ` Don Zickus
@ 2014-02-21 16:59       ` Jiri Olsa
  2014-02-26  3:12         ` Don Zickus
  0 siblings, 1 reply; 72+ messages in thread
From: Jiri Olsa @ 2014-02-21 16:59 UTC (permalink / raw)
  To: Don Zickus; +Cc: acme, LKML, jmario, fowles, eranian

On Thu, Feb 20, 2014 at 09:45:53PM -0500, Don Zickus wrote:
> On Tue, Feb 18, 2014 at 02:04:05PM +0100, Jiri Olsa wrote:
> > On Mon, Feb 10, 2014 at 12:29:04PM -0500, Don Zickus wrote:

SNIP

> > > +
> > > +		if (l > r) return 1;
> > > +		if (l < r) return -1;
> > > +
> > > +		/* sorting by iaddr makes calculations easier later */
> > > +		if (left->mi->iaddr.al_addr > right->mi->iaddr.al_addr) return 1;
> > > +		if (left->mi->iaddr.al_addr < right->mi->iaddr.al_addr) return -1;
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > 
> > there's sort object doing exatly this over hist_entry's
> > 
> > Is there any reason not to use hist_entries?
> 
> So looking over hist_entry, I realize, what do I gain?  I implemented it
> and realized I had to add, 'cpumode', 'tid' and a 'private' field to
> struct hist_entry.  Then because I have my own report implementation, I
> still have to copy and paste a ton of stuff from builtin-report over to
> here (including callchain support).

you mean new sort_entry objects?

> 
> Not unless you are expecting me to add giant chunks of code to
> builtin-report.c?

it can be separated object, implementing new report iterator

I think that we should go on with existing sort code we have..
but I understand you might need some special usage.. i'll dive
in and try to find some answer ;-)

jirka

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 09/21] perf, c2c: Add rbtree sorted on mmap2 data
  2014-02-21 16:59       ` Jiri Olsa
@ 2014-02-26  3:12         ` Don Zickus
  0 siblings, 0 replies; 72+ messages in thread
From: Don Zickus @ 2014-02-26  3:12 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: acme, LKML, jmario, fowles, eranian

On Fri, Feb 21, 2014 at 05:59:28PM +0100, Jiri Olsa wrote:
> On Thu, Feb 20, 2014 at 09:45:53PM -0500, Don Zickus wrote:
> > On Tue, Feb 18, 2014 at 02:04:05PM +0100, Jiri Olsa wrote:
> > > On Mon, Feb 10, 2014 at 12:29:04PM -0500, Don Zickus wrote:
> 
> SNIP
> 
> > > > +
> > > > +		if (l > r) return 1;
> > > > +		if (l < r) return -1;
> > > > +
> > > > +		/* sorting by iaddr makes calculations easier later */
> > > > +		if (left->mi->iaddr.al_addr > right->mi->iaddr.al_addr) return 1;
> > > > +		if (left->mi->iaddr.al_addr < right->mi->iaddr.al_addr) return -1;
> > > > +	}
> > > > +
> > > > +	return 0;
> > > > +}
> > > 
> > > there's sort object doing exatly this over hist_entry's
> > > 
> > > Is there any reason not to use hist_entries?
> > 
> > So looking over hist_entry, I realize, what do I gain?  I implemented it
> > and realized I had to add, 'cpumode', 'tid' and a 'private' field to
> > struct hist_entry.  Then because I have my own report implementation, I
> > still have to copy and paste a ton of stuff from builtin-report over to
> > here (including callchain support).
> 
> you mean new sort_entry objects?
> 
> > 
> > Not unless you are expecting me to add giant chunks of code to
> > builtin-report.c?
> 
> it can be separated object, implementing new report iterator
> 
> I think that we should go on with existing sort code we have..
> but I understand you might need some special usage.. i'll dive
> in and try to find some answer ;-)

Do things fall apart if I do not use evsel->hists to store the hist_entry
tree?  I need to combine two events (store and load) on to the same tree.

Cheers,
Don

^ permalink raw reply	[flat|nested] 72+ messages in thread

end of thread, other threads:[~2014-02-26  3:12 UTC | newest]

Thread overview: 72+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-10 17:28 [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
2014-02-10 17:28 ` [PATCH 03/21] Revert "perf: Disable PERF_RECORD_MMAP2 support" Don Zickus
2014-02-10 17:28 ` [PATCH 04/21] perf, machine: Use map as success in ip__resolve_ams Don Zickus
2014-02-10 17:29 ` [PATCH 05/21] perf, session: Change header.misc dump from decimal to hex Don Zickus
2014-02-18 12:56   ` Jiri Olsa
2014-02-19  2:40     ` Don Zickus
2014-02-10 17:29 ` [PATCH 06/21] perf, stat: FIXME Stddev calculation is incorrect Don Zickus
2014-02-10 17:29 ` [PATCH 07/21] perf, callchain: Add generic callchain print handler for stdio Don Zickus
2014-02-10 17:29 ` [PATCH 08/21] perf, c2c: Rework setup code to prepare for features Don Zickus
2014-02-18 13:02   ` Jiri Olsa
2014-02-19  2:45     ` Don Zickus
2014-02-10 17:29 ` [PATCH 09/21] perf, c2c: Add rbtree sorted on mmap2 data Don Zickus
2014-02-18 13:04   ` Jiri Olsa
2014-02-19  2:48     ` Don Zickus
2014-02-21  2:45     ` Don Zickus
2014-02-21 16:59       ` Jiri Olsa
2014-02-26  3:12         ` Don Zickus
2014-02-10 17:29 ` [PATCH 10/21] perf, c2c: Add stats to track data source bits and cpu to node maps Don Zickus
2014-02-18 13:05   ` Jiri Olsa
2014-02-19  2:51     ` Don Zickus
2014-02-10 17:29 ` [PATCH 11/21] perf, c2c: Sort based on hottest cache line Don Zickus
2014-02-10 17:29 ` [PATCH 12/21] perf, c2c: Display cacheline HITM analysis to stdout Don Zickus
2014-02-10 17:29 ` [PATCH 13/21] perf, c2c: Add callchain support Don Zickus
2014-02-18 13:07   ` Jiri Olsa
2014-02-19  2:54     ` Don Zickus
2014-02-10 17:29 ` [PATCH 14/21] perf, c2c: Output summary stats Don Zickus
2014-02-10 17:29 ` [PATCH 15/21] perf, c2c: Dump rbtree for debugging Don Zickus
2014-02-10 17:29 ` [PATCH 16/21] perf, c2c: Fixup tid because of perf map is broken Don Zickus
2014-02-10 17:29 ` [PATCH 17/21] perf, c2c: Add symbol count table Don Zickus
2014-02-18 13:09   ` Jiri Olsa
2014-02-19  2:56     ` Don Zickus
2014-02-10 17:29 ` [PATCH 18/21] perf, c2c: Add shared cachline summary table Don Zickus
2014-02-10 17:29 ` [PATCH 19/21] perf, c2c: Add framework to analyze latency and display summary stats Don Zickus
2014-02-10 17:29 ` [PATCH 20/21] perf, c2c: Add selected extreme latencies to output cacheline stats table Don Zickus
2014-02-10 17:29 ` [PATCH 21/21] perf, c2c: Add summary latency table for various parts of caches Don Zickus
2014-02-10 18:59 ` [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Davidlohr Bueso
2014-02-10 19:17   ` Don Zickus
2014-02-10 19:18 ` [PATCH 01/21] perf c2c: Shared data analyser Don Zickus
2014-02-10 22:10   ` Davidlohr Bueso
2014-02-11 11:24     ` Jiri Olsa
2014-02-11 11:31     ` Arnaldo Carvalho de Melo
2014-02-11 13:54       ` Don Zickus
2014-02-11 14:36       ` Don Zickus
2014-02-11 15:41         ` Arnaldo Carvalho de Melo
2014-02-10 19:18 ` [PATCH 02/21] perf c2c: Dump raw records, decode data_src bits Don Zickus
2014-02-10 21:18 ` [PATCH 00/21] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Peter Zijlstra
2014-02-10 22:11   ` Don Zickus
2014-02-10 21:29 ` Peter Zijlstra
2014-02-10 22:20   ` Don Zickus
2014-02-10 22:21   ` Stephane Eranian
2014-02-11  7:14     ` Peter Zijlstra
2014-02-11 10:35       ` Stephane Eranian
2014-02-11 10:52         ` Peter Zijlstra
2014-02-11 10:58           ` Stephane Eranian
2014-02-11 11:02             ` Peter Zijlstra
2014-02-11 11:04               ` Stephane Eranian
2014-02-11 11:08                 ` Peter Zijlstra
2014-02-11 11:08                 ` Stephane Eranian
2014-02-11 11:14                   ` Peter Zijlstra
2014-02-11 11:28                     ` Stephane Eranian
2014-02-11 11:31                       ` Peter Zijlstra
2014-02-11 11:51                         ` Peter Zijlstra
2014-02-11 11:50                     ` Arnaldo Carvalho de Melo
2014-02-11 12:09                       ` Peter Zijlstra
2014-02-13 13:02                       ` Jiri Olsa
2014-02-13 13:10                         ` Stephane Eranian
     [not found] ` <1392053356-23024-2-git-send-email-dzickus@redhat.com>
2014-02-18 12:52   ` [PATCH 01/21] perf c2c: Shared data analyser Jiri Olsa
2014-02-18 12:56     ` Arnaldo Carvalho de Melo
2014-02-19  2:42       ` Don Zickus
     [not found] ` <1392053356-23024-3-git-send-email-dzickus@redhat.com>
2014-02-18 12:53   ` [PATCH 02/21] perf c2c: Dump raw records, decode data_src bits Jiri Olsa
2014-02-18 13:49     ` Arnaldo Carvalho de Melo
2014-02-19  3:04     ` Don Zickus

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).