linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/19 V2] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems
@ 2014-02-28 17:42 Don Zickus
  2014-02-28 17:42 ` [PATCH 01/19] Revert "perf: Disable PERF_RECORD_MMAP2 support" Don Zickus
                   ` (19 more replies)
  0 siblings, 20 replies; 56+ messages in thread
From: Don Zickus @ 2014-02-28 17:42 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, eranian, Don Zickus

With the introduction of NUMA systems, came the possibility of remote memory accesses.
Combine those remote memory accesses with contention on the remote node (ie a modified
cacheline) and you have a possibility for very long latencies.  These latencies can
bottleneck a program.

The program added by these patches, helps detect the situation where two nodes are
'tugging' on the same _data_ cacheline.  The term used through out this program and
the various changelogs is called a HITM.  This means nodeX went to read a cacheline
and it was discovered to be loaded in nodeY's LLC cache (hence the cacheHIT). The 
remote cacheline was also in a 'M'odified state thus creating a 'HIT M' for hit in
a modified state.  HITMs can happen locally and remotely.  This program's interest
is mainly in remote HITMs as they cause the longest latencies.

Why a program has a remote HITM derives from how the two nodes are 'sharing' the
cacheline.  Is the sharing intentional ("true") or unintentional ("false").  We have seen
lots of "false" sharing cases, which lead to simple solutions such as seperating the data
onto different cachelines.  

This tool does not distinguish between 'true' or 'false' sharing, instead it just points to
the more expensive sharing situations under the current workload.  It is up to the user
to understand what the workload is doing to determine whether a problem exists or not and
how to report it.

The data output is verbose and there are lots of data tables that interprit the latencies
and data addresses in different ways to help see where bottlenecks might be lying.

Most of this idea, work and calculations were done by Dick Fowles.  My work mainly
includes porting it to perf.  Joe Mario has contributed greatly with ideas to make the
output more informative based on his usage of the tool.  Joe has found a handful of
bottlenecks using various industry benchmarks and has worked with developers to fix
them.

I would also like to thank Stephane Eranian for his early help and guidance on 
navigating the differences between the current perf tool and how similar tools
looked at HP.  And also his tireless work in getting the MMAP2 interface to stick.

Also thanks to Arnaldo and Jiri Olso for their help in suggestions for this tool.

I also have a test program that generated a controlled number of HITMs that we used
frequently to validate our early work (the Intel docs were not always clear which
bits had to be set and some arches do not work well).  I would like to add it, but
didn't know how (nor did I spend any serious time looking either).

This program has been tested primarily on Intel's Ivy Bridge platforms.  The Sandy Bridge
platforms had some quirks that were fixed on Ivy Bridge.  We haven't tried Haswell as
that has a re-worked latency event implementation.

A handful of patches include re-enabling MMAP2 support and some fixes to perf itself.  One
in particular hacks up how standard deviation is calculated.  It works with our calculations
but may break other tools expectations.  Feedback is welcomed.

Comemnts, feedback, anything else welcomed.

V2: updated to latest perf/core branch 1029f9fedf87fa6
    switched to hist_entry based on Jiri O's suggestion
    dropped latency analyze for now until this patchset is accepted
    little fixes and tweaks

Signed-off-by: Don Zickus <dzickus@redhat.com>

Arnaldo Carvalho de Melo (2):
  perf c2c: Shared data analyser
  perf c2c: Dump raw records, decode data_src bits

Don Zickus (19):
  Revert "perf: Disable PERF_RECORD_MMAP2 support"
  perf, machine: Use map as success in ip__resolve_ams
  perf, session: Change header.misc dump from decimal to hex
  perf, stat: FIXME Stddev calculation is incorrect
  perf, callchain: Add generic callchain print handler for stdio
  perf, c2c: Rework setup code to prepare for features
  perf, c2c: Add rbtree sorted on mmap2 data
  perf, c2c: Add stats to track data source bits and cpu to node maps
  perf, c2c: Sort based on hottest cache line
  perf, c2c: Display cacheline HITM analysis to stdout
  perf, c2c: Add callchain support
  perf, c2c: Output summary stats
  perf, c2c: Dump rbtree for debugging
  perf, c2c: Fixup tid because of perf map is broken
  perf, c2c: Add symbol count table
  perf, c2c: Add shared cachline summary table
  perf, c2c: Add framework to analyze latency and display summary stats
  perf, c2c: Add selected extreme latencies to output cacheline stats
    table
  perf, c2c: Add summary latency table for various parts of caches

 kernel/events/core.c                |    4 -
 tools/perf/Documentation/perf-c2c.c |   22 +
 tools/perf/Makefile.perf            |    1 +
 tools/perf/builtin-c2c.c            | 2963 +++++++++++++++++++++++++++++++++++
 tools/perf/builtin.h                |    1 +
 tools/perf/perf.c                   |    1 +
 tools/perf/ui/stdio/hist.c          |   37 +
 tools/perf/util/event.c             |   36 +-
 tools/perf/util/evlist.c            |   37 +
 tools/perf/util/evlist.h            |    7 +
 tools/perf/util/evsel.c             |    1 +
 tools/perf/util/hist.h              |    4 +
 tools/perf/util/machine.c           |    2 +-
 tools/perf/util/session.c           |    2 +-
 tools/perf/util/stat.c              |    3 +-
 15 files changed, 3097 insertions(+), 24 deletions(-)
 create mode 100644 tools/perf/Documentation/perf-c2c.c
 create mode 100644 tools/perf/builtin-c2c.c

-- 
1.7.11.7

Arnaldo Carvalho de Melo (2):
  perf c2c: Shared data analyser
  perf c2c: Dump raw records, decode data_src bits

Don Zickus (17):
  Revert "perf: Disable PERF_RECORD_MMAP2 support"
  perf, sort:  Add physid sorting based on mmap2 data
  perf, sort:  Allow unique sorting instead of combining hist_entries
  perf: Allow ability to map cpus to nodes easily
  perf, kmem: Utilize the new generic cpunode_map
  perf: Fix stddev calculation
  perf, callchain: Add generic callchain print handler for stdio
  perf, c2c: Rework setup code to prepare for features
  perf, c2c: Add in sort on physid
  perf, c2c: Add stats to track data source bits and cpu to node maps
  perf, c2c: Sort based on hottest cache line
  perf, c2c: Display cacheline HITM analysis to stdout
  perf, c2c: Add callchain support
  perf, c2c: Output summary stats
  perf, c2c: Dump rbtree for debugging
  perf, c2c: Add symbol count table
  perf, c2c: Add shared cachline summary table

 kernel/events/core.c                |    4 -
 tools/perf/Documentation/perf-c2c.c |   22 +
 tools/perf/Makefile.perf            |    1 +
 tools/perf/builtin-c2c.c            | 1787 +++++++++++++++++++++++++++++++++++
 tools/perf/builtin-kmem.c           |   78 +-
 tools/perf/builtin-report.c         |    2 +-
 tools/perf/builtin.h                |    1 +
 tools/perf/perf.c                   |    1 +
 tools/perf/ui/stdio/hist.c          |   37 +
 tools/perf/util/cpumap.c            |  150 +++
 tools/perf/util/cpumap.h            |   35 +
 tools/perf/util/event.c             |   36 +-
 tools/perf/util/evsel.c             |    1 +
 tools/perf/util/hist.c              |   10 +-
 tools/perf/util/hist.h              |    5 +
 tools/perf/util/sort.c              |  149 +++
 tools/perf/util/sort.h              |    4 +
 tools/perf/util/stat.c              |   13 +
 tools/perf/util/stat.h              |    1 +
 19 files changed, 2236 insertions(+), 101 deletions(-)
 create mode 100644 tools/perf/Documentation/perf-c2c.c
 create mode 100644 tools/perf/builtin-c2c.c

-- 
1.7.11.7


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 01/19] Revert "perf: Disable PERF_RECORD_MMAP2 support"
  2014-02-28 17:42 [PATCH 00/19 V2] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
@ 2014-02-28 17:42 ` Don Zickus
  2014-02-28 17:42 ` [PATCH 02/19] perf, sort: Add physid sorting based on mmap2 data Don Zickus
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 56+ messages in thread
From: Don Zickus @ 2014-02-28 17:42 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, eranian, Don Zickus

This reverts commit 3090ffb5a2515990182f3f55b0688a7817325488.

Conflicts:
	tools/perf/util/event.c
---
 kernel/events/core.c    |  4 ----
 tools/perf/util/event.c | 36 +++++++++++++++++++-----------------
 tools/perf/util/evsel.c |  1 +
 3 files changed, 20 insertions(+), 21 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 45e5543..a4ab184 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6851,10 +6851,6 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
 	if (ret)
 		return -EFAULT;
 
-	/* disabled for now */
-	if (attr->mmap2)
-		return -EINVAL;
-
 	if (attr->__reserved_1)
 		return -EINVAL;
 
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index 55eebe9..82fb890 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -155,13 +155,14 @@ int perf_event__synthesize_mmap_events(struct perf_tool *tool,
 		return -1;
 	}
 
-	event->header.type = PERF_RECORD_MMAP;
+	event->header.type = PERF_RECORD_MMAP2;
 
 	while (1) {
 		char bf[BUFSIZ];
 		char prot[5];
 		char execname[PATH_MAX];
 		char anonstr[] = "//anon";
+		unsigned int ino;
 		size_t size;
 		ssize_t n;
 
@@ -172,14 +173,15 @@ int perf_event__synthesize_mmap_events(struct perf_tool *tool,
 		strcpy(execname, "");
 
 		/* 00400000-0040c000 r-xp 00000000 fd:01 41038  /bin/cat */
-		n = sscanf(bf, "%"PRIx64"-%"PRIx64" %s %"PRIx64" %*x:%*x %*u %s\n",
-		       &event->mmap.start, &event->mmap.len, prot,
-		       &event->mmap.pgoff,
-		       execname);
-		/*
- 		 * Anon maps don't have the execname.
- 		 */
-		if (n < 4)
+		n = sscanf(bf, "%"PRIx64"-%"PRIx64" %s %"PRIx64" %x:%x %u %s\n",
+		       &event->mmap2.start, &event->mmap2.len, prot,
+		       &event->mmap2.pgoff, &event->mmap2.maj,
+		       &event->mmap2.min,
+		       &ino, execname);
+
+		event->mmap2.ino = (u64)ino;
+
+		if (n < 7)
 			continue;
 		/*
 		 * Just like the kernel, see __perf_event_mmap in kernel/perf_event.c
@@ -200,15 +202,15 @@ int perf_event__synthesize_mmap_events(struct perf_tool *tool,
 			strcpy(execname, anonstr);
 
 		size = strlen(execname) + 1;
-		memcpy(event->mmap.filename, execname, size);
+		memcpy(event->mmap2.filename, execname, size);
 		size = PERF_ALIGN(size, sizeof(u64));
-		event->mmap.len -= event->mmap.start;
-		event->mmap.header.size = (sizeof(event->mmap) -
-					(sizeof(event->mmap.filename) - size));
-		memset(event->mmap.filename + size, 0, machine->id_hdr_size);
-		event->mmap.header.size += machine->id_hdr_size;
-		event->mmap.pid = tgid;
-		event->mmap.tid = pid;
+		event->mmap2.len -= event->mmap.start;
+		event->mmap2.header.size = (sizeof(event->mmap2) -
+					(sizeof(event->mmap2.filename) - size));
+		memset(event->mmap2.filename + size, 0, machine->id_hdr_size);
+		event->mmap2.header.size += machine->id_hdr_size;
+		event->mmap2.pid = tgid;
+		event->mmap2.tid = pid;
 
 		if (process(tool, event, &synth_sample, machine) != 0) {
 			rc = -1;
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index adc94dd..f2ab0b3 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -640,6 +640,7 @@ void perf_evsel__config(struct perf_evsel *evsel, struct record_opts *opts)
 		perf_evsel__set_sample_bit(evsel, WEIGHT);
 
 	attr->mmap  = track;
+	attr->mmap2 = track && !perf_missing_features.mmap2;
 	attr->comm  = track;
 
 	if (opts->sample_transaction)
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 02/19] perf, sort:  Add physid sorting based on mmap2 data
  2014-02-28 17:42 [PATCH 00/19 V2] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
  2014-02-28 17:42 ` [PATCH 01/19] Revert "perf: Disable PERF_RECORD_MMAP2 support" Don Zickus
@ 2014-02-28 17:42 ` Don Zickus
  2014-03-19 10:45   ` Jiri Olsa
  2014-02-28 17:42 ` [PATCH 03/19] perf, sort: Allow unique sorting instead of combining hist_entries Don Zickus
                   ` (17 subsequent siblings)
  19 siblings, 1 reply; 56+ messages in thread
From: Don Zickus @ 2014-02-28 17:42 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, eranian, Don Zickus

In order for the c2c tool to work correctly, it needs to properly
sort all the records on uniquely identifiable data addresses.  These
unique addresses are converted from virtual addresses provided by the
hardware into a kernel address using an mmap2 record as the decoder.

Once a unique address is converted, we can sort on them based on
various rules.  Then it becomes clear which address are overlapping
with each other across mmap regions or pid spaces.

This patch just creates the rules and inserts the records into a
sort entry for safe keeping until later patches process them.

The general sorting rule is:

o group cpumodes together
o group similar major, minor, inode, inode generation numbers togther
o if (nonzero major/minor number - ie mmap'd areas)
  o sort on data addresses
  o sort on instruction address
  o sort on pid
  o sort on tid
o if cpumode is kernel
  o sort on data addresses
  o sort on instruction address
  o sort on pid
  o sort on tid
o else (private to pid space)
  o sort on pid
  o sort on tid
  o sort on data addresses
  o sort on instruction address

I also hacked in the concept of 'color'.  The purpose of that bit is to
provides hints later when processing these records that indicate a new unique
address has been encountered.  Because later processing only checks the data
addresses, there can be a theoretical scenario that similar sequential data
addresses (when walking the rbtree) could be misinterpreted as overlapping
when in fact they are not.

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/builtin-report.c |   2 +-
 tools/perf/util/hist.c      |   7 ++-
 tools/perf/util/hist.h      |   1 +
 tools/perf/util/sort.c      | 148 ++++++++++++++++++++++++++++++++++++++++++++
 tools/perf/util/sort.h      |   3 +
 5 files changed, 157 insertions(+), 4 deletions(-)

diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index d882b6f..ec797da 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -755,7 +755,7 @@ int cmd_report(int argc, const char **argv, const char *prefix __maybe_unused)
 		   "sort by key(s): pid, comm, dso, symbol, parent, cpu, srcline,"
 		   " dso_to, dso_from, symbol_to, symbol_from, mispredict,"
 		   " weight, local_weight, mem, symbol_daddr, dso_daddr, tlb, "
-		   "snoop, locked, abort, in_tx, transaction"),
+		   "snoop, locked, abort, in_tx, transaction, physid"),
 	OPT_BOOLEAN(0, "showcpuutilization", &symbol_conf.show_cpu_utilization,
 		    "Show sample percentage for different cpu modes"),
 	OPT_STRING('p', "parent", &parent_pattern, "regex",
diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index 0466efa..ea54db3 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -420,9 +420,10 @@ struct hist_entry *__hists__add_entry(struct hists *hists,
 			.map	= al->map,
 			.sym	= al->sym,
 		},
-		.cpu	= al->cpu,
-		.ip	= al->addr,
-		.level	= al->level,
+		.cpu	 = al->cpu,
+		.cpumode = al->cpumode,
+		.ip	 = al->addr,
+		.level	 = al->level,
 		.stat = {
 			.nr_events = 1,
 			.period	= period,
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index a59743f..d226c5b 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -62,6 +62,7 @@ enum hist_column {
 	HISTC_MEM_LVL,
 	HISTC_MEM_SNOOP,
 	HISTC_TRANSACTION,
+	HISTC_PHYSID,
 	HISTC_NR_COLS, /* Last entry */
 };
 
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index 635cd8f..0cb43a5 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -977,6 +977,151 @@ struct sort_entry sort_transaction = {
 	.se_width_idx	= HISTC_TRANSACTION,
 };
 
+static int64_t
+sort__physid_cmp(struct hist_entry *left, struct hist_entry *right)
+{
+        u64 l, r;
+        struct map *l_map = left->mem_info->daddr.map;
+        struct map *r_map = right->mem_info->daddr.map;
+
+	/* store all NULL mem maps at the bottom */
+	/* shouldn't even need this check, should have stubs */
+	if (!left->mem_info->daddr.map || !right->mem_info->daddr.map)
+		return 1;
+
+        /* group event types together */
+        if (left->cpumode > right->cpumode) return -1;
+        if (left->cpumode < right->cpumode) return 1;
+
+        if (l_map->maj > r_map->maj) return -1;
+        if (l_map->maj < r_map->maj) return 1;
+
+        if (l_map->min > r_map->min) return -1;
+        if (l_map->min < r_map->min) return 1;
+
+        if (l_map->ino > r_map->ino) return -1;
+        if (l_map->ino < r_map->ino) return 1;
+
+        if (l_map->ino_generation > r_map->ino_generation) return -1;
+        if (l_map->ino_generation < r_map->ino_generation) return 1;
+
+        /*
+         * Addresses with no major/minor numbers are assumed to be
+         * anonymous in userspace.  Sort those on pid then address.
+         *
+         * The kernel and non-zero major/minor mapped areas are
+         * assumed to be unity mapped.  Sort those on address then pid.
+         */
+
+        /* al_addr does all the right addr - start + offset calculations */
+        l = left->mem_info->daddr.al_addr;
+        r = right->mem_info->daddr.al_addr;
+
+        if (l_map->maj || l_map->min || l_map->ino || l_map-> ino_generation) {
+                /* mmapped areas */
+
+                /* hack to mark similar regions, 'right' is new entry */
+                /* entries with same maj/min/ino/inogen are in same address space */
+                right->color = TRUE;
+
+                if (l > r) return -1;
+                if (l < r) return 1;
+
+                /* sorting by iaddr makes calculations easier later */
+                if (left->mem_info->iaddr.al_addr > right->mem_info->iaddr.al_addr) return -1;
+                if (left->mem_info->iaddr.al_addr < right->mem_info->iaddr.al_addr) return 1;
+
+                if (left->thread->pid_ > right->thread->pid_) return -1;
+                if (left->thread->pid_ < right->thread->pid_) return 1;
+
+                if (left->thread->tid > right->thread->tid) return -1;
+                if (left->thread->tid < right->thread->tid) return 1;
+        } else if (left->cpumode == PERF_RECORD_MISC_KERNEL) {
+                /* kernel mapped areas where 'start' doesn't matter */
+
+                /* hack to mark similar regions, 'right' is new entry */
+                /* whole kernel region is in the same address space */
+                right->color = TRUE;
+
+                if (l > r) return -1;
+                if (l < r) return 1;
+
+                /* sorting by iaddr makes calculations easier later */
+                if (left->mem_info->iaddr.al_addr > right->mem_info->iaddr.al_addr) return -1;
+                if (left->mem_info->iaddr.al_addr < right->mem_info->iaddr.al_addr) return 1;
+
+                if (left->thread->pid_ > right->thread->pid_) return -1;
+                if (left->thread->pid_ < right->thread->pid_) return 1;
+
+                if (left->thread->tid > right->thread->tid) return -1;
+                if (left->thread->tid < right->thread->tid) return 1;
+        } else {
+                /* userspace anonymous */
+                if (left->thread->pid_ > right->thread->pid_) return -1;
+                if (left->thread->pid_ < right->thread->pid_) return 1;
+
+                if (left->thread->tid > right->thread->tid) return -1;
+                if (left->thread->tid < right->thread->tid) return 1;
+
+	         /* hack to mark similar regions, 'right' is new entry */
+                /* userspace anonymous address space is contained within pid */
+                right->color = TRUE;
+
+                if (l > r) return -1;
+                if (l < r) return 1;
+
+                /* sorting by iaddr makes calculations easier later */
+                if (left->mem_info->iaddr.al_addr > right->mem_info->iaddr.al_addr) return -1;
+                if (left->mem_info->iaddr.al_addr < right->mem_info->iaddr.al_addr) return 1;
+        }
+
+	/* sanity check the maps; only mmaped areas should have different maps */
+	if ((left->mem_info->daddr.map != right->mem_info->daddr.map) &&
+	     !right->mem_info->daddr.map->maj && !right->mem_info->daddr.map->min)
+		pr_debug("physid_cmp: Similar entries have different maps\n");
+
+        return 0;
+}
+
+static int hist_entry__physid_snprintf(struct hist_entry *he, char *bf,
+					    size_t size, unsigned int width)
+{
+	char buf[256];
+	char *p = buf;
+
+	if (!he->mem_info->daddr.map) {
+        	sprintf(p, "%3x %3x %8lx %8lx %6d %16lx %16lx %16lx %8x\n",
+                        -1,
+                        -1,
+                        -1UL,
+                        -1UL,
+                        he->thread->pid_,
+                        -1UL,
+                        he->mem_info->daddr.addr,
+                        he->mem_info->iaddr.al_addr,
+                        he->cpumode);
+	} else {
+	        sprintf(p, "%3x %3x %8lx %8lx %6d %16lx %16lx %16lx %8x\n",
+                        he->mem_info->daddr.map->maj,
+                        he->mem_info->daddr.map->min,
+                        he->mem_info->daddr.map->ino,
+                        he->mem_info->daddr.map->ino_generation,
+                        he->thread->pid_,
+                        he->mem_info->daddr.map->start,
+                        he->mem_info->daddr.addr,
+                        he->mem_info->iaddr.al_addr,
+                        he->cpumode);
+	}
+	return repsep_snprintf(bf, size, "%-*s", width, buf);
+}
+
+struct sort_entry sort_physid = {
+	.se_header	= "Physid (major, minor, inode, inode generation, pid, start, Data addr, IP, cpumode)",
+	.se_cmp		= sort__physid_cmp,
+	.se_snprintf	= hist_entry__physid_snprintf,
+	.se_width_idx	= HISTC_PHYSID,
+};
+
 struct sort_dimension {
 	const char		*name;
 	struct sort_entry	*entry;
@@ -1023,6 +1168,7 @@ static struct sort_dimension memory_sort_dimensions[] = {
 	DIM(SORT_MEM_TLB, "tlb", sort_mem_tlb),
 	DIM(SORT_MEM_LVL, "mem", sort_mem_lvl),
 	DIM(SORT_MEM_SNOOP, "snoop", sort_mem_snoop),
+	DIM(SORT_MEM_PHYSID, "physid", sort_physid),
 };
 
 #undef DIM
@@ -1182,6 +1328,8 @@ void sort__setup_elide(FILE *output)
 					"tlb", output);
 		sort_entry__setup_elide(&sort_dso, symbol_conf.dso_list,
 					"snoop", output);
+		sort_entry__setup_elide(&sort_physid, symbol_conf.dso_list,
+					"physid", output);
 	}
 
 	/*
diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
index 43e5ff4..eb8cd50 100644
--- a/tools/perf/util/sort.h
+++ b/tools/perf/util/sort.h
@@ -87,11 +87,13 @@ struct hist_entry {
 	u64			ip;
 	u64			transaction;
 	s32			cpu;
+	u8			cpumode;
 
 	struct hist_entry_diff	diff;
 
 	/* We are added by hists__add_dummy_entry. */
 	bool			dummy;
+	bool			color;
 
 	/* XXX These two should move to some tree widget lib */
 	u16			row_offset;
@@ -166,6 +168,7 @@ enum sort_type {
 	SORT_MEM_TLB,
 	SORT_MEM_LVL,
 	SORT_MEM_SNOOP,
+	SORT_MEM_PHYSID,
 };
 
 /*
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 03/19] perf, sort:  Allow unique sorting instead of combining hist_entries
  2014-02-28 17:42 [PATCH 00/19 V2] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
  2014-02-28 17:42 ` [PATCH 01/19] Revert "perf: Disable PERF_RECORD_MMAP2 support" Don Zickus
  2014-02-28 17:42 ` [PATCH 02/19] perf, sort: Add physid sorting based on mmap2 data Don Zickus
@ 2014-02-28 17:42 ` Don Zickus
  2014-02-28 17:42 ` [PATCH 04/19] perf: Allow ability to map cpus to nodes easily Don Zickus
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 56+ messages in thread
From: Don Zickus @ 2014-02-28 17:42 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, eranian, Don Zickus

The cache contention tools needs to keep all the perf records unique in order
to properly parse all the data.  Currently add_hist_entry() will combine
the duplicate record and add the weight/period to the existing record.

This throws away the unique data the cache contention tool needs (mainly
the data source).  Create a flag to force the records to stay unique.

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/util/hist.c | 3 ++-
 tools/perf/util/sort.c | 1 +
 tools/perf/util/sort.h | 1 +
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index ea54db3..cf85d7d 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -365,7 +365,8 @@ static struct hist_entry *add_hist_entry(struct hists *hists,
 		 */
 		cmp = hist_entry__cmp(he, entry);
 
-		if (!cmp) {
+		if (!cmp && !sort__wants_unique) {
+
 			he_stat__add_period(&he->stat, period, weight);
 
 			/*
diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c
index 0cb43a5..453b8f0 100644
--- a/tools/perf/util/sort.c
+++ b/tools/perf/util/sort.c
@@ -14,6 +14,7 @@ int		sort__need_collapse = 0;
 int		sort__has_parent = 0;
 int		sort__has_sym = 0;
 int		sort__has_dso = 0;
+int		sort__wants_unique = 0;
 enum sort_mode	sort__mode = SORT_MODE__NORMAL;
 
 enum sort_type	sort__first_dimension;
diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h
index eb8cd50..4e960fd 100644
--- a/tools/perf/util/sort.h
+++ b/tools/perf/util/sort.h
@@ -33,6 +33,7 @@ extern int have_ignore_callees;
 extern int sort__need_collapse;
 extern int sort__has_parent;
 extern int sort__has_sym;
+extern int sort__wants_unique;
 extern enum sort_mode sort__mode;
 extern struct sort_entry sort_comm;
 extern struct sort_entry sort_dso;
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 04/19] perf: Allow ability to map cpus to nodes easily
  2014-02-28 17:42 [PATCH 00/19 V2] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (2 preceding siblings ...)
  2014-02-28 17:42 ` [PATCH 03/19] perf, sort: Allow unique sorting instead of combining hist_entries Don Zickus
@ 2014-02-28 17:42 ` Don Zickus
  2014-03-19 12:48   ` Jiri Olsa
  2014-03-19 13:22   ` Jiri Olsa
  2014-02-28 17:42 ` [PATCH 05/19] perf, kmem: Utilize the new generic cpunode_map Don Zickus
                   ` (15 subsequent siblings)
  19 siblings, 2 replies; 56+ messages in thread
From: Don Zickus @ 2014-02-28 17:42 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, eranian, Don Zickus

This patch figures out the max number of cpus and nodes that are on the
system and creates a map of cpu to node.  This allows us to provide a cpu
and quickly get the node associated with it.

It was mostly copied from builtin-kmem.c and tweaked slightly to use less memory
(use possible cpus instead of max).  It also calculates the max number of nodes.

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/util/cpumap.c | 150 +++++++++++++++++++++++++++++++++++++++++++++++
 tools/perf/util/cpumap.h |  35 +++++++++++
 2 files changed, 185 insertions(+)

diff --git a/tools/perf/util/cpumap.c b/tools/perf/util/cpumap.c
index 7fe4994..d19d5b3 100644
--- a/tools/perf/util/cpumap.c
+++ b/tools/perf/util/cpumap.c
@@ -317,3 +317,153 @@ int cpu_map__build_core_map(struct cpu_map *cpus, struct cpu_map **corep)
 {
 	return cpu_map__build_map(cpus, corep, cpu_map__get_core);
 }
+
+/* setup simple routines to easily access node numbers given a cpu number */
+
+
+#define PATH_SYS_NODE   "/sys/devices/system/node"
+
+/* Determine highest possible cpu in the system for sparse allocation */
+static void set_max_cpu_num(void)
+{
+	FILE *fp;
+	char buf[256];
+	int num;
+
+	/* set up default */
+	max_cpu_num = 4096;
+
+	/* get the highest possible cpu number for a sparse allocation */
+	fp = fopen("/sys/devices/system/cpu/possible", "r");
+	if (!fp)
+		goto out;
+
+	num = fread(&buf, 1, sizeof(buf), fp);
+	if (!num)
+		goto out_close;
+	buf[num] = '\0';
+
+	/* start on the right, to find highest cpu num */
+	while (--num) {
+		if ((buf[num] == ',') || (buf[num] == '-')) {
+			num++;
+			break;
+		}
+	}
+	if (sscanf(&buf[num], "%d", &max_cpu_num) < 1)
+		goto out_close;
+
+	max_cpu_num++;
+
+	fclose(fp);
+	return;
+
+out_close:
+	fclose(fp);
+out:
+	pr_err("Failed to read max cpus, using default of %d\n",
+		max_cpu_num);
+	return;
+}
+
+/* Determine highest possible node in the system for sparse allocation */
+static void set_max_node_num(void)
+{
+	FILE *fp;
+	char buf[256];
+	int num;
+
+	/* set up default */
+	max_node_num = 8;
+
+	/* get the highest possible cpu number for a sparse allocation */
+	fp = fopen("/sys/devices/system/node/possible", "r");
+	if (!fp)
+		goto out;
+
+	num = fread(&buf, 1, sizeof(buf), fp);
+	if (!num)
+		goto out_close;
+	buf[num] = '\0';
+
+	/* start on the right, to find highest node num */
+	while (--num) {
+		if ((buf[num] == ',') || (buf[num] == '-')) {
+			num++;
+			break;
+		}
+	}
+	if (sscanf(&buf[num], "%d", &max_node_num) < 1)
+		goto out_close;
+
+	max_node_num++;
+
+	fclose(fp);
+	return;
+
+out_close:
+	fclose(fp);
+out:
+	pr_err("Failed to read max nodes, using default of %d\n",
+		max_node_num);
+	return;
+}
+
+static int init_cpunode_map(void)
+{
+	int i;
+
+	set_max_cpu_num();
+	set_max_node_num();
+
+	cpunode_map = calloc(max_cpu_num, sizeof(int));
+	if (!cpunode_map) {
+		pr_err("%s: calloc failed\n", __func__);
+		goto out;
+	}
+
+	for (i = 0; i < max_cpu_num; i++)
+		cpunode_map[i] = -1;
+
+	return 0;
+out:
+	return -1;
+}
+
+int cpu_map__setup_cpunode_map(void)
+{
+	struct dirent *dent1, *dent2;
+	DIR *dir1, *dir2;
+	unsigned int cpu, mem;
+	char buf[PATH_MAX];
+
+	/* initialize globals */
+	if (init_cpunode_map())
+		return -1;
+
+	dir1 = opendir(PATH_SYS_NODE);
+	if (!dir1)
+		return 0;
+
+	/* walk tree and setup map */
+	while ((dent1 = readdir(dir1)) != NULL) {
+		if (dent1->d_type != DT_DIR ||
+		    sscanf(dent1->d_name, "node%u", &mem) < 1)
+			continue;
+
+		snprintf(buf, PATH_MAX, "%s/%s", PATH_SYS_NODE, dent1->d_name);
+		dir2 = opendir(buf);
+		if (!dir2)
+			continue;
+		while ((dent2 = readdir(dir2)) != NULL) {
+			if (dent2->d_type != DT_LNK ||
+			    sscanf(dent2->d_name, "cpu%u", &cpu) < 1)
+				continue;
+			cpunode_map[cpu] = mem;
+		}
+		closedir(dir2);
+	}
+	closedir(dir1);
+	return 0;
+}
+
diff --git a/tools/perf/util/cpumap.h b/tools/perf/util/cpumap.h
index b123bb9..d6fde2b 100644
--- a/tools/perf/util/cpumap.h
+++ b/tools/perf/util/cpumap.h
@@ -4,6 +4,9 @@
 #include <stdio.h>
 #include <stdbool.h>
 
+#include "perf.h"
+#include "util/debug.h"
+
 struct cpu_map {
 	int nr;
 	int map[];
@@ -46,4 +49,36 @@ static inline bool cpu_map__empty(const struct cpu_map *map)
 	return map ? map->map[0] == -1 : true;
 }
 
+int max_cpu_num;
+int max_node_num;
+int *cpunode_map;
+
+int cpu_map__setup_cpunode_map(void);
+
+static inline int cpu_map__max_node(void)
+{
+	if (unlikely(!max_node_num))
+		pr_debug("cpu_map not initiailzed\n");
+
+	return max_node_num;
+}
+
+static inline int cpu_map__max_cpu(void)
+{
+	if (unlikely(!max_cpu_num))
+		pr_debug("cpu_map not initiailzed\n");
+
+	return max_cpu_num;
+}
+
+static inline int cpu_map__get_node(int cpu)
+{
+	if (unlikely(cpunode_map == NULL)) {
+		pr_debug("cpu_map not initialized\n");
+		return -1;
+	}
+
+	return cpunode_map[cpu];
+}
+
 #endif /* __PERF_CPUMAP_H */
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 05/19] perf, kmem: Utilize the new generic cpunode_map
  2014-02-28 17:42 [PATCH 00/19 V2] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (3 preceding siblings ...)
  2014-02-28 17:42 ` [PATCH 04/19] perf: Allow ability to map cpus to nodes easily Don Zickus
@ 2014-02-28 17:42 ` Don Zickus
  2014-02-28 17:42 ` [PATCH 06/19] perf: Fix stddev calculation Don Zickus
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 56+ messages in thread
From: Don Zickus @ 2014-02-28 17:42 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, eranian, Don Zickus, Li Zefan

Use the previous patch implementation of cpunode_map for builtin-kmem.c
Should not be any functional difference.

Cc: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/builtin-kmem.c | 78 ++---------------------------------------------
 1 file changed, 3 insertions(+), 75 deletions(-)

diff --git a/tools/perf/builtin-kmem.c b/tools/perf/builtin-kmem.c
index 929462a..9ff7892 100644
--- a/tools/perf/builtin-kmem.c
+++ b/tools/perf/builtin-kmem.c
@@ -14,6 +14,7 @@
 #include "util/parse-options.h"
 #include "util/trace-event.h"
 #include "util/data.h"
+#include "util/cpumap.h"
 
 #include "util/debug.h"
 
@@ -31,9 +32,6 @@ static int			caller_lines = -1;
 
 static bool			raw_ip;
 
-static int			*cpunode_map;
-static int			max_cpu_num;
-
 struct alloc_stat {
 	u64	call_site;
 	u64	ptr;
@@ -55,76 +53,6 @@ static struct rb_root root_caller_sorted;
 static unsigned long total_requested, total_allocated;
 static unsigned long nr_allocs, nr_cross_allocs;
 
-#define PATH_SYS_NODE	"/sys/devices/system/node"
-
-static int init_cpunode_map(void)
-{
-	FILE *fp;
-	int i, err = -1;
-
-	fp = fopen("/sys/devices/system/cpu/kernel_max", "r");
-	if (!fp) {
-		max_cpu_num = 4096;
-		return 0;
-	}
-
-	if (fscanf(fp, "%d", &max_cpu_num) < 1) {
-		pr_err("Failed to read 'kernel_max' from sysfs");
-		goto out_close;
-	}
-
-	max_cpu_num++;
-
-	cpunode_map = calloc(max_cpu_num, sizeof(int));
-	if (!cpunode_map) {
-		pr_err("%s: calloc failed\n", __func__);
-		goto out_close;
-	}
-
-	for (i = 0; i < max_cpu_num; i++)
-		cpunode_map[i] = -1;
-
-	err = 0;
-out_close:
-	fclose(fp);
-	return err;
-}
-
-static int setup_cpunode_map(void)
-{
-	struct dirent *dent1, *dent2;
-	DIR *dir1, *dir2;
-	unsigned int cpu, mem;
-	char buf[PATH_MAX];
-
-	if (init_cpunode_map())
-		return -1;
-
-	dir1 = opendir(PATH_SYS_NODE);
-	if (!dir1)
-		return 0;
-
-	while ((dent1 = readdir(dir1)) != NULL) {
-		if (dent1->d_type != DT_DIR ||
-		    sscanf(dent1->d_name, "node%u", &mem) < 1)
-			continue;
-
-		snprintf(buf, PATH_MAX, "%s/%s", PATH_SYS_NODE, dent1->d_name);
-		dir2 = opendir(buf);
-		if (!dir2)
-			continue;
-		while ((dent2 = readdir(dir2)) != NULL) {
-			if (dent2->d_type != DT_LNK ||
-			    sscanf(dent2->d_name, "cpu%u", &cpu) < 1)
-				continue;
-			cpunode_map[cpu] = mem;
-		}
-		closedir(dir2);
-	}
-	closedir(dir1);
-	return 0;
-}
-
 static int insert_alloc_stat(unsigned long call_site, unsigned long ptr,
 			     int bytes_req, int bytes_alloc, int cpu)
 {
@@ -235,7 +163,7 @@ static int perf_evsel__process_alloc_node_event(struct perf_evsel *evsel,
 	int ret = perf_evsel__process_alloc_event(evsel, sample);
 
 	if (!ret) {
-		int node1 = cpunode_map[sample->cpu],
+		int node1 = cpu_map__get_node(sample->cpu),
 		    node2 = perf_evsel__intval(evsel, sample, "node");
 
 		if (node1 != node2)
@@ -770,7 +698,7 @@ int cmd_kmem(int argc, const char **argv, const char *prefix __maybe_unused)
 	if (!strncmp(argv[0], "rec", 3)) {
 		return __cmd_record(argc, argv);
 	} else if (!strcmp(argv[0], "stat")) {
-		if (setup_cpunode_map())
+		if (cpu_map__setup_cpunode_map())
 			return -1;
 
 		if (list_empty(&caller_sort))
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 06/19] perf: Fix stddev calculation
  2014-02-28 17:42 [PATCH 00/19 V2] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (4 preceding siblings ...)
  2014-02-28 17:42 ` [PATCH 05/19] perf, kmem: Utilize the new generic cpunode_map Don Zickus
@ 2014-02-28 17:42 ` Don Zickus
  2014-02-28 17:42 ` [PATCH 07/19] perf, callchain: Add generic callchain print handler for stdio Don Zickus
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 56+ messages in thread
From: Don Zickus @ 2014-02-28 17:42 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, eranian, Don Zickus

The stddev calculation written matched standard error.  As a result when
using this result to find the relative stddev between runs, it was not
accurate.

Update the formula to match traditional stddev.  Then rename the old
stddev calculation to stderr_stats in case someone wants to use it.

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/util/stat.c | 13 +++++++++++++
 tools/perf/util/stat.h |  1 +
 2 files changed, 14 insertions(+)

diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c
index 6506b3d..0cb4dbc 100644
--- a/tools/perf/util/stat.c
+++ b/tools/perf/util/stat.c
@@ -33,6 +33,7 @@ double avg_stats(struct stats *stats)
  * http://en.wikipedia.org/wiki/Stddev
  *
  * The std dev of the mean is related to the std dev by:
+ * (also known as standard error)
  *
  *             s
  * s_mean = -------
@@ -41,6 +42,18 @@ double avg_stats(struct stats *stats)
  */
 double stddev_stats(struct stats *stats)
 {
+	double variance;
+
+	if (stats->n < 2)
+		return 0.0;
+
+	variance = stats->M2 / (stats->n - 1);
+
+	return sqrt(variance);
+}
+
+double stderr_stats(struct stats *stats)
+{
 	double variance, variance_mean;
 
 	if (stats->n < 2)
diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h
index ae8ccd7..6f61615 100644
--- a/tools/perf/util/stat.h
+++ b/tools/perf/util/stat.h
@@ -12,6 +12,7 @@ struct stats
 void update_stats(struct stats *stats, u64 val);
 double avg_stats(struct stats *stats);
 double stddev_stats(struct stats *stats);
+double stderr_stats(struct stats *stats);
 double rel_stddev_stats(double stddev, double avg);
 
 static inline void init_stats(struct stats *stats)
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 07/19] perf, callchain: Add generic callchain print handler for stdio
  2014-02-28 17:42 [PATCH 00/19 V2] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (5 preceding siblings ...)
  2014-02-28 17:42 ` [PATCH 06/19] perf: Fix stddev calculation Don Zickus
@ 2014-02-28 17:42 ` Don Zickus
  2014-02-28 17:42 ` [PATCH 08/19] perf c2c: Shared data analyser Don Zickus
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 56+ messages in thread
From: Don Zickus @ 2014-02-28 17:42 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, eranian, Don Zickus

My initial implementation for rbtree sorting in the c2c tool does not use the
normal history elements.  As a result, adding callchain support (which is
deeply integrated with history elements) is more challenging when trying to
display its output.

To make things simpler for myself (and to avoid rewriting the same code into
the c2c tool), I provided a generic interface that takes an unsorted callchain
list along with its total and relative sample size, and sorts it locally based
on period and calls the appropriate graph function (passing the correct sample
size).

This makes things easier because the c2c tool can be dumber and just collect
callchains and not worry about the magic needed to sort and display them
correctly.

Unfortunately, this is assuming a stdio output only and does not use the other
gui type outputs.

Regardless, this patch provides useful info for the tool right now.  Tweaks and
recommendations for a better approach are welcomed. :-)

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/ui/stdio/hist.c | 37 +++++++++++++++++++++++++++++++++++++
 tools/perf/util/hist.h     |  4 ++++
 2 files changed, 41 insertions(+)

diff --git a/tools/perf/ui/stdio/hist.c b/tools/perf/ui/stdio/hist.c
index 831fbb7..0a40f59 100644
--- a/tools/perf/ui/stdio/hist.c
+++ b/tools/perf/ui/stdio/hist.c
@@ -536,3 +536,40 @@ size_t events_stats__fprintf(struct events_stats *stats, FILE *fp)
 
 	return ret;
 }
+
+size_t generic_entry_callchain__fprintf(struct callchain_root *unsorted_callchain,
+					u64 total_samples, u64 relative_samples,
+					int left_margin, FILE *fp)
+{
+	struct rb_root sorted_chain;
+	u64 min_callchain_hits;
+
+	if (!symbol_conf.use_callchain)
+		return 0;
+
+	min_callchain_hits = total_samples * (callchain_param.min_percent / 100);
+
+	callchain_param.sort(&sorted_chain, unsorted_callchain,
+				min_callchain_hits, &callchain_param);
+
+	switch (callchain_param.mode) {
+	case CHAIN_GRAPH_REL:
+		return callchain__fprintf_graph(fp, &sorted_chain, relative_samples,
+						left_margin);
+		break;
+	case CHAIN_GRAPH_ABS:
+		return callchain__fprintf_graph(fp, &sorted_chain, total_samples,
+						left_margin);
+		break;
+	case CHAIN_FLAT:
+		return callchain__fprintf_flat(fp, &sorted_chain, total_samples);
+		break;
+	case CHAIN_NONE:
+		break;
+	default:
+		pr_err("Bad callchain mode\n");
+	}
+
+	return 0;
+}
+
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index d226c5b..df981ce 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -112,6 +112,10 @@ size_t events_stats__fprintf(struct events_stats *stats, FILE *fp);
 size_t hists__fprintf(struct hists *hists, bool show_header, int max_rows,
 		      int max_cols, float min_pcnt, FILE *fp);
 
+size_t generic_entry_callchain__fprintf(struct callchain_root *unsorted_callchain,
+					u64 total_samples, u64 relative_samples,
+					int left_margin, FILE *fp);
+
 void hists__filter_by_dso(struct hists *hists);
 void hists__filter_by_thread(struct hists *hists);
 void hists__filter_by_symbol(struct hists *hists);
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 08/19] perf c2c: Shared data analyser
  2014-02-28 17:42 [PATCH 00/19 V2] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (6 preceding siblings ...)
  2014-02-28 17:42 ` [PATCH 07/19] perf, callchain: Add generic callchain print handler for stdio Don Zickus
@ 2014-02-28 17:42 ` Don Zickus
  2014-02-28 19:08   ` Andi Kleen
  2014-02-28 17:42 ` [PATCH 09/19] perf c2c: Dump raw records, decode data_src bits Don Zickus
                   ` (11 subsequent siblings)
  19 siblings, 1 reply; 56+ messages in thread
From: Don Zickus @ 2014-02-28 17:42 UTC (permalink / raw)
  To: acme
  Cc: LKML, jolsa, jmario, fowles, eranian, Arnaldo Carvalho de Melo,
	David Ahern, Don Zickus, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Peter Zijlstra, Richard Fowles

From: Arnaldo Carvalho de Melo <acme@redhat.com>

This is the start of a new perf tool that will collect information about
memory accesses and analyse it to find things like hot cachelines, etc.

This is basically trying to get a prototype written by Richard Fowles
written using the tools/perf coding style and libraries.

Start it from 'perf sched', this patch starts the process by adding the
'record' subcommand to collect the needed mem loads and stores samples.

It also have the basic 'report' skeleton, resolving the sample address
and hooking the events found in a perf.data file with methods to handle
them, right now just printing the resolved perf_sample data structure
after each event name.

[dcz: refreshed to latest upstream changes]

Cc: David Ahern <dsahern@gmail.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Joe Mario <jmario@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Fowles <rfowles@redhat.com>
Cc: Stephane Eranian <eranian@google.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/Documentation/perf-c2c.c |  22 +++++
 tools/perf/Makefile.perf            |   1 +
 tools/perf/builtin-c2c.c            | 185 ++++++++++++++++++++++++++++++++++++
 tools/perf/builtin.h                |   1 +
 tools/perf/perf.c                   |   1 +
 5 files changed, 210 insertions(+)
 create mode 100644 tools/perf/Documentation/perf-c2c.c
 create mode 100644 tools/perf/builtin-c2c.c

diff --git a/tools/perf/Documentation/perf-c2c.c b/tools/perf/Documentation/perf-c2c.c
new file mode 100644
index 0000000..4d52798
--- /dev/null
+++ b/tools/perf/Documentation/perf-c2c.c
@@ -0,0 +1,22 @@
+perf-c2c(1)
+===========
+
+NAME
+----
+perf-c2c - Shared Data C2C/HITM Analyzer.
+
+SYNOPSIS
+--------
+[verse]
+'perf c2c' record
+
+DESCRIPTION
+-----------
+These are the variants of perf c2c:
+
+  'perf c2c record <command>' to record the memory accesses of an arbitrary
+  workload.
+
+SEE ALSO
+--------
+linkperf:perf-record[1], linkperf:perf-mem[1]
diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
index 1f7ec48..a9eebb4 100644
--- a/tools/perf/Makefile.perf
+++ b/tools/perf/Makefile.perf
@@ -427,6 +427,7 @@ endif
 BUILTIN_OBJS += $(OUTPUT)bench/mem-memcpy.o
 BUILTIN_OBJS += $(OUTPUT)bench/mem-memset.o
 
+BUILTIN_OBJS += $(OUTPUT)builtin-c2c.o
 BUILTIN_OBJS += $(OUTPUT)builtin-diff.o
 BUILTIN_OBJS += $(OUTPUT)builtin-evlist.o
 BUILTIN_OBJS += $(OUTPUT)builtin-help.o
diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
new file mode 100644
index 0000000..2935484
--- /dev/null
+++ b/tools/perf/builtin-c2c.c
@@ -0,0 +1,185 @@
+#include "builtin.h"
+#include "cache.h"
+
+#include "util/evlist.h"
+#include "util/parse-options.h"
+#include "util/session.h"
+#include "util/tool.h"
+
+#include <linux/compiler.h>
+#include <linux/kernel.h>
+
+struct perf_c2c {
+	struct perf_tool tool;
+};
+
+static int perf_sample__fprintf(struct perf_sample *sample,
+				struct perf_evsel *evsel,
+				struct addr_location *al, FILE *fp)
+{
+	return fprintf(fp, "%25.25s: %5d %5d 0x%016" PRIx64 " 0x016%" PRIx64 " %5" PRIu64 " 0x%06" PRIx64 " %s:%s\n",
+		       perf_evsel__name(evsel),
+		       sample->pid, sample->tid, sample->ip, sample->addr,
+		       sample->weight, sample->data_src,
+		       al->map ? (al->map->dso ? al->map->dso->long_name : "???") : "???",
+		       al->sym ? al->sym->name : "???");
+}
+
+static int perf_c2c__process_load(struct perf_evsel *evsel,
+				  struct perf_sample *sample,
+				  struct addr_location *al)
+{
+	perf_sample__fprintf(sample, evsel, al, stdout);
+	return 0;
+}
+
+static int perf_c2c__process_store(struct perf_evsel *evsel,
+				   struct perf_sample *sample,
+				   struct addr_location *al)
+{
+	perf_sample__fprintf(sample, evsel, al, stdout);
+	return 0;
+}
+
+static const struct perf_evsel_str_handler handlers[] = {
+	{ "cpu/mem-loads,ldlat=30/pp", perf_c2c__process_load, },
+	{ "cpu/mem-stores/pp",	       perf_c2c__process_store, },
+};
+
+typedef int (*sample_handler)(struct perf_evsel *evsel,
+			      struct perf_sample *sample,
+			      struct addr_location *al);
+
+static int perf_c2c__process_sample(struct perf_tool *tool __maybe_unused,
+				    union perf_event *event,
+				    struct perf_sample *sample,
+				    struct perf_evsel *evsel,
+				    struct machine *machine)
+{
+	struct addr_location al;
+	int err = 0;
+
+	if (perf_event__preprocess_sample(event, machine, &al, sample) < 0) {
+		pr_err("problem processing %d event, skipping it.\n",
+		       event->header.type);
+		return -1;
+	}
+
+	if (evsel->handler != NULL) {
+		sample_handler f = evsel->handler;
+		err = f(evsel, sample, &al);
+	}
+
+	return err;
+}
+
+static int perf_c2c__read_events(struct perf_c2c *c2c)
+{
+	int err = -1;
+	struct perf_session *session;
+	struct perf_data_file file = {
+			.path = input_name,
+			.mode = PERF_DATA_MODE_READ,
+	};
+	struct perf_evsel *evsel;
+
+	session = perf_session__new(&file, 0, &c2c->tool);
+	if (session == NULL) {
+		pr_debug("No memory for session\n");
+		goto out;
+	}
+
+	/* setup the evsel handlers for each event type */
+	evlist__for_each(session->evlist, evsel) {
+		const char *name = perf_evsel__name(evsel);
+		unsigned int i;
+
+		for (i = 0; i < ARRAY_SIZE(handlers); i++) {
+			if (!strcmp(name, handlers[i].name))
+				evsel->handler = handlers[i].handler;
+		}
+	}
+
+	err = perf_session__process_events(session, &c2c->tool);
+	if (err)
+		pr_err("Failed to process events, error %d", err);
+
+out:
+	return err;
+}
+
+static int perf_c2c__report(struct perf_c2c *c2c)
+{
+	setup_pager();
+	return perf_c2c__read_events(c2c);
+}
+
+static int perf_c2c__record(int argc, const char **argv)
+{
+	unsigned int rec_argc, i, j;
+	const char **rec_argv;
+	const char * const record_args[] = {
+		"record",
+		/* "--phys-addr", */
+		"-W",
+		"-d",
+		"-a",
+	};
+
+	rec_argc = ARRAY_SIZE(record_args) + 2 * ARRAY_SIZE(handlers) + argc - 1;
+	rec_argv = calloc(rec_argc + 1, sizeof(char *));
+
+	if (rec_argv == NULL)
+		return -ENOMEM;
+
+	for (i = 0; i < ARRAY_SIZE(record_args); i++)
+		rec_argv[i] = strdup(record_args[i]);
+
+	for (j = 0; j < ARRAY_SIZE(handlers); j++) {
+		rec_argv[i++] = strdup("-e");
+		rec_argv[i++] = strdup(handlers[j].name);
+	}
+
+	for (j = 1; j < (unsigned int)argc; j++, i++)
+		rec_argv[i] = argv[j];
+
+	BUG_ON(i != rec_argc);
+
+	return cmd_record(i, rec_argv, NULL);
+}
+
+int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
+{
+	struct perf_c2c c2c = {
+		.tool = {
+			.sample		 = perf_c2c__process_sample,
+			.comm		 = perf_event__process_comm,
+			.exit		 = perf_event__process_exit,
+			.fork		 = perf_event__process_fork,
+			.lost		 = perf_event__process_lost,
+			.ordered_samples = true,
+		},
+	};
+	const struct option c2c_options[] = {
+	OPT_END()
+	};
+	const char * const c2c_usage[] = {
+		"perf c2c {record|report}",
+		NULL
+	};
+
+	argc = parse_options(argc, argv, c2c_options, c2c_usage,
+			     PARSE_OPT_STOP_AT_NON_OPTION);
+	if (!argc)
+		usage_with_options(c2c_usage, c2c_options);
+
+	if (!strncmp(argv[0], "rec", 3)) {
+		return perf_c2c__record(argc, argv);
+	} else if (!strncmp(argv[0], "rep", 3)) {
+		return perf_c2c__report(&c2c);
+	} else {
+		usage_with_options(c2c_usage, c2c_options);
+	}
+
+	return 0;
+}
diff --git a/tools/perf/builtin.h b/tools/perf/builtin.h
index b210d62..2d0b1b5 100644
--- a/tools/perf/builtin.h
+++ b/tools/perf/builtin.h
@@ -17,6 +17,7 @@ extern int cmd_annotate(int argc, const char **argv, const char *prefix);
 extern int cmd_bench(int argc, const char **argv, const char *prefix);
 extern int cmd_buildid_cache(int argc, const char **argv, const char *prefix);
 extern int cmd_buildid_list(int argc, const char **argv, const char *prefix);
+extern int cmd_c2c(int argc, const char **argv, const char *prefix);
 extern int cmd_diff(int argc, const char **argv, const char *prefix);
 extern int cmd_evlist(int argc, const char **argv, const char *prefix);
 extern int cmd_help(int argc, const char **argv, const char *prefix);
diff --git a/tools/perf/perf.c b/tools/perf/perf.c
index 431798a..c7012a3 100644
--- a/tools/perf/perf.c
+++ b/tools/perf/perf.c
@@ -35,6 +35,7 @@ struct cmd_struct {
 static struct cmd_struct commands[] = {
 	{ "buildid-cache", cmd_buildid_cache, 0 },
 	{ "buildid-list", cmd_buildid_list, 0 },
+	{ "c2c",	cmd_c2c,	0 },
 	{ "diff",	cmd_diff,	0 },
 	{ "evlist",	cmd_evlist,	0 },
 	{ "help",	cmd_help,	0 },
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 09/19] perf c2c: Dump raw records, decode data_src bits
  2014-02-28 17:42 [PATCH 00/19 V2] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (7 preceding siblings ...)
  2014-02-28 17:42 ` [PATCH 08/19] perf c2c: Shared data analyser Don Zickus
@ 2014-02-28 17:42 ` Don Zickus
  2014-02-28 17:42 ` [PATCH 10/19] perf, c2c: Rework setup code to prepare for features Don Zickus
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 56+ messages in thread
From: Don Zickus @ 2014-02-28 17:42 UTC (permalink / raw)
  To: acme
  Cc: LKML, jolsa, jmario, fowles, eranian, Arnaldo Carvalho de Melo,
	David Ahern, Don Zickus, Frederic Weisbecker, Mike Galbraith,
	Paul Mackerras, Peter Zijlstra, Richard Fowles

From: Arnaldo Carvalho de Melo <acme@redhat.com>

>From the c2c prototype:

[root@sandy ~]# perf c2c -r report | head -7
T Status    Pid Tid CPU          Inst Adrs     Virt Data Adrs Phys Data Adrs Cycles Source      Decoded Source                ObJect:Symbol
--------------------------------------------------------------------------------------------------------------------------------------------
  raw input 779 779   7 0xffffffff810865dd 0xffff8803f4d75ec8              0    370 0x68080882 [LOAD,LCL_LLC,MISS,SNP NA]    [kernel.kallsyms]:try_to_wake_up
  raw input 779 779   7 0xffffffff8107acb3 0xffff8802a5b73158              0    297 0x6a100142 [LOAD,L1,HIT,SNP NONE,LOCKED] [kernel.kallsyms]:up_read
  raw input 779 779   7       0x3b7e009814     0x7fff87429ea0              0    925 0x68100142 [LOAD,L1,HIT,SNP NONE]        ???:???
  raw input   0   0   1 0xffffffff8108bf81 0xffff8803eafebf50              0    172 0x68800842 [LOAD,LCL_LLC,HIT,SNP HITM]   [kernel.kallsyms]:update_stats_wait_end
  raw input 779 779   7       0x3b7e0097cc     0x7fac94b69068              0    228 0x68100242 [LOAD,LFB,HIT,SNP NONE]       ???:???
[root@sandy ~]#

The "Phys Data Adrs" column is not available at this point.

Cc: David Ahern <dsahern@gmail.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Joe Mario <jmario@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Fowles <rfowles@redhat.com>
Cc: Stephane Eranian <eranian@google.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/builtin-c2c.c | 148 +++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 125 insertions(+), 23 deletions(-)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 2935484..7082913 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -11,51 +11,148 @@
 
 struct perf_c2c {
 	struct perf_tool tool;
+	bool		 raw_records;
 };
 
-static int perf_sample__fprintf(struct perf_sample *sample,
-				struct perf_evsel *evsel,
-				struct addr_location *al, FILE *fp)
+enum { OP, LVL, SNP, LCK, TLB };
+
+static int perf_c2c__scnprintf_data_src(char *bf, size_t size, uint64_t val)
 {
-	return fprintf(fp, "%25.25s: %5d %5d 0x%016" PRIx64 " 0x016%" PRIx64 " %5" PRIu64 " 0x%06" PRIx64 " %s:%s\n",
-		       perf_evsel__name(evsel),
-		       sample->pid, sample->tid, sample->ip, sample->addr,
-		       sample->weight, sample->data_src,
-		       al->map ? (al->map->dso ? al->map->dso->long_name : "???") : "???",
-		       al->sym ? al->sym->name : "???");
+#define PREFIX       "["
+#define SUFFIX       "]"
+#define ELLIPSIS     "..."
+	static const struct {
+		uint64_t   bit;
+		int64_t    field;
+		const char *name;
+	} decode_bits[] = {
+	{ PERF_MEM_OP_LOAD,       OP,  "LOAD"     },
+	{ PERF_MEM_OP_STORE,      OP,  "STORE"    },
+	{ PERF_MEM_OP_NA,         OP,  "OP_NA"    },
+	{ PERF_MEM_LVL_LFB,       LVL, "LFB"      },
+	{ PERF_MEM_LVL_L1,        LVL, "L1"       },
+	{ PERF_MEM_LVL_L2,        LVL, "L2"       },
+	{ PERF_MEM_LVL_L3,        LVL, "LCL_LLC"  },
+	{ PERF_MEM_LVL_LOC_RAM,   LVL, "LCL_RAM"  },
+	{ PERF_MEM_LVL_REM_RAM1,  LVL, "RMT_RAM"  },
+	{ PERF_MEM_LVL_REM_RAM2,  LVL, "RMT_RAM"  },
+	{ PERF_MEM_LVL_REM_CCE1,  LVL, "RMT_LLC"  },
+	{ PERF_MEM_LVL_REM_CCE2,  LVL, "RMT_LLC"  },
+	{ PERF_MEM_LVL_IO,        LVL, "I/O"	  },
+	{ PERF_MEM_LVL_UNC,       LVL, "UNCACHED" },
+	{ PERF_MEM_LVL_NA,        LVL, "N"        },
+	{ PERF_MEM_LVL_HIT,       LVL, "HIT"      },
+	{ PERF_MEM_LVL_MISS,      LVL, "MISS"     },
+	{ PERF_MEM_SNOOP_NONE,    SNP, "SNP NONE" },
+	{ PERF_MEM_SNOOP_HIT,     SNP, "SNP HIT"  },
+	{ PERF_MEM_SNOOP_MISS,    SNP, "SNP MISS" },
+	{ PERF_MEM_SNOOP_HITM,    SNP, "SNP HITM" },
+	{ PERF_MEM_SNOOP_NA,      SNP, "SNP NA"   },
+	{ PERF_MEM_LOCK_LOCKED,   LCK, "LOCKED"   },
+	{ PERF_MEM_LOCK_NA,       LCK, "LOCK_NA"  },
+	};
+	union perf_mem_data_src dsrc = { .val = val, };
+	int printed = scnprintf(bf, size, PREFIX);
+	size_t i;
+	bool first_present = true;
+
+	for (i = 0; i < ARRAY_SIZE(decode_bits); i++) {
+		int bitval;
+
+		switch (decode_bits[i].field) {
+		case OP:  bitval = decode_bits[i].bit & dsrc.mem_op;    break;
+		case LVL: bitval = decode_bits[i].bit & dsrc.mem_lvl;   break;
+		case SNP: bitval = decode_bits[i].bit & dsrc.mem_snoop; break;
+		case LCK: bitval = decode_bits[i].bit & dsrc.mem_lock;  break;
+		case TLB: bitval = decode_bits[i].bit & dsrc.mem_dtlb;  break;
+		default: bitval = 0;					break;
+		}
+
+		if (!bitval)
+			continue;
+
+		if (strlen(decode_bits[i].name) + !!i > size - printed - sizeof(SUFFIX)) {
+			sprintf(bf + size - sizeof(SUFFIX) - sizeof(ELLIPSIS) + 1, ELLIPSIS);
+			printed = size - sizeof(SUFFIX);
+			break;
+		}
+
+		printed += scnprintf(bf + printed, size - printed, "%s%s",
+				     first_present ? "" : ",", decode_bits[i].name);
+		first_present = false;
+	}
+
+	printed += scnprintf(bf + printed, size - printed, SUFFIX);
+	return printed;
 }
 
-static int perf_c2c__process_load(struct perf_evsel *evsel,
-				  struct perf_sample *sample,
-				  struct addr_location *al)
+static int perf_c2c__fprintf_header(FILE *fp)
 {
-	perf_sample__fprintf(sample, evsel, al, stdout);
-	return 0;
+	int printed = fprintf(fp, "%c %-16s  %6s  %6s  %4s  %18s  %18s  %18s  %6s  %-10s %-60s %s\n", 
+			      'T',
+			      "Status",
+			      "Pid",
+			      "Tid",
+			      "CPU",
+			      "Inst Adrs",
+			      "Virt Data Adrs",
+			      "Phys Data Adrs",
+			      "Cycles",
+			      "Source",
+			      "  Decoded Source",
+			      "ObJect:Symbol");
+	return printed + fprintf(fp, "%-*.*s\n", printed, printed, graph_dotted_line);
+}
+
+static int perf_sample__fprintf(struct perf_sample *sample, char tag,
+				const char *reason, struct addr_location *al, FILE *fp)
+{
+	char data_src[61];
+
+	perf_c2c__scnprintf_data_src(data_src, sizeof(data_src), sample->data_src);
+
+	return fprintf(fp, "%c %-16s  %6d  %6d  %4d  %#18" PRIx64 "  %#18" PRIx64 "  %#18" PRIx64 "  %6" PRIu64 "  %#10" PRIx64 " %-60.60s %s:%s\n",
+		       tag,
+		       reason ?: "valid record",
+		       sample->pid,
+		       sample->tid,
+		       sample->cpu,
+		       sample->ip,
+		       sample->addr,
+		       0UL,
+		       sample->weight,
+		       sample->data_src,
+		       data_src,
+		       al->map ? (al->map->dso ? al->map->dso->long_name : "???") : "???",
+		       al->sym ? al->sym->name : "???");
 }
 
-static int perf_c2c__process_store(struct perf_evsel *evsel,
-				   struct perf_sample *sample,
-				   struct addr_location *al)
+static int perf_c2c__process_load_store(struct perf_c2c *c2c,
+					struct perf_sample *sample,
+					struct addr_location *al)
 {
-	perf_sample__fprintf(sample, evsel, al, stdout);
+	if (c2c->raw_records)
+		perf_sample__fprintf(sample, ' ', "raw input", al, stdout);
+
 	return 0;
 }
 
 static const struct perf_evsel_str_handler handlers[] = {
-	{ "cpu/mem-loads,ldlat=30/pp", perf_c2c__process_load, },
-	{ "cpu/mem-stores/pp",	       perf_c2c__process_store, },
+	{ "cpu/mem-loads,ldlat=30/pp", perf_c2c__process_load_store, },
+	{ "cpu/mem-stores/pp",	       perf_c2c__process_load_store, },
 };
 
-typedef int (*sample_handler)(struct perf_evsel *evsel,
+typedef int (*sample_handler)(struct perf_c2c *c2c,
 			      struct perf_sample *sample,
 			      struct addr_location *al);
 
-static int perf_c2c__process_sample(struct perf_tool *tool __maybe_unused,
+static int perf_c2c__process_sample(struct perf_tool *tool,
 				    union perf_event *event,
 				    struct perf_sample *sample,
 				    struct perf_evsel *evsel,
 				    struct machine *machine)
 {
+	struct perf_c2c *c2c = container_of(tool, struct perf_c2c, tool);
 	struct addr_location al;
 	int err = 0;
 
@@ -67,7 +164,7 @@ static int perf_c2c__process_sample(struct perf_tool *tool __maybe_unused,
 
 	if (evsel->handler != NULL) {
 		sample_handler f = evsel->handler;
-		err = f(evsel, sample, &al);
+		err = f(c2c, sample, &al);
 	}
 
 	return err;
@@ -111,6 +208,10 @@ out:
 static int perf_c2c__report(struct perf_c2c *c2c)
 {
 	setup_pager();
+
+	if (c2c->raw_records)
+		perf_c2c__fprintf_header(stdout);
+
 	return perf_c2c__read_events(c2c);
 }
 
@@ -161,6 +262,7 @@ int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
 		},
 	};
 	const struct option c2c_options[] = {
+	OPT_BOOLEAN('r', "raw_records", &c2c.raw_records, "dump raw events"),
 	OPT_END()
 	};
 	const char * const c2c_usage[] = {
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 10/19] perf, c2c: Rework setup code to prepare for features
  2014-02-28 17:42 [PATCH 00/19 V2] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (8 preceding siblings ...)
  2014-02-28 17:42 ` [PATCH 09/19] perf c2c: Dump raw records, decode data_src bits Don Zickus
@ 2014-02-28 17:42 ` Don Zickus
  2014-02-28 17:43 ` [PATCH 11/19] perf, c2c: Add in sort on physid Don Zickus
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 56+ messages in thread
From: Don Zickus @ 2014-02-28 17:42 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, eranian, Don Zickus

A basic patch that re-arranges some of the c2c code and adds a couple
of small features to lay the ground work for the rest of the patch
series.

Changes include:

o reworking the report path
o replace preprocess_sample with simpler calls
o rework raw output to handle separators
o remove phys id gunk
o add some generic options

There isn't much meat in this patch just a bunch of code movement and cleanups.

V2: refresh to latest upstream changes

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/builtin-c2c.c | 125 ++++++++++++++++++++++++++++++++++-------------
 1 file changed, 92 insertions(+), 33 deletions(-)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 7082913..367d6c1 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -5,6 +5,7 @@
 #include "util/parse-options.h"
 #include "util/session.h"
 #include "util/tool.h"
+#include "util/debug.h"
 
 #include <linux/compiler.h>
 #include <linux/kernel.h>
@@ -105,34 +106,55 @@ static int perf_c2c__fprintf_header(FILE *fp)
 }
 
 static int perf_sample__fprintf(struct perf_sample *sample, char tag,
-				const char *reason, struct addr_location *al, FILE *fp)
+				const char *reason, struct mem_info *mi, FILE *fp)
 {
 	char data_src[61];
+	const char *fmt, *sep;
+	struct map *map = mi->iaddr.map;
 
 	perf_c2c__scnprintf_data_src(data_src, sizeof(data_src), sample->data_src);
 
-	return fprintf(fp, "%c %-16s  %6d  %6d  %4d  %#18" PRIx64 "  %#18" PRIx64 "  %#18" PRIx64 "  %6" PRIu64 "  %#10" PRIx64 " %-60.60s %s:%s\n",
-		       tag,
-		       reason ?: "valid record",
-		       sample->pid,
-		       sample->tid,
-		       sample->cpu,
-		       sample->ip,
-		       sample->addr,
-		       0UL,
-		       sample->weight,
-		       sample->data_src,
-		       data_src,
-		       al->map ? (al->map->dso ? al->map->dso->long_name : "???") : "???",
-		       al->sym ? al->sym->name : "???");
+	if (symbol_conf.field_sep) {
+		fmt = "%c%s%s%s%d%s%d%s%d%s%#"PRIx64"%s%#"PRIx64"%s"
+		      "%"PRIu64"%s%#"PRIx64"%s%s%s%s:%s\n";
+		sep = symbol_conf.field_sep;
+	} else {
+		fmt = "%c%s%-16s%s%6d%s%6d%s%4d%s%#18"PRIx64"%s%#18"PRIx64"%s"
+		      "%6"PRIu64"%s%#10"PRIx64"%s%-60.60s%s%s:%s\n";
+		sep = " ";
+	}
+
+	return fprintf(fp, fmt,
+		       tag,				sep,
+		       reason ?: "valid record",	sep,
+		       sample->pid,			sep,
+		       sample->tid,			sep,
+		       sample->cpu,			sep,
+		       sample->ip,			sep,
+		       sample->addr,			sep,
+		       sample->weight,			sep,
+		       sample->data_src,		sep,
+		       data_src,			sep,
+		       map ? (map->dso ? map->dso->long_name : "???") : "???",
+		       mi->iaddr.sym ? mi->iaddr.sym->name : "???");
 }
 
 static int perf_c2c__process_load_store(struct perf_c2c *c2c,
+					struct addr_location *al,
 					struct perf_sample *sample,
-					struct addr_location *al)
+					struct perf_evsel *evsel)
 {
-	if (c2c->raw_records)
-		perf_sample__fprintf(sample, ' ', "raw input", al, stdout);
+	struct mem_info *mi;
+
+	mi = sample__resolve_mem(sample, al);
+	if (!mi)
+		return -ENOMEM;
+
+	if (c2c->raw_records) {
+		perf_sample__fprintf(sample, ' ', "raw input", mi, stdout);
+		free(mi);
+		return 0;
+	}
 
 	return 0;
 }
@@ -143,8 +165,9 @@ static const struct perf_evsel_str_handler handlers[] = {
 };
 
 typedef int (*sample_handler)(struct perf_c2c *c2c,
+			      struct addr_location *al,
 			      struct perf_sample *sample,
-			      struct addr_location *al);
+			      struct perf_evsel *evsel);
 
 static int perf_c2c__process_sample(struct perf_tool *tool,
 				    union perf_event *event,
@@ -153,20 +176,49 @@ static int perf_c2c__process_sample(struct perf_tool *tool,
 				    struct machine *machine)
 {
 	struct perf_c2c *c2c = container_of(tool, struct perf_c2c, tool);
-	struct addr_location al;
-	int err = 0;
+	u8 cpumode = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
+	struct thread *thread;
+	sample_handler f;
+	int err = -1;
+	struct addr_location al = {
+			.machine = machine,
+			.cpu = sample->cpu,
+			.cpumode = cpumode,
+	};
 
-	if (perf_event__preprocess_sample(event, machine, &al, sample) < 0) {
-		pr_err("problem processing %d event, skipping it.\n",
-		       event->header.type);
-		return -1;
-	}
+	if (evsel->handler == NULL)
+		return 0;
+
+	thread = machine__find_thread(machine, sample->tid);
+	if (thread == NULL)
+		goto err;
+
+	al.thread = thread;
 
-	if (evsel->handler != NULL) {
-		sample_handler f = evsel->handler;
-		err = f(c2c, sample, &al);
+	f = evsel->handler;
+	err = f(c2c, &al, sample, evsel);
+	if (err)
+		goto err;
+
+	return 0;
+err:
+	if (err > 0)
+		err = 0;
+	return err;
+}
+
+static int perf_c2c__process_events(struct perf_session *session,
+				    struct perf_c2c *c2c)
+{
+	int err = -1;
+
+	err = perf_session__process_events(session, &c2c->tool);
+	if (err) {
+		pr_err("Failed to process count events, error %d\n", err);
+		goto err;
 	}
 
+err:
 	return err;
 }
 
@@ -197,9 +249,7 @@ static int perf_c2c__read_events(struct perf_c2c *c2c)
 		}
 	}
 
-	err = perf_session__process_events(session, &c2c->tool);
-	if (err)
-		pr_err("Failed to process events, error %d", err);
+	err = perf_c2c__process_events(session, c2c);
 
 out:
 	return err;
@@ -221,7 +271,6 @@ static int perf_c2c__record(int argc, const char **argv)
 	const char **rec_argv;
 	const char * const record_args[] = {
 		"record",
-		/* "--phys-addr", */
 		"-W",
 		"-d",
 		"-a",
@@ -254,6 +303,8 @@ int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
 	struct perf_c2c c2c = {
 		.tool = {
 			.sample		 = perf_c2c__process_sample,
+			.mmap2           = perf_event__process_mmap2,
+			.mmap            = perf_event__process_mmap,
 			.comm		 = perf_event__process_comm,
 			.exit		 = perf_event__process_exit,
 			.fork		 = perf_event__process_fork,
@@ -263,6 +314,14 @@ int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
 	};
 	const struct option c2c_options[] = {
 	OPT_BOOLEAN('r', "raw_records", &c2c.raw_records, "dump raw events"),
+	OPT_INCR('v', "verbose", &verbose,
+		 "be more verbose (show counter open errors, etc)"),
+	OPT_STRING('i', "input", &input_name, "file",
+		   "the input file to process"),
+	OPT_STRING('x', "field-separator", &symbol_conf.field_sep,
+		   "separator",
+		   "separator for columns, no spaces will be added"
+		   " between columns '.' is reserved."),
 	OPT_END()
 	};
 	const char * const c2c_usage[] = {
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 11/19] perf, c2c: Add in sort on physid
  2014-02-28 17:42 [PATCH 00/19 V2] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (9 preceding siblings ...)
  2014-02-28 17:42 ` [PATCH 10/19] perf, c2c: Rework setup code to prepare for features Don Zickus
@ 2014-02-28 17:43 ` Don Zickus
  2014-02-28 18:59   ` Andi Kleen
  2014-02-28 17:43 ` [PATCH 12/19] perf, c2c: Add stats to track data source bits and cpu to node maps Don Zickus
                   ` (8 subsequent siblings)
  19 siblings, 1 reply; 56+ messages in thread
From: Don Zickus @ 2014-02-28 17:43 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, eranian, Don Zickus

Now that the infrastructure is set, add in the support to use
hist_entry to sort on physid.

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/builtin-c2c.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 61 insertions(+), 2 deletions(-)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 367d6c1..472d4d9 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -6,6 +6,7 @@
 #include "util/session.h"
 #include "util/tool.h"
 #include "util/debug.h"
+#include "util/annotate.h"
 
 #include <linux/compiler.h>
 #include <linux/kernel.h>
@@ -13,6 +14,7 @@
 struct perf_c2c {
 	struct perf_tool tool;
 	bool		 raw_records;
+	struct hists	 hists;
 };
 
 enum { OP, LVL, SNP, LCK, TLB };
@@ -144,7 +146,11 @@ static int perf_c2c__process_load_store(struct perf_c2c *c2c,
 					struct perf_sample *sample,
 					struct perf_evsel *evsel)
 {
-	struct mem_info *mi;
+	struct symbol *parent = NULL;
+	struct hist_entry *he;
+	struct mem_info *mi, *mx;
+	uint64_t cost;
+	int err;
 
 	mi = sample__resolve_mem(sample, al);
 	if (!mi)
@@ -156,7 +162,42 @@ static int perf_c2c__process_load_store(struct perf_c2c *c2c,
 		return 0;
 	}
 
-	return 0;
+	cost = sample->weight;
+	if (!cost)
+		cost = 1;
+
+	/*
+	 * must pass period=weight in order to get the correct
+	 * sorting from hists__collapse_resort() which is solely
+	 * based on periods. We want sorting be done on nr_events * weight
+	 * and this is indirectly achieved by passing period=weight here
+	 * and the he_stat__add_period() function.
+	 */
+	he = __hists__add_entry(&c2c->hists, al, parent, NULL, mi,
+				cost, cost, 0);
+	if (!he) {
+		err = -ENOMEM;
+		goto out_mem;
+	}
+
+	err = hist_entry__inc_addr_samples(he, evsel->idx, al->addr);
+	if (err)
+		goto out;
+
+	mx = he->mem_info;
+	err = addr_map_symbol__inc_samples(&mx->daddr, evsel->idx);
+	if (err)
+		goto out;
+
+	c2c->hists.stats.total_period += cost;
+	hists__inc_nr_events(&c2c->hists, PERF_RECORD_SAMPLE);
+	return err;
+
+out_mem:
+	/* implicitly freed by __hists__add_entry */
+	free(mi);
+out:
+	return err;
 }
 
 static const struct perf_evsel_str_handler handlers[] = {
@@ -255,10 +296,28 @@ out:
 	return err;
 }
 
+static int perf_c2c__init(struct perf_c2c *c2c)
+{
+	sort__mode = SORT_MODE__MEMORY;
+	sort__wants_unique = 1;
+	sort_order = "physid";
+
+	if (setup_sorting() < 0) {
+		pr_err("can not setup sorting\n");
+		return -1;
+	}
+
+	hists__init(&c2c->hists);
+
+	return 0;
+}
 static int perf_c2c__report(struct perf_c2c *c2c)
 {
 	setup_pager();
 
+	if (perf_c2c__init(c2c))
+		return -1;
+
 	if (c2c->raw_records)
 		perf_c2c__fprintf_header(stdout);
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 12/19] perf, c2c: Add stats to track data source bits and cpu to node maps
  2014-02-28 17:42 [PATCH 00/19 V2] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (10 preceding siblings ...)
  2014-02-28 17:43 ` [PATCH 11/19] perf, c2c: Add in sort on physid Don Zickus
@ 2014-02-28 17:43 ` Don Zickus
  2014-02-28 17:43 ` [PATCH 13/19] perf, c2c: Sort based on hottest cache line Don Zickus
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 56+ messages in thread
From: Don Zickus @ 2014-02-28 17:43 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, eranian, Don Zickus

This patch adds a bunch of stats that will be used later in post-processing
to determine where and with what frequency the HITMs are coming from.

Most of the stats are decoded from the data source response.  Another
piece of the stats is tracking which cpu the record came in on.

Credit to Dick Fowles for determining which bits are important and how to
properly track them.  Ported to perf by me.

V2: refresh with hist_entry

Original-by: Dick Fowles <rfowles@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/builtin-c2c.c | 184 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 184 insertions(+)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 472d4d9..5deb7cc 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -5,20 +5,84 @@
 #include "util/parse-options.h"
 #include "util/session.h"
 #include "util/tool.h"
+#include "util/stat.h"
+#include "util/cpumap.h"
 #include "util/debug.h"
 #include "util/annotate.h"
 
 #include <linux/compiler.h>
 #include <linux/kernel.h>
+#include <sched.h>
+
+typedef struct {
+	int  locks;               /* count of 'lock' transactions */
+	int  store;               /* count of all stores in trace */
+	int  st_uncache;          /* stores to uncacheable address */
+	int  st_noadrs;           /* cacheable store with no address */
+	int  st_l1hit;            /* count of stores that hit L1D */
+	int  st_l1miss;           /* count of stores that miss L1D */
+	int  load;                /* count of all loads in trace */
+	int  ld_excl;             /* exclusive loads, rmt/lcl DRAM - snp none/miss */
+	int  ld_shared;           /* shared loads, rmt/lcl DRAM - snp hit */
+	int  ld_uncache;          /* loads to uncacheable address */
+	int  ld_io;               /* loads to io address */
+	int  ld_miss;             /* loads miss */
+	int  ld_noadrs;           /* cacheable load with no address */
+	int  ld_fbhit;            /* count of loads hitting Fill Buffer */
+	int  ld_l1hit;            /* count of loads that hit L1D */
+	int  ld_l2hit;            /* count of loads that hit L2D */
+	int  ld_llchit;           /* count of loads that hit LLC */
+	int  lcl_hitm;            /* count of loads with local HITM  */
+	int  rmt_hitm;            /* count of loads with remote HITM */
+	int  rmt_hit;             /* count of loads with remote hit clean; */
+	int  lcl_dram;            /* count of loads miss to local DRAM */
+	int  rmt_dram;            /* count of loads miss to remote DRAM */
+	int  nomap;               /* count of load/stores with no phys adrs */
+	int  noparse;             /* count of unparsable data sources */
+} trinfo_t;
+
+struct c2c_stats {
+	cpu_set_t		cpuset;
+	int			nr_entries;
+	u64			total_period;
+	trinfo_t		t;
+	struct stats		stats;
+};
 
 struct perf_c2c {
 	struct perf_tool tool;
 	bool		 raw_records;
 	struct hists	 hists;
+
+	/* stats */
+	struct c2c_stats	stats;
 };
 
 enum { OP, LVL, SNP, LCK, TLB };
 
+#define RMT_RAM              (PERF_MEM_LVL_REM_RAM1 | PERF_MEM_LVL_REM_RAM2)
+#define RMT_LLC              (PERF_MEM_LVL_REM_CCE1 | PERF_MEM_LVL_REM_CCE2)
+
+#define L1CACHE_HIT(a)       (((a) & PERF_MEM_LVL_L1 ) && ((a) & PERF_MEM_LVL_HIT))
+#define FILLBUF_HIT(a)       (((a) & PERF_MEM_LVL_LFB) && ((a) & PERF_MEM_LVL_HIT))
+#define L2CACHE_HIT(a)       (((a) & PERF_MEM_LVL_L2 ) && ((a) & PERF_MEM_LVL_HIT))
+#define L3CACHE_HIT(a)       (((a) & PERF_MEM_LVL_L3 ) && ((a) & PERF_MEM_LVL_HIT))
+
+#define L1CACHE_MISS(a)      (((a) & PERF_MEM_LVL_L1 ) && ((a) & PERF_MEM_LVL_MISS))
+#define L3CACHE_MISS(a)      (((a) & PERF_MEM_LVL_L3 ) && ((a) & PERF_MEM_LVL_MISS))
+
+#define LD_UNCACHED(a)       (((a) & PERF_MEM_LVL_UNC) && ((a) & PERF_MEM_LVL_HIT))
+#define ST_UNCACHED(a)       (((a) & PERF_MEM_LVL_UNC) && ((a) & PERF_MEM_LVL_HIT))
+
+#define RMT_LLCHIT(a)        (((a) & RMT_LLC) && ((a) & PERF_MEM_LVL_HIT))
+#define RMT_HIT(a,b)         (((a) & RMT_LLC) && ((b) & PERF_MEM_SNOOP_HIT))
+#define RMT_HITM(a,b)        (((a) & RMT_LLC) && ((b) & PERF_MEM_SNOOP_HITM))
+#define RMT_MEM(a)           (((a) & RMT_RAM) && ((a) & PERF_MEM_LVL_HIT))
+
+#define LCL_HIT(a,b)         (L3CACHE_HIT(a) && ((b) & PERF_MEM_SNOOP_HIT))
+#define LCL_HITM(a,b)        (L3CACHE_HIT(a) && ((b) & PERF_MEM_SNOOP_HITM))
+#define LCL_MEM(a)           (((a) & PERF_MEM_LVL_LOC_RAM) && ((a) & PERF_MEM_LVL_HIT))
+
 static int perf_c2c__scnprintf_data_src(char *bf, size_t size, uint64_t val)
 {
 #define PREFIX       "["
@@ -141,6 +205,106 @@ static int perf_sample__fprintf(struct perf_sample *sample, char tag,
 		       mi->iaddr.sym ? mi->iaddr.sym->name : "???");
 }
 
+static int c2c_decode_stats(struct c2c_stats *stats, struct hist_entry *entry)
+{
+	union perf_mem_data_src *data_src = &entry->mem_info->data_src;
+	u64 daddr = entry->mem_info->daddr.addr;
+	u64 weight = entry->stat.weight;
+	int err = 0;
+
+	u64 op = data_src->mem_op;
+	u64 lvl = data_src->mem_lvl;
+	u64 snoop = data_src->mem_snoop;
+	u64 lock = data_src->mem_lock;
+
+#define P(a,b) PERF_MEM_##a##_##b
+
+	stats->nr_entries++;
+	stats->total_period += entry->stat.period;
+
+	if (lock & P(LOCK,LOCKED)) stats->t.locks++;
+
+	if (op & P(OP,LOAD)) {
+		stats->t.load++;
+
+		if (!daddr) {
+			stats->t.ld_noadrs++;
+			return -1;
+		}
+
+		if (lvl & P(LVL,HIT)) {
+			if (lvl & P(LVL,UNC)) stats->t.ld_uncache++;
+			if (lvl & P(LVL,IO))  stats->t.ld_io++;
+			if (lvl & P(LVL,LFB)) stats->t.ld_fbhit++;
+			if (lvl & P(LVL,L1 )) stats->t.ld_l1hit++;
+			if (lvl & P(LVL,L2 )) stats->t.ld_l2hit++;
+			if (lvl & P(LVL,L3 )) {
+				if (snoop & P(SNOOP,HITM))
+					stats->t.lcl_hitm++;
+				else
+					stats->t.ld_llchit++;
+			}
+
+			if (lvl & P(LVL,LOC_RAM)) {
+				stats->t.lcl_dram++;
+				if (snoop & P(SNOOP,HIT))
+					stats->t.ld_shared++;
+				else
+					stats->t.ld_excl++;
+			}
+
+			if ((lvl & P(LVL,REM_RAM1)) ||
+			    (lvl & P(LVL,REM_RAM2))) {
+				stats->t.rmt_dram++;
+				if (snoop & P(SNOOP,HIT))
+					stats->t.ld_shared++;
+				else
+					stats->t.ld_excl++;
+			}
+		}
+
+		if ((lvl & P(LVL,REM_CCE1)) ||
+		    (lvl & P(LVL,REM_CCE2))) {
+			if (snoop & P(SNOOP, HIT))
+				stats->t.rmt_hit++;
+			else if (snoop & P(SNOOP, HITM)) {
+				stats->t.rmt_hitm++;
+				update_stats(&stats->stats, weight);
+			}
+		}
+
+		if ((lvl & P(LVL,MISS)))
+			stats->t.ld_miss++;
+
+	} else if (op & P(OP,STORE)) {
+		/* store */
+		stats->t.store++;
+
+		if (!daddr) {
+			stats->t.st_noadrs++;
+			return -1;
+		}
+
+		if (lvl & P(LVL,HIT)) {
+			if (lvl & P(LVL,UNC)) stats->t.st_uncache++;
+			if (lvl & P(LVL,L1 )) stats->t.st_l1hit++;
+		}
+		if (lvl & P(LVL,MISS))
+			if (lvl & P(LVL,L1)) stats->t.st_l1miss++;
+	} else {
+		/* unparsable data_src? */
+		stats->t.noparse++;
+		return -1;
+	}
+
+	if (!entry->mem_info->daddr.map || !entry->mem_info->iaddr.map) {
+		stats->t.nomap++;
+		return -1;
+	}
+
+	return err;
+}
+
 static int perf_c2c__process_load_store(struct perf_c2c *c2c,
 					struct addr_location *al,
 					struct perf_sample *sample,
@@ -180,6 +344,14 @@ static int perf_c2c__process_load_store(struct perf_c2c *c2c,
 		goto out_mem;
 	}
 
+	err = c2c_decode_stats(&c2c->stats, he);
+	if (err < 0) {
+		err = 0;
+		rb_erase(&he->rb_node_in, c2c->hists.entries_in);
+		free(he);
+		goto out;
+	}
+
 	err = hist_entry__inc_addr_samples(he, evsel->idx, al->addr);
 	if (err)
 		goto out;
@@ -279,6 +451,9 @@ static int perf_c2c__read_events(struct perf_c2c *c2c)
 		goto out;
 	}
 
+	if (symbol__init() < 0)
+		goto out_delete;
+
 	/* setup the evsel handlers for each event type */
 	evlist__for_each(session->evlist, evsel) {
 		const char *name = perf_evsel__name(evsel);
@@ -292,12 +467,20 @@ static int perf_c2c__read_events(struct perf_c2c *c2c)
 
 	err = perf_c2c__process_events(session, c2c);
 
+out_delete:
+	perf_session__delete(session);
 out:
 	return err;
 }
 
 static int perf_c2c__init(struct perf_c2c *c2c)
 {
+	/* setup cpu map */
+	if (cpu_map__setup_cpunode_map() < 0) {
+		pr_err("can not setup cpu map\n");
+		return -1;
+	}
+
 	sort__mode = SORT_MODE__MEMORY;
 	sort__wants_unique = 1;
 	sort_order = "physid";
@@ -308,6 +491,7 @@ static int perf_c2c__init(struct perf_c2c *c2c)
 	}
 
 	hists__init(&c2c->hists);
+	CPU_ZERO(&c2c->stats.cpuset);
 
 	return 0;
 }
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 13/19] perf, c2c: Sort based on hottest cache line
  2014-02-28 17:42 [PATCH 00/19 V2] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (11 preceding siblings ...)
  2014-02-28 17:43 ` [PATCH 12/19] perf, c2c: Add stats to track data source bits and cpu to node maps Don Zickus
@ 2014-02-28 17:43 ` Don Zickus
  2014-02-28 17:43 ` [PATCH 14/19] perf, c2c: Display cacheline HITM analysis to stdout Don Zickus
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 56+ messages in thread
From: Don Zickus @ 2014-02-28 17:43 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, eranian, Don Zickus

Now that we have all the events sort on a unique address, we can walk
the rbtree sequential and count up all the HITMs for each cacheline
fairly easily.

Once we encounter a new event on a different cacheline, process the previous
cacheline.  That includes determining if any HITMs were present on that
cacheline and if so, add it to another rbtree sorted on the number of HITMs.

This second rbtree sorted on number of HITMs will be the interesting data
we want to report and will be displayed in a follow up patch.

For now, organize the data properly.

V2: re-work using hist_entries

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/builtin-c2c.c | 201 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 201 insertions(+)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 5deb7cc..30cb8c3 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -58,6 +58,25 @@ struct perf_c2c {
 	struct c2c_stats	stats;
 };
 
+#define CACHE_LINESIZE       64
+#define CLINE_OFFSET_MSK     (CACHE_LINESIZE - 1)
+#define CLADRS(a)            ((a) & ~(CLINE_OFFSET_MSK))
+#define CLOFFSET(a)          (int)((a) &  (CLINE_OFFSET_MSK))
+
+struct c2c_hit {
+	struct rb_node	rb_node;
+	struct rb_root  tree;
+	struct list_head list;
+	u64		cacheline;
+	int		color;
+	struct c2c_stats	stats;
+	pid_t		pid;
+	pid_t		tid;
+	u64		daddr;
+	u64		iaddr;
+	struct mem_info	*mi;
+};
+
 enum { OP, LVL, SNP, LCK, TLB };
 
 #define RMT_RAM              (PERF_MEM_LVL_REM_RAM1 | PERF_MEM_LVL_REM_RAM2)
@@ -153,6 +172,44 @@ static int perf_c2c__scnprintf_data_src(char *bf, size_t size, uint64_t val)
 	return printed;
 }
 
+static int c2c_hitm__add_to_list(struct rb_root *root, struct c2c_hit *h)
+{
+	struct rb_node **p;
+	struct rb_node *parent = NULL;
+	struct c2c_hit *he;
+	int64_t cmp;
+	u64 l_hitms, r_hitms;
+
+	p = &root->rb_node;
+
+	while (*p != NULL) {
+		parent = *p;
+		he = rb_entry(parent, struct c2c_hit, rb_node);
+
+		/* sort on remote hitms first */
+		l_hitms = he->stats.t.rmt_hitm;
+		r_hitms = h->stats.t.rmt_hitm;
+		cmp = r_hitms - l_hitms;
+
+		if (!cmp) {
+			/* sort on local hitms */
+			l_hitms = he->stats.t.lcl_hitm;
+			r_hitms = h->stats.t.lcl_hitm;
+			cmp = r_hitms - l_hitms;
+		}
+
+		if (cmp > 0)
+			p = &(*p)->rb_left;
+		else
+			p = &(*p)->rb_right;
+	}
+
+	rb_link_node(&h->rb_node, parent, p);
+	rb_insert_color(&h->rb_node, root);
+
+	return 0;
+}
+
 static int perf_c2c__fprintf_header(FILE *fp)
 {
 	int printed = fprintf(fp, "%c %-16s  %6s  %6s  %4s  %18s  %18s  %18s  %6s  %-10s %-60s %s\n", 
@@ -305,6 +362,50 @@ static int c2c_decode_stats(struct c2c_stats *stats, struct hist_entry *entry)
 	return err;
 }
 
+static struct c2c_hit *c2c_hit__new(u64 cacheline, struct hist_entry *entry)
+{
+	struct c2c_hit *h = zalloc(sizeof(struct c2c_hit));
+
+	if (!h) {
+		pr_err("Could not allocate c2c_hit memory\n");
+		return NULL;
+	}
+
+	CPU_ZERO(&h->stats.cpuset);
+	INIT_LIST_HEAD(&h->list);
+	init_stats(&h->stats.stats);
+	h->tree = RB_ROOT;
+	h->cacheline = cacheline;
+	h->pid = entry->thread->pid_;
+	h->tid = entry->thread->tid;
+
+	/* use original addresses here, not adjusted al_addr */
+	h->iaddr = entry->mem_info->iaddr.addr;
+	h->daddr = entry->mem_info->daddr.addr;
+
+	h->mi = entry->mem_info;
+	return h;
+}
+
+static void c2c_hit__update_strings(struct c2c_hit *h,
+				    struct hist_entry *n)
+{
+	if (h->pid != n->thread->pid_)
+		h->pid = -1;
+
+	if (h->tid != n->thread->tid)
+		h->tid = -1;
+
+	/* use original addresses here, not adjusted al_addr */
+	if (h->iaddr != n->mem_info->iaddr.addr)
+		h->iaddr = -1;
+
+	if (CLADRS(h->daddr) != CLADRS(n->mem_info->daddr.addr))
+		h->daddr = -1;
+
+	CPU_SET(n->cpu, &h->stats.cpuset);
+}
+
 static int perf_c2c__process_load_store(struct perf_c2c *c2c,
 					struct addr_location *al,
 					struct perf_sample *sample,
@@ -420,6 +521,104 @@ err:
 	return err;
 }
 
+#define HAS_HITMS(h) (h->stats.t.lcl_hitm || h->stats.t.rmt_hitm)
+
+static void c2c_hit__update_stats(struct c2c_stats *new,
+				  struct c2c_stats *old)
+{
+	new->t.load		+= old->t.load;
+	new->t.ld_fbhit		+= old->t.ld_fbhit;
+	new->t.ld_l1hit		+= old->t.ld_l1hit;
+	new->t.ld_l2hit		+= old->t.ld_l2hit;
+	new->t.ld_llchit	+= old->t.ld_llchit;
+	new->t.locks		+= old->t.locks;
+	new->t.lcl_dram		+= old->t.lcl_dram;
+	new->t.rmt_dram		+= old->t.rmt_dram;
+	new->t.lcl_hitm		+= old->t.lcl_hitm;
+	new->t.rmt_hitm		+= old->t.rmt_hitm;
+	new->t.rmt_hit		+= old->t.rmt_hit;
+	new->t.store		+= old->t.store;
+	new->t.st_l1hit		+= old->t.st_l1hit;
+
+	new->total_period	+= old->total_period;
+}
+
+static inline int valid_hitm_or_store(union perf_mem_data_src *dsrc)
+{
+	return ((dsrc->mem_snoop & P(SNOOP,HITM)) ||
+		(dsrc->mem_op & P(OP,STORE)));
+}
+
+static void c2c_analyze_hitms(struct perf_c2c *c2c)
+{
+
+	struct rb_node *next = rb_first(c2c->hists.entries_in);
+	struct hist_entry *he;
+	struct c2c_hit *h = NULL;
+	struct c2c_stats hitm_stats;
+	struct rb_root hitm_tree = RB_ROOT;
+	int shared_clines = 0;
+	u64 cl = 0;
+
+	memset(&hitm_stats, 0, sizeof(struct c2c_stats));
+
+	/* find HITMs */
+	while (next) {
+		he = rb_entry(next, struct hist_entry, rb_node_in);
+		next = rb_next(&he->rb_node_in);
+
+		cl = he->mem_info->daddr.al_addr;
+
+		/* switch cache line objects */
+		/* 'color' forces a boundary change based on the original sort */
+		if (!h || !he->color || (CLADRS(cl) != h->cacheline)) {
+			if (h && HAS_HITMS(h)) {
+				c2c_hit__update_stats(&hitm_stats, &h->stats);
+
+				/* sort based on hottest cacheline */
+				c2c_hitm__add_to_list(&hitm_tree, h);
+				shared_clines++;
+			} else {
+				/* stores-only are un-interesting */
+				free(h);
+			}
+			h = c2c_hit__new(CLADRS(cl), he);
+			if (!h)
+				goto cleanup;
+		}
+
+
+		c2c_decode_stats(&h->stats, he);
+
+		/* filter out non-hitms as un-interesting noise */
+		if (valid_hitm_or_store(&he->mem_info->data_src)) {
+			/* save the entry for later processing */
+			list_add_tail(&he->pairs.node, &h->list);
+
+			c2c_hit__update_strings(h, he);
+		}
+	}
+
+	/* last chunk */
+	if (HAS_HITMS(h)) {
+		c2c_hit__update_stats(&hitm_stats, &h->stats);
+		c2c_hitm__add_to_list(&hitm_tree, h);
+		shared_clines++;
+	} else
+		free(h);
+
+cleanup:
+	next = rb_first(&hitm_tree);
+	while (next) {
+		h = rb_entry(next, struct c2c_hit, rb_node);
+		next = rb_next(&h->rb_node);
+		rb_erase(&h->rb_node, &hitm_tree);
+
+		free(h);
+	}
+	return;
+}
+
 static int perf_c2c__process_events(struct perf_session *session,
 				    struct perf_c2c *c2c)
 {
@@ -431,6 +630,8 @@ static int perf_c2c__process_events(struct perf_session *session,
 		goto err;
 	}
 
+	c2c_analyze_hitms(c2c);
+
 err:
 	return err;
 }
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 14/19] perf, c2c: Display cacheline HITM analysis to stdout
  2014-02-28 17:42 [PATCH 00/19 V2] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (12 preceding siblings ...)
  2014-02-28 17:43 ` [PATCH 13/19] perf, c2c: Sort based on hottest cache line Don Zickus
@ 2014-02-28 17:43 ` Don Zickus
  2014-02-28 17:43 ` [PATCH 15/19] perf, c2c: Add callchain support Don Zickus
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 56+ messages in thread
From: Don Zickus @ 2014-02-28 17:43 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, eranian, Don Zickus

This patch mainly focuses on processing and displaying the collected
HITMs to stdout.  Most of it is just printing data in a pretty way.

There is one trick used when walking the cacheline.  When we get this
far we have two rbtrees.  One rbtree holds every record sorted on a
unique id (using the mmap2 decoder), the other rbtree holds every
cacheline with at least one HITM sorted on number of HITMs.

To display the output, the tool walks the second rbtree to display
the hottest cachelines.  Inside each hot cacheline, the tool displays
the offsets and the loads/stores it generates.  To determine the
cacheline offset, it uses linked list inside the cacheline elment
to walk the first rbtree elements for that particular cacheline.

The first rbtree elements are already sorted correctly in offset order, so
processing the offsets is fairly trivial and is done sequentially.

This is why you will see two while loops in the print_c2c_hitm_report(),
the outer one walks the cachelines, the inner one walks the offsets.

A knob has been added to display node information, which is useful
to see which cpus are involved in the contention and their nodes.

Another knob has been added to change the coalescing levels.  You can
coalesce the output based on pid, tid, ip, and/or symbol.

Original output and statistics done by Dick Fowles, backported by me.

Sample output:

=================================================
    Global Shared Cache Line Event Information
=================================================
  Total Shared Cache Lines          :       1327
  Load HITs on shared lines         :     167131
  Fill Buffer Hits on shared lines  :      43469
  L1D hits on shared lines          :      50497
  L2D hits on shared lines          :        960
  LLC hits on shared lines          :      38467
  Locked Access on shared lines     :     100032
  Store HITs on shared lines        :     118659
  Store L1D hits on shared lines    :     113783
  Total Merged records              :     160807

===========================================================================================================================================================

                                                                                      Shared Cache Line Distribution Pareto

     ---- All ----  -- Shared --    ---- HITM ----                                                                        Load Inst Execute Latency
       Data Misses   Data Misses   Remote    Local  -- Store Refs --
                                                                                                                          ---- cycles ----             cpu
 Num  %dist  %cumm  %dist  %cumm  LLCmiss   LLChit   L1 hit  L1 Miss       Data Address    Pid    Tid       Inst Address   median     mean     CV      cnt
==========================================================================================================================================================
-----------------------------------------------------------------------------------------------
   0  17.0%  17.0%  23.3%  23.3%     6238     3288    28098      813 0xffff881fa55b0140    ***
-----------------------------------------------------------------------------------------------
                                     0.0%     0.0%     0.0%     0.0%               0x00    375    375 0xffffffffa018ff5b      n/a      n/a      n/a      1
                                     0.0%     0.0%     0.0%     0.0%               0x08  18156  18156 0xffffffffa018b7f9       -1      384     0.0%      1
                                     0.2%     0.0%     0.0%     0.0%               0x10  18156  18156 0xffffffff811ca1aa       -1      387    10.7%      7
                                     0.0%     0.0%    23.2%     0.0%               0x18  18156  18156 0xffffffff815c1615       -1      684     0.0%     50

-----------------------------------------------------------------------------------------------
   1   5.3%  22.3%   7.3%  30.6%     1944     1143     7916        0 0xffff881fba47f000  18156
-----------------------------------------------------------------------------------------------
                                   100.0%   100.0%     0.0%     0.0%               0x00  18156  18156 0xffffffffa01b410e       -1      401    13.5%     50
                                     0.0%     0.0%    10.1%     0.0%               0x28  18156  18156 0xffffffffa0167409      n/a      n/a      n/a     50
                                     0.0%     0.0%    89.9%     0.0%               0x28  18156  18156 0xffffffff815c4be9      n/a      n/a      n/a     50

Original-by: Dick Fowles <rfowles@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/builtin-c2c.c | 519 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 519 insertions(+)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 30cb8c3..c5f4b5a 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -58,10 +58,13 @@ struct perf_c2c {
 	struct c2c_stats	stats;
 };
 
+#define DISPLAY_LINE_LIMIT  0.0015
 #define CACHE_LINESIZE       64
 #define CLINE_OFFSET_MSK     (CACHE_LINESIZE - 1)
 #define CLADRS(a)            ((a) & ~(CLINE_OFFSET_MSK))
 #define CLOFFSET(a)          (int)((a) &  (CLINE_OFFSET_MSK))
+#define MAXTITLE_SZ          400
+#define MAXLBL_SZ            256
 
 struct c2c_hit {
 	struct rb_node	rb_node;
@@ -102,6 +105,11 @@ enum { OP, LVL, SNP, LCK, TLB };
 #define LCL_HITM(a,b)        (L3CACHE_HIT(a) && ((b) & PERF_MEM_SNOOP_HITM))
 #define LCL_MEM(a)           (((a) & PERF_MEM_LVL_LOC_RAM) && ((a) & PERF_MEM_LVL_HIT))
 
+enum { LVL0, LVL1, LVL2, LVL3, LVL4, MAX_LVL };
+static int cloffset = LVL1;
+static int node_info = 0;
+static int coalesce_level = LVL1;
+
 static int perf_c2c__scnprintf_data_src(char *bf, size_t size, uint64_t val)
 {
 #define PREFIX       "["
@@ -406,6 +414,80 @@ static void c2c_hit__update_strings(struct c2c_hit *h,
 	CPU_SET(n->cpu, &h->stats.cpuset);
 }
 
+static inline bool matching_coalescing(struct c2c_hit *h,
+				       struct hist_entry *e)
+{
+	bool value = false;
+	struct mem_info *mi = e->mem_info;
+
+	if (coalesce_level > MAX_LVL)
+		printf("DON: bad coalesce level %d\n", coalesce_level);
+
+	if (e->cpumode != PERF_RECORD_MISC_KERNEL) {
+
+		switch (coalesce_level) {
+
+		case LVL0:
+		case LVL1:
+			value = ((h->daddr == mi->daddr.addr) &&
+				 (h->pid   == e->thread->pid_) &&
+				 (h->tid   == e->thread->tid) &&
+				 (h->iaddr == mi->iaddr.addr));
+			break;
+
+		case LVL2:
+			value = ((h->daddr == mi->daddr.addr) &&
+				 (h->pid   == e->thread->pid_) &&
+				 (h->iaddr == mi->iaddr.addr));
+			break;
+
+		case LVL3:
+			value = ((h->daddr == mi->daddr.addr) &&
+				 (h->iaddr == mi->iaddr.addr));
+			break;
+
+		case LVL4:
+			value = ((h->daddr == mi->daddr.addr) &&
+				 (h->mi->iaddr.sym == mi->iaddr.sym));
+			break;
+
+		default:
+			break;
+
+		}
+
+	} else {
+
+		switch (coalesce_level) {
+
+		case LVL0:
+			value = ((h->daddr == mi->daddr.addr) &&
+				 (h->pid   == e->thread->pid_) &&
+				 (h->tid   == e->thread->tid) &&
+				 (h->iaddr == mi->iaddr.addr));
+			break;
+
+		case LVL1:
+		case LVL2:
+		case LVL3:
+			value = ((h->daddr == mi->daddr.addr) &&
+				 (h->iaddr == mi->iaddr.addr));
+			break;
+
+		case LVL4:
+			value = ((h->daddr == mi->daddr.addr) &&
+				 (h->mi->iaddr.sym == mi->iaddr.sym));
+			break;
+
+		default:
+			break;
+
+		}
+	}
+
+	return value;
+}
+
 static int perf_c2c__process_load_store(struct perf_c2c *c2c,
 					struct addr_location *al,
 					struct perf_sample *sample,
@@ -543,12 +625,442 @@ static void c2c_hit__update_stats(struct c2c_stats *new,
 	new->total_period	+= old->total_period;
 }
 
+static void print_hitm_cacheline_header(void)
+{
+#define SHARING_REPORT_TITLE  "Shared Cache Line Distribution Pareto"
+#define PARTICIPANTS1         "Node{cpus %hitms %stores} Node{cpus %hitms %stores} ..."
+#define PARTICIPANTS2         "Node{cpu list}; Node{cpu list}; Node{cpu list}; ..."
+
+	int i;
+	const  char *docptr;
+	static char  delimit[MAXTITLE_SZ];
+	static char  title2[MAXTITLE_SZ];
+	int       pad;
+
+	docptr = " ";
+	if (node_info == 1)
+		docptr = PARTICIPANTS1;
+	if (node_info  == 2)
+		docptr = PARTICIPANTS2;
+
+	sprintf(title2, "%4s %6s %6s %6s %6s %8s %8s %8s %8s %18s %6s %6s %18s %8s %8s %8s %6s %-30s %-20s %s",
+			"Num",
+			"%dist",
+			"%cumm",
+			"%dist",
+			"%cumm",
+			"LLCmiss",
+			"LLChit",
+			"L1 hit",
+			"L1 Miss",
+			"Data Address",
+			"Pid",
+			"Tid",
+			"Inst Address",
+			"median",
+			"mean",
+			"CV  ",
+			"cnt",
+			"Symbol",
+			"Object",
+			docptr);
+
+	for (i = 0; i < (int)strlen(title2); i++) strcat(delimit, "=");
+
+
+	printf("\n\n");
+	printf("%s\n", delimit);
+	printf("\n");
+
+	pad = (strlen(title2)/2) - (strlen(SHARING_REPORT_TITLE)/2);
+	for (i = 0; i < pad; i++) printf(" ");
+	printf("%s\n", SHARING_REPORT_TITLE);
+
+	printf("\n");
+	printf("%4s %13s %13s %17s %8s %8s %18s %6s %6s %18s %26s %6s %30s %20s %s\n",
+		" ",
+		"---- All ----",
+		"-- Shared --",
+		"---- HITM ----",
+		" ",
+		" ",
+		" ",
+		" ",
+		" ",
+		" ",
+		"Load Inst Execute Latency",
+		" ",
+		" ",
+		" ",
+		node_info  ? "Shared Data Participants" : " ");
+
+
+	printf("%4s %13s %13s %8s %8s %17s %18s %6s %6s %17s %18s\n",
+		" ",
+		" Data Misses",
+		" Data Misses",
+		"Remote",
+		"Local",
+		"-- Store Refs --",
+		" ",
+		" ",
+		" ",
+		" ",
+		" ");
+
+	printf("%4s %13s %13s %8s %8s %8s %8s %18s %6s %6s %17s %18s %8s %6s\n",
+		" ",
+		" ",
+		" ",
+		" ",
+		" ",
+		" ",
+		" ",
+		" ",
+		" ",
+		" ",
+		" ",
+		"---- cycles ----",
+		" ",
+		"cpu");
+
+	printf("%s\n", title2);
+	printf("%s\n", delimit);
+}
+
+static void print_hitm_cacheline(struct c2c_hit *h,
+				 int record,
+				 double tot_cumm,
+				 double ld_cumm,
+				 double tot_dist,
+				 double ld_dist)
+{
+	char pidstr[7];
+	char addrstr[20];
+	static char  summary[MAXLBL_SZ];
+	int j;
+
+	if (h->pid > 0)
+		sprintf(pidstr, "%6d", h->pid);
+	else
+		sprintf(pidstr, "***");
+	/*
+	 * It is possible to have none distinct virtual addresses
+	 * pointing to a distinct SYstem V shared memory region.
+	 * if there are multple virtual addresses the address
+	 * field will be astericks. It would be possible to subsitute
+	 * the physical address but this count be confusing as some
+	 * times the field is a virtual address while or times it
+	 * may be a physical address which may lead to confusion.
+	 */
+	if (h->daddr != ~0UL)
+		sprintf(addrstr, "%#18lx", CLADRS(h->daddr));
+	else
+		sprintf(addrstr, "****************");
+
+
+	sprintf(summary, "%4d %5.1f%% %5.1f%% %5.1f%% %5.1f%% %8d %8d %8d %8d %18s %6s\n",
+			record,
+			tot_dist * 100.,
+			tot_cumm * 100.,
+			ld_dist * 100.,
+			ld_cumm * 100.,
+			h->stats.t.rmt_hitm,
+			h->stats.t.lcl_hitm,
+			h->stats.t.st_l1hit,
+			h->stats.t.st_l1miss,
+			addrstr,
+			pidstr);
+
+	for (j = 0; j < (int)strlen(summary); j++) printf("-");
+	printf("\n");
+	printf("%s", summary);
+	for (j = 0; j < (int)strlen(summary); j++) printf("-");
+	printf("\n");
+}
+
+static void print_socket_stats_str(struct c2c_hit *clo,
+				   struct c2c_stats *node_stats)
+{
+	int i, j;
+
+	if (!node_stats)
+		return;
+
+	for (i = 0; i < max_node_num; i++) {
+		struct c2c_stats *stats = &node_stats[i];
+		int num = CPU_COUNT(&stats->cpuset);
+
+		if (!num) {
+			/* pad align socket info */
+			for (j = 0; j < 21; j++)
+				printf(" ");
+			continue;
+		}
+
+		printf("%2d{%2d ", i, num);
+
+		if (clo->stats.t.rmt_hitm > 0)
+			printf("%5.1f%% ", 100. * ((double)stats->t.rmt_hitm / (double) clo->stats.t.rmt_hitm));
+		else
+			printf("%6s ", "n/a");
+
+		if (clo->stats.t.store > 0)
+			printf("%5.1f%%} ", 100. * ((double)stats->t.store / (double)clo->stats.t.store));
+		else
+			printf("%6s} ", "n/a");
+	}
+}
+
+static void print_socket_shared_str(struct c2c_stats *node_stats)
+{
+	int i, j;
+
+	if (!node_stats)
+		return;
+
+	for (i = 0; i < max_node_num; i++) {
+		struct c2c_stats *stats = &node_stats[i];
+		int num = CPU_COUNT(&stats->cpuset);
+		int start = -1;
+		bool first = true;
+
+		if (!num)
+			continue;
+
+		printf("%d{", i);
+
+		for (j = 0; j < max_cpu_num; j++) {
+			if (!CPU_ISSET(j, &stats->cpuset)) {
+				if (start != -1) {
+					if ((j-1) - start)
+						/* print the range */
+						printf("%s%d-%d", (first ? "" : ","), start, j-1);
+					else
+						/* standalone */
+						printf("%s%d", (first ? "" : ",") , start);
+					start = -1;
+					first = false;
+				}
+				continue;
+			}
+
+			if (start == -1)
+				start = j;
+		}
+		/* last chunk */
+		if (start != -1) {
+			if ((j-1) - start)
+				/* print the range */
+				printf("%s%d-%d", (first ? "" : ","), start, j-1);
+			else
+				/* standalone */
+				printf("%s%d", (first ? "" : ",") , start);
+		}
+
+		printf("}; ");
+	}
+}
+
+static void print_hitm_cacheline_offset(struct c2c_hit *clo,
+					struct c2c_hit *h,
+					struct c2c_stats *node_stats)
+{
+#define SHORT_STR_LEN	7
+#define LONG_STR_LEN	30
+
+	char pidstr[SHORT_STR_LEN];
+	char tidstr[SHORT_STR_LEN];
+	char addrstr[LONG_STR_LEN];
+	char latstr[LONG_STR_LEN];
+	char objptr[LONG_STR_LEN];
+	char symptr[LONG_STR_LEN];
+	struct c2c_stats *stats = &clo->stats;
+	struct addr_map_symbol *ams;
+
+	ams = &clo->mi->iaddr;
+
+	if (clo->pid >= 0)
+		snprintf(pidstr, SHORT_STR_LEN, "%6d", clo->pid);
+	else
+		sprintf(pidstr, "***");
+
+	if (clo->tid >= 0)
+		snprintf(tidstr, SHORT_STR_LEN, "%6d", clo->tid);
+	else
+		sprintf(tidstr, "***");
+
+	if (clo->iaddr != ~0UL)
+		snprintf(addrstr, LONG_STR_LEN, "%#18lx", clo->iaddr);
+	else
+		sprintf(addrstr, "****************");
+	snprintf(objptr, LONG_STR_LEN, "%-18s", ams->map->dso->short_name);
+	snprintf(symptr, LONG_STR_LEN, "%-18s", (ams->sym ? ams->sym->name : "?????"));
+
+	if (stats->t.rmt_hitm > 0) {
+		double mean = avg_stats(&stats->stats);
+		double std = stddev_stats(&stats->stats);
+
+		sprintf(latstr, "%8.0f %8.0f %7.1f%%",
+			-1.0, /* FIXME */
+			mean,
+			rel_stddev_stats(std, mean));
+	} else {
+		sprintf(latstr, "%8s %8s %8s",
+			"n/a",
+			"n/a",
+			"n/a");
+
+	}
+
+	/*
+	 * implicit assumption that we are not coalescing over IPs
+	 */
+	printf("%4s %6s %6s %6s %6s %7.1f%% %7.1f%% %7.1f%% %7.1f%% %14s0x%02lx %6s %6s %18s %8s %6d %-30s %-20s",
+		" ",
+		" ",
+		" ",
+		" ",
+		" ",
+		(stats->t.rmt_hitm  > 0) ? (100. * ((double)stats->t.rmt_hitm  / (double)h->stats.t.rmt_hitm))  : 0.0,
+		(stats->t.lcl_hitm  > 0) ? (100. * ((double)stats->t.lcl_hitm  / (double)h->stats.t.lcl_hitm))  : 0.0,
+		(stats->t.st_l1hit  > 0) ? (100. * ((double)stats->t.st_l1hit  / (double)h->stats.t.st_l1hit))  : 0.0,
+		(stats->t.st_l1miss > 0) ? (100. * ((double)stats->t.st_l1miss / (double)h->stats.t.st_l1miss)) : 0.0,
+		" ",
+		(cloffset == LVL2) ? (clo->daddr & 0xff) : CLOFFSET(clo->daddr),
+		pidstr,
+		tidstr,
+		addrstr,
+		latstr,
+		CPU_COUNT(&clo->stats.cpuset),
+		symptr,
+		objptr);
+
+	if (node_info == 0)
+		printf("  ");
+	else if (node_info  == 1)
+		print_socket_stats_str(clo, node_stats);
+	else if (node_info  == 2)
+		print_socket_shared_str(node_stats);
+
+	printf("\n");
+}
+
+static void print_c2c_hitm_report(struct rb_root *hitm_tree,
+				  struct c2c_stats *hitm_stats __maybe_unused,
+				  struct c2c_stats *c2c_stats)
+{
+	struct rb_node	*next = rb_first(hitm_tree);
+	struct c2c_hit	*h, *clo = NULL;
+	u64		addr;
+	double		tot_dist, tot_cumm;
+	double		ld_dist, ld_cumm;
+	int		llc_misses;
+	int		record = 0;
+	struct c2c_stats *node_stats = NULL;
+
+	if (node_info) {
+		node_stats = zalloc(sizeof(struct c2c_stats) * cpu_map__max_node());
+		if (!node_stats) {
+			printf("Can not allocate stats for node output\n");
+			return;
+		}
+	}
+
+	print_hitm_cacheline_header();
+
+	llc_misses = c2c_stats->t.lcl_dram +
+		     c2c_stats->t.rmt_dram +
+		     c2c_stats->t.rmt_hit +
+		     c2c_stats->t.rmt_hitm;
+
+	/*
+	 * generate distinct cache line report
+	 */
+	tot_cumm = 0.0;
+	ld_cumm  = 0.0;
+
+	while (next) {
+		struct hist_entry *entry;
+
+		h = rb_entry(next, struct c2c_hit, rb_node);
+		next = rb_next(&h->rb_node);
+
+		tot_dist  = ((double)h->stats.t.rmt_hitm / llc_misses);
+		tot_cumm += tot_dist;
+
+		ld_dist  = ((double)h->stats.t.rmt_hitm / c2c_stats->t.rmt_hitm);
+		ld_cumm += ld_dist;
+
+		/*
+		 * don't display lines with insignificant sharing contribution
+		 */
+		if (ld_dist < DISPLAY_LINE_LIMIT)
+			break;
+
+		print_hitm_cacheline(h, record, tot_cumm, ld_cumm, tot_dist, ld_dist);
+
+		list_for_each_entry(entry, &h->list, pairs.node) {
+
+			if (!clo || !matching_coalescing(clo, entry)) {
+				if (clo)
+					print_hitm_cacheline_offset(clo, h, node_stats);
+
+				free(clo);
+				addr = entry->mem_info->iaddr.al_addr;
+				clo = c2c_hit__new(addr, entry);
+				if (node_info)
+					memset(node_stats, 0, sizeof(struct c2c_stats) * cpu_map__max_node());
+			}
+			c2c_decode_stats(&clo->stats, entry);
+			c2c_hit__update_strings(clo, entry);
+
+			if (node_info) {
+				int node = cpu_map__get_node(entry->cpu);
+				c2c_decode_stats(&node_stats[node], entry);
+				CPU_SET(entry->cpu, &(node_stats[node].cpuset));
+			}
+
+		}
+		if (clo) {
+			print_hitm_cacheline_offset(clo, h, node_stats);
+			free(clo);
+			clo = NULL;
+		}
+
+		if (node_info)
+			memset(node_stats, 0, sizeof(struct c2c_stats) * cpu_map__max_node());
+
+		printf("\n");
+		record++;
+	}
+}
+
 static inline int valid_hitm_or_store(union perf_mem_data_src *dsrc)
 {
 	return ((dsrc->mem_snoop & P(SNOOP,HITM)) ||
 		(dsrc->mem_op & P(OP,STORE)));
 }
 
+static void print_shared_cacheline_info(struct c2c_stats *stats, int cline_cnt)
+{
+	int hitm_cnt = stats->t.lcl_hitm + stats->t.rmt_hitm;
+
+	printf("=================================================\n");
+	printf("    Global Shared Cache Line Event Information   \n");
+	printf("=================================================\n");
+	printf("  Total Shared Cache Lines          : %10d\n", cline_cnt);
+	printf("  Load HITs on shared lines         : %10d\n", stats->t.load);
+	printf("  Fill Buffer Hits on shared lines  : %10d\n", stats->t.ld_fbhit);
+	printf("  L1D hits on shared lines          : %10d\n", stats->t.ld_l1hit);
+	printf("  L2D hits on shared lines          : %10d\n", stats->t.ld_l2hit);
+	printf("  LLC hits on shared lines          : %10d\n", stats->t.ld_llchit + stats->t.lcl_hitm);
+	printf("  Locked Access on shared lines     : %10d\n", stats->t.locks);
+	printf("  Store HITs on shared lines        : %10d\n", stats->t.store);
+	printf("  Store L1D hits on shared lines    : %10d\n", stats->t.st_l1hit);
+	printf("  Total Merged records              : %10d\n", hitm_cnt + stats->t.store);
+}
+
 static void c2c_analyze_hitms(struct perf_c2c *c2c)
 {
 
@@ -607,6 +1119,9 @@ static void c2c_analyze_hitms(struct perf_c2c *c2c)
 	} else
 		free(h);
 
+	print_shared_cacheline_info(&hitm_stats, shared_clines);
+	print_c2c_hitm_report(&hitm_tree, &hitm_stats, &c2c->stats);
+
 cleanup:
 	next = rb_first(&hitm_tree);
 	while (next) {
@@ -758,6 +1273,10 @@ int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
 	};
 	const struct option c2c_options[] = {
 	OPT_BOOLEAN('r', "raw_records", &c2c.raw_records, "dump raw events"),
+	OPT_INCR('N', "node-info", &node_info,
+		 "show extra node info in report (repeat for more info)"),
+	OPT_INTEGER('c', "coalesce-level", &coalesce_level,
+		 "how much coalescing for tid, pid, and ip is done (repeat for more coalescing)"),
 	OPT_INCR('v', "verbose", &verbose,
 		 "be more verbose (show counter open errors, etc)"),
 	OPT_STRING('i', "input", &input_name, "file",
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 15/19] perf, c2c: Add callchain support
  2014-02-28 17:42 [PATCH 00/19 V2] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (13 preceding siblings ...)
  2014-02-28 17:43 ` [PATCH 14/19] perf, c2c: Display cacheline HITM analysis to stdout Don Zickus
@ 2014-02-28 17:43 ` Don Zickus
  2014-03-19 13:00   ` Jiri Olsa
  2014-02-28 17:43 ` [PATCH 16/19] perf, c2c: Output summary stats Don Zickus
                   ` (4 subsequent siblings)
  19 siblings, 1 reply; 56+ messages in thread
From: Don Zickus @ 2014-02-28 17:43 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, eranian, Don Zickus

Seeing cacheline statistics is useful by itself.  Seeing the callchain
for these cache contentions saves time tracking things down.

This patch tries to add callchain support.  I had to use the generic
interface from a previous patch to output things to stdout easily.

Other than the displaying the results, collecting the callchain and
merging it was fairly straightforward.

I used a lot of copying-n-pasting from other builtin tools to get
the intial parameter setup correctly and the automatic reading of
'symbol_conf.use_callchain' from the data file.

Hopefully this is all correct.  The amount of memory corruption (from the
callchain dynamic array) seems to have dwindled done to nothing. :-)

V2: update to latest api

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/builtin-c2c.c | 153 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 150 insertions(+), 3 deletions(-)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index c5f4b5a..8756ca5 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -52,6 +52,7 @@ struct c2c_stats {
 struct perf_c2c {
 	struct perf_tool tool;
 	bool		 raw_records;
+	bool		 call_graph;
 	struct hists	 hists;
 
 	/* stats */
@@ -78,6 +79,8 @@ struct c2c_hit {
 	u64		daddr;
 	u64		iaddr;
 	struct mem_info	*mi;
+
+	struct callchain_root   callchain[0]; /* must be last member */
 };
 
 enum { OP, LVL, SNP, LCK, TLB };
@@ -372,7 +375,8 @@ static int c2c_decode_stats(struct c2c_stats *stats, struct hist_entry *entry)
 
 static struct c2c_hit *c2c_hit__new(u64 cacheline, struct hist_entry *entry)
 {
-	struct c2c_hit *h = zalloc(sizeof(struct c2c_hit));
+	size_t callchain_size = symbol_conf.use_callchain ? sizeof(struct callchain_root) : 0;
+	struct c2c_hit *h = zalloc(sizeof(struct c2c_hit) + callchain_size);
 
 	if (!h) {
 		pr_err("Could not allocate c2c_hit memory\n");
@@ -386,6 +390,8 @@ static struct c2c_hit *c2c_hit__new(u64 cacheline, struct hist_entry *entry)
 	h->cacheline = cacheline;
 	h->pid = entry->thread->pid_;
 	h->tid = entry->thread->tid;
+	if (symbol_conf.use_callchain)
+		callchain_init(h->callchain);
 
 	/* use original addresses here, not adjusted al_addr */
 	h->iaddr = entry->mem_info->iaddr.addr;
@@ -509,6 +515,10 @@ static int perf_c2c__process_load_store(struct perf_c2c *c2c,
 		return 0;
 	}
 
+	err = sample__resolve_callchain(sample, &parent, evsel, al, PERF_MAX_STACK_DEPTH);
+	if (err)
+		return err;
+
 	cost = sample->weight;
 	if (!cost)
 		cost = 1;
@@ -544,8 +554,9 @@ static int perf_c2c__process_load_store(struct perf_c2c *c2c,
 	if (err)
 		goto out;
 
-	c2c->hists.stats.total_period += cost;
-	hists__inc_nr_events(&c2c->hists, PERF_RECORD_SAMPLE);
+        c2c->hists.stats.total_period += cost;
+        hists__inc_nr_events(&c2c->hists, PERF_RECORD_SAMPLE);
+        err = hist_entry__append_callchain(he, sample);
 	return err;
 
 out_mem:
@@ -944,6 +955,13 @@ static void print_hitm_cacheline_offset(struct c2c_hit *clo,
 		print_socket_shared_str(node_stats);
 
 	printf("\n");
+
+	if (symbol_conf.use_callchain) {
+		generic_entry_callchain__fprintf(clo->callchain,
+						 h->stats.total_period,
+						 clo->stats.total_period,
+						 23, stdout);
+	}
 }
 
 static void print_c2c_hitm_report(struct rb_root *hitm_tree,
@@ -1020,6 +1038,12 @@ static void print_c2c_hitm_report(struct rb_root *hitm_tree,
 				c2c_decode_stats(&node_stats[node], entry);
 				CPU_SET(entry->cpu, &(node_stats[node].cpuset));
 			}
+			if (symbol_conf.use_callchain) {
+				callchain_cursor_reset(&callchain_cursor);
+				callchain_merge(&callchain_cursor,
+						clo->callchain,
+						entry->callchain);
+			}
 
 		}
 		if (clo) {
@@ -1151,6 +1175,30 @@ err:
 	return err;
 }
 
+static int perf_c2c__setup_sample_type(struct perf_c2c *c2c,
+				       struct perf_session *session)
+{
+	u64 sample_type = perf_evlist__combined_sample_type(session->evlist);
+
+	if (!(sample_type & PERF_SAMPLE_CALLCHAIN)) {
+		if (symbol_conf.use_callchain) {
+			printf("Selected -g but no callchain data. Did "
+				  "you call 'perf c2c record' without -g?\n");
+			return -1;
+		}
+	} else if (callchain_param.mode != CHAIN_NONE &&
+		   !symbol_conf.use_callchain) {
+			symbol_conf.use_callchain = true;
+			c2c->call_graph = true;
+			if (callchain_register_param(&callchain_param) < 0) {
+				printf("Can't register callchain params.\n");
+				return -EINVAL;
+			}
+	}
+
+	return 0;
+}
+
 static int perf_c2c__read_events(struct perf_c2c *c2c)
 {
 	int err = -1;
@@ -1170,6 +1218,9 @@ static int perf_c2c__read_events(struct perf_c2c *c2c)
 	if (symbol__init() < 0)
 		goto out_delete;
 
+	if (perf_c2c__setup_sample_type(c2c, session) < 0)
+		goto out_delete;
+
 	/* setup the evsel handlers for each event type */
 	evlist__for_each(session->evlist, evsel) {
 		const char *name = perf_evsel__name(evsel);
@@ -1257,8 +1308,101 @@ static int perf_c2c__record(int argc, const char **argv)
 	return cmd_record(i, rec_argv, NULL);
 }
 
+static int
+opt_callchain_cb(const struct option *opt, const char *arg, int unset)
+{
+	struct perf_c2c *c2c = (struct perf_c2c *)opt->value;
+	char *tok, *tok2;
+	char *endptr;
+
+	/*
+	 * --no-call-graph
+	 */
+	if (unset) {
+		c2c->call_graph = false;
+		return 0;
+	}
+
+	symbol_conf.use_callchain = true;
+	c2c->call_graph = true;
+
+	if (!arg)
+		return 0;
+
+	tok = strtok((char *)arg, ",");
+	if (!tok)
+		return -1;
+
+	/* get the output mode */
+	if (!strncmp(tok, "graph", strlen(arg)))
+		callchain_param.mode = CHAIN_GRAPH_ABS;
+
+	else if (!strncmp(tok, "flat", strlen(arg)))
+		callchain_param.mode = CHAIN_FLAT;
+
+	else if (!strncmp(tok, "fractal", strlen(arg)))
+		callchain_param.mode = CHAIN_GRAPH_REL;
+
+	else if (!strncmp(tok, "none", strlen(arg))) {
+		callchain_param.mode = CHAIN_NONE;
+		symbol_conf.use_callchain = false;
+
+		return 0;
+	}
+
+	else
+		return -1;
+
+	/* get the min percentage */
+	tok = strtok(NULL, ",");
+	if (!tok)
+		goto setup;
+
+	callchain_param.min_percent = strtod(tok, &endptr);
+	if (tok == endptr)
+		return -1;
+
+	/* get the print limit */
+	tok2 = strtok(NULL, ",");
+	if (!tok2)
+		goto setup;
+
+	if (tok2[0] != 'c') {
+		callchain_param.print_limit = strtoul(tok2, &endptr, 0);
+		tok2 = strtok(NULL, ",");
+		if (!tok2)
+			goto setup;
+	}
+
+	/* get the call chain order */
+	if (!strncmp(tok2, "caller", strlen("caller")))
+		callchain_param.order = ORDER_CALLER;
+	else if (!strncmp(tok2, "callee", strlen("callee")))
+		callchain_param.order = ORDER_CALLEE;
+	else
+		return -1;
+
+	/* Get the sort key */
+	tok2 = strtok(NULL, ",");
+	if (!tok2)
+		goto setup;
+	if (!strncmp(tok2, "function", strlen("function")))
+		callchain_param.key = CCKEY_FUNCTION;
+	else if (!strncmp(tok2, "address", strlen("address")))
+		callchain_param.key = CCKEY_ADDRESS;
+	else
+		return -1;
+setup:
+	if (callchain_register_param(&callchain_param) < 0) {
+		fprintf(stderr, "Can't register callchain params\n");
+		return -1;
+	}
+	return 0;
+}
+
 int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
 {
+	char callchain_default_opt[] = "fractal,0.05,callee";
 	struct perf_c2c c2c = {
 		.tool = {
 			.sample		 = perf_c2c__process_sample,
@@ -1285,6 +1429,9 @@ int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
 		   "separator",
 		   "separator for columns, no spaces will be added"
 		   " between columns '.' is reserved."),
+	OPT_CALLBACK_DEFAULT('g', "call-graph", &c2c, "output_type,min_percent[,print_limit],call_order",
+			     "Display callchains using output_type (graph, flat, fractal, or none) , min percent threshold, optional print limit, callchain order, key (function or address). "
+			     "Default: fractal,0.5,callee,function", &opt_callchain_cb, callchain_default_opt),
 	OPT_END()
 	};
 	const char * const c2c_usage[] = {
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 16/19] perf, c2c: Output summary stats
  2014-02-28 17:42 [PATCH 00/19 V2] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (14 preceding siblings ...)
  2014-02-28 17:43 ` [PATCH 15/19] perf, c2c: Add callchain support Don Zickus
@ 2014-02-28 17:43 ` Don Zickus
  2014-02-28 17:43 ` [PATCH 17/19] perf, c2c: Dump rbtree for debugging Don Zickus
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 56+ messages in thread
From: Don Zickus @ 2014-02-28 17:43 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, eranian, Don Zickus

Output some summary stats based on the processed records.
Mainly diagnostic uses.

Stats done by Dick Fowles, backported by me.

Sample output:

=================================================
            Trace Event Information
=================================================
  Total records                     :    1322047
  Locked Load/Store Operations      :     206317
  Load Operations                   :     355701
  Loads - uncacheable               :        590
  Loads - IO                        :          0
  Loads - Miss                      :        440
  Loads - no mapping                :        207
  Load Fill Buffer Hit              :     100214
  Load L1D hit                      :     148454
  Load L2D hit                      :      15170
  Load LLC hit                      :      53872
  Load Local HITM                   :      15388
  Load Remote HITM                  :      26760
  Load Remote HIT                   :       3910
  Load Local DRAM                   :       2436
  Load Remote DRAM                  :       3648
  Load MESI State Exclusive         :       2883
  Load MESI State Shared            :       3201
  Load LLC Misses                   :      36754
  LLC Misses to Local DRAM          :        6.6%
  LLC Misses to Remote DRAM         :        9.9%
  LLC Misses to Remote cache (HIT)  :       10.6%
  LLC Misses to Remote cache (HITM) :       72.8%
  Store Operations                  :     966322
  Store - uncacheable               :          0
  Store - no mapping                :      42931
  Store L1D Hit                     :     915696
  Store L1D Miss                    :       7695
  No Page Map Rejects               :       1193
  Unable to parse data source       :         24

V2: refresh to hist_entry

Original-by: Dick Fowles <rfowles@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/builtin-c2c.c | 47 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 46 insertions(+), 1 deletion(-)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 8756ca5..3b0e0b2 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -963,7 +963,6 @@ static void print_hitm_cacheline_offset(struct c2c_hit *clo,
 						 23, stdout);
 	}
 }
-
 static void print_c2c_hitm_report(struct rb_root *hitm_tree,
 				  struct c2c_stats *hitm_stats __maybe_unused,
 				  struct c2c_stats *c2c_stats)
@@ -1158,6 +1157,51 @@ cleanup:
 	return;
 }
 
+static void print_c2c_trace_report(struct perf_c2c *c2c)
+{
+	int llc_misses;
+	struct c2c_stats *stats = &c2c->stats;
+
+	llc_misses = stats->t.lcl_dram +
+		     stats->t.rmt_dram +
+		     stats->t.rmt_hit +
+		     stats->t.rmt_hitm;
+
+	printf("=================================================\n");
+	printf("            Trace Event Information              \n");
+	printf("=================================================\n");
+	printf("  Total records                     : %10d\n", c2c->stats.nr_entries);
+	printf("  Locked Load/Store Operations      : %10d\n", stats->t.locks);
+	printf("  Load Operations                   : %10d\n", stats->t.load);
+	printf("  Loads - uncacheable               : %10d\n", stats->t.ld_uncache);
+	printf("  Loads - IO                        : %10d\n", stats->t.ld_io);
+	printf("  Loads - Miss                      : %10d\n", stats->t.ld_miss);
+	printf("  Loads - no mapping                : %10d\n", stats->t.ld_noadrs);
+	printf("  Load Fill Buffer Hit              : %10d\n", stats->t.ld_fbhit);
+	printf("  Load L1D hit                      : %10d\n", stats->t.ld_l1hit);
+	printf("  Load L2D hit                      : %10d\n", stats->t.ld_l2hit);
+	printf("  Load LLC hit                      : %10d\n", stats->t.ld_llchit + stats->t.lcl_hitm);
+	printf("  Load Local HITM                   : %10d\n", stats->t.lcl_hitm);
+	printf("  Load Remote HITM                  : %10d\n", stats->t.rmt_hitm);
+	printf("  Load Remote HIT                   : %10d\n", stats->t.rmt_hit);
+	printf("  Load Local DRAM                   : %10d\n", stats->t.lcl_dram);
+	printf("  Load Remote DRAM                  : %10d\n", stats->t.rmt_dram);
+	printf("  Load MESI State Exclusive         : %10d\n", stats->t.ld_excl);
+	printf("  Load MESI State Shared            : %10d\n", stats->t.ld_shared);
+	printf("  Load LLC Misses                   : %10d\n", llc_misses);
+	printf("  LLC Misses to Local DRAM          : %10.1f%%\n", ((double)stats->t.lcl_dram/(double)llc_misses) * 100.);
+	printf("  LLC Misses to Remote DRAM         : %10.1f%%\n", ((double)stats->t.rmt_dram/(double)llc_misses) * 100.);
+	printf("  LLC Misses to Remote cache (HIT)  : %10.1f%%\n", ((double)stats->t.rmt_hit /(double)llc_misses) * 100.);
+	printf("  LLC Misses to Remote cache (HITM) : %10.1f%%\n", ((double)stats->t.rmt_hitm/(double)llc_misses) * 100.);
+	printf("  Store Operations                  : %10d\n", stats->t.store);
+	printf("  Store - uncacheable               : %10d\n", stats->t.st_uncache);
+	printf("  Store - no mapping                : %10d\n", stats->t.st_noadrs);
+	printf("  Store L1D Hit                     : %10d\n", stats->t.st_l1hit);
+	printf("  Store L1D Miss                    : %10d\n", stats->t.st_l1miss);
+	printf("  No Page Map Rejects               : %10d\n", stats->t.nomap);
+	printf("  Unable to parse data source       : %10d\n", stats->t.noparse);
+}
+
 static int perf_c2c__process_events(struct perf_session *session,
 				    struct perf_c2c *c2c)
 {
@@ -1169,6 +1213,7 @@ static int perf_c2c__process_events(struct perf_session *session,
 		goto err;
 	}
 
+	print_c2c_trace_report(c2c);
 	c2c_analyze_hitms(c2c);
 
 err:
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 17/19] perf, c2c: Dump rbtree for debugging
  2014-02-28 17:42 [PATCH 00/19 V2] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (15 preceding siblings ...)
  2014-02-28 17:43 ` [PATCH 16/19] perf, c2c: Output summary stats Don Zickus
@ 2014-02-28 17:43 ` Don Zickus
  2014-02-28 17:43 ` [PATCH 18/19] perf, c2c: Add symbol count table Don Zickus
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 56+ messages in thread
From: Don Zickus @ 2014-02-28 17:43 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, eranian, Don Zickus

Sometimes you want to verify the rbtree sorting on a unique id
is working correctly.  This allows you to dump it.

Sample output:

   Idx Hit Maj Min      Ino           InoGen    Pid            Daddr            Iaddr         Data Src                         (string)  cpumode
     0       0   0        0                0      0               22 ffffffff813044cf         48080184           [STORE,L1,MISS,SNP NA]        1

     1       0   0        0                0   2332              ca3 ffffffffa0226032         48080184           [STORE,L1,MISS,SNP NA]        1
     2       0   0        0                0   2332              ca3 ffffffffa0226032         48080184           [STORE,L1,MISS,SNP NA]        1
     3       0   0        0                0   2332              ca3 ffffffffa0226032         48080184           [STORE,L1,MISS,SNP NA]        1
     4       0   0        0                0   2332              ca3 ffffffffa0226032         48080184           [STORE,L1,MISS,SNP NA]        1
     5       0   0        0                0   2332              ca3 ffffffffa0226032         48080184           [STORE,L1,MISS,SNP NA]        1
     6       0   0        0                0   2332              ca3 ffffffffa0226032         48080184           [STORE,L1,MISS,SNP NA]        1

     7       0   0        0                0  18179          135f860 ffffffff812ad509         68100242          [LOAD,LFB,HIT,SNP NONE]        1

     8       0   0        0                0  18179     7ff9d7fbaf98 ffffffff812ad509         68100242          [LOAD,LFB,HIT,SNP NONE]        1

V2: refresh with hist_entry

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/builtin-c2c.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 51 insertions(+)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 3b0e0b2..c095f1b 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -616,6 +616,55 @@ err:
 
 #define HAS_HITMS(h) (h->stats.t.lcl_hitm || h->stats.t.rmt_hitm)
 
+static void dump_rb_tree(struct rb_root *tree,
+			 struct perf_c2c *c2c __maybe_unused)
+{
+	struct rb_node *next = rb_first(tree);
+	struct hist_entry *he;
+	u64 cl = 0;
+	int idx = 0;
+
+	printf("# Summary: Total entries - %d\n", c2c->stats.nr_entries);
+	printf("# HITMs: Local - %d   Remote - %d     Total - %d\n",
+		c2c->stats.t.lcl_hitm, c2c->stats.t.rmt_hitm,
+		(c2c->stats.t.lcl_hitm + c2c->stats.t.rmt_hitm));
+
+	printf("%6s %3s %3s %3s %8s %16s %6s %16s %16s %16s %32s %8s\n",
+		"Idx", "Hit", "Maj", "Min", "Ino", "InoGen", "Pid", 
+		"Daddr", "Iaddr", "Data Src", "(string)", "cpumode");
+	while (next) {
+		char data_src[32];
+		u64 val;
+
+		he = rb_entry(next, struct hist_entry, rb_node_in);
+		next = rb_next(&he->rb_node_in);
+
+		if ((!he->color) || (cl != CLADRS(he->mem_info->daddr.al_addr))) {
+			printf("\n");
+			cl = CLADRS(he->mem_info->daddr.al_addr);
+		}
+
+		val = he->mem_info->data_src.val;
+		perf_c2c__scnprintf_data_src(data_src, sizeof(data_src), val);
+
+		printf("%6d %3s %3x %3x %8lx %16lx %6d %16lx %16lx %16lx %32s %8x\n",
+			idx,
+			(PERF_MEM_S(SNOOP,HITM) & val) ? " * " : "   ",
+			he->mem_info->daddr.map->maj,
+			he->mem_info->daddr.map->min,
+			he->mem_info->daddr.map->ino,
+			he->mem_info->daddr.map->ino_generation,
+			he->thread->pid_,
+			he->mem_info->daddr.addr,
+			he->mem_info->iaddr.addr,
+			val,
+			data_src,
+			he->cpumode);
+
+		idx++;
+	}
+}
+
 static void c2c_hit__update_stats(struct c2c_stats *new,
 				  struct c2c_stats *old)
 {
@@ -1213,6 +1262,8 @@ static int perf_c2c__process_events(struct perf_session *session,
 		goto err;
 	}
 
+	if (verbose > 2)
+		dump_rb_tree(c2c->hists.entries_in, c2c);
 	print_c2c_trace_report(c2c);
 	c2c_analyze_hitms(c2c);
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 18/19] perf, c2c: Add symbol count table
  2014-02-28 17:42 [PATCH 00/19 V2] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (16 preceding siblings ...)
  2014-02-28 17:43 ` [PATCH 17/19] perf, c2c: Dump rbtree for debugging Don Zickus
@ 2014-02-28 17:43 ` Don Zickus
  2014-02-28 17:43 ` [PATCH 19/19] perf, c2c: Add shared cachline summary table Don Zickus
  2014-02-28 18:57 ` [PATCH 00/19 V2] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Andi Kleen
  19 siblings, 0 replies; 56+ messages in thread
From: Don Zickus @ 2014-02-28 17:43 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, eranian, Don Zickus

Just another table that displays the referenced symbols in the analysis
report.  The table lists the most frequently used symbols first.

It is just another way to look at similar data to figure out who
is causing the most contention (based on the workload used).

Original done by Dick Fowles, backported by me.

Sample output:

=======================================================================================================
                                                   Object Name, Path & Reference Counts

Index    Records   Object Name                       Object Path
=======================================================================================================
    0     931379   [kernel.kallsyms]                 [kernel.kallsyms]
    1     192258   fio                               /home/joe/old_fio-2.0.15/fio
    2      80302   [jbd2]                            /lib/modules/3.10.0c2c_all+/kernel/fs/jbd2/jbd2.ko
    3      65392   [ext4]                            /lib/modules/3.10.0c2c_all+/kernel/fs/ext4/ext4.ko

V2: refresh to latest upstream changes and hist_entry

Original-by: Dick Fowles <rfowles@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/builtin-c2c.c | 99 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 99 insertions(+)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index c095f1b..0749ea6 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -685,6 +685,104 @@ static void c2c_hit__update_stats(struct c2c_stats *new,
 	new->total_period	+= old->total_period;
 }
 
+LIST_HEAD(ref_tree);
+LIST_HEAD(ref_tree_sorted);
+struct refs {
+	struct list_head	list;
+	int			nr;
+	const char		*name;
+	const char		*long_name;
+};
+
+static int update_ref_list(struct hist_entry *entry)
+{
+	struct refs *p;
+	struct dso *dso = entry->mem_info->iaddr.map->dso;
+	const char *name = dso->short_name;
+
+	list_for_each_entry(p, &ref_tree, list) {
+		if (!strcmp(p->name, name))
+			goto found;
+	}
+
+	p = zalloc(sizeof(struct refs));
+	if (!p)
+		return -1;
+	p->name = name;
+	p->long_name = dso->long_name;
+	list_add_tail(&p->list, &ref_tree);
+
+found:
+	p->nr++;
+	return 0;
+}
+
+static void print_symbol_record_count(struct rb_root *tree)
+{
+	struct rb_node *next = rb_first(tree);
+	struct hist_entry *he;
+	struct refs *p, *q, *pn;
+	char      string[256];
+	char      delimit[256];
+	int       i;
+	int       idx = 0;
+
+	/* gather symbol references */
+	while (next) {
+		he = rb_entry(next, struct hist_entry, rb_node_in);
+		next = rb_next(&he->rb_node_in);
+
+		if (update_ref_list(he)) {
+			pr_err("Could not update reference tree\n");
+			goto cleanup;
+		}
+	}
+
+	/* sort on number of references per symbol */
+	list_for_each_entry_safe(p, pn, &ref_tree, list) {
+		list_del_init(&p->list);
+		list_for_each_entry(q, &ref_tree_sorted, list) {
+			if (p->nr > q->nr) {
+				list_add_tail(&p->list, &q->list);
+				break;
+			}
+		}
+		if (list_empty(&p->list))
+			list_add_tail(&p->list, &ref_tree_sorted);
+	}
+
+	/* print header info */
+	sprintf(string, "%5s   %8s   %-32s  %-80s",
+		"Index",
+		"Records",
+		"Object Name",
+		"Object Path");
+
+	delimit[0] = '\0';
+	for (i = 0; i < (int)strlen(string); i++) strcat(delimit, "=");
+
+	printf("\n\n");
+	printf("%s\n", delimit);
+	printf("%50s %s\n", " ", "Object Name, Path & Reference Counts");
+	printf("\n");
+	printf("%s\n", string);
+	printf("%s\n", delimit);
+
+	/* print out table */
+	list_for_each_entry(p, &ref_tree_sorted, list) {
+		printf("%5d   %8d   %-32s  %-80s\n",
+			idx, p->nr, p->name, p->long_name);
+		idx++;
+	}
+	printf("\n");
+
+cleanup:
+	list_for_each_entry_safe(p, pn, &ref_tree_sorted, list) {
+		list_del(&p->list);
+		free(p);
+	}
+}
+
 static void print_hitm_cacheline_header(void)
 {
 #define SHARING_REPORT_TITLE  "Shared Cache Line Distribution Pareto"
@@ -1266,6 +1364,7 @@ static int perf_c2c__process_events(struct perf_session *session,
 		dump_rb_tree(c2c->hists.entries_in, c2c);
 	print_c2c_trace_report(c2c);
 	c2c_analyze_hitms(c2c);
+	print_symbol_record_count(c2c->hists.entries_in);
 
 err:
 	return err;
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 19/19] perf, c2c: Add shared cachline summary table
  2014-02-28 17:42 [PATCH 00/19 V2] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (17 preceding siblings ...)
  2014-02-28 17:43 ` [PATCH 18/19] perf, c2c: Add symbol count table Don Zickus
@ 2014-02-28 17:43 ` Don Zickus
  2014-02-28 18:57 ` [PATCH 00/19 V2] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Andi Kleen
  19 siblings, 0 replies; 56+ messages in thread
From: Don Zickus @ 2014-02-28 17:43 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, eranian, Don Zickus

This adds a quick summary of the hottest cache contention lines based
on the input data.  This summarizes what the broken table shows you,
so you can see at a quick glance which cachelines are interesting.

Originally done by Dick Fowles, backported by me.

Sample output (width trimmed):

===================================================================================================================================================

                                                                                          Shared Data Cache Line Table

                                 Total     %All                Total       ---- Core Load Hit ----  -- LLC Load Hit --     ----- LLC Load Hitm -----
   Index           Phys Adrs   Records   Ld Miss     %hitm     Loads        FB       L1D       L2D       Lcl       Rmt     Total       Lcl       Rmt
====================================================================================================================================================
       0  0xffff881fa55b0140     72006    16.97%    23.31%     43095     13591     16860        45      2651        25      9526      3288      6238
       1  0xffff881fba47f000     21854     5.29%     7.26%     13938      3887      6941        15         1         7      3087      1143      1944
       2  0xffff881fc21b9cc0      2153     1.61%     2.21%       862        32        70         0        15         1       740       148       592
       3  0xffff881fc7d91cc0      1957     1.40%     1.92%       866        34        94         0        14         3       720       207       513
       4  0xffff881fba539cc0      1813     1.35%     1.85%       808        33        84         3        14         1       665       170       495

Original-by: Dick Fowles <rfowles@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/builtin-c2c.c | 136 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 136 insertions(+)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 0749ea6..57441b9 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -783,6 +783,141 @@ cleanup:
 	}
 }
 
+static void print_c2c_shared_cacheline_report(struct rb_root *hitm_tree,
+					      struct c2c_stats *shared_stats __maybe_unused,
+					      struct c2c_stats *c2c_stats __maybe_unused)
+{
+#define   SHM_TITLE  "Shared Data Cache Line Table"
+
+	struct rb_node	*next = rb_first(hitm_tree);
+	struct c2c_hit	*h;
+	char		header[256];
+	char		delimit[256];
+	u32		crecords;
+	u32		lclmiss;
+	u32		ldcnt;
+	double		p_hitm;
+	double		p_all;
+	int		totmiss;
+	int		rmt_hitm;
+	int		len;
+	int		pad;
+	int		i;
+
+	sprintf(header,"%28s  %8s  %8s  %8s  %8s  %28s  %18s  %28s  %18s  %8s  %28s",
+		" ",
+		"Total",
+		"%All ",
+		" ",
+		"Total",
+		"---- Core Load Hit ----",
+		"-- LLC Load Hit --",
+		"----- LLC Load Hitm -----",
+		"-- Load Dram --",
+		"LLC  ",
+		"---- Store Reference ----");
+
+	len = strlen(header);
+	delimit[0] = '\0';
+
+	for (i = 0; i < len; i++)
+		strcat(delimit, "=");
+
+	printf("\n\n");
+	printf("%s\n", delimit);
+	printf("\n");
+	pad = (strlen(header)/2) - (strlen(SHM_TITLE)/2);
+	for (i = 0; i < pad; i++)
+		printf(" ");
+	printf("%s\n", SHM_TITLE);
+	printf("\n");
+	printf("%s\n", header);
+
+	sprintf(header, "%8s  %18s  %8s  %8s  %8s  %8s  %8s  %8s  %8s  %8s  %8s  %8s  %8s  %8s  %8s  %8s  %8s  %8s  %8s  %8s",
+		"Index",
+		"Phys Adrs",
+		"Records",
+		"Ld Miss",
+		"%hitm",
+		"Loads",
+		"FB",
+		"L1D",
+		"L2D",
+		"Lcl",
+		"Rmt",
+		"Total",
+		"Lcl",
+		"Rmt",
+		"Lcl",
+		"Rmt",
+		"Ld Miss",
+		"Total",
+		"L1Hit",
+		"L1Miss");
+
+	printf("%s\n", header);
+	printf("%s\n", delimit);
+
+	rmt_hitm    = c2c_stats->t.rmt_hitm;
+	totmiss     = c2c_stats->t.lcl_dram +
+		      c2c_stats->t.rmt_dram +
+		      c2c_stats->t.rmt_hit +
+		      c2c_stats->t.rmt_hitm;
+
+	i = 0;
+	while (next) {
+		h = rb_entry(next, struct c2c_hit, rb_node);
+		next = rb_next(&h->rb_node);
+
+		lclmiss  = h->stats.t.lcl_dram +
+			   h->stats.t.rmt_dram +
+			   h->stats.t.rmt_hitm +
+			   h->stats.t.rmt_hit;
+
+		ldcnt    = lclmiss +
+			   h->stats.t.ld_fbhit +
+			   h->stats.t.ld_l1hit +
+			   h->stats.t.ld_l2hit +
+			   h->stats.t.ld_llchit +
+			   h->stats.t.lcl_hitm;
+
+		crecords = ldcnt +
+			   h->stats.t.st_l1hit +
+			   h->stats.t.st_l1miss;
+
+		p_hitm = (double)h->stats.t.rmt_hitm / (double)rmt_hitm;
+		p_all  = (double)h->stats.t.rmt_hitm / (double)totmiss;
+
+		/* stop when the percentage gets to low */
+		if (p_hitm < DISPLAY_LINE_LIMIT)
+			break;
+
+		printf("%8d  %#18lx  %8u  %7.2f%%  %7.2f%%  %8u  %8u  %8u  %8u  %8u  %8u  %8u  %8u  %8u  %8u  %8u  %8u  %8u  %8u  %8u\n",
+			i,
+			h->cacheline,
+			crecords,
+			100. * p_all,
+			100. * p_hitm,
+			ldcnt,
+			h->stats.t.ld_fbhit,
+			h->stats.t.ld_l1hit,
+			h->stats.t.ld_l2hit,
+			h->stats.t.ld_llchit,
+			h->stats.t.rmt_hit,
+			h->stats.t.lcl_hitm + h->stats.t.rmt_hitm,
+			h->stats.t.lcl_hitm,
+			h->stats.t.rmt_hitm,
+			h->stats.t.lcl_dram,
+			h->stats.t.rmt_dram,
+			lclmiss,
+			h->stats.t.store,
+			h->stats.t.st_l1hit,
+			h->stats.t.st_l1miss);
+
+		i++;
+	}
+}
+
 static void print_hitm_cacheline_header(void)
 {
 #define SHARING_REPORT_TITLE  "Shared Cache Line Distribution Pareto"
@@ -1290,6 +1425,7 @@ static void c2c_analyze_hitms(struct perf_c2c *c2c)
 		free(h);
 
 	print_shared_cacheline_info(&hitm_stats, shared_clines);
+	print_c2c_shared_cacheline_report(&hitm_tree, &hitm_stats, &c2c->stats);
 	print_c2c_hitm_report(&hitm_tree, &hitm_stats, &c2c->stats);
 
 cleanup:
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH 00/19 V2] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems
  2014-02-28 17:42 [PATCH 00/19 V2] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (18 preceding siblings ...)
  2014-02-28 17:43 ` [PATCH 19/19] perf, c2c: Add shared cachline summary table Don Zickus
@ 2014-02-28 18:57 ` Andi Kleen
  2014-02-28 19:42   ` Don Zickus
  19 siblings, 1 reply; 56+ messages in thread
From: Andi Kleen @ 2014-02-28 18:57 UTC (permalink / raw)
  To: Don Zickus; +Cc: acme, LKML, jolsa, jmario, fowles, eranian

Don Zickus <dzickus@redhat.com> writes:
>
> A handful of patches include re-enabling MMAP2 support and some fixes
> to perf itself.

I would suggest to pursue the lone kernel patch separately. Hopefully
that can be merged soon, once the remainin problems with that are 
addressed.

>
> Comemnts, feedback, anything else welcomed.

As a high level comment, can you add support for CSV mode? 
(-x, in other tools)

I assume most people would data mine the output in some form,
and that's much easier with a more machine oriented format.

Also you should probably have the standard perf conventions
for output fds (--log-fd, --output, default to stderr). otherwise
the output may often be lost.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 11/19] perf, c2c: Add in sort on physid
  2014-02-28 17:43 ` [PATCH 11/19] perf, c2c: Add in sort on physid Don Zickus
@ 2014-02-28 18:59   ` Andi Kleen
  2014-02-28 19:44     ` Don Zickus
  0 siblings, 1 reply; 56+ messages in thread
From: Andi Kleen @ 2014-02-28 18:59 UTC (permalink / raw)
  To: Don Zickus; +Cc: acme, LKML, jolsa, jmario, fowles, eranian

Don Zickus <dzickus@redhat.com> writes:
> +
> +	/*
> +	 * must pass period=weight in order to get the correct
> +	 * sorting from hists__collapse_resort() which is solely
> +	 * based on periods. We want sorting be done on nr_events * weight
> +	 * and this is indirectly achieved by passing period=weight here
> +	 * and the he_stat__add_period() function.
> +	 */

sort.c really must be fixed to support multiple sort keys properly.
Lots of other code has workarounds for this too (e.g. it is a bit
problem for TSX abort weights)

I would suggest to fix that instead of hacking around it.

-Andi


-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 08/19] perf c2c: Shared data analyser
  2014-02-28 17:42 ` [PATCH 08/19] perf c2c: Shared data analyser Don Zickus
@ 2014-02-28 19:08   ` Andi Kleen
  2014-02-28 19:46     ` Don Zickus
  0 siblings, 1 reply; 56+ messages in thread
From: Andi Kleen @ 2014-02-28 19:08 UTC (permalink / raw)
  To: Don Zickus
  Cc: acme, LKML, jolsa, jmario, fowles, eranian,
	Arnaldo Carvalho de Melo, David Ahern, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Peter Zijlstra, Richard Fowles

Don Zickus <dzickus@redhat.com> writes:
> +
> +static const struct perf_evsel_str_handler handlers[] = {
> +	{ "cpu/mem-loads,ldlat=30/pp", perf_c2c__process_load, },
> +	{ "cpu/mem-stores/pp",	       perf_c2c__process_store, },

The 30 magic number should probably be configurable.

Using load-latency here rules out Atom, so at some point
you would need to get rid of that.

I suspect on most systems you should rather use p 
instead of pp to get the overhead down (before Haswell pp
is expensive)

> +static int perf_c2c__record(int argc, const char **argv)
> +{
> +	unsigned int rec_argc, i, j;
> +	const char **rec_argv;
> +	const char * const record_args[] = {
> +		"record",
> +		/* "--phys-addr", */

So is that needed or not?

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 00/19 V2] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems
  2014-02-28 18:57 ` [PATCH 00/19 V2] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Andi Kleen
@ 2014-02-28 19:42   ` Don Zickus
  2014-02-28 21:54     ` Andi Kleen
  0 siblings, 1 reply; 56+ messages in thread
From: Don Zickus @ 2014-02-28 19:42 UTC (permalink / raw)
  To: Andi Kleen; +Cc: acme, LKML, jolsa, jmario, fowles, eranian

On Fri, Feb 28, 2014 at 10:57:48AM -0800, Andi Kleen wrote:
> Don Zickus <dzickus@redhat.com> writes:
> >
> > A handful of patches include re-enabling MMAP2 support and some fixes
> > to perf itself.
> 
> I would suggest to pursue the lone kernel patch separately. Hopefully
> that can be merged soon, once the remainin problems with that are 
> addressed.

Good point.

> 
> >
> > Comemnts, feedback, anything else welcomed.
> 
> As a high level comment, can you add support for CSV mode? 
> (-x, in other tools)

It's there. :-)

perf c2c -r -x, report

will spit out raw records in CSV format.

> 
> I assume most people would data mine the output in some form,
> and that's much easier with a more machine oriented format.
> 
> Also you should probably have the standard perf conventions
> for output fds (--log-fd, --output, default to stderr). otherwise
> the output may often be lost.

Ok,  I'll look into what that entails.

Thanks!

Cheers,
Don

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 11/19] perf, c2c: Add in sort on physid
  2014-02-28 18:59   ` Andi Kleen
@ 2014-02-28 19:44     ` Don Zickus
  2014-03-01  1:07       ` Andi Kleen
  0 siblings, 1 reply; 56+ messages in thread
From: Don Zickus @ 2014-02-28 19:44 UTC (permalink / raw)
  To: Andi Kleen; +Cc: acme, LKML, jolsa, jmario, fowles, eranian

On Fri, Feb 28, 2014 at 10:59:18AM -0800, Andi Kleen wrote:
> Don Zickus <dzickus@redhat.com> writes:
> > +
> > +	/*
> > +	 * must pass period=weight in order to get the correct
> > +	 * sorting from hists__collapse_resort() which is solely
> > +	 * based on periods. We want sorting be done on nr_events * weight
> > +	 * and this is indirectly achieved by passing period=weight here
> > +	 * and the he_stat__add_period() function.
> > +	 */
> 
> sort.c really must be fixed to support multiple sort keys properly.
> Lots of other code has workarounds for this too (e.g. it is a bit
> problem for TSX abort weights)
> 
> I would suggest to fix that instead of hacking around it.

I don't think I understand the problem enough to know what to fix.  I just
copied this piece of code from builtin-report.c and things seemed to work.

Mind giving me some details and I can look at fixing it. :-)

Cheers,
Don

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 08/19] perf c2c: Shared data analyser
  2014-02-28 19:08   ` Andi Kleen
@ 2014-02-28 19:46     ` Don Zickus
  2014-02-28 21:03       ` Davidlohr Bueso
  0 siblings, 1 reply; 56+ messages in thread
From: Don Zickus @ 2014-02-28 19:46 UTC (permalink / raw)
  To: Andi Kleen
  Cc: acme, LKML, jolsa, jmario, fowles, eranian,
	Arnaldo Carvalho de Melo, David Ahern, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Peter Zijlstra, Richard Fowles

On Fri, Feb 28, 2014 at 11:08:59AM -0800, Andi Kleen wrote:
> Don Zickus <dzickus@redhat.com> writes:
> > +
> > +static const struct perf_evsel_str_handler handlers[] = {
> > +	{ "cpu/mem-loads,ldlat=30/pp", perf_c2c__process_load, },
> > +	{ "cpu/mem-stores/pp",	       perf_c2c__process_store, },
> 
> The 30 magic number should probably be configurable.

Yeah, I just didn't figure out how to make it configurable yet within this
string.  

> 
> Using load-latency here rules out Atom, so at some point
> you would need to get rid of that.

Oh.  How do you get load-latency for Atom then?

> 
> I suspect on most systems you should rather use p 
> instead of pp to get the overhead down (before Haswell pp
> is expensive)

Ok.  Good to know.

> 
> > +static int perf_c2c__record(int argc, const char **argv)
> > +{
> > +	unsigned int rec_argc, i, j;
> > +	const char **rec_argv;
> > +	const char * const record_args[] = {
> > +		"record",
> > +		/* "--phys-addr", */
> 
> So is that needed or not?

No.  Legacy code before we had MMAP2 support.  I'll remove it next respin.

Cheers,
Don

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 08/19] perf c2c: Shared data analyser
  2014-02-28 19:46     ` Don Zickus
@ 2014-02-28 21:03       ` Davidlohr Bueso
  2014-02-28 22:28         ` Joe Mario
                           ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: Davidlohr Bueso @ 2014-02-28 21:03 UTC (permalink / raw)
  To: Don Zickus
  Cc: Andi Kleen, acme, LKML, jolsa, jmario, fowles, eranian,
	Arnaldo Carvalho de Melo, David Ahern, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Peter Zijlstra, Richard Fowles

On Fri, 2014-02-28 at 14:46 -0500, Don Zickus wrote:
> On Fri, Feb 28, 2014 at 11:08:59AM -0800, Andi Kleen wrote:
> > Don Zickus <dzickus@redhat.com> writes:
> > > +
> > > +static const struct perf_evsel_str_handler handlers[] = {
> > > +	{ "cpu/mem-loads,ldlat=30/pp", perf_c2c__process_load, },
> > > +	{ "cpu/mem-stores/pp",	       perf_c2c__process_store, },
> > 

Hmm I'm getting this when running a simple record command. 

invalid or unsupported event: 'cpu/mem-loads/pp'

This only occurs with c2c, other subcommands work normally. It's as if
it were an old kernel, but it's Linus' latest. Is this an issue with the
patch or something I'm missing?

Furthermore, I see:
ls /sys/bus/event_source/devices/cpu/events
branch-instructions  branch-misses  cache-misses  cache-references  cpu-cycles  instructions  mem-loads

Thanks!




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 00/19 V2] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems
  2014-02-28 19:42   ` Don Zickus
@ 2014-02-28 21:54     ` Andi Kleen
  2014-03-03 14:04       ` Don Zickus
  0 siblings, 1 reply; 56+ messages in thread
From: Andi Kleen @ 2014-02-28 21:54 UTC (permalink / raw)
  To: Don Zickus; +Cc: Andi Kleen, acme, LKML, jolsa, jmario, fowles, eranian

> It's there. :-)
> 
> perf c2c -r -x, report
> 
> will spit out raw records in CSV format.

But most of the cooked data seems not? I saw
lots of printfs without separator support.

Sorry I meant data mining the cooked data,
e.g. plotting it.

-Andi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 08/19] perf c2c: Shared data analyser
  2014-02-28 21:03       ` Davidlohr Bueso
@ 2014-02-28 22:28         ` Joe Mario
  2014-03-01  0:50           ` Andi Kleen
  2014-03-03 14:13         ` Don Zickus
  2014-03-03 15:05         ` Don Zickus
  2 siblings, 1 reply; 56+ messages in thread
From: Joe Mario @ 2014-02-28 22:28 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Don Zickus, Andi Kleen, acme, LKML, jolsa, fowles, eranian,
	Arnaldo Carvalho de Melo, David Ahern, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Peter Zijlstra, Richard Fowles

Apologies for the resend.  My first msg contained html in it.

On 02/28/2014 04:03 PM, Davidlohr Bueso wrote:
> On Fri, 2014-02-28 at 14:46 -0500, Don Zickus wrote:
>> On Fri, Feb 28, 2014 at 11:08:59AM -0800, Andi Kleen wrote:
>>> Don Zickus <dzickus@redhat.com> writes:
>>>> +
>>>> +static const struct perf_evsel_str_handler handlers[] = {
>>>> +	{ "cpu/mem-loads,ldlat=30/pp", perf_c2c__process_load, },
>>>> +	{ "cpu/mem-stores/pp",	       perf_c2c__process_store, },
>>>
>
> Hmm I'm getting this when running a simple record command.
>
> invalid or unsupported event: 'cpu/mem-loads/pp'
>
> This only occurs with c2c, other subcommands work normally. It's as if
> it were an old kernel, but it's Linus' latest. Is this an issue with the
> patch or something I'm missing?
>
> Furthermore, I see:
> ls /sys/bus/event_source/devices/cpu/events
> branch-instructions  branch-misses  cache-misses  cache-references  cpu-cycles  instructions  mem-loads

David:
  It looks like you're running on an older Intel processor, which is missing necessary events for C2C to work.
  As Don noted in his patch 00/19, this was primarily developed and tested on Intel's Ivy Bridge platform.
  If you rerun this on an Ivy Bridge, it should work fine.
  We should add a runtime check for supported platforms.
Joe

> Thanks!
>
>
>


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 08/19] perf c2c: Shared data analyser
  2014-02-28 22:28         ` Joe Mario
@ 2014-03-01  0:50           ` Andi Kleen
  0 siblings, 0 replies; 56+ messages in thread
From: Andi Kleen @ 2014-03-01  0:50 UTC (permalink / raw)
  To: Joe Mario
  Cc: Davidlohr Bueso, Don Zickus, Andi Kleen, acme, LKML, jolsa,
	fowles, eranian, Arnaldo Carvalho de Melo, David Ahern,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Peter Zijlstra, Richard Fowles

> David:
>  It looks like you're running on an older Intel processor, which is missing necessary events for C2C to work.

mem-loads should be supported Nehalem and up.
mem-stores is Sandy Bridge and up

You can check in perf list

-Andi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 11/19] perf, c2c: Add in sort on physid
  2014-02-28 19:44     ` Don Zickus
@ 2014-03-01  1:07       ` Andi Kleen
  2014-03-01  1:27         ` Namhyung Kim
  0 siblings, 1 reply; 56+ messages in thread
From: Andi Kleen @ 2014-03-01  1:07 UTC (permalink / raw)
  To: Don Zickus
  Cc: Andi Kleen, acme, LKML, jolsa, jmario, fowles, eranian, namhyung

> I don't think I understand the problem enough to know what to fix.  I just
> copied this piece of code from builtin-report.c and things seemed to work.
> 
> Mind giving me some details and I can look at fixing it. :-)

sort.c even though has all these sort keys only sorts by period.
It should instead sort by all the specified keys in order instead.

Namhyung looked at it at some point.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 11/19] perf, c2c: Add in sort on physid
  2014-03-01  1:07       ` Andi Kleen
@ 2014-03-01  1:27         ` Namhyung Kim
  0 siblings, 0 replies; 56+ messages in thread
From: Namhyung Kim @ 2014-03-01  1:27 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Don Zickus, Arnaldo Melo, LKML, Jiri Olsa, jmario, fowles,
	Stephane Eranian

Hi Andi,

On Sat, Mar 1, 2014 at 1:07 AM, Andi Kleen <andi@firstfloor.org> wrote:
>> I don't think I understand the problem enough to know what to fix.  I just
>> copied this piece of code from builtin-report.c and things seemed to work.
>>
>> Mind giving me some details and I can look at fixing it. :-)
>
> sort.c even though has all these sort keys only sorts by period.
> It should instead sort by all the specified keys in order instead.
>
> Namhyung looked at it at some point.

Yes, I'm working on it now, but I only have a little time.  Hopefully
I can send a rfc version next week..

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 00/19 V2] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems
  2014-02-28 21:54     ` Andi Kleen
@ 2014-03-03 14:04       ` Don Zickus
  0 siblings, 0 replies; 56+ messages in thread
From: Don Zickus @ 2014-03-03 14:04 UTC (permalink / raw)
  To: Andi Kleen; +Cc: acme, LKML, jolsa, jmario, fowles, eranian

On Fri, Feb 28, 2014 at 10:54:42PM +0100, Andi Kleen wrote:
> > It's there. :-)
> > 
> > perf c2c -r -x, report
> > 
> > will spit out raw records in CSV format.
> 
> But most of the cooked data seems not? I saw
> lots of printfs without separator support.
> 
> Sorry I meant data mining the cooked data,
> e.g. plotting it.

Ah. Ok. I will look into that.

Cheers,
Don

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 08/19] perf c2c: Shared data analyser
  2014-02-28 21:03       ` Davidlohr Bueso
  2014-02-28 22:28         ` Joe Mario
@ 2014-03-03 14:13         ` Don Zickus
  2014-03-03 15:05         ` Don Zickus
  2 siblings, 0 replies; 56+ messages in thread
From: Don Zickus @ 2014-03-03 14:13 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Andi Kleen, acme, LKML, jolsa, jmario, fowles, eranian,
	Arnaldo Carvalho de Melo, David Ahern, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Peter Zijlstra, Richard Fowles

On Fri, Feb 28, 2014 at 01:03:31PM -0800, Davidlohr Bueso wrote:
> On Fri, 2014-02-28 at 14:46 -0500, Don Zickus wrote:
> > On Fri, Feb 28, 2014 at 11:08:59AM -0800, Andi Kleen wrote:
> > > Don Zickus <dzickus@redhat.com> writes:
> > > > +
> > > > +static const struct perf_evsel_str_handler handlers[] = {
> > > > +	{ "cpu/mem-loads,ldlat=30/pp", perf_c2c__process_load, },
> > > > +	{ "cpu/mem-stores/pp",	       perf_c2c__process_store, },
> > > 
> 
> Hmm I'm getting this when running a simple record command. 
> 
> invalid or unsupported event: 'cpu/mem-loads/pp'
> 
> This only occurs with c2c, other subcommands work normally. It's as if
> it were an old kernel, but it's Linus' latest. Is this an issue with the
> patch or something I'm missing?

Does rebooting or 'echo 100000 > /proc/sys/kernel/perf_event_max_sample_rate'

fix the problem?  Perf throttling might have lowered your sample rate so
low that you can't do anything any more.  Though if you are running the
latest from Linus a lot of fixes should be in there to minimize that
problem.

What processor are you on?

Cheers,
Don

> 
> Furthermore, I see:
> ls /sys/bus/event_source/devices/cpu/events
> branch-instructions  branch-misses  cache-misses  cache-references  cpu-cycles  instructions  mem-loads
> 
> Thanks!
> 
> 
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 08/19] perf c2c: Shared data analyser
  2014-02-28 21:03       ` Davidlohr Bueso
  2014-02-28 22:28         ` Joe Mario
  2014-03-03 14:13         ` Don Zickus
@ 2014-03-03 15:05         ` Don Zickus
  2014-03-03 17:23           ` Andi Kleen
  2014-03-03 18:21           ` Davidlohr Bueso
  2 siblings, 2 replies; 56+ messages in thread
From: Don Zickus @ 2014-03-03 15:05 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Andi Kleen, acme, LKML, jolsa, jmario, fowles, eranian,
	Arnaldo Carvalho de Melo, David Ahern, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Peter Zijlstra, Richard Fowles

On Fri, Feb 28, 2014 at 01:03:31PM -0800, Davidlohr Bueso wrote:
> On Fri, 2014-02-28 at 14:46 -0500, Don Zickus wrote:
> > On Fri, Feb 28, 2014 at 11:08:59AM -0800, Andi Kleen wrote:
> > > Don Zickus <dzickus@redhat.com> writes:
> > > > +
> > > > +static const struct perf_evsel_str_handler handlers[] = {
> > > > +	{ "cpu/mem-loads,ldlat=30/pp", perf_c2c__process_load, },
> > > > +	{ "cpu/mem-stores/pp",	       perf_c2c__process_store, },
> > > 
> 
> Hmm I'm getting this when running a simple record command. 
> 
> invalid or unsupported event: 'cpu/mem-loads/pp'
> 
> This only occurs with c2c, other subcommands work normally. It's as if
> it were an old kernel, but it's Linus' latest. Is this an issue with the
> patch or something I'm missing?
> 
> Furthermore, I see:
> ls /sys/bus/event_source/devices/cpu/events
> branch-instructions  branch-misses  cache-misses  cache-references  cpu-cycles  instructions  mem-loads

Hmm, so based on Andi's reply, I am assuming you are running on a Westmere
(or Nehalem) due to the lack of mem-stores.

If you don't have mem-stores, this tool isn't going to work.  The tool can
only detect contention when sampling reads _and_ writes to the same
addresses.

Cheers,
Don

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 08/19] perf c2c: Shared data analyser
  2014-03-03 15:05         ` Don Zickus
@ 2014-03-03 17:23           ` Andi Kleen
  2014-03-03 18:07             ` Joe Mario
  2014-03-03 20:26             ` Don Zickus
  2014-03-03 18:21           ` Davidlohr Bueso
  1 sibling, 2 replies; 56+ messages in thread
From: Andi Kleen @ 2014-03-03 17:23 UTC (permalink / raw)
  To: Don Zickus
  Cc: Davidlohr Bueso, Andi Kleen, acme, LKML, jolsa, jmario, fowles,
	eranian, Arnaldo Carvalho de Melo, David Ahern,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Peter Zijlstra, Richard Fowles

> Hmm, so based on Andi's reply, I am assuming you are running on a Westmere
> (or Nehalem) due to the lack of mem-stores.
> 
> If you don't have mem-stores, this tool isn't going to work.  The tool can
> only detect contention when sampling reads _and_ writes to the same
> addresses.

On these CPUs you could simply sample on HITM. You won't get addresses,
but at least IPs and call stacks.

-Andi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 08/19] perf c2c: Shared data analyser
  2014-03-03 17:23           ` Andi Kleen
@ 2014-03-03 18:07             ` Joe Mario
  2014-03-03 18:41               ` Peter Zijlstra
  2014-03-03 20:26             ` Don Zickus
  1 sibling, 1 reply; 56+ messages in thread
From: Joe Mario @ 2014-03-03 18:07 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Don Zickus, Davidlohr Bueso, acme, LKML, jolsa, fowles, eranian,
	Arnaldo Carvalho de Melo, David Ahern, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Peter Zijlstra, Richard Fowles

On 03/03/2014 12:23 PM, Andi Kleen wrote:
>> Hmm, so based on Andi's reply, I am assuming you are running on a Westmere
>> (or Nehalem) due to the lack of mem-stores.
>>
>> If you don't have mem-stores, this tool isn't going to work.  The tool can
>> only detect contention when sampling reads _and_ writes to the same
>> addresses.
>
> On these CPUs you could simply sample on HITM. You won't get addresses,
> but at least IPs and call stacks.
>
> -Andi
>

If you only sample on the HITMs then you don't get the store misses.  That means you'll not be able to detect who is simultaneously tugging on the same cache lines.  That gives up much of the value of "perf c2c".

As we developed this, we ended up settling on Ivy Bridge to get the behavior we wanted.

Joe

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 08/19] perf c2c: Shared data analyser
  2014-03-03 15:05         ` Don Zickus
  2014-03-03 17:23           ` Andi Kleen
@ 2014-03-03 18:21           ` Davidlohr Bueso
  1 sibling, 0 replies; 56+ messages in thread
From: Davidlohr Bueso @ 2014-03-03 18:21 UTC (permalink / raw)
  To: Don Zickus
  Cc: Andi Kleen, acme, LKML, jolsa, jmario, fowles, eranian,
	Arnaldo Carvalho de Melo, David Ahern, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Peter Zijlstra, Richard Fowles

On Mon, 2014-03-03 at 10:05 -0500, Don Zickus wrote:
> On Fri, Feb 28, 2014 at 01:03:31PM -0800, Davidlohr Bueso wrote:
> > On Fri, 2014-02-28 at 14:46 -0500, Don Zickus wrote:
> > > On Fri, Feb 28, 2014 at 11:08:59AM -0800, Andi Kleen wrote:
> > > > Don Zickus <dzickus@redhat.com> writes:
> > > > > +
> > > > > +static const struct perf_evsel_str_handler handlers[] = {
> > > > > +	{ "cpu/mem-loads,ldlat=30/pp", perf_c2c__process_load, },
> > > > > +	{ "cpu/mem-stores/pp",	       perf_c2c__process_store, },
> > > > 
> > 
> > Hmm I'm getting this when running a simple record command. 
> > 
> > invalid or unsupported event: 'cpu/mem-loads/pp'
> > 
> > This only occurs with c2c, other subcommands work normally. It's as if
> > it were an old kernel, but it's Linus' latest. Is this an issue with the
> > patch or something I'm missing?
> > 
> > Furthermore, I see:
> > ls /sys/bus/event_source/devices/cpu/events
> > branch-instructions  branch-misses  cache-misses  cache-references  cpu-cycles  instructions  mem-loads
> 
> Hmm, so based on Andi's reply, I am assuming you are running on a Westmere
> (or Nehalem) due to the lack of mem-stores.

Indeed, I had definitely overlooked this detail.

Performance Events: PEBS fmt1+, 16-deep LBR, Westmere events, Broken BIOS detected, complain to your hardware vendor.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 08/19] perf c2c: Shared data analyser
  2014-03-03 18:07             ` Joe Mario
@ 2014-03-03 18:41               ` Peter Zijlstra
  2014-03-03 18:58                 ` Andi Kleen
  2014-03-03 20:30                 ` Don Zickus
  0 siblings, 2 replies; 56+ messages in thread
From: Peter Zijlstra @ 2014-03-03 18:41 UTC (permalink / raw)
  To: Joe Mario
  Cc: Andi Kleen, Don Zickus, Davidlohr Bueso, acme, LKML, jolsa,
	fowles, eranian, Arnaldo Carvalho de Melo, David Ahern,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Richard Fowles

On Mon, Mar 03, 2014 at 01:07:00PM -0500, Joe Mario wrote:
> If you only sample on the HITMs then you don't get the store misses.
> That means you'll not be able to detect who is simultaneously tugging
> on the same cache lines.  That gives up much of the value of "perf
> c2c".

As long as you know which lines are hurting bringing in (loads) you can
often figure out who is doing the stores on them.

> As we developed this, we ended up settling on Ivy Bridge to get the
> behavior we wanted.

Wouldn't SNB also work?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 08/19] perf c2c: Shared data analyser
  2014-03-03 18:41               ` Peter Zijlstra
@ 2014-03-03 18:58                 ` Andi Kleen
  2014-03-03 19:48                   ` Peter Zijlstra
  2014-03-03 20:32                   ` Don Zickus
  2014-03-03 20:30                 ` Don Zickus
  1 sibling, 2 replies; 56+ messages in thread
From: Andi Kleen @ 2014-03-03 18:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joe Mario, Andi Kleen, Don Zickus, Davidlohr Bueso, acme, LKML,
	jolsa, fowles, eranian, Arnaldo Carvalho de Melo, David Ahern,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Richard Fowles

On Mon, Mar 03, 2014 at 07:41:17PM +0100, Peter Zijlstra wrote:
> On Mon, Mar 03, 2014 at 01:07:00PM -0500, Joe Mario wrote:
> > If you only sample on the HITMs then you don't get the store misses.
> > That means you'll not be able to detect who is simultaneously tugging
> > on the same cache lines.  That gives up much of the value of "perf
> > c2c".
> 
> As long as you know which lines are hurting bringing in (loads) you can
> often figure out who is doing the stores on them.

Yes, especially since every store is a load too (unless you're talking
WC)

The method c2c uses is more exact, but keep in mind it's a sampling
heuristic in any cases, with some potential bias. load-latency tags
the loads randomly and there's no guarantee that tagging is fully
uniform. Also you only see a subset in any case.

> 
> > As we developed this, we ended up settling on Ivy Bridge to get the
> > behavior we wanted.
> 
> Wouldn't SNB also work?

Yes.

Haswell is best however because it can report addresses on far more
events.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 08/19] perf c2c: Shared data analyser
  2014-03-03 18:58                 ` Andi Kleen
@ 2014-03-03 19:48                   ` Peter Zijlstra
  2014-03-03 20:32                   ` Don Zickus
  1 sibling, 0 replies; 56+ messages in thread
From: Peter Zijlstra @ 2014-03-03 19:48 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Joe Mario, Don Zickus, Davidlohr Bueso, acme, LKML, jolsa,
	fowles, eranian, Arnaldo Carvalho de Melo, David Ahern,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Richard Fowles

On Mon, Mar 03, 2014 at 07:58:19PM +0100, Andi Kleen wrote:
> Haswell is best however because it can report addresses on far more
> events.

Sure; however I think the code to data structure mapping might be more
useful still. Its very hard to map a random address into a heap and find
a fitting data structure. You'd need some allocator with a reverse map,
which to me seems like too much overhead for normal runtime use.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 08/19] perf c2c: Shared data analyser
  2014-03-03 17:23           ` Andi Kleen
  2014-03-03 18:07             ` Joe Mario
@ 2014-03-03 20:26             ` Don Zickus
  2014-03-03 21:36               ` Andi Kleen
  1 sibling, 1 reply; 56+ messages in thread
From: Don Zickus @ 2014-03-03 20:26 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Davidlohr Bueso, acme, LKML, jolsa, jmario, fowles, eranian,
	Arnaldo Carvalho de Melo, David Ahern, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Peter Zijlstra, Richard Fowles

On Mon, Mar 03, 2014 at 06:23:16PM +0100, Andi Kleen wrote:
> > Hmm, so based on Andi's reply, I am assuming you are running on a Westmere
> > (or Nehalem) due to the lack of mem-stores.
> > 
> > If you don't have mem-stores, this tool isn't going to work.  The tool can
> > only detect contention when sampling reads _and_ writes to the same
> > addresses.
> 
> On these CPUs you could simply sample on HITM. You won't get addresses,
> but at least IPs and call stacks.

Heh.  I never thought about that.  And sure enough a quick test with
mem-stores commented out produced the same results (minus the stores).

One would just have to 'figure' out what cacheline offsets are causing the
HITMs.

Cheers,
Don

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 08/19] perf c2c: Shared data analyser
  2014-03-03 18:41               ` Peter Zijlstra
  2014-03-03 18:58                 ` Andi Kleen
@ 2014-03-03 20:30                 ` Don Zickus
  1 sibling, 0 replies; 56+ messages in thread
From: Don Zickus @ 2014-03-03 20:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joe Mario, Andi Kleen, Davidlohr Bueso, acme, LKML, jolsa,
	fowles, eranian, Arnaldo Carvalho de Melo, David Ahern,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Richard Fowles

On Mon, Mar 03, 2014 at 07:41:17PM +0100, Peter Zijlstra wrote:
> On Mon, Mar 03, 2014 at 01:07:00PM -0500, Joe Mario wrote:
> > If you only sample on the HITMs then you don't get the store misses.
> > That means you'll not be able to detect who is simultaneously tugging
> > on the same cache lines.  That gives up much of the value of "perf
> > c2c".
> 
> As long as you know which lines are hurting bringing in (loads) you can
> often figure out who is doing the stores on them.

If you are familar with the code I guess.  One scenario that would be
difficult for is shared memory (ie databases) as it may not be obvious who
is writing to a cacheline.

> 
> > As we developed this, we ended up settling on Ivy Bridge to get the
> > behavior we wanted.
> 
> Wouldn't SNB also work?

We started with SNB, but because latency info was not working (due to a GO
bug) and some other event quirks, we found ourselves migrating to a more
stable IVB platform instead.

Now that we stablized our test case and tool, SNB may work.   We just
haven't tried in along time.

Cheers,
Don


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 08/19] perf c2c: Shared data analyser
  2014-03-03 18:58                 ` Andi Kleen
  2014-03-03 19:48                   ` Peter Zijlstra
@ 2014-03-03 20:32                   ` Don Zickus
  2014-03-03 21:38                     ` Andi Kleen
  1 sibling, 1 reply; 56+ messages in thread
From: Don Zickus @ 2014-03-03 20:32 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Joe Mario, Davidlohr Bueso, acme, LKML, jolsa,
	fowles, eranian, Arnaldo Carvalho de Melo, David Ahern,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Richard Fowles

On Mon, Mar 03, 2014 at 07:58:19PM +0100, Andi Kleen wrote:
> On Mon, Mar 03, 2014 at 07:41:17PM +0100, Peter Zijlstra wrote:
> > On Mon, Mar 03, 2014 at 01:07:00PM -0500, Joe Mario wrote:
> > > If you only sample on the HITMs then you don't get the store misses.
> > > That means you'll not be able to detect who is simultaneously tugging
> > > on the same cache lines.  That gives up much of the value of "perf
> > > c2c".
> > 
> > As long as you know which lines are hurting bringing in (loads) you can
> > often figure out who is doing the stores on them.
> 
> Yes, especially since every store is a load too (unless you're talking
> WC)

Thoughts on how to determine which load is a potential store?  I agree
every store needs to load the cacheline, but I wasn't sure if there was an
approach that could be applied to determine anything useful.

Cheers,
Don

> 
> The method c2c uses is more exact, but keep in mind it's a sampling
> heuristic in any cases, with some potential bias. load-latency tags
> the loads randomly and there's no guarantee that tagging is fully
> uniform. Also you only see a subset in any case.
> 
> > 
> > > As we developed this, we ended up settling on Ivy Bridge to get the
> > > behavior we wanted.
> > 
> > Wouldn't SNB also work?
> 
> Yes.
> 
> Haswell is best however because it can report addresses on far more
> events.
> 
> -Andi
> 
> -- 
> ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 08/19] perf c2c: Shared data analyser
  2014-03-03 20:26             ` Don Zickus
@ 2014-03-03 21:36               ` Andi Kleen
  2014-03-04  9:42                 ` Peter Zijlstra
  0 siblings, 1 reply; 56+ messages in thread
From: Andi Kleen @ 2014-03-03 21:36 UTC (permalink / raw)
  To: Don Zickus
  Cc: Andi Kleen, Davidlohr Bueso, acme, LKML, jolsa, jmario, fowles,
	eranian, Arnaldo Carvalho de Melo, David Ahern,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Peter Zijlstra, Richard Fowles

> Heh.  I never thought about that.  And sure enough a quick test with
> mem-stores commented out produced the same results (minus the stores).
> 
> One would just have to 'figure' out what cacheline offsets are causing the
> HITMs.

Often that can be determined statically from the instruction (register-offset)

However keep in mind that there is some skid so the instruction may not
be correct.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 08/19] perf c2c: Shared data analyser
  2014-03-03 20:32                   ` Don Zickus
@ 2014-03-03 21:38                     ` Andi Kleen
  2014-03-03 21:41                       ` Don Zickus
  0 siblings, 1 reply; 56+ messages in thread
From: Andi Kleen @ 2014-03-03 21:38 UTC (permalink / raw)
  To: Don Zickus
  Cc: Andi Kleen, Peter Zijlstra, Joe Mario, Davidlohr Bueso, acme,
	LKML, jolsa, fowles, eranian, Arnaldo Carvalho de Melo,
	David Ahern, Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Richard Fowles

> Thoughts on how to determine which load is a potential store?  I agree
> every store needs to load the cacheline, but I wasn't sure if there was an
> approach that could be applied to determine anything useful.

HITM covers both load and stores.

But it may have some skid.  The main advantage of the address is that
it is skidless.

-Andi

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 08/19] perf c2c: Shared data analyser
  2014-03-03 21:38                     ` Andi Kleen
@ 2014-03-03 21:41                       ` Don Zickus
  0 siblings, 0 replies; 56+ messages in thread
From: Don Zickus @ 2014-03-03 21:41 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Joe Mario, Davidlohr Bueso, acme, LKML, jolsa,
	fowles, eranian, Arnaldo Carvalho de Melo, David Ahern,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Richard Fowles

On Mon, Mar 03, 2014 at 10:38:36PM +0100, Andi Kleen wrote:
> > Thoughts on how to determine which load is a potential store?  I agree
> > every store needs to load the cacheline, but I wasn't sure if there was an
> > approach that could be applied to determine anything useful.
> 
> HITM covers both load and stores.
> 
> But it may have some skid.  The main advantage of the address is that
> it is skidless.

I guess I don't follow, how does having a skidless address help me when I
don't have any stores (on a Westmere for example)?

Cheers,
Don

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 08/19] perf c2c: Shared data analyser
  2014-03-03 21:36               ` Andi Kleen
@ 2014-03-04  9:42                 ` Peter Zijlstra
  0 siblings, 0 replies; 56+ messages in thread
From: Peter Zijlstra @ 2014-03-04  9:42 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Don Zickus, Davidlohr Bueso, acme, LKML, jolsa, jmario, fowles,
	eranian, Arnaldo Carvalho de Melo, David Ahern,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Richard Fowles

On Mon, Mar 03, 2014 at 10:36:25PM +0100, Andi Kleen wrote:
> > Heh.  I never thought about that.  And sure enough a quick test with
> > mem-stores commented out produced the same results (minus the stores).
> > 
> > One would just have to 'figure' out what cacheline offsets are causing the
> > HITMs.
> 
> Often that can be determined statically from the instruction (register-offset)
> 
> However keep in mind that there is some skid so the instruction may not
> be correct.

Even with the real_ip pebs field? I thought the whole purpose of that
was to tag the actual instruction matching the event.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 02/19] perf, sort:  Add physid sorting based on mmap2 data
  2014-02-28 17:42 ` [PATCH 02/19] perf, sort: Add physid sorting based on mmap2 data Don Zickus
@ 2014-03-19 10:45   ` Jiri Olsa
  2014-03-19 13:36     ` Don Zickus
  0 siblings, 1 reply; 56+ messages in thread
From: Jiri Olsa @ 2014-03-19 10:45 UTC (permalink / raw)
  To: Don Zickus; +Cc: acme, LKML, jmario, fowles, eranian

On Fri, Feb 28, 2014 at 12:42:51PM -0500, Don Zickus wrote:

SNIP

> +static int64_t
> +sort__physid_cmp(struct hist_entry *left, struct hist_entry *right)
> +{
> +        u64 l, r;
> +        struct map *l_map = left->mem_info->daddr.map;
> +        struct map *r_map = right->mem_info->daddr.map;
> +
> +	/* store all NULL mem maps at the bottom */
> +	/* shouldn't even need this check, should have stubs */
> +	if (!left->mem_info->daddr.map || !right->mem_info->daddr.map)
> +		return 1;
> +
> +        /* group event types together */
> +        if (left->cpumode > right->cpumode) return -1;
> +        if (left->cpumode < right->cpumode) return 1;
> +
> +        if (l_map->maj > r_map->maj) return -1;
> +        if (l_map->maj < r_map->maj) return 1;
> +
> +        if (l_map->min > r_map->min) return -1;
> +        if (l_map->min < r_map->min) return 1;
> +
> +        if (l_map->ino > r_map->ino) return -1;
> +        if (l_map->ino < r_map->ino) return 1;
> +
> +        if (l_map->ino_generation > r_map->ino_generation) return -1;
> +        if (l_map->ino_generation < r_map->ino_generation) return 1;
> +
> +        /*
> +         * Addresses with no major/minor numbers are assumed to be
> +         * anonymous in userspace.  Sort those on pid then address.
> +         *
> +         * The kernel and non-zero major/minor mapped areas are
> +         * assumed to be unity mapped.  Sort those on address then pid.
> +         */
> +
> +        /* al_addr does all the right addr - start + offset calculations */
> +        l = left->mem_info->daddr.al_addr;
> +        r = right->mem_info->daddr.al_addr;
> +
> +        if (l_map->maj || l_map->min || l_map->ino || l_map-> ino_generation) {
> +                /* mmapped areas */
> +
> +                /* hack to mark similar regions, 'right' is new entry */
> +                /* entries with same maj/min/ino/inogen are in same address space */
> +                right->color = TRUE;
> +
> +                if (l > r) return -1;
> +                if (l < r) return 1;
> +
> +                /* sorting by iaddr makes calculations easier later */
> +                if (left->mem_info->iaddr.al_addr > right->mem_info->iaddr.al_addr) return -1;
> +                if (left->mem_info->iaddr.al_addr < right->mem_info->iaddr.al_addr) return 1;
> +
> +                if (left->thread->pid_ > right->thread->pid_) return -1;
> +                if (left->thread->pid_ < right->thread->pid_) return 1;
> +
> +                if (left->thread->tid > right->thread->tid) return -1;
> +                if (left->thread->tid < right->thread->tid) return 1;
> +        } else if (left->cpumode == PERF_RECORD_MISC_KERNEL) {
> +                /* kernel mapped areas where 'start' doesn't matter */
> +
> +                /* hack to mark similar regions, 'right' is new entry */
> +                /* whole kernel region is in the same address space */
> +                right->color = TRUE;
> +
> +                if (l > r) return -1;
> +                if (l < r) return 1;
> +
> +                /* sorting by iaddr makes calculations easier later */
> +                if (left->mem_info->iaddr.al_addr > right->mem_info->iaddr.al_addr) return -1;
> +                if (left->mem_info->iaddr.al_addr < right->mem_info->iaddr.al_addr) return 1;
> +
> +                if (left->thread->pid_ > right->thread->pid_) return -1;
> +                if (left->thread->pid_ < right->thread->pid_) return 1;
> +
> +                if (left->thread->tid > right->thread->tid) return -1;
> +                if (left->thread->tid < right->thread->tid) return 1;
> +        } else {
> +                /* userspace anonymous */
> +                if (left->thread->pid_ > right->thread->pid_) return -1;
> +                if (left->thread->pid_ < right->thread->pid_) return 1;
> +
> +                if (left->thread->tid > right->thread->tid) return -1;
> +                if (left->thread->tid < right->thread->tid) return 1;
> +
> +	         /* hack to mark similar regions, 'right' is new entry */
> +                /* userspace anonymous address space is contained within pid */
> +                right->color = TRUE;
> +
> +                if (l > r) return -1;
> +                if (l < r) return 1;
> +
> +                /* sorting by iaddr makes calculations easier later */
> +                if (left->mem_info->iaddr.al_addr > right->mem_info->iaddr.al_addr) return -1;
> +                if (left->mem_info->iaddr.al_addr < right->mem_info->iaddr.al_addr) return 1;
> +        }

do you need single column for 'physid' ?

my first idea was to have separate sort entries for all checked entries:
(same way like for memory memory_sort_dimensions)

  - mem_info->daddr.al_addr (we already have 'addr' check)
  - mem_info->iaddr.al_addr
  - thread->pid_ (we have only 'tid' check so far)
  - l_map->maj, l_map->min, l_map->ino, l_map, ino_generation (we could probably group these)

and init sort order with:

   sort_order = "physid,pid,...";

'...' is whatever name you choose for above entries


you can check the sort entries cmp function call
logic in hist_entry__cmp function

jirka

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 04/19] perf: Allow ability to map cpus to nodes easily
  2014-02-28 17:42 ` [PATCH 04/19] perf: Allow ability to map cpus to nodes easily Don Zickus
@ 2014-03-19 12:48   ` Jiri Olsa
  2014-03-19 13:38     ` Don Zickus
  2014-03-19 13:22   ` Jiri Olsa
  1 sibling, 1 reply; 56+ messages in thread
From: Jiri Olsa @ 2014-03-19 12:48 UTC (permalink / raw)
  To: Don Zickus; +Cc: acme, LKML, jmario, fowles, eranian

On Fri, Feb 28, 2014 at 12:42:53PM -0500, Don Zickus wrote:
> This patch figures out the max number of cpus and nodes that are on the

SNIP

> +
> +int cpu_map__setup_cpunode_map(void)
> +{
> +	struct dirent *dent1, *dent2;
> +	DIR *dir1, *dir2;
> +	unsigned int cpu, mem;
> +	char buf[PATH_MAX];
> +
> +	/* initialize globals */
> +	if (init_cpunode_map())
> +		return -1;
> +
> +	dir1 = opendir(PATH_SYS_NODE);
> +	if (!dir1)
> +		return 0;
> +
> +	/* walk tree and setup map */
> +	while ((dent1 = readdir(dir1)) != NULL) {
> +		if (dent1->d_type != DT_DIR ||
> +		    sscanf(dent1->d_name, "node%u", &mem) < 1)
> +			continue;
> +
> +		snprintf(buf, PATH_MAX, "%s/%s", PATH_SYS_NODE, dent1->d_name);
> +		dir2 = opendir(buf);
> +		if (!dir2)
> +			continue;
> +		while ((dent2 = readdir(dir2)) != NULL) {
> +			if (dent2->d_type != DT_LNK ||
> +			    sscanf(dent2->d_name, "cpu%u", &cpu) < 1)
> +				continue;
> +			cpunode_map[cpu] = mem;
> +		}
> +		closedir(dir2);
> +	}
> +	closedir(dir1);
> +	return 0;
> +}
> +
> diff --git a/tools/perf/util/cpumap.h b/tools/perf/util/cpumap.h
> index b123bb9..d6fde2b 100644
> --- a/tools/perf/util/cpumap.h
> +++ b/tools/perf/util/cpumap.h
> @@ -4,6 +4,9 @@
>  #include <stdio.h>
>  #include <stdbool.h>
>  
> +#include "perf.h"
> +#include "util/debug.h"
> +
>  struct cpu_map {
>  	int nr;
>  	int map[];
> @@ -46,4 +49,36 @@ static inline bool cpu_map__empty(const struct cpu_map *map)
>  	return map ? map->map[0] == -1 : true;
>  }
>  
> +int max_cpu_num;
> +int max_node_num;
> +int *cpunode_map;
> +
> +int cpu_map__setup_cpunode_map(void);
> +
> +static inline int cpu_map__max_node(void)
> +{
> +	if (unlikely(!max_node_num))
> +		pr_debug("cpu_map not initiailzed\n");
> +
> +	return max_node_num;
> +}
> +
> +static inline int cpu_map__max_cpu(void)
> +{
> +	if (unlikely(!max_cpu_num))
> +		pr_debug("cpu_map not initiailzed\n");
> +
> +	return max_cpu_num;
> +}
> +
> +static inline int cpu_map__get_node(int cpu)
> +{
> +	if (unlikely(cpunode_map == NULL)) {
> +		pr_debug("cpu_map not initialized\n");
> +		return -1;
> +	}
> +
> +	return cpunode_map[cpu];
> +}

cool, maybe above function names should not have 'cpu_map..',
it's like pure cpu-ish stuff, maybe:

  setup_cpunode_map
  cpu__max_node
  cpu__max_cpu
  cpu__get_node

or something like that ;-)

jirka

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 15/19] perf, c2c: Add callchain support
  2014-02-28 17:43 ` [PATCH 15/19] perf, c2c: Add callchain support Don Zickus
@ 2014-03-19 13:00   ` Jiri Olsa
  2014-03-19 13:53     ` Don Zickus
  0 siblings, 1 reply; 56+ messages in thread
From: Jiri Olsa @ 2014-03-19 13:00 UTC (permalink / raw)
  To: Don Zickus; +Cc: acme, LKML, jmario, fowles, eranian

On Fri, Feb 28, 2014 at 12:43:04PM -0500, Don Zickus wrote:

SNIP

>  
> +static int
> +opt_callchain_cb(const struct option *opt, const char *arg, int unset)
> +{
> +	struct perf_c2c *c2c = (struct perf_c2c *)opt->value;
> +	char *tok, *tok2;
> +	char *endptr;


this still needs to be shared with report code, but
I'm guessing V2 was about moving to hist_entry-ies

jirka

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 04/19] perf: Allow ability to map cpus to nodes easily
  2014-02-28 17:42 ` [PATCH 04/19] perf: Allow ability to map cpus to nodes easily Don Zickus
  2014-03-19 12:48   ` Jiri Olsa
@ 2014-03-19 13:22   ` Jiri Olsa
  1 sibling, 0 replies; 56+ messages in thread
From: Jiri Olsa @ 2014-03-19 13:22 UTC (permalink / raw)
  To: Don Zickus; +Cc: acme, LKML, jmario, fowles, eranian

On Fri, Feb 28, 2014 at 12:42:53PM -0500, Don Zickus wrote:

SNIP

> +
> +/* Determine highest possible node in the system for sparse allocation */
> +static void set_max_node_num(void)
> +{
> +	FILE *fp;
> +	char buf[256];
> +	int num;
> +
> +	/* set up default */
> +	max_node_num = 8;
> +
> +	/* get the highest possible cpu number for a sparse allocation */
> +	fp = fopen("/sys/devices/system/node/possible", "r");
> +	if (!fp)
> +		goto out;
> +
> +	num = fread(&buf, 1, sizeof(buf), fp);
> +	if (!num)
> +		goto out_close;
> +	buf[num] = '\0';
> +
> +	/* start on the right, to find highest node num */
> +	while (--num) {
> +		if ((buf[num] == ',') || (buf[num] == '-')) {
> +			num++;
> +			break;
> +		}
> +	}
> +	if (sscanf(&buf[num], "%d", &max_node_num) < 1)
> +		goto out_close;
> +
> +	max_node_num++;
> +
> +	fclose(fp);
> +	return;

these ^^^ inners should go into single function and be used in 
both set_max_cpu_num and set_max_node_num functions

jirka

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 02/19] perf, sort:  Add physid sorting based on mmap2 data
  2014-03-19 10:45   ` Jiri Olsa
@ 2014-03-19 13:36     ` Don Zickus
  0 siblings, 0 replies; 56+ messages in thread
From: Don Zickus @ 2014-03-19 13:36 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: acme, LKML, jmario, fowles, eranian

On Wed, Mar 19, 2014 at 11:45:15AM +0100, Jiri Olsa wrote:
> > +static int64_t
> > +sort__physid_cmp(struct hist_entry *left, struct hist_entry *right)
> > +{
> > +        u64 l, r;
> > +        struct map *l_map = left->mem_info->daddr.map;
> > +        struct map *r_map = right->mem_info->daddr.map;
> > +
> > +	/* store all NULL mem maps at the bottom */
> > +	/* shouldn't even need this check, should have stubs */
> > +	if (!left->mem_info->daddr.map || !right->mem_info->daddr.map)
> > +		return 1;
> > +
> > +        /* group event types together */
> > +        if (left->cpumode > right->cpumode) return -1;
> > +        if (left->cpumode < right->cpumode) return 1;
> > +
> > +        if (l_map->maj > r_map->maj) return -1;
> > +        if (l_map->maj < r_map->maj) return 1;
> > +
> > +        if (l_map->min > r_map->min) return -1;
> > +        if (l_map->min < r_map->min) return 1;
> > +
> > +        if (l_map->ino > r_map->ino) return -1;
> > +        if (l_map->ino < r_map->ino) return 1;
> > +
> > +        if (l_map->ino_generation > r_map->ino_generation) return -1;
> > +        if (l_map->ino_generation < r_map->ino_generation) return 1;
> > +
> > +        /*
> > +         * Addresses with no major/minor numbers are assumed to be
> > +         * anonymous in userspace.  Sort those on pid then address.
> > +         *
> > +         * The kernel and non-zero major/minor mapped areas are
> > +         * assumed to be unity mapped.  Sort those on address then pid.
> > +         */
> > +
> > +        /* al_addr does all the right addr - start + offset calculations */
> > +        l = left->mem_info->daddr.al_addr;
> > +        r = right->mem_info->daddr.al_addr;
> > +
> > +        if (l_map->maj || l_map->min || l_map->ino || l_map-> ino_generation) {
> > +                /* mmapped areas */
> > +
> > +                /* hack to mark similar regions, 'right' is new entry */
> > +                /* entries with same maj/min/ino/inogen are in same address space */
> > +                right->color = TRUE;
> > +
> > +                if (l > r) return -1;
> > +                if (l < r) return 1;
> > +
> > +                /* sorting by iaddr makes calculations easier later */
> > +                if (left->mem_info->iaddr.al_addr > right->mem_info->iaddr.al_addr) return -1;
> > +                if (left->mem_info->iaddr.al_addr < right->mem_info->iaddr.al_addr) return 1;
> > +
> > +                if (left->thread->pid_ > right->thread->pid_) return -1;
> > +                if (left->thread->pid_ < right->thread->pid_) return 1;
> > +
> > +                if (left->thread->tid > right->thread->tid) return -1;
> > +                if (left->thread->tid < right->thread->tid) return 1;
> > +        } else if (left->cpumode == PERF_RECORD_MISC_KERNEL) {
> > +                /* kernel mapped areas where 'start' doesn't matter */
> > +
> > +                /* hack to mark similar regions, 'right' is new entry */
> > +                /* whole kernel region is in the same address space */
> > +                right->color = TRUE;
> > +
> > +                if (l > r) return -1;
> > +                if (l < r) return 1;
> > +
> > +                /* sorting by iaddr makes calculations easier later */
> > +                if (left->mem_info->iaddr.al_addr > right->mem_info->iaddr.al_addr) return -1;
> > +                if (left->mem_info->iaddr.al_addr < right->mem_info->iaddr.al_addr) return 1;
> > +
> > +                if (left->thread->pid_ > right->thread->pid_) return -1;
> > +                if (left->thread->pid_ < right->thread->pid_) return 1;
> > +
> > +                if (left->thread->tid > right->thread->tid) return -1;
> > +                if (left->thread->tid < right->thread->tid) return 1;
> > +        } else {
> > +                /* userspace anonymous */
> > +                if (left->thread->pid_ > right->thread->pid_) return -1;
> > +                if (left->thread->pid_ < right->thread->pid_) return 1;
> > +
> > +                if (left->thread->tid > right->thread->tid) return -1;
> > +                if (left->thread->tid < right->thread->tid) return 1;
> > +
> > +	         /* hack to mark similar regions, 'right' is new entry */
> > +                /* userspace anonymous address space is contained within pid */
> > +                right->color = TRUE;
> > +
> > +                if (l > r) return -1;
> > +                if (l < r) return 1;
> > +
> > +                /* sorting by iaddr makes calculations easier later */
> > +                if (left->mem_info->iaddr.al_addr > right->mem_info->iaddr.al_addr) return -1;
> > +                if (left->mem_info->iaddr.al_addr < right->mem_info->iaddr.al_addr) return 1;
> > +        }
> 
> do you need single column for 'physid' ?
> 
> my first idea was to have separate sort entries for all checked entries:
> (same way like for memory memory_sort_dimensions)
> 
>   - mem_info->daddr.al_addr (we already have 'addr' check)
>   - mem_info->iaddr.al_addr
>   - thread->pid_ (we have only 'tid' check so far)
>   - l_map->maj, l_map->min, l_map->ino, l_map, ino_generation (we could probably group these)
> 
> and init sort order with:
> 
>    sort_order = "physid,pid,...";
> 
> '...' is whatever name you choose for above entries

The problem is (as you can see in the code above), physid is _not_ just
one piece of data.  It is dependent on multiple things.  It is mainly
dependent on four things:

- cpumode
- major, minor, inode, inode_generation (call it mmap2 data)
- data address
- pid

If cpumode == KERNEL, sort in this order
  - mmap2 data
  - data address
  - pid (optional)

If cpumode == USERSPACE and mmap2 data != 0, sort in this order
  - mmap2 data
  - data address
  - pid (optional)

If cpumode == USERSAPCE and mmap2 data == 0, sort in this order
  - pid
  - data address

Notice how sorting on the pid is different depending on the scenario.

I'll agree that sorting on iaddr and tid can be filtered out as a nice to
have.

But in order to sort on physid, you really need lots of pieces.
Otherwise, what do you consider the definition of 'physid'? :-)

Cheers,
Don

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 04/19] perf: Allow ability to map cpus to nodes easily
  2014-03-19 12:48   ` Jiri Olsa
@ 2014-03-19 13:38     ` Don Zickus
  0 siblings, 0 replies; 56+ messages in thread
From: Don Zickus @ 2014-03-19 13:38 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: acme, LKML, jmario, fowles, eranian

On Wed, Mar 19, 2014 at 01:48:27PM +0100, Jiri Olsa wrote:
> On Fri, Feb 28, 2014 at 12:42:53PM -0500, Don Zickus wrote:
> > This patch figures out the max number of cpus and nodes that are on the
> 
> SNIP
> 
> > +
> > +int cpu_map__setup_cpunode_map(void)
> > +{
> > +	struct dirent *dent1, *dent2;
> > +	DIR *dir1, *dir2;
> > +	unsigned int cpu, mem;
> > +	char buf[PATH_MAX];
> > +
> > +	/* initialize globals */
> > +	if (init_cpunode_map())
> > +		return -1;
> > +
> > +	dir1 = opendir(PATH_SYS_NODE);
> > +	if (!dir1)
> > +		return 0;
> > +
> > +	/* walk tree and setup map */
> > +	while ((dent1 = readdir(dir1)) != NULL) {
> > +		if (dent1->d_type != DT_DIR ||
> > +		    sscanf(dent1->d_name, "node%u", &mem) < 1)
> > +			continue;
> > +
> > +		snprintf(buf, PATH_MAX, "%s/%s", PATH_SYS_NODE, dent1->d_name);
> > +		dir2 = opendir(buf);
> > +		if (!dir2)
> > +			continue;
> > +		while ((dent2 = readdir(dir2)) != NULL) {
> > +			if (dent2->d_type != DT_LNK ||
> > +			    sscanf(dent2->d_name, "cpu%u", &cpu) < 1)
> > +				continue;
> > +			cpunode_map[cpu] = mem;
> > +		}
> > +		closedir(dir2);
> > +	}
> > +	closedir(dir1);
> > +	return 0;
> > +}
> > +
> > diff --git a/tools/perf/util/cpumap.h b/tools/perf/util/cpumap.h
> > index b123bb9..d6fde2b 100644
> > --- a/tools/perf/util/cpumap.h
> > +++ b/tools/perf/util/cpumap.h
> > @@ -4,6 +4,9 @@
> >  #include <stdio.h>
> >  #include <stdbool.h>
> >  
> > +#include "perf.h"
> > +#include "util/debug.h"
> > +
> >  struct cpu_map {
> >  	int nr;
> >  	int map[];
> > @@ -46,4 +49,36 @@ static inline bool cpu_map__empty(const struct cpu_map *map)
> >  	return map ? map->map[0] == -1 : true;
> >  }
> >  
> > +int max_cpu_num;
> > +int max_node_num;
> > +int *cpunode_map;
> > +
> > +int cpu_map__setup_cpunode_map(void);
> > +
> > +static inline int cpu_map__max_node(void)
> > +{
> > +	if (unlikely(!max_node_num))
> > +		pr_debug("cpu_map not initiailzed\n");
> > +
> > +	return max_node_num;
> > +}
> > +
> > +static inline int cpu_map__max_cpu(void)
> > +{
> > +	if (unlikely(!max_cpu_num))
> > +		pr_debug("cpu_map not initiailzed\n");
> > +
> > +	return max_cpu_num;
> > +}
> > +
> > +static inline int cpu_map__get_node(int cpu)
> > +{
> > +	if (unlikely(cpunode_map == NULL)) {
> > +		pr_debug("cpu_map not initialized\n");
> > +		return -1;
> > +	}
> > +
> > +	return cpunode_map[cpu];
> > +}
> 
> cool, maybe above function names should not have 'cpu_map..',
> it's like pure cpu-ish stuff, maybe:
> 
>   setup_cpunode_map
>   cpu__max_node
>   cpu__max_cpu
>   cpu__get_node
> 
> or something like that ;-)

Sure.  That works for me. :-)

Cheers,
Don

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 15/19] perf, c2c: Add callchain support
  2014-03-19 13:00   ` Jiri Olsa
@ 2014-03-19 13:53     ` Don Zickus
  2014-03-19 14:05       ` Jiri Olsa
  0 siblings, 1 reply; 56+ messages in thread
From: Don Zickus @ 2014-03-19 13:53 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: acme, LKML, jmario, fowles, eranian

On Wed, Mar 19, 2014 at 02:00:35PM +0100, Jiri Olsa wrote:
> On Fri, Feb 28, 2014 at 12:43:04PM -0500, Don Zickus wrote:
> 
> SNIP
> 
> >  
> > +static int
> > +opt_callchain_cb(const struct option *opt, const char *arg, int unset)
> > +{
> > +	struct perf_c2c *c2c = (struct perf_c2c *)opt->value;
> > +	char *tok, *tok2;
> > +	char *endptr;
> 
> 
> this still needs to be shared with report code, but
> I'm guessing V2 was about moving to hist_entry-ies

I guess I don't understand this?  I don't use the report code.  Is there
another way I can get this?

Cheers,
Don

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 15/19] perf, c2c: Add callchain support
  2014-03-19 13:53     ` Don Zickus
@ 2014-03-19 14:05       ` Jiri Olsa
  0 siblings, 0 replies; 56+ messages in thread
From: Jiri Olsa @ 2014-03-19 14:05 UTC (permalink / raw)
  To: Don Zickus; +Cc: acme, LKML, jmario, fowles, eranian

On Wed, Mar 19, 2014 at 09:53:28AM -0400, Don Zickus wrote:
> On Wed, Mar 19, 2014 at 02:00:35PM +0100, Jiri Olsa wrote:
> > On Fri, Feb 28, 2014 at 12:43:04PM -0500, Don Zickus wrote:
> > 
> > SNIP
> > 
> > >  
> > > +static int
> > > +opt_callchain_cb(const struct option *opt, const char *arg, int unset)
> > > +{
> > > +	struct perf_c2c *c2c = (struct perf_c2c *)opt->value;
> > > +	char *tok, *tok2;
> > > +	char *endptr;
> > 
> > 
> > this still needs to be shared with report code, but
> > I'm guessing V2 was about moving to hist_entry-ies
> 
> I guess I don't understand this?  I don't use the report code.  Is there
> another way I can get this?

opt_callchain_cb is same as parse_callchain_opt function,
the argument parsing code could be shared

jirka

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2014-03-19 14:06 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-28 17:42 [PATCH 00/19 V2] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
2014-02-28 17:42 ` [PATCH 01/19] Revert "perf: Disable PERF_RECORD_MMAP2 support" Don Zickus
2014-02-28 17:42 ` [PATCH 02/19] perf, sort: Add physid sorting based on mmap2 data Don Zickus
2014-03-19 10:45   ` Jiri Olsa
2014-03-19 13:36     ` Don Zickus
2014-02-28 17:42 ` [PATCH 03/19] perf, sort: Allow unique sorting instead of combining hist_entries Don Zickus
2014-02-28 17:42 ` [PATCH 04/19] perf: Allow ability to map cpus to nodes easily Don Zickus
2014-03-19 12:48   ` Jiri Olsa
2014-03-19 13:38     ` Don Zickus
2014-03-19 13:22   ` Jiri Olsa
2014-02-28 17:42 ` [PATCH 05/19] perf, kmem: Utilize the new generic cpunode_map Don Zickus
2014-02-28 17:42 ` [PATCH 06/19] perf: Fix stddev calculation Don Zickus
2014-02-28 17:42 ` [PATCH 07/19] perf, callchain: Add generic callchain print handler for stdio Don Zickus
2014-02-28 17:42 ` [PATCH 08/19] perf c2c: Shared data analyser Don Zickus
2014-02-28 19:08   ` Andi Kleen
2014-02-28 19:46     ` Don Zickus
2014-02-28 21:03       ` Davidlohr Bueso
2014-02-28 22:28         ` Joe Mario
2014-03-01  0:50           ` Andi Kleen
2014-03-03 14:13         ` Don Zickus
2014-03-03 15:05         ` Don Zickus
2014-03-03 17:23           ` Andi Kleen
2014-03-03 18:07             ` Joe Mario
2014-03-03 18:41               ` Peter Zijlstra
2014-03-03 18:58                 ` Andi Kleen
2014-03-03 19:48                   ` Peter Zijlstra
2014-03-03 20:32                   ` Don Zickus
2014-03-03 21:38                     ` Andi Kleen
2014-03-03 21:41                       ` Don Zickus
2014-03-03 20:30                 ` Don Zickus
2014-03-03 20:26             ` Don Zickus
2014-03-03 21:36               ` Andi Kleen
2014-03-04  9:42                 ` Peter Zijlstra
2014-03-03 18:21           ` Davidlohr Bueso
2014-02-28 17:42 ` [PATCH 09/19] perf c2c: Dump raw records, decode data_src bits Don Zickus
2014-02-28 17:42 ` [PATCH 10/19] perf, c2c: Rework setup code to prepare for features Don Zickus
2014-02-28 17:43 ` [PATCH 11/19] perf, c2c: Add in sort on physid Don Zickus
2014-02-28 18:59   ` Andi Kleen
2014-02-28 19:44     ` Don Zickus
2014-03-01  1:07       ` Andi Kleen
2014-03-01  1:27         ` Namhyung Kim
2014-02-28 17:43 ` [PATCH 12/19] perf, c2c: Add stats to track data source bits and cpu to node maps Don Zickus
2014-02-28 17:43 ` [PATCH 13/19] perf, c2c: Sort based on hottest cache line Don Zickus
2014-02-28 17:43 ` [PATCH 14/19] perf, c2c: Display cacheline HITM analysis to stdout Don Zickus
2014-02-28 17:43 ` [PATCH 15/19] perf, c2c: Add callchain support Don Zickus
2014-03-19 13:00   ` Jiri Olsa
2014-03-19 13:53     ` Don Zickus
2014-03-19 14:05       ` Jiri Olsa
2014-02-28 17:43 ` [PATCH 16/19] perf, c2c: Output summary stats Don Zickus
2014-02-28 17:43 ` [PATCH 17/19] perf, c2c: Dump rbtree for debugging Don Zickus
2014-02-28 17:43 ` [PATCH 18/19] perf, c2c: Add symbol count table Don Zickus
2014-02-28 17:43 ` [PATCH 19/19] perf, c2c: Add shared cachline summary table Don Zickus
2014-02-28 18:57 ` [PATCH 00/19 V2] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Andi Kleen
2014-02-28 19:42   ` Don Zickus
2014-02-28 21:54     ` Andi Kleen
2014-03-03 14:04       ` Don Zickus

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).