All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/15 V3] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems
@ 2014-03-24 19:36 Don Zickus
  2014-03-24 19:36 ` [PATCH 01/15 V3] perf: Fix stddev calculation Don Zickus
                   ` (14 more replies)
  0 siblings, 15 replies; 45+ messages in thread
From: Don Zickus @ 2014-03-24 19:36 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, peterz, eranian, andi.kleen, Don Zickus

With the introduction of NUMA systems, came the possibility of remote memory accesses.
Combine those remote memory accesses with contention on the remote node (ie a modified
cacheline) and you have a possibility for very long latencies.  These latencies can
bottleneck a program.

The program added by these patches, helps detect the situation where two nodes are
'tugging' on the same _data_ cacheline.  The term used through out this program and
the various changelogs is called a HITM.  This means nodeX went to read a cacheline
and it was discovered to be loaded in nodeY's LLC cache (hence the cacheHIT). The 
remote cacheline was also in a 'M'odified state thus creating a 'HIT M' for hit in
a modified state.  HITMs can happen locally and remotely.  This program's interest
is mainly in remote HITMs as they cause the longest latencies.

Why a program has a remote HITM derives from how the two nodes are 'sharing' the
cacheline.  Is the sharing intentional ("true") or unintentional ("false").  We have seen
lots of "false" sharing cases, which lead to simple solutions such as seperating the data
onto different cachelines.  

This tool does not distinguish between 'true' or 'false' sharing, instead it just points to
the more expensive sharing situations under the current workload.  It is up to the user
to understand what the workload is doing to determine whether a problem exists or not and
how to report it.

The data output is verbose and there are lots of data tables that interprit the latencies
and data addresses in different ways to help see where bottlenecks might be lying.

Most of this idea, work and calculations were done by Dick Fowles.  My work mainly
includes porting it to perf.  Joe Mario has contributed greatly with ideas to make the
output more informative based on his usage of the tool.  Joe has found a handful of
bottlenecks using various industry benchmarks and has worked with developers to fix
them.

I would also like to thank Stephane Eranian for his early help and guidance on 
navigating the differences between the current perf tool and how similar tools
looked at HP.  And also his tireless work in getting the MMAP2 interface to stick.

Also thanks to Arnaldo and Jiri Olso for their help in suggestions for this tool.

I also have a test program that generated a controlled number of HITMs that we used
frequently to validate our early work (the Intel docs were not always clear which
bits had to be set and some arches do not work well).  I would like to add it, but
didn't know how (nor did I spend any serious time looking either).

This program has been tested primarily on Intel's Ivy Bridge platforms.  The Sandy Bridge
platforms had some quirks that were fixed on Ivy Bridge.  We haven't tried Haswell as
that has a re-worked latency event implementation.

A handful of patches include re-enabling MMAP2 support and some fixes to perf itself.  One
in particular hacks up how standard deviation is calculated.  It works with our calculations
but may break other tools expectations.  Feedback is welcomed.

Comemnts, feedback, anything else welcomed.

V3: updated to latest perf/core branch 5aed711
    support for older machines like Westmere (Peter Z)
    cleanups and fixes
    broke out physid support into another patchset
    still need to address output deficiencies (Andi Kleen)

V2: updated to latest perf/core branch 1029f9fedf87fa6
    switched to hist_entry based on Jiri O's suggestion
    dropped latency analyze for now until this patchset is accepted
    little fixes and tweaks

Signed-off-by: Don Zickus <dzickus@redhat.com>

Arnaldo Carvalho de Melo (2):
  perf c2c: Shared data analyser
  perf c2c: Dump raw records, decode data_src bits

Don Zickus (19):
  Revert "perf: Disable PERF_RECORD_MMAP2 support"
  perf, machine: Use map as success in ip__resolve_ams
  perf, session: Change header.misc dump from decimal to hex
  perf, stat: FIXME Stddev calculation is incorrect
  perf, callchain: Add generic callchain print handler for stdio
  perf, c2c: Rework setup code to prepare for features
  perf, c2c: Add rbtree sorted on mmap2 data
  perf, c2c: Add stats to track data source bits and cpu to node maps
  perf, c2c: Sort based on hottest cache line
  perf, c2c: Display cacheline HITM analysis to stdout
  perf, c2c: Add callchain support
  perf, c2c: Output summary stats
  perf, c2c: Dump rbtree for debugging
  perf, c2c: Fixup tid because of perf map is broken
  perf, c2c: Add symbol count table
  perf, c2c: Add shared cachline summary table
  perf, c2c: Add framework to analyze latency and display summary stats
  perf, c2c: Add selected extreme latencies to output cacheline stats
    table
  perf, c2c: Add summary latency table for various parts of caches

 kernel/events/core.c                |    4 -
 tools/perf/Documentation/perf-c2c.c |   22 +
 tools/perf/Makefile.perf            |    1 +
 tools/perf/builtin-c2c.c            | 2963 +++++++++++++++++++++++++++++++++++
 tools/perf/builtin.h                |    1 +
 tools/perf/perf.c                   |    1 +
 tools/perf/ui/stdio/hist.c          |   37 +
 tools/perf/util/event.c             |   36 +-
 tools/perf/util/evlist.c            |   37 +
 tools/perf/util/evlist.h            |    7 +
 tools/perf/util/evsel.c             |    1 +
 tools/perf/util/hist.h              |    4 +
 tools/perf/util/machine.c           |    2 +-
 tools/perf/util/session.c           |    2 +-
 tools/perf/util/stat.c              |    3 +-
 15 files changed, 3097 insertions(+), 24 deletions(-)
 create mode 100644 tools/perf/Documentation/perf-c2c.c
 create mode 100644 tools/perf/builtin-c2c.c

-- 
1.7.11.7

Arnaldo Carvalho de Melo (2):
  perf c2c: Shared data analyser
  perf c2c: Dump raw records, decode data_src bits

Don Zickus (17):
  Revert "perf: Disable PERF_RECORD_MMAP2 support"
  perf, sort:  Add physid sorting based on mmap2 data
  perf, sort:  Allow unique sorting instead of combining hist_entries
  perf: Allow ability to map cpus to nodes easily
  perf, kmem: Utilize the new generic cpunode_map
  perf: Fix stddev calculation
  perf, callchain: Add generic callchain print handler for stdio
  perf, c2c: Rework setup code to prepare for features
  perf, c2c: Add in sort on physid
  perf, c2c: Add stats to track data source bits and cpu to node maps
  perf, c2c: Sort based on hottest cache line
  perf, c2c: Display cacheline HITM analysis to stdout
  perf, c2c: Add callchain support
  perf, c2c: Output summary stats
  perf, c2c: Dump rbtree for debugging
  perf, c2c: Add symbol count table
  perf, c2c: Add shared cachline summary table

 kernel/events/core.c                |    4 -
 tools/perf/Documentation/perf-c2c.c |   22 +
 tools/perf/Makefile.perf            |    1 +
 tools/perf/builtin-c2c.c            | 1787 +++++++++++++++++++++++++++++++++++
 tools/perf/builtin-kmem.c           |   78 +-
 tools/perf/builtin-report.c         |    2 +-
 tools/perf/builtin.h                |    1 +
 tools/perf/perf.c                   |    1 +
 tools/perf/ui/stdio/hist.c          |   37 +
 tools/perf/util/cpumap.c            |  150 +++
 tools/perf/util/cpumap.h            |   35 +
 tools/perf/util/event.c             |   36 +-
 tools/perf/util/evsel.c             |    1 +
 tools/perf/util/hist.c              |   10 +-
 tools/perf/util/hist.h              |    5 +
 tools/perf/util/sort.c              |  149 +++
 tools/perf/util/sort.h              |    4 +
 tools/perf/util/stat.c              |   13 +
 tools/perf/util/stat.h              |    1 +
 19 files changed, 2236 insertions(+), 101 deletions(-)
 create mode 100644 tools/perf/Documentation/perf-c2c.c
 create mode 100644 tools/perf/builtin-c2c.c

-- 
1.7.11.7

*** BLURB HERE ***

Arnaldo Carvalho de Melo (2):
  perf c2c: Shared data analyser
  perf c2c: Dump raw records, decode data_src bits

Don Zickus (13):
  perf: Fix stddev calculation
  perf, callchain: Add generic callchain print handler for stdio
  perf, c2c: Rework setup code to prepare for features
  perf, c2c: Add in new options to configure latency and stores
  perf, c2c: Add in sort on physid
  perf, c2c: Add stats to track data source bits and cpu to node maps
  perf, c2c: Sort based on hottest cache line
  perf, c2c: Display cacheline HITM analysis to stdout
  perf, c2c: Add callchain support
  perf, c2c: Output summary stats
  perf, c2c: Dump rbtree for debugging
  perf, c2c: Add symbol count table
  perf, c2c: Add shared cachline summary table

 tools/perf/Documentation/perf-c2c.c |   22 +
 tools/perf/Makefile.perf            |    1 +
 tools/perf/builtin-c2c.c            | 1760 +++++++++++++++++++++++++++++++++++
 tools/perf/builtin.h                |    1 +
 tools/perf/perf.c                   |    1 +
 tools/perf/ui/stdio/hist.c          |   37 +
 tools/perf/util/hist.h              |    4 +
 tools/perf/util/stat.c              |   13 +
 tools/perf/util/stat.h              |    1 +
 9 files changed, 1840 insertions(+)
 create mode 100644 tools/perf/Documentation/perf-c2c.c
 create mode 100644 tools/perf/builtin-c2c.c

-- 
1.7.11.7


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 01/15 V3] perf: Fix stddev calculation
  2014-03-24 19:36 [PATCH 00/15 V3] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
@ 2014-03-24 19:36 ` Don Zickus
  2014-03-24 19:36 ` [PATCH 02/15 V3] perf, callchain: Add generic callchain print handler for stdio Don Zickus
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 45+ messages in thread
From: Don Zickus @ 2014-03-24 19:36 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, peterz, eranian, andi.kleen, Don Zickus

The stddev calculation written matched standard error.  As a result when
using this result to find the relative stddev between runs, it was not
accurate.

Update the formula to match traditional stddev.  Then rename the old
stddev calculation to stderr_stats in case someone wants to use it.

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/util/stat.c | 13 +++++++++++++
 tools/perf/util/stat.h |  1 +
 2 files changed, 14 insertions(+)

diff --git a/tools/perf/util/stat.c b/tools/perf/util/stat.c
index 6506b3d..0cb4dbc 100644
--- a/tools/perf/util/stat.c
+++ b/tools/perf/util/stat.c
@@ -33,6 +33,7 @@ double avg_stats(struct stats *stats)
  * http://en.wikipedia.org/wiki/Stddev
  *
  * The std dev of the mean is related to the std dev by:
+ * (also known as standard error)
  *
  *             s
  * s_mean = -------
@@ -41,6 +42,18 @@ double avg_stats(struct stats *stats)
  */
 double stddev_stats(struct stats *stats)
 {
+	double variance;
+
+	if (stats->n < 2)
+		return 0.0;
+
+	variance = stats->M2 / (stats->n - 1);
+
+	return sqrt(variance);
+}
+
+double stderr_stats(struct stats *stats)
+{
 	double variance, variance_mean;
 
 	if (stats->n < 2)
diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h
index ae8ccd7..6f61615 100644
--- a/tools/perf/util/stat.h
+++ b/tools/perf/util/stat.h
@@ -12,6 +12,7 @@ struct stats
 void update_stats(struct stats *stats, u64 val);
 double avg_stats(struct stats *stats);
 double stddev_stats(struct stats *stats);
+double stderr_stats(struct stats *stats);
 double rel_stddev_stats(double stddev, double avg);
 
 static inline void init_stats(struct stats *stats)
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 02/15 V3] perf, callchain: Add generic callchain print handler for stdio
  2014-03-24 19:36 [PATCH 00/15 V3] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
  2014-03-24 19:36 ` [PATCH 01/15 V3] perf: Fix stddev calculation Don Zickus
@ 2014-03-24 19:36 ` Don Zickus
  2014-03-24 19:36 ` [PATCH 03/15 V3] perf c2c: Shared data analyser Don Zickus
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 45+ messages in thread
From: Don Zickus @ 2014-03-24 19:36 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, peterz, eranian, andi.kleen, Don Zickus

My initial implementation for rbtree sorting in the c2c tool does not use the
normal history elements.  As a result, adding callchain support (which is
deeply integrated with history elements) is more challenging when trying to
display its output.

To make things simpler for myself (and to avoid rewriting the same code into
the c2c tool), I provided a generic interface that takes an unsorted callchain
list along with its total and relative sample size, and sorts it locally based
on period and calls the appropriate graph function (passing the correct sample
size).

This makes things easier because the c2c tool can be dumber and just collect
callchains and not worry about the magic needed to sort and display them
correctly.

Unfortunately, this is assuming a stdio output only and does not use the other
gui type outputs.

Regardless, this patch provides useful info for the tool right now.  Tweaks and
recommendations for a better approach are welcomed. :-)

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/ui/stdio/hist.c | 37 +++++++++++++++++++++++++++++++++++++
 tools/perf/util/hist.h     |  4 ++++
 2 files changed, 41 insertions(+)

diff --git a/tools/perf/ui/stdio/hist.c b/tools/perf/ui/stdio/hist.c
index d59893e..c98ea06 100644
--- a/tools/perf/ui/stdio/hist.c
+++ b/tools/perf/ui/stdio/hist.c
@@ -529,3 +529,40 @@ size_t events_stats__fprintf(struct events_stats *stats, FILE *fp)
 
 	return ret;
 }
+
+size_t generic_entry_callchain__fprintf(struct callchain_root *unsorted_callchain,
+					u64 total_samples, u64 relative_samples,
+					int left_margin, FILE *fp)
+{
+	struct rb_root sorted_chain;
+	u64 min_callchain_hits;
+
+	if (!symbol_conf.use_callchain)
+		return 0;
+
+	min_callchain_hits = total_samples * (callchain_param.min_percent / 100);
+
+	callchain_param.sort(&sorted_chain, unsorted_callchain,
+				min_callchain_hits, &callchain_param);
+
+	switch (callchain_param.mode) {
+	case CHAIN_GRAPH_REL:
+		return callchain__fprintf_graph(fp, &sorted_chain, relative_samples,
+						left_margin);
+		break;
+	case CHAIN_GRAPH_ABS:
+		return callchain__fprintf_graph(fp, &sorted_chain, total_samples,
+						left_margin);
+		break;
+	case CHAIN_FLAT:
+		return callchain__fprintf_flat(fp, &sorted_chain, total_samples);
+		break;
+	case CHAIN_NONE:
+		break;
+	default:
+		pr_err("Bad callchain mode\n");
+	}
+
+	return 0;
+}
+
diff --git a/tools/perf/util/hist.h b/tools/perf/util/hist.h
index 664d83f..9da9981 100644
--- a/tools/perf/util/hist.h
+++ b/tools/perf/util/hist.h
@@ -128,6 +128,10 @@ size_t events_stats__fprintf(struct events_stats *stats, FILE *fp);
 size_t hists__fprintf(struct hists *hists, bool show_header, int max_rows,
 		      int max_cols, float min_pcnt, FILE *fp);
 
+size_t generic_entry_callchain__fprintf(struct callchain_root *unsorted_callchain,
+					u64 total_samples, u64 relative_samples,
+					int left_margin, FILE *fp);
+
 void hists__filter_by_dso(struct hists *hists);
 void hists__filter_by_thread(struct hists *hists);
 void hists__filter_by_symbol(struct hists *hists);
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 03/15 V3] perf c2c: Shared data analyser
  2014-03-24 19:36 [PATCH 00/15 V3] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
  2014-03-24 19:36 ` [PATCH 01/15 V3] perf: Fix stddev calculation Don Zickus
  2014-03-24 19:36 ` [PATCH 02/15 V3] perf, callchain: Add generic callchain print handler for stdio Don Zickus
@ 2014-03-24 19:36 ` Don Zickus
  2014-04-08  6:59   ` Namhyung Kim
  2014-03-24 19:36 ` [PATCH 04/15 V3] perf c2c: Dump raw records, decode data_src bits Don Zickus
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 45+ messages in thread
From: Don Zickus @ 2014-03-24 19:36 UTC (permalink / raw)
  To: acme
  Cc: LKML, jolsa, jmario, fowles, peterz, eranian, andi.kleen,
	Arnaldo Carvalho de Melo, David Ahern, Don Zickus,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Richard Fowles

From: Arnaldo Carvalho de Melo <acme@redhat.com>

This is the start of a new perf tool that will collect information about
memory accesses and analyse it to find things like hot cachelines, etc.

This is basically trying to get a prototype written by Richard Fowles
written using the tools/perf coding style and libraries.

Start it from 'perf sched', this patch starts the process by adding the
'record' subcommand to collect the needed mem loads and stores samples.

It also have the basic 'report' skeleton, resolving the sample address
and hooking the events found in a perf.data file with methods to handle
them, right now just printing the resolved perf_sample data structure
after each event name.

[dcz: refreshed to latest upstream changes]

Cc: David Ahern <dsahern@gmail.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Joe Mario <jmario@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Fowles <rfowles@redhat.com>
Cc: Stephane Eranian <eranian@google.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/Documentation/perf-c2c.c |  22 +++++
 tools/perf/Makefile.perf            |   1 +
 tools/perf/builtin-c2c.c            | 185 ++++++++++++++++++++++++++++++++++++
 tools/perf/builtin.h                |   1 +
 tools/perf/perf.c                   |   1 +
 5 files changed, 210 insertions(+)
 create mode 100644 tools/perf/Documentation/perf-c2c.c
 create mode 100644 tools/perf/builtin-c2c.c

diff --git a/tools/perf/Documentation/perf-c2c.c b/tools/perf/Documentation/perf-c2c.c
new file mode 100644
index 0000000..4d52798
--- /dev/null
+++ b/tools/perf/Documentation/perf-c2c.c
@@ -0,0 +1,22 @@
+perf-c2c(1)
+===========
+
+NAME
+----
+perf-c2c - Shared Data C2C/HITM Analyzer.
+
+SYNOPSIS
+--------
+[verse]
+'perf c2c' record
+
+DESCRIPTION
+-----------
+These are the variants of perf c2c:
+
+  'perf c2c record <command>' to record the memory accesses of an arbitrary
+  workload.
+
+SEE ALSO
+--------
+linkperf:perf-record[1], linkperf:perf-mem[1]
diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
index 667e85a..069bdca 100644
--- a/tools/perf/Makefile.perf
+++ b/tools/perf/Makefile.perf
@@ -431,6 +431,7 @@ BUILTIN_OBJS += $(OUTPUT)bench/futex-hash.o
 BUILTIN_OBJS += $(OUTPUT)bench/futex-wake.o
 BUILTIN_OBJS += $(OUTPUT)bench/futex-requeue.o
 
+BUILTIN_OBJS += $(OUTPUT)builtin-c2c.o
 BUILTIN_OBJS += $(OUTPUT)builtin-diff.o
 BUILTIN_OBJS += $(OUTPUT)builtin-evlist.o
 BUILTIN_OBJS += $(OUTPUT)builtin-help.o
diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
new file mode 100644
index 0000000..2935484
--- /dev/null
+++ b/tools/perf/builtin-c2c.c
@@ -0,0 +1,185 @@
+#include "builtin.h"
+#include "cache.h"
+
+#include "util/evlist.h"
+#include "util/parse-options.h"
+#include "util/session.h"
+#include "util/tool.h"
+
+#include <linux/compiler.h>
+#include <linux/kernel.h>
+
+struct perf_c2c {
+	struct perf_tool tool;
+};
+
+static int perf_sample__fprintf(struct perf_sample *sample,
+				struct perf_evsel *evsel,
+				struct addr_location *al, FILE *fp)
+{
+	return fprintf(fp, "%25.25s: %5d %5d 0x%016" PRIx64 " 0x016%" PRIx64 " %5" PRIu64 " 0x%06" PRIx64 " %s:%s\n",
+		       perf_evsel__name(evsel),
+		       sample->pid, sample->tid, sample->ip, sample->addr,
+		       sample->weight, sample->data_src,
+		       al->map ? (al->map->dso ? al->map->dso->long_name : "???") : "???",
+		       al->sym ? al->sym->name : "???");
+}
+
+static int perf_c2c__process_load(struct perf_evsel *evsel,
+				  struct perf_sample *sample,
+				  struct addr_location *al)
+{
+	perf_sample__fprintf(sample, evsel, al, stdout);
+	return 0;
+}
+
+static int perf_c2c__process_store(struct perf_evsel *evsel,
+				   struct perf_sample *sample,
+				   struct addr_location *al)
+{
+	perf_sample__fprintf(sample, evsel, al, stdout);
+	return 0;
+}
+
+static const struct perf_evsel_str_handler handlers[] = {
+	{ "cpu/mem-loads,ldlat=30/pp", perf_c2c__process_load, },
+	{ "cpu/mem-stores/pp",	       perf_c2c__process_store, },
+};
+
+typedef int (*sample_handler)(struct perf_evsel *evsel,
+			      struct perf_sample *sample,
+			      struct addr_location *al);
+
+static int perf_c2c__process_sample(struct perf_tool *tool __maybe_unused,
+				    union perf_event *event,
+				    struct perf_sample *sample,
+				    struct perf_evsel *evsel,
+				    struct machine *machine)
+{
+	struct addr_location al;
+	int err = 0;
+
+	if (perf_event__preprocess_sample(event, machine, &al, sample) < 0) {
+		pr_err("problem processing %d event, skipping it.\n",
+		       event->header.type);
+		return -1;
+	}
+
+	if (evsel->handler != NULL) {
+		sample_handler f = evsel->handler;
+		err = f(evsel, sample, &al);
+	}
+
+	return err;
+}
+
+static int perf_c2c__read_events(struct perf_c2c *c2c)
+{
+	int err = -1;
+	struct perf_session *session;
+	struct perf_data_file file = {
+			.path = input_name,
+			.mode = PERF_DATA_MODE_READ,
+	};
+	struct perf_evsel *evsel;
+
+	session = perf_session__new(&file, 0, &c2c->tool);
+	if (session == NULL) {
+		pr_debug("No memory for session\n");
+		goto out;
+	}
+
+	/* setup the evsel handlers for each event type */
+	evlist__for_each(session->evlist, evsel) {
+		const char *name = perf_evsel__name(evsel);
+		unsigned int i;
+
+		for (i = 0; i < ARRAY_SIZE(handlers); i++) {
+			if (!strcmp(name, handlers[i].name))
+				evsel->handler = handlers[i].handler;
+		}
+	}
+
+	err = perf_session__process_events(session, &c2c->tool);
+	if (err)
+		pr_err("Failed to process events, error %d", err);
+
+out:
+	return err;
+}
+
+static int perf_c2c__report(struct perf_c2c *c2c)
+{
+	setup_pager();
+	return perf_c2c__read_events(c2c);
+}
+
+static int perf_c2c__record(int argc, const char **argv)
+{
+	unsigned int rec_argc, i, j;
+	const char **rec_argv;
+	const char * const record_args[] = {
+		"record",
+		/* "--phys-addr", */
+		"-W",
+		"-d",
+		"-a",
+	};
+
+	rec_argc = ARRAY_SIZE(record_args) + 2 * ARRAY_SIZE(handlers) + argc - 1;
+	rec_argv = calloc(rec_argc + 1, sizeof(char *));
+
+	if (rec_argv == NULL)
+		return -ENOMEM;
+
+	for (i = 0; i < ARRAY_SIZE(record_args); i++)
+		rec_argv[i] = strdup(record_args[i]);
+
+	for (j = 0; j < ARRAY_SIZE(handlers); j++) {
+		rec_argv[i++] = strdup("-e");
+		rec_argv[i++] = strdup(handlers[j].name);
+	}
+
+	for (j = 1; j < (unsigned int)argc; j++, i++)
+		rec_argv[i] = argv[j];
+
+	BUG_ON(i != rec_argc);
+
+	return cmd_record(i, rec_argv, NULL);
+}
+
+int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
+{
+	struct perf_c2c c2c = {
+		.tool = {
+			.sample		 = perf_c2c__process_sample,
+			.comm		 = perf_event__process_comm,
+			.exit		 = perf_event__process_exit,
+			.fork		 = perf_event__process_fork,
+			.lost		 = perf_event__process_lost,
+			.ordered_samples = true,
+		},
+	};
+	const struct option c2c_options[] = {
+	OPT_END()
+	};
+	const char * const c2c_usage[] = {
+		"perf c2c {record|report}",
+		NULL
+	};
+
+	argc = parse_options(argc, argv, c2c_options, c2c_usage,
+			     PARSE_OPT_STOP_AT_NON_OPTION);
+	if (!argc)
+		usage_with_options(c2c_usage, c2c_options);
+
+	if (!strncmp(argv[0], "rec", 3)) {
+		return perf_c2c__record(argc, argv);
+	} else if (!strncmp(argv[0], "rep", 3)) {
+		return perf_c2c__report(&c2c);
+	} else {
+		usage_with_options(c2c_usage, c2c_options);
+	}
+
+	return 0;
+}
diff --git a/tools/perf/builtin.h b/tools/perf/builtin.h
index b210d62..2d0b1b5 100644
--- a/tools/perf/builtin.h
+++ b/tools/perf/builtin.h
@@ -17,6 +17,7 @@ extern int cmd_annotate(int argc, const char **argv, const char *prefix);
 extern int cmd_bench(int argc, const char **argv, const char *prefix);
 extern int cmd_buildid_cache(int argc, const char **argv, const char *prefix);
 extern int cmd_buildid_list(int argc, const char **argv, const char *prefix);
+extern int cmd_c2c(int argc, const char **argv, const char *prefix);
 extern int cmd_diff(int argc, const char **argv, const char *prefix);
 extern int cmd_evlist(int argc, const char **argv, const char *prefix);
 extern int cmd_help(int argc, const char **argv, const char *prefix);
diff --git a/tools/perf/perf.c b/tools/perf/perf.c
index 431798a..c7012a3 100644
--- a/tools/perf/perf.c
+++ b/tools/perf/perf.c
@@ -35,6 +35,7 @@ struct cmd_struct {
 static struct cmd_struct commands[] = {
 	{ "buildid-cache", cmd_buildid_cache, 0 },
 	{ "buildid-list", cmd_buildid_list, 0 },
+	{ "c2c",	cmd_c2c,	0 },
 	{ "diff",	cmd_diff,	0 },
 	{ "evlist",	cmd_evlist,	0 },
 	{ "help",	cmd_help,	0 },
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 04/15 V3] perf c2c: Dump raw records, decode data_src bits
  2014-03-24 19:36 [PATCH 00/15 V3] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (2 preceding siblings ...)
  2014-03-24 19:36 ` [PATCH 03/15 V3] perf c2c: Shared data analyser Don Zickus
@ 2014-03-24 19:36 ` Don Zickus
  2014-04-08  7:09   ` Namhyung Kim
  2014-03-24 19:36 ` [PATCH 05/15 V3] perf, c2c: Rework setup code to prepare for features Don Zickus
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 45+ messages in thread
From: Don Zickus @ 2014-03-24 19:36 UTC (permalink / raw)
  To: acme
  Cc: LKML, jolsa, jmario, fowles, peterz, eranian, andi.kleen,
	Arnaldo Carvalho de Melo, David Ahern, Don Zickus,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Richard Fowles

From: Arnaldo Carvalho de Melo <acme@redhat.com>

>From the c2c prototype:

[root@sandy ~]# perf c2c -r report | head -7
T Status    Pid Tid CPU          Inst Adrs     Virt Data Adrs Phys Data Adrs Cycles Source      Decoded Source                ObJect:Symbol
--------------------------------------------------------------------------------------------------------------------------------------------
  raw input 779 779   7 0xffffffff810865dd 0xffff8803f4d75ec8              0    370 0x68080882 [LOAD,LCL_LLC,MISS,SNP NA]    [kernel.kallsyms]:try_to_wake_up
  raw input 779 779   7 0xffffffff8107acb3 0xffff8802a5b73158              0    297 0x6a100142 [LOAD,L1,HIT,SNP NONE,LOCKED] [kernel.kallsyms]:up_read
  raw input 779 779   7       0x3b7e009814     0x7fff87429ea0              0    925 0x68100142 [LOAD,L1,HIT,SNP NONE]        ???:???
  raw input   0   0   1 0xffffffff8108bf81 0xffff8803eafebf50              0    172 0x68800842 [LOAD,LCL_LLC,HIT,SNP HITM]   [kernel.kallsyms]:update_stats_wait_end
  raw input 779 779   7       0x3b7e0097cc     0x7fac94b69068              0    228 0x68100242 [LOAD,LFB,HIT,SNP NONE]       ???:???
[root@sandy ~]#

The "Phys Data Adrs" column is not available at this point.

Cc: David Ahern <dsahern@gmail.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Joe Mario <jmario@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Fowles <rfowles@redhat.com>
Cc: Stephane Eranian <eranian@google.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/builtin-c2c.c | 148 +++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 125 insertions(+), 23 deletions(-)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 2935484..7082913 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -11,51 +11,148 @@
 
 struct perf_c2c {
 	struct perf_tool tool;
+	bool		 raw_records;
 };
 
-static int perf_sample__fprintf(struct perf_sample *sample,
-				struct perf_evsel *evsel,
-				struct addr_location *al, FILE *fp)
+enum { OP, LVL, SNP, LCK, TLB };
+
+static int perf_c2c__scnprintf_data_src(char *bf, size_t size, uint64_t val)
 {
-	return fprintf(fp, "%25.25s: %5d %5d 0x%016" PRIx64 " 0x016%" PRIx64 " %5" PRIu64 " 0x%06" PRIx64 " %s:%s\n",
-		       perf_evsel__name(evsel),
-		       sample->pid, sample->tid, sample->ip, sample->addr,
-		       sample->weight, sample->data_src,
-		       al->map ? (al->map->dso ? al->map->dso->long_name : "???") : "???",
-		       al->sym ? al->sym->name : "???");
+#define PREFIX       "["
+#define SUFFIX       "]"
+#define ELLIPSIS     "..."
+	static const struct {
+		uint64_t   bit;
+		int64_t    field;
+		const char *name;
+	} decode_bits[] = {
+	{ PERF_MEM_OP_LOAD,       OP,  "LOAD"     },
+	{ PERF_MEM_OP_STORE,      OP,  "STORE"    },
+	{ PERF_MEM_OP_NA,         OP,  "OP_NA"    },
+	{ PERF_MEM_LVL_LFB,       LVL, "LFB"      },
+	{ PERF_MEM_LVL_L1,        LVL, "L1"       },
+	{ PERF_MEM_LVL_L2,        LVL, "L2"       },
+	{ PERF_MEM_LVL_L3,        LVL, "LCL_LLC"  },
+	{ PERF_MEM_LVL_LOC_RAM,   LVL, "LCL_RAM"  },
+	{ PERF_MEM_LVL_REM_RAM1,  LVL, "RMT_RAM"  },
+	{ PERF_MEM_LVL_REM_RAM2,  LVL, "RMT_RAM"  },
+	{ PERF_MEM_LVL_REM_CCE1,  LVL, "RMT_LLC"  },
+	{ PERF_MEM_LVL_REM_CCE2,  LVL, "RMT_LLC"  },
+	{ PERF_MEM_LVL_IO,        LVL, "I/O"	  },
+	{ PERF_MEM_LVL_UNC,       LVL, "UNCACHED" },
+	{ PERF_MEM_LVL_NA,        LVL, "N"        },
+	{ PERF_MEM_LVL_HIT,       LVL, "HIT"      },
+	{ PERF_MEM_LVL_MISS,      LVL, "MISS"     },
+	{ PERF_MEM_SNOOP_NONE,    SNP, "SNP NONE" },
+	{ PERF_MEM_SNOOP_HIT,     SNP, "SNP HIT"  },
+	{ PERF_MEM_SNOOP_MISS,    SNP, "SNP MISS" },
+	{ PERF_MEM_SNOOP_HITM,    SNP, "SNP HITM" },
+	{ PERF_MEM_SNOOP_NA,      SNP, "SNP NA"   },
+	{ PERF_MEM_LOCK_LOCKED,   LCK, "LOCKED"   },
+	{ PERF_MEM_LOCK_NA,       LCK, "LOCK_NA"  },
+	};
+	union perf_mem_data_src dsrc = { .val = val, };
+	int printed = scnprintf(bf, size, PREFIX);
+	size_t i;
+	bool first_present = true;
+
+	for (i = 0; i < ARRAY_SIZE(decode_bits); i++) {
+		int bitval;
+
+		switch (decode_bits[i].field) {
+		case OP:  bitval = decode_bits[i].bit & dsrc.mem_op;    break;
+		case LVL: bitval = decode_bits[i].bit & dsrc.mem_lvl;   break;
+		case SNP: bitval = decode_bits[i].bit & dsrc.mem_snoop; break;
+		case LCK: bitval = decode_bits[i].bit & dsrc.mem_lock;  break;
+		case TLB: bitval = decode_bits[i].bit & dsrc.mem_dtlb;  break;
+		default: bitval = 0;					break;
+		}
+
+		if (!bitval)
+			continue;
+
+		if (strlen(decode_bits[i].name) + !!i > size - printed - sizeof(SUFFIX)) {
+			sprintf(bf + size - sizeof(SUFFIX) - sizeof(ELLIPSIS) + 1, ELLIPSIS);
+			printed = size - sizeof(SUFFIX);
+			break;
+		}
+
+		printed += scnprintf(bf + printed, size - printed, "%s%s",
+				     first_present ? "" : ",", decode_bits[i].name);
+		first_present = false;
+	}
+
+	printed += scnprintf(bf + printed, size - printed, SUFFIX);
+	return printed;
 }
 
-static int perf_c2c__process_load(struct perf_evsel *evsel,
-				  struct perf_sample *sample,
-				  struct addr_location *al)
+static int perf_c2c__fprintf_header(FILE *fp)
 {
-	perf_sample__fprintf(sample, evsel, al, stdout);
-	return 0;
+	int printed = fprintf(fp, "%c %-16s  %6s  %6s  %4s  %18s  %18s  %18s  %6s  %-10s %-60s %s\n", 
+			      'T',
+			      "Status",
+			      "Pid",
+			      "Tid",
+			      "CPU",
+			      "Inst Adrs",
+			      "Virt Data Adrs",
+			      "Phys Data Adrs",
+			      "Cycles",
+			      "Source",
+			      "  Decoded Source",
+			      "ObJect:Symbol");
+	return printed + fprintf(fp, "%-*.*s\n", printed, printed, graph_dotted_line);
+}
+
+static int perf_sample__fprintf(struct perf_sample *sample, char tag,
+				const char *reason, struct addr_location *al, FILE *fp)
+{
+	char data_src[61];
+
+	perf_c2c__scnprintf_data_src(data_src, sizeof(data_src), sample->data_src);
+
+	return fprintf(fp, "%c %-16s  %6d  %6d  %4d  %#18" PRIx64 "  %#18" PRIx64 "  %#18" PRIx64 "  %6" PRIu64 "  %#10" PRIx64 " %-60.60s %s:%s\n",
+		       tag,
+		       reason ?: "valid record",
+		       sample->pid,
+		       sample->tid,
+		       sample->cpu,
+		       sample->ip,
+		       sample->addr,
+		       0UL,
+		       sample->weight,
+		       sample->data_src,
+		       data_src,
+		       al->map ? (al->map->dso ? al->map->dso->long_name : "???") : "???",
+		       al->sym ? al->sym->name : "???");
 }
 
-static int perf_c2c__process_store(struct perf_evsel *evsel,
-				   struct perf_sample *sample,
-				   struct addr_location *al)
+static int perf_c2c__process_load_store(struct perf_c2c *c2c,
+					struct perf_sample *sample,
+					struct addr_location *al)
 {
-	perf_sample__fprintf(sample, evsel, al, stdout);
+	if (c2c->raw_records)
+		perf_sample__fprintf(sample, ' ', "raw input", al, stdout);
+
 	return 0;
 }
 
 static const struct perf_evsel_str_handler handlers[] = {
-	{ "cpu/mem-loads,ldlat=30/pp", perf_c2c__process_load, },
-	{ "cpu/mem-stores/pp",	       perf_c2c__process_store, },
+	{ "cpu/mem-loads,ldlat=30/pp", perf_c2c__process_load_store, },
+	{ "cpu/mem-stores/pp",	       perf_c2c__process_load_store, },
 };
 
-typedef int (*sample_handler)(struct perf_evsel *evsel,
+typedef int (*sample_handler)(struct perf_c2c *c2c,
 			      struct perf_sample *sample,
 			      struct addr_location *al);
 
-static int perf_c2c__process_sample(struct perf_tool *tool __maybe_unused,
+static int perf_c2c__process_sample(struct perf_tool *tool,
 				    union perf_event *event,
 				    struct perf_sample *sample,
 				    struct perf_evsel *evsel,
 				    struct machine *machine)
 {
+	struct perf_c2c *c2c = container_of(tool, struct perf_c2c, tool);
 	struct addr_location al;
 	int err = 0;
 
@@ -67,7 +164,7 @@ static int perf_c2c__process_sample(struct perf_tool *tool __maybe_unused,
 
 	if (evsel->handler != NULL) {
 		sample_handler f = evsel->handler;
-		err = f(evsel, sample, &al);
+		err = f(c2c, sample, &al);
 	}
 
 	return err;
@@ -111,6 +208,10 @@ out:
 static int perf_c2c__report(struct perf_c2c *c2c)
 {
 	setup_pager();
+
+	if (c2c->raw_records)
+		perf_c2c__fprintf_header(stdout);
+
 	return perf_c2c__read_events(c2c);
 }
 
@@ -161,6 +262,7 @@ int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
 		},
 	};
 	const struct option c2c_options[] = {
+	OPT_BOOLEAN('r', "raw_records", &c2c.raw_records, "dump raw events"),
 	OPT_END()
 	};
 	const char * const c2c_usage[] = {
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 05/15 V3] perf, c2c: Rework setup code to prepare for features
  2014-03-24 19:36 [PATCH 00/15 V3] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (3 preceding siblings ...)
  2014-03-24 19:36 ` [PATCH 04/15 V3] perf c2c: Dump raw records, decode data_src bits Don Zickus
@ 2014-03-24 19:36 ` Don Zickus
  2014-03-29 17:10   ` Jiri Olsa
  2014-04-08  7:18   ` Namhyung Kim
  2014-03-24 19:36 ` [PATCH 06/15 V3] perf, c2c: Add in new options to configure latency and stores Don Zickus
                   ` (9 subsequent siblings)
  14 siblings, 2 replies; 45+ messages in thread
From: Don Zickus @ 2014-03-24 19:36 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, peterz, eranian, andi.kleen, Don Zickus

A basic patch that re-arranges some of the c2c code and adds a couple
of small features to lay the ground work for the rest of the patch
series.

Changes include:

o reworking the report path
o replace preprocess_sample with simpler calls
o rework raw output to handle separators
o remove phys id gunk
o add some generic options

There isn't much meat in this patch just a bunch of code movement and cleanups.

V2: refresh to latest upstream changes

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/builtin-c2c.c | 125 +++++++++++++++++++++++++++++++++++------------
 1 file changed, 93 insertions(+), 32 deletions(-)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 7082913..72dd9d5 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -5,6 +5,7 @@
 #include "util/parse-options.h"
 #include "util/session.h"
 #include "util/tool.h"
+#include "util/debug.h"
 
 #include <linux/compiler.h>
 #include <linux/kernel.h>
@@ -105,34 +106,55 @@ static int perf_c2c__fprintf_header(FILE *fp)
 }
 
 static int perf_sample__fprintf(struct perf_sample *sample, char tag,
-				const char *reason, struct addr_location *al, FILE *fp)
+				const char *reason, struct mem_info *mi, FILE *fp)
 {
 	char data_src[61];
+	const char *fmt, *sep;
+	struct map *map = mi->iaddr.map;
 
 	perf_c2c__scnprintf_data_src(data_src, sizeof(data_src), sample->data_src);
 
-	return fprintf(fp, "%c %-16s  %6d  %6d  %4d  %#18" PRIx64 "  %#18" PRIx64 "  %#18" PRIx64 "  %6" PRIu64 "  %#10" PRIx64 " %-60.60s %s:%s\n",
-		       tag,
-		       reason ?: "valid record",
-		       sample->pid,
-		       sample->tid,
-		       sample->cpu,
-		       sample->ip,
-		       sample->addr,
-		       0UL,
-		       sample->weight,
-		       sample->data_src,
-		       data_src,
-		       al->map ? (al->map->dso ? al->map->dso->long_name : "???") : "???",
-		       al->sym ? al->sym->name : "???");
+	if (symbol_conf.field_sep) {
+		fmt = "%c%s%s%s%d%s%d%s%d%s%#"PRIx64"%s%#"PRIx64"%s"
+		      "%"PRIu64"%s%#"PRIx64"%s%s%s%s:%s\n";
+		sep = symbol_conf.field_sep;
+	} else {
+		fmt = "%c%s%-16s%s%6d%s%6d%s%4d%s%#18"PRIx64"%s%#18"PRIx64"%s"
+		      "%6"PRIu64"%s%#10"PRIx64"%s%-60.60s%s%s:%s\n";
+		sep = " ";
+	}
+
+	return fprintf(fp, fmt,
+		       tag,				sep,
+		       reason ?: "valid record",	sep,
+		       sample->pid,			sep,
+		       sample->tid,			sep,
+		       sample->cpu,			sep,
+		       sample->ip,			sep,
+		       sample->addr,			sep,
+		       sample->weight,			sep,
+		       sample->data_src,		sep,
+		       data_src,			sep,
+		       map ? (map->dso ? map->dso->long_name : "???") : "???",
+		       mi->iaddr.sym ? mi->iaddr.sym->name : "???");
 }
 
 static int perf_c2c__process_load_store(struct perf_c2c *c2c,
+					struct addr_location *al,
 					struct perf_sample *sample,
-					struct addr_location *al)
+					struct perf_evsel *evsel)
 {
-	if (c2c->raw_records)
-		perf_sample__fprintf(sample, ' ', "raw input", al, stdout);
+	struct mem_info *mi;
+
+	mi = sample__resolve_mem(sample, al);
+	if (!mi)
+		return -ENOMEM;
+
+	if (c2c->raw_records) {
+		perf_sample__fprintf(sample, ' ', "raw input", mi, stdout);
+		free(mi);
+		return 0;
+	}
 
 	return 0;
 }
@@ -143,8 +165,9 @@ static const struct perf_evsel_str_handler handlers[] = {
 };
 
 typedef int (*sample_handler)(struct perf_c2c *c2c,
+			      struct addr_location *al,
 			      struct perf_sample *sample,
-			      struct addr_location *al);
+			      struct perf_evsel *evsel);
 
 static int perf_c2c__process_sample(struct perf_tool *tool,
 				    union perf_event *event,
@@ -153,20 +176,51 @@ static int perf_c2c__process_sample(struct perf_tool *tool,
 				    struct machine *machine)
 {
 	struct perf_c2c *c2c = container_of(tool, struct perf_c2c, tool);
-	struct addr_location al;
-	int err = 0;
+	u8 cpumode = event->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
+	struct thread *thread;
+	sample_handler f;
+	int err = -1;
+	struct addr_location al = {
+			.machine = machine,
+			.cpu = sample->cpu,
+			.cpumode = cpumode,
+	};
+
+	if (evsel->handler == NULL)
+		return 0;
 
-	if (perf_event__preprocess_sample(event, machine, &al, sample) < 0) {
-		pr_err("problem processing %d event, skipping it.\n",
-		       event->header.type);
-		return -1;
+	thread = machine__find_thread(machine, sample->pid, sample->tid);
+	if (thread == NULL) {
+		printf("Could not find thread for tid %d\n", sample->tid);
+		return 0;
 	}
 
-	if (evsel->handler != NULL) {
-		sample_handler f = evsel->handler;
-		err = f(c2c, sample, &al);
+	al.thread = thread;
+
+	f = evsel->handler;
+	err = f(c2c, &al, sample, evsel);
+	if (err)
+		goto err;
+
+	return 0;
+err:
+	if (err > 0)
+		err = 0;
+	return err;
+}
+
+static int perf_c2c__process_events(struct perf_session *session,
+				    struct perf_c2c *c2c)
+{
+	int err = -1;
+
+	err = perf_session__process_events(session, &c2c->tool);
+	if (err) {
+		pr_err("Failed to process count events, error %d\n", err);
+		goto err;
 	}
 
+err:
 	return err;
 }
 
@@ -197,9 +251,7 @@ static int perf_c2c__read_events(struct perf_c2c *c2c)
 		}
 	}
 
-	err = perf_session__process_events(session, &c2c->tool);
-	if (err)
-		pr_err("Failed to process events, error %d", err);
+	err = perf_c2c__process_events(session, c2c);
 
 out:
 	return err;
@@ -221,7 +273,6 @@ static int perf_c2c__record(int argc, const char **argv)
 	const char **rec_argv;
 	const char * const record_args[] = {
 		"record",
-		/* "--phys-addr", */
 		"-W",
 		"-d",
 		"-a",
@@ -254,6 +305,8 @@ int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
 	struct perf_c2c c2c = {
 		.tool = {
 			.sample		 = perf_c2c__process_sample,
+			.mmap2           = perf_event__process_mmap2,
+			.mmap            = perf_event__process_mmap,
 			.comm		 = perf_event__process_comm,
 			.exit		 = perf_event__process_exit,
 			.fork		 = perf_event__process_fork,
@@ -263,6 +316,14 @@ int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
 	};
 	const struct option c2c_options[] = {
 	OPT_BOOLEAN('r', "raw_records", &c2c.raw_records, "dump raw events"),
+	OPT_INCR('v', "verbose", &verbose,
+		 "be more verbose (show counter open errors, etc)"),
+	OPT_STRING('i', "input", &input_name, "file",
+		   "the input file to process"),
+	OPT_STRING('x', "field-separator", &symbol_conf.field_sep,
+		   "separator",
+		   "separator for columns, no spaces will be added"
+		   " between columns '.' is reserved."),
 	OPT_END()
 	};
 	const char * const c2c_usage[] = {
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 06/15 V3] perf, c2c: Add in new options to configure latency and stores
  2014-03-24 19:36 [PATCH 00/15 V3] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (4 preceding siblings ...)
  2014-03-24 19:36 ` [PATCH 05/15 V3] perf, c2c: Rework setup code to prepare for features Don Zickus
@ 2014-03-24 19:36 ` Don Zickus
  2014-03-29 17:11   ` Jiri Olsa
  2014-04-08  7:31   ` Namhyung Kim
  2014-03-24 19:36 ` [PATCH 07/15 V3] perf, c2c: Add in sort on physid Don Zickus
                   ` (8 subsequent siblings)
  14 siblings, 2 replies; 45+ messages in thread
From: Don Zickus @ 2014-03-24 19:36 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, peterz, eranian, andi.kleen, Don Zickus

Modified the code to allow latency settings to be tweaked on the command line
and also the ability to dynamically profile stores (or disable using stores).

This allows the tool to be used on older Intel platforms like Westmere.

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/builtin-c2c.c | 73 +++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 66 insertions(+), 7 deletions(-)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 72dd9d5..d7eaf81 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -6,6 +6,7 @@
 #include "util/session.h"
 #include "util/tool.h"
 #include "util/debug.h"
+#include <api/fs/fs.h>
 
 #include <linux/compiler.h>
 #include <linux/kernel.h>
@@ -17,6 +18,12 @@ struct perf_c2c {
 
 enum { OP, LVL, SNP, LCK, TLB };
 
+#define DEFAULT_LATENCY_THRES	30
+#define DEFAULT_PRECISION	1
+static int lat_level = DEFAULT_LATENCY_THRES;
+static int prec_level = DEFAULT_PRECISION;
+static bool no_stores = false;
+
 static int perf_c2c__scnprintf_data_src(char *bf, size_t size, uint64_t val)
 {
 #define PREFIX       "["
@@ -159,9 +166,10 @@ static int perf_c2c__process_load_store(struct perf_c2c *c2c,
 	return 0;
 }
 
+/* setup the basic events for most coverage, options added later */
 static const struct perf_evsel_str_handler handlers[] = {
-	{ "cpu/mem-loads,ldlat=30/pp", perf_c2c__process_load_store, },
-	{ "cpu/mem-stores/pp",	       perf_c2c__process_load_store, },
+	{ "cpu/mem-loads", 		perf_c2c__process_load_store, },
+	{ "cpu/mem-stores",		perf_c2c__process_load_store, },
 };
 
 typedef int (*sample_handler)(struct perf_c2c *c2c,
@@ -246,7 +254,8 @@ static int perf_c2c__read_events(struct perf_c2c *c2c)
 		unsigned int i;
 
 		for (i = 0; i < ARRAY_SIZE(handlers); i++) {
-			if (!strcmp(name, handlers[i].name))
+			if (!strncmp(name, handlers[i].name,
+				     sizeof(handlers[i].name)))
 				evsel->handler = handlers[i].handler;
 		}
 	}
@@ -269,7 +278,7 @@ static int perf_c2c__report(struct perf_c2c *c2c)
 
 static int perf_c2c__record(int argc, const char **argv)
 {
-	unsigned int rec_argc, i, j;
+	unsigned int rec_argc, i, j, n;
 	const char **rec_argv;
 	const char * const record_args[] = {
 		"record",
@@ -277,8 +286,21 @@ static int perf_c2c__record(int argc, const char **argv)
 		"-d",
 		"-a",
 	};
+	char lat[16], prec[4] = "/";
+	char buf[256];
+
+	snprintf(lat, sizeof(lat), "ldlat=%d", lat_level);
+
+	if (prec_level > 3)
+		prec_level = 3;
+
+	for (i = 0; i < (unsigned int)prec_level; i++)
+		prec[i+1] = 'p';
+
+	/* how many events to pass in */
+	n = (no_stores) ? 1 : 2;
 
-	rec_argc = ARRAY_SIZE(record_args) + 2 * ARRAY_SIZE(handlers) + argc - 1;
+	rec_argc = ARRAY_SIZE(record_args) + 2 * n + argc - 1;
 	rec_argv = calloc(rec_argc + 1, sizeof(char *));
 
 	if (rec_argv == NULL)
@@ -287,9 +309,16 @@ static int perf_c2c__record(int argc, const char **argv)
 	for (i = 0; i < ARRAY_SIZE(record_args); i++)
 		rec_argv[i] = strdup(record_args[i]);
 
-	for (j = 0; j < ARRAY_SIZE(handlers); j++) {
+	/* mem-loads is first index */
+	snprintf(buf, sizeof(buf), "%s,%s%s", handlers[0].name, lat, prec);
+	rec_argv[i++] = strdup("-e");
+	rec_argv[i++] = strdup(buf);
+
+	if (!no_stores) {
+		/* mem-stores is second index */
 		rec_argv[i++] = strdup("-e");
-		rec_argv[i++] = strdup(handlers[j].name);
+		snprintf(buf, sizeof(buf), "%s%s", handlers[1].name, prec);
+		rec_argv[i++] = strdup(buf);
 	}
 
 	for (j = 1; j < (unsigned int)argc; j++, i++)
@@ -300,6 +329,30 @@ static int perf_c2c__record(int argc, const char **argv)
 	return cmd_record(i, rec_argv, NULL);
 }
 
+static int
+opt_no_stores_cb(const struct option *opt __maybe_unused, const char *arg __maybe_unused, int unset)
+{
+	int ret;
+
+	if (unset) {
+		struct stat st;
+		char path[PATH_MAX];
+
+		snprintf(path, PATH_MAX, "%s/bus/event_source/devices/cpu/events/mem-stores",
+			 sysfs__mountpoint());
+
+		ret = stat(path, &st);
+		if (ret) {
+			pr_debug("Omitting mem-stores event");
+			no_stores = true;
+		}
+		return 0;
+	}
+
+	no_stores = true;
+	return 0;
+}
+
 int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
 {
 	struct perf_c2c c2c = {
@@ -316,6 +369,12 @@ int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
 	};
 	const struct option c2c_options[] = {
 	OPT_BOOLEAN('r', "raw_records", &c2c.raw_records, "dump raw events"),
+	OPT_INTEGER('l', "latency-level", &lat_level,
+		 "specify the latency threshold for loads [default=30]"),
+	OPT_INTEGER('p', "precision-level", &prec_level,
+		 "specify the precision level of events (0,1,2,3) [default=1]"),
+	OPT_CALLBACK_NOOPT(0, "no-stores", &no_stores, "",
+			   "do not record stores", &opt_no_stores_cb),
 	OPT_INCR('v', "verbose", &verbose,
 		 "be more verbose (show counter open errors, etc)"),
 	OPT_STRING('i', "input", &input_name, "file",
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 07/15 V3] perf, c2c: Add in sort on physid
  2014-03-24 19:36 [PATCH 00/15 V3] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (5 preceding siblings ...)
  2014-03-24 19:36 ` [PATCH 06/15 V3] perf, c2c: Add in new options to configure latency and stores Don Zickus
@ 2014-03-24 19:36 ` Don Zickus
  2014-04-08  7:56   ` Namhyung Kim
  2014-03-24 19:36 ` [PATCH 08/15 V3] perf, c2c: Add stats to track data source bits and cpu to node maps Don Zickus
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 45+ messages in thread
From: Don Zickus @ 2014-03-24 19:36 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, peterz, eranian, andi.kleen, Don Zickus

Now that the infrastructure is set, add in the support to use
hist_entry to sort on physid.

V2: use new mmap2 sort

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/builtin-c2c.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 51 insertions(+), 1 deletion(-)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index d7eaf81..b5742bd 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -7,6 +7,7 @@
 #include "util/tool.h"
 #include "util/debug.h"
 #include <api/fs/fs.h>
+#include "util/annotate.h"
 
 #include <linux/compiler.h>
 #include <linux/kernel.h>
@@ -14,6 +15,7 @@
 struct perf_c2c {
 	struct perf_tool tool;
 	bool		 raw_records;
+	struct hists	 hists;
 };
 
 enum { OP, LVL, SNP, LCK, TLB };
@@ -151,7 +153,11 @@ static int perf_c2c__process_load_store(struct perf_c2c *c2c,
 					struct perf_sample *sample,
 					struct perf_evsel *evsel)
 {
+	struct symbol *parent = NULL;
+	struct hist_entry *he;
 	struct mem_info *mi;
+	uint64_t cost;
+	int err;
 
 	mi = sample__resolve_mem(sample, al);
 	if (!mi)
@@ -163,7 +169,33 @@ static int perf_c2c__process_load_store(struct perf_c2c *c2c,
 		return 0;
 	}
 
-	return 0;
+	cost = sample->weight;
+	if (!cost)
+		cost = 1;
+
+	/*
+	 * must pass period=weight in order to get the correct
+	 * sorting from hists__collapse_resort() which is solely
+	 * based on periods. We want sorting be done on nr_events * weight
+	 * and this is indirectly achieved by passing period=weight here
+	 * and the he_stat__add_period() function.
+	 */
+	he = __hists__add_entry(&c2c->hists, al, parent, NULL, mi,
+				cost, cost, 0);
+	if (!he) {
+		err = -ENOMEM;
+		goto out_mem;
+	}
+
+	c2c->hists.stats.total_period += cost;
+	hists__inc_nr_events(&c2c->hists, PERF_RECORD_SAMPLE);
+	return err;
+
+out_mem:
+	/* implicitly freed by __hists__add_entry */
+	free(mi);
+out:
+	return err;
 }
 
 /* setup the basic events for most coverage, options added later */
@@ -266,10 +298,28 @@ out:
 	return err;
 }
 
+static int perf_c2c__init(struct perf_c2c *c2c)
+{
+	sort__mode = SORT_MODE__PHYSID;
+	sort__wants_unique = 1;
+	sort_order = "daddr,iaddr,pid,tid";
+
+	if (setup_sorting() < 0) {
+		pr_err("can not setup sorting\n");
+		return -1;
+	}
+
+	hists__init(&c2c->hists);
+
+	return 0;
+}
 static int perf_c2c__report(struct perf_c2c *c2c)
 {
 	setup_pager();
 
+	if (perf_c2c__init(c2c))
+		return -1;
+
 	if (c2c->raw_records)
 		perf_c2c__fprintf_header(stdout);
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 08/15 V3] perf, c2c: Add stats to track data source bits and cpu to node maps
  2014-03-24 19:36 [PATCH 00/15 V3] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (6 preceding siblings ...)
  2014-03-24 19:36 ` [PATCH 07/15 V3] perf, c2c: Add in sort on physid Don Zickus
@ 2014-03-24 19:36 ` Don Zickus
  2014-04-08  8:05   ` Namhyung Kim
  2014-03-24 19:37 ` [PATCH 09/15 V3] perf, c2c: Sort based on hottest cache line Don Zickus
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 45+ messages in thread
From: Don Zickus @ 2014-03-24 19:36 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, peterz, eranian, andi.kleen, Don Zickus

This patch adds a bunch of stats that will be used later in post-processing
to determine where and with what frequency the HITMs are coming from.

Most of the stats are decoded from the data source response.  Another
piece of the stats is tracking which cpu the record came in on.

Credit to Dick Fowles for determining which bits are important and how to
properly track them.  Ported to perf by me.

V2: refresh with hist_entry

Original-by: Dick Fowles <rfowles@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/builtin-c2c.c | 187 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 187 insertions(+)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index b5742bd..55c5ce9 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -5,17 +5,58 @@
 #include "util/parse-options.h"
 #include "util/session.h"
 #include "util/tool.h"
+#include "util/stat.h"
+#include "util/cpumap.h"
 #include "util/debug.h"
 #include <api/fs/fs.h>
 #include "util/annotate.h"
 
 #include <linux/compiler.h>
 #include <linux/kernel.h>
+#include <sched.h>
+
+typedef struct {
+	int  locks;               /* count of 'lock' transactions */
+	int  store;               /* count of all stores in trace */
+	int  st_uncache;          /* stores to uncacheable address */
+	int  st_noadrs;           /* cacheable store with no address */
+	int  st_l1hit;            /* count of stores that hit L1D */
+	int  st_l1miss;           /* count of stores that miss L1D */
+	int  load;                /* count of all loads in trace */
+	int  ld_excl;             /* exclusive loads, rmt/lcl DRAM - snp none/miss */
+	int  ld_shared;           /* shared loads, rmt/lcl DRAM - snp hit */
+	int  ld_uncache;          /* loads to uncacheable address */
+	int  ld_io;               /* loads to io address */
+	int  ld_miss;             /* loads miss */
+	int  ld_noadrs;           /* cacheable load with no address */
+	int  ld_fbhit;            /* count of loads hitting Fill Buffer */
+	int  ld_l1hit;            /* count of loads that hit L1D */
+	int  ld_l2hit;            /* count of loads that hit L2D */
+	int  ld_llchit;           /* count of loads that hit LLC */
+	int  lcl_hitm;            /* count of loads with local HITM  */
+	int  rmt_hitm;            /* count of loads with remote HITM */
+	int  rmt_hit;             /* count of loads with remote hit clean; */
+	int  lcl_dram;            /* count of loads miss to local DRAM */
+	int  rmt_dram;            /* count of loads miss to remote DRAM */
+	int  nomap;               /* count of load/stores with no phys adrs */
+	int  noparse;             /* count of unparsable data sources */
+} trinfo_t;
+
+struct c2c_stats {
+	cpu_set_t		cpuset;
+	int			nr_entries;
+	u64			total_period;
+	trinfo_t		t;
+	struct stats		stats;
+};
 
 struct perf_c2c {
 	struct perf_tool tool;
 	bool		 raw_records;
 	struct hists	 hists;
+
+	/* stats */
+	struct c2c_stats	stats;
 };
 
 enum { OP, LVL, SNP, LCK, TLB };
@@ -26,6 +67,29 @@ static int lat_level = DEFAULT_LATENCY_THRES;
 static int prec_level = DEFAULT_PRECISION;
 static bool no_stores = false;
 
+#define RMT_RAM              (PERF_MEM_LVL_REM_RAM1 | PERF_MEM_LVL_REM_RAM2)
+#define RMT_LLC              (PERF_MEM_LVL_REM_CCE1 | PERF_MEM_LVL_REM_CCE2)
+
+#define L1CACHE_HIT(a)       (((a) & PERF_MEM_LVL_L1 ) && ((a) & PERF_MEM_LVL_HIT))
+#define FILLBUF_HIT(a)       (((a) & PERF_MEM_LVL_LFB) && ((a) & PERF_MEM_LVL_HIT))
+#define L2CACHE_HIT(a)       (((a) & PERF_MEM_LVL_L2 ) && ((a) & PERF_MEM_LVL_HIT))
+#define L3CACHE_HIT(a)       (((a) & PERF_MEM_LVL_L3 ) && ((a) & PERF_MEM_LVL_HIT))
+
+#define L1CACHE_MISS(a)      (((a) & PERF_MEM_LVL_L1 ) && ((a) & PERF_MEM_LVL_MISS))
+#define L3CACHE_MISS(a)      (((a) & PERF_MEM_LVL_L3 ) && ((a) & PERF_MEM_LVL_MISS))
+
+#define LD_UNCACHED(a)       (((a) & PERF_MEM_LVL_UNC) && ((a) & PERF_MEM_LVL_HIT))
+#define ST_UNCACHED(a)       (((a) & PERF_MEM_LVL_UNC) && ((a) & PERF_MEM_LVL_HIT))
+
+#define RMT_LLCHIT(a)        (((a) & RMT_LLC) && ((a) & PERF_MEM_LVL_HIT))
+#define RMT_HIT(a,b)         (((a) & RMT_LLC) && ((b) & PERF_MEM_SNOOP_HIT))
+#define RMT_HITM(a,b)        (((a) & RMT_LLC) && ((b) & PERF_MEM_SNOOP_HITM))
+#define RMT_MEM(a)           (((a) & RMT_RAM) && ((a) & PERF_MEM_LVL_HIT))
+
+#define LCL_HIT(a,b)         (L3CACHE_HIT(a) && ((b) & PERF_MEM_SNOOP_HIT))
+#define LCL_HITM(a,b)        (L3CACHE_HIT(a) && ((b) & PERF_MEM_SNOOP_HITM))
+#define LCL_MEM(a)           (((a) & PERF_MEM_LVL_LOC_RAM) && ((a) & PERF_MEM_LVL_HIT))
+
 static int perf_c2c__scnprintf_data_src(char *bf, size_t size, uint64_t val)
 {
 #define PREFIX       "["
@@ -148,6 +212,109 @@ static int perf_sample__fprintf(struct perf_sample *sample, char tag,
 		       mi->iaddr.sym ? mi->iaddr.sym->name : "???");
 }
 
+static int c2c_decode_stats(struct c2c_stats *stats, struct hist_entry *entry)
+{
+	union perf_mem_data_src *data_src = &entry->mem_info->data_src;
+	u64 daddr = entry->mem_info->daddr.addr;
+	u64 weight = entry->stat.weight;
+	int err = 0;
+
+	u64 op = data_src->mem_op;
+	u64 lvl = data_src->mem_lvl;
+	u64 snoop = data_src->mem_snoop;
+	u64 lock = data_src->mem_lock;
+
+#define P(a,b) PERF_MEM_##a##_##b
+
+	stats->nr_entries++;
+	stats->total_period += entry->stat.period;
+
+	if (lock & P(LOCK,LOCKED)) stats->t.locks++;
+
+	if (op & P(OP,LOAD)) {
+		stats->t.load++;
+
+		if (!daddr) {
+			stats->t.ld_noadrs++;
+			return -1;
+		}
+
+		if (lvl & P(LVL,HIT)) {
+			if (lvl & P(LVL,UNC)) stats->t.ld_uncache++;
+			if (lvl & P(LVL,IO))  stats->t.ld_io++;
+			if (lvl & P(LVL,LFB)) stats->t.ld_fbhit++;
+			if (lvl & P(LVL,L1 )) stats->t.ld_l1hit++;
+			if (lvl & P(LVL,L2 )) stats->t.ld_l2hit++;
+			if (lvl & P(LVL,L3 )) {
+				if (snoop & P(SNOOP,HITM))
+					stats->t.lcl_hitm++;
+				else
+					stats->t.ld_llchit++;
+			}
+
+			if (lvl & P(LVL,LOC_RAM)) {
+				stats->t.lcl_dram++;
+				if (snoop & P(SNOOP,HIT))
+					stats->t.ld_shared++;
+				else
+					stats->t.ld_excl++;
+			}
+
+			if ((lvl & P(LVL,REM_RAM1)) ||
+			    (lvl & P(LVL,REM_RAM2))) {
+				stats->t.rmt_dram++;
+				if (snoop & P(SNOOP,HIT))
+					stats->t.ld_shared++;
+				else
+					stats->t.ld_excl++;
+			}
+		}
+
+		if ((lvl & P(LVL,REM_CCE1)) ||
+		    (lvl & P(LVL,REM_CCE2))) {
+			if (snoop & P(SNOOP, HIT))
+				stats->t.rmt_hit++;
+			else if (snoop & P(SNOOP, HITM)) {
+				stats->t.rmt_hitm++;
+				update_stats(&stats->stats, weight);
+			}
+		}
+
+		if ((lvl & P(LVL,MISS)))
+			stats->t.ld_miss++;
+
+	} else if (op & P(OP,STORE)) {
+		/* store */
+		stats->t.store++;
+
+		if (!daddr) {
+			stats->t.st_noadrs++;
+			return -1;
+		}
+
+		if (lvl & P(LVL,HIT)) {
+			if (lvl & P(LVL,UNC)) stats->t.st_uncache++;
+			if (lvl & P(LVL,L1 )) stats->t.st_l1hit++;
+		}
+		if (lvl & P(LVL,MISS))
+			if (lvl & P(LVL,L1)) stats->t.st_l1miss++;
+	} else {
+		/* unparsable data_src? */
+		stats->t.noparse++;
+		return -1;
+	}
+
+	if (!entry->mem_info->daddr.map || !entry->mem_info->iaddr.map) {
+		stats->t.nomap++;
+		pr_debug("Dropping data 0x%lx (%p) and inst 0x%lx (%p)\n",
+			 entry->mem_info->daddr.addr, entry->mem_info->daddr.map,
+			 entry->mem_info->iaddr.addr, entry->mem_info->iaddr.map);
+		return -1;
+	}
+
+	return err;
+}
+
 static int perf_c2c__process_load_store(struct perf_c2c *c2c,
 					struct addr_location *al,
 					struct perf_sample *sample,
@@ -187,6 +354,14 @@ static int perf_c2c__process_load_store(struct perf_c2c *c2c,
 		goto out_mem;
 	}
 
+	err = c2c_decode_stats(&c2c->stats, he);
+	if (err < 0) {
+		err = 0;
+		rb_erase(&he->rb_node_in, c2c->hists.entries_in);
+		free(he);
+		goto out;
+	}
+
 	c2c->hists.stats.total_period += cost;
 	hists__inc_nr_events(&c2c->hists, PERF_RECORD_SAMPLE);
 	return err;
@@ -280,6 +455,9 @@ static int perf_c2c__read_events(struct perf_c2c *c2c)
 		goto out;
 	}
 
+	if (symbol__init() < 0)
+		goto out_delete;
+
 	/* setup the evsel handlers for each event type */
 	evlist__for_each(session->evlist, evsel) {
 		const char *name = perf_evsel__name(evsel);
@@ -294,12 +472,20 @@ static int perf_c2c__read_events(struct perf_c2c *c2c)
 
 	err = perf_c2c__process_events(session, c2c);
 
+out_delete:
+	perf_session__delete(session);
 out:
 	return err;
 }
 
 static int perf_c2c__init(struct perf_c2c *c2c)
 {
+	/* setup cpu map */
+	if (cpu__setup_cpunode_map() < 0) {
+		pr_err("can not setup cpu map\n");
+		return -1;
+	}
+
 	sort__mode = SORT_MODE__PHYSID;
 	sort__wants_unique = 1;
 	sort_order = "daddr,iaddr,pid,tid";
@@ -310,6 +496,7 @@ static int perf_c2c__init(struct perf_c2c *c2c)
 	}
 
 	hists__init(&c2c->hists);
+	CPU_ZERO(&c2c->stats.cpuset);
 
 	return 0;
 }
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 09/15 V3] perf, c2c: Sort based on hottest cache line
  2014-03-24 19:36 [PATCH 00/15 V3] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (7 preceding siblings ...)
  2014-03-24 19:36 ` [PATCH 08/15 V3] perf, c2c: Add stats to track data source bits and cpu to node maps Don Zickus
@ 2014-03-24 19:37 ` Don Zickus
  2014-04-08  8:23   ` Namhyung Kim
  2014-03-24 19:37 ` [PATCH 10/15 V3] perf, c2c: Display cacheline HITM analysis to stdout Don Zickus
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 45+ messages in thread
From: Don Zickus @ 2014-03-24 19:37 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, peterz, eranian, andi.kleen, Don Zickus

Now that we have all the events sort on a unique address, we can walk
the rbtree sequential and count up all the HITMs for each cacheline
fairly easily.

Once we encounter a new event on a different cacheline, process the previous
cacheline.  That includes determining if any HITMs were present on that
cacheline and if so, add it to another rbtree sorted on the number of HITMs.

This second rbtree sorted on number of HITMs will be the interesting data
we want to report and will be displayed in a follow up patch.

For now, organize the data properly.

V2: re-work using hist_entries

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/builtin-c2c.c | 201 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 201 insertions(+)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 55c5ce9..8674626 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -59,6 +59,25 @@ struct perf_c2c {
 	struct c2c_stats	stats;
 };
 
+#define CACHE_LINESIZE       64
+#define CLINE_OFFSET_MSK     (CACHE_LINESIZE - 1)
+#define CLADRS(a)            ((a) & ~(CLINE_OFFSET_MSK))
+#define CLOFFSET(a)          (int)((a) &  (CLINE_OFFSET_MSK))
+
+struct c2c_hit {
+	struct rb_node	rb_node;
+	struct rb_root  tree;
+	struct list_head list;
+	u64		cacheline;
+	int		color;
+	struct c2c_stats	stats;
+	pid_t		pid;
+	pid_t		tid;
+	u64		daddr;
+	u64		iaddr;
+	struct mem_info	*mi;
+};
+
 enum { OP, LVL, SNP, LCK, TLB };
 
 #define DEFAULT_LATENCY_THRES	30
@@ -160,6 +179,44 @@ static int perf_c2c__scnprintf_data_src(char *bf, size_t size, uint64_t val)
 	return printed;
 }
 
+static int c2c_hitm__add_to_list(struct rb_root *root, struct c2c_hit *h)
+{
+	struct rb_node **p;
+	struct rb_node *parent = NULL;
+	struct c2c_hit *he;
+	int64_t cmp;
+	u64 l_hitms, r_hitms;
+
+	p = &root->rb_node;
+
+	while (*p != NULL) {
+		parent = *p;
+		he = rb_entry(parent, struct c2c_hit, rb_node);
+
+		/* sort on remote hitms first */
+		l_hitms = he->stats.t.rmt_hitm;
+		r_hitms = h->stats.t.rmt_hitm;
+		cmp = r_hitms - l_hitms;
+
+		if (!cmp) {
+			/* sort on local hitms */
+			l_hitms = he->stats.t.lcl_hitm;
+			r_hitms = h->stats.t.lcl_hitm;
+			cmp = r_hitms - l_hitms;
+		}
+
+		if (cmp > 0)
+			p = &(*p)->rb_left;
+		else
+			p = &(*p)->rb_right;
+	}
+
+	rb_link_node(&h->rb_node, parent, p);
+	rb_insert_color(&h->rb_node, root);
+
+	return 0;
+}
+
 static int perf_c2c__fprintf_header(FILE *fp)
 {
 	int printed = fprintf(fp, "%c %-16s  %6s  %6s  %4s  %18s  %18s  %18s  %6s  %-10s %-60s %s\n", 
@@ -315,6 +372,50 @@ static int c2c_decode_stats(struct c2c_stats *stats, struct hist_entry *entry)
 	return err;
 }
 
+static struct c2c_hit *c2c_hit__new(u64 cacheline, struct hist_entry *entry)
+{
+	struct c2c_hit *h = zalloc(sizeof(struct c2c_hit));
+
+	if (!h) {
+		pr_err("Could not allocate c2c_hit memory\n");
+		return NULL;
+	}
+
+	CPU_ZERO(&h->stats.cpuset);
+	INIT_LIST_HEAD(&h->list);
+	init_stats(&h->stats.stats);
+	h->tree = RB_ROOT;
+	h->cacheline = cacheline;
+	h->pid = entry->thread->pid_;
+	h->tid = entry->thread->tid;
+
+	/* use original addresses here, not adjusted al_addr */
+	h->iaddr = entry->mem_info->iaddr.addr;
+	h->daddr = entry->mem_info->daddr.addr;
+
+	h->mi = entry->mem_info;
+	return h;
+}
+
+static void c2c_hit__update_strings(struct c2c_hit *h,
+				    struct hist_entry *n)
+{
+	if (h->pid != n->thread->pid_)
+		h->pid = -1;
+
+	if (h->tid != n->thread->tid)
+		h->tid = -1;
+
+	/* use original addresses here, not adjusted al_addr */
+	if (h->iaddr != n->mem_info->iaddr.addr)
+		h->iaddr = -1;
+
+	if (CLADRS(h->daddr) != CLADRS(n->mem_info->daddr.addr))
+		h->daddr = -1;
+
+	CPU_SET(n->cpu, &h->stats.cpuset);
+}
+
 static int perf_c2c__process_load_store(struct perf_c2c *c2c,
 					struct addr_location *al,
 					struct perf_sample *sample,
@@ -424,6 +525,104 @@ err:
 	return err;
 }
 
+#define HAS_HITMS(h) (h->stats.t.lcl_hitm || h->stats.t.rmt_hitm)
+
+static void c2c_hit__update_stats(struct c2c_stats *new,
+				  struct c2c_stats *old)
+{
+	new->t.load		+= old->t.load;
+	new->t.ld_fbhit		+= old->t.ld_fbhit;
+	new->t.ld_l1hit		+= old->t.ld_l1hit;
+	new->t.ld_l2hit		+= old->t.ld_l2hit;
+	new->t.ld_llchit	+= old->t.ld_llchit;
+	new->t.locks		+= old->t.locks;
+	new->t.lcl_dram		+= old->t.lcl_dram;
+	new->t.rmt_dram		+= old->t.rmt_dram;
+	new->t.lcl_hitm		+= old->t.lcl_hitm;
+	new->t.rmt_hitm		+= old->t.rmt_hitm;
+	new->t.rmt_hit		+= old->t.rmt_hit;
+	new->t.store		+= old->t.store;
+	new->t.st_l1hit		+= old->t.st_l1hit;
+
+	new->total_period	+= old->total_period;
+}
+
+static inline int valid_hitm_or_store(union perf_mem_data_src *dsrc)
+{
+	return ((dsrc->mem_snoop & P(SNOOP,HITM)) ||
+		(dsrc->mem_op & P(OP,STORE)));
+}
+
+static void c2c_analyze_hitms(struct perf_c2c *c2c)
+{
+
+	struct rb_node *next = rb_first(c2c->hists.entries_in);
+	struct hist_entry *he;
+	struct c2c_hit *h = NULL;
+	struct c2c_stats hitm_stats;
+	struct rb_root hitm_tree = RB_ROOT;
+	int shared_clines = 0;
+	u64 cl = 0;
+
+	memset(&hitm_stats, 0, sizeof(struct c2c_stats));
+
+	/* find HITMs */
+	while (next) {
+		he = rb_entry(next, struct hist_entry, rb_node_in);
+		next = rb_next(&he->rb_node_in);
+
+		cl = he->mem_info->daddr.al_addr;
+
+		/* switch cache line objects */
+		/* 'color' forces a boundary change based on the original sort */
+		if (!h || !he->color || (CLADRS(cl) != h->cacheline)) {
+			if (h && HAS_HITMS(h)) {
+				c2c_hit__update_stats(&hitm_stats, &h->stats);
+
+				/* sort based on hottest cacheline */
+				c2c_hitm__add_to_list(&hitm_tree, h);
+				shared_clines++;
+			} else {
+				/* stores-only are un-interesting */
+				free(h);
+			}
+			h = c2c_hit__new(CLADRS(cl), he);
+			if (!h)
+				goto cleanup;
+		}
+
+
+		c2c_decode_stats(&h->stats, he);
+
+		/* filter out non-hitms as un-interesting noise */
+		if (valid_hitm_or_store(&he->mem_info->data_src)) {
+			/* save the entry for later processing */
+			list_add_tail(&he->pairs.node, &h->list);
+
+			c2c_hit__update_strings(h, he);
+		}
+	}
+
+	/* last chunk */
+	if (h && HAS_HITMS(h)) {
+		c2c_hit__update_stats(&hitm_stats, &h->stats);
+		c2c_hitm__add_to_list(&hitm_tree, h);
+		shared_clines++;
+	} else
+		free(h);
+
+cleanup:
+	next = rb_first(&hitm_tree);
+	while (next) {
+		h = rb_entry(next, struct c2c_hit, rb_node);
+		next = rb_next(&h->rb_node);
+		rb_erase(&h->rb_node, &hitm_tree);
+
+		free(h);
+	}
+	return;
+}
+
 static int perf_c2c__process_events(struct perf_session *session,
 				    struct perf_c2c *c2c)
 {
@@ -435,6 +634,8 @@ static int perf_c2c__process_events(struct perf_session *session,
 		goto err;
 	}
 
+	c2c_analyze_hitms(c2c);
+
 err:
 	return err;
 }
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 10/15 V3] perf, c2c: Display cacheline HITM analysis to stdout
  2014-03-24 19:36 [PATCH 00/15 V3] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (8 preceding siblings ...)
  2014-03-24 19:37 ` [PATCH 09/15 V3] perf, c2c: Sort based on hottest cache line Don Zickus
@ 2014-03-24 19:37 ` Don Zickus
  2014-04-08  8:26   ` Namhyung Kim
  2014-04-08 23:46   ` Namhyung Kim
  2014-03-24 19:37 ` [PATCH 11/15 V3] perf, c2c: Add callchain support Don Zickus
                   ` (4 subsequent siblings)
  14 siblings, 2 replies; 45+ messages in thread
From: Don Zickus @ 2014-03-24 19:37 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, peterz, eranian, andi.kleen, Don Zickus

This patch mainly focuses on processing and displaying the collected
HITMs to stdout.  Most of it is just printing data in a pretty way.

There is one trick used when walking the cacheline.  When we get this
far we have two rbtrees.  One rbtree holds every record sorted on a
unique id (using the mmap2 decoder), the other rbtree holds every
cacheline with at least one HITM sorted on number of HITMs.

To display the output, the tool walks the second rbtree to display
the hottest cachelines.  Inside each hot cacheline, the tool displays
the offsets and the loads/stores it generates.  To determine the
cacheline offset, it uses linked list inside the cacheline elment
to walk the first rbtree elements for that particular cacheline.

The first rbtree elements are already sorted correctly in offset order, so
processing the offsets is fairly trivial and is done sequentially.

This is why you will see two while loops in the print_c2c_hitm_report(),
the outer one walks the cachelines, the inner one walks the offsets.

A knob has been added to display node information, which is useful
to see which cpus are involved in the contention and their nodes.

Another knob has been added to change the coalescing levels.  You can
coalesce the output based on pid, tid, ip, and/or symbol.

Original output and statistics done by Dick Fowles, backported by me.

Sample output:

=================================================
    Global Shared Cache Line Event Information
=================================================
  Total Shared Cache Lines          :       1327
  Load HITs on shared lines         :     167131
  Fill Buffer Hits on shared lines  :      43469
  L1D hits on shared lines          :      50497
  L2D hits on shared lines          :        960
  LLC hits on shared lines          :      38467
  Locked Access on shared lines     :     100032
  Store HITs on shared lines        :     118659
  Store L1D hits on shared lines    :     113783
  Total Merged records              :     160807

===========================================================================================================================================================

                                                                                      Shared Cache Line Distribution Pareto

     ---- All ----  -- Shared --    ---- HITM ----                                                                        Load Inst Execute Latency
       Data Misses   Data Misses   Remote    Local  -- Store Refs --
                                                                                                                          ---- cycles ----             cpu
 Num  %dist  %cumm  %dist  %cumm  LLCmiss   LLChit   L1 hit  L1 Miss       Data Address    Pid    Tid       Inst Address   median     mean     CV      cnt
==========================================================================================================================================================
-----------------------------------------------------------------------------------------------
   0  17.0%  17.0%  23.3%  23.3%     6238     3288    28098      813 0xffff881fa55b0140    ***
-----------------------------------------------------------------------------------------------
                                     0.0%     0.0%     0.0%     0.0%               0x00    375    375 0xffffffffa018ff5b      n/a      n/a      n/a      1
                                     0.0%     0.0%     0.0%     0.0%               0x08  18156  18156 0xffffffffa018b7f9       -1      384     0.0%      1
                                     0.2%     0.0%     0.0%     0.0%               0x10  18156  18156 0xffffffff811ca1aa       -1      387    10.7%      7
                                     0.0%     0.0%    23.2%     0.0%               0x18  18156  18156 0xffffffff815c1615       -1      684     0.0%     50

-----------------------------------------------------------------------------------------------
   1   5.3%  22.3%   7.3%  30.6%     1944     1143     7916        0 0xffff881fba47f000  18156
-----------------------------------------------------------------------------------------------
                                   100.0%   100.0%     0.0%     0.0%               0x00  18156  18156 0xffffffffa01b410e       -1      401    13.5%     50
                                     0.0%     0.0%    10.1%     0.0%               0x28  18156  18156 0xffffffffa0167409      n/a      n/a      n/a     50
                                     0.0%     0.0%    89.9%     0.0%               0x28  18156  18156 0xffffffff815c4be9      n/a      n/a      n/a     50

Original-by: Dick Fowles <rfowles@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/builtin-c2c.c | 519 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 519 insertions(+)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 8674626..e3dbb76 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -59,10 +59,13 @@ struct perf_c2c {
 	struct c2c_stats	stats;
 };
 
+#define DISPLAY_LINE_LIMIT  0.0015
 #define CACHE_LINESIZE       64
 #define CLINE_OFFSET_MSK     (CACHE_LINESIZE - 1)
 #define CLADRS(a)            ((a) & ~(CLINE_OFFSET_MSK))
 #define CLOFFSET(a)          (int)((a) &  (CLINE_OFFSET_MSK))
+#define MAXTITLE_SZ          400
+#define MAXLBL_SZ            256
 
 struct c2c_hit {
 	struct rb_node	rb_node;
@@ -109,6 +112,11 @@ static bool no_stores = false;
 #define LCL_HITM(a,b)        (L3CACHE_HIT(a) && ((b) & PERF_MEM_SNOOP_HITM))
 #define LCL_MEM(a)           (((a) & PERF_MEM_LVL_LOC_RAM) && ((a) & PERF_MEM_LVL_HIT))
 
+enum { LVL0, LVL1, LVL2, LVL3, LVL4, MAX_LVL };
+static int cloffset = LVL1;
+static int node_info = 0;
+static int coalesce_level = LVL1;
+
 static int perf_c2c__scnprintf_data_src(char *bf, size_t size, uint64_t val)
 {
 #define PREFIX       "["
@@ -416,6 +424,80 @@ static void c2c_hit__update_strings(struct c2c_hit *h,
 	CPU_SET(n->cpu, &h->stats.cpuset);
 }
 
+static inline bool matching_coalescing(struct c2c_hit *h,
+				       struct hist_entry *e)
+{
+	bool value = false;
+	struct mem_info *mi = e->mem_info;
+
+	if (coalesce_level > MAX_LVL)
+		printf("DON: bad coalesce level %d\n", coalesce_level);
+
+	if (e->cpumode != PERF_RECORD_MISC_KERNEL) {
+
+		switch (coalesce_level) {
+
+		case LVL0:
+		case LVL1:
+			value = ((h->daddr == mi->daddr.addr) &&
+				 (h->pid   == e->thread->pid_) &&
+				 (h->tid   == e->thread->tid) &&
+				 (h->iaddr == mi->iaddr.addr));
+			break;
+
+		case LVL2:
+			value = ((h->daddr == mi->daddr.addr) &&
+				 (h->pid   == e->thread->pid_) &&
+				 (h->iaddr == mi->iaddr.addr));
+			break;
+
+		case LVL3:
+			value = ((h->daddr == mi->daddr.addr) &&
+				 (h->iaddr == mi->iaddr.addr));
+			break;
+
+		case LVL4:
+			value = ((h->daddr == mi->daddr.addr) &&
+				 (h->mi->iaddr.sym == mi->iaddr.sym));
+			break;
+
+		default:
+			break;
+
+		}
+
+	} else {
+
+		switch (coalesce_level) {
+
+		case LVL0:
+			value = ((h->daddr == mi->daddr.addr) &&
+				 (h->pid   == e->thread->pid_) &&
+				 (h->tid   == e->thread->tid) &&
+				 (h->iaddr == mi->iaddr.addr));
+			break;
+
+		case LVL1:
+		case LVL2:
+		case LVL3:
+			value = ((h->daddr == mi->daddr.addr) &&
+				 (h->iaddr == mi->iaddr.addr));
+			break;
+
+		case LVL4:
+			value = ((h->daddr == mi->daddr.addr) &&
+				 (h->mi->iaddr.sym == mi->iaddr.sym));
+			break;
+
+		default:
+			break;
+
+		}
+	}
+
+	return value;
+}
+
 static int perf_c2c__process_load_store(struct perf_c2c *c2c,
 					struct addr_location *al,
 					struct perf_sample *sample,
@@ -547,12 +629,442 @@ static void c2c_hit__update_stats(struct c2c_stats *new,
 	new->total_period	+= old->total_period;
 }
 
+static void print_hitm_cacheline_header(void)
+{
+#define SHARING_REPORT_TITLE  "Shared Cache Line Distribution Pareto"
+#define PARTICIPANTS1         "Node{cpus %hitms %stores} Node{cpus %hitms %stores} ..."
+#define PARTICIPANTS2         "Node{cpu list}; Node{cpu list}; Node{cpu list}; ..."
+
+	int i;
+	const  char *docptr;
+	static char  delimit[MAXTITLE_SZ];
+	static char  title2[MAXTITLE_SZ];
+	int       pad;
+
+	docptr = " ";
+	if (node_info == 1)
+		docptr = PARTICIPANTS1;
+	if (node_info  == 2)
+		docptr = PARTICIPANTS2;
+
+	sprintf(title2, "%4s %6s %6s %6s %6s %8s %8s %8s %8s %18s %6s %6s %18s %8s %8s %8s %6s %-30s %-20s %s",
+			"Num",
+			"%dist",
+			"%cumm",
+			"%dist",
+			"%cumm",
+			"LLCmiss",
+			"LLChit",
+			"L1 hit",
+			"L1 Miss",
+			"Data Address",
+			"Pid",
+			"Tid",
+			"Inst Address",
+			"median",
+			"mean",
+			"CV  ",
+			"cnt",
+			"Symbol",
+			"Object",
+			docptr);
+
+	for (i = 0; i < (int)strlen(title2); i++) strcat(delimit, "=");
+
+
+	printf("\n\n");
+	printf("%s\n", delimit);
+	printf("\n");
+
+	pad = (strlen(title2)/2) - (strlen(SHARING_REPORT_TITLE)/2);
+	for (i = 0; i < pad; i++) printf(" ");
+	printf("%s\n", SHARING_REPORT_TITLE);
+
+	printf("\n");
+	printf("%4s %13s %13s %17s %8s %8s %18s %6s %6s %18s %26s %6s %30s %20s %s\n",
+		" ",
+		"---- All ----",
+		"-- Shared --",
+		"---- HITM ----",
+		" ",
+		" ",
+		" ",
+		" ",
+		" ",
+		" ",
+		"Load Inst Execute Latency",
+		" ",
+		" ",
+		" ",
+		node_info  ? "Shared Data Participants" : " ");
+
+
+	printf("%4s %13s %13s %8s %8s %17s %18s %6s %6s %17s %18s\n",
+		" ",
+		" Data Misses",
+		" Data Misses",
+		"Remote",
+		"Local",
+		"-- Store Refs --",
+		" ",
+		" ",
+		" ",
+		" ",
+		" ");
+
+	printf("%4s %13s %13s %8s %8s %8s %8s %18s %6s %6s %17s %18s %8s %6s\n",
+		" ",
+		" ",
+		" ",
+		" ",
+		" ",
+		" ",
+		" ",
+		" ",
+		" ",
+		" ",
+		" ",
+		"---- cycles ----",
+		" ",
+		"cpu");
+
+	printf("%s\n", title2);
+	printf("%s\n", delimit);
+}
+
+static void print_hitm_cacheline(struct c2c_hit *h,
+				 int record,
+				 double tot_cumm,
+				 double ld_cumm,
+				 double tot_dist,
+				 double ld_dist)
+{
+	char pidstr[7];
+	char addrstr[20];
+	static char  summary[MAXLBL_SZ];
+	int j;
+
+	if (h->pid > 0)
+		sprintf(pidstr, "%6d", h->pid);
+	else
+		sprintf(pidstr, "***");
+	/*
+	 * It is possible to have none distinct virtual addresses
+	 * pointing to a distinct SYstem V shared memory region.
+	 * if there are multple virtual addresses the address
+	 * field will be astericks. It would be possible to subsitute
+	 * the physical address but this count be confusing as some
+	 * times the field is a virtual address while or times it
+	 * may be a physical address which may lead to confusion.
+	 */
+	if (h->daddr != ~0UL)
+		sprintf(addrstr, "%#18lx", CLADRS(h->daddr));
+	else
+		sprintf(addrstr, "****************");
+
+
+	sprintf(summary, "%4d %5.1f%% %5.1f%% %5.1f%% %5.1f%% %8d %8d %8d %8d %18s %6s\n",
+			record,
+			tot_dist * 100.,
+			tot_cumm * 100.,
+			ld_dist * 100.,
+			ld_cumm * 100.,
+			h->stats.t.rmt_hitm,
+			h->stats.t.lcl_hitm,
+			h->stats.t.st_l1hit,
+			h->stats.t.st_l1miss,
+			addrstr,
+			pidstr);
+
+	for (j = 0; j < (int)strlen(summary); j++) printf("-");
+	printf("\n");
+	printf("%s", summary);
+	for (j = 0; j < (int)strlen(summary); j++) printf("-");
+	printf("\n");
+}
+
+static void print_socket_stats_str(struct c2c_hit *clo,
+				   struct c2c_stats *node_stats)
+{
+	int i, j;
+
+	if (!node_stats)
+		return;
+
+	for (i = 0; i < max_node_num; i++) {
+		struct c2c_stats *stats = &node_stats[i];
+		int num = CPU_COUNT(&stats->cpuset);
+
+		if (!num) {
+			/* pad align socket info */
+			for (j = 0; j < 21; j++)
+				printf(" ");
+			continue;
+		}
+
+		printf("%2d{%2d ", i, num);
+
+		if (clo->stats.t.rmt_hitm > 0)
+			printf("%5.1f%% ", 100. * ((double)stats->t.rmt_hitm / (double) clo->stats.t.rmt_hitm));
+		else
+			printf("%6s ", "n/a");
+
+		if (clo->stats.t.store > 0)
+			printf("%5.1f%%} ", 100. * ((double)stats->t.store / (double)clo->stats.t.store));
+		else
+			printf("%6s} ", "n/a");
+	}
+}
+
+static void print_socket_shared_str(struct c2c_stats *node_stats)
+{
+	int i, j;
+
+	if (!node_stats)
+		return;
+
+	for (i = 0; i < max_node_num; i++) {
+		struct c2c_stats *stats = &node_stats[i];
+		int num = CPU_COUNT(&stats->cpuset);
+		int start = -1;
+		bool first = true;
+
+		if (!num)
+			continue;
+
+		printf("%d{", i);
+
+		for (j = 0; j < max_cpu_num; j++) {
+			if (!CPU_ISSET(j, &stats->cpuset)) {
+				if (start != -1) {
+					if ((j-1) - start)
+						/* print the range */
+						printf("%s%d-%d", (first ? "" : ","), start, j-1);
+					else
+						/* standalone */
+						printf("%s%d", (first ? "" : ",") , start);
+					start = -1;
+					first = false;
+				}
+				continue;
+			}
+
+			if (start == -1)
+				start = j;
+		}
+		/* last chunk */
+		if (start != -1) {
+			if ((j-1) - start)
+				/* print the range */
+				printf("%s%d-%d", (first ? "" : ","), start, j-1);
+			else
+				/* standalone */
+				printf("%s%d", (first ? "" : ",") , start);
+		}
+
+		printf("}; ");
+	}
+}
+
+static void print_hitm_cacheline_offset(struct c2c_hit *clo,
+					struct c2c_hit *h,
+					struct c2c_stats *node_stats)
+{
+#define SHORT_STR_LEN	7
+#define LONG_STR_LEN	30
+
+	char pidstr[SHORT_STR_LEN];
+	char tidstr[SHORT_STR_LEN];
+	char addrstr[LONG_STR_LEN];
+	char latstr[LONG_STR_LEN];
+	char objptr[LONG_STR_LEN];
+	char symptr[LONG_STR_LEN];
+	struct c2c_stats *stats = &clo->stats;
+	struct addr_map_symbol *ams;
+
+	ams = &clo->mi->iaddr;
+
+	if (clo->pid >= 0)
+		snprintf(pidstr, SHORT_STR_LEN, "%6d", clo->pid);
+	else
+		sprintf(pidstr, "***");
+
+	if (clo->tid >= 0)
+		snprintf(tidstr, SHORT_STR_LEN, "%6d", clo->tid);
+	else
+		sprintf(tidstr, "***");
+
+	if (clo->iaddr != ~0UL)
+		snprintf(addrstr, LONG_STR_LEN, "%#18lx", clo->iaddr);
+	else
+		sprintf(addrstr, "****************");
+	snprintf(objptr, LONG_STR_LEN, "%-18s", ams->map->dso->short_name);
+	snprintf(symptr, LONG_STR_LEN, "%-18s", (ams->sym ? ams->sym->name : "?????"));
+
+	if (stats->t.rmt_hitm > 0) {
+		double mean = avg_stats(&stats->stats);
+		double std = stddev_stats(&stats->stats);
+
+		sprintf(latstr, "%8.0f %8.0f %7.1f%%",
+			-1.0, /* FIXME */
+			mean,
+			rel_stddev_stats(std, mean));
+	} else {
+		sprintf(latstr, "%8s %8s %8s",
+			"n/a",
+			"n/a",
+			"n/a");
+
+	}
+
+	/*
+	 * implicit assumption that we are not coalescing over IPs
+	 */
+	printf("%4s %6s %6s %6s %6s %7.1f%% %7.1f%% %7.1f%% %7.1f%% %14s0x%02lx %6s %6s %18s %8s %6d %-30s %-20s",
+		" ",
+		" ",
+		" ",
+		" ",
+		" ",
+		(stats->t.rmt_hitm  > 0) ? (100. * ((double)stats->t.rmt_hitm  / (double)h->stats.t.rmt_hitm))  : 0.0,
+		(stats->t.lcl_hitm  > 0) ? (100. * ((double)stats->t.lcl_hitm  / (double)h->stats.t.lcl_hitm))  : 0.0,
+		(stats->t.st_l1hit  > 0) ? (100. * ((double)stats->t.st_l1hit  / (double)h->stats.t.st_l1hit))  : 0.0,
+		(stats->t.st_l1miss > 0) ? (100. * ((double)stats->t.st_l1miss / (double)h->stats.t.st_l1miss)) : 0.0,
+		" ",
+		(cloffset == LVL2) ? (clo->daddr & 0xff) : CLOFFSET(clo->daddr),
+		pidstr,
+		tidstr,
+		addrstr,
+		latstr,
+		CPU_COUNT(&clo->stats.cpuset),
+		symptr,
+		objptr);
+
+	if (node_info == 0)
+		printf("  ");
+	else if (node_info  == 1)
+		print_socket_stats_str(clo, node_stats);
+	else if (node_info  == 2)
+		print_socket_shared_str(node_stats);
+
+	printf("\n");
+}
+
+static void print_c2c_hitm_report(struct rb_root *hitm_tree,
+				  struct c2c_stats *hitm_stats __maybe_unused,
+				  struct c2c_stats *c2c_stats)
+{
+	struct rb_node	*next = rb_first(hitm_tree);
+	struct c2c_hit	*h, *clo = NULL;
+	u64		addr;
+	double		tot_dist, tot_cumm;
+	double		ld_dist, ld_cumm;
+	int		llc_misses;
+	int		record = 0;
+	struct c2c_stats *node_stats = NULL;
+
+	if (node_info) {
+		node_stats = zalloc(sizeof(struct c2c_stats) * cpu__max_node());
+		if (!node_stats) {
+			printf("Can not allocate stats for node output\n");
+			return;
+		}
+	}
+
+	print_hitm_cacheline_header();
+
+	llc_misses = c2c_stats->t.lcl_dram +
+		     c2c_stats->t.rmt_dram +
+		     c2c_stats->t.rmt_hit +
+		     c2c_stats->t.rmt_hitm;
+
+	/*
+	 * generate distinct cache line report
+	 */
+	tot_cumm = 0.0;
+	ld_cumm  = 0.0;
+
+	while (next) {
+		struct hist_entry *entry;
+
+		h = rb_entry(next, struct c2c_hit, rb_node);
+		next = rb_next(&h->rb_node);
+
+		tot_dist  = ((double)h->stats.t.rmt_hitm / llc_misses);
+		tot_cumm += tot_dist;
+
+		ld_dist  = ((double)h->stats.t.rmt_hitm / c2c_stats->t.rmt_hitm);
+		ld_cumm += ld_dist;
+
+		/*
+		 * don't display lines with insignificant sharing contribution
+		 */
+		if (ld_dist < DISPLAY_LINE_LIMIT)
+			break;
+
+		print_hitm_cacheline(h, record, tot_cumm, ld_cumm, tot_dist, ld_dist);
+
+		list_for_each_entry(entry, &h->list, pairs.node) {
+
+			if (!clo || !matching_coalescing(clo, entry)) {
+				if (clo)
+					print_hitm_cacheline_offset(clo, h, node_stats);
+
+				free(clo);
+				addr = entry->mem_info->iaddr.al_addr;
+				clo = c2c_hit__new(addr, entry);
+				if (node_info)
+					memset(node_stats, 0, sizeof(struct c2c_stats) * cpu__max_node());
+			}
+			c2c_decode_stats(&clo->stats, entry);
+			c2c_hit__update_strings(clo, entry);
+
+			if (node_info) {
+				int node = cpu__get_node(entry->cpu);
+				c2c_decode_stats(&node_stats[node], entry);
+				CPU_SET(entry->cpu, &(node_stats[node].cpuset));
+			}
+
+		}
+		if (clo) {
+			print_hitm_cacheline_offset(clo, h, node_stats);
+			free(clo);
+			clo = NULL;
+		}
+
+		if (node_info)
+			memset(node_stats, 0, sizeof(struct c2c_stats) * cpu__max_node());
+
+		printf("\n");
+		record++;
+	}
+}
+
 static inline int valid_hitm_or_store(union perf_mem_data_src *dsrc)
 {
 	return ((dsrc->mem_snoop & P(SNOOP,HITM)) ||
 		(dsrc->mem_op & P(OP,STORE)));
 }
 
+static void print_shared_cacheline_info(struct c2c_stats *stats, int cline_cnt)
+{
+	int hitm_cnt = stats->t.lcl_hitm + stats->t.rmt_hitm;
+
+	printf("=================================================\n");
+	printf("    Global Shared Cache Line Event Information   \n");
+	printf("=================================================\n");
+	printf("  Total Shared Cache Lines          : %10d\n", cline_cnt);
+	printf("  Load HITs on shared lines         : %10d\n", stats->t.load);
+	printf("  Fill Buffer Hits on shared lines  : %10d\n", stats->t.ld_fbhit);
+	printf("  L1D hits on shared lines          : %10d\n", stats->t.ld_l1hit);
+	printf("  L2D hits on shared lines          : %10d\n", stats->t.ld_l2hit);
+	printf("  LLC hits on shared lines          : %10d\n", stats->t.ld_llchit + stats->t.lcl_hitm);
+	printf("  Locked Access on shared lines     : %10d\n", stats->t.locks);
+	printf("  Store HITs on shared lines        : %10d\n", stats->t.store);
+	printf("  Store L1D hits on shared lines    : %10d\n", stats->t.st_l1hit);
+	printf("  Total Merged records              : %10d\n", hitm_cnt + stats->t.store);
+}
+
 static void c2c_analyze_hitms(struct perf_c2c *c2c)
 {
 
@@ -611,6 +1123,9 @@ static void c2c_analyze_hitms(struct perf_c2c *c2c)
 	} else
 		free(h);
 
+	print_shared_cacheline_info(&hitm_stats, shared_clines);
+	print_c2c_hitm_report(&hitm_tree, &hitm_stats, &c2c->stats);
+
 cleanup:
 	next = rb_first(&hitm_tree);
 	while (next) {
@@ -813,6 +1328,10 @@ int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
 		 "specify the precision level of events (0,1,2,3) [default=1]"),
 	OPT_CALLBACK_NOOPT(0, "no-stores", &no_stores, "",
 			   "do not record stores", &opt_no_stores_cb),
+	OPT_INCR('N', "node-info", &node_info,
+		 "show extra node info in report (repeat for more info)"),
+	OPT_INTEGER('c', "coalesce-level", &coalesce_level,
+		 "how much coalescing for tid, pid, and ip is done (repeat for more coalescing)"),
 	OPT_INCR('v', "verbose", &verbose,
 		 "be more verbose (show counter open errors, etc)"),
 	OPT_STRING('i', "input", &input_name, "file",
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 11/15 V3] perf, c2c: Add callchain support
  2014-03-24 19:36 [PATCH 00/15 V3] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (9 preceding siblings ...)
  2014-03-24 19:37 ` [PATCH 10/15 V3] perf, c2c: Display cacheline HITM analysis to stdout Don Zickus
@ 2014-03-24 19:37 ` Don Zickus
  2014-03-24 19:37 ` [PATCH 12/15 V3] perf, c2c: Output summary stats Don Zickus
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 45+ messages in thread
From: Don Zickus @ 2014-03-24 19:37 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, peterz, eranian, andi.kleen, Don Zickus

Seeing cacheline statistics is useful by itself.  Seeing the callchain
for these cache contentions saves time tracking things down.

This patch tries to add callchain support.  I had to use the generic
interface from a previous patch to output things to stdout easily.

Other than the displaying the results, collecting the callchain and
merging it was fairly straightforward.

I used a lot of copying-n-pasting from other builtin tools to get
the intial parameter setup correctly and the automatic reading of
'symbol_conf.use_callchain' from the data file.

Hopefully this is all correct.  The amount of memory corruption (from the
callchain dynamic array) seems to have dwindled done to nothing. :-)

V2: update to latest api
V3: remove call_graph variable, unused

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/builtin-c2c.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 66 insertions(+), 1 deletion(-)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index e3dbb76..363deec 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -79,6 +79,8 @@ struct c2c_hit {
 	u64		daddr;
 	u64		iaddr;
 	struct mem_info	*mi;
+
+	struct callchain_root   callchain[0]; /* must be last member */
 };
 
 enum { OP, LVL, SNP, LCK, TLB };
@@ -382,7 +384,8 @@ static int c2c_decode_stats(struct c2c_stats *stats, struct hist_entry *entry)
 
 static struct c2c_hit *c2c_hit__new(u64 cacheline, struct hist_entry *entry)
 {
-	struct c2c_hit *h = zalloc(sizeof(struct c2c_hit));
+	size_t callchain_size = symbol_conf.use_callchain ? sizeof(struct callchain_root) : 0;
+	struct c2c_hit *h = zalloc(sizeof(struct c2c_hit) + callchain_size);
 
 	if (!h) {
 		pr_err("Could not allocate c2c_hit memory\n");
@@ -396,6 +399,8 @@ static struct c2c_hit *c2c_hit__new(u64 cacheline, struct hist_entry *entry)
 	h->cacheline = cacheline;
 	h->pid = entry->thread->pid_;
 	h->tid = entry->thread->tid;
+	if (symbol_conf.use_callchain)
+		callchain_init(h->callchain);
 
 	/* use original addresses here, not adjusted al_addr */
 	h->iaddr = entry->mem_info->iaddr.addr;
@@ -519,6 +524,10 @@ static int perf_c2c__process_load_store(struct perf_c2c *c2c,
 		return 0;
 	}
 
+	err = sample__resolve_callchain(sample, &parent, evsel, al, PERF_MAX_STACK_DEPTH);
+	if (err)
+		return err;
+
 	cost = sample->weight;
 	if (!cost)
 		cost = 1;
@@ -547,6 +556,7 @@ static int perf_c2c__process_load_store(struct perf_c2c *c2c,
 
 	c2c->hists.stats.total_period += cost;
 	hists__inc_nr_events(&c2c->hists, PERF_RECORD_SAMPLE);
+	err = hist_entry__append_callchain(he, sample);
 	return err;
 
 out_mem:
@@ -948,6 +958,13 @@ static void print_hitm_cacheline_offset(struct c2c_hit *clo,
 		print_socket_shared_str(node_stats);
 
 	printf("\n");
+
+	if (symbol_conf.use_callchain) {
+		generic_entry_callchain__fprintf(clo->callchain,
+						 h->stats.total_period,
+						 clo->stats.total_period,
+						 23, stdout);
+	}
 }
 
 static void print_c2c_hitm_report(struct rb_root *hitm_tree,
@@ -1024,6 +1041,12 @@ static void print_c2c_hitm_report(struct rb_root *hitm_tree,
 				c2c_decode_stats(&node_stats[node], entry);
 				CPU_SET(entry->cpu, &(node_stats[node].cpuset));
 			}
+			if (symbol_conf.use_callchain) {
+				callchain_cursor_reset(&callchain_cursor);
+				callchain_merge(&callchain_cursor,
+						clo->callchain,
+						entry->callchain);
+			}
 
 		}
 		if (clo) {
@@ -1155,6 +1178,29 @@ err:
 	return err;
 }
 
+static int perf_c2c__setup_sample_type(struct perf_c2c *c2c __maybe_unused,
+				       struct perf_session *session)
+{
+	u64 sample_type = perf_evlist__combined_sample_type(session->evlist);
+
+	if (!(sample_type & PERF_SAMPLE_CALLCHAIN)) {
+		if (symbol_conf.use_callchain) {
+			printf("Selected -g but no callchain data. Did "
+				  "you call 'perf c2c record' without -g?\n");
+			return -1;
+		}
+	} else if (callchain_param.mode != CHAIN_NONE &&
+		   !symbol_conf.use_callchain) {
+			symbol_conf.use_callchain = true;
+			if (callchain_register_param(&callchain_param) < 0) {
+				printf("Can't register callchain params.\n");
+				return -EINVAL;
+			}
+	}
+
+	return 0;
+}
+
 static int perf_c2c__read_events(struct perf_c2c *c2c)
 {
 	int err = -1;
@@ -1174,6 +1220,9 @@ static int perf_c2c__read_events(struct perf_c2c *c2c)
 	if (symbol__init() < 0)
 		goto out_delete;
 
+	if (perf_c2c__setup_sample_type(c2c, session) < 0)
+		goto out_delete;
+
 	/* setup the evsel handlers for each event type */
 	evlist__for_each(session->evlist, evsel) {
 		const char *name = perf_evsel__name(evsel);
@@ -1306,8 +1355,21 @@ opt_no_stores_cb(const struct option *opt __maybe_unused, const char *arg __mayb
 	return 0;
 }
 
+static int
+opt_callchain_cb(const struct option *opt __maybe_unused, const char *arg, int unset)
+{
+	/*
+	 * --no-call-graph
+	 */
+	if (unset)
+		return 0;
+
+	return report_parse_callchain_opt(arg);
+}
+
 int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
 {
+	char callchain_default_opt[] = "fractal,0.05,callee";
 	struct perf_c2c c2c = {
 		.tool = {
 			.sample		 = perf_c2c__process_sample,
@@ -1340,6 +1402,9 @@ int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
 		   "separator",
 		   "separator for columns, no spaces will be added"
 		   " between columns '.' is reserved."),
+	OPT_CALLBACK_DEFAULT('g', "call-graph", &c2c, "output_type,min_percent[,print_limit],call_order",
+			     "Display callchains using output_type (graph, flat, fractal, or none) , min percent threshold, optional print limit, callchain order, key (function or address). "
+			     "Default: fractal,0.5,callee,function", &opt_callchain_cb, callchain_default_opt),
 	OPT_END()
 	};
 	const char * const c2c_usage[] = {
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 12/15 V3] perf, c2c: Output summary stats
  2014-03-24 19:36 [PATCH 00/15 V3] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (10 preceding siblings ...)
  2014-03-24 19:37 ` [PATCH 11/15 V3] perf, c2c: Add callchain support Don Zickus
@ 2014-03-24 19:37 ` Don Zickus
  2014-03-24 19:37 ` [PATCH 13/15 V3] perf, c2c: Dump rbtree for debugging Don Zickus
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 45+ messages in thread
From: Don Zickus @ 2014-03-24 19:37 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, peterz, eranian, andi.kleen, Don Zickus

Output some summary stats based on the processed records.
Mainly diagnostic uses.

Stats done by Dick Fowles, backported by me.

Sample output:

=================================================
            Trace Event Information
=================================================
  Total records                     :    1322047
  Locked Load/Store Operations      :     206317
  Load Operations                   :     355701
  Loads - uncacheable               :        590
  Loads - IO                        :          0
  Loads - Miss                      :        440
  Loads - no mapping                :        207
  Load Fill Buffer Hit              :     100214
  Load L1D hit                      :     148454
  Load L2D hit                      :      15170
  Load LLC hit                      :      53872
  Load Local HITM                   :      15388
  Load Remote HITM                  :      26760
  Load Remote HIT                   :       3910
  Load Local DRAM                   :       2436
  Load Remote DRAM                  :       3648
  Load MESI State Exclusive         :       2883
  Load MESI State Shared            :       3201
  Load LLC Misses                   :      36754
  LLC Misses to Local DRAM          :        6.6%
  LLC Misses to Remote DRAM         :        9.9%
  LLC Misses to Remote cache (HIT)  :       10.6%
  LLC Misses to Remote cache (HITM) :       72.8%
  Store Operations                  :     966322
  Store - uncacheable               :          0
  Store - no mapping                :      42931
  Store L1D Hit                     :     915696
  Store L1D Miss                    :       7695
  No Page Map Rejects               :       1193
  Unable to parse data source       :         24

V2: refresh to hist_entry

Original-by: Dick Fowles <rfowles@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/builtin-c2c.c | 47 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 46 insertions(+), 1 deletion(-)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 363deec..37bf0bd 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -966,7 +966,6 @@ static void print_hitm_cacheline_offset(struct c2c_hit *clo,
 						 23, stdout);
 	}
 }
-
 static void print_c2c_hitm_report(struct rb_root *hitm_tree,
 				  struct c2c_stats *hitm_stats __maybe_unused,
 				  struct c2c_stats *c2c_stats)
@@ -1161,6 +1160,51 @@ cleanup:
 	return;
 }
 
+static void print_c2c_trace_report(struct perf_c2c *c2c)
+{
+	int llc_misses;
+	struct c2c_stats *stats = &c2c->stats;
+
+	llc_misses = stats->t.lcl_dram +
+		     stats->t.rmt_dram +
+		     stats->t.rmt_hit +
+		     stats->t.rmt_hitm;
+
+	printf("=================================================\n");
+	printf("            Trace Event Information              \n");
+	printf("=================================================\n");
+	printf("  Total records                     : %10d\n", c2c->stats.nr_entries);
+	printf("  Locked Load/Store Operations      : %10d\n", stats->t.locks);
+	printf("  Load Operations                   : %10d\n", stats->t.load);
+	printf("  Loads - uncacheable               : %10d\n", stats->t.ld_uncache);
+	printf("  Loads - IO                        : %10d\n", stats->t.ld_io);
+	printf("  Loads - Miss                      : %10d\n", stats->t.ld_miss);
+	printf("  Loads - no mapping                : %10d\n", stats->t.ld_noadrs);
+	printf("  Load Fill Buffer Hit              : %10d\n", stats->t.ld_fbhit);
+	printf("  Load L1D hit                      : %10d\n", stats->t.ld_l1hit);
+	printf("  Load L2D hit                      : %10d\n", stats->t.ld_l2hit);
+	printf("  Load LLC hit                      : %10d\n", stats->t.ld_llchit + stats->t.lcl_hitm);
+	printf("  Load Local HITM                   : %10d\n", stats->t.lcl_hitm);
+	printf("  Load Remote HITM                  : %10d\n", stats->t.rmt_hitm);
+	printf("  Load Remote HIT                   : %10d\n", stats->t.rmt_hit);
+	printf("  Load Local DRAM                   : %10d\n", stats->t.lcl_dram);
+	printf("  Load Remote DRAM                  : %10d\n", stats->t.rmt_dram);
+	printf("  Load MESI State Exclusive         : %10d\n", stats->t.ld_excl);
+	printf("  Load MESI State Shared            : %10d\n", stats->t.ld_shared);
+	printf("  Load LLC Misses                   : %10d\n", llc_misses);
+	printf("  LLC Misses to Local DRAM          : %10.1f%%\n", ((double)stats->t.lcl_dram/(double)llc_misses) * 100.);
+	printf("  LLC Misses to Remote DRAM         : %10.1f%%\n", ((double)stats->t.rmt_dram/(double)llc_misses) * 100.);
+	printf("  LLC Misses to Remote cache (HIT)  : %10.1f%%\n", ((double)stats->t.rmt_hit /(double)llc_misses) * 100.);
+	printf("  LLC Misses to Remote cache (HITM) : %10.1f%%\n", ((double)stats->t.rmt_hitm/(double)llc_misses) * 100.);
+	printf("  Store Operations                  : %10d\n", stats->t.store);
+	printf("  Store - uncacheable               : %10d\n", stats->t.st_uncache);
+	printf("  Store - no mapping                : %10d\n", stats->t.st_noadrs);
+	printf("  Store L1D Hit                     : %10d\n", stats->t.st_l1hit);
+	printf("  Store L1D Miss                    : %10d\n", stats->t.st_l1miss);
+	printf("  No Page Map Rejects               : %10d\n", stats->t.nomap);
+	printf("  Unable to parse data source       : %10d\n", stats->t.noparse);
+}
+
 static int perf_c2c__process_events(struct perf_session *session,
 				    struct perf_c2c *c2c)
 {
@@ -1172,6 +1216,7 @@ static int perf_c2c__process_events(struct perf_session *session,
 		goto err;
 	}
 
+	print_c2c_trace_report(c2c);
 	c2c_analyze_hitms(c2c);
 
 err:
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 13/15 V3] perf, c2c: Dump rbtree for debugging
  2014-03-24 19:36 [PATCH 00/15 V3] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (11 preceding siblings ...)
  2014-03-24 19:37 ` [PATCH 12/15 V3] perf, c2c: Output summary stats Don Zickus
@ 2014-03-24 19:37 ` Don Zickus
  2014-03-24 19:37 ` [PATCH 14/15 V3] perf, c2c: Add symbol count table Don Zickus
  2014-03-24 19:37 ` [PATCH 15/15 V3] perf, c2c: Add shared cachline summary table Don Zickus
  14 siblings, 0 replies; 45+ messages in thread
From: Don Zickus @ 2014-03-24 19:37 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, peterz, eranian, andi.kleen, Don Zickus

Sometimes you want to verify the rbtree sorting on a unique id
is working correctly.  This allows you to dump it.

Sample output:

   Idx Hit Maj Min      Ino           InoGen    Pid            Daddr            Iaddr         Data Src                         (string)  cpumode
     0       0   0        0                0      0               22 ffffffff813044cf         48080184           [STORE,L1,MISS,SNP NA]        1

     1       0   0        0                0   2332              ca3 ffffffffa0226032         48080184           [STORE,L1,MISS,SNP NA]        1
     2       0   0        0                0   2332              ca3 ffffffffa0226032         48080184           [STORE,L1,MISS,SNP NA]        1
     3       0   0        0                0   2332              ca3 ffffffffa0226032         48080184           [STORE,L1,MISS,SNP NA]        1
     4       0   0        0                0   2332              ca3 ffffffffa0226032         48080184           [STORE,L1,MISS,SNP NA]        1
     5       0   0        0                0   2332              ca3 ffffffffa0226032         48080184           [STORE,L1,MISS,SNP NA]        1
     6       0   0        0                0   2332              ca3 ffffffffa0226032         48080184           [STORE,L1,MISS,SNP NA]        1

     7       0   0        0                0  18179          135f860 ffffffff812ad509         68100242          [LOAD,LFB,HIT,SNP NONE]        1

     8       0   0        0                0  18179     7ff9d7fbaf98 ffffffff812ad509         68100242          [LOAD,LFB,HIT,SNP NONE]        1

V2: refresh with hist_entry

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/builtin-c2c.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 51 insertions(+)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index 37bf0bd..a937c4b 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -619,6 +619,55 @@ err:
 
 #define HAS_HITMS(h) (h->stats.t.lcl_hitm || h->stats.t.rmt_hitm)
 
+static void dump_rb_tree(struct rb_root *tree,
+			 struct perf_c2c *c2c __maybe_unused)
+{
+	struct rb_node *next = rb_first(tree);
+	struct hist_entry *he;
+	u64 cl = 0;
+	int idx = 0;
+
+	printf("# Summary: Total entries - %d\n", c2c->stats.nr_entries);
+	printf("# HITMs: Local - %d   Remote - %d     Total - %d\n",
+		c2c->stats.t.lcl_hitm, c2c->stats.t.rmt_hitm,
+		(c2c->stats.t.lcl_hitm + c2c->stats.t.rmt_hitm));
+
+	printf("%6s %3s %3s %3s %8s %16s %6s %16s %16s %16s %32s %8s\n",
+		"Idx", "Hit", "Maj", "Min", "Ino", "InoGen", "Pid", 
+		"Daddr", "Iaddr", "Data Src", "(string)", "cpumode");
+	while (next) {
+		char data_src[32];
+		u64 val;
+
+		he = rb_entry(next, struct hist_entry, rb_node_in);
+		next = rb_next(&he->rb_node_in);
+
+		if ((!he->color) || (cl != CLADRS(he->mem_info->daddr.al_addr))) {
+			printf("\n");
+			cl = CLADRS(he->mem_info->daddr.al_addr);
+		}
+
+		val = he->mem_info->data_src.val;
+		perf_c2c__scnprintf_data_src(data_src, sizeof(data_src), val);
+
+		printf("%6d %3s %3x %3x %8lx %16lx %6d %16lx %16lx %16lx %32s %8x\n",
+			idx,
+			(PERF_MEM_S(SNOOP,HITM) & val) ? " * " : "   ",
+			he->mem_info->daddr.map->maj,
+			he->mem_info->daddr.map->min,
+			he->mem_info->daddr.map->ino,
+			he->mem_info->daddr.map->ino_generation,
+			he->thread->pid_,
+			he->mem_info->daddr.addr,
+			he->mem_info->iaddr.addr,
+			val,
+			data_src,
+			he->cpumode);
+
+		idx++;
+	}
+}
+
 static void c2c_hit__update_stats(struct c2c_stats *new,
 				  struct c2c_stats *old)
 {
@@ -1216,6 +1265,8 @@ static int perf_c2c__process_events(struct perf_session *session,
 		goto err;
 	}
 
+	if (verbose > 2)
+		dump_rb_tree(c2c->hists.entries_in, c2c);
 	print_c2c_trace_report(c2c);
 	c2c_analyze_hitms(c2c);
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 14/15 V3] perf, c2c: Add symbol count table
  2014-03-24 19:36 [PATCH 00/15 V3] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (12 preceding siblings ...)
  2014-03-24 19:37 ` [PATCH 13/15 V3] perf, c2c: Dump rbtree for debugging Don Zickus
@ 2014-03-24 19:37 ` Don Zickus
  2014-03-24 19:37 ` [PATCH 15/15 V3] perf, c2c: Add shared cachline summary table Don Zickus
  14 siblings, 0 replies; 45+ messages in thread
From: Don Zickus @ 2014-03-24 19:37 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, peterz, eranian, andi.kleen, Don Zickus

Just another table that displays the referenced symbols in the analysis
report.  The table lists the most frequently used symbols first.

It is just another way to look at similar data to figure out who
is causing the most contention (based on the workload used).

Original done by Dick Fowles, backported by me.

Sample output:

=======================================================================================================
                                                   Object Name, Path & Reference Counts

Index    Records   Object Name                       Object Path
=======================================================================================================
    0     931379   [kernel.kallsyms]                 [kernel.kallsyms]
    1     192258   fio                               /home/joe/old_fio-2.0.15/fio
    2      80302   [jbd2]                            /lib/modules/3.10.0c2c_all+/kernel/fs/jbd2/jbd2.ko
    3      65392   [ext4]                            /lib/modules/3.10.0c2c_all+/kernel/fs/ext4/ext4.ko

V2: refresh to latest upstream changes and hist_entry

Original-by: Dick Fowles <rfowles@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/builtin-c2c.c | 99 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 99 insertions(+)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index a937c4b..e30ce2f 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -688,6 +688,104 @@ static void c2c_hit__update_stats(struct c2c_stats *new,
 	new->total_period	+= old->total_period;
 }
 
+LIST_HEAD(ref_tree);
+LIST_HEAD(ref_tree_sorted);
+struct refs {
+	struct list_head	list;
+	int			nr;
+	const char		*name;
+	const char		*long_name;
+};
+
+static int update_ref_list(struct hist_entry *entry)
+{
+	struct refs *p;
+	struct dso *dso = entry->mem_info->iaddr.map->dso;
+	const char *name = dso->short_name;
+
+	list_for_each_entry(p, &ref_tree, list) {
+		if (!strcmp(p->name, name))
+			goto found;
+	}
+
+	p = zalloc(sizeof(struct refs));
+	if (!p)
+		return -1;
+	p->name = name;
+	p->long_name = dso->long_name;
+	list_add_tail(&p->list, &ref_tree);
+
+found:
+	p->nr++;
+	return 0;
+}
+
+static void print_symbol_record_count(struct rb_root *tree)
+{
+	struct rb_node *next = rb_first(tree);
+	struct hist_entry *he;
+	struct refs *p, *q, *pn;
+	char      string[256];
+	char      delimit[256];
+	int       i;
+	int       idx = 0;
+
+	/* gather symbol references */
+	while (next) {
+		he = rb_entry(next, struct hist_entry, rb_node_in);
+		next = rb_next(&he->rb_node_in);
+
+		if (update_ref_list(he)) {
+			pr_err("Could not update reference tree\n");
+			goto cleanup;
+		}
+	}
+
+	/* sort on number of references per symbol */
+	list_for_each_entry_safe(p, pn, &ref_tree, list) {
+		list_del_init(&p->list);
+		list_for_each_entry(q, &ref_tree_sorted, list) {
+			if (p->nr > q->nr) {
+				list_add_tail(&p->list, &q->list);
+				break;
+			}
+		}
+		if (list_empty(&p->list))
+			list_add_tail(&p->list, &ref_tree_sorted);
+	}
+
+	/* print header info */
+	sprintf(string, "%5s   %8s   %-32s  %-80s",
+		"Index",
+		"Records",
+		"Object Name",
+		"Object Path");
+
+	delimit[0] = '\0';
+	for (i = 0; i < (int)strlen(string); i++) strcat(delimit, "=");
+
+	printf("\n\n");
+	printf("%s\n", delimit);
+	printf("%50s %s\n", " ", "Object Name, Path & Reference Counts");
+	printf("\n");
+	printf("%s\n", string);
+	printf("%s\n", delimit);
+
+	/* print out table */
+	list_for_each_entry(p, &ref_tree_sorted, list) {
+		printf("%5d   %8d   %-32s  %-80s\n",
+			idx, p->nr, p->name, p->long_name);
+		idx++;
+	}
+	printf("\n");
+
+cleanup:
+	list_for_each_entry_safe(p, pn, &ref_tree_sorted, list) {
+		list_del(&p->list);
+		free(p);
+	}
+}
+
 static void print_hitm_cacheline_header(void)
 {
 #define SHARING_REPORT_TITLE  "Shared Cache Line Distribution Pareto"
@@ -1269,6 +1367,7 @@ static int perf_c2c__process_events(struct perf_session *session,
 		dump_rb_tree(c2c->hists.entries_in, c2c);
 	print_c2c_trace_report(c2c);
 	c2c_analyze_hitms(c2c);
+	print_symbol_record_count(c2c->hists.entries_in);
 
 err:
 	return err;
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 15/15 V3] perf, c2c: Add shared cachline summary table
  2014-03-24 19:36 [PATCH 00/15 V3] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
                   ` (13 preceding siblings ...)
  2014-03-24 19:37 ` [PATCH 14/15 V3] perf, c2c: Add symbol count table Don Zickus
@ 2014-03-24 19:37 ` Don Zickus
  14 siblings, 0 replies; 45+ messages in thread
From: Don Zickus @ 2014-03-24 19:37 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, peterz, eranian, andi.kleen, Don Zickus

This adds a quick summary of the hottest cache contention lines based
on the input data.  This summarizes what the broken table shows you,
so you can see at a quick glance which cachelines are interesting.

Originally done by Dick Fowles, backported by me.

Sample output (width trimmed):

===================================================================================================================================================

                                                                                          Shared Data Cache Line Table

                                 Total     %All                Total       ---- Core Load Hit ----  -- LLC Load Hit --     ----- LLC Load Hitm -----
   Index           Phys Adrs   Records   Ld Miss     %hitm     Loads        FB       L1D       L2D       Lcl       Rmt     Total       Lcl       Rmt
====================================================================================================================================================
       0  0xffff881fa55b0140     72006    16.97%    23.31%     43095     13591     16860        45      2651        25      9526      3288      6238
       1  0xffff881fba47f000     21854     5.29%     7.26%     13938      3887      6941        15         1         7      3087      1143      1944
       2  0xffff881fc21b9cc0      2153     1.61%     2.21%       862        32        70         0        15         1       740       148       592
       3  0xffff881fc7d91cc0      1957     1.40%     1.92%       866        34        94         0        14         3       720       207       513
       4  0xffff881fba539cc0      1813     1.35%     1.85%       808        33        84         3        14         1       665       170       495

Original-by: Dick Fowles <rfowles@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 tools/perf/builtin-c2c.c | 136 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 136 insertions(+)

diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
index e30ce2f..e33f548 100644
--- a/tools/perf/builtin-c2c.c
+++ b/tools/perf/builtin-c2c.c
@@ -786,6 +786,141 @@ cleanup:
 	}
 }
 
+static void print_c2c_shared_cacheline_report(struct rb_root *hitm_tree,
+					      struct c2c_stats *shared_stats __maybe_unused,
+					      struct c2c_stats *c2c_stats __maybe_unused)
+{
+#define   SHM_TITLE  "Shared Data Cache Line Table"
+
+	struct rb_node	*next = rb_first(hitm_tree);
+	struct c2c_hit	*h;
+	char		header[256];
+	char		delimit[256];
+	u32		crecords;
+	u32		lclmiss;
+	u32		ldcnt;
+	double		p_hitm;
+	double		p_all;
+	int		totmiss;
+	int		rmt_hitm;
+	int		len;
+	int		pad;
+	int		i;
+
+	sprintf(header, "%28s  %8s  %8s  %8s  %8s  %28s  %18s  %28s  %18s  %8s  %28s",
+		" ",
+		"Total",
+		"%All ",
+		" ",
+		"Total",
+		"---- Core Load Hit ----",
+		"-- LLC Load Hit --",
+		"----- LLC Load Hitm -----",
+		"-- Load Dram --",
+		"LLC  ",
+		"---- Store Reference ----");
+
+	len = strlen(header);
+	delimit[0] = '\0';
+
+	for (i = 0; i < len; i++)
+		strcat(delimit, "=");
+
+	printf("\n\n");
+	printf("%s\n", delimit);
+	printf("\n");
+	pad = (strlen(header)/2) - (strlen(SHM_TITLE)/2);
+	for (i = 0; i < pad; i++)
+		printf(" ");
+	printf("%s\n", SHM_TITLE);
+	printf("\n");
+	printf("%s\n", header);
+
+	sprintf(header, "%8s  %18s  %8s  %8s  %8s  %8s  %8s  %8s  %8s  %8s  %8s  %8s  %8s  %8s  %8s  %8s  %8s  %8s  %8s  %8s",
+		"Index",
+		"Phys Adrs",
+		"Records",
+		"Ld Miss",
+		"%hitm",
+		"Loads",
+		"FB",
+		"L1D",
+		"L2D",
+		"Lcl",
+		"Rmt",
+		"Total",
+		"Lcl",
+		"Rmt",
+		"Lcl",
+		"Rmt",
+		"Ld Miss",
+		"Total",
+		"L1Hit",
+		"L1Miss");
+
+	printf("%s\n", header);
+	printf("%s\n", delimit);
+
+	rmt_hitm    = c2c_stats->t.rmt_hitm;
+	totmiss     = c2c_stats->t.lcl_dram +
+		      c2c_stats->t.rmt_dram +
+		      c2c_stats->t.rmt_hit +
+		      c2c_stats->t.rmt_hitm;
+
+	i = 0;
+	while (next) {
+		h = rb_entry(next, struct c2c_hit, rb_node);
+		next = rb_next(&h->rb_node);
+
+		lclmiss  = h->stats.t.lcl_dram +
+			   h->stats.t.rmt_dram +
+			   h->stats.t.rmt_hitm +
+			   h->stats.t.rmt_hit;
+
+		ldcnt    = lclmiss +
+			   h->stats.t.ld_fbhit +
+			   h->stats.t.ld_l1hit +
+			   h->stats.t.ld_l2hit +
+			   h->stats.t.ld_llchit +
+			   h->stats.t.lcl_hitm;
+
+		crecords = ldcnt +
+			   h->stats.t.st_l1hit +
+			   h->stats.t.st_l1miss;
+
+		p_hitm = (double)h->stats.t.rmt_hitm / (double)rmt_hitm;
+		p_all  = (double)h->stats.t.rmt_hitm / (double)totmiss;
+
+		/* stop when the percentage gets to low */
+		if (p_hitm < DISPLAY_LINE_LIMIT)
+			break;
+
+		printf("%8d  %#18lx  %8u  %7.2f%%  %7.2f%%  %8u  %8u  %8u  %8u  %8u  %8u  %8u  %8u  %8u  %8u  %8u  %8u  %8u  %8u  %8u\n",
+			i,
+			h->cacheline,
+			crecords,
+			100. * p_all,
+			100. * p_hitm,
+			ldcnt,
+			h->stats.t.ld_fbhit,
+			h->stats.t.ld_l1hit,
+			h->stats.t.ld_l2hit,
+			h->stats.t.ld_llchit,
+			h->stats.t.rmt_hit,
+			h->stats.t.lcl_hitm + h->stats.t.rmt_hitm,
+			h->stats.t.lcl_hitm,
+			h->stats.t.rmt_hitm,
+			h->stats.t.lcl_dram,
+			h->stats.t.rmt_dram,
+			lclmiss,
+			h->stats.t.store,
+			h->stats.t.st_l1hit,
+			h->stats.t.st_l1miss);
+
+		i++;
+	}
+}
+
 static void print_hitm_cacheline_header(void)
 {
 #define SHARING_REPORT_TITLE  "Shared Cache Line Distribution Pareto"
@@ -1293,6 +1428,7 @@ static void c2c_analyze_hitms(struct perf_c2c *c2c)
 		free(h);
 
 	print_shared_cacheline_info(&hitm_stats, shared_clines);
+	print_c2c_shared_cacheline_report(&hitm_tree, &hitm_stats, &c2c->stats);
 	print_c2c_hitm_report(&hitm_tree, &hitm_stats, &c2c->stats);
 
 cleanup:
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 05/15 V3] perf, c2c: Rework setup code to prepare for features
  2014-03-24 19:36 ` [PATCH 05/15 V3] perf, c2c: Rework setup code to prepare for features Don Zickus
@ 2014-03-29 17:10   ` Jiri Olsa
  2014-04-01  2:52     ` Don Zickus
  2014-04-08  7:41     ` Namhyung Kim
  2014-04-08  7:18   ` Namhyung Kim
  1 sibling, 2 replies; 45+ messages in thread
From: Jiri Olsa @ 2014-03-29 17:10 UTC (permalink / raw)
  To: Don Zickus; +Cc: acme, LKML, jmario, fowles, peterz, eranian, andi.kleen

On Mon, Mar 24, 2014 at 03:36:56PM -0400, Don Zickus wrote:

SNIP

>  
>  static int perf_c2c__process_load_store(struct perf_c2c *c2c,
> +					struct addr_location *al,
>  					struct perf_sample *sample,
> -					struct addr_location *al)
> +					struct perf_evsel *evsel)
>  {
> -	if (c2c->raw_records)
> -		perf_sample__fprintf(sample, ' ', "raw input", al, stdout);
> +	struct mem_info *mi;
> +
> +	mi = sample__resolve_mem(sample, al);
> +	if (!mi)
> +		return -ENOMEM;

perhaps not directly related to this patchset, but I needed
attached patch to get resolved data in .bss (static), which
for some reason happened to be located in executable segment

I wonder why we need this VARIABLE/FUNCTION separation at all,
I think Arnaldo told me some archs could have same addresses
for data and functions.. will check ;-)

jirka

---
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index d9d86c6..be6d7cf 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -1220,6 +1220,10 @@ static void ip__resolve_data(struct machine *machine, struct thread *thread,
 
 	thread__find_addr_location(thread, machine, m, MAP__VARIABLE, addr,
 				   &al);
+	if (!al.map)
+		thread__find_addr_location(thread, machine, m, MAP__FUNCTION, addr,
+					   &al);
+
 	ams->addr = addr;
 	ams->al_addr = al.addr;
 	ams->sym = al.sym;

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 06/15 V3] perf, c2c: Add in new options to configure latency and stores
  2014-03-24 19:36 ` [PATCH 06/15 V3] perf, c2c: Add in new options to configure latency and stores Don Zickus
@ 2014-03-29 17:11   ` Jiri Olsa
  2014-04-01  2:55     ` Don Zickus
  2014-04-08  7:31   ` Namhyung Kim
  1 sibling, 1 reply; 45+ messages in thread
From: Jiri Olsa @ 2014-03-29 17:11 UTC (permalink / raw)
  To: Don Zickus; +Cc: acme, LKML, jmario, fowles, peterz, eranian, andi.kleen

On Mon, Mar 24, 2014 at 03:36:57PM -0400, Don Zickus wrote:
> Modified the code to allow latency settings to be tweaked on the command line
> and also the ability to dynamically profile stores (or disable using stores).
> 
> This allows the tool to be used on older Intel platforms like Westmere.
> 
> Signed-off-by: Don Zickus <dzickus@redhat.com>
> ---

SNIP

> @@ -316,6 +369,12 @@ int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
>  	};
>  	const struct option c2c_options[] = {
>  	OPT_BOOLEAN('r', "raw_records", &c2c.raw_records, "dump raw events"),
> +	OPT_INTEGER('l', "latency-level", &lat_level,
> +		 "specify the latency threshold for loads [default=30]"),
> +	OPT_INTEGER('p', "precision-level", &prec_level,
> +		 "specify the precision level of events (0,1,2,3) [default=1]"),

could we get also option for user space modifier?
I needed to hack it in for my tests ;-)

thanks,
jirka

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 05/15 V3] perf, c2c: Rework setup code to prepare for features
  2014-03-29 17:10   ` Jiri Olsa
@ 2014-04-01  2:52     ` Don Zickus
  2014-04-08  7:41     ` Namhyung Kim
  1 sibling, 0 replies; 45+ messages in thread
From: Don Zickus @ 2014-04-01  2:52 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: acme, LKML, jmario, fowles, peterz, eranian, andi.kleen

On Sat, Mar 29, 2014 at 06:10:18PM +0100, Jiri Olsa wrote:
> On Mon, Mar 24, 2014 at 03:36:56PM -0400, Don Zickus wrote:
> 
> SNIP
> 
> >  
> >  static int perf_c2c__process_load_store(struct perf_c2c *c2c,
> > +					struct addr_location *al,
> >  					struct perf_sample *sample,
> > -					struct addr_location *al)
> > +					struct perf_evsel *evsel)
> >  {
> > -	if (c2c->raw_records)
> > -		perf_sample__fprintf(sample, ' ', "raw input", al, stdout);
> > +	struct mem_info *mi;
> > +
> > +	mi = sample__resolve_mem(sample, al);
> > +	if (!mi)
> > +		return -ENOMEM;
> 
> perhaps not directly related to this patchset, but I needed
> attached patch to get resolved data in .bss (static), which
> for some reason happened to be located in executable segment
> 
> I wonder why we need this VARIABLE/FUNCTION separation at all,
> I think Arnaldo told me some archs could have same addresses
> for data and functions.. will check ;-)

Well that hack is a lot prettier than the one I came up with a couple of
weeks ago. :-)  Though I was more strict and only went down this path if
the protection bit had PROT_READ set.

Note to Arnaldo, I think this is only needed in //anon regions or the
[stack] (basically some private memory area that had PROT_READ and
PROT_EXEC set).  I noticed it with the java programs I was running.

Cheers,
Don
> 
> jirka
> 
> ---
> diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
> index d9d86c6..be6d7cf 100644
> --- a/tools/perf/util/machine.c
> +++ b/tools/perf/util/machine.c
> @@ -1220,6 +1220,10 @@ static void ip__resolve_data(struct machine *machine, struct thread *thread,
>  
>  	thread__find_addr_location(thread, machine, m, MAP__VARIABLE, addr,
>  				   &al);
> +	if (!al.map)
> +		thread__find_addr_location(thread, machine, m, MAP__FUNCTION, addr,
> +					   &al);
> +
>  	ams->addr = addr;
>  	ams->al_addr = al.addr;
>  	ams->sym = al.sym;

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 06/15 V3] perf, c2c: Add in new options to configure latency and stores
  2014-03-29 17:11   ` Jiri Olsa
@ 2014-04-01  2:55     ` Don Zickus
  2014-04-06 13:14       ` Jiri Olsa
  0 siblings, 1 reply; 45+ messages in thread
From: Don Zickus @ 2014-04-01  2:55 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: acme, LKML, jmario, fowles, peterz, eranian, andi.kleen

On Sat, Mar 29, 2014 at 06:11:38PM +0100, Jiri Olsa wrote:
> On Mon, Mar 24, 2014 at 03:36:57PM -0400, Don Zickus wrote:
> > Modified the code to allow latency settings to be tweaked on the command line
> > and also the ability to dynamically profile stores (or disable using stores).
> > 
> > This allows the tool to be used on older Intel platforms like Westmere.
> > 
> > Signed-off-by: Don Zickus <dzickus@redhat.com>
> > ---
> 
> SNIP
> 
> > @@ -316,6 +369,12 @@ int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
> >  	};
> >  	const struct option c2c_options[] = {
> >  	OPT_BOOLEAN('r', "raw_records", &c2c.raw_records, "dump raw events"),
> > +	OPT_INTEGER('l', "latency-level", &lat_level,
> > +		 "specify the latency threshold for loads [default=30]"),
> > +	OPT_INTEGER('p', "precision-level", &prec_level,
> > +		 "specify the precision level of events (0,1,2,3) [default=1]"),
> 
> could we get also option for user space modifier?

You mean the 'u' modifier, ie cpu/mem-loads/u ?  If so, then I can do that
but will that work with the -a option (which is hardcoded in the c2c
tool [system-wide mode])?

> I needed to hack it in for my tests ;-)
> 
> thanks,
> jirka

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 06/15 V3] perf, c2c: Add in new options to configure latency and stores
  2014-04-01  2:55     ` Don Zickus
@ 2014-04-06 13:14       ` Jiri Olsa
  2014-04-07 18:16         ` Don Zickus
  2014-04-08  7:37         ` Namhyung Kim
  0 siblings, 2 replies; 45+ messages in thread
From: Jiri Olsa @ 2014-04-06 13:14 UTC (permalink / raw)
  To: Don Zickus; +Cc: acme, LKML, jmario, fowles, peterz, eranian, andi.kleen

On Mon, Mar 31, 2014 at 10:55:35PM -0400, Don Zickus wrote:
> On Sat, Mar 29, 2014 at 06:11:38PM +0100, Jiri Olsa wrote:
> > On Mon, Mar 24, 2014 at 03:36:57PM -0400, Don Zickus wrote:
> > > Modified the code to allow latency settings to be tweaked on the command line
> > > and also the ability to dynamically profile stores (or disable using stores).
> > > 
> > > This allows the tool to be used on older Intel platforms like Westmere.
> > > 
> > > Signed-off-by: Don Zickus <dzickus@redhat.com>
> > > ---
> > 
> > SNIP
> > 
> > > @@ -316,6 +369,12 @@ int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
> > >  	};
> > >  	const struct option c2c_options[] = {
> > >  	OPT_BOOLEAN('r', "raw_records", &c2c.raw_records, "dump raw events"),
> > > +	OPT_INTEGER('l', "latency-level", &lat_level,
> > > +		 "specify the latency threshold for loads [default=30]"),
> > > +	OPT_INTEGER('p', "precision-level", &prec_level,
> > > +		 "specify the precision level of events (0,1,2,3) [default=1]"),
> > 
> > could we get also option for user space modifier?
> 
> You mean the 'u' modifier, ie cpu/mem-loads/u ?  If so, then I can do that
> but will that work with the -a option (which is hardcoded in the c2c
> tool [system-wide mode])?

right, forgot about that.. could the -a option be optional as well?
probably the same way as for record would be the best:

  perf c2c record ./foo       # workload specific
  perf c2c record -a sleep 3  # system wide

jirka

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 06/15 V3] perf, c2c: Add in new options to configure latency and stores
  2014-04-06 13:14       ` Jiri Olsa
@ 2014-04-07 18:16         ` Don Zickus
  2014-04-09  0:17           ` Namhyung Kim
  2014-04-08  7:37         ` Namhyung Kim
  1 sibling, 1 reply; 45+ messages in thread
From: Don Zickus @ 2014-04-07 18:16 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: acme, LKML, jmario, fowles, peterz, eranian, andi.kleen

On Sun, Apr 06, 2014 at 03:14:22PM +0200, Jiri Olsa wrote:
> On Mon, Mar 31, 2014 at 10:55:35PM -0400, Don Zickus wrote:
> > On Sat, Mar 29, 2014 at 06:11:38PM +0100, Jiri Olsa wrote:
> > > On Mon, Mar 24, 2014 at 03:36:57PM -0400, Don Zickus wrote:
> > > > Modified the code to allow latency settings to be tweaked on the command line
> > > > and also the ability to dynamically profile stores (or disable using stores).
> > > > 
> > > > This allows the tool to be used on older Intel platforms like Westmere.
> > > > 
> > > > Signed-off-by: Don Zickus <dzickus@redhat.com>
> > > > ---
> > > 
> > > SNIP
> > > 
> > > > @@ -316,6 +369,12 @@ int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
> > > >  	};
> > > >  	const struct option c2c_options[] = {
> > > >  	OPT_BOOLEAN('r', "raw_records", &c2c.raw_records, "dump raw events"),
> > > > +	OPT_INTEGER('l', "latency-level", &lat_level,
> > > > +		 "specify the latency threshold for loads [default=30]"),
> > > > +	OPT_INTEGER('p', "precision-level", &prec_level,
> > > > +		 "specify the precision level of events (0,1,2,3) [default=1]"),
> > > 
> > > could we get also option for user space modifier?
> > 
> > You mean the 'u' modifier, ie cpu/mem-loads/u ?  If so, then I can do that
> > but will that work with the -a option (which is hardcoded in the c2c
> > tool [system-wide mode])?
> 
> right, forgot about that.. could the -a option be optional as well?
> probably the same way as for record would be the best:
> 
>   perf c2c record ./foo       # workload specific
>   perf c2c record -a sleep 3  # system wide

I understand what you are saying, but our tool was written to find cache
contention across the system, so -a is usually implied.  Most of our
profiling is done system-wide.  I would rather implement the opposite
option --no-system-wide, if that is ok.

Cheers,
Don

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 03/15 V3] perf c2c: Shared data analyser
  2014-03-24 19:36 ` [PATCH 03/15 V3] perf c2c: Shared data analyser Don Zickus
@ 2014-04-08  6:59   ` Namhyung Kim
  2014-04-08 14:22     ` Don Zickus
  2014-04-08 14:23     ` Don Zickus
  0 siblings, 2 replies; 45+ messages in thread
From: Namhyung Kim @ 2014-04-08  6:59 UTC (permalink / raw)
  To: Don Zickus
  Cc: acme, LKML, jolsa, jmario, fowles, peterz, eranian, andi.kleen,
	Arnaldo Carvalho de Melo, David Ahern, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Richard Fowles

Hi Don,

On Mon, 24 Mar 2014 15:36:54 -0400, Don Zickus wrote:
> From: Arnaldo Carvalho de Melo <acme@redhat.com>
>
> This is the start of a new perf tool that will collect information about
> memory accesses and analyse it to find things like hot cachelines, etc.

So why not integrating this into existing 'perf mem' command if it's all
about analyzing memory accesses?

>
> This is basically trying to get a prototype written by Richard Fowles
> written using the tools/perf coding style and libraries.
>
> Start it from 'perf sched', this patch starts the process by adding the
> 'record' subcommand to collect the needed mem loads and stores samples.
>
> It also have the basic 'report' skeleton, resolving the sample address
> and hooking the events found in a perf.data file with methods to handle
> them, right now just printing the resolved perf_sample data structure
> after each event name.
>
> [dcz: refreshed to latest upstream changes]

[SNIP]
> +perf-c2c(1)
> +===========
> +
> +NAME
> +----
> +perf-c2c - Shared Data C2C/HITM Analyzer.
> +
> +SYNOPSIS
> +--------
> +[verse]
> +'perf c2c' record
> +
> +DESCRIPTION
> +-----------
> +These are the variants of perf c2c:
> +
> +  'perf c2c record <command>' to record the memory accesses of an arbitrary
> +  workload.
> +
> +SEE ALSO
> +--------
> +linkperf:perf-record[1], linkperf:perf-mem[1]

This document is very terse and only memtions the 'record' subcommand -
also it's not updated throughout the series.  So I'd like to suggest
adding a separate documentation patch with full/verbose descriptions at
the end of this series.


[SNIP]
> +static int perf_c2c__read_events(struct perf_c2c *c2c)
> +{
> +	int err = -1;
> +	struct perf_session *session;
> +	struct perf_data_file file = {
> +			.path = input_name,
> +			.mode = PERF_DATA_MODE_READ,
> +	};
> +	struct perf_evsel *evsel;
> +
> +	session = perf_session__new(&file, 0, &c2c->tool);
> +	if (session == NULL) {
> +		pr_debug("No memory for session\n");
> +		goto out;
> +	}
> +
> +	/* setup the evsel handlers for each event type */
> +	evlist__for_each(session->evlist, evsel) {
> +		const char *name = perf_evsel__name(evsel);
> +		unsigned int i;
> +
> +		for (i = 0; i < ARRAY_SIZE(handlers); i++) {
> +			if (!strcmp(name, handlers[i].name))
> +				evsel->handler = handlers[i].handler;
> +		}
> +	}
> +
> +	err = perf_session__process_events(session, &c2c->tool);
> +	if (err)
> +		pr_err("Failed to process events, error %d", err);

You may want to add perf_session__delete() here.

> +
> +out:
> +	return err;
> +}


[SNIP]
> +int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
> +{
> +	struct perf_c2c c2c = {
> +		.tool = {
> +			.sample		 = perf_c2c__process_sample,
> +			.comm		 = perf_event__process_comm,
> +			.exit		 = perf_event__process_exit,
> +			.fork		 = perf_event__process_fork,
> +			.lost		 = perf_event__process_lost,

It seems that it also needs to handle mmap[2] events otherwise it cannot
find symbols from an address.

Thanks,
Namhyung


> +			.ordered_samples = true,
> +		},
> +	};

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 04/15 V3] perf c2c: Dump raw records, decode data_src bits
  2014-03-24 19:36 ` [PATCH 04/15 V3] perf c2c: Dump raw records, decode data_src bits Don Zickus
@ 2014-04-08  7:09   ` Namhyung Kim
  0 siblings, 0 replies; 45+ messages in thread
From: Namhyung Kim @ 2014-04-08  7:09 UTC (permalink / raw)
  To: Don Zickus
  Cc: acme, LKML, jolsa, jmario, fowles, peterz, eranian, andi.kleen,
	Arnaldo Carvalho de Melo, David Ahern, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Richard Fowles

On Mon, 24 Mar 2014 15:36:55 -0400, Don Zickus wrote:
> +	union perf_mem_data_src dsrc = { .val = val, };
> +	int printed = scnprintf(bf, size, PREFIX);
> +	size_t i;
> +	bool first_present = true;
> +
> +	for (i = 0; i < ARRAY_SIZE(decode_bits); i++) {
> +		int bitval;
> +
> +		switch (decode_bits[i].field) {
> +		case OP:  bitval = decode_bits[i].bit & dsrc.mem_op;    break;
> +		case LVL: bitval = decode_bits[i].bit & dsrc.mem_lvl;   break;
> +		case SNP: bitval = decode_bits[i].bit & dsrc.mem_snoop; break;
> +		case LCK: bitval = decode_bits[i].bit & dsrc.mem_lock;  break;
> +		case TLB: bitval = decode_bits[i].bit & dsrc.mem_dtlb;  break;
> +		default: bitval = 0;					break;
> +		}
> +
> +		if (!bitval)
> +			continue;
> +
> +		if (strlen(decode_bits[i].name) + !!i > size - printed - sizeof(SUFFIX)) {
> +			sprintf(bf + size - sizeof(SUFFIX) - sizeof(ELLIPSIS) + 1, ELLIPSIS);
> +			printed = size - sizeof(SUFFIX);
> +			break;
> +		}
> +
> +		printed += scnprintf(bf + printed, size - printed, "%s%s",
> +				     first_present ? "" : ",", decode_bits[i].name);
> +		first_present = false;
> +	}
> +
> +	printed += scnprintf(bf + printed, size - printed, SUFFIX);

It seems that adjusting bf and size as printed might make the code
somewhat simpler and readable - i.e.:

        ret = scnprintf(bf, size, "%s", ...);

        bf += ret;
        size -= ret;
        printed += ret;


> +	return printed;
>  }
>  
> -static int perf_c2c__process_load(struct perf_evsel *evsel,
> -				  struct perf_sample *sample,
> -				  struct addr_location *al)
> +static int perf_c2c__fprintf_header(FILE *fp)
>  {
> -	perf_sample__fprintf(sample, evsel, al, stdout);
> -	return 0;
> +	int printed = fprintf(fp, "%c %-16s  %6s  %6s  %4s  %18s  %18s  %18s  %6s  %-10s %-60s %s\n", 
> +			      'T',
> +			      "Status",
> +			      "Pid",
> +			      "Tid",
> +			      "CPU",
> +			      "Inst Adrs",
> +			      "Virt Data Adrs",
> +			      "Phys Data Adrs",
> +			      "Cycles",
> +			      "Source",
> +			      "  Decoded Source",
> +			      "ObJect:Symbol");

s/ObJect/Object/ ?

Thanks,
Namhyung


> +	return printed + fprintf(fp, "%-*.*s\n", printed, printed, graph_dotted_line);
> +}

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 05/15 V3] perf, c2c: Rework setup code to prepare for features
  2014-03-24 19:36 ` [PATCH 05/15 V3] perf, c2c: Rework setup code to prepare for features Don Zickus
  2014-03-29 17:10   ` Jiri Olsa
@ 2014-04-08  7:18   ` Namhyung Kim
  1 sibling, 0 replies; 45+ messages in thread
From: Namhyung Kim @ 2014-04-08  7:18 UTC (permalink / raw)
  To: Don Zickus; +Cc: acme, LKML, jolsa, jmario, fowles, peterz, eranian, andi.kleen

On Mon, 24 Mar 2014 15:36:56 -0400, Don Zickus wrote:
> A basic patch that re-arranges some of the c2c code and adds a couple
> of small features to lay the ground work for the rest of the patch
> series.
>
> Changes include:
>
> o reworking the report path
> o replace preprocess_sample with simpler calls
> o rework raw output to handle separators
> o remove phys id gunk
> o add some generic options

[SNIP]
>
> +	return fprintf(fp, fmt,
> +		       tag,				sep,
> +		       reason ?: "valid record",	sep,
> +		       sample->pid,			sep,
> +		       sample->tid,			sep,
> +		       sample->cpu,			sep,
> +		       sample->ip,			sep,
> +		       sample->addr,			sep,
> +		       sample->weight,			sep,
> +		       sample->data_src,		sep,
> +		       data_src,			sep,
> +		       map ? (map->dso ? map->dso->long_name : "???") : "???",

How about using this instead?

		(map && map->dso) ? map->dso->long_name : "???".


> +		       mi->iaddr.sym ? mi->iaddr.sym->name : "???");
>  }
>  
>  static int perf_c2c__process_load_store(struct perf_c2c *c2c,
> +					struct addr_location *al,
>  					struct perf_sample *sample,
> -					struct addr_location *al)
> +					struct perf_evsel *evsel)
>  {
> -	if (c2c->raw_records)
> -		perf_sample__fprintf(sample, ' ', "raw input", al, stdout);
> +	struct mem_info *mi;
> +
> +	mi = sample__resolve_mem(sample, al);
> +	if (!mi)
> +		return -ENOMEM;
> +
> +	if (c2c->raw_records) {
> +		perf_sample__fprintf(sample, ' ', "raw input", mi, stdout);
> +		free(mi);
> +		return 0;
> +	}

It seems the 'mi' should be freed here as well?

>  
>  	return 0;
>  }


[SNIP]
> +static int perf_c2c__process_events(struct perf_session *session,
> +				    struct perf_c2c *c2c)
> +{
> +	int err = -1;

No need to set.

> +
> +	err = perf_session__process_events(session, &c2c->tool);
> +	if (err) {
> +		pr_err("Failed to process count events, error %d\n", err);
> +		goto err;

Unnecessary goto.

Thanks,
Namhyung


>  	}
>  
> +err:
>  	return err;
>  }

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 06/15 V3] perf, c2c: Add in new options to configure latency and stores
  2014-03-24 19:36 ` [PATCH 06/15 V3] perf, c2c: Add in new options to configure latency and stores Don Zickus
  2014-03-29 17:11   ` Jiri Olsa
@ 2014-04-08  7:31   ` Namhyung Kim
  1 sibling, 0 replies; 45+ messages in thread
From: Namhyung Kim @ 2014-04-08  7:31 UTC (permalink / raw)
  To: Don Zickus; +Cc: acme, LKML, jolsa, jmario, fowles, peterz, eranian, andi.kleen

On Mon, 24 Mar 2014 15:36:57 -0400, Don Zickus wrote:
> Modified the code to allow latency settings to be tweaked on the command line
> and also the ability to dynamically profile stores (or disable using stores).
>
> This allows the tool to be used on older Intel platforms like Westmere.
>

[SNIP]
> @@ -277,8 +286,21 @@ static int perf_c2c__record(int argc, const char **argv)
>  		"-d",
>  		"-a",
>  	};
> +	char lat[16], prec[4] = "/";
> +	char buf[256];
> +
> +	snprintf(lat, sizeof(lat), "ldlat=%d", lat_level);
> +
> +	if (prec_level > 3)
> +		prec_level = 3;
> +
> +	for (i = 0; i < (unsigned int)prec_level; i++)
> +		prec[i+1] = 'p';

Hmm.. it seems that negative prec_level can be a problem here - how
about making it unsigned?  Also the 'prec' is not null-terminated and
needs to be increased to 5 to accept max level 3 ('/' + 'ppp' + '\0').

> +
> +	/* how many events to pass in */
> +	n = (no_stores) ? 1 : 2;
>  
> -	rec_argc = ARRAY_SIZE(record_args) + 2 * ARRAY_SIZE(handlers) + argc - 1;
> +	rec_argc = ARRAY_SIZE(record_args) + 2 * n + argc - 1;
>  	rec_argv = calloc(rec_argc + 1, sizeof(char *));
>  
>  	if (rec_argv == NULL)
> @@ -287,9 +309,16 @@ static int perf_c2c__record(int argc, const char **argv)
>  	for (i = 0; i < ARRAY_SIZE(record_args); i++)
>  		rec_argv[i] = strdup(record_args[i]);
>  
> -	for (j = 0; j < ARRAY_SIZE(handlers); j++) {
> +	/* mem-loads is first index */
> +	snprintf(buf, sizeof(buf), "%s,%s%s", handlers[0].name, lat, prec);
> +	rec_argv[i++] = strdup("-e");
> +	rec_argv[i++] = strdup(buf);
> +
> +	if (!no_stores) {
> +		/* mem-stores is second index */
>  		rec_argv[i++] = strdup("-e");
> -		rec_argv[i++] = strdup(handlers[j].name);
> +		snprintf(buf, sizeof(buf), "%s%s", handlers[1].name, prec);
> +		rec_argv[i++] = strdup(buf);
>  	}
>  
>  	for (j = 1; j < (unsigned int)argc; j++, i++)
> @@ -300,6 +329,30 @@ static int perf_c2c__record(int argc, const char **argv)
>  	return cmd_record(i, rec_argv, NULL);
>  }
>  
> +static int
> +opt_no_stores_cb(const struct option *opt __maybe_unused, const char *arg __maybe_unused, int unset)
> +{
> +	int ret;
> +
> +	if (unset) {
> +		struct stat st;
> +		char path[PATH_MAX];
> +
> +		snprintf(path, PATH_MAX, "%s/bus/event_source/devices/cpu/events/mem-stores",
> +			 sysfs__mountpoint());
> +
> +		ret = stat(path, &st);
> +		if (ret) {
> +			pr_debug("Omitting mem-stores event");
> +			no_stores = true;
> +		}
> +		return 0;
> +	}
> +
> +	no_stores = true;
> +	return 0;
> +}
> +
>  int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
>  {
>  	struct perf_c2c c2c = {
> @@ -316,6 +369,12 @@ int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
>  	};
>  	const struct option c2c_options[] = {
>  	OPT_BOOLEAN('r', "raw_records", &c2c.raw_records, "dump raw events"),
> +	OPT_INTEGER('l', "latency-level", &lat_level,
> +		 "specify the latency threshold for loads [default=30]"),
> +	OPT_INTEGER('p', "precision-level", &prec_level,
> +		 "specify the precision level of events (0,1,2,3) [default=1]"),

Please use OPT_UINTEGER as I said above.


> +	OPT_CALLBACK_NOOPT(0, "no-stores", &no_stores, "",
> +			   "do not record stores", &opt_no_stores_cb),

Why not using just plain "stores" boolean option - it'd also provide the
"--no-stores" option - and checking availability of "mem-stores" events
before starting record separately?  Negation is always confusing.. (at
least for me ;-p).

Thanks,
Namhyung


>  	OPT_INCR('v', "verbose", &verbose,
>  		 "be more verbose (show counter open errors, etc)"),
>  	OPT_STRING('i', "input", &input_name, "file",

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 06/15 V3] perf, c2c: Add in new options to configure latency and stores
  2014-04-06 13:14       ` Jiri Olsa
  2014-04-07 18:16         ` Don Zickus
@ 2014-04-08  7:37         ` Namhyung Kim
  1 sibling, 0 replies; 45+ messages in thread
From: Namhyung Kim @ 2014-04-08  7:37 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Don Zickus, acme, LKML, jmario, fowles, peterz, eranian, andi.kleen

Hi Jiri,

On Sun, 6 Apr 2014 15:14:22 +0200, Jiri Olsa wrote:
> On Mon, Mar 31, 2014 at 10:55:35PM -0400, Don Zickus wrote:
>> On Sat, Mar 29, 2014 at 06:11:38PM +0100, Jiri Olsa wrote:
>> > On Mon, Mar 24, 2014 at 03:36:57PM -0400, Don Zickus wrote:
>> > > Modified the code to allow latency settings to be tweaked on the command line
>> > > and also the ability to dynamically profile stores (or disable using stores).
>> > > 
>> > > This allows the tool to be used on older Intel platforms like Westmere.
>> > > 
>> > > Signed-off-by: Don Zickus <dzickus@redhat.com>
>> > > ---
>> > 
>> > SNIP
>> > 
>> > > @@ -316,6 +369,12 @@ int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
>> > >  	};
>> > >  	const struct option c2c_options[] = {
>> > >  	OPT_BOOLEAN('r', "raw_records", &c2c.raw_records, "dump raw events"),
>> > > +	OPT_INTEGER('l', "latency-level", &lat_level,
>> > > +		 "specify the latency threshold for loads [default=30]"),
>> > > +	OPT_INTEGER('p', "precision-level", &prec_level,
>> > > +		 "specify the precision level of events (0,1,2,3) [default=1]"),
>> > 
>> > could we get also option for user space modifier?
>> 
>> You mean the 'u' modifier, ie cpu/mem-loads/u ?  If so, then I can do that
>> but will that work with the -a option (which is hardcoded in the c2c
>> tool [system-wide mode])?
>
> right, forgot about that.. 

Why?  I think the -a option behavior is orthogonal to the user/kernel
modifier behavior, no?  Do the mem-loads/stores events have any
restriction on that?


> could the -a option be optional as well?
> probably the same way as for record would be the best:
>
>   perf c2c record ./foo       # workload specific
>   perf c2c record -a sleep 3  # system wide

Agreed.

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 05/15 V3] perf, c2c: Rework setup code to prepare for features
  2014-03-29 17:10   ` Jiri Olsa
  2014-04-01  2:52     ` Don Zickus
@ 2014-04-08  7:41     ` Namhyung Kim
  2014-04-08 14:11       ` Don Zickus
  1 sibling, 1 reply; 45+ messages in thread
From: Namhyung Kim @ 2014-04-08  7:41 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Don Zickus, acme, LKML, jmario, fowles, peterz, eranian, andi.kleen

On Sat, 29 Mar 2014 18:10:18 +0100, Jiri Olsa wrote:
> On Mon, Mar 24, 2014 at 03:36:56PM -0400, Don Zickus wrote:
>
> SNIP
>
>>  
>>  static int perf_c2c__process_load_store(struct perf_c2c *c2c,
>> +					struct addr_location *al,
>>  					struct perf_sample *sample,
>> -					struct addr_location *al)
>> +					struct perf_evsel *evsel)
>>  {
>> -	if (c2c->raw_records)
>> -		perf_sample__fprintf(sample, ' ', "raw input", al, stdout);
>> +	struct mem_info *mi;
>> +
>> +	mi = sample__resolve_mem(sample, al);
>> +	if (!mi)
>> +		return -ENOMEM;
>
> perhaps not directly related to this patchset, but I needed
> attached patch to get resolved data in .bss (static), which
> for some reason happened to be located in executable segment

Wasn't it a read-only/const data?

Hmm.. maybe we could copy the text map and add it to a variable map
group if it has any variables in it..

Thanks,
Namhyung

>
> I wonder why we need this VARIABLE/FUNCTION separation at all,
> I think Arnaldo told me some archs could have same addresses
> for data and functions.. will check ;-)
>
> jirka
>
> ---
> diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
> index d9d86c6..be6d7cf 100644
> --- a/tools/perf/util/machine.c
> +++ b/tools/perf/util/machine.c
> @@ -1220,6 +1220,10 @@ static void ip__resolve_data(struct machine *machine, struct thread *thread,
>  
>  	thread__find_addr_location(thread, machine, m, MAP__VARIABLE, addr,
>  				   &al);
> +	if (!al.map)
> +		thread__find_addr_location(thread, machine, m, MAP__FUNCTION, addr,
> +					   &al);
> +
>  	ams->addr = addr;
>  	ams->al_addr = al.addr;
>  	ams->sym = al.sym;

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 07/15 V3] perf, c2c: Add in sort on physid
  2014-03-24 19:36 ` [PATCH 07/15 V3] perf, c2c: Add in sort on physid Don Zickus
@ 2014-04-08  7:56   ` Namhyung Kim
  2014-04-08 14:17     ` Don Zickus
  0 siblings, 1 reply; 45+ messages in thread
From: Namhyung Kim @ 2014-04-08  7:56 UTC (permalink / raw)
  To: Don Zickus; +Cc: acme, LKML, jolsa, jmario, fowles, peterz, eranian, andi.kleen

On Mon, 24 Mar 2014 15:36:58 -0400, Don Zickus wrote:
> Now that the infrastructure is set, add in the support to use
> hist_entry to sort on physid.
>
> V2: use new mmap2 sort
>
> Signed-off-by: Don Zickus <dzickus@redhat.com>
> ---
>  tools/perf/builtin-c2c.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 51 insertions(+), 1 deletion(-)
>
> diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
> index d7eaf81..b5742bd 100644
> --- a/tools/perf/builtin-c2c.c
> +++ b/tools/perf/builtin-c2c.c
> @@ -7,6 +7,7 @@
>  #include "util/tool.h"
>  #include "util/debug.h"
>  #include <api/fs/fs.h>
> +#include "util/annotate.h"

Why is this #include needed?


[SNIP]
> @@ -163,7 +169,33 @@ static int perf_c2c__process_load_store(struct perf_c2c *c2c,
>  		return 0;
>  	}
>  
> -	return 0;
> +	cost = sample->weight;
> +	if (!cost)
> +		cost = 1;
> +
> +	/*
> +	 * must pass period=weight in order to get the correct
> +	 * sorting from hists__collapse_resort() which is solely
> +	 * based on periods. We want sorting be done on nr_events * weight
> +	 * and this is indirectly achieved by passing period=weight here
> +	 * and the he_stat__add_period() function.
> +	 */
> +	he = __hists__add_entry(&c2c->hists, al, parent, NULL, mi,
> +				cost, cost, 0);
> +	if (!he) {
> +		err = -ENOMEM;
> +		goto out_mem;
> +	}
> +
> +	c2c->hists.stats.total_period += cost;
> +	hists__inc_nr_events(&c2c->hists, PERF_RECORD_SAMPLE);
> +	return err;
> +
> +out_mem:
> +	/* implicitly freed by __hists__add_entry */
> +	free(mi);
> +out:

It seems this 'out' lable is not used anywhere..


> +	return err;
>  }
>  
>  /* setup the basic events for most coverage, options added later */
> @@ -266,10 +298,28 @@ out:
>  	return err;
>  }
>  
> +static int perf_c2c__init(struct perf_c2c *c2c)
> +{
> +	sort__mode = SORT_MODE__PHYSID;
> +	sort__wants_unique = 1;
> +	sort_order = "daddr,iaddr,pid,tid";

Where are the SORT_MODE__PHYSID, sort__wants_unique and "daddr", "iaddr"
sort keys defined?

Also, more importantly, I think the sort order should contain at least
"mem" and "snoop" keys to group samples based on the exact hitm
information.

In my understanding, if two samples are captured at exactly same
position with a same data access but different hitm info, they cannot be
classified and just use the hitm info of first entry.

Thanks,
Namhyung

> +
> +	if (setup_sorting() < 0) {
> +		pr_err("can not setup sorting\n");
> +		return -1;
> +	}
> +
> +	hists__init(&c2c->hists);
> +
> +	return 0;
> +}
>  static int perf_c2c__report(struct perf_c2c *c2c)
>  {
>  	setup_pager();
>  
> +	if (perf_c2c__init(c2c))
> +		return -1;
> +
>  	if (c2c->raw_records)
>  		perf_c2c__fprintf_header(stdout);

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 08/15 V3] perf, c2c: Add stats to track data source bits and cpu to node maps
  2014-03-24 19:36 ` [PATCH 08/15 V3] perf, c2c: Add stats to track data source bits and cpu to node maps Don Zickus
@ 2014-04-08  8:05   ` Namhyung Kim
  0 siblings, 0 replies; 45+ messages in thread
From: Namhyung Kim @ 2014-04-08  8:05 UTC (permalink / raw)
  To: Don Zickus; +Cc: acme, LKML, jolsa, jmario, fowles, peterz, eranian, andi.kleen

On Mon, 24 Mar 2014 15:36:59 -0400, Don Zickus wrote:
> This patch adds a bunch of stats that will be used later in post-processing
> to determine where and with what frequency the HITMs are coming from.
>
> Most of the stats are decoded from the data source response.  Another
> piece of the stats is tracking which cpu the record came in on.
>
> Credit to Dick Fowles for determining which bits are important and how to
> properly track them.  Ported to perf by me.
> @@ -187,6 +354,14 @@ static int perf_c2c__process_load_store(struct perf_c2c *c2c,
>  		goto out_mem;
>  	}
>  
> +	err = c2c_decode_stats(&c2c->stats, he);
> +	if (err < 0) {
> +		err = 0;
> +		rb_erase(&he->rb_node_in, c2c->hists.entries_in);
> +		free(he);

Please use hist_entry__free() to free he->mem_info also.

Thanks,
Namhyung


> +		goto out;
> +	}
> +
>  	c2c->hists.stats.total_period += cost;
>  	hists__inc_nr_events(&c2c->hists, PERF_RECORD_SAMPLE);
>  	return err;

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 09/15 V3] perf, c2c: Sort based on hottest cache line
  2014-03-24 19:37 ` [PATCH 09/15 V3] perf, c2c: Sort based on hottest cache line Don Zickus
@ 2014-04-08  8:23   ` Namhyung Kim
  0 siblings, 0 replies; 45+ messages in thread
From: Namhyung Kim @ 2014-04-08  8:23 UTC (permalink / raw)
  To: Don Zickus; +Cc: acme, LKML, jolsa, jmario, fowles, peterz, eranian, andi.kleen

On Mon, 24 Mar 2014 15:37:00 -0400, Don Zickus wrote:
> Now that we have all the events sort on a unique address, we can walk
> the rbtree sequential and count up all the HITMs for each cacheline
> fairly easily.
>
> Once we encounter a new event on a different cacheline, process the previous
> cacheline.  That includes determining if any HITMs were present on that
> cacheline and if so, add it to another rbtree sorted on the number of HITMs.
>
> This second rbtree sorted on number of HITMs will be the interesting data
> we want to report and will be displayed in a follow up patch.
>
> For now, organize the data properly.

just a few minor nitpicks below..


> +static int c2c_hitm__add_to_list(struct rb_root *root, struct c2c_hit *h)

Why does it have 'list' in the name while it's not a list?  Maybe just
using c2c_hit__add_entry() ?


> +{
> +	struct rb_node **p;
> +	struct rb_node *parent = NULL;
> +	struct c2c_hit *he;
> +	int64_t cmp;
> +	u64 l_hitms, r_hitms;
> +
> +	p = &root->rb_node;
> +
> +	while (*p != NULL) {
> +		parent = *p;
> +		he = rb_entry(parent, struct c2c_hit, rb_node);
> +
> +		/* sort on remote hitms first */
> +		l_hitms = he->stats.t.rmt_hitm;
> +		r_hitms = h->stats.t.rmt_hitm;
> +		cmp = r_hitms - l_hitms;
> +
> +		if (!cmp) {
> +			/* sort on local hitms */
> +			l_hitms = he->stats.t.lcl_hitm;
> +			r_hitms = h->stats.t.lcl_hitm;
> +			cmp = r_hitms - l_hitms;
> +		}
> +
> +		if (cmp > 0)
> +			p = &(*p)->rb_left;
> +		else
> +			p = &(*p)->rb_right;
> +	}
> +
> +	rb_link_node(&h->rb_node, parent, p);
> +	rb_insert_color(&h->rb_node, root);
> +
> +	return 0;
> +}


[SNIP]
> +static void c2c_hit__update_strings(struct c2c_hit *h,
> +				    struct hist_entry *n)

This one also has nothing with any string IMHO.


> +{
> +	if (h->pid != n->thread->pid_)
> +		h->pid = -1;
> +
> +	if (h->tid != n->thread->tid)
> +		h->tid = -1;
> +
> +	/* use original addresses here, not adjusted al_addr */
> +	if (h->iaddr != n->mem_info->iaddr.addr)
> +		h->iaddr = -1;
> +
> +	if (CLADRS(h->daddr) != CLADRS(n->mem_info->daddr.addr))
> +		h->daddr = -1;
> +
> +	CPU_SET(n->cpu, &h->stats.cpuset);
> +}


[SNIP]
> +static void c2c_analyze_hitms(struct perf_c2c *c2c)
> +{
> +
> +	struct rb_node *next = rb_first(c2c->hists.entries_in);
> +	struct hist_entry *he;
> +	struct c2c_hit *h = NULL;
> +	struct c2c_stats hitm_stats;
> +	struct rb_root hitm_tree = RB_ROOT;
> +	int shared_clines = 0;

It seems the shared_clines is set but not used in this patch.  Maybe
it'd better moving it to a patch which use it actually.

Thanks,
Namhyung


> +	u64 cl = 0;
> +
> +	memset(&hitm_stats, 0, sizeof(struct c2c_stats));
> +
> +	/* find HITMs */
> +	while (next) {
> +		he = rb_entry(next, struct hist_entry, rb_node_in);
> +		next = rb_next(&he->rb_node_in);
> +
> +		cl = he->mem_info->daddr.al_addr;
> +
> +		/* switch cache line objects */
> +		/* 'color' forces a boundary change based on the original sort */
> +		if (!h || !he->color || (CLADRS(cl) != h->cacheline)) {
> +			if (h && HAS_HITMS(h)) {
> +				c2c_hit__update_stats(&hitm_stats, &h->stats);
> +
> +				/* sort based on hottest cacheline */
> +				c2c_hitm__add_to_list(&hitm_tree, h);
> +				shared_clines++;
> +			} else {
> +				/* stores-only are un-interesting */
> +				free(h);
> +			}
> +			h = c2c_hit__new(CLADRS(cl), he);
> +			if (!h)
> +				goto cleanup;
> +		}
> +
> +
> +		c2c_decode_stats(&h->stats, he);
> +
> +		/* filter out non-hitms as un-interesting noise */
> +		if (valid_hitm_or_store(&he->mem_info->data_src)) {
> +			/* save the entry for later processing */
> +			list_add_tail(&he->pairs.node, &h->list);
> +
> +			c2c_hit__update_strings(h, he);
> +		}
> +	}
> +
> +	/* last chunk */
> +	if (h && HAS_HITMS(h)) {
> +		c2c_hit__update_stats(&hitm_stats, &h->stats);
> +		c2c_hitm__add_to_list(&hitm_tree, h);
> +		shared_clines++;
> +	} else
> +		free(h);
> +
> +cleanup:
> +	next = rb_first(&hitm_tree);
> +	while (next) {
> +		h = rb_entry(next, struct c2c_hit, rb_node);
> +		next = rb_next(&h->rb_node);
> +		rb_erase(&h->rb_node, &hitm_tree);
> +
> +		free(h);
> +	}
> +	return;
> +}

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 10/15 V3] perf, c2c: Display cacheline HITM analysis to stdout
  2014-03-24 19:37 ` [PATCH 10/15 V3] perf, c2c: Display cacheline HITM analysis to stdout Don Zickus
@ 2014-04-08  8:26   ` Namhyung Kim
  2014-04-08 23:46   ` Namhyung Kim
  1 sibling, 0 replies; 45+ messages in thread
From: Namhyung Kim @ 2014-04-08  8:26 UTC (permalink / raw)
  To: Don Zickus; +Cc: acme, LKML, jolsa, jmario, fowles, peterz, eranian, andi.kleen

> +	if (coalesce_level > MAX_LVL)
> +		printf("DON: bad coalesce level %d\n", coalesce_level);

Please use pr_warning() with the "DON" prefix removed :)

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 05/15 V3] perf, c2c: Rework setup code to prepare for features
  2014-04-08  7:41     ` Namhyung Kim
@ 2014-04-08 14:11       ` Don Zickus
  2014-04-09  1:12         ` Namhyung Kim
  0 siblings, 1 reply; 45+ messages in thread
From: Don Zickus @ 2014-04-08 14:11 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Jiri Olsa, acme, LKML, jmario, fowles, peterz, eranian, andi.kleen

On Tue, Apr 08, 2014 at 04:41:29PM +0900, Namhyung Kim wrote:
> On Sat, 29 Mar 2014 18:10:18 +0100, Jiri Olsa wrote:
> > On Mon, Mar 24, 2014 at 03:36:56PM -0400, Don Zickus wrote:
> >
> > SNIP
> >
> >>  
> >>  static int perf_c2c__process_load_store(struct perf_c2c *c2c,
> >> +					struct addr_location *al,
> >>  					struct perf_sample *sample,
> >> -					struct addr_location *al)
> >> +					struct perf_evsel *evsel)
> >>  {
> >> -	if (c2c->raw_records)
> >> -		perf_sample__fprintf(sample, ' ', "raw input", al, stdout);
> >> +	struct mem_info *mi;
> >> +
> >> +	mi = sample__resolve_mem(sample, al);
> >> +	if (!mi)
> >> +		return -ENOMEM;
> >
> > perhaps not directly related to this patchset, but I needed
> > attached patch to get resolved data in .bss (static), which
> > for some reason happened to be located in executable segment
> 
> Wasn't it a read-only/const data?

I believe it had the 'x' bit set.  Or the kernel was not passing any
protection bits, so it defaulted to MAP_FUNCTION?

Cheers,
Don

> 
> Hmm.. maybe we could copy the text map and add it to a variable map
> group if it has any variables in it..
> 
> Thanks,
> Namhyung
> 
> >
> > I wonder why we need this VARIABLE/FUNCTION separation at all,
> > I think Arnaldo told me some archs could have same addresses
> > for data and functions.. will check ;-)
> >
> > jirka
> >
> > ---
> > diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
> > index d9d86c6..be6d7cf 100644
> > --- a/tools/perf/util/machine.c
> > +++ b/tools/perf/util/machine.c
> > @@ -1220,6 +1220,10 @@ static void ip__resolve_data(struct machine *machine, struct thread *thread,
> >  
> >  	thread__find_addr_location(thread, machine, m, MAP__VARIABLE, addr,
> >  				   &al);
> > +	if (!al.map)
> > +		thread__find_addr_location(thread, machine, m, MAP__FUNCTION, addr,
> > +					   &al);
> > +
> >  	ams->addr = addr;
> >  	ams->al_addr = al.addr;
> >  	ams->sym = al.sym;

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 07/15 V3] perf, c2c: Add in sort on physid
  2014-04-08  7:56   ` Namhyung Kim
@ 2014-04-08 14:17     ` Don Zickus
  2014-04-09  1:30       ` Namhyung Kim
  0 siblings, 1 reply; 45+ messages in thread
From: Don Zickus @ 2014-04-08 14:17 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: acme, LKML, jolsa, jmario, fowles, peterz, eranian, andi.kleen

On Tue, Apr 08, 2014 at 04:56:25PM +0900, Namhyung Kim wrote:
> On Mon, 24 Mar 2014 15:36:58 -0400, Don Zickus wrote:
> > Now that the infrastructure is set, add in the support to use
> > hist_entry to sort on physid.
> >
> > V2: use new mmap2 sort
> >
> > Signed-off-by: Don Zickus <dzickus@redhat.com>
> > ---
> >  tools/perf/builtin-c2c.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 51 insertions(+), 1 deletion(-)
> >
> > diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c
> > index d7eaf81..b5742bd 100644
> > --- a/tools/perf/builtin-c2c.c
> > +++ b/tools/perf/builtin-c2c.c
> > @@ -7,6 +7,7 @@
> >  #include "util/tool.h"
> >  #include "util/debug.h"
> >  #include <api/fs/fs.h>
> > +#include "util/annotate.h"
> 
> Why is this #include needed?
> 
> 
> [SNIP]
> > @@ -163,7 +169,33 @@ static int perf_c2c__process_load_store(struct perf_c2c *c2c,
> >  		return 0;
> >  	}
> >  
> > -	return 0;
> > +	cost = sample->weight;
> > +	if (!cost)
> > +		cost = 1;
> > +
> > +	/*
> > +	 * must pass period=weight in order to get the correct
> > +	 * sorting from hists__collapse_resort() which is solely
> > +	 * based on periods. We want sorting be done on nr_events * weight
> > +	 * and this is indirectly achieved by passing period=weight here
> > +	 * and the he_stat__add_period() function.
> > +	 */
> > +	he = __hists__add_entry(&c2c->hists, al, parent, NULL, mi,
> > +				cost, cost, 0);
> > +	if (!he) {
> > +		err = -ENOMEM;
> > +		goto out_mem;
> > +	}
> > +
> > +	c2c->hists.stats.total_period += cost;
> > +	hists__inc_nr_events(&c2c->hists, PERF_RECORD_SAMPLE);
> > +	return err;
> > +
> > +out_mem:
> > +	/* implicitly freed by __hists__add_entry */
> > +	free(mi);
> > +out:
> 
> It seems this 'out' lable is not used anywhere..
> 
> 
> > +	return err;
> >  }
> >  
> >  /* setup the basic events for most coverage, options added later */
> > @@ -266,10 +298,28 @@ out:
> >  	return err;
> >  }
> >  
> > +static int perf_c2c__init(struct perf_c2c *c2c)
> > +{
> > +	sort__mode = SORT_MODE__PHYSID;
> > +	sort__wants_unique = 1;
> > +	sort_order = "daddr,iaddr,pid,tid";
> 
> Where are the SORT_MODE__PHYSID, sort__wants_unique and "daddr", "iaddr"
> sort keys defined?

In a previous patchset that enables the mmap2 interface.

> 
> Also, more importantly, I think the sort order should contain at least
> "mem" and "snoop" keys to group samples based on the exact hitm
> information.

I can look into it, but after iaddr, pid and tid, sorting on snoop doesn't
really change anything if I recall.  The hitms are scattered across iaddr.

> 
> In my understanding, if two samples are captured at exactly same
> position with a same data access but different hitm info, they cannot be
> classified and just use the hitm info of first entry.

Why?  If the first hitm access was local and the second one remote,
doesn't that indicate the accessed data is being pulled onto different
nodes?

Cheers,
Don

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 03/15 V3] perf c2c: Shared data analyser
  2014-04-08  6:59   ` Namhyung Kim
@ 2014-04-08 14:22     ` Don Zickus
  2014-04-09  0:58       ` Namhyung Kim
  2014-04-08 14:23     ` Don Zickus
  1 sibling, 1 reply; 45+ messages in thread
From: Don Zickus @ 2014-04-08 14:22 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: acme, LKML, jolsa, jmario, fowles, peterz, eranian, andi.kleen,
	Arnaldo Carvalho de Melo, David Ahern, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Richard Fowles

On Tue, Apr 08, 2014 at 03:59:15PM +0900, Namhyung Kim wrote:
> Hi Don,
> 
> On Mon, 24 Mar 2014 15:36:54 -0400, Don Zickus wrote:
> > From: Arnaldo Carvalho de Melo <acme@redhat.com>
> >
> > This is the start of a new perf tool that will collect information about
> > memory accesses and analyse it to find things like hot cachelines, etc.
> 
> So why not integrating this into existing 'perf mem' command if it's all
> about analyzing memory accesses?

Our expectations were different.  We expeted to do system-wide analysis
with loads and stores.  With 'perf mem' you didn't have the ability to
anlayze both load and stores at the same time.

In all my private conversations with Stephane, Arnalado and Jiri, it was
never brought up.  We had just assumed that is made more sense to keep it
separate.

> 
> >
> > This is basically trying to get a prototype written by Richard Fowles
> > written using the tools/perf coding style and libraries.
> >
> > Start it from 'perf sched', this patch starts the process by adding the
> > 'record' subcommand to collect the needed mem loads and stores samples.
> >
> > It also have the basic 'report' skeleton, resolving the sample address
> > and hooking the events found in a perf.data file with methods to handle
> > them, right now just printing the resolved perf_sample data structure
> > after each event name.
> >
> > [dcz: refreshed to latest upstream changes]
> 
> [SNIP]
> > +perf-c2c(1)
> > +===========
> > +
> > +NAME
> > +----
> > +perf-c2c - Shared Data C2C/HITM Analyzer.
> > +
> > +SYNOPSIS
> > +--------
> > +[verse]
> > +'perf c2c' record
> > +
> > +DESCRIPTION
> > +-----------
> > +These are the variants of perf c2c:
> > +
> > +  'perf c2c record <command>' to record the memory accesses of an arbitrary
> > +  workload.
> > +
> > +SEE ALSO
> > +--------
> > +linkperf:perf-record[1], linkperf:perf-mem[1]
> 
> This document is very terse and only memtions the 'record' subcommand -
> also it's not updated throughout the series.  So I'd like to suggest
> adding a separate documentation patch with full/verbose descriptions at
> the end of this series.
> 
> 
> [SNIP]
> > +static int perf_c2c__read_events(struct perf_c2c *c2c)
> > +{
> > +	int err = -1;
> > +	struct perf_session *session;
> > +	struct perf_data_file file = {
> > +			.path = input_name,
> > +			.mode = PERF_DATA_MODE_READ,
> > +	};
> > +	struct perf_evsel *evsel;
> > +
> > +	session = perf_session__new(&file, 0, &c2c->tool);
> > +	if (session == NULL) {
> > +		pr_debug("No memory for session\n");
> > +		goto out;
> > +	}
> > +
> > +	/* setup the evsel handlers for each event type */
> > +	evlist__for_each(session->evlist, evsel) {
> > +		const char *name = perf_evsel__name(evsel);
> > +		unsigned int i;
> > +
> > +		for (i = 0; i < ARRAY_SIZE(handlers); i++) {
> > +			if (!strcmp(name, handlers[i].name))
> > +				evsel->handler = handlers[i].handler;
> > +		}
> > +	}
> > +
> > +	err = perf_session__process_events(session, &c2c->tool);
> > +	if (err)
> > +		pr_err("Failed to process events, error %d", err);
> 
> You may want to add perf_session__delete() here.
> 
> > +
> > +out:
> > +	return err;
> > +}
> 
> 
> [SNIP]
> > +int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
> > +{
> > +	struct perf_c2c c2c = {
> > +		.tool = {
> > +			.sample		 = perf_c2c__process_sample,
> > +			.comm		 = perf_event__process_comm,
> > +			.exit		 = perf_event__process_exit,
> > +			.fork		 = perf_event__process_fork,
> > +			.lost		 = perf_event__process_lost,
> 
> It seems that it also needs to handle mmap[2] events otherwise it cannot
> find symbols from an address.
> 
> Thanks,
> Namhyung
> 
> 
> > +			.ordered_samples = true,
> > +		},
> > +	};

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 03/15 V3] perf c2c: Shared data analyser
  2014-04-08  6:59   ` Namhyung Kim
  2014-04-08 14:22     ` Don Zickus
@ 2014-04-08 14:23     ` Don Zickus
  1 sibling, 0 replies; 45+ messages in thread
From: Don Zickus @ 2014-04-08 14:23 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: acme, LKML, jolsa, jmario, fowles, peterz, eranian, andi.kleen,
	Arnaldo Carvalho de Melo, David Ahern, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Richard Fowles

On Tue, Apr 08, 2014 at 03:59:15PM +0900, Namhyung Kim wrote:
> Hi Don,

Oh by the way, thank you for your review.  I will clean up a bunch of
stuff based on your suggestions.

Cheers,
Don

> 
> On Mon, 24 Mar 2014 15:36:54 -0400, Don Zickus wrote:
> > From: Arnaldo Carvalho de Melo <acme@redhat.com>
> >
> > This is the start of a new perf tool that will collect information about
> > memory accesses and analyse it to find things like hot cachelines, etc.
> 
> So why not integrating this into existing 'perf mem' command if it's all
> about analyzing memory accesses?
> 
> >
> > This is basically trying to get a prototype written by Richard Fowles
> > written using the tools/perf coding style and libraries.
> >
> > Start it from 'perf sched', this patch starts the process by adding the
> > 'record' subcommand to collect the needed mem loads and stores samples.
> >
> > It also have the basic 'report' skeleton, resolving the sample address
> > and hooking the events found in a perf.data file with methods to handle
> > them, right now just printing the resolved perf_sample data structure
> > after each event name.
> >
> > [dcz: refreshed to latest upstream changes]
> 
> [SNIP]
> > +perf-c2c(1)
> > +===========
> > +
> > +NAME
> > +----
> > +perf-c2c - Shared Data C2C/HITM Analyzer.
> > +
> > +SYNOPSIS
> > +--------
> > +[verse]
> > +'perf c2c' record
> > +
> > +DESCRIPTION
> > +-----------
> > +These are the variants of perf c2c:
> > +
> > +  'perf c2c record <command>' to record the memory accesses of an arbitrary
> > +  workload.
> > +
> > +SEE ALSO
> > +--------
> > +linkperf:perf-record[1], linkperf:perf-mem[1]
> 
> This document is very terse and only memtions the 'record' subcommand -
> also it's not updated throughout the series.  So I'd like to suggest
> adding a separate documentation patch with full/verbose descriptions at
> the end of this series.
> 
> 
> [SNIP]
> > +static int perf_c2c__read_events(struct perf_c2c *c2c)
> > +{
> > +	int err = -1;
> > +	struct perf_session *session;
> > +	struct perf_data_file file = {
> > +			.path = input_name,
> > +			.mode = PERF_DATA_MODE_READ,
> > +	};
> > +	struct perf_evsel *evsel;
> > +
> > +	session = perf_session__new(&file, 0, &c2c->tool);
> > +	if (session == NULL) {
> > +		pr_debug("No memory for session\n");
> > +		goto out;
> > +	}
> > +
> > +	/* setup the evsel handlers for each event type */
> > +	evlist__for_each(session->evlist, evsel) {
> > +		const char *name = perf_evsel__name(evsel);
> > +		unsigned int i;
> > +
> > +		for (i = 0; i < ARRAY_SIZE(handlers); i++) {
> > +			if (!strcmp(name, handlers[i].name))
> > +				evsel->handler = handlers[i].handler;
> > +		}
> > +	}
> > +
> > +	err = perf_session__process_events(session, &c2c->tool);
> > +	if (err)
> > +		pr_err("Failed to process events, error %d", err);
> 
> You may want to add perf_session__delete() here.
> 
> > +
> > +out:
> > +	return err;
> > +}
> 
> 
> [SNIP]
> > +int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
> > +{
> > +	struct perf_c2c c2c = {
> > +		.tool = {
> > +			.sample		 = perf_c2c__process_sample,
> > +			.comm		 = perf_event__process_comm,
> > +			.exit		 = perf_event__process_exit,
> > +			.fork		 = perf_event__process_fork,
> > +			.lost		 = perf_event__process_lost,
> 
> It seems that it also needs to handle mmap[2] events otherwise it cannot
> find symbols from an address.
> 
> Thanks,
> Namhyung
> 
> 
> > +			.ordered_samples = true,
> > +		},
> > +	};

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 10/15 V3] perf, c2c: Display cacheline HITM analysis to stdout
  2014-03-24 19:37 ` [PATCH 10/15 V3] perf, c2c: Display cacheline HITM analysis to stdout Don Zickus
  2014-04-08  8:26   ` Namhyung Kim
@ 2014-04-08 23:46   ` Namhyung Kim
  1 sibling, 0 replies; 45+ messages in thread
From: Namhyung Kim @ 2014-04-08 23:46 UTC (permalink / raw)
  To: Don Zickus; +Cc: acme, LKML, jolsa, jmario, fowles, peterz, eranian, andi.kleen

Hi Don,

On Mon, 24 Mar 2014 15:37:01 -0400, Don Zickus wrote:
> +	sprintf(summary, "%4d %5.1f%% %5.1f%% %5.1f%% %5.1f%% %8d %8d %8d %8d %18s %6s\n",
> +			record,
> +			tot_dist * 100.,
> +			tot_cumm * 100.,
> +			ld_dist * 100.,
> +			ld_cumm * 100.,

I think adding a 0 after decimal point helps readability little bit.


> +			h->stats.t.rmt_hitm,
> +			h->stats.t.lcl_hitm,
> +			h->stats.t.st_l1hit,
> +			h->stats.t.st_l1miss,
> +			addrstr,
> +			pidstr);
> +
> +	for (j = 0; j < (int)strlen(summary); j++) printf("-");
> +	printf("\n");
> +	printf("%s", summary);
> +	for (j = 0; j < (int)strlen(summary); j++) printf("-");
> +	printf("\n");
> +}
> +
> +static void print_socket_stats_str(struct c2c_hit *clo,
> +				   struct c2c_stats *node_stats)
> +{
> +	int i, j;
> +
> +	if (!node_stats)
> +		return;
> +
> +	for (i = 0; i < max_node_num; i++) {
> +		struct c2c_stats *stats = &node_stats[i];
> +		int num = CPU_COUNT(&stats->cpuset);
> +
> +		if (!num) {
> +			/* pad align socket info */
> +			for (j = 0; j < 21; j++)
> +				printf(" ");
> +			continue;
> +		}
> +
> +		printf("%2d{%2d ", i, num);
> +
> +		if (clo->stats.t.rmt_hitm > 0)
> +			printf("%5.1f%% ", 100. * ((double)stats->t.rmt_hitm / (double) clo->stats.t.rmt_hitm));

Just for your information, most perf tools show 2 digit after decimal
point so it'd be "%5.2f%%" but it's not a strict rule.

Anyway I think multiplying 100.0 first can remove the casts.  So code
below will do the same thing..

		100.0 * stats->t.rmt_hitm / clo->stats.t.rmt_hitm

The same goes to other percentage calculations too.

Thanks,
Namhyung


> +		else
> +			printf("%6s ", "n/a");
> +
> +		if (clo->stats.t.store > 0)
> +			printf("%5.1f%%} ", 100. * ((double)stats->t.store / (double)clo->stats.t.store));
> +		else
> +			printf("%6s} ", "n/a");
> +	}
> +}

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 06/15 V3] perf, c2c: Add in new options to configure latency and stores
  2014-04-07 18:16         ` Don Zickus
@ 2014-04-09  0:17           ` Namhyung Kim
  0 siblings, 0 replies; 45+ messages in thread
From: Namhyung Kim @ 2014-04-09  0:17 UTC (permalink / raw)
  To: Don Zickus
  Cc: Jiri Olsa, acme, LKML, jmario, fowles, peterz, eranian, andi.kleen

On Mon, 7 Apr 2014 14:16:18 -0400, Don Zickus wrote:
> On Sun, Apr 06, 2014 at 03:14:22PM +0200, Jiri Olsa wrote:
>> On Mon, Mar 31, 2014 at 10:55:35PM -0400, Don Zickus wrote:
>> > On Sat, Mar 29, 2014 at 06:11:38PM +0100, Jiri Olsa wrote:
>> > > > @@ -316,6 +369,12 @@ int cmd_c2c(int argc, const char **argv, const char *prefix __maybe_unused)
>> > > >  	};
>> > > >  	const struct option c2c_options[] = {
>> > > >  	OPT_BOOLEAN('r', "raw_records", &c2c.raw_records, "dump raw events"),
>> > > > +	OPT_INTEGER('l', "latency-level", &lat_level,
>> > > > +		 "specify the latency threshold for loads [default=30]"),
>> > > > +	OPT_INTEGER('p', "precision-level", &prec_level,
>> > > > +		 "specify the precision level of events (0,1,2,3) [default=1]"),
>> > > 
>> > > could we get also option for user space modifier?
>> > 
>> > You mean the 'u' modifier, ie cpu/mem-loads/u ?  If so, then I can do that
>> > but will that work with the -a option (which is hardcoded in the c2c
>> > tool [system-wide mode])?
>> 
>> right, forgot about that.. could the -a option be optional as well?
>> probably the same way as for record would be the best:
>> 
>>   perf c2c record ./foo       # workload specific
>>   perf c2c record -a sleep 3  # system wide
>
> I understand what you are saying, but our tool was written to find cache
> contention across the system, so -a is usually implied.  Most of our
> profiling is done system-wide.

Hmm.. any chance it can be used for non-system-wide analysis?  I think
tool should provide a way to do it if it's a valid usecase.  And adding
-a option when recording doesn't look too hard. :)


> I would rather implement the opposite option --no-system-wide, if that
> is ok.

Please just add -a/--system-wide (maybe use can make it default if no
argument/workload is given like perf top does) then tool will provide
the --no-* option automatically.

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 03/15 V3] perf c2c: Shared data analyser
  2014-04-08 14:22     ` Don Zickus
@ 2014-04-09  0:58       ` Namhyung Kim
  2014-04-09  1:29         ` Andi Kleen
  0 siblings, 1 reply; 45+ messages in thread
From: Namhyung Kim @ 2014-04-09  0:58 UTC (permalink / raw)
  To: Don Zickus
  Cc: acme, LKML, jolsa, jmario, fowles, peterz, eranian, andi.kleen,
	Arnaldo Carvalho de Melo, David Ahern, Frederic Weisbecker,
	Mike Galbraith, Paul Mackerras, Richard Fowles

On Tue, 8 Apr 2014 10:22:26 -0400, Don Zickus wrote:
> On Tue, Apr 08, 2014 at 03:59:15PM +0900, Namhyung Kim wrote:
>> Hi Don,
>> 
>> On Mon, 24 Mar 2014 15:36:54 -0400, Don Zickus wrote:
>> > From: Arnaldo Carvalho de Melo <acme@redhat.com>
>> >
>> > This is the start of a new perf tool that will collect information about
>> > memory accesses and analyse it to find things like hot cachelines, etc.
>> 
>> So why not integrating this into existing 'perf mem' command if it's all
>> about analyzing memory accesses?
>
> Our expectations were different.  We expeted to do system-wide analysis
> with loads and stores.  With 'perf mem' you didn't have the ability to
> anlayze both load and stores at the same time.

But it's very simple to change perf mem to work with the both IMHO.

>
> In all my private conversations with Stephane, Arnalado and Jiri, it was
> never brought up.  We had just assumed that is made more sense to keep it
> separate.

Well, I'm not sure ;-)  Yes, the c2c is a complex tool which might
deserve an own command, but the functionality is very similar and I
guess there's something to share between them.

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 05/15 V3] perf, c2c: Rework setup code to prepare for features
  2014-04-08 14:11       ` Don Zickus
@ 2014-04-09  1:12         ` Namhyung Kim
  2014-04-09  1:36           ` Don Zickus
  0 siblings, 1 reply; 45+ messages in thread
From: Namhyung Kim @ 2014-04-09  1:12 UTC (permalink / raw)
  To: Don Zickus
  Cc: Jiri Olsa, acme, LKML, jmario, fowles, peterz, eranian, andi.kleen

On Tue, 8 Apr 2014 10:11:07 -0400, Don Zickus wrote:
> On Tue, Apr 08, 2014 at 04:41:29PM +0900, Namhyung Kim wrote:
>> On Sat, 29 Mar 2014 18:10:18 +0100, Jiri Olsa wrote:
>> > On Mon, Mar 24, 2014 at 03:36:56PM -0400, Don Zickus wrote:
>> >
>> > SNIP
>> >
>> >>  
>> >>  static int perf_c2c__process_load_store(struct perf_c2c *c2c,
>> >> +					struct addr_location *al,
>> >>  					struct perf_sample *sample,
>> >> -					struct addr_location *al)
>> >> +					struct perf_evsel *evsel)
>> >>  {
>> >> -	if (c2c->raw_records)
>> >> -		perf_sample__fprintf(sample, ' ', "raw input", al, stdout);
>> >> +	struct mem_info *mi;
>> >> +
>> >> +	mi = sample__resolve_mem(sample, al);
>> >> +	if (!mi)
>> >> +		return -ENOMEM;
>> >
>> > perhaps not directly related to this patchset, but I needed
>> > attached patch to get resolved data in .bss (static), which
>> > for some reason happened to be located in executable segment
>> 
>> Wasn't it a read-only/const data?
>
> I believe it had the 'x' bit set.  Or the kernel was not passing any
> protection bits, so it defaulted to MAP_FUNCTION?

The perf treats a mapping as a data mapping (MAP_VARIABLE) by default if
the 'x' bit is not set.  As Jiri said its a static data, I guessed it's
a const data (set to 0?) and moved into .rodata section and then to the
text segment.

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 03/15 V3] perf c2c: Shared data analyser
  2014-04-09  0:58       ` Namhyung Kim
@ 2014-04-09  1:29         ` Andi Kleen
  0 siblings, 0 replies; 45+ messages in thread
From: Andi Kleen @ 2014-04-09  1:29 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Don Zickus, acme, LKML, jolsa, jmario, fowles, peterz, eranian,
	andi.kleen, Arnaldo Carvalho de Melo, David Ahern,
	Frederic Weisbecker, Mike Galbraith, Paul Mackerras,
	Richard Fowles

Namhyung Kim <namhyung@kernel.org> writes:
>
> Well, I'm not sure ;-)  Yes, the c2c is a complex tool which might
> deserve an own command, but the functionality is very similar and I
> guess there's something to share between them.

They work very differently.  I don't see a lot of potential
for sharing.

perf mem is basically just a way to annotate normal samples slightly
with addresses, while c2c is fundamentally address driven.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 07/15 V3] perf, c2c: Add in sort on physid
  2014-04-08 14:17     ` Don Zickus
@ 2014-04-09  1:30       ` Namhyung Kim
  2014-04-09  1:56         ` Don Zickus
  0 siblings, 1 reply; 45+ messages in thread
From: Namhyung Kim @ 2014-04-09  1:30 UTC (permalink / raw)
  To: Don Zickus; +Cc: acme, LKML, jolsa, jmario, fowles, peterz, eranian, andi.kleen

On Tue, 8 Apr 2014 10:17:58 -0400, Don Zickus wrote:
> On Tue, Apr 08, 2014 at 04:56:25PM +0900, Namhyung Kim wrote:
>> On Mon, 24 Mar 2014 15:36:58 -0400, Don Zickus wrote:
>> > +static int perf_c2c__init(struct perf_c2c *c2c)
>> > +{
>> > +	sort__mode = SORT_MODE__PHYSID;
>> > +	sort__wants_unique = 1;
>> > +	sort_order = "daddr,iaddr,pid,tid";
>> 
>> Where are the SORT_MODE__PHYSID, sort__wants_unique and "daddr", "iaddr"
>> sort keys defined?
>
> In a previous patchset that enables the mmap2 interface.

Ah, missed that.. will look at it soon.

>
>> 
>> Also, more importantly, I think the sort order should contain at least
>> "mem" and "snoop" keys to group samples based on the exact hitm
>> information.
>
> I can look into it, but after iaddr, pid and tid, sorting on snoop doesn't
> really change anything if I recall.  The hitms are scattered across iaddr.

But it doesn't guarantee that all hitms are scattered, right?  Also if
it's the case I guess adding more sort keys are not harmful since they
don't even have a chance to test.

I think you can check hist_entry->stat.nr_events always being 1.

>
>> 
>> In my understanding, if two samples are captured at exactly same
>> position with a same data access but different hitm info, they cannot be
>> classified and just use the hitm info of first entry.
>
> Why?  If the first hitm access was local and the second one remote,
> doesn't that indicate the accessed data is being pulled onto different
> nodes?

But "hist_entry" won't have the information after calling
__hists__add__entry() called unless it has 'mem' and 'snoop' sort keys.
Since two samples have same daddr, iaddr, pid and tid, it'd think those
two samples are same and then add stats of second one to the first and
finally discard the second.  So first one will have a double weight for
the local hitm case only.

This is the case what I worry about.  Am I missing something?

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 05/15 V3] perf, c2c: Rework setup code to prepare for features
  2014-04-09  1:12         ` Namhyung Kim
@ 2014-04-09  1:36           ` Don Zickus
  2014-04-11 14:57             ` Jiri Olsa
  0 siblings, 1 reply; 45+ messages in thread
From: Don Zickus @ 2014-04-09  1:36 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Jiri Olsa, acme, LKML, jmario, fowles, peterz, eranian, andi.kleen

On Wed, Apr 09, 2014 at 10:12:32AM +0900, Namhyung Kim wrote:
> On Tue, 8 Apr 2014 10:11:07 -0400, Don Zickus wrote:
> > On Tue, Apr 08, 2014 at 04:41:29PM +0900, Namhyung Kim wrote:
> >> On Sat, 29 Mar 2014 18:10:18 +0100, Jiri Olsa wrote:
> >> > On Mon, Mar 24, 2014 at 03:36:56PM -0400, Don Zickus wrote:
> >> >
> >> > SNIP
> >> >
> >> >>  
> >> >>  static int perf_c2c__process_load_store(struct perf_c2c *c2c,
> >> >> +					struct addr_location *al,
> >> >>  					struct perf_sample *sample,
> >> >> -					struct addr_location *al)
> >> >> +					struct perf_evsel *evsel)
> >> >>  {
> >> >> -	if (c2c->raw_records)
> >> >> -		perf_sample__fprintf(sample, ' ', "raw input", al, stdout);
> >> >> +	struct mem_info *mi;
> >> >> +
> >> >> +	mi = sample__resolve_mem(sample, al);
> >> >> +	if (!mi)
> >> >> +		return -ENOMEM;
> >> >
> >> > perhaps not directly related to this patchset, but I needed
> >> > attached patch to get resolved data in .bss (static), which
> >> > for some reason happened to be located in executable segment
> >> 
> >> Wasn't it a read-only/const data?
> >
> > I believe it had the 'x' bit set.  Or the kernel was not passing any
> > protection bits, so it defaulted to MAP_FUNCTION?
> 
> The perf treats a mapping as a data mapping (MAP_VARIABLE) by default if
> the 'x' bit is not set.  As Jiri said its a static data, I guessed it's
> a const data (set to 0?) and moved into .rodata section and then to the
> text segment.

Unfortunately, his patch will be needed eventually, if not for his reason,
I had a java JAR file example that was pulling data addresses out of a
shared memory region with the protection bits set to 'rwx' (in the
/proc/<pid>/maps area).  I was losing lots of samples until I came up with
a more complicated hack.

I will try to dig up my example, so you can see, so it doesn't sound like
I am making this up. :-)

Cheers,
Don

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 07/15 V3] perf, c2c: Add in sort on physid
  2014-04-09  1:30       ` Namhyung Kim
@ 2014-04-09  1:56         ` Don Zickus
  0 siblings, 0 replies; 45+ messages in thread
From: Don Zickus @ 2014-04-09  1:56 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: acme, LKML, jolsa, jmario, fowles, peterz, eranian, andi.kleen

On Wed, Apr 09, 2014 at 10:30:56AM +0900, Namhyung Kim wrote:
> On Tue, 8 Apr 2014 10:17:58 -0400, Don Zickus wrote:
> > On Tue, Apr 08, 2014 at 04:56:25PM +0900, Namhyung Kim wrote:
> >> On Mon, 24 Mar 2014 15:36:58 -0400, Don Zickus wrote:
> >> > +static int perf_c2c__init(struct perf_c2c *c2c)
> >> > +{
> >> > +	sort__mode = SORT_MODE__PHYSID;
> >> > +	sort__wants_unique = 1;
> >> > +	sort_order = "daddr,iaddr,pid,tid";
> >> 
> >> Where are the SORT_MODE__PHYSID, sort__wants_unique and "daddr", "iaddr"
> >> sort keys defined?
> >
> > In a previous patchset that enables the mmap2 interface.
> 
> Ah, missed that.. will look at it soon.
> 
> >
> >> 
> >> Also, more importantly, I think the sort order should contain at least
> >> "mem" and "snoop" keys to group samples based on the exact hitm
> >> information.
> >
> > I can look into it, but after iaddr, pid and tid, sorting on snoop doesn't
> > really change anything if I recall.  The hitms are scattered across iaddr.
> 
> But it doesn't guarantee that all hitms are scattered, right?  Also if
> it's the case I guess adding more sort keys are not harmful since they
> don't even have a chance to test.
> 
> I think you can check hist_entry->stat.nr_events always being 1.
> 
> >
> >> 
> >> In my understanding, if two samples are captured at exactly same
> >> position with a same data access but different hitm info, they cannot be
> >> classified and just use the hitm info of first entry.
> >
> > Why?  If the first hitm access was local and the second one remote,
> > doesn't that indicate the accessed data is being pulled onto different
> > nodes?
> 
> But "hist_entry" won't have the information after calling
> __hists__add__entry() called unless it has 'mem' and 'snoop' sort keys.
> Since two samples have same daddr, iaddr, pid and tid, it'd think those
> two samples are same and then add stats of second one to the first and
> finally discard the second.  So first one will have a double weight for
> the local hitm case only.
> 
> This is the case what I worry about.  Am I missing something?

My patch 6/6 of the enable mmap2 support. :-)  It specifically forces all
the data to remain separate to avoid this issue.  We couldn't have the
data merged because it messed up our stats.

Cheers,
Don

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 05/15 V3] perf, c2c: Rework setup code to prepare for features
  2014-04-09  1:36           ` Don Zickus
@ 2014-04-11 14:57             ` Jiri Olsa
  0 siblings, 0 replies; 45+ messages in thread
From: Jiri Olsa @ 2014-04-11 14:57 UTC (permalink / raw)
  To: Don Zickus
  Cc: Namhyung Kim, acme, LKML, jmario, fowles, peterz, eranian, andi.kleen

On Tue, Apr 08, 2014 at 09:36:58PM -0400, Don Zickus wrote:
> On Wed, Apr 09, 2014 at 10:12:32AM +0900, Namhyung Kim wrote:
> > On Tue, 8 Apr 2014 10:11:07 -0400, Don Zickus wrote:
> > > On Tue, Apr 08, 2014 at 04:41:29PM +0900, Namhyung Kim wrote:
> > >> On Sat, 29 Mar 2014 18:10:18 +0100, Jiri Olsa wrote:
> > >> > On Mon, Mar 24, 2014 at 03:36:56PM -0400, Don Zickus wrote:
> > >> >
> > >> > SNIP
> > >> >
> > >> >>  
> > >> >>  static int perf_c2c__process_load_store(struct perf_c2c *c2c,
> > >> >> +					struct addr_location *al,
> > >> >>  					struct perf_sample *sample,
> > >> >> -					struct addr_location *al)
> > >> >> +					struct perf_evsel *evsel)
> > >> >>  {
> > >> >> -	if (c2c->raw_records)
> > >> >> -		perf_sample__fprintf(sample, ' ', "raw input", al, stdout);
> > >> >> +	struct mem_info *mi;
> > >> >> +
> > >> >> +	mi = sample__resolve_mem(sample, al);
> > >> >> +	if (!mi)
> > >> >> +		return -ENOMEM;
> > >> >
> > >> > perhaps not directly related to this patchset, but I needed
> > >> > attached patch to get resolved data in .bss (static), which
> > >> > for some reason happened to be located in executable segment
> > >> 
> > >> Wasn't it a read-only/const data?
> > >
> > > I believe it had the 'x' bit set.  Or the kernel was not passing any
> > > protection bits, so it defaulted to MAP_FUNCTION?
> > 
> > The perf treats a mapping as a data mapping (MAP_VARIABLE) by default if
> > the 'x' bit is not set.  As Jiri said its a static data, I guessed it's
> > a const data (set to 0?) and moved into .rodata section and then to the
> > text segment.

it was not initialized static data, .. not sure why it ended up
in text segment, because I dont see it anymore :-)

> 
> Unfortunately, his patch will be needed eventually, if not for his reason,
> I had a java JAR file example that was pulling data addresses out of a
> shared memory region with the protection bits set to 'rwx' (in the
> /proc/<pid>/maps area).  I was losing lots of samples until I came up with
> a more complicated hack.

also, I think nothing prevents user from mapping memory as 'rwx'
and load/store data there.. we should be able to handle this case

jirka

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2014-04-11 14:58 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-24 19:36 [PATCH 00/15 V3] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
2014-03-24 19:36 ` [PATCH 01/15 V3] perf: Fix stddev calculation Don Zickus
2014-03-24 19:36 ` [PATCH 02/15 V3] perf, callchain: Add generic callchain print handler for stdio Don Zickus
2014-03-24 19:36 ` [PATCH 03/15 V3] perf c2c: Shared data analyser Don Zickus
2014-04-08  6:59   ` Namhyung Kim
2014-04-08 14:22     ` Don Zickus
2014-04-09  0:58       ` Namhyung Kim
2014-04-09  1:29         ` Andi Kleen
2014-04-08 14:23     ` Don Zickus
2014-03-24 19:36 ` [PATCH 04/15 V3] perf c2c: Dump raw records, decode data_src bits Don Zickus
2014-04-08  7:09   ` Namhyung Kim
2014-03-24 19:36 ` [PATCH 05/15 V3] perf, c2c: Rework setup code to prepare for features Don Zickus
2014-03-29 17:10   ` Jiri Olsa
2014-04-01  2:52     ` Don Zickus
2014-04-08  7:41     ` Namhyung Kim
2014-04-08 14:11       ` Don Zickus
2014-04-09  1:12         ` Namhyung Kim
2014-04-09  1:36           ` Don Zickus
2014-04-11 14:57             ` Jiri Olsa
2014-04-08  7:18   ` Namhyung Kim
2014-03-24 19:36 ` [PATCH 06/15 V3] perf, c2c: Add in new options to configure latency and stores Don Zickus
2014-03-29 17:11   ` Jiri Olsa
2014-04-01  2:55     ` Don Zickus
2014-04-06 13:14       ` Jiri Olsa
2014-04-07 18:16         ` Don Zickus
2014-04-09  0:17           ` Namhyung Kim
2014-04-08  7:37         ` Namhyung Kim
2014-04-08  7:31   ` Namhyung Kim
2014-03-24 19:36 ` [PATCH 07/15 V3] perf, c2c: Add in sort on physid Don Zickus
2014-04-08  7:56   ` Namhyung Kim
2014-04-08 14:17     ` Don Zickus
2014-04-09  1:30       ` Namhyung Kim
2014-04-09  1:56         ` Don Zickus
2014-03-24 19:36 ` [PATCH 08/15 V3] perf, c2c: Add stats to track data source bits and cpu to node maps Don Zickus
2014-04-08  8:05   ` Namhyung Kim
2014-03-24 19:37 ` [PATCH 09/15 V3] perf, c2c: Sort based on hottest cache line Don Zickus
2014-04-08  8:23   ` Namhyung Kim
2014-03-24 19:37 ` [PATCH 10/15 V3] perf, c2c: Display cacheline HITM analysis to stdout Don Zickus
2014-04-08  8:26   ` Namhyung Kim
2014-04-08 23:46   ` Namhyung Kim
2014-03-24 19:37 ` [PATCH 11/15 V3] perf, c2c: Add callchain support Don Zickus
2014-03-24 19:37 ` [PATCH 12/15 V3] perf, c2c: Output summary stats Don Zickus
2014-03-24 19:37 ` [PATCH 13/15 V3] perf, c2c: Dump rbtree for debugging Don Zickus
2014-03-24 19:37 ` [PATCH 14/15 V3] perf, c2c: Add symbol count table Don Zickus
2014-03-24 19:37 ` [PATCH 15/15 V3] perf, c2c: Add shared cachline summary table Don Zickus

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.