All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/15 V3] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems
@ 2014-03-24 19:36 Don Zickus
  2014-03-24 19:36 ` [PATCH 01/15 V3] perf: Fix stddev calculation Don Zickus
                   ` (14 more replies)
  0 siblings, 15 replies; 45+ messages in thread
From: Don Zickus @ 2014-03-24 19:36 UTC (permalink / raw)
  To: acme; +Cc: LKML, jolsa, jmario, fowles, peterz, eranian, andi.kleen, Don Zickus

With the introduction of NUMA systems, came the possibility of remote memory accesses.
Combine those remote memory accesses with contention on the remote node (ie a modified
cacheline) and you have a possibility for very long latencies.  These latencies can
bottleneck a program.

The program added by these patches, helps detect the situation where two nodes are
'tugging' on the same _data_ cacheline.  The term used through out this program and
the various changelogs is called a HITM.  This means nodeX went to read a cacheline
and it was discovered to be loaded in nodeY's LLC cache (hence the cacheHIT). The 
remote cacheline was also in a 'M'odified state thus creating a 'HIT M' for hit in
a modified state.  HITMs can happen locally and remotely.  This program's interest
is mainly in remote HITMs as they cause the longest latencies.

Why a program has a remote HITM derives from how the two nodes are 'sharing' the
cacheline.  Is the sharing intentional ("true") or unintentional ("false").  We have seen
lots of "false" sharing cases, which lead to simple solutions such as seperating the data
onto different cachelines.  

This tool does not distinguish between 'true' or 'false' sharing, instead it just points to
the more expensive sharing situations under the current workload.  It is up to the user
to understand what the workload is doing to determine whether a problem exists or not and
how to report it.

The data output is verbose and there are lots of data tables that interprit the latencies
and data addresses in different ways to help see where bottlenecks might be lying.

Most of this idea, work and calculations were done by Dick Fowles.  My work mainly
includes porting it to perf.  Joe Mario has contributed greatly with ideas to make the
output more informative based on his usage of the tool.  Joe has found a handful of
bottlenecks using various industry benchmarks and has worked with developers to fix
them.

I would also like to thank Stephane Eranian for his early help and guidance on 
navigating the differences between the current perf tool and how similar tools
looked at HP.  And also his tireless work in getting the MMAP2 interface to stick.

Also thanks to Arnaldo and Jiri Olso for their help in suggestions for this tool.

I also have a test program that generated a controlled number of HITMs that we used
frequently to validate our early work (the Intel docs were not always clear which
bits had to be set and some arches do not work well).  I would like to add it, but
didn't know how (nor did I spend any serious time looking either).

This program has been tested primarily on Intel's Ivy Bridge platforms.  The Sandy Bridge
platforms had some quirks that were fixed on Ivy Bridge.  We haven't tried Haswell as
that has a re-worked latency event implementation.

A handful of patches include re-enabling MMAP2 support and some fixes to perf itself.  One
in particular hacks up how standard deviation is calculated.  It works with our calculations
but may break other tools expectations.  Feedback is welcomed.

Comemnts, feedback, anything else welcomed.

V3: updated to latest perf/core branch 5aed711
    support for older machines like Westmere (Peter Z)
    cleanups and fixes
    broke out physid support into another patchset
    still need to address output deficiencies (Andi Kleen)

V2: updated to latest perf/core branch 1029f9fedf87fa6
    switched to hist_entry based on Jiri O's suggestion
    dropped latency analyze for now until this patchset is accepted
    little fixes and tweaks

Signed-off-by: Don Zickus <dzickus@redhat.com>

Arnaldo Carvalho de Melo (2):
  perf c2c: Shared data analyser
  perf c2c: Dump raw records, decode data_src bits

Don Zickus (19):
  Revert "perf: Disable PERF_RECORD_MMAP2 support"
  perf, machine: Use map as success in ip__resolve_ams
  perf, session: Change header.misc dump from decimal to hex
  perf, stat: FIXME Stddev calculation is incorrect
  perf, callchain: Add generic callchain print handler for stdio
  perf, c2c: Rework setup code to prepare for features
  perf, c2c: Add rbtree sorted on mmap2 data
  perf, c2c: Add stats to track data source bits and cpu to node maps
  perf, c2c: Sort based on hottest cache line
  perf, c2c: Display cacheline HITM analysis to stdout
  perf, c2c: Add callchain support
  perf, c2c: Output summary stats
  perf, c2c: Dump rbtree for debugging
  perf, c2c: Fixup tid because of perf map is broken
  perf, c2c: Add symbol count table
  perf, c2c: Add shared cachline summary table
  perf, c2c: Add framework to analyze latency and display summary stats
  perf, c2c: Add selected extreme latencies to output cacheline stats
    table
  perf, c2c: Add summary latency table for various parts of caches

 kernel/events/core.c                |    4 -
 tools/perf/Documentation/perf-c2c.c |   22 +
 tools/perf/Makefile.perf            |    1 +
 tools/perf/builtin-c2c.c            | 2963 +++++++++++++++++++++++++++++++++++
 tools/perf/builtin.h                |    1 +
 tools/perf/perf.c                   |    1 +
 tools/perf/ui/stdio/hist.c          |   37 +
 tools/perf/util/event.c             |   36 +-
 tools/perf/util/evlist.c            |   37 +
 tools/perf/util/evlist.h            |    7 +
 tools/perf/util/evsel.c             |    1 +
 tools/perf/util/hist.h              |    4 +
 tools/perf/util/machine.c           |    2 +-
 tools/perf/util/session.c           |    2 +-
 tools/perf/util/stat.c              |    3 +-
 15 files changed, 3097 insertions(+), 24 deletions(-)
 create mode 100644 tools/perf/Documentation/perf-c2c.c
 create mode 100644 tools/perf/builtin-c2c.c

-- 
1.7.11.7

Arnaldo Carvalho de Melo (2):
  perf c2c: Shared data analyser
  perf c2c: Dump raw records, decode data_src bits

Don Zickus (17):
  Revert "perf: Disable PERF_RECORD_MMAP2 support"
  perf, sort:  Add physid sorting based on mmap2 data
  perf, sort:  Allow unique sorting instead of combining hist_entries
  perf: Allow ability to map cpus to nodes easily
  perf, kmem: Utilize the new generic cpunode_map
  perf: Fix stddev calculation
  perf, callchain: Add generic callchain print handler for stdio
  perf, c2c: Rework setup code to prepare for features
  perf, c2c: Add in sort on physid
  perf, c2c: Add stats to track data source bits and cpu to node maps
  perf, c2c: Sort based on hottest cache line
  perf, c2c: Display cacheline HITM analysis to stdout
  perf, c2c: Add callchain support
  perf, c2c: Output summary stats
  perf, c2c: Dump rbtree for debugging
  perf, c2c: Add symbol count table
  perf, c2c: Add shared cachline summary table

 kernel/events/core.c                |    4 -
 tools/perf/Documentation/perf-c2c.c |   22 +
 tools/perf/Makefile.perf            |    1 +
 tools/perf/builtin-c2c.c            | 1787 +++++++++++++++++++++++++++++++++++
 tools/perf/builtin-kmem.c           |   78 +-
 tools/perf/builtin-report.c         |    2 +-
 tools/perf/builtin.h                |    1 +
 tools/perf/perf.c                   |    1 +
 tools/perf/ui/stdio/hist.c          |   37 +
 tools/perf/util/cpumap.c            |  150 +++
 tools/perf/util/cpumap.h            |   35 +
 tools/perf/util/event.c             |   36 +-
 tools/perf/util/evsel.c             |    1 +
 tools/perf/util/hist.c              |   10 +-
 tools/perf/util/hist.h              |    5 +
 tools/perf/util/sort.c              |  149 +++
 tools/perf/util/sort.h              |    4 +
 tools/perf/util/stat.c              |   13 +
 tools/perf/util/stat.h              |    1 +
 19 files changed, 2236 insertions(+), 101 deletions(-)
 create mode 100644 tools/perf/Documentation/perf-c2c.c
 create mode 100644 tools/perf/builtin-c2c.c

-- 
1.7.11.7

*** BLURB HERE ***

Arnaldo Carvalho de Melo (2):
  perf c2c: Shared data analyser
  perf c2c: Dump raw records, decode data_src bits

Don Zickus (13):
  perf: Fix stddev calculation
  perf, callchain: Add generic callchain print handler for stdio
  perf, c2c: Rework setup code to prepare for features
  perf, c2c: Add in new options to configure latency and stores
  perf, c2c: Add in sort on physid
  perf, c2c: Add stats to track data source bits and cpu to node maps
  perf, c2c: Sort based on hottest cache line
  perf, c2c: Display cacheline HITM analysis to stdout
  perf, c2c: Add callchain support
  perf, c2c: Output summary stats
  perf, c2c: Dump rbtree for debugging
  perf, c2c: Add symbol count table
  perf, c2c: Add shared cachline summary table

 tools/perf/Documentation/perf-c2c.c |   22 +
 tools/perf/Makefile.perf            |    1 +
 tools/perf/builtin-c2c.c            | 1760 +++++++++++++++++++++++++++++++++++
 tools/perf/builtin.h                |    1 +
 tools/perf/perf.c                   |    1 +
 tools/perf/ui/stdio/hist.c          |   37 +
 tools/perf/util/hist.h              |    4 +
 tools/perf/util/stat.c              |   13 +
 tools/perf/util/stat.h              |    1 +
 9 files changed, 1840 insertions(+)
 create mode 100644 tools/perf/Documentation/perf-c2c.c
 create mode 100644 tools/perf/builtin-c2c.c

-- 
1.7.11.7


^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2014-04-11 14:58 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-24 19:36 [PATCH 00/15 V3] perf, c2c: Add new tool to analyze cacheline contention on NUMA systems Don Zickus
2014-03-24 19:36 ` [PATCH 01/15 V3] perf: Fix stddev calculation Don Zickus
2014-03-24 19:36 ` [PATCH 02/15 V3] perf, callchain: Add generic callchain print handler for stdio Don Zickus
2014-03-24 19:36 ` [PATCH 03/15 V3] perf c2c: Shared data analyser Don Zickus
2014-04-08  6:59   ` Namhyung Kim
2014-04-08 14:22     ` Don Zickus
2014-04-09  0:58       ` Namhyung Kim
2014-04-09  1:29         ` Andi Kleen
2014-04-08 14:23     ` Don Zickus
2014-03-24 19:36 ` [PATCH 04/15 V3] perf c2c: Dump raw records, decode data_src bits Don Zickus
2014-04-08  7:09   ` Namhyung Kim
2014-03-24 19:36 ` [PATCH 05/15 V3] perf, c2c: Rework setup code to prepare for features Don Zickus
2014-03-29 17:10   ` Jiri Olsa
2014-04-01  2:52     ` Don Zickus
2014-04-08  7:41     ` Namhyung Kim
2014-04-08 14:11       ` Don Zickus
2014-04-09  1:12         ` Namhyung Kim
2014-04-09  1:36           ` Don Zickus
2014-04-11 14:57             ` Jiri Olsa
2014-04-08  7:18   ` Namhyung Kim
2014-03-24 19:36 ` [PATCH 06/15 V3] perf, c2c: Add in new options to configure latency and stores Don Zickus
2014-03-29 17:11   ` Jiri Olsa
2014-04-01  2:55     ` Don Zickus
2014-04-06 13:14       ` Jiri Olsa
2014-04-07 18:16         ` Don Zickus
2014-04-09  0:17           ` Namhyung Kim
2014-04-08  7:37         ` Namhyung Kim
2014-04-08  7:31   ` Namhyung Kim
2014-03-24 19:36 ` [PATCH 07/15 V3] perf, c2c: Add in sort on physid Don Zickus
2014-04-08  7:56   ` Namhyung Kim
2014-04-08 14:17     ` Don Zickus
2014-04-09  1:30       ` Namhyung Kim
2014-04-09  1:56         ` Don Zickus
2014-03-24 19:36 ` [PATCH 08/15 V3] perf, c2c: Add stats to track data source bits and cpu to node maps Don Zickus
2014-04-08  8:05   ` Namhyung Kim
2014-03-24 19:37 ` [PATCH 09/15 V3] perf, c2c: Sort based on hottest cache line Don Zickus
2014-04-08  8:23   ` Namhyung Kim
2014-03-24 19:37 ` [PATCH 10/15 V3] perf, c2c: Display cacheline HITM analysis to stdout Don Zickus
2014-04-08  8:26   ` Namhyung Kim
2014-04-08 23:46   ` Namhyung Kim
2014-03-24 19:37 ` [PATCH 11/15 V3] perf, c2c: Add callchain support Don Zickus
2014-03-24 19:37 ` [PATCH 12/15 V3] perf, c2c: Output summary stats Don Zickus
2014-03-24 19:37 ` [PATCH 13/15 V3] perf, c2c: Dump rbtree for debugging Don Zickus
2014-03-24 19:37 ` [PATCH 14/15 V3] perf, c2c: Add symbol count table Don Zickus
2014-03-24 19:37 ` [PATCH 15/15 V3] perf, c2c: Add shared cachline summary table Don Zickus

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.