All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/4] Reduce NUMA related overhead in perf record profiling on large server systems
@ 2019-01-22 17:45 Alexey Budankov
  2019-01-22 17:47 ` [PATCH v5 1/4] perf record: allocate affinity masks Alexey Budankov
                   ` (6 more replies)
  0 siblings, 7 replies; 21+ messages in thread
From: Alexey Budankov @ 2019-01-22 17:45 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Ingo Molnar, Peter Zijlstra
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Andi Kleen, linux-kernel


It has been observed that trace reading thread runs on the same hw thread
most of the time during perf record sampling collection. This scheduling
layout leads up to 30% profiling overhead in case when some cpu intensive
workload fully utilizes a large server system with NUMA. Overhead usually
arises from remote (cross node) HW and memory references that have much
longer latencies than local ones [1].

This patch set implements --affinity option that lowers 30% overhead
completely for serial trace streaming (--affinity=cpu) and from 30% to
10% for AIO1 (--aio=1) trace streaming (--affinity=node|cpu).
See OVERHEAD section below for more details.

Implemented extension provides users with capability to instruct Perf 
tool to bounce trace reading thread's affinity mask between NUMA nodes 
(--affinity=node) or assign the thread to the exact cpu (--affinity=cpu) 
that trace buffer being processed belongs to.

The extension brings improvement in case of full system utilization when 
Perf tool process contends with workload process on cpu cores. In case a 
system has free cores to execute Perf tool process during profiling the 
default system scheduling layout induces the lowest overhead.

The patch set has been validated on BT benchmark from NAS Parallel 
Benchmarks [2] running on dual socket, 44 cores, 88 hw threads Broadwell 
system with kernels v4.4-21-generic (Ubuntu 16.04) and v4.20.0-rc5 
(tip perf/core). 

The patch set is for Arnaldo's perf/core repository.

OVERHEAD:
			       BENCH REPORT BASED   ELAPSED TIME BASED
	  v4.20.0-rc5 
          (tip perf/core):
				
(current) SERIAL-SYS  / BASE : 1.27x (14.37/11.31), 1.29x (15.19/11.69)
	  SERIAL-NODE / BASE : 1.15x (13.04/11.31), 1.17x (13.79/11.69)
	  SERIAL-CPU  / BASE : 1.00x (11.32/11.31), 1.01x (11.89/11.69)
	
	  AIO1-SYS    / BASE : 1.29x (14.58/11.31), 1.29x (15.26/11.69)
	  AIO1-NODE   / BASE : 1.08x (12.23/11.31), 1,11x (13.01/11.69)
	  AIO1-CPU    / BASE : 1.07x (12.14/11.31), 1.08x (12.83/11.69)

	  v4.4.0-21-generic
          (Ubuntu 16.04 LTS):

(current) SERIAL-SYS  / BASE : 1.26x (13.73/10.87), 1.29x (14.69/11.32)
	  SERIAL-NODE / BASE : 1.19x (13.02/10.87), 1.23x (14.03/11.32)
	  SERIAL-CPU  / BASE : 1.03x (11.21/10.87), 1.07x (12.18/11.32)
	
	  AIO1-SYS    / BASE : 1.26x (13.73/10.87), 1.29x (14.69/11.32)
	  AIO1-NODE   / BASE : 1.10x (12.04/10.87), 1.15x (13.03/11.32)
	  AIO1-CPU    / BASE : 1.12x (12.20/10.87), 1.15x (13.09/11.32)

---
Alexey Budankov (4):
  perf record: allocate affinity masks
  perf record: bind the AIO user space buffers to nodes
  perf record: apply affinity masks when reading mmap buffers
  perf record: implement --affinity=node|cpu option

 tools/perf/Documentation/perf-record.txt |   5 ++
 tools/perf/builtin-record.c              |  45 +++++++++-
 tools/perf/perf.h                        |   8 ++
 tools/perf/util/cpumap.c                 |  10 +++
 tools/perf/util/cpumap.h                 |   1 +
 tools/perf/util/evlist.c                 |   6 +-
 tools/perf/util/evlist.h                 |   2 +-
 tools/perf/util/mmap.c                   | 105 ++++++++++++++++++++++-
 tools/perf/util/mmap.h                   |   3 +-
 9 files changed, 175 insertions(+), 10 deletions(-)

---
Changes in v5:
- avoided multiple allocations of online cpu maps by 
  implementing it once in cpu_map__online()
- reduced indentation at record__parse_affinity()

Changes in v4:
- fixed compilation issue converting pr_warn() to pr_warning()
- implemented stop if mbind() fails
- corrected mmap_params->cpu_map initialization to be based on /sys/devices/system/cpu/online
- separated node cpu map generation into build_node_mask()

Changes in v3:
- converted PERF_AFFINITY_EOF to PERF_AFFINITY_MAX
- corrected code style issues
- adjusted __aio_alloc,__aio_bind,__aio_free() implementation
- separated mask manipulations into __adjust_affinity() and __setup_affinity_mask()
- implemented mapping of c index into online cpu index
- adjusted indentation at record__parse_affinity()

Changes in v2:
- made debug affinity mode message user friendly
- converted affinity mode defines to enum values
- implemented perf_mmap__aio_alloc, perf_mmap__aio_free, perf_mmap__aio_bind 
  and put HAVE_LIBNUMA_SUPPORT #ifdefs in there
- separated AIO buffers binding to patch 2/4

---
[1] https://en.wikipedia.org/wiki/Non-uniform_memory_access
[2] https://www.nas.nasa.gov/publications/npb.html
[3] http://man7.org/linux/man-pages/man2/sched_setaffinity.2.html
[4] http://man7.org/linux/man-pages/man2/mbind.2.html

---
ENVIRONMENT AND MEASUREMENTS:

  MACHINE:

	broadwell, dual socket, 44 core, 88 threads

	/proc/cpuinfo

	processor	: 87
	vendor_id	: GenuineIntel
	cpu family	: 6
	model		: 79
	model name	: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
	stepping	: 1
	microcode	: 0xb000019
	cpu MHz		: 1200.117
	cache size	: 56320 KB
	physical id	: 1
	siblings	: 44
	core id		: 28
	cpu cores	: 22
	apicid		: 121
	initial apicid	: 121
	fpu		: yes
	fpu_exception	: yes
	cpuid level	: 20
	wp		: yes
	flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts
	bugs		:
	bogomips	: 4391.42
	clflush size	: 64
	cache_alignment	: 64
	address sizes	: 46 bits physical, 48 bits virtual
	power management:
  		
  BASE:

	/usr/bin/time ./bt.B.x 

	NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark
	
	No input file inputbt.data. Using compiled defaults
	Size:  102x 102x 102
	Iterations:  200       dt:   0.0003000
	Number of available threads:    88
	
	BT Benchmark Completed.
	Class           =                        B
	Size            =            102x 102x 102
	Iterations      =                      200
	Time in seconds =                    10.87
	Total threads   =                       88
	Avail threads   =                       88
	Mop/s total     =                 64608.74
	Mop/s/thread    =                   734.19
	Operation type  =           floating point
	Verification    =               SUCCESSFUL
	Version         =                    3.3.1
	Compile date    =              20 Sep 2018
	
	956.25user 19.14system 0:11.32elapsed 8616%CPU (0avgtext+0avgdata 210496maxresident)k
	0inputs+0outputs (0major+57939minor)pagefaults 0swaps

  SERIAL-SYS:

	/usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 -a -e cycles -- ./bt.B.x 
	Using CPUID GenuineIntel-6-4F-1
	nr_cblocks: 0
	affinity (UNSET:0, NODE:1, CPU:2) = 0
	mmap size 528384B

	NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark

	No input file inputbt.data. Using compiled defaults
	Size:  102x 102x 102
	Iterations:  200       dt:   0.0003000
	Number of available threads:    88

	BT Benchmark Completed.
	Class           =                        B
	Size            =            102x 102x 102
	Iterations      =                      200
	Time in seconds =                    13.73
	Total threads   =                       88
	Avail threads   =                       88
	Mop/s total     =                 51136.52
	Mop/s/thread    =                   581.10
	Operation type  =           floating point
	Verification    =               SUCCESSFUL
	Version         =                    3.3.1
	Compile date    =              20 Sep 2018

	[ perf record: Captured and wrote 1661,120 MB perf.data ]

	1184.84user 40.70system 0:14.69elapsed 8341%CPU (0avgtext+0avgdata 208612maxresident)k
	0inputs+3402072outputs (0major+137077minor)pagefaults 0swaps

  SERIAL-NODE:

	/usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 --affinity=node -a -e cycles -- ./bt.B.x 
	Using CPUID GenuineIntel-6-4F-1
	nr_cblocks: 0
	affinity (UNSET:0, NODE:1, CPU:2) = 1
	mmap size 528384B

	NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark

	No input file inputbt.data. Using compiled defaults
	Size:  102x 102x 102
	Iterations:  200       dt:   0.0003000
	Number of available threads:    88

	BT Benchmark Completed.
	Class           =                        B
	Size            =            102x 102x 102
	Iterations      =                      200
	Time in seconds =                    13.02
	Total threads   =                       88
	Avail threads   =                       88
	Mop/s total     =                 53924.69
	Mop/s/thread    =                   612.78
	Operation type  =           floating point
	Verification    =               SUCCESSFUL
	Version         =                    3.3.1
	Compile date    =              20 Sep 2018

	[ perf record: Captured and wrote 1557,152 MB perf.data ]

	1120.42user 29.92system 0:14.03elapsed 8198%CPU (0avgtext+0avgdata 206388maxresident)k
	0inputs+3189128outputs (0major+149207minor)pagefaults 0swaps

  SERIAL-CPU:

	/usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 --affinity=cpu -a -e cycles -- ./bt.B.x 
	Using CPUID GenuineIntel-6-4F-1
	nr_cblocks: 0
	affinity (UNSET:0, NODE:1, CPU:2) = 2
	mmap size 528384B

	NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark

	No input file inputbt.data. Using compiled defaults
	Size:  102x 102x 102
	Iterations:  200       dt:   0.0003000
	Number of available threads:    88

	BT Benchmark Completed.
	Class           =                        B
	Size            =            102x 102x 102
	Iterations      =                      200
	Time in seconds =                    11.21
	Total threads   =                       88
	Avail threads   =                       88
	Mop/s total     =                 62642.24
	Mop/s/thread    =                   711.84
	Operation type  =           floating point
	Verification    =               SUCCESSFUL
	Version         =                    3.3.1
	Compile date    =              20 Sep 2018

	[ perf record: Captured and wrote 1365,043 MB perf.data ]

	976.06user 31.35system 0:12.18elapsed 8264%CPU (0avgtext+0avgdata 208488maxresident)k
	0inputs+2795704outputs (0major+126032minor)pagefaults 0swaps

  AIO1-SYS:

	/usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 --aio=1 -a -e cycles -- ./bt.B.x 
	Using CPUID GenuineIntel-6-4F-1
	nr_cblocks: 1
	affinity (UNSET:0, NODE:1, CPU:2) = 0
	mmap size 528384B

	NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark

	No input file inputbt.data. Using compiled defaults
	Size:  102x 102x 102
	Iterations:  200       dt:   0.0003000
	Number of available threads:    88

	BT Benchmark Completed.
	Class           =                        B
	Size            =            102x 102x 102
	Iterations      =                      200
	Time in seconds =                    14.23
	Total threads   =                       88
	Avail threads   =                       88
	Mop/s total     =                 49338.27
	Mop/s/thread    =                   560.66
	Operation type  =           floating point
	Verification    =               SUCCESSFUL
	Version         =                    3.3.1
	Compile date    =              20 Sep 2018

	[ perf record: Captured and wrote 1720,590 MB perf.data ]

	1229.19user 41.99system 0:15.22elapsed 8350%CPU (0avgtext+0avgdata 208604maxresident)k
	0inputs+3523880outputs (0major+124670minor)pagefaults 0swaps

  AIO1-NODE:

	/usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 --aio=1 --affinity=node -a -e cycles -- ./bt.B.x 
	Using CPUID GenuineIntel-6-4F-1
	nr_cblocks: 1
	affinity (UNSET:0, NODE:1, CPU:2) = 1
	mmap size 528384B

	NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark

	No input file inputbt.data. Using compiled defaults
	Size:  102x 102x 102
	Iterations:  200       dt:   0.0003000
	Number of available threads:    88

	BT Benchmark Completed.
	Class           =                        B
	Size            =            102x 102x 102
	Iterations      =                      200
	Time in seconds =                    12.04
	Total threads   =                       88
	Avail threads   =                       88
	Mop/s total     =                 58313.58
	Mop/s/thread    =                   662.65
	Operation type  =           floating point
	Verification    =               SUCCESSFUL
	Version         =                    3.3.1
	Compile date    =              20 Sep 2018

	[ perf record: Captured and wrote 1471,279 MB perf.data ]

	1055.62user 30.43system 0:13.03elapsed 8333%CPU (0avgtext+0avgdata 208424maxresident)k
	0inputs+3013288outputs (0major+79088minor)pagefaults 0swaps

  AIO1-CPU:

	/usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 --aio=1 --affinity=cpu -a -e cycles -- ./bt.B.x 
	Using CPUID GenuineIntel-6-4F-1
	nr_cblocks: 1
	affinity (UNSET:0, NODE:1, CPU:2) = 2
	mmap size 528384B

	NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark

	No input file inputbt.data. Using compiled defaults
	Size:  102x 102x 102
	Iterations:  200       dt:   0.0003000
	Number of available threads:    88

	BT Benchmark Completed.
	Class           =                        B
	Size            =            102x 102x 102
	Iterations      =                      200
	Time in seconds =                    12.20
	Total threads   =                       88
	Avail threads   =                       88
	Mop/s total     =                 57538.84
	Mop/s/thread    =                   653.85
	Operation type  =           floating point
	Verification    =               SUCCESSFUL
	Version         =                    3.3.1
	Compile date    =              20 Sep 2018

	[ perf record: Captured and wrote 1486,859 MB perf.data ]

	1051.97user 42.07system 0:13.09elapsed 8352%CPU (0avgtext+0avgdata 206388maxresident)k
	0inputs+3045168outputs (0major+174612minor)pagefaults 0swaps

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v5 1/4] perf record: allocate affinity masks
  2019-01-22 17:45 [PATCH v5 0/4] Reduce NUMA related overhead in perf record profiling on large server systems Alexey Budankov
@ 2019-01-22 17:47 ` Alexey Budankov
  2019-02-09 12:45   ` [tip:perf/core] perf record: Allocate " tip-bot for Alexey Budankov
  2019-01-22 17:48 ` [PATCH v5 2/4] perf record: bind the AIO user space buffers to nodes Alexey Budankov
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Alexey Budankov @ 2019-01-22 17:47 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Ingo Molnar, Peter Zijlstra
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Andi Kleen, linux-kernel


Allocate affinity option and masks for mmap data buffers and
record thread as well as initialize allocated objects.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
Changes in v3:
- converted PERF_AFFINITY_EOF to PERF_AFFINITY_MAX

Changes in v2:
- made debug affinity mode message user friendly
- converted affinity mode defines to enum values
---
 tools/perf/builtin-record.c | 13 ++++++++++++-
 tools/perf/perf.h           |  8 ++++++++
 tools/perf/util/evlist.c    |  6 +++---
 tools/perf/util/evlist.h    |  2 +-
 tools/perf/util/mmap.c      |  2 ++
 tools/perf/util/mmap.h      |  3 ++-
 6 files changed, 28 insertions(+), 6 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 88ea11d57c6f..370a68487532 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -82,12 +82,17 @@ struct record {
 	bool			timestamp_boundary;
 	struct switch_output	switch_output;
 	unsigned long long	samples;
+	cpu_set_t		affinity_mask;
 };
 
 static volatile int auxtrace_record__snapshot_started;
 static DEFINE_TRIGGER(auxtrace_snapshot_trigger);
 static DEFINE_TRIGGER(switch_output_trigger);
 
+static const char *affinity_tags[PERF_AFFINITY_MAX] = {
+	"SYS", "NODE", "CPU"
+};
+
 static bool switch_output_signal(struct record *rec)
 {
 	return rec->switch_output.signal &&
@@ -534,7 +539,8 @@ static int record__mmap_evlist(struct record *rec,
 
 	if (perf_evlist__mmap_ex(evlist, opts->mmap_pages,
 				 opts->auxtrace_mmap_pages,
-				 opts->auxtrace_snapshot_mode, opts->nr_cblocks) < 0) {
+				 opts->auxtrace_snapshot_mode,
+				 opts->nr_cblocks, opts->affinity) < 0) {
 		if (errno == EPERM) {
 			pr_err("Permission error mapping pages.\n"
 			       "Consider increasing "
@@ -1987,6 +1993,9 @@ int cmd_record(int argc, const char **argv)
 # undef REASON
 #endif
 
+	CPU_ZERO(&rec->affinity_mask);
+	rec->opts.affinity = PERF_AFFINITY_SYS;
+
 	rec->evlist = perf_evlist__new();
 	if (rec->evlist == NULL)
 		return -ENOMEM;
@@ -2150,6 +2159,8 @@ int cmd_record(int argc, const char **argv)
 	if (verbose > 0)
 		pr_info("nr_cblocks: %d\n", rec->opts.nr_cblocks);
 
+	pr_debug("affinity: %s\n", affinity_tags[rec->opts.affinity]);
+
 	err = __cmd_record(&record, argc, argv);
 out:
 	perf_evlist__delete(rec->evlist);
diff --git a/tools/perf/perf.h b/tools/perf/perf.h
index 5941fb6eccfc..b120e547ddc7 100644
--- a/tools/perf/perf.h
+++ b/tools/perf/perf.h
@@ -84,6 +84,14 @@ struct record_opts {
 	clockid_t    clockid;
 	u64          clockid_res_ns;
 	int	     nr_cblocks;
+	int	     affinity;
+};
+
+enum perf_affinity {
+	PERF_AFFINITY_SYS = 0,
+	PERF_AFFINITY_NODE,
+	PERF_AFFINITY_CPU,
+	PERF_AFFINITY_MAX
 };
 
 struct option;
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 5f01bced3116..6d462ee1b529 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -1035,7 +1035,7 @@ int perf_evlist__parse_mmap_pages(const struct option *opt, const char *str,
  */
 int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
 			 unsigned int auxtrace_pages,
-			 bool auxtrace_overwrite, int nr_cblocks)
+			 bool auxtrace_overwrite, int nr_cblocks, int affinity)
 {
 	struct perf_evsel *evsel;
 	const struct cpu_map *cpus = evlist->cpus;
@@ -1045,7 +1045,7 @@ int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
 	 * Its value is decided by evsel's write_backward.
 	 * So &mp should not be passed through const pointer.
 	 */
-	struct mmap_params mp = { .nr_cblocks = nr_cblocks };
+	struct mmap_params mp = { .nr_cblocks = nr_cblocks, .affinity = affinity };
 
 	if (!evlist->mmap)
 		evlist->mmap = perf_evlist__alloc_mmap(evlist, false);
@@ -1077,7 +1077,7 @@ int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
 
 int perf_evlist__mmap(struct perf_evlist *evlist, unsigned int pages)
 {
-	return perf_evlist__mmap_ex(evlist, pages, 0, false, 0);
+	return perf_evlist__mmap_ex(evlist, pages, 0, false, 0, PERF_AFFINITY_SYS);
 }
 
 int perf_evlist__create_maps(struct perf_evlist *evlist, struct target *target)
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index 18365b1f80b0..bb33872d12fc 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -165,7 +165,7 @@ unsigned long perf_event_mlock_kb_in_pages(void);
 
 int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
 			 unsigned int auxtrace_pages,
-			 bool auxtrace_overwrite, int nr_cblocks);
+			 bool auxtrace_overwrite, int nr_cblocks, int affinity);
 int perf_evlist__mmap(struct perf_evlist *evlist, unsigned int pages);
 void perf_evlist__munmap(struct perf_evlist *evlist);
 
diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
index 8fc39311a30d..e68ba754a8e2 100644
--- a/tools/perf/util/mmap.c
+++ b/tools/perf/util/mmap.c
@@ -343,6 +343,8 @@ int perf_mmap__mmap(struct perf_mmap *map, struct mmap_params *mp, int fd, int c
 	map->fd = fd;
 	map->cpu = cpu;
 
+	CPU_ZERO(&map->affinity_mask);
+
 	if (auxtrace_mmap__mmap(&map->auxtrace_mmap,
 				&mp->auxtrace_mp, map->base, fd))
 		return -1;
diff --git a/tools/perf/util/mmap.h b/tools/perf/util/mmap.h
index aeb6942fdb00..e566c19b242b 100644
--- a/tools/perf/util/mmap.h
+++ b/tools/perf/util/mmap.h
@@ -38,6 +38,7 @@ struct perf_mmap {
 		int		 nr_cblocks;
 	} aio;
 #endif
+	cpu_set_t	affinity_mask;
 };
 
 /*
@@ -69,7 +70,7 @@ enum bkw_mmap_state {
 };
 
 struct mmap_params {
-	int			    prot, mask, nr_cblocks;
+	int			    prot, mask, nr_cblocks, affinity;
 	struct auxtrace_mmap_params auxtrace_mp;
 };

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v5 2/4] perf record: bind the AIO user space buffers to nodes
  2019-01-22 17:45 [PATCH v5 0/4] Reduce NUMA related overhead in perf record profiling on large server systems Alexey Budankov
  2019-01-22 17:47 ` [PATCH v5 1/4] perf record: allocate affinity masks Alexey Budankov
@ 2019-01-22 17:48 ` Alexey Budankov
  2019-02-04 19:29   ` Arnaldo Carvalho de Melo
  2019-01-22 17:50 ` [PATCH v5 3/4] perf record: apply affinity masks when reading mmap buffers Alexey Budankov
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Alexey Budankov @ 2019-01-22 17:48 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Ingo Molnar, Peter Zijlstra
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Andi Kleen, linux-kernel


Allocate and bind AIO user space buffers to the memory nodes
that mmap kernel buffers are bound to.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
Changes in v4:
- fixed compilation issue converting pr_warn() to pr_warning()
- implemented stop if mbind() fails

Changes in v3:
- corrected code style issues
- adjusted __aio_alloc,__aio_bind,__aio_free() implementation

Changes in v2:
- implemented perf_mmap__aio_alloc, perf_mmap__aio_free, perf_mmap__aio_bind 
  and put HAVE_LIBNUMA_SUPPORT #ifdefs in there
---
 tools/perf/util/mmap.c | 77 +++++++++++++++++++++++++++++++++++++++---
 1 file changed, 73 insertions(+), 4 deletions(-)

diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
index e68ba754a8e2..34be9f900575 100644
--- a/tools/perf/util/mmap.c
+++ b/tools/perf/util/mmap.c
@@ -10,6 +10,9 @@
 #include <sys/mman.h>
 #include <inttypes.h>
 #include <asm/bug.h>
+#ifdef HAVE_LIBNUMA_SUPPORT
+#include <numaif.h>
+#endif
 #include "debug.h"
 #include "event.h"
 #include "mmap.h"
@@ -154,9 +157,72 @@ void __weak auxtrace_mmap_params__set_idx(struct auxtrace_mmap_params *mp __mayb
 }
 
 #ifdef HAVE_AIO_SUPPORT
+
+#ifdef HAVE_LIBNUMA_SUPPORT
+static int perf_mmap__aio_alloc(struct perf_mmap *map, int index)
+{
+	map->aio.data[index] = mmap(NULL, perf_mmap__mmap_len(map), PROT_READ|PROT_WRITE,
+				    MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
+	if (map->aio.data[index] == MAP_FAILED) {
+		map->aio.data[index] = NULL;
+		return -1;
+	}
+
+	return 0;
+}
+
+static void perf_mmap__aio_free(struct perf_mmap *map, int index)
+{
+	if (map->aio.data[index]) {
+		munmap(map->aio.data[index], perf_mmap__mmap_len(map));
+		map->aio.data[index] = NULL;
+	}
+}
+
+static int perf_mmap__aio_bind(struct perf_mmap *map, int index, int cpu, int affinity)
+{
+	void *data;
+	size_t mmap_len;
+	unsigned long node_mask;
+
+	if (affinity != PERF_AFFINITY_SYS && cpu__max_node() > 1) {
+		data = map->aio.data[index];
+		mmap_len = perf_mmap__mmap_len(map);
+		node_mask = 1UL << cpu__get_node(cpu);
+		if (mbind(data, mmap_len, MPOL_BIND, &node_mask, 1, 0)) {
+			pr_err("Failed to bind [%p-%p] AIO buffer to node %d: error %m\n",
+				data, data + mmap_len, cpu__get_node(cpu));
+			return -1;
+		}
+	}
+
+	return 0;
+}
+#else
+static int perf_mmap__aio_alloc(struct perf_mmap *map, int index)
+{
+	map->aio.data[index] = malloc(perf_mmap__mmap_len(map));
+	if (map->aio.data[index] == NULL)
+		return -1;
+
+	return 0;
+}
+
+static void perf_mmap__aio_free(struct perf_mmap *map, int index)
+{
+	zfree(&(map->aio.data[index]));
+}
+
+static int perf_mmap__aio_bind(struct perf_mmap *map __maybe_unused, int index __maybe_unused,
+		int cpu __maybe_unused, int affinity __maybe_unused)
+{
+	return 0;
+}
+#endif
+
 static int perf_mmap__aio_mmap(struct perf_mmap *map, struct mmap_params *mp)
 {
-	int delta_max, i, prio;
+	int delta_max, i, prio, ret;
 
 	map->aio.nr_cblocks = mp->nr_cblocks;
 	if (map->aio.nr_cblocks) {
@@ -177,11 +243,14 @@ static int perf_mmap__aio_mmap(struct perf_mmap *map, struct mmap_params *mp)
 		}
 		delta_max = sysconf(_SC_AIO_PRIO_DELTA_MAX);
 		for (i = 0; i < map->aio.nr_cblocks; ++i) {
-			map->aio.data[i] = malloc(perf_mmap__mmap_len(map));
-			if (!map->aio.data[i]) {
+			ret = perf_mmap__aio_alloc(map, i);
+			if (ret == -1) {
 				pr_debug2("failed to allocate data buffer area, error %m");
 				return -1;
 			}
+			ret = perf_mmap__aio_bind(map, i, map->cpu, mp->affinity);
+			if (ret == -1)
+				return -1;
 			/*
 			 * Use cblock.aio_fildes value different from -1
 			 * to denote started aio write operation on the
@@ -210,7 +279,7 @@ static void perf_mmap__aio_munmap(struct perf_mmap *map)
 	int i;
 
 	for (i = 0; i < map->aio.nr_cblocks; ++i)
-		zfree(&map->aio.data[i]);
+		perf_mmap__aio_free(map, i);
 	if (map->aio.data)
 		zfree(&map->aio.data);
 	zfree(&map->aio.cblocks);

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v5 3/4] perf record: apply affinity masks when reading mmap buffers
  2019-01-22 17:45 [PATCH v5 0/4] Reduce NUMA related overhead in perf record profiling on large server systems Alexey Budankov
  2019-01-22 17:47 ` [PATCH v5 1/4] perf record: allocate affinity masks Alexey Budankov
  2019-01-22 17:48 ` [PATCH v5 2/4] perf record: bind the AIO user space buffers to nodes Alexey Budankov
@ 2019-01-22 17:50 ` Alexey Budankov
  2019-02-09 12:47   ` [tip:perf/core] perf record: Apply " tip-bot for Alexey Budankov
  2019-01-22 17:52 ` [PATCH v5 4/4] perf record: implement --affinity=node|cpu option Alexey Budankov
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Alexey Budankov @ 2019-01-22 17:50 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Ingo Molnar, Peter Zijlstra
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Andi Kleen, linux-kernel


Build node cpu masks for mmap data buffers. Apply node cpu
masks to tool thread every time it references data buffers
cross node or cross cpu.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
Changes in v5:
- avoided multiple allocations of online cpu maps by 
  implementing it once in cpu_map__online()

Changes in v4:
- corrected mmap_params->cpu_map initialization to be based on /sys/devices/system/cpu/online
- separated node cpu map generation into build_node_mask()

Changes in v3:
- separated mask manipulations into __adjust_affinity() and __setup_affinity_mask()
- implemented mapping of c index into online cpu index

Changes in v2:
- separated AIO buffers binding to patch 2/4
---
 tools/perf/builtin-record.c | 14 ++++++++++++++
 tools/perf/util/cpumap.c    | 10 ++++++++++
 tools/perf/util/cpumap.h    |  1 +
 tools/perf/util/mmap.c      | 28 +++++++++++++++++++++++++++-
 4 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 370a68487532..142d109bc53d 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -537,6 +537,9 @@ static int record__mmap_evlist(struct record *rec,
 	struct record_opts *opts = &rec->opts;
 	char msg[512];
 
+	if (opts->affinity != PERF_AFFINITY_SYS)
+		cpu__setup_cpunode_map();
+
 	if (perf_evlist__mmap_ex(evlist, opts->mmap_pages,
 				 opts->auxtrace_mmap_pages,
 				 opts->auxtrace_snapshot_mode,
@@ -729,6 +732,16 @@ static struct perf_event_header finished_round_event = {
 	.type = PERF_RECORD_FINISHED_ROUND,
 };
 
+static void record__adjust_affinity(struct record *rec, struct perf_mmap *map)
+{
+	if (rec->opts.affinity != PERF_AFFINITY_SYS &&
+	    !CPU_EQUAL(&rec->affinity_mask, &map->affinity_mask)) {
+		CPU_ZERO(&rec->affinity_mask);
+		CPU_OR(&rec->affinity_mask, &rec->affinity_mask, &map->affinity_mask);
+		sched_setaffinity(0, sizeof(rec->affinity_mask), &rec->affinity_mask);
+	}
+}
+
 static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evlist,
 				    bool overwrite)
 {
@@ -756,6 +769,7 @@ static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evli
 		struct perf_mmap *map = &maps[i];
 
 		if (map->base) {
+			record__adjust_affinity(rec, map);
 			if (!record__aio_enabled(rec)) {
 				if (perf_mmap__push(map, rec, record__pushfn) != 0) {
 					rc = -1;
diff --git a/tools/perf/util/cpumap.c b/tools/perf/util/cpumap.c
index 1ccbd3342069..a5523ba05cf1 100644
--- a/tools/perf/util/cpumap.c
+++ b/tools/perf/util/cpumap.c
@@ -723,3 +723,13 @@ size_t cpu_map__snprint_mask(struct cpu_map *map, char *buf, size_t size)
 	buf[size - 1] = '\0';
 	return ptr - buf;
 }
+
+const struct cpu_map *cpu_map__online(void) /* thread unsafe */
+{
+	static const struct cpu_map *online = NULL;
+
+	if (!online)
+		online = cpu_map__new(NULL); /* from /sys/devices/system/cpu/online */
+
+	return online;
+}
diff --git a/tools/perf/util/cpumap.h b/tools/perf/util/cpumap.h
index ed8999d1a640..f00ce624b9f7 100644
--- a/tools/perf/util/cpumap.h
+++ b/tools/perf/util/cpumap.h
@@ -29,6 +29,7 @@ int cpu_map__get_core_id(int cpu);
 int cpu_map__get_core(struct cpu_map *map, int idx, void *data);
 int cpu_map__build_socket_map(struct cpu_map *cpus, struct cpu_map **sockp);
 int cpu_map__build_core_map(struct cpu_map *cpus, struct cpu_map **corep);
+const struct cpu_map *cpu_map__online(void); /* thread unsafe */
 
 struct cpu_map *cpu_map__get(struct cpu_map *map);
 void cpu_map__put(struct cpu_map *map);
diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
index 34be9f900575..995c282fcd21 100644
--- a/tools/perf/util/mmap.c
+++ b/tools/perf/util/mmap.c
@@ -383,6 +383,32 @@ void perf_mmap__munmap(struct perf_mmap *map)
 	auxtrace_mmap__munmap(&map->auxtrace_mmap);
 }
 
+static void build_node_mask(int node, cpu_set_t *mask)
+{
+	int c, cpu, nr_cpus;
+	const struct cpu_map *cpu_map = NULL;
+
+	cpu_map = cpu_map__online();
+	if (!cpu_map)
+		return;
+
+	nr_cpus = cpu_map__nr(cpu_map);
+	for (c = 0; c < nr_cpus; c++) {
+		cpu = cpu_map->map[c]; /* map c index to online cpu index */
+		if (cpu__get_node(cpu) == node)
+			CPU_SET(cpu, mask);
+	}
+}
+
+static void perf_mmap__setup_affinity_mask(struct perf_mmap *map, struct mmap_params *mp)
+{
+	CPU_ZERO(&map->affinity_mask);
+	if (mp->affinity == PERF_AFFINITY_NODE && cpu__max_node() > 1)
+		build_node_mask(cpu__get_node(map->cpu), &map->affinity_mask);
+	else if (mp->affinity == PERF_AFFINITY_CPU)
+		CPU_SET(map->cpu, &map->affinity_mask);
+}
+
 int perf_mmap__mmap(struct perf_mmap *map, struct mmap_params *mp, int fd, int cpu)
 {
 	/*
@@ -412,7 +438,7 @@ int perf_mmap__mmap(struct perf_mmap *map, struct mmap_params *mp, int fd, int c
 	map->fd = fd;
 	map->cpu = cpu;
 
-	CPU_ZERO(&map->affinity_mask);
+	perf_mmap__setup_affinity_mask(map, mp);
 
 	if (auxtrace_mmap__mmap(&map->auxtrace_mmap,
 				&mp->auxtrace_mp, map->base, fd))

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v5 4/4] perf record: implement --affinity=node|cpu option
  2019-01-22 17:45 [PATCH v5 0/4] Reduce NUMA related overhead in perf record profiling on large server systems Alexey Budankov
                   ` (2 preceding siblings ...)
  2019-01-22 17:50 ` [PATCH v5 3/4] perf record: apply affinity masks when reading mmap buffers Alexey Budankov
@ 2019-01-22 17:52 ` Alexey Budankov
  2019-02-15  9:25   ` [tip:perf/core] perf record: Implement " tip-bot for Alexey Budankov
  2019-01-28  7:13 ` [PATCH v5 0/4] Reduce NUMA related overhead in perf record profiling on large server systems Alexey Budankov
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Alexey Budankov @ 2019-01-22 17:52 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Ingo Molnar, Peter Zijlstra
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Andi Kleen, linux-kernel


Implement --affinity=node|cpu option for the record mode defaulting
to system affinity mask bouncing.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
Changes in v5:
- reduced indentation at record__parse_affinity()

Changes in v3:
- adjusted indentation at record__parse_affinity()
---
 tools/perf/Documentation/perf-record.txt |  5 +++++
 tools/perf/builtin-record.c              | 18 ++++++++++++++++++
 2 files changed, 23 insertions(+)

diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index d232b13ea713..efb839784f32 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -440,6 +440,11 @@ Use <n> control blocks in asynchronous (Posix AIO) trace writing mode (default:
 Asynchronous mode is supported only when linking Perf tool with libc library
 providing implementation for Posix AIO API.
 
+--affinity=mode::
+Set affinity mask of trace reading thread according to the policy defined by 'mode' value:
+  node - thread affinity mask is set to NUMA node cpu mask of the processed mmap buffer
+  cpu  - thread affinity mask is set to cpu of the processed mmap buffer
+
 --all-kernel::
 Configure all used events to run in kernel space.
 
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 142d109bc53d..c2a18261e595 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -1665,6 +1665,21 @@ static int parse_clockid(const struct option *opt, const char *str, int unset)
 	return -1;
 }
 
+static int record__parse_affinity(const struct option *opt, const char *str, int unset)
+{
+	struct record_opts *opts = (struct record_opts *)opt->value;
+
+	if (unset || !str)
+		return 0;
+
+	if (!strcasecmp(str, "node"))
+		opts->affinity = PERF_AFFINITY_NODE;
+	else if (!strcasecmp(str, "cpu"))
+		opts->affinity = PERF_AFFINITY_CPU;
+
+	return 0;
+}
+
 static int record__parse_mmap_pages(const struct option *opt,
 				    const char *str,
 				    int unset __maybe_unused)
@@ -1973,6 +1988,9 @@ static struct option __record_options[] = {
 		     &nr_cblocks_default, "n", "Use <n> control blocks in asynchronous trace writing mode (default: 1, max: 4)",
 		     record__aio_parse),
 #endif
+	OPT_CALLBACK(0, "affinity", &record.opts, "node|cpu",
+		     "Set affinity mask of trace reading thread to NUMA node cpu mask or cpu of processed mmap buffer",
+		     record__parse_affinity),
 	OPT_END()
 };

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 0/4] Reduce NUMA related overhead in perf record profiling on large server systems
  2019-01-22 17:45 [PATCH v5 0/4] Reduce NUMA related overhead in perf record profiling on large server systems Alexey Budankov
                   ` (3 preceding siblings ...)
  2019-01-22 17:52 ` [PATCH v5 4/4] perf record: implement --affinity=node|cpu option Alexey Budankov
@ 2019-01-28  7:13 ` Alexey Budankov
  2019-01-28  8:20   ` Jiri Olsa
  2019-01-28 11:27 ` Jiri Olsa
  2019-01-29  9:14 ` Arnaldo Carvalho de Melo
  6 siblings, 1 reply; 21+ messages in thread
From: Alexey Budankov @ 2019-01-28  7:13 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ingo Molnar, Peter Zijlstra, Namhyung Kim, Alexander Shishkin,
	Andi Kleen, linux-kernel

Hi Jiri, Arnaldo,

On 22.01.2019 20:45, Alexey Budankov wrote:
> 
> It has been observed that trace reading thread runs on the same hw thread
> most of the time during perf record sampling collection. This scheduling
> layout leads up to 30% profiling overhead in case when some cpu intensive
> workload fully utilizes a large server system with NUMA. Overhead usually
> arises from remote (cross node) HW and memory references that have much
> longer latencies than local ones [1].
> 
> This patch set implements --affinity option that lowers 30% overhead
> completely for serial trace streaming (--affinity=cpu) and from 30% to
> 10% for AIO1 (--aio=1) trace streaming (--affinity=node|cpu).
> See OVERHEAD section below for more details.
> 
> Implemented extension provides users with capability to instruct Perf 
> tool to bounce trace reading thread's affinity mask between NUMA nodes 
> (--affinity=node) or assign the thread to the exact cpu (--affinity=cpu) 
> that trace buffer being processed belongs to.
> 
> The extension brings improvement in case of full system utilization when 
> Perf tool process contends with workload process on cpu cores. In case a 
> system has free cores to execute Perf tool process during profiling the 
> default system scheduling layout induces the lowest overhead.
> 
> The patch set has been validated on BT benchmark from NAS Parallel 
> Benchmarks [2] running on dual socket, 44 cores, 88 hw threads Broadwell 
> system with kernels v4.4-21-generic (Ubuntu 16.04) and v4.20.0-rc5 
> (tip perf/core). 
> 
> The patch set is for Arnaldo's perf/core repository.
> 
> OVERHEAD:
> 			       BENCH REPORT BASED   ELAPSED TIME BASED
> 	  v4.20.0-rc5 
>           (tip perf/core):
> 				
> (current) SERIAL-SYS  / BASE : 1.27x (14.37/11.31), 1.29x (15.19/11.69)
> 	  SERIAL-NODE / BASE : 1.15x (13.04/11.31), 1.17x (13.79/11.69)
> 	  SERIAL-CPU  / BASE : 1.00x (11.32/11.31), 1.01x (11.89/11.69)
> 	
> 	  AIO1-SYS    / BASE : 1.29x (14.58/11.31), 1.29x (15.26/11.69)
> 	  AIO1-NODE   / BASE : 1.08x (12.23/11.31), 1,11x (13.01/11.69)
> 	  AIO1-CPU    / BASE : 1.07x (12.14/11.31), 1.08x (12.83/11.69)
> 
> 	  v4.4.0-21-generic
>           (Ubuntu 16.04 LTS):
> 
> (current) SERIAL-SYS  / BASE : 1.26x (13.73/10.87), 1.29x (14.69/11.32)
> 	  SERIAL-NODE / BASE : 1.19x (13.02/10.87), 1.23x (14.03/11.32)
> 	  SERIAL-CPU  / BASE : 1.03x (11.21/10.87), 1.07x (12.18/11.32)
> 	
> 	  AIO1-SYS    / BASE : 1.26x (13.73/10.87), 1.29x (14.69/11.32)
> 	  AIO1-NODE   / BASE : 1.10x (12.04/10.87), 1.15x (13.03/11.32)
> 	  AIO1-CPU    / BASE : 1.12x (12.20/10.87), 1.15x (13.09/11.32)
> 
> ---
> Alexey Budankov (4):
>   perf record: allocate affinity masks
>   perf record: bind the AIO user space buffers to nodes
>   perf record: apply affinity masks when reading mmap buffers
>   perf record: implement --affinity=node|cpu option
> 
>  tools/perf/Documentation/perf-record.txt |   5 ++
>  tools/perf/builtin-record.c              |  45 +++++++++-
>  tools/perf/perf.h                        |   8 ++
>  tools/perf/util/cpumap.c                 |  10 +++
>  tools/perf/util/cpumap.h                 |   1 +
>  tools/perf/util/evlist.c                 |   6 +-
>  tools/perf/util/evlist.h                 |   2 +-
>  tools/perf/util/mmap.c                   | 105 ++++++++++++++++++++++-
>  tools/perf/util/mmap.h                   |   3 +-
>  9 files changed, 175 insertions(+), 10 deletions(-)
> 
> ---
> Changes in v5:
> - avoided multiple allocations of online cpu maps by 
>   implementing it once in cpu_map__online()
> - reduced indentation at record__parse_affinity()

Are there any more comments on this patch set?

Thanks,
Alexey

> 
> Changes in v4:
> - fixed compilation issue converting pr_warn() to pr_warning()
> - implemented stop if mbind() fails
> - corrected mmap_params->cpu_map initialization to be based on /sys/devices/system/cpu/online
> - separated node cpu map generation into build_node_mask()
> 
> Changes in v3:
> - converted PERF_AFFINITY_EOF to PERF_AFFINITY_MAX
> - corrected code style issues
> - adjusted __aio_alloc,__aio_bind,__aio_free() implementation
> - separated mask manipulations into __adjust_affinity() and __setup_affinity_mask()
> - implemented mapping of c index into online cpu index
> - adjusted indentation at record__parse_affinity()
> 
> Changes in v2:
> - made debug affinity mode message user friendly
> - converted affinity mode defines to enum values
> - implemented perf_mmap__aio_alloc, perf_mmap__aio_free, perf_mmap__aio_bind 
>   and put HAVE_LIBNUMA_SUPPORT #ifdefs in there
> - separated AIO buffers binding to patch 2/4
> 
> ---
> [1] https://en.wikipedia.org/wiki/Non-uniform_memory_access
> [2] https://www.nas.nasa.gov/publications/npb.html
> [3] http://man7.org/linux/man-pages/man2/sched_setaffinity.2.html
> [4] http://man7.org/linux/man-pages/man2/mbind.2.html
> 
> ---
> ENVIRONMENT AND MEASUREMENTS:
> 
>   MACHINE:
> 
> 	broadwell, dual socket, 44 core, 88 threads
> 
> 	/proc/cpuinfo
> 
> 	processor	: 87
> 	vendor_id	: GenuineIntel
> 	cpu family	: 6
> 	model		: 79
> 	model name	: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
> 	stepping	: 1
> 	microcode	: 0xb000019
> 	cpu MHz		: 1200.117
> 	cache size	: 56320 KB
> 	physical id	: 1
> 	siblings	: 44
> 	core id		: 28
> 	cpu cores	: 22
> 	apicid		: 121
> 	initial apicid	: 121
> 	fpu		: yes
> 	fpu_exception	: yes
> 	cpuid level	: 20
> 	wp		: yes
> 	flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts
> 	bugs		:
> 	bogomips	: 4391.42
> 	clflush size	: 64
> 	cache_alignment	: 64
> 	address sizes	: 46 bits physical, 48 bits virtual
> 	power management:
>   		
>   BASE:
> 
> 	/usr/bin/time ./bt.B.x 
> 
> 	NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark
> 	
> 	No input file inputbt.data. Using compiled defaults
> 	Size:  102x 102x 102
> 	Iterations:  200       dt:   0.0003000
> 	Number of available threads:    88
> 	
> 	BT Benchmark Completed.
> 	Class           =                        B
> 	Size            =            102x 102x 102
> 	Iterations      =                      200
> 	Time in seconds =                    10.87
> 	Total threads   =                       88
> 	Avail threads   =                       88
> 	Mop/s total     =                 64608.74
> 	Mop/s/thread    =                   734.19
> 	Operation type  =           floating point
> 	Verification    =               SUCCESSFUL
> 	Version         =                    3.3.1
> 	Compile date    =              20 Sep 2018
> 	
> 	956.25user 19.14system 0:11.32elapsed 8616%CPU (0avgtext+0avgdata 210496maxresident)k
> 	0inputs+0outputs (0major+57939minor)pagefaults 0swaps
> 
>   SERIAL-SYS:
> 
> 	/usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 -a -e cycles -- ./bt.B.x 
> 	Using CPUID GenuineIntel-6-4F-1
> 	nr_cblocks: 0
> 	affinity (UNSET:0, NODE:1, CPU:2) = 0
> 	mmap size 528384B
> 
> 	NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark
> 
> 	No input file inputbt.data. Using compiled defaults
> 	Size:  102x 102x 102
> 	Iterations:  200       dt:   0.0003000
> 	Number of available threads:    88
> 
> 	BT Benchmark Completed.
> 	Class           =                        B
> 	Size            =            102x 102x 102
> 	Iterations      =                      200
> 	Time in seconds =                    13.73
> 	Total threads   =                       88
> 	Avail threads   =                       88
> 	Mop/s total     =                 51136.52
> 	Mop/s/thread    =                   581.10
> 	Operation type  =           floating point
> 	Verification    =               SUCCESSFUL
> 	Version         =                    3.3.1
> 	Compile date    =              20 Sep 2018
> 
> 	[ perf record: Captured and wrote 1661,120 MB perf.data ]
> 
> 	1184.84user 40.70system 0:14.69elapsed 8341%CPU (0avgtext+0avgdata 208612maxresident)k
> 	0inputs+3402072outputs (0major+137077minor)pagefaults 0swaps
> 
>   SERIAL-NODE:
> 
> 	/usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 --affinity=node -a -e cycles -- ./bt.B.x 
> 	Using CPUID GenuineIntel-6-4F-1
> 	nr_cblocks: 0
> 	affinity (UNSET:0, NODE:1, CPU:2) = 1
> 	mmap size 528384B
> 
> 	NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark
> 
> 	No input file inputbt.data. Using compiled defaults
> 	Size:  102x 102x 102
> 	Iterations:  200       dt:   0.0003000
> 	Number of available threads:    88
> 
> 	BT Benchmark Completed.
> 	Class           =                        B
> 	Size            =            102x 102x 102
> 	Iterations      =                      200
> 	Time in seconds =                    13.02
> 	Total threads   =                       88
> 	Avail threads   =                       88
> 	Mop/s total     =                 53924.69
> 	Mop/s/thread    =                   612.78
> 	Operation type  =           floating point
> 	Verification    =               SUCCESSFUL
> 	Version         =                    3.3.1
> 	Compile date    =              20 Sep 2018
> 
> 	[ perf record: Captured and wrote 1557,152 MB perf.data ]
> 
> 	1120.42user 29.92system 0:14.03elapsed 8198%CPU (0avgtext+0avgdata 206388maxresident)k
> 	0inputs+3189128outputs (0major+149207minor)pagefaults 0swaps
> 
>   SERIAL-CPU:
> 
> 	/usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 --affinity=cpu -a -e cycles -- ./bt.B.x 
> 	Using CPUID GenuineIntel-6-4F-1
> 	nr_cblocks: 0
> 	affinity (UNSET:0, NODE:1, CPU:2) = 2
> 	mmap size 528384B
> 
> 	NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark
> 
> 	No input file inputbt.data. Using compiled defaults
> 	Size:  102x 102x 102
> 	Iterations:  200       dt:   0.0003000
> 	Number of available threads:    88
> 
> 	BT Benchmark Completed.
> 	Class           =                        B
> 	Size            =            102x 102x 102
> 	Iterations      =                      200
> 	Time in seconds =                    11.21
> 	Total threads   =                       88
> 	Avail threads   =                       88
> 	Mop/s total     =                 62642.24
> 	Mop/s/thread    =                   711.84
> 	Operation type  =           floating point
> 	Verification    =               SUCCESSFUL
> 	Version         =                    3.3.1
> 	Compile date    =              20 Sep 2018
> 
> 	[ perf record: Captured and wrote 1365,043 MB perf.data ]
> 
> 	976.06user 31.35system 0:12.18elapsed 8264%CPU (0avgtext+0avgdata 208488maxresident)k
> 	0inputs+2795704outputs (0major+126032minor)pagefaults 0swaps
> 
>   AIO1-SYS:
> 
> 	/usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 --aio=1 -a -e cycles -- ./bt.B.x 
> 	Using CPUID GenuineIntel-6-4F-1
> 	nr_cblocks: 1
> 	affinity (UNSET:0, NODE:1, CPU:2) = 0
> 	mmap size 528384B
> 
> 	NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark
> 
> 	No input file inputbt.data. Using compiled defaults
> 	Size:  102x 102x 102
> 	Iterations:  200       dt:   0.0003000
> 	Number of available threads:    88
> 
> 	BT Benchmark Completed.
> 	Class           =                        B
> 	Size            =            102x 102x 102
> 	Iterations      =                      200
> 	Time in seconds =                    14.23
> 	Total threads   =                       88
> 	Avail threads   =                       88
> 	Mop/s total     =                 49338.27
> 	Mop/s/thread    =                   560.66
> 	Operation type  =           floating point
> 	Verification    =               SUCCESSFUL
> 	Version         =                    3.3.1
> 	Compile date    =              20 Sep 2018
> 
> 	[ perf record: Captured and wrote 1720,590 MB perf.data ]
> 
> 	1229.19user 41.99system 0:15.22elapsed 8350%CPU (0avgtext+0avgdata 208604maxresident)k
> 	0inputs+3523880outputs (0major+124670minor)pagefaults 0swaps
> 
>   AIO1-NODE:
> 
> 	/usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 --aio=1 --affinity=node -a -e cycles -- ./bt.B.x 
> 	Using CPUID GenuineIntel-6-4F-1
> 	nr_cblocks: 1
> 	affinity (UNSET:0, NODE:1, CPU:2) = 1
> 	mmap size 528384B
> 
> 	NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark
> 
> 	No input file inputbt.data. Using compiled defaults
> 	Size:  102x 102x 102
> 	Iterations:  200       dt:   0.0003000
> 	Number of available threads:    88
> 
> 	BT Benchmark Completed.
> 	Class           =                        B
> 	Size            =            102x 102x 102
> 	Iterations      =                      200
> 	Time in seconds =                    12.04
> 	Total threads   =                       88
> 	Avail threads   =                       88
> 	Mop/s total     =                 58313.58
> 	Mop/s/thread    =                   662.65
> 	Operation type  =           floating point
> 	Verification    =               SUCCESSFUL
> 	Version         =                    3.3.1
> 	Compile date    =              20 Sep 2018
> 
> 	[ perf record: Captured and wrote 1471,279 MB perf.data ]
> 
> 	1055.62user 30.43system 0:13.03elapsed 8333%CPU (0avgtext+0avgdata 208424maxresident)k
> 	0inputs+3013288outputs (0major+79088minor)pagefaults 0swaps
> 
>   AIO1-CPU:
> 
> 	/usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 --aio=1 --affinity=cpu -a -e cycles -- ./bt.B.x 
> 	Using CPUID GenuineIntel-6-4F-1
> 	nr_cblocks: 1
> 	affinity (UNSET:0, NODE:1, CPU:2) = 2
> 	mmap size 528384B
> 
> 	NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark
> 
> 	No input file inputbt.data. Using compiled defaults
> 	Size:  102x 102x 102
> 	Iterations:  200       dt:   0.0003000
> 	Number of available threads:    88
> 
> 	BT Benchmark Completed.
> 	Class           =                        B
> 	Size            =            102x 102x 102
> 	Iterations      =                      200
> 	Time in seconds =                    12.20
> 	Total threads   =                       88
> 	Avail threads   =                       88
> 	Mop/s total     =                 57538.84
> 	Mop/s/thread    =                   653.85
> 	Operation type  =           floating point
> 	Verification    =               SUCCESSFUL
> 	Version         =                    3.3.1
> 	Compile date    =              20 Sep 2018
> 
> 	[ perf record: Captured and wrote 1486,859 MB perf.data ]
> 
> 	1051.97user 42.07system 0:13.09elapsed 8352%CPU (0avgtext+0avgdata 206388maxresident)k
> 	0inputs+3045168outputs (0major+174612minor)pagefaults 0swaps
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 0/4] Reduce NUMA related overhead in perf record profiling on large server systems
  2019-01-28  7:13 ` [PATCH v5 0/4] Reduce NUMA related overhead in perf record profiling on large server systems Alexey Budankov
@ 2019-01-28  8:20   ` Jiri Olsa
  2019-01-28  8:39     ` Alexey Budankov
  0 siblings, 1 reply; 21+ messages in thread
From: Jiri Olsa @ 2019-01-28  8:20 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Arnaldo Carvalho de Melo, Ingo Molnar, Peter Zijlstra,
	Namhyung Kim, Alexander Shishkin, Andi Kleen, linux-kernel

On Mon, Jan 28, 2019 at 10:13:20AM +0300, Alexey Budankov wrote:

SNIP

> > ---
> > Alexey Budankov (4):
> >   perf record: allocate affinity masks
> >   perf record: bind the AIO user space buffers to nodes
> >   perf record: apply affinity masks when reading mmap buffers
> >   perf record: implement --affinity=node|cpu option
> > 
> >  tools/perf/Documentation/perf-record.txt |   5 ++
> >  tools/perf/builtin-record.c              |  45 +++++++++-
> >  tools/perf/perf.h                        |   8 ++
> >  tools/perf/util/cpumap.c                 |  10 +++
> >  tools/perf/util/cpumap.h                 |   1 +
> >  tools/perf/util/evlist.c                 |   6 +-
> >  tools/perf/util/evlist.h                 |   2 +-
> >  tools/perf/util/mmap.c                   | 105 ++++++++++++++++++++++-
> >  tools/perf/util/mmap.h                   |   3 +-
> >  9 files changed, 175 insertions(+), 10 deletions(-)
> > 
> > ---
> > Changes in v5:
> > - avoided multiple allocations of online cpu maps by 
> >   implementing it once in cpu_map__online()
> > - reduced indentation at record__parse_affinity()
> 
> Are there any more comments on this patch set?

sry for late reply.. there was a mayhem in here
last week because of the devconf ;-)

I'll review it today, but I think there were no more issues

jirka

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 0/4] Reduce NUMA related overhead in perf record profiling on large server systems
  2019-01-28  8:20   ` Jiri Olsa
@ 2019-01-28  8:39     ` Alexey Budankov
  0 siblings, 0 replies; 21+ messages in thread
From: Alexey Budankov @ 2019-01-28  8:39 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Arnaldo Carvalho de Melo, Ingo Molnar, Peter Zijlstra,
	Namhyung Kim, Alexander Shishkin, Andi Kleen, linux-kernel

On 28.01.2019 11:20, Jiri Olsa wrote:
> On Mon, Jan 28, 2019 at 10:13:20AM +0300, Alexey Budankov wrote:
> 
> SNIP
> 
>>> ---
>>> Alexey Budankov (4):
>>>   perf record: allocate affinity masks
>>>   perf record: bind the AIO user space buffers to nodes
>>>   perf record: apply affinity masks when reading mmap buffers
>>>   perf record: implement --affinity=node|cpu option
>>>
>>>  tools/perf/Documentation/perf-record.txt |   5 ++
>>>  tools/perf/builtin-record.c              |  45 +++++++++-
>>>  tools/perf/perf.h                        |   8 ++
>>>  tools/perf/util/cpumap.c                 |  10 +++
>>>  tools/perf/util/cpumap.h                 |   1 +
>>>  tools/perf/util/evlist.c                 |   6 +-
>>>  tools/perf/util/evlist.h                 |   2 +-
>>>  tools/perf/util/mmap.c                   | 105 ++++++++++++++++++++++-
>>>  tools/perf/util/mmap.h                   |   3 +-
>>>  9 files changed, 175 insertions(+), 10 deletions(-)
>>>
>>> ---
>>> Changes in v5:
>>> - avoided multiple allocations of online cpu maps by 
>>>   implementing it once in cpu_map__online()
>>> - reduced indentation at record__parse_affinity()
>>
>> Are there any more comments on this patch set?
> 
> sry for late reply.. there was a mayhem in here
> last week because of the devconf ;-)

Aww, I see. Thanks for update.

-Alexey

> 
> I'll review it today, but I think there were no more issues
> 
> jirka
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 0/4] Reduce NUMA related overhead in perf record profiling on large server systems
  2019-01-22 17:45 [PATCH v5 0/4] Reduce NUMA related overhead in perf record profiling on large server systems Alexey Budankov
                   ` (4 preceding siblings ...)
  2019-01-28  7:13 ` [PATCH v5 0/4] Reduce NUMA related overhead in perf record profiling on large server systems Alexey Budankov
@ 2019-01-28 11:27 ` Jiri Olsa
  2019-01-31  9:52   ` Alexey Budankov
  2019-02-01 16:31   ` Arnaldo Carvalho de Melo
  2019-01-29  9:14 ` Arnaldo Carvalho de Melo
  6 siblings, 2 replies; 21+ messages in thread
From: Jiri Olsa @ 2019-01-28 11:27 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Arnaldo Carvalho de Melo, Ingo Molnar, Peter Zijlstra,
	Namhyung Kim, Alexander Shishkin, Andi Kleen, linux-kernel

On Tue, Jan 22, 2019 at 08:45:12PM +0300, Alexey Budankov wrote:

SNIP

> The patch set has been validated on BT benchmark from NAS Parallel 
> Benchmarks [2] running on dual socket, 44 cores, 88 hw threads Broadwell 
> system with kernels v4.4-21-generic (Ubuntu 16.04) and v4.20.0-rc5 
> (tip perf/core). 
> 
> The patch set is for Arnaldo's perf/core repository.
> 
> OVERHEAD:
> 			       BENCH REPORT BASED   ELAPSED TIME BASED
> 	  v4.20.0-rc5 
>           (tip perf/core):
> 				
> (current) SERIAL-SYS  / BASE : 1.27x (14.37/11.31), 1.29x (15.19/11.69)
> 	  SERIAL-NODE / BASE : 1.15x (13.04/11.31), 1.17x (13.79/11.69)
> 	  SERIAL-CPU  / BASE : 1.00x (11.32/11.31), 1.01x (11.89/11.69)
> 	
> 	  AIO1-SYS    / BASE : 1.29x (14.58/11.31), 1.29x (15.26/11.69)
> 	  AIO1-NODE   / BASE : 1.08x (12.23/11.31), 1,11x (13.01/11.69)
> 	  AIO1-CPU    / BASE : 1.07x (12.14/11.31), 1.08x (12.83/11.69)
> 
> 	  v4.4.0-21-generic
>           (Ubuntu 16.04 LTS):
> 
> (current) SERIAL-SYS  / BASE : 1.26x (13.73/10.87), 1.29x (14.69/11.32)
> 	  SERIAL-NODE / BASE : 1.19x (13.02/10.87), 1.23x (14.03/11.32)
> 	  SERIAL-CPU  / BASE : 1.03x (11.21/10.87), 1.07x (12.18/11.32)
> 	
> 	  AIO1-SYS    / BASE : 1.26x (13.73/10.87), 1.29x (14.69/11.32)
> 	  AIO1-NODE   / BASE : 1.10x (12.04/10.87), 1.15x (13.03/11.32)
> 	  AIO1-CPU    / BASE : 1.12x (12.20/10.87), 1.15x (13.09/11.32)
> 
> ---
> Alexey Budankov (4):
>   perf record: allocate affinity masks
>   perf record: bind the AIO user space buffers to nodes
>   perf record: apply affinity masks when reading mmap buffers
>   perf record: implement --affinity=node|cpu option
> 
>  tools/perf/Documentation/perf-record.txt |   5 ++
>  tools/perf/builtin-record.c              |  45 +++++++++-
>  tools/perf/perf.h                        |   8 ++
>  tools/perf/util/cpumap.c                 |  10 +++
>  tools/perf/util/cpumap.h                 |   1 +
>  tools/perf/util/evlist.c                 |   6 +-
>  tools/perf/util/evlist.h                 |   2 +-
>  tools/perf/util/mmap.c                   | 105 ++++++++++++++++++++++-
>  tools/perf/util/mmap.h                   |   3 +-
>  9 files changed, 175 insertions(+), 10 deletions(-)
> 
> ---
> Changes in v5:
> - avoided multiple allocations of online cpu maps by 
>   implementing it once in cpu_map__online()
> - reduced indentation at record__parse_affinity()

Reviewed-by: Jiri Olsa <jolsa@kernel.org>

thanks,
jirka

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 0/4] Reduce NUMA related overhead in perf record profiling on large server systems
  2019-01-22 17:45 [PATCH v5 0/4] Reduce NUMA related overhead in perf record profiling on large server systems Alexey Budankov
                   ` (5 preceding siblings ...)
  2019-01-28 11:27 ` Jiri Olsa
@ 2019-01-29  9:14 ` Arnaldo Carvalho de Melo
  2019-01-29 10:22   ` Alexey Budankov
  6 siblings, 1 reply; 21+ messages in thread
From: Arnaldo Carvalho de Melo @ 2019-01-29  9:14 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Peter Zijlstra, Jiri Olsa, Namhyung Kim,
	Alexander Shishkin, Andi Kleen, linux-kernel

Em Tue, Jan 22, 2019 at 08:45:12PM +0300, Alexey Budankov escreveu:
> 
> It has been observed that trace reading thread runs on the same hw thread
> most of the time during perf record sampling collection. This scheduling
> layout leads up to 30% profiling overhead in case when some cpu intensive
> workload fully utilizes a large server system with NUMA. Overhead usually
> arises from remote (cross node) HW and memory references that have much
> longer latencies than local ones [1].
> 
> This patch set implements --affinity option that lowers 30% overhead
> completely for serial trace streaming (--affinity=cpu) and from 30% to
> 10% for AIO1 (--aio=1) trace streaming (--affinity=node|cpu).
> See OVERHEAD section below for more details.
> 
> Implemented extension provides users with capability to instruct Perf 
> tool to bounce trace reading thread's affinity mask between NUMA nodes 
> (--affinity=node) or assign the thread to the exact cpu (--affinity=cpu) 
> that trace buffer being processed belongs to.
> 
> The extension brings improvement in case of full system utilization when 
> Perf tool process contends with workload process on cpu cores. In case a 
> system has free cores to execute Perf tool process during profiling the 
> default system scheduling layout induces the lowest overhead.
> 
> The patch set has been validated on BT benchmark from NAS Parallel 
> Benchmarks [2] running on dual socket, 44 cores, 88 hw threads Broadwell 
> system with kernels v4.4-21-generic (Ubuntu 16.04) and v4.20.0-rc5 
> (tip perf/core). 
> 
> The patch set is for Arnaldo's perf/core repository.
> 
> OVERHEAD:
> 			       BENCH REPORT BASED   ELAPSED TIME BASED
> 	  v4.20.0-rc5 
>           (tip perf/core):
> 				
> (current) SERIAL-SYS  / BASE : 1.27x (14.37/11.31), 1.29x (15.19/11.69)
> 	  SERIAL-NODE / BASE : 1.15x (13.04/11.31), 1.17x (13.79/11.69)
> 	  SERIAL-CPU  / BASE : 1.00x (11.32/11.31), 1.01x (11.89/11.69)

Can you explain those numbers a bit? You mean that in (14.37/11.31)
without this patch your test workload (NAS Parallel Benchmarks) took
14.37 seconds to complete and with the patch applied it took 11.31
seconds?

If so its really interesting that all the "after" numbers, in seconds,
are the same in _all_ the tests, is that right?

Please clarify.

- Arnaldo

> 	
> 	  AIO1-SYS    / BASE : 1.29x (14.58/11.31), 1.29x (15.26/11.69)
> 	  AIO1-NODE   / BASE : 1.08x (12.23/11.31), 1,11x (13.01/11.69)
> 	  AIO1-CPU    / BASE : 1.07x (12.14/11.31), 1.08x (12.83/11.69)
> 
> 	  v4.4.0-21-generic
>           (Ubuntu 16.04 LTS):
> 
> (current) SERIAL-SYS  / BASE : 1.26x (13.73/10.87), 1.29x (14.69/11.32)
> 	  SERIAL-NODE / BASE : 1.19x (13.02/10.87), 1.23x (14.03/11.32)
> 	  SERIAL-CPU  / BASE : 1.03x (11.21/10.87), 1.07x (12.18/11.32)
> 	
> 	  AIO1-SYS    / BASE : 1.26x (13.73/10.87), 1.29x (14.69/11.32)
> 	  AIO1-NODE   / BASE : 1.10x (12.04/10.87), 1.15x (13.03/11.32)
> 	  AIO1-CPU    / BASE : 1.12x (12.20/10.87), 1.15x (13.09/11.32)
> 
> ---
> Alexey Budankov (4):
>   perf record: allocate affinity masks
>   perf record: bind the AIO user space buffers to nodes
>   perf record: apply affinity masks when reading mmap buffers
>   perf record: implement --affinity=node|cpu option
> 
>  tools/perf/Documentation/perf-record.txt |   5 ++
>  tools/perf/builtin-record.c              |  45 +++++++++-
>  tools/perf/perf.h                        |   8 ++
>  tools/perf/util/cpumap.c                 |  10 +++
>  tools/perf/util/cpumap.h                 |   1 +
>  tools/perf/util/evlist.c                 |   6 +-
>  tools/perf/util/evlist.h                 |   2 +-
>  tools/perf/util/mmap.c                   | 105 ++++++++++++++++++++++-
>  tools/perf/util/mmap.h                   |   3 +-
>  9 files changed, 175 insertions(+), 10 deletions(-)
> 
> ---
> Changes in v5:
> - avoided multiple allocations of online cpu maps by 
>   implementing it once in cpu_map__online()
> - reduced indentation at record__parse_affinity()
> 
> Changes in v4:
> - fixed compilation issue converting pr_warn() to pr_warning()
> - implemented stop if mbind() fails
> - corrected mmap_params->cpu_map initialization to be based on /sys/devices/system/cpu/online
> - separated node cpu map generation into build_node_mask()
> 
> Changes in v3:
> - converted PERF_AFFINITY_EOF to PERF_AFFINITY_MAX
> - corrected code style issues
> - adjusted __aio_alloc,__aio_bind,__aio_free() implementation
> - separated mask manipulations into __adjust_affinity() and __setup_affinity_mask()
> - implemented mapping of c index into online cpu index
> - adjusted indentation at record__parse_affinity()
> 
> Changes in v2:
> - made debug affinity mode message user friendly
> - converted affinity mode defines to enum values
> - implemented perf_mmap__aio_alloc, perf_mmap__aio_free, perf_mmap__aio_bind 
>   and put HAVE_LIBNUMA_SUPPORT #ifdefs in there
> - separated AIO buffers binding to patch 2/4
> 
> ---
> [1] https://en.wikipedia.org/wiki/Non-uniform_memory_access
> [2] https://www.nas.nasa.gov/publications/npb.html
> [3] http://man7.org/linux/man-pages/man2/sched_setaffinity.2.html
> [4] http://man7.org/linux/man-pages/man2/mbind.2.html
> 
> ---
> ENVIRONMENT AND MEASUREMENTS:
> 
>   MACHINE:
> 
> 	broadwell, dual socket, 44 core, 88 threads
> 
> 	/proc/cpuinfo
> 
> 	processor	: 87
> 	vendor_id	: GenuineIntel
> 	cpu family	: 6
> 	model		: 79
> 	model name	: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
> 	stepping	: 1
> 	microcode	: 0xb000019
> 	cpu MHz		: 1200.117
> 	cache size	: 56320 KB
> 	physical id	: 1
> 	siblings	: 44
> 	core id		: 28
> 	cpu cores	: 22
> 	apicid		: 121
> 	initial apicid	: 121
> 	fpu		: yes
> 	fpu_exception	: yes
> 	cpuid level	: 20
> 	wp		: yes
> 	flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts
> 	bugs		:
> 	bogomips	: 4391.42
> 	clflush size	: 64
> 	cache_alignment	: 64
> 	address sizes	: 46 bits physical, 48 bits virtual
> 	power management:
>   		
>   BASE:
> 
> 	/usr/bin/time ./bt.B.x 
> 
> 	NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark
> 	
> 	No input file inputbt.data. Using compiled defaults
> 	Size:  102x 102x 102
> 	Iterations:  200       dt:   0.0003000
> 	Number of available threads:    88
> 	
> 	BT Benchmark Completed.
> 	Class           =                        B
> 	Size            =            102x 102x 102
> 	Iterations      =                      200
> 	Time in seconds =                    10.87
> 	Total threads   =                       88
> 	Avail threads   =                       88
> 	Mop/s total     =                 64608.74
> 	Mop/s/thread    =                   734.19
> 	Operation type  =           floating point
> 	Verification    =               SUCCESSFUL
> 	Version         =                    3.3.1
> 	Compile date    =              20 Sep 2018
> 	
> 	956.25user 19.14system 0:11.32elapsed 8616%CPU (0avgtext+0avgdata 210496maxresident)k
> 	0inputs+0outputs (0major+57939minor)pagefaults 0swaps
> 
>   SERIAL-SYS:
> 
> 	/usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 -a -e cycles -- ./bt.B.x 
> 	Using CPUID GenuineIntel-6-4F-1
> 	nr_cblocks: 0
> 	affinity (UNSET:0, NODE:1, CPU:2) = 0
> 	mmap size 528384B
> 
> 	NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark
> 
> 	No input file inputbt.data. Using compiled defaults
> 	Size:  102x 102x 102
> 	Iterations:  200       dt:   0.0003000
> 	Number of available threads:    88
> 
> 	BT Benchmark Completed.
> 	Class           =                        B
> 	Size            =            102x 102x 102
> 	Iterations      =                      200
> 	Time in seconds =                    13.73
> 	Total threads   =                       88
> 	Avail threads   =                       88
> 	Mop/s total     =                 51136.52
> 	Mop/s/thread    =                   581.10
> 	Operation type  =           floating point
> 	Verification    =               SUCCESSFUL
> 	Version         =                    3.3.1
> 	Compile date    =              20 Sep 2018
> 
> 	[ perf record: Captured and wrote 1661,120 MB perf.data ]
> 
> 	1184.84user 40.70system 0:14.69elapsed 8341%CPU (0avgtext+0avgdata 208612maxresident)k
> 	0inputs+3402072outputs (0major+137077minor)pagefaults 0swaps
> 
>   SERIAL-NODE:
> 
> 	/usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 --affinity=node -a -e cycles -- ./bt.B.x 
> 	Using CPUID GenuineIntel-6-4F-1
> 	nr_cblocks: 0
> 	affinity (UNSET:0, NODE:1, CPU:2) = 1
> 	mmap size 528384B
> 
> 	NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark
> 
> 	No input file inputbt.data. Using compiled defaults
> 	Size:  102x 102x 102
> 	Iterations:  200       dt:   0.0003000
> 	Number of available threads:    88
> 
> 	BT Benchmark Completed.
> 	Class           =                        B
> 	Size            =            102x 102x 102
> 	Iterations      =                      200
> 	Time in seconds =                    13.02
> 	Total threads   =                       88
> 	Avail threads   =                       88
> 	Mop/s total     =                 53924.69
> 	Mop/s/thread    =                   612.78
> 	Operation type  =           floating point
> 	Verification    =               SUCCESSFUL
> 	Version         =                    3.3.1
> 	Compile date    =              20 Sep 2018
> 
> 	[ perf record: Captured and wrote 1557,152 MB perf.data ]
> 
> 	1120.42user 29.92system 0:14.03elapsed 8198%CPU (0avgtext+0avgdata 206388maxresident)k
> 	0inputs+3189128outputs (0major+149207minor)pagefaults 0swaps
> 
>   SERIAL-CPU:
> 
> 	/usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 --affinity=cpu -a -e cycles -- ./bt.B.x 
> 	Using CPUID GenuineIntel-6-4F-1
> 	nr_cblocks: 0
> 	affinity (UNSET:0, NODE:1, CPU:2) = 2
> 	mmap size 528384B
> 
> 	NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark
> 
> 	No input file inputbt.data. Using compiled defaults
> 	Size:  102x 102x 102
> 	Iterations:  200       dt:   0.0003000
> 	Number of available threads:    88
> 
> 	BT Benchmark Completed.
> 	Class           =                        B
> 	Size            =            102x 102x 102
> 	Iterations      =                      200
> 	Time in seconds =                    11.21
> 	Total threads   =                       88
> 	Avail threads   =                       88
> 	Mop/s total     =                 62642.24
> 	Mop/s/thread    =                   711.84
> 	Operation type  =           floating point
> 	Verification    =               SUCCESSFUL
> 	Version         =                    3.3.1
> 	Compile date    =              20 Sep 2018
> 
> 	[ perf record: Captured and wrote 1365,043 MB perf.data ]
> 
> 	976.06user 31.35system 0:12.18elapsed 8264%CPU (0avgtext+0avgdata 208488maxresident)k
> 	0inputs+2795704outputs (0major+126032minor)pagefaults 0swaps
> 
>   AIO1-SYS:
> 
> 	/usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 --aio=1 -a -e cycles -- ./bt.B.x 
> 	Using CPUID GenuineIntel-6-4F-1
> 	nr_cblocks: 1
> 	affinity (UNSET:0, NODE:1, CPU:2) = 0
> 	mmap size 528384B
> 
> 	NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark
> 
> 	No input file inputbt.data. Using compiled defaults
> 	Size:  102x 102x 102
> 	Iterations:  200       dt:   0.0003000
> 	Number of available threads:    88
> 
> 	BT Benchmark Completed.
> 	Class           =                        B
> 	Size            =            102x 102x 102
> 	Iterations      =                      200
> 	Time in seconds =                    14.23
> 	Total threads   =                       88
> 	Avail threads   =                       88
> 	Mop/s total     =                 49338.27
> 	Mop/s/thread    =                   560.66
> 	Operation type  =           floating point
> 	Verification    =               SUCCESSFUL
> 	Version         =                    3.3.1
> 	Compile date    =              20 Sep 2018
> 
> 	[ perf record: Captured and wrote 1720,590 MB perf.data ]
> 
> 	1229.19user 41.99system 0:15.22elapsed 8350%CPU (0avgtext+0avgdata 208604maxresident)k
> 	0inputs+3523880outputs (0major+124670minor)pagefaults 0swaps
> 
>   AIO1-NODE:
> 
> 	/usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 --aio=1 --affinity=node -a -e cycles -- ./bt.B.x 
> 	Using CPUID GenuineIntel-6-4F-1
> 	nr_cblocks: 1
> 	affinity (UNSET:0, NODE:1, CPU:2) = 1
> 	mmap size 528384B
> 
> 	NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark
> 
> 	No input file inputbt.data. Using compiled defaults
> 	Size:  102x 102x 102
> 	Iterations:  200       dt:   0.0003000
> 	Number of available threads:    88
> 
> 	BT Benchmark Completed.
> 	Class           =                        B
> 	Size            =            102x 102x 102
> 	Iterations      =                      200
> 	Time in seconds =                    12.04
> 	Total threads   =                       88
> 	Avail threads   =                       88
> 	Mop/s total     =                 58313.58
> 	Mop/s/thread    =                   662.65
> 	Operation type  =           floating point
> 	Verification    =               SUCCESSFUL
> 	Version         =                    3.3.1
> 	Compile date    =              20 Sep 2018
> 
> 	[ perf record: Captured and wrote 1471,279 MB perf.data ]
> 
> 	1055.62user 30.43system 0:13.03elapsed 8333%CPU (0avgtext+0avgdata 208424maxresident)k
> 	0inputs+3013288outputs (0major+79088minor)pagefaults 0swaps
> 
>   AIO1-CPU:
> 
> 	/usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 --aio=1 --affinity=cpu -a -e cycles -- ./bt.B.x 
> 	Using CPUID GenuineIntel-6-4F-1
> 	nr_cblocks: 1
> 	affinity (UNSET:0, NODE:1, CPU:2) = 2
> 	mmap size 528384B
> 
> 	NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark
> 
> 	No input file inputbt.data. Using compiled defaults
> 	Size:  102x 102x 102
> 	Iterations:  200       dt:   0.0003000
> 	Number of available threads:    88
> 
> 	BT Benchmark Completed.
> 	Class           =                        B
> 	Size            =            102x 102x 102
> 	Iterations      =                      200
> 	Time in seconds =                    12.20
> 	Total threads   =                       88
> 	Avail threads   =                       88
> 	Mop/s total     =                 57538.84
> 	Mop/s/thread    =                   653.85
> 	Operation type  =           floating point
> 	Verification    =               SUCCESSFUL
> 	Version         =                    3.3.1
> 	Compile date    =              20 Sep 2018
> 
> 	[ perf record: Captured and wrote 1486,859 MB perf.data ]
> 
> 	1051.97user 42.07system 0:13.09elapsed 8352%CPU (0avgtext+0avgdata 206388maxresident)k
> 	0inputs+3045168outputs (0major+174612minor)pagefaults 0swaps

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 0/4] Reduce NUMA related overhead in perf record profiling on large server systems
  2019-01-29  9:14 ` Arnaldo Carvalho de Melo
@ 2019-01-29 10:22   ` Alexey Budankov
  0 siblings, 0 replies; 21+ messages in thread
From: Alexey Budankov @ 2019-01-29 10:22 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Ingo Molnar, Peter Zijlstra, Jiri Olsa, Namhyung Kim,
	Alexander Shishkin, Andi Kleen, linux-kernel

Hi,
On 29.01.2019 12:14, Arnaldo Carvalho de Melo wrote:
> Em Tue, Jan 22, 2019 at 08:45:12PM +0300, Alexey Budankov escreveu:
>>
>> It has been observed that trace reading thread runs on the same hw thread
>> most of the time during perf record sampling collection. This scheduling
>> layout leads up to 30% profiling overhead in case when some cpu intensive
>> workload fully utilizes a large server system with NUMA. Overhead usually
>> arises from remote (cross node) HW and memory references that have much
>> longer latencies than local ones [1].
>>
>> This patch set implements --affinity option that lowers 30% overhead
>> completely for serial trace streaming (--affinity=cpu) and from 30% to
>> 10% for AIO1 (--aio=1) trace streaming (--affinity=node|cpu).
>> See OVERHEAD section below for more details.
>>
>> Implemented extension provides users with capability to instruct Perf 
>> tool to bounce trace reading thread's affinity mask between NUMA nodes 
>> (--affinity=node) or assign the thread to the exact cpu (--affinity=cpu) 
>> that trace buffer being processed belongs to.
>>
>> The extension brings improvement in case of full system utilization when 
>> Perf tool process contends with workload process on cpu cores. In case a 
>> system has free cores to execute Perf tool process during profiling the 
>> default system scheduling layout induces the lowest overhead.
>>
>> The patch set has been validated on BT benchmark from NAS Parallel 
>> Benchmarks [2] running on dual socket, 44 cores, 88 hw threads Broadwell 
>> system with kernels v4.4-21-generic (Ubuntu 16.04) and v4.20.0-rc5 
>> (tip perf/core). 
>>
>> The patch set is for Arnaldo's perf/core repository.
>>
>> OVERHEAD:
>> 			       BENCH REPORT BASED   ELAPSED TIME BASED
>> 	  v4.20.0-rc5 
>>           (tip perf/core):
>> 				
>> (current) SERIAL-SYS  / BASE : 1.27x (14.37/11.31), 1.29x (15.19/11.69)
>> 	  SERIAL-NODE / BASE : 1.15x (13.04/11.31), 1.17x (13.79/11.69)
>> 	  SERIAL-CPU  / BASE : 1.00x (11.32/11.31), 1.01x (11.89/11.69)
> 
> Can you explain those numbers a bit? You mean that in (14.37/11.31)
> without this patch your test workload (NAS Parallel Benchmarks) took
> 14.37 seconds to complete and with the patch applied it took 11.31
> seconds?

With patches applied workload takes 13.04 sec (--affinity=node) or 11.32 sec 
(--affinity=cpu) under profiling depending on the value of the option.

11.31 sec (BASE) is the elapsed time of workload without profiling as it is 
reported by the benchmark itself (Time in seconds = ).
11.69 sec (BASE) is the elapsed time of workload without profiling
as it is reported by /usr/bin/time.

14.37 sec (SERIAL-SYS) is the elapsed time of workload under profiling
as it is reported by the benchmark itself (Time in seconds =) without 
specifying --aio or --affinity options thus with serial trace writing 
and default system affinity:
/usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 -a -e cycles -- ./bt.B.x
15.19 sec (SERIAL-SYS) is the elapsed time of workload under profiling
as it is reported by /usr/bin/time.

13.04 sec, 13.79 sec (SERIAL-NODE) are the elapsed times of workload 
under profiling with --affinity=node specified, reported by the benchmark 
and /usr/bin/time:
/usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 --affinity=node -a -e cycles -- ./bt.B.x

11.32 sec, 11.89 sec (SERIAL-CPU) are the elapsed times of workload under 
profiling with --affinity=cpu specified, reported by the benchmark itself 
and by /usr/bin/time:
/usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 --affinity=cpu -a -e cycles -- ./bt.B.x

For AIO trace writing (--aio=1) the experiment is the same:

AIO1-SYS:  /usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 --aio=1 -a -e cycles -- ./bt.B.x 
AIO1-NODE: /usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 --aio=1 --affinity=node -a -e cycles -- ./bt.B.x
AIO1-CPU:  /usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 --aio=1 --affinity=cpu -a -e cycles -- ./bt.B.x

Thanks,
Alexey

> 
> If so its really interesting that all the "after" numbers, in seconds,
> are the same in _all_ the tests, is that right?
> 
> Please clarify.
> 
> - Arnaldo
> 
>> 	
>> 	  AIO1-SYS    / BASE : 1.29x (14.58/11.31), 1.29x (15.26/11.69)
>> 	  AIO1-NODE   / BASE : 1.08x (12.23/11.31), 1,11x (13.01/11.69)
>> 	  AIO1-CPU    / BASE : 1.07x (12.14/11.31), 1.08x (12.83/11.69)
>>
>> 	  v4.4.0-21-generic
>>           (Ubuntu 16.04 LTS):
>>
>> (current) SERIAL-SYS  / BASE : 1.26x (13.73/10.87), 1.29x (14.69/11.32)
>> 	  SERIAL-NODE / BASE : 1.19x (13.02/10.87), 1.23x (14.03/11.32)
>> 	  SERIAL-CPU  / BASE : 1.03x (11.21/10.87), 1.07x (12.18/11.32)
>> 	
>> 	  AIO1-SYS    / BASE : 1.26x (13.73/10.87), 1.29x (14.69/11.32)
>> 	  AIO1-NODE   / BASE : 1.10x (12.04/10.87), 1.15x (13.03/11.32)
>> 	  AIO1-CPU    / BASE : 1.12x (12.20/10.87), 1.15x (13.09/11.32)
>>
>> ---
>> Alexey Budankov (4):
>>   perf record: allocate affinity masks
>>   perf record: bind the AIO user space buffers to nodes
>>   perf record: apply affinity masks when reading mmap buffers
>>   perf record: implement --affinity=node|cpu option
>>
>>  tools/perf/Documentation/perf-record.txt |   5 ++
>>  tools/perf/builtin-record.c              |  45 +++++++++-
>>  tools/perf/perf.h                        |   8 ++
>>  tools/perf/util/cpumap.c                 |  10 +++
>>  tools/perf/util/cpumap.h                 |   1 +
>>  tools/perf/util/evlist.c                 |   6 +-
>>  tools/perf/util/evlist.h                 |   2 +-
>>  tools/perf/util/mmap.c                   | 105 ++++++++++++++++++++++-
>>  tools/perf/util/mmap.h                   |   3 +-
>>  9 files changed, 175 insertions(+), 10 deletions(-)
>>
>> ---
>> Changes in v5:
>> - avoided multiple allocations of online cpu maps by 
>>   implementing it once in cpu_map__online()
>> - reduced indentation at record__parse_affinity()
>>
>> Changes in v4:
>> - fixed compilation issue converting pr_warn() to pr_warning()
>> - implemented stop if mbind() fails
>> - corrected mmap_params->cpu_map initialization to be based on /sys/devices/system/cpu/online
>> - separated node cpu map generation into build_node_mask()
>>
>> Changes in v3:
>> - converted PERF_AFFINITY_EOF to PERF_AFFINITY_MAX
>> - corrected code style issues
>> - adjusted __aio_alloc,__aio_bind,__aio_free() implementation
>> - separated mask manipulations into __adjust_affinity() and __setup_affinity_mask()
>> - implemented mapping of c index into online cpu index
>> - adjusted indentation at record__parse_affinity()
>>
>> Changes in v2:
>> - made debug affinity mode message user friendly
>> - converted affinity mode defines to enum values
>> - implemented perf_mmap__aio_alloc, perf_mmap__aio_free, perf_mmap__aio_bind 
>>   and put HAVE_LIBNUMA_SUPPORT #ifdefs in there
>> - separated AIO buffers binding to patch 2/4
>>
>> ---
>> [1] https://en.wikipedia.org/wiki/Non-uniform_memory_access
>> [2] https://www.nas.nasa.gov/publications/npb.html
>> [3] http://man7.org/linux/man-pages/man2/sched_setaffinity.2.html
>> [4] http://man7.org/linux/man-pages/man2/mbind.2.html
>>
>> ---
>> ENVIRONMENT AND MEASUREMENTS:
>>
>>   MACHINE:
>>
>> 	broadwell, dual socket, 44 core, 88 threads
>>
>> 	/proc/cpuinfo
>>
>> 	processor	: 87
>> 	vendor_id	: GenuineIntel
>> 	cpu family	: 6
>> 	model		: 79
>> 	model name	: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
>> 	stepping	: 1
>> 	microcode	: 0xb000019
>> 	cpu MHz		: 1200.117
>> 	cache size	: 56320 KB
>> 	physical id	: 1
>> 	siblings	: 44
>> 	core id		: 28
>> 	cpu cores	: 22
>> 	apicid		: 121
>> 	initial apicid	: 121
>> 	fpu		: yes
>> 	fpu_exception	: yes
>> 	cpuid level	: 20
>> 	wp		: yes
>> 	flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts
>> 	bugs		:
>> 	bogomips	: 4391.42
>> 	clflush size	: 64
>> 	cache_alignment	: 64
>> 	address sizes	: 46 bits physical, 48 bits virtual
>> 	power management:
>>   		
>>   BASE:
>>
>> 	/usr/bin/time ./bt.B.x 
>>
>> 	NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark
>> 	
>> 	No input file inputbt.data. Using compiled defaults
>> 	Size:  102x 102x 102
>> 	Iterations:  200       dt:   0.0003000
>> 	Number of available threads:    88
>> 	
>> 	BT Benchmark Completed.
>> 	Class           =                        B
>> 	Size            =            102x 102x 102
>> 	Iterations      =                      200
>> 	Time in seconds =                    10.87
>> 	Total threads   =                       88
>> 	Avail threads   =                       88
>> 	Mop/s total     =                 64608.74
>> 	Mop/s/thread    =                   734.19
>> 	Operation type  =           floating point
>> 	Verification    =               SUCCESSFUL
>> 	Version         =                    3.3.1
>> 	Compile date    =              20 Sep 2018
>> 	
>> 	956.25user 19.14system 0:11.32elapsed 8616%CPU (0avgtext+0avgdata 210496maxresident)k
>> 	0inputs+0outputs (0major+57939minor)pagefaults 0swaps
>>
>>   SERIAL-SYS:
>>
>> 	/usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 -a -e cycles -- ./bt.B.x 
>> 	Using CPUID GenuineIntel-6-4F-1
>> 	nr_cblocks: 0
>> 	affinity (UNSET:0, NODE:1, CPU:2) = 0
>> 	mmap size 528384B
>>
>> 	NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark
>>
>> 	No input file inputbt.data. Using compiled defaults
>> 	Size:  102x 102x 102
>> 	Iterations:  200       dt:   0.0003000
>> 	Number of available threads:    88
>>
>> 	BT Benchmark Completed.
>> 	Class           =                        B
>> 	Size            =            102x 102x 102
>> 	Iterations      =                      200
>> 	Time in seconds =                    13.73
>> 	Total threads   =                       88
>> 	Avail threads   =                       88
>> 	Mop/s total     =                 51136.52
>> 	Mop/s/thread    =                   581.10
>> 	Operation type  =           floating point
>> 	Verification    =               SUCCESSFUL
>> 	Version         =                    3.3.1
>> 	Compile date    =              20 Sep 2018
>>
>> 	[ perf record: Captured and wrote 1661,120 MB perf.data ]
>>
>> 	1184.84user 40.70system 0:14.69elapsed 8341%CPU (0avgtext+0avgdata 208612maxresident)k
>> 	0inputs+3402072outputs (0major+137077minor)pagefaults 0swaps
>>
>>   SERIAL-NODE:
>>
>> 	/usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 --affinity=node -a -e cycles -- ./bt.B.x 
>> 	Using CPUID GenuineIntel-6-4F-1
>> 	nr_cblocks: 0
>> 	affinity (UNSET:0, NODE:1, CPU:2) = 1
>> 	mmap size 528384B
>>
>> 	NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark
>>
>> 	No input file inputbt.data. Using compiled defaults
>> 	Size:  102x 102x 102
>> 	Iterations:  200       dt:   0.0003000
>> 	Number of available threads:    88
>>
>> 	BT Benchmark Completed.
>> 	Class           =                        B
>> 	Size            =            102x 102x 102
>> 	Iterations      =                      200
>> 	Time in seconds =                    13.02
>> 	Total threads   =                       88
>> 	Avail threads   =                       88
>> 	Mop/s total     =                 53924.69
>> 	Mop/s/thread    =                   612.78
>> 	Operation type  =           floating point
>> 	Verification    =               SUCCESSFUL
>> 	Version         =                    3.3.1
>> 	Compile date    =              20 Sep 2018
>>
>> 	[ perf record: Captured and wrote 1557,152 MB perf.data ]
>>
>> 	1120.42user 29.92system 0:14.03elapsed 8198%CPU (0avgtext+0avgdata 206388maxresident)k
>> 	0inputs+3189128outputs (0major+149207minor)pagefaults 0swaps
>>
>>   SERIAL-CPU:
>>
>> 	/usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 --affinity=cpu -a -e cycles -- ./bt.B.x 
>> 	Using CPUID GenuineIntel-6-4F-1
>> 	nr_cblocks: 0
>> 	affinity (UNSET:0, NODE:1, CPU:2) = 2
>> 	mmap size 528384B
>>
>> 	NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark
>>
>> 	No input file inputbt.data. Using compiled defaults
>> 	Size:  102x 102x 102
>> 	Iterations:  200       dt:   0.0003000
>> 	Number of available threads:    88
>>
>> 	BT Benchmark Completed.
>> 	Class           =                        B
>> 	Size            =            102x 102x 102
>> 	Iterations      =                      200
>> 	Time in seconds =                    11.21
>> 	Total threads   =                       88
>> 	Avail threads   =                       88
>> 	Mop/s total     =                 62642.24
>> 	Mop/s/thread    =                   711.84
>> 	Operation type  =           floating point
>> 	Verification    =               SUCCESSFUL
>> 	Version         =                    3.3.1
>> 	Compile date    =              20 Sep 2018
>>
>> 	[ perf record: Captured and wrote 1365,043 MB perf.data ]
>>
>> 	976.06user 31.35system 0:12.18elapsed 8264%CPU (0avgtext+0avgdata 208488maxresident)k
>> 	0inputs+2795704outputs (0major+126032minor)pagefaults 0swaps
>>
>>   AIO1-SYS:
>>
>> 	/usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 --aio=1 -a -e cycles -- ./bt.B.x 
>> 	Using CPUID GenuineIntel-6-4F-1
>> 	nr_cblocks: 1
>> 	affinity (UNSET:0, NODE:1, CPU:2) = 0
>> 	mmap size 528384B
>>
>> 	NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark
>>
>> 	No input file inputbt.data. Using compiled defaults
>> 	Size:  102x 102x 102
>> 	Iterations:  200       dt:   0.0003000
>> 	Number of available threads:    88
>>
>> 	BT Benchmark Completed.
>> 	Class           =                        B
>> 	Size            =            102x 102x 102
>> 	Iterations      =                      200
>> 	Time in seconds =                    14.23
>> 	Total threads   =                       88
>> 	Avail threads   =                       88
>> 	Mop/s total     =                 49338.27
>> 	Mop/s/thread    =                   560.66
>> 	Operation type  =           floating point
>> 	Verification    =               SUCCESSFUL
>> 	Version         =                    3.3.1
>> 	Compile date    =              20 Sep 2018
>>
>> 	[ perf record: Captured and wrote 1720,590 MB perf.data ]
>>
>> 	1229.19user 41.99system 0:15.22elapsed 8350%CPU (0avgtext+0avgdata 208604maxresident)k
>> 	0inputs+3523880outputs (0major+124670minor)pagefaults 0swaps
>>
>>   AIO1-NODE:
>>
>> 	/usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 --aio=1 --affinity=node -a -e cycles -- ./bt.B.x 
>> 	Using CPUID GenuineIntel-6-4F-1
>> 	nr_cblocks: 1
>> 	affinity (UNSET:0, NODE:1, CPU:2) = 1
>> 	mmap size 528384B
>>
>> 	NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark
>>
>> 	No input file inputbt.data. Using compiled defaults
>> 	Size:  102x 102x 102
>> 	Iterations:  200       dt:   0.0003000
>> 	Number of available threads:    88
>>
>> 	BT Benchmark Completed.
>> 	Class           =                        B
>> 	Size            =            102x 102x 102
>> 	Iterations      =                      200
>> 	Time in seconds =                    12.04
>> 	Total threads   =                       88
>> 	Avail threads   =                       88
>> 	Mop/s total     =                 58313.58
>> 	Mop/s/thread    =                   662.65
>> 	Operation type  =           floating point
>> 	Verification    =               SUCCESSFUL
>> 	Version         =                    3.3.1
>> 	Compile date    =              20 Sep 2018
>>
>> 	[ perf record: Captured and wrote 1471,279 MB perf.data ]
>>
>> 	1055.62user 30.43system 0:13.03elapsed 8333%CPU (0avgtext+0avgdata 208424maxresident)k
>> 	0inputs+3013288outputs (0major+79088minor)pagefaults 0swaps
>>
>>   AIO1-CPU:
>>
>> 	/usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 --aio=1 --affinity=cpu -a -e cycles -- ./bt.B.x 
>> 	Using CPUID GenuineIntel-6-4F-1
>> 	nr_cblocks: 1
>> 	affinity (UNSET:0, NODE:1, CPU:2) = 2
>> 	mmap size 528384B
>>
>> 	NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark
>>
>> 	No input file inputbt.data. Using compiled defaults
>> 	Size:  102x 102x 102
>> 	Iterations:  200       dt:   0.0003000
>> 	Number of available threads:    88
>>
>> 	BT Benchmark Completed.
>> 	Class           =                        B
>> 	Size            =            102x 102x 102
>> 	Iterations      =                      200
>> 	Time in seconds =                    12.20
>> 	Total threads   =                       88
>> 	Avail threads   =                       88
>> 	Mop/s total     =                 57538.84
>> 	Mop/s/thread    =                   653.85
>> 	Operation type  =           floating point
>> 	Verification    =               SUCCESSFUL
>> 	Version         =                    3.3.1
>> 	Compile date    =              20 Sep 2018
>>
>> 	[ perf record: Captured and wrote 1486,859 MB perf.data ]
>>
>> 	1051.97user 42.07system 0:13.09elapsed 8352%CPU (0avgtext+0avgdata 206388maxresident)k
>> 	0inputs+3045168outputs (0major+174612minor)pagefaults 0swaps
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 0/4] Reduce NUMA related overhead in perf record profiling on large server systems
  2019-01-28 11:27 ` Jiri Olsa
@ 2019-01-31  9:52   ` Alexey Budankov
  2019-02-01 16:31   ` Arnaldo Carvalho de Melo
  1 sibling, 0 replies; 21+ messages in thread
From: Alexey Budankov @ 2019-01-31  9:52 UTC (permalink / raw)
  To: Jiri Olsa, Arnaldo Carvalho de Melo
  Cc: Ingo Molnar, Peter Zijlstra, Namhyung Kim, Alexander Shishkin,
	Andi Kleen, linux-kernel

On 28.01.2019 14:27, Jiri Olsa wrote:
> On Tue, Jan 22, 2019 at 08:45:12PM +0300, Alexey Budankov wrote:
> 
> SNIP
> 
>> The patch set has been validated on BT benchmark from NAS Parallel 
>> Benchmarks [2] running on dual socket, 44 cores, 88 hw threads Broadwell 
>> system with kernels v4.4-21-generic (Ubuntu 16.04) and v4.20.0-rc5 
>> (tip perf/core). 
>>
>> The patch set is for Arnaldo's perf/core repository.
>>
>> OVERHEAD:
>> 			       BENCH REPORT BASED   ELAPSED TIME BASED
>> 	  v4.20.0-rc5 
>>           (tip perf/core):
>> 				
>> (current) SERIAL-SYS  / BASE : 1.27x (14.37/11.31), 1.29x (15.19/11.69)
>> 	  SERIAL-NODE / BASE : 1.15x (13.04/11.31), 1.17x (13.79/11.69)
>> 	  SERIAL-CPU  / BASE : 1.00x (11.32/11.31), 1.01x (11.89/11.69)
>> 	
>> 	  AIO1-SYS    / BASE : 1.29x (14.58/11.31), 1.29x (15.26/11.69)
>> 	  AIO1-NODE   / BASE : 1.08x (12.23/11.31), 1,11x (13.01/11.69)
>> 	  AIO1-CPU    / BASE : 1.07x (12.14/11.31), 1.08x (12.83/11.69)
>>
>> 	  v4.4.0-21-generic
>>           (Ubuntu 16.04 LTS):
>>
>> (current) SERIAL-SYS  / BASE : 1.26x (13.73/10.87), 1.29x (14.69/11.32)
>> 	  SERIAL-NODE / BASE : 1.19x (13.02/10.87), 1.23x (14.03/11.32)
>> 	  SERIAL-CPU  / BASE : 1.03x (11.21/10.87), 1.07x (12.18/11.32)
>> 	
>> 	  AIO1-SYS    / BASE : 1.26x (13.73/10.87), 1.29x (14.69/11.32)
>> 	  AIO1-NODE   / BASE : 1.10x (12.04/10.87), 1.15x (13.03/11.32)
>> 	  AIO1-CPU    / BASE : 1.12x (12.20/10.87), 1.15x (13.09/11.32)
>>
>> ---
>> Alexey Budankov (4):
>>   perf record: allocate affinity masks
>>   perf record: bind the AIO user space buffers to nodes
>>   perf record: apply affinity masks when reading mmap buffers
>>   perf record: implement --affinity=node|cpu option
>>
>>  tools/perf/Documentation/perf-record.txt |   5 ++
>>  tools/perf/builtin-record.c              |  45 +++++++++-
>>  tools/perf/perf.h                        |   8 ++
>>  tools/perf/util/cpumap.c                 |  10 +++
>>  tools/perf/util/cpumap.h                 |   1 +
>>  tools/perf/util/evlist.c                 |   6 +-
>>  tools/perf/util/evlist.h                 |   2 +-
>>  tools/perf/util/mmap.c                   | 105 ++++++++++++++++++++++-
>>  tools/perf/util/mmap.h                   |   3 +-
>>  9 files changed, 175 insertions(+), 10 deletions(-)
>>
>> ---
>> Changes in v5:
>> - avoided multiple allocations of online cpu maps by 
>>   implementing it once in cpu_map__online()
>> - reduced indentation at record__parse_affinity()
> 
> Reviewed-by: Jiri Olsa <jolsa@kernel.org>

Thanks! 
Alexey

> 
> thanks,
> jirka
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 0/4] Reduce NUMA related overhead in perf record profiling on large server systems
  2019-01-28 11:27 ` Jiri Olsa
  2019-01-31  9:52   ` Alexey Budankov
@ 2019-02-01 16:31   ` Arnaldo Carvalho de Melo
  1 sibling, 0 replies; 21+ messages in thread
From: Arnaldo Carvalho de Melo @ 2019-02-01 16:31 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Alexey Budankov, Ingo Molnar, Peter Zijlstra, Namhyung Kim,
	Alexander Shishkin, Andi Kleen, linux-kernel

Em Mon, Jan 28, 2019 at 12:27:48PM +0100, Jiri Olsa escreveu:
> On Tue, Jan 22, 2019 at 08:45:12PM +0300, Alexey Budankov wrote:
> > Changes in v5:
> > - avoided multiple allocations of online cpu maps by 
> >   implementing it once in cpu_map__online()
> > - reduced indentation at record__parse_affinity()
 
> Reviewed-by: Jiri Olsa <jolsa@kernel.org>

Applied, now going thru testing, lets see if it builds in all the
containers.

- Arnaldo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/4] perf record: bind the AIO user space buffers to nodes
  2019-01-22 17:48 ` [PATCH v5 2/4] perf record: bind the AIO user space buffers to nodes Alexey Budankov
@ 2019-02-04 19:29   ` Arnaldo Carvalho de Melo
  2019-02-04 19:47     ` Alexey Budankov
  0 siblings, 1 reply; 21+ messages in thread
From: Arnaldo Carvalho de Melo @ 2019-02-04 19:29 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Peter Zijlstra, Jiri Olsa, Namhyung Kim,
	Alexander Shishkin, Andi Kleen, linux-kernel

Em Tue, Jan 22, 2019 at 08:48:54PM +0300, Alexey Budankov escreveu:
> 
> Allocate and bind AIO user space buffers to the memory nodes
> that mmap kernel buffers are bound to.

[root@quaco amazonlinux]# perf test -v python
18: 'import perf' in python                               :
--- start ---
test child forked, pid 526
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: /tmp/build/perf/python/perf.so: undefined symbol: mbind
test child finished with -1
---- end ----
'import perf' in python: FAILED!
[root@quaco amazonlinux]#


Please always use 'perf test' before pushing upstream, I'll try to fix
this one, either by linking libnuma into the python binding or by moving
the routines using it to a separate file.

Thanks,

- Arnaldo
 
> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
> ---
> Changes in v4:
> - fixed compilation issue converting pr_warn() to pr_warning()
> - implemented stop if mbind() fails
> 
> Changes in v3:
> - corrected code style issues
> - adjusted __aio_alloc,__aio_bind,__aio_free() implementation
> 
> Changes in v2:
> - implemented perf_mmap__aio_alloc, perf_mmap__aio_free, perf_mmap__aio_bind 
>   and put HAVE_LIBNUMA_SUPPORT #ifdefs in there
> ---
>  tools/perf/util/mmap.c | 77 +++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 73 insertions(+), 4 deletions(-)
> 
> diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
> index e68ba754a8e2..34be9f900575 100644
> --- a/tools/perf/util/mmap.c
> +++ b/tools/perf/util/mmap.c
> @@ -10,6 +10,9 @@
>  #include <sys/mman.h>
>  #include <inttypes.h>
>  #include <asm/bug.h>
> +#ifdef HAVE_LIBNUMA_SUPPORT
> +#include <numaif.h>
> +#endif
>  #include "debug.h"
>  #include "event.h"
>  #include "mmap.h"
> @@ -154,9 +157,72 @@ void __weak auxtrace_mmap_params__set_idx(struct auxtrace_mmap_params *mp __mayb
>  }
>  
>  #ifdef HAVE_AIO_SUPPORT
> +
> +#ifdef HAVE_LIBNUMA_SUPPORT
> +static int perf_mmap__aio_alloc(struct perf_mmap *map, int index)
> +{
> +	map->aio.data[index] = mmap(NULL, perf_mmap__mmap_len(map), PROT_READ|PROT_WRITE,
> +				    MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
> +	if (map->aio.data[index] == MAP_FAILED) {
> +		map->aio.data[index] = NULL;
> +		return -1;
> +	}
> +
> +	return 0;
> +}
> +
> +static void perf_mmap__aio_free(struct perf_mmap *map, int index)
> +{
> +	if (map->aio.data[index]) {
> +		munmap(map->aio.data[index], perf_mmap__mmap_len(map));
> +		map->aio.data[index] = NULL;
> +	}
> +}
> +
> +static int perf_mmap__aio_bind(struct perf_mmap *map, int index, int cpu, int affinity)
> +{
> +	void *data;
> +	size_t mmap_len;
> +	unsigned long node_mask;
> +
> +	if (affinity != PERF_AFFINITY_SYS && cpu__max_node() > 1) {
> +		data = map->aio.data[index];
> +		mmap_len = perf_mmap__mmap_len(map);
> +		node_mask = 1UL << cpu__get_node(cpu);
> +		if (mbind(data, mmap_len, MPOL_BIND, &node_mask, 1, 0)) {
> +			pr_err("Failed to bind [%p-%p] AIO buffer to node %d: error %m\n",
> +				data, data + mmap_len, cpu__get_node(cpu));
> +			return -1;
> +		}
> +	}
> +
> +	return 0;
> +}
> +#else
> +static int perf_mmap__aio_alloc(struct perf_mmap *map, int index)
> +{
> +	map->aio.data[index] = malloc(perf_mmap__mmap_len(map));
> +	if (map->aio.data[index] == NULL)
> +		return -1;
> +
> +	return 0;
> +}
> +
> +static void perf_mmap__aio_free(struct perf_mmap *map, int index)
> +{
> +	zfree(&(map->aio.data[index]));
> +}
> +
> +static int perf_mmap__aio_bind(struct perf_mmap *map __maybe_unused, int index __maybe_unused,
> +		int cpu __maybe_unused, int affinity __maybe_unused)
> +{
> +	return 0;
> +}
> +#endif
> +
>  static int perf_mmap__aio_mmap(struct perf_mmap *map, struct mmap_params *mp)
>  {
> -	int delta_max, i, prio;
> +	int delta_max, i, prio, ret;
>  
>  	map->aio.nr_cblocks = mp->nr_cblocks;
>  	if (map->aio.nr_cblocks) {
> @@ -177,11 +243,14 @@ static int perf_mmap__aio_mmap(struct perf_mmap *map, struct mmap_params *mp)
>  		}
>  		delta_max = sysconf(_SC_AIO_PRIO_DELTA_MAX);
>  		for (i = 0; i < map->aio.nr_cblocks; ++i) {
> -			map->aio.data[i] = malloc(perf_mmap__mmap_len(map));
> -			if (!map->aio.data[i]) {
> +			ret = perf_mmap__aio_alloc(map, i);
> +			if (ret == -1) {
>  				pr_debug2("failed to allocate data buffer area, error %m");
>  				return -1;
>  			}
> +			ret = perf_mmap__aio_bind(map, i, map->cpu, mp->affinity);
> +			if (ret == -1)
> +				return -1;
>  			/*
>  			 * Use cblock.aio_fildes value different from -1
>  			 * to denote started aio write operation on the
> @@ -210,7 +279,7 @@ static void perf_mmap__aio_munmap(struct perf_mmap *map)
>  	int i;
>  
>  	for (i = 0; i < map->aio.nr_cblocks; ++i)
> -		zfree(&map->aio.data[i]);
> +		perf_mmap__aio_free(map, i);
>  	if (map->aio.data)
>  		zfree(&map->aio.data);
>  	zfree(&map->aio.cblocks);

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/4] perf record: bind the AIO user space buffers to nodes
  2019-02-04 19:29   ` Arnaldo Carvalho de Melo
@ 2019-02-04 19:47     ` Alexey Budankov
  2019-02-05 15:15       ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 21+ messages in thread
From: Alexey Budankov @ 2019-02-04 19:47 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Ingo Molnar, Peter Zijlstra, Jiri Olsa, Namhyung Kim,
	Alexander Shishkin, Andi Kleen, linux-kernel


On 04.02.2019 22:29, Arnaldo Carvalho de Melo wrote:
> Em Tue, Jan 22, 2019 at 08:48:54PM +0300, Alexey Budankov escreveu:
>>
>> Allocate and bind AIO user space buffers to the memory nodes
>> that mmap kernel buffers are bound to.
> 
> [root@quaco amazonlinux]# perf test -v python
> 18: 'import perf' in python                               :
> --- start ---
> test child forked, pid 526
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> ImportError: /tmp/build/perf/python/perf.so: undefined symbol: mbind

Argh. Missed that.

> test child finished with -1
> ---- end ----
> 'import perf' in python: FAILED!
> [root@quaco amazonlinux]#
> 
> 
> Please always use 'perf test' before pushing upstream, I'll try to fix
> this one, either by linking libnuma into the python binding or by moving
> the routines using it to a separate file.

Will do. Thanks for followup.

- Alexey

> 
> Thanks,
> 
> - Arnaldo
>  
>> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
>> ---
>> Changes in v4:
>> - fixed compilation issue converting pr_warn() to pr_warning()
>> - implemented stop if mbind() fails
>>
>> Changes in v3:
>> - corrected code style issues
>> - adjusted __aio_alloc,__aio_bind,__aio_free() implementation
>>
>> Changes in v2:
>> - implemented perf_mmap__aio_alloc, perf_mmap__aio_free, perf_mmap__aio_bind 
>>   and put HAVE_LIBNUMA_SUPPORT #ifdefs in there
>> ---
>>  tools/perf/util/mmap.c | 77 +++++++++++++++++++++++++++++++++++++++---
>>  1 file changed, 73 insertions(+), 4 deletions(-)
>>
>> diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
>> index e68ba754a8e2..34be9f900575 100644
>> --- a/tools/perf/util/mmap.c
>> +++ b/tools/perf/util/mmap.c
>> @@ -10,6 +10,9 @@
>>  #include <sys/mman.h>
>>  #include <inttypes.h>
>>  #include <asm/bug.h>
>> +#ifdef HAVE_LIBNUMA_SUPPORT
>> +#include <numaif.h>
>> +#endif
>>  #include "debug.h"
>>  #include "event.h"
>>  #include "mmap.h"
>> @@ -154,9 +157,72 @@ void __weak auxtrace_mmap_params__set_idx(struct auxtrace_mmap_params *mp __mayb
>>  }
>>  
>>  #ifdef HAVE_AIO_SUPPORT
>> +
>> +#ifdef HAVE_LIBNUMA_SUPPORT
>> +static int perf_mmap__aio_alloc(struct perf_mmap *map, int index)
>> +{
>> +	map->aio.data[index] = mmap(NULL, perf_mmap__mmap_len(map), PROT_READ|PROT_WRITE,
>> +				    MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
>> +	if (map->aio.data[index] == MAP_FAILED) {
>> +		map->aio.data[index] = NULL;
>> +		return -1;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static void perf_mmap__aio_free(struct perf_mmap *map, int index)
>> +{
>> +	if (map->aio.data[index]) {
>> +		munmap(map->aio.data[index], perf_mmap__mmap_len(map));
>> +		map->aio.data[index] = NULL;
>> +	}
>> +}
>> +
>> +static int perf_mmap__aio_bind(struct perf_mmap *map, int index, int cpu, int affinity)
>> +{
>> +	void *data;
>> +	size_t mmap_len;
>> +	unsigned long node_mask;
>> +
>> +	if (affinity != PERF_AFFINITY_SYS && cpu__max_node() > 1) {
>> +		data = map->aio.data[index];
>> +		mmap_len = perf_mmap__mmap_len(map);
>> +		node_mask = 1UL << cpu__get_node(cpu);
>> +		if (mbind(data, mmap_len, MPOL_BIND, &node_mask, 1, 0)) {
>> +			pr_err("Failed to bind [%p-%p] AIO buffer to node %d: error %m\n",
>> +				data, data + mmap_len, cpu__get_node(cpu));
>> +			return -1;
>> +		}
>> +	}
>> +
>> +	return 0;
>> +}
>> +#else
>> +static int perf_mmap__aio_alloc(struct perf_mmap *map, int index)
>> +{
>> +	map->aio.data[index] = malloc(perf_mmap__mmap_len(map));
>> +	if (map->aio.data[index] == NULL)
>> +		return -1;
>> +
>> +	return 0;
>> +}
>> +
>> +static void perf_mmap__aio_free(struct perf_mmap *map, int index)
>> +{
>> +	zfree(&(map->aio.data[index]));
>> +}
>> +
>> +static int perf_mmap__aio_bind(struct perf_mmap *map __maybe_unused, int index __maybe_unused,
>> +		int cpu __maybe_unused, int affinity __maybe_unused)
>> +{
>> +	return 0;
>> +}
>> +#endif
>> +
>>  static int perf_mmap__aio_mmap(struct perf_mmap *map, struct mmap_params *mp)
>>  {
>> -	int delta_max, i, prio;
>> +	int delta_max, i, prio, ret;
>>  
>>  	map->aio.nr_cblocks = mp->nr_cblocks;
>>  	if (map->aio.nr_cblocks) {
>> @@ -177,11 +243,14 @@ static int perf_mmap__aio_mmap(struct perf_mmap *map, struct mmap_params *mp)
>>  		}
>>  		delta_max = sysconf(_SC_AIO_PRIO_DELTA_MAX);
>>  		for (i = 0; i < map->aio.nr_cblocks; ++i) {
>> -			map->aio.data[i] = malloc(perf_mmap__mmap_len(map));
>> -			if (!map->aio.data[i]) {
>> +			ret = perf_mmap__aio_alloc(map, i);
>> +			if (ret == -1) {
>>  				pr_debug2("failed to allocate data buffer area, error %m");
>>  				return -1;
>>  			}
>> +			ret = perf_mmap__aio_bind(map, i, map->cpu, mp->affinity);
>> +			if (ret == -1)
>> +				return -1;
>>  			/*
>>  			 * Use cblock.aio_fildes value different from -1
>>  			 * to denote started aio write operation on the
>> @@ -210,7 +279,7 @@ static void perf_mmap__aio_munmap(struct perf_mmap *map)
>>  	int i;
>>  
>>  	for (i = 0; i < map->aio.nr_cblocks; ++i)
>> -		zfree(&map->aio.data[i]);
>> +		perf_mmap__aio_free(map, i);
>>  	if (map->aio.data)
>>  		zfree(&map->aio.data);
>>  	zfree(&map->aio.cblocks);
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/4] perf record: bind the AIO user space buffers to nodes
  2019-02-04 19:47     ` Alexey Budankov
@ 2019-02-05 15:15       ` Arnaldo Carvalho de Melo
  2019-02-05 15:34         ` Alexey Budankov
  2019-02-09 12:46         ` [tip:perf/core] perf record: Bind " tip-bot for Alexey Budankov
  0 siblings, 2 replies; 21+ messages in thread
From: Arnaldo Carvalho de Melo @ 2019-02-05 15:15 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Ingo Molnar, Peter Zijlstra, Jiri Olsa, Namhyung Kim,
	Alexander Shishkin, Andi Kleen, linux-kernel

Em Mon, Feb 04, 2019 at 10:47:03PM +0300, Alexey Budankov escreveu:
> 
> On 04.02.2019 22:29, Arnaldo Carvalho de Melo wrote:
> > Em Tue, Jan 22, 2019 at 08:48:54PM +0300, Alexey Budankov escreveu:
> >>
> >> Allocate and bind AIO user space buffers to the memory nodes
> >> that mmap kernel buffers are bound to.
> > 
> > [root@quaco amazonlinux]# perf test -v python
> > 18: 'import perf' in python                               :
> > --- start ---
> > test child forked, pid 526
> > Traceback (most recent call last):
> >   File "<stdin>", line 1, in <module>
> > ImportError: /tmp/build/perf/python/perf.so: undefined symbol: mbind
> 
> Argh. Missed that.
> 
> > test child finished with -1
> > ---- end ----
> > 'import perf' in python: FAILED!
> > [root@quaco amazonlinux]#
> > 
> > 
> > Please always use 'perf test' before pushing upstream, I'll try to fix
> > this one, either by linking libnuma into the python binding or by moving
> > the routines using it to a separate file.
> 
> Will do. Thanks for followup.

this seems to do the trick:

diff --git a/tools/perf/util/setup.py b/tools/perf/util/setup.py
index d3ffc18424b5..88ffa995b44b 100644
--- a/tools/perf/util/setup.py
+++ b/tools/perf/util/setup.py
@@ -56,6 +56,7 @@ ext_sources = list(map(lambda x: '%s/%s' % (src_perf, x) , ext_sources))
 perf = Extension('perf',
 		  sources = ext_sources,
 		  include_dirs = ['util/include'],
+		  libraries = ['numa'],
 		  extra_compile_args = cflags,
 		  extra_objects = [libtraceevent, libapikfs],
                  )

------------------------------------------------

[root@quaco ~]# ldd /tmp/build/perf/python/perf.cpython-37m-x86_64-linux-gnu.so
	linux-vdso.so.1 (0x00007ffdf53c3000)
	libunwind-x86_64.so.8 => /lib64/libunwind-x86_64.so.8 (0x00007fa538b82000)
	libunwind.so.8 => /lib64/libunwind.so.8 (0x00007fa538b66000)
	liblzma.so.5 => /lib64/liblzma.so.5 (0x00007fa538b3d000)
	libnuma.so.1 => /lib64/libnuma.so.1 (0x00007fa538b2f000)
	libpython3.7m.so.1.0 => /lib64/libpython3.7m.so.1.0 (0x00007fa5387b7000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fa538795000)
	libc.so.6 => /lib64/libc.so.6 (0x00007fa5385cd000)
	libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fa5385b2000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fa538c1a000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007fa5385ac000)
	libutil.so.1 => /lib64/libutil.so.1 (0x00007fa5385a7000)
	libm.so.6 => /lib64/libm.so.6 (0x00007fa538423000)
[root@quaco ~]#

[root@quaco ~]# perf test python
18: 'import perf' in python                               : Ok
[root@quaco ~]#


So I'm ammending that hunk to the patch that introduces that mbind
usage.

Thanks,

- Arnaldo

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v5 2/4] perf record: bind the AIO user space buffers to nodes
  2019-02-05 15:15       ` Arnaldo Carvalho de Melo
@ 2019-02-05 15:34         ` Alexey Budankov
  2019-02-09 12:46         ` [tip:perf/core] perf record: Bind " tip-bot for Alexey Budankov
  1 sibling, 0 replies; 21+ messages in thread
From: Alexey Budankov @ 2019-02-05 15:34 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Ingo Molnar, Peter Zijlstra, Jiri Olsa, Namhyung Kim,
	Alexander Shishkin, Andi Kleen, linux-kernel


On 05.02.2019 18:15, Arnaldo Carvalho de Melo wrote:
> Em Mon, Feb 04, 2019 at 10:47:03PM +0300, Alexey Budankov escreveu:
>>
>> On 04.02.2019 22:29, Arnaldo Carvalho de Melo wrote:
>>> Em Tue, Jan 22, 2019 at 08:48:54PM +0300, Alexey Budankov escreveu:
>>>>
>>>> Allocate and bind AIO user space buffers to the memory nodes
>>>> that mmap kernel buffers are bound to.
>>>
>>> [root@quaco amazonlinux]# perf test -v python
>>> 18: 'import perf' in python                               :
>>> --- start ---
>>> test child forked, pid 526
>>> Traceback (most recent call last):
>>>   File "<stdin>", line 1, in <module>
>>> ImportError: /tmp/build/perf/python/perf.so: undefined symbol: mbind
>>
>> Argh. Missed that.
>>
>>> test child finished with -1
>>> ---- end ----
>>> 'import perf' in python: FAILED!
>>> [root@quaco amazonlinux]#
>>>
>>>
>>> Please always use 'perf test' before pushing upstream, I'll try to fix
>>> this one, either by linking libnuma into the python binding or by moving
>>> the routines using it to a separate file.
>>
>> Will do. Thanks for followup.
> 
> this seems to do the trick:
> 
> diff --git a/tools/perf/util/setup.py b/tools/perf/util/setup.py
> index d3ffc18424b5..88ffa995b44b 100644
> --- a/tools/perf/util/setup.py
> +++ b/tools/perf/util/setup.py
> @@ -56,6 +56,7 @@ ext_sources = list(map(lambda x: '%s/%s' % (src_perf, x) , ext_sources))
>  perf = Extension('perf',
>  		  sources = ext_sources,
>  		  include_dirs = ['util/include'],
> +		  libraries = ['numa'],
>  		  extra_compile_args = cflags,
>  		  extra_objects = [libtraceevent, libapikfs],
>                   )
> 
> ------------------------------------------------
> 
> [root@quaco ~]# ldd /tmp/build/perf/python/perf.cpython-37m-x86_64-linux-gnu.so
> 	linux-vdso.so.1 (0x00007ffdf53c3000)
> 	libunwind-x86_64.so.8 => /lib64/libunwind-x86_64.so.8 (0x00007fa538b82000)
> 	libunwind.so.8 => /lib64/libunwind.so.8 (0x00007fa538b66000)
> 	liblzma.so.5 => /lib64/liblzma.so.5 (0x00007fa538b3d000)
> 	libnuma.so.1 => /lib64/libnuma.so.1 (0x00007fa538b2f000)
> 	libpython3.7m.so.1.0 => /lib64/libpython3.7m.so.1.0 (0x00007fa5387b7000)
> 	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fa538795000)
> 	libc.so.6 => /lib64/libc.so.6 (0x00007fa5385cd000)
> 	libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fa5385b2000)
> 	/lib64/ld-linux-x86-64.so.2 (0x00007fa538c1a000)
> 	libdl.so.2 => /lib64/libdl.so.2 (0x00007fa5385ac000)
> 	libutil.so.1 => /lib64/libutil.so.1 (0x00007fa5385a7000)
> 	libm.so.6 => /lib64/libm.so.6 (0x00007fa538423000)
> [root@quaco ~]#
> 
> [root@quaco ~]# perf test python
> 18: 'import perf' in python                               : Ok
> [root@quaco ~]#
> 
> 
> So I'm ammending that hunk to the patch that introduces that mbind
> usage.

Again, thanks a lot, Arnaldo!

Alexey

> 
> Thanks,
> 
> - Arnaldo
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [tip:perf/core] perf record: Allocate affinity masks
  2019-01-22 17:47 ` [PATCH v5 1/4] perf record: allocate affinity masks Alexey Budankov
@ 2019-02-09 12:45   ` tip-bot for Alexey Budankov
  0 siblings, 0 replies; 21+ messages in thread
From: tip-bot for Alexey Budankov @ 2019-02-09 12:45 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, mingo, alexey.budankov, hpa, alexander.shishkin,
	namhyung, acme, ak, tglx, peterz, jolsa

Commit-ID:  9d2ed64587c045304efe8872b0258c30803d370c
Gitweb:     https://git.kernel.org/tip/9d2ed64587c045304efe8872b0258c30803d370c
Author:     Alexey Budankov <alexey.budankov@linux.intel.com>
AuthorDate: Tue, 22 Jan 2019 20:47:43 +0300
Committer:  Arnaldo Carvalho de Melo <acme@redhat.com>
CommitDate: Wed, 6 Feb 2019 10:00:39 -0300

perf record: Allocate affinity masks

Allocate affinity option and masks for mmap data buffers and record
thread as well as initialize allocated objects.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Reviewed-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/526fa2b0-07de-6dbd-a7e9-26ba875593c9@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/builtin-record.c | 13 ++++++++++++-
 tools/perf/perf.h           |  8 ++++++++
 tools/perf/util/evlist.c    |  6 +++---
 tools/perf/util/evlist.h    |  2 +-
 tools/perf/util/mmap.c      |  2 ++
 tools/perf/util/mmap.h      |  3 ++-
 6 files changed, 28 insertions(+), 6 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 56e9d9e8c174..7ced5f3e8100 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -81,12 +81,17 @@ struct record {
 	bool			timestamp_boundary;
 	struct switch_output	switch_output;
 	unsigned long long	samples;
+	cpu_set_t		affinity_mask;
 };
 
 static volatile int auxtrace_record__snapshot_started;
 static DEFINE_TRIGGER(auxtrace_snapshot_trigger);
 static DEFINE_TRIGGER(switch_output_trigger);
 
+static const char *affinity_tags[PERF_AFFINITY_MAX] = {
+	"SYS", "NODE", "CPU"
+};
+
 static bool switch_output_signal(struct record *rec)
 {
 	return rec->switch_output.signal &&
@@ -533,7 +538,8 @@ static int record__mmap_evlist(struct record *rec,
 
 	if (perf_evlist__mmap_ex(evlist, opts->mmap_pages,
 				 opts->auxtrace_mmap_pages,
-				 opts->auxtrace_snapshot_mode, opts->nr_cblocks) < 0) {
+				 opts->auxtrace_snapshot_mode,
+				 opts->nr_cblocks, opts->affinity) < 0) {
 		if (errno == EPERM) {
 			pr_err("Permission error mapping pages.\n"
 			       "Consider increasing "
@@ -1977,6 +1983,9 @@ int cmd_record(int argc, const char **argv)
 # undef REASON
 #endif
 
+	CPU_ZERO(&rec->affinity_mask);
+	rec->opts.affinity = PERF_AFFINITY_SYS;
+
 	rec->evlist = perf_evlist__new();
 	if (rec->evlist == NULL)
 		return -ENOMEM;
@@ -2140,6 +2149,8 @@ int cmd_record(int argc, const char **argv)
 	if (verbose > 0)
 		pr_info("nr_cblocks: %d\n", rec->opts.nr_cblocks);
 
+	pr_debug("affinity: %s\n", affinity_tags[rec->opts.affinity]);
+
 	err = __cmd_record(&record, argc, argv);
 out:
 	perf_evlist__delete(rec->evlist);
diff --git a/tools/perf/perf.h b/tools/perf/perf.h
index 5941fb6eccfc..b120e547ddc7 100644
--- a/tools/perf/perf.h
+++ b/tools/perf/perf.h
@@ -84,6 +84,14 @@ struct record_opts {
 	clockid_t    clockid;
 	u64          clockid_res_ns;
 	int	     nr_cblocks;
+	int	     affinity;
+};
+
+enum perf_affinity {
+	PERF_AFFINITY_SYS = 0,
+	PERF_AFFINITY_NODE,
+	PERF_AFFINITY_CPU,
+	PERF_AFFINITY_MAX
 };
 
 struct option;
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 8c902276d4b4..08cedb643ea6 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -1022,7 +1022,7 @@ int perf_evlist__parse_mmap_pages(const struct option *opt, const char *str,
  */
 int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
 			 unsigned int auxtrace_pages,
-			 bool auxtrace_overwrite, int nr_cblocks)
+			 bool auxtrace_overwrite, int nr_cblocks, int affinity)
 {
 	struct perf_evsel *evsel;
 	const struct cpu_map *cpus = evlist->cpus;
@@ -1032,7 +1032,7 @@ int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
 	 * Its value is decided by evsel's write_backward.
 	 * So &mp should not be passed through const pointer.
 	 */
-	struct mmap_params mp = { .nr_cblocks = nr_cblocks };
+	struct mmap_params mp = { .nr_cblocks = nr_cblocks, .affinity = affinity };
 
 	if (!evlist->mmap)
 		evlist->mmap = perf_evlist__alloc_mmap(evlist, false);
@@ -1064,7 +1064,7 @@ int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
 
 int perf_evlist__mmap(struct perf_evlist *evlist, unsigned int pages)
 {
-	return perf_evlist__mmap_ex(evlist, pages, 0, false, 0);
+	return perf_evlist__mmap_ex(evlist, pages, 0, false, 0, PERF_AFFINITY_SYS);
 }
 
 int perf_evlist__create_maps(struct perf_evlist *evlist, struct target *target)
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index 00ab43c6dd15..744906dd4887 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -165,7 +165,7 @@ unsigned long perf_event_mlock_kb_in_pages(void);
 
 int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages,
 			 unsigned int auxtrace_pages,
-			 bool auxtrace_overwrite, int nr_cblocks);
+			 bool auxtrace_overwrite, int nr_cblocks, int affinity);
 int perf_evlist__mmap(struct perf_evlist *evlist, unsigned int pages);
 void perf_evlist__munmap(struct perf_evlist *evlist);
 
diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
index 8fc39311a30d..e68ba754a8e2 100644
--- a/tools/perf/util/mmap.c
+++ b/tools/perf/util/mmap.c
@@ -343,6 +343,8 @@ int perf_mmap__mmap(struct perf_mmap *map, struct mmap_params *mp, int fd, int c
 	map->fd = fd;
 	map->cpu = cpu;
 
+	CPU_ZERO(&map->affinity_mask);
+
 	if (auxtrace_mmap__mmap(&map->auxtrace_mmap,
 				&mp->auxtrace_mp, map->base, fd))
 		return -1;
diff --git a/tools/perf/util/mmap.h b/tools/perf/util/mmap.h
index aeb6942fdb00..e566c19b242b 100644
--- a/tools/perf/util/mmap.h
+++ b/tools/perf/util/mmap.h
@@ -38,6 +38,7 @@ struct perf_mmap {
 		int		 nr_cblocks;
 	} aio;
 #endif
+	cpu_set_t	affinity_mask;
 };
 
 /*
@@ -69,7 +70,7 @@ enum bkw_mmap_state {
 };
 
 struct mmap_params {
-	int			    prot, mask, nr_cblocks;
+	int			    prot, mask, nr_cblocks, affinity;
 	struct auxtrace_mmap_params auxtrace_mp;
 };
 

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [tip:perf/core] perf record: Bind the AIO user space buffers to nodes
  2019-02-05 15:15       ` Arnaldo Carvalho de Melo
  2019-02-05 15:34         ` Alexey Budankov
@ 2019-02-09 12:46         ` tip-bot for Alexey Budankov
  1 sibling, 0 replies; 21+ messages in thread
From: tip-bot for Alexey Budankov @ 2019-02-09 12:46 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, namhyung, alexander.shishkin, ak, acme, linux-kernel,
	alexey.budankov, tglx, hpa, mingo, jolsa

Commit-ID:  c44a8b44ca9f156a5395597109987d1c462ba655
Gitweb:     https://git.kernel.org/tip/c44a8b44ca9f156a5395597109987d1c462ba655
Author:     Alexey Budankov <alexey.budankov@linux.intel.com>
AuthorDate: Tue, 22 Jan 2019 20:48:54 +0300
Committer:  Arnaldo Carvalho de Melo <acme@redhat.com>
CommitDate: Wed, 6 Feb 2019 10:00:39 -0300

perf record: Bind the AIO user space buffers to nodes

Allocate and bind AIO user space buffers to the memory nodes that mmap
kernel buffers are bound to.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Reviewed-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/5a5adebc-afe0-4806-81cd-180d49ec043f@linux.intel.com
[ Do not use 'index' as a variable name, it is a define in older glibcs ]
Link: http://lkml.kernel.org/r/20190205151526.GC10613@kernel.org
[ Add -lnuma to the python build when -DHAVE_LIBNUMA_SUPPORT is present, fixing 'perf test python' ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/util/mmap.c   | 77 +++++++++++++++++++++++++++++++++++++++++++++---
 tools/perf/util/setup.py |  5 ++++
 2 files changed, 78 insertions(+), 4 deletions(-)

diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
index e68ba754a8e2..d882f43148c3 100644
--- a/tools/perf/util/mmap.c
+++ b/tools/perf/util/mmap.c
@@ -10,6 +10,9 @@
 #include <sys/mman.h>
 #include <inttypes.h>
 #include <asm/bug.h>
+#ifdef HAVE_LIBNUMA_SUPPORT
+#include <numaif.h>
+#endif
 #include "debug.h"
 #include "event.h"
 #include "mmap.h"
@@ -154,9 +157,72 @@ void __weak auxtrace_mmap_params__set_idx(struct auxtrace_mmap_params *mp __mayb
 }
 
 #ifdef HAVE_AIO_SUPPORT
+
+#ifdef HAVE_LIBNUMA_SUPPORT
+static int perf_mmap__aio_alloc(struct perf_mmap *map, int idx)
+{
+	map->aio.data[idx] = mmap(NULL, perf_mmap__mmap_len(map), PROT_READ|PROT_WRITE,
+				  MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
+	if (map->aio.data[idx] == MAP_FAILED) {
+		map->aio.data[idx] = NULL;
+		return -1;
+	}
+
+	return 0;
+}
+
+static void perf_mmap__aio_free(struct perf_mmap *map, int idx)
+{
+	if (map->aio.data[idx]) {
+		munmap(map->aio.data[idx], perf_mmap__mmap_len(map));
+		map->aio.data[idx] = NULL;
+	}
+}
+
+static int perf_mmap__aio_bind(struct perf_mmap *map, int idx, int cpu, int affinity)
+{
+	void *data;
+	size_t mmap_len;
+	unsigned long node_mask;
+
+	if (affinity != PERF_AFFINITY_SYS && cpu__max_node() > 1) {
+		data = map->aio.data[idx];
+		mmap_len = perf_mmap__mmap_len(map);
+		node_mask = 1UL << cpu__get_node(cpu);
+		if (mbind(data, mmap_len, MPOL_BIND, &node_mask, 1, 0)) {
+			pr_err("Failed to bind [%p-%p] AIO buffer to node %d: error %m\n",
+				data, data + mmap_len, cpu__get_node(cpu));
+			return -1;
+		}
+	}
+
+	return 0;
+}
+#else
+static int perf_mmap__aio_alloc(struct perf_mmap *map, int idx)
+{
+	map->aio.data[idx] = malloc(perf_mmap__mmap_len(map));
+	if (map->aio.data[idx] == NULL)
+		return -1;
+
+	return 0;
+}
+
+static void perf_mmap__aio_free(struct perf_mmap *map, int idx)
+{
+	zfree(&(map->aio.data[idx]));
+}
+
+static int perf_mmap__aio_bind(struct perf_mmap *map __maybe_unused, int idx __maybe_unused,
+		int cpu __maybe_unused, int affinity __maybe_unused)
+{
+	return 0;
+}
+#endif
+
 static int perf_mmap__aio_mmap(struct perf_mmap *map, struct mmap_params *mp)
 {
-	int delta_max, i, prio;
+	int delta_max, i, prio, ret;
 
 	map->aio.nr_cblocks = mp->nr_cblocks;
 	if (map->aio.nr_cblocks) {
@@ -177,11 +243,14 @@ static int perf_mmap__aio_mmap(struct perf_mmap *map, struct mmap_params *mp)
 		}
 		delta_max = sysconf(_SC_AIO_PRIO_DELTA_MAX);
 		for (i = 0; i < map->aio.nr_cblocks; ++i) {
-			map->aio.data[i] = malloc(perf_mmap__mmap_len(map));
-			if (!map->aio.data[i]) {
+			ret = perf_mmap__aio_alloc(map, i);
+			if (ret == -1) {
 				pr_debug2("failed to allocate data buffer area, error %m");
 				return -1;
 			}
+			ret = perf_mmap__aio_bind(map, i, map->cpu, mp->affinity);
+			if (ret == -1)
+				return -1;
 			/*
 			 * Use cblock.aio_fildes value different from -1
 			 * to denote started aio write operation on the
@@ -210,7 +279,7 @@ static void perf_mmap__aio_munmap(struct perf_mmap *map)
 	int i;
 
 	for (i = 0; i < map->aio.nr_cblocks; ++i)
-		zfree(&map->aio.data[i]);
+		perf_mmap__aio_free(map, i);
 	if (map->aio.data)
 		zfree(&map->aio.data);
 	zfree(&map->aio.cblocks);
diff --git a/tools/perf/util/setup.py b/tools/perf/util/setup.py
index d3ffc18424b5..5b5a167b43ce 100644
--- a/tools/perf/util/setup.py
+++ b/tools/perf/util/setup.py
@@ -53,9 +53,14 @@ ext_sources = [f.strip() for f in open('util/python-ext-sources')
 # use full paths with source files
 ext_sources = list(map(lambda x: '%s/%s' % (src_perf, x) , ext_sources))
 
+extra_libraries = []
+if '-DHAVE_LIBNUMA_SUPPORT' in cflags:
+    extra_libraries = [ 'numa' ]
+
 perf = Extension('perf',
 		  sources = ext_sources,
 		  include_dirs = ['util/include'],
+		  libraries = extra_libraries,
 		  extra_compile_args = cflags,
 		  extra_objects = [libtraceevent, libapikfs],
                  )

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [tip:perf/core] perf record: Apply affinity masks when reading mmap buffers
  2019-01-22 17:50 ` [PATCH v5 3/4] perf record: apply affinity masks when reading mmap buffers Alexey Budankov
@ 2019-02-09 12:47   ` tip-bot for Alexey Budankov
  0 siblings, 0 replies; 21+ messages in thread
From: tip-bot for Alexey Budankov @ 2019-02-09 12:47 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, acme, mingo, peterz, tglx, ak, hpa,
	alexey.budankov, namhyung, alexander.shishkin, jolsa

Commit-ID:  f13de6609a9a25ce4d6dc37c4427f5bc90072fb0
Gitweb:     https://git.kernel.org/tip/f13de6609a9a25ce4d6dc37c4427f5bc90072fb0
Author:     Alexey Budankov <alexey.budankov@linux.intel.com>
AuthorDate: Tue, 22 Jan 2019 20:50:57 +0300
Committer:  Arnaldo Carvalho de Melo <acme@redhat.com>
CommitDate: Wed, 6 Feb 2019 10:00:39 -0300

perf record: Apply affinity masks when reading mmap buffers

Build node cpu masks for mmap data buffers. Apply node cpu masks to tool
thread every time it references data buffers cross node or cross cpu.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Reviewed-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/b25e4ebc-078d-2c7b-216c-f0bed108d073@linux.intel.com
[ Use cpu-set-sched.h to get the CPU_{EQUAL,OR}() fallbacks for older systems ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/builtin-record.c | 15 +++++++++++++++
 tools/perf/util/cpumap.c    | 10 ++++++++++
 tools/perf/util/cpumap.h    |  1 +
 tools/perf/util/mmap.c      | 28 +++++++++++++++++++++++++++-
 4 files changed, 53 insertions(+), 1 deletion(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 7ced5f3e8100..3fdfbaebd95e 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -38,6 +38,7 @@
 #include "util/bpf-loader.h"
 #include "util/trigger.h"
 #include "util/perf-hooks.h"
+#include "util/cpu-set-sched.h"
 #include "util/time-utils.h"
 #include "util/units.h"
 #include "util/bpf-event.h"
@@ -536,6 +537,9 @@ static int record__mmap_evlist(struct record *rec,
 	struct record_opts *opts = &rec->opts;
 	char msg[512];
 
+	if (opts->affinity != PERF_AFFINITY_SYS)
+		cpu__setup_cpunode_map();
+
 	if (perf_evlist__mmap_ex(evlist, opts->mmap_pages,
 				 opts->auxtrace_mmap_pages,
 				 opts->auxtrace_snapshot_mode,
@@ -719,6 +723,16 @@ static struct perf_event_header finished_round_event = {
 	.type = PERF_RECORD_FINISHED_ROUND,
 };
 
+static void record__adjust_affinity(struct record *rec, struct perf_mmap *map)
+{
+	if (rec->opts.affinity != PERF_AFFINITY_SYS &&
+	    !CPU_EQUAL(&rec->affinity_mask, &map->affinity_mask)) {
+		CPU_ZERO(&rec->affinity_mask);
+		CPU_OR(&rec->affinity_mask, &rec->affinity_mask, &map->affinity_mask);
+		sched_setaffinity(0, sizeof(rec->affinity_mask), &rec->affinity_mask);
+	}
+}
+
 static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evlist,
 				    bool overwrite)
 {
@@ -746,6 +760,7 @@ static int record__mmap_read_evlist(struct record *rec, struct perf_evlist *evli
 		struct perf_mmap *map = &maps[i];
 
 		if (map->base) {
+			record__adjust_affinity(rec, map);
 			if (!record__aio_enabled(rec)) {
 				if (perf_mmap__push(map, rec, record__pushfn) != 0) {
 					rc = -1;
diff --git a/tools/perf/util/cpumap.c b/tools/perf/util/cpumap.c
index 383674f448fc..0bbc3feb0894 100644
--- a/tools/perf/util/cpumap.c
+++ b/tools/perf/util/cpumap.c
@@ -730,3 +730,13 @@ size_t cpu_map__snprint_mask(struct cpu_map *map, char *buf, size_t size)
 	buf[size - 1] = '\0';
 	return ptr - buf;
 }
+
+const struct cpu_map *cpu_map__online(void) /* thread unsafe */
+{
+	static const struct cpu_map *online = NULL;
+
+	if (!online)
+		online = cpu_map__new(NULL); /* from /sys/devices/system/cpu/online */
+
+	return online;
+}
diff --git a/tools/perf/util/cpumap.h b/tools/perf/util/cpumap.h
index ed8999d1a640..f00ce624b9f7 100644
--- a/tools/perf/util/cpumap.h
+++ b/tools/perf/util/cpumap.h
@@ -29,6 +29,7 @@ int cpu_map__get_core_id(int cpu);
 int cpu_map__get_core(struct cpu_map *map, int idx, void *data);
 int cpu_map__build_socket_map(struct cpu_map *cpus, struct cpu_map **sockp);
 int cpu_map__build_core_map(struct cpu_map *cpus, struct cpu_map **corep);
+const struct cpu_map *cpu_map__online(void); /* thread unsafe */
 
 struct cpu_map *cpu_map__get(struct cpu_map *map);
 void cpu_map__put(struct cpu_map *map);
diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
index d882f43148c3..cdc7740fc181 100644
--- a/tools/perf/util/mmap.c
+++ b/tools/perf/util/mmap.c
@@ -383,6 +383,32 @@ void perf_mmap__munmap(struct perf_mmap *map)
 	auxtrace_mmap__munmap(&map->auxtrace_mmap);
 }
 
+static void build_node_mask(int node, cpu_set_t *mask)
+{
+	int c, cpu, nr_cpus;
+	const struct cpu_map *cpu_map = NULL;
+
+	cpu_map = cpu_map__online();
+	if (!cpu_map)
+		return;
+
+	nr_cpus = cpu_map__nr(cpu_map);
+	for (c = 0; c < nr_cpus; c++) {
+		cpu = cpu_map->map[c]; /* map c index to online cpu index */
+		if (cpu__get_node(cpu) == node)
+			CPU_SET(cpu, mask);
+	}
+}
+
+static void perf_mmap__setup_affinity_mask(struct perf_mmap *map, struct mmap_params *mp)
+{
+	CPU_ZERO(&map->affinity_mask);
+	if (mp->affinity == PERF_AFFINITY_NODE && cpu__max_node() > 1)
+		build_node_mask(cpu__get_node(map->cpu), &map->affinity_mask);
+	else if (mp->affinity == PERF_AFFINITY_CPU)
+		CPU_SET(map->cpu, &map->affinity_mask);
+}
+
 int perf_mmap__mmap(struct perf_mmap *map, struct mmap_params *mp, int fd, int cpu)
 {
 	/*
@@ -412,7 +438,7 @@ int perf_mmap__mmap(struct perf_mmap *map, struct mmap_params *mp, int fd, int c
 	map->fd = fd;
 	map->cpu = cpu;
 
-	CPU_ZERO(&map->affinity_mask);
+	perf_mmap__setup_affinity_mask(map, mp);
 
 	if (auxtrace_mmap__mmap(&map->auxtrace_mmap,
 				&mp->auxtrace_mp, map->base, fd))

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [tip:perf/core] perf record: Implement --affinity=node|cpu option
  2019-01-22 17:52 ` [PATCH v5 4/4] perf record: implement --affinity=node|cpu option Alexey Budankov
@ 2019-02-15  9:25   ` tip-bot for Alexey Budankov
  0 siblings, 0 replies; 21+ messages in thread
From: tip-bot for Alexey Budankov @ 2019-02-15  9:25 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, linux-kernel, hpa, peterz, alexander.shishkin,
	alexey.budankov, namhyung, jolsa, mingo, acme, ak

Commit-ID:  f4fe11b7bf7ff6a1ccf15d7a9484f0ff7d1e92ae
Gitweb:     https://git.kernel.org/tip/f4fe11b7bf7ff6a1ccf15d7a9484f0ff7d1e92ae
Author:     Alexey Budankov <alexey.budankov@linux.intel.com>
AuthorDate: Tue, 22 Jan 2019 20:52:03 +0300
Committer:  Arnaldo Carvalho de Melo <acme@redhat.com>
CommitDate: Mon, 11 Feb 2019 12:32:21 -0300

perf record: Implement --affinity=node|cpu option

Implement --affinity=node|cpu option for the record mode defaulting
to system affinity mask bouncing.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/083f5422-ece9-10dd-8305-bf59c860f10f@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/Documentation/perf-record.txt |  5 +++++
 tools/perf/builtin-record.c              | 18 ++++++++++++++++++
 2 files changed, 23 insertions(+)

diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index 02b4aa2621e7..8f0c2be34848 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -454,6 +454,11 @@ Use <n> control blocks in asynchronous (Posix AIO) trace writing mode (default:
 Asynchronous mode is supported only when linking Perf tool with libc library
 providing implementation for Posix AIO API.
 
+--affinity=mode::
+Set affinity mask of trace reading thread according to the policy defined by 'mode' value:
+  node - thread affinity mask is set to NUMA node cpu mask of the processed mmap buffer
+  cpu  - thread affinity mask is set to cpu of the processed mmap buffer
+
 --all-kernel::
 Configure all used events to run in kernel space.
 
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 3fdfbaebd95e..6c3719ac901d 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -1656,6 +1656,21 @@ static int parse_clockid(const struct option *opt, const char *str, int unset)
 	return -1;
 }
 
+static int record__parse_affinity(const struct option *opt, const char *str, int unset)
+{
+	struct record_opts *opts = (struct record_opts *)opt->value;
+
+	if (unset || !str)
+		return 0;
+
+	if (!strcasecmp(str, "node"))
+		opts->affinity = PERF_AFFINITY_NODE;
+	else if (!strcasecmp(str, "cpu"))
+		opts->affinity = PERF_AFFINITY_CPU;
+
+	return 0;
+}
+
 static int record__parse_mmap_pages(const struct option *opt,
 				    const char *str,
 				    int unset __maybe_unused)
@@ -1964,6 +1979,9 @@ static struct option __record_options[] = {
 		     &nr_cblocks_default, "n", "Use <n> control blocks in asynchronous trace writing mode (default: 1, max: 4)",
 		     record__aio_parse),
 #endif
+	OPT_CALLBACK(0, "affinity", &record.opts, "node|cpu",
+		     "Set affinity mask of trace reading thread to NUMA node cpu mask or cpu of processed mmap buffer",
+		     record__parse_affinity),
 	OPT_END()
 };
 

^ permalink raw reply related	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2019-02-15  9:26 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-22 17:45 [PATCH v5 0/4] Reduce NUMA related overhead in perf record profiling on large server systems Alexey Budankov
2019-01-22 17:47 ` [PATCH v5 1/4] perf record: allocate affinity masks Alexey Budankov
2019-02-09 12:45   ` [tip:perf/core] perf record: Allocate " tip-bot for Alexey Budankov
2019-01-22 17:48 ` [PATCH v5 2/4] perf record: bind the AIO user space buffers to nodes Alexey Budankov
2019-02-04 19:29   ` Arnaldo Carvalho de Melo
2019-02-04 19:47     ` Alexey Budankov
2019-02-05 15:15       ` Arnaldo Carvalho de Melo
2019-02-05 15:34         ` Alexey Budankov
2019-02-09 12:46         ` [tip:perf/core] perf record: Bind " tip-bot for Alexey Budankov
2019-01-22 17:50 ` [PATCH v5 3/4] perf record: apply affinity masks when reading mmap buffers Alexey Budankov
2019-02-09 12:47   ` [tip:perf/core] perf record: Apply " tip-bot for Alexey Budankov
2019-01-22 17:52 ` [PATCH v5 4/4] perf record: implement --affinity=node|cpu option Alexey Budankov
2019-02-15  9:25   ` [tip:perf/core] perf record: Implement " tip-bot for Alexey Budankov
2019-01-28  7:13 ` [PATCH v5 0/4] Reduce NUMA related overhead in perf record profiling on large server systems Alexey Budankov
2019-01-28  8:20   ` Jiri Olsa
2019-01-28  8:39     ` Alexey Budankov
2019-01-28 11:27 ` Jiri Olsa
2019-01-31  9:52   ` Alexey Budankov
2019-02-01 16:31   ` Arnaldo Carvalho de Melo
2019-01-29  9:14 ` Arnaldo Carvalho de Melo
2019-01-29 10:22   ` Alexey Budankov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.