[PATCH v5 0/4] Reduce NUMA related overhead in perf record profiling on large server systems

* [PATCH v5 0/4] Reduce NUMA related overhead in perf record profiling on large server systems
@ 2019-01-22 17:45 Alexey Budankov
  2019-01-22 17:47 ` [PATCH v5 1/4] perf record: allocate affinity masks Alexey Budankov
                   ` (6 more replies)
  0 siblings, 7 replies; 21+ messages in thread
From: Alexey Budankov @ 2019-01-22 17:45 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Ingo Molnar, Peter Zijlstra
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Andi Kleen, linux-kernel

It has been observed that trace reading thread runs on the same hw thread
most of the time during perf record sampling collection. This scheduling
layout leads up to 30% profiling overhead in case when some cpu intensive
workload fully utilizes a large server system with NUMA. Overhead usually
arises from remote (cross node) HW and memory references that have much
longer latencies than local ones [1].

This patch set implements --affinity option that lowers 30% overhead
completely for serial trace streaming (--affinity=cpu) and from 30% to
10% for AIO1 (--aio=1) trace streaming (--affinity=node|cpu).
See OVERHEAD section below for more details.

Implemented extension provides users with capability to instruct Perf 
tool to bounce trace reading thread's affinity mask between NUMA nodes 
(--affinity=node) or assign the thread to the exact cpu (--affinity=cpu) 
that trace buffer being processed belongs to.

The extension brings improvement in case of full system utilization when 
Perf tool process contends with workload process on cpu cores. In case a 
system has free cores to execute Perf tool process during profiling the 
default system scheduling layout induces the lowest overhead.

The patch set has been validated on BT benchmark from NAS Parallel 
Benchmarks [2] running on dual socket, 44 cores, 88 hw threads Broadwell 
system with kernels v4.4-21-generic (Ubuntu 16.04) and v4.20.0-rc5 
(tip perf/core). 

The patch set is for Arnaldo's perf/core repository.

OVERHEAD:
			       BENCH REPORT BASED   ELAPSED TIME BASED
	  v4.20.0-rc5 
          (tip perf/core):

(current) SERIAL-SYS  / BASE : 1.27x (14.37/11.31), 1.29x (15.19/11.69)
	  SERIAL-NODE / BASE : 1.15x (13.04/11.31), 1.17x (13.79/11.69)
	  SERIAL-CPU  / BASE : 1.00x (11.32/11.31), 1.01x (11.89/11.69)

	  AIO1-SYS    / BASE : 1.29x (14.58/11.31), 1.29x (15.26/11.69)
	  AIO1-NODE   / BASE : 1.08x (12.23/11.31), 1,11x (13.01/11.69)
	  AIO1-CPU    / BASE : 1.07x (12.14/11.31), 1.08x (12.83/11.69)

	  v4.4.0-21-generic
          (Ubuntu 16.04 LTS):

(current) SERIAL-SYS  / BASE : 1.26x (13.73/10.87), 1.29x (14.69/11.32)
	  SERIAL-NODE / BASE : 1.19x (13.02/10.87), 1.23x (14.03/11.32)
	  SERIAL-CPU  / BASE : 1.03x (11.21/10.87), 1.07x (12.18/11.32)

	  AIO1-SYS    / BASE : 1.26x (13.73/10.87), 1.29x (14.69/11.32)
	  AIO1-NODE   / BASE : 1.10x (12.04/10.87), 1.15x (13.03/11.32)
	  AIO1-CPU    / BASE : 1.12x (12.20/10.87), 1.15x (13.09/11.32)

---
Alexey Budankov (4):
  perf record: allocate affinity masks
  perf record: bind the AIO user space buffers to nodes
  perf record: apply affinity masks when reading mmap buffers
  perf record: implement --affinity=node|cpu option

 tools/perf/Documentation/perf-record.txt |   5 ++
 tools/perf/builtin-record.c              |  45 +++++++++-
 tools/perf/perf.h                        |   8 ++
 tools/perf/util/cpumap.c                 |  10 +++
 tools/perf/util/cpumap.h                 |   1 +
 tools/perf/util/evlist.c                 |   6 +-
 tools/perf/util/evlist.h                 |   2 +-
 tools/perf/util/mmap.c                   | 105 ++++++++++++++++++++++-
 tools/perf/util/mmap.h                   |   3 +-
 9 files changed, 175 insertions(+), 10 deletions(-)

---
Changes in v5:
- avoided multiple allocations of online cpu maps by 
  implementing it once in cpu_map__online()
- reduced indentation at record__parse_affinity()

Changes in v4:
- fixed compilation issue converting pr_warn() to pr_warning()
- implemented stop if mbind() fails
- corrected mmap_params->cpu_map initialization to be based on /sys/devices/system/cpu/online
- separated node cpu map generation into build_node_mask()

Changes in v3:
- converted PERF_AFFINITY_EOF to PERF_AFFINITY_MAX
- corrected code style issues
- adjusted __aio_alloc,__aio_bind,__aio_free() implementation
- separated mask manipulations into __adjust_affinity() and __setup_affinity_mask()
- implemented mapping of c index into online cpu index
- adjusted indentation at record__parse_affinity()

Changes in v2:
- made debug affinity mode message user friendly
- converted affinity mode defines to enum values
- implemented perf_mmap__aio_alloc, perf_mmap__aio_free, perf_mmap__aio_bind 
  and put HAVE_LIBNUMA_SUPPORT #ifdefs in there
- separated AIO buffers binding to patch 2/4

---
[1] https://en.wikipedia.org/wiki/Non-uniform_memory_access
[2] https://www.nas.nasa.gov/publications/npb.html
[3] http://man7.org/linux/man-pages/man2/sched_setaffinity.2.html
[4] http://man7.org/linux/man-pages/man2/mbind.2.html

---
ENVIRONMENT AND MEASUREMENTS:

  MACHINE:

	broadwell, dual socket, 44 core, 88 threads

	/proc/cpuinfo

	processor	: 87
	vendor_id	: GenuineIntel
	cpu family	: 6
	model		: 79
	model name	: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
	stepping	: 1
	microcode	: 0xb000019
	cpu MHz		: 1200.117
	cache size	: 56320 KB
	physical id	: 1
	siblings	: 44
	core id		: 28
	cpu cores	: 22
	apicid		: 121
	initial apicid	: 121
	fpu		: yes
	fpu_exception	: yes
	cpuid level	: 20
	wp		: yes
	flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts
	bugs		:
	bogomips	: 4391.42
	clflush size	: 64
	cache_alignment	: 64
	address sizes	: 46 bits physical, 48 bits virtual
	power management:

  BASE:

	/usr/bin/time ./bt.B.x 

	NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark

	No input file inputbt.data. Using compiled defaults
	Size:  102x 102x 102
	Iterations:  200       dt:   0.0003000
	Number of available threads:    88

	BT Benchmark Completed.
	Class           =                        B
	Size            =            102x 102x 102
	Iterations      =                      200
	Time in seconds =                    10.87
	Total threads   =                       88
	Avail threads   =                       88
	Mop/s total     =                 64608.74
	Mop/s/thread    =                   734.19
	Operation type  =           floating point
	Verification    =               SUCCESSFUL
	Version         =                    3.3.1
	Compile date    =              20 Sep 2018

	956.25user 19.14system 0:11.32elapsed 8616%CPU (0avgtext+0avgdata 210496maxresident)k
	0inputs+0outputs (0major+57939minor)pagefaults 0swaps

  SERIAL-SYS:

	/usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 -a -e cycles -- ./bt.B.x 
	Using CPUID GenuineIntel-6-4F-1
	nr_cblocks: 0
	affinity (UNSET:0, NODE:1, CPU:2) = 0
	mmap size 528384B

	NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark

	No input file inputbt.data. Using compiled defaults
	Size:  102x 102x 102
	Iterations:  200       dt:   0.0003000
	Number of available threads:    88

	BT Benchmark Completed.
	Class           =                        B
	Size            =            102x 102x 102
	Iterations      =                      200
	Time in seconds =                    13.73
	Total threads   =                       88
	Avail threads   =                       88
	Mop/s total     =                 51136.52
	Mop/s/thread    =                   581.10
	Operation type  =           floating point
	Verification    =               SUCCESSFUL
	Version         =                    3.3.1
	Compile date    =              20 Sep 2018

	[ perf record: Captured and wrote 1661,120 MB perf.data ]

	1184.84user 40.70system 0:14.69elapsed 8341%CPU (0avgtext+0avgdata 208612maxresident)k
	0inputs+3402072outputs (0major+137077minor)pagefaults 0swaps

  SERIAL-NODE:

	/usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 --affinity=node -a -e cycles -- ./bt.B.x 
	Using CPUID GenuineIntel-6-4F-1
	nr_cblocks: 0
	affinity (UNSET:0, NODE:1, CPU:2) = 1
	mmap size 528384B

	NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark

	No input file inputbt.data. Using compiled defaults
	Size:  102x 102x 102
	Iterations:  200       dt:   0.0003000
	Number of available threads:    88

	BT Benchmark Completed.
	Class           =                        B
	Size            =            102x 102x 102
	Iterations      =                      200
	Time in seconds =                    13.02
	Total threads   =                       88
	Avail threads   =                       88
	Mop/s total     =                 53924.69
	Mop/s/thread    =                   612.78
	Operation type  =           floating point
	Verification    =               SUCCESSFUL
	Version         =                    3.3.1
	Compile date    =              20 Sep 2018

	[ perf record: Captured and wrote 1557,152 MB perf.data ]

	1120.42user 29.92system 0:14.03elapsed 8198%CPU (0avgtext+0avgdata 206388maxresident)k
	0inputs+3189128outputs (0major+149207minor)pagefaults 0swaps

  SERIAL-CPU:

	/usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 --affinity=cpu -a -e cycles -- ./bt.B.x 
	Using CPUID GenuineIntel-6-4F-1
	nr_cblocks: 0
	affinity (UNSET:0, NODE:1, CPU:2) = 2
	mmap size 528384B

	NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark

	No input file inputbt.data. Using compiled defaults
	Size:  102x 102x 102
	Iterations:  200       dt:   0.0003000
	Number of available threads:    88

	BT Benchmark Completed.
	Class           =                        B
	Size            =            102x 102x 102
	Iterations      =                      200
	Time in seconds =                    11.21
	Total threads   =                       88
	Avail threads   =                       88
	Mop/s total     =                 62642.24
	Mop/s/thread    =                   711.84
	Operation type  =           floating point
	Verification    =               SUCCESSFUL
	Version         =                    3.3.1
	Compile date    =              20 Sep 2018

	[ perf record: Captured and wrote 1365,043 MB perf.data ]

	976.06user 31.35system 0:12.18elapsed 8264%CPU (0avgtext+0avgdata 208488maxresident)k
	0inputs+2795704outputs (0major+126032minor)pagefaults 0swaps

  AIO1-SYS:

	/usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 --aio=1 -a -e cycles -- ./bt.B.x 
	Using CPUID GenuineIntel-6-4F-1
	nr_cblocks: 1
	affinity (UNSET:0, NODE:1, CPU:2) = 0
	mmap size 528384B

	NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark

	No input file inputbt.data. Using compiled defaults
	Size:  102x 102x 102
	Iterations:  200       dt:   0.0003000
	Number of available threads:    88

	BT Benchmark Completed.
	Class           =                        B
	Size            =            102x 102x 102
	Iterations      =                      200
	Time in seconds =                    14.23
	Total threads   =                       88
	Avail threads   =                       88
	Mop/s total     =                 49338.27
	Mop/s/thread    =                   560.66
	Operation type  =           floating point
	Verification    =               SUCCESSFUL
	Version         =                    3.3.1
	Compile date    =              20 Sep 2018

	[ perf record: Captured and wrote 1720,590 MB perf.data ]

	1229.19user 41.99system 0:15.22elapsed 8350%CPU (0avgtext+0avgdata 208604maxresident)k
	0inputs+3523880outputs (0major+124670minor)pagefaults 0swaps

  AIO1-NODE:

	/usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 --aio=1 --affinity=node -a -e cycles -- ./bt.B.x 
	Using CPUID GenuineIntel-6-4F-1
	nr_cblocks: 1
	affinity (UNSET:0, NODE:1, CPU:2) = 1
	mmap size 528384B

	NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark

	No input file inputbt.data. Using compiled defaults
	Size:  102x 102x 102
	Iterations:  200       dt:   0.0003000
	Number of available threads:    88

	BT Benchmark Completed.
	Class           =                        B
	Size            =            102x 102x 102
	Iterations      =                      200
	Time in seconds =                    12.04
	Total threads   =                       88
	Avail threads   =                       88
	Mop/s total     =                 58313.58
	Mop/s/thread    =                   662.65
	Operation type  =           floating point
	Verification    =               SUCCESSFUL
	Version         =                    3.3.1
	Compile date    =              20 Sep 2018

	[ perf record: Captured and wrote 1471,279 MB perf.data ]

	1055.62user 30.43system 0:13.03elapsed 8333%CPU (0avgtext+0avgdata 208424maxresident)k
	0inputs+3013288outputs (0major+79088minor)pagefaults 0swaps

  AIO1-CPU:

	/usr/bin/time ./tip/tools/perf/perf record -v -N -B -T -R -F 25000 --aio=1 --affinity=cpu -a -e cycles -- ./bt.B.x 
	Using CPUID GenuineIntel-6-4F-1
	nr_cblocks: 1
	affinity (UNSET:0, NODE:1, CPU:2) = 2
	mmap size 528384B

	NAS Parallel Benchmarks (NPB3.3-OMP) - BT Benchmark

	No input file inputbt.data. Using compiled defaults
	Size:  102x 102x 102
	Iterations:  200       dt:   0.0003000
	Number of available threads:    88

	BT Benchmark Completed.
	Class           =                        B
	Size            =            102x 102x 102
	Iterations      =                      200
	Time in seconds =                    12.20
	Total threads   =                       88
	Avail threads   =                       88
	Mop/s total     =                 57538.84
	Mop/s/thread    =                   653.85
	Operation type  =           floating point
	Verification    =               SUCCESSFUL
	Version         =                    3.3.1
	Compile date    =              20 Sep 2018

	[ perf record: Captured and wrote 1486,859 MB perf.data ]

	1051.97user 42.07system 0:13.09elapsed 8352%CPU (0avgtext+0avgdata 206388maxresident)k
	0inputs+3045168outputs (0major+174612minor)pagefaults 0swaps

^ permalink raw reply	[flat|nested] 21+ messages in thread