All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v13 00/16] Introduce threaded trace streaming for basic perf record operation
@ 2022-01-17 18:34 Alexey Bayduraev
  2022-01-17 18:34 ` [PATCH v13 01/16] perf record: Introduce thread affinity and mmap masks Alexey Bayduraev
                   ` (16 more replies)
  0 siblings, 17 replies; 38+ messages in thread
From: Alexey Bayduraev @ 2022-01-17 18:34 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, linux-kernel, Andi Kleen, Adrian Hunter,
	Alexander Antonov, Alexei Budankov, Riccardo Mancini

Changes in v13:
- fixed error handling in record__mmap_cpu_mask_alloc()
- removed redundant record__thread_mask_clear()
- added notes about evlist__ctlfd_update() to [v13 05/16]
- fixed build on systems w/o pthread_attr_setaffinity_np() and syscall.h
- fixed samples zeroing before process_buildids()
- added notes about valid parallel masks to the documentation and sources
- fixed masks releasing in record__init_thread_cpu_masks
- added and fixed some error messages

v12: https://lore.kernel.org/lkml/cover.1637675515.git.alexey.v.bayduraev@linux.intel.com/

Changes in v12:
- fixed nr_threads=1 cases
- fixed "Woken up %ld times" message
- removed unnecessary record__fini_thread_masks function
- moved bytes written/compressed statistics to struct record_thread
- moved all unnecessary debug messages to verbose=2 level
- renamed "socket" option to "package" for consistency with util/cputopo.h
- excluded single trace file reading patches

v11: https://lore.kernel.org/lkml/cover.1629186429.git.alexey.v.bayduraev@linux.intel.com/

Changes in v11:
- removed python dependency on zstd (perf test 19)
- captured tags from Riccardo Mancini 

v10: https://lore.kernel.org/lkml/cover.1626072008.git.alexey.v.bayduraev@linux.intel.com/

Changes in v10:
- renamed fdarray__clone to fdarray__dup_entry_from
- captured Acked-by: tags by Namhyung Kim for 09/24

v9: https://lore.kernel.org/lkml/cover.1625227739.git.alexey.v.bayduraev@linux.intel.com/

Changes in v9:
- fixes in [v9 01/24]:
  - move 'nr_threads' to before 'thread_masks'
  - combined decl+assign into one line in record__thread_mask_alloc
  - releasing masks inplace in record__alloc_thread_masks
- split patch [v8 02/22] to [v9 02/24] and [v9 03/24]
- fixes in [v9 03/24]:
  - renamed 'struct thread_data' to 'struct record_thread'
  - moved nr_mmaps after ctlfd_pos
  - releasing resources inplace in record__thread_data_init_maps
  - initializing pipes by -1 value
  - added temporary gettid() wrapper
- split patch [v8 03/22] to [v9 04/24] and [v9 05/24] 
- removed upstreamed [v8 09/22]
- split [v8 10/22] to [v9 12/24] and [v9 13/24]
- moved --threads documentation to the related patches
- fixed output of written/compressed stats in [v9 10/24]
- split patch [v8 12/22] to [v9 15/24] and [v9 16/24]
- fixed order of error checking for decompressed events in [v9 16/24]
- merged patch [v8 21/22] with [v9 23/24] and [v9 24/24]
- moved patch [v8 22/22] to [v9 09/24]
- added max reader size constant in [v9 24/24]

v8: https://lore.kernel.org/lkml/cover.1625065643.git.alexey.v.bayduraev@linux.intel.com/

Changes in v8:
- captured Acked-by: tags by Namhyung Kim
- merged with origin/perf/core
- added patch 21/22 introducing READER_NODATA state
- added patch 22/22 fixing --max-size option

v7: https://lore.kernel.org/lkml/cover.1624350588.git.alexey.v.bayduraev@linux.intel.com/

Changes in v7:
- fixed possible crash after out_free_threads label
- added missing pthread_attr_destroy() call
- added check of correctness of user masks 
- fixed zsts_data finalization

v6: https://lore.kernel.org/lkml/cover.1622025774.git.alexey.v.bayduraev@linux.intel.com/

Changes in v6:
- fixed leaks and possible double free in record__thread_mask_alloc()
- fixed leaks in record__init_thread_user_masks()
- fixed final mmaps flushing for threads id > 0
- merged with origin/perf/core

v5: https://lore.kernel.org/lkml/cover.1619781188.git.alexey.v.bayduraev@linux.intel.com/

Changes in v5:
- fixed leaks in record__init_thread_masks_spec()
- fixed leaks after failed realloc
- replaced "%m" to strerror()
- added masks examples to the documentation
- captured Acked-by: tags by Andi Kleen
- do not allow --thread option for full_auxtrace mode 
- split patch 06/12 to 06/20 and 07/20
- split patch 08/12 to 09/20 and 10/20
- split patches 11/12 and 11/12 to 13/20-20/20

v4: https://lore.kernel.org/lkml/6c15adcb-6a9d-320e-70b5-957c4c8b6ff2@linux.intel.com/

Changes in v4:
- renamed 'comm' structure to 'pipes'
- moved thread fd/maps messages to verbose=2
- fixed leaks during allocation of thread_data structures
- fixed leaks during allocation of thread masks
- fixed possible fails when releasing thread masks

v3: https://lore.kernel.org/lkml/7d197a2d-56e2-896d-bf96-6de0a4db1fb8@linux.intel.com/

Changes in v3:
- avoided skipped redundant patch 3/15
- applied "data file" and "data directory" terms allover the patch set
- captured Acked-by: tags by Namhyung Kim
- avoided braces where don't needed
- employed thread local variable for serial trace streaming 
- added specs for --thread option - core, socket, numa and user defined
- added parallel loading of data directory files similar to the prototype [1]

v2: https://lore.kernel.org/lkml/1ec29ed6-0047-d22f-630b-a7f5ccee96b4@linux.intel.com/

Changes in v2:
- explicitly added credit tags to patches 6/15 and 15/15,
  additionally to cites [1], [2]
- updated description of 3/15 to explicitly mention the reason
  to open data directories in read access mode (e.g. for perf report)
- implemented fix for compilation error of 2/15
- explicitly elaborated on found issues to be resolved for
  threaded AUX trace capture

v1: https://lore.kernel.org/lkml/810f3a69-0004-9dff-a911-b7ff97220ae0@linux.intel.com/

Patch set provides parallel threaded trace streaming mode for basic
perf record operation. Provided mode mitigates profiling data losses
and resolves scalability issues of serial and asynchronous (--aio)
trace streaming modes on multicore server systems. The design and
implementation are based on the prototype [1], [2].

Parallel threaded mode executes trace streaming threads that read kernel
data buffers and write captured data into several data files located at
data directory. Layout of trace streaming threads and their mapping to data
buffers to read can be configured using a value of --thread command line
option. Specification value provides masks separated by colon so the masks
define CPUs to be monitored by one thread and thread affinity mask is
separated by slash. <cpus mask 1>/<affinity mask 1>:<cpu mask 2>/<affinity mask 2>
specifies parallel threads layout that consists of two threads with
corresponding assigned CPUs to be monitored. Specification value can be
a string e.g. "cpu", "core" or "socket" meaning creation of data streaming
thread for monitoring every CPU, whole core or socket. The option provided
with no or empty value defaults to "cpu" layout creating data streaming
thread for every CPU being monitored. Specification masks are filtered
by the mask provided via -C option.

Parallel streaming mode is compatible with Zstd compression/decompression
(--compression-level) and external control commands (--control). The mode
is not enabled for pipe mode. The mode is not enabled for AUX area tracing,
related and derived modes like --snapshot or --aux-sample. --switch-output-*
and --timestamp-filename options are not enabled for parallel streaming.
Initial intent to enable AUX area tracing faced the need to define some
optimal way to store index data in data directory. --switch-output-* and
--timestamp-filename use cases are not clear for data directories.
Asynchronous(--aio) trace streaming and affinity (--affinity) modes are
mutually exclusive to parallel streaming mode.

Basic analysis of data directories is provided in perf report mode.
Raw dump and aggregated reports are available for data directories,
still with no memory consumption optimizations.

Tested:

tools/perf/perf record -o prof.data --threads -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data --threads= -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data --threads=cpu -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data --threads=core -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data --threads=socket -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data --threads=numa -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data --threads=0-3/3:4-7/4 -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data -C 2,5 --threads=0-3/3:4-7/4 -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data -C 3,4 --threads=0-3/3:4-7/4 -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data -C 0,4,2,6 --threads=core -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data -C 0,4,2,6 --threads=numa -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data --threads -g --call-graph dwarf,4096 -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data --threads -g --call-graph dwarf,4096 --compression-level=3 -- matrix.gcc.g.O3
tools/perf/perf record -o prof.data --threads -a
tools/perf/perf record -D -1 -e cpu-cycles -a --control fd:10,11 -- sleep 30
tools/perf/perf record --threads -D -1 -e cpu-cycles -a --control fd:10,11 -- sleep 30

tools/perf/perf report -i prof.data
tools/perf/perf report -i prof.data --call-graph=callee
tools/perf/perf report -i prof.data --stdio --header
tools/perf/perf report -i prof.data -D --header

[1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
[2] https://lore.kernel.org/lkml/20180913125450.21342-1-jolsa@kernel.org/

Alexey Bayduraev (16):
  perf record: Introduce thread affinity and mmap masks
  tools lib: Introduce fdarray duplicate function
  perf record: Introduce thread specific data array
  perf record: Introduce function to propagate control commands
  perf record: Introduce thread local variable
  perf record: Stop threads in the end of trace streaming
  perf record: Start threads in the beginning of trace streaming
  perf record: Introduce data file at mmap buffer object
  perf record: Introduce bytes written stats
  perf record: Introduce compressor at mmap buffer object
  perf record: Introduce data transferred and compressed stats
  perf record: Introduce --threads command line option
  perf record: Extend --threads command line option
  perf record: Implement compatibility checks
  perf session: Load data directory files for analysis
  perf report: Output data file name in raw trace dump

 tools/lib/api/fd/array.c                 |   17 +
 tools/lib/api/fd/array.h                 |    1 +
 tools/perf/Documentation/perf-record.txt |   34 +
 tools/perf/builtin-inject.c              |    3 +-
 tools/perf/builtin-kvm.c                 |    2 +-
 tools/perf/builtin-record.c              | 1164 ++++++++++++++++++++--
 tools/perf/builtin-top.c                 |    2 +-
 tools/perf/builtin-trace.c               |    2 +-
 tools/perf/util/evlist.c                 |   16 +
 tools/perf/util/evlist.h                 |    1 +
 tools/perf/util/mmap.c                   |   10 +
 tools/perf/util/mmap.h                   |    3 +
 tools/perf/util/ordered-events.c         |    3 +-
 tools/perf/util/ordered-events.h         |    3 +-
 tools/perf/util/record.h                 |    2 +
 tools/perf/util/session.c                |  208 +++-
 tools/perf/util/session.h                |    3 +-
 tools/perf/util/tool.h                   |    3 +-
 18 files changed, 1373 insertions(+), 104 deletions(-)

-- 
2.19.0


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v13 01/16] perf record: Introduce thread affinity and mmap masks
  2022-01-17 18:34 [PATCH v13 00/16] Introduce threaded trace streaming for basic perf record operation Alexey Bayduraev
@ 2022-01-17 18:34 ` Alexey Bayduraev
  2022-01-31 21:00   ` Arnaldo Carvalho de Melo
  2022-04-04 22:25   ` Ian Rogers
  2022-01-17 18:34 ` [PATCH v13 02/16] tools lib: Introduce fdarray duplicate function Alexey Bayduraev
                   ` (15 subsequent siblings)
  16 siblings, 2 replies; 38+ messages in thread
From: Alexey Bayduraev @ 2022-01-17 18:34 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, linux-kernel, Andi Kleen, Adrian Hunter,
	Alexander Antonov, Alexei Budankov, Riccardo Mancini

Introduce affinity and mmap thread masks. Thread affinity mask
defines CPUs that a thread is allowed to run on. Thread maps
mask defines mmap data buffers the thread serves to stream
profiling data from.

Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Namhyung Kim <namhyung@gmail.com>
Reviewed-by: Riccardo Mancini <rickyman7@gmail.com>
Tested-by: Riccardo Mancini <rickyman7@gmail.com>
Signed-off-by: Alexey Bayduraev <alexey.v.bayduraev@linux.intel.com>
---
 tools/perf/builtin-record.c | 123 ++++++++++++++++++++++++++++++++++++
 1 file changed, 123 insertions(+)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index bb716c953d02..41998f2140cd 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -87,6 +87,11 @@ struct switch_output {
 	int		 cur_file;
 };
 
+struct thread_mask {
+	struct mmap_cpu_mask	maps;
+	struct mmap_cpu_mask	affinity;
+};
+
 struct record {
 	struct perf_tool	tool;
 	struct record_opts	opts;
@@ -112,6 +117,8 @@ struct record {
 	struct mmap_cpu_mask	affinity_mask;
 	unsigned long		output_max_size;	/* = 0: unlimited */
 	struct perf_debuginfod	debuginfod;
+	int			nr_threads;
+	struct thread_mask	*thread_masks;
 };
 
 static volatile int done;
@@ -2204,6 +2211,47 @@ static int record__parse_affinity(const struct option *opt, const char *str, int
 	return 0;
 }
 
+static int record__mmap_cpu_mask_alloc(struct mmap_cpu_mask *mask, int nr_bits)
+{
+	mask->nbits = nr_bits;
+	mask->bits = bitmap_zalloc(mask->nbits);
+	if (!mask->bits)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static void record__mmap_cpu_mask_free(struct mmap_cpu_mask *mask)
+{
+	bitmap_free(mask->bits);
+	mask->nbits = 0;
+}
+
+static int record__thread_mask_alloc(struct thread_mask *mask, int nr_bits)
+{
+	int ret;
+
+	ret = record__mmap_cpu_mask_alloc(&mask->maps, nr_bits);
+	if (ret) {
+		mask->affinity.bits = NULL;
+		return ret;
+	}
+
+	ret = record__mmap_cpu_mask_alloc(&mask->affinity, nr_bits);
+	if (ret) {
+		record__mmap_cpu_mask_free(&mask->maps);
+		mask->maps.bits = NULL;
+	}
+
+	return ret;
+}
+
+static void record__thread_mask_free(struct thread_mask *mask)
+{
+	record__mmap_cpu_mask_free(&mask->maps);
+	record__mmap_cpu_mask_free(&mask->affinity);
+}
+
 static int parse_output_max_size(const struct option *opt,
 				 const char *str, int unset)
 {
@@ -2683,6 +2731,73 @@ static struct option __record_options[] = {
 
 struct option *record_options = __record_options;
 
+static void record__mmap_cpu_mask_init(struct mmap_cpu_mask *mask, struct perf_cpu_map *cpus)
+{
+	int c;
+
+	for (c = 0; c < cpus->nr; c++)
+		set_bit(cpus->map[c].cpu, mask->bits);
+}
+
+static void record__free_thread_masks(struct record *rec, int nr_threads)
+{
+	int t;
+
+	if (rec->thread_masks)
+		for (t = 0; t < nr_threads; t++)
+			record__thread_mask_free(&rec->thread_masks[t]);
+
+	zfree(&rec->thread_masks);
+}
+
+static int record__alloc_thread_masks(struct record *rec, int nr_threads, int nr_bits)
+{
+	int t, ret;
+
+	rec->thread_masks = zalloc(nr_threads * sizeof(*(rec->thread_masks)));
+	if (!rec->thread_masks) {
+		pr_err("Failed to allocate thread masks\n");
+		return -ENOMEM;
+	}
+
+	for (t = 0; t < nr_threads; t++) {
+		ret = record__thread_mask_alloc(&rec->thread_masks[t], nr_bits);
+		if (ret) {
+			pr_err("Failed to allocate thread masks[%d]\n", t);
+			goto out_free;
+		}
+	}
+
+	return 0;
+
+out_free:
+	record__free_thread_masks(rec, nr_threads);
+
+	return ret;
+}
+
+static int record__init_thread_default_masks(struct record *rec, struct perf_cpu_map *cpus)
+{
+	int ret;
+
+	ret = record__alloc_thread_masks(rec, 1, cpu__max_cpu().cpu);
+	if (ret)
+		return ret;
+
+	record__mmap_cpu_mask_init(&rec->thread_masks->maps, cpus);
+
+	rec->nr_threads = 1;
+
+	return 0;
+}
+
+static int record__init_thread_masks(struct record *rec)
+{
+	struct perf_cpu_map *cpus = rec->evlist->core.cpus;
+
+	return record__init_thread_default_masks(rec, cpus);
+}
+
 int cmd_record(int argc, const char **argv)
 {
 	int err;
@@ -2948,6 +3063,12 @@ int cmd_record(int argc, const char **argv)
 		goto out;
 	}
 
+	err = record__init_thread_masks(rec);
+	if (err) {
+		pr_err("Failed to initialize parallel data streaming masks\n");
+		goto out;
+	}
+
 	if (rec->opts.nr_cblocks > nr_cblocks_max)
 		rec->opts.nr_cblocks = nr_cblocks_max;
 	pr_debug("nr_cblocks: %d\n", rec->opts.nr_cblocks);
@@ -2966,6 +3087,8 @@ int cmd_record(int argc, const char **argv)
 	symbol__exit();
 	auxtrace_record__free(rec->itr);
 out_opts:
+	record__free_thread_masks(rec, rec->nr_threads);
+	rec->nr_threads = 0;
 	evlist__close_control(rec->opts.ctl_fd, rec->opts.ctl_fd_ack, &rec->opts.ctl_fd_close);
 	return err;
 }
-- 
2.19.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v13 02/16] tools lib: Introduce fdarray duplicate function
  2022-01-17 18:34 [PATCH v13 00/16] Introduce threaded trace streaming for basic perf record operation Alexey Bayduraev
  2022-01-17 18:34 ` [PATCH v13 01/16] perf record: Introduce thread affinity and mmap masks Alexey Bayduraev
@ 2022-01-17 18:34 ` Alexey Bayduraev
  2022-01-17 18:34 ` [PATCH v13 03/16] perf record: Introduce thread specific data array Alexey Bayduraev
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Alexey Bayduraev @ 2022-01-17 18:34 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, linux-kernel, Andi Kleen, Adrian Hunter,
	Alexander Antonov, Alexei Budankov, Riccardo Mancini

Introduce a function to duplicate an existing file descriptor in
the fdarray structure. The function returns the position of the duplicated
file descriptor.

Acked-by: Namhyung Kim <namhyung@gmail.com>
Reviewed-by: Riccardo Mancini <rickyman7@gmail.com>
Tested-by: Riccardo Mancini <rickyman7@gmail.com>
Signed-off-by: Alexey Bayduraev <alexey.v.bayduraev@linux.intel.com>
---
 tools/lib/api/fd/array.c | 17 +++++++++++++++++
 tools/lib/api/fd/array.h |  1 +
 2 files changed, 18 insertions(+)

diff --git a/tools/lib/api/fd/array.c b/tools/lib/api/fd/array.c
index 5e6cb9debe37..f0f195207fca 100644
--- a/tools/lib/api/fd/array.c
+++ b/tools/lib/api/fd/array.c
@@ -88,6 +88,23 @@ int fdarray__add(struct fdarray *fda, int fd, short revents, enum fdarray_flags
 	return pos;
 }
 
+int fdarray__dup_entry_from(struct fdarray *fda, int pos, struct fdarray *from)
+{
+	struct pollfd *entry;
+	int npos;
+
+	if (pos >= from->nr)
+		return -EINVAL;
+
+	entry = &from->entries[pos];
+
+	npos = fdarray__add(fda, entry->fd, entry->events, from->priv[pos].flags);
+	if (npos >= 0)
+		fda->priv[npos] = from->priv[pos];
+
+	return npos;
+}
+
 int fdarray__filter(struct fdarray *fda, short revents,
 		    void (*entry_destructor)(struct fdarray *fda, int fd, void *arg),
 		    void *arg)
diff --git a/tools/lib/api/fd/array.h b/tools/lib/api/fd/array.h
index 7fcf21a33c0c..60ad197c8ee9 100644
--- a/tools/lib/api/fd/array.h
+++ b/tools/lib/api/fd/array.h
@@ -42,6 +42,7 @@ struct fdarray *fdarray__new(int nr_alloc, int nr_autogrow);
 void fdarray__delete(struct fdarray *fda);
 
 int fdarray__add(struct fdarray *fda, int fd, short revents, enum fdarray_flags flags);
+int fdarray__dup_entry_from(struct fdarray *fda, int pos, struct fdarray *from);
 int fdarray__poll(struct fdarray *fda, int timeout);
 int fdarray__filter(struct fdarray *fda, short revents,
 		    void (*entry_destructor)(struct fdarray *fda, int fd, void *arg),
-- 
2.19.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v13 03/16] perf record: Introduce thread specific data array
  2022-01-17 18:34 [PATCH v13 00/16] Introduce threaded trace streaming for basic perf record operation Alexey Bayduraev
  2022-01-17 18:34 ` [PATCH v13 01/16] perf record: Introduce thread affinity and mmap masks Alexey Bayduraev
  2022-01-17 18:34 ` [PATCH v13 02/16] tools lib: Introduce fdarray duplicate function Alexey Bayduraev
@ 2022-01-17 18:34 ` Alexey Bayduraev
  2022-01-31 21:39   ` Arnaldo Carvalho de Melo
  2022-01-17 18:34 ` [PATCH v13 04/16] perf record: Introduce function to propagate control commands Alexey Bayduraev
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 38+ messages in thread
From: Alexey Bayduraev @ 2022-01-17 18:34 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, linux-kernel, Andi Kleen, Adrian Hunter,
	Alexander Antonov, Alexei Budankov, Riccardo Mancini

Introduce thread specific data object and array of such objects
to store and manage thread local data. Implement functions to
allocate, initialize, finalize and release thread specific data.

Thread local maps and overwrite_maps arrays keep pointers to
mmap buffer objects to serve according to maps thread mask.
Thread local pollfd array keeps event fds connected to mmaps
buffers according to maps thread mask.

Thread control commands are delivered via thread local comm pipes
and ctlfd_pos fd. External control commands (--control option)
are delivered via evlist ctlfd_pos fd and handled by the main
tool thread.

Acked-by: Namhyung Kim <namhyung@gmail.com>
Reviewed-by: Riccardo Mancini <rickyman7@gmail.com>
Tested-by: Riccardo Mancini <rickyman7@gmail.com>
Signed-off-by: Alexey Bayduraev <alexey.v.bayduraev@linux.intel.com>
---
 tools/perf/builtin-record.c | 247 +++++++++++++++++++++++++++++++++++-
 1 file changed, 244 insertions(+), 3 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 41998f2140cd..0d4a34c66274 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -58,6 +58,9 @@
 #include <poll.h>
 #include <pthread.h>
 #include <unistd.h>
+#ifndef HAVE_GETTID
+#include <syscall.h>
+#endif
 #include <sched.h>
 #include <signal.h>
 #ifdef HAVE_EVENTFD_SUPPORT
@@ -92,6 +95,21 @@ struct thread_mask {
 	struct mmap_cpu_mask	affinity;
 };
 
+struct record_thread {
+	pid_t			tid;
+	struct thread_mask	*mask;
+	struct {
+		int		msg[2];
+		int		ack[2];
+	} pipes;
+	struct fdarray		pollfd;
+	int			ctlfd_pos;
+	int			nr_mmaps;
+	struct mmap		**maps;
+	struct mmap		**overwrite_maps;
+	struct record		*rec;
+};
+
 struct record {
 	struct perf_tool	tool;
 	struct record_opts	opts;
@@ -119,6 +137,7 @@ struct record {
 	struct perf_debuginfod	debuginfod;
 	int			nr_threads;
 	struct thread_mask	*thread_masks;
+	struct record_thread	*thread_data;
 };
 
 static volatile int done;
@@ -131,6 +150,13 @@ static const char *affinity_tags[PERF_AFFINITY_MAX] = {
 	"SYS", "NODE", "CPU"
 };
 
+#ifndef HAVE_GETTID
+static inline pid_t gettid(void)
+{
+	return (pid_t)syscall(__NR_gettid);
+}
+#endif
+
 static bool switch_output_signal(struct record *rec)
 {
 	return rec->switch_output.signal &&
@@ -848,9 +874,218 @@ static int record__kcore_copy(struct machine *machine, struct perf_data *data)
 	return kcore_copy(from_dir, kcore_dir);
 }
 
+static void record__thread_data_init_pipes(struct record_thread *thread_data)
+{
+	thread_data->pipes.msg[0] = -1;
+	thread_data->pipes.msg[1] = -1;
+	thread_data->pipes.ack[0] = -1;
+	thread_data->pipes.ack[1] = -1;
+}
+
+static int record__thread_data_open_pipes(struct record_thread *thread_data)
+{
+	if (pipe(thread_data->pipes.msg))
+		return -EINVAL;
+
+	if (pipe(thread_data->pipes.ack)) {
+		close(thread_data->pipes.msg[0]);
+		thread_data->pipes.msg[0] = -1;
+		close(thread_data->pipes.msg[1]);
+		thread_data->pipes.msg[1] = -1;
+		return -EINVAL;
+	}
+
+	pr_debug2("thread_data[%p]: msg=[%d,%d], ack=[%d,%d]\n", thread_data,
+		 thread_data->pipes.msg[0], thread_data->pipes.msg[1],
+		 thread_data->pipes.ack[0], thread_data->pipes.ack[1]);
+
+	return 0;
+}
+
+static void record__thread_data_close_pipes(struct record_thread *thread_data)
+{
+	if (thread_data->pipes.msg[0] != -1) {
+		close(thread_data->pipes.msg[0]);
+		thread_data->pipes.msg[0] = -1;
+	}
+	if (thread_data->pipes.msg[1] != -1) {
+		close(thread_data->pipes.msg[1]);
+		thread_data->pipes.msg[1] = -1;
+	}
+	if (thread_data->pipes.ack[0] != -1) {
+		close(thread_data->pipes.ack[0]);
+		thread_data->pipes.ack[0] = -1;
+	}
+	if (thread_data->pipes.ack[1] != -1) {
+		close(thread_data->pipes.ack[1]);
+		thread_data->pipes.ack[1] = -1;
+	}
+}
+
+static int record__thread_data_init_maps(struct record_thread *thread_data, struct evlist *evlist)
+{
+	int m, tm, nr_mmaps = evlist->core.nr_mmaps;
+	struct mmap *mmap = evlist->mmap;
+	struct mmap *overwrite_mmap = evlist->overwrite_mmap;
+	struct perf_cpu_map *cpus = evlist->core.cpus;
+
+	thread_data->nr_mmaps = bitmap_weight(thread_data->mask->maps.bits,
+					      thread_data->mask->maps.nbits);
+	if (mmap) {
+		thread_data->maps = zalloc(thread_data->nr_mmaps * sizeof(struct mmap *));
+		if (!thread_data->maps)
+			return -ENOMEM;
+	}
+	if (overwrite_mmap) {
+		thread_data->overwrite_maps = zalloc(thread_data->nr_mmaps * sizeof(struct mmap *));
+		if (!thread_data->overwrite_maps) {
+			zfree(&thread_data->maps);
+			return -ENOMEM;
+		}
+	}
+	pr_debug2("thread_data[%p]: nr_mmaps=%d, maps=%p, ow_maps=%p\n", thread_data,
+		 thread_data->nr_mmaps, thread_data->maps, thread_data->overwrite_maps);
+
+	for (m = 0, tm = 0; m < nr_mmaps && tm < thread_data->nr_mmaps; m++) {
+		if (test_bit(cpus->map[m].cpu, thread_data->mask->maps.bits)) {
+			if (thread_data->maps) {
+				thread_data->maps[tm] = &mmap[m];
+				pr_debug2("thread_data[%p]: cpu%d: maps[%d] -> mmap[%d]\n",
+					  thread_data, cpus->map[m].cpu, tm, m);
+			}
+			if (thread_data->overwrite_maps) {
+				thread_data->overwrite_maps[tm] = &overwrite_mmap[m];
+				pr_debug2("thread_data[%p]: cpu%d: ow_maps[%d] -> ow_mmap[%d]\n",
+					  thread_data, cpus->map[m].cpu, tm, m);
+			}
+			tm++;
+		}
+	}
+
+	return 0;
+}
+
+static int record__thread_data_init_pollfd(struct record_thread *thread_data, struct evlist *evlist)
+{
+	int f, tm, pos;
+	struct mmap *map, *overwrite_map;
+
+	fdarray__init(&thread_data->pollfd, 64);
+
+	for (tm = 0; tm < thread_data->nr_mmaps; tm++) {
+		map = thread_data->maps ? thread_data->maps[tm] : NULL;
+		overwrite_map = thread_data->overwrite_maps ?
+				thread_data->overwrite_maps[tm] : NULL;
+
+		for (f = 0; f < evlist->core.pollfd.nr; f++) {
+			void *ptr = evlist->core.pollfd.priv[f].ptr;
+
+			if ((map && ptr == map) || (overwrite_map && ptr == overwrite_map)) {
+				pos = fdarray__dup_entry_from(&thread_data->pollfd, f,
+							      &evlist->core.pollfd);
+				if (pos < 0)
+					return pos;
+				pr_debug2("thread_data[%p]: pollfd[%d] <- event_fd=%d\n",
+					 thread_data, pos, evlist->core.pollfd.entries[f].fd);
+			}
+		}
+	}
+
+	return 0;
+}
+
+static void record__free_thread_data(struct record *rec)
+{
+	int t;
+	struct record_thread *thread_data = rec->thread_data;
+
+	if (thread_data == NULL)
+		return;
+
+	for (t = 0; t < rec->nr_threads; t++) {
+		record__thread_data_close_pipes(&thread_data[t]);
+		zfree(&thread_data[t].maps);
+		zfree(&thread_data[t].overwrite_maps);
+		fdarray__exit(&thread_data[t].pollfd);
+	}
+
+	zfree(&rec->thread_data);
+}
+
+static int record__alloc_thread_data(struct record *rec, struct evlist *evlist)
+{
+	int t, ret;
+	struct record_thread *thread_data;
+
+	rec->thread_data = zalloc(rec->nr_threads * sizeof(*(rec->thread_data)));
+	if (!rec->thread_data) {
+		pr_err("Failed to allocate thread data\n");
+		return -ENOMEM;
+	}
+	thread_data = rec->thread_data;
+
+	for (t = 0; t < rec->nr_threads; t++)
+		record__thread_data_init_pipes(&thread_data[t]);
+
+	for (t = 0; t < rec->nr_threads; t++) {
+		thread_data[t].rec = rec;
+		thread_data[t].mask = &rec->thread_masks[t];
+		ret = record__thread_data_init_maps(&thread_data[t], evlist);
+		if (ret) {
+			pr_err("Failed to initialize thread[%d] maps\n", t);
+			goto out_free;
+		}
+		ret = record__thread_data_init_pollfd(&thread_data[t], evlist);
+		if (ret) {
+			pr_err("Failed to initialize thread[%d] pollfd\n", t);
+			goto out_free;
+		}
+		if (t) {
+			thread_data[t].tid = -1;
+			ret = record__thread_data_open_pipes(&thread_data[t]);
+			if (ret) {
+				pr_err("Failed to open thread[%d] communication pipes\n", t);
+				goto out_free;
+			}
+			ret = fdarray__add(&thread_data[t].pollfd, thread_data[t].pipes.msg[0],
+					   POLLIN | POLLERR | POLLHUP, fdarray_flag__nonfilterable);
+			if (ret < 0) {
+				pr_err("Failed to add descriptor to thread[%d] pollfd\n", t);
+				goto out_free;
+			}
+			thread_data[t].ctlfd_pos = ret;
+			pr_debug2("thread_data[%p]: pollfd[%d] <- ctl_fd=%d\n",
+				 thread_data, thread_data[t].ctlfd_pos,
+				 thread_data[t].pipes.msg[0]);
+		} else {
+			thread_data[t].tid = gettid();
+			if (evlist->ctl_fd.pos == -1)
+				continue;
+			ret = fdarray__dup_entry_from(&thread_data[t].pollfd, evlist->ctl_fd.pos,
+						      &evlist->core.pollfd);
+			if (ret < 0) {
+				pr_err("Failed to duplicate descriptor in main thread pollfd\n");
+				goto out_free;
+			}
+			thread_data[t].ctlfd_pos = ret;
+			pr_debug2("thread_data[%p]: pollfd[%d] <- ctl_fd=%d\n",
+				 thread_data, thread_data[t].ctlfd_pos,
+				 evlist->core.pollfd.entries[evlist->ctl_fd.pos].fd);
+		}
+	}
+
+	return 0;
+
+out_free:
+	record__free_thread_data(rec);
+
+	return ret;
+}
+
 static int record__mmap_evlist(struct record *rec,
 			       struct evlist *evlist)
 {
+	int ret;
 	struct record_opts *opts = &rec->opts;
 	bool auxtrace_overwrite = opts->auxtrace_snapshot_mode ||
 				  opts->auxtrace_sample_mode;
@@ -881,6 +1116,14 @@ static int record__mmap_evlist(struct record *rec,
 				return -EINVAL;
 		}
 	}
+
+	if (evlist__initialize_ctlfd(evlist, opts->ctl_fd, opts->ctl_fd_ack))
+		return -1;
+
+	ret = record__alloc_thread_data(rec, evlist);
+	if (ret)
+		return ret;
+
 	return 0;
 }
 
@@ -1862,9 +2105,6 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 		evlist__start_workload(rec->evlist);
 	}
 
-	if (evlist__initialize_ctlfd(rec->evlist, opts->ctl_fd, opts->ctl_fd_ack))
-		goto out_child;
-
 	if (opts->initial_delay) {
 		pr_info(EVLIST_DISABLED_MSG);
 		if (opts->initial_delay > 0) {
@@ -2022,6 +2262,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 out_child:
 	evlist__finalize_ctlfd(rec->evlist);
 	record__mmap_read_all(rec, true);
+	record__free_thread_data(rec);
 	record__aio_mmap_read_sync(rec);
 
 	if (rec->session->bytes_transferred && rec->session->bytes_compressed) {
-- 
2.19.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v13 04/16] perf record: Introduce function to propagate control commands
  2022-01-17 18:34 [PATCH v13 00/16] Introduce threaded trace streaming for basic perf record operation Alexey Bayduraev
                   ` (2 preceding siblings ...)
  2022-01-17 18:34 ` [PATCH v13 03/16] perf record: Introduce thread specific data array Alexey Bayduraev
@ 2022-01-17 18:34 ` Alexey Bayduraev
  2022-01-17 18:34 ` [PATCH v13 05/16] perf record: Introduce thread local variable Alexey Bayduraev
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Alexey Bayduraev @ 2022-01-17 18:34 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, linux-kernel, Andi Kleen, Adrian Hunter,
	Alexander Antonov, Alexei Budankov, Riccardo Mancini

Introduce evlist__ctlfd_update() function to propagate external control
commands to global evlist object.

Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Namhyung Kim <namhyung@gmail.com>
Reviewed-by: Riccardo Mancini <rickyman7@gmail.com>
Tested-by: Riccardo Mancini <rickyman7@gmail.com>
Signed-off-by: Alexey Bayduraev <alexey.v.bayduraev@linux.intel.com>
---
 tools/perf/util/evlist.c | 16 ++++++++++++++++
 tools/perf/util/evlist.h |  1 +
 2 files changed, 17 insertions(+)

diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 6e88d404b5b3..48a865221636 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -2131,6 +2131,22 @@ int evlist__ctlfd_process(struct evlist *evlist, enum evlist_ctl_cmd *cmd)
 	return err;
 }
 
+int evlist__ctlfd_update(struct evlist *evlist, struct pollfd *update)
+{
+	int ctlfd_pos = evlist->ctl_fd.pos;
+	struct pollfd *entries = evlist->core.pollfd.entries;
+
+	if (!evlist__ctlfd_initialized(evlist))
+		return 0;
+
+	if (entries[ctlfd_pos].fd != update->fd ||
+	    entries[ctlfd_pos].events != update->events)
+		return -1;
+
+	entries[ctlfd_pos].revents = update->revents;
+	return 0;
+}
+
 struct evsel *evlist__find_evsel(struct evlist *evlist, int idx)
 {
 	struct evsel *evsel;
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index 64cba56fbc74..a21daaa5fc1b 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -410,6 +410,7 @@ void evlist__close_control(int ctl_fd, int ctl_fd_ack, bool *ctl_fd_close);
 int evlist__initialize_ctlfd(struct evlist *evlist, int ctl_fd, int ctl_fd_ack);
 int evlist__finalize_ctlfd(struct evlist *evlist);
 bool evlist__ctlfd_initialized(struct evlist *evlist);
+int evlist__ctlfd_update(struct evlist *evlist, struct pollfd *update);
 int evlist__ctlfd_process(struct evlist *evlist, enum evlist_ctl_cmd *cmd);
 int evlist__ctlfd_ack(struct evlist *evlist);
 
-- 
2.19.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v13 05/16] perf record: Introduce thread local variable
  2022-01-17 18:34 [PATCH v13 00/16] Introduce threaded trace streaming for basic perf record operation Alexey Bayduraev
                   ` (3 preceding siblings ...)
  2022-01-17 18:34 ` [PATCH v13 04/16] perf record: Introduce function to propagate control commands Alexey Bayduraev
@ 2022-01-17 18:34 ` Alexey Bayduraev
  2022-01-31 21:42   ` Arnaldo Carvalho de Melo
  2022-01-31 21:45   ` Arnaldo Carvalho de Melo
  2022-01-17 18:34 ` [PATCH v13 06/16] perf record: Stop threads in the end of trace streaming Alexey Bayduraev
                   ` (11 subsequent siblings)
  16 siblings, 2 replies; 38+ messages in thread
From: Alexey Bayduraev @ 2022-01-17 18:34 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, linux-kernel, Andi Kleen, Adrian Hunter,
	Alexander Antonov, Alexei Budankov, Riccardo Mancini

Introduce thread local variable and use it for threaded trace streaming.
Use thread affinity mask instead of record affinity mask in affinity
modes. Use evlist__ctlfd_update() to propagate control commands from
thread object to global evlist object to enable evlist__ctlfd_*
functionality. Move waking and sample statistic to struct record_thread
and introduce record__waking function to calculate the total number of
wakes.

Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Namhyung Kim <namhyung@gmail.com>
Reviewed-by: Riccardo Mancini <rickyman7@gmail.com>
Tested-by: Riccardo Mancini <rickyman7@gmail.com>
Signed-off-by: Alexey Bayduraev <alexey.v.bayduraev@linux.intel.com>
---
 tools/perf/builtin-record.c | 140 ++++++++++++++++++++++++------------
 1 file changed, 94 insertions(+), 46 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 0d4a34c66274..163d261dd293 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -108,8 +108,12 @@ struct record_thread {
 	struct mmap		**maps;
 	struct mmap		**overwrite_maps;
 	struct record		*rec;
+	unsigned long long	samples;
+	unsigned long		waking;
 };
 
+static __thread struct record_thread *thread;
+
 struct record {
 	struct perf_tool	tool;
 	struct record_opts	opts;
@@ -132,7 +136,6 @@ struct record {
 	bool			timestamp_boundary;
 	struct switch_output	switch_output;
 	unsigned long long	samples;
-	struct mmap_cpu_mask	affinity_mask;
 	unsigned long		output_max_size;	/* = 0: unlimited */
 	struct perf_debuginfod	debuginfod;
 	int			nr_threads;
@@ -575,7 +578,7 @@ static int record__pushfn(struct mmap *map, void *to, void *bf, size_t size)
 		bf   = map->data;
 	}
 
-	rec->samples++;
+	thread->samples++;
 	return record__write(rec, map, bf, size);
 }
 
@@ -1315,15 +1318,17 @@ static struct perf_event_header finished_round_event = {
 static void record__adjust_affinity(struct record *rec, struct mmap *map)
 {
 	if (rec->opts.affinity != PERF_AFFINITY_SYS &&
-	    !bitmap_equal(rec->affinity_mask.bits, map->affinity_mask.bits,
-			  rec->affinity_mask.nbits)) {
-		bitmap_zero(rec->affinity_mask.bits, rec->affinity_mask.nbits);
-		bitmap_or(rec->affinity_mask.bits, rec->affinity_mask.bits,
-			  map->affinity_mask.bits, rec->affinity_mask.nbits);
-		sched_setaffinity(0, MMAP_CPU_MASK_BYTES(&rec->affinity_mask),
-				  (cpu_set_t *)rec->affinity_mask.bits);
-		if (verbose == 2)
-			mmap_cpu_mask__scnprintf(&rec->affinity_mask, "thread");
+	    !bitmap_equal(thread->mask->affinity.bits, map->affinity_mask.bits,
+			  thread->mask->affinity.nbits)) {
+		bitmap_zero(thread->mask->affinity.bits, thread->mask->affinity.nbits);
+		bitmap_or(thread->mask->affinity.bits, thread->mask->affinity.bits,
+			  map->affinity_mask.bits, thread->mask->affinity.nbits);
+		sched_setaffinity(0, MMAP_CPU_MASK_BYTES(&thread->mask->affinity),
+					(cpu_set_t *)thread->mask->affinity.bits);
+		if (verbose == 2) {
+			pr_debug("threads[%d]: running on cpu%d: ", thread->tid, sched_getcpu());
+			mmap_cpu_mask__scnprintf(&thread->mask->affinity, "affinity");
+		}
 	}
 }
 
@@ -1364,14 +1369,17 @@ static int record__mmap_read_evlist(struct record *rec, struct evlist *evlist,
 	u64 bytes_written = rec->bytes_written;
 	int i;
 	int rc = 0;
-	struct mmap *maps;
+	int nr_mmaps;
+	struct mmap **maps;
 	int trace_fd = rec->data.file.fd;
 	off_t off = 0;
 
 	if (!evlist)
 		return 0;
 
-	maps = overwrite ? evlist->overwrite_mmap : evlist->mmap;
+	nr_mmaps = thread->nr_mmaps;
+	maps = overwrite ? thread->overwrite_maps : thread->maps;
+
 	if (!maps)
 		return 0;
 
@@ -1381,9 +1389,9 @@ static int record__mmap_read_evlist(struct record *rec, struct evlist *evlist,
 	if (record__aio_enabled(rec))
 		off = record__aio_get_pos(trace_fd);
 
-	for (i = 0; i < evlist->core.nr_mmaps; i++) {
+	for (i = 0; i < nr_mmaps; i++) {
 		u64 flush = 0;
-		struct mmap *map = &maps[i];
+		struct mmap *map = maps[i];
 
 		if (map->core.base) {
 			record__adjust_affinity(rec, map);
@@ -1446,6 +1454,15 @@ static int record__mmap_read_all(struct record *rec, bool synch)
 	return record__mmap_read_evlist(rec, rec->evlist, true, synch);
 }
 
+static void record__thread_munmap_filtered(struct fdarray *fda, int fd,
+					   void *arg __maybe_unused)
+{
+	struct perf_mmap *map = fda->priv[fd].ptr;
+
+	if (map)
+		perf_mmap__put(map);
+}
+
 static void record__init_features(struct record *rec)
 {
 	struct perf_session *session = rec->session;
@@ -1869,11 +1886,44 @@ static void record__uniquify_name(struct record *rec)
 	}
 }
 
+static int record__start_threads(struct record *rec)
+{
+	struct record_thread *thread_data = rec->thread_data;
+
+	thread = &thread_data[0];
+
+	pr_debug("threads[%d]: started on cpu%d\n", thread->tid, sched_getcpu());
+
+	return 0;
+}
+
+static int record__stop_threads(struct record *rec)
+{
+	int t;
+	struct record_thread *thread_data = rec->thread_data;
+
+	for (t = 0; t < rec->nr_threads; t++)
+		rec->samples += thread_data[t].samples;
+
+	return 0;
+}
+
+static unsigned long record__waking(struct record *rec)
+{
+	int t;
+	unsigned long waking = 0;
+	struct record_thread *thread_data = rec->thread_data;
+
+	for (t = 0; t < rec->nr_threads; t++)
+		waking += thread_data[t].waking;
+
+	return waking;
+}
+
 static int __cmd_record(struct record *rec, int argc, const char **argv)
 {
 	int err;
 	int status = 0;
-	unsigned long waking = 0;
 	const bool forks = argc > 0;
 	struct perf_tool *tool = &rec->tool;
 	struct record_opts *opts = &rec->opts;
@@ -1977,7 +2027,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 
 	if (record__open(rec) != 0) {
 		err = -1;
-		goto out_child;
+		goto out_free_threads;
 	}
 	session->header.env.comp_mmap_len = session->evlist->core.mmap_len;
 
@@ -1985,7 +2035,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 		err = record__kcore_copy(&session->machines.host, data);
 		if (err) {
 			pr_err("ERROR: Failed to copy kcore\n");
-			goto out_child;
+			goto out_free_threads;
 		}
 	}
 
@@ -1996,7 +2046,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 		bpf__strerror_apply_obj_config(err, errbuf, sizeof(errbuf));
 		pr_err("ERROR: Apply config to BPF failed: %s\n",
 			 errbuf);
-		goto out_child;
+		goto out_free_threads;
 	}
 
 	/*
@@ -2014,11 +2064,11 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 	if (data->is_pipe) {
 		err = perf_header__write_pipe(fd);
 		if (err < 0)
-			goto out_child;
+			goto out_free_threads;
 	} else {
 		err = perf_session__write_header(session, rec->evlist, fd, false);
 		if (err < 0)
-			goto out_child;
+			goto out_free_threads;
 	}
 
 	err = -1;
@@ -2026,16 +2076,16 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 	    && !perf_header__has_feat(&session->header, HEADER_BUILD_ID)) {
 		pr_err("Couldn't generate buildids. "
 		       "Use --no-buildid to profile anyway.\n");
-		goto out_child;
+		goto out_free_threads;
 	}
 
 	err = record__setup_sb_evlist(rec);
 	if (err)
-		goto out_child;
+		goto out_free_threads;
 
 	err = record__synthesize(rec, false);
 	if (err < 0)
-		goto out_child;
+		goto out_free_threads;
 
 	if (rec->realtime_prio) {
 		struct sched_param param;
@@ -2044,10 +2094,13 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 		if (sched_setscheduler(0, SCHED_FIFO, &param)) {
 			pr_err("Could not set realtime priority.\n");
 			err = -1;
-			goto out_child;
+			goto out_free_threads;
 		}
 	}
 
+	if (record__start_threads(rec))
+		goto out_free_threads;
+
 	/*
 	 * When perf is starting the traced process, all the events
 	 * (apart from group members) have enable_on_exec=1 set,
@@ -2118,7 +2171,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 	trigger_ready(&switch_output_trigger);
 	perf_hooks__invoke_record_start();
 	for (;;) {
-		unsigned long long hits = rec->samples;
+		unsigned long long hits = thread->samples;
 
 		/*
 		 * rec->evlist->bkw_mmap_state is possible to be
@@ -2172,8 +2225,8 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 
 			if (!quiet)
 				fprintf(stderr, "[ perf record: dump data: Woken up %ld times ]\n",
-					waking);
-			waking = 0;
+					record__waking(rec));
+			thread->waking = 0;
 			fd = record__switch_output(rec, false);
 			if (fd < 0) {
 				pr_err("Failed to switch to new file\n");
@@ -2187,20 +2240,24 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 				alarm(rec->switch_output.time);
 		}
 
-		if (hits == rec->samples) {
+		if (hits == thread->samples) {
 			if (done || draining)
 				break;
-			err = evlist__poll(rec->evlist, -1);
+			err = fdarray__poll(&thread->pollfd, -1);
 			/*
 			 * Propagate error, only if there's any. Ignore positive
 			 * number of returned events and interrupt error.
 			 */
 			if (err > 0 || (err < 0 && errno == EINTR))
 				err = 0;
-			waking++;
+			thread->waking++;
 
-			if (evlist__filter_pollfd(rec->evlist, POLLERR | POLLHUP) == 0)
+			if (fdarray__filter(&thread->pollfd, POLLERR | POLLHUP,
+					    record__thread_munmap_filtered, NULL) == 0)
 				draining = true;
+
+			evlist__ctlfd_update(rec->evlist,
+				&thread->pollfd.entries[thread->ctlfd_pos]);
 		}
 
 		if (evlist__ctlfd_process(rec->evlist, &cmd) > 0) {
@@ -2254,15 +2311,18 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 	}
 
 	if (!quiet)
-		fprintf(stderr, "[ perf record: Woken up %ld times to write data ]\n", waking);
+		fprintf(stderr, "[ perf record: Woken up %ld times to write data ]\n",
+			record__waking(rec));
 
 	if (target__none(&rec->opts.target))
 		record__synthesize_workload(rec, true);
 
 out_child:
-	evlist__finalize_ctlfd(rec->evlist);
+	record__stop_threads(rec);
 	record__mmap_read_all(rec, true);
+out_free_threads:
 	record__free_thread_data(rec);
+	evlist__finalize_ctlfd(rec->evlist);
 	record__aio_mmap_read_sync(rec);
 
 	if (rec->session->bytes_transferred && rec->session->bytes_compressed) {
@@ -3164,17 +3224,6 @@ int cmd_record(int argc, const char **argv)
 
 	symbol__init(NULL);
 
-	if (rec->opts.affinity != PERF_AFFINITY_SYS) {
-		rec->affinity_mask.nbits = cpu__max_cpu().cpu;
-		rec->affinity_mask.bits = bitmap_zalloc(rec->affinity_mask.nbits);
-		if (!rec->affinity_mask.bits) {
-			pr_err("Failed to allocate thread mask for %zd cpus\n", rec->affinity_mask.nbits);
-			err = -ENOMEM;
-			goto out_opts;
-		}
-		pr_debug2("thread mask[%zd]: empty\n", rec->affinity_mask.nbits);
-	}
-
 	err = record__auxtrace_init(rec);
 	if (err)
 		goto out;
@@ -3323,7 +3372,6 @@ int cmd_record(int argc, const char **argv)
 
 	err = __cmd_record(&record, argc, argv);
 out:
-	bitmap_free(rec->affinity_mask.bits);
 	evlist__delete(rec->evlist);
 	symbol__exit();
 	auxtrace_record__free(rec->itr);
-- 
2.19.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v13 06/16] perf record: Stop threads in the end of trace streaming
  2022-01-17 18:34 [PATCH v13 00/16] Introduce threaded trace streaming for basic perf record operation Alexey Bayduraev
                   ` (4 preceding siblings ...)
  2022-01-17 18:34 ` [PATCH v13 05/16] perf record: Introduce thread local variable Alexey Bayduraev
@ 2022-01-17 18:34 ` Alexey Bayduraev
  2022-01-17 18:34 ` [PATCH v13 07/16] perf record: Start threads in the beginning " Alexey Bayduraev
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Alexey Bayduraev @ 2022-01-17 18:34 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, linux-kernel, Andi Kleen, Adrian Hunter,
	Alexander Antonov, Alexei Budankov, Riccardo Mancini

Signal thread to terminate by closing write fd of msg pipe.
Receive THREAD_MSG__READY message as the confirmation of the
thread's termination. Stop threads created for parallel trace
streaming prior their stats processing.

Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Namhyung Kim <namhyung@gmail.com>
Reviewed-by: Riccardo Mancini <rickyman7@gmail.com>
Tested-by: Riccardo Mancini <rickyman7@gmail.com>
Signed-off-by: Alexey Bayduraev <alexey.v.bayduraev@linux.intel.com>
---
 tools/perf/builtin-record.c | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 163d261dd293..0e65b80927b7 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -114,6 +114,16 @@ struct record_thread {
 
 static __thread struct record_thread *thread;
 
+enum thread_msg {
+	THREAD_MSG__UNDEFINED = 0,
+	THREAD_MSG__READY,
+	THREAD_MSG__MAX,
+};
+
+static const char *thread_msg_tags[THREAD_MSG__MAX] = {
+	"UNDEFINED", "READY"
+};
+
 struct record {
 	struct perf_tool	tool;
 	struct record_opts	opts;
@@ -1886,6 +1896,24 @@ static void record__uniquify_name(struct record *rec)
 	}
 }
 
+static int record__terminate_thread(struct record_thread *thread_data)
+{
+	int err;
+	enum thread_msg ack = THREAD_MSG__UNDEFINED;
+	pid_t tid = thread_data->tid;
+
+	close(thread_data->pipes.msg[1]);
+	thread_data->pipes.msg[1] = -1;
+	err = read(thread_data->pipes.ack[0], &ack, sizeof(ack));
+	if (err > 0)
+		pr_debug2("threads[%d]: sent %s\n", tid, thread_msg_tags[ack]);
+	else
+		pr_warning("threads[%d]: failed to receive termination notification from %d\n",
+			   thread->tid, tid);
+
+	return 0;
+}
+
 static int record__start_threads(struct record *rec)
 {
 	struct record_thread *thread_data = rec->thread_data;
@@ -1902,6 +1930,9 @@ static int record__stop_threads(struct record *rec)
 	int t;
 	struct record_thread *thread_data = rec->thread_data;
 
+	for (t = 1; t < rec->nr_threads; t++)
+		record__terminate_thread(&thread_data[t]);
+
 	for (t = 0; t < rec->nr_threads; t++)
 		rec->samples += thread_data[t].samples;
 
-- 
2.19.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v13 07/16] perf record: Start threads in the beginning of trace streaming
  2022-01-17 18:34 [PATCH v13 00/16] Introduce threaded trace streaming for basic perf record operation Alexey Bayduraev
                   ` (5 preceding siblings ...)
  2022-01-17 18:34 ` [PATCH v13 06/16] perf record: Stop threads in the end of trace streaming Alexey Bayduraev
@ 2022-01-17 18:34 ` Alexey Bayduraev
  2022-01-17 18:34 ` [PATCH v13 08/16] perf record: Introduce data file at mmap buffer object Alexey Bayduraev
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Alexey Bayduraev @ 2022-01-17 18:34 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, linux-kernel, Andi Kleen, Adrian Hunter,
	Alexander Antonov, Alexei Budankov, Riccardo Mancini

Start thread in detached state because its management is implemented
via messaging to avoid any scaling issues. Block signals prior thread
start so only main tool thread would be notified on external async
signals during data collection. Thread affinity mask is used to assign
eligible CPUs for the thread to run. Wait and sync on thread start using
thread ack pipe.

Acked-by: Namhyung Kim <namhyung@gmail.com>
Reviewed-by: Riccardo Mancini <rickyman7@gmail.com>
Tested-by: Riccardo Mancini <rickyman7@gmail.com>
Signed-off-by: Alexey Bayduraev <alexey.v.bayduraev@linux.intel.com>
---
 tools/perf/builtin-record.c | 121 +++++++++++++++++++++++++++++++++++-
 tools/perf/util/record.h    |   1 +
 2 files changed, 121 insertions(+), 1 deletion(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 0e65b80927b7..517520ae1520 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -170,6 +170,11 @@ static inline pid_t gettid(void)
 }
 #endif
 
+static int record__threads_enabled(struct record *rec)
+{
+	return rec->opts.threads_spec;
+}
+
 static bool switch_output_signal(struct record *rec)
 {
 	return rec->switch_output.signal &&
@@ -1473,6 +1478,68 @@ static void record__thread_munmap_filtered(struct fdarray *fda, int fd,
 		perf_mmap__put(map);
 }
 
+static void *record__thread(void *arg)
+{
+	enum thread_msg msg = THREAD_MSG__READY;
+	bool terminate = false;
+	struct fdarray *pollfd;
+	int err, ctlfd_pos;
+
+	thread = arg;
+	thread->tid = gettid();
+
+	err = write(thread->pipes.ack[1], &msg, sizeof(msg));
+	if (err == -1)
+		pr_warning("threads[%d]: failed to notify on start: %s\n",
+			   thread->tid, strerror(errno));
+
+	pr_debug("threads[%d]: started on cpu%d\n", thread->tid, sched_getcpu());
+
+	pollfd = &thread->pollfd;
+	ctlfd_pos = thread->ctlfd_pos;
+
+	for (;;) {
+		unsigned long long hits = thread->samples;
+
+		if (record__mmap_read_all(thread->rec, false) < 0 || terminate)
+			break;
+
+		if (hits == thread->samples) {
+
+			err = fdarray__poll(pollfd, -1);
+			/*
+			 * Propagate error, only if there's any. Ignore positive
+			 * number of returned events and interrupt error.
+			 */
+			if (err > 0 || (err < 0 && errno == EINTR))
+				err = 0;
+			thread->waking++;
+
+			if (fdarray__filter(pollfd, POLLERR | POLLHUP,
+					    record__thread_munmap_filtered, NULL) == 0)
+				break;
+		}
+
+		if (pollfd->entries[ctlfd_pos].revents & POLLHUP) {
+			terminate = true;
+			close(thread->pipes.msg[0]);
+			thread->pipes.msg[0] = -1;
+			pollfd->entries[ctlfd_pos].fd = -1;
+			pollfd->entries[ctlfd_pos].events = 0;
+		}
+
+		pollfd->entries[ctlfd_pos].revents = 0;
+	}
+	record__mmap_read_all(thread->rec, true);
+
+	err = write(thread->pipes.ack[1], &msg, sizeof(msg));
+	if (err == -1)
+		pr_warning("threads[%d]: failed to notify on termination: %s\n",
+			   thread->tid, strerror(errno));
+
+	return NULL;
+}
+
 static void record__init_features(struct record *rec)
 {
 	struct perf_session *session = rec->session;
@@ -1916,13 +1983,65 @@ static int record__terminate_thread(struct record_thread *thread_data)
 
 static int record__start_threads(struct record *rec)
 {
+	int t, tt, err, ret = 0, nr_threads = rec->nr_threads;
 	struct record_thread *thread_data = rec->thread_data;
+	sigset_t full, mask;
+	pthread_t handle;
+	pthread_attr_t attrs;
 
 	thread = &thread_data[0];
 
+	if (!record__threads_enabled(rec))
+		return 0;
+
+	sigfillset(&full);
+	if (sigprocmask(SIG_SETMASK, &full, &mask)) {
+		pr_err("Failed to block signals on threads start: %s\n", strerror(errno));
+		return -1;
+	}
+
+	pthread_attr_init(&attrs);
+	pthread_attr_setdetachstate(&attrs, PTHREAD_CREATE_DETACHED);
+
+	for (t = 1; t < nr_threads; t++) {
+		enum thread_msg msg = THREAD_MSG__UNDEFINED;
+
+#ifdef HAVE_PTHREAD_ATTR_SETAFFINITY_NP
+		pthread_attr_setaffinity_np(&attrs,
+					    MMAP_CPU_MASK_BYTES(&(thread_data[t].mask->affinity)),
+					    (cpu_set_t *)(thread_data[t].mask->affinity.bits));
+#endif
+		if (pthread_create(&handle, &attrs, record__thread, &thread_data[t])) {
+			for (tt = 1; tt < t; tt++)
+				record__terminate_thread(&thread_data[t]);
+			pr_err("Failed to start threads: %s\n", strerror(errno));
+			ret = -1;
+			goto out_err;
+		}
+
+		err = read(thread_data[t].pipes.ack[0], &msg, sizeof(msg));
+		if (err > 0)
+			pr_debug2("threads[%d]: sent %s\n", rec->thread_data[t].tid,
+				  thread_msg_tags[msg]);
+		else
+			pr_warning("threads[%d]: failed to receive start notification from %d\n",
+				   thread->tid, rec->thread_data[t].tid);
+	}
+
+	sched_setaffinity(0, MMAP_CPU_MASK_BYTES(&thread->mask->affinity),
+			(cpu_set_t *)thread->mask->affinity.bits);
+
 	pr_debug("threads[%d]: started on cpu%d\n", thread->tid, sched_getcpu());
 
-	return 0;
+out_err:
+	pthread_attr_destroy(&attrs);
+
+	if (sigprocmask(SIG_SETMASK, &mask, NULL)) {
+		pr_err("Failed to unblock signals on threads start: %s\n", strerror(errno));
+		ret = -1;
+	}
+
+	return ret;
 }
 
 static int record__stop_threads(struct record *rec)
diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
index ef6c2715fdd9..ad08c092f3dd 100644
--- a/tools/perf/util/record.h
+++ b/tools/perf/util/record.h
@@ -78,6 +78,7 @@ struct record_opts {
 	int	      ctl_fd_ack;
 	bool	      ctl_fd_close;
 	int	      synth;
+	int	      threads_spec;
 };
 
 extern const char * const *record_usage;
-- 
2.19.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v13 08/16] perf record: Introduce data file at mmap buffer object
  2022-01-17 18:34 [PATCH v13 00/16] Introduce threaded trace streaming for basic perf record operation Alexey Bayduraev
                   ` (6 preceding siblings ...)
  2022-01-17 18:34 ` [PATCH v13 07/16] perf record: Start threads in the beginning " Alexey Bayduraev
@ 2022-01-17 18:34 ` Alexey Bayduraev
  2022-01-17 18:34 ` [PATCH v13 09/16] perf record: Introduce bytes written stats Alexey Bayduraev
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Alexey Bayduraev @ 2022-01-17 18:34 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, linux-kernel, Andi Kleen, Adrian Hunter,
	Alexander Antonov, Alexei Budankov, Riccardo Mancini

Introduce data file objects into mmap object so it could be used to
process and store data stream from the corresponding kernel data buffer.
Initialize data files located at mmap buffer objects so trace data
can be written into several data file located at data directory.

Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Alexey Bayduraev <alexey.v.bayduraev@linux.intel.com>
---
 tools/perf/builtin-record.c | 37 ++++++++++++++++++++++++++++++++-----
 tools/perf/util/mmap.h      |  1 +
 2 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 517520ae1520..8766a3dc9440 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -205,12 +205,16 @@ static int record__write(struct record *rec, struct mmap *map __maybe_unused,
 {
 	struct perf_data_file *file = &rec->session->data->file;
 
+	if (map && map->file)
+		file = map->file;
+
 	if (perf_data_file__write(file, bf, size) < 0) {
 		pr_err("failed to write perf data, error: %m\n");
 		return -1;
 	}
 
-	rec->bytes_written += size;
+	if (!(map && map->file))
+		rec->bytes_written += size;
 
 	if (record__output_max_size_exceeded(rec) && !done) {
 		fprintf(stderr, "[ perf record: perf size limit reached (%" PRIu64 " KB),"
@@ -1103,7 +1107,7 @@ static int record__alloc_thread_data(struct record *rec, struct evlist *evlist)
 static int record__mmap_evlist(struct record *rec,
 			       struct evlist *evlist)
 {
-	int ret;
+	int i, ret;
 	struct record_opts *opts = &rec->opts;
 	bool auxtrace_overwrite = opts->auxtrace_snapshot_mode ||
 				  opts->auxtrace_sample_mode;
@@ -1142,6 +1146,18 @@ static int record__mmap_evlist(struct record *rec,
 	if (ret)
 		return ret;
 
+	if (record__threads_enabled(rec)) {
+		ret = perf_data__create_dir(&rec->data, evlist->core.nr_mmaps);
+		if (ret)
+			return ret;
+		for (i = 0; i < evlist->core.nr_mmaps; i++) {
+			if (evlist->mmap)
+				evlist->mmap[i].file = &rec->data.dir.files[i];
+			if (evlist->overwrite_mmap)
+				evlist->overwrite_mmap[i].file = &rec->data.dir.files[i];
+		}
+	}
+
 	return 0;
 }
 
@@ -1448,8 +1464,12 @@ static int record__mmap_read_evlist(struct record *rec, struct evlist *evlist,
 	/*
 	 * Mark the round finished in case we wrote
 	 * at least one event.
+	 *
+	 * No need for round events in directory mode,
+	 * because per-cpu maps and files have data
+	 * sorted by kernel.
 	 */
-	if (bytes_written != rec->bytes_written)
+	if (!record__threads_enabled(rec) && bytes_written != rec->bytes_written)
 		rc = record__write(rec, NULL, &finished_round_event, sizeof(finished_round_event));
 
 	if (overwrite)
@@ -1566,7 +1586,9 @@ static void record__init_features(struct record *rec)
 	if (!rec->opts.use_clockid)
 		perf_header__clear_feat(&session->header, HEADER_CLOCK_DATA);
 
-	perf_header__clear_feat(&session->header, HEADER_DIR_FORMAT);
+	if (!record__threads_enabled(rec))
+		perf_header__clear_feat(&session->header, HEADER_DIR_FORMAT);
+
 	if (!record__comp_enabled(rec))
 		perf_header__clear_feat(&session->header, HEADER_COMPRESSED);
 
@@ -1576,6 +1598,7 @@ static void record__init_features(struct record *rec)
 static void
 record__finish_output(struct record *rec)
 {
+	int i;
 	struct perf_data *data = &rec->data;
 	int fd = perf_data__fd(data);
 
@@ -1584,6 +1607,10 @@ record__finish_output(struct record *rec)
 
 	rec->session->header.data_size += rec->bytes_written;
 	data->file.size = lseek(perf_data__fd(data), 0, SEEK_CUR);
+	if (record__threads_enabled(rec)) {
+		for (i = 0; i < data->dir.nr; i++)
+			data->dir.files[i].size = lseek(data->dir.files[i].fd, 0, SEEK_CUR);
+	}
 
 	if (!rec->no_buildid) {
 		process_buildids(rec);
@@ -3330,7 +3357,7 @@ int cmd_record(int argc, const char **argv)
 		goto out_opts;
 	}
 
-	if (rec->opts.kcore)
+	if (rec->opts.kcore || record__threads_enabled(rec))
 		rec->data.is_dir = true;
 
 	if (rec->opts.comp_level != 0) {
diff --git a/tools/perf/util/mmap.h b/tools/perf/util/mmap.h
index 83f6bd4d4082..62f38d7977bb 100644
--- a/tools/perf/util/mmap.h
+++ b/tools/perf/util/mmap.h
@@ -45,6 +45,7 @@ struct mmap {
 	struct mmap_cpu_mask	affinity_mask;
 	void		*data;
 	int		comp_level;
+	struct perf_data_file *file;
 };
 
 struct mmap_params {
-- 
2.19.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v13 09/16] perf record: Introduce bytes written stats
  2022-01-17 18:34 [PATCH v13 00/16] Introduce threaded trace streaming for basic perf record operation Alexey Bayduraev
                   ` (7 preceding siblings ...)
  2022-01-17 18:34 ` [PATCH v13 08/16] perf record: Introduce data file at mmap buffer object Alexey Bayduraev
@ 2022-01-17 18:34 ` Alexey Bayduraev
  2022-01-17 18:34 ` [PATCH v13 10/16] perf record: Introduce compressor at mmap buffer object Alexey Bayduraev
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Alexey Bayduraev @ 2022-01-17 18:34 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, linux-kernel, Andi Kleen, Adrian Hunter,
	Alexander Antonov, Alexei Budankov, Riccardo Mancini

Introduce a function to calculate the total amount of data written
and use it to support the --max-size option.

Acked-by: Namhyung Kim <namhyung@kernel.org>
Reviewed-by: Riccardo Mancini <rickyman7@gmail.com>
Tested-by: Riccardo Mancini <rickyman7@gmail.com>
Signed-off-by: Alexey Bayduraev <alexey.v.bayduraev@linux.intel.com>
---
 tools/perf/builtin-record.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 8766a3dc9440..50981bbc98bb 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -110,6 +110,7 @@ struct record_thread {
 	struct record		*rec;
 	unsigned long long	samples;
 	unsigned long		waking;
+	u64			bytes_written;
 };
 
 static __thread struct record_thread *thread;
@@ -194,10 +195,22 @@ static bool switch_output_time(struct record *rec)
 	       trigger_is_ready(&switch_output_trigger);
 }
 
+static u64 record__bytes_written(struct record *rec)
+{
+	int t;
+	u64 bytes_written = rec->bytes_written;
+	struct record_thread *thread_data = rec->thread_data;
+
+	for (t = 0; t < rec->nr_threads; t++)
+		bytes_written += thread_data[t].bytes_written;
+
+	return bytes_written;
+}
+
 static bool record__output_max_size_exceeded(struct record *rec)
 {
 	return rec->output_max_size &&
-	       (rec->bytes_written >= rec->output_max_size);
+	       (record__bytes_written(rec) >= rec->output_max_size);
 }
 
 static int record__write(struct record *rec, struct mmap *map __maybe_unused,
@@ -213,13 +226,15 @@ static int record__write(struct record *rec, struct mmap *map __maybe_unused,
 		return -1;
 	}
 
-	if (!(map && map->file))
+	if (map && map->file)
+		thread->bytes_written += size;
+	else
 		rec->bytes_written += size;
 
 	if (record__output_max_size_exceeded(rec) && !done) {
 		fprintf(stderr, "[ perf record: perf size limit reached (%" PRIu64 " KB),"
 				" stopping session ]\n",
-				rec->bytes_written >> 10);
+				record__bytes_written(rec) >> 10);
 		done = 1;
 	}
 
-- 
2.19.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v13 10/16] perf record: Introduce compressor at mmap buffer object
  2022-01-17 18:34 [PATCH v13 00/16] Introduce threaded trace streaming for basic perf record operation Alexey Bayduraev
                   ` (8 preceding siblings ...)
  2022-01-17 18:34 ` [PATCH v13 09/16] perf record: Introduce bytes written stats Alexey Bayduraev
@ 2022-01-17 18:34 ` Alexey Bayduraev
  2022-01-31 21:56   ` Arnaldo Carvalho de Melo
  2022-01-17 18:34 ` [PATCH v13 11/16] perf record: Introduce data transferred and compressed stats Alexey Bayduraev
                   ` (6 subsequent siblings)
  16 siblings, 1 reply; 38+ messages in thread
From: Alexey Bayduraev @ 2022-01-17 18:34 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, linux-kernel, Andi Kleen, Adrian Hunter,
	Alexander Antonov, Alexei Budankov, Riccardo Mancini

Introduce compressor object into mmap object so it could be used to
pack the data stream from the corresponding kernel data buffer.
Initialize and make use of the introduced per mmap compressor.

Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Alexey Bayduraev <alexey.v.bayduraev@linux.intel.com>
---
 tools/perf/builtin-record.c | 18 +++++++++++-------
 tools/perf/util/mmap.c      | 10 ++++++++++
 tools/perf/util/mmap.h      |  2 ++
 3 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 50981bbc98bb..7d0338b5a0e3 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -246,8 +246,8 @@ static int record__write(struct record *rec, struct mmap *map __maybe_unused,
 
 static int record__aio_enabled(struct record *rec);
 static int record__comp_enabled(struct record *rec);
-static size_t zstd_compress(struct perf_session *session, void *dst, size_t dst_size,
-			    void *src, size_t src_size);
+static size_t zstd_compress(struct perf_session *session, struct mmap *map,
+			    void *dst, size_t dst_size, void *src, size_t src_size);
 
 #ifdef HAVE_AIO_SUPPORT
 static int record__aio_write(struct aiocb *cblock, int trace_fd,
@@ -381,7 +381,7 @@ static int record__aio_pushfn(struct mmap *map, void *to, void *buf, size_t size
 	 */
 
 	if (record__comp_enabled(aio->rec)) {
-		size = zstd_compress(aio->rec->session, aio->data + aio->size,
+		size = zstd_compress(aio->rec->session, NULL, aio->data + aio->size,
 				     mmap__mmap_len(map) - aio->size,
 				     buf, size);
 	} else {
@@ -608,7 +608,7 @@ static int record__pushfn(struct mmap *map, void *to, void *bf, size_t size)
 	struct record *rec = to;
 
 	if (record__comp_enabled(rec)) {
-		size = zstd_compress(rec->session, map->data, mmap__mmap_len(map), bf, size);
+		size = zstd_compress(rec->session, map, map->data, mmap__mmap_len(map), bf, size);
 		bf   = map->data;
 	}
 
@@ -1394,13 +1394,17 @@ static size_t process_comp_header(void *record, size_t increment)
 	return size;
 }
 
-static size_t zstd_compress(struct perf_session *session, void *dst, size_t dst_size,
-			    void *src, size_t src_size)
+static size_t zstd_compress(struct perf_session *session, struct mmap *map,
+			    void *dst, size_t dst_size, void *src, size_t src_size)
 {
 	size_t compressed;
 	size_t max_record_size = PERF_SAMPLE_MAX_SIZE - sizeof(struct perf_record_compressed) - 1;
+	struct zstd_data *zstd_data = &session->zstd_data;
 
-	compressed = zstd_compress_stream_to_records(&session->zstd_data, dst, dst_size, src, src_size,
+	if (map && map->file)
+		zstd_data = &map->zstd_data;
+
+	compressed = zstd_compress_stream_to_records(zstd_data, dst, dst_size, src, src_size,
 						     max_record_size, process_comp_header);
 
 	session->bytes_transferred += src_size;
diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
index 12261ed8c15b..8bf97d9b8424 100644
--- a/tools/perf/util/mmap.c
+++ b/tools/perf/util/mmap.c
@@ -230,6 +230,10 @@ void mmap__munmap(struct mmap *map)
 {
 	bitmap_free(map->affinity_mask.bits);
 
+#ifndef PYTHON_PERF
+	zstd_fini(&map->zstd_data);
+#endif
+
 	perf_mmap__aio_munmap(map);
 	if (map->data != NULL) {
 		munmap(map->data, mmap__mmap_len(map));
@@ -292,6 +296,12 @@ int mmap__mmap(struct mmap *map, struct mmap_params *mp, int fd, struct perf_cpu
 	map->core.flush = mp->flush;
 
 	map->comp_level = mp->comp_level;
+#ifndef PYTHON_PERF
+	if (zstd_init(&map->zstd_data, map->comp_level)) {
+		pr_debug2("failed to init mmap commpressor, error %d\n", errno);
+		return -1;
+	}
+#endif
 
 	if (map->comp_level && !perf_mmap__aio_enabled(map)) {
 		map->data = mmap(NULL, mmap__mmap_len(map), PROT_READ|PROT_WRITE,
diff --git a/tools/perf/util/mmap.h b/tools/perf/util/mmap.h
index 62f38d7977bb..cd8b0777473b 100644
--- a/tools/perf/util/mmap.h
+++ b/tools/perf/util/mmap.h
@@ -15,6 +15,7 @@
 #endif
 #include "auxtrace.h"
 #include "event.h"
+#include "util/compress.h"
 
 struct aiocb;
 
@@ -46,6 +47,7 @@ struct mmap {
 	void		*data;
 	int		comp_level;
 	struct perf_data_file *file;
+	struct zstd_data      zstd_data;
 };
 
 struct mmap_params {
-- 
2.19.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v13 11/16] perf record: Introduce data transferred and compressed stats
  2022-01-17 18:34 [PATCH v13 00/16] Introduce threaded trace streaming for basic perf record operation Alexey Bayduraev
                   ` (9 preceding siblings ...)
  2022-01-17 18:34 ` [PATCH v13 10/16] perf record: Introduce compressor at mmap buffer object Alexey Bayduraev
@ 2022-01-17 18:34 ` Alexey Bayduraev
  2022-01-24 15:28   ` Arnaldo Carvalho de Melo
  2022-01-17 18:34 ` [PATCH v13 12/16] perf record: Introduce --threads command line option Alexey Bayduraev
                   ` (5 subsequent siblings)
  16 siblings, 1 reply; 38+ messages in thread
From: Alexey Bayduraev @ 2022-01-17 18:34 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, linux-kernel, Andi Kleen, Adrian Hunter,
	Alexander Antonov, Alexei Budankov, Riccardo Mancini

Introduce bytes_transferred and bytes_compressed stats so they
would capture statistics for the related data buffer transfers.

Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Namhyung Kim <namhyung@gmail.com>
Reviewed-by: Riccardo Mancini <rickyman7@gmail.com>
Tested-by: Riccardo Mancini <rickyman7@gmail.com>
Signed-off-by: Alexey Bayduraev <alexey.v.bayduraev@linux.intel.com>
---
 tools/perf/builtin-record.c | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 7d0338b5a0e3..0f8488d12f44 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -111,6 +111,8 @@ struct record_thread {
 	unsigned long long	samples;
 	unsigned long		waking;
 	u64			bytes_written;
+	u64			bytes_transferred;
+	u64			bytes_compressed;
 };
 
 static __thread struct record_thread *thread;
@@ -1407,8 +1409,13 @@ static size_t zstd_compress(struct perf_session *session, struct mmap *map,
 	compressed = zstd_compress_stream_to_records(zstd_data, dst, dst_size, src, src_size,
 						     max_record_size, process_comp_header);
 
-	session->bytes_transferred += src_size;
-	session->bytes_compressed  += compressed;
+	if (map && map->file) {
+		thread->bytes_transferred += src_size;
+		thread->bytes_compressed  += compressed;
+	} else {
+		session->bytes_transferred += src_size;
+		session->bytes_compressed  += compressed;
+	}
 
 	return compressed;
 }
@@ -2098,8 +2105,20 @@ static int record__stop_threads(struct record *rec)
 	for (t = 1; t < rec->nr_threads; t++)
 		record__terminate_thread(&thread_data[t]);
 
-	for (t = 0; t < rec->nr_threads; t++)
+	for (t = 0; t < rec->nr_threads; t++) {
 		rec->samples += thread_data[t].samples;
+		if (!record__threads_enabled(rec))
+			continue;
+		rec->session->bytes_transferred += thread_data[t].bytes_transferred;
+		rec->session->bytes_compressed += thread_data[t].bytes_compressed;
+		pr_debug("threads[%d]: samples=%lld, wakes=%ld, ", thread_data[t].tid,
+			 thread_data[t].samples, thread_data[t].waking);
+		if (thread_data[t].bytes_transferred && thread_data[t].bytes_compressed)
+			pr_debug("trasferred=%ld, compressed=%ld\n",
+				 thread_data[t].bytes_transferred, thread_data[t].bytes_compressed);
+		else
+			pr_debug("written=%ld\n", thread_data[t].bytes_written);
+	}
 
 	return 0;
 }
-- 
2.19.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v13 12/16] perf record: Introduce --threads command line option
  2022-01-17 18:34 [PATCH v13 00/16] Introduce threaded trace streaming for basic perf record operation Alexey Bayduraev
                   ` (10 preceding siblings ...)
  2022-01-17 18:34 ` [PATCH v13 11/16] perf record: Introduce data transferred and compressed stats Alexey Bayduraev
@ 2022-01-17 18:34 ` Alexey Bayduraev
  2022-01-17 18:34 ` [PATCH v13 13/16] perf record: Extend " Alexey Bayduraev
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Alexey Bayduraev @ 2022-01-17 18:34 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, linux-kernel, Andi Kleen, Adrian Hunter,
	Alexander Antonov, Alexei Budankov, Riccardo Mancini

Provide --threads option in perf record command line interface.
The option creates a data streaming thread for each CPU in the system.
Document --threads option in Documentation/perf-record.txt.

Reviewed-by: Riccardo Mancini <rickyman7@gmail.com>
Tested-by: Riccardo Mancini <rickyman7@gmail.com>
Signed-off-by: Alexey Bayduraev <alexey.v.bayduraev@linux.intel.com>
---
 tools/perf/Documentation/perf-record.txt |  4 ++
 tools/perf/builtin-record.c              | 48 +++++++++++++++++++++++-
 2 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index 9ccc75935bc5..b9c6b112bf46 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -713,6 +713,10 @@ measurements:
  wait -n ${perf_pid}
  exit $?
 
+--threads::
+Write collected trace data into several data files using parallel threads.
+The option creates a data streaming thread for each CPU in the system.
+
 include::intel-hybrid.txt[]
 
 --debuginfod[=URLs]::
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 0f8488d12f44..ba1622a192a9 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -127,6 +127,11 @@ static const char *thread_msg_tags[THREAD_MSG__MAX] = {
 	"UNDEFINED", "READY"
 };
 
+enum thread_spec {
+	THREAD_SPEC__UNDEFINED = 0,
+	THREAD_SPEC__CPU,
+};
+
 struct record {
 	struct perf_tool	tool;
 	struct record_opts	opts;
@@ -2768,6 +2773,16 @@ static void record__thread_mask_free(struct thread_mask *mask)
 	record__mmap_cpu_mask_free(&mask->affinity);
 }
 
+static int record__parse_threads(const struct option *opt, const char *str, int unset)
+{
+	struct record_opts *opts = opt->value;
+
+	if (unset || !str || !strlen(str))
+		opts->threads_spec = THREAD_SPEC__CPU;
+
+	return 0;
+}
+
 static int parse_output_max_size(const struct option *opt,
 				 const char *str, int unset)
 {
@@ -3242,6 +3257,9 @@ static struct option __record_options[] = {
 			  &record.debuginfod.set, "debuginfod urls",
 			  "Enable debuginfod data retrieval from DEBUGINFOD_URLS or specified urls",
 			  "system"),
+	OPT_CALLBACK_OPTARG(0, "threads", &record.opts, NULL, "spec",
+			    "write collected trace data into several data files using parallel threads",
+			    record__parse_threads),
 	OPT_END()
 };
 
@@ -3292,6 +3310,31 @@ static int record__alloc_thread_masks(struct record *rec, int nr_threads, int nr
 	return ret;
 }
 
+static int record__init_thread_cpu_masks(struct record *rec, struct perf_cpu_map *cpus)
+{
+	int t, ret, nr_cpus = perf_cpu_map__nr(cpus);
+
+	ret = record__alloc_thread_masks(rec, nr_cpus, cpu__max_cpu().cpu);
+	if (ret)
+		return ret;
+
+	rec->nr_threads = nr_cpus;
+	pr_debug("nr_threads: %d\n", rec->nr_threads);
+
+	for (t = 0; t < rec->nr_threads; t++) {
+		set_bit(cpus->map[t].cpu, rec->thread_masks[t].maps.bits);
+		set_bit(cpus->map[t].cpu, rec->thread_masks[t].affinity.bits);
+		if (verbose) {
+			pr_debug("thread_masks[%d]: ", t);
+			mmap_cpu_mask__scnprintf(&rec->thread_masks[t].maps, "maps");
+			pr_debug("thread_masks[%d]: ", t);
+			mmap_cpu_mask__scnprintf(&rec->thread_masks[t].affinity, "affinity");
+		}
+	}
+
+	return 0;
+}
+
 static int record__init_thread_default_masks(struct record *rec, struct perf_cpu_map *cpus)
 {
 	int ret;
@@ -3311,7 +3354,10 @@ static int record__init_thread_masks(struct record *rec)
 {
 	struct perf_cpu_map *cpus = rec->evlist->core.cpus;
 
-	return record__init_thread_default_masks(rec, cpus);
+	if (!record__threads_enabled(rec))
+		return record__init_thread_default_masks(rec, cpus);
+
+	return record__init_thread_cpu_masks(rec, cpus);
 }
 
 int cmd_record(int argc, const char **argv)
-- 
2.19.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v13 13/16] perf record: Extend --threads command line option
  2022-01-17 18:34 [PATCH v13 00/16] Introduce threaded trace streaming for basic perf record operation Alexey Bayduraev
                   ` (11 preceding siblings ...)
  2022-01-17 18:34 ` [PATCH v13 12/16] perf record: Introduce --threads command line option Alexey Bayduraev
@ 2022-01-17 18:34 ` Alexey Bayduraev
  2022-01-17 18:34 ` [PATCH v13 14/16] perf record: Implement compatibility checks Alexey Bayduraev
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Alexey Bayduraev @ 2022-01-17 18:34 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, linux-kernel, Andi Kleen, Adrian Hunter,
	Alexander Antonov, Alexei Budankov, Riccardo Mancini

Extend --threads option in perf record command line interface.
The option can have a value in the form of masks that specify
CPUs to be monitored with data streaming threads and its layout
in system topology. The masks can be filtered using CPU mask
provided via -C option.

The specification value can be user defined list of masks. Masks
separated by colon define CPUs to be monitored by one thread and
affinity mask of that thread is separated by slash. For example:
<cpus mask 1>/<affinity mask 1>:<cpu mask 2>/<affinity mask 2>
specifies parallel threads layout that consists of two threads
with corresponding assigned CPUs to be monitored.

The specification value can be a string e.g. "cpu", "core" or
"package" meaning creation of data streaming thread for every
CPU or core or package to monitor distinct CPUs or CPUs grouped
by core or package.

The option provided with no or empty value defaults to per-cpu
parallel threads layout creating data streaming thread for every
CPU being monitored.

Document --threads option syntax and parallel data streaming modes
in Documentation/perf-record.txt.

Suggested-by: Jiri Olsa <jolsa@kernel.org>
Suggested-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Namhyung Kim <namhyung@gmail.com>
Reviewed-by: Riccardo Mancini <rickyman7@gmail.com>
Tested-by: Riccardo Mancini <rickyman7@gmail.com>
Signed-off-by: Alexey Bayduraev <alexey.v.bayduraev@linux.intel.com>
---
 tools/perf/Documentation/perf-record.txt |  34 ++-
 tools/perf/builtin-record.c              | 318 ++++++++++++++++++++++-
 tools/perf/util/record.h                 |   1 +
 3 files changed, 349 insertions(+), 4 deletions(-)

diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index b9c6b112bf46..465be4e62a17 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -713,9 +713,39 @@ measurements:
  wait -n ${perf_pid}
  exit $?
 
---threads::
+--threads=<spec>::
 Write collected trace data into several data files using parallel threads.
-The option creates a data streaming thread for each CPU in the system.
+<spec> value can be user defined list of masks. Masks separated by colon
+define CPUs to be monitored by a thread and affinity mask of that thread
+is separated by slash:
+
+    <cpus mask 1>/<affinity mask 1>:<cpus mask 2>/<affinity mask 2>:...
+
+CPUs or affinity masks must not overlap with other corresponding masks.
+Invalid CPUs are ignored, but masks containing only invalid CPUs are not
+allowed.
+
+For example user specification like the following:
+
+    0,2-4/2-4:1,5-7/5-7
+
+specifies parallel threads layout that consists of two threads,
+the first thread monitors CPUs 0 and 2-4 with the affinity mask 2-4,
+the second monitors CPUs 1 and 5-7 with the affinity mask 5-7.
+
+<spec> value can also be a string meaning predefined parallel threads
+layout:
+
+    cpu    - create new data streaming thread for every monitored cpu
+    core   - create new thread to monitor CPUs grouped by a core
+    package - create new thread to monitor CPUs grouped by a package
+    numa   - create new threed to monitor CPUs grouped by a NUMA domain
+
+Predefined layouts can be used on systems with large number of CPUs in
+order not to spawn multiple per-cpu streaming threads but still avoid LOST
+events in data directory files. Option specified with no or empty value
+defaults to CPU layout. Masks defined or provided by the option value are
+filtered through the mask provided by -C option.
 
 include::intel-hybrid.txt[]
 
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index ba1622a192a9..e60bf0e3bc25 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -51,6 +51,7 @@
 #include "util/evlist-hybrid.h"
 #include "asm/bug.h"
 #include "perf.h"
+#include "cputopo.h"
 
 #include <errno.h>
 #include <inttypes.h>
@@ -130,6 +131,15 @@ static const char *thread_msg_tags[THREAD_MSG__MAX] = {
 enum thread_spec {
 	THREAD_SPEC__UNDEFINED = 0,
 	THREAD_SPEC__CPU,
+	THREAD_SPEC__CORE,
+	THREAD_SPEC__PACKAGE,
+	THREAD_SPEC__NUMA,
+	THREAD_SPEC__USER,
+	THREAD_SPEC__MAX,
+};
+
+static const char *thread_spec_tags[THREAD_SPEC__MAX] = {
+	"undefined", "cpu", "core", "package", "numa", "user"
 };
 
 struct record {
@@ -2775,10 +2785,31 @@ static void record__thread_mask_free(struct thread_mask *mask)
 
 static int record__parse_threads(const struct option *opt, const char *str, int unset)
 {
+	int s;
 	struct record_opts *opts = opt->value;
 
-	if (unset || !str || !strlen(str))
+	if (unset || !str || !strlen(str)) {
 		opts->threads_spec = THREAD_SPEC__CPU;
+	} else {
+		for (s = 1; s < THREAD_SPEC__MAX; s++) {
+			if (s == THREAD_SPEC__USER) {
+				opts->threads_user_spec = strdup(str);
+				if (!opts->threads_user_spec)
+					return -ENOMEM;
+				opts->threads_spec = THREAD_SPEC__USER;
+				break;
+			}
+			if (!strncasecmp(str, thread_spec_tags[s], strlen(thread_spec_tags[s]))) {
+				opts->threads_spec = s;
+				break;
+			}
+		}
+	}
+
+	if (opts->threads_spec == THREAD_SPEC__USER)
+		pr_debug("threads_spec: %s\n", opts->threads_user_spec);
+	else
+		pr_debug("threads_spec: %s\n", thread_spec_tags[opts->threads_spec]);
 
 	return 0;
 }
@@ -3273,6 +3304,21 @@ static void record__mmap_cpu_mask_init(struct mmap_cpu_mask *mask, struct perf_c
 		set_bit(cpus->map[c].cpu, mask->bits);
 }
 
+static int record__mmap_cpu_mask_init_spec(struct mmap_cpu_mask *mask, const char *mask_spec)
+{
+	struct perf_cpu_map *cpus;
+
+	cpus = perf_cpu_map__new(mask_spec);
+	if (!cpus)
+		return -ENOMEM;
+
+	bitmap_zero(mask->bits, mask->nbits);
+	record__mmap_cpu_mask_init(mask, cpus);
+	perf_cpu_map__put(cpus);
+
+	return 0;
+}
+
 static void record__free_thread_masks(struct record *rec, int nr_threads)
 {
 	int t;
@@ -3335,6 +3381,253 @@ static int record__init_thread_cpu_masks(struct record *rec, struct perf_cpu_map
 	return 0;
 }
 
+static int record__init_thread_masks_spec(struct record *rec, struct perf_cpu_map *cpus,
+					  const char **maps_spec, const char **affinity_spec,
+					  u32 nr_spec)
+{
+	u32 s;
+	int ret = 0, t = 0;
+	struct mmap_cpu_mask cpus_mask;
+	struct thread_mask thread_mask, full_mask, *thread_masks;
+
+	ret = record__mmap_cpu_mask_alloc(&cpus_mask, cpu__max_cpu().cpu);
+	if (ret) {
+		pr_err("Failed to allocate CPUs mask\n");
+		return ret;
+	}
+	record__mmap_cpu_mask_init(&cpus_mask, cpus);
+
+	ret = record__thread_mask_alloc(&full_mask, cpu__max_cpu().cpu);
+	if (ret) {
+		pr_err("Failed to allocate full mask\n");
+		goto out_free_cpu_mask;
+	}
+
+	ret = record__thread_mask_alloc(&thread_mask, cpu__max_cpu().cpu);
+	if (ret) {
+		pr_err("Failed to allocate thread mask\n");
+		goto out_free_full_and_cpu_masks;
+	}
+
+	for (s = 0; s < nr_spec; s++) {
+		ret = record__mmap_cpu_mask_init_spec(&thread_mask.maps, maps_spec[s]);
+		if (ret) {
+			pr_err("Failed to initialize maps thread mask\n");
+			goto out_free;
+		}
+		ret = record__mmap_cpu_mask_init_spec(&thread_mask.affinity, affinity_spec[s]);
+		if (ret) {
+			pr_err("Failed to initialize affinity thread mask\n");
+			goto out_free;
+		}
+
+		/* ignore invalid CPUs but do not allow empty masks */
+		if (!bitmap_and(thread_mask.maps.bits, thread_mask.maps.bits,
+				cpus_mask.bits, thread_mask.maps.nbits)) {
+			pr_err("Empty maps mask: %s\n", maps_spec[s]);
+			ret = -EINVAL;
+			goto out_free;
+		}
+		if (!bitmap_and(thread_mask.affinity.bits, thread_mask.affinity.bits,
+				cpus_mask.bits, thread_mask.affinity.nbits)) {
+			pr_err("Empty affinity mask: %s\n", affinity_spec[s]);
+			ret = -EINVAL;
+			goto out_free;
+		}
+
+		/* do not allow intersection with other masks (full_mask) */
+		if (bitmap_intersects(thread_mask.maps.bits, full_mask.maps.bits,
+				      thread_mask.maps.nbits)) {
+			pr_err("Intersecting maps mask: %s\n", maps_spec[s]);
+			ret = -EINVAL;
+			goto out_free;
+		}
+		if (bitmap_intersects(thread_mask.affinity.bits, full_mask.affinity.bits,
+				      thread_mask.affinity.nbits)) {
+			pr_err("Intersecting affinity mask: %s\n", affinity_spec[s]);
+			ret = -EINVAL;
+			goto out_free;
+		}
+
+		bitmap_or(full_mask.maps.bits, full_mask.maps.bits,
+			  thread_mask.maps.bits, full_mask.maps.nbits);
+		bitmap_or(full_mask.affinity.bits, full_mask.affinity.bits,
+			  thread_mask.affinity.bits, full_mask.maps.nbits);
+
+		thread_masks = realloc(rec->thread_masks, (t + 1) * sizeof(struct thread_mask));
+		if (!thread_masks) {
+			pr_err("Failed to reallocate thread masks\n");
+			ret = -ENOMEM;
+			goto out_free;
+		}
+		rec->thread_masks = thread_masks;
+		rec->thread_masks[t] = thread_mask;
+		if (verbose) {
+			pr_debug("thread_masks[%d]: ", t);
+			mmap_cpu_mask__scnprintf(&rec->thread_masks[t].maps, "maps");
+			pr_debug("thread_masks[%d]: ", t);
+			mmap_cpu_mask__scnprintf(&rec->thread_masks[t].affinity, "affinity");
+		}
+		t++;
+		ret = record__thread_mask_alloc(&thread_mask, cpu__max_cpu().cpu);
+		if (ret) {
+			pr_err("Failed to allocate thread mask\n");
+			goto out_free_full_and_cpu_masks;
+		}
+	}
+	rec->nr_threads = t;
+	pr_debug("nr_threads: %d\n", rec->nr_threads);
+	if (!rec->nr_threads)
+		ret = -EINVAL;
+
+out_free:
+	record__thread_mask_free(&thread_mask);
+out_free_full_and_cpu_masks:
+	record__thread_mask_free(&full_mask);
+out_free_cpu_mask:
+	record__mmap_cpu_mask_free(&cpus_mask);
+
+	return ret;
+}
+
+static int record__init_thread_core_masks(struct record *rec, struct perf_cpu_map *cpus)
+{
+	int ret;
+	struct cpu_topology *topo;
+
+	topo = cpu_topology__new();
+	if (!topo) {
+		pr_err("Failed to allocate CPU topology\n");
+		return -ENOMEM;
+	}
+
+	ret = record__init_thread_masks_spec(rec, cpus, topo->core_cpus_list,
+					     topo->core_cpus_list, topo->core_cpus_lists);
+	cpu_topology__delete(topo);
+
+	return ret;
+}
+
+static int record__init_thread_package_masks(struct record *rec, struct perf_cpu_map *cpus)
+{
+	int ret;
+	struct cpu_topology *topo;
+
+	topo = cpu_topology__new();
+	if (!topo) {
+		pr_err("Failed to allocate CPU topology\n");
+		return -ENOMEM;
+	}
+
+	ret = record__init_thread_masks_spec(rec, cpus, topo->package_cpus_list,
+					     topo->package_cpus_list, topo->package_cpus_lists);
+	cpu_topology__delete(topo);
+
+	return ret;
+}
+
+static int record__init_thread_numa_masks(struct record *rec, struct perf_cpu_map *cpus)
+{
+	u32 s;
+	int ret;
+	const char **spec;
+	struct numa_topology *topo;
+
+	topo = numa_topology__new();
+	if (!topo) {
+		pr_err("Failed to allocate NUMA topology\n");
+		return -ENOMEM;
+	}
+
+	spec = zalloc(topo->nr * sizeof(char *));
+	if (!spec) {
+		pr_err("Failed to allocate NUMA spec\n");
+		ret = -ENOMEM;
+		goto out_delete_topo;
+	}
+	for (s = 0; s < topo->nr; s++)
+		spec[s] = topo->nodes[s].cpus;
+
+	ret = record__init_thread_masks_spec(rec, cpus, spec, spec, topo->nr);
+
+	zfree(&spec);
+
+out_delete_topo:
+	numa_topology__delete(topo);
+
+	return ret;
+}
+
+static int record__init_thread_user_masks(struct record *rec, struct perf_cpu_map *cpus)
+{
+	int t, ret;
+	u32 s, nr_spec = 0;
+	char **maps_spec = NULL, **affinity_spec = NULL, **tmp_spec;
+	char *user_spec, *spec, *spec_ptr, *mask, *mask_ptr, *dup_mask = NULL;
+
+	for (t = 0, user_spec = (char *)rec->opts.threads_user_spec; ; t++, user_spec = NULL) {
+		spec = strtok_r(user_spec, ":", &spec_ptr);
+		if (spec == NULL)
+			break;
+		pr_debug2("threads_spec[%d]: %s\n", t, spec);
+		mask = strtok_r(spec, "/", &mask_ptr);
+		if (mask == NULL)
+			break;
+		pr_debug2("  maps mask: %s\n", mask);
+		tmp_spec = realloc(maps_spec, (nr_spec + 1) * sizeof(char *));
+		if (!tmp_spec) {
+			pr_err("Failed to reallocate maps spec\n");
+			ret = -ENOMEM;
+			goto out_free;
+		}
+		maps_spec = tmp_spec;
+		maps_spec[nr_spec] = dup_mask = strdup(mask);
+		if (!maps_spec[nr_spec]) {
+			pr_err("Failed to allocate maps spec[%d]\n", nr_spec);
+			ret = -ENOMEM;
+			goto out_free;
+		}
+		mask = strtok_r(NULL, "/", &mask_ptr);
+		if (mask == NULL) {
+			pr_err("Invalid thread maps or affinity specs\n");
+			ret = -EINVAL;
+			goto out_free;
+		}
+		pr_debug2("  affinity mask: %s\n", mask);
+		tmp_spec = realloc(affinity_spec, (nr_spec + 1) * sizeof(char *));
+		if (!tmp_spec) {
+			pr_err("Failed to reallocate affinity spec\n");
+			ret = -ENOMEM;
+			goto out_free;
+		}
+		affinity_spec = tmp_spec;
+		affinity_spec[nr_spec] = strdup(mask);
+		if (!affinity_spec[nr_spec]) {
+			pr_err("Failed to allocate affinity spec[%d]\n", nr_spec);
+			ret = -ENOMEM;
+			goto out_free;
+		}
+		dup_mask = NULL;
+		nr_spec++;
+	}
+
+	ret = record__init_thread_masks_spec(rec, cpus, (const char **)maps_spec,
+					     (const char **)affinity_spec, nr_spec);
+
+out_free:
+	free(dup_mask);
+	for (s = 0; s < nr_spec; s++) {
+		if (maps_spec)
+			free(maps_spec[s]);
+		if (affinity_spec)
+			free(affinity_spec[s]);
+	}
+	free(affinity_spec);
+	free(maps_spec);
+
+	return ret;
+}
+
 static int record__init_thread_default_masks(struct record *rec, struct perf_cpu_map *cpus)
 {
 	int ret;
@@ -3352,12 +3645,33 @@ static int record__init_thread_default_masks(struct record *rec, struct perf_cpu
 
 static int record__init_thread_masks(struct record *rec)
 {
+	int ret = 0;
 	struct perf_cpu_map *cpus = rec->evlist->core.cpus;
 
 	if (!record__threads_enabled(rec))
 		return record__init_thread_default_masks(rec, cpus);
 
-	return record__init_thread_cpu_masks(rec, cpus);
+	switch (rec->opts.threads_spec) {
+	case THREAD_SPEC__CPU:
+		ret = record__init_thread_cpu_masks(rec, cpus);
+		break;
+	case THREAD_SPEC__CORE:
+		ret = record__init_thread_core_masks(rec, cpus);
+		break;
+	case THREAD_SPEC__PACKAGE:
+		ret = record__init_thread_package_masks(rec, cpus);
+		break;
+	case THREAD_SPEC__NUMA:
+		ret = record__init_thread_numa_masks(rec, cpus);
+		break;
+	case THREAD_SPEC__USER:
+		ret = record__init_thread_user_masks(rec, cpus);
+		break;
+	default:
+		break;
+	}
+
+	return ret;
 }
 
 int cmd_record(int argc, const char **argv)
diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
index ad08c092f3dd..be9a957501f4 100644
--- a/tools/perf/util/record.h
+++ b/tools/perf/util/record.h
@@ -79,6 +79,7 @@ struct record_opts {
 	bool	      ctl_fd_close;
 	int	      synth;
 	int	      threads_spec;
+	const char    *threads_user_spec;
 };
 
 extern const char * const *record_usage;
-- 
2.19.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v13 14/16] perf record: Implement compatibility checks
  2022-01-17 18:34 [PATCH v13 00/16] Introduce threaded trace streaming for basic perf record operation Alexey Bayduraev
                   ` (12 preceding siblings ...)
  2022-01-17 18:34 ` [PATCH v13 13/16] perf record: Extend " Alexey Bayduraev
@ 2022-01-17 18:34 ` Alexey Bayduraev
  2022-01-17 18:34 ` [PATCH v13 15/16] perf session: Load data directory files for analysis Alexey Bayduraev
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Alexey Bayduraev @ 2022-01-17 18:34 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, linux-kernel, Andi Kleen, Adrian Hunter,
	Alexander Antonov, Alexei Budankov, Riccardo Mancini

Implement compatibility checks for other modes and related command line
options: asynchronous (--aio) trace streaming and affinity (--affinity)
modes, pipe mode, AUX area tracing --snapshot and --aux-sample options,
--switch-output, --switch-output-event, --switch-max-files and
--timestamp-filename options. Parallel data streaming is compatible with
Zstd compression (--compression-level) and external control commands
(--control). CPU mask provided via -C option filters --threads
specification masks.

Acked-by: Namhyung Kim <namhyung@gmail.com>
Reviewed-by: Riccardo Mancini <rickyman7@gmail.com>
Tested-by: Riccardo Mancini <rickyman7@gmail.com>
Signed-off-by: Alexey Bayduraev <alexey.v.bayduraev@linux.intel.com>
---
 tools/perf/builtin-record.c | 49 ++++++++++++++++++++++++++++++++++---
 1 file changed, 46 insertions(+), 3 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index e60bf0e3bc25..9b7102262b20 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -805,6 +805,12 @@ static int record__auxtrace_init(struct record *rec)
 {
 	int err;
 
+	if ((rec->opts.auxtrace_snapshot_opts || rec->opts.auxtrace_sample_opts)
+	    && record__threads_enabled(rec)) {
+		pr_err("AUX area tracing options are not available in parallel streaming mode.\n");
+		return -EINVAL;
+	}
+
 	if (!rec->itr) {
 		rec->itr = auxtrace_record__init(rec->evlist, &err);
 		if (err)
@@ -2198,6 +2204,17 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 		return PTR_ERR(session);
 	}
 
+	if (record__threads_enabled(rec)) {
+		if (perf_data__is_pipe(&rec->data)) {
+			pr_err("Parallel trace streaming is not available in pipe mode.\n");
+			return -1;
+		}
+		if (rec->opts.full_auxtrace) {
+			pr_err("Parallel trace streaming is not available in AUX area tracing mode.\n");
+			return -1;
+		}
+	}
+
 	fd = perf_data__fd(data);
 	rec->session = session;
 
@@ -2938,12 +2955,22 @@ static int switch_output_setup(struct record *rec)
 	 * --switch-output=signal, as we'll send a SIGUSR2 from the side band
 	 *  thread to its parent.
 	 */
-	if (rec->switch_output_event_set)
+	if (rec->switch_output_event_set) {
+		if (record__threads_enabled(rec)) {
+			pr_warning("WARNING: --switch-output-event option is not available in parallel streaming mode.\n");
+			return 0;
+		}
 		goto do_signal;
+	}
 
 	if (!s->set)
 		return 0;
 
+	if (record__threads_enabled(rec)) {
+		pr_warning("WARNING: --switch-output option is not available in parallel streaming mode.\n");
+		return 0;
+	}
+
 	if (!strcmp(s->str, "signal")) {
 do_signal:
 		s->signal = true;
@@ -3262,8 +3289,8 @@ static struct option __record_options[] = {
 		     "Set affinity mask of trace reading thread to NUMA node cpu mask or cpu of processed mmap buffer",
 		     record__parse_affinity),
 #ifdef HAVE_ZSTD_SUPPORT
-	OPT_CALLBACK_OPTARG('z', "compression-level", &record.opts, &comp_level_default,
-			    "n", "Compressed records using specified level (default: 1 - fastest compression, 22 - greatest compression)",
+	OPT_CALLBACK_OPTARG('z', "compression-level", &record.opts, &comp_level_default, "n",
+			    "Compress records using specified level (default: 1 - fastest compression, 22 - greatest compression)",
 			    record__parse_comp_level),
 #endif
 	OPT_CALLBACK(0, "max-size", &record.output_max_size,
@@ -3758,6 +3785,17 @@ int cmd_record(int argc, const char **argv)
 	if (rec->opts.kcore || record__threads_enabled(rec))
 		rec->data.is_dir = true;
 
+	if (record__threads_enabled(rec)) {
+		if (rec->opts.affinity != PERF_AFFINITY_SYS) {
+			pr_err("--affinity option is mutually exclusive to parallel streaming mode.\n");
+			goto out_opts;
+		}
+		if (record__aio_enabled(rec)) {
+			pr_err("Asynchronous streaming mode (--aio) is mutually exclusive to parallel streaming mode.\n");
+			goto out_opts;
+		}
+	}
+
 	if (rec->opts.comp_level != 0) {
 		pr_debug("Compression enabled, disabling build id collection at the end of the session.\n");
 		rec->no_buildid = true;
@@ -3791,6 +3829,11 @@ int cmd_record(int argc, const char **argv)
 		}
 	}
 
+	if (rec->timestamp_filename && record__threads_enabled(rec)) {
+		rec->timestamp_filename = false;
+		pr_warning("WARNING: --timestamp-filename option is not available in parallel streaming mode.\n");
+	}
+
 	/*
 	 * Allow aliases to facilitate the lookup of symbols for address
 	 * filters. Refer to auxtrace_parse_filters().
-- 
2.19.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v13 15/16] perf session: Load data directory files for analysis
  2022-01-17 18:34 [PATCH v13 00/16] Introduce threaded trace streaming for basic perf record operation Alexey Bayduraev
                   ` (13 preceding siblings ...)
  2022-01-17 18:34 ` [PATCH v13 14/16] perf record: Implement compatibility checks Alexey Bayduraev
@ 2022-01-17 18:34 ` Alexey Bayduraev
  2022-01-17 18:34 ` [PATCH v13 16/16] perf report: Output data file name in raw trace dump Alexey Bayduraev
  2022-01-24 15:45 ` [PATCH v13 00/16] Introduce threaded trace streaming for basic perf record operation Jiri Olsa
  16 siblings, 0 replies; 38+ messages in thread
From: Alexey Bayduraev @ 2022-01-17 18:34 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, linux-kernel, Andi Kleen, Adrian Hunter,
	Alexander Antonov, Alexei Budankov, Riccardo Mancini

Load data directory files and provide basic raw dump and aggregated
analysis support of data directories in report mode, still with no
memory consumption optimizations.

READER_MAX_SIZE is chosen based on the results of measurements on
different machines on perf.data directory sizes >1GB. On machines
with big core count (192 cores) the difference between 1MB and 2MB
is about 4%. Other sizes (>2MB) are quite equal to 2MB.
On machines with small core count (4-24) there is no differences
between 1-16 MB sizes. So this constant is 2MB.

Suggested-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@gmail.com>
Reviewed-by: Riccardo Mancini <rickyman7@gmail.com>
Tested-by: Riccardo Mancini <rickyman7@gmail.com>
Signed-off-by: Alexey Bayduraev <alexey.v.bayduraev@linux.intel.com>
---
 tools/perf/util/session.c | 133 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 133 insertions(+)

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index f19348dddd55..c6605eab61c1 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -2185,6 +2185,8 @@ struct reader {
 	u64		 file_pos;
 	u64		 file_offset;
 	u64		 head;
+	u64		 size;
+	bool		 done;
 	struct zstd_data   zstd_data;
 	struct decomp_data decomp_data;
 };
@@ -2302,6 +2304,7 @@ reader__read_event(struct reader *rd, struct perf_session *session,
 	if (skip)
 		size += skip;
 
+	rd->size += size;
 	rd->head += size;
 	rd->file_pos += size;
 
@@ -2410,6 +2413,133 @@ static int __perf_session__process_events(struct perf_session *session)
 	return err;
 }
 
+/*
+ * Processing 2 MB of data from each reader in sequence,
+ * because that's the way the ordered events sorting works
+ * most efficiently.
+ */
+#define READER_MAX_SIZE (2 * 1024 * 1024)
+
+/*
+ * This function reads, merge and process directory data.
+ * It assumens the version 1 of directory data, where each
+ * data file holds per-cpu data, already sorted by kernel.
+ */
+static int __perf_session__process_dir_events(struct perf_session *session)
+{
+	struct perf_data *data = session->data;
+	struct perf_tool *tool = session->tool;
+	int i, ret, readers, nr_readers;
+	struct ui_progress prog;
+	u64 total_size = perf_data__size(session->data);
+	struct reader *rd;
+
+	perf_tool__fill_defaults(tool);
+
+	ui_progress__init_size(&prog, total_size, "Sorting events...");
+
+	nr_readers = 1;
+	for (i = 0; i < data->dir.nr; i++) {
+		if (data->dir.files[i].size)
+			nr_readers++;
+	}
+
+	rd = zalloc(nr_readers * sizeof(struct reader));
+	if (!rd)
+		return -ENOMEM;
+
+	rd[0] = (struct reader) {
+		.fd		 = perf_data__fd(session->data),
+		.data_size	 = session->header.data_size,
+		.data_offset	 = session->header.data_offset,
+		.process	 = process_simple,
+		.in_place_update = session->data->in_place_update,
+	};
+	ret = reader__init(&rd[0], NULL);
+	if (ret)
+		goto out_err;
+	ret = reader__mmap(&rd[0], session);
+	if (ret)
+		goto out_err;
+	readers = 1;
+
+	for (i = 0; i < data->dir.nr; i++) {
+		if (!data->dir.files[i].size)
+			continue;
+		rd[readers] = (struct reader) {
+			.fd		 = data->dir.files[i].fd,
+			.data_size	 = data->dir.files[i].size,
+			.data_offset	 = 0,
+			.process	 = process_simple,
+			.in_place_update = session->data->in_place_update,
+		};
+		ret = reader__init(&rd[readers], NULL);
+		if (ret)
+			goto out_err;
+		ret = reader__mmap(&rd[readers], session);
+		if (ret)
+			goto out_err;
+		readers++;
+	}
+
+	i = 0;
+	while (readers) {
+		if (session_done())
+			break;
+
+		if (rd[i].done) {
+			i = (i + 1) % nr_readers;
+			continue;
+		}
+		if (reader__eof(&rd[i])) {
+			rd[i].done = true;
+			readers--;
+			continue;
+		}
+
+		session->active_decomp = &rd[i].decomp_data;
+		ret = reader__read_event(&rd[i], session, &prog);
+		if (ret < 0) {
+			goto out_err;
+		} else if (ret == READER_NODATA) {
+			ret = reader__mmap(&rd[i], session);
+			if (ret)
+				goto out_err;
+		}
+
+		if (rd[i].size >= READER_MAX_SIZE) {
+			rd[i].size = 0;
+			i = (i + 1) % nr_readers;
+		}
+	}
+
+	ret = ordered_events__flush(&session->ordered_events, OE_FLUSH__FINAL);
+	if (ret)
+		goto out_err;
+
+	ret = perf_session__flush_thread_stacks(session);
+out_err:
+	ui_progress__finish();
+
+	if (!tool->no_warn)
+		perf_session__warn_about_errors(session);
+
+	/*
+	 * We may switching perf.data output, make ordered_events
+	 * reusable.
+	 */
+	ordered_events__reinit(&session->ordered_events);
+
+	session->one_mmap = false;
+
+	session->active_decomp = &session->decomp_data;
+	for (i = 0; i < nr_readers; i++)
+		reader__release_decomp(&rd[i]);
+	zfree(&rd);
+
+	return ret;
+}
+
 int perf_session__process_events(struct perf_session *session)
 {
 	if (perf_session__register_idle_thread(session) < 0)
@@ -2418,6 +2548,9 @@ int perf_session__process_events(struct perf_session *session)
 	if (perf_data__is_pipe(session->data))
 		return __perf_session__process_pipe_events(session);
 
+	if (perf_data__is_dir(session->data))
+		return __perf_session__process_dir_events(session);
+
 	return __perf_session__process_events(session);
 }
 
-- 
2.19.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v13 16/16] perf report: Output data file name in raw trace dump
  2022-01-17 18:34 [PATCH v13 00/16] Introduce threaded trace streaming for basic perf record operation Alexey Bayduraev
                   ` (14 preceding siblings ...)
  2022-01-17 18:34 ` [PATCH v13 15/16] perf session: Load data directory files for analysis Alexey Bayduraev
@ 2022-01-17 18:34 ` Alexey Bayduraev
  2022-01-24 15:45 ` [PATCH v13 00/16] Introduce threaded trace streaming for basic perf record operation Jiri Olsa
  16 siblings, 0 replies; 38+ messages in thread
From: Alexey Bayduraev @ 2022-01-17 18:34 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, linux-kernel, Andi Kleen, Adrian Hunter,
	Alexander Antonov, Alexei Budankov, Riccardo Mancini

Print path and name of a data file into raw dump (-D)
<file_offset>@<path/file>:

  0x2226a@perf.data [0x30]: event: 9
or
  0x15cc36@perf.data/data.7 [0x30]: event: 9

Acked-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Riccardo Mancini <rickyman7@gmail.com>
Tested-by: Riccardo Mancini <rickyman7@gmail.com>
Signed-off-by: Alexey Bayduraev <alexey.v.bayduraev@linux.intel.com>
---
 tools/perf/builtin-inject.c      |  3 +-
 tools/perf/builtin-kvm.c         |  2 +-
 tools/perf/builtin-top.c         |  2 +-
 tools/perf/builtin-trace.c       |  2 +-
 tools/perf/util/ordered-events.c |  3 +-
 tools/perf/util/ordered-events.h |  3 +-
 tools/perf/util/session.c        | 75 ++++++++++++++++++++------------
 tools/perf/util/session.h        |  3 +-
 tools/perf/util/tool.h           |  3 +-
 9 files changed, 59 insertions(+), 37 deletions(-)

diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c
index 409b721666cb..16acebd62b65 100644
--- a/tools/perf/builtin-inject.c
+++ b/tools/perf/builtin-inject.c
@@ -110,7 +110,8 @@ static int perf_event__repipe_op2_synth(struct perf_session *session,
 
 static int perf_event__repipe_op4_synth(struct perf_session *session,
 					union perf_event *event,
-					u64 data __maybe_unused)
+					u64 data __maybe_unused,
+					const char *str __maybe_unused)
 {
 	return perf_event__repipe_synth(session->tool, event);
 }
diff --git a/tools/perf/builtin-kvm.c b/tools/perf/builtin-kvm.c
index c6f352ee57e6..b23a1f3eaeda 100644
--- a/tools/perf/builtin-kvm.c
+++ b/tools/perf/builtin-kvm.c
@@ -771,7 +771,7 @@ static s64 perf_kvm__mmap_read_idx(struct perf_kvm_stat *kvm, int idx,
 			return -1;
 		}
 
-		err = perf_session__queue_event(kvm->session, event, timestamp, 0);
+		err = perf_session__queue_event(kvm->session, event, timestamp, 0, NULL);
 		/*
 		 * FIXME: Here we can't consume the event, as perf_session__queue_event will
 		 *        point to it, and it'll get possibly overwritten by the kernel.
diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index 1fc390f136dd..92b314fa7223 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -888,7 +888,7 @@ static void perf_top__mmap_read_idx(struct perf_top *top, int idx)
 		if (ret && ret != -1)
 			break;
 
-		ret = ordered_events__queue(top->qe.in, event, last_timestamp, 0);
+		ret = ordered_events__queue(top->qe.in, event, last_timestamp, 0, NULL);
 		if (ret)
 			break;
 
diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
index 46bca1f9ab9e..a89304b55309 100644
--- a/tools/perf/builtin-trace.c
+++ b/tools/perf/builtin-trace.c
@@ -3780,7 +3780,7 @@ static int trace__deliver_event(struct trace *trace, union perf_event *event)
 	if (err && err != -1)
 		return err;
 
-	err = ordered_events__queue(&trace->oe.data, event, trace->oe.last, 0);
+	err = ordered_events__queue(&trace->oe.data, event, trace->oe.last, 0, NULL);
 	if (err)
 		return err;
 
diff --git a/tools/perf/util/ordered-events.c b/tools/perf/util/ordered-events.c
index 48c8f609441b..b887dfeea673 100644
--- a/tools/perf/util/ordered-events.c
+++ b/tools/perf/util/ordered-events.c
@@ -192,7 +192,7 @@ void ordered_events__delete(struct ordered_events *oe, struct ordered_event *eve
 }
 
 int ordered_events__queue(struct ordered_events *oe, union perf_event *event,
-			  u64 timestamp, u64 file_offset)
+			  u64 timestamp, u64 file_offset, const char *file_path)
 {
 	struct ordered_event *oevent;
 
@@ -217,6 +217,7 @@ int ordered_events__queue(struct ordered_events *oe, union perf_event *event,
 		return -ENOMEM;
 
 	oevent->file_offset = file_offset;
+	oevent->file_path = file_path;
 	return 0;
 }
 
diff --git a/tools/perf/util/ordered-events.h b/tools/perf/util/ordered-events.h
index 75345946c4b9..0b05c3c0aeaa 100644
--- a/tools/perf/util/ordered-events.h
+++ b/tools/perf/util/ordered-events.h
@@ -9,6 +9,7 @@ struct perf_sample;
 struct ordered_event {
 	u64			timestamp;
 	u64			file_offset;
+	const char		*file_path;
 	union perf_event	*event;
 	struct list_head	list;
 };
@@ -53,7 +54,7 @@ struct ordered_events {
 };
 
 int ordered_events__queue(struct ordered_events *oe, union perf_event *event,
-			  u64 timestamp, u64 file_offset);
+			  u64 timestamp, u64 file_offset, const char *file_path);
 void ordered_events__delete(struct ordered_events *oe, struct ordered_event *event);
 int ordered_events__flush(struct ordered_events *oe, enum oe_flush how);
 int ordered_events__flush_time(struct ordered_events *oe, u64 timestamp);
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index c6605eab61c1..f4e0944c5a2d 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -39,7 +39,8 @@
 
 #ifdef HAVE_ZSTD_SUPPORT
 static int perf_session__process_compressed_event(struct perf_session *session,
-						  union perf_event *event, u64 file_offset)
+						  union perf_event *event, u64 file_offset,
+						  const char *file_path)
 {
 	void *src;
 	size_t decomp_size, src_size;
@@ -61,6 +62,7 @@ static int perf_session__process_compressed_event(struct perf_session *session,
 	}
 
 	decomp->file_pos = file_offset;
+	decomp->file_path = file_path;
 	decomp->mmap_len = mmap_len;
 	decomp->head = 0;
 
@@ -100,7 +102,8 @@ static int perf_session__process_compressed_event(struct perf_session *session,
 static int perf_session__deliver_event(struct perf_session *session,
 				       union perf_event *event,
 				       struct perf_tool *tool,
-				       u64 file_offset);
+				       u64 file_offset,
+				       const char *file_path);
 
 static int perf_session__open(struct perf_session *session, int repipe_fd)
 {
@@ -182,7 +185,8 @@ static int ordered_events__deliver_event(struct ordered_events *oe,
 						    ordered_events);
 
 	return perf_session__deliver_event(session, event->event,
-					   session->tool, event->file_offset);
+					   session->tool, event->file_offset,
+					   event->file_path);
 }
 
 struct perf_session *__perf_session__new(struct perf_data *data,
@@ -471,7 +475,8 @@ static int process_event_time_conv_stub(struct perf_session *perf_session __mayb
 
 static int perf_session__process_compressed_event_stub(struct perf_session *session __maybe_unused,
 						       union perf_event *event __maybe_unused,
-						       u64 file_offset __maybe_unused)
+						       u64 file_offset __maybe_unused,
+						       const char *file_path __maybe_unused)
 {
        dump_printf(": unhandled!\n");
        return 0;
@@ -1072,9 +1077,9 @@ static int process_finished_round(struct perf_tool *tool __maybe_unused,
 }
 
 int perf_session__queue_event(struct perf_session *s, union perf_event *event,
-			      u64 timestamp, u64 file_offset)
+			      u64 timestamp, u64 file_offset, const char *file_path)
 {
-	return ordered_events__queue(&s->ordered_events, event, timestamp, file_offset);
+	return ordered_events__queue(&s->ordered_events, event, timestamp, file_offset, file_path);
 }
 
 static void callchain__lbr_callstack_printf(struct perf_sample *sample)
@@ -1277,13 +1282,14 @@ static void sample_read__printf(struct perf_sample *sample, u64 read_format)
 }
 
 static void dump_event(struct evlist *evlist, union perf_event *event,
-		       u64 file_offset, struct perf_sample *sample)
+		       u64 file_offset, struct perf_sample *sample,
+		       const char *file_path)
 {
 	if (!dump_trace)
 		return;
 
-	printf("\n%#" PRIx64 " [%#x]: event: %d\n",
-	       file_offset, event->header.size, event->header.type);
+	printf("\n%#" PRIx64 "@%s [%#x]: event: %d\n",
+	       file_offset, file_path, event->header.size, event->header.type);
 
 	trace_event(event);
 	if (event->header.type == PERF_RECORD_SAMPLE && evlist->trace_event_sample_raw)
@@ -1486,12 +1492,13 @@ static int machines__deliver_event(struct machines *machines,
 				   struct evlist *evlist,
 				   union perf_event *event,
 				   struct perf_sample *sample,
-				   struct perf_tool *tool, u64 file_offset)
+				   struct perf_tool *tool, u64 file_offset,
+				   const char *file_path)
 {
 	struct evsel *evsel;
 	struct machine *machine;
 
-	dump_event(evlist, event, file_offset, sample);
+	dump_event(evlist, event, file_offset, sample, file_path);
 
 	evsel = evlist__id2evsel(evlist, sample->id);
 
@@ -1572,7 +1579,8 @@ static int machines__deliver_event(struct machines *machines,
 static int perf_session__deliver_event(struct perf_session *session,
 				       union perf_event *event,
 				       struct perf_tool *tool,
-				       u64 file_offset)
+				       u64 file_offset,
+				       const char *file_path)
 {
 	struct perf_sample sample;
 	int ret = evlist__parse_sample(session->evlist, event, &sample);
@@ -1589,7 +1597,7 @@ static int perf_session__deliver_event(struct perf_session *session,
 		return 0;
 
 	ret = machines__deliver_event(&session->machines, session->evlist,
-				      event, &sample, tool, file_offset);
+				      event, &sample, tool, file_offset, file_path);
 
 	if (dump_trace && sample.aux_sample.size)
 		auxtrace__dump_auxtrace_sample(session, &sample);
@@ -1599,7 +1607,8 @@ static int perf_session__deliver_event(struct perf_session *session,
 
 static s64 perf_session__process_user_event(struct perf_session *session,
 					    union perf_event *event,
-					    u64 file_offset)
+					    u64 file_offset,
+					    const char *file_path)
 {
 	struct ordered_events *oe = &session->ordered_events;
 	struct perf_tool *tool = session->tool;
@@ -1609,7 +1618,7 @@ static s64 perf_session__process_user_event(struct perf_session *session,
 
 	if (event->header.type != PERF_RECORD_COMPRESSED ||
 	    tool->compressed == perf_session__process_compressed_event_stub)
-		dump_event(session->evlist, event, file_offset, &sample);
+		dump_event(session->evlist, event, file_offset, &sample, file_path);
 
 	/* These events are processed right away */
 	switch (event->header.type) {
@@ -1668,9 +1677,9 @@ static s64 perf_session__process_user_event(struct perf_session *session,
 	case PERF_RECORD_HEADER_FEATURE:
 		return tool->feature(session, event);
 	case PERF_RECORD_COMPRESSED:
-		err = tool->compressed(session, event, file_offset);
+		err = tool->compressed(session, event, file_offset, file_path);
 		if (err)
-			dump_event(session->evlist, event, file_offset, &sample);
+			dump_event(session->evlist, event, file_offset, &sample, file_path);
 		return err;
 	default:
 		return -EINVAL;
@@ -1687,9 +1696,9 @@ int perf_session__deliver_synth_event(struct perf_session *session,
 	events_stats__inc(&evlist->stats, event->header.type);
 
 	if (event->header.type >= PERF_RECORD_USER_TYPE_START)
-		return perf_session__process_user_event(session, event, 0);
+		return perf_session__process_user_event(session, event, 0, NULL);
 
-	return machines__deliver_event(&session->machines, evlist, event, sample, tool, 0);
+	return machines__deliver_event(&session->machines, evlist, event, sample, tool, 0, NULL);
 }
 
 static void event_swap(union perf_event *event, bool sample_id_all)
@@ -1786,7 +1795,8 @@ int perf_session__peek_events(struct perf_session *session, u64 offset,
 }
 
 static s64 perf_session__process_event(struct perf_session *session,
-				       union perf_event *event, u64 file_offset)
+				       union perf_event *event, u64 file_offset,
+				       const char *file_path)
 {
 	struct evlist *evlist = session->evlist;
 	struct perf_tool *tool = session->tool;
@@ -1801,7 +1811,7 @@ static s64 perf_session__process_event(struct perf_session *session,
 	events_stats__inc(&evlist->stats, event->header.type);
 
 	if (event->header.type >= PERF_RECORD_USER_TYPE_START)
-		return perf_session__process_user_event(session, event, file_offset);
+		return perf_session__process_user_event(session, event, file_offset, file_path);
 
 	if (tool->ordered_events) {
 		u64 timestamp = -1ULL;
@@ -1810,12 +1820,12 @@ static s64 perf_session__process_event(struct perf_session *session,
 		if (ret && ret != -1)
 			return ret;
 
-		ret = perf_session__queue_event(session, event, timestamp, file_offset);
+		ret = perf_session__queue_event(session, event, timestamp, file_offset, file_path);
 		if (ret != -ETIME)
 			return ret;
 	}
 
-	return perf_session__deliver_event(session, event, tool, file_offset);
+	return perf_session__deliver_event(session, event, tool, file_offset, file_path);
 }
 
 void perf_event_header__bswap(struct perf_event_header *hdr)
@@ -2042,7 +2052,7 @@ static int __perf_session__process_pipe_events(struct perf_session *session)
 		}
 	}
 
-	if ((skip = perf_session__process_event(session, event, head)) < 0) {
+	if ((skip = perf_session__process_event(session, event, head, "pipe")) < 0) {
 		pr_err("%#" PRIx64 " [%#x]: failed to process type: %d\n",
 		       head, event->header.size, event->header.type);
 		err = -EINVAL;
@@ -2139,7 +2149,8 @@ static int __perf_session__process_decomp_events(struct perf_session *session)
 		size = event->header.size;
 
 		if (size < sizeof(struct perf_event_header) ||
-		    (skip = perf_session__process_event(session, event, decomp->file_pos)) < 0) {
+		    (skip = perf_session__process_event(session, event, decomp->file_pos,
+							decomp->file_path)) < 0) {
 			pr_err("%#" PRIx64 " [%#x]: failed to process type: %d\n",
 				decomp->file_pos + decomp->head, event->header.size, event->header.type);
 			return -EINVAL;
@@ -2170,10 +2181,12 @@ struct reader;
 
 typedef s64 (*reader_cb_t)(struct perf_session *session,
 			   union perf_event *event,
-			   u64 file_offset);
+			   u64 file_offset,
+			   const char *file_path);
 
 struct reader {
 	int		 fd;
+	const char	 *path;
 	u64		 data_size;
 	u64		 data_offset;
 	reader_cb_t	 process;
@@ -2293,7 +2306,7 @@ reader__read_event(struct reader *rd, struct perf_session *session,
 	skip = -EINVAL;
 
 	if (size < sizeof(struct perf_event_header) ||
-	    (skip = rd->process(session, event, rd->file_pos)) < 0) {
+	    (skip = rd->process(session, event, rd->file_pos, rd->path)) < 0) {
 		pr_err("%#" PRIx64 " [%#x]: failed to process type: %d [%s]\n",
 		       rd->file_offset + rd->head, event->header.size,
 		       event->header.type, strerror(-skip));
@@ -2361,15 +2374,17 @@ reader__process_events(struct reader *rd, struct perf_session *session,
 
 static s64 process_simple(struct perf_session *session,
 			  union perf_event *event,
-			  u64 file_offset)
+			  u64 file_offset,
+			  const char *file_path)
 {
-	return perf_session__process_event(session, event, file_offset);
+	return perf_session__process_event(session, event, file_offset, file_path);
 }
 
 static int __perf_session__process_events(struct perf_session *session)
 {
 	struct reader rd = {
 		.fd		= perf_data__fd(session->data),
+		.path		= session->data->file.path,
 		.data_size	= session->header.data_size,
 		.data_offset	= session->header.data_offset,
 		.process	= process_simple,
@@ -2450,6 +2465,7 @@ static int __perf_session__process_dir_events(struct perf_session *session)
 
 	rd[0] = (struct reader) {
 		.fd		 = perf_data__fd(session->data),
+		.path		 = session->data->file.path,
 		.data_size	 = session->header.data_size,
 		.data_offset	 = session->header.data_offset,
 		.process	 = process_simple,
@@ -2468,6 +2484,7 @@ static int __perf_session__process_dir_events(struct perf_session *session)
 			continue;
 		rd[readers] = (struct reader) {
 			.fd		 = data->dir.files[i].fd,
+			.path		 = data->dir.files[i].path,
 			.data_size	 = data->dir.files[i].size,
 			.data_offset	 = 0,
 			.process	 = process_simple,
diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index 46c854292ad6..34500a3da735 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -52,6 +52,7 @@ struct perf_session {
 struct decomp {
 	struct decomp *next;
 	u64 file_pos;
+	const char *file_path;
 	size_t mmap_len;
 	u64 head;
 	size_t size;
@@ -87,7 +88,7 @@ int perf_session__peek_events(struct perf_session *session, u64 offset,
 int perf_session__process_events(struct perf_session *session);
 
 int perf_session__queue_event(struct perf_session *s, union perf_event *event,
-			      u64 timestamp, u64 file_offset);
+			      u64 timestamp, u64 file_offset, const char *file_path);
 
 void perf_tool__fill_defaults(struct perf_tool *tool);
 
diff --git a/tools/perf/util/tool.h b/tools/perf/util/tool.h
index ef873f2cc38f..f2352dba1875 100644
--- a/tools/perf/util/tool.h
+++ b/tools/perf/util/tool.h
@@ -28,7 +28,8 @@ typedef int (*event_attr_op)(struct perf_tool *tool,
 
 typedef int (*event_op2)(struct perf_session *session, union perf_event *event);
 typedef s64 (*event_op3)(struct perf_session *session, union perf_event *event);
-typedef int (*event_op4)(struct perf_session *session, union perf_event *event, u64 data);
+typedef int (*event_op4)(struct perf_session *session, union perf_event *event, u64 data,
+			 const char *str);
 
 typedef int (*event_oe)(struct perf_tool *tool, union perf_event *event,
 			struct ordered_events *oe);
-- 
2.19.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 11/16] perf record: Introduce data transferred and compressed stats
  2022-01-17 18:34 ` [PATCH v13 11/16] perf record: Introduce data transferred and compressed stats Alexey Bayduraev
@ 2022-01-24 15:28   ` Arnaldo Carvalho de Melo
  2022-01-24 16:39     ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 38+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-01-24 15:28 UTC (permalink / raw)
  To: Alexey Bayduraev
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, linux-kernel, Andi Kleen, Adrian Hunter,
	Alexander Antonov, Alexei Budankov, Riccardo Mancini

Em Mon, Jan 17, 2022 at 09:34:31PM +0300, Alexey Bayduraev escreveu:
> Introduce bytes_transferred and bytes_compressed stats so they
> would capture statistics for the related data buffer transfers.
> 
> Acked-by: Andi Kleen <ak@linux.intel.com>
> Acked-by: Namhyung Kim <namhyung@gmail.com>
> Reviewed-by: Riccardo Mancini <rickyman7@gmail.com>
> Tested-by: Riccardo Mancini <rickyman7@gmail.com>
> Signed-off-by: Alexey Bayduraev <alexey.v.bayduraev@linux.intel.com>
> ---
>  tools/perf/builtin-record.c | 25 ++++++++++++++++++++++---
>  1 file changed, 22 insertions(+), 3 deletions(-)
> 
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index 7d0338b5a0e3..0f8488d12f44 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -111,6 +111,8 @@ struct record_thread {
>  	unsigned long long	samples;
>  	unsigned long		waking;
>  	u64			bytes_written;
> +	u64			bytes_transferred;
> +	u64			bytes_compressed;
>  };
>  
>  static __thread struct record_thread *thread;
> @@ -1407,8 +1409,13 @@ static size_t zstd_compress(struct perf_session *session, struct mmap *map,
>  	compressed = zstd_compress_stream_to_records(zstd_data, dst, dst_size, src, src_size,
>  						     max_record_size, process_comp_header);
>  
> -	session->bytes_transferred += src_size;
> -	session->bytes_compressed  += compressed;
> +	if (map && map->file) {
> +		thread->bytes_transferred += src_size;
> +		thread->bytes_compressed  += compressed;
> +	} else {
> +		session->bytes_transferred += src_size;
> +		session->bytes_compressed  += compressed;
> +	}
>  
>  	return compressed;
>  }
> @@ -2098,8 +2105,20 @@ static int record__stop_threads(struct record *rec)
>  	for (t = 1; t < rec->nr_threads; t++)
>  		record__terminate_thread(&thread_data[t]);
>  
> -	for (t = 0; t < rec->nr_threads; t++)
> +	for (t = 0; t < rec->nr_threads; t++) {
>  		rec->samples += thread_data[t].samples;
> +		if (!record__threads_enabled(rec))
> +			continue;
> +		rec->session->bytes_transferred += thread_data[t].bytes_transferred;
> +		rec->session->bytes_compressed += thread_data[t].bytes_compressed;
> +		pr_debug("threads[%d]: samples=%lld, wakes=%ld, ", thread_data[t].tid,
> +			 thread_data[t].samples, thread_data[t].waking);
> +		if (thread_data[t].bytes_transferred && thread_data[t].bytes_compressed)
> +			pr_debug("trasferred=%ld, compressed=%ld\n",
> +				 thread_data[t].bytes_transferred, thread_data[t].bytes_compressed);
> +		else
> +			pr_debug("written=%ld\n", thread_data[t].bytes_written);

In file included from builtin-record.c:22:
builtin-record.c: In function 'record__stop_threads':
builtin-record.c:2138:13: error: format '%ld' expects argument of type 'long int', but argument 4 has type 'u64' {aka 'long long unsigned int'} [-Werror=format=]
 2138 |    pr_debug("trasferred=%ld, compressed=%ld\n",
      |             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
util/debug.h:18:21: note: in definition of macro 'pr_fmt'
   18 | #define pr_fmt(fmt) fmt
      |                     ^~~
builtin-record.c:2138:4: note: in expansion of macro 'pr_debug'
 2138 |    pr_debug("trasferred=%ld, compressed=%ld\n",
      |    ^~~~~~~~
builtin-record.c:2138:27: note: format string is defined here
 2138 |    pr_debug("trasferred=%ld, compressed=%ld\n",
      |                         ~~^
      |                           |
      |                           long int
      |                         %lld
In file included from builtin-record.c:22:
builtin-record.c:2138:13: error: format '%ld' expects argument of type 'long int', but argument 5 has type 'u64' {aka 'long long unsigned int'} [-Werror=format=]
 2138 |    pr_debug("trasferred=%ld, compressed=%ld\n",
      |             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
util/debug.h:18:21: note: in definition of macro 'pr_fmt'
   18 | #define pr_fmt(fmt) fmt
      |                     ^~~
builtin-record.c:2138:4: note: in expansion of macro 'pr_debug'
 2138 |    pr_debug("trasferred=%ld, compressed=%ld\n",
      |    ^~~~~~~~
  LINK    /tmp/build/perf/libtraceevent.a
builtin-record.c:2138:43: note: format string is defined here
 2138 |    pr_debug("trasferred=%ld, compressed=%ld\n",
      |                                         ~~^
      |                                           |
      |                                           long int
      |                                         %lld
In file included from builtin-record.c:22:
builtin-record.c:2141:13: error: format '%ld' expects argument of type 'long int', but argument 4 has type 'u64' {aka 'long long unsigned int'} [-Werror=format=]
 2141 |    pr_debug("written=%ld\n", thread_data[t].bytes_written);
      |             ^~~~~~~~~~~~~~~
util/debug.h:18:21: note: in definition of macro 'pr_fmt'
   18 | #define pr_fmt(fmt) fmt
      |                     ^~~
builtin-record.c:2141:4: note: in expansion of macro 'pr_debug'
 2141 |    pr_debug("written=%ld\n", thread_data[t].bytes_written);
      |    ^~~~~~~~
builtin-record.c:2141:24: note: format string is defined here
 2141 |    pr_debug("written=%ld\n", thread_data[t].bytes_written);
      |                      ~~^
      |                        |
      |                        long int
      |                      %lld

Fixed with the following patch, no need to resend, I'll fix several
other similar issues and put the result in a tmp.perf/thread branch
while I review/test it.

- Arnaldo

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 0f8488d12f446b84..d19d0639c3f1abc0 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -2114,10 +2114,10 @@ static int record__stop_threads(struct record *rec)
 		pr_debug("threads[%d]: samples=%lld, wakes=%ld, ", thread_data[t].tid,
 			 thread_data[t].samples, thread_data[t].waking);
 		if (thread_data[t].bytes_transferred && thread_data[t].bytes_compressed)
-			pr_debug("trasferred=%ld, compressed=%ld\n",
+			pr_debug("transferred=%" PRIu64 ", compressed=%" PRIu64 "\n",
 				 thread_data[t].bytes_transferred, thread_data[t].bytes_compressed);
 		else
-			pr_debug("written=%ld\n", thread_data[t].bytes_written);
+			pr_debug("written=%" PRIu64 "\n", thread_data[t].bytes_written);
 	}
 
 	return 0;

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 00/16] Introduce threaded trace streaming for basic perf record operation
  2022-01-17 18:34 [PATCH v13 00/16] Introduce threaded trace streaming for basic perf record operation Alexey Bayduraev
                   ` (15 preceding siblings ...)
  2022-01-17 18:34 ` [PATCH v13 16/16] perf report: Output data file name in raw trace dump Alexey Bayduraev
@ 2022-01-24 15:45 ` Jiri Olsa
  16 siblings, 0 replies; 38+ messages in thread
From: Jiri Olsa @ 2022-01-24 15:45 UTC (permalink / raw)
  To: Alexey Bayduraev
  Cc: Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Peter Zijlstra, Ingo Molnar, linux-kernel, Andi Kleen,
	Adrian Hunter, Alexander Antonov, Alexei Budankov,
	Riccardo Mancini

On Mon, Jan 17, 2022 at 09:34:20PM +0300, Alexey Bayduraev wrote:
> Changes in v13:
> - fixed error handling in record__mmap_cpu_mask_alloc()
> - removed redundant record__thread_mask_clear()
> - added notes about evlist__ctlfd_update() to [v13 05/16]
> - fixed build on systems w/o pthread_attr_setaffinity_np() and syscall.h
> - fixed samples zeroing before process_buildids()
> - added notes about valid parallel masks to the documentation and sources
> - fixed masks releasing in record__init_thread_cpu_masks
> - added and fixed some error messages

Acked-by: Jiri Olsa <jolsa@redhat.com>

thanks,
jirka

> 
> v12: https://lore.kernel.org/lkml/cover.1637675515.git.alexey.v.bayduraev@linux.intel.com/
> 
> Changes in v12:
> - fixed nr_threads=1 cases
> - fixed "Woken up %ld times" message
> - removed unnecessary record__fini_thread_masks function
> - moved bytes written/compressed statistics to struct record_thread
> - moved all unnecessary debug messages to verbose=2 level
> - renamed "socket" option to "package" for consistency with util/cputopo.h
> - excluded single trace file reading patches
> 
> v11: https://lore.kernel.org/lkml/cover.1629186429.git.alexey.v.bayduraev@linux.intel.com/
> 
> Changes in v11:
> - removed python dependency on zstd (perf test 19)
> - captured tags from Riccardo Mancini 
> 
> v10: https://lore.kernel.org/lkml/cover.1626072008.git.alexey.v.bayduraev@linux.intel.com/
> 
> Changes in v10:
> - renamed fdarray__clone to fdarray__dup_entry_from
> - captured Acked-by: tags by Namhyung Kim for 09/24
> 
> v9: https://lore.kernel.org/lkml/cover.1625227739.git.alexey.v.bayduraev@linux.intel.com/
> 
> Changes in v9:
> - fixes in [v9 01/24]:
>   - move 'nr_threads' to before 'thread_masks'
>   - combined decl+assign into one line in record__thread_mask_alloc
>   - releasing masks inplace in record__alloc_thread_masks
> - split patch [v8 02/22] to [v9 02/24] and [v9 03/24]
> - fixes in [v9 03/24]:
>   - renamed 'struct thread_data' to 'struct record_thread'
>   - moved nr_mmaps after ctlfd_pos
>   - releasing resources inplace in record__thread_data_init_maps
>   - initializing pipes by -1 value
>   - added temporary gettid() wrapper
> - split patch [v8 03/22] to [v9 04/24] and [v9 05/24] 
> - removed upstreamed [v8 09/22]
> - split [v8 10/22] to [v9 12/24] and [v9 13/24]
> - moved --threads documentation to the related patches
> - fixed output of written/compressed stats in [v9 10/24]
> - split patch [v8 12/22] to [v9 15/24] and [v9 16/24]
> - fixed order of error checking for decompressed events in [v9 16/24]
> - merged patch [v8 21/22] with [v9 23/24] and [v9 24/24]
> - moved patch [v8 22/22] to [v9 09/24]
> - added max reader size constant in [v9 24/24]
> 
> v8: https://lore.kernel.org/lkml/cover.1625065643.git.alexey.v.bayduraev@linux.intel.com/
> 
> Changes in v8:
> - captured Acked-by: tags by Namhyung Kim
> - merged with origin/perf/core
> - added patch 21/22 introducing READER_NODATA state
> - added patch 22/22 fixing --max-size option
> 
> v7: https://lore.kernel.org/lkml/cover.1624350588.git.alexey.v.bayduraev@linux.intel.com/
> 
> Changes in v7:
> - fixed possible crash after out_free_threads label
> - added missing pthread_attr_destroy() call
> - added check of correctness of user masks 
> - fixed zsts_data finalization
> 
> v6: https://lore.kernel.org/lkml/cover.1622025774.git.alexey.v.bayduraev@linux.intel.com/
> 
> Changes in v6:
> - fixed leaks and possible double free in record__thread_mask_alloc()
> - fixed leaks in record__init_thread_user_masks()
> - fixed final mmaps flushing for threads id > 0
> - merged with origin/perf/core
> 
> v5: https://lore.kernel.org/lkml/cover.1619781188.git.alexey.v.bayduraev@linux.intel.com/
> 
> Changes in v5:
> - fixed leaks in record__init_thread_masks_spec()
> - fixed leaks after failed realloc
> - replaced "%m" to strerror()
> - added masks examples to the documentation
> - captured Acked-by: tags by Andi Kleen
> - do not allow --thread option for full_auxtrace mode 
> - split patch 06/12 to 06/20 and 07/20
> - split patch 08/12 to 09/20 and 10/20
> - split patches 11/12 and 11/12 to 13/20-20/20
> 
> v4: https://lore.kernel.org/lkml/6c15adcb-6a9d-320e-70b5-957c4c8b6ff2@linux.intel.com/
> 
> Changes in v4:
> - renamed 'comm' structure to 'pipes'
> - moved thread fd/maps messages to verbose=2
> - fixed leaks during allocation of thread_data structures
> - fixed leaks during allocation of thread masks
> - fixed possible fails when releasing thread masks
> 
> v3: https://lore.kernel.org/lkml/7d197a2d-56e2-896d-bf96-6de0a4db1fb8@linux.intel.com/
> 
> Changes in v3:
> - avoided skipped redundant patch 3/15
> - applied "data file" and "data directory" terms allover the patch set
> - captured Acked-by: tags by Namhyung Kim
> - avoided braces where don't needed
> - employed thread local variable for serial trace streaming 
> - added specs for --thread option - core, socket, numa and user defined
> - added parallel loading of data directory files similar to the prototype [1]
> 
> v2: https://lore.kernel.org/lkml/1ec29ed6-0047-d22f-630b-a7f5ccee96b4@linux.intel.com/
> 
> Changes in v2:
> - explicitly added credit tags to patches 6/15 and 15/15,
>   additionally to cites [1], [2]
> - updated description of 3/15 to explicitly mention the reason
>   to open data directories in read access mode (e.g. for perf report)
> - implemented fix for compilation error of 2/15
> - explicitly elaborated on found issues to be resolved for
>   threaded AUX trace capture
> 
> v1: https://lore.kernel.org/lkml/810f3a69-0004-9dff-a911-b7ff97220ae0@linux.intel.com/
> 
> Patch set provides parallel threaded trace streaming mode for basic
> perf record operation. Provided mode mitigates profiling data losses
> and resolves scalability issues of serial and asynchronous (--aio)
> trace streaming modes on multicore server systems. The design and
> implementation are based on the prototype [1], [2].
> 
> Parallel threaded mode executes trace streaming threads that read kernel
> data buffers and write captured data into several data files located at
> data directory. Layout of trace streaming threads and their mapping to data
> buffers to read can be configured using a value of --thread command line
> option. Specification value provides masks separated by colon so the masks
> define CPUs to be monitored by one thread and thread affinity mask is
> separated by slash. <cpus mask 1>/<affinity mask 1>:<cpu mask 2>/<affinity mask 2>
> specifies parallel threads layout that consists of two threads with
> corresponding assigned CPUs to be monitored. Specification value can be
> a string e.g. "cpu", "core" or "socket" meaning creation of data streaming
> thread for monitoring every CPU, whole core or socket. The option provided
> with no or empty value defaults to "cpu" layout creating data streaming
> thread for every CPU being monitored. Specification masks are filtered
> by the mask provided via -C option.
> 
> Parallel streaming mode is compatible with Zstd compression/decompression
> (--compression-level) and external control commands (--control). The mode
> is not enabled for pipe mode. The mode is not enabled for AUX area tracing,
> related and derived modes like --snapshot or --aux-sample. --switch-output-*
> and --timestamp-filename options are not enabled for parallel streaming.
> Initial intent to enable AUX area tracing faced the need to define some
> optimal way to store index data in data directory. --switch-output-* and
> --timestamp-filename use cases are not clear for data directories.
> Asynchronous(--aio) trace streaming and affinity (--affinity) modes are
> mutually exclusive to parallel streaming mode.
> 
> Basic analysis of data directories is provided in perf report mode.
> Raw dump and aggregated reports are available for data directories,
> still with no memory consumption optimizations.
> 
> Tested:
> 
> tools/perf/perf record -o prof.data --threads -- matrix.gcc.g.O3
> tools/perf/perf record -o prof.data --threads= -- matrix.gcc.g.O3
> tools/perf/perf record -o prof.data --threads=cpu -- matrix.gcc.g.O3
> tools/perf/perf record -o prof.data --threads=core -- matrix.gcc.g.O3
> tools/perf/perf record -o prof.data --threads=socket -- matrix.gcc.g.O3
> tools/perf/perf record -o prof.data --threads=numa -- matrix.gcc.g.O3
> tools/perf/perf record -o prof.data --threads=0-3/3:4-7/4 -- matrix.gcc.g.O3
> tools/perf/perf record -o prof.data -C 2,5 --threads=0-3/3:4-7/4 -- matrix.gcc.g.O3
> tools/perf/perf record -o prof.data -C 3,4 --threads=0-3/3:4-7/4 -- matrix.gcc.g.O3
> tools/perf/perf record -o prof.data -C 0,4,2,6 --threads=core -- matrix.gcc.g.O3
> tools/perf/perf record -o prof.data -C 0,4,2,6 --threads=numa -- matrix.gcc.g.O3
> tools/perf/perf record -o prof.data --threads -g --call-graph dwarf,4096 -- matrix.gcc.g.O3
> tools/perf/perf record -o prof.data --threads -g --call-graph dwarf,4096 --compression-level=3 -- matrix.gcc.g.O3
> tools/perf/perf record -o prof.data --threads -a
> tools/perf/perf record -D -1 -e cpu-cycles -a --control fd:10,11 -- sleep 30
> tools/perf/perf record --threads -D -1 -e cpu-cycles -a --control fd:10,11 -- sleep 30
> 
> tools/perf/perf report -i prof.data
> tools/perf/perf report -i prof.data --call-graph=callee
> tools/perf/perf report -i prof.data --stdio --header
> tools/perf/perf report -i prof.data -D --header
> 
> [1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
> [2] https://lore.kernel.org/lkml/20180913125450.21342-1-jolsa@kernel.org/
> 
> Alexey Bayduraev (16):
>   perf record: Introduce thread affinity and mmap masks
>   tools lib: Introduce fdarray duplicate function
>   perf record: Introduce thread specific data array
>   perf record: Introduce function to propagate control commands
>   perf record: Introduce thread local variable
>   perf record: Stop threads in the end of trace streaming
>   perf record: Start threads in the beginning of trace streaming
>   perf record: Introduce data file at mmap buffer object
>   perf record: Introduce bytes written stats
>   perf record: Introduce compressor at mmap buffer object
>   perf record: Introduce data transferred and compressed stats
>   perf record: Introduce --threads command line option
>   perf record: Extend --threads command line option
>   perf record: Implement compatibility checks
>   perf session: Load data directory files for analysis
>   perf report: Output data file name in raw trace dump
> 
>  tools/lib/api/fd/array.c                 |   17 +
>  tools/lib/api/fd/array.h                 |    1 +
>  tools/perf/Documentation/perf-record.txt |   34 +
>  tools/perf/builtin-inject.c              |    3 +-
>  tools/perf/builtin-kvm.c                 |    2 +-
>  tools/perf/builtin-record.c              | 1164 ++++++++++++++++++++--
>  tools/perf/builtin-top.c                 |    2 +-
>  tools/perf/builtin-trace.c               |    2 +-
>  tools/perf/util/evlist.c                 |   16 +
>  tools/perf/util/evlist.h                 |    1 +
>  tools/perf/util/mmap.c                   |   10 +
>  tools/perf/util/mmap.h                   |    3 +
>  tools/perf/util/ordered-events.c         |    3 +-
>  tools/perf/util/ordered-events.h         |    3 +-
>  tools/perf/util/record.h                 |    2 +
>  tools/perf/util/session.c                |  208 +++-
>  tools/perf/util/session.h                |    3 +-
>  tools/perf/util/tool.h                   |    3 +-
>  18 files changed, 1373 insertions(+), 104 deletions(-)
> 
> -- 
> 2.19.0
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 11/16] perf record: Introduce data transferred and compressed stats
  2022-01-24 15:28   ` Arnaldo Carvalho de Melo
@ 2022-01-24 16:39     ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 38+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-01-24 16:39 UTC (permalink / raw)
  To: Alexey Bayduraev
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, linux-kernel, Andi Kleen, Adrian Hunter,
	Alexander Antonov, Alexei Budankov, Riccardo Mancini

Em Mon, Jan 24, 2022 at 12:28:18PM -0300, Arnaldo Carvalho de Melo escreveu:
> builtin-record.c:2141:4: note: in expansion of macro 'pr_debug'
>  2141 |    pr_debug("written=%ld\n", thread_data[t].bytes_written);
>       |    ^~~~~~~~
> builtin-record.c:2141:24: note: format string is defined here
>  2141 |    pr_debug("written=%ld\n", thread_data[t].bytes_written);
>       |                      ~~^
>       |                        |
>       |                        long int
>       |                      %lld
> 
> Fixed with the following patch, no need to resend, I'll fix several
> other similar issues and put the result in a tmp.perf/thread branch
> while I review/test it.

Did it, with the fix it builds in all containers, now to test it and
review patch by patch one more time.

[acme@quaco perf]$ git push acme.korg perf/core:tmp.perf/threaded
Enumerating objects: 134, done.
Counting objects: 100% (134/134), done.
Delta compression using up to 8 threads
Compressing objects: 100% (55/55), done.
Writing objects: 100% (108/108), 23.64 KiB | 4.73 MiB/s, done.
Total 108 (delta 105), reused 55 (delta 53), pack-reused 0
remote: Resolving deltas: 100% (105/105), completed with 25 local objects.
remote: Recorded in the transparency log
remote:  manifest: updated /pub/scm/linux/kernel/git/acme/linux.git
remote: Done in 0.06s
remote: Notifying frontends: dfw ams sin
To gitolite.kernel.org:/pub/scm/linux/kernel/git/acme/linux.git
 * [new branch]                        perf/core -> tmp.perf/threaded
[acme@quaco perf]$

- Arnaldo

> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index 0f8488d12f446b84..d19d0639c3f1abc0 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -2114,10 +2114,10 @@ static int record__stop_threads(struct record *rec)
>  		pr_debug("threads[%d]: samples=%lld, wakes=%ld, ", thread_data[t].tid,
>  			 thread_data[t].samples, thread_data[t].waking);
>  		if (thread_data[t].bytes_transferred && thread_data[t].bytes_compressed)
> -			pr_debug("trasferred=%ld, compressed=%ld\n",
> +			pr_debug("transferred=%" PRIu64 ", compressed=%" PRIu64 "\n",
>  				 thread_data[t].bytes_transferred, thread_data[t].bytes_compressed);
>  		else
> -			pr_debug("written=%ld\n", thread_data[t].bytes_written);
> +			pr_debug("written=%" PRIu64 "\n", thread_data[t].bytes_written);
>  	}
>  
>  	return 0;

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 01/16] perf record: Introduce thread affinity and mmap masks
  2022-01-17 18:34 ` [PATCH v13 01/16] perf record: Introduce thread affinity and mmap masks Alexey Bayduraev
@ 2022-01-31 21:00   ` Arnaldo Carvalho de Melo
  2022-01-31 21:16     ` Arnaldo Carvalho de Melo
  2022-01-31 22:03     ` Arnaldo Carvalho de Melo
  2022-04-04 22:25   ` Ian Rogers
  1 sibling, 2 replies; 38+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-01-31 21:00 UTC (permalink / raw)
  To: Alexey Bayduraev
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, linux-kernel, Andi Kleen, Adrian Hunter,
	Alexander Antonov, Alexei Budankov, Riccardo Mancini

Em Mon, Jan 17, 2022 at 09:34:21PM +0300, Alexey Bayduraev escreveu:
> Introduce affinity and mmap thread masks. Thread affinity mask
> defines CPUs that a thread is allowed to run on. Thread maps
> mask defines mmap data buffers the thread serves to stream
> profiling data from.
> 
> Acked-by: Andi Kleen <ak@linux.intel.com>
> Acked-by: Namhyung Kim <namhyung@gmail.com>
> Reviewed-by: Riccardo Mancini <rickyman7@gmail.com>
> Tested-by: Riccardo Mancini <rickyman7@gmail.com>
> Signed-off-by: Alexey Bayduraev <alexey.v.bayduraev@linux.intel.com>

Some simplifications I added here while reviewing this patchkit:

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 41998f2140cd5119..53b88c8600624237 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -2213,35 +2213,33 @@ static int record__parse_affinity(const struct option *opt, const char *str, int
 
 static int record__mmap_cpu_mask_alloc(struct mmap_cpu_mask *mask, int nr_bits)
 {
-	mask->nbits = nr_bits;
 	mask->bits = bitmap_zalloc(mask->nbits);
 	if (!mask->bits)
 		return -ENOMEM;
 
+	mask->nbits = nr_bits;
 	return 0;
 }
 
 static void record__mmap_cpu_mask_free(struct mmap_cpu_mask *mask)
 {
 	bitmap_free(mask->bits);
+	mask->bits = NULL;
 	mask->nbits = 0;
 }
 
 static int record__thread_mask_alloc(struct thread_mask *mask, int nr_bits)
 {
-	int ret;
+	int ret = record__mmap_cpu_mask_alloc(&mask->maps, nr_bits);
 
-	ret = record__mmap_cpu_mask_alloc(&mask->maps, nr_bits);
 	if (ret) {
 		mask->affinity.bits = NULL;
 		return ret;
 	}
 
 	ret = record__mmap_cpu_mask_alloc(&mask->affinity, nr_bits);
-	if (ret) {
+	if (ret)
 		record__mmap_cpu_mask_free(&mask->maps);
-		mask->maps.bits = NULL;
-	}
 
 	return ret;
 }
@@ -2733,18 +2731,14 @@ struct option *record_options = __record_options;
 
 static void record__mmap_cpu_mask_init(struct mmap_cpu_mask *mask, struct perf_cpu_map *cpus)
 {
-	int c;
-
-	for (c = 0; c < cpus->nr; c++)
+	for (int c = 0; c < cpus->nr; c++)
 		set_bit(cpus->map[c].cpu, mask->bits);
 }
 
 static void record__free_thread_masks(struct record *rec, int nr_threads)
 {
-	int t;
-
 	if (rec->thread_masks)
-		for (t = 0; t < nr_threads; t++)
+		for (int t = 0; t < nr_threads; t++)
 			record__thread_mask_free(&rec->thread_masks[t]);
 
 	zfree(&rec->thread_masks);
@@ -2752,7 +2746,7 @@ static void record__free_thread_masks(struct record *rec, int nr_threads)
 
 static int record__alloc_thread_masks(struct record *rec, int nr_threads, int nr_bits)
 {
-	int t, ret;
+	int ret;
 
 	rec->thread_masks = zalloc(nr_threads * sizeof(*(rec->thread_masks)));
 	if (!rec->thread_masks) {
@@ -2760,7 +2754,7 @@ static int record__alloc_thread_masks(struct record *rec, int nr_threads, int nr
 		return -ENOMEM;
 	}
 
-	for (t = 0; t < nr_threads; t++) {
+	for (int t = 0; t < nr_threads; t++) {
 		ret = record__thread_mask_alloc(&rec->thread_masks[t], nr_bits);
 		if (ret) {
 			pr_err("Failed to allocate thread masks[%d]\n", t);
@@ -2778,9 +2772,7 @@ static int record__alloc_thread_masks(struct record *rec, int nr_threads, int nr
 
 static int record__init_thread_default_masks(struct record *rec, struct perf_cpu_map *cpus)
 {
-	int ret;
-
-	ret = record__alloc_thread_masks(rec, 1, cpu__max_cpu().cpu);
+	int ret = record__alloc_thread_masks(rec, 1, cpu__max_cpu().cpu);
 	if (ret)
 		return ret;
 


> ---
>  tools/perf/builtin-record.c | 123 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 123 insertions(+)
> 
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index bb716c953d02..41998f2140cd 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -87,6 +87,11 @@ struct switch_output {
>  	int		 cur_file;
>  };
>  
> +struct thread_mask {
> +	struct mmap_cpu_mask	maps;
> +	struct mmap_cpu_mask	affinity;
> +};
> +
>  struct record {
>  	struct perf_tool	tool;
>  	struct record_opts	opts;
> @@ -112,6 +117,8 @@ struct record {
>  	struct mmap_cpu_mask	affinity_mask;
>  	unsigned long		output_max_size;	/* = 0: unlimited */
>  	struct perf_debuginfod	debuginfod;
> +	int			nr_threads;
> +	struct thread_mask	*thread_masks;
>  };
>  
>  static volatile int done;
> @@ -2204,6 +2211,47 @@ static int record__parse_affinity(const struct option *opt, const char *str, int
>  	return 0;
>  }
>  
> +static int record__mmap_cpu_mask_alloc(struct mmap_cpu_mask *mask, int nr_bits)
> +{
> +	mask->nbits = nr_bits;
> +	mask->bits = bitmap_zalloc(mask->nbits);
> +	if (!mask->bits)
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +
> +static void record__mmap_cpu_mask_free(struct mmap_cpu_mask *mask)
> +{
> +	bitmap_free(mask->bits);
> +	mask->nbits = 0;
> +}
> +
> +static int record__thread_mask_alloc(struct thread_mask *mask, int nr_bits)
> +{
> +	int ret;
> +
> +	ret = record__mmap_cpu_mask_alloc(&mask->maps, nr_bits);
> +	if (ret) {
> +		mask->affinity.bits = NULL;
> +		return ret;
> +	}
> +
> +	ret = record__mmap_cpu_mask_alloc(&mask->affinity, nr_bits);
> +	if (ret) {
> +		record__mmap_cpu_mask_free(&mask->maps);
> +		mask->maps.bits = NULL;
> +	}
> +
> +	return ret;
> +}
> +
> +static void record__thread_mask_free(struct thread_mask *mask)
> +{
> +	record__mmap_cpu_mask_free(&mask->maps);
> +	record__mmap_cpu_mask_free(&mask->affinity);
> +}
> +
>  static int parse_output_max_size(const struct option *opt,
>  				 const char *str, int unset)
>  {
> @@ -2683,6 +2731,73 @@ static struct option __record_options[] = {
>  
>  struct option *record_options = __record_options;
>  
> +static void record__mmap_cpu_mask_init(struct mmap_cpu_mask *mask, struct perf_cpu_map *cpus)
> +{
> +	int c;
> +
> +	for (c = 0; c < cpus->nr; c++)
> +		set_bit(cpus->map[c].cpu, mask->bits);
> +}
> +
> +static void record__free_thread_masks(struct record *rec, int nr_threads)
> +{
> +	int t;
> +
> +	if (rec->thread_masks)
> +		for (t = 0; t < nr_threads; t++)
> +			record__thread_mask_free(&rec->thread_masks[t]);
> +
> +	zfree(&rec->thread_masks);
> +}
> +
> +static int record__alloc_thread_masks(struct record *rec, int nr_threads, int nr_bits)
> +{
> +	int t, ret;
> +
> +	rec->thread_masks = zalloc(nr_threads * sizeof(*(rec->thread_masks)));
> +	if (!rec->thread_masks) {
> +		pr_err("Failed to allocate thread masks\n");
> +		return -ENOMEM;
> +	}
> +
> +	for (t = 0; t < nr_threads; t++) {
> +		ret = record__thread_mask_alloc(&rec->thread_masks[t], nr_bits);
> +		if (ret) {
> +			pr_err("Failed to allocate thread masks[%d]\n", t);
> +			goto out_free;
> +		}
> +	}
> +
> +	return 0;
> +
> +out_free:
> +	record__free_thread_masks(rec, nr_threads);
> +
> +	return ret;
> +}
> +
> +static int record__init_thread_default_masks(struct record *rec, struct perf_cpu_map *cpus)
> +{
> +	int ret;
> +
> +	ret = record__alloc_thread_masks(rec, 1, cpu__max_cpu().cpu);
> +	if (ret)
> +		return ret;
> +
> +	record__mmap_cpu_mask_init(&rec->thread_masks->maps, cpus);
> +
> +	rec->nr_threads = 1;
> +
> +	return 0;
> +}
> +
> +static int record__init_thread_masks(struct record *rec)
> +{
> +	struct perf_cpu_map *cpus = rec->evlist->core.cpus;
> +
> +	return record__init_thread_default_masks(rec, cpus);
> +}
> +
>  int cmd_record(int argc, const char **argv)
>  {
>  	int err;
> @@ -2948,6 +3063,12 @@ int cmd_record(int argc, const char **argv)
>  		goto out;
>  	}
>  
> +	err = record__init_thread_masks(rec);
> +	if (err) {
> +		pr_err("Failed to initialize parallel data streaming masks\n");
> +		goto out;
> +	}
> +
>  	if (rec->opts.nr_cblocks > nr_cblocks_max)
>  		rec->opts.nr_cblocks = nr_cblocks_max;
>  	pr_debug("nr_cblocks: %d\n", rec->opts.nr_cblocks);
> @@ -2966,6 +3087,8 @@ int cmd_record(int argc, const char **argv)
>  	symbol__exit();
>  	auxtrace_record__free(rec->itr);
>  out_opts:
> +	record__free_thread_masks(rec, rec->nr_threads);
> +	rec->nr_threads = 0;
>  	evlist__close_control(rec->opts.ctl_fd, rec->opts.ctl_fd_ack, &rec->opts.ctl_fd_close);
>  	return err;
>  }
> -- 
> 2.19.0

-- 

- Arnaldo

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 01/16] perf record: Introduce thread affinity and mmap masks
  2022-01-31 21:00   ` Arnaldo Carvalho de Melo
@ 2022-01-31 21:16     ` Arnaldo Carvalho de Melo
  2022-02-01 11:46       ` Bayduraev, Alexey V
  2022-01-31 22:03     ` Arnaldo Carvalho de Melo
  1 sibling, 1 reply; 38+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-01-31 21:16 UTC (permalink / raw)
  To: Alexey Bayduraev
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, linux-kernel, Andi Kleen, Adrian Hunter,
	Alexander Antonov, Alexei Budankov, Riccardo Mancini

Em Mon, Jan 31, 2022 at 06:00:31PM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Mon, Jan 17, 2022 at 09:34:21PM +0300, Alexey Bayduraev escreveu:
> > Introduce affinity and mmap thread masks. Thread affinity mask
> > defines CPUs that a thread is allowed to run on. Thread maps
> > mask defines mmap data buffers the thread serves to stream
> > profiling data from.
> > 
> > Acked-by: Andi Kleen <ak@linux.intel.com>
> > Acked-by: Namhyung Kim <namhyung@gmail.com>
> > Reviewed-by: Riccardo Mancini <rickyman7@gmail.com>
> > Tested-by: Riccardo Mancini <rickyman7@gmail.com>
> > Signed-off-by: Alexey Bayduraev <alexey.v.bayduraev@linux.intel.com>
> 
> Some simplifications I added here while reviewing this patchkit:

But then, why allocate these even without using them? I.e. the init
should be left for when we are sure that we'll actually use this, i.e.
when the user asks for parallel mode.

We already have lots of needless initializations, reading of files that
may not be needed, so we should avoid doing things till we really know
that we'll use those allocations, readings, etc.

Anyway, continuing to review, will leave what I have at a separata
branch so that we can continue from there.

- Arnaldo
 
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index 41998f2140cd5119..53b88c8600624237 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -2213,35 +2213,33 @@ static int record__parse_affinity(const struct option *opt, const char *str, int
>  
>  static int record__mmap_cpu_mask_alloc(struct mmap_cpu_mask *mask, int nr_bits)
>  {
> -	mask->nbits = nr_bits;
>  	mask->bits = bitmap_zalloc(mask->nbits);
>  	if (!mask->bits)
>  		return -ENOMEM;
>  
> +	mask->nbits = nr_bits;
>  	return 0;
>  }
>  
>  static void record__mmap_cpu_mask_free(struct mmap_cpu_mask *mask)
>  {
>  	bitmap_free(mask->bits);
> +	mask->bits = NULL;
>  	mask->nbits = 0;
>  }
>  
>  static int record__thread_mask_alloc(struct thread_mask *mask, int nr_bits)
>  {
> -	int ret;
> +	int ret = record__mmap_cpu_mask_alloc(&mask->maps, nr_bits);
>  
> -	ret = record__mmap_cpu_mask_alloc(&mask->maps, nr_bits);
>  	if (ret) {
>  		mask->affinity.bits = NULL;
>  		return ret;
>  	}
>  
>  	ret = record__mmap_cpu_mask_alloc(&mask->affinity, nr_bits);
> -	if (ret) {
> +	if (ret)
>  		record__mmap_cpu_mask_free(&mask->maps);
> -		mask->maps.bits = NULL;
> -	}
>  
>  	return ret;
>  }
> @@ -2733,18 +2731,14 @@ struct option *record_options = __record_options;
>  
>  static void record__mmap_cpu_mask_init(struct mmap_cpu_mask *mask, struct perf_cpu_map *cpus)
>  {
> -	int c;
> -
> -	for (c = 0; c < cpus->nr; c++)
> +	for (int c = 0; c < cpus->nr; c++)
>  		set_bit(cpus->map[c].cpu, mask->bits);
>  }
>  
>  static void record__free_thread_masks(struct record *rec, int nr_threads)
>  {
> -	int t;
> -
>  	if (rec->thread_masks)
> -		for (t = 0; t < nr_threads; t++)
> +		for (int t = 0; t < nr_threads; t++)
>  			record__thread_mask_free(&rec->thread_masks[t]);
>  
>  	zfree(&rec->thread_masks);
> @@ -2752,7 +2746,7 @@ static void record__free_thread_masks(struct record *rec, int nr_threads)
>  
>  static int record__alloc_thread_masks(struct record *rec, int nr_threads, int nr_bits)
>  {
> -	int t, ret;
> +	int ret;
>  
>  	rec->thread_masks = zalloc(nr_threads * sizeof(*(rec->thread_masks)));
>  	if (!rec->thread_masks) {
> @@ -2760,7 +2754,7 @@ static int record__alloc_thread_masks(struct record *rec, int nr_threads, int nr
>  		return -ENOMEM;
>  	}
>  
> -	for (t = 0; t < nr_threads; t++) {
> +	for (int t = 0; t < nr_threads; t++) {
>  		ret = record__thread_mask_alloc(&rec->thread_masks[t], nr_bits);
>  		if (ret) {
>  			pr_err("Failed to allocate thread masks[%d]\n", t);
> @@ -2778,9 +2772,7 @@ static int record__alloc_thread_masks(struct record *rec, int nr_threads, int nr
>  
>  static int record__init_thread_default_masks(struct record *rec, struct perf_cpu_map *cpus)
>  {
> -	int ret;
> -
> -	ret = record__alloc_thread_masks(rec, 1, cpu__max_cpu().cpu);
> +	int ret = record__alloc_thread_masks(rec, 1, cpu__max_cpu().cpu);
>  	if (ret)
>  		return ret;
>  
> 
> 
> > ---
> >  tools/perf/builtin-record.c | 123 ++++++++++++++++++++++++++++++++++++
> >  1 file changed, 123 insertions(+)
> > 
> > diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> > index bb716c953d02..41998f2140cd 100644
> > --- a/tools/perf/builtin-record.c
> > +++ b/tools/perf/builtin-record.c
> > @@ -87,6 +87,11 @@ struct switch_output {
> >  	int		 cur_file;
> >  };
> >  
> > +struct thread_mask {
> > +	struct mmap_cpu_mask	maps;
> > +	struct mmap_cpu_mask	affinity;
> > +};
> > +
> >  struct record {
> >  	struct perf_tool	tool;
> >  	struct record_opts	opts;
> > @@ -112,6 +117,8 @@ struct record {
> >  	struct mmap_cpu_mask	affinity_mask;
> >  	unsigned long		output_max_size;	/* = 0: unlimited */
> >  	struct perf_debuginfod	debuginfod;
> > +	int			nr_threads;
> > +	struct thread_mask	*thread_masks;
> >  };
> >  
> >  static volatile int done;
> > @@ -2204,6 +2211,47 @@ static int record__parse_affinity(const struct option *opt, const char *str, int
> >  	return 0;
> >  }
> >  
> > +static int record__mmap_cpu_mask_alloc(struct mmap_cpu_mask *mask, int nr_bits)
> > +{
> > +	mask->nbits = nr_bits;
> > +	mask->bits = bitmap_zalloc(mask->nbits);
> > +	if (!mask->bits)
> > +		return -ENOMEM;
> > +
> > +	return 0;
> > +}
> > +
> > +static void record__mmap_cpu_mask_free(struct mmap_cpu_mask *mask)
> > +{
> > +	bitmap_free(mask->bits);
> > +	mask->nbits = 0;
> > +}
> > +
> > +static int record__thread_mask_alloc(struct thread_mask *mask, int nr_bits)
> > +{
> > +	int ret;
> > +
> > +	ret = record__mmap_cpu_mask_alloc(&mask->maps, nr_bits);
> > +	if (ret) {
> > +		mask->affinity.bits = NULL;
> > +		return ret;
> > +	}
> > +
> > +	ret = record__mmap_cpu_mask_alloc(&mask->affinity, nr_bits);
> > +	if (ret) {
> > +		record__mmap_cpu_mask_free(&mask->maps);
> > +		mask->maps.bits = NULL;
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> > +static void record__thread_mask_free(struct thread_mask *mask)
> > +{
> > +	record__mmap_cpu_mask_free(&mask->maps);
> > +	record__mmap_cpu_mask_free(&mask->affinity);
> > +}
> > +
> >  static int parse_output_max_size(const struct option *opt,
> >  				 const char *str, int unset)
> >  {
> > @@ -2683,6 +2731,73 @@ static struct option __record_options[] = {
> >  
> >  struct option *record_options = __record_options;
> >  
> > +static void record__mmap_cpu_mask_init(struct mmap_cpu_mask *mask, struct perf_cpu_map *cpus)
> > +{
> > +	int c;
> > +
> > +	for (c = 0; c < cpus->nr; c++)
> > +		set_bit(cpus->map[c].cpu, mask->bits);
> > +}
> > +
> > +static void record__free_thread_masks(struct record *rec, int nr_threads)
> > +{
> > +	int t;
> > +
> > +	if (rec->thread_masks)
> > +		for (t = 0; t < nr_threads; t++)
> > +			record__thread_mask_free(&rec->thread_masks[t]);
> > +
> > +	zfree(&rec->thread_masks);
> > +}
> > +
> > +static int record__alloc_thread_masks(struct record *rec, int nr_threads, int nr_bits)
> > +{
> > +	int t, ret;
> > +
> > +	rec->thread_masks = zalloc(nr_threads * sizeof(*(rec->thread_masks)));
> > +	if (!rec->thread_masks) {
> > +		pr_err("Failed to allocate thread masks\n");
> > +		return -ENOMEM;
> > +	}
> > +
> > +	for (t = 0; t < nr_threads; t++) {
> > +		ret = record__thread_mask_alloc(&rec->thread_masks[t], nr_bits);
> > +		if (ret) {
> > +			pr_err("Failed to allocate thread masks[%d]\n", t);
> > +			goto out_free;
> > +		}
> > +	}
> > +
> > +	return 0;
> > +
> > +out_free:
> > +	record__free_thread_masks(rec, nr_threads);
> > +
> > +	return ret;
> > +}
> > +
> > +static int record__init_thread_default_masks(struct record *rec, struct perf_cpu_map *cpus)
> > +{
> > +	int ret;
> > +
> > +	ret = record__alloc_thread_masks(rec, 1, cpu__max_cpu().cpu);
> > +	if (ret)
> > +		return ret;
> > +
> > +	record__mmap_cpu_mask_init(&rec->thread_masks->maps, cpus);
> > +
> > +	rec->nr_threads = 1;
> > +
> > +	return 0;
> > +}
> > +
> > +static int record__init_thread_masks(struct record *rec)
> > +{
> > +	struct perf_cpu_map *cpus = rec->evlist->core.cpus;
> > +
> > +	return record__init_thread_default_masks(rec, cpus);
> > +}
> > +
> >  int cmd_record(int argc, const char **argv)
> >  {
> >  	int err;
> > @@ -2948,6 +3063,12 @@ int cmd_record(int argc, const char **argv)
> >  		goto out;
> >  	}
> >  
> > +	err = record__init_thread_masks(rec);
> > +	if (err) {
> > +		pr_err("Failed to initialize parallel data streaming masks\n");
> > +		goto out;
> > +	}
> > +
> >  	if (rec->opts.nr_cblocks > nr_cblocks_max)
> >  		rec->opts.nr_cblocks = nr_cblocks_max;
> >  	pr_debug("nr_cblocks: %d\n", rec->opts.nr_cblocks);
> > @@ -2966,6 +3087,8 @@ int cmd_record(int argc, const char **argv)
> >  	symbol__exit();
> >  	auxtrace_record__free(rec->itr);
> >  out_opts:
> > +	record__free_thread_masks(rec, rec->nr_threads);
> > +	rec->nr_threads = 0;
> >  	evlist__close_control(rec->opts.ctl_fd, rec->opts.ctl_fd_ack, &rec->opts.ctl_fd_close);
> >  	return err;
> >  }
> > -- 
> > 2.19.0
> 
> -- 
> 
> - Arnaldo

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 03/16] perf record: Introduce thread specific data array
  2022-01-17 18:34 ` [PATCH v13 03/16] perf record: Introduce thread specific data array Alexey Bayduraev
@ 2022-01-31 21:39   ` Arnaldo Carvalho de Melo
  2022-01-31 22:21     ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 38+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-01-31 21:39 UTC (permalink / raw)
  To: Alexey Bayduraev
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, linux-kernel, Andi Kleen, Adrian Hunter,
	Alexander Antonov, Alexei Budankov, Riccardo Mancini

Em Mon, Jan 17, 2022 at 09:34:23PM +0300, Alexey Bayduraev escreveu:
> Introduce thread specific data object and array of such objects
> to store and manage thread local data. Implement functions to
> allocate, initialize, finalize and release thread specific data.
> 
> Thread local maps and overwrite_maps arrays keep pointers to
> mmap buffer objects to serve according to maps thread mask.
> Thread local pollfd array keeps event fds connected to mmaps
> buffers according to maps thread mask.
> 
> Thread control commands are delivered via thread local comm pipes
> and ctlfd_pos fd. External control commands (--control option)
> are delivered via evlist ctlfd_pos fd and handled by the main
> tool thread.
> 
> Acked-by: Namhyung Kim <namhyung@gmail.com>
> Reviewed-by: Riccardo Mancini <rickyman7@gmail.com>
> Tested-by: Riccardo Mancini <rickyman7@gmail.com>
> Signed-off-by: Alexey Bayduraev <alexey.v.bayduraev@linux.intel.com>

Some changes to reduce patch size, I have them in my local tree, will
publish later.

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 1645b40540b870aa..a8c7118a95c6a3fa 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -924,7 +924,7 @@ static void record__thread_data_close_pipes(struct record_thread *thread_data)
 
 static int record__thread_data_init_maps(struct record_thread *thread_data, struct evlist *evlist)
 {
-	int m, tm, nr_mmaps = evlist->core.nr_mmaps;
+	int nr_mmaps = evlist->core.nr_mmaps;
 	struct mmap *mmap = evlist->mmap;
 	struct mmap *overwrite_mmap = evlist->overwrite_mmap;
 	struct perf_cpu_map *cpus = evlist->core.cpus;
@@ -946,7 +946,7 @@ static int record__thread_data_init_maps(struct record_thread *thread_data, stru
 	pr_debug2("thread_data[%p]: nr_mmaps=%d, maps=%p, ow_maps=%p\n", thread_data,
 		 thread_data->nr_mmaps, thread_data->maps, thread_data->overwrite_maps);
 
-	for (m = 0, tm = 0; m < nr_mmaps && tm < thread_data->nr_mmaps; m++) {
+	for (int m = 0, tm = 0; m < nr_mmaps && tm < thread_data->nr_mmaps; m++) {
 		if (test_bit(cpus->map[m].cpu, thread_data->mask->maps.bits)) {
 			if (thread_data->maps) {
 				thread_data->maps[tm] = &mmap[m];
@@ -967,21 +967,18 @@ static int record__thread_data_init_maps(struct record_thread *thread_data, stru
 
 static int record__thread_data_init_pollfd(struct record_thread *thread_data, struct evlist *evlist)
 {
-	int f, tm, pos;
-	struct mmap *map, *overwrite_map;
-
 	fdarray__init(&thread_data->pollfd, 64);
 
-	for (tm = 0; tm < thread_data->nr_mmaps; tm++) {
-		map = thread_data->maps ? thread_data->maps[tm] : NULL;
-		overwrite_map = thread_data->overwrite_maps ?
-				thread_data->overwrite_maps[tm] : NULL;
+	for (int tm = 0; tm < thread_data->nr_mmaps; tm++) {
+		struct mmap *map = thread_data->maps ? thread_data->maps[tm] : NULL,
+			    *overwrite_map = thread_data->overwrite_maps ?
+						thread_data->overwrite_maps[tm] : NULL;
 
-		for (f = 0; f < evlist->core.pollfd.nr; f++) {
+		for (int f = 0; f < evlist->core.pollfd.nr; f++) {
 			void *ptr = evlist->core.pollfd.priv[f].ptr;
 
 			if ((map && ptr == map) || (overwrite_map && ptr == overwrite_map)) {
-				pos = fdarray__dup_entry_from(&thread_data->pollfd, f,
+				int pos = fdarray__dup_entry_from(&thread_data->pollfd, f,
 							      &evlist->core.pollfd);
 				if (pos < 0)
 					return pos;
@@ -996,13 +993,12 @@ static int record__thread_data_init_pollfd(struct record_thread *thread_data, st
 
 static void record__free_thread_data(struct record *rec)
 {
-	int t;
 	struct record_thread *thread_data = rec->thread_data;
 
 	if (thread_data == NULL)
 		return;
 
-	for (t = 0; t < rec->nr_threads; t++) {
+	for (int t = 0; t < rec->nr_threads; t++) {
 		record__thread_data_close_pipes(&thread_data[t]);
 		zfree(&thread_data[t].maps);
 		zfree(&thread_data[t].overwrite_maps);
@@ -1014,20 +1010,18 @@ static void record__free_thread_data(struct record *rec)
 
 static int record__alloc_thread_data(struct record *rec, struct evlist *evlist)
 {
-	int t, ret;
-	struct record_thread *thread_data;
+	struct record_thread *thread_data = rec->thread_data = zalloc(rec->nr_threads * sizeof(*(rec->thread_data)));
+	int ret;
 
-	rec->thread_data = zalloc(rec->nr_threads * sizeof(*(rec->thread_data)));
 	if (!rec->thread_data) {
 		pr_err("Failed to allocate thread data\n");
 		return -ENOMEM;
 	}
-	thread_data = rec->thread_data;
 
-	for (t = 0; t < rec->nr_threads; t++)
+	for (int t = 0; t < rec->nr_threads; t++)
 		record__thread_data_init_pipes(&thread_data[t]);
 
-	for (t = 0; t < rec->nr_threads; t++) {
+	for (int t = 0; t < rec->nr_threads; t++) {
 		thread_data[t].rec = rec;
 		thread_data[t].mask = &rec->thread_masks[t];
 		ret = record__thread_data_init_maps(&thread_data[t], evlist);

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 05/16] perf record: Introduce thread local variable
  2022-01-17 18:34 ` [PATCH v13 05/16] perf record: Introduce thread local variable Alexey Bayduraev
@ 2022-01-31 21:42   ` Arnaldo Carvalho de Melo
  2022-01-31 21:45   ` Arnaldo Carvalho de Melo
  1 sibling, 0 replies; 38+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-01-31 21:42 UTC (permalink / raw)
  To: Alexey Bayduraev
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, linux-kernel, Andi Kleen, Adrian Hunter,
	Alexander Antonov, Alexei Budankov, Riccardo Mancini

Em Mon, Jan 17, 2022 at 09:34:25PM +0300, Alexey Bayduraev escreveu:
> Introduce thread local variable and use it for threaded trace streaming.
> Use thread affinity mask instead of record affinity mask in affinity
> modes. Use evlist__ctlfd_update() to propagate control commands from
> thread object to global evlist object to enable evlist__ctlfd_*
> functionality. Move waking and sample statistic to struct record_thread
> and introduce record__waking function to calculate the total number of
> wakes.
> 
> Acked-by: Andi Kleen <ak@linux.intel.com>
> Acked-by: Namhyung Kim <namhyung@gmail.com>
> Reviewed-by: Riccardo Mancini <rickyman7@gmail.com>
> Tested-by: Riccardo Mancini <rickyman7@gmail.com>
> Signed-off-by: Alexey Bayduraev <alexey.v.bayduraev@linux.intel.com>
> ---
>  tools/perf/builtin-record.c | 140 ++++++++++++++++++++++++------------
>  1 file changed, 94 insertions(+), 46 deletions(-)
> 
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index 0d4a34c66274..163d261dd293 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -108,8 +108,12 @@ struct record_thread {
>  	struct mmap		**maps;
>  	struct mmap		**overwrite_maps;
>  	struct record		*rec;
> +	unsigned long long	samples;
> +	unsigned long		waking;
>  };
>  
> +static __thread struct record_thread *thread;
> +
>  struct record {
>  	struct perf_tool	tool;
>  	struct record_opts	opts;
> @@ -132,7 +136,6 @@ struct record {
>  	bool			timestamp_boundary;
>  	struct switch_output	switch_output;
>  	unsigned long long	samples;
> -	struct mmap_cpu_mask	affinity_mask;
>  	unsigned long		output_max_size;	/* = 0: unlimited */
>  	struct perf_debuginfod	debuginfod;
>  	int			nr_threads;
> @@ -575,7 +578,7 @@ static int record__pushfn(struct mmap *map, void *to, void *bf, size_t size)
>  		bf   = map->data;
>  	}
>  
> -	rec->samples++;
> +	thread->samples++;
>  	return record__write(rec, map, bf, size);
>  }
>  
> @@ -1315,15 +1318,17 @@ static struct perf_event_header finished_round_event = {
>  static void record__adjust_affinity(struct record *rec, struct mmap *map)
>  {
>  	if (rec->opts.affinity != PERF_AFFINITY_SYS &&
> -	    !bitmap_equal(rec->affinity_mask.bits, map->affinity_mask.bits,
> -			  rec->affinity_mask.nbits)) {
> -		bitmap_zero(rec->affinity_mask.bits, rec->affinity_mask.nbits);
> -		bitmap_or(rec->affinity_mask.bits, rec->affinity_mask.bits,
> -			  map->affinity_mask.bits, rec->affinity_mask.nbits);
> -		sched_setaffinity(0, MMAP_CPU_MASK_BYTES(&rec->affinity_mask),
> -				  (cpu_set_t *)rec->affinity_mask.bits);
> -		if (verbose == 2)
> -			mmap_cpu_mask__scnprintf(&rec->affinity_mask, "thread");
> +	    !bitmap_equal(thread->mask->affinity.bits, map->affinity_mask.bits,
> +			  thread->mask->affinity.nbits)) {
> +		bitmap_zero(thread->mask->affinity.bits, thread->mask->affinity.nbits);
> +		bitmap_or(thread->mask->affinity.bits, thread->mask->affinity.bits,
> +			  map->affinity_mask.bits, thread->mask->affinity.nbits);
> +		sched_setaffinity(0, MMAP_CPU_MASK_BYTES(&thread->mask->affinity),
> +					(cpu_set_t *)thread->mask->affinity.bits);
> +		if (verbose == 2) {
> +			pr_debug("threads[%d]: running on cpu%d: ", thread->tid, sched_getcpu());
> +			mmap_cpu_mask__scnprintf(&thread->mask->affinity, "affinity");
> +		}
>  	}
>  }
>  
> @@ -1364,14 +1369,17 @@ static int record__mmap_read_evlist(struct record *rec, struct evlist *evlist,
>  	u64 bytes_written = rec->bytes_written;
>  	int i;
>  	int rc = 0;
> -	struct mmap *maps;
> +	int nr_mmaps;
> +	struct mmap **maps;
>  	int trace_fd = rec->data.file.fd;
>  	off_t off = 0;
>  
>  	if (!evlist)
>  		return 0;
>  
> -	maps = overwrite ? evlist->overwrite_mmap : evlist->mmap;
> +	nr_mmaps = thread->nr_mmaps;
> +	maps = overwrite ? thread->overwrite_maps : thread->maps;
> +
>  	if (!maps)
>  		return 0;
>  
> @@ -1381,9 +1389,9 @@ static int record__mmap_read_evlist(struct record *rec, struct evlist *evlist,
>  	if (record__aio_enabled(rec))
>  		off = record__aio_get_pos(trace_fd);
>  
> -	for (i = 0; i < evlist->core.nr_mmaps; i++) {
> +	for (i = 0; i < nr_mmaps; i++) {
>  		u64 flush = 0;
> -		struct mmap *map = &maps[i];
> +		struct mmap *map = maps[i];
>  
>  		if (map->core.base) {
>  			record__adjust_affinity(rec, map);
> @@ -1446,6 +1454,15 @@ static int record__mmap_read_all(struct record *rec, bool synch)
>  	return record__mmap_read_evlist(rec, rec->evlist, true, synch);
>  }
>  
> +static void record__thread_munmap_filtered(struct fdarray *fda, int fd,
> +					   void *arg __maybe_unused)
> +{
> +	struct perf_mmap *map = fda->priv[fd].ptr;
> +
> +	if (map)
> +		perf_mmap__put(map);

put() operations should accept a NULL pointer, this one doesn't tho :-\
I'll keep it like this and later make perf_mmap__put() just check if the
arg is NULL.

> +}
> +
>  static void record__init_features(struct record *rec)
>  {
>  	struct perf_session *session = rec->session;
> @@ -1869,11 +1886,44 @@ static void record__uniquify_name(struct record *rec)
>  	}
>  }
>  
> +static int record__start_threads(struct record *rec)
> +{
> +	struct record_thread *thread_data = rec->thread_data;
> +
> +	thread = &thread_data[0];
> +
> +	pr_debug("threads[%d]: started on cpu%d\n", thread->tid, sched_getcpu());
> +
> +	return 0;
> +}
> +
> +static int record__stop_threads(struct record *rec)
> +{
> +	int t;
> +	struct record_thread *thread_data = rec->thread_data;
> +
> +	for (t = 0; t < rec->nr_threads; t++)
> +		rec->samples += thread_data[t].samples;
> +
> +	return 0;
> +}
> +
> +static unsigned long record__waking(struct record *rec)
> +{
> +	int t;
> +	unsigned long waking = 0;
> +	struct record_thread *thread_data = rec->thread_data;
> +
> +	for (t = 0; t < rec->nr_threads; t++)
> +		waking += thread_data[t].waking;
> +
> +	return waking;
> +}
> +
>  static int __cmd_record(struct record *rec, int argc, const char **argv)
>  {
>  	int err;
>  	int status = 0;
> -	unsigned long waking = 0;
>  	const bool forks = argc > 0;
>  	struct perf_tool *tool = &rec->tool;
>  	struct record_opts *opts = &rec->opts;
> @@ -1977,7 +2027,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
>  
>  	if (record__open(rec) != 0) {
>  		err = -1;
> -		goto out_child;
> +		goto out_free_threads;
>  	}
>  	session->header.env.comp_mmap_len = session->evlist->core.mmap_len;
>  
> @@ -1985,7 +2035,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
>  		err = record__kcore_copy(&session->machines.host, data);
>  		if (err) {
>  			pr_err("ERROR: Failed to copy kcore\n");
> -			goto out_child;
> +			goto out_free_threads;
>  		}
>  	}
>  
> @@ -1996,7 +2046,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
>  		bpf__strerror_apply_obj_config(err, errbuf, sizeof(errbuf));
>  		pr_err("ERROR: Apply config to BPF failed: %s\n",
>  			 errbuf);
> -		goto out_child;
> +		goto out_free_threads;
>  	}
>  
>  	/*
> @@ -2014,11 +2064,11 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
>  	if (data->is_pipe) {
>  		err = perf_header__write_pipe(fd);
>  		if (err < 0)
> -			goto out_child;
> +			goto out_free_threads;
>  	} else {
>  		err = perf_session__write_header(session, rec->evlist, fd, false);
>  		if (err < 0)
> -			goto out_child;
> +			goto out_free_threads;
>  	}
>  
>  	err = -1;
> @@ -2026,16 +2076,16 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
>  	    && !perf_header__has_feat(&session->header, HEADER_BUILD_ID)) {
>  		pr_err("Couldn't generate buildids. "
>  		       "Use --no-buildid to profile anyway.\n");
> -		goto out_child;
> +		goto out_free_threads;
>  	}
>  
>  	err = record__setup_sb_evlist(rec);
>  	if (err)
> -		goto out_child;
> +		goto out_free_threads;
>  
>  	err = record__synthesize(rec, false);
>  	if (err < 0)
> -		goto out_child;
> +		goto out_free_threads;
>  
>  	if (rec->realtime_prio) {
>  		struct sched_param param;
> @@ -2044,10 +2094,13 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
>  		if (sched_setscheduler(0, SCHED_FIFO, &param)) {
>  			pr_err("Could not set realtime priority.\n");
>  			err = -1;
> -			goto out_child;
> +			goto out_free_threads;
>  		}
>  	}
>  
> +	if (record__start_threads(rec))
> +		goto out_free_threads;
> +
>  	/*
>  	 * When perf is starting the traced process, all the events
>  	 * (apart from group members) have enable_on_exec=1 set,
> @@ -2118,7 +2171,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
>  	trigger_ready(&switch_output_trigger);
>  	perf_hooks__invoke_record_start();
>  	for (;;) {
> -		unsigned long long hits = rec->samples;
> +		unsigned long long hits = thread->samples;
>  
>  		/*
>  		 * rec->evlist->bkw_mmap_state is possible to be
> @@ -2172,8 +2225,8 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
>  
>  			if (!quiet)
>  				fprintf(stderr, "[ perf record: dump data: Woken up %ld times ]\n",
> -					waking);
> -			waking = 0;
> +					record__waking(rec));
> +			thread->waking = 0;
>  			fd = record__switch_output(rec, false);
>  			if (fd < 0) {
>  				pr_err("Failed to switch to new file\n");
> @@ -2187,20 +2240,24 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
>  				alarm(rec->switch_output.time);
>  		}
>  
> -		if (hits == rec->samples) {
> +		if (hits == thread->samples) {
>  			if (done || draining)
>  				break;
> -			err = evlist__poll(rec->evlist, -1);
> +			err = fdarray__poll(&thread->pollfd, -1);
>  			/*
>  			 * Propagate error, only if there's any. Ignore positive
>  			 * number of returned events and interrupt error.
>  			 */
>  			if (err > 0 || (err < 0 && errno == EINTR))
>  				err = 0;
> -			waking++;
> +			thread->waking++;
>  
> -			if (evlist__filter_pollfd(rec->evlist, POLLERR | POLLHUP) == 0)
> +			if (fdarray__filter(&thread->pollfd, POLLERR | POLLHUP,
> +					    record__thread_munmap_filtered, NULL) == 0)
>  				draining = true;
> +
> +			evlist__ctlfd_update(rec->evlist,
> +				&thread->pollfd.entries[thread->ctlfd_pos]);
>  		}
>  
>  		if (evlist__ctlfd_process(rec->evlist, &cmd) > 0) {
> @@ -2254,15 +2311,18 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
>  	}
>  
>  	if (!quiet)
> -		fprintf(stderr, "[ perf record: Woken up %ld times to write data ]\n", waking);
> +		fprintf(stderr, "[ perf record: Woken up %ld times to write data ]\n",
> +			record__waking(rec));
>  
>  	if (target__none(&rec->opts.target))
>  		record__synthesize_workload(rec, true);
>  
>  out_child:
> -	evlist__finalize_ctlfd(rec->evlist);
> +	record__stop_threads(rec);
>  	record__mmap_read_all(rec, true);
> +out_free_threads:
>  	record__free_thread_data(rec);
> +	evlist__finalize_ctlfd(rec->evlist);
>  	record__aio_mmap_read_sync(rec);
>  
>  	if (rec->session->bytes_transferred && rec->session->bytes_compressed) {
> @@ -3164,17 +3224,6 @@ int cmd_record(int argc, const char **argv)
>  
>  	symbol__init(NULL);
>  
> -	if (rec->opts.affinity != PERF_AFFINITY_SYS) {
> -		rec->affinity_mask.nbits = cpu__max_cpu().cpu;
> -		rec->affinity_mask.bits = bitmap_zalloc(rec->affinity_mask.nbits);
> -		if (!rec->affinity_mask.bits) {
> -			pr_err("Failed to allocate thread mask for %zd cpus\n", rec->affinity_mask.nbits);
> -			err = -ENOMEM;
> -			goto out_opts;
> -		}
> -		pr_debug2("thread mask[%zd]: empty\n", rec->affinity_mask.nbits);
> -	}
> -
>  	err = record__auxtrace_init(rec);
>  	if (err)
>  		goto out;
> @@ -3323,7 +3372,6 @@ int cmd_record(int argc, const char **argv)
>  
>  	err = __cmd_record(&record, argc, argv);
>  out:
> -	bitmap_free(rec->affinity_mask.bits);
>  	evlist__delete(rec->evlist);
>  	symbol__exit();
>  	auxtrace_record__free(rec->itr);
> -- 
> 2.19.0

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 05/16] perf record: Introduce thread local variable
  2022-01-17 18:34 ` [PATCH v13 05/16] perf record: Introduce thread local variable Alexey Bayduraev
  2022-01-31 21:42   ` Arnaldo Carvalho de Melo
@ 2022-01-31 21:45   ` Arnaldo Carvalho de Melo
  2022-02-01  7:35     ` Bayduraev, Alexey V
  1 sibling, 1 reply; 38+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-01-31 21:45 UTC (permalink / raw)
  To: Alexey Bayduraev
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, linux-kernel, Andi Kleen, Adrian Hunter,
	Alexander Antonov, Alexei Budankov, Riccardo Mancini

Em Mon, Jan 17, 2022 at 09:34:25PM +0300, Alexey Bayduraev escreveu:
> Introduce thread local variable and use it for threaded trace streaming.
> Use thread affinity mask instead of record affinity mask in affinity
> modes. Use evlist__ctlfd_update() to propagate control commands from
> thread object to global evlist object to enable evlist__ctlfd_*
> functionality. Move waking and sample statistic to struct record_thread
> and introduce record__waking function to calculate the total number of
> wakes.
> 
> Acked-by: Andi Kleen <ak@linux.intel.com>
> Acked-by: Namhyung Kim <namhyung@gmail.com>
> Reviewed-by: Riccardo Mancini <rickyman7@gmail.com>
> Tested-by: Riccardo Mancini <rickyman7@gmail.com>
> Signed-off-by: Alexey Bayduraev <alexey.v.bayduraev@linux.intel.com>
> ---
>  tools/perf/builtin-record.c | 140 ++++++++++++++++++++++++------------
>  1 file changed, 94 insertions(+), 46 deletions(-)
> 
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index 0d4a34c66274..163d261dd293 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -108,8 +108,12 @@ struct record_thread {
>  	struct mmap		**maps;
>  	struct mmap		**overwrite_maps;
>  	struct record		*rec;
> +	unsigned long long	samples;
> +	unsigned long		waking;
>  };
>  
> +static __thread struct record_thread *thread;
> +
>  struct record {
>  	struct perf_tool	tool;
>  	struct record_opts	opts;
> @@ -132,7 +136,6 @@ struct record {
>  	bool			timestamp_boundary;
>  	struct switch_output	switch_output;
>  	unsigned long long	samples;
> -	struct mmap_cpu_mask	affinity_mask;
>  	unsigned long		output_max_size;	/* = 0: unlimited */
>  	struct perf_debuginfod	debuginfod;
>  	int			nr_threads;
> @@ -575,7 +578,7 @@ static int record__pushfn(struct mmap *map, void *to, void *bf, size_t size)
>  		bf   = map->data;
>  	}
>  
> -	rec->samples++;
> +	thread->samples++;
>  	return record__write(rec, map, bf, size);
>  }
>  
> @@ -1315,15 +1318,17 @@ static struct perf_event_header finished_round_event = {
>  static void record__adjust_affinity(struct record *rec, struct mmap *map)
>  {
>  	if (rec->opts.affinity != PERF_AFFINITY_SYS &&
> -	    !bitmap_equal(rec->affinity_mask.bits, map->affinity_mask.bits,
> -			  rec->affinity_mask.nbits)) {
> -		bitmap_zero(rec->affinity_mask.bits, rec->affinity_mask.nbits);
> -		bitmap_or(rec->affinity_mask.bits, rec->affinity_mask.bits,
> -			  map->affinity_mask.bits, rec->affinity_mask.nbits);
> -		sched_setaffinity(0, MMAP_CPU_MASK_BYTES(&rec->affinity_mask),
> -				  (cpu_set_t *)rec->affinity_mask.bits);
> -		if (verbose == 2)
> -			mmap_cpu_mask__scnprintf(&rec->affinity_mask, "thread");
> +	    !bitmap_equal(thread->mask->affinity.bits, map->affinity_mask.bits,
> +			  thread->mask->affinity.nbits)) {
> +		bitmap_zero(thread->mask->affinity.bits, thread->mask->affinity.nbits);
> +		bitmap_or(thread->mask->affinity.bits, thread->mask->affinity.bits,
> +			  map->affinity_mask.bits, thread->mask->affinity.nbits);
> +		sched_setaffinity(0, MMAP_CPU_MASK_BYTES(&thread->mask->affinity),
> +					(cpu_set_t *)thread->mask->affinity.bits);
> +		if (verbose == 2) {
> +			pr_debug("threads[%d]: running on cpu%d: ", thread->tid, sched_getcpu());
> +			mmap_cpu_mask__scnprintf(&thread->mask->affinity, "affinity");
> +		}
>  	}
>  }
>  
> @@ -1364,14 +1369,17 @@ static int record__mmap_read_evlist(struct record *rec, struct evlist *evlist,
>  	u64 bytes_written = rec->bytes_written;
>  	int i;
>  	int rc = 0;
> -	struct mmap *maps;
> +	int nr_mmaps;
> +	struct mmap **maps;
>  	int trace_fd = rec->data.file.fd;
>  	off_t off = 0;
>  
>  	if (!evlist)
>  		return 0;
>  
> -	maps = overwrite ? evlist->overwrite_mmap : evlist->mmap;
> +	nr_mmaps = thread->nr_mmaps;
> +	maps = overwrite ? thread->overwrite_maps : thread->maps;
> +
>  	if (!maps)
>  		return 0;
>  
> @@ -1381,9 +1389,9 @@ static int record__mmap_read_evlist(struct record *rec, struct evlist *evlist,
>  	if (record__aio_enabled(rec))
>  		off = record__aio_get_pos(trace_fd);
>  
> -	for (i = 0; i < evlist->core.nr_mmaps; i++) {
> +	for (i = 0; i < nr_mmaps; i++) {
>  		u64 flush = 0;
> -		struct mmap *map = &maps[i];
> +		struct mmap *map = maps[i];
>  
>  		if (map->core.base) {
>  			record__adjust_affinity(rec, map);
> @@ -1446,6 +1454,15 @@ static int record__mmap_read_all(struct record *rec, bool synch)
>  	return record__mmap_read_evlist(rec, rec->evlist, true, synch);
>  }
>  
> +static void record__thread_munmap_filtered(struct fdarray *fda, int fd,
> +					   void *arg __maybe_unused)
> +{
> +	struct perf_mmap *map = fda->priv[fd].ptr;
> +
> +	if (map)
> +		perf_mmap__put(map);
> +}
> +
>  static void record__init_features(struct record *rec)
>  {
>  	struct perf_session *session = rec->session;
> @@ -1869,11 +1886,44 @@ static void record__uniquify_name(struct record *rec)
>  	}
>  }
>  
> +static int record__start_threads(struct record *rec)
> +{
> +	struct record_thread *thread_data = rec->thread_data;
> +
> +	thread = &thread_data[0];
> +
> +	pr_debug("threads[%d]: started on cpu%d\n", thread->tid, sched_getcpu());
> +
> +	return 0;
> +}
> +
> +static int record__stop_threads(struct record *rec)
> +{
> +	int t;
> +	struct record_thread *thread_data = rec->thread_data;
> +
> +	for (t = 0; t < rec->nr_threads; t++)
> +		rec->samples += thread_data[t].samples;
> +
> +	return 0;
> +}
> +
> +static unsigned long record__waking(struct record *rec)
> +{
> +	int t;
> +	unsigned long waking = 0;
> +	struct record_thread *thread_data = rec->thread_data;
> +
> +	for (t = 0; t < rec->nr_threads; t++)
> +		waking += thread_data[t].waking;
> +
> +	return waking;
> +}
> +
>  static int __cmd_record(struct record *rec, int argc, const char **argv)
>  {
>  	int err;
>  	int status = 0;
> -	unsigned long waking = 0;
>  	const bool forks = argc > 0;
>  	struct perf_tool *tool = &rec->tool;
>  	struct record_opts *opts = &rec->opts;
> @@ -1977,7 +2027,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
>  
>  	if (record__open(rec) != 0) {
>  		err = -1;
> -		goto out_child;
> +		goto out_free_threads;
>  	}
>  	session->header.env.comp_mmap_len = session->evlist->core.mmap_len;
>  
> @@ -1985,7 +2035,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
>  		err = record__kcore_copy(&session->machines.host, data);
>  		if (err) {
>  			pr_err("ERROR: Failed to copy kcore\n");
> -			goto out_child;
> +			goto out_free_threads;
>  		}
>  	}
>  
> @@ -1996,7 +2046,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
>  		bpf__strerror_apply_obj_config(err, errbuf, sizeof(errbuf));
>  		pr_err("ERROR: Apply config to BPF failed: %s\n",
>  			 errbuf);
> -		goto out_child;
> +		goto out_free_threads;
>  	}
>  
>  	/*
> @@ -2014,11 +2064,11 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
>  	if (data->is_pipe) {
>  		err = perf_header__write_pipe(fd);
>  		if (err < 0)
> -			goto out_child;
> +			goto out_free_threads;
>  	} else {
>  		err = perf_session__write_header(session, rec->evlist, fd, false);
>  		if (err < 0)
> -			goto out_child;
> +			goto out_free_threads;
>  	}
>  
>  	err = -1;
> @@ -2026,16 +2076,16 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
>  	    && !perf_header__has_feat(&session->header, HEADER_BUILD_ID)) {
>  		pr_err("Couldn't generate buildids. "
>  		       "Use --no-buildid to profile anyway.\n");
> -		goto out_child;
> +		goto out_free_threads;
>  	}
>  
>  	err = record__setup_sb_evlist(rec);
>  	if (err)
> -		goto out_child;
> +		goto out_free_threads;
>  
>  	err = record__synthesize(rec, false);
>  	if (err < 0)
> -		goto out_child;
> +		goto out_free_threads;
>  
>  	if (rec->realtime_prio) {
>  		struct sched_param param;
> @@ -2044,10 +2094,13 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
>  		if (sched_setscheduler(0, SCHED_FIFO, &param)) {
>  			pr_err("Could not set realtime priority.\n");
>  			err = -1;
> -			goto out_child;
> +			goto out_free_threads;
>  		}
>  	}
>  
> +	if (record__start_threads(rec))
> +		goto out_free_threads;
> +
>  	/*
>  	 * When perf is starting the traced process, all the events
>  	 * (apart from group members) have enable_on_exec=1 set,
> @@ -2118,7 +2171,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
>  	trigger_ready(&switch_output_trigger);
>  	perf_hooks__invoke_record_start();
>  	for (;;) {
> -		unsigned long long hits = rec->samples;
> +		unsigned long long hits = thread->samples;
>  
>  		/*
>  		 * rec->evlist->bkw_mmap_state is possible to be
> @@ -2172,8 +2225,8 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
>  
>  			if (!quiet)
>  				fprintf(stderr, "[ perf record: dump data: Woken up %ld times ]\n",
> -					waking);
> -			waking = 0;
> +					record__waking(rec));
> +			thread->waking = 0;
>  			fd = record__switch_output(rec, false);
>  			if (fd < 0) {
>  				pr_err("Failed to switch to new file\n");
> @@ -2187,20 +2240,24 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
>  				alarm(rec->switch_output.time);
>  		}
>  
> -		if (hits == rec->samples) {
> +		if (hits == thread->samples) {
>  			if (done || draining)
>  				break;
> -			err = evlist__poll(rec->evlist, -1);
> +			err = fdarray__poll(&thread->pollfd, -1);
>  			/*
>  			 * Propagate error, only if there's any. Ignore positive
>  			 * number of returned events and interrupt error.
>  			 */
>  			if (err > 0 || (err < 0 && errno == EINTR))
>  				err = 0;
> -			waking++;
> +			thread->waking++;
>  
> -			if (evlist__filter_pollfd(rec->evlist, POLLERR | POLLHUP) == 0)
> +			if (fdarray__filter(&thread->pollfd, POLLERR | POLLHUP,
> +					    record__thread_munmap_filtered, NULL) == 0)
>  				draining = true;
> +
> +			evlist__ctlfd_update(rec->evlist,
> +				&thread->pollfd.entries[thread->ctlfd_pos]);
>  		}
>  
>  		if (evlist__ctlfd_process(rec->evlist, &cmd) > 0) {
> @@ -2254,15 +2311,18 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
>  	}
>  
>  	if (!quiet)
> -		fprintf(stderr, "[ perf record: Woken up %ld times to write data ]\n", waking);
> +		fprintf(stderr, "[ perf record: Woken up %ld times to write data ]\n",
> +			record__waking(rec));
>  
>  	if (target__none(&rec->opts.target))
>  		record__synthesize_workload(rec, true);
>  
>  out_child:
> -	evlist__finalize_ctlfd(rec->evlist);
> +	record__stop_threads(rec);
>  	record__mmap_read_all(rec, true);
> +out_free_threads:
>  	record__free_thread_data(rec);
> +	evlist__finalize_ctlfd(rec->evlist);

You changed the calling order, moving evlist__finalize_ctlfd to after
record__mmap_read_all, is that ok? And if so, should be in a separate
patch, right?

- Arnaldo

>  	record__aio_mmap_read_sync(rec);
>  
>  	if (rec->session->bytes_transferred && rec->session->bytes_compressed) {
> @@ -3164,17 +3224,6 @@ int cmd_record(int argc, const char **argv)
>  
>  	symbol__init(NULL);
>  
> -	if (rec->opts.affinity != PERF_AFFINITY_SYS) {
> -		rec->affinity_mask.nbits = cpu__max_cpu().cpu;
> -		rec->affinity_mask.bits = bitmap_zalloc(rec->affinity_mask.nbits);
> -		if (!rec->affinity_mask.bits) {
> -			pr_err("Failed to allocate thread mask for %zd cpus\n", rec->affinity_mask.nbits);
> -			err = -ENOMEM;
> -			goto out_opts;
> -		}
> -		pr_debug2("thread mask[%zd]: empty\n", rec->affinity_mask.nbits);
> -	}
> -
>  	err = record__auxtrace_init(rec);
>  	if (err)
>  		goto out;
> @@ -3323,7 +3372,6 @@ int cmd_record(int argc, const char **argv)
>  
>  	err = __cmd_record(&record, argc, argv);
>  out:
> -	bitmap_free(rec->affinity_mask.bits);
>  	evlist__delete(rec->evlist);
>  	symbol__exit();
>  	auxtrace_record__free(rec->itr);
> -- 
> 2.19.0

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 10/16] perf record: Introduce compressor at mmap buffer object
  2022-01-17 18:34 ` [PATCH v13 10/16] perf record: Introduce compressor at mmap buffer object Alexey Bayduraev
@ 2022-01-31 21:56   ` Arnaldo Carvalho de Melo
  2022-02-01  8:08     ` Bayduraev, Alexey V
  0 siblings, 1 reply; 38+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-01-31 21:56 UTC (permalink / raw)
  To: Alexey Bayduraev
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, linux-kernel, Andi Kleen, Adrian Hunter,
	Alexander Antonov, Alexei Budankov, Riccardo Mancini

Em Mon, Jan 17, 2022 at 09:34:30PM +0300, Alexey Bayduraev escreveu:
> Introduce compressor object into mmap object so it could be used to
> pack the data stream from the corresponding kernel data buffer.
> Initialize and make use of the introduced per mmap compressor.
> 
> Acked-by: Andi Kleen <ak@linux.intel.com>
> Acked-by: Namhyung Kim <namhyung@gmail.com>
> Signed-off-by: Alexey Bayduraev <alexey.v.bayduraev@linux.intel.com>
> ---
>  tools/perf/builtin-record.c | 18 +++++++++++-------
>  tools/perf/util/mmap.c      | 10 ++++++++++
>  tools/perf/util/mmap.h      |  2 ++
>  3 files changed, 23 insertions(+), 7 deletions(-)
> 
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index 50981bbc98bb..7d0338b5a0e3 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -246,8 +246,8 @@ static int record__write(struct record *rec, struct mmap *map __maybe_unused,
>  
>  static int record__aio_enabled(struct record *rec);
>  static int record__comp_enabled(struct record *rec);
> -static size_t zstd_compress(struct perf_session *session, void *dst, size_t dst_size,
> -			    void *src, size_t src_size);
> +static size_t zstd_compress(struct perf_session *session, struct mmap *map,
> +			    void *dst, size_t dst_size, void *src, size_t src_size);
>  
>  #ifdef HAVE_AIO_SUPPORT
>  static int record__aio_write(struct aiocb *cblock, int trace_fd,
> @@ -381,7 +381,7 @@ static int record__aio_pushfn(struct mmap *map, void *to, void *buf, size_t size
>  	 */
>  
>  	if (record__comp_enabled(aio->rec)) {
> -		size = zstd_compress(aio->rec->session, aio->data + aio->size,
> +		size = zstd_compress(aio->rec->session, NULL, aio->data + aio->size,
>  				     mmap__mmap_len(map) - aio->size,
>  				     buf, size);
>  	} else {
> @@ -608,7 +608,7 @@ static int record__pushfn(struct mmap *map, void *to, void *bf, size_t size)
>  	struct record *rec = to;
>  
>  	if (record__comp_enabled(rec)) {
> -		size = zstd_compress(rec->session, map->data, mmap__mmap_len(map), bf, size);
> +		size = zstd_compress(rec->session, map, map->data, mmap__mmap_len(map), bf, size);
>  		bf   = map->data;
>  	}
>  
> @@ -1394,13 +1394,17 @@ static size_t process_comp_header(void *record, size_t increment)
>  	return size;
>  }
>  
> -static size_t zstd_compress(struct perf_session *session, void *dst, size_t dst_size,
> -			    void *src, size_t src_size)
> +static size_t zstd_compress(struct perf_session *session, struct mmap *map,
> +			    void *dst, size_t dst_size, void *src, size_t src_size)
>  {
>  	size_t compressed;
>  	size_t max_record_size = PERF_SAMPLE_MAX_SIZE - sizeof(struct perf_record_compressed) - 1;
> +	struct zstd_data *zstd_data = &session->zstd_data;
>  
> -	compressed = zstd_compress_stream_to_records(&session->zstd_data, dst, dst_size, src, src_size,
> +	if (map && map->file)
> +		zstd_data = &map->zstd_data;
> +
> +	compressed = zstd_compress_stream_to_records(zstd_data, dst, dst_size, src, src_size,
>  						     max_record_size, process_comp_header);
>  
>  	session->bytes_transferred += src_size;
> diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
> index 12261ed8c15b..8bf97d9b8424 100644
> --- a/tools/perf/util/mmap.c
> +++ b/tools/perf/util/mmap.c
> @@ -230,6 +230,10 @@ void mmap__munmap(struct mmap *map)
>  {
>  	bitmap_free(map->affinity_mask.bits);
>  
> +#ifndef PYTHON_PERF
> +	zstd_fini(&map->zstd_data);
> +#endif
> +

Exposing this build detail in the main source code seems ugly, casual
readers will scratch their heads trying to figure this out, so either
we should have this behind some macro that hides these deps on a header
file or add a comment stating why this is needed.
	

>  	perf_mmap__aio_munmap(map);
>  	if (map->data != NULL) {
>  		munmap(map->data, mmap__mmap_len(map));
> @@ -292,6 +296,12 @@ int mmap__mmap(struct mmap *map, struct mmap_params *mp, int fd, struct perf_cpu
>  	map->core.flush = mp->flush;
>  
>  	map->comp_level = mp->comp_level;
> +#ifndef PYTHON_PERF
> +	if (zstd_init(&map->zstd_data, map->comp_level)) {
> +		pr_debug2("failed to init mmap commpressor, error %d\n", errno);
> +		return -1;
> +	}
> +#endif
>  
>  	if (map->comp_level && !perf_mmap__aio_enabled(map)) {
>  		map->data = mmap(NULL, mmap__mmap_len(map), PROT_READ|PROT_WRITE,
> diff --git a/tools/perf/util/mmap.h b/tools/perf/util/mmap.h
> index 62f38d7977bb..cd8b0777473b 100644
> --- a/tools/perf/util/mmap.h
> +++ b/tools/perf/util/mmap.h
> @@ -15,6 +15,7 @@
>  #endif
>  #include "auxtrace.h"
>  #include "event.h"
> +#include "util/compress.h"
>  
>  struct aiocb;
>  
> @@ -46,6 +47,7 @@ struct mmap {
>  	void		*data;
>  	int		comp_level;
>  	struct perf_data_file *file;
> +	struct zstd_data      zstd_data;
>  };
>  
>  struct mmap_params {
> -- 
> 2.19.0

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 01/16] perf record: Introduce thread affinity and mmap masks
  2022-01-31 21:00   ` Arnaldo Carvalho de Melo
  2022-01-31 21:16     ` Arnaldo Carvalho de Melo
@ 2022-01-31 22:03     ` Arnaldo Carvalho de Melo
  2022-01-31 22:04       ` Arnaldo Carvalho de Melo
  1 sibling, 1 reply; 38+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-01-31 22:03 UTC (permalink / raw)
  To: Alexey Bayduraev
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, linux-kernel, Andi Kleen, Adrian Hunter,
	Alexander Antonov, Alexei Budankov, Riccardo Mancini

Em Mon, Jan 31, 2022 at 06:00:31PM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Mon, Jan 17, 2022 at 09:34:21PM +0300, Alexey Bayduraev escreveu:
> > Introduce affinity and mmap thread masks. Thread affinity mask
> > defines CPUs that a thread is allowed to run on. Thread maps
> > mask defines mmap data buffers the thread serves to stream
> > profiling data from.
> > 
> > Acked-by: Andi Kleen <ak@linux.intel.com>
> > Acked-by: Namhyung Kim <namhyung@gmail.com>
> > Reviewed-by: Riccardo Mancini <rickyman7@gmail.com>
> > Tested-by: Riccardo Mancini <rickyman7@gmail.com>
> > Signed-off-by: Alexey Bayduraev <alexey.v.bayduraev@linux.intel.com>
> 
> Some simplifications I added here while reviewing this patchkit:
> 
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index 41998f2140cd5119..53b88c8600624237 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -2213,35 +2213,33 @@ static int record__parse_affinity(const struct option *opt, const char *str, int
>  
>  static int record__mmap_cpu_mask_alloc(struct mmap_cpu_mask *mask, int nr_bits)
>  {
> -	mask->nbits = nr_bits;
>  	mask->bits = bitmap_zalloc(mask->nbits);
>  	if (!mask->bits)
>  		return -ENOMEM;
>  
> +	mask->nbits = nr_bits;
>  	return 0;


Interesting, building it at this point in the patchkit didn't uncover
the bug I introduced, only later when this gets used I got the compiler
error and applied this on top:

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 53b88c8600624237..6b0e506df20c002a 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -2213,7 +2213,7 @@ static int record__parse_affinity(const struct option *opt, const char *str, int
 
 static int record__mmap_cpu_mask_alloc(struct mmap_cpu_mask *mask, int nr_bits)
 {
-	mask->bits = bitmap_zalloc(mask->nbits);
+	mask->bits = bitmap_zalloc(nbits);
 	if (!mask->bits)
 		return -ENOMEM;
 

>  }
>  
>  static void record__mmap_cpu_mask_free(struct mmap_cpu_mask *mask)
>  {
>  	bitmap_free(mask->bits);
> +	mask->bits = NULL;
>  	mask->nbits = 0;
>  }
>  
>  static int record__thread_mask_alloc(struct thread_mask *mask, int nr_bits)
>  {
> -	int ret;
> +	int ret = record__mmap_cpu_mask_alloc(&mask->maps, nr_bits);
>  
> -	ret = record__mmap_cpu_mask_alloc(&mask->maps, nr_bits);
>  	if (ret) {
>  		mask->affinity.bits = NULL;
>  		return ret;
>  	}
>  
>  	ret = record__mmap_cpu_mask_alloc(&mask->affinity, nr_bits);
> -	if (ret) {
> +	if (ret)
>  		record__mmap_cpu_mask_free(&mask->maps);
> -		mask->maps.bits = NULL;
> -	}
>  
>  	return ret;
>  }
> @@ -2733,18 +2731,14 @@ struct option *record_options = __record_options;
>  
>  static void record__mmap_cpu_mask_init(struct mmap_cpu_mask *mask, struct perf_cpu_map *cpus)
>  {
> -	int c;
> -
> -	for (c = 0; c < cpus->nr; c++)
> +	for (int c = 0; c < cpus->nr; c++)
>  		set_bit(cpus->map[c].cpu, mask->bits);
>  }
>  
>  static void record__free_thread_masks(struct record *rec, int nr_threads)
>  {
> -	int t;
> -
>  	if (rec->thread_masks)
> -		for (t = 0; t < nr_threads; t++)
> +		for (int t = 0; t < nr_threads; t++)
>  			record__thread_mask_free(&rec->thread_masks[t]);
>  
>  	zfree(&rec->thread_masks);
> @@ -2752,7 +2746,7 @@ static void record__free_thread_masks(struct record *rec, int nr_threads)
>  
>  static int record__alloc_thread_masks(struct record *rec, int nr_threads, int nr_bits)
>  {
> -	int t, ret;
> +	int ret;
>  
>  	rec->thread_masks = zalloc(nr_threads * sizeof(*(rec->thread_masks)));
>  	if (!rec->thread_masks) {
> @@ -2760,7 +2754,7 @@ static int record__alloc_thread_masks(struct record *rec, int nr_threads, int nr
>  		return -ENOMEM;
>  	}
>  
> -	for (t = 0; t < nr_threads; t++) {
> +	for (int t = 0; t < nr_threads; t++) {
>  		ret = record__thread_mask_alloc(&rec->thread_masks[t], nr_bits);
>  		if (ret) {
>  			pr_err("Failed to allocate thread masks[%d]\n", t);
> @@ -2778,9 +2772,7 @@ static int record__alloc_thread_masks(struct record *rec, int nr_threads, int nr
>  
>  static int record__init_thread_default_masks(struct record *rec, struct perf_cpu_map *cpus)
>  {
> -	int ret;
> -
> -	ret = record__alloc_thread_masks(rec, 1, cpu__max_cpu().cpu);
> +	int ret = record__alloc_thread_masks(rec, 1, cpu__max_cpu().cpu);
>  	if (ret)
>  		return ret;
>  
> 
> 
> > ---
> >  tools/perf/builtin-record.c | 123 ++++++++++++++++++++++++++++++++++++
> >  1 file changed, 123 insertions(+)
> > 
> > diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> > index bb716c953d02..41998f2140cd 100644
> > --- a/tools/perf/builtin-record.c
> > +++ b/tools/perf/builtin-record.c
> > @@ -87,6 +87,11 @@ struct switch_output {
> >  	int		 cur_file;
> >  };
> >  
> > +struct thread_mask {
> > +	struct mmap_cpu_mask	maps;
> > +	struct mmap_cpu_mask	affinity;
> > +};
> > +
> >  struct record {
> >  	struct perf_tool	tool;
> >  	struct record_opts	opts;
> > @@ -112,6 +117,8 @@ struct record {
> >  	struct mmap_cpu_mask	affinity_mask;
> >  	unsigned long		output_max_size;	/* = 0: unlimited */
> >  	struct perf_debuginfod	debuginfod;
> > +	int			nr_threads;
> > +	struct thread_mask	*thread_masks;
> >  };
> >  
> >  static volatile int done;
> > @@ -2204,6 +2211,47 @@ static int record__parse_affinity(const struct option *opt, const char *str, int
> >  	return 0;
> >  }
> >  
> > +static int record__mmap_cpu_mask_alloc(struct mmap_cpu_mask *mask, int nr_bits)
> > +{
> > +	mask->nbits = nr_bits;
> > +	mask->bits = bitmap_zalloc(mask->nbits);
> > +	if (!mask->bits)
> > +		return -ENOMEM;
> > +
> > +	return 0;
> > +}
> > +
> > +static void record__mmap_cpu_mask_free(struct mmap_cpu_mask *mask)
> > +{
> > +	bitmap_free(mask->bits);
> > +	mask->nbits = 0;
> > +}
> > +
> > +static int record__thread_mask_alloc(struct thread_mask *mask, int nr_bits)
> > +{
> > +	int ret;
> > +
> > +	ret = record__mmap_cpu_mask_alloc(&mask->maps, nr_bits);
> > +	if (ret) {
> > +		mask->affinity.bits = NULL;
> > +		return ret;
> > +	}
> > +
> > +	ret = record__mmap_cpu_mask_alloc(&mask->affinity, nr_bits);
> > +	if (ret) {
> > +		record__mmap_cpu_mask_free(&mask->maps);
> > +		mask->maps.bits = NULL;
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> > +static void record__thread_mask_free(struct thread_mask *mask)
> > +{
> > +	record__mmap_cpu_mask_free(&mask->maps);
> > +	record__mmap_cpu_mask_free(&mask->affinity);
> > +}
> > +
> >  static int parse_output_max_size(const struct option *opt,
> >  				 const char *str, int unset)
> >  {
> > @@ -2683,6 +2731,73 @@ static struct option __record_options[] = {
> >  
> >  struct option *record_options = __record_options;
> >  
> > +static void record__mmap_cpu_mask_init(struct mmap_cpu_mask *mask, struct perf_cpu_map *cpus)
> > +{
> > +	int c;
> > +
> > +	for (c = 0; c < cpus->nr; c++)
> > +		set_bit(cpus->map[c].cpu, mask->bits);
> > +}
> > +
> > +static void record__free_thread_masks(struct record *rec, int nr_threads)
> > +{
> > +	int t;
> > +
> > +	if (rec->thread_masks)
> > +		for (t = 0; t < nr_threads; t++)
> > +			record__thread_mask_free(&rec->thread_masks[t]);
> > +
> > +	zfree(&rec->thread_masks);
> > +}
> > +
> > +static int record__alloc_thread_masks(struct record *rec, int nr_threads, int nr_bits)
> > +{
> > +	int t, ret;
> > +
> > +	rec->thread_masks = zalloc(nr_threads * sizeof(*(rec->thread_masks)));
> > +	if (!rec->thread_masks) {
> > +		pr_err("Failed to allocate thread masks\n");
> > +		return -ENOMEM;
> > +	}
> > +
> > +	for (t = 0; t < nr_threads; t++) {
> > +		ret = record__thread_mask_alloc(&rec->thread_masks[t], nr_bits);
> > +		if (ret) {
> > +			pr_err("Failed to allocate thread masks[%d]\n", t);
> > +			goto out_free;
> > +		}
> > +	}
> > +
> > +	return 0;
> > +
> > +out_free:
> > +	record__free_thread_masks(rec, nr_threads);
> > +
> > +	return ret;
> > +}
> > +
> > +static int record__init_thread_default_masks(struct record *rec, struct perf_cpu_map *cpus)
> > +{
> > +	int ret;
> > +
> > +	ret = record__alloc_thread_masks(rec, 1, cpu__max_cpu().cpu);
> > +	if (ret)
> > +		return ret;
> > +
> > +	record__mmap_cpu_mask_init(&rec->thread_masks->maps, cpus);
> > +
> > +	rec->nr_threads = 1;
> > +
> > +	return 0;
> > +}
> > +
> > +static int record__init_thread_masks(struct record *rec)
> > +{
> > +	struct perf_cpu_map *cpus = rec->evlist->core.cpus;
> > +
> > +	return record__init_thread_default_masks(rec, cpus);
> > +}
> > +
> >  int cmd_record(int argc, const char **argv)
> >  {
> >  	int err;
> > @@ -2948,6 +3063,12 @@ int cmd_record(int argc, const char **argv)
> >  		goto out;
> >  	}
> >  
> > +	err = record__init_thread_masks(rec);
> > +	if (err) {
> > +		pr_err("Failed to initialize parallel data streaming masks\n");
> > +		goto out;
> > +	}
> > +
> >  	if (rec->opts.nr_cblocks > nr_cblocks_max)
> >  		rec->opts.nr_cblocks = nr_cblocks_max;
> >  	pr_debug("nr_cblocks: %d\n", rec->opts.nr_cblocks);
> > @@ -2966,6 +3087,8 @@ int cmd_record(int argc, const char **argv)
> >  	symbol__exit();
> >  	auxtrace_record__free(rec->itr);
> >  out_opts:
> > +	record__free_thread_masks(rec, rec->nr_threads);
> > +	rec->nr_threads = 0;
> >  	evlist__close_control(rec->opts.ctl_fd, rec->opts.ctl_fd_ack, &rec->opts.ctl_fd_close);
> >  	return err;
> >  }
> > -- 
> > 2.19.0
> 
> -- 
> 
> - Arnaldo

-- 

- Arnaldo

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 01/16] perf record: Introduce thread affinity and mmap masks
  2022-01-31 22:03     ` Arnaldo Carvalho de Melo
@ 2022-01-31 22:04       ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 38+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-01-31 22:04 UTC (permalink / raw)
  To: Alexey Bayduraev
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, linux-kernel, Andi Kleen, Adrian Hunter,
	Alexander Antonov, Alexei Budankov, Riccardo Mancini

Em Mon, Jan 31, 2022 at 07:03:12PM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Mon, Jan 31, 2022 at 06:00:31PM -0300, Arnaldo Carvalho de Melo escreveu:
> > Em Mon, Jan 17, 2022 at 09:34:21PM +0300, Alexey Bayduraev escreveu:
> > > Introduce affinity and mmap thread masks. Thread affinity mask
> > > defines CPUs that a thread is allowed to run on. Thread maps
> > > mask defines mmap data buffers the thread serves to stream
> > > profiling data from.
> > > 
> > > Acked-by: Andi Kleen <ak@linux.intel.com>
> > > Acked-by: Namhyung Kim <namhyung@gmail.com>
> > > Reviewed-by: Riccardo Mancini <rickyman7@gmail.com>
> > > Tested-by: Riccardo Mancini <rickyman7@gmail.com>
> > > Signed-off-by: Alexey Bayduraev <alexey.v.bayduraev@linux.intel.com>
> > 
> > Some simplifications I added here while reviewing this patchkit:
> > 
> > diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> > index 41998f2140cd5119..53b88c8600624237 100644
> > --- a/tools/perf/builtin-record.c
> > +++ b/tools/perf/builtin-record.c
> > @@ -2213,35 +2213,33 @@ static int record__parse_affinity(const struct option *opt, const char *str, int
> >  
> >  static int record__mmap_cpu_mask_alloc(struct mmap_cpu_mask *mask, int nr_bits)
> >  {
> > -	mask->nbits = nr_bits;
> >  	mask->bits = bitmap_zalloc(mask->nbits);
> >  	if (!mask->bits)
> >  		return -ENOMEM;
> >  
> > +	mask->nbits = nr_bits;
> >  	return 0;
> 
> 
> Interesting, building it at this point in the patchkit didn't uncover
> the bug I introduced, only later when this gets used I got the compiler
> error and applied this on top:
> 
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index 53b88c8600624237..6b0e506df20c002a 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -2213,7 +2213,7 @@ static int record__parse_affinity(const struct option *opt, const char *str, int
>  
>  static int record__mmap_cpu_mask_alloc(struct mmap_cpu_mask *mask, int nr_bits)
>  {
> -	mask->bits = bitmap_zalloc(mask->nbits);
> +	mask->bits = bitmap_zalloc(nbits);

Make that nr_bits, sigh :-\

>  	if (!mask->bits)
>  		return -ENOMEM;
>  
> 
> >  }
> >  
> >  static void record__mmap_cpu_mask_free(struct mmap_cpu_mask *mask)
> >  {
> >  	bitmap_free(mask->bits);
> > +	mask->bits = NULL;
> >  	mask->nbits = 0;
> >  }
> >  
> >  static int record__thread_mask_alloc(struct thread_mask *mask, int nr_bits)
> >  {
> > -	int ret;
> > +	int ret = record__mmap_cpu_mask_alloc(&mask->maps, nr_bits);
> >  
> > -	ret = record__mmap_cpu_mask_alloc(&mask->maps, nr_bits);
> >  	if (ret) {
> >  		mask->affinity.bits = NULL;
> >  		return ret;
> >  	}
> >  
> >  	ret = record__mmap_cpu_mask_alloc(&mask->affinity, nr_bits);
> > -	if (ret) {
> > +	if (ret)
> >  		record__mmap_cpu_mask_free(&mask->maps);
> > -		mask->maps.bits = NULL;
> > -	}
> >  
> >  	return ret;
> >  }
> > @@ -2733,18 +2731,14 @@ struct option *record_options = __record_options;
> >  
> >  static void record__mmap_cpu_mask_init(struct mmap_cpu_mask *mask, struct perf_cpu_map *cpus)
> >  {
> > -	int c;
> > -
> > -	for (c = 0; c < cpus->nr; c++)
> > +	for (int c = 0; c < cpus->nr; c++)
> >  		set_bit(cpus->map[c].cpu, mask->bits);
> >  }
> >  
> >  static void record__free_thread_masks(struct record *rec, int nr_threads)
> >  {
> > -	int t;
> > -
> >  	if (rec->thread_masks)
> > -		for (t = 0; t < nr_threads; t++)
> > +		for (int t = 0; t < nr_threads; t++)
> >  			record__thread_mask_free(&rec->thread_masks[t]);
> >  
> >  	zfree(&rec->thread_masks);
> > @@ -2752,7 +2746,7 @@ static void record__free_thread_masks(struct record *rec, int nr_threads)
> >  
> >  static int record__alloc_thread_masks(struct record *rec, int nr_threads, int nr_bits)
> >  {
> > -	int t, ret;
> > +	int ret;
> >  
> >  	rec->thread_masks = zalloc(nr_threads * sizeof(*(rec->thread_masks)));
> >  	if (!rec->thread_masks) {
> > @@ -2760,7 +2754,7 @@ static int record__alloc_thread_masks(struct record *rec, int nr_threads, int nr
> >  		return -ENOMEM;
> >  	}
> >  
> > -	for (t = 0; t < nr_threads; t++) {
> > +	for (int t = 0; t < nr_threads; t++) {
> >  		ret = record__thread_mask_alloc(&rec->thread_masks[t], nr_bits);
> >  		if (ret) {
> >  			pr_err("Failed to allocate thread masks[%d]\n", t);
> > @@ -2778,9 +2772,7 @@ static int record__alloc_thread_masks(struct record *rec, int nr_threads, int nr
> >  
> >  static int record__init_thread_default_masks(struct record *rec, struct perf_cpu_map *cpus)
> >  {
> > -	int ret;
> > -
> > -	ret = record__alloc_thread_masks(rec, 1, cpu__max_cpu().cpu);
> > +	int ret = record__alloc_thread_masks(rec, 1, cpu__max_cpu().cpu);
> >  	if (ret)
> >  		return ret;
> >  
> > 
> > 
> > > ---
> > >  tools/perf/builtin-record.c | 123 ++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 123 insertions(+)
> > > 
> > > diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> > > index bb716c953d02..41998f2140cd 100644
> > > --- a/tools/perf/builtin-record.c
> > > +++ b/tools/perf/builtin-record.c
> > > @@ -87,6 +87,11 @@ struct switch_output {
> > >  	int		 cur_file;
> > >  };
> > >  
> > > +struct thread_mask {
> > > +	struct mmap_cpu_mask	maps;
> > > +	struct mmap_cpu_mask	affinity;
> > > +};
> > > +
> > >  struct record {
> > >  	struct perf_tool	tool;
> > >  	struct record_opts	opts;
> > > @@ -112,6 +117,8 @@ struct record {
> > >  	struct mmap_cpu_mask	affinity_mask;
> > >  	unsigned long		output_max_size;	/* = 0: unlimited */
> > >  	struct perf_debuginfod	debuginfod;
> > > +	int			nr_threads;
> > > +	struct thread_mask	*thread_masks;
> > >  };
> > >  
> > >  static volatile int done;
> > > @@ -2204,6 +2211,47 @@ static int record__parse_affinity(const struct option *opt, const char *str, int
> > >  	return 0;
> > >  }
> > >  
> > > +static int record__mmap_cpu_mask_alloc(struct mmap_cpu_mask *mask, int nr_bits)
> > > +{
> > > +	mask->nbits = nr_bits;
> > > +	mask->bits = bitmap_zalloc(mask->nbits);
> > > +	if (!mask->bits)
> > > +		return -ENOMEM;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static void record__mmap_cpu_mask_free(struct mmap_cpu_mask *mask)
> > > +{
> > > +	bitmap_free(mask->bits);
> > > +	mask->nbits = 0;
> > > +}
> > > +
> > > +static int record__thread_mask_alloc(struct thread_mask *mask, int nr_bits)
> > > +{
> > > +	int ret;
> > > +
> > > +	ret = record__mmap_cpu_mask_alloc(&mask->maps, nr_bits);
> > > +	if (ret) {
> > > +		mask->affinity.bits = NULL;
> > > +		return ret;
> > > +	}
> > > +
> > > +	ret = record__mmap_cpu_mask_alloc(&mask->affinity, nr_bits);
> > > +	if (ret) {
> > > +		record__mmap_cpu_mask_free(&mask->maps);
> > > +		mask->maps.bits = NULL;
> > > +	}
> > > +
> > > +	return ret;
> > > +}
> > > +
> > > +static void record__thread_mask_free(struct thread_mask *mask)
> > > +{
> > > +	record__mmap_cpu_mask_free(&mask->maps);
> > > +	record__mmap_cpu_mask_free(&mask->affinity);
> > > +}
> > > +
> > >  static int parse_output_max_size(const struct option *opt,
> > >  				 const char *str, int unset)
> > >  {
> > > @@ -2683,6 +2731,73 @@ static struct option __record_options[] = {
> > >  
> > >  struct option *record_options = __record_options;
> > >  
> > > +static void record__mmap_cpu_mask_init(struct mmap_cpu_mask *mask, struct perf_cpu_map *cpus)
> > > +{
> > > +	int c;
> > > +
> > > +	for (c = 0; c < cpus->nr; c++)
> > > +		set_bit(cpus->map[c].cpu, mask->bits);
> > > +}
> > > +
> > > +static void record__free_thread_masks(struct record *rec, int nr_threads)
> > > +{
> > > +	int t;
> > > +
> > > +	if (rec->thread_masks)
> > > +		for (t = 0; t < nr_threads; t++)
> > > +			record__thread_mask_free(&rec->thread_masks[t]);
> > > +
> > > +	zfree(&rec->thread_masks);
> > > +}
> > > +
> > > +static int record__alloc_thread_masks(struct record *rec, int nr_threads, int nr_bits)
> > > +{
> > > +	int t, ret;
> > > +
> > > +	rec->thread_masks = zalloc(nr_threads * sizeof(*(rec->thread_masks)));
> > > +	if (!rec->thread_masks) {
> > > +		pr_err("Failed to allocate thread masks\n");
> > > +		return -ENOMEM;
> > > +	}
> > > +
> > > +	for (t = 0; t < nr_threads; t++) {
> > > +		ret = record__thread_mask_alloc(&rec->thread_masks[t], nr_bits);
> > > +		if (ret) {
> > > +			pr_err("Failed to allocate thread masks[%d]\n", t);
> > > +			goto out_free;
> > > +		}
> > > +	}
> > > +
> > > +	return 0;
> > > +
> > > +out_free:
> > > +	record__free_thread_masks(rec, nr_threads);
> > > +
> > > +	return ret;
> > > +}
> > > +
> > > +static int record__init_thread_default_masks(struct record *rec, struct perf_cpu_map *cpus)
> > > +{
> > > +	int ret;
> > > +
> > > +	ret = record__alloc_thread_masks(rec, 1, cpu__max_cpu().cpu);
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	record__mmap_cpu_mask_init(&rec->thread_masks->maps, cpus);
> > > +
> > > +	rec->nr_threads = 1;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static int record__init_thread_masks(struct record *rec)
> > > +{
> > > +	struct perf_cpu_map *cpus = rec->evlist->core.cpus;
> > > +
> > > +	return record__init_thread_default_masks(rec, cpus);
> > > +}
> > > +
> > >  int cmd_record(int argc, const char **argv)
> > >  {
> > >  	int err;
> > > @@ -2948,6 +3063,12 @@ int cmd_record(int argc, const char **argv)
> > >  		goto out;
> > >  	}
> > >  
> > > +	err = record__init_thread_masks(rec);
> > > +	if (err) {
> > > +		pr_err("Failed to initialize parallel data streaming masks\n");
> > > +		goto out;
> > > +	}
> > > +
> > >  	if (rec->opts.nr_cblocks > nr_cblocks_max)
> > >  		rec->opts.nr_cblocks = nr_cblocks_max;
> > >  	pr_debug("nr_cblocks: %d\n", rec->opts.nr_cblocks);
> > > @@ -2966,6 +3087,8 @@ int cmd_record(int argc, const char **argv)
> > >  	symbol__exit();
> > >  	auxtrace_record__free(rec->itr);
> > >  out_opts:
> > > +	record__free_thread_masks(rec, rec->nr_threads);
> > > +	rec->nr_threads = 0;
> > >  	evlist__close_control(rec->opts.ctl_fd, rec->opts.ctl_fd_ack, &rec->opts.ctl_fd_close);
> > >  	return err;
> > >  }
> > > -- 
> > > 2.19.0
> > 
> > -- 
> > 
> > - Arnaldo
> 
> -- 
> 
> - Arnaldo

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 03/16] perf record: Introduce thread specific data array
  2022-01-31 21:39   ` Arnaldo Carvalho de Melo
@ 2022-01-31 22:21     ` Arnaldo Carvalho de Melo
  2022-02-11 16:51       ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 38+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-01-31 22:21 UTC (permalink / raw)
  To: Alexey Bayduraev
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, linux-kernel, Andi Kleen, Adrian Hunter,
	Alexander Antonov, Alexei Budankov, Riccardo Mancini

Em Mon, Jan 31, 2022 at 06:39:39PM -0300, Arnaldo Carvalho de Melo escreveu:
> Some changes to reduce patch size, I have them in my local tree, will
> publish later.

Its in perf/threaded at:

git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git

Will continue tomorrow, testing it and checking the speedups on my
5950x, I think the things I found so far can be fixed in follow up
patches, to make progress and have this merged sooner.

I'll try and add committer notes with the test for some 'perf bench'
workload without/with parallel recording, something I missed in your
patch descriptions.

- Arnaldo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 05/16] perf record: Introduce thread local variable
  2022-01-31 21:45   ` Arnaldo Carvalho de Melo
@ 2022-02-01  7:35     ` Bayduraev, Alexey V
  0 siblings, 0 replies; 38+ messages in thread
From: Bayduraev, Alexey V @ 2022-02-01  7:35 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, linux-kernel, Andi Kleen, Adrian Hunter,
	Alexander Antonov, Alexei Budankov, Riccardo Mancini

On 01.02.2022 0:45, Arnaldo Carvalho de Melo wrote:
> Em Mon, Jan 17, 2022 at 09:34:25PM +0300, Alexey Bayduraev escreveu:
>> Introduce thread local variable and use it for threaded trace streaming.
>> Use thread affinity mask instead of record affinity mask in affinity
>> modes. Use evlist__ctlfd_update() to propagate control commands from
>> thread object to global evlist object to enable evlist__ctlfd_*
>> functionality. Move waking and sample statistic to struct record_thread
>> and introduce record__waking function to calculate the total number of
>> wakes.

SNIP

>>  	if (record__open(rec) != 0) {
>>  		err = -1;
>> -		goto out_child;
>> +		goto out_free_threads;
>>  	}

SNIP

>>  
>>  out_child:
>> -	evlist__finalize_ctlfd(rec->evlist);
>> +	record__stop_threads(rec);
>>  	record__mmap_read_all(rec, true);
>> +out_free_threads:
>>  	record__free_thread_data(rec);
>> +	evlist__finalize_ctlfd(rec->evlist);
> 
> You changed the calling order, moving evlist__finalize_ctlfd to after
> record__mmap_read_all, is that ok? And if so, should be in a separate
> patch, right?

This is necessary because record__mmap_read_all() must be right after 
record__stop_threads() to prevent data loss, but we must deinitialize
ctlfd after out_free_threads as it was initialized in record__open()

record__mmap_read_all() looks independent of evlist__finalize_ctlfd()
but I think any deinitialization in evlist would be safer after
record__mmap_read_all()

Probably adding such notes to this patch will be enough.

Regards,
Alexey

> 
> - Arnaldo
> 
>>  	record__aio_mmap_read_sync(rec);
>>  
>>  	if (rec->session->bytes_transferred && rec->session->bytes_compressed) {
>> @@ -3164,17 +3224,6 @@ int cmd_record(int argc, const char **argv)
>>  
>>  	symbol__init(NULL);
>>  
>> -	if (rec->opts.affinity != PERF_AFFINITY_SYS) {
>> -		rec->affinity_mask.nbits = cpu__max_cpu().cpu;
>> -		rec->affinity_mask.bits = bitmap_zalloc(rec->affinity_mask.nbits);
>> -		if (!rec->affinity_mask.bits) {
>> -			pr_err("Failed to allocate thread mask for %zd cpus\n", rec->affinity_mask.nbits);
>> -			err = -ENOMEM;
>> -			goto out_opts;
>> -		}
>> -		pr_debug2("thread mask[%zd]: empty\n", rec->affinity_mask.nbits);
>> -	}
>> -
>>  	err = record__auxtrace_init(rec);
>>  	if (err)
>>  		goto out;
>> @@ -3323,7 +3372,6 @@ int cmd_record(int argc, const char **argv)
>>  
>>  	err = __cmd_record(&record, argc, argv);
>>  out:
>> -	bitmap_free(rec->affinity_mask.bits);
>>  	evlist__delete(rec->evlist);
>>  	symbol__exit();
>>  	auxtrace_record__free(rec->itr);
>> -- 
>> 2.19.0
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 10/16] perf record: Introduce compressor at mmap buffer object
  2022-01-31 21:56   ` Arnaldo Carvalho de Melo
@ 2022-02-01  8:08     ` Bayduraev, Alexey V
  0 siblings, 0 replies; 38+ messages in thread
From: Bayduraev, Alexey V @ 2022-02-01  8:08 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, linux-kernel, Andi Kleen, Adrian Hunter,
	Alexander Antonov, Alexei Budankov, Riccardo Mancini

On 01.02.2022 0:56, Arnaldo Carvalho de Melo wrote:
> Em Mon, Jan 17, 2022 at 09:34:30PM +0300, Alexey Bayduraev escreveu:
>> Introduce compressor object into mmap object so it could be used to
>> pack the data stream from the corresponding kernel data buffer.
>> Initialize and make use of the introduced per mmap compressor.

SNIP

>> --- a/tools/perf/util/mmap.c
>> +++ b/tools/perf/util/mmap.c
>> @@ -230,6 +230,10 @@ void mmap__munmap(struct mmap *map)
>>  {
>>  	bitmap_free(map->affinity_mask.bits);
>>  
>> +#ifndef PYTHON_PERF
>> +	zstd_fini(&map->zstd_data);
>> +#endif
>> +
> 
> Exposing this build detail in the main source code seems ugly, casual
> readers will scratch their heads trying to figure this out, so either
> we should have this behind some macro that hides these deps on a header
> file or add a comment stating why this is needed.

This is quick fix for the fact that the perf.so library for Python is not
compiled with -lzstd, but it includes mmap.c. Probably adding stub macro
to mmap.h would be better.

Regards,
Alexey

> 	
> 
>>  	perf_mmap__aio_munmap(map);
>>  	if (map->data != NULL) {
>>  		munmap(map->data, mmap__mmap_len(map));
>> @@ -292,6 +296,12 @@ int mmap__mmap(struct mmap *map, struct mmap_params *mp, int fd, struct perf_cpu
>>  	map->core.flush = mp->flush;
>>  
>>  	map->comp_level = mp->comp_level;
>> +#ifndef PYTHON_PERF
>> +	if (zstd_init(&map->zstd_data, map->comp_level)) {
>> +		pr_debug2("failed to init mmap commpressor, error %d\n", errno);
>> +		return -1;
>> +	}
>> +#endif
>>  
>>  	if (map->comp_level && !perf_mmap__aio_enabled(map)) {
>>  		map->data = mmap(NULL, mmap__mmap_len(map), PROT_READ|PROT_WRITE,
>> diff --git a/tools/perf/util/mmap.h b/tools/perf/util/mmap.h
>> index 62f38d7977bb..cd8b0777473b 100644
>> --- a/tools/perf/util/mmap.h
>> +++ b/tools/perf/util/mmap.h
>> @@ -15,6 +15,7 @@
>>  #endif
>>  #include "auxtrace.h"
>>  #include "event.h"
>> +#include "util/compress.h"
>>  
>>  struct aiocb;
>>  
>> @@ -46,6 +47,7 @@ struct mmap {
>>  	void		*data;
>>  	int		comp_level;
>>  	struct perf_data_file *file;
>> +	struct zstd_data      zstd_data;
>>  };
>>  
>>  struct mmap_params {
>> -- 
>> 2.19.0
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 01/16] perf record: Introduce thread affinity and mmap masks
  2022-01-31 21:16     ` Arnaldo Carvalho de Melo
@ 2022-02-01 11:46       ` Bayduraev, Alexey V
  0 siblings, 0 replies; 38+ messages in thread
From: Bayduraev, Alexey V @ 2022-02-01 11:46 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, linux-kernel, Andi Kleen, Adrian Hunter,
	Alexander Antonov, Alexei Budankov, Riccardo Mancini

On 01.02.2022 0:16, Arnaldo Carvalho de Melo wrote:
> Em Mon, Jan 31, 2022 at 06:00:31PM -0300, Arnaldo Carvalho de Melo escreveu:
>> Em Mon, Jan 17, 2022 at 09:34:21PM +0300, Alexey Bayduraev escreveu:
>>> Introduce affinity and mmap thread masks. Thread affinity mask
>>> defines CPUs that a thread is allowed to run on. Thread maps
>>> mask defines mmap data buffers the thread serves to stream
>>> profiling data from.
>>>
>>> Acked-by: Andi Kleen <ak@linux.intel.com>
>>> Acked-by: Namhyung Kim <namhyung@gmail.com>
>>> Reviewed-by: Riccardo Mancini <rickyman7@gmail.com>
>>> Tested-by: Riccardo Mancini <rickyman7@gmail.com>
>>> Signed-off-by: Alexey Bayduraev <alexey.v.bayduraev@linux.intel.com>
>>
>> Some simplifications I added here while reviewing this patchkit:
> 
> But then, why allocate these even without using them? I.e. the init
> should be left for when we are sure that we'll actually use this, i.e.
> when the user asks for parallel mode.
> 
> We already have lots of needless initializations, reading of files that
> may not be needed, so we should avoid doing things till we really know
> that we'll use those allocations, readings, etc.
> 
> Anyway, continuing to review, will leave what I have at a separata
> branch so that we can continue from there.

In the current design, we assume that without --threads option nr_threads==1
and we allocate rec->thread_masks and rec->thread_data as arrays of size 1,
also we move some variables from "struct record" to rec->thread_data and
use thread_data[0] (thru "thread" pointer) in main thread instead of
"struct record". This simplifies the later implementation.

With another approach we could assume nr_threads==0 and use only necessary
"struct record" variables, but this adds many if(record__threads_enabled())

Regards,
Alexey

> 
> - Arnaldo
>  
>> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
>> index 41998f2140cd5119..53b88c8600624237 100644
>> --- a/tools/perf/builtin-record.c
>> +++ b/tools/perf/builtin-record.c
>> @@ -2213,35 +2213,33 @@ static int record__parse_affinity(const struct option *opt, const char *str, int
>>  
>>  static int record__mmap_cpu_mask_alloc(struct mmap_cpu_mask *mask, int nr_bits)
>>  {
>> -	mask->nbits = nr_bits;
>>  	mask->bits = bitmap_zalloc(mask->nbits);
>>  	if (!mask->bits)
>>  		return -ENOMEM;
>>  
>> +	mask->nbits = nr_bits;
>>  	return 0;
>>  }
>>  
>>  static void record__mmap_cpu_mask_free(struct mmap_cpu_mask *mask)
>>  {
>>  	bitmap_free(mask->bits);
>> +	mask->bits = NULL;
>>  	mask->nbits = 0;
>>  }
>>  
>>  static int record__thread_mask_alloc(struct thread_mask *mask, int nr_bits)
>>  {
>> -	int ret;
>> +	int ret = record__mmap_cpu_mask_alloc(&mask->maps, nr_bits);
>>  
>> -	ret = record__mmap_cpu_mask_alloc(&mask->maps, nr_bits);
>>  	if (ret) {
>>  		mask->affinity.bits = NULL;
>>  		return ret;
>>  	}
>>  
>>  	ret = record__mmap_cpu_mask_alloc(&mask->affinity, nr_bits);
>> -	if (ret) {
>> +	if (ret)
>>  		record__mmap_cpu_mask_free(&mask->maps);
>> -		mask->maps.bits = NULL;
>> -	}
>>  
>>  	return ret;
>>  }
>> @@ -2733,18 +2731,14 @@ struct option *record_options = __record_options;
>>  
>>  static void record__mmap_cpu_mask_init(struct mmap_cpu_mask *mask, struct perf_cpu_map *cpus)
>>  {
>> -	int c;
>> -
>> -	for (c = 0; c < cpus->nr; c++)
>> +	for (int c = 0; c < cpus->nr; c++)
>>  		set_bit(cpus->map[c].cpu, mask->bits);
>>  }
>>  
>>  static void record__free_thread_masks(struct record *rec, int nr_threads)
>>  {
>> -	int t;
>> -
>>  	if (rec->thread_masks)
>> -		for (t = 0; t < nr_threads; t++)
>> +		for (int t = 0; t < nr_threads; t++)
>>  			record__thread_mask_free(&rec->thread_masks[t]);
>>  
>>  	zfree(&rec->thread_masks);
>> @@ -2752,7 +2746,7 @@ static void record__free_thread_masks(struct record *rec, int nr_threads)
>>  
>>  static int record__alloc_thread_masks(struct record *rec, int nr_threads, int nr_bits)
>>  {
>> -	int t, ret;
>> +	int ret;
>>  
>>  	rec->thread_masks = zalloc(nr_threads * sizeof(*(rec->thread_masks)));
>>  	if (!rec->thread_masks) {
>> @@ -2760,7 +2754,7 @@ static int record__alloc_thread_masks(struct record *rec, int nr_threads, int nr
>>  		return -ENOMEM;
>>  	}
>>  
>> -	for (t = 0; t < nr_threads; t++) {
>> +	for (int t = 0; t < nr_threads; t++) {
>>  		ret = record__thread_mask_alloc(&rec->thread_masks[t], nr_bits);
>>  		if (ret) {
>>  			pr_err("Failed to allocate thread masks[%d]\n", t);
>> @@ -2778,9 +2772,7 @@ static int record__alloc_thread_masks(struct record *rec, int nr_threads, int nr
>>  
>>  static int record__init_thread_default_masks(struct record *rec, struct perf_cpu_map *cpus)
>>  {
>> -	int ret;
>> -
>> -	ret = record__alloc_thread_masks(rec, 1, cpu__max_cpu().cpu);
>> +	int ret = record__alloc_thread_masks(rec, 1, cpu__max_cpu().cpu);
>>  	if (ret)
>>  		return ret;
>>  
>>
>>
>>> ---
>>>  tools/perf/builtin-record.c | 123 ++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 123 insertions(+)
>>>
>>> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
>>> index bb716c953d02..41998f2140cd 100644
>>> --- a/tools/perf/builtin-record.c
>>> +++ b/tools/perf/builtin-record.c
>>> @@ -87,6 +87,11 @@ struct switch_output {
>>>  	int		 cur_file;
>>>  };
>>>  
>>> +struct thread_mask {
>>> +	struct mmap_cpu_mask	maps;
>>> +	struct mmap_cpu_mask	affinity;
>>> +};
>>> +
>>>  struct record {
>>>  	struct perf_tool	tool;
>>>  	struct record_opts	opts;
>>> @@ -112,6 +117,8 @@ struct record {
>>>  	struct mmap_cpu_mask	affinity_mask;
>>>  	unsigned long		output_max_size;	/* = 0: unlimited */
>>>  	struct perf_debuginfod	debuginfod;
>>> +	int			nr_threads;
>>> +	struct thread_mask	*thread_masks;
>>>  };
>>>  
>>>  static volatile int done;
>>> @@ -2204,6 +2211,47 @@ static int record__parse_affinity(const struct option *opt, const char *str, int
>>>  	return 0;
>>>  }
>>>  
>>> +static int record__mmap_cpu_mask_alloc(struct mmap_cpu_mask *mask, int nr_bits)
>>> +{
>>> +	mask->nbits = nr_bits;
>>> +	mask->bits = bitmap_zalloc(mask->nbits);
>>> +	if (!mask->bits)
>>> +		return -ENOMEM;
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static void record__mmap_cpu_mask_free(struct mmap_cpu_mask *mask)
>>> +{
>>> +	bitmap_free(mask->bits);
>>> +	mask->nbits = 0;
>>> +}
>>> +
>>> +static int record__thread_mask_alloc(struct thread_mask *mask, int nr_bits)
>>> +{
>>> +	int ret;
>>> +
>>> +	ret = record__mmap_cpu_mask_alloc(&mask->maps, nr_bits);
>>> +	if (ret) {
>>> +		mask->affinity.bits = NULL;
>>> +		return ret;
>>> +	}
>>> +
>>> +	ret = record__mmap_cpu_mask_alloc(&mask->affinity, nr_bits);
>>> +	if (ret) {
>>> +		record__mmap_cpu_mask_free(&mask->maps);
>>> +		mask->maps.bits = NULL;
>>> +	}
>>> +
>>> +	return ret;
>>> +}
>>> +
>>> +static void record__thread_mask_free(struct thread_mask *mask)
>>> +{
>>> +	record__mmap_cpu_mask_free(&mask->maps);
>>> +	record__mmap_cpu_mask_free(&mask->affinity);
>>> +}
>>> +
>>>  static int parse_output_max_size(const struct option *opt,
>>>  				 const char *str, int unset)
>>>  {
>>> @@ -2683,6 +2731,73 @@ static struct option __record_options[] = {
>>>  
>>>  struct option *record_options = __record_options;
>>>  
>>> +static void record__mmap_cpu_mask_init(struct mmap_cpu_mask *mask, struct perf_cpu_map *cpus)
>>> +{
>>> +	int c;
>>> +
>>> +	for (c = 0; c < cpus->nr; c++)
>>> +		set_bit(cpus->map[c].cpu, mask->bits);
>>> +}
>>> +
>>> +static void record__free_thread_masks(struct record *rec, int nr_threads)
>>> +{
>>> +	int t;
>>> +
>>> +	if (rec->thread_masks)
>>> +		for (t = 0; t < nr_threads; t++)
>>> +			record__thread_mask_free(&rec->thread_masks[t]);
>>> +
>>> +	zfree(&rec->thread_masks);
>>> +}
>>> +
>>> +static int record__alloc_thread_masks(struct record *rec, int nr_threads, int nr_bits)
>>> +{
>>> +	int t, ret;
>>> +
>>> +	rec->thread_masks = zalloc(nr_threads * sizeof(*(rec->thread_masks)));
>>> +	if (!rec->thread_masks) {
>>> +		pr_err("Failed to allocate thread masks\n");
>>> +		return -ENOMEM;
>>> +	}
>>> +
>>> +	for (t = 0; t < nr_threads; t++) {
>>> +		ret = record__thread_mask_alloc(&rec->thread_masks[t], nr_bits);
>>> +		if (ret) {
>>> +			pr_err("Failed to allocate thread masks[%d]\n", t);
>>> +			goto out_free;
>>> +		}
>>> +	}
>>> +
>>> +	return 0;
>>> +
>>> +out_free:
>>> +	record__free_thread_masks(rec, nr_threads);
>>> +
>>> +	return ret;
>>> +}
>>> +
>>> +static int record__init_thread_default_masks(struct record *rec, struct perf_cpu_map *cpus)
>>> +{
>>> +	int ret;
>>> +
>>> +	ret = record__alloc_thread_masks(rec, 1, cpu__max_cpu().cpu);
>>> +	if (ret)
>>> +		return ret;
>>> +
>>> +	record__mmap_cpu_mask_init(&rec->thread_masks->maps, cpus);
>>> +
>>> +	rec->nr_threads = 1;
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static int record__init_thread_masks(struct record *rec)
>>> +{
>>> +	struct perf_cpu_map *cpus = rec->evlist->core.cpus;
>>> +
>>> +	return record__init_thread_default_masks(rec, cpus);
>>> +}
>>> +
>>>  int cmd_record(int argc, const char **argv)
>>>  {
>>>  	int err;
>>> @@ -2948,6 +3063,12 @@ int cmd_record(int argc, const char **argv)
>>>  		goto out;
>>>  	}
>>>  
>>> +	err = record__init_thread_masks(rec);
>>> +	if (err) {
>>> +		pr_err("Failed to initialize parallel data streaming masks\n");
>>> +		goto out;
>>> +	}
>>> +
>>>  	if (rec->opts.nr_cblocks > nr_cblocks_max)
>>>  		rec->opts.nr_cblocks = nr_cblocks_max;
>>>  	pr_debug("nr_cblocks: %d\n", rec->opts.nr_cblocks);
>>> @@ -2966,6 +3087,8 @@ int cmd_record(int argc, const char **argv)
>>>  	symbol__exit();
>>>  	auxtrace_record__free(rec->itr);
>>>  out_opts:
>>> +	record__free_thread_masks(rec, rec->nr_threads);
>>> +	rec->nr_threads = 0;
>>>  	evlist__close_control(rec->opts.ctl_fd, rec->opts.ctl_fd_ack, &rec->opts.ctl_fd_close);
>>>  	return err;
>>>  }
>>> -- 
>>> 2.19.0
>>
>> -- 
>>
>> - Arnaldo
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 03/16] perf record: Introduce thread specific data array
  2022-01-31 22:21     ` Arnaldo Carvalho de Melo
@ 2022-02-11 16:51       ` Arnaldo Carvalho de Melo
  2022-02-11 16:52         ` Arnaldo Carvalho de Melo
  2022-02-11 19:34         ` Alexei Budankov
  0 siblings, 2 replies; 38+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-02-11 16:51 UTC (permalink / raw)
  To: Alexey Bayduraev
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, linux-kernel, Andi Kleen, Adrian Hunter,
	Alexander Antonov, Alexei Budankov, Riccardo Mancini

Em Mon, Jan 31, 2022 at 07:21:11PM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Mon, Jan 31, 2022 at 06:39:39PM -0300, Arnaldo Carvalho de Melo escreveu:
> > Some changes to reduce patch size, I have them in my local tree, will
> > publish later.
> 
> Its in perf/threaded at:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git
> 
> Will continue tomorrow, testing it and checking the speedups on my
> 5950x, I think the things I found so far can be fixed in follow up
> patches, to make progress and have this merged sooner.
> 
> I'll try and add committer notes with the test for some 'perf bench'
> workload without/with parallel recording, something I missed in your
> patch descriptions.

Didn't manage to do that, but my considerations are minor at this point
and plenty of informed people acked, reviewed, tested, so I'm not going
to be the one to prevent this from going upstream.

If we find problems (oh well), we'll fix it and progress.

Thank you, Alexei Budankov, Jiri, Namhyung and Riccardo for working on
making perf scale at the record phase for so long,

I'm pushing this to perf/core, that should get into 5.18.

Also, as a heads up, I'll change 'perf/core' to 'perf/next', to align
with the kool kids out there,

Thanks,

- Arnaldo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 03/16] perf record: Introduce thread specific data array
  2022-02-11 16:51       ` Arnaldo Carvalho de Melo
@ 2022-02-11 16:52         ` Arnaldo Carvalho de Melo
  2022-02-11 19:34         ` Alexei Budankov
  1 sibling, 0 replies; 38+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-02-11 16:52 UTC (permalink / raw)
  To: Alexey Bayduraev
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, linux-kernel, Andi Kleen, Adrian Hunter,
	Alexander Antonov, Alexei Budankov, Riccardo Mancini

Em Fri, Feb 11, 2022 at 01:51:16PM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Mon, Jan 31, 2022 at 07:21:11PM -0300, Arnaldo Carvalho de Melo escreveu:
> > Em Mon, Jan 31, 2022 at 06:39:39PM -0300, Arnaldo Carvalho de Melo escreveu:
> > > Some changes to reduce patch size, I have them in my local tree, will
> > > publish later.
> > 
> > Its in perf/threaded at:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git
> > 
> > Will continue tomorrow, testing it and checking the speedups on my
> > 5950x, I think the things I found so far can be fixed in follow up
> > patches, to make progress and have this merged sooner.
> > 
> > I'll try and add committer notes with the test for some 'perf bench'
> > workload without/with parallel recording, something I missed in your
> > patch descriptions.
> 
> Didn't manage to do that, but my considerations are minor at this point
> and plenty of informed people acked, reviewed, tested, so I'm not going
> to be the one to prevent this from going upstream.
> 
> If we find problems (oh well), we'll fix it and progress.
> 
> Thank you, Alexei Budankov, Jiri, Namhyung and Riccardo for working on
> making perf scale at the record phase for so long,
> 
> I'm pushing this to perf/core, that should get into 5.18.
> 
> Also, as a heads up, I'll change 'perf/core' to 'perf/next', to align
> with the kool kids out there,

Something I forgot to add: the current codebase, with this patchset,
passes 'perf test' and 'make -C tools/perf build-test' and also all the
container build tests on a myriad of distros.

- Arnaldo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 03/16] perf record: Introduce thread specific data array
  2022-02-11 16:51       ` Arnaldo Carvalho de Melo
  2022-02-11 16:52         ` Arnaldo Carvalho de Melo
@ 2022-02-11 19:34         ` Alexei Budankov
  1 sibling, 0 replies; 38+ messages in thread
From: Alexei Budankov @ 2022-02-11 19:34 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Alexey Bayduraev
  Cc: Jiri Olsa, Namhyung Kim, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, linux-kernel, Andi Kleen, Adrian Hunter,
	Alexander Antonov, Riccardo Mancini


On 11.02.2022 19:51, Arnaldo Carvalho de Melo wrote:
> Em Mon, Jan 31, 2022 at 07:21:11PM -0300, Arnaldo Carvalho de Melo escreveu:
>> Em Mon, Jan 31, 2022 at 06:39:39PM -0300, Arnaldo Carvalho de Melo escreveu:
>>> Some changes to reduce patch size, I have them in my local tree, will
>>> publish later.
>>
>> Its in perf/threaded at:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git
>>
>> Will continue tomorrow, testing it and checking the speedups on my
>> 5950x, I think the things I found so far can be fixed in follow up
>> patches, to make progress and have this merged sooner.
>>
>> I'll try and add committer notes with the test for some 'perf bench'
>> workload without/with parallel recording, something I missed in your
>> patch descriptions.
> 
> Didn't manage to do that, but my considerations are minor at this point
> and plenty of informed people acked, reviewed, tested, so I'm not going
> to be the one to prevent this from going upstream.
> 
> If we find problems (oh well), we'll fix it and progress.
> 
> Thank you, Alexei Budankov, Jiri, Namhyung and Riccardo for working on
> making perf scale at the record phase for so long,
> 
> I'm pushing this to perf/core, that should get into 5.18.

Please also accept my congratulations to
Alexey Bayduraev, Jiri, Namhyung and Riccardo Mancini
who made this happen.
Great work. Thanks!

Regards,
Alexei

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 01/16] perf record: Introduce thread affinity and mmap masks
  2022-01-17 18:34 ` [PATCH v13 01/16] perf record: Introduce thread affinity and mmap masks Alexey Bayduraev
  2022-01-31 21:00   ` Arnaldo Carvalho de Melo
@ 2022-04-04 22:25   ` Ian Rogers
  2022-04-05 16:21     ` Bayduraev, Alexey V
  1 sibling, 1 reply; 38+ messages in thread
From: Ian Rogers @ 2022-04-04 22:25 UTC (permalink / raw)
  To: Alexey Bayduraev
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Namhyung Kim,
	Alexander Shishkin, Peter Zijlstra, Ingo Molnar, linux-kernel,
	Andi Kleen, Adrian Hunter, Alexander Antonov, Alexei Budankov,
	Riccardo Mancini

On Mon, Jan 17, 2022 at 10:38 AM Alexey Bayduraev
<alexey.v.bayduraev@linux.intel.com> wrote:
>
> Introduce affinity and mmap thread masks. Thread affinity mask
> defines CPUs that a thread is allowed to run on. Thread maps
> mask defines mmap data buffers the thread serves to stream
> profiling data from.
>
> Acked-by: Andi Kleen <ak@linux.intel.com>
> Acked-by: Namhyung Kim <namhyung@gmail.com>
> Reviewed-by: Riccardo Mancini <rickyman7@gmail.com>
> Tested-by: Riccardo Mancini <rickyman7@gmail.com>
> Signed-off-by: Alexey Bayduraev <alexey.v.bayduraev@linux.intel.com>
> ---
>  tools/perf/builtin-record.c | 123 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 123 insertions(+)
>
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index bb716c953d02..41998f2140cd 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -87,6 +87,11 @@ struct switch_output {
>         int              cur_file;
>  };
>
> +struct thread_mask {
> +       struct mmap_cpu_mask    maps;
> +       struct mmap_cpu_mask    affinity;
> +};
> +
>  struct record {
>         struct perf_tool        tool;
>         struct record_opts      opts;
> @@ -112,6 +117,8 @@ struct record {
>         struct mmap_cpu_mask    affinity_mask;
>         unsigned long           output_max_size;        /* = 0: unlimited */
>         struct perf_debuginfod  debuginfod;
> +       int                     nr_threads;
> +       struct thread_mask      *thread_masks;
>  };
>
>  static volatile int done;
> @@ -2204,6 +2211,47 @@ static int record__parse_affinity(const struct option *opt, const char *str, int
>         return 0;
>  }
>
> +static int record__mmap_cpu_mask_alloc(struct mmap_cpu_mask *mask, int nr_bits)
> +{
> +       mask->nbits = nr_bits;
> +       mask->bits = bitmap_zalloc(mask->nbits);
> +       if (!mask->bits)
> +               return -ENOMEM;
> +
> +       return 0;
> +}
> +
> +static void record__mmap_cpu_mask_free(struct mmap_cpu_mask *mask)
> +{
> +       bitmap_free(mask->bits);
> +       mask->nbits = 0;
> +}
> +
> +static int record__thread_mask_alloc(struct thread_mask *mask, int nr_bits)
> +{
> +       int ret;
> +
> +       ret = record__mmap_cpu_mask_alloc(&mask->maps, nr_bits);
> +       if (ret) {
> +               mask->affinity.bits = NULL;
> +               return ret;
> +       }
> +
> +       ret = record__mmap_cpu_mask_alloc(&mask->affinity, nr_bits);
> +       if (ret) {
> +               record__mmap_cpu_mask_free(&mask->maps);
> +               mask->maps.bits = NULL;
> +       }
> +
> +       return ret;
> +}
> +
> +static void record__thread_mask_free(struct thread_mask *mask)
> +{
> +       record__mmap_cpu_mask_free(&mask->maps);
> +       record__mmap_cpu_mask_free(&mask->affinity);
> +}
> +
>  static int parse_output_max_size(const struct option *opt,
>                                  const char *str, int unset)
>  {
> @@ -2683,6 +2731,73 @@ static struct option __record_options[] = {
>
>  struct option *record_options = __record_options;
>
> +static void record__mmap_cpu_mask_init(struct mmap_cpu_mask *mask, struct perf_cpu_map *cpus)
> +{
> +       int c;
> +       for (c = 0; c < cpus->nr; c++)
> +               set_bit(cpus->map[c].cpu, mask->bits);
> +}
> +

In per-thread mode it is possible that cpus is the dummy CPU map here.
This means that the cpu below has the value -1 and setting bit -1
actually has the effect of setting bit 63. Here is a reproduction
based on the acme/perf/core branch:

```
$ make STATIC=1 DEBUG=1 EXTRA_CFLAGS='-fno-omit-frame-pointer
-fsanitize=undefined -fno-sanitize-recover'
$ perf record -o /tmp/perf.data  --per-thread true
tools/include/asm-generic/bitops/atomic.h:10:36: runtime error: shift
exponent -1 is negative
$ UBSAN_OPTIONS=abort_on_error=1 gdb --args perf record -o
/tmp/perf.data --per-thread true
(gdb) r
tools/include/asm-generic/bitops/atomic.h:10:36: runtime error: shift
exponent -1 is negative
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49
#1  0x00007ffff71d2546 in __GI_abort () at abort.c:79
#2  0x00007ffff640db9f in __sanitizer::Abort () at
../../../../src/libsanitizer/sanitizer_common/sanitizer_posix_libcdep.cpp:151
#3  0x00007ffff6418efc in __sanitizer::Die () at
../../../../src/libsanitizer/sanitizer_common/sanitizer_termination.cpp:58
#4  0x00007ffff63fd99e in
__ubsan::__ubsan_handle_shift_out_of_bounds_abort (Data=<optimized
out>, LHS=<optimized out>,
    RHS=<optimized out>) at
../../../../src/libsanitizer/ubsan/ubsan_handlers.cpp:378
#5  0x0000555555c54405 in set_bit (nr=-1, addr=0x555556ecd0a0)
    at tools/include/asm-generic/bitops/atomic.h:10
#6  0x0000555555c6ddaf in record__mmap_cpu_mask_init
(mask=0x555556ecd070, cpus=0x555556ecd050) at builtin-record.c:3333
#7  0x0000555555c7044c in record__init_thread_default_masks
(rec=0x55555681b100 <record>, cpus=0x555556ecd050) at
builtin-record.c:3668
#8  0x0000555555c705b3 in record__init_thread_masks
(rec=0x55555681b100 <record>) at builtin-record.c:3681
#9  0x0000555555c7297a in cmd_record (argc=1, argv=0x7fffffffdcc0) at
builtin-record.c:3976
#10 0x0000555555e06d41 in run_builtin (p=0x555556827538
<commands+216>, argc=5, argv=0x7fffffffdcc0) at perf.c:313
#11 0x0000555555e07253 in handle_internal_command (argc=5,
argv=0x7fffffffdcc0) at perf.c:365
#12 0x0000555555e07508 in run_argv (argcp=0x7fffffffdb0c,
argv=0x7fffffffdb00) at perf.c:409
#13 0x0000555555e07b32 in main (argc=5, argv=0x7fffffffdcc0) at perf.c:539
```

Not setting the mask->bits if the cpu map is dummy causes no data to
be written. Setting mask->bits 0 causes a segv. Setting bit 63 works
but feels like there are more invariants broken in the code.

Here is a not good workaround patch:

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index ba74fab02e62..62727b676f98 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -3329,6 +3329,11 @@ static void record__mmap_cpu_mask_init(struct
mmap_cpu_mask *mask, struct perf_c
 {
        int c;

+       if (cpu_map__is_dummy(cpus)) {
+               set_bit(63, mask->bits);
+               return;
+       }
+
        for (c = 0; c < cpus->nr; c++)
                set_bit(cpus->map[c].cpu, mask->bits);
 }

Alexey, what should the expected behavior be with per-thread mmaps?

Thanks,
Ian

> +static void record__free_thread_masks(struct record *rec, int nr_threads)
> +{
> +       int t;
> +
> +       if (rec->thread_masks)
> +               for (t = 0; t < nr_threads; t++)
> +                       record__thread_mask_free(&rec->thread_masks[t]);
> +
> +       zfree(&rec->thread_masks);
> +}
> +
> +static int record__alloc_thread_masks(struct record *rec, int nr_threads, int nr_bits)
> +{
> +       int t, ret;
> +
> +       rec->thread_masks = zalloc(nr_threads * sizeof(*(rec->thread_masks)));
> +       if (!rec->thread_masks) {
> +               pr_err("Failed to allocate thread masks\n");
> +               return -ENOMEM;
> +       }
> +
> +       for (t = 0; t < nr_threads; t++) {
> +               ret = record__thread_mask_alloc(&rec->thread_masks[t], nr_bits);
> +               if (ret) {
> +                       pr_err("Failed to allocate thread masks[%d]\n", t);
> +                       goto out_free;
> +               }
> +       }
> +
> +       return 0;
> +
> +out_free:
> +       record__free_thread_masks(rec, nr_threads);
> +
> +       return ret;
> +}
> +
> +static int record__init_thread_default_masks(struct record *rec, struct perf_cpu_map *cpus)
> +{
> +       int ret;
> +
> +       ret = record__alloc_thread_masks(rec, 1, cpu__max_cpu().cpu);
> +       if (ret)
> +               return ret;
> +
> +       record__mmap_cpu_mask_init(&rec->thread_masks->maps, cpus);
> +
> +       rec->nr_threads = 1;
> +
> +       return 0;
> +}
> +
> +static int record__init_thread_masks(struct record *rec)
> +{
> +       struct perf_cpu_map *cpus = rec->evlist->core.cpus;
> +
> +       return record__init_thread_default_masks(rec, cpus);
> +}
> +
>  int cmd_record(int argc, const char **argv)
>  {
>         int err;
> @@ -2948,6 +3063,12 @@ int cmd_record(int argc, const char **argv)
>                 goto out;
>         }
>
> +       err = record__init_thread_masks(rec);
> +       if (err) {
> +               pr_err("Failed to initialize parallel data streaming masks\n");
> +               goto out;
> +       }
> +
>         if (rec->opts.nr_cblocks > nr_cblocks_max)
>                 rec->opts.nr_cblocks = nr_cblocks_max;
>         pr_debug("nr_cblocks: %d\n", rec->opts.nr_cblocks);
> @@ -2966,6 +3087,8 @@ int cmd_record(int argc, const char **argv)
>         symbol__exit();
>         auxtrace_record__free(rec->itr);
>  out_opts:
> +       record__free_thread_masks(rec, rec->nr_threads);
> +       rec->nr_threads = 0;
>         evlist__close_control(rec->opts.ctl_fd, rec->opts.ctl_fd_ack, &rec->opts.ctl_fd_close);
>         return err;
>  }
> --
> 2.19.0
>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 01/16] perf record: Introduce thread affinity and mmap masks
  2022-04-04 22:25   ` Ian Rogers
@ 2022-04-05 16:21     ` Bayduraev, Alexey V
  2022-04-06 16:46       ` Ian Rogers
  0 siblings, 1 reply; 38+ messages in thread
From: Bayduraev, Alexey V @ 2022-04-05 16:21 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Namhyung Kim,
	Alexander Shishkin, Peter Zijlstra, Ingo Molnar, linux-kernel,
	Andi Kleen, Adrian Hunter, Alexander Antonov, Alexei Budankov,
	Riccardo Mancini

On 05.04.2022 1:25, Ian Rogers wrote:
> On Mon, Jan 17, 2022 at 10:38 AM Alexey Bayduraev
> <alexey.v.bayduraev@linux.intel.com> wrote:
>>
>> Introduce affinity and mmap thread masks. Thread affinity mask

<SNIP>

> 
> In per-thread mode it is possible that cpus is the dummy CPU map here.
> This means that the cpu below has the value -1 and setting bit -1
> actually has the effect of setting bit 63. Here is a reproduction
> based on the acme/perf/core branch:
> 
> ```
> $ make STATIC=1 DEBUG=1 EXTRA_CFLAGS='-fno-omit-frame-pointer
> -fsanitize=undefined -fno-sanitize-recover'
> $ perf record -o /tmp/perf.data  --per-thread true
> tools/include/asm-generic/bitops/atomic.h:10:36: runtime error: shift
> exponent -1 is negative
> $ UBSAN_OPTIONS=abort_on_error=1 gdb --args perf record -o
> /tmp/perf.data --per-thread true
> (gdb) r
> tools/include/asm-generic/bitops/atomic.h:10:36: runtime error: shift
> exponent -1 is negative
> (gdb) bt
> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49
> #1  0x00007ffff71d2546 in __GI_abort () at abort.c:79
> #2  0x00007ffff640db9f in __sanitizer::Abort () at
> ../../../../src/libsanitizer/sanitizer_common/sanitizer_posix_libcdep.cpp:151
> #3  0x00007ffff6418efc in __sanitizer::Die () at
> ../../../../src/libsanitizer/sanitizer_common/sanitizer_termination.cpp:58
> #4  0x00007ffff63fd99e in
> __ubsan::__ubsan_handle_shift_out_of_bounds_abort (Data=<optimized
> out>, LHS=<optimized out>,
>     RHS=<optimized out>) at
> ../../../../src/libsanitizer/ubsan/ubsan_handlers.cpp:378
> #5  0x0000555555c54405 in set_bit (nr=-1, addr=0x555556ecd0a0)
>     at tools/include/asm-generic/bitops/atomic.h:10
> #6  0x0000555555c6ddaf in record__mmap_cpu_mask_init
> (mask=0x555556ecd070, cpus=0x555556ecd050) at builtin-record.c:3333
> #7  0x0000555555c7044c in record__init_thread_default_masks
> (rec=0x55555681b100 <record>, cpus=0x555556ecd050) at
> builtin-record.c:3668
> #8  0x0000555555c705b3 in record__init_thread_masks
> (rec=0x55555681b100 <record>) at builtin-record.c:3681
> #9  0x0000555555c7297a in cmd_record (argc=1, argv=0x7fffffffdcc0) at
> builtin-record.c:3976
> #10 0x0000555555e06d41 in run_builtin (p=0x555556827538
> <commands+216>, argc=5, argv=0x7fffffffdcc0) at perf.c:313
> #11 0x0000555555e07253 in handle_internal_command (argc=5,
> argv=0x7fffffffdcc0) at perf.c:365
> #12 0x0000555555e07508 in run_argv (argcp=0x7fffffffdb0c,
> argv=0x7fffffffdb00) at perf.c:409
> #13 0x0000555555e07b32 in main (argc=5, argv=0x7fffffffdcc0) at perf.c:539
> ```
> 
> Not setting the mask->bits if the cpu map is dummy causes no data to
> be written. Setting mask->bits 0 causes a segv. Setting bit 63 works
> but feels like there are more invariants broken in the code.
> 
> Here is a not good workaround patch:
> 
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index ba74fab02e62..62727b676f98 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -3329,6 +3329,11 @@ static void record__mmap_cpu_mask_init(struct
> mmap_cpu_mask *mask, struct perf_c
>  {
>         int c;
> 
> +       if (cpu_map__is_dummy(cpus)) {
> +               set_bit(63, mask->bits);
> +               return;
> +       }
> +
>         for (c = 0; c < cpus->nr; c++)
>                 set_bit(cpus->map[c].cpu, mask->bits);
>  }
> 
> Alexey, what should the expected behavior be with per-thread mmaps?
> 
> Thanks,
> Ian

Thanks a lot,

In case of per-thread mmaps we should initialize thread_data[0]->maps[i] by
evlist->mmap[i]. Looks like this was missed by this patchset.

Your patch works, because it triggers test_bit() in record__thread_data_init_maps()
and thread_data maps get correctly initialized.

However, it's better to ignore thread_data->masks in record__mmap_cpu_mask_init()
and setup thread_data maps explicitly for per-thread case.

Also, to prevent more runtime crashes, --per-thread and --threads 
options should be mutually exclusive.

I will prepare a fix for this issue soon.

Regards,
Alexey

> 
>> +static void record__free_thread_masks(struct record *rec, int nr_threads)
>> +{
>> +       int t;
>> +
>> +       if (rec->thread_masks)
>> +               for (t = 0; t < nr_threads; t++)
>> +                       record__thread_mask_free(&rec->thread_masks[t]);
>> +
>> +       zfree(&rec->thread_masks);
>> +}
>> +
>> +static int record__alloc_thread_masks(struct record *rec, int nr_threads, int nr_bits)
>> +{
>> +       int t, ret;
>> +
>> +       rec->thread_masks = zalloc(nr_threads * sizeof(*(rec->thread_masks)));
>> +       if (!rec->thread_masks) {
>> +               pr_err("Failed to allocate thread masks\n");
>> +               return -ENOMEM;
>> +       }
>> +
>> +       for (t = 0; t < nr_threads; t++) {
>> +               ret = record__thread_mask_alloc(&rec->thread_masks[t], nr_bits);
>> +               if (ret) {
>> +                       pr_err("Failed to allocate thread masks[%d]\n", t);
>> +                       goto out_free;
>> +               }
>> +       }
>> +
>> +       return 0;
>> +
>> +out_free:
>> +       record__free_thread_masks(rec, nr_threads);
>> +
>> +       return ret;
>> +}
>> +
>> +static int record__init_thread_default_masks(struct record *rec, struct perf_cpu_map *cpus)
>> +{
>> +       int ret;
>> +
>> +       ret = record__alloc_thread_masks(rec, 1, cpu__max_cpu().cpu);
>> +       if (ret)
>> +               return ret;
>> +
>> +       record__mmap_cpu_mask_init(&rec->thread_masks->maps, cpus);
>> +
>> +       rec->nr_threads = 1;
>> +
>> +       return 0;
>> +}
>> +
>> +static int record__init_thread_masks(struct record *rec)
>> +{
>> +       struct perf_cpu_map *cpus = rec->evlist->core.cpus;
>> +
>> +       return record__init_thread_default_masks(rec, cpus);
>> +}
>> +
>>  int cmd_record(int argc, const char **argv)
>>  {
>>         int err;
>> @@ -2948,6 +3063,12 @@ int cmd_record(int argc, const char **argv)
>>                 goto out;
>>         }
>>
>> +       err = record__init_thread_masks(rec);
>> +       if (err) {
>> +               pr_err("Failed to initialize parallel data streaming masks\n");
>> +               goto out;
>> +       }
>> +
>>         if (rec->opts.nr_cblocks > nr_cblocks_max)
>>                 rec->opts.nr_cblocks = nr_cblocks_max;
>>         pr_debug("nr_cblocks: %d\n", rec->opts.nr_cblocks);
>> @@ -2966,6 +3087,8 @@ int cmd_record(int argc, const char **argv)
>>         symbol__exit();
>>         auxtrace_record__free(rec->itr);
>>  out_opts:
>> +       record__free_thread_masks(rec, rec->nr_threads);
>> +       rec->nr_threads = 0;
>>         evlist__close_control(rec->opts.ctl_fd, rec->opts.ctl_fd_ack, &rec->opts.ctl_fd_close);
>>         return err;
>>  }
>> --
>> 2.19.0
>>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v13 01/16] perf record: Introduce thread affinity and mmap masks
  2022-04-05 16:21     ` Bayduraev, Alexey V
@ 2022-04-06 16:46       ` Ian Rogers
  0 siblings, 0 replies; 38+ messages in thread
From: Ian Rogers @ 2022-04-06 16:46 UTC (permalink / raw)
  To: Bayduraev, Alexey V
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Namhyung Kim,
	Alexander Shishkin, Peter Zijlstra, Ingo Molnar, linux-kernel,
	Andi Kleen, Adrian Hunter, Alexander Antonov, Alexei Budankov,
	Riccardo Mancini

On Tue, Apr 5, 2022 at 9:21 AM Bayduraev, Alexey V
<alexey.v.bayduraev@linux.intel.com> wrote:
>
> On 05.04.2022 1:25, Ian Rogers wrote:
> > On Mon, Jan 17, 2022 at 10:38 AM Alexey Bayduraev
> > <alexey.v.bayduraev@linux.intel.com> wrote:
> >>
> >> Introduce affinity and mmap thread masks. Thread affinity mask
>
> <SNIP>
>
> >
> > In per-thread mode it is possible that cpus is the dummy CPU map here.
> > This means that the cpu below has the value -1 and setting bit -1
> > actually has the effect of setting bit 63. Here is a reproduction
> > based on the acme/perf/core branch:
> >
> > ```
> > $ make STATIC=1 DEBUG=1 EXTRA_CFLAGS='-fno-omit-frame-pointer
> > -fsanitize=undefined -fno-sanitize-recover'
> > $ perf record -o /tmp/perf.data  --per-thread true
> > tools/include/asm-generic/bitops/atomic.h:10:36: runtime error: shift
> > exponent -1 is negative
> > $ UBSAN_OPTIONS=abort_on_error=1 gdb --args perf record -o
> > /tmp/perf.data --per-thread true
> > (gdb) r
> > tools/include/asm-generic/bitops/atomic.h:10:36: runtime error: shift
> > exponent -1 is negative
> > (gdb) bt
> > #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49
> > #1  0x00007ffff71d2546 in __GI_abort () at abort.c:79
> > #2  0x00007ffff640db9f in __sanitizer::Abort () at
> > ../../../../src/libsanitizer/sanitizer_common/sanitizer_posix_libcdep.cpp:151
> > #3  0x00007ffff6418efc in __sanitizer::Die () at
> > ../../../../src/libsanitizer/sanitizer_common/sanitizer_termination.cpp:58
> > #4  0x00007ffff63fd99e in
> > __ubsan::__ubsan_handle_shift_out_of_bounds_abort (Data=<optimized
> > out>, LHS=<optimized out>,
> >     RHS=<optimized out>) at
> > ../../../../src/libsanitizer/ubsan/ubsan_handlers.cpp:378
> > #5  0x0000555555c54405 in set_bit (nr=-1, addr=0x555556ecd0a0)
> >     at tools/include/asm-generic/bitops/atomic.h:10
> > #6  0x0000555555c6ddaf in record__mmap_cpu_mask_init
> > (mask=0x555556ecd070, cpus=0x555556ecd050) at builtin-record.c:3333
> > #7  0x0000555555c7044c in record__init_thread_default_masks
> > (rec=0x55555681b100 <record>, cpus=0x555556ecd050) at
> > builtin-record.c:3668
> > #8  0x0000555555c705b3 in record__init_thread_masks
> > (rec=0x55555681b100 <record>) at builtin-record.c:3681
> > #9  0x0000555555c7297a in cmd_record (argc=1, argv=0x7fffffffdcc0) at
> > builtin-record.c:3976
> > #10 0x0000555555e06d41 in run_builtin (p=0x555556827538
> > <commands+216>, argc=5, argv=0x7fffffffdcc0) at perf.c:313
> > #11 0x0000555555e07253 in handle_internal_command (argc=5,
> > argv=0x7fffffffdcc0) at perf.c:365
> > #12 0x0000555555e07508 in run_argv (argcp=0x7fffffffdb0c,
> > argv=0x7fffffffdb00) at perf.c:409
> > #13 0x0000555555e07b32 in main (argc=5, argv=0x7fffffffdcc0) at perf.c:539
> > ```
> >
> > Not setting the mask->bits if the cpu map is dummy causes no data to
> > be written. Setting mask->bits 0 causes a segv. Setting bit 63 works
> > but feels like there are more invariants broken in the code.
> >
> > Here is a not good workaround patch:
> >
> > diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> > index ba74fab02e62..62727b676f98 100644
> > --- a/tools/perf/builtin-record.c
> > +++ b/tools/perf/builtin-record.c
> > @@ -3329,6 +3329,11 @@ static void record__mmap_cpu_mask_init(struct
> > mmap_cpu_mask *mask, struct perf_c
> >  {
> >         int c;
> >
> > +       if (cpu_map__is_dummy(cpus)) {
> > +               set_bit(63, mask->bits);
> > +               return;
> > +       }
> > +
> >         for (c = 0; c < cpus->nr; c++)
> >                 set_bit(cpus->map[c].cpu, mask->bits);
> >  }
> >
> > Alexey, what should the expected behavior be with per-thread mmaps?
> >
> > Thanks,
> > Ian
>
> Thanks a lot,
>
> In case of per-thread mmaps we should initialize thread_data[0]->maps[i] by
> evlist->mmap[i]. Looks like this was missed by this patchset.
>
> Your patch works, because it triggers test_bit() in record__thread_data_init_maps()
> and thread_data maps get correctly initialized.
>
> However, it's better to ignore thread_data->masks in record__mmap_cpu_mask_init()
> and setup thread_data maps explicitly for per-thread case.
>
> Also, to prevent more runtime crashes, --per-thread and --threads
> options should be mutually exclusive.
>
> I will prepare a fix for this issue soon.

Hi Alexey,

sorry to nag, I'm being nagged, is there an ETA on the fix? Is it
pragmatic to roll back for now?

Thanks,
Ian

> Regards,
> Alexey
>
> >
> >> +static void record__free_thread_masks(struct record *rec, int nr_threads)
> >> +{
> >> +       int t;
> >> +
> >> +       if (rec->thread_masks)
> >> +               for (t = 0; t < nr_threads; t++)
> >> +                       record__thread_mask_free(&rec->thread_masks[t]);
> >> +
> >> +       zfree(&rec->thread_masks);
> >> +}
> >> +
> >> +static int record__alloc_thread_masks(struct record *rec, int nr_threads, int nr_bits)
> >> +{
> >> +       int t, ret;
> >> +
> >> +       rec->thread_masks = zalloc(nr_threads * sizeof(*(rec->thread_masks)));
> >> +       if (!rec->thread_masks) {
> >> +               pr_err("Failed to allocate thread masks\n");
> >> +               return -ENOMEM;
> >> +       }
> >> +
> >> +       for (t = 0; t < nr_threads; t++) {
> >> +               ret = record__thread_mask_alloc(&rec->thread_masks[t], nr_bits);
> >> +               if (ret) {
> >> +                       pr_err("Failed to allocate thread masks[%d]\n", t);
> >> +                       goto out_free;
> >> +               }
> >> +       }
> >> +
> >> +       return 0;
> >> +
> >> +out_free:
> >> +       record__free_thread_masks(rec, nr_threads);
> >> +
> >> +       return ret;
> >> +}
> >> +
> >> +static int record__init_thread_default_masks(struct record *rec, struct perf_cpu_map *cpus)
> >> +{
> >> +       int ret;
> >> +
> >> +       ret = record__alloc_thread_masks(rec, 1, cpu__max_cpu().cpu);
> >> +       if (ret)
> >> +               return ret;
> >> +
> >> +       record__mmap_cpu_mask_init(&rec->thread_masks->maps, cpus);
> >> +
> >> +       rec->nr_threads = 1;
> >> +
> >> +       return 0;
> >> +}
> >> +
> >> +static int record__init_thread_masks(struct record *rec)
> >> +{
> >> +       struct perf_cpu_map *cpus = rec->evlist->core.cpus;
> >> +
> >> +       return record__init_thread_default_masks(rec, cpus);
> >> +}
> >> +
> >>  int cmd_record(int argc, const char **argv)
> >>  {
> >>         int err;
> >> @@ -2948,6 +3063,12 @@ int cmd_record(int argc, const char **argv)
> >>                 goto out;
> >>         }
> >>
> >> +       err = record__init_thread_masks(rec);
> >> +       if (err) {
> >> +               pr_err("Failed to initialize parallel data streaming masks\n");
> >> +               goto out;
> >> +       }
> >> +
> >>         if (rec->opts.nr_cblocks > nr_cblocks_max)
> >>                 rec->opts.nr_cblocks = nr_cblocks_max;
> >>         pr_debug("nr_cblocks: %d\n", rec->opts.nr_cblocks);
> >> @@ -2966,6 +3087,8 @@ int cmd_record(int argc, const char **argv)
> >>         symbol__exit();
> >>         auxtrace_record__free(rec->itr);
> >>  out_opts:
> >> +       record__free_thread_masks(rec, rec->nr_threads);
> >> +       rec->nr_threads = 0;
> >>         evlist__close_control(rec->opts.ctl_fd, rec->opts.ctl_fd_ack, &rec->opts.ctl_fd_close);
> >>         return err;
> >>  }
> >> --
> >> 2.19.0
> >>

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2022-04-06 18:09 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-17 18:34 [PATCH v13 00/16] Introduce threaded trace streaming for basic perf record operation Alexey Bayduraev
2022-01-17 18:34 ` [PATCH v13 01/16] perf record: Introduce thread affinity and mmap masks Alexey Bayduraev
2022-01-31 21:00   ` Arnaldo Carvalho de Melo
2022-01-31 21:16     ` Arnaldo Carvalho de Melo
2022-02-01 11:46       ` Bayduraev, Alexey V
2022-01-31 22:03     ` Arnaldo Carvalho de Melo
2022-01-31 22:04       ` Arnaldo Carvalho de Melo
2022-04-04 22:25   ` Ian Rogers
2022-04-05 16:21     ` Bayduraev, Alexey V
2022-04-06 16:46       ` Ian Rogers
2022-01-17 18:34 ` [PATCH v13 02/16] tools lib: Introduce fdarray duplicate function Alexey Bayduraev
2022-01-17 18:34 ` [PATCH v13 03/16] perf record: Introduce thread specific data array Alexey Bayduraev
2022-01-31 21:39   ` Arnaldo Carvalho de Melo
2022-01-31 22:21     ` Arnaldo Carvalho de Melo
2022-02-11 16:51       ` Arnaldo Carvalho de Melo
2022-02-11 16:52         ` Arnaldo Carvalho de Melo
2022-02-11 19:34         ` Alexei Budankov
2022-01-17 18:34 ` [PATCH v13 04/16] perf record: Introduce function to propagate control commands Alexey Bayduraev
2022-01-17 18:34 ` [PATCH v13 05/16] perf record: Introduce thread local variable Alexey Bayduraev
2022-01-31 21:42   ` Arnaldo Carvalho de Melo
2022-01-31 21:45   ` Arnaldo Carvalho de Melo
2022-02-01  7:35     ` Bayduraev, Alexey V
2022-01-17 18:34 ` [PATCH v13 06/16] perf record: Stop threads in the end of trace streaming Alexey Bayduraev
2022-01-17 18:34 ` [PATCH v13 07/16] perf record: Start threads in the beginning " Alexey Bayduraev
2022-01-17 18:34 ` [PATCH v13 08/16] perf record: Introduce data file at mmap buffer object Alexey Bayduraev
2022-01-17 18:34 ` [PATCH v13 09/16] perf record: Introduce bytes written stats Alexey Bayduraev
2022-01-17 18:34 ` [PATCH v13 10/16] perf record: Introduce compressor at mmap buffer object Alexey Bayduraev
2022-01-31 21:56   ` Arnaldo Carvalho de Melo
2022-02-01  8:08     ` Bayduraev, Alexey V
2022-01-17 18:34 ` [PATCH v13 11/16] perf record: Introduce data transferred and compressed stats Alexey Bayduraev
2022-01-24 15:28   ` Arnaldo Carvalho de Melo
2022-01-24 16:39     ` Arnaldo Carvalho de Melo
2022-01-17 18:34 ` [PATCH v13 12/16] perf record: Introduce --threads command line option Alexey Bayduraev
2022-01-17 18:34 ` [PATCH v13 13/16] perf record: Extend " Alexey Bayduraev
2022-01-17 18:34 ` [PATCH v13 14/16] perf record: Implement compatibility checks Alexey Bayduraev
2022-01-17 18:34 ` [PATCH v13 15/16] perf session: Load data directory files for analysis Alexey Bayduraev
2022-01-17 18:34 ` [PATCH v13 16/16] perf report: Output data file name in raw trace dump Alexey Bayduraev
2022-01-24 15:45 ` [PATCH v13 00/16] Introduce threaded trace streaming for basic perf record operation Jiri Olsa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.