linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/15] Introduce threaded trace streaming for basic perf record operation
@ 2020-10-21 15:52 Alexey Budankov
  2020-10-21 15:56 ` [PATCH v2 01/15] perf session: introduce trace file path to be shown in raw trace dump Alexey Budankov
                   ` (15 more replies)
  0 siblings, 16 replies; 64+ messages in thread
From: Alexey Budankov @ 2020-10-21 15:52 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Namhyung Kim, Alexander Shishkin, Adrian Hunter, Andi Kleen,
	Peter Zijlstra, Ingo Molnar, linux-kernel


Changes in v2:
- explicitly added credit tags to patches 6/15 and 15/15,
  additionally to cites [1], [2]
- updated description of 3/15 to explicitly mention the reason
  to open data directories in read access mode (e.g. for perf report)
- implemented fix for compilation error of 2/15
- explicitly elaborated on found issues to be resolved for
  threaded AUX trace capture

v1: https://lore.kernel.org/lkml/810f3a69-0004-9dff-a911-b7ff97220ae0@linux.intel.com/

Patch set provides threaded trace streaming for base perf record
operation. Provided streaming mode (--threads) mitigates profiling
data losses and resolves scalability issues of serial and asynchronous
(--aio) trace streaming modes on multicore server systems. The patch
set is based on the prototype [1], [2] and the most closely relates
to mode 3) "mode that creates thread for every monitored memory map".

The threaded mode executes one-to-one mapping of trace streaming threads
to mapped data buffers and streaming into per-CPU trace files located
at data directory. The data buffers and threads are affined to NUMA
nodes and monitored CPUs according to system topology. --cpu option
can be used to specify exact CPUs to be monitored.

Basic analysis of data directories is provided for perf report mode.
Raw dump (-D) and aggregated reports are available for data directories,
still with no memory consumption optimizations. However data directories
collected with --compression-level option enabled can be analyzed with
little less memory because trace files are unmaped from tool process
memory after loading collected data.

Provided streaming mode is available with Zstd compression/decompression
(--compression-level) and handling of external commands (--control).
AUX area tracing, related and derived modes like --snapshot or
--aux-sample are not enabled. --switch-output, --switch-output-event, 
--switch-max-files and --timestamp-filename options are not enabled.
Threaded trace streaming is not enabled for pipe mode. Initial intent
to enable AUX area tracing faced the need to define some optimal way
to store index data in data directory, thus left aside. --switch-output-*
and --timestamp-filename use cases are not clear for data directories.

Asynchronous(--aio) trace streaming and affinity (--affinity) modes
are mutually exclusive to threaded streaming mode.

See testing results and validation examples below.

[1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
[2] https://lore.kernel.org/lkml/20180913125450.21342-1-jolsa@kernel.org/

---
Alexey Budankov (15):
  perf session: introduce trace file path to be shown in raw trace dump
  perf report: output trace file name in raw trace dump
  perf data: open data directory in read access mode
  perf session: move reader object definition to header file
  perf session: introduce decompressor into trace reader object
  perf session: load data directory into tool process memory
  perf record: introduce trace file, compressor and stats in mmap object
  perf record: write trace data into mmap trace files
  perf record: introduce thread specific objects for trace streaming
  perf record: manage thread specific data array
  perf evlist: introduce evlist__ctlfd_update() to update ctl fd status
  perf record: introduce thread local variable for trace streaming
  perf record: stop threads in the end of trace streaming
  perf record: start threads in the beginning of trace streaming
  perf record: introduce --threads command line option

 tools/perf/Documentation/perf-record.txt |   7 +
 tools/perf/builtin-inject.c              |   3 +-
 tools/perf/builtin-record.c              | 514 +++++++++++++++++++++--
 tools/perf/util/data.c                   |   4 +
 tools/perf/util/evlist.c                 |  16 +
 tools/perf/util/evlist.h                 |   1 +
 tools/perf/util/mmap.c                   |   6 +
 tools/perf/util/mmap.h                   |   5 +
 tools/perf/util/ordered-events.h         |   1 +
 tools/perf/util/record.h                 |   1 +
 tools/perf/util/session.c                | 150 ++++---
 tools/perf/util/session.h                |  28 ++
 tools/perf/util/tool.h                   |   3 +-
 13 files changed, 637 insertions(+), 102 deletions(-)

---
Testing results:

 $ perf test
 1: vmlinux symtab matches kallsyms                                 : Skip
 2: Detect openat syscall event                                     : Ok
 3: Detect openat syscall event on all cpus                         : Ok
 4: Read samples using the mmap interface                           : Ok
 5: Test data source output                                         : Ok
 6: Parse event definition strings                                  : Ok
 7: Simple expression parser                                        : Ok
 8: PERF_RECORD_* events & perf_sample fields                       : Ok
 9: Parse perf pmu format                                           : Ok
10: PMU events                                                      :
10.1: PMU event table sanity                                        : Ok
10.2: PMU event map aliases                                         : Ok
10.3: Parsing of PMU event table metrics                            : Skip (some metrics failed)
10.4: Parsing of PMU event table metrics with fake PMUs             : Ok
11: DSO data read                                                   : Ok
12: DSO data cache                                                  : Ok
13: DSO data reopen                                                 : FAILED!
14: Roundtrip evsel->name                                           : Ok
15: Parse sched tracepoints fields                                  : Ok
16: syscalls:sys_enter_openat event fields                          : Ok
17: Setup struct perf_event_attr                                    : Ok
18: Match and link multiple hists                                   : Ok
19: 'import perf' in python                                         : FAILED!
20: Breakpoint overflow signal handler                              : Ok
21: Breakpoint overflow sampling                                    : Ok
22: Breakpoint accounting                                           : Ok
23: Watchpoint                                                      :
23.1: Read Only Watchpoint                                          : Skip
23.2: Write Only Watchpoint                                         : Ok
23.3: Read / Write Watchpoint                                       : Ok
23.4: Modify Watchpoint                                             : Ok
24: Number of exit events of a simple workload                      : Ok
25: Software clock events period values                             : Ok
26: Object code reading                                             : Ok
27: Sample parsing                                                  : Ok
28: Use a dummy software event to keep tracking                     : Ok
29: Parse with no sample_id_all bit set                             : Ok
30: Filter hist entries                                             : Ok
31: Lookup mmap thread                                              : Ok
32: Share thread maps                                               : Ok
33: Sort output of hist entries                                     : Ok
34: Cumulate child hist entries                                     : Ok
35: Track with sched_switch                                         : Ok
36: Filter fds with revents mask in a fdarray                       : Ok
37: Add fd to a fdarray, making it autogrow                         : Ok
38: kmod_path__parse                                                : Ok
39: Thread map                                                      : Ok
40: LLVM search and compile                                         :
40.1: Basic BPF llvm compile                                        : Skip
40.2: kbuild searching                                              : Skip
40.3: Compile source for BPF prologue generation                    : Skip
40.4: Compile source for BPF relocation                             : Skip
41: Session topology                                                : Ok
42: BPF filter                                                      :
42.1: Basic BPF filtering                                           : Skip
42.2: BPF pinning                                                   : Skip
42.3: BPF prologue generation                                       : Skip
42.4: BPF relocation checker                                        : Skip
43: Synthesize thread map                                           : Ok
44: Remove thread map                                               : Ok
45: Synthesize cpu map                                              : Ok
46: Synthesize stat config                                          : Ok
47: Synthesize stat                                                 : Ok
48: Synthesize stat round                                           : Ok
49: Synthesize attr update                                          : Ok
50: Event times                                                     : Ok
51: Read backward ring buffer                                       : Ok
52: Print cpu map                                                   : Ok
53: Merge cpu map                                                   : Ok
54: Probe SDT events                                                : Ok
55: is_printable_array                                              : Ok
56: Print bitmap                                                    : Ok
57: perf hooks                                                      : Ok
58: builtin clang support                                           : Skip (not compiled in)
59: unit_number__scnprintf                                          : Ok
60: mem2node                                                        : Ok
61: time utils                                                      : Ok
62: Test jit_write_elf                                              : Ok
63: Test libpfm4 support                                            : Skip (not compiled in)
64: Test api io                                                     : Ok
65: maps__merge_in                                                  : Ok
66: Demangle Java                                                   : Ok
67: Parse and process metrics                                       : Ok
68: PE file support                                                 : Ok
69: Event expansion for cgroups                                     : Ok
70: x86 rdpmc                                                       : Ok
71: Convert perf time to TSC                                        : Ok
72: DWARF unwind                                                    : Ok
73: x86 instruction decoder - new instructions                      : Ok
74: Intel PT packet decoder                                         : Ok
75: x86 bp modify                                                   : Ok
76: Check open filename arg using perf trace + vfs_getname          : FAILED!
77: Add vfs_getname probe to get syscall args filenames             : FAILED!
78: probe libc's inet_pton & backtrace it with ping                 : Ok
79: Use vfs_getname probe to get syscall args filenames             : FAILED!
80: Zstd perf.data compression/decompression                        : Ok
81: Check Arm CoreSight trace data recording and synthesized samples: Skip

Validation:

  $ tools/perf/perf record --threads --compression-level=3 -e cycles,instructions -- ../../bench/matrix/linux/matrix.gcc.g.O3
  Addr of buf1 = 0x7f8af24ae010
  Offs of buf1 = 0x7f8af24ae180
  Addr of buf2 = 0x7f8af04ad010
  Offs of buf2 = 0x7f8af04ad1c0
  Addr of buf3 = 0x7f8aee4ac010
  Offs of buf3 = 0x7f8aee4ac100
  Addr of buf4 = 0x7f8aec4ab010
  Offs of buf4 = 0x7f8aec4ab140
  Threads #: 8 Pthreads
  Matrix size: 2048
  Using multiply kernel: multiply1
  Execution time = 24.579 seconds
  [ perf record: Woken up 972 times to write data ]
  [ perf record: Captured and wrote 11.280 MB perf.data (403 samples), compressed (original 66.670 MB, ratio is 5.926) ]

  $ perf report --header --stats
  # ========
  # captured on    : Wed Oct  7 19:07:50 2020
  # header version : 1
  # data offset    : 504
  # data size      : 30056
  # feat offset    : 30560
  # hostname : nntvtune39
  # os release : 5.8.8-100.fc31.x86_64
  # perf version : 4.13.rc5.gdf19b1347b82
  # arch : x86_64
  # nrcpus online : 8
  # nrcpus avail : 8
  # cpudesc : Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
  # cpuid : GenuineIntel,6,94,3
  # total memory : 16119484 kB
  # cmdline : /root/abudanko/kernel/acme/tools/perf/perf record --threads --compression-level=3 -e cycles,instructions -- ../../bench  /matrix/linux/matrix.gcc.g.O3 
  # event : name = cycles, , id = { 54562, 54563, 54564, 54565, 54566, 54567, 54568, 54569 }, size = 120, { sample_period, sample_freq } = 4000, sample_type = IP|TID|TIME|ID|>
  # event : name = instructions, , id = { 54570, 54571, 54572, 54573, 54574, 54575, 54576, 54577 }, size = 120, config = 0x1, { sample_period, sample_freq } = 4000, sample_ty>
  # sibling sockets : 0-7
  # sibling dies    : 0-7
  # sibling threads : 0,4
  # sibling threads : 1,5
  # sibling threads : 2,6
  # sibling threads : 3,7
  # CPU 0: Core ID 0, Die ID 0, Socket ID 0
  # CPU 1: Core ID 1, Die ID 0, Socket ID 0
  # CPU 2: Core ID 2, Die ID 0, Socket ID 0
  # CPU 3: Core ID 3, Die ID 0, Socket ID 0
  # CPU 4: Core ID 0, Die ID 0, Socket ID 0
  # CPU 5: Core ID 1, Die ID 0, Socket ID 0
  # CPU 6: Core ID 2, Die ID 0, Socket ID 0
  # CPU 7: Core ID 3, Die ID 0, Socket ID 0
  # node0 meminfo  : total = 16119484 kB, free = 933088 kB
  # node0 cpu list : 0-7
  # pmu mappings: intel_pt = 8, software = 1, power = 19, uprobe = 7, uncore_imc = 11, cpu = 4, cstate_core = 17, uncore_cbox_2 = 14, breakpoint = 5, uncore_cbox_0 = 12, trac>
  # CPU cache info:
  #  L1 Data                 32K [0,4]
  #  L1 Instruction          32K [0,4]
  #  L1 Data                 32K [1,5]
  #  L1 Instruction          32K [1,5]
  #  L1 Data                 32K [1,5]
  #  L1 Instruction          32K [1,5]
  #  L1 Data                 32K [2,6]
  #  L1 Instruction          32K [2,6]
  #  L1 Data                 32K [3,7]
  #  L1 Instruction          32K [3,7]
  #  L2 Unified             256K [0,4]
  #  L2 Unified             256K [1,5]
  #  L2 Unified             256K [2,6]
  #  L2 Unified             256K [3,7]
  #  L3 Unified            8192K [0-7]
  # time of first sample : 0.000000
  # time of last sample : 0.000000
  # sample duration :      0.000 ms
  # memory nodes (nr 1, block size 0x8000000):
  #    0 [16G]: 0-23,32-135
  # directory data version : 1
  # bpf_prog_info 835: bpf_prog_6deef7357e7b4530 addr 0xffffffffc07c5138 size 66
  # bpf_prog_info 836: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0804558 size 66
  # bpf_prog_info 837: bpf_prog_84efc2eecc454ca6 addr 0xffffffffc080804c size 373
  # bpf_prog_info 838: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0817f50 size 66
  # bpf_prog_info 839: bpf_prog_6deef7357e7b4530 addr 0xffffffffc082aee8 size 66
  # bpf_prog_info 840: bpf_prog_5a2b06eab81b8f51 addr 0xffffffffc082c0fc size 1132
  # bpf_prog_info 841: bpf_prog_8a642ed2819e9acc addr 0xffffffffc0aac380 size 62
  # bpf_prog_info 842: bpf_prog_8a642ed2819e9acc addr 0xffffffffc0d4c970 size 62
  # bpf_prog_info 843: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0d4e8bc size 66
  # bpf_prog_info 844: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0d5050c size 66
  # bpf_prog_info 845: bpf_prog_3ff695c0afc2579c addr 0xffffffffc0d56098 size 1552
  # bpf_prog_info 846: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0d58364 size 66
  # bpf_prog_info 847: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0d6ec00 size 66
  # compressed : Zstd, level = 3, ratio = 6
  # cpu pmu capabilities: branches=32, max_precise=3, pmu_name=skylake
  # missing features: TRACING_DATA BUILD_ID BRANCH_STACK GROUP_DESC AUXTRACE STAT CLOCKID CLOCK_DATA 
  # ========
  #
  Aggregated stats:
           TOTAL events:    1457107
            MMAP events:         98
            LOST events:          0
            COMM events:          2
            EXIT events:          6
        THROTTLE events:          0
      UNTHROTTLE events:          0
            FORK events:          8
            READ events:          0
          SAMPLE events:    1456388
           MMAP2 events:          7
             AUX events:          0
    ITRACE_START events:          0
    LOST_SAMPLES events:          0
          SWITCH events:          0
 SWITCH_CPU_WIDE events:          0
      NAMESPACES events:          0
         KSYMBOL events:         13
       BPF_EVENT events:         13
          CGROUP events:        166
       TEXT_POKE events:          0
            ATTR events:          0
      EVENT_TYPE events:          0
    TRACING_DATA events:          0
        BUILD_ID events:          0
  FINISHED_ROUND events:          0
        ID_INDEX events:          0
   AUXTRACE_INFO events:          0
        AUXTRACE events:          0
  AUXTRACE_ERROR events:          0
      THREAD_MAP events:          1
         CPU_MAP events:          1
     STAT_CONFIG events:          0
            STAT events:          0
      STAT_ROUND events:          0
    EVENT_UPDATE events:          0
       TIME_CONV events:          1
         FEATURE events:          0
      COMPRESSED events:        403

  $ tools/perf/perf record --threads --compression-level=3 -g --call-graph dwarf,1024 -e cycles,instructions -- ../../bench/matrix/linux/matrix.gcc.g.O3
  Addr of buf1 = 0x7f15e8b89010
  Offs of buf1 = 0x7f15e8b89180
  Addr of buf2 = 0x7f15e6b88010
  Offs of buf2 = 0x7f15e6b881c0
  Addr of buf3 = 0x7f15e4b87010
  Offs of buf3 = 0x7f15e4b87100
  Addr of buf4 = 0x7f15e2b86010
  Offs of buf4 = 0x7f15e2b86140
  Threads #: 8 Pthreads
  Matrix size: 2048
  Using multiply kernel: multiply1
  Execution time = 24.281 seconds
  [ perf record: Woken up 6923 times to write data ]
  [ perf record: Captured and wrote 34.933 MB perf.data (10414 samples), compressed (original 1728.795 MB, ratio is 49.530) ]

  $ perf report --stdio
  # To display the perf.data header info, please use --header/--header-only options.
  #
  #
  # Total Lost Samples: 0
  #
  # Samples: 708K of event 'cycles'
  # Event count (approx.): 277718834163
  #
  # Children      Self  Command          Shared Object             Symbol                                                  
  # ........  ........  ...............  ........................  ........................................................
  #
    99.97%     0.00%  matrix.gcc.g.O3  libc-2.30.so              [.] __GI___clone (inlined)
            |
            ---__GI___clone (inlined)
               |          
                --99.97%--start_thread
                          |          
                           --99.97%--ThreadFunction
                                     multiply1
                                     |          
                                      --0.96%--asm_sysvec_apic_timer_interrupt
                                                |          
                                                 --0.70%--subflow_syn_recv_sock
                                                           |          
                                                            --0.60%--__sysvec_apic_timer_interrupt
                                                                      |          
                                                                       --0.57%--hrtimer_interrupt

    99.97%     0.00%  matrix.gcc.g.O3  libpthread-2.30.so        [.] start_thread
            |
            ---start_thread
               |          
                --99.97%--ThreadFunction
                          |          
                           --99.97%--multiply1
                                     |          
                                      --0.96%--asm_sysvec_apic_timer_interrupt
                                                |          
                                                 --0.70%--subflow_syn_recv_sock
                                                           |          
                                                            --0.60%--__sysvec_apic_timer_interrupt
                                                                      |          
                                                                       --0.57%--hrtimer_interrupt

    99.97%     0.00%  matrix.gcc.g.O3  matrix.gcc.g.O3           [.] ThreadFunction
            |
            ---ThreadFunction
               |          
                --99.97%--multiply1
                          |          
                           --0.96%--asm_sysvec_apic_timer_interrupt
                                     |          
                                      --0.70%--subflow_syn_recv_sock
                                                |          
                                                 --0.60%--__sysvec_apic_timer_interrupt
                                                           |          
                                                            --0.57%--hrtimer_interrupt

    99.97%    98.68%  matrix.gcc.g.O3  matrix.gcc.g.O3           [.] multiply1
            |          
            |--98.68%--__GI___clone (inlined)
            |          start_thread
            |          ThreadFunction
            |          multiply1
            |          
             --1.28%--multiply1
                       |          
                        --0.96%--asm_sysvec_apic_timer_interrupt
                                  |          
                                   --0.70%--subflow_syn_recv_sock
                                             |          
                                              --0.60%--__sysvec_apic_timer_interrupt
                                                        |          
                                                         --0.57%--hrtimer_interrupt

     0.97%     0.05%  matrix.gcc.g.O3  [kernel.kallsyms]         [k] asm_sysvec_apic_timer_interrupt
            |          
             --0.91%--asm_sysvec_apic_timer_interrupt
                       |          
                        --0.70%--subflow_syn_recv_sock


-- 
2.24.1


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v2 01/15] perf session: introduce trace file path to be shown in raw trace dump
  2020-10-21 15:52 [PATCH v2 00/15] Introduce threaded trace streaming for basic perf record operation Alexey Budankov
@ 2020-10-21 15:56 ` Alexey Budankov
  2020-10-22  4:28   ` Namhyung Kim
  2020-10-21 15:57 ` [PATCH v2 02/15] perf report: output trace file name " Alexey Budankov
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 64+ messages in thread
From: Alexey Budankov @ 2020-10-21 15:56 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Namhyung Kim, Alexander Shishkin, Adrian Hunter, Andi Kleen,
	Peter Zijlstra, Ingo Molnar, linux-kernel


Extend reader, ordered_event and decomp objects to contain path
of a trace file being displayed.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
 tools/perf/util/ordered-events.h | 1 +
 tools/perf/util/session.c        | 2 ++
 tools/perf/util/session.h        | 1 +
 3 files changed, 4 insertions(+)

diff --git a/tools/perf/util/ordered-events.h b/tools/perf/util/ordered-events.h
index 75345946c4b9..42c9764c6b5b 100644
--- a/tools/perf/util/ordered-events.h
+++ b/tools/perf/util/ordered-events.h
@@ -9,6 +9,7 @@ struct perf_sample;
 struct ordered_event {
 	u64			timestamp;
 	u64			file_offset;
+	const char		*file_path;
 	union perf_event	*event;
 	struct list_head	list;
 };
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 7a5f03764702..4478ddae485b 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -2119,6 +2119,7 @@ typedef s64 (*reader_cb_t)(struct perf_session *session,
 
 struct reader {
 	int		 fd;
+	const char	 *path;
 	u64		 data_size;
 	u64		 data_offset;
 	reader_cb_t	 process;
@@ -2241,6 +2242,7 @@ static int __perf_session__process_events(struct perf_session *session)
 		.data_size	= session->header.data_size,
 		.data_offset	= session->header.data_offset,
 		.process	= process_simple,
+		.path		= session->data->file.path,
 	};
 	struct ordered_events *oe = &session->ordered_events;
 	struct perf_tool *tool = session->tool;
diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index f76480166d38..378ffc3e2809 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -46,6 +46,7 @@ struct perf_session {
 struct decomp {
 	struct decomp *next;
 	u64 file_pos;
+	const char *file_path;
 	size_t mmap_len;
 	u64 head;
 	size_t size;
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 02/15] perf report: output trace file name in raw trace dump
  2020-10-21 15:52 [PATCH v2 00/15] Introduce threaded trace streaming for basic perf record operation Alexey Budankov
  2020-10-21 15:56 ` [PATCH v2 01/15] perf session: introduce trace file path to be shown in raw trace dump Alexey Budankov
@ 2020-10-21 15:57 ` Alexey Budankov
  2020-10-22  4:29   ` Namhyung Kim
  2020-10-21 15:57 ` [PATCH v2 03/15] perf data: open data directory in read access mode Alexey Budankov
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 64+ messages in thread
From: Alexey Budankov @ 2020-10-21 15:57 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Namhyung Kim, Alexander Shishkin, Adrian Hunter, Andi Kleen,
	Peter Zijlstra, Ingo Molnar, linux-kernel


Print path and name of a trace file into raw dump (-D)
<file_offset>@<path/file>. Print offset of PERF_RECORD_COMPRESSED
record instead of zero for decompressed records:
  0x2226a@perf.data [0x30]: event: 9
or
  0x15cc36@perf.data/data.7 [0x30]: event: 9

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
 tools/perf/builtin-inject.c |  3 +-
 tools/perf/util/session.c   | 75 ++++++++++++++++++++++---------------
 tools/perf/util/tool.h      |  3 +-
 3 files changed, 48 insertions(+), 33 deletions(-)

diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c
index 452a75fe68e5..037f8a98220c 100644
--- a/tools/perf/builtin-inject.c
+++ b/tools/perf/builtin-inject.c
@@ -106,7 +106,8 @@ static int perf_event__repipe_op2_synth(struct perf_session *session,
 
 static int perf_event__repipe_op4_synth(struct perf_session *session,
 					union perf_event *event,
-					u64 data __maybe_unused)
+					u64 data __maybe_unused,
+					const char *str __maybe_unused)
 {
 	return perf_event__repipe_synth(session->tool, event);
 }
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 4478ddae485b..6f09d506b2f6 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -36,7 +36,8 @@
 
 #ifdef HAVE_ZSTD_SUPPORT
 static int perf_session__process_compressed_event(struct perf_session *session,
-						  union perf_event *event, u64 file_offset)
+						  union perf_event *event, u64 file_offset,
+						  const char *file_path)
 {
 	void *src;
 	size_t decomp_size, src_size;
@@ -58,6 +59,7 @@ static int perf_session__process_compressed_event(struct perf_session *session,
 	}
 
 	decomp->file_pos = file_offset;
+	decomp->file_path = file_path;
 	decomp->mmap_len = mmap_len;
 	decomp->head = 0;
 
@@ -98,7 +100,8 @@ static int perf_session__process_compressed_event(struct perf_session *session,
 static int perf_session__deliver_event(struct perf_session *session,
 				       union perf_event *event,
 				       struct perf_tool *tool,
-				       u64 file_offset);
+				       u64 file_offset,
+				       const char *file_path);
 
 static int perf_session__open(struct perf_session *session)
 {
@@ -180,7 +183,8 @@ static int ordered_events__deliver_event(struct ordered_events *oe,
 						    ordered_events);
 
 	return perf_session__deliver_event(session, event->event,
-					   session->tool, event->file_offset);
+					   session->tool, event->file_offset,
+					   event->file_path);
 }
 
 struct perf_session *perf_session__new(struct perf_data *data,
@@ -452,7 +456,8 @@ static int process_stat_round_stub(struct perf_session *perf_session __maybe_unu
 
 static int perf_session__process_compressed_event_stub(struct perf_session *session __maybe_unused,
 						       union perf_event *event __maybe_unused,
-						       u64 file_offset __maybe_unused)
+						       u64 file_offset __maybe_unused,
+						       const char *file_path __maybe_unused)
 {
        dump_printf(": unhandled!\n");
        return 0;
@@ -1227,13 +1232,14 @@ static void sample_read__printf(struct perf_sample *sample, u64 read_format)
 }
 
 static void dump_event(struct evlist *evlist, union perf_event *event,
-		       u64 file_offset, struct perf_sample *sample)
+		       u64 file_offset, struct perf_sample *sample,
+		       const char *file_path)
 {
 	if (!dump_trace)
 		return;
 
-	printf("\n%#" PRIx64 " [%#x]: event: %d\n",
-	       file_offset, event->header.size, event->header.type);
+	printf("\n%#" PRIx64 "@%s [%#x]: event: %d\n",
+	       file_offset, file_path, event->header.size, event->header.type);
 
 	trace_event(event);
 	if (event->header.type == PERF_RECORD_SAMPLE && evlist->trace_event_sample_raw)
@@ -1424,12 +1430,13 @@ static int machines__deliver_event(struct machines *machines,
 				   struct evlist *evlist,
 				   union perf_event *event,
 				   struct perf_sample *sample,
-				   struct perf_tool *tool, u64 file_offset)
+				   struct perf_tool *tool, u64 file_offset,
+				   const char *file_path)
 {
 	struct evsel *evsel;
 	struct machine *machine;
 
-	dump_event(evlist, event, file_offset, sample);
+	dump_event(evlist, event, file_offset, sample, file_path);
 
 	evsel = perf_evlist__id2evsel(evlist, sample->id);
 
@@ -1506,7 +1513,8 @@ static int machines__deliver_event(struct machines *machines,
 static int perf_session__deliver_event(struct perf_session *session,
 				       union perf_event *event,
 				       struct perf_tool *tool,
-				       u64 file_offset)
+				       u64 file_offset,
+				       const char *file_path)
 {
 	struct perf_sample sample;
 	int ret;
@@ -1524,7 +1532,7 @@ static int perf_session__deliver_event(struct perf_session *session,
 		return 0;
 
 	ret = machines__deliver_event(&session->machines, session->evlist,
-				      event, &sample, tool, file_offset);
+				      event, &sample, tool, file_offset, file_path);
 
 	if (dump_trace && sample.aux_sample.size)
 		auxtrace__dump_auxtrace_sample(session, &sample);
@@ -1534,7 +1542,8 @@ static int perf_session__deliver_event(struct perf_session *session,
 
 static s64 perf_session__process_user_event(struct perf_session *session,
 					    union perf_event *event,
-					    u64 file_offset)
+					    u64 file_offset,
+					    const char *file_path)
 {
 	struct ordered_events *oe = &session->ordered_events;
 	struct perf_tool *tool = session->tool;
@@ -1544,7 +1553,7 @@ static s64 perf_session__process_user_event(struct perf_session *session,
 
 	if (event->header.type != PERF_RECORD_COMPRESSED ||
 	    tool->compressed == perf_session__process_compressed_event_stub)
-		dump_event(session->evlist, event, file_offset, &sample);
+		dump_event(session->evlist, event, file_offset, &sample, file_path);
 
 	/* These events are processed right away */
 	switch (event->header.type) {
@@ -1603,9 +1612,9 @@ static s64 perf_session__process_user_event(struct perf_session *session,
 	case PERF_RECORD_HEADER_FEATURE:
 		return tool->feature(session, event);
 	case PERF_RECORD_COMPRESSED:
-		err = tool->compressed(session, event, file_offset);
+		err = tool->compressed(session, event, file_offset, file_path);
 		if (err)
-			dump_event(session->evlist, event, file_offset, &sample);
+			dump_event(session->evlist, event, file_offset, &sample, file_path);
 		return err;
 	default:
 		return -EINVAL;
@@ -1622,9 +1631,9 @@ int perf_session__deliver_synth_event(struct perf_session *session,
 	events_stats__inc(&evlist->stats, event->header.type);
 
 	if (event->header.type >= PERF_RECORD_USER_TYPE_START)
-		return perf_session__process_user_event(session, event, 0);
+		return perf_session__process_user_event(session, event, 0, NULL);
 
-	return machines__deliver_event(&session->machines, evlist, event, sample, tool, 0);
+	return machines__deliver_event(&session->machines, evlist, event, sample, tool, 0, NULL);
 }
 
 static void event_swap(union perf_event *event, bool sample_id_all)
@@ -1720,7 +1729,8 @@ int perf_session__peek_events(struct perf_session *session, u64 offset,
 }
 
 static s64 perf_session__process_event(struct perf_session *session,
-				       union perf_event *event, u64 file_offset)
+				       union perf_event *event, u64 file_offset,
+				       const char *file_path)
 {
 	struct evlist *evlist = session->evlist;
 	struct perf_tool *tool = session->tool;
@@ -1735,7 +1745,7 @@ static s64 perf_session__process_event(struct perf_session *session,
 	events_stats__inc(&evlist->stats, event->header.type);
 
 	if (event->header.type >= PERF_RECORD_USER_TYPE_START)
-		return perf_session__process_user_event(session, event, file_offset);
+		return perf_session__process_user_event(session, event, file_offset, file_path);
 
 	if (tool->ordered_events) {
 		u64 timestamp = -1ULL;
@@ -1749,7 +1759,7 @@ static s64 perf_session__process_event(struct perf_session *session,
 			return ret;
 	}
 
-	return perf_session__deliver_event(session, event, tool, file_offset);
+	return perf_session__deliver_event(session, event, tool, file_offset, file_path);
 }
 
 void perf_event_header__bswap(struct perf_event_header *hdr)
@@ -1987,7 +1997,8 @@ static int __perf_session__process_pipe_events(struct perf_session *session)
 		}
 	}
 
-	if ((skip = perf_session__process_event(session, event, head)) < 0) {
+	skip = perf_session__process_event(session, event, head, "pipe");
+	if (skip < 0) {
 		pr_err("%#" PRIx64 " [%#x]: failed to process type: %d\n",
 		       head, event->header.size, event->header.type);
 		err = -EINVAL;
@@ -2068,7 +2079,7 @@ fetch_decomp_event(u64 head, size_t mmap_size, char *buf, bool needs_swap)
 static int __perf_session__process_decomp_events(struct perf_session *session)
 {
 	s64 skip;
-	u64 size, file_pos = 0;
+	u64 size;
 	struct decomp *decomp = session->decomp_last;
 
 	if (!decomp)
@@ -2082,9 +2093,9 @@ static int __perf_session__process_decomp_events(struct perf_session *session)
 			break;
 
 		size = event->header.size;
-
-		if (size < sizeof(struct perf_event_header) ||
-		    (skip = perf_session__process_event(session, event, file_pos)) < 0) {
+		skip = perf_session__process_event(session, event, decomp->file_pos,
+						   decomp->file_path);
+		if (size < sizeof(struct perf_event_header) || skip < 0) {
 			pr_err("%#" PRIx64 " [%#x]: failed to process type: %d\n",
 				decomp->file_pos + decomp->head, event->header.size, event->header.type);
 			return -EINVAL;
@@ -2115,7 +2126,8 @@ struct reader;
 
 typedef s64 (*reader_cb_t)(struct perf_session *session,
 			   union perf_event *event,
-			   u64 file_offset);
+			   u64 file_offset,
+			   const char *file_path);
 
 struct reader {
 	int		 fd;
@@ -2198,9 +2210,9 @@ reader__process_events(struct reader *rd, struct perf_session *session,
 	skip = -EINVAL;
 
 	if (size < sizeof(struct perf_event_header) ||
-	    (skip = rd->process(session, event, file_pos)) < 0) {
-		pr_err("%#" PRIx64 " [%#x]: failed to process type: %d [%s]\n",
-		       file_offset + head, event->header.size,
+	    (skip = rd->process(session, event, file_pos, rd->path)) < 0) {
+		pr_err("%#" PRIx64 " [%s] [%#x]: failed to process type: %d [%s]\n",
+		       file_offset + head, rd->path, event->header.size,
 		       event->header.type, strerror(-skip));
 		err = skip;
 		goto out;
@@ -2230,9 +2242,10 @@ reader__process_events(struct reader *rd, struct perf_session *session,
 
 static s64 process_simple(struct perf_session *session,
 			  union perf_event *event,
-			  u64 file_offset)
+			  u64 file_offset,
+			  const char *file_path)
 {
-	return perf_session__process_event(session, event, file_offset);
+	return perf_session__process_event(session, event, file_offset, file_path);
 }
 
 static int __perf_session__process_events(struct perf_session *session)
diff --git a/tools/perf/util/tool.h b/tools/perf/util/tool.h
index bbbc0dcd461f..c966531d3eca 100644
--- a/tools/perf/util/tool.h
+++ b/tools/perf/util/tool.h
@@ -28,7 +28,8 @@ typedef int (*event_attr_op)(struct perf_tool *tool,
 
 typedef int (*event_op2)(struct perf_session *session, union perf_event *event);
 typedef s64 (*event_op3)(struct perf_session *session, union perf_event *event);
-typedef int (*event_op4)(struct perf_session *session, union perf_event *event, u64 data);
+typedef int (*event_op4)(struct perf_session *session, union perf_event *event, u64 data,
+			 const char *str);
 
 typedef int (*event_oe)(struct perf_tool *tool, union perf_event *event,
 			struct ordered_events *oe);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 03/15] perf data: open data directory in read access mode
  2020-10-21 15:52 [PATCH v2 00/15] Introduce threaded trace streaming for basic perf record operation Alexey Budankov
  2020-10-21 15:56 ` [PATCH v2 01/15] perf session: introduce trace file path to be shown in raw trace dump Alexey Budankov
  2020-10-21 15:57 ` [PATCH v2 02/15] perf report: output trace file name " Alexey Budankov
@ 2020-10-21 15:57 ` Alexey Budankov
  2020-10-22  4:31   ` Namhyung Kim
  2020-10-24 15:43   ` Jiri Olsa
  2020-10-21 15:59 ` [PATCH v2 04/15] perf session: move reader object definition to header file Alexey Budankov
                   ` (12 subsequent siblings)
  15 siblings, 2 replies; 64+ messages in thread
From: Alexey Budankov @ 2020-10-21 15:57 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Namhyung Kim, Alexander Shishkin, Adrian Hunter, Andi Kleen,
	Peter Zijlstra, Ingo Molnar, linux-kernel


Open files located at trace data directory in case read access
mode is requested. File are opened and its fds assigned to
perf_data dir files especially for loading data directories
content in perf report mode.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
 tools/perf/util/data.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/tools/perf/util/data.c b/tools/perf/util/data.c
index c47aa34fdc0a..6ad61ac6ba67 100644
--- a/tools/perf/util/data.c
+++ b/tools/perf/util/data.c
@@ -321,6 +321,10 @@ static int open_dir(struct perf_data *data)
 		return -1;
 
 	ret = open_file(data);
+	if (!ret && perf_data__is_dir(data)) {
+		if (perf_data__is_read(data))
+			ret = perf_data__open_dir(data);
+	}
 
 	/* Cleanup whatever we managed to create so far. */
 	if (ret && perf_data__is_write(data))
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 04/15] perf session: move reader object definition to header file
  2020-10-21 15:52 [PATCH v2 00/15] Introduce threaded trace streaming for basic perf record operation Alexey Budankov
                   ` (2 preceding siblings ...)
  2020-10-21 15:57 ` [PATCH v2 03/15] perf data: open data directory in read access mode Alexey Budankov
@ 2020-10-21 15:59 ` Alexey Budankov
  2020-10-22  4:31   ` Namhyung Kim
  2020-10-24 15:44   ` Jiri Olsa
  2020-10-21 16:00 ` [PATCH v2 05/15] perf session: introduce decompressor into trace reader object Alexey Budankov
                   ` (11 subsequent siblings)
  15 siblings, 2 replies; 64+ messages in thread
From: Alexey Budankov @ 2020-10-21 15:59 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Namhyung Kim, Alexander Shishkin, Adrian Hunter, Andi Kleen,
	Peter Zijlstra, Ingo Molnar, linux-kernel


Move definition of reader to session header file to be shared
among different source files. Introduce reference to active
reader object from session object.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
 tools/perf/util/session.c | 27 ---------------------------
 tools/perf/util/session.h | 25 +++++++++++++++++++++++++
 2 files changed, 25 insertions(+), 27 deletions(-)

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 6f09d506b2f6..911b2dbcd0ac 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -2110,33 +2110,6 @@ static int __perf_session__process_decomp_events(struct perf_session *session)
 	return 0;
 }
 
-/*
- * On 64bit we can mmap the data file in one go. No need for tiny mmap
- * slices. On 32bit we use 32MB.
- */
-#if BITS_PER_LONG == 64
-#define MMAP_SIZE ULLONG_MAX
-#define NUM_MMAPS 1
-#else
-#define MMAP_SIZE (32 * 1024 * 1024ULL)
-#define NUM_MMAPS 128
-#endif
-
-struct reader;
-
-typedef s64 (*reader_cb_t)(struct perf_session *session,
-			   union perf_event *event,
-			   u64 file_offset,
-			   const char *file_path);
-
-struct reader {
-	int		 fd;
-	const char	 *path;
-	u64		 data_size;
-	u64		 data_offset;
-	reader_cb_t	 process;
-};
-
 static int
 reader__process_events(struct reader *rd, struct perf_session *session,
 		       struct ui_progress *prog)
diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index 378ffc3e2809..abdb8518a81f 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -20,6 +20,30 @@ struct thread;
 struct auxtrace;
 struct itrace_synth_opts;
 
+/*
+ * On 64bit we can mmap the data file in one go. No need for tiny mmap
+ * slices. On 32bit we use 32MB.
+ */
+#if BITS_PER_LONG == 64
+#define MMAP_SIZE ULLONG_MAX
+#define NUM_MMAPS 1
+#else
+#define MMAP_SIZE (32 * 1024 * 1024ULL)
+#define NUM_MMAPS 128
+#endif
+
+typedef s64 (*reader_cb_t)(struct perf_session *session,
+			   union perf_event *event,
+			   u64 file_offset, const char *file_path);
+
+struct reader {
+	int		 fd;
+	const char	 *path;
+	u64		 data_size;
+	u64		 data_offset;
+	reader_cb_t	 process;
+};
+
 struct perf_session {
 	struct perf_header	header;
 	struct machines		machines;
@@ -41,6 +65,7 @@ struct perf_session {
 	struct zstd_data	zstd_data;
 	struct decomp		*decomp;
 	struct decomp		*decomp_last;
+	struct reader		*reader;
 };
 
 struct decomp {
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 05/15] perf session: introduce decompressor into trace reader object
  2020-10-21 15:52 [PATCH v2 00/15] Introduce threaded trace streaming for basic perf record operation Alexey Budankov
                   ` (3 preceding siblings ...)
  2020-10-21 15:59 ` [PATCH v2 04/15] perf session: move reader object definition to header file Alexey Budankov
@ 2020-10-21 16:00 ` Alexey Budankov
  2020-10-22  4:36   ` Namhyung Kim
  2020-10-24 15:44   ` Jiri Olsa
  2020-10-21 16:01 ` [PATCH v2 06/15] perf session: load data directory into tool process memory Alexey Budankov
                   ` (10 subsequent siblings)
  15 siblings, 2 replies; 64+ messages in thread
From: Alexey Budankov @ 2020-10-21 16:00 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Namhyung Kim, Alexander Shishkin, Adrian Hunter, Andi Kleen,
	Peter Zijlstra, Ingo Molnar, linux-kernel


Introduce decompressor to trace reader object so that decompression
could be executed on per trace file basis separately for every
trace file located in trace directory.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
 tools/perf/util/session.c | 4 +++-
 tools/perf/util/session.h | 1 +
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 911b2dbcd0ac..6afc670fdf0c 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -44,6 +44,8 @@ static int perf_session__process_compressed_event(struct perf_session *session,
 	u64 decomp_last_rem = 0;
 	size_t mmap_len, decomp_len = session->header.env.comp_mmap_len;
 	struct decomp *decomp, *decomp_last = session->decomp_last;
+	struct zstd_data *zstd_data = session->reader ?
+		&(session->reader->zstd_data) : &(session->zstd_data);
 
 	if (decomp_last) {
 		decomp_last_rem = decomp_last->size - decomp_last->head;
@@ -71,7 +73,7 @@ static int perf_session__process_compressed_event(struct perf_session *session,
 	src = (void *)event + sizeof(struct perf_record_compressed);
 	src_size = event->pack.header.size - sizeof(struct perf_record_compressed);
 
-	decomp_size = zstd_decompress_stream(&(session->zstd_data), src, src_size,
+	decomp_size = zstd_decompress_stream(zstd_data, src, src_size,
 				&(decomp->data[decomp_last_rem]), decomp_len - decomp_last_rem);
 	if (!decomp_size) {
 		munmap(decomp, mmap_len);
diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index abdb8518a81f..4fc9ccdf7970 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -42,6 +42,7 @@ struct reader {
 	u64		 data_size;
 	u64		 data_offset;
 	reader_cb_t	 process;
+	struct zstd_data zstd_data;
 };
 
 struct perf_session {
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 06/15] perf session: load data directory into tool process memory
  2020-10-21 15:52 [PATCH v2 00/15] Introduce threaded trace streaming for basic perf record operation Alexey Budankov
                   ` (4 preceding siblings ...)
  2020-10-21 16:00 ` [PATCH v2 05/15] perf session: introduce decompressor into trace reader object Alexey Budankov
@ 2020-10-21 16:01 ` Alexey Budankov
  2020-10-24 15:43   ` Jiri Olsa
  2020-10-21 16:01 ` [PATCH v2 07/15] perf record: introduce trace file, compressor and stats in mmap object Alexey Budankov
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 64+ messages in thread
From: Alexey Budankov @ 2020-10-21 16:01 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Namhyung Kim, Alexander Shishkin, Adrian Hunter, Andi Kleen,
	Peter Zijlstra, Ingo Molnar, linux-kernel


Read trace files located in data directory into tool process memory.
Basic analysis support of data directories is provided for report
mode. Raw dump (-D) and aggregated reports are available for data
directories, still with no memory consumption optimizations. However
data directories collected with --compression-level option enabled
can be analyzed with little less memory because trace files are
unmaped from tool process memory after loading collected data.
The implementation is based on the prototype [1], [2].

[1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
[2] https://lore.kernel.org/lkml/20180913125450.21342-1-jolsa@kernel.org/

Suggested-by: Jiri Olsa <jolsa@kernel.org>
Suggested-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
 tools/perf/util/session.c | 48 +++++++++++++++++++++++++++++++++++++++
 tools/perf/util/session.h |  1 +
 2 files changed, 49 insertions(+)

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 6afc670fdf0c..0752eec19813 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -2212,6 +2212,17 @@ reader__process_events(struct reader *rd, struct perf_session *session,
 		goto more;
 
 out:
+	if (rd->unmap_file) {
+		int i;
+
+		for (i = 0; i < NUM_MMAPS; i++) {
+			if (mmaps[i]) {
+				munmap(mmaps[i], mmap_size);
+				mmaps[i] = NULL;
+			}
+		}
+	}
+
 	return err;
 }
 
@@ -2231,6 +2242,7 @@ static int __perf_session__process_events(struct perf_session *session)
 		.data_offset	= session->header.data_offset,
 		.process	= process_simple,
 		.path		= session->data->file.path,
+		.unmap_file	= false,
 	};
 	struct ordered_events *oe = &session->ordered_events;
 	struct perf_tool *tool = session->tool;
@@ -2247,6 +2259,42 @@ static int __perf_session__process_events(struct perf_session *session)
 	err = reader__process_events(&rd, session, &prog);
 	if (err)
 		goto out_err;
+
+	if (perf_data__is_dir(session->data)) {
+		int i, nr = session->data->dir.nr;
+		struct reader file_rd[nr];
+		u64 total_size = perf_data__size(session->data);
+
+		total_size -= session->data->file.size;
+		ui_progress__init_size(&prog, total_size, "Sorting events...");
+
+		memset(&file_rd, 0, nr * sizeof(file_rd[0]));
+
+		for (i = 0; i < nr ; i++) {
+			struct perf_data_file *file;
+
+			file = &session->data->dir.files[i];
+			file_rd[i] = (struct reader) {
+				.fd             = file->fd,
+				.path           = file->path,
+				.data_size      = file->size,
+				.data_offset    = 0,
+				.process	= process_simple,
+			};
+			file_rd[i].unmap_file = perf_header__has_feat(&session->header,
+								      HEADER_COMPRESSED);
+			session->reader = &file_rd[i];
+
+			if (zstd_init(&(file_rd[i].zstd_data), 0))
+				goto out_err;
+			err = reader__process_events(&file_rd[i], session, &prog);
+			zstd_fini(&(file_rd[i].zstd_data));
+			session->reader = NULL;
+			if (err)
+				goto out_err;
+		}
+	}
+
 	/* do the final flush for ordered samples */
 	err = ordered_events__flush(oe, OE_FLUSH__FINAL);
 	if (err)
diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index 4fc9ccdf7970..d428f3eaf7fd 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -43,6 +43,7 @@ struct reader {
 	u64		 data_offset;
 	reader_cb_t	 process;
 	struct zstd_data zstd_data;
+	bool		 unmap_file;
 };
 
 struct perf_session {
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 07/15] perf record: introduce trace file, compressor and stats in mmap object
  2020-10-21 15:52 [PATCH v2 00/15] Introduce threaded trace streaming for basic perf record operation Alexey Budankov
                   ` (5 preceding siblings ...)
  2020-10-21 16:01 ` [PATCH v2 06/15] perf session: load data directory into tool process memory Alexey Budankov
@ 2020-10-21 16:01 ` Alexey Budankov
  2020-10-21 16:02 ` [PATCH v2 08/15] perf record: write trace data into mmap trace files Alexey Budankov
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 64+ messages in thread
From: Alexey Budankov @ 2020-10-21 16:01 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Namhyung Kim, Alexander Shishkin, Adrian Hunter, Andi Kleen,
	Peter Zijlstra, Ingo Molnar, linux-kernel


Introduce trace file and compressor objects into mmap object so
they could be used to process and store data stream from the
corresponding kernel data buffer. Introduce bytes_transferred
and bytes_compressed stats so they would capture statistics for
the related data buffer transfers. Make use of the introduced
per mmap file, compressor and stats when they are initialized
and available.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
 tools/perf/builtin-record.c | 47 +++++++++++++++++++++++++++----------
 tools/perf/util/mmap.c      |  6 +++++
 tools/perf/util/mmap.h      |  5 ++++
 3 files changed, 46 insertions(+), 12 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index adf311d15d3d..619aaee11231 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -150,11 +150,17 @@ static int record__write(struct record *rec, struct mmap *map __maybe_unused,
 {
 	struct perf_data_file *file = &rec->session->data->file;
 
+	if (map && map->file)
+		file = map->file;
+
 	if (perf_data_file__write(file, bf, size) < 0) {
 		pr_err("failed to write perf data, error: %m\n");
 		return -1;
 	}
 
+	if (file != &rec->session->data->file)
+		return 0;
+
 	rec->bytes_written += size;
 
 	if (record__output_max_size_exceeded(rec) && !done) {
@@ -172,8 +178,8 @@ static int record__write(struct record *rec, struct mmap *map __maybe_unused,
 
 static int record__aio_enabled(struct record *rec);
 static int record__comp_enabled(struct record *rec);
-static size_t zstd_compress(struct perf_session *session, void *dst, size_t dst_size,
-			    void *src, size_t src_size);
+static size_t zstd_compress(struct zstd_data *data,
+			    void *dst, size_t dst_size, void *src, size_t src_size);
 
 #ifdef HAVE_AIO_SUPPORT
 static int record__aio_write(struct aiocb *cblock, int trace_fd,
@@ -291,6 +297,7 @@ struct record_aio {
 static int record__aio_pushfn(struct mmap *map, void *to, void *buf, size_t size)
 {
 	struct record_aio *aio = to;
+	size_t compressed = size;
 
 	/*
 	 * map->core.base data pointed by buf is copied into free map->aio.data[] buffer
@@ -307,9 +314,14 @@ static int record__aio_pushfn(struct mmap *map, void *to, void *buf, size_t size
 	 */
 
 	if (record__comp_enabled(aio->rec)) {
-		size = zstd_compress(aio->rec->session, aio->data + aio->size,
-				     mmap__mmap_len(map) - aio->size,
+		struct zstd_data *zstd_data = &(aio->rec->session->zstd_data);
+
+		compressed = zstd_compress(zstd_data,
+				     aio->data + aio->size, mmap__mmap_len(map) - aio->size,
 				     buf, size);
+
+		aio->rec->session->bytes_transferred += size;
+		aio->rec->session->bytes_compressed  += compressed;
 	} else {
 		memcpy(aio->data + aio->size, buf, size);
 	}
@@ -328,7 +340,7 @@ static int record__aio_pushfn(struct mmap *map, void *to, void *buf, size_t size
 		perf_mmap__get(&map->core);
 	}
 
-	aio->size += size;
+	aio->size += compressed;
 
 	return size;
 }
@@ -532,14 +544,28 @@ static int process_locked_synthesized_event(struct perf_tool *tool,
 static int record__pushfn(struct mmap *map, void *to, void *bf, size_t size)
 {
 	struct record *rec = to;
+	size_t compressed = size;
 
 	if (record__comp_enabled(rec)) {
-		size = zstd_compress(rec->session, map->data, mmap__mmap_len(map), bf, size);
+		struct zstd_data *zstd_data = &rec->session->zstd_data;
+
+		if (map->file)
+			zstd_data = &map->zstd_data;
+
+		compressed = zstd_compress(zstd_data, map->data, mmap__mmap_len(map), bf, size);
 		bf   = map->data;
+
+		if (map->file) {
+			map->bytes_transferred += size;
+			map->bytes_compressed  += compressed;
+		} else {
+			rec->session->bytes_transferred += size;
+			rec->session->bytes_compressed  += compressed;
+		}
 	}
 
 	rec->samples++;
-	return record__write(rec, map, bf, size);
+	return record__write(rec, map, bf, compressed);
 }
 
 static volatile int signr = -1;
@@ -1079,18 +1105,15 @@ static size_t process_comp_header(void *record, size_t increment)
 	return size;
 }
 
-static size_t zstd_compress(struct perf_session *session, void *dst, size_t dst_size,
+static size_t zstd_compress(struct zstd_data *zstd_data, void *dst, size_t dst_size,
 			    void *src, size_t src_size)
 {
 	size_t compressed;
 	size_t max_record_size = PERF_SAMPLE_MAX_SIZE - sizeof(struct perf_record_compressed) - 1;
 
-	compressed = zstd_compress_stream_to_records(&session->zstd_data, dst, dst_size, src, src_size,
+	compressed = zstd_compress_stream_to_records(zstd_data, dst, dst_size, src, src_size,
 						     max_record_size, process_comp_header);
 
-	session->bytes_transferred += src_size;
-	session->bytes_compressed  += compressed;
-
 	return compressed;
 }
 
diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
index ab7108d22428..a2c5e4237592 100644
--- a/tools/perf/util/mmap.c
+++ b/tools/perf/util/mmap.c
@@ -230,6 +230,8 @@ void mmap__munmap(struct mmap *map)
 {
 	bitmap_free(map->affinity_mask.bits);
 
+	zstd_fini(&map->zstd_data);
+
 	perf_mmap__aio_munmap(map);
 	if (map->data != NULL) {
 		munmap(map->data, mmap__mmap_len(map));
@@ -291,6 +293,10 @@ int mmap__mmap(struct mmap *map, struct mmap_params *mp, int fd, int cpu)
 	map->core.flush = mp->flush;
 
 	map->comp_level = mp->comp_level;
+	if (zstd_init(&map->zstd_data, map->comp_level)) {
+		pr_debug2("failed to init mmap commpressor, error %d\n", errno);
+		return -1;
+	}
 
 	if (map->comp_level && !perf_mmap__aio_enabled(map)) {
 		map->data = mmap(NULL, mmap__mmap_len(map), PROT_READ|PROT_WRITE,
diff --git a/tools/perf/util/mmap.h b/tools/perf/util/mmap.h
index 9d5f589f02ae..d98b0bdedeac 100644
--- a/tools/perf/util/mmap.h
+++ b/tools/perf/util/mmap.h
@@ -13,6 +13,7 @@
 #endif
 #include "auxtrace.h"
 #include "event.h"
+#include "util/compress.h"
 
 struct aiocb;
 
@@ -43,6 +44,10 @@ struct mmap {
 	struct mmap_cpu_mask	affinity_mask;
 	void		*data;
 	int		comp_level;
+	struct perf_data_file *file;
+	struct zstd_data      zstd_data;
+	u64		      bytes_transferred;
+	u64		      bytes_compressed;
 };
 
 struct mmap_params {
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 08/15] perf record: write trace data into mmap trace files
  2020-10-21 15:52 [PATCH v2 00/15] Introduce threaded trace streaming for basic perf record operation Alexey Budankov
                   ` (6 preceding siblings ...)
  2020-10-21 16:01 ` [PATCH v2 07/15] perf record: introduce trace file, compressor and stats in mmap object Alexey Budankov
@ 2020-10-21 16:02 ` Alexey Budankov
  2020-10-24 15:44   ` Jiri Olsa
  2020-10-21 16:03 ` [PATCH v2 09/15] perf record: introduce thread specific objects for trace streaming Alexey Budankov
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 64+ messages in thread
From: Alexey Budankov @ 2020-10-21 16:02 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Namhyung Kim, Alexander Shishkin, Adrian Hunter, Andi Kleen,
	Peter Zijlstra, Ingo Molnar, linux-kernel


Write trace buffer data into per mmap trace files located
at data directory. Streaming thread adjusts its affinity
according to mask of the buffer being processed.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
 tools/perf/builtin-record.c | 44 ++++++++++++++++++++++++++++++++-----
 tools/perf/util/record.h    |  1 +
 2 files changed, 39 insertions(+), 6 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 619aaee11231..ba26d75c51d6 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -120,6 +120,11 @@ static const char *affinity_tags[PERF_AFFINITY_MAX] = {
 	"SYS", "NODE", "CPU"
 };
 
+static int record__threads_enabled(struct record *rec)
+{
+	return rec->opts.threads;
+}
+
 static bool switch_output_signal(struct record *rec)
 {
 	return rec->switch_output.signal &&
@@ -894,6 +899,20 @@ static int record__mmap_evlist(struct record *rec,
 				return -EINVAL;
 		}
 	}
+
+	if (record__threads_enabled(rec)) {
+		int i, ret, nr = evlist->core.nr_mmaps;
+		struct mmap *mmaps = rec->opts.overwrite ?
+				     evlist->overwrite_mmap : evlist->mmap;
+
+		ret = perf_data__create_dir(&rec->data, evlist->core.nr_mmaps);
+		if (ret)
+			return ret;
+
+		for (i = 0; i < nr; i++)
+			mmaps[i].file = &rec->data.dir.files[i];
+	}
+
 	return 0;
 }
 
@@ -1184,8 +1203,12 @@ static int record__mmap_read_evlist(struct record *rec, struct evlist *evlist,
 	/*
 	 * Mark the round finished in case we wrote
 	 * at least one event.
+	 *
+	 * No need for round events in directory mode,
+	 * because per-cpu maps and files have data
+	 * sorted by kernel.
 	 */
-	if (bytes_written != rec->bytes_written)
+	if (!record__threads_enabled(rec) && bytes_written != rec->bytes_written)
 		rc = record__write(rec, NULL, &finished_round_event, sizeof(finished_round_event));
 
 	if (overwrite)
@@ -1231,7 +1254,9 @@ static void record__init_features(struct record *rec)
 	if (!rec->opts.use_clockid)
 		perf_header__clear_feat(&session->header, HEADER_CLOCK_DATA);
 
-	perf_header__clear_feat(&session->header, HEADER_DIR_FORMAT);
+	if (!record__threads_enabled(rec))
+		perf_header__clear_feat(&session->header, HEADER_DIR_FORMAT);
+
 	if (!record__comp_enabled(rec))
 		perf_header__clear_feat(&session->header, HEADER_COMPRESSED);
 
@@ -1242,15 +1267,21 @@ static void
 record__finish_output(struct record *rec)
 {
 	struct perf_data *data = &rec->data;
-	int fd = perf_data__fd(data);
+	int i, fd = perf_data__fd(data);
 
 	if (data->is_pipe)
 		return;
 
 	rec->session->header.data_size += rec->bytes_written;
 	data->file.size = lseek(perf_data__fd(data), 0, SEEK_CUR);
+	if (record__threads_enabled(rec)) {
+		for (i = 0; i < data->dir.nr; i++)
+			data->dir.files[i].size = lseek(data->dir.files[i].fd, 0, SEEK_CUR);
+	}
 
 	if (!rec->no_buildid) {
+		/* this will be recalculated during process_buildids() */
+		rec->samples = 0;
 		process_buildids(rec);
 
 		if (rec->buildid_all)
@@ -2041,8 +2072,6 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 		status = err;
 
 	record__synthesize(rec, true);
-	/* this will be recalculated during process_buildids() */
-	rec->samples = 0;
 
 	if (!err) {
 		if (!rec->timestamp_filename) {
@@ -2680,9 +2709,12 @@ int cmd_record(int argc, const char **argv)
 
 	}
 
-	if (rec->opts.kcore)
+	if (rec->opts.kcore || record__threads_enabled(rec))
 		rec->data.is_dir = true;
 
+	if (record__threads_enabled(rec))
+		rec->opts.affinity = PERF_AFFINITY_CPU;
+
 	if (rec->opts.comp_level != 0) {
 		pr_debug("Compression enabled, disabling build id collection at the end of the session.\n");
 		rec->no_buildid = true;
diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
index 266760ac9143..aeda3cdaa3e9 100644
--- a/tools/perf/util/record.h
+++ b/tools/perf/util/record.h
@@ -74,6 +74,7 @@ struct record_opts {
 	int	      ctl_fd;
 	int	      ctl_fd_ack;
 	bool	      ctl_fd_close;
+	bool	      threads;
 };
 
 extern const char * const *record_usage;
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 09/15] perf record: introduce thread specific objects for trace streaming
  2020-10-21 15:52 [PATCH v2 00/15] Introduce threaded trace streaming for basic perf record operation Alexey Budankov
                   ` (7 preceding siblings ...)
  2020-10-21 16:02 ` [PATCH v2 08/15] perf record: write trace data into mmap trace files Alexey Budankov
@ 2020-10-21 16:03 ` Alexey Budankov
  2020-10-24 15:44   ` Jiri Olsa
  2020-10-21 16:04 ` [PATCH v2 10/15] perf record: manage thread specific data array Alexey Budankov
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 64+ messages in thread
From: Alexey Budankov @ 2020-10-21 16:03 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Namhyung Kim, Alexander Shishkin, Adrian Hunter, Andi Kleen,
	Peter Zijlstra, Ingo Molnar, linux-kernel


Introduce thread local data object and its array to be used for
threaded trace streaming.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
 tools/perf/builtin-record.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index ba26d75c51d6..8e512096a060 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -85,11 +85,29 @@ struct switch_output {
 	int		 cur_file;
 };
 
+struct thread_data {
+	pid_t		   tid;
+	struct {
+		int	   msg[2];
+		int	   ack[2];
+	} comm;
+	struct fdarray	   pollfd;
+	int		   ctlfd_pos;
+	struct mmap	   *maps;
+	int		   nr_mmaps;
+	struct record	   *rec;
+	unsigned long long samples;
+	unsigned long	   waking;
+	u64		   bytes_written;
+};
+
 struct record {
 	struct perf_tool	tool;
 	struct record_opts	opts;
 	u64			bytes_written;
 	struct perf_data	data;
+	struct thread_data	*thread_data;
+	int			nr_thread_data;
 	struct auxtrace_record	*itr;
 	struct evlist	*evlist;
 	struct perf_session	*session;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 10/15] perf record: manage thread specific data array
  2020-10-21 15:52 [PATCH v2 00/15] Introduce threaded trace streaming for basic perf record operation Alexey Budankov
                   ` (8 preceding siblings ...)
  2020-10-21 16:03 ` [PATCH v2 09/15] perf record: introduce thread specific objects for trace streaming Alexey Budankov
@ 2020-10-21 16:04 ` Alexey Budankov
  2020-10-24 15:44   ` Jiri Olsa
  2020-10-21 16:06 ` [PATCH v2 11/15] perf evlist: introduce evlist__ctlfd_update() to update ctl fd status Alexey Budankov
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 64+ messages in thread
From: Alexey Budankov @ 2020-10-21 16:04 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Namhyung Kim, Alexander Shishkin, Adrian Hunter, Andi Kleen,
	Peter Zijlstra, Ingo Molnar, linux-kernel


Provide allocation, initialization, finalization and releasing of
thread specific objects at thread specific data array. Allocate
thread specific object for every data buffer making one-to-one
relation between data buffer and a thread processing the buffer.
Deliver event fd related signals to thread's pollfd object.
Deliver thread control commands to ctlfd_pos fd of thread 1+.
Deliver tool external control commands via ctlfd_pos fd of thread 0.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
 tools/perf/builtin-record.c | 101 ++++++++++++++++++++++++++++++++++--
 1 file changed, 98 insertions(+), 3 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 8e512096a060..89cb8e913fb3 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -884,6 +884,94 @@ static int record__kcore_copy(struct machine *machine, struct perf_data *data)
 	return kcore_copy(from_dir, kcore_dir);
 }
 
+static int record__alloc_thread_data(struct record *rec, struct mmap *mmaps, int nr_mmaps,
+				     struct fdarray *evlist_pollfd, int ctlfd_pos)
+{
+	int i, j, k, nr_thread_data;
+	struct thread_data *thread_data;
+
+	rec->nr_thread_data = nr_thread_data = nr_mmaps;
+	rec->thread_data = thread_data = zalloc(rec->nr_thread_data * sizeof(*(rec->thread_data)));
+	if (!thread_data) {
+		pr_err("Failed to allocate thread data\n");
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < nr_thread_data; i++) {
+		short revents;
+		int pos, fd;
+
+		thread_data[i].tid = -1;
+
+		if (pipe(thread_data[i].comm.msg) ||
+		    pipe(thread_data[i].comm.ack)) {
+			pr_err("Failed to create thread comm pipes, errno %d\n", errno);
+			return -ENOMEM;
+		}
+
+		thread_data[i].maps = &mmaps[i];
+		thread_data[i].nr_mmaps = 1;
+
+		thread_data[i].rec = rec;
+
+		fdarray__init(&(thread_data[i].pollfd), 64);
+
+		for (j = 0; j < thread_data[i].nr_mmaps; j++) {
+			struct mmap *map = &(thread_data[i].maps[j]);
+
+			for (k = 0; k < evlist_pollfd->nr; k++) {
+				if (evlist_pollfd->priv[k].ptr != map)
+					continue;
+
+				fd = evlist_pollfd->entries[k].fd;
+				revents = evlist_pollfd->entries[k].events;
+				pos = fdarray__add(&(thread_data[i].pollfd),
+						fd, revents | POLLERR | POLLHUP,
+						fdarray_flag__default);
+				if (pos >= 0)
+					thread_data[i].pollfd.priv[pos].ptr = map;
+				else
+					return -ENOMEM;
+			}
+		}
+
+		if (i) {
+			fd = thread_data[i].comm.msg[0];
+			revents = POLLIN | POLLERR | POLLHUP;
+		} else {
+			if (ctlfd_pos == -1)
+				continue;
+			fd = evlist_pollfd->entries[ctlfd_pos].fd;
+			revents = evlist_pollfd->entries[ctlfd_pos].events;
+		}
+		thread_data[i].ctlfd_pos =
+				fdarray__add(&(thread_data[i].pollfd),
+					     fd, revents, fdarray_flag__nonfilterable);
+		if (thread_data[i].ctlfd_pos < 0)
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static int record__free_thread_data(struct record *rec)
+{
+	int i;
+
+	if (rec->thread_data) {
+		for (i = 0; i < rec->nr_thread_data; i++) {
+			close(rec->thread_data[i].comm.msg[0]);
+			close(rec->thread_data[i].comm.msg[1]);
+			close(rec->thread_data[i].comm.ack[0]);
+			close(rec->thread_data[i].comm.ack[1]);
+			fdarray__exit(&(rec->thread_data[i].pollfd));
+		}
+		zfree(&rec->thread_data);
+	}
+
+	return 0;
+}
+
 static int record__mmap_evlist(struct record *rec,
 			       struct evlist *evlist)
 {
@@ -918,6 +1006,9 @@ static int record__mmap_evlist(struct record *rec,
 		}
 	}
 
+	if (evlist__initialize_ctlfd(evlist, opts->ctl_fd, opts->ctl_fd_ack))
+		return -1;
+
 	if (record__threads_enabled(rec)) {
 		int i, ret, nr = evlist->core.nr_mmaps;
 		struct mmap *mmaps = rec->opts.overwrite ?
@@ -929,6 +1020,12 @@ static int record__mmap_evlist(struct record *rec,
 
 		for (i = 0; i < nr; i++)
 			mmaps[i].file = &rec->data.dir.files[i];
+
+		ret = record__alloc_thread_data(rec, mmaps, nr,
+						&evlist->core.pollfd,
+						evlist->ctl_fd.pos);
+		if (ret)
+			return ret;
 	}
 
 	return 0;
@@ -1910,9 +2007,6 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 		perf_evlist__start_workload(rec->evlist);
 	}
 
-	if (evlist__initialize_ctlfd(rec->evlist, opts->ctl_fd, opts->ctl_fd_ack))
-		goto out_child;
-
 	if (opts->initial_delay) {
 		pr_info(EVLIST_DISABLED_MSG);
 		if (opts->initial_delay > 0) {
@@ -2063,6 +2157,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 		record__synthesize_workload(rec, true);
 
 out_child:
+	record__free_thread_data(rec);
 	evlist__finalize_ctlfd(rec->evlist);
 	record__mmap_read_all(rec, true);
 	record__aio_mmap_read_sync(rec);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 11/15] perf evlist: introduce evlist__ctlfd_update() to update ctl fd status
  2020-10-21 15:52 [PATCH v2 00/15] Introduce threaded trace streaming for basic perf record operation Alexey Budankov
                   ` (9 preceding siblings ...)
  2020-10-21 16:04 ` [PATCH v2 10/15] perf record: manage thread specific data array Alexey Budankov
@ 2020-10-21 16:06 ` Alexey Budankov
  2020-10-21 16:07 ` [PATCH v2 12/15] perf record: introduce thread local variable for trace streaming Alexey Budankov
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 64+ messages in thread
From: Alexey Budankov @ 2020-10-21 16:06 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Namhyung Kim, Alexander Shishkin, Adrian Hunter, Andi Kleen,
	Peter Zijlstra, Ingo Molnar, linux-kernel


Introduce evlist__ctlfd_update() to update ctl fd poll status
in evlist pollfd array using other pollfd object.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
 tools/perf/util/evlist.c | 16 ++++++++++++++++
 tools/perf/util/evlist.h |  1 +
 2 files changed, 17 insertions(+)

diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 8bdf3d2c907c..758a4896fedd 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -1970,6 +1970,22 @@ int evlist__ctlfd_process(struct evlist *evlist, enum evlist_ctl_cmd *cmd)
 	return err;
 }
 
+int evlist__ctlfd_update(struct evlist *evlist, struct pollfd *update)
+{
+	int ctlfd_pos = evlist->ctl_fd.pos;
+	struct pollfd *entries = evlist->core.pollfd.entries;
+
+	if (!evlist__ctlfd_initialized(evlist))
+		return 0;
+
+	if (entries[ctlfd_pos].fd != update->fd ||
+	    entries[ctlfd_pos].events != update->events)
+		return -1;
+
+	entries[ctlfd_pos].revents = update->revents;
+	return 0;
+}
+
 struct evsel *evlist__find_evsel(struct evlist *evlist, int idx)
 {
 	struct evsel *evsel;
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index e1a450322bc5..9b73d6ccf066 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -380,6 +380,7 @@ void evlist__close_control(int ctl_fd, int ctl_fd_ack, bool *ctl_fd_close);
 int evlist__initialize_ctlfd(struct evlist *evlist, int ctl_fd, int ctl_fd_ack);
 int evlist__finalize_ctlfd(struct evlist *evlist);
 bool evlist__ctlfd_initialized(struct evlist *evlist);
+int evlist__ctlfd_update(struct evlist *evlist, struct pollfd *update);
 int evlist__ctlfd_process(struct evlist *evlist, enum evlist_ctl_cmd *cmd);
 int evlist__ctlfd_ack(struct evlist *evlist);
 
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 12/15] perf record: introduce thread local variable for trace streaming
  2020-10-21 15:52 [PATCH v2 00/15] Introduce threaded trace streaming for basic perf record operation Alexey Budankov
                   ` (10 preceding siblings ...)
  2020-10-21 16:06 ` [PATCH v2 11/15] perf evlist: introduce evlist__ctlfd_update() to update ctl fd status Alexey Budankov
@ 2020-10-21 16:07 ` Alexey Budankov
  2020-10-24 15:43   ` Jiri Olsa
  2020-10-21 16:08 ` [PATCH v2 13/15] perf record: stop threads in the end of " Alexey Budankov
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 64+ messages in thread
From: Alexey Budankov @ 2020-10-21 16:07 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Namhyung Kim, Alexander Shishkin, Adrian Hunter, Andi Kleen,
	Peter Zijlstra, Ingo Molnar, linux-kernel


Introduce thread local variable and use it for threaded trace streaming.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
 tools/perf/builtin-record.c | 71 ++++++++++++++++++++++++++++++++-----
 1 file changed, 62 insertions(+), 9 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 89cb8e913fb3..3b7e9026f25b 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -101,6 +101,8 @@ struct thread_data {
 	u64		   bytes_written;
 };
 
+static __thread struct thread_data *thread;
+
 struct record {
 	struct perf_tool	tool;
 	struct record_opts	opts;
@@ -587,7 +589,11 @@ static int record__pushfn(struct mmap *map, void *to, void *bf, size_t size)
 		}
 	}
 
-	rec->samples++;
+	if (thread)
+		thread->samples++;
+	else
+		rec->samples++;
+
 	return record__write(rec, map, bf, compressed);
 }
 
@@ -1258,6 +1264,7 @@ static int record__mmap_read_evlist(struct record *rec, struct evlist *evlist,
 	int i;
 	int rc = 0;
 	struct mmap *maps;
+	int nr_mmaps;
 	int trace_fd = rec->data.file.fd;
 	off_t off = 0;
 
@@ -1265,6 +1272,14 @@ static int record__mmap_read_evlist(struct record *rec, struct evlist *evlist,
 		return 0;
 
 	maps = overwrite ? evlist->overwrite_mmap : evlist->mmap;
+	nr_mmaps = evlist->core.nr_mmaps;
+
+	if (thread) {
+		bytes_written = thread->bytes_written;
+		maps = thread->maps;
+		nr_mmaps = thread->nr_mmaps;
+	}
+
 	if (!maps)
 		return 0;
 
@@ -1274,7 +1289,7 @@ static int record__mmap_read_evlist(struct record *rec, struct evlist *evlist,
 	if (record__aio_enabled(rec))
 		off = record__aio_get_pos(trace_fd);
 
-	for (i = 0; i < evlist->core.nr_mmaps; i++) {
+	for (i = 0; i < nr_mmaps; i++) {
 		u64 flush = 0;
 		struct mmap *map = &maps[i];
 
@@ -1323,7 +1338,7 @@ static int record__mmap_read_evlist(struct record *rec, struct evlist *evlist,
 	 * because per-cpu maps and files have data
 	 * sorted by kernel.
 	 */
-	if (!record__threads_enabled(rec) && bytes_written != rec->bytes_written)
+	if (!thread && bytes_written != rec->bytes_written)
 		rc = record__write(rec, NULL, &finished_round_event, sizeof(finished_round_event));
 
 	if (overwrite)
@@ -1343,6 +1358,15 @@ static int record__mmap_read_all(struct record *rec, bool synch)
 	return record__mmap_read_evlist(rec, rec->evlist, true, synch);
 }
 
+static void record__thread_munmap_filtered(struct fdarray *fda, int fd,
+					   void *arg __maybe_unused)
+{
+	struct perf_mmap *map = fda->priv[fd].ptr;
+
+	if (map)
+		perf_mmap__put(map);
+}
+
 static void record__init_features(struct record *rec)
 {
 	struct perf_session *session = rec->session;
@@ -2020,7 +2044,12 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 	trigger_ready(&switch_output_trigger);
 	perf_hooks__invoke_record_start();
 	for (;;) {
-		unsigned long long hits = rec->samples;
+		unsigned long long hits0, hits1;
+
+		if (thread)
+			hits0 = thread->samples;
+		else
+			hits0 = rec->samples;
 
 		/*
 		 * rec->evlist->bkw_mmap_state is possible to be
@@ -2089,20 +2118,44 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 				alarm(rec->switch_output.time);
 		}
 
-		if (hits == rec->samples) {
+		if (thread)
+			hits1 = thread->samples;
+		else
+			hits1 = rec->samples;
+
+		if (hits0 == hits1) {
 			if (done || draining)
 				break;
-			err = evlist__poll(rec->evlist, -1);
+
+			if (thread)
+				err = fdarray__poll(&thread->pollfd, -1);
+			else
+				err = evlist__poll(rec->evlist, -1);
 			/*
 			 * Propagate error, only if there's any. Ignore positive
 			 * number of returned events and interrupt error.
 			 */
 			if (err > 0 || (err < 0 && errno == EINTR))
 				err = 0;
-			waking++;
 
-			if (evlist__filter_pollfd(rec->evlist, POLLERR | POLLHUP) == 0)
-				draining = true;
+			if (thread) {
+				thread->waking++;
+				if (thread->ctlfd_pos != -1) {
+					evlist__ctlfd_update(rec->evlist,
+						&(thread->pollfd.entries[thread->ctlfd_pos]));
+				}
+			} else {
+				waking++;
+			}
+
+			if (thread) {
+				if (fdarray__filter(&thread->pollfd, POLLERR | POLLHUP,
+						    record__thread_munmap_filtered, NULL) == 0)
+					draining = true;
+			} else {
+				if (evlist__filter_pollfd(rec->evlist, POLLERR | POLLHUP) == 0)
+					draining = true;
+			}
 		}
 
 		if (evlist__ctlfd_process(rec->evlist, &cmd) > 0) {
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 13/15] perf record: stop threads in the end of trace streaming
  2020-10-21 15:52 [PATCH v2 00/15] Introduce threaded trace streaming for basic perf record operation Alexey Budankov
                   ` (11 preceding siblings ...)
  2020-10-21 16:07 ` [PATCH v2 12/15] perf record: introduce thread local variable for trace streaming Alexey Budankov
@ 2020-10-21 16:08 ` Alexey Budankov
  2020-10-21 16:10 ` [PATCH v2 14/15] perf record: start threads in the beginning " Alexey Budankov
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 64+ messages in thread
From: Alexey Budankov @ 2020-10-21 16:08 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Namhyung Kim, Alexander Shishkin, Adrian Hunter, Andi Kleen,
	Peter Zijlstra, Ingo Molnar, linux-kernel


Close write fd of comm.msg pipe to signal thread to terminate
and receive THREAD_MSG__READY confirmation on termination.
Accumulate thread stats into global stats to be correctly
calculated and displayed in perf tool output.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
 tools/perf/builtin-record.c | 64 ++++++++++++++++++++++++++++++++++---
 1 file changed, 60 insertions(+), 4 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 3b7e9026f25b..a15642656066 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -85,6 +85,16 @@ struct switch_output {
 	int		 cur_file;
 };
 
+enum thread_msg {
+	THREAD_MSG__UNSUPPORTED = 0,
+	THREAD_MSG__READY,
+	THREAD_MSG__MAX,
+};
+
+static const char *thread_msg_tags[THREAD_MSG__MAX] = {
+	"UNSUPPORTED", "READY"
+};
+
 struct thread_data {
 	pid_t		   tid;
 	struct {
@@ -1796,6 +1806,50 @@ static void hit_auxtrace_snapshot_trigger(struct record *rec)
 	}
 }
 
+static int record__terminate_thread(struct thread_data *thread_data)
+{
+	int res;
+	enum thread_msg ack = THREAD_MSG__UNSUPPORTED;
+	pid_t tid = thread_data->tid;
+
+	close(thread_data->comm.msg[1]);
+	res = read(thread_data->comm.ack[0], &ack, sizeof(ack));
+	if (res != -1)
+		pr_debug("threads: %d -> %s\n", tid, thread_msg_tags[ack]);
+	else
+		pr_err("threads: failed to recv msg=%s from %d\n",
+		       thread_msg_tags[ack], tid);
+
+	return 0;
+}
+
+static int record__stop_threads(struct record *rec, unsigned long *waking)
+{
+	int i, j, nr_thread_data = rec->nr_thread_data;
+	struct thread_data *thread_data = rec->thread_data;
+
+	if (!record__threads_enabled(rec))
+		return 0;
+
+	for (i = 1; i < nr_thread_data; i++)
+		record__terminate_thread(&thread_data[i]);
+
+	for (i = 0; i < nr_thread_data; i++) {
+		pr_debug("threads: %d : samples %lld, wakes %ld\n",
+			 thread_data[i].tid, thread_data[i].samples,
+			 thread_data[i].waking);
+
+		rec->samples += thread_data[i].samples;
+		*waking      += thread_data[i].waking;
+		for (j = 0; j < thread_data[i].nr_mmaps; j++) {
+			rec->session->bytes_transferred += thread_data[i].maps[j].bytes_transferred;
+			rec->session->bytes_compressed  += thread_data[i].maps[j].bytes_compressed;
+		}
+	}
+
+	return 0;
+}
+
 static int __cmd_record(struct record *rec, int argc, const char **argv)
 {
 	int err;
@@ -1903,7 +1957,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 
 	if (record__open(rec) != 0) {
 		err = -1;
-		goto out_child;
+		goto out_free_threads;
 	}
 	session->header.env.comp_mmap_len = session->evlist->core.mmap_len;
 
@@ -2203,18 +2257,20 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 		goto out_child;
 	}
 
-	if (!quiet)
-		fprintf(stderr, "[ perf record: Woken up %ld times to write data ]\n", waking);
-
 	if (target__none(&rec->opts.target))
 		record__synthesize_workload(rec, true);
 
 out_child:
+	record__stop_threads(rec, &waking);
+out_free_threads:
 	record__free_thread_data(rec);
 	evlist__finalize_ctlfd(rec->evlist);
 	record__mmap_read_all(rec, true);
 	record__aio_mmap_read_sync(rec);
 
+	if (!quiet)
+		fprintf(stderr, "[ perf record: Woken up %ld times to write data ]\n", waking);
+
 	if (rec->session->bytes_transferred && rec->session->bytes_compressed) {
 		ratio = (float)rec->session->bytes_transferred/(float)rec->session->bytes_compressed;
 		session->header.env.comp_ratio = ratio + 0.5;
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 14/15] perf record: start threads in the beginning of trace streaming
  2020-10-21 15:52 [PATCH v2 00/15] Introduce threaded trace streaming for basic perf record operation Alexey Budankov
                   ` (12 preceding siblings ...)
  2020-10-21 16:08 ` [PATCH v2 13/15] perf record: stop threads in the end of " Alexey Budankov
@ 2020-10-21 16:10 ` Alexey Budankov
  2020-10-24 15:44   ` Jiri Olsa
  2020-10-21 16:10 ` [PATCH v2 15/15] perf record: introduce --threads command line option Alexey Budankov
  2020-10-24 15:43 ` [PATCH v2 00/15] Introduce threaded trace streaming for basic perf record operation Jiri Olsa
  15 siblings, 1 reply; 64+ messages in thread
From: Alexey Budankov @ 2020-10-21 16:10 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Namhyung Kim, Alexander Shishkin, Adrian Hunter, Andi Kleen,
	Peter Zijlstra, Ingo Molnar, linux-kernel


Start threads in detached state because its management is possible
via messaging. Block signals prior the threads start so only main
tool thread would be notified on external async signals during data
collection. Streaming threads connect one-to-one to mapped data
buffers and write into per-CPU trace files located at data directory.
Data buffers and threads are affined to local NUMA nodes and monitored
CPUs according to system topology. --cpu option can be used to specify
CPUs to be monitored.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
 tools/perf/builtin-record.c | 128 +++++++++++++++++++++++++++++++++---
 1 file changed, 120 insertions(+), 8 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index a15642656066..1d41e996a994 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -56,6 +56,7 @@
 #include <poll.h>
 #include <pthread.h>
 #include <unistd.h>
+#include <sys/syscall.h>
 #include <sched.h>
 #include <signal.h>
 #ifdef HAVE_EVENTFD_SUPPORT
@@ -1377,6 +1378,62 @@ static void record__thread_munmap_filtered(struct fdarray *fda, int fd,
 		perf_mmap__put(map);
 }
 
+static void *record__thread(void *arg)
+{
+	enum thread_msg msg = THREAD_MSG__READY;
+	bool terminate = false;
+	struct fdarray *pollfd;
+	int err, ctlfd_pos;
+
+	thread = arg;
+	thread->tid = syscall(SYS_gettid);
+
+	err = write(thread->comm.ack[1], &msg, sizeof(msg));
+	if (err == -1)
+		pr_err("threads: %d failed to notify on start. Error %m", thread->tid);
+
+	pollfd = &(thread->pollfd);
+	ctlfd_pos = thread->ctlfd_pos;
+
+	for (;;) {
+		unsigned long long hits = thread->samples;
+
+		if (record__mmap_read_all(thread->rec, false) < 0 || terminate)
+			break;
+
+		if (hits == thread->samples) {
+
+			err = fdarray__poll(pollfd, -1);
+			/*
+			 * Propagate error, only if there's any. Ignore positive
+			 * number of returned events and interrupt error.
+			 */
+			if (err > 0 || (err < 0 && errno == EINTR))
+				err = 0;
+			thread->waking++;
+
+			if (fdarray__filter(pollfd, POLLERR | POLLHUP,
+					    record__thread_munmap_filtered, NULL) == 0)
+				break;
+		}
+
+		if (pollfd->entries[ctlfd_pos].revents & POLLHUP) {
+			terminate = true;
+			close(thread->comm.msg[0]);
+			pollfd->entries[ctlfd_pos].fd = -1;
+			pollfd->entries[ctlfd_pos].events = 0;
+		}
+
+		pollfd->entries[ctlfd_pos].revents = 0;
+	}
+
+	err = write(thread->comm.ack[1], &msg, sizeof(msg));
+	if (err == -1)
+		pr_err("threads: %d failed to notify on termination. Error %m", thread->tid);
+
+	return NULL;
+}
+
 static void record__init_features(struct record *rec)
 {
 	struct perf_session *session = rec->session;
@@ -1823,6 +1880,58 @@ static int record__terminate_thread(struct thread_data *thread_data)
 	return 0;
 }
 
+static int record__start_threads(struct record *rec)
+{
+	int i, j, ret = 0;
+	sigset_t full, mask;
+	pthread_t handle;
+	pthread_attr_t attrs;
+	int nr_thread_data = rec->nr_thread_data;
+	struct thread_data *thread_data = rec->thread_data;
+
+	if (!record__threads_enabled(rec))
+		return 0;
+
+	sigfillset(&full);
+	if (sigprocmask(SIG_SETMASK, &full, &mask)) {
+		pr_err("Failed to block signals on threads start. Error: %m\n");
+		return -1;
+	}
+
+	pthread_attr_init(&attrs);
+	pthread_attr_setdetachstate(&attrs, PTHREAD_CREATE_DETACHED);
+
+	for (i = 1; i < nr_thread_data; i++) {
+		int res = 0;
+		enum thread_msg msg = THREAD_MSG__UNSUPPORTED;
+
+		if (pthread_create(&handle, &attrs, record__thread, &thread_data[i])) {
+			for (j = 1; j < i; j++)
+				record__terminate_thread(&thread_data[i]);
+			pr_err("Failed to start threads. Error: %m\n");
+			ret = -1;
+			goto out_err;
+		}
+
+		res = read(thread_data[i].comm.ack[0], &msg, sizeof(msg));
+		if (res > 0)
+			pr_debug("threads: %d -> %s\n", rec->thread_data[i].tid,
+				 thread_msg_tags[msg]);
+	}
+
+	thread = &thread_data[0];
+	thread->tid = syscall(SYS_gettid);
+	pr_debug("threads: %d -> %s\n", thread->tid, thread_msg_tags[THREAD_MSG__READY]);
+
+out_err:
+	if (sigprocmask(SIG_SETMASK, &mask, NULL)) {
+		pr_err("Failed to unblock signals on threads start. Error: %m\n");
+		ret = -1;
+	}
+
+	return ret;
+}
+
 static int record__stop_threads(struct record *rec, unsigned long *waking)
 {
 	int i, j, nr_thread_data = rec->nr_thread_data;
@@ -1965,7 +2074,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 		err = record__kcore_copy(&session->machines.host, data);
 		if (err) {
 			pr_err("ERROR: Failed to copy kcore\n");
-			goto out_child;
+			goto out_free_threads;
 		}
 	}
 
@@ -1976,7 +2085,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 		bpf__strerror_apply_obj_config(err, errbuf, sizeof(errbuf));
 		pr_err("ERROR: Apply config to BPF failed: %s\n",
 			 errbuf);
-		goto out_child;
+		goto out_free_threads;
 	}
 
 	/*
@@ -1994,11 +2103,11 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 	if (data->is_pipe) {
 		err = perf_header__write_pipe(fd);
 		if (err < 0)
-			goto out_child;
+			goto out_free_threads;
 	} else {
 		err = perf_session__write_header(session, rec->evlist, fd, false);
 		if (err < 0)
-			goto out_child;
+			goto out_free_threads;
 	}
 
 	err = -1;
@@ -2006,16 +2115,16 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 	    && !perf_header__has_feat(&session->header, HEADER_BUILD_ID)) {
 		pr_err("Couldn't generate buildids. "
 		       "Use --no-buildid to profile anyway.\n");
-		goto out_child;
+		goto out_free_threads;
 	}
 
 	err = record__setup_sb_evlist(rec);
 	if (err)
-		goto out_child;
+		goto out_free_threads;
 
 	err = record__synthesize(rec, false);
 	if (err < 0)
-		goto out_child;
+		goto out_free_threads;
 
 	if (rec->realtime_prio) {
 		struct sched_param param;
@@ -2024,10 +2133,13 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 		if (sched_setscheduler(0, SCHED_FIFO, &param)) {
 			pr_err("Could not set realtime priority.\n");
 			err = -1;
-			goto out_child;
+			goto out_free_threads;
 		}
 	}
 
+	if (record__start_threads(rec))
+		goto out_free_threads;
+
 	/*
 	 * When perf is starting the traced process, all the events
 	 * (apart from group members) have enable_on_exec=1 set,
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 15/15] perf record: introduce --threads command line option
  2020-10-21 15:52 [PATCH v2 00/15] Introduce threaded trace streaming for basic perf record operation Alexey Budankov
                   ` (13 preceding siblings ...)
  2020-10-21 16:10 ` [PATCH v2 14/15] perf record: start threads in the beginning " Alexey Budankov
@ 2020-10-21 16:10 ` Alexey Budankov
  2020-10-24 15:43 ` [PATCH v2 00/15] Introduce threaded trace streaming for basic perf record operation Jiri Olsa
  15 siblings, 0 replies; 64+ messages in thread
From: Alexey Budankov @ 2020-10-21 16:10 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Namhyung Kim, Alexander Shishkin, Adrian Hunter, Andi Kleen,
	Peter Zijlstra, Ingo Molnar, linux-kernel


Provide --threads option in perf record command line interface.
Threaded streaming mode mitigates profiling data losses and
resolves scalability issues of serial and asynchronous (--aio) trace
streaming modes on multicore server systems. The implementation is
based on the prototype [1], [2] and the most closely relates to mode 3)
"mode that creates thread for every monitored memory map".

Threaded streaming mode is available with Zstd compression/decompression
(--compression-level) and handling of external commands (--control).
AUX area tracing, related and derived modes like --snapshot or
--aux-sample are not enabled. --switch-output, --switch-output-event,
--switch-max-files and --timestamp-filename options are not enabled.
Threaded trace streaming is not enabled for pipe mode. Initial intent
to enable AUX area tracing faced the need to define some optimal way
to store index data at data directory thus left aside. --switch-output-*
and --timestamp-filename use cases are not yet clear for data directories.

Asynchronous (--aio) trace streaming and affinity (--affinity) modes
are mutually exclusive to the exposed threaded streaming mode.

[1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
[2] https://lore.kernel.org/lkml/20180913125450.21342-1-jolsa@kernel.org/

Suggested-by: Jiri Olsa <jolsa@kernel.org>
Suggested-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
---
 tools/perf/Documentation/perf-record.txt |  7 ++++
 tools/perf/builtin-record.c              | 45 +++++++++++++++++++++---
 2 files changed, 48 insertions(+), 4 deletions(-)

diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index 768888b9326a..d8fa387da973 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -671,6 +671,13 @@ Example of bash shell script to enable and disable events during measurements:
  wait -n ${perf_pid}
  exit $?
 
+--threads::
+Write collected trace data into per-CPU trace files using parallel threads.
+List of monitored CPUs can be configured by a mask provided via --cpu option.
+Trace writing threads correspond one-to-one to mapped data buffers. Threads
+and buffers are affined to monitored CPUs and NUMA nodes according to system
+topology. Threaded trace streaming mode is mutually exclusive to asynchronous
+trace streaming (--aio) and affinity (--affinity) modes.
 
 SEE ALSO
 --------
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 1d41e996a994..575b0b595081 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -780,6 +780,12 @@ static int record__auxtrace_init(struct record *rec)
 {
 	int err;
 
+	if ((rec->opts.auxtrace_snapshot_opts || rec->opts.auxtrace_sample_opts)
+	    && record__threads_enabled(rec)) {
+		pr_err("AUX area tracing options are not available in threaded streaming mode.\n");
+		return -EINVAL;
+	}
+
 	if (!rec->itr) {
 		rec->itr = auxtrace_record__init(rec->evlist, &err);
 		if (err)
@@ -2008,6 +2014,11 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
 		return PTR_ERR(session);
 	}
 
+	if (record__threads_enabled(rec) && perf_data__is_pipe(&rec->data)) {
+		pr_err("Threaded trace streaming is not available in pipe mode.\n");
+		return -1;
+	}
+
 	fd = perf_data__fd(data);
 	rec->session = session;
 
@@ -2680,12 +2691,22 @@ static int switch_output_setup(struct record *rec)
 	 * --switch-output=signal, as we'll send a SIGUSR2 from the side band
 	 *  thread to its parent.
 	 */
-	if (rec->switch_output_event_set)
+	if (rec->switch_output_event_set) {
+		if (record__threads_enabled(rec)) {
+			pr_warning("WARNING: --switch-output-event option is not available in threaded streaming mode.\n");
+			return 0;
+		}
 		goto do_signal;
+	}
 
 	if (!s->set)
 		return 0;
 
+	if (record__threads_enabled(rec)) {
+		pr_warning("WARNING: --switch-output option is not available in threaded streaming mode.\n");
+		return 0;
+	}
+
 	if (!strcmp(s->str, "signal")) {
 do_signal:
 		s->signal = true;
@@ -2964,8 +2985,8 @@ static struct option __record_options[] = {
 		     "Set affinity mask of trace reading thread to NUMA node cpu mask or cpu of processed mmap buffer",
 		     record__parse_affinity),
 #ifdef HAVE_ZSTD_SUPPORT
-	OPT_CALLBACK_OPTARG('z', "compression-level", &record.opts, &comp_level_default,
-			    "n", "Compressed records using specified level (default: 1 - fastest compression, 22 - greatest compression)",
+	OPT_CALLBACK_OPTARG('z', "compression-level", &record.opts, &comp_level_default, "n",
+			    "Compress records using specified level (default: 1 - fastest compression, 22 - greatest compression)",
 			    record__parse_comp_level),
 #endif
 	OPT_CALLBACK(0, "max-size", &record.output_max_size,
@@ -2984,6 +3005,8 @@ static struct option __record_options[] = {
 		     "\t\t\t  Optionally send control command completion ('ack\\n') to ack-fd descriptor.\n"
 		     "\t\t\t  Alternatively, ctl-fifo / ack-fifo will be opened and used as ctl-fd / ack-fd.",
 		      parse_control_option),
+	OPT_BOOLEAN(0, "threads", &record.opts.threads,
+		    "write collected trace data into per-CPU trace files using parallel threads"),
 	OPT_END()
 };
 
@@ -3046,8 +3069,17 @@ int cmd_record(int argc, const char **argv)
 	if (rec->opts.kcore || record__threads_enabled(rec))
 		rec->data.is_dir = true;
 
-	if (record__threads_enabled(rec))
+	if (record__threads_enabled(rec)) {
+		if (rec->opts.affinity != PERF_AFFINITY_SYS) {
+			pr_err("--affinity option is mutually exclusive to threaded streaming mode.\n");
+			goto out_opts;
+		}
 		rec->opts.affinity = PERF_AFFINITY_CPU;
+		if (record__aio_enabled(rec)) {
+			pr_err("Asynchronous streaming mode (--aio) is mutually exclusive to threaded streaming mode.\n");
+			goto out_opts;
+		}
+	}
 
 	if (rec->opts.comp_level != 0) {
 		pr_debug("Compression enabled, disabling build id collection at the end of the session.\n");
@@ -3082,6 +3114,11 @@ int cmd_record(int argc, const char **argv)
 		}
 	}
 
+	if (rec->timestamp_filename && record__threads_enabled(rec)) {
+		rec->timestamp_filename = false;
+		pr_warning("WARNING: --timestamp-filename option is not available in threaded streaming mode.\n");
+	}
+
 	/*
 	 * Allow aliases to facilitate the lookup of symbols for address
 	 * filters. Refer to auxtrace_parse_filters().
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 01/15] perf session: introduce trace file path to be shown in raw trace dump
  2020-10-21 15:56 ` [PATCH v2 01/15] perf session: introduce trace file path to be shown in raw trace dump Alexey Budankov
@ 2020-10-22  4:28   ` Namhyung Kim
  0 siblings, 0 replies; 64+ messages in thread
From: Namhyung Kim @ 2020-10-22  4:28 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel

Hi,

On Thu, Oct 22, 2020 at 12:56 AM Alexey Budankov
<alexey.budankov@linux.intel.com> wrote:
>
>
> Extend reader, ordered_event and decomp objects to contain path
> of a trace file being displayed.
>
> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>

Acked-by: Namhyung Kim <namhyung@kernel.org>

Thanks
Namhyung

> ---
>  tools/perf/util/ordered-events.h | 1 +
>  tools/perf/util/session.c        | 2 ++
>  tools/perf/util/session.h        | 1 +
>  3 files changed, 4 insertions(+)
>
> diff --git a/tools/perf/util/ordered-events.h b/tools/perf/util/ordered-events.h
> index 75345946c4b9..42c9764c6b5b 100644
> --- a/tools/perf/util/ordered-events.h
> +++ b/tools/perf/util/ordered-events.h
> @@ -9,6 +9,7 @@ struct perf_sample;
>  struct ordered_event {
>         u64                     timestamp;
>         u64                     file_offset;
> +       const char              *file_path;
>         union perf_event        *event;
>         struct list_head        list;
>  };
> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
> index 7a5f03764702..4478ddae485b 100644
> --- a/tools/perf/util/session.c
> +++ b/tools/perf/util/session.c
> @@ -2119,6 +2119,7 @@ typedef s64 (*reader_cb_t)(struct perf_session *session,
>
>  struct reader {
>         int              fd;
> +       const char       *path;
>         u64              data_size;
>         u64              data_offset;
>         reader_cb_t      process;
> @@ -2241,6 +2242,7 @@ static int __perf_session__process_events(struct perf_session *session)
>                 .data_size      = session->header.data_size,
>                 .data_offset    = session->header.data_offset,
>                 .process        = process_simple,
> +               .path           = session->data->file.path,
>         };
>         struct ordered_events *oe = &session->ordered_events;
>         struct perf_tool *tool = session->tool;
> diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
> index f76480166d38..378ffc3e2809 100644
> --- a/tools/perf/util/session.h
> +++ b/tools/perf/util/session.h
> @@ -46,6 +46,7 @@ struct perf_session {
>  struct decomp {
>         struct decomp *next;
>         u64 file_pos;
> +       const char *file_path;
>         size_t mmap_len;
>         u64 head;
>         size_t size;
> --
> 2.24.1
>
>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 02/15] perf report: output trace file name in raw trace dump
  2020-10-21 15:57 ` [PATCH v2 02/15] perf report: output trace file name " Alexey Budankov
@ 2020-10-22  4:29   ` Namhyung Kim
  0 siblings, 0 replies; 64+ messages in thread
From: Namhyung Kim @ 2020-10-22  4:29 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel

On Thu, Oct 22, 2020 at 12:57 AM Alexey Budankov
<alexey.budankov@linux.intel.com> wrote:
>
>
> Print path and name of a trace file into raw dump (-D)
> <file_offset>@<path/file>. Print offset of PERF_RECORD_COMPRESSED
> record instead of zero for decompressed records:
>   0x2226a@perf.data [0x30]: event: 9
> or
>   0x15cc36@perf.data/data.7 [0x30]: event: 9
>
> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>

Acked-by: Namhyung Kim <namhyung@kernel.org>

Thanks
Namhyung

> ---
>  tools/perf/builtin-inject.c |  3 +-
>  tools/perf/util/session.c   | 75 ++++++++++++++++++++++---------------
>  tools/perf/util/tool.h      |  3 +-
>  3 files changed, 48 insertions(+), 33 deletions(-)
>
> diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c
> index 452a75fe68e5..037f8a98220c 100644
> --- a/tools/perf/builtin-inject.c
> +++ b/tools/perf/builtin-inject.c
> @@ -106,7 +106,8 @@ static int perf_event__repipe_op2_synth(struct perf_session *session,
>
>  static int perf_event__repipe_op4_synth(struct perf_session *session,
>                                         union perf_event *event,
> -                                       u64 data __maybe_unused)
> +                                       u64 data __maybe_unused,
> +                                       const char *str __maybe_unused)
>  {
>         return perf_event__repipe_synth(session->tool, event);
>  }
> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
> index 4478ddae485b..6f09d506b2f6 100644
> --- a/tools/perf/util/session.c
> +++ b/tools/perf/util/session.c
> @@ -36,7 +36,8 @@
>
>  #ifdef HAVE_ZSTD_SUPPORT
>  static int perf_session__process_compressed_event(struct perf_session *session,
> -                                                 union perf_event *event, u64 file_offset)
> +                                                 union perf_event *event, u64 file_offset,
> +                                                 const char *file_path)
>  {
>         void *src;
>         size_t decomp_size, src_size;
> @@ -58,6 +59,7 @@ static int perf_session__process_compressed_event(struct perf_session *session,
>         }
>
>         decomp->file_pos = file_offset;
> +       decomp->file_path = file_path;
>         decomp->mmap_len = mmap_len;
>         decomp->head = 0;
>
> @@ -98,7 +100,8 @@ static int perf_session__process_compressed_event(struct perf_session *session,
>  static int perf_session__deliver_event(struct perf_session *session,
>                                        union perf_event *event,
>                                        struct perf_tool *tool,
> -                                      u64 file_offset);
> +                                      u64 file_offset,
> +                                      const char *file_path);
>
>  static int perf_session__open(struct perf_session *session)
>  {
> @@ -180,7 +183,8 @@ static int ordered_events__deliver_event(struct ordered_events *oe,
>                                                     ordered_events);
>
>         return perf_session__deliver_event(session, event->event,
> -                                          session->tool, event->file_offset);
> +                                          session->tool, event->file_offset,
> +                                          event->file_path);
>  }
>
>  struct perf_session *perf_session__new(struct perf_data *data,
> @@ -452,7 +456,8 @@ static int process_stat_round_stub(struct perf_session *perf_session __maybe_unu
>
>  static int perf_session__process_compressed_event_stub(struct perf_session *session __maybe_unused,
>                                                        union perf_event *event __maybe_unused,
> -                                                      u64 file_offset __maybe_unused)
> +                                                      u64 file_offset __maybe_unused,
> +                                                      const char *file_path __maybe_unused)
>  {
>         dump_printf(": unhandled!\n");
>         return 0;
> @@ -1227,13 +1232,14 @@ static void sample_read__printf(struct perf_sample *sample, u64 read_format)
>  }
>
>  static void dump_event(struct evlist *evlist, union perf_event *event,
> -                      u64 file_offset, struct perf_sample *sample)
> +                      u64 file_offset, struct perf_sample *sample,
> +                      const char *file_path)
>  {
>         if (!dump_trace)
>                 return;
>
> -       printf("\n%#" PRIx64 " [%#x]: event: %d\n",
> -              file_offset, event->header.size, event->header.type);
> +       printf("\n%#" PRIx64 "@%s [%#x]: event: %d\n",
> +              file_offset, file_path, event->header.size, event->header.type);
>
>         trace_event(event);
>         if (event->header.type == PERF_RECORD_SAMPLE && evlist->trace_event_sample_raw)
> @@ -1424,12 +1430,13 @@ static int machines__deliver_event(struct machines *machines,
>                                    struct evlist *evlist,
>                                    union perf_event *event,
>                                    struct perf_sample *sample,
> -                                  struct perf_tool *tool, u64 file_offset)
> +                                  struct perf_tool *tool, u64 file_offset,
> +                                  const char *file_path)
>  {
>         struct evsel *evsel;
>         struct machine *machine;
>
> -       dump_event(evlist, event, file_offset, sample);
> +       dump_event(evlist, event, file_offset, sample, file_path);
>
>         evsel = perf_evlist__id2evsel(evlist, sample->id);
>
> @@ -1506,7 +1513,8 @@ static int machines__deliver_event(struct machines *machines,
>  static int perf_session__deliver_event(struct perf_session *session,
>                                        union perf_event *event,
>                                        struct perf_tool *tool,
> -                                      u64 file_offset)
> +                                      u64 file_offset,
> +                                      const char *file_path)
>  {
>         struct perf_sample sample;
>         int ret;
> @@ -1524,7 +1532,7 @@ static int perf_session__deliver_event(struct perf_session *session,
>                 return 0;
>
>         ret = machines__deliver_event(&session->machines, session->evlist,
> -                                     event, &sample, tool, file_offset);
> +                                     event, &sample, tool, file_offset, file_path);
>
>         if (dump_trace && sample.aux_sample.size)
>                 auxtrace__dump_auxtrace_sample(session, &sample);
> @@ -1534,7 +1542,8 @@ static int perf_session__deliver_event(struct perf_session *session,
>
>  static s64 perf_session__process_user_event(struct perf_session *session,
>                                             union perf_event *event,
> -                                           u64 file_offset)
> +                                           u64 file_offset,
> +                                           const char *file_path)
>  {
>         struct ordered_events *oe = &session->ordered_events;
>         struct perf_tool *tool = session->tool;
> @@ -1544,7 +1553,7 @@ static s64 perf_session__process_user_event(struct perf_session *session,
>
>         if (event->header.type != PERF_RECORD_COMPRESSED ||
>             tool->compressed == perf_session__process_compressed_event_stub)
> -               dump_event(session->evlist, event, file_offset, &sample);
> +               dump_event(session->evlist, event, file_offset, &sample, file_path);
>
>         /* These events are processed right away */
>         switch (event->header.type) {
> @@ -1603,9 +1612,9 @@ static s64 perf_session__process_user_event(struct perf_session *session,
>         case PERF_RECORD_HEADER_FEATURE:
>                 return tool->feature(session, event);
>         case PERF_RECORD_COMPRESSED:
> -               err = tool->compressed(session, event, file_offset);
> +               err = tool->compressed(session, event, file_offset, file_path);
>                 if (err)
> -                       dump_event(session->evlist, event, file_offset, &sample);
> +                       dump_event(session->evlist, event, file_offset, &sample, file_path);
>                 return err;
>         default:
>                 return -EINVAL;
> @@ -1622,9 +1631,9 @@ int perf_session__deliver_synth_event(struct perf_session *session,
>         events_stats__inc(&evlist->stats, event->header.type);
>
>         if (event->header.type >= PERF_RECORD_USER_TYPE_START)
> -               return perf_session__process_user_event(session, event, 0);
> +               return perf_session__process_user_event(session, event, 0, NULL);
>
> -       return machines__deliver_event(&session->machines, evlist, event, sample, tool, 0);
> +       return machines__deliver_event(&session->machines, evlist, event, sample, tool, 0, NULL);
>  }
>
>  static void event_swap(union perf_event *event, bool sample_id_all)
> @@ -1720,7 +1729,8 @@ int perf_session__peek_events(struct perf_session *session, u64 offset,
>  }
>
>  static s64 perf_session__process_event(struct perf_session *session,
> -                                      union perf_event *event, u64 file_offset)
> +                                      union perf_event *event, u64 file_offset,
> +                                      const char *file_path)
>  {
>         struct evlist *evlist = session->evlist;
>         struct perf_tool *tool = session->tool;
> @@ -1735,7 +1745,7 @@ static s64 perf_session__process_event(struct perf_session *session,
>         events_stats__inc(&evlist->stats, event->header.type);
>
>         if (event->header.type >= PERF_RECORD_USER_TYPE_START)
> -               return perf_session__process_user_event(session, event, file_offset);
> +               return perf_session__process_user_event(session, event, file_offset, file_path);
>
>         if (tool->ordered_events) {
>                 u64 timestamp = -1ULL;
> @@ -1749,7 +1759,7 @@ static s64 perf_session__process_event(struct perf_session *session,
>                         return ret;
>         }
>
> -       return perf_session__deliver_event(session, event, tool, file_offset);
> +       return perf_session__deliver_event(session, event, tool, file_offset, file_path);
>  }
>
>  void perf_event_header__bswap(struct perf_event_header *hdr)
> @@ -1987,7 +1997,8 @@ static int __perf_session__process_pipe_events(struct perf_session *session)
>                 }
>         }
>
> -       if ((skip = perf_session__process_event(session, event, head)) < 0) {
> +       skip = perf_session__process_event(session, event, head, "pipe");
> +       if (skip < 0) {
>                 pr_err("%#" PRIx64 " [%#x]: failed to process type: %d\n",
>                        head, event->header.size, event->header.type);
>                 err = -EINVAL;
> @@ -2068,7 +2079,7 @@ fetch_decomp_event(u64 head, size_t mmap_size, char *buf, bool needs_swap)
>  static int __perf_session__process_decomp_events(struct perf_session *session)
>  {
>         s64 skip;
> -       u64 size, file_pos = 0;
> +       u64 size;
>         struct decomp *decomp = session->decomp_last;
>
>         if (!decomp)
> @@ -2082,9 +2093,9 @@ static int __perf_session__process_decomp_events(struct perf_session *session)
>                         break;
>
>                 size = event->header.size;
> -
> -               if (size < sizeof(struct perf_event_header) ||
> -                   (skip = perf_session__process_event(session, event, file_pos)) < 0) {
> +               skip = perf_session__process_event(session, event, decomp->file_pos,
> +                                                  decomp->file_path);
> +               if (size < sizeof(struct perf_event_header) || skip < 0) {
>                         pr_err("%#" PRIx64 " [%#x]: failed to process type: %d\n",
>                                 decomp->file_pos + decomp->head, event->header.size, event->header.type);
>                         return -EINVAL;
> @@ -2115,7 +2126,8 @@ struct reader;
>
>  typedef s64 (*reader_cb_t)(struct perf_session *session,
>                            union perf_event *event,
> -                          u64 file_offset);
> +                          u64 file_offset,
> +                          const char *file_path);
>
>  struct reader {
>         int              fd;
> @@ -2198,9 +2210,9 @@ reader__process_events(struct reader *rd, struct perf_session *session,
>         skip = -EINVAL;
>
>         if (size < sizeof(struct perf_event_header) ||
> -           (skip = rd->process(session, event, file_pos)) < 0) {
> -               pr_err("%#" PRIx64 " [%#x]: failed to process type: %d [%s]\n",
> -                      file_offset + head, event->header.size,
> +           (skip = rd->process(session, event, file_pos, rd->path)) < 0) {
> +               pr_err("%#" PRIx64 " [%s] [%#x]: failed to process type: %d [%s]\n",
> +                      file_offset + head, rd->path, event->header.size,
>                        event->header.type, strerror(-skip));
>                 err = skip;
>                 goto out;
> @@ -2230,9 +2242,10 @@ reader__process_events(struct reader *rd, struct perf_session *session,
>
>  static s64 process_simple(struct perf_session *session,
>                           union perf_event *event,
> -                         u64 file_offset)
> +                         u64 file_offset,
> +                         const char *file_path)
>  {
> -       return perf_session__process_event(session, event, file_offset);
> +       return perf_session__process_event(session, event, file_offset, file_path);
>  }
>
>  static int __perf_session__process_events(struct perf_session *session)
> diff --git a/tools/perf/util/tool.h b/tools/perf/util/tool.h
> index bbbc0dcd461f..c966531d3eca 100644
> --- a/tools/perf/util/tool.h
> +++ b/tools/perf/util/tool.h
> @@ -28,7 +28,8 @@ typedef int (*event_attr_op)(struct perf_tool *tool,
>
>  typedef int (*event_op2)(struct perf_session *session, union perf_event *event);
>  typedef s64 (*event_op3)(struct perf_session *session, union perf_event *event);
> -typedef int (*event_op4)(struct perf_session *session, union perf_event *event, u64 data);
> +typedef int (*event_op4)(struct perf_session *session, union perf_event *event, u64 data,
> +                        const char *str);
>
>  typedef int (*event_oe)(struct perf_tool *tool, union perf_event *event,
>                         struct ordered_events *oe);
> --
> 2.24.1
>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 03/15] perf data: open data directory in read access mode
  2020-10-21 15:57 ` [PATCH v2 03/15] perf data: open data directory in read access mode Alexey Budankov
@ 2020-10-22  4:31   ` Namhyung Kim
  2020-10-22  7:50     ` Alexey Budankov
  2020-10-24 15:43   ` Jiri Olsa
  1 sibling, 1 reply; 64+ messages in thread
From: Namhyung Kim @ 2020-10-22  4:31 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel

On Thu, Oct 22, 2020 at 12:58 AM Alexey Budankov
<alexey.budankov@linux.intel.com> wrote:
>
>
> Open files located at trace data directory in case read access
> mode is requested. File are opened and its fds assigned to
> perf_data dir files especially for loading data directories
> content in perf report mode.
>
> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
> ---
>  tools/perf/util/data.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/tools/perf/util/data.c b/tools/perf/util/data.c
> index c47aa34fdc0a..6ad61ac6ba67 100644
> --- a/tools/perf/util/data.c
> +++ b/tools/perf/util/data.c
> @@ -321,6 +321,10 @@ static int open_dir(struct perf_data *data)
>                 return -1;
>
>         ret = open_file(data);
> +       if (!ret && perf_data__is_dir(data)) {

I think this __is_dir() check is unnecessary since it's checked
from the caller side already.

Thanks
Namhyung


> +               if (perf_data__is_read(data))
> +                       ret = perf_data__open_dir(data);
> +       }
>
>         /* Cleanup whatever we managed to create so far. */
>         if (ret && perf_data__is_write(data))
> --
> 2.24.1
>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 04/15] perf session: move reader object definition to header file
  2020-10-21 15:59 ` [PATCH v2 04/15] perf session: move reader object definition to header file Alexey Budankov
@ 2020-10-22  4:31   ` Namhyung Kim
  2020-10-24 15:44   ` Jiri Olsa
  1 sibling, 0 replies; 64+ messages in thread
From: Namhyung Kim @ 2020-10-22  4:31 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel

On Thu, Oct 22, 2020 at 12:59 AM Alexey Budankov
<alexey.budankov@linux.intel.com> wrote:
>
>
> Move definition of reader to session header file to be shared
> among different source files. Introduce reference to active
> reader object from session object.
>
> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>

Acked-by: Namhyung Kim <namhyung@kernel.org>

Thanks
Namhyung

> ---
>  tools/perf/util/session.c | 27 ---------------------------
>  tools/perf/util/session.h | 25 +++++++++++++++++++++++++
>  2 files changed, 25 insertions(+), 27 deletions(-)
>
> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
> index 6f09d506b2f6..911b2dbcd0ac 100644
> --- a/tools/perf/util/session.c
> +++ b/tools/perf/util/session.c
> @@ -2110,33 +2110,6 @@ static int __perf_session__process_decomp_events(struct perf_session *session)
>         return 0;
>  }
>
> -/*
> - * On 64bit we can mmap the data file in one go. No need for tiny mmap
> - * slices. On 32bit we use 32MB.
> - */
> -#if BITS_PER_LONG == 64
> -#define MMAP_SIZE ULLONG_MAX
> -#define NUM_MMAPS 1
> -#else
> -#define MMAP_SIZE (32 * 1024 * 1024ULL)
> -#define NUM_MMAPS 128
> -#endif
> -
> -struct reader;
> -
> -typedef s64 (*reader_cb_t)(struct perf_session *session,
> -                          union perf_event *event,
> -                          u64 file_offset,
> -                          const char *file_path);
> -
> -struct reader {
> -       int              fd;
> -       const char       *path;
> -       u64              data_size;
> -       u64              data_offset;
> -       reader_cb_t      process;
> -};
> -
>  static int
>  reader__process_events(struct reader *rd, struct perf_session *session,
>                        struct ui_progress *prog)
> diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
> index 378ffc3e2809..abdb8518a81f 100644
> --- a/tools/perf/util/session.h
> +++ b/tools/perf/util/session.h
> @@ -20,6 +20,30 @@ struct thread;
>  struct auxtrace;
>  struct itrace_synth_opts;
>
> +/*
> + * On 64bit we can mmap the data file in one go. No need for tiny mmap
> + * slices. On 32bit we use 32MB.
> + */
> +#if BITS_PER_LONG == 64
> +#define MMAP_SIZE ULLONG_MAX
> +#define NUM_MMAPS 1
> +#else
> +#define MMAP_SIZE (32 * 1024 * 1024ULL)
> +#define NUM_MMAPS 128
> +#endif
> +
> +typedef s64 (*reader_cb_t)(struct perf_session *session,
> +                          union perf_event *event,
> +                          u64 file_offset, const char *file_path);
> +
> +struct reader {
> +       int              fd;
> +       const char       *path;
> +       u64              data_size;
> +       u64              data_offset;
> +       reader_cb_t      process;
> +};
> +
>  struct perf_session {
>         struct perf_header      header;
>         struct machines         machines;
> @@ -41,6 +65,7 @@ struct perf_session {
>         struct zstd_data        zstd_data;
>         struct decomp           *decomp;
>         struct decomp           *decomp_last;
> +       struct reader           *reader;
>  };
>
>  struct decomp {
> --
> 2.24.1
>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 05/15] perf session: introduce decompressor into trace reader object
  2020-10-21 16:00 ` [PATCH v2 05/15] perf session: introduce decompressor into trace reader object Alexey Budankov
@ 2020-10-22  4:36   ` Namhyung Kim
  2020-10-22  7:20     ` Alexey Budankov
  2020-10-24 15:44   ` Jiri Olsa
  1 sibling, 1 reply; 64+ messages in thread
From: Namhyung Kim @ 2020-10-22  4:36 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel

On Thu, Oct 22, 2020 at 1:00 AM Alexey Budankov
<alexey.budankov@linux.intel.com> wrote:
>
>
> Introduce decompressor to trace reader object so that decompression
> could be executed on per trace file basis separately for every
> trace file located in trace directory.

I'm slightly uncomfortable with the word 'trace' here as it's
used for other cases like perf trace and ftrace.  Maybe we can
change it to 'profile' and/or 'data file'?

Thanks
Namhyung

>
> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
> ---
>  tools/perf/util/session.c | 4 +++-
>  tools/perf/util/session.h | 1 +
>  2 files changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
> index 911b2dbcd0ac..6afc670fdf0c 100644
> --- a/tools/perf/util/session.c
> +++ b/tools/perf/util/session.c
> @@ -44,6 +44,8 @@ static int perf_session__process_compressed_event(struct perf_session *session,
>         u64 decomp_last_rem = 0;
>         size_t mmap_len, decomp_len = session->header.env.comp_mmap_len;
>         struct decomp *decomp, *decomp_last = session->decomp_last;
> +       struct zstd_data *zstd_data = session->reader ?
> +               &(session->reader->zstd_data) : &(session->zstd_data);
>
>         if (decomp_last) {
>                 decomp_last_rem = decomp_last->size - decomp_last->head;
> @@ -71,7 +73,7 @@ static int perf_session__process_compressed_event(struct perf_session *session,
>         src = (void *)event + sizeof(struct perf_record_compressed);
>         src_size = event->pack.header.size - sizeof(struct perf_record_compressed);
>
> -       decomp_size = zstd_decompress_stream(&(session->zstd_data), src, src_size,
> +       decomp_size = zstd_decompress_stream(zstd_data, src, src_size,
>                                 &(decomp->data[decomp_last_rem]), decomp_len - decomp_last_rem);
>         if (!decomp_size) {
>                 munmap(decomp, mmap_len);
> diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
> index abdb8518a81f..4fc9ccdf7970 100644
> --- a/tools/perf/util/session.h
> +++ b/tools/perf/util/session.h
> @@ -42,6 +42,7 @@ struct reader {
>         u64              data_size;
>         u64              data_offset;
>         reader_cb_t      process;
> +       struct zstd_data zstd_data;
>  };
>
>  struct perf_session {
> --
> 2.24.1
>
>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 05/15] perf session: introduce decompressor into trace reader object
  2020-10-22  4:36   ` Namhyung Kim
@ 2020-10-22  7:20     ` Alexey Budankov
  0 siblings, 0 replies; 64+ messages in thread
From: Alexey Budankov @ 2020-10-22  7:20 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel


On 22.10.2020 7:36, Namhyung Kim wrote:
> On Thu, Oct 22, 2020 at 1:00 AM Alexey Budankov
> <alexey.budankov@linux.intel.com> wrote:
>>
>>
>> Introduce decompressor to trace reader object so that decompression
>> could be executed on per trace file basis separately for every
>> trace file located in trace directory.
> 
> I'm slightly uncomfortable with the word 'trace' here as it's
> used for other cases like perf trace and ftrace.  Maybe we can
> change it to 'profile' and/or 'data file'?

Let's use "data file" and "data directory" terms then.

Thanks,
Alexei

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 03/15] perf data: open data directory in read access mode
  2020-10-22  4:31   ` Namhyung Kim
@ 2020-10-22  7:50     ` Alexey Budankov
  0 siblings, 0 replies; 64+ messages in thread
From: Alexey Budankov @ 2020-10-22  7:50 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Arnaldo Carvalho de Melo, Jiri Olsa, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel


On 22.10.2020 7:31, Namhyung Kim wrote:
> On Thu, Oct 22, 2020 at 12:58 AM Alexey Budankov
> <alexey.budankov@linux.intel.com> wrote:
>>
>>
>> Open files located at trace data directory in case read access
>> mode is requested. File are opened and its fds assigned to
>> perf_data dir files especially for loading data directories
>> content in perf report mode.
>>
>> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
>> ---
>>  tools/perf/util/data.c | 4 ++++
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/tools/perf/util/data.c b/tools/perf/util/data.c
>> index c47aa34fdc0a..6ad61ac6ba67 100644
>> --- a/tools/perf/util/data.c
>> +++ b/tools/perf/util/data.c
>> @@ -321,6 +321,10 @@ static int open_dir(struct perf_data *data)
>>                 return -1;
>>
>>         ret = open_file(data);
>> +       if (!ret && perf_data__is_dir(data)) {
> 
> I think this __is_dir() check is unnecessary since it's checked
> from the caller side already.

Corrected in v3. Thanks!

Alexei

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 00/15] Introduce threaded trace streaming for basic perf record operation
  2020-10-21 15:52 [PATCH v2 00/15] Introduce threaded trace streaming for basic perf record operation Alexey Budankov
                   ` (14 preceding siblings ...)
  2020-10-21 16:10 ` [PATCH v2 15/15] perf record: introduce --threads command line option Alexey Budankov
@ 2020-10-24 15:43 ` Jiri Olsa
  2020-10-26 17:59   ` Alexey Budankov
  15 siblings, 1 reply; 64+ messages in thread
From: Jiri Olsa @ 2020-10-24 15:43 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel

On Wed, Oct 21, 2020 at 06:52:43PM +0300, Alexey Budankov wrote:
> 
> Changes in v2:
> - explicitly added credit tags to patches 6/15 and 15/15,
>   additionally to cites [1], [2]
> - updated description of 3/15 to explicitly mention the reason
>   to open data directories in read access mode (e.g. for perf report)
> - implemented fix for compilation error of 2/15
> - explicitly elaborated on found issues to be resolved for
>   threaded AUX trace capture
> 
> v1: https://lore.kernel.org/lkml/810f3a69-0004-9dff-a911-b7ff97220ae0@linux.intel.com/
> 
> Patch set provides threaded trace streaming for base perf record
> operation. Provided streaming mode (--threads) mitigates profiling
> data losses and resolves scalability issues of serial and asynchronous
> (--aio) trace streaming modes on multicore server systems. The patch
> set is based on the prototype [1], [2] and the most closely relates
> to mode 3) "mode that creates thread for every monitored memory map".

so what I liked about the previous code was that you could
configure how the threads would be created

default --threads options created thread for each cpu like
in your change:

  $ perf record -v --threads ...
  ...
  thread 0 monitor: 0 allowed: 0
  thread 1 monitor: 1 allowed: 1
  thread 2 monitor: 2 allowed: 2
  thread 3 monitor: 3 allowed: 3
  thread 4 monitor: 4 allowed: 4
  thread 5 monitor: 5 allowed: 5
  thread 6 monitor: 6 allowed: 6
  thread 7 monitor: 7 allowed: 7


then numa based:

  $ perf record -v --threads=numa ...
  ...
  thread 0 monitor: 0-5,12-17 allowed: 0-5,12-17
  thread 1 monitor: 6-11,18-23 allowed: 6-11,18-23


socket based:

  $ perf record -v --threads=socket ...
  ...
  thread 0 monitor: 0-7 allowed: 0-7


core based:

  $ perf record -v --threads=core ...
  ...
  thread 0 monitor: 0,4 allowed: 0,4
  thread 1 monitor: 1,5 allowed: 1,5
  thread 2 monitor: 2,6 allowed: 2,6
  thread 3 monitor: 3,7 allowed: 3,7


and user configurable:

  $ perf record -v  --threads=0-3/0:4-7/4 ...
  ...
  threads: 0. monitor 0-3, allowed 0
  threads: 1. monitor 4-7, allowed 4


so this way you could easily pin threads to cpu/core/socket/numa,
or to some other cpu of your choice, because this will be always
game of try and check where I'm not getting LOST events and not
creating 1000 threads

 perf record: Add support for threads numa option value
 perf record: Add support for threads socket option value
 perf record: Add support for threads core option value
 perf record: Add support for threads user option value

jirka


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 03/15] perf data: open data directory in read access mode
  2020-10-21 15:57 ` [PATCH v2 03/15] perf data: open data directory in read access mode Alexey Budankov
  2020-10-22  4:31   ` Namhyung Kim
@ 2020-10-24 15:43   ` Jiri Olsa
  2020-10-26 17:47     ` Alexey Budankov
  1 sibling, 1 reply; 64+ messages in thread
From: Jiri Olsa @ 2020-10-24 15:43 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel

On Wed, Oct 21, 2020 at 06:57:53PM +0300, Alexey Budankov wrote:
> 
> Open files located at trace data directory in case read access
> mode is requested. File are opened and its fds assigned to
> perf_data dir files especially for loading data directories
> content in perf report mode.
> 
> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
> ---
>  tools/perf/util/data.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/tools/perf/util/data.c b/tools/perf/util/data.c
> index c47aa34fdc0a..6ad61ac6ba67 100644
> --- a/tools/perf/util/data.c
> +++ b/tools/perf/util/data.c
> @@ -321,6 +321,10 @@ static int open_dir(struct perf_data *data)
>  		return -1;
>  
>  	ret = open_file(data);
> +	if (!ret && perf_data__is_dir(data)) {
> +		if (perf_data__is_read(data))
> +			ret = perf_data__open_dir(data);
> +	}

perf_data__open_dir is also called from perf_session__new
is it called twice?

jirka

>  
>  	/* Cleanup whatever we managed to create so far. */
>  	if (ret && perf_data__is_write(data))
> -- 
> 2.24.1
> 


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 06/15] perf session: load data directory into tool process memory
  2020-10-21 16:01 ` [PATCH v2 06/15] perf session: load data directory into tool process memory Alexey Budankov
@ 2020-10-24 15:43   ` Jiri Olsa
  2020-10-27  7:37     ` Alexey Budankov
  0 siblings, 1 reply; 64+ messages in thread
From: Jiri Olsa @ 2020-10-24 15:43 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel

On Wed, Oct 21, 2020 at 07:01:19PM +0300, Alexey Budankov wrote:
> 
> Read trace files located in data directory into tool process memory.
> Basic analysis support of data directories is provided for report
> mode. Raw dump (-D) and aggregated reports are available for data
> directories, still with no memory consumption optimizations. However
> data directories collected with --compression-level option enabled
> can be analyzed with little less memory because trace files are
> unmaped from tool process memory after loading collected data.
> The implementation is based on the prototype [1], [2].
> 
> [1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
> [2] https://lore.kernel.org/lkml/20180913125450.21342-1-jolsa@kernel.org/
> 
> Suggested-by: Jiri Olsa <jolsa@kernel.org>

very loosely ;-) so there was a reason for all that reader refactoring,
so we could have __perf_session__process_dir_events function:

  https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/commit/?h=perf/record_threads&id=308aa7cff1fed335401cfc02c7bac1a4644af68e

when reporting the threaded record data on really big servers,
you will run out of memory, so you need to read and flush all
the files together by smaller pieces

IMO we need to have this change before we allow threaded record

jirka


> Suggested-by: Namhyung Kim <namhyung@kernel.org>
> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
> ---
>  tools/perf/util/session.c | 48 +++++++++++++++++++++++++++++++++++++++
>  tools/perf/util/session.h |  1 +
>  2 files changed, 49 insertions(+)
> 
> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
> index 6afc670fdf0c..0752eec19813 100644
> --- a/tools/perf/util/session.c
> +++ b/tools/perf/util/session.c
> @@ -2212,6 +2212,17 @@ reader__process_events(struct reader *rd, struct perf_session *session,
>  		goto more;
>  
>  out:
> +	if (rd->unmap_file) {
> +		int i;
> +
> +		for (i = 0; i < NUM_MMAPS; i++) {
> +			if (mmaps[i]) {
> +				munmap(mmaps[i], mmap_size);
> +				mmaps[i] = NULL;
> +			}
> +		}
> +	}
> +
>  	return err;
>  }
>  
> @@ -2231,6 +2242,7 @@ static int __perf_session__process_events(struct perf_session *session)
>  		.data_offset	= session->header.data_offset,
>  		.process	= process_simple,
>  		.path		= session->data->file.path,
> +		.unmap_file	= false,
>  	};
>  	struct ordered_events *oe = &session->ordered_events;
>  	struct perf_tool *tool = session->tool;
> @@ -2247,6 +2259,42 @@ static int __perf_session__process_events(struct perf_session *session)
>  	err = reader__process_events(&rd, session, &prog);
>  	if (err)
>  		goto out_err;
> +
> +	if (perf_data__is_dir(session->data)) {
> +		int i, nr = session->data->dir.nr;
> +		struct reader file_rd[nr];
> +		u64 total_size = perf_data__size(session->data);
> +
> +		total_size -= session->data->file.size;
> +		ui_progress__init_size(&prog, total_size, "Sorting events...");
> +
> +		memset(&file_rd, 0, nr * sizeof(file_rd[0]));
> +
> +		for (i = 0; i < nr ; i++) {
> +			struct perf_data_file *file;
> +
> +			file = &session->data->dir.files[i];
> +			file_rd[i] = (struct reader) {
> +				.fd             = file->fd,
> +				.path           = file->path,
> +				.data_size      = file->size,
> +				.data_offset    = 0,
> +				.process	= process_simple,
> +			};
> +			file_rd[i].unmap_file = perf_header__has_feat(&session->header,
> +								      HEADER_COMPRESSED);
> +			session->reader = &file_rd[i];
> +
> +			if (zstd_init(&(file_rd[i].zstd_data), 0))
> +				goto out_err;
> +			err = reader__process_events(&file_rd[i], session, &prog);
> +			zstd_fini(&(file_rd[i].zstd_data));
> +			session->reader = NULL;
> +			if (err)
> +				goto out_err;
> +		}
> +	}
> +
>  	/* do the final flush for ordered samples */
>  	err = ordered_events__flush(oe, OE_FLUSH__FINAL);
>  	if (err)
> diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
> index 4fc9ccdf7970..d428f3eaf7fd 100644
> --- a/tools/perf/util/session.h
> +++ b/tools/perf/util/session.h
> @@ -43,6 +43,7 @@ struct reader {
>  	u64		 data_offset;
>  	reader_cb_t	 process;
>  	struct zstd_data zstd_data;
> +	bool		 unmap_file;
>  };
>  
>  struct perf_session {
> -- 
> 2.24.1
> 


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 12/15] perf record: introduce thread local variable for trace streaming
  2020-10-21 16:07 ` [PATCH v2 12/15] perf record: introduce thread local variable for trace streaming Alexey Budankov
@ 2020-10-24 15:43   ` Jiri Olsa
  2020-10-26  8:21     ` Alexei Budankov
  0 siblings, 1 reply; 64+ messages in thread
From: Jiri Olsa @ 2020-10-24 15:43 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel

On Wed, Oct 21, 2020 at 07:07:00PM +0300, Alexey Budankov wrote:
> 
> Introduce thread local variable and use it for threaded trace streaming.
> 
> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
> ---
>  tools/perf/builtin-record.c | 71 ++++++++++++++++++++++++++++++++-----
>  1 file changed, 62 insertions(+), 9 deletions(-)
> 
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index 89cb8e913fb3..3b7e9026f25b 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -101,6 +101,8 @@ struct thread_data {
>  	u64		   bytes_written;
>  };
>  
> +static __thread struct thread_data *thread;
> +
>  struct record {
>  	struct perf_tool	tool;
>  	struct record_opts	opts;
> @@ -587,7 +589,11 @@ static int record__pushfn(struct mmap *map, void *to, void *bf, size_t size)
>  		}
>  	}
>  
> -	rec->samples++;
> +	if (thread)
> +		thread->samples++;
> +	else
> +		rec->samples++;

this is really wrong, let's keep just single samples counter
ditto for all the other places in this patch

SNIP

> -		if (hits == rec->samples) {
> +		if (thread)
> +			hits1 = thread->samples;
> +		else
> +			hits1 = rec->samples;
> +
> +		if (hits0 == hits1) {
>  			if (done || draining)
>  				break;
> -			err = evlist__poll(rec->evlist, -1);
> +
> +			if (thread)
> +				err = fdarray__poll(&thread->pollfd, -1);
> +			else
> +				err = evlist__poll(rec->evlist, -1);

same here, why do we have the __thread struct then?

jirka


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 10/15] perf record: manage thread specific data array
  2020-10-21 16:04 ` [PATCH v2 10/15] perf record: manage thread specific data array Alexey Budankov
@ 2020-10-24 15:44   ` Jiri Olsa
  2020-10-26  8:39     ` Alexei Budankov
  0 siblings, 1 reply; 64+ messages in thread
From: Jiri Olsa @ 2020-10-24 15:44 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel

On Wed, Oct 21, 2020 at 07:04:26PM +0300, Alexey Budankov wrote:
> 
> Provide allocation, initialization, finalization and releasing of
> thread specific objects at thread specific data array. Allocate
> thread specific object for every data buffer making one-to-one
> relation between data buffer and a thread processing the buffer.
> Deliver event fd related signals to thread's pollfd object.
> Deliver thread control commands to ctlfd_pos fd of thread 1+.
> Deliver tool external control commands via ctlfd_pos fd of thread 0.
> 
> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
> ---
>  tools/perf/builtin-record.c | 101 ++++++++++++++++++++++++++++++++++--
>  1 file changed, 98 insertions(+), 3 deletions(-)
> 
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index 8e512096a060..89cb8e913fb3 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -884,6 +884,94 @@ static int record__kcore_copy(struct machine *machine, struct perf_data *data)
>  	return kcore_copy(from_dir, kcore_dir);
>  }
>  
> +static int record__alloc_thread_data(struct record *rec, struct mmap *mmaps, int nr_mmaps,
> +				     struct fdarray *evlist_pollfd, int ctlfd_pos)
> +{
> +	int i, j, k, nr_thread_data;
> +	struct thread_data *thread_data;
> +
> +	rec->nr_thread_data = nr_thread_data = nr_mmaps;
> +	rec->thread_data = thread_data = zalloc(rec->nr_thread_data * sizeof(*(rec->thread_data)));
> +	if (!thread_data) {
> +		pr_err("Failed to allocate thread data\n");
> +		return -ENOMEM;
> +	}
> +
> +	for (i = 0; i < nr_thread_data; i++) {
> +		short revents;
> +		int pos, fd;
> +
> +		thread_data[i].tid = -1;
> +
> +		if (pipe(thread_data[i].comm.msg) ||
> +		    pipe(thread_data[i].comm.ack)) {
> +			pr_err("Failed to create thread comm pipes, errno %d\n", errno);
> +			return -ENOMEM;
> +		}

the original code was using state flag and pthread_cond,
which I think is more readable

https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/commit/?h=perf/record_threads&id=a7da527ff8be69572c6d17525c03c6fe394503c8
https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/commit/?h=perf/record_threads&id=eb85ce4da64a885fdb6c77cfc5bd71312fe02e2a

> +
> +		thread_data[i].maps = &mmaps[i];
> +		thread_data[i].nr_mmaps = 1;
> +
> +		thread_data[i].rec = rec;
> +
> +		fdarray__init(&(thread_data[i].pollfd), 64);
> +
> +		for (j = 0; j < thread_data[i].nr_mmaps; j++) {
> +			struct mmap *map = &(thread_data[i].maps[j]);
> +
> +			for (k = 0; k < evlist_pollfd->nr; k++) {
> +				if (evlist_pollfd->priv[k].ptr != map)
> +					continue;
> +
> +				fd = evlist_pollfd->entries[k].fd;
> +				revents = evlist_pollfd->entries[k].events;
> +				pos = fdarray__add(&(thread_data[i].pollfd),
> +						fd, revents | POLLERR | POLLHUP,
> +						fdarray_flag__default);
> +				if (pos >= 0)
> +					thread_data[i].pollfd.priv[pos].ptr = map;
> +				else
> +					return -ENOMEM;

I added function for that:
  https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/commit/?h=perf/record_threads&id=8aa6e68a7471b9d25af1a9eebfa9321433366a17

jirka

> +			}
> +		}
> +
> +		if (i) {
> +			fd = thread_data[i].comm.msg[0];
> +			revents = POLLIN | POLLERR | POLLHUP;
> +		} else {
> +			if (ctlfd_pos == -1)
> +				continue;
> +			fd = evlist_pollfd->entries[ctlfd_pos].fd;
> +			revents = evlist_pollfd->entries[ctlfd_pos].events;
> +		}
> +		thread_data[i].ctlfd_pos =
> +				fdarray__add(&(thread_data[i].pollfd),
> +					     fd, revents, fdarray_flag__nonfilterable);
> +		if (thread_data[i].ctlfd_pos < 0)
> +			return -ENOMEM;
> +	}
> +
> +	return 0;
> +}
> +
> +static int record__free_thread_data(struct record *rec)
> +{
> +	int i;
> +
> +	if (rec->thread_data) {
> +		for (i = 0; i < rec->nr_thread_data; i++) {
> +			close(rec->thread_data[i].comm.msg[0]);
> +			close(rec->thread_data[i].comm.msg[1]);
> +			close(rec->thread_data[i].comm.ack[0]);
> +			close(rec->thread_data[i].comm.ack[1]);
> +			fdarray__exit(&(rec->thread_data[i].pollfd));
> +		}
> +		zfree(&rec->thread_data);
> +	}
> +
> +	return 0;
> +}
> +
>  static int record__mmap_evlist(struct record *rec,
>  			       struct evlist *evlist)
>  {
> @@ -918,6 +1006,9 @@ static int record__mmap_evlist(struct record *rec,
>  		}
>  	}
>  
> +	if (evlist__initialize_ctlfd(evlist, opts->ctl_fd, opts->ctl_fd_ack))
> +		return -1;
> +
>  	if (record__threads_enabled(rec)) {
>  		int i, ret, nr = evlist->core.nr_mmaps;
>  		struct mmap *mmaps = rec->opts.overwrite ?
> @@ -929,6 +1020,12 @@ static int record__mmap_evlist(struct record *rec,
>  
>  		for (i = 0; i < nr; i++)
>  			mmaps[i].file = &rec->data.dir.files[i];
> +
> +		ret = record__alloc_thread_data(rec, mmaps, nr,
> +						&evlist->core.pollfd,
> +						evlist->ctl_fd.pos);
> +		if (ret)
> +			return ret;
>  	}
>  
>  	return 0;
> @@ -1910,9 +2007,6 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
>  		perf_evlist__start_workload(rec->evlist);
>  	}
>  
> -	if (evlist__initialize_ctlfd(rec->evlist, opts->ctl_fd, opts->ctl_fd_ack))
> -		goto out_child;
> -
>  	if (opts->initial_delay) {
>  		pr_info(EVLIST_DISABLED_MSG);
>  		if (opts->initial_delay > 0) {
> @@ -2063,6 +2157,7 @@ static int __cmd_record(struct record *rec, int argc, const char **argv)
>  		record__synthesize_workload(rec, true);
>  
>  out_child:
> +	record__free_thread_data(rec);
>  	evlist__finalize_ctlfd(rec->evlist);
>  	record__mmap_read_all(rec, true);
>  	record__aio_mmap_read_sync(rec);
> -- 
> 2.24.1
> 


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 14/15] perf record: start threads in the beginning of trace streaming
  2020-10-21 16:10 ` [PATCH v2 14/15] perf record: start threads in the beginning " Alexey Budankov
@ 2020-10-24 15:44   ` Jiri Olsa
  2020-10-26  8:39     ` Alexei Budankov
  0 siblings, 1 reply; 64+ messages in thread
From: Jiri Olsa @ 2020-10-24 15:44 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel

On Wed, Oct 21, 2020 at 07:10:09PM +0300, Alexey Budankov wrote:
> 
> Start threads in detached state because its management is possible
> via messaging. Block signals prior the threads start so only main
> tool thread would be notified on external async signals during data
> collection. Streaming threads connect one-to-one to mapped data
> buffers and write into per-CPU trace files located at data directory.
> Data buffers and threads are affined to local NUMA nodes and monitored
> CPUs according to system topology. --cpu option can be used to specify
> CPUs to be monitored.
> 
> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
> ---
>  tools/perf/builtin-record.c | 128 +++++++++++++++++++++++++++++++++---
>  1 file changed, 120 insertions(+), 8 deletions(-)
> 
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index a15642656066..1d41e996a994 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -56,6 +56,7 @@
>  #include <poll.h>
>  #include <pthread.h>
>  #include <unistd.h>
> +#include <sys/syscall.h>
>  #include <sched.h>
>  #include <signal.h>
>  #ifdef HAVE_EVENTFD_SUPPORT
> @@ -1377,6 +1378,62 @@ static void record__thread_munmap_filtered(struct fdarray *fda, int fd,
>  		perf_mmap__put(map);
>  }
>  
> +static void *record__thread(void *arg)
> +{
> +	enum thread_msg msg = THREAD_MSG__READY;
> +	bool terminate = false;
> +	struct fdarray *pollfd;
> +	int err, ctlfd_pos;
> +
> +	thread = arg;
> +	thread->tid = syscall(SYS_gettid);
> +
> +	err = write(thread->comm.ack[1], &msg, sizeof(msg));
> +	if (err == -1)
> +		pr_err("threads: %d failed to notify on start. Error %m", thread->tid);
> +
> +	pollfd = &(thread->pollfd);

I don't think braces are necessary in here

jirka


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 08/15] perf record: write trace data into mmap trace files
  2020-10-21 16:02 ` [PATCH v2 08/15] perf record: write trace data into mmap trace files Alexey Budankov
@ 2020-10-24 15:44   ` Jiri Olsa
  2020-10-26  8:52     ` Alexei Budankov
  0 siblings, 1 reply; 64+ messages in thread
From: Jiri Olsa @ 2020-10-24 15:44 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel

On Wed, Oct 21, 2020 at 07:02:56PM +0300, Alexey Budankov wrote:

SNIP

>  
>  	record__synthesize(rec, true);
> -	/* this will be recalculated during process_buildids() */
> -	rec->samples = 0;
>  
>  	if (!err) {
>  		if (!rec->timestamp_filename) {
> @@ -2680,9 +2709,12 @@ int cmd_record(int argc, const char **argv)
>  
>  	}
>  
> -	if (rec->opts.kcore)
> +	if (rec->opts.kcore || record__threads_enabled(rec))
>  		rec->data.is_dir = true;
>  
> +	if (record__threads_enabled(rec))
> +		rec->opts.affinity = PERF_AFFINITY_CPU;

so all the threads will pin to cpu and back before reading?
it makes sense for one thread, but why not pin every thread
at the start?

jirka


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 09/15] perf record: introduce thread specific objects for trace streaming
  2020-10-21 16:03 ` [PATCH v2 09/15] perf record: introduce thread specific objects for trace streaming Alexey Budankov
@ 2020-10-24 15:44   ` Jiri Olsa
  2020-10-26  8:53     ` Alexei Budankov
  0 siblings, 1 reply; 64+ messages in thread
From: Jiri Olsa @ 2020-10-24 15:44 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel

On Wed, Oct 21, 2020 at 07:03:48PM +0300, Alexey Budankov wrote:
> 
> Introduce thread local data object and its array to be used for
> threaded trace streaming.
> 
> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
> ---
>  tools/perf/builtin-record.c | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index ba26d75c51d6..8e512096a060 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -85,11 +85,29 @@ struct switch_output {
>  	int		 cur_file;
>  };
>  
> +struct thread_data {
> +	pid_t		   tid;
> +	struct {
> +		int	   msg[2];
> +		int	   ack[2];
> +	} comm;
> +	struct fdarray	   pollfd;
> +	int		   ctlfd_pos;
> +	struct mmap	   *maps;
> +	int		   nr_mmaps;
> +	struct record	   *rec;
> +	unsigned long long samples;
> +	unsigned long	   waking;
> +	u64		   bytes_written;
> +};

please merge the struct with the code that's using it

jirka

> +
>  struct record {
>  	struct perf_tool	tool;
>  	struct record_opts	opts;
>  	u64			bytes_written;
>  	struct perf_data	data;
> +	struct thread_data	*thread_data;
> +	int			nr_thread_data;
>  	struct auxtrace_record	*itr;
>  	struct evlist	*evlist;
>  	struct perf_session	*session;
> -- 
> 2.24.1
> 


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 05/15] perf session: introduce decompressor into trace reader object
  2020-10-21 16:00 ` [PATCH v2 05/15] perf session: introduce decompressor into trace reader object Alexey Budankov
  2020-10-22  4:36   ` Namhyung Kim
@ 2020-10-24 15:44   ` Jiri Olsa
  2020-10-26  8:54     ` Alexei Budankov
  1 sibling, 1 reply; 64+ messages in thread
From: Jiri Olsa @ 2020-10-24 15:44 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel

On Wed, Oct 21, 2020 at 07:00:30PM +0300, Alexey Budankov wrote:
> 
> Introduce decompressor to trace reader object so that decompression
> could be executed on per trace file basis separately for every
> trace file located in trace directory.
> 
> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
> ---
>  tools/perf/util/session.c | 4 +++-
>  tools/perf/util/session.h | 1 +
>  2 files changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
> index 911b2dbcd0ac..6afc670fdf0c 100644
> --- a/tools/perf/util/session.c
> +++ b/tools/perf/util/session.c
> @@ -44,6 +44,8 @@ static int perf_session__process_compressed_event(struct perf_session *session,
>  	u64 decomp_last_rem = 0;
>  	size_t mmap_len, decomp_len = session->header.env.comp_mmap_len;
>  	struct decomp *decomp, *decomp_last = session->decomp_last;
> +	struct zstd_data *zstd_data = session->reader ?
> +		&(session->reader->zstd_data) : &(session->zstd_data);

I don't think we're using braces in these cases, they are not necessary

jirka

>  
>  	if (decomp_last) {
>  		decomp_last_rem = decomp_last->size - decomp_last->head;
> @@ -71,7 +73,7 @@ static int perf_session__process_compressed_event(struct perf_session *session,
>  	src = (void *)event + sizeof(struct perf_record_compressed);
>  	src_size = event->pack.header.size - sizeof(struct perf_record_compressed);
>  
> -	decomp_size = zstd_decompress_stream(&(session->zstd_data), src, src_size,
> +	decomp_size = zstd_decompress_stream(zstd_data, src, src_size,
>  				&(decomp->data[decomp_last_rem]), decomp_len - decomp_last_rem);
>  	if (!decomp_size) {
>  		munmap(decomp, mmap_len);
> diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
> index abdb8518a81f..4fc9ccdf7970 100644
> --- a/tools/perf/util/session.h
> +++ b/tools/perf/util/session.h
> @@ -42,6 +42,7 @@ struct reader {
>  	u64		 data_size;
>  	u64		 data_offset;
>  	reader_cb_t	 process;
> +	struct zstd_data zstd_data;
>  };
>  
>  struct perf_session {
> -- 
> 2.24.1
> 
> 


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 04/15] perf session: move reader object definition to header file
  2020-10-21 15:59 ` [PATCH v2 04/15] perf session: move reader object definition to header file Alexey Budankov
  2020-10-22  4:31   ` Namhyung Kim
@ 2020-10-24 15:44   ` Jiri Olsa
  2020-10-26 17:50     ` Alexey Budankov
  1 sibling, 1 reply; 64+ messages in thread
From: Jiri Olsa @ 2020-10-24 15:44 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel

On Wed, Oct 21, 2020 at 06:59:48PM +0300, Alexey Budankov wrote:
> 
> Move definition of reader to session header file to be shared
> among different source files. Introduce reference to active
> reader object from session object.
> 
> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
> ---
>  tools/perf/util/session.c | 27 ---------------------------
>  tools/perf/util/session.h | 25 +++++++++++++++++++++++++
>  2 files changed, 25 insertions(+), 27 deletions(-)
> 
> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
> index 6f09d506b2f6..911b2dbcd0ac 100644
> --- a/tools/perf/util/session.c
> +++ b/tools/perf/util/session.c
> @@ -2110,33 +2110,6 @@ static int __perf_session__process_decomp_events(struct perf_session *session)
>  	return 0;
>  }
>  
> -/*
> - * On 64bit we can mmap the data file in one go. No need for tiny mmap
> - * slices. On 32bit we use 32MB.
> - */
> -#if BITS_PER_LONG == 64
> -#define MMAP_SIZE ULLONG_MAX
> -#define NUM_MMAPS 1
> -#else
> -#define MMAP_SIZE (32 * 1024 * 1024ULL)
> -#define NUM_MMAPS 128
> -#endif
> -
> -struct reader;
> -
> -typedef s64 (*reader_cb_t)(struct perf_session *session,
> -			   union perf_event *event,
> -			   u64 file_offset,
> -			   const char *file_path);
> -
> -struct reader {
> -	int		 fd;
> -	const char	 *path;
> -	u64		 data_size;
> -	u64		 data_offset;
> -	reader_cb_t	 process;
> -};
> -
>  static int
>  reader__process_events(struct reader *rd, struct perf_session *session,
>  		       struct ui_progress *prog)
> diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
> index 378ffc3e2809..abdb8518a81f 100644
> --- a/tools/perf/util/session.h
> +++ b/tools/perf/util/session.h
> @@ -20,6 +20,30 @@ struct thread;
>  struct auxtrace;
>  struct itrace_synth_opts;
>  
> +/*
> + * On 64bit we can mmap the data file in one go. No need for tiny mmap
> + * slices. On 32bit we use 32MB.
> + */
> +#if BITS_PER_LONG == 64
> +#define MMAP_SIZE ULLONG_MAX
> +#define NUM_MMAPS 1
> +#else
> +#define MMAP_SIZE (32 * 1024 * 1024ULL)
> +#define NUM_MMAPS 128
> +#endif
> +
> +typedef s64 (*reader_cb_t)(struct perf_session *session,
> +			   union perf_event *event,
> +			   u64 file_offset, const char *file_path);
> +
> +struct reader {
> +	int		 fd;
> +	const char	 *path;
> +	u64		 data_size;
> +	u64		 data_offset;
> +	reader_cb_t	 process;
> +};

I wasn't able to find where is this used ouf of session.c ?


> +
>  struct perf_session {
>  	struct perf_header	header;
>  	struct machines		machines;
> @@ -41,6 +65,7 @@ struct perf_session {
>  	struct zstd_data	zstd_data;
>  	struct decomp		*decomp;
>  	struct decomp		*decomp_last;
> +	struct reader		*reader;

please define it in the patch where it's actualy used

thanks,
jirka


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 12/15] perf record: introduce thread local variable for trace streaming
  2020-10-24 15:43   ` Jiri Olsa
@ 2020-10-26  8:21     ` Alexei Budankov
  2020-10-26 10:34       ` Jiri Olsa
  0 siblings, 1 reply; 64+ messages in thread
From: Alexei Budankov @ 2020-10-26  8:21 UTC (permalink / raw)
  To: Jiri Olsa, Alexey Budankov
  Cc: Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel, budankov.lore


On 24.10.2020 18:43, Jiri Olsa wrote:
> On Wed, Oct 21, 2020 at 07:07:00PM +0300, Alexey Budankov wrote:
>>
>> Introduce thread local variable and use it for threaded trace streaming.
>>
>> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
>> ---
>>  tools/perf/builtin-record.c | 71 ++++++++++++++++++++++++++++++++-----
>>  1 file changed, 62 insertions(+), 9 deletions(-)
>>
>> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
>> index 89cb8e913fb3..3b7e9026f25b 100644
>> --- a/tools/perf/builtin-record.c
>> +++ b/tools/perf/builtin-record.c
>> @@ -101,6 +101,8 @@ struct thread_data {
>>  	u64		   bytes_written;
>>  };
>>  
>> +static __thread struct thread_data *thread;
>> +
>>  struct record {
>>  	struct perf_tool	tool;
>>  	struct record_opts	opts;
>> @@ -587,7 +589,11 @@ static int record__pushfn(struct mmap *map, void *to, void *bf, size_t size)
>>  		}
>>  	}
>>  
>> -	rec->samples++;
>> +	if (thread)
>> +		thread->samples++;
>> +	else
>> +		rec->samples++;
> 
> this is really wrong, let's keep just single samples counter
> ditto for all the other places in this patch

This does look like data parallelism [1] which is very true for
threaded trace streaming so your prototype design looks optimal.

For this specific place incrementing global counter in memory is
less performant and faces scalability limitations as a number of
cores grow.

Not sure why you have changed your mind.

Alexei

[1] https://en.wikipedia.org/wiki/Data_parallelism#:~:text=Data%20parallelism%20is%20parallelization%20across,on%20each%20element%20in%20parallel.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 10/15] perf record: manage thread specific data array
  2020-10-24 15:44   ` Jiri Olsa
@ 2020-10-26  8:39     ` Alexei Budankov
  0 siblings, 0 replies; 64+ messages in thread
From: Alexei Budankov @ 2020-10-26  8:39 UTC (permalink / raw)
  To: Jiri Olsa, Alexey Budankov
  Cc: Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel


On 24.10.2020 18:44, Jiri Olsa wrote:
> On Wed, Oct 21, 2020 at 07:04:26PM +0300, Alexey Budankov wrote:
>>
>> Provide allocation, initialization, finalization and releasing of
>> thread specific objects at thread specific data array. Allocate
>> thread specific object for every data buffer making one-to-one
>> relation between data buffer and a thread processing the buffer.
>> Deliver event fd related signals to thread's pollfd object.
>> Deliver thread control commands to ctlfd_pos fd of thread 1+.
>> Deliver tool external control commands via ctlfd_pos fd of thread 0.
>>
>> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
>> ---
>>  tools/perf/builtin-record.c | 101 ++++++++++++++++++++++++++++++++++--
>>  1 file changed, 98 insertions(+), 3 deletions(-)
>>
>> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
>> index 8e512096a060..89cb8e913fb3 100644
>> --- a/tools/perf/builtin-record.c
>> +++ b/tools/perf/builtin-record.c
>> @@ -884,6 +884,94 @@ static int record__kcore_copy(struct machine *machine, struct perf_data *data)
>>  	return kcore_copy(from_dir, kcore_dir);
>>  }
>>  
>> +static int record__alloc_thread_data(struct record *rec, struct mmap *mmaps, int nr_mmaps,
>> +				     struct fdarray *evlist_pollfd, int ctlfd_pos)
>> +{
>> +	int i, j, k, nr_thread_data;
>> +	struct thread_data *thread_data;
>> +
>> +	rec->nr_thread_data = nr_thread_data = nr_mmaps;
>> +	rec->thread_data = thread_data = zalloc(rec->nr_thread_data * sizeof(*(rec->thread_data)));
>> +	if (!thread_data) {
>> +		pr_err("Failed to allocate thread data\n");
>> +		return -ENOMEM;
>> +	}
>> +
>> +	for (i = 0; i < nr_thread_data; i++) {
>> +		short revents;
>> +		int pos, fd;
>> +
>> +		thread_data[i].tid = -1;
>> +
>> +		if (pipe(thread_data[i].comm.msg) ||
>> +		    pipe(thread_data[i].comm.ack)) {
>> +			pr_err("Failed to create thread comm pipes, errno %d\n", errno);
>> +			return -ENOMEM;
>> +		}
> 
> the original code was using state flag and pthread_cond,
> which I think is more readable
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/commit/?h=perf/record_threads&id=a7da527ff8be69572c6d17525c03c6fe394503c8
> https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/commit/?h=perf/record_threads&id=eb85ce4da64a885fdb6c77cfc5bd71312fe02e2a

Yes, right, but messaging scales better and that critical section
to just increment global counter in memory looked as over complication.

> 
>> +
>> +		thread_data[i].maps = &mmaps[i];
>> +		thread_data[i].nr_mmaps = 1;
>> +
>> +		thread_data[i].rec = rec;
>> +
>> +		fdarray__init(&(thread_data[i].pollfd), 64);
>> +
>> +		for (j = 0; j < thread_data[i].nr_mmaps; j++) {
>> +			struct mmap *map = &(thread_data[i].maps[j]);
>> +
>> +			for (k = 0; k < evlist_pollfd->nr; k++) {
>> +				if (evlist_pollfd->priv[k].ptr != map)
>> +					continue;
>> +
>> +				fd = evlist_pollfd->entries[k].fd;
>> +				revents = evlist_pollfd->entries[k].events;
>> +				pos = fdarray__add(&(thread_data[i].pollfd),
>> +						fd, revents | POLLERR | POLLHUP,
>> +						fdarray_flag__default);
>> +				if (pos >= 0)
>> +					thread_data[i].pollfd.priv[pos].ptr = map;
>> +				else
>> +					return -ENOMEM;
> 
> I added function for that:
>   https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/commit/?h=perf/record_threads&id=8aa6e68a7471b9d25af1a9eebfa9321433366a17

Ok, but it is not clone operation it is more like ordinary adding of
fd into another pollfd object. It could be wrapped into a function
e.g. fdarray__dup(pollfd_dst, fd_src, fd_revents)

Alexei

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 14/15] perf record: start threads in the beginning of trace streaming
  2020-10-24 15:44   ` Jiri Olsa
@ 2020-10-26  8:39     ` Alexei Budankov
  0 siblings, 0 replies; 64+ messages in thread
From: Alexei Budankov @ 2020-10-26  8:39 UTC (permalink / raw)
  To: Jiri Olsa, Alexey Budankov
  Cc: Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel, budankov.lore


On 24.10.2020 18:44, Jiri Olsa wrote:
> On Wed, Oct 21, 2020 at 07:10:09PM +0300, Alexey Budankov wrote:
>>
>> Start threads in detached state because its management is possible
>> via messaging. Block signals prior the threads start so only main
>> tool thread would be notified on external async signals during data
>> collection. Streaming threads connect one-to-one to mapped data
>> buffers and write into per-CPU trace files located at data directory.
>> Data buffers and threads are affined to local NUMA nodes and monitored
>> CPUs according to system topology. --cpu option can be used to specify
>> CPUs to be monitored.
>>
>> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
>> ---
>>  tools/perf/builtin-record.c | 128 +++++++++++++++++++++++++++++++++---
>>  1 file changed, 120 insertions(+), 8 deletions(-)
>>
>> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
>> index a15642656066..1d41e996a994 100644
>> --- a/tools/perf/builtin-record.c
>> +++ b/tools/perf/builtin-record.c
>> @@ -56,6 +56,7 @@
>>  #include <poll.h>
>>  #include <pthread.h>
>>  #include <unistd.h>
>> +#include <sys/syscall.h>
>>  #include <sched.h>
>>  #include <signal.h>
>>  #ifdef HAVE_EVENTFD_SUPPORT
>> @@ -1377,6 +1378,62 @@ static void record__thread_munmap_filtered(struct fdarray *fda, int fd,
>>  		perf_mmap__put(map);
>>  }
>>  
>> +static void *record__thread(void *arg)
>> +{
>> +	enum thread_msg msg = THREAD_MSG__READY;
>> +	bool terminate = false;
>> +	struct fdarray *pollfd;
>> +	int err, ctlfd_pos;
>> +
>> +	thread = arg;
>> +	thread->tid = syscall(SYS_gettid);
>> +
>> +	err = write(thread->comm.ack[1], &msg, sizeof(msg));
>> +	if (err == -1)
>> +		pr_err("threads: %d failed to notify on start. Error %m", thread->tid);
>> +
>> +	pollfd = &(thread->pollfd);
> 
> I don't think braces are necessary in here

Corrected in v3.

Alexei

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 08/15] perf record: write trace data into mmap trace files
  2020-10-24 15:44   ` Jiri Olsa
@ 2020-10-26  8:52     ` Alexei Budankov
  2020-10-26 10:32       ` Jiri Olsa
  0 siblings, 1 reply; 64+ messages in thread
From: Alexei Budankov @ 2020-10-26  8:52 UTC (permalink / raw)
  To: Jiri Olsa, Alexey Budankov
  Cc: Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel, budankov.lore


On 24.10.2020 18:44, Jiri Olsa wrote:
> On Wed, Oct 21, 2020 at 07:02:56PM +0300, Alexey Budankov wrote:
> 
> SNIP
> 
>>  
>>  	record__synthesize(rec, true);
>> -	/* this will be recalculated during process_buildids() */
>> -	rec->samples = 0;
>>  
>>  	if (!err) {
>>  		if (!rec->timestamp_filename) {
>> @@ -2680,9 +2709,12 @@ int cmd_record(int argc, const char **argv)
>>  
>>  	}
>>  
>> -	if (rec->opts.kcore)
>> +	if (rec->opts.kcore || record__threads_enabled(rec))
>>  		rec->data.is_dir = true;
>>  
>> +	if (record__threads_enabled(rec))
>> +		rec->opts.affinity = PERF_AFFINITY_CPU;
> 
> so all the threads will pin to cpu and back before reading?

No, they will not back. Thread mask compares to mmap mask before
read and the thread migrates if masks don't match. This happens
once on the first mmap read. So explicit pinning can be avoided.

Alexei

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 09/15] perf record: introduce thread specific objects for trace streaming
  2020-10-24 15:44   ` Jiri Olsa
@ 2020-10-26  8:53     ` Alexei Budankov
  0 siblings, 0 replies; 64+ messages in thread
From: Alexei Budankov @ 2020-10-26  8:53 UTC (permalink / raw)
  To: Jiri Olsa, Alexey Budankov
  Cc: Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel



On 24.10.2020 18:44, Jiri Olsa wrote:
> On Wed, Oct 21, 2020 at 07:03:48PM +0300, Alexey Budankov wrote:
>>
>> Introduce thread local data object and its array to be used for
>> threaded trace streaming.
>>
>> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
>> ---
>>  tools/perf/builtin-record.c | 18 ++++++++++++++++++
>>  1 file changed, 18 insertions(+)
>>
>> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
>> index ba26d75c51d6..8e512096a060 100644
>> --- a/tools/perf/builtin-record.c
>> +++ b/tools/perf/builtin-record.c
>> @@ -85,11 +85,29 @@ struct switch_output {
>>  	int		 cur_file;
>>  };
>>  
>> +struct thread_data {
>> +	pid_t		   tid;
>> +	struct {
>> +		int	   msg[2];
>> +		int	   ack[2];
>> +	} comm;
>> +	struct fdarray	   pollfd;
>> +	int		   ctlfd_pos;
>> +	struct mmap	   *maps;
>> +	int		   nr_mmaps;
>> +	struct record	   *rec;
>> +	unsigned long long samples;
>> +	unsigned long	   waking;
>> +	u64		   bytes_written;
>> +};
> 
> please merge the struct with the code that's using it

Corrected in v3.

Thanks,
Alexei

> 
> jirka
> 
>> +
>>  struct record {
>>  	struct perf_tool	tool;
>>  	struct record_opts	opts;
>>  	u64			bytes_written;
>>  	struct perf_data	data;
>> +	struct thread_data	*thread_data;
>> +	int			nr_thread_data;
>>  	struct auxtrace_record	*itr;
>>  	struct evlist	*evlist;
>>  	struct perf_session	*session;
>> -- 
>> 2.24.1
>>
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 05/15] perf session: introduce decompressor into trace reader object
  2020-10-24 15:44   ` Jiri Olsa
@ 2020-10-26  8:54     ` Alexei Budankov
  0 siblings, 0 replies; 64+ messages in thread
From: Alexei Budankov @ 2020-10-26  8:54 UTC (permalink / raw)
  To: Jiri Olsa, Alexey Budankov
  Cc: Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel


On 24.10.2020 18:44, Jiri Olsa wrote:
> On Wed, Oct 21, 2020 at 07:00:30PM +0300, Alexey Budankov wrote:
>>
>> Introduce decompressor to trace reader object so that decompression
>> could be executed on per trace file basis separately for every
>> trace file located in trace directory.
>>
>> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
>> ---
>>  tools/perf/util/session.c | 4 +++-
>>  tools/perf/util/session.h | 1 +
>>  2 files changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
>> index 911b2dbcd0ac..6afc670fdf0c 100644
>> --- a/tools/perf/util/session.c
>> +++ b/tools/perf/util/session.c
>> @@ -44,6 +44,8 @@ static int perf_session__process_compressed_event(struct perf_session *session,
>>  	u64 decomp_last_rem = 0;
>>  	size_t mmap_len, decomp_len = session->header.env.comp_mmap_len;
>>  	struct decomp *decomp, *decomp_last = session->decomp_last;
>> +	struct zstd_data *zstd_data = session->reader ?
>> +		&(session->reader->zstd_data) : &(session->zstd_data);
> 
> I don't think we're using braces in these cases, they are not necessary

Corrected in v3.

Thanks,
Alexei

> 
> jirka
> 
>>  
>>  	if (decomp_last) {
>>  		decomp_last_rem = decomp_last->size - decomp_last->head;
>> @@ -71,7 +73,7 @@ static int perf_session__process_compressed_event(struct perf_session *session,
>>  	src = (void *)event + sizeof(struct perf_record_compressed);
>>  	src_size = event->pack.header.size - sizeof(struct perf_record_compressed);
>>  
>> -	decomp_size = zstd_decompress_stream(&(session->zstd_data), src, src_size,
>> +	decomp_size = zstd_decompress_stream(zstd_data, src, src_size,
>>  				&(decomp->data[decomp_last_rem]), decomp_len - decomp_last_rem);
>>  	if (!decomp_size) {
>>  		munmap(decomp, mmap_len);
>> diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
>> index abdb8518a81f..4fc9ccdf7970 100644
>> --- a/tools/perf/util/session.h
>> +++ b/tools/perf/util/session.h
>> @@ -42,6 +42,7 @@ struct reader {
>>  	u64		 data_size;
>>  	u64		 data_offset;
>>  	reader_cb_t	 process;
>> +	struct zstd_data zstd_data;
>>  };
>>  
>>  struct perf_session {
>> -- 
>> 2.24.1
>>
>>
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 08/15] perf record: write trace data into mmap trace files
  2020-10-26  8:52     ` Alexei Budankov
@ 2020-10-26 10:32       ` Jiri Olsa
  2020-10-26 14:04         ` Alexei Budankov
  0 siblings, 1 reply; 64+ messages in thread
From: Jiri Olsa @ 2020-10-26 10:32 UTC (permalink / raw)
  To: Alexei Budankov
  Cc: Alexey Budankov, Arnaldo Carvalho de Melo, Namhyung Kim,
	Alexander Shishkin, Adrian Hunter, Andi Kleen, Peter Zijlstra,
	Ingo Molnar, linux-kernel

On Mon, Oct 26, 2020 at 11:52:21AM +0300, Alexei Budankov wrote:
> 
> On 24.10.2020 18:44, Jiri Olsa wrote:
> > On Wed, Oct 21, 2020 at 07:02:56PM +0300, Alexey Budankov wrote:
> > 
> > SNIP
> > 
> >>  
> >>  	record__synthesize(rec, true);
> >> -	/* this will be recalculated during process_buildids() */
> >> -	rec->samples = 0;
> >>  
> >>  	if (!err) {
> >>  		if (!rec->timestamp_filename) {
> >> @@ -2680,9 +2709,12 @@ int cmd_record(int argc, const char **argv)
> >>  
> >>  	}
> >>  
> >> -	if (rec->opts.kcore)
> >> +	if (rec->opts.kcore || record__threads_enabled(rec))
> >>  		rec->data.is_dir = true;
> >>  
> >> +	if (record__threads_enabled(rec))
> >> +		rec->opts.affinity = PERF_AFFINITY_CPU;
> > 
> > so all the threads will pin to cpu and back before reading?
> 
> No, they will not back. Thread mask compares to mmap mask before
> read and the thread migrates if masks don't match. This happens
> once on the first mmap read. So explicit pinning can be avoided.

hum, is that right? the check in record__adjust_affinity
is checking global 'rec->affinity_mask', at lest I assume
it's still global ;-)

        if (rec->opts.affinity != PERF_AFFINITY_SYS &&
            !bitmap_equal(rec->affinity_mask.bits, map->affinity_mask.bits,
                          rec->affinity_mask.nbits)) {

I think this can never be equal if you have more than one map

when I check on sched_setaffinity syscalls:

  # perf trace -e syscalls:sys_enter_sched_setaffinity

while running record --threads, I see sched_setaffinity
calls all the time

jirka


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 12/15] perf record: introduce thread local variable for trace streaming
  2020-10-26  8:21     ` Alexei Budankov
@ 2020-10-26 10:34       ` Jiri Olsa
  2020-10-26 14:11         ` Alexei Budankov
  0 siblings, 1 reply; 64+ messages in thread
From: Jiri Olsa @ 2020-10-26 10:34 UTC (permalink / raw)
  To: Alexei Budankov
  Cc: Alexey Budankov, Arnaldo Carvalho de Melo, Namhyung Kim,
	Alexander Shishkin, Adrian Hunter, Andi Kleen, Peter Zijlstra,
	Ingo Molnar, linux-kernel

On Mon, Oct 26, 2020 at 11:21:28AM +0300, Alexei Budankov wrote:
> 
> On 24.10.2020 18:43, Jiri Olsa wrote:
> > On Wed, Oct 21, 2020 at 07:07:00PM +0300, Alexey Budankov wrote:
> >>
> >> Introduce thread local variable and use it for threaded trace streaming.
> >>
> >> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
> >> ---
> >>  tools/perf/builtin-record.c | 71 ++++++++++++++++++++++++++++++++-----
> >>  1 file changed, 62 insertions(+), 9 deletions(-)
> >>
> >> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> >> index 89cb8e913fb3..3b7e9026f25b 100644
> >> --- a/tools/perf/builtin-record.c
> >> +++ b/tools/perf/builtin-record.c
> >> @@ -101,6 +101,8 @@ struct thread_data {
> >>  	u64		   bytes_written;
> >>  };
> >>  
> >> +static __thread struct thread_data *thread;
> >> +
> >>  struct record {
> >>  	struct perf_tool	tool;
> >>  	struct record_opts	opts;
> >> @@ -587,7 +589,11 @@ static int record__pushfn(struct mmap *map, void *to, void *bf, size_t size)
> >>  		}
> >>  	}
> >>  
> >> -	rec->samples++;
> >> +	if (thread)
> >> +		thread->samples++;
> >> +	else
> >> +		rec->samples++;
> > 
> > this is really wrong, let's keep just single samples counter
> > ditto for all the other places in this patch
> 
> This does look like data parallelism [1] which is very true for
> threaded trace streaming so your prototype design looks optimal.
> 
> For this specific place incrementing global counter in memory is
> less performant and faces scalability limitations as a number of
> cores grow.
> 
> Not sure why you have changed your mind.

I'm not sure I follow.. what I'm complaining about is to have
'samples' stat variable in separate locations for --threads
and --no-threads mode

jirka

> 
> Alexei
> 
> [1] https://en.wikipedia.org/wiki/Data_parallelism#:~:text=Data%20parallelism%20is%20parallelization%20across,on%20each%20element%20in%20parallel.
> 


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 08/15] perf record: write trace data into mmap trace files
  2020-10-26 10:32       ` Jiri Olsa
@ 2020-10-26 14:04         ` Alexei Budankov
  0 siblings, 0 replies; 64+ messages in thread
From: Alexei Budankov @ 2020-10-26 14:04 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Alexey Budankov, Arnaldo Carvalho de Melo, Namhyung Kim,
	Alexander Shishkin, Adrian Hunter, Andi Kleen, Peter Zijlstra,
	Ingo Molnar, linux-kernel, budankov.lore


On 26.10.2020 13:32, Jiri Olsa wrote:
> On Mon, Oct 26, 2020 at 11:52:21AM +0300, Alexei Budankov wrote:
>>
>> On 24.10.2020 18:44, Jiri Olsa wrote:
>>> On Wed, Oct 21, 2020 at 07:02:56PM +0300, Alexey Budankov wrote:
>>>
>>> SNIP
>>>
>>>>  
>>>>  	record__synthesize(rec, true);
>>>> -	/* this will be recalculated during process_buildids() */
>>>> -	rec->samples = 0;
>>>>  
>>>>  	if (!err) {
>>>>  		if (!rec->timestamp_filename) {
>>>> @@ -2680,9 +2709,12 @@ int cmd_record(int argc, const char **argv)
>>>>  
>>>>  	}
>>>>  
>>>> -	if (rec->opts.kcore)
>>>> +	if (rec->opts.kcore || record__threads_enabled(rec))
>>>>  		rec->data.is_dir = true;
>>>>  
>>>> +	if (record__threads_enabled(rec))
>>>> +		rec->opts.affinity = PERF_AFFINITY_CPU;
>>>
>>> so all the threads will pin to cpu and back before reading?
>>
>> No, they will not back. Thread mask compares to mmap mask before
>> read and the thread migrates if masks don't match. This happens
>> once on the first mmap read. So explicit pinning can be avoided.
> 
> hum, is that right? the check in record__adjust_affinity
> is checking global 'rec->affinity_mask', at lest I assume
> it's still global ;-)

Yes, rec->affinity_mask should also be per-thread. Good catch. Thanks!

Alexei

> 
>         if (rec->opts.affinity != PERF_AFFINITY_SYS &&
>             !bitmap_equal(rec->affinity_mask.bits, map->affinity_mask.bits,
>                           rec->affinity_mask.nbits)) {
> 
> I think this can never be equal if you have more than one map
> 
> when I check on sched_setaffinity syscalls:
> 
>   # perf trace -e syscalls:sys_enter_sched_setaffinity
> 
> while running record --threads, I see sched_setaffinity
> calls all the time
> 
> jirka
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 12/15] perf record: introduce thread local variable for trace streaming
  2020-10-26 10:34       ` Jiri Olsa
@ 2020-10-26 14:11         ` Alexei Budankov
  2020-10-27 12:01           ` Jiri Olsa
  0 siblings, 1 reply; 64+ messages in thread
From: Alexei Budankov @ 2020-10-26 14:11 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Alexey Budankov, Arnaldo Carvalho de Melo, Namhyung Kim,
	Alexander Shishkin, Adrian Hunter, Andi Kleen, Peter Zijlstra,
	Ingo Molnar, linux-kernel, budankov.lore


On 26.10.2020 13:34, Jiri Olsa wrote:
> On Mon, Oct 26, 2020 at 11:21:28AM +0300, Alexei Budankov wrote:
>>
>> On 24.10.2020 18:43, Jiri Olsa wrote:
>>> On Wed, Oct 21, 2020 at 07:07:00PM +0300, Alexey Budankov wrote:
>>>>
>>>> Introduce thread local variable and use it for threaded trace streaming.
>>>>
>>>> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
>>>> ---
>>>>  tools/perf/builtin-record.c | 71 ++++++++++++++++++++++++++++++++-----
>>>>  1 file changed, 62 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
>>>> index 89cb8e913fb3..3b7e9026f25b 100644
>>>> --- a/tools/perf/builtin-record.c
>>>> +++ b/tools/perf/builtin-record.c
>>>> @@ -101,6 +101,8 @@ struct thread_data {
>>>>  	u64		   bytes_written;
>>>>  };
>>>>  
>>>> +static __thread struct thread_data *thread;
>>>> +
>>>>  struct record {
>>>>  	struct perf_tool	tool;
>>>>  	struct record_opts	opts;
>>>> @@ -587,7 +589,11 @@ static int record__pushfn(struct mmap *map, void *to, void *bf, size_t size)
>>>>  		}
>>>>  	}
>>>>  
>>>> -	rec->samples++;
>>>> +	if (thread)
>>>> +		thread->samples++;
>>>> +	else
>>>> +		rec->samples++;
>>>
>>> this is really wrong, let's keep just single samples counter
>>> ditto for all the other places in this patch
>>
>> This does look like data parallelism [1] which is very true for
>> threaded trace streaming so your prototype design looks optimal.
>>
>> For this specific place incrementing global counter in memory is
>> less performant and faces scalability limitations as a number of
>> cores grow.
>>
>> Not sure why you have changed your mind.
> 
> I'm not sure I follow.. what I'm complaining about is to have
> 'samples' stat variable in separate locations for --threads
> and --no-threads mode

It is optimal to have samples variable as per thread one
and then sum up the total in the end of data collection.

Single global variable design has scalability and performance
drawbacks.

Why do you complain about per thread variable in this case?
It looks like ideally fits these specific needs.

Alexei

> 
> jirka
> 
>>
>> Alexei
>>
>> [1] https://en.wikipedia.org/wiki/Data_parallelism#:~:text=Data%20parallelism%20is%20parallelization%20across,on%20each%20element%20in%20parallel.
>>
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 03/15] perf data: open data directory in read access mode
  2020-10-24 15:43   ` Jiri Olsa
@ 2020-10-26 17:47     ` Alexey Budankov
  2020-10-27 11:59       ` Jiri Olsa
  0 siblings, 1 reply; 64+ messages in thread
From: Alexey Budankov @ 2020-10-26 17:47 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel


On 24.10.2020 18:43, Jiri Olsa wrote:
> On Wed, Oct 21, 2020 at 06:57:53PM +0300, Alexey Budankov wrote:
>>
>> Open files located at trace data directory in case read access
>> mode is requested. File are opened and its fds assigned to
>> perf_data dir files especially for loading data directories
>> content in perf report mode.
>>
>> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
>> ---
>>  tools/perf/util/data.c | 4 ++++
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/tools/perf/util/data.c b/tools/perf/util/data.c
>> index c47aa34fdc0a..6ad61ac6ba67 100644
>> --- a/tools/perf/util/data.c
>> +++ b/tools/perf/util/data.c
>> @@ -321,6 +321,10 @@ static int open_dir(struct perf_data *data)
>>  		return -1;
>>  
>>  	ret = open_file(data);
>> +	if (!ret && perf_data__is_dir(data)) {
>> +		if (perf_data__is_read(data))
>> +			ret = perf_data__open_dir(data);
>> +	}
> 
> perf_data__open_dir is also called from perf_session__new
> is it called twice?

It is not called twice. It is in different branches.
This one is for write and the other one is for read.

Alexei

> 
> jirka
> 
>>  
>>  	/* Cleanup whatever we managed to create so far. */
>>  	if (ret && perf_data__is_write(data))
>> -- 
>> 2.24.1
>>
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 04/15] perf session: move reader object definition to header file
  2020-10-24 15:44   ` Jiri Olsa
@ 2020-10-26 17:50     ` Alexey Budankov
  0 siblings, 0 replies; 64+ messages in thread
From: Alexey Budankov @ 2020-10-26 17:50 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel


On 24.10.2020 18:44, Jiri Olsa wrote:
> On Wed, Oct 21, 2020 at 06:59:48PM +0300, Alexey Budankov wrote:
>>
>> Move definition of reader to session header file to be shared
>> among different source files. Introduce reference to active
>> reader object from session object.
>>
>> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
>> ---
>>  tools/perf/util/session.c | 27 ---------------------------
>>  tools/perf/util/session.h | 25 +++++++++++++++++++++++++
>>  2 files changed, 25 insertions(+), 27 deletions(-)
>>
>> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
>> index 6f09d506b2f6..911b2dbcd0ac 100644
>> --- a/tools/perf/util/session.c
>> +++ b/tools/perf/util/session.c
>> @@ -2110,33 +2110,6 @@ static int __perf_session__process_decomp_events(struct perf_session *session)
>>  	return 0;
>>  }
>>  
>> -/*
>> - * On 64bit we can mmap the data file in one go. No need for tiny mmap
>> - * slices. On 32bit we use 32MB.
>> - */
>> -#if BITS_PER_LONG == 64
>> -#define MMAP_SIZE ULLONG_MAX
>> -#define NUM_MMAPS 1
>> -#else
>> -#define MMAP_SIZE (32 * 1024 * 1024ULL)
>> -#define NUM_MMAPS 128
>> -#endif
>> -
>> -struct reader;
>> -
>> -typedef s64 (*reader_cb_t)(struct perf_session *session,
>> -			   union perf_event *event,
>> -			   u64 file_offset,
>> -			   const char *file_path);
>> -
>> -struct reader {
>> -	int		 fd;
>> -	const char	 *path;
>> -	u64		 data_size;
>> -	u64		 data_offset;
>> -	reader_cb_t	 process;
>> -};
>> -
>>  static int
>>  reader__process_events(struct reader *rd, struct perf_session *session,
>>  		       struct ui_progress *prog)
>> diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
>> index 378ffc3e2809..abdb8518a81f 100644
>> --- a/tools/perf/util/session.h
>> +++ b/tools/perf/util/session.h
>> @@ -20,6 +20,30 @@ struct thread;
>>  struct auxtrace;
>>  struct itrace_synth_opts;
>>  
>> +/*
>> + * On 64bit we can mmap the data file in one go. No need for tiny mmap
>> + * slices. On 32bit we use 32MB.
>> + */
>> +#if BITS_PER_LONG == 64
>> +#define MMAP_SIZE ULLONG_MAX
>> +#define NUM_MMAPS 1
>> +#else
>> +#define MMAP_SIZE (32 * 1024 * 1024ULL)
>> +#define NUM_MMAPS 128
>> +#endif
>> +
>> +typedef s64 (*reader_cb_t)(struct perf_session *session,
>> +			   union perf_event *event,
>> +			   u64 file_offset, const char *file_path);
>> +
>> +struct reader {
>> +	int		 fd;
>> +	const char	 *path;
>> +	u64		 data_size;
>> +	u64		 data_offset;
>> +	reader_cb_t	 process;
>> +};
> 
> I wasn't able to find where is this used ouf of session.c ?

I will double check and adjust appropriately.

> 
> 
>> +
>>  struct perf_session {
>>  	struct perf_header	header;
>>  	struct machines		machines;
>> @@ -41,6 +65,7 @@ struct perf_session {
>>  	struct zstd_data	zstd_data;
>>  	struct decomp		*decomp;
>>  	struct decomp		*decomp_last;
>> +	struct reader		*reader;
> 
> please define it in the patch where it's actualy used

Corrected in v3.

Alexei

> 
> thanks,
> jirka
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 00/15] Introduce threaded trace streaming for basic perf record operation
  2020-10-24 15:43 ` [PATCH v2 00/15] Introduce threaded trace streaming for basic perf record operation Jiri Olsa
@ 2020-10-26 17:59   ` Alexey Budankov
  2020-10-27 12:10     ` Jiri Olsa
  0 siblings, 1 reply; 64+ messages in thread
From: Alexey Budankov @ 2020-10-26 17:59 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel


On 24.10.2020 18:43, Jiri Olsa wrote:
> On Wed, Oct 21, 2020 at 06:52:43PM +0300, Alexey Budankov wrote:
>>
>> Changes in v2:
>> - explicitly added credit tags to patches 6/15 and 15/15,
>>   additionally to cites [1], [2]
>> - updated description of 3/15 to explicitly mention the reason
>>   to open data directories in read access mode (e.g. for perf report)
>> - implemented fix for compilation error of 2/15
>> - explicitly elaborated on found issues to be resolved for
>>   threaded AUX trace capture
>>
>> v1: https://lore.kernel.org/lkml/810f3a69-0004-9dff-a911-b7ff97220ae0@linux.intel.com/
>>
>> Patch set provides threaded trace streaming for base perf record
>> operation. Provided streaming mode (--threads) mitigates profiling
>> data losses and resolves scalability issues of serial and asynchronous
>> (--aio) trace streaming modes on multicore server systems. The patch
>> set is based on the prototype [1], [2] and the most closely relates
>> to mode 3) "mode that creates thread for every monitored memory map".
> 
> so what I liked about the previous code was that you could
> configure how the threads would be created
> 
> default --threads options created thread for each cpu like
> in your change:
> 
>   $ perf record -v --threads ...
>   ...
>   thread 0 monitor: 0 allowed: 0
>   thread 1 monitor: 1 allowed: 1
>   thread 2 monitor: 2 allowed: 2
>   thread 3 monitor: 3 allowed: 3
>   thread 4 monitor: 4 allowed: 4
>   thread 5 monitor: 5 allowed: 5
>   thread 6 monitor: 6 allowed: 6
>   thread 7 monitor: 7 allowed: 7

Yes, it is configurable in the prototype. Even though this patch set
doesn't implement that parameters for --thread option, just because
VTune doesn't have use cases for that yet, it has still been designed
and implemented with that possible extension in mind so it could then
be easily added on top of it.

> 
> 
> then numa based:
> 
>   $ perf record -v --threads=numa ...
>   ...
>   thread 0 monitor: 0-5,12-17 allowed: 0-5,12-17
>   thread 1 monitor: 6-11,18-23 allowed: 6-11,18-23
> 
> 
> socket based:
> 
>   $ perf record -v --threads=socket ...
>   ...
>   thread 0 monitor: 0-7 allowed: 0-7
> 
> 
> core based:
> 
>   $ perf record -v --threads=core ...
>   ...
>   thread 0 monitor: 0,4 allowed: 0,4
>   thread 1 monitor: 1,5 allowed: 1,5
>   thread 2 monitor: 2,6 allowed: 2,6
>   thread 3 monitor: 3,7 allowed: 3,7
> 
> 
> and user configurable:
> 
>   $ perf record -v  --threads=0-3/0:4-7/4 ...
>   ...
>   threads: 0. monitor 0-3, allowed 0
>   threads: 1. monitor 4-7, allowed 4
> 
> 
> so this way you could easily pin threads to cpu/core/socket/numa,
> or to some other cpu of your choice, because this will be always
> game of try and check where I'm not getting LOST events and not
> creating 1000 threads
> 
>  perf record: Add support for threads numa option value
>  perf record: Add support for threads socket option value
>  perf record: Add support for threads core option value
>  perf record: Add support for threads user option value

Makes sense.

Alexei

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 06/15] perf session: load data directory into tool process memory
  2020-10-24 15:43   ` Jiri Olsa
@ 2020-10-27  7:37     ` Alexey Budankov
  2020-10-27 12:21       ` Jiri Olsa
  0 siblings, 1 reply; 64+ messages in thread
From: Alexey Budankov @ 2020-10-27  7:37 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel


On 24.10.2020 18:43, Jiri Olsa wrote:
> On Wed, Oct 21, 2020 at 07:01:19PM +0300, Alexey Budankov wrote:
>>
>> Read trace files located in data directory into tool process memory.
>> Basic analysis support of data directories is provided for report
>> mode. Raw dump (-D) and aggregated reports are available for data
>> directories, still with no memory consumption optimizations. However
>> data directories collected with --compression-level option enabled
>> can be analyzed with little less memory because trace files are
>> unmaped from tool process memory after loading collected data.
>> The implementation is based on the prototype [1], [2].
>>
>> [1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
>> [2] https://lore.kernel.org/lkml/20180913125450.21342-1-jolsa@kernel.org/
>>
>> Suggested-by: Jiri Olsa <jolsa@kernel.org>
> 
> very loosely ;-) so there was a reason for all that reader refactoring,
> so we could have __perf_session__process_dir_events function:
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/commit/?h=perf/record_threads&id=308aa7cff1fed335401cfc02c7bac1a4644af68e

Nonetheless. All that are necessary parts to make threaded data streaming
and analysis eventually merged into the mainline as joint Perf developers
community effort.

> 
> when reporting the threaded record data on really big servers,
> you will run out of memory, so you need to read and flush all
> the files together by smaller pieces

Yes, handling all that _big_ data after collection to make it
helpful for analysis of performance issues is the other part
of this story so that possible OOM should be somehow avoided.

> 
> IMO we need to have this change before we allow threaded record

There are use cases of perf tool as a data provider, btw VTune is not
the only one of them, and for those use cases threaded trace streaming
lets its users get to their data that the users just were loosing before.
This is huge difference and whole new level of support for such users.
Post-process scripting around perf (e.g. Python based) will benefit
from threaded trace streaming. Pipe mode can be extended to stream into
open and passed fds using threads (e.g. perf record -o -fd:13,14,15,16).
VTune-like tools can get performance data, load it into a (relational)
DB files and provide analysis. And all that uses perf tool at its core.

I agree perf report OOM issue can exist on really-big servers but data 
directories support for report mode for not-so-big servers and desktops
is already enabled with this smaller change. Also really-big-servers
come with really-big amount of memory and collection could possibly be
limited to only interesting phases of execution so the issue could likely
be avoided. At the same time threaded trace streaming could clarify on
real use cases that are blocked by perf report OOM issue and that would
clarify on exact required solution. So perf report OOM issue shouldn't
be the showstopper for upstream of threaded trace streaming.

Alexei


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 03/15] perf data: open data directory in read access mode
  2020-10-26 17:47     ` Alexey Budankov
@ 2020-10-27 11:59       ` Jiri Olsa
  2020-10-27 14:44         ` Alexey Budankov
  0 siblings, 1 reply; 64+ messages in thread
From: Jiri Olsa @ 2020-10-27 11:59 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel

On Mon, Oct 26, 2020 at 08:47:06PM +0300, Alexey Budankov wrote:
> 
> On 24.10.2020 18:43, Jiri Olsa wrote:
> > On Wed, Oct 21, 2020 at 06:57:53PM +0300, Alexey Budankov wrote:
> >>
> >> Open files located at trace data directory in case read access
> >> mode is requested. File are opened and its fds assigned to
> >> perf_data dir files especially for loading data directories
> >> content in perf report mode.
> >>
> >> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
> >> ---
> >>  tools/perf/util/data.c | 4 ++++
> >>  1 file changed, 4 insertions(+)
> >>
> >> diff --git a/tools/perf/util/data.c b/tools/perf/util/data.c
> >> index c47aa34fdc0a..6ad61ac6ba67 100644
> >> --- a/tools/perf/util/data.c
> >> +++ b/tools/perf/util/data.c
> >> @@ -321,6 +321,10 @@ static int open_dir(struct perf_data *data)
> >>  		return -1;
> >>  
> >>  	ret = open_file(data);
> >> +	if (!ret && perf_data__is_dir(data)) {
> >> +		if (perf_data__is_read(data))
> >> +			ret = perf_data__open_dir(data);
> >> +	}
> > 
> > perf_data__open_dir is also called from perf_session__new
> > is it called twice?
> 
> It is not called twice. It is in different branches.
> This one is for write and the other one is for read.

hum, is that right?

	# ./perf record --threads 
	^C[ perf record: Woken up 15 times to write data ]
	[ perf record: Captured and wrote 1.421 MB perf.data (515 samples) ]

	# gdb ./perf
	GNU gdb (GDB) Fedora 9.1-6.fc32

	(gdb) b perf_data__open_dir
	Breakpoint 1 at 0x5b4753: file util/data.c, line 72.

	(gdb) r --no-pager report --stdio
	Starting program: /home/jolsa/kernel/linux-perf/tools/perf/perf --no-pager report --stdio

	Breakpoint 1, perf_data__open_dir (data=0x7fffffffb0e0) at util/data.c:72
	72      {
	(gdb) bt
	#0  perf_data__open_dir (data=0x7fffffffb0e0) at util/data.c:72
	#1  0x00000000005b538d in open_dir (data=0x7fffffffb0e0) at util/data.c:326
	#2  0x00000000005b546d in perf_data__open (data=0x7fffffffb0e0) at util/data.c:351
	#3  0x00000000005627e8 in perf_session__new (data=0x7fffffffb0e0, repipe=false, tool=0x7fffffffb220) at util/session.c:210
	#4  0x000000000045a572 in cmd_report (argc=0, argv=0x7fffffffd7a8) at builtin-report.c:1372
	#5  0x00000000004f49ec in run_builtin (p=0xaadab0 <commands+240>, argc=2, argv=0x7fffffffd7a8) at perf.c:312
	#6  0x00000000004f4c59 in handle_internal_command (argc=2, argv=0x7fffffffd7a8) at perf.c:364
	#7  0x00000000004f4da0 in run_argv (argcp=0x7fffffffd5ec, argv=0x7fffffffd5e0) at perf.c:408
	#8  0x00000000004f516c in main (argc=2, argv=0x7fffffffd7a8) at perf.c:538

	(gdb) c
	Continuing.

	Breakpoint 1, perf_data__open_dir (data=0x7fffffffb0e0) at util/data.c:72
	72      {
	(gdb) bt
	#0  perf_data__open_dir (data=0x7fffffffb0e0) at util/data.c:72
	#1  0x0000000000562883 in perf_session__new (data=0x7fffffffb0e0, repipe=false, tool=0x7fffffffb220) at util/session.c:234
	#2  0x000000000045a572 in cmd_report (argc=0, argv=0x7fffffffd7a8) at builtin-report.c:1372
	#3  0x00000000004f49ec in run_builtin (p=0xaadab0 <commands+240>, argc=2, argv=0x7fffffffd7a8) at perf.c:312
	#4  0x00000000004f4c59 in handle_internal_command (argc=2, argv=0x7fffffffd7a8) at perf.c:364
	#5  0x00000000004f4da0 in run_argv (argcp=0x7fffffffd5ec, argv=0x7fffffffd5e0) at perf.c:408
	#6  0x00000000004f516c in main (argc=2, argv=0x7fffffffd7a8) at perf.c:538


AFAICS the second (current) call to perf_data__open_dir will
do the job, because the call you added still does not see
directory with proper version and will bail out on call to
  perf_data__is_single_file

perf_session__open call will read headers and update dir version
so the current perf_data__open_dir will open the directory

jirka


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 12/15] perf record: introduce thread local variable for trace streaming
  2020-10-26 14:11         ` Alexei Budankov
@ 2020-10-27 12:01           ` Jiri Olsa
  2020-10-27 14:16             ` Alexey Budankov
  2020-10-27 15:58             ` Alexey Budankov
  0 siblings, 2 replies; 64+ messages in thread
From: Jiri Olsa @ 2020-10-27 12:01 UTC (permalink / raw)
  To: Alexei Budankov
  Cc: Alexey Budankov, Arnaldo Carvalho de Melo, Namhyung Kim,
	Alexander Shishkin, Adrian Hunter, Andi Kleen, Peter Zijlstra,
	Ingo Molnar, linux-kernel

On Mon, Oct 26, 2020 at 05:11:30PM +0300, Alexei Budankov wrote:
> 
> On 26.10.2020 13:34, Jiri Olsa wrote:
> > On Mon, Oct 26, 2020 at 11:21:28AM +0300, Alexei Budankov wrote:
> >>
> >> On 24.10.2020 18:43, Jiri Olsa wrote:
> >>> On Wed, Oct 21, 2020 at 07:07:00PM +0300, Alexey Budankov wrote:
> >>>>
> >>>> Introduce thread local variable and use it for threaded trace streaming.
> >>>>
> >>>> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
> >>>> ---
> >>>>  tools/perf/builtin-record.c | 71 ++++++++++++++++++++++++++++++++-----
> >>>>  1 file changed, 62 insertions(+), 9 deletions(-)
> >>>>
> >>>> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> >>>> index 89cb8e913fb3..3b7e9026f25b 100644
> >>>> --- a/tools/perf/builtin-record.c
> >>>> +++ b/tools/perf/builtin-record.c
> >>>> @@ -101,6 +101,8 @@ struct thread_data {
> >>>>  	u64		   bytes_written;
> >>>>  };
> >>>>  
> >>>> +static __thread struct thread_data *thread;
> >>>> +
> >>>>  struct record {
> >>>>  	struct perf_tool	tool;
> >>>>  	struct record_opts	opts;
> >>>> @@ -587,7 +589,11 @@ static int record__pushfn(struct mmap *map, void *to, void *bf, size_t size)
> >>>>  		}
> >>>>  	}
> >>>>  
> >>>> -	rec->samples++;
> >>>> +	if (thread)
> >>>> +		thread->samples++;
> >>>> +	else
> >>>> +		rec->samples++;
> >>>
> >>> this is really wrong, let's keep just single samples counter
> >>> ditto for all the other places in this patch
> >>
> >> This does look like data parallelism [1] which is very true for
> >> threaded trace streaming so your prototype design looks optimal.
> >>
> >> For this specific place incrementing global counter in memory is
> >> less performant and faces scalability limitations as a number of
> >> cores grow.
> >>
> >> Not sure why you have changed your mind.
> > 
> > I'm not sure I follow.. what I'm complaining about is to have
> > 'samples' stat variable in separate locations for --threads
> > and --no-threads mode
> 
> It is optimal to have samples variable as per thread one
> and then sum up the total in the end of data collection.
> 
> Single global variable design has scalability and performance
> drawbacks.
> 
> Why do you complain about per thread variable in this case?
> It looks like ideally fits these specific needs.

I think there's misunderstanding.. I think we should move
samples to per thread 'thread' object and have just one
copy of that.. and do not increase separate variables for
thread and non-thread cases

jirka


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 00/15] Introduce threaded trace streaming for basic perf record operation
  2020-10-26 17:59   ` Alexey Budankov
@ 2020-10-27 12:10     ` Jiri Olsa
  2020-10-27 14:26       ` Alexey Budankov
  2020-10-27 16:01       ` Alexey Budankov
  0 siblings, 2 replies; 64+ messages in thread
From: Jiri Olsa @ 2020-10-27 12:10 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel

On Mon, Oct 26, 2020 at 08:59:01PM +0300, Alexey Budankov wrote:
> 
> On 24.10.2020 18:43, Jiri Olsa wrote:
> > On Wed, Oct 21, 2020 at 06:52:43PM +0300, Alexey Budankov wrote:
> >>
> >> Changes in v2:
> >> - explicitly added credit tags to patches 6/15 and 15/15,
> >>   additionally to cites [1], [2]
> >> - updated description of 3/15 to explicitly mention the reason
> >>   to open data directories in read access mode (e.g. for perf report)
> >> - implemented fix for compilation error of 2/15
> >> - explicitly elaborated on found issues to be resolved for
> >>   threaded AUX trace capture
> >>
> >> v1: https://lore.kernel.org/lkml/810f3a69-0004-9dff-a911-b7ff97220ae0@linux.intel.com/
> >>
> >> Patch set provides threaded trace streaming for base perf record
> >> operation. Provided streaming mode (--threads) mitigates profiling
> >> data losses and resolves scalability issues of serial and asynchronous
> >> (--aio) trace streaming modes on multicore server systems. The patch
> >> set is based on the prototype [1], [2] and the most closely relates
> >> to mode 3) "mode that creates thread for every monitored memory map".
> > 
> > so what I liked about the previous code was that you could
> > configure how the threads would be created
> > 
> > default --threads options created thread for each cpu like
> > in your change:
> > 
> >   $ perf record -v --threads ...
> >   ...
> >   thread 0 monitor: 0 allowed: 0
> >   thread 1 monitor: 1 allowed: 1
> >   thread 2 monitor: 2 allowed: 2
> >   thread 3 monitor: 3 allowed: 3
> >   thread 4 monitor: 4 allowed: 4
> >   thread 5 monitor: 5 allowed: 5
> >   thread 6 monitor: 6 allowed: 6
> >   thread 7 monitor: 7 allowed: 7
> 
> Yes, it is configurable in the prototype. Even though this patch set
> doesn't implement that parameters for --thread option, just because
> VTune doesn't have use cases for that yet, it has still been designed
> and implemented with that possible extension in mind so it could then
> be easily added on top of it.

I'm not sure about vtune extensions, but if we are going to
have --threads option I believe we should make it configurable
at least to the extend descibed below

jirka

> 
> > 
> > 
> > then numa based:
> > 
> >   $ perf record -v --threads=numa ...
> >   ...
> >   thread 0 monitor: 0-5,12-17 allowed: 0-5,12-17
> >   thread 1 monitor: 6-11,18-23 allowed: 6-11,18-23
> > 
> > 
> > socket based:
> > 
> >   $ perf record -v --threads=socket ...
> >   ...
> >   thread 0 monitor: 0-7 allowed: 0-7
> > 
> > 
> > core based:
> > 
> >   $ perf record -v --threads=core ...
> >   ...
> >   thread 0 monitor: 0,4 allowed: 0,4
> >   thread 1 monitor: 1,5 allowed: 1,5
> >   thread 2 monitor: 2,6 allowed: 2,6
> >   thread 3 monitor: 3,7 allowed: 3,7
> > 
> > 
> > and user configurable:
> > 
> >   $ perf record -v  --threads=0-3/0:4-7/4 ...
> >   ...
> >   threads: 0. monitor 0-3, allowed 0
> >   threads: 1. monitor 4-7, allowed 4
> > 
> > 
> > so this way you could easily pin threads to cpu/core/socket/numa,
> > or to some other cpu of your choice, because this will be always
> > game of try and check where I'm not getting LOST events and not
> > creating 1000 threads
> > 
> >  perf record: Add support for threads numa option value
> >  perf record: Add support for threads socket option value
> >  perf record: Add support for threads core option value
> >  perf record: Add support for threads user option value
> 
> Makes sense.
> 
> Alexei
> 


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 06/15] perf session: load data directory into tool process memory
  2020-10-27  7:37     ` Alexey Budankov
@ 2020-10-27 12:21       ` Jiri Olsa
  2020-10-27 14:43         ` Alexey Budankov
  2020-10-27 15:04         ` Alexey Budankov
  0 siblings, 2 replies; 64+ messages in thread
From: Jiri Olsa @ 2020-10-27 12:21 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel

On Tue, Oct 27, 2020 at 10:37:58AM +0300, Alexey Budankov wrote:
> 
> On 24.10.2020 18:43, Jiri Olsa wrote:
> > On Wed, Oct 21, 2020 at 07:01:19PM +0300, Alexey Budankov wrote:
> >>
> >> Read trace files located in data directory into tool process memory.
> >> Basic analysis support of data directories is provided for report
> >> mode. Raw dump (-D) and aggregated reports are available for data
> >> directories, still with no memory consumption optimizations. However
> >> data directories collected with --compression-level option enabled
> >> can be analyzed with little less memory because trace files are
> >> unmaped from tool process memory after loading collected data.
> >> The implementation is based on the prototype [1], [2].
> >>
> >> [1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
> >> [2] https://lore.kernel.org/lkml/20180913125450.21342-1-jolsa@kernel.org/
> >>
> >> Suggested-by: Jiri Olsa <jolsa@kernel.org>
> > 
> > very loosely ;-) so there was a reason for all that reader refactoring,
> > so we could have __perf_session__process_dir_events function:
> > 
> >   https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/commit/?h=perf/record_threads&id=308aa7cff1fed335401cfc02c7bac1a4644af68e
> 
> Nonetheless. All that are necessary parts to make threaded data streaming
> and analysis eventually merged into the mainline as joint Perf developers
> community effort.
> 
> > 
> > when reporting the threaded record data on really big servers,
> > you will run out of memory, so you need to read and flush all
> > the files together by smaller pieces
> 
> Yes, handling all that _big_ data after collection to make it
> helpful for analysis of performance issues is the other part
> of this story so that possible OOM should be somehow avoided.
> 
> > 
> > IMO we need to have this change before we allow threaded record
> 
> There are use cases of perf tool as a data provider, btw VTune is not
> the only one of them, and for those use cases threaded trace streaming
> lets its users get to their data that the users just were loosing before.
> This is huge difference and whole new level of support for such users.
> Post-process scripting around perf (e.g. Python based) will benefit
> from threaded trace streaming. Pipe mode can be extended to stream into
> open and passed fds using threads (e.g. perf record -o -fd:13,14,15,16).
> VTune-like tools can get performance data, load it into a (relational)
> DB files and provide analysis. And all that uses perf tool at its core.
> 
> I agree perf report OOM issue can exist on really-big servers but data 
> directories support for report mode for not-so-big servers and desktops
> is already enabled with this smaller change. Also really-big-servers
> come with really-big amount of memory and collection could possibly be
> limited to only interesting phases of execution so the issue could likely
> be avoided. At the same time threaded trace streaming could clarify on
> real use cases that are blocked by perf report OOM issue and that would
> clarify on exact required solution. So perf report OOM issue shouldn't
> be the showstopper for upstream of threaded trace streaming.

so the short answer is no, right? ;-) 

I understand all the excuses, but from my point of view we are
adding another pain point (and there's already few ;-) ) that
will make perf (even more) not user friendly

if we allow really friendly way to create huge data, we should
do our best to be able to process it as best as we can

jirka


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 12/15] perf record: introduce thread local variable for trace streaming
  2020-10-27 12:01           ` Jiri Olsa
@ 2020-10-27 14:16             ` Alexey Budankov
  2020-10-27 15:58             ` Alexey Budankov
  1 sibling, 0 replies; 64+ messages in thread
From: Alexey Budankov @ 2020-10-27 14:16 UTC (permalink / raw)
  To: Jiri Olsa, Alexei Budankov
  Cc: Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel


On 27.10.2020 15:01, Jiri Olsa wrote:
> On Mon, Oct 26, 2020 at 05:11:30PM +0300, Alexei Budankov wrote:
<SNIP>
>> Why do you complain about per thread variable in this case?
>> It looks like ideally fits these specific needs.
> 
> I think there's misunderstanding.. I think we should move
> samples to per thread 'thread' object and have just one
> copy of that.. and do not increase separate variables for
> thread and non-thread cases

Aw, I see. Using the same __thread object by main thread in
serial and threaded modes. That makes sense.
I will correct in v3.

Alexei

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 00/15] Introduce threaded trace streaming for basic perf record operation
  2020-10-27 12:10     ` Jiri Olsa
@ 2020-10-27 14:26       ` Alexey Budankov
  2020-10-27 16:01       ` Alexey Budankov
  1 sibling, 0 replies; 64+ messages in thread
From: Alexey Budankov @ 2020-10-27 14:26 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel


On 27.10.2020 15:10, Jiri Olsa wrote:
> On Mon, Oct 26, 2020 at 08:59:01PM +0300, Alexey Budankov wrote:
>>
>> On 24.10.2020 18:43, Jiri Olsa wrote:
>>> On Wed, Oct 21, 2020 at 06:52:43PM +0300, Alexey Budankov wrote:
>>>>
>>>> Changes in v2:
>>>> - explicitly added credit tags to patches 6/15 and 15/15,
>>>>   additionally to cites [1], [2]
>>>> - updated description of 3/15 to explicitly mention the reason
>>>>   to open data directories in read access mode (e.g. for perf report)
>>>> - implemented fix for compilation error of 2/15
>>>> - explicitly elaborated on found issues to be resolved for
>>>>   threaded AUX trace capture
>>>>
>>>> v1: https://lore.kernel.org/lkml/810f3a69-0004-9dff-a911-b7ff97220ae0@linux.intel.com/
>>>>
>>>> Patch set provides threaded trace streaming for base perf record
>>>> operation. Provided streaming mode (--threads) mitigates profiling
>>>> data losses and resolves scalability issues of serial and asynchronous
>>>> (--aio) trace streaming modes on multicore server systems. The patch
>>>> set is based on the prototype [1], [2] and the most closely relates
>>>> to mode 3) "mode that creates thread for every monitored memory map".
>>>
>>> so what I liked about the previous code was that you could
>>> configure how the threads would be created
>>>
>>> default --threads options created thread for each cpu like
>>> in your change:
>>>
>>>   $ perf record -v --threads ...
>>>   ...
>>>   thread 0 monitor: 0 allowed: 0
>>>   thread 1 monitor: 1 allowed: 1
>>>   thread 2 monitor: 2 allowed: 2
>>>   thread 3 monitor: 3 allowed: 3
>>>   thread 4 monitor: 4 allowed: 4
>>>   thread 5 monitor: 5 allowed: 5
>>>   thread 6 monitor: 6 allowed: 6
>>>   thread 7 monitor: 7 allowed: 7
>>
>> Yes, it is configurable in the prototype. Even though this patch set
>> doesn't implement that parameters for --thread option, just because
>> VTune doesn't have use cases for that yet, it has still been designed
>> and implemented with that possible extension in mind so it could then
>> be easily added on top of it.
> 
> I'm not sure about vtune extensions, but if we are going to
> have --threads option I believe we should make it configurable
> at least to the extend descibed below

vtune employs --threads mode only and there are no use cases
observed so far beyond this mode. Do you have such use cases?

Alexei

> 
> jirka
> 
>>
>>>
>>>
>>> then numa based:
>>>
>>>   $ perf record -v --threads=numa ...
>>>   ...
>>>   thread 0 monitor: 0-5,12-17 allowed: 0-5,12-17
>>>   thread 1 monitor: 6-11,18-23 allowed: 6-11,18-23
>>>
>>>
>>> socket based:
>>>
>>>   $ perf record -v --threads=socket ...
>>>   ...
>>>   thread 0 monitor: 0-7 allowed: 0-7
>>>
>>>
>>> core based:
>>>
>>>   $ perf record -v --threads=core ...
>>>   ...
>>>   thread 0 monitor: 0,4 allowed: 0,4
>>>   thread 1 monitor: 1,5 allowed: 1,5
>>>   thread 2 monitor: 2,6 allowed: 2,6
>>>   thread 3 monitor: 3,7 allowed: 3,7
>>>
>>>
>>> and user configurable:
>>>
>>>   $ perf record -v  --threads=0-3/0:4-7/4 ...
>>>   ...
>>>   threads: 0. monitor 0-3, allowed 0
>>>   threads: 1. monitor 4-7, allowed 4
>>>
>>>
>>> so this way you could easily pin threads to cpu/core/socket/numa,
>>> or to some other cpu of your choice, because this will be always
>>> game of try and check where I'm not getting LOST events and not
>>> creating 1000 threads
>>>
>>>  perf record: Add support for threads numa option value
>>>  perf record: Add support for threads socket option value
>>>  perf record: Add support for threads core option value
>>>  perf record: Add support for threads user option value
>>
>> Makes sense.
>>
>> Alexei
>>
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 06/15] perf session: load data directory into tool process memory
  2020-10-27 12:21       ` Jiri Olsa
@ 2020-10-27 14:43         ` Alexey Budankov
  2020-10-28  7:22           ` Namhyung Kim
  2020-10-28 15:36           ` Jiri Olsa
  2020-10-27 15:04         ` Alexey Budankov
  1 sibling, 2 replies; 64+ messages in thread
From: Alexey Budankov @ 2020-10-27 14:43 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel


On 27.10.2020 15:21, Jiri Olsa wrote:
> On Tue, Oct 27, 2020 at 10:37:58AM +0300, Alexey Budankov wrote:
>>
>> On 24.10.2020 18:43, Jiri Olsa wrote:
>>> On Wed, Oct 21, 2020 at 07:01:19PM +0300, Alexey Budankov wrote:
>>>>
>>>> Read trace files located in data directory into tool process memory.
>>>> Basic analysis support of data directories is provided for report
>>>> mode. Raw dump (-D) and aggregated reports are available for data
>>>> directories, still with no memory consumption optimizations. However
>>>> data directories collected with --compression-level option enabled
>>>> can be analyzed with little less memory because trace files are
>>>> unmaped from tool process memory after loading collected data.
>>>> The implementation is based on the prototype [1], [2].
>>>>
>>>> [1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
>>>> [2] https://lore.kernel.org/lkml/20180913125450.21342-1-jolsa@kernel.org/
>>>>
>>>> Suggested-by: Jiri Olsa <jolsa@kernel.org>
>>>
>>> very loosely ;-) so there was a reason for all that reader refactoring,
>>> so we could have __perf_session__process_dir_events function:
>>>
>>>   https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/commit/?h=perf/record_threads&id=308aa7cff1fed335401cfc02c7bac1a4644af68e
>>
>> Nonetheless. All that are necessary parts to make threaded data streaming
>> and analysis eventually merged into the mainline as joint Perf developers
>> community effort.
>>
>>>
>>> when reporting the threaded record data on really big servers,
>>> you will run out of memory, so you need to read and flush all
>>> the files together by smaller pieces
>>
>> Yes, handling all that _big_ data after collection to make it
>> helpful for analysis of performance issues is the other part
>> of this story so that possible OOM should be somehow avoided.
>>
>>>
>>> IMO we need to have this change before we allow threaded record
>>
>> There are use cases of perf tool as a data provider, btw VTune is not
>> the only one of them, and for those use cases threaded trace streaming
>> lets its users get to their data that the users just were loosing before.
>> This is huge difference and whole new level of support for such users.
>> Post-process scripting around perf (e.g. Python based) will benefit
>> from threaded trace streaming. Pipe mode can be extended to stream into
>> open and passed fds using threads (e.g. perf record -o -fd:13,14,15,16).
>> VTune-like tools can get performance data, load it into a (relational)
>> DB files and provide analysis. And all that uses perf tool at its core.
>>
>> I agree perf report OOM issue can exist on really-big servers but data 
>> directories support for report mode for not-so-big servers and desktops
>> is already enabled with this smaller change. Also really-big-servers
>> come with really-big amount of memory and collection could possibly be
>> limited to only interesting phases of execution so the issue could likely
>> be avoided. At the same time threaded trace streaming could clarify on
>> real use cases that are blocked by perf report OOM issue and that would
>> clarify on exact required solution. So perf report OOM issue shouldn't
>> be the showstopper for upstream of threaded trace streaming.
> 
> so the short answer is no, right? ;-) 

Answer to what question? Resolve OOM in perf report for data directories?
I don't see a simple solution for that. The next issue after OOM is resolved
is a very long processing of data directories. And again there is no simple
solution for that as well. But it still need progress in order to be resolved
eventually.

> 
> I understand all the excuses, but from my point of view we are
> adding another pain point (and there's already few ;-) ) that
> will make perf (even more) not user friendly

I would not name it a paint point but instead a growth opportunity. 
Now --threads can't be and is not enabled by default. When a user
asks --threads the tool can print warning in advance about lots of
data and possible perf report OOM limitation so confusion and
disappointment for users of perf report can be avoided in advance.

> 
> if we allow really friendly way to create huge data, we should
> do our best to be able to process it as best as we can

It is just little to no more friendly as it is already now.
Everyone can grab patches apply and get threaded streaming.
Inclusion into mainline will standardize solution to build
and evolve upon and this is necessary step towards complete
support of data directories in perf tool suite.

Alexei

> 
> jirka
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 03/15] perf data: open data directory in read access mode
  2020-10-27 11:59       ` Jiri Olsa
@ 2020-10-27 14:44         ` Alexey Budankov
  0 siblings, 0 replies; 64+ messages in thread
From: Alexey Budankov @ 2020-10-27 14:44 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel


On 27.10.2020 14:59, Jiri Olsa wrote:
> On Mon, Oct 26, 2020 at 08:47:06PM +0300, Alexey Budankov wrote:
>>
>> On 24.10.2020 18:43, Jiri Olsa wrote:
>>> On Wed, Oct 21, 2020 at 06:57:53PM +0300, Alexey Budankov wrote:
>>>>
>>>> Open files located at trace data directory in case read access
>>>> mode is requested. File are opened and its fds assigned to
>>>> perf_data dir files especially for loading data directories
>>>> content in perf report mode.
>>>>
>>>> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
>>>> ---
>>>>  tools/perf/util/data.c | 4 ++++
>>>>  1 file changed, 4 insertions(+)
>>>>
>>>> diff --git a/tools/perf/util/data.c b/tools/perf/util/data.c
>>>> index c47aa34fdc0a..6ad61ac6ba67 100644
>>>> --- a/tools/perf/util/data.c
>>>> +++ b/tools/perf/util/data.c
>>>> @@ -321,6 +321,10 @@ static int open_dir(struct perf_data *data)
>>>>  		return -1;
>>>>  
>>>>  	ret = open_file(data);
>>>> +	if (!ret && perf_data__is_dir(data)) {
>>>> +		if (perf_data__is_read(data))
>>>> +			ret = perf_data__open_dir(data);
>>>> +	}
>>>
>>> perf_data__open_dir is also called from perf_session__new
>>> is it called twice?
>>
>> It is not called twice. It is in different branches.
>> This one is for write and the other one is for read.
> 
> hum, is that right?
> 
> 	# ./perf record --threads 
> 	^C[ perf record: Woken up 15 times to write data ]
> 	[ perf record: Captured and wrote 1.421 MB perf.data (515 samples) ]
> 
> 	# gdb ./perf
> 	GNU gdb (GDB) Fedora 9.1-6.fc32
> 
> 	(gdb) b perf_data__open_dir
> 	Breakpoint 1 at 0x5b4753: file util/data.c, line 72.
> 
> 	(gdb) r --no-pager report --stdio
> 	Starting program: /home/jolsa/kernel/linux-perf/tools/perf/perf --no-pager report --stdio
> 
> 	Breakpoint 1, perf_data__open_dir (data=0x7fffffffb0e0) at util/data.c:72
> 	72      {
> 	(gdb) bt
> 	#0  perf_data__open_dir (data=0x7fffffffb0e0) at util/data.c:72
> 	#1  0x00000000005b538d in open_dir (data=0x7fffffffb0e0) at util/data.c:326
> 	#2  0x00000000005b546d in perf_data__open (data=0x7fffffffb0e0) at util/data.c:351
> 	#3  0x00000000005627e8 in perf_session__new (data=0x7fffffffb0e0, repipe=false, tool=0x7fffffffb220) at util/session.c:210
> 	#4  0x000000000045a572 in cmd_report (argc=0, argv=0x7fffffffd7a8) at builtin-report.c:1372
> 	#5  0x00000000004f49ec in run_builtin (p=0xaadab0 <commands+240>, argc=2, argv=0x7fffffffd7a8) at perf.c:312
> 	#6  0x00000000004f4c59 in handle_internal_command (argc=2, argv=0x7fffffffd7a8) at perf.c:364
> 	#7  0x00000000004f4da0 in run_argv (argcp=0x7fffffffd5ec, argv=0x7fffffffd5e0) at perf.c:408
> 	#8  0x00000000004f516c in main (argc=2, argv=0x7fffffffd7a8) at perf.c:538
> 
> 	(gdb) c
> 	Continuing.
> 
> 	Breakpoint 1, perf_data__open_dir (data=0x7fffffffb0e0) at util/data.c:72
> 	72      {
> 	(gdb) bt
> 	#0  perf_data__open_dir (data=0x7fffffffb0e0) at util/data.c:72
> 	#1  0x0000000000562883 in perf_session__new (data=0x7fffffffb0e0, repipe=false, tool=0x7fffffffb220) at util/session.c:234
> 	#2  0x000000000045a572 in cmd_report (argc=0, argv=0x7fffffffd7a8) at builtin-report.c:1372
> 	#3  0x00000000004f49ec in run_builtin (p=0xaadab0 <commands+240>, argc=2, argv=0x7fffffffd7a8) at perf.c:312
> 	#4  0x00000000004f4c59 in handle_internal_command (argc=2, argv=0x7fffffffd7a8) at perf.c:364
> 	#5  0x00000000004f4da0 in run_argv (argcp=0x7fffffffd5ec, argv=0x7fffffffd5e0) at perf.c:408
> 	#6  0x00000000004f516c in main (argc=2, argv=0x7fffffffd7a8) at perf.c:538
> 
> 
> AFAICS the second (current) call to perf_data__open_dir will
> do the job, because the call you added still does not see
> directory with proper version and will bail out on call to
>   perf_data__is_single_file
> 
> perf_session__open call will read headers and update dir version
> so the current perf_data__open_dir will open the directory

Tested once again. Now it looks like this patch is redundant
since dir is already opened by the existing code.
Thanks for pointing this out!

Alexei

> 
> jirka
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 06/15] perf session: load data directory into tool process memory
  2020-10-27 12:21       ` Jiri Olsa
  2020-10-27 14:43         ` Alexey Budankov
@ 2020-10-27 15:04         ` Alexey Budankov
  1 sibling, 0 replies; 64+ messages in thread
From: Alexey Budankov @ 2020-10-27 15:04 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel


On 27.10.2020 15:21, Jiri Olsa wrote:
> On Tue, Oct 27, 2020 at 10:37:58AM +0300, Alexey Budankov wrote:
>>
>> On 24.10.2020 18:43, Jiri Olsa wrote:
>>> On Wed, Oct 21, 2020 at 07:01:19PM +0300, Alexey Budankov wrote:
>>>>
>>>> Read trace files located in data directory into tool process memory.
>>>> Basic analysis support of data directories is provided for report
>>>> mode. Raw dump (-D) and aggregated reports are available for data
>>>> directories, still with no memory consumption optimizations. However
>>>> data directories collected with --compression-level option enabled
>>>> can be analyzed with little less memory because trace files are
>>>> unmaped from tool process memory after loading collected data.
>>>> The implementation is based on the prototype [1], [2].
>>>>
>>>> [1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
>>>> [2] https://lore.kernel.org/lkml/20180913125450.21342-1-jolsa@kernel.org/
>>>>
>>>> Suggested-by: Jiri Olsa <jolsa@kernel.org>
>>>
>>> very loosely ;-) so there was a reason for all that reader refactoring,
>>> so we could have __perf_session__process_dir_events function:
>>>
>>>   https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/commit/?h=perf/record_threads&id=308aa7cff1fed335401cfc02c7bac1a4644af68e
>>
>> Nonetheless. All that are necessary parts to make threaded data streaming
>> and analysis eventually merged into the mainline as joint Perf developers
>> community effort.
>>
>>>
>>> when reporting the threaded record data on really big servers,
>>> you will run out of memory, so you need to read and flush all
>>> the files together by smaller pieces
>>
>> Yes, handling all that _big_ data after collection to make it
>> helpful for analysis of performance issues is the other part
>> of this story so that possible OOM should be somehow avoided.
>>
>>>
>>> IMO we need to have this change before we allow threaded record
>>
>> There are use cases of perf tool as a data provider, btw VTune is not
>> the only one of them, and for those use cases threaded trace streaming
>> lets its users get to their data that the users just were loosing before.
>> This is huge difference and whole new level of support for such users.
>> Post-process scripting around perf (e.g. Python based) will benefit
>> from threaded trace streaming. Pipe mode can be extended to stream into
>> open and passed fds using threads (e.g. perf record -o -fd:13,14,15,16).
>> VTune-like tools can get performance data, load it into a (relational)
>> DB files and provide analysis. And all that uses perf tool at its core.
>>
>> I agree perf report OOM issue can exist on really-big servers but data 
>> directories support for report mode for not-so-big servers and desktops
>> is already enabled with this smaller change. Also really-big-servers
>> come with really-big amount of memory and collection could possibly be
>> limited to only interesting phases of execution so the issue could likely
>> be avoided. At the same time threaded trace streaming could clarify on
>> real use cases that are blocked by perf report OOM issue and that would
>> clarify on exact required solution. So perf report OOM issue shouldn't
>> be the showstopper for upstream of threaded trace streaming.
> 
> so the short answer is no, right? ;-) 
> 
> I understand all the excuses, but from my point of view we are
> adding another pain point (and there's already few ;-) ) that

BTW what are those a few pain points that are 'unfriendly' in perf?
Possibly users could be warned about the points in advance to avoid
confusion and disappointment by the fact.

Alexei

> will make perf (even more) not user friendly
> 
> if we allow really friendly way to create huge data, we should
> do our best to be able to process it as best as we can
> 
> jirka
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 12/15] perf record: introduce thread local variable for trace streaming
  2020-10-27 12:01           ` Jiri Olsa
  2020-10-27 14:16             ` Alexey Budankov
@ 2020-10-27 15:58             ` Alexey Budankov
  1 sibling, 0 replies; 64+ messages in thread
From: Alexey Budankov @ 2020-10-27 15:58 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel


On 27.10.2020 15:01, Jiri Olsa wrote:
> On Mon, Oct 26, 2020 at 05:11:30PM +0300, Alexei Budankov wrote:
>>
>> On 26.10.2020 13:34, Jiri Olsa wrote:
>>> On Mon, Oct 26, 2020 at 11:21:28AM +0300, Alexei Budankov wrote:
>>>>
>>>> On 24.10.2020 18:43, Jiri Olsa wrote:
>>>>> On Wed, Oct 21, 2020 at 07:07:00PM +0300, Alexey Budankov wrote:
>>>>>>
>>>>>> Introduce thread local variable and use it for threaded trace streaming.
>>>>>>
>>>>>> Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
>>>>>> ---
>>>>>>  tools/perf/builtin-record.c | 71 ++++++++++++++++++++++++++++++++-----
>>>>>>  1 file changed, 62 insertions(+), 9 deletions(-)
>>>>>>
>>>>>> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
>>>>>> index 89cb8e913fb3..3b7e9026f25b 100644
>>>>>> --- a/tools/perf/builtin-record.c
>>>>>> +++ b/tools/perf/builtin-record.c
>>>>>> @@ -101,6 +101,8 @@ struct thread_data {
>>>>>>  	u64		   bytes_written;
>>>>>>  };
>>>>>>  
>>>>>> +static __thread struct thread_data *thread;
>>>>>> +
>>>>>>  struct record {
>>>>>>  	struct perf_tool	tool;
>>>>>>  	struct record_opts	opts;
>>>>>> @@ -587,7 +589,11 @@ static int record__pushfn(struct mmap *map, void *to, void *bf, size_t size)
>>>>>>  		}
>>>>>>  	}
>>>>>>  
>>>>>> -	rec->samples++;
>>>>>> +	if (thread)
>>>>>> +		thread->samples++;
>>>>>> +	else
>>>>>> +		rec->samples++;
>>>>>
>>>>> this is really wrong, let's keep just single samples counter
>>>>> ditto for all the other places in this patch
>>>>
>>>> This does look like data parallelism [1] which is very true for
>>>> threaded trace streaming so your prototype design looks optimal.
>>>>
>>>> For this specific place incrementing global counter in memory is
>>>> less performant and faces scalability limitations as a number of
>>>> cores grow.
>>>>
>>>> Not sure why you have changed your mind.
>>>
>>> I'm not sure I follow.. what I'm complaining about is to have
>>> 'samples' stat variable in separate locations for --threads
>>> and --no-threads mode
>>
>> It is optimal to have samples variable as per thread one
>> and then sum up the total in the end of data collection.
>>
>> Single global variable design has scalability and performance
>> drawbacks.
>>
>> Why do you complain about per thread variable in this case?
>> It looks like ideally fits these specific needs.
> 
> I think there's misunderstanding.. I think we should move
> samples to per thread 'thread' object and have just one
> copy of that.. and do not increase separate variables for
> thread and non-thread cases

Aw, I see. Using the same __thread object by main thread in
serial and threaded modes. That makes sense.
I will try in v3.

Alexei

> 
> jirka
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 00/15] Introduce threaded trace streaming for basic perf record operation
  2020-10-27 12:10     ` Jiri Olsa
  2020-10-27 14:26       ` Alexey Budankov
@ 2020-10-27 16:01       ` Alexey Budankov
  2020-10-28  7:08         ` Namhyung Kim
  1 sibling, 1 reply; 64+ messages in thread
From: Alexey Budankov @ 2020-10-27 16:01 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel


On 27.10.2020 15:10, Jiri Olsa wrote:
> On Mon, Oct 26, 2020 at 08:59:01PM +0300, Alexey Budankov wrote:
>>
>> On 24.10.2020 18:43, Jiri Olsa wrote:
>>> On Wed, Oct 21, 2020 at 06:52:43PM +0300, Alexey Budankov wrote:
>>>>
>>>> Changes in v2:
>>>> - explicitly added credit tags to patches 6/15 and 15/15,
>>>>   additionally to cites [1], [2]
>>>> - updated description of 3/15 to explicitly mention the reason
>>>>   to open data directories in read access mode (e.g. for perf report)
>>>> - implemented fix for compilation error of 2/15
>>>> - explicitly elaborated on found issues to be resolved for
>>>>   threaded AUX trace capture
>>>>
>>>> v1: https://lore.kernel.org/lkml/810f3a69-0004-9dff-a911-b7ff97220ae0@linux.intel.com/
>>>>
>>>> Patch set provides threaded trace streaming for base perf record
>>>> operation. Provided streaming mode (--threads) mitigates profiling
>>>> data losses and resolves scalability issues of serial and asynchronous
>>>> (--aio) trace streaming modes on multicore server systems. The patch
>>>> set is based on the prototype [1], [2] and the most closely relates
>>>> to mode 3) "mode that creates thread for every monitored memory map".
>>>
>>> so what I liked about the previous code was that you could
>>> configure how the threads would be created
>>>
>>> default --threads options created thread for each cpu like
>>> in your change:
>>>
>>>   $ perf record -v --threads ...
>>>   ...
>>>   thread 0 monitor: 0 allowed: 0
>>>   thread 1 monitor: 1 allowed: 1
>>>   thread 2 monitor: 2 allowed: 2
>>>   thread 3 monitor: 3 allowed: 3
>>>   thread 4 monitor: 4 allowed: 4
>>>   thread 5 monitor: 5 allowed: 5
>>>   thread 6 monitor: 6 allowed: 6
>>>   thread 7 monitor: 7 allowed: 7
>>
>> Yes, it is configurable in the prototype. Even though this patch set
>> doesn't implement that parameters for --thread option, just because
>> VTune doesn't have use cases for that yet, it has still been designed
>> and implemented with that possible extension in mind so it could then
>> be easily added on top of it.
> 
> I'm not sure about vtune extensions, but if we are going to
> have --threads option I believe we should make it configurable
> at least to the extend descibed below

It employs --threads mode only and there are no use cases
observed so far beyond this mode. Do you have or see such
use cases?

Alexei

> 
> jirka
> 
>>
>>>
>>>
>>> then numa based:
>>>
>>>   $ perf record -v --threads=numa ...
>>>   ...
>>>   thread 0 monitor: 0-5,12-17 allowed: 0-5,12-17
>>>   thread 1 monitor: 6-11,18-23 allowed: 6-11,18-23
>>>
>>>
>>> socket based:
>>>
>>>   $ perf record -v --threads=socket ...
>>>   ...
>>>   thread 0 monitor: 0-7 allowed: 0-7
>>>
>>>
>>> core based:
>>>
>>>   $ perf record -v --threads=core ...
>>>   ...
>>>   thread 0 monitor: 0,4 allowed: 0,4
>>>   thread 1 monitor: 1,5 allowed: 1,5
>>>   thread 2 monitor: 2,6 allowed: 2,6
>>>   thread 3 monitor: 3,7 allowed: 3,7
>>>
>>>
>>> and user configurable:
>>>
>>>   $ perf record -v  --threads=0-3/0:4-7/4 ...
>>>   ...
>>>   threads: 0. monitor 0-3, allowed 0
>>>   threads: 1. monitor 4-7, allowed 4
>>>
>>>
>>> so this way you could easily pin threads to cpu/core/socket/numa,
>>> or to some other cpu of your choice, because this will be always
>>> game of try and check where I'm not getting LOST events and not
>>> creating 1000 threads
>>>
>>>  perf record: Add support for threads numa option value
>>>  perf record: Add support for threads socket option value
>>>  perf record: Add support for threads core option value
>>>  perf record: Add support for threads user option value
>>
>> Makes sense.
>>
>> Alexei
>>
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 00/15] Introduce threaded trace streaming for basic perf record operation
  2020-10-27 16:01       ` Alexey Budankov
@ 2020-10-28  7:08         ` Namhyung Kim
       [not found]           ` <b6150d2f-04a6-9204-59ac-c31c8697c630@linux.intel.com>
  0 siblings, 1 reply; 64+ messages in thread
From: Namhyung Kim @ 2020-10-28  7:08 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel

Hi,

On Wed, Oct 28, 2020 at 1:02 AM Alexey Budankov
<alexey.budankov@linux.intel.com> wrote:
>
>
> On 27.10.2020 15:10, Jiri Olsa wrote:
> > On Mon, Oct 26, 2020 at 08:59:01PM +0300, Alexey Budankov wrote:
> >>
> >> On 24.10.2020 18:43, Jiri Olsa wrote:
> >>> On Wed, Oct 21, 2020 at 06:52:43PM +0300, Alexey Budankov wrote:
> >>>>
> >>>> Changes in v2:
> >>>> - explicitly added credit tags to patches 6/15 and 15/15,
> >>>>   additionally to cites [1], [2]
> >>>> - updated description of 3/15 to explicitly mention the reason
> >>>>   to open data directories in read access mode (e.g. for perf report)
> >>>> - implemented fix for compilation error of 2/15
> >>>> - explicitly elaborated on found issues to be resolved for
> >>>>   threaded AUX trace capture
> >>>>
> >>>> v1: https://lore.kernel.org/lkml/810f3a69-0004-9dff-a911-b7ff97220ae0@linux.intel.com/
> >>>>
> >>>> Patch set provides threaded trace streaming for base perf record
> >>>> operation. Provided streaming mode (--threads) mitigates profiling
> >>>> data losses and resolves scalability issues of serial and asynchronous
> >>>> (--aio) trace streaming modes on multicore server systems. The patch
> >>>> set is based on the prototype [1], [2] and the most closely relates
> >>>> to mode 3) "mode that creates thread for every monitored memory map".
> >>>
> >>> so what I liked about the previous code was that you could
> >>> configure how the threads would be created
> >>>
> >>> default --threads options created thread for each cpu like
> >>> in your change:
> >>>
> >>>   $ perf record -v --threads ...
> >>>   ...
> >>>   thread 0 monitor: 0 allowed: 0
> >>>   thread 1 monitor: 1 allowed: 1
> >>>   thread 2 monitor: 2 allowed: 2
> >>>   thread 3 monitor: 3 allowed: 3
> >>>   thread 4 monitor: 4 allowed: 4
> >>>   thread 5 monitor: 5 allowed: 5
> >>>   thread 6 monitor: 6 allowed: 6
> >>>   thread 7 monitor: 7 allowed: 7
> >>
> >> Yes, it is configurable in the prototype. Even though this patch set
> >> doesn't implement that parameters for --thread option, just because
> >> VTune doesn't have use cases for that yet, it has still been designed
> >> and implemented with that possible extension in mind so it could then
> >> be easily added on top of it.
> >
> > I'm not sure about vtune extensions, but if we are going to
> > have --threads option I believe we should make it configurable
> > at least to the extend descibed below
>
> It employs --threads mode only and there are no use cases
> observed so far beyond this mode. Do you have or see such
> use cases?

I don't know about vtune and other users, but it's an important
feature for better performance so I agree with Jiri's opinion to
make it flexible for the system requirement.

Thanks
Namhyung

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 06/15] perf session: load data directory into tool process memory
  2020-10-27 14:43         ` Alexey Budankov
@ 2020-10-28  7:22           ` Namhyung Kim
  2020-10-28 15:39             ` Jiri Olsa
  2020-10-28 15:36           ` Jiri Olsa
  1 sibling, 1 reply; 64+ messages in thread
From: Namhyung Kim @ 2020-10-28  7:22 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel

On Tue, Oct 27, 2020 at 11:43 PM Alexey Budankov
<alexey.budankov@linux.intel.com> wrote:
>
>
> On 27.10.2020 15:21, Jiri Olsa wrote:
> > On Tue, Oct 27, 2020 at 10:37:58AM +0300, Alexey Budankov wrote:
> >> I agree perf report OOM issue can exist on really-big servers but data
> >> directories support for report mode for not-so-big servers and desktops
> >> is already enabled with this smaller change. Also really-big-servers
> >> come with really-big amount of memory and collection could possibly be
> >> limited to only interesting phases of execution so the issue could likely
> >> be avoided. At the same time threaded trace streaming could clarify on
> >> real use cases that are blocked by perf report OOM issue and that would
> >> clarify on exact required solution. So perf report OOM issue shouldn't
> >> be the showstopper for upstream of threaded trace streaming.
> >
> > so the short answer is no, right? ;-)
>
> Answer to what question? Resolve OOM in perf report for data directories?
> I don't see a simple solution for that. The next issue after OOM is resolved
> is a very long processing of data directories. And again there is no simple
> solution for that as well. But it still need progress in order to be resolved
> eventually.

I think we should find a better way than just adding all events to the
ordered events queue in memory then processing them one by one.

Separating tracking events (FORK/MMAP/...) might be the first step.

Thanks
Namhyung

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 00/15] Introduce threaded trace streaming for basic perf record operation
       [not found]           ` <b6150d2f-04a6-9204-59ac-c31c8697c630@linux.intel.com>
@ 2020-10-28 15:32             ` Jiri Olsa
  0 siblings, 0 replies; 64+ messages in thread
From: Jiri Olsa @ 2020-10-28 15:32 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Namhyung Kim, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel

On Wed, Oct 28, 2020 at 10:35:08AM +0300, Alexey Budankov wrote:
> 
> On 28.10.2020 10:08, Namhyung Kim wrote:
> > Hi,
> > 
> > On Wed, Oct 28, 2020 at 1:02 AM Alexey Budankov
> > <alexey.budankov@linux.intel.com> wrote:
> >>
> >>
> >> On 27.10.2020 15:10, Jiri Olsa wrote:
> >>> On Mon, Oct 26, 2020 at 08:59:01PM +0300, Alexey Budankov wrote:
> >>>>
> >>>> On 24.10.2020 18:43, Jiri Olsa wrote:
> >>>>> On Wed, Oct 21, 2020 at 06:52:43PM +0300, Alexey Budankov wrote:
> >>>>>>
> >>>>>> Changes in v2:
> >>>>>> - explicitly added credit tags to patches 6/15 and 15/15,
> >>>>>>   additionally to cites [1], [2]
> >>>>>> - updated description of 3/15 to explicitly mention the reason
> >>>>>>   to open data directories in read access mode (e.g. for perf report)
> >>>>>> - implemented fix for compilation error of 2/15
> >>>>>> - explicitly elaborated on found issues to be resolved for
> >>>>>>   threaded AUX trace capture
> >>>>>>
> >>>>>> v1: https://lore.kernel.org/lkml/810f3a69-0004-9dff-a911-b7ff97220ae0@linux.intel.com/
> >>>>>>
> >>>>>> Patch set provides threaded trace streaming for base perf record
> >>>>>> operation. Provided streaming mode (--threads) mitigates profiling
> >>>>>> data losses and resolves scalability issues of serial and asynchronous
> >>>>>> (--aio) trace streaming modes on multicore server systems. The patch
> >>>>>> set is based on the prototype [1], [2] and the most closely relates
> >>>>>> to mode 3) "mode that creates thread for every monitored memory map".
> >>>>>
> >>>>> so what I liked about the previous code was that you could
> >>>>> configure how the threads would be created
> >>>>>
> >>>>> default --threads options created thread for each cpu like
> >>>>> in your change:
> >>>>>
> >>>>>   $ perf record -v --threads ...
> >>>>>   ...
> >>>>>   thread 0 monitor: 0 allowed: 0
> >>>>>   thread 1 monitor: 1 allowed: 1
> >>>>>   thread 2 monitor: 2 allowed: 2
> >>>>>   thread 3 monitor: 3 allowed: 3
> >>>>>   thread 4 monitor: 4 allowed: 4
> >>>>>   thread 5 monitor: 5 allowed: 5
> >>>>>   thread 6 monitor: 6 allowed: 6
> >>>>>   thread 7 monitor: 7 allowed: 7
> >>>>
> >>>> Yes, it is configurable in the prototype. Even though this patch set
> >>>> doesn't implement that parameters for --thread option, just because
> >>>> VTune doesn't have use cases for that yet, it has still been designed
> >>>> and implemented with that possible extension in mind so it could then
> >>>> be easily added on top of it.
> >>>
> >>> I'm not sure about vtune extensions, but if we are going to
> >>> have --threads option I believe we should make it configurable
> >>> at least to the extend descibed below
> >>
> >> It employs --threads mode only and there are no use cases
> >> observed so far beyond this mode. Do you have or see such
> >> use cases?
> > 
> > I don't know about vtune and other users, but it's an important
> > feature for better performance so I agree with Jiri's opinion to
> > make it flexible for the system requirement.
> 
> For sure, vtune is not the only one for this threaded streaming
> and it should be well suited for perf tool use cases equally.
> And for perf it would be beneficial to document some examples in
> perf-record.txt as a part of this configuration implementation.
> I am not just aware of such examples and that is why I am asking
> you guys.

I saw no LOST events on big servers for some tests with --threads=numa
option, so there was no reason to spawn 200+ threads

jirka


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 06/15] perf session: load data directory into tool process memory
  2020-10-27 14:43         ` Alexey Budankov
  2020-10-28  7:22           ` Namhyung Kim
@ 2020-10-28 15:36           ` Jiri Olsa
  1 sibling, 0 replies; 64+ messages in thread
From: Jiri Olsa @ 2020-10-28 15:36 UTC (permalink / raw)
  To: Alexey Budankov
  Cc: Arnaldo Carvalho de Melo, Namhyung Kim, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel

On Tue, Oct 27, 2020 at 05:43:20PM +0300, Alexey Budankov wrote:
> 
> On 27.10.2020 15:21, Jiri Olsa wrote:
> > On Tue, Oct 27, 2020 at 10:37:58AM +0300, Alexey Budankov wrote:
> >>
> >> On 24.10.2020 18:43, Jiri Olsa wrote:
> >>> On Wed, Oct 21, 2020 at 07:01:19PM +0300, Alexey Budankov wrote:
> >>>>
> >>>> Read trace files located in data directory into tool process memory.
> >>>> Basic analysis support of data directories is provided for report
> >>>> mode. Raw dump (-D) and aggregated reports are available for data
> >>>> directories, still with no memory consumption optimizations. However
> >>>> data directories collected with --compression-level option enabled
> >>>> can be analyzed with little less memory because trace files are
> >>>> unmaped from tool process memory after loading collected data.
> >>>> The implementation is based on the prototype [1], [2].
> >>>>
> >>>> [1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
> >>>> [2] https://lore.kernel.org/lkml/20180913125450.21342-1-jolsa@kernel.org/
> >>>>
> >>>> Suggested-by: Jiri Olsa <jolsa@kernel.org>
> >>>
> >>> very loosely ;-) so there was a reason for all that reader refactoring,
> >>> so we could have __perf_session__process_dir_events function:
> >>>
> >>>   https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/commit/?h=perf/record_threads&id=308aa7cff1fed335401cfc02c7bac1a4644af68e
> >>
> >> Nonetheless. All that are necessary parts to make threaded data streaming
> >> and analysis eventually merged into the mainline as joint Perf developers
> >> community effort.
> >>
> >>>
> >>> when reporting the threaded record data on really big servers,
> >>> you will run out of memory, so you need to read and flush all
> >>> the files together by smaller pieces
> >>
> >> Yes, handling all that _big_ data after collection to make it
> >> helpful for analysis of performance issues is the other part
> >> of this story so that possible OOM should be somehow avoided.
> >>
> >>>
> >>> IMO we need to have this change before we allow threaded record
> >>
> >> There are use cases of perf tool as a data provider, btw VTune is not
> >> the only one of them, and for those use cases threaded trace streaming
> >> lets its users get to their data that the users just were loosing before.
> >> This is huge difference and whole new level of support for such users.
> >> Post-process scripting around perf (e.g. Python based) will benefit
> >> from threaded trace streaming. Pipe mode can be extended to stream into
> >> open and passed fds using threads (e.g. perf record -o -fd:13,14,15,16).
> >> VTune-like tools can get performance data, load it into a (relational)
> >> DB files and provide analysis. And all that uses perf tool at its core.
> >>
> >> I agree perf report OOM issue can exist on really-big servers but data 
> >> directories support for report mode for not-so-big servers and desktops
> >> is already enabled with this smaller change. Also really-big-servers
> >> come with really-big amount of memory and collection could possibly be
> >> limited to only interesting phases of execution so the issue could likely
> >> be avoided. At the same time threaded trace streaming could clarify on
> >> real use cases that are blocked by perf report OOM issue and that would
> >> clarify on exact required solution. So perf report OOM issue shouldn't
> >> be the showstopper for upstream of threaded trace streaming.
> > 
> > so the short answer is no, right? ;-) 
> 
> Answer to what question? Resolve OOM in perf report for data directories?
> I don't see a simple solution for that. The next issue after OOM is resolved
> is a very long processing of data directories. And again there is no simple
> solution for that as well. But it still need progress in order to be resolved
> eventually.

it's right here:
  https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/commit/?h=perf/record_threads&id=308aa7cff1fed335401cfc02c7bac1a4644af68e

jirka

> 
> > 
> > I understand all the excuses, but from my point of view we are
> > adding another pain point (and there's already few ;-) ) that
> > will make perf (even more) not user friendly
> 
> I would not name it a paint point but instead a growth opportunity. 
> Now --threads can't be and is not enabled by default. When a user
> asks --threads the tool can print warning in advance about lots of
> data and possible perf report OOM limitation so confusion and
> disappointment for users of perf report can be avoided in advance.
> 
> > 
> > if we allow really friendly way to create huge data, we should
> > do our best to be able to process it as best as we can
> 
> It is just little to no more friendly as it is already now.
> Everyone can grab patches apply and get threaded streaming.
> Inclusion into mainline will standardize solution to build
> and evolve upon and this is necessary step towards complete
> support of data directories in perf tool suite.
> 
> Alexei
> 
> > 
> > jirka
> > 
> 


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 06/15] perf session: load data directory into tool process memory
  2020-10-28  7:22           ` Namhyung Kim
@ 2020-10-28 15:39             ` Jiri Olsa
  2020-10-29 11:00               ` Namhyung Kim
  0 siblings, 1 reply; 64+ messages in thread
From: Jiri Olsa @ 2020-10-28 15:39 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Alexey Budankov, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel

On Wed, Oct 28, 2020 at 04:22:49PM +0900, Namhyung Kim wrote:
> On Tue, Oct 27, 2020 at 11:43 PM Alexey Budankov
> <alexey.budankov@linux.intel.com> wrote:
> >
> >
> > On 27.10.2020 15:21, Jiri Olsa wrote:
> > > On Tue, Oct 27, 2020 at 10:37:58AM +0300, Alexey Budankov wrote:
> > >> I agree perf report OOM issue can exist on really-big servers but data
> > >> directories support for report mode for not-so-big servers and desktops
> > >> is already enabled with this smaller change. Also really-big-servers
> > >> come with really-big amount of memory and collection could possibly be
> > >> limited to only interesting phases of execution so the issue could likely
> > >> be avoided. At the same time threaded trace streaming could clarify on
> > >> real use cases that are blocked by perf report OOM issue and that would
> > >> clarify on exact required solution. So perf report OOM issue shouldn't
> > >> be the showstopper for upstream of threaded trace streaming.
> > >
> > > so the short answer is no, right? ;-)
> >
> > Answer to what question? Resolve OOM in perf report for data directories?
> > I don't see a simple solution for that. The next issue after OOM is resolved
> > is a very long processing of data directories. And again there is no simple
> > solution for that as well. But it still need progress in order to be resolved
> > eventually.
> 
> I think we should find a better way than just adding all events to the
> ordered events queue in memory then processing them one by one.
> 
> Separating tracking events (FORK/MMAP/...) might be the first step.

I recall seeing this change before for threaded perf report,
maybe even from you, right? ;-)

jirka

> 
> Thanks
> Namhyung
> 


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 06/15] perf session: load data directory into tool process memory
  2020-10-28 15:39             ` Jiri Olsa
@ 2020-10-29 11:00               ` Namhyung Kim
  0 siblings, 0 replies; 64+ messages in thread
From: Namhyung Kim @ 2020-10-29 11:00 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Alexey Budankov, Arnaldo Carvalho de Melo, Alexander Shishkin,
	Adrian Hunter, Andi Kleen, Peter Zijlstra, Ingo Molnar,
	linux-kernel

Hi Jiri,

On Thu, Oct 29, 2020 at 12:40 AM Jiri Olsa <jolsa@redhat.com> wrote:
>
> On Wed, Oct 28, 2020 at 04:22:49PM +0900, Namhyung Kim wrote:
> > On Tue, Oct 27, 2020 at 11:43 PM Alexey Budankov
> > <alexey.budankov@linux.intel.com> wrote:
> > >
> > >
> > > On 27.10.2020 15:21, Jiri Olsa wrote:
> > > > On Tue, Oct 27, 2020 at 10:37:58AM +0300, Alexey Budankov wrote:
> > > >> I agree perf report OOM issue can exist on really-big servers but data
> > > >> directories support for report mode for not-so-big servers and desktops
> > > >> is already enabled with this smaller change. Also really-big-servers
> > > >> come with really-big amount of memory and collection could possibly be
> > > >> limited to only interesting phases of execution so the issue could likely
> > > >> be avoided. At the same time threaded trace streaming could clarify on
> > > >> real use cases that are blocked by perf report OOM issue and that would
> > > >> clarify on exact required solution. So perf report OOM issue shouldn't
> > > >> be the showstopper for upstream of threaded trace streaming.
> > > >
> > > > so the short answer is no, right? ;-)
> > >
> > > Answer to what question? Resolve OOM in perf report for data directories?
> > > I don't see a simple solution for that. The next issue after OOM is resolved
> > > is a very long processing of data directories. And again there is no simple
> > > solution for that as well. But it still need progress in order to be resolved
> > > eventually.
> >
> > I think we should find a better way than just adding all events to the
> > ordered events queue in memory then processing them one by one.
> >
> > Separating tracking events (FORK/MMAP/...) might be the first step.
>
> I recall seeing this change before for threaded perf report,
> maybe even from you, right? ;-)

Yes, I did it a couple of years ago.  The last version is here:

https://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git/log/?h=perf/threaded-v5

Thanks
Namhyung

^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2020-10-29 11:10 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-21 15:52 [PATCH v2 00/15] Introduce threaded trace streaming for basic perf record operation Alexey Budankov
2020-10-21 15:56 ` [PATCH v2 01/15] perf session: introduce trace file path to be shown in raw trace dump Alexey Budankov
2020-10-22  4:28   ` Namhyung Kim
2020-10-21 15:57 ` [PATCH v2 02/15] perf report: output trace file name " Alexey Budankov
2020-10-22  4:29   ` Namhyung Kim
2020-10-21 15:57 ` [PATCH v2 03/15] perf data: open data directory in read access mode Alexey Budankov
2020-10-22  4:31   ` Namhyung Kim
2020-10-22  7:50     ` Alexey Budankov
2020-10-24 15:43   ` Jiri Olsa
2020-10-26 17:47     ` Alexey Budankov
2020-10-27 11:59       ` Jiri Olsa
2020-10-27 14:44         ` Alexey Budankov
2020-10-21 15:59 ` [PATCH v2 04/15] perf session: move reader object definition to header file Alexey Budankov
2020-10-22  4:31   ` Namhyung Kim
2020-10-24 15:44   ` Jiri Olsa
2020-10-26 17:50     ` Alexey Budankov
2020-10-21 16:00 ` [PATCH v2 05/15] perf session: introduce decompressor into trace reader object Alexey Budankov
2020-10-22  4:36   ` Namhyung Kim
2020-10-22  7:20     ` Alexey Budankov
2020-10-24 15:44   ` Jiri Olsa
2020-10-26  8:54     ` Alexei Budankov
2020-10-21 16:01 ` [PATCH v2 06/15] perf session: load data directory into tool process memory Alexey Budankov
2020-10-24 15:43   ` Jiri Olsa
2020-10-27  7:37     ` Alexey Budankov
2020-10-27 12:21       ` Jiri Olsa
2020-10-27 14:43         ` Alexey Budankov
2020-10-28  7:22           ` Namhyung Kim
2020-10-28 15:39             ` Jiri Olsa
2020-10-29 11:00               ` Namhyung Kim
2020-10-28 15:36           ` Jiri Olsa
2020-10-27 15:04         ` Alexey Budankov
2020-10-21 16:01 ` [PATCH v2 07/15] perf record: introduce trace file, compressor and stats in mmap object Alexey Budankov
2020-10-21 16:02 ` [PATCH v2 08/15] perf record: write trace data into mmap trace files Alexey Budankov
2020-10-24 15:44   ` Jiri Olsa
2020-10-26  8:52     ` Alexei Budankov
2020-10-26 10:32       ` Jiri Olsa
2020-10-26 14:04         ` Alexei Budankov
2020-10-21 16:03 ` [PATCH v2 09/15] perf record: introduce thread specific objects for trace streaming Alexey Budankov
2020-10-24 15:44   ` Jiri Olsa
2020-10-26  8:53     ` Alexei Budankov
2020-10-21 16:04 ` [PATCH v2 10/15] perf record: manage thread specific data array Alexey Budankov
2020-10-24 15:44   ` Jiri Olsa
2020-10-26  8:39     ` Alexei Budankov
2020-10-21 16:06 ` [PATCH v2 11/15] perf evlist: introduce evlist__ctlfd_update() to update ctl fd status Alexey Budankov
2020-10-21 16:07 ` [PATCH v2 12/15] perf record: introduce thread local variable for trace streaming Alexey Budankov
2020-10-24 15:43   ` Jiri Olsa
2020-10-26  8:21     ` Alexei Budankov
2020-10-26 10:34       ` Jiri Olsa
2020-10-26 14:11         ` Alexei Budankov
2020-10-27 12:01           ` Jiri Olsa
2020-10-27 14:16             ` Alexey Budankov
2020-10-27 15:58             ` Alexey Budankov
2020-10-21 16:08 ` [PATCH v2 13/15] perf record: stop threads in the end of " Alexey Budankov
2020-10-21 16:10 ` [PATCH v2 14/15] perf record: start threads in the beginning " Alexey Budankov
2020-10-24 15:44   ` Jiri Olsa
2020-10-26  8:39     ` Alexei Budankov
2020-10-21 16:10 ` [PATCH v2 15/15] perf record: introduce --threads command line option Alexey Budankov
2020-10-24 15:43 ` [PATCH v2 00/15] Introduce threaded trace streaming for basic perf record operation Jiri Olsa
2020-10-26 17:59   ` Alexey Budankov
2020-10-27 12:10     ` Jiri Olsa
2020-10-27 14:26       ` Alexey Budankov
2020-10-27 16:01       ` Alexey Budankov
2020-10-28  7:08         ` Namhyung Kim
     [not found]           ` <b6150d2f-04a6-9204-59ac-c31c8697c630@linux.intel.com>
2020-10-28 15:32             ` Jiri Olsa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).