From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.1 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3A22EC433DF for ; Tue, 13 Oct 2020 16:20:12 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id CB26E2528C for ; Tue, 13 Oct 2020 16:20:11 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1602606012; bh=P8Upht7muhFDTiJCzsw5OF+xsS2QQQtIsPLSJdp+aos=; h=Date:From:To:Cc:Subject:References:In-Reply-To:List-ID:From; b=0pdpHtyCQWJmDjzo/c6xuKacBwELbp3m2Mb8xkbUg34FVWSRLpIr0jhBY9wl06YdR 45H3JnZSH7zzI7jaKxad9MbeQRSfaqAKo1L9B3oTnvE+DII8J0ga+CLKPzL+mK+Iij zjLKGzczVcLn4m/a0AmNc+hsPHJhL6TFopeUcf0w= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731703AbgJMQUK (ORCPT ); Tue, 13 Oct 2020 12:20:10 -0400 Received: from mail.kernel.org ([198.145.29.99]:34100 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726482AbgJMQUK (ORCPT ); Tue, 13 Oct 2020 12:20:10 -0400 Received: from quaco.ghostprotocols.net (unknown [179.97.37.151]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 70F612528B; Tue, 13 Oct 2020 16:20:08 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1602606008; bh=P8Upht7muhFDTiJCzsw5OF+xsS2QQQtIsPLSJdp+aos=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=Qv76B5tKxi7/ZBM8fOdn/OJjLvjPzoOZw491wK5AAdaLEvvQD5aLsiMXW2JeSUPaI c/xNCJSsBxBT7bqKFYMJHmyQpVs1orAJXvXaks2wsChU1R+uIPaL007K1TcKSCUQLh qz5+dHUMccLTCG6eP4440tfQXfod08SJDAQEgy8E= Received: by quaco.ghostprotocols.net (Postfix, from userid 1000) id 30C7A4047F; Tue, 13 Oct 2020 13:20:06 -0300 (-03) Date: Tue, 13 Oct 2020 13:20:06 -0300 From: Arnaldo Carvalho de Melo To: Alexey Budankov Cc: Jiri Olsa , Namhyung Kim , Alexander Shishkin , Andi Kleen , Adrian Hunter , Peter Zijlstra , Ingo Molnar , linux-kernel Subject: Re: [PATCH v1 00/15] Introduce threaded trace streaming for basic perf record operation Message-ID: <20201013162006.GJ1063281@kernel.org> References: <810f3a69-0004-9dff-a911-b7ff97220ae0@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <810f3a69-0004-9dff-a911-b7ff97220ae0@linux.intel.com> X-Url: http://acmel.wordpress.com Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Em Mon, Oct 12, 2020 at 11:50:29AM +0300, Alexey Budankov escreveu: > > Patch set provides threaded trace streaming for base perf record > operation. Provided streaming mode (--threads) mitigates profiling > data losses and resolves scalability issues of serial and asynchronous > (--aio) trace streaming modes on multicore server systems. The patch > set is based on the prototype [1], [2] and the most closely relates > to mode 3) "mode that creates thread for every monitored memory map". > > The threaded mode executes one-to-one mapping of trace streaming threads > to mapped data buffers and streaming into per-CPU trace files located > at data directory. The data buffers and threads are affined to NUMA > nodes and monitored CPUs according to system topology. --cpu option > can be used to specify exact CPUs to be monitored. > > Basic analysis of data directories is provided for perf report mode. > Raw dump (-D) and aggregated reports are available for data directories, > still with no memory consumption optimizations. However data directories > collected with --compression-level option enabled can be analyzed with > little less memory because trace files are unmaped from tool process > memory after loading collected data. > > Provided streaming mode is available with Zstd compression/decompression > (--compression-level) and handling of external commands (--control). > AUX area tracing, related and derived modes like --snapshot or > --aux-sample are not enabled. --switch-output, --switch-output-event, > --switch-max-files and --timestamp-filename options are not enabled. Would be interesting to spell out what are the difficulties to have those options working with this threaded mode, as I expect that once this is all reviewed and tested we should switch to it by default, no? - Arnaldo > Threaded trace streaming is not enabled for pipe mode. Asynchronous > (--aio) trace streaming and affinity (--affinity) modes are mutually > exclusive to threaded streaming mode. > > See testing results and validation examples below. > > [1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads > [2] https://lore.kernel.org/lkml/20180913125450.21342-1-jolsa@kernel.org/ > > --- > Alexey Budankov (15): > perf session: introduce trace file path to be shown in raw trace dump > perf report: output trace file name in raw trace dump > perf data: open data directory in read access mode > perf session: move reader object definition to header file > perf session: introduce decompressor into trace reader object > perf session: load data directory into tool process memory > perf record: introduce trace file, compressor and stats in mmap object > perf record: write trace data into mmap trace files > perf record: introduce thread specific objects for trace streaming > perf record: manage thread specific data array > perf evlist: introduce evlist__ctlfd_update() to update ctl fd status > perf record: introduce thread local variable for trace streaming > perf record: stop threads in the end of trace streaming > perf record: start threads in the beginning of trace streaming > perf record: introduce --threads command line option > > tools/perf/Documentation/perf-record.txt | 7 + > tools/perf/builtin-record.c | 514 +++++++++++++++++++++-- > tools/perf/util/data.c | 4 + > tools/perf/util/evlist.c | 16 + > tools/perf/util/evlist.h | 1 + > tools/perf/util/mmap.c | 6 + > tools/perf/util/mmap.h | 5 + > tools/perf/util/ordered-events.h | 1 + > tools/perf/util/record.h | 1 + > tools/perf/util/session.c | 150 ++++--- > tools/perf/util/session.h | 28 ++ > tools/perf/util/tool.h | 3 +- > 12 files changed, 635 insertions(+), 101 deletions(-) > > --- > Testing results: > > $ perf test > 1: vmlinux symtab matches kallsyms : Skip > 2: Detect openat syscall event : Ok > 3: Detect openat syscall event on all cpus : Ok > 4: Read samples using the mmap interface : Ok > 5: Test data source output : Ok > 6: Parse event definition strings : Ok > 7: Simple expression parser : Ok > 8: PERF_RECORD_* events & perf_sample fields : Ok > 9: Parse perf pmu format : Ok > 10: PMU events : > 10.1: PMU event table sanity : Ok > 10.2: PMU event map aliases : Ok > 10.3: Parsing of PMU event table metrics : Skip (some metrics failed) > 10.4: Parsing of PMU event table metrics with fake PMUs : Ok > 11: DSO data read : Ok > 12: DSO data cache : Ok > 13: DSO data reopen : FAILED! > 14: Roundtrip evsel->name : Ok > 15: Parse sched tracepoints fields : Ok > 16: syscalls:sys_enter_openat event fields : Ok > 17: Setup struct perf_event_attr : Ok > 18: Match and link multiple hists : Ok > 19: 'import perf' in python : FAILED! > 20: Breakpoint overflow signal handler : Ok > 21: Breakpoint overflow sampling : Ok > 22: Breakpoint accounting : Ok > 23: Watchpoint : > 23.1: Read Only Watchpoint : Skip > 23.2: Write Only Watchpoint : Ok > 23.3: Read / Write Watchpoint : Ok > 23.4: Modify Watchpoint : Ok > 24: Number of exit events of a simple workload : Ok > 25: Software clock events period values : Ok > 26: Object code reading : Ok > 27: Sample parsing : Ok > 28: Use a dummy software event to keep tracking : Ok > 29: Parse with no sample_id_all bit set : Ok > 30: Filter hist entries : Ok > 31: Lookup mmap thread : Ok > 32: Share thread maps : Ok > 33: Sort output of hist entries : Ok > 34: Cumulate child hist entries : Ok > 35: Track with sched_switch : Ok > 36: Filter fds with revents mask in a fdarray : Ok > 37: Add fd to a fdarray, making it autogrow : Ok > 38: kmod_path__parse : Ok > 39: Thread map : Ok > 40: LLVM search and compile : > 40.1: Basic BPF llvm compile : Skip > 40.2: kbuild searching : Skip > 40.3: Compile source for BPF prologue generation : Skip > 40.4: Compile source for BPF relocation : Skip > 41: Session topology : Ok > 42: BPF filter : > 42.1: Basic BPF filtering : Skip > 42.2: BPF pinning : Skip > 42.3: BPF prologue generation : Skip > 42.4: BPF relocation checker : Skip > 43: Synthesize thread map : Ok > 44: Remove thread map : Ok > 45: Synthesize cpu map : Ok > 46: Synthesize stat config : Ok > 47: Synthesize stat : Ok > 48: Synthesize stat round : Ok > 49: Synthesize attr update : Ok > 50: Event times : Ok > 51: Read backward ring buffer : Ok > 52: Print cpu map : Ok > 53: Merge cpu map : Ok > 54: Probe SDT events : Ok > 55: is_printable_array : Ok > 56: Print bitmap : Ok > 57: perf hooks : Ok > 58: builtin clang support : Skip (not compiled in) > 59: unit_number__scnprintf : Ok > 60: mem2node : Ok > 61: time utils : Ok > 62: Test jit_write_elf : Ok > 63: Test libpfm4 support : Skip (not compiled in) > 64: Test api io : Ok > 65: maps__merge_in : Ok > 66: Demangle Java : Ok > 67: Parse and process metrics : Ok > 68: PE file support : Ok > 69: Event expansion for cgroups : Ok > 70: x86 rdpmc : Ok > 71: Convert perf time to TSC : Ok > 72: DWARF unwind : Ok > 73: x86 instruction decoder - new instructions : Ok > 74: Intel PT packet decoder : Ok > 75: x86 bp modify : Ok > 76: Check open filename arg using perf trace + vfs_getname : FAILED! > 77: Add vfs_getname probe to get syscall args filenames : FAILED! > 78: probe libc's inet_pton & backtrace it with ping : Ok > 79: Use vfs_getname probe to get syscall args filenames : FAILED! > 80: Zstd perf.data compression/decompression : Ok > 81: Check Arm CoreSight trace data recording and synthesized samples: Skip > > Validation: > > $ tools/perf/perf record --threads --compression-level=3 -e cycles,instructions -- ../../bench/matrix/linux/matrix.gcc.g.O3 > Addr of buf1 = 0x7f8af24ae010 > Offs of buf1 = 0x7f8af24ae180 > Addr of buf2 = 0x7f8af04ad010 > Offs of buf2 = 0x7f8af04ad1c0 > Addr of buf3 = 0x7f8aee4ac010 > Offs of buf3 = 0x7f8aee4ac100 > Addr of buf4 = 0x7f8aec4ab010 > Offs of buf4 = 0x7f8aec4ab140 > Threads #: 8 Pthreads > Matrix size: 2048 > Using multiply kernel: multiply1 > Execution time = 24.579 seconds > [ perf record: Woken up 972 times to write data ] > [ perf record: Captured and wrote 11.280 MB perf.data (403 samples), compressed (original 66.670 MB, ratio is 5.926) ] > > $ perf report --header --stats > # ======== > # captured on : Wed Oct 7 19:07:50 2020 > # header version : 1 > # data offset : 504 > # data size : 30056 > # feat offset : 30560 > # hostname : nntvtune39 > # os release : 5.8.8-100.fc31.x86_64 > # perf version : 4.13.rc5.gdf19b1347b82 > # arch : x86_64 > # nrcpus online : 8 > # nrcpus avail : 8 > # cpudesc : Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz > # cpuid : GenuineIntel,6,94,3 > # total memory : 16119484 kB > # cmdline : /root/abudanko/kernel/acme/tools/perf/perf record --threads --compression-level=3 -e cycles,instructions -- ../../bench /matrix/linux/matrix.gcc.g.O3 > # event : name = cycles, , id = { 54562, 54563, 54564, 54565, 54566, 54567, 54568, 54569 }, size = 120, { sample_period, sample_freq } = 4000, sample_type = IP|TID|TIME|ID|> > # event : name = instructions, , id = { 54570, 54571, 54572, 54573, 54574, 54575, 54576, 54577 }, size = 120, config = 0x1, { sample_period, sample_freq } = 4000, sample_ty> > # sibling sockets : 0-7 > # sibling dies : 0-7 > # sibling threads : 0,4 > # sibling threads : 1,5 > # sibling threads : 2,6 > # sibling threads : 3,7 > # CPU 0: Core ID 0, Die ID 0, Socket ID 0 > # CPU 1: Core ID 1, Die ID 0, Socket ID 0 > # CPU 2: Core ID 2, Die ID 0, Socket ID 0 > # CPU 3: Core ID 3, Die ID 0, Socket ID 0 > # CPU 4: Core ID 0, Die ID 0, Socket ID 0 > # CPU 5: Core ID 1, Die ID 0, Socket ID 0 > # CPU 6: Core ID 2, Die ID 0, Socket ID 0 > # CPU 7: Core ID 3, Die ID 0, Socket ID 0 > # node0 meminfo : total = 16119484 kB, free = 933088 kB > # node0 cpu list : 0-7 > # pmu mappings: intel_pt = 8, software = 1, power = 19, uprobe = 7, uncore_imc = 11, cpu = 4, cstate_core = 17, uncore_cbox_2 = 14, breakpoint = 5, uncore_cbox_0 = 12, trac> > # CPU cache info: > # L1 Data 32K [0,4] > # L1 Instruction 32K [0,4] > # L1 Data 32K [1,5] > # L1 Instruction 32K [1,5] > # L1 Data 32K [1,5] > # L1 Instruction 32K [1,5] > # L1 Data 32K [2,6] > # L1 Instruction 32K [2,6] > # L1 Data 32K [3,7] > # L1 Instruction 32K [3,7] > # L2 Unified 256K [0,4] > # L2 Unified 256K [1,5] > # L2 Unified 256K [2,6] > # L2 Unified 256K [3,7] > # L3 Unified 8192K [0-7] > # time of first sample : 0.000000 > # time of last sample : 0.000000 > # sample duration : 0.000 ms > # memory nodes (nr 1, block size 0x8000000): > # 0 [16G]: 0-23,32-135 > # directory data version : 1 > # bpf_prog_info 835: bpf_prog_6deef7357e7b4530 addr 0xffffffffc07c5138 size 66 > # bpf_prog_info 836: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0804558 size 66 > # bpf_prog_info 837: bpf_prog_84efc2eecc454ca6 addr 0xffffffffc080804c size 373 > # bpf_prog_info 838: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0817f50 size 66 > # bpf_prog_info 839: bpf_prog_6deef7357e7b4530 addr 0xffffffffc082aee8 size 66 > # bpf_prog_info 840: bpf_prog_5a2b06eab81b8f51 addr 0xffffffffc082c0fc size 1132 > # bpf_prog_info 841: bpf_prog_8a642ed2819e9acc addr 0xffffffffc0aac380 size 62 > # bpf_prog_info 842: bpf_prog_8a642ed2819e9acc addr 0xffffffffc0d4c970 size 62 > # bpf_prog_info 843: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0d4e8bc size 66 > # bpf_prog_info 844: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0d5050c size 66 > # bpf_prog_info 845: bpf_prog_3ff695c0afc2579c addr 0xffffffffc0d56098 size 1552 > # bpf_prog_info 846: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0d58364 size 66 > # bpf_prog_info 847: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0d6ec00 size 66 > # compressed : Zstd, level = 3, ratio = 6 > # cpu pmu capabilities: branches=32, max_precise=3, pmu_name=skylake > # missing features: TRACING_DATA BUILD_ID BRANCH_STACK GROUP_DESC AUXTRACE STAT CLOCKID CLOCK_DATA > # ======== > # > Aggregated stats: > TOTAL events: 1457107 > MMAP events: 98 > LOST events: 0 > COMM events: 2 > EXIT events: 6 > THROTTLE events: 0 > UNTHROTTLE events: 0 > FORK events: 8 > READ events: 0 > SAMPLE events: 1456388 > MMAP2 events: 7 > AUX events: 0 > ITRACE_START events: 0 > LOST_SAMPLES events: 0 > SWITCH events: 0 > SWITCH_CPU_WIDE events: 0 > NAMESPACES events: 0 > KSYMBOL events: 13 > BPF_EVENT events: 13 > CGROUP events: 166 > TEXT_POKE events: 0 > ATTR events: 0 > EVENT_TYPE events: 0 > TRACING_DATA events: 0 > BUILD_ID events: 0 > FINISHED_ROUND events: 0 > ID_INDEX events: 0 > AUXTRACE_INFO events: 0 > AUXTRACE events: 0 > AUXTRACE_ERROR events: 0 > THREAD_MAP events: 1 > CPU_MAP events: 1 > STAT_CONFIG events: 0 > STAT events: 0 > STAT_ROUND events: 0 > EVENT_UPDATE events: 0 > TIME_CONV events: 1 > FEATURE events: 0 > COMPRESSED events: 403 > > $ tools/perf/perf record --threads --compression-level=3 -g --call-graph dwarf,1024 -e cycles,instructions -- ../../bench/matrix/linux/matrix.gcc.g.O3 > Addr of buf1 = 0x7f15e8b89010 > Offs of buf1 = 0x7f15e8b89180 > Addr of buf2 = 0x7f15e6b88010 > Offs of buf2 = 0x7f15e6b881c0 > Addr of buf3 = 0x7f15e4b87010 > Offs of buf3 = 0x7f15e4b87100 > Addr of buf4 = 0x7f15e2b86010 > Offs of buf4 = 0x7f15e2b86140 > Threads #: 8 Pthreads > Matrix size: 2048 > Using multiply kernel: multiply1 > Execution time = 24.281 seconds > [ perf record: Woken up 6923 times to write data ] > [ perf record: Captured and wrote 34.933 MB perf.data (10414 samples), compressed (original 1728.795 MB, ratio is 49.530) ] > > $ perf report --stdio > # To display the perf.data header info, please use --header/--header-only options. > # > # > # Total Lost Samples: 0 > # > # Samples: 708K of event 'cycles' > # Event count (approx.): 277718834163 > # > # Children Self Command Shared Object Symbol > # ........ ........ ............... ........................ ........................................................ > # > 99.97% 0.00% matrix.gcc.g.O3 libc-2.30.so [.] __GI___clone (inlined) > | > ---__GI___clone (inlined) > | > --99.97%--start_thread > | > --99.97%--ThreadFunction > multiply1 > | > --0.96%--asm_sysvec_apic_timer_interrupt > | > --0.70%--subflow_syn_recv_sock > | > --0.60%--__sysvec_apic_timer_interrupt > | > --0.57%--hrtimer_interrupt > > 99.97% 0.00% matrix.gcc.g.O3 libpthread-2.30.so [.] start_thread > | > ---start_thread > | > --99.97%--ThreadFunction > | > --99.97%--multiply1 > | > --0.96%--asm_sysvec_apic_timer_interrupt > | > --0.70%--subflow_syn_recv_sock > | > --0.60%--__sysvec_apic_timer_interrupt > | > --0.57%--hrtimer_interrupt > > 99.97% 0.00% matrix.gcc.g.O3 matrix.gcc.g.O3 [.] ThreadFunction > | > ---ThreadFunction > | > --99.97%--multiply1 > | > --0.96%--asm_sysvec_apic_timer_interrupt > | > --0.70%--subflow_syn_recv_sock > | > --0.60%--__sysvec_apic_timer_interrupt > | > --0.57%--hrtimer_interrupt > > 99.97% 98.68% matrix.gcc.g.O3 matrix.gcc.g.O3 [.] multiply1 > | > |--98.68%--__GI___clone (inlined) > | start_thread > | ThreadFunction > | multiply1 > | > --1.28%--multiply1 > | > --0.96%--asm_sysvec_apic_timer_interrupt > | > --0.70%--subflow_syn_recv_sock > | > --0.60%--__sysvec_apic_timer_interrupt > | > --0.57%--hrtimer_interrupt > > 0.97% 0.05% matrix.gcc.g.O3 [kernel.kallsyms] [k] asm_sysvec_apic_timer_interrupt > | > --0.91%--asm_sysvec_apic_timer_interrupt > | > --0.70%--subflow_syn_recv_sock > > > -- > 2.24.1 > -- - Arnaldo