From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.4 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 71560C433F5 for ; Thu, 9 Sep 2021 20:22:32 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 4ED7061179 for ; Thu, 9 Sep 2021 20:22:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1343982AbhIIUXk (ORCPT ); Thu, 9 Sep 2021 16:23:40 -0400 Received: from mail.kernel.org ([198.145.29.99]:57196 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233423AbhIIUXi (ORCPT ); Thu, 9 Sep 2021 16:23:38 -0400 Received: by mail.kernel.org (Postfix) with ESMTPSA id 0C9E56115A; Thu, 9 Sep 2021 20:22:28 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1631218948; bh=Or0vIP8ALSj1Mvi1ASB4Tok+D+dYVvjV/glrtPzu3Qw=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=i4cgwD5UeJyAN/u5Ec4uHnHbpvGBL8JR4yzDxVAQeljVmRSWsvuHgIdxPOXEUJmeY C1C2vSeOf926TooCP9UM/8/aIvgWdzHnRp+/tPlzC4vZ2YnSg5TG//FSkQ+V/Imzvs KJcovQ6nCC1ot2/KBdGL1/Ji47KVM8FZ46Iuq4Ek6K7px/7GoY7YUNNdewyMEaNr39 WN/YcV703MrONBg5gDN7tWFS3lPGzzl1Bd6z3jw1q6DBxkmcyx9AnINpBWLVOKCY/I j5Ekwyl6cWZg5ZuVOf3TI9yjfYthgueW6bN1pVo+BUI/3x1Nwip0VlvnHMmWE3haW+ oIyb4hS2EISAA== Received: by quaco.ghostprotocols.net (Postfix, from userid 1000) id AE3794038F; Thu, 9 Sep 2021 17:22:25 -0300 (-03) Date: Thu, 9 Sep 2021 17:22:25 -0300 From: Arnaldo Carvalho de Melo To: Ian Rogers Cc: Milian Wolff , linux-perf-users@vger.kernel.org Subject: Re: Reducing impact of IO/CPU overload during `perf record` - bad handling of excessive yield? Message-ID: References: <9270373.m2mTvpyXtE@milian-workstation> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Url: http://acmel.wordpress.com Precedence: bulk List-ID: X-Mailing-List: linux-perf-users@vger.kernel.org Em Thu, Sep 09, 2021 at 12:28:10PM -0700, Ian Rogers escreveu: > On Thu, Sep 9, 2021 at 7:14 AM Milian Wolff wrote: > > > > Hey there! > > > > I'm trying to profile an application that suffers from overcommit and > > excessive calls to `yield` (via OMP). Generally, it is putting a lot of strain > > on the system. Sadly, I cannot make `perf record` work without dropping (a lot > > of) chunks / events. > > > > Usually I'm using this command line: > > > > ``` > > perf record --call-graph dwarf -z -m 16M ... > > ``` > > > > But this is not enough in this case. What other alternatives are there? Is > > there a `perf record` mode which never drops events, at the cost of maybe > > stalling the profiled application? There was discussion about this mode, for using on a 'perf trace' that works like 'strace', i.e. doesn't lose records but stalls the profiled/traced application, but I don't think that progressed > > Generally though, looking at the (partial) data I get from perf, it seems like > > it does not cope well in the face of excessive calls to `yield`. Take this > > example code here: > > > > ``` > > #include > > #include > > #include > > > > void yieldALot() > > { > > for (int i = 0; i < 1000000; ++i) { > > std::this_thread::yield(); > > } > > } > > > > int main() > > { > > std::vector threads(std::thread::hardware_concurrency() * 2); > > std::generate(threads.begin(), threads.end(), []() { > > return std::thread(&yieldALot); > > }); > > > > for (auto &thread : threads) > > thread.join(); > > > > return 0; > > } > > ``` > > > > When I compile this with `g++ test.cpp -pthread -O2 -g` and run it on my > > system with a AMD Ryzen 9 3900X with 12 cores and 24 SMT cores, then I observe > > the following behavior: > > > > ``` > > perf stat -r 5 ./a.out > > > > Performance counter stats for './a.out' (5 runs): > > > > 18,570.71 msec task-clock # 19.598 CPUs utilized > > ( +- 1.27% ) > > 2,717,136 context-switches # 145.447 K/sec > > ( +- 3.26% ) > > 55 cpu-migrations # 2.944 /sec > > ( +- 3.79% ) > > 222 page-faults # 11.884 /sec > > ( +- 0.49% ) > > 61,223,010,902 cycles # 3.277 GHz > > ( +- 0.54% ) (83.33%) > > 1,259,821,397 stalled-cycles-frontend # 2.07% frontend cycles > > idle ( +- 1.54% ) (83.16%) > > 9,001,394,333 stalled-cycles-backend # 14.81% backend cycles > > idle ( +- 0.79% ) (83.25%) > > 65,746,848,695 instructions # 1.08 insn per cycle > > # 0.14 stalled cycles > > per insn ( +- 0.26% ) (83.40%) > > 14,474,862,454 branches # 774.830 M/sec > > ( +- 0.36% ) (83.49%) > > 114,518,545 branch-misses # 0.79% of all branches > > ( +- 0.71% ) (83.67%) > > > > 0.94761 +- 0.00935 seconds time elapsed ( +- 0.99% ) > > ``` > > > > Now compare this to what happens when I enable perf recording: > > > > ``` > > Performance counter stats for 'perf record --call-graph dwarf -z ./a.out' (5 > > runs): > > > > 19,928.28 msec task-clock # 9.405 CPUs utilized > > ( +- 2.74% ) > > 3,026,081 context-switches # 142.538 K/sec > > ( +- 7.48% ) > > 120 cpu-migrations # 5.652 /sec > > ( +- 3.18% ) > > 5,509 page-faults # 259.491 /sec > > ( +- 0.01% ) > > 65,393,592,529 cycles # 3.080 GHz > > ( +- 1.81% ) (71.81%) > > 1,445,203,862 stalled-cycles-frontend # 2.13% frontend cycles > > idle ( +- 2.01% ) (71.60%) > > 9,701,321,165 stalled-cycles-backend # 14.29% backend cycles > > idle ( +- 1.28% ) (71.62%) > > 69,967,062,730 instructions # 1.03 insn per cycle > > # 0.14 stalled cycles > > per insn ( +- 0.74% ) (71.83%) > > 15,376,981,968 branches # 724.304 M/sec > > ( +- 0.83% ) (72.00%) > > 127,934,194 branch-misses # 0.81% of all branches > > ( +- 1.29% ) (72.07%) > > > > 2.118881 +- 0.000496 seconds time elapsed ( +- 0.02% ) > > ``` > > > > Note how the CPU utilization drops dramatically by roughly a factor of 2x, > > which correlates with the increase in overall runtime. > > > > From the data obtained here I see: > > ``` > > 98.49% 0.00% a.out libc-2.33.so [.] __GI___sched_yield > > (inlined) > > | > > ---__GI___sched_yield (inlined) > > | > > |--89.93%--entry_SYSCALL_64_after_hwframe > > | | > > | --89.32%--do_syscall_64 > > | | > > | |--71.70%--__ia32_sys_sched_yield > > | | | > > | | |--64.36%--schedule > > | | | | > > | | | --63.42%-- > > __schedule > > | | | | > > | | | > > |--32.56%--__perf_event_task_sched_out > > | | | | > > | > > | | | | > > --32.07%--amd_pmu_disable_all > > | | | | > > | > > | | | | > > |--23.54%--x86_pmu_disable_all > > | | | | > > | | > > | | | | > > | |--12.81%--native_write_msr > > | | | | > > | | > > | | | | > > | |--8.24%--native_read_msr > > | | | | > > | | > > | | | | > > | --1.25%--amd_pmu_addr_offset > > | | | | > > | > > | | | | > > --8.19%--amd_pmu_wait_on_overflow > > | | | | > > | > > | | | | > > |--5.00%--native_read_msr > > | | | | > > | > > | | | | > > |--2.20%--delay_halt > > | | | | > > | | > > | | | | > > | --1.99%--delay_halt_mwaitx > > | | | | > > | > > | | | | > > --0.55%--amd_pmu_addr_offset > > > > ``` > > > > Is this a dificiency with the AMD perf subsystem? Or is this a generic issue > > with perf? I understand that it has to enable/disable the PMU when it's > > switching tasks, but potentially there are some ways to optimize this > > behavior? > > Hi Milian, > > By using dwarf call graphs your samples are writing a dump of the > stack into the perf event ring buffer that will be processed in the > userspace perf command. The default stack dump size is 8kb and you can > lower it - for example with "--call-graph dwarf,4096". I suspect that > most of the overhead you are seeing is from these stack dumps. There > is a more complete description in the man page: > https://man7.org/linux/man-pages/man1/perf-record.1.htm > > Which are always worth improving: > https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/tools/perf/Documentation/perf-record.txt?h=perf/core > > Thanks, > Ian > > Thanks > > > > -- > > Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer > > KDAB (Deutschland) GmbH, a KDAB Group company > > Tel: +49-30-521325470 > > KDAB - The Qt, C++ and OpenGL Experts -- - Arnaldo