From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E20DAC77B7C for ; Mon, 1 May 2023 22:15:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232003AbjEAWPY (ORCPT ); Mon, 1 May 2023 18:15:24 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55586 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229664AbjEAWPX (ORCPT ); Mon, 1 May 2023 18:15:23 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8C5FE199B for ; Mon, 1 May 2023 15:15:21 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 1EBFA61FC0 for ; Mon, 1 May 2023 22:15:21 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id A4DCBC433D2; Mon, 1 May 2023 22:15:19 +0000 (UTC) Date: Mon, 1 May 2023 18:15:15 -0400 From: Steven Rostedt To: Indu Bhagat Cc: linux-toolchains@vger.kernel.org, daandemeyer@meta.com, andrii@kernel.org, kris.van.hees@oracle.com, elena.zannoni@oracle.com, nick.alcock@oracle.com Subject: Re: [POC 0/5] SFrame based stack tracer for user space in the kernel Message-ID: <20230501181515.098acdce@gandalf.local.home> In-Reply-To: <20230501200410.3973453-1-indu.bhagat@oracle.com> References: <20230501200410.3973453-1-indu.bhagat@oracle.com> X-Mailer: Claws Mail 3.17.8 (GTK+ 2.24.33; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-toolchains@vger.kernel.org On Mon, 1 May 2023 13:04:05 -0700 Indu Bhagat wrote: > Hello, > Hi Indu, This is really great! I think we should include LKML in this as well. And possibly even linux-trace-kernel@vger.kernel.org. > This patch set is a Proof of Concept implementation for an SFrame-based > stack tracer for user space in the kernel. Some of you had expressed interest > in exploring this earlier; hopefully, this POC helps discuss the design and > take it forward. > > Motivation > ========== > Generating stack traces is vital for all profiling, tracing and debugging > tools. In context of generating stack traces for user space, frame-pointer > based unwinding works, but has its issues ([1],[2]). EH_Frame based > unwinding seems undesirable for kernel's unwinding needs ([3],[4]). > In general, EH_Frame based unwinding is undesirable in applications that need > fast, real-time stack tracers (e.g., profilers), because of the overhead of > interpreting and executing DWARF opcodes to calculate the relevant stack > offsets. > > SFrame (Simple Frame) stack trace format is designed to address these concerns. > With this POC, we would like to see how to use SFrame as a viable alternative > for user space stack tracing needs in the kernel. > > [1] https://lwn.net/Articles/919940/ > [2] https://pagure.io/fesco/issue/2817 > [3] https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/thread/OOJDAKTJB5WGMOZRXTUX7FTPFBF3H7WE/#NXRMNKD4B23HX7U5ICMKFRZO6Z3VXQXL > [4] https://lkml.org/lkml/2012/2/10/356 > > What is SFrame format > ===================== > > SFrame is the "Simple Frame" stack trace format. The format is documented as > part of the binutils documentation at https://sourceware.org/binutils/docs. > > Starting with binutils 2.40, the GNU assembler (as) can generate SFrame stack > trace data based on the CFI directives found in the source assembly. This is > achieved by using the --gsframe command line option when invoking the > assembler. This option plays the same role as the existing --gdwarf-[2345] > options, only this time referring to SFrame. The resulting stack tracing > information is stored in a new segment of its own with type PT_GNU_SFRAME, > containing a section named '.sframe'. > > Also starting with binutils 2.40, the GNU linker (ld) knows how to merge > sections containing SFrame stack trace info. > > SFrame based user space stack tracer POC > ======================================== > These patches implement a POC for an SFrame based user space stack tracer (for > x86) in the kernel. The purpose of this code is to serve as a reference, > initiate discussions, and perhaps serve as a starting point for a viable > implementation of an SFrame based stack tracer. Please keep in mind that my > familiarity with with kernel code/processes/conventions is still limited ;-). > > High-level Design in this POC > ============================= > Kconfig adds two config options for userspace unwinding > - config USER_UNWINDER_SFRAME to enable the SFrame userspace unwinder > - config USER_UNWINDER_FRAME_POINTER to enable the Frame Pointer userspace > unwinder > > If CONFIG_USER_UNWINDER_SFRAME is set, the task_struct keeps a reference to > the sframe_state object for the task. > > For long running user programs, it makes sense to cache the sframe_state > in the task and be able to simply do a quick do_sframe_unwind() at every > unwind request. Caching the sframe_state also means keeping the .sframe > pages (for the prog and its DSOs) pinned. The task's sframe_state is > kmalloc'ed and initialized in load_elf_binary, when the task is close to begin > execution. The (open) issue with this design, however, remains that we need to > detect when additional DSOs are brought in at run-time by the application. > > The detection (and resolution) of stale sframe_state is not implemented in this > POC. As such, the POC at this time is fit only for applications that are > statically linked. So my thoughts on this was not to pin the sframe, but simply note that it exists. When perf/bpf/ftrace wants a user space stack trace, it will ask for one (I plan on adding an interface around this process, as it will also handle the case the sframe is not available). As the user stack trace will not change while the task is in the kernel, it does not need to be triggered when asked for. Instead, it could register a callback, and then on exiting back to user space (via the ptrace path), it would then do the sframe look up, and pass the user space stack trace to the perf/bpf/ftrace handlers. In this location, we can allow for the sframe to be faulted in, as it will be in a context where it can safely take a fault (and schedule out!). It would be no different than any part of the elf file faulting in, and can be swapped back out with memory pressure. I'll go ahead and play with this code. Thanks again, this is really helpful. -- Steve