From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.6 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1875EC31E45 for ; Fri, 14 Jun 2019 00:55:51 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id D775421537 for ; Fri, 14 Jun 2019 00:55:50 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Am2rlOOm" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725812AbfFNAzu (ORCPT ); Thu, 13 Jun 2019 20:55:50 -0400 Received: from mail-lf1-f65.google.com ([209.85.167.65]:38149 "EHLO mail-lf1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725765AbfFNAzu (ORCPT ); Thu, 13 Jun 2019 20:55:50 -0400 Received: by mail-lf1-f65.google.com with SMTP id b11so492491lfa.5; Thu, 13 Jun 2019 17:55:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=ryaPCZIxAcKEqlnkJURjQarpGjK1d3kFhw+bV3KFP9w=; b=Am2rlOOmSe5TZvPubrB9t5WLp99W7cGT3AFwKkLm752jiH3pKsM7TuXaV/gVBtbI0I Lvt/jAO2ZklUO1AuoGu4BSbAG3VQGC0Bs/G/Nudkn/xfVMLjTPELMRMyOX731rgWMoqJ 9Gh6wD1+eyS6aCNiN3tdJ1UyKEUtUfWNfMC3eDMhMJsX44+v5DWx7n2a931/3UcXjJs5 QbANRRVfLLaJzAf3JXE4YvgTn0D3VYoGcLZh3KqPm6iMoJOfV0OFEXA94WXq3noAf1yq iFlBs6bFtTBu3UnsjO3GXwg/ug1ViobK/hpmMhcfcroIa30HDFexgv6ZYKKPwspta92N VdvQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=ryaPCZIxAcKEqlnkJURjQarpGjK1d3kFhw+bV3KFP9w=; b=DhP81wer8GY2AoeNzp0w3CV4Q04uUa0DWyPXb1srpOHIx3Ib/zE/GVDlzsz/J8bkqE GSViQVBYm0F27uuSXJs7yj6Uxny3WG4W+PQ3FNvm4j827EhxRgchJoTescpXhegWo0dr EA+/SMzLvphSAhNfMPV/RwxBraSeewXjf9v8lzNbo/zwRoN0DymXKUPUi83XgpRz/HZo XDQOvBtw9ZLRUug4+SF0DIjkxJSUTx1jVLQ0DnqI8Z8qonFvdi8oLpneFh73RCdLVKpM v2W3feCLGxc7AFIkJzZx/kbqllA1W5kPvMmJsmbU0xJaGYJAWU29lTPru1R8MjBYcg0e /SIA== X-Gm-Message-State: APjAAAUEs+RAWOZGMdbf1ZKdJKOCWLlQND/Qxb+1pi+lbZ0lDBWf9M7k Sy3CSH9ZMYHSU0CF6gODDR9S7xGZnmJIJOPhO+M= X-Google-Smtp-Source: APXvYqx9lJ1Egqv+7WXAkIM2x8BJIXO6TIeTHbKoqF+cwbEj1hpPLT6iy6bgD8dz8+Z+cP0erI7MZ1fSdo/A3fdlVgE= X-Received: by 2002:a19:ab1a:: with SMTP id u26mr8730266lfe.6.1560473747165; Thu, 13 Jun 2019 17:55:47 -0700 (PDT) MIME-Version: 1.0 References: <20190611215304.28831-1-mmullins@fb.com> <4aa26670-75b8-118d-68ca-56719af44204@iogearbox.net> <9c77657414993332987ca79d4081c4d71cc48d66.camel@fb.com> In-Reply-To: <9c77657414993332987ca79d4081c4d71cc48d66.camel@fb.com> From: Alexei Starovoitov Date: Thu, 13 Jun 2019 17:55:35 -0700 Message-ID: Subject: Re: [PATCH bpf v2] bpf: fix nested bpf tracepoints with per-cpu data To: Matt Mullins Cc: "daniel@iogearbox.net" , "andrii.nakryiko@gmail.com" , Song Liu , "linux-kernel@vger.kernel.org" , "bpf@vger.kernel.org" , "rostedt@goodmis.org" , "ast@kernel.org" , Andrew Hall , "mingo@redhat.com" , "netdev@vger.kernel.org" , Martin Lau , Yonghong Song Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: bpf-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org On Thu, Jun 13, 2019 at 5:52 PM Matt Mullins wrote: > > On Fri, 2019-06-14 at 00:47 +0200, Daniel Borkmann wrote: > > On 06/12/2019 07:00 AM, Andrii Nakryiko wrote: > > > On Tue, Jun 11, 2019 at 8:48 PM Matt Mullins wrote: > > > > > > > > BPF_PROG_TYPE_RAW_TRACEPOINTs can be executed nested on the same CP= U, as > > > > they do not increment bpf_prog_active while executing. > > > > > > > > This enables three levels of nesting, to support > > > > - a kprobe or raw tp or perf event, > > > > - another one of the above that irq context happens to call, and > > > > - another one in nmi context > > > > (at most one of which may be a kprobe or perf event). > > > > > > > > Fixes: 20b9d7ac4852 ("bpf: avoid excessive stack usage for perf_sam= ple_data") > > > > Generally, looks good to me. Two things below: > > > > Nit, for stable, shouldn't fixes tag be c4f6699dfcb8 ("bpf: introduce B= PF_RAW_TRACEPOINT") > > instead of the one you currently have? > > Ah, yeah, that's probably more reasonable; I haven't managed to come up > with a scenario where one could hit this without raw tracepoints. I'll > fix up the nits that've accumulated since v2. > > > One more question / clarification: we have __bpf_trace_run() vs trace_c= all_bpf(). > > > > Only raw tracepoints can be nested since the rest has the bpf_prog_acti= ve per-CPU > > counter via trace_call_bpf() and would bail out otherwise, iiuc. And ra= w ones use > > the __bpf_trace_run() added in c4f6699dfcb8 ("bpf: introduce BPF_RAW_TR= ACEPOINT"). > > > > 1) I tried to recall and find a rationale for mentioned trace_call_bpf(= ) split in > > the c4f6699dfcb8 log, but couldn't find any. Is the raison d'=C3=AAtre = purely because of > > performance overhead (and desire to not miss events as a result of nest= ing)? (This > > also means we're not protected by bpf_prog_active in all the map ops, o= f course.) > > 2) Wouldn't this also mean that we only need to fix the raw tp programs= via > > get_bpf_raw_tp_regs() / put_bpf_raw_tp_regs() and won't need this dupli= cation for > > the rest which relies upon trace_call_bpf()? I'm probably missing somet= hing, but > > given they have separate pt_regs there, how could they be affected then= ? > > For the pt_regs, you're correct: I only used get/put_raw_tp_regs for > the _raw_tp variants. However, consider the following nesting: > > trace_nest_level raw_tp_nest_level > (kprobe) bpf_perf_event_output 1 0 > (raw_tp) bpf_perf_event_output_raw_tp 2 1 > (raw_tp) bpf_get_stackid_raw_tp 2 2 > > I need to increment a nest level (and ideally increment it only once) > between the kprobe and the first raw_tp, because they would otherwise > share the struct perf_sample_data. But I also need to increment a nest > level between the two raw_tps, since they share the pt_regs -- I can't > use trace_nest_level for everything because it's not used by > get_stackid, and I can't use raw_tp_nest_level for everything because > it's not incremented by kprobes. > > If raw tracepoints were to bump bpf_prog_active, then I could get away > with just using that count in these callsites -- I'm reluctant to do > that, though, since it would prevent kprobes from ever running inside a > raw_tp. I'd like to retain the ability to (e.g.) > trace.py -K htab_map_update_elem > and get some stack traces from at least within raw tracepoints. > > That said, as I wrote up this example, bpf_trace_nest_level seems to be > wildly misnamed; I should name those after the structure they're > protecting... I still don't get what's wrong with the previous approach. Didn't I manage to convince both of you that perf_sample_data inside bpf_perf_event_array doesn't have any issue that Daniel brought up? I think this refcnting approach is inferior.