From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.1 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_PASS, URIBL_BLOCKED,USER_AGENT_NEOMUTT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B9841C282C0 for ; Fri, 25 Jan 2019 23:42:48 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 739B4218CD for ; Fri, 25 Jan 2019 23:42:48 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="p5h6KP2i" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726689AbfAYXmr (ORCPT ); Fri, 25 Jan 2019 18:42:47 -0500 Received: from mail-pf1-f196.google.com ([209.85.210.196]:39126 "EHLO mail-pf1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726218AbfAYXmr (ORCPT ); Fri, 25 Jan 2019 18:42:47 -0500 Received: by mail-pf1-f196.google.com with SMTP id r136so5440440pfc.6 for ; Fri, 25 Jan 2019 15:42:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=9sfawTVnC4EsBwiVAawDSnuXOE5LuHZG5/xRbOHoysY=; b=p5h6KP2iVY4Tms6g2AAcEk4sl3mi5HFQ5T+o1BqfbaAqi4bJH5Urdb+1wkYH1mpgBE kDd9nymyySXXsi9MvN97jfzplcENXOWxXh80/TJiEcg36Ign2a+PfFEyjLAS4VxQ2/qx kUZnYM0fw+G4TniUevI5AE1vHWnVPt0r0T489+dII5jzShjyFQTmkZcBvnevIFzMflnk VqvH6lYWEFUUKUK+8N72jiaQOBTIrReB81jFLgB7yVISlZDd35cscQ9ByIdz4qMbaJxT vHKRz7raH1p2zne/e2QZb5rgHcLqwi1UiL2ZD5OW39LHO5I3ZqT7OVKIqSpbDCo4/wgt aoLw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=9sfawTVnC4EsBwiVAawDSnuXOE5LuHZG5/xRbOHoysY=; b=HncDCqOmiNJkvC5P2BjNLohk5jPCRhsX9TsXhRcixBGaoJwshhZYnZKEBOgSiaxKUl RFHoq4YYz1OprrJuDTvmPQKNd5+BImUgNXaijQvYRAgoRjlwAMtgj0sbMqoZHf8lOhyy KeGMSRaFHY61JUgMBTSLcluiE/wprEPQrgkXcePpMH7bceMY41fpqIMq5Iqu9VS//C2m hKjvss0VgdX/nb6Gj/RqMQwUNcDfeqQgSA84vywsWbmCVG5dKO/8DxOvx7B7xHdD5tJL SvGYAvwTzsBc2SCdGcaDnUOIV0DTWO44xoeRhyzjpSJnlClKwq/ZFIe1DK8VtqEZNFcS xWvw== X-Gm-Message-State: AJcUukd5V+Fgept3fLFqkuxZRiA6RrV/AFMp6AZg5ZWFCDLPhpQGO9oP 1pVSsSUmlrtUBenInW9imJQ= X-Google-Smtp-Source: ALg8bN6L8ZBEYFuilGU11WKZbVLvii8ZQ5c6q3lEByUxNwdSl/Ujkn0ntguYUm7yejs5OJi9u6TRZQ== X-Received: by 2002:a62:4c5:: with SMTP id 188mr13117516pfe.130.1548459765797; Fri, 25 Jan 2019 15:42:45 -0800 (PST) Received: from ast-mbp.dhcp.thefacebook.com ([2620:10d:c090:200::7:19a8]) by smtp.gmail.com with ESMTPSA id 125sm38380576pfd.124.2019.01.25.15.42.44 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 25 Jan 2019 15:42:44 -0800 (PST) Date: Fri, 25 Jan 2019 15:42:43 -0800 From: Alexei Starovoitov To: Peter Zijlstra Cc: Alexei Starovoitov , davem@davemloft.net, daniel@iogearbox.net, jakub.kicinski@netronome.com, netdev@vger.kernel.org, kernel-team@fb.com, mingo@redhat.com, will.deacon@arm.com, Paul McKenney , jannh@google.com Subject: Re: [PATCH v4 bpf-next 1/9] bpf: introduce bpf_spin_lock Message-ID: <20190125234241.soomtkrgp2i7m7ul@ast-mbp.dhcp.thefacebook.com> References: <20190124041403.2100609-1-ast@kernel.org> <20190124041403.2100609-2-ast@kernel.org> <20190124180109.GA27771@hirez.programming.kicks-ass.net> <20190124235857.xyb5xx2ufr6x5mbt@ast-mbp.dhcp.thefacebook.com> <20190125091057.GK17749@hirez.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190125091057.GK17749@hirez.programming.kicks-ass.net> User-Agent: NeoMutt/20180223 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On Fri, Jan 25, 2019 at 10:10:57AM +0100, Peter Zijlstra wrote: > On Thu, Jan 24, 2019 at 03:58:59PM -0800, Alexei Starovoitov wrote: > > On Thu, Jan 24, 2019 at 07:01:09PM +0100, Peter Zijlstra wrote: > > > > > > Thanks for having kernel/locking people on Cc... > > > > > > On Wed, Jan 23, 2019 at 08:13:55PM -0800, Alexei Starovoitov wrote: > > > > > > > Implementation details: > > > > - on !SMP bpf_spin_lock() becomes nop > > > > > > Because no BPF program is preemptible? I don't see any assertions or > > > even a comment that says this code is non-preemptible. > > > > > > AFAICT some of the BPF_RUN_PROG things are under rcu_read_lock() only, > > > which is not sufficient. > > > > nope. all bpf prog types disable preemption. That is must have for all > > sorts of things to work properly. > > If there is a prog type that doing rcu_read_lock only it's a serious bug. > > About a year or so ago we audited everything specifically to make > > sure everything disables preemption before calling bpf progs. > > I'm pretty sure nothing crept in in the mean time. > > Do we want something like (the completely untested) below to avoid > having to manually audit this over and over? > > --- > include/linux/filter.h | 2 +- > include/linux/kernel.h | 9 +++++++-- > kernel/sched/core.c | 28 ++++++++++++++++++++++++++++ > 3 files changed, 36 insertions(+), 3 deletions(-) > > diff --git a/include/linux/filter.h b/include/linux/filter.h > index d531d4250bff..4ab51e78da36 100644 > --- a/include/linux/filter.h > +++ b/include/linux/filter.h > @@ -513,7 +513,7 @@ struct sk_filter { > struct bpf_prog *prog; > }; > > -#define BPF_PROG_RUN(filter, ctx) (*(filter)->bpf_func)(ctx, (filter)->insnsi) > +#define BPF_PROG_RUN(filter, ctx) ({ cant_sleep(); (*(filter)->bpf_func)(ctx, (filter)->insnsi); }) That looks reasonable and I intent to apply this patch to bpf-next after testing. Can you pls reply with a sob ? With Daniel we did a bit of git archaeology how we came to this missing preempt_disable in send side of socket filters... The main reason is that classic BPF was preemptable from the beginning and it has SKF_AD_CPU feature for the 10+ years. Meaning that classic BPF programs can read current cpu as a hint though they are preemptable. raw_smp_processor_id was used to avoid triggering false warning. When eBPF came along we kept it as-is during classic->extended conversion. Back then there were no per-cpu maps. When per-cpu maps were introduced we missed preemptable sendmsg case. This cant_sleep() check should help us in the future. The easiest fix is to add preempt_disable/enable for socket filters. There is a concern that such fix will make classic bpf non-preemptable and classic bpf can be quite cpu expensive. For example it's allowed to have 4k classic instructions that do 'load SKF_AD_PAY_OFFSET' which will call heavy flow_dissector 4k times. We can do preempt_disable for extended only. Then BPF_PROG_RUN macro and cant_sleep would need to be adjusted accordingly. My preference would be to do preempt_disable for both classic and extended. Since classic is only legacy uapi at this point. Nothing on the kernel side makes it special and I'd like to keep it such. Also on the receive side classic runs in bh, so 4k flow_dissector calls in classic has to be dealt with anyway. > > nmi checks for bpf_prog_active==0. See bpf_overflow_handler. > yuck yuck yuck.. That's horrific :-( That means the whole BPF crud is > unreliable and events can go randomly missing. bpf_prog_active is the mechanism to workaround non-reentrant pieces of the kernel. Before that we couldn't do this: sudo funccount.py *spin_lock* Tracing 5 functions for "*spin_lock*"... Hit Ctrl-C to end. ^C FUNC COUNT queued_spin_lock_slowpath 682 _raw_spin_lock_bh 997 _raw_spin_lock_irq 10586 _raw_spin_lock_irqsave 90666 _raw_spin_lock 318300 Detaching... Now we can. This power comes with the 'horrific' caveat that two tracing progs cannot execute on the same cpu at the same time. It's not a pretty fix for the reentrancy issue. If anyone has better ideas, I'm all ears. > What about the progs that run from SoftIRQ ? Since that bpf_prog_active > thing isn't inside BPF_PROG_RUN() what is to stop say: > > reuseport_select_sock() > ... > BPF_PROG_RUN() > bpf_spin_lock() > > ... > BPF_PROG_RUN() > bpf_spin_lock() // forever more > > > > Unless you stick that bpf_prog_active stuff inside BPF_PROG_RUN itself, > I don't see how you can fundamentally avoid this happening (now or in > the future). BPF_PROG_RUN macro is used as a template for another macro: ret = BPF_PROG_RUN_ARRAY(&bpf_prog_array, ctx, BPF_PROG_RUN); see kernel/trace/bpf_trace.c Doing bpf_prog_active for every prog in array is too expensive. But your issue above is valid. We don't use bpf_prog_active for networking progs, since we allow for one level of nesting due to the classic SKF_AD_PAY_OFFSET legacy. Also we allow tracing progs to nest with networking progs. People using this actively. Typically it's not an issue, since in networking there is no arbitrary nesting (unlike kprobe/nmi in tracing), but for bpf_spin_lock it can be, since the same map can be shared by networking and tracing progs and above deadlock would be possible: (first BPF_PROG_RUN will be from networking prog, then kprobe+bpf's BPF_PROG_RUN accessing the same map with bpf_spin_lock) So for now I'm going to allow bpf_spin_lock in networking progs only, since there is no arbitrary nesting there. And once we figure out the safety concerns for kprobe/tracepoint progs we can enable bpf_spin_lock there too. NMI bpf progs will never have bpf_spin_lock. re: memory model.. will reply in the other thread.