From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.3 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E766CCA9EA0 for ; Fri, 18 Oct 2019 11:28:31 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id BB30321925 for ; Fri, 18 Oct 2019 11:28:31 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2442564AbfJRL2b (ORCPT ); Fri, 18 Oct 2019 07:28:31 -0400 Received: from Galois.linutronix.de ([193.142.43.55]:56541 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2389257AbfJRL2b (ORCPT ); Fri, 18 Oct 2019 07:28:31 -0400 Received: from p5b06da22.dip0.t-ipconnect.de ([91.6.218.34] helo=nanos) by Galois.linutronix.de with esmtpsa (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from ) id 1iLQQg-0005w0-Q6; Fri, 18 Oct 2019 13:28:22 +0200 Date: Fri, 18 Oct 2019 13:28:21 +0200 (CEST) From: Thomas Gleixner To: Alexei Starovoitov cc: David Miller , Sebastian Sewior , Daniel Borkmann , bpf , Alexei Starovoitov , Martin KaFai Lau , Song Liu , Yonghong Song , Peter Zijlstra , Clark Williams Subject: Re: [PATCH] BPF: Disable on PREEMPT_RT In-Reply-To: <20191018055222.cwx5dmj6pppqzcpc@ast-mbp> Message-ID: References: <20191017090500.ienqyium2phkxpdo@linutronix.de> <20191017145358.GA26267@pc-63.home> <20191017154021.ndza4la3hntk4d4o@linutronix.de> <20191017.132548.2120028117307856274.davem@davemloft.net> <20191018055222.cwx5dmj6pppqzcpc@ast-mbp> User-Agent: Alpine 2.21 (DEB 202 2017-01-01) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Linutronix-Spam-Score: -1.0 X-Linutronix-Spam-Level: - X-Linutronix-Spam-Status: No , -1.0 points, 5.0 required, ALL_TRUSTED=-1,SHORTCIRCUIT=-0.0001 Sender: bpf-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org Alexei, On Thu, 17 Oct 2019, Alexei Starovoitov wrote: > On Fri, Oct 18, 2019 at 02:22:40AM +0200, Thomas Gleixner wrote: > > > > But that also means any code which explcitely disables preemption or > > interrupts without taking a spin/rw lock can trigger the following issues: > > > > - Calling into code which requires to be preemtible/sleepable on RT > > results in a might sleep splat. > > > > - Has in RT terms potentially unbound or undesired runtime length without > > any chance for the scheduler to control it. > > Much appreciate the explanation. Few more questions: > There is a ton of kernel code that does preempt_disable() > and proceeds to do per-cpu things. How is it handled in RT? There is not really tons of it, at least not tons which actually hurt. Most of those sections are extremly small or actually required even on RT (e.g. scheduler, lock internals ...) > Are you saying that every preempt_disable has to be paired with some lock? > I don't think it's a practical requirement for fulfill, so I probably > misunderstood something. See above. The ones RT cares about are: - Long and potentially unbound preempt/interrupt disabled sections - Preempt disabled sections which call into code which might sleep under RT due to the magic 'sleeping' spin/rw_locks which we use as substitution. > In BPF we disable preemption because of per-cpu maps and per-cpu data structures > that are shared between bpf program execution and kernel execution. > > BPF doesn't call into code that might sleep. Sure, not if you look at it from the mainline perspective. RT changes the picture there because due to forced interrupt/soft interrupt threading and the lock substitution 'innocent' code becomes sleepable. That's especially true for the memory allocators, which are required to be called with preemption enabled on RT. But again, most GFP_ATOMIC allocations happen from within spin/rwlock held sections, which are already made preemptible by RT magically. The ones which were inside of contexts which are atomic even on RT have been moved out of the atomic sections already (except for the BPF ones). The problem with standalone preempt_disable() and local_irq_disable() is that the protection scope is not defined. These two are basically per CPU big kernel locks. We all know how well the BKL semantics worked :) One of the mechanisms RT uses to substitute standalone preempt_disable() and local_irq_disable() which are not related to a lock operation with so called local_locks. We haven't submitted the local_lock code yet, but that might be a way out. The way it works is simple: DEFINE_LOCAL_LOCK(this_scope); in the code: - preempt_disable(); + local_lock(this_scope); and all kind of variants local_lock_bh/irq/irqsave(). You get the idea. On a non RT enabled build these primitives just resolve to preempt_disable(), local_bh_disable(), local_irq_disable() and local_irq_save(). On RT the local lock is actually a per CPU lock which allows nesting. i.e. preempt_disable(); ... local_irq_disable(); becomes local_lock(this_scope); ... local_lock_irq(this_scope); The local lock is a 'sleeping' spinlock on RT (PI support) and as any other RT substituted lock it also ensures that the task cannot be migrated when it is held, which makes per cpu assumptions work - the kernel has lots of them. :) That works as long as the scope is well defined and clear. It does not work when preempt_disable() or any of the other scopeless protections is used to protect random (unidentifiable) code against each other, which means the protection has the dreaded per CPU BKL semantics, i.e. undefined. One nice thing about local_lock even aside of RT is that it annotates the code in terms of protection scope which actually gives you also lockdep coverage. We found already a few bugs that way in the past, where data was protected with preempt_disable() when the code was introduced and later access from interrupt code was added without anyone noticing for years.... > BPF also doesn't have unbound runtime. > So two above issues are actually non-issues. That'd be nice :) Anyway, we'll have a look whether this can be solved with local locks which would be nice, but that still does not solve the issue with the non_owner release of the rwsem. Thanks, tglx