From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E61A1C433F5 for ; Tue, 3 May 2022 21:57:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243372AbiECWBS (ORCPT ); Tue, 3 May 2022 18:01:18 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44746 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236064AbiECWBQ (ORCPT ); Tue, 3 May 2022 18:01:16 -0400 Received: from mail-oi1-x22f.google.com (mail-oi1-x22f.google.com [IPv6:2607:f8b0:4864:20::22f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3BE5542499 for ; Tue, 3 May 2022 14:57:43 -0700 (PDT) Received: by mail-oi1-x22f.google.com with SMTP id q8so19165737oif.13 for ; Tue, 03 May 2022 14:57:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=czLZHU7MG6H1spBLHXY5jaZbzV59tXkMuN+Ck8Z7wc4=; b=hhWjsHnmVba6J3iaRJaO09nZL4LxjIq8C9W91LOfNC4XgTIwgTHymewopsSR7xhrbE wJfMPDofFT9Orm5X1Xmnq6csdeG70AmAjQHsJ6oidSvnu+c5NqT0zTpfwvGt6c7imJaY Yx4JIofQGY7s+yKOtcqFuF1p2zpjcELmsl0YE= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=czLZHU7MG6H1spBLHXY5jaZbzV59tXkMuN+Ck8Z7wc4=; b=TbJBk1+MEC8kn7m4L7cD69vB13BdWgVmLvGtjAA4WAgkTZI0Mfiqle8YNO8bGiKBFA sIvJdRCAUrRJoip+OaM3sc6XGEFph5hrM5K5mWQM36J//heSOmGWJ8BVNNeHymLjt4k5 ZaJZSf+8zx0bPefAQip5osLeVpn+9wXwr/IaB5anPlU5latuqk83o6OHO7rvaJiHseJ9 BZuoROiHiKejjxZfl8EJtWx5wMCYLh6/39aXalDL51g7zH5xg9uWXALV1o0LKKbaDh96 psp5t9ulswM3bAnesYmJ6ufwsLzkbWiNkCixB4lamnNkYOrz00MfG4nqSs9zSLZ6yR5y +cfQ== X-Gm-Message-State: AOAM532sZFsJGWkE/JoTIA1eO4oS/gBTvzgEqTUSvaEvbZ6v9zLuqeve 0pO9WHP6i1LasSBvqoD0gQwn0imrQvcQvQ== X-Google-Smtp-Source: ABdhPJwg0/UbiNCsxVioPwVOQOf2Ih7EYHaaka0N5GF37BIcnoOEBKA/58rpbwtxk7TlvGhARR86Pg== X-Received: by 2002:a05:6808:150e:b0:323:292a:f32c with SMTP id u14-20020a056808150e00b00323292af32cmr2709836oiw.125.1651615062474; Tue, 03 May 2022 14:57:42 -0700 (PDT) Received: from mail-oi1-f169.google.com (mail-oi1-f169.google.com. [209.85.167.169]) by smtp.gmail.com with ESMTPSA id q23-20020a4adc57000000b0035eb4e5a6c7sm5347629oov.29.2022.05.03.14.57.42 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 03 May 2022 14:57:42 -0700 (PDT) Received: by mail-oi1-f169.google.com with SMTP id e189so19573652oia.8 for ; Tue, 03 May 2022 14:57:42 -0700 (PDT) X-Received: by 2002:a05:6808:219a:b0:325:93fc:e0fd with SMTP id be26-20020a056808219a00b0032593fce0fdmr2775646oib.241.1651615054192; Tue, 03 May 2022 14:57:34 -0700 (PDT) MIME-Version: 1.0 References: <20220427224924.592546-1-gpiccoli@igalia.com> <20220427224924.592546-5-gpiccoli@igalia.com> In-Reply-To: From: Evan Green Date: Tue, 3 May 2022 14:56:58 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH 04/30] firmware: google: Convert regular spinlock into trylock on panic path To: "Guilherme G. Piccoli" Cc: Andrew Morton , bhe@redhat.com, pmladek@suse.com, kexec@lists.infradead.org, LKML , bcm-kernel-feedback-list@broadcom.com, linuxppc-dev@lists.ozlabs.org, linux-alpha@vger.kernel.org, linux-edac@vger.kernel.org, linux-hyperv@vger.kernel.org, linux-leds@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, Linux PM , linux-remoteproc@vger.kernel.org, linux-s390@vger.kernel.org, linux-tegra@vger.kernel.org, linux-um@lists.infradead.org, linux-xtensa@linux-xtensa.org, netdev@vger.kernel.org, openipmi-developer@lists.sourceforge.net, rcu@vger.kernel.org, sparclinux@vger.kernel.org, xen-devel@lists.xenproject.org, x86@kernel.org, kernel-dev@igalia.com, kernel@gpiccoli.net, halves@canonical.com, fabiomirmar@gmail.com, alejandro.j.jimenez@oracle.com, Andy Shevchenko , Arnd Bergmann , Borislav Petkov , Jonathan Corbet , d.hatayama@jp.fujitsu.com, dave.hansen@linux.intel.com, dyoung@redhat.com, feng.tang@intel.com, Greg Kroah-Hartman , mikelley@microsoft.com, hidehiro.kawai.ez@hitachi.com, jgross@suse.com, john.ogness@linutronix.de, Kees Cook , luto@kernel.org, mhiramat@kernel.org, mingo@redhat.com, paulmck@kernel.org, peterz@infradead.org, rostedt@goodmis.org, senozhatsky@chromium.org, Alan Stern , Thomas Gleixner , vgoyal@redhat.com, vkuznets@redhat.com, Will Deacon , Ard Biesheuvel , David Gow , Julius Werner Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-remoteproc@vger.kernel.org Hi Guilherme, On Tue, May 3, 2022 at 12:12 PM Guilherme G. Piccoli wrote: > > On 03/05/2022 15:03, Evan Green wrote: > > [...] > > gsmi_shutdown_reason() is a common function called in other scenarios > > as well, like reboot and thermal trip, where it may still make sense > > to wait to acquire a spinlock. Maybe we should add a parameter to > > gsmi_shutdown_reason() so that you can get your change on panic, but > > we don't convert other callbacks into try-fail scenarios causing us to > > miss logs. > > > > Hi Evan, thanks for your feedback, much appreciated! > What I've done in other cases like this was to have a helper checking > the spinlock in the panic notifier - if we can acquire that, go ahead > but if not, bail out. For a proper example of an implementation, check > patch 13 of the series: > https://lore.kernel.org/lkml/20220427224924.592546-14-gpiccoli@igalia.com/ . > > Do you agree with that, or prefer really a parameter in > gsmi_shutdown_reason() ? I'll follow your choice =) I'm fine with either, thanks for the link. Mostly I want to make sure other paths to gsmi_shutdown_reason() aren't also converted to a try. > > > > Though thinking more about it, is this really a Good Change (TM)? The > > spinlock itself already disables interrupts, meaning the only case > > where this change makes a difference is if the panic happens from > > within the function that grabbed the spinlock (in which case the > > callback is also likely to panic), or in an NMI that panics within > > that window. The downside of this change is that if one core was > > politely working through an event with the lock held, and another core > > panics, we now might lose the panic log, even though it probably would > > have gone through fine assuming the other core has a chance to > > continue. > > My feeling is that this is a good change, indeed - a lot of places are > getting changed like this, in this series. > > Reasoning: the problem with your example is that, by default, secondary > CPUs are disabled in the panic path, through an IPI mechanism. IPIs take > precedence and interrupt the work in these CPUs, effectively > interrupting the "polite work" with the lock held heh The IPI can only interrupt a CPU with irqs disabled if the IPI is an NMI. I haven't looked before to see if we use NMI IPIs to corral the other CPUs on panic. On x86, I grepped my way down to native_stop_other_cpus(), which looks like it does a normal IPI, waits 1 second, then does an NMI IPI. So, if a secondary CPU has the lock held, on x86 it has roughly 1s to finish what it's doing and re-enable interrupts before smp_send_stop() brings the NMI hammer down. I think this should be more than enough time for the secondary CPU to get out and release the lock. So then it makes sense to me that you're fixing cases where we panicked with the lock held, or hung with the lock held. Given the 1 second grace period x86 gives us, I'm on board, as that helps mitigate the risk that we bailed out early with the try and should have spun a bit longer instead. Thanks. -Evan > > Then, such CPU is put to sleep and we finally reach the panic notifier > hereby discussed, in the main CPU. If the other CPU was shut-off *with > the lock held*, it's never finishing such work, so the lock is never to > be released. Conclusion: the spinlock can't be acquired, hence we broke > the machine (which is already broken, given it's panic) in the path of > this notifier. > This should be really rare, but..possible. So I think we should protect > against this scenario. > > We can grab others' feedback if you prefer, and of course you have the > rights to refuse this change in the gsmi code, but from my > point-of-view, I don't see any advantage in just assume the risk, > specially since the change is very very simple. > > Cheers, > > > Guilherme From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.ozlabs.org (lists.ozlabs.org [112.213.38.117]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 5EF49C433F5 for ; Tue, 3 May 2022 22:23:18 +0000 (UTC) Received: from boromir.ozlabs.org (localhost [IPv6:::1]) by lists.ozlabs.org (Postfix) with ESMTP id 4KtDxc69Zsz3cFb for ; Wed, 4 May 2022 08:23:16 +1000 (AEST) Authentication-Results: lists.ozlabs.org; dkim=fail reason="signature verification failed" (1024-bit key; unprotected) header.d=chromium.org header.i=@chromium.org header.a=rsa-sha256 header.s=google header.b=hhWjsHnm; dkim-atps=neutral Authentication-Results: lists.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=chromium.org (client-ip=2607:f8b0:4864:20::22b; helo=mail-oi1-x22b.google.com; envelope-from=evgreen@chromium.org; receiver=) Authentication-Results: lists.ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=chromium.org header.i=@chromium.org header.a=rsa-sha256 header.s=google header.b=hhWjsHnm; dkim-atps=neutral Received: from mail-oi1-x22b.google.com (mail-oi1-x22b.google.com [IPv6:2607:f8b0:4864:20::22b]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 4KtDYH0rVKz2xD4 for ; Wed, 4 May 2022 08:05:36 +1000 (AEST) Received: by mail-oi1-x22b.google.com with SMTP id v65so19582959oig.10 for ; Tue, 03 May 2022 15:05:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=czLZHU7MG6H1spBLHXY5jaZbzV59tXkMuN+Ck8Z7wc4=; b=hhWjsHnmVba6J3iaRJaO09nZL4LxjIq8C9W91LOfNC4XgTIwgTHymewopsSR7xhrbE wJfMPDofFT9Orm5X1Xmnq6csdeG70AmAjQHsJ6oidSvnu+c5NqT0zTpfwvGt6c7imJaY Yx4JIofQGY7s+yKOtcqFuF1p2zpjcELmsl0YE= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=czLZHU7MG6H1spBLHXY5jaZbzV59tXkMuN+Ck8Z7wc4=; b=HMXfugkA8NVUBCU3hCxLWntQj60/R9FI0B81P+I9TYM8a4YyEZUpCmDA/RdAnlSrUB xEqY8r4xoHtKEwqYbnwkaO/tkVGPoDxsg2qoxuKDuB4FxhXWYe7wCCZb3nhWSb2jmJA7 KhNCS9VhlcHQgHUdTVwNsyUiVvVIQHweFaTfwGAwdNLsD0xWsG6+KAxh3ZpS21AnhgfF wAakgFtMJDhQI6JSd1kKDhsmxXR8cOc0JXNijmJNHy3rhM+oEUWv7Pg3OkV0fOHXMrvO 3xEPxp4jnx3tcDWCh7DERrFN4eHNtvJUYUnPrOxe97C+MyD37HO7dA//HDLR+1Afjauq 8bng== X-Gm-Message-State: AOAM530+KNIkbPRU3QiIeLU6QJ1TBygE6KDff6bwwHUT2/5tD40fN84z BRzMNLZgogkOKRSoGKMzsQXs2eHj+3KrIqX8 X-Google-Smtp-Source: ABdhPJy/OHB0Ug9mbJYkl0pxPs03QP2lhvAsvVIZLFCCRCFgP8lezmRhCikvNZ71LyI0cT3+HVFVtg== X-Received: by 2002:a05:6808:1825:b0:326:3e14:f156 with SMTP id bh37-20020a056808182500b003263e14f156mr1547837oib.144.1651615531953; Tue, 03 May 2022 15:05:31 -0700 (PDT) Received: from mail-oi1-f176.google.com (mail-oi1-f176.google.com. [209.85.167.176]) by smtp.gmail.com with ESMTPSA id l39-20020a0568302b2700b0060603221256sm4461651otv.38.2022.05.03.15.05.31 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 03 May 2022 15:05:31 -0700 (PDT) Received: by mail-oi1-f176.google.com with SMTP id z8so19631786oix.3 for ; Tue, 03 May 2022 15:05:31 -0700 (PDT) X-Received: by 2002:a05:6808:219a:b0:325:93fc:e0fd with SMTP id be26-20020a056808219a00b0032593fce0fdmr2775646oib.241.1651615054192; Tue, 03 May 2022 14:57:34 -0700 (PDT) MIME-Version: 1.0 References: <20220427224924.592546-1-gpiccoli@igalia.com> <20220427224924.592546-5-gpiccoli@igalia.com> In-Reply-To: From: Evan Green Date: Tue, 3 May 2022 14:56:58 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH 04/30] firmware: google: Convert regular spinlock into trylock on panic path To: "Guilherme G. Piccoli" Content-Type: text/plain; charset="UTF-8" X-Mailman-Approved-At: Wed, 04 May 2022 08:18:47 +1000 X-BeenThere: linuxppc-dev@lists.ozlabs.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: linux-hyperv@vger.kernel.org, halves@canonical.com, David Gow , linux-xtensa@linux-xtensa.org, peterz@infradead.org, alejandro.j.jimenez@oracle.com, linux-remoteproc@vger.kernel.org, feng.tang@intel.com, linux-mips@vger.kernel.org, hidehiro.kawai.ez@hitachi.com, sparclinux@vger.kernel.org, Will Deacon , Thomas Gleixner , linux-leds@vger.kernel.org, linux-s390@vger.kernel.org, mikelley@microsoft.com, john.ogness@linutronix.de, bhe@redhat.com, Jonathan Corbet , paulmck@kernel.org, fabiomirmar@gmail.com, x86@kernel.org, Ard Biesheuvel , mingo@redhat.com, bcm-kernel-feedback-list@broadcom.com, xen-devel@lists.xenproject.org, dyoung@redhat.com, vgoyal@redhat.com, pmladek@suse.com, dave.hansen@linux.intel.com, Kees Cook , Arnd Bergmann , Linux PM , linux-um@lists.infradead.org, rostedt@goodmis.org, rcu@vger.kernel.org, Greg Kroah-Hartman , Borislav Petkov , luto@kernel.org, linux-tegra@vger.kernel.org, openipmi-developer@lists.sourceforge.net, Andy Shevchenko , vkuznets@redhat.com, linux-edac@vger.kernel.org, jgross@suse.com, linux-parisc@vger.kernel.org, netdev@vger.kernel.org, kernel@gpiccoli.net, kexec@lists.infradead.org, LKML , Alan Stern , senozhatsky@chromium.org, d.hatayama@jp.fujitsu.com, mhiramat@kernel.org, kernel-dev@igalia.com, linux-alpha@vger.kernel.org, Julius Werner , Andrew Morton , linuxppc-dev@lists.ozlabs.org Errors-To: linuxppc-dev-bounces+linuxppc-dev=archiver.kernel.org@lists.ozlabs.org Sender: "Linuxppc-dev" Hi Guilherme, On Tue, May 3, 2022 at 12:12 PM Guilherme G. Piccoli wrote: > > On 03/05/2022 15:03, Evan Green wrote: > > [...] > > gsmi_shutdown_reason() is a common function called in other scenarios > > as well, like reboot and thermal trip, where it may still make sense > > to wait to acquire a spinlock. Maybe we should add a parameter to > > gsmi_shutdown_reason() so that you can get your change on panic, but > > we don't convert other callbacks into try-fail scenarios causing us to > > miss logs. > > > > Hi Evan, thanks for your feedback, much appreciated! > What I've done in other cases like this was to have a helper checking > the spinlock in the panic notifier - if we can acquire that, go ahead > but if not, bail out. For a proper example of an implementation, check > patch 13 of the series: > https://lore.kernel.org/lkml/20220427224924.592546-14-gpiccoli@igalia.com/ . > > Do you agree with that, or prefer really a parameter in > gsmi_shutdown_reason() ? I'll follow your choice =) I'm fine with either, thanks for the link. Mostly I want to make sure other paths to gsmi_shutdown_reason() aren't also converted to a try. > > > > Though thinking more about it, is this really a Good Change (TM)? The > > spinlock itself already disables interrupts, meaning the only case > > where this change makes a difference is if the panic happens from > > within the function that grabbed the spinlock (in which case the > > callback is also likely to panic), or in an NMI that panics within > > that window. The downside of this change is that if one core was > > politely working through an event with the lock held, and another core > > panics, we now might lose the panic log, even though it probably would > > have gone through fine assuming the other core has a chance to > > continue. > > My feeling is that this is a good change, indeed - a lot of places are > getting changed like this, in this series. > > Reasoning: the problem with your example is that, by default, secondary > CPUs are disabled in the panic path, through an IPI mechanism. IPIs take > precedence and interrupt the work in these CPUs, effectively > interrupting the "polite work" with the lock held heh The IPI can only interrupt a CPU with irqs disabled if the IPI is an NMI. I haven't looked before to see if we use NMI IPIs to corral the other CPUs on panic. On x86, I grepped my way down to native_stop_other_cpus(), which looks like it does a normal IPI, waits 1 second, then does an NMI IPI. So, if a secondary CPU has the lock held, on x86 it has roughly 1s to finish what it's doing and re-enable interrupts before smp_send_stop() brings the NMI hammer down. I think this should be more than enough time for the secondary CPU to get out and release the lock. So then it makes sense to me that you're fixing cases where we panicked with the lock held, or hung with the lock held. Given the 1 second grace period x86 gives us, I'm on board, as that helps mitigate the risk that we bailed out early with the try and should have spun a bit longer instead. Thanks. -Evan > > Then, such CPU is put to sleep and we finally reach the panic notifier > hereby discussed, in the main CPU. If the other CPU was shut-off *with > the lock held*, it's never finishing such work, so the lock is never to > be released. Conclusion: the spinlock can't be acquired, hence we broke > the machine (which is already broken, given it's panic) in the path of > this notifier. > This should be really rare, but..possible. So I think we should protect > against this scenario. > > We can grab others' feedback if you prefer, and of course you have the > rights to refuse this change in the gsmi code, but from my > point-of-view, I don't see any advantage in just assume the risk, > specially since the change is very very simple. > > Cheers, > > > Guilherme From mboxrd@z Thu Jan 1 00:00:00 1970 From: Evan Green Date: Tue, 3 May 2022 14:56:58 -0700 Subject: [PATCH 04/30] firmware: google: Convert regular spinlock into trylock on panic path In-Reply-To: References: <20220427224924.592546-1-gpiccoli@igalia.com> <20220427224924.592546-5-gpiccoli@igalia.com> Message-ID: List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: kexec@lists.infradead.org Hi Guilherme, On Tue, May 3, 2022 at 12:12 PM Guilherme G. Piccoli wrote: > > On 03/05/2022 15:03, Evan Green wrote: > > [...] > > gsmi_shutdown_reason() is a common function called in other scenarios > > as well, like reboot and thermal trip, where it may still make sense > > to wait to acquire a spinlock. Maybe we should add a parameter to > > gsmi_shutdown_reason() so that you can get your change on panic, but > > we don't convert other callbacks into try-fail scenarios causing us to > > miss logs. > > > > Hi Evan, thanks for your feedback, much appreciated! > What I've done in other cases like this was to have a helper checking > the spinlock in the panic notifier - if we can acquire that, go ahead > but if not, bail out. For a proper example of an implementation, check > patch 13 of the series: > https://lore.kernel.org/lkml/20220427224924.592546-14-gpiccoli at igalia.com/ . > > Do you agree with that, or prefer really a parameter in > gsmi_shutdown_reason() ? I'll follow your choice =) I'm fine with either, thanks for the link. Mostly I want to make sure other paths to gsmi_shutdown_reason() aren't also converted to a try. > > > > Though thinking more about it, is this really a Good Change (TM)? The > > spinlock itself already disables interrupts, meaning the only case > > where this change makes a difference is if the panic happens from > > within the function that grabbed the spinlock (in which case the > > callback is also likely to panic), or in an NMI that panics within > > that window. The downside of this change is that if one core was > > politely working through an event with the lock held, and another core > > panics, we now might lose the panic log, even though it probably would > > have gone through fine assuming the other core has a chance to > > continue. > > My feeling is that this is a good change, indeed - a lot of places are > getting changed like this, in this series. > > Reasoning: the problem with your example is that, by default, secondary > CPUs are disabled in the panic path, through an IPI mechanism. IPIs take > precedence and interrupt the work in these CPUs, effectively > interrupting the "polite work" with the lock held heh The IPI can only interrupt a CPU with irqs disabled if the IPI is an NMI. I haven't looked before to see if we use NMI IPIs to corral the other CPUs on panic. On x86, I grepped my way down to native_stop_other_cpus(), which looks like it does a normal IPI, waits 1 second, then does an NMI IPI. So, if a secondary CPU has the lock held, on x86 it has roughly 1s to finish what it's doing and re-enable interrupts before smp_send_stop() brings the NMI hammer down. I think this should be more than enough time for the secondary CPU to get out and release the lock. So then it makes sense to me that you're fixing cases where we panicked with the lock held, or hung with the lock held. Given the 1 second grace period x86 gives us, I'm on board, as that helps mitigate the risk that we bailed out early with the try and should have spun a bit longer instead. Thanks. -Evan > > Then, such CPU is put to sleep and we finally reach the panic notifier > hereby discussed, in the main CPU. If the other CPU was shut-off *with > the lock held*, it's never finishing such work, so the lock is never to > be released. Conclusion: the spinlock can't be acquired, hence we broke > the machine (which is already broken, given it's panic) in the path of > this notifier. > This should be really rare, but..possible. So I think we should protect > against this scenario. > > We can grab others' feedback if you prefer, and of course you have the > rights to refuse this change in the gsmi code, but from my > point-of-view, I don't see any advantage in just assume the risk, > specially since the change is very very simple. > > Cheers, > > > Guilherme From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ot1-x342.google.com ([2607:f8b0:4864:20::342]) by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux)) id 1nm0WX-007tUi-JE for linux-um@lists.infradead.org; Tue, 03 May 2022 21:57:39 +0000 Received: by mail-ot1-x342.google.com with SMTP id k25-20020a056830169900b00605f215e55dso9177561otr.13 for ; Tue, 03 May 2022 14:57:35 -0700 (PDT) Received: from mail-oi1-f181.google.com (mail-oi1-f181.google.com. [209.85.167.181]) by smtp.gmail.com with ESMTPSA id n10-20020a9d6f0a000000b0060603221264sm4469469otq.52.2022.05.03.14.57.34 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 03 May 2022 14:57:34 -0700 (PDT) Received: by mail-oi1-f181.google.com with SMTP id l203so19654254oif.0 for ; Tue, 03 May 2022 14:57:34 -0700 (PDT) MIME-Version: 1.0 References: <20220427224924.592546-1-gpiccoli@igalia.com> <20220427224924.592546-5-gpiccoli@igalia.com> In-Reply-To: From: Evan Green Date: Tue, 3 May 2022 14:56:58 -0700 Message-ID: Subject: Re: [PATCH 04/30] firmware: google: Convert regular spinlock into trylock on panic path List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-um" Errors-To: linux-um-bounces+geert=linux-m68k.org@lists.infradead.org To: "Guilherme G. Piccoli" Cc: Andrew Morton , bhe@redhat.com, pmladek@suse.com, kexec@lists.infradead.org, LKML , bcm-kernel-feedback-list@broadcom.com, linuxppc-dev@lists.ozlabs.org, linux-alpha@vger.kernel.org, linux-edac@vger.kernel.org, linux-hyperv@vger.kernel.org, linux-leds@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, Linux PM , linux-remoteproc@vger.kernel.org, linux-s390@vger.kernel.org, linux-tegra@vger.kernel.org, linux-um@lists.infradead.org, linux-xtensa@linux-xtensa.org, netdev@vger.kernel.org, openipmi-developer@lists.sourceforge.net, rcu@vger.kernel.org, sparclinux@vger.kernel.org, xen-devel@lists.xenproject.org, x86@kernel.org, kernel-dev@igalia.com, kernel@gpiccoli.net, halves@canonical.com, fabiomirmar@gmail.com, alejandro.j.jimenez@oracle.com, Andy Shevchenko , Arnd Bergmann , Borislav Petkov , Jonathan Corbet , d.hatayama@jp.fujitsu.com, dave.hansen@linux.intel.com, dyoung@redhat.com, feng.tang@intel.com, Greg Kroah-Hartman , mikelley@microsoft.com, hidehiro.kawai.ez@hitachi.com, jgross@suse.com, john.ogness@linutronix.de, Kees Cook , luto@kernel.org, mhiramat@kernel.org, mingo@redhat.com, paulmck@kernel.org, peterz@infradead.org, rostedt@goodmis.org, senozhatsky@chromium.org, Alan Stern , Thomas Gleixner , vgoyal@redhat.com, vkuznets@redhat.com, Will Deacon , Ard Biesheuvel , David Gow , Julius Werner Hi Guilherme, On Tue, May 3, 2022 at 12:12 PM Guilherme G. Piccoli wrote: > > On 03/05/2022 15:03, Evan Green wrote: > > [...] > > gsmi_shutdown_reason() is a common function called in other scenarios > > as well, like reboot and thermal trip, where it may still make sense > > to wait to acquire a spinlock. Maybe we should add a parameter to > > gsmi_shutdown_reason() so that you can get your change on panic, but > > we don't convert other callbacks into try-fail scenarios causing us to > > miss logs. > > > > Hi Evan, thanks for your feedback, much appreciated! > What I've done in other cases like this was to have a helper checking > the spinlock in the panic notifier - if we can acquire that, go ahead > but if not, bail out. For a proper example of an implementation, check > patch 13 of the series: > https://lore.kernel.org/lkml/20220427224924.592546-14-gpiccoli@igalia.com/ . > > Do you agree with that, or prefer really a parameter in > gsmi_shutdown_reason() ? I'll follow your choice =) I'm fine with either, thanks for the link. Mostly I want to make sure other paths to gsmi_shutdown_reason() aren't also converted to a try. > > > > Though thinking more about it, is this really a Good Change (TM)? The > > spinlock itself already disables interrupts, meaning the only case > > where this change makes a difference is if the panic happens from > > within the function that grabbed the spinlock (in which case the > > callback is also likely to panic), or in an NMI that panics within > > that window. The downside of this change is that if one core was > > politely working through an event with the lock held, and another core > > panics, we now might lose the panic log, even though it probably would > > have gone through fine assuming the other core has a chance to > > continue. > > My feeling is that this is a good change, indeed - a lot of places are > getting changed like this, in this series. > > Reasoning: the problem with your example is that, by default, secondary > CPUs are disabled in the panic path, through an IPI mechanism. IPIs take > precedence and interrupt the work in these CPUs, effectively > interrupting the "polite work" with the lock held heh The IPI can only interrupt a CPU with irqs disabled if the IPI is an NMI. I haven't looked before to see if we use NMI IPIs to corral the other CPUs on panic. On x86, I grepped my way down to native_stop_other_cpus(), which looks like it does a normal IPI, waits 1 second, then does an NMI IPI. So, if a secondary CPU has the lock held, on x86 it has roughly 1s to finish what it's doing and re-enable interrupts before smp_send_stop() brings the NMI hammer down. I think this should be more than enough time for the secondary CPU to get out and release the lock. So then it makes sense to me that you're fixing cases where we panicked with the lock held, or hung with the lock held. Given the 1 second grace period x86 gives us, I'm on board, as that helps mitigate the risk that we bailed out early with the try and should have spun a bit longer instead. Thanks. -Evan > > Then, such CPU is put to sleep and we finally reach the panic notifier > hereby discussed, in the main CPU. If the other CPU was shut-off *with > the lock held*, it's never finishing such work, so the lock is never to > be released. Conclusion: the spinlock can't be acquired, hence we broke > the machine (which is already broken, given it's panic) in the path of > this notifier. > This should be really rare, but..possible. So I think we should protect > against this scenario. > > We can grab others' feedback if you prefer, and of course you have the > rights to refuse this change in the gsmi code, but from my > point-of-view, I don't see any advantage in just assume the risk, > specially since the change is very very simple. > > Cheers, > > > Guilherme _______________________________________________ linux-um mailing list linux-um@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-um From mboxrd@z Thu Jan 1 00:00:00 1970 From: Evan Green Subject: Re: [PATCH 04/30] firmware: google: Convert regular spinlock into trylock on panic path Date: Tue, 3 May 2022 14:56:58 -0700 Message-ID: References: <20220427224924.592546-1-gpiccoli@igalia.com> <20220427224924.592546-5-gpiccoli@igalia.com> Mime-Version: 1.0 Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=czLZHU7MG6H1spBLHXY5jaZbzV59tXkMuN+Ck8Z7wc4=; b=hhWjsHnmVba6J3iaRJaO09nZL4LxjIq8C9W91LOfNC4XgTIwgTHymewopsSR7xhrbE wJfMPDofFT9Orm5X1Xmnq6csdeG70AmAjQHsJ6oidSvnu+c5NqT0zTpfwvGt6c7imJaY Yx4JIofQGY7s+yKOtcqFuF1p2zpjcELmsl0YE= In-Reply-To: List-ID: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: "Guilherme G. Piccoli" Cc: Andrew Morton , bhe-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, pmladek-IBi9RG/b67k@public.gmane.org, kexec-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org, LKML , bcm-kernel-feedback-list-dY08KVG/lbpWk0Htik3J/w@public.gmane.org, linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ@public.gmane.org, linux-alpha-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-edac-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-hyperv-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-leds-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mips-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-parisc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux PM , linux-remoteproc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-s390-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-tegra-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-um-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org, linux-xtensa-PjhNF2WwrV/0Sa2dR60CXw@public.gmane.org, netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, openipmi-developer-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, rcu-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, sparclinux-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b@public.gmane.org, x86-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, kernel-dev-wEGTBA9jqPzQT0dZR+AlfA@public.gmane.org, kernel-WeLdAqEWwDvk1uMJSBkQmQ@public.gmane.org, halv Hi Guilherme, On Tue, May 3, 2022 at 12:12 PM Guilherme G. Piccoli wrote: > > On 03/05/2022 15:03, Evan Green wrote: > > [...] > > gsmi_shutdown_reason() is a common function called in other scenarios > > as well, like reboot and thermal trip, where it may still make sense > > to wait to acquire a spinlock. Maybe we should add a parameter to > > gsmi_shutdown_reason() so that you can get your change on panic, but > > we don't convert other callbacks into try-fail scenarios causing us to > > miss logs. > > > > Hi Evan, thanks for your feedback, much appreciated! > What I've done in other cases like this was to have a helper checking > the spinlock in the panic notifier - if we can acquire that, go ahead > but if not, bail out. For a proper example of an implementation, check > patch 13 of the series: > https://lore.kernel.org/lkml/20220427224924.592546-14-gpiccoli-wEGTBA9jqPzQT0dZR+AlfA@public.gmane.org/ . > > Do you agree with that, or prefer really a parameter in > gsmi_shutdown_reason() ? I'll follow your choice =) I'm fine with either, thanks for the link. Mostly I want to make sure other paths to gsmi_shutdown_reason() aren't also converted to a try. > > > > Though thinking more about it, is this really a Good Change (TM)? The > > spinlock itself already disables interrupts, meaning the only case > > where this change makes a difference is if the panic happens from > > within the function that grabbed the spinlock (in which case the > > callback is also likely to panic), or in an NMI that panics within > > that window. The downside of this change is that if one core was > > politely working through an event with the lock held, and another core > > panics, we now might lose the panic log, even though it probably would > > have gone through fine assuming the other core has a chance to > > continue. > > My feeling is that this is a good change, indeed - a lot of places are > getting changed like this, in this series. > > Reasoning: the problem with your example is that, by default, secondary > CPUs are disabled in the panic path, through an IPI mechanism. IPIs take > precedence and interrupt the work in these CPUs, effectively > interrupting the "polite work" with the lock held heh The IPI can only interrupt a CPU with irqs disabled if the IPI is an NMI. I haven't looked before to see if we use NMI IPIs to corral the other CPUs on panic. On x86, I grepped my way down to native_stop_other_cpus(), which looks like it does a normal IPI, waits 1 second, then does an NMI IPI. So, if a secondary CPU has the lock held, on x86 it has roughly 1s to finish what it's doing and re-enable interrupts before smp_send_stop() brings the NMI hammer down. I think this should be more than enough time for the secondary CPU to get out and release the lock. So then it makes sense to me that you're fixing cases where we panicked with the lock held, or hung with the lock held. Given the 1 second grace period x86 gives us, I'm on board, as that helps mitigate the risk that we bailed out early with the try and should have spun a bit longer instead. Thanks. -Evan > > Then, such CPU is put to sleep and we finally reach the panic notifier > hereby discussed, in the main CPU. If the other CPU was shut-off *with > the lock held*, it's never finishing such work, so the lock is never to > be released. Conclusion: the spinlock can't be acquired, hence we broke > the machine (which is already broken, given it's panic) in the path of > this notifier. > This should be really rare, but..possible. So I think we should protect > against this scenario. > > We can grab others' feedback if you prefer, and of course you have the > rights to refuse this change in the gsmi code, but from my > point-of-view, I don't see any advantage in just assume the risk, > specially since the change is very very simple. > > Cheers, > > > Guilherme