From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 49D26C43217 for ; Wed, 4 May 2022 12:46:36 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1349953AbiEDMuK (ORCPT ); Wed, 4 May 2022 08:50:10 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35938 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235732AbiEDMuI (ORCPT ); Wed, 4 May 2022 08:50:08 -0400 Received: from fanzine2.igalia.com (fanzine.igalia.com [178.60.130.6]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 447452CE37; Wed, 4 May 2022 05:46:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=igalia.com; s=20170329; h=Content-Transfer-Encoding:Content-Type:In-Reply-To:From: References:Cc:To:Subject:MIME-Version:Date:Message-ID:Sender:Reply-To: Content-ID:Content-Description:Resent-Date:Resent-From:Resent-Sender: Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help:List-Unsubscribe: List-Subscribe:List-Post:List-Owner:List-Archive; bh=UDxTzr/LbFPHIhZF1U36kLmGFtpbBGYhm6tD7CfQBjM=; b=KDrxDW8Mnr1qToRh0yGaDv25Fu TWpgQRScRF3dRjjU3pakegMZLqUHmycIdjATA6DbY60uPfya2zQOAwJ0e5erFJc5BJn7o3eV3sQ41 Tox4yunn4/jA8a3IjyUJT/sI3KcoYtb2p2FNUwcMM3f1g/ex9W0DrY0odKixjZjudGtrtbA3G1JPw m7Nny7DVsQFLh+HpvU/q7Yc8/k1hRGwJeSWIAUBYmqnc5ZdThCMVxiKgVNOsyAMWZdrNHmIgXn40c w6AF2HzVr+D+OvvIusg4XzuOUWHoF2AYp44ZOOo2CAb1BDljEp1qf8qc23J8WoJ9sqhtbxFJxmFVB 8aeUFBag==; Received: from [179.113.53.197] (helo=[192.168.1.60]) by fanzine2.igalia.com with esmtpsa (Cipher TLS1.3:ECDHE_X25519__RSA_PSS_RSAE_SHA256__AES_128_GCM:128) (Exim) id 1nmEOH-0003Ke-Oi; Wed, 04 May 2022 14:46:02 +0200 Message-ID: <9581851d-6c61-a2ef-a3c4-6e2ce05eab12@igalia.com> Date: Wed, 4 May 2022 09:45:31 -0300 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.8.0 Subject: Re: [PATCH 04/30] firmware: google: Convert regular spinlock into trylock on panic path Content-Language: en-US To: Evan Green Cc: Andrew Morton , bhe@redhat.com, pmladek@suse.com, kexec@lists.infradead.org, LKML , bcm-kernel-feedback-list@broadcom.com, linuxppc-dev@lists.ozlabs.org, linux-alpha@vger.kernel.org, linux-edac@vger.kernel.org, linux-hyperv@vger.kernel.org, linux-leds@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, Linux PM , linux-remoteproc@vger.kernel.org, linux-s390@vger.kernel.org, linux-tegra@vger.kernel.org, linux-um@lists.infradead.org, linux-xtensa@linux-xtensa.org, netdev@vger.kernel.org, openipmi-developer@lists.sourceforge.net, rcu@vger.kernel.org, sparclinux@vger.kernel.org, xen-devel@lists.xenproject.org, x86@kernel.org, kernel-dev@igalia.com, kernel@gpiccoli.net, halves@canonical.com, fabiomirmar@gmail.com, alejandro.j.jimenez@oracle.com, Andy Shevchenko , Arnd Bergmann , Borislav Petkov , Jonathan Corbet , d.hatayama@jp.fujitsu.com, dave.hansen@linux.intel.com, dyoung@redhat.com, feng.tang@intel.com, Greg Kroah-Hartman , mikelley@microsoft.com, hidehiro.kawai.ez@hitachi.com, jgross@suse.com, john.ogness@linutronix.de, Kees Cook , luto@kernel.org, mhiramat@kernel.org, mingo@redhat.com, paulmck@kernel.org, peterz@infradead.org, rostedt@goodmis.org, senozhatsky@chromium.org, Alan Stern , Thomas Gleixner , vgoyal@redhat.com, vkuznets@redhat.com, Will Deacon , Ard Biesheuvel , David Gow , Julius Werner References: <20220427224924.592546-1-gpiccoli@igalia.com> <20220427224924.592546-5-gpiccoli@igalia.com> From: "Guilherme G. Piccoli" In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-remoteproc@vger.kernel.org On 03/05/2022 18:56, Evan Green wrote: > Hi Guilherme, > [...] >> Do you agree with that, or prefer really a parameter in >> gsmi_shutdown_reason() ? I'll follow your choice =) > > I'm fine with either, thanks for the link. Mostly I want to make sure > other paths to gsmi_shutdown_reason() aren't also converted to a try. Hi Evan, thanks for the prompt response! So, I'll proceed like I did in s390, for consistency. > [...] >> Reasoning: the problem with your example is that, by default, secondary >> CPUs are disabled in the panic path, through an IPI mechanism. IPIs take >> precedence and interrupt the work in these CPUs, effectively >> interrupting the "polite work" with the lock held heh > > The IPI can only interrupt a CPU with irqs disabled if the IPI is an > NMI. I haven't looked before to see if we use NMI IPIs to corral the > other CPUs on panic. On x86, I grepped my way down to > native_stop_other_cpus(), which looks like it does a normal IPI, waits > 1 second, then does an NMI IPI. So, if a secondary CPU has the lock > held, on x86 it has roughly 1s to finish what it's doing and re-enable > interrupts before smp_send_stop() brings the NMI hammer down. I think > this should be more than enough time for the secondary CPU to get out > and release the lock. > > So then it makes sense to me that you're fixing cases where we > panicked with the lock held, or hung with the lock held. Given the 1 > second grace period x86 gives us, I'm on board, as that helps mitigate > the risk that we bailed out early with the try and should have spun a > bit longer instead. Thanks. > > -Evan Well, in the old path without "crash_kexec_post_notifiers", we indeed end-up relying on native_stop_other_cpus() for x86 as you said, and the "1s rule" makes sense. But after this series (or even before, if the kernel parameter "crash_kexec_post_notifiers" was used) the function used to stop CPUs in the panic path is crash_smp_send_stop(), and the call chain is like: Main CPU: crash_smp_send_stop() --kdump_nmi_shootdown_cpus() ----nmi_shootdown_cpus() Then, in each CPU (except the main one, running panic() path), we execute kdump_nmi_callback() in NMI context. So, we seem to indeed interrupt any context (even with IRQs disabled), increasing the likelihood of the potential lockups due to stopped CPUs holding the locks heheh Thanks again for the good discussion, let me know if anything I'm saying doesn't make sense - this crash path is a bit convoluted, specially in x86, I might have understood something wrongly =) Cheers, Guilherme From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.ozlabs.org (lists.ozlabs.org [112.213.38.117]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id B5B73C433EF for ; Wed, 4 May 2022 20:18:36 +0000 (UTC) Received: from boromir.ozlabs.org (localhost [IPv6:::1]) by lists.ozlabs.org (Postfix) with ESMTP id 4Ktp7H32t8z3brL for ; Thu, 5 May 2022 06:18:35 +1000 (AEST) Authentication-Results: lists.ozlabs.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=igalia.com header.i=@igalia.com header.a=rsa-sha256 header.s=20170329 header.b=KDrxDW8M; dkim-atps=neutral Authentication-Results: lists.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=igalia.com (client-ip=178.60.130.6; helo=fanzine2.igalia.com; envelope-from=gpiccoli@igalia.com; receiver=) Authentication-Results: lists.ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=igalia.com header.i=@igalia.com header.a=rsa-sha256 header.s=20170329 header.b=KDrxDW8M; dkim-atps=neutral Received: from fanzine2.igalia.com (fanzine.igalia.com [178.60.130.6]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 4Ktc5q3rMlz3bWM for ; Wed, 4 May 2022 22:46:37 +1000 (AEST) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=igalia.com; s=20170329; h=Content-Transfer-Encoding:Content-Type:In-Reply-To:From: References:Cc:To:Subject:MIME-Version:Date:Message-ID:Sender:Reply-To: Content-ID:Content-Description:Resent-Date:Resent-From:Resent-Sender: Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help:List-Unsubscribe: List-Subscribe:List-Post:List-Owner:List-Archive; bh=UDxTzr/LbFPHIhZF1U36kLmGFtpbBGYhm6tD7CfQBjM=; b=KDrxDW8Mnr1qToRh0yGaDv25Fu TWpgQRScRF3dRjjU3pakegMZLqUHmycIdjATA6DbY60uPfya2zQOAwJ0e5erFJc5BJn7o3eV3sQ41 Tox4yunn4/jA8a3IjyUJT/sI3KcoYtb2p2FNUwcMM3f1g/ex9W0DrY0odKixjZjudGtrtbA3G1JPw m7Nny7DVsQFLh+HpvU/q7Yc8/k1hRGwJeSWIAUBYmqnc5ZdThCMVxiKgVNOsyAMWZdrNHmIgXn40c w6AF2HzVr+D+OvvIusg4XzuOUWHoF2AYp44ZOOo2CAb1BDljEp1qf8qc23J8WoJ9sqhtbxFJxmFVB 8aeUFBag==; Received: from [179.113.53.197] (helo=[192.168.1.60]) by fanzine2.igalia.com with esmtpsa (Cipher TLS1.3:ECDHE_X25519__RSA_PSS_RSAE_SHA256__AES_128_GCM:128) (Exim) id 1nmEOH-0003Ke-Oi; Wed, 04 May 2022 14:46:02 +0200 Message-ID: <9581851d-6c61-a2ef-a3c4-6e2ce05eab12@igalia.com> Date: Wed, 4 May 2022 09:45:31 -0300 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.8.0 Subject: Re: [PATCH 04/30] firmware: google: Convert regular spinlock into trylock on panic path Content-Language: en-US To: Evan Green References: <20220427224924.592546-1-gpiccoli@igalia.com> <20220427224924.592546-5-gpiccoli@igalia.com> From: "Guilherme G. Piccoli" In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Mailman-Approved-At: Thu, 05 May 2022 06:17:57 +1000 X-BeenThere: linuxppc-dev@lists.ozlabs.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: linux-hyperv@vger.kernel.org, halves@canonical.com, David Gow , linux-xtensa@linux-xtensa.org, peterz@infradead.org, alejandro.j.jimenez@oracle.com, linux-remoteproc@vger.kernel.org, feng.tang@intel.com, linux-mips@vger.kernel.org, hidehiro.kawai.ez@hitachi.com, sparclinux@vger.kernel.org, Will Deacon , Thomas Gleixner , linux-leds@vger.kernel.org, linux-s390@vger.kernel.org, mikelley@microsoft.com, john.ogness@linutronix.de, bhe@redhat.com, Jonathan Corbet , paulmck@kernel.org, fabiomirmar@gmail.com, x86@kernel.org, Ard Biesheuvel , mingo@redhat.com, bcm-kernel-feedback-list@broadcom.com, xen-devel@lists.xenproject.org, dyoung@redhat.com, vgoyal@redhat.com, pmladek@suse.com, dave.hansen@linux.intel.com, Kees Cook , Arnd Bergmann , Linux PM , linux-um@lists.infradead.org, rostedt@goodmis.org, rcu@vger.kernel.org, Greg Kroah-Hartman , Borislav Petkov , luto@kernel.org, linux-tegra@vger.kernel.org, openipmi-developer@lists.sourceforge.net, Andy Shevchenko , vkuznets@redhat.com, linux-edac@vger.kernel.org, jgross@suse.com, linux-parisc@vger.kernel.org, netdev@vger.kernel.org, kernel@gpiccoli.net, kexec@lists.infradead.org, LKML , Alan Stern , senozhatsky@chromium.org, d.hatayama@jp.fujitsu.com, mhiramat@kernel.org, kernel-dev@igalia.com, linux-alpha@vger.kernel.org, Julius Werner , Andrew Morton , linuxppc-dev@lists.ozlabs.org Errors-To: linuxppc-dev-bounces+linuxppc-dev=archiver.kernel.org@lists.ozlabs.org Sender: "Linuxppc-dev" On 03/05/2022 18:56, Evan Green wrote: > Hi Guilherme, > [...] >> Do you agree with that, or prefer really a parameter in >> gsmi_shutdown_reason() ? I'll follow your choice =) > > I'm fine with either, thanks for the link. Mostly I want to make sure > other paths to gsmi_shutdown_reason() aren't also converted to a try. Hi Evan, thanks for the prompt response! So, I'll proceed like I did in s390, for consistency. > [...] >> Reasoning: the problem with your example is that, by default, secondary >> CPUs are disabled in the panic path, through an IPI mechanism. IPIs take >> precedence and interrupt the work in these CPUs, effectively >> interrupting the "polite work" with the lock held heh > > The IPI can only interrupt a CPU with irqs disabled if the IPI is an > NMI. I haven't looked before to see if we use NMI IPIs to corral the > other CPUs on panic. On x86, I grepped my way down to > native_stop_other_cpus(), which looks like it does a normal IPI, waits > 1 second, then does an NMI IPI. So, if a secondary CPU has the lock > held, on x86 it has roughly 1s to finish what it's doing and re-enable > interrupts before smp_send_stop() brings the NMI hammer down. I think > this should be more than enough time for the secondary CPU to get out > and release the lock. > > So then it makes sense to me that you're fixing cases where we > panicked with the lock held, or hung with the lock held. Given the 1 > second grace period x86 gives us, I'm on board, as that helps mitigate > the risk that we bailed out early with the try and should have spun a > bit longer instead. Thanks. > > -Evan Well, in the old path without "crash_kexec_post_notifiers", we indeed end-up relying on native_stop_other_cpus() for x86 as you said, and the "1s rule" makes sense. But after this series (or even before, if the kernel parameter "crash_kexec_post_notifiers" was used) the function used to stop CPUs in the panic path is crash_smp_send_stop(), and the call chain is like: Main CPU: crash_smp_send_stop() --kdump_nmi_shootdown_cpus() ----nmi_shootdown_cpus() Then, in each CPU (except the main one, running panic() path), we execute kdump_nmi_callback() in NMI context. So, we seem to indeed interrupt any context (even with IRQs disabled), increasing the likelihood of the potential lockups due to stopped CPUs holding the locks heheh Thanks again for the good discussion, let me know if anything I'm saying doesn't make sense - this crash path is a bit convoluted, specially in x86, I might have understood something wrongly =) Cheers, Guilherme From mboxrd@z Thu Jan 1 00:00:00 1970 From: Guilherme G. Piccoli Date: Wed, 4 May 2022 09:45:31 -0300 Subject: [PATCH 04/30] firmware: google: Convert regular spinlock into trylock on panic path In-Reply-To: References: <20220427224924.592546-1-gpiccoli@igalia.com> <20220427224924.592546-5-gpiccoli@igalia.com> Message-ID: <9581851d-6c61-a2ef-a3c4-6e2ce05eab12@igalia.com> List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: kexec@lists.infradead.org On 03/05/2022 18:56, Evan Green wrote: > Hi Guilherme, > [...] >> Do you agree with that, or prefer really a parameter in >> gsmi_shutdown_reason() ? I'll follow your choice =) > > I'm fine with either, thanks for the link. Mostly I want to make sure > other paths to gsmi_shutdown_reason() aren't also converted to a try. Hi Evan, thanks for the prompt response! So, I'll proceed like I did in s390, for consistency. > [...] >> Reasoning: the problem with your example is that, by default, secondary >> CPUs are disabled in the panic path, through an IPI mechanism. IPIs take >> precedence and interrupt the work in these CPUs, effectively >> interrupting the "polite work" with the lock held heh > > The IPI can only interrupt a CPU with irqs disabled if the IPI is an > NMI. I haven't looked before to see if we use NMI IPIs to corral the > other CPUs on panic. On x86, I grepped my way down to > native_stop_other_cpus(), which looks like it does a normal IPI, waits > 1 second, then does an NMI IPI. So, if a secondary CPU has the lock > held, on x86 it has roughly 1s to finish what it's doing and re-enable > interrupts before smp_send_stop() brings the NMI hammer down. I think > this should be more than enough time for the secondary CPU to get out > and release the lock. > > So then it makes sense to me that you're fixing cases where we > panicked with the lock held, or hung with the lock held. Given the 1 > second grace period x86 gives us, I'm on board, as that helps mitigate > the risk that we bailed out early with the try and should have spun a > bit longer instead. Thanks. > > -Evan Well, in the old path without "crash_kexec_post_notifiers", we indeed end-up relying on native_stop_other_cpus() for x86 as you said, and the "1s rule" makes sense. But after this series (or even before, if the kernel parameter "crash_kexec_post_notifiers" was used) the function used to stop CPUs in the panic path is crash_smp_send_stop(), and the call chain is like: Main CPU: crash_smp_send_stop() --kdump_nmi_shootdown_cpus() ----nmi_shootdown_cpus() Then, in each CPU (except the main one, running panic() path), we execute kdump_nmi_callback() in NMI context. So, we seem to indeed interrupt any context (even with IRQs disabled), increasing the likelihood of the potential lockups due to stopped CPUs holding the locks heheh Thanks again for the good discussion, let me know if anything I'm saying doesn't make sense - this crash path is a bit convoluted, specially in x86, I might have understood something wrongly =) Cheers, Guilherme From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <9581851d-6c61-a2ef-a3c4-6e2ce05eab12@igalia.com> Date: Wed, 4 May 2022 09:45:31 -0300 MIME-Version: 1.0 Subject: Re: [PATCH 04/30] firmware: google: Convert regular spinlock into trylock on panic path Content-Language: en-US References: <20220427224924.592546-1-gpiccoli@igalia.com> <20220427224924.592546-5-gpiccoli@igalia.com> From: "Guilherme G. Piccoli" In-Reply-To: List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-um" Errors-To: linux-um-bounces+geert=linux-m68k.org@lists.infradead.org To: Evan Green Cc: Andrew Morton , bhe@redhat.com, pmladek@suse.com, kexec@lists.infradead.org, LKML , bcm-kernel-feedback-list@broadcom.com, linuxppc-dev@lists.ozlabs.org, linux-alpha@vger.kernel.org, linux-edac@vger.kernel.org, linux-hyperv@vger.kernel.org, linux-leds@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, Linux PM , linux-remoteproc@vger.kernel.org, linux-s390@vger.kernel.org, linux-tegra@vger.kernel.org, linux-um@lists.infradead.org, linux-xtensa@linux-xtensa.org, netdev@vger.kernel.org, openipmi-developer@lists.sourceforge.net, rcu@vger.kernel.org, sparclinux@vger.kernel.org, xen-devel@lists.xenproject.org, x86@kernel.org, kernel-dev@igalia.com, kernel@gpiccoli.net, halves@canonical.com, fabiomirmar@gmail.com, alejandro.j.jimenez@oracle.com, Andy Shevchenko , Arnd Bergmann , Borislav Petkov , Jonathan Corbet , d.hatayama@jp.fujitsu.com, dave.hansen@linux.intel.com, dyoung@redhat.com, feng.tang@intel.com, Greg Kroah-Hartman , mikelley@microsoft.com, hidehiro.kawai.ez@hitachi.com, jgross@suse.com, john.ogness@linutronix.de, Kees Cook , luto@kernel.org, mhiramat@kernel.org, mingo@redhat.com, paulmck@kernel.org, peterz@infradead.org, rostedt@goodmis.org, senozhatsky@chromium.org, Alan Stern , Thomas Gleixner , vgoyal@redhat.com, vkuznets@redhat.com, Will Deacon , Ard Biesheuvel , David Gow , Julius Werner On 03/05/2022 18:56, Evan Green wrote: > Hi Guilherme, > [...] >> Do you agree with that, or prefer really a parameter in >> gsmi_shutdown_reason() ? I'll follow your choice =) > > I'm fine with either, thanks for the link. Mostly I want to make sure > other paths to gsmi_shutdown_reason() aren't also converted to a try. Hi Evan, thanks for the prompt response! So, I'll proceed like I did in s390, for consistency. > [...] >> Reasoning: the problem with your example is that, by default, secondary >> CPUs are disabled in the panic path, through an IPI mechanism. IPIs take >> precedence and interrupt the work in these CPUs, effectively >> interrupting the "polite work" with the lock held heh > > The IPI can only interrupt a CPU with irqs disabled if the IPI is an > NMI. I haven't looked before to see if we use NMI IPIs to corral the > other CPUs on panic. On x86, I grepped my way down to > native_stop_other_cpus(), which looks like it does a normal IPI, waits > 1 second, then does an NMI IPI. So, if a secondary CPU has the lock > held, on x86 it has roughly 1s to finish what it's doing and re-enable > interrupts before smp_send_stop() brings the NMI hammer down. I think > this should be more than enough time for the secondary CPU to get out > and release the lock. > > So then it makes sense to me that you're fixing cases where we > panicked with the lock held, or hung with the lock held. Given the 1 > second grace period x86 gives us, I'm on board, as that helps mitigate > the risk that we bailed out early with the try and should have spun a > bit longer instead. Thanks. > > -Evan Well, in the old path without "crash_kexec_post_notifiers", we indeed end-up relying on native_stop_other_cpus() for x86 as you said, and the "1s rule" makes sense. But after this series (or even before, if the kernel parameter "crash_kexec_post_notifiers" was used) the function used to stop CPUs in the panic path is crash_smp_send_stop(), and the call chain is like: Main CPU: crash_smp_send_stop() --kdump_nmi_shootdown_cpus() ----nmi_shootdown_cpus() Then, in each CPU (except the main one, running panic() path), we execute kdump_nmi_callback() in NMI context. So, we seem to indeed interrupt any context (even with IRQs disabled), increasing the likelihood of the potential lockups due to stopped CPUs holding the locks heheh Thanks again for the good discussion, let me know if anything I'm saying doesn't make sense - this crash path is a bit convoluted, specially in x86, I might have understood something wrongly =) Cheers, Guilherme _______________________________________________ linux-um mailing list linux-um@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-um From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Guilherme G. Piccoli" Subject: Re: [PATCH 04/30] firmware: google: Convert regular spinlock into trylock on panic path Date: Wed, 4 May 2022 09:45:31 -0300 Message-ID: <9581851d-6c61-a2ef-a3c4-6e2ce05eab12@igalia.com> References: <20220427224924.592546-1-gpiccoli@igalia.com> <20220427224924.592546-5-gpiccoli@igalia.com> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=igalia.com; s=20170329; h=Content-Transfer-Encoding:Content-Type:In-Reply-To:From: References:Cc:To:Subject:MIME-Version:Date:Message-ID:Sender:Reply-To: Content-ID:Content-Description:Resent-Date:Resent-From:Resent-Sender: Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help:List-Unsubscribe: List-Subscribe:List-Post:List-Owner:List-Archive; bh=UDxTzr/LbFPHIhZF1U36kLmGFtpbBGYhm6tD7CfQBjM=; b=KDrxDW8Mnr1qToRh0yGaDv25Fu TWpgQRScRF3dRjjU3pakegMZLqUHmycIdjATA6DbY60uPfya2zQOAwJ0e5erFJc5BJn7o3eV3sQ41 Tox4yunn4/jA8a3IjyUJT/sI3KcoYtb2p2FNUwcMM3f1g/ex9W0DrY0odKixjZjudGtrtbA3G1JPw m7Nny7DVsQFLh+HpvU/q7Yc8/k1hRGwJeSWIAUBYmqnc5ZdThCMVxiKgVNOsyAMWZdrNHmIgXn40c w6AF2HzVr+D+OvvIusg4XzuOUWHoF2AYp44ZOOo2CAb1BDljEp1qf8qc23J8WoJ9sqhtbxFJxmFVB 8aeUFBa Content-Language: en-US In-Reply-To: List-ID: Content-Type: text/plain; charset="us-ascii" To: Evan Green Cc: Andrew Morton , bhe-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, pmladek-IBi9RG/b67k@public.gmane.org, kexec-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org, LKML , bcm-kernel-feedback-list-dY08KVG/lbpWk0Htik3J/w@public.gmane.org, linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ@public.gmane.org, linux-alpha-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-edac-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-hyperv-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-leds-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mips-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-parisc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux PM , linux-remoteproc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-s390-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-tegra-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-um-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org, linux-xtensa-PjhNF2WwrV/0Sa2dR60CXw@public.gmane.org, netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, openipmi-developer-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, rcu-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, sparclinux-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b@public.gmane.org, x86-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, kernel-dev-wEGTBA9jqPzQT0dZR+AlfA@public.gmane.org, kernel-WeLdAqEWwDvk1uMJSBkQmQ@public.gmane.org, halv On 03/05/2022 18:56, Evan Green wrote: > Hi Guilherme, > [...] >> Do you agree with that, or prefer really a parameter in >> gsmi_shutdown_reason() ? I'll follow your choice =) > > I'm fine with either, thanks for the link. Mostly I want to make sure > other paths to gsmi_shutdown_reason() aren't also converted to a try. Hi Evan, thanks for the prompt response! So, I'll proceed like I did in s390, for consistency. > [...] >> Reasoning: the problem with your example is that, by default, secondary >> CPUs are disabled in the panic path, through an IPI mechanism. IPIs take >> precedence and interrupt the work in these CPUs, effectively >> interrupting the "polite work" with the lock held heh > > The IPI can only interrupt a CPU with irqs disabled if the IPI is an > NMI. I haven't looked before to see if we use NMI IPIs to corral the > other CPUs on panic. On x86, I grepped my way down to > native_stop_other_cpus(), which looks like it does a normal IPI, waits > 1 second, then does an NMI IPI. So, if a secondary CPU has the lock > held, on x86 it has roughly 1s to finish what it's doing and re-enable > interrupts before smp_send_stop() brings the NMI hammer down. I think > this should be more than enough time for the secondary CPU to get out > and release the lock. > > So then it makes sense to me that you're fixing cases where we > panicked with the lock held, or hung with the lock held. Given the 1 > second grace period x86 gives us, I'm on board, as that helps mitigate > the risk that we bailed out early with the try and should have spun a > bit longer instead. Thanks. > > -Evan Well, in the old path without "crash_kexec_post_notifiers", we indeed end-up relying on native_stop_other_cpus() for x86 as you said, and the "1s rule" makes sense. But after this series (or even before, if the kernel parameter "crash_kexec_post_notifiers" was used) the function used to stop CPUs in the panic path is crash_smp_send_stop(), and the call chain is like: Main CPU: crash_smp_send_stop() --kdump_nmi_shootdown_cpus() ----nmi_shootdown_cpus() Then, in each CPU (except the main one, running panic() path), we execute kdump_nmi_callback() in NMI context. So, we seem to indeed interrupt any context (even with IRQs disabled), increasing the likelihood of the potential lockups due to stopped CPUs holding the locks heheh Thanks again for the good discussion, let me know if anything I'm saying doesn't make sense - this crash path is a bit convoluted, specially in x86, I might have understood something wrongly =) Cheers, Guilherme