From: James Morse <james.morse@arm.com> To: 乱石 <zhangliguang@linux.alibaba.com> Cc: linux-acpi@vger.kernel.org, linux-kernel@vger.kernel.org, Tony Luck <tony.luck@intel.com>, linux-arm-kernel@lists.infradead.org, Borislav Petkov <bp@alien8.de>, Len Brown <lenb@kernel.org>, "Rafael J. Wysocki" <rafael@kernel.org>, huangming@linux.alibaba.com Subject: Re: [PATCH V2] ACPI / APEI: restore interrupt before panic in sdei flow Date: Mon, 18 Oct 2021 18:21:29 +0100 [thread overview] Message-ID: <ddefdfea-a8ca-8735-0f81-8b2748587da5@arm.com> (raw) In-Reply-To: <f8e73ed7-f45f-0f5d-9055-486fb83dcd82@linux.alibaba.com> Hi Liguang, On 14/10/2021 15:18, 乱石 wrote: > 在 2021/10/14 1:44, James Morse 写道: >> On 12/10/2021 15:29, Liguang Zhang wrote: >>> When hest acpi table configure Hardware Error Notification type as >>> Software Delegated Exception(0x0B) for RAS event, OS RAS interacts with >>> ATF by SDEI mechanism. On the firmware first system, OS was notified by >>> ATF sdei call. >>> If fatal RAS error occured, panic was called in sdei_asm_handle() >>> without ehf_deactivate_priority executed, which lead interrupt masked. >> So far the story is: >> Firmware generated and SDEI event (a kind of software NMI) because of a firmware >> interrupt, but it hasn't completely handled the interrupt. >> >> >>> If interrupt masked, system would be halted in kdump flow like this: >>> >>> arm-smmu-v3 arm-smmu-v3.3.auto: allocated 65536 entries for cmdq >>> arm-smmu-v3 arm-smmu-v3.3.auto: allocated 32768 entries for evtq >>> arm-smmu-v3 arm-smmu-v3.3.auto: allocated 65536 entries for priq >>> arm-smmu-v3 arm-smmu-v3.3.auto: SMMU currently enabled! Resetting... >> How and why do firmware interrupts affect the IOMMU? [...] >> Could you debug why firmware interrupts being active prevent the SMMU from being reset. As >> far as I can tell, those should be totally independent. > If ehf_deactivate_priority() was not executed, pmr_el1 register was not resumed to >0x80, > which leads > non-secure interrupts masked. arm_smmu_device_probe() finally called usleep_range() which > based on > hrtimer. Because non-secure timer interrupts was masked, usleep_range would not reponse. Aha! So nothing to do with with the SMMU at all. Your firmware has 'disabled' the interrupt by moving the CPUs priority mask so that no interrupts at all can be taken. I still think this is best fixed in firmware. Papering over the problem here is not enough as the handler may encounter memory corruption, take an exception, and panic() from some other part of the kernel. Its RAS - we know something has gone wrong before we get to this point. The OS needs to be able to call panic() at any point in time. Your firmware should not deny the normal-world interrupts like this. Please either complete the interrupt handling before calling into the normal world, or disable it if you need the interrupt to not fire again. If the device that triggers the interrupt doesn't have a disable, there are hardware registers in the GIC to do this. (I don't know how TFA works here, it may be a bug in the upstream code) Thanks, James
WARNING: multiple messages have this Message-ID (diff)
From: James Morse <james.morse@arm.com> To: 乱石 <zhangliguang@linux.alibaba.com> Cc: linux-acpi@vger.kernel.org, linux-kernel@vger.kernel.org, Tony Luck <tony.luck@intel.com>, linux-arm-kernel@lists.infradead.org, Borislav Petkov <bp@alien8.de>, Len Brown <lenb@kernel.org>, "Rafael J. Wysocki" <rafael@kernel.org>, huangming@linux.alibaba.com Subject: Re: [PATCH V2] ACPI / APEI: restore interrupt before panic in sdei flow Date: Mon, 18 Oct 2021 18:21:29 +0100 [thread overview] Message-ID: <ddefdfea-a8ca-8735-0f81-8b2748587da5@arm.com> (raw) In-Reply-To: <f8e73ed7-f45f-0f5d-9055-486fb83dcd82@linux.alibaba.com> Hi Liguang, On 14/10/2021 15:18, 乱石 wrote: > 在 2021/10/14 1:44, James Morse 写道: >> On 12/10/2021 15:29, Liguang Zhang wrote: >>> When hest acpi table configure Hardware Error Notification type as >>> Software Delegated Exception(0x0B) for RAS event, OS RAS interacts with >>> ATF by SDEI mechanism. On the firmware first system, OS was notified by >>> ATF sdei call. >>> If fatal RAS error occured, panic was called in sdei_asm_handle() >>> without ehf_deactivate_priority executed, which lead interrupt masked. >> So far the story is: >> Firmware generated and SDEI event (a kind of software NMI) because of a firmware >> interrupt, but it hasn't completely handled the interrupt. >> >> >>> If interrupt masked, system would be halted in kdump flow like this: >>> >>> arm-smmu-v3 arm-smmu-v3.3.auto: allocated 65536 entries for cmdq >>> arm-smmu-v3 arm-smmu-v3.3.auto: allocated 32768 entries for evtq >>> arm-smmu-v3 arm-smmu-v3.3.auto: allocated 65536 entries for priq >>> arm-smmu-v3 arm-smmu-v3.3.auto: SMMU currently enabled! Resetting... >> How and why do firmware interrupts affect the IOMMU? [...] >> Could you debug why firmware interrupts being active prevent the SMMU from being reset. As >> far as I can tell, those should be totally independent. > If ehf_deactivate_priority() was not executed, pmr_el1 register was not resumed to >0x80, > which leads > non-secure interrupts masked. arm_smmu_device_probe() finally called usleep_range() which > based on > hrtimer. Because non-secure timer interrupts was masked, usleep_range would not reponse. Aha! So nothing to do with with the SMMU at all. Your firmware has 'disabled' the interrupt by moving the CPUs priority mask so that no interrupts at all can be taken. I still think this is best fixed in firmware. Papering over the problem here is not enough as the handler may encounter memory corruption, take an exception, and panic() from some other part of the kernel. Its RAS - we know something has gone wrong before we get to this point. The OS needs to be able to call panic() at any point in time. Your firmware should not deny the normal-world interrupts like this. Please either complete the interrupt handling before calling into the normal world, or disable it if you need the interrupt to not fire again. If the device that triggers the interrupt doesn't have a disable, there are hardware registers in the GIC to do this. (I don't know how TFA works here, it may be a bug in the upstream code) Thanks, James _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next prev parent reply other threads:[~2021-10-18 17:21 UTC|newest] Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-10-12 14:29 [PATCH V2] ACPI / APEI: restore interrupt before panic in sdei flow Liguang Zhang 2021-10-12 14:29 ` Liguang Zhang 2021-10-13 17:44 ` James Morse 2021-10-13 17:44 ` James Morse 2021-10-14 14:18 ` 乱石 2021-10-14 14:18 ` 乱石 2021-10-18 17:21 ` James Morse [this message] 2021-10-18 17:21 ` James Morse
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=ddefdfea-a8ca-8735-0f81-8b2748587da5@arm.com \ --to=james.morse@arm.com \ --cc=bp@alien8.de \ --cc=huangming@linux.alibaba.com \ --cc=lenb@kernel.org \ --cc=linux-acpi@vger.kernel.org \ --cc=linux-arm-kernel@lists.infradead.org \ --cc=linux-kernel@vger.kernel.org \ --cc=rafael@kernel.org \ --cc=tony.luck@intel.com \ --cc=zhangliguang@linux.alibaba.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.