All of lore.kernel.org
 help / color / mirror / Atom feed
From: James Morse <james.morse@arm.com>
To: Borislav Petkov <bp@alien8.de>
Cc: Rafael Wysocki <rjw@rjwysocki.net>,
	Tony Luck <tony.luck@intel.com>, Xie XiuQi <xiexiuqi@huawei.com>,
	linux-mm@kvack.org, Marc Zyngier <marc.zyngier@arm.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Punit Agrawal <punit.agrawal@arm.com>,
	Will Deacon <will.deacon@arm.com>,
	Tyler Baicar <tbaicar@codeaurora.org>,
	Dongjiu Geng <gengdongjiu@huawei.com>,
	linux-acpi@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
	kvmarm@lists.cs.columbia.edu,
	Christoffer Dall <christoffer.dall@linaro.org>,
	Len Brown <lenb@kernel.org>
Subject: Re: [PATCH 02/11] ACPI / APEI: Generalise the estatus queue's add/remove and notify code
Date: Wed, 28 Mar 2018 17:30:55 +0100	[thread overview]
Message-ID: <d15ba145-479c-ffde-14ad-ab7170d0f06e@arm.com> (raw)
In-Reply-To: <20180327172510.GB32184@pd.tnic>

Hi Borislav,

On 27/03/18 18:25, Borislav Petkov wrote:
> On Mon, Mar 19, 2018 at 02:29:13PM +0000, James Morse wrote:
>> I don't think the die_lock really helps here, do we really want to wait for a
>> remote CPU to finish printing an OOPs about user-space's bad memory accesses,
>> before we bring the machine down due to this system-wide fatal RAS error? The
>> presence of firmware-first means we know this error, and any other oops are
>> unrelated.
> 
> Hmm, now that you put it this way...

>> I'd like to leave this under the x86-ifdef for now. For arm64 it would be an
>> APEI specific arch hook to stop the arch code from printing some messages,
> 
> ... I'm thinking we should ignore the whole serializing of oopses and
> really dump that hw error ASAP. If it really is a fatal error, our main
> and only goal is to get it out as fast as possible so that it has the
> highest chance to appear on some screen or logging facility and thus the
> system can be serviced successfully.
> 
> And the other oopses have lower prio.

> Hmmm?

Yes, I agree. With firmware-first we know that errors the firmware takes first,
then notifies by NMI causing us to panic() must be a higher priority than
another oops.

I'll add a patch[0] to v3 making this argument and removing the #ifdef'd
oops_begin().


Thanks,

James


[0]
-----------------%<-----------------
    ACPI / APEI: don't wait to serialise with oops messages when panic()ing

    oops_begin() exists to group printk() messages with the oops message
    printed by die(). To reach this caller we know that platform firmware
    took this error first, then notified the OS via NMI with a 'panic'
    severity.

    Don't wait for another CPU to release the die-lock before we can
    panic(), our only goal is to print this fatal error and panic().

    This code is always called in_nmi(), and since 42a0bb3f7138 ("printk/nmi:
    generic solution for safe printk in NMI"), it has been safe to call
    printk() from this context. Messages are batched in a per-cpu buffer
    and printed via irq-work, or a call back from panic().

    Signed-off-by: James Morse <james.morse@arm.com>

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 22f6ea5b9ad5..f348e6540960 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -34,7 +34,6 @@
 #include <linux/interrupt.h>
 #include <linux/timer.h>
 #include <linux/cper.h>
-#include <linux/kdebug.h>
 #include <linux/platform_device.h>
 #include <linux/mutex.h>
 #include <linux/ratelimit.h>
@@ -736,9 +735,6 @@ static int _in_nmi_notify_one(struct ghes *ghes)

        sev = ghes_severity(ghes->estatus->error_severity);
        if (sev >= GHES_SEV_PANIC) {
-#ifdef CONFIG_X86
-               oops_begin();
-#endif
                ghes_print_queued_estatus();
                __ghes_panic(ghes);
        }
-----------------%<-----------------

WARNING: multiple messages have this Message-ID (diff)
From: James Morse <james.morse@arm.com>
To: Borislav Petkov <bp@alien8.de>
Cc: Punit Agrawal <punit.agrawal@arm.com>,
	linux-acpi@vger.kernel.org, kvmarm@lists.cs.columbia.edu,
	linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org,
	Christoffer Dall <christoffer.dall@linaro.org>,
	Marc Zyngier <marc.zyngier@arm.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will.deacon@arm.com>,
	Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
	Rafael Wysocki <rjw@rjwysocki.net>, Len Brown <lenb@kernel.org>,
	Tony Luck <tony.luck@intel.com>,
	Tyler Baicar <tbaicar@codeaurora.org>,
	Dongjiu Geng <gengdongjiu@huawei.com>,
	Xie XiuQi <xiexiuqi@huawei.com>
Subject: Re: [PATCH 02/11] ACPI / APEI: Generalise the estatus queue's add/remove and notify code
Date: Wed, 28 Mar 2018 17:30:55 +0100	[thread overview]
Message-ID: <d15ba145-479c-ffde-14ad-ab7170d0f06e@arm.com> (raw)
In-Reply-To: <20180327172510.GB32184@pd.tnic>

Hi Borislav,

On 27/03/18 18:25, Borislav Petkov wrote:
> On Mon, Mar 19, 2018 at 02:29:13PM +0000, James Morse wrote:
>> I don't think the die_lock really helps here, do we really want to wait for a
>> remote CPU to finish printing an OOPs about user-space's bad memory accesses,
>> before we bring the machine down due to this system-wide fatal RAS error? The
>> presence of firmware-first means we know this error, and any other oops are
>> unrelated.
> 
> Hmm, now that you put it this way...

>> I'd like to leave this under the x86-ifdef for now. For arm64 it would be an
>> APEI specific arch hook to stop the arch code from printing some messages,
> 
> ... I'm thinking we should ignore the whole serializing of oopses and
> really dump that hw error ASAP. If it really is a fatal error, our main
> and only goal is to get it out as fast as possible so that it has the
> highest chance to appear on some screen or logging facility and thus the
> system can be serviced successfully.
> 
> And the other oopses have lower prio.

> Hmmm?

Yes, I agree. With firmware-first we know that errors the firmware takes first,
then notifies by NMI causing us to panic() must be a higher priority than
another oops.

I'll add a patch[0] to v3 making this argument and removing the #ifdef'd
oops_begin().


Thanks,

James


[0]
-----------------%<-----------------
    ACPI / APEI: don't wait to serialise with oops messages when panic()ing

    oops_begin() exists to group printk() messages with the oops message
    printed by die(). To reach this caller we know that platform firmware
    took this error first, then notified the OS via NMI with a 'panic'
    severity.

    Don't wait for another CPU to release the die-lock before we can
    panic(), our only goal is to print this fatal error and panic().

    This code is always called in_nmi(), and since 42a0bb3f7138 ("printk/nmi:
    generic solution for safe printk in NMI"), it has been safe to call
    printk() from this context. Messages are batched in a per-cpu buffer
    and printed via irq-work, or a call back from panic().

    Signed-off-by: James Morse <james.morse@arm.com>

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 22f6ea5b9ad5..f348e6540960 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -34,7 +34,6 @@
 #include <linux/interrupt.h>
 #include <linux/timer.h>
 #include <linux/cper.h>
-#include <linux/kdebug.h>
 #include <linux/platform_device.h>
 #include <linux/mutex.h>
 #include <linux/ratelimit.h>
@@ -736,9 +735,6 @@ static int _in_nmi_notify_one(struct ghes *ghes)

        sev = ghes_severity(ghes->estatus->error_severity);
        if (sev >= GHES_SEV_PANIC) {
-#ifdef CONFIG_X86
-               oops_begin();
-#endif
                ghes_print_queued_estatus();
                __ghes_panic(ghes);
        }
-----------------%<-----------------

WARNING: multiple messages have this Message-ID (diff)
From: james.morse@arm.com (James Morse)
To: linux-arm-kernel@lists.infradead.org
Subject: [PATCH 02/11] ACPI / APEI: Generalise the estatus queue's add/remove and notify code
Date: Wed, 28 Mar 2018 17:30:55 +0100	[thread overview]
Message-ID: <d15ba145-479c-ffde-14ad-ab7170d0f06e@arm.com> (raw)
In-Reply-To: <20180327172510.GB32184@pd.tnic>

Hi Borislav,

On 27/03/18 18:25, Borislav Petkov wrote:
> On Mon, Mar 19, 2018 at 02:29:13PM +0000, James Morse wrote:
>> I don't think the die_lock really helps here, do we really want to wait for a
>> remote CPU to finish printing an OOPs about user-space's bad memory accesses,
>> before we bring the machine down due to this system-wide fatal RAS error? The
>> presence of firmware-first means we know this error, and any other oops are
>> unrelated.
> 
> Hmm, now that you put it this way...

>> I'd like to leave this under the x86-ifdef for now. For arm64 it would be an
>> APEI specific arch hook to stop the arch code from printing some messages,
> 
> ... I'm thinking we should ignore the whole serializing of oopses and
> really dump that hw error ASAP. If it really is a fatal error, our main
> and only goal is to get it out as fast as possible so that it has the
> highest chance to appear on some screen or logging facility and thus the
> system can be serviced successfully.
> 
> And the other oopses have lower prio.

> Hmmm?

Yes, I agree. With firmware-first we know that errors the firmware takes first,
then notifies by NMI causing us to panic() must be a higher priority than
another oops.

I'll add a patch[0] to v3 making this argument and removing the #ifdef'd
oops_begin().


Thanks,

James


[0]
-----------------%<-----------------
    ACPI / APEI: don't wait to serialise with oops messages when panic()ing

    oops_begin() exists to group printk() messages with the oops message
    printed by die(). To reach this caller we know that platform firmware
    took this error first, then notified the OS via NMI with a 'panic'
    severity.

    Don't wait for another CPU to release the die-lock before we can
    panic(), our only goal is to print this fatal error and panic().

    This code is always called in_nmi(), and since 42a0bb3f7138 ("printk/nmi:
    generic solution for safe printk in NMI"), it has been safe to call
    printk() from this context. Messages are batched in a per-cpu buffer
    and printed via irq-work, or a call back from panic().

    Signed-off-by: James Morse <james.morse@arm.com>

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 22f6ea5b9ad5..f348e6540960 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -34,7 +34,6 @@
 #include <linux/interrupt.h>
 #include <linux/timer.h>
 #include <linux/cper.h>
-#include <linux/kdebug.h>
 #include <linux/platform_device.h>
 #include <linux/mutex.h>
 #include <linux/ratelimit.h>
@@ -736,9 +735,6 @@ static int _in_nmi_notify_one(struct ghes *ghes)

        sev = ghes_severity(ghes->estatus->error_severity);
        if (sev >= GHES_SEV_PANIC) {
-#ifdef CONFIG_X86
-               oops_begin();
-#endif
                ghes_print_queued_estatus();
                __ghes_panic(ghes);
        }
-----------------%<-----------------

  reply	other threads:[~2018-03-28 16:30 UTC|newest]

Thread overview: 94+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-02-15 18:55 [PATCH 00/11] APEI in_nmi() rework and arm64 SDEI wire-up James Morse
2018-02-15 18:55 ` James Morse
2018-02-15 18:55 ` James Morse
2018-02-15 18:55 ` [PATCH 01/11] ACPI / APEI: Move the estatus queue code up, and under its own ifdef James Morse
2018-02-15 18:55   ` James Morse
2018-02-15 18:55   ` James Morse
2018-02-20 18:26   ` Punit Agrawal
2018-02-20 18:26     ` Punit Agrawal
2018-02-20 18:26     ` Punit Agrawal
2018-02-20 19:28   ` Borislav Petkov
2018-02-20 19:28     ` Borislav Petkov
2018-02-20 19:28     ` Borislav Petkov
2018-02-23 18:02     ` James Morse
2018-02-23 18:02       ` James Morse
2018-02-23 18:02       ` James Morse
2018-02-23 18:07       ` Borislav Petkov
2018-02-23 18:07         ` Borislav Petkov
2018-02-23 18:07         ` Borislav Petkov
2018-02-15 18:55 ` [PATCH 02/11] ACPI / APEI: Generalise the estatus queue's add/remove and notify code James Morse
2018-02-15 18:55   ` James Morse
2018-02-15 18:55   ` James Morse
2018-02-20 18:26   ` Punit Agrawal
2018-02-20 18:26     ` Punit Agrawal
2018-02-20 18:26     ` Punit Agrawal
2018-02-23 18:21     ` James Morse
2018-02-23 18:21       ` James Morse
2018-02-23 18:21       ` James Morse
2018-03-01 15:01   ` Borislav Petkov
2018-03-01 15:01     ` Borislav Petkov
2018-03-01 15:01     ` Borislav Petkov
2018-03-01 18:06     ` Punit Agrawal
2018-03-01 18:06       ` Punit Agrawal
2018-03-01 18:06       ` Punit Agrawal
2018-03-01 22:35       ` Borislav Petkov
2018-03-01 22:35         ` Borislav Petkov
2018-03-01 22:35         ` Borislav Petkov
2018-03-07 18:15         ` James Morse
2018-03-07 18:15           ` James Morse
2018-03-07 18:15           ` James Morse
2018-03-08 10:44           ` Borislav Petkov
2018-03-08 10:44             ` Borislav Petkov
2018-03-08 10:44             ` Borislav Petkov
2018-03-19 14:29             ` James Morse
2018-03-19 14:29               ` James Morse
2018-03-19 14:29               ` James Morse
2018-03-27 17:25               ` Borislav Petkov
2018-03-27 17:25                 ` Borislav Petkov
2018-03-27 17:25                 ` Borislav Petkov
2018-03-28 16:30                 ` James Morse [this message]
2018-03-28 16:30                   ` James Morse
2018-03-28 16:30                   ` James Morse
2018-04-17 15:10                   ` Borislav Petkov
2018-04-17 15:10                     ` Borislav Petkov
2018-04-17 15:10                     ` Borislav Petkov
2018-02-15 18:55 ` [PATCH 03/11] ACPI / APEI: Switch NOTIFY_SEA to use the estatus queue James Morse
2018-02-15 18:55   ` James Morse
2018-02-15 18:55   ` James Morse
2018-02-15 18:55 ` [PATCH 04/11] KVM: arm/arm64: Add kvm_ras.h to collect kvm specific RAS plumbing James Morse
2018-02-15 18:55   ` James Morse
2018-02-15 18:56 ` [PATCH 05/11] arm64: KVM/mm: Move SEA handling behind a single 'claim' interface James Morse
2018-02-15 18:56   ` James Morse
2018-02-20 18:30   ` Punit Agrawal
2018-02-20 18:30     ` Punit Agrawal
2018-02-20 18:30     ` Punit Agrawal
2018-02-15 18:56 ` [PATCH 06/11] ACPI / APEI: Make the fixmap_idx per-ghes to allow multiple in_nmi() users James Morse
2018-02-15 18:56   ` James Morse
2018-02-20 21:18   ` Tyler Baicar
2018-02-20 21:18     ` Tyler Baicar
2018-02-20 21:18     ` Tyler Baicar
2018-02-22 17:47     ` James Morse
2018-02-22 17:47       ` James Morse
2018-02-22 17:47       ` James Morse
2018-02-15 18:56 ` [PATCH 07/11] ACPI / APEI: Split fixmap pages for arm64 NMI-like notifications James Morse
2018-02-15 18:56   ` James Morse
2018-02-15 18:56   ` James Morse
2018-02-15 18:56 ` [PATCH 08/11] firmware: arm_sdei: Add ACPI GHES registration helper James Morse
2018-02-15 18:56   ` James Morse
2018-02-20 18:31   ` Punit Agrawal
2018-02-20 18:31     ` Punit Agrawal
2018-02-20 18:31     ` Punit Agrawal
2018-02-15 18:56 ` [PATCH 09/11] ACPI / APEI: Add support for the SDEI GHES Notification type James Morse
2018-02-15 18:56   ` James Morse
2018-02-15 18:56   ` James Morse
2018-02-15 18:56 ` [PATCH 10/11] mm/memory-failure: increase queued recovery work's priority James Morse
2018-02-15 18:56   ` James Morse
2018-02-15 18:56   ` James Morse
2018-02-15 18:56 ` [PATCH 11/11] arm64: acpi: Make apei_claim_sea() synchronise with APEI's irq work James Morse
2018-02-15 18:56   ` James Morse
2018-02-19 21:05 ` [PATCH 00/11] APEI in_nmi() rework and arm64 SDEI wire-up Borislav Petkov
2018-02-19 21:05   ` Borislav Petkov
2018-02-19 21:05   ` Borislav Petkov
2018-02-20 18:42 ` Punit Agrawal
2018-02-20 18:42   ` Punit Agrawal
2018-02-20 18:42   ` Punit Agrawal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d15ba145-479c-ffde-14ad-ab7170d0f06e@arm.com \
    --to=james.morse@arm.com \
    --cc=bp@alien8.de \
    --cc=catalin.marinas@arm.com \
    --cc=christoffer.dall@linaro.org \
    --cc=gengdongjiu@huawei.com \
    --cc=kvmarm@lists.cs.columbia.edu \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-mm@kvack.org \
    --cc=marc.zyngier@arm.com \
    --cc=n-horiguchi@ah.jp.nec.com \
    --cc=punit.agrawal@arm.com \
    --cc=rjw@rjwysocki.net \
    --cc=tbaicar@codeaurora.org \
    --cc=tony.luck@intel.com \
    --cc=will.deacon@arm.com \
    --cc=xiexiuqi@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.