linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] ACPI/APEI: Clear GHES block_status before panic()
@ 2018-12-19 16:50 David Arcari
  2018-12-20 18:28 ` Tyler Baicar
  2018-12-20 19:24 ` Borislav Petkov
  0 siblings, 2 replies; 6+ messages in thread
From: David Arcari @ 2018-12-19 16:50 UTC (permalink / raw)
  To: Linux ACPI
  Cc: Lenny Szubowicz, David Arcari, Rafael J. Wysocki, Len Brown,
	Tony Luck, Borislav Petkov, Eric W. Biederman, Alexandru Gagniuc,
	linux-kernel

From: Lenny Szubowicz <lszubowi@redhat.com>

In __ghes_panic() clear the block status in the APEI generic
error status block for that generic hardware error source before
calling panic() to prevent a second panic() in the crash kernel
for exactly the same fatal error.

Otherwise ghes_probe(), running in the crash kernel, would see
an unhandled error in the APEI generic error status block and
panic again, thereby precluding any crash dump.

Signed-off-by: Lenny Szubowicz <lszubowi@redhat.com>
Signed-off-by: David Arcari <darcari@redhat.com>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexandru Gagniuc <mr.nuke.me@gmail.com>
Cc: linux-kernel@vger.kernel.org
---
 drivers/acpi/apei/ghes.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 02c6fd9..f008ba7 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -691,6 +691,8 @@ static void __ghes_panic(struct ghes *ghes)
 {
 	__ghes_print_estatus(KERN_EMERG, ghes->generic, ghes->estatus);
 
+	ghes_clear_estatus(ghes);
+
 	/* reboot to log the error! */
 	if (!panic_timeout)
 		panic_timeout = ghes_panic_timeout;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] ACPI/APEI: Clear GHES block_status before panic()
  2018-12-19 16:50 [PATCH] ACPI/APEI: Clear GHES block_status before panic() David Arcari
@ 2018-12-20 18:28 ` Tyler Baicar
  2018-12-20 19:24 ` Borislav Petkov
  1 sibling, 0 replies; 6+ messages in thread
From: Tyler Baicar @ 2018-12-20 18:28 UTC (permalink / raw)
  To: David Arcari
  Cc: Linux ACPI, Lenny Szubowicz, Rafael J. Wysocki, Len Brown,
	Tony Luck, Borislav Petkov, Eric W. Biederman, Alexandru Gagniuc,
	Linux Kernel Mailing List

On Wed, Dec 19, 2018 at 1:33 PM David Arcari <darcari@redhat.com> wrote:
>
> From: Lenny Szubowicz <lszubowi@redhat.com>
>
> In __ghes_panic() clear the block status in the APEI generic
> error status block for that generic hardware error source before
> calling panic() to prevent a second panic() in the crash kernel
> for exactly the same fatal error.
>
> Otherwise ghes_probe(), running in the crash kernel, would see
> an unhandled error in the APEI generic error status block and
> panic again, thereby precluding any crash dump.
>
> Signed-off-by: Lenny Szubowicz <lszubowi@redhat.com>
> Signed-off-by: David Arcari <darcari@redhat.com>

Good catch!

Tested-by: Tyler Baicar <baicar.tyler@gmail.com>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] ACPI/APEI: Clear GHES block_status before panic()
  2018-12-19 16:50 [PATCH] ACPI/APEI: Clear GHES block_status before panic() David Arcari
  2018-12-20 18:28 ` Tyler Baicar
@ 2018-12-20 19:24 ` Borislav Petkov
  2018-12-21 11:17   ` Rafael J. Wysocki
  1 sibling, 1 reply; 6+ messages in thread
From: Borislav Petkov @ 2018-12-20 19:24 UTC (permalink / raw)
  To: David Arcari
  Cc: Linux ACPI, Lenny Szubowicz, Rafael J. Wysocki, Len Brown,
	Tony Luck, Eric W. Biederman, Alexandru Gagniuc, linux-kernel,
	James Morse

+ James.

On Wed, Dec 19, 2018 at 11:50:52AM -0500, David Arcari wrote:
> From: Lenny Szubowicz <lszubowi@redhat.com>
> 
> In __ghes_panic() clear the block status in the APEI generic
> error status block for that generic hardware error source before
> calling panic() to prevent a second panic() in the crash kernel
> for exactly the same fatal error.
> 
> Otherwise ghes_probe(), running in the crash kernel, would see
> an unhandled error in the APEI generic error status block and
> panic again, thereby precluding any crash dump.
> 
> Signed-off-by: Lenny Szubowicz <lszubowi@redhat.com>
> Signed-off-by: David Arcari <darcari@redhat.com>
> Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
> Cc: Len Brown <lenb@kernel.org>
> Cc: Tony Luck <tony.luck@intel.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> Cc: Alexandru Gagniuc <mr.nuke.me@gmail.com>
> Cc: linux-kernel@vger.kernel.org
> ---
>  drivers/acpi/apei/ghes.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 02c6fd9..f008ba7 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -691,6 +691,8 @@ static void __ghes_panic(struct ghes *ghes)
>  {
>  	__ghes_print_estatus(KERN_EMERG, ghes->generic, ghes->estatus);
>  
> +	ghes_clear_estatus(ghes);
> +
>  	/* reboot to log the error! */
>  	if (!panic_timeout)
>  		panic_timeout = ghes_panic_timeout;
> -- 

Acked-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] ACPI/APEI: Clear GHES block_status before panic()
  2018-12-20 19:24 ` Borislav Petkov
@ 2018-12-21 11:17   ` Rafael J. Wysocki
  2018-12-21 18:52     ` James Morse
  0 siblings, 1 reply; 6+ messages in thread
From: Rafael J. Wysocki @ 2018-12-21 11:17 UTC (permalink / raw)
  To: Borislav Petkov, David Arcari
  Cc: Linux ACPI, Lenny Szubowicz, Len Brown, Tony Luck,
	Eric W. Biederman, Alexandru Gagniuc, linux-kernel, James Morse

On Thursday, December 20, 2018 8:24:47 PM CET Borislav Petkov wrote:
> + James.
> 
> On Wed, Dec 19, 2018 at 11:50:52AM -0500, David Arcari wrote:
> > From: Lenny Szubowicz <lszubowi@redhat.com>
> > 
> > In __ghes_panic() clear the block status in the APEI generic
> > error status block for that generic hardware error source before
> > calling panic() to prevent a second panic() in the crash kernel
> > for exactly the same fatal error.
> > 
> > Otherwise ghes_probe(), running in the crash kernel, would see
> > an unhandled error in the APEI generic error status block and
> > panic again, thereby precluding any crash dump.
> > 
> > Signed-off-by: Lenny Szubowicz <lszubowi@redhat.com>
> > Signed-off-by: David Arcari <darcari@redhat.com>
> > Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
> > Cc: Len Brown <lenb@kernel.org>
> > Cc: Tony Luck <tony.luck@intel.com>
> > Cc: Borislav Petkov <bp@alien8.de>
> > Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> > Cc: Alexandru Gagniuc <mr.nuke.me@gmail.com>
> > Cc: linux-kernel@vger.kernel.org
> > ---
> >  drivers/acpi/apei/ghes.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> > index 02c6fd9..f008ba7 100644
> > --- a/drivers/acpi/apei/ghes.c
> > +++ b/drivers/acpi/apei/ghes.c
> > @@ -691,6 +691,8 @@ static void __ghes_panic(struct ghes *ghes)
> >  {
> >  	__ghes_print_estatus(KERN_EMERG, ghes->generic, ghes->estatus);
> >  
> > +	ghes_clear_estatus(ghes);
> > +
> >  	/* reboot to log the error! */
> >  	if (!panic_timeout)
> >  		panic_timeout = ghes_panic_timeout;
> 
> Acked-by: Borislav Petkov <bp@suse.de>

Patch applied, thanks!


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] ACPI/APEI: Clear GHES block_status before panic()
  2018-12-21 11:17   ` Rafael J. Wysocki
@ 2018-12-21 18:52     ` James Morse
  2018-12-21 18:59       ` Borislav Petkov
  0 siblings, 1 reply; 6+ messages in thread
From: James Morse @ 2018-12-21 18:52 UTC (permalink / raw)
  To: Borislav Petkov, David Arcari
  Cc: Rafael J. Wysocki, Linux ACPI, Lenny Szubowicz, Len Brown,
	Tony Luck, Eric W. Biederman, Alexandru Gagniuc, linux-kernel

On 21/12/2018 11:17, Rafael J. Wysocki wrote:
> On Thursday, December 20, 2018 8:24:47 PM CET Borislav Petkov wrote:
>> + James.

Thanks,

>> On Wed, Dec 19, 2018 at 11:50:52AM -0500, David Arcari wrote:
>>> From: Lenny Szubowicz <lszubowi@redhat.com>
>>>
>>> In __ghes_panic() clear the block status in the APEI generic
>>> error status block for that generic hardware error source before
>>> calling panic() to prevent a second panic() in the crash kernel
>>> for exactly the same fatal error.
>>>
>>> Otherwise ghes_probe(), running in the crash kernel, would see
>>> an unhandled error in the APEI generic error status block and
>>> panic again, thereby precluding any crash dump.

I bet that was fun to watch!


>>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>>> index 02c6fd9..f008ba7 100644
>>> --- a/drivers/acpi/apei/ghes.c
>>> +++ b/drivers/acpi/apei/ghes.c
>>> @@ -691,6 +691,8 @@ static void __ghes_panic(struct ghes *ghes)
>>>  {
>>>  	__ghes_print_estatus(KERN_EMERG, ghes->generic, ghes->estatus);
>>>  
>>> +	ghes_clear_estatus(ghes);
>>> +
>>>  	/* reboot to log the error! */
>>>  	if (!panic_timeout)
>>>  		panic_timeout = ghes_panic_timeout;
>>
>> Acked-by: Borislav Petkov <bp@suse.de>
> 
> Patch applied, thanks!

Great!

Do we need to ghes_ack_error() too?

With the location cleared the new kernel will never find the records, and
firmware can never re-use that location because it wasn't ack'd. The upshot is
RAS records can't be generated for the kdump kernel. The acpi spec talks about
use of the memory, so I don't think its fair for it to use this to disarm a
watchdog.

I think we can live with this as the kdump kernel isn't going to handle RAS
errors for the bulk of memory anyway.


Thanks,

James

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] ACPI/APEI: Clear GHES block_status before panic()
  2018-12-21 18:52     ` James Morse
@ 2018-12-21 18:59       ` Borislav Petkov
  0 siblings, 0 replies; 6+ messages in thread
From: Borislav Petkov @ 2018-12-21 18:59 UTC (permalink / raw)
  To: James Morse
  Cc: David Arcari, Rafael J. Wysocki, Linux ACPI, Lenny Szubowicz,
	Len Brown, Tony Luck, Eric W. Biederman, Alexandru Gagniuc,
	linux-kernel

On Fri, Dec 21, 2018 at 06:52:20PM +0000, James Morse wrote:
> Do we need to ghes_ack_error() too?

That's GHES v2 AFAICT.

> With the location cleared the new kernel will never find the records, and
> firmware can never re-use that location because it wasn't ack'd. The upshot is
> RAS records can't be generated for the kdump kernel. The acpi spec talks about
> use of the memory, so I don't think its fair for it to use this to disarm a
> watchdog.
> 
> I think we can live with this as the kdump kernel isn't going to handle RAS
> errors for the bulk of memory anyway.

Usually, handling hw errors is always better than not but the second
kernel can't do anything better in that respect than the first, right?
If it panics, it panics - no matter the kernel. Generally.

Therefore I think the role of the second kernel should be to be as
resilient as possible to hw errors - like, not even see them :-) - dump
the memory of the first kernel as quickly as possible and reboot for
analysis.

IMHO, of course.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-12-21 18:59 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-19 16:50 [PATCH] ACPI/APEI: Clear GHES block_status before panic() David Arcari
2018-12-20 18:28 ` Tyler Baicar
2018-12-20 19:24 ` Borislav Petkov
2018-12-21 11:17   ` Rafael J. Wysocki
2018-12-21 18:52     ` James Morse
2018-12-21 18:59       ` Borislav Petkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).