linux-edac.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RE: [PATCH] mce: fix missing stack-dumping in mce_panic()
  2022-12-02 16:37 [PATCH] mce: fix missing stack-dumping in mce_panic() Miaohe Lin
@ 2022-12-02 14:44 ` Luck, Tony
  2022-12-03  2:19   ` Miaohe Lin
  2022-12-10  2:28 ` Miaohe Lin
  1 sibling, 1 reply; 19+ messages in thread
From: Luck, Tony @ 2022-12-02 14:44 UTC (permalink / raw)
  To: Miaohe Lin, bp, tglx, mingo, dave.hansen
  Cc: x86, hpa, linux-edac, linux-kernel

> When machine check exception occurs, there is no stack-dumping now in
> mce_panic(). It's because bust_spinlocks(1) is called prematurely so
> oops_in_progress will be >= 2 when trying to call dump_stack() in
> panic(). Thus dump_stack() won't be called as this is considered as
> nested stack-dumping.

I had an earlier patch series to just dump from "interesting" machine checks
(I think the interesting ones are when the kernel hit poison in code that hadn't
been tagged in the extable as recoverable)

https://lore.kernel.org/all/20220922195136.54575-1-tony.luck@intel.com/

Discussion on that fizzled out.

Thanks for tracking down why panic() didn't provide a stack dump.

I'm still of the opinion that stack dumps from machine checks aren't
generally useful. But I'd rather have extra stack dumps than no stack
dumps at all.

-Tony

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH] mce: fix missing stack-dumping in mce_panic()
@ 2022-12-02 16:37 Miaohe Lin
  2022-12-02 14:44 ` Luck, Tony
  2022-12-10  2:28 ` Miaohe Lin
  0 siblings, 2 replies; 19+ messages in thread
From: Miaohe Lin @ 2022-12-02 16:37 UTC (permalink / raw)
  To: tony.luck, bp, tglx, mingo, dave.hansen
  Cc: x86, hpa, linux-edac, linux-kernel, linmiaohe

When machine check exception occurs, there is no stack-dumping now in
mce_panic(). It's because bust_spinlocks(1) is called prematurely so
oops_in_progress will be >= 2 when trying to call dump_stack() in
panic(). Thus dump_stack() won't be called as this is considered as
nested stack-dumping.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
---
 arch/x86/kernel/cpu/mce/core.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 2c8ec5c71712..c40dad1a6749 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -254,7 +254,6 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
 			wait_for_panic();
 		barrier();
 
-		bust_spinlocks(1);
 		console_verbose();
 	} else {
 		/* Don't log too much for fake panic */
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] mce: fix missing stack-dumping in mce_panic()
  2022-12-02 14:44 ` Luck, Tony
@ 2022-12-03  2:19   ` Miaohe Lin
  2022-12-03  4:22     ` Tony Luck
  2022-12-29 11:54     ` Borislav Petkov
  0 siblings, 2 replies; 19+ messages in thread
From: Miaohe Lin @ 2022-12-03  2:19 UTC (permalink / raw)
  To: Luck, Tony
  Cc: x86, hpa, linux-edac, linux-kernel, bp, tglx, mingo, dave.hansen

On 2022/12/2 22:44, Luck, Tony wrote:
>> When machine check exception occurs, there is no stack-dumping now in
>> mce_panic(). It's because bust_spinlocks(1) is called prematurely so
>> oops_in_progress will be >= 2 when trying to call dump_stack() in
>> panic(). Thus dump_stack() won't be called as this is considered as
>> nested stack-dumping.
> 

Many thanks for your quick reply. :)

> I had an earlier patch series to just dump from "interesting" machine checks
> (I think the interesting ones are when the kernel hit poison in code that hadn't
> been tagged in the extable as recoverable)
> 
> https://lore.kernel.org/all/20220922195136.54575-1-tony.luck@intel.com/

[1]

> 
> Discussion on that fizzled out.
> 
> Thanks for tracking down why panic() didn't provide a stack dump.
> 
> I'm still of the opinion that stack dumps from machine checks aren't
> generally useful. But I'd rather have extra stack dumps than no stack
> dumps at all.

As you mentioned in [1]:
"""
In order to ease the hunt for additional code flows where machine check
errors can be recovered it is useful to know, for example, why the
kernel was copying a page.
"""

So I think it's better to have at least one stack dumps. Also what the commit
6e6f0a1f0fa6 ("panic: don't print redundant backtraces on oops") and commit
026ee1f66aaa ("panic: fix stack dump print on direct call to panic()") want
to do is avoiding nested stack-dumping to have the original oops data being
scrolled away on a 80x50 screen but to have *at least one backtraces*. So
this patch acts more like a BUGFIX to ensure having at least one backtraces
in mce_panic(). What's your thought, Luck?

Thanks,
Miaohe Lin

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mce: fix missing stack-dumping in mce_panic()
  2022-12-03  2:19   ` Miaohe Lin
@ 2022-12-03  4:22     ` Tony Luck
  2022-12-03  6:48       ` Miaohe Lin
  2022-12-29 11:54     ` Borislav Petkov
  1 sibling, 1 reply; 19+ messages in thread
From: Tony Luck @ 2022-12-03  4:22 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: x86, hpa, linux-edac, linux-kernel, bp, tglx, mingo, dave.hansen

On Sat, Dec 03, 2022 at 10:19:32AM +0800, Miaohe Lin wrote:
> So I think it's better to have at least one stack dumps. Also what the commit
> 6e6f0a1f0fa6 ("panic: don't print redundant backtraces on oops") and commit
> 026ee1f66aaa ("panic: fix stack dump print on direct call to panic()") want
> to do is avoiding nested stack-dumping to have the original oops data being
> scrolled away on a 80x50 screen but to have *at least one backtraces*. So
> this patch acts more like a BUGFIX to ensure having at least one backtraces
> in mce_panic(). What's your thought, Luck?

I tried out your patch with the ras-tools test:

# ./einj_mem_uc -f copyout

which currently causes a panic from the "recoverable" machine check.

Your patch worked fine:

[  112.457735] stack backtrace:
[  112.457736] CPU: 124 PID: 3401 Comm: einj_mem_uc Not tainted 6.1.0-rc7+ #41
[  112.457738] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYDCRB1.86B.0154.R04.1804231104 04/23/2018
[  112.457739] Call Trace:
[  112.457740]  <#MC>
[  112.457742]  dump_stack_lvl+0x5a/0x78
[  112.457746]  dump_stack+0x10/0x16
[  112.457748]  print_usage_bug.part.0+0x1ad/0x1c4
[  112.457755]  lock_acquire.cold+0x16/0x47
[  112.457759]  ? down_trylock+0x14/0x40
[  112.457762]  ? panic+0x180/0x2b9
[  112.457766]  _raw_spin_lock_irqsave+0x4e/0x70
[  112.457768]  ? down_trylock+0x14/0x40
[  112.457771]  down_trylock+0x14/0x40
[  112.457772]  ? panic+0x180/0x2b9
[  112.457775]  __down_trylock_console_sem+0x34/0xc0
[  112.457778]  console_unblank+0x1d/0x90
[  112.457781]  panic+0x180/0x2b9
[  112.457788]  mce_panic+0x118/0x1e0
[  112.457794]  do_machine_check+0x79a/0x890
[  112.457800]  ? copy_user_enhanced_fast_string+0xa/0x50
[  112.457810]  exc_machine_check+0x76/0xb0
[  112.457813]  asm_exc_machine_check+0x1a/0x40
[  112.457816] RIP: 0010:copy_user_enhanced_fast_string+0xa/0x50
[  112.457819] Code: d1 f3 a4 31 c0 0f 01 ca c3 cc cc cc cc 8d 0c ca 89 ca eb 2c 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 01 cb 83 fa 40 72 48 89 d1 <f3> a4 31 c0 0f 01 ca c3 cc cc cc cc 89 ca eb 06 66 0f 1f 44 00 00
[  112.457820] RSP: 0018:ffffb140f789bbd8 EFLAGS: 00050206
[  112.457822] RAX: 0000000000001000 RBX: ffffb140f789be58 RCX: 0000000000000c00
[  112.457824] RDX: 0000000000001000 RSI: ffff9133eb3f0400 RDI: 000055c1a36986c0
[  112.457825] RBP: ffffb140f789bc68 R08: 00000000f789be01 R09: 0000000000001000
[  112.457827] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000001000
[  112.457828] R13: ffff9133eb3f0000 R14: 0000000000001000 R15: 0000000000000000
[  112.457837]  </#MC>
[  112.457838]  <TASK>
[  112.457838]  ? _copy_to_iter+0xc3/0x6f0
[  112.457843]  ? filemap_get_pages+0x9b/0x670
[  112.457851]  copy_page_to_iter+0x7c/0x1f0
[  112.457854]  ? find_held_lock+0x31/0x90
[  112.457858]  filemap_read+0x1ec/0x390
[  112.457865]  ? __fsnotify_parent+0x10f/0x310
[  112.457867]  ? aa_file_perm+0x1ab/0x610
[  112.457875]  generic_file_read_iter+0xf4/0x170
[  112.457879]  ext4_file_read_iter+0x5b/0x1e0
[  112.457881]  ? security_file_permission+0x4e/0x60
[  112.457886]  vfs_read+0x208/0x2e0
[  112.457895]  ksys_read+0x6d/0xf0
[  112.457900]  __x64_sys_read+0x19/0x20
[  112.457902]  do_syscall_64+0x38/0x90
[  112.457906]  entry_SYSCALL_64_after_hwframe+0x63/0xcd


Tested-by: Tony Luck <tony.luck@intel.com>

-Tony

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mce: fix missing stack-dumping in mce_panic()
  2022-12-03  4:22     ` Tony Luck
@ 2022-12-03  6:48       ` Miaohe Lin
  0 siblings, 0 replies; 19+ messages in thread
From: Miaohe Lin @ 2022-12-03  6:48 UTC (permalink / raw)
  To: Tony Luck
  Cc: x86, hpa, linux-edac, linux-kernel, bp, tglx, mingo, dave.hansen

On 2022/12/3 12:22, Tony Luck wrote:
> On Sat, Dec 03, 2022 at 10:19:32AM +0800, Miaohe Lin wrote:
>> So I think it's better to have at least one stack dumps. Also what the commit
>> 6e6f0a1f0fa6 ("panic: don't print redundant backtraces on oops") and commit
>> 026ee1f66aaa ("panic: fix stack dump print on direct call to panic()") want
>> to do is avoiding nested stack-dumping to have the original oops data being
>> scrolled away on a 80x50 screen but to have *at least one backtraces*. So
>> this patch acts more like a BUGFIX to ensure having at least one backtraces
>> in mce_panic(). What's your thought, Luck?
> 
> I tried out your patch with the ras-tools test:

ras-tool is really convenient. :)

> 
> # ./einj_mem_uc -f copyout
> 
> which currently causes a panic from the "recoverable" machine check.
> 
> Your patch worked fine:
> 
> [  112.457735] stack backtrace:
> [  112.457736] CPU: 124 PID: 3401 Comm: einj_mem_uc Not tainted 6.1.0-rc7+ #41
> [  112.457738] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS PLYDCRB1.86B.0154.R04.1804231104 04/23/2018
> [  112.457739] Call Trace:
> [  112.457740]  <#MC>
> [  112.457742]  dump_stack_lvl+0x5a/0x78
> [  112.457746]  dump_stack+0x10/0x16
> [  112.457748]  print_usage_bug.part.0+0x1ad/0x1c4
> [  112.457755]  lock_acquire.cold+0x16/0x47
> [  112.457759]  ? down_trylock+0x14/0x40
> [  112.457762]  ? panic+0x180/0x2b9
> [  112.457766]  _raw_spin_lock_irqsave+0x4e/0x70
> [  112.457768]  ? down_trylock+0x14/0x40
> [  112.457771]  down_trylock+0x14/0x40
> [  112.457772]  ? panic+0x180/0x2b9
> [  112.457775]  __down_trylock_console_sem+0x34/0xc0
> [  112.457778]  console_unblank+0x1d/0x90
> [  112.457781]  panic+0x180/0x2b9
> [  112.457788]  mce_panic+0x118/0x1e0
> [  112.457794]  do_machine_check+0x79a/0x890
> [  112.457800]  ? copy_user_enhanced_fast_string+0xa/0x50
> [  112.457810]  exc_machine_check+0x76/0xb0
> [  112.457813]  asm_exc_machine_check+0x1a/0x40
> [  112.457816] RIP: 0010:copy_user_enhanced_fast_string+0xa/0x50
> [  112.457819] Code: d1 f3 a4 31 c0 0f 01 ca c3 cc cc cc cc 8d 0c ca 89 ca eb 2c 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 01 cb 83 fa 40 72 48 89 d1 <f3> a4 31 c0 0f 01 ca c3 cc cc cc cc 89 ca eb 06 66 0f 1f 44 00 00
> [  112.457820] RSP: 0018:ffffb140f789bbd8 EFLAGS: 00050206
> [  112.457822] RAX: 0000000000001000 RBX: ffffb140f789be58 RCX: 0000000000000c00
> [  112.457824] RDX: 0000000000001000 RSI: ffff9133eb3f0400 RDI: 000055c1a36986c0
> [  112.457825] RBP: ffffb140f789bc68 R08: 00000000f789be01 R09: 0000000000001000
> [  112.457827] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000001000
> [  112.457828] R13: ffff9133eb3f0000 R14: 0000000000001000 R15: 0000000000000000
> [  112.457837]  </#MC>
> [  112.457838]  <TASK>
> [  112.457838]  ? _copy_to_iter+0xc3/0x6f0
> [  112.457843]  ? filemap_get_pages+0x9b/0x670
> [  112.457851]  copy_page_to_iter+0x7c/0x1f0
> [  112.457854]  ? find_held_lock+0x31/0x90
> [  112.457858]  filemap_read+0x1ec/0x390
> [  112.457865]  ? __fsnotify_parent+0x10f/0x310
> [  112.457867]  ? aa_file_perm+0x1ab/0x610
> [  112.457875]  generic_file_read_iter+0xf4/0x170
> [  112.457879]  ext4_file_read_iter+0x5b/0x1e0
> [  112.457881]  ? security_file_permission+0x4e/0x60
> [  112.457886]  vfs_read+0x208/0x2e0
> [  112.457895]  ksys_read+0x6d/0xf0
> [  112.457900]  __x64_sys_read+0x19/0x20
> [  112.457902]  do_syscall_64+0x38/0x90
> [  112.457906]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> 
> 
> Tested-by: Tony Luck <tony.luck@intel.com>
> 
> -Tony

Many thanks for your test, Tony!

Thanks,
Miaohe Lin


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mce: fix missing stack-dumping in mce_panic()
  2022-12-02 16:37 [PATCH] mce: fix missing stack-dumping in mce_panic() Miaohe Lin
  2022-12-02 14:44 ` Luck, Tony
@ 2022-12-10  2:28 ` Miaohe Lin
  2022-12-28 12:33   ` Miaohe Lin
  1 sibling, 1 reply; 19+ messages in thread
From: Miaohe Lin @ 2022-12-10  2:28 UTC (permalink / raw)
  To: bp, tglx, mingo, dave.hansen
  Cc: x86, hpa, linux-edac, linux-kernel, Tony Luck

On 2022/12/3 0:37, Miaohe Lin wrote:
> When machine check exception occurs, there is no stack-dumping now in
> mce_panic(). It's because bust_spinlocks(1) is called prematurely so
> oops_in_progress will be >= 2 when trying to call dump_stack() in
> panic(). Thus dump_stack() won't be called as this is considered as
> nested stack-dumping.
> 
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>

Friendly ping. ;)

> ---
>  arch/x86/kernel/cpu/mce/core.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 2c8ec5c71712..c40dad1a6749 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -254,7 +254,6 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
>  			wait_for_panic();
>  		barrier();
>  
> -		bust_spinlocks(1);
>  		console_verbose();
>  	} else {
>  		/* Don't log too much for fake panic */
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mce: fix missing stack-dumping in mce_panic()
  2022-12-10  2:28 ` Miaohe Lin
@ 2022-12-28 12:33   ` Miaohe Lin
  0 siblings, 0 replies; 19+ messages in thread
From: Miaohe Lin @ 2022-12-28 12:33 UTC (permalink / raw)
  To: bp, tglx, mingo, dave.hansen, hpa
  Cc: x86, linux-edac, linux-kernel, Tony Luck

On 2022/12/10 10:28, Miaohe Lin wrote:
> On 2022/12/3 0:37, Miaohe Lin wrote:
>> When machine check exception occurs, there is no stack-dumping now in
>> mce_panic(). It's because bust_spinlocks(1) is called prematurely so
>> oops_in_progress will be >= 2 when trying to call dump_stack() in
>> panic(). Thus dump_stack() won't be called as this is considered as
>> nested stack-dumping.
>>
>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> 
> Friendly ping. ;)

Friendly ping after busy merge window. :)

> 
>> ---
>>  arch/x86/kernel/cpu/mce/core.c | 1 -
>>  1 file changed, 1 deletion(-)
>>
>> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
>> index 2c8ec5c71712..c40dad1a6749 100644
>> --- a/arch/x86/kernel/cpu/mce/core.c
>> +++ b/arch/x86/kernel/cpu/mce/core.c
>> @@ -254,7 +254,6 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
>>  			wait_for_panic();
>>  		barrier();
>>  
>> -		bust_spinlocks(1);
>>  		console_verbose();
>>  	} else {
>>  		/* Don't log too much for fake panic */
>>
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mce: fix missing stack-dumping in mce_panic()
  2022-12-03  2:19   ` Miaohe Lin
  2022-12-03  4:22     ` Tony Luck
@ 2022-12-29 11:54     ` Borislav Petkov
  2022-12-29 12:33       ` Miaohe Lin
  1 sibling, 1 reply; 19+ messages in thread
From: Borislav Petkov @ 2022-12-29 11:54 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Luck, Tony, x86, hpa, linux-edac, linux-kernel, tglx, mingo, dave.hansen

On Sat, Dec 03, 2022 at 10:19:32AM +0800, Miaohe Lin wrote:
> So I think it's better to have at least one stack dumps. Also what the commit
> 6e6f0a1f0fa6 ("panic: don't print redundant backtraces on oops") and commit
> 026ee1f66aaa ("panic: fix stack dump print on direct call to panic()") want
> to do is avoiding nested stack-dumping to have the original oops data being
> scrolled away on a 80x50 screen but to have *at least one backtraces*. So
> this patch acts more like a BUGFIX to ensure having at least one backtraces
> in mce_panic().

So which commit broke this?

One of the two above or

004429956b48 ("handle recursive calls to bust_spinlocks()")

or

d896a940ef4f ("x86, mce: remove oops_begin() use in 64bit machine check")

or...?

By looking at their dates, they're pretty much too old so that this can
go to *all* stable kernels.

Hmmm.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mce: fix missing stack-dumping in mce_panic()
  2022-12-29 11:54     ` Borislav Petkov
@ 2022-12-29 12:33       ` Miaohe Lin
  2022-12-29 12:51         ` Borislav Petkov
  0 siblings, 1 reply; 19+ messages in thread
From: Miaohe Lin @ 2022-12-29 12:33 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Luck, Tony, x86, hpa, linux-edac, linux-kernel, tglx, mingo, dave.hansen

On 2022/12/29 19:54, Borislav Petkov wrote:
> On Sat, Dec 03, 2022 at 10:19:32AM +0800, Miaohe Lin wrote:
>> So I think it's better to have at least one stack dumps. Also what the commit
>> 6e6f0a1f0fa6 ("panic: don't print redundant backtraces on oops") and commit
>> 026ee1f66aaa ("panic: fix stack dump print on direct call to panic()") want
>> to do is avoiding nested stack-dumping to have the original oops data being
>> scrolled away on a 80x50 screen but to have *at least one backtraces*. So
>> this patch acts more like a BUGFIX to ensure having at least one backtraces
>> in mce_panic().
> 

Many thanks for your reply. :)

> So which commit broke this?
> 

I think it should be Fixes: 6e6f0a1f0fa6 ("panic: don't print redundant backtraces on oops")
as there's missing stack-dumping in mce_panic() since then. Should I resend the patch with
above Fixes tag attached?

> One of the two above or
> 
> 004429956b48 ("handle recursive calls to bust_spinlocks()")
> 
> or
> 
> d896a940ef4f ("x86, mce: remove oops_begin() use in 64bit machine check")
> 
> or...?
> 
> By looking at their dates, they're pretty much too old so that this can
> go to *all* stable kernels.

Yes, it should be a really old problem. And it seems it's unnoticed yet because mce should be a
really rare event and corresponding dump-stacking might not be that attractive at that time.

Thanks!
Miaohe Lin


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mce: fix missing stack-dumping in mce_panic()
  2022-12-29 12:33       ` Miaohe Lin
@ 2022-12-29 12:51         ` Borislav Petkov
  2022-12-30  1:56           ` Miaohe Lin
  0 siblings, 1 reply; 19+ messages in thread
From: Borislav Petkov @ 2022-12-29 12:51 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Luck, Tony, x86, hpa, linux-edac, linux-kernel, tglx, mingo, dave.hansen

On Thu, Dec 29, 2022 at 08:33:55PM +0800, Miaohe Lin wrote:
> I think

You think or you know?

I'm assuming reverting 6e6f0a1f0fa6 shows the MCE panic?

I guess the original issue that commit was fixing is to save that
redundant oops message but Tony seems to want to see it now and I'm not
sure how much we care about 80x50 screens nowadays... :-)

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mce: fix missing stack-dumping in mce_panic()
  2022-12-29 12:51         ` Borislav Petkov
@ 2022-12-30  1:56           ` Miaohe Lin
  2023-01-03 21:12             ` Luck, Tony
  0 siblings, 1 reply; 19+ messages in thread
From: Miaohe Lin @ 2022-12-30  1:56 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Luck, Tony, x86, hpa, linux-edac, linux-kernel, tglx, mingo, dave.hansen

On 2022/12/29 20:51, Borislav Petkov wrote:
> On Thu, Dec 29, 2022 at 08:33:55PM +0800, Miaohe Lin wrote:
>> I think
> 
> You think or you know?

The problem code logic exists since that time.

> 
> I'm assuming reverting 6e6f0a1f0fa6 shows the MCE panic?

Yes, that's right.

> 
> I guess the original issue that commit was fixing is to save that
> redundant oops message but Tony seems to want to see it now and I'm not
> sure how much we care about 80x50 screens nowadays... :-)

80x50 screens should be really uncommon now. But avoiding redundant oops message
should always save our time. ;)

Thanks,
Miaohe Lin


^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH] mce: fix missing stack-dumping in mce_panic()
  2022-12-30  1:56           ` Miaohe Lin
@ 2023-01-03 21:12             ` Luck, Tony
  2023-01-06  1:57               ` Miaohe Lin
  0 siblings, 1 reply; 19+ messages in thread
From: Luck, Tony @ 2023-01-03 21:12 UTC (permalink / raw)
  To: Miaohe Lin, Borislav Petkov
  Cc: x86, hpa, linux-edac, linux-kernel, tglx, mingo, dave.hansen

>> I guess the original issue that commit was fixing is to save that
>> redundant oops message but Tony seems to want to see it now and I'm not
>> sure how much we care about 80x50 screens nowadays... :-)

I want a stack dump for the specific case of a recoverable machine check caused by
poison consumption in kernel code that doesn't have an extable[] entry for a recovery
path. That's a potential candidate for future kernel change to make that recoverable
(if the code path seems common enough to warrant the churn), and there is some
plausible way for s/w to "recover").

For most other machine checks the dump is very likely useless. E.g. some CPU core stalled
so that the system generates a broadcast machine check because instructions are not
being retired on that CPU core. In this case the machine check "monarch" is almost certainly
some innocent bystander that was executing normally. Stack dump from that CPU is going
to tell you nothing about the machine check.

-Tony



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mce: fix missing stack-dumping in mce_panic()
  2023-01-03 21:12             ` Luck, Tony
@ 2023-01-06  1:57               ` Miaohe Lin
  2023-01-06 17:42                 ` Luck, Tony
  0 siblings, 1 reply; 19+ messages in thread
From: Miaohe Lin @ 2023-01-06  1:57 UTC (permalink / raw)
  To: Luck, Tony
  Cc: x86, hpa, linux-edac, linux-kernel, tglx, mingo, dave.hansen,
	Borislav Petkov

On 2023/1/4 5:12, Luck, Tony wrote:
>>> I guess the original issue that commit was fixing is to save that
>>> redundant oops message but Tony seems to want to see it now and I'm not
>>> sure how much we care about 80x50 screens nowadays... :-)
> 

Many thanks for your thought. :)

> I want a stack dump for the specific case of a recoverable machine check caused by
> poison consumption in kernel code that doesn't have an extable[] entry for a recovery
> path. That's a potential candidate for future kernel change to make that recoverable

Sure, in this case a stack dump will be really helpful. We can gather these stack dumps
and try to make the most frequent machine check scene recoverable.

> (if the code path seems common enough to warrant the churn), and there is some
> plausible way for s/w to "recover").
> 
> For most other machine checks the dump is very likely useless. E.g. some CPU core stalled
> so that the system generates a broadcast machine check because instructions are not
> being retired on that CPU core. In this case the machine check "monarch" is almost certainly
> some innocent bystander that was executing normally. Stack dump from that CPU is going
> to tell you nothing about the machine check.

Agree. A stack dump won't be helpful in this case. But I tend to keep the stack dump in case
it would be helpful and also make mce_panic() dumps the stack as expected. What do you think?

Thanks,
Miaohe Lin

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH] mce: fix missing stack-dumping in mce_panic()
  2023-01-06  1:57               ` Miaohe Lin
@ 2023-01-06 17:42                 ` Luck, Tony
  2023-01-07  2:27                   ` Miaohe Lin
  0 siblings, 1 reply; 19+ messages in thread
From: Luck, Tony @ 2023-01-06 17:42 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: x86, hpa, linux-edac, linux-kernel, tglx, mingo, dave.hansen,
	Borislav Petkov

> Agree. A stack dump won't be helpful in this case. But I tend to keep the stack dump in case
> it would be helpful and also make mce_panic() dumps the stack as expected. What do you think?

My #1 preference is to just get stack dumps from machine checks where the cause is known
to be poison consumption by the kernel.

Patches to do this were posted in September 2022

https://lore.kernel.org/all/20220922195136.54575-1-tony.luck@intel.com/

Boris has applied part 1 (which was a cleanup). But discussion on part 2
centered around why mce_panic() wasn't giving a full OOPs dump. That's
now understood.


If I can't get my first preference ... then a stack dump for every machine check
is better than no stack dumps at all. So a patch that enables the OOPs dump
would be a good second choice.

I don't want to stick with the current behavior of no stack dumps.

-Tony

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mce: fix missing stack-dumping in mce_panic()
  2023-01-06 17:42                 ` Luck, Tony
@ 2023-01-07  2:27                   ` Miaohe Lin
  2023-01-08 17:54                     ` Luck, Tony
  0 siblings, 1 reply; 19+ messages in thread
From: Miaohe Lin @ 2023-01-07  2:27 UTC (permalink / raw)
  To: Luck, Tony
  Cc: x86, hpa, linux-edac, linux-kernel, tglx, mingo, dave.hansen,
	Borislav Petkov

On 2023/1/7 1:42, Luck, Tony wrote:
>> Agree. A stack dump won't be helpful in this case. But I tend to keep the stack dump in case
>> it would be helpful and also make mce_panic() dumps the stack as expected. What do you think?
> 
> My #1 preference is to just get stack dumps from machine checks where the cause is known
> to be poison consumption by the kernel.
> 
> Patches to do this were posted in September 2022
> 
> https://lore.kernel.org/all/20220922195136.54575-1-tony.luck@intel.com/

[1]

> 
> Boris has applied part 1 (which was a cleanup). But discussion on part 2
> centered around why mce_panic() wasn't giving a full OOPs dump. That's
> now understood.
> 

Many thanks for your valuable information. [1] uses MCE_PANIC_STACKDUMP_SEVERITY to
pick the preferred ones and do a stack dump manually. But now stack dumps is done
via panic(), it seems there's no easy way to achieve #1 preference.

> 
> If I can't get my first preference ... then a stack dump for every machine check
> is better than no stack dumps at all. So a patch that enables the OOPs dump
> would be a good second choice.
> 
> I don't want to stick with the current behavior of no stack dumps.

As a first step, let's change the current behavior of no stack dumps first? And then,
we could try to achieve #1 preference if possible.

Thanks,
Miaohe Lin

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH] mce: fix missing stack-dumping in mce_panic()
  2023-01-07  2:27                   ` Miaohe Lin
@ 2023-01-08 17:54                     ` Luck, Tony
  2023-01-09  2:16                       ` Miaohe Lin
  0 siblings, 1 reply; 19+ messages in thread
From: Luck, Tony @ 2023-01-08 17:54 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: x86, hpa, linux-edac, linux-kernel, tglx, mingo, dave.hansen,
	Borislav Petkov

> As a first step, let's change the current behavior of no stack dumps first? And then,
> we could try to achieve #1 preference if possible.

Ok. Repost the patch that fixes things for an oops dump for every mce_panic().

Maybe then I can look at some way to suppress the oops dump for the "useless"
cases? Or perhaps at least add an extra message that stack dumps from arbitrary
machine checks are rarely useful.

-Tony

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mce: fix missing stack-dumping in mce_panic()
  2023-01-08 17:54                     ` Luck, Tony
@ 2023-01-09  2:16                       ` Miaohe Lin
  2023-01-09  4:45                         ` Luck, Tony
  0 siblings, 1 reply; 19+ messages in thread
From: Miaohe Lin @ 2023-01-09  2:16 UTC (permalink / raw)
  To: Luck, Tony
  Cc: x86, hpa, linux-edac, linux-kernel, tglx, mingo, dave.hansen,
	Borislav Petkov

On 2023/1/9 1:54, Luck, Tony wrote:
>> As a first step, let's change the current behavior of no stack dumps first? And then,
>> we could try to achieve #1 preference if possible.
> 
> Ok. Repost the patch that fixes things for an oops dump for every mce_panic().

Do you mean make every call site of mce_panic() does a successful oops dump? I think
this patch will have achieved that. Or do I miss-read your comment?

> 
> Maybe then I can look at some way to suppress the oops dump for the "useless"
> cases? Or perhaps at least add an extra message that stack dumps from arbitrary
> machine checks are rarely useful.

IMHO the user will always expect a stack dump when system panic. I'm afraid that if these
"useless" cases are eliminated, it will cause user confused and trying to find what eats
the stack dump?

Thanks,
Miaohe Lin

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [PATCH] mce: fix missing stack-dumping in mce_panic()
  2023-01-09  2:16                       ` Miaohe Lin
@ 2023-01-09  4:45                         ` Luck, Tony
  2023-01-09  6:54                           ` Miaohe Lin
  0 siblings, 1 reply; 19+ messages in thread
From: Luck, Tony @ 2023-01-09  4:45 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: x86, hpa, linux-edac, linux-kernel, tglx, mingo, dave.hansen,
	Borislav Petkov

>> Maybe then I can look at some way to suppress the oops dump for the "useless"
>> cases? Or perhaps at least add an extra message that stack dumps from arbitrary
>> machine checks are rarely useful.
>
> IMHO the user will always expect a stack dump when system panic. I'm afraid that if these
> "useless" cases are eliminated, it will cause user confused and trying to find what eats
> the stack dump?

What about the "extra message" option?  Always print the full oops. but for ones from
mce_panic() arrange somehow to add an extra bit of helpful text. Something like this:

1) For poison consumption in kernel code:

"The reason for this panic is consumption of poisoned data by the kernel"

2) For the general machine check case:

"The reason for this panic is a hardware detected error reported as a machine check.
 Note that the stack and register dumps are likely useless to diagnose the root cause
 of this panic."

-Tony

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mce: fix missing stack-dumping in mce_panic()
  2023-01-09  4:45                         ` Luck, Tony
@ 2023-01-09  6:54                           ` Miaohe Lin
  0 siblings, 0 replies; 19+ messages in thread
From: Miaohe Lin @ 2023-01-09  6:54 UTC (permalink / raw)
  To: Luck, Tony
  Cc: x86, hpa, linux-edac, linux-kernel, tglx, mingo, dave.hansen,
	Borislav Petkov

On 2023/1/9 12:45, Luck, Tony wrote:
>>> Maybe then I can look at some way to suppress the oops dump for the "useless"
>>> cases? Or perhaps at least add an extra message that stack dumps from arbitrary
>>> machine checks are rarely useful.
>>
>> IMHO the user will always expect a stack dump when system panic. I'm afraid that if these
>> "useless" cases are eliminated, it will cause user confused and trying to find what eats
>> the stack dump?
> 
> What about the "extra message" option?  Always print the full oops. but for ones from
> mce_panic() arrange somehow to add an extra bit of helpful text. Something like this:
> 
> 1) For poison consumption in kernel code:
> 
> "The reason for this panic is consumption of poisoned data by the kernel"
> 
> 2) For the general machine check case:
> 
> "The reason for this panic is a hardware detected error reported as a machine check.
>  Note that the stack and register dumps are likely useless to diagnose the root cause
>  of this panic."

It sounds like a good idea to me. Many thanks for your thought. And it's fine to squash my patch
to your provided thought. :)

Thanks,
Miaohe Lin

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2023-01-09  6:55 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-02 16:37 [PATCH] mce: fix missing stack-dumping in mce_panic() Miaohe Lin
2022-12-02 14:44 ` Luck, Tony
2022-12-03  2:19   ` Miaohe Lin
2022-12-03  4:22     ` Tony Luck
2022-12-03  6:48       ` Miaohe Lin
2022-12-29 11:54     ` Borislav Petkov
2022-12-29 12:33       ` Miaohe Lin
2022-12-29 12:51         ` Borislav Petkov
2022-12-30  1:56           ` Miaohe Lin
2023-01-03 21:12             ` Luck, Tony
2023-01-06  1:57               ` Miaohe Lin
2023-01-06 17:42                 ` Luck, Tony
2023-01-07  2:27                   ` Miaohe Lin
2023-01-08 17:54                     ` Luck, Tony
2023-01-09  2:16                       ` Miaohe Lin
2023-01-09  4:45                         ` Luck, Tony
2023-01-09  6:54                           ` Miaohe Lin
2022-12-10  2:28 ` Miaohe Lin
2022-12-28 12:33   ` Miaohe Lin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).