linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] ppc64/fadump: fix inaccurate CPU state info in vmcore generated with panic
@ 2021-11-10 19:01 Hari Bathini
  2021-11-11  6:14 ` Michael Ellerman
  0 siblings, 1 reply; 5+ messages in thread
From: Hari Bathini @ 2021-11-10 19:01 UTC (permalink / raw)
  To: mpe, linuxppc-dev; +Cc: Hari Bathini, sourabhjain, mahesh, npiggin

In panic path, fadump is triggered via a panic notifier function.
Before calling panic notifier functions, smp_send_stop() gets called,
which stops all CPUs except the panic'ing CPU. Commit 8389b37dffdc
("powerpc: stop_this_cpu: remove the cpu from the online map.") and
again commit bab26238bbd4 ("powerpc: Offline CPU in stop_this_cpu()")
started marking CPUs as offline while stopping them. So, if a kernel
has either of the above commits, vmcore captured with fadump via panic
path would show only the panic'ing CPU as online. Sample output of
crash-utility with such vmcore:

  # crash vmlinux vmcore
  ...
        KERNEL: vmlinux
      DUMPFILE: vmcore  [PARTIAL DUMP]
          CPUS: 1
          DATE: Wed Nov 10 09:56:34 EST 2021
        UPTIME: 00:00:42
  LOAD AVERAGE: 2.27, 0.69, 0.24
         TASKS: 183
      NODENAME: XXXXXXXXX
       RELEASE: 5.15.0+
       VERSION: #974 SMP Wed Nov 10 04:18:19 CST 2021
       MACHINE: ppc64le  (2500 Mhz)
        MEMORY: 8 GB
         PANIC: "Kernel panic - not syncing: sysrq triggered crash"
           PID: 3394
       COMMAND: "bash"
          TASK: c0000000150a5f80  [THREAD_INFO: c0000000150a5f80]
           CPU: 1
         STATE: TASK_RUNNING (PANIC)

  crash> p -x __cpu_online_mask
  __cpu_online_mask = $1 = {
    bits = {0x2, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
  }
  crash>
  crash>
  crash> p -x __cpu_active_mask
  __cpu_active_mask = $2 = {
    bits = {0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
  }
  crash>

While this has been the case since fadump was introduced, the issue
was not identified for two probable reasons:

  - In general, the bulk of the vmcores analyzed were from crash
    due to exception.

  - The above did change since commit 8341f2f222d7 ("sysrq: Use
    panic() to force a crash") started using panic() instead of
    deferencing NULL pointer to force a kernel crash. But then
    commit de6e5d38417e ("powerpc: smp_send_stop do not offline
    stopped CPUs") stopped marking CPUs as offline till kernel
    commit bab26238bbd4 ("powerpc: Offline CPU in stop_this_cpu()")
    reverted that change.

To avoid vmcore from showing only one CPU as online in panic path,
skip marking non panic'ing CPUs as offline while stopping them
during fadump crash.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
---
 arch/powerpc/kernel/smp.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index c23ee842c4c3..20555d5d5966 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -61,6 +61,7 @@
 #include <asm/cpu_has_feature.h>
 #include <asm/ftrace.h>
 #include <asm/kup.h>
+#include <asm/fadump.h>
 
 #ifdef DEBUG
 #include <asm/udbg.h>
@@ -626,7 +627,8 @@ static void nmi_stop_this_cpu(struct pt_regs *regs)
 	/*
 	 * IRQs are already hard disabled by the smp_handle_nmi_ipi.
 	 */
-	set_cpu_online(smp_processor_id(), false);
+	if (!(oops_in_progress && should_fadump_crash()))
+		set_cpu_online(smp_processor_id(), false);
 
 	spin_begin();
 	while (1)
@@ -650,7 +652,8 @@ static void stop_this_cpu(void *dummy)
 	 * to know other CPUs are offline before it breaks locks to flush
 	 * printk buffers, in case we panic()ed while holding the lock.
 	 */
-	set_cpu_online(smp_processor_id(), false);
+	if (!(oops_in_progress && should_fadump_crash()))
+		set_cpu_online(smp_processor_id(), false);
 
 	spin_begin();
 	while (1)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] ppc64/fadump: fix inaccurate CPU state info in vmcore generated with panic
  2021-11-10 19:01 [PATCH] ppc64/fadump: fix inaccurate CPU state info in vmcore generated with panic Hari Bathini
@ 2021-11-11  6:14 ` Michael Ellerman
  2021-11-11 12:06   ` Hari Bathini
  0 siblings, 1 reply; 5+ messages in thread
From: Michael Ellerman @ 2021-11-11  6:14 UTC (permalink / raw)
  To: Hari Bathini, linuxppc-dev; +Cc: Hari Bathini, sourabhjain, mahesh, npiggin

Hari Bathini <hbathini@linux.ibm.com> writes:
> In panic path, fadump is triggered via a panic notifier function.
> Before calling panic notifier functions, smp_send_stop() gets called,
> which stops all CPUs except the panic'ing CPU. Commit 8389b37dffdc
> ("powerpc: stop_this_cpu: remove the cpu from the online map.") and
> again commit bab26238bbd4 ("powerpc: Offline CPU in stop_this_cpu()")
> started marking CPUs as offline while stopping them. So, if a kernel
> has either of the above commits, vmcore captured with fadump via panic
> path would show only the panic'ing CPU as online. Sample output of
> crash-utility with such vmcore:
>
>   # crash vmlinux vmcore
>   ...
>         KERNEL: vmlinux
>       DUMPFILE: vmcore  [PARTIAL DUMP]
>           CPUS: 1
>           DATE: Wed Nov 10 09:56:34 EST 2021
>         UPTIME: 00:00:42
>   LOAD AVERAGE: 2.27, 0.69, 0.24
>          TASKS: 183
>       NODENAME: XXXXXXXXX
>        RELEASE: 5.15.0+
>        VERSION: #974 SMP Wed Nov 10 04:18:19 CST 2021
>        MACHINE: ppc64le  (2500 Mhz)
>         MEMORY: 8 GB
>          PANIC: "Kernel panic - not syncing: sysrq triggered crash"
>            PID: 3394
>        COMMAND: "bash"
>           TASK: c0000000150a5f80  [THREAD_INFO: c0000000150a5f80]
>            CPU: 1
>          STATE: TASK_RUNNING (PANIC)
>
>   crash> p -x __cpu_online_mask
>   __cpu_online_mask = $1 = {
>     bits = {0x2, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
>   }
>   crash>
>   crash>
>   crash> p -x __cpu_active_mask
>   __cpu_active_mask = $2 = {
>     bits = {0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
>   }
>   crash>
>
> While this has been the case since fadump was introduced, the issue
> was not identified for two probable reasons:
>
>   - In general, the bulk of the vmcores analyzed were from crash
>     due to exception.
>
>   - The above did change since commit 8341f2f222d7 ("sysrq: Use
>     panic() to force a crash") started using panic() instead of
>     deferencing NULL pointer to force a kernel crash. But then
>     commit de6e5d38417e ("powerpc: smp_send_stop do not offline
>     stopped CPUs") stopped marking CPUs as offline till kernel
>     commit bab26238bbd4 ("powerpc: Offline CPU in stop_this_cpu()")
>     reverted that change.
>
> To avoid vmcore from showing only one CPU as online in panic path,
> skip marking non panic'ing CPUs as offline while stopping them
> during fadump crash.

Is this really worth the added complexity/bug surface?

Why does it matter if the vmcore shows only one CPU online?


> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index c23ee842c4c3..20555d5d5966 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -61,6 +61,7 @@
>  #include <asm/cpu_has_feature.h>
>  #include <asm/ftrace.h>
>  #include <asm/kup.h>
> +#include <asm/fadump.h>
>  
>  #ifdef DEBUG
>  #include <asm/udbg.h>
> @@ -626,7 +627,8 @@ static void nmi_stop_this_cpu(struct pt_regs *regs)
>  	/*
>  	 * IRQs are already hard disabled by the smp_handle_nmi_ipi.
>  	 */
> -	set_cpu_online(smp_processor_id(), false);
> +	if (!(oops_in_progress && should_fadump_crash()))
> +		set_cpu_online(smp_processor_id(), false);
>  
>  	spin_begin();
>  	while (1)
> @@ -650,7 +652,8 @@ static void stop_this_cpu(void *dummy)
>  	 * to know other CPUs are offline before it breaks locks to flush
>  	 * printk buffers, in case we panic()ed while holding the lock.
>  	 */
> -	set_cpu_online(smp_processor_id(), false);
> +	if (!(oops_in_progress && should_fadump_crash()))
> +		set_cpu_online(smp_processor_id(), false);

The comment talks about printk_safe_flush_on_panic(), and this change
would presumably break that. Except that printk_safe_flush_on_panic() no
longer exists.

So do we need to set the CPU online here at all?

ie. could we revert bab26238bbd4 ("powerpc: Offline CPU in stop_this_cpu()")
now that printk_safe_flush_on_panic() no longer exists?

cheers

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] ppc64/fadump: fix inaccurate CPU state info in vmcore generated with panic
  2021-11-11  6:14 ` Michael Ellerman
@ 2021-11-11 12:06   ` Hari Bathini
  2021-11-11 13:20     ` Nicholas Piggin
  0 siblings, 1 reply; 5+ messages in thread
From: Hari Bathini @ 2021-11-11 12:06 UTC (permalink / raw)
  To: Michael Ellerman, linuxppc-dev; +Cc: npiggin, mahesh, sourabhjain



On 11/11/21 11:44 am, Michael Ellerman wrote:
> Hari Bathini <hbathini@linux.ibm.com> writes:
>> In panic path, fadump is triggered via a panic notifier function.
>> Before calling panic notifier functions, smp_send_stop() gets called,
>> which stops all CPUs except the panic'ing CPU. Commit 8389b37dffdc
>> ("powerpc: stop_this_cpu: remove the cpu from the online map.") and
>> again commit bab26238bbd4 ("powerpc: Offline CPU in stop_this_cpu()")
>> started marking CPUs as offline while stopping them. So, if a kernel
>> has either of the above commits, vmcore captured with fadump via panic
>> path would show only the panic'ing CPU as online. Sample output of
>> crash-utility with such vmcore:
>>
>>    # crash vmlinux vmcore
>>    ...
>>          KERNEL: vmlinux
>>        DUMPFILE: vmcore  [PARTIAL DUMP]
>>            CPUS: 1
>>            DATE: Wed Nov 10 09:56:34 EST 2021
>>          UPTIME: 00:00:42
>>    LOAD AVERAGE: 2.27, 0.69, 0.24
>>           TASKS: 183
>>        NODENAME: XXXXXXXXX
>>         RELEASE: 5.15.0+
>>         VERSION: #974 SMP Wed Nov 10 04:18:19 CST 2021
>>         MACHINE: ppc64le  (2500 Mhz)
>>          MEMORY: 8 GB
>>           PANIC: "Kernel panic - not syncing: sysrq triggered crash"
>>             PID: 3394
>>         COMMAND: "bash"
>>            TASK: c0000000150a5f80  [THREAD_INFO: c0000000150a5f80]
>>             CPU: 1
>>           STATE: TASK_RUNNING (PANIC)
>>
>>    crash> p -x __cpu_online_mask
>>    __cpu_online_mask = $1 = {
>>      bits = {0x2, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
>>    }
>>    crash>
>>    crash>
>>    crash> p -x __cpu_active_mask
>>    __cpu_active_mask = $2 = {
>>      bits = {0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
>>    }
>>    crash>
>>
>> While this has been the case since fadump was introduced, the issue
>> was not identified for two probable reasons:
>>
>>    - In general, the bulk of the vmcores analyzed were from crash
>>      due to exception.
>>
>>    - The above did change since commit 8341f2f222d7 ("sysrq: Use
>>      panic() to force a crash") started using panic() instead of
>>      deferencing NULL pointer to force a kernel crash. But then
>>      commit de6e5d38417e ("powerpc: smp_send_stop do not offline
>>      stopped CPUs") stopped marking CPUs as offline till kernel
>>      commit bab26238bbd4 ("powerpc: Offline CPU in stop_this_cpu()")
>>      reverted that change.
>>
>> To avoid vmcore from showing only one CPU as online in panic path,
>> skip marking non panic'ing CPUs as offline while stopping them
>> during fadump crash.
> 
> Is this really worth the added complexity/bug surface?
> 
> Why does it matter if the vmcore shows only one CPU online?

We lose out on backtrace/register data of non-crashing CPUs as they
are explicitly marked offline.

Actually, the state of CPU resources is explicitly changed after the
panic though the aim of vmcore is to capture the system state at the
time of panic...

Alternatively, how about moving crash_fadump() call from panic notifier
into panic() function explicitly, just like __crash_kexec() - before the
smp_send_stop() call, so as to remove dependency with smp_send_stop()
implementation altogether...

> 
>> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
>> index c23ee842c4c3..20555d5d5966 100644
>> --- a/arch/powerpc/kernel/smp.c
>> +++ b/arch/powerpc/kernel/smp.c
>> @@ -61,6 +61,7 @@
>>   #include <asm/cpu_has_feature.h>
>>   #include <asm/ftrace.h>
>>   #include <asm/kup.h>
>> +#include <asm/fadump.h>
>>   
>>   #ifdef DEBUG
>>   #include <asm/udbg.h>
>> @@ -626,7 +627,8 @@ static void nmi_stop_this_cpu(struct pt_regs *regs)
>>   	/*
>>   	 * IRQs are already hard disabled by the smp_handle_nmi_ipi.
>>   	 */
>> -	set_cpu_online(smp_processor_id(), false);
>> +	if (!(oops_in_progress && should_fadump_crash()))
>> +		set_cpu_online(smp_processor_id(), false);
>>   
>>   	spin_begin();
>>   	while (1)
>> @@ -650,7 +652,8 @@ static void stop_this_cpu(void *dummy)
>>   	 * to know other CPUs are offline before it breaks locks to flush
>>   	 * printk buffers, in case we panic()ed while holding the lock.
>>   	 */
>> -	set_cpu_online(smp_processor_id(), false);
>> +	if (!(oops_in_progress && should_fadump_crash()))
>> +		set_cpu_online(smp_processor_id(), false);
> 
> The comment talks about printk_safe_flush_on_panic(), and this change
> would presumably break that. Except that printk_safe_flush_on_panic() no
> longer exists.
> 
> So do we need to set the CPU online here at all?
> 
> ie. could we revert bab26238bbd4 ("powerpc: Offline CPU in stop_this_cpu()")
> now that printk_safe_flush_on_panic() no longer exists?

Yeah, sounds like the logical thing to do but I guess, Nick would be in
a better position to answer this..

Thanks,
Hari

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] ppc64/fadump: fix inaccurate CPU state info in vmcore generated with panic
  2021-11-11 12:06   ` Hari Bathini
@ 2021-11-11 13:20     ` Nicholas Piggin
  2021-11-17  6:09       ` Hari Bathini
  0 siblings, 1 reply; 5+ messages in thread
From: Nicholas Piggin @ 2021-11-11 13:20 UTC (permalink / raw)
  To: Hari Bathini, linuxppc-dev, Michael Ellerman; +Cc: mahesh, sourabhjain

Excerpts from Hari Bathini's message of November 11, 2021 10:06 pm:
> 
> 
> On 11/11/21 11:44 am, Michael Ellerman wrote:
>> Hari Bathini <hbathini@linux.ibm.com> writes:
>>> In panic path, fadump is triggered via a panic notifier function.
>>> Before calling panic notifier functions, smp_send_stop() gets called,
>>> which stops all CPUs except the panic'ing CPU. Commit 8389b37dffdc
>>> ("powerpc: stop_this_cpu: remove the cpu from the online map.") and
>>> again commit bab26238bbd4 ("powerpc: Offline CPU in stop_this_cpu()")
>>> started marking CPUs as offline while stopping them. So, if a kernel
>>> has either of the above commits, vmcore captured with fadump via panic
>>> path would show only the panic'ing CPU as online. Sample output of
>>> crash-utility with such vmcore:
>>>
>>>    # crash vmlinux vmcore
>>>    ...
>>>          KERNEL: vmlinux
>>>        DUMPFILE: vmcore  [PARTIAL DUMP]
>>>            CPUS: 1
>>>            DATE: Wed Nov 10 09:56:34 EST 2021
>>>          UPTIME: 00:00:42
>>>    LOAD AVERAGE: 2.27, 0.69, 0.24
>>>           TASKS: 183
>>>        NODENAME: XXXXXXXXX
>>>         RELEASE: 5.15.0+
>>>         VERSION: #974 SMP Wed Nov 10 04:18:19 CST 2021
>>>         MACHINE: ppc64le  (2500 Mhz)
>>>          MEMORY: 8 GB
>>>           PANIC: "Kernel panic - not syncing: sysrq triggered crash"
>>>             PID: 3394
>>>         COMMAND: "bash"
>>>            TASK: c0000000150a5f80  [THREAD_INFO: c0000000150a5f80]
>>>             CPU: 1
>>>           STATE: TASK_RUNNING (PANIC)
>>>
>>>    crash> p -x __cpu_online_mask
>>>    __cpu_online_mask = $1 = {
>>>      bits = {0x2, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
>>>    }
>>>    crash>
>>>    crash>
>>>    crash> p -x __cpu_active_mask
>>>    __cpu_active_mask = $2 = {
>>>      bits = {0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
>>>    }
>>>    crash>
>>>
>>> While this has been the case since fadump was introduced, the issue
>>> was not identified for two probable reasons:
>>>
>>>    - In general, the bulk of the vmcores analyzed were from crash
>>>      due to exception.
>>>
>>>    - The above did change since commit 8341f2f222d7 ("sysrq: Use
>>>      panic() to force a crash") started using panic() instead of
>>>      deferencing NULL pointer to force a kernel crash. But then
>>>      commit de6e5d38417e ("powerpc: smp_send_stop do not offline
>>>      stopped CPUs") stopped marking CPUs as offline till kernel
>>>      commit bab26238bbd4 ("powerpc: Offline CPU in stop_this_cpu()")
>>>      reverted that change.
>>>
>>> To avoid vmcore from showing only one CPU as online in panic path,
>>> skip marking non panic'ing CPUs as offline while stopping them
>>> during fadump crash.
>> 
>> Is this really worth the added complexity/bug surface?
>> 
>> Why does it matter if the vmcore shows only one CPU online?
> 
> We lose out on backtrace/register data of non-crashing CPUs as they
> are explicitly marked offline.
> 
> Actually, the state of CPU resources is explicitly changed after the
> panic though the aim of vmcore is to capture the system state at the
> time of panic...
> 
> Alternatively, how about moving crash_fadump() call from panic notifier
> into panic() function explicitly, just like __crash_kexec() - before the
> smp_send_stop() call, so as to remove dependency with smp_send_stop()
> implementation altogether...

Does the crash dump code snapshot the CPUs with send_debuggr_break? I
can't remember the exact flow. But if it sends NMI IPIs to all other
CPUs then maybe it could be moved before smp_send_stop, good idea.

> 
>> 
>>> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
>>> index c23ee842c4c3..20555d5d5966 100644
>>> --- a/arch/powerpc/kernel/smp.c
>>> +++ b/arch/powerpc/kernel/smp.c
>>> @@ -61,6 +61,7 @@
>>>   #include <asm/cpu_has_feature.h>
>>>   #include <asm/ftrace.h>
>>>   #include <asm/kup.h>
>>> +#include <asm/fadump.h>
>>>   
>>>   #ifdef DEBUG
>>>   #include <asm/udbg.h>
>>> @@ -626,7 +627,8 @@ static void nmi_stop_this_cpu(struct pt_regs *regs)
>>>   	/*
>>>   	 * IRQs are already hard disabled by the smp_handle_nmi_ipi.
>>>   	 */
>>> -	set_cpu_online(smp_processor_id(), false);
>>> +	if (!(oops_in_progress && should_fadump_crash()))
>>> +		set_cpu_online(smp_processor_id(), false);
>>>   
>>>   	spin_begin();
>>>   	while (1)
>>> @@ -650,7 +652,8 @@ static void stop_this_cpu(void *dummy)
>>>   	 * to know other CPUs are offline before it breaks locks to flush
>>>   	 * printk buffers, in case we panic()ed while holding the lock.
>>>   	 */
>>> -	set_cpu_online(smp_processor_id(), false);
>>> +	if (!(oops_in_progress && should_fadump_crash()))
>>> +		set_cpu_online(smp_processor_id(), false);
>> 
>> The comment talks about printk_safe_flush_on_panic(), and this change
>> would presumably break that. Except that printk_safe_flush_on_panic() no
>> longer exists.
>> 
>> So do we need to set the CPU online here at all?
>> 
>> ie. could we revert bab26238bbd4 ("powerpc: Offline CPU in stop_this_cpu()")
>> now that printk_safe_flush_on_panic() no longer exists?
> 
> Yeah, sounds like the logical thing to do but I guess, Nick would be in
> a better position to answer this..

Maybe we could look at reverting it, it would be nice. But I think it 
would be good to consider moving crash_fadump as well. There is at
least one issue with removal of some of that flushing -

https://git.kernel.org/pub/scm/linux/kernel/git/printk/linux.git/commit/?h=rework/printk_safe-removal&id=5d5e4522a7f404d1a96fd6c703989d32a9c9568d

I haven't really looked at or tested panic path, it's notoriously
buggy in general so who knows if we have to add workarounds. So even
if we revert it now, retaining the option to change our minds again
and mark CPUs offline as x86 does would be a good thing.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] ppc64/fadump: fix inaccurate CPU state info in vmcore generated with panic
  2021-11-11 13:20     ` Nicholas Piggin
@ 2021-11-17  6:09       ` Hari Bathini
  0 siblings, 0 replies; 5+ messages in thread
From: Hari Bathini @ 2021-11-17  6:09 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev, Michael Ellerman; +Cc: mahesh, sourabhjain



On 11/11/21 6:50 pm, Nicholas Piggin wrote:
> Excerpts from Hari Bathini's message of November 11, 2021 10:06 pm:
>>
>>
>> On 11/11/21 11:44 am, Michael Ellerman wrote:
>>> Hari Bathini <hbathini@linux.ibm.com> writes:
>>>> In panic path, fadump is triggered via a panic notifier function.
>>>> Before calling panic notifier functions, smp_send_stop() gets called,
>>>> which stops all CPUs except the panic'ing CPU. Commit 8389b37dffdc
>>>> ("powerpc: stop_this_cpu: remove the cpu from the online map.") and
>>>> again commit bab26238bbd4 ("powerpc: Offline CPU in stop_this_cpu()")
>>>> started marking CPUs as offline while stopping them. So, if a kernel
>>>> has either of the above commits, vmcore captured with fadump via panic
>>>> path would show only the panic'ing CPU as online. Sample output of
>>>> crash-utility with such vmcore:
>>>>
>>>>     # crash vmlinux vmcore
>>>>     ...
>>>>           KERNEL: vmlinux
>>>>         DUMPFILE: vmcore  [PARTIAL DUMP]
>>>>             CPUS: 1
>>>>             DATE: Wed Nov 10 09:56:34 EST 2021
>>>>           UPTIME: 00:00:42
>>>>     LOAD AVERAGE: 2.27, 0.69, 0.24
>>>>            TASKS: 183
>>>>         NODENAME: XXXXXXXXX
>>>>          RELEASE: 5.15.0+
>>>>          VERSION: #974 SMP Wed Nov 10 04:18:19 CST 2021
>>>>          MACHINE: ppc64le  (2500 Mhz)
>>>>           MEMORY: 8 GB
>>>>            PANIC: "Kernel panic - not syncing: sysrq triggered crash"
>>>>              PID: 3394
>>>>          COMMAND: "bash"
>>>>             TASK: c0000000150a5f80  [THREAD_INFO: c0000000150a5f80]
>>>>              CPU: 1
>>>>            STATE: TASK_RUNNING (PANIC)
>>>>
>>>>     crash> p -x __cpu_online_mask
>>>>     __cpu_online_mask = $1 = {
>>>>       bits = {0x2, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
>>>>     }
>>>>     crash>
>>>>     crash>
>>>>     crash> p -x __cpu_active_mask
>>>>     __cpu_active_mask = $2 = {
>>>>       bits = {0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
>>>>     }
>>>>     crash>
>>>>
>>>> While this has been the case since fadump was introduced, the issue
>>>> was not identified for two probable reasons:
>>>>
>>>>     - In general, the bulk of the vmcores analyzed were from crash
>>>>       due to exception.
>>>>
>>>>     - The above did change since commit 8341f2f222d7 ("sysrq: Use
>>>>       panic() to force a crash") started using panic() instead of
>>>>       deferencing NULL pointer to force a kernel crash. But then
>>>>       commit de6e5d38417e ("powerpc: smp_send_stop do not offline
>>>>       stopped CPUs") stopped marking CPUs as offline till kernel
>>>>       commit bab26238bbd4 ("powerpc: Offline CPU in stop_this_cpu()")
>>>>       reverted that change.
>>>>
>>>> To avoid vmcore from showing only one CPU as online in panic path,
>>>> skip marking non panic'ing CPUs as offline while stopping them
>>>> during fadump crash.
>>>
>>> Is this really worth the added complexity/bug surface?
>>>
>>> Why does it matter if the vmcore shows only one CPU online?
>>
>> We lose out on backtrace/register data of non-crashing CPUs as they
>> are explicitly marked offline.
>>
>> Actually, the state of CPU resources is explicitly changed after the
>> panic though the aim of vmcore is to capture the system state at the
>> time of panic...
>>
>> Alternatively, how about moving crash_fadump() call from panic notifier
>> into panic() function explicitly, just like __crash_kexec() - before the
>> smp_send_stop() call, so as to remove dependency with smp_send_stop()
>> implementation altogether...
> 
> Does the crash dump code snapshot the CPUs with send_debuggr_break? I
> can't remember the exact flow. But if it sends NMI IPIs to all other
> CPUs then maybe it could be moved before smp_send_stop, good idea.

Except for crashing CPU, snapshot of register data for all other CPUs is
collected by f/w on ibm,os-term rtas call.


>>> The comment talks about printk_safe_flush_on_panic(), and this change
>>> would presumably break that. Except that printk_safe_flush_on_panic() no
>>> longer exists.
>>>
>>> So do we need to set the CPU online here at all?
>>>
>>> ie. could we revert bab26238bbd4 ("powerpc: Offline CPU in stop_this_cpu()")
>>> now that printk_safe_flush_on_panic() no longer exists?
>>
>> Yeah, sounds like the logical thing to do but I guess, Nick would be in
>> a better position to answer this..
> 
> Maybe we could look at reverting it, it would be nice. But I think it
> would be good to consider moving crash_fadump as well. There is at

OK, I will try moving crash_fadump() instead and post the patch..

Thanks
Hari

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-11-17  6:10 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-10 19:01 [PATCH] ppc64/fadump: fix inaccurate CPU state info in vmcore generated with panic Hari Bathini
2021-11-11  6:14 ` Michael Ellerman
2021-11-11 12:06   ` Hari Bathini
2021-11-11 13:20     ` Nicholas Piggin
2021-11-17  6:09       ` Hari Bathini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).