xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [Xen-devel] [stable-4.11] Heads-up: c719519 (x86/SMP: don't try to stop already stopped CPUs) causes 100% kexec/kdump failure
@ 2019-10-28 17:30 Stonehouse, Robert
  2019-10-29  9:56 ` Jan Beulich
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Stonehouse, Robert @ 2019-10-28 17:30 UTC (permalink / raw)
  To: xen-devel; +Cc: Durrant, Paul, Stonehouse, Robert, Elnikety, Eslam, Jan Beulich

This is a heads-up as I have observed that the following commit (backported onto an Amazon 4.11 tree) causes kexec (and hence kdump) to fail. 
========
commit c719519a4183d0630121f6abeba420f49dbc3229
Author: Jan Beulich <jbeulich@suse.com>
AuthorDate: Fri Jul 5 10:32:41 2019 +0200
Commit: Jan Beulich <jbeulich@suse.com>
CommitDate: Fri Jul 5 10:32:41 2019 +0200

x86/SMP: don't try to stop already stopped CPUs
    
    In particular with an enabled IOMMU (but not really limited to this
    case), trying to invoke fixup_irqs() after having already done
    disable_IO_APIC() -> clear_IO_APIC() is a rather bad idea:
========

The test was performing "echo c > /proc/sysrq-trigger" in dom0 and the loaded crash kernel fails to show any signs of starting. This is the end of the Xen console ...
========
(XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds.
(XEN) Resetting with ACPI MEMORY or I/O RESET_REG.
<machine hangs here then reboots via the BIOS after 5 seconds>
========
Expected behaviour is that the kdump kernel immediately loads and then performs the crash dump

I'm sorry that I have not yet had time to check if this affects vanilla stable-4.11 or master. I just wanted to be certain that you don't have the same issue.


Reverting one hunk via the following commit fixes things for me (this is an experiment and not at all a proposed fix)
========
--- a/xen/arch/x86/smp.c
+++ b/xen/arch/x86/smp.c
@@ -303,15 +303,15 @@ static void stop_this_cpu(void *dummy)
 void smp_send_stop(void)
 {
     unsigned int cpu = smp_processor_id();
+    
+    local_irq_disable();
+    fixup_irqs(cpumask_of(cpu), 0);
+    local_irq_enable();
 
     if ( num_online_cpus() > 1 )
     {
         int timeout = 10;
 
-        local_irq_disable();
-        fixup_irqs(cpumask_of(cpu), 0);
-        local_irq_enable();
-
         smp_call_function(stop_this_cpu, NULL, 0);
 
         /* Wait 10ms for all other CPUs to go offline. */
========

Regards
Rob

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Xen-devel] [stable-4.11] Heads-up: c719519 (x86/SMP: don't try to stop already stopped CPUs) causes 100% kexec/kdump failure
  2019-10-28 17:30 [Xen-devel] [stable-4.11] Heads-up: c719519 (x86/SMP: don't try to stop already stopped CPUs) causes 100% kexec/kdump failure Stonehouse, Robert
@ 2019-10-29  9:56 ` Jan Beulich
  2019-10-29 10:46 ` Dietmar Hahn
  2019-10-29 11:29 ` Sergey Dyasli
  2 siblings, 0 replies; 5+ messages in thread
From: Jan Beulich @ 2019-10-29  9:56 UTC (permalink / raw)
  To: Stonehouse, Robert; +Cc: xen-devel, Durrant, Paul, Elnikety, Eslam

On 28.10.2019 18:30, Stonehouse, Robert wrote:
> Reverting one hunk via the following commit fixes things for me (this is an experiment and not at all a proposed fix)
> ========
> --- a/xen/arch/x86/smp.c
> +++ b/xen/arch/x86/smp.c
> @@ -303,15 +303,15 @@ static void stop_this_cpu(void *dummy)
>  void smp_send_stop(void)
>  {
>      unsigned int cpu = smp_processor_id();
> +    
> +    local_irq_disable();
> +    fixup_irqs(cpumask_of(cpu), 0);
> +    local_irq_enable();
> 
>      if ( num_online_cpus() > 1 )
>      {
>          int timeout = 10;
>  
> -        local_irq_disable();
> -        fixup_irqs(cpumask_of(cpu), 0);
> -        local_irq_enable();

Are you saying we get here the first time only when num_online_cpus()
already returns 1 (but there are actually multiple CPUs, i.e. affinity
changes are actually needed)? If so - why?

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Xen-devel] [stable-4.11] Heads-up: c719519 (x86/SMP: don't try to stop already stopped CPUs) causes 100% kexec/kdump failure
  2019-10-28 17:30 [Xen-devel] [stable-4.11] Heads-up: c719519 (x86/SMP: don't try to stop already stopped CPUs) causes 100% kexec/kdump failure Stonehouse, Robert
  2019-10-29  9:56 ` Jan Beulich
@ 2019-10-29 10:46 ` Dietmar Hahn
  2019-10-29 11:29 ` Sergey Dyasli
  2 siblings, 0 replies; 5+ messages in thread
From: Dietmar Hahn @ 2019-10-29 10:46 UTC (permalink / raw)
  To: xen-devel; +Cc: Durrant, Paul, Stonehouse, Robert, Elnikety, Eslam, Jan Beulich


[-- Attachment #1.1: Type: text/plain, Size: 2577 bytes --]

Hi,

Am Montag, 28. Oktober 2019, 18:30:12 CET schrieb Stonehouse, Robert:
> This is a heads-up as I have observed that the following commit (backported onto an Amazon 4.11 tree) causes kexec (and hence kdump) to fail. 
> ========
> commit c719519a4183d0630121f6abeba420f49dbc3229
> Author: Jan Beulich <jbeulich@suse.com>
> AuthorDate: Fri Jul 5 10:32:41 2019 +0200
> Commit: Jan Beulich <jbeulich@suse.com>
> CommitDate: Fri Jul 5 10:32:41 2019 +0200
> 
> x86/SMP: don't try to stop already stopped CPUs
>     
>     In particular with an enabled IOMMU (but not really limited to this
>     case), trying to invoke fixup_irqs() after having already done
>     disable_IO_APIC() -> clear_IO_APIC() is a rather bad idea:
> ========
> 
> The test was performing "echo c > /proc/sysrq-trigger" in dom0 and the loaded crash kernel fails to show any signs of starting. This is the end of the Xen console ...
> ========
> (XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds.
> (XEN) Resetting with ACPI MEMORY or I/O RESET_REG.
> <machine hangs here then reboots via the BIOS after 5 seconds>
> ========
> Expected behaviour is that the kdump kernel immediately loads and then performs the crash dump

I can confirm this behavior but with xen version (4.11.0_08-1) from
SuSE SLES12 SP4 which doesn't contain the said commit
c719519a4183d0630121f6abeba420f49dbc3229.But I can see this only on systems with newer Intel CPUS like
"Intel(R) Xeon(R) Gold 6242 CPU".



> 
> I'm sorry that I have not yet had time to check if this affects vanilla stable-4.11 or master. I just wanted to be certain that you don't have the same issue.
> 
> 
> Reverting one hunk via the following commit fixes things for me (this is an experiment and not at all a proposed fix)
> ========
> --- a/xen/arch/x86/smp.c
> +++ b/xen/arch/x86/smp.c
> @@ -303,15 +303,15 @@ static void stop_this_cpu(void *dummy)
>  void smp_send_stop(void)
>  {
>      unsigned int cpu = smp_processor_id();
> +    
> +    local_irq_disable();
> +    fixup_irqs(cpumask_of(cpu), 0);
> +    local_irq_enable();
>  
>      if ( num_online_cpus() > 1 )
>      {
>          int timeout = 10;
>  
> -        local_irq_disable();
> -        fixup_irqs(cpumask_of(cpu), 0);
> -        local_irq_enable();
> -
>          smp_call_function(stop_this_cpu, NULL, 0);
>  
>          /* Wait 10ms for all other CPUs to go offline. */
> ========
> 
> Regards
> Rob
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-devel


[-- Attachment #1.2: Type: text/html, Size: 13300 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Xen-devel] [stable-4.11] Heads-up: c719519 (x86/SMP: don't try to stop already stopped CPUs) causes 100% kexec/kdump failure
  2019-10-28 17:30 [Xen-devel] [stable-4.11] Heads-up: c719519 (x86/SMP: don't try to stop already stopped CPUs) causes 100% kexec/kdump failure Stonehouse, Robert
  2019-10-29  9:56 ` Jan Beulich
  2019-10-29 10:46 ` Dietmar Hahn
@ 2019-10-29 11:29 ` Sergey Dyasli
  2019-10-29 12:01   ` Jan Beulich
  2 siblings, 1 reply; 5+ messages in thread
From: Sergey Dyasli @ 2019-10-29 11:29 UTC (permalink / raw)
  To: Stonehouse, Robert, xen-devel; +Cc: Durrant, Paul, Elnikety, Eslam, Jan Beulich

On 28/10/2019 17:30, Stonehouse, Robert wrote:
> This is a heads-up as I have observed that the following commit (backported onto an Amazon 4.11 tree) causes kexec (and hence kdump) to fail. 
> ========
> commit c719519a4183d0630121f6abeba420f49dbc3229
> Author: Jan Beulich <jbeulich@suse.com>
> AuthorDate: Fri Jul 5 10:32:41 2019 +0200
> Commit: Jan Beulich <jbeulich@suse.com>
> CommitDate: Fri Jul 5 10:32:41 2019 +0200
> 
> x86/SMP: don't try to stop already stopped CPUs
>     
>     In particular with an enabled IOMMU (but not really limited to this
>     case), trying to invoke fixup_irqs() after having already done
>     disable_IO_APIC() -> clear_IO_APIC() is a rather bad idea:
> ========

This was already fixed in staging by "x86/crash: fix kexec transition breakage":

	https://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=f56813f3470c5b4987963c3c41e4fe16b95c5a3f

Looks like it needs inclusion into 4.11 branch.

--
Thanks,
Sergey

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Xen-devel] [stable-4.11] Heads-up: c719519 (x86/SMP: don't try to stop already stopped CPUs) causes 100% kexec/kdump failure
  2019-10-29 11:29 ` Sergey Dyasli
@ 2019-10-29 12:01   ` Jan Beulich
  0 siblings, 0 replies; 5+ messages in thread
From: Jan Beulich @ 2019-10-29 12:01 UTC (permalink / raw)
  To: Sergey Dyasli, Stonehouse, Robert
  Cc: xen-devel, Durrant, Paul, Elnikety, Eslam

On 29.10.2019 12:29, Sergey Dyasli wrote:
> On 28/10/2019 17:30, Stonehouse, Robert wrote:
>> This is a heads-up as I have observed that the following commit (backported onto an Amazon 4.11 tree) causes kexec (and hence kdump) to fail. 
>> ========
>> commit c719519a4183d0630121f6abeba420f49dbc3229
>> Author: Jan Beulich <jbeulich@suse.com>
>> AuthorDate: Fri Jul 5 10:32:41 2019 +0200
>> Commit: Jan Beulich <jbeulich@suse.com>
>> CommitDate: Fri Jul 5 10:32:41 2019 +0200
>>
>> x86/SMP: don't try to stop already stopped CPUs
>>     
>>     In particular with an enabled IOMMU (but not really limited to this
>>     case), trying to invoke fixup_irqs() after having already done
>>     disable_IO_APIC() -> clear_IO_APIC() is a rather bad idea:
>> ========
> 
> This was already fixed in staging by "x86/crash: fix kexec transition breakage":
> 
> 	https://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=f56813f3470c5b4987963c3c41e4fe16b95c5a3f
> 
> Looks like it needs inclusion into 4.11 branch.

Hmm, in principle I did fish out this one and a few more for
backporting. But it looks like I've applied them to the 4.12
branch only. Thanks for noticing!

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-10-29 12:02 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-28 17:30 [Xen-devel] [stable-4.11] Heads-up: c719519 (x86/SMP: don't try to stop already stopped CPUs) causes 100% kexec/kdump failure Stonehouse, Robert
2019-10-29  9:56 ` Jan Beulich
2019-10-29 10:46 ` Dietmar Hahn
2019-10-29 11:29 ` Sergey Dyasli
2019-10-29 12:01   ` Jan Beulich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).