All of lore.kernel.org
 help / color / mirror / Atom feed
* [Xen-devel] Host freezing after "fixing" recursive fault starting in multicalls.c
@ 2020-01-18 18:59 Peter.Kurfer
  2020-01-20  9:46 ` Jan Beulich
  0 siblings, 1 reply; 7+ messages in thread
From: Peter.Kurfer @ 2020-01-18 18:59 UTC (permalink / raw)
  To: xen-devel

Hi,

I was advised to bump this also to the devel mailing list, because the mentioned error message was apparently added in Kernel 4.20 (and upwards) and this kernel version  is not broadly adopted already and therefore it is unlikely that another user encountered a smiliar problem alrleady. 

Original message (see also here: https://lists.xenproject.org/archives/html/xen-users/2020-01/msg00013.html )

I'm running Xen 4.11.2 on Fedora 30 with Kernel versions 5.4.7 and 5.4.10 on multiple HP servers.

The workflow I'm trying to achieve looks like the following:

- a VM is resumed from a snapshot with a Python script using the libvirt API
- it is running for a few minutes,
- it gets paused and finally destroyed for testing purposes

At some point - it doesn't seem to be deterministic because sometimes it  happens directly after the boot and sometimes after multiple hours - a  huge stacktrace starting with an error in `arch/x86/xen/multicalls.c`  can be found in the kernel logs which ends with the message 'Fixing recursive fault but reboot is needed!'.

After some time the system completely freezes and needs to be hard  resetted because it is not possible any more to login via SSH.
The freeze is also not deterministic but there are no other critical errors in the logs, so it seems somehow to be related.

Because the full stacktrace has round about 370 lines I attached it as a GitHub Gist:

https://gist.github.com/baez90/135c3985cbb6fd4b4204269fb384221a

I'm a little confused as to what else to try and I have no idea what the problem might be.

Any hints/ideas/proposals?

Kind regards and thanks in advance 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Xen-devel] Host freezing after "fixing" recursive fault starting in multicalls.c
  2020-01-18 18:59 [Xen-devel] Host freezing after "fixing" recursive fault starting in multicalls.c Peter.Kurfer
@ 2020-01-20  9:46 ` Jan Beulich
  2020-01-20 12:09   ` Peter.Kurfer
  0 siblings, 1 reply; 7+ messages in thread
From: Jan Beulich @ 2020-01-20  9:46 UTC (permalink / raw)
  To: Peter.Kurfer; +Cc: xen-devel

On 18.01.2020 19:59, Peter.Kurfer@gdata.de wrote:
> Hi,
> 
> I was advised to bump this also to the devel mailing list, because the mentioned error message was apparently added in Kernel 4.20 (and upwards) and this kernel version  is not broadly adopted already and therefore it is unlikely that another user encountered a smiliar problem alrleady. 
> 
> Original message (see also here: https://lists.xenproject.org/archives/html/xen-users/2020-01/msg00013.html )
> 
> I'm running Xen 4.11.2 on Fedora 30 with Kernel versions 5.4.7 and 5.4.10 on multiple HP servers.
> 
> The workflow I'm trying to achieve looks like the following:
> 
> - a VM is resumed from a snapshot with a Python script using the libvirt API
> - it is running for a few minutes,
> - it gets paused and finally destroyed for testing purposes
> 
> At some point - it doesn't seem to be deterministic because sometimes it  happens directly after the boot and sometimes after multiple hours - a  huge stacktrace starting with an error in `arch/x86/xen/multicalls.c`  can be found in the kernel logs which ends with the message 'Fixing recursive fault but reboot is needed!'.
> 
> After some time the system completely freezes and needs to be hard  resetted because it is not possible any more to login via SSH.
> The freeze is also not deterministic but there are no other critical errors in the logs, so it seems somehow to be related.
> 
> Because the full stacktrace has round about 370 lines I attached it as a GitHub Gist:
> 
> https://gist.github.com/baez90/135c3985cbb6fd4b4204269fb384221a
> 
> I'm a little confused as to what else to try and I have no idea what the problem might be.
> 
> Any hints/ideas/proposals?

A debug hypervisor would most likely spit out a log message for every
individual failure. Seeing these messages may help diagnosing what's
wrong. Knowing more of what exactly triggers this may also help, but
judging from your report may be difficult to isolate. Of course all
of this is applicable only if no-one has already found an explanation
(and then perhaps also a fix) for this.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Xen-devel] Host freezing after "fixing" recursive fault starting in multicalls.c
  2020-01-20  9:46 ` Jan Beulich
@ 2020-01-20 12:09   ` Peter.Kurfer
  2020-01-20 12:13     ` Jan Beulich
  0 siblings, 1 reply; 7+ messages in thread
From: Peter.Kurfer @ 2020-01-20 12:09 UTC (permalink / raw)
  To: jbeulich; +Cc: xen-devel

I will enable debug logs on two hosts today to see if I can correlate the aforementioned error message with some debug logs.
Anything I should consider to ensure that everything required is included?
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Xen-devel] Host freezing after "fixing" recursive fault starting in multicalls.c
  2020-01-20 12:09   ` Peter.Kurfer
@ 2020-01-20 12:13     ` Jan Beulich
  2020-01-29  8:29       ` Peter.Kurfer
  0 siblings, 1 reply; 7+ messages in thread
From: Jan Beulich @ 2020-01-20 12:13 UTC (permalink / raw)
  To: Peter.Kurfer; +Cc: xen-devel

On 20.01.2020 13:09, Peter.Kurfer@gdata.de wrote:
> I will enable debug logs on two hosts today to see if I can correlate the aforementioned error message with some debug logs.
> Anything I should consider to ensure that everything required is included?

"loglvl=all guest_loglvl=all" should be part of your Xen command line.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Xen-devel] Host freezing after "fixing" recursive fault starting in multicalls.c
  2020-01-20 12:13     ` Jan Beulich
@ 2020-01-29  8:29       ` Peter.Kurfer
  2020-01-29  8:59         ` Jan Beulich
  0 siblings, 1 reply; 7+ messages in thread
From: Peter.Kurfer @ 2020-01-29  8:29 UTC (permalink / raw)
  To: jbeulich; +Cc: xen-devel

As requested I configured one host with:

> loglvl=all guest_loglvl=all

and collected one day of logs via serial interface:

https://drive.google.com/drive/folders/1sQvyNH0Sz28tUeVRZl9mowhB0Htd8ZpO?usp=sharing

searching for "error" or "multicalls.c" leads to some stacktraces that might be interesting.

As far as I know the ACPI errors in the context of IPMI can be ignored.
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Xen-devel] Host freezing after "fixing" recursive fault starting in multicalls.c
  2020-01-29  8:29       ` Peter.Kurfer
@ 2020-01-29  8:59         ` Jan Beulich
  2020-01-29 13:52           ` Peter.Kurfer
  0 siblings, 1 reply; 7+ messages in thread
From: Jan Beulich @ 2020-01-29  8:59 UTC (permalink / raw)
  To: Peter.Kurfer; +Cc: xen-devel

On 29.01.2020 09:29, Peter.Kurfer@gdata.de wrote:
> As requested I configured one host with:
> 
>> loglvl=all guest_loglvl=all
> 
> and collected one day of logs via serial interface:
> 
> https://drive.google.com/drive/folders/1sQvyNH0Sz28tUeVRZl9mowhB0Htd8ZpO?usp=sharing
> 
> searching for "error" or "multicalls.c" leads to some stacktraces that might be interesting.

Right, but the bad news is that there are no helpful hypervisor
messages at all. Sadly this is partly my fault, because I should
have asked you to do this log collection with a debug hypervisor.
Most of the possibly interesting messages would appear only there.

In any event, problems start quite a bit earlier, and typically
it's the first instance of a problem that is the most helpful to
analyze, as later ones may be cascade issues. The first sign of
problems is an overlapping

[14991.827762] BUG: unable to handle page fault for address: ffff888ae2eb6bd8

and

[14991.828172] WARNING: CPU: 5 PID: 2585 at arch/x86/xen/multicalls.c:102 xen_mc_flush+0x194/0x1c0

on CPUs 8 and 5.

> As far as I know the ACPI errors in the context of IPMI can be ignored.

Looks like so, yes, at least for the purposes here. What I wouldn't
put off as a possible reason for problems is the significant amount
of temperature related messages. What I also find at least curious
(but possibly just because I know too little of the respective
aspects of modern kernels) are the recurring __text_poke() instances
on the stack traces. Assuming these are to be expected in the first
place, there might be a race here which is either Xen-specific or
simply has a much better chance of hitting (larger window?) when
running on Xen. But I'm afraid this will need looking into (or at
least commenting on) by a kernel person.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Xen-devel] Host freezing after "fixing" recursive fault starting in multicalls.c
  2020-01-29  8:59         ` Jan Beulich
@ 2020-01-29 13:52           ` Peter.Kurfer
  0 siblings, 0 replies; 7+ messages in thread
From: Peter.Kurfer @ 2020-01-29 13:52 UTC (permalink / raw)
  To: jbeulich; +Cc: xen-devel

> Right, but the bad news is that there are no helpful hypervisor
> messages at all. Sadly this is partly my fault, because I should
> have asked you to do this log collection with a debug hypervisor.
> Most of the possibly interesting messages would appear only there.

> In any event, problems start quite a bit earlier, and typically
> it's the first instance of a problem that is the most helpful to
> analyze, as later ones may be cascade issues. The first sign of
> problems is an overlapping

To be honest, I was already wondering why there were only so few logs but while I already found the CMDLINE_XEN options for debug logs I didn't find any documentation how to build a debug hypervisor so far and it took me some time to work around the fact that I don't have physical access to the server to attach an actual serial cable and so on.

I will try to compile Xen with debug enabled and collect more logs afterwards.
Anything to be aware of?


Von: Jan Beulich <jbeulich@suse.com>
Gesendet: Mittwoch, 29. Januar 2020 09:59
An: Kurfer, Peter
Cc: xen-devel@lists.xenproject.org
Betreff: Re: Host freezing after "fixing" recursive fault starting in multicalls.c
    
On 29.01.2020 09:29, Peter.Kurfer@gdata.de wrote:
> As requested I configured one host with:
> 
>> loglvl=all guest_loglvl=all
> 
> and collected one day of logs via serial interface:
> 
>  https://drive.google.com/drive/folders/1sQvyNH0Sz28tUeVRZl9mowhB0Htd8ZpO?usp=sharing
> 
> searching for "error" or "multicalls.c" leads to some stacktraces that might be interesting.

Right, but the bad news is that there are no helpful hypervisor
messages at all. Sadly this is partly my fault, because I should
have asked you to do this log collection with a debug hypervisor.
Most of the possibly interesting messages would appear only there.

In any event, problems start quite a bit earlier, and typically
it's the first instance of a problem that is the most helpful to
analyze, as later ones may be cascade issues. The first sign of
problems is an overlapping

[14991.827762] BUG: unable to handle page fault for address: ffff888ae2eb6bd8

and

[14991.828172] WARNING: CPU: 5 PID: 2585 at arch/x86/xen/multicalls.c:102 xen_mc_flush+0x194/0x1c0

on CPUs 8 and 5.

> As far as I know the ACPI errors in the context of IPMI can be ignored.

Looks like so, yes, at least for the purposes here. What I wouldn't
put off as a possible reason for problems is the significant amount
of temperature related messages. What I also find at least curious
(but possibly just because I know too little of the respective
aspects of modern kernels) are the recurring __text_poke() instances
on the stack traces. Assuming these are to be expected in the first
place, there might be a race here which is either Xen-specific or
simply has a much better chance of hitting (larger window?) when
running on Xen. But I'm afraid this will need looking into (or at
least commenting on) by a kernel person.

Jan
    
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-01-29 13:53 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-18 18:59 [Xen-devel] Host freezing after "fixing" recursive fault starting in multicalls.c Peter.Kurfer
2020-01-20  9:46 ` Jan Beulich
2020-01-20 12:09   ` Peter.Kurfer
2020-01-20 12:13     ` Jan Beulich
2020-01-29  8:29       ` Peter.Kurfer
2020-01-29  8:59         ` Jan Beulich
2020-01-29 13:52           ` Peter.Kurfer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.