* [Xen-devel] Host freezing after "fixing" recursive fault starting in multicalls.c @ 2020-01-18 18:59 Peter.Kurfer 2020-01-20 9:46 ` Jan Beulich 0 siblings, 1 reply; 7+ messages in thread From: Peter.Kurfer @ 2020-01-18 18:59 UTC (permalink / raw) To: xen-devel Hi, I was advised to bump this also to the devel mailing list, because the mentioned error message was apparently added in Kernel 4.20 (and upwards) and this kernel version is not broadly adopted already and therefore it is unlikely that another user encountered a smiliar problem alrleady. Original message (see also here: https://lists.xenproject.org/archives/html/xen-users/2020-01/msg00013.html ) I'm running Xen 4.11.2 on Fedora 30 with Kernel versions 5.4.7 and 5.4.10 on multiple HP servers. The workflow I'm trying to achieve looks like the following: - a VM is resumed from a snapshot with a Python script using the libvirt API - it is running for a few minutes, - it gets paused and finally destroyed for testing purposes At some point - it doesn't seem to be deterministic because sometimes it happens directly after the boot and sometimes after multiple hours - a huge stacktrace starting with an error in `arch/x86/xen/multicalls.c` can be found in the kernel logs which ends with the message 'Fixing recursive fault but reboot is needed!'. After some time the system completely freezes and needs to be hard resetted because it is not possible any more to login via SSH. The freeze is also not deterministic but there are no other critical errors in the logs, so it seems somehow to be related. Because the full stacktrace has round about 370 lines I attached it as a GitHub Gist: https://gist.github.com/baez90/135c3985cbb6fd4b4204269fb384221a I'm a little confused as to what else to try and I have no idea what the problem might be. Any hints/ideas/proposals? Kind regards and thanks in advance _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Xen-devel] Host freezing after "fixing" recursive fault starting in multicalls.c 2020-01-18 18:59 [Xen-devel] Host freezing after "fixing" recursive fault starting in multicalls.c Peter.Kurfer @ 2020-01-20 9:46 ` Jan Beulich 2020-01-20 12:09 ` Peter.Kurfer 0 siblings, 1 reply; 7+ messages in thread From: Jan Beulich @ 2020-01-20 9:46 UTC (permalink / raw) To: Peter.Kurfer; +Cc: xen-devel On 18.01.2020 19:59, Peter.Kurfer@gdata.de wrote: > Hi, > > I was advised to bump this also to the devel mailing list, because the mentioned error message was apparently added in Kernel 4.20 (and upwards) and this kernel version is not broadly adopted already and therefore it is unlikely that another user encountered a smiliar problem alrleady. > > Original message (see also here: https://lists.xenproject.org/archives/html/xen-users/2020-01/msg00013.html ) > > I'm running Xen 4.11.2 on Fedora 30 with Kernel versions 5.4.7 and 5.4.10 on multiple HP servers. > > The workflow I'm trying to achieve looks like the following: > > - a VM is resumed from a snapshot with a Python script using the libvirt API > - it is running for a few minutes, > - it gets paused and finally destroyed for testing purposes > > At some point - it doesn't seem to be deterministic because sometimes it happens directly after the boot and sometimes after multiple hours - a huge stacktrace starting with an error in `arch/x86/xen/multicalls.c` can be found in the kernel logs which ends with the message 'Fixing recursive fault but reboot is needed!'. > > After some time the system completely freezes and needs to be hard resetted because it is not possible any more to login via SSH. > The freeze is also not deterministic but there are no other critical errors in the logs, so it seems somehow to be related. > > Because the full stacktrace has round about 370 lines I attached it as a GitHub Gist: > > https://gist.github.com/baez90/135c3985cbb6fd4b4204269fb384221a > > I'm a little confused as to what else to try and I have no idea what the problem might be. > > Any hints/ideas/proposals? A debug hypervisor would most likely spit out a log message for every individual failure. Seeing these messages may help diagnosing what's wrong. Knowing more of what exactly triggers this may also help, but judging from your report may be difficult to isolate. Of course all of this is applicable only if no-one has already found an explanation (and then perhaps also a fix) for this. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Xen-devel] Host freezing after "fixing" recursive fault starting in multicalls.c 2020-01-20 9:46 ` Jan Beulich @ 2020-01-20 12:09 ` Peter.Kurfer 2020-01-20 12:13 ` Jan Beulich 0 siblings, 1 reply; 7+ messages in thread From: Peter.Kurfer @ 2020-01-20 12:09 UTC (permalink / raw) To: jbeulich; +Cc: xen-devel I will enable debug logs on two hosts today to see if I can correlate the aforementioned error message with some debug logs. Anything I should consider to ensure that everything required is included? _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Xen-devel] Host freezing after "fixing" recursive fault starting in multicalls.c 2020-01-20 12:09 ` Peter.Kurfer @ 2020-01-20 12:13 ` Jan Beulich 2020-01-29 8:29 ` Peter.Kurfer 0 siblings, 1 reply; 7+ messages in thread From: Jan Beulich @ 2020-01-20 12:13 UTC (permalink / raw) To: Peter.Kurfer; +Cc: xen-devel On 20.01.2020 13:09, Peter.Kurfer@gdata.de wrote: > I will enable debug logs on two hosts today to see if I can correlate the aforementioned error message with some debug logs. > Anything I should consider to ensure that everything required is included? "loglvl=all guest_loglvl=all" should be part of your Xen command line. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Xen-devel] Host freezing after "fixing" recursive fault starting in multicalls.c 2020-01-20 12:13 ` Jan Beulich @ 2020-01-29 8:29 ` Peter.Kurfer 2020-01-29 8:59 ` Jan Beulich 0 siblings, 1 reply; 7+ messages in thread From: Peter.Kurfer @ 2020-01-29 8:29 UTC (permalink / raw) To: jbeulich; +Cc: xen-devel As requested I configured one host with: > loglvl=all guest_loglvl=all and collected one day of logs via serial interface: https://drive.google.com/drive/folders/1sQvyNH0Sz28tUeVRZl9mowhB0Htd8ZpO?usp=sharing searching for "error" or "multicalls.c" leads to some stacktraces that might be interesting. As far as I know the ACPI errors in the context of IPMI can be ignored. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Xen-devel] Host freezing after "fixing" recursive fault starting in multicalls.c 2020-01-29 8:29 ` Peter.Kurfer @ 2020-01-29 8:59 ` Jan Beulich 2020-01-29 13:52 ` Peter.Kurfer 0 siblings, 1 reply; 7+ messages in thread From: Jan Beulich @ 2020-01-29 8:59 UTC (permalink / raw) To: Peter.Kurfer; +Cc: xen-devel On 29.01.2020 09:29, Peter.Kurfer@gdata.de wrote: > As requested I configured one host with: > >> loglvl=all guest_loglvl=all > > and collected one day of logs via serial interface: > > https://drive.google.com/drive/folders/1sQvyNH0Sz28tUeVRZl9mowhB0Htd8ZpO?usp=sharing > > searching for "error" or "multicalls.c" leads to some stacktraces that might be interesting. Right, but the bad news is that there are no helpful hypervisor messages at all. Sadly this is partly my fault, because I should have asked you to do this log collection with a debug hypervisor. Most of the possibly interesting messages would appear only there. In any event, problems start quite a bit earlier, and typically it's the first instance of a problem that is the most helpful to analyze, as later ones may be cascade issues. The first sign of problems is an overlapping [14991.827762] BUG: unable to handle page fault for address: ffff888ae2eb6bd8 and [14991.828172] WARNING: CPU: 5 PID: 2585 at arch/x86/xen/multicalls.c:102 xen_mc_flush+0x194/0x1c0 on CPUs 8 and 5. > As far as I know the ACPI errors in the context of IPMI can be ignored. Looks like so, yes, at least for the purposes here. What I wouldn't put off as a possible reason for problems is the significant amount of temperature related messages. What I also find at least curious (but possibly just because I know too little of the respective aspects of modern kernels) are the recurring __text_poke() instances on the stack traces. Assuming these are to be expected in the first place, there might be a race here which is either Xen-specific or simply has a much better chance of hitting (larger window?) when running on Xen. But I'm afraid this will need looking into (or at least commenting on) by a kernel person. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Xen-devel] Host freezing after "fixing" recursive fault starting in multicalls.c 2020-01-29 8:59 ` Jan Beulich @ 2020-01-29 13:52 ` Peter.Kurfer 0 siblings, 0 replies; 7+ messages in thread From: Peter.Kurfer @ 2020-01-29 13:52 UTC (permalink / raw) To: jbeulich; +Cc: xen-devel > Right, but the bad news is that there are no helpful hypervisor > messages at all. Sadly this is partly my fault, because I should > have asked you to do this log collection with a debug hypervisor. > Most of the possibly interesting messages would appear only there. > In any event, problems start quite a bit earlier, and typically > it's the first instance of a problem that is the most helpful to > analyze, as later ones may be cascade issues. The first sign of > problems is an overlapping To be honest, I was already wondering why there were only so few logs but while I already found the CMDLINE_XEN options for debug logs I didn't find any documentation how to build a debug hypervisor so far and it took me some time to work around the fact that I don't have physical access to the server to attach an actual serial cable and so on. I will try to compile Xen with debug enabled and collect more logs afterwards. Anything to be aware of? Von: Jan Beulich <jbeulich@suse.com> Gesendet: Mittwoch, 29. Januar 2020 09:59 An: Kurfer, Peter Cc: xen-devel@lists.xenproject.org Betreff: Re: Host freezing after "fixing" recursive fault starting in multicalls.c On 29.01.2020 09:29, Peter.Kurfer@gdata.de wrote: > As requested I configured one host with: > >> loglvl=all guest_loglvl=all > > and collected one day of logs via serial interface: > > https://drive.google.com/drive/folders/1sQvyNH0Sz28tUeVRZl9mowhB0Htd8ZpO?usp=sharing > > searching for "error" or "multicalls.c" leads to some stacktraces that might be interesting. Right, but the bad news is that there are no helpful hypervisor messages at all. Sadly this is partly my fault, because I should have asked you to do this log collection with a debug hypervisor. Most of the possibly interesting messages would appear only there. In any event, problems start quite a bit earlier, and typically it's the first instance of a problem that is the most helpful to analyze, as later ones may be cascade issues. The first sign of problems is an overlapping [14991.827762] BUG: unable to handle page fault for address: ffff888ae2eb6bd8 and [14991.828172] WARNING: CPU: 5 PID: 2585 at arch/x86/xen/multicalls.c:102 xen_mc_flush+0x194/0x1c0 on CPUs 8 and 5. > As far as I know the ACPI errors in the context of IPMI can be ignored. Looks like so, yes, at least for the purposes here. What I wouldn't put off as a possible reason for problems is the significant amount of temperature related messages. What I also find at least curious (but possibly just because I know too little of the respective aspects of modern kernels) are the recurring __text_poke() instances on the stack traces. Assuming these are to be expected in the first place, there might be a race here which is either Xen-specific or simply has a much better chance of hitting (larger window?) when running on Xen. But I'm afraid this will need looking into (or at least commenting on) by a kernel person. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2020-01-29 13:53 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-01-18 18:59 [Xen-devel] Host freezing after "fixing" recursive fault starting in multicalls.c Peter.Kurfer 2020-01-20 9:46 ` Jan Beulich 2020-01-20 12:09 ` Peter.Kurfer 2020-01-20 12:13 ` Jan Beulich 2020-01-29 8:29 ` Peter.Kurfer 2020-01-29 8:59 ` Jan Beulich 2020-01-29 13:52 ` Peter.Kurfer
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.