From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andrew Cooper Subject: Re: HVM domains crash after upgrade from XEN 4.5.1 to 4.5.2 Date: Fri, 13 Nov 2015 10:09:00 +0000 Message-ID: <5645B6BC.6030603@citrix.com> References: <5643E68C.8090406@web2web.at> <564499B002000078000B43EE@prv-mh.provo.novell.com> <56448D9B.4090007@citrix.com> <5644A248.1060505@web2web.at> <5644C1CD.3020202@citrix.com> <56451A2B.9090706@web2web.at> <56459E5F02000078000B4944@prv-mh.provo.novell.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="------------030205070703040602050506" Return-path: In-Reply-To: <56459E5F02000078000B4944@prv-mh.provo.novell.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Atom2 Cc: Jan Beulich , xen-devel@lists.xen.org List-Id: xen-devel@lists.xenproject.org --------------030205070703040602050506 Content-Type: text/plain; charset="windows-1252" Content-Transfer-Encoding: 7bit On 13/11/15 07:25, Jan Beulich wrote: >>>> On 13.11.15 at 00:00, wrote: >> Am 12.11.15 um 17:43 schrieb Andrew Cooper: >>> On 12/11/15 14:29, Atom2 wrote: >>>> Hi Andrew, >>>> thanks for your reply. Answers are inline further down. >>>> >>>> Am 12.11.15 um 14:01 schrieb Andrew Cooper: >>>>> On 12/11/15 12:52, Jan Beulich wrote: >>>>>>>>> On 12.11.15 at 02:08, wrote: >>>>>>> After the upgrade HVM domUs appear to no longer work - regardless >>>>>>> of the >>>>>>> dom0 kernel (tested with both 3.18.9 and 4.1.7 as the dom0 kernel); PV >>>>>>> domUs, however, work just fine as before on both dom0 kernels. >>>>>>> >>>>>>> xl dmesg shows the following information after the first crashed HVM >>>>>>> domU which is started as part of the machine booting up: >>>>>>> [...] >>>>>>> (XEN) Failed vm entry (exit reason 0x80000021) caused by invalid guest >>>>>>> state (0). >>>>>>> (XEN) ************* VMCS Area ************** >>>>>>> (XEN) *** Guest State *** >>>>>>> (XEN) CR0: actual=0x0000000000000039, shadow=0x0000000000000011, >>>>>>> gh_mask=ffffffffffffffff >>>>>>> (XEN) CR4: actual=0x0000000000002050, shadow=0x0000000000000000, >>>>>>> gh_mask=ffffffffffffffff >>>>>>> (XEN) CR3: actual=0x0000000000800000, target_count=0 >>>>>>> (XEN) target0=0000000000000000, target1=0000000000000000 >>>>>>> (XEN) target2=0000000000000000, target3=0000000000000000 >>>>>>> (XEN) RSP = 0x0000000000006fdc (0x0000000000006fdc) RIP = >>>>>>> 0x0000000100000000 (0x0000000100000000) >>>>>> Other than RIP looking odd for a guest still in non-paged protected >>>>>> mode I can't seem to spot anything wrong with guest state. >>>>> odd? That will be the source of the failure. >>>>> >>>>> Out of long mode, the upper 32bit of %rip should all be zero, and it >>>>> should not be possible to set any of them. >>>>> >>>>> I suspect that the guest has exited for emulation, and there has been a >>>>> bad update to %rip. The alternative (which I hope is not the case) is >>>>> that there is a hardware errata which allows the guest to accidentally >>>>> get it self into this condition. >>>>> >>>>> Are you able to rerun with a debug build of the hypervisor? >>>> [snip] >>>> Another question is whether prior to enabling the debug USE flag it >>>> might make sense to re-compile with gcc-4.8.5 (please see my previous >>>> list reply) to rule out any compiler related issues. Jan, Andrew - >>>> what are your thoughts? >>> First of all, check whether the compiler makes a difference on 4.5.2 >> Hi Andrew, >> I changed the compiler and there was no change to the better: >> Unfortunately the HVM domU is still crashing with a similar error >> message as soon as it is being started. >>> If both compiles result in a guest crashing in that manner, test a debug >>> Xen to see if any assertions/errors are encountered just before the >>> guest crashes. >>> >> As the compiler did not make any difference, I enabled the debug USE >> flag, re-compiled (using gcc-4.9.3), and rebooted using a serial console >> to capture output. Unfortunately I did not get very far and things >> become even stranger: This time the system did not even finnish the boot >> process, but rather hard-stopped pretty early with a message reading >> "Panic on CPU 3: DOUBLE FAULT -- system shutdown". The captured logfile >> is attached as "serial log.txt". >> >> As this happened immediately after the CPU microcode update, I thought >> there might be a connection and disabled the microcode update. After the >> next reboot it seemed as if the boot process got a bit further as >> evidenced by a few more lines in the log file (those between lines 136 >> and 197 in the second log file named "serial log no ucode.txt"), but in >> the end it finnished off with an identical error message (only the CPU # >> was different this time, but that number seems to change between boots >> anyways). >> >> I hope that makes some sense to you. > Not really, other than now even more suspecting bad hardware or > something fundamentally wrong with your build. Did you retry with > a freshly built 4.5.1? Could you alternatively try with a known good > build of 4.5.2 (e.g. from osstest)? Agreed. Double faults indicate that the exception handing entry points are not set up in an appropriate state. Something is definitely wrong with either the compiled binary or the hardware. Several questions and lines of investigation: Is this straight Xen 4.5.1 and 2, or do Gentoo have their own patches on top? On repeated attempts, are the details of the double fault identical (other than the cpu), or does it move around (i.e. always do_IRQ+0x15) Can you boot with console_timestamps=boot on the command line in the future. This will put Linux-sytle timestamps on log messages. Can you also compile in the attached patch? I haven't quite got it suitable for inclusion upstream yet, but it will also dump the instruction stream under the fault. Finally, can you disassemble the xen-syms which results from the debug build and paste the start of do_IRQ. (i.e. `gdb xen-syms` and "disass do_IRQ") ~Andrew --------------030205070703040602050506 Content-Type: text/x-diff; name="0001-x86-traps-Dump-instruction-stream-in-show_execution_.patch" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename*0="0001-x86-traps-Dump-instruction-stream-in-show_execution_.pa"; filename*1="tch" >>From a0afc573ca47abe7f900a404d366599e5da93391 Mon Sep 17 00:00:00 2001 From: Andrew Cooper Date: Tue, 14 Jul 2015 16:58:49 +0100 Subject: [PATCH] x86/traps: Dump instruction stream in show_execution_state() For first pass triage of crashes, it is useful to have the instruction stream present, especially now that Xen binary patches itself. A sample output now looks like: (XEN) ----[ Xen-4.6-unstable x86_64 debug=y Not tainted ]---- (XEN) CPU: 0 (XEN) RIP: e008:[] default_idle+0x76/0x7b (XEN) RFLAGS: 0000000000000246 CONTEXT: hypervisor (XEN) rax: ffff82d080331030 rbx: ffff83007fce8000 rcx: 0000000000000000 (XEN) rdx: 0000000000000000 rsi: ffff82d080331b98 rdi: 0000000000000000 (XEN) rbp: ffff83007fcefef0 rsp: ffff83007fcefef0 r8: ffff83007faf8118 (XEN) r9: 00000009983e89fd r10: 00000009983e89fd r11: 0000000000000246 (XEN) r12: ffff83007fd61000 r13: 00000000ffffffff r14: ffff83007fad9000 (XEN) r15: ffff83007fae3000 cr0: 000000008005003b cr4: 00000000000026e0 (XEN) cr3: 000000007fc9b000 cr2: 00007f70976b3fed (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008 (XEN) Xen code around e008:ffff82d0801607e4 (default_idle+0x76/0x7b): (XEN) 83 3c 10 00 75 04 fb f4 01 fb 5d c3 55 48 89 e5 3b 3d 0d 50 12 00 72 (XEN) Xen stack trace from rsp=ffff83007fcefef0: (XEN) ffff83007fceff10 ffff82d080160e08 ffff82d08012c40a ffff83007faf9000 (XEN) ffff83007fcefdd8 ffffffff81a01fd8 ffff88002f07d4c0 ffffffff81a01fd8 (XEN) 0000000000000000 ffffffff81a01e58 ffffffff81a01fd8 0000000000000246 (XEN) 00000000ffff0052 0000000000000000 0000000000000000 0000000000000000 (XEN) ffffffff810013aa 0000000000000001 00000000deadbeef 00000000deadbeef (XEN) 0000010000000000 ffffffff810013aa 000000000000e033 0000000000000246 (XEN) ffffffff81a01e40 000000000000e02b 0000000000000000 0000000000000000 (XEN) 0000000000000000 0000000000000000 0000000000000000 ffff83007faf9000 (XEN) 0000000000000000 0000000000000000 (XEN) Xen call trace: (XEN) [] default_idle+0x76/0x7b (XEN) [] idle_loop+0x51/0x6e (XEN) Signed-off-by: Andrew Cooper CC: Jan Beulich --- Currently limited to just hypervisor context, but it could be extended to vcpus as well. --- xen/arch/x86/traps.c | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c index e21fb78..658e0d7 100644 --- a/xen/arch/x86/traps.c +++ b/xen/arch/x86/traps.c @@ -114,6 +114,31 @@ boolean_param("ler", opt_ler); #define stack_words_per_line 4 #define ESP_BEFORE_EXCEPTION(regs) ((unsigned long *)regs->rsp) +static void show_code(const struct cpu_user_regs *regs) +{ + char insns[24]; + unsigned int i, not_copied; + void *__user start_ip = (void *)regs->rip - 8; + + if ( guest_mode(regs) ) + return; + + not_copied = __copy_from_user(insns, start_ip, ARRAY_SIZE(insns)); + + printk("Xen code around %04x:%p (%ps)%s:\n", + regs->cs, _p(regs->rip), _p(regs->rip), + !!not_copied ? " [fault on access]" : ""); + + for ( i = 0; i < ARRAY_SIZE(insns) - not_copied; ++i ) + { + if ( (unsigned long)(start_ip + i) == regs->rip ) + printk(" <%02x>", (unsigned char)insns[i]); + else + printk(" %02x", (unsigned char)insns[i]); + } + printk("\n"); +} + static void show_guest_stack(struct vcpu *v, const struct cpu_user_regs *regs) { int i; @@ -417,6 +442,7 @@ void show_stack_overflow(unsigned int cpu, const struct cpu_user_regs *regs) void show_execution_state(const struct cpu_user_regs *regs) { show_registers(regs); + show_code(regs); show_stack(regs); } -- 2.1.4 --------------030205070703040602050506 Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --------------030205070703040602050506--