* [xen-unstable test] 115471: regressions - FAIL @ 2017-11-02 8:19 osstest service owner 2017-10-31 10:49 ` Commit moratorium to staging Julien Grall 0 siblings, 1 reply; 27+ messages in thread From: osstest service owner @ 2017-11-02 8:19 UTC (permalink / raw) To: xen-devel, osstest-admin [-- Attachment #1: Type: text/plain, Size: 13395 bytes --] flight 115471 xen-unstable real [real] http://logs.test-lab.xenproject.org/osstest/logs/115471/ Regressions :-( Tests which did not succeed and are blocking, including tests which could not be run: test-amd64-i386-xl-qemuu-ws16-amd64 17 guest-stop fail REGR. vs. 114644 test-amd64-amd64-xl-qemuu-ws16-amd64 17 guest-stop fail REGR. vs. 114644 test-amd64-amd64-xl-qemut-ws16-amd64 17 guest-stop fail REGR. vs. 114644 Tests which are failing intermittently (not blocking): test-armhf-armhf-xl 6 xen-install fail in 115401 pass in 115471 test-amd64-amd64-xl-qemuu-ws16-amd64 15 guest-saverestore.2 fail in 115401 pass in 115471 test-armhf-armhf-xl-vhd 15 guest-start/debian.repeat fail in 115401 pass in 115471 test-amd64-amd64-xl-qcow2 19 guest-start/debian.repeat fail in 115401 pass in 115471 test-amd64-i386-libvirt-qcow2 17 guest-start/debian.repeat fail pass in 115401 test-amd64-amd64-libvirt-vhd 17 guest-start/debian.repeat fail pass in 115450 Tests which did not succeed, but are not blocking: test-armhf-armhf-libvirt-xsm 14 saverestore-support-check fail like 114644 test-amd64-amd64-xl-qemuu-win7-amd64 17 guest-stop fail like 114644 test-amd64-i386-xl-qemuu-win7-amd64 17 guest-stop fail like 114644 test-amd64-amd64-xl-qemut-win7-amd64 17 guest-stop fail like 114644 test-amd64-i386-xl-qemut-win7-amd64 17 guest-stop fail like 114644 test-armhf-armhf-libvirt-raw 13 saverestore-support-check fail like 114644 test-armhf-armhf-libvirt 14 saverestore-support-check fail like 114644 test-amd64-amd64-xl-pvhv2-intel 12 guest-start fail never pass test-amd64-amd64-xl-pvhv2-amd 12 guest-start fail never pass test-amd64-amd64-libvirt-xsm 13 migrate-support-check fail never pass test-amd64-amd64-libvirt 13 migrate-support-check fail never pass test-amd64-i386-libvirt-xsm 13 migrate-support-check fail never pass test-amd64-i386-libvirt 13 migrate-support-check fail never pass test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 11 migrate-support-check fail never pass test-amd64-i386-libvirt-qcow2 12 migrate-support-check fail never pass test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 11 migrate-support-check fail never pass test-amd64-amd64-libvirt-vhd 12 migrate-support-check fail never pass test-amd64-amd64-qemuu-nested-amd 17 debian-hvm-install/l1/l2 fail never pass test-armhf-armhf-xl-rtds 13 migrate-support-check fail never pass test-armhf-armhf-xl-rtds 14 saverestore-support-check fail never pass test-armhf-armhf-xl-cubietruck 13 migrate-support-check fail never pass test-armhf-armhf-xl-cubietruck 14 saverestore-support-check fail never pass test-armhf-armhf-xl 13 migrate-support-check fail never pass test-armhf-armhf-xl 14 saverestore-support-check fail never pass test-armhf-armhf-libvirt-xsm 13 migrate-support-check fail never pass test-armhf-armhf-libvirt-raw 12 migrate-support-check fail never pass test-armhf-armhf-xl-vhd 12 migrate-support-check fail never pass test-armhf-armhf-xl-vhd 13 saverestore-support-check fail never pass test-armhf-armhf-xl-arndale 13 migrate-support-check fail never pass test-armhf-armhf-xl-arndale 14 saverestore-support-check fail never pass test-armhf-armhf-xl-xsm 13 migrate-support-check fail never pass test-armhf-armhf-xl-xsm 14 saverestore-support-check fail never pass test-armhf-armhf-xl-multivcpu 13 migrate-support-check fail never pass test-armhf-armhf-xl-multivcpu 14 saverestore-support-check fail never pass test-armhf-armhf-xl-credit2 13 migrate-support-check fail never pass test-armhf-armhf-xl-credit2 14 saverestore-support-check fail never pass test-armhf-armhf-libvirt 13 migrate-support-check fail never pass test-amd64-i386-xl-qemut-ws16-amd64 17 guest-stop fail never pass test-amd64-i386-xl-qemut-win10-i386 10 windows-install fail never pass test-amd64-i386-xl-qemuu-win10-i386 10 windows-install fail never pass test-amd64-amd64-xl-qemuu-win10-i386 10 windows-install fail never pass test-amd64-amd64-xl-qemut-win10-i386 10 windows-install fail never pass version targeted for testing: xen bb2c1a1cc98a22e2d4c14b18421aa7be6c2adf0d baseline version: xen 24fb44e971a62b345c7b6ca3c03b454a1e150abe Last test of basis 114644 2017-10-17 10:49:11 Z 15 days Failing since 114670 2017-10-18 05:03:38 Z 15 days 24 attempts Testing same since 115314 2017-10-28 05:53:13 Z 5 days 10 attempts ------------------------------------------------------------ People who touched revisions under test: Andre Przywara <andre.przywara@linaro.org> Andrew Cooper <andrew.cooper3@citrix.com> Anthony PERARD <anthony.perard@citrix.com> Bhupinder Thakur <bhupinder.thakur@linaro.org> Boris Ostrovsky <boris.ostrovsky@oracle.com> Chao Gao <chao.gao@intel.com> David Esler <drumandstrum@gmail.com> George Dunlap <george.dunlap@citrix.com> Ian Jackson <Ian.Jackson@eu.citrix.com> Jan Beulich <jbeulich@suse.com> Juergen Gross <jgross@suse.com> Julien Grall <julien.grall@linaro.org> Roger Pau Monne <roger.pau@citrix.com> Roger Pau Monné <roger.pau@citrix.com> Ross Lagerwall <ross.lagerwall@citrix.com> Stefano Stabellini <sstabellini@kernel.org> Tim Deegan <tim@xen.org> Wei Liu <wei.liu2@citrix.com> jobs: build-amd64-xsm pass build-armhf-xsm pass build-i386-xsm pass build-amd64-xtf pass build-amd64 pass build-armhf pass build-i386 pass build-amd64-libvirt pass build-armhf-libvirt pass build-i386-libvirt pass build-amd64-prev pass build-i386-prev pass build-amd64-pvops pass build-armhf-pvops pass build-i386-pvops pass build-amd64-rumprun pass build-i386-rumprun pass test-xtf-amd64-amd64-1 pass test-xtf-amd64-amd64-2 pass test-xtf-amd64-amd64-3 pass test-xtf-amd64-amd64-4 pass test-xtf-amd64-amd64-5 pass test-amd64-amd64-xl pass test-armhf-armhf-xl pass test-amd64-i386-xl pass test-amd64-amd64-xl-qemut-debianhvm-amd64-xsm pass test-amd64-i386-xl-qemut-debianhvm-amd64-xsm pass test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm pass test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm pass test-amd64-amd64-xl-qemuu-debianhvm-amd64-xsm pass test-amd64-i386-xl-qemuu-debianhvm-amd64-xsm pass test-amd64-amd64-xl-qemut-stubdom-debianhvm-amd64-xsm pass test-amd64-i386-xl-qemut-stubdom-debianhvm-amd64-xsm pass test-amd64-amd64-libvirt-xsm pass test-armhf-armhf-libvirt-xsm pass test-amd64-i386-libvirt-xsm pass test-amd64-amd64-xl-xsm pass test-armhf-armhf-xl-xsm pass test-amd64-i386-xl-xsm pass test-amd64-amd64-qemuu-nested-amd fail test-amd64-amd64-xl-pvhv2-amd fail test-amd64-i386-qemut-rhel6hvm-amd pass test-amd64-i386-qemuu-rhel6hvm-amd pass test-amd64-amd64-xl-qemut-debianhvm-amd64 pass test-amd64-i386-xl-qemut-debianhvm-amd64 pass test-amd64-amd64-xl-qemuu-debianhvm-amd64 pass test-amd64-i386-xl-qemuu-debianhvm-amd64 pass test-amd64-i386-freebsd10-amd64 pass test-amd64-amd64-xl-qemuu-ovmf-amd64 pass test-amd64-i386-xl-qemuu-ovmf-amd64 pass test-amd64-amd64-rumprun-amd64 pass test-amd64-amd64-xl-qemut-win7-amd64 fail test-amd64-i386-xl-qemut-win7-amd64 fail test-amd64-amd64-xl-qemuu-win7-amd64 fail test-amd64-i386-xl-qemuu-win7-amd64 fail test-amd64-amd64-xl-qemut-ws16-amd64 fail test-amd64-i386-xl-qemut-ws16-amd64 fail test-amd64-amd64-xl-qemuu-ws16-amd64 fail test-amd64-i386-xl-qemuu-ws16-amd64 fail test-armhf-armhf-xl-arndale pass test-amd64-amd64-xl-credit2 pass test-armhf-armhf-xl-credit2 pass test-armhf-armhf-xl-cubietruck pass test-amd64-amd64-examine pass test-armhf-armhf-examine pass test-amd64-i386-examine pass test-amd64-i386-freebsd10-i386 pass test-amd64-i386-rumprun-i386 pass test-amd64-amd64-xl-qemut-win10-i386 fail test-amd64-i386-xl-qemut-win10-i386 fail test-amd64-amd64-xl-qemuu-win10-i386 fail test-amd64-i386-xl-qemuu-win10-i386 fail test-amd64-amd64-qemuu-nested-intel pass test-amd64-amd64-xl-pvhv2-intel fail test-amd64-i386-qemut-rhel6hvm-intel pass test-amd64-i386-qemuu-rhel6hvm-intel pass test-amd64-amd64-libvirt pass test-armhf-armhf-libvirt pass test-amd64-i386-libvirt pass test-amd64-amd64-livepatch pass test-amd64-i386-livepatch pass test-amd64-amd64-migrupgrade pass test-amd64-i386-migrupgrade pass test-amd64-amd64-xl-multivcpu pass test-armhf-armhf-xl-multivcpu pass test-amd64-amd64-pair pass test-amd64-i386-pair pass test-amd64-amd64-libvirt-pair pass test-amd64-i386-libvirt-pair pass test-amd64-amd64-amd64-pvgrub pass test-amd64-amd64-i386-pvgrub pass test-amd64-amd64-pygrub pass test-amd64-i386-libvirt-qcow2 fail test-amd64-amd64-xl-qcow2 pass test-armhf-armhf-libvirt-raw pass test-amd64-i386-xl-raw pass test-amd64-amd64-xl-rtds pass test-armhf-armhf-xl-rtds pass test-amd64-amd64-libvirt-vhd fail test-armhf-armhf-xl-vhd pass ------------------------------------------------------------ sg-report-flight on osstest.test-lab.xenproject.org logs: /home/logs/logs images: /home/logs/images Logs, config files, etc. are available at http://logs.test-lab.xenproject.org/osstest/logs Explanation of these reports, and of osstest in general, is at http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master Test harness code can be found at http://xenbits.xen.org/gitweb?p=osstest.git;a=summary Not pushing. (No revision log; it would be 686 lines long.) [-- Attachment #2: Type: text/plain, Size: 127 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 27+ messages in thread
* Commit moratorium to staging @ 2017-10-31 10:49 ` Julien Grall 2017-10-31 16:52 ` Roger Pau Monné 0 siblings, 1 reply; 27+ messages in thread From: Julien Grall @ 2017-10-31 10:49 UTC (permalink / raw) To: committers, xen-devel; +Cc: Lars Kurth, Roger Pau Monné Hi all, Master lags 15 days behind staging due to tests failing reliably on some of the hardware in osstest (see [1]). At the moment a force push is not feasible because the same tests passes on different hardware (see [2]). Please avoid committing any more patches unless it is fixing a test failure in osstest. Tree will be re-opened once we get a push. Cheers, [1] https://lists.xenproject.org/archives/html/xen-devel/2017-10/msg03351.html [2] https://lists.xenproject.org/archives/html/xen-devel/2017-10/msg02932.html -- Julien Grall _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Commit moratorium to staging 2017-10-31 10:49 ` Commit moratorium to staging Julien Grall @ 2017-10-31 16:52 ` Roger Pau Monné 2017-11-01 10:48 ` Wei Liu 0 siblings, 1 reply; 27+ messages in thread From: Roger Pau Monné @ 2017-10-31 16:52 UTC (permalink / raw) To: Julien Grall; +Cc: xen-devel, Paul Durrant, committers, Lars Kurth On Tue, Oct 31, 2017 at 10:49:35AM +0000, Julien Grall wrote: > Hi all, > > Master lags 15 days behind staging due to tests failing reliably on some of > the hardware in osstest (see [1]). > > At the moment a force push is not feasible because the same tests passes on > different hardware (see [2]). I've been looking into this, and I'm afraid I don't yet have a cause for those issues. I'm going to post what I've found so far, maybe someone is able to spot something I'm missing. Since I assumed this was somehow related to the ACPI PM1A_STS/EN blocks (which is how the power button even gets notified to the OS), I've added the following instrumentation to the pmtimer.c code: diff --git a/xen/arch/x86/hvm/pmtimer.c b/xen/arch/x86/hvm/pmtimer.c index 435647ff1e..051fc46df8 100644 --- a/xen/arch/x86/hvm/pmtimer.c +++ b/xen/arch/x86/hvm/pmtimer.c @@ -61,9 +61,15 @@ static void pmt_update_sci(PMTState *s) ASSERT(spin_is_locked(&s->lock)); if ( acpi->pm1a_en & acpi->pm1a_sts & SCI_MASK ) + { + printk("asserting SCI IRQ\n"); hvm_isa_irq_assert(s->vcpu->domain, SCI_IRQ, NULL); + } else + { + printk("de-asserting SCI IRQ\n"); hvm_isa_irq_deassert(s->vcpu->domain, SCI_IRQ); + } } void hvm_acpi_power_button(struct domain *d) @@ -73,6 +79,7 @@ void hvm_acpi_power_button(struct domain *d) if ( !has_vpm(d) ) return; + printk("hvm_acpi_power_button for d%d\n", d->domain_id); spin_lock(&s->lock); d->arch.hvm_domain.acpi.pm1a_sts |= PWRBTN_STS; pmt_update_sci(s); @@ -86,6 +93,7 @@ void hvm_acpi_sleep_button(struct domain *d) if ( !has_vpm(d) ) return; + printk("hvm_acpi_sleep_button for d%d\n", d->domain_id); spin_lock(&s->lock); d->arch.hvm_domain.acpi.pm1a_sts |= PWRBTN_STS; pmt_update_sci(s); @@ -170,6 +178,7 @@ static int handle_evt_io( if ( dir == IOREQ_WRITE ) { + printk("write PM1a addr: %#x val: %#x\n", addr, *val); /* Handle this I/O one byte at a time */ for ( i = bytes, data = *val; i > 0; @@ -197,6 +206,8 @@ static int handle_evt_io( bytes, *val, port); } } + printk("result pm1a_sts: %#x pm1a_en: %#x\n", + acpi->pm1a_sts, acpi->pm1a_en); /* Fix up the SCI state to match the new register state */ pmt_update_sci(s); } I've then rerun the failing test, and this is what I got in the failure case (ie: windows ignoring the power event): (XEN) hvm_acpi_power_button for d14 (XEN) asserting SCI IRQ (XEN) write PM1a addr: 0 val: 0x1 (XEN) result pm1a_sts: 0x100 pm1a_en: 0x320 (XEN) asserting SCI IRQ (XEN) write PM1a addr: 0 val: 0x100 (XEN) result pm1a_sts: 0 pm1a_en: 0x320 (XEN) de-asserting SCI IRQ (XEN) write PM1a addr: 0x2 val: 0x320 (XEN) result pm1a_sts: 0 pm1a_en: 0x320 (XEN) de-asserting SCI IRQ Strangely enough, the second time I've tried the same command (xl shutdown -wF ...) on the same guest, it succeed and windows shut down without issues, this is the log in that case: (XEN) hvm_acpi_power_button for d14 (XEN) asserting SCI IRQ (XEN) write PM1a addr: 0 val: 0x1 (XEN) result pm1a_sts: 0x100 pm1a_en: 0x320 (XEN) asserting SCI IRQ (XEN) write PM1a addr: 0 val: 0x100 (XEN) result pm1a_sts: 0 pm1a_en: 0x320 (XEN) de-asserting SCI IRQ (XEN) write PM1a addr: 0x2 val: 0x320 (XEN) result pm1a_sts: 0 pm1a_en: 0x320 (XEN) de-asserting SCI IRQ (XEN) write PM1a addr: 0x2 val: 0x320 (XEN) result pm1a_sts: 0 pm1a_en: 0x320 (XEN) de-asserting SCI IRQ (XEN) write PM1a addr: 0 val: 0 (XEN) result pm1a_sts: 0 pm1a_en: 0x320 (XEN) de-asserting SCI IRQ (XEN) write PM1a addr: 0 val: 0x8000 (XEN) result pm1a_sts: 0 pm1a_en: 0x320 (XEN) de-asserting SCI IRQ I have to admit I have no idea why Windows clears the STS power bit and then completely ignores it on certain occasions. I'm also afraid I have no idea how to debug Windows in order to know why this event is acknowledged but ignored. I've also tried to reproduce the same with a Debian guest, by doing the same amount of save/restores and migrations, and finally issuing a xl trigger <guest> power, but Debian has always worked fine and shut down. Any comments are welcome. Roger. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: Commit moratorium to staging 2017-10-31 16:52 ` Roger Pau Monné @ 2017-11-01 10:48 ` Wei Liu 2017-11-01 11:00 ` Paul Durrant 0 siblings, 1 reply; 27+ messages in thread From: Wei Liu @ 2017-11-01 10:48 UTC (permalink / raw) To: Roger Pau Monné Cc: Lars Kurth, Wei Liu, Julien Grall, Paul Durrant, committers, xen-devel On Tue, Oct 31, 2017 at 04:52:37PM +0000, Roger Pau Monné wrote: > > I have to admit I have no idea why Windows clears the STS power bit > and then completely ignores it on certain occasions. > > I'm also afraid I have no idea how to debug Windows in order to know > why this event is acknowledged but ignored. > > I've also tried to reproduce the same with a Debian guest, by doing > the same amount of save/restores and migrations, and finally issuing a > xl trigger <guest> power, but Debian has always worked fine and > shut down. > > Any comments are welcome. After googling around, some articles suggest Windows can ignore ACPI events under certain circumstances. Is it worth checking in the Windows event log to see if an event is received but ignored for reason X? For Windows Server 2012: https://serverfault.com/questions/534042/windows-2012-how-to-make-power-button-work-in-every-cases Can't find anything for Windows Server 2016. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Commit moratorium to staging 2017-11-01 10:48 ` Wei Liu @ 2017-11-01 11:00 ` Paul Durrant 2017-11-01 14:07 ` Ian Jackson 0 siblings, 1 reply; 27+ messages in thread From: Paul Durrant @ 2017-11-01 11:00 UTC (permalink / raw) To: Roger Pau Monne; +Cc: xen-devel, Julien Grall, committers, Wei Liu, Lars Kurth > -----Original Message----- > From: Wei Liu [mailto:wei.liu2@citrix.com] > Sent: 01 November 2017 10:48 > To: Roger Pau Monne <roger.pau@citrix.com> > Cc: Julien Grall <julien.grall@linaro.org>; committers@xenproject.org; xen- > devel <xen-devel@lists.xenproject.org>; Lars Kurth <lars.kurth@citrix.com>; > Paul Durrant <Paul.Durrant@citrix.com>; Wei Liu <wei.liu2@citrix.com> > Subject: Re: Commit moratorium to staging > > On Tue, Oct 31, 2017 at 04:52:37PM +0000, Roger Pau Monné wrote: > > > > I have to admit I have no idea why Windows clears the STS power bit > > and then completely ignores it on certain occasions. > > > > I'm also afraid I have no idea how to debug Windows in order to know > > why this event is acknowledged but ignored. > > > > I've also tried to reproduce the same with a Debian guest, by doing > > the same amount of save/restores and migrations, and finally issuing a > > xl trigger <guest> power, but Debian has always worked fine and > > shut down. > > > > Any comments are welcome. > > After googling around, some articles suggest Windows can ignore ACPI > events under certain circumstances. Is it worth checking in the Windows > event log to see if an event is received but ignored for reason X? Dumping the event logs would definitely be a useful thing to do. > > For Windows Server 2012: > https://serverfault.com/questions/534042/windows-2012-how-to-make- > power-button-work-in-every-cases > > Can't find anything for Windows Server 2016. No, I couldn't either. I did find https://ethertubes.com/unattended-acpi-shutdown-of-windows-server/ too which seems to have some potentially useful suggestions. Paul _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Commit moratorium to staging 2017-11-01 11:00 ` Paul Durrant @ 2017-11-01 14:07 ` Ian Jackson 2017-11-01 14:59 ` Julien Grall ` (2 more replies) 0 siblings, 3 replies; 27+ messages in thread From: Ian Jackson @ 2017-11-01 14:07 UTC (permalink / raw) To: Julien Grall, Roger Pau Monne Cc: committers, Lars Kurth, Paul Durrant, Wei Liu, xen-devel So, investigations (mostly by Roger, and also a bit of archaeology in the osstest db by me) have determined: * This bug is 100% reproducible on affected hosts. The repro is to boot the Windows guest, save/restore it, then migrate it, then shut down. (This is from an IRL conversation with Roger and may not be 100% accurate. Roger, please correct me.) * Affected hosts differ from unaffected hosts according to cpuid. Roger has repro'd the bug on an unaffected host by masking out certain cpuid bits. There are 6 implicated bits and he is working to narrow that down. * It seems likely that this is therefore a real bug. Maybe in Xen and perhaps indeed one that should indeed be a release blocker. * But this is not a regresson between master and staging. It affects many osstest branches apparently equally. * This test is, effectively, new: before the osstest change "HostDiskRoot: bump to 20G", these jobs would always fail earlier and the affected step would not be run. * The passes we got on various osstest branches before were just because those branches hadn't tested on an affected host yet. As branches test different hosts, they will stick on affected hosts. ISTM that this situation would therefore justify a force push. We have established that this bug is very unlikely to be anything to do with the commits currently blocked by the failing pushes. Furthermore, the test is not intermittent, so a force push will be effective in the following sense: we would only get a "spurious" pass, resulting in the relevant osstest branch becoming stuck again, if a future test was unlucky and got an unaffected host. That will happen infrequently enough. So unless anyone objects (and for xen.git#master, with Julien's permission), I intend to force push all affected osstest branches when the test report shows the only blockage is ws16 and/or win10 tests failing the "guest-stop" step. Opinions ? Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Commit moratorium to staging 2017-11-01 14:07 ` Ian Jackson @ 2017-11-01 14:59 ` Julien Grall 2017-11-01 16:54 ` Ian Jackson 2017-11-01 16:17 ` Commit moratorium to staging Roger Pau Monné 2017-11-02 11:19 ` George Dunlap 2 siblings, 1 reply; 27+ messages in thread From: Julien Grall @ 2017-11-01 14:59 UTC (permalink / raw) To: Ian Jackson, Roger Pau Monne Cc: committers, Lars Kurth, Paul Durrant, Wei Liu, xen-devel Hi Ian, Thank you for the detailed e-mail. On 11/01/2017 02:07 PM, Ian Jackson wrote: > So, investigations (mostly by Roger, and also a bit of archaeology in > the osstest db by me) have determined: > > * This bug is 100% reproducible on affected hosts. The repro is > to boot the Windows guest, save/restore it, then migrate it, > then shut down. (This is from an IRL conversation with Roger and > may not be 100% accurate. Roger, please correct me.) > > * Affected hosts differ from unaffected hosts according to cpuid. > Roger has repro'd the bug on an unaffected host by masking out > certain cpuid bits. There are 6 implicated bits and he is working > to narrow that down. > > * It seems likely that this is therefore a real bug. Maybe in Xen and > perhaps indeed one that should indeed be a release blocker. > > * But this is not a regresson between master and staging. It affects > many osstest branches apparently equally. > > * This test is, effectively, new: before the osstest change > "HostDiskRoot: bump to 20G", these jobs would always fail earlier > and the affected step would not be run. > > * The passes we got on various osstest branches before were just > because those branches hadn't tested on an affected host yet. As > branches test different hosts, they will stick on affected hosts. > > ISTM that this situation would therefore justify a force push. We > have established that this bug is very unlikely to be anything to do > with the commits currently blocked by the failing pushes. > > Furthermore, the test is not intermittent, so a force push will be > effective in the following sense: we would only get a "spurious" pass, > resulting in the relevant osstest branch becoming stuck again, if a > future test was unlucky and got an unaffected host. That will happen > infrequently enough. I am not entirely sure to understand this paragraph. Are you saying that osstest will not get stuck if we get a "spurious" pass on some hardware in the future? Or will we need another force push? > > So unless anyone objects (and for xen.git#master, with Julien's > permission), I intend to force push all affected osstest branches when > the test report shows the only blockage is ws16 and/or win10 tests > failing the "guest-stop" step. This is not only blocking xen.git#master but also blocking other trees: - linux-linus - linux-4.9 Cheers, -- Julien Grall _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Commit moratorium to staging 2017-11-01 14:59 ` Julien Grall @ 2017-11-01 16:54 ` Ian Jackson 2017-11-01 17:00 ` Julien Grall 0 siblings, 1 reply; 27+ messages in thread From: Ian Jackson @ 2017-11-01 16:54 UTC (permalink / raw) To: Julien Grall Cc: Lars Kurth, Wei Liu, Paul Durrant, committers, xen-devel, Roger Pau Monne Julien Grall writes ("Re: Commit moratorium to staging"): > Hi Ian, > > Thank you for the detailed e-mail. > > On 11/01/2017 02:07 PM, Ian Jackson wrote: > > Furthermore, the test is not intermittent, so a force push will be > > effective in the following sense: we would only get a "spurious" pass, > > resulting in the relevant osstest branch becoming stuck again, if a > > future test was unlucky and got an unaffected host. That will happen > > infrequently enough. ... > I am not entirely sure to understand this paragraph. Are you saying that > osstest will not get stuck if we get a "spurious" pass on some hardware > in the future? Or will we need another force push? osstest *would* get stuck *if* we got such a spurious push. However, because osstest likes to retest failing tests on the same host as they failed on previously, such spurious passes are fairly unlikely. I say "likes to". The allocation system uses a set of heuristics to calculate a score for each possible host. The score takes into account both when the host will be available to this job, and information like "did the most recent run of this test, on this host, pass or fail". So I can't make guarantees but the amount of manual work to force push stuck branches will be tolerable. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Commit moratorium to staging 2017-11-01 16:54 ` Ian Jackson @ 2017-11-01 17:00 ` Julien Grall 2017-11-02 13:27 ` Commit moratorium to staging [and 1 more messages] Ian Jackson 0 siblings, 1 reply; 27+ messages in thread From: Julien Grall @ 2017-11-01 17:00 UTC (permalink / raw) To: Ian Jackson Cc: Lars Kurth, Wei Liu, Paul Durrant, committers, xen-devel, Roger Pau Monne Hi Ian, On 11/01/2017 04:54 PM, Ian Jackson wrote: > Julien Grall writes ("Re: Commit moratorium to staging"): >> Hi Ian, >> >> Thank you for the detailed e-mail. >> >> On 11/01/2017 02:07 PM, Ian Jackson wrote: >>> Furthermore, the test is not intermittent, so a force push will be >>> effective in the following sense: we would only get a "spurious" pass, >>> resulting in the relevant osstest branch becoming stuck again, if a >>> future test was unlucky and got an unaffected host. That will happen >>> infrequently enough. > ... >> I am not entirely sure to understand this paragraph. Are you saying that >> osstest will not get stuck if we get a "spurious" pass on some hardware >> in the future? Or will we need another force push? > > osstest *would* get stuck *if* we got such a spurious push. However, > because osstest likes to retest failing tests on the same host as they > failed on previously, such spurious passes are fairly unlikely. > > I say "likes to". The allocation system uses a set of heuristics to > calculate a score for each possible host. The score takes into > account both when the host will be available to this job, and > information like "did the most recent run of this test, on this host, > pass or fail". So I can't make guarantees but the amount of manual > work to force push stuck branches will be tolerable. Thank you for the explanation. I agree with the force push to unblock master (and other tree I mentioned). However, it would still be nice to find the root causes of this bug and fix it. Cheers, -- Julien Grall _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Commit moratorium to staging [and 1 more messages] 2017-11-01 17:00 ` Julien Grall @ 2017-11-02 13:27 ` Ian Jackson 2017-11-02 13:33 ` Julien Grall 0 siblings, 1 reply; 27+ messages in thread From: Ian Jackson @ 2017-11-02 13:27 UTC (permalink / raw) To: Julien Grall Cc: Lars Kurth, xen-devel, Wei Liu, Paul Durrant, committers, xen-devel, Roger Pau Monne Julien Grall writes ("Re: Commit moratorium to staging"): > Thank you for the explanation. I agree with the force push to unblock > master (and other tree I mentioned). I will force push all the affected trees, but in a reactive way because I base each force push on a test report - so it won't be right away for all of them. osstest service owner writes ("[xen-unstable test] 115471: regressions - FAIL"): > flight 115471 xen-unstable real [real] > http://logs.test-lab.xenproject.org/osstest/logs/115471/ > > Regressions :-( > > Tests which did not succeed and are blocking, > including tests which could not be run: > test-amd64-i386-xl-qemuu-ws16-amd64 17 guest-stop fail REGR. vs. 114644 > test-amd64-amd64-xl-qemuu-ws16-amd64 17 guest-stop fail REGR. vs. 114644 > test-amd64-amd64-xl-qemut-ws16-amd64 17 guest-stop fail REGR. vs. 114644 The above are justifiable as discussed, leaving no blockers. > version targeted for testing: > xen bb2c1a1cc98a22e2d4c14b18421aa7be6c2adf0d So I have forced pushed that. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Commit moratorium to staging [and 1 more messages] 2017-11-02 13:27 ` Commit moratorium to staging [and 1 more messages] Ian Jackson @ 2017-11-02 13:33 ` Julien Grall 0 siblings, 0 replies; 27+ messages in thread From: Julien Grall @ 2017-11-02 13:33 UTC (permalink / raw) To: Ian Jackson Cc: Lars Kurth, xen-devel, Wei Liu, Paul Durrant, committers, xen-devel, Roger Pau Monne Hi Ian, On 02/11/17 13:27, Ian Jackson wrote: > Julien Grall writes ("Re: Commit moratorium to staging"): >> Thank you for the explanation. I agree with the force push to unblock >> master (and other tree I mentioned). > > I will force push all the affected trees, but in a reactive way > because I base each force push on a test report - so it won't be right > away for all of them. > > osstest service owner writes ("[xen-unstable test] 115471: regressions - FAIL"): >> flight 115471 xen-unstable real [real] >> http://logs.test-lab.xenproject.org/osstest/logs/115471/ >> >> Regressions :-( >> >> Tests which did not succeed and are blocking, >> including tests which could not be run: >> test-amd64-i386-xl-qemuu-ws16-amd64 17 guest-stop fail REGR. vs. 114644 >> test-amd64-amd64-xl-qemuu-ws16-amd64 17 guest-stop fail REGR. vs. 114644 >> test-amd64-amd64-xl-qemut-ws16-amd64 17 guest-stop fail REGR. vs. 114644 > > The above are justifiable as discussed, leaving no blockers. > >> version targeted for testing: >> xen bb2c1a1cc98a22e2d4c14b18421aa7be6c2adf0d > > So I have forced pushed that. Thank you! With that, the tree is re-opened. I will go through my backlog of Xen 4.10 and have a look whether they are suitable. Cheers, -- Julien Grall _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Commit moratorium to staging 2017-11-01 14:07 ` Ian Jackson 2017-11-01 14:59 ` Julien Grall @ 2017-11-01 16:17 ` Roger Pau Monné 2017-11-02 9:15 ` Roger Pau Monné 2017-11-02 11:19 ` George Dunlap 2 siblings, 1 reply; 27+ messages in thread From: Roger Pau Monné @ 2017-11-01 16:17 UTC (permalink / raw) To: Ian Jackson Cc: Lars Kurth, Wei Liu, Julien Grall, Paul Durrant, committers, xen-devel On Wed, Nov 01, 2017 at 02:07:48PM +0000, Ian Jackson wrote: > So, investigations (mostly by Roger, and also a bit of archaeology in > the osstest db by me) have determined: > > * This bug is 100% reproducible on affected hosts. The repro is > to boot the Windows guest, save/restore it, then migrate it, > then shut down. (This is from an IRL conversation with Roger and > may not be 100% accurate. Roger, please correct me.) Yes, that's correct AFAICT. The affected hosts works fine if windows is booted and then shut down (without save/restore or migrations involved). > * Affected hosts differ from unaffected hosts according to cpuid. > Roger has repro'd the bug on an unaffected host by masking out > certain cpuid bits. There are 6 implicated bits and he is working > to narrow that down. I'm currently trying to narrow this down and make sure the above is accurate. > * It seems likely that this is therefore a real bug. Maybe in Xen and > perhaps indeed one that should indeed be a release blocker. > > * But this is not a regresson between master and staging. It affects > many osstest branches apparently equally. > > * This test is, effectively, new: before the osstest change > "HostDiskRoot: bump to 20G", these jobs would always fail earlier > and the affected step would not be run. > > * The passes we got on various osstest branches before were just > because those branches hadn't tested on an affected host yet. As > branches test different hosts, they will stick on affected hosts. > > ISTM that this situation would therefore justify a force push. We > have established that this bug is very unlikely to be anything to do > with the commits currently blocked by the failing pushes. I agree, this is a bug that's always been present (at least in the tested branches). It's triggered now because the windows tests have made further progress. > Furthermore, the test is not intermittent, so a force push will be > effective in the following sense: we would only get a "spurious" pass, > resulting in the relevant osstest branch becoming stuck again, if a > future test was unlucky and got an unaffected host. That will happen > infrequently enough. > > So unless anyone objects (and for xen.git#master, with Julien's > permission), I intend to force push all affected osstest branches when > the test report shows the only blockage is ws16 and/or win10 tests > failing the "guest-stop" step. > > Opinions ? I agree that a force push is justified. This is bug going to be quite annoying if osstest decides to tests on non-affected hosts, because then we will get sporadic success flights. Thanks, Roger. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Commit moratorium to staging 2017-11-01 16:17 ` Commit moratorium to staging Roger Pau Monné @ 2017-11-02 9:15 ` Roger Pau Monné 2017-11-02 9:20 ` Paul Durrant 0 siblings, 1 reply; 27+ messages in thread From: Roger Pau Monné @ 2017-11-02 9:15 UTC (permalink / raw) To: Roger Pau Monné Cc: Lars Kurth, Wei Liu, Julien Grall, Ian Jackson, Paul Durrant, committers, xen-devel On Wed, Nov 01, 2017 at 04:17:10PM +0000, Roger Pau Monné wrote: > On Wed, Nov 01, 2017 at 02:07:48PM +0000, Ian Jackson wrote: > > * Affected hosts differ from unaffected hosts according to cpuid. > > Roger has repro'd the bug on an unaffected host by masking out > > certain cpuid bits. There are 6 implicated bits and he is working > > to narrow that down. > > I'm currently trying to narrow this down and make sure the above is > accurate. So I was wrong with this, I guess I've run the tests on the wrong host. Even when masking the different cpuid bits in the guest the tests still succeeds. AFAICT the test fail or succeed reliably depending on the host hardware. I don't really have many ideas about what to do next, but I think it would be useful to create a manual osstest flight that runs the win16 job in all the different hosts in the colo. I would also capture the normal information that Xen collects after each test (xl info, /proc/cpuid, serial logs...). Is there anything else not captured by ts-logs-capture that would be interesting in order to help debug the issue? Regards, Roger. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Commit moratorium to staging 2017-11-02 9:15 ` Roger Pau Monné @ 2017-11-02 9:20 ` Paul Durrant 2017-11-02 9:42 ` Roger Pau Monné 0 siblings, 1 reply; 27+ messages in thread From: Paul Durrant @ 2017-11-02 9:20 UTC (permalink / raw) To: Roger Pau Monne Cc: Lars Kurth, Wei Liu, Julien Grall, committers, xen-devel, Ian Jackson > -----Original Message----- > From: Roger Pau Monne > Sent: 02 November 2017 09:15 > To: Roger Pau Monne <roger.pau@citrix.com> > Cc: Ian Jackson <Ian.Jackson@citrix.com>; Lars Kurth > <lars.kurth@citrix.com>; Wei Liu <wei.liu2@citrix.com>; Julien Grall > <julien.grall@linaro.org>; Paul Durrant <Paul.Durrant@citrix.com>; > committers@xenproject.org; xen-devel <xen-devel@lists.xenproject.org> > Subject: Re: [Xen-devel] Commit moratorium to staging > > On Wed, Nov 01, 2017 at 04:17:10PM +0000, Roger Pau Monné wrote: > > On Wed, Nov 01, 2017 at 02:07:48PM +0000, Ian Jackson wrote: > > > * Affected hosts differ from unaffected hosts according to cpuid. > > > Roger has repro'd the bug on an unaffected host by masking out > > > certain cpuid bits. There are 6 implicated bits and he is working > > > to narrow that down. > > > > I'm currently trying to narrow this down and make sure the above is > > accurate. > > So I was wrong with this, I guess I've run the tests on the wrong > host. Even when masking the different cpuid bits in the guest the > tests still succeeds. > > AFAICT the test fail or succeed reliably depending on the host > hardware. I don't really have many ideas about what to do next, but I > think it would be useful to create a manual osstest flight that runs > the win16 job in all the different hosts in the colo. I would also > capture the normal information that Xen collects after each test (xl > info, /proc/cpuid, serial logs...). > > Is there anything else not captured by ts-logs-capture that would be > interesting in order to help debug the issue? Does the shutdown reliably complete prior to migrate and then only fail intermittently after a localhost migrate? It might be useful to know what cpuid info is seen by the guest before and after migrate. Another datapoint... does the shutdown fail if you insert a delay of a couple of minutes between the migrate and the shutdown? Paul > > Regards, Roger. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Commit moratorium to staging 2017-11-02 9:20 ` Paul Durrant @ 2017-11-02 9:42 ` Roger Pau Monné 2017-11-02 9:55 ` Paul Durrant 2017-11-02 12:05 ` Ian Jackson 0 siblings, 2 replies; 27+ messages in thread From: Roger Pau Monné @ 2017-11-02 9:42 UTC (permalink / raw) To: Paul Durrant Cc: Lars Kurth, Wei Liu, Julien Grall, committers, xen-devel, Ian Jackson On Thu, Nov 02, 2017 at 09:20:10AM +0000, Paul Durrant wrote: > > -----Original Message----- > > From: Roger Pau Monne > > Sent: 02 November 2017 09:15 > > To: Roger Pau Monne <roger.pau@citrix.com> > > Cc: Ian Jackson <Ian.Jackson@citrix.com>; Lars Kurth > > <lars.kurth@citrix.com>; Wei Liu <wei.liu2@citrix.com>; Julien Grall > > <julien.grall@linaro.org>; Paul Durrant <Paul.Durrant@citrix.com>; > > committers@xenproject.org; xen-devel <xen-devel@lists.xenproject.org> > > Subject: Re: [Xen-devel] Commit moratorium to staging > > > > On Wed, Nov 01, 2017 at 04:17:10PM +0000, Roger Pau Monné wrote: > > > On Wed, Nov 01, 2017 at 02:07:48PM +0000, Ian Jackson wrote: > > > > * Affected hosts differ from unaffected hosts according to cpuid. > > > > Roger has repro'd the bug on an unaffected host by masking out > > > > certain cpuid bits. There are 6 implicated bits and he is working > > > > to narrow that down. > > > > > > I'm currently trying to narrow this down and make sure the above is > > > accurate. > > > > So I was wrong with this, I guess I've run the tests on the wrong > > host. Even when masking the different cpuid bits in the guest the > > tests still succeeds. > > > > AFAICT the test fail or succeed reliably depending on the host > > hardware. I don't really have many ideas about what to do next, but I > > think it would be useful to create a manual osstest flight that runs > > the win16 job in all the different hosts in the colo. I would also > > capture the normal information that Xen collects after each test (xl > > info, /proc/cpuid, serial logs...). > > > > Is there anything else not captured by ts-logs-capture that would be > > interesting in order to help debug the issue? > > Does the shutdown reliably complete prior to migrate and then only fail intermittently after a localhost migrate? AFAICT yes, but it can also be added to the test in order to be sure. > It might be useful to know what cpuid info is seen by the guest before and after migrate. Is there anyway to get that from windows in an automatic way? If not I could test that with a Debian guest. In fact it might even be a good thing for Linux based guest to be added to the regular migration tests in order to make sure cpuid bits don't change across migrations. > Another datapoint... does the shutdown fail if you insert a delay of a couple of minutes between the migrate and the shutdown? Sometimes, after a variable number of calls to xl shutdown ... the guest usually ends up shutting down. Roger. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Commit moratorium to staging 2017-11-02 9:42 ` Roger Pau Monné @ 2017-11-02 9:55 ` Paul Durrant 2017-11-03 14:14 ` Roger Pau Monné 2017-11-02 12:05 ` Ian Jackson 1 sibling, 1 reply; 27+ messages in thread From: Paul Durrant @ 2017-11-02 9:55 UTC (permalink / raw) To: Roger Pau Monne Cc: Lars Kurth, Wei Liu, Julien Grall, committers, xen-devel, Ian Jackson > -----Original Message----- > From: Roger Pau Monne > Sent: 02 November 2017 09:42 > To: Paul Durrant <Paul.Durrant@citrix.com> > Cc: Ian Jackson <Ian.Jackson@citrix.com>; Lars Kurth > <lars.kurth@citrix.com>; Wei Liu <wei.liu2@citrix.com>; Julien Grall > <julien.grall@linaro.org>; committers@xenproject.org; xen-devel <xen- > devel@lists.xenproject.org> > Subject: Re: [Xen-devel] Commit moratorium to staging > > On Thu, Nov 02, 2017 at 09:20:10AM +0000, Paul Durrant wrote: > > > -----Original Message----- > > > From: Roger Pau Monne > > > Sent: 02 November 2017 09:15 > > > To: Roger Pau Monne <roger.pau@citrix.com> > > > Cc: Ian Jackson <Ian.Jackson@citrix.com>; Lars Kurth > > > <lars.kurth@citrix.com>; Wei Liu <wei.liu2@citrix.com>; Julien Grall > > > <julien.grall@linaro.org>; Paul Durrant <Paul.Durrant@citrix.com>; > > > committers@xenproject.org; xen-devel <xen- > devel@lists.xenproject.org> > > > Subject: Re: [Xen-devel] Commit moratorium to staging > > > > > > On Wed, Nov 01, 2017 at 04:17:10PM +0000, Roger Pau Monné wrote: > > > > On Wed, Nov 01, 2017 at 02:07:48PM +0000, Ian Jackson wrote: > > > > > * Affected hosts differ from unaffected hosts according to cpuid. > > > > > Roger has repro'd the bug on an unaffected host by masking out > > > > > certain cpuid bits. There are 6 implicated bits and he is working > > > > > to narrow that down. > > > > > > > > I'm currently trying to narrow this down and make sure the above is > > > > accurate. > > > > > > So I was wrong with this, I guess I've run the tests on the wrong > > > host. Even when masking the different cpuid bits in the guest the > > > tests still succeeds. > > > > > > AFAICT the test fail or succeed reliably depending on the host > > > hardware. I don't really have many ideas about what to do next, but I > > > think it would be useful to create a manual osstest flight that runs > > > the win16 job in all the different hosts in the colo. I would also > > > capture the normal information that Xen collects after each test (xl > > > info, /proc/cpuid, serial logs...). > > > > > > Is there anything else not captured by ts-logs-capture that would be > > > interesting in order to help debug the issue? > > > > Does the shutdown reliably complete prior to migrate and then only fail > intermittently after a localhost migrate? > > AFAICT yes, but it can also be added to the test in order to be sure. > > > It might be useful to know what cpuid info is seen by the guest before and > after migrate. > > Is there anyway to get that from windows in an automatic way? If not I > could test that with a Debian guest. In fact it might even be a good > thing for Linux based guest to be added to the regular migration tests > in order to make sure cpuid bits don't change across migrations. > I found this for windows: https://www.cpuid.com/downloads/cpu-z/cpu-z_1.81-en.exe It can generate a text or html report as well as being run interactively. But you may get more mileage from using a debian HVM guest. I guess it may also be useful is we can get a scan of available MSRs and content before and after migrate too. > > Another datapoint... does the shutdown fail if you insert a delay of a couple > of minutes between the migrate and the shutdown? > > Sometimes, after a variable number of calls to xl shutdown ... the > guest usually ends up shutting down. > Hmm. I wonder whether the guest is actually healthy after the migrate. One could imagine a situation where the storage device model (IDE in our case I guess) gets stuck in some way but recovers after a timeout in the guest storage stack. Thus, if you happen to try shut down while it is still stuck Windows starts trying to shut down but can't. Try after the timeout though and it can. In the past we did make attempts to support Windows without PV drivers in XenServer but xenrt would never reliably pass VM lifecycle tests using emulated devices. That was with qemu trad, but I wonder whether upstream qemu is actually any better particularly if using older device models such as IDE and RTL8139 (which are probably largely unmodified from trad). Paul > Roger. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Commit moratorium to staging 2017-11-02 9:55 ` Paul Durrant @ 2017-11-03 14:14 ` Roger Pau Monné 2017-11-03 14:52 ` George Dunlap 0 siblings, 1 reply; 27+ messages in thread From: Roger Pau Monné @ 2017-11-03 14:14 UTC (permalink / raw) To: Paul Durrant Cc: Lars Kurth, Wei Liu, Julien Grall, committers, xen-devel, Ian Jackson On Thu, Nov 02, 2017 at 09:55:11AM +0000, Paul Durrant wrote: > > -----Original Message----- > > From: Roger Pau Monne > > Sent: 02 November 2017 09:42 > > To: Paul Durrant <Paul.Durrant@citrix.com> > > Cc: Ian Jackson <Ian.Jackson@citrix.com>; Lars Kurth > > <lars.kurth@citrix.com>; Wei Liu <wei.liu2@citrix.com>; Julien Grall > > <julien.grall@linaro.org>; committers@xenproject.org; xen-devel <xen- > > devel@lists.xenproject.org> > > Subject: Re: [Xen-devel] Commit moratorium to staging > > > > On Thu, Nov 02, 2017 at 09:20:10AM +0000, Paul Durrant wrote: > > > > -----Original Message----- > > > > From: Roger Pau Monne > > > > Sent: 02 November 2017 09:15 > > > > To: Roger Pau Monne <roger.pau@citrix.com> > > > > Cc: Ian Jackson <Ian.Jackson@citrix.com>; Lars Kurth > > > > <lars.kurth@citrix.com>; Wei Liu <wei.liu2@citrix.com>; Julien Grall > > > > <julien.grall@linaro.org>; Paul Durrant <Paul.Durrant@citrix.com>; > > > > committers@xenproject.org; xen-devel <xen- > > devel@lists.xenproject.org> > > > > Subject: Re: [Xen-devel] Commit moratorium to staging > > > > > > > > On Wed, Nov 01, 2017 at 04:17:10PM +0000, Roger Pau Monné wrote: > > > > > On Wed, Nov 01, 2017 at 02:07:48PM +0000, Ian Jackson wrote: > > > > > > * Affected hosts differ from unaffected hosts according to cpuid. > > > > > > Roger has repro'd the bug on an unaffected host by masking out > > > > > > certain cpuid bits. There are 6 implicated bits and he is working > > > > > > to narrow that down. > > > > > > > > > > I'm currently trying to narrow this down and make sure the above is > > > > > accurate. > > > > > > > > So I was wrong with this, I guess I've run the tests on the wrong > > > > host. Even when masking the different cpuid bits in the guest the > > > > tests still succeeds. > > > > > > > > AFAICT the test fail or succeed reliably depending on the host > > > > hardware. I don't really have many ideas about what to do next, but I > > > > think it would be useful to create a manual osstest flight that runs > > > > the win16 job in all the different hosts in the colo. I would also > > > > capture the normal information that Xen collects after each test (xl > > > > info, /proc/cpuid, serial logs...). > > > > > > > > Is there anything else not captured by ts-logs-capture that would be > > > > interesting in order to help debug the issue? > > > > > > Does the shutdown reliably complete prior to migrate and then only fail > > intermittently after a localhost migrate? > > > > AFAICT yes, but it can also be added to the test in order to be sure. > > > > > It might be useful to know what cpuid info is seen by the guest before and > > after migrate. > > > > Is there anyway to get that from windows in an automatic way? If not I > > could test that with a Debian guest. In fact it might even be a good > > thing for Linux based guest to be added to the regular migration tests > > in order to make sure cpuid bits don't change across migrations. > > > > I found this for windows: > > https://www.cpuid.com/downloads/cpu-z/cpu-z_1.81-en.exe > > It can generate a text or html report as well as being run interactively. But you may get more mileage from using a debian HVM guest. I guess it may also be useful is we can get a scan of available MSRs and content before and after migrate too. > > > > Another datapoint... does the shutdown fail if you insert a delay of a couple > > of minutes between the migrate and the shutdown? > > > > Sometimes, after a variable number of calls to xl shutdown ... the > > guest usually ends up shutting down. > > > > Hmm. I wonder whether the guest is actually healthy after the migrate. One could imagine a situation where the storage device model (IDE in our case I guess) gets stuck in some way but recovers after a timeout in the guest storage stack. Thus, if you happen to try shut down while it is still stuck Windows starts trying to shut down but can't. Try after the timeout though and it can. > In the past we did make attempts to support Windows without PV drivers in XenServer but xenrt would never reliably pass VM lifecycle tests using emulated devices. That was with qemu trad, but I wonder whether upstream qemu is actually any better particularly if using older device models such as IDE and RTL8139 (which are probably largely unmodified from trad). Since I've been looking into this for a couple of days, and found no solution I'm going to write what I've found so far: - The issue only affects Windows guests. - It only manifests itself when doing live migration, non-live migration or save/resume work fine. - It affects all x86 hardware, the amount of migrations in order to trigger it seems to depend on the hardware, but doing 20 migrations reliably triggers it on all the hardware I've tested. - After a variable amount of `xl shutdown -wF ...` the guest will eventually acknowledge the event and shut down. Roger. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Commit moratorium to staging 2017-11-03 14:14 ` Roger Pau Monné @ 2017-11-03 14:52 ` George Dunlap 2017-11-03 17:57 ` George Dunlap 0 siblings, 1 reply; 27+ messages in thread From: George Dunlap @ 2017-11-03 14:52 UTC (permalink / raw) To: Roger Pau Monné, Paul Durrant Cc: Lars Kurth, Wei Liu, Julien Grall, committers, xen-devel, Ian Jackson On 11/03/2017 02:14 PM, Roger Pau Monné wrote: > On Thu, Nov 02, 2017 at 09:55:11AM +0000, Paul Durrant wrote: >>> -----Original Message----- >>> From: Roger Pau Monne >>> Sent: 02 November 2017 09:42 >>> To: Paul Durrant <Paul.Durrant@citrix.com> >>> Cc: Ian Jackson <Ian.Jackson@citrix.com>; Lars Kurth >>> <lars.kurth@citrix.com>; Wei Liu <wei.liu2@citrix.com>; Julien Grall >>> <julien.grall@linaro.org>; committers@xenproject.org; xen-devel <xen- >>> devel@lists.xenproject.org> >>> Subject: Re: [Xen-devel] Commit moratorium to staging >>> >>> On Thu, Nov 02, 2017 at 09:20:10AM +0000, Paul Durrant wrote: >>>>> -----Original Message----- >>>>> From: Roger Pau Monne >>>>> Sent: 02 November 2017 09:15 >>>>> To: Roger Pau Monne <roger.pau@citrix.com> >>>>> Cc: Ian Jackson <Ian.Jackson@citrix.com>; Lars Kurth >>>>> <lars.kurth@citrix.com>; Wei Liu <wei.liu2@citrix.com>; Julien Grall >>>>> <julien.grall@linaro.org>; Paul Durrant <Paul.Durrant@citrix.com>; >>>>> committers@xenproject.org; xen-devel <xen- >>> devel@lists.xenproject.org> >>>>> Subject: Re: [Xen-devel] Commit moratorium to staging >>>>> >>>>> On Wed, Nov 01, 2017 at 04:17:10PM +0000, Roger Pau Monné wrote: >>>>>> On Wed, Nov 01, 2017 at 02:07:48PM +0000, Ian Jackson wrote: >>>>>>> * Affected hosts differ from unaffected hosts according to cpuid. >>>>>>> Roger has repro'd the bug on an unaffected host by masking out >>>>>>> certain cpuid bits. There are 6 implicated bits and he is working >>>>>>> to narrow that down. >>>>>> >>>>>> I'm currently trying to narrow this down and make sure the above is >>>>>> accurate. >>>>> >>>>> So I was wrong with this, I guess I've run the tests on the wrong >>>>> host. Even when masking the different cpuid bits in the guest the >>>>> tests still succeeds. >>>>> >>>>> AFAICT the test fail or succeed reliably depending on the host >>>>> hardware. I don't really have many ideas about what to do next, but I >>>>> think it would be useful to create a manual osstest flight that runs >>>>> the win16 job in all the different hosts in the colo. I would also >>>>> capture the normal information that Xen collects after each test (xl >>>>> info, /proc/cpuid, serial logs...). >>>>> >>>>> Is there anything else not captured by ts-logs-capture that would be >>>>> interesting in order to help debug the issue? >>>> >>>> Does the shutdown reliably complete prior to migrate and then only fail >>> intermittently after a localhost migrate? >>> >>> AFAICT yes, but it can also be added to the test in order to be sure. >>> >>>> It might be useful to know what cpuid info is seen by the guest before and >>> after migrate. >>> >>> Is there anyway to get that from windows in an automatic way? If not I >>> could test that with a Debian guest. In fact it might even be a good >>> thing for Linux based guest to be added to the regular migration tests >>> in order to make sure cpuid bits don't change across migrations. >>> >> >> I found this for windows: >> >> https://www.cpuid.com/downloads/cpu-z/cpu-z_1.81-en.exe >> >> It can generate a text or html report as well as being run interactively. But you may get more mileage from using a debian HVM guest. I guess it may also be useful is we can get a scan of available MSRs and content before and after migrate too. >> >>>> Another datapoint... does the shutdown fail if you insert a delay of a couple >>> of minutes between the migrate and the shutdown? >>> >>> Sometimes, after a variable number of calls to xl shutdown ... the >>> guest usually ends up shutting down. >>> >> >> Hmm. I wonder whether the guest is actually healthy after the migrate. One could imagine a situation where the storage device model (IDE in our case I guess) gets stuck in some way but recovers after a timeout in the guest storage stack. Thus, if you happen to try shut down while it is still stuck Windows starts trying to shut down but can't. Try after the timeout though and it can. >> In the past we did make attempts to support Windows without PV drivers in XenServer but xenrt would never reliably pass VM lifecycle tests using emulated devices. That was with qemu trad, but I wonder whether upstream qemu is actually any better particularly if using older device models such as IDE and RTL8139 (which are probably largely unmodified from trad). > > Since I've been looking into this for a couple of days, and found no > solution I'm going to write what I've found so far: > > - The issue only affects Windows guests. > - It only manifests itself when doing live migration, non-live > migration or save/resume work fine. > - It affects all x86 hardware, the amount of migrations in order to > trigger it seems to depend on the hardware, but doing 20 migrations > reliably triggers it on all the hardware I've tested. Not good. You said that Windows reported that the login process failed somehow? Is it possible something bad is happening, like sending spurious page faults to the guest in logdirty mode? I wonder if we could reproduce something like it on Linux -- set a build going and start localhost migrating; a spurious page fault is likely to cause the build to fail. -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Commit moratorium to staging 2017-11-03 14:52 ` George Dunlap @ 2017-11-03 17:57 ` George Dunlap 2017-11-03 18:29 ` Roger Pau Monné 2017-11-03 18:47 ` Ian Jackson 0 siblings, 2 replies; 27+ messages in thread From: George Dunlap @ 2017-11-03 17:57 UTC (permalink / raw) To: Roger Pau Monné, Paul Durrant Cc: Lars Kurth, Wei Liu, Julien Grall, committers, xen-devel, Ian Jackson On 11/03/2017 02:52 PM, George Dunlap wrote: > On 11/03/2017 02:14 PM, Roger Pau Monné wrote: >> On Thu, Nov 02, 2017 at 09:55:11AM +0000, Paul Durrant wrote: >>> Hmm. I wonder whether the guest is actually healthy after the migrate. One could imagine a situation where the storage device model (IDE in our case I guess) gets stuck in some way but recovers after a timeout in the guest storage stack. Thus, if you happen to try shut down while it is still stuck Windows starts trying to shut down but can't. Try after the timeout though and it can. >>> In the past we did make attempts to support Windows without PV drivers in XenServer but xenrt would never reliably pass VM lifecycle tests using emulated devices. That was with qemu trad, but I wonder whether upstream qemu is actually any better particularly if using older device models such as IDE and RTL8139 (which are probably largely unmodified from trad). >> >> Since I've been looking into this for a couple of days, and found no >> solution I'm going to write what I've found so far: >> >> - The issue only affects Windows guests. >> - It only manifests itself when doing live migration, non-live >> migration or save/resume work fine. >> - It affects all x86 hardware, the amount of migrations in order to >> trigger it seems to depend on the hardware, but doing 20 migrations >> reliably triggers it on all the hardware I've tested. > > Not good. > > You said that Windows reported that the login process failed somehow? > > Is it possible something bad is happening, like sending spurious page > faults to the guest in logdirty mode? > > I wonder if we could reproduce something like it on Linux -- set a build > going and start localhost migrating; a spurious page fault is likely to > cause the build to fail. Well, with a looping xen-build going on in the guest, I've done 40 local migrates with no problems yet. But Roger -- is this on emulated devices only, no PV drivers? That might be something worth looking at. -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Commit moratorium to staging 2017-11-03 17:57 ` George Dunlap @ 2017-11-03 18:29 ` Roger Pau Monné 2017-11-03 18:35 ` Juergen Gross 2017-11-03 18:47 ` Ian Jackson 1 sibling, 1 reply; 27+ messages in thread From: Roger Pau Monné @ 2017-11-03 18:29 UTC (permalink / raw) To: George Dunlap Cc: Lars Kurth, Wei Liu, Julien Grall, Paul Durrant, committers, xen-devel, Ian Jackson On Fri, Nov 03, 2017 at 05:57:52PM +0000, George Dunlap wrote: > On 11/03/2017 02:52 PM, George Dunlap wrote: > > On 11/03/2017 02:14 PM, Roger Pau Monné wrote: > >> On Thu, Nov 02, 2017 at 09:55:11AM +0000, Paul Durrant wrote: > >>> Hmm. I wonder whether the guest is actually healthy after the migrate. One could imagine a situation where the storage device model (IDE in our case I guess) gets stuck in some way but recovers after a timeout in the guest storage stack. Thus, if you happen to try shut down while it is still stuck Windows starts trying to shut down but can't. Try after the timeout though and it can. > >>> In the past we did make attempts to support Windows without PV drivers in XenServer but xenrt would never reliably pass VM lifecycle tests using emulated devices. That was with qemu trad, but I wonder whether upstream qemu is actually any better particularly if using older device models such as IDE and RTL8139 (which are probably largely unmodified from trad). > >> > >> Since I've been looking into this for a couple of days, and found no > >> solution I'm going to write what I've found so far: > >> > >> - The issue only affects Windows guests. > >> - It only manifests itself when doing live migration, non-live > >> migration or save/resume work fine. > >> - It affects all x86 hardware, the amount of migrations in order to > >> trigger it seems to depend on the hardware, but doing 20 migrations > >> reliably triggers it on all the hardware I've tested. > > > > Not good. > > > > You said that Windows reported that the login process failed somehow? > > > > Is it possible something bad is happening, like sending spurious page > > faults to the guest in logdirty mode? > > > > I wonder if we could reproduce something like it on Linux -- set a build > > going and start localhost migrating; a spurious page fault is likely to > > cause the build to fail. > > Well, with a looping xen-build going on in the guest, I've done 40 local > migrates with no problems yet. > > But Roger -- is this on emulated devices only, no PV drivers? > > That might be something worth looking at. Yes, windows doesn't have PV devices. But save/restore and non-live migration seems fine, so it doesn't look to be related to devices, but rather to log-dirty or some other aspect of live-migration. Or maybe it's something indeed related to emulated devices that's more easily triggerable on live-migration. I'm also thinking it would be helpful to do x20 save/restore, shutdown, create, x20 migrations and shutdown. That would help us identify problems related to save/restore and live-migration more easily. Roger. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Commit moratorium to staging 2017-11-03 18:29 ` Roger Pau Monné @ 2017-11-03 18:35 ` Juergen Gross 2017-11-06 18:25 ` George Dunlap 0 siblings, 1 reply; 27+ messages in thread From: Juergen Gross @ 2017-11-03 18:35 UTC (permalink / raw) To: Roger Pau Monné, George Dunlap Cc: Lars Kurth, Wei Liu, Julien Grall, Paul Durrant, committers, Ian Jackson, xen-devel On 03/11/17 19:29, Roger Pau Monné wrote: > On Fri, Nov 03, 2017 at 05:57:52PM +0000, George Dunlap wrote: >> On 11/03/2017 02:52 PM, George Dunlap wrote: >>> On 11/03/2017 02:14 PM, Roger Pau Monné wrote: >>>> On Thu, Nov 02, 2017 at 09:55:11AM +0000, Paul Durrant wrote: >>>>> Hmm. I wonder whether the guest is actually healthy after the migrate. One could imagine a situation where the storage device model (IDE in our case I guess) gets stuck in some way but recovers after a timeout in the guest storage stack. Thus, if you happen to try shut down while it is still stuck Windows starts trying to shut down but can't. Try after the timeout though and it can. >>>>> In the past we did make attempts to support Windows without PV drivers in XenServer but xenrt would never reliably pass VM lifecycle tests using emulated devices. That was with qemu trad, but I wonder whether upstream qemu is actually any better particularly if using older device models such as IDE and RTL8139 (which are probably largely unmodified from trad). >>>> >>>> Since I've been looking into this for a couple of days, and found no >>>> solution I'm going to write what I've found so far: >>>> >>>> - The issue only affects Windows guests. >>>> - It only manifests itself when doing live migration, non-live >>>> migration or save/resume work fine. >>>> - It affects all x86 hardware, the amount of migrations in order to >>>> trigger it seems to depend on the hardware, but doing 20 migrations >>>> reliably triggers it on all the hardware I've tested. >>> >>> Not good. >>> >>> You said that Windows reported that the login process failed somehow? >>> >>> Is it possible something bad is happening, like sending spurious page >>> faults to the guest in logdirty mode? >>> >>> I wonder if we could reproduce something like it on Linux -- set a build >>> going and start localhost migrating; a spurious page fault is likely to >>> cause the build to fail. >> >> Well, with a looping xen-build going on in the guest, I've done 40 local >> migrates with no problems yet. >> >> But Roger -- is this on emulated devices only, no PV drivers? >> >> That might be something worth looking at. > > Yes, windows doesn't have PV devices. But save/restore and non-live > migration seems fine, so it doesn't look to be related to devices, but > rather to log-dirty or some other aspect of live-migration. log-dirty for read-I/Os of emulated devices? Juergen _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Commit moratorium to staging 2017-11-03 18:35 ` Juergen Gross @ 2017-11-06 18:25 ` George Dunlap 0 siblings, 0 replies; 27+ messages in thread From: George Dunlap @ 2017-11-06 18:25 UTC (permalink / raw) To: Juergen Gross, Roger Pau Monné Cc: Lars Kurth, Wei Liu, Julien Grall, Paul Durrant, committers, Ian Jackson, xen-devel On 11/03/2017 06:35 PM, Juergen Gross wrote: > On 03/11/17 19:29, Roger Pau Monné wrote: >> On Fri, Nov 03, 2017 at 05:57:52PM +0000, George Dunlap wrote: >>> On 11/03/2017 02:52 PM, George Dunlap wrote: >>>> On 11/03/2017 02:14 PM, Roger Pau Monné wrote: >>>>> On Thu, Nov 02, 2017 at 09:55:11AM +0000, Paul Durrant wrote: >>>>>> Hmm. I wonder whether the guest is actually healthy after the migrate. One could imagine a situation where the storage device model (IDE in our case I guess) gets stuck in some way but recovers after a timeout in the guest storage stack. Thus, if you happen to try shut down while it is still stuck Windows starts trying to shut down but can't. Try after the timeout though and it can. >>>>>> In the past we did make attempts to support Windows without PV drivers in XenServer but xenrt would never reliably pass VM lifecycle tests using emulated devices. That was with qemu trad, but I wonder whether upstream qemu is actually any better particularly if using older device models such as IDE and RTL8139 (which are probably largely unmodified from trad). >>>>> >>>>> Since I've been looking into this for a couple of days, and found no >>>>> solution I'm going to write what I've found so far: >>>>> >>>>> - The issue only affects Windows guests. >>>>> - It only manifests itself when doing live migration, non-live >>>>> migration or save/resume work fine. >>>>> - It affects all x86 hardware, the amount of migrations in order to >>>>> trigger it seems to depend on the hardware, but doing 20 migrations >>>>> reliably triggers it on all the hardware I've tested. >>>> >>>> Not good. >>>> >>>> You said that Windows reported that the login process failed somehow? >>>> >>>> Is it possible something bad is happening, like sending spurious page >>>> faults to the guest in logdirty mode? >>>> >>>> I wonder if we could reproduce something like it on Linux -- set a build >>>> going and start localhost migrating; a spurious page fault is likely to >>>> cause the build to fail. >>> >>> Well, with a looping xen-build going on in the guest, I've done 40 local >>> migrates with no problems yet. >>> >>> But Roger -- is this on emulated devices only, no PV drivers? >>> >>> That might be something worth looking at. >> >> Yes, windows doesn't have PV devices. But save/restore and non-live >> migration seems fine, so it doesn't look to be related to devices, but >> rather to log-dirty or some other aspect of live-migration. > > log-dirty for read-I/Os of emulated devices? FWIW I booted a Linux guest with "xen_nopv" on the command-line, gave it 256 MiB of RAM, and then ran a Xen build on it in a loop (see command below). Then I started migrating it in a loop. After an hour or two it had done 146 local migrations, and 46 builds of Xen (swapping onto emulated disk is pretty slow), without any issues. Build command: # while make -j 3 xen ; do git clean -ffdx ; done I'm shutting down the VM and I'll leave it running overnight. -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Commit moratorium to staging 2017-11-03 17:57 ` George Dunlap 2017-11-03 18:29 ` Roger Pau Monné @ 2017-11-03 18:47 ` Ian Jackson 1 sibling, 0 replies; 27+ messages in thread From: Ian Jackson @ 2017-11-03 18:47 UTC (permalink / raw) To: George Dunlap Cc: Lars Kurth, Wei Liu, Julien Grall, Paul Durrant, committers, xen-devel, Roger Pau Monné George Dunlap writes ("Re: [Xen-devel] Commit moratorium to staging"): > Well, with a looping xen-build going on in the guest, I've done 40 local > migrates with no problems yet. > > But Roger -- is this on emulated devices only, no PV drivers? Yes. None of our Windows tests have PV drivers. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Commit moratorium to staging 2017-11-02 9:42 ` Roger Pau Monné 2017-11-02 9:55 ` Paul Durrant @ 2017-11-02 12:05 ` Ian Jackson 1 sibling, 0 replies; 27+ messages in thread From: Ian Jackson @ 2017-11-02 12:05 UTC (permalink / raw) To: Roger Pau Monné Cc: Lars Kurth, Wei Liu, Julien Grall, Paul Durrant, committers, xen-devel Roger Pau Monné writes ("Re: [Xen-devel] Commit moratorium to staging"): > Is there anyway to get that from windows in an automatic way? If not I > could test that with a Debian guest. In fact it might even be a good > thing for Linux based guest to be added to the regular migration tests > in order to make sure cpuid bits don't change across migrations. We do migrations of all the guests in osstest (apart from in ARM, where the guests don't support it, and some special cases like rumpkernel and xtf domains). Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Commit moratorium to staging 2017-11-01 14:07 ` Ian Jackson 2017-11-01 14:59 ` Julien Grall 2017-11-01 16:17 ` Commit moratorium to staging Roger Pau Monné @ 2017-11-02 11:19 ` George Dunlap 2 siblings, 0 replies; 27+ messages in thread From: George Dunlap @ 2017-11-02 11:19 UTC (permalink / raw) To: Ian Jackson, Julien Grall, Roger Pau Monne Cc: committers, Lars Kurth, Paul Durrant, Wei Liu, xen-devel On 11/01/2017 02:07 PM, Ian Jackson wrote: > So, investigations (mostly by Roger, and also a bit of archaeology in > the osstest db by me) have determined: > > * This bug is 100% reproducible on affected hosts. The repro is > to boot the Windows guest, save/restore it, then migrate it, > then shut down. (This is from an IRL conversation with Roger and > may not be 100% accurate. Roger, please correct me.) I presume when you say 'migrate' you mean localhost migration? Are the results different if you: - only save/restore *or* migrate it? - save/restore twice or migrate twice, rather than save/restore + migrate? Going through the save/restore path suggests that there's something about the domain that's being set up one way on initial creation than on restoring/receiving from a migration: i.e., something not being saved and restored properly. An alternate explanation would be a 'hitch' somewhere in the 're-attach' driver code. -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 27+ messages in thread
* Commit moratorium to staging @ 2017-05-15 17:21 Julien Grall 2017-05-16 16:01 ` Ian Jackson 0 siblings, 1 reply; 27+ messages in thread From: Julien Grall @ 2017-05-15 17:21 UTC (permalink / raw) To: xen-devel; +Cc: lars.kurth, committers Committers, It looks like osstest is a bit behind because of ARM64 boxes (they are fully loaded) and XP testing (they now have been removed see [1]). I'd like to cut the next rc when staging == master, so please stop committing today. Ian forced pushed osstest today, so hopefully we can get a push tomorrow. Cheers, [1] https://lists.xenproject.org/archives/html/xen-devel/2017-05/msg00425.html -- Julien Grall _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Commit moratorium to staging 2017-05-15 17:21 Julien Grall @ 2017-05-16 16:01 ` Ian Jackson 0 siblings, 0 replies; 27+ messages in thread From: Ian Jackson @ 2017-05-16 16:01 UTC (permalink / raw) To: Julien Grall; +Cc: xen-devel, committers, lars.kurth Julien Grall writes ("Commit moratorium to staging"): > It looks like osstest is a bit behind because of ARM64 boxes (they are > fully loaded) and XP testing (they now have been removed see [1]). > > I'd like to cut the next rc when staging == master, so please stop > committing today. I force pushed xen#master earlier and there is no longer any need for this moratorium. Of course any commits to staging still need RM approval from Julien. Thanks, Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2017-11-06 18:30 UTC | newest] Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-11-02 8:19 [xen-unstable test] 115471: regressions - FAIL osstest service owner 2017-10-31 10:49 ` Commit moratorium to staging Julien Grall 2017-10-31 16:52 ` Roger Pau Monné 2017-11-01 10:48 ` Wei Liu 2017-11-01 11:00 ` Paul Durrant 2017-11-01 14:07 ` Ian Jackson 2017-11-01 14:59 ` Julien Grall 2017-11-01 16:54 ` Ian Jackson 2017-11-01 17:00 ` Julien Grall 2017-11-02 13:27 ` Commit moratorium to staging [and 1 more messages] Ian Jackson 2017-11-02 13:33 ` Julien Grall 2017-11-01 16:17 ` Commit moratorium to staging Roger Pau Monné 2017-11-02 9:15 ` Roger Pau Monné 2017-11-02 9:20 ` Paul Durrant 2017-11-02 9:42 ` Roger Pau Monné 2017-11-02 9:55 ` Paul Durrant 2017-11-03 14:14 ` Roger Pau Monné 2017-11-03 14:52 ` George Dunlap 2017-11-03 17:57 ` George Dunlap 2017-11-03 18:29 ` Roger Pau Monné 2017-11-03 18:35 ` Juergen Gross 2017-11-06 18:25 ` George Dunlap 2017-11-03 18:47 ` Ian Jackson 2017-11-02 12:05 ` Ian Jackson 2017-11-02 11:19 ` George Dunlap -- strict thread matches above, loose matches on Subject: below -- 2017-05-15 17:21 Julien Grall 2017-05-16 16:01 ` Ian Jackson
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.