All of lore.kernel.org
 help / color / mirror / Atom feed
* Commit moratorium to staging
@ 2017-10-31 10:49 ` Julien Grall
  2017-10-31 16:52   ` Roger Pau Monné
  0 siblings, 1 reply; 27+ messages in thread
From: Julien Grall @ 2017-10-31 10:49 UTC (permalink / raw)
  To: committers, xen-devel; +Cc: Lars Kurth, Roger Pau Monné

Hi all,

Master lags 15 days behind staging due to tests failing reliably on some 
of the hardware in osstest (see [1]).

At the moment a force push is not feasible because the same tests passes 
on different hardware (see [2]).

Please avoid committing any more patches unless it is fixing a test 
failure in osstest.

Tree will be re-opened once we get a push.

Cheers,

[1] 
https://lists.xenproject.org/archives/html/xen-devel/2017-10/msg03351.html
[2] 
https://lists.xenproject.org/archives/html/xen-devel/2017-10/msg02932.html

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Commit moratorium to staging
  2017-10-31 10:49 ` Commit moratorium to staging Julien Grall
@ 2017-10-31 16:52   ` Roger Pau Monné
  2017-11-01 10:48     ` Wei Liu
  0 siblings, 1 reply; 27+ messages in thread
From: Roger Pau Monné @ 2017-10-31 16:52 UTC (permalink / raw)
  To: Julien Grall; +Cc: xen-devel, Paul Durrant, committers, Lars Kurth

On Tue, Oct 31, 2017 at 10:49:35AM +0000, Julien Grall wrote:
> Hi all,
> 
> Master lags 15 days behind staging due to tests failing reliably on some of
> the hardware in osstest (see [1]).
> 
> At the moment a force push is not feasible because the same tests passes on
> different hardware (see [2]).

I've been looking into this, and I'm afraid I don't yet have a cause
for those issues. I'm going to post what I've found so far, maybe
someone is able to spot something I'm missing.

Since I assumed this was somehow related to the ACPI PM1A_STS/EN
blocks (which is how the power button even gets notified to the OS),
I've added the following instrumentation to the pmtimer.c code:

diff --git a/xen/arch/x86/hvm/pmtimer.c b/xen/arch/x86/hvm/pmtimer.c
index 435647ff1e..051fc46df8 100644
--- a/xen/arch/x86/hvm/pmtimer.c
+++ b/xen/arch/x86/hvm/pmtimer.c
@@ -61,9 +61,15 @@ static void pmt_update_sci(PMTState *s)
     ASSERT(spin_is_locked(&s->lock));
 
     if ( acpi->pm1a_en & acpi->pm1a_sts & SCI_MASK )
+    {
+        printk("asserting SCI IRQ\n");
         hvm_isa_irq_assert(s->vcpu->domain, SCI_IRQ, NULL);
+    }
     else
+    {
+        printk("de-asserting SCI IRQ\n");
         hvm_isa_irq_deassert(s->vcpu->domain, SCI_IRQ);
+    }
 }
 
 void hvm_acpi_power_button(struct domain *d)
@@ -73,6 +79,7 @@ void hvm_acpi_power_button(struct domain *d)
     if ( !has_vpm(d) )
         return;
 
+    printk("hvm_acpi_power_button for d%d\n", d->domain_id);
     spin_lock(&s->lock);
     d->arch.hvm_domain.acpi.pm1a_sts |= PWRBTN_STS;
     pmt_update_sci(s);
@@ -86,6 +93,7 @@ void hvm_acpi_sleep_button(struct domain *d)
     if ( !has_vpm(d) )
         return;
 
+    printk("hvm_acpi_sleep_button for d%d\n", d->domain_id);
     spin_lock(&s->lock);
     d->arch.hvm_domain.acpi.pm1a_sts |= PWRBTN_STS;
     pmt_update_sci(s);
@@ -170,6 +178,7 @@ static int handle_evt_io(
 
     if ( dir == IOREQ_WRITE )
     {
+        printk("write PM1a addr: %#x val: %#x\n", addr, *val);
         /* Handle this I/O one byte at a time */
         for ( i = bytes, data = *val;
               i > 0;
@@ -197,6 +206,8 @@ static int handle_evt_io(
                          bytes, *val, port);
             }
         }
+        printk("result pm1a_sts: %#x pm1a_en: %#x\n",
+              acpi->pm1a_sts, acpi->pm1a_en);
         /* Fix up the SCI state to match the new register state */
         pmt_update_sci(s);
     }

I've then rerun the failing test, and this is what I got in the
failure case (ie: windows ignoring the power event):

(XEN) hvm_acpi_power_button for d14
(XEN) asserting SCI IRQ
(XEN) write PM1a addr: 0 val: 0x1
(XEN) result pm1a_sts: 0x100 pm1a_en: 0x320
(XEN) asserting SCI IRQ
(XEN) write PM1a addr: 0 val: 0x100
(XEN) result pm1a_sts: 0 pm1a_en: 0x320
(XEN) de-asserting SCI IRQ
(XEN) write PM1a addr: 0x2 val: 0x320
(XEN) result pm1a_sts: 0 pm1a_en: 0x320
(XEN) de-asserting SCI IRQ

Strangely enough, the second time I've tried the same command (xl
shutdown -wF ...) on the same guest, it succeed and windows shut down
without issues, this is the log in that case:

(XEN) hvm_acpi_power_button for d14
(XEN) asserting SCI IRQ
(XEN) write PM1a addr: 0 val: 0x1
(XEN) result pm1a_sts: 0x100 pm1a_en: 0x320
(XEN) asserting SCI IRQ
(XEN) write PM1a addr: 0 val: 0x100
(XEN) result pm1a_sts: 0 pm1a_en: 0x320
(XEN) de-asserting SCI IRQ
(XEN) write PM1a addr: 0x2 val: 0x320
(XEN) result pm1a_sts: 0 pm1a_en: 0x320
(XEN) de-asserting SCI IRQ
(XEN) write PM1a addr: 0x2 val: 0x320
(XEN) result pm1a_sts: 0 pm1a_en: 0x320
(XEN) de-asserting SCI IRQ
(XEN) write PM1a addr: 0 val: 0
(XEN) result pm1a_sts: 0 pm1a_en: 0x320
(XEN) de-asserting SCI IRQ
(XEN) write PM1a addr: 0 val: 0x8000
(XEN) result pm1a_sts: 0 pm1a_en: 0x320
(XEN) de-asserting SCI IRQ

I have to admit I have no idea why Windows clears the STS power bit
and then completely ignores it on certain occasions.

I'm also afraid I have no idea how to debug Windows in order to know
why this event is acknowledged but ignored.

I've also tried to reproduce the same with a Debian guest, by doing
the same amount of save/restores and migrations, and finally issuing a
xl trigger <guest> power, but Debian has always worked fine and
shut down.

Any comments are welcome.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: Commit moratorium to staging
  2017-10-31 16:52   ` Roger Pau Monné
@ 2017-11-01 10:48     ` Wei Liu
  2017-11-01 11:00       ` Paul Durrant
  0 siblings, 1 reply; 27+ messages in thread
From: Wei Liu @ 2017-11-01 10:48 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Lars Kurth, Wei Liu, Julien Grall, Paul Durrant, committers, xen-devel

On Tue, Oct 31, 2017 at 04:52:37PM +0000, Roger Pau Monné wrote:
> 
> I have to admit I have no idea why Windows clears the STS power bit
> and then completely ignores it on certain occasions.
> 
> I'm also afraid I have no idea how to debug Windows in order to know
> why this event is acknowledged but ignored.
> 
> I've also tried to reproduce the same with a Debian guest, by doing
> the same amount of save/restores and migrations, and finally issuing a
> xl trigger <guest> power, but Debian has always worked fine and
> shut down.
> 
> Any comments are welcome.

After googling around, some articles suggest Windows can ignore ACPI
events under certain circumstances. Is it worth checking in the Windows
event log to see if an event is received but ignored for reason X?

For Windows Server 2012:
https://serverfault.com/questions/534042/windows-2012-how-to-make-power-button-work-in-every-cases

Can't find anything for Windows Server 2016.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Commit moratorium to staging
  2017-11-01 10:48     ` Wei Liu
@ 2017-11-01 11:00       ` Paul Durrant
  2017-11-01 14:07         ` Ian Jackson
  0 siblings, 1 reply; 27+ messages in thread
From: Paul Durrant @ 2017-11-01 11:00 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: xen-devel, Julien Grall, committers, Wei Liu, Lars Kurth

> -----Original Message-----
> From: Wei Liu [mailto:wei.liu2@citrix.com]
> Sent: 01 November 2017 10:48
> To: Roger Pau Monne <roger.pau@citrix.com>
> Cc: Julien Grall <julien.grall@linaro.org>; committers@xenproject.org; xen-
> devel <xen-devel@lists.xenproject.org>; Lars Kurth <lars.kurth@citrix.com>;
> Paul Durrant <Paul.Durrant@citrix.com>; Wei Liu <wei.liu2@citrix.com>
> Subject: Re: Commit moratorium to staging
> 
> On Tue, Oct 31, 2017 at 04:52:37PM +0000, Roger Pau Monné wrote:
> >
> > I have to admit I have no idea why Windows clears the STS power bit
> > and then completely ignores it on certain occasions.
> >
> > I'm also afraid I have no idea how to debug Windows in order to know
> > why this event is acknowledged but ignored.
> >
> > I've also tried to reproduce the same with a Debian guest, by doing
> > the same amount of save/restores and migrations, and finally issuing a
> > xl trigger <guest> power, but Debian has always worked fine and
> > shut down.
> >
> > Any comments are welcome.
> 
> After googling around, some articles suggest Windows can ignore ACPI
> events under certain circumstances. Is it worth checking in the Windows
> event log to see if an event is received but ignored for reason X?

Dumping the event logs would definitely be a useful thing to do.

> 
> For Windows Server 2012:
> https://serverfault.com/questions/534042/windows-2012-how-to-make-
> power-button-work-in-every-cases
> 
> Can't find anything for Windows Server 2016.

No, I couldn't either. I did find https://ethertubes.com/unattended-acpi-shutdown-of-windows-server/ too which seems to have some potentially useful suggestions.

  Paul

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Commit moratorium to staging
  2017-11-01 11:00       ` Paul Durrant
@ 2017-11-01 14:07         ` Ian Jackson
  2017-11-01 14:59           ` Julien Grall
                             ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Ian Jackson @ 2017-11-01 14:07 UTC (permalink / raw)
  To: Julien Grall, Roger Pau Monne
  Cc: committers, Lars Kurth, Paul Durrant, Wei Liu, xen-devel

So, investigations (mostly by Roger, and also a bit of archaeology in
the osstest db by me) have determined:

* This bug is 100% reproducible on affected hosts.  The repro is
  to boot the Windows guest, save/restore it, then migrate it,
  then shut down.  (This is from an IRL conversation with Roger and
  may not be 100% accurate.  Roger, please correct me.)

* Affected hosts differ from unaffected hosts according to cpuid.
  Roger has repro'd the bug on an unaffected host by masking out
  certain cpuid bits.  There are 6 implicated bits and he is working
  to narrow that down.

* It seems likely that this is therefore a real bug.  Maybe in Xen and
  perhaps indeed one that should indeed be a release blocker.

* But this is not a regresson between master and staging.  It affects
  many osstest branches apparently equally.

* This test is, effectively, new: before the osstest change
  "HostDiskRoot: bump to 20G", these jobs would always fail earlier
  and the affected step would not be run.

* The passes we got on various osstest branches before were just
  because those branches hadn't tested on an affected host yet.  As
  branches test different hosts, they will stick on affected hosts.

ISTM that this situation would therefore justify a force push.  We
have established that this bug is very unlikely to be anything to do
with the commits currently blocked by the failing pushes.

Furthermore, the test is not intermittent, so a force push will be
effective in the following sense: we would only get a "spurious" pass,
resulting in the relevant osstest branch becoming stuck again, if a
future test was unlucky and got an unaffected host.  That will happen
infrequently enough.

So unless anyone objects (and for xen.git#master, with Julien's
permission), I intend to force push all affected osstest branches when
the test report shows the only blockage is ws16 and/or win10 tests
failing the "guest-stop" step.

Opinions ?

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Commit moratorium to staging
  2017-11-01 14:07         ` Ian Jackson
@ 2017-11-01 14:59           ` Julien Grall
  2017-11-01 16:54             ` Ian Jackson
  2017-11-01 16:17           ` Commit moratorium to staging Roger Pau Monné
  2017-11-02 11:19           ` George Dunlap
  2 siblings, 1 reply; 27+ messages in thread
From: Julien Grall @ 2017-11-01 14:59 UTC (permalink / raw)
  To: Ian Jackson, Roger Pau Monne
  Cc: committers, Lars Kurth, Paul Durrant, Wei Liu, xen-devel

Hi Ian,

Thank you for the detailed e-mail.

On 11/01/2017 02:07 PM, Ian Jackson wrote:
> So, investigations (mostly by Roger, and also a bit of archaeology in
> the osstest db by me) have determined:
> 
> * This bug is 100% reproducible on affected hosts.  The repro is
>    to boot the Windows guest, save/restore it, then migrate it,
>    then shut down.  (This is from an IRL conversation with Roger and
>    may not be 100% accurate.  Roger, please correct me.)
> 
> * Affected hosts differ from unaffected hosts according to cpuid.
>    Roger has repro'd the bug on an unaffected host by masking out
>    certain cpuid bits.  There are 6 implicated bits and he is working
>    to narrow that down.
> 
> * It seems likely that this is therefore a real bug.  Maybe in Xen and
>    perhaps indeed one that should indeed be a release blocker.
> 
> * But this is not a regresson between master and staging.  It affects
>    many osstest branches apparently equally.
> 
> * This test is, effectively, new: before the osstest change
>    "HostDiskRoot: bump to 20G", these jobs would always fail earlier
>    and the affected step would not be run.
> 
> * The passes we got on various osstest branches before were just
>    because those branches hadn't tested on an affected host yet.  As
>    branches test different hosts, they will stick on affected hosts.
> 
> ISTM that this situation would therefore justify a force push.  We
> have established that this bug is very unlikely to be anything to do
> with the commits currently blocked by the failing pushes.
> 
> Furthermore, the test is not intermittent, so a force push will be
> effective in the following sense: we would only get a "spurious" pass,
> resulting in the relevant osstest branch becoming stuck again, if a
> future test was unlucky and got an unaffected host.  That will happen
> infrequently enough.
I am not entirely sure to understand this paragraph. Are you saying that 
osstest will not get stuck if we get a "spurious" pass on some hardware
in the future? Or will we need another force push?

> 
> So unless anyone objects (and for xen.git#master, with Julien's
> permission), I intend to force push all affected osstest branches when
> the test report shows the only blockage is ws16 and/or win10 tests
> failing the "guest-stop" step.

This is not only blocking xen.git#master but also blocking other trees:
	- linux-linus
	- linux-4.9

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Commit moratorium to staging
  2017-11-01 14:07         ` Ian Jackson
  2017-11-01 14:59           ` Julien Grall
@ 2017-11-01 16:17           ` Roger Pau Monné
  2017-11-02  9:15             ` Roger Pau Monné
  2017-11-02 11:19           ` George Dunlap
  2 siblings, 1 reply; 27+ messages in thread
From: Roger Pau Monné @ 2017-11-01 16:17 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Lars Kurth, Wei Liu, Julien Grall, Paul Durrant, committers, xen-devel

On Wed, Nov 01, 2017 at 02:07:48PM +0000, Ian Jackson wrote:
> So, investigations (mostly by Roger, and also a bit of archaeology in
> the osstest db by me) have determined:
> 
> * This bug is 100% reproducible on affected hosts.  The repro is
>   to boot the Windows guest, save/restore it, then migrate it,
>   then shut down.  (This is from an IRL conversation with Roger and
>   may not be 100% accurate.  Roger, please correct me.)

Yes, that's correct AFAICT. The affected hosts works fine if windows
is booted and then shut down (without save/restore or migrations
involved).

> * Affected hosts differ from unaffected hosts according to cpuid.
>   Roger has repro'd the bug on an unaffected host by masking out
>   certain cpuid bits.  There are 6 implicated bits and he is working
>   to narrow that down.

I'm currently trying to narrow this down and make sure the above is
accurate.

> * It seems likely that this is therefore a real bug.  Maybe in Xen and
>   perhaps indeed one that should indeed be a release blocker.
> 
> * But this is not a regresson between master and staging.  It affects
>   many osstest branches apparently equally.
> 
> * This test is, effectively, new: before the osstest change
>   "HostDiskRoot: bump to 20G", these jobs would always fail earlier
>   and the affected step would not be run.
> 
> * The passes we got on various osstest branches before were just
>   because those branches hadn't tested on an affected host yet.  As
>   branches test different hosts, they will stick on affected hosts.
> 
> ISTM that this situation would therefore justify a force push.  We
> have established that this bug is very unlikely to be anything to do
> with the commits currently blocked by the failing pushes.

I agree, this is a bug that's always been present (at least in the
tested branches). It's triggered now because the windows tests
have made further progress.

> Furthermore, the test is not intermittent, so a force push will be
> effective in the following sense: we would only get a "spurious" pass,
> resulting in the relevant osstest branch becoming stuck again, if a
> future test was unlucky and got an unaffected host.  That will happen
> infrequently enough.
> 
> So unless anyone objects (and for xen.git#master, with Julien's
> permission), I intend to force push all affected osstest branches when
> the test report shows the only blockage is ws16 and/or win10 tests
> failing the "guest-stop" step.
> 
> Opinions ?

I agree that a force push is justified. This is bug going to be quite
annoying if osstest decides to tests on non-affected hosts, because
then we will get sporadic success flights.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Commit moratorium to staging
  2017-11-01 14:59           ` Julien Grall
@ 2017-11-01 16:54             ` Ian Jackson
  2017-11-01 17:00               ` Julien Grall
  0 siblings, 1 reply; 27+ messages in thread
From: Ian Jackson @ 2017-11-01 16:54 UTC (permalink / raw)
  To: Julien Grall
  Cc: Lars Kurth, Wei Liu, Paul Durrant, committers, xen-devel,
	Roger Pau Monne

Julien Grall writes ("Re: Commit moratorium to staging"):
> Hi Ian,
> 
> Thank you for the detailed e-mail.
> 
> On 11/01/2017 02:07 PM, Ian Jackson wrote:
> > Furthermore, the test is not intermittent, so a force push will be
> > effective in the following sense: we would only get a "spurious" pass,
> > resulting in the relevant osstest branch becoming stuck again, if a
> > future test was unlucky and got an unaffected host.  That will happen
> > infrequently enough.
...
> I am not entirely sure to understand this paragraph. Are you saying that 
> osstest will not get stuck if we get a "spurious" pass on some hardware
> in the future? Or will we need another force push?

osstest *would* get stuck *if* we got such a spurious push.  However,
because osstest likes to retest failing tests on the same host as they
failed on previously, such spurious passes are fairly unlikely.

I say "likes to".  The allocation system uses a set of heuristics to
calculate a score for each possible host.  The score takes into
account both when the host will be available to this job, and
information like "did the most recent run of this test, on this host,
pass or fail".  So I can't make guarantees but the amount of manual
work to force push stuck branches will be tolerable.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Commit moratorium to staging
  2017-11-01 16:54             ` Ian Jackson
@ 2017-11-01 17:00               ` Julien Grall
  2017-11-02 13:27                 ` Commit moratorium to staging [and 1 more messages] Ian Jackson
  0 siblings, 1 reply; 27+ messages in thread
From: Julien Grall @ 2017-11-01 17:00 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Lars Kurth, Wei Liu, Paul Durrant, committers, xen-devel,
	Roger Pau Monne

Hi Ian,

On 11/01/2017 04:54 PM, Ian Jackson wrote:
> Julien Grall writes ("Re: Commit moratorium to staging"):
>> Hi Ian,
>>
>> Thank you for the detailed e-mail.
>>
>> On 11/01/2017 02:07 PM, Ian Jackson wrote:
>>> Furthermore, the test is not intermittent, so a force push will be
>>> effective in the following sense: we would only get a "spurious" pass,
>>> resulting in the relevant osstest branch becoming stuck again, if a
>>> future test was unlucky and got an unaffected host.  That will happen
>>> infrequently enough.
> ...
>> I am not entirely sure to understand this paragraph. Are you saying that
>> osstest will not get stuck if we get a "spurious" pass on some hardware
>> in the future? Or will we need another force push?
> 
> osstest *would* get stuck *if* we got such a spurious push.  However,
> because osstest likes to retest failing tests on the same host as they
> failed on previously, such spurious passes are fairly unlikely.
> 
> I say "likes to".  The allocation system uses a set of heuristics to
> calculate a score for each possible host.  The score takes into
> account both when the host will be available to this job, and
> information like "did the most recent run of this test, on this host,
> pass or fail".  So I can't make guarantees but the amount of manual
> work to force push stuck branches will be tolerable.

Thank you for the explanation. I agree with the force push to unblock 
master (and other tree I mentioned).

However, it would still be nice to find the root causes of this bug and 
fix it.

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [xen-unstable test] 115471: regressions - FAIL
@ 2017-11-02  8:19 osstest service owner
  2017-10-31 10:49 ` Commit moratorium to staging Julien Grall
  0 siblings, 1 reply; 27+ messages in thread
From: osstest service owner @ 2017-11-02  8:19 UTC (permalink / raw)
  To: xen-devel, osstest-admin

[-- Attachment #1: Type: text/plain, Size: 13395 bytes --]

flight 115471 xen-unstable real [real]
http://logs.test-lab.xenproject.org/osstest/logs/115471/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 test-amd64-i386-xl-qemuu-ws16-amd64 17 guest-stop        fail REGR. vs. 114644
 test-amd64-amd64-xl-qemuu-ws16-amd64 17 guest-stop       fail REGR. vs. 114644
 test-amd64-amd64-xl-qemut-ws16-amd64 17 guest-stop       fail REGR. vs. 114644

Tests which are failing intermittently (not blocking):
 test-armhf-armhf-xl           6 xen-install      fail in 115401 pass in 115471
 test-amd64-amd64-xl-qemuu-ws16-amd64 15 guest-saverestore.2 fail in 115401 pass in 115471
 test-armhf-armhf-xl-vhd 15 guest-start/debian.repeat fail in 115401 pass in 115471
 test-amd64-amd64-xl-qcow2 19 guest-start/debian.repeat fail in 115401 pass in 115471
 test-amd64-i386-libvirt-qcow2 17 guest-start/debian.repeat fail pass in 115401
 test-amd64-amd64-libvirt-vhd 17 guest-start/debian.repeat  fail pass in 115450

Tests which did not succeed, but are not blocking:
 test-armhf-armhf-libvirt-xsm 14 saverestore-support-check    fail  like 114644
 test-amd64-amd64-xl-qemuu-win7-amd64 17 guest-stop            fail like 114644
 test-amd64-i386-xl-qemuu-win7-amd64 17 guest-stop             fail like 114644
 test-amd64-amd64-xl-qemut-win7-amd64 17 guest-stop            fail like 114644
 test-amd64-i386-xl-qemut-win7-amd64 17 guest-stop             fail like 114644
 test-armhf-armhf-libvirt-raw 13 saverestore-support-check    fail  like 114644
 test-armhf-armhf-libvirt     14 saverestore-support-check    fail  like 114644
 test-amd64-amd64-xl-pvhv2-intel 12 guest-start                 fail never pass
 test-amd64-amd64-xl-pvhv2-amd 12 guest-start                  fail  never pass
 test-amd64-amd64-libvirt-xsm 13 migrate-support-check        fail   never pass
 test-amd64-amd64-libvirt     13 migrate-support-check        fail   never pass
 test-amd64-i386-libvirt-xsm  13 migrate-support-check        fail   never pass
 test-amd64-i386-libvirt      13 migrate-support-check        fail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 11 migrate-support-check fail never pass
 test-amd64-i386-libvirt-qcow2 12 migrate-support-check        fail  never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 11 migrate-support-check fail never pass
 test-amd64-amd64-libvirt-vhd 12 migrate-support-check        fail   never pass
 test-amd64-amd64-qemuu-nested-amd 17 debian-hvm-install/l1/l2  fail never pass
 test-armhf-armhf-xl-rtds     13 migrate-support-check        fail   never pass
 test-armhf-armhf-xl-rtds     14 saverestore-support-check    fail   never pass
 test-armhf-armhf-xl-cubietruck 13 migrate-support-check        fail never pass
 test-armhf-armhf-xl-cubietruck 14 saverestore-support-check    fail never pass
 test-armhf-armhf-xl          13 migrate-support-check        fail   never pass
 test-armhf-armhf-xl          14 saverestore-support-check    fail   never pass
 test-armhf-armhf-libvirt-xsm 13 migrate-support-check        fail   never pass
 test-armhf-armhf-libvirt-raw 12 migrate-support-check        fail   never pass
 test-armhf-armhf-xl-vhd      12 migrate-support-check        fail   never pass
 test-armhf-armhf-xl-vhd      13 saverestore-support-check    fail   never pass
 test-armhf-armhf-xl-arndale  13 migrate-support-check        fail   never pass
 test-armhf-armhf-xl-arndale  14 saverestore-support-check    fail   never pass
 test-armhf-armhf-xl-xsm      13 migrate-support-check        fail   never pass
 test-armhf-armhf-xl-xsm      14 saverestore-support-check    fail   never pass
 test-armhf-armhf-xl-multivcpu 13 migrate-support-check        fail  never pass
 test-armhf-armhf-xl-multivcpu 14 saverestore-support-check    fail  never pass
 test-armhf-armhf-xl-credit2  13 migrate-support-check        fail   never pass
 test-armhf-armhf-xl-credit2  14 saverestore-support-check    fail   never pass
 test-armhf-armhf-libvirt     13 migrate-support-check        fail   never pass
 test-amd64-i386-xl-qemut-ws16-amd64 17 guest-stop              fail never pass
 test-amd64-i386-xl-qemut-win10-i386 10 windows-install         fail never pass
 test-amd64-i386-xl-qemuu-win10-i386 10 windows-install         fail never pass
 test-amd64-amd64-xl-qemuu-win10-i386 10 windows-install        fail never pass
 test-amd64-amd64-xl-qemut-win10-i386 10 windows-install        fail never pass

version targeted for testing:
 xen                  bb2c1a1cc98a22e2d4c14b18421aa7be6c2adf0d
baseline version:
 xen                  24fb44e971a62b345c7b6ca3c03b454a1e150abe

Last test of basis   114644  2017-10-17 10:49:11 Z   15 days
Failing since        114670  2017-10-18 05:03:38 Z   15 days   24 attempts
Testing same since   115314  2017-10-28 05:53:13 Z    5 days   10 attempts

------------------------------------------------------------
People who touched revisions under test:
  Andre Przywara <andre.przywara@linaro.org>
  Andrew Cooper <andrew.cooper3@citrix.com>
  Anthony PERARD <anthony.perard@citrix.com>
  Bhupinder Thakur <bhupinder.thakur@linaro.org>
  Boris Ostrovsky <boris.ostrovsky@oracle.com>
  Chao Gao <chao.gao@intel.com>
  David Esler <drumandstrum@gmail.com>
  George Dunlap <george.dunlap@citrix.com>
  Ian Jackson <Ian.Jackson@eu.citrix.com>
  Jan Beulich <jbeulich@suse.com>
  Juergen Gross <jgross@suse.com>
  Julien Grall <julien.grall@linaro.org>
  Roger Pau Monne <roger.pau@citrix.com>
  Roger Pau Monné <roger.pau@citrix.com>
  Ross Lagerwall <ross.lagerwall@citrix.com>
  Stefano Stabellini <sstabellini@kernel.org>
  Tim Deegan <tim@xen.org>
  Wei Liu <wei.liu2@citrix.com>

jobs:
 build-amd64-xsm                                              pass    
 build-armhf-xsm                                              pass    
 build-i386-xsm                                               pass    
 build-amd64-xtf                                              pass    
 build-amd64                                                  pass    
 build-armhf                                                  pass    
 build-i386                                                   pass    
 build-amd64-libvirt                                          pass    
 build-armhf-libvirt                                          pass    
 build-i386-libvirt                                           pass    
 build-amd64-prev                                             pass    
 build-i386-prev                                              pass    
 build-amd64-pvops                                            pass    
 build-armhf-pvops                                            pass    
 build-i386-pvops                                             pass    
 build-amd64-rumprun                                          pass    
 build-i386-rumprun                                           pass    
 test-xtf-amd64-amd64-1                                       pass    
 test-xtf-amd64-amd64-2                                       pass    
 test-xtf-amd64-amd64-3                                       pass    
 test-xtf-amd64-amd64-4                                       pass    
 test-xtf-amd64-amd64-5                                       pass    
 test-amd64-amd64-xl                                          pass    
 test-armhf-armhf-xl                                          pass    
 test-amd64-i386-xl                                           pass    
 test-amd64-amd64-xl-qemut-debianhvm-amd64-xsm                pass    
 test-amd64-i386-xl-qemut-debianhvm-amd64-xsm                 pass    
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm           pass    
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm            pass    
 test-amd64-amd64-xl-qemuu-debianhvm-amd64-xsm                pass    
 test-amd64-i386-xl-qemuu-debianhvm-amd64-xsm                 pass    
 test-amd64-amd64-xl-qemut-stubdom-debianhvm-amd64-xsm        pass    
 test-amd64-i386-xl-qemut-stubdom-debianhvm-amd64-xsm         pass    
 test-amd64-amd64-libvirt-xsm                                 pass    
 test-armhf-armhf-libvirt-xsm                                 pass    
 test-amd64-i386-libvirt-xsm                                  pass    
 test-amd64-amd64-xl-xsm                                      pass    
 test-armhf-armhf-xl-xsm                                      pass    
 test-amd64-i386-xl-xsm                                       pass    
 test-amd64-amd64-qemuu-nested-amd                            fail    
 test-amd64-amd64-xl-pvhv2-amd                                fail    
 test-amd64-i386-qemut-rhel6hvm-amd                           pass    
 test-amd64-i386-qemuu-rhel6hvm-amd                           pass    
 test-amd64-amd64-xl-qemut-debianhvm-amd64                    pass    
 test-amd64-i386-xl-qemut-debianhvm-amd64                     pass    
 test-amd64-amd64-xl-qemuu-debianhvm-amd64                    pass    
 test-amd64-i386-xl-qemuu-debianhvm-amd64                     pass    
 test-amd64-i386-freebsd10-amd64                              pass    
 test-amd64-amd64-xl-qemuu-ovmf-amd64                         pass    
 test-amd64-i386-xl-qemuu-ovmf-amd64                          pass    
 test-amd64-amd64-rumprun-amd64                               pass    
 test-amd64-amd64-xl-qemut-win7-amd64                         fail    
 test-amd64-i386-xl-qemut-win7-amd64                          fail    
 test-amd64-amd64-xl-qemuu-win7-amd64                         fail    
 test-amd64-i386-xl-qemuu-win7-amd64                          fail    
 test-amd64-amd64-xl-qemut-ws16-amd64                         fail    
 test-amd64-i386-xl-qemut-ws16-amd64                          fail    
 test-amd64-amd64-xl-qemuu-ws16-amd64                         fail    
 test-amd64-i386-xl-qemuu-ws16-amd64                          fail    
 test-armhf-armhf-xl-arndale                                  pass    
 test-amd64-amd64-xl-credit2                                  pass    
 test-armhf-armhf-xl-credit2                                  pass    
 test-armhf-armhf-xl-cubietruck                               pass    
 test-amd64-amd64-examine                                     pass    
 test-armhf-armhf-examine                                     pass    
 test-amd64-i386-examine                                      pass    
 test-amd64-i386-freebsd10-i386                               pass    
 test-amd64-i386-rumprun-i386                                 pass    
 test-amd64-amd64-xl-qemut-win10-i386                         fail    
 test-amd64-i386-xl-qemut-win10-i386                          fail    
 test-amd64-amd64-xl-qemuu-win10-i386                         fail    
 test-amd64-i386-xl-qemuu-win10-i386                          fail    
 test-amd64-amd64-qemuu-nested-intel                          pass    
 test-amd64-amd64-xl-pvhv2-intel                              fail    
 test-amd64-i386-qemut-rhel6hvm-intel                         pass    
 test-amd64-i386-qemuu-rhel6hvm-intel                         pass    
 test-amd64-amd64-libvirt                                     pass    
 test-armhf-armhf-libvirt                                     pass    
 test-amd64-i386-libvirt                                      pass    
 test-amd64-amd64-livepatch                                   pass    
 test-amd64-i386-livepatch                                    pass    
 test-amd64-amd64-migrupgrade                                 pass    
 test-amd64-i386-migrupgrade                                  pass    
 test-amd64-amd64-xl-multivcpu                                pass    
 test-armhf-armhf-xl-multivcpu                                pass    
 test-amd64-amd64-pair                                        pass    
 test-amd64-i386-pair                                         pass    
 test-amd64-amd64-libvirt-pair                                pass    
 test-amd64-i386-libvirt-pair                                 pass    
 test-amd64-amd64-amd64-pvgrub                                pass    
 test-amd64-amd64-i386-pvgrub                                 pass    
 test-amd64-amd64-pygrub                                      pass    
 test-amd64-i386-libvirt-qcow2                                fail    
 test-amd64-amd64-xl-qcow2                                    pass    
 test-armhf-armhf-libvirt-raw                                 pass    
 test-amd64-i386-xl-raw                                       pass    
 test-amd64-amd64-xl-rtds                                     pass    
 test-armhf-armhf-xl-rtds                                     pass    
 test-amd64-amd64-libvirt-vhd                                 fail    
 test-armhf-armhf-xl-vhd                                      pass    


------------------------------------------------------------
sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
    http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
    http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
    http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
    http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Not pushing.

(No revision log; it would be 686 lines long.)


[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Commit moratorium to staging
  2017-11-01 16:17           ` Commit moratorium to staging Roger Pau Monné
@ 2017-11-02  9:15             ` Roger Pau Monné
  2017-11-02  9:20               ` Paul Durrant
  0 siblings, 1 reply; 27+ messages in thread
From: Roger Pau Monné @ 2017-11-02  9:15 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Lars Kurth, Wei Liu, Julien Grall, Ian Jackson, Paul Durrant,
	committers, xen-devel

On Wed, Nov 01, 2017 at 04:17:10PM +0000, Roger Pau Monné wrote:
> On Wed, Nov 01, 2017 at 02:07:48PM +0000, Ian Jackson wrote:
> > * Affected hosts differ from unaffected hosts according to cpuid.
> >   Roger has repro'd the bug on an unaffected host by masking out
> >   certain cpuid bits.  There are 6 implicated bits and he is working
> >   to narrow that down.
> 
> I'm currently trying to narrow this down and make sure the above is
> accurate.

So I was wrong with this, I guess I've run the tests on the wrong
host. Even when masking the different cpuid bits in the guest the
tests still succeeds.

AFAICT the test fail or succeed reliably depending on the host
hardware. I don't really have many ideas about what to do next, but I
think it would be useful to create a manual osstest flight that runs
the win16 job in all the different hosts in the colo. I would also
capture the normal information that Xen collects after each test (xl
info, /proc/cpuid, serial logs...).

Is there anything else not captured by ts-logs-capture that would be
interesting in order to help debug the issue?

Regards, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Commit moratorium to staging
  2017-11-02  9:15             ` Roger Pau Monné
@ 2017-11-02  9:20               ` Paul Durrant
  2017-11-02  9:42                 ` Roger Pau Monné
  0 siblings, 1 reply; 27+ messages in thread
From: Paul Durrant @ 2017-11-02  9:20 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Lars Kurth, Wei Liu, Julien Grall, committers, xen-devel, Ian Jackson

> -----Original Message-----
> From: Roger Pau Monne
> Sent: 02 November 2017 09:15
> To: Roger Pau Monne <roger.pau@citrix.com>
> Cc: Ian Jackson <Ian.Jackson@citrix.com>; Lars Kurth
> <lars.kurth@citrix.com>; Wei Liu <wei.liu2@citrix.com>; Julien Grall
> <julien.grall@linaro.org>; Paul Durrant <Paul.Durrant@citrix.com>;
> committers@xenproject.org; xen-devel <xen-devel@lists.xenproject.org>
> Subject: Re: [Xen-devel] Commit moratorium to staging
> 
> On Wed, Nov 01, 2017 at 04:17:10PM +0000, Roger Pau Monné wrote:
> > On Wed, Nov 01, 2017 at 02:07:48PM +0000, Ian Jackson wrote:
> > > * Affected hosts differ from unaffected hosts according to cpuid.
> > >   Roger has repro'd the bug on an unaffected host by masking out
> > >   certain cpuid bits.  There are 6 implicated bits and he is working
> > >   to narrow that down.
> >
> > I'm currently trying to narrow this down and make sure the above is
> > accurate.
> 
> So I was wrong with this, I guess I've run the tests on the wrong
> host. Even when masking the different cpuid bits in the guest the
> tests still succeeds.
> 
> AFAICT the test fail or succeed reliably depending on the host
> hardware. I don't really have many ideas about what to do next, but I
> think it would be useful to create a manual osstest flight that runs
> the win16 job in all the different hosts in the colo. I would also
> capture the normal information that Xen collects after each test (xl
> info, /proc/cpuid, serial logs...).
> 
> Is there anything else not captured by ts-logs-capture that would be
> interesting in order to help debug the issue?

Does the shutdown reliably complete prior to migrate and then only fail intermittently after a localhost migrate? It might be useful to know what cpuid info is seen by the guest before and after migrate. Another datapoint... does the shutdown fail if you insert a delay of a couple of minutes between the migrate and the shutdown?

  Paul

> 
> Regards, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Commit moratorium to staging
  2017-11-02  9:20               ` Paul Durrant
@ 2017-11-02  9:42                 ` Roger Pau Monné
  2017-11-02  9:55                   ` Paul Durrant
  2017-11-02 12:05                   ` Ian Jackson
  0 siblings, 2 replies; 27+ messages in thread
From: Roger Pau Monné @ 2017-11-02  9:42 UTC (permalink / raw)
  To: Paul Durrant
  Cc: Lars Kurth, Wei Liu, Julien Grall, committers, xen-devel, Ian Jackson

On Thu, Nov 02, 2017 at 09:20:10AM +0000, Paul Durrant wrote:
> > -----Original Message-----
> > From: Roger Pau Monne
> > Sent: 02 November 2017 09:15
> > To: Roger Pau Monne <roger.pau@citrix.com>
> > Cc: Ian Jackson <Ian.Jackson@citrix.com>; Lars Kurth
> > <lars.kurth@citrix.com>; Wei Liu <wei.liu2@citrix.com>; Julien Grall
> > <julien.grall@linaro.org>; Paul Durrant <Paul.Durrant@citrix.com>;
> > committers@xenproject.org; xen-devel <xen-devel@lists.xenproject.org>
> > Subject: Re: [Xen-devel] Commit moratorium to staging
> > 
> > On Wed, Nov 01, 2017 at 04:17:10PM +0000, Roger Pau Monné wrote:
> > > On Wed, Nov 01, 2017 at 02:07:48PM +0000, Ian Jackson wrote:
> > > > * Affected hosts differ from unaffected hosts according to cpuid.
> > > >   Roger has repro'd the bug on an unaffected host by masking out
> > > >   certain cpuid bits.  There are 6 implicated bits and he is working
> > > >   to narrow that down.
> > >
> > > I'm currently trying to narrow this down and make sure the above is
> > > accurate.
> > 
> > So I was wrong with this, I guess I've run the tests on the wrong
> > host. Even when masking the different cpuid bits in the guest the
> > tests still succeeds.
> > 
> > AFAICT the test fail or succeed reliably depending on the host
> > hardware. I don't really have many ideas about what to do next, but I
> > think it would be useful to create a manual osstest flight that runs
> > the win16 job in all the different hosts in the colo. I would also
> > capture the normal information that Xen collects after each test (xl
> > info, /proc/cpuid, serial logs...).
> > 
> > Is there anything else not captured by ts-logs-capture that would be
> > interesting in order to help debug the issue?
> 
> Does the shutdown reliably complete prior to migrate and then only fail intermittently after a localhost migrate?

AFAICT yes, but it can also be added to the test in order to be sure.

> It might be useful to know what cpuid info is seen by the guest before and after migrate.

Is there anyway to get that from windows in an automatic way? If not I
could test that with a Debian guest. In fact it might even be a good
thing for Linux based guest to be added to the regular migration tests
in order to make sure cpuid bits don't change across migrations.

> Another datapoint... does the shutdown fail if you insert a delay of a couple of minutes between the migrate and the shutdown?

Sometimes, after a variable number of calls to xl shutdown ... the
guest usually ends up shutting down.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Commit moratorium to staging
  2017-11-02  9:42                 ` Roger Pau Monné
@ 2017-11-02  9:55                   ` Paul Durrant
  2017-11-03 14:14                     ` Roger Pau Monné
  2017-11-02 12:05                   ` Ian Jackson
  1 sibling, 1 reply; 27+ messages in thread
From: Paul Durrant @ 2017-11-02  9:55 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Lars Kurth, Wei Liu, Julien Grall, committers, xen-devel, Ian Jackson

> -----Original Message-----
> From: Roger Pau Monne
> Sent: 02 November 2017 09:42
> To: Paul Durrant <Paul.Durrant@citrix.com>
> Cc: Ian Jackson <Ian.Jackson@citrix.com>; Lars Kurth
> <lars.kurth@citrix.com>; Wei Liu <wei.liu2@citrix.com>; Julien Grall
> <julien.grall@linaro.org>; committers@xenproject.org; xen-devel <xen-
> devel@lists.xenproject.org>
> Subject: Re: [Xen-devel] Commit moratorium to staging
> 
> On Thu, Nov 02, 2017 at 09:20:10AM +0000, Paul Durrant wrote:
> > > -----Original Message-----
> > > From: Roger Pau Monne
> > > Sent: 02 November 2017 09:15
> > > To: Roger Pau Monne <roger.pau@citrix.com>
> > > Cc: Ian Jackson <Ian.Jackson@citrix.com>; Lars Kurth
> > > <lars.kurth@citrix.com>; Wei Liu <wei.liu2@citrix.com>; Julien Grall
> > > <julien.grall@linaro.org>; Paul Durrant <Paul.Durrant@citrix.com>;
> > > committers@xenproject.org; xen-devel <xen-
> devel@lists.xenproject.org>
> > > Subject: Re: [Xen-devel] Commit moratorium to staging
> > >
> > > On Wed, Nov 01, 2017 at 04:17:10PM +0000, Roger Pau Monné wrote:
> > > > On Wed, Nov 01, 2017 at 02:07:48PM +0000, Ian Jackson wrote:
> > > > > * Affected hosts differ from unaffected hosts according to cpuid.
> > > > >   Roger has repro'd the bug on an unaffected host by masking out
> > > > >   certain cpuid bits.  There are 6 implicated bits and he is working
> > > > >   to narrow that down.
> > > >
> > > > I'm currently trying to narrow this down and make sure the above is
> > > > accurate.
> > >
> > > So I was wrong with this, I guess I've run the tests on the wrong
> > > host. Even when masking the different cpuid bits in the guest the
> > > tests still succeeds.
> > >
> > > AFAICT the test fail or succeed reliably depending on the host
> > > hardware. I don't really have many ideas about what to do next, but I
> > > think it would be useful to create a manual osstest flight that runs
> > > the win16 job in all the different hosts in the colo. I would also
> > > capture the normal information that Xen collects after each test (xl
> > > info, /proc/cpuid, serial logs...).
> > >
> > > Is there anything else not captured by ts-logs-capture that would be
> > > interesting in order to help debug the issue?
> >
> > Does the shutdown reliably complete prior to migrate and then only fail
> intermittently after a localhost migrate?
> 
> AFAICT yes, but it can also be added to the test in order to be sure.
> 
> > It might be useful to know what cpuid info is seen by the guest before and
> after migrate.
> 
> Is there anyway to get that from windows in an automatic way? If not I
> could test that with a Debian guest. In fact it might even be a good
> thing for Linux based guest to be added to the regular migration tests
> in order to make sure cpuid bits don't change across migrations.
> 

I found this for windows:

https://www.cpuid.com/downloads/cpu-z/cpu-z_1.81-en.exe

It can generate a text or html report as well as being run interactively. But you may get more mileage from using a debian HVM guest. I guess it may also be useful is we can get a scan of available MSRs and content before and after migrate too.

> > Another datapoint... does the shutdown fail if you insert a delay of a couple
> of minutes between the migrate and the shutdown?
> 
> Sometimes, after a variable number of calls to xl shutdown ... the
> guest usually ends up shutting down.
> 

Hmm. I wonder whether the guest is actually healthy after the migrate. One could imagine a situation where the storage device model (IDE in our case I guess) gets stuck in some way but recovers after a timeout in the guest storage stack. Thus, if you happen to try shut down while it is still stuck Windows starts trying to shut down but can't. Try after the timeout though and it can.
In the past we did make attempts to support Windows without PV drivers in XenServer but xenrt would never reliably pass VM lifecycle tests using emulated devices. That was with qemu trad, but I wonder whether upstream qemu is actually any better particularly if using older device models such as IDE and RTL8139 (which are probably largely unmodified from trad).

  Paul

> Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Commit moratorium to staging
  2017-11-01 14:07         ` Ian Jackson
  2017-11-01 14:59           ` Julien Grall
  2017-11-01 16:17           ` Commit moratorium to staging Roger Pau Monné
@ 2017-11-02 11:19           ` George Dunlap
  2 siblings, 0 replies; 27+ messages in thread
From: George Dunlap @ 2017-11-02 11:19 UTC (permalink / raw)
  To: Ian Jackson, Julien Grall, Roger Pau Monne
  Cc: committers, Lars Kurth, Paul Durrant, Wei Liu, xen-devel

On 11/01/2017 02:07 PM, Ian Jackson wrote:
> So, investigations (mostly by Roger, and also a bit of archaeology in
> the osstest db by me) have determined:
> 
> * This bug is 100% reproducible on affected hosts.  The repro is
>   to boot the Windows guest, save/restore it, then migrate it,
>   then shut down.  (This is from an IRL conversation with Roger and
>   may not be 100% accurate.  Roger, please correct me.)

I presume when you say 'migrate' you mean localhost migration?

Are the results different if you:
- only save/restore *or* migrate it?
- save/restore twice or migrate twice, rather than save/restore + migrate?

Going through the save/restore path suggests that there's something
about the domain that's being set up one way on initial creation than on
restoring/receiving from a migration: i.e., something not being saved
and restored properly.

An alternate explanation would be a 'hitch' somewhere in the 're-attach'
driver code.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Commit moratorium to staging
  2017-11-02  9:42                 ` Roger Pau Monné
  2017-11-02  9:55                   ` Paul Durrant
@ 2017-11-02 12:05                   ` Ian Jackson
  1 sibling, 0 replies; 27+ messages in thread
From: Ian Jackson @ 2017-11-02 12:05 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Lars Kurth, Wei Liu, Julien Grall, Paul Durrant, committers, xen-devel

Roger Pau Monné writes ("Re: [Xen-devel] Commit moratorium to staging"):
> Is there anyway to get that from windows in an automatic way? If not I
> could test that with a Debian guest. In fact it might even be a good
> thing for Linux based guest to be added to the regular migration tests
> in order to make sure cpuid bits don't change across migrations.

We do migrations of all the guests in osstest (apart from in ARM,
where the guests don't support it, and some special cases like
rumpkernel and xtf domains).

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Commit moratorium to staging [and 1 more messages]
  2017-11-01 17:00               ` Julien Grall
@ 2017-11-02 13:27                 ` Ian Jackson
  2017-11-02 13:33                   ` Julien Grall
  0 siblings, 1 reply; 27+ messages in thread
From: Ian Jackson @ 2017-11-02 13:27 UTC (permalink / raw)
  To: Julien Grall
  Cc: Lars Kurth, xen-devel, Wei Liu, Paul Durrant, committers,
	xen-devel, Roger Pau Monne

Julien Grall writes ("Re: Commit moratorium to staging"):
> Thank you for the explanation. I agree with the force push to unblock 
> master (and other tree I mentioned).

I will force push all the affected trees, but in a reactive way
because I base each force push on a test report - so it won't be right
away for all of them.

osstest service owner writes ("[xen-unstable test] 115471: regressions - FAIL"):
> flight 115471 xen-unstable real [real]
> http://logs.test-lab.xenproject.org/osstest/logs/115471/
> 
> Regressions :-(
> 
> Tests which did not succeed and are blocking,
> including tests which could not be run:
>  test-amd64-i386-xl-qemuu-ws16-amd64 17 guest-stop        fail REGR. vs. 114644
>  test-amd64-amd64-xl-qemuu-ws16-amd64 17 guest-stop       fail REGR. vs. 114644
>  test-amd64-amd64-xl-qemut-ws16-amd64 17 guest-stop       fail REGR. vs. 114644

The above are justifiable as discussed, leaving no blockers.

> version targeted for testing:
>  xen                  bb2c1a1cc98a22e2d4c14b18421aa7be6c2adf0d

So I have forced pushed that.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Commit moratorium to staging [and 1 more messages]
  2017-11-02 13:27                 ` Commit moratorium to staging [and 1 more messages] Ian Jackson
@ 2017-11-02 13:33                   ` Julien Grall
  0 siblings, 0 replies; 27+ messages in thread
From: Julien Grall @ 2017-11-02 13:33 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Lars Kurth, xen-devel, Wei Liu, Paul Durrant, committers,
	xen-devel, Roger Pau Monne

Hi Ian,

On 02/11/17 13:27, Ian Jackson wrote:
> Julien Grall writes ("Re: Commit moratorium to staging"):
>> Thank you for the explanation. I agree with the force push to unblock
>> master (and other tree I mentioned).
> 
> I will force push all the affected trees, but in a reactive way
> because I base each force push on a test report - so it won't be right
> away for all of them.
> 
> osstest service owner writes ("[xen-unstable test] 115471: regressions - FAIL"):
>> flight 115471 xen-unstable real [real]
>> http://logs.test-lab.xenproject.org/osstest/logs/115471/
>>
>> Regressions :-(
>>
>> Tests which did not succeed and are blocking,
>> including tests which could not be run:
>>   test-amd64-i386-xl-qemuu-ws16-amd64 17 guest-stop        fail REGR. vs. 114644
>>   test-amd64-amd64-xl-qemuu-ws16-amd64 17 guest-stop       fail REGR. vs. 114644
>>   test-amd64-amd64-xl-qemut-ws16-amd64 17 guest-stop       fail REGR. vs. 114644
> 
> The above are justifiable as discussed, leaving no blockers.
> 
>> version targeted for testing:
>>   xen                  bb2c1a1cc98a22e2d4c14b18421aa7be6c2adf0d
> 
> So I have forced pushed that.

Thank you! With that, the tree is re-opened. I will go through my 
backlog of Xen 4.10 and have a look whether they are suitable.

Cheers,


-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Commit moratorium to staging
  2017-11-02  9:55                   ` Paul Durrant
@ 2017-11-03 14:14                     ` Roger Pau Monné
  2017-11-03 14:52                       ` George Dunlap
  0 siblings, 1 reply; 27+ messages in thread
From: Roger Pau Monné @ 2017-11-03 14:14 UTC (permalink / raw)
  To: Paul Durrant
  Cc: Lars Kurth, Wei Liu, Julien Grall, committers, xen-devel, Ian Jackson

On Thu, Nov 02, 2017 at 09:55:11AM +0000, Paul Durrant wrote:
> > -----Original Message-----
> > From: Roger Pau Monne
> > Sent: 02 November 2017 09:42
> > To: Paul Durrant <Paul.Durrant@citrix.com>
> > Cc: Ian Jackson <Ian.Jackson@citrix.com>; Lars Kurth
> > <lars.kurth@citrix.com>; Wei Liu <wei.liu2@citrix.com>; Julien Grall
> > <julien.grall@linaro.org>; committers@xenproject.org; xen-devel <xen-
> > devel@lists.xenproject.org>
> > Subject: Re: [Xen-devel] Commit moratorium to staging
> > 
> > On Thu, Nov 02, 2017 at 09:20:10AM +0000, Paul Durrant wrote:
> > > > -----Original Message-----
> > > > From: Roger Pau Monne
> > > > Sent: 02 November 2017 09:15
> > > > To: Roger Pau Monne <roger.pau@citrix.com>
> > > > Cc: Ian Jackson <Ian.Jackson@citrix.com>; Lars Kurth
> > > > <lars.kurth@citrix.com>; Wei Liu <wei.liu2@citrix.com>; Julien Grall
> > > > <julien.grall@linaro.org>; Paul Durrant <Paul.Durrant@citrix.com>;
> > > > committers@xenproject.org; xen-devel <xen-
> > devel@lists.xenproject.org>
> > > > Subject: Re: [Xen-devel] Commit moratorium to staging
> > > >
> > > > On Wed, Nov 01, 2017 at 04:17:10PM +0000, Roger Pau Monné wrote:
> > > > > On Wed, Nov 01, 2017 at 02:07:48PM +0000, Ian Jackson wrote:
> > > > > > * Affected hosts differ from unaffected hosts according to cpuid.
> > > > > >   Roger has repro'd the bug on an unaffected host by masking out
> > > > > >   certain cpuid bits.  There are 6 implicated bits and he is working
> > > > > >   to narrow that down.
> > > > >
> > > > > I'm currently trying to narrow this down and make sure the above is
> > > > > accurate.
> > > >
> > > > So I was wrong with this, I guess I've run the tests on the wrong
> > > > host. Even when masking the different cpuid bits in the guest the
> > > > tests still succeeds.
> > > >
> > > > AFAICT the test fail or succeed reliably depending on the host
> > > > hardware. I don't really have many ideas about what to do next, but I
> > > > think it would be useful to create a manual osstest flight that runs
> > > > the win16 job in all the different hosts in the colo. I would also
> > > > capture the normal information that Xen collects after each test (xl
> > > > info, /proc/cpuid, serial logs...).
> > > >
> > > > Is there anything else not captured by ts-logs-capture that would be
> > > > interesting in order to help debug the issue?
> > >
> > > Does the shutdown reliably complete prior to migrate and then only fail
> > intermittently after a localhost migrate?
> > 
> > AFAICT yes, but it can also be added to the test in order to be sure.
> > 
> > > It might be useful to know what cpuid info is seen by the guest before and
> > after migrate.
> > 
> > Is there anyway to get that from windows in an automatic way? If not I
> > could test that with a Debian guest. In fact it might even be a good
> > thing for Linux based guest to be added to the regular migration tests
> > in order to make sure cpuid bits don't change across migrations.
> > 
> 
> I found this for windows:
> 
> https://www.cpuid.com/downloads/cpu-z/cpu-z_1.81-en.exe
> 
> It can generate a text or html report as well as being run interactively. But you may get more mileage from using a debian HVM guest. I guess it may also be useful is we can get a scan of available MSRs and content before and after migrate too.
> 
> > > Another datapoint... does the shutdown fail if you insert a delay of a couple
> > of minutes between the migrate and the shutdown?
> > 
> > Sometimes, after a variable number of calls to xl shutdown ... the
> > guest usually ends up shutting down.
> > 
> 
> Hmm. I wonder whether the guest is actually healthy after the migrate. One could imagine a situation where the storage device model (IDE in our case I guess) gets stuck in some way but recovers after a timeout in the guest storage stack. Thus, if you happen to try shut down while it is still stuck Windows starts trying to shut down but can't. Try after the timeout though and it can.
> In the past we did make attempts to support Windows without PV drivers in XenServer but xenrt would never reliably pass VM lifecycle tests using emulated devices. That was with qemu trad, but I wonder whether upstream qemu is actually any better particularly if using older device models such as IDE and RTL8139 (which are probably largely unmodified from trad).

Since I've been looking into this for a couple of days, and found no
solution I'm going to write what I've found so far:

 - The issue only affects Windows guests.
 - It only manifests itself when doing live migration, non-live
   migration or save/resume work fine.
 - It affects all x86 hardware, the amount of migrations in order to
   trigger it seems to depend on the hardware, but doing 20 migrations
   reliably triggers it on all the hardware I've tested.
 - After a variable amount of `xl shutdown -wF ...` the guest will
   eventually acknowledge the event and shut down.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Commit moratorium to staging
  2017-11-03 14:14                     ` Roger Pau Monné
@ 2017-11-03 14:52                       ` George Dunlap
  2017-11-03 17:57                         ` George Dunlap
  0 siblings, 1 reply; 27+ messages in thread
From: George Dunlap @ 2017-11-03 14:52 UTC (permalink / raw)
  To: Roger Pau Monné, Paul Durrant
  Cc: Lars Kurth, Wei Liu, Julien Grall, committers, xen-devel, Ian Jackson

On 11/03/2017 02:14 PM, Roger Pau Monné wrote:
> On Thu, Nov 02, 2017 at 09:55:11AM +0000, Paul Durrant wrote:
>>> -----Original Message-----
>>> From: Roger Pau Monne
>>> Sent: 02 November 2017 09:42
>>> To: Paul Durrant <Paul.Durrant@citrix.com>
>>> Cc: Ian Jackson <Ian.Jackson@citrix.com>; Lars Kurth
>>> <lars.kurth@citrix.com>; Wei Liu <wei.liu2@citrix.com>; Julien Grall
>>> <julien.grall@linaro.org>; committers@xenproject.org; xen-devel <xen-
>>> devel@lists.xenproject.org>
>>> Subject: Re: [Xen-devel] Commit moratorium to staging
>>>
>>> On Thu, Nov 02, 2017 at 09:20:10AM +0000, Paul Durrant wrote:
>>>>> -----Original Message-----
>>>>> From: Roger Pau Monne
>>>>> Sent: 02 November 2017 09:15
>>>>> To: Roger Pau Monne <roger.pau@citrix.com>
>>>>> Cc: Ian Jackson <Ian.Jackson@citrix.com>; Lars Kurth
>>>>> <lars.kurth@citrix.com>; Wei Liu <wei.liu2@citrix.com>; Julien Grall
>>>>> <julien.grall@linaro.org>; Paul Durrant <Paul.Durrant@citrix.com>;
>>>>> committers@xenproject.org; xen-devel <xen-
>>> devel@lists.xenproject.org>
>>>>> Subject: Re: [Xen-devel] Commit moratorium to staging
>>>>>
>>>>> On Wed, Nov 01, 2017 at 04:17:10PM +0000, Roger Pau Monné wrote:
>>>>>> On Wed, Nov 01, 2017 at 02:07:48PM +0000, Ian Jackson wrote:
>>>>>>> * Affected hosts differ from unaffected hosts according to cpuid.
>>>>>>>   Roger has repro'd the bug on an unaffected host by masking out
>>>>>>>   certain cpuid bits.  There are 6 implicated bits and he is working
>>>>>>>   to narrow that down.
>>>>>>
>>>>>> I'm currently trying to narrow this down and make sure the above is
>>>>>> accurate.
>>>>>
>>>>> So I was wrong with this, I guess I've run the tests on the wrong
>>>>> host. Even when masking the different cpuid bits in the guest the
>>>>> tests still succeeds.
>>>>>
>>>>> AFAICT the test fail or succeed reliably depending on the host
>>>>> hardware. I don't really have many ideas about what to do next, but I
>>>>> think it would be useful to create a manual osstest flight that runs
>>>>> the win16 job in all the different hosts in the colo. I would also
>>>>> capture the normal information that Xen collects after each test (xl
>>>>> info, /proc/cpuid, serial logs...).
>>>>>
>>>>> Is there anything else not captured by ts-logs-capture that would be
>>>>> interesting in order to help debug the issue?
>>>>
>>>> Does the shutdown reliably complete prior to migrate and then only fail
>>> intermittently after a localhost migrate?
>>>
>>> AFAICT yes, but it can also be added to the test in order to be sure.
>>>
>>>> It might be useful to know what cpuid info is seen by the guest before and
>>> after migrate.
>>>
>>> Is there anyway to get that from windows in an automatic way? If not I
>>> could test that with a Debian guest. In fact it might even be a good
>>> thing for Linux based guest to be added to the regular migration tests
>>> in order to make sure cpuid bits don't change across migrations.
>>>
>>
>> I found this for windows:
>>
>> https://www.cpuid.com/downloads/cpu-z/cpu-z_1.81-en.exe
>>
>> It can generate a text or html report as well as being run interactively. But you may get more mileage from using a debian HVM guest. I guess it may also be useful is we can get a scan of available MSRs and content before and after migrate too.
>>
>>>> Another datapoint... does the shutdown fail if you insert a delay of a couple
>>> of minutes between the migrate and the shutdown?
>>>
>>> Sometimes, after a variable number of calls to xl shutdown ... the
>>> guest usually ends up shutting down.
>>>
>>
>> Hmm. I wonder whether the guest is actually healthy after the migrate. One could imagine a situation where the storage device model (IDE in our case I guess) gets stuck in some way but recovers after a timeout in the guest storage stack. Thus, if you happen to try shut down while it is still stuck Windows starts trying to shut down but can't. Try after the timeout though and it can.
>> In the past we did make attempts to support Windows without PV drivers in XenServer but xenrt would never reliably pass VM lifecycle tests using emulated devices. That was with qemu trad, but I wonder whether upstream qemu is actually any better particularly if using older device models such as IDE and RTL8139 (which are probably largely unmodified from trad).
> 
> Since I've been looking into this for a couple of days, and found no
> solution I'm going to write what I've found so far:
> 
>  - The issue only affects Windows guests.
>  - It only manifests itself when doing live migration, non-live
>    migration or save/resume work fine.
>  - It affects all x86 hardware, the amount of migrations in order to
>    trigger it seems to depend on the hardware, but doing 20 migrations
>    reliably triggers it on all the hardware I've tested.

Not good.

You said that Windows reported that the login process failed somehow?

Is it possible something bad is happening, like sending spurious page
faults to the guest in logdirty mode?

I wonder if we could reproduce something like it on Linux -- set a build
going and start localhost migrating; a spurious page fault is likely to
cause the build to fail.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Commit moratorium to staging
  2017-11-03 14:52                       ` George Dunlap
@ 2017-11-03 17:57                         ` George Dunlap
  2017-11-03 18:29                           ` Roger Pau Monné
  2017-11-03 18:47                           ` Ian Jackson
  0 siblings, 2 replies; 27+ messages in thread
From: George Dunlap @ 2017-11-03 17:57 UTC (permalink / raw)
  To: Roger Pau Monné, Paul Durrant
  Cc: Lars Kurth, Wei Liu, Julien Grall, committers, xen-devel, Ian Jackson

On 11/03/2017 02:52 PM, George Dunlap wrote:
> On 11/03/2017 02:14 PM, Roger Pau Monné wrote:
>> On Thu, Nov 02, 2017 at 09:55:11AM +0000, Paul Durrant wrote:
>>> Hmm. I wonder whether the guest is actually healthy after the migrate. One could imagine a situation where the storage device model (IDE in our case I guess) gets stuck in some way but recovers after a timeout in the guest storage stack. Thus, if you happen to try shut down while it is still stuck Windows starts trying to shut down but can't. Try after the timeout though and it can.
>>> In the past we did make attempts to support Windows without PV drivers in XenServer but xenrt would never reliably pass VM lifecycle tests using emulated devices. That was with qemu trad, but I wonder whether upstream qemu is actually any better particularly if using older device models such as IDE and RTL8139 (which are probably largely unmodified from trad).
>>
>> Since I've been looking into this for a couple of days, and found no
>> solution I'm going to write what I've found so far:
>>
>>  - The issue only affects Windows guests.
>>  - It only manifests itself when doing live migration, non-live
>>    migration or save/resume work fine.
>>  - It affects all x86 hardware, the amount of migrations in order to
>>    trigger it seems to depend on the hardware, but doing 20 migrations
>>    reliably triggers it on all the hardware I've tested.
> 
> Not good.
> 
> You said that Windows reported that the login process failed somehow?
> 
> Is it possible something bad is happening, like sending spurious page
> faults to the guest in logdirty mode?
> 
> I wonder if we could reproduce something like it on Linux -- set a build
> going and start localhost migrating; a spurious page fault is likely to
> cause the build to fail.

Well, with a looping xen-build going on in the guest, I've done 40 local
migrates with no problems yet.

But Roger -- is this on emulated devices only, no PV drivers?

That might be something worth looking at.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Commit moratorium to staging
  2017-11-03 17:57                         ` George Dunlap
@ 2017-11-03 18:29                           ` Roger Pau Monné
  2017-11-03 18:35                             ` Juergen Gross
  2017-11-03 18:47                           ` Ian Jackson
  1 sibling, 1 reply; 27+ messages in thread
From: Roger Pau Monné @ 2017-11-03 18:29 UTC (permalink / raw)
  To: George Dunlap
  Cc: Lars Kurth, Wei Liu, Julien Grall, Paul Durrant, committers,
	xen-devel, Ian Jackson

On Fri, Nov 03, 2017 at 05:57:52PM +0000, George Dunlap wrote:
> On 11/03/2017 02:52 PM, George Dunlap wrote:
> > On 11/03/2017 02:14 PM, Roger Pau Monné wrote:
> >> On Thu, Nov 02, 2017 at 09:55:11AM +0000, Paul Durrant wrote:
> >>> Hmm. I wonder whether the guest is actually healthy after the migrate. One could imagine a situation where the storage device model (IDE in our case I guess) gets stuck in some way but recovers after a timeout in the guest storage stack. Thus, if you happen to try shut down while it is still stuck Windows starts trying to shut down but can't. Try after the timeout though and it can.
> >>> In the past we did make attempts to support Windows without PV drivers in XenServer but xenrt would never reliably pass VM lifecycle tests using emulated devices. That was with qemu trad, but I wonder whether upstream qemu is actually any better particularly if using older device models such as IDE and RTL8139 (which are probably largely unmodified from trad).
> >>
> >> Since I've been looking into this for a couple of days, and found no
> >> solution I'm going to write what I've found so far:
> >>
> >>  - The issue only affects Windows guests.
> >>  - It only manifests itself when doing live migration, non-live
> >>    migration or save/resume work fine.
> >>  - It affects all x86 hardware, the amount of migrations in order to
> >>    trigger it seems to depend on the hardware, but doing 20 migrations
> >>    reliably triggers it on all the hardware I've tested.
> > 
> > Not good.
> > 
> > You said that Windows reported that the login process failed somehow?
> > 
> > Is it possible something bad is happening, like sending spurious page
> > faults to the guest in logdirty mode?
> > 
> > I wonder if we could reproduce something like it on Linux -- set a build
> > going and start localhost migrating; a spurious page fault is likely to
> > cause the build to fail.
> 
> Well, with a looping xen-build going on in the guest, I've done 40 local
> migrates with no problems yet.
> 
> But Roger -- is this on emulated devices only, no PV drivers?
> 
> That might be something worth looking at.

Yes, windows doesn't have PV devices. But save/restore and non-live
migration seems fine, so it doesn't look to be related to devices, but
rather to log-dirty or some other aspect of live-migration.

Or maybe it's something indeed related to emulated devices that's more
easily triggerable on live-migration.

I'm also thinking it would be helpful to do x20 save/restore,
shutdown, create, x20 migrations and shutdown. That would help us
identify problems related to save/restore and live-migration more
easily.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Commit moratorium to staging
  2017-11-03 18:29                           ` Roger Pau Monné
@ 2017-11-03 18:35                             ` Juergen Gross
  2017-11-06 18:25                               ` George Dunlap
  0 siblings, 1 reply; 27+ messages in thread
From: Juergen Gross @ 2017-11-03 18:35 UTC (permalink / raw)
  To: Roger Pau Monné, George Dunlap
  Cc: Lars Kurth, Wei Liu, Julien Grall, Paul Durrant, committers,
	Ian Jackson, xen-devel

On 03/11/17 19:29, Roger Pau Monné wrote:
> On Fri, Nov 03, 2017 at 05:57:52PM +0000, George Dunlap wrote:
>> On 11/03/2017 02:52 PM, George Dunlap wrote:
>>> On 11/03/2017 02:14 PM, Roger Pau Monné wrote:
>>>> On Thu, Nov 02, 2017 at 09:55:11AM +0000, Paul Durrant wrote:
>>>>> Hmm. I wonder whether the guest is actually healthy after the migrate. One could imagine a situation where the storage device model (IDE in our case I guess) gets stuck in some way but recovers after a timeout in the guest storage stack. Thus, if you happen to try shut down while it is still stuck Windows starts trying to shut down but can't. Try after the timeout though and it can.
>>>>> In the past we did make attempts to support Windows without PV drivers in XenServer but xenrt would never reliably pass VM lifecycle tests using emulated devices. That was with qemu trad, but I wonder whether upstream qemu is actually any better particularly if using older device models such as IDE and RTL8139 (which are probably largely unmodified from trad).
>>>>
>>>> Since I've been looking into this for a couple of days, and found no
>>>> solution I'm going to write what I've found so far:
>>>>
>>>>  - The issue only affects Windows guests.
>>>>  - It only manifests itself when doing live migration, non-live
>>>>    migration or save/resume work fine.
>>>>  - It affects all x86 hardware, the amount of migrations in order to
>>>>    trigger it seems to depend on the hardware, but doing 20 migrations
>>>>    reliably triggers it on all the hardware I've tested.
>>>
>>> Not good.
>>>
>>> You said that Windows reported that the login process failed somehow?
>>>
>>> Is it possible something bad is happening, like sending spurious page
>>> faults to the guest in logdirty mode?
>>>
>>> I wonder if we could reproduce something like it on Linux -- set a build
>>> going and start localhost migrating; a spurious page fault is likely to
>>> cause the build to fail.
>>
>> Well, with a looping xen-build going on in the guest, I've done 40 local
>> migrates with no problems yet.
>>
>> But Roger -- is this on emulated devices only, no PV drivers?
>>
>> That might be something worth looking at.
> 
> Yes, windows doesn't have PV devices. But save/restore and non-live
> migration seems fine, so it doesn't look to be related to devices, but
> rather to log-dirty or some other aspect of live-migration.

log-dirty for read-I/Os of emulated devices?


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Commit moratorium to staging
  2017-11-03 17:57                         ` George Dunlap
  2017-11-03 18:29                           ` Roger Pau Monné
@ 2017-11-03 18:47                           ` Ian Jackson
  1 sibling, 0 replies; 27+ messages in thread
From: Ian Jackson @ 2017-11-03 18:47 UTC (permalink / raw)
  To: George Dunlap
  Cc: Lars Kurth, Wei Liu, Julien Grall, Paul Durrant, committers,
	xen-devel, Roger Pau Monné

George Dunlap writes ("Re: [Xen-devel] Commit moratorium to staging"):
> Well, with a looping xen-build going on in the guest, I've done 40 local
> migrates with no problems yet.
> 
> But Roger -- is this on emulated devices only, no PV drivers?

Yes.  None of our Windows tests have PV drivers.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Commit moratorium to staging
  2017-11-03 18:35                             ` Juergen Gross
@ 2017-11-06 18:25                               ` George Dunlap
  0 siblings, 0 replies; 27+ messages in thread
From: George Dunlap @ 2017-11-06 18:25 UTC (permalink / raw)
  To: Juergen Gross, Roger Pau Monné
  Cc: Lars Kurth, Wei Liu, Julien Grall, Paul Durrant, committers,
	Ian Jackson, xen-devel

On 11/03/2017 06:35 PM, Juergen Gross wrote:
> On 03/11/17 19:29, Roger Pau Monné wrote:
>> On Fri, Nov 03, 2017 at 05:57:52PM +0000, George Dunlap wrote:
>>> On 11/03/2017 02:52 PM, George Dunlap wrote:
>>>> On 11/03/2017 02:14 PM, Roger Pau Monné wrote:
>>>>> On Thu, Nov 02, 2017 at 09:55:11AM +0000, Paul Durrant wrote:
>>>>>> Hmm. I wonder whether the guest is actually healthy after the migrate. One could imagine a situation where the storage device model (IDE in our case I guess) gets stuck in some way but recovers after a timeout in the guest storage stack. Thus, if you happen to try shut down while it is still stuck Windows starts trying to shut down but can't. Try after the timeout though and it can.
>>>>>> In the past we did make attempts to support Windows without PV drivers in XenServer but xenrt would never reliably pass VM lifecycle tests using emulated devices. That was with qemu trad, but I wonder whether upstream qemu is actually any better particularly if using older device models such as IDE and RTL8139 (which are probably largely unmodified from trad).
>>>>>
>>>>> Since I've been looking into this for a couple of days, and found no
>>>>> solution I'm going to write what I've found so far:
>>>>>
>>>>>  - The issue only affects Windows guests.
>>>>>  - It only manifests itself when doing live migration, non-live
>>>>>    migration or save/resume work fine.
>>>>>  - It affects all x86 hardware, the amount of migrations in order to
>>>>>    trigger it seems to depend on the hardware, but doing 20 migrations
>>>>>    reliably triggers it on all the hardware I've tested.
>>>>
>>>> Not good.
>>>>
>>>> You said that Windows reported that the login process failed somehow?
>>>>
>>>> Is it possible something bad is happening, like sending spurious page
>>>> faults to the guest in logdirty mode?
>>>>
>>>> I wonder if we could reproduce something like it on Linux -- set a build
>>>> going and start localhost migrating; a spurious page fault is likely to
>>>> cause the build to fail.
>>>
>>> Well, with a looping xen-build going on in the guest, I've done 40 local
>>> migrates with no problems yet.
>>>
>>> But Roger -- is this on emulated devices only, no PV drivers?
>>>
>>> That might be something worth looking at.
>>
>> Yes, windows doesn't have PV devices. But save/restore and non-live
>> migration seems fine, so it doesn't look to be related to devices, but
>> rather to log-dirty or some other aspect of live-migration.
> 
> log-dirty for read-I/Os of emulated devices?

FWIW I booted a Linux guest with "xen_nopv" on the command-line, gave it
256 MiB of RAM, and then ran a Xen build on it in a loop (see command
below).

Then I started migrating it in a loop.

After an hour or two it had done 146 local migrations, and 46 builds of
Xen (swapping onto emulated disk is pretty slow), without any issues.

Build command:

# while make -j 3 xen ; do git clean -ffdx ; done

I'm shutting down the VM and I'll leave it running overnight.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Commit moratorium to staging
  2017-05-15 17:21 Julien Grall
@ 2017-05-16 16:01 ` Ian Jackson
  0 siblings, 0 replies; 27+ messages in thread
From: Ian Jackson @ 2017-05-16 16:01 UTC (permalink / raw)
  To: Julien Grall; +Cc: xen-devel, committers, lars.kurth

Julien Grall writes ("Commit moratorium to staging"):
> It looks like osstest is a bit behind because of ARM64 boxes (they are 
> fully loaded) and XP testing (they now have been removed see [1]).
> 
> I'd like to cut the next rc when staging == master, so please stop 
> committing today.

I force pushed xen#master earlier and there is no longer any need for
this moratorium.

Of course any commits to staging still need RM approval from Julien.

Thanks,
Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Commit moratorium to staging
@ 2017-05-15 17:21 Julien Grall
  2017-05-16 16:01 ` Ian Jackson
  0 siblings, 1 reply; 27+ messages in thread
From: Julien Grall @ 2017-05-15 17:21 UTC (permalink / raw)
  To: xen-devel; +Cc: lars.kurth, committers

Committers,

It looks like osstest is a bit behind because of ARM64 boxes (they are 
fully loaded) and XP testing (they now have been removed see [1]).

I'd like to cut the next rc when staging == master, so please stop 
committing today.

Ian forced pushed osstest today, so hopefully we can get a push tomorrow.

Cheers,

[1] 
https://lists.xenproject.org/archives/html/xen-devel/2017-05/msg00425.html

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2017-11-06 18:30 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-02  8:19 [xen-unstable test] 115471: regressions - FAIL osstest service owner
2017-10-31 10:49 ` Commit moratorium to staging Julien Grall
2017-10-31 16:52   ` Roger Pau Monné
2017-11-01 10:48     ` Wei Liu
2017-11-01 11:00       ` Paul Durrant
2017-11-01 14:07         ` Ian Jackson
2017-11-01 14:59           ` Julien Grall
2017-11-01 16:54             ` Ian Jackson
2017-11-01 17:00               ` Julien Grall
2017-11-02 13:27                 ` Commit moratorium to staging [and 1 more messages] Ian Jackson
2017-11-02 13:33                   ` Julien Grall
2017-11-01 16:17           ` Commit moratorium to staging Roger Pau Monné
2017-11-02  9:15             ` Roger Pau Monné
2017-11-02  9:20               ` Paul Durrant
2017-11-02  9:42                 ` Roger Pau Monné
2017-11-02  9:55                   ` Paul Durrant
2017-11-03 14:14                     ` Roger Pau Monné
2017-11-03 14:52                       ` George Dunlap
2017-11-03 17:57                         ` George Dunlap
2017-11-03 18:29                           ` Roger Pau Monné
2017-11-03 18:35                             ` Juergen Gross
2017-11-06 18:25                               ` George Dunlap
2017-11-03 18:47                           ` Ian Jackson
2017-11-02 12:05                   ` Ian Jackson
2017-11-02 11:19           ` George Dunlap
  -- strict thread matches above, loose matches on Subject: below --
2017-05-15 17:21 Julien Grall
2017-05-16 16:01 ` Ian Jackson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.