All of lore.kernel.org
 help / color / mirror / Atom feed
* [xen-unstable test] 57852: regressions - FAIL
@ 2015-06-04 12:01 osstest service user
  2015-06-05  8:45 ` Ian Campbell
  0 siblings, 1 reply; 40+ messages in thread
From: osstest service user @ 2015-06-04 12:01 UTC (permalink / raw)
  To: xen-devel; +Cc: ian.jackson

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 18584 bytes --]

flight 57852 xen-unstable real [real]
http://logs.test-lab.xenproject.org/osstest/logs/57852/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 test-amd64-amd64-xl-qemuu-win7-amd64  9 windows-install   fail REGR. vs. 57419

Regressions which are regarded as allowable (not blocking):
 test-amd64-amd64-libvirt-xsm 11 guest-start               fail REGR. vs. 57419
 test-amd64-i386-libvirt      11 guest-start                  fail   like 57419
 test-amd64-i386-libvirt-xsm  11 guest-start                  fail   like 57419
 test-amd64-amd64-libvirt     11 guest-start                  fail   like 57419
 test-amd64-amd64-rumpuserxen-amd64 15 rumpuserxen-demo-xenstorels/xenstorels.repeat fail like 57419
 test-amd64-amd64-xl-qemut-win7-amd64 16 guest-stop             fail like 57419
 test-armhf-armhf-libvirt-xsm 11 guest-start                  fail   like 57419
 test-amd64-i386-xl-qemuu-win7-amd64 16 guest-stop              fail like 57419

Tests which did not succeed, but are not blocking:
 test-amd64-i386-xl-xsm       14 guest-localmigrate           fail   never pass
 test-amd64-amd64-xl-pvh-amd  11 guest-start                  fail   never pass
 test-amd64-amd64-xl-xsm      14 guest-localmigrate           fail   never pass
 test-amd64-amd64-xl-pvh-intel 11 guest-start                  fail  never pass
 test-amd64-i386-xl-qemuu-debianhvm-amd64-xsm 12 guest-localmigrate fail never pass
 test-amd64-amd64-xl-qemuu-debianhvm-amd64-xsm 12 guest-localmigrate fail never pass
 test-amd64-i386-xl-qemut-debianhvm-amd64-xsm 12 guest-localmigrate fail never pass
 test-amd64-amd64-xl-qemut-debianhvm-amd64-xsm 12 guest-localmigrate fail never pass
 test-amd64-i386-xl-qemut-win7-amd64 16 guest-stop              fail never pass
 test-armhf-armhf-libvirt     12 migrate-support-check        fail   never pass
 test-armhf-armhf-xl-xsm      12 migrate-support-check        fail   never pass
 test-armhf-armhf-xl-arndale  12 migrate-support-check        fail   never pass
 test-armhf-armhf-xl-cubietruck 12 migrate-support-check        fail never pass
 test-armhf-armhf-xl-multivcpu 12 migrate-support-check        fail  never pass
 test-armhf-armhf-xl          12 migrate-support-check        fail   never pass
 test-armhf-armhf-xl-sedf     12 migrate-support-check        fail   never pass
 test-armhf-armhf-xl-sedf-pin 12 migrate-support-check        fail   never pass
 test-armhf-armhf-xl-credit2  12 migrate-support-check        fail   never pass

version targeted for testing:
 xen                  fed56ba0e69b251d0222ef0785cd1c1838f9e51d
baseline version:
 xen                  d6b6bd8374ac30597495d457829ce7ad6e8b7016

------------------------------------------------------------
People who touched revisions under test:
  Andrew Cooper <andrew.cooper3@citrix.com>
  Dario Faggioli <dario.faggioli@citrix.com>
  George Dunlap <george.dunlap@eu.citrix.com>
  Ian Campbell <ian.campbell@citrix.com>
  Jan Beulich <jbeulich@suse.com>
  Kevin Tian <kevin.tian@intel.com>
  Roger Pau Monné <roger.pau@citrix.com>
  Ross Lagerwall <ross.lagerwall@citrix.com>
  Tim Deegan <tim@xen.org>
  Vitaly Kuznetsov <vkuznets@redhat.com>
  Yang Hongyang <yanghy@cn.fujitsu.com>
------------------------------------------------------------

jobs:
 build-amd64-xsm                                              pass
 build-armhf-xsm                                              pass
 build-i386-xsm                                               pass
 build-amd64                                                  pass
 build-armhf                                                  pass
 build-i386                                                   pass
 build-amd64-libvirt                                          pass
 build-armhf-libvirt                                          pass
 build-i386-libvirt                                           pass
 build-amd64-oldkern                                          pass
 build-i386-oldkern                                           pass
 build-amd64-pvops                                            pass
 build-armhf-pvops                                            pass
 build-i386-pvops                                             pass
 build-amd64-rumpuserxen                                      pass
 build-i386-rumpuserxen                                       pass
 test-amd64-amd64-xl                                          pass
 test-armhf-armhf-xl                                          pass
 test-amd64-i386-xl                                           pass
 test-amd64-amd64-xl-qemut-debianhvm-amd64-xsm                fail
 test-amd64-i386-xl-qemut-debianhvm-amd64-xsm                 fail
 test-amd64-amd64-xl-qemuu-debianhvm-amd64-xsm                fail
 test-amd64-i386-xl-qemuu-debianhvm-amd64-xsm                 fail
 test-amd64-amd64-libvirt-xsm                                 fail
 test-armhf-armhf-libvirt-xsm                                 fail
 test-amd64-i386-libvirt-xsm                                  fail
 test-amd64-amd64-xl-xsm                                      fail
 test-armhf-armhf-xl-xsm                                      pass
 test-amd64-i386-xl-xsm                                       fail
 test-amd64-amd64-xl-pvh-amd                                  fail
 test-amd64-i386-qemut-rhel6hvm-amd                           pass
 test-amd64-i386-qemuu-rhel6hvm-amd                           pass
 test-amd64-amd64-xl-qemut-debianhvm-amd64                    pass
 test-amd64-i386-xl-qemut-debianhvm-amd64                     pass
 test-amd64-amd64-xl-qemuu-debianhvm-amd64                    pass
 test-amd64-i386-xl-qemuu-debianhvm-amd64                     pass
 test-amd64-i386-freebsd10-amd64                              pass
 test-amd64-amd64-xl-qemuu-ovmf-amd64                         pass
 test-amd64-i386-xl-qemuu-ovmf-amd64                          pass
 test-amd64-amd64-rumpuserxen-amd64                           fail
 test-amd64-amd64-xl-qemut-win7-amd64                         fail
 test-amd64-i386-xl-qemut-win7-amd64                          fail
 test-amd64-amd64-xl-qemuu-win7-amd64                         fail
 test-amd64-i386-xl-qemuu-win7-amd64                          fail
 test-armhf-armhf-xl-arndale                                  pass
 test-amd64-amd64-xl-credit2                                  pass
 test-armhf-armhf-xl-credit2                                  pass
 test-armhf-armhf-xl-cubietruck                               pass
 test-amd64-i386-freebsd10-i386                               pass
 test-amd64-i386-rumpuserxen-i386                             pass
 test-amd64-amd64-xl-pvh-intel                                fail
 test-amd64-i386-qemut-rhel6hvm-intel                         pass
 test-amd64-i386-qemuu-rhel6hvm-intel                         pass
 test-amd64-amd64-libvirt                                     fail
 test-armhf-armhf-libvirt                                     pass
 test-amd64-i386-libvirt                                      fail
 test-amd64-amd64-xl-multivcpu                                pass
 test-armhf-armhf-xl-multivcpu                                pass
 test-amd64-amd64-pair                                        pass
 test-amd64-i386-pair                                         pass
 test-amd64-amd64-xl-sedf-pin                                 pass
 test-armhf-armhf-xl-sedf-pin                                 pass
 test-amd64-amd64-xl-sedf                                     pass
 test-armhf-armhf-xl-sedf                                     pass
 test-amd64-i386-xl-qemut-winxpsp3-vcpus1                     pass
 test-amd64-i386-xl-qemuu-winxpsp3-vcpus1                     pass
 test-amd64-amd64-xl-qemut-winxpsp3                           pass
 test-amd64-i386-xl-qemut-winxpsp3                            pass
 test-amd64-amd64-xl-qemuu-winxpsp3                           pass
 test-amd64-i386-xl-qemuu-winxpsp3                            pass


------------------------------------------------------------
sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
    http://logs.test-lab.xenproject.org/osstest/logs

Test harness code can be found at
    http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Not pushing.

------------------------------------------------------------
commit fed56ba0e69b251d0222ef0785cd1c1838f9e51d
Author: Jan Beulich <jbeulich@suse.com>
Date:   Tue Jun 2 13:45:03 2015 +0200

    unmodified-drivers: tolerate IRQF_DISABLED being undefined

    It's being removed in Linux 4.1.

    Signed-off-by: Jan Beulich <jbeulich@suse.com>
    Acked-by: Ian Campbell <ian.campbell@citrix.com>

commit 8a753b3f1cf5e4714974196df9517849bf174324
Author: Ross Lagerwall <ross.lagerwall@citrix.com>
Date:   Tue Jun 2 13:44:24 2015 +0200

    efi: fix allocation problems if ExitBootServices() fails

    If calling ExitBootServices() fails, the required memory map size may
    have increased. When initially allocating the memory map, allocate a
    slightly larger buffer (by an arbitrary 8 entries) to fix this.

    The ARM code path was already allocating a larger buffer than required,
    so this moves the code to be common for all architectures.

    This was seen on the following machine when using the iscsidxe UEFI
    driver. The machine would consistently fail the first call to
    ExitBootServices().
    System Information
            Manufacturer: Supermicro
            Product Name: X10SLE-F/HF
    BIOS Information
            Vendor: American Megatrends Inc.
            Version: 2.00
            Release Date: 04/24/2014

    Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
    Acked-by: Jan Beulich <jbeulich@suse.com>
    Reviewed-by: Roy Franz <roy.franz@linaro.org>
    Acked-by: Ian Campbell <ian.campbell@citrix.com>

commit 376bbbabbda607d2039b8f839f15ff02721597d2
Author: Dario Faggioli <dario.faggioli@citrix.com>
Date:   Tue Jun 2 13:43:15 2015 +0200

    sched_rt: print useful affinity info when dumping

    In fact, printing the cpupool's CPU online mask
    for each vCPU is just redundant, as that is the
    same for all the vCPUs of all the domains in the
    same cpupool, while hard affinity is already part
    of the output of dumping domains info.

    Instead, print the intersection between hard
    affinity and online CPUs, which is --in case of this
    scheduler-- the effective affinity always used for
    the vCPUs.

    This change also takes the chance to add a scratch
    cpumask area, to avoid having to either put one
    (more) cpumask_t on the stack, or dynamically
    allocate it within the dumping routine. (The former
    being bad because hypervisor stack size is limited,
    the latter because dynamic allocations can fail, if
    the hypervisor was built for a large enough number
    of CPUs.) We allocate such scratch area, for all pCPUs,
    when the first instance of the RTDS scheduler is
    activated and, in order not to loose track/leak it
    if other instances are activated in new cpupools,
    and when the last instance is deactivated, we (sort
    of) refcount it.

    Such scratch area can be used to kill most of the
    cpumasks{_var}_t local variables in other functions
    in the file, but that is *NOT* done in this chage.

    Finally, convert the file to use keyhandler scratch,
    instead of open coded string buffers.

    Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
    Reviewed-by: Meng Xu <mengxu@cis.upenn.edu>
    Acked-by: George Dunlap <george.dunlap@eu.citrix.com>

commit e758ed14f390342513405dd766e874934573e6cb
Author: Andrew Cooper <andrew.cooper3@citrix.com>
Date:   Mon Jun 1 12:00:18 2015 +0200

    docs: clarification to terms used in hypervisor memory management

    Memory management is hard[citation needed].  Furthermore, it isn't helped by
    the inconsistent use of terms through the code, or that some terms have
    changed meaning over time.

    Describe the currently-used terms in a more practical fashon, so new code has
    a concrete reference.

    Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
    Acked-by: Tim Deegan <tim@xen.org>

commit 591e1e357c29589e9d6121d8faadc4f4d3b9013e
Author: Ross Lagerwall <ross.lagerwall@citrix.com>
Date:   Mon Jun 1 11:59:14 2015 +0200

    x86: don't crash when mapping a page using EFI runtime page tables

    When an interrupt is received during an EFI runtime service call, Xen
    may call map_domain_page() while using the EFI runtime page tables.
    This fails because, although the EFI runtime page tables are a
    copy of the idle domain's page tables, current points at a different
    domain's vCPU.

    To fix this, return NULL from mapcache_current_vcpu() when using the EFI
    runtime page tables which is treated equivalently to running in an idle
    vCPU.

    This issue can be reproduced by repeatedly calling GetVariable() from
    dom0 while using VT-d, since VT-d frequently maps a page from interrupt
    context.

    Example call trace:
    [<ffff82d0801615dc>] __find_next_zero_bit+0x28/0x60
    [<ffff82d08016a10e>] map_domain_page+0x4c6/0x4eb
    [<ffff82d080156ae6>] map_vtd_domain_page+0xd/0xf
    [<ffff82d08015533a>] msi_msg_read_remap_rte+0xe3/0x1d8
    [<ffff82d08014e516>] iommu_read_msi_from_ire+0x31/0x34
    [<ffff82d08016ff6c>] set_msi_affinity+0x134/0x17a
    [<ffff82d0801737b5>] move_masked_irq+0x5c/0x98
    [<ffff82d080173816>] move_native_irq+0x25/0x36
    [<ffff82d08016ffcb>] ack_nonmaskable_msi_irq+0x19/0x20
    [<ffff82d08016ffdb>] ack_maskable_msi_irq+0x9/0x37
    [<ffff82d080173e8b>] do_IRQ+0x251/0x635
    [<ffff82d080234502>] common_interrupt+0x62/0x70
    [<00000000df7ed2be>] 00000000df7ed2be

    Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>

commit 47ec25a3c8cdd7a057af0a05e8e00257ef950437
Merge: 088e9b2 818e376
Author: Ian Campbell <ian.campbell@citrix.com>
Date:   Fri May 29 13:22:31 2015 +0100

    Merge branch 'staging' of ssh://xenbits.xen.org/home/xen/git/xen into staging

commit 088e9b2796bd1f9ebe4fda800275cc689677b699
Author: Yang Hongyang <yanghy@cn.fujitsu.com>
Date:   Mon May 18 15:03:56 2015 +0800

    libxc/restore: implement Remus checkpointed restore

    With Remus, the restore flow should be:
    the first full migration stream -> { periodically restore stream }

    Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
    Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
    CC: Ian Campbell <Ian.Campbell@citrix.com>
    CC: Ian Jackson <Ian.Jackson@eu.citrix.com>
    CC: Wei Liu <wei.liu2@citrix.com>
    Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
    Acked-by: Ian Campbell <ian.campbell@citrix.com>

commit a25e4e96fc95150f5c58d069de1b204aa6487ed8
Author: Yang Hongyang <yanghy@cn.fujitsu.com>
Date:   Mon May 18 15:03:55 2015 +0800

    libxc/save: implement Remus checkpointed save

    With Remus, the save flow should be:
    live migration->{ periodically save(checkpointed save) }

    Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
    CC: Ian Campbell <Ian.Campbell@citrix.com>
    CC: Ian Jackson <Ian.Jackson@eu.citrix.com>
    CC: Wei Liu <wei.liu2@citrix.com>
    CC: Andrew Cooper <andrew.cooper3@citrix.com>
    Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
    Acked-by: Ian Campbell <ian.campbell@citrix.com>

commit cfa955591caea5d7ec505cdcbf4442f2d6e889e1
Author: Yang Hongyang <yanghy@cn.fujitsu.com>
Date:   Mon May 18 15:03:54 2015 +0800

    libxc/save: refactor of send_domain_memory_live()

    Split the send_domain_memory_live() into three helper function:
      - send_memory_live()  do the actually live send
      - suspend_and_send_dirty() suspend the guest and send dirty pages
      - send_memory_verify()
    The motivation of this is that when we send checkpointed stream, we
    will skip the actually live part.

    Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
    CC: Ian Campbell <Ian.Campbell@citrix.com>
    CC: Ian Jackson <Ian.Jackson@eu.citrix.com>
    CC: Wei Liu <wei.liu2@citrix.com>
    CC: Andrew Cooper <andrew.cooper3@citrix.com>
    Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
    Acked-by: Ian Campbell <ian.campbell@citrix.com>

commit 818e376d3b17845d39735517650224c64c9e0078
Author: Jan Beulich <jbeulich@suse.com>
Date:   Thu May 28 12:07:33 2015 +0200

    Revert "use ticket locks for spin locks"

    This reverts commit 45fcc4568c5162b00fb3907fb158af82dd484a3d as it
    introduces yet to be explained issues on ARM.

commit 02cdd81aa0a88007addc788c6cf93e2f1cb1a314
Author: Jan Beulich <jbeulich@suse.com>
Date:   Thu May 28 12:06:47 2015 +0200

    Revert "spinlock: fix build with older GCC"

    This reverts commit 1037e33c88bb0e1fe530c164f242df17030102e1 as its
    prereq commit 45fcc4568c is about to be reverted.

commit 814ca12647f06b023f4aac8eae837ba9b417acc7
Author: Jan Beulich <jbeulich@suse.com>
Date:   Thu May 28 11:59:34 2015 +0200

    Revert "x86,arm: remove asm/spinlock.h from all architectures"

    This reverts commit e62e49e6d5d4e8d22f3df0b75443ede65a812435 as
    its prerequisite 45fcc4568c is going to be reverted.

commit cf6b3ccf28faee01a078311fcfe671148c81ea75
Author: Roger Pau Monné <roger.pau@citrix.com>
Date:   Thu May 28 10:56:08 2015 +0200

    x86/pvh: disable posted interrupts

    Enabling posted interrupts requires the virtual interrupt delivery feature,
    which is disabled for PVH guests, so make sure posted interrupts are also
    disabled or else vmlaunch will fail.

    Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
    Reported-and-Tested-by: Lars Eggert <lars@netapp.com>
    Acked-by: Kevin Tian <kevin.tian@intel.com>

commit d4d39de054a6f6c5a474aee62999a8ea7c2fd180
Author: Vitaly Kuznetsov <vkuznets@redhat.com>
Date:   Thu May 28 10:55:43 2015 +0200

    public: fix xen_domctl_monitor_op_t definition

    It seems xen_domctl_monitor_op_t was supposed to be a typedef for
    struct xen_domctl_monitor_op and not the non-existent xen_domctl__op.

    Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
(qemu changes not included)


[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [xen-unstable test] 57852: regressions - FAIL
  2015-06-04 12:01 [xen-unstable test] 57852: regressions - FAIL osstest service user
@ 2015-06-05  8:45 ` Ian Campbell
  2015-06-05  9:00   ` Jan Beulich
  0 siblings, 1 reply; 40+ messages in thread
From: Ian Campbell @ 2015-06-05  8:45 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper; +Cc: xen-devel, ian.jackson

On Thu, 2015-06-04 at 12:01 +0000, osstest service user wrote:
> flight 57852 xen-unstable real [real]
> http://logs.test-lab.xenproject.org/osstest/logs/57852/
> 
> Regressions :-(
> 
> Tests which did not succeed and are blocking,
> including tests which could not be run:
>  test-amd64-amd64-xl-qemuu-win7-amd64  9 windows-install   fail REGR. vs. 57419

Is anyone looking into this?

It seems to have been intermittent for a long time but the probability
of failure seems to have increased significantly some time around flight
52633 (see [0]). Before that it failed <5% of the time and since then it
looks to be closer to 45-50%. 5% could be put down to infrastructure or
guest flakiness, 50% seems more like something on the Xen (or qemu etc)
side.

The bisector is taking a look[1] but TBH given a 50% pass rate I think
it is unlikely to get anywhere (I suspect this isn't its first attempt
at this either, pretty sure I saw a failed attempt on an earlier range).

Taking 50370 as a rough baseline (4 consecutive passes before the first
of the more frequent failures) gives a range of
b6e7fbadbda4..5c44b5cf352e which is quite a few. It's noteworthy though
that qemuu didn't change during the interval 50370..52633 (again, from
[0]).

None of the vnc snapshots look interesting, just the windows login
screen. Neither do any of the logs look interesting.

Ian.

[0] http://logs.test-lab.xenproject.org/osstest/results/history.test-amd64-amd64-xl-qemuu-win7-amd64.xen-unstable.html
[1] http://logs.test-lab.xenproject.org/osstest/results/bisect.xen-unstable.test-amd64-amd64-xl-qemuu-win7-amd64.windows-install.html

> 
> Regressions which are regarded as allowable (not blocking):
>  test-amd64-amd64-libvirt-xsm 11 guest-start               fail REGR. vs. 57419
>  test-amd64-i386-libvirt      11 guest-start                  fail   like 57419
>  test-amd64-i386-libvirt-xsm  11 guest-start                  fail   like 57419
>  test-amd64-amd64-libvirt     11 guest-start                  fail   like 57419
>  test-amd64-amd64-rumpuserxen-amd64 15 rumpuserxen-demo-xenstorels/xenstorels.repeat fail like 57419
>  test-amd64-amd64-xl-qemut-win7-amd64 16 guest-stop             fail like 57419
>  test-armhf-armhf-libvirt-xsm 11 guest-start                  fail   like 57419
>  test-amd64-i386-xl-qemuu-win7-amd64 16 guest-stop              fail like 57419
> 
> Tests which did not succeed, but are not blocking:
>  test-amd64-i386-xl-xsm       14 guest-localmigrate           fail   never pass
>  test-amd64-amd64-xl-pvh-amd  11 guest-start                  fail   never pass
>  test-amd64-amd64-xl-xsm      14 guest-localmigrate           fail   never pass
>  test-amd64-amd64-xl-pvh-intel 11 guest-start                  fail  never pass
>  test-amd64-i386-xl-qemuu-debianhvm-amd64-xsm 12 guest-localmigrate fail never pass
>  test-amd64-amd64-xl-qemuu-debianhvm-amd64-xsm 12 guest-localmigrate fail never pass
>  test-amd64-i386-xl-qemut-debianhvm-amd64-xsm 12 guest-localmigrate fail never pass
>  test-amd64-amd64-xl-qemut-debianhvm-amd64-xsm 12 guest-localmigrate fail never pass
>  test-amd64-i386-xl-qemut-win7-amd64 16 guest-stop              fail never pass
>  test-armhf-armhf-libvirt     12 migrate-support-check        fail   never pass
>  test-armhf-armhf-xl-xsm      12 migrate-support-check        fail   never pass
>  test-armhf-armhf-xl-arndale  12 migrate-support-check        fail   never pass
>  test-armhf-armhf-xl-cubietruck 12 migrate-support-check        fail never pass
>  test-armhf-armhf-xl-multivcpu 12 migrate-support-check        fail  never pass
>  test-armhf-armhf-xl          12 migrate-support-check        fail   never pass
>  test-armhf-armhf-xl-sedf     12 migrate-support-check        fail   never pass
>  test-armhf-armhf-xl-sedf-pin 12 migrate-support-check        fail   never pass
>  test-armhf-armhf-xl-credit2  12 migrate-support-check        fail   never pass
> 
> version targeted for testing:
>  xen                  fed56ba0e69b251d0222ef0785cd1c1838f9e51d
> baseline version:
>  xen                  d6b6bd8374ac30597495d457829ce7ad6e8b7016
> 
> ------------------------------------------------------------
> People who touched revisions under test:
>   Andrew Cooper <andrew.cooper3@citrix.com>
>   Dario Faggioli <dario.faggioli@citrix.com>
>   George Dunlap <george.dunlap@eu.citrix.com>
>   Ian Campbell <ian.campbell@citrix.com>
>   Jan Beulich <jbeulich@suse.com>
>   Kevin Tian <kevin.tian@intel.com>
>   Roger Pau Monné <roger.pau@citrix.com>
>   Ross Lagerwall <ross.lagerwall@citrix.com>
>   Tim Deegan <tim@xen.org>
>   Vitaly Kuznetsov <vkuznets@redhat.com>
>   Yang Hongyang <yanghy@cn.fujitsu.com>
> ------------------------------------------------------------
> 
> jobs:
>  build-amd64-xsm                                              pass    
>  build-armhf-xsm                                              pass    
>  build-i386-xsm                                               pass    
>  build-amd64                                                  pass    
>  build-armhf                                                  pass    
>  build-i386                                                   pass    
>  build-amd64-libvirt                                          pass    
>  build-armhf-libvirt                                          pass    
>  build-i386-libvirt                                           pass    
>  build-amd64-oldkern                                          pass    
>  build-i386-oldkern                                           pass    
>  build-amd64-pvops                                            pass    
>  build-armhf-pvops                                            pass    
>  build-i386-pvops                                             pass    
>  build-amd64-rumpuserxen                                      pass    
>  build-i386-rumpuserxen                                       pass    
>  test-amd64-amd64-xl                                          pass    
>  test-armhf-armhf-xl                                          pass    
>  test-amd64-i386-xl                                           pass    
>  test-amd64-amd64-xl-qemut-debianhvm-amd64-xsm                fail    
>  test-amd64-i386-xl-qemut-debianhvm-amd64-xsm                 fail    
>  test-amd64-amd64-xl-qemuu-debianhvm-amd64-xsm                fail    
>  test-amd64-i386-xl-qemuu-debianhvm-amd64-xsm                 fail    
>  test-amd64-amd64-libvirt-xsm                                 fail    
>  test-armhf-armhf-libvirt-xsm                                 fail    
>  test-amd64-i386-libvirt-xsm                                  fail    
>  test-amd64-amd64-xl-xsm                                      fail    
>  test-armhf-armhf-xl-xsm                                      pass    
>  test-amd64-i386-xl-xsm                                       fail    
>  test-amd64-amd64-xl-pvh-amd                                  fail    
>  test-amd64-i386-qemut-rhel6hvm-amd                           pass    
>  test-amd64-i386-qemuu-rhel6hvm-amd                           pass    
>  test-amd64-amd64-xl-qemut-debianhvm-amd64                    pass    
>  test-amd64-i386-xl-qemut-debianhvm-amd64                     pass    
>  test-amd64-amd64-xl-qemuu-debianhvm-amd64                    pass    
>  test-amd64-i386-xl-qemuu-debianhvm-amd64                     pass    
>  test-amd64-i386-freebsd10-amd64                              pass    
>  test-amd64-amd64-xl-qemuu-ovmf-amd64                         pass    
>  test-amd64-i386-xl-qemuu-ovmf-amd64                          pass    
>  test-amd64-amd64-rumpuserxen-amd64                           fail    
>  test-amd64-amd64-xl-qemut-win7-amd64                         fail    
>  test-amd64-i386-xl-qemut-win7-amd64                          fail    
>  test-amd64-amd64-xl-qemuu-win7-amd64                         fail    
>  test-amd64-i386-xl-qemuu-win7-amd64                          fail    
>  test-armhf-armhf-xl-arndale                                  pass    
>  test-amd64-amd64-xl-credit2                                  pass    
>  test-armhf-armhf-xl-credit2                                  pass    
>  test-armhf-armhf-xl-cubietruck                               pass    
>  test-amd64-i386-freebsd10-i386                               pass    
>  test-amd64-i386-rumpuserxen-i386                             pass    
>  test-amd64-amd64-xl-pvh-intel                                fail    
>  test-amd64-i386-qemut-rhel6hvm-intel                         pass    
>  test-amd64-i386-qemuu-rhel6hvm-intel                         pass    
>  test-amd64-amd64-libvirt                                     fail    
>  test-armhf-armhf-libvirt                                     pass    
>  test-amd64-i386-libvirt                                      fail    
>  test-amd64-amd64-xl-multivcpu                                pass    
>  test-armhf-armhf-xl-multivcpu                                pass    
>  test-amd64-amd64-pair                                        pass    
>  test-amd64-i386-pair                                         pass    
>  test-amd64-amd64-xl-sedf-pin                                 pass    
>  test-armhf-armhf-xl-sedf-pin                                 pass    
>  test-amd64-amd64-xl-sedf                                     pass    
>  test-armhf-armhf-xl-sedf                                     pass    
>  test-amd64-i386-xl-qemut-winxpsp3-vcpus1                     pass    
>  test-amd64-i386-xl-qemuu-winxpsp3-vcpus1                     pass    
>  test-amd64-amd64-xl-qemut-winxpsp3                           pass    
>  test-amd64-i386-xl-qemut-winxpsp3                            pass    
>  test-amd64-amd64-xl-qemuu-winxpsp3                           pass    
>  test-amd64-i386-xl-qemuu-winxpsp3                            pass    
> 
> 
> ------------------------------------------------------------
> sg-report-flight on osstest.test-lab.xenproject.org
> logs: /home/logs/logs
> images: /home/logs/images
> 
> Logs, config files, etc. are available at
>     http://logs.test-lab.xenproject.org/osstest/logs
> 
> Test harness code can be found at
>     http://xenbits.xen.org/gitweb?p=osstest.git;a=summary
> 
> 
> Not pushing.
> 
> ------------------------------------------------------------
> commit fed56ba0e69b251d0222ef0785cd1c1838f9e51d
> Author: Jan Beulich <jbeulich@suse.com>
> Date:   Tue Jun 2 13:45:03 2015 +0200
> 
>     unmodified-drivers: tolerate IRQF_DISABLED being undefined
>     
>     It's being removed in Linux 4.1.
>     
>     Signed-off-by: Jan Beulich <jbeulich@suse.com>
>     Acked-by: Ian Campbell <ian.campbell@citrix.com>
> 
> commit 8a753b3f1cf5e4714974196df9517849bf174324
> Author: Ross Lagerwall <ross.lagerwall@citrix.com>
> Date:   Tue Jun 2 13:44:24 2015 +0200
> 
>     efi: fix allocation problems if ExitBootServices() fails
>     
>     If calling ExitBootServices() fails, the required memory map size may
>     have increased. When initially allocating the memory map, allocate a
>     slightly larger buffer (by an arbitrary 8 entries) to fix this.
>     
>     The ARM code path was already allocating a larger buffer than required,
>     so this moves the code to be common for all architectures.
>     
>     This was seen on the following machine when using the iscsidxe UEFI
>     driver. The machine would consistently fail the first call to
>     ExitBootServices().
>     System Information
>             Manufacturer: Supermicro
>             Product Name: X10SLE-F/HF
>     BIOS Information
>             Vendor: American Megatrends Inc.
>             Version: 2.00
>             Release Date: 04/24/2014
>     
>     Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
>     Acked-by: Jan Beulich <jbeulich@suse.com>
>     Reviewed-by: Roy Franz <roy.franz@linaro.org>
>     Acked-by: Ian Campbell <ian.campbell@citrix.com>
> 
> commit 376bbbabbda607d2039b8f839f15ff02721597d2
> Author: Dario Faggioli <dario.faggioli@citrix.com>
> Date:   Tue Jun 2 13:43:15 2015 +0200
> 
>     sched_rt: print useful affinity info when dumping
>     
>     In fact, printing the cpupool's CPU online mask
>     for each vCPU is just redundant, as that is the
>     same for all the vCPUs of all the domains in the
>     same cpupool, while hard affinity is already part
>     of the output of dumping domains info.
>     
>     Instead, print the intersection between hard
>     affinity and online CPUs, which is --in case of this
>     scheduler-- the effective affinity always used for
>     the vCPUs.
>     
>     This change also takes the chance to add a scratch
>     cpumask area, to avoid having to either put one
>     (more) cpumask_t on the stack, or dynamically
>     allocate it within the dumping routine. (The former
>     being bad because hypervisor stack size is limited,
>     the latter because dynamic allocations can fail, if
>     the hypervisor was built for a large enough number
>     of CPUs.) We allocate such scratch area, for all pCPUs,
>     when the first instance of the RTDS scheduler is
>     activated and, in order not to loose track/leak it
>     if other instances are activated in new cpupools,
>     and when the last instance is deactivated, we (sort
>     of) refcount it.
>     
>     Such scratch area can be used to kill most of the
>     cpumasks{_var}_t local variables in other functions
>     in the file, but that is *NOT* done in this chage.
>     
>     Finally, convert the file to use keyhandler scratch,
>     instead of open coded string buffers.
>     
>     Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
>     Reviewed-by: Meng Xu <mengxu@cis.upenn.edu>
>     Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
> 
> commit e758ed14f390342513405dd766e874934573e6cb
> Author: Andrew Cooper <andrew.cooper3@citrix.com>
> Date:   Mon Jun 1 12:00:18 2015 +0200
> 
>     docs: clarification to terms used in hypervisor memory management
>     
>     Memory management is hard[citation needed].  Furthermore, it isn't helped by
>     the inconsistent use of terms through the code, or that some terms have
>     changed meaning over time.
>     
>     Describe the currently-used terms in a more practical fashon, so new code has
>     a concrete reference.
>     
>     Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>     Acked-by: Tim Deegan <tim@xen.org>
> 
> commit 591e1e357c29589e9d6121d8faadc4f4d3b9013e
> Author: Ross Lagerwall <ross.lagerwall@citrix.com>
> Date:   Mon Jun 1 11:59:14 2015 +0200
> 
>     x86: don't crash when mapping a page using EFI runtime page tables
>     
>     When an interrupt is received during an EFI runtime service call, Xen
>     may call map_domain_page() while using the EFI runtime page tables.
>     This fails because, although the EFI runtime page tables are a
>     copy of the idle domain's page tables, current points at a different
>     domain's vCPU.
>     
>     To fix this, return NULL from mapcache_current_vcpu() when using the EFI
>     runtime page tables which is treated equivalently to running in an idle
>     vCPU.
>     
>     This issue can be reproduced by repeatedly calling GetVariable() from
>     dom0 while using VT-d, since VT-d frequently maps a page from interrupt
>     context.
>     
>     Example call trace:
>     [<ffff82d0801615dc>] __find_next_zero_bit+0x28/0x60
>     [<ffff82d08016a10e>] map_domain_page+0x4c6/0x4eb
>     [<ffff82d080156ae6>] map_vtd_domain_page+0xd/0xf
>     [<ffff82d08015533a>] msi_msg_read_remap_rte+0xe3/0x1d8
>     [<ffff82d08014e516>] iommu_read_msi_from_ire+0x31/0x34
>     [<ffff82d08016ff6c>] set_msi_affinity+0x134/0x17a
>     [<ffff82d0801737b5>] move_masked_irq+0x5c/0x98
>     [<ffff82d080173816>] move_native_irq+0x25/0x36
>     [<ffff82d08016ffcb>] ack_nonmaskable_msi_irq+0x19/0x20
>     [<ffff82d08016ffdb>] ack_maskable_msi_irq+0x9/0x37
>     [<ffff82d080173e8b>] do_IRQ+0x251/0x635
>     [<ffff82d080234502>] common_interrupt+0x62/0x70
>     [<00000000df7ed2be>] 00000000df7ed2be
>     
>     Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
> 
> commit 47ec25a3c8cdd7a057af0a05e8e00257ef950437
> Merge: 088e9b2 818e376
> Author: Ian Campbell <ian.campbell@citrix.com>
> Date:   Fri May 29 13:22:31 2015 +0100
> 
>     Merge branch 'staging' of ssh://xenbits.xen.org/home/xen/git/xen into staging
> 
> commit 088e9b2796bd1f9ebe4fda800275cc689677b699
> Author: Yang Hongyang <yanghy@cn.fujitsu.com>
> Date:   Mon May 18 15:03:56 2015 +0800
> 
>     libxc/restore: implement Remus checkpointed restore
>     
>     With Remus, the restore flow should be:
>     the first full migration stream -> { periodically restore stream }
>     
>     Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
>     Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>     CC: Ian Campbell <Ian.Campbell@citrix.com>
>     CC: Ian Jackson <Ian.Jackson@eu.citrix.com>
>     CC: Wei Liu <wei.liu2@citrix.com>
>     Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
>     Acked-by: Ian Campbell <ian.campbell@citrix.com>
> 
> commit a25e4e96fc95150f5c58d069de1b204aa6487ed8
> Author: Yang Hongyang <yanghy@cn.fujitsu.com>
> Date:   Mon May 18 15:03:55 2015 +0800
> 
>     libxc/save: implement Remus checkpointed save
>     
>     With Remus, the save flow should be:
>     live migration->{ periodically save(checkpointed save) }
>     
>     Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
>     CC: Ian Campbell <Ian.Campbell@citrix.com>
>     CC: Ian Jackson <Ian.Jackson@eu.citrix.com>
>     CC: Wei Liu <wei.liu2@citrix.com>
>     CC: Andrew Cooper <andrew.cooper3@citrix.com>
>     Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
>     Acked-by: Ian Campbell <ian.campbell@citrix.com>
> 
> commit cfa955591caea5d7ec505cdcbf4442f2d6e889e1
> Author: Yang Hongyang <yanghy@cn.fujitsu.com>
> Date:   Mon May 18 15:03:54 2015 +0800
> 
>     libxc/save: refactor of send_domain_memory_live()
>     
>     Split the send_domain_memory_live() into three helper function:
>       - send_memory_live()  do the actually live send
>       - suspend_and_send_dirty() suspend the guest and send dirty pages
>       - send_memory_verify()
>     The motivation of this is that when we send checkpointed stream, we
>     will skip the actually live part.
>     
>     Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
>     CC: Ian Campbell <Ian.Campbell@citrix.com>
>     CC: Ian Jackson <Ian.Jackson@eu.citrix.com>
>     CC: Wei Liu <wei.liu2@citrix.com>
>     CC: Andrew Cooper <andrew.cooper3@citrix.com>
>     Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
>     Acked-by: Ian Campbell <ian.campbell@citrix.com>
> 
> commit 818e376d3b17845d39735517650224c64c9e0078
> Author: Jan Beulich <jbeulich@suse.com>
> Date:   Thu May 28 12:07:33 2015 +0200
> 
>     Revert "use ticket locks for spin locks"
>     
>     This reverts commit 45fcc4568c5162b00fb3907fb158af82dd484a3d as it
>     introduces yet to be explained issues on ARM.
> 
> commit 02cdd81aa0a88007addc788c6cf93e2f1cb1a314
> Author: Jan Beulich <jbeulich@suse.com>
> Date:   Thu May 28 12:06:47 2015 +0200
> 
>     Revert "spinlock: fix build with older GCC"
>     
>     This reverts commit 1037e33c88bb0e1fe530c164f242df17030102e1 as its
>     prereq commit 45fcc4568c is about to be reverted.
> 
> commit 814ca12647f06b023f4aac8eae837ba9b417acc7
> Author: Jan Beulich <jbeulich@suse.com>
> Date:   Thu May 28 11:59:34 2015 +0200
> 
>     Revert "x86,arm: remove asm/spinlock.h from all architectures"
>     
>     This reverts commit e62e49e6d5d4e8d22f3df0b75443ede65a812435 as
>     its prerequisite 45fcc4568c is going to be reverted.
> 
> commit cf6b3ccf28faee01a078311fcfe671148c81ea75
> Author: Roger Pau Monné <roger.pau@citrix.com>
> Date:   Thu May 28 10:56:08 2015 +0200
> 
>     x86/pvh: disable posted interrupts
>     
>     Enabling posted interrupts requires the virtual interrupt delivery feature,
>     which is disabled for PVH guests, so make sure posted interrupts are also
>     disabled or else vmlaunch will fail.
>     
>     Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
>     Reported-and-Tested-by: Lars Eggert <lars@netapp.com>
>     Acked-by: Kevin Tian <kevin.tian@intel.com>
> 
> commit d4d39de054a6f6c5a474aee62999a8ea7c2fd180
> Author: Vitaly Kuznetsov <vkuznets@redhat.com>
> Date:   Thu May 28 10:55:43 2015 +0200
> 
>     public: fix xen_domctl_monitor_op_t definition
>     
>     It seems xen_domctl_monitor_op_t was supposed to be a typedef for
>     struct xen_domctl_monitor_op and not the non-existent xen_domctl__op.
>     
>     Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> (qemu changes not included)
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [xen-unstable test] 57852: regressions - FAIL
  2015-06-05  8:45 ` Ian Campbell
@ 2015-06-05  9:00   ` Jan Beulich
  2015-06-05  9:07     ` Ian Campbell
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2015-06-05  9:00 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Andrew Cooper, ian.jackson, xen-devel

>>> On 05.06.15 at 10:45, <ian.campbell@citrix.com> wrote:
> On Thu, 2015-06-04 at 12:01 +0000, osstest service user wrote:
>> flight 57852 xen-unstable real [real]
>> http://logs.test-lab.xenproject.org/osstest/logs/57852/ 
>> 
>> Regressions :-(
>> 
>> Tests which did not succeed and are blocking,
>> including tests which could not be run:
>>  test-amd64-amd64-xl-qemuu-win7-amd64  9 windows-install   fail REGR. vs. 57419
> 
> Is anyone looking into this?

Not actively, to be honest.

> It seems to have been intermittent for a long time but the probability
> of failure seems to have increased significantly some time around flight
> 52633 (see [0]). Before that it failed <5% of the time and since then it
> looks to be closer to 45-50%. 5% could be put down to infrastructure or
> guest flakiness, 50% seems more like something on the Xen (or qemu etc)
> side.
> 
> The bisector is taking a look[1] but TBH given a 50% pass rate I think
> it is unlikely to get anywhere (I suspect this isn't its first attempt
> at this either, pretty sure I saw a failed attempt on an earlier range).
> 
> Taking 50370 as a rough baseline (4 consecutive passes before the first
> of the more frequent failures) gives a range of
> b6e7fbadbda4..5c44b5cf352e which is quite a few. It's noteworthy though
> that qemuu didn't change during the interval 50370..52633 (again, from
> [0]).
> 
> None of the vnc snapshots look interesting, just the windows login
> screen. Neither do any of the logs look interesting.

Which is the main reason for it being difficult to look into without
seeing it oneself. Two things are possibly noteworthy: This again
is an issue only ever seen with qemuu (just like the migration issue
on the stable branches), and the other day there was a report of
posted interrupts causing spurious hangs, which raises the question
whether the increased failure rate was perhaps due to the new
osstest host system pool having got extended at around that time.
(As noted in a reply to that report, this possible issue can't be an
explanation for the issue on the stable trees, as 4.3 doesn't support
posted interrupts yet.)

Jan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [xen-unstable test] 57852: regressions - FAIL
  2015-06-05  9:00   ` Jan Beulich
@ 2015-06-05  9:07     ` Ian Campbell
  2015-06-05  9:18       ` Jan Beulich
  0 siblings, 1 reply; 40+ messages in thread
From: Ian Campbell @ 2015-06-05  9:07 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, ian.jackson, xen-devel

On Fri, 2015-06-05 at 10:00 +0100, Jan Beulich wrote:
> >>> On 05.06.15 at 10:45, <ian.campbell@citrix.com> wrote:
> > On Thu, 2015-06-04 at 12:01 +0000, osstest service user wrote:
> >> flight 57852 xen-unstable real [real]
> >> http://logs.test-lab.xenproject.org/osstest/logs/57852/ 
> >> 
> >> Regressions :-(
> >> 
> >> Tests which did not succeed and are blocking,
> >> including tests which could not be run:
> >>  test-amd64-amd64-xl-qemuu-win7-amd64  9 windows-install   fail REGR. vs. 57419
> > 
> > Is anyone looking into this?
> 
> Not actively, to be honest.
> 
> > It seems to have been intermittent for a long time but the probability
> > of failure seems to have increased significantly some time around flight
> > 52633 (see [0]). Before that it failed <5% of the time and since then it
> > looks to be closer to 45-50%. 5% could be put down to infrastructure or
> > guest flakiness, 50% seems more like something on the Xen (or qemu etc)
> > side.
> > 
> > The bisector is taking a look[1] but TBH given a 50% pass rate I think
> > it is unlikely to get anywhere (I suspect this isn't its first attempt
> > at this either, pretty sure I saw a failed attempt on an earlier range).
> > 
> > Taking 50370 as a rough baseline (4 consecutive passes before the first
> > of the more frequent failures) gives a range of
> > b6e7fbadbda4..5c44b5cf352e which is quite a few. It's noteworthy though
> > that qemuu didn't change during the interval 50370..52633 (again, from
> > [0]).
> > 
> > None of the vnc snapshots look interesting, just the windows login
> > screen. Neither do any of the logs look interesting.
> 
> Which is the main reason for it being difficult to look into without
> seeing it oneself. Two things are possibly noteworthy: This again
> is an issue only ever seen with qemuu (just like the migration issue
> on the stable branches), and the other day there was a report of
> posted interrupts causing spurious hangs, which raises the question
> whether the increased failure rate was perhaps due to the new
> osstest host system pool having got extended at around that time.
> (As noted in a reply to that report, this possible issue can't be an
> explanation for the issue on the stable trees, as 4.3 doesn't support
> posted interrupts yet.)

From
http://logs.test-lab.xenproject.org/osstest/results/history.test-amd64-amd64-xl-qemuu-win7-amd64.xen-4.3-testing.html
it doesn't seem like 4.3-testing is suffering from the higher incidence
of windows-install failures, just the background noise which unstable
had prior to 52633.

From:
http://logs.test-lab.xenproject.org/osstest/results/history.test-amd64-amd64-xl-qemuu-win7-amd64.xen-4.4-testing.html
http://logs.test-lab.xenproject.org/osstest/results/history.test-amd64-amd64-xl-qemuu-win7-amd64.xen-4.5-testing.html
it looks like none of the stable branches suffer from the install issue.
I'd be inclined to discount any possible link with the migration issue
based on that.

WRT the move to the colo, flights in 5xxxx are in the new one, while
3xxxx are in the old one,
http://logs.test-lab.xenproject.org/osstest/results/history.test-amd64-amd64-xl-qemuu-win7-amd64.xen-unstable.html
shows that things seemed ok for 8 consecutive runs after the move
(ignoring blockages).

Ian.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [xen-unstable test] 57852: regressions - FAIL
  2015-06-05  9:07     ` Ian Campbell
@ 2015-06-05  9:18       ` Jan Beulich
  2015-06-05 10:48         ` Ian Campbell
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2015-06-05  9:18 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Andrew Cooper, ian.jackson, xen-devel

>>> On 05.06.15 at 11:07, <ian.campbell@citrix.com> wrote:
> From:
> http://logs.test-lab.xenproject.org/osstest/results/history.test-amd64-amd64 
> -xl-qemuu-win7-amd64.xen-4.4-testing.html
> http://logs.test-lab.xenproject.org/osstest/results/history.test-amd64-amd64 
> -xl-qemuu-win7-amd64.xen-4.5-testing.html
> it looks like none of the stable branches suffer from the install issue.
> I'd be inclined to discount any possible link with the migration issue
> based on that.

Generally I would agree, but it strikes me as extremely odd that
(a) stable trees face only the migration issue, while unstable only
faces the install one,
(b) a tree as old as 4.3 (receiving only security updates) developed
this migration issue (I went into more detail on this in a reply to flight
57474's report).

> WRT the move to the colo, flights in 5xxxx are in the new one, while
> 3xxxx are in the old one,
> http://logs.test-lab.xenproject.org/osstest/results/history.test-amd64-amd64 
> -xl-qemuu-win7-amd64.xen-unstable.html
> shows that things seemed ok for 8 consecutive runs after the move
> (ignoring blockages).

And when it went live, all systems being in use now got immediately
deployed?

Jan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [xen-unstable test] 57852: regressions - FAIL
  2015-06-05  9:18       ` Jan Beulich
@ 2015-06-05 10:48         ` Ian Campbell
  2015-06-05 16:46           ` Ian Campbell
  2015-06-08  8:07           ` Jan Beulich
  0 siblings, 2 replies; 40+ messages in thread
From: Ian Campbell @ 2015-06-05 10:48 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, ian.jackson, xen-devel

On Fri, 2015-06-05 at 10:18 +0100, Jan Beulich wrote:
> >>> On 05.06.15 at 11:07, <ian.campbell@citrix.com> wrote:
> > From:
> > http://logs.test-lab.xenproject.org/osstest/results/history.test-amd64-amd64 
> > -xl-qemuu-win7-amd64.xen-4.4-testing.html
> > http://logs.test-lab.xenproject.org/osstest/results/history.test-amd64-amd64 
> > -xl-qemuu-win7-amd64.xen-4.5-testing.html
> > it looks like none of the stable branches suffer from the install issue.
> > I'd be inclined to discount any possible link with the migration issue
> > based on that.
> 
> Generally I would agree, but it strikes me as extremely odd that
> (a) stable trees face only the migration issue, while unstable only
> faces the install one,

http://logs.test-lab.xenproject.org/osstest/results/history.test-amd64-amd64-xl-qemuu-win7-amd64.xen-unstable.html shows some migration failures too (in a batch though, not spread out).

Wouldn't the migration issue be potentially blocked by the install one?

> (b) a tree as old as 4.3 (receiving only security updates) developed
> this migration issue (I went into more detail on this in a reply to flight
> 57474's report).
> 
> > WRT the move to the colo, flights in 5xxxx are in the new one, while
> > 3xxxx are in the old one,
> > http://logs.test-lab.xenproject.org/osstest/results/history.test-amd64-amd64 
> > -xl-qemuu-win7-amd64.xen-unstable.html
> > shows that things seemed ok for 8 consecutive runs after the move
> > (ignoring blockages).
> 
> And when it went live, all systems being in use now got immediately
> deployed?

All the flights in the new colo seem to have been on fiano[01].

But having looked at the page again the early success was all on fiano0
while the later failures were all on fiano1.

fiano[01] are supposedly identical hardware.

This might be simply explained by osstest's stickiness for jobs on hosts
where they are failing. I'll run a few adhoc jobs on fiano0 using 57852
as a template so we can see if that's the case.

Ian.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [xen-unstable test] 57852: regressions - FAIL
  2015-06-05 10:48         ` Ian Campbell
@ 2015-06-05 16:46           ` Ian Campbell
  2015-06-08  8:07           ` Jan Beulich
  1 sibling, 0 replies; 40+ messages in thread
From: Ian Campbell @ 2015-06-05 16:46 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, ian.jackson, xen-devel

On Fri, 2015-06-05 at 11:48 +0100, Ian Campbell wrote:
> All the flights in the new colo seem to have been on fiano[01].
> 
> But having looked at the page again the early success was all on fiano0
> while the later failures were all on fiano1.
> 
> fiano[01] are supposedly identical hardware.
> 
> This might be simply explained by osstest's stickiness for jobs on hosts
> where they are failing. I'll run a few adhoc jobs on fiano0 using 57852
> as a template so we can see if that's the case.

http://logs.test-lab.xenproject.org/osstest/logs/57940/
http://logs.test-lab.xenproject.org/osstest/logs/57945/
http://logs.test-lab.xenproject.org/osstest/logs/57953/

All ran in fiano0 and passed the install phase (they failed shutdown,
but that's a different story). They were using the exact same binaries
every time, the ones from flight 57852 which failed on fiano1.

So we may have a host specific issue on just 1 or a pair of hosts, which
is certainly annoying!

I'm going to run 3 on fiano1 to confirm that it still fails there.

Then I'm going to run 3 more on each to make extra sure...

Ian.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [xen-unstable test] 57852: regressions - FAIL
  2015-06-05 10:48         ` Ian Campbell
  2015-06-05 16:46           ` Ian Campbell
@ 2015-06-08  8:07           ` Jan Beulich
  2015-06-08  8:53             ` Ian Campbell
  1 sibling, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2015-06-08  8:07 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Andrew Cooper, ian.jackson, xen-devel

>>> On 05.06.15 at 12:48, <ian.campbell@citrix.com> wrote:
> On Fri, 2015-06-05 at 10:18 +0100, Jan Beulich wrote:
>> >>> On 05.06.15 at 11:07, <ian.campbell@citrix.com> wrote:
>> > WRT the move to the colo, flights in 5xxxx are in the new one, while
>> > 3xxxx are in the old one,
>> > http://logs.test-lab.xenproject.org/osstest/results/history.test-amd64-amd64 
>> > -xl-qemuu-win7-amd64.xen-unstable.html
>> > shows that things seemed ok for 8 consecutive runs after the move
>> > (ignoring blockages).
>> 
>> And when it went live, all systems being in use now got immediately
>> deployed?
> 
> All the flights in the new colo seem to have been on fiano[01].

So are there just two hosts to run all x86 tests on? I thought one
of the purposes of the switch was to have a wider pool of test
systems...

> But having looked at the page again the early success was all on fiano0
> while the later failures were all on fiano1.

But that's for the unstable install failures only as it looks. At the
example of flight 57955 (testing 4.2) a local migration failure was
observed on fiano0. Which would seem to support your earlier
assumption that the install and migration issues are likely unrelated
(yet their coincidence still strikes me as odd).

Jan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [xen-unstable test] 57852: regressions - FAIL
  2015-06-08  8:07           ` Jan Beulich
@ 2015-06-08  8:53             ` Ian Campbell
  2015-06-08  9:15               ` Jan Beulich
  0 siblings, 1 reply; 40+ messages in thread
From: Ian Campbell @ 2015-06-08  8:53 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, ian.jackson, xen-devel

On Mon, 2015-06-08 at 09:07 +0100, Jan Beulich wrote:
> >>> On 05.06.15 at 12:48, <ian.campbell@citrix.com> wrote:
> > On Fri, 2015-06-05 at 10:18 +0100, Jan Beulich wrote:
> >> >>> On 05.06.15 at 11:07, <ian.campbell@citrix.com> wrote:
> >> > WRT the move to the colo, flights in 5xxxx are in the new one, while
> >> > 3xxxx are in the old one,
> >> > http://logs.test-lab.xenproject.org/osstest/results/history.test-amd64-amd64 
> >> > -xl-qemuu-win7-amd64.xen-unstable.html
> >> > shows that things seemed ok for 8 consecutive runs after the move
> >> > (ignoring blockages).
> >> 
> >> And when it went live, all systems being in use now got immediately
> >> deployed?
> > 
> > All the flights in the new colo seem to have been on fiano[01].
> 
> So are there just two hosts to run all x86 tests on? I thought one
> of the purposes of the switch was to have a wider pool of test
> systems...

There are about a dozen, but when a test is failing osstest will have a
preference for the host on which it failed last time (i.e. failures
become sticky to the host), in order to catch host specific failures I
think.

I think it was just coincidence that the first group of runs which
passed were on fiano0, although perhaps the pool was smaller then since
the colo was in the process of being commissioned.

The stickiness does make it a bit harder to know if a failure is host
specific though, since you often don't get results for other systems.

> > But having looked at the page again the early success was all on fiano0
> > while the later failures were all on fiano1.
> 
> But that's for the unstable install failures only as it looks. At the
> example of flight 57955 (testing 4.2) a local migration failure was
> observed on fiano0. Which would seem to support your earlier
> assumption that the install and migration issues are likely unrelated
> (yet their coincidence still strikes me as odd).

http://logs.test-lab.xenproject.org/osstest/results/history.test-amd64-amd64-xl-qemuu-win7-amd64.html has the cross branch history for this test case. With one exception (on chardonay0, in a linux-next test) all the fails were on fiano[01] and they were all on branches which would use xen-unstable as the Xen version (xen-unstable itself and linux-* + qemu-mainline which both use the current xen.git#master as their Xen).

I've got some adhoc results over the weekend, all can be found at
http://logs.test-lab.xenproject.org/osstest/logs/<NNNNN>/test-amd64-amd64-xl-qemuu-win7-amd64/info.html for flight <NNNNN>. All of them are using the binaries from 57852.

I messed up my first command line and ran them all on fiano0 by mistake,
so there are more results than I was planning for.

Flight	Host	Failed at	Install step duration
57940	fiano0	ts-guest-stop	1483
57945	fiano0	ts-guest-stop	1640
57953	fiano0	ts-guest-stop	1473
57958	fiano0	ts-guest-stop	1472
57962	fiano0	windows-install	7512
57973	fiano0	windows-install	7693
57080	fiano0	ts-guest-stop	1534
57986	fiano0	windows-install	7203
57933	fiano0	ts-guest-stop	1529
57997	fiano0	ts-guest-stop	1494
58004	fiano0	ts-guest-stop	1492

58011	fiano1	ts-guest-stop	1408
58012	fiano1	ts-guest-stop	1529
58017	fiano1	ts-guest-stop	1466
58023	fiano1	ts-guest-stop	1624
58028	fiano1	windows-install	7208
58038	fiano1	ts-guest-stop	1479
58043	fiano1	ts-guest-stop	1493

58053	fiano0	windows-install	7439
58062	fiano0	windows-install	1916
58063	fiano0	windows-install	1477

58067	fiano1	ts-guest-stop	1453
58071	fiano1	ts-guest-stop	1550
58077	fiano1	windows-install	7156

That's 6/14 (43%) failure rate on fiano0 and 2/10 (20%) on fiano1. Which
differs form the apparent xen-unstable failure rate. But I wouldn't take
this as evidence that the two systems differ significantly, despite how
the unstable results looked at first glance.

On successful install the test step takes 1450-1650s, with one outlier
at 1916. The failures take 7000-7500s (test case timeout is 7000, so
with slop that fits). So on success it takes <30mins and on fail it has
been given nearly 2hours.

Ian.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [xen-unstable test] 57852: regressions - FAIL
  2015-06-08  8:53             ` Ian Campbell
@ 2015-06-08  9:15               ` Jan Beulich
  2015-06-08  9:27                 ` Ian Campbell
  2015-06-08 10:10                 ` Ian Campbell
  0 siblings, 2 replies; 40+ messages in thread
From: Jan Beulich @ 2015-06-08  9:15 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Andrew Cooper, ian.jackson, xen-devel

>>> On 08.06.15 at 10:53, <ian.campbell@citrix.com> wrote:
> That's 6/14 (43%) failure rate on fiano0 and 2/10 (20%) on fiano1. Which
> differs form the apparent xen-unstable failure rate. But I wouldn't take
> this as evidence that the two systems differ significantly, despite how
> the unstable results looked at first glance.

So we can basically rule out just one of the hosts being the culprit;
it's either both or our software. Considering that (again at the
example of the recent 4.2 flight) the guest is apparently waiting for
a timer (or other) interrupt (on a HLT instruction), this is very likely
interrupt delivery related, yet (as said before, albeit wrongly for
4.3) 4.2 doesn't have APICV support yet (4.3 only lack the option
to disable it), so it can't be that (alone).

Looking at the hardware - are fiano[01], in terms of CPU and
chipset, perhaps the newest or oldest in the pool? (I'm trying to
make myself a picture of what debugging options we have.)

Jan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [xen-unstable test] 57852: regressions - FAIL
  2015-06-08  9:15               ` Jan Beulich
@ 2015-06-08  9:27                 ` Ian Campbell
  2015-06-08 10:17                   ` Jan Beulich
                                     ` (2 more replies)
  2015-06-08 10:10                 ` Ian Campbell
  1 sibling, 3 replies; 40+ messages in thread
From: Ian Campbell @ 2015-06-08  9:27 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, ian.jackson, xen-devel

On Mon, 2015-06-08 at 10:15 +0100, Jan Beulich wrote:
> >>> On 08.06.15 at 10:53, <ian.campbell@citrix.com> wrote:
> > That's 6/14 (43%) failure rate on fiano0 and 2/10 (20%) on fiano1. Which
> > differs form the apparent xen-unstable failure rate. But I wouldn't take
> > this as evidence that the two systems differ significantly, despite how
> > the unstable results looked at first glance.
> 
> So we can basically rule out just one of the hosts being the culprit;
> it's either both or our software. Considering that (again at the
> example of the recent 4.2 flight) the guest is apparently waiting for
> a timer (or other) interrupt (on a HLT instruction), this is very likely
> interrupt delivery related, yet (as said before, albeit wrongly for
> 4.3) 4.2 doesn't have APICV support yet (4.3 only lack the option
> to disable it), so it can't be that (alone).
> 
> Looking at the hardware - are fiano[01], in terms of CPU and
> chipset, perhaps the newest or oldest in the pool? (I'm trying to
> make myself a picture of what debugging options we have.)

I don't know much about the hardware in the pool other than what can be
gathered from the serial and dmesg logs.

http://logs.test-lab.xenproject.org/osstest/logs/58028/test-amd64-amd64-xl-qemuu-win7-amd64/info.html

>From the serial log and this:

Jun  6 12:09:27.089020 (XEN) VMX: Supported advanced features:
Jun  6 12:09:27.089052 (XEN)  - APIC MMIO access virtualisation
Jun  6 12:09:27.097051 (XEN)  - APIC TPR shadow
Jun  6 12:09:27.097088 (XEN)  - Extended Page Tables (EPT)
Jun  6 12:09:27.097118 (XEN)  - Virtual-Processor Identifiers (VPID)
Jun  6 12:09:27.105066 (XEN)  - Virtual NMI
Jun  6 12:09:27.105100 (XEN)  - MSR direct-access bitmap
Jun  6 12:09:27.105130 (XEN)  - Unrestricted Guest
Jun  6 12:09:27.113269 (XEN)  - APIC Register Virtualization
Jun  6 12:09:27.113290 (XEN)  - Virtual Interrupt Delivery
Jun  6 12:09:27.113328 (XEN)  - Posted Interrupt Processing
Jun  6 12:09:27.121180 (XEN) HVM: ASIDs enabled.
Jun  6 12:09:27.121235 (XEN) HVM: VMX enabled
Jun  6 12:09:27.121267 (XEN) HVM: Hardware Assisted Paging (HAP) detected
Jun  6 12:09:27.129069 (XEN) HVM: HAP page sizes: 4kB, 2MB, 1GB

I guess they are pretty new?

Ian.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [xen-unstable test] 57852: regressions - FAIL
  2015-06-08  9:15               ` Jan Beulich
  2015-06-08  9:27                 ` Ian Campbell
@ 2015-06-08 10:10                 ` Ian Campbell
  1 sibling, 0 replies; 40+ messages in thread
From: Ian Campbell @ 2015-06-08 10:10 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, ian.jackson, xen-devel

On Mon, 2015-06-08 at 10:15 +0100, Jan Beulich wrote:
> (I'm trying to make myself a picture of what debugging options we
> have.)

In the meantime I've kicked off an adhoc job using no-apicv as suggested
by Andy (on IIRC last week IIRC). Assuming that my tweak takes effect in
practice I'll run a bunch of those to hopefully come up with a
significant result.

Ian.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [xen-unstable test] 57852: regressions - FAIL
  2015-06-08  9:27                 ` Ian Campbell
@ 2015-06-08 10:17                   ` Jan Beulich
  2015-06-08 14:43                     ` Ian Jackson
  2015-06-08 12:16                   ` Ian Campbell
  2015-06-08 13:50                   ` Konrad Rzeszutek Wilk
  2 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2015-06-08 10:17 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Andrew Cooper, ian.jackson, xen-devel

>>> On 08.06.15 at 11:27, <ian.campbell@citrix.com> wrote:
> I don't know much about the hardware in the pool other than what can be
> gathered from the serial and dmesg logs.

Right - this is useful for learning details of an individual system, but
isn't really helpful when wanting to compare all system kinds that are
in the pool.

> From the serial log and this:
> 
> Jun  6 12:09:27.089020 (XEN) VMX: Supported advanced features:
> Jun  6 12:09:27.089052 (XEN)  - APIC MMIO access virtualisation
> Jun  6 12:09:27.097051 (XEN)  - APIC TPR shadow
> Jun  6 12:09:27.097088 (XEN)  - Extended Page Tables (EPT)
> Jun  6 12:09:27.097118 (XEN)  - Virtual-Processor Identifiers (VPID)
> Jun  6 12:09:27.105066 (XEN)  - Virtual NMI
> Jun  6 12:09:27.105100 (XEN)  - MSR direct-access bitmap
> Jun  6 12:09:27.105130 (XEN)  - Unrestricted Guest
> Jun  6 12:09:27.113269 (XEN)  - APIC Register Virtualization
> Jun  6 12:09:27.113290 (XEN)  - Virtual Interrupt Delivery
> Jun  6 12:09:27.113328 (XEN)  - Posted Interrupt Processing
> Jun  6 12:09:27.121180 (XEN) HVM: ASIDs enabled.
> Jun  6 12:09:27.121235 (XEN) HVM: VMX enabled
> Jun  6 12:09:27.121267 (XEN) HVM: Hardware Assisted Paging (HAP) detected
> Jun  6 12:09:27.129069 (XEN) HVM: HAP page sizes: 4kB, 2MB, 1GB
> 
> I guess they are pretty new?

Looks like so, yes.

Jan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [xen-unstable test] 57852: regressions - FAIL
  2015-06-08  9:27                 ` Ian Campbell
  2015-06-08 10:17                   ` Jan Beulich
@ 2015-06-08 12:16                   ` Ian Campbell
  2015-06-08 12:19                     ` Andrew Cooper
                                       ` (2 more replies)
  2015-06-08 13:50                   ` Konrad Rzeszutek Wilk
  2 siblings, 3 replies; 40+ messages in thread
From: Ian Campbell @ 2015-06-08 12:16 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, ian.jackson, xen-devel

On Mon, 2015-06-08 at 10:27 +0100, Ian Campbell wrote:
> On Mon, 2015-06-08 at 10:15 +0100, Jan Beulich wrote:
> > >>> On 08.06.15 at 10:53, <ian.campbell@citrix.com> wrote:
> > > That's 6/14 (43%) failure rate on fiano0 and 2/10 (20%) on fiano1. Which
> > > differs form the apparent xen-unstable failure rate. But I wouldn't take
> > > this as evidence that the two systems differ significantly, despite how
> > > the unstable results looked at first glance.
> > 
> > So we can basically rule out just one of the hosts being the culprit;
> > it's either both or our software. Considering that (again at the
> > example of the recent 4.2 flight) the guest is apparently waiting for
> > a timer (or other) interrupt (on a HLT instruction), this is very likely
> > interrupt delivery related, yet (as said before, albeit wrongly for
> > 4.3) 4.2 doesn't have APICV support yet (4.3 only lack the option
> > to disable it), so it can't be that (alone).
> > 
> > Looking at the hardware - are fiano[01], in terms of CPU and
> > chipset, perhaps the newest or oldest in the pool? (I'm trying to
> > make myself a picture of what debugging options we have.)
> 
> I don't know much about the hardware in the pool other than what can be
> gathered from the serial and dmesg logs.
> 
> http://logs.test-lab.xenproject.org/osstest/logs/58028/test-amd64-amd64-xl-qemuu-win7-amd64/info.html
> 
> From the serial log and this:
> 
> Jun  6 12:09:27.089020 (XEN) VMX: Supported advanced features:
> Jun  6 12:09:27.089052 (XEN)  - APIC MMIO access virtualisation
> Jun  6 12:09:27.097051 (XEN)  - APIC TPR shadow
> Jun  6 12:09:27.097088 (XEN)  - Extended Page Tables (EPT)
> Jun  6 12:09:27.097118 (XEN)  - Virtual-Processor Identifiers (VPID)
> Jun  6 12:09:27.105066 (XEN)  - Virtual NMI
> Jun  6 12:09:27.105100 (XEN)  - MSR direct-access bitmap
> Jun  6 12:09:27.105130 (XEN)  - Unrestricted Guest

Running with no-apicv seems to have disabled these three:

> Jun  6 12:09:27.113269 (XEN)  - APIC Register Virtualization
> Jun  6 12:09:27.113290 (XEN)  - Virtual Interrupt Delivery
> Jun  6 12:09:27.113328 (XEN)  - Posted Interrupt Processing

Is that expected?

The adhoc run passed, but that's not statistically significant.

Ian.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [xen-unstable test] 57852: regressions - FAIL
  2015-06-08 12:16                   ` Ian Campbell
@ 2015-06-08 12:19                     ` Andrew Cooper
  2015-06-08 12:24                     ` Jan Beulich
  2015-06-09  8:26                     ` Ian Campbell
  2 siblings, 0 replies; 40+ messages in thread
From: Andrew Cooper @ 2015-06-08 12:19 UTC (permalink / raw)
  To: Ian Campbell, Jan Beulich; +Cc: xen-devel, ian.jackson

On 08/06/15 13:16, Ian Campbell wrote:
> On Mon, 2015-06-08 at 10:27 +0100, Ian Campbell wrote:
>> On Mon, 2015-06-08 at 10:15 +0100, Jan Beulich wrote:
>>>>>> On 08.06.15 at 10:53, <ian.campbell@citrix.com> wrote:
>>>> That's 6/14 (43%) failure rate on fiano0 and 2/10 (20%) on fiano1. Which
>>>> differs form the apparent xen-unstable failure rate. But I wouldn't take
>>>> this as evidence that the two systems differ significantly, despite how
>>>> the unstable results looked at first glance.
>>> So we can basically rule out just one of the hosts being the culprit;
>>> it's either both or our software. Considering that (again at the
>>> example of the recent 4.2 flight) the guest is apparently waiting for
>>> a timer (or other) interrupt (on a HLT instruction), this is very likely
>>> interrupt delivery related, yet (as said before, albeit wrongly for
>>> 4.3) 4.2 doesn't have APICV support yet (4.3 only lack the option
>>> to disable it), so it can't be that (alone).
>>>
>>> Looking at the hardware - are fiano[01], in terms of CPU and
>>> chipset, perhaps the newest or oldest in the pool? (I'm trying to
>>> make myself a picture of what debugging options we have.)
>> I don't know much about the hardware in the pool other than what can be
>> gathered from the serial and dmesg logs.
>>
>> http://logs.test-lab.xenproject.org/osstest/logs/58028/test-amd64-amd64-xl-qemuu-win7-amd64/info.html
>>
>> From the serial log and this:
>>
>> Jun  6 12:09:27.089020 (XEN) VMX: Supported advanced features:
>> Jun  6 12:09:27.089052 (XEN)  - APIC MMIO access virtualisation
>> Jun  6 12:09:27.097051 (XEN)  - APIC TPR shadow
>> Jun  6 12:09:27.097088 (XEN)  - Extended Page Tables (EPT)
>> Jun  6 12:09:27.097118 (XEN)  - Virtual-Processor Identifiers (VPID)
>> Jun  6 12:09:27.105066 (XEN)  - Virtual NMI
>> Jun  6 12:09:27.105100 (XEN)  - MSR direct-access bitmap
>> Jun  6 12:09:27.105130 (XEN)  - Unrestricted Guest
> Running with no-apicv seems to have disabled these three:
>
>> Jun  6 12:09:27.113269 (XEN)  - APIC Register Virtualization
>> Jun  6 12:09:27.113290 (XEN)  - Virtual Interrupt Delivery
>> Jun  6 12:09:27.113328 (XEN)  - Posted Interrupt Processing
> Is that expected?

Yes - The first is APICV itself, and the further two are dependent features.

~Andrew

>
> The adhoc run passed, but that's not statistically significant.
>
> Ian.
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [xen-unstable test] 57852: regressions - FAIL
  2015-06-08 12:16                   ` Ian Campbell
  2015-06-08 12:19                     ` Andrew Cooper
@ 2015-06-08 12:24                     ` Jan Beulich
  2015-06-09  8:26                     ` Ian Campbell
  2 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2015-06-08 12:24 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Andrew Cooper, ian.jackson, xen-devel

>>> On 08.06.15 at 14:16, <ian.campbell@citrix.com> wrote:
> On Mon, 2015-06-08 at 10:27 +0100, Ian Campbell wrote:
>> On Mon, 2015-06-08 at 10:15 +0100, Jan Beulich wrote:
>> > >>> On 08.06.15 at 10:53, <ian.campbell@citrix.com> wrote:
>> > > That's 6/14 (43%) failure rate on fiano0 and 2/10 (20%) on fiano1. Which
>> > > differs form the apparent xen-unstable failure rate. But I wouldn't take
>> > > this as evidence that the two systems differ significantly, despite how
>> > > the unstable results looked at first glance.
>> > 
>> > So we can basically rule out just one of the hosts being the culprit;
>> > it's either both or our software. Considering that (again at the
>> > example of the recent 4.2 flight) the guest is apparently waiting for
>> > a timer (or other) interrupt (on a HLT instruction), this is very likely
>> > interrupt delivery related, yet (as said before, albeit wrongly for
>> > 4.3) 4.2 doesn't have APICV support yet (4.3 only lack the option
>> > to disable it), so it can't be that (alone).
>> > 
>> > Looking at the hardware - are fiano[01], in terms of CPU and
>> > chipset, perhaps the newest or oldest in the pool? (I'm trying to
>> > make myself a picture of what debugging options we have.)
>> 
>> I don't know much about the hardware in the pool other than what can be
>> gathered from the serial and dmesg logs.
>> 
>> 
> http://logs.test-lab.xenproject.org/osstest/logs/58028/test-amd64-amd64-xl-qe 
> muu-win7-amd64/info.html
>> 
>> From the serial log and this:
>> 
>> Jun  6 12:09:27.089020 (XEN) VMX: Supported advanced features:
>> Jun  6 12:09:27.089052 (XEN)  - APIC MMIO access virtualisation
>> Jun  6 12:09:27.097051 (XEN)  - APIC TPR shadow
>> Jun  6 12:09:27.097088 (XEN)  - Extended Page Tables (EPT)
>> Jun  6 12:09:27.097118 (XEN)  - Virtual-Processor Identifiers (VPID)
>> Jun  6 12:09:27.105066 (XEN)  - Virtual NMI
>> Jun  6 12:09:27.105100 (XEN)  - MSR direct-access bitmap
>> Jun  6 12:09:27.105130 (XEN)  - Unrestricted Guest
> 
> Running with no-apicv seems to have disabled these three:
> 
>> Jun  6 12:09:27.113269 (XEN)  - APIC Register Virtualization
>> Jun  6 12:09:27.113290 (XEN)  - Virtual Interrupt Delivery
>> Jun  6 12:09:27.113328 (XEN)  - Posted Interrupt Processing
> 
> Is that expected?

I think so, based on

        if ( (_vmx_cpu_based_exec_control & CPU_BASED_TPR_SHADOW) &&
             opt_apicv_enabled )
            opt |= SECONDARY_EXEC_APIC_REGISTER_VIRT |
                   SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
                   SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE;

and

    if ( !(_vmx_secondary_exec_control & SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY)
          || !(_vmx_vmexit_control & VM_EXIT_ACK_INTR_ON_EXIT) )
        _vmx_pin_based_exec_control  &= ~ PIN_BASED_POSTED_INTERRUPT;

Jan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [xen-unstable test] 57852: regressions - FAIL
  2015-06-08  9:27                 ` Ian Campbell
  2015-06-08 10:17                   ` Jan Beulich
  2015-06-08 12:16                   ` Ian Campbell
@ 2015-06-08 13:50                   ` Konrad Rzeszutek Wilk
  2015-06-08 14:02                     ` Ian Campbell
  2015-06-08 14:47                     ` Ian Jackson
  2 siblings, 2 replies; 40+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-06-08 13:50 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Andrew Cooper, ian.jackson, Jan Beulich, xen-devel

On Mon, Jun 08, 2015 at 10:27:32AM +0100, Ian Campbell wrote:
> On Mon, 2015-06-08 at 10:15 +0100, Jan Beulich wrote:
> > >>> On 08.06.15 at 10:53, <ian.campbell@citrix.com> wrote:
> > > That's 6/14 (43%) failure rate on fiano0 and 2/10 (20%) on fiano1. Which
> > > differs form the apparent xen-unstable failure rate. But I wouldn't take
> > > this as evidence that the two systems differ significantly, despite how
> > > the unstable results looked at first glance.
> > 
> > So we can basically rule out just one of the hosts being the culprit;
> > it's either both or our software. Considering that (again at the
> > example of the recent 4.2 flight) the guest is apparently waiting for
> > a timer (or other) interrupt (on a HLT instruction), this is very likely
> > interrupt delivery related, yet (as said before, albeit wrongly for
> > 4.3) 4.2 doesn't have APICV support yet (4.3 only lack the option
> > to disable it), so it can't be that (alone).
> > 
> > Looking at the hardware - are fiano[01], in terms of CPU and
> > chipset, perhaps the newest or oldest in the pool? (I'm trying to
> > make myself a picture of what debugging options we have.)
> 
> I don't know much about the hardware in the pool other than what can be
> gathered from the serial and dmesg logs.
> 
> http://logs.test-lab.xenproject.org/osstest/logs/58028/test-amd64-amd64-xl-qemuu-win7-amd64/info.html
> 
> >From the serial log and this:
> 
> Jun  6 12:09:27.089020 (XEN) VMX: Supported advanced features:
> Jun  6 12:09:27.089052 (XEN)  - APIC MMIO access virtualisation
> Jun  6 12:09:27.097051 (XEN)  - APIC TPR shadow
> Jun  6 12:09:27.097088 (XEN)  - Extended Page Tables (EPT)
> Jun  6 12:09:27.097118 (XEN)  - Virtual-Processor Identifiers (VPID)
> Jun  6 12:09:27.105066 (XEN)  - Virtual NMI
> Jun  6 12:09:27.105100 (XEN)  - MSR direct-access bitmap
> Jun  6 12:09:27.105130 (XEN)  - Unrestricted Guest
> Jun  6 12:09:27.113269 (XEN)  - APIC Register Virtualization
> Jun  6 12:09:27.113290 (XEN)  - Virtual Interrupt Delivery
> Jun  6 12:09:27.113328 (XEN)  - Posted Interrupt Processing
> Jun  6 12:09:27.121180 (XEN) HVM: ASIDs enabled.
> Jun  6 12:09:27.121235 (XEN) HVM: VMX enabled
> Jun  6 12:09:27.121267 (XEN) HVM: Hardware Assisted Paging (HAP) detected
> Jun  6 12:09:27.129069 (XEN) HVM: HAP page sizes: 4kB, 2MB, 1GB
> 
> I guess they are pretty new?

Could it be an missing microcode update? I don't know if the OSSTest does
the ucode=scan or updates the microcode later?

> 
> Ian.
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [xen-unstable test] 57852: regressions - FAIL
  2015-06-08 13:50                   ` Konrad Rzeszutek Wilk
@ 2015-06-08 14:02                     ` Ian Campbell
  2015-06-08 14:47                     ` Ian Jackson
  1 sibling, 0 replies; 40+ messages in thread
From: Ian Campbell @ 2015-06-08 14:02 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Andrew Cooper, ian.jackson, Jan Beulich, xen-devel

On Mon, 2015-06-08 at 09:50 -0400, Konrad Rzeszutek Wilk wrote:
> On Mon, Jun 08, 2015 at 10:27:32AM +0100, Ian Campbell wrote:
> > On Mon, 2015-06-08 at 10:15 +0100, Jan Beulich wrote:
> > > >>> On 08.06.15 at 10:53, <ian.campbell@citrix.com> wrote:
> > > > That's 6/14 (43%) failure rate on fiano0 and 2/10 (20%) on fiano1. Which
> > > > differs form the apparent xen-unstable failure rate. But I wouldn't take
> > > > this as evidence that the two systems differ significantly, despite how
> > > > the unstable results looked at first glance.
> > > 
> > > So we can basically rule out just one of the hosts being the culprit;
> > > it's either both or our software. Considering that (again at the
> > > example of the recent 4.2 flight) the guest is apparently waiting for
> > > a timer (or other) interrupt (on a HLT instruction), this is very likely
> > > interrupt delivery related, yet (as said before, albeit wrongly for
> > > 4.3) 4.2 doesn't have APICV support yet (4.3 only lack the option
> > > to disable it), so it can't be that (alone).
> > > 
> > > Looking at the hardware - are fiano[01], in terms of CPU and
> > > chipset, perhaps the newest or oldest in the pool? (I'm trying to
> > > make myself a picture of what debugging options we have.)
> > 
> > I don't know much about the hardware in the pool other than what can be
> > gathered from the serial and dmesg logs.
> > 
> > http://logs.test-lab.xenproject.org/osstest/logs/58028/test-amd64-amd64-xl-qemuu-win7-amd64/info.html
> > 
> > >From the serial log and this:
> > 
> > Jun  6 12:09:27.089020 (XEN) VMX: Supported advanced features:
> > Jun  6 12:09:27.089052 (XEN)  - APIC MMIO access virtualisation
> > Jun  6 12:09:27.097051 (XEN)  - APIC TPR shadow
> > Jun  6 12:09:27.097088 (XEN)  - Extended Page Tables (EPT)
> > Jun  6 12:09:27.097118 (XEN)  - Virtual-Processor Identifiers (VPID)
> > Jun  6 12:09:27.105066 (XEN)  - Virtual NMI
> > Jun  6 12:09:27.105100 (XEN)  - MSR direct-access bitmap
> > Jun  6 12:09:27.105130 (XEN)  - Unrestricted Guest
> > Jun  6 12:09:27.113269 (XEN)  - APIC Register Virtualization
> > Jun  6 12:09:27.113290 (XEN)  - Virtual Interrupt Delivery
> > Jun  6 12:09:27.113328 (XEN)  - Posted Interrupt Processing
> > Jun  6 12:09:27.121180 (XEN) HVM: ASIDs enabled.
> > Jun  6 12:09:27.121235 (XEN) HVM: VMX enabled
> > Jun  6 12:09:27.121267 (XEN) HVM: Hardware Assisted Paging (HAP) detected
> > Jun  6 12:09:27.129069 (XEN) HVM: HAP page sizes: 4kB, 2MB, 1GB
> > 
> > I guess they are pretty new?
> 
> Could it be an missing microcode update? I don't know if the OSSTest does
> the ucode=scan or updates the microcode later?

I rather suspect it doesn't do microcode updates at all. (It probably
should)

Is there some reason to expect APICV (or something else) would cause
these failures if microcode wasn't up to date?

Ian.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [xen-unstable test] 57852: regressions - FAIL
  2015-06-08 10:17                   ` Jan Beulich
@ 2015-06-08 14:43                     ` Ian Jackson
  0 siblings, 0 replies; 40+ messages in thread
From: Ian Jackson @ 2015-06-08 14:43 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Ian Campbell, xen-devel

Jan Beulich writes ("Re: [Xen-devel] [xen-unstable test] 57852: regressions - FAIL"):
> On 08.06.15 at 11:27, <ian.campbell@citrix.com> wrote:
> > I don't know much about the hardware in the pool other than what can be
> > gathered from the serial and dmesg logs.
> 
> Right - this is useful for learning details of an individual system, but
> isn't really helpful when wanting to compare all system kinds that are
> in the pool.

The other information we have is from the procurement exercise.

Summary spreadsheet:

http://xenbits.xen.org/gitweb/?p=people/iwj/colo-for-testing.git;a=blob;f=selections.ods;h=82134d5bc2c441a0b23006edc33a8ad80aae71e3;hb=master

Contract:

http://xenbits.xen.org/gitweb/?p=people/iwj/colo-for-testing.git;a=blob;f=PURCHASE+AND+SALE+AGREEMENT.doc;h=a87814184c1a8fe45ec1992548cfac088113f3d7;hb=master

Ian.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [xen-unstable test] 57852: regressions - FAIL
  2015-06-08 13:50                   ` Konrad Rzeszutek Wilk
  2015-06-08 14:02                     ` Ian Campbell
@ 2015-06-08 14:47                     ` Ian Jackson
  2015-06-08 15:21                       ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 40+ messages in thread
From: Ian Jackson @ 2015-06-08 14:47 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Andrew Cooper, Ian Campbell, Jan Beulich, xen-devel

Konrad Rzeszutek Wilk writes ("Re: [Xen-devel] [xen-unstable test] 57852: regressions - FAIL"):
> Could it be an missing microcode update? I don't know if the OSSTest does
> the ucode=scan or updates the microcode later?

I think osstest's machines don't get microcode updates.  I'm no expert
on x86 microcode, but my understanding is:

Microcode updates (i) have to be loaded dynamically at boot time and
(ii) are regarded as a non-free package by Debian.

We could arrange to install the non-free microcode package.  I haven't
looked into it but I would expect that to automatically arrange to
load the microcode for native boots.  It IMO ought to do the same for
non-native boots but I wouldn't rely on that being the case.

Ian.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [xen-unstable test] 57852: regressions - FAIL
  2015-06-08 14:47                     ` Ian Jackson
@ 2015-06-08 15:21                       ` Konrad Rzeszutek Wilk
  2015-06-08 15:29                         ` Ian Campbell
  0 siblings, 1 reply; 40+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-06-08 15:21 UTC (permalink / raw)
  To: Ian Jackson; +Cc: Andrew Cooper, Ian Campbell, Jan Beulich, xen-devel

On Mon, Jun 08, 2015 at 03:47:22PM +0100, Ian Jackson wrote:
> Konrad Rzeszutek Wilk writes ("Re: [Xen-devel] [xen-unstable test] 57852: regressions - FAIL"):
> > Could it be an missing microcode update? I don't know if the OSSTest does
> > the ucode=scan or updates the microcode later?
> 
> I think osstest's machines don't get microcode updates.  I'm no expert
> on x86 microcode, but my understanding is:
> 
> Microcode updates (i) have to be loaded dynamically at boot time and
> (ii) are regarded as a non-free package by Debian.
> 
> We could arrange to install the non-free microcode package.  I haven't
> looked into it but I would expect that to automatically arrange to
> load the microcode for native boots.  It IMO ought to do the same for
> non-native boots but I wouldn't rely on that being the case.

If Debian is using dracut it just requires adding in /etc/dracut.conf
early_microcode=yes

and from there on any regenerated initramfs will have the required
microcode. Then 'ucode=scan' needs to be added on the Xen command line.

But this is a shoot in the dark - this microcode update might have nothing
to do with these failures.
> 
> Ian.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [xen-unstable test] 57852: regressions - FAIL
  2015-06-08 15:21                       ` Konrad Rzeszutek Wilk
@ 2015-06-08 15:29                         ` Ian Campbell
  0 siblings, 0 replies; 40+ messages in thread
From: Ian Campbell @ 2015-06-08 15:29 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Andrew Cooper, Ian Jackson, Jan Beulich, xen-devel

On Mon, 2015-06-08 at 11:21 -0400, Konrad Rzeszutek Wilk wrote:
> On Mon, Jun 08, 2015 at 03:47:22PM +0100, Ian Jackson wrote:
> > Konrad Rzeszutek Wilk writes ("Re: [Xen-devel] [xen-unstable test] 57852: regressions - FAIL"):
> > > Could it be an missing microcode update? I don't know if the OSSTest does
> > > the ucode=scan or updates the microcode later?
> > 
> > I think osstest's machines don't get microcode updates.  I'm no expert
> > on x86 microcode, but my understanding is:
> > 
> > Microcode updates (i) have to be loaded dynamically at boot time and
> > (ii) are regarded as a non-free package by Debian.
> > 
> > We could arrange to install the non-free microcode package.  I haven't
> > looked into it but I would expect that to automatically arrange to
> > load the microcode for native boots.  It IMO ought to do the same for
> > non-native boots but I wouldn't rely on that being the case.
> 
> If Debian is using dracut

It doesn't, it uses initramfs-tools.

I'm not sure about Wheezy but from Jessie onwards installing the
microcode packages adds hooks which makes initramfs-tools do the right
thing.

>  it just requires adding in /etc/dracut.conf
> early_microcode=yes
> 
> and from there on any regenerated initramfs will have the required
> microcode. Then 'ucode=scan' needs to be added on the Xen command line.

This is still needed with initramfs-tools.

There's also Debian bug #785187 (discussed on xen-devel) which stops
ucode=scan from working, for a reason I've not had a changce to dig into
yet.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [xen-unstable test] 57852: regressions - FAIL
  2015-06-08 12:16                   ` Ian Campbell
  2015-06-08 12:19                     ` Andrew Cooper
  2015-06-08 12:24                     ` Jan Beulich
@ 2015-06-09  8:26                     ` Ian Campbell
  2015-06-09  9:29                       ` Jan Beulich
  2 siblings, 1 reply; 40+ messages in thread
From: Ian Campbell @ 2015-06-09  8:26 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, ian.jackson, xen-devel

On Mon, 2015-06-08 at 13:16 +0100, Ian Campbell wrote:

> The adhoc run passed, but that's not statistically significant.

I ran a bunch more in this no-apicv configuration, the logs are at
http://logs.test-lab.xenproject.org/osstest/logs/<NNNN>:

Flight	Host	Failed at
58190	fiano0	ts-guest-stop
58198	fiano0	ts-guest-stop
58203	fiano0	ts-windows-install
58208	fiano0	ts-guest-stop
58210	fiano1	ts-guest-stop
58214	fiano1	ts-guest-stop
58217	fiano1	ts-guest-stop

I think that's not sufficient data to draw a conclusion, since there was
always a small background failure rate. I'm going to run another half
dozen (3 on each).

NB the build artefacts from 57852 got garbage collected, so 58190
rebuilt them all (from the same versions) and those were used for the
other flights. I think this is unlikely to have made any difference.

I'm also going to run some on another pair of hosts without no-apicv. I
chose elbling[01] since according to
http://logs.test-lab.xenproject.org/osstest/results/history.test-amd64-amd64-xl-qemuu-win7-amd64.html it has been used a few times on various branches and hasn't so far failed a ts-windows-install. It looks to have the same set of advanced features as fiano*:

Jun  8 05:24:51.033042 (XEN) VMX: Supported advanced features:
Jun  8 05:24:51.041023 (XEN)  - APIC MMIO access virtualisation
Jun  8 05:24:51.049018 (XEN)  - APIC TPR shadow
Jun  8 05:24:51.049050 (XEN)  - Extended Page Tables (EPT)
Jun  8 05:24:51.049079 (XEN)  - Virtual-Processor Identifiers (VPID)
Jun  8 05:24:51.057032 (XEN)  - Virtual NMI
Jun  8 05:24:51.057063 (XEN)  - MSR direct-access bitmap
Jun  8 05:24:51.065034 (XEN)  - Unrestricted Guest
Jun  8 05:24:51.065067 (XEN)  - APIC Register Virtualization
Jun  8 05:24:51.073034 (XEN)  - Virtual Interrupt Delivery
Jun  8 05:24:51.073073 (XEN)  - Posted Interrupt Processing
Jun  8 05:24:51.073101 (XEN) HVM: ASIDs enabled.

Ian.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [xen-unstable test] 57852: regressions - FAIL
  2015-06-09  8:26                     ` Ian Campbell
@ 2015-06-09  9:29                       ` Jan Beulich
  2015-06-10  8:50                         ` Ian Campbell
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2015-06-09  9:29 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Andrew Cooper, ian.jackson, xen-devel

>>> On 09.06.15 at 10:26, <ian.campbell@citrix.com> wrote:
> On Mon, 2015-06-08 at 13:16 +0100, Ian Campbell wrote:
> 
>> The adhoc run passed, but that's not statistically significant.
> 
> I ran a bunch more in this no-apicv configuration, the logs are at
> http://logs.test-lab.xenproject.org/osstest/logs/<NNNN>:
> 
> Flight	Host	Failed at
> 58190	fiano0	ts-guest-stop
> 58198	fiano0	ts-guest-stop
> 58203	fiano0	ts-windows-install
> 58208	fiano0	ts-guest-stop
> 58210	fiano1	ts-guest-stop
> 58214	fiano1	ts-guest-stop
> 58217	fiano1	ts-guest-stop
> 
> I think that's not sufficient data to draw a conclusion, since there was
> always a small background failure rate. I'm going to run another half
> dozen (3 on each).

At least the one failure is following the patterns of previous ones
(ping timing out and guest sitting with both of its vCPU-s on HLT,
and the last VM entry having delivered a timer interrupt). Without
knowing _when_ that last timer interrupt got injected and whether
other interrupts are occurring for the guest as necessary, that
again doesn't mean much.

> I'm also going to run some on another pair of hosts without no-apicv. I
> chose elbling[01] since according to
> http://logs.test-lab.xenproject.org/osstest/results/history.test-amd64-amd64 
> -xl-qemuu-win7-amd64.html it has been used a few times on various branches 
> and hasn't so far failed a ts-windows-install. It looks to have the same set 
> of advanced features as fiano*:

That's a good idea, thanks.

Jan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [xen-unstable test] 57852: regressions - FAIL
  2015-06-09  9:29                       ` Jan Beulich
@ 2015-06-10  8:50                         ` Ian Campbell
  2015-06-10  9:36                           ` Jan Beulich
  0 siblings, 1 reply; 40+ messages in thread
From: Ian Campbell @ 2015-06-10  8:50 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, ian.jackson, xen-devel

On Tue, 2015-06-09 at 10:29 +0100, Jan Beulich wrote:
> >>> On 09.06.15 at 10:26, <ian.campbell@citrix.com> wrote:
> > On Mon, 2015-06-08 at 13:16 +0100, Ian Campbell wrote:
> > 
> >> The adhoc run passed, but that's not statistically significant.
> > 
> > I ran a bunch more in this no-apicv configuration, the logs are at
> > http://logs.test-lab.xenproject.org/osstest/logs/<NNNN>:
> > 
> > Flight	Host	Failed at
> > 58190	fiano0	ts-guest-stop
> > 58198	fiano0	ts-guest-stop
> > 58203	fiano0	ts-windows-install
> > 58208	fiano0	ts-guest-stop
> > 58210	fiano1	ts-guest-stop
> > 58214	fiano1	ts-guest-stop
> > 58217	fiano1	ts-guest-stop
> > 
> > I think that's not sufficient data to draw a conclusion, since there was
> > always a small background failure rate. I'm going to run another half
> > dozen (3 on each).
> 
> At least the one failure is following the patterns of previous ones
> (ping timing out and guest sitting with both of its vCPU-s on HLT,
> and the last VM entry having delivered a timer interrupt). Without
> knowing _when_ that last timer interrupt got injected and whether
> other interrupts are occurring for the guest as necessary, that
> again doesn't mean much.

58243	fiano0	ts-guest-stop
58251	fiano0	ts-guest-stop
58258	fiano0	ts-guest-stop
58266	fiano1	ts-guest-stop
58279	fiano1	ts-guest-stop
58282	fiano1	ts-guest-stop

> > I'm also going to run some on another pair of hosts without no-apicv. I
> > chose elbling[01] since according to
> > http://logs.test-lab.xenproject.org/osstest/results/history.test-amd64-amd64 
> > -xl-qemuu-win7-amd64.html it has been used a few times on various branches 
> > and hasn't so far failed a ts-windows-install. It looks to have the same set 
> > of advanced features as fiano*:
> 
> That's a good idea, thanks.

58244	elbling0	ts-guest-stop
58250	elbling0	ts-guest-stop
58256	elbling0	ts-guest-stop
58261	elbling1	ts-guest-stop
58269	elbling1	ts-guest-stop
58274	elbling1	ts-guest-stop

So it is looking awfully like a host specific issue with apicv on
fiano*.

Ian.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [xen-unstable test] 57852: regressions - FAIL
  2015-06-10  8:50                         ` Ian Campbell
@ 2015-06-10  9:36                           ` Jan Beulich
  2015-06-10 11:01                             ` Ian Campbell
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2015-06-10  9:36 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Andrew Cooper, ian.jackson, xen-devel

>>> On 10.06.15 at 10:50, <ian.campbell@citrix.com> wrote:
> On Tue, 2015-06-09 at 10:29 +0100, Jan Beulich wrote:
>> >>> On 09.06.15 at 10:26, <ian.campbell@citrix.com> wrote:
>> > On Mon, 2015-06-08 at 13:16 +0100, Ian Campbell wrote:
>> > 
>> >> The adhoc run passed, but that's not statistically significant.
>> > 
>> > I ran a bunch more in this no-apicv configuration, the logs are at
>> > http://logs.test-lab.xenproject.org/osstest/logs/<NNNN>:
>> > 
>> > Flight	Host	Failed at
>> > 58190	fiano0	ts-guest-stop
>> > 58198	fiano0	ts-guest-stop
>> > 58203	fiano0	ts-windows-install
>> > 58208	fiano0	ts-guest-stop
>> > 58210	fiano1	ts-guest-stop
>> > 58214	fiano1	ts-guest-stop
>> > 58217	fiano1	ts-guest-stop
>> > 
>> > I think that's not sufficient data to draw a conclusion, since there was
>> > always a small background failure rate. I'm going to run another half
>> > dozen (3 on each).
>> 
>> At least the one failure is following the patterns of previous ones
>> (ping timing out and guest sitting with both of its vCPU-s on HLT,
>> and the last VM entry having delivered a timer interrupt). Without
>> knowing _when_ that last timer interrupt got injected and whether
>> other interrupts are occurring for the guest as necessary, that
>> again doesn't mean much.
> 
> 58243	fiano0	ts-guest-stop
> 58251	fiano0	ts-guest-stop
> 58258	fiano0	ts-guest-stop
> 58266	fiano1	ts-guest-stop
> 58279	fiano1	ts-guest-stop
> 58282	fiano1	ts-guest-stop
> 
>> > I'm also going to run some on another pair of hosts without no-apicv. I
>> > chose elbling[01] since according to
>> > 
> http://logs.test-lab.xenproject.org/osstest/results/history.test-amd64-amd64 
>> > -xl-qemuu-win7-amd64.html it has been used a few times on various branches 
>> > and hasn't so far failed a ts-windows-install. It looks to have the same 
> set 
>> > of advanced features as fiano*:
>> 
>> That's a good idea, thanks.
> 
> 58244	elbling0	ts-guest-stop
> 58250	elbling0	ts-guest-stop
> 58256	elbling0	ts-guest-stop
> 58261	elbling1	ts-guest-stop
> 58269	elbling1	ts-guest-stop
> 58274	elbling1	ts-guest-stop
> 
> So it is looking awfully like a host specific issue with apicv on
> fiano*.

Indeed. Leaving us with the slight hope that there is a microcode
update available that's newer than what the BIOS of those boxes
loads. Could we perhaps afford un-blessing the two systems for
the time being? And maybe get Intel involved if there's no ucode
update available that helps?

Jan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [xen-unstable test] 57852: regressions - FAIL
  2015-06-10  9:36                           ` Jan Beulich
@ 2015-06-10 11:01                             ` Ian Campbell
  2015-06-10 11:48                               ` Jan Beulich
  0 siblings, 1 reply; 40+ messages in thread
From: Ian Campbell @ 2015-06-10 11:01 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, ian.jackson, xen-devel

On Wed, 2015-06-10 at 10:36 +0100, Jan Beulich wrote:
> Indeed. Leaving us with the slight hope that there is a microcode
> update available that's newer than what the BIOS of those boxes
> loads. Could we perhaps afford un-blessing the two systems for
> the time being? And maybe get Intel involved if there's no ucode
> update available that helps?

Arranging to do microcode updates looks like it is going to be a bit
non-trivial from the osstest side. Is there any reason to think it would
help other than just hoping it will?

Can't we get Intel involved right away?

Ian.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [xen-unstable test] 57852: regressions - FAIL
  2015-06-10 11:01                             ` Ian Campbell
@ 2015-06-10 11:48                               ` Jan Beulich
  2015-06-10 12:56                                 ` Ian Campbell
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2015-06-10 11:48 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Andrew Cooper, ian.jackson, xen-devel

>>> On 10.06.15 at 13:01, <ian.campbell@citrix.com> wrote:
> On Wed, 2015-06-10 at 10:36 +0100, Jan Beulich wrote:
>> Indeed. Leaving us with the slight hope that there is a microcode
>> update available that's newer than what the BIOS of those boxes
>> loads. Could we perhaps afford un-blessing the two systems for
>> the time being? And maybe get Intel involved if there's no ucode
>> update available that helps?
> 
> Arranging to do microcode updates looks like it is going to be a bit
> non-trivial from the osstest side. Is there any reason to think it would
> help other than just hoping it will?

It's really hope, not much more. But I guess you could at least check
what microcode the box has in use - if there's nothing newer available,
then trying to get the microcode updating working isn't of immediate
importance anymore (but of course it would still be nice to have in
place).

> Can't we get Intel involved right away?

Sure we can; I just generally prefer not to bother people with
problems they already solved, but maybe that's the wrong approach
a case like this.

Jan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [xen-unstable test] 57852: regressions - FAIL
  2015-06-10 11:48                               ` Jan Beulich
@ 2015-06-10 12:56                                 ` Ian Campbell
  2015-06-10 13:23                                   ` Jan Beulich
                                                     ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: Ian Campbell @ 2015-06-10 12:56 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, ian.jackson, xen-devel

On Wed, 2015-06-10 at 12:48 +0100, Jan Beulich wrote:
> >>> On 10.06.15 at 13:01, <ian.campbell@citrix.com> wrote:
> > On Wed, 2015-06-10 at 10:36 +0100, Jan Beulich wrote:
> >> Indeed. Leaving us with the slight hope that there is a microcode
> >> update available that's newer than what the BIOS of those boxes
> >> loads. Could we perhaps afford un-blessing the two systems for
> >> the time being? And maybe get Intel involved if there's no ucode
> >> update available that helps?
> > 
> > Arranging to do microcode updates looks like it is going to be a bit
> > non-trivial from the osstest side. Is there any reason to think it would
> > help other than just hoping it will?
> 
> It's really hope, not much more.

OK. I think this is something which is worth doing but I'm going to
treat it more like a feature request than a bug fix in terms of
prioritising it.

>  But I guess you could at least check
> what microcode the box has in use - if there's nothing newer available,
> then trying to get the microcode updating working isn't of immediate
> importance anymore (but of course it would still be nice to have in
> place).

I logged into fiano1 while it was running under Xen:

cpuinfo contains (just the first processor for brevity):

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Intel(R) Xeon(R) CPU E5-2403 v2 @ 1.80GHz
stepping	: 4
microcode	: 0x416
cpu MHz		: 1800.041
cache size	: 10240 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fdiv_bug	: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht nx constant_tsc nonstop_tsc eagerfpu pni pclmulqdq monitor est ssse3 sse4_1 sse4_2 popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor arat epb xsaveopt pln pts dtherm fsgsbase erms
bogomips	: 3600.08
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

I'll hold onto the machine for another hour (until 1500 BST) if you want
to know anything else (otherwise I'll have to relock it which will imply
waiting for a test to finish)

> > Can't we get Intel involved right away?
> 
> Sure we can; I just generally prefer not to bother people with
> problems they already solved, but maybe that's the wrong approach
> a case like this.

Is the list of errata fixed by a given ucode update public? If not then
I think we've done sufficient due diligence that we should feel ok to
ask, even if the answer turns out to be fixed in microcode.

Ian.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [xen-unstable test] 57852: regressions - FAIL
  2015-06-10 12:56                                 ` Ian Campbell
@ 2015-06-10 13:23                                   ` Jan Beulich
  2015-06-10 13:45                                   ` Jan Beulich
  2015-06-10 14:34                                   ` Ian Campbell
  2 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2015-06-10 13:23 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Andrew Cooper, ian.jackson, xen-devel

>>> On 10.06.15 at 14:56, <ian.campbell@citrix.com> wrote:
> On Wed, 2015-06-10 at 12:48 +0100, Jan Beulich wrote:
>>  But I guess you could at least check
>> what microcode the box has in use - if there's nothing newer available,
>> then trying to get the microcode updating working isn't of immediate
>> importance anymore (but of course it would still be nice to have in
>> place).
> 
> I logged into fiano1 while it was running under Xen:
> 
> cpuinfo contains (just the first processor for brevity):
> 
> processor	: 0
> vendor_id	: GenuineIntel
> cpu family	: 6
> model		: 62
> model name	: Intel(R) Xeon(R) CPU E5-2403 v2 @ 1.80GHz
> stepping	: 4
> microcode	: 0x416

Peeking into the microcode files I have lying around, 0x428 ought to
be available for that family+model+stepping.

>> > Can't we get Intel involved right away?
>> 
>> Sure we can; I just generally prefer not to bother people with
>> problems they already solved, but maybe that's the wrong approach
>> a case like this.
> 
> Is the list of errata fixed by a given ucode update public? If not then
> I think we've done sufficient due diligence that we should feel ok to
> ask, even if the answer turns out to be fixed in microcode.

No, Intel isn't doing as good a job as AMD in that regard.

Jan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [xen-unstable test] 57852: regressions - FAIL
  2015-06-10 12:56                                 ` Ian Campbell
  2015-06-10 13:23                                   ` Jan Beulich
@ 2015-06-10 13:45                                   ` Jan Beulich
  2015-06-10 14:08                                     ` Ian Campbell
  2015-06-10 14:34                                   ` Ian Campbell
  2 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2015-06-10 13:45 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Andrew Cooper, ian.jackson, xen-devel

>>> On 10.06.15 at 14:56, <ian.campbell@citrix.com> wrote:
> On Wed, 2015-06-10 at 12:48 +0100, Jan Beulich wrote:
>> Sure we can; I just generally prefer not to bother people with
>> problems they already solved, but maybe that's the wrong approach
>> a case like this.
> 
> Is the list of errata fixed by a given ucode update public? If not then
> I think we've done sufficient due diligence that we should feel ok to
> ask, even if the answer turns out to be fixed in microcode.

So I went though the errata list for that specific model; the only
one really concerning seems to be CA135 ("A MOV to CR3 When
EPT is Enabled May Lead to an Unexpected Page Fault or an
Incorrect Page Translation"). But while it would affect us, it would
quite likely make the guest crash instead of idling or being hung.

So if we're going to approach Intel with this - will you or should I?

Jan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [xen-unstable test] 57852: regressions - FAIL
  2015-06-10 13:45                                   ` Jan Beulich
@ 2015-06-10 14:08                                     ` Ian Campbell
  2015-06-11  7:02                                       ` Jan Beulich
  0 siblings, 1 reply; 40+ messages in thread
From: Ian Campbell @ 2015-06-10 14:08 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, ian.jackson, xen-devel

On Wed, 2015-06-10 at 14:45 +0100, Jan Beulich wrote:
> >>> On 10.06.15 at 14:56, <ian.campbell@citrix.com> wrote:
> > On Wed, 2015-06-10 at 12:48 +0100, Jan Beulich wrote:
> >> Sure we can; I just generally prefer not to bother people with
> >> problems they already solved, but maybe that's the wrong approach
> >> a case like this.
> > 
> > Is the list of errata fixed by a given ucode update public? If not then
> > I think we've done sufficient due diligence that we should feel ok to
> > ask, even if the answer turns out to be fixed in microcode.
> 
> So I went though the errata list for that specific model; the only
> one really concerning seems to be CA135 ("A MOV to CR3 When
> EPT is Enabled May Lead to an Unexpected Page Fault or an
> Incorrect Page Translation"). But while it would affect us, it would
> quite likely make the guest crash instead of idling or being hung.

Yes, sounds like it.

> So if we're going to approach Intel with this - will you or should I?

I think it'd be best coming from you.

Ian.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [xen-unstable test] 57852: regressions - FAIL
  2015-06-10 12:56                                 ` Ian Campbell
  2015-06-10 13:23                                   ` Jan Beulich
  2015-06-10 13:45                                   ` Jan Beulich
@ 2015-06-10 14:34                                   ` Ian Campbell
  2015-06-10 15:59                                     ` Jan Beulich
  2 siblings, 1 reply; 40+ messages in thread
From: Ian Campbell @ 2015-06-10 14:34 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, ian.jackson, xen-devel

On Wed, 2015-06-10 at 13:56 +0100, Ian Campbell wrote:

> > > Arranging to do microcode updates looks like it is going to be a bit
> > > non-trivial from the osstest side.
> 
> OK. I think this is something which is worth doing

So for AMD I think things are pretty clear, cat
linux-firmware.git/amd-ucode/*.bin into
kernel/x86/microcode/AuthenticAMD.bin inside microcode.cpio

For Intel I'm less sure, I've got microcode-20150121.tgz containing
microcode.dat. Is that just to be placed at
kernel/x86/microcode/GenuineIntel.bin and done, or is there some
processing needed?

I've got a thing called iucode-tool in my hand from a Debian package if
I need it.

Ian.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [xen-unstable test] 57852: regressions - FAIL
  2015-06-10 14:34                                   ` Ian Campbell
@ 2015-06-10 15:59                                     ` Jan Beulich
  2015-06-10 16:18                                       ` Don Slutz
  2015-06-10 18:00                                       ` Ian Campbell
  0 siblings, 2 replies; 40+ messages in thread
From: Jan Beulich @ 2015-06-10 15:59 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Andrew Cooper, ian.jackson, xen-devel

>>> On 10.06.15 at 16:34, <ian.campbell@citrix.com> wrote:
> For Intel I'm less sure, I've got microcode-20150121.tgz containing
> microcode.dat. Is that just to be placed at
> kernel/x86/microcode/GenuineIntel.bin and done, or is there some
> processing needed?

The full blob (albeit usually named microcode.bin; microcode.dat
ordinarily is a text file) can be used if so desired, but there's also a
tool to split it into more fine grained chunks.

> I've got a thing called iucode-tool in my hand from a Debian package if
> I need it.

Or maybe there are multiple different tools - the one I know about
is commonly named intel-microcode2ucode taking microcode.dat as
input and producing microcode.bin as well as many individual
<family>-<model>-<stepping> blobs.

Jan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [xen-unstable test] 57852: regressions - FAIL
  2015-06-10 15:59                                     ` Jan Beulich
@ 2015-06-10 16:18                                       ` Don Slutz
  2015-06-10 18:00                                       ` Ian Campbell
  1 sibling, 0 replies; 40+ messages in thread
From: Don Slutz @ 2015-06-10 16:18 UTC (permalink / raw)
  To: Jan Beulich, Ian Campbell; +Cc: Andrew Cooper, ian.jackson, xen-devel



On 06/10/15 11:59, Jan Beulich wrote:
>>>> On 10.06.15 at 16:34, <ian.campbell@citrix.com> wrote:
>> For Intel I'm less sure, I've got microcode-20150121.tgz containing
>> microcode.dat. Is that just to be placed at
>> kernel/x86/microcode/GenuineIntel.bin and done, or is there some
>> processing needed?
> 
> The full blob (albeit usually named microcode.bin; microcode.dat
> ordinarily is a text file) can be used if so desired, but there's also a
> tool to split it into more fine grained chunks.
> 
>> I've got a thing called iucode-tool in my hand from a Debian package if
>> I need it.
> 
> Or maybe there are multiple different tools - the one I know about
> is commonly named intel-microcode2ucode taking microcode.dat as
> input and producing microcode.bin as well as many individual
> <family>-<model>-<stepping> blobs.
> 

Well, my version did not produce microcode.bin.  Based on:

...
processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 42
model name      : Intel(R) Xeon(R) CPU E31265L @ 2.40GHz
stepping        : 7
...

and grub2 (Not sure for ucode=scan):

1) intel-microcode2ucode microcode.dat
2a) cp intel-ucode/06-2a-07 /boot/microcode.bin
or
2b) cat intel-ucode/* >/boot/microcode.bin
3) Make sure "ucode=-1" is in GRUB_CMDLINE_XEN
4) /sbin/grub2-mkconfig -o /boot/grub2/grub.cfg

And you see microcode loaded on the serial console.

   -Don Slutz

> Jan
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [xen-unstable test] 57852: regressions - FAIL
  2015-06-10 15:59                                     ` Jan Beulich
  2015-06-10 16:18                                       ` Don Slutz
@ 2015-06-10 18:00                                       ` Ian Campbell
  1 sibling, 0 replies; 40+ messages in thread
From: Ian Campbell @ 2015-06-10 18:00 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, ian.jackson, xen-devel

On Wed, 2015-06-10 at 16:59 +0100, Jan Beulich wrote:
> >>> On 10.06.15 at 16:34, <ian.campbell@citrix.com> wrote:
> > For Intel I'm less sure, I've got microcode-20150121.tgz containing
> > microcode.dat. Is that just to be placed at
> > kernel/x86/microcode/GenuineIntel.bin and done, or is there some
> > processing needed?
> 
> The full blob (albeit usually named microcode.bin; microcode.dat
> ordinarily is a text file) can be used if so desired, but there's also a
> tool to split it into more fine grained chunks.
> 
> > I've got a thing called iucode-tool in my hand from a Debian package if
> > I need it.
> 
> Or maybe there are multiple different tools - the one I know about
> is commonly named intel-microcode2ucode taking microcode.dat as
> input and producing microcode.bin as well as many individual
> <family>-<model>-<stepping> blobs.

Not sure if they are the same tool, but I seem to have managed to get
iucode-tool to take my microcode.dat and produce a suitable binary file.
I'm testing the integration with osstest now.

Thanks,

Ian.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [xen-unstable test] 57852: regressions - FAIL
  2015-06-10 14:08                                     ` Ian Campbell
@ 2015-06-11  7:02                                       ` Jan Beulich
  2015-06-11  8:45                                         ` Ian Campbell
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2015-06-11  7:02 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Andrew Cooper, ian.jackson, xen-devel

>>> On 10.06.15 at 16:08, <ian.campbell@citrix.com> wrote:
> On Wed, 2015-06-10 at 14:45 +0100, Jan Beulich wrote:
>> So if we're going to approach Intel with this - will you or should I?
> 
> I think it'd be best coming from you.

Just have sent it off; in putting together the technical details it
became clear that elbling* indeed are at a newer microcode level,
so I think this at least slightly raises the chances of an update to
help fiano* (if so I of course wonder why the vendor hasn't made
a suitable BIOS update available yet).

Jan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [xen-unstable test] 57852: regressions - FAIL
  2015-06-11  7:02                                       ` Jan Beulich
@ 2015-06-11  8:45                                         ` Ian Campbell
  2015-06-15  8:57                                           ` Ian Campbell
  0 siblings, 1 reply; 40+ messages in thread
From: Ian Campbell @ 2015-06-11  8:45 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, ian.jackson, xen-devel

On Thu, 2015-06-11 at 08:02 +0100, Jan Beulich wrote:
> >>> On 10.06.15 at 16:08, <ian.campbell@citrix.com> wrote:
> > On Wed, 2015-06-10 at 14:45 +0100, Jan Beulich wrote:
> >> So if we're going to approach Intel with this - will you or should I?
> > 
> > I think it'd be best coming from you.
> 
> Just have sent it off; in putting together the technical details it
> became clear that elbling* indeed are at a newer microcode level,
> so I think this at least slightly raises the chances of an update to
> help fiano* (if so I of course wonder why the vendor hasn't made
> a suitable BIOS update available yet).

It's possible that there is one which we've not applied...

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [xen-unstable test] 57852: regressions - FAIL
  2015-06-11  8:45                                         ` Ian Campbell
@ 2015-06-15  8:57                                           ` Ian Campbell
  2015-06-15  9:03                                             ` Jan Beulich
  0 siblings, 1 reply; 40+ messages in thread
From: Ian Campbell @ 2015-06-15  8:57 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, ian.jackson, xen-devel

On Thu, 2015-06-11 at 09:45 +0100, Ian Campbell wrote:
> On Thu, 2015-06-11 at 08:02 +0100, Jan Beulich wrote:
> > >>> On 10.06.15 at 16:08, <ian.campbell@citrix.com> wrote:
> > > On Wed, 2015-06-10 at 14:45 +0100, Jan Beulich wrote:
> > >> So if we're going to approach Intel with this - will you or should I?
> > > 
> > > I think it'd be best coming from you.
> > 
> > Just have sent it off; in putting together the technical details it
> > became clear that elbling* indeed are at a newer microcode level,
> > so I think this at least slightly raises the chances of an update to
> > help fiano* (if so I of course wonder why the vendor hasn't made
> > a suitable BIOS update available yet).
> 
> It's possible that there is one which we've not applied...

I've now run a bunch of adhoc runs with the microcode update in place
(from 0x416 to 0x428 on these particular machines):

58468	fiano0	guest-stop
58479	fiano0	guest-stop
58485	fiano0	windows-install
58494	fiano0	guest-stop
58499	fiano1	guest-stop
58509	fiano1	windows-install
58516	fiano1	guest-stop
58527*	fiano0	guest-stop
58531	fiano0	guest-stop
58534	fiano0	guest-stop
58537	fiano0	guest-stop
58538	fiano1	guest-stop
58544	fiano1	guest-stop
58547	fiano1	guest-stop
58550	fiano0	guest-stop
58555	fiano0	guest-stop
58557	fiano0	guest-stop
58560	fiano1	guest-stop
58563	fiano1	guest-stop
58565	fiano1	windows-install

(*) rebuilt binaries because previous build was gc'd, same versions as
before.

So 3/20 = 15% failure rate (fiano0: 1/11=9%; fiano1: 2/9=22%). Which is
better than the ~50% seen at the start of this thread, so it is worth
applying the ucode update I think (and it would have been regardless the
right thing to do), 

I do think a 15-20% failure rate might be worthy of further
investigation by Intel too, since the failure rate with no-apicv was
1/13 = 7% (fiano0: 1/7=14%, fiano1: 0/6=0%), although those numbers are
less significant due to fewer runs.

Ian.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [xen-unstable test] 57852: regressions - FAIL
  2015-06-15  8:57                                           ` Ian Campbell
@ 2015-06-15  9:03                                             ` Jan Beulich
  0 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2015-06-15  9:03 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Andrew Cooper, ian.jackson, xen-devel

>>> On 15.06.15 at 10:57, <ian.campbell@citrix.com> wrote:
> So 3/20 = 15% failure rate (fiano0: 1/11=9%; fiano1: 2/9=22%). Which is
> better than the ~50% seen at the start of this thread, so it is worth
> applying the ucode update I think (and it would have been regardless the
> right thing to do), 
> 
> I do think a 15-20% failure rate might be worthy of further
> investigation by Intel too, since the failure rate with no-apicv was
> 1/13 = 7% (fiano0: 1/7=14%, fiano1: 0/6=0%), although those numbers are
> less significant due to fewer runs.

I fully agree; even the remaining 7% should be looked into, provided
they can be reproduced by them on sufficiently similar hardware.

Jan

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2015-06-15  9:03 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-04 12:01 [xen-unstable test] 57852: regressions - FAIL osstest service user
2015-06-05  8:45 ` Ian Campbell
2015-06-05  9:00   ` Jan Beulich
2015-06-05  9:07     ` Ian Campbell
2015-06-05  9:18       ` Jan Beulich
2015-06-05 10:48         ` Ian Campbell
2015-06-05 16:46           ` Ian Campbell
2015-06-08  8:07           ` Jan Beulich
2015-06-08  8:53             ` Ian Campbell
2015-06-08  9:15               ` Jan Beulich
2015-06-08  9:27                 ` Ian Campbell
2015-06-08 10:17                   ` Jan Beulich
2015-06-08 14:43                     ` Ian Jackson
2015-06-08 12:16                   ` Ian Campbell
2015-06-08 12:19                     ` Andrew Cooper
2015-06-08 12:24                     ` Jan Beulich
2015-06-09  8:26                     ` Ian Campbell
2015-06-09  9:29                       ` Jan Beulich
2015-06-10  8:50                         ` Ian Campbell
2015-06-10  9:36                           ` Jan Beulich
2015-06-10 11:01                             ` Ian Campbell
2015-06-10 11:48                               ` Jan Beulich
2015-06-10 12:56                                 ` Ian Campbell
2015-06-10 13:23                                   ` Jan Beulich
2015-06-10 13:45                                   ` Jan Beulich
2015-06-10 14:08                                     ` Ian Campbell
2015-06-11  7:02                                       ` Jan Beulich
2015-06-11  8:45                                         ` Ian Campbell
2015-06-15  8:57                                           ` Ian Campbell
2015-06-15  9:03                                             ` Jan Beulich
2015-06-10 14:34                                   ` Ian Campbell
2015-06-10 15:59                                     ` Jan Beulich
2015-06-10 16:18                                       ` Don Slutz
2015-06-10 18:00                                       ` Ian Campbell
2015-06-08 13:50                   ` Konrad Rzeszutek Wilk
2015-06-08 14:02                     ` Ian Campbell
2015-06-08 14:47                     ` Ian Jackson
2015-06-08 15:21                       ` Konrad Rzeszutek Wilk
2015-06-08 15:29                         ` Ian Campbell
2015-06-08 10:10                 ` Ian Campbell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.