* [xen-unstable test] 113807: regressions - FAIL
@ 2017-09-25 9:46 osstest service owner
2017-09-25 14:07 ` Guest start issue on ARM (maybe related to Credit2) [Was: Re: [xen-unstable test] 113807: regressions - FAIL] Dario Faggioli
0 siblings, 1 reply; 10+ messages in thread
From: osstest service owner @ 2017-09-25 9:46 UTC (permalink / raw)
To: xen-devel, osstest-admin
flight 113807 xen-unstable real [real]
http://logs.test-lab.xenproject.org/osstest/logs/113807/
Regressions :-(
Tests which did not succeed and are blocking,
including tests which could not be run:
test-amd64-i386-xl-qemuu-win7-amd64 18 guest-start/win.repeat fail in 113791 REGR. vs. 113387
Tests which are failing intermittently (not blocking):
test-amd64-i386-xl-qemut-win7-amd64 17 guest-stop fail in 113791 pass in 113807
test-armhf-armhf-xl-credit2 16 guest-start/debian.repeat fail in 113791 pass in 113807
test-amd64-amd64-rumprun-amd64 17 rumprun-demo-xenstorels/xenstorels.repeat fail in 113800 pass in 113807
test-amd64-i386-xl-qemuu-win7-amd64 17 guest-stop fail pass in 113791
test-amd64-amd64-xl-qemuu-win7-amd64 16 guest-localmigrate/x10 fail pass in 113800
test-amd64-i386-xl-qemut-win7-amd64 18 guest-start/win.repeat fail pass in 113800
Tests which did not succeed, but are not blocking:
test-amd64-amd64-xl-qemut-win7-amd64 18 guest-start/win.repeat fail in 113791 blocked in 113387
test-armhf-armhf-xl-rtds 16 guest-start/debian.repeat fail in 113800 blocked in 113387
test-amd64-amd64-xl-qemuu-win7-amd64 17 guest-stop fail in 113800 like 113387
test-amd64-amd64-xl-qemut-win7-amd64 16 guest-localmigrate/x10 fail like 113387
test-armhf-armhf-libvirt 14 saverestore-support-check fail like 113387
test-armhf-armhf-libvirt-xsm 14 saverestore-support-check fail like 113387
test-amd64-amd64-xl-rtds 10 debian-install fail like 113387
test-armhf-armhf-libvirt-raw 13 saverestore-support-check fail like 113387
test-amd64-amd64-xl-qemut-ws16-amd64 10 windows-install fail never pass
test-amd64-amd64-xl-qemuu-ws16-amd64 10 windows-install fail never pass
test-amd64-i386-libvirt-xsm 13 migrate-support-check fail never pass
test-amd64-amd64-libvirt 13 migrate-support-check fail never pass
test-amd64-amd64-libvirt-xsm 13 migrate-support-check fail never pass
test-amd64-i386-libvirt 13 migrate-support-check fail never pass
test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 11 migrate-support-check fail never pass
test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 11 migrate-support-check fail never pass
test-amd64-amd64-libvirt-vhd 12 migrate-support-check fail never pass
test-amd64-i386-xl-qemuu-ws16-amd64 13 guest-saverestore fail never pass
test-amd64-amd64-qemuu-nested-amd 17 debian-hvm-install/l1/l2 fail never pass
test-armhf-armhf-xl 13 migrate-support-check fail never pass
test-armhf-armhf-xl 14 saverestore-support-check fail never pass
test-armhf-armhf-xl-xsm 13 migrate-support-check fail never pass
test-armhf-armhf-xl-xsm 14 saverestore-support-check fail never pass
test-armhf-armhf-xl-cubietruck 13 migrate-support-check fail never pass
test-armhf-armhf-xl-cubietruck 14 saverestore-support-check fail never pass
test-armhf-armhf-xl-credit2 13 migrate-support-check fail never pass
test-armhf-armhf-xl-credit2 14 saverestore-support-check fail never pass
test-amd64-i386-xl-qemut-ws16-amd64 13 guest-saverestore fail never pass
test-armhf-armhf-xl-rtds 13 migrate-support-check fail never pass
test-armhf-armhf-xl-rtds 14 saverestore-support-check fail never pass
test-armhf-armhf-xl-multivcpu 13 migrate-support-check fail never pass
test-armhf-armhf-xl-multivcpu 14 saverestore-support-check fail never pass
test-amd64-i386-libvirt-qcow2 12 migrate-support-check fail never pass
test-armhf-armhf-xl-vhd 12 migrate-support-check fail never pass
test-armhf-armhf-xl-vhd 13 saverestore-support-check fail never pass
test-armhf-armhf-xl-arndale 13 migrate-support-check fail never pass
test-armhf-armhf-xl-arndale 14 saverestore-support-check fail never pass
test-armhf-armhf-libvirt 13 migrate-support-check fail never pass
test-armhf-armhf-libvirt-xsm 13 migrate-support-check fail never pass
test-armhf-armhf-libvirt-raw 12 migrate-support-check fail never pass
test-amd64-i386-xl-qemut-win10-i386 10 windows-install fail never pass
test-amd64-i386-xl-qemuu-win10-i386 10 windows-install fail never pass
test-amd64-amd64-xl-qemuu-win10-i386 10 windows-install fail never pass
test-amd64-amd64-xl-qemut-win10-i386 10 windows-install fail never pass
version targeted for testing:
xen 7ff9661b904a3af618dc2a2b8cdec46be6930308
baseline version:
xen 16b1414de91b5a82a0996c67f6db3af7d7e32873
Last test of basis 113387 2017-09-12 23:20:09 Z 12 days
Failing since 113430 2017-09-14 01:24:48 Z 11 days 23 attempts
Testing same since 113760 2017-09-23 03:14:13 Z 2 days 6 attempts
------------------------------------------------------------
People who touched revisions under test:
Alexandru Isaila <aisaila@bitdefender.com>
Andrew Cooper <andrew.cooper3@citrix.com>
Bhupinder Thakur <bhupinder.thakur@linaro.org>
Boris Ostrovsky <boris.ostrovsky@oracle.com>
Daniel De Graaf <dgdegra@tycho.nsa.gov>
Dario Faggioli <dario.faggioli@citrix.com>
George Dunlap <george.dunlap@citrix.com>
Haozhong Zhang <haozhong.zhang@intel.com>
Ian Jackson <ian.jackson@eu.citrix.com>
Jan Beulich <jbeulich@suse.com>
Juergen Gross <jgross@suse.com>
Julien Grall <julien.grall@arm.com>
Kevin Tian <kevin.tian@intel.com>
Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Meng Xu <mengxu@cis.upenn.edu>
Oleksandr Grytsov <oleksandr_grytsov@epam.com>
Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Petre Pircalabu <ppircalabu@bitdefender.com>
Razvan Cojocaru <rcojocaru@bitdefender.com>
Sergey Dyasli <sergey.dyasli@citrix.com>
Stefano Stabellini <sstabellini@kernel.org>
Tamas K Lengyel <tamas@tklengyel.com>
Tim Deegan <tim@xen.org>
Wei Liu <wei.liu2@citrix.com>
Yi Sun <yi.y.sun@linux.intel.com>
jobs:
build-amd64-xsm pass
build-armhf-xsm pass
build-i386-xsm pass
build-amd64-xtf pass
build-amd64 pass
build-armhf pass
build-i386 pass
build-amd64-libvirt pass
build-armhf-libvirt pass
build-i386-libvirt pass
build-amd64-prev pass
build-i386-prev pass
build-amd64-pvops pass
build-armhf-pvops pass
build-i386-pvops pass
build-amd64-rumprun pass
build-i386-rumprun pass
test-xtf-amd64-amd64-1 pass
test-xtf-amd64-amd64-2 pass
test-xtf-amd64-amd64-3 pass
test-xtf-amd64-amd64-4 pass
test-xtf-amd64-amd64-5 pass
test-amd64-amd64-xl pass
test-armhf-armhf-xl pass
test-amd64-i386-xl pass
test-amd64-amd64-xl-qemut-debianhvm-amd64-xsm pass
test-amd64-i386-xl-qemut-debianhvm-amd64-xsm pass
test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm pass
test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm pass
test-amd64-amd64-xl-qemuu-debianhvm-amd64-xsm pass
test-amd64-i386-xl-qemuu-debianhvm-amd64-xsm pass
test-amd64-amd64-xl-qemut-stubdom-debianhvm-amd64-xsm pass
test-amd64-i386-xl-qemut-stubdom-debianhvm-amd64-xsm pass
test-amd64-amd64-libvirt-xsm pass
test-armhf-armhf-libvirt-xsm pass
test-amd64-i386-libvirt-xsm pass
test-amd64-amd64-xl-xsm pass
test-armhf-armhf-xl-xsm pass
test-amd64-i386-xl-xsm pass
test-amd64-amd64-qemuu-nested-amd fail
test-amd64-amd64-xl-pvh-amd pass
test-amd64-i386-qemut-rhel6hvm-amd pass
test-amd64-i386-qemuu-rhel6hvm-amd pass
test-amd64-amd64-xl-qemut-debianhvm-amd64 pass
test-amd64-i386-xl-qemut-debianhvm-amd64 pass
test-amd64-amd64-xl-qemuu-debianhvm-amd64 pass
test-amd64-i386-xl-qemuu-debianhvm-amd64 pass
test-amd64-i386-freebsd10-amd64 pass
test-amd64-amd64-xl-qemuu-ovmf-amd64 pass
test-amd64-i386-xl-qemuu-ovmf-amd64 pass
test-amd64-amd64-rumprun-amd64 pass
test-amd64-amd64-xl-qemut-win7-amd64 fail
test-amd64-i386-xl-qemut-win7-amd64 fail
test-amd64-amd64-xl-qemuu-win7-amd64 fail
test-amd64-i386-xl-qemuu-win7-amd64 fail
test-amd64-amd64-xl-qemut-ws16-amd64 fail
test-amd64-i386-xl-qemut-ws16-amd64 fail
test-amd64-amd64-xl-qemuu-ws16-amd64 fail
test-amd64-i386-xl-qemuu-ws16-amd64 fail
test-armhf-armhf-xl-arndale pass
test-amd64-amd64-xl-credit2 pass
test-armhf-armhf-xl-credit2 pass
test-armhf-armhf-xl-cubietruck pass
test-amd64-amd64-examine pass
test-armhf-armhf-examine pass
test-amd64-i386-examine pass
test-amd64-i386-freebsd10-i386 pass
test-amd64-i386-rumprun-i386 pass
test-amd64-amd64-xl-qemut-win10-i386 fail
test-amd64-i386-xl-qemut-win10-i386 fail
test-amd64-amd64-xl-qemuu-win10-i386 fail
test-amd64-i386-xl-qemuu-win10-i386 fail
test-amd64-amd64-qemuu-nested-intel pass
test-amd64-amd64-xl-pvh-intel pass
test-amd64-i386-qemut-rhel6hvm-intel pass
test-amd64-i386-qemuu-rhel6hvm-intel pass
test-amd64-amd64-libvirt pass
test-armhf-armhf-libvirt pass
test-amd64-i386-libvirt pass
test-amd64-amd64-livepatch pass
test-amd64-i386-livepatch pass
test-amd64-amd64-migrupgrade pass
test-amd64-i386-migrupgrade pass
test-amd64-amd64-xl-multivcpu pass
test-armhf-armhf-xl-multivcpu pass
test-amd64-amd64-pair pass
test-amd64-i386-pair pass
test-amd64-amd64-libvirt-pair pass
test-amd64-i386-libvirt-pair pass
test-amd64-amd64-amd64-pvgrub pass
test-amd64-amd64-i386-pvgrub pass
test-amd64-amd64-pygrub pass
test-amd64-i386-libvirt-qcow2 pass
test-amd64-amd64-xl-qcow2 pass
test-armhf-armhf-libvirt-raw pass
test-amd64-i386-xl-raw pass
test-amd64-amd64-xl-rtds fail
test-armhf-armhf-xl-rtds pass
test-amd64-amd64-libvirt-vhd pass
test-armhf-armhf-xl-vhd pass
------------------------------------------------------------
sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images
Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs
Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master
Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary
Not pushing.
(No revision log; it would be 1495 lines long.)
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
^ permalink raw reply [flat|nested] 10+ messages in thread
* Guest start issue on ARM (maybe related to Credit2) [Was: Re: [xen-unstable test] 113807: regressions - FAIL]
2017-09-25 9:46 [xen-unstable test] 113807: regressions - FAIL osstest service owner
@ 2017-09-25 14:07 ` Dario Faggioli
2017-09-25 16:23 ` Julien Grall
0 siblings, 1 reply; 10+ messages in thread
From: Dario Faggioli @ 2017-09-25 14:07 UTC (permalink / raw)
To: osstest service owner, xen-devel; +Cc: Julien Grall, Stefano Stabellini
[-- Attachment #1.1: Type: text/plain, Size: 1894 bytes --]
Hey,
On Mon, 2017-09-25 at 09:46 +0000, osstest service owner wrote:
> flight 113807 xen-unstable real [real]
> http://logs.test-lab.xenproject.org/osstest/logs/113807/
>
So, triggered by this:
> Tests which are failing intermittently (not blocking):
> test-armhf-armhf-xl-credit2 16 guest-start/debian.repeat fail in
> 113791 pass in 113807
>
I went having a look, and discovered that it's indeed happening that,
from time to time, we fail to create a guest, on ARM, with Credit2.
Looking here:
http://logs.test-lab.xenproject.org/osstest/results/history/test-armhf-armhf-xl-credit2/xen-unstable
It seems to be happening only on the cubietracks, but in a non-linear
and non-deterministic fashion. E.g., 113791 failed on metzinger, which
is fine on 113800; 113611 and 113618 failed on baroque, which is fine
on 113638.
I don't see much in the logs, TBH, but both `xl vcpu-list' and the 'r'
debug key seem to suggest that vCPU 0 is running, while the other vCPUs
have never run... like it was an issue with secondary (v)CPU bringup.
It indeed shows up with Credit2, as it were _specific_ to it, but I'm
not 100% sure. In fact, it indeed seems to never show up here:
http://logs.test-lab.xenproject.org/osstest/results/history/test-armhf-
armhf-xl/xen-unstable
but it looks like it may have shown up in 112460 (but we don't have the
logs any longer):
http://logs.test-lab.xenproject.org/osstest/results/history/test-armhf-
armhf-xl-cubietruck/xen-unstable
So... ARM people? Does this ring any bell? Is this something known, or
easy to explain? What can I do for help?
Regards,
Dario
--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]
[-- Attachment #2: Type: text/plain, Size: 127 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Guest start issue on ARM (maybe related to Credit2) [Was: Re: [xen-unstable test] 113807: regressions - FAIL]
2017-09-25 14:07 ` Guest start issue on ARM (maybe related to Credit2) [Was: Re: [xen-unstable test] 113807: regressions - FAIL] Dario Faggioli
@ 2017-09-25 16:23 ` Julien Grall
2017-09-25 17:29 ` Dario Faggioli
2017-09-26 7:33 ` Dario Faggioli
0 siblings, 2 replies; 10+ messages in thread
From: Julien Grall @ 2017-09-25 16:23 UTC (permalink / raw)
To: Dario Faggioli, osstest service owner, xen-devel; +Cc: Stefano Stabellini
On 09/25/2017 03:07 PM, Dario Faggioli wrote:
> Hey,
Hi Dario,
>
> On Mon, 2017-09-25 at 09:46 +0000, osstest service owner wrote:
>> flight 113807 xen-unstable real [real]
>> http://logs.test-lab.xenproject.org/osstest/logs/113807/
>>
> So, triggered by this:
>
>> Tests which are failing intermittently (not blocking):
>> test-armhf-armhf-xl-credit2 16 guest-start/debian.repeat fail in
>> 113791 pass in 113807
>>
> I went having a look, and discovered that it's indeed happening that,
> from time to time, we fail to create a guest, on ARM, with Credit2.
>
> Looking here:
> http://logs.test-lab.xenproject.org/osstest/results/history/test-armhf-armhf-xl-credit2/xen-unstable
>
> It seems to be happening only on the cubietracks, but in a non-linear
> and non-deterministic fashion. E.g., 113791 failed on metzinger, which
> is fine on 113800; 113611 and 113618 failed on baroque, which is fine
> on 113638.
>
> I don't see much in the logs, TBH, but both `xl vcpu-list' and the 'r'
> debug key seem to suggest that vCPU 0 is running, while the other vCPUs
> have never run... like it was an issue with secondary (v)CPU bringup.
>
> It indeed shows up with Credit2, as it were _specific_ to it, but I'm
> not 100% sure. In fact, it indeed seems to never show up here:
> http://logs.test-lab.xenproject.org/osstest/results/history/test-armhf-
> armhf-xl/xen-unstable
>
> but it looks like it may have shown up in 112460 (but we don't have the
> logs any longer):
> http://logs.test-lab.xenproject.org/osstest/results/history/test-armhf-
> armhf-xl-cubietruck/xen-unstable
>
> So... ARM people? Does this ring any bell? Is this something known, or
> easy to explain? What can I do for help?
It definitely rings a bell, I have seen similar trace in July and I have
been working on a potential fix since then.
Most of the time guest-start/debian.repeat fails, vCPU 0 is in
data/prefetch abort state. My guess is a latent cache bug that credit2
appears to expose.
Indeed, the arm32 kernel is using set/way cache flush instruction at
boot time. They are used to clean one by one each level of caches on
each CPUs.
At the moment, Xen does not trap those instructions. As you know cache
may not be private to a given physical processors. So if you happen to
migrate the vCPU to another physical CPU, you may hit stale data.
This means we have to trap and emulate set/way instructions. Per the ARM
ARM and also experience emulating them is a non-trivial.
Thankfully, people are trying to get rid of those instructions. For
instance arm64 Linux does not use it anymore. Sadly, arm32 linux
maintainer does not want to remove them... This is also used by EDK2 at
the moment.
The solution is to go through the P2M and clean & invalidate every page
one by one. This process is really realy slow given Xen on Arm is always
populating the P2M at guest creation.
So I have been working for the past 2 months to add PoD support on Arm.
I have a proof of concept that boot a guest and properly handle set/way
cache instructions.
I am still cleaning-up my work and hopefully can post a couple of series
soon. This is not targeting Xen 4.10 and I am not even sure it would fix
the problem here. But that's my best guess.
Cheers,
--
Julien Grall
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Guest start issue on ARM (maybe related to Credit2) [Was: Re: [xen-unstable test] 113807: regressions - FAIL]
2017-09-25 16:23 ` Julien Grall
@ 2017-09-25 17:29 ` Dario Faggioli
2017-09-26 7:33 ` Dario Faggioli
1 sibling, 0 replies; 10+ messages in thread
From: Dario Faggioli @ 2017-09-25 17:29 UTC (permalink / raw)
To: Julien Grall, osstest service owner, xen-devel; +Cc: Stefano Stabellini
[-- Attachment #1.1: Type: text/plain, Size: 2211 bytes --]
On Mon, 2017-09-25 at 17:23 +0100, Julien Grall wrote:
> On 09/25/2017 03:07 PM, Dario Faggioli wrote:
> > Hey,
>
> Hi Dario,
>
Hi!
> > I don't see much in the logs, TBH, but both `xl vcpu-list' and the
> > 'r'
> > debug key seem to suggest that vCPU 0 is running, while the other
> > vCPUs
> > have never run... like it was an issue with secondary (v)CPU
> > bringup.
> >
> It definitely rings a bell, I have seen similar trace in July and I
> have
> been working on a potential fix since then.
>
> Most of the time guest-start/debian.repeat fails, vCPU 0 is in
> data/prefetch abort state. My guess is a latent cache bug that
> credit2
> appears to expose.
>
> Indeed, the arm32 kernel is using set/way cache flush instruction at
> boot time. They are used to clean one by one each level of caches on
> each CPUs.
>
> At the moment, Xen does not trap those instructions. As you know
> cache
> may not be private to a given physical processors. So if you happen
> to
> migrate the vCPU to another physical CPU, you may hit stale data.
>
Ah, yes, I remember "hearing" you talking about this. We've also talked
about it a bit together... I just wasn't recognising it being what's
biting us here.
> I am still cleaning-up my work and hopefully can post a couple of
> series
> soon. This is not targeting Xen 4.10 and I am not even sure it would
> fix
> the problem here. But that's my best guess.
>
Well, yes, now that you mention it, it indeed sounds plausible.
So, I was mainly curious about whether it was either something which
was affecting or directly caused by Credit2, or something that Credit2
can help diagnose, reproduce and fix.
Since we already have a candidate, and you're already working on the
(difficult! :-( ), well, let's see, once you'll have it, if it actually
cures the problem.
We'll jump back on it if it does not.
Thanks and regards,
Dario
--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]
[-- Attachment #2: Type: text/plain, Size: 127 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Guest start issue on ARM (maybe related to Credit2) [Was: Re: [xen-unstable test] 113807: regressions - FAIL]
2017-09-25 16:23 ` Julien Grall
2017-09-25 17:29 ` Dario Faggioli
@ 2017-09-26 7:33 ` Dario Faggioli
2017-09-26 17:28 ` Julien Grall
1 sibling, 1 reply; 10+ messages in thread
From: Dario Faggioli @ 2017-09-26 7:33 UTC (permalink / raw)
To: Julien Grall, osstest service owner, xen-devel; +Cc: Stefano Stabellini
[-- Attachment #1.1: Type: text/plain, Size: 1689 bytes --]
On Mon, 2017-09-25 at 17:23 +0100, Julien Grall wrote:
> On 09/25/2017 03:07 PM, Dario Faggioli wrote:
> > I don't see much in the logs, TBH, but both `xl vcpu-list' and the
> > 'r'
> > debug key seem to suggest that vCPU 0 is running, while the other
> > vCPUs
> > have never run... like it was an issue with secondary (v)CPU
> > bringup.
> >
> > It indeed shows up with Credit2, as it were _specific_ to it, but
> > I'm
> > not 100% sure. In fact, it indeed seems to never show up here:
> > http://logs.test-lab.xenproject.org/osstest/results/history/test-ar
> > mhf-
> > armhf-xl/xen-unstable
> >
> Most of the time guest-start/debian.repeat fails, vCPU 0 is in
> data/prefetch abort state. My guess is a latent cache bug that
> credit2
> appears to expose.
>
So, forgive my ARM ignorance, but how do you tell that the vCPU(s)
is(are) in that particular state?
I'm asking because I now wonder whether this same issue could also be
the cause of these other failures, which we see from time to time:
flight 113816 xen-unstable real [real]
http://logs.test-lab.xenproject.org/osstest/logs/113816/
[...]
Tests which did not succeed, but are not blocking:
test-armhf-armhf-xl-rtds 16 guest-start/debian.repeat fail blocked in 113387
Here's the logs:
http://logs.test-lab.xenproject.org/osstest/logs/113816/test-armhf-armhf-xl-rtds/info.html
Thanks and Regards,
Dario
--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]
[-- Attachment #2: Type: text/plain, Size: 127 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Guest start issue on ARM (maybe related to Credit2) [Was: Re: [xen-unstable test] 113807: regressions - FAIL]
2017-09-26 7:33 ` Dario Faggioli
@ 2017-09-26 17:28 ` Julien Grall
2017-09-26 20:51 ` Dario Faggioli
0 siblings, 1 reply; 10+ messages in thread
From: Julien Grall @ 2017-09-26 17:28 UTC (permalink / raw)
To: Dario Faggioli, osstest service owner, xen-devel; +Cc: Stefano Stabellini
Hi Dario,
On 09/26/2017 08:33 AM, Dario Faggioli wrote:
> On Mon, 2017-09-25 at 17:23 +0100, Julien Grall wrote:
>> On 09/25/2017 03:07 PM, Dario Faggioli wrote:
>>> I don't see much in the logs, TBH, but both `xl vcpu-list' and the
>>> 'r'
>>> debug key seem to suggest that vCPU 0 is running, while the other
>>> vCPUs
>>> have never run... like it was an issue with secondary (v)CPU
>>> bringup.
>>>
>>> It indeed shows up with Credit2, as it were _specific_ to it, but
>>> I'm
>>> not 100% sure. In fact, it indeed seems to never show up here:
>>> http://logs.test-lab.xenproject.org/osstest/results/history/test-ar
>>> mhf-
>>> armhf-xl/xen-unstable
>>>
>> Most of the time guest-start/debian.repeat fails, vCPU 0 is in
>> data/prefetch abort state. My guess is a latent cache bug that
>> credit2
>> appears to expose.
>>
> So, forgive my ARM ignorance, but how do you tell that the vCPU(s)
> is(are) in that particular state?
I was looking at the guest state dumped:
Sep 24 15:10:43.275221 (XEN) *** Dumping CPU1 guest state (d3v0): ***
Sep 24 15:10:43.279352 (XEN) ----[ Xen-4.10-unstable arm32 debug=y Not tainted ]----
Sep 24 15:10:43.285242 (XEN) CPU: 1
Sep 24 15:10:43.286597 (XEN) PC: 0000000c
Sep 24 15:10:43.288743 (XEN) CPSR: 800001d7 MODE:32-bit Guest ABT
Sep 24 15:10:43.292741 (XEN) R0: 00400000 R1: ffffffff R2: 48c24000 R3: 80000000
Sep 24 15:10:43.298241 (XEN) R4: 410aa758 R5: 410aacf8 R6: 00000080 R7: c2c2c2c2
Sep 24 15:10:43.303850 (XEN) R8: 40000000 R9: 410fc074 R10:40b7923c R11:10101105 R12:ffffffff
Sep 24 15:10:43.310457 (XEN) USR: SP: 00000000 LR: 00000000
Sep 24 15:10:43.313714 (XEN) SVC: SP: 4199fb70 LR: 40208060 SPSR:400001d3
Sep 24 15:10:43.318334 (XEN) ABT: SP: 00000000 LR: 0000000c SPSR:800001d7
Sep 24 15:10:43.322863 (XEN) UND: SP: 00000000 LR: 00000000 SPSR:00000000
Sep 24 15:10:43.327361 (XEN) IRQ: SP: 00000000 LR: 00000000 SPSR:00000000
Sep 24 15:10:43.331855 (XEN) FIQ: SP: 00000000 LR: c1318ae4 SPSR:00000000
Sep 24 15:10:43.336349 (XEN) FIQ: R8: 00000000 R9: 00000000 R10:00000000 R11:00000000 R12:00000000
"MODE:..." is the current mode of the vCPU. In that case ABT means it receive an abort (e.g data/prefetch abort).
There are other mode such as:
- USR : User mode
- SVC : Kernel mode
>
> I'm asking because I now wonder whether this same issue could also be
> the cause of these other failures, which we see from time to time:
>
> flight 113816 xen-unstable real [real]
> http://logs.test-lab.xenproject.org/osstest/logs/113816/
>
> [...]
>
> Tests which did not succeed, but are not blocking:
> test-armhf-armhf-xl-rtds 16 guest-start/debian.repeat fail blocked in 113387
>
> Here's the logs:
> http://logs.test-lab.xenproject.org/osstest/logs/113816/test-armhf-armhf-xl-rtds/info.html
It does not seem to be similar, in the credit2 case the kernel is stuck at very early boot.
Here it seems it is running (there are grants setup).
This seem to be confirmed from the guest console log, I can see the prompt. Interestingly
when the guest job fails, it has been waiting for a long time disk and hvc0. Although, it
does not timeout.
I am actually quite surprised that we start a 4 vCPUs guest on a 2 pCPUs platform. The total of
vCPUs is 6 (2 DOM0 + 4 DOMU). The processors in are not the greatest for testing. So I was
wondering if we end up to have too many vCPUs running on the platform and making it unreliable
the test?
Cheers,
--
Julien Grall
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Guest start issue on ARM (maybe related to Credit2) [Was: Re: [xen-unstable test] 113807: regressions - FAIL]
2017-09-26 17:28 ` Julien Grall
@ 2017-09-26 20:51 ` Dario Faggioli
2017-09-27 23:51 ` Julien Grall
0 siblings, 1 reply; 10+ messages in thread
From: Dario Faggioli @ 2017-09-26 20:51 UTC (permalink / raw)
To: Julien Grall, osstest service owner, xen-devel
Cc: Meng Xu, Stefano Stabellini
[-- Attachment #1.1: Type: text/plain, Size: 2377 bytes --]
On Tue, 2017-09-26 at 18:28 +0100, Julien Grall wrote:
> On 09/26/2017 08:33 AM, Dario Faggioli wrote:
> > >
> > Here's the logs:
> > http://logs.test-lab.xenproject.org/osstest/logs/113816/test-armhf-
> > armhf-xl-rtds/info.html
>
> It does not seem to be similar, in the credit2 case the kernel is
> stuck at very early boot.
> Here it seems it is running (there are grants setup).
>
Yes, I agree, it's not totally similar.
> This seem to be confirmed from the guest console log, I can see the
> prompt. Interestingly
> when the guest job fails, it has been waiting for a long time disk
> and hvc0. Although, it
> does not timeout.
>
Ah, I see what you mean, I found it in the guest console log.
> I am actually quite surprised that we start a 4 vCPUs guest on a 2
> pCPUs platform. The total of
> vCPUs is 6 (2 DOM0 + 4 DOMU). The processors in are not the greatest
> for testing. So I was
> wondering if we end up to have too many vCPUs running on the platform
> and making it unreliable
> the test?
>
Well, doing that, with this scheduler, is certainly *not* the best
recipe for determinism and reliability.
In fact, RTDS is a non-work conserving scheduler. This means that (with
default parameters) each vCPU gets at most 40% CPU time, even if there
are idle cycles.
With 6 vCPU, there's a total demand of 240% of CPU time, and with 2
pCPUs, there's at most 200% of that, which means we're in overload
(well, at least that's the case if/when all the vCPUs try to execute
for their guaranteed 40%).
Things *should really not* explode (like as in Xen crashes) if that
happens; actually, from a scheduler perspective, it should really not
be too big of a deal (especially if the overload is transient, like I
guess it should be in this case). However, it's entirely possible that
some specific vCPUs failing to be scheduler for a certain amount of
time, causes something _inside_ the guest to timeout, or get stuck or
wedged, which may be what happens here.
I'm adding Meng to Cc, to see what he thinks about this situation.
Thanks and Regards,
Dario
--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]
[-- Attachment #2: Type: text/plain, Size: 127 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Guest start issue on ARM (maybe related to Credit2) [Was: Re: [xen-unstable test] 113807: regressions - FAIL]
2017-09-26 20:51 ` Dario Faggioli
@ 2017-09-27 23:51 ` Julien Grall
2017-09-27 23:52 ` Julien Grall
0 siblings, 1 reply; 10+ messages in thread
From: Julien Grall @ 2017-09-27 23:51 UTC (permalink / raw)
To: Dario Faggioli, osstest service owner, xen-devel
Cc: Meng Xu, Stefano Stabellini
Hi Dario,
On 09/26/2017 09:51 PM, Dario Faggioli wrote:
> On Tue, 2017-09-26 at 18:28 +0100, Julien Grall wrote:
>> On 09/26/2017 08:33 AM, Dario Faggioli wrote:
>>>>
>>> Here's the logs:
>>> http://logs.test-lab.xenproject.org/osstest/logs/113816/test-armhf-
>>> armhf-xl-rtds/info.html
>>
>> It does not seem to be similar, in the credit2 case the kernel is
>> stuck at very early boot.
>> Here it seems it is running (there are grants setup).
>>
> Yes, I agree, it's not totally similar.
>
>> This seem to be confirmed from the guest console log, I can see the
>> prompt. Interestingly
>> when the guest job fails, it has been waiting for a long time disk
>> and hvc0. Although, it
>> does not timeout.
>>
> Ah, I see what you mean, I found it in the guest console log.
>
>> I am actually quite surprised that we start a 4 vCPUs guest on a 2
>> pCPUs platform. The total of
>> vCPUs is 6 (2 DOM0 + 4 DOMU). The processors in are not the greatest
>> for testing. So I was
>> wondering if we end up to have too many vCPUs running on the platform
>> and making it unreliable
>> the test?
>>
> Well, doing that, with this scheduler, is certainly *not* the best
> recipe for determinism and reliability.
>
> In fact, RTDS is a non-work conserving scheduler. This means that (with
> default parameters) each vCPU gets at most 40% CPU time, even if there
> are idle cycles.
>
> With 6 vCPU, there's a total demand of 240% of CPU time, and with 2
> pCPUs, there's at most 200% of that, which means we're in overload
> (well, at least that's the case if/when all the vCPUs try to execute
> for their guaranteed 40%).
>
> Things *should really not* explode (like as in Xen crashes) if that
> happens; actually, from a scheduler perspective, it should really not
> be too big of a deal (especially if the overload is transient, like I
> guess it should be in this case). However, it's entirely possible that
> some specific vCPUs failing to be scheduler for a certain amount of
> time, causes something _inside_ the guest to timeout, or get stuck or
> wedged, which may be what happens here.
Looking at the log I don't see any crash of Xen and it seems to
be responsive.
I don't know much about the scheduler and how to interpret the logs:
Sep 25 22:43:21.495119 (XEN) Domain info:
Sep 25 22:43:21.503073 (XEN) domain: 0
Sep 25 22:43:21.503100 (XEN) [ 0.0 ] cpu 0, (10000000, 4000000), cur_b=3895333 cur_d=1611120000000 last_start=1611116505875
Sep 25 22:43:21.511080 (XEN) onQ=0 runnable=0 flags=0 effective hard_affinity=0-1
Sep 25 22:43:21.519082 (XEN) [ 0.1 ] cpu 1, (10000000, 4000000), cur_b=3946375 cur_d=1611130000000 last_start=1611126446583
Sep 25 22:43:21.527023 (XEN) onQ=0 runnable=1 flags=0 effective hard_affinity=0-1
Sep 25 22:43:21.535063 (XEN) domain: 5
Sep 25 22:43:21.535089 (XEN) [ 5.0 ] cpu 0, (10000000, 4000000), cur_b=3953875 cur_d=1611120000000 last_start=1611110106041
Sep 25 22:43:21.543073 (XEN) onQ=0 runnable=0 flags=0 effective hard_affinity=0-1
Sep 25 22:43:21.551078 (XEN) [ 5.1 ] cpu 1, (10000000, 4000000), cur_b=3938167 cur_d=1611140000000 last_start=1611130169791
Sep 25 22:43:21.559063 (XEN) onQ=0 runnable=0 flags=0 effective hard_affinity=0-1
Sep 25 22:43:21.559096 (XEN) [ 5.2 ] cpu 1, (10000000, 4000000), cur_b=3952500 cur_d=1611140000000 last_start=1611130107958
Sep 25 22:43:21.575067 (XEN) onQ=0 runnable=0 flags=0 effective hard_affinity=0-1
Sep 25 22:43:21.575101 (XEN) [ 5.3 ] cpu 0, (10000000, 4000000), cur_b=3951875 cur_d=1611120000000 last_start=1611110154166
Sep 25 22:43:21.583196 (XEN) onQ=0 runnable=0 flags=0 effective hard_affinity=0-1
Also, it seems to fail fairly reliably, so it might be possible
to set up a reproducer.
Cheers,
--
Julien Grall
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Guest start issue on ARM (maybe related to Credit2) [Was: Re: [xen-unstable test] 113807: regressions - FAIL]
2017-09-27 23:51 ` Julien Grall
@ 2017-09-27 23:52 ` Julien Grall
2017-09-28 9:38 ` Dario Faggioli
0 siblings, 1 reply; 10+ messages in thread
From: Julien Grall @ 2017-09-27 23:52 UTC (permalink / raw)
To: Dario Faggioli, osstest service owner, xen-devel
Cc: Meng Xu, Stefano Stabellini
On 09/28/2017 12:51 AM, Julien Grall wrote:
> Hi Dario,
>
> On 09/26/2017 09:51 PM, Dario Faggioli wrote:
>> On Tue, 2017-09-26 at 18:28 +0100, Julien Grall wrote:
>>> On 09/26/2017 08:33 AM, Dario Faggioli wrote:
>>>>>
>>>> Here's the logs:
>>>> http://logs.test-lab.xenproject.org/osstest/logs/113816/test-armhf-
>>>> armhf-xl-rtds/info.html
>>>
>>> It does not seem to be similar, in the credit2 case the kernel is
>>> stuck at very early boot.
>>> Here it seems it is running (there are grants setup).
>>>
>> Yes, I agree, it's not totally similar.
>>
>>> This seem to be confirmed from the guest console log, I can see the
>>> prompt. Interestingly
>>> when the guest job fails, it has been waiting for a long time disk
>>> and hvc0. Although, it
>>> does not timeout.
>>>
>> Ah, I see what you mean, I found it in the guest console log.
>>
>>> I am actually quite surprised that we start a 4 vCPUs guest on a 2
>>> pCPUs platform. The total of
>>> vCPUs is 6 (2 DOM0 + 4 DOMU). The processors in are not the greatest
>>> for testing. So I was
>>> wondering if we end up to have too many vCPUs running on the platform
>>> and making it unreliable
>>> the test?
>>>
>> Well, doing that, with this scheduler, is certainly *not* the best
>> recipe for determinism and reliability.
>>
>> In fact, RTDS is a non-work conserving scheduler. This means that (with
>> default parameters) each vCPU gets at most 40% CPU time, even if there
>> are idle cycles.
>>
>> With 6 vCPU, there's a total demand of 240% of CPU time, and with 2
>> pCPUs, there's at most 200% of that, which means we're in overload
>> (well, at least that's the case if/when all the vCPUs try to execute
>> for their guaranteed 40%).
>>
>> Things *should really not* explode (like as in Xen crashes) if that
>> happens; actually, from a scheduler perspective, it should really not
>> be too big of a deal (especially if the overload is transient, like I
>> guess it should be in this case). However, it's entirely possible that
>> some specific vCPUs failing to be scheduler for a certain amount of
>> time, causes something _inside_ the guest to timeout, or get stuck or
>> wedged, which may be what happens here.
>
> Looking at the log I don't see any crash of Xen and it seems to
> be responsive.
I forgot to add that I don't see any timeout on the guest console
but can notice slow down (waiting for some PV device).
>
> I don't know much about the scheduler and how to interpret the logs:
>
> Sep 25 22:43:21.495119 (XEN) Domain info:
> Sep 25 22:43:21.503073 (XEN) domain: 0
> Sep 25 22:43:21.503100 (XEN) [ 0.0 ] cpu 0, (10000000, 4000000), cur_b=3895333 cur_d=1611120000000 last_start=1611116505875
> Sep 25 22:43:21.511080 (XEN) onQ=0 runnable=0 flags=0 effective hard_affinity=0-1
> Sep 25 22:43:21.519082 (XEN) [ 0.1 ] cpu 1, (10000000, 4000000), cur_b=3946375 cur_d=1611130000000 last_start=1611126446583
> Sep 25 22:43:21.527023 (XEN) onQ=0 runnable=1 flags=0 effective hard_affinity=0-1
> Sep 25 22:43:21.535063 (XEN) domain: 5
> Sep 25 22:43:21.535089 (XEN) [ 5.0 ] cpu 0, (10000000, 4000000), cur_b=3953875 cur_d=1611120000000 last_start=1611110106041
> Sep 25 22:43:21.543073 (XEN) onQ=0 runnable=0 flags=0 effective hard_affinity=0-1
> Sep 25 22:43:21.551078 (XEN) [ 5.1 ] cpu 1, (10000000, 4000000), cur_b=3938167 cur_d=1611140000000 last_start=1611130169791
> Sep 25 22:43:21.559063 (XEN) onQ=0 runnable=0 flags=0 effective hard_affinity=0-1
> Sep 25 22:43:21.559096 (XEN) [ 5.2 ] cpu 1, (10000000, 4000000), cur_b=3952500 cur_d=1611140000000 last_start=1611130107958
> Sep 25 22:43:21.575067 (XEN) onQ=0 runnable=0 flags=0 effective hard_affinity=0-1
> Sep 25 22:43:21.575101 (XEN) [ 5.3 ] cpu 0, (10000000, 4000000), cur_b=3951875 cur_d=1611120000000 last_start=1611110154166
> Sep 25 22:43:21.583196 (XEN) onQ=0 runnable=0 flags=0 effective hard_affinity=0-1
>
> Also, it seems to fail fairly reliably, so it might be possible
> to set up a reproducer.
>
> Cheers,
>
--
Julien Grall
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Guest start issue on ARM (maybe related to Credit2) [Was: Re: [xen-unstable test] 113807: regressions - FAIL]
2017-09-27 23:52 ` Julien Grall
@ 2017-09-28 9:38 ` Dario Faggioli
0 siblings, 0 replies; 10+ messages in thread
From: Dario Faggioli @ 2017-09-28 9:38 UTC (permalink / raw)
To: Julien Grall, osstest service owner, xen-devel
Cc: Stefano Stabellini, Meng Xu
[-- Attachment #1.1: Type: text/plain, Size: 1631 bytes --]
On Thu, 2017-09-28 at 00:52 +0100, Julien Grall wrote:
> On 09/28/2017 12:51 AM, Julien Grall wrote:
> > > Things *should really not* explode (like as in Xen crashes) if
> > > that
> > > happens; actually, from a scheduler perspective, it should really
> > > not
> > > be too big of a deal (especially if the overload is transient,
> > > like I
> > > guess it should be in this case). However, it's entirely possible
> > > that
> > > some specific vCPUs failing to be scheduler for a certain amount
> > > of
> > > time, causes something _inside_ the guest to timeout, or get
> > > stuck or
> > > wedged, which may be what happens here.
> >
> > Looking at the log I don't see any crash of Xen and it seems to
> > be responsive.
>
> I forgot to add that I don't see any timeout on the guest console
> but can notice slow down (waiting for some PV device).
>
Exactly! And in fact, I'm saying that, even if nothing breaks, maybe
there are intervals during which --due to the combination of the
overload, the non work-conserving nature and the fact that these CPUs
are slow-- Dom0 is slow in dealing with the backends, to the point that
OSSTest times out.
Then, after the "load spike", everything goes back to normal, the
system is responsive, the logs (like the runqueue dump you posted)
depicts a normal semi-idle system.
Regards,
Dario
--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]
[-- Attachment #2: Type: text/plain, Size: 127 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2017-09-28 9:38 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-25 9:46 [xen-unstable test] 113807: regressions - FAIL osstest service owner
2017-09-25 14:07 ` Guest start issue on ARM (maybe related to Credit2) [Was: Re: [xen-unstable test] 113807: regressions - FAIL] Dario Faggioli
2017-09-25 16:23 ` Julien Grall
2017-09-25 17:29 ` Dario Faggioli
2017-09-26 7:33 ` Dario Faggioli
2017-09-26 17:28 ` Julien Grall
2017-09-26 20:51 ` Dario Faggioli
2017-09-27 23:51 ` Julien Grall
2017-09-27 23:52 ` Julien Grall
2017-09-28 9:38 ` Dario Faggioli
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.