On Tue, 2015-07-14 at 10:37 +0100, Ian Campbell wrote: > On Tue, 2015-07-14 at 11:25 +0200, Dario Faggioli wrote: > > On Tue, 2015-07-14 at 08:55 +0100, Ian Campbell wrote: > > > It'll be hard to say until this change gets through the Xen push gate > > > and that version gets used for other branches (linux testing, libvirt, > > > ovmf, osstest's own gate etc). > > > > > Indeed. My opinion is that no, it is not. > > > > My understanding of the data Anthony provided is that, under some > > (difficult to track/analyze/reproduce/etc) load conditions, the Linux IO > > and VM subsystem suffer from high latency, delaying QEMU startup. > > > > In the merlot* cases, the system is completely idle, apart from the > > failing creation/migration operation. > > > > So, no, I don't think that would not be the fix we need for that > > situation. > > Even if it is not the correct fix it seems like in some situations the > increase in timeout has improved things, hence it is an "answer" as Jan > asked (his quote marks). > Sure! And that's why I find this weird/interesting... > > > At the moment it looks like it has helped with some but not all of the > > > issues. > > > > > > These: > > > > > > http://logs.test-lab.xenproject.org/osstest/results/host/merlot0.html > > > http://logs.test-lab.xenproject.org/osstest/results/host/merlot1.html > > > > > Can I ask why (I mean, e.g., comparing what with what) you're saying it > > seems to have helped? > > There seemed (unscientifically) to be fewer of the libvirt related > guest-start failures. > And you mean by only looking at xen-unstable lines, don't you? If yes, looking at merlot0, I've found the below. Old timeout, failing: http://logs.test-lab.xenproject.org/osstest/logs/59105/test-amd64-amd64-libvirt-xsm/info.html New timeout, success: http://logs.test-lab.xenproject.org/osstest/logs/59311/test-amd64-amd64-libvirt/info.html And, looking at how long QEMU did take to start up that would be: 13:44:32 - 13:43:42 i.e., just a bit less than 1min! So, yes, it looks that this change is actually going to help in this case. What I'm missing is how it is possible that, on an idle system, DM spawning takes that long. As said, in Anthony's OpenStack case, the system was quite busy... not that it can't be a bug (somewhere, perhaps in Linux) in that case too, but here, it looks even more weird to me. May it be the NUMA misconfiguration? Well, if yes, I'm not sure how... Dario -- <> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)