* [PATCH] libxl: Increase device model startup timeout to 1min. @ 2015-06-26 11:57 Anthony PERARD 2015-06-26 16:41 ` Ian Campbell 0 siblings, 1 reply; 28+ messages in thread From: Anthony PERARD @ 2015-06-26 11:57 UTC (permalink / raw) To: xen-devel Cc: Anthony PERARD, Wei Liu, Ian Jackson, Ian Campbell, Stefano Stabellini On a busy host, QEMU may take more than 10s to start. Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> --- tools/libxl/libxl_internal.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h index e96d6b5..3a604f7 100644 --- a/tools/libxl/libxl_internal.h +++ b/tools/libxl/libxl_internal.h @@ -85,7 +85,7 @@ #define LIBXL_INIT_TIMEOUT 10 #define LIBXL_DESTROY_TIMEOUT 10 #define LIBXL_HOTPLUG_TIMEOUT 10 -#define LIBXL_DEVICE_MODEL_START_TIMEOUT 10 +#define LIBXL_DEVICE_MODEL_START_TIMEOUT 60 #define LIBXL_STUBDOM_START_TIMEOUT 30 #define LIBXL_QEMU_BODGE_TIMEOUT 2 #define LIBXL_XENCONSOLE_LIMIT 1048576 -- Anthony PERARD ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH] libxl: Increase device model startup timeout to 1min. 2015-06-26 11:57 [PATCH] libxl: Increase device model startup timeout to 1min Anthony PERARD @ 2015-06-26 16:41 ` Ian Campbell 2015-06-29 14:23 ` Anthony PERARD 0 siblings, 1 reply; 28+ messages in thread From: Ian Campbell @ 2015-06-26 16:41 UTC (permalink / raw) To: Anthony PERARD; +Cc: Wei Liu, Stefano Stabellini, Ian Jackson, xen-devel On Fri, 2015-06-26 at 12:57 +0100, Anthony PERARD wrote: > On a busy host, QEMU may take more than 10s to start. Please show your working here so that in 2 years when we want to know why this value was chosen we don't have to go scrobbling around in the ML archives looking for the data you gathered. > > Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> > --- > tools/libxl/libxl_internal.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h > index e96d6b5..3a604f7 100644 > --- a/tools/libxl/libxl_internal.h > +++ b/tools/libxl/libxl_internal.h > @@ -85,7 +85,7 @@ > #define LIBXL_INIT_TIMEOUT 10 > #define LIBXL_DESTROY_TIMEOUT 10 > #define LIBXL_HOTPLUG_TIMEOUT 10 > -#define LIBXL_DEVICE_MODEL_START_TIMEOUT 10 > +#define LIBXL_DEVICE_MODEL_START_TIMEOUT 60 > #define LIBXL_STUBDOM_START_TIMEOUT 30 > #define LIBXL_QEMU_BODGE_TIMEOUT 2 > #define LIBXL_XENCONSOLE_LIMIT 1048576 ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] libxl: Increase device model startup timeout to 1min. 2015-06-26 16:41 ` Ian Campbell @ 2015-06-29 14:23 ` Anthony PERARD 2015-06-29 14:51 ` Ian Campbell 0 siblings, 1 reply; 28+ messages in thread From: Anthony PERARD @ 2015-06-29 14:23 UTC (permalink / raw) To: Ian Campbell; +Cc: Wei Liu, Stefano Stabellini, Ian Jackson, xen-devel On Fri, Jun 26, 2015 at 05:41:07PM +0100, Ian Campbell wrote: > On Fri, 2015-06-26 at 12:57 +0100, Anthony PERARD wrote: > > On a busy host, QEMU may take more than 10s to start. > > Please show your working here so that in 2 years when we want to know > why this value was chosen we don't have to go scrobbling around in the > ML archives looking for the data you gathered. OK. What about: While running the whole stack of OpenStack (install via devstack on Ubuntu) on a single machine, sometime the device model fail to start on time. When this happen, a strace of QEMU shows that it still loading its libraries. You can find such strace on: http://lists.xenproject.org/archives/html/xen-devel/2015-06/msg03549.html What shows the strace is that between a mmap() syscall and the next syscall, 0.5s might have pass. Normaly, on this same machine it takes 0.32 second on average for QEMU to start, while running Tempest (which is the OpenStack test suite). Those QEMU are runned as backend for a PV guest. > > > > Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> > > --- > > tools/libxl/libxl_internal.h | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h > > index e96d6b5..3a604f7 100644 > > --- a/tools/libxl/libxl_internal.h > > +++ b/tools/libxl/libxl_internal.h > > @@ -85,7 +85,7 @@ > > #define LIBXL_INIT_TIMEOUT 10 > > #define LIBXL_DESTROY_TIMEOUT 10 > > #define LIBXL_HOTPLUG_TIMEOUT 10 > > -#define LIBXL_DEVICE_MODEL_START_TIMEOUT 10 > > +#define LIBXL_DEVICE_MODEL_START_TIMEOUT 60 > > #define LIBXL_STUBDOM_START_TIMEOUT 30 > > #define LIBXL_QEMU_BODGE_TIMEOUT 2 > > #define LIBXL_XENCONSOLE_LIMIT 1048576 > > -- Anthony PERARD ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] libxl: Increase device model startup timeout to 1min. 2015-06-29 14:23 ` Anthony PERARD @ 2015-06-29 14:51 ` Ian Campbell 2015-06-29 16:09 ` Anthony PERARD 0 siblings, 1 reply; 28+ messages in thread From: Ian Campbell @ 2015-06-29 14:51 UTC (permalink / raw) To: Anthony PERARD; +Cc: Wei Liu, Stefano Stabellini, Ian Jackson, xen-devel On Mon, 2015-06-29 at 15:23 +0100, Anthony PERARD wrote: > On Fri, Jun 26, 2015 at 05:41:07PM +0100, Ian Campbell wrote: > > On Fri, 2015-06-26 at 12:57 +0100, Anthony PERARD wrote: > > > On a busy host, QEMU may take more than 10s to start. > > > > Please show your working here so that in 2 years when we want to know > > why this value was chosen we don't have to go scrobbling around in the > > ML archives looking for the data you gathered. > > OK. What about: I'm afraid none of that really explains why QEMU taking >10s is reasonable/expected under the circumstances or why 60s is a good choice now. Nor does it really answer Ian's question in <21901.33163.547929.321814@mariner.uk.xensource.com> I think. Ian. > > While running the whole stack of OpenStack (install via devstack on Ubuntu) > on a single machine, sometime the device model fail to start on time. When > this happen, a strace of QEMU shows that it still loading its libraries. > > You can find such strace on: > http://lists.xenproject.org/archives/html/xen-devel/2015-06/msg03549.html > > What shows the strace is that between a mmap() syscall and the next > syscall, 0.5s might have pass. > > Normaly, on this same machine it takes 0.32 second on average for QEMU to > start, while running Tempest (which is the OpenStack test suite). Those > QEMU are runned as backend for a PV guest. > > > > > > > Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> > > > --- > > > tools/libxl/libxl_internal.h | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h > > > index e96d6b5..3a604f7 100644 > > > --- a/tools/libxl/libxl_internal.h > > > +++ b/tools/libxl/libxl_internal.h > > > @@ -85,7 +85,7 @@ > > > #define LIBXL_INIT_TIMEOUT 10 > > > #define LIBXL_DESTROY_TIMEOUT 10 > > > #define LIBXL_HOTPLUG_TIMEOUT 10 > > > -#define LIBXL_DEVICE_MODEL_START_TIMEOUT 10 > > > +#define LIBXL_DEVICE_MODEL_START_TIMEOUT 60 > > > #define LIBXL_STUBDOM_START_TIMEOUT 30 > > > #define LIBXL_QEMU_BODGE_TIMEOUT 2 > > > #define LIBXL_XENCONSOLE_LIMIT 1048576 > > > > > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] libxl: Increase device model startup timeout to 1min. 2015-06-29 14:51 ` Ian Campbell @ 2015-06-29 16:09 ` Anthony PERARD 2015-06-30 3:08 ` Dario Faggioli 2015-06-30 14:04 ` Ian Jackson 0 siblings, 2 replies; 28+ messages in thread From: Anthony PERARD @ 2015-06-29 16:09 UTC (permalink / raw) To: Ian Campbell; +Cc: Wei Liu, Stefano Stabellini, Ian Jackson, xen-devel On Mon, Jun 29, 2015 at 03:51:57PM +0100, Ian Campbell wrote: > On Mon, 2015-06-29 at 15:23 +0100, Anthony PERARD wrote: > > On Fri, Jun 26, 2015 at 05:41:07PM +0100, Ian Campbell wrote: > > > On Fri, 2015-06-26 at 12:57 +0100, Anthony PERARD wrote: > > > > On a busy host, QEMU may take more than 10s to start. > > > > > > Please show your working here so that in 2 years when we want to know > > > why this value was chosen we don't have to go scrobbling around in the > > > ML archives looking for the data you gathered. > > > > OK. What about: > > I'm afraid none of that really explains why QEMU taking >10s is > reasonable/expected under the circumstances or why 60s is a good choice > now. > > Nor does it really answer Ian's question in > <21901.33163.547929.321814@mariner.uk.xensource.com> I think. I only know what happen, not why it happen. How could I investigate why qemu is taking so long after an mmap() syscall? I guest at that time, it is copying the library into memory. -- Anthony PERARD ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] libxl: Increase device model startup timeout to 1min. 2015-06-29 16:09 ` Anthony PERARD @ 2015-06-30 3:08 ` Dario Faggioli 2015-06-30 14:04 ` Ian Jackson 1 sibling, 0 replies; 28+ messages in thread From: Dario Faggioli @ 2015-06-30 3:08 UTC (permalink / raw) To: Anthony PERARD Cc: Ian Jackson, xen-devel, Wei Liu, Ian Campbell, Stefano Stabellini [-- Attachment #1.1: Type: text/plain, Size: 1309 bytes --] On Mon, 2015-06-29 at 17:09 +0100, Anthony PERARD wrote: > On Mon, Jun 29, 2015 at 03:51:57PM +0100, Ian Campbell wrote: > > On Mon, 2015-06-29 at 15:23 +0100, Anthony PERARD wrote: > > > OK. What about: > > > > I'm afraid none of that really explains why QEMU taking >10s is > > reasonable/expected under the circumstances or why 60s is a good choice > > now. > > > > Nor does it really answer Ian's question in > > <21901.33163.547929.321814@mariner.uk.xensource.com> I think. > > I only know what happen, not why it happen. > > How could I investigate why qemu is taking so long after an mmap() syscall? > I guest at that time, it is copying the library into memory. > Is there a lot of CPU / memory contention in Dom0 (assuming that QEMU runs in dom0 in this case)? If yes, is it plausible for that to be the cause? If yes, can we show, or monitor, or trace it somehow? That would be a first step toward trying to figure out what specific characteristic of the contended scenario causes the issue... Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 181 bytes --] [-- Attachment #2: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] libxl: Increase device model startup timeout to 1min. 2015-06-29 16:09 ` Anthony PERARD 2015-06-30 3:08 ` Dario Faggioli @ 2015-06-30 14:04 ` Ian Jackson 2015-06-30 15:58 ` Stefano Stabellini 1 sibling, 1 reply; 28+ messages in thread From: Ian Jackson @ 2015-06-30 14:04 UTC (permalink / raw) To: Anthony PERARD; +Cc: Stefano Stabellini, Wei Liu, Ian Campbell, xen-devel Anthony PERARD writes ("Re: [PATCH] libxl: Increase device model startup timeout to 1min."): > On Mon, Jun 29, 2015 at 03:51:57PM +0100, Ian Campbell wrote: > > Nor does it really answer Ian's question in > > <21901.33163.547929.321814@mariner.uk.xensource.com> I think. > > I only know what happen, not why it happen. > > How could I investigate why qemu is taking so long after an mmap() syscall? > I guest at that time, it is copying the library into memory. Is the machine very busy at this time ? If this is a stress test of some kind then I guess there are three possible views: * The number and nature of parallel operations done in the stress test is unreasonable for the provided hardware: => the timeout is fine * The number and nature of parallel operations is reasonable for the provided hardware, and should not result in the toolstack domain being overloaded to the point where ld.so on qemu takes so long => the timeout is correct and there is an underlying bug (perhaps in Linux) * The number and nature of parallel operations is reasonable for the hardware, but might easily result in long delays in ld.so assembling qemu - that is, people expect to deploy Xen and then cause their toolstack domain to be massively `overloaded' and very slow => the timeout is too short In the third case we probably need a general-purpose facility for adjusting timeouts in general. People who do not expect to overload their toolstack domain should not be made to await massive timeout values designed for people who do. Ian. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] libxl: Increase device model startup timeout to 1min. 2015-06-30 14:04 ` Ian Jackson @ 2015-06-30 15:58 ` Stefano Stabellini 2015-06-30 16:40 ` Ian Jackson 0 siblings, 1 reply; 28+ messages in thread From: Stefano Stabellini @ 2015-06-30 15:58 UTC (permalink / raw) To: Ian Jackson Cc: Anthony PERARD, Stefano Stabellini, Wei Liu, Ian Campbell, xen-devel On Tue, 30 Jun 2015, Ian Jackson wrote: > Anthony PERARD writes ("Re: [PATCH] libxl: Increase device model startup timeout to 1min."): > > On Mon, Jun 29, 2015 at 03:51:57PM +0100, Ian Campbell wrote: > > > Nor does it really answer Ian's question in > > > <21901.33163.547929.321814@mariner.uk.xensource.com> I think. > > > > I only know what happen, not why it happen. > > > > How could I investigate why qemu is taking so long after an mmap() syscall? > > I guest at that time, it is copying the library into memory. > > Is the machine very busy at this time ? > > If this is a stress test of some kind then I guess there are three > possible views: > > * The number and nature of parallel operations done in the stress > test is unreasonable for the provided hardware: > => the timeout is fine I don't know if it is our place to make this call. Should we really be deciding what is considered "reasonable"? I think not. Defining what is reasonable and policies that match it is not a route I think we should take in libxl. > * The number and nature of parallel operations is reasonable for the > provided hardware, and should not result in the toolstack domain > being overloaded to the point where ld.so on qemu takes so long > => the timeout is correct and there is an underlying bug > (perhaps in Linux) > > * The number and nature of parallel operations is reasonable for > the hardware, but might easily result in long delays in ld.so > assembling qemu - that is, people expect to deploy Xen and then > cause their toolstack domain to be massively `overloaded' and > very slow > => the timeout is too short > > In the third case we probably need a general-purpose facility for > adjusting timeouts in general. People who do not expect to overload > their toolstack domain should not be made to await massive timeout > values designed for people who do. But they are not actually made to await any timeouts: the timeout is only to check whether QEMU started. Users are only made to await timeouts in case of errors. Is that what you meant? In the normal case, we could have a timeout of 1h and users wouldn't tell the difference. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] libxl: Increase device model startup timeout to 1min. 2015-06-30 15:58 ` Stefano Stabellini @ 2015-06-30 16:40 ` Ian Jackson 2015-07-01 15:03 ` Stefano Stabellini 0 siblings, 1 reply; 28+ messages in thread From: Ian Jackson @ 2015-06-30 16:40 UTC (permalink / raw) To: Stefano Stabellini; +Cc: Anthony PERARD, Wei Liu, Ian Campbell, xen-devel Stefano Stabellini writes ("Re: [PATCH] libxl: Increase device model startup timeout to 1min."): > On Tue, 30 Jun 2015, Ian Jackson wrote: > > Anthony PERARD writes ("Re: [PATCH] libxl: Increase device model startup timeout to 1min."): > > In the third case we probably need a general-purpose facility for > > adjusting timeouts in general. People who do not expect to overload > > their toolstack domain should not be made to await massive timeout > > values designed for people who do. > > But they are not actually made to await any timeouts: the timeout is > only to check whether QEMU started. Users are only made to await > timeouts in case of errors. Is that what you meant? > In the normal case, we could have a timeout of 1h and users wouldn't > tell the difference. Sadly errors are far not uncommon. Making someone wait 1h for an inevitable error message would be quite unreasonable. > > * The number and nature of parallel operations done in the stress > > test is unreasonable for the provided hardware: > > => the timeout is fine > > I don't know if it is our place to make this call. Should we really be > deciding what is considered "reasonable"? I think not. Defining what is > reasonable and policies that match it is not a route I think we should > take in libxl. Nevertheless if we are defining timeouts we are implicitly setting some parameters which imply that certain configurations are unreasonable. Hopefully all such configurations are absurd. If what you mean is that our bounds of `reasonable' should be very wide, then I agree. If anyone could reasonably expect it to work, then that is fine. Certainly we should refrain fromk subjective judgements. Ian. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] libxl: Increase device model startup timeout to 1min. 2015-06-30 16:40 ` Ian Jackson @ 2015-07-01 15:03 ` Stefano Stabellini 2015-07-02 11:11 ` Anthony PERARD 0 siblings, 1 reply; 28+ messages in thread From: Stefano Stabellini @ 2015-07-01 15:03 UTC (permalink / raw) To: Ian Jackson Cc: Anthony PERARD, xen-devel, Wei Liu, Ian Campbell, Stefano Stabellini On Tue, 30 Jun 2015, Ian Jackson wrote: > > > * The number and nature of parallel operations done in the stress > > > test is unreasonable for the provided hardware: > > > => the timeout is fine > > > > I don't know if it is our place to make this call. Should we really be > > deciding what is considered "reasonable"? I think not. Defining what is > > reasonable and policies that match it is not a route I think we should > > take in libxl. > > Nevertheless if we are defining timeouts we are implicitly setting > some parameters which imply that certain configurations are > unreasonable. Hopefully all such configurations are absurd. > > If what you mean is that our bounds of `reasonable' should be very > wide, then I agree. If anyone could reasonably expect it to work, > then that is fine. Certainly we should refrain fromk subjective > judgements. OK. How do you measure reasonable for this case? What I actually mean to ask is how do you suggest we proceed on this problem? Of course it would be nice if we knew exactly why this is happening, but the issue only happens once every 2-3 tempest runs, each of them takes about 1 hour. Tempest executes about 1300 tests for each run, some of them in parallel. We haven't taken the time to read all the tests run by tempest so we don't know exactly what they do. We don't really know the environment that causes the failure. Reading all the tests is not an option. We could try adding more tracing to the system, but given the type of error, if we do we are not likely to reproduce the error at all, or maybe reproduce something different. Given the state of things, I suggest we make sure that increasing the timeout actually fixes/works-around the problem. I would also like to see some empirical measurements that tell us by how much we should increase the timeout. Is 1 minute actually enough? I would not go as far as asking to figure out what the real cause of the problem is, because there is no way to estimate how long is going to take or even how to do that. And in the meantime we still have spurious failures in the OpenStack CI-loop. Alternatively we could carry the work around in the Xen package we build for the OpenStack CI-loop, leaving xen-unstable "unfixed", but I think that would be less desirable. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] libxl: Increase device model startup timeout to 1min. 2015-07-01 15:03 ` Stefano Stabellini @ 2015-07-02 11:11 ` Anthony PERARD 2015-07-02 12:38 ` Ian Jackson 0 siblings, 1 reply; 28+ messages in thread From: Anthony PERARD @ 2015-07-02 11:11 UTC (permalink / raw) To: Stefano Stabellini; +Cc: Wei Liu, Ian Jackson, Ian Campbell, xen-devel On Wed, Jul 01, 2015 at 04:03:55PM +0100, Stefano Stabellini wrote: > On Tue, 30 Jun 2015, Ian Jackson wrote: > > > > * The number and nature of parallel operations done in the stress > > > > test is unreasonable for the provided hardware: > > > > => the timeout is fine > > > > > > I don't know if it is our place to make this call. Should we really be > > > deciding what is considered "reasonable"? I think not. Defining what is > > > reasonable and policies that match it is not a route I think we should > > > take in libxl. > > > > Nevertheless if we are defining timeouts we are implicitly setting > > some parameters which imply that certain configurations are > > unreasonable. Hopefully all such configurations are absurd. > > > > If what you mean is that our bounds of `reasonable' should be very > > wide, then I agree. If anyone could reasonably expect it to work, > > then that is fine. Certainly we should refrain fromk subjective > > judgements. > > OK. How do you measure reasonable for this case? > > What I actually mean to ask is how do you suggest we proceed on this > problem? > > Of course it would be nice if we knew exactly why this is happening, but > the issue only happens once every 2-3 tempest runs, each of them takes > about 1 hour. Tempest executes about 1300 tests for each run, some > of them in parallel. We haven't taken the time to read all the tests run > by tempest so we don't know exactly what they do. > > We don't really know the environment that causes the failure. Reading > all the tests is not an option. We could try adding more tracing to the > system, but given the type of error, if we do we are not likely to > reproduce the error at all, or maybe reproduce something different. > > > Given the state of things, I suggest we make sure that increasing the > timeout actually fixes/works-around the problem. I would also like to > see some empirical measurements that tell us by how much we should > increase the timeout. Is 1 minute actually enough? I have tested an increase timeout this night. And here are the result. The machine is a AMD Opteron(tm) Processor 4284, with 8G of RAM and 8 pCPU. It's running Ubuntu 14.04, with Xen 4.4. On top of that, OpenStack have been deployed via devstack on a single. The test is to run Tempest with --concurrency=4. There are 4 tests runned in parallel, but they don't necessarly start a VM. When they do, it's a PV with 64MB and 1 vCPU and sometime with double amount of RAM. The stats: Tempest run: 22 Tempest run time for each run: ~3000s Tempest number of test: 1143 after 22 run of tempest: QEMU start: 3352 number of run that took more than 2s: 20 number of run that took more than 9s: 6 maximum start time: 10.973713s I have gathered the QEMU start time by having strace running for each of them. I have then look at the time it took from the first syscall execve('qemu') until the syscall where QEMU respond on its QMP socket (libxl have acknoledge that QEMU is running at that time). -- Anthony PERARD ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] libxl: Increase device model startup timeout to 1min. 2015-07-02 11:11 ` Anthony PERARD @ 2015-07-02 12:38 ` Ian Jackson 2015-07-03 11:21 ` Anthony PERARD 0 siblings, 1 reply; 28+ messages in thread From: Ian Jackson @ 2015-07-02 12:38 UTC (permalink / raw) To: Anthony PERARD; +Cc: xen-devel, Wei Liu, Ian Campbell, Stefano Stabellini Anthony PERARD writes ("Re: [PATCH] libxl: Increase device model startup timeout to 1min."): > I have tested an increase timeout this night. And here are the result. > > The machine is a AMD Opteron(tm) Processor 4284, with 8G of RAM and 8 pCPU. > It's running Ubuntu 14.04, with Xen 4.4. On top of that, OpenStack have > been deployed via devstack on a single. > > The test is to run Tempest with --concurrency=4. There are 4 tests runned > in parallel, but they don't necessarly start a VM. When they do, it's a PV > with 64MB and 1 vCPU and sometime with double amount of RAM. > > The stats: > Tempest run: 22 > Tempest run time for each run: ~3000s > Tempest number of test: 1143 > after 22 run of tempest: > QEMU start: 3352 > number of run that took more than 2s: 20 > number of run that took more than 9s: 6 > maximum start time: 10.973713s > > I have gathered the QEMU start time by having strace running for each of > them. I have then look at the time it took from the first syscall > execve('qemu') until the syscall where QEMU respond on its QMP socket > (libxl have acknoledge that QEMU is running at that time). Thanks for this information. So from what you say it appears that we are running at most 4 copies of libxl and qemu in parallel, along with at most 4 VMs ? And out of 3352 qemu starts, we have <= 2s 3332 >2s <= 9s 14 > 9s 6 ? Do you have any information about the maximum system load in general ? dom0 load, vcpu overcommit, etc. ? Do you know what the guests are doing ? I'm starting to think that this might be a real bug but that the bug might be "Linux's I/O subsystem sometimes produces appalling latency under load" (which is hardly news). Ian. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] libxl: Increase device model startup timeout to 1min. 2015-07-02 12:38 ` Ian Jackson @ 2015-07-03 11:21 ` Anthony PERARD 2015-07-03 11:30 ` Ian Jackson 0 siblings, 1 reply; 28+ messages in thread From: Anthony PERARD @ 2015-07-03 11:21 UTC (permalink / raw) To: Ian Jackson; +Cc: xen-devel, Wei Liu, Ian Campbell, Stefano Stabellini On Thu, Jul 02, 2015 at 01:38:37PM +0100, Ian Jackson wrote: > Anthony PERARD writes ("Re: [PATCH] libxl: Increase device model startup timeout to 1min."): > > I have tested an increase timeout this night. And here are the result. > > > > The machine is a AMD Opteron(tm) Processor 4284, with 8G of RAM and 8 pCPU. > > It's running Ubuntu 14.04, with Xen 4.4. On top of that, OpenStack have > > been deployed via devstack on a single. > > > > The test is to run Tempest with --concurrency=4. There are 4 tests runned > > in parallel, but they don't necessarly start a VM. When they do, it's a PV > > with 64MB and 1 vCPU and sometime with double amount of RAM. > > > > The stats: > > Tempest run: 22 > > Tempest run time for each run: ~3000s > > Tempest number of test: 1143 > > after 22 run of tempest: > > QEMU start: 3352 > > number of run that took more than 2s: 20 > > number of run that took more than 9s: 6 > > maximum start time: 10.973713s > > > > I have gathered the QEMU start time by having strace running for each of > > them. I have then look at the time it took from the first syscall > > execve('qemu') until the syscall where QEMU respond on its QMP socket > > (libxl have acknoledge that QEMU is running at that time). > > Thanks for this information. > > So from what you say it appears that we are running at most 4 copies > of libxl and qemu in parallel, along with at most 4 VMs ? Some test do run 3 VM, so at most it would be 12 VM. I think that still only one copy of libxl since everythings goes through libvirt, right? > And out of 3352 qemu starts, we have > <= 2s 3332 > >2s <= 9s 14 > > 9s 6 > ? > > Do you have any information about the maximum system load in general ? > dom0 load, vcpu overcommit, etc. ? Do you know what the guests are > doing ? I've got some stat collected via `dstat`. During the tests, the load average can go from 5 to 8 in dom0. The CPUs in dom0 can spend quite a lot of time waiting, it's not rare to see beetween 35 and 60% of time waiting (total of all 8 cpu). As for the disk, dstat reveal that there is often between 50M and 120M per second of total disk write. The guests usully does not do much, beside booting. It run a small system call CirrOS. > I'm starting to think that this might be a real bug but that the bug > might be "Linux's I/O subsystem sometimes produces appalling latency > under load" (which is hardly news). I guess the straces support this, here are few quote from different strace: 04:11:50.602067 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) <0.000036> 04:11:50.602166 open("/usr/lib/x86_64-linux-gnu/libidn.so.11", O_RDONLY|O_CLOEXEC) = 3 <0.000039> 04:11:50.602263 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\300+\0\0\0\0\0\0"..., 832) = 832 <0.000035> 04:11:50.602359 fstat(3, {st_mode=S_IFREG|0644, st_size=207128, ...}) = 0 <0.000033> 04:11:50.602453 mmap(NULL, 2302208, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f845b9f8000 <0.000036> 04:11:50.602546 mprotect(0x7f845ba29000, 2097152, PROT_NONE) = 0 <0.000037> 04:11:50.602639 mmap(0x7f845bc29000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x31000) = 0x7f845bc29000 <0.000038> 04:11:51.257654 close(3) = 0 <0.000042> 04:11:51.257859 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) <0.000078> 18:08:37.635570 open("/usr/lib/x86_64-linux-gnu/sasl2/libcrammd5.so", O_RDONLY|O_CLOEXEC) = 28 <0.000050> 18:08:37.635681 read(28, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260\20\0\0\0\0\0\0"..., 832) = 832 <1.011536> 18:08:38.647329 fstat(28, {st_mode=S_IFREG|0644, st_size=22592, ...}) = 0 <0.000042> 18:08:38.647499 mmap(NULL, 2117872, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 28, 0) = 0x7f6c29128000 <0.000048> 18:08:38.647613 mprotect(0x7f6c2912c000, 2097152, PROT_NONE) = 0 <0.000049> 18:08:38.647721 mmap(0x7f6c2932c000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 28, 0x4000) = 0x7f6c2932c000 <0.000043> 18:08:38.647862 close(28) = 0 <0.000035> The first quote is a pattern I'm seeing very often on slow dm start, where it take a long time between the mmap and the next syscall. On the second quote, read() is to blame, it took 1s. I guess even the first quote imply there is going to be I/O after the mmap call, isn't it? -- Anthony PERARD ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] libxl: Increase device model startup timeout to 1min. 2015-07-03 11:21 ` Anthony PERARD @ 2015-07-03 11:30 ` Ian Jackson 2015-07-07 13:44 ` Ian Campbell 0 siblings, 1 reply; 28+ messages in thread From: Ian Jackson @ 2015-07-03 11:30 UTC (permalink / raw) To: Anthony PERARD; +Cc: xen-devel, Wei Liu, Ian Campbell, Stefano Stabellini Anthony PERARD writes ("Re: [PATCH] libxl: Increase device model startup timeout to 1min."): > On Thu, Jul 02, 2015 at 01:38:37PM +0100, Ian Jackson wrote: > > I'm starting to think that this might be a real bug but that the bug > > might be "Linux's I/O subsystem sometimes produces appalling latency > > under load" (which is hardly news). > > I guess the straces support this, here are few quote from different strace: ... > 04:11:50.602639 mmap(0x7f845bc29000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x31000) = 0x7f845bc29000 <0.000038> > 04:11:51.257654 close(3) = 0 <0.000042> ... > The first quote is a pattern I'm seeing very often on slow dm start, where > it take a long time between the mmap and the next syscall. On the second > quote, read() is to blame, it took 1s. > > I guess even the first quote imply there is going to be I/O after the mmap > call, isn't it? It's very likely, yes. The code after mmap will probably start reading the pages just mapped. Thanks for this investigation. I am now convinced that this is indeed the bug "Linux's I/O subsystem sometimes produces appalling latency under load". That bug has existed for at least a decade and seems unlikely to be fixed any time soon. Certainly, fixing it is beyond our scope. So papering over this with an increase in the timeout is probably proper. I'm tempted to suggest increasing the timeout only on Linux. Ian. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] libxl: Increase device model startup timeout to 1min. 2015-07-03 11:30 ` Ian Jackson @ 2015-07-07 13:44 ` Ian Campbell 2015-07-07 13:47 ` Ian Jackson 0 siblings, 1 reply; 28+ messages in thread From: Ian Campbell @ 2015-07-07 13:44 UTC (permalink / raw) To: Ian Jackson; +Cc: Anthony PERARD, xen-devel, Wei Liu, Stefano Stabellini On Fri, 2015-07-03 at 12:30 +0100, Ian Jackson wrote: > So papering over this with an increase in the timeout is probably > proper. Yes :-/ > I'm tempted to suggest increasing the timeout only on Linux. How tempted? Or shall I apply this patch? Ian. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] libxl: Increase device model startup timeout to 1min. 2015-07-07 13:44 ` Ian Campbell @ 2015-07-07 13:47 ` Ian Jackson 2015-07-07 14:37 ` [PATCH V2] " Anthony PERARD 0 siblings, 1 reply; 28+ messages in thread From: Ian Jackson @ 2015-07-07 13:47 UTC (permalink / raw) To: Ian Campbell; +Cc: Anthony PERARD, xen-devel, Wei Liu, Stefano Stabellini Ian Campbell writes ("Re: [PATCH] libxl: Increase device model startup timeout to 1min."): > On Fri, 2015-07-03 at 12:30 +0100, Ian Jackson wrote: > > So papering over this with an increase in the timeout is probably > > proper. > > Yes :-/ > > > I'm tempted to suggest increasing the timeout only on Linux. > > How tempted? Or shall I apply this patch? I guess I don't want to insist on a pile of #ifdefery. But can we have a resubmission with a comment that makes it clear that this is due to a Linux-specific bug ? The comment could steal some of my words from earlier in this thread. Ian. ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH V2] libxl: Increase device model startup timeout to 1min. 2015-07-07 13:47 ` Ian Jackson @ 2015-07-07 14:37 ` Anthony PERARD 2015-07-07 14:41 ` Ian Jackson 0 siblings, 1 reply; 28+ messages in thread From: Anthony PERARD @ 2015-07-07 14:37 UTC (permalink / raw) To: xen-devel Cc: Anthony PERARD, Wei Liu, Ian Jackson, Ian Campbell, Stefano Stabellini On a busy host, QEMU may take more than 10s to load and start. This is likely due to a bug in Linux where the I/O subsystem sometime produce high latency under load and result in QEMU taking a long time to load every single dynamic libraries. Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> --- tools/libxl/libxl_internal.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h index e96d6b5..3a604f7 100644 --- a/tools/libxl/libxl_internal.h +++ b/tools/libxl/libxl_internal.h @@ -85,7 +85,7 @@ #define LIBXL_INIT_TIMEOUT 10 #define LIBXL_DESTROY_TIMEOUT 10 #define LIBXL_HOTPLUG_TIMEOUT 10 -#define LIBXL_DEVICE_MODEL_START_TIMEOUT 10 +#define LIBXL_DEVICE_MODEL_START_TIMEOUT 60 #define LIBXL_STUBDOM_START_TIMEOUT 30 #define LIBXL_QEMU_BODGE_TIMEOUT 2 #define LIBXL_XENCONSOLE_LIMIT 1048576 -- Anthony PERARD ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH V2] libxl: Increase device model startup timeout to 1min. 2015-07-07 14:37 ` [PATCH V2] " Anthony PERARD @ 2015-07-07 14:41 ` Ian Jackson 2015-07-07 15:09 ` [PATCH V3] " Anthony PERARD 0 siblings, 1 reply; 28+ messages in thread From: Ian Jackson @ 2015-07-07 14:41 UTC (permalink / raw) To: Anthony PERARD; +Cc: Stefano Stabellini, Wei Liu, Ian Campbell, xen-devel Anthony PERARD writes ("[PATCH V2] libxl: Increase device model startup timeout to 1min."): > On a busy host, QEMU may take more than 10s to load and start. > > This is likely due to a bug in Linux where the I/O subsystem sometime > produce high latency under load and result in QEMU taking a long time to > load every single dynamic libraries. Please > #define LIBXL_HOTPLUG_TIMEOUT 10 > -#define LIBXL_DEVICE_MODEL_START_TIMEOUT 10 > +#define LIBXL_DEVICE_MODEL_START_TIMEOUT 60 > #define LIBXL_STUBDOM_START_TIMEOUT 30 can that information be in a comment in the code, as well as in a commit message ? Thanks, Ian. ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH V3] libxl: Increase device model startup timeout to 1min. 2015-07-07 14:41 ` Ian Jackson @ 2015-07-07 15:09 ` Anthony PERARD 2015-07-07 15:14 ` Ian Jackson 0 siblings, 1 reply; 28+ messages in thread From: Anthony PERARD @ 2015-07-07 15:09 UTC (permalink / raw) To: xen-devel Cc: Anthony PERARD, Wei Liu, Ian Jackson, Ian Campbell, Stefano Stabellini On a busy host, QEMU may take more than 10s to load and start. This is likely due to a bug in Linux where the I/O subsystem sometime produce high latency under load and result in QEMU taking a long time to load every single dynamic libraries. Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> --- tools/libxl/libxl_internal.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h index d52589e..d6f87c1 100644 --- a/tools/libxl/libxl_internal.h +++ b/tools/libxl/libxl_internal.h @@ -85,7 +85,9 @@ #define LIBXL_INIT_TIMEOUT 10 #define LIBXL_DESTROY_TIMEOUT 10 #define LIBXL_HOTPLUG_TIMEOUT 10 -#define LIBXL_DEVICE_MODEL_START_TIMEOUT 10 +/* QEMU may be slow to load and start due to a bug in Linux where the I/O + * subsystem sometime produce high latency under load. */ +#define LIBXL_DEVICE_MODEL_START_TIMEOUT 60 #define LIBXL_STUBDOM_START_TIMEOUT 30 #define LIBXL_QEMU_BODGE_TIMEOUT 2 #define LIBXL_XENCONSOLE_LIMIT 1048576 -- Anthony PERARD ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH V3] libxl: Increase device model startup timeout to 1min. 2015-07-07 15:09 ` [PATCH V3] " Anthony PERARD @ 2015-07-07 15:14 ` Ian Jackson 2015-07-07 15:41 ` Ian Campbell 0 siblings, 1 reply; 28+ messages in thread From: Ian Jackson @ 2015-07-07 15:14 UTC (permalink / raw) To: Anthony PERARD; +Cc: Stefano Stabellini, Wei Liu, Ian Campbell, xen-devel Anthony PERARD writes ("[PATCH V3] libxl: Increase device model startup timeout to 1min."): > On a busy host, QEMU may take more than 10s to load and start. > > This is likely due to a bug in Linux where the I/O subsystem sometime > produce high latency under load and result in QEMU taking a long time to > load every single dynamic libraries. Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH V3] libxl: Increase device model startup timeout to 1min. 2015-07-07 15:14 ` Ian Jackson @ 2015-07-07 15:41 ` Ian Campbell 2015-07-14 6:17 ` Jan Beulich 0 siblings, 1 reply; 28+ messages in thread From: Ian Campbell @ 2015-07-07 15:41 UTC (permalink / raw) To: Ian Jackson; +Cc: Anthony PERARD, Stefano Stabellini, Wei Liu, xen-devel On Tue, 2015-07-07 at 16:14 +0100, Ian Jackson wrote: > Anthony PERARD writes ("[PATCH V3] libxl: Increase device model startup timeout to 1min."): > > On a busy host, QEMU may take more than 10s to load and start. > > > > This is likely due to a bug in Linux where the I/O subsystem sometime > > produce high latency under load and result in QEMU taking a long time to > > load every single dynamic libraries. > > Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Applied. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH V3] libxl: Increase device model startup timeout to 1min. 2015-07-07 15:41 ` Ian Campbell @ 2015-07-14 6:17 ` Jan Beulich 2015-07-14 7:55 ` Ian Campbell 0 siblings, 1 reply; 28+ messages in thread From: Jan Beulich @ 2015-07-14 6:17 UTC (permalink / raw) To: Anthony PERARD, Ian Campbell Cc: Ian Jackson, xen-devel, Wei Liu, Stefano Stabellini >>> On 07.07.15 at 17:41, <ian.campbell@citrix.com> wrote: > On Tue, 2015-07-07 at 16:14 +0100, Ian Jackson wrote: >> Anthony PERARD writes ("[PATCH V3] libxl: Increase device model startup > timeout to 1min."): >> > On a busy host, QEMU may take more than 10s to load and start. >> > >> > This is likely due to a bug in Linux where the I/O subsystem sometime >> > produce high latency under load and result in QEMU taking a long time to >> > load every single dynamic libraries. >> >> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> > > Applied. So is this the "answer" to "Problems with merlot* AMD Opteron 6376 systems"? Jan ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH V3] libxl: Increase device model startup timeout to 1min. 2015-07-14 6:17 ` Jan Beulich @ 2015-07-14 7:55 ` Ian Campbell 2015-07-14 9:25 ` Dario Faggioli 0 siblings, 1 reply; 28+ messages in thread From: Ian Campbell @ 2015-07-14 7:55 UTC (permalink / raw) To: Jan Beulich Cc: Anthony PERARD, Ian Jackson, xen-devel, Wei Liu, Stefano Stabellini On Tue, 2015-07-14 at 07:17 +0100, Jan Beulich wrote: > >>> On 07.07.15 at 17:41, <ian.campbell@citrix.com> wrote: > > On Tue, 2015-07-07 at 16:14 +0100, Ian Jackson wrote: > >> Anthony PERARD writes ("[PATCH V3] libxl: Increase device model startup > > timeout to 1min."): > >> > On a busy host, QEMU may take more than 10s to load and start. > >> > > >> > This is likely due to a bug in Linux where the I/O subsystem sometime > >> > produce high latency under load and result in QEMU taking a long time to > >> > load every single dynamic libraries. > >> > >> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> > > > > Applied. > > So is this the "answer" to "Problems with merlot* AMD Opteron 6376 > systems"? It'll be hard to say until this change gets through the Xen push gate and that version gets used for other branches (linux testing, libvirt, ovmf, osstest's own gate etc). At the moment it looks like it has helped with some but not all of the issues. These: http://logs.test-lab.xenproject.org/osstest/results/host/merlot0.html http://logs.test-lab.xenproject.org/osstest/results/host/merlot1.html still don't look brilliant, even if you only pay attention to the xen-unstable lines. Ian. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH V3] libxl: Increase device model startup timeout to 1min. 2015-07-14 7:55 ` Ian Campbell @ 2015-07-14 9:25 ` Dario Faggioli 2015-07-14 9:37 ` Ian Campbell 0 siblings, 1 reply; 28+ messages in thread From: Dario Faggioli @ 2015-07-14 9:25 UTC (permalink / raw) To: Ian Campbell Cc: Wei Liu, Stefano Stabellini, Ian Jackson, xen-devel, Jan Beulich, Anthony PERARD [-- Attachment #1.1: Type: text/plain, Size: 2086 bytes --] On Tue, 2015-07-14 at 08:55 +0100, Ian Campbell wrote: > On Tue, 2015-07-14 at 07:17 +0100, Jan Beulich wrote: > > >>> On 07.07.15 at 17:41, <ian.campbell@citrix.com> wrote: > > > On Tue, 2015-07-07 at 16:14 +0100, Ian Jackson wrote: > > >> Anthony PERARD writes ("[PATCH V3] libxl: Increase device model startup > > > timeout to 1min."): > > >> > On a busy host, QEMU may take more than 10s to load and start. > > >> > > > >> > This is likely due to a bug in Linux where the I/O subsystem sometime > > >> > produce high latency under load and result in QEMU taking a long time to > > >> > load every single dynamic libraries. > > >> > > >> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> > > > > > > Applied. > > > > So is this the "answer" to "Problems with merlot* AMD Opteron 6376 > > systems"? > > It'll be hard to say until this change gets through the Xen push gate > and that version gets used for other branches (linux testing, libvirt, > ovmf, osstest's own gate etc). > Indeed. My opinion is that no, it is not. My understanding of the data Anthony provided is that, under some (difficult to track/analyze/reproduce/etc) load conditions, the Linux IO and VM subsystem suffer from high latency, delaying QEMU startup. In the merlot* cases, the system is completely idle, apart from the failing creation/migration operation. So, no, I don't think that would not be the fix we need for that situation. > At the moment it looks like it has helped with some but not all of the > issues. > > These: > > http://logs.test-lab.xenproject.org/osstest/results/host/merlot0.html > http://logs.test-lab.xenproject.org/osstest/results/host/merlot1.html > Can I ask why (I mean, e.g., comparing what with what) you're saying it seems to have helped? Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 181 bytes --] [-- Attachment #2: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH V3] libxl: Increase device model startup timeout to 1min. 2015-07-14 9:25 ` Dario Faggioli @ 2015-07-14 9:37 ` Ian Campbell 2015-07-14 10:52 ` Dario Faggioli 0 siblings, 1 reply; 28+ messages in thread From: Ian Campbell @ 2015-07-14 9:37 UTC (permalink / raw) To: Dario Faggioli Cc: Wei Liu, Stefano Stabellini, Ian Jackson, xen-devel, Jan Beulich, Anthony PERARD On Tue, 2015-07-14 at 11:25 +0200, Dario Faggioli wrote: > On Tue, 2015-07-14 at 08:55 +0100, Ian Campbell wrote: > > On Tue, 2015-07-14 at 07:17 +0100, Jan Beulich wrote: > > > >>> On 07.07.15 at 17:41, <ian.campbell@citrix.com> wrote: > > > > On Tue, 2015-07-07 at 16:14 +0100, Ian Jackson wrote: > > > >> Anthony PERARD writes ("[PATCH V3] libxl: Increase device model startup > > > > timeout to 1min."): > > > >> > On a busy host, QEMU may take more than 10s to load and start. > > > >> > > > > >> > This is likely due to a bug in Linux where the I/O subsystem sometime > > > >> > produce high latency under load and result in QEMU taking a long time to > > > >> > load every single dynamic libraries. > > > >> > > > >> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> > > > > > > > > Applied. > > > > > > So is this the "answer" to "Problems with merlot* AMD Opteron 6376 > > > systems"? > > > > It'll be hard to say until this change gets through the Xen push gate > > and that version gets used for other branches (linux testing, libvirt, > > ovmf, osstest's own gate etc). > > > Indeed. My opinion is that no, it is not. > > My understanding of the data Anthony provided is that, under some > (difficult to track/analyze/reproduce/etc) load conditions, the Linux IO > and VM subsystem suffer from high latency, delaying QEMU startup. > > In the merlot* cases, the system is completely idle, apart from the > failing creation/migration operation. > > So, no, I don't think that would not be the fix we need for that > situation. Even if it is not the correct fix it seems like in some situations the increase in timeout has improved things, hence it is an "answer" as Jan asked (his quote marks). > > At the moment it looks like it has helped with some but not all of the > > issues. > > > > These: > > > > http://logs.test-lab.xenproject.org/osstest/results/host/merlot0.html > > http://logs.test-lab.xenproject.org/osstest/results/host/merlot1.html > > > Can I ask why (I mean, e.g., comparing what with what) you're saying it > seems to have helped? There seemed (unscientifically) to be fewer of the libvirt related guest-start failures. Ian. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH V3] libxl: Increase device model startup timeout to 1min. 2015-07-14 9:37 ` Ian Campbell @ 2015-07-14 10:52 ` Dario Faggioli 2015-07-14 14:23 ` Dario Faggioli 0 siblings, 1 reply; 28+ messages in thread From: Dario Faggioli @ 2015-07-14 10:52 UTC (permalink / raw) To: Ian Campbell Cc: Wei Liu, Stefano Stabellini, Ian Jackson, xen-devel, Jan Beulich, Anthony PERARD [-- Attachment #1.1: Type: text/plain, Size: 2788 bytes --] On Tue, 2015-07-14 at 10:37 +0100, Ian Campbell wrote: > On Tue, 2015-07-14 at 11:25 +0200, Dario Faggioli wrote: > > On Tue, 2015-07-14 at 08:55 +0100, Ian Campbell wrote: > > > It'll be hard to say until this change gets through the Xen push gate > > > and that version gets used for other branches (linux testing, libvirt, > > > ovmf, osstest's own gate etc). > > > > > Indeed. My opinion is that no, it is not. > > > > My understanding of the data Anthony provided is that, under some > > (difficult to track/analyze/reproduce/etc) load conditions, the Linux IO > > and VM subsystem suffer from high latency, delaying QEMU startup. > > > > In the merlot* cases, the system is completely idle, apart from the > > failing creation/migration operation. > > > > So, no, I don't think that would not be the fix we need for that > > situation. > > Even if it is not the correct fix it seems like in some situations the > increase in timeout has improved things, hence it is an "answer" as Jan > asked (his quote marks). > Sure! And that's why I find this weird/interesting... > > > At the moment it looks like it has helped with some but not all of the > > > issues. > > > > > > These: > > > > > > http://logs.test-lab.xenproject.org/osstest/results/host/merlot0.html > > > http://logs.test-lab.xenproject.org/osstest/results/host/merlot1.html > > > > > Can I ask why (I mean, e.g., comparing what with what) you're saying it > > seems to have helped? > > There seemed (unscientifically) to be fewer of the libvirt related > guest-start failures. > And you mean by only looking at xen-unstable lines, don't you? If yes, looking at merlot0, I've found the below. Old timeout, failing: http://logs.test-lab.xenproject.org/osstest/logs/59105/test-amd64-amd64-libvirt-xsm/info.html New timeout, success: http://logs.test-lab.xenproject.org/osstest/logs/59311/test-amd64-amd64-libvirt/info.html And, looking at how long QEMU did take to start up that would be: 13:44:32 - 13:43:42 i.e., just a bit less than 1min! So, yes, it looks that this change is actually going to help in this case. What I'm missing is how it is possible that, on an idle system, DM spawning takes that long. As said, in Anthony's OpenStack case, the system was quite busy... not that it can't be a bug (somewhere, perhaps in Linux) in that case too, but here, it looks even more weird to me. May it be the NUMA misconfiguration? Well, if yes, I'm not sure how... Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 181 bytes --] [-- Attachment #2: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH V3] libxl: Increase device model startup timeout to 1min. 2015-07-14 10:52 ` Dario Faggioli @ 2015-07-14 14:23 ` Dario Faggioli 2015-07-14 14:48 ` Ian Campbell 0 siblings, 1 reply; 28+ messages in thread From: Dario Faggioli @ 2015-07-14 14:23 UTC (permalink / raw) To: Ian Campbell Cc: Wei Liu, Stefano Stabellini, Ian Jackson, xen-devel, Jan Beulich, Anthony PERARD [-- Attachment #1.1: Type: text/plain, Size: 2444 bytes --] On Tue, 2015-07-14 at 12:52 +0200, Dario Faggioli wrote: > On Tue, 2015-07-14 at 10:37 +0100, Ian Campbell wrote: > > On Tue, 2015-07-14 at 11:25 +0200, Dario Faggioli wrote: > > > > http://logs.test-lab.xenproject.org/osstest/results/host/merlot0.html > > > > http://logs.test-lab.xenproject.org/osstest/results/host/merlot1.html > > > > > > > Can I ask why (I mean, e.g., comparing what with what) you're saying it > > > seems to have helped? > > > > There seemed (unscientifically) to be fewer of the libvirt related > > guest-start failures. > > > Old timeout, failing: > http://logs.test-lab.xenproject.org/osstest/logs/59105/test-amd64-amd64-libvirt-xsm/info.html > > New timeout, success: > http://logs.test-lab.xenproject.org/osstest/logs/59311/test-amd64-amd64-libvirt/info.html > > And, looking at how long QEMU did take to start up that would be: > > 13:44:32 - 13:43:42 > > i.e., just a bit less than 1min! > > So, yes, it looks that this change is actually going to help in this > case. > > What I'm missing is how it is possible that, on an idle system, DM > spawning takes that long. As said, in Anthony's OpenStack case, the > system was quite busy... not that it can't be a bug (somewhere, perhaps > in Linux) in that case too, but here, it looks even more weird to me. > > May it be the NUMA misconfiguration? Well, if yes, I'm not sure how... > BTW, Ian, perhaps it's not that urgent, but how cumbersome would it be to setup a few ad-hoc flights, like you did here: http://logs.test-lab.xenproject.org/osstest/logs/58853/ when also doing what Anthony was doing for stracing QEMU, as he described in here: http://bugs.xenproject.org/xen/mid/%3C20150624133948.GJ3393@perard.uk.xensource.com%3E I'm not sure what we could find, nor whether we should expect the strace to be similar to Anthony's. Perhaps not, as the system here is idle, and perhaps something would highlight or hint at where to look in this specific case... Or maybe none of the above, but it may be worth a try. If there is anything I can do for helping setting this up, just let me know (here or on IRC). Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 181 bytes --] [-- Attachment #2: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH V3] libxl: Increase device model startup timeout to 1min. 2015-07-14 14:23 ` Dario Faggioli @ 2015-07-14 14:48 ` Ian Campbell 0 siblings, 0 replies; 28+ messages in thread From: Ian Campbell @ 2015-07-14 14:48 UTC (permalink / raw) To: Dario Faggioli Cc: Wei Liu, Stefano Stabellini, Ian Jackson, xen-devel, Jan Beulich, Anthony PERARD On Tue, 2015-07-14 at 16:23 +0200, Dario Faggioli wrote: > BTW, Ian, perhaps it's not that urgent, but how cumbersome would it be > to setup a few ad-hoc flights, like you did here: > > http://logs.test-lab.xenproject.org/osstest/logs/58853/ > > when also doing what Anthony was doing for stracing QEMU, "Fairly cumbersome", someone would have to develop and test some code in osstest to allow the switching in of a device model wrapper and for collecting the results afterwards. Those previous adhoc flights were simple because they just involved rerunning existing jobs in the same configuration and with the same harness code. I'm afraid that for a "to see what we see" experiment the effort I have outweighs the predicted information gain for this. If someone were to provide a tested and working osstest patch I could probably throw it through a few runs with minimal effort though. > as he > described in here: > > http://bugs.xenproject.org/xen/mid/%3C20150624133948.GJ3393@perard.uk.xensource.com%3E > > I'm not sure what we could find, nor whether we should expect the strace > to be similar to Anthony's. Perhaps not, as the system here is idle, and > perhaps something would highlight or hint at where to look in this > specific case... Or maybe none of the above, but it may be worth a try. > > If there is anything I can do for helping setting this up, just let me > know (here or on IRC). > > Regards, > Dario ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2015-07-14 14:48 UTC | newest] Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-06-26 11:57 [PATCH] libxl: Increase device model startup timeout to 1min Anthony PERARD 2015-06-26 16:41 ` Ian Campbell 2015-06-29 14:23 ` Anthony PERARD 2015-06-29 14:51 ` Ian Campbell 2015-06-29 16:09 ` Anthony PERARD 2015-06-30 3:08 ` Dario Faggioli 2015-06-30 14:04 ` Ian Jackson 2015-06-30 15:58 ` Stefano Stabellini 2015-06-30 16:40 ` Ian Jackson 2015-07-01 15:03 ` Stefano Stabellini 2015-07-02 11:11 ` Anthony PERARD 2015-07-02 12:38 ` Ian Jackson 2015-07-03 11:21 ` Anthony PERARD 2015-07-03 11:30 ` Ian Jackson 2015-07-07 13:44 ` Ian Campbell 2015-07-07 13:47 ` Ian Jackson 2015-07-07 14:37 ` [PATCH V2] " Anthony PERARD 2015-07-07 14:41 ` Ian Jackson 2015-07-07 15:09 ` [PATCH V3] " Anthony PERARD 2015-07-07 15:14 ` Ian Jackson 2015-07-07 15:41 ` Ian Campbell 2015-07-14 6:17 ` Jan Beulich 2015-07-14 7:55 ` Ian Campbell 2015-07-14 9:25 ` Dario Faggioli 2015-07-14 9:37 ` Ian Campbell 2015-07-14 10:52 ` Dario Faggioli 2015-07-14 14:23 ` Dario Faggioli 2015-07-14 14:48 ` Ian Campbell
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).