* [PATCH v2 0/2] xen: cpupool (small) improvement and (latent) bug fix @ 2016-07-14 16:17 Dario Faggioli 2016-07-14 16:18 ` [PATCH v2 1/2] xen: fix a (latent) cpupool-related race during domain destroy Dario Faggioli 2016-07-14 16:18 ` [PATCH v2 2/2] xen: cpupool: small optimization when moving between pools Dario Faggioli 0 siblings, 2 replies; 16+ messages in thread From: Dario Faggioli @ 2016-07-14 16:17 UTC (permalink / raw) To: xen-devel; +Cc: Juergen Gross, Andrew Cooper, George Dunlap, Jan Beulich Hi everyone, Version 2, with only patch 1 changed to follow Andrew's suggestion of directly moving the cpupool_*_domain() functions into the sched_*_domain() functions. That required changing the signature of sched_init_domain(), but I think the final state of the code is indeed more robust. Thanks and Regards, Dario --- Dario Faggioli (2): xen: fix a (latent) cpupool-related race during domain destroy xen: cpupool: small optimization when moving between pools xen/common/cpupool.c | 3 +++ xen/common/domain.c | 7 +------ xen/common/schedule.c | 13 ++++++++++++- xen/include/xen/sched.h | 2 +- 4 files changed, 17 insertions(+), 8 deletions(-) -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v2 1/2] xen: fix a (latent) cpupool-related race during domain destroy 2016-07-14 16:17 [PATCH v2 0/2] xen: cpupool (small) improvement and (latent) bug fix Dario Faggioli @ 2016-07-14 16:18 ` Dario Faggioli 2016-07-14 17:08 ` Andrew Cooper 2016-07-15 9:38 ` Juergen Gross 2016-07-14 16:18 ` [PATCH v2 2/2] xen: cpupool: small optimization when moving between pools Dario Faggioli 1 sibling, 2 replies; 16+ messages in thread From: Dario Faggioli @ 2016-07-14 16:18 UTC (permalink / raw) To: xen-devel; +Cc: Juergen Gross, Andrew Cooper, George Dunlap, Jan Beulich So, during domain destruction, we do: cpupool_rm_domain() [ in domain_destroy() ] sched_destroy_domain() [ in complete_domain_destroy() ] Therefore, there's a window during which, from the scheduler's point of view, a domain stilsts outside of any cpupool. In fact, cpupool_rm_domain() does d->cpupool=NULL, and we don't allow that to hold true, for anything but the idle domain (and there are, in fact, ASSERT()s and BUG_ON()s to that effect). Currently, we never really check d->cpupool during the window, but that does not mean the race is not there. For instance, Credit2 at some point (during load balancing) iterates on the list of domains, and if we add logic that needs checking d->cpupool, and any one of them had cpupool_rm_domain() called on itself already... Boom! (In fact, calling __vcpu_has_soft_affinity() from inside balance_load() makes `xl shutdown <domid>' reliably crash, and this is how I discovered this.) On the other hand, cpupool_rm_domain() "only" does cpupool related bookkeeping, and there's no harm postponing it a little bit. Also, considering that, during domain initialization, we do: cpupool_add_domain() sched_init_domain() It makes sense for the destruction path to look like the opposite of it, i.e.: sched_destroy_domain() cpupool_rm_domain() And hence that's what this patch does. Actually, for better robustness, what we really do is moving both cpupool_add_domain() and cpupool_rm_domain() inside sched_init_domain() and sched_destroy_domain(), respectively (and also add a couple of ASSERT()-s). Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- Cc: Juergen Gross <jgross@suse.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: George Dunlap <George.Dunlap@eu.citrix.com> Cc: Jan Beulich <jbeulich@suse.com> --- Changes from v1: * followed Andrew's suggestion of moving the cpupool functions inside the sched functions; * reworded the changelog. --- xen/common/domain.c | 7 +------ xen/common/schedule.c | 13 ++++++++++++- xen/include/xen/sched.h | 2 +- 3 files changed, 14 insertions(+), 8 deletions(-) diff --git a/xen/common/domain.c b/xen/common/domain.c index 42c07ee..339ee56 100644 --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -379,10 +379,7 @@ struct domain *domain_create(domid_t domid, unsigned int domcr_flags, goto fail; init_status |= INIT_arch; - if ( (err = cpupool_add_domain(d, poolid)) != 0 ) - goto fail; - - if ( (err = sched_init_domain(d)) != 0 ) + if ( (err = sched_init_domain(d, poolid)) != 0 ) goto fail; if ( (err = late_hwdom_init(d)) != 0 ) @@ -868,8 +865,6 @@ void domain_destroy(struct domain *d) TRACE_1D(TRC_DOM0_DOM_REM, d->domain_id); - cpupool_rm_domain(d); - /* Delete from task list and task hashtable. */ spin_lock(&domlist_update_lock); pd = &domain_list; diff --git a/xen/common/schedule.c b/xen/common/schedule.c index 7ac12d3..852f840 100644 --- a/xen/common/schedule.c +++ b/xen/common/schedule.c @@ -379,8 +379,15 @@ void sched_destroy_vcpu(struct vcpu *v) SCHED_OP(VCPU2OP(v), free_vdata, v->sched_priv); } -int sched_init_domain(struct domain *d) +int sched_init_domain(struct domain *d, int poolid) { + int ret; + + ASSERT(d->cpupool == NULL); + + if ( (ret = cpupool_add_domain(d, poolid)) ) + return ret; + SCHED_STAT_CRANK(dom_init); TRACE_1D(TRC_SCHED_DOM_ADD, d->domain_id); return SCHED_OP(DOM2OP(d), init_domain, d); @@ -388,9 +395,13 @@ int sched_init_domain(struct domain *d) void sched_destroy_domain(struct domain *d) { + ASSERT(d->cpupool != NULL || is_idle_domain(d)); + SCHED_STAT_CRANK(dom_destroy); TRACE_1D(TRC_SCHED_DOM_REM, d->domain_id); SCHED_OP(DOM2OP(d), destroy_domain, d); + + cpupool_rm_domain(d); } void vcpu_sleep_nosync(struct vcpu *v) diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h index 46c82e7..888bc19 100644 --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -637,7 +637,7 @@ void noreturn asm_domain_crash_synchronous(unsigned long addr); void scheduler_init(void); int sched_init_vcpu(struct vcpu *v, unsigned int processor); void sched_destroy_vcpu(struct vcpu *v); -int sched_init_domain(struct domain *d); +int sched_init_domain(struct domain *d, int poolid); void sched_destroy_domain(struct domain *d); int sched_move_domain(struct domain *d, struct cpupool *c); long sched_adjust(struct domain *, struct xen_domctl_scheduler_op *); _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/2] xen: fix a (latent) cpupool-related race during domain destroy 2016-07-14 16:18 ` [PATCH v2 1/2] xen: fix a (latent) cpupool-related race during domain destroy Dario Faggioli @ 2016-07-14 17:08 ` Andrew Cooper 2016-07-15 9:38 ` Juergen Gross 1 sibling, 0 replies; 16+ messages in thread From: Andrew Cooper @ 2016-07-14 17:08 UTC (permalink / raw) To: Dario Faggioli, xen-devel; +Cc: Juergen Gross, George Dunlap, Jan Beulich On 14/07/16 17:18, Dario Faggioli wrote: > So, during domain destruction, we do: > cpupool_rm_domain() [ in domain_destroy() ] > sched_destroy_domain() [ in complete_domain_destroy() ] > > Therefore, there's a window during which, from the > scheduler's point of view, a domain stilsts outside > of any cpupool. > > In fact, cpupool_rm_domain() does d->cpupool=NULL, > and we don't allow that to hold true, for anything > but the idle domain (and there are, in fact, ASSERT()s > and BUG_ON()s to that effect). > > Currently, we never really check d->cpupool during the > window, but that does not mean the race is not there. > For instance, Credit2 at some point (during load balancing) > iterates on the list of domains, and if we add logic that > needs checking d->cpupool, and any one of them had > cpupool_rm_domain() called on itself already... Boom! > > (In fact, calling __vcpu_has_soft_affinity() from inside > balance_load() makes `xl shutdown <domid>' reliably > crash, and this is how I discovered this.) > > On the other hand, cpupool_rm_domain() "only" does > cpupool related bookkeeping, and there's no harm > postponing it a little bit. > > Also, considering that, during domain initialization, > we do: > cpupool_add_domain() > sched_init_domain() > > It makes sense for the destruction path to look like > the opposite of it, i.e.: > sched_destroy_domain() > cpupool_rm_domain() > > And hence that's what this patch does. > > Actually, for better robustness, what we really do is > moving both cpupool_add_domain() and cpupool_rm_domain() > inside sched_init_domain() and sched_destroy_domain(), > respectively (and also add a couple of ASSERT()-s). > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > --- > Cc: Juergen Gross <jgross@suse.com> > Cc: Andrew Cooper <andrew.cooper3@citrix.com> > Cc: George Dunlap <George.Dunlap@eu.citrix.com> > Cc: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> This definitely looks better, although I will leave the in-depth review to people more familiar with the scheduler internals. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/2] xen: fix a (latent) cpupool-related race during domain destroy 2016-07-14 16:18 ` [PATCH v2 1/2] xen: fix a (latent) cpupool-related race during domain destroy Dario Faggioli 2016-07-14 17:08 ` Andrew Cooper @ 2016-07-15 9:38 ` Juergen Gross 2016-07-15 10:14 ` Dario Faggioli 1 sibling, 1 reply; 16+ messages in thread From: Juergen Gross @ 2016-07-15 9:38 UTC (permalink / raw) To: Dario Faggioli, xen-devel; +Cc: George Dunlap, Andrew Cooper, Jan Beulich On 14/07/16 18:18, Dario Faggioli wrote: > So, during domain destruction, we do: > cpupool_rm_domain() [ in domain_destroy() ] > sched_destroy_domain() [ in complete_domain_destroy() ] > > Therefore, there's a window during which, from the > scheduler's point of view, a domain stilsts outside > of any cpupool. > > In fact, cpupool_rm_domain() does d->cpupool=NULL, > and we don't allow that to hold true, for anything > but the idle domain (and there are, in fact, ASSERT()s > and BUG_ON()s to that effect). > > Currently, we never really check d->cpupool during the > window, but that does not mean the race is not there. > For instance, Credit2 at some point (during load balancing) > iterates on the list of domains, and if we add logic that > needs checking d->cpupool, and any one of them had > cpupool_rm_domain() called on itself already... Boom! > > (In fact, calling __vcpu_has_soft_affinity() from inside > balance_load() makes `xl shutdown <domid>' reliably > crash, and this is how I discovered this.) > > On the other hand, cpupool_rm_domain() "only" does > cpupool related bookkeeping, and there's no harm > postponing it a little bit. > > Also, considering that, during domain initialization, > we do: > cpupool_add_domain() > sched_init_domain() > > It makes sense for the destruction path to look like > the opposite of it, i.e.: > sched_destroy_domain() > cpupool_rm_domain() > > And hence that's what this patch does. > > Actually, for better robustness, what we really do is > moving both cpupool_add_domain() and cpupool_rm_domain() > inside sched_init_domain() and sched_destroy_domain(), > respectively (and also add a couple of ASSERT()-s). > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Hmm, are you aware of commit bac6334b51d9bcfe57ecf4a4cb5288348fcf044a which explicitly moved cpupool_rm_domain() at the place where you are removing it again? Please verify that the scenario mentioned in the description of that commit is still working with your patch. Juergen _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/2] xen: fix a (latent) cpupool-related race during domain destroy 2016-07-15 9:38 ` Juergen Gross @ 2016-07-15 10:14 ` Dario Faggioli 2016-07-15 10:36 ` Juergen Gross 0 siblings, 1 reply; 16+ messages in thread From: Dario Faggioli @ 2016-07-15 10:14 UTC (permalink / raw) To: Juergen Gross, xen-devel; +Cc: George Dunlap, Andrew Cooper, Jan Beulich [-- Attachment #1.1: Type: text/plain, Size: 4067 bytes --] On Fri, 2016-07-15 at 11:38 +0200, Juergen Gross wrote: > Hmm, are you aware of commit bac6334b51d9bcfe57ecf4a4cb5288348fcf044a > which explicitly moved cpupool_rm_domain() at the place where you are > removing it again? Please verify that the scenario mentioned in the > description of that commit is still working with your patch. > Sorry, but I only partly see the problem. In particular, I'm probably not fully understanding, from that commit changelog, what is the set of operations/command that I should run to check whether or not I reintroduced the issue back. What I did so far is as follows: root@Zhaman:~# xl cpupool-list Name CPUs Sched Active Domain count Pool-0 12 credit y 1 Pool-credit 4 credit y 1 root@Zhaman:~# xl list -c Name ID Mem VCPUs State Time(s) Cpupool Domain-0 0 1019 16 r----- 34.5 Pool-0 vm1 1 4096 4 -b---- 9.7 Pool-credit root@Zhaman:~# xl cpupool-cpu-remove Pool-credit all libxl: error: libxl.c:6998:libxl_cpupool_cpuremove: Error removing cpu 9 from cpupool: Device or resource busy Some cpus may have not or only partially been removed from 'Pool-credit'. If a cpu can't be added to another cpupool, add it to 'Pool-credit' again and retry. root@Zhaman:~# xl cpupool-list -c Name CPU list Pool-0 0,1,2,3,4,5,10,11,12,13,14,15 Pool-credit 9 root@Zhaman:~# xl shutdown vm1 Shutting down domain 1 root@Zhaman:~# xl cpupool-cpu-remove Pool-credit all root@Zhaman:~# xl cpupool-list -c Name CPU list Pool-0 0,1,2,3,4,5,10,11,12,13,14,15 Pool-credit If (with vm1 still in Pool-credit), I do this, it indeed fails: root@Zhaman:~# xl shutdown vm1 & xl cpupool-cpu-remove Pool-credit all [1] 3275 Shutting down domain 2 libxl: error: libxl.c:6998:libxl_cpupool_cpuremove: Error removing cpu 9 from cpupool: Device or resource busy Some cpus may have not or only partially been removed from 'Pool-credit'. If a cpu can't be added to another cpupool, add it to 'Pool-credit' again and retry. [1]+ Done xl shutdown vm1 root@Zhaman:~# xl cpupool-list -c Name CPU list Pool-0 0,1,2,3,4,5,10,11,12,13,14,15 Pool-credit 9 But that does not look too strange to me, as it's entirely possible that the domain has not been moved yet, when we try to remove the last cpu. And in fact, after the domain has properly shutdown: root@Zhaman:~# xl cpupool-cpu-remove Pool-credit all root@Zhaman:~# xl cpupool-list Name CPUs Sched Active Domain count Pool-0 12 credit y 1 Pool-credit 0 credit y 0 And in fact, looking at the code introduced by that commit, the important part, to me, seems to be the moving of the domain to cpupool0, which is indeed the right thing to do. OTOH, what I am seeing and fixing, happens (well, could happen) all the times, even when the domain being shutdown is already in cpupool0, and (as you say yourself in your changelog) there not such issue as removing the last cpu of cpupool0. What am I missing? Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 819 bytes --] [-- Attachment #2: Type: text/plain, Size: 127 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/2] xen: fix a (latent) cpupool-related race during domain destroy 2016-07-15 10:14 ` Dario Faggioli @ 2016-07-15 10:36 ` Juergen Gross 2016-07-15 11:52 ` Dario Faggioli 0 siblings, 1 reply; 16+ messages in thread From: Juergen Gross @ 2016-07-15 10:36 UTC (permalink / raw) To: Dario Faggioli, xen-devel; +Cc: George Dunlap, Andrew Cooper, Jan Beulich On 15/07/16 12:14, Dario Faggioli wrote: > On Fri, 2016-07-15 at 11:38 +0200, Juergen Gross wrote: >> Hmm, are you aware of commit bac6334b51d9bcfe57ecf4a4cb5288348fcf044a >> which explicitly moved cpupool_rm_domain() at the place where you are >> removing it again? Please verify that the scenario mentioned in the >> description of that commit is still working with your patch. >> > Sorry, but I only partly see the problem. > > In particular, I'm probably not fully understanding, from that commit > changelog, what is the set of operations/command that I should run to > check whether or not I reintroduced the issue back. You need to create a domain in a cpupool and destroy it again while some dom0 process still is holding a reference to it (resulting in a zombie domain). Then try to destroy the cpupool. > > What I did so far is as follows: > > root@Zhaman:~# xl cpupool-list > Name CPUs Sched Active Domain count > Pool-0 12 credit y 1 > Pool-credit 4 credit y 1 > root@Zhaman:~# xl list -c > Name ID Mem VCPUs State Time(s) Cpupool > Domain-0 0 1019 16 r----- 34.5 Pool-0 > vm1 1 4096 4 -b---- 9.7 Pool-credit > root@Zhaman:~# xl cpupool-cpu-remove Pool-credit all > libxl: error: libxl.c:6998:libxl_cpupool_cpuremove: Error removing cpu 9 from cpupool: Device or resource busy > Some cpus may have not or only partially been removed from 'Pool-credit'. > If a cpu can't be added to another cpupool, add it to 'Pool-credit' again and retry. > root@Zhaman:~# xl cpupool-list -c > Name CPU list > Pool-0 0,1,2,3,4,5,10,11,12,13,14,15 > Pool-credit 9 > root@Zhaman:~# xl shutdown vm1 > Shutting down domain 1 > root@Zhaman:~# xl cpupool-cpu-remove Pool-credit all > root@Zhaman:~# xl cpupool-list -c > Name CPU list > Pool-0 0,1,2,3,4,5,10,11,12,13,14,15 > Pool-credit > > If (with vm1 still in Pool-credit), I do this, it indeed fails: > > root@Zhaman:~# xl shutdown vm1 & xl cpupool-cpu-remove Pool-credit all > [1] 3275 > Shutting down domain 2 > libxl: error: libxl.c:6998:libxl_cpupool_cpuremove: Error removing cpu 9 from cpupool: Device or resource busy > Some cpus may have not or only partially been removed from 'Pool-credit'. > If a cpu can't be added to another cpupool, add it to 'Pool-credit' again and retry. > [1]+ Done xl shutdown vm1 > root@Zhaman:~# xl cpupool-list -c > Name CPU list > Pool-0 0,1,2,3,4,5,10,11,12,13,14,15 > Pool-credit 9 > > But that does not look too strange to me, as it's entirely possible > that the domain has not been moved yet, when we try to remove the last > cpu. And in fact, after the domain has properly shutdown: > > root@Zhaman:~# xl cpupool-cpu-remove Pool-credit all > root@Zhaman:~# xl cpupool-list > Name CPUs Sched Active Domain count > Pool-0 12 credit y 1 > Pool-credit 0 credit y 0 > > And in fact, looking at the code introduced by that commit, the > important part, to me, seems to be the moving of the domain to > cpupool0, which is indeed the right thing to do. OTOH, what I am seeing > and fixing, happens (well, could happen) all the times, even when the > domain being shutdown is already in cpupool0, and (as you say yourself > in your changelog) there not such issue as removing the last cpu of > cpupool0. > > What am I missing? The domain being a zombie domain might change the picture. Moving it to cpupool0 was failing before my patch and it might do so again with your patch applied. Juergen _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/2] xen: fix a (latent) cpupool-related race during domain destroy 2016-07-15 10:36 ` Juergen Gross @ 2016-07-15 11:52 ` Dario Faggioli 2016-07-15 12:52 ` Juergen Gross 0 siblings, 1 reply; 16+ messages in thread From: Dario Faggioli @ 2016-07-15 11:52 UTC (permalink / raw) To: Juergen Gross, xen-devel; +Cc: George Dunlap, Andrew Cooper, Jan Beulich [-- Attachment #1.1: Type: text/plain, Size: 2876 bytes --] On Fri, 2016-07-15 at 12:36 +0200, Juergen Gross wrote: > On 15/07/16 12:14, Dario Faggioli wrote: > > In particular, I'm probably not fully understanding, from that > > commit > > changelog, what is the set of operations/command that I should run > > to > > check whether or not I reintroduced the issue back. > You need to create a domain in a cpupool and destroy it again while > some dom0 process still is holding a reference to it (resulting in a > zombie domain). Then try to destroy the cpupool. > Ah, I see. I wasn't get the fact that it needed to be a zombie domain from anywhere. > > What am I missing? > The domain being a zombie domain might change the picture. Moving it > to > cpupool0 was failing before my patch and it might do so again with > your > patch applied. > Mmmm... I don't immediately see the reason why moving a zombie domain fails either, but I guess I'll have to try. But then, correct me if I'm wrong, the situation is like this: - right now there's a (potential) race between domain's scheduling data destruction and domain removal from a cpupool; - with my patch, the race goes away, but we risk not being able to destroy a cpupool with a zombie domain in it. I don't think we want to be in either of these two situations. :-( The race is never triggering so far, but I've already patches to Credit2 (finishing implementing soft affinity for it) that make it a real thing. I think I can work around that specific case by doing something like the below: diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h index bc0e794..d91fd70 100644 --- a/xen/include/xen/sched-if.h +++ b/xen/include/xen/sched-if.h @@ -193,12 +193,9 @@ struct cpupool static inline cpumask_t* cpupool_domain_cpumask(struct domain *d) { - /* - * d->cpupool is NULL only for the idle domain, and no one should - * be interested in calling this for the idle domain. - */ - ASSERT(d->cpupool != NULL); - return d->cpupool->cpu_valid; + /* No one should be interested in calling this for the idle domain! */ + ASSERT(!is_idle_domain(d)); + return d->cpupool ? d->cpupool->cpu_valid : cpupool0->cpu_valid; } #endif /* __XEN_SCHED_IF_H__ */ But even if that would be acceptable (what do you think?), I still don't like to have the race there lurking. :-/ Therefore, I still think this patch is correct, but I'm up for investigating further and finding a way to solve the "zombie in cpupool" issue as well. Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 819 bytes --] [-- Attachment #2: Type: text/plain, Size: 127 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/2] xen: fix a (latent) cpupool-related race during domain destroy 2016-07-15 11:52 ` Dario Faggioli @ 2016-07-15 12:52 ` Juergen Gross 2016-07-15 14:23 ` Dario Faggioli 0 siblings, 1 reply; 16+ messages in thread From: Juergen Gross @ 2016-07-15 12:52 UTC (permalink / raw) To: Dario Faggioli, xen-devel; +Cc: George Dunlap, Andrew Cooper, Jan Beulich On 15/07/16 13:52, Dario Faggioli wrote: > On Fri, 2016-07-15 at 12:36 +0200, Juergen Gross wrote: >> On 15/07/16 12:14, Dario Faggioli wrote: >>> In particular, I'm probably not fully understanding, from that >>> commit >>> changelog, what is the set of operations/command that I should run >>> to >>> check whether or not I reintroduced the issue back. >> You need to create a domain in a cpupool and destroy it again while >> some dom0 process still is holding a reference to it (resulting in a >> zombie domain). Then try to destroy the cpupool. >> > Ah, I see. I wasn't get the fact that it needed to be a zombie domain > from anywhere. > >>> What am I missing? >> The domain being a zombie domain might change the picture. Moving it >> to >> cpupool0 was failing before my patch and it might do so again with >> your >> patch applied. >> > Mmmm... I don't immediately see the reason why moving a zombie domain > fails either, but I guess I'll have to try. Searching through the history I found commit 934e7baa6c12d19cfaf24e8f8e27d6c6a8b8c5e4 which might has removed the problematic condition (cpupool->n_dom being non-zero while d->cpupool was NULL already). > But then, correct me if I'm wrong, the situation is like this: > - right now there's a (potential) race between domain's scheduling > data destruction and domain removal from a cpupool; > - with my patch, the race goes away, but we risk not being able to > destroy a cpupool with a zombie domain in it. This one has been observed. I do remember the following critical cases: - removing a cpupool with a zombie domain - shutting down the system with a domain in a cpupool - not sure about the combination of both cases (shutting down with zombie domain in a cpupool): is this even possible without another bug in the hypervisor or dom0? > Therefore, I still think this patch is correct, but I'm up for > investigating further and finding a way to solve the "zombie in > cpupool" issue as well. I'm not saying your patch is wrong. I just wanted to give you a hint about the history of the stuff you are changing. :-) If it is working I'd really prefer it over the current situation. Juergen _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/2] xen: fix a (latent) cpupool-related race during domain destroy 2016-07-15 12:52 ` Juergen Gross @ 2016-07-15 14:23 ` Dario Faggioli 2016-07-18 14:03 ` Dario Faggioli 0 siblings, 1 reply; 16+ messages in thread From: Dario Faggioli @ 2016-07-15 14:23 UTC (permalink / raw) To: Juergen Gross, xen-devel; +Cc: George Dunlap, Andrew Cooper, Jan Beulich [-- Attachment #1.1: Type: text/plain, Size: 3595 bytes --] On Fri, 2016-07-15 at 14:52 +0200, Juergen Gross wrote: > On 15/07/16 13:52, Dario Faggioli wrote: > > > > On Fri, 2016-07-15 at 12:36 +0200, Juergen Gross wrote: > > > > > > On 15/07/16 12:14, Dario Faggioli wrote: > > > > > > > > In particular, I'm probably not fully understanding, from that > > > > commit > > > > changelog, what is the set of operations/command that I should > > > > run > > > > to > > > > check whether or not I reintroduced the issue back. > > > You need to create a domain in a cpupool and destroy it again > > > while > > > some dom0 process still is holding a reference to it (resulting > > > in a > > > zombie domain). Then try to destroy the cpupool. > > > > > Ah, I see. I wasn't get the fact that it needed to be a zombie > > domain > > from anywhere. > > > > > > > > > > > > > What am I missing? > > > The domain being a zombie domain might change the picture. Moving > > > it > > > to > > > cpupool0 was failing before my patch and it might do so again > > > with > > > your > > > patch applied. > > > > > Mmmm... I don't immediately see the reason why moving a zombie > > domain > > fails either, but I guess I'll have to try. > Searching through the history I found commit > 934e7baa6c12d19cfaf24e8f8e27d6c6a8b8c5e4 which might has removed the > problematic condition (cpupool->n_dom being non-zero while d->cpupool > was NULL already). > Yeah.. And there's also this: /* * unassign a specific cpu from a cpupool * we must be sure not to run on the cpu to be unassigned! to achieve this * the main functionality is performed via continue_hypercall_on_cpu on a * specific cpu. * if the cpu to be removed is the last one of the cpupool no active domain * must be bound to the cpupool. dying domains are moved to cpupool0 as they * might be zombies. * possible failures: * - last cpu and still active domains in cpupool * - cpu just being unplugged */ static int cpupool_unassign_cpu(struct cpupool *c, unsigned int cpu) { int work_cpu; int ret; struct domain *d; cpupool_dprintk("cpupool_unassign_cpu(pool=%d,cpu=%d)\n", c->cpupool_id, cpu); [...] for_each_domain_in_cpupool(d, c) { if ( !d->is_dying ) { ret = -EBUSY; break; } ret = cpupool_move_domain_locked(d, cpupool0); if ( ret ) break; } [...] So it really looks like it ought to be possible to move zombies to cpupool0 these days. :-) > > Therefore, I still think this patch is correct, but I'm up for > > investigating further and finding a way to solve the "zombie in > > cpupool" issue as well. > I'm not saying your patch is wrong. I just wanted to give you a hint > about the history of the stuff you are changing. :-) > Sure, and that's much appreciated! :-) > If it is working I'd really prefer it over the current situation. > Right. I've got to leave now. But I'll produce some zombies on Monday, and will see if they move. :-D Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 819 bytes --] [-- Attachment #2: Type: text/plain, Size: 127 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/2] xen: fix a (latent) cpupool-related race during domain destroy 2016-07-15 14:23 ` Dario Faggioli @ 2016-07-18 14:03 ` Dario Faggioli 2016-07-18 14:09 ` Juergen Gross 0 siblings, 1 reply; 16+ messages in thread From: Dario Faggioli @ 2016-07-18 14:03 UTC (permalink / raw) To: Juergen Gross, xen-devel; +Cc: George Dunlap, Andrew Cooper, Jan Beulich [-- Attachment #1.1: Type: text/plain, Size: 1308 bytes --] On Fri, 2016-07-15 at 16:23 +0200, Dario Faggioli wrote: > On Fri, 2016-07-15 at 14:52 +0200, Juergen Gross wrote: > > On 15/07/16 13:52, Dario Faggioli wrote: > > > Therefore, I still think this patch is correct, but I'm up for > > > investigating further and finding a way to solve the "zombie in > > > cpupool" issue as well. > > I'm not saying your patch is wrong. I just wanted to give you a > > hint > > about the history of the stuff you are changing. :-) > > > Sure, and that's much appreciated! :-) > > > If it is working I'd really prefer it over the current situation. > > > Right. I've got to leave now. But I'll produce some zombies on > Monday, > and will see if they move. :-D > Here you go: http://pastebin.com/r93TGCZU Is that "vm1 -b-" --> "(null) ---" domain zombie enough? If yes, as you can see, it moves to cpupool0 while being destroyed, and I can destroy the pool without problems. I also shutdown the system with the domain still there (in cpupool0), and it works. Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 819 bytes --] [-- Attachment #2: Type: text/plain, Size: 127 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/2] xen: fix a (latent) cpupool-related race during domain destroy 2016-07-18 14:03 ` Dario Faggioli @ 2016-07-18 14:09 ` Juergen Gross 2016-07-28 17:29 ` Dario Faggioli 2016-08-03 12:27 ` George Dunlap 0 siblings, 2 replies; 16+ messages in thread From: Juergen Gross @ 2016-07-18 14:09 UTC (permalink / raw) To: Dario Faggioli, xen-devel; +Cc: George Dunlap, Andrew Cooper, Jan Beulich On 18/07/16 16:03, Dario Faggioli wrote: > On Fri, 2016-07-15 at 16:23 +0200, Dario Faggioli wrote: >> On Fri, 2016-07-15 at 14:52 +0200, Juergen Gross wrote: >>> On 15/07/16 13:52, Dario Faggioli wrote: >>>> Therefore, I still think this patch is correct, but I'm up for >>>> investigating further and finding a way to solve the "zombie in >>>> cpupool" issue as well. >>> I'm not saying your patch is wrong. I just wanted to give you a >>> hint >>> about the history of the stuff you are changing. :-) >>> >> Sure, and that's much appreciated! :-) >> >>> If it is working I'd really prefer it over the current situation. >>> >> Right. I've got to leave now. But I'll produce some zombies on >> Monday, >> and will see if they move. :-D >> > Here you go: > > http://pastebin.com/r93TGCZU > > Is that "vm1 -b-" --> "(null) ---" domain zombie enough? Yes, seems to be okay for a zombie. > If yes, as you can see, it moves to cpupool0 while being destroyed, and > I can destroy the pool without problems. Great. > I also shutdown the system with the domain still there (in cpupool0), > and it works. Fine. You can add Acked-by: Juergen Gross <jgross@suse.com> for this patch then. Thanks, Juergen _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/2] xen: fix a (latent) cpupool-related race during domain destroy 2016-07-18 14:09 ` Juergen Gross @ 2016-07-28 17:29 ` Dario Faggioli 2016-08-03 11:54 ` George Dunlap 2016-08-03 12:27 ` George Dunlap 1 sibling, 1 reply; 16+ messages in thread From: Dario Faggioli @ 2016-07-28 17:29 UTC (permalink / raw) To: Juergen Gross, xen-devel; +Cc: George Dunlap, Andrew Cooper, Jan Beulich [-- Attachment #1.1: Type: text/plain, Size: 652 bytes --] On Mon, 2016-07-18 at 16:09 +0200, Juergen Gross wrote: > Acked-by: Juergen Gross <jgross@suse.com> > > for this patch then. > George, Ping about this series. It's not terribly urgent, but it should be easy enough, so I guess there is a chance that you can have a quick look. If you can't, sorry for the noise, I'll re-ping you in a bit. :-) Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) [-- Attachment #1.2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 819 bytes --] [-- Attachment #2: Type: text/plain, Size: 127 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/2] xen: fix a (latent) cpupool-related race during domain destroy 2016-07-28 17:29 ` Dario Faggioli @ 2016-08-03 11:54 ` George Dunlap 0 siblings, 0 replies; 16+ messages in thread From: George Dunlap @ 2016-08-03 11:54 UTC (permalink / raw) To: Dario Faggioli; +Cc: Juergen Gross, xen-devel, Jan Beulich, Andrew Cooper On Thu, Jul 28, 2016 at 6:29 PM, Dario Faggioli <dario.faggioli@citrix.com> wrote: > On Mon, 2016-07-18 at 16:09 +0200, Juergen Gross wrote: >> Acked-by: Juergen Gross <jgross@suse.com> >> >> for this patch then. >> > George, > > Ping about this series. Dario, Somehow I can only find patch 1/2 in either Google or my Citrix mailbox. What's the title of the 2nd patch? -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/2] xen: fix a (latent) cpupool-related race during domain destroy 2016-07-18 14:09 ` Juergen Gross 2016-07-28 17:29 ` Dario Faggioli @ 2016-08-03 12:27 ` George Dunlap 1 sibling, 0 replies; 16+ messages in thread From: George Dunlap @ 2016-08-03 12:27 UTC (permalink / raw) To: Juergen Gross, Dario Faggioli, xen-devel Cc: George Dunlap, Andrew Cooper, Jan Beulich On 18/07/16 15:09, Juergen Gross wrote: > On 18/07/16 16:03, Dario Faggioli wrote: >> On Fri, 2016-07-15 at 16:23 +0200, Dario Faggioli wrote: >>> On Fri, 2016-07-15 at 14:52 +0200, Juergen Gross wrote: >>>> On 15/07/16 13:52, Dario Faggioli wrote: >>>>> Therefore, I still think this patch is correct, but I'm up for >>>>> investigating further and finding a way to solve the "zombie in >>>>> cpupool" issue as well. >>>> I'm not saying your patch is wrong. I just wanted to give you a >>>> hint >>>> about the history of the stuff you are changing. :-) >>>> >>> Sure, and that's much appreciated! :-) >>> >>>> If it is working I'd really prefer it over the current situation. >>>> >>> Right. I've got to leave now. But I'll produce some zombies on >>> Monday, >>> and will see if they move. :-D >>> >> Here you go: >> >> http://pastebin.com/r93TGCZU >> >> Is that "vm1 -b-" --> "(null) ---" domain zombie enough? > > Yes, seems to be okay for a zombie. > >> If yes, as you can see, it moves to cpupool0 while being destroyed, and >> I can destroy the pool without problems. > > Great. > >> I also shutdown the system with the domain still there (in cpupool0), >> and it works. > > Fine. You can add > > Acked-by: Juergen Gross <jgross@suse.com> Acked-by: George Dunlap <george.dunlap@citrix.com> And queued. -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v2 2/2] xen: cpupool: small optimization when moving between pools 2016-07-14 16:17 [PATCH v2 0/2] xen: cpupool (small) improvement and (latent) bug fix Dario Faggioli 2016-07-14 16:18 ` [PATCH v2 1/2] xen: fix a (latent) cpupool-related race during domain destroy Dario Faggioli @ 2016-07-14 16:18 ` Dario Faggioli 2016-07-15 9:39 ` Juergen Gross 1 sibling, 1 reply; 16+ messages in thread From: Dario Faggioli @ 2016-07-14 16:18 UTC (permalink / raw) To: xen-devel; +Cc: Juergen Gross If the domain is already where we want it to go, there's not much to do indeed. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- Cc: Juergen Gross <jgross@suse.com> --- xen/common/cpupool.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/xen/common/cpupool.c b/xen/common/cpupool.c index 5dacc61..9998394 100644 --- a/xen/common/cpupool.c +++ b/xen/common/cpupool.c @@ -232,6 +232,9 @@ static int cpupool_move_domain_locked(struct domain *d, struct cpupool *c) { int ret; + if ( unlikely(d->cpupool == c) ) + return 0; + d->cpupool->n_dom--; ret = sched_move_domain(d, c); if ( ret ) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v2 2/2] xen: cpupool: small optimization when moving between pools 2016-07-14 16:18 ` [PATCH v2 2/2] xen: cpupool: small optimization when moving between pools Dario Faggioli @ 2016-07-15 9:39 ` Juergen Gross 0 siblings, 0 replies; 16+ messages in thread From: Juergen Gross @ 2016-07-15 9:39 UTC (permalink / raw) To: Dario Faggioli, xen-devel On 14/07/16 18:18, Dario Faggioli wrote: > If the domain is already where we want it to go, > there's not much to do indeed. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Juergen _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2016-08-03 12:27 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-07-14 16:17 [PATCH v2 0/2] xen: cpupool (small) improvement and (latent) bug fix Dario Faggioli 2016-07-14 16:18 ` [PATCH v2 1/2] xen: fix a (latent) cpupool-related race during domain destroy Dario Faggioli 2016-07-14 17:08 ` Andrew Cooper 2016-07-15 9:38 ` Juergen Gross 2016-07-15 10:14 ` Dario Faggioli 2016-07-15 10:36 ` Juergen Gross 2016-07-15 11:52 ` Dario Faggioli 2016-07-15 12:52 ` Juergen Gross 2016-07-15 14:23 ` Dario Faggioli 2016-07-18 14:03 ` Dario Faggioli 2016-07-18 14:09 ` Juergen Gross 2016-07-28 17:29 ` Dario Faggioli 2016-08-03 11:54 ` George Dunlap 2016-08-03 12:27 ` George Dunlap 2016-07-14 16:18 ` [PATCH v2 2/2] xen: cpupool: small optimization when moving between pools Dario Faggioli 2016-07-15 9:39 ` Juergen Gross
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).