xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] xen: cpupool (small) improvement and (latent) bug fix
@ 2016-07-14 16:17 Dario Faggioli
  2016-07-14 16:18 ` [PATCH v2 1/2] xen: fix a (latent) cpupool-related race during domain destroy Dario Faggioli
  2016-07-14 16:18 ` [PATCH v2 2/2] xen: cpupool: small optimization when moving between pools Dario Faggioli
  0 siblings, 2 replies; 16+ messages in thread
From: Dario Faggioli @ 2016-07-14 16:17 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, Andrew Cooper, George Dunlap, Jan Beulich

Hi everyone,

Version 2, with only patch 1 changed to follow Andrew's suggestion of directly
moving the cpupool_*_domain() functions into the sched_*_domain() functions.

That required changing the signature of sched_init_domain(), but I think the
final state of the code is indeed more robust.

Thanks and Regards,
Dario
---
Dario Faggioli (2):
      xen: fix a (latent) cpupool-related race during domain destroy
      xen: cpupool: small optimization when moving between pools

 xen/common/cpupool.c    |    3 +++
 xen/common/domain.c     |    7 +------
 xen/common/schedule.c   |   13 ++++++++++++-
 xen/include/xen/sched.h |    2 +-
 4 files changed, 17 insertions(+), 8 deletions(-)
--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 1/2] xen: fix a (latent) cpupool-related race during domain destroy
  2016-07-14 16:17 [PATCH v2 0/2] xen: cpupool (small) improvement and (latent) bug fix Dario Faggioli
@ 2016-07-14 16:18 ` Dario Faggioli
  2016-07-14 17:08   ` Andrew Cooper
  2016-07-15  9:38   ` Juergen Gross
  2016-07-14 16:18 ` [PATCH v2 2/2] xen: cpupool: small optimization when moving between pools Dario Faggioli
  1 sibling, 2 replies; 16+ messages in thread
From: Dario Faggioli @ 2016-07-14 16:18 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross, Andrew Cooper, George Dunlap, Jan Beulich

So, during domain destruction, we do:
 cpupool_rm_domain()    [ in domain_destroy() ]
 sched_destroy_domain() [ in complete_domain_destroy() ]

Therefore, there's a window during which, from the
scheduler's point of view, a domain stilsts outside
of any cpupool.

In fact, cpupool_rm_domain() does d->cpupool=NULL,
and we don't allow that to hold true, for anything
but the idle domain (and there are, in fact, ASSERT()s
and BUG_ON()s to that effect).

Currently, we never really check d->cpupool during the
window, but that does not mean the race is not there.
For instance, Credit2 at some point (during load balancing)
iterates on the list of domains, and if we add logic that
needs checking d->cpupool, and any one of them had
cpupool_rm_domain() called on itself already... Boom!

(In fact, calling __vcpu_has_soft_affinity() from inside
balance_load() makes `xl shutdown <domid>' reliably
crash, and this is how I discovered this.)

On the other hand, cpupool_rm_domain() "only" does
cpupool related bookkeeping, and there's no harm
postponing it a little bit.

Also, considering that, during domain initialization,
we do:
 cpupool_add_domain()
 sched_init_domain()

It makes sense for the destruction path to look like
the opposite of it, i.e.:
 sched_destroy_domain()
 cpupool_rm_domain()

And hence that's what this patch does.

Actually, for better robustness, what we really do is
moving both cpupool_add_domain() and cpupool_rm_domain()
inside sched_init_domain() and sched_destroy_domain(),
respectively (and also add a couple of ASSERT()-s).

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: Juergen Gross <jgross@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
---
Changes from v1:
 * followed Andrew's suggestion of moving the cpupool functions inside the
   sched functions;
 * reworded the changelog.
---
 xen/common/domain.c     |    7 +------
 xen/common/schedule.c   |   13 ++++++++++++-
 xen/include/xen/sched.h |    2 +-
 3 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/xen/common/domain.c b/xen/common/domain.c
index 42c07ee..339ee56 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -379,10 +379,7 @@ struct domain *domain_create(domid_t domid, unsigned int domcr_flags,
         goto fail;
     init_status |= INIT_arch;
 
-    if ( (err = cpupool_add_domain(d, poolid)) != 0 )
-        goto fail;
-
-    if ( (err = sched_init_domain(d)) != 0 )
+    if ( (err = sched_init_domain(d, poolid)) != 0 )
         goto fail;
 
     if ( (err = late_hwdom_init(d)) != 0 )
@@ -868,8 +865,6 @@ void domain_destroy(struct domain *d)
 
     TRACE_1D(TRC_DOM0_DOM_REM, d->domain_id);
 
-    cpupool_rm_domain(d);
-
     /* Delete from task list and task hashtable. */
     spin_lock(&domlist_update_lock);
     pd = &domain_list;
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 7ac12d3..852f840 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -379,8 +379,15 @@ void sched_destroy_vcpu(struct vcpu *v)
     SCHED_OP(VCPU2OP(v), free_vdata, v->sched_priv);
 }
 
-int sched_init_domain(struct domain *d)
+int sched_init_domain(struct domain *d, int poolid)
 {
+    int ret;
+
+    ASSERT(d->cpupool == NULL);
+
+    if ( (ret = cpupool_add_domain(d, poolid)) )
+        return ret;
+
     SCHED_STAT_CRANK(dom_init);
     TRACE_1D(TRC_SCHED_DOM_ADD, d->domain_id);
     return SCHED_OP(DOM2OP(d), init_domain, d);
@@ -388,9 +395,13 @@ int sched_init_domain(struct domain *d)
 
 void sched_destroy_domain(struct domain *d)
 {
+    ASSERT(d->cpupool != NULL || is_idle_domain(d));
+
     SCHED_STAT_CRANK(dom_destroy);
     TRACE_1D(TRC_SCHED_DOM_REM, d->domain_id);
     SCHED_OP(DOM2OP(d), destroy_domain, d);
+
+    cpupool_rm_domain(d);
 }
 
 void vcpu_sleep_nosync(struct vcpu *v)
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 46c82e7..888bc19 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -637,7 +637,7 @@ void noreturn asm_domain_crash_synchronous(unsigned long addr);
 void scheduler_init(void);
 int  sched_init_vcpu(struct vcpu *v, unsigned int processor);
 void sched_destroy_vcpu(struct vcpu *v);
-int  sched_init_domain(struct domain *d);
+int  sched_init_domain(struct domain *d, int poolid);
 void sched_destroy_domain(struct domain *d);
 int sched_move_domain(struct domain *d, struct cpupool *c);
 long sched_adjust(struct domain *, struct xen_domctl_scheduler_op *);


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 2/2] xen: cpupool: small optimization when moving between pools
  2016-07-14 16:17 [PATCH v2 0/2] xen: cpupool (small) improvement and (latent) bug fix Dario Faggioli
  2016-07-14 16:18 ` [PATCH v2 1/2] xen: fix a (latent) cpupool-related race during domain destroy Dario Faggioli
@ 2016-07-14 16:18 ` Dario Faggioli
  2016-07-15  9:39   ` Juergen Gross
  1 sibling, 1 reply; 16+ messages in thread
From: Dario Faggioli @ 2016-07-14 16:18 UTC (permalink / raw)
  To: xen-devel; +Cc: Juergen Gross

If the domain is already where we want it to go,
there's not much to do indeed.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Cc: Juergen Gross <jgross@suse.com>
---
 xen/common/cpupool.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/xen/common/cpupool.c b/xen/common/cpupool.c
index 5dacc61..9998394 100644
--- a/xen/common/cpupool.c
+++ b/xen/common/cpupool.c
@@ -232,6 +232,9 @@ static int cpupool_move_domain_locked(struct domain *d, struct cpupool *c)
 {
     int ret;
 
+    if ( unlikely(d->cpupool == c) )
+        return 0;
+
     d->cpupool->n_dom--;
     ret = sched_move_domain(d, c);
     if ( ret )


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] xen: fix a (latent) cpupool-related race during domain destroy
  2016-07-14 16:18 ` [PATCH v2 1/2] xen: fix a (latent) cpupool-related race during domain destroy Dario Faggioli
@ 2016-07-14 17:08   ` Andrew Cooper
  2016-07-15  9:38   ` Juergen Gross
  1 sibling, 0 replies; 16+ messages in thread
From: Andrew Cooper @ 2016-07-14 17:08 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: Juergen Gross, George Dunlap, Jan Beulich

On 14/07/16 17:18, Dario Faggioli wrote:
> So, during domain destruction, we do:
>  cpupool_rm_domain()    [ in domain_destroy() ]
>  sched_destroy_domain() [ in complete_domain_destroy() ]
>
> Therefore, there's a window during which, from the
> scheduler's point of view, a domain stilsts outside
> of any cpupool.
>
> In fact, cpupool_rm_domain() does d->cpupool=NULL,
> and we don't allow that to hold true, for anything
> but the idle domain (and there are, in fact, ASSERT()s
> and BUG_ON()s to that effect).
>
> Currently, we never really check d->cpupool during the
> window, but that does not mean the race is not there.
> For instance, Credit2 at some point (during load balancing)
> iterates on the list of domains, and if we add logic that
> needs checking d->cpupool, and any one of them had
> cpupool_rm_domain() called on itself already... Boom!
>
> (In fact, calling __vcpu_has_soft_affinity() from inside
> balance_load() makes `xl shutdown <domid>' reliably
> crash, and this is how I discovered this.)
>
> On the other hand, cpupool_rm_domain() "only" does
> cpupool related bookkeeping, and there's no harm
> postponing it a little bit.
>
> Also, considering that, during domain initialization,
> we do:
>  cpupool_add_domain()
>  sched_init_domain()
>
> It makes sense for the destruction path to look like
> the opposite of it, i.e.:
>  sched_destroy_domain()
>  cpupool_rm_domain()
>
> And hence that's what this patch does.
>
> Actually, for better robustness, what we really do is
> moving both cpupool_add_domain() and cpupool_rm_domain()
> inside sched_init_domain() and sched_destroy_domain(),
> respectively (and also add a couple of ASSERT()-s).
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> ---
> Cc: Juergen Gross <jgross@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: George Dunlap <George.Dunlap@eu.citrix.com>
> Cc: Jan Beulich <jbeulich@suse.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

This definitely looks better, although I will leave the in-depth review
to people more familiar with the scheduler internals.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] xen: fix a (latent) cpupool-related race during domain destroy
  2016-07-14 16:18 ` [PATCH v2 1/2] xen: fix a (latent) cpupool-related race during domain destroy Dario Faggioli
  2016-07-14 17:08   ` Andrew Cooper
@ 2016-07-15  9:38   ` Juergen Gross
  2016-07-15 10:14     ` Dario Faggioli
  1 sibling, 1 reply; 16+ messages in thread
From: Juergen Gross @ 2016-07-15  9:38 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: George Dunlap, Andrew Cooper, Jan Beulich

On 14/07/16 18:18, Dario Faggioli wrote:
> So, during domain destruction, we do:
>  cpupool_rm_domain()    [ in domain_destroy() ]
>  sched_destroy_domain() [ in complete_domain_destroy() ]
> 
> Therefore, there's a window during which, from the
> scheduler's point of view, a domain stilsts outside
> of any cpupool.
> 
> In fact, cpupool_rm_domain() does d->cpupool=NULL,
> and we don't allow that to hold true, for anything
> but the idle domain (and there are, in fact, ASSERT()s
> and BUG_ON()s to that effect).
> 
> Currently, we never really check d->cpupool during the
> window, but that does not mean the race is not there.
> For instance, Credit2 at some point (during load balancing)
> iterates on the list of domains, and if we add logic that
> needs checking d->cpupool, and any one of them had
> cpupool_rm_domain() called on itself already... Boom!
> 
> (In fact, calling __vcpu_has_soft_affinity() from inside
> balance_load() makes `xl shutdown <domid>' reliably
> crash, and this is how I discovered this.)
> 
> On the other hand, cpupool_rm_domain() "only" does
> cpupool related bookkeeping, and there's no harm
> postponing it a little bit.
> 
> Also, considering that, during domain initialization,
> we do:
>  cpupool_add_domain()
>  sched_init_domain()
> 
> It makes sense for the destruction path to look like
> the opposite of it, i.e.:
>  sched_destroy_domain()
>  cpupool_rm_domain()
> 
> And hence that's what this patch does.
> 
> Actually, for better robustness, what we really do is
> moving both cpupool_add_domain() and cpupool_rm_domain()
> inside sched_init_domain() and sched_destroy_domain(),
> respectively (and also add a couple of ASSERT()-s).
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Hmm, are you aware of commit bac6334b51d9bcfe57ecf4a4cb5288348fcf044a
which explicitly moved cpupool_rm_domain() at the place where you are
removing it again? Please verify that the scenario mentioned in the
description of that commit is still working with your patch.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/2] xen: cpupool: small optimization when moving between pools
  2016-07-14 16:18 ` [PATCH v2 2/2] xen: cpupool: small optimization when moving between pools Dario Faggioli
@ 2016-07-15  9:39   ` Juergen Gross
  0 siblings, 0 replies; 16+ messages in thread
From: Juergen Gross @ 2016-07-15  9:39 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel

On 14/07/16 18:18, Dario Faggioli wrote:
> If the domain is already where we want it to go,
> there's not much to do indeed.
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Reviewed-by: Juergen Gross <jgross@suse.com>


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] xen: fix a (latent) cpupool-related race during domain destroy
  2016-07-15  9:38   ` Juergen Gross
@ 2016-07-15 10:14     ` Dario Faggioli
  2016-07-15 10:36       ` Juergen Gross
  0 siblings, 1 reply; 16+ messages in thread
From: Dario Faggioli @ 2016-07-15 10:14 UTC (permalink / raw)
  To: Juergen Gross, xen-devel; +Cc: George Dunlap, Andrew Cooper, Jan Beulich


[-- Attachment #1.1: Type: text/plain, Size: 4067 bytes --]

On Fri, 2016-07-15 at 11:38 +0200, Juergen Gross wrote:
> Hmm, are you aware of commit bac6334b51d9bcfe57ecf4a4cb5288348fcf044a
> which explicitly moved cpupool_rm_domain() at the place where you are
> removing it again? Please verify that the scenario mentioned in the
> description of that commit is still working with your patch.
> 
Sorry, but I only partly see the problem.

In particular, I'm probably not fully understanding, from that commit
changelog, what is the set of operations/command that I should run to
check whether or not I reintroduced the issue back.

What I did so far is as follows:

root@Zhaman:~# xl cpupool-list 
Name               CPUs   Sched     Active   Domain count
Pool-0              12    credit       y          1
Pool-credit          4    credit       y          1
root@Zhaman:~# xl list -c
Name                                        ID   Mem VCPUs	State	Time(s)         Cpupool
Domain-0                                     0  1019    16     r-----      34.5          Pool-0
vm1                                          1  4096     4     -b----       9.7     Pool-credit
root@Zhaman:~# xl cpupool-cpu-remove Pool-credit all
libxl: error: libxl.c:6998:libxl_cpupool_cpuremove: Error removing cpu 9 from cpupool: Device or resource busy
Some cpus may have not or only partially been removed from 'Pool-credit'.
If a cpu can't be added to another cpupool, add it to 'Pool-credit' again and retry.
root@Zhaman:~# xl cpupool-list -c
Name               CPU list
Pool-0             0,1,2,3,4,5,10,11,12,13,14,15
Pool-credit        9
root@Zhaman:~# xl shutdown vm1
Shutting down domain 1
root@Zhaman:~# xl cpupool-cpu-remove Pool-credit all
root@Zhaman:~# xl cpupool-list -c
Name               CPU list
Pool-0             0,1,2,3,4,5,10,11,12,13,14,15
Pool-credit

If (with vm1 still in Pool-credit), I do this, it indeed fails:

root@Zhaman:~# xl shutdown vm1 & xl cpupool-cpu-remove Pool-credit all
[1] 3275
Shutting down domain 2
libxl: error: libxl.c:6998:libxl_cpupool_cpuremove: Error removing cpu 9 from cpupool: Device or resource busy
Some cpus may have not or only partially been removed from 'Pool-credit'.
If a cpu can't be added to another cpupool, add it to 'Pool-credit' again and retry.
[1]+  Done                    xl shutdown vm1
root@Zhaman:~# xl cpupool-list -c
Name               CPU list
Pool-0             0,1,2,3,4,5,10,11,12,13,14,15
Pool-credit        9

But that does not look too strange to me, as it's entirely possible
that the domain has not been moved yet, when we try to remove the last
cpu. And in fact, after the domain has properly shutdown:

root@Zhaman:~# xl cpupool-cpu-remove Pool-credit all
root@Zhaman:~# xl cpupool-list 
Name               CPUs   Sched     Active   Domain count
Pool-0              12    credit       y          1
Pool-credit          0    credit       y          0

And in fact, looking at the code introduced by that commit, the
important part, to me, seems to be the moving of the domain to
cpupool0, which is indeed the right thing to do. OTOH, what I am seeing
and fixing, happens (well, could happen) all the times, even when the
domain being shutdown is already in cpupool0, and (as you say yourself
in your changelog) there not such issue as removing the last cpu of
cpupool0.

What am I missing?

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] xen: fix a (latent) cpupool-related race during domain destroy
  2016-07-15 10:14     ` Dario Faggioli
@ 2016-07-15 10:36       ` Juergen Gross
  2016-07-15 11:52         ` Dario Faggioli
  0 siblings, 1 reply; 16+ messages in thread
From: Juergen Gross @ 2016-07-15 10:36 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: George Dunlap, Andrew Cooper, Jan Beulich

On 15/07/16 12:14, Dario Faggioli wrote:
> On Fri, 2016-07-15 at 11:38 +0200, Juergen Gross wrote:
>> Hmm, are you aware of commit bac6334b51d9bcfe57ecf4a4cb5288348fcf044a
>> which explicitly moved cpupool_rm_domain() at the place where you are
>> removing it again? Please verify that the scenario mentioned in the
>> description of that commit is still working with your patch.
>>
> Sorry, but I only partly see the problem.
> 
> In particular, I'm probably not fully understanding, from that commit
> changelog, what is the set of operations/command that I should run to
> check whether or not I reintroduced the issue back.

You need to create a domain in a cpupool and destroy it again while
some dom0 process still is holding a reference to it (resulting in a
zombie domain). Then try to destroy the cpupool.

> 
> What I did so far is as follows:
> 
> root@Zhaman:~# xl cpupool-list 
> Name               CPUs   Sched     Active   Domain count
> Pool-0              12    credit       y          1
> Pool-credit          4    credit       y          1
> root@Zhaman:~# xl list -c
> Name                                        ID   Mem VCPUs	State	Time(s)         Cpupool
> Domain-0                                     0  1019    16     r-----      34.5          Pool-0
> vm1                                          1  4096     4     -b----       9.7     Pool-credit
> root@Zhaman:~# xl cpupool-cpu-remove Pool-credit all
> libxl: error: libxl.c:6998:libxl_cpupool_cpuremove: Error removing cpu 9 from cpupool: Device or resource busy
> Some cpus may have not or only partially been removed from 'Pool-credit'.
> If a cpu can't be added to another cpupool, add it to 'Pool-credit' again and retry.
> root@Zhaman:~# xl cpupool-list -c
> Name               CPU list
> Pool-0             0,1,2,3,4,5,10,11,12,13,14,15
> Pool-credit        9
> root@Zhaman:~# xl shutdown vm1
> Shutting down domain 1
> root@Zhaman:~# xl cpupool-cpu-remove Pool-credit all
> root@Zhaman:~# xl cpupool-list -c
> Name               CPU list
> Pool-0             0,1,2,3,4,5,10,11,12,13,14,15
> Pool-credit
> 
> If (with vm1 still in Pool-credit), I do this, it indeed fails:
> 
> root@Zhaman:~# xl shutdown vm1 & xl cpupool-cpu-remove Pool-credit all
> [1] 3275
> Shutting down domain 2
> libxl: error: libxl.c:6998:libxl_cpupool_cpuremove: Error removing cpu 9 from cpupool: Device or resource busy
> Some cpus may have not or only partially been removed from 'Pool-credit'.
> If a cpu can't be added to another cpupool, add it to 'Pool-credit' again and retry.
> [1]+  Done                    xl shutdown vm1
> root@Zhaman:~# xl cpupool-list -c
> Name               CPU list
> Pool-0             0,1,2,3,4,5,10,11,12,13,14,15
> Pool-credit        9
> 
> But that does not look too strange to me, as it's entirely possible
> that the domain has not been moved yet, when we try to remove the last
> cpu. And in fact, after the domain has properly shutdown:
> 
> root@Zhaman:~# xl cpupool-cpu-remove Pool-credit all
> root@Zhaman:~# xl cpupool-list 
> Name               CPUs   Sched     Active   Domain count
> Pool-0              12    credit       y          1
> Pool-credit          0    credit       y          0
> 
> And in fact, looking at the code introduced by that commit, the
> important part, to me, seems to be the moving of the domain to
> cpupool0, which is indeed the right thing to do. OTOH, what I am seeing
> and fixing, happens (well, could happen) all the times, even when the
> domain being shutdown is already in cpupool0, and (as you say yourself
> in your changelog) there not such issue as removing the last cpu of
> cpupool0.
> 
> What am I missing?

The domain being a zombie domain might change the picture. Moving it to
cpupool0 was failing before my patch and it might do so again with your
patch applied.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] xen: fix a (latent) cpupool-related race during domain destroy
  2016-07-15 10:36       ` Juergen Gross
@ 2016-07-15 11:52         ` Dario Faggioli
  2016-07-15 12:52           ` Juergen Gross
  0 siblings, 1 reply; 16+ messages in thread
From: Dario Faggioli @ 2016-07-15 11:52 UTC (permalink / raw)
  To: Juergen Gross, xen-devel; +Cc: George Dunlap, Andrew Cooper, Jan Beulich


[-- Attachment #1.1: Type: text/plain, Size: 2876 bytes --]

On Fri, 2016-07-15 at 12:36 +0200, Juergen Gross wrote:
> On 15/07/16 12:14, Dario Faggioli wrote:
> > In particular, I'm probably not fully understanding, from that
> > commit
> > changelog, what is the set of operations/command that I should run
> > to
> > check whether or not I reintroduced the issue back.
> You need to create a domain in a cpupool and destroy it again while
> some dom0 process still is holding a reference to it (resulting in a
> zombie domain). Then try to destroy the cpupool.
> 
Ah, I see. I wasn't get the fact that it needed to be a zombie domain
from anywhere.

> > What am I missing?
> The domain being a zombie domain might change the picture. Moving it
> to
> cpupool0 was failing before my patch and it might do so again with
> your
> patch applied.
> 
Mmmm... I don't immediately see the reason why moving a zombie domain
fails either, but I guess I'll have to try.

But then, correct me if I'm wrong, the situation is like this:
 - right now there's a (potential) race between domain's scheduling 
   data destruction and domain removal from a cpupool;
 - with my patch, the race goes away, but we risk not being able to 
   destroy a cpupool with a zombie domain in it.

I don't think we want to be in either of these two situations. :-(

The race is never triggering so far, but I've already patches to
Credit2 (finishing implementing soft affinity for it) that make it a
real thing.

I think I can work around that specific case by doing something like
the below:

diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
index bc0e794..d91fd70 100644
--- a/xen/include/xen/sched-if.h
+++ b/xen/include/xen/sched-if.h
@@ -193,12 +193,9 @@ struct cpupool
 
 static inline cpumask_t* cpupool_domain_cpumask(struct domain *d)
 {
-    /*
-     * d->cpupool is NULL only for the idle domain, and no one should
-     * be interested in calling this for the idle domain.
-     */
-    ASSERT(d->cpupool != NULL);
-    return d->cpupool->cpu_valid;
+    /* No one should be interested in calling this for the idle domain! */
+    ASSERT(!is_idle_domain(d));
+    return d->cpupool ? d->cpupool->cpu_valid : cpupool0->cpu_valid;
 }
 
 #endif /* __XEN_SCHED_IF_H__ */

But even if that would be acceptable (what do you think?), I still
don't like to have the race there lurking. :-/

Therefore, I still think this patch is correct, but I'm up for
investigating further and finding a way to solve the "zombie in
cpupool" issue as well.

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] xen: fix a (latent) cpupool-related race during domain destroy
  2016-07-15 11:52         ` Dario Faggioli
@ 2016-07-15 12:52           ` Juergen Gross
  2016-07-15 14:23             ` Dario Faggioli
  0 siblings, 1 reply; 16+ messages in thread
From: Juergen Gross @ 2016-07-15 12:52 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: George Dunlap, Andrew Cooper, Jan Beulich

On 15/07/16 13:52, Dario Faggioli wrote:
> On Fri, 2016-07-15 at 12:36 +0200, Juergen Gross wrote:
>> On 15/07/16 12:14, Dario Faggioli wrote:
>>> In particular, I'm probably not fully understanding, from that
>>> commit
>>> changelog, what is the set of operations/command that I should run
>>> to
>>> check whether or not I reintroduced the issue back.
>> You need to create a domain in a cpupool and destroy it again while
>> some dom0 process still is holding a reference to it (resulting in a
>> zombie domain). Then try to destroy the cpupool.
>>
> Ah, I see. I wasn't get the fact that it needed to be a zombie domain
> from anywhere.
> 
>>> What am I missing?
>> The domain being a zombie domain might change the picture. Moving it
>> to
>> cpupool0 was failing before my patch and it might do so again with
>> your
>> patch applied.
>>
> Mmmm... I don't immediately see the reason why moving a zombie domain
> fails either, but I guess I'll have to try.

Searching through the history I found commit
934e7baa6c12d19cfaf24e8f8e27d6c6a8b8c5e4 which might has removed the
problematic condition (cpupool->n_dom being non-zero while d->cpupool
was NULL already).

> But then, correct me if I'm wrong, the situation is like this:
>  - right now there's a (potential) race between domain's scheduling 
>    data destruction and domain removal from a cpupool;
>  - with my patch, the race goes away, but we risk not being able to 
>    destroy a cpupool with a zombie domain in it.

This one has been observed.

I do remember the following critical cases:

- removing a cpupool with a zombie domain
- shutting down the system with a domain in a cpupool
- not sure about the combination of both cases (shutting down with
  zombie domain in a cpupool): is this even possible without another
  bug in the hypervisor or dom0?

> Therefore, I still think this patch is correct, but I'm up for
> investigating further and finding a way to solve the "zombie in
> cpupool" issue as well.

I'm not saying your patch is wrong. I just wanted to give you a hint
about the history of the stuff you are changing. :-)

If it is working I'd really prefer it over the current situation.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] xen: fix a (latent) cpupool-related race during domain destroy
  2016-07-15 12:52           ` Juergen Gross
@ 2016-07-15 14:23             ` Dario Faggioli
  2016-07-18 14:03               ` Dario Faggioli
  0 siblings, 1 reply; 16+ messages in thread
From: Dario Faggioli @ 2016-07-15 14:23 UTC (permalink / raw)
  To: Juergen Gross, xen-devel; +Cc: George Dunlap, Andrew Cooper, Jan Beulich


[-- Attachment #1.1: Type: text/plain, Size: 3595 bytes --]

On Fri, 2016-07-15 at 14:52 +0200, Juergen Gross wrote:
> On 15/07/16 13:52, Dario Faggioli wrote:
> > 
> > On Fri, 2016-07-15 at 12:36 +0200, Juergen Gross wrote:
> > > 
> > > On 15/07/16 12:14, Dario Faggioli wrote:
> > > > 
> > > > In particular, I'm probably not fully understanding, from that
> > > > commit
> > > > changelog, what is the set of operations/command that I should
> > > > run
> > > > to
> > > > check whether or not I reintroduced the issue back.
> > > You need to create a domain in a cpupool and destroy it again
> > > while
> > > some dom0 process still is holding a reference to it (resulting
> > > in a
> > > zombie domain). Then try to destroy the cpupool.
> > > 
> > Ah, I see. I wasn't get the fact that it needed to be a zombie
> > domain
> > from anywhere.
> > 
> > > 
> > > > 
> > > > What am I missing?
> > > The domain being a zombie domain might change the picture. Moving
> > > it
> > > to
> > > cpupool0 was failing before my patch and it might do so again
> > > with
> > > your
> > > patch applied.
> > > 
> > Mmmm... I don't immediately see the reason why moving a zombie
> > domain
> > fails either, but I guess I'll have to try.
> Searching through the history I found commit
> 934e7baa6c12d19cfaf24e8f8e27d6c6a8b8c5e4 which might has removed the
> problematic condition (cpupool->n_dom being non-zero while d->cpupool
> was NULL already).
> 
Yeah.. And there's also this:

/*
 * unassign a specific cpu from a cpupool
 * we must be sure not to run on the cpu to be unassigned! to achieve this
 * the main functionality is performed via continue_hypercall_on_cpu on a
 * specific cpu.
 * if the cpu to be removed is the last one of the cpupool no active domain
 * must be bound to the cpupool. dying domains are moved to cpupool0 as they
 * might be zombies.
 * possible failures:
 * - last cpu and still active domains in cpupool
 * - cpu just being unplugged
 */
static int cpupool_unassign_cpu(struct cpupool *c, unsigned int cpu)
{
    int work_cpu;
    int ret;
    struct domain *d;

    cpupool_dprintk("cpupool_unassign_cpu(pool=%d,cpu=%d)\n",
                    c->cpupool_id, cpu);
    [...]
        for_each_domain_in_cpupool(d, c)
        {
            if ( !d->is_dying )
            {
                ret = -EBUSY;
                break;
            }
            ret = cpupool_move_domain_locked(d, cpupool0);
            if ( ret )
                break;
        }
    [...]

So it really looks like it ought to be possible to move zombies to
cpupool0 these days. :-)

> > Therefore, I still think this patch is correct, but I'm up for
> > investigating further and finding a way to solve the "zombie in
> > cpupool" issue as well.
> I'm not saying your patch is wrong. I just wanted to give you a hint
> about the history of the stuff you are changing. :-)
> 
Sure, and that's much appreciated! :-)

> If it is working I'd really prefer it over the current situation.
> 
Right. I've got to leave now. But I'll produce some zombies on Monday,
and will see if they move. :-D

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] xen: fix a (latent) cpupool-related race during domain destroy
  2016-07-15 14:23             ` Dario Faggioli
@ 2016-07-18 14:03               ` Dario Faggioli
  2016-07-18 14:09                 ` Juergen Gross
  0 siblings, 1 reply; 16+ messages in thread
From: Dario Faggioli @ 2016-07-18 14:03 UTC (permalink / raw)
  To: Juergen Gross, xen-devel; +Cc: George Dunlap, Andrew Cooper, Jan Beulich


[-- Attachment #1.1: Type: text/plain, Size: 1308 bytes --]

On Fri, 2016-07-15 at 16:23 +0200, Dario Faggioli wrote:
> On Fri, 2016-07-15 at 14:52 +0200, Juergen Gross wrote:
> > On 15/07/16 13:52, Dario Faggioli wrote:
> > > Therefore, I still think this patch is correct, but I'm up for
> > > investigating further and finding a way to solve the "zombie in
> > > cpupool" issue as well.
> > I'm not saying your patch is wrong. I just wanted to give you a
> > hint
> > about the history of the stuff you are changing. :-)
> > 
> Sure, and that's much appreciated! :-)
> 
> > If it is working I'd really prefer it over the current situation.
> > 
> Right. I've got to leave now. But I'll produce some zombies on
> Monday,
> and will see if they move. :-D
> 
Here you go:

http://pastebin.com/r93TGCZU

Is that "vm1 -b-" --> "(null) ---" domain zombie enough?

If yes, as you can see, it moves to cpupool0 while being destroyed, and
I can destroy the pool without problems.

I also shutdown the system with the domain still there (in cpupool0),
and it works.

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] xen: fix a (latent) cpupool-related race during domain destroy
  2016-07-18 14:03               ` Dario Faggioli
@ 2016-07-18 14:09                 ` Juergen Gross
  2016-07-28 17:29                   ` Dario Faggioli
  2016-08-03 12:27                   ` George Dunlap
  0 siblings, 2 replies; 16+ messages in thread
From: Juergen Gross @ 2016-07-18 14:09 UTC (permalink / raw)
  To: Dario Faggioli, xen-devel; +Cc: George Dunlap, Andrew Cooper, Jan Beulich

On 18/07/16 16:03, Dario Faggioli wrote:
> On Fri, 2016-07-15 at 16:23 +0200, Dario Faggioli wrote:
>> On Fri, 2016-07-15 at 14:52 +0200, Juergen Gross wrote:
>>> On 15/07/16 13:52, Dario Faggioli wrote:
>>>> Therefore, I still think this patch is correct, but I'm up for
>>>> investigating further and finding a way to solve the "zombie in
>>>> cpupool" issue as well.
>>> I'm not saying your patch is wrong. I just wanted to give you a
>>> hint
>>> about the history of the stuff you are changing. :-)
>>>
>> Sure, and that's much appreciated! :-)
>>
>>> If it is working I'd really prefer it over the current situation.
>>>
>> Right. I've got to leave now. But I'll produce some zombies on
>> Monday,
>> and will see if they move. :-D
>>
> Here you go:
> 
> http://pastebin.com/r93TGCZU
> 
> Is that "vm1 -b-" --> "(null) ---" domain zombie enough?

Yes, seems to be okay for a zombie.

> If yes, as you can see, it moves to cpupool0 while being destroyed, and
> I can destroy the pool without problems.

Great.

> I also shutdown the system with the domain still there (in cpupool0),
> and it works.

Fine. You can add

Acked-by: Juergen Gross <jgross@suse.com>

for this patch then.


Thanks, Juergen


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] xen: fix a (latent) cpupool-related race during domain destroy
  2016-07-18 14:09                 ` Juergen Gross
@ 2016-07-28 17:29                   ` Dario Faggioli
  2016-08-03 11:54                     ` George Dunlap
  2016-08-03 12:27                   ` George Dunlap
  1 sibling, 1 reply; 16+ messages in thread
From: Dario Faggioli @ 2016-07-28 17:29 UTC (permalink / raw)
  To: Juergen Gross, xen-devel; +Cc: George Dunlap, Andrew Cooper, Jan Beulich


[-- Attachment #1.1: Type: text/plain, Size: 652 bytes --]

On Mon, 2016-07-18 at 16:09 +0200, Juergen Gross wrote:
> Acked-by: Juergen Gross <jgross@suse.com>
> 
> for this patch then.
> 
George,

Ping about this series.

It's not terribly urgent, but it should be easy enough, so I guess
there is a chance that you can have a quick look.

If you can't, sorry for the noise, I'll re-ping you in a bit. :-)

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 127 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] xen: fix a (latent) cpupool-related race during domain destroy
  2016-07-28 17:29                   ` Dario Faggioli
@ 2016-08-03 11:54                     ` George Dunlap
  0 siblings, 0 replies; 16+ messages in thread
From: George Dunlap @ 2016-08-03 11:54 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: Juergen Gross, xen-devel, Jan Beulich, Andrew Cooper

On Thu, Jul 28, 2016 at 6:29 PM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> On Mon, 2016-07-18 at 16:09 +0200, Juergen Gross wrote:
>> Acked-by: Juergen Gross <jgross@suse.com>
>>
>> for this patch then.
>>
> George,
>
> Ping about this series.

Dario,

Somehow I can only find patch 1/2 in either Google or my Citrix
mailbox.  What's the title of the 2nd patch?

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] xen: fix a (latent) cpupool-related race during domain destroy
  2016-07-18 14:09                 ` Juergen Gross
  2016-07-28 17:29                   ` Dario Faggioli
@ 2016-08-03 12:27                   ` George Dunlap
  1 sibling, 0 replies; 16+ messages in thread
From: George Dunlap @ 2016-08-03 12:27 UTC (permalink / raw)
  To: Juergen Gross, Dario Faggioli, xen-devel
  Cc: George Dunlap, Andrew Cooper, Jan Beulich

On 18/07/16 15:09, Juergen Gross wrote:
> On 18/07/16 16:03, Dario Faggioli wrote:
>> On Fri, 2016-07-15 at 16:23 +0200, Dario Faggioli wrote:
>>> On Fri, 2016-07-15 at 14:52 +0200, Juergen Gross wrote:
>>>> On 15/07/16 13:52, Dario Faggioli wrote:
>>>>> Therefore, I still think this patch is correct, but I'm up for
>>>>> investigating further and finding a way to solve the "zombie in
>>>>> cpupool" issue as well.
>>>> I'm not saying your patch is wrong. I just wanted to give you a
>>>> hint
>>>> about the history of the stuff you are changing. :-)
>>>>
>>> Sure, and that's much appreciated! :-)
>>>
>>>> If it is working I'd really prefer it over the current situation.
>>>>
>>> Right. I've got to leave now. But I'll produce some zombies on
>>> Monday,
>>> and will see if they move. :-D
>>>
>> Here you go:
>>
>> http://pastebin.com/r93TGCZU
>>
>> Is that "vm1 -b-" --> "(null) ---" domain zombie enough?
> 
> Yes, seems to be okay for a zombie.
> 
>> If yes, as you can see, it moves to cpupool0 while being destroyed, and
>> I can destroy the pool without problems.
> 
> Great.
> 
>> I also shutdown the system with the domain still there (in cpupool0),
>> and it works.
> 
> Fine. You can add
> 
> Acked-by: Juergen Gross <jgross@suse.com>

Acked-by: George Dunlap <george.dunlap@citrix.com>

And queued.

 -George


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-08-03 12:27 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-14 16:17 [PATCH v2 0/2] xen: cpupool (small) improvement and (latent) bug fix Dario Faggioli
2016-07-14 16:18 ` [PATCH v2 1/2] xen: fix a (latent) cpupool-related race during domain destroy Dario Faggioli
2016-07-14 17:08   ` Andrew Cooper
2016-07-15  9:38   ` Juergen Gross
2016-07-15 10:14     ` Dario Faggioli
2016-07-15 10:36       ` Juergen Gross
2016-07-15 11:52         ` Dario Faggioli
2016-07-15 12:52           ` Juergen Gross
2016-07-15 14:23             ` Dario Faggioli
2016-07-18 14:03               ` Dario Faggioli
2016-07-18 14:09                 ` Juergen Gross
2016-07-28 17:29                   ` Dario Faggioli
2016-08-03 11:54                     ` George Dunlap
2016-08-03 12:27                   ` George Dunlap
2016-07-14 16:18 ` [PATCH v2 2/2] xen: cpupool: small optimization when moving between pools Dario Faggioli
2016-07-15  9:39   ` Juergen Gross

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).