On Mon, 2015-03-09 at 11:45 +0000, George Dunlap wrote:
> On 03/09/2015 07:11 AM, Justin Weaver wrote:

> > I don't think there's any way the mask can be empty after the
> > cpumask_and with the cpu_hard_affinity and the VCPU2ONLINE. In
> > schedule.c:vcpu_set_hard_affinity, if the intersection of the new hard
> > mask and VCPU2ONLINE is empty, the function returns -EINVAL. I took
> > into account all the discussion here and modified the function for v3.
> 
> What about this:
> 
> * Create a cpu pool with cpus 0 and 1.  online_cpus is now [0,1].
> * Set a hard affinity of [1].  This succeeds.
>
So, I decided to try this:

# xl cpupool-list -c
Name               CPU list
Pool-0              0,1

# xl list -c
Name                                        ID   Mem VCPUs	State	Time(s)         Cpupool
Domain-0                                     0   507     4     r-----      19.9           Pool0
test-pv                                      1  2000     8     -b----      19.2           Pool0

# xl vcpu-list test-pv
Name                                ID  VCPU   CPU State   Time(s) Affinity (Hard / Soft)
test-pv                              1     0    1   -b-       5.5  1 / 1
test-pv                              1     1    1   -b-       2.4  1 / 1
test-pv                              1     2    1   -b-       2.9  1 / 1
test-pv                              1     3    1   -b-       1.9  1 / 1
test-pv                              1     4    1   -b-       1.0  1 / 1
test-pv                              1     5    1   -b-       2.0  1 / 1
test-pv                              1     6    1   -b-       1.2  1 / 1
test-pv                              1     7    1   -b-       2.4  1 / 1

# xl cpupool-cpu-remove Pool0 1
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) Assertion '!cpumask_empty(dom_cpumask)' failed at domain.c:460
(XEN) ****************************************

i.e., surprise surprise, there's already an assert guarding this... and
it triggers!! :-(

It seems to have been added in v4 of the per-vcpu soft affinity work:
http://xenbits.xen.org/gitweb/?p=people/dariof/xen.git;a=commit;h=4d4e999637f38e0bbd4318ad8e0c92fd1e580430

So, we must have had this discussion before! :-)

I've done a bit of archeology and the ASSERT() in
domain_update_node_affinity() was introduced by me (upon request) while
working on implementing per-vcpu soft affinity. Then, cs 93be8285 is
what causes the assert to trigger. Time seems not to match, but that's
because soft affinity was put on hold when it was decided not to include
it in 4.4, and I probably forgot to retest with a use case similar to
the above when resubmitting it! :-(

A patch to fix things is attached to this email, for convenience (I'll
submit it properly, with changelog and everything, right away).

After seeing this, I'm even more convinced that
!empty(online && hard_affinity) is really something we want and, as we
ASSERT() it in domain.c, the same should be done in sched_credit2.c.
OTOH, if we don't want to ASSERT() in the specific scheduler code, then
I'd remove the one in domain.c too (and, perhaps, just use online as a
fallback), or things would look inconsistet.

Of course, this can also be seen as a proof of George's point, that this
is something not obvious and not easy to catch. Still, I think that if
we say that we support hard vcpu affinity (a.k.a. pinning), we should
not allow vcpus outside their hard affinity, and the code must reflect
this.

> * Move cpu 1 to a different cpupool.
>
> After a quick look I don't see anything that updates the hard affinity
> when cpus are removed from pools.
> 
cpupool_unassign_cpu() calls cpupool_unassign_cpu_helper()that calls
cpu_disable_scheduler() which, if it finds that empty(online &&
hard_affinity)==true, it resets hard_affinity to "all".

Note that this is reported in the log, as a confirmation that this is
really something exceptional:

  (XEN) Breaking affinity for d1v0
  (XEN) Breaking affinity for d1v1
  (XEN) Breaking affinity for d1v2
  (XEN) Breaking affinity for d1v3
  (XEN) Breaking affinity for d1v4
  (XEN) Breaking affinity for d1v5
  (XEN) Breaking affinity for d1v6
  (XEN) Breaking affinity for d1v7

And also note that, as a consequence of fiddling with cpupools, the
affinity has been altered by Xen (i.e., vcpus still always run within
their hard affinity masks):


# xl vcpu-pin 1 all 1 1
# xl cpupool-cpu-remove Pool-0 1
# xl cpupool-list -c
Name               CPU list
Pool-0             0,2,3,4,5,6,7,8,9,10,11,12,13,14,15

# xl vcpu-list test-pv
Name                                ID  VCPU   CPU State   Time(s) Affinity (Hard / Soft)
test-pv                              1     0    2   -b-       6.1  all / 1
test-pv                              1     1    6   -b-       1.6  all / 1
test-pv                              1     2    4   -b-       1.6  all / 1
test-pv                              1     3    5   -b-       3.1  all / 1
test-pv                              1     4    4   -b-       0.8  all / 1
test-pv                              1     5    7   -b-       3.0  all / 1
test-pv                              1     6    3   -b-       1.1  all / 1
test-pv                              1     7    3   -b-       0.7  all / 1

That's what drove all the reasoning when changing
domain_update_node_affinity(), and led to that ASSERT(), during the soft
affinity work. The reason why the ASSERT() triggers, as can be seen in
the patch, is that, because of the cited changeset,
domain_update_node_affinity() is called too early.

> And, even if it does *now*, it's possible that something might be
> changed in the future that would forget it; this ASSERT() isn't exactly
> next to that code.
> 
But it would help us catch the bug at some point... As proved above! :-D

BTW, I've already written and submitted an OSSTest test involving
cpupools (we don't do anything at the moment). I'll see about adding
these kind of things to it.

> So it seems to me like handling that case makes the software
> more robust for little cost.
> 
Yep, I also agree that it at least won't cost much, in terms of runtime
overhead.

Regards,
Dario