Re: Hypervisor crash(!) on xl cpupool-numa-split

From: Juergen Gross <juergen.gross@ts.fujitsu.com>
To: Stephan Diestelhorst <stephan.diestelhorst@amd.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>,
	"Przywara, Andre" <Andre.Przywara@amd.com>,
	Keir Fraser <keir@xen.org>,
	"xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>,
	Ian Jackson <Ian.Jackson@eu.citrix.com>
Subject: Re: Hypervisor crash(!) on xl cpupool-numa-split
Date: Wed, 02 Feb 2011 16:14:25 +0100	[thread overview]
Message-ID: <4D4974D1.1080503@ts.fujitsu.com> (raw)
In-Reply-To: <201102021539.06664.stephan.diestelhorst@amd.com>

On 02/02/11 15:39, Stephan Diestelhorst wrote:
> Hi folks,
>    long time no see. :-)
>
> On Tuesday 01 February 2011 17:32:25 Andre Przywara wrote:
>> I asked Stephan Diestelhorst for help and after I convinced him that
>> removing credit and making SEDF the default again is not an option he
>> worked together with me on that ;-) Many thanks for that!
>> We haven't come to a final solution but could gather some debug data.
>> I will simply dump some data here, maybe somebody has got a clue. We
>> will work further on this tomorrow.
>
> Andre and I have been looking through this further, in particular sanity
> checking the invariant
>
> prv->weight>= sdom->weight * sdom->active_vcpu_count
>
> each time someone tweaks the active vcpu count. This happens only in
> __csched_vcpu_acct_start and __csched_vcpu_acct_stop_locked. We managed
> to observe the broken invariant when splitting cpupoools.
>
> We have the following theory of what happens:
> * some vcpus of a particular domain are currently in the process of
>    being moved to the new pool

The only _vcpus_ to be moved between pools are the idle vcpus. And those
never contribute to accounting in credit scheduler.

We are moving _pcpus_ only (well, moving a domain between pools actually
moves vcpus as well, but then the domain is paused).
On the pcpu to be moved the idle vcpu should be running. Obviously you
have found a scenario where this isn't true. I have no idea how this could
happen, as other then idle vcpus are taken into account for scheduling
only if the pcpu is valid in the cpupool. And the pcpu is set valid after the
BUG_ON you have triggered in your tests.

>
> * some are still left on the old pool (vcpus_old) and some are already
>    in the new pool (vcpus_new)
>
> * we now have vcpus_old->sdom = vcpus_new->sdom and following from this
>    * vcpus_old->sdom->weight = vcpus_new->sdom->weight
>    * vcpus_old->sdom->active_vcpu_count = vcpus_new->sdom->active_vcpu_count
>
> * active_vcpu_count thus does not represent the separation of the
>    actual vpcus (may be the sum, only the old or new ones, does not
>    matter)
>
> * however, sched_old != sched_new, and thus
>    * sched_old->prv != sched_new->prv
>    * sched_old->prv->weight != sched_new->prv->weight
>
> * the prv->weight field hence sees the incremental move of VCPUs
>    (through modifications in *acct_start and *acct_stop_locked)
>
> * if at any point in this half-way migration, the scheduler wants to
>    csched_acct, it erroneously checks the wrong active_vcpu_count
>
> Workarounds / fixes (none tried):
> * disable scheduler accounting while half-way migrating a domain
>    (dom->pool_migrating flag and then checking in csched_acct)
> * temporarily split the sdom structures while migrating to account for
>    transient split of vcpus
> * synchronously disable all vcpus, migrate and then re-enable
>
> Caveats:
> * prv->lock does not guarantee mutual exclusion between (same)
>    schedulers of different pools
>
> <rant>
> The general locking policy vs the comment situation is a nightmare.
> I know that we have some advanced data-structure folks here, but
> intuitively reasoning about when specific things are atomic and
> mutually excluded is a pain in the scheduler / cpupool code, see the
> issue with the separate prv->locks above.
>
> E.g. cpupool_unassign_cpu and cpupool_unassign_cpu_helper interplay:
> * cpupool_unassign_cpu unlocks cpupool_lock
> * sets up the continuation calling cpupool_unassign_cpu_helper
> * cpupool_unassign_cpu_helper locks cpupool_lock
> * while intuitively, one would think that both should see a consistent
>    snapshot and hence freeing the lock in the middle is a bad idea
> * also communicating continuation-local state through global variables
>    mandates that only a single global continuation can be pending
>
> * reading cpu outside of the lock protection in
>    cpupool_unassign_cpu_helper also smells
> </rant>
>
> Despite the rant, it is amazing to see the ability to move running
> things around through this remote continuation trick! In my (ancient)
> balancer experiments I added hypervisor-threads just for side-
> stepping this issue..

I think the easiest way to solve the problem would be to move the cpu to the
new pool in a tasklet. This is possible now, because tasklets are always
executed in the idle vcpus.

OTOH I'd like to understand what is wrong with my current approach...

Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html