From mboxrd@z Thu Jan 1 00:00:00 1970 From: Juergen Gross Subject: Re: Hypervisor crash(!) on xl cpupool-numa-split Date: Wed, 02 Feb 2011 16:14:25 +0100 Message-ID: <4D4974D1.1080503@ts.fujitsu.com> References: <4D41FD3A.5090506@amd.com> <4D483599.1060807@amd.com> <201102021539.06664.stephan.diestelhorst@amd.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <201102021539.06664.stephan.diestelhorst@amd.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Stephan Diestelhorst Cc: George Dunlap , "Przywara, Andre" , Keir Fraser , "xen-devel@lists.xensource.com" , Ian Jackson List-Id: xen-devel@lists.xenproject.org On 02/02/11 15:39, Stephan Diestelhorst wrote: > Hi folks, > long time no see. :-) > > On Tuesday 01 February 2011 17:32:25 Andre Przywara wrote: >> I asked Stephan Diestelhorst for help and after I convinced him that >> removing credit and making SEDF the default again is not an option he >> worked together with me on that ;-) Many thanks for that! >> We haven't come to a final solution but could gather some debug data. >> I will simply dump some data here, maybe somebody has got a clue. We >> will work further on this tomorrow. > > Andre and I have been looking through this further, in particular sanity > checking the invariant > > prv->weight>= sdom->weight * sdom->active_vcpu_count > > each time someone tweaks the active vcpu count. This happens only in > __csched_vcpu_acct_start and __csched_vcpu_acct_stop_locked. We managed > to observe the broken invariant when splitting cpupoools. > > We have the following theory of what happens: > * some vcpus of a particular domain are currently in the process of > being moved to the new pool The only _vcpus_ to be moved between pools are the idle vcpus. And those never contribute to accounting in credit scheduler. We are moving _pcpus_ only (well, moving a domain between pools actually moves vcpus as well, but then the domain is paused). On the pcpu to be moved the idle vcpu should be running. Obviously you have found a scenario where this isn't true. I have no idea how this could happen, as other then idle vcpus are taken into account for scheduling only if the pcpu is valid in the cpupool. And the pcpu is set valid after the BUG_ON you have triggered in your tests. > > * some are still left on the old pool (vcpus_old) and some are already > in the new pool (vcpus_new) > > * we now have vcpus_old->sdom = vcpus_new->sdom and following from this > * vcpus_old->sdom->weight = vcpus_new->sdom->weight > * vcpus_old->sdom->active_vcpu_count = vcpus_new->sdom->active_vcpu_count > > * active_vcpu_count thus does not represent the separation of the > actual vpcus (may be the sum, only the old or new ones, does not > matter) > > * however, sched_old != sched_new, and thus > * sched_old->prv != sched_new->prv > * sched_old->prv->weight != sched_new->prv->weight > > * the prv->weight field hence sees the incremental move of VCPUs > (through modifications in *acct_start and *acct_stop_locked) > > * if at any point in this half-way migration, the scheduler wants to > csched_acct, it erroneously checks the wrong active_vcpu_count > > Workarounds / fixes (none tried): > * disable scheduler accounting while half-way migrating a domain > (dom->pool_migrating flag and then checking in csched_acct) > * temporarily split the sdom structures while migrating to account for > transient split of vcpus > * synchronously disable all vcpus, migrate and then re-enable > > Caveats: > * prv->lock does not guarantee mutual exclusion between (same) > schedulers of different pools > > > The general locking policy vs the comment situation is a nightmare. > I know that we have some advanced data-structure folks here, but > intuitively reasoning about when specific things are atomic and > mutually excluded is a pain in the scheduler / cpupool code, see the > issue with the separate prv->locks above. > > E.g. cpupool_unassign_cpu and cpupool_unassign_cpu_helper interplay: > * cpupool_unassign_cpu unlocks cpupool_lock > * sets up the continuation calling cpupool_unassign_cpu_helper > * cpupool_unassign_cpu_helper locks cpupool_lock > * while intuitively, one would think that both should see a consistent > snapshot and hence freeing the lock in the middle is a bad idea > * also communicating continuation-local state through global variables > mandates that only a single global continuation can be pending > > * reading cpu outside of the lock protection in > cpupool_unassign_cpu_helper also smells > > > Despite the rant, it is amazing to see the ability to move running > things around through this remote continuation trick! In my (ancient) > balancer experiments I added hypervisor-threads just for side- > stepping this issue.. I think the easiest way to solve the problem would be to move the cpu to the new pool in a tasklet. This is possible now, because tasklets are always executed in the idle vcpus. OTOH I'd like to understand what is wrong with my current approach... Juergen -- Juergen Gross Principal Developer Operating Systems TSP ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967 Fujitsu Technology Solutions e-mail: juergen.gross@ts.fujitsu.com Domagkstr. 28 Internet: ts.fujitsu.com D-80807 Muenchen Company details: ts.fujitsu.com/imprint.html