From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stephan Diestelhorst Subject: Re: Hypervisor crash(!) on xl cpupool-numa-split Date: Wed, 2 Feb 2011 09:39:06 -0500 Message-ID: <201102021539.06664.stephan.diestelhorst@amd.com> References: <4D41FD3A.5090506@amd.com> <4D483599.1060807@amd.com> Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <4D483599.1060807@amd.com> Content-Language: de-DE List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: "Przywara, Andre" Cc: George Dunlap , Keir Fraser , Juergen Gross , "xen-devel@lists.xensource.com" , Ian Jackson List-Id: xen-devel@lists.xenproject.org Hi folks, long time no see. :-) On Tuesday 01 February 2011 17:32:25 Andre Przywara wrote: > I asked Stephan Diestelhorst for help and after I convinced him that > removing credit and making SEDF the default again is not an option he > worked together with me on that ;-) Many thanks for that! > We haven't come to a final solution but could gather some debug data. > I will simply dump some data here, maybe somebody has got a clue. We > will work further on this tomorrow. Andre and I have been looking through this further, in particular sanity checking the invariant prv->weight >= sdom->weight * sdom->active_vcpu_count each time someone tweaks the active vcpu count. This happens only in __csched_vcpu_acct_start and __csched_vcpu_acct_stop_locked. We managed to observe the broken invariant when splitting cpupoools. We have the following theory of what happens: * some vcpus of a particular domain are currently in the process of being moved to the new pool * some are still left on the old pool (vcpus_old) and some are already in the new pool (vcpus_new) * we now have vcpus_old->sdom = vcpus_new->sdom and following from this * vcpus_old->sdom->weight = vcpus_new->sdom->weight * vcpus_old->sdom->active_vcpu_count = vcpus_new->sdom->active_vcpu_count * active_vcpu_count thus does not represent the separation of the actual vpcus (may be the sum, only the old or new ones, does not matter) * however, sched_old != sched_new, and thus * sched_old->prv != sched_new->prv * sched_old->prv->weight != sched_new->prv->weight * the prv->weight field hence sees the incremental move of VCPUs (through modifications in *acct_start and *acct_stop_locked) * if at any point in this half-way migration, the scheduler wants to csched_acct, it erroneously checks the wrong active_vcpu_count Workarounds / fixes (none tried): * disable scheduler accounting while half-way migrating a domain (dom->pool_migrating flag and then checking in csched_acct) * temporarily split the sdom structures while migrating to account for transient split of vcpus * synchronously disable all vcpus, migrate and then re-enable Caveats: * prv->lock does not guarantee mutual exclusion between (same) schedulers of different pools The general locking policy vs the comment situation is a nightmare. I know that we have some advanced data-structure folks here, but intuitively reasoning about when specific things are atomic and mutually excluded is a pain in the scheduler / cpupool code, see the issue with the separate prv->locks above. E.g. cpupool_unassign_cpu and cpupool_unassign_cpu_helper interplay: * cpupool_unassign_cpu unlocks cpupool_lock * sets up the continuation calling cpupool_unassign_cpu_helper * cpupool_unassign_cpu_helper locks cpupool_lock * while intuitively, one would think that both should see a consistent snapshot and hence freeing the lock in the middle is a bad idea * also communicating continuation-local state through global variables mandates that only a single global continuation can be pending * reading cpu outside of the lock protection in cpupool_unassign_cpu_helper also smells Despite the rant, it is amazing to see the ability to move running things around through this remote continuation trick! In my (ancient) balancer experiments I added hypervisor-threads just for side- stepping this issue.. Stephan -- Stephan Diestelhorst, AMD Operating System Research Center stephan.diestelhorst@amd.com Tel. +49 (0)351 448 356 719 Advanced Micro Devices GmbH Einsteinring 24 85609 Aschheim Germany Geschaeftsfuehrer: Alberto Bozzo u. Andrew Bowd Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen Registergericht Muenchen, HRB Nr. 43632, WEEE-Reg-Nr: DE 12919551