On Thu, 2016-08-11 at 14:39 +0100, Andrew Cooper wrote: > On 11/08/16 14:24, George Dunlap wrote: > > On 11/08/16 12:35, Andrew Cooper wrote: > > > The actual cause is _csched_cpu_pick() falling over LIST_POISON, > > > which > > > happened to occur at the same time as a domain was shutting > > > down.  The > > > instruction in question is `mov 0x10(%rax),%rax` which looks like > > > reverse list traversal. > Thanks for the report. > > Could you use line2addr or objdump -dl to get a better idea where > > the > > #GP is happening? > addr2line -e xen-syms-4.7.0-xs127493 ffff82d08012944f > /obj/RPM_BUILD_DIRECTORY/xen-4.7.0/xen/common/sched_credit.c:775 > (discriminator 1) > > It will be IS_RUNQ_IDLE() which is the problem. > Ok, that does one step of list traversing (the runq). What I didn't understand from your report is what crashed when. IS_RUNQ_IDLE() has been introduced a while back and anything like that has been ever caught so far. George's patch makes _csched_cpu_pick() be called during insert_vcpu()-->csched_vcpu_insert() which, in 4.7, is called:  1) during domain (well, vcpu) creation,  2) when domain is moved among cpupools AFAICR, during domain destruction we basically move the domain to cpupool0, and without a patch that I sent recently, that is always done as a full fledged cpupool movement, even if the domain is _already_ in cpupool0. So, even if you are not using cpupools, and since you mention domain shutdown we probably are looking at 2). But this is what I'm not sure I got well... Do you have enough info to tell precisely when the crash manifests? Is it indeed during a domain shutdown, or was it during a domain creation (sched_init_vcpu() is in the stack trace... although I've read it's a non-debug one)? And is it a 'regular' domain or dom0 that is shutting down/coming up? The idea behind IS_RUNQ_IDLE() is that we need to know whether there is someone in the runq of a cpu or not, to correctly initialize --and hence avoid biasing-- some load balancing calculations. I've never liked the idea (leave it alone the code), but it's necessary (or, at least, I don't see a sensible alternative). The questions I'm asking above have the aim of figuring out what the status of the runq could be, and why adding a call to csched_cpu_pick() from insert_vcpu() is making things explode... Regards, Dario -- <> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)