From mboxrd@z Thu Jan 1 00:00:00 1970 From: Juergen Gross Subject: Re: Hypervisor crash(!) on xl cpupool-numa-split Date: Wed, 16 Feb 2011 15:11:04 +0100 Message-ID: <4D5BDAF8.50800@ts.fujitsu.com> References: <4D41FD3A.5090506@amd.com> <4D4A43B7.5040707@ts.fujitsu.com> <4D4A72D8.3020502@ts.fujitsu.com> <4D4C08B6.30600@amd.com> <4D4FE7E2.9070605@amd.com> <4D4FF452.6060508@ts.fujitsu.com> <4D50D80F.9000007@ts.fujitsu.com> <4D517051.10402@amd.com> <4D529BD9.5050200@amd.com> <4D52A2CD.9090507@ts.fujitsu.com> <4D5388DF.8040900@ts.fujitsu.com> <4D53AF27.7030909@amd.com> <4D53F3BC.4070807@amd.com> <4D54D478.9000402@ts.fujitsu.com> <4D54E79E.3000800@amd.com> <4D5A29C0.4050702@ts.fujitsu.com> <4D5B9D2B.107@ts.fujitsu.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="------------000703080008050303060100" Return-path: In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: George Dunlap Cc: Andre Przywara , "xen-devel@lists.xensource.com" , "Diestelhorst, Stephan" List-Id: xen-devel@lists.xenproject.org This is a multi-part message in MIME format. --------------000703080008050303060100 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit On 02/16/11 14:54, George Dunlap wrote: > Andre (and Juergen), can you try again with the attached patch? > > What the patch basically does is try to make "cpu_disable_scheduler()" > do what it seems to say it does. :-) Namely, the various > scheduler-related interrutps (both per-cpu ticks and the master tick) > is a part of the scheduler, so disable them before doing anything, and > don't enable them until the cpu is really ready to go again. > > To be precise: > * cpu_disable_scheduler() disables ticks > * scheduler_cpu_switch() only enables ticks if adding a cpu to a pool, > and does it after inserting the idle vcpu > * Modify semantics, s.t., {alloc,free}_pdata() don't actually start or > stop tickers > + Call tick_{resume,suspend} in cpu_{up,down}, respectively I tried this before :-) It didn't work for Andre, but may be there were some bits missing. > * Modify credit1's tick_{suspend,resume} to handle the master ticker as well. > > With this patch (if dom0 doesn't get wedged due to all 8 vcpus being > on one pcpu), I can perform thousands of operations successfully. Nice. I'll try later. In the moment I'm testing another patch (attached for review, if you like). I think I've identified two possible races. Juergen > > (NB this is not ready for application yet, I just wanted to check to > see if it fixes Andre's problem) > > -George > > On Wed, Feb 16, 2011 at 9:47 AM, Juergen Gross > wrote: >> Okay, I have some more data. >> >> I activated cpupool_dprintk() and included checks in sched_credit.c to >> test for weight inconsistencies. To reduce race possibilities I've added >> my patch to execute cpu assigning/unassigning always in a tasklet on the >> cpu to be moved. >> >> Here is the result: >> >> (XEN) cpupool_unassign_cpu(pool=0,cpu=6) >> (XEN) cpupool_unassign_cpu(pool=0,cpu=6) ret -16 >> (XEN) cpupool_unassign_cpu(pool=0,cpu=6) >> (XEN) cpupool_unassign_cpu(pool=0,cpu=6) ret -16 >> (XEN) cpupool_assign_cpu(pool=0,cpu=1) >> (XEN) cpupool_assign_cpu(pool=0,cpu=1) ffff83083fff74c0 >> (XEN) cpupool_assign_cpu(cpu=1) ret 0 >> (XEN) cpupool_assign_cpu(pool=1,cpu=4) >> (XEN) cpupool_assign_cpu(pool=1,cpu=4) ffff831002ad5e40 >> (XEN) cpupool_assign_cpu(cpu=4) ret 0 >> (XEN) cpu 4, weight 0,prv ffff831002ad5e40, dom 0: >> (XEN) sdom->weight: 256, sdom->active_vcpu_count: 1 >> (XEN) Xen BUG at sched_credit.c:570 >> (XEN) ----[ Xen-4.1.0-rc5-pre x86_64 debug=y Tainted: C ]---- >> (XEN) CPU: 4 >> (XEN) RIP: e008:[] csched_tick+0x186/0x37f >> (XEN) RFLAGS: 0000000000010086 CONTEXT: hypervisor >> (XEN) rax: 0000000000000000 rbx: ffff830839d3ec30 rcx: 0000000000000000 >> (XEN) rdx: ffff830839dcff18 rsi: 000000000000000a rdi: ffff82c4802542e8 >> (XEN) rbp: ffff830839dcfe38 rsp: ffff830839dcfde8 r8: 0000000000000004 >> (XEN) r9: ffff82c480213520 r10: 00000000fffffffc r11: 0000000000000001 >> (XEN) r12: 0000000000000004 r13: ffff830839d3ec40 r14: ffff831002ad5e40 >> (XEN) r15: ffff830839d66f90 cr0: 000000008005003b cr4: 00000000000026f0 >> (XEN) cr3: 0000001020a98000 cr2: 00007fc5e9b79d98 >> (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008 >> (XEN) Xen stack trace from rsp=ffff830839dcfde8: >> (XEN) ffff83083ffa3ba0 ffff831002ad5e40 0000000000000246 ffff830839d6c000 >> (XEN) 0000000000000000 ffff830839dd1100 0000000000000004 ffff82c480119651 >> (XEN) ffff831002b28018 ffff831002b28010 ffff830839dcfe68 ffff82c480126204 >> (XEN) 0000000000000002 ffff83083ffa3bb8 ffff830839dd1100 000000cae439ea7e >> (XEN) ffff830839dcfeb8 ffff82c480126539 00007fc5e9fa5b20 ffff830839dd1100 >> (XEN) ffff831002b28010 0000000000000004 0000000000000004 ffff82c4802b0880 >> (XEN) ffff830839dcff18 ffffffffffffffff ffff830839dcfef8 ffff82c480123647 >> (XEN) ffff830839dcfed8 ffff830077eee000 00007fc5e9b79d98 00007fc5e9fa5b20 >> (XEN) 0000000000000002 00007fff46826f20 ffff830839dcff08 ffff82c4801236c2 >> (XEN) 00007cf7c62300c7 ffff82c480206ad6 00007fff46826f20 0000000000000002 >> (XEN) 00007fc5e9fa5b20 00007fc5e9b79d98 00007fff46827260 00007fff46826f50 >> (XEN) 0000000000000246 0000000000000032 0000000000000000 00000000ffffffff >> (XEN) 0000000000000009 00007fc5e9d9de1a 0000000000000003 0000000000004848 >> (XEN) 00007fc5e9b7a000 0000010000000000 ffffffff800073f0 000000000000e033 >> (XEN) 0000000000000246 ffff880f97b51fc8 000000000000e02b 0000000000000000 >> (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000004 >> (XEN) ffff830077eee000 00000043b9afd180 0000000000000000 >> (XEN) Xen call trace: >> (XEN) [] csched_tick+0x186/0x37f >> (XEN) [] execute_timer+0x4e/0x6c >> (XEN) [] timer_softirq_action+0xf6/0x239 >> (XEN) [] __do_softirq+0x88/0x99 >> (XEN) [] do_softirq+0x6a/0x7a >> (XEN) >> (XEN) >> (XEN) **************************************** >> (XEN) Panic on CPU 4: >> (XEN) Xen BUG at sched_credit.c:570 >> (XEN) **************************************** >> >> As you can see, a Dom0 vcpus is becoming active on a pool 1 cpu. The BUG_ON >> triggered in csched_acct() is a logical result of this. >> >> How this can happen I don't know yet. >> Anyone any idea? I'll keep searching... >> >> >> Juergen >> >> On 02/15/11 08:22, Juergen Gross wrote: >>> >>> On 02/14/11 18:57, George Dunlap wrote: >>>> >>>> The good news is, I've managed to reproduce this on my local test >>>> hardware with 1x4x2 (1 socket, 4 cores, 2 threads per core) using the >>>> attached script. It's time to go home now, but I should be able to >>>> dig something up tomorrow. >>>> >>>> To use the script: >>>> * Rename cpupool0 to "p0", and create an empty second pool, "p1" >>>> * You can modify elements by adding "arg=val" as arguments. >>>> * Arguments are: >>>> + dryrun={true,false} Do the work, but don't actually execute any xl >>>> arguments. Default false. >>>> + left: Number commands to execute. Default 10. >>>> + maxcpus: highest numerical value for a cpu. Default 7 (i.e., 0-7 is >>>> 8 cpus). >>>> + verbose={true,false} Print what you're doing. Default is true. >>>> >>>> The script sometimes attempts to remove the last cpu from cpupool0; in >>>> this case, libxl will print an error. If the script gets an error >>>> under that condition, it will ignore it; under any other condition, it >>>> will print diagnostic information. >>>> >>>> What finally crashed it for me was this command: >>>> # ./cpupool-test.sh verbose=false left=1000 >>> >>> Nice! >>> With your script I finally managed to get the error, too. On my box (2 >>> sockets >>> a 6 cores) I had to use >>> >>> ./cpupool-test.sh verbose=false left=10000 maxcpus=11 >>> >>> to trigger it. >>> Looking for more data now... >>> >>> >>> Juergen >>> >>>> >>>> -George >>>> >>>> On Fri, Feb 11, 2011 at 7:39 AM, Andre >>>> Przywara wrote: >>>>> >>>>> Juergen Gross wrote: >>>>>> >>>>>> On 02/10/11 15:18, Andre Przywara wrote: >>>>>>> >>>>>>> Andre Przywara wrote: >>>>>>>> >>>>>>>> On 02/10/2011 07:42 AM, Juergen Gross wrote: >>>>>>>>> >>>>>>>>> On 02/09/11 15:21, Juergen Gross wrote: >>>>>>>>>> >>>>>>>>>> Andre, George, >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> What seems to be interesting: I think the problem did always occur >>>>>>>>>> when >>>>>>>>>> a new cpupool was created and the first cpu was moved to it. >>>>>>>>>> >>>>>>>>>> I think my previous assumption regarding the master_ticker was not >>>>>>>>>> too bad. >>>>>>>>>> I think somehow the master_ticker of the new cpupool is becoming >>>>>>>>>> active >>>>>>>>>> before the scheduler is really initialized properly. This could >>>>>>>>>> happen, if >>>>>>>>>> enough time is spent between alloc_pdata for the cpu to be moved >>>>>>>>>> and >>>>>>>>>> the >>>>>>>>>> critical section in schedule_cpu_switch(). >>>>>>>>>> >>>>>>>>>> The solution should be to activate the timers only if the >>>>>>>>>> scheduler is >>>>>>>>>> ready for them. >>>>>>>>>> >>>>>>>>>> George, do you think the master_ticker should be stopped in >>>>>>>>>> suspend_ticker >>>>>>>>>> as well? I still see potential problems for entering deep C-States. >>>>>>>>>> I think >>>>>>>>>> I'll prepare a patch which will keep the master_ticker active >>>>>>>>>> for the >>>>>>>>>> C-State case and migrate it for the schedule_cpu_switch() case. >>>>>>>>> >>>>>>>>> Okay, here is a patch for this. It ran on my 4-core machine >>>>>>>>> without any >>>>>>>>> problems. >>>>>>>>> Andre, could you give it a try? >>>>>>>> >>>>>>>> Did, but unfortunately it crashed as always. Tried twice and made >>>>>>>> sure >>>>>>>> I booted the right kernel. Sorry. >>>>>>>> The idea with the race between the timer and the state changing >>>>>>>> sounded very appealing, actually that was suspicious to me from the >>>>>>>> beginning. >>>>>>>> >>>>>>>> I will add some code to dump the state of all cpupools to the BUG_ON >>>>>>>> to see in which situation we are when the bug triggers. >>>>>>> >>>>>>> OK, here is a first try of this, the patch iterates over all CPU pools >>>>>>> and outputs some data if the BUG_ON >>>>>>> ((sdom->weight * sdom->active_vcpu_count)> weight_left) condition >>>>>>> triggers: >>>>>>> (XEN) CPU pool #0: 1 domains (SMP Credit Scheduler), mask: >>>>>>> fffffffc003f >>>>>>> (XEN) CPU pool #1: 0 domains (SMP Credit Scheduler), mask: fc0 >>>>>>> (XEN) CPU pool #2: 0 domains (SMP Credit Scheduler), mask: 1000 >>>>>>> (XEN) Xen BUG at sched_credit.c:1010 >>>>>>> .... >>>>>>> The masks look proper (6 cores per node), the bug triggers when the >>>>>>> first CPU is about to be(?) inserted. >>>>>> >>>>>> Sure? I'm missing the cpu with mask 2000. >>>>>> I'll try to reproduce the problem on a larger machine here (24 cores, 4 >>>>>> numa >>>>>> nodes). >>>>>> Andre, can you give me your xen boot parameters? Which xen changeset >>>>>> are >>>>>> you >>>>>> running, and do you have any additional patches in use? >>>>> >>>>> The grub lines: >>>>> kernel (hd1,0)/boot/xen-22858_debug_04.gz console=com1,vga com1=115200 >>>>> module (hd1,0)/boot/vmlinuz-2.6.32.27_pvops console=tty0 >>>>> console=ttyS0,115200 ro root=/dev/sdb1 xencons=hvc0 >>>>> >>>>> All of my experiments are use c/s 22858 as a base. >>>>> If you use a AMD Magny-Cours box for your experiments (socket C32 or >>>>> G34), >>>>> you should add the following patch (removing the line) >>>>> --- a/xen/arch/x86/traps.c >>>>> +++ b/xen/arch/x86/traps.c >>>>> @@ -803,7 +803,6 @@ static void pv_cpuid(struct cpu_user_regs *regs) >>>>> __clear_bit(X86_FEATURE_SKINIT % 32,&c); >>>>> __clear_bit(X86_FEATURE_WDT % 32,&c); >>>>> __clear_bit(X86_FEATURE_LWP % 32,&c); >>>>> - __clear_bit(X86_FEATURE_NODEID_MSR % 32,&c); >>>>> __clear_bit(X86_FEATURE_TOPOEXT % 32,&c); >>>>> break; >>>>> case 5: /* MONITOR/MWAIT */ >>>>> >>>>> This is not necessary (in fact that reverts my patch c/s 22815), but >>>>> raises >>>>> the probability to trigger the bug, probably because it increases the >>>>> pressure of the Dom0 scheduler. If you cannot trigger it with Dom0, >>>>> try to >>>>> create a guest with many VCPUs and squeeze it into a small CPU-pool. >>>>> >>>>> Good luck ;-) >>>>> Andre. >>>>> >>>>> -- >>>>> Andre Przywara >>>>> AMD-OSRC (Dresden) >>>>> Tel: x29712 >>>>> >>>>> >>>>> _______________________________________________ >>>>> Xen-devel mailing list >>>>> Xen-devel@lists.xensource.com >>>>> http://lists.xensource.com/xen-devel >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Xen-devel mailing list >>>>> Xen-devel@lists.xensource.com >>>>> http://lists.xensource.com/xen-devel >>> >>> >> >> >> -- >> Juergen Gross Principal Developer Operating Systems >> TSP ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967 >> Fujitsu Technology Solutions e-mail: >> juergen.gross@ts.fujitsu.com >> Domagkstr. 28 Internet: ts.fujitsu.com >> D-80807 Muenchen Company details: >> ts.fujitsu.com/imprint.html >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xensource.com >> http://lists.xensource.com/xen-devel >> >> >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xensource.com >> http://lists.xensource.com/xen-devel -- Juergen Gross Principal Developer Operating Systems TSP ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967 Fujitsu Technology Solutions e-mail: juergen.gross@ts.fujitsu.com Domagkstr. 28 Internet: ts.fujitsu.com D-80807 Muenchen Company details: ts.fujitsu.com/imprint.html --------------000703080008050303060100 Content-Type: text/x-patch; name="cpupool-race.patch" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="cpupool-race.patch" diff -r 72470de157ce xen/common/sched_credit.c --- a/xen/common/sched_credit.c Wed Feb 16 09:49:33 2011 +0000 +++ b/xen/common/sched_credit.c Wed Feb 16 15:09:54 2011 +0100 @@ -1268,7 +1268,8 @@ csched_load_balance(struct csched_privat /* * Any work over there to steal? */ - speer = csched_runq_steal(peer_cpu, cpu, snext->pri); + speer = cpu_isset(peer_cpu, *online) ? + csched_runq_steal(peer_cpu, cpu, snext->pri) : NULL; pcpu_schedule_unlock(peer_cpu); if ( speer != NULL ) { diff -r 72470de157ce xen/common/schedule.c --- a/xen/common/schedule.c Wed Feb 16 09:49:33 2011 +0000 +++ b/xen/common/schedule.c Wed Feb 16 15:09:54 2011 +0100 @@ -395,7 +395,28 @@ static void vcpu_migrate(struct vcpu *v) unsigned long flags; int old_cpu, new_cpu; - vcpu_schedule_lock_irqsave(v, flags); + for (;;) + { + vcpu_schedule_lock_irqsave(v, flags); + + /* Select new CPU. */ + old_cpu = v->processor; + new_cpu = SCHED_OP(VCPU2OP(v), pick_cpu, v); + + if ( new_cpu == old_cpu ) + break; + + if ( !pcpu_schedule_trylock(new_cpu) ) + { + vcpu_schedule_unlock_irqrestore(v, flags); + continue; + } + if ( cpu_isset(new_cpu, v->domain->cpupool->cpu_valid) ) + break; + + pcpu_schedule_unlock(new_cpu); + vcpu_schedule_unlock_irqrestore(v, flags); + } /* * NB. Check of v->running happens /after/ setting migration flag @@ -405,13 +426,12 @@ static void vcpu_migrate(struct vcpu *v) if ( v->is_running || !test_and_clear_bit(_VPF_migrating, &v->pause_flags) ) { + if ( old_cpu != new_cpu ) + pcpu_schedule_unlock(new_cpu); + vcpu_schedule_unlock_irqrestore(v, flags); return; } - - /* Select new CPU. */ - old_cpu = v->processor; - new_cpu = SCHED_OP(VCPU2OP(v), pick_cpu, v); /* * Transfer urgency status to new CPU before switching CPUs, as once @@ -424,9 +444,13 @@ static void vcpu_migrate(struct vcpu *v) atomic_dec(&per_cpu(schedule_data, old_cpu).urgent_count); } - /* Switch to new CPU, then unlock old CPU. This is safe because + /* Switch to new CPU, then unlock new and old CPU. This is safe because * the lock pointer cant' change while the current lock is held. */ v->processor = new_cpu; + + if ( old_cpu != new_cpu ) + pcpu_schedule_unlock(new_cpu); + spin_unlock_irqrestore( per_cpu(schedule_data, old_cpu).schedule_lock, flags); --------------000703080008050303060100 Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel --------------000703080008050303060100--