From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from e2.ny.us.ibm.com (e2.ny.us.ibm.com [32.97.182.142]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client CN "e2.ny.us.ibm.com", Issuer "Equifax" (verified OK)) by ozlabs.org (Postfix) with ESMTPS id 2EEEAB716C for ; Thu, 2 Sep 2010 16:04:37 +1000 (EST) Received: from d01relay03.pok.ibm.com (d01relay03.pok.ibm.com [9.56.227.235]) by e2.ny.us.ibm.com (8.14.4/8.13.1) with ESMTP id o825o4Dp023835 for ; Thu, 2 Sep 2010 01:50:04 -0400 Received: from d01av01.pok.ibm.com (d01av01.pok.ibm.com [9.56.224.215]) by d01relay03.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id o8264WQA307574 for ; Thu, 2 Sep 2010 02:04:32 -0400 Received: from d01av01.pok.ibm.com (loopback [127.0.0.1]) by d01av01.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id o8264Vmg004035 for ; Thu, 2 Sep 2010 02:04:32 -0400 Message-ID: <4C7F3E6C.3010905@us.ibm.com> Date: Wed, 01 Sep 2010 23:04:28 -0700 From: Darren Hart MIME-Version: 1.0 To: Steven Rostedt Subject: Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries References: <4C488CCD.60004@us.ibm.com> <20100819155824.GD2690@in.ibm.com> <4C7CAB72.2050305@us.ibm.com> <1283320481.32679.32.camel@concordia> <4C7E6CCC.8090700@us.ibm.com> <4C7E9FC1.60004@us.ibm.com> <1283371161.2356.53.camel@gandalf.stny.rr.com> <4C7EBAA8.7030601@us.ibm.com> <11579.1283389322@neuling.org> <1283400367.2356.69.camel@gandalf.stny.rr.com> In-Reply-To: <1283400367.2356.69.camel@gandalf.stny.rr.com> Content-Type: text/plain; charset=ISO-8859-15 Cc: Stephen Rothwell , Michael Neuling , Gautham R Shenoy , Josh Triplett , linuxppc-dev@ozlabs.org, Will Schmidt , Paul Mackerras , Ankita Garg , Thomas Gleixner List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , On 09/01/2010 09:06 PM, Steven Rostedt wrote: > On Thu, 2010-09-02 at 11:02 +1000, Michael Neuling wrote: > >> We need to call smp_startup_cpu on boot when we the cpus are still in >> FW. smp_startup_cpu does this for us on boot. >> >> I'm wondering if we just need to move the test down a bit to make sure >> the preempt_count is set. I've not been following this thread, but >> maybe this might work? > > Egad no! Setting the preempt_count to zero _is_ the bug. I think Darren > even said that adding the exit prevented the bug (although now he's > hitting a hard lockup someplace else). The original code he was using > did not have the condition to return for kexec. It was just a > coincidence that this code helped in bringing a CPU back online. > >> >> Untested patch below... >> >> Mikey >> >> diff --git a/arch/powerpc/platforms/pseries/smp.c b/arch/powerpc/platforms/pseries/smp.c >> index 0317cce..3afaba4 100644 >> --- a/arch/powerpc/platforms/pseries/smp.c >> +++ b/arch/powerpc/platforms/pseries/smp.c >> @@ -104,18 +104,18 @@ static inline int __devinit smp_startup_cpu(unsigned int lcpu) >> >> pcpu = get_hard_smp_processor_id(lcpu); >> >> - /* Check to see if the CPU out of FW already for kexec */ >> - if (smp_query_cpu_stopped(pcpu) == QCSS_NOT_STOPPED){ >> - cpumask_set_cpu(lcpu, of_spin_mask); >> - return 1; >> - } >> - >> /* Fixup atomic count: it exited inside IRQ handler. */ >> task_thread_info(paca[lcpu].__current)->preempt_count = 0; > > We DON'T want to do the above. It's nasty! This is one CPU's task > touching an intimate part of another CPU's task. It's equivalent of me > putting my hand down you wife's blouse. It's offensive, and rude. > > OK, if the CPU was never online, then you can do what you want. But what > we see is that this fails on CPU hotplug. You stop a CPU, and it goes > into this cede_processor() call. When you wake it up, suddenly the task > on that woken CPU has its preempt count fscked up. This was really > really hard to debug. We thought it was stack corruption or something. > But it ended up being that this code has one CPU touching the breasts of > another CPU. This code is a pervert! > > What the trace clearly showed, was that we take down a CPU, and in doing > so, the code on that CPU set the preempt count to 1, and it expected to > have it as 1 when it returned. But the code that kicked started this CPU > back to life (bring the CPU back online), set the preempt count on the > task of that CPU to 0, and screwed everything up. Right, what Steve said. This patch resolved the problem, but it did so inadvertently while trying to fix kexec. I'm wondering if the offline/online code needs to be updated to never call smp_startup_cpu(). Or perhaps this is the right fix, and the comment needs to be cleaned up so that it isn't only for kexec. With this in place, we no longer see the preempt_count dropping below zero. However, if I offline/online a CPU about 246 times I hit the opposite problem, a preempt_count() overflow. There appears to be a missing preempt_enable() somewhere in the offline/online paths. On 2.6.33.7 with CONFIG_PREEMPT=y I run the following test: # cat offon_loop.sh #!/bin/sh for i in `seq 100`; do echo "Iteration $i" echo 0 > /sys/devices/system/cpu/cpu1/online echo 1 > /sys/devices/system/cpu/cpu1/online done And see the overflow: Iteration 238 Iteration 239 Iteration 240 Message from syslogd@igoort1 at Sep 1 17:36:45 ... kernel:------------[ cut here ]------------ Iteration 241 Iteration 242 Iteration 243 Iteration 244 Iteration 245 With the following on the console. This is the: /* * Spinlock count overflowing soon? */ DEBUG_LOCKS_WARN_ON((preempt_count() & PREEMPT_MASK) >= PREEMPT_MASK - 10); test. Max preempt count is 256. Somewhere we are now MISSING a sub_preempt_count() or preempt_enable(). Badness at kernel/sched.c:5372 NIP: c000000000694f14 LR: c000000000694ef8 CTR: c00000000005b0a0 REGS: c00000008e776ae0 TRAP: 0700 Not tainted (2.6.33.7-dirty) MSR: 8000000000021032 CR: 28000082 XER: 00000000 TASK = c00000010e6fc040[0] 'swapper' THREAD: c00000008e774000 CPU: 1 GPR00: 0000000000000000 c00000008e776d60 c000000000af2ab8 0000000000000001 GPR04: c00000008e776fb8 0000000000000004 0000000000000001 ffffffffff676980 GPR08: 0000000000000400 c000000000bcd930 0000000000000001 c000000000b2d360 GPR12: 0000000000000002 c000000000b0f680 0000000000000000 000000000f394acc GPR16: 0000000000000000 0000000000000000 0000000000000000 c00000008e777880 GPR20: c000000000e05160 c00000008e777870 7fffffffffffffff c000000000b0f480 GPR24: c0000000009da400 0000000000000002 0000000000000000 0000000000000001 GPR28: c00000008e776fb8 0000000000000001 c000000000a75bf0 c00000008e776d60 NIP [c000000000694f14] .add_preempt_count+0xc0/0xe0 LR [c000000000694ef8] .add_preempt_count+0xa4/0xe0 Call Trace: [c00000008e776d60] [c00000008e776e00] 0xc00000008e776e00 (unreliable) [c00000008e776df0] [c00000000005b0d8] .probe_hcall_entry+0x38/0x94 [c00000008e776e80] [c00000000004ef70] .__trace_hcall_entry+0x80/0xd4 [c00000008e776f10] [c00000000004f96c] .plpar_hcall_norets+0x50/0xd0 [c00000008e776f80] [c000000000055044] .smp_xics_message_pass+0x110/0x244 [c00000008e777030] [c000000000034824] .smp_send_reschedule+0x5c/0x78 [c00000008e7770c0] [c00000000006a6bc] .resched_task+0xb4/0xd8 [c00000008e777150] [c00000000006a840] .check_preempt_curr_idle+0x2c/0x44 [c00000008e7771e0] [c00000000007a868] .try_to_wake_up+0x460/0x548 [c00000008e7772b0] [c00000000007a98c] .default_wake_function+0x3c/0x54 [c00000008e777350] [c0000000000ab3b0] .autoremove_wake_function+0x44/0x84 [c00000008e777400] [c00000000006449c] .__wake_up_common+0x7c/0xf4 [c00000008e7774c0] [c00000000006a5d8] .__wake_up+0x60/0x90 [c00000008e777570] [c0000000000810dc] .printk_tick+0x84/0xa8 [c00000008e777600] [c000000000095c90] .update_process_times+0x64/0xa4 [c00000008e7776a0] [c0000000000bdcec] .tick_sched_timer+0xd0/0x120 [c00000008e777750] [c0000000000afe7c] .__run_hrtimer+0x1a0/0x29c [c00000008e777800] [c0000000000b02a4] .hrtimer_interrupt+0x124/0x278 [c00000008e777900] [c00000000002ea90] .timer_interrupt+0x1dc/0x2e4 [c00000008e7779a0] [c000000000003700] decrementer_common+0x100/0x180 --- Exception: 901 at .raw_local_irq_restore+0x48/0x54 LR = .cpu_idle+0x12c/0x208 [c00000008e777c90] [c000000000b21130] ppc_md+0x0/0x240 (unreliable) [c00000008e777cd0] [c000000000016a60] .cpu_idle+0x120/0x208 [c00000008e777d70] [c00000000069dbec] .start_secondary+0x394/0x3d4 [c00000008e777e30] [c000000000008278] .start_secondary_resume+0x10/0x14 Instruction dump: 409e0034 78630620 2ba300f4 40fd0028 4bd0b711 60000000 2fa30000 419e0018 e93e8880 80090000 2f800000 409e0008 <0fe00000> 383f0090 e8010010 7c0803a6 BUG: scheduling while atomic: swapper/0/0x000000f0 Modules linked in: autofs4 binfmt_misc dm_mirror dm_region_hash dm_log [last unloaded: scsi_wait_scan] Call Trace: [c00000008e7779a0] [c000000000014280] .show_stack+0xd8/0x218 (unreliable) [c00000008e777a80] [c0000000006980b0] .dump_stack+0x28/0x3c [c00000008e777b00] [c000000000071954] .__schedule_bug+0x94/0xb8 [c00000008e777b90] [c00000000068db40] .schedule+0xf8/0xbf4 [c00000008e777cd0] [c000000000016b34] .cpu_idle+0x1f4/0x208 [c00000008e777d70] [c00000000069dbec] .start_secondary+0x394/0x3d4 [c00000008e777e30] [c000000000008278] .start_secondary_resume+0x10/0x14 I'll spend tomorrow collecting traces and trying to see who's groping who this time... -- Darren > > -- Steve > > >> >> if (get_cpu_current_state(lcpu) == CPU_STATE_INACTIVE) >> goto out; >> >> + /* Check to see if the CPU out of FW already for kexec */ >> + if (smp_query_cpu_stopped(pcpu) == QCSS_NOT_STOPPED){ >> + cpumask_set_cpu(lcpu, of_spin_mask); >> + return 1; >> + } >> + >> /* >> * If the RTAS start-cpu token does not exist then presume the >> * cpu is already spinning. >> > > -- Darren Hart IBM Linux Technology Center Real-Time Linux Team