linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline
@ 2008-03-02 18:42 Yi Yang
  2008-03-03 11:54 ` Dmitry Adamushko
  2008-03-03 15:31 ` Gautham R Shenoy
  0 siblings, 2 replies; 54+ messages in thread
From: Yi Yang @ 2008-03-02 18:42 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: akpm, linux-kernel

When i run cpu hotplug stress test on a system with 16 logical cpus,
some processes are deadlocked, it can be regenerated on 2.6.25-rc2
and 2.6.25-rc3.

The kernel thread [watchdog/1] isn't reaped by its parent when cpu #1
is set to offline, so that [kstopmachine] sleeps for ever.

After i investigated, debugged and analyzed it, i think it is a scheduler
specific issue, i noticed someone submitted a patch which mentioned
kthread_should_stop calling rule:

"In order to safely use kthread_stop() for kthread, there is a requirement
on how its main loop has to be orginized. Namely, the sequence of
events that lead to kthread being blocked (schedule()) has to be
ordered as follows:

    set_current_state(TASK_INTERRUPTIBLE);
    if (kthread_should_stop()) break;
    schedule() or similar.

set_current_state() implies a full memory barrier. kthread_stop()
has a matching barrier right after an update of kthread_stop_info.k
and before kthread's wakeup."

I don't think it is a good programming paradigm because it will result
in some issues very easily, why can't we provide a simple wrapper for it?

This issue seems such one, but i tried to change it to follow this rule but
the issue is still there.

Why isn't the kernel thread [watchdog/1] reaped by its parent? its state
is TASK_RUNNING with high priority (R< means this), why it isn't done?

Anyone ever met such a problem? Your thought?


Here is my stress test script, you may try it on SMP platforms (the more
number of cpu is ,  the easier it can be regenerated.).

########################################################################

####### stresscpus.sh #######

#!/bin/sh

for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
do
        ./bugexposure.sh $i &
done


######## bugexposure.sh ###########

#!/bin/bash

main(){

    typeset -i CPU=$1
    ./task.sh > /dev/null&
    PID=$!

    if [ `cat /sys/devices/system/cpu/cpu${CPU}/online` = "0" ]; then
        echo "1" > /sys/devices/system/cpu/cpu${CPU}/online
    fi

    MASK=$((1<<${CPU}))

    `taskset -p ${MASK} ${PID} > /dev/null 2>&1`

    echo "0" > /sys/devices/system/cpu/cpu${CPU}/online

    echo "1" > /sys/devices/system/cpu/cpu${CPU}/online

    disown $PID
    kill -9 $PID > /dev/null 2>&1

    #echo "PASS\n"

}

typeset -i TEST_CPU=$1
while true
do
        main $TEST_CPU
done


###### task.sh ######
#!/bin/bash

while :
do
  NOOP=1
done

########################################################################

My tracing and analysis info is below:

# ps aux | grep watchdog
root         5  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/0]
root         8  0.0  0.0      0     0 ?        R<   22:26   0:00 [watchdog/1]
root        11  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/2]
root        14  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/3]
root        17  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/4]
root        20  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/5]
root        23  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/6]
root        26  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/7]
root        29  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/8]
root        32  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/9]
root        35  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/10]
root        38  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/11]
root        41  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/12]
root        44  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/13]
root        50  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/15]
root      5641  0.0  0.0  61144   708 pts/0    R+   22:58   0:00 grep watchdog
#

# ps aux | grep ksoftirqd
root         4  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/0]
root        10  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/2]
root        13  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/3]
root        16  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/4]
root        19  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/5]
root        22  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/6]
root        25  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/7]
root        28  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/8]
root        31  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/9]
root        34  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/10]
root        37  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/11]
root        40  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/12]
root        43  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/13]
root        49  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/15]
root      5647  0.0  0.0  61144   712 pts/0    R+   23:00   0:00 grep ksoftirqd
#

#ps aux
...
root      5554  0.0  0.0      0     0 ?        S<   22:42   0:00 [kstopmachine]
root      5556  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
root      5557  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
root      5558  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
root      5559  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
root      5560  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
root      5561  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
root      5562  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
root      5563  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
root      5564  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
root      5565  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
root      5566  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
root      5567  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
root      5568  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
root      5569  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
...


[watchdog/1] isn't terminated so that [kstopmachine] will waiting for it
for ever, so that "echo 0 > /sys/devices/syste/cpu/cpu1/online" will hang
up, so that hotplug spinlock will be holden by it, so that other spinlock
contenter will wait for locking it for ever, so that...

These are some softlock detection output:

process 5471 (task.sh) no longer affine to cpu1
INFO: task group_balance:51 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
group_balance D ffff810001011580     0    51      2
 ffff81027e97fdc0 0000000000000046 ffff81027e97fdd0 ffffffff80472b15
 0000000000000000 ffff81027e97d280 ffff81027c8bd300 ffff81027e97d5d0
 0000000080569da0 ffff81027e97d5d0 000000007e97d280 000000010009d366
Call Trace:
 [<ffffffff80472b15>] thread_return+0x3d/0xa2
 [<ffffffff8023cce5>] lock_timer_base+0x26/0x4b
 [<ffffffff8047338b>] __mutex_lock_slowpath+0x5c/0x93
 [<ffffffff80473221>] mutex_lock+0x1a/0x1e
 [<ffffffff8024f8fd>] get_online_cpus+0x27/0x3e
 [<ffffffff8022fd04>] load_balance_monitor+0x60/0x375
 [<ffffffff8022fca4>] load_balance_monitor+0x0/0x375
 [<ffffffff80245ff6>] kthread+0x47/0x75
 [<ffffffff802301f0>] schedule_tail+0x28/0x5c
 [<ffffffff8020cc78>] child_rip+0xa/0x12
 [<ffffffff80245faf>] kthread+0x0/0x75
 [<ffffffff8020cc6e>] child_rip+0x0/0x12

INFO: task bugexposure.sh:5462 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
bugexposure.s D ffff8100010e1580     0  5462      1
 ffff81026a981c58 0000000000000082 ffff81026a981c68 ffffffff80472b15
 00000000000004d6 ffff81027c83c920 ffff81027e8bd680 ffff81027c83cc70
 0000000d7e8bc4c0 ffff81027c83cc70 0000000d00000002 000000010009d809
Call Trace:
 [<ffffffff80472b15>] thread_return+0x3d/0xa2
 [<ffffffff80314b25>] __next_cpu+0x19/0x28
 [<ffffffff8022a5d4>] enqueue_rt_entity+0x4e/0xb1
 [<ffffffff80472f34>] schedule_timeout+0x1e/0xad
 [<ffffffff802280bf>] enqueue_task+0x4d/0x58
 [<ffffffff80472ccf>] wait_for_common+0xd5/0x118
 [<ffffffff8022c31b>] default_wake_function+0x0/0xe
 [<ffffffff80245c93>] kthread_stop+0x58/0x79
 [<ffffffff8046fdcf>] cpu_callback+0x189/0x197
 [<ffffffff8046f59e>] cpu_callback+0x1cf/0x274
 [<ffffffff8026c07e>] writeback_set_ratelimit+0x19/0x68
 [<ffffffff8047658b>] notifier_call_chain+0x29/0x4c
 [<ffffffff8024f74d>] _cpu_down+0x1d6/0x2aa
 [<ffffffff8024f845>] cpu_down+0x24/0x31
 [<ffffffff8038fcf4>] store_online+0x29/0x67
 [<ffffffff802d23e1>] sysfs_write_file+0xd2/0x110
 [<ffffffff8028db49>] vfs_write+0xad/0x136
 [<ffffffff8028e086>] sys_write+0x45/0x6e
 [<ffffffff8020bfd9>] tracesys+0xdc/0xe1




^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline
  2008-03-02 18:42 [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline Yi Yang
@ 2008-03-03 11:54 ` Dmitry Adamushko
  2008-03-03 11:56   ` Ingo Molnar
  2008-03-03 15:31 ` Gautham R Shenoy
  1 sibling, 1 reply; 54+ messages in thread
From: Dmitry Adamushko @ 2008-03-03 11:54 UTC (permalink / raw)
  To: yi.y.yang; +Cc: Ingo Molnar, akpm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1999 bytes --]

On 02/03/2008, Yi Yang <yi.y.yang@intel.com> wrote:
> [ ... ]
>
>  Why isn't the kernel thread [watchdog/1] reaped by its parent?
> its state
>  is TASK_RUNNING with high priority (R< means this), why it isn't done?
>
>  Anyone ever met such a problem? Your thought?
>

iirc, Andrew had the same issue.

'watchdog's are supposed to be stopped with kthread_stop() from
softlockup.c :: cpu_callback() :: case CPU_DEAD.

The 'R' state of 'watchdog' is strange indeed.

would you please conduct a test with the patch [1] below and provide
me with additional output it generates?

(non-white-space-damaged versions are enclosed)

with your current set-up and also with
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs"

Then please try also the patch [2] (on top of [1]). With this 'magic'
patch applied, the issue 'seemed' to disappear on Andrew's set-up...

Thanks in advacne,


[1]

--- softlockup-prev.c   2008-03-03 12:35:01.000000000 +0100
+++ softlockup.c        2008-03-03 12:38:01.000000000 +0100
@@ -237,6 +237,8 @@ static int watchdog(void *__bind_cpu)

        }

+       printk(KERN_WARN "-> watchdog(cpu: %d) is done\n", this_cpu);
+
        return 0;
 }

@@ -249,6 +251,9 @@ cpu_callback(struct notifier_block *nfb,
        int hotcpu = (unsigned long)hcpu;
        struct task_struct *p;

+       printk(KERN_WARN "-> cpu_callback(cpu: %d, action: %lu,
check_cpu: %d)\n",
+               hotcpu, action, check_cpu);
+
        switch (action) {
        case CPU_UP_PREPARE:
        case CPU_UP_PREPARE_FROZEN:


[2]

--- softlockup-prev-2.c 2008-03-03 12:38:36.000000000 +0100
+++ softlockup.c        2008-03-03 12:39:02.000000000 +0100
@@ -294,6 +294,7 @@ cpu_callback(struct notifier_block *nfb,
        case CPU_DEAD_FROZEN:
                p = per_cpu(watchdog_task, hotcpu);
                per_cpu(watchdog_task, hotcpu) = NULL;
+               mlseep(1);
                kthread_stop(p);
                break;
 #endif /* CONFIG_HOTPLUG_CPU */


-- 
Best regards,
Dmitry Adamushko

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: softlockup-debug-1.patch --]
[-- Type: text/x-patch; name=softlockup-debug-1.patch, Size: 567 bytes --]

--- softlockup-prev.c	2008-03-03 12:35:01.000000000 +0100
+++ softlockup.c	2008-03-03 12:38:01.000000000 +0100
@@ -237,6 +237,8 @@ static int watchdog(void *__bind_cpu)
 
 	}
 
+	printk(KERN_WARN "-> watchdog(cpu: %d) is done\n", this_cpu);
+
 	return 0;
 }
 
@@ -249,6 +251,9 @@ cpu_callback(struct notifier_block *nfb,
 	int hotcpu = (unsigned long)hcpu;
 	struct task_struct *p;
 
+	printk(KERN_WARN "-> cpu_callback(cpu: %d, action: %lu, check_cpu: %d)\n",
+		hotcpu, action, check_cpu);
+
 	switch (action) {
 	case CPU_UP_PREPARE:
 	case CPU_UP_PREPARE_FROZEN:

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: softlockup-debug-2.patch --]
[-- Type: text/x-patch; name=softlockup-debug-2.patch, Size: 356 bytes --]

--- softlockup-prev-2.c	2008-03-03 12:38:36.000000000 +0100
+++ softlockup.c	2008-03-03 12:39:02.000000000 +0100
@@ -294,6 +294,7 @@ cpu_callback(struct notifier_block *nfb,
 	case CPU_DEAD_FROZEN:
 		p = per_cpu(watchdog_task, hotcpu);
 		per_cpu(watchdog_task, hotcpu) = NULL;
+		mlseep(1);
 		kthread_stop(p);
 		break;
 #endif /* CONFIG_HOTPLUG_CPU */

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline
  2008-03-03 11:54 ` Dmitry Adamushko
@ 2008-03-03 11:56   ` Ingo Molnar
  2008-03-03 12:02     ` Dmitry Adamushko
  0 siblings, 1 reply; 54+ messages in thread
From: Ingo Molnar @ 2008-03-03 11:56 UTC (permalink / raw)
  To: Dmitry Adamushko; +Cc: yi.y.yang, akpm, linux-kernel


* Dmitry Adamushko <dmitry.adamushko@gmail.com> wrote:

>                 per_cpu(watchdog_task, hotcpu) = NULL;
> +               mlseep(1);

that wont build very well ...

	Ingo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline
  2008-03-03 11:56   ` Ingo Molnar
@ 2008-03-03 12:02     ` Dmitry Adamushko
  2008-03-03 14:53       ` Yi Yang
  0 siblings, 1 reply; 54+ messages in thread
From: Dmitry Adamushko @ 2008-03-03 12:02 UTC (permalink / raw)
  To: yi.y.yang; +Cc: Ingo Molnar, akpm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 893 bytes --]

On 03/03/2008, Ingo Molnar <mingo@elte.hu> wrote:
>
>  * Dmitry Adamushko <dmitry.adamushko@gmail.com> wrote:
>
>  >                 per_cpu(watchdog_task, hotcpu) = NULL;
>  > +               mlseep(1);
>
>
> that wont build very well ...

yeah, I forgot to mention that it's not even compile-tested :-/
I re-created it from scratch instead of looking for the original one.

please, this one (again, not compile-tested)

--- softlockup-prev-2.c 2008-03-03 12:38:36.000000000 +0100
+++ softlockup.c        2008-03-03 13:00:20.000000000 +0100
@@ -294,6 +294,7 @@ cpu_callback(struct notifier_block *nfb,
        case CPU_DEAD_FROZEN:
                p = per_cpu(watchdog_task, hotcpu);
                per_cpu(watchdog_task, hotcpu) = NULL;
+               msleep(1);
                kthread_stop(p);
                break;
 #endif /* CONFIG_HOTPLUG_CPU */


-- 
Best regards,
Dmitry Adamushko

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: softlockup-debug-2.patch --]
[-- Type: text/x-patch; name=softlockup-debug-2.patch, Size: 356 bytes --]

--- softlockup-prev-2.c	2008-03-03 12:38:36.000000000 +0100
+++ softlockup.c	2008-03-03 13:00:20.000000000 +0100
@@ -294,6 +294,7 @@ cpu_callback(struct notifier_block *nfb,
 	case CPU_DEAD_FROZEN:
 		p = per_cpu(watchdog_task, hotcpu);
 		per_cpu(watchdog_task, hotcpu) = NULL;
+		msleep(1);
 		kthread_stop(p);
 		break;
 #endif /* CONFIG_HOTPLUG_CPU */

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline
  2008-03-03 15:31 ` Gautham R Shenoy
@ 2008-03-03 14:45   ` Yi Yang
  2008-03-04  5:26     ` Gautham R Shenoy
  0 siblings, 1 reply; 54+ messages in thread
From: Yi Yang @ 2008-03-03 14:45 UTC (permalink / raw)
  To: ego; +Cc: Ingo Molnar, akpm, linux-kernel, Oleg Nesterov, Rafael J. Wysocki

On Mon, 2008-03-03 at 21:01 +0530, Gautham R Shenoy wrote:
> > This issue seems such one, but i tried to change it to follow this rule but
> > the issue is still there.
> > 
> > Why isn't the kernel thread [watchdog/1] reaped by its parent? its state
> > is TASK_RUNNING with high priority (R< means this), why it isn't done?
> > 
> > Anyone ever met such a problem? Your thought?
> 
> Hi Yi,
> 
> This is indeed strange. I am able to reproduce this problem on my 4-way
> box. From what I see in the past two runs, we're waiting in the
> cpu-hotplug callback path for the watchdog/1 thread to stop.
> 
> During cpu-offline, once the cpu goes offline, in the migration_call(), 
> we migrate any tasks associated with the offline cpus
> to some other cpu. This also mean breaking affinity for tasks which were
> affined to the cpu which went down. So watchdog/1 has been migrated to
> some other cpu.
No, [watchdog/1] is just for CPU #1, if CPU #1 has been offline, it
should be killed but not migrated to other CPU because other CPU has
such a kthread.

Maybe migration_call was doing such a bad thing. :-)
> 
> However, it remains in R< state and has not executed the
> kthread_should_stop() instruction.
> 
> I'm trying to probe further by inserting a few more printk's in there.
> 
> Will post the findings in a couple of hours.
> 
> Thanks for reporting the problem.
> 
> Regards
> gautham.



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline
  2008-03-03 12:02     ` Dmitry Adamushko
@ 2008-03-03 14:53       ` Yi Yang
  2008-03-03 17:37         ` Yi Yang
  0 siblings, 1 reply; 54+ messages in thread
From: Yi Yang @ 2008-03-03 14:53 UTC (permalink / raw)
  To: Dmitry Adamushko; +Cc: Ingo Molnar, akpm, linux-kernel

On Mon, 2008-03-03 at 13:02 +0100, Dmitry Adamushko wrote:
> On 03/03/2008, Ingo Molnar <mingo@elte.hu> wrote:
> >
> >  * Dmitry Adamushko <dmitry.adamushko@gmail.com> wrote:
> >
> >  >                 per_cpu(watchdog_task, hotcpu) = NULL;
> >  > +               mlseep(1);
> >
> >
> > that wont build very well ...
> 
> yeah, I forgot to mention that it's not even compile-tested :-/
> I re-created it from scratch instead of looking for the original one.
> 
> please, this one (again, not compile-tested)
> 
> --- softlockup-prev-2.c 2008-03-03 12:38:36.000000000 +0100
> +++ softlockup.c        2008-03-03 13:00:20.000000000 +0100
> @@ -294,6 +294,7 @@ cpu_callback(struct notifier_block *nfb,
>         case CPU_DEAD_FROZEN:
>                 p = per_cpu(watchdog_task, hotcpu);
>                 per_cpu(watchdog_task, hotcpu) = NULL;
> +               msleep(1);
>                 kthread_stop(p);
>                 break;
>  #endif /* CONFIG_HOTPLUG_CPU */

I don't think it can fix this issue, it only gives one chance to
scheduler, i think there are another potential and very serious issues
inside of scheduler or locking or what else we don't know.

Maybe migration is a doubtful point as Gautham mentioned.
> 


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline
  2008-03-02 18:42 [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline Yi Yang
  2008-03-03 11:54 ` Dmitry Adamushko
@ 2008-03-03 15:31 ` Gautham R Shenoy
  2008-03-03 14:45   ` Yi Yang
  1 sibling, 1 reply; 54+ messages in thread
From: Gautham R Shenoy @ 2008-03-03 15:31 UTC (permalink / raw)
  To: Yi Yang; +Cc: Ingo Molnar, akpm, linux-kernel, Oleg Nesterov, Rafael J. Wysocki

On Mon, Mar 03, 2008 at 02:42:09AM +0800, Yi Yang wrote:
> When i run cpu hotplug stress test on a system with 16 logical cpus,
> some processes are deadlocked, it can be regenerated on 2.6.25-rc2
> and 2.6.25-rc3.
> 
> The kernel thread [watchdog/1] isn't reaped by its parent when cpu #1
> is set to offline, so that [kstopmachine] sleeps for ever.
> 
> After i investigated, debugged and analyzed it, i think it is a scheduler
> specific issue, i noticed someone submitted a patch which mentioned
> kthread_should_stop calling rule:
> 
> "In order to safely use kthread_stop() for kthread, there is a requirement
> on how its main loop has to be orginized. Namely, the sequence of
> events that lead to kthread being blocked (schedule()) has to be
> ordered as follows:
> 
>     set_current_state(TASK_INTERRUPTIBLE);
>     if (kthread_should_stop()) break;
>     schedule() or similar.
> 
> set_current_state() implies a full memory barrier. kthread_stop()
> has a matching barrier right after an update of kthread_stop_info.k
> and before kthread's wakeup."
> 
> I don't think it is a good programming paradigm because it will result
> in some issues very easily, why can't we provide a simple wrapper for it?
> 
> This issue seems such one, but i tried to change it to follow this rule but
> the issue is still there.
> 
> Why isn't the kernel thread [watchdog/1] reaped by its parent? its state
> is TASK_RUNNING with high priority (R< means this), why it isn't done?
> 
> Anyone ever met such a problem? Your thought?

Hi Yi,

This is indeed strange. I am able to reproduce this problem on my 4-way
box. From what I see in the past two runs, we're waiting in the
cpu-hotplug callback path for the watchdog/1 thread to stop.

During cpu-offline, once the cpu goes offline, in the migration_call(), 
we migrate any tasks associated with the offline cpus
to some other cpu. This also mean breaking affinity for tasks which were
affined to the cpu which went down. So watchdog/1 has been migrated to
some other cpu.

However, it remains in R< state and has not executed the
kthread_should_stop() instruction.

I'm trying to probe further by inserting a few more printk's in there.

Will post the findings in a couple of hours.

Thanks for reporting the problem.

Regards
gautham.


> 
> 
> Here is my stress test script, you may try it on SMP platforms (the more
> number of cpu is ,  the easier it can be regenerated.).
> 
> ########################################################################
> 
> ####### stresscpus.sh #######
> 
> #!/bin/sh
> 
> for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> do
>         ./bugexposure.sh $i &
> done
> 
> 
> ######## bugexposure.sh ###########
> 
> #!/bin/bash
> 
> main(){
> 
>     typeset -i CPU=$1
>     ./task.sh > /dev/null&
>     PID=$!
> 
>     if [ `cat /sys/devices/system/cpu/cpu${CPU}/online` = "0" ]; then
>         echo "1" > /sys/devices/system/cpu/cpu${CPU}/online
>     fi
> 
>     MASK=$((1<<${CPU}))
> 
>     `taskset -p ${MASK} ${PID} > /dev/null 2>&1`
> 
>     echo "0" > /sys/devices/system/cpu/cpu${CPU}/online
> 
>     echo "1" > /sys/devices/system/cpu/cpu${CPU}/online
> 
>     disown $PID
>     kill -9 $PID > /dev/null 2>&1
> 
>     #echo "PASS\n"
> 
> }
> 
> typeset -i TEST_CPU=$1
> while true
> do
>         main $TEST_CPU
> done
> 
> 
> ###### task.sh ######
> #!/bin/bash
> 
> while :
> do
>   NOOP=1
> done
> 
> ########################################################################
> 
> My tracing and analysis info is below:
> 
> # ps aux | grep watchdog
> root         5  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/0]
> root         8  0.0  0.0      0     0 ?        R<   22:26   0:00 [watchdog/1]
> root        11  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/2]
> root        14  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/3]
> root        17  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/4]
> root        20  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/5]
> root        23  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/6]
> root        26  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/7]
> root        29  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/8]
> root        32  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/9]
> root        35  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/10]
> root        38  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/11]
> root        41  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/12]
> root        44  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/13]
> root        50  0.0  0.0      0     0 ?        S<   22:26   0:00 [watchdog/15]
> root      5641  0.0  0.0  61144   708 pts/0    R+   22:58   0:00 grep watchdog
> #
> 
> # ps aux | grep ksoftirqd
> root         4  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/0]
> root        10  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/2]
> root        13  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/3]
> root        16  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/4]
> root        19  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/5]
> root        22  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/6]
> root        25  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/7]
> root        28  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/8]
> root        31  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/9]
> root        34  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/10]
> root        37  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/11]
> root        40  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/12]
> root        43  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/13]
> root        49  0.0  0.0      0     0 ?        S<   22:26   0:00 [ksoftirqd/15]
> root      5647  0.0  0.0  61144   712 pts/0    R+   23:00   0:00 grep ksoftirqd
> #
> 
> #ps aux
> ...
> root      5554  0.0  0.0      0     0 ?        S<   22:42   0:00 [kstopmachine]
> root      5556  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
> root      5557  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
> root      5558  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
> root      5559  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
> root      5560  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
> root      5561  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
> root      5562  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
> root      5563  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
> root      5564  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
> root      5565  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
> root      5566  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
> root      5567  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
> root      5568  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
> root      5569  0.0  0.0      0     0 ?        Z<   22:42   0:00 [kstopmachine] <defunct>
> ...
> 
> 
> [watchdog/1] isn't terminated so that [kstopmachine] will waiting for it
> for ever, so that "echo 0 > /sys/devices/syste/cpu/cpu1/online" will hang
> up, so that hotplug spinlock will be holden by it, so that other spinlock
> contenter will wait for locking it for ever, so that...
> 
> These are some softlock detection output:
> 
> process 5471 (task.sh) no longer affine to cpu1
> INFO: task group_balance:51 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> group_balance D ffff810001011580     0    51      2
>  ffff81027e97fdc0 0000000000000046 ffff81027e97fdd0 ffffffff80472b15
>  0000000000000000 ffff81027e97d280 ffff81027c8bd300 ffff81027e97d5d0
>  0000000080569da0 ffff81027e97d5d0 000000007e97d280 000000010009d366
> Call Trace:
>  [<ffffffff80472b15>] thread_return+0x3d/0xa2
>  [<ffffffff8023cce5>] lock_timer_base+0x26/0x4b
>  [<ffffffff8047338b>] __mutex_lock_slowpath+0x5c/0x93
>  [<ffffffff80473221>] mutex_lock+0x1a/0x1e
>  [<ffffffff8024f8fd>] get_online_cpus+0x27/0x3e
>  [<ffffffff8022fd04>] load_balance_monitor+0x60/0x375
>  [<ffffffff8022fca4>] load_balance_monitor+0x0/0x375
>  [<ffffffff80245ff6>] kthread+0x47/0x75
>  [<ffffffff802301f0>] schedule_tail+0x28/0x5c
>  [<ffffffff8020cc78>] child_rip+0xa/0x12
>  [<ffffffff80245faf>] kthread+0x0/0x75
>  [<ffffffff8020cc6e>] child_rip+0x0/0x12
> 
> INFO: task bugexposure.sh:5462 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> bugexposure.s D ffff8100010e1580     0  5462      1
>  ffff81026a981c58 0000000000000082 ffff81026a981c68 ffffffff80472b15
>  00000000000004d6 ffff81027c83c920 ffff81027e8bd680 ffff81027c83cc70
>  0000000d7e8bc4c0 ffff81027c83cc70 0000000d00000002 000000010009d809
> Call Trace:
>  [<ffffffff80472b15>] thread_return+0x3d/0xa2
>  [<ffffffff80314b25>] __next_cpu+0x19/0x28
>  [<ffffffff8022a5d4>] enqueue_rt_entity+0x4e/0xb1
>  [<ffffffff80472f34>] schedule_timeout+0x1e/0xad
>  [<ffffffff802280bf>] enqueue_task+0x4d/0x58
>  [<ffffffff80472ccf>] wait_for_common+0xd5/0x118
>  [<ffffffff8022c31b>] default_wake_function+0x0/0xe
>  [<ffffffff80245c93>] kthread_stop+0x58/0x79
>  [<ffffffff8046fdcf>] cpu_callback+0x189/0x197
>  [<ffffffff8046f59e>] cpu_callback+0x1cf/0x274
>  [<ffffffff8026c07e>] writeback_set_ratelimit+0x19/0x68
>  [<ffffffff8047658b>] notifier_call_chain+0x29/0x4c
>  [<ffffffff8024f74d>] _cpu_down+0x1d6/0x2aa
>  [<ffffffff8024f845>] cpu_down+0x24/0x31
>  [<ffffffff8038fcf4>] store_online+0x29/0x67
>  [<ffffffff802d23e1>] sysfs_write_file+0xd2/0x110
>  [<ffffffff8028db49>] vfs_write+0xad/0x136
>  [<ffffffff8028e086>] sys_write+0x45/0x6e
>  [<ffffffff8020bfd9>] tracesys+0xdc/0xe1
> 


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline
  2008-03-03 14:53       ` Yi Yang
@ 2008-03-03 17:37         ` Yi Yang
  0 siblings, 0 replies; 54+ messages in thread
From: Yi Yang @ 2008-03-03 17:37 UTC (permalink / raw)
  To: Dmitry Adamushko; +Cc: Ingo Molnar, akpm, linux-kernel

On Mon, 2008-03-03 at 22:53 +0800, Yi Yang wrote:
> On Mon, 2008-03-03 at 13:02 +0100, Dmitry Adamushko wrote:
> > On 03/03/2008, Ingo Molnar <mingo@elte.hu> wrote:
> > >
> > >  * Dmitry Adamushko <dmitry.adamushko@gmail.com> wrote:
> > >
> > >  >                 per_cpu(watchdog_task, hotcpu) = NULL;
> > >  > +               mlseep(1);
> > >
> > >
> > > that wont build very well ...
> > 
> > yeah, I forgot to mention that it's not even compile-tested :-/
> > I re-created it from scratch instead of looking for the original one.
> > 
> > please, this one (again, not compile-tested)
> > 
> > --- softlockup-prev-2.c 2008-03-03 12:38:36.000000000 +0100
> > +++ softlockup.c        2008-03-03 13:00:20.000000000 +0100
> > @@ -294,6 +294,7 @@ cpu_callback(struct notifier_block *nfb,
> >         case CPU_DEAD_FROZEN:
> >                 p = per_cpu(watchdog_task, hotcpu);
> >                 per_cpu(watchdog_task, hotcpu) = NULL;
> > +               msleep(1);
> >                 kthread_stop(p);
> >                 break;
> >  #endif /* CONFIG_HOTPLUG_CPU */
> 
> I don't think it can fix this issue, it only gives one chance to
> scheduler, i think there are another potential and very serious issues
> inside of scheduler or locking or what else we don't know.
> 
> Maybe migration is a doubtful point as Gautham mentioned.

That issue is still there after the above patch is applied.

I found that [watchdog/#] is indeed migrated to other cpu because
migration_call is called before cpu_callback, i think this is the real
root cause very very possibly.

I suggest we can develop a new notifier infrastructure in which one
caller can specify whether it is kthread_stopping a cpu-bind kthread
so that such notifier callbacks can be executed prior to other
callbacks.
> > 


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline
  2008-03-04  9:09       ` Gautham R Shenoy
@ 2008-03-03 21:56         ` Yi Yang
  0 siblings, 0 replies; 54+ messages in thread
From: Yi Yang @ 2008-03-03 21:56 UTC (permalink / raw)
  To: ego
  Cc: Ingo Molnar, akpm, linux-kernel, Oleg Nesterov,
	Rafael J. Wysocki, Thomas Gleixner

> This is the hung_task_timeout message after a couple of cpu-offlines.
> 
> This is on 2.6.25-rc3.
> 
> INFO: task bash:4467 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>        f3701dd0 00000046 f796aac0 f796aac0 f796abf8 cc434b80 00000000 f41ee940 
>        0180b046 0000026e 00000016 00000000 00000008 f796b080 f796aac0 00000002 
>        7fffffff 7fffffff f3701e1c f3701df8 c04e033a f3701e1c f3701dec c0139dec 
> Call Trace:
>  [<c04e033a>] schedule_timeout+0x16/0x8b
>  [<c0139dec>] ? trace_hardirqs_on+0xe9/0x111
>  [<c04e01c9>] wait_for_common+0xcf/0x12e
>  [<c011a3f0>] ? default_wake_function+0x0/0xd
>  [<c04e02aa>] wait_for_completion+0x12/0x14
>  [<c012ccbb>] flush_cpu_workqueue+0x50/0x66
>  [<c012cd28>] ? wq_barrier_func+0x0/0xd
>  [<c012cd14>] cleanup_workqueue_thread+0x43/0x57
>  [<c04c6f87>] workqueue_cpu_callback+0x8e/0xbd
>  [<c04e3975>] notifier_call_chain+0x2b/0x4a
>  [<c0132e9d>] __raw_notifier_call_chain+0xe/0x10
>  [<c0132eab>] raw_notifier_call_chain+0xc/0xe
>  [<c013e054>] _cpu_down+0x150/0x1ec
>  [<c013e133>] cpu_down+0x23/0x30
>  [<c02e3897>] store_online+0x27/0x5a
>  [<c02e3870>] ? store_online+0x0/0x5a
>  [<c02e09d8>] sysdev_store+0x20/0x25
>  [<c0196d2d>] sysfs_write_file+0xad/0xdf
>  [<c0196c80>] ? sysfs_write_file+0x0/0xdf
>  [<c0163da9>] vfs_write+0x8c/0x108
>  [<c0164333>] sys_write+0x3b/0x60
>  [<c01049da>] sysenter_past_esp+0x5f/0xa5
>  =======================
> 3 locks held by bash/4467:
>  #0:  (&buffer->mutex){--..}, at: [<c0196ca5>] sysfs_write_file+0x25/0xdf
>  #1:  (cpu_add_remove_lock){--..}, at: [<c013e10e>] cpu_maps_update_begin+0xf/0x11
>  #2:  (cpu_hotplug_lock){----}, at: [<c013df5b>] _cpu_down+0x57/0x1ec
> 
> So it's not just a not reaping of watchdog thread issue.
> 
> I doubt it's due to some locking dependency since we have lockdep checks
> in the workqueue code before we flush the cpu_workqueue.
You may "echo 1 > /proc/sys/kernel/sysrq" and "echo t
> /proc/sysrq-trigger", then check dmesg info, you can get
[watchdog/#]'s call stack which could give out where it is currently.

On my machine, that indicated [watchdog/1] is calling
sched_setscheduler. I doubt it is being killed before it is started and
woken up, this may result in some synchronization issues.

> 
> --
> Thanks and Regards
> gautham


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline
  2008-03-03 14:45   ` Yi Yang
@ 2008-03-04  5:26     ` Gautham R Shenoy
  2008-03-04  9:09       ` Gautham R Shenoy
  2008-03-04 15:01       ` Oleg Nesterov
  0 siblings, 2 replies; 54+ messages in thread
From: Gautham R Shenoy @ 2008-03-04  5:26 UTC (permalink / raw)
  To: Yi Yang
  Cc: Ingo Molnar, akpm, linux-kernel, Oleg Nesterov,
	Rafael J. Wysocki, Thomas Gleixner

On Mon, Mar 03, 2008 at 10:45:04PM +0800, Yi Yang wrote:
> On Mon, 2008-03-03 at 21:01 +0530, Gautham R Shenoy wrote:
> > > This issue seems such one, but i tried to change it to follow this rule but
> > > the issue is still there.
> > > 
> > > Why isn't the kernel thread [watchdog/1] reaped by its parent? its state
> > > is TASK_RUNNING with high priority (R< means this), why it isn't done?
> > > 
> > > Anyone ever met such a problem? Your thought?
> > 
> > Hi Yi,
> > 
> > This is indeed strange. I am able to reproduce this problem on my 4-way
> > box. From what I see in the past two runs, we're waiting in the
> > cpu-hotplug callback path for the watchdog/1 thread to stop.
> > 
> > During cpu-offline, once the cpu goes offline, in the migration_call(), 
> > we migrate any tasks associated with the offline cpus
> > to some other cpu. This also mean breaking affinity for tasks which were
> > affined to the cpu which went down. So watchdog/1 has been migrated to
> > some other cpu.
> No, [watchdog/1] is just for CPU #1, if CPU #1 has been offline, it
> should be killed but not migrated to other CPU because other CPU has
> such a kthread.

Yes, it is killed once it gets a chance to run *after* cpu goes offline.
The moment it runs on some other cpu, it will see the kthread_should_stop()
because in the cpu-hotplug callback path we've issues a 
kthread_stop(watchdog/1)

Again, we can argue that we could issue a kthread_stop() 
in CPU_DOWN_PREPARE, rather than in CPU_DEAD and restart 
it in CPU_DOWN_FAILED if the cpu-hotplug operation does fail.

> 
> Maybe migration_call was doing such a bad thing. :-)

Nope, from what I see migration call is not having any problems. It is
behaving the way it is supposed to behave :)

The other observation I noted was the WARN_ON_ONCE() in hrtick() [1]
that I am consistently hitting after the first cpu goes offline.

So at times, the callback thread is blocked on kthread_stop(k) in
softlockup.c, while other time, it was blocked in
cleanup_workqueue_threads() in workqueue.c. 

This was with the debug patch[2]

Not sure if this is linked to the problem that Yi has pointed out
but looks like a regression. I'll see if this can be reproduced on
2.6.24, 2.6.25-rc1 and 2.6.25-rc2.

[1] The WARN_ON_ONCE() trace.

------------[ cut here ]------------
WARNING: at kernel/sched.c:1007 hrtick+0x32/0x6a()
Modules linked in: dock
Pid: 4451, comm: bash Not tainted 2.6.25-rc3 #26
 [<c011f6c8>] warn_on_slowpath+0x41/0x51
 [<c013a0dd>] ? trace_hardirqs_on+0xd3/0x111
 [<c04e43a3>] ? _spin_unlock_irqrestore+0x42/0x58
 [<c02767e7>] ? blk_run_queue+0x64/0x68
 [<c033ae6e>] ? scsi_run_queue+0x18d/0x195
 [<c027fd7b>] ? kobject_put+0x14/0x16
 [<c02e1c3f>] ? put_device+0x11/0x13
 [<c013af7b>] ? __lock_acquire+0xaae/0xaf6
 [<c01320bf>] ? __run_hrtimer+0x35/0x70
 [<c0119a8a>] hrtick+0x32/0x6a
 [<c0119a58>] ? hrtick+0x0/0x6a
 [<c01320c3>] __run_hrtimer+0x39/0x70
 [<c01328e8>] hrtimer_interrupt+0xed/0x156
 [<c0112db9>] smp_apic_timer_interrupt+0x6c/0x7f
 [<c010568b>] apic_timer_interrupt+0x33/0x38
 [<c01202ed>] ? vprintk+0x2d0/0x328
 [<c027fe47>] ? kobject_release+0x4b/0x50
 [<c027fdfc>] ? kobject_release+0x0/0x50
 [<c04dea24>] ? cpuid_class_cpu_callback+0x0/0x50
 [<c0280931>] ? kref_put+0x39/0x44
 [<c027fd7b>] ? kobject_put+0x14/0x16
 [<c02e1c3f>] ? put_device+0x11/0x13
 [<c014e2e3>] ? cpu_swap_callback+0x0/0x3d
 [<c012035a>] printk+0x15/0x17
 [<c04e61d0>] notifier_call_chain+0x40/0x9b
 [<c04e2d25>] ? mutex_unlock+0x8/0xa
 [<c0143c29>] ? __stop_machine_run+0x8c/0x95
 [<c013e1b3>] ? take_cpu_down+0x0/0x27
 [<c01331e8>] __raw_notifier_call_chain+0xe/0x10
 [<c01331f6>] raw_notifier_call_chain+0xc/0xe
 [<c013e37e>] _cpu_down+0x1a4/0x269
 [<c013e466>] cpu_down+0x23/0x30
 [<c02e58e7>] store_online+0x27/0x5a
 [<c02e58c0>] ? store_online+0x0/0x5a
 [<c02e2a9c>] sysdev_store+0x20/0x25
 [<c0197e65>] sysfs_write_file+0xad/0xdf
 [<c0197db8>] ? sysfs_write_file+0x0/0xdf
 [<c0165099>] vfs_write+0x8c/0x108
 [<c0165623>] sys_write+0x3b/0x60
 [<c0104b12>] sysenter_past_esp+0x5f/0xa5
 =======================
---[ end trace 22cbd9e369049151 ]---



[2] The debug patch
----->

Index: linux-2.6.25-rc3/kernel/cpu.c
===================================================================
--- linux-2.6.25-rc3.orig/kernel/cpu.c
+++ linux-2.6.25-rc3/kernel/cpu.c
@@ -18,7 +18,7 @@
 /* Serializes the updates to cpu_online_map, cpu_present_map */
 static DEFINE_MUTEX(cpu_add_remove_lock);
 
-static __cpuinitdata RAW_NOTIFIER_HEAD(cpu_chain);
+__cpuinitdata RAW_NOTIFIER_HEAD(cpu_chain);
 
 /* If set, cpu_up and cpu_down will return -EBUSY and do nothing.
  * Should always be manipulated under cpu_add_remove_lock
@@ -207,11 +207,14 @@ static int _cpu_down(unsigned int cpu, i
 	if (!cpu_online(cpu))
 		return -EINVAL;
 
+	printk("[HOTPLUG] calling cpu_hotplug_begin\n");
 	cpu_hotplug_begin();
+	printk("[HOTPLUG] calling CPU_DOWN_PREPARE\n");
 	err = __raw_notifier_call_chain(&cpu_chain, CPU_DOWN_PREPARE | mod,
 					hcpu, -1, &nr_calls);
 	if (err == NOTIFY_BAD) {
 		nr_calls--;
+		printk("[HOTPLUG] calling CPU_DOWN_FAILED\n");
 		__raw_notifier_call_chain(&cpu_chain, CPU_DOWN_FAILED | mod,
 					  hcpu, nr_calls, NULL);
 		printk("%s: attempt to take down CPU %u failed\n",
@@ -226,10 +229,12 @@ static int _cpu_down(unsigned int cpu, i
 	cpu_clear(cpu, tmp);
 	set_cpus_allowed(current, tmp);
 
+	printk("[HOTPLUG] calling stop_machine_run()\n");
 	p = __stop_machine_run(take_cpu_down, &tcd_param, cpu);
 
 	if (IS_ERR(p) || cpu_online(cpu)) {
 		/* CPU didn't die: tell everyone.  Can't complain. */
+		printk("[HOTPLUG] calling CPU_DOWN_FAILED\n");
 		if (raw_notifier_call_chain(&cpu_chain, CPU_DOWN_FAILED | mod,
 					    hcpu) == NOTIFY_BAD)
 			BUG();
@@ -241,13 +246,16 @@ static int _cpu_down(unsigned int cpu, i
 		goto out_thread;
 	}
 
+	printk("[HOTPLUG] waiting for idle_cpu()\n");
 	/* Wait for it to sleep (leaving idle task). */
 	while (!idle_cpu(cpu))
 		yield();
 
+	printk("[HOTPLUG] calling __cpu_die()\n");
 	/* This actually kills the CPU. */
 	__cpu_die(cpu);
 
+	printk("[HOTPLUG] calling CPU_DEAD\n");
 	/* CPU is completely dead: tell everyone.  Too late to complain. */
 	if (raw_notifier_call_chain(&cpu_chain, CPU_DEAD | mod,
 				    hcpu) == NOTIFY_BAD)
@@ -256,11 +264,14 @@ static int _cpu_down(unsigned int cpu, i
 	check_for_tasks(cpu);
 
 out_thread:
+	printk("[HOTPLUG] calling kthread_stop_machine\n");
 	err = kthread_stop(p);
 out_allowed:
 	set_cpus_allowed(current, old_allowed);
 out_release:
+	printk("[HOTPLUG] calling cpu_hotplug_done()\n");
 	cpu_hotplug_done();
+	printk("[HOTPLUG] returning from _cpu_down()\n");
 	return err;
 }
 
Index: linux-2.6.25-rc3/kernel/notifier.c
===================================================================
--- linux-2.6.25-rc3.orig/kernel/notifier.c
+++ linux-2.6.25-rc3/kernel/notifier.c
@@ -5,7 +5,7 @@
 #include <linux/rcupdate.h>
 #include <linux/vmalloc.h>
 #include <linux/reboot.h>
-
+#include <linux/kallsyms.h>
 /*
  *	Notifier list for kernel code which wants to be called
  *	at shutdown. This is used to stop any idling DMA operations
@@ -44,6 +44,7 @@ static int notifier_chain_unregister(str
 	return -ENOENT;
 }
 
+extern struct raw_notifier_head cpu_chain;
 /**
  * notifier_call_chain - Informs the registered notifiers about an event.
  *	@nl:		Pointer to head of the blocking notifier chain
@@ -62,12 +63,21 @@ static int __kprobes notifier_call_chain
 {
 	int ret = NOTIFY_DONE;
 	struct notifier_block *nb, *next_nb;
+	char name_buf[100];
 
 	nb = rcu_dereference(*nl);
 
 	while (nb && nr_to_call) {
 		next_nb = rcu_dereference(nb->next);
+		if (nl == &cpu_chain.head) {
+			sprint_symbol(name_buf, (unsigned long)nb->notifier_call);
+			printk("[HOTPLUG] calling callback:%s\n", name_buf);
+		}
 		ret = nb->notifier_call(nb, val, v);
+		if (nl == &cpu_chain.head) {
+			sprint_symbol(name_buf, (unsigned long)nb->notifier_call);
+			printk("[HOTPLUG] returned from callback:%s\n", name_buf);
+		}
 
 		if (nr_calls)
 			(*nr_calls)++;



> > 
> > However, it remains in R< state and has not executed the
> > kthread_should_stop() instruction.
> > 
> > I'm trying to probe further by inserting a few more printk's in there.
> > 
> > Will post the findings in a couple of hours.
> > 
> > Thanks for reporting the problem.
> > 
> > Regards
> > gautham.
> 

-- 
Thanks and Regards
gautham

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline
  2008-03-04  5:26     ` Gautham R Shenoy
@ 2008-03-04  9:09       ` Gautham R Shenoy
  2008-03-03 21:56         ` Yi Yang
  2008-03-04 15:01       ` Oleg Nesterov
  1 sibling, 1 reply; 54+ messages in thread
From: Gautham R Shenoy @ 2008-03-04  9:09 UTC (permalink / raw)
  To: Yi Yang
  Cc: Ingo Molnar, akpm, linux-kernel, Oleg Nesterov,
	Rafael J. Wysocki, Thomas Gleixner

On Tue, Mar 04, 2008 at 10:56:13AM +0530, Gautham R Shenoy wrote:
> On Mon, Mar 03, 2008 at 10:45:04PM +0800, Yi Yang wrote:
> > On Mon, 2008-03-03 at 21:01 +0530, Gautham R Shenoy wrote:
> > > > This issue seems such one, but i tried to change it to follow this rule but
> > > > the issue is still there.
> > > > 
> > > > Why isn't the kernel thread [watchdog/1] reaped by its parent? its state
> > > > is TASK_RUNNING with high priority (R< means this), why it isn't done?
> > > > 
> > > > Anyone ever met such a problem? Your thought?
> > > 
> > > Hi Yi,
> > > 
> > > This is indeed strange. I am able to reproduce this problem on my 4-way
> > > box. From what I see in the past two runs, we're waiting in the
> > > cpu-hotplug callback path for the watchdog/1 thread to stop.
> > > 
> > > During cpu-offline, once the cpu goes offline, in the migration_call(), 
> > > we migrate any tasks associated with the offline cpus
> > > to some other cpu. This also mean breaking affinity for tasks which were
> > > affined to the cpu which went down. So watchdog/1 has been migrated to
> > > some other cpu.
> > No, [watchdog/1] is just for CPU #1, if CPU #1 has been offline, it
> > should be killed but not migrated to other CPU because other CPU has
> > such a kthread.
> 
> Yes, it is killed once it gets a chance to run *after* cpu goes offline.
> The moment it runs on some other cpu, it will see the kthread_should_stop()
> because in the cpu-hotplug callback path we've issues a 
> kthread_stop(watchdog/1)
> 
> Again, we can argue that we could issue a kthread_stop() 
> in CPU_DOWN_PREPARE, rather than in CPU_DEAD and restart 
> it in CPU_DOWN_FAILED if the cpu-hotplug operation does fail.
> 
> > 
> > Maybe migration_call was doing such a bad thing. :-)
> 
> Nope, from what I see migration call is not having any problems. It is
> behaving the way it is supposed to behave :)
> 
> The other observation I noted was the WARN_ON_ONCE() in hrtick() [1]
> that I am consistently hitting after the first cpu goes offline.
> 
> So at times, the callback thread is blocked on kthread_stop(k) in
> softlockup.c, while other time, it was blocked in
> cleanup_workqueue_threads() in workqueue.c. 

This is the hung_task_timeout message after a couple of cpu-offlines.

This is on 2.6.25-rc3.

INFO: task bash:4467 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
       f3701dd0 00000046 f796aac0 f796aac0 f796abf8 cc434b80 00000000 f41ee940 
       0180b046 0000026e 00000016 00000000 00000008 f796b080 f796aac0 00000002 
       7fffffff 7fffffff f3701e1c f3701df8 c04e033a f3701e1c f3701dec c0139dec 
Call Trace:
 [<c04e033a>] schedule_timeout+0x16/0x8b
 [<c0139dec>] ? trace_hardirqs_on+0xe9/0x111
 [<c04e01c9>] wait_for_common+0xcf/0x12e
 [<c011a3f0>] ? default_wake_function+0x0/0xd
 [<c04e02aa>] wait_for_completion+0x12/0x14
 [<c012ccbb>] flush_cpu_workqueue+0x50/0x66
 [<c012cd28>] ? wq_barrier_func+0x0/0xd
 [<c012cd14>] cleanup_workqueue_thread+0x43/0x57
 [<c04c6f87>] workqueue_cpu_callback+0x8e/0xbd
 [<c04e3975>] notifier_call_chain+0x2b/0x4a
 [<c0132e9d>] __raw_notifier_call_chain+0xe/0x10
 [<c0132eab>] raw_notifier_call_chain+0xc/0xe
 [<c013e054>] _cpu_down+0x150/0x1ec
 [<c013e133>] cpu_down+0x23/0x30
 [<c02e3897>] store_online+0x27/0x5a
 [<c02e3870>] ? store_online+0x0/0x5a
 [<c02e09d8>] sysdev_store+0x20/0x25
 [<c0196d2d>] sysfs_write_file+0xad/0xdf
 [<c0196c80>] ? sysfs_write_file+0x0/0xdf
 [<c0163da9>] vfs_write+0x8c/0x108
 [<c0164333>] sys_write+0x3b/0x60
 [<c01049da>] sysenter_past_esp+0x5f/0xa5
 =======================
3 locks held by bash/4467:
 #0:  (&buffer->mutex){--..}, at: [<c0196ca5>] sysfs_write_file+0x25/0xdf
 #1:  (cpu_add_remove_lock){--..}, at: [<c013e10e>] cpu_maps_update_begin+0xf/0x11
 #2:  (cpu_hotplug_lock){----}, at: [<c013df5b>] _cpu_down+0x57/0x1ec

So it's not just a not reaping of watchdog thread issue.

I doubt it's due to some locking dependency since we have lockdep checks
in the workqueue code before we flush the cpu_workqueue.

--
Thanks and Regards
gautham

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline
  2008-03-04 15:01       ` Oleg Nesterov
@ 2008-03-04 14:37         ` Yi Yang
  2008-03-06 20:05           ` Yi Yang
  2008-03-05 10:05         ` Gautham R Shenoy
  2008-03-06 13:44         ` Gautham R Shenoy
  2 siblings, 1 reply; 54+ messages in thread
From: Yi Yang @ 2008-03-04 14:37 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Gautham R Shenoy, Ingo Molnar, akpm, linux-kernel,
	Rafael J. Wysocki, Thomas Gleixner

On Tue, 2008-03-04 at 18:01 +0300, Oleg Nesterov wrote:
> On 03/04, Gautham R Shenoy wrote:
> >
> > So at times, the callback thread is blocked on kthread_stop(k) in
> > softlockup.c, while other time, it was blocked in
> > cleanup_workqueue_threads() in workqueue.c. 
> 
> From another message:
> >
> > However, it remains in R< state
> 
> What about cwq->thread? Was it TASK_RUNNING too?
> 
> Perhaps, for some reason the task can't get CPU after migrating from
> the now dead CPU.
> 
> I can't reproduce this problem on my one cpu P4-ht, perhaps you can
> try something like the untested/uncompiled patch below?
You need one multiprocessor system to regenerate, 4 CPUs or more are
best.

At the beginning, i thought Detect Soft Lockups
(CONFIG_DETECT_SOFTLOCKUP) is culprit because it added much code to
detect lock dependencies and deadlocks, but the issue is still there
after i disabled soft lock detection, it can be regenerated but will
take much long time.

At this time, group_balance and kblockd are in "R<" state.

[root@harwich-rh ~]# ps aux | grep kstop
root     15466  0.0  0.0      0     0 ?        S<   01:03   0:00 [kstopmachine]
root     15467  0.0  0.0      0     0 ?        Z<   01:03   0:00 [kstopmachine] <defunct>
root     15468  0.0  0.0      0     0 ?        Z<   01:03   0:00 [kstopmachine] <defunct>
root     15469  0.0  0.0      0     0 ?        Z<   01:03   0:00 [kstopmachine] <defunct>
root     15470  0.0  0.0      0     0 ?        Z<   01:03   0:00 [kstopmachine] <defunct>
root     15471  0.0  0.0      0     0 ?        Z<   01:03   0:00 [kstopmachine] <defunct>
root     15472  0.0  0.0      0     0 ?        Z<   01:03   0:00 [kstopmachine] <defunct>
root     15473  0.0  0.0      0     0 ?        Z<   01:03   0:00 [kstopmachine] <defunct>
root     15474  0.0  0.0      0     0 ?        Z<   01:03   0:00 [kstopmachine] <defunct>
root     15475  0.0  0.0      0     0 ?        Z<   01:03   0:00 [kstopmachine] <defunct>
root     15476  0.0  0.0      0     0 ?        Z<   01:03   0:00 [kstopmachine] <defunct>
root     17155  0.0  0.0  61148   708 pts/3    R+   14:47   0:00 grep kstop
[root@harwich-rh ~]# ps aux | grep "R<"
root        35  0.0  0.0      0     0 ?        R<   Mar13   0:00 [group_balance]
root       182  0.1  0.0      0     0 ?        R<   Mar13   1:09 [kblockd/0]
root     17157  0.0  0.0  61148   704 pts/3    R+   14:47   0:00 grep R<
[root@harwich-rh ~]#



> Oleg.
> 
> --- include/linux/sched.h	2008-02-15 16:59:17.000000000 +0300
> +++ include/linux/sched.h	2008-03-04 17:44:53.136738605 +0300
> @@ -1121,6 +1121,7 @@ struct task_struct {
>  /* hung task detection */
>  	unsigned long last_switch_timestamp;
>  	unsigned long last_switch_count;
> +	unsigned long xxx;
>  #endif
>  /* CPU-specific state of this task */
>  	struct thread_struct thread;
> --- kernel/fork.c	2008-02-15 16:59:17.000000000 +0300
> +++ kernel/fork.c	2008-03-04 17:45:14.773033839 +0300
> @@ -1097,6 +1097,7 @@ static struct task_struct *copy_process(
>  #ifdef CONFIG_DETECT_SOFTLOCKUP
>  	p->last_switch_count = 0;
>  	p->last_switch_timestamp = 0;
> +	p->xxx = 0;
>  #endif
>  
>  #ifdef CONFIG_TASK_XACCT
> --- kernel/sched.c	2008-02-15 16:59:17.000000000 +0300
> +++ kernel/sched.c	2008-03-04 17:48:42.308798646 +0300
> @@ -1291,6 +1291,7 @@ static void enqueue_task(struct rq *rq, 
>  	sched_info_queued(p);
>  	p->sched_class->enqueue_task(rq, p, wakeup);
>  	p->se.on_rq = 1;
> +	p->xxx = jiffies | 1;
>  }
>  
>  static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
> @@ -3944,6 +3945,8 @@ need_resched_nonpreemptible:
>  	preempt_enable_no_resched();
>  	if (unlikely(test_thread_flag(TIF_NEED_RESCHED)))
>  		goto need_resched;
> +
> +	current->xxx = 0;
>  }
>  EXPORT_SYMBOL(schedule);
>  
> --- kernel/softlockup.c	2008-02-15 16:59:17.000000000 +0300
> +++ kernel/softlockup.c	2008-03-04 17:49:05.584414763 +0300
> @@ -174,6 +174,27 @@ static void check_hung_task(struct task_
>  	touch_nmi_watchdog();
>  }
>  
> +static void check_running_task(struct task_struct *t, unsigned long now)
> +{
> +	if (!sysctl_hung_task_timeout_secs)
> +		return;
> +
> +	if (time_before(now, t->xxx + HZ * sysctl_hung_task_timeout_secs)
> +		return;
> +
> +	printk(KERN_ERR "INFO: task %s:%d can't get CPU for more than "
> +			"%ld seconds.\n", t->comm, t->pid,
> +			sysctl_hung_task_timeout_secs);
> +
> +	if (!cpus_intersects(t->cpus_allowed, cpu_online_map))
> +		printk(KERN_ERR "bad ->cpus_allowed\n");
> +	if (!cpu_online(task_cpu(t)))
> +		printk(KERN_ERR "bad ->cpu\n");
> +
> +	sched_show_task(t);
> +	touch_nmi_watchdog();
> +}
> +
>  /*
>   * Check whether a TASK_UNINTERRUPTIBLE does not get woken up for
>   * a really long time (120 seconds). If that happens, print out
> @@ -183,6 +204,7 @@ static void check_hung_uninterruptible_t
>  {
>  	int max_count = sysctl_hung_task_check_count;
>  	unsigned long now = get_timestamp(this_cpu);
> +	unsigned long jiff = jiffies;
>  	struct task_struct *g, *t;
>  
>  	/*
> @@ -192,15 +214,17 @@ static void check_hung_uninterruptible_t
>  	if ((tainted & TAINT_DIE) || did_panic)
>  		return;
>  
> -	read_lock(&tasklist_lock);
> +	rcu_read_lock();
>  	do_each_thread(g, t) {
>  		if (!--max_count)
>  			goto unlock;
>  		if (t->state & TASK_UNINTERRUPTIBLE)
>  			check_hung_task(t, now);
> +		if (!t->xxx)
> +			check_running_task(t, jiff);
>  	} while_each_thread(g, t);
>   unlock:
> -	read_unlock(&tasklist_lock);
> +	rcu_read_unlock();
>  }
>  
>  /*
> 


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline
  2008-03-04  5:26     ` Gautham R Shenoy
  2008-03-04  9:09       ` Gautham R Shenoy
@ 2008-03-04 15:01       ` Oleg Nesterov
  2008-03-04 14:37         ` Yi Yang
                           ` (2 more replies)
  1 sibling, 3 replies; 54+ messages in thread
From: Oleg Nesterov @ 2008-03-04 15:01 UTC (permalink / raw)
  To: Gautham R Shenoy
  Cc: Yi Yang, Ingo Molnar, akpm, linux-kernel, Rafael J. Wysocki,
	Thomas Gleixner

On 03/04, Gautham R Shenoy wrote:
>
> So at times, the callback thread is blocked on kthread_stop(k) in
> softlockup.c, while other time, it was blocked in
> cleanup_workqueue_threads() in workqueue.c. 

>From another message:
>
> However, it remains in R< state

What about cwq->thread? Was it TASK_RUNNING too?

Perhaps, for some reason the task can't get CPU after migrating from
the now dead CPU.

I can't reproduce this problem on my one cpu P4-ht, perhaps you can
try something like the untested/uncompiled patch below?

Oleg.

--- include/linux/sched.h	2008-02-15 16:59:17.000000000 +0300
+++ include/linux/sched.h	2008-03-04 17:44:53.136738605 +0300
@@ -1121,6 +1121,7 @@ struct task_struct {
 /* hung task detection */
 	unsigned long last_switch_timestamp;
 	unsigned long last_switch_count;
+	unsigned long xxx;
 #endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
--- kernel/fork.c	2008-02-15 16:59:17.000000000 +0300
+++ kernel/fork.c	2008-03-04 17:45:14.773033839 +0300
@@ -1097,6 +1097,7 @@ static struct task_struct *copy_process(
 #ifdef CONFIG_DETECT_SOFTLOCKUP
 	p->last_switch_count = 0;
 	p->last_switch_timestamp = 0;
+	p->xxx = 0;
 #endif
 
 #ifdef CONFIG_TASK_XACCT
--- kernel/sched.c	2008-02-15 16:59:17.000000000 +0300
+++ kernel/sched.c	2008-03-04 17:48:42.308798646 +0300
@@ -1291,6 +1291,7 @@ static void enqueue_task(struct rq *rq, 
 	sched_info_queued(p);
 	p->sched_class->enqueue_task(rq, p, wakeup);
 	p->se.on_rq = 1;
+	p->xxx = jiffies | 1;
 }
 
 static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
@@ -3944,6 +3945,8 @@ need_resched_nonpreemptible:
 	preempt_enable_no_resched();
 	if (unlikely(test_thread_flag(TIF_NEED_RESCHED)))
 		goto need_resched;
+
+	current->xxx = 0;
 }
 EXPORT_SYMBOL(schedule);
 
--- kernel/softlockup.c	2008-02-15 16:59:17.000000000 +0300
+++ kernel/softlockup.c	2008-03-04 17:49:05.584414763 +0300
@@ -174,6 +174,27 @@ static void check_hung_task(struct task_
 	touch_nmi_watchdog();
 }
 
+static void check_running_task(struct task_struct *t, unsigned long now)
+{
+	if (!sysctl_hung_task_timeout_secs)
+		return;
+
+	if (time_before(now, t->xxx + HZ * sysctl_hung_task_timeout_secs)
+		return;
+
+	printk(KERN_ERR "INFO: task %s:%d can't get CPU for more than "
+			"%ld seconds.\n", t->comm, t->pid,
+			sysctl_hung_task_timeout_secs);
+
+	if (!cpus_intersects(t->cpus_allowed, cpu_online_map))
+		printk(KERN_ERR "bad ->cpus_allowed\n");
+	if (!cpu_online(task_cpu(t)))
+		printk(KERN_ERR "bad ->cpu\n");
+
+	sched_show_task(t);
+	touch_nmi_watchdog();
+}
+
 /*
  * Check whether a TASK_UNINTERRUPTIBLE does not get woken up for
  * a really long time (120 seconds). If that happens, print out
@@ -183,6 +204,7 @@ static void check_hung_uninterruptible_t
 {
 	int max_count = sysctl_hung_task_check_count;
 	unsigned long now = get_timestamp(this_cpu);
+	unsigned long jiff = jiffies;
 	struct task_struct *g, *t;
 
 	/*
@@ -192,15 +214,17 @@ static void check_hung_uninterruptible_t
 	if ((tainted & TAINT_DIE) || did_panic)
 		return;
 
-	read_lock(&tasklist_lock);
+	rcu_read_lock();
 	do_each_thread(g, t) {
 		if (!--max_count)
 			goto unlock;
 		if (t->state & TASK_UNINTERRUPTIBLE)
 			check_hung_task(t, now);
+		if (!t->xxx)
+			check_running_task(t, jiff);
 	} while_each_thread(g, t);
  unlock:
-	read_unlock(&tasklist_lock);
+	rcu_read_unlock();
 }
 
 /*


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline
  2008-03-04 15:01       ` Oleg Nesterov
  2008-03-04 14:37         ` Yi Yang
@ 2008-03-05 10:05         ` Gautham R Shenoy
  2008-03-05 13:53           ` Oleg Nesterov
  2008-03-06 13:44         ` Gautham R Shenoy
  2 siblings, 1 reply; 54+ messages in thread
From: Gautham R Shenoy @ 2008-03-05 10:05 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Yi Yang, Ingo Molnar, akpm, linux-kernel, Rafael J. Wysocki,
	Thomas Gleixner

On Tue, Mar 04, 2008 at 06:01:07PM +0300, Oleg Nesterov wrote:
> On 03/04, Gautham R Shenoy wrote:
> >
> > So at times, the callback thread is blocked on kthread_stop(k) in
> > softlockup.c, while other time, it was blocked in
> > cleanup_workqueue_threads() in workqueue.c. 
> 
> >From another message:
> >
> > However, it remains in R< state
> 
> What about cwq->thread? Was it TASK_RUNNING too?

No, it was in TASK_UNINTERRUPTIBLE state. The last thing it ever
executed was the wait_for_completion in flush_cpu_workqueue()

> 
> Perhaps, for some reason the task can't get CPU after migrating from
> the now dead CPU.
> 
> I can't reproduce this problem on my one cpu P4-ht, perhaps you can
> try something like the untested/uncompiled patch below?
> 
> Oleg.
> 
> --- include/linux/sched.h	2008-02-15 16:59:17.000000000 +0300
> +++ include/linux/sched.h	2008-03-04 17:44:53.136738605 +0300
> @@ -1121,6 +1121,7 @@ struct task_struct {
>  /* hung task detection */
>  	unsigned long last_switch_timestamp;
>  	unsigned long last_switch_count;
> +	unsigned long xxx;
>  #endif
>  /* CPU-specific state of this task */
>  	struct thread_struct thread;
> --- kernel/fork.c	2008-02-15 16:59:17.000000000 +0300
> +++ kernel/fork.c	2008-03-04 17:45:14.773033839 +0300
> @@ -1097,6 +1097,7 @@ static struct task_struct *copy_process(
>  #ifdef CONFIG_DETECT_SOFTLOCKUP
>  	p->last_switch_count = 0;
>  	p->last_switch_timestamp = 0;
> +	p->xxx = 0;
>  #endif
> 
>  #ifdef CONFIG_TASK_XACCT
> --- kernel/sched.c	2008-02-15 16:59:17.000000000 +0300
> +++ kernel/sched.c	2008-03-04 17:48:42.308798646 +0300
> @@ -1291,6 +1291,7 @@ static void enqueue_task(struct rq *rq, 
>  	sched_info_queued(p);
>  	p->sched_class->enqueue_task(rq, p, wakeup);
>  	p->se.on_rq = 1;
> +	p->xxx = jiffies | 1;
>  }
> 
>  static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
> @@ -3944,6 +3945,8 @@ need_resched_nonpreemptible:
>  	preempt_enable_no_resched();
>  	if (unlikely(test_thread_flag(TIF_NEED_RESCHED)))
>  		goto need_resched;
> +
> +	current->xxx = 0;
>  }
>  EXPORT_SYMBOL(schedule);
> 
> --- kernel/softlockup.c	2008-02-15 16:59:17.000000000 +0300
> +++ kernel/softlockup.c	2008-03-04 17:49:05.584414763 +0300
> @@ -174,6 +174,27 @@ static void check_hung_task(struct task_
>  	touch_nmi_watchdog();
>  }
> 
> +static void check_running_task(struct task_struct *t, unsigned long now)
> +{
> +	if (!sysctl_hung_task_timeout_secs)
> +		return;
> +
> +	if (time_before(now, t->xxx + HZ * sysctl_hung_task_timeout_secs)
> +		return;
> +
> +	printk(KERN_ERR "INFO: task %s:%d can't get CPU for more than "
> +			"%ld seconds.\n", t->comm, t->pid,
> +			sysctl_hung_task_timeout_secs);
> +
> +	if (!cpus_intersects(t->cpus_allowed, cpu_online_map))
> +		printk(KERN_ERR "bad ->cpus_allowed\n");
> +	if (!cpu_online(task_cpu(t)))
> +		printk(KERN_ERR "bad ->cpu\n");
> +
> +	sched_show_task(t);
> +	touch_nmi_watchdog();
> +}
> +
>  /*
>   * Check whether a TASK_UNINTERRUPTIBLE does not get woken up for
>   * a really long time (120 seconds). If that happens, print out
> @@ -183,6 +204,7 @@ static void check_hung_uninterruptible_t
>  {
>  	int max_count = sysctl_hung_task_check_count;
>  	unsigned long now = get_timestamp(this_cpu);
> +	unsigned long jiff = jiffies;
>  	struct task_struct *g, *t;
> 
>  	/*
> @@ -192,15 +214,17 @@ static void check_hung_uninterruptible_t
>  	if ((tainted & TAINT_DIE) || did_panic)
>  		return;
> 
> -	read_lock(&tasklist_lock);
> +	rcu_read_lock();
>  	do_each_thread(g, t) {
>  		if (!--max_count)
>  			goto unlock;
>  		if (t->state & TASK_UNINTERRUPTIBLE)
>  			check_hung_task(t, now);
> +		if (!t->xxx)
> +			check_running_task(t, jiff);
>  	} while_each_thread(g, t);
>   unlock:
> -	read_unlock(&tasklist_lock);
> +	rcu_read_unlock();
>  }
> 
>  /*

-- 
Thanks and Regards
gautham

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline
  2008-03-05 10:05         ` Gautham R Shenoy
@ 2008-03-05 13:53           ` Oleg Nesterov
  2008-03-06 11:15             ` Gautham R Shenoy
  0 siblings, 1 reply; 54+ messages in thread
From: Oleg Nesterov @ 2008-03-05 13:53 UTC (permalink / raw)
  To: Gautham R Shenoy
  Cc: Yi Yang, Ingo Molnar, akpm, linux-kernel, Rafael J. Wysocki,
	Thomas Gleixner

On 03/05, Gautham R Shenoy wrote:
>
> On Tue, Mar 04, 2008 at 06:01:07PM +0300, Oleg Nesterov wrote:
> > On 03/04, Gautham R Shenoy wrote:
> > >
> > > So at times, the callback thread is blocked on kthread_stop(k) in
> > > softlockup.c, while other time, it was blocked in
> > > cleanup_workqueue_threads() in workqueue.c. 
> > 
> > >From another message:
> > >
> > > However, it remains in R< state
> > 
> > What about cwq->thread? Was it TASK_RUNNING too?
> 
> No, it was in TASK_UNINTERRUPTIBLE state. The last thing it ever
> executed was the wait_for_completion in flush_cpu_workqueue()

You misunderstood ;) The task doing _cpu_down() hang in flush_cpu_workqueue(),
because cwq->thread doesn't want to die. But what was cwq->thread->state?

Could you try the patch? Yi constantly sees the tasks in R< state.
I am starting to suspect some scheduling issue.

Oleg.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline
  2008-03-05 13:53           ` Oleg Nesterov
@ 2008-03-06 11:15             ` Gautham R Shenoy
  2008-03-06 12:22               ` Gautham R Shenoy
  0 siblings, 1 reply; 54+ messages in thread
From: Gautham R Shenoy @ 2008-03-06 11:15 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Yi Yang, Ingo Molnar, akpm, linux-kernel, Rafael J. Wysocki,
	Thomas Gleixner

On Wed, Mar 05, 2008 at 04:53:14PM +0300, Oleg Nesterov wrote:
> On 03/05, Gautham R Shenoy wrote:
> >
> > On Tue, Mar 04, 2008 at 06:01:07PM +0300, Oleg Nesterov wrote:
> > > On 03/04, Gautham R Shenoy wrote:
> > > >
> > > > So at times, the callback thread is blocked on kthread_stop(k) in
> > > > softlockup.c, while other time, it was blocked in
> > > > cleanup_workqueue_threads() in workqueue.c. 
> > > 
> > > >From another message:
> > > >
> > > > However, it remains in R< state
> > > 
> > > What about cwq->thread? Was it TASK_RUNNING too?
> > 
> > No, it was in TASK_UNINTERRUPTIBLE state. The last thing it ever
> > executed was the wait_for_completion in flush_cpu_workqueue()
> 
> You misunderstood ;) The task doing _cpu_down() hang in flush_cpu_workqueue(),
> because cwq->thread doesn't want to die. But what was cwq->thread->state?
> 

Yup, sorry for that stupid reply.

> Could you try the patch? Yi constantly sees the tasks in R< state.
> I am starting to suspect some scheduling issue.

Me too. With your patch applied there were quite a few tasks in the
running state which didn't get the cpu for more than 120 seconds.

I could be a scheduler problem which CPU-Hotplug is able to catch more
easily. Will try a git bisect.


> 
> Oleg.

-- 
Thanks and Regards
gautham

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline
  2008-03-06 11:15             ` Gautham R Shenoy
@ 2008-03-06 12:22               ` Gautham R Shenoy
  0 siblings, 0 replies; 54+ messages in thread
From: Gautham R Shenoy @ 2008-03-06 12:22 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Yi Yang, Ingo Molnar, akpm, linux-kernel, Rafael J. Wysocki,
	Thomas Gleixner

On Thu, Mar 06, 2008 at 04:45:25PM +0530, Gautham R Shenoy wrote:
> On Wed, Mar 05, 2008 at 04:53:14PM +0300, Oleg Nesterov wrote:
> > On 03/05, Gautham R Shenoy wrote:
> > >
> > > On Tue, Mar 04, 2008 at 06:01:07PM +0300, Oleg Nesterov wrote:
> > > > On 03/04, Gautham R Shenoy wrote:
> > > > >
> > > > > So at times, the callback thread is blocked on kthread_stop(k) in
> > > > > softlockup.c, while other time, it was blocked in
> > > > > cleanup_workqueue_threads() in workqueue.c. 
> > > > 
> > > > >From another message:
> > > > >
> > > > > However, it remains in R< state
> > > > 
> > > > What about cwq->thread? Was it TASK_RUNNING too?
> > > 
> > > No, it was in TASK_UNINTERRUPTIBLE state. The last thing it ever
> > > executed was the wait_for_completion in flush_cpu_workqueue()
> > 
> > You misunderstood ;) The task doing _cpu_down() hang in flush_cpu_workqueue(),
> > because cwq->thread doesn't want to die. But what was cwq->thread->state?
> > 
> 
> Yup, sorry for that stupid reply.
> 
> > Could you try the patch? Yi constantly sees the tasks in R< state.
> > I am starting to suspect some scheduling issue.
> 
> Me too. With your patch applied there were quite a few tasks in the
> running state which didn't get the cpu for more than 120 seconds.
> 
> I could be a scheduler problem which CPU-Hotplug is able to catch more
> easily. Will try a git bisect.

BTW, with your patch, I get the "task can't get CPU for more than
120 seconds" message even while running a kernbench on 2.6.25-rc2.

> 
> 
> > 
> > Oleg.
> 
> -- 
> Thanks and Regards
> gautham

-- 
Thanks and Regards
gautham

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline
  2008-03-04 15:01       ` Oleg Nesterov
  2008-03-04 14:37         ` Yi Yang
  2008-03-05 10:05         ` Gautham R Shenoy
@ 2008-03-06 13:44         ` Gautham R Shenoy
  2008-03-07  2:54           ` Oleg Nesterov
  2 siblings, 1 reply; 54+ messages in thread
From: Gautham R Shenoy @ 2008-03-06 13:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Yi Yang, Ingo Molnar, akpm, linux-kernel, Rafael J. Wysocki,
	Thomas Gleixner

On Tue, Mar 04, 2008 at 06:01:07PM +0300, Oleg Nesterov wrote:
> On 03/04, Gautham R Shenoy wrote:
> >
> > So at times, the callback thread is blocked on kthread_stop(k) in
> > softlockup.c, while other time, it was blocked in
> > cleanup_workqueue_threads() in workqueue.c. 
> 
> >From another message:
> >
> > However, it remains in R< state
> 
> What about cwq->thread? Was it TASK_RUNNING too?
> 
> Perhaps, for some reason the task can't get CPU after migrating from
> the now dead CPU.
> 
> I can't reproduce this problem on my one cpu P4-ht, perhaps you can
> try something like the untested/uncompiled patch below?
> 
> Oleg.
> 
> --- include/linux/sched.h	2008-02-15 16:59:17.000000000 +0300
> +++ include/linux/sched.h	2008-03-04 17:44:53.136738605 +0300
> @@ -1121,6 +1121,7 @@ struct task_struct {
>  /* hung task detection */
>  	unsigned long last_switch_timestamp;
>  	unsigned long last_switch_count;
> +	unsigned long xxx;
>  #endif
>  /* CPU-specific state of this task */
>  	struct thread_struct thread;
> --- kernel/fork.c	2008-02-15 16:59:17.000000000 +0300
> +++ kernel/fork.c	2008-03-04 17:45:14.773033839 +0300
> @@ -1097,6 +1097,7 @@ static struct task_struct *copy_process(
>  #ifdef CONFIG_DETECT_SOFTLOCKUP
>  	p->last_switch_count = 0;
>  	p->last_switch_timestamp = 0;
> +	p->xxx = 0;
>  #endif
> 
>  #ifdef CONFIG_TASK_XACCT
> --- kernel/sched.c	2008-02-15 16:59:17.000000000 +0300
> +++ kernel/sched.c	2008-03-04 17:48:42.308798646 +0300
> @@ -1291,6 +1291,7 @@ static void enqueue_task(struct rq *rq, 
>  	sched_info_queued(p);
>  	p->sched_class->enqueue_task(rq, p, wakeup);
>  	p->se.on_rq = 1;
> +	p->xxx = jiffies | 1;
>  }
> 
>  static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
> @@ -3944,6 +3945,8 @@ need_resched_nonpreemptible:
>  	preempt_enable_no_resched();
>  	if (unlikely(test_thread_flag(TIF_NEED_RESCHED)))
>  		goto need_resched;
> +
> +	current->xxx = 0;
>  }
>  EXPORT_SYMBOL(schedule);
> 
> --- kernel/softlockup.c	2008-02-15 16:59:17.000000000 +0300
> +++ kernel/softlockup.c	2008-03-04 17:49:05.584414763 +0300
> @@ -174,6 +174,27 @@ static void check_hung_task(struct task_
>  	touch_nmi_watchdog();
>  }
> 
> +static void check_running_task(struct task_struct *t, unsigned long now)
> +{
> +	if (!sysctl_hung_task_timeout_secs)
> +		return;
> +

This function gets called only when t->xxx == 0,
so the if below doesn't mean much, does it? :)

> +	if (time_before(now, t->xxx + HZ * sysctl_hung_task_timeout_secs)
> +		return;
> +
> +	printk(KERN_ERR "INFO: task %s:%d can't get CPU for more than "
> +			"%ld seconds.\n", t->comm, t->pid,
> +			sysctl_hung_task_timeout_secs);
> +
> +	if (!cpus_intersects(t->cpus_allowed, cpu_online_map))
> +		printk(KERN_ERR "bad ->cpus_allowed\n");
> +	if (!cpu_online(task_cpu(t)))
> +		printk(KERN_ERR "bad ->cpu\n");
> +
> +	sched_show_task(t);
> +	touch_nmi_watchdog();
> +}
> +
>  /*
>   * Check whether a TASK_UNINTERRUPTIBLE does not get woken up for
>   * a really long time (120 seconds). If that happens, print out
> @@ -183,6 +204,7 @@ static void check_hung_uninterruptible_t
>  {
>  	int max_count = sysctl_hung_task_check_count;
>  	unsigned long now = get_timestamp(this_cpu);
> +	unsigned long jiff = jiffies;
>  	struct task_struct *g, *t;
> 
>  	/*
> @@ -192,15 +214,17 @@ static void check_hung_uninterruptible_t
>  	if ((tainted & TAINT_DIE) || did_panic)
>  		return;
> 
> -	read_lock(&tasklist_lock);
> +	rcu_read_lock();
>  	do_each_thread(g, t) {
>  		if (!--max_count)
>  			goto unlock;
>  		if (t->state & TASK_UNINTERRUPTIBLE)
>  			check_hung_task(t, now);
> +		if (!t->xxx)
> +			check_running_task(t, jiff);
>  	} while_each_thread(g, t);
>   unlock:
> -	read_unlock(&tasklist_lock);
> +	rcu_read_unlock();
>  }
> 
>  /*

-- 
Thanks and Regards
gautham

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline
  2008-03-04 14:37         ` Yi Yang
@ 2008-03-06 20:05           ` Yi Yang
  0 siblings, 0 replies; 54+ messages in thread
From: Yi Yang @ 2008-03-06 20:05 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Gautham R Shenoy, Ingo Molnar, akpm, linux-kernel,
	Rafael J. Wysocki, Thomas Gleixner

I found priority of some processes is always 0 but they are highly
related to this kind of deadlock.

What does Priority 0 mean? When and in which case is it's priority
changed to 0? When can it's priority be restored to one higher value.

A simple script to get priority 0:

===================================================================
#!/bin/sh

for file in `ls /proc/*/sched`
do
        prio=`grep prio $file | awk '{print $3;}'`
        if [ "X$prio" == "X0" ] ; then
                echo "`head -n 1 $file | awk '{print $1}'`: $prio"
        fi
done
==================================================================

[root@harwich-rh ~]# ./lsprio
migration/3: 0
migration/4: 0
migration/5: 0
migration/6: 0
migration/7: 0
migration/8: 0
migration/9: 0
migration/10: 0
migration/11: 0
migration/12: 0
migration/0: 0
migration/13: 0
migration/14: 0
migration/15: 0
grep: /proc/5244/sched: No such file or directory
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
kstopmachine: 0
watchdog/1: 0
migration/2: 0

[root@harwich-rh ~]# ps aux | grep "R<"
root         8  0.0  0.0      0     0 ?        R<   19:29   0:00
[watchdog/1]
root      6987  0.0  0.0  61144   708 pts/2    R+   20:16   0:00 grep R<
[root@harwich-rh ~]#




^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline
  2008-03-07 10:51               ` Gautham R Shenoy
@ 2008-03-06 23:20                 ` Yi Yang
  2008-03-07 13:02                 ` Dmitry Adamushko
  1 sibling, 0 replies; 54+ messages in thread
From: Yi Yang @ 2008-03-06 23:20 UTC (permalink / raw)
  To: ego
  Cc: Oleg Nesterov, Ingo Molnar, akpm, linux-kernel,
	Rafael J. Wysocki, Thomas Gleixner

On Fri, 2008-03-07 at 16:21 +0530, Gautham R Shenoy wrote:
> On Fri, Mar 07, 2008 at 02:40:49PM +0530, Gautham R Shenoy wrote:
> > 
> > > 
> > > Just to be sure, there were no "bad ->cpu..." messages, yes?
> > 
> > Hopefully should be able to catch them now. If yes, it's a problem in
> > the way we do migration after cpu-hotplug as Yi suggested in an earlier
> > mail.
> > 
> > http://lkml.org/lkml/2008/3/6/437
> > 
> > This mail from akpm says the same thing.
> 
> Yup! There are a quite a few "bad->cpu" messages.
> All of them only for watchdog/1.
> 
> What surprises me is the fact that the first of  task-hung messages come
> after 136 successful cpu-hotplug attempts.
> 
> To answer Andrew's question, migration_call() *should* ideally be the
> first notifier called, because it's registered with the priority 10.
> 
> Unless there's some change that has recently gone in the
> notifier_call_chain subsystem that doesn't honour the registration priority
> anymore, migration_call() will be called first,
> which means the task affinity for watchdog thread is broken and hence
> we shouldn't see the message below.
> 
> I'll probe to see if there's a change in the call order that's causing
> this hang.
I have traced all the cpu callback functions, i ensure migration_call is
called firstly, i have ever set notifier priority of cpu_callback in
kernel/softlockup.c to maximum int and check what will happen, the
result is same.

> 
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> INFO: task watchdog/1:10958 can't get CPU for more than 120 seconds.
> bad ->cpu
>        f524ffac 00000046 c011befc f5276aa0 f5276bd8 c443db80 00000001 f6130b80 
>        f524ff94 00000246 c07130f4 03d28000 00000000 f524ff9c c01438ad f524ffac 
>        00000001 c0143a1f 00000000 f524ffd0 c0143a66 00000001 c0119fea 00000000 
> Call Trace:
>  [<c011befc>] ? cpu_clock+0x4e/0x59
>  [<c01438ad>] ? get_timestamp+0x8/0x11
>  [<c0143a1f>] ? watchdog+0x0/0x239
>  [<c0143a66>] watchdog+0x47/0x239
>  [<c0119fea>] ? complete+0x34/0x3e
>  [<c0143a1f>] ? watchdog+0x0/0x239
>  [<c012f797>] kthread+0x3b/0x64
>  [<c012f75c>] ? kthread+0x0/0x64
>  [<c01056eb>] kernel_thread_helper+0x7/0x10
>  =======================
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> 
> 
> >
> > >
> > > Oleg.
> > 
> > -- 
> > Thanks and Regards
> > gautham
> 


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline
  2008-03-06 13:44         ` Gautham R Shenoy
@ 2008-03-07  2:54           ` Oleg Nesterov
  2008-03-07  9:10             ` Gautham R Shenoy
  0 siblings, 1 reply; 54+ messages in thread
From: Oleg Nesterov @ 2008-03-07  2:54 UTC (permalink / raw)
  To: Gautham R Shenoy
  Cc: Yi Yang, Ingo Molnar, akpm, linux-kernel, Rafael J. Wysocki,
	Thomas Gleixner

On 03/06, Gautham R Shenoy wrote:
>
> On Tue, Mar 04, 2008 at 06:01:07PM +0300, Oleg Nesterov wrote:
> > +static void check_running_task(struct task_struct *t, unsigned long now)
> > +{
> > +	if (!sysctl_hung_task_timeout_secs)
> > +		return;
> > +
> 
> This function gets called only when t->xxx == 0,
> so the if below doesn't mean much, does it? :)
> 
> > +	if (time_before(now, t->xxx + HZ * sysctl_hung_task_timeout_secs)
> > +		return;
> .......
>
> > @@ -192,15 +214,17 @@ static void check_hung_uninterruptible_t
> >  	if ((tainted & TAINT_DIE) || did_panic)
> >  		return;
> > 
> > -	read_lock(&tasklist_lock);
> > +	rcu_read_lock();
> >  	do_each_thread(g, t) {
> >  		if (!--max_count)
> >  			goto unlock;
> >  		if (t->state & TASK_UNINTERRUPTIBLE)
> >  			check_hung_task(t, now);
> > +		if (!t->xxx)
> > +			check_running_task(t, jiff);

Of course, the check above should be

		if (1t->xxx)
			check_running_task(t, jiff);

Thanks!

>From another message,
>
> Me too. With your patch applied there were quite a few tasks in the
> running state which didn't get the cpu for more than 120 seconds.

(I assume you fixed the patch before using it ;)

Just to be sure, there were no "bad ->cpu..." messages, yes?

Oleg.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline
  2008-03-07  2:54           ` Oleg Nesterov
@ 2008-03-07  9:10             ` Gautham R Shenoy
  2008-03-07 10:51               ` Gautham R Shenoy
  0 siblings, 1 reply; 54+ messages in thread
From: Gautham R Shenoy @ 2008-03-07  9:10 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Yi Yang, Ingo Molnar, akpm, linux-kernel, Rafael J. Wysocki,
	Thomas Gleixner

On Fri, Mar 07, 2008 at 05:54:51AM +0300, Oleg Nesterov wrote:
> On 03/06, Gautham R Shenoy wrote:
> >
> > On Tue, Mar 04, 2008 at 06:01:07PM +0300, Oleg Nesterov wrote:
> > > +static void check_running_task(struct task_struct *t, unsigned long now)
> > > +{
> > > +	if (!sysctl_hung_task_timeout_secs)
> > > +		return;
> > > +
> > 
> > This function gets called only when t->xxx == 0,
> > so the if below doesn't mean much, does it? :)
> > 
> > > +	if (time_before(now, t->xxx + HZ * sysctl_hung_task_timeout_secs)
> > > +		return;
> > .......
> >
> > > @@ -192,15 +214,17 @@ static void check_hung_uninterruptible_t
> > >  	if ((tainted & TAINT_DIE) || did_panic)
> > >  		return;
> > > 
> > > -	read_lock(&tasklist_lock);
> > > +	rcu_read_lock();
> > >  	do_each_thread(g, t) {
> > >  		if (!--max_count)
> > >  			goto unlock;
> > >  		if (t->state & TASK_UNINTERRUPTIBLE)
> > >  			check_hung_task(t, now);
> > > +		if (!t->xxx)
> > > +			check_running_task(t, jiff);
> 
> Of course, the check above should be
> 
> 		if (1t->xxx)
> 			check_running_task(t, jiff);
> 
> Thanks!
> 
> >From another message,
> >
> > Me too. With your patch applied there were quite a few tasks in the
> > running state which didn't get the cpu for more than 120 seconds.
> 
> (I assume you fixed the patch before using it ;)
No! Conversely, I fixed the patch because I found this behaviour a bit
odd. Couldn't run the tests again as it was a tad bit late.

> 
> Just to be sure, there were no "bad ->cpu..." messages, yes?

Hopefully should be able to catch them now. If yes, it's a problem in
the way we do migration after cpu-hotplug as Yi suggested in an earlier
mail.

http://lkml.org/lkml/2008/3/6/437

This mail from akpm says the same thing.

> 
> Oleg.

-- 
Thanks and Regards
gautham

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline
  2008-03-07  9:10             ` Gautham R Shenoy
@ 2008-03-07 10:51               ` Gautham R Shenoy
  2008-03-06 23:20                 ` Yi Yang
  2008-03-07 13:02                 ` Dmitry Adamushko
  0 siblings, 2 replies; 54+ messages in thread
From: Gautham R Shenoy @ 2008-03-07 10:51 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Yi Yang, Ingo Molnar, akpm, linux-kernel, Rafael J. Wysocki,
	Thomas Gleixner

On Fri, Mar 07, 2008 at 02:40:49PM +0530, Gautham R Shenoy wrote:
> 
> > 
> > Just to be sure, there were no "bad ->cpu..." messages, yes?
> 
> Hopefully should be able to catch them now. If yes, it's a problem in
> the way we do migration after cpu-hotplug as Yi suggested in an earlier
> mail.
> 
> http://lkml.org/lkml/2008/3/6/437
> 
> This mail from akpm says the same thing.

Yup! There are a quite a few "bad->cpu" messages.
All of them only for watchdog/1.

What surprises me is the fact that the first of  task-hung messages come
after 136 successful cpu-hotplug attempts.

To answer Andrew's question, migration_call() *should* ideally be the
first notifier called, because it's registered with the priority 10.

Unless there's some change that has recently gone in the
notifier_call_chain subsystem that doesn't honour the registration priority
anymore, migration_call() will be called first,
which means the task affinity for watchdog thread is broken and hence
we shouldn't see the message below.

I'll probe to see if there's a change in the call order that's causing
this hang.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
INFO: task watchdog/1:10958 can't get CPU for more than 120 seconds.
bad ->cpu
       f524ffac 00000046 c011befc f5276aa0 f5276bd8 c443db80 00000001 f6130b80 
       f524ff94 00000246 c07130f4 03d28000 00000000 f524ff9c c01438ad f524ffac 
       00000001 c0143a1f 00000000 f524ffd0 c0143a66 00000001 c0119fea 00000000 
Call Trace:
 [<c011befc>] ? cpu_clock+0x4e/0x59
 [<c01438ad>] ? get_timestamp+0x8/0x11
 [<c0143a1f>] ? watchdog+0x0/0x239
 [<c0143a66>] watchdog+0x47/0x239
 [<c0119fea>] ? complete+0x34/0x3e
 [<c0143a1f>] ? watchdog+0x0/0x239
 [<c012f797>] kthread+0x3b/0x64
 [<c012f75c>] ? kthread+0x0/0x64
 [<c01056eb>] kernel_thread_helper+0x7/0x10
 =======================
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



>
> >
> > Oleg.
> 
> -- 
> Thanks and Regards
> gautham

-- 
Thanks and Regards
gautham

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline
  2008-03-07 10:51               ` Gautham R Shenoy
  2008-03-06 23:20                 ` Yi Yang
@ 2008-03-07 13:02                 ` Dmitry Adamushko
  2008-03-07 13:55                   ` Gautham R Shenoy
  2008-03-07 20:18                   ` [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked " Andrew Morton
  1 sibling, 2 replies; 54+ messages in thread
From: Dmitry Adamushko @ 2008-03-07 13:02 UTC (permalink / raw)
  To: ego, Ingo Molnar
  Cc: Oleg Nesterov, Yi Yang, akpm, linux-kernel, Rafael J. Wysocki,
	Thomas Gleixner

Hi,

'watchdog' is of SCHED_FIFO class. The standard load-balancer doesn't
move RT tasks between cpus anymore and there is a special mechanism in
scher_rt.c instead (I think, it's .25 material).

So I wonder, whether __migrate_task() is still capable of properly
moving a RT task to another CPU (e.g. for the case when it's in
TASK_RUNNING state) without breaking something in the rt migration
mechanism (or whatever else) that would leave us with a runqueue in
the 'inconsistent' state...
(I've taken a quick look at the relevant code so can't confirm it yet)

maybe it'd be faster if somebody could do a quick test now with the
following line commented out in kernel/softlockup.c :: watchdog()

-         sched_setscheduler(current, SCHED_FIFO, &param);


-- 
Best regards,
Dmitry Adamushko

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline
  2008-03-07 13:02                 ` Dmitry Adamushko
@ 2008-03-07 13:55                   ` Gautham R Shenoy
  2008-03-07 15:50                     ` Gautham R Shenoy
  2008-03-07 20:18                   ` [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked " Andrew Morton
  1 sibling, 1 reply; 54+ messages in thread
From: Gautham R Shenoy @ 2008-03-07 13:55 UTC (permalink / raw)
  To: Dmitry Adamushko
  Cc: Ingo Molnar, Oleg Nesterov, Yi Yang, akpm, linux-kernel,
	Rafael J. Wysocki, Thomas Gleixner

On Fri, Mar 07, 2008 at 02:02:20PM +0100, Dmitry Adamushko wrote:
> Hi,
> 
> 'watchdog' is of SCHED_FIFO class. The standard load-balancer doesn't
> move RT tasks between cpus anymore and there is a special mechanism in
> scher_rt.c instead (I think, it's .25 material).
> 
> So I wonder, whether __migrate_task() is still capable of properly
> moving a RT task to another CPU (e.g. for the case when it's in
> TASK_RUNNING state) without breaking something in the rt migration
> mechanism (or whatever else) that would leave us with a runqueue in
> the 'inconsistent' state...
> (I've taken a quick look at the relevant code so can't confirm it yet)
> 
> maybe it'd be faster if somebody could do a quick test now with the
> following line commented out in kernel/softlockup.c :: watchdog()
> 
> -         sched_setscheduler(current, SCHED_FIFO, &param);

Commenting out that like seems to work. Passed 500 iterations of
cpu-hotplug without any problems.

> 
> 
> -- 
> Best regards,
> Dmitry Adamushko

-- 
Thanks and Regards
gautham

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline
  2008-03-07 13:55                   ` Gautham R Shenoy
@ 2008-03-07 15:50                     ` Gautham R Shenoy
  2008-03-07 19:14                       ` [BUG 2.6.25-rc3] scheduler/hotplug: some processes aredealocked " Suresh Siddha
  0 siblings, 1 reply; 54+ messages in thread
From: Gautham R Shenoy @ 2008-03-07 15:50 UTC (permalink / raw)
  To: Dmitry Adamushko
  Cc: Ingo Molnar, Oleg Nesterov, Yi Yang, akpm, linux-kernel,
	Rafael J. Wysocki, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 4897 bytes --]

On Fri, Mar 07, 2008 at 07:25:37PM +0530, Gautham R Shenoy wrote:
> On Fri, Mar 07, 2008 at 02:02:20PM +0100, Dmitry Adamushko wrote:
> > Hi,
> > 
> > 'watchdog' is of SCHED_FIFO class. The standard load-balancer doesn't
> > move RT tasks between cpus anymore and there is a special mechanism in
> > scher_rt.c instead (I think, it's .25 material).
> > 
> > So I wonder, whether __migrate_task() is still capable of properly
> > moving a RT task to another CPU (e.g. for the case when it's in
> > TASK_RUNNING state) without breaking something in the rt migration
> > mechanism (or whatever else) that would leave us with a runqueue in
> > the 'inconsistent' state...
> > (I've taken a quick look at the relevant code so can't confirm it yet)
> > 
> > maybe it'd be faster if somebody could do a quick test now with the
> > following line commented out in kernel/softlockup.c :: watchdog()
> > 
> > -         sched_setscheduler(current, SCHED_FIFO, &param);
> 
> Commenting out that like seems to work. Passed 500 iterations of
> cpu-hotplug without any problems.

This seems to unearth another problem. After some 850 successful
cpu-hotplug iterations I got this message.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
------------[ cut here ]------------
BUG: spinlock recursion on CPU#2, kstopmachine/32521
BUG: spinlock lockup on CPU#1, kstopmachine/32517, cc43db80
Pid: 32517, comm: kstopmachine Not tainted 2.6.25-rc3 #44
 [<c0284ebb>] <0>BUG: spinlock lockup on CPU#0, kstopmachine/32520, cc43db80
_raw_spin_lock+0xd5/0xf9
 [<c04e1a33>] <0>BUG: spinlock lockup on CPU#3, kstopmachine/32522, cc43db80
_spin_lock+0x20/0x28
Pid: 32522, comm: kstopmachine Not tainted 2.6.25-rc3 #44
 [<c04dfa41>] ?  [<c0284ebb>] schedule+0xb0/0x5ab
 [<c04dfa41>] schedule+0xb0/0x5ab
_raw_spin_lock+0xd5/0xf9
 [<c04e1d58>] ?  [<c04e1a33>] _spin_unlock_irqrestore+0x36/0x3c
 [<c0119fdc>] ? _spin_lock+0x20/0x28
 [<c0119806>] ? complete+0x34/0x3e
 [<c0143646>] double_lock_balance+0x3a/0x57
 [<c0119806>] do_stop+0xd4/0xfe
double_lock_balance+0x3a/0x57
 [<c0143572>] ?  [<c0119af9>] do_stop+0x0/0xfe
pull_rt_task+0x79/0x164
 [<c012f77f>]  [<c011ca9d>] kthread+0x3b/0x64
pre_schedule_rt+0x18/0x1a
 [<c012f744>] ?  [<c04dfaa4>] kthread+0x0/0x64
schedule+0x113/0x5ab
 [<c01056eb>]  [<c04e1923>] ? kernel_thread_helper+0x7/0x10
_write_unlock_irq+0x22/0x26
 =======================
 [<c0139dec>] ? trace_hardirqs_on+0xe9/0x111
 [<c012262a>] do_exit+0x5a6/0x5aa
 [<c013007b>] ? sample_to_timespec+0x16/0x35
 [<c0143670>] ? stopmachine+0x0/0x98
 [<c01056f1>] kernel_thread_helper+0xd/0x10
 =======================
BUG: spinlock lockup on CPU#2, kstopmachine/32521, c065eb00
Pid: 32521, comm: kstopmachine Not tainted 2.6.25-rc3 #44
 [<c0284ebb>] _raw_spin_lock+0xd5/0xf9
 [<c04e1cb5>] _spin_lock_irqsave+0x29/0x32
 [<c011a03f>] ? __wake_up+0x15/0x3b
 [<c011a03f>] __wake_up+0x15/0x3b
 [<c011fa68>] wake_up_klogd+0x2e/0x31
 [<c011fc1e>] release_console_sem+0x1b3/0x1bb
 [<c012002c>] vprintk+0x2a8/0x301
 [<c013ac74>] ? __lock_acquire+0xaae/0xaf6
 [<c012009a>] printk+0x15/0x17
 [<c0284c6f>] spin_bug+0x47/0xbd
 [<c0284e1b>] _raw_spin_lock+0x35/0xf9
 [<c04e1a33>] _spin_lock+0x20/0x28
 [<c011a09b>] ? task_rq_lock+0x36/0x5d
 [<c011a09b>] task_rq_lock+0x36/0x5d
 [<c011a338>] try_to_wake_up+0x19/0xd1
 [<c011a3fb>] default_wake_function+0xb/0xd
 [<c012f850>] autoremove_wake_function+0xf/0x33
 [<c0117ec9>] __wake_up_common+0x2f/0x5a
 [<c011a052>] __wake_up+0x28/0x3b
 [<c011fa68>] wake_up_klogd+0x2e/0x31
 [<c011fc1e>] release_console_sem+0x1b3/0x1bb
 [<c012002c>] vprintk+0x2a8/0x301
 [<c04e196c>] ? _read_unlock+0x1d/0x20
 [<c0441a43>] ? sock_def_readable+0x61/0x69
 [<c048dde8>] ? unix_dgram_sendmsg+0x399/0x410
 [<c012009a>] printk+0x15/0x17
 [<c011f40d>] warn_on_slowpath+0x2a/0x51
 [<c0130066>] ? sample_to_timespec+0x1/0x35
 [<c013ac74>] ? __lock_acquire+0xaae/0xaf6
 [<c011a446>] ? resched_cpu+0x2c/0x6f
 [<c0111691>] native_smp_send_reschedule+0x22/0x3f
 [<c01186c7>] __resched_task+0x5f/0x63
 [<c011a479>] resched_cpu+0x5f/0x6f
 [<c011bb6d>] scheduler_tick+0x214/0x28f
 [<c012737b>] update_process_times+0x3d/0x49
 [<c013751d>] tick_sched_timer+0x6e/0xa6
 [<c01374af>] ? tick_sched_timer+0x0/0xa6
 [<c0131e8c>] __run_hrtimer+0x39/0x70
 [<c0132668>] hrtimer_interrupt+0xeb/0x154
 [<c01127bc>] smp_apic_timer_interrupt+0x6c/0x80
 [<c0143670>] ? stopmachine+0x0/0x98
 [<c0105553>] apic_timer_interrupt+0x33/0x38
 [<c0143670>] ? stopmachine+0x0/0x98
 [<c013007b>] ? sample_to_timespec+0x16/0x35
 [<c0143701>] ? stopmachine+0x91/0x98
 [<c01056eb>] kernel_thread_helper+0x7/0x10
 =======================

BUG: spinlock lockup on CPU#0, kstopmachine/32520, c065eb00

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.config attached.


> 
> > 
> > 
> > -- 
> > Best regards,
> > Dmitry Adamushko
> 
> -- 
> Thanks and Regards
> gautham

-- 
Thanks and Regards
gautham

[-- Attachment #2: config.gz --]
[-- Type: application/octet-stream, Size: 9691 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [BUG 2.6.25-rc3] scheduler/hotplug: some processes aredealocked when cpu is set to offline
  2008-03-07 15:50                     ` Gautham R Shenoy
@ 2008-03-07 19:14                       ` Suresh Siddha
  0 siblings, 0 replies; 54+ messages in thread
From: Suresh Siddha @ 2008-03-07 19:14 UTC (permalink / raw)
  To: Gautham R Shenoy
  Cc: Dmitry Adamushko, Ingo Molnar, Oleg Nesterov, Yang, Yi Y, akpm,
	linux-kernel, Rafael J. Wysocki, Thomas Gleixner

On Fri, Mar 07, 2008 at 07:50:02AM -0800, Gautham R Shenoy wrote:
> On Fri, Mar 07, 2008 at 07:25:37PM +0530, Gautham R Shenoy wrote:
> > On Fri, Mar 07, 2008 at 02:02:20PM +0100, Dmitry Adamushko wrote:
> > > Hi,
> > >
> > > 'watchdog' is of SCHED_FIFO class. The standard load-balancer doesn't
> > > move RT tasks between cpus anymore and there is a special mechanism in
> > > scher_rt.c instead (I think, it's .25 material).
> > >
> > > So I wonder, whether __migrate_task() is still capable of properly
> > > moving a RT task to another CPU (e.g. for the case when it's in
> > > TASK_RUNNING state) without breaking something in the rt migration
> > > mechanism (or whatever else) that would leave us with a runqueue in
> > > the 'inconsistent' state...
> > > (I've taken a quick look at the relevant code so can't confirm it yet)
> > >
> > > maybe it'd be faster if somebody could do a quick test now with the
> > > following line commented out in kernel/softlockup.c :: watchdog()
> > >
> > > -         sched_setscheduler(current, SCHED_FIFO, &param);
> >
> > Commenting out that like seems to work. Passed 500 iterations of
> > cpu-hotplug without any problems.
> 
> This seems to unearth another problem. After some 850 successful
> cpu-hotplug iterations I got this message.
> 
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> ------------[ cut here ]------------
> BUG: spinlock recursion on CPU#2, kstopmachine/32521
> BUG: spinlock lockup on CPU#1, kstopmachine/32517, cc43db80
> Pid: 32517, comm: kstopmachine Not tainted 2.6.25-rc3 #44
>  [<c0284ebb>] <0>BUG: spinlock lockup on CPU#0, kstopmachine/32520,
> cc43db80
> _raw_spin_lock+0xd5/0xf9
>  [<c04e1a33>] <0>BUG: spinlock lockup on CPU#3, kstopmachine/32522,
> cc43db80
> _spin_lock+0x20/0x28
> Pid: 32522, comm: kstopmachine Not tainted 2.6.25-rc3 #44
>  [<c04dfa41>] ?  [<c0284ebb>] schedule+0xb0/0x5ab
>  [<c04dfa41>] schedule+0xb0/0x5ab
> _raw_spin_lock+0xd5/0xf9
>  [<c04e1d58>] ?  [<c04e1a33>] _spin_unlock_irqrestore+0x36/0x3c
>  [<c0119fdc>] ? _spin_lock+0x20/0x28
>  [<c0119806>] ? complete+0x34/0x3e
>  [<c0143646>] double_lock_balance+0x3a/0x57
>  [<c0119806>] do_stop+0xd4/0xfe

Well, there is another

	sched_setscheduler(p, SCHED_FIFO, &param);

in kernel/stop_machine.c

Perhaps we need to remove this aswell and try?

thanks,
suresh

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline
  2008-03-07 13:02                 ` Dmitry Adamushko
  2008-03-07 13:55                   ` Gautham R Shenoy
@ 2008-03-07 20:18                   ` Andrew Morton
  2008-03-07 21:36                     ` Rafael J. Wysocki
  1 sibling, 1 reply; 54+ messages in thread
From: Andrew Morton @ 2008-03-07 20:18 UTC (permalink / raw)
  To: Dmitry Adamushko; +Cc: ego, mingo, oleg, yi.y.yang, linux-kernel, rjw, tglx

On Fri, 7 Mar 2008 14:02:20 +0100
"Dmitry Adamushko" <dmitry.adamushko@gmail.com> wrote:

> Hi,
> 
> 'watchdog' is of SCHED_FIFO class. The standard load-balancer doesn't
> move RT tasks between cpus anymore and there is a special mechanism in
> scher_rt.c instead (I think, it's .25 material).
> 
> So I wonder, whether __migrate_task() is still capable of properly
> moving a RT task to another CPU (e.g. for the case when it's in
> TASK_RUNNING state) without breaking something in the rt migration
> mechanism (or whatever else) that would leave us with a runqueue in
> the 'inconsistent' state...
> (I've taken a quick look at the relevant code so can't confirm it yet)
> 
> maybe it'd be faster if somebody could do a quick test now with the
> following line commented out in kernel/softlockup.c :: watchdog()
> 
> -         sched_setscheduler(current, SCHED_FIFO, &param);
> 

Yup, thanks.  This:


 kernel/softirq.c      |    2 +-
 kernel/softlockup.c   |    2 +-
 kernel/stop_machine.c |    2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff -puN kernel/softlockup.c~a kernel/softlockup.c
--- a/kernel/softlockup.c~a
+++ a/kernel/softlockup.c
@@ -211,7 +211,7 @@ static int watchdog(void *__bind_cpu)
 	struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
 	int this_cpu = (long)__bind_cpu;
 
-	sched_setscheduler(current, SCHED_FIFO, &param);
+//	sched_setscheduler(current, SCHED_FIFO, &param);
 
 	/* initialize timestamp */
 	touch_softlockup_watchdog();
diff -puN kernel/stop_machine.c~a kernel/stop_machine.c
--- a/kernel/stop_machine.c~a
+++ a/kernel/stop_machine.c
@@ -188,7 +188,7 @@ struct task_struct *__stop_machine_run(i
 		struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
 
 		/* One high-prio thread per cpu.  We'll do this one. */
-		sched_setscheduler(p, SCHED_FIFO, &param);
+//		sched_setscheduler(p, SCHED_FIFO, &param);
 		kthread_bind(p, cpu);
 		wake_up_process(p);
 		wait_for_completion(&smdata.done);
diff -puN kernel/softirq.c~a kernel/softirq.c
--- a/kernel/softirq.c~a
+++ a/kernel/softirq.c
@@ -622,7 +622,7 @@ static int __cpuinit cpu_callback(struct
 
 		p = per_cpu(ksoftirqd, hotcpu);
 		per_cpu(ksoftirqd, hotcpu) = NULL;
-		sched_setscheduler(p, SCHED_FIFO, &param);
+//		sched_setscheduler(p, SCHED_FIFO, &param);
 		kthread_stop(p);
 		takeover_tasklets(hotcpu);
 		break;
_

fixes the wont-power-off regression.

But 2.6.24 runs the watchdog threads SCHED_FIFO too.  Are you saying that
it's the migration code which changed? 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline
  2008-03-07 20:18                   ` [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked " Andrew Morton
@ 2008-03-07 21:36                     ` Rafael J. Wysocki
  2008-03-07 23:01                       ` Suresh Siddha
  0 siblings, 1 reply; 54+ messages in thread
From: Rafael J. Wysocki @ 2008-03-07 21:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dmitry Adamushko, ego, mingo, oleg, yi.y.yang, linux-kernel, tglx

On Friday, 7 of March 2008, Andrew Morton wrote:
> On Fri, 7 Mar 2008 14:02:20 +0100
> "Dmitry Adamushko" <dmitry.adamushko@gmail.com> wrote:
> 
> > Hi,
> > 
> > 'watchdog' is of SCHED_FIFO class. The standard load-balancer doesn't
> > move RT tasks between cpus anymore and there is a special mechanism in
> > scher_rt.c instead (I think, it's .25 material).
> > 
> > So I wonder, whether __migrate_task() is still capable of properly
> > moving a RT task to another CPU (e.g. for the case when it's in
> > TASK_RUNNING state) without breaking something in the rt migration
> > mechanism (or whatever else) that would leave us with a runqueue in
> > the 'inconsistent' state...
> > (I've taken a quick look at the relevant code so can't confirm it yet)
> > 
> > maybe it'd be faster if somebody could do a quick test now with the
> > following line commented out in kernel/softlockup.c :: watchdog()
> > 
> > -         sched_setscheduler(current, SCHED_FIFO, &param);
> > 
> 
> Yup, thanks.  This:
> 
> 
>  kernel/softirq.c      |    2 +-
>  kernel/softlockup.c   |    2 +-
>  kernel/stop_machine.c |    2 +-
>  3 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff -puN kernel/softlockup.c~a kernel/softlockup.c
> --- a/kernel/softlockup.c~a
> +++ a/kernel/softlockup.c
> @@ -211,7 +211,7 @@ static int watchdog(void *__bind_cpu)
>  	struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
>  	int this_cpu = (long)__bind_cpu;
>  
> -	sched_setscheduler(current, SCHED_FIFO, &param);
> +//	sched_setscheduler(current, SCHED_FIFO, &param);
>  
>  	/* initialize timestamp */
>  	touch_softlockup_watchdog();
> diff -puN kernel/stop_machine.c~a kernel/stop_machine.c
> --- a/kernel/stop_machine.c~a
> +++ a/kernel/stop_machine.c
> @@ -188,7 +188,7 @@ struct task_struct *__stop_machine_run(i
>  		struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
>  
>  		/* One high-prio thread per cpu.  We'll do this one. */
> -		sched_setscheduler(p, SCHED_FIFO, &param);
> +//		sched_setscheduler(p, SCHED_FIFO, &param);
>  		kthread_bind(p, cpu);
>  		wake_up_process(p);
>  		wait_for_completion(&smdata.done);
> diff -puN kernel/softirq.c~a kernel/softirq.c
> --- a/kernel/softirq.c~a
> +++ a/kernel/softirq.c
> @@ -622,7 +622,7 @@ static int __cpuinit cpu_callback(struct
>  
>  		p = per_cpu(ksoftirqd, hotcpu);
>  		per_cpu(ksoftirqd, hotcpu) = NULL;
> -		sched_setscheduler(p, SCHED_FIFO, &param);
> +//		sched_setscheduler(p, SCHED_FIFO, &param);
>  		kthread_stop(p);
>  		takeover_tasklets(hotcpu);
>  		break;
> _
> 
> fixes the wont-power-off regression.
> 
> But 2.6.24 runs the watchdog threads SCHED_FIFO too.  Are you saying that
> it's the migration code which changed? 

Well, that would be a problem for suspend/hibernation if there were SCHED_FIFO
non-freezable tasks bound to the nonboot CPUs.  I'm not aware of any, but ...

Also, it should be possible to just offline one or more CPUs using the sysfs
interface at any time and what happens if there are any RT tasks bound to these
CPUs at that time?  That would be the $subject problem, IMHO.

Anyone who made the change affecting __migrate_task() so it's unable to migrate
RT tasks any more should have fixed the CPU hotplug as well.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline
  2008-03-07 21:36                     ` Rafael J. Wysocki
@ 2008-03-07 23:01                       ` Suresh Siddha
  2008-03-07 23:29                         ` Andrew Morton
  0 siblings, 1 reply; 54+ messages in thread
From: Suresh Siddha @ 2008-03-07 23:01 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Andrew Morton, Dmitry Adamushko, ego, mingo, oleg, yi.y.yang,
	linux-kernel, tglx

On Fri, Mar 07, 2008 at 10:36:25PM +0100, Rafael J. Wysocki wrote:
> On Friday, 7 of March 2008, Andrew Morton wrote:
> > -		sched_setscheduler(p, SCHED_FIFO, &param);
> > +//		sched_setscheduler(p, SCHED_FIFO, &param);
> >  		kthread_stop(p);
> >  		takeover_tasklets(hotcpu);
> >  		break;
> > _
> > 
> > fixes the wont-power-off regression.
> > 
> > But 2.6.24 runs the watchdog threads SCHED_FIFO too.  Are you saying that
> > it's the migration code which changed? 
> 
> Well, that would be a problem for suspend/hibernation if there were SCHED_FIFO
> non-freezable tasks bound to the nonboot CPUs.  I'm not aware of any, but ...
> 
> Also, it should be possible to just offline one or more CPUs using the sysfs
> interface at any time and what happens if there are any RT tasks bound to these
> CPUs at that time?  That would be the $subject problem, IMHO.
> 
> Anyone who made the change affecting __migrate_task() so it's unable to migrate
> RT tasks any more should have fixed the CPU hotplug as well.

No. Its not the issue with __migrate_task(). Appended patch fixes my issue.
Recent RT wakeup balance code changes exposed a bug in migration_call() code
path.

Andrew, Please check if the appended patch fixes your power-off problem aswell.

thanks,
suresh
---

Handle the `CPU_DOWN_PREPARE_FROZEN' case in migration_call().

Otherwise, without this, we don't clear the cpu going down in the
root domains "online" mask. This was causing the RT tasks to be woken
up on already dead cpus, causing system hangs during standby, shutdown etc.

For example, on my system, this is the failing sequence:

kthread_stop() // coming from the cpu_callback's
    wake_up_process()
	sched_class->select_task_rq(); //select_task_rq_rt
	    find_lowest_rq
		find_lowest_cpus
	    	    cpus_and(*lowest_mask, task_rq(task)->rd->online, task->cpus_allowed);

In my case tasks->cpus_allowed is set to cpu_possible_map and because of the
this bug, rd->online still shows the dead cpu. Resulting in
find_lowest_rq() return an offlined cpu, because of which RT task gets woken
up on a DEAD cpu, causing various hangs.

This issue doesn't happen with normal tasks because, the select_task_rq_fair()
chooses between only two cpu's (the cpu which is waking up the task or last run
cpu (task_cpu()), kernel hotplug code makes sure that both of which always
represent the online CPU's).

Why it didn't show up in 2.6.24? Because the new wakeup code is using a
complex CPU selection logic in select_task_rq_rt() which exposed this
bug in migration_call()

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---

diff --git a/kernel/sched.c b/kernel/sched.c
index 52b9867..60550d8 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5882,6 +5882,7 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
 		break;
 
 	case CPU_DOWN_PREPARE:
+	case CPU_DOWN_PREPARE_FROZEN:
 		/* Update our root-domain */
 		rq = cpu_rq(cpu);
 		spin_lock_irqsave(&rq->lock, flags);

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline
  2008-03-07 23:01                       ` Suresh Siddha
@ 2008-03-07 23:29                         ` Andrew Morton
  2008-03-07 23:43                           ` Rafael J. Wysocki
  0 siblings, 1 reply; 54+ messages in thread
From: Andrew Morton @ 2008-03-07 23:29 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: rjw, dmitry.adamushko, ego, mingo, oleg, yi.y.yang, linux-kernel, tglx

On Fri, 7 Mar 2008 15:01:26 -0800
Suresh Siddha <suresh.b.siddha@intel.com> wrote:

> 
> Andrew, Please check if the appended patch fixes your power-off problem aswell.
> ...
>
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -5882,6 +5882,7 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
>  		break;
>  
>  	case CPU_DOWN_PREPARE:
> +	case CPU_DOWN_PREPARE_FROZEN:
>  		/* Update our root-domain */
>  		rq = cpu_rq(cpu);
>  		spin_lock_irqsave(&rq->lock, flags);

No, it does not.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline
  2008-03-07 23:29                         ` Andrew Morton
@ 2008-03-07 23:43                           ` Rafael J. Wysocki
  2008-03-08  1:50                             ` Suresh Siddha
  0 siblings, 1 reply; 54+ messages in thread
From: Rafael J. Wysocki @ 2008-03-07 23:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Suresh Siddha, dmitry.adamushko, ego, mingo, oleg, yi.y.yang,
	linux-kernel, tglx

On Saturday, 8 of March 2008, Andrew Morton wrote:
> On Fri, 7 Mar 2008 15:01:26 -0800
> Suresh Siddha <suresh.b.siddha@intel.com> wrote:
> 
> > 
> > Andrew, Please check if the appended patch fixes your power-off problem aswell.
> > ...
> >
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -5882,6 +5882,7 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
> >  		break;
> >  
> >  	case CPU_DOWN_PREPARE:
> > +	case CPU_DOWN_PREPARE_FROZEN:
> >  		/* Update our root-domain */
> >  		rq = cpu_rq(cpu);
> >  		spin_lock_irqsave(&rq->lock, flags);
> 
> No, it does not.

Well, this is a bug nevertheless.

In fact, the frozen bit should only be set during hibernation/suspend and not
during shutdown ...

So, it looks like disable_nonboot_cpus() needs an extra parameter.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline
  2008-03-07 23:43                           ` Rafael J. Wysocki
@ 2008-03-08  1:50                             ` Suresh Siddha
  2008-03-08  2:09                               ` Andrew Morton
  2008-03-08  5:10                               ` [PATCH] adjust root-domain->online span in response to hotplug event Gregory Haskins
  0 siblings, 2 replies; 54+ messages in thread
From: Suresh Siddha @ 2008-03-08  1:50 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Andrew Morton, Suresh Siddha, dmitry.adamushko, ego, mingo, oleg,
	yi.y.yang, linux-kernel, tglx

On Sat, Mar 08, 2008 at 12:43:15AM +0100, Rafael J. Wysocki wrote:
> On Saturday, 8 of March 2008, Andrew Morton wrote:
> > On Fri, 7 Mar 2008 15:01:26 -0800
> > Suresh Siddha <suresh.b.siddha@intel.com> wrote:
> > 
> > > 
> > > Andrew, Please check if the appended patch fixes your power-off problem aswell.
> > > ...
> > >
> > > --- a/kernel/sched.c
> > > +++ b/kernel/sched.c
> > > @@ -5882,6 +5882,7 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
> > >  		break;
> > >  
> > >  	case CPU_DOWN_PREPARE:
> > > +	case CPU_DOWN_PREPARE_FROZEN:
> > >  		/* Update our root-domain */
> > >  		rq = cpu_rq(cpu);
> > >  		spin_lock_irqsave(&rq->lock, flags);
> > 
> > No, it does not.
> 
> Well, this is a bug nevertheless.
> 

Well, my previous root cause needs some small changes.

During the notifier call chain for CPU_DOWN(till 'update_sched_domains'
is called atleast), all the cpu's are attached to 'def_root_domain', for whom
online mask still has the offline cpu.

This is because, during CPU_DOWN_PREPARE, migration_call() first clears
the root_domain->online, and later during the DOWN_PREPARE call chain
detach_destroy_domains() attach to def_root_domain with cpu_online_map(which
still has the just about to die 'cpu' set).

So essentially, during the notifier call chain of CPU_DOWN (before
'update_sched_domains' is called atleast), any one doing RT process
wakeup's (for example: kthread_stop()) can still end up on the dead cpu.

Andrew, Can you please try one more patch(appended) to see if it helps?

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 0a6d2e5..8cb707c 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -597,7 +597,7 @@ static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
 	int       count       = 0;
 	int       cpu;
 
-	cpus_and(*lowest_mask, task_rq(task)->rd->online, task->cpus_allowed);
+	cpus_and(*lowest_mask, task->cpus_allowed, cpu_online_map);
 
 	/*
 	 * Scan each rq for the lowest prio.

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline
  2008-03-08  1:50                             ` Suresh Siddha
@ 2008-03-08  2:09                               ` Andrew Morton
  2008-03-08  5:10                               ` [PATCH] adjust root-domain->online span in response to hotplug event Gregory Haskins
  1 sibling, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2008-03-08  2:09 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Rafael J. Wysocki, dmitry.adamushko, ego, mingo, oleg, yi.y.yang,
	linux-kernel, tglx

On Fri, 7 Mar 2008 17:50:45 -0800 Suresh Siddha <suresh.b.siddha@intel.com> wrote:

> Andrew, Can you please try one more patch(appended) to see if it helps?
> 

Not until Monday, sorry - the offending machine is at work and (due to
this bug) cannot be woken via LAN.

> ---
> 
> diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
> index 0a6d2e5..8cb707c 100644
> --- a/kernel/sched_rt.c
> +++ b/kernel/sched_rt.c
> @@ -597,7 +597,7 @@ static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask)
>  	int       count       = 0;
>  	int       cpu;
>  
> -	cpus_and(*lowest_mask, task_rq(task)->rd->online, task->cpus_allowed);
> +	cpus_and(*lowest_mask, task->cpus_allowed, cpu_online_map);

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH] adjust root-domain->online span in response to hotplug event
  2008-03-08  1:50                             ` Suresh Siddha
  2008-03-08  2:09                               ` Andrew Morton
@ 2008-03-08  5:10                               ` Gregory Haskins
  2008-03-08  8:41                                 ` Ingo Molnar
                                                   ` (2 more replies)
  1 sibling, 3 replies; 54+ messages in thread
From: Gregory Haskins @ 2008-03-08  5:10 UTC (permalink / raw)
  To: suresh.b.siddha
  Cc: rjw, akpm, dmitry.adamushko, ego, mingo, oleg, yi.y.yang,
	linux-kernel, tglx, ghaskins

Suresh Siddha wrote:
 > On Sat, Mar 08, 2008 at 12:43:15AM +0100, Rafael J. Wysocki wrote:
 >> On Saturday, 8 of March 2008, Andrew Morton wrote:
 >>> On Fri, 7 Mar 2008 15:01:26 -0800
 >>> Suresh Siddha <suresh.b.siddha@intel.com> wrote:
 >>>
 >>>> Andrew, Please check if the appended patch fixes your power-off 
problem aswell.
 >>>> ...
 >>>>
 >>>> --- a/kernel/sched.c
 >>>> +++ b/kernel/sched.c
 >>>> @@ -5882,6 +5882,7 @@ migration_call(struct notifier_block *nfb, 
unsigned long action, void *hcpu)
 >>>>  		break;
 >>>>
 >>>>  	case CPU_DOWN_PREPARE:
 >>>> +	case CPU_DOWN_PREPARE_FROZEN:
 >>>>  		/* Update our root-domain */
 >>>>  		rq = cpu_rq(cpu);
 >>>>  		spin_lock_irqsave(&rq->lock, flags);
 >>> No, it does not.
 >> Well, this is a bug nevertheless.
 >>
 >
 > Well, my previous root cause needs some small changes.
 >
 > During the notifier call chain for CPU_DOWN(till 'update_sched_domains'
 > is called atleast), all the cpu's are attached to 'def_root_domain', 
for whom
 > online mask still has the offline cpu.
 >
 > This is because, during CPU_DOWN_PREPARE, migration_call() first clears
 > the root_domain->online, and later during the DOWN_PREPARE call chain
 > detach_destroy_domains() attach to def_root_domain with 
cpu_online_map(which
 > still has the just about to die 'cpu' set).
 >
 > So essentially, during the notifier call chain of CPU_DOWN (before
 > 'update_sched_domains' is called atleast), any one doing RT process
 > wakeup's (for example: kthread_stop()) can still end up on the dead cpu.
 >
 > Andrew, Can you please try one more patch(appended) to see if it helps?
 >
 > Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
 > ---
 >
 > diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
 > index 0a6d2e5..8cb707c 100644
 > --- a/kernel/sched_rt.c
 > +++ b/kernel/sched_rt.c
 > @@ -597,7 +597,7 @@ static int find_lowest_cpus(struct task_struct 
*task, cpumask_t *lowest_mask)
 >  	int       count       = 0;
 >  	int       cpu;
 >
 > -	cpus_and(*lowest_mask, task_rq(task)->rd->online, task->cpus_allowed);
 > +	cpus_and(*lowest_mask, task->cpus_allowed, cpu_online_map);
 >
 >  	/*

Hi Suresh,
   Unfortunately, this patch will introduce its own set of bugs. 
However, your analysis was spot-on.  I think I see the problem now.  It 
was introduced when I put a hack in to "fix" s2ram problems in -mm as a 
result of the new root-domain logic.  I think the following patch will 
fix both issues:

(I verified that I could take a cpu offline/online, but I don't have an
s2ram compatible machine handy.  Andrew, I believe you could reproduce
the s2ram problem a few months ago when that issue popped up.  So if you
could, please verify that s2ram also works with this patch applied, in
addition to the hotplug problem.

Regards,
-Greg

-------------------------------------------------

adjust root-domain->online span in response to hotplug event

We currently set the root-domain online span automatically when the domain
is added to the cpu if the cpu is already a member of cpu_online_map.
This was done as a hack/bug-fix for s2ram, but it also causes a problem
with hotplug CPU_DOWN transitioning.  The right way to fix the original
problem is to actually respond to CPU_UP events, instead of CPU_ONLINE,
which is already too late.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched.c |   18 +++++++-----------
 1 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 52b9867..b02e4fc 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5813,6 +5813,13 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
 		/* Must be high prio: stop_machine expects to yield to it. */
 		rq = task_rq_lock(p, &flags);
 		__setscheduler(rq, p, SCHED_FIFO, MAX_RT_PRIO-1);
+
+		/* Update our root-domain */
+		if (rq->rd) {
+			BUG_ON(!cpu_isset(cpu, rq->rd->span));
+			cpu_set(cpu, rq->rd->online);
+		}
+
 		task_rq_unlock(rq, &flags);
 		cpu_rq(cpu)->migration_thread = p;
 		break;
@@ -5821,15 +5828,6 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
 	case CPU_ONLINE_FROZEN:
 		/* Strictly unnecessary, as first user will wake it. */
 		wake_up_process(cpu_rq(cpu)->migration_thread);
-
-		/* Update our root-domain */
-		rq = cpu_rq(cpu);
-		spin_lock_irqsave(&rq->lock, flags);
-		if (rq->rd) {
-			BUG_ON(!cpu_isset(cpu, rq->rd->span));
-			cpu_set(cpu, rq->rd->online);
-		}
-		spin_unlock_irqrestore(&rq->lock, flags);
 		break;
 
 #ifdef CONFIG_HOTPLUG_CPU
@@ -6105,8 +6103,6 @@ static void rq_attach_root(struct rq *rq, struct root_domain *rd)
 	rq->rd = rd;
 
 	cpu_set(rq->cpu, rd->span);
-	if (cpu_isset(rq->cpu, cpu_online_map))
-		cpu_set(rq->cpu, rd->online);
 
 	for (class = sched_class_highest; class; class = class->next) {
 		if (class->join_domain)


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH] adjust root-domain->online span in response to hotplug event
  2008-03-08  5:10                               ` [PATCH] adjust root-domain->online span in response to hotplug event Gregory Haskins
@ 2008-03-08  8:41                                 ` Ingo Molnar
  2008-03-08 17:50                                   ` [PATCH] adjust root-domain->online span in response to hotplugevent Gregory Haskins
  2008-03-09  2:35                                 ` [PATCH] adjust root-domain->online span in response to hotplug event Suresh Siddha
  2008-03-10  8:14                                 ` Gautham R Shenoy
  2 siblings, 1 reply; 54+ messages in thread
From: Ingo Molnar @ 2008-03-08  8:41 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: suresh.b.siddha, rjw, akpm, dmitry.adamushko, ego, oleg,
	yi.y.yang, linux-kernel, tglx


* Gregory Haskins <ghaskins@novell.com> wrote:

>    Unfortunately, this patch will introduce its own set of bugs. 
> However, your analysis was spot-on.  I think I see the problem now.  
> It was introduced when I put a hack in to "fix" s2ram problems in -mm 
> as a result of the new root-domain logic.  I think the following patch 
> will fix both issues:
> 
> (I verified that I could take a cpu offline/online, but I don't have 
> an s2ram compatible machine handy.  Andrew, I believe you could 
> reproduce the s2ram problem a few months ago when that issue popped 
> up.  So if you could, please verify that s2ram also works with this 
> patch applied, in addition to the hotplug problem.

thanks Gregory, i've queued up your fix. If it passes all tests over the 
day i'll send it to Linus in about 12 hours.

	Ingo

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] adjust root-domain->online span in response to hotplugevent
  2008-03-08  8:41                                 ` Ingo Molnar
@ 2008-03-08 17:50                                   ` Gregory Haskins
  2008-03-09  0:31                                     ` Dmitry Adamushko
  0 siblings, 1 reply; 54+ messages in thread
From: Gregory Haskins @ 2008-03-08 17:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: dmitry.adamushko, ego, suresh.b.siddha, yi.y.yang, tglx, akpm,
	rjw, oleg, linux-kernel

>>> On Sat, Mar 8, 2008 at  3:41 AM, in message <20080308084118.GA24552@elte.hu>,
Ingo Molnar <mingo@elte.hu> wrote: 

> * Gregory Haskins <ghaskins@novell.com> wrote:
> 
>>    Unfortunately, this patch will introduce its own set of bugs. 
>> However, your analysis was spot-on.  I think I see the problem now.  
>> It was introduced when I put a hack in to "fix" s2ram problems in -mm 
>> as a result of the new root-domain logic.  I think the following patch 
>> will fix both issues:
>> 
>> (I verified that I could take a cpu offline/online, but I don't have 
>> an s2ram compatible machine handy.  Andrew, I believe you could 
>> reproduce the s2ram problem a few months ago when that issue popped 
>> up.  So if you could, please verify that s2ram also works with this 
>> patch applied, in addition to the hotplug problem.
> 
> thanks Gregory, i've queued up your fix. If it passes all tests over the 
> day i'll send it to Linus in about 12 hours.
> 


After thinking about it some more, I am not sure if I got this fix quite right. The first two hunks are technically fine and should ultimately go in.  The last hunk is questionable.

Ultimately, I believe the root cause of these reported issues is that cpu_online_map and rd->online can get out of sync.  I fear that while the current patch may fix the hotplug/s2ram case, it potentially breaks the disjoint cpuset topology change case.  I will run a few cpuset tests ASAP to confirm, though I may not have a chance until Monday.

What I probably need is to tie the code that sets/clears cpu_online_map to the root-domain rd->online map somehow.  However, a cursory cscope inspection has failed to reveal the location that this cpu_online_map manipulation occurs.  If anyone knows where cpu_online_map is actually updated, please let me know.  Otherwise, let me think about this some more and get back to you.

In the meantime, cpu_and'ing "task->cpus_allowed & cpu_online_map & rd->online" in the place that Suresh identified should technically fix the hotplug issue without adding more bugs, at the expense of requiring two calls to cpu_and().  The expense should be more or less negligible on mainstream kernels, but may hurt our big-iron (e.g. SGI) brethren.  Hopefully I will come up with a real fix shortly, before it becomes a real issue.

Regards,
-Greg

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] adjust root-domain->online span in response to hotplugevent
  2008-03-08 17:50                                   ` [PATCH] adjust root-domain->online span in response to hotplugevent Gregory Haskins
@ 2008-03-09  0:31                                     ` Dmitry Adamushko
  2008-03-10 14:12                                       ` Gregory Haskins
  0 siblings, 1 reply; 54+ messages in thread
From: Dmitry Adamushko @ 2008-03-09  0:31 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Ingo Molnar, ego, suresh.b.siddha, yi.y.yang, tglx, akpm, rjw,
	oleg, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3069 bytes --]

On 08/03/2008, Gregory Haskins <ghaskins@novell.com> wrote:
> >>> On Sat, Mar 8, 2008 at  3:41 AM, in message <20080308084118.GA24552@elte.hu>,
>  Ingo Molnar <mingo@elte.hu> wrote:
>
>  > * Gregory Haskins <ghaskins@novell.com> wrote:
>  >
>  >>    Unfortunately, this patch will introduce its own set of bugs.
>  >> However, your analysis was spot-on.  I think I see the problem now.
>  >> It was introduced when I put a hack in to "fix" s2ram problems in -mm
>  >> as a result of the new root-domain logic.  I think the following patch
>  >> will fix both issues:
>  >>
>  >> (I verified that I could take a cpu offline/online, but I don't have
>  >> an s2ram compatible machine handy.  Andrew, I believe you could
>  >> reproduce the s2ram problem a few months ago when that issue popped
>  >> up.  So if you could, please verify that s2ram also works with this
>  >> patch applied, in addition to the hotplug problem.
>  >
>  > thanks Gregory, i've queued up your fix. If it passes all tests over the
>  > day i'll send it to Linus in about 12 hours.
>  >
>
>
>
> After thinking about it some more, I am not sure if I got this fix quite right. The first two hunks are technically fine and should ultimately go in.  The last hunk is questionable.
>
>  Ultimately, I believe the root cause of these reported issues is that cpu_online_map and rd->online can get out of sync.  I fear that while the current patch may fix the hotplug/s2ram case, it potentially breaks the disjoint cpuset topology change case.  I will run a few cpuset tests ASAP to confirm, though I may not have a chance until Monday.
>
>  What I probably need is to tie the code that sets/clears cpu_online_map to the root-domain rd->online map somehow.  However, a cursory cscope inspection has failed to reveal the location that this cpu_online_map manipulation occurs.  If anyone knows where cpu_online_map is actually updated, please let me know.

I guess, it's in arch-specific code. e.g. for x86-64,

down:

kernel/cpu.c :: _cpu_down -> take_cpu_down() -> __cpu_disable() [
arch/x86/kernel/smpboot_64.c ] -> cpu_clear(cpu, cpu_online_map)

up:

kernel/cpu.c :: _cpu_up() -> __cpu_up [ arch/x86/kernel/smpboot_64.c ]
-> do_boot_cpu() -> start_secondary() [ runs on a to-be-online cpu ]
-> cpu_set(smp_processor_id(), cpu_online_map)

there are cpu-notification events available before a cpu gets removed
from cpu_online_map or after it gets added, so I guess (a first guess.
I'll also look at the code) they should be a sync. point.

I have a small patch to record some events during the cpu_down/up. I
have a trace but will already analyze it tomorrow.
[ the patch is enclosed (err.. it also includes Gregory's fix... I
just forgot to commit it previously :-/) ]

e.g. it should print a message if a task is placed on the 'offline'
cpu. Note, the 'cpu' is removed from the cpu_online_map before
migrate_live_tasks() runs. So if any task remains on the offline cpu,
it has been likely placed there already after migrate_live_tasks().


>  Regards,
>  -Greg
>

-- 
Best regards,
Dmitry Adamushko

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: debug-stop_machine-2.patch --]
[-- Type: text/x-patch; name=debug-stop_machine-2.patch, Size: 7008 bytes --]

diff --git a/arch/x86/kernel/smpboot_64.c b/arch/x86/kernel/smpboot_64.c
index 0880f2c..a8332c6 100644
--- a/arch/x86/kernel/smpboot_64.c
+++ b/arch/x86/kernel/smpboot_64.c
@@ -381,6 +381,8 @@ void __cpuinit start_secondary(void)
 
 	setup_secondary_clock();
 
+	printk("*** -> online cpu %i\n", smp_processor_id());
+
 	cpu_idle();
 }
 
@@ -1063,6 +1065,9 @@ int __cpu_disable(void)
 	spin_lock(&vector_lock);
 	/* It's now safe to remove this processor from the online map */
 	cpu_clear(cpu, cpu_online_map);
+
+	printk(KERN_ERR "*** <- down cpu %i\n", cpu);
+
 	spin_unlock(&vector_lock);
 	remove_cpu_from_maps();
 	fixup_irqs(cpu_online_map);
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 2eff3f6..d374c27 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -198,6 +198,8 @@ static int take_cpu_down(void *_param)
 	return 0;
 }
 
+extern void wake_up_watchdog(int cpu);
+
 /* Requires cpu_add_remove_lock to be held */
 static int _cpu_down(unsigned int cpu, int tasks_frozen)
 {
@@ -255,6 +257,8 @@ static int _cpu_down(unsigned int cpu, int tasks_frozen)
 	while (!idle_cpu(cpu))
 		yield();
 
+	wake_up_watchdog(cpu);
+
 	/* This actually kills the CPU. */
 	__cpu_die(cpu);
 
diff --git a/kernel/sched.c b/kernel/sched.c
index 52b9867..6087161 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1787,7 +1787,7 @@ static int sched_balance_self(int cpu, int flag)
  */
 static int try_to_wake_up(struct task_struct *p, unsigned int state, int sync)
 {
-	int cpu, orig_cpu, this_cpu, success = 0;
+	int cpu = -1, orig_cpu, this_cpu, success = 0, offline = 0;
 	unsigned long flags;
 	long old_state;
 	struct rq *rq;
@@ -1851,6 +1851,9 @@ out_activate:
 		schedstat_inc(p, se.nr_wakeups_local);
 	else
 		schedstat_inc(p, se.nr_wakeups_remote);
+
+	offline = cpu_is_offline(cpu);
+
 	update_rq_clock(rq);
 	activate_task(rq, p, 1);
 	check_preempt_curr(rq, p);
@@ -1865,9 +1868,36 @@ out_running:
 out:
 	task_rq_unlock(rq, &flags);
 
+	if (cpu != -1 && offline)
+		printk(KERN_ERR "!!! try_to_wake_up(%s) on offline cpu %i\n",
+			p->comm, cpu);
+
 	return success;
 }
 
+extern struct task_struct * get_watchdog_task(int cpu);
+
+void wake_up_watchdog(int cpu)
+{
+	unsigned long flags;
+	struct rq *rq;
+	struct task_struct *p = get_watchdog_task(cpu);
+	int ret = 0;
+
+	rq = task_rq_lock(p, &flags);
+
+	if (p->state != TASK_RUNNING && !p->se.on_rq) {
+		ret = 1;
+		update_rq_clock(rq);
+		activate_task(rq, p, 1);
+		p->state = TASK_RUNNING;
+	}
+
+	task_rq_unlock(rq, &flags);
+
+	printk(KERN_ERR "watchdog(cpu: %i) -> wake up (%i)\n", cpu, ret);
+}
+
 int wake_up_process(struct task_struct *p)
 {
 	return try_to_wake_up(p, TASK_ALL, 0);
@@ -5345,7 +5375,7 @@ EXPORT_SYMBOL_GPL(set_cpus_allowed);
 static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu)
 {
 	struct rq *rq_dest, *rq_src;
-	int ret = 0, on_rq;
+	int ret = 0, on_rq = -1;
 
 	if (unlikely(cpu_is_offline(dest_cpu)))
 		return ret;
@@ -5373,6 +5403,11 @@ static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu)
 	ret = 1;
 out:
 	double_rq_unlock(rq_src, rq_dest);
+
+	if (cpu_is_offline(src_cpu))
+		printk("__migrate_task( %s, on_rq: %i ) from %i to %i -> %i\n",
+			p->comm, on_rq, src_cpu, dest_cpu, ret);
+
 	return ret;
 }
 
@@ -5802,6 +5837,8 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
 	unsigned long flags;
 	struct rq *rq;
 
+	printk(KERN_ERR "-> migration_call(cpu: %i, action: %lu)\n", cpu, action);
+
 	switch (action) {
 
 	case CPU_UP_PREPARE:
@@ -5813,23 +5850,25 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
 		/* Must be high prio: stop_machine expects to yield to it. */
 		rq = task_rq_lock(p, &flags);
 		__setscheduler(rq, p, SCHED_FIFO, MAX_RT_PRIO-1);
+
+		/* Update our root-domain */
+		if (rq->rd) {
+			BUG_ON(!cpu_isset(cpu, rq->rd->span));
+			cpu_set(cpu, rq->rd->online);
+		}
+
 		task_rq_unlock(rq, &flags);
 		cpu_rq(cpu)->migration_thread = p;
+
+		printk(KERN_ERR "--> migration_call() : cpu(%i - %p) online\n",
+			cpu, rq->rd);
+
 		break;
 
 	case CPU_ONLINE:
 	case CPU_ONLINE_FROZEN:
 		/* Strictly unnecessary, as first user will wake it. */
 		wake_up_process(cpu_rq(cpu)->migration_thread);
-
-		/* Update our root-domain */
-		rq = cpu_rq(cpu);
-		spin_lock_irqsave(&rq->lock, flags);
-		if (rq->rd) {
-			BUG_ON(!cpu_isset(cpu, rq->rd->span));
-			cpu_set(cpu, rq->rd->online);
-		}
-		spin_unlock_irqrestore(&rq->lock, flags);
 		break;
 
 #ifdef CONFIG_HOTPLUG_CPU
@@ -5890,9 +5929,14 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
 			cpu_clear(cpu, rq->rd->online);
 		}
 		spin_unlock_irqrestore(&rq->lock, flags);
+
+		printk(KERN_ERR "--> migration_call() : cpu(%i - %p) offline\n",
+			cpu, rq->rd);
 		break;
 #endif
 	}
+	printk(KERN_ERR "<- migration_call(cpu: %i, action: %lu)\n", cpu, action);
+
 	return NOTIFY_OK;
 }
 
@@ -6083,6 +6127,7 @@ static void rq_attach_root(struct rq *rq, struct root_domain *rd)
 {
 	unsigned long flags;
 	const struct sched_class *class;
+	int clear = 0;
 
 	spin_lock_irqsave(&rq->lock, flags);
 
@@ -6097,6 +6142,8 @@ static void rq_attach_root(struct rq *rq, struct root_domain *rd)
 		cpu_clear(rq->cpu, old_rd->span);
 		cpu_clear(rq->cpu, old_rd->online);
 
+		clear = 1;
+
 		if (atomic_dec_and_test(&old_rd->refcount))
 			kfree(old_rd);
 	}
@@ -6105,8 +6152,6 @@ static void rq_attach_root(struct rq *rq, struct root_domain *rd)
 	rq->rd = rd;
 
 	cpu_set(rq->cpu, rd->span);
-	if (cpu_isset(rq->cpu, cpu_online_map))
-		cpu_set(rq->cpu, rd->online);
 
 	for (class = sched_class_highest; class; class = class->next) {
 		if (class->join_domain)
@@ -6114,6 +6159,8 @@ static void rq_attach_root(struct rq *rq, struct root_domain *rd)
 	}
 
 	spin_unlock_irqrestore(&rq->lock, flags);
+
+	printk(KERN_ERR "* rq_attach_root(cpu: %i) -- clear (%i)\n", rq->cpu, clear);
 }
 
 static void init_rootdomain(struct root_domain *rd)
diff --git a/kernel/softlockup.c b/kernel/softlockup.c
index 01b6522..4f478c2 100644
--- a/kernel/softlockup.c
+++ b/kernel/softlockup.c
@@ -27,6 +27,11 @@ static DEFINE_PER_CPU(struct task_struct *, watchdog_task);
 static int __read_mostly did_panic;
 unsigned long __read_mostly softlockup_thresh = 60;
 
+struct task_struct * get_watchdog_task(int cpu)
+{
+	return per_cpu(watchdog_task, cpu);
+}
+
 static int
 softlock_panic(struct notifier_block *this, unsigned long event, void *ptr)
 {
@@ -250,6 +255,8 @@ cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu)
 	int hotcpu = (unsigned long)hcpu;
 	struct task_struct *p;
 
+	printk(KERN_ERR "--> cpu_callback(cpu: %i, action: %lu)\n", hotcpu, action);
+
 	switch (action) {
 	case CPU_UP_PREPARE:
 	case CPU_UP_PREPARE_FROZEN:
@@ -294,6 +301,9 @@ cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu)
 		break;
 #endif /* CONFIG_HOTPLUG_CPU */
 	}
+
+	printk(KERN_ERR "<-- cpu_callback(cpu: %i, action: %lu)\n", hotcpu, action);
+
 	return NOTIFY_OK;
 }
 

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH] adjust root-domain->online span in response to hotplug event
  2008-03-08  5:10                               ` [PATCH] adjust root-domain->online span in response to hotplug event Gregory Haskins
  2008-03-08  8:41                                 ` Ingo Molnar
@ 2008-03-09  2:35                                 ` Suresh Siddha
  2008-03-10 12:41                                   ` Gregory Haskins
  2008-03-10  8:14                                 ` Gautham R Shenoy
  2 siblings, 1 reply; 54+ messages in thread
From: Suresh Siddha @ 2008-03-09  2:35 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: suresh.b.siddha, rjw, akpm, dmitry.adamushko, ego, mingo, oleg,
	yi.y.yang, linux-kernel, tglx

On Sat, Mar 08, 2008 at 12:10:15AM -0500, Gregory Haskins wrote:
> Suresh Siddha wrote:
>  >
>  > -	cpus_and(*lowest_mask, task_rq(task)->rd->online, task->cpus_allowed);
>  > +	cpus_and(*lowest_mask, task->cpus_allowed, cpu_online_map);
> 
> Hi Suresh,
>    Unfortunately, this patch will introduce its own set of bugs. 

Is that because of the missing get/put_online_cpus() ?

> However, your analysis was spot-on.  I think I see the problem now.  It 
> was introduced when I put a hack in to "fix" s2ram problems in -mm as a 
> result of the new root-domain logic.  I think the following patch will 
> fix both issues:

BTW, what is use of per root domains online map, when we have global
cpu_online_map() ? Each domains span & cpu_online_map should give
domain_online_map anyhow.

And is n't it buggy if someone accesses rd->online with out get/put_online_map()

thanks,
suresh

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] adjust root-domain->online span in response to hotplug event
  2008-03-08  5:10                               ` [PATCH] adjust root-domain->online span in response to hotplug event Gregory Haskins
  2008-03-08  8:41                                 ` Ingo Molnar
  2008-03-09  2:35                                 ` [PATCH] adjust root-domain->online span in response to hotplug event Suresh Siddha
@ 2008-03-10  8:14                                 ` Gautham R Shenoy
  2008-03-10 13:13                                   ` [PATCH] cpu-hotplug: Register update_sched_domains() notifier with higher prio Gautham R Shenoy
  2008-03-10 13:39                                   ` [PATCH] keep rd->online and cpu_online_map in sync Gregory Haskins
  2 siblings, 2 replies; 54+ messages in thread
From: Gautham R Shenoy @ 2008-03-10  8:14 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: suresh.b.siddha, rjw, akpm, dmitry.adamushko, mingo, oleg,
	yi.y.yang, linux-kernel, tglx

On Sat, Mar 08, 2008 at 12:10:15AM -0500, Gregory Haskins wrote:
> Suresh Siddha wrote:
>  > On Sat, Mar 08, 2008 at 12:43:15AM +0100, Rafael J. Wysocki wrote:
>  >> On Saturday, 8 of March 2008, Andrew Morton wrote:
>  >>> On Fri, 7 Mar 2008 15:01:26 -0800
>  >>> Suresh Siddha <suresh.b.siddha@intel.com> wrote:
>  >>>
>  >>>> Andrew, Please check if the appended patch fixes your power-off 
> problem aswell.
>  >>>> ...
>  >>>>
>  >>>> --- a/kernel/sched.c
>  >>>> +++ b/kernel/sched.c
>  >>>> @@ -5882,6 +5882,7 @@ migration_call(struct notifier_block *nfb, 
> unsigned long action, void *hcpu)
>  >>>>  		break;
>  >>>>
>  >>>>  	case CPU_DOWN_PREPARE:
>  >>>> +	case CPU_DOWN_PREPARE_FROZEN:
>  >>>>  		/* Update our root-domain */
>  >>>>  		rq = cpu_rq(cpu);
>  >>>>  		spin_lock_irqsave(&rq->lock, flags);
>  >>> No, it does not.
>  >> Well, this is a bug nevertheless.
>  >>
>  >
>  > Well, my previous root cause needs some small changes.
>  >
>  > During the notifier call chain for CPU_DOWN(till 'update_sched_domains'
>  > is called atleast), all the cpu's are attached to 'def_root_domain', 
> for whom
>  > online mask still has the offline cpu.
>  >
>  > This is because, during CPU_DOWN_PREPARE, migration_call() first clears
>  > the root_domain->online, and later during the DOWN_PREPARE call chain
>  > detach_destroy_domains() attach to def_root_domain with 
> cpu_online_map(which
>  > still has the just about to die 'cpu' set).
>  >
>  > So essentially, during the notifier call chain of CPU_DOWN (before
>  > 'update_sched_domains' is called atleast), any one doing RT process
>  > wakeup's (for example: kthread_stop()) can still end up on the dead cpu.
>  >
>  > Andrew, Can you please try one more patch(appended) to see if it helps?
>  >
>  > Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
>  > ---
>  >
>  > diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
>  > index 0a6d2e5..8cb707c 100644
>  > --- a/kernel/sched_rt.c
>  > +++ b/kernel/sched_rt.c
>  > @@ -597,7 +597,7 @@ static int find_lowest_cpus(struct task_struct 
> *task, cpumask_t *lowest_mask)
>  >  	int       count       = 0;
>  >  	int       cpu;
>  >
>  > -	cpus_and(*lowest_mask, task_rq(task)->rd->online, task->cpus_allowed);
>  > +	cpus_and(*lowest_mask, task->cpus_allowed, cpu_online_map);
>  >
>  >  	/*
> 
> Hi Suresh,
>    Unfortunately, this patch will introduce its own set of bugs. 
> However, your analysis was spot-on.  I think I see the problem now.  It 
> was introduced when I put a hack in to "fix" s2ram problems in -mm as a 
> result of the new root-domain logic.  I think the following patch will 
> fix both issues:
> 
> (I verified that I could take a cpu offline/online, but I don't have an
> s2ram compatible machine handy.  Andrew, I believe you could reproduce
> the s2ram problem a few months ago when that issue popped up.  So if you
> could, please verify that s2ram also works with this patch applied, in
> addition to the hotplug problem.
> 
> Regards,
> -Greg
> 
> -------------------------------------------------
> 
> adjust root-domain->online span in response to hotplug event
> 
> We currently set the root-domain online span automatically when the domain
> is added to the cpu if the cpu is already a member of cpu_online_map.
> This was done as a hack/bug-fix for s2ram, but it also causes a problem
> with hotplug CPU_DOWN transitioning.  The right way to fix the original
> problem is to actually respond to CPU_UP events, instead of CPU_ONLINE,
> which is already too late.
> 
> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
> ---
> 
>  kernel/sched.c |   18 +++++++-----------
>  1 files changed, 7 insertions(+), 11 deletions(-)
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 52b9867..b02e4fc 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -5813,6 +5813,13 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
>  		/* Must be high prio: stop_machine expects to yield to it. */
>  		rq = task_rq_lock(p, &flags);
>  		__setscheduler(rq, p, SCHED_FIFO, MAX_RT_PRIO-1);
> +
> +		/* Update our root-domain */
> +		if (rq->rd) {
> +			BUG_ON(!cpu_isset(cpu, rq->rd->span));
> +			cpu_set(cpu, rq->rd->online);
> +		}

Hi Greg,

Suppose someone issues a wakeup at some point between CPU_UP_PREPARE
and CPU_ONLINE, then isn't there a possibility that the task could
be woken up on the cpu which has not yet come online ?
Because at this point in find_lowest_cpus()

cpus_and(*lowest_mask, task_rq(task)->rd->online, task->cpus_allowed);

the cpu which has not yet come online is set in both rd->online map
and the task->cpus_allowed.

I wonder if assigning a priority to the update_sched_domains() notifier
so that it's called immediately after migration_call() would solve the
problem.


> +
>  		task_rq_unlock(rq, &flags);
>  		cpu_rq(cpu)->migration_thread = p;
>  		break;
> @@ -5821,15 +5828,6 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
>  	case CPU_ONLINE_FROZEN:
>  		/* Strictly unnecessary, as first user will wake it. */
>  		wake_up_process(cpu_rq(cpu)->migration_thread);
> -
> -		/* Update our root-domain */
> -		rq = cpu_rq(cpu);
> -		spin_lock_irqsave(&rq->lock, flags);
> -		if (rq->rd) {
> -			BUG_ON(!cpu_isset(cpu, rq->rd->span));
> -			cpu_set(cpu, rq->rd->online);
> -		}
> -		spin_unlock_irqrestore(&rq->lock, flags);
>  		break;
> 
>  #ifdef CONFIG_HOTPLUG_CPU
> @@ -6105,8 +6103,6 @@ static void rq_attach_root(struct rq *rq, struct root_domain *rd)
>  	rq->rd = rd;
> 
>  	cpu_set(rq->cpu, rd->span);
> -	if (cpu_isset(rq->cpu, cpu_online_map))
> -		cpu_set(rq->cpu, rd->online);
> 
>  	for (class = sched_class_highest; class; class = class->next) {
>  		if (class->join_domain)

-- 
Thanks and Regards
gautham

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] adjust root-domain->online span in response to hotplug event
  2008-03-09  2:35                                 ` [PATCH] adjust root-domain->online span in response to hotplug event Suresh Siddha
@ 2008-03-10 12:41                                   ` Gregory Haskins
  0 siblings, 0 replies; 54+ messages in thread
From: Gregory Haskins @ 2008-03-10 12:41 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: mingo, dmitry.adamushko, ego, yi.y.yang, tglx, akpm, rjw, oleg,
	linux-kernel

>>> On Sat, Mar 8, 2008 at  9:35 PM, in message
<20080309023515.GC15909@linux-os.sc.intel.com>, Suresh Siddha
<suresh.b.siddha@intel.com> wrote: 
> On Sat, Mar 08, 2008 at 12:10:15AM -0500, Gregory Haskins wrote:
>> Suresh Siddha wrote:
>>  >
>>  > -	cpus_and(*lowest_mask, task_rq(task)->rd->online, task->cpus_allowed);
>>  > +	cpus_and(*lowest_mask, task->cpus_allowed, cpu_online_map);
>> 
>> Hi Suresh,
>>    Unfortunately, this patch will introduce its own set of bugs. 
> 
> Is that because of the missing get/put_online_cpus() ?

Actually, I was referring to the problem of potentially including "out of domain" CPUs in the search, but that is a good point too.  Ill have to think about whether a get/put is needed here.


> 
>> However, your analysis was spot-on.  I think I see the problem now.  It 
>> was introduced when I put a hack in to "fix" s2ram problems in -mm as a 
>> result of the new root-domain logic.  I think the following patch will 
>> fix both issues:
> 
> BTW, what is use of per root domains online map, when we have global
> cpu_online_map() ? Each domains span & cpu_online_map should give
> domain_online_map anyhow.


Thats exactly right.  rd->online is a cached version of "rd->span & cpu_online_map".  It simply saves us from having to do an extra cpu_and() at run-time (which can be expensive as the NR_CPUS grows large).

> 
> And is n't it buggy if someone accesses rd->online with out 
> get/put_online_map()

See above: I'm not sure, but you may be right about that.

Regards,
-Greg


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH] cpu-hotplug: Register update_sched_domains() notifier with higher prio
  2008-03-10  8:14                                 ` Gautham R Shenoy
@ 2008-03-10 13:13                                   ` Gautham R Shenoy
  2008-03-10 22:25                                     ` Andrew Morton
  2008-03-10 13:39                                   ` [PATCH] keep rd->online and cpu_online_map in sync Gregory Haskins
  1 sibling, 1 reply; 54+ messages in thread
From: Gautham R Shenoy @ 2008-03-10 13:13 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: suresh.b.siddha, rjw, akpm, dmitry.adamushko, mingo, oleg,
	yi.y.yang, linux-kernel, tglx, Srivatsa Vaddagiri

cpu-hotplug: Register update_sched_domains() notifier with higher prio
From: Gautham R Shenoy <ego@in.ibm.com>

The current -rt wake_up logic uses the rq->rd->online_map which is
updated on cpu-hotplug events by update_sched_domains(). Currently
update_sched_domains() is registered with a priority 0. It is
preferable that update_sched_domains() be the first notifier called on
a CPU_DEAD or CPU_ONLINE events inorder to ensure that the
rq->rd->online_map is in sync with the cpu_online_map.

This way on a CPU_DOWN_PREPARE,
we would destroy the sched_domains and reattach all the rq's to the
def_root_domain. This will be followed by a CPU_DOWN_PREPARE in
migration_call() which will clear the bit of the cpu going offline
from the corresponding rq.
This will ensure that no task will be woken up on the cpu which is
going to be offlined between CPU_DOWN_PREPARE and CPU_DEAD.

Ditto for the CPU_ONLINE path.

Tested on a 4-way Intel Xeon Box with cpu-hotplug tests running in parallel
with kernbench.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
---

 kernel/sched.c |   12 ++++++++++--
 1 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 69ecb1a..374c46f 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -7079,8 +7079,16 @@ void __init sched_init_smp(void)
 	if (cpus_empty(non_isolated_cpus))
 		cpu_set(smp_processor_id(), non_isolated_cpus);
 	put_online_cpus();
-	/* XXX: Theoretical race here - CPU may be hotplugged now */
-	hotcpu_notifier(update_sched_domains, 0);
+	/*
+	 * XXX: Theoretical race here - CPU may be hotplugged now
+	 *
+	 * We register the notifier with priority 11, which means that
+	 * update_sched_domains() will be called just before migration_call().
+	 *
+	 * This is necessary to ensure that the rt wake up logic works fine
+	 * and the rq->rd->online_map remains in sync with the cpu_online_map.
+	 */
+	hotcpu_notifier(update_sched_domains, 11);
 
 	/* Move init over to a non-isolated CPU */
 	if (set_cpus_allowed(current, non_isolated_cpus) < 0)
-- 
Thanks and Regards
gautham

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH] keep rd->online and cpu_online_map in sync
  2008-03-10  8:14                                 ` Gautham R Shenoy
  2008-03-10 13:13                                   ` [PATCH] cpu-hotplug: Register update_sched_domains() notifier with higher prio Gautham R Shenoy
@ 2008-03-10 13:39                                   ` Gregory Haskins
  2008-03-10 14:21                                     ` Gautham R Shenoy
  2008-03-10 18:12                                     ` Suresh Siddha
  1 sibling, 2 replies; 54+ messages in thread
From: Gregory Haskins @ 2008-03-10 13:39 UTC (permalink / raw)
  To: ego
  Cc: suresh.b.siddha, rjw, akpm, dmitry.adamushko, ego, mingo, oleg,
	yi.y.yang, linux-kernel, tglx, ghaskins

>>> On Mon, Mar 10, 2008 at  4:14 AM, in message
<20080310081425.GA11031@in.ibm.com>, Gautham R Shenoy <ego@in.ibm.com> wrote:

> On Sat, Mar 08, 2008 at 12:10:15AM -0500, Gregory Haskins wrote:

[ snip ]

>> @@ -5813,6 +5813,13 @@ migration_call(struct notifier_block *nfb, unsigned 
> long action, void *hcpu)
>>  		/* Must be high prio: stop_machine expects to yield to it. */
>>  		rq = task_rq_lock(p, &flags);
>>  		__setscheduler(rq, p, SCHED_FIFO, MAX_RT_PRIO-1);
>> +
>> +		/* Update our root-domain */
>> +		if (rq->rd) {
>> +			BUG_ON(!cpu_isset(cpu, rq->rd->span));
>> +			cpu_set(cpu, rq->rd->online);
>> +		}
> 
> Hi Greg,
> 
> Suppose someone issues a wakeup at some point between CPU_UP_PREPARE
> and CPU_ONLINE, then isn't there a possibility that the task could
> be woken up on the cpu which has not yet come online ?
> Because at this point in find_lowest_cpus()
> 
> cpus_and(*lowest_mask, task_rq(task)->rd->online, task->cpus_allowed);
> 
> the cpu which has not yet come online is set in both rd->online map
> and the task->cpus_allowed.

Hi Gautham,
   This is a good point.  I think I got it backwards (or rather, I had the ordering more correct before the patch):  I really want to keep the set() on ONLINE as I had it originally.  Its the clear() that is misplaced, as DOWN_PREPARE is too early in the chain.  I probably should have used DYING to defer the clear out to the point right before the cpu_online_map is updated.  Thanks to Dmitry for highlighting the smpboot path which helped me to find the proper notifier symbol. 

> 
> I wonder if assigning a priority to the update_sched_domains() notifier
> so that it's called immediately after migration_call() would solve the
> problem.

Possibly (and I just saw your patch).  But note that migration_call was previously declared as "should run first" so I am not sure if moving update_sched_domain() above it will not have a ripple effect in some other area.  If that is "ok", then I agree your solution solves the ordering problem, albeit in a bit heavy handed manner.  Otherwise s/CPU_DOWN_PREPARE/CPU_DYING in migrate call will fix the issue as well, I believe.  This will push the update of rd->online to be much more tightly coupled with updates to cpu_online_map.

Ingo/Andrew: Please drop the earlier fix I submitted.  I don't think it is correct on several fronts.  Please try the one below.  As before, I have tested that I can offline/online CPUs, but I dont have s2ram capability here.

Regards,
-Greg

---------------------------------------------
keep rd->online and cpu_online_map in sync

It is possible to allow the root-domain cache of online cpus to
become out of sync with the global cpu_online_map.  This is because we
currently trigger removal of cpus too early in the notifier chain.
Other DOWN_PREPARE handlers may in fact run and reconfigure the
root-domain topology, thereby stomping on our own offline handling.

The end result is that rd->online may become out of sync with
cpu_online_map, which results in potential task misrouting.

So change the offline handling to be more tightly coupled with the
global offline process by triggering on CPU_DYING intead of
CPU_DOWN_PREPARE.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 kernel/sched.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 52b9867..a616fa1 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5881,7 +5881,7 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
 		spin_unlock_irq(&rq->lock);
 		break;
 
-	case CPU_DOWN_PREPARE:
+	case CPU_DYING:
 		/* Update our root-domain */
 		rq = cpu_rq(cpu);
 		spin_lock_irqsave(&rq->lock, flags);


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH] adjust root-domain->online span in response to hotplugevent
  2008-03-09  0:31                                     ` Dmitry Adamushko
@ 2008-03-10 14:12                                       ` Gregory Haskins
  0 siblings, 0 replies; 54+ messages in thread
From: Gregory Haskins @ 2008-03-10 14:12 UTC (permalink / raw)
  To: Dmitry Adamushko
  Cc: Gregory Haskins, Ingo Molnar, ego, suresh.b.siddha, yi.y.yang,
	tglx, akpm, rjw, oleg, linux-kernel

Dmitry Adamushko wrote:
> On 08/03/2008, Gregory Haskins <ghaskins@novell.com> wrote:
> 
> I guess, it's in arch-specific code. e.g. for x86-64,
> 
> down:
> 
> kernel/cpu.c :: _cpu_down -> take_cpu_down() -> __cpu_disable() [
> arch/x86/kernel/smpboot_64.c ] -> cpu_clear(cpu, cpu_online_map)
> 
> up:
> 
> kernel/cpu.c :: _cpu_up() -> __cpu_up [ arch/x86/kernel/smpboot_64.c ]
> -> do_boot_cpu() -> start_secondary() [ runs on a to-be-online cpu ]
> -> cpu_set(smp_processor_id(), cpu_online_map)

Thanks Dmitry!  Your tip helped me to figure out what the real solution 
should be.

Regards,
-Greg

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] keep rd->online and cpu_online_map in sync
  2008-03-10 13:39                                   ` [PATCH] keep rd->online and cpu_online_map in sync Gregory Haskins
@ 2008-03-10 14:21                                     ` Gautham R Shenoy
  2008-03-10 18:12                                     ` Suresh Siddha
  1 sibling, 0 replies; 54+ messages in thread
From: Gautham R Shenoy @ 2008-03-10 14:21 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: suresh.b.siddha, rjw, akpm, dmitry.adamushko, mingo, oleg,
	yi.y.yang, linux-kernel, tglx

On Mon, Mar 10, 2008 at 09:39:34AM -0400, Gregory Haskins wrote:
> >>> On Mon, Mar 10, 2008 at  4:14 AM, in message
> <20080310081425.GA11031@in.ibm.com>, Gautham R Shenoy <ego@in.ibm.com> wrote:
> 
> > On Sat, Mar 08, 2008 at 12:10:15AM -0500, Gregory Haskins wrote:
> 
> [ snip ]
> 
> >> @@ -5813,6 +5813,13 @@ migration_call(struct notifier_block *nfb, unsigned 
> > long action, void *hcpu)
> >>  		/* Must be high prio: stop_machine expects to yield to it. */
> >>  		rq = task_rq_lock(p, &flags);
> >>  		__setscheduler(rq, p, SCHED_FIFO, MAX_RT_PRIO-1);
> >> +
> >> +		/* Update our root-domain */
> >> +		if (rq->rd) {
> >> +			BUG_ON(!cpu_isset(cpu, rq->rd->span));
> >> +			cpu_set(cpu, rq->rd->online);
> >> +		}
> > 
> > Hi Greg,
> > 
> > Suppose someone issues a wakeup at some point between CPU_UP_PREPARE
> > and CPU_ONLINE, then isn't there a possibility that the task could
> > be woken up on the cpu which has not yet come online ?
> > Because at this point in find_lowest_cpus()
> > 
> > cpus_and(*lowest_mask, task_rq(task)->rd->online, task->cpus_allowed);
> > 
> > the cpu which has not yet come online is set in both rd->online map
> > and the task->cpus_allowed.
> 
> Hi Gautham,
>    This is a good point.  I think I got it backwards (or rather, I had the ordering more correct before the patch):  I really want to keep the set() on ONLINE as I had it originally.  Its the clear() that is misplaced, as DOWN_PREPARE is too early in the chain.  I probably should have used DYING to defer the clear out to the point right before the cpu_online_map is updated.  Thanks to Dmitry for highlighting the smpboot path which helped me to find the proper notifier symbol. 

I think CPU_DYING is a more apt point to clear the bit.
Vatsa had infact suggested that we do it in cpu_disable(), but since we
have an event, we might as well make use of it.
> 
> > 
> > I wonder if assigning a priority to the update_sched_domains() notifier
> > so that it's called immediately after migration_call() would solve the
> > problem.
> 
> Possibly (and I just saw your patch).  But note that migration_call was previously declared as "should run first" so I am not sure if moving update_sched_domain() above it will not have a ripple effect in some other area.  If that is "ok", then I agree your solution solves the ordering problem, albeit in a bit heavy handed manner.  Otherwise s/CPU_DOWN_PREPARE/CPU_DYING in migrate call will fix the issue as well, I believe.  This will push the update of rd->online to be much more tightly coupled with updates to cpu_online_map.

The reason why migration_call() was declared as "should run first" was
to ensure that all the migration happens before the other subsystems run
their CPU_DEAD events. This is important, else when the subsystems issue
a kthread_stop(subsystem_thread) in their CPU_DEAD case, the
subsystem_thread might still be affined to the offlined cpu, thus
resulting in a deadlock.

That requirement is taken care of in the patch I had sent since
migration_call() will be called immediately after update_sched_domains().

> 
> Ingo/Andrew: Please drop the earlier fix I submitted.  I don't think it is correct on several fronts.  Please try the one below.  As before, I have tested that I can offline/online CPUs, but I dont have s2ram capability here.
> 
> Regards,
> -Greg
> 
> ---------------------------------------------
> keep rd->online and cpu_online_map in sync
> 
> It is possible to allow the root-domain cache of online cpus to
> become out of sync with the global cpu_online_map.  This is because we
> currently trigger removal of cpus too early in the notifier chain.
> Other DOWN_PREPARE handlers may in fact run and reconfigure the
> root-domain topology, thereby stomping on our own offline handling.
> 
> The end result is that rd->online may become out of sync with
> cpu_online_map, which results in potential task misrouting.
> 
> So change the offline handling to be more tightly coupled with the
> global offline process by triggering on CPU_DYING intead of
> CPU_DOWN_PREPARE.
> 
> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
> ---
> 
>  kernel/sched.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 52b9867..a616fa1 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -5881,7 +5881,7 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
>  		spin_unlock_irq(&rq->lock);
>  		break;
> 
> -	case CPU_DOWN_PREPARE:
> +	case CPU_DYING:
>  		/* Update our root-domain */
>  		rq = cpu_rq(cpu);
>  		spin_lock_irqsave(&rq->lock, flags);

-- 
Thanks and Regards
gautham

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] keep rd->online and cpu_online_map in sync
  2008-03-10 13:39                                   ` [PATCH] keep rd->online and cpu_online_map in sync Gregory Haskins
  2008-03-10 14:21                                     ` Gautham R Shenoy
@ 2008-03-10 18:12                                     ` Suresh Siddha
  2008-03-10 22:03                                       ` Rafael J. Wysocki
  1 sibling, 1 reply; 54+ messages in thread
From: Suresh Siddha @ 2008-03-10 18:12 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: ego, suresh.b.siddha, rjw, akpm, dmitry.adamushko, mingo, oleg,
	yi.y.yang, linux-kernel, tglx

On Mon, Mar 10, 2008 at 09:39:34AM -0400, Gregory Haskins wrote:
> keep rd->online and cpu_online_map in sync
> 
> It is possible to allow the root-domain cache of online cpus to
> become out of sync with the global cpu_online_map.  This is because we
> currently trigger removal of cpus too early in the notifier chain.
> Other DOWN_PREPARE handlers may in fact run and reconfigure the
> root-domain topology, thereby stomping on our own offline handling.
> 
> The end result is that rd->online may become out of sync with
> cpu_online_map, which results in potential task misrouting.
> 
> So change the offline handling to be more tightly coupled with the
> global offline process by triggering on CPU_DYING intead of
> CPU_DOWN_PREPARE.
> 
> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
> ---
> 
>  kernel/sched.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 52b9867..a616fa1 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -5881,7 +5881,7 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
>  		spin_unlock_irq(&rq->lock);
>  		break;
>  
> -	case CPU_DOWN_PREPARE:
> +	case CPU_DYING:

Don't we need to take care of CPU_DYING_FROZEN aswell?

>  		/* Update our root-domain */
>  		rq = cpu_rq(cpu);
>  		spin_lock_irqsave(&rq->lock, flags);
> 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v2] keep rd->online and cpu_online_map in sync
  2008-03-10 22:10                                           ` Suresh Siddha
@ 2008-03-10 21:59                                             ` Gregory Haskins
  2008-03-10 23:36                                               ` Andrew Morton
  0 siblings, 1 reply; 54+ messages in thread
From: Gregory Haskins @ 2008-03-10 21:59 UTC (permalink / raw)
  To: suresh.b.siddha
  Cc: ego, rjw, akpm, dmitry.adamushko, ego, mingo, oleg, yi.y.yang,
	linux-kernel, tglx, ghaskins

>>> On Mon, Mar 10, 2008 at  6:10 PM, in message
<20080310221014.GB27329@linux-os.sc.intel.com>, Suresh Siddha
<suresh.b.siddha@intel.com> wrote: 
> On Mon, Mar 10, 2008 at 04:00:28PM -0600, Gregory Haskins wrote:
>> >>> On Mon, Mar 10, 2008 at  6:03 PM, in message 
> <200803102303.28660.rjw@sisk.pl>,
>> "Rafael J. Wysocki" <rjw@sisk.pl> wrote: 
>> > On Monday, 10 of March 2008, Suresh Siddha wrote:
>> >> >  
>> >> > -	case CPU_DOWN_PREPARE:
>> >> > +	case CPU_DYING:
>> >> 
>> >> Don't we need to take care of CPU_DYING_FROZEN aswell?
>> > 
>> > Well, I'd say we do.
>> 
>> Should I add that to the patch as well then?
> 
> Yes please.

Here is v2 with the suggested improvement

-Greg

------------------------
keep rd->online and cpu_online_map in sync

It is possible to allow the root-domain cache of online cpus to
become out of sync with the global cpu_online_map.  This is because we
currently trigger removal of cpus too early in the notifier chain.
Other DOWN_PREPARE handlers may in fact run and reconfigure the
root-domain topology, thereby stomping on our own offline handling.

The end result is that rd->online may become out of sync with
cpu_online_map, which results in potential task misrouting.

So change the offline handling to be more tightly coupled with the
global offline process by triggering on CPU_DYING intead of
CPU_DOWN_PREPARE.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/sched.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 52b9867..1cb53fb 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5881,7 +5881,8 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
 		spin_unlock_irq(&rq->lock);
 		break;
 
-	case CPU_DOWN_PREPARE:
+	case CPU_DYING:
+	case CPU_DYING_FROZEN:
 		/* Update our root-domain */
 		rq = cpu_rq(cpu);
 		spin_lock_irqsave(&rq->lock, flags);


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH] keep rd->online and cpu_online_map in sync
  2008-03-10 22:03                                       ` Rafael J. Wysocki
@ 2008-03-10 22:00                                         ` Gregory Haskins
  2008-03-10 22:10                                           ` Suresh Siddha
  0 siblings, 1 reply; 54+ messages in thread
From: Gregory Haskins @ 2008-03-10 22:00 UTC (permalink / raw)
  To: Suresh Siddha, Rafael J. Wysocki
  Cc: mingo, dmitry.adamushko, ego, yi.y.yang, tglx, akpm, oleg, linux-kernel

>>> On Mon, Mar 10, 2008 at  6:03 PM, in message <200803102303.28660.rjw@sisk.pl>,
"Rafael J. Wysocki" <rjw@sisk.pl> wrote: 
> On Monday, 10 of March 2008, Suresh Siddha wrote:
>> On Mon, Mar 10, 2008 at 09:39:34AM -0400, Gregory Haskins wrote:
>> > keep rd->online and cpu_online_map in sync
>> > 
>> > It is possible to allow the root-domain cache of online cpus to
>> > become out of sync with the global cpu_online_map.  This is because we
>> > currently trigger removal of cpus too early in the notifier chain.
>> > Other DOWN_PREPARE handlers may in fact run and reconfigure the
>> > root-domain topology, thereby stomping on our own offline handling.
>> > 
>> > The end result is that rd->online may become out of sync with
>> > cpu_online_map, which results in potential task misrouting.
>> > 
>> > So change the offline handling to be more tightly coupled with the
>> > global offline process by triggering on CPU_DYING intead of
>> > CPU_DOWN_PREPARE.
>> > 
>> > Signed-off-by: Gregory Haskins <ghaskins@novell.com>
>> > ---
>> > 
>> >  kernel/sched.c |    2 +-
>> >  1 files changed, 1 insertions(+), 1 deletions(-)
>> > 
>> > diff --git a/kernel/sched.c b/kernel/sched.c
>> > index 52b9867..a616fa1 100644
>> > --- a/kernel/sched.c
>> > +++ b/kernel/sched.c
>> > @@ -5881,7 +5881,7 @@ migration_call(struct notifier_block *nfb, unsigned 
> long action, void *hcpu)
>> >  		spin_unlock_irq(&rq->lock);
>> >  		break;
>> >  
>> > -	case CPU_DOWN_PREPARE:
>> > +	case CPU_DYING:
>> 
>> Don't we need to take care of CPU_DYING_FROZEN aswell?
> 
> Well, I'd say we do.

Should I add that to the patch as well then?

> 
>> >  		/* Update our root-domain */
>> >  		rq = cpu_rq(cpu);
>> >  		spin_lock_irqsave(&rq->lock, flags);
>> > 



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] keep rd->online and cpu_online_map in sync
  2008-03-10 18:12                                     ` Suresh Siddha
@ 2008-03-10 22:03                                       ` Rafael J. Wysocki
  2008-03-10 22:00                                         ` Gregory Haskins
  0 siblings, 1 reply; 54+ messages in thread
From: Rafael J. Wysocki @ 2008-03-10 22:03 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Gregory Haskins, ego, akpm, dmitry.adamushko, mingo, oleg,
	yi.y.yang, linux-kernel, tglx

On Monday, 10 of March 2008, Suresh Siddha wrote:
> On Mon, Mar 10, 2008 at 09:39:34AM -0400, Gregory Haskins wrote:
> > keep rd->online and cpu_online_map in sync
> > 
> > It is possible to allow the root-domain cache of online cpus to
> > become out of sync with the global cpu_online_map.  This is because we
> > currently trigger removal of cpus too early in the notifier chain.
> > Other DOWN_PREPARE handlers may in fact run and reconfigure the
> > root-domain topology, thereby stomping on our own offline handling.
> > 
> > The end result is that rd->online may become out of sync with
> > cpu_online_map, which results in potential task misrouting.
> > 
> > So change the offline handling to be more tightly coupled with the
> > global offline process by triggering on CPU_DYING intead of
> > CPU_DOWN_PREPARE.
> > 
> > Signed-off-by: Gregory Haskins <ghaskins@novell.com>
> > ---
> > 
> >  kernel/sched.c |    2 +-
> >  1 files changed, 1 insertions(+), 1 deletions(-)
> > 
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 52b9867..a616fa1 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -5881,7 +5881,7 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
> >  		spin_unlock_irq(&rq->lock);
> >  		break;
> >  
> > -	case CPU_DOWN_PREPARE:
> > +	case CPU_DYING:
> 
> Don't we need to take care of CPU_DYING_FROZEN aswell?

Well, I'd say we do.

> >  		/* Update our root-domain */
> >  		rq = cpu_rq(cpu);
> >  		spin_lock_irqsave(&rq->lock, flags);
> > 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] keep rd->online and cpu_online_map in sync
  2008-03-10 22:00                                         ` Gregory Haskins
@ 2008-03-10 22:10                                           ` Suresh Siddha
  2008-03-10 21:59                                             ` [PATCH v2] " Gregory Haskins
  0 siblings, 1 reply; 54+ messages in thread
From: Suresh Siddha @ 2008-03-10 22:10 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Suresh Siddha, Rafael J. Wysocki, mingo, dmitry.adamushko, ego,
	yi.y.yang, tglx, akpm, oleg, linux-kernel

On Mon, Mar 10, 2008 at 04:00:28PM -0600, Gregory Haskins wrote:
> >>> On Mon, Mar 10, 2008 at  6:03 PM, in message <200803102303.28660.rjw@sisk.pl>,
> "Rafael J. Wysocki" <rjw@sisk.pl> wrote: 
> > On Monday, 10 of March 2008, Suresh Siddha wrote:
> >> >  
> >> > -	case CPU_DOWN_PREPARE:
> >> > +	case CPU_DYING:
> >> 
> >> Don't we need to take care of CPU_DYING_FROZEN aswell?
> > 
> > Well, I'd say we do.
> 
> Should I add that to the patch as well then?

Yes please.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH] cpu-hotplug: Register update_sched_domains() notifier with higher prio
  2008-03-10 13:13                                   ` [PATCH] cpu-hotplug: Register update_sched_domains() notifier with higher prio Gautham R Shenoy
@ 2008-03-10 22:25                                     ` Andrew Morton
  0 siblings, 0 replies; 54+ messages in thread
From: Andrew Morton @ 2008-03-10 22:25 UTC (permalink / raw)
  To: ego
  Cc: Gregory Haskins, suresh.b.siddha, rjw, dmitry.adamushko, mingo,
	oleg, yi.y.yang, linux-kernel, tglx, Srivatsa Vaddagiri

On Mon, 10 Mar 2008 18:43:55 +0530 Gautham R Shenoy <ego@in.ibm.com> wrote:

> cpu-hotplug: Register update_sched_domains() notifier with higher prio
> From: Gautham R Shenoy <ego@in.ibm.com>
> 
> The current -rt wake_up logic uses the rq->rd->online_map which is
> updated on cpu-hotplug events by update_sched_domains(). Currently
> update_sched_domains() is registered with a priority 0. It is
> preferable that update_sched_domains() be the first notifier called on
> a CPU_DEAD or CPU_ONLINE events inorder to ensure that the
> rq->rd->online_map is in sync with the cpu_online_map.
> 
> This way on a CPU_DOWN_PREPARE,
> we would destroy the sched_domains and reattach all the rq's to the
> def_root_domain. This will be followed by a CPU_DOWN_PREPARE in
> migration_call() which will clear the bit of the cpu going offline
> from the corresponding rq.
> This will ensure that no task will be woken up on the cpu which is
> going to be offlined between CPU_DOWN_PREPARE and CPU_DEAD.
> 
> Ditto for the CPU_ONLINE path.
> 
> Tested on a 4-way Intel Xeon Box with cpu-hotplug tests running in parallel
> with kernbench.
> 
> Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
> ---
> 
>  kernel/sched.c |   12 ++++++++++--
>  1 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 69ecb1a..374c46f 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -7079,8 +7079,16 @@ void __init sched_init_smp(void)
>  	if (cpus_empty(non_isolated_cpus))
>  		cpu_set(smp_processor_id(), non_isolated_cpus);
>  	put_online_cpus();
> -	/* XXX: Theoretical race here - CPU may be hotplugged now */
> -	hotcpu_notifier(update_sched_domains, 0);
> +	/*
> +	 * XXX: Theoretical race here - CPU may be hotplugged now
> +	 *
> +	 * We register the notifier with priority 11, which means that
> +	 * update_sched_domains() will be called just before migration_call().
> +	 *
> +	 * This is necessary to ensure that the rt wake up logic works fine
> +	 * and the rq->rd->online_map remains in sync with the cpu_online_map.
> +	 */
> +	hotcpu_notifier(update_sched_domains, 11);
>  
>  	/* Move init over to a non-isolated CPU */
>  	if (set_cpus_allowed(current, non_isolated_cpus) < 0)

This fixes my poweroff-doesn't-work post-2.6.24 regression.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2] keep rd->online and cpu_online_map in sync
  2008-03-10 21:59                                             ` [PATCH v2] " Gregory Haskins
@ 2008-03-10 23:36                                               ` Andrew Morton
  2008-03-11  1:34                                                 ` Suresh Siddha
  0 siblings, 1 reply; 54+ messages in thread
From: Andrew Morton @ 2008-03-10 23:36 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: suresh.b.siddha, ego, rjw, dmitry.adamushko, mingo, oleg,
	yi.y.yang, linux-kernel, tglx

On Mon, 10 Mar 2008 17:59:11 -0400 Gregory Haskins <ghaskins@novell.com> wrote:

> >>> On Mon, Mar 10, 2008 at  6:10 PM, in message
> <20080310221014.GB27329@linux-os.sc.intel.com>, Suresh Siddha
> <suresh.b.siddha@intel.com> wrote: 
> > On Mon, Mar 10, 2008 at 04:00:28PM -0600, Gregory Haskins wrote:
> >> >>> On Mon, Mar 10, 2008 at  6:03 PM, in message 
> > <200803102303.28660.rjw@sisk.pl>,
> >> "Rafael J. Wysocki" <rjw@sisk.pl> wrote: 
> >> > On Monday, 10 of March 2008, Suresh Siddha wrote:
> >> >> >  
> >> >> > -	case CPU_DOWN_PREPARE:
> >> >> > +	case CPU_DYING:
> >> >> 
> >> >> Don't we need to take care of CPU_DYING_FROZEN aswell?
> >> > 
> >> > Well, I'd say we do.
> >> 
> >> Should I add that to the patch as well then?
> > 
> > Yes please.
> 
> Here is v2 with the suggested improvement
> 
> -Greg
> 
> ------------------------
> keep rd->online and cpu_online_map in sync
> 
> It is possible to allow the root-domain cache of online cpus to
> become out of sync with the global cpu_online_map.  This is because we
> currently trigger removal of cpus too early in the notifier chain.
> Other DOWN_PREPARE handlers may in fact run and reconfigure the
> root-domain topology, thereby stomping on our own offline handling.
> 
> The end result is that rd->online may become out of sync with
> cpu_online_map, which results in potential task misrouting.
> 
> So change the offline handling to be more tightly coupled with the
> global offline process by triggering on CPU_DYING intead of
> CPU_DOWN_PREPARE.
> 
> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
> Cc: Gautham R Shenoy <ego@in.ibm.com>
> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  kernel/sched.c |    3 ++-
>  1 files changed, 2 insertions(+), 1 deletions(-)
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 52b9867..1cb53fb 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -5881,7 +5881,8 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
>  		spin_unlock_irq(&rq->lock);
>  		break;
>  
> -	case CPU_DOWN_PREPARE:
> +	case CPU_DYING:
> +	case CPU_DYING_FROZEN:
>  		/* Update our root-domain */
>  		rq = cpu_rq(cpu);
>  		spin_lock_irqsave(&rq->lock, flags);

Does this make
cpu-hotplug-register-update_sched_domains-notifier-with-higher-prio.patch
(below) obsolete, or do we want both?  

--- a/kernel/sched.c~cpu-hotplug-register-update_sched_domains-notifier-with-higher-prio
+++ a/kernel/sched.c
@@ -7096,8 +7096,16 @@ void __init sched_init_smp(void)
 	if (cpus_empty(non_isolated_cpus))
 		cpu_set(smp_processor_id(), non_isolated_cpus);
 	put_online_cpus();
-	/* XXX: Theoretical race here - CPU may be hotplugged now */
-	hotcpu_notifier(update_sched_domains, 0);
+	/*
+	 * XXX: Theoretical race here - CPU may be hotplugged now
+	 *
+	 * We register the notifier with priority 11, which means that
+	 * update_sched_domains() will be called just before migration_call().
+	 *
+	 * This is necessary to ensure that the rt wake up logic works fine
+	 * and the rq->rd->online_map remains in sync with the cpu_online_map.
+	 */
+	hotcpu_notifier(update_sched_domains, 11);
 
 	/* Move init over to a non-isolated CPU */
 	if (set_cpus_allowed(current, non_isolated_cpus) < 0)
_


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2] keep rd->online and cpu_online_map in sync
  2008-03-10 23:36                                               ` Andrew Morton
@ 2008-03-11  1:34                                                 ` Suresh Siddha
  2008-03-11  4:39                                                   ` Gautham R Shenoy
  0 siblings, 1 reply; 54+ messages in thread
From: Suresh Siddha @ 2008-03-11  1:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Gregory Haskins, suresh.b.siddha, ego, rjw, dmitry.adamushko,
	mingo, oleg, yi.y.yang, linux-kernel, tglx

On Mon, Mar 10, 2008 at 04:36:13PM -0700, Andrew Morton wrote:
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -5881,7 +5881,8 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
> >  		spin_unlock_irq(&rq->lock);
> >  		break;
> >  
> > -	case CPU_DOWN_PREPARE:
> > +	case CPU_DYING:
> > +	case CPU_DYING_FROZEN:
> >  		/* Update our root-domain */
> >  		rq = cpu_rq(cpu);
> >  		spin_lock_irqsave(&rq->lock, flags);
> 
> Does this make
> cpu-hotplug-register-update_sched_domains-notifier-with-higher-prio.patch
> (below) obsolete,

Yes. I would like to Ack for the
keep-rd-online-and-cpu_online_map-in-sync.patch, as it is more cleaner.


> or do we want both?  

No. I don't think so. Gautham, do you agree?

thanks,
suresh

> 
> --- a/kernel/sched.c~cpu-hotplug-register-update_sched_domains-notifier-with-higher-prio
> +++ a/kernel/sched.c
> @@ -7096,8 +7096,16 @@ void __init sched_init_smp(void)
>  	if (cpus_empty(non_isolated_cpus))
>  		cpu_set(smp_processor_id(), non_isolated_cpus);
>  	put_online_cpus();
> -	/* XXX: Theoretical race here - CPU may be hotplugged now */
> -	hotcpu_notifier(update_sched_domains, 0);
> +	/*
> +	 * XXX: Theoretical race here - CPU may be hotplugged now
> +	 *
> +	 * We register the notifier with priority 11, which means that
> +	 * update_sched_domains() will be called just before migration_call().
> +	 *
> +	 * This is necessary to ensure that the rt wake up logic works fine
> +	 * and the rq->rd->online_map remains in sync with the cpu_online_map.
> +	 */
> +	hotcpu_notifier(update_sched_domains, 11);
>  
>  	/* Move init over to a non-isolated CPU */
>  	if (set_cpus_allowed(current, non_isolated_cpus) < 0)
> _
> 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2] keep rd->online and cpu_online_map in sync
  2008-03-11  1:34                                                 ` Suresh Siddha
@ 2008-03-11  4:39                                                   ` Gautham R Shenoy
  0 siblings, 0 replies; 54+ messages in thread
From: Gautham R Shenoy @ 2008-03-11  4:39 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Andrew Morton, Gregory Haskins, rjw, dmitry.adamushko, mingo,
	oleg, yi.y.yang, linux-kernel, tglx

On Mon, Mar 10, 2008 at 06:34:56PM -0700, Suresh Siddha wrote:
> On Mon, Mar 10, 2008 at 04:36:13PM -0700, Andrew Morton wrote:
> > > --- a/kernel/sched.c
> > > +++ b/kernel/sched.c
> > > @@ -5881,7 +5881,8 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
> > >  		spin_unlock_irq(&rq->lock);
> > >  		break;
> > >  
> > > -	case CPU_DOWN_PREPARE:
> > > +	case CPU_DYING:
> > > +	case CPU_DYING_FROZEN:
> > >  		/* Update our root-domain */
> > >  		rq = cpu_rq(cpu);
> > >  		spin_lock_irqsave(&rq->lock, flags);
> > 
> > Does this make
> > cpu-hotplug-register-update_sched_domains-notifier-with-higher-prio.patch
> > (below) obsolete,
> 
> Yes. I would like to Ack for the
> keep-rd-online-and-cpu_online_map-in-sync.patch, as it is more cleaner.
> 
> 
> > or do we want both?  
> 
> No. I don't think so. Gautham, do you agree?

Yup I agree! Greg's patch is better since i clears the
bit *just* before we take the cpu down. That way we won't mess with the 
ordering. 

But I would still prefer that we update the sched_domains early on 
after cpu-hotplug rather than doing it after the subsystems have run 
their notifiers.

Will send a seperate patch for that one.
> 
> thanks,
> suresh
> 
> > 
> > --- a/kernel/sched.c~cpu-hotplug-register-update_sched_domains-notifier-with-higher-prio
> > +++ a/kernel/sched.c
> > @@ -7096,8 +7096,16 @@ void __init sched_init_smp(void)
> >  	if (cpus_empty(non_isolated_cpus))
> >  		cpu_set(smp_processor_id(), non_isolated_cpus);
> >  	put_online_cpus();
> > -	/* XXX: Theoretical race here - CPU may be hotplugged now */
> > -	hotcpu_notifier(update_sched_domains, 0);
> > +	/*
> > +	 * XXX: Theoretical race here - CPU may be hotplugged now
> > +	 *
> > +	 * We register the notifier with priority 11, which means that
> > +	 * update_sched_domains() will be called just before migration_call().
> > +	 *
> > +	 * This is necessary to ensure that the rt wake up logic works fine
> > +	 * and the rq->rd->online_map remains in sync with the cpu_online_map.
> > +	 */
> > +	hotcpu_notifier(update_sched_domains, 11);
> >  
> >  	/* Move init over to a non-isolated CPU */
> >  	if (set_cpus_allowed(current, non_isolated_cpus) < 0)
> > _
> > 

-- 
Thanks and Regards
gautham

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2008-03-11  4:39 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-03-02 18:42 [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline Yi Yang
2008-03-03 11:54 ` Dmitry Adamushko
2008-03-03 11:56   ` Ingo Molnar
2008-03-03 12:02     ` Dmitry Adamushko
2008-03-03 14:53       ` Yi Yang
2008-03-03 17:37         ` Yi Yang
2008-03-03 15:31 ` Gautham R Shenoy
2008-03-03 14:45   ` Yi Yang
2008-03-04  5:26     ` Gautham R Shenoy
2008-03-04  9:09       ` Gautham R Shenoy
2008-03-03 21:56         ` Yi Yang
2008-03-04 15:01       ` Oleg Nesterov
2008-03-04 14:37         ` Yi Yang
2008-03-06 20:05           ` Yi Yang
2008-03-05 10:05         ` Gautham R Shenoy
2008-03-05 13:53           ` Oleg Nesterov
2008-03-06 11:15             ` Gautham R Shenoy
2008-03-06 12:22               ` Gautham R Shenoy
2008-03-06 13:44         ` Gautham R Shenoy
2008-03-07  2:54           ` Oleg Nesterov
2008-03-07  9:10             ` Gautham R Shenoy
2008-03-07 10:51               ` Gautham R Shenoy
2008-03-06 23:20                 ` Yi Yang
2008-03-07 13:02                 ` Dmitry Adamushko
2008-03-07 13:55                   ` Gautham R Shenoy
2008-03-07 15:50                     ` Gautham R Shenoy
2008-03-07 19:14                       ` [BUG 2.6.25-rc3] scheduler/hotplug: some processes aredealocked " Suresh Siddha
2008-03-07 20:18                   ` [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked " Andrew Morton
2008-03-07 21:36                     ` Rafael J. Wysocki
2008-03-07 23:01                       ` Suresh Siddha
2008-03-07 23:29                         ` Andrew Morton
2008-03-07 23:43                           ` Rafael J. Wysocki
2008-03-08  1:50                             ` Suresh Siddha
2008-03-08  2:09                               ` Andrew Morton
2008-03-08  5:10                               ` [PATCH] adjust root-domain->online span in response to hotplug event Gregory Haskins
2008-03-08  8:41                                 ` Ingo Molnar
2008-03-08 17:50                                   ` [PATCH] adjust root-domain->online span in response to hotplugevent Gregory Haskins
2008-03-09  0:31                                     ` Dmitry Adamushko
2008-03-10 14:12                                       ` Gregory Haskins
2008-03-09  2:35                                 ` [PATCH] adjust root-domain->online span in response to hotplug event Suresh Siddha
2008-03-10 12:41                                   ` Gregory Haskins
2008-03-10  8:14                                 ` Gautham R Shenoy
2008-03-10 13:13                                   ` [PATCH] cpu-hotplug: Register update_sched_domains() notifier with higher prio Gautham R Shenoy
2008-03-10 22:25                                     ` Andrew Morton
2008-03-10 13:39                                   ` [PATCH] keep rd->online and cpu_online_map in sync Gregory Haskins
2008-03-10 14:21                                     ` Gautham R Shenoy
2008-03-10 18:12                                     ` Suresh Siddha
2008-03-10 22:03                                       ` Rafael J. Wysocki
2008-03-10 22:00                                         ` Gregory Haskins
2008-03-10 22:10                                           ` Suresh Siddha
2008-03-10 21:59                                             ` [PATCH v2] " Gregory Haskins
2008-03-10 23:36                                               ` Andrew Morton
2008-03-11  1:34                                                 ` Suresh Siddha
2008-03-11  4:39                                                   ` Gautham R Shenoy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).