2.4.7p6 hang

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* 2.4.7p6 hang
@ 2001-07-11  8:49 Klaus Dittrich
  2001-07-11 12:56 ` Trond Myklebust
  0 siblings, 1 reply; 17+ messages in thread
From: Klaus Dittrich @ 2001-07-11  8:49 UTC (permalink / raw)
  To: linux-kernel

Kernel: 2.4.7p5 or 2.4.7p6
System: PII-SMP, BX-Chipset

The kernel boots up to the message 

..
Linux NET4.0 for Linux 2.4
Based upon Swansea University Computer Society NET3.039

and then stops.

I actually use 2.4.7p3 without problems. 

I am not on the kernel mailing-list.

-- 
Best regards
Klaus Dittrich

e-mail: kladit@t-online.de

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 2.4.7p6 hang
  2001-07-11  8:49 2.4.7p6 hang Klaus Dittrich
@ 2001-07-11 12:56 ` Trond Myklebust
  2001-07-11 13:38   ` Andrew Morton
                     ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Trond Myklebust @ 2001-07-11 12:56 UTC (permalink / raw)
  To: Klaus Dittrich, Linus Torvalds; +Cc: linux-kernel

>>>>> " " == Klaus Dittrich <kladit@t-online.de> writes:

     > Kernel: 2.4.7p5 or 2.4.7p6 System: PII-SMP, BX-Chipset

     > The kernel boots up to the message

     > ..  Linux NET4.0 for Linux 2.4 Based upon Swansea University
     > Computer Society NET3.039

     > and then stops.

     > I actually use 2.4.7p3 without problems.

I have the same problem on my setup. To me, it looks like the loop in
spawn_ksoftirqd() is suffering from some sort of atomicity problem.

I managed to band-aid over the problem by replacing the loop with a
semaphore which the child clears when it has been initialized (as per
the appended patch).

Linus?

Cheers,
  Trond

--- linux-2.4.7-smp/kernel/softirq.c.orig	Wed Jul 11 10:31:50 2001
+++ linux-2.4.7-smp/kernel/softirq.c	Wed Jul 11 14:43:03 2001
@@ -371,6 +371,8 @@
 	}
 }
 
+static DECLARE_MUTEX_LOCKED(ksoftirqd_start);
+
 static int ksoftirqd(void * __bind_cpu)
 {
 	int bind_cpu = *(int *) __bind_cpu;
@@ -391,6 +393,7 @@
 	mb();
 
 	ksoftirqd_task(cpu) = current;
+	up(&ksoftirqd_start);
 
 	for (;;) {
 		if (!softirq_pending(cpu))
@@ -416,12 +419,8 @@
 		if (kernel_thread(ksoftirqd, (void *) &cpu,
 				  CLONE_FS | CLONE_FILES | CLONE_SIGNAL) < 0)
 			printk("spawn_ksoftirqd() failed for cpu %d\n", cpu);
-		else {
-			while (!ksoftirqd_task(cpu_logical_map(cpu))) {
-				current->policy |= SCHED_YIELD;
-				schedule();
-			}
-		}
+		else
+			down(&ksoftirqd_start);
 	}
 
 	return 0;

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 2.4.7p6 hang
  2001-07-11 12:56 ` Trond Myklebust
@ 2001-07-11 13:38   ` Andrew Morton
  2001-07-11 14:22   ` Trond Myklebust
  2001-07-11 15:49   ` Andrea Arcangeli
  2 siblings, 0 replies; 17+ messages in thread
From: Andrew Morton @ 2001-07-11 13:38 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Klaus Dittrich, Linus Torvalds, linux-kernel

Trond Myklebust wrote:
> 
> ...
> I have the same problem on my setup. To me, it looks like the loop in
> spawn_ksoftirqd() is suffering from some sort of atomicity problem.

Does a `set_current_state(TASK_RUNNING);' in spawn_ksoftirqd()
fix it?  If so we have a rogue initcall...

-

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 2.4.7p6 hang
  2001-07-11 12:56 ` Trond Myklebust
  2001-07-11 13:38   ` Andrew Morton
@ 2001-07-11 14:22   ` Trond Myklebust
  2001-07-11 15:58     ` Andrea Arcangeli
  2001-07-11 16:30     ` Trond Myklebust
  2001-07-11 15:49   ` Andrea Arcangeli
  2 siblings, 2 replies; 17+ messages in thread
From: Trond Myklebust @ 2001-07-11 14:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Klaus Dittrich, Linus Torvalds, linux-kernel

>>>>> " " == Andrew Morton <andrewm@uow.edu.au> writes:

     > Trond Myklebust wrote:
    >>
    >> ...  I have the same problem on my setup. To me, it looks like
    >> the loop in spawn_ksoftirqd() is suffering from some sort of
    >> atomicity problem.

     > Does a `set_current_state(TASK_RUNNING);' in spawn_ksoftirqd()
     > fix it?  If so we have a rogue initcall...

Nope. The same thing happens as before.

A couple of debugging statements show that ksoftirqd_CPU0 gets created
fine, and that ksoftirqd_task(0) is indeed getting set correctly
before we loop in spawn_ksoftirqd().
After this the second call to kernel_thread() succeeds, but
ksoftirqd() itself never gets called before the hang occurs.

Cheers,
   Trond

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 2.4.7p6 hang
  2001-07-11 12:56 ` Trond Myklebust
  2001-07-11 13:38   ` Andrew Morton
  2001-07-11 14:22   ` Trond Myklebust
@ 2001-07-11 15:49   ` Andrea Arcangeli
  2 siblings, 0 replies; 17+ messages in thread
From: Andrea Arcangeli @ 2001-07-11 15:49 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Klaus Dittrich, Linus Torvalds, linux-kernel

On Wed, Jul 11, 2001 at 02:56:43PM +0200, Trond Myklebust wrote:
> >>>>> " " == Klaus Dittrich <kladit@t-online.de> writes:
> 
>      > Kernel: 2.4.7p5 or 2.4.7p6 System: PII-SMP, BX-Chipset
> 
>      > The kernel boots up to the message
> 
>      > ..  Linux NET4.0 for Linux 2.4 Based upon Swansea University
>      > Computer Society NET3.039
> 
>      > and then stops.
> 
>      > I actually use 2.4.7p3 without problems.
> 
> I have the same problem on my setup. To me, it looks like the loop in
> spawn_ksoftirqd() is suffering from some sort of atomicity problem.

can you reproduce with 2.4.7pre5aa1?

Andrea

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 2.4.7p6 hang
  2001-07-11 14:22   ` Trond Myklebust
@ 2001-07-11 15:58     ` Andrea Arcangeli
  2001-07-11 17:19       ` Mike Kravetz
                         ` (2 more replies)
  2001-07-11 16:30     ` Trond Myklebust
  1 sibling, 3 replies; 17+ messages in thread
From: Andrea Arcangeli @ 2001-07-11 15:58 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Andrew Morton, Klaus Dittrich, Linus Torvalds, linux-kernel

On Wed, Jul 11, 2001 at 04:22:04PM +0200, Trond Myklebust wrote:
> >>>>> " " == Andrew Morton <andrewm@uow.edu.au> writes:
> 
>      > Trond Myklebust wrote:
>     >>
>     >> ...  I have the same problem on my setup. To me, it looks like
>     >> the loop in spawn_ksoftirqd() is suffering from some sort of
>     >> atomicity problem.
> 
>      > Does a `set_current_state(TASK_RUNNING);' in spawn_ksoftirqd()
>      > fix it?  If so we have a rogue initcall...
> 
> Nope. The same thing happens as before.
> 
> A couple of debugging statements show that ksoftirqd_CPU0 gets created
> fine, and that ksoftirqd_task(0) is indeed getting set correctly
> before we loop in spawn_ksoftirqd().
> After this the second call to kernel_thread() succeeds, but
> ksoftirqd() itself never gets called before the hang occurs.

ksoftirqd is quite scheduler intensive, and while its startup is
correct (no need of any change there), it tends to trigger scheduler
bugs (one of those bugs was just fixed in pre5). The reason I never seen
the deadlock I also fixed this other scheduler bug in my tree:

	ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.7pre5aa1/00_sched-yield-1

this one I forgot to sumbit but here it is now for easy merging:

--- 2.4.4aa3/kernel/sched.c.~1~	Sun Apr 29 17:37:05 2001
+++ 2.4.4aa3/kernel/sched.c	Tue May  1 16:39:42 2001
@@ -674,8 +674,10 @@
 #endif
 	spin_unlock_irq(&runqueue_lock);
 
-	if (prev == next)
+	if (prev == next) {
+		current->policy &= ~SCHED_YIELD;
 		goto same_process;
+	}
 
 #ifdef CONFIG_SMP
  	/*


Andrea

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 2.4.7p6 hang
  2001-07-11 14:22   ` Trond Myklebust
  2001-07-11 15:58     ` Andrea Arcangeli
@ 2001-07-11 16:30     ` Trond Myklebust
  2001-07-11 16:53       ` Andrea Arcangeli
  1 sibling, 1 reply; 17+ messages in thread
From: Trond Myklebust @ 2001-07-11 16:30 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Trond Myklebust, Andrew Morton, Klaus Dittrich, Linus Torvalds,
	linux-kernel

>>>>> " " == Andrea Arcangeli <andrea@suse.de> writes:

     > ksoftirqd is quite scheduler intensive, and while its startup
     > is correct (no need of any change there), it tends to trigger
     > scheduler bugs (one of those bugs was just fixed in pre5). The
     > reason I never seen the deadlock I also fixed this other
     > scheduler bug in my tree:

     > --- 2.4.4aa3/kernel/sched.c.~1~ Sun Apr 29 17:37:05 2001
     > +++ 2.4.4aa3/kernel/sched.c Tue May 1 16:39:42 2001
     > @@ -674,8 +674,10 @@
     >  #endif
     >  	spin_unlock_irq(&runqueue_lock);
 
     > - if (prev == next)
     > + if (prev == next) {
     > + current->policy &= ~SCHED_YIELD;
     >  		goto same_process;
     > + }
 
     >  #ifdef CONFIG_SMP
     >   	/*

I no longer see the hang with this patch, but I'm not sure I
understand why it works.
Does the above mean that the hang is occuring because spawn_ksoftirqd
is yielding back to itself? If so, the semaphore trick seems more
robust, as it causes a proper sleep until it's safe to wake up.

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 2.4.7p6 hang
  2001-07-11 16:30     ` Trond Myklebust
@ 2001-07-11 16:53       ` Andrea Arcangeli
  0 siblings, 0 replies; 17+ messages in thread
From: Andrea Arcangeli @ 2001-07-11 16:53 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Andrew Morton, Klaus Dittrich, Linus Torvalds, linux-kernel

On Wed, Jul 11, 2001 at 06:30:43PM +0200, Trond Myklebust wrote:
> >>>>> " " == Andrea Arcangeli <andrea@suse.de> writes:
> 
>      > ksoftirqd is quite scheduler intensive, and while its startup
>      > is correct (no need of any change there), it tends to trigger
>      > scheduler bugs (one of those bugs was just fixed in pre5). The
>      > reason I never seen the deadlock I also fixed this other
>      > scheduler bug in my tree:
> 
>      > --- 2.4.4aa3/kernel/sched.c.~1~ Sun Apr 29 17:37:05 2001
>      > +++ 2.4.4aa3/kernel/sched.c Tue May 1 16:39:42 2001
>      > @@ -674,8 +674,10 @@
>      >  #endif
>      >  	spin_unlock_irq(&runqueue_lock);
>  
>      > - if (prev == next)
>      > + if (prev == next) {
>      > + current->policy &= ~SCHED_YIELD;
>      >  		goto same_process;
>      > + }
>  
>      >  #ifdef CONFIG_SMP
>      >   	/*
> 
> I no longer see the hang with this patch, but I'm not sure I
> understand why it works.

I do. It's very subtle and it goes down to the fork and scheduler
details.

> Does the above mean that the hang is occuring because spawn_ksoftirqd
> is yielding back to itself? If so, the semaphore trick seems more

No, that's a generic bug.

> robust, as it causes a proper sleep until it's safe to wake up.

rwsem is definitenly not more robust than the current code, if something
it hides if sched_yield is broken in the scheduler. no need to change
it wasting some static ram for a rwsem for no good reason.

The bug is that sched_yield must always be cleared at the time of a
fork() or the child may never get schedule. Only tasks running in-cpu are
allowed to have SCHED_YIELD set.

Another way to cure the deadlock could be to clear SCHED_YIELD in the child so
then you could even do something as silly as:

	current->policy |= SCHED_YIELD;
	fork()
	schedule()

but the above doesn't make sense so we can optimize away the clear of
SCHED_YIELD of the child in fork. And even if you allow the above you
still need my attached fix for performance reason because if schedule()
returns that's all for the last sched_yield try, the next time we run
schedule without specifying sched_yield we don't want it to be threated
like a sched_yield again (that was the original reason of the patch
infact, I noticed now that the bug had very serious implication with
fork, such implication won't trigger only with ksoftirqd but also with
normal userspace forks, it's only that with ksoftirqd banging of the
scheduler it becomes reproducible).

Andrea

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 2.4.7p6 hang
  2001-07-11 15:58     ` Andrea Arcangeli
@ 2001-07-11 17:19       ` Mike Kravetz
  2001-07-11 18:33       ` Josh Logan
  2001-07-12  0:17       ` Johan Kullstam
  2 siblings, 0 replies; 17+ messages in thread
From: Mike Kravetz @ 2001-07-11 17:19 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Trond Myklebust, Andrew Morton, Klaus Dittrich, Linus Torvalds,
	linux-kernel

On Wed, Jul 11, 2001 at 05:58:09PM +0200, Andrea Arcangeli wrote:
> 
> this one I forgot to sumbit but here it is now for easy merging:
> 
> --- 2.4.4aa3/kernel/sched.c.~1~	Sun Apr 29 17:37:05 2001
> +++ 2.4.4aa3/kernel/sched.c	Tue May  1 16:39:42 2001
> @@ -674,8 +674,10 @@
>  #endif
>  	spin_unlock_irq(&runqueue_lock);
>  
> -	if (prev == next)
> +	if (prev == next) {
> +		current->policy &= ~SCHED_YIELD;
>  		goto same_process;
> +	}
>  
>  #ifdef CONFIG_SMP
>   	/*

I would like to second the need for this patch in the 'mainline' kernel.
Not too long ago, I came up with the following senario caused by this
bug.  The scenario is based on the unmodified 2.4.4 scheduler.

- Task A calls sched_yield(), and the code in sys_sched_yield()
  determines that a yield is in order and sets SCHED_YIELD in
  the task's policy field and need_resched is set for this task.

- When Task A attempts to return to user land, schedule() will
  be called (since need_resched was set).  However, in this case
  schedule() does not find a better task than A to run.  Since
  task A will continue to run, the 'same_process' goto is taken
  in schedule().  Note that __schedule_tail() is not called, so
  the SCHED_YIELD flag remains set in A when it continues to
  execute.

- Task A then performs some operation which causes it to go into
  a non-runnable state (such as calling nanosleep()).  After setting
  the state of Task A to something other than TASK_RUNNING, a call
  to schedule() will be made.  At this time Task A will be removed
  from the runqueue (again note that SCHED_YIELD remains set in A).
  Also, assume that there are no other runnable tasks so the idle
  task is chosen to run next on this CPU.

- Now, after schedule() releases the runqueue lock the timer for
  Task A fires and we call the wake_up code.  This code path will
  eventually call try_to_wake_up() which will set the state of A
  to TASK_RUNNING, add A to the runqueue and call reschedule_idle()
  for A.

- Note that we have not yet cleared the has_cpu field in A.  Hence,
  can_schedule() will never be true for task A.  As a result, we
  will not send an IPI to any other CPU.  In effect, reschedule_idle()
  is a noop.

- Now, we finally call __schedule_tail() for task A.  After clearing
  the SCHED_YIELD and has_cpu flags, we notice that the state of A
  is TASK_RUNNING (it was set by try_to_wake_up()) and take the
  needs_resched goto.

- The needs_resched block of code usually results in a call to
  reschedule_idle for the task.  However, the first line of code
  in this block is:

                /*
                 * Avoid taking the runqueue lock in cases where
                 * no preemption-check is necessery:
                 */
                if ((prev == idle_task(smp_processor_id())) ||
                                                (policy & SCHED_YIELD))
                        goto out_unlock;

  Since, the SCHED_YIELD flag was set in A when we entered this routine
  we will not call reschedule_idle().

In this case, the CPU associated with task A is still idle yet we will
not schedule the task on the CPU.  In addition, it is possible that at
this time ALL CPUs in the system could be idle.  Hence, we would end up
with all CPUs idle while task A is on the runqueue.  Not good!

-- 
Mike Kravetz                                 mkravetz@sequent.com
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 2.4.7p6 hang
  2001-07-11 15:58     ` Andrea Arcangeli
  2001-07-11 17:19       ` Mike Kravetz
@ 2001-07-11 18:33       ` Josh Logan
  2001-07-11 19:05         ` Andrea Arcangeli
  2001-07-11 19:27         ` David Ford
  2001-07-12  0:17       ` Johan Kullstam
  2 siblings, 2 replies; 17+ messages in thread
From: Josh Logan @ 2001-07-11 18:33 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Trond Myklebust, Andrew Morton, Klaus Dittrich, Linus Torvalds,
	linux-kernel


I'm having a hang right after the floppy is initialised with pre5 and pre6
(2.4.3 works fine)  I tried this patch, but it did not make any
improvments.  The machine still has SysRq commands available.  Please let
me know what other information you would like to debug this problem.

BTW, I also tried to disable the floppy in the BIOS and got:
...
Floppy OK
task queue still active
<HANG>

							Later, JOSH


On Wed, 11 Jul 2001, Andrea Arcangeli wrote:

> On Wed, Jul 11, 2001 at 04:22:04PM +0200, Trond Myklebust wrote:
> > >>>>> " " == Andrew Morton <andrewm@uow.edu.au> writes:
> > 
> >      > Trond Myklebust wrote:
> >     >>
> >     >> ...  I have the same problem on my setup. To me, it looks like
> >     >> the loop in spawn_ksoftirqd() is suffering from some sort of
> >     >> atomicity problem.
> > 
> >      > Does a `set_current_state(TASK_RUNNING);' in spawn_ksoftirqd()
> >      > fix it?  If so we have a rogue initcall...
> > 
> > Nope. The same thing happens as before.
> > 
> > A couple of debugging statements show that ksoftirqd_CPU0 gets created
> > fine, and that ksoftirqd_task(0) is indeed getting set correctly
> > before we loop in spawn_ksoftirqd().
> > After this the second call to kernel_thread() succeeds, but
> > ksoftirqd() itself never gets called before the hang occurs.
> 
> ksoftirqd is quite scheduler intensive, and while its startup is
> correct (no need of any change there), it tends to trigger scheduler
> bugs (one of those bugs was just fixed in pre5). The reason I never seen
> the deadlock I also fixed this other scheduler bug in my tree:
> 
> 	ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.7pre5aa1/00_sched-yield-1
> 
> this one I forgot to sumbit but here it is now for easy merging:
> 
> --- 2.4.4aa3/kernel/sched.c.~1~	Sun Apr 29 17:37:05 2001
> +++ 2.4.4aa3/kernel/sched.c	Tue May  1 16:39:42 2001
> @@ -674,8 +674,10 @@
>  #endif
>  	spin_unlock_irq(&runqueue_lock);
>  
> -	if (prev == next)
> +	if (prev == next) {
> +		current->policy &= ~SCHED_YIELD;
>  		goto same_process;
> +	}
>  
>  #ifdef CONFIG_SMP
>   	/*
> 
> 
> Andrea
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 2.4.7p6 hang
  2001-07-11 18:33       ` Josh Logan
@ 2001-07-11 19:05         ` Andrea Arcangeli
  2001-07-11 19:28           ` Josh Logan
  2001-07-11 19:27         ` David Ford
  1 sibling, 1 reply; 17+ messages in thread
From: Andrea Arcangeli @ 2001-07-11 19:05 UTC (permalink / raw)
  To: Josh Logan
  Cc: Trond Myklebust, Andrew Morton, Klaus Dittrich, Linus Torvalds,
	linux-kernel

On Wed, Jul 11, 2001 at 11:33:40AM -0700, Josh Logan wrote:
> 
> I'm having a hang right after the floppy is initialised with pre5 and pre6
> (2.4.3 works fine)  I tried this patch, but it did not make any

is the problem introduced in pre5? Can you reproduce under 2.4.7pre4?

> improvments.  The machine still has SysRq commands available.  Please let
> me know what other information you would like to debug this problem.

SYSRQ+T

> BTW, I also tried to disable the floppy in the BIOS and got:
> ...
> Floppy OK
> task queue still active
> <HANG>

I'll soon have a look at this message.

Andrea

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 2.4.7p6 hang
  2001-07-11 18:33       ` Josh Logan
  2001-07-11 19:05         ` Andrea Arcangeli
@ 2001-07-11 19:27         ` David Ford
  1 sibling, 0 replies; 17+ messages in thread
From: David Ford @ 2001-07-11 19:27 UTC (permalink / raw)
  To: Josh Logan
  Cc: Andrea Arcangeli, Trond Myklebust, Andrew Morton, Klaus Dittrich,
	Linus Torvalds, linux-kernel

This patch fixes the hang for me.

Thank you,
David

Josh Logan wrote:

>I'm having a hang right after the floppy is initialised with pre5 and pre6
>(2.4.3 works fine)  I tried this patch, but it did not make any
>improvments.  The machine still has SysRq commands available.  Please let
>me know what other information you would like to debug this problem.
>
>BTW, I also tried to disable the floppy in the BIOS and got:
>...
>Floppy OK
>task queue still active
><HANG>
>
>							Later, JOSH
>
>
>On Wed, 11 Jul 2001, Andrea Arcangeli wrote:
>
>>On Wed, Jul 11, 2001 at 04:22:04PM +0200, Trond Myklebust wrote:
>>
>>>>>>>>" " == Andrew Morton <andrewm@uow.edu.au> writes:
>>>>>>>>
>>>     > Trond Myklebust wrote:
>>>    >>
>>>    >> ...  I have the same problem on my setup. To me, it looks like
>>>    >> the loop in spawn_ksoftirqd() is suffering from some sort of
>>>    >> atomicity problem.
>>>
>>>     > Does a `set_current_state(TASK_RUNNING);' in spawn_ksoftirqd()
>>>     > fix it?  If so we have a rogue initcall...
>>>
>>>Nope. The same thing happens as before.
>>>
>>>A couple of debugging statements show that ksoftirqd_CPU0 gets created
>>>fine, and that ksoftirqd_task(0) is indeed getting set correctly
>>>before we loop in spawn_ksoftirqd().
>>>After this the second call to kernel_thread() succeeds, but
>>>ksoftirqd() itself never gets called before the hang occurs.
>>>
>>ksoftirqd is quite scheduler intensive, and while its startup is
>>correct (no need of any change there), it tends to trigger scheduler
>>bugs (one of those bugs was just fixed in pre5). The reason I never seen
>>the deadlock I also fixed this other scheduler bug in my tree:
>>
>>	ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.7pre5aa1/00_sched-yield-1
>>
>>this one I forgot to sumbit but here it is now for easy merging:
>>
>>--- 2.4.4aa3/kernel/sched.c.~1~	Sun Apr 29 17:37:05 2001
>>+++ 2.4.4aa3/kernel/sched.c	Tue May  1 16:39:42 2001
>>@@ -674,8 +674,10 @@
>> #endif
>> 	spin_unlock_irq(&runqueue_lock);
>> 
>>-	if (prev == next)
>>+	if (prev == next) {
>>+		current->policy &= ~SCHED_YIELD;
>> 		goto same_process;
>>+	}
>> 
>> #ifdef CONFIG_SMP
>>  	/*
>>



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 2.4.7p6 hang
  2001-07-11 19:05         ` Andrea Arcangeli
@ 2001-07-11 19:28           ` Josh Logan
  2001-07-16 19:16             ` Josh Logan
  0 siblings, 1 reply; 17+ messages in thread
From: Josh Logan @ 2001-07-11 19:28 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Trond Myklebust, Andrew Morton, Klaus Dittrich, Linus Torvalds,
	linux-kernel



On Wed, 11 Jul 2001, Andrea Arcangeli wrote:

> On Wed, Jul 11, 2001 at 11:33:40AM -0700, Josh Logan wrote:
> > 
> > I'm having a hang right after the floppy is initialised with pre5 and pre6
> > (2.4.3 works fine)  I tried this patch, but it did not make any
> 
> is the problem introduced in pre5? Can you reproduce under 2.4.7pre4?

I'll have to go try it...

> 
> > improvments.  The machine still has SysRq commands available.  Please let
> > me know what other information you would like to debug this problem.
> 
> SYSRQ+T

Floppy Drives(s): fd0 is 1.44M
FDC 0 is a post-1991 82077
SysRq: Show State

  task		     PC    stack    pid father child younger older
swapper		D C03EDEC0  4980      1      0     7               (L-TLB)
keventd		S C1234560  6624      2      1             3       (L-TLB)
ksoftirqd_CPU   S C1232000  6468      3      1             4     2 (L-TLB)
kswapd		S C1231FA8  6588      4      1             5     3 (L-TLB)
kreclaimd	S 00000286  6656      5      1             6     4 (L-TLB)
bdflush		S 00000286  6652      6      1             7     5 (L-TLB)
kupdated	S C7F9BFC8  6620      7      1                   6 (L-TLB)

I can add Call Traces if needed, this is done by hand.

> 
> > BTW, I also tried to disable the floppy in the BIOS and got:
> > ...
> > Floppy OK
> > task queue still active
> > <HANG>
> 
> I'll soon have a look at this message.
> 
> Andrea
> 

							Later, JOSH



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 2.4.7p6 hang
  2001-07-11 15:58     ` Andrea Arcangeli
  2001-07-11 17:19       ` Mike Kravetz
  2001-07-11 18:33       ` Josh Logan
@ 2001-07-12  0:17       ` Johan Kullstam
  2 siblings, 0 replies; 17+ messages in thread
From: Johan Kullstam @ 2001-07-12  0:17 UTC (permalink / raw)
  To: linux-kernel

Andrea Arcangeli <andrea@suse.de> writes:

> On Wed, Jul 11, 2001 at 04:22:04PM +0200, Trond Myklebust wrote:
> > >>>>> " " == Andrew Morton <andrewm@uow.edu.au> writes:
> > 
> >      > Trond Myklebust wrote:
> >     >>
> >     >> ...  I have the same problem on my setup. To me, it looks like
> >     >> the loop in spawn_ksoftirqd() is suffering from some sort of
> >     >> atomicity problem.
> > 
> >      > Does a `set_current_state(TASK_RUNNING);' in spawn_ksoftirqd()
> >      > fix it?  If so we have a rogue initcall...
> > 
> > Nope. The same thing happens as before.
> > 
> > A couple of debugging statements show that ksoftirqd_CPU0 gets created
> > fine, and that ksoftirqd_task(0) is indeed getting set correctly
> > before we loop in spawn_ksoftirqd().
> > After this the second call to kernel_thread() succeeds, but
> > ksoftirqd() itself never gets called before the hang occurs.
> 
> ksoftirqd is quite scheduler intensive, and while its startup is
> correct (no need of any change there), it tends to trigger scheduler
> bugs (one of those bugs was just fixed in pre5). The reason I never seen
> the deadlock I also fixed this other scheduler bug in my tree:
> 
> 	ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.7pre5aa1/00_sched-yield-1
> 
> this one I forgot to sumbit but here it is now for easy merging:
> 
> --- 2.4.4aa3/kernel/sched.c.~1~	Sun Apr 29 17:37:05 2001
> +++ 2.4.4aa3/kernel/sched.c	Tue May  1 16:39:42 2001
> @@ -674,8 +674,10 @@
>  #endif
>  	spin_unlock_irq(&runqueue_lock);
>  
> -	if (prev == next)
> +	if (prev == next) {
> +		current->policy &= ~SCHED_YIELD;
>  		goto same_process;
> +	}
>  
>  #ifdef CONFIG_SMP
>   	/*

thank you.

this patch fixes things for me too.

i was freezing at boot, right after the kernel prints the line
Initializing RT netlink.
with both 2.4.7-pre5 and 2.4.7-pre6.

after applying this to 2.4.7-pre6, things are working fine (afaict
after just a few minutes...).

-- 
J o h a n  K u l l s t a m
[kullstam@ne.mediaone.net]
Don't Fear the Penguin!

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 2.4.7p6 hang
  2001-07-11 19:28           ` Josh Logan
@ 2001-07-16 19:16             ` Josh Logan
  2001-07-16 19:34               ` David Ford
  0 siblings, 1 reply; 17+ messages in thread
From: Josh Logan @ 2001-07-16 19:16 UTC (permalink / raw)
  To: Andrea Arcangeli, alan
  Cc: Trond Myklebust, Andrew Morton, Klaus Dittrich, Linus Torvalds,
	linux-kernel


I just tried 2.4.6-ac5 and I had the same problem.  I'll go try 2.4.7-pre4
next.

							Later, JOSH


On Wed, 11 Jul 2001, Josh Logan wrote:

> 
> 
> On Wed, 11 Jul 2001, Andrea Arcangeli wrote:
> 
> > On Wed, Jul 11, 2001 at 11:33:40AM -0700, Josh Logan wrote:
> > > 
> > > I'm having a hang right after the floppy is initialised with pre5 and pre6
> > > (2.4.3 works fine)  I tried this patch, but it did not make any
> > 
> > is the problem introduced in pre5? Can you reproduce under 2.4.7pre4?
> 
> I'll have to go try it...
> 
> > 
> > > improvments.  The machine still has SysRq commands available.  Please let
> > > me know what other information you would like to debug this problem.
> > 
> > SYSRQ+T
> 
> Floppy Drives(s): fd0 is 1.44M
> FDC 0 is a post-1991 82077
> SysRq: Show State
> 
>   task		     PC    stack    pid father child younger older
> swapper		D C03EDEC0  4980      1      0     7               (L-TLB)
> keventd		S C1234560  6624      2      1             3       (L-TLB)
> ksoftirqd_CPU   S C1232000  6468      3      1             4     2 (L-TLB)
> kswapd		S C1231FA8  6588      4      1             5     3 (L-TLB)
> kreclaimd	S 00000286  6656      5      1             6     4 (L-TLB)
> bdflush		S 00000286  6652      6      1             7     5 (L-TLB)
> kupdated	S C7F9BFC8  6620      7      1                   6 (L-TLB)
> 
> I can add Call Traces if needed, this is done by hand.
> 
> > 
> > > BTW, I also tried to disable the floppy in the BIOS and got:
> > > ...
> > > Floppy OK
> > > task queue still active
> > > <HANG>
> > 
> > I'll soon have a look at this message.
> > 
> > Andrea
> > 
> 
> 							Later, JOSH
> 
> 
> 



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 2.4.7p6 hang
  2001-07-16 19:16             ` Josh Logan
@ 2001-07-16 19:34               ` David Ford
  2001-07-16 21:07                 ` Josh Logan
  0 siblings, 1 reply; 17+ messages in thread
From: David Ford @ 2001-07-16 19:34 UTC (permalink / raw)
  To: Josh Logan
  Cc: Andrea Arcangeli, Trond Myklebust, Andrew Morton, Klaus Dittrich,
	linux-kernel

Chances are that you have TEQL as one of your packet schedulers?

Try the patch Dave M posted this morning, let me fetch it...

--- net/core/dev.c.~1~	Mon Jul  9 22:19:33 2001
+++ net/core/dev.c	Sat Jul 14 17:25:51 2001
@@ -2654,10 +2654,6 @@
 	if (!dev_boot_phase)
 		return 0;
 
-#ifdef CONFIG_NET_SCHED
-	pktsched_init();
-#endif
-
 #ifdef CONFIG_NET_DIVERT
 	dv_init();
 #endif /* CONFIG_NET_DIVERT */
@@ -2771,6 +2767,10 @@
 
 	dst_init();
 	dev_mcast_init();
+
+#ifdef CONFIG_NET_SCHED
+	pktsched_init();
+#endif
 
 	/*
 	 *	Initialise network devices


David

Josh Logan wrote:

>I just tried 2.4.6-ac5 and I had the same problem.  I'll go try 2.4.7-pre4
>next.
>
>							Later, JOSH
>
>
>On Wed, 11 Jul 2001, Josh Logan wrote:
>
>>
>>On Wed, 11 Jul 2001, Andrea Arcangeli wrote:
>>
>>>On Wed, Jul 11, 2001 at 11:33:40AM -0700, Josh Logan wrote:
>>>
>>>>I'm having a hang right after the floppy is initialised with pre5 and pre6
>>>>(2.4.3 works fine)  I tried this patch, but it did not make any
>>>>
>>>is the problem introduced in pre5? Can you reproduce under 2.4.7pre4?
>>>
>>I'll have to go try it...
>>
>>>>improvments.  The machine still has SysRq commands available.  Please let
>>>>me know what other information you would like to debug this problem.
>>>>
>>>SYSRQ+T
>>>
>>Floppy Drives(s): fd0 is 1.44M
>>FDC 0 is a post-1991 82077
>>SysRq: Show State
>>
>>  task		     PC    stack    pid father child younger older
>>swapper		D C03EDEC0  4980      1      0     7               (L-TLB)
>>keventd		S C1234560  6624      2      1             3       (L-TLB)
>>ksoftirqd_CPU   S C1232000  6468      3      1             4     2 (L-TLB)
>>kswapd		S C1231FA8  6588      4      1             5     3 (L-TLB)
>>kreclaimd	S 00000286  6656      5      1             6     4 (L-TLB)
>>bdflush		S 00000286  6652      6      1             7     5 (L-TLB)
>>kupdated	S C7F9BFC8  6620      7      1                   6 (L-TLB)
>>
>>I can add Call Traces if needed, this is done by hand.
>>
>>>>BTW, I also tried to disable the floppy in the BIOS and got:
>>>>...
>>>>Floppy OK
>>>>task queue still active
>>>><HANG>
>>>>
>>>I'll soon have a look at this message.
>>>
>>>Andrea
>>>
>>							Later, JOSH
>>
>>
>>
>
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/
>



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: 2.4.7p6 hang
  2001-07-16 19:34               ` David Ford
@ 2001-07-16 21:07                 ` Josh Logan
  0 siblings, 0 replies; 17+ messages in thread
From: Josh Logan @ 2001-07-16 21:07 UTC (permalink / raw)
  To: David Ford
  Cc: Andrea Arcangeli, Trond Myklebust, Andrew Morton, Klaus Dittrich,
	linux-kernel


Thanks.  With this patch it now boots.  Hope this is part of 2.4.7.

							Later, JOSH


On Mon, 16 Jul 2001, David Ford wrote:

> Chances are that you have TEQL as one of your packet schedulers?
> 
> Try the patch Dave M posted this morning, let me fetch it...
> 
> --- net/core/dev.c.~1~	Mon Jul  9 22:19:33 2001
> +++ net/core/dev.c	Sat Jul 14 17:25:51 2001
> @@ -2654,10 +2654,6 @@
>  	if (!dev_boot_phase)
>  		return 0;
>  
> -#ifdef CONFIG_NET_SCHED
> -	pktsched_init();
> -#endif
> -
>  #ifdef CONFIG_NET_DIVERT
>  	dv_init();
>  #endif /* CONFIG_NET_DIVERT */
> @@ -2771,6 +2767,10 @@
>  
>  	dst_init();
>  	dev_mcast_init();
> +
> +#ifdef CONFIG_NET_SCHED
> +	pktsched_init();
> +#endif
>  
>  	/*
>  	 *	Initialise network devices
> 
> 
> David
> 
> Josh Logan wrote:
> 
> >I just tried 2.4.6-ac5 and I had the same problem.  I'll go try 2.4.7-pre4
> >next.
> >
> >							Later, JOSH
> >
> >
> >On Wed, 11 Jul 2001, Josh Logan wrote:
> >
> >>
> >>On Wed, 11 Jul 2001, Andrea Arcangeli wrote:
> >>
> >>>On Wed, Jul 11, 2001 at 11:33:40AM -0700, Josh Logan wrote:
> >>>
> >>>>I'm having a hang right after the floppy is initialised with pre5 and pre6
> >>>>(2.4.3 works fine)  I tried this patch, but it did not make any
> >>>>
> >>>is the problem introduced in pre5? Can you reproduce under 2.4.7pre4?
> >>>
> >>I'll have to go try it...
> >>
> >>>>improvments.  The machine still has SysRq commands available.  Please let
> >>>>me know what other information you would like to debug this problem.
> >>>>
> >>>SYSRQ+T
> >>>
> >>Floppy Drives(s): fd0 is 1.44M
> >>FDC 0 is a post-1991 82077
> >>SysRq: Show State
> >>
> >>  task		     PC    stack    pid father child younger older
> >>swapper		D C03EDEC0  4980      1      0     7               (L-TLB)
> >>keventd		S C1234560  6624      2      1             3       (L-TLB)
> >>ksoftirqd_CPU   S C1232000  6468      3      1             4     2 (L-TLB)
> >>kswapd		S C1231FA8  6588      4      1             5     3 (L-TLB)
> >>kreclaimd	S 00000286  6656      5      1             6     4 (L-TLB)
> >>bdflush		S 00000286  6652      6      1             7     5 (L-TLB)
> >>kupdated	S C7F9BFC8  6620      7      1                   6 (L-TLB)
> >>
> >>I can add Call Traces if needed, this is done by hand.
> >>
> >>>>BTW, I also tried to disable the floppy in the BIOS and got:
> >>>>...
> >>>>Floppy OK
> >>>>task queue still active
> >>>><HANG>
> >>>>
> >>>I'll soon have a look at this message.
> >>>
> >>>Andrea
> >>>
> >>							Later, JOSH
> >>
> >>
> >>
> >
> >
> >-
> >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >the body of a message to majordomo@vger.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >Please read the FAQ at  http://www.tux.org/lkml/
> >
> 
> 
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2001-07-16 21:07 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-07-11  8:49 2.4.7p6 hang Klaus Dittrich
2001-07-11 12:56 ` Trond Myklebust
2001-07-11 13:38   ` Andrew Morton
2001-07-11 14:22   ` Trond Myklebust
2001-07-11 15:58     ` Andrea Arcangeli
2001-07-11 17:19       ` Mike Kravetz
2001-07-11 18:33       ` Josh Logan
2001-07-11 19:05         ` Andrea Arcangeli
2001-07-11 19:28           ` Josh Logan
2001-07-16 19:16             ` Josh Logan
2001-07-16 19:34               ` David Ford
2001-07-16 21:07                 ` Josh Logan
2001-07-11 19:27         ` David Ford
2001-07-12  0:17       ` Johan Kullstam
2001-07-11 16:30     ` Trond Myklebust
2001-07-11 16:53       ` Andrea Arcangeli
2001-07-11 15:49   ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).