All of lore.kernel.org
 help / color / mirror / Atom feed
* [2.4.28-rc1] process stuck in release_task() call
@ 2004-11-09 16:24 Andrey J. Melnikoff (TEMHOTA)
  2004-11-10 18:58 ` Marcelo Tosatti
  0 siblings, 1 reply; 6+ messages in thread
From: Andrey J. Melnikoff (TEMHOTA) @ 2004-11-09 16:24 UTC (permalink / raw)
  To: linux-kernel

Hello!

With 2.4.28-pre3 and 2.4.28-rc1 i see strange situation - sendmail some
times get stuck into release_task() call. 

System - Tyan Tiger MPX, dual Athlon MP 2800+ with 1Gb memory.

--- SysRq-T output ---
ksymoops 2.4.9 on i686 2.4.28-rc1.  Options used
     -V (default)
     -k /proc/ksyms (default)
     -l /proc/modules (default)
     -o /lib/modules/2.4.28-rc1/ (default)
     -m /boot/System.map-2.4.28-rc1 (default)

Reading Oops report from the terminal
sendmail      S C012073D     0 15814      1 32701         14365 (NOTLB)
Using defaults from ksymoops -t elf32-i386 -a i386
Call Trace:    [<c012073d>] [<c0106582>] [<c0107717>]
sendmail      Z 00000000     4 30459  15814         30669       (L-TLB)
Call Trace:    [<c0120f53>] [<c0121600>] [<c011547d>] [<c0121725>] [<c0107717>]
sendmail      Z 00000000     0 30669  15814         30707 30459 (L-TLB)
Call Trace:    [<c0120f53>] [<c0121600>] [<c0121725>] [<c0107717>]
sendmail      Z 00000000     4 30707  15814         31549 30669 (L-TLB)
Call Trace:    [<c0120f53>] [<c0121600>] [<c0121725>] [<c0107717>]
sendmail      Z 00000000  2624 31549  15814         31708 30707 (L-TLB)
Call Trace:    [<c0120f53>] [<c0121600>] [<c0121725>] [<c0107717>]
sendmail      Z 00000000     0 31708  15814         32269 31549 (L-TLB)
Call Trace:    [<c0120f53>] [<c0121600>] [<c0121725>] [<c0107717>]
sendmail      Z 00000000     0 32269  15814         32352 31708 (L-TLB)
Call Trace:    [<c0120f53>] [<c0121600>] [<c0121725>] [<c0107717>]
sendmail      Z 00000000    20 32352  15814         32403 32269 (L-TLB)
Call Trace:    [<c0120f53>] [<c0121600>] [<c0121725>] [<c0107717>]
sendmail      Z 00000000     0 32403  15814         32413 32352 (L-TLB)
Call Trace:    [<c0120f53>] [<c0121600>] [<c0121725>] [<c0107717>]
sendmail      Z 00000000   624 32413  15814         32468 32403 (L-TLB)
Call Trace:    [<c0120f53>] [<c0121600>] [<c0121725>] [<c0107717>]
sendmail      Z 00000000     0 32468  15814         32473 32413 (L-TLB)
Call Trace:    [<c0120f53>] [<c0121600>] [<c0121725>] [<c0107717>]
sendmail      Z 00000000     0 32473  15814         32482 32468 (L-TLB)
Call Trace:    [<c0120f53>] [<c0121600>] [<c0121725>] [<c0107717>]
sendmail      Z 00000000     0 32482  15814         32499 32473 (L-TLB)
Call Trace:    [<c0120f53>] [<c0121600>] [<c0121725>] [<c0107717>]
..... many sendmail zombies ......

Warning (Oops_read): Code line not seen, dumping what data is available

Proc;  sendmail

>>EIP; c012073d <release_task+1fd/230>   <=====

Trace; c012073d <release_task+1fd/230>
Trace; c0106582 <sys_rt_sigsuspend+122/160>
Trace; c0107717 <system_call+33/38>
Proc;  sendmail

>>EIP; 00000000 Before first symbol

Trace; c0120f53 <exit_notify+103/3c0>
Trace; c0121600 <do_exit+3f0/4e0>
Trace; c011547d <smp_apic_timer_interrupt+12d/130>
Trace; c0121725 <sys_exit+15/20>
Trace; c0107717 <system_call+33/38>
Proc;  sendmail

>>EIP; 00000000 Before first symbol

Trace; c0120f53 <exit_notify+103/3c0>
Trace; c0121600 <do_exit+3f0/4e0>
Trace; c0121725 <sys_exit+15/20>
Trace; c0107717 <system_call+33/38>
Proc;  sendmail

.... same trace with other zombies ......


disassemble show other result - process stuck into free_pages() call:

c0120540 <release_task>:
c0120540:       55                      push   %ebp
....
c0120736:       89 d8                   mov    %ebx,%eax
c0120738:       e8 73 dd 01 00          call   c013e4b0 <free_pages> <= here
c012073d:       83 c4 10                add    $0x10,%esp
c0120740:       5b                      pop    %ebx
c0120741:       5e                      pop    %esi
c0120742:       c9                      leave
....

Any hints ?

-- 
 Best regards, TEMHOTA-RIPN aka MJA13-RIPE
 System Administrator. mailto:temnota@kmv.ru


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [2.4.28-rc1] process stuck in release_task() call
  2004-11-09 16:24 [2.4.28-rc1] process stuck in release_task() call Andrey J. Melnikoff (TEMHOTA)
@ 2004-11-10 18:58 ` Marcelo Tosatti
  2004-11-11  8:33   ` Willy Tarreau
  0 siblings, 1 reply; 6+ messages in thread
From: Marcelo Tosatti @ 2004-11-10 18:58 UTC (permalink / raw)
  To: Andrey J. Melnikoff (TEMHOTA); +Cc: linux-kernel


Hi Andrey,

On Tue, Nov 09, 2004 at 07:24:45PM +0300, Andrey J. Melnikoff (TEMHOTA) wrote:
> Hello!
> 
> With 2.4.28-pre3 and 2.4.28-rc1 i see strange situation - sendmail some
> times get stuck into release_task() call. 
> 
> System - Tyan Tiger MPX, dual Athlon MP 2800+ with 1Gb memory.
> 
> --- SysRq-T output ---
> ksymoops 2.4.9 on i686 2.4.28-rc1.  Options used
>      -V (default)
>      -k /proc/ksyms (default)
>      -l /proc/modules (default)
>      -o /lib/modules/2.4.28-rc1/ (default)
>      -m /boot/System.map-2.4.28-rc1 (default)
> 
> Reading Oops report from the terminal
> sendmail      S C012073D     0 15814      1 32701         14365 (NOTLB)
> Using defaults from ksymoops -t elf32-i386 -a i386
> Call Trace:    [<c012073d>] [<c0106582>] [<c0107717>]
> sendmail      Z 00000000     4 30459  15814         30669       (L-TLB)
> Call Trace:    [<c0120f53>] [<c0121600>] [<c011547d>] [<c0121725>] [<c0107717>]
> sendmail      Z 00000000     0 30669  15814         30707 30459 (L-TLB)
> Call Trace:    [<c0120f53>] [<c0121600>] [<c0121725>] [<c0107717>]
> sendmail      Z 00000000     4 30707  15814         31549 30669 (L-TLB)
> Call Trace:    [<c0120f53>] [<c0121600>] [<c0121725>] [<c0107717>]
> sendmail      Z 00000000  2624 31549  15814         31708 30707 (L-TLB)
> Call Trace:    [<c0120f53>] [<c0121600>] [<c0121725>] [<c0107717>]
> sendmail      Z 00000000     0 31708  15814         32269 31549 (L-TLB)
> Call Trace:    [<c0120f53>] [<c0121600>] [<c0121725>] [<c0107717>]
> sendmail      Z 00000000     0 32269  15814         32352 31708 (L-TLB)
> Call Trace:    [<c0120f53>] [<c0121600>] [<c0121725>] [<c0107717>]
> sendmail      Z 00000000    20 32352  15814         32403 32269 (L-TLB)
> Call Trace:    [<c0120f53>] [<c0121600>] [<c0121725>] [<c0107717>]
> sendmail      Z 00000000     0 32403  15814         32413 32352 (L-TLB)
> Call Trace:    [<c0120f53>] [<c0121600>] [<c0121725>] [<c0107717>]
> sendmail      Z 00000000   624 32413  15814         32468 32403 (L-TLB)
> Call Trace:    [<c0120f53>] [<c0121600>] [<c0121725>] [<c0107717>]
> sendmail      Z 00000000     0 32468  15814         32473 32413 (L-TLB)
> Call Trace:    [<c0120f53>] [<c0121600>] [<c0121725>] [<c0107717>]
> sendmail      Z 00000000     0 32473  15814         32482 32468 (L-TLB)
> Call Trace:    [<c0120f53>] [<c0121600>] [<c0121725>] [<c0107717>]
> sendmail      Z 00000000     0 32482  15814         32499 32473 (L-TLB)
> Call Trace:    [<c0120f53>] [<c0121600>] [<c0121725>] [<c0107717>]
> ..... many sendmail zombies ......
> 
> Warning (Oops_read): Code line not seen, dumping what data is available
> 
> Proc;  sendmail
> 
> >>EIP; c012073d <release_task+1fd/230>   <=====
> 
> Trace; c012073d <release_task+1fd/230>
> Trace; c0106582 <sys_rt_sigsuspend+122/160>
> Trace; c0107717 <system_call+33/38>
> Proc;  sendmail
> 
> >>EIP; 00000000 Before first symbol
> 
> Trace; c0120f53 <exit_notify+103/3c0>
> Trace; c0121600 <do_exit+3f0/4e0>
> Trace; c011547d <smp_apic_timer_interrupt+12d/130>
> Trace; c0121725 <sys_exit+15/20>
> Trace; c0107717 <system_call+33/38>
> Proc;  sendmail
> 
> >>EIP; 00000000 Before first symbol
> 
> Trace; c0120f53 <exit_notify+103/3c0>
> Trace; c0121600 <do_exit+3f0/4e0>
> Trace; c0121725 <sys_exit+15/20>
> Trace; c0107717 <system_call+33/38>
> Proc;  sendmail
> 
> .... same trace with other zombies ......
> 
> 
> disassemble show other result - process stuck into free_pages() call:
> 
> c0120540 <release_task>:
> c0120540:       55                      push   %ebp
> ....
> c0120736:       89 d8                   mov    %ebx,%eax
> c0120738:       e8 73 dd 01 00          call   c013e4b0 <free_pages> <= here

is this release_task+1fd?  Can you send me the full disassemble of release_task?

It can't be blocked here, its a "call" instruction. 

free_pages can't block either. Odd.  

It is reproducible? 

First wild guess (because I haven't got much of a clue really) 
would be to revert the mm/page_alloc.c __free_pages() "fastcall" 
gcc3.4 change.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [2.4.28-rc1] process stuck in release_task() call
  2004-11-11  8:33   ` Willy Tarreau
@ 2004-11-11  8:01     ` Marcelo Tosatti
       [not found]       ` <20041112135942.GW24130@kmv.ru>
  2004-11-11 13:37     ` Andrey Melnikoff
  1 sibling, 1 reply; 6+ messages in thread
From: Marcelo Tosatti @ 2004-11-11  8:01 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Andrey J. Melnikoff (TEMHOTA), linux-kernel

On Thu, Nov 11, 2004 at 09:33:12AM +0100, Willy Tarreau wrote:
> Hi Marcelo,
> 
> > > >>EIP; c012073d <release_task+1fd/230>   <=====
> (...)
> > > c0120540 <release_task>:
> > > c0120540:       55                      push   %ebp
> > > ....
> > > c0120736:       89 d8                   mov    %ebx,%eax
> > > c0120738:       e8 73 dd 01 00          call   c013e4b0 <free_pages> <= here
> > 
> > is this release_task+1fd?  Can you send me the full disassemble of release_task?
> 
> Yes it is because the next instruction after call will be at c0120738+5 =
> c012073d = release_task+1fd. (the return address on the stack is the
> address of the next instruction after the call).

OK.

> > It can't be blocked here, its a "call" instruction. 
> 
> Seems rather strange indeed ! Perhaps this is not the disassembled function
> of the *running* kernel ? it would be good to disassemble vmlinux and ensure
> that it is exactly the one currently running. I too have already lost lots
> of time searching a wrong bug because I disassembled the wrong kernel, so
> I'm certain it can happen even when we're very careful :-(
> 
> > free_pages can't block either. Odd.  
> 
> Marcelo, I have two questions for my own understanding :
>   - free_pages does spin_lock(&zone->lock) around the while() loop.
>     Considering that someone else could hold the lock (bug, etc...), it
>     could block here. But my feeling is that if such a lock were kept held,
>     the system would be totally frozen because everything which would want
>     to free memory would get stuck (even a process exit). Am I right ?

Right, the system will be totally frozen spinning on the lock.

>   - would it enhance performance a bit to put a bunch of 'unlikely()' in all
>     the ifs which end in BUG(), especially inside the loop ?

Yes, it should generate better code. 

Try it and see how the generated code differs from the original without unlikely.

I'm not aware of the internals of unlikely however, so I can't 
explain how it works in details... the GCC documentation 
should do it.  :)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [2.4.28-rc1] process stuck in release_task() call
  2004-11-10 18:58 ` Marcelo Tosatti
@ 2004-11-11  8:33   ` Willy Tarreau
  2004-11-11  8:01     ` Marcelo Tosatti
  2004-11-11 13:37     ` Andrey Melnikoff
  0 siblings, 2 replies; 6+ messages in thread
From: Willy Tarreau @ 2004-11-11  8:33 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Andrey J. Melnikoff (TEMHOTA), linux-kernel

Hi Marcelo,

> > >>EIP; c012073d <release_task+1fd/230>   <=====
(...)
> > c0120540 <release_task>:
> > c0120540:       55                      push   %ebp
> > ....
> > c0120736:       89 d8                   mov    %ebx,%eax
> > c0120738:       e8 73 dd 01 00          call   c013e4b0 <free_pages> <= here
> 
> is this release_task+1fd?  Can you send me the full disassemble of release_task?

Yes it is because the next instruction after call will be at c0120738+5 =
c012073d = release_task+1fd. (the return address on the stack is the
address of the next instruction after the call).

> It can't be blocked here, its a "call" instruction. 

Seems rather strange indeed ! Perhaps this is not the disassembled function
of the *running* kernel ? it would be good to disassemble vmlinux and ensure
that it is exactly the one currently running. I too have already lost lots
of time searching a wrong bug because I disassembled the wrong kernel, so
I'm certain it can happen even when we're very careful :-(

> free_pages can't block either. Odd.  

Marcelo, I have two questions for my own understanding :
  - free_pages does spin_lock(&zone->lock) around the while() loop.
    Considering that someone else could hold the lock (bug, etc...), it
    could block here. But my feeling is that if such a lock were kept held,
    the system would be totally frozen because everything which would want
    to free memory would get stuck (even a process exit). Am I right ?

  - would it enhance performance a bit to put a bunch of 'unlikely()' in all
    the ifs which end in BUG(), especially inside the loop ?

Regards,
Willy


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [2.4.28-rc1] process stuck in release_task() call
  2004-11-11  8:33   ` Willy Tarreau
  2004-11-11  8:01     ` Marcelo Tosatti
@ 2004-11-11 13:37     ` Andrey Melnikoff
  1 sibling, 0 replies; 6+ messages in thread
From: Andrey Melnikoff @ 2004-11-11 13:37 UTC (permalink / raw)
  To: Willy Tarreau, linux-kernel

Hello Willy Tarreau!
 In article <20041111083312.GE783@alpha.home.local> you wrote:

> > > >>EIP; c012073d <release_task+1fd/230>   <=====
> (...)
> > > c0120540 <release_task>:
> > > c0120540:       55                      push   %ebp
> > > ....
> > > c0120736:       89 d8                   mov    %ebx,%eax
> > > c0120738:       e8 73 dd 01 00          call   c013e4b0 <free_pages> <= here
> > 
> > is this release_task+1fd?  Can you send me the full disassemble of release_task?

> Yes it is because the next instruction after call will be at c0120738+5 =
> c012073d = release_task+1fd. (the return address on the stack is the
> address of the next instruction after the call).

> > It can't be blocked here, its a "call" instruction. 

> Seems rather strange indeed ! Perhaps this is not the disassembled function
> of the *running* kernel ? 

This is also strange for me. Stack trace should point into __free_pages_ok() 
address space. Only this function work with lock's.

> it would be good to disassemble vmlinux and ensure that it is exactly 
> the one currently running. 

This is code from vmlinux-2.4.28-rc1.

> I too have already lost lots of time searching a wrong bug because I 
> disassembled the wrong kernel, so I'm certain it can happen even when 
> we're very careful :-(

[skipp]

PS: Please, keep CC: to me.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [RESOLVED] Re: [2.4.28-rc1] process stuck in release_task() call
       [not found]         ` <20041116100639.GA11948@logos.cnet>
@ 2004-11-30 19:46           ` Andrey J. Melnikoff (TEMHOTA)
  0 siblings, 0 replies; 6+ messages in thread
From: Andrey J. Melnikoff (TEMHOTA) @ 2004-11-30 19:46 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Willy Tarreau, linux-kernel

Hello Marcelo, Willy!
 On Tue, Nov 16, 2004 at 08:06:42AM -0200, Marcelo Tosatti wrote next:

> On Fri, Nov 12, 2004 at 04:59:42PM +0300, Andrey J. Melnikoff (TEMHOTA) wrote:
> Andrey,
> 
> I do not have much of a clue of what is going on here.
show_trace() has made a fool of me and I started to ask silly questions :)
 
> Can you try 2.4.27 please?
Ok, i'm tested 2.4.25 - same result. But this is complete userland problem.

There two problem:

First - show_trace() give incorrect traces. it strat unwind stack from
address in `tsk->thread.esp', but it should use address saved in `regs->ebp'
- this make more accuracy stack trace.

Second - strange libpthreads problem. 
libpthreads always install own sa_restorer helper, and when first signal
arrived - call signal handler and if (when process in signal handler)
arrived new signal - lipthreads start play with rt_sigprocmask() and
rt_sigsuspend() syscalls inside own sa_restorer helper. 
woops - infinity loop inside libpthreads.

-- 
 Best regards, TEMHOTA-RIPN aka MJA13-RIPE
 System Administrator. mailto:temnota@kmv.ru


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2004-11-30 19:52 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-11-09 16:24 [2.4.28-rc1] process stuck in release_task() call Andrey J. Melnikoff (TEMHOTA)
2004-11-10 18:58 ` Marcelo Tosatti
2004-11-11  8:33   ` Willy Tarreau
2004-11-11  8:01     ` Marcelo Tosatti
     [not found]       ` <20041112135942.GW24130@kmv.ru>
     [not found]         ` <20041116100639.GA11948@logos.cnet>
2004-11-30 19:46           ` [RESOLVED] " Andrey J. Melnikoff (TEMHOTA)
2004-11-11 13:37     ` Andrey Melnikoff

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.