linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Serious reproducible 2.4.x kernel hang
@ 2001-02-01 14:14 Chris Evans
  2001-02-01 14:40 ` Malcolm Beattie
  0 siblings, 1 reply; 12+ messages in thread
From: Chris Evans @ 2001-02-01 14:14 UTC (permalink / raw)
  To: linux-kernel


Hi,

I've just managed to reproduce this personally on 2.4.0. I've had a report
that 2.4.1 is also affected. Both myself and the other person who
reproduced this have SMP i686 machines, which may or may not be relevant.

To reproduce, all you need to do is get my vsftpd ftp server:
ftp://ferret.lmh.ox.ac.uk/pub/linux/vsftpd-0.0.9.tar.gz

It runs from inetd. Connect using the Linux command line ftp client, to
localhost, and simply press CTRL-C. If it matters, I'm using RH7.0
software.

After the first iteration of this, I'm left with:
[chris@localhost chris]$ ps auwx | grep ftp
root       713 99.9  0.4  1416  592 ?        SN   22:01  38:17 vsftpd
/etc/vsftpd.conf
nobody     715  0.0  0.0     0    0 ?        ZN   22:01   0:00 [vsftpd
<defunct>]

As you can see, the root process is burning 100% of one of my CPUs. It
_cannot_ be killed with kill -9!

>From Alt-Sysrq-T:
Jan 30 22:01:52 localhost kernel: vsftpd    S 00000000   860   713    670
715
 (NOTLB)
Jan 30 22:01:52 localhost kernel: Call Trace:
[smp_apic_timer_interrupt+240/272] [smp_apic_timer_interrupt+240/272]
[update_process_times+32/160] [smp_apic_timer_interrupt+240/272]
[remove_wait_queue+6/48] [wait_for_packet+273/288]
[skb_recv_datagram+205/240]
Jan 30 22:01:52 localhost kernel:        [unix_dgram_recvmsg+69/256]
[sock_recvmsg+53/176] [sock_read+134/144] [sys_read+150/208]
[system_call+51/56]
Jan 30 22:01:52 localhost kernel: vsftpd    Z C5E07040  1408   715    713
 (L-TLB)
Jan 30 22:01:52 localhost kernel: Call Trace: [do_exit+628/672]
[system_call+51/56]

As we can see, the 100% CPU broken process has got stuck in a blocking
read() on a unix socket.

If I repeat the ftp connect/CTRL-C process again, I get a totally dead
machine.

Hope this is sufficient info. I'll try and write a minimal test case.

Cheers
Chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Serious reproducible 2.4.x kernel hang
  2001-02-01 14:14 Serious reproducible 2.4.x kernel hang Chris Evans
@ 2001-02-01 14:40 ` Malcolm Beattie
  2001-02-01 15:27   ` Chris Evans
                     ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Malcolm Beattie @ 2001-02-01 14:40 UTC (permalink / raw)
  To: linux-kernel

Chris Evans writes:
> I've just managed to reproduce this personally on 2.4.0. I've had a report
> that 2.4.1 is also affected. Both myself and the other person who
> reproduced this have SMP i686 machines, which may or may not be relevant.
> 
> To reproduce, all you need to do is get my vsftpd ftp server:
> ftp://ferret.lmh.ox.ac.uk/pub/linux/vsftpd-0.0.9.tar.gz

I got this just before lunch too. I was trying out 2.4.1 + zerocopy
(with netfilter configured off, see the sendfile/zerocopy thread for
more details and hardware specs) and tried running vsftpd on the
slower machine instead of the faster machine as before. I connected
to vsftpd with an ftp client and got a
    500 OOPS: chdir
    Login failed.
    421 Service not available, remote server has closed connection
(ftpd's idea of an OOPS; not the kernel's idea of an oops, of course).
That was probably because I hadn't configured the directory properly
but following that the machine hung, in the following way: userland
hung: no more logins, existent xterm processes didn't refresh their
windows on my (remote) display. The machine was still pingable, though.

I configured Magic SysRq into the kernel but hadn't played with it
before so I hadn't enabled it in /proc (D'oh. Next time I'll know.)
As in Chris' case, vzftpd was a zombie (so Foo-ScrollLock told me) and
all other processes were looking OK in R or S state.

Looking at the kernel's EIP every so often to see what was going
showed remove_wait_queue, add_wait_queue, skb_recv_datagram and
wait_for_packet mostly. Random thought: if vsftpd did a sendfile and
then exited, becoming a zombie, could there be a problem with
tearing down a sendfile mapping? I'm off to read some code.

--Malcolm

-- 
Malcolm Beattie <mbeattie@sable.ox.ac.uk>
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Serious reproducible 2.4.x kernel hang
  2001-02-01 14:40 ` Malcolm Beattie
@ 2001-02-01 15:27   ` Chris Evans
  2001-02-01 15:30   ` Malcolm Beattie
  2001-02-03 20:26   ` kees
  2 siblings, 0 replies; 12+ messages in thread
From: Chris Evans @ 2001-02-01 15:27 UTC (permalink / raw)
  To: Malcolm Beattie; +Cc: linux-kernel


On Thu, 1 Feb 2001, Malcolm Beattie wrote:

> Chris Evans writes:
> > I've just managed to reproduce this personally on 2.4.0. I've had a report
> > that 2.4.1 is also affected. Both myself and the other person who
> > reproduced this have SMP i686 machines, which may or may not be relevant.
> >
> > To reproduce, all you need to do is get my vsftpd ftp server:
> > ftp://ferret.lmh.ox.ac.uk/pub/linux/vsftpd-0.0.9.tar.gz
>
> I got this just before lunch too. I was trying out 2.4.1 + zerocopy
> (with netfilter configured off, see the sendfile/zerocopy thread for

[...]

I reproduced with 2.4.1.

> Looking at the kernel's EIP every so often to see what was going
> showed remove_wait_queue, add_wait_queue, skb_recv_datagram and
> wait_for_packet mostly. Random thought: if vsftpd did a sendfile and
> then exited, becoming a zombie, could there be a problem with
> tearing down a sendfile mapping? I'm off to read some code.

I get it simply doing CTRL-C at the ftp logon prompt. No sendfile has been
used at this point. Trying to distill a test case...

Cheers
Chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Serious reproducible 2.4.x kernel hang
  2001-02-01 14:40 ` Malcolm Beattie
  2001-02-01 15:27   ` Chris Evans
@ 2001-02-01 15:30   ` Malcolm Beattie
  2001-02-01 16:03     ` Chris Evans
  2001-02-03 20:26   ` kees
  2 siblings, 1 reply; 12+ messages in thread
From: Malcolm Beattie @ 2001-02-01 15:30 UTC (permalink / raw)
  To: linux-kernel

Malcolm Beattie writes:
> Chris Evans writes:
> > I've just managed to reproduce this personally on 2.4.0. I've had a report
> > that 2.4.1 is also affected. Both myself and the other person who
> > reproduced this have SMP i686 machines, which may or may not be relevant.
> > 
> > To reproduce, all you need to do is get my vsftpd ftp server:
> > ftp://ferret.lmh.ox.ac.uk/pub/linux/vsftpd-0.0.9.tar.gz
[...]
> As in Chris' case, vzftpd was a zombie (so Foo-ScrollLock told me) and
> all other processes were looking OK in R or S state.

Mapping the addresses from whichever ScrollLock combination produced
the task list to symbols produces the call trace
 do_exit <- do_signal <- tcp_destroy_sock <- inet_ioctl <- signal_return

The inet_ioctl is odd there--vsftpd doesn't explicitly call ioctl
anywhere at all and the next function before it in memory is
inet_shutdown which looks more believable. I have checked I'm looking
at the right System.map but I suppose I may have mis-transcribed the
address when writing it down. vsftpd doesn't make use of signal
handlers except to unset some existing ones and a SIGALRM handler
which I don't think would have triggered. Something like a seg fault
may have caused it (I should have seen an oops if it had happened in
kernel space) perhaps?

--Malcolm

-- 
Malcolm Beattie <mbeattie@sable.ox.ac.uk>
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Serious reproducible 2.4.x kernel hang
  2001-02-01 15:30   ` Malcolm Beattie
@ 2001-02-01 16:03     ` Chris Evans
  2001-02-01 16:52       ` Malcolm Beattie
  0 siblings, 1 reply; 12+ messages in thread
From: Chris Evans @ 2001-02-01 16:03 UTC (permalink / raw)
  To: Malcolm Beattie; +Cc: linux-kernel


On Thu, 1 Feb 2001, Malcolm Beattie wrote:

> Mapping the addresses from whichever ScrollLock combination produced
> the task list to symbols produces the call trace
>  do_exit <- do_signal <- tcp_destroy_sock <- inet_ioctl <- signal_return
>
> The inet_ioctl is odd there--vsftpd doesn't explicitly call ioctl
> anywhere at all and the next function before it in memory is
> inet_shutdown which looks more believable. I have checked I'm looking

Probably, the empty SIGPIPE handler triggered. The response to this is a
lot of shutdown() close() and finally an exit().

The trace you give above looks like the child process trace. I always see
the parent process go nuts. The parent process is almost always blocking
on read() of a unix dgram socket, which it shares with the child. The
child does a shutdown() on this socket just before exit().

Cheers
Chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Serious reproducible 2.4.x kernel hang
  2001-02-01 16:03     ` Chris Evans
@ 2001-02-01 16:52       ` Malcolm Beattie
  2001-02-01 18:28         ` Chris Evans
  0 siblings, 1 reply; 12+ messages in thread
From: Malcolm Beattie @ 2001-02-01 16:52 UTC (permalink / raw)
  To: Chris Evans; +Cc: linux-kernel

Chris Evans writes:
> 
> On Thu, 1 Feb 2001, Malcolm Beattie wrote:
> 
> > Mapping the addresses from whichever ScrollLock combination produced
> > the task list to symbols produces the call trace
> >  do_exit <- do_signal <- tcp_destroy_sock <- inet_ioctl <- signal_return
> >
> > The inet_ioctl is odd there--vsftpd doesn't explicitly call ioctl
> > anywhere at all and the next function before it in memory is
> > inet_shutdown which looks more believable. I have checked I'm looking
> 
> Probably, the empty SIGPIPE handler triggered. The response to this is a
> lot of shutdown() close() and finally an exit().
> 
> The trace you give above looks like the child process trace. I always see
> the parent process go nuts. The parent process is almost always blocking
> on read() of a unix dgram socket, which it shares with the child. The
> child does a shutdown() on this socket just before exit().

We've done some more detective work. I can reproduce the hang too
by quitting the ftp client abruptly (^Z and kill %1 in my case).
Inducing the hang while stracing the daemon shows a recv returning 0
as expected when the socket closes. The daemon then calls "die":

    die(const char* p_text)
    {
      /* Going down hard... */
    #ifdef DIE_DEBUG
      bug(p_text);
    #endif

and DIE_DEBUG is defined. bug() writes an error message and then does
three things:
    shutdown(2) on the sockets
    close(2) on the sockets
    abort()
the last of which libc implements as
    rt_sigprocmask(SIG_UNBLOCK, [SIGABRT])
    kill(getpid(), SIGABRT)

Here's the interesting thing: doing an exit(0) before the shutdowns
and abort gets rid of the hang. The only unusual and potentially
untested thing I could find about the program was that it uses
capset() and prctl(PR_SET_KEEPCAPS). However, replacing the
"retval = capset(...)" call with a dummy "retval = 0" doesn't get
rid of the hang. So it looks as though some combination of
shutdown(2) and SIGABRT is at fault. After the hang the kernel-side
stack trace is always either the one I gave above (and I *did*
write down the address for inet_ioctl correctly; it's definitely
not inet_shutdown) or else
  do_exit <- do_signal <- schedule <- syscall_trace <- signal_return
(with exactly the same addresses as above except for the differing
schedule and syscall_trace ones) which appeared after the hang while
vsftpd was being run under strace.

--Malcolm

-- 
Malcolm Beattie <mbeattie@sable.ox.ac.uk>
Unix Systems Programmer
Oxford University Computing Services
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Serious reproducible 2.4.x kernel hang
  2001-02-01 16:52       ` Malcolm Beattie
@ 2001-02-01 18:28         ` Chris Evans
  2001-02-01 18:41           ` Doug McNaught
                             ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Chris Evans @ 2001-02-01 18:28 UTC (permalink / raw)
  To: Malcolm Beattie; +Cc: linux-kernel, davem


[cc: davem because of the severity]

On Thu, 1 Feb 2001, Malcolm Beattie wrote:

> rid of the hang. So it looks as though some combination of
> shutdown(2) and SIGABRT is at fault. After the hang the kernel-side

Nope - I've nailed it to a _really_ simple test case. It looks like a
read() on a shutdown() unix dgram socket just kills the kernel. Demo code
below. I wonder if this affects UP or is SMP only?

Malcolm, does the below code reproduce the problem for you?

Cheers
Chris

#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/socket.h>

int
main(int argc, const char* argv[])
{
  int retval;
  int sockets[2];
  char buf[1];

  retval = socketpair(PF_UNIX, SOCK_DGRAM, 0, sockets);
  if (retval != 0)
  {
    perror("socketpair");
    exit(1);
  }
  shutdown(sockets[0], SHUT_RDWR);
  read(sockets[0], buf, 1);
}

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Serious reproducible 2.4.x kernel hang
  2001-02-01 18:28         ` Chris Evans
@ 2001-02-01 18:41           ` Doug McNaught
  2001-02-01 18:42           ` rui.sousa
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Doug McNaught @ 2001-02-01 18:41 UTC (permalink / raw)
  To: Chris Evans; +Cc: Malcolm Beattie, linux-kernel, davem

Chris Evans <chris@scary.beasts.org> writes:

> [cc: davem because of the severity]
> 
> On Thu, 1 Feb 2001, Malcolm Beattie wrote:
> 
> > rid of the hang. So it looks as though some combination of
> > shutdown(2) and SIGABRT is at fault. After the hang the kernel-side
> 
> Nope - I've nailed it to a _really_ simple test case. It looks like a
> read() on a shutdown() unix dgram socket just kills the kernel. Demo code
> below. I wonder if this affects UP or is SMP only?

Kills my UP K6-2 dead as a doornail (except for pings, as you say). 

-Doug
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Serious reproducible 2.4.x kernel hang
  2001-02-01 18:28         ` Chris Evans
  2001-02-01 18:41           ` Doug McNaught
@ 2001-02-01 18:42           ` rui.sousa
  2001-02-02  9:25           ` [Patch]Re: " Prasanna P Subash
  2001-02-02  9:29           ` David S. Miller
  3 siblings, 0 replies; 12+ messages in thread
From: rui.sousa @ 2001-02-01 18:42 UTC (permalink / raw)
  To: Chris Evans; +Cc: Malcolm Beattie, linux-kernel, davem

On Thu, 1 Feb 2001, Chris Evans wrote:

>
> Nope - I've nailed it to a _really_ simple test case. It looks like a
> read() on a shutdown() unix dgram socket just kills the kernel. Demo code
> below. I wonder if this affects UP or is SMP only?

It surely killed my PIII UP machine (running 2.4.1)

Rui Sousa

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Patch]Re: Serious reproducible 2.4.x kernel hang
  2001-02-01 18:28         ` Chris Evans
  2001-02-01 18:41           ` Doug McNaught
  2001-02-01 18:42           ` rui.sousa
@ 2001-02-02  9:25           ` Prasanna P Subash
  2001-02-02  9:29           ` David S. Miller
  3 siblings, 0 replies; 12+ messages in thread
From: Prasanna P Subash @ 2001-02-02  9:25 UTC (permalink / raw)
  To: Chris Evans; +Cc: Malcolm Beattie, linux-kernel, davem


[-- Attachment #1.1: Type: text/plain, Size: 1359 bytes --]

 
> #include <stdio.h>
> #include <unistd.h>
> #include <sys/types.h>
> #include <sys/socket.h>
> 
> int
> main(int argc, const char* argv[])
> {
>   int retval;
>   int sockets[2];
>   char buf[1];
> 
>   retval = socketpair(PF_UNIX, SOCK_DGRAM, 0, sockets);
>   if (retval != 0)
>   {
>     perror("socketpair");
>     exit(1);
>   }
>   shutdown(sockets[0], SHUT_RDWR);
>   read(sockets[0], buf, 1);
> }

I tried to debug this issue with the kdb on 2.4.1-pre7.
Here is the stack trace

mcount+0x1f9
wait_for_packet+0x13
skb_recv_datagram+0xbb
unix_dgram_recvmsg+0x53
sock_recvmsg+0x41
sock_read+0x8f
sys_read+0xa4
system_call+0x3c

I looked at the skb_recv_datagram code and noticed that wait_for_packet is not
returning an error, even while trying to read a closed socket.
Anyways here is a patch against 2.4.1 that will fix the issue.
Please feel free to flame me about the patch :)

thanks
-- 
Prasanna Subash   ---   psubash@turbolinux.com   ---     TurboLinux, INC
------------------------------------------------------------------------
Linux, the choice          | Q: How do you keep a moron in suspense? 
of a GNU generation   -o)  | 
Kernel 2.2.16         /\\  | 
on a i686            _\\_v | 
                           | 
------------------------------------------------------------------------

[-- Attachment #1.2: dgram.patch --]
[-- Type: application/x-patch, Size: 597 bytes --]

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Patch]Re: Serious reproducible 2.4.x kernel hang
  2001-02-01 18:28         ` Chris Evans
                             ` (2 preceding siblings ...)
  2001-02-02  9:25           ` [Patch]Re: " Prasanna P Subash
@ 2001-02-02  9:29           ` David S. Miller
  3 siblings, 0 replies; 12+ messages in thread
From: David S. Miller @ 2001-02-02  9:29 UTC (permalink / raw)
  To: Prasanna P Subash; +Cc: Chris Evans, Malcolm Beattie, linux-kernel


Prasanna P Subash writes:
 > I looked at the skb_recv_datagram code and noticed that wait_for_packet is not
 > returning an error, even while trying to read a closed socket.
 > Anyways here is a patch against 2.4.1 that will fix the issue.
 > Please feel free to flame me about the patch :)

Please read the rest of today's postings, Alexey Kuznetsov already
posted the correct fix, which I'm attached below:

diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/net/core/datagram.c linux/net/core/datagram.c
--- vanilla/linux/net/core/datagram.c	Sat Nov 11 19:02:40 2000
+++ linux/net/core/datagram.c	Thu Feb  1 17:15:12 2001
@@ -72,19 +73,19 @@
 	/* Socket errors? */
 	error = sock_error(sk);
 	if (error)
-		goto out;
+		goto out_err;
 
 	if (!skb_queue_empty(&sk->receive_queue))
 		goto ready;
 
 	/* Socket shut down? */
 	if (sk->shutdown & RCV_SHUTDOWN)
-		goto out;
+		goto out_noerr;
 
 	/* Sequenced packets can come disconnected. If so we report the problem */
 	error = -ENOTCONN;
 	if(connection_based(sk) && !(sk->state==TCP_ESTABLISHED || sk->state==TCP_LISTEN))
-		goto out;
+		goto out_err;
 
 	/* handle signals */
 	if (signal_pending(current))
@@ -99,11 +100,16 @@
 
 interrupted:
 	error = sock_intr_errno(*timeo_p);
+out_err:
+	*err = error;
 out:
 	current->state = TASK_RUNNING;
 	remove_wait_queue(sk->sleep, &wait);
-	*err = error;
 	return error;
+out_noerr:
+	*err = 0;
+	error = 1;
+	goto out;
 }
 
 /*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Serious reproducible 2.4.x kernel hang
  2001-02-01 14:40 ` Malcolm Beattie
  2001-02-01 15:27   ` Chris Evans
  2001-02-01 15:30   ` Malcolm Beattie
@ 2001-02-03 20:26   ` kees
  2 siblings, 0 replies; 12+ messages in thread
From: kees @ 2001-02-03 20:26 UTC (permalink / raw)
  To: Malcolm Beattie; +Cc: linux-kernel

Hi,

What is related in /proc w.r.t. sysrq?

Kees

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2001-02-03 20:39 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-02-01 14:14 Serious reproducible 2.4.x kernel hang Chris Evans
2001-02-01 14:40 ` Malcolm Beattie
2001-02-01 15:27   ` Chris Evans
2001-02-01 15:30   ` Malcolm Beattie
2001-02-01 16:03     ` Chris Evans
2001-02-01 16:52       ` Malcolm Beattie
2001-02-01 18:28         ` Chris Evans
2001-02-01 18:41           ` Doug McNaught
2001-02-01 18:42           ` rui.sousa
2001-02-02  9:25           ` [Patch]Re: " Prasanna P Subash
2001-02-02  9:29           ` David S. Miller
2001-02-03 20:26   ` kees

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).