linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* waitid breaks telnet
@ 2004-12-01  3:55 Joe Korty
  2004-12-01  4:27 ` Andrew Morton
  0 siblings, 1 reply; 12+ messages in thread
From: Joe Korty @ 2004-12-01  3:55 UTC (permalink / raw)
  To: roland; +Cc: linux-kernel, akpm

[ 2nd send, this one from my home email account...]

telnet no longer works:

     # chkconfig telnet on
     # telnet localhost
     Trying 127.0.0.1...
     Connected to localhost (127.0.0.1).
     Escape character is '^]'.
     Red Hat Enterprise Linux WS release 3 (Taroon Update 2)
     Kernel 2.6.10-rc2 on an i686
     Connection closed by foreign host.

A bsearch placed the bug between 2.6.9-rc1-bk[78], another
bsearch on the changesets showed the problem is caused
by this patch:

     roland@redhat.com[torvalds]|ChangeSet|20040831173525|30767
     [PATCH] waitid system call

My guess is, something about the new wait4(2) wrapper
is causing the telnet daemon to declare success before
its child, /bin/login, exits.

Joe

[PS: this email may not get through, our email servers
changed recently and I have been having problems.  Roland,
please ACK me privately as soon as you see this.  Thanks] 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: waitid breaks telnet
  2004-12-01  3:55 waitid breaks telnet Joe Korty
@ 2004-12-01  4:27 ` Andrew Morton
  2004-12-01 13:32   ` Joe Korty
  2004-12-01 19:20   ` Roland McGrath
  0 siblings, 2 replies; 12+ messages in thread
From: Andrew Morton @ 2004-12-01  4:27 UTC (permalink / raw)
  To: Joe Korty; +Cc: roland, linux-kernel

Joe Korty <kortyads@mindspring.com> wrote:
>
> [ 2nd send, this one from my home email account...]
> 
> telnet no longer works:
> 
>      # chkconfig telnet on
>      # telnet localhost
>      Trying 127.0.0.1...
>      Connected to localhost (127.0.0.1).
>      Escape character is '^]'.
>      Red Hat Enterprise Linux WS release 3 (Taroon Update 2)
>      Kernel 2.6.10-rc2 on an i686
>      Connection closed by foreign host.
> 
> A bsearch placed the bug between 2.6.9-rc1-bk[78], another
> bsearch on the changesets showed the problem is caused
> by this patch:
> 
>      roland@redhat.com[torvalds]|ChangeSet|20040831173525|30767
>      [PATCH] waitid system call
> 
> My guess is, something about the new wait4(2) wrapper
> is causing the telnet daemon to declare success before
> its child, /bin/login, exits.

I can reproduce this on 2.6.10-rc2, but it seems to have been fixed in more
recent kernels.  However I cannot think of anything which we did which
would have fixed this.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: waitid breaks telnet
  2004-12-01  4:27 ` Andrew Morton
@ 2004-12-01 13:32   ` Joe Korty
  2004-12-01 19:20   ` Roland McGrath
  1 sibling, 0 replies; 12+ messages in thread
From: Joe Korty @ 2004-12-01 13:32 UTC (permalink / raw)
  To: Andrew Morton; +Cc: roland, linux-kernel

At 08:27 PM 11/30/04 -0800, Andrew Morton wrote:
>Joe Korty <kortyads@mindspring.com> wrote:
> >
> > [ 2nd send, this one from my home email account...]
> >
> > telnet no longer works:
> >
> >      # chkconfig telnet on
> >      # telnet localhost
> >      Trying 127.0.0.1...
> >      Connected to localhost (127.0.0.1).
> >      Escape character is '^]'.
> >      Red Hat Enterprise Linux WS release 3 (Taroon Update 2)
> >      Kernel 2.6.10-rc2 on an i686
> >      Connection closed by foreign host.
> >
> > A bsearch placed the bug between 2.6.9-rc1-bk[78], another
> > bsearch on the changesets showed the problem is caused
> > by this patch:
> >
> >      roland@redhat.com[torvalds]|ChangeSet|20040831173525|30767
> >      [PATCH] waitid system call
> >
> > My guess is, something about the new wait4(2) wrapper
> > is causing the telnet daemon to declare success before
> > its child, /bin/login, exits.
>
>I can reproduce this on 2.6.10-rc2, but it seems to have been fixed in more
>recent kernels.  However I cannot think of anything which we did which
>would have fixed this.

I was able to reproduce it with the day-before-yesterday''s bitkeeper tree.

My boss sees broken kernels work once in a while.  I myself have
never been able to get a broken kernel to work.  The problem may
be a race.

Joe



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: waitid breaks telnet
  2004-12-01  4:27 ` Andrew Morton
  2004-12-01 13:32   ` Joe Korty
@ 2004-12-01 19:20   ` Roland McGrath
  2004-12-01 19:41     ` Andrew Morton
  1 sibling, 1 reply; 12+ messages in thread
From: Roland McGrath @ 2004-12-01 19:20 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Joe Korty, linux-kernel

I've had no luck reproducing that, so there isn't much I can do.  The last
time someone thought the waitid change broke something random, it was the
perturbation of the compiled code vs the issue that the kernel's assembly
code doesn't follow the same calling conventions the compiler expects.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: waitid breaks telnet
  2004-12-01 19:20   ` Roland McGrath
@ 2004-12-01 19:41     ` Andrew Morton
  2004-12-01 22:30       ` Joe Korty
  0 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2004-12-01 19:41 UTC (permalink / raw)
  To: Roland McGrath; +Cc: kortyads, linux-kernel

Roland McGrath <roland@redhat.com> wrote:
>
> I've had no luck reproducing that, so there isn't much I can do.

Did you try bare 2.6.10-rc2?

>  The last
> time someone thought the waitid change broke something random, it was the
> perturbation of the compiled code vs the issue that the kernel's assembly
> code doesn't follow the same calling conventions the compiler expects.

Could be that, but I was able to reproduce it on 2.6.10-rc2 with
gcc-2.95.4, with which -mregparm is disabled.

Still.  It would be interesting if Joe could retest with CONFIG_REGPARM=n?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: waitid breaks telnet
  2004-12-01 19:41     ` Andrew Morton
@ 2004-12-01 22:30       ` Joe Korty
  2004-12-01 22:49         ` Joe Korty
  0 siblings, 1 reply; 12+ messages in thread
From: Joe Korty @ 2004-12-01 22:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Roland McGrath, kortyads, linux-kernel

On Wed, Dec 01, 2004 at 11:41:41AM -0800, Andrew Morton wrote:
> Roland McGrath <roland@redhat.com> wrote:
> >
> > I've had no luck reproducing that, so there isn't much I can do.
> 
> Did you try bare 2.6.10-rc2?
> 
> >  The last
> > time someone thought the waitid change broke something random, it was the
> > perturbation of the compiled code vs the issue that the kernel's assembly
> > code doesn't follow the same calling conventions the compiler expects.
> 
> Could be that, but I was able to reproduce it on 2.6.10-rc2 with
> gcc-2.95.4, with which -mregparm is disabled.
> 
> Still.  It would be interesting if Joe could retest with CONFIG_REGPARM=n?

CONFIG_REGPARM is not set in all of my kernels (just verified).
Joe

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: waitid breaks telnet
  2004-12-01 22:30       ` Joe Korty
@ 2004-12-01 22:49         ` Joe Korty
  2004-12-01 23:22           ` Joe Korty
  0 siblings, 1 reply; 12+ messages in thread
From: Joe Korty @ 2004-12-01 22:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Roland McGrath, linux-kernel

On Wed, Dec 01, 2004 at 05:30:14PM -0500, Joe Korty wrote:
> On Wed, Dec 01, 2004 at 11:41:41AM -0800, Andrew Morton wrote:
> > Roland McGrath <roland@redhat.com> wrote:
> > >
> > > I've had no luck reproducing that, so there isn't much I can do.
> > 
> > Did you try bare 2.6.10-rc2?
> > 
> > >  The last
> > > time someone thought the waitid change broke something random, it was the
> > > perturbation of the compiled code vs the issue that the kernel's assembly
> > > code doesn't follow the same calling conventions the compiler expects.
> > 
> > Could be that, but I was able to reproduce it on 2.6.10-rc2 with
> > gcc-2.95.4, with which -mregparm is disabled.
> > 
> > Still.  It would be interesting if Joe could retest with CONFIG_REGPARM=n?
> 
> CONFIG_REGPARM is not set in all of my kernels (just verified).

More info: I exclusively use CONFIG_SMP and CONFIG_PREEMPT.
If it is a race either or both of these is likely to
be involved.

Joe

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: waitid breaks telnet
  2004-12-01 22:49         ` Joe Korty
@ 2004-12-01 23:22           ` Joe Korty
  2004-12-01 23:58             ` Roland McGrath
  0 siblings, 1 reply; 12+ messages in thread
From: Joe Korty @ 2004-12-01 23:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Roland McGrath, linux-kernel

On Wed, Dec 01, 2004 at 05:49:06PM -0500, Joe Korty wrote:
> On Wed, Dec 01, 2004 at 05:30:14PM -0500, Joe Korty wrote:
> > On Wed, Dec 01, 2004 at 11:41:41AM -0800, Andrew Morton wrote:
> > > Roland McGrath <roland@redhat.com> wrote:
> > > >
> > > > I've had no luck reproducing that, so there isn't much I can do.
> > > 
> > > Did you try bare 2.6.10-rc2?
> > > 
> > > >  The last
> > > > time someone thought the waitid change broke something random, it was the
> > > > perturbation of the compiled code vs the issue that the kernel's assembly
> > > > code doesn't follow the same calling conventions the compiler expects.
> > > 
> > > Could be that, but I was able to reproduce it on 2.6.10-rc2 with
> > > gcc-2.95.4, with which -mregparm is disabled.
> > > 
> > > Still.  It would be interesting if Joe could retest with CONFIG_REGPARM=n?
> > 
> > CONFIG_REGPARM is not set in all of my kernels (just verified).
> 
> More info: I exclusively use CONFIG_SMP and CONFIG_PREEMPT.
> If it is a race either or both of these is likely to
> be involved.

Ok, I rebuilt 2.6.9 with CONFIG_PREEMPT=n and telnet failed
the one time I tried it.

Then I built with CONFIG_PREEMPT=n and CONFIG=SMP=n and
the first telnet attempt succeeded.  I then tried six
more telnet attempts, two of those failed and the rest
succeeded.

Since my earlier testing usually was of only 1 (sometimes
2) telnet attempts per boot, they too may have had some
ratio of success/failure other than 100% or 0%.

Joe

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: waitid breaks telnet
  2004-12-01 23:22           ` Joe Korty
@ 2004-12-01 23:58             ` Roland McGrath
  2004-12-02 17:54               ` Joe Korty
  0 siblings, 1 reply; 12+ messages in thread
From: Roland McGrath @ 2004-12-01 23:58 UTC (permalink / raw)
  To: joe.korty; +Cc: Andrew Morton, linux-kernel

Since I can only manage so far to see this once in a blue moon, and you can
produce it frequently, it would be helpful if you can diagnose the problem
some.  That is, figure out exactly what wrong results from a wait* call is
at fault.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: waitid breaks telnet
  2004-12-01 23:58             ` Roland McGrath
@ 2004-12-02 17:54               ` Joe Korty
  2004-12-02 22:39                 ` [PATCH] fix uninitialized variable in waitid(2) Joe Korty
  0 siblings, 1 reply; 12+ messages in thread
From: Joe Korty @ 2004-12-02 17:54 UTC (permalink / raw)
  To: Roland McGrath; +Cc: Andrew Morton, linux-kernel

On Wed, Dec 01, 2004 at 03:58:46PM -0800, Roland McGrath wrote:
> Since I can only manage so far to see this once in a blue moon, and you can
> produce it frequently, it would be helpful if you can diagnose the problem
> some.  That is, figure out exactly what wrong results from a wait* call is
> at fault.

Hi Roland,
I've been playing with this most of the morning, finally got strace attached
to the telnet daemon, but it did me no good .. everything works when straced.

My technique was to replace /usr/sbin/in.telnetd with a script that invokes
the original binary under strace:

	# cd /usr/sbin
	# mv in.telnetd in.telnet.d.orig
	# cat <<EOF >in.telnetd
	/usr/bin/strace -ff -o /tmp/telnet.log.$$ /usr/sbin/in.telnetd.orig "$@"
	EOF
	# chmod 755 in.telnetd

Earlier this morning I systematically repeated my earlier, haphazard
experiments.  I built three kernels from two sources: the first source
was the pure 2.6.7-rc1-bk7 tree, the second the same tree with the suspect
waitid patch applied.  From these I built various kernels with and without
SMP and PREEMPT and ran at least seven 'telnet' tests on each.  The results:

   kernel       smp preempt | 1 2 3 4 5 6 7 8 9
   ======================================================
   bk7          Y   Y       | g g g g g g g
   bk7+waitid   Y   Y       | F F F F F F F
   bk7+waitid   N   N       | F g F F g g g F g


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH] fix uninitialized variable in waitid(2)
  2004-12-02 17:54               ` Joe Korty
@ 2004-12-02 22:39                 ` Joe Korty
  2004-12-02 22:51                   ` Andrew Morton
  0 siblings, 1 reply; 12+ messages in thread
From: Joe Korty @ 2004-12-02 22:39 UTC (permalink / raw)
  To: Roland McGrath; +Cc: Andrew Morton, linux-kernel

Specify an initial value signal_struct's field stop_state
whenever a signal_struct variable is created.

Bug was discovered through the occasional failure of
telnet(1) to connect.

Signed-off-by: Joe Korty <joe.korty@ccur.com>

--- base/kernel/fork.c	2004-12-02 17:18:39.340843441 -0500
+++ new/kernel/fork.c	2004-12-02 17:24:27.085305563 -0500
@@ -733,6 +733,7 @@
 	sig->group_exit_code = 0;
 	sig->group_exit_task = NULL;
 	sig->group_stop_count = 0;
+	sig->stop_state = 0;
 	sig->curr_target = NULL;
 	init_sigpending(&sig->shared_pending);
 	INIT_LIST_HEAD(&sig->posix_timers);

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] fix uninitialized variable in waitid(2)
  2004-12-02 22:39                 ` [PATCH] fix uninitialized variable in waitid(2) Joe Korty
@ 2004-12-02 22:51                   ` Andrew Morton
  0 siblings, 0 replies; 12+ messages in thread
From: Andrew Morton @ 2004-12-02 22:51 UTC (permalink / raw)
  To: joe.korty; +Cc: roland, linux-kernel

Joe Korty <joe.korty@ccur.com> wrote:
>
> Specify an initial value signal_struct's field stop_state
> whenever a signal_struct variable is created.

whew.  Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2004-12-02 22:47 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-12-01  3:55 waitid breaks telnet Joe Korty
2004-12-01  4:27 ` Andrew Morton
2004-12-01 13:32   ` Joe Korty
2004-12-01 19:20   ` Roland McGrath
2004-12-01 19:41     ` Andrew Morton
2004-12-01 22:30       ` Joe Korty
2004-12-01 22:49         ` Joe Korty
2004-12-01 23:22           ` Joe Korty
2004-12-01 23:58             ` Roland McGrath
2004-12-02 17:54               ` Joe Korty
2004-12-02 22:39                 ` [PATCH] fix uninitialized variable in waitid(2) Joe Korty
2004-12-02 22:51                   ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).