All of lore.kernel.org
 help / color / mirror / Atom feed
* t7006 sometimes hangs in cronjobs on OS X
@ 2011-02-09 14:38 Thomas Rast
  2011-02-09 18:30 ` Ben Walton
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Thomas Rast @ 2011-02-09 14:38 UTC (permalink / raw)
  To: git; +Cc: Jeff King, Jonathan Nieder

Hi *,

I have been running a nightly next&pu smoke tester each on RHEL5 and
OS X.  For quite some time (at least a month), t7006 gets stuck
randomly (i.e., not every night).  [This has been a bit of a problem
because it keeps a lot of processes stuck on the machine and eats into
my ulimit, but that's the case with any stuck process; I'll have to
think about a good solution to that.]

The relevant part of 'ps xwww' is

  65211   ??  SN     0:00.03 /bin/sh t7006-pager.sh
  65798   ??  SN     0:00.04 /usr/bin/perl /Users/trast/git-smoke/t/test-terminal.perl git --no-pager log
  65846   ??  ZN     0:00.00 (git)
  65847   ??  ZN     0:00.00 (perl5.10.0)

Everything runs from a cronjob, so it does not have a terminal.  If I
SIGTERM the first perl process, the script continues and I eventually
get the normal results (upload to smoke.nix.git.is and emailed
report).

The Cc list is a bit of a random guess based on who worked on this;
from the 'git --no-pager log' you can see that the test got stuck in
the 'no pager with --no-pager' test, but that was introduced 10 months
ago.  It may very well have been broken all the time, or also for
other commands.

So if you have any idea what the problem could be or what I should be
looking for, help would be much appreciated...


Some files that may be relevant:

Crontab entry:
0 5 * * * cd ~ && . .bashrc && ./do-the-smoke.sh

---- 8< ---- ~/do-the-smoke.sh ---- 8< ----
#!/bin/bash

cd ~/git-smoke
git fetch -q

doit () {
    branch="$1"
    git checkout origin/"$branch" || exit 1

    make -j12 || exit 1
    cd t
    make clean
    TEST_JOBS=6 nice make smoke
    TEST_JOBS=6 nice make smoke_report SMOKE_COMMENT="Darwin 10.5 $branch" SMOKE_TAGS="$branch, Darwin 10.5, NO_SVN_TESTS, NO_EXPAT"
}

( doit next )
( doit pu )
---- >8 ---- >8 ----

---- 8< ---- ~/git-smoke/config.mak ---- 8< ----
CFLAGS = -Wall -O3 -g
THREADED_DELTA_SEARCH = 1
NO_SVN_TESTS = 1
ASCIIDOC8 = 1
ASCIIDOC_NO_ROFF = 1
GNU_ROFF = 1
BLK_SHA1 = 1
prefix = /home/thomas/.local
NO_EXPAT = 1
NO_GETTEXT = 1
SMOKE_USERNAME = trast
SMOKE_PASSWORD = I'm not telling you :-)
SMOKE_COMMENT = RHEL5.5
---- >8 ---- >8 ----

(I just noticed that the above 'prefix' makes no sense because $HOME
is /Users/trast, but I'm not installing it anyway...)

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: t7006 sometimes hangs in cronjobs on OS X
  2011-02-09 14:38 t7006 sometimes hangs in cronjobs on OS X Thomas Rast
@ 2011-02-09 18:30 ` Ben Walton
  2011-02-09 19:09 ` Jeff King
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Ben Walton @ 2011-02-09 18:30 UTC (permalink / raw)
  To: Thomas Rast; +Cc: git, Jeff King, Jonathan Nieder

Excerpts from Thomas Rast's message of Wed Feb 09 09:38:46 -0500 2011:

> I have been running a nightly next&pu smoke tester each on RHEL5 and
> OS X.  For quite some time (at least a month), t7006 gets stuck
> randomly (i.e., not every night).  [This has been a bit of a problem
> because it keeps a lot of processes stuck on the machine and eats
> into my ulimit, but that's the case with any stuck process; I'll
> have to think about a good solution to that.]

I'm also seeing this test fail when runing the packaging tools for
OpenCSW.  I haven't had a chance to investigate yet, but manual tests
of the script have run without errors so far.  It's only during the
automated build that it fails.  The automated build is initiated
manually but is run through a Makefile wrapper that will drastically
alter the environment.  Beyond this, I have no further input.  I'm
hoping to spend some time on it in the next few days.

Thanks
-Ben
--
Ben Walton
Systems Programmer - CHASS
University of Toronto
C:416.407.5610 | W:416.978.4302

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: t7006 sometimes hangs in cronjobs on OS X
  2011-02-09 14:38 t7006 sometimes hangs in cronjobs on OS X Thomas Rast
  2011-02-09 18:30 ` Ben Walton
@ 2011-02-09 19:09 ` Jeff King
  2011-02-09 19:16   ` Thomas Rast
  2011-02-13  5:31 ` Jonathan Nieder
  2011-02-17 12:38 ` [PATCH] tests: skip terminal output tests " Jonathan Nieder
  3 siblings, 1 reply; 12+ messages in thread
From: Jeff King @ 2011-02-09 19:09 UTC (permalink / raw)
  To: Thomas Rast; +Cc: git, Jonathan Nieder

On Wed, Feb 09, 2011 at 03:38:46PM +0100, Thomas Rast wrote:

> I have been running a nightly next&pu smoke tester each on RHEL5 and
> OS X.  For quite some time (at least a month), t7006 gets stuck
> randomly (i.e., not every night).  [This has been a bit of a problem
> because it keeps a lot of processes stuck on the machine and eats into
> my ulimit, but that's the case with any stuck process; I'll have to
> think about a good solution to that.]
> 
> The relevant part of 'ps xwww' is
> 
>   65211   ??  SN     0:00.03 /bin/sh t7006-pager.sh
>   65798   ??  SN     0:00.04 /usr/bin/perl /Users/trast/git-smoke/t/test-terminal.perl git --no-pager log
>   65846   ??  ZN     0:00.00 (git)
>   65847   ??  ZN     0:00.00 (perl5.10.0)

Hmm. The zombie git process implies to me that git has exited, but for
some reason we are still stuck in the copy-to-terminal loop and haven't
reaped it. But the zombie perl process confuses me.  We fork a second
time so that one process copies stderr and the other one copies stdout.
Is the second perl process the stderr copier, and we are still blocking
on copying stdout for some reason? But then why is the command name
different? Is /usr/bin/perl a wrapper script on your platform?

Have you tried running strace on the surviving perl process to see what
it's doing? Presumably it's just hung on a read() syscall.

Or maybe try stracing the whole thing from the start, which may be more
informative, like:

diff --git a/t/lib-terminal.sh b/t/lib-terminal.sh
index c383b57..f7e6b7f 100644
--- a/t/lib-terminal.sh
+++ b/t/lib-terminal.sh
@@ -13,7 +13,7 @@ test_expect_success 'set up terminal for tests' '
 				echo >&4 "test_terminal: need to declare TTY prerequisite"
 				return 127
 			fi
-			"$PERL_PATH" "$TEST_DIRECTORY"/test-terminal.perl "$@"
+			strace -f "$PERL_PATH" "$TEST_DIRECTORY"/test-terminal.perl "$@"
 		}
 	fi
 '

and giving us the stderr dump (or perhaps redirecting strace output to a
file and saving it)?

>     TEST_JOBS=6 nice make smoke

If you use TEST_JOBS=1, does it still happen? I can't imagine what race
condition would cause this, but there are other tests use
lib-terminal.sh, so perhaps there is some interference there.

-Peff

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: t7006 sometimes hangs in cronjobs on OS X
  2011-02-09 19:09 ` Jeff King
@ 2011-02-09 19:16   ` Thomas Rast
  2011-02-09 20:22     ` Thomas Rast
  0 siblings, 1 reply; 12+ messages in thread
From: Thomas Rast @ 2011-02-09 19:16 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Jonathan Nieder

Jeff King wrote:
> On Wed, Feb 09, 2011 at 03:38:46PM +0100, Thomas Rast wrote:
> 
> > I have been running a nightly next&pu smoke tester each on RHEL5 and
> > OS X.  For quite some time (at least a month), t7006 gets stuck
> > randomly (i.e., not every night).

BTW, I just noticed when re-reading this that it's a bit confusing.
It never hangs on RHEL (i.e. Linux), only on OS X.  (And I only
reported it now because I only just noticed that I had a month's worth
of zombies hanging around. :-)

> >   65211   ??  SN     0:00.03 /bin/sh t7006-pager.sh
> >   65798   ??  SN     0:00.04 /usr/bin/perl /Users/trast/git-smoke/t/test-terminal.perl git --no-pager log
> >   65846   ??  ZN     0:00.00 (git)
> >   65847   ??  ZN     0:00.00 (perl5.10.0)
> 
> Hmm. The zombie git process implies to me that git has exited, but for
> some reason we are still stuck in the copy-to-terminal loop and haven't
> reaped it. But the zombie perl process confuses me.  We fork a second
> time so that one process copies stderr and the other one copies stdout.
> Is the second perl process the stderr copier, and we are still blocking
> on copying stdout for some reason?

How do I find out?

> But then why is the command name
> different? Is /usr/bin/perl a wrapper script on your platform?

No, it's your average OS X binary (i.e., 'file' says it has an image
for every platform etc.).

The command name apparently just changes when it becomes a zombie:
when I run perl -e 'if (fork) {sleep 10;} else {sleep 5;}' they first
share the command line, but as soon as the child's sleep expires its
command line changes to '(perl5.10.0)'.

I'll have to put off the more involved experimentation somewhat :-(

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: t7006 sometimes hangs in cronjobs on OS X
  2011-02-09 19:16   ` Thomas Rast
@ 2011-02-09 20:22     ` Thomas Rast
  2011-02-09 20:50       ` Jeff King
  0 siblings, 1 reply; 12+ messages in thread
From: Thomas Rast @ 2011-02-09 20:22 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Jonathan Nieder

Thomas Rast wrote:
> I'll have to put off the more involved experimentation somewhat :-(

Or not.

Anyway, I managed to load the relevant parts of t7006 in an
interactive shell, and used

  i=0
  while test_terminal git --no-pager log >out.$i 2>&1; do
    i=$(($i+1))
    echo $i
  done

to trigger the race.  With

  test_terminal () {
    sudo "$PERL_PATH" "$TEST_DIRECTORY"/test-terminal.perl "$@"
  }

which is the closest to dtruss I can get without actually using it, I
hit the race very quickly, usually in ~5 runs.  But using either one
of

  test_terminal () {
    sudo dtruss "$PERL_PATH" "$TEST_DIRECTORY"/test-terminal.perl "$@"
  }

  test_terminal () {
    sudo dtruss -f "$PERL_PATH" "$TEST_DIRECTORY"/test-terminal.perl "$@"
  }

slows down things so much that it did not hit the race in useful time
(about 100 runs).  I also tried running 'dtruss -n perl' in another
terminal and hoping it will hit a wall eventually, which does run a
bit faster but also got nowhere in about 350 runs.

Running 'dtruss -n git' instead did hit the issue, and I have the
output saved away, but I'm not sure what I'd be looking for.  After
all, the git process becomes a zombie:

  47606 s006  S+     0:00.13 /usr/bin/perl /Users/trast/git/t/test-terminal.perl git --no-pager log
  47607 s006  Z+     0:00.00 (git)
  47608 s006  Z+     0:00.00 (perl5.10.0)

The output for pid 47607 ends in:

  47607/0x1f7fec5:  write_nocancel(0x1, "Author: A U Thor <author@example.com>\n\0", 0x26)                 = 38 0
  47607/0x1f7fec5:  write_nocancel(0x1, "Date:   Thu Apr 7 15:13:13 2005 -0700\n\0", 0x26)                 = 38 0
  47607/0x1f7fec5:  write_nocancel(0x1, "\n\0", 0x1)               = 1 0
  47607/0x1f7fec5:  write_nocancel(0x1, "    initial\n\0", 0xC)            = 12 0
  47607/0x1f7fec5:  fstat64(0x1, 0x7FFF5FBFE8A0, 0x0)              = 0 0
  47607/0x1f7fec5:  fstat64(0x1, 0x7FFF5FBFEF60, 0x0)              = 0 0
  47607/0x1f7fec5:  close_nocancel(0x1)            = 0 0

Which seems rather unsurprising really :-(

Finally, attaching dtruss after hitting the race gives no output when
I finally kill it.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: t7006 sometimes hangs in cronjobs on OS X
  2011-02-09 20:22     ` Thomas Rast
@ 2011-02-09 20:50       ` Jeff King
  2011-02-12  2:03         ` Jonathan Nieder
  0 siblings, 1 reply; 12+ messages in thread
From: Jeff King @ 2011-02-09 20:50 UTC (permalink / raw)
  To: Thomas Rast; +Cc: git, Jonathan Nieder

On Wed, Feb 09, 2011 at 09:22:19PM +0100, Thomas Rast wrote:

> Thomas Rast wrote:
> > I'll have to put off the more involved experimentation somewhat :-(
> 
> Or not.
> 
> Anyway, I managed to load the relevant parts of t7006 in an
> interactive shell, and used
> 
>   i=0
>   while test_terminal git --no-pager log >out.$i 2>&1; do
>     i=$(($i+1))
>     echo $i
>   done

Great, that is a much smaller bit to work with. I of course can't
replicate it on my Linux boxes, though. :(

> Running 'dtruss -n git' instead did hit the issue, and I have the
> output saved away, but I'm not sure what I'd be looking for.  After
> all, the git process becomes a zombie:
> 
>   47606 s006  S+     0:00.13 /usr/bin/perl /Users/trast/git/t/test-terminal.perl git --no-pager log
>   47607 s006  Z+     0:00.00 (git)
>   47608 s006  Z+     0:00.00 (perl5.10.0)

Yeah, I think the problem is going to be in the perl process, but...

> The output for pid 47607 ends in:
> 
>   47607/0x1f7fec5:  write_nocancel(0x1, "Author: A U Thor <author@example.com>\n\0", 0x26)                 = 38 0
>   47607/0x1f7fec5:  write_nocancel(0x1, "Date:   Thu Apr 7 15:13:13 2005 -0700\n\0", 0x26)                 = 38 0
>   47607/0x1f7fec5:  write_nocancel(0x1, "\n\0", 0x1)               = 1 0
>   47607/0x1f7fec5:  write_nocancel(0x1, "    initial\n\0", 0xC)            = 12 0
>   47607/0x1f7fec5:  fstat64(0x1, 0x7FFF5FBFE8A0, 0x0)              = 0 0
>   47607/0x1f7fec5:  fstat64(0x1, 0x7FFF5FBFEF60, 0x0)              = 0 0
>   47607/0x1f7fec5:  close_nocancel(0x1)            = 0 0
> 
> Which seems rather unsurprising really :-(

It doesn't close(2) also? That could be the problem. I would expect it
also to exit().

> Finally, attaching dtruss after hitting the race gives no output when
> I finally kill it.

Yeah, it is probably stuck in some syscall. You could try instrumenting
your test-terminal.perl like:

diff --git a/t/test-terminal.perl b/t/test-terminal.perl
index ee01eb9..1755017 100755
--- a/t/test-terminal.perl
+++ b/t/test-terminal.perl
@@ -55,11 +55,15 @@ sub copy_stdio {
 	defined $pid or die "fork failed: $!";
 	if (!$pid) {
 		close($out);
+		print STDERR "child: pumping stderr\n";
 		xsendfile(\*STDERR, $err);
+		print STDERR "child: done pumping stderr\n";
 		exit 0;
 	}
 	close($err);
+	print STDERR "parent: pumping stdout\n";
 	xsendfile(\*STDOUT, $out);
+	print STDERR "parent: done pumping stdout\n";
 	finish_child($pid) == 0
 		or exit 1;
 }

and see which of those messages actually gets printed on the stuck case.
My suspicion is that one of the processes is getting stuck in xsendfile.
It may be the fault of File::Copy::copy, which I have seen be flaky
before on non-regular files. We should arguably be using non-buffered
I/O here anyway, so maybe try this:

diff --git a/t/test-terminal.perl b/t/test-terminal.perl
index ee01eb9..0d33da1 100755
--- a/t/test-terminal.perl
+++ b/t/test-terminal.perl
@@ -3,7 +3,6 @@ use 5.008;
 use strict;
 use warnings;
 use IO::Pty;
-use File::Copy;
 
 # Run @$argv in the background with stdio redirected to $out and $err.
 sub start_child {
@@ -46,7 +45,13 @@ sub xsendfile {
 	# from a disconnected terminal will return
 	# EIO (as in AIX 4.x, IRIX, and Linux) or
 	# end-of-file.  Either is fine.
-	copy($in, $out, 4096) or $!{EIO} or die "cannot copy from child: $!";
+	while (1) {
+		my $r = sysread($in, my $buf, 4096);
+		last if $r == 0;
+		last if !defined $r and $!{EIO};
+		die "cannot copy from child: $!" unless defined $r;
+		syswrite($out, $buf);
+	}
 }
 
 sub copy_stdio {

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: t7006 sometimes hangs in cronjobs on OS X
  2011-02-09 20:50       ` Jeff King
@ 2011-02-12  2:03         ` Jonathan Nieder
  2011-02-12  2:05           ` Jonathan Nieder
  2011-02-12  5:12           ` Jeff King
  0 siblings, 2 replies; 12+ messages in thread
From: Jonathan Nieder @ 2011-02-12  2:03 UTC (permalink / raw)
  To: Jeff King; +Cc: Thomas Rast, git

Jeff King wrote:

> Yeah, it is probably stuck in some syscall. You could try instrumenting
> your test-terminal.perl like:
>
> diff --git a/t/test-terminal.perl b/t/test-terminal.perl
> index ee01eb9..1755017 100755
> --- a/t/test-terminal.perl
> +++ b/t/test-terminal.perl
> @@ -55,11 +55,15 @@ sub copy_stdio {
>  	defined $pid or die "fork failed: $!";
>  	if (!$pid) {
>  		close($out);
> +		print STDERR "child: pumping stderr\n";
[...]

Nice, I can reproduce this (kernel = Darwin 10.5.0, using Apple's perl
5.10.0).  With

	while ./test-terminal.perl echo hi >out.$i 2>&1
	do
		i=$(($i + 1))
		echo $i
	done

I get:

	$ cat out.28 
	child: pumping stdout
	child: pumping stderr
	hi
	child: done pumping stdout

Instrumenting Copy.pm, I can see that we are stuck in sysread.

	31984: child: pumping stdout
	31984: enter copy loop.
	31984: time to read 4096...
	31986: child: pumping stderr
	31984: read returns with 4
	31984: about to write "hi
	"
	31984: pump 4 - 0, result = 4
	31984: time to read 4096...
	31986: enter copy loop.
	31986: time to read 4096...
	31986: read returns with 0
	31986: about to write ""
	31986: done!
	31986: successful return.
	31986: copy returns with 1 errno=
	31986: child: done pumping stderr

Redirecting stderr by using 'xsendfile("elsewhere", $err);' avoids
trouble.  Asking stderr and stdout to go to different places, like so:

	./test-terminal.perl echo hi >out.$i 2>err.$i

does not avoid trouble.  Sometimes output includes some streams of
null bytes, which makes me suspect something awry in the kernel.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: t7006 sometimes hangs in cronjobs on OS X
  2011-02-12  2:03         ` Jonathan Nieder
@ 2011-02-12  2:05           ` Jonathan Nieder
  2011-02-12  5:12           ` Jeff King
  1 sibling, 0 replies; 12+ messages in thread
From: Jonathan Nieder @ 2011-02-12  2:05 UTC (permalink / raw)
  To: Jeff King; +Cc: Thomas Rast, git

Jonathan Nieder wrote:

> Sometimes output includes some streams of
> null bytes, which makes me suspect something awry in the kernel.

(or maybe that last part is a consequence of using stderr both for
standard, unbuffered output and our debugging output) :)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: t7006 sometimes hangs in cronjobs on OS X
  2011-02-12  2:03         ` Jonathan Nieder
  2011-02-12  2:05           ` Jonathan Nieder
@ 2011-02-12  5:12           ` Jeff King
  2011-02-12  8:37             ` Jonathan Nieder
  1 sibling, 1 reply; 12+ messages in thread
From: Jeff King @ 2011-02-12  5:12 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Thomas Rast, git

On Fri, Feb 11, 2011 at 08:03:21PM -0600, Jonathan Nieder wrote:

> Nice, I can reproduce this (kernel = Darwin 10.5.0, using Apple's perl
> 5.10.0).  With
> 
> 	while ./test-terminal.perl echo hi >out.$i 2>&1
> 	do
> 		i=$(($i + 1))
> 		echo $i
> 	done

Cool, more narrowing it down is nice. I'm not surprised it's a
test-terminal problem, not a git problem, but it's nice to see
confirmation.

> I get:
> 
> 	$ cat out.28 
> 	child: pumping stdout
> 	child: pumping stderr
> 	hi
> 	child: done pumping stdout

Hmm. So here, the stderr process hangs...

> Instrumenting Copy.pm, I can see that we are stuck in sysread.
> 
> 	31984: child: pumping stdout
> 	31984: enter copy loop.
> 	31984: time to read 4096...
> 	31986: child: pumping stderr
> 	31984: read returns with 4
> 	31984: about to write "hi
> 	"
> 	31984: pump 4 - 0, result = 4
> 	31984: time to read 4096...
> 	31986: enter copy loop.
> 	31986: time to read 4096...
> 	31986: read returns with 0
> 	31986: about to write ""
> 	31986: done!
> 	31986: successful return.
> 	31986: copy returns with 1 errno=
> 	31986: child: done pumping stderr

...but here, the stdout process hangs. Weird.

So I'm not even sure our perl is bad. It's just that for some reason the
pty allocated by IO::Pty doesn't give us EOF sometimes. Hmm. Two curious
things I notice:

  1. In your Copy.pm log above, it says read gives it 4 characters. But
     "hi\n" has only 3. CRLF or some other hidden character?

  2. In test-terminal.perl, start_child clones the $out and $err handles
     (which are the IO::Pty slaves). Then it closes $out, but not $err.
     Why not? It shouldn't matter, since the child process ends up
     dying, which should close all descriptors. And if that were the
     problem, I would expect it to always hang on stderr, but your
     examples show the hang on both descriptors.

     Also, should the child be closing the pty masters? Again, it
     shouldn't make a difference, both because the masters should be
     waiting for the slaves to close to get EOF, and because the child
     process dying should close all descriptors.

     But the first could be explained by some differing bsd pty
     semantics that Linux doesn't have. But for the second, I dunno.

I would first try this patch:

diff --git a/t/test-terminal.perl b/t/test-terminal.perl
index ee01eb9..bfec680 100755
--- a/t/test-terminal.perl
+++ b/t/test-terminal.perl
@@ -15,6 +15,7 @@ sub start_child {
 		open STDOUT, ">&", $out;
 		open STDERR, ">&", $err;
 		close $out;
+		close $err;
 		exec(@$argv) or die "cannot exec '$argv->[0]': $!"
 	}
 	return $pid;

and then try this more drastic one:

diff --git a/t/test-terminal.perl b/t/test-terminal.perl
index ee01eb9..cce7de0 100755
--- a/t/test-terminal.perl
+++ b/t/test-terminal.perl
@@ -12,9 +12,12 @@ sub start_child {
 	if (not defined $pid) {
 		die "fork failed: $!"
 	} elsif ($pid == 0) {
-		open STDOUT, ">&", $out;
-		open STDERR, ">&", $err;
+		open STDOUT, ">&", $out->slave;
+		open STDERR, ">&", $err->slave;
+		close $out->slave;
 		close $out;
+		close $err->slave;
+		close $err;
 		exec(@$argv) or die "cannot exec '$argv->[0]': $!"
 	}
 	return $pid;
@@ -69,7 +72,7 @@ if ($#ARGV < 1) {
 }
 my $master_out = new IO::Pty;
 my $master_err = new IO::Pty;
-my $pid = start_child(\@ARGV, $master_out->slave, $master_err->slave);
+my $pid = start_child(\@ARGV, $master_out, $master_err);
 close $master_out->slave;
 close $master_err->slave;
 copy_stdio($master_out, $master_err);

Also, I don't know what kind of support you have for stuff like lsof,
but in theory we should be able to get a hung process, find the open
descriptor for the pty using lsof, match that descriptor with the other
end of the pty, and then see which processes have that pty still open.

> Redirecting stderr by using 'xsendfile("elsewhere", $err);' avoids
> trouble.

That seems doubly weird, since you are changing the _output_, not the
input. But the input is what is causing the hang.

> Sometimes output includes some streams of null bytes, which makes me
> suspect something awry in the kernel.

Yuck.

-Peff

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: t7006 sometimes hangs in cronjobs on OS X
  2011-02-12  5:12           ` Jeff King
@ 2011-02-12  8:37             ` Jonathan Nieder
  0 siblings, 0 replies; 12+ messages in thread
From: Jonathan Nieder @ 2011-02-12  8:37 UTC (permalink / raw)
  To: Jeff King; +Cc: Thomas Rast, git

Jeff King wrote:

>   1. In your Copy.pm log above, it says read gives it 4 characters. But
>      "hi\n" has only 3.

Yes, it's "hi\r\n".

> I would first try this patch:
[...]
> +++ b/t/test-terminal.perl
> @@ -15,6 +15,7 @@ sub start_child {
>  		open STDOUT, ">&", $out;
>  		open STDERR, ">&", $err;
>  		close $out;
> +		close $err;
>  		exec(@$argv) or die "cannot exec '$argv->[0]': $!"

Good idea.  No change, alas (and likewise with the change to close
the pty master in the child).  It seems I have some reading to do.

Jonathan

> and then try this more drastic one:
[...]
> --- a/t/test-terminal.perl
> +++ b/t/test-terminal.perl
> @@ -12,9 +12,12 @@ sub start_child {
[...]
> @@ -69,7 +72,7 @@ if ($#ARGV < 1) {
>  }
>  my $master_out = new IO::Pty;
>  my $master_err = new IO::Pty;
> -my $pid = start_child(\@ARGV, $master_out->slave, $master_err->slave);
> +my $pid = start_child(\@ARGV, $master_out, $master_err);

Runs through ~1000 iterations instead of 100 before hanging.

> Also, I don't know what kind of support you have for stuff like lsof,
> but in theory we should be able to get a hung process, find the open
> descriptor for the pty using lsof, match that descriptor with the other
> end of the pty, and then see which processes have that pty still open.

Trial 1
~~~~~~~
 PID 49145 (which has successfully pumped stdout):

  0	/dev/ttys001
  1 write-only	out.1707
  2 write-only	out.1707
  3	/dev/ptmx	@ offset 4
  5 write-only	debug.log

 PID 49147 (which is stuck in sysread trying to read stderr):

  0	/dev/ttys001
  1 write-only	out.1707
  2 write-only	out.1707
  5 write-only	debug.log
  6	/dev/ptmx	@ offset 0

Trial 2
~~~~~~~
 PID 51091 (which is stuck in sysread trying to read stdout):

  0	/dev/ttys001
  1 write-only	out.2017
  2 write-only	out.2017
  3	/dev/ptmx	@ offset 4
  5 write-only	debug.log

 PID 591093 (which successfully pumped stderr) is a zombie

(echo was a zombie in both cases.)

>> Redirecting stderr by using 'xsendfile("elsewhere", $err);' avoids
>> trouble.
>
> That seems doubly weird, since you are changing the _output_, not the
> input. But the input is what is causing the hang.

False alarm --- after about 2500 iterations it hangs.  Probably just
changed the timing.

>> Sometimes output includes some streams of null bytes, which makes me
>> suspect something awry in the kernel.
>
> Yuck.

Was my mistake --- apparently I was writing files with holes.  Now I
send debug output to a separate file with O_APPEND and it hasn't
happened again.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: t7006 sometimes hangs in cronjobs on OS X
  2011-02-09 14:38 t7006 sometimes hangs in cronjobs on OS X Thomas Rast
  2011-02-09 18:30 ` Ben Walton
  2011-02-09 19:09 ` Jeff King
@ 2011-02-13  5:31 ` Jonathan Nieder
  2011-02-17 12:38 ` [PATCH] tests: skip terminal output tests " Jonathan Nieder
  3 siblings, 0 replies; 12+ messages in thread
From: Jonathan Nieder @ 2011-02-13  5:31 UTC (permalink / raw)
  To: Thomas Rast; +Cc: git, Jeff King

Thomas Rast wrote:

> So if you have any idea what the problem could be or what I should be
> looking for, help would be much appreciated...

Still no idea, but I've passed the report on.

https://rt.cpan.org/Ticket/Display.html?id=65692

Thanks, both.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH] tests: skip terminal output tests on OS X
  2011-02-09 14:38 t7006 sometimes hangs in cronjobs on OS X Thomas Rast
                   ` (2 preceding siblings ...)
  2011-02-13  5:31 ` Jonathan Nieder
@ 2011-02-17 12:38 ` Jonathan Nieder
  3 siblings, 0 replies; 12+ messages in thread
From: Jonathan Nieder @ 2011-02-17 12:38 UTC (permalink / raw)
  To: Thomas Rast; +Cc: git, Jeff King

On Mac OS X 10.5.0, test_terminal gets stuck reading from the pty
master every once in a while.  To reproduce the problem:

 perl -MIO::Pty -MFile::Copy -e '
	for (my $i = 0;; $i++) {
		my $master = new IO::Pty;
		my $slave = $master->slave;
		if (fork == 0) {
			close $master or die "close: $!";
			open STDOUT, ">&", $slave or die "dup2: $!";
			close $slave or die "close: $!";
			exec("echo", "hi", $i) or die "exec: $!";
		}
		close $slave or die "close: $!";
		copy($master, \*STDOUT) or die "copy: $!";
		close $master or die "close: $!";
		wait;
	}
 '

It blocks after 7000 iterations or so in sysread().  The relevant
sysread() call is the second call by the parent, which presumably
executes before the child dies but after the parent has read all
output from there.

Since this is an intermitent problem, the quick check of terminal
support in lib-terminal doesn't catch it.  Skip these tests on the Mac
for now.

Noticed-by: Thomas Rast <trast@student.ethz.ch>
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
---
Thanks for a reminder.

I'd be interested to hear whether this happens on other BSDs, too.
It's possible that test_terminal should be using a select() loop
to wait for input to be ready or the child to die instead of
waiting on one while the other happens.

 t/lib-terminal.sh |   22 +++++++++++++++++++---
 1 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/t/lib-terminal.sh b/t/lib-terminal.sh
index c383b57..58d911d 100644
--- a/t/lib-terminal.sh
+++ b/t/lib-terminal.sh
@@ -1,8 +1,24 @@
 #!/bin/sh
 
-test_expect_success 'set up terminal for tests' '
-	if
-		test_have_prereq PERL &&
+test_expect_success PERL 'set up terminal for tests' '
+	# Reading from the pty master seems to get stuck _sometimes_
+	# on Mac OS X 10.5.0, using Perl 5.10.0 or 5.8.9.
+	#
+	# Reproduction recipe: run
+	#
+	#	i=0
+	#	while ./test-terminal.perl echo hi $i
+	#	do
+	#		: $((i = $i + 1))
+	#	done
+	#
+	# After 2000 iterations or so it hangs.
+	# https://rt.cpan.org/Ticket/Display.html?id=65692
+	#
+	if test "$(uname -s)" = Darwin
+	then
+		:
+	elif
 		"$PERL_PATH" "$TEST_DIRECTORY"/test-terminal.perl \
 			sh -c "test -t 1 && test -t 2"
 	then
-- 
1.7.4.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2011-02-17 12:39 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-09 14:38 t7006 sometimes hangs in cronjobs on OS X Thomas Rast
2011-02-09 18:30 ` Ben Walton
2011-02-09 19:09 ` Jeff King
2011-02-09 19:16   ` Thomas Rast
2011-02-09 20:22     ` Thomas Rast
2011-02-09 20:50       ` Jeff King
2011-02-12  2:03         ` Jonathan Nieder
2011-02-12  2:05           ` Jonathan Nieder
2011-02-12  5:12           ` Jeff King
2011-02-12  8:37             ` Jonathan Nieder
2011-02-13  5:31 ` Jonathan Nieder
2011-02-17 12:38 ` [PATCH] tests: skip terminal output tests " Jonathan Nieder

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.