dash.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Bizarre interaction bug involving bash w/ lastpipe + Almquist 'wait'
@ 2020-02-06 16:12 Martijn Dekker
  2020-02-06 19:29 ` Harald van Dijk
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Martijn Dekker @ 2020-02-06 16:12 UTC (permalink / raw)
  To: DASH shell mailing list, busybox,
	Bug reports for the GNU Bourne Again SHell, Robert Elz,
	Jilles Tjoelker, Harald van Dijk

This is probably the strangest bug (or maybe pair of bugs) I've run into 
in nearly five years of breaking shells by developing modernish.

I've traced it to an interaction between bash >= 4.2 (i.e.: bash with 
shopt -s lastpipe) and variants of the Almquist shell, at least: dash, 
gwsh, Busybox ash, FreeBSD sh, and NetBSD 9.0rc2 sh.

Symptom: if 'return' is invoked on bash in the last element of a pipe 
executed in the main shell environment, then if you subsequently 'exec' 
an Almquist shell variant so that it has the same PID, its 'wait' 
builtin breaks.

I can consistently reproduce this on Linux, macOS, FreeBSD, NetBSD 
9.0rc2, OpenBSD, and Solaris.

To reproduce this, you need bash >= 4.2, some Almquist shell variant, 
and these two test scripts:

---begin test.bash---
fn() {
	: | return
}
shopt -s lastpipe || exit
fn
exec "${1:-dash}" test.ash
---end test.bash---

---begin test.ash---
echo '*ash-begin'
: &
echo '*ash-middle'
wait "$!"
echo '*ash-end'
---end test.ash---

When executing test.bash with dash, gwsh, Busybox ash, or FreeBSD sh, 
then test.ash simply waits forever on executing 'wait "$!"'.

$ bash test.bash <some-almquist-shell>
*ash-begin
*ash-middle
(nothing until ^C)

NetBSD sh behaves differently. NetBSD 8.1 sh (as installed on sdf.org 
and sdf-eu.org) seem to act completely normally, but NetBSD 9.0rc2 sh 
(on my VirtualBox test VM) segfaults. Output on NetBSD 9.0rc2:

$ bash test.bash /bin/sh
*ash-begin
*ash-middle
[1]   Segmentation fault       bash test.bash sh

I don't know if the different NetBSD sh behaviour is because the older 
NetBSD sh doesn't have the bug, or because some factor on the sdf*.org 
systems causes it to not be triggered.

To me, this smells like the use of some uninitialised value on various 
Almquist shells. Tracing that is beyond my expertise though.

Whether this also represents a bug in bash or not, I can't say. But no 
other shells trigger this that I've found, not even ksh93 and zsh which 
also execute the last element of a pipe in the main shell environment.

- Martijn

-- 
modernish -- harness the shell
https://github.com/modernish/modernish

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Bizarre interaction bug involving bash w/ lastpipe + Almquist 'wait'
  2020-02-06 16:12 Bizarre interaction bug involving bash w/ lastpipe + Almquist 'wait' Martijn Dekker
@ 2020-02-06 19:29 ` Harald van Dijk
  2020-02-07 11:19   ` AW: " Walter Harms
  2020-02-07 16:16   ` Chet Ramey
  2020-02-06 20:43 ` Robert Elz
  2020-02-07  2:41 ` Robert Elz
  2 siblings, 2 replies; 12+ messages in thread
From: Harald van Dijk @ 2020-02-06 19:29 UTC (permalink / raw)
  To: Martijn Dekker, DASH shell mailing list, busybox,
	Bug reports for the GNU Bourne Again SHell, Robert Elz,
	Jilles Tjoelker

On 06/02/2020 16:12, Martijn Dekker wrote:
> This is probably the strangest bug (or maybe pair of bugs) I've run into 
> in nearly five years of breaking shells by developing modernish.
> 
> I've traced it to an interaction between bash >= 4.2 (i.e.: bash with 
> shopt -s lastpipe) and variants of the Almquist shell, at least: dash, 
> gwsh, Busybox ash, FreeBSD sh, and NetBSD 9.0rc2 sh.
> 
> Symptom: if 'return' is invoked on bash in the last element of a pipe 
> executed in the main shell environment, then if you subsequently 'exec' 
> an Almquist shell variant so that it has the same PID, its 'wait' 
> builtin breaks.
> 
> I can consistently reproduce this on Linux, macOS, FreeBSD, NetBSD 
> 9.0rc2, OpenBSD, and Solaris.
> 
> To reproduce this, you need bash >= 4.2, some Almquist shell variant, 
> and these two test scripts:
> 
> ---begin test.bash---
> fn() {
>      : | return
> }
> shopt -s lastpipe || exit
> fn
> exec "${1:-dash}" test.ash
> ---end test.bash---
> 
> ---begin test.ash---
> echo '*ash-begin'
> : &
> echo '*ash-middle'
> wait "$!"
> echo '*ash-end'
> ---end test.ash---
> 
> When executing test.bash with dash, gwsh, Busybox ash, or FreeBSD sh, 
> then test.ash simply waits forever on executing 'wait "$!"'.

Nice test. bash leaves the process in a state where SIGCHLD is blocked, 
and the various ash-based shells do not unblock it. Because of that, 
they do not pick up on the fact that the child process has terminated. I 
would consider this a bug both in bash and in the ash-based shells.

Cheers,
Harald van Dijk

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Bizarre interaction bug involving bash w/ lastpipe + Almquist 'wait'
  2020-02-06 16:12 Bizarre interaction bug involving bash w/ lastpipe + Almquist 'wait' Martijn Dekker
  2020-02-06 19:29 ` Harald van Dijk
@ 2020-02-06 20:43 ` Robert Elz
  2020-02-07  2:41 ` Robert Elz
  2 siblings, 0 replies; 12+ messages in thread
From: Robert Elz @ 2020-02-06 20:43 UTC (permalink / raw)
  To: Harald van Dijk
  Cc: Martijn Dekker, DASH shell mailing list, busybox,
	Bug reports for the GNU Bourne Again SHell, Jilles Tjoelker

    Date:        Thu, 6 Feb 2020 19:29:41 +0000
    From:        Harald van Dijk <harald@gigawatt.nl>
    Message-ID:  <f8b210f5-dd59-2c7f-05d4-be0a89316d3d@gigawatt.nl>

  | Nice test.

Yes!

  | and the various ash-based shells do not unblock it.

We do now, the fix for that will be in 9.0 when it is released.
("now" as in as of the past half hour...)

  | Because of that, 
  | they do not pick up on the fact that the child process has terminated.

It was actually a race condition, for me it 'worked' about half the time
(seems to depend whether the wait happens in the parent before or after
the sub-process exits).

kre

ps: that core dump was an "impossible to happen" condition that this
actually made happen, that will be fixed as well, both by actually now
making it impossible like it was supposed to be (by not blocking or
ignoring SIGCHLD, ever) and by testing for it happening anyway...

The secondary fix for that one is still to be committed after I investigate
some more - I know what happened, just need to make sure what will happen
now if this situation which should never occur ever does happen again.

That the 8.1 NetBSD sh seems to work is more just an artifact of how
it runs the race I believe (or guess) - the wait & process invocation code
has changed a lot in 9 (well, 9.0RC2 for now) which seems to have made the
race a close call, instead of one sided.   But that was not an artifact of
the environment for the test, it happens for me on a real -8(ish) type
system as well.

kre

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Bizarre interaction bug involving bash w/ lastpipe + Almquist 'wait'
  2020-02-06 16:12 Bizarre interaction bug involving bash w/ lastpipe + Almquist 'wait' Martijn Dekker
  2020-02-06 19:29 ` Harald van Dijk
  2020-02-06 20:43 ` Robert Elz
@ 2020-02-07  2:41 ` Robert Elz
  2020-02-08 18:39   ` Harald van Dijk
  2 siblings, 1 reply; 12+ messages in thread
From: Robert Elz @ 2020-02-07  2:41 UTC (permalink / raw)
  To: Martijn Dekker
  Cc: busybox, Harald van Dijk, DASH shell mailing list,
	Jilles Tjoelker, Bug reports for the GNU Bourne Again SHell

    Date:        Thu, 6 Feb 2020 16:12:06 +0000
    From:        Martijn Dekker <martijn@inlv.org>
    Message-ID:  <10e3756b-5e8f-ba00-df0d-b36c93fa2281@inlv.org>

  | NetBSD sh behaves differently. NetBSD 8.1 sh (as installed on sdf.org 
  | and sdf-eu.org) seem to act completely normally, but NetBSD 9.0rc2 sh 
  | (on my VirtualBox test VM) segfaults. Output on NetBSD 9.0rc2:

I have updated my opinion on that, I think it is "don't have the bug",
though it is possible a blocked SIGCHLD acts differently on NetBSD than
on other systems.   On NetBSD it seems to affect nothing (the shell does
not rely upon receiving SIGCHLD so not getting it is irrelevant) and
the wait code when given an arg (as your script did) would always wait
until that process exited, and return as soon as it did.

None of that is changed in -9 ... but the wait command now has -n, which
also works with a list of pids, and while waiting for any process in its
list to exit, gets told each time a process is reaped (from lower level
code) which job that process was from (new code of mine) so it can see if
the process that completed finished one of the jobs for which it is waiting.\
I wasn't expecting to see exiting children that are not the shell's children,
which is what happens here - the
	: | return
creates a child (of bash) to run the ':' command, then the function
returns without waiting for that one.  You then exec the NetBSD shell,
which inherits that child (a child of the same process) but is unaware of
it.   If that one happens to exit while the ash script running on the
NetBSD sh is doing the wait command, core would dump.   (Fix for that is
now in the tree).   If the bash invoked ':' command exited some other time
and was noticed (eg: between commands) as having finished, it would simply
have been ignored.   I saw both happen.

kre

^ permalink raw reply	[flat|nested] 12+ messages in thread

* AW: Bizarre interaction bug involving bash w/ lastpipe + Almquist 'wait'
  2020-02-06 19:29 ` Harald van Dijk
@ 2020-02-07 11:19   ` Walter Harms
  2020-02-07 14:33     ` Martijn Dekker
  2020-02-07 16:16   ` Chet Ramey
  1 sibling, 1 reply; 12+ messages in thread
From: Walter Harms @ 2020-02-07 11:19 UTC (permalink / raw)
  To: Harald van Dijk, Martijn Dekker, DASH shell mailing list,
	busybox, Bug reports for the GNU Bourne Again SHell, Robert Elz,
	Jilles Tjoelker

IMHO is the bug on bash side. ash can assume to get an "healthy" environment from the caller. You can simply not fix everything that can possible go wrong.

Obviously it should not segfault but so far i understand it is bsd as that does, not busybox ash.

re,
 wh
________________________________________
Von: busybox <busybox-bounces@busybox.net> im Auftrag von Harald van Dijk <harald@gigawatt.nl>
Gesendet: Donnerstag, 6. Februar 2020 20:29
An: Martijn Dekker; DASH shell mailing list; busybox; Bug reports for the GNU Bourne Again SHell; Robert Elz; Jilles Tjoelker
Betreff: Re: Bizarre interaction bug involving bash w/ lastpipe + Almquist 'wait'

On 06/02/2020 16:12, Martijn Dekker wrote:
> This is probably the strangest bug (or maybe pair of bugs) I've run into
> in nearly five years of breaking shells by developing modernish.
>
> I've traced it to an interaction between bash >= 4.2 (i.e.: bash with
> shopt -s lastpipe) and variants of the Almquist shell, at least: dash,
> gwsh, Busybox ash, FreeBSD sh, and NetBSD 9.0rc2 sh.
>
> Symptom: if 'return' is invoked on bash in the last element of a pipe
> executed in the main shell environment, then if you subsequently 'exec'
> an Almquist shell variant so that it has the same PID, its 'wait'
> builtin breaks.
>
> I can consistently reproduce this on Linux, macOS, FreeBSD, NetBSD
> 9.0rc2, OpenBSD, and Solaris.
>
> To reproduce this, you need bash >= 4.2, some Almquist shell variant,
> and these two test scripts:
>
> ---begin test.bash---
> fn() {
>      : | return
> }
> shopt -s lastpipe || exit
> fn
> exec "${1:-dash}" test.ash
> ---end test.bash---
>
> ---begin test.ash---
> echo '*ash-begin'
> : &
> echo '*ash-middle'
> wait "$!"
> echo '*ash-end'
> ---end test.ash---
>
> When executing test.bash with dash, gwsh, Busybox ash, or FreeBSD sh,
> then test.ash simply waits forever on executing 'wait "$!"'.

Nice test. bash leaves the process in a state where SIGCHLD is blocked,
and the various ash-based shells do not unblock it. Because of that,
they do not pick up on the fact that the child process has terminated. I
would consider this a bug both in bash and in the ash-based shells.

Cheers,
Harald van Dijk

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: AW: Bizarre interaction bug involving bash w/ lastpipe + Almquist 'wait'
  2020-02-07 11:19   ` AW: " Walter Harms
@ 2020-02-07 14:33     ` Martijn Dekker
  0 siblings, 0 replies; 12+ messages in thread
From: Martijn Dekker @ 2020-02-07 14:33 UTC (permalink / raw)
  To: Walter Harms, Harald van Dijk, DASH shell mailing list, busybox,
	Bug reports for the GNU Bourne Again SHell, Robert Elz,
	Jilles Tjoelker

Op 07-02-20 om 12:19 schreef Walter Harms:
> IMHO is the bug on bash side. ash can assume to get an "healthy"
> environment from the caller. You can simply not fix everything that
> can possible go wrong.

That is a rather fallacious argument. Of course you cannot fix 
*everything* that could possibly go wrong. You can certainly fix *this* 
thing, though. I know, because every non-Almquist shell does it.

These days, no program can realistically assume a "healthy" environment. 
Computers have become unimaginably complex machines, built on thousands 
of interdependent abstraction layers, each as fallible as the humans 
that designed and implemented them. So "unhealthy" environments happen 
all the time, due to all sorts of unforeseen causes.

It's well past time to accept that the 1980s are behind us. In 2020, 
systems have to be programmed robustly and defensively.

> Obviously it should not segfault but so far i understand it is bsd as
> that does, not busybox ash.

True. But instead, it simply gets stuck forever, with no message or 
other indicator of what went wrong. How is that better?

(Going slightly off-topic below...)

Segfaulting is actually a good thing: it's one form of failing reliably. 
And failing reliably is vastly better than what often happens instead, 
especially in shell scripts: subtle breakage, which can take a lot of 
detective work to trace, and in some cases can cause serious damage due 
to the program functioning inconsistently and incorrectly (instead of 
not at all).

Failing reliably is something the shell is ATROCIOUSLY bad at, and it's 
one of the first things modernish aims to fix.

- Martijn

-- 
modernish -- harness the shell
https://github.com/modernish/modernish

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Bizarre interaction bug involving bash w/ lastpipe + Almquist 'wait'
  2020-02-06 19:29 ` Harald van Dijk
  2020-02-07 11:19   ` AW: " Walter Harms
@ 2020-02-07 16:16   ` Chet Ramey
  1 sibling, 0 replies; 12+ messages in thread
From: Chet Ramey @ 2020-02-07 16:16 UTC (permalink / raw)
  To: Harald van Dijk, Martijn Dekker, DASH shell mailing list,
	busybox, Bug reports for the GNU Bourne Again SHell, Robert Elz,
	Jilles Tjoelker
  Cc: chet.ramey

On 2/6/20 2:29 PM, Harald van Dijk wrote:
> On 06/02/2020 16:12, Martijn Dekker wrote:

>> When executing test.bash with dash, gwsh, Busybox ash, or FreeBSD sh,
>> then test.ash simply waits forever on executing 'wait "$!"'.
> 
> Nice test. bash leaves the process in a state where SIGCHLD is blocked, and
> the various ash-based shells do not unblock it. 

Thanks for the investigation. Bash does leave SIGCHLD blocked in this exact
set of circumstances (lastpipe+function+return at end of pipeline+exec).

-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
		 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    chet@case.edu    http://tiswww.cwru.edu/~chet/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Bizarre interaction bug involving bash w/ lastpipe + Almquist 'wait'
  2020-02-07  2:41 ` Robert Elz
@ 2020-02-08 18:39   ` Harald van Dijk
  2020-02-09 19:03     ` Jilles Tjoelker
                       ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Harald van Dijk @ 2020-02-08 18:39 UTC (permalink / raw)
  To: Robert Elz, Martijn Dekker
  Cc: DASH shell mailing list, busybox,
	Bug reports for the GNU Bourne Again SHell, Jilles Tjoelker

On 07/02/2020 02:41, Robert Elz wrote:
>      Date:        Thu, 6 Feb 2020 16:12:06 +0000
>      From:        Martijn Dekker <martijn@inlv.org>
>      Message-ID:  <10e3756b-5e8f-ba00-df0d-b36c93fa2281@inlv.org>
> 
>    | NetBSD sh behaves differently. NetBSD 8.1 sh (as installed on sdf.org
>    | and sdf-eu.org) seem to act completely normally, but NetBSD 9.0rc2 sh
>    | (on my VirtualBox test VM) segfaults. Output on NetBSD 9.0rc2:
> 
> I have updated my opinion on that, I think it is "don't have the bug",
> though it is possible a blocked SIGCHLD acts differently on NetBSD than
> on other systems.   On NetBSD it seems to affect nothing (the shell does
> not rely upon receiving SIGCHLD so not getting it is irrelevant) and
> the wait code when given an arg (as your script did) would always wait
> until that process exited, and return as soon as it did.

I think you're right that this isn't SIGCHLD behaving differently on 
NetBSD, it's that NetBSD sh does not have the same problem the other 
ash-based shells do. The problem is with sigsuspend, which in dash looks 
like:

> 		sigblockall(&oldmask);
> 
> 		while (!gotsigchld && !pending_sig)
> 			sigsuspend(&oldmask);
> 
> 		sigclearmask();

<https://git.kernel.org/pub/scm/utils/dash/dash.git/tree/src/jobs.c?id=f30bd155ccbc3f084bbf03d56f9cc43f4b02af2a#n1170>

This clearly cannot work when oldmask blocks SIGCHLD.

NetBSD sh does not use sigsuspend here, so avoids that problem.

I changed gwsh to call sigclearmask() on shell startup, but plan to 
check whether this loop is really necessary at some later time. It was 
added to dash to fix a race condition, where that race condition was 
apparently introduced by a fix for another race condition. If NetBSD sh 
manages to avoid this pattern, and assuming NetBSD sh is not still 
susceptible to one of those race conditions, the fix for it in the other 
shells would seem to be more complicated than necessary, and simplifying 
things would be good.

Cheers,
Harald van Dijk

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Bizarre interaction bug involving bash w/ lastpipe + Almquist 'wait'
  2020-02-08 18:39   ` Harald van Dijk
@ 2020-02-09 19:03     ` Jilles Tjoelker
  2020-02-18 16:46     ` Denys Vlasenko
  2020-02-18 18:17     ` Robert Elz
  2 siblings, 0 replies; 12+ messages in thread
From: Jilles Tjoelker @ 2020-02-09 19:03 UTC (permalink / raw)
  To: Harald van Dijk
  Cc: busybox, Martijn Dekker, DASH shell mailing list, Robert Elz,
	Bug reports for the GNU Bourne Again SHell

On Sat, Feb 08, 2020 at 06:39:38PM +0000, Harald van Dijk wrote:
> On 07/02/2020 02:41, Robert Elz wrote:
> >      Date:        Thu, 6 Feb 2020 16:12:06 +0000
> >      From:        Martijn Dekker <martijn@inlv.org>
> >      Message-ID:  <10e3756b-5e8f-ba00-df0d-b36c93fa2281@inlv.org>

> >    | NetBSD sh behaves differently. NetBSD 8.1 sh (as installed on sdf.org
> >    | and sdf-eu.org) seem to act completely normally, but NetBSD 9.0rc2 sh
> >    | (on my VirtualBox test VM) segfaults. Output on NetBSD 9.0rc2:

> > I have updated my opinion on that, I think it is "don't have the bug",
> > though it is possible a blocked SIGCHLD acts differently on NetBSD than
> > on other systems.   On NetBSD it seems to affect nothing (the shell does
> > not rely upon receiving SIGCHLD so not getting it is irrelevant) and
> > the wait code when given an arg (as your script did) would always wait
> > until that process exited, and return as soon as it did.

> I think you're right that this isn't SIGCHLD behaving differently on NetBSD,
> it's that NetBSD sh does not have the same problem the other ash-based
> shells do. The problem is with sigsuspend, which in dash looks like:

> > 		sigblockall(&oldmask);
> > 
> > 		while (!gotsigchld && !pending_sig)
> > 			sigsuspend(&oldmask);
> > 
> > 		sigclearmask();

> <https://git.kernel.org/pub/scm/utils/dash/dash.git/tree/src/jobs.c?id=f30bd155ccbc3f084bbf03d56f9cc43f4b02af2a#n1170>

> This clearly cannot work when oldmask blocks SIGCHLD.

> NetBSD sh does not use sigsuspend here, so avoids that problem.

> I changed gwsh to call sigclearmask() on shell startup, but plan to check
> whether this loop is really necessary at some later time. It was added to
> dash to fix a race condition, where that race condition was apparently
> introduced by a fix for another race condition. If NetBSD sh manages to
> avoid this pattern, and assuming NetBSD sh is not still susceptible to one
> of those race conditions, the fix for it in the other shells would seem to
> be more complicated than necessary, and simplifying things would be good.

I have not tested whether the bug actually happens in NetBSD sh but I
think the complexity is necessary. The problem is that the wait builtin
must wait for either process termination or a signal, and relying on an
[EINTR] error return to abort a blocking waitpid() or similar leaves a
window where a signal could come in after which the program goes asleep.

In a script this could look like

trap 'echo cleaning up; exit' TERM
slow_process_1 &
slow_process_2 &
wait

and if a TERM signal comes in just before the wait system call is
invoked, the signal handler sets a flag but the trap is not taken until
a process terminates or another signal comes in.

FreeBSD sh also has a -T flag that causes traps to be taken immediately
while waiting for a process to terminate. This has the same issue with
waiting for process termination or a signal.

There are various solutions here:

* Make sure SIGCHLD is caught, reducing the problem to waiting for
  signals only. This can then be done using sigsuspend() or sigwait().

  Most ash variants that have closed this race window have chosen this
  option.

  The SIGCHLD handler could be installed globally or only for the
  duration of the wait builtin.

* Call longjmp() from the signal handler. The blocking wait will have to
  be changed to waitid() with WNOWAIT so no exit statuses are lost when
  a signal comes in just after waitid() returns.

  Note that ash variants already call longjmp() from a SIGINT signal
  handler in certain situations in interactive mode, so it is not a
  really strange thing to do.

* Use musl's solution for [EINTR] in the context of pthread
  cancellation, checking the saved program counter when a signal
  arrives. Although theoretically portable, it requires writing
  architecture-specific code in practice.

* Use FreeBSD libthr's solution for [EINTR] in the context of pthread
  cancellation, asking the kernel to abort the next blocking system call
  with [EINTR] immediately from the signal handler. This is not portable
  to other kernels.

-- 
Jilles Tjoelker

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Bizarre interaction bug involving bash w/ lastpipe + Almquist 'wait'
  2020-02-08 18:39   ` Harald van Dijk
  2020-02-09 19:03     ` Jilles Tjoelker
@ 2020-02-18 16:46     ` Denys Vlasenko
  2020-02-18 21:59       ` Harald van Dijk
  2020-02-18 18:17     ` Robert Elz
  2 siblings, 1 reply; 12+ messages in thread
From: Denys Vlasenko @ 2020-02-18 16:46 UTC (permalink / raw)
  To: Harald van Dijk
  Cc: Robert Elz, Martijn Dekker, busybox, DASH shell mailing list,
	Jilles Tjoelker, Bug reports for the GNU Bourne Again SHell

On Sat, Feb 8, 2020 at 7:41 PM Harald van Dijk <harald@gigawatt.nl> wrote:
> I changed gwsh to call sigclearmask() on shell startup, but plan to
> check whether this loop is really necessary at some later time. It was
> added to dash to fix a race condition, where that race condition was
> apparently introduced by a fix for another race condition.

sigsuspend() is needed to make "wait" builtin interruptible by signals.

Attempts to use EINTR error return of waitpid() a-la:

               if (got_sigs) { handle signals }
               got_sigs = 0;
               pid = waitpid(...);  /* without WNOHANG */
               if (pid < 0 && errno == EINTR) { handle signals }

are racy, since signals can be delivered not only while waitpid() syscall
is in kernel, but also when we are only about to enter the kernel
- and in this case, the shell's sighandler will set the flag variable,
but then we enter the kernel and sleep.

Masking signals doesn't help, since you need to unmask them just before
waitpid() if you want to get EINTR on a signal, hence there is still
a window for the race.

> If NetBSD sh
> manages to avoid this pattern, and assuming NetBSD sh is not still
> susceptible to one of those race conditions

Please let us know what you discovered.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Bizarre interaction bug involving bash w/ lastpipe + Almquist 'wait'
  2020-02-08 18:39   ` Harald van Dijk
  2020-02-09 19:03     ` Jilles Tjoelker
  2020-02-18 16:46     ` Denys Vlasenko
@ 2020-02-18 18:17     ` Robert Elz
  2 siblings, 0 replies; 12+ messages in thread
From: Robert Elz @ 2020-02-18 18:17 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Jilles Tjoelker, busybox, DASH shell mailing list,
	Harald van Dijk, Bug reports for the GNU Bourne Again SHell,
	Martijn Dekker

    Date:        Tue, 18 Feb 2020 17:46:23 +0100
    From:        Denys Vlasenko <vda.linux@googlemail.com>
    Message-ID:  <CAK1hOcO_S_T=5SWJ0jpZWxDYwdUFqJisw_nC+JysnQvZ6XUuKw@mail.gmail.com>

  | > If NetBSD sh
  | > manages to avoid this pattern, and assuming NetBSD sh is not still
  | > susceptible to one of those race conditions
  |
  | Please let us know what you discovered.

It is very likley that it is racy as described, though no-one has ever
filed a bug report on it (ie: it hasn't happened to anyone in a way that
they'd complain about it).

I suspect it also isn't a conformance problem - POSIX says very
little about when traps are executed ... really only that they don't
interrupt waiting for a foreground command to complete, and that if a
trap occurs while waiting in the wait command, then that command ends
with an exit status indicating the signal.

What that means is that using traps for anything much more than cleanup
activities isn't really safe (or perhaps, s/safe/sane/) as there's no
guarantee when the trap will actually run.

Given that, losing the race in the situation cited (ie: getting the
signal just before running the waitpid() (or whichever) sys call when
implementing the wait command - and then going ahead and doing the
sys call, hanging until some process terminates (perhaps until a particular
process terminates) seems fully conformant to me (the signal doesn't
arrive while waiting, so no error return from wait is required).

It isn't nice, and ideally wouldn't happen (and in real life, seems
not to ... the window is quite small after all) but nothing should really
break badly because of it - or at least nothing portable should.

We do now unilaterally reset SIGCHLD to SIG_DFL/unblocked at startup
(SIGCHLD is the one signal we're not required to pass on to exec'd
processes in the same state we received it, so that's OK) so we could
adopt the block, catch SIGCHLD, and sigsuspend() approach if that ever
seemed like a necessary thing to do.

kre

ps: the observed core dump problem is also fixed, that was a related,
but quite different, issue - not connected to SIGCHLD in any way at all.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Bizarre interaction bug involving bash w/ lastpipe + Almquist 'wait'
  2020-02-18 16:46     ` Denys Vlasenko
@ 2020-02-18 21:59       ` Harald van Dijk
  0 siblings, 0 replies; 12+ messages in thread
From: Harald van Dijk @ 2020-02-18 21:59 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Robert Elz, Martijn Dekker, busybox, DASH shell mailing list,
	Jilles Tjoelker, Bug reports for the GNU Bourne Again SHell

On 18/02/2020 16:46, Denys Vlasenko wrote:
> On Sat, Feb 8, 2020 at 7:41 PM Harald van Dijk <harald@gigawatt.nl> wrote:
>> If NetBSD sh
>> manages to avoid this pattern, and assuming NetBSD sh is not still
>> susceptible to one of those race conditions
> 
> Please let us know what you discovered.

Okay, please take a look. I hope I managed to avoid race conditions in 
the test shell script.

test1.sh:

   i=1
   while test "$i" -lt 100000
   do
     printf "%d\r" "$i"
     "$@" test2.sh
     i=$((i + 1))
   done

test2.sh

   trap 'kill $!; exit 0' TERM
   { kill $$; exec sleep 1000; } &
   wait $!

To run:

   sh test1.sh $shell

For instance:

   sh test1.sh busybox ash

test1.sh will repeatedly run test2.sh and increment and print a counter 
variable to display progress.

test2.sh will immediately exit, in a complicated way, if all goes well. 
It may sleep for 1000s or fail to clean up its background process if 
something goes wrong.

On my system, I see:

   bash 5.0.11        - sleeps after a while
   bosh 2019-11-11    - sleeps after a while
   busybox 1.31.1 ash - ok
   dash 0.5.10.2      - ok
   dash (current)     - sleeps immediately
   fbsh 12.1          - ok *
   gwsh (current)     - leaves subprocesses
   ksh 93v            - sleeps after a while
   ksh 2020.0.0       - sleeps after a while
   mksh 57            - sleeps after a while
   nbsh (current)     - sleeps after a while *
   pdksh 5.2.14       - leaves subprocesses + sleeps after a while
   posh 0.13.1        - ok
   yash               - ok
   zsh                - sleeps after a while

* Because of the way I was running FreeBSD sh and NetBSD sh on qemu, I 
could not easily check what happens to the subprocesses.

I think that confirms that NetBSD sh does have a problem with a race 
condition, but that many shells have that same problem. It also tells me 
that there is another different problem in my shell that I should look at.

Cheers,
Harald van Dijk

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-02-18 22:01 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-06 16:12 Bizarre interaction bug involving bash w/ lastpipe + Almquist 'wait' Martijn Dekker
2020-02-06 19:29 ` Harald van Dijk
2020-02-07 11:19   ` AW: " Walter Harms
2020-02-07 14:33     ` Martijn Dekker
2020-02-07 16:16   ` Chet Ramey
2020-02-06 20:43 ` Robert Elz
2020-02-07  2:41 ` Robert Elz
2020-02-08 18:39   ` Harald van Dijk
2020-02-09 19:03     ` Jilles Tjoelker
2020-02-18 16:46     ` Denys Vlasenko
2020-02-18 21:59       ` Harald van Dijk
2020-02-18 18:17     ` Robert Elz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).