This is probably the strangest bug (or maybe pair of bugs) I've run into in nearly five years of breaking shells by developing modernish. I've traced it to an interaction between bash >= 4.2 (i.e.: bash with shopt -s lastpipe) and variants of the Almquist shell, at least: dash, gwsh, Busybox ash, FreeBSD sh, and NetBSD 9.0rc2 sh. Symptom: if 'return' is invoked on bash in the last element of a pipe executed in the main shell environment, then if you subsequently 'exec' an Almquist shell variant so that it has the same PID, its 'wait' builtin breaks. I can consistently reproduce this on Linux, macOS, FreeBSD, NetBSD 9.0rc2, OpenBSD, and Solaris. To reproduce this, you need bash >= 4.2, some Almquist shell variant, and these two test scripts: ---begin test.bash--- fn() { : | return } shopt -s lastpipe || exit fn exec "${1:-dash}" test.ash ---end test.bash--- ---begin test.ash--- echo '*ash-begin' : & echo '*ash-middle' wait "$!" echo '*ash-end' ---end test.ash--- When executing test.bash with dash, gwsh, Busybox ash, or FreeBSD sh, then test.ash simply waits forever on executing 'wait "$!"'. $ bash test.bash <some-almquist-shell> *ash-begin *ash-middle (nothing until ^C) NetBSD sh behaves differently. NetBSD 8.1 sh (as installed on sdf.org and sdf-eu.org) seem to act completely normally, but NetBSD 9.0rc2 sh (on my VirtualBox test VM) segfaults. Output on NetBSD 9.0rc2: $ bash test.bash /bin/sh *ash-begin *ash-middle [1] Segmentation fault bash test.bash sh I don't know if the different NetBSD sh behaviour is because the older NetBSD sh doesn't have the bug, or because some factor on the sdf*.org systems causes it to not be triggered. To me, this smells like the use of some uninitialised value on various Almquist shells. Tracing that is beyond my expertise though. Whether this also represents a bug in bash or not, I can't say. But no other shells trigger this that I've found, not even ksh93 and zsh which also execute the last element of a pipe in the main shell environment. - Martijn -- modernish -- harness the shell https://github.com/modernish/modernish
On 06/02/2020 16:12, Martijn Dekker wrote:
> This is probably the strangest bug (or maybe pair of bugs) I've run into
> in nearly five years of breaking shells by developing modernish.
>
> I've traced it to an interaction between bash >= 4.2 (i.e.: bash with
> shopt -s lastpipe) and variants of the Almquist shell, at least: dash,
> gwsh, Busybox ash, FreeBSD sh, and NetBSD 9.0rc2 sh.
>
> Symptom: if 'return' is invoked on bash in the last element of a pipe
> executed in the main shell environment, then if you subsequently 'exec'
> an Almquist shell variant so that it has the same PID, its 'wait'
> builtin breaks.
>
> I can consistently reproduce this on Linux, macOS, FreeBSD, NetBSD
> 9.0rc2, OpenBSD, and Solaris.
>
> To reproduce this, you need bash >= 4.2, some Almquist shell variant,
> and these two test scripts:
>
> ---begin test.bash---
> fn() {
> : | return
> }
> shopt -s lastpipe || exit
> fn
> exec "${1:-dash}" test.ash
> ---end test.bash---
>
> ---begin test.ash---
> echo '*ash-begin'
> : &
> echo '*ash-middle'
> wait "$!"
> echo '*ash-end'
> ---end test.ash---
>
> When executing test.bash with dash, gwsh, Busybox ash, or FreeBSD sh,
> then test.ash simply waits forever on executing 'wait "$!"'.
Nice test. bash leaves the process in a state where SIGCHLD is blocked,
and the various ash-based shells do not unblock it. Because of that,
they do not pick up on the fact that the child process has terminated. I
would consider this a bug both in bash and in the ash-based shells.
Cheers,
Harald van Dijk
Date: Thu, 6 Feb 2020 19:29:41 +0000 From: Harald van Dijk <harald@gigawatt.nl> Message-ID: <f8b210f5-dd59-2c7f-05d4-be0a89316d3d@gigawatt.nl> | Nice test. Yes! | and the various ash-based shells do not unblock it. We do now, the fix for that will be in 9.0 when it is released. ("now" as in as of the past half hour...) | Because of that, | they do not pick up on the fact that the child process has terminated. It was actually a race condition, for me it 'worked' about half the time (seems to depend whether the wait happens in the parent before or after the sub-process exits). kre ps: that core dump was an "impossible to happen" condition that this actually made happen, that will be fixed as well, both by actually now making it impossible like it was supposed to be (by not blocking or ignoring SIGCHLD, ever) and by testing for it happening anyway... The secondary fix for that one is still to be committed after I investigate some more - I know what happened, just need to make sure what will happen now if this situation which should never occur ever does happen again. That the 8.1 NetBSD sh seems to work is more just an artifact of how it runs the race I believe (or guess) - the wait & process invocation code has changed a lot in 9 (well, 9.0RC2 for now) which seems to have made the race a close call, instead of one sided. But that was not an artifact of the environment for the test, it happens for me on a real -8(ish) type system as well. kre
Date: Thu, 6 Feb 2020 16:12:06 +0000 From: Martijn Dekker <martijn@inlv.org> Message-ID: <10e3756b-5e8f-ba00-df0d-b36c93fa2281@inlv.org> | NetBSD sh behaves differently. NetBSD 8.1 sh (as installed on sdf.org | and sdf-eu.org) seem to act completely normally, but NetBSD 9.0rc2 sh | (on my VirtualBox test VM) segfaults. Output on NetBSD 9.0rc2: I have updated my opinion on that, I think it is "don't have the bug", though it is possible a blocked SIGCHLD acts differently on NetBSD than on other systems. On NetBSD it seems to affect nothing (the shell does not rely upon receiving SIGCHLD so not getting it is irrelevant) and the wait code when given an arg (as your script did) would always wait until that process exited, and return as soon as it did. None of that is changed in -9 ... but the wait command now has -n, which also works with a list of pids, and while waiting for any process in its list to exit, gets told each time a process is reaped (from lower level code) which job that process was from (new code of mine) so it can see if the process that completed finished one of the jobs for which it is waiting.\ I wasn't expecting to see exiting children that are not the shell's children, which is what happens here - the : | return creates a child (of bash) to run the ':' command, then the function returns without waiting for that one. You then exec the NetBSD shell, which inherits that child (a child of the same process) but is unaware of it. If that one happens to exit while the ash script running on the NetBSD sh is doing the wait command, core would dump. (Fix for that is now in the tree). If the bash invoked ':' command exited some other time and was noticed (eg: between commands) as having finished, it would simply have been ignored. I saw both happen. kre
IMHO is the bug on bash side. ash can assume to get an "healthy" environment from the caller. You can simply not fix everything that can possible go wrong.
Obviously it should not segfault but so far i understand it is bsd as that does, not busybox ash.
re,
wh
________________________________________
Von: busybox <busybox-bounces@busybox.net> im Auftrag von Harald van Dijk <harald@gigawatt.nl>
Gesendet: Donnerstag, 6. Februar 2020 20:29
An: Martijn Dekker; DASH shell mailing list; busybox; Bug reports for the GNU Bourne Again SHell; Robert Elz; Jilles Tjoelker
Betreff: Re: Bizarre interaction bug involving bash w/ lastpipe + Almquist 'wait'
On 06/02/2020 16:12, Martijn Dekker wrote:
> This is probably the strangest bug (or maybe pair of bugs) I've run into
> in nearly five years of breaking shells by developing modernish.
>
> I've traced it to an interaction between bash >= 4.2 (i.e.: bash with
> shopt -s lastpipe) and variants of the Almquist shell, at least: dash,
> gwsh, Busybox ash, FreeBSD sh, and NetBSD 9.0rc2 sh.
>
> Symptom: if 'return' is invoked on bash in the last element of a pipe
> executed in the main shell environment, then if you subsequently 'exec'
> an Almquist shell variant so that it has the same PID, its 'wait'
> builtin breaks.
>
> I can consistently reproduce this on Linux, macOS, FreeBSD, NetBSD
> 9.0rc2, OpenBSD, and Solaris.
>
> To reproduce this, you need bash >= 4.2, some Almquist shell variant,
> and these two test scripts:
>
> ---begin test.bash---
> fn() {
> : | return
> }
> shopt -s lastpipe || exit
> fn
> exec "${1:-dash}" test.ash
> ---end test.bash---
>
> ---begin test.ash---
> echo '*ash-begin'
> : &
> echo '*ash-middle'
> wait "$!"
> echo '*ash-end'
> ---end test.ash---
>
> When executing test.bash with dash, gwsh, Busybox ash, or FreeBSD sh,
> then test.ash simply waits forever on executing 'wait "$!"'.
Nice test. bash leaves the process in a state where SIGCHLD is blocked,
and the various ash-based shells do not unblock it. Because of that,
they do not pick up on the fact that the child process has terminated. I
would consider this a bug both in bash and in the ash-based shells.
Cheers,
Harald van Dijk
Op 07-02-20 om 12:19 schreef Walter Harms: > IMHO is the bug on bash side. ash can assume to get an "healthy" > environment from the caller. You can simply not fix everything that > can possible go wrong. That is a rather fallacious argument. Of course you cannot fix *everything* that could possibly go wrong. You can certainly fix *this* thing, though. I know, because every non-Almquist shell does it. These days, no program can realistically assume a "healthy" environment. Computers have become unimaginably complex machines, built on thousands of interdependent abstraction layers, each as fallible as the humans that designed and implemented them. So "unhealthy" environments happen all the time, due to all sorts of unforeseen causes. It's well past time to accept that the 1980s are behind us. In 2020, systems have to be programmed robustly and defensively. > Obviously it should not segfault but so far i understand it is bsd as > that does, not busybox ash. True. But instead, it simply gets stuck forever, with no message or other indicator of what went wrong. How is that better? (Going slightly off-topic below...) Segfaulting is actually a good thing: it's one form of failing reliably. And failing reliably is vastly better than what often happens instead, especially in shell scripts: subtle breakage, which can take a lot of detective work to trace, and in some cases can cause serious damage due to the program functioning inconsistently and incorrectly (instead of not at all). Failing reliably is something the shell is ATROCIOUSLY bad at, and it's one of the first things modernish aims to fix. - Martijn -- modernish -- harness the shell https://github.com/modernish/modernish
On 2/6/20 2:29 PM, Harald van Dijk wrote: > On 06/02/2020 16:12, Martijn Dekker wrote: >> When executing test.bash with dash, gwsh, Busybox ash, or FreeBSD sh, >> then test.ash simply waits forever on executing 'wait "$!"'. > > Nice test. bash leaves the process in a state where SIGCHLD is blocked, and > the various ash-based shells do not unblock it. Thanks for the investigation. Bash does leave SIGCHLD blocked in this exact set of circumstances (lastpipe+function+return at end of pipeline+exec). -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRU chet@case.edu http://tiswww.cwru.edu/~chet/
On 07/02/2020 02:41, Robert Elz wrote: > Date: Thu, 6 Feb 2020 16:12:06 +0000 > From: Martijn Dekker <martijn@inlv.org> > Message-ID: <10e3756b-5e8f-ba00-df0d-b36c93fa2281@inlv.org> > > | NetBSD sh behaves differently. NetBSD 8.1 sh (as installed on sdf.org > | and sdf-eu.org) seem to act completely normally, but NetBSD 9.0rc2 sh > | (on my VirtualBox test VM) segfaults. Output on NetBSD 9.0rc2: > > I have updated my opinion on that, I think it is "don't have the bug", > though it is possible a blocked SIGCHLD acts differently on NetBSD than > on other systems. On NetBSD it seems to affect nothing (the shell does > not rely upon receiving SIGCHLD so not getting it is irrelevant) and > the wait code when given an arg (as your script did) would always wait > until that process exited, and return as soon as it did. I think you're right that this isn't SIGCHLD behaving differently on NetBSD, it's that NetBSD sh does not have the same problem the other ash-based shells do. The problem is with sigsuspend, which in dash looks like: > sigblockall(&oldmask); > > while (!gotsigchld && !pending_sig) > sigsuspend(&oldmask); > > sigclearmask(); <https://git.kernel.org/pub/scm/utils/dash/dash.git/tree/src/jobs.c?id=f30bd155ccbc3f084bbf03d56f9cc43f4b02af2a#n1170> This clearly cannot work when oldmask blocks SIGCHLD. NetBSD sh does not use sigsuspend here, so avoids that problem. I changed gwsh to call sigclearmask() on shell startup, but plan to check whether this loop is really necessary at some later time. It was added to dash to fix a race condition, where that race condition was apparently introduced by a fix for another race condition. If NetBSD sh manages to avoid this pattern, and assuming NetBSD sh is not still susceptible to one of those race conditions, the fix for it in the other shells would seem to be more complicated than necessary, and simplifying things would be good. Cheers, Harald van Dijk
On Sat, Feb 08, 2020 at 06:39:38PM +0000, Harald van Dijk wrote: > On 07/02/2020 02:41, Robert Elz wrote: > > Date: Thu, 6 Feb 2020 16:12:06 +0000 > > From: Martijn Dekker <martijn@inlv.org> > > Message-ID: <10e3756b-5e8f-ba00-df0d-b36c93fa2281@inlv.org> > > | NetBSD sh behaves differently. NetBSD 8.1 sh (as installed on sdf.org > > | and sdf-eu.org) seem to act completely normally, but NetBSD 9.0rc2 sh > > | (on my VirtualBox test VM) segfaults. Output on NetBSD 9.0rc2: > > I have updated my opinion on that, I think it is "don't have the bug", > > though it is possible a blocked SIGCHLD acts differently on NetBSD than > > on other systems. On NetBSD it seems to affect nothing (the shell does > > not rely upon receiving SIGCHLD so not getting it is irrelevant) and > > the wait code when given an arg (as your script did) would always wait > > until that process exited, and return as soon as it did. > I think you're right that this isn't SIGCHLD behaving differently on NetBSD, > it's that NetBSD sh does not have the same problem the other ash-based > shells do. The problem is with sigsuspend, which in dash looks like: > > sigblockall(&oldmask); > > > > while (!gotsigchld && !pending_sig) > > sigsuspend(&oldmask); > > > > sigclearmask(); > <https://git.kernel.org/pub/scm/utils/dash/dash.git/tree/src/jobs.c?id=f30bd155ccbc3f084bbf03d56f9cc43f4b02af2a#n1170> > This clearly cannot work when oldmask blocks SIGCHLD. > NetBSD sh does not use sigsuspend here, so avoids that problem. > I changed gwsh to call sigclearmask() on shell startup, but plan to check > whether this loop is really necessary at some later time. It was added to > dash to fix a race condition, where that race condition was apparently > introduced by a fix for another race condition. If NetBSD sh manages to > avoid this pattern, and assuming NetBSD sh is not still susceptible to one > of those race conditions, the fix for it in the other shells would seem to > be more complicated than necessary, and simplifying things would be good. I have not tested whether the bug actually happens in NetBSD sh but I think the complexity is necessary. The problem is that the wait builtin must wait for either process termination or a signal, and relying on an [EINTR] error return to abort a blocking waitpid() or similar leaves a window where a signal could come in after which the program goes asleep. In a script this could look like trap 'echo cleaning up; exit' TERM slow_process_1 & slow_process_2 & wait and if a TERM signal comes in just before the wait system call is invoked, the signal handler sets a flag but the trap is not taken until a process terminates or another signal comes in. FreeBSD sh also has a -T flag that causes traps to be taken immediately while waiting for a process to terminate. This has the same issue with waiting for process termination or a signal. There are various solutions here: * Make sure SIGCHLD is caught, reducing the problem to waiting for signals only. This can then be done using sigsuspend() or sigwait(). Most ash variants that have closed this race window have chosen this option. The SIGCHLD handler could be installed globally or only for the duration of the wait builtin. * Call longjmp() from the signal handler. The blocking wait will have to be changed to waitid() with WNOWAIT so no exit statuses are lost when a signal comes in just after waitid() returns. Note that ash variants already call longjmp() from a SIGINT signal handler in certain situations in interactive mode, so it is not a really strange thing to do. * Use musl's solution for [EINTR] in the context of pthread cancellation, checking the saved program counter when a signal arrives. Although theoretically portable, it requires writing architecture-specific code in practice. * Use FreeBSD libthr's solution for [EINTR] in the context of pthread cancellation, asking the kernel to abort the next blocking system call with [EINTR] immediately from the signal handler. This is not portable to other kernels. -- Jilles Tjoelker
On Sat, Feb 8, 2020 at 7:41 PM Harald van Dijk <harald@gigawatt.nl> wrote: > I changed gwsh to call sigclearmask() on shell startup, but plan to > check whether this loop is really necessary at some later time. It was > added to dash to fix a race condition, where that race condition was > apparently introduced by a fix for another race condition. sigsuspend() is needed to make "wait" builtin interruptible by signals. Attempts to use EINTR error return of waitpid() a-la: if (got_sigs) { handle signals } got_sigs = 0; pid = waitpid(...); /* without WNOHANG */ if (pid < 0 && errno == EINTR) { handle signals } are racy, since signals can be delivered not only while waitpid() syscall is in kernel, but also when we are only about to enter the kernel - and in this case, the shell's sighandler will set the flag variable, but then we enter the kernel and sleep. Masking signals doesn't help, since you need to unmask them just before waitpid() if you want to get EINTR on a signal, hence there is still a window for the race. > If NetBSD sh > manages to avoid this pattern, and assuming NetBSD sh is not still > susceptible to one of those race conditions Please let us know what you discovered.
Date: Tue, 18 Feb 2020 17:46:23 +0100 From: Denys Vlasenko <vda.linux@googlemail.com> Message-ID: <CAK1hOcO_S_T=5SWJ0jpZWxDYwdUFqJisw_nC+JysnQvZ6XUuKw@mail.gmail.com> | > If NetBSD sh | > manages to avoid this pattern, and assuming NetBSD sh is not still | > susceptible to one of those race conditions | | Please let us know what you discovered. It is very likley that it is racy as described, though no-one has ever filed a bug report on it (ie: it hasn't happened to anyone in a way that they'd complain about it). I suspect it also isn't a conformance problem - POSIX says very little about when traps are executed ... really only that they don't interrupt waiting for a foreground command to complete, and that if a trap occurs while waiting in the wait command, then that command ends with an exit status indicating the signal. What that means is that using traps for anything much more than cleanup activities isn't really safe (or perhaps, s/safe/sane/) as there's no guarantee when the trap will actually run. Given that, losing the race in the situation cited (ie: getting the signal just before running the waitpid() (or whichever) sys call when implementing the wait command - and then going ahead and doing the sys call, hanging until some process terminates (perhaps until a particular process terminates) seems fully conformant to me (the signal doesn't arrive while waiting, so no error return from wait is required). It isn't nice, and ideally wouldn't happen (and in real life, seems not to ... the window is quite small after all) but nothing should really break badly because of it - or at least nothing portable should. We do now unilaterally reset SIGCHLD to SIG_DFL/unblocked at startup (SIGCHLD is the one signal we're not required to pass on to exec'd processes in the same state we received it, so that's OK) so we could adopt the block, catch SIGCHLD, and sigsuspend() approach if that ever seemed like a necessary thing to do. kre ps: the observed core dump problem is also fixed, that was a related, but quite different, issue - not connected to SIGCHLD in any way at all.
On 18/02/2020 16:46, Denys Vlasenko wrote:
> On Sat, Feb 8, 2020 at 7:41 PM Harald van Dijk <harald@gigawatt.nl> wrote:
>> If NetBSD sh
>> manages to avoid this pattern, and assuming NetBSD sh is not still
>> susceptible to one of those race conditions
>
> Please let us know what you discovered.
Okay, please take a look. I hope I managed to avoid race conditions in
the test shell script.
test1.sh:
i=1
while test "$i" -lt 100000
do
printf "%d\r" "$i"
"$@" test2.sh
i=$((i + 1))
done
test2.sh
trap 'kill $!; exit 0' TERM
{ kill $$; exec sleep 1000; } &
wait $!
To run:
sh test1.sh $shell
For instance:
sh test1.sh busybox ash
test1.sh will repeatedly run test2.sh and increment and print a counter
variable to display progress.
test2.sh will immediately exit, in a complicated way, if all goes well.
It may sleep for 1000s or fail to clean up its background process if
something goes wrong.
On my system, I see:
bash 5.0.11 - sleeps after a while
bosh 2019-11-11 - sleeps after a while
busybox 1.31.1 ash - ok
dash 0.5.10.2 - ok
dash (current) - sleeps immediately
fbsh 12.1 - ok *
gwsh (current) - leaves subprocesses
ksh 93v - sleeps after a while
ksh 2020.0.0 - sleeps after a while
mksh 57 - sleeps after a while
nbsh (current) - sleeps after a while *
pdksh 5.2.14 - leaves subprocesses + sleeps after a while
posh 0.13.1 - ok
yash - ok
zsh - sleeps after a while
* Because of the way I was running FreeBSD sh and NetBSD sh on qemu, I
could not easily check what happens to the subprocesses.
I think that confirms that NetBSD sh does have a problem with a race
condition, but that many shells have that same problem. It also tells me
that there is another different problem in my shell that I should look at.
Cheers,
Harald van Dijk