* [PATCH] run-command: don't spam trace2_child_exit() @ 2022-04-28 20:58 Josh Steadmon 2022-04-28 21:46 ` Junio C Hamano 2022-06-07 18:21 ` [PATCH v2] " Josh Steadmon 0 siblings, 2 replies; 10+ messages in thread From: Josh Steadmon @ 2022-04-28 20:58 UTC (permalink / raw) To: git In rare cases, wait_or_whine() cannot determine a child process's exit status (and will return -1 in this case). This can cause Git to issue trace2 child_exit events despite the fact that the child is still running. In pathological cases, we've seen > 80 million exit events in our trace logs for a single child process. Fix this by only issuing trace2 events in finish_command_in_signal() if we get a value other than -1 from wait_or_whine(). Signed-off-by: Josh Steadmon <steadmon@google.com> --- run-command.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/run-command.c b/run-command.c index a8501e38ce..e0fe2418a2 100644 --- a/run-command.c +++ b/run-command.c @@ -983,7 +983,8 @@ int finish_command(struct child_process *cmd) int finish_command_in_signal(struct child_process *cmd) { int ret = wait_or_whine(cmd->pid, cmd->args.v[0], 1); - trace2_child_exit(cmd, ret); + if (ret != -1) + trace2_child_exit(cmd, ret); return ret; } base-commit: faa21c10d44184f616d391c158dcbb13b9c72ef3 -- 2.36.0.464.gb9c8b46e94-goog ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] run-command: don't spam trace2_child_exit() 2022-04-28 20:58 [PATCH] run-command: don't spam trace2_child_exit() Josh Steadmon @ 2022-04-28 21:46 ` Junio C Hamano 2022-05-03 14:59 ` Jeff Hostetler 2022-05-05 19:44 ` Josh Steadmon 2022-06-07 18:21 ` [PATCH v2] " Josh Steadmon 1 sibling, 2 replies; 10+ messages in thread From: Junio C Hamano @ 2022-04-28 21:46 UTC (permalink / raw) To: Josh Steadmon; +Cc: git Josh Steadmon <steadmon@google.com> writes: > In rare cases, wait_or_whine() cannot determine a child process's exit > status (and will return -1 in this case). This can cause Git to issue > trace2 child_exit events despite the fact that the child is still > running. Rather, we do not even know if the child is still running when it happens, right? It is curious what "rare cases" makes the symptom appear. Do we know? The patch looks OK from the "we do not know the child exited in this case, so we shouldn't be reporting the child exit" point of view, of course. Having one event that started a child in the log and then having millions of events that reports the exit of the (same) child is way too broken. With this change, we remove these phoney exit events from the log. Do we know, for such a child process that caused these millions phoney exit events, we got a real exit event at the end? Otherwise, we'd still have a similar problem in the opposite direction, i.e. a child has a start event recorded, many exit event discarded but the log lacks the true exit event for the child, implying that the child is still running because we failed to log its exit? > int finish_command_in_signal(struct child_process *cmd) > { > int ret = wait_or_whine(cmd->pid, cmd->args.v[0], 1); > - trace2_child_exit(cmd, ret); > + if (ret != -1) > + trace2_child_exit(cmd, ret); > return ret; > } Will queue; thanks. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] run-command: don't spam trace2_child_exit() 2022-04-28 21:46 ` Junio C Hamano @ 2022-05-03 14:59 ` Jeff Hostetler 2022-05-05 19:58 ` Josh Steadmon 2022-05-05 19:44 ` Josh Steadmon 1 sibling, 1 reply; 10+ messages in thread From: Jeff Hostetler @ 2022-05-03 14:59 UTC (permalink / raw) To: Junio C Hamano, Josh Steadmon; +Cc: git On 4/28/22 5:46 PM, Junio C Hamano wrote: > Josh Steadmon <steadmon@google.com> writes: > >> In rare cases, wait_or_whine() cannot determine a child process's exit >> status (and will return -1 in this case). This can cause Git to issue >> trace2 child_exit events despite the fact that the child is still >> running. I'm curious what is causing the spurious return values. Could you instrument wait_or_whine() and see which of the if/else arms are causing the -1 to be returned? That routine is rather complicated and looks like it has 3 different ways that a -1 could be returned. > > Rather, we do not even know if the child is still running when it > happens, right? It is curious what "rare cases" makes the symptom > appear. Do we know? > > The patch looks OK from the "we do not know the child exited in this > case, so we shouldn't be reporting the child exit" point of view, of > course. Having one event that started a child in the log and then > having millions of events that reports the exit of the (same) child > is way too broken. With this change, we remove these phoney exit > events from the log. > > Do we know, for such a child process that caused these millions > phoney exit events, we got a real exit event at the end? Otherwise, > we'd still have a similar problem in the opposite direction, i.e. a > child has a start event recorded, many exit event discarded but the > log lacks the true exit event for the child, implying that the child > is still running because we failed to log its exit? > >> int finish_command_in_signal(struct child_process *cmd) >> { >> int ret = wait_or_whine(cmd->pid, cmd->args.v[0], 1); >> - trace2_child_exit(cmd, ret); >> + if (ret != -1) >> + trace2_child_exit(cmd, ret); >> return ret; >> } Since this is only called from pager.c and is used to setup the pager, I have to wonder if you're only getting these spurious events for the pager process or for any of the other random child processes. And whether they are received while the pager is alive and working properly, or when you're trying to quit the pager or when the pager is trying to signal eof. > > Will queue; thanks. > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] run-command: don't spam trace2_child_exit() 2022-05-03 14:59 ` Jeff Hostetler @ 2022-05-05 19:58 ` Josh Steadmon 2022-05-10 20:37 ` Jeff Hostetler 0 siblings, 1 reply; 10+ messages in thread From: Josh Steadmon @ 2022-05-05 19:58 UTC (permalink / raw) To: Jeff Hostetler; +Cc: Junio C Hamano, git On 2022.05.03 10:59, Jeff Hostetler wrote: > > > On 4/28/22 5:46 PM, Junio C Hamano wrote: > > Josh Steadmon <steadmon@google.com> writes: > > > > > In rare cases, wait_or_whine() cannot determine a child process's exit > > > status (and will return -1 in this case). This can cause Git to issue > > > trace2 child_exit events despite the fact that the child is still > > > running. > > I'm curious what is causing the spurious return values. > Could you instrument wait_or_whine() and see which of the > if/else arms are causing the -1 to be returned? > > That routine is rather complicated and looks like it has 3 > different ways that a -1 could be returned. Unfortunately we haven't been able to reproduce the issue, so instrumentation won't clarify the cause unless we randomly hit another instance. > > > > Rather, we do not even know if the child is still running when it > > happens, right? It is curious what "rare cases" makes the symptom > > appear. Do we know? > > > > The patch looks OK from the "we do not know the child exited in this > > case, so we shouldn't be reporting the child exit" point of view, of > > course. Having one event that started a child in the log and then > > having millions of events that reports the exit of the (same) child > > is way too broken. With this change, we remove these phoney exit > > events from the log. > > > > Do we know, for such a child process that caused these millions > > phoney exit events, we got a real exit event at the end? Otherwise, > > we'd still have a similar problem in the opposite direction, i.e. a > > child has a start event recorded, many exit event discarded but the > > log lacks the true exit event for the child, implying that the child > > is still running because we failed to log its exit? > > > > > int finish_command_in_signal(struct child_process *cmd) > > > { > > > int ret = wait_or_whine(cmd->pid, cmd->args.v[0], 1); > > > - trace2_child_exit(cmd, ret); > > > + if (ret != -1) > > > + trace2_child_exit(cmd, ret); > > > return ret; > > > } > > Since this is only called from pager.c and is used to setup the > pager, I have to wonder if you're only getting these spurious events > for the pager process or for any of the other random child processes. > > And whether they are received while the pager is alive and working > properly, or when you're trying to quit the pager or when the pager > is trying to signal eof. It was for a pager run by `git diff`. The only other child process in this instance was `git status`. I believe the events were generated while the user was not interacting with the process; IIRC the pager was left running in a terminal overnight and they discovered the next morning that their disk was filled up by trace logs. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] run-command: don't spam trace2_child_exit() 2022-05-05 19:58 ` Josh Steadmon @ 2022-05-10 20:37 ` Jeff Hostetler 2022-06-07 18:45 ` Josh Steadmon 0 siblings, 1 reply; 10+ messages in thread From: Jeff Hostetler @ 2022-05-10 20:37 UTC (permalink / raw) To: Josh Steadmon, Junio C Hamano, git On 5/5/22 3:58 PM, Josh Steadmon wrote: > On 2022.05.03 10:59, Jeff Hostetler wrote: >> >> >> On 4/28/22 5:46 PM, Junio C Hamano wrote: >>> Josh Steadmon <steadmon@google.com> writes: >>> >>>> In rare cases, wait_or_whine() cannot determine a child process's exit >>>> status (and will return -1 in this case). This can cause Git to issue >>>> trace2 child_exit events despite the fact that the child is still >>>> running. >> >> I'm curious what is causing the spurious return values. >> Could you instrument wait_or_whine() and see which of the >> if/else arms are causing the -1 to be returned? >> >> That routine is rather complicated and looks like it has 3 >> different ways that a -1 could be returned. > > Unfortunately we haven't been able to reproduce the issue, so > instrumentation won't clarify the cause unless we randomly hit another > instance. > > >>> >>> Rather, we do not even know if the child is still running when it >>> happens, right? It is curious what "rare cases" makes the symptom >>> appear. Do we know? >>> >>> The patch looks OK from the "we do not know the child exited in this >>> case, so we shouldn't be reporting the child exit" point of view, of >>> course. Having one event that started a child in the log and then >>> having millions of events that reports the exit of the (same) child >>> is way too broken. With this change, we remove these phoney exit >>> events from the log. >>> >>> Do we know, for such a child process that caused these millions >>> phoney exit events, we got a real exit event at the end? Otherwise, >>> we'd still have a similar problem in the opposite direction, i.e. a >>> child has a start event recorded, many exit event discarded but the >>> log lacks the true exit event for the child, implying that the child >>> is still running because we failed to log its exit? >>> >>>> int finish_command_in_signal(struct child_process *cmd) >>>> { >>>> int ret = wait_or_whine(cmd->pid, cmd->args.v[0], 1); >>>> - trace2_child_exit(cmd, ret); >>>> + if (ret != -1) >>>> + trace2_child_exit(cmd, ret); >>>> return ret; >>>> } >> >> Since this is only called from pager.c and is used to setup the >> pager, I have to wonder if you're only getting these spurious events >> for the pager process or for any of the other random child processes. >> >> And whether they are received while the pager is alive and working >> properly, or when you're trying to quit the pager or when the pager >> is trying to signal eof. > > It was for a pager run by `git diff`. The only other child process in > this instance was `git status`. I believe the events were generated > while the user was not interacting with the process; IIRC the pager was > left running in a terminal overnight and they discovered the next > morning that their disk was filled up by trace logs. > This worries me that you've found a race or infinite loop here and the only symptom is the child exit messages. Suppressing them solves the disk-fill problem, but might mean that there's a lurking infinite loop or something. Can you (privately) send me the part of the log before the infinite loop? And maybe the parent and child processes too? I'd kinda like to understand what's happening here. Jeff ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] run-command: don't spam trace2_child_exit() 2022-05-10 20:37 ` Jeff Hostetler @ 2022-06-07 18:45 ` Josh Steadmon 0 siblings, 0 replies; 10+ messages in thread From: Josh Steadmon @ 2022-06-07 18:45 UTC (permalink / raw) To: Jeff Hostetler; +Cc: Junio C Hamano, git On 2022.05.10 16:37, Jeff Hostetler wrote: > > > On 5/5/22 3:58 PM, Josh Steadmon wrote: > > On 2022.05.03 10:59, Jeff Hostetler wrote: > > > > > > > > > On 4/28/22 5:46 PM, Junio C Hamano wrote: > > > > Josh Steadmon <steadmon@google.com> writes: > > > > > > > > > In rare cases, wait_or_whine() cannot determine a child process's exit > > > > > status (and will return -1 in this case). This can cause Git to issue > > > > > trace2 child_exit events despite the fact that the child is still > > > > > running. > > > > > > I'm curious what is causing the spurious return values. > > > Could you instrument wait_or_whine() and see which of the > > > if/else arms are causing the -1 to be returned? > > > > > > That routine is rather complicated and looks like it has 3 > > > different ways that a -1 could be returned. > > > > Unfortunately we haven't been able to reproduce the issue, so > > instrumentation won't clarify the cause unless we randomly hit another > > instance. > > > > > > > > > > > > Rather, we do not even know if the child is still running when it > > > > happens, right? It is curious what "rare cases" makes the symptom > > > > appear. Do we know? > > > > > > > > The patch looks OK from the "we do not know the child exited in this > > > > case, so we shouldn't be reporting the child exit" point of view, of > > > > course. Having one event that started a child in the log and then > > > > having millions of events that reports the exit of the (same) child > > > > is way too broken. With this change, we remove these phoney exit > > > > events from the log. > > > > > > > > Do we know, for such a child process that caused these millions > > > > phoney exit events, we got a real exit event at the end? Otherwise, > > > > we'd still have a similar problem in the opposite direction, i.e. a > > > > child has a start event recorded, many exit event discarded but the > > > > log lacks the true exit event for the child, implying that the child > > > > is still running because we failed to log its exit? > > > > > > > > > int finish_command_in_signal(struct child_process *cmd) > > > > > { > > > > > int ret = wait_or_whine(cmd->pid, cmd->args.v[0], 1); > > > > > - trace2_child_exit(cmd, ret); > > > > > + if (ret != -1) > > > > > + trace2_child_exit(cmd, ret); > > > > > return ret; > > > > > } > > > > > > Since this is only called from pager.c and is used to setup the > > > pager, I have to wonder if you're only getting these spurious events > > > for the pager process or for any of the other random child processes. > > > > > > And whether they are received while the pager is alive and working > > > properly, or when you're trying to quit the pager or when the pager > > > is trying to signal eof. > > > > It was for a pager run by `git diff`. The only other child process in > > this instance was `git status`. I believe the events were generated > > while the user was not interacting with the process; IIRC the pager was > > left running in a terminal overnight and they discovered the next > > morning that their disk was filled up by trace logs. > > > > This worries me that you've found a race or infinite loop here > and the only symptom is the child exit messages. Suppressing > them solves the disk-fill problem, but might mean that there's > a lurking infinite loop or something. > > Can you (privately) send me the part of the log before the > infinite loop? And maybe the parent and child processes too? > I'd kinda like to understand what's happening here. > > Jeff Unfortunately we no longer have the relevant logs. This incident happened in 2020 and the logs have expired. I can share some general details though: The session started on a Friday when the user ran `git diff`. Presumably the user left the pager open after they left for the weekend. On the following Saturday we started getting ~30M child_exit events per day until their disk filled up early Monday morning. The process was running on a VM, and I have reason to believe that the user's SSH connection to the VM was broken at some point on the Saturday that the spammy events started appearing. I don't know whether or not the user was using any sort of session manager / terminal multiplexer such as screen or tmux. I am curious whether or not the broken SSH connection could have created some kind of signal loop that caused the pager to continuously trigger our wait_for_pager_signal() handler. I don't have any useful info beyond that, sorry. Thanks for taking a look though! ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] run-command: don't spam trace2_child_exit() 2022-04-28 21:46 ` Junio C Hamano 2022-05-03 14:59 ` Jeff Hostetler @ 2022-05-05 19:44 ` Josh Steadmon 1 sibling, 0 replies; 10+ messages in thread From: Josh Steadmon @ 2022-05-05 19:44 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On 2022.04.28 14:46, Junio C Hamano wrote: > Josh Steadmon <steadmon@google.com> writes: > > > In rare cases, wait_or_whine() cannot determine a child process's exit > > status (and will return -1 in this case). This can cause Git to issue > > trace2 child_exit events despite the fact that the child is still > > running. > > Rather, we do not even know if the child is still running when it > happens, right? Correct, if you'd like me to clarify the commit message I'll send a V2. > It is curious what "rare cases" makes the symptom > appear. Do we know? Unfortunately, no. The quoted 80 million exit event instance was not reproducible. > The patch looks OK from the "we do not know the child exited in this > case, so we shouldn't be reporting the child exit" point of view, of > course. Having one event that started a child in the log and then > having millions of events that reports the exit of the (same) child > is way too broken. With this change, we remove these phoney exit > events from the log. > > Do we know, for such a child process that caused these millions > phoney exit events, we got a real exit event at the end? We don't know. The trace log filled up the user's disk in this case, so the log was truncated. > Otherwise, > we'd still have a similar problem in the opposite direction, i.e. a > child has a start event recorded, many exit event discarded but the > log lacks the true exit event for the child, implying that the child > is still running because we failed to log its exit? Yes, that is a weakness with this approach. > > int finish_command_in_signal(struct child_process *cmd) > > { > > int ret = wait_or_whine(cmd->pid, cmd->args.v[0], 1); > > - trace2_child_exit(cmd, ret); > > + if (ret != -1) > > + trace2_child_exit(cmd, ret); > > return ret; > > } > > Will queue; thanks. Thanks, and sorry for the delayed reply, I've been out sick for a few days. ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v2] run-command: don't spam trace2_child_exit() 2022-04-28 20:58 [PATCH] run-command: don't spam trace2_child_exit() Josh Steadmon 2022-04-28 21:46 ` Junio C Hamano @ 2022-06-07 18:21 ` Josh Steadmon 2022-06-07 22:09 ` Ævar Arnfjörð Bjarmason 2022-06-10 15:31 ` Jeff Hostetler 1 sibling, 2 replies; 10+ messages in thread From: Josh Steadmon @ 2022-06-07 18:21 UTC (permalink / raw) To: git; +Cc: gitster, git In rare cases[1], wait_or_whine() cannot determine a child process's status (and will return -1 in this case). This can cause Git to issue trace2 child_exit events despite the fact that the child may still be running. In pathological cases, we've seen > 80 million exit events in our trace logs for a single child process. Fix this by only issuing trace2 events in finish_command_in_signal() if we get a value other than -1 from wait_or_whine(). This can lead to missing child_exit events in such a case, but that is preferable to duplicating events on a scale that threatens to fill the user's filesystem with invalid trace logs. [1]: This can happen when: * waitpid() returns -1 and errno != EINTR * waitpid() returns an invalid PID * the status set by waitpid() has neither the WIFEXITED() nor WIFSIGNALED() flags Signed-off-by: Josh Steadmon <steadmon@google.com> --- Updated the commit message with more details about when wait_or_whine() can fail. run-command.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/run-command.c b/run-command.c index a8501e38ce..e0fe2418a2 100644 --- a/run-command.c +++ b/run-command.c @@ -983,7 +983,8 @@ int finish_command(struct child_process *cmd) int finish_command_in_signal(struct child_process *cmd) { int ret = wait_or_whine(cmd->pid, cmd->args.v[0], 1); - trace2_child_exit(cmd, ret); + if (ret != -1) + trace2_child_exit(cmd, ret); return ret; } base-commit: faa21c10d44184f616d391c158dcbb13b9c72ef3 -- 2.36.1.255.ge46751e96f-goog ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH v2] run-command: don't spam trace2_child_exit() 2022-06-07 18:21 ` [PATCH v2] " Josh Steadmon @ 2022-06-07 22:09 ` Ævar Arnfjörð Bjarmason 2022-06-10 15:31 ` Jeff Hostetler 1 sibling, 0 replies; 10+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2022-06-07 22:09 UTC (permalink / raw) To: Josh Steadmon; +Cc: git, gitster, git On Tue, Jun 07 2022, Josh Steadmon wrote: > In rare cases[1], wait_or_whine() cannot determine a child process's > status (and will return -1 in this case). This can cause Git to issue > trace2 child_exit events despite the fact that the child may still be > running. In pathological cases, we've seen > 80 million exit events in > our trace logs for a single child process. > > Fix this by only issuing trace2 events in finish_command_in_signal() if > we get a value other than -1 from wait_or_whine(). This can lead to > missing child_exit events in such a case, but that is preferable to > duplicating events on a scale that threatens to fill the user's > filesystem with invalid trace logs. > > [1]: This can happen when: > > * waitpid() returns -1 and errno != EINTR > * waitpid() returns an invalid PID > * the status set by waitpid() has neither the WIFEXITED() nor > WIFSIGNALED() flags > > Signed-off-by: Josh Steadmon <steadmon@google.com> > --- > Updated the commit message with more details about when wait_or_whine() > can fail. > > run-command.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/run-command.c b/run-command.c > index a8501e38ce..e0fe2418a2 100644 > --- a/run-command.c > +++ b/run-command.c > @@ -983,7 +983,8 @@ int finish_command(struct child_process *cmd) > int finish_command_in_signal(struct child_process *cmd) > { > int ret = wait_or_whine(cmd->pid, cmd->args.v[0], 1); > - trace2_child_exit(cmd, ret); > + if (ret != -1) > + trace2_child_exit(cmd, ret); > return ret; > } This seems like a legitimate issue, but I really don't think we should sweep this under the rug like this. * Why can't we see if we logged such an event already in common_exit(), if we didn't we should trace2_child_exit() (or similar). I.e. not miss an event, ever. * Should this really be an "exit" event, aren't some of these failed signal events? Per the "should this be an exit event?" question in my related "signal on BUG" series. * We should have tests here, e.g. in t0210 to see the exact events we emit in certain cases, we really should have a test for this. Perhaps we can instrument a simulated failure with some GIT_TEST_* variables? ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2] run-command: don't spam trace2_child_exit() 2022-06-07 18:21 ` [PATCH v2] " Josh Steadmon 2022-06-07 22:09 ` Ævar Arnfjörð Bjarmason @ 2022-06-10 15:31 ` Jeff Hostetler 1 sibling, 0 replies; 10+ messages in thread From: Jeff Hostetler @ 2022-06-10 15:31 UTC (permalink / raw) To: Josh Steadmon, git; +Cc: gitster On 6/7/22 2:21 PM, Josh Steadmon wrote: > In rare cases[1], wait_or_whine() cannot determine a child process's > status (and will return -1 in this case). This can cause Git to issue > trace2 child_exit events despite the fact that the child may still be > running. In pathological cases, we've seen > 80 million exit events in > our trace logs for a single child process. > > Fix this by only issuing trace2 events in finish_command_in_signal() if > we get a value other than -1 from wait_or_whine(). This can lead to > missing child_exit events in such a case, but that is preferable to > duplicating events on a scale that threatens to fill the user's > filesystem with invalid trace logs. > > [1]: This can happen when: > > * waitpid() returns -1 and errno != EINTR > * waitpid() returns an invalid PID > * the status set by waitpid() has neither the WIFEXITED() nor > WIFSIGNALED() flags > > Signed-off-by: Josh Steadmon <steadmon@google.com> > --- > Updated the commit message with more details about when wait_or_whine() > can fail. > > run-command.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/run-command.c b/run-command.c > index a8501e38ce..e0fe2418a2 100644 > --- a/run-command.c > +++ b/run-command.c > @@ -983,7 +983,8 @@ int finish_command(struct child_process *cmd) > int finish_command_in_signal(struct child_process *cmd) > { > int ret = wait_or_whine(cmd->pid, cmd->args.v[0], 1); > - trace2_child_exit(cmd, ret); > + if (ret != -1) > + trace2_child_exit(cmd, ret); > return ret; > } > > > base-commit: faa21c10d44184f616d391c158dcbb13b9c72ef3 > I think there are larger bugs lurking here. finish_command_in_signal() is only called from wait_for_pager_signal(), which is loaded up by sigchain_push_common() when the pager is started. There are no loops or other callers here. We sigchain-pop our handler upon first use, so even if we were getting millions of PIPE signals from other chilld processes, we shouldn't be getting more than 1 message here. (Assuming that our pop is working and our call to raise() doesn't cause us to be called again.) The code is behaving as if the "s->n--" at the bottom of sigchain_pop() isn't being performed and our raise() is calling us again. I originally thought it might be due to unbalanced push/pops -- which could happen if the process were in a "push; do something; pop" sequence and the PIPE happened in the middle of that. But I don't think that matters. So I have to wonder if the sigchain code has a bug. The Linux man-page states that signal(2) is not thread-safe and even if it were, the sigchain code here does not have any locking on those global data structures. But that might not be the problem here and it's hard to say without any of the original trace data. Having said all of that I think it would be better to fix the original problem something like this (not tested): static void wait_for_pager_signal(int signo) { + static int handled = 0; + + if (handled) + return; + handled = 1; + close_pager_fds(); finish_command_in_signal(&pager_process); sigchain_pop(signo); raise(signo); } Jeff ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2022-06-10 15:31 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-04-28 20:58 [PATCH] run-command: don't spam trace2_child_exit() Josh Steadmon 2022-04-28 21:46 ` Junio C Hamano 2022-05-03 14:59 ` Jeff Hostetler 2022-05-05 19:58 ` Josh Steadmon 2022-05-10 20:37 ` Jeff Hostetler 2022-06-07 18:45 ` Josh Steadmon 2022-05-05 19:44 ` Josh Steadmon 2022-06-07 18:21 ` [PATCH v2] " Josh Steadmon 2022-06-07 22:09 ` Ævar Arnfjörð Bjarmason 2022-06-10 15:31 ` Jeff Hostetler
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.