All of lore.kernel.org
 help / color / mirror / Atom feed
* [LTP] [PATCH] test_children_cleanup.sh: Fix race condition
@ 2022-02-14 16:51 Martin Doucha
  2022-02-15  4:30 ` Li Wang
  0 siblings, 1 reply; 6+ messages in thread
From: Martin Doucha @ 2022-02-14 16:51 UTC (permalink / raw)
  To: ltp

Processes can stay alive for a short while even after receiving SIGKILL.
Give the child in subprocess cleanup libtest up to 5 seconds to fully exit
before reporting that it was left behind.

Signed-off-by: Martin Doucha <mdoucha@suse.cz>
---
 lib/newlib_tests/test_children_cleanup.sh | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/lib/newlib_tests/test_children_cleanup.sh b/lib/newlib_tests/test_children_cleanup.sh
index 4b4e8b2f0..44505aa51 100755
--- a/lib/newlib_tests/test_children_cleanup.sh
+++ b/lib/newlib_tests/test_children_cleanup.sh
@@ -10,10 +10,19 @@ rm "$TMPFILE"
 if [ "x$CHILD_PID" = "x" ]; then
 	echo "TFAIL: Child process was not created"
 	exit 1
-elif ! kill -s 0 $CHILD_PID &>/dev/null; then
-	echo "TPASS: Child process was cleaned up"
-	exit 0
-else
-	echo "TFAIL: Child process was left behind"
-	exit 1
 fi
+
+# The child process can stay alive for a short while even after receiving
+# SIGKILL, especially if the system is under heavy load. Wait up to 5 seconds
+# for it to fully exit.
+for i in `seq 6`; do
+	if ! [ -e "/proc/$CHILD_PID" ]; then
+		echo "TPASS: Child process was cleaned up"
+		exit 0
+	fi
+
+	sleep 1
+done
+
+echo "TFAIL: Child process was left behind"
+exit 1
-- 
2.34.1


-- 
Mailing list info: https://lists.linux.it/listinfo/ltp

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [LTP] [PATCH] test_children_cleanup.sh: Fix race condition
  2022-02-14 16:51 [LTP] [PATCH] test_children_cleanup.sh: Fix race condition Martin Doucha
@ 2022-02-15  4:30 ` Li Wang
  2022-02-15  7:34   ` Petr Vorel
  2022-02-15  9:27   ` Martin Doucha
  0 siblings, 2 replies; 6+ messages in thread
From: Li Wang @ 2022-02-15  4:30 UTC (permalink / raw)
  To: Martin Doucha; +Cc: LTP List


[-- Attachment #1.1: Type: text/plain, Size: 2616 bytes --]

On Tue, Feb 15, 2022 at 12:51 AM Martin Doucha <mdoucha@suse.cz> wrote:

> Processes can stay alive for a short while even after receiving SIGKILL.
> Give the child in subprocess cleanup libtest up to 5 seconds to fully exit
> before reporting that it was left behind.


> Signed-off-by: Martin Doucha <mdoucha@suse.cz>
> ---
>  lib/newlib_tests/test_children_cleanup.sh | 21 +++++++++++++++------
>  1 file changed, 15 insertions(+), 6 deletions(-)
>
> diff --git a/lib/newlib_tests/test_children_cleanup.sh
> b/lib/newlib_tests/test_children_cleanup.sh
> index 4b4e8b2f0..44505aa51 100755
> --- a/lib/newlib_tests/test_children_cleanup.sh
> +++ b/lib/newlib_tests/test_children_cleanup.sh
> @@ -10,10 +10,19 @@ rm "$TMPFILE"
>  if [ "x$CHILD_PID" = "x" ]; then
>         echo "TFAIL: Child process was not created"
>         exit 1
> -elif ! kill -s 0 $CHILD_PID &>/dev/null; then
> -       echo "TPASS: Child process was cleaned up"
> -       exit 0
> -else
> -       echo "TFAIL: Child process was left behind"
> -       exit 1
>  fi
> +
> +# The child process can stay alive for a short while even after receiving
> +# SIGKILL, especially if the system is under heavy load. Wait up to 5
> seconds
> +# for it to fully exit.
>


It doesn't work for all platforms and we can not guarantee how long it will
cost before PID 1 reaps zombie process.

Also, I just get to know that  Docker does not run processes under a
special init process that properly reaps child processes, so that it is
possible for the container to end up with zombie processes that cause
all sorts of trouble.

I even try adding `kill -SIGCHLD 1` but does not work as expected.

See CI jobs:
  https://github.com/wangli5665/ltp/runs/5194270998?check_suite_focus=true
<https://mail.google.com/mail/u/1/%E2%80%8Bhttps://github.com/wangli5665/ltp/runs/5194270998?check_suite_focus=true>

Therefore, I suggest giving a chance to my refined patch V2 :).

--- a/lib/newlib_tests/test_children_cleanup.sh
+++ b/lib/newlib_tests/test_children_cleanup.sh
@@ -10,10 +10,16 @@ rm "$TMPFILE"
 if [ "x$CHILD_PID" = "x" ]; then
        echo "TFAIL: Child process was not created"
        exit 1
+elif grep -q "Z (zombie)" /proc/$CHILD_PID/status; then
+       echo "TPASS: Child process is in zombie state"
+       exit 0
 elif ! kill -s 0 $CHILD_PID &>/dev/null; then
        echo "TPASS: Child process was cleaned up"
        exit 0
 else
        echo "TFAIL: Child process was left behind"
+       echo "cat /proc/$CHILD_PID/status"
+       echo "---------------------------"
+       cat /proc/$CHILD_PID/status
        exit 1
 fi


-- 
Regards,
Li Wang

[-- Attachment #1.2: Type: text/html, Size: 4472 bytes --]

[-- Attachment #2: Type: text/plain, Size: 60 bytes --]


-- 
Mailing list info: https://lists.linux.it/listinfo/ltp

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [LTP] [PATCH] test_children_cleanup.sh: Fix race condition
  2022-02-15  4:30 ` Li Wang
@ 2022-02-15  7:34   ` Petr Vorel
  2022-02-15  9:27   ` Martin Doucha
  1 sibling, 0 replies; 6+ messages in thread
From: Petr Vorel @ 2022-02-15  7:34 UTC (permalink / raw)
  To: Li Wang; +Cc: LTP List

Hi all,

...
> It doesn't work for all platforms and we can not guarantee how long it will
> cost before PID 1 reaps zombie process.

> Also, I just get to know that  Docker does not run processes under a
> special init process that properly reaps child processes, so that it is
> possible for the container to end up with zombie processes that cause
> all sorts of trouble.

> I even try adding `kill -SIGCHLD 1` but does not work as expected.

> See CI jobs:
>   https://github.com/wangli5665/ltp/runs/5194270998?check_suite_focus=true
> <https://mail.google.com/mail/u/1/%E2%80%8Bhttps://github.com/wangli5665/ltp/runs/5194270998?check_suite_focus=true>

> Therefore, I suggest giving a chance to my refined patch V2 :).

> --- a/lib/newlib_tests/test_children_cleanup.sh
> +++ b/lib/newlib_tests/test_children_cleanup.sh
> @@ -10,10 +10,16 @@ rm "$TMPFILE"
>  if [ "x$CHILD_PID" = "x" ]; then
>         echo "TFAIL: Child process was not created"
>         exit 1
> +elif grep -q "Z (zombie)" /proc/$CHILD_PID/status; then
> +       echo "TPASS: Child process is in zombie state"
> +       exit 0
>  elif ! kill -s 0 $CHILD_PID &>/dev/null; then
>         echo "TPASS: Child process was cleaned up"
>         exit 0
>  else
>         echo "TFAIL: Child process was left behind"
> +       echo "cat /proc/$CHILD_PID/status"
> +       echo "---------------------------"
> +       cat /proc/$CHILD_PID/status
>         exit 1
>  fi

Li's approach LGTM.

Kind regards,
Petr

-- 
Mailing list info: https://lists.linux.it/listinfo/ltp

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [LTP] [PATCH] test_children_cleanup.sh: Fix race condition
  2022-02-15  4:30 ` Li Wang
  2022-02-15  7:34   ` Petr Vorel
@ 2022-02-15  9:27   ` Martin Doucha
  2022-02-15  9:39     ` Petr Vorel
  1 sibling, 1 reply; 6+ messages in thread
From: Martin Doucha @ 2022-02-15  9:27 UTC (permalink / raw)
  To: Li Wang; +Cc: LTP List

On 15. 02. 22 5:30, Li Wang wrote:
> It doesn't work for all platforms and we can not guarantee how long it will
> cost before PID 1 reaps zombie process.
> 
> Also, I just get to know that  Docker does not run processes under a
> special init process that properly reaps child processes, so that it is
> possible for the container to end up with zombie processes that cause
> all sorts of trouble.
> 
> I even try adding `kill -SIGCHLD 1` but does not work as expected.
> 
> See CI jobs:
>   https://github.com/wangli5665/ltp/runs/5194270998?check_suite_focus=true
> <https://mail.google.com/mail/u/1/%E2%80%8Bhttps://github.com/wangli5665/ltp/runs/5194270998?check_suite_focus=true>
> 
> Therefore, I suggest giving a chance to my refined patch V2 :).

When I was testing the libtest yesterday on a moderately stressed
machine, I actually saw the child process still in the R state during
the first state check a couple of times. That's why I've added looping
with delay.

On the other hand I did not see any zombies even after several hundred
tries. But I can add a zombie check to my patch a well.

-- 
Martin Doucha   mdoucha@suse.cz
QA Engineer for Software Maintenance
SUSE LINUX, s.r.o.
CORSO IIa
Krizikova 148/34
186 00 Prague 8
Czech Republic

-- 
Mailing list info: https://lists.linux.it/listinfo/ltp

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [LTP] [PATCH] test_children_cleanup.sh: Fix race condition
  2022-02-15  9:27   ` Martin Doucha
@ 2022-02-15  9:39     ` Petr Vorel
  2022-02-15  9:50       ` Li Wang
  0 siblings, 1 reply; 6+ messages in thread
From: Petr Vorel @ 2022-02-15  9:39 UTC (permalink / raw)
  To: Martin Doucha; +Cc: LTP List

> On 15. 02. 22 5:30, Li Wang wrote:
> > It doesn't work for all platforms and we can not guarantee how long it will
> > cost before PID 1 reaps zombie process.

> > Also, I just get to know that  Docker does not run processes under a
> > special init process that properly reaps child processes, so that it is
> > possible for the container to end up with zombie processes that cause
> > all sorts of trouble.

> > I even try adding `kill -SIGCHLD 1` but does not work as expected.

> > See CI jobs:
> >   https://github.com/wangli5665/ltp/runs/5194270998?check_suite_focus=true
> > <https://mail.google.com/mail/u/1/%E2%80%8Bhttps://github.com/wangli5665/ltp/runs/5194270998?check_suite_focus=true>

> > Therefore, I suggest giving a chance to my refined patch V2 :).

> When I was testing the libtest yesterday on a moderately stressed
> machine, I actually saw the child process still in the R state during
> the first state check a couple of times. That's why I've added looping
> with delay.

> On the other hand I did not see any zombies even after several hundred
> tries. But I can add a zombie check to my patch a well.
I'd be for it. As Li noticed, containers behave really differently
(maybe it'd be faster to debug tests using podman).

Kind regards,
Petr

-- 
Mailing list info: https://lists.linux.it/listinfo/ltp

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [LTP] [PATCH] test_children_cleanup.sh: Fix race condition
  2022-02-15  9:39     ` Petr Vorel
@ 2022-02-15  9:50       ` Li Wang
  0 siblings, 0 replies; 6+ messages in thread
From: Li Wang @ 2022-02-15  9:50 UTC (permalink / raw)
  To: Petr Vorel; +Cc: LTP List


[-- Attachment #1.1: Type: text/plain, Size: 1904 bytes --]

On Tue, Feb 15, 2022 at 5:39 PM Petr Vorel <pvorel@suse.cz> wrote:

> > On 15. 02. 22 5:30, Li Wang wrote:
> > > It doesn't work for all platforms and we can not guarantee how long it
> will
> > > cost before PID 1 reaps zombie process.
>
> > > Also, I just get to know that  Docker does not run processes under a
> > > special init process that properly reaps child processes, so that it is
> > > possible for the container to end up with zombie processes that cause
> > > all sorts of trouble.
>
> > > I even try adding `kill -SIGCHLD 1` but does not work as expected.
>
> > > See CI jobs:
> > >
> https://github.com/wangli5665/ltp/runs/5194270998?check_suite_focus=true
> > > <
> https://mail.google.com/mail/u/1/%E2%80%8Bhttps://github.com/wangli5665/ltp/runs/5194270998?check_suite_focus=true
> >
>
> > > Therefore, I suggest giving a chance to my refined patch V2 :).
>
> > When I was testing the libtest yesterday on a moderately stressed
> > machine, I actually saw the child process still in the R state during
>

That might cause by signal asynchronous processing when the system is under
pressure.


> > the first state check a couple of times. That's why I've added looping
> > with delay.


> > On the other hand I did not see any zombies even after several hundred
> > tries. But I can add a zombie check to my patch a well.
> I'd be for it. As Li noticed, containers behave really differently
> (maybe it'd be faster to debug tests using podman).
>

Right, we have to consider many scenarios:

1.  asynchronous processing signal so that child still in 'R'
2.  the child correctly terminated on bare-metal system
3. allow zombied while running by docker (without a formal init(PID 1)
process)

Especially scenario-3 confused me for quite a while.

@Martin, feel free to send the patch that includes the above.
or I can help summarize these in the commit description.


-- 
Regards,
Li Wang

[-- Attachment #1.2: Type: text/html, Size: 3777 bytes --]

[-- Attachment #2: Type: text/plain, Size: 60 bytes --]


-- 
Mailing list info: https://lists.linux.it/listinfo/ltp

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-02-15  9:50 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-14 16:51 [LTP] [PATCH] test_children_cleanup.sh: Fix race condition Martin Doucha
2022-02-15  4:30 ` Li Wang
2022-02-15  7:34   ` Petr Vorel
2022-02-15  9:27   ` Martin Doucha
2022-02-15  9:39     ` Petr Vorel
2022-02-15  9:50       ` Li Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.