All of lore.kernel.org
 help / color / mirror / Atom feed
* [LTP] [PATCH 1/2 v2] tst_test: Fail the test subprocess cannot be killed
@ 2018-06-27 15:22 Cyril Hrubis
  2018-06-27 15:22 ` [LTP] [PATCH 2/2] [WORK-IN-PROGRESS] lib/tst_test: Dump stack for test processes stuck in kernel Cyril Hrubis
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Cyril Hrubis @ 2018-06-27 15:22 UTC (permalink / raw)
  To: ltp

If there are any leftover children the main test process will likely be
killed while sleeping in wait(). That is because all child processes are
either waited explicitely by the test code or implicitly by the test
library.

We also send SIGKILL to the whole process group, so if one of the
children continues to live for long enough that very likely means that
it ended up stuck in the kernel.

So if there are any processes left with in the process group we created
once the process group leader i.e. main test process has been waited
for we loop for a short while to give the init daemon chance to reap the
process after it has been reparented and if that does not happen for a
few seconds we declare the process to be stuck in the kernel.

Signed-off-by: Cyril Hrubis <chrubis@suse.cz>
CC: Eric Biggers <ebiggers3@gmail.com>
---
 lib/tst_test.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/lib/tst_test.c b/lib/tst_test.c
index 80808854e..329168a24 100644
--- a/lib/tst_test.c
+++ b/lib/tst_test.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2015-2016 Cyril Hrubis <chrubis@suse.cz>
+ * Copyright (c) 2015-2018 Cyril Hrubis <chrubis@suse.cz>
  *
  * This program is free software: you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
@@ -1047,6 +1047,21 @@ static int fork_testrun(void)
 	alarm(0);
 	SAFE_SIGNAL(SIGINT, SIG_DFL);
 
+	unsigned int sleep = 100;
+	unsigned int retries = 0;
+
+	while (kill(-test_pid, 0) == 0) {
+
+		usleep(sleep);
+		sleep*=2;
+
+		if (retries++ <= 14)
+			continue;
+
+		tst_res(TFAIL, "Test process child stuck in the kernel!");
+		tst_brk(TFAIL, "Congratulation, likely test hit a kernel bug.");
+	}
+
 	if (WIFEXITED(status) && WEXITSTATUS(status))
 		return WEXITSTATUS(status);
 
-- 
2.13.6


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [LTP] [PATCH 2/2] [WORK-IN-PROGRESS] lib/tst_test: Dump stack for test processes stuck in kernel
  2018-06-27 15:22 [LTP] [PATCH 1/2 v2] tst_test: Fail the test subprocess cannot be killed Cyril Hrubis
@ 2018-06-27 15:22 ` Cyril Hrubis
  2018-06-28 13:05   ` Jan Stancek
  2018-06-28  7:10 ` [LTP] [PATCH 1/2 v2] tst_test: Fail the test subprocess cannot be killed Petr Vorel
  2018-06-28  9:41 ` Li Wang
  2 siblings, 1 reply; 8+ messages in thread
From: Cyril Hrubis @ 2018-06-27 15:22 UTC (permalink / raw)
  To: ltp

This commit adds a small helper library to find a process(es) given a
process group ID and dump their stacks.

Example output:

$ ./shmctl05
tst_test.c:1015: INFO: Timeout per run is 0h 00m 10s
Test timeouted, sending SIGKILL!
tst_test.c:1059: TFAIL: Test process child stuck in the kernel!
tst_find_pid.c:90: INFO: Pid 1272 stuck in kernel!
Kernel stacktrace follows:
[<ffffffffa3c12564>] __switch_to_asm+0x34/0x70
[<ffffffffa3c12570>] __switch_to_asm+0x40/0x70
[<ffffffffa3625761>] __switch_to+0x2c1/0x6e0
[<ffffffffa393e194>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffffa3704802>] acct_collect+0x42/0x1a0
[<ffffffffa367d36a>] do_exit+0x74a/0xaf0
[<ffffffffa3c13d27>] rewind_stack_do_exit+0x17/0x20
[<ffffffffffffffff>] 0xffffffffffffffff
tst_test.c:1061: FAIL: Congratulation, likely test hit a kernel bug.

TODO: The main test process uses signal handler and alarm to call _exit if the
      child process that executes the actuall test timeouts. We need to redesign
      this if we want to dump the stack in that case as well.

Signed-off-by: Cyril Hrubis <chrubis@suse.cz>
CC: Jan Stancek <jstancek@redhat.com>
---
 include/tst_dump_stacks.h |  25 +++++++++++
 lib/tst_dump_stacks.c     | 108 ++++++++++++++++++++++++++++++++++++++++++++++
 lib/tst_test.c            |   3 +-
 3 files changed, 135 insertions(+), 1 deletion(-)
 create mode 100644 include/tst_dump_stacks.h
 create mode 100644 lib/tst_dump_stacks.c

diff --git a/include/tst_dump_stacks.h b/include/tst_dump_stacks.h
new file mode 100644
index 000000000..643cc58a8
--- /dev/null
+++ b/include/tst_dump_stacks.h
@@ -0,0 +1,25 @@
+/*
+ * Copyright (c) 2018 Cyril Hrubis <chrubis@suse.cz>
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef TST_DUMP_STACKS__
+#define TST_DUMP_STACKS__
+
+void tst_dump_stacks_by_pgid(pid_t pgid);
+
+void tst_dump_stack_by_pid(pid_t pid);
+
+#endif /* TST_DUMP_STACKS__ */
diff --git a/lib/tst_dump_stacks.c b/lib/tst_dump_stacks.c
new file mode 100644
index 000000000..aa97c6820
--- /dev/null
+++ b/lib/tst_dump_stacks.c
@@ -0,0 +1,108 @@
+/*
+ * Copyright (c) 2018 Cyril Hrubis <chrubis@suse.cz>
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <ctype.h>
+#include <stdio.h>
+
+#define TST_NO_DEFAULT_MAIN 1
+#include "tst_test.h"
+
+static void *process_search_init(void)
+{
+	DIR *dir = SAFE_OPENDIR("/proc/");
+
+	return dir;
+}
+
+static int is_number(const char *str)
+{
+	do {
+		if (!isdigit(*str))
+			return 0;
+	} while (*(++str));
+
+	return 1;
+}
+
+static int process_search_pgid_next(void *pid_search, pid_t pgid)
+{
+	struct dirent *ent;
+	DIR *dir = pid_search;
+	char path[1024];
+	int ppgid, pid;
+	FILE *f;
+
+	while ((ent = readdir(dir))) {
+		if (ent->d_type != DT_DIR)
+			continue;
+		if (!is_number(ent->d_name))
+			continue;
+
+		snprintf(path, sizeof(path), "/proc/%s/stat", ent->d_name);
+
+		f = fopen(path, "r");
+		if (!f)
+			continue;
+
+		if (fscanf(f, "%i %*s %*c %*i %i", &pid, &ppgid) != 2) {
+			tst_res(TWARN, "fscanf(%s) failed!", ent->d_name);
+			fclose(f);
+			continue;
+		}
+
+		fclose(f);
+
+		if (ppgid == pgid)
+			break;
+	}
+
+	if (ent)
+		return pid;
+
+	closedir(dir);
+	return -1;
+}
+
+void tst_dump_stack_by_pid(pid_t pid)
+{
+	int fd, len;
+	char buf[512];
+	char path[1024];
+
+	tst_res(TINFO, "Pid %i stuck in kernel!", pid);
+
+	fprintf(stderr, "Kernel stacktrace follows:\n");
+	fflush(stderr);
+
+	snprintf(path, sizeof(path), "/proc/%i/stack", pid);
+
+	fd = SAFE_OPEN(path, O_RDONLY);
+
+	while ((len = SAFE_READ(0, fd, buf, sizeof(buf))) > 0)
+		SAFE_WRITE(1, 2, buf, len);
+
+	SAFE_CLOSE(fd);
+}
+
+void tst_dump_stacks_by_pgid(pid_t pgid)
+{
+	void *ps = process_search_init();
+	int pid;
+
+	while ((pid = process_search_pgid_next(ps, pgid)) != -1)
+		tst_dump_stack_by_pid(pid);
+}
diff --git a/lib/tst_test.c b/lib/tst_test.c
index 329168a24..d9476c02c 100644
--- a/lib/tst_test.c
+++ b/lib/tst_test.c
@@ -1058,7 +1058,8 @@ static int fork_testrun(void)
 		if (retries++ <= 14)
 			continue;
 
-		tst_res(TFAIL, "Test process child stuck in the kernel!");
+		tst_res(TFAIL, "Test process child(ren) stuck in the kernel!");
+		tst_dump_stacks_by_pgid(test_pid);
 		tst_brk(TFAIL, "Congratulation, likely test hit a kernel bug.");
 	}
 
-- 
2.13.6


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [LTP] [PATCH 1/2 v2] tst_test: Fail the test subprocess cannot be killed
  2018-06-27 15:22 [LTP] [PATCH 1/2 v2] tst_test: Fail the test subprocess cannot be killed Cyril Hrubis
  2018-06-27 15:22 ` [LTP] [PATCH 2/2] [WORK-IN-PROGRESS] lib/tst_test: Dump stack for test processes stuck in kernel Cyril Hrubis
@ 2018-06-28  7:10 ` Petr Vorel
  2018-06-28  9:41 ` Li Wang
  2 siblings, 0 replies; 8+ messages in thread
From: Petr Vorel @ 2018-06-28  7:10 UTC (permalink / raw)
  To: ltp

Hi Cyril,

...
> +	while (kill(-test_pid, 0) == 0) {
> +
> +		usleep(sleep);
> +		sleep*=2;
> +
> +		if (retries++ <= 14)
                          ^
Maybe some constant for it?


Kind regards,
Petr

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [LTP] [PATCH 1/2 v2] tst_test: Fail the test subprocess cannot be killed
  2018-06-27 15:22 [LTP] [PATCH 1/2 v2] tst_test: Fail the test subprocess cannot be killed Cyril Hrubis
  2018-06-27 15:22 ` [LTP] [PATCH 2/2] [WORK-IN-PROGRESS] lib/tst_test: Dump stack for test processes stuck in kernel Cyril Hrubis
  2018-06-28  7:10 ` [LTP] [PATCH 1/2 v2] tst_test: Fail the test subprocess cannot be killed Petr Vorel
@ 2018-06-28  9:41 ` Li Wang
  2018-06-28 10:08   ` Li Wang
  2018-06-28 10:20   ` Cyril Hrubis
  2 siblings, 2 replies; 8+ messages in thread
From: Li Wang @ 2018-06-28  9:41 UTC (permalink / raw)
  To: ltp

Hi Cyril,

On Wed, Jun 27, 2018 at 11:22 PM, Cyril Hrubis <chrubis@suse.cz> wrote:
> If there are any leftover children the main test process will likely be
> killed while sleeping in wait(). That is because all child processes are
> either waited explicitely by the test code or implicitly by the test
> library.
>
> We also send SIGKILL to the whole process group, so if one of the
> children continues to live for long enough that very likely means that
> it ended up stuck in the kernel.
>
> So if there are any processes left with in the process group we created
> once the process group leader i.e. main test process has been waited
> for we loop for a short while to give the init daemon chance to reap the
> process after it has been reparented and if that does not happen for a
> few seconds we declare the process to be stuck in the kernel.
>
> Signed-off-by: Cyril Hrubis <chrubis@suse.cz>
> CC: Eric Biggers <ebiggers3@gmail.com>
> ---
>  lib/tst_test.c | 17 ++++++++++++++++-
>  1 file changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/lib/tst_test.c b/lib/tst_test.c
> index 80808854e..329168a24 100644
> --- a/lib/tst_test.c
> +++ b/lib/tst_test.c
> @@ -1,5 +1,5 @@
>  /*
> - * Copyright (c) 2015-2016 Cyril Hrubis <chrubis@suse.cz>
> + * Copyright (c) 2015-2018 Cyril Hrubis <chrubis@suse.cz>
>   *
>   * This program is free software: you can redistribute it and/or modify
>   * it under the terms of the GNU General Public License as published by
> @@ -1047,6 +1047,21 @@ static int fork_testrun(void)
>         alarm(0);
>         SAFE_SIGNAL(SIGINT, SIG_DFL);
>
> +       unsigned int sleep = 100;
> +       unsigned int retries = 0;
> +
> +       while (kill(-test_pid, 0) == 0) {

I'm a little worried about here, image that, if a process_A(test_pid)
exist to make function kill(-test_pid, 0) return 0 at first time, then
we go into this while loop, but during the sleeping time process_A
exit and system reuse the test_pid to another process_B, we will still
keep looping and very probably make mistake to report TFAIL(with stack
of  process_B dump to ltp user in PATCH 2/2).

> +
> +               usleep(sleep);
> +               sleep*=2;
> +
> +               if (retries++ <= 14)
> +                       continue;
> +
> +               tst_res(TFAIL, "Test process child stuck in the kernel!");
> +               tst_brk(TFAIL, "Congratulation, likely test hit a kernel bug.");
> +       }
> +
>         if (WIFEXITED(status) && WEXITSTATUS(status))
>                 return WEXITSTATUS(status);
>
> --
> 2.13.6
>
>
> --
> Mailing list info: https://lists.linux.it/listinfo/ltp



-- 
Regards,
Li Wang

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [LTP] [PATCH 1/2 v2] tst_test: Fail the test subprocess cannot be killed
  2018-06-28  9:41 ` Li Wang
@ 2018-06-28 10:08   ` Li Wang
  2018-06-28 10:23     ` Cyril Hrubis
  2018-06-28 10:20   ` Cyril Hrubis
  1 sibling, 1 reply; 8+ messages in thread
From: Li Wang @ 2018-06-28 10:08 UTC (permalink / raw)
  To: ltp

On Thu, Jun 28, 2018 at 5:41 PM, Li Wang <liwang@redhat.com> wrote:
> Hi Cyril,
>
> On Wed, Jun 27, 2018 at 11:22 PM, Cyril Hrubis <chrubis@suse.cz> wrote:
>> If there are any leftover children the main test process will likely be
>> killed while sleeping in wait(). That is because all child processes are
>> either waited explicitely by the test code or implicitly by the test
>> library.
>>
>> We also send SIGKILL to the whole process group, so if one of the
>> children continues to live for long enough that very likely means that
>> it ended up stuck in the kernel.
>>
>> So if there are any processes left with in the process group we created
>> once the process group leader i.e. main test process has been waited
>> for we loop for a short while to give the init daemon chance to reap the
>> process after it has been reparented and if that does not happen for a
>> few seconds we declare the process to be stuck in the kernel.
>>
>> Signed-off-by: Cyril Hrubis <chrubis@suse.cz>
>> CC: Eric Biggers <ebiggers3@gmail.com>
>> ---
>>  lib/tst_test.c | 17 ++++++++++++++++-
>>  1 file changed, 16 insertions(+), 1 deletion(-)
>>
>> diff --git a/lib/tst_test.c b/lib/tst_test.c
>> index 80808854e..329168a24 100644
>> --- a/lib/tst_test.c
>> +++ b/lib/tst_test.c
>> @@ -1,5 +1,5 @@
>>  /*
>> - * Copyright (c) 2015-2016 Cyril Hrubis <chrubis@suse.cz>
>> + * Copyright (c) 2015-2018 Cyril Hrubis <chrubis@suse.cz>
>>   *
>>   * This program is free software: you can redistribute it and/or modify
>>   * it under the terms of the GNU General Public License as published by
>> @@ -1047,6 +1047,21 @@ static int fork_testrun(void)
>>         alarm(0);
>>         SAFE_SIGNAL(SIGINT, SIG_DFL);
>>
>> +       unsigned int sleep = 100;
>> +       unsigned int retries = 0;
>> +
>> +       while (kill(-test_pid, 0) == 0) {
>
> I'm a little worried about here, image that, if a process_A(test_pid)
> exist to make function kill(-test_pid, 0) return 0 at first time, then
> we go into this while loop, but during the sleeping time process_A
> exit and system reuse the test_pid to another process_B, we will still
> keep looping and very probably make mistake to report TFAIL(with stack
> of  process_B dump to ltp user in PATCH 2/2).

Maybe we could verify the content of '/proc/test_pid/cmdline' in this loop
to make sure test_pid is still using by the process we wanted?

>
>> +
>> +               usleep(sleep);
>> +               sleep*=2;
>> +
>> +               if (retries++ <= 14)
>> +                       continue;
>> +
>> +               tst_res(TFAIL, "Test process child stuck in the kernel!");
>> +               tst_brk(TFAIL, "Congratulation, likely test hit a kernel bug.");
>> +       }
>> +
>>         if (WIFEXITED(status) && WEXITSTATUS(status))
>>                 return WEXITSTATUS(status);
>>
>> --
>> 2.13.6
>>
>>
>> --
>> Mailing list info: https://lists.linux.it/listinfo/ltp
>
>
>
> --
> Regards,
> Li Wang



-- 
Regards,
Li Wang

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [LTP] [PATCH 1/2 v2] tst_test: Fail the test subprocess cannot be killed
  2018-06-28  9:41 ` Li Wang
  2018-06-28 10:08   ` Li Wang
@ 2018-06-28 10:20   ` Cyril Hrubis
  1 sibling, 0 replies; 8+ messages in thread
From: Cyril Hrubis @ 2018-06-28 10:20 UTC (permalink / raw)
  To: ltp

Hi!
> > +       unsigned int sleep = 100;
> > +       unsigned int retries = 0;
> > +
> > +       while (kill(-test_pid, 0) == 0) {
> 
> I'm a little worried about here, image that, if a process_A(test_pid)
> exist to make function kill(-test_pid, 0) return 0 at first time, then
> we go into this while loop, but during the sleeping time process_A
> exit and system reuse the test_pid to another process_B, we will still
> keep looping and very probably make mistake to report TFAIL(with stack
> of  process_B dump to ltp user in PATCH 2/2).

That is known limitation of UNIX. In practice it's very unlikely that
the pid would be reused in very short timeframe unless there is a fork
bomb running on the system or the system is out of pids, which both
means greater trouble.

Just try to run 'watch /proc/self/stat' and look how fast is the first
number increasing. On an idle system it's increased by a single digit
number every two seconds and even if you run a parallel compilation in
background it takes a long time until we start to reuse recenlty used
pids.

I guess that we can remove the part that doubles the sleep and increase
the number of retries accordingly, that way we would be much more likely
to hit even very short interval when the pid was not allocated.

We can also include various sanity checks, we may examine the process
whoose process group matches the test_pid to some degree. We can for
instance check if the process has been reparented to init i.e. parent
pid == 1 which happens when the parent is killed. However I would like
to avoid anything too complicated since at a point we get to this
situation the kernel has been likely corrupted so all bets are off, the
system is in inconsistent state and the best action to take is to try to
inform the tester that something went wrong.

-- 
Cyril Hrubis
chrubis@suse.cz

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [LTP] [PATCH 1/2 v2] tst_test: Fail the test subprocess cannot be killed
  2018-06-28 10:08   ` Li Wang
@ 2018-06-28 10:23     ` Cyril Hrubis
  0 siblings, 0 replies; 8+ messages in thread
From: Cyril Hrubis @ 2018-06-28 10:23 UTC (permalink / raw)
  To: ltp

Hi!
> > I'm a little worried about here, image that, if a process_A(test_pid)
> > exist to make function kill(-test_pid, 0) return 0 at first time, then
> > we go into this while loop, but during the sleeping time process_A
> > exit and system reuse the test_pid to another process_B, we will still
> > keep looping and very probably make mistake to report TFAIL(with stack
> > of  process_B dump to ltp user in PATCH 2/2).
> 
> Maybe we could verify the content of '/proc/test_pid/cmdline' in this loop
> to make sure test_pid is still using by the process we wanted?

That unfortunatelly does not work, half of the /proc/$pid/* files block
on reading when this happens as there is a deadlock in the kernel and
the processes that try to read these files end up deadlocked too.

-- 
Cyril Hrubis
chrubis@suse.cz

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [LTP] [PATCH 2/2] [WORK-IN-PROGRESS] lib/tst_test: Dump stack for test processes stuck in kernel
  2018-06-27 15:22 ` [LTP] [PATCH 2/2] [WORK-IN-PROGRESS] lib/tst_test: Dump stack for test processes stuck in kernel Cyril Hrubis
@ 2018-06-28 13:05   ` Jan Stancek
  0 siblings, 0 replies; 8+ messages in thread
From: Jan Stancek @ 2018-06-28 13:05 UTC (permalink / raw)
  To: ltp



----- Original Message -----
> This commit adds a small helper library to find a process(es) given a
> process group ID and dump their stacks.
> 
> Example output:
> 
> $ ./shmctl05
> tst_test.c:1015: INFO: Timeout per run is 0h 00m 10s
> Test timeouted, sending SIGKILL!
> tst_test.c:1059: TFAIL: Test process child stuck in the kernel!
> tst_find_pid.c:90: INFO: Pid 1272 stuck in kernel!
> Kernel stacktrace follows:
> [<ffffffffa3c12564>] __switch_to_asm+0x34/0x70
> [<ffffffffa3c12570>] __switch_to_asm+0x40/0x70
> [<ffffffffa3625761>] __switch_to+0x2c1/0x6e0
> [<ffffffffa393e194>] call_rwsem_down_read_failed+0x14/0x30
> [<ffffffffa3704802>] acct_collect+0x42/0x1a0
> [<ffffffffa367d36a>] do_exit+0x74a/0xaf0
> [<ffffffffa3c13d27>] rewind_stack_do_exit+0x17/0x20
> [<ffffffffffffffff>] 0xffffffffffffffff
> tst_test.c:1061: FAIL: Congratulation, likely test hit a kernel bug.
> 
> TODO: The main test process uses signal handler and alarm to call _exit if
> the
>       child process that executes the actuall test timeouts. We need to
>       redesign
>       this if we want to dump the stack in that case as well.

Hi,

What if we dropped _exit() from signal handler, and left all
killing to code added in 1/2 of this series?

Signal handler will only note that we hit timeout:

static void alarm_handler(int sig LTP_ATTRIBUTE_UNUSED)
{
        WRITE_MSG("Test timed out!\n");
        ++timeout_hit;
}

and fork_testrun() will be periodically checking for it:

do {
    usleep(10000);
    ret = SAFE_WAITPID(test_pid, &status, WNOHANG);
} while (ret == 0 || timeout_hit == 0);

// try to kill process group here

> 
> Signed-off-by: Cyril Hrubis <chrubis@suse.cz>
> CC: Jan Stancek <jstancek@redhat.com>
> ---
>  include/tst_dump_stacks.h |  25 +++++++++++
>  lib/tst_dump_stacks.c     | 108
>  ++++++++++++++++++++++++++++++++++++++++++++++
>  lib/tst_test.c            |   3 +-
>  3 files changed, 135 insertions(+), 1 deletion(-)
>  create mode 100644 include/tst_dump_stacks.h
>  create mode 100644 lib/tst_dump_stacks.c
> 
> diff --git a/include/tst_dump_stacks.h b/include/tst_dump_stacks.h
> new file mode 100644
> index 000000000..643cc58a8
> --- /dev/null
> +++ b/include/tst_dump_stacks.h
> @@ -0,0 +1,25 @@
> +/*
> + * Copyright (c) 2018 Cyril Hrubis <chrubis@suse.cz>
> + *
> + * This program is free software: you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation, either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program. If not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#ifndef TST_DUMP_STACKS__
> +#define TST_DUMP_STACKS__
> +
> +void tst_dump_stacks_by_pgid(pid_t pgid);
> +
> +void tst_dump_stack_by_pid(pid_t pid);
> +
> +#endif /* TST_DUMP_STACKS__ */
> diff --git a/lib/tst_dump_stacks.c b/lib/tst_dump_stacks.c
> new file mode 100644
> index 000000000..aa97c6820
> --- /dev/null
> +++ b/lib/tst_dump_stacks.c
> @@ -0,0 +1,108 @@
> +/*
> + * Copyright (c) 2018 Cyril Hrubis <chrubis@suse.cz>
> + *
> + * This program is free software: you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation, either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program. If not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include <ctype.h>
> +#include <stdio.h>
> +
> +#define TST_NO_DEFAULT_MAIN 1
> +#include "tst_test.h"
> +
> +static void *process_search_init(void)
> +{
> +	DIR *dir = SAFE_OPENDIR("/proc/");
> +
> +	return dir;
> +}
> +
> +static int is_number(const char *str)
> +{
> +	do {
> +		if (!isdigit(*str))
> +			return 0;
> +	} while (*(++str));
> +
> +	return 1;
> +}
> +
> +static int process_search_pgid_next(void *pid_search, pid_t pgid)
> +{
> +	struct dirent *ent;
> +	DIR *dir = pid_search;
> +	char path[1024];
> +	int ppgid, pid;
> +	FILE *f;
> +
> +	while ((ent = readdir(dir))) {
> +		if (ent->d_type != DT_DIR)
> +			continue;
> +		if (!is_number(ent->d_name))
> +			continue;
> +
> +		snprintf(path, sizeof(path), "/proc/%s/stat", ent->d_name);
> +
> +		f = fopen(path, "r");
> +		if (!f)
> +			continue;
> +
> +		if (fscanf(f, "%i %*s %*c %*i %i", &pid, &ppgid) != 2) {
> +			tst_res(TWARN, "fscanf(%s) failed!", ent->d_name);
> +			fclose(f);
> +			continue;
> +		}
> +
> +		fclose(f);
> +
> +		if (ppgid == pgid)
> +			break;
> +	}
> +
> +	if (ent)
> +		return pid;
> +
> +	closedir(dir);
> +	return -1;
> +}
> +
> +void tst_dump_stack_by_pid(pid_t pid)
> +{
> +	int fd, len;
> +	char buf[512];
> +	char path[1024];
> +
> +	tst_res(TINFO, "Pid %i stuck in kernel!", pid);
> +
> +	fprintf(stderr, "Kernel stacktrace follows:\n");
> +	fflush(stderr);
> +
> +	snprintf(path, sizeof(path), "/proc/%i/stack", pid);
> +
> +	fd = SAFE_OPEN(path, O_RDONLY);
> +
> +	while ((len = SAFE_READ(0, fd, buf, sizeof(buf))) > 0)
> +		SAFE_WRITE(1, 2, buf, len);
> +
> +	SAFE_CLOSE(fd);
> +}
> +
> +void tst_dump_stacks_by_pgid(pid_t pgid)
> +{
> +	void *ps = process_search_init();
> +	int pid;
> +
> +	while ((pid = process_search_pgid_next(ps, pgid)) != -1)
> +		tst_dump_stack_by_pid(pid);
> +}
> diff --git a/lib/tst_test.c b/lib/tst_test.c
> index 329168a24..d9476c02c 100644
> --- a/lib/tst_test.c
> +++ b/lib/tst_test.c
> @@ -1058,7 +1058,8 @@ static int fork_testrun(void)
>  		if (retries++ <= 14)
>  			continue;
>  
> -		tst_res(TFAIL, "Test process child stuck in the kernel!");
> +		tst_res(TFAIL, "Test process child(ren) stuck in the kernel!");
> +		tst_dump_stacks_by_pgid(test_pid);
>  		tst_brk(TFAIL, "Congratulation, likely test hit a kernel bug.");
>  	}

Looks good to me.

Regards,
Jan

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-06-28 13:05 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-27 15:22 [LTP] [PATCH 1/2 v2] tst_test: Fail the test subprocess cannot be killed Cyril Hrubis
2018-06-27 15:22 ` [LTP] [PATCH 2/2] [WORK-IN-PROGRESS] lib/tst_test: Dump stack for test processes stuck in kernel Cyril Hrubis
2018-06-28 13:05   ` Jan Stancek
2018-06-28  7:10 ` [LTP] [PATCH 1/2 v2] tst_test: Fail the test subprocess cannot be killed Petr Vorel
2018-06-28  9:41 ` Li Wang
2018-06-28 10:08   ` Li Wang
2018-06-28 10:23     ` Cyril Hrubis
2018-06-28 10:20   ` Cyril Hrubis

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.