All of lore.kernel.org
 help / color / mirror / Atom feed
* [LTP] mtest01 parent/child process synchronization issue
@ 2015-11-18 16:51 Jiri Vohanka
  2015-11-23 10:10 ` Jan Stancek
  0 siblings, 1 reply; 5+ messages in thread
From: Jiri Vohanka @ 2015-11-18 16:51 UTC (permalink / raw)
  To: ltp

Hello,

I discovered a problem in mtest01 which causes it to freeze on some occasions.
As I understand it, the mtest01:

Master process:
* the master process starts several child processes
   (each child process is tasked with allocating some amount of memory)
* it waits until all child processes send SIGRTMIN signal or
   until the amount of free memory is decreased by the amount
   that all child tasks together should allocate
* it kills all child processes and exits

Child process:
* allocates certain amount of memory and fills it with some data
   (the '-w' option) so the pages are actually allocated
* when the desired amount of memory is allocated then it sends
   SIGRTMIN signal to the parent process
* it waits in the infinite loop (until parent commits filicide)

The problem that I encountered is that the master process never observes
sufficient decrease in the free memory and one of the child processes is
killed by oom-killer before it sends SIGRTMIN to the master process.
Thus, the master process never terminates.

I added a debug code that:
* Prints 'signal: <child-pid>' when SIGRTMIN is sent from the child process.
* Prints 'sigchld' when the master process receives SIGCHLD.
   (This happens when oom-killer kills the child process.)

I got the following result:

# ./mtest01 -p90 -w
mtest01     0  TINFO  :  Total memory already used on system = 2234816 kbytes
mtest01     0  TINFO  :  Total memory used needed to reach maximum = 9337709 kbytes
mtest01     0  TINFO  :  Filling up 90% of ram which is 7102893 kbytes
mtest01     0  TINFO  :  ... 831520768 bytes allocated and used.
signal: 2318
mtest01     0  TINFO  :  ... 3221225472 bytes allocated and used.
signal: 2316
sigchld

at the same time
# ps -e | grep mtest01
  2315 pts/0    00:00:00 mtest01
  2316 pts/0    00:00:03 mtest01
  2317 pts/0    00:00:04 mtest01 <defunct>
  2318 pts/0    00:00:00 mtest01

and dmesg shows
[ 1482.432478] Out of memory: Kill process 2317 (mtest01) score 251 or sacrifice child

You can see that the master task 2315 receives SIGRTMIN signal from 2318 and 2316
but not from 2317 which is killed by oom-killer, from which it receives SIGCHLD signal.

I think that the oom-killer should not be invoked during mtest01 (there might be
a problem in our kernel), nevertheless the test should handle that situation more
gracefully.

I attached a patch that fixes this issue. It modifies the test such that the master
task also waits for SIGCHLD. The test fails if SIGCHLD is received (I hope that
this is a correct behavior).

The output of the patched test looks like:
<<<test_start>>>
tag=mtest01w stime=1447855660
cmdline="mtest01 -p80 -w"
contacts=""
analysis=exit
<<<test_output>>>
mtest01     0  TINFO  :  Total memory already used on system = 672000 kbytes
mtest01     0  TINFO  :  Total memory used needed to reach maximum = 8300185 kbytes
mtest01     0  TINFO  :  Filling up 80% of ram which is 7628185 kbytes
mtest01     1  TFAIL  :  mtest01.c:292: the child process was killed
<<<execution_status>>>
initiation_status="ok"
duration=7 termination_type=exited termination_id=1 corefile=no
cutime=0 cstime=0
<<<test_end>>>

Regards,
Jiri Vohanka

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Watch-for-SIGCHLD-in-mtest01-in-case-the-child-gets-.patch
Type: text/x-patch
Size: 2460 bytes
Desc: not available
URL: <http://lists.linux.it/pipermail/ltp/attachments/20151118/dfe42bf7/attachment.bin>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [LTP] mtest01 parent/child process synchronization issue
  2015-11-18 16:51 [LTP] mtest01 parent/child process synchronization issue Jiri Vohanka
@ 2015-11-23 10:10 ` Jan Stancek
  2015-11-25 10:47   ` Jiri Vohanka
  0 siblings, 1 reply; 5+ messages in thread
From: Jan Stancek @ 2015-11-23 10:10 UTC (permalink / raw)
  To: ltp

On 11/18/2015 05:51 PM, Jiri Vohanka wrote:
> Hello,
> 
> I discovered a problem in mtest01 which causes it to freeze on some occasions.

Hi,

> diff --git a/testcases/kernel/mem/mtest01/mtest01.c b/testcases/kernel/mem/mtest01/mtest01.c
> index 8c9e81c..06f41d5 100644
> --- a/testcases/kernel/mem/mtest01/mtest01.c
> +++ b/testcases/kernel/mem/mtest01/mtest01.c
> @@ -51,9 +51,12 @@
>  char *TCID = "mtest01";
>  int TST_TOTAL = 1;
>  int pid_count = 0;
> +int oom_count = 0;

both of these should be static

>  
>  void handler(int signo)
>  {
> +	if (signo == SIGCHLD)
> +		oom_count++;
>  	pid_count++;
>  }
>  
> @@ -77,6 +80,7 @@ int main(int argc, char *argv[])
>  	act.sa_flags = 0;
>  	sigemptyset(&act.sa_mask);
>  	sigaction(SIGRTMIN, &act, 0);
> +	sigaction(SIGCHLD, &act, 0);
>  
>  	while ((c = getopt(argc, argv, "c:b:p:wvh")) != -1) {
>  		switch (c) {
> @@ -268,7 +272,7 @@ int main(int argc, char *argv[])
>  
>  			while ((((unsigned long long)pre_mem - post_mem) <
>  				(unsigned long long)original_maxbytes) &&
> -			       pid_count < pid_cntr) {
> +			       pid_count < pid_cntr && !oom_count) {
>  				sleep(1);
>  				sysinfo(&sstats);
>  				post_mem =
> @@ -284,7 +288,9 @@ int main(int argc, char *argv[])
>  			kill(pid_list[i], SIGKILL);
>  			i++;
>  		}
> -		if (dowrite)
> +		if (oom_count)
> +			tst_resm(TFAIL, "the child process was killed");


I wouldn't combine two "if"s into one if else block, the information how
much was allocated could be useful even when we hit OOM.

But mainly, problem here is that ~6 lines above child processes get killed,
so you are racing with SIGCHLD here. I'd suggest to move this TFAIL before
kill() loop.

Regards,
Jan


> +		else if (dowrite)
>  			tst_resm(TPASS, "%llu kbytes allocated and used.",
>  				 original_maxbytes / 1024);
>  		else
> -- 
> 2.4.3



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [LTP] mtest01 parent/child process synchronization issue
  2015-11-23 10:10 ` Jan Stancek
@ 2015-11-25 10:47   ` Jiri Vohanka
  2015-11-26 14:38     ` Jan Stancek
  2015-11-26 14:50     ` Cyril Hrubis
  0 siblings, 2 replies; 5+ messages in thread
From: Jiri Vohanka @ 2015-11-25 10:47 UTC (permalink / raw)
  To: ltp

Hi,

>> diff --git a/testcases/kernel/mem/mtest01/mtest01.c b/testcases/kernel/mem/mtest01/mtest01.c
>> index 8c9e81c..06f41d5 100644
>> --- a/testcases/kernel/mem/mtest01/mtest01.c
>> +++ b/testcases/kernel/mem/mtest01/mtest01.c

>> @@ -51,9 +51,12 @@
>>   char *TCID = "mtest01";
>>   int TST_TOTAL = 1;
>>   int pid_count = 0;
>> +int oom_count = 0;
>
> both of these should be static

You are right, I also added volatile just to be sure.

>> @@ -284,7 +288,9 @@ int main(int argc, char *argv[])
>>   			kill(pid_list[i], SIGKILL);
>>   			i++;
>>   		}
>> -		if (dowrite)
>> +		if (oom_count)
>> +			tst_resm(TFAIL, "the child process was killed");
>
> I wouldn't combine two "if"s into one if else block, the information how
> much was allocated could be useful even when we hit OOM.

The original_maxbytes tells us how much we want to allocate, not how much was already allocated. You have to watch TINFO messages emitted by 
child processes to see what was actually allocated. I think that printing ""%llu kbytes allocated" would be misleading since that is true only 
if the test passes. I kept the if statement at its current place, but I made the fail message more verbose.

> But mainly, problem here is that ~6 lines above child processes get killed,
> so you are racing with SIGCHLD here. I'd suggest to move this TFAIL before
> kill() loop.

I moved the part that kills the child processes after PASS/FAIL evaluation to avoid possible race condition.
(Another option would be to uninstall the SIGCHLD signal handler before killing the processes.)

>> +		else if (dowrite)
>>   			tst_resm(TPASS, "%llu kbytes allocated and used.",
>>   				 original_maxbytes / 1024);
>>   		else

I am new to the list and I am not sure what the rules for posting patches are. Should I use 'git send-email', sign the patch or something?

Regards,
Jiri
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Watch-for-SIGCHLD-in-mtest01-in-case-the-child-gets-.patch
Type: text/x-patch
Size: 2887 bytes
Desc: not available
URL: <http://lists.linux.it/pipermail/ltp/attachments/20151125/77c5390d/attachment.bin>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [LTP] mtest01 parent/child process synchronization issue
  2015-11-25 10:47   ` Jiri Vohanka
@ 2015-11-26 14:38     ` Jan Stancek
  2015-11-26 14:50     ` Cyril Hrubis
  1 sibling, 0 replies; 5+ messages in thread
From: Jan Stancek @ 2015-11-26 14:38 UTC (permalink / raw)
  To: ltp





----- Original Message -----
> From: "Jiri Vohanka" <jvohanka@redhat.com>
> To: "Jan Stancek" <jstancek@redhat.com>, ltp@lists.linux.it
> Sent: Wednesday, 25 November, 2015 11:47:36 AM
> Subject: Re: [LTP] mtest01 parent/child process synchronization issue
> 
> Hi,

Hi,

pushed with some changes, see comments below.

> 
> >> diff --git a/testcases/kernel/mem/mtest01/mtest01.c
> >> b/testcases/kernel/mem/mtest01/mtest01.c
> >> index 8c9e81c..06f41d5 100644
> >> --- a/testcases/kernel/mem/mtest01/mtest01.c
> >> +++ b/testcases/kernel/mem/mtest01/mtest01.c
> 
> >> @@ -51,9 +51,12 @@
> >>   char *TCID = "mtest01";
> >>   int TST_TOTAL = 1;
> >>   int pid_count = 0;
> >> +int oom_count = 0;
> >
> > both of these should be static
> 
> You are right, I also added volatile just to be sure.

I changed it to:
  static sig_atomic_t sigchld_count;

> 
> >> @@ -284,7 +288,9 @@ int main(int argc, char *argv[])
> >>   			kill(pid_list[i], SIGKILL);
> >>   			i++;
> >>   		}
> >> -		if (dowrite)
> >> +		if (oom_count)
> >> +			tst_resm(TFAIL, "the child process was killed");
> >
> > I wouldn't combine two "if"s into one if else block, the information how
> > much was allocated could be useful even when we hit OOM.
> 
> The original_maxbytes tells us how much we want to allocate, not how much was
> already allocated. You have to watch TINFO messages emitted by
> child processes to see what was actually allocated. I think that printing
> ""%llu kbytes allocated" would be misleading since that is true only
> if the test passes. I kept the if statement at its current place, but I made
> the fail message more verbose.

You're right, printing orig_maxbytes is useless.

> 
> > But mainly, problem here is that ~6 lines above child processes get killed,
> > so you are racing with SIGCHLD here. I'd suggest to move this TFAIL before
> > kill() loop.
> 
> I moved the part that kills the child processes after PASS/FAIL evaluation to
> avoid possible race condition.
> (Another option would be to uninstall the SIGCHLD signal handler before
> killing the processes.)
> 
> >> +		else if (dowrite)
> >>   			tst_resm(TPASS, "%llu kbytes allocated and used.",
> >>   				 original_maxbytes / 1024);
> >>   		else
> 
> I am new to the list and I am not sure what the rules for posting patches
> are. Should I use 'git send-email', sign the patch or something?

Yes, also good start is to check out "style-guide.txt" and
"test-writing-guidelines.txt" in doc directory.

Regards,
Jan

> 
> Regards,
> Jiri
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [LTP] mtest01 parent/child process synchronization issue
  2015-11-25 10:47   ` Jiri Vohanka
  2015-11-26 14:38     ` Jan Stancek
@ 2015-11-26 14:50     ` Cyril Hrubis
  1 sibling, 0 replies; 5+ messages in thread
From: Cyril Hrubis @ 2015-11-26 14:50 UTC (permalink / raw)
  To: ltp

Hi!
> I am new to the list and I am not sure what the rules for posting
> patches are. Should I use 'git send-email', sign the patch or
> something?

The usual workflow is to tell git your name and email (git config
user.email foo_bar@email.com and git config user.name Foo Bar) and use
git commit -s. Then both git format-patch and git send-email would
produce signed patches.

> From 05227cd214ed7fcb5ba9e5234d06c3866bfe98dc Mon Sep 17 00:00:00 2001
> From: jvohanka <jvohanka@redhat.com>
> Date: Wed, 18 Nov 2015 14:39:21 +0100
> Subject: [PATCH] Watch for SIGCHLD in mtest01 in case the child gets killed by
>  oom-killer
> 
> The mtest01 checks that a specified amount of memory could be allocated.
> The memory allocation is performed by several child processes. Each
> child allocates certain amount of memory and then sends SIGRTMIN to
> the parent. The parent waits until it receives SIGRTMIN from all child
> processes or until the amount of free memory is decreased by at least
> the amount it tasked the children to allocate.
> The problem is that in certain situations the child process gets killed
> by oom-killer. The parent never receives SIGRTMIN from that process and
> there is not sufficient decrease in the amount of free memory, so
> the parent keeps waiting indefinitely.
> This patch fixes the above issue. It modifies the test in such a way that
> it also waits for SIGCHLD signal (which is sent when the oom-killer
> terminates the child process). If SIGCHLD is received the test FAILS.

Ah, and new patch is hidden as an atachement in the reply to the email.
I nearly missed it. Ideally new version of patch should be send (inline)
as a separate email with [PATCH vN] (where N increments with each
version) in subject.

-- 
Cyril Hrubis
chrubis@suse.cz

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-11-26 14:50 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-18 16:51 [LTP] mtest01 parent/child process synchronization issue Jiri Vohanka
2015-11-23 10:10 ` Jan Stancek
2015-11-25 10:47   ` Jiri Vohanka
2015-11-26 14:38     ` Jan Stancek
2015-11-26 14:50     ` Cyril Hrubis

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.