linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] oom && coredump
@ 2014-11-27 23:03 Oleg Nesterov
  2014-11-27 23:04 ` [PATCH 1/2] oom: don't assume that a coredumping thread will exit soon Oleg Nesterov
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Oleg Nesterov @ 2014-11-27 23:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Cong Wang, David Rientjes, Michal Hocko, Rafael J. Wysocki,
	Tejun Heo, linux-kernel

On 11/27, Oleg Nesterov wrote:
>
> So I think the patch below makes sense anyway. Although I should probably
> split it and remove PT_TRACE_EXIT in 2/2.

So let me send the patches.

David, Michal, could you review and ack/nack these changes explicitly?

Let me repeat once again that this patch doesn't pretend to solve
all problems, even with the coredumping. And I have to admit that
my main motivation is 2/2, this PT_TRACE_EXIT check annoys me ;)

Oleg.

 mm/oom_kill.c |   21 +++++++++++----------
 1 files changed, 11 insertions(+), 10 deletions(-)


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/2] oom: don't assume that a coredumping thread will exit soon
  2014-11-27 23:03 [PATCH 0/2] oom && coredump Oleg Nesterov
@ 2014-11-27 23:04 ` Oleg Nesterov
  2014-12-02  9:19   ` Michal Hocko
  2014-11-27 23:04 ` [PATCH 2/2] oom: kill the insufficient and no longer needed PT_TRACE_EXIT check Oleg Nesterov
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 17+ messages in thread
From: Oleg Nesterov @ 2014-11-27 23:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Cong Wang, David Rientjes, Michal Hocko, Rafael J. Wysocki,
	Tejun Heo, linux-kernel

oom_kill.c assumes that PF_EXITING task should exit and free the memory
soon. This is wrong in many ways and one important case is the coredump.
A task can sleep in exit_mm() "forever" while the coredumping sub-thread
can need more memory.

Change the PF_EXITING checks to take SIGNAL_GROUP_COREDUMP into account,
we add the new trivial helper for that.

Note: this is only the first step, this patch doesn't try to solve other
problems. For example it doesn't try to clear the wrongly set TIF_MEMDIE
(SIGNAL_GROUP_COREDUMP check is obviously racy), fatal_signal_pending()
can be false positive, etc.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 mm/oom_kill.c |   12 +++++++++---
 1 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5340f6b..7af33b5 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -254,6 +254,12 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
 }
 #endif
 
+static inline bool task_will_free_mem(struct task_struct *task)
+{
+	return (task->flags & PF_EXITING) &&
+		!(task->signal->flags & SIGNAL_GROUP_COREDUMP);
+}
+
 enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
 		unsigned long totalpages, const nodemask_t *nodemask,
 		bool force_kill)
@@ -281,7 +287,7 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
 	if (oom_task_origin(task))
 		return OOM_SCAN_SELECT;
 
-	if (task->flags & PF_EXITING && !force_kill) {
+	if (task_will_free_mem(task) && !force_kill) {
 		/*
 		 * If this task is not being ptraced on exit, then wait for it
 		 * to finish before killing some other task unnecessarily.
@@ -443,7 +449,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 * If the task is already exiting, don't alarm the sysadmin or kill
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
-	if (p->flags & PF_EXITING) {
+	if (task_will_free_mem(p)) {
 		set_tsk_thread_flag(p, TIF_MEMDIE);
 		put_task_struct(p);
 		return;
@@ -649,7 +655,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.
 	 */
-	if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
+	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
 		set_thread_flag(TIF_MEMDIE);
 		return;
 	}
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/2] oom: kill the insufficient and no longer needed PT_TRACE_EXIT check
  2014-11-27 23:03 [PATCH 0/2] oom && coredump Oleg Nesterov
  2014-11-27 23:04 ` [PATCH 1/2] oom: don't assume that a coredumping thread will exit soon Oleg Nesterov
@ 2014-11-27 23:04 ` Oleg Nesterov
  2014-12-02  9:35   ` Michal Hocko
  2014-11-28 19:28 ` [PATCH 0/2] oom && coredump Oleg Nesterov
  2014-12-03 19:50 ` [PATCH v2 0/1] oom: don't assume that a coredumping thread will exit soon Oleg Nesterov
  3 siblings, 1 reply; 17+ messages in thread
From: Oleg Nesterov @ 2014-11-27 23:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Cong Wang, David Rientjes, Michal Hocko, Rafael J. Wysocki,
	Tejun Heo, linux-kernel

After the previous patch we can remove the PT_TRACE_EXIT check in
oom_scan_process_thread(), it was added to handle the case when the
coredumping was "frozen" by ptrace, but it doesn't really work. If
nothing else, we would need to check all threads which could share
the same ->mm to make it more or less correct.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 mm/oom_kill.c |   11 +++--------
 1 files changed, 3 insertions(+), 8 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7af33b5..a2a4036 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -287,14 +287,9 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
 	if (oom_task_origin(task))
 		return OOM_SCAN_SELECT;
 
-	if (task_will_free_mem(task) && !force_kill) {
-		/*
-		 * If this task is not being ptraced on exit, then wait for it
-		 * to finish before killing some other task unnecessarily.
-		 */
-		if (!(task->group_leader->ptrace & PT_TRACE_EXIT))
-			return OOM_SCAN_ABORT;
-	}
+	if (task_will_free_mem(task) && !force_kill)
+		return OOM_SCAN_ABORT;
+
 	return OOM_SCAN_OK;
 }
 
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/2] oom && coredump
  2014-11-27 23:03 [PATCH 0/2] oom && coredump Oleg Nesterov
  2014-11-27 23:04 ` [PATCH 1/2] oom: don't assume that a coredumping thread will exit soon Oleg Nesterov
  2014-11-27 23:04 ` [PATCH 2/2] oom: kill the insufficient and no longer needed PT_TRACE_EXIT check Oleg Nesterov
@ 2014-11-28 19:28 ` Oleg Nesterov
  2014-12-03 19:50 ` [PATCH v2 0/1] oom: don't assume that a coredumping thread will exit soon Oleg Nesterov
  3 siblings, 0 replies; 17+ messages in thread
From: Oleg Nesterov @ 2014-11-28 19:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Cong Wang, David Rientjes, Michal Hocko, Rafael J. Wysocki,
	Tejun Heo, linux-kernel

On 11/28, Oleg Nesterov wrote:
>
> On 11/27, Oleg Nesterov wrote:
> >
> > So I think the patch below makes sense anyway. Although I should probably
> > split it and remove PT_TRACE_EXIT in 2/2.
>
> So let me send the patches.
>
> David, Michal, could you review and ack/nack these changes explicitly?
>
> Let me repeat once again that this patch doesn't pretend to solve
> all problems, even with the coredumping. And I have to admit that
> my main motivation is 2/2, this PT_TRACE_EXIT check annoys me ;)

Another simple test. cat mtsleep.c:

	#include <unistd.h>
	#include <pthread.h>

	void *tfunc(void *arg)
	{
		pause();
		return NULL;
	}

	int main(void)
	{
		pthread_t th;

		pthread_create(&th, NULL, tfunc, NULL);
		pause();

		return 0;
	}

Now,

	# echo '|/bin/sleep 1000' >> /proc/sys/kernel/core_pattern
	# echo 10 >> /proc/sys/kernel/core_pipe_limit

	# ./mtsleep &
	# kill -QUIT %1
	# perl -e 'push @_,"x" x 1000_1000 while 1'

Before this series the system hangs and doesn't respond. With these
patches it correctly kills perl.

Oleg.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] oom: don't assume that a coredumping thread will exit soon
  2014-11-27 23:04 ` [PATCH 1/2] oom: don't assume that a coredumping thread will exit soon Oleg Nesterov
@ 2014-12-02  9:19   ` Michal Hocko
  2014-12-02 17:50     ` Oleg Nesterov
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2014-12-02  9:19 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Cong Wang, David Rientjes, Rafael J. Wysocki,
	Tejun Heo, linux-kernel

On Fri 28-11-14 00:04:05, Oleg Nesterov wrote:
> oom_kill.c assumes that PF_EXITING task should exit and free the memory
> soon. This is wrong in many ways and one important case is the coredump.
> A task can sleep in exit_mm() "forever" while the coredumping sub-thread
> can need more memory.
> 
> Change the PF_EXITING checks to take SIGNAL_GROUP_COREDUMP into account,
> we add the new trivial helper for that.
> 
> Note: this is only the first step, this patch doesn't try to solve other
> problems. For example it doesn't try to clear the wrongly set TIF_MEMDIE
> (SIGNAL_GROUP_COREDUMP check is obviously racy),

I am not sure I understand this. What do you mean by wrongly set
TIF_MEMDIE? That we give a process access to reserves even though it is
already done with the coredumping?

> fatal_signal_pending() can be false positive, etc.

When can this happen?

> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

I guess the patch as is makes sense and it is an improvement. We need
to call the helper in mem_cgroup_out_of_memory as well, though.

With that feel free to add
Acked-by: Michal Hocko <mhocko@suse.cz>

Also the original fix for the coredumping (edd45544c6f0 "oom: avoid
deferring oom killer if exiting task is being traced") doesn't work
really as per http://marc.info/?l=linux-kernel&m=141711049013620 then
this and the follow up patch should be marked for stable I guess.

> ---
>  mm/oom_kill.c |   12 +++++++++---
>  1 files changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 5340f6b..7af33b5 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -254,6 +254,12 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
>  }
>  #endif
>  
> +static inline bool task_will_free_mem(struct task_struct *task)
> +{
> +	return (task->flags & PF_EXITING) &&
> +		!(task->signal->flags & SIGNAL_GROUP_COREDUMP);
> +}
> +
>  enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
>  		unsigned long totalpages, const nodemask_t *nodemask,
>  		bool force_kill)
> @@ -281,7 +287,7 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
>  	if (oom_task_origin(task))
>  		return OOM_SCAN_SELECT;
>  
> -	if (task->flags & PF_EXITING && !force_kill) {
> +	if (task_will_free_mem(task) && !force_kill) {
>  		/*
>  		 * If this task is not being ptraced on exit, then wait for it
>  		 * to finish before killing some other task unnecessarily.
> @@ -443,7 +449,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  	 * If the task is already exiting, don't alarm the sysadmin or kill
>  	 * its children or threads, just set TIF_MEMDIE so it can die quickly
>  	 */
> -	if (p->flags & PF_EXITING) {
> +	if (task_will_free_mem(p)) {
>  		set_tsk_thread_flag(p, TIF_MEMDIE);
>  		put_task_struct(p);
>  		return;
> @@ -649,7 +655,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  	 * select it.  The goal is to allow it to allocate so that it may
>  	 * quickly exit and free its memory.
>  	 */
> -	if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
> +	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
>  		set_thread_flag(TIF_MEMDIE);
>  		return;
>  	}
> -- 
> 1.5.5.1
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] oom: kill the insufficient and no longer needed PT_TRACE_EXIT check
  2014-11-27 23:04 ` [PATCH 2/2] oom: kill the insufficient and no longer needed PT_TRACE_EXIT check Oleg Nesterov
@ 2014-12-02  9:35   ` Michal Hocko
  2014-12-02 18:05     ` Oleg Nesterov
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2014-12-02  9:35 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Cong Wang, David Rientjes, Rafael J. Wysocki,
	Tejun Heo, linux-kernel

On Fri 28-11-14 00:04:08, Oleg Nesterov wrote:
> After the previous patch we can remove the PT_TRACE_EXIT check in
> oom_scan_process_thread(), it was added to handle the case when the
> coredumping was "frozen" by ptrace, but it doesn't really work. If
> nothing else, we would need to check all threads which could share
> the same ->mm to make it more or less correct.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

I still do not see why we do not need task->ptrace & PT_TRACE_EXIT check
here. I do understand that the check on group_leader doesn't make much
sense. ptrace_event would block until the tracer let the task run again
which may be never AFAICS.

It is really sad how subtle and racy are all these checks :/

> ---
>  mm/oom_kill.c |   11 +++--------
>  1 files changed, 3 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 7af33b5..a2a4036 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -287,14 +287,9 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
>  	if (oom_task_origin(task))
>  		return OOM_SCAN_SELECT;
>  
> -	if (task_will_free_mem(task) && !force_kill) {
> -		/*
> -		 * If this task is not being ptraced on exit, then wait for it
> -		 * to finish before killing some other task unnecessarily.
> -		 */
> -		if (!(task->group_leader->ptrace & PT_TRACE_EXIT))
> -			return OOM_SCAN_ABORT;
> -	}
> +	if (task_will_free_mem(task) && !force_kill)
> +		return OOM_SCAN_ABORT;
> +
>  	return OOM_SCAN_OK;
>  }
>  
> -- 
> 1.5.5.1
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] oom: don't assume that a coredumping thread will exit soon
  2014-12-02  9:19   ` Michal Hocko
@ 2014-12-02 17:50     ` Oleg Nesterov
  2014-12-02 18:31       ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Oleg Nesterov @ 2014-12-02 17:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Cong Wang, David Rientjes, Rafael J. Wysocki,
	Tejun Heo, linux-kernel

On 12/02, Michal Hocko wrote:
>
> On Fri 28-11-14 00:04:05, Oleg Nesterov wrote:
>
> > Note: this is only the first step, this patch doesn't try to solve other
> > problems. For example it doesn't try to clear the wrongly set TIF_MEMDIE
> > (SIGNAL_GROUP_COREDUMP check is obviously racy),
>
> I am not sure I understand this. What do you mean by wrongly set
> TIF_MEMDIE? That we give a process access to reserves even though it is
> already done with the coredumping?

I meant that (say) oom_kill_process() can set TIF_MEMDIE because
PF_EXITING && !SIGNAL_GROUP_COREDUMP, and after that this task can
participate the coredumping. For example, this thread can exit on its
own, but before it calls exit_mm() another thread can start the coredump.

In this case TIF_MEMDIE can fool oom-killer the same way,
oom_scan_process_thread() returns OOM_SCAN_ABORT if TIF_MEMDIE is set.

> > fatal_signal_pending() can be false positive, etc.
>
> When can this happen?

I meant "if (fatal_signal_pending(current) || task_will_free_mem(current))"
in out_of_memory(). Yes, sorry, "false positive" looks confusing. I meant
that fatal_signal_pending() can be true because of SIGNAL_GROUP_COREDUMP.


> > Signed-off-by: Oleg Nesterov <oleg@redhat.com>
>
> I guess the patch as is makes sense and it is an improvement. We need
> to call the helper in mem_cgroup_out_of_memory as well, though.

Yes, but can't we do this in a separate patch? try_charge() plays with
TIF_MEMDIE/PF_EXITING too, but probably this is fine.

> With that feel free to add
> Acked-by: Michal Hocko <mhocko@suse.cz>

Thanks.

> Also the original fix for the coredumping (edd45544c6f0 "oom: avoid
> deferring oom killer if exiting task is being traced") doesn't work
> really as per http://marc.info/?l=linux-kernel&m=141711049013620 then
> this and the follow up patch should be marked for stable I guess.

Perhaps this makes sense. It looks simple enough.

Oleg.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] oom: kill the insufficient and no longer needed PT_TRACE_EXIT check
  2014-12-02  9:35   ` Michal Hocko
@ 2014-12-02 18:05     ` Oleg Nesterov
  2014-12-02 18:23       ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Oleg Nesterov @ 2014-12-02 18:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Cong Wang, David Rientjes, Rafael J. Wysocki,
	Tejun Heo, linux-kernel

On 12/02, Michal Hocko wrote:
>
> On Fri 28-11-14 00:04:08, Oleg Nesterov wrote:
> > After the previous patch we can remove the PT_TRACE_EXIT check in
> > oom_scan_process_thread(), it was added to handle the case when the
> > coredumping was "frozen" by ptrace, but it doesn't really work. If
> > nothing else, we would need to check all threads which could share
> > the same ->mm to make it more or less correct.
> >
> > Signed-off-by: Oleg Nesterov <oleg@redhat.com>
>
> I still do not see why we do not need task->ptrace & PT_TRACE_EXIT check
> here. I do understand that the check on group_leader doesn't make much
> sense. ptrace_event would block until the tracer let the task run again
> which may be never AFAICS.

No, note that PT_TRACE_EXIT (the last ptrace event) is reported before
PF_EXITING is set.

(just in case... we do have some problems with SIGKILL && ptrace, but
 this is completely off-topic and has nothing to do with oom-kill.c)

Oleg.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] oom: kill the insufficient and no longer needed PT_TRACE_EXIT check
  2014-12-02 18:05     ` Oleg Nesterov
@ 2014-12-02 18:23       ` Michal Hocko
  0 siblings, 0 replies; 17+ messages in thread
From: Michal Hocko @ 2014-12-02 18:23 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Cong Wang, David Rientjes, Rafael J. Wysocki,
	Tejun Heo, linux-kernel

On Tue 02-12-14 19:05:29, Oleg Nesterov wrote:
> On 12/02, Michal Hocko wrote:
> >
> > On Fri 28-11-14 00:04:08, Oleg Nesterov wrote:
> > > After the previous patch we can remove the PT_TRACE_EXIT check in
> > > oom_scan_process_thread(), it was added to handle the case when the
> > > coredumping was "frozen" by ptrace, but it doesn't really work. If
> > > nothing else, we would need to check all threads which could share
> > > the same ->mm to make it more or less correct.
> > >
> > > Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> >
> > I still do not see why we do not need task->ptrace & PT_TRACE_EXIT check
> > here. I do understand that the check on group_leader doesn't make much
> > sense. ptrace_event would block until the tracer let the task run again
> > which may be never AFAICS.
> 
> No, note that PT_TRACE_EXIT (the last ptrace event) is reported before
> PF_EXITING is set.

OK, I managed to confuse myself by the PF_EXITING test before
exit_signals in do_exit but now that I am looking closer it seems that
this is not interesting from the OOM POV.

Thanks for the clarification. This code is so subtle that I rather ask
than miss something.

Acked-by: Michal Hocko <mhocko@suse.cz>

> (just in case... we do have some problems with SIGKILL && ptrace, but
>  this is completely off-topic and has nothing to do with oom-kill.c)
> 
> Oleg.
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] oom: don't assume that a coredumping thread will exit soon
  2014-12-02 17:50     ` Oleg Nesterov
@ 2014-12-02 18:31       ` Michal Hocko
  2014-12-02 19:16         ` Oleg Nesterov
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2014-12-02 18:31 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Cong Wang, David Rientjes, Rafael J. Wysocki,
	Tejun Heo, linux-kernel

On Tue 02-12-14 18:50:41, Oleg Nesterov wrote:
> On 12/02, Michal Hocko wrote:
> >
> > On Fri 28-11-14 00:04:05, Oleg Nesterov wrote:
> >
> > > Note: this is only the first step, this patch doesn't try to solve other
> > > problems. For example it doesn't try to clear the wrongly set TIF_MEMDIE
> > > (SIGNAL_GROUP_COREDUMP check is obviously racy),
> >
> > I am not sure I understand this. What do you mean by wrongly set
> > TIF_MEMDIE? That we give a process access to reserves even though it is
> > already done with the coredumping?
> 
> I meant that (say) oom_kill_process() can set TIF_MEMDIE because
> PF_EXITING && !SIGNAL_GROUP_COREDUMP, and after that this task can
> participate the coredumping. For example, this thread can exit on its
> own, but before it calls exit_mm() another thread can start the coredump.
> 
> In this case TIF_MEMDIE can fool oom-killer the same way,
> oom_scan_process_thread() returns OOM_SCAN_ABORT if TIF_MEMDIE is set.
> 
> > > fatal_signal_pending() can be false positive, etc.
> >
> > When can this happen?
> 
> I meant "if (fatal_signal_pending(current) || task_will_free_mem(current))"
> in out_of_memory(). Yes, sorry, "false positive" looks confusing. I meant
> that fatal_signal_pending() can be true because of SIGNAL_GROUP_COREDUMP.

Ahh, OK I guess I see what you meant.
 
> > > Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> >
> > I guess the patch as is makes sense and it is an improvement. We need
> > to call the helper in mem_cgroup_out_of_memory as well, though.
> 
> Yes, but can't we do this in a separate patch?

I would prefer if it was in the same patch because we might be facing
the same problem in memcg as with the global case. And worse, smaller
limit tend to trigger corner cases more often than the global case.

> try_charge() plays with TIF_MEMDIE/PF_EXITING too, but probably this
> is fine.

try_charge is OK because this is from the time when the allocation has
been already done and we just decide to bypass the charge.

> > With that feel free to add
> > Acked-by: Michal Hocko <mhocko@suse.cz>
> 
> Thanks.
> 
> > Also the original fix for the coredumping (edd45544c6f0 "oom: avoid
> > deferring oom killer if exiting task is being traced") doesn't work
> > really as per http://marc.info/?l=linux-kernel&m=141711049013620 then
> > this and the follow up patch should be marked for stable I guess.
> 
> Perhaps this makes sense. It looks simple enough.
> 
> Oleg.
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] oom: don't assume that a coredumping thread will exit soon
  2014-12-02 18:31       ` Michal Hocko
@ 2014-12-02 19:16         ` Oleg Nesterov
  2014-12-03 13:07           ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Oleg Nesterov @ 2014-12-02 19:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Cong Wang, David Rientjes, Rafael J. Wysocki,
	Tejun Heo, linux-kernel

On 12/02, Michal Hocko wrote:
>
> On Tue 02-12-14 18:50:41, Oleg Nesterov wrote:
> > On 12/02, Michal Hocko wrote:
> > >
> > > I guess the patch as is makes sense and it is an improvement. We need
> > > to call the helper in mem_cgroup_out_of_memory as well, though.
> >
> > Yes, but can't we do this in a separate patch?
>
> I would prefer if it was in the same patch because we might be facing
> the same problem in memcg as with the global case. And worse, smaller
> limit tend to trigger corner cases more often than the global case.

OK, I'll do V2...

But let me explain why I thought about another patch. I do not want
to export task_will_free_mem(). If nothing else, its name matches the
current "quickly exit and free its memory" comments but not the reality.
An exiting thread won't free the memory (ignoring task_struct/etc) if
the process is multithreaded.

I'd rather add another helper for oom_kill.c and memcontrol.c which does

	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
		set_thread_flag(TIF_MEMDIE);
		return true;
	}

	return false;

This way the patch could document that fatal_signal_pending() is not
exactly right as we discussed, and then we can improve this helper.

But OK, probably this helper doesn't really make sense, and I can not
invent the good name for it ;)

> > try_charge() plays with TIF_MEMDIE/PF_EXITING too, but probably this
> > is fine.
>
> try_charge is OK because this is from the time when the allocation has
> been already done and we just decide to bypass the charge.

Yes, thanks, this was my vague understanding but I wasn't sure. However,
I am not sure that PF_EXITING check is 100% right (again, this can only
mean that a single thread from a thread group exits), but I do not
understand this code and I agree this is another story in any case.

Oleg.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] oom: don't assume that a coredumping thread will exit soon
  2014-12-02 19:16         ` Oleg Nesterov
@ 2014-12-03 13:07           ` Michal Hocko
  0 siblings, 0 replies; 17+ messages in thread
From: Michal Hocko @ 2014-12-03 13:07 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Cong Wang, David Rientjes, Rafael J. Wysocki,
	Tejun Heo, linux-kernel

On Tue 02-12-14 20:16:22, Oleg Nesterov wrote:
> On 12/02, Michal Hocko wrote:
> >
> > On Tue 02-12-14 18:50:41, Oleg Nesterov wrote:
> > > On 12/02, Michal Hocko wrote:
> > > >
> > > > I guess the patch as is makes sense and it is an improvement. We need
> > > > to call the helper in mem_cgroup_out_of_memory as well, though.
> > >
> > > Yes, but can't we do this in a separate patch?
> >
> > I would prefer if it was in the same patch because we might be facing
> > the same problem in memcg as with the global case. And worse, smaller
> > limit tend to trigger corner cases more often than the global case.
> 
> OK, I'll do V2...
> 
> But let me explain why I thought about another patch. I do not want
> to export task_will_free_mem(). If nothing else, its name matches the
> current "quickly exit and free its memory" comments but not the reality.

Yes, the name suggests much more than what it does.

> An exiting thread won't free the memory (ignoring task_struct/etc) if
> the process is multithreaded.
> 
> I'd rather add another helper for oom_kill.c and memcontrol.c which does
> 
> 	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
> 		set_thread_flag(TIF_MEMDIE);
> 		return true;
> 	}
> 
> 	return false;
> 
> This way the patch could document that fatal_signal_pending() is not
> exactly right as we discussed, and then we can improve this helper.
> 
> But OK, probably this helper doesn't really make sense, and I can not
> invent the good name for it ;)

I've failed to come up with a better name as well. The code duplication
is not nice either so I guess it would be better to keep the helper
localt to mm/oom_kill.c and have it open coded in mm/memcontro.c. Git
blame will still tell us all the motivation if they are in the single
patch.

> > > try_charge() plays with TIF_MEMDIE/PF_EXITING too, but probably this
> > > is fine.
> >
> > try_charge is OK because this is from the time when the allocation has
> > been already done and we just decide to bypass the charge.
> 
> Yes, thanks, this was my vague understanding but I wasn't sure. However,
> I am not sure that PF_EXITING check is 100% right (again, this can only
> mean that a single thread from a thread group exits), but I do not
> understand this code and I agree this is another story in any case.

See d8dc595ce390 "memcg: do not hang on OOM when killed by userspace OOM
access to memory reserves" for more details.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 0/1] oom: don't assume that a coredumping thread will exit soon
  2014-11-27 23:03 [PATCH 0/2] oom && coredump Oleg Nesterov
                   ` (2 preceding siblings ...)
  2014-11-28 19:28 ` [PATCH 0/2] oom && coredump Oleg Nesterov
@ 2014-12-03 19:50 ` Oleg Nesterov
  2014-12-03 19:50   ` [PATCH v2 1/1] " Oleg Nesterov
  3 siblings, 1 reply; 17+ messages in thread
From: Oleg Nesterov @ 2014-12-03 19:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Cong Wang, David Rientjes, Michal Hocko, Rafael J. Wysocki,
	Tejun Heo, linux-kernel

Andrew,

This replaces

	oom-dont-assume-that-a-coredumping-thread-will-exit-soon.patch

in -mm tree.

The patch is basically the same, but as Michal pointed out
mem_cgroup_out_of_memory() can use the new helper too, it can face
the same problems.

Also, I tried to update the changelog.

Michal, can you ack this version?

Oleg.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 1/1] oom: don't assume that a coredumping thread will exit soon
  2014-12-03 19:50 ` [PATCH v2 0/1] oom: don't assume that a coredumping thread will exit soon Oleg Nesterov
@ 2014-12-03 19:50   ` Oleg Nesterov
  2014-12-03 21:58     ` Andrew Morton
                       ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Oleg Nesterov @ 2014-12-03 19:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Cong Wang, David Rientjes, Michal Hocko, Rafael J. Wysocki,
	Tejun Heo, linux-kernel

oom_kill.c assumes that PF_EXITING task should exit and free the memory
soon. This is wrong in many ways and one important case is the coredump.
A task can sleep in exit_mm() "forever" while the coredumping sub-thread
can need more memory.

Change the PF_EXITING checks to take SIGNAL_GROUP_COREDUMP into account,
we add the new trivial helper for that.

Note: this is only the first step, this patch doesn't try to solve other
problems. The SIGNAL_GROUP_COREDUMP check is obviously racy, a task can
participate in coredump after it was already observed in PF_EXITING state,
so TIF_MEMDIE (which also blocks oom-killer) still can be wrongly set.
fatal_signal_pending() can be true because of SIGNAL_GROUP_COREDUMP so
out_of_memory() and mem_cgroup_out_of_memory() shouldn't blindly trust it.
And even the name/usage of the new helper is confusing, an exiting thread
can only free its ->mm if it is the only/last task in thread group.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/oom.h |    6 ++++++
 mm/memcontrol.c     |    2 +-
 mm/oom_kill.c       |    6 +++---
 3 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index e8d6e10..e16b945 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -92,6 +92,12 @@ static inline bool oom_gfp_allowed(gfp_t gfp_mask)
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
+static inline bool task_will_free_mem(struct task_struct *task)
+{
+	return (task->flags & PF_EXITING) &&
+		!(task->signal->flags & SIGNAL_GROUP_COREDUMP);
+}
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d6ac0e3..3d1d49f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1734,7 +1734,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.
 	 */
-	if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
+	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
 		set_thread_flag(TIF_MEMDIE);
 		return;
 	}
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5340f6b..837ff31 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -281,7 +281,7 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
 	if (oom_task_origin(task))
 		return OOM_SCAN_SELECT;
 
-	if (task->flags & PF_EXITING && !force_kill) {
+	if (task_will_free_mem(task) && !force_kill) {
 		/*
 		 * If this task is not being ptraced on exit, then wait for it
 		 * to finish before killing some other task unnecessarily.
@@ -443,7 +443,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 * If the task is already exiting, don't alarm the sysadmin or kill
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
-	if (p->flags & PF_EXITING) {
+	if (task_will_free_mem(p)) {
 		set_tsk_thread_flag(p, TIF_MEMDIE);
 		put_task_struct(p);
 		return;
@@ -649,7 +649,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.
 	 */
-	if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
+	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
 		set_thread_flag(TIF_MEMDIE);
 		return;
 	}
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/1] oom: don't assume that a coredumping thread will exit soon
  2014-12-03 19:50   ` [PATCH v2 1/1] " Oleg Nesterov
@ 2014-12-03 21:58     ` Andrew Morton
  2014-12-04 14:34     ` Michal Hocko
  2014-12-05 23:46     ` David Rientjes
  2 siblings, 0 replies; 17+ messages in thread
From: Andrew Morton @ 2014-12-03 21:58 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Cong Wang, David Rientjes, Michal Hocko, Rafael J. Wysocki,
	Tejun Heo, linux-kernel

On Wed, 3 Dec 2014 20:50:20 +0100 Oleg Nesterov <oleg@redhat.com> wrote:

> oom_kill.c assumes that PF_EXITING task should exit and free the memory
> soon. This is wrong in many ways and one important case is the coredump.
> A task can sleep in exit_mm() "forever" while the coredumping sub-thread
> can need more memory.
> 
> Change the PF_EXITING checks to take SIGNAL_GROUP_COREDUMP into account,
> we add the new trivial helper for that.
> 
> Note: this is only the first step, this patch doesn't try to solve other
> problems. The SIGNAL_GROUP_COREDUMP check is obviously racy, a task can
> participate in coredump after it was already observed in PF_EXITING state,
> so TIF_MEMDIE (which also blocks oom-killer) still can be wrongly set.
> fatal_signal_pending() can be true because of SIGNAL_GROUP_COREDUMP so
> out_of_memory() and mem_cgroup_out_of_memory() shouldn't blindly trust it.
> And even the name/usage of the new helper is confusing, an exiting thread
> can only free its ->mm if it is the only/last task in thread group.
> 
> ...
>
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -92,6 +92,12 @@ static inline bool oom_gfp_allowed(gfp_t gfp_mask)
>  
>  extern struct task_struct *find_lock_task_mm(struct task_struct *p);
>  
> +static inline bool task_will_free_mem(struct task_struct *task)
> +{
> +	return (task->flags & PF_EXITING) &&
> +		!(task->signal->flags & SIGNAL_GROUP_COREDUMP);
> +}

The SIGNAL_GROUP_COREDUMP test is utterly obscure.  Let's help our poor
readers?

--- a/include/linux/oom.h~oom-dont-assume-that-a-coredumping-thread-will-exit-soon-v2-fix
+++ a/include/linux/oom.h
@@ -89,6 +89,11 @@ extern struct task_struct *find_lock_tas
 
 static inline bool task_will_free_mem(struct task_struct *task)
 {
+	/*
+	 * A coredumping process may sleep for extended periods in exit_mm(), so
+	 * the oom killer cannot assume that the process will promptly exit and
+	 * release memory.
+	 */
 	return (task->flags & PF_EXITING) &&
 		!(task->signal->flags & SIGNAL_GROUP_COREDUMP);
 }


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/1] oom: don't assume that a coredumping thread will exit soon
  2014-12-03 19:50   ` [PATCH v2 1/1] " Oleg Nesterov
  2014-12-03 21:58     ` Andrew Morton
@ 2014-12-04 14:34     ` Michal Hocko
  2014-12-05 23:46     ` David Rientjes
  2 siblings, 0 replies; 17+ messages in thread
From: Michal Hocko @ 2014-12-04 14:34 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Cong Wang, David Rientjes, Rafael J. Wysocki,
	Tejun Heo, linux-kernel

On Wed 03-12-14 20:50:20, Oleg Nesterov wrote:
> oom_kill.c assumes that PF_EXITING task should exit and free the memory
> soon. This is wrong in many ways and one important case is the coredump.
> A task can sleep in exit_mm() "forever" while the coredumping sub-thread
> can need more memory.
> 
> Change the PF_EXITING checks to take SIGNAL_GROUP_COREDUMP into account,
> we add the new trivial helper for that.
> 
> Note: this is only the first step, this patch doesn't try to solve other
> problems. The SIGNAL_GROUP_COREDUMP check is obviously racy, a task can
> participate in coredump after it was already observed in PF_EXITING state,
> so TIF_MEMDIE (which also blocks oom-killer) still can be wrongly set.
> fatal_signal_pending() can be true because of SIGNAL_GROUP_COREDUMP so
> out_of_memory() and mem_cgroup_out_of_memory() shouldn't blindly trust it.
> And even the name/usage of the new helper is confusing, an exiting thread
> can only free its ->mm if it is the only/last task in thread group.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

OK, I do not have a better name for the helper and code duplication
looks even worse so
Acked-by: Michal Hocko <mhocko@suse.cz>

Andrew's additional comment makes sense as well.
> ---
>  include/linux/oom.h |    6 ++++++
>  mm/memcontrol.c     |    2 +-
>  mm/oom_kill.c       |    6 +++---
>  3 files changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index e8d6e10..e16b945 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -92,6 +92,12 @@ static inline bool oom_gfp_allowed(gfp_t gfp_mask)
>  
>  extern struct task_struct *find_lock_task_mm(struct task_struct *p);
>  
> +static inline bool task_will_free_mem(struct task_struct *task)
> +{
> +	return (task->flags & PF_EXITING) &&
> +		!(task->signal->flags & SIGNAL_GROUP_COREDUMP);
> +}
> +
>  /* sysctls */
>  extern int sysctl_oom_dump_tasks;
>  extern int sysctl_oom_kill_allocating_task;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d6ac0e3..3d1d49f 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1734,7 +1734,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 * select it.  The goal is to allow it to allocate so that it may
>  	 * quickly exit and free its memory.
>  	 */
> -	if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
> +	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
>  		set_thread_flag(TIF_MEMDIE);
>  		return;
>  	}
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 5340f6b..837ff31 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -281,7 +281,7 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
>  	if (oom_task_origin(task))
>  		return OOM_SCAN_SELECT;
>  
> -	if (task->flags & PF_EXITING && !force_kill) {
> +	if (task_will_free_mem(task) && !force_kill) {
>  		/*
>  		 * If this task is not being ptraced on exit, then wait for it
>  		 * to finish before killing some other task unnecessarily.
> @@ -443,7 +443,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  	 * If the task is already exiting, don't alarm the sysadmin or kill
>  	 * its children or threads, just set TIF_MEMDIE so it can die quickly
>  	 */
> -	if (p->flags & PF_EXITING) {
> +	if (task_will_free_mem(p)) {
>  		set_tsk_thread_flag(p, TIF_MEMDIE);
>  		put_task_struct(p);
>  		return;
> @@ -649,7 +649,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  	 * select it.  The goal is to allow it to allocate so that it may
>  	 * quickly exit and free its memory.
>  	 */
> -	if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
> +	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
>  		set_thread_flag(TIF_MEMDIE);
>  		return;
>  	}
> -- 
> 1.5.5.1
> 
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/1] oom: don't assume that a coredumping thread will exit soon
  2014-12-03 19:50   ` [PATCH v2 1/1] " Oleg Nesterov
  2014-12-03 21:58     ` Andrew Morton
  2014-12-04 14:34     ` Michal Hocko
@ 2014-12-05 23:46     ` David Rientjes
  2 siblings, 0 replies; 17+ messages in thread
From: David Rientjes @ 2014-12-05 23:46 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, Cong Wang, Michal Hocko, Rafael J. Wysocki,
	Tejun Heo, linux-kernel

On Wed, 3 Dec 2014, Oleg Nesterov wrote:

> oom_kill.c assumes that PF_EXITING task should exit and free the memory
> soon. This is wrong in many ways and one important case is the coredump.
> A task can sleep in exit_mm() "forever" while the coredumping sub-thread
> can need more memory.
> 
> Change the PF_EXITING checks to take SIGNAL_GROUP_COREDUMP into account,
> we add the new trivial helper for that.
> 
> Note: this is only the first step, this patch doesn't try to solve other
> problems. The SIGNAL_GROUP_COREDUMP check is obviously racy, a task can
> participate in coredump after it was already observed in PF_EXITING state,
> so TIF_MEMDIE (which also blocks oom-killer) still can be wrongly set.
> fatal_signal_pending() can be true because of SIGNAL_GROUP_COREDUMP so
> out_of_memory() and mem_cgroup_out_of_memory() shouldn't blindly trust it.
> And even the name/usage of the new helper is confusing, an exiting thread
> can only free its ->mm if it is the only/last task in thread group.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2014-12-05 23:46 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-27 23:03 [PATCH 0/2] oom && coredump Oleg Nesterov
2014-11-27 23:04 ` [PATCH 1/2] oom: don't assume that a coredumping thread will exit soon Oleg Nesterov
2014-12-02  9:19   ` Michal Hocko
2014-12-02 17:50     ` Oleg Nesterov
2014-12-02 18:31       ` Michal Hocko
2014-12-02 19:16         ` Oleg Nesterov
2014-12-03 13:07           ` Michal Hocko
2014-11-27 23:04 ` [PATCH 2/2] oom: kill the insufficient and no longer needed PT_TRACE_EXIT check Oleg Nesterov
2014-12-02  9:35   ` Michal Hocko
2014-12-02 18:05     ` Oleg Nesterov
2014-12-02 18:23       ` Michal Hocko
2014-11-28 19:28 ` [PATCH 0/2] oom && coredump Oleg Nesterov
2014-12-03 19:50 ` [PATCH v2 0/1] oom: don't assume that a coredumping thread will exit soon Oleg Nesterov
2014-12-03 19:50   ` [PATCH v2 1/1] " Oleg Nesterov
2014-12-03 21:58     ` Andrew Morton
2014-12-04 14:34     ` Michal Hocko
2014-12-05 23:46     ` David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).