linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: + oom-pm-oom-killed-task-cannot-escape-pm-suspend.patch added to -mm tree
@ 2014-10-17 17:19 Oleg Nesterov
  2014-10-20 18:46 ` Michal Hocko
  0 siblings, 1 reply; 7+ messages in thread
From: Oleg Nesterov @ 2014-10-17 17:19 UTC (permalink / raw)
  To: Michal Hocko, Cong Wang, Rafael J. Wysocki, Tejun Heo,
	David Rientjes, Andrew Morton
  Cc: linux-kernel

Michal, I am not really arguing with this patch, but since you are going
(iiuc) to resend it anyway let me ask a couple of questions.

> This, however, still keeps
> a window open when a killed task didn't manage to die by the time
> freeze_processes finishes.

Sure,

> Fix this race by checking all tasks after OOM killer has been disabled.

But this doesn't close the race entirely? please see below.

>  int freeze_processes(void)
>  {
>  	int error;
> +	int oom_kills_saved;
>
>  	error = __usermodehelper_disable(UMH_FREEZING);
>  	if (error)
> @@ -132,12 +133,40 @@ int freeze_processes(void)
>  	pm_wakeup_clear();
>  	printk("Freezing user space processes ... ");
>  	pm_freezing = true;
> +	oom_kills_saved = oom_kills_count();
>  	error = try_to_freeze_tasks(true);
>  	if (!error) {
> -		printk("done.");
>  		__usermodehelper_set_disable_depth(UMH_DISABLED);
>  		oom_killer_disable();
> +
> +		/*
> +		 * There was a OOM kill while we were freezing tasks
> +		 * and the killed task might be still on the way out
> +		 * so we have to double check for race.
> +		 */
> +		if (oom_kills_count() != oom_kills_saved) {

OK, I agree, this makes the things better, but perhaps we should document
(at least in the changelog) that this is still racy. oom_killer_disable()
obviously can stop the already called out_of_memory(), it can kill a frozen
task right after this check or even after the loop before.

> +			struct task_struct *g, *p;
> +
> +			read_lock(&tasklist_lock);
> +			do_each_thread(g, p) {
> +				if (p == current || freezer_should_skip(p) ||
> +				    frozen(p))
> +					continue;
> +				error = -EBUSY;
> +				break;
> +			} while_each_thread(g, p);

Please use for_each_process_thread(), do/while_each_thread is deprecated.

> +/*
> + * Number of OOM killer invocations (including memcg OOM killer).
> + * Primarily used by PM freezer to check for potential races with
> + * OOM killed frozen task.
> + */
> +static atomic_t oom_kills = ATOMIC_INIT(0);
> +
> +int oom_kills_count(void)
> +{
> +	return atomic_read(&oom_kills);
> +}
> +
>  #define K(x) ((x) << (PAGE_SHIFT-10))
>  /*
>   * Must be called while holding a reference to p, which will be released upon
> @@ -504,11 +516,13 @@ void oom_kill_process(struct task_struct
>  			pr_err("Kill process %d (%s) sharing same memory\n",
>  				task_pid_nr(p), p->comm);
>  			task_unlock(p);
> +			atomic_inc(&oom_kills);

Do we really need this? Can't freeze_processes() (ab)use oom_notify_list?

Yes, we can have more false positives this way, but probably this doesn't
matter? This is unlikely case anyway.

Oleg.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: + oom-pm-oom-killed-task-cannot-escape-pm-suspend.patch added to -mm tree
  2014-10-17 17:19 + oom-pm-oom-killed-task-cannot-escape-pm-suspend.patch added to -mm tree Oleg Nesterov
@ 2014-10-20 18:46 ` Michal Hocko
  2014-10-20 19:06   ` Oleg Nesterov
  0 siblings, 1 reply; 7+ messages in thread
From: Michal Hocko @ 2014-10-20 18:46 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Cong Wang, Rafael J. Wysocki, Tejun Heo, David Rientjes,
	Andrew Morton, linux-kernel

On Fri 17-10-14 19:19:04, Oleg Nesterov wrote:
> Michal, I am not really arguing with this patch, but since you are going
> (iiuc) to resend it anyway let me ask a couple of questions.
> 
> > This, however, still keeps
> > a window open when a killed task didn't manage to die by the time
> > freeze_processes finishes.
> 
> Sure,
> 
> > Fix this race by checking all tasks after OOM killer has been disabled.
> 
> But this doesn't close the race entirely? please see below.
> 
> >  int freeze_processes(void)
> >  {
> >  	int error;
> > +	int oom_kills_saved;
> >
> >  	error = __usermodehelper_disable(UMH_FREEZING);
> >  	if (error)
> > @@ -132,12 +133,40 @@ int freeze_processes(void)
> >  	pm_wakeup_clear();
> >  	printk("Freezing user space processes ... ");
> >  	pm_freezing = true;
> > +	oom_kills_saved = oom_kills_count();
> >  	error = try_to_freeze_tasks(true);
> >  	if (!error) {
> > -		printk("done.");
> >  		__usermodehelper_set_disable_depth(UMH_DISABLED);
> >  		oom_killer_disable();
> > +
> > +		/*
> > +		 * There was a OOM kill while we were freezing tasks
> > +		 * and the killed task might be still on the way out
> > +		 * so we have to double check for race.
> > +		 */
> > +		if (oom_kills_count() != oom_kills_saved) {
> 
> OK, I agree, this makes the things better, but perhaps we should document
> (at least in the changelog) that this is still racy. oom_killer_disable()
> obviously can stop the already called out_of_memory(), it can kill a frozen

I guess you meant "can't stop the already called..."

> task right after this check or even after the loop before.

You are right. The race window is still there. I've considered all tasks
being frozen already as sufficient but kernel threads and workqueue
items may allocate memory while we are freezing tasks and trigger OOM
killer as well. This will be inherently racy unless we use a locking
between freezer and OOM killer - which sounds too heavy to me. I can
reduce the race window by noting an OOM much earlier (when the allocator
enters OOM last round before OOM killer fires). The question is whether
this is sufficient because it is a half solution too.

> > +			struct task_struct *g, *p;
> > +
> > +			read_lock(&tasklist_lock);
> > +			do_each_thread(g, p) {
> > +				if (p == current || freezer_should_skip(p) ||
> > +				    frozen(p))
> > +					continue;
> > +				error = -EBUSY;
> > +				break;
> > +			} while_each_thread(g, p);
> 
> Please use for_each_process_thread(), do/while_each_thread is deprecated.

Sure, I was mimicking try_to_freeze_tasks which still uses the old
interface. Will send a patch which will use the new macro.

> > +/*
> > + * Number of OOM killer invocations (including memcg OOM killer).
> > + * Primarily used by PM freezer to check for potential races with
> > + * OOM killed frozen task.
> > + */
> > +static atomic_t oom_kills = ATOMIC_INIT(0);
> > +
> > +int oom_kills_count(void)
> > +{
> > +	return atomic_read(&oom_kills);
> > +}
> > +
> >  #define K(x) ((x) << (PAGE_SHIFT-10))
> >  /*
> >   * Must be called while holding a reference to p, which will be released upon
> > @@ -504,11 +516,13 @@ void oom_kill_process(struct task_struct
> >  			pr_err("Kill process %d (%s) sharing same memory\n",
> >  				task_pid_nr(p), p->comm);
> >  			task_unlock(p);
> > +			atomic_inc(&oom_kills);
> 
> Do we really need this? Can't freeze_processes() (ab)use oom_notify_list?

I would really prefer not using oom_notify_list. It is just an ugly
interface.

> Yes, we can have more false positives this way, but probably this doesn't
> matter? This is unlikely case anyway.

Yeah false positives are not a big deal.

I cannot say I would be happy about the following because it doesn't
close the race window completely but it is very well possible that
closing it completely would require much bigger changes and maybe this
is sufficient for now?
---
>From d5f7b3e8bb4859288a759635fdf502c6779faafd Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Mon, 20 Oct 2014 18:12:32 +0200
Subject: [PATCH] OOM, PM: OOM killed task cannot escape PM suspend

PM freezer relies on having all tasks frozen by the time devices are
getting frozen so that no task will touch them while they are getting
frozen. But OOM killer is allowed to kill an already frozen task in
order to handle OOM situtation. In order to protect from late wake ups
OOM killer is disabled after all tasks are frozen. This, however, still
keeps a window open when a killed task didn't manage to die by the time
freeze_processes finishes.

Reduce the race window by checking all tasks after OOM killer has been
disabled. This is still not race free completely unfortunately because
oom_killer_disable cannot stop an already ongoing OOM killer so a task
might still wake up from the fridge and get killed without
freeze_processes noticing. Full synchronization of OOM and freezer is,
however, too heavy weight for this highly unlikely case.

Introduce and check oom_kills counter which gets incremented early when
the allocator enters __alloc_pages_may_oom path and only check all the
tasks if the counter changes during the freezing attempt. The counter
is updated so early to reduce the race window since allocator checked
oom_killer_disabled which is set by PM-freezing code. A false positive
will push the PM-freezer into a slow path but that is not a big deal.

Fixes: f660daac474c6f (oom: thaw threads if oom killed thread is frozen before deferring)
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org # 3.2+
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/oom.h    |  3 +++
 kernel/power/process.c | 31 ++++++++++++++++++++++++++++++-
 mm/oom_kill.c          | 17 +++++++++++++++++
 mm/page_alloc.c        |  8 ++++++++
 4 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 647395a1a550..e8d6e1058723 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -50,6 +50,9 @@ static inline bool oom_task_origin(const struct task_struct *p)
 extern unsigned long oom_badness(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask,
 		unsigned long totalpages);
+
+extern int oom_kills_count(void);
+extern void note_oom_kill(void);
 extern void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			     unsigned int points, unsigned long totalpages,
 			     struct mem_cgroup *memcg, nodemask_t *nodemask,
diff --git a/kernel/power/process.c b/kernel/power/process.c
index 4ee194eb524b..a397fa161d11 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -118,6 +118,7 @@ static int try_to_freeze_tasks(bool user_only)
 int freeze_processes(void)
 {
 	int error;
+	int oom_kills_saved;
 
 	error = __usermodehelper_disable(UMH_FREEZING);
 	if (error)
@@ -131,12 +132,40 @@ int freeze_processes(void)
 
 	printk("Freezing user space processes ... ");
 	pm_freezing = true;
+	oom_kills_saved = oom_kills_count();
 	error = try_to_freeze_tasks(true);
 	if (!error) {
-		printk("done.");
 		__usermodehelper_set_disable_depth(UMH_DISABLED);
 		oom_killer_disable();
+
+		/*
+		 * There might have been an OOM kill while we were
+		 * freezing tasks and the killed task might be still
+		 * on the way out so we have to double check for race.
+		 */
+		if (oom_kills_count() != oom_kills_saved) {
+			struct task_struct *g, *p;
+
+			read_lock(&tasklist_lock);
+			for_each_process_thread(g, p) {
+				if (p == current || freezer_should_skip(p) ||
+				    frozen(p))
+					continue;
+				error = -EBUSY;
+				goto out_loop;
+			}
+out_loop:
+			read_unlock(&tasklist_lock);
+
+			if (error) {
+				__usermodehelper_set_disable_depth(UMH_ENABLED);
+				printk("OOM in progress.");
+				goto done;
+			}
+		}
+		printk("done.");
 	}
+done:
 	printk("\n");
 	BUG_ON(in_atomic());
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index bbf405a3a18f..5340f6b91312 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -404,6 +404,23 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
 		dump_tasks(memcg, nodemask);
 }
 
+/*
+ * Number of OOM killer invocations (including memcg OOM killer).
+ * Primarily used by PM freezer to check for potential races with
+ * OOM killed frozen task.
+ */
+static atomic_t oom_kills = ATOMIC_INIT(0);
+
+int oom_kills_count(void)
+{
+	return atomic_read(&oom_kills);
+}
+
+void note_oom_kill(void)
+{
+	atomic_inc(&oom_kills);
+}
+
 #define K(x) ((x) << (PAGE_SHIFT-10))
 /*
  * Must be called while holding a reference to p, which will be released upon
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c9710c9bbee2..e0c7832f8e5a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2252,6 +2252,14 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	}
 
 	/*
+	 * PM-freezer should be notified that there might be an OOM killer on its
+	 * way to kill and wake somebody up. This is too early and we might end
+	 * up not killing anything but false positives are acceptable.
+	 * See freeze_processes.
+	 */
+	note_oom_kill();
+
+	/*
 	 * Go through the zonelist yet one more time, keep very high watermark
 	 * here, this is only to catch a parallel oom killing, we must fail if
 	 * we're still under heavy pressure.
-- 
2.1.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: + oom-pm-oom-killed-task-cannot-escape-pm-suspend.patch added to -mm tree
  2014-10-20 18:46 ` Michal Hocko
@ 2014-10-20 19:06   ` Oleg Nesterov
  2014-10-20 19:56     ` oom && coredump Oleg Nesterov
  0 siblings, 1 reply; 7+ messages in thread
From: Oleg Nesterov @ 2014-10-20 19:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Cong Wang, Rafael J. Wysocki, Tejun Heo, David Rientjes,
	Andrew Morton, linux-kernel

On 10/20, Michal Hocko wrote:
>
> On Fri 17-10-14 19:19:04, Oleg Nesterov wrote:
> > > @@ -504,11 +516,13 @@ void oom_kill_process(struct task_struct
> > >  			pr_err("Kill process %d (%s) sharing same memory\n",
> > >  				task_pid_nr(p), p->comm);
> > >  			task_unlock(p);
> > > +			atomic_inc(&oom_kills);
> >
> > Do we really need this? Can't freeze_processes() (ab)use oom_notify_list?
>
> I would really prefer not using oom_notify_list. It is just an ugly
> interface.

And to me oom_kills_count() is more ugly ;) But! of course this is
subjective, I am not going to insist.

> Reduce the race window by checking all tasks after OOM killer has been
> disabled. This is still not race free

Yes, thanks.

I only argued because this fact was not documented. And I agree that it
is hardly possible to close this race, and this patch makes the things
better.

I think this version is fine.

Oleg.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* oom && coredump
  2014-10-20 19:06   ` Oleg Nesterov
@ 2014-10-20 19:56     ` Oleg Nesterov
  2014-11-27 12:29       ` Michal Hocko
  0 siblings, 1 reply; 7+ messages in thread
From: Oleg Nesterov @ 2014-10-20 19:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Cong Wang, Rafael J. Wysocki, Tejun Heo, David Rientjes,
	Andrew Morton, linux-kernel

On 10/20, Oleg Nesterov wrote:
>
> And I agree that it
> is hardly possible to close this race, and this patch makes the things
> better.

speaking of "partial" fixes for oom problems...

Perhaps the patch below makes sense? Sure, it is racy, but probably
better than nothing. And in any case (imo) this SIGNAL_GROUP_COREDUMP
check doesn't look bad, the coredumping task can consume more memory,
and we can't assume it is going to actually exit soon.

And at least we can kill that ugly and wrong ptrace check.

What do you think?

Oleg.

--- x/mm/oom_kill.c
+++ x/mm/oom_kill.c
@@ -254,6 +254,12 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
 }
 #endif
 
+static inline bool task_will_free_mem(struct task_struct *task)
+{
+	return (task->flags & PF_EXITING) &&
+		!(task->signal->flags & SIGNAL_GROUP_COREDUMP);
+}
+
 enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
 		unsigned long totalpages, const nodemask_t *nodemask,
 		bool force_kill)
@@ -281,14 +287,9 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
 	if (oom_task_origin(task))
 		return OOM_SCAN_SELECT;
 
-	if (task->flags & PF_EXITING && !force_kill) {
-		/*
-		 * If this task is not being ptraced on exit, then wait for it
-		 * to finish before killing some other task unnecessarily.
-		 */
-		if (!(task->group_leader->ptrace & PT_TRACE_EXIT))
-			return OOM_SCAN_ABORT;
-	}
+	if (task_will_free_mem(task) && !force_kill)
+		return OOM_SCAN_ABORT;
+
 	return OOM_SCAN_OK;
 }
 
@@ -426,7 +427,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 * If the task is already exiting, don't alarm the sysadmin or kill
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
-	if (p->flags & PF_EXITING) {
+	if (task_will_free_mem(p)) {
 		set_tsk_thread_flag(p, TIF_MEMDIE);
 		put_task_struct(p);
 		return;
@@ -632,7 +633,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.
 	 */
-	if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
+	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
 		set_thread_flag(TIF_MEMDIE);
 		return;
 	}


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: oom && coredump
  2014-10-20 19:56     ` oom && coredump Oleg Nesterov
@ 2014-11-27 12:29       ` Michal Hocko
  2014-11-27 17:47         ` Oleg Nesterov
  0 siblings, 1 reply; 7+ messages in thread
From: Michal Hocko @ 2014-11-27 12:29 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Cong Wang, Rafael J. Wysocki, Tejun Heo, David Rientjes,
	Andrew Morton, linux-kernel

[Sorry this one has somehow completely fallen off my radar]

On Mon 20-10-14 21:56:18, Oleg Nesterov wrote:
> On 10/20, Oleg Nesterov wrote:
> >
> > And I agree that it
> > is hardly possible to close this race, and this patch makes the things
> > better.
> 
> speaking of "partial" fixes for oom problems...
> 
> Perhaps the patch below makes sense? Sure, it is racy, but probably
> better than nothing. And in any case (imo) this SIGNAL_GROUP_COREDUMP
> check doesn't look bad, the coredumping task can consume more memory,
> and we can't assume it is going to actually exit soon.

I am not familiar with this area much (it is too scary...).
I guess the issue is that OOM killer might try to kill a task which is
currently in the middle of cordumping which is not killable, right? And
if it is blocked on memory allocation then we are effectively dead
locked. Right?

Wouldn't it be better to make coredumping killable? Is this even
possible?

> And at least we can kill that ugly and wrong ptrace check.

Why is the ptrace check wrong? PF_EXITING should be set after
ptrace_event(PTRACE_EVENT_EXIT, code). But then I can see
unlikely(tsk->flags & PF_EXITING) check right after PTRACE_EVENT_EXIT
notification. Is this the thing?

But I do not get why we do not have to care about PTRACE_EVENT_EXIT once
SIGNAL_GROUP_COREDUMP is checked anymore or how they are related? What
prevents the original issue when the OOM victim is blocked by ptrace for
ever?

> What do you think?
> 
> Oleg.
> 
> --- x/mm/oom_kill.c
> +++ x/mm/oom_kill.c
> @@ -254,6 +254,12 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
>  }
>  #endif
>  
> +static inline bool task_will_free_mem(struct task_struct *task)
> +{
> +	return (task->flags & PF_EXITING) &&
> +		!(task->signal->flags & SIGNAL_GROUP_COREDUMP);
> +}
> +
>  enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
>  		unsigned long totalpages, const nodemask_t *nodemask,
>  		bool force_kill)
> @@ -281,14 +287,9 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
>  	if (oom_task_origin(task))
>  		return OOM_SCAN_SELECT;
>  
> -	if (task->flags & PF_EXITING && !force_kill) {
> -		/*
> -		 * If this task is not being ptraced on exit, then wait for it
> -		 * to finish before killing some other task unnecessarily.
> -		 */
> -		if (!(task->group_leader->ptrace & PT_TRACE_EXIT))
> -			return OOM_SCAN_ABORT;
> -	}
> +	if (task_will_free_mem(task) && !force_kill)
> +		return OOM_SCAN_ABORT;
> +
>  	return OOM_SCAN_OK;
>  }
>  
> @@ -426,7 +427,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  	 * If the task is already exiting, don't alarm the sysadmin or kill
>  	 * its children or threads, just set TIF_MEMDIE so it can die quickly
>  	 */
> -	if (p->flags & PF_EXITING) {
> +	if (task_will_free_mem(p)) {
>  		set_tsk_thread_flag(p, TIF_MEMDIE);
>  		put_task_struct(p);
>  		return;
> @@ -632,7 +633,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  	 * select it.  The goal is to allow it to allocate so that it may
>  	 * quickly exit and free its memory.
>  	 */
> -	if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
> +	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
>  		set_thread_flag(TIF_MEMDIE);
>  		return;
>  	}
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: oom && coredump
  2014-11-27 12:29       ` Michal Hocko
@ 2014-11-27 17:47         ` Oleg Nesterov
  2014-12-02  8:59           ` Michal Hocko
  0 siblings, 1 reply; 7+ messages in thread
From: Oleg Nesterov @ 2014-11-27 17:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Cong Wang, Rafael J. Wysocki, Tejun Heo, David Rientjes,
	Andrew Morton, linux-kernel

On 11/27, Michal Hocko wrote:
>
> On Mon 20-10-14 21:56:18, Oleg Nesterov wrote:
> > speaking of "partial" fixes for oom problems...
> >
> > Perhaps the patch below makes sense? Sure, it is racy, but probably
> > better than nothing. And in any case (imo) this SIGNAL_GROUP_COREDUMP
> > check doesn't look bad, the coredumping task can consume more memory,
> > and we can't assume it is going to actually exit soon.
>
> I am not familiar with this area much (it is too scary...).
> I guess the issue is that OOM killer might try to kill a task which is
> currently in the middle of cordumping which is not killable, right? And
> if it is blocked on memory allocation then we are effectively dead
> locked. Right?
>
> Wouldn't it be better to make coredumping killable? Is this even
> possible?

It is already killable.

The problem is that oom-killer assumes that the PF_EXITING task should
exit and release its memory "soon", so oom_scan_process_thread() returns
OOM_SCAN_ABORT.

This is obviously wrong if this PF_EXITING task participates in coredump
and sleeps in exit_mm(). (iirc there are other issues with mt tasks, but
lets not discuss this now). This task won't exit until the coredumping
completes, and this can take o lot of time, more memory, etc.


> > And at least we can kill that ugly and wrong ptrace check.
>
> Why is the ptrace check wrong?

It was added in reply to exploit I sent. But:

- It doesn't (and can't) really work, it can only detect this particular
  case and the same exploit still blocks oom-killer with the minimal
  modifications.

- Once again, this has nothing to do with ptrace. That exploit used
  ptrace only to control (freeze) the coredumping process, the coredumping
  can "hang" because of other reasons.

- It is no longer needed after this patch, the coredumping process will
  be killed.

So I think the patch below makes sense anyway. Although I should probably
split it and remove PT_TRACE_EXIT in 2/2.

> PF_EXITING should be set after
> ptrace_event(PTRACE_EVENT_EXIT, code). But then I can see
> unlikely(tsk->flags & PF_EXITING) check right after PTRACE_EVENT_EXIT
> notification.

Note that it checks task->group_leader, not task. But see above. This makes
no sense.

> > --- x/mm/oom_kill.c
> > +++ x/mm/oom_kill.c
> > @@ -254,6 +254,12 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> >  }
> >  #endif
> >
> > +static inline bool task_will_free_mem(struct task_struct *task)
> > +{
> > +	return (task->flags & PF_EXITING) &&
> > +		!(task->signal->flags & SIGNAL_GROUP_COREDUMP);
> > +}
> > +
> >  enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
> >  		unsigned long totalpages, const nodemask_t *nodemask,
> >  		bool force_kill)
> > @@ -281,14 +287,9 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
> >  	if (oom_task_origin(task))
> >  		return OOM_SCAN_SELECT;
> >
> > -	if (task->flags & PF_EXITING && !force_kill) {
> > -		/*
> > -		 * If this task is not being ptraced on exit, then wait for it
> > -		 * to finish before killing some other task unnecessarily.
> > -		 */
> > -		if (!(task->group_leader->ptrace & PT_TRACE_EXIT))
> > -			return OOM_SCAN_ABORT;
> > -	}
> > +	if (task_will_free_mem(task) && !force_kill)
> > +		return OOM_SCAN_ABORT;
> > +
> >  	return OOM_SCAN_OK;
> >  }
> >
> > @@ -426,7 +427,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> >  	 * If the task is already exiting, don't alarm the sysadmin or kill
> >  	 * its children or threads, just set TIF_MEMDIE so it can die quickly
> >  	 */
> > -	if (p->flags & PF_EXITING) {
> > +	if (task_will_free_mem(p)) {
> >  		set_tsk_thread_flag(p, TIF_MEMDIE);
> >  		put_task_struct(p);
> >  		return;
> > @@ -632,7 +633,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> >  	 * select it.  The goal is to allow it to allocate so that it may
> >  	 * quickly exit and free its memory.
> >  	 */
> > -	if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
> > +	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
> >  		set_thread_flag(TIF_MEMDIE);
> >  		return;
> >  	}
> >
>
> --
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: oom && coredump
  2014-11-27 17:47         ` Oleg Nesterov
@ 2014-12-02  8:59           ` Michal Hocko
  0 siblings, 0 replies; 7+ messages in thread
From: Michal Hocko @ 2014-12-02  8:59 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Cong Wang, Rafael J. Wysocki, Tejun Heo, David Rientjes,
	Andrew Morton, linux-kernel

On Thu 27-11-14 18:47:47, Oleg Nesterov wrote:
> On 11/27, Michal Hocko wrote:
[...]
> > Why is the ptrace check wrong?
> 
> It was added in reply to exploit I sent. But:
> 
> - It doesn't (and can't) really work, it can only detect this particular
>   case and the same exploit still blocks oom-killer with the minimal
>   modifications.
> 
> - Once again, this has nothing to do with ptrace. That exploit used
>   ptrace only to control (freeze) the coredumping process, the coredumping
>   can "hang" because of other reasons.
> 
> - It is no longer needed after this patch, the coredumping process will
>   be killed.

OK, I guess I am seeing it now. Thanks for the clarification!

[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-12-02  8:59 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-17 17:19 + oom-pm-oom-killed-task-cannot-escape-pm-suspend.patch added to -mm tree Oleg Nesterov
2014-10-20 18:46 ` Michal Hocko
2014-10-20 19:06   ` Oleg Nesterov
2014-10-20 19:56     ` oom && coredump Oleg Nesterov
2014-11-27 12:29       ` Michal Hocko
2014-11-27 17:47         ` Oleg Nesterov
2014-12-02  8:59           ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).