All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 1/3 -mmotm] oom: move oom_adj value from task_struct to mm_struct
@ 2009-06-02  1:31 David Rientjes
  2009-06-02  1:31 ` [patch 2/3 -mmotm] oom: avoid unnecessary mm locking and scanning for OOM_DISABLE David Rientjes
  2009-06-02  1:31 ` [patch 3/3 -mmotm] oom: invoke oom killer for __GFP_NOFAIL David Rientjes
  0 siblings, 2 replies; 13+ messages in thread
From: David Rientjes @ 2009-06-02  1:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Rik van Riel, Mel Gorman, Peter Zijlstra,
	Christoph Lameter, Dave Hansen, linux-kernel

The per-task oom_adj value is a characteristic of its mm more than the
task itself since it's not possible to oom kill any thread that shares
the mm.  If a task were to be killed while attached to an mm that could
not be freed because another thread were set to OOM_DISABLE, it would
have needlessly been terminated since there is no potential for future
memory freeing.

This patch moves oomkilladj (now more appropriately named oom_adj) from
struct task_struct to struct mm_struct.  This requires task_lock() on a
task to check its oom_adj value to protect against exec, but it's already
necessary to take the lock when dereferencing the mm to find the total VM
size for the badness heuristic.

This fixes a livelock if the oom killer chooses a task and another thread
sharing the same memory has an oom_adj value of OOM_DISABLE.  This occurs
because oom_kill_task() repeatedly returns 1 and refuses to kill the
chosen task while select_bad_process() will repeatedly choose the same
task during the next retry.

Taking task_lock() in select_bad_process() to check for OOM_DISABLE and
in oom_kill_task() to check for threads sharing the same memory will be
removed in the next patch in this series where it will no longer be
necessary.

Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since
these threads are immune from oom killing already.  They simply report an
oom_adj value of OOM_DISABLE.

Cc: Nick Piggin <npiggin@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/filesystems/proc.txt |   15 ++++++++++-----
 fs/proc/base.c                     |   19 ++++++++++++++++---
 include/linux/mm_types.h           |    2 ++
 include/linux/sched.h              |    1 -
 mm/oom_kill.c                      |   34 ++++++++++++++++++++++------------
 5 files changed, 50 insertions(+), 21 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -1003,11 +1003,13 @@ CHAPTER 3: PER-PROCESS PARAMETERS
 3.1 /proc/<pid>/oom_adj - Adjust the oom-killer score
 ------------------------------------------------------
 
-This file can be used to adjust the score used to select which processes
-should be killed in an  out-of-memory  situation.  Giving it a high score will
-increase the likelihood of this process being killed by the oom-killer.  Valid
-values are in the range -16 to +15, plus the special value -17, which disables
-oom-killing altogether for this process.
+This file can be used to adjust the score used to select which processes should
+be killed in an out-of-memory situation.  The oom_adj value is a characteristic
+of the task's mm, so all threads that share an mm with pid will have the same
+oom_adj value.  A high value will increase the likelihood of this process being
+killed by the oom-killer.  Valid values are in the range -16 to +15 as
+explained below and a special value of -17, which disables oom-killing
+altogether for threads sharing pid's mm.
 
 The process to be killed in an out-of-memory situation is selected among all others
 based on its badness score. This value equals the original memory size of the process
@@ -1021,6 +1023,9 @@ the parent's score if they do not share the same memory. Thus forking servers
 are the prime candidates to be killed. Having only one 'hungry' child will make
 parent less preferable than the child.
 
+/proc/<pid>/oom_adj cannot be changed for kthreads since they are immune from
+oom-killing already.
+
 /proc/<pid>/oom_score shows process' current badness score.
 
 The following heuristics are then applied:
diff --git a/fs/proc/base.c b/fs/proc/base.c
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1006,7 +1006,12 @@ static ssize_t oom_adjust_read(struct file *file, char __user *buf,
 
 	if (!task)
 		return -ESRCH;
-	oom_adjust = task->oomkilladj;
+	task_lock(task);
+	if (task->mm)
+		oom_adjust = task->mm->oom_adj;
+	else
+		oom_adjust = OOM_DISABLE;
+	task_unlock(task);
 	put_task_struct(task);
 
 	len = snprintf(buffer, sizeof(buffer), "%i\n", oom_adjust);
@@ -1035,11 +1040,19 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
 	task = get_proc_task(file->f_path.dentry->d_inode);
 	if (!task)
 		return -ESRCH;
-	if (oom_adjust < task->oomkilladj && !capable(CAP_SYS_RESOURCE)) {
+	task_lock(task);
+	if (!task->mm) {
+		task_unlock(task);
+		put_task_struct(task);
+		return -EINVAL;
+	}
+	if (oom_adjust < task->mm->oom_adj && !capable(CAP_SYS_RESOURCE)) {
+		task_unlock(task);
 		put_task_struct(task);
 		return -EACCES;
 	}
-	task->oomkilladj = oom_adjust;
+	task->mm->oom_adj = oom_adjust;
+	task_unlock(task);
 	put_task_struct(task);
 	if (end - buffer == 0)
 		return -EIO;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -232,6 +232,8 @@ struct mm_struct {
 
 	unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
 
+	s8 oom_adj;	/* OOM kill score adjustment (bit shift) */
+
 	cpumask_t cpu_vm_mask;
 
 	/* Architecture-specific MM context */
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1149,7 +1149,6 @@ struct task_struct {
 	 * a short time
 	 */
 	unsigned char fpu_counter;
-	s8 oomkilladj; /* OOM kill score adjustment (bit shift). */
 #ifdef CONFIG_BLK_DEV_IO_TRACE
 	unsigned int btrace_seq;
 #endif
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -58,6 +58,7 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 	unsigned long points, cpu_time, run_time;
 	struct mm_struct *mm;
 	struct task_struct *child;
+	int oom_adj;
 
 	task_lock(p);
 	mm = p->mm;
@@ -65,6 +66,7 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 		task_unlock(p);
 		return 0;
 	}
+	oom_adj = mm->oom_adj;
 
 	/*
 	 * The memory size of the process is the basis for the badness.
@@ -148,15 +150,15 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 		points /= 8;
 
 	/*
-	 * Adjust the score by oomkilladj.
+	 * Adjust the score by oom_adj.
 	 */
-	if (p->oomkilladj) {
-		if (p->oomkilladj > 0) {
+	if (oom_adj) {
+		if (oom_adj > 0) {
 			if (!points)
 				points = 1;
-			points <<= p->oomkilladj;
+			points <<= oom_adj;
 		} else
-			points >>= -(p->oomkilladj);
+			points >>= -(oom_adj);
 	}
 
 #ifdef DEBUG
@@ -251,8 +253,12 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 			*ppoints = ULONG_MAX;
 		}
 
-		if (p->oomkilladj == OOM_DISABLE)
+		task_lock(p);
+		if (p->mm && p->mm->oom_adj == OOM_DISABLE) {
+			task_unlock(p);
 			continue;
+		}
+		task_unlock(p);
 
 		points = badness(p, uptime.tv_sec);
 		if (points > *ppoints || !chosen) {
@@ -304,8 +310,7 @@ static void dump_tasks(const struct mem_cgroup *mem)
 		}
 		printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d     %3d %s\n",
 		       p->pid, __task_cred(p)->uid, p->tgid, mm->total_vm,
-		       get_mm_rss(mm), (int)task_cpu(p), p->oomkilladj,
-		       p->comm);
+		       get_mm_rss(mm), (int)task_cpu(p), mm->oom_adj, p->comm);
 		task_unlock(p);
 	} while_each_thread(g, p);
 }
@@ -367,8 +372,12 @@ static int oom_kill_task(struct task_struct *p)
 	 * Don't kill the process if any threads are set to OOM_DISABLE
 	 */
 	do_each_thread(g, q) {
-		if (q->mm == mm && q->oomkilladj == OOM_DISABLE)
+		task_lock(q);
+		if (q->mm == mm && q->mm && q->mm->oom_adj == OOM_DISABLE) {
+			task_unlock(q);
 			return 1;
+		}
+		task_unlock(q);
 	} while_each_thread(g, q);
 
 	__oom_kill_task(p, 1);
@@ -393,10 +402,11 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	struct task_struct *c;
 
 	if (printk_ratelimit()) {
-		printk(KERN_WARNING "%s invoked oom-killer: "
-			"gfp_mask=0x%x, order=%d, oomkilladj=%d\n",
-			current->comm, gfp_mask, order, current->oomkilladj);
 		task_lock(current);
+		printk(KERN_WARNING "%s invoked oom-killer: "
+			"gfp_mask=0x%x, order=%d, oom_adj=%d\n",
+			current->comm, gfp_mask, order,
+			current->mm ? current->mm->oom_adj : OOM_DISABLE);
 		cpuset_print_task_mems_allowed(current);
 		task_unlock(current);
 		dump_stack();

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 2/3 -mmotm] oom: avoid unnecessary mm locking and scanning for OOM_DISABLE
  2009-06-02  1:31 [patch 1/3 -mmotm] oom: move oom_adj value from task_struct to mm_struct David Rientjes
@ 2009-06-02  1:31 ` David Rientjes
  2009-06-02  1:31 ` [patch 3/3 -mmotm] oom: invoke oom killer for __GFP_NOFAIL David Rientjes
  1 sibling, 0 replies; 13+ messages in thread
From: David Rientjes @ 2009-06-02  1:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Rik van Riel, Mel Gorman, Peter Zijlstra,
	Christoph Lameter, Dave Hansen, linux-kernel

This moves the check for OOM_DISABLE to the badness heuristic so it is
only necessary to hold task_lock() once.  If the mm is OOM_DISABLE, the
score is 0, which is also correctly exported via /proc/pid/oom_score.
This requires that tasks with badness scores of 0 are prohibited from
being oom killed, which makes sense since they would not allow for
future memory freeing anyway.
 
Since the oom_adj value is a characteristic of an mm and not a task, it
is no longer necessary to check the oom_adj value for threads sharing the
same memory (except when simply issuing SIGKILLs for threads in other
thread groups).

Cc: Nick Piggin <npiggin@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   42 ++++++++++--------------------------------
 1 files changed, 10 insertions(+), 32 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -67,6 +67,10 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 		return 0;
 	}
 	oom_adj = mm->oom_adj;
+	if (oom_adj == OOM_DISABLE) {
+		task_unlock(p);
+		return 0;
+	}
 
 	/*
 	 * The memory size of the process is the basis for the badness.
@@ -253,15 +257,8 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 			*ppoints = ULONG_MAX;
 		}
 
-		task_lock(p);
-		if (p->mm && p->mm->oom_adj == OOM_DISABLE) {
-			task_unlock(p);
-			continue;
-		}
-		task_unlock(p);
-
 		points = badness(p, uptime.tv_sec);
-		if (points > *ppoints || !chosen) {
+		if (points > *ppoints) {
 			chosen = p;
 			*ppoints = points;
 		}
@@ -354,32 +351,13 @@ static int oom_kill_task(struct task_struct *p)
 	struct mm_struct *mm;
 	struct task_struct *g, *q;
 
+	task_lock(p);
 	mm = p->mm;
-
-	/* WARNING: mm may not be dereferenced since we did not obtain its
-	 * value from get_task_mm(p).  This is OK since all we need to do is
-	 * compare mm to q->mm below.
-	 *
-	 * Furthermore, even if mm contains a non-NULL value, p->mm may
-	 * change to NULL at any time since we do not hold task_lock(p).
-	 * However, this is of no concern to us.
-	 */
-
-	if (mm == NULL)
+	if (!mm || mm->oom_adj == OOM_DISABLE) {
+		task_unlock(p);
 		return 1;
-
-	/*
-	 * Don't kill the process if any threads are set to OOM_DISABLE
-	 */
-	do_each_thread(g, q) {
-		task_lock(q);
-		if (q->mm == mm && q->mm && q->mm->oom_adj == OOM_DISABLE) {
-			task_unlock(q);
-			return 1;
-		}
-		task_unlock(q);
-	} while_each_thread(g, q);
-
+	}
+	task_unlock(p);
 	__oom_kill_task(p, 1);
 
 	/*

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 3/3 -mmotm] oom: invoke oom killer for __GFP_NOFAIL
  2009-06-02  1:31 [patch 1/3 -mmotm] oom: move oom_adj value from task_struct to mm_struct David Rientjes
  2009-06-02  1:31 ` [patch 2/3 -mmotm] oom: avoid unnecessary mm locking and scanning for OOM_DISABLE David Rientjes
@ 2009-06-02  1:31 ` David Rientjes
  2009-06-02  5:56   ` Andrew Morton
  1 sibling, 1 reply; 13+ messages in thread
From: David Rientjes @ 2009-06-02  1:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Rik van Riel, Mel Gorman, Peter Zijlstra,
	Christoph Lameter, Dave Hansen, linux-kernel

The oom killer must be invoked regardless of the order if the allocation
is __GFP_NOFAIL, otherwise it will loop forever when reclaim fails to
free some memory.

Cc: Nick Piggin <npiggin@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/page_alloc.c |   12 +++++++-----
 1 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1547,7 +1547,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		goto out;
 
 	/* The OOM killer will not help higher order allocs */
-	if (order > PAGE_ALLOC_COSTLY_ORDER)
+	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_NOFAIL))
 		goto out;
 
 	/* Exhausted what can be done so it's blamo time */
@@ -1765,11 +1765,13 @@ rebalance:
 				goto got_pg;
 
 			/*
-			 * The OOM killer does not trigger for high-order allocations
-			 * but if no progress is being made, there are no other
-			 * options and retrying is unlikely to help
+			 * The OOM killer does not trigger for high-order
+			 * ~__GFP_NOFAIL allocations so if no progress is being
+			 * made, there are no other options and retrying is
+			 * unlikely to help.
 			 */
-			if (order > PAGE_ALLOC_COSTLY_ORDER)
+			if (order > PAGE_ALLOC_COSTLY_ORDER &&
+						!(gfp_mask & __GFP_NOFAIL))
 				goto nopage;
 
 			goto restart;

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 3/3 -mmotm] oom: invoke oom killer for __GFP_NOFAIL
  2009-06-02  1:31 ` [patch 3/3 -mmotm] oom: invoke oom killer for __GFP_NOFAIL David Rientjes
@ 2009-06-02  5:56   ` Andrew Morton
  2009-06-02  6:27     ` Nick Piggin
  2009-06-02  7:26     ` David Rientjes
  0 siblings, 2 replies; 13+ messages in thread
From: Andrew Morton @ 2009-06-02  5:56 UTC (permalink / raw)
  To: David Rientjes
  Cc: Nick Piggin, Rik van Riel, Mel Gorman, Peter Zijlstra,
	Christoph Lameter, Dave Hansen, linux-kernel

On Mon, 1 Jun 2009 18:31:14 -0700 (PDT) David Rientjes <rientjes@google.com> wrote:

> The oom killer must be invoked regardless of the order if the allocation
> is __GFP_NOFAIL, otherwise it will loop forever when reclaim fails to
> free some memory.
> 
> Cc: Nick Piggin <npiggin@suse.de>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mel@csn.ul.ie>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Christoph Lameter <cl@linux-foundation.org>
> Cc: Dave Hansen <dave@linux.vnet.ibm.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  mm/page_alloc.c |   12 +++++++-----
>  1 files changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1547,7 +1547,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  		goto out;
>  
>  	/* The OOM killer will not help higher order allocs */
> -	if (order > PAGE_ALLOC_COSTLY_ORDER)
> +	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_NOFAIL))
>  		goto out;
>  
>  	/* Exhausted what can be done so it's blamo time */
> @@ -1765,11 +1765,13 @@ rebalance:
>  				goto got_pg;
>  
>  			/*
> -			 * The OOM killer does not trigger for high-order allocations
> -			 * but if no progress is being made, there are no other
> -			 * options and retrying is unlikely to help
> +			 * The OOM killer does not trigger for high-order
> +			 * ~__GFP_NOFAIL allocations so if no progress is being
> +			 * made, there are no other options and retrying is
> +			 * unlikely to help.
>  			 */
> -			if (order > PAGE_ALLOC_COSTLY_ORDER)
> +			if (order > PAGE_ALLOC_COSTLY_ORDER &&
> +						!(gfp_mask & __GFP_NOFAIL))
>  				goto nopage;
>  
>  			goto restart;

I really think/hope/expect that this is unneeded.

Do we know of any callsites which do greater-than-order-0 allocations
with GFP_NOFAIL?  If so, we should fix them.

Then just ban order>0 && GFP_NOFAIL allocations.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 3/3 -mmotm] oom: invoke oom killer for __GFP_NOFAIL
  2009-06-02  5:56   ` Andrew Morton
@ 2009-06-02  6:27     ` Nick Piggin
  2009-06-02  6:41       ` Pekka Enberg
  2009-06-02  7:26     ` David Rientjes
  1 sibling, 1 reply; 13+ messages in thread
From: Nick Piggin @ 2009-06-02  6:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Rik van Riel, Mel Gorman, Peter Zijlstra,
	Christoph Lameter, Dave Hansen, linux-kernel

On Mon, Jun 01, 2009 at 10:56:02PM -0700, Andrew Morton wrote:
> On Mon, 1 Jun 2009 18:31:14 -0700 (PDT) David Rientjes <rientjes@google.com> wrote:
> Do we know of any callsites which do greater-than-order-0 allocations
> with GFP_NOFAIL?  If so, we should fix them.

Yeah, we have GFP_NOFAIL going to the slab allocator, so all bets are
off in regard to the order.

Hmm, checkpatch should warn about it too shouldn't it? We recently
acquired one right on the IO submission path, unfortunately.

7ba1ba12eeef0aa7113beb16410ef8b7c748e18b


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 3/3 -mmotm] oom: invoke oom killer for __GFP_NOFAIL
  2009-06-02  6:27     ` Nick Piggin
@ 2009-06-02  6:41       ` Pekka Enberg
  0 siblings, 0 replies; 13+ messages in thread
From: Pekka Enberg @ 2009-06-02  6:41 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, David Rientjes, Rik van Riel, Mel Gorman,
	Peter Zijlstra, Christoph Lameter, Dave Hansen, linux-kernel

Hi Nick,

On Mon, Jun 01, 2009 at 10:56:02PM -0700, Andrew Morton wrote:
>> Do we know of any callsites which do greater-than-order-0 allocations
>> with GFP_NOFAIL?  If so, we should fix them.

On Tue, Jun 2, 2009 at 9:27 AM, Nick Piggin <npiggin@suse.de> wrote:
> Yeah, we have GFP_NOFAIL going to the slab allocator, so all bets are
> off in regard to the order.

SLUB is supports variable order allocations so we can certainly do
GFP_NOFAIL special casing there for < PAGE_SIZE allocations to enforce
zero order allocation for those. I suspect we could do something
similar in SLQB too?

                        Pekka

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 3/3 -mmotm] oom: invoke oom killer for __GFP_NOFAIL
  2009-06-02  5:56   ` Andrew Morton
  2009-06-02  6:27     ` Nick Piggin
@ 2009-06-02  7:26     ` David Rientjes
  2009-06-02  7:34       ` Peter Zijlstra
  1 sibling, 1 reply; 13+ messages in thread
From: David Rientjes @ 2009-06-02  7:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Rik van Riel, Mel Gorman, Peter Zijlstra,
	Christoph Lameter, Dave Hansen, linux-kernel

On Mon, 1 Jun 2009, Andrew Morton wrote:

> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1547,7 +1547,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> >  		goto out;
> >  
> >  	/* The OOM killer will not help higher order allocs */
> > -	if (order > PAGE_ALLOC_COSTLY_ORDER)
> > +	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_NOFAIL))
> >  		goto out;
> >  
> >  	/* Exhausted what can be done so it's blamo time */
> > @@ -1765,11 +1765,13 @@ rebalance:
> >  				goto got_pg;
> >  
> >  			/*
> > -			 * The OOM killer does not trigger for high-order allocations
> > -			 * but if no progress is being made, there are no other
> > -			 * options and retrying is unlikely to help
> > +			 * The OOM killer does not trigger for high-order
> > +			 * ~__GFP_NOFAIL allocations so if no progress is being
> > +			 * made, there are no other options and retrying is
> > +			 * unlikely to help.
> >  			 */
> > -			if (order > PAGE_ALLOC_COSTLY_ORDER)
> > +			if (order > PAGE_ALLOC_COSTLY_ORDER &&
> > +						!(gfp_mask & __GFP_NOFAIL))
> >  				goto nopage;
> >  
> >  			goto restart;
> 
> I really think/hope/expect that this is unneeded.
> 
> Do we know of any callsites which do greater-than-order-0 allocations
> with GFP_NOFAIL?  If so, we should fix them.
> 
> Then just ban order>0 && GFP_NOFAIL allocations.
> 

That seems like a different topic: banning higher-order __GFP_NOFAIL 
allocations or just deprecating __GFP_NOFAIL altogether and slowly 
switching users over is a worthwhile effort, but is unrelated.

This patch is necessary because we explicitly deny the oom killer from 
being used when the order is greater than PAGE_ALLOC_COSTLY_ORDER because 
of an assumption that it won't help.  That assumption isn't always true, 
especially for large memory-hogging tasks that have mlocked large chunks 
of contiguous memory, for example.  The only thing we do know is that 
direct reclaim has not made any progress so we're unlikely to get a 
substantial amount of memory freeing in the immediate future.  Such an 
instance will simply loop forever without killing that rogue task for a 
__GFP_NOFAIL allocation.

So while it's better in the long-term to deprecate the flag as much as 
possible and perhaps someday remove it from the page allocator entirely, 
we're faced with the current behavior of either looping endlessly or 
freeing memory so the kernel allocation may succeed when direct reclaim 
has failed, which also makes this a rare instance where the oom killer 
will never needlessly kill a task.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 3/3 -mmotm] oom: invoke oom killer for __GFP_NOFAIL
  2009-06-02  7:26     ` David Rientjes
@ 2009-06-02  7:34       ` Peter Zijlstra
  2009-06-02  7:58         ` Nick Piggin
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2009-06-02  7:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Nick Piggin, Rik van Riel, Mel Gorman,
	Christoph Lameter, Dave Hansen, linux-kernel

On Tue, 2009-06-02 at 00:26 -0700, David Rientjes wrote:
> > I really think/hope/expect that this is unneeded.
> > 
> > Do we know of any callsites which do greater-than-order-0 allocations
> > with GFP_NOFAIL?  If so, we should fix them.
> > 
> > Then just ban order>0 && GFP_NOFAIL allocations.
> > 
> 
> That seems like a different topic: banning higher-order __GFP_NOFAIL 
> allocations or just deprecating __GFP_NOFAIL altogether and slowly 
> switching users over is a worthwhile effort, but is unrelated.
> 
> This patch is necessary because we explicitly deny the oom killer from 
> being used when the order is greater than PAGE_ALLOC_COSTLY_ORDER because 
> of an assumption that it won't help.  That assumption isn't always true, 
> especially for large memory-hogging tasks that have mlocked large chunks 
> of contiguous memory, for example.  The only thing we do know is that 
> direct reclaim has not made any progress so we're unlikely to get a 
> substantial amount of memory freeing in the immediate future.  Such an 
> instance will simply loop forever without killing that rogue task for a 
> __GFP_NOFAIL allocation.
> 
> So while it's better in the long-term to deprecate the flag as much as 
> possible and perhaps someday remove it from the page allocator entirely, 
> we're faced with the current behavior of either looping endlessly or 
> freeing memory so the kernel allocation may succeed when direct reclaim 
> has failed, which also makes this a rare instance where the oom killer 
> will never needlessly kill a task.

I would really prefer if we do as Andrew suggests. Both will fix this
problem, so I don't see it as a different topic at all.

Eradicating __GFP_NOFAIL is a fine goal, but very hard work (people have
been wanting to do that for many years). But simply limiting it to
0-order allocation should be much(?) easier.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 3/3 -mmotm] oom: invoke oom killer for __GFP_NOFAIL
  2009-06-02  7:34       ` Peter Zijlstra
@ 2009-06-02  7:58         ` Nick Piggin
  2009-06-02  8:14           ` David Rientjes
  0 siblings, 1 reply; 13+ messages in thread
From: Nick Piggin @ 2009-06-02  7:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David Rientjes, Andrew Morton, Rik van Riel, Mel Gorman,
	Christoph Lameter, Dave Hansen, linux-kernel

On Tue, Jun 02, 2009 at 09:34:55AM +0200, Peter Zijlstra wrote:
> On Tue, 2009-06-02 at 00:26 -0700, David Rientjes wrote:
> > > I really think/hope/expect that this is unneeded.
> > > 
> > > Do we know of any callsites which do greater-than-order-0 allocations
> > > with GFP_NOFAIL?  If so, we should fix them.
> > > 
> > > Then just ban order>0 && GFP_NOFAIL allocations.
> > > 
> > 
> > That seems like a different topic: banning higher-order __GFP_NOFAIL 
> > allocations or just deprecating __GFP_NOFAIL altogether and slowly 
> > switching users over is a worthwhile effort, but is unrelated.
> > 
> > This patch is necessary because we explicitly deny the oom killer from 
> > being used when the order is greater than PAGE_ALLOC_COSTLY_ORDER because 
> > of an assumption that it won't help.  That assumption isn't always true, 
> > especially for large memory-hogging tasks that have mlocked large chunks 
> > of contiguous memory, for example.  The only thing we do know is that 
> > direct reclaim has not made any progress so we're unlikely to get a 
> > substantial amount of memory freeing in the immediate future.  Such an 
> > instance will simply loop forever without killing that rogue task for a 
> > __GFP_NOFAIL allocation.
> > 
> > So while it's better in the long-term to deprecate the flag as much as 
> > possible and perhaps someday remove it from the page allocator entirely, 
> > we're faced with the current behavior of either looping endlessly or 
> > freeing memory so the kernel allocation may succeed when direct reclaim 
> > has failed, which also makes this a rare instance where the oom killer 
> > will never needlessly kill a task.
> 
> I would really prefer if we do as Andrew suggests. Both will fix this
> problem, so I don't see it as a different topic at all.

Well, his patch, as it stands, is a good one. Because we do have
potential higher order GFP_NOFAIL.

I don't particularly want to add complexity (not a single branch)
to SLQB to handle this (and how does the caller *really* know
anyway? they know the exact object size, the hardware alignment
constraints, the page size, etc. in order to know that all of the
many slab allocators will be able to satisfy it with an order-0
allocation?)


> Eradicating __GFP_NOFAIL is a fine goal, but very hard work (people have
> been wanting to do that for many years). But simply limiting it to
> 0-order allocation should be much(?) easier.

Some of them may be hard work, but I don't think anybody has
been working too hard at it ;) 



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 3/3 -mmotm] oom: invoke oom killer for __GFP_NOFAIL
  2009-06-02  7:58         ` Nick Piggin
@ 2009-06-02  8:14           ` David Rientjes
  2009-06-03 22:10             ` David Rientjes
  0 siblings, 1 reply; 13+ messages in thread
From: David Rientjes @ 2009-06-02  8:14 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Andrew Morton, Rik van Riel, Mel Gorman,
	Christoph Lameter, Dave Hansen, linux-kernel

On Tue, 2 Jun 2009, Nick Piggin wrote:

> > I would really prefer if we do as Andrew suggests. Both will fix this
> > problem, so I don't see it as a different topic at all.
> 
> Well, his patch, as it stands, is a good one. Because we do have
> potential higher order GFP_NOFAIL.
> 

There's currently an inconsistency in the definition of __GFP_NOFAIL and 
its implementation.  The clearly defined purpose of the flag is:

 * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
 * cannot handle allocation failures.

Yet __GFP_NOFAIL allocations may fail if no progress is made via direct 
reclaim and order > PAGE_ALLOC_COSTLY_ORDER.  That's the behavior in the 
git HEAD and Mel's allocator rework in mmotm.

I've been addressing this implicitly by requiring __GFP_NOFAIL to always 
abide by the definition: we simply can never return NULL because the 
caller can't handle it (and, by definition, shouldn't even be responsible 
for considering it).

With my patch, we kill a memory hogging task that will free some memory so 
the allocation will succeed (or multiple tasks if insufficient contiguous 
memory is available).  Kernel allocations use __GFP_NOFAIL, so the fault 
of this memory freeing is entirely on the caller, not the page allocator.

My preference for handling this is to merge my patch (obviously :), and 
then hopefully deprecate __GFP_NOFAIL as much as possible although I don't 
suspect it could be eradicated forever.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 3/3 -mmotm] oom: invoke oom killer for __GFP_NOFAIL
  2009-06-02  8:14           ` David Rientjes
@ 2009-06-03 22:10             ` David Rientjes
  2009-06-03 22:26               ` Andrew Morton
  0 siblings, 1 reply; 13+ messages in thread
From: David Rientjes @ 2009-06-03 22:10 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Andrew Morton, Rik van Riel, Mel Gorman,
	Christoph Lameter, Dave Hansen, linux-kernel

On Tue, 2 Jun 2009, David Rientjes wrote:

> With my patch, we kill a memory hogging task that will free some memory so 
> the allocation will succeed (or multiple tasks if insufficient contiguous 
> memory is available).  Kernel allocations use __GFP_NOFAIL, so the fault 
> of this memory freeing is entirely on the caller, not the page allocator.
> 
> My preference for handling this is to merge my patch (obviously :), and 
> then hopefully deprecate __GFP_NOFAIL as much as possible although I don't 
> suspect it could be eradicated forever.
> 

I really hope this patch isn't getting dropped because it fixes the 
possibility that a __GFP_NOFAIL allocation will fail when its definition 
is to the contrary.  Depending on the size of the allocation, that can 
cause a panic in at least the reiserfs, ntfs, cxgb3, and gfs2 cases.

As I mentioned before, it's a noble goal to deprecate __GFP_NOFAIL as much 
as possible and (at the least) prevent it from trying high-order 
allocation attempts.  The current implementation of the flag is 
problematic, however, and this patch addresses it by attempting to free 
some memory when direct reclaim fails.

This change has already been acked by Rik van Riel in 
http://marc.info/?l=linux-kernel&m=124199910508930 and Mel Gorman in 
http://marc.info/?l=linux-kernel&m=124203850416159.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 3/3 -mmotm] oom: invoke oom killer for __GFP_NOFAIL
  2009-06-03 22:10             ` David Rientjes
@ 2009-06-03 22:26               ` Andrew Morton
  2009-06-03 22:54                 ` Divy Le Ray
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2009-06-03 22:26 UTC (permalink / raw)
  To: David Rientjes
  Cc: npiggin, a.p.zijlstra, riel, mel, cl, dave, linux-kernel,
	Divy Le Ray, Jens Axboe

On Wed, 3 Jun 2009 15:10:38 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> On Tue, 2 Jun 2009, David Rientjes wrote:
> 
> > With my patch, we kill a memory hogging task that will free some memory so 
> > the allocation will succeed (or multiple tasks if insufficient contiguous 
> > memory is available).  Kernel allocations use __GFP_NOFAIL, so the fault 
> > of this memory freeing is entirely on the caller, not the page allocator.
> > 
> > My preference for handling this is to merge my patch (obviously :), and 
> > then hopefully deprecate __GFP_NOFAIL as much as possible although I don't 
> > suspect it could be eradicated forever.
> > 
> 
> I really hope this patch isn't getting dropped because it fixes the 
> possibility that a __GFP_NOFAIL allocation will fail when its definition 
> is to the contrary.  Depending on the size of the allocation, that can 
> cause a panic in at least the reiserfs, ntfs, cxgb3, and gfs2 cases.
> 
> As I mentioned before, it's a noble goal to deprecate __GFP_NOFAIL as much 
> as possible and (at the least) prevent it from trying high-order 
> allocation attempts.  The current implementation of the flag is 
> problematic, however, and this patch addresses it by attempting to free 
> some memory when direct reclaim fails.
> 

Sigh, all right, but we suck.

Divy, could we please at least remove __GFP_NOFAIL from
drivers/net/cxgb?  It's really quite inappropriate for a driver to
assume that core VM can do magic.  Drivers should test the return value
and handle the -ENOMEM in the old-fashioned way, please.

Ditto-in-spades cfq-iosched.c.  We discussed that recently but I forgot the
upshot.  The code and its comment are still in flagrant disagreement?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 3/3 -mmotm] oom: invoke oom killer for __GFP_NOFAIL
  2009-06-03 22:26               ` Andrew Morton
@ 2009-06-03 22:54                 ` Divy Le Ray
  0 siblings, 0 replies; 13+ messages in thread
From: Divy Le Ray @ 2009-06-03 22:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, npiggin, a.p.zijlstra, riel, mel, cl, dave,
	linux-kernel, Jens Axboe

Andrew Morton wrote:
> On Wed, 3 Jun 2009 15:10:38 -0700 (PDT)
> David Rientjes <rientjes@google.com> wrote:
>
>   
>> On Tue, 2 Jun 2009, David Rientjes wrote:
>>
>>     
>>> With my patch, we kill a memory hogging task that will free some memory so 
>>> the allocation will succeed (or multiple tasks if insufficient contiguous 
>>> memory is available).  Kernel allocations use __GFP_NOFAIL, so the fault 
>>> of this memory freeing is entirely on the caller, not the page allocator.
>>>
>>> My preference for handling this is to merge my patch (obviously :), and 
>>> then hopefully deprecate __GFP_NOFAIL as much as possible although I don't 
>>> suspect it could be eradicated forever.
>>>
>>>       
>> I really hope this patch isn't getting dropped because it fixes the 
>> possibility that a __GFP_NOFAIL allocation will fail when its definition 
>> is to the contrary.  Depending on the size of the allocation, that can 
>> cause a panic in at least the reiserfs, ntfs, cxgb3, and gfs2 cases.
>>
>> As I mentioned before, it's a noble goal to deprecate __GFP_NOFAIL as much 
>> as possible and (at the least) prevent it from trying high-order 
>> allocation attempts.  The current implementation of the flag is 
>> problematic, however, and this patch addresses it by attempting to free 
>> some memory when direct reclaim fails.
>>
>>     
>
> Sigh, all right, but we suck.
>
> Divy, could we please at least remove __GFP_NOFAIL from
> drivers/net/cxgb?  It's really quite inappropriate for a driver to
> assume that core VM can do magic.  Drivers should test the return value
> and handle the -ENOMEM in the old-fashioned way, please.
>   

We started working on it, and got to eliminate them out of 
cxgb3_main.c::init_tp_parity().
We're still looking at eliminating its use in 
cxgb3_offload.c::t3_process_tid_release_list().
I'll post the patches as soon as they are ready.

cheers,
Divy


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2009-06-03 23:04 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-06-02  1:31 [patch 1/3 -mmotm] oom: move oom_adj value from task_struct to mm_struct David Rientjes
2009-06-02  1:31 ` [patch 2/3 -mmotm] oom: avoid unnecessary mm locking and scanning for OOM_DISABLE David Rientjes
2009-06-02  1:31 ` [patch 3/3 -mmotm] oom: invoke oom killer for __GFP_NOFAIL David Rientjes
2009-06-02  5:56   ` Andrew Morton
2009-06-02  6:27     ` Nick Piggin
2009-06-02  6:41       ` Pekka Enberg
2009-06-02  7:26     ` David Rientjes
2009-06-02  7:34       ` Peter Zijlstra
2009-06-02  7:58         ` Nick Piggin
2009-06-02  8:14           ` David Rientjes
2009-06-03 22:10             ` David Rientjes
2009-06-03 22:26               ` Andrew Morton
2009-06-03 22:54                 ` Divy Le Ray

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.