All of lore.kernel.org
 help / color / mirror / Atom feed
* [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask
@ 2009-11-04  8:09 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 92+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-04  8:09 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, akpm, cl, rientjes

If clarification is not enough, let me know.
Thank you for all your review in RFC version.
Thanks,
-Kame
==
From: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>

Fixing node-oriented allocation handling in oom-kill.c
I myself think this as bugfix not as enhancement.

In these days, things are changed as
  - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
  - mempolicy don't maintain its own private zonelists.
  (And cpuset doesn't use nodemask for __alloc_pages_nodemask())

So, current oom-killer's check function is wrong.

This patch does
  - check nodemask, if nodemask && nodemask != node_states[N_HIGH_MEMORY]
    this is never be CONSTRAINT_NONE. We assume this from mempolicy.
  - Scan all zonelist under nodemask, if it hits cpuset's wall
    this faiulre is from cpuset.
And
  - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
    This doesn't change "current" behavior.
    If callers use __GFP_THISNODE, it should handle "page allocation failure"
    by itself.

Changelog: 2009/11/04
 - cut out as bug-fix for passing nodemask.
 - added GFP_THISNODE sanity check.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>
---
 drivers/char/sysrq.c |    2 +-
 mm/oom_kill.c        |   38 ++++++++++++++++++++++++++------------
 mm/page_alloc.c      |   10 ++++++++--
 3 files changed, 35 insertions(+), 15 deletions(-)

Index: mmotm-2.6.32-Nov2/drivers/char/sysrq.c
===================================================================
--- mmotm-2.6.32-Nov2.orig/drivers/char/sysrq.c
+++ mmotm-2.6.32-Nov2/drivers/char/sysrq.c
@@ -339,7 +339,7 @@ static struct sysrq_key_op sysrq_term_op
 
 static void moom_callback(struct work_struct *ignored)
 {
-	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0);
+	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0, NULL);
 }
 
 static DECLARE_WORK(moom_work, moom_callback);
Index: mmotm-2.6.32-Nov2/mm/oom_kill.c
===================================================================
--- mmotm-2.6.32-Nov2.orig/mm/oom_kill.c
+++ mmotm-2.6.32-Nov2/mm/oom_kill.c
@@ -196,27 +196,40 @@ unsigned long badness(struct task_struct
 /*
  * Determine the type of allocation constraint.
  */
+#ifdef CONFIG_NUMA
 static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
-						    gfp_t gfp_mask)
+				    gfp_t gfp_mask, nodemask_t *nodemask)
 {
-#ifdef CONFIG_NUMA
 	struct zone *zone;
 	struct zoneref *z;
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
-	nodemask_t nodes = node_states[N_HIGH_MEMORY];
+	int ret = CONSTRAINT_NONE;
 
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
-		if (cpuset_zone_allowed_softwall(zone, gfp_mask))
-			node_clear(zone_to_nid(zone), nodes);
-		else
+	/*
+ 	 * The nodemask here is a nodemask passed to alloc_pages(). Now,
+ 	 * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
+ 	 * feature. Then, only mempolicy use this nodemask.
+ 	 */
+	if (nodemask && nodes_equal(*nodemask, node_states[N_HIGH_MEMORY]))
+		ret = CONSTRAINT_MEMORY_POLICY;
+
+	/* Check this allocation failure is caused by cpuset's wall function */
+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+			high_zoneidx, nodemask)
+		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
 			return CONSTRAINT_CPUSET;
 
-	if (!nodes_empty(nodes))
-		return CONSTRAINT_MEMORY_POLICY;
-#endif
+	/* __GFP_THISNODE never calls OOM.*/
 
+	return ret;
+}
+#else
+static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
+				gfp_t gfp_mask, nodemask_t *nodemask)
+{
 	return CONSTRAINT_NONE;
 }
+#endif
 
 /*
  * Simple selection loop. We chose the process with the highest
@@ -603,7 +616,8 @@ rest_and_return:
  * OR try to be smart about which process to kill. Note that we
  * don't have to be perfect here, we just have to be good.
  */
-void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
+void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+		int order, nodemask_t *nodemask)
 {
 	unsigned long freed = 0;
 	enum oom_constraint constraint;
@@ -622,7 +636,7 @@ void out_of_memory(struct zonelist *zone
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.
 	 */
-	constraint = constrained_alloc(zonelist, gfp_mask);
+	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
 	read_lock(&tasklist_lock);
 
 	switch (constraint) {
Index: mmotm-2.6.32-Nov2/mm/page_alloc.c
===================================================================
--- mmotm-2.6.32-Nov2.orig/mm/page_alloc.c
+++ mmotm-2.6.32-Nov2/mm/page_alloc.c
@@ -1667,9 +1667,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, un
 	/* The OOM killer will not help higher order allocs */
 	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_NOFAIL))
 		goto out;
-
+	/*
+	 * In usual, GFP_THISNODE contains __GFP_NORETRY and we never hit this.
+	 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
+	 * Note: Hugepage uses it but will hit PAGE_ALLOC_COSTLY_ORDER.
+	 */
+	if (gfp_mask & __GFP_THISNODE)
+		goto out;
 	/* Exhausted what can be done so it's blamo time */
-	out_of_memory(zonelist, gfp_mask, order);
+	out_of_memory(zonelist, gfp_mask, order, nodemask);
 
 out:
 	clear_zonelist_oom(zonelist, gfp_mask);


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask
@ 2009-11-04  8:09 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 92+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-04  8:09 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, akpm, cl, rientjes

If clarification is not enough, let me know.
Thank you for all your review in RFC version.
Thanks,
-Kame
==
From: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>

Fixing node-oriented allocation handling in oom-kill.c
I myself think this as bugfix not as enhancement.

In these days, things are changed as
  - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
  - mempolicy don't maintain its own private zonelists.
  (And cpuset doesn't use nodemask for __alloc_pages_nodemask())

So, current oom-killer's check function is wrong.

This patch does
  - check nodemask, if nodemask && nodemask != node_states[N_HIGH_MEMORY]
    this is never be CONSTRAINT_NONE. We assume this from mempolicy.
  - Scan all zonelist under nodemask, if it hits cpuset's wall
    this faiulre is from cpuset.
And
  - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
    This doesn't change "current" behavior.
    If callers use __GFP_THISNODE, it should handle "page allocation failure"
    by itself.

Changelog: 2009/11/04
 - cut out as bug-fix for passing nodemask.
 - added GFP_THISNODE sanity check.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>
---
 drivers/char/sysrq.c |    2 +-
 mm/oom_kill.c        |   38 ++++++++++++++++++++++++++------------
 mm/page_alloc.c      |   10 ++++++++--
 3 files changed, 35 insertions(+), 15 deletions(-)

Index: mmotm-2.6.32-Nov2/drivers/char/sysrq.c
===================================================================
--- mmotm-2.6.32-Nov2.orig/drivers/char/sysrq.c
+++ mmotm-2.6.32-Nov2/drivers/char/sysrq.c
@@ -339,7 +339,7 @@ static struct sysrq_key_op sysrq_term_op
 
 static void moom_callback(struct work_struct *ignored)
 {
-	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0);
+	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0, NULL);
 }
 
 static DECLARE_WORK(moom_work, moom_callback);
Index: mmotm-2.6.32-Nov2/mm/oom_kill.c
===================================================================
--- mmotm-2.6.32-Nov2.orig/mm/oom_kill.c
+++ mmotm-2.6.32-Nov2/mm/oom_kill.c
@@ -196,27 +196,40 @@ unsigned long badness(struct task_struct
 /*
  * Determine the type of allocation constraint.
  */
+#ifdef CONFIG_NUMA
 static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
-						    gfp_t gfp_mask)
+				    gfp_t gfp_mask, nodemask_t *nodemask)
 {
-#ifdef CONFIG_NUMA
 	struct zone *zone;
 	struct zoneref *z;
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
-	nodemask_t nodes = node_states[N_HIGH_MEMORY];
+	int ret = CONSTRAINT_NONE;
 
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
-		if (cpuset_zone_allowed_softwall(zone, gfp_mask))
-			node_clear(zone_to_nid(zone), nodes);
-		else
+	/*
+ 	 * The nodemask here is a nodemask passed to alloc_pages(). Now,
+ 	 * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
+ 	 * feature. Then, only mempolicy use this nodemask.
+ 	 */
+	if (nodemask && nodes_equal(*nodemask, node_states[N_HIGH_MEMORY]))
+		ret = CONSTRAINT_MEMORY_POLICY;
+
+	/* Check this allocation failure is caused by cpuset's wall function */
+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+			high_zoneidx, nodemask)
+		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
 			return CONSTRAINT_CPUSET;
 
-	if (!nodes_empty(nodes))
-		return CONSTRAINT_MEMORY_POLICY;
-#endif
+	/* __GFP_THISNODE never calls OOM.*/
 
+	return ret;
+}
+#else
+static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
+				gfp_t gfp_mask, nodemask_t *nodemask)
+{
 	return CONSTRAINT_NONE;
 }
+#endif
 
 /*
  * Simple selection loop. We chose the process with the highest
@@ -603,7 +616,8 @@ rest_and_return:
  * OR try to be smart about which process to kill. Note that we
  * don't have to be perfect here, we just have to be good.
  */
-void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
+void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+		int order, nodemask_t *nodemask)
 {
 	unsigned long freed = 0;
 	enum oom_constraint constraint;
@@ -622,7 +636,7 @@ void out_of_memory(struct zonelist *zone
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.
 	 */
-	constraint = constrained_alloc(zonelist, gfp_mask);
+	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
 	read_lock(&tasklist_lock);
 
 	switch (constraint) {
Index: mmotm-2.6.32-Nov2/mm/page_alloc.c
===================================================================
--- mmotm-2.6.32-Nov2.orig/mm/page_alloc.c
+++ mmotm-2.6.32-Nov2/mm/page_alloc.c
@@ -1667,9 +1667,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, un
 	/* The OOM killer will not help higher order allocs */
 	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_NOFAIL))
 		goto out;
-
+	/*
+	 * In usual, GFP_THISNODE contains __GFP_NORETRY and we never hit this.
+	 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
+	 * Note: Hugepage uses it but will hit PAGE_ALLOC_COSTLY_ORDER.
+	 */
+	if (gfp_mask & __GFP_THISNODE)
+		goto out;
 	/* Exhausted what can be done so it's blamo time */
-	out_of_memory(zonelist, gfp_mask, order);
+	out_of_memory(zonelist, gfp_mask, order, nodemask);
 
 out:
 	clear_zonelist_oom(zonelist, gfp_mask);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v2
  2009-11-04  8:09 ` KAMEZAWA Hiroyuki
@ 2009-11-06  0:02   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 92+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-06  0:02 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, cl, rientjes

On Wed, 4 Nov 2009 17:09:44 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>  drivers/char/sysrq.c |    2 +-
>  mm/oom_kill.c        |   38 ++++++++++++++++++++++++++------------
>  mm/page_alloc.c      |   10 ++++++++--

Very sorry, this patch doesn't includes changes to oom.h...

From: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>

Fixing node-oriented allocation handling in oom-kill.c
I myself think this as bugfix not as ehnancement.

In these days, things are changed as
  - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
  - mempolicy don't maintain its own private zonelists.
  (And cpuset doesn't use nodemask for __alloc_pages_nodemask())

So, current oom-killer's check function is wrong.

This patch does
  - check nodemask, if nodemask && nodemask != node_states[N_HIGH_MEMORY]
    this is never be CONSTRAINT_NONE. We assume this from mempolicy.
  - Scan all zonelist under nodemask, if it hits cpuset's wall
    this faiulre is from cpuset.
And
  - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
    This doesn't change "current" behavior.
    If callers use __GFP_THISNODE, it should handle "page allocation failure"
    by itself.

Changelog: 2009/11/06
 - fixed lack of oom.h

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>
---
 drivers/char/sysrq.c |    2 +-
 include/linux/oom.h  |    3 ++-
 mm/oom_kill.c        |   38 ++++++++++++++++++++++++++------------
 mm/page_alloc.c      |   10 ++++++++--
 4 files changed, 37 insertions(+), 16 deletions(-)

Index: mmotm-2.6.32-Nov2/drivers/char/sysrq.c
===================================================================
--- mmotm-2.6.32-Nov2.orig/drivers/char/sysrq.c
+++ mmotm-2.6.32-Nov2/drivers/char/sysrq.c
@@ -339,7 +339,7 @@ static struct sysrq_key_op sysrq_term_op
 
 static void moom_callback(struct work_struct *ignored)
 {
-	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0);
+	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0, NULL);
 }
 
 static DECLARE_WORK(moom_work, moom_callback);
Index: mmotm-2.6.32-Nov2/mm/oom_kill.c
===================================================================
--- mmotm-2.6.32-Nov2.orig/mm/oom_kill.c
+++ mmotm-2.6.32-Nov2/mm/oom_kill.c
@@ -196,27 +196,40 @@ unsigned long badness(struct task_struct
 /*
  * Determine the type of allocation constraint.
  */
+#ifdef CONFIG_NUMA
 static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
-						    gfp_t gfp_mask)
+				    gfp_t gfp_mask, nodemask_t *nodemask)
 {
-#ifdef CONFIG_NUMA
 	struct zone *zone;
 	struct zoneref *z;
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
-	nodemask_t nodes = node_states[N_HIGH_MEMORY];
+	int ret = CONSTRAINT_NONE;
 
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
-		if (cpuset_zone_allowed_softwall(zone, gfp_mask))
-			node_clear(zone_to_nid(zone), nodes);
-		else
+	/*
+ 	 * The nodemask here is a nodemask passed to alloc_pages(). Now,
+ 	 * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
+ 	 * feature. Then, only mempolicy use this nodemask.
+ 	 */
+	if (nodemask && nodes_equal(*nodemask, node_states[N_HIGH_MEMORY]))
+		ret = CONSTRAINT_MEMORY_POLICY;
+
+	/* Check this allocation failure is caused by cpuset's wall function */
+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+			high_zoneidx, nodemask)
+		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
 			return CONSTRAINT_CPUSET;
 
-	if (!nodes_empty(nodes))
-		return CONSTRAINT_MEMORY_POLICY;
-#endif
+	/* __GFP_THISNODE never calls OOM.*/
 
+	return ret;
+}
+#else
+static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
+				gfp_t gfp_mask, nodemask_t *nodemask)
+{
 	return CONSTRAINT_NONE;
 }
+#endif
 
 /*
  * Simple selection loop. We chose the process with the highest
@@ -603,7 +616,8 @@ rest_and_return:
  * OR try to be smart about which process to kill. Note that we
  * don't have to be perfect here, we just have to be good.
  */
-void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
+void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+		int order, nodemask_t *nodemask)
 {
 	unsigned long freed = 0;
 	enum oom_constraint constraint;
@@ -622,7 +636,7 @@ void out_of_memory(struct zonelist *zone
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.
 	 */
-	constraint = constrained_alloc(zonelist, gfp_mask);
+	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
 	read_lock(&tasklist_lock);
 
 	switch (constraint) {
Index: mmotm-2.6.32-Nov2/mm/page_alloc.c
===================================================================
--- mmotm-2.6.32-Nov2.orig/mm/page_alloc.c
+++ mmotm-2.6.32-Nov2/mm/page_alloc.c
@@ -1667,9 +1667,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, un
 	/* The OOM killer will not help higher order allocs */
 	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_NOFAIL))
 		goto out;
-
+	/*
+	 * In usual, GFP_THISNODE contains __GFP_NORETRY and we never hit this.
+	 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
+	 * Note: Hugepage uses it but will hit PAGE_ALLOC_COSTLY_ORDER.
+	 */
+	if (gfp_mask & __GFP_THISNODE)
+		goto out;
 	/* Exhausted what can be done so it's blamo time */
-	out_of_memory(zonelist, gfp_mask, order);
+	out_of_memory(zonelist, gfp_mask, order, nodemask);
 
 out:
 	clear_zonelist_oom(zonelist, gfp_mask);
Index: mmotm-2.6.32-Nov2/include/linux/oom.h
===================================================================
--- mmotm-2.6.32-Nov2.orig/include/linux/oom.h
+++ mmotm-2.6.32-Nov2/include/linux/oom.h
@@ -27,7 +27,8 @@ enum oom_constraint {
 extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 
-extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order);
+extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+		int order, nodemask_t *mask);
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
 


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v2
@ 2009-11-06  0:02   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 92+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-06  0:02 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, akpm, cl, rientjes

On Wed, 4 Nov 2009 17:09:44 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>  drivers/char/sysrq.c |    2 +-
>  mm/oom_kill.c        |   38 ++++++++++++++++++++++++++------------
>  mm/page_alloc.c      |   10 ++++++++--

Very sorry, this patch doesn't includes changes to oom.h...

From: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>

Fixing node-oriented allocation handling in oom-kill.c
I myself think this as bugfix not as ehnancement.

In these days, things are changed as
  - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
  - mempolicy don't maintain its own private zonelists.
  (And cpuset doesn't use nodemask for __alloc_pages_nodemask())

So, current oom-killer's check function is wrong.

This patch does
  - check nodemask, if nodemask && nodemask != node_states[N_HIGH_MEMORY]
    this is never be CONSTRAINT_NONE. We assume this from mempolicy.
  - Scan all zonelist under nodemask, if it hits cpuset's wall
    this faiulre is from cpuset.
And
  - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
    This doesn't change "current" behavior.
    If callers use __GFP_THISNODE, it should handle "page allocation failure"
    by itself.

Changelog: 2009/11/06
 - fixed lack of oom.h

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>
---
 drivers/char/sysrq.c |    2 +-
 include/linux/oom.h  |    3 ++-
 mm/oom_kill.c        |   38 ++++++++++++++++++++++++++------------
 mm/page_alloc.c      |   10 ++++++++--
 4 files changed, 37 insertions(+), 16 deletions(-)

Index: mmotm-2.6.32-Nov2/drivers/char/sysrq.c
===================================================================
--- mmotm-2.6.32-Nov2.orig/drivers/char/sysrq.c
+++ mmotm-2.6.32-Nov2/drivers/char/sysrq.c
@@ -339,7 +339,7 @@ static struct sysrq_key_op sysrq_term_op
 
 static void moom_callback(struct work_struct *ignored)
 {
-	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0);
+	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0, NULL);
 }
 
 static DECLARE_WORK(moom_work, moom_callback);
Index: mmotm-2.6.32-Nov2/mm/oom_kill.c
===================================================================
--- mmotm-2.6.32-Nov2.orig/mm/oom_kill.c
+++ mmotm-2.6.32-Nov2/mm/oom_kill.c
@@ -196,27 +196,40 @@ unsigned long badness(struct task_struct
 /*
  * Determine the type of allocation constraint.
  */
+#ifdef CONFIG_NUMA
 static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
-						    gfp_t gfp_mask)
+				    gfp_t gfp_mask, nodemask_t *nodemask)
 {
-#ifdef CONFIG_NUMA
 	struct zone *zone;
 	struct zoneref *z;
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
-	nodemask_t nodes = node_states[N_HIGH_MEMORY];
+	int ret = CONSTRAINT_NONE;
 
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
-		if (cpuset_zone_allowed_softwall(zone, gfp_mask))
-			node_clear(zone_to_nid(zone), nodes);
-		else
+	/*
+ 	 * The nodemask here is a nodemask passed to alloc_pages(). Now,
+ 	 * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
+ 	 * feature. Then, only mempolicy use this nodemask.
+ 	 */
+	if (nodemask && nodes_equal(*nodemask, node_states[N_HIGH_MEMORY]))
+		ret = CONSTRAINT_MEMORY_POLICY;
+
+	/* Check this allocation failure is caused by cpuset's wall function */
+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+			high_zoneidx, nodemask)
+		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
 			return CONSTRAINT_CPUSET;
 
-	if (!nodes_empty(nodes))
-		return CONSTRAINT_MEMORY_POLICY;
-#endif
+	/* __GFP_THISNODE never calls OOM.*/
 
+	return ret;
+}
+#else
+static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
+				gfp_t gfp_mask, nodemask_t *nodemask)
+{
 	return CONSTRAINT_NONE;
 }
+#endif
 
 /*
  * Simple selection loop. We chose the process with the highest
@@ -603,7 +616,8 @@ rest_and_return:
  * OR try to be smart about which process to kill. Note that we
  * don't have to be perfect here, we just have to be good.
  */
-void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
+void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+		int order, nodemask_t *nodemask)
 {
 	unsigned long freed = 0;
 	enum oom_constraint constraint;
@@ -622,7 +636,7 @@ void out_of_memory(struct zonelist *zone
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.
 	 */
-	constraint = constrained_alloc(zonelist, gfp_mask);
+	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
 	read_lock(&tasklist_lock);
 
 	switch (constraint) {
Index: mmotm-2.6.32-Nov2/mm/page_alloc.c
===================================================================
--- mmotm-2.6.32-Nov2.orig/mm/page_alloc.c
+++ mmotm-2.6.32-Nov2/mm/page_alloc.c
@@ -1667,9 +1667,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, un
 	/* The OOM killer will not help higher order allocs */
 	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_NOFAIL))
 		goto out;
-
+	/*
+	 * In usual, GFP_THISNODE contains __GFP_NORETRY and we never hit this.
+	 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
+	 * Note: Hugepage uses it but will hit PAGE_ALLOC_COSTLY_ORDER.
+	 */
+	if (gfp_mask & __GFP_THISNODE)
+		goto out;
 	/* Exhausted what can be done so it's blamo time */
-	out_of_memory(zonelist, gfp_mask, order);
+	out_of_memory(zonelist, gfp_mask, order, nodemask);
 
 out:
 	clear_zonelist_oom(zonelist, gfp_mask);
Index: mmotm-2.6.32-Nov2/include/linux/oom.h
===================================================================
--- mmotm-2.6.32-Nov2.orig/include/linux/oom.h
+++ mmotm-2.6.32-Nov2/include/linux/oom.h
@@ -27,7 +27,8 @@ enum oom_constraint {
 extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 
-extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order);
+extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+		int order, nodemask_t *mask);
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v2
  2009-11-06  0:02   ` KAMEZAWA Hiroyuki
@ 2009-11-10  7:24     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 92+ messages in thread
From: KOSAKI Motohiro @ 2009-11-10  7:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: kosaki.motohiro, linux-mm, linux-kernel, akpm, cl, rientjes

Hi

> ===================================================================
> --- mmotm-2.6.32-Nov2.orig/mm/oom_kill.c
> +++ mmotm-2.6.32-Nov2/mm/oom_kill.c
> @@ -196,27 +196,40 @@ unsigned long badness(struct task_struct
>  /*
>   * Determine the type of allocation constraint.
>   */
> +#ifdef CONFIG_NUMA
>  static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> -						    gfp_t gfp_mask)
> +				    gfp_t gfp_mask, nodemask_t *nodemask)
>  {
> -#ifdef CONFIG_NUMA
>  	struct zone *zone;
>  	struct zoneref *z;
>  	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> -	nodemask_t nodes = node_states[N_HIGH_MEMORY];
> +	int ret = CONSTRAINT_NONE;
>  
> -	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> -		if (cpuset_zone_allowed_softwall(zone, gfp_mask))
> -			node_clear(zone_to_nid(zone), nodes);
> -		else
> +	/*
> + 	 * The nodemask here is a nodemask passed to alloc_pages(). Now,
> + 	 * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
> + 	 * feature. Then, only mempolicy use this nodemask.
> + 	 */
> +	if (nodemask && nodes_equal(*nodemask, node_states[N_HIGH_MEMORY]))
> +		ret = CONSTRAINT_MEMORY_POLICY;

!nodes_equal() ?


> +
> +	/* Check this allocation failure is caused by cpuset's wall function */
> +	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> +			high_zoneidx, nodemask)
> +		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
>  			return CONSTRAINT_CPUSET;

If cpuset and MPOL_BIND are both used, Probably CONSTRAINT_MEMORY_POLICY is
better choice.



>  
> -	if (!nodes_empty(nodes))
> -		return CONSTRAINT_MEMORY_POLICY;
> -#endif
> +	/* __GFP_THISNODE never calls OOM.*/
>  
> +	return ret;
> +}
> +#else
> +static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> +				gfp_t gfp_mask, nodemask_t *nodemask)
> +{
>  	return CONSTRAINT_NONE;
>  }
> +#endif



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v2
@ 2009-11-10  7:24     ` KOSAKI Motohiro
  0 siblings, 0 replies; 92+ messages in thread
From: KOSAKI Motohiro @ 2009-11-10  7:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: kosaki.motohiro, linux-mm, linux-kernel, akpm, cl, rientjes

Hi

> ===================================================================
> --- mmotm-2.6.32-Nov2.orig/mm/oom_kill.c
> +++ mmotm-2.6.32-Nov2/mm/oom_kill.c
> @@ -196,27 +196,40 @@ unsigned long badness(struct task_struct
>  /*
>   * Determine the type of allocation constraint.
>   */
> +#ifdef CONFIG_NUMA
>  static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> -						    gfp_t gfp_mask)
> +				    gfp_t gfp_mask, nodemask_t *nodemask)
>  {
> -#ifdef CONFIG_NUMA
>  	struct zone *zone;
>  	struct zoneref *z;
>  	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> -	nodemask_t nodes = node_states[N_HIGH_MEMORY];
> +	int ret = CONSTRAINT_NONE;
>  
> -	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> -		if (cpuset_zone_allowed_softwall(zone, gfp_mask))
> -			node_clear(zone_to_nid(zone), nodes);
> -		else
> +	/*
> + 	 * The nodemask here is a nodemask passed to alloc_pages(). Now,
> + 	 * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
> + 	 * feature. Then, only mempolicy use this nodemask.
> + 	 */
> +	if (nodemask && nodes_equal(*nodemask, node_states[N_HIGH_MEMORY]))
> +		ret = CONSTRAINT_MEMORY_POLICY;

!nodes_equal() ?


> +
> +	/* Check this allocation failure is caused by cpuset's wall function */
> +	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> +			high_zoneidx, nodemask)
> +		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
>  			return CONSTRAINT_CPUSET;

If cpuset and MPOL_BIND are both used, Probably CONSTRAINT_MEMORY_POLICY is
better choice.



>  
> -	if (!nodes_empty(nodes))
> -		return CONSTRAINT_MEMORY_POLICY;
> -#endif
> +	/* __GFP_THISNODE never calls OOM.*/
>  
> +	return ret;
> +}
> +#else
> +static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> +				gfp_t gfp_mask, nodemask_t *nodemask)
> +{
>  	return CONSTRAINT_NONE;
>  }
> +#endif


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v2
  2009-11-10  7:24     ` KOSAKI Motohiro
@ 2009-11-10  7:24       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 92+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-10  7:24 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: linux-mm, linux-kernel, akpm, cl, rientjes

On Tue, 10 Nov 2009 16:24:22 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> Hi
> 
> > ===================================================================
> > --- mmotm-2.6.32-Nov2.orig/mm/oom_kill.c
> > +++ mmotm-2.6.32-Nov2/mm/oom_kill.c
> > @@ -196,27 +196,40 @@ unsigned long badness(struct task_struct
> >  /*
> >   * Determine the type of allocation constraint.
> >   */
> > +#ifdef CONFIG_NUMA
> >  static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> > -						    gfp_t gfp_mask)
> > +				    gfp_t gfp_mask, nodemask_t *nodemask)
> >  {
> > -#ifdef CONFIG_NUMA
> >  	struct zone *zone;
> >  	struct zoneref *z;
> >  	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> > -	nodemask_t nodes = node_states[N_HIGH_MEMORY];
> > +	int ret = CONSTRAINT_NONE;
> >  
> > -	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> > -		if (cpuset_zone_allowed_softwall(zone, gfp_mask))
> > -			node_clear(zone_to_nid(zone), nodes);
> > -		else
> > +	/*
> > + 	 * The nodemask here is a nodemask passed to alloc_pages(). Now,
> > + 	 * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
> > + 	 * feature. Then, only mempolicy use this nodemask.
> > + 	 */
> > +	if (nodemask && nodes_equal(*nodemask, node_states[N_HIGH_MEMORY]))
> > +		ret = CONSTRAINT_MEMORY_POLICY;
> 
> !nodes_equal() ?
> 
yes. will fix.

> 
> > +
> > +	/* Check this allocation failure is caused by cpuset's wall function */
> > +	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> > +			high_zoneidx, nodemask)
> > +		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
> >  			return CONSTRAINT_CPUSET;
> 
> If cpuset and MPOL_BIND are both used, Probably CONSTRAINT_MEMORY_POLICY is
> better choice.
> 

No. this memory allocation is failed by limitation of cpuset's alloc mask.
Not from mempolicy.


Thanks,
-Kame


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v2
@ 2009-11-10  7:24       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 92+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-10  7:24 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: linux-mm, linux-kernel, akpm, cl, rientjes

On Tue, 10 Nov 2009 16:24:22 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> Hi
> 
> > ===================================================================
> > --- mmotm-2.6.32-Nov2.orig/mm/oom_kill.c
> > +++ mmotm-2.6.32-Nov2/mm/oom_kill.c
> > @@ -196,27 +196,40 @@ unsigned long badness(struct task_struct
> >  /*
> >   * Determine the type of allocation constraint.
> >   */
> > +#ifdef CONFIG_NUMA
> >  static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> > -						    gfp_t gfp_mask)
> > +				    gfp_t gfp_mask, nodemask_t *nodemask)
> >  {
> > -#ifdef CONFIG_NUMA
> >  	struct zone *zone;
> >  	struct zoneref *z;
> >  	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> > -	nodemask_t nodes = node_states[N_HIGH_MEMORY];
> > +	int ret = CONSTRAINT_NONE;
> >  
> > -	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> > -		if (cpuset_zone_allowed_softwall(zone, gfp_mask))
> > -			node_clear(zone_to_nid(zone), nodes);
> > -		else
> > +	/*
> > + 	 * The nodemask here is a nodemask passed to alloc_pages(). Now,
> > + 	 * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
> > + 	 * feature. Then, only mempolicy use this nodemask.
> > + 	 */
> > +	if (nodemask && nodes_equal(*nodemask, node_states[N_HIGH_MEMORY]))
> > +		ret = CONSTRAINT_MEMORY_POLICY;
> 
> !nodes_equal() ?
> 
yes. will fix.

> 
> > +
> > +	/* Check this allocation failure is caused by cpuset's wall function */
> > +	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> > +			high_zoneidx, nodemask)
> > +		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
> >  			return CONSTRAINT_CPUSET;
> 
> If cpuset and MPOL_BIND are both used, Probably CONSTRAINT_MEMORY_POLICY is
> better choice.
> 

No. this memory allocation is failed by limitation of cpuset's alloc mask.
Not from mempolicy.


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v2
  2009-11-10  7:24       ` KAMEZAWA Hiroyuki
@ 2009-11-10  7:39         ` KOSAKI Motohiro
  -1 siblings, 0 replies; 92+ messages in thread
From: KOSAKI Motohiro @ 2009-11-10  7:39 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: kosaki.motohiro, linux-mm, linux-kernel, akpm, cl, rientjes

> > > +
> > > +	/* Check this allocation failure is caused by cpuset's wall function */
> > > +	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> > > +			high_zoneidx, nodemask)
> > > +		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
> > >  			return CONSTRAINT_CPUSET;
> > 
> > If cpuset and MPOL_BIND are both used, Probably CONSTRAINT_MEMORY_POLICY is
> > better choice.
> 
> No. this memory allocation is failed by limitation of cpuset's alloc mask.
> Not from mempolicy.

But CONSTRAINT_CPUSET doesn't help to free necessary node memory. It isn't
your fault. original code is wrong too. but I hope we should fix it.




^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v2
@ 2009-11-10  7:39         ` KOSAKI Motohiro
  0 siblings, 0 replies; 92+ messages in thread
From: KOSAKI Motohiro @ 2009-11-10  7:39 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: kosaki.motohiro, linux-mm, linux-kernel, akpm, cl, rientjes

> > > +
> > > +	/* Check this allocation failure is caused by cpuset's wall function */
> > > +	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> > > +			high_zoneidx, nodemask)
> > > +		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
> > >  			return CONSTRAINT_CPUSET;
> > 
> > If cpuset and MPOL_BIND are both used, Probably CONSTRAINT_MEMORY_POLICY is
> > better choice.
> 
> No. this memory allocation is failed by limitation of cpuset's alloc mask.
> Not from mempolicy.

But CONSTRAINT_CPUSET doesn't help to free necessary node memory. It isn't
your fault. original code is wrong too. but I hope we should fix it.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v2
  2009-11-10  7:39         ` KOSAKI Motohiro
@ 2009-11-10  7:40           ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 92+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-10  7:40 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: linux-mm, linux-kernel, akpm, cl, rientjes

On Tue, 10 Nov 2009 16:39:02 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > > > +
> > > > +	/* Check this allocation failure is caused by cpuset's wall function */
> > > > +	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> > > > +			high_zoneidx, nodemask)
> > > > +		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
> > > >  			return CONSTRAINT_CPUSET;
> > > 
> > > If cpuset and MPOL_BIND are both used, Probably CONSTRAINT_MEMORY_POLICY is
> > > better choice.
> > 
> > No. this memory allocation is failed by limitation of cpuset's alloc mask.
> > Not from mempolicy.
> 
> But CONSTRAINT_CPUSET doesn't help to free necessary node memory. It isn't
> your fault. original code is wrong too. but I hope we should fix it.
> 
Hmm, maybe fair enough.

My 3rd version will use "kill always current(CONSTRAINT_MEMPOLICY does this)
if it uses mempolicy" logic.

Objections ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v2
@ 2009-11-10  7:40           ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 92+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-10  7:40 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: linux-mm, linux-kernel, akpm, cl, rientjes

On Tue, 10 Nov 2009 16:39:02 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > > > +
> > > > +	/* Check this allocation failure is caused by cpuset's wall function */
> > > > +	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> > > > +			high_zoneidx, nodemask)
> > > > +		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
> > > >  			return CONSTRAINT_CPUSET;
> > > 
> > > If cpuset and MPOL_BIND are both used, Probably CONSTRAINT_MEMORY_POLICY is
> > > better choice.
> > 
> > No. this memory allocation is failed by limitation of cpuset's alloc mask.
> > Not from mempolicy.
> 
> But CONSTRAINT_CPUSET doesn't help to free necessary node memory. It isn't
> your fault. original code is wrong too. but I hope we should fix it.
> 
Hmm, maybe fair enough.

My 3rd version will use "kill always current(CONSTRAINT_MEMPOLICY does this)
if it uses mempolicy" logic.

Objections ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v2
  2009-11-10  7:40           ` KAMEZAWA Hiroyuki
@ 2009-11-10  8:03             ` Daisuke Nishimura
  -1 siblings, 0 replies; 92+ messages in thread
From: Daisuke Nishimura @ 2009-11-10  8:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, linux-mm, linux-kernel, akpm, cl, rientjes,
	Daisuke Nishimura

On Tue, 10 Nov 2009 16:40:55 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 10 Nov 2009 16:39:02 +0900 (JST)
> KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > > > > +
> > > > > +	/* Check this allocation failure is caused by cpuset's wall function */
> > > > > +	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> > > > > +			high_zoneidx, nodemask)
> > > > > +		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
> > > > >  			return CONSTRAINT_CPUSET;
> > > > 
> > > > If cpuset and MPOL_BIND are both used, Probably CONSTRAINT_MEMORY_POLICY is
> > > > better choice.
> > > 
> > > No. this memory allocation is failed by limitation of cpuset's alloc mask.
> > > Not from mempolicy.
> > 
> > But CONSTRAINT_CPUSET doesn't help to free necessary node memory. It isn't
> > your fault. original code is wrong too. but I hope we should fix it.
> > 
I think so too.

> Hmm, maybe fair enough.
> 
> My 3rd version will use "kill always current(CONSTRAINT_MEMPOLICY does this)
> if it uses mempolicy" logic.
> 
"if it uses mempoicy" ?
You mean "kill allways current if memory allocation has failed by limitation of
cpuset's mask"(i.e. CONSTRAINT_CPUSET case) ?


Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v2
@ 2009-11-10  8:03             ` Daisuke Nishimura
  0 siblings, 0 replies; 92+ messages in thread
From: Daisuke Nishimura @ 2009-11-10  8:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, linux-mm, linux-kernel, akpm, cl, rientjes,
	Daisuke Nishimura

On Tue, 10 Nov 2009 16:40:55 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 10 Nov 2009 16:39:02 +0900 (JST)
> KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > > > > +
> > > > > +	/* Check this allocation failure is caused by cpuset's wall function */
> > > > > +	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> > > > > +			high_zoneidx, nodemask)
> > > > > +		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
> > > > >  			return CONSTRAINT_CPUSET;
> > > > 
> > > > If cpuset and MPOL_BIND are both used, Probably CONSTRAINT_MEMORY_POLICY is
> > > > better choice.
> > > 
> > > No. this memory allocation is failed by limitation of cpuset's alloc mask.
> > > Not from mempolicy.
> > 
> > But CONSTRAINT_CPUSET doesn't help to free necessary node memory. It isn't
> > your fault. original code is wrong too. but I hope we should fix it.
> > 
I think so too.

> Hmm, maybe fair enough.
> 
> My 3rd version will use "kill always current(CONSTRAINT_MEMPOLICY does this)
> if it uses mempolicy" logic.
> 
"if it uses mempoicy" ?
You mean "kill allways current if memory allocation has failed by limitation of
cpuset's mask"(i.e. CONSTRAINT_CPUSET case) ?


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v2
  2009-11-10  8:03             ` Daisuke Nishimura
@ 2009-11-10  8:17               ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 92+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-10  8:17 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: KOSAKI Motohiro, linux-mm, linux-kernel, akpm, cl, rientjes

On Tue, 10 Nov 2009 17:03:38 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Tue, 10 Nov 2009 16:40:55 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Tue, 10 Nov 2009 16:39:02 +0900 (JST)
> > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > 
> > > > > > +
> > > > > > +	/* Check this allocation failure is caused by cpuset's wall function */
> > > > > > +	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> > > > > > +			high_zoneidx, nodemask)
> > > > > > +		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
> > > > > >  			return CONSTRAINT_CPUSET;
> > > > > 
> > > > > If cpuset and MPOL_BIND are both used, Probably CONSTRAINT_MEMORY_POLICY is
> > > > > better choice.
> > > > 
> > > > No. this memory allocation is failed by limitation of cpuset's alloc mask.
> > > > Not from mempolicy.
> > > 
> > > But CONSTRAINT_CPUSET doesn't help to free necessary node memory. It isn't
> > > your fault. original code is wrong too. but I hope we should fix it.
> > > 
> I think so too.
> 
> > Hmm, maybe fair enough.
> > 
> > My 3rd version will use "kill always current(CONSTRAINT_MEMPOLICY does this)
> > if it uses mempolicy" logic.
> > 
> "if it uses mempoicy" ?
> You mean "kill allways current if memory allocation has failed by limitation of
> cpuset's mask"(i.e. CONSTRAINT_CPUSET case) ?
> 

No. "kill always current process if memory allocation uses mempolicy"
regardless of cpuset. If the task doesn't use mempolicy allocation,
usual CONSTRAINT_CPUSET/CONSTRAINT_NONE oom handler will be invoked.

Now, without patch, CONSTRAINT_MEMPOLICY is not returned at all. I'd
like to limit the scope of this patch to return it. If it's returned,
current will be killed.

Finally, we'll have to consinder "how to manage oom under cpuset"
problem, again. It's not handled in good way, now.

The main problems are...
   - Cpuset allows intersection of nodes among groups. 
   - Task can be migrated to other cpuset withoug moving memory.
   - We don't have per-node-rss information per task.

Then,
   - We have to scan all tasks.
   - We have to invoke Totally-Random-Innocent-Task-Killer and pray that
     someone bad will be killed.

IMHO, "find correct one" is too heavy to the kernel (under cpuset).
If we can have notifier to userland, some daemon can check numa_maps of all
tasks and will do something reasonbale.


Thanks,
-Kame


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v2
@ 2009-11-10  8:17               ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 92+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-10  8:17 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: KOSAKI Motohiro, linux-mm, linux-kernel, akpm, cl, rientjes

On Tue, 10 Nov 2009 17:03:38 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Tue, 10 Nov 2009 16:40:55 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Tue, 10 Nov 2009 16:39:02 +0900 (JST)
> > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > 
> > > > > > +
> > > > > > +	/* Check this allocation failure is caused by cpuset's wall function */
> > > > > > +	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> > > > > > +			high_zoneidx, nodemask)
> > > > > > +		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
> > > > > >  			return CONSTRAINT_CPUSET;
> > > > > 
> > > > > If cpuset and MPOL_BIND are both used, Probably CONSTRAINT_MEMORY_POLICY is
> > > > > better choice.
> > > > 
> > > > No. this memory allocation is failed by limitation of cpuset's alloc mask.
> > > > Not from mempolicy.
> > > 
> > > But CONSTRAINT_CPUSET doesn't help to free necessary node memory. It isn't
> > > your fault. original code is wrong too. but I hope we should fix it.
> > > 
> I think so too.
> 
> > Hmm, maybe fair enough.
> > 
> > My 3rd version will use "kill always current(CONSTRAINT_MEMPOLICY does this)
> > if it uses mempolicy" logic.
> > 
> "if it uses mempoicy" ?
> You mean "kill allways current if memory allocation has failed by limitation of
> cpuset's mask"(i.e. CONSTRAINT_CPUSET case) ?
> 

No. "kill always current process if memory allocation uses mempolicy"
regardless of cpuset. If the task doesn't use mempolicy allocation,
usual CONSTRAINT_CPUSET/CONSTRAINT_NONE oom handler will be invoked.

Now, without patch, CONSTRAINT_MEMPOLICY is not returned at all. I'd
like to limit the scope of this patch to return it. If it's returned,
current will be killed.

Finally, we'll have to consinder "how to manage oom under cpuset"
problem, again. It's not handled in good way, now.

The main problems are...
   - Cpuset allows intersection of nodes among groups. 
   - Task can be migrated to other cpuset withoug moving memory.
   - We don't have per-node-rss information per task.

Then,
   - We have to scan all tasks.
   - We have to invoke Totally-Random-Innocent-Task-Killer and pray that
     someone bad will be killed.

IMHO, "find correct one" is too heavy to the kernel (under cpuset).
If we can have notifier to userland, some daemon can check numa_maps of all
tasks and will do something reasonbale.


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v3
  2009-11-10  8:17               ` KAMEZAWA Hiroyuki
@ 2009-11-11  2:24                 ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 92+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-11  2:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, KOSAKI Motohiro, linux-mm, linux-kernel, akpm,
	cl, rientjes

From: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>

Fixing node-oriented allocation handling in oom-kill.c
I myself think this as bugfix not as ehnancement.

In these days, things are changed as
  - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
  - mempolicy don't maintain its own private zonelists.
  (And cpuset doesn't use nodemask for __alloc_pages_nodemask())

But, current oom-killer's doesn't handle nodemask and it's wrong.

This patch does
  - check nodemask, if nodemask && nodemask doesn't cover all
    node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
  - Scan all zonelist under nodemask, if it hits cpuset's wall
    this faiulre is from cpuset.
And
  - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
    This doesn't change "current" behavior.
    If callers use __GFP_THISNODE, it should handle "page allocation failure"
    by itself.

Changelog: 2009/11/11
 - fixed nodes_equal() calculation.
 - return CONSTRAINT_MEMPOLICY always if given nodemask is not enough big.

Changelog: 2009/11/06
 - fixed lack of oom.h

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>
---
 drivers/char/sysrq.c |    2 +-
 include/linux/oom.h  |    4 +++-
 mm/oom_kill.c        |   44 ++++++++++++++++++++++++++++++++------------
 mm/page_alloc.c      |   10 ++++++++--
 4 files changed, 44 insertions(+), 16 deletions(-)

Index: mm-test-kernel/drivers/char/sysrq.c
===================================================================
--- mm-test-kernel.orig/drivers/char/sysrq.c
+++ mm-test-kernel/drivers/char/sysrq.c
@@ -339,7 +339,7 @@ static struct sysrq_key_op sysrq_term_op
 
 static void moom_callback(struct work_struct *ignored)
 {
-	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0);
+	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0, NULL);
 }
 
 static DECLARE_WORK(moom_work, moom_callback);
Index: mm-test-kernel/mm/oom_kill.c
===================================================================
--- mm-test-kernel.orig/mm/oom_kill.c
+++ mm-test-kernel/mm/oom_kill.c
@@ -196,27 +196,45 @@ unsigned long badness(struct task_struct
 /*
  * Determine the type of allocation constraint.
  */
+#ifdef CONFIG_NUMA
 static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
-						    gfp_t gfp_mask)
+				    gfp_t gfp_mask, nodemask_t *nodemask)
 {
-#ifdef CONFIG_NUMA
 	struct zone *zone;
 	struct zoneref *z;
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
-	nodemask_t nodes = node_states[N_HIGH_MEMORY];
+	int ret = CONSTRAINT_NONE;
 
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
-		if (cpuset_zone_allowed_softwall(zone, gfp_mask))
-			node_clear(zone_to_nid(zone), nodes);
-		else
+	/*
+ 	 * The nodemask here is a nodemask passed to alloc_pages(). Now,
+ 	 * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
+ 	 * feature. mempolicy is an only user of nodemask here.
+ 	 */
+	if (nodemask) {
+		nodemask_t mask;
+		/* check mempolicy's nodemask contains all N_HIGH_MEMORY */
+		nodes_and(mask, *nodemask, node_states[N_HIGH_MEMORY]);
+		if (!nodes_equal(mask, node_states[N_HIGH_MEMORY]))
+			return CONSTRAINT_MEMORY_POLICY;
+	}
+
+	/* Check this allocation failure is caused by cpuset's wall function */
+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+			high_zoneidx, nodemask)
+		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
 			return CONSTRAINT_CPUSET;
 
-	if (!nodes_empty(nodes))
-		return CONSTRAINT_MEMORY_POLICY;
-#endif
+	/* __GFP_THISNODE never calls OOM.*/
 
 	return CONSTRAINT_NONE;
 }
+#else
+static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
+				gfp_t gfp_mask, nodemask_t *nodemask)
+{
+	return CONSTRAINT_NONE;
+}
+#endif
 
 /*
  * Simple selection loop. We chose the process with the highest
@@ -603,7 +621,8 @@ rest_and_return:
  * OR try to be smart about which process to kill. Note that we
  * don't have to be perfect here, we just have to be good.
  */
-void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
+void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+		int order, nodemask_t *nodemask)
 {
 	unsigned long freed = 0;
 	enum oom_constraint constraint;
@@ -622,11 +641,12 @@ void out_of_memory(struct zonelist *zone
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.
 	 */
-	constraint = constrained_alloc(zonelist, gfp_mask);
+	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
 	read_lock(&tasklist_lock);
 
 	switch (constraint) {
 	case CONSTRAINT_MEMORY_POLICY:
+		/* kill by process's its own memory alloc limitation */
 		oom_kill_process(current, gfp_mask, order, 0, NULL,
 				"No available memory (MPOL_BIND)");
 		break;
Index: mm-test-kernel/mm/page_alloc.c
===================================================================
--- mm-test-kernel.orig/mm/page_alloc.c
+++ mm-test-kernel/mm/page_alloc.c
@@ -1667,9 +1667,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, un
 	/* The OOM killer will not help higher order allocs */
 	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_NOFAIL))
 		goto out;
-
+	/*
+	 * In usual, GFP_THISNODE contains __GFP_NORETRY and we never hit this.
+	 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
+	 * Note: Hugepage uses it but will hit PAGE_ALLOC_COSTLY_ORDER.
+	 */
+	if (gfp_mask & __GFP_THISNODE)
+		goto out;
 	/* Exhausted what can be done so it's blamo time */
-	out_of_memory(zonelist, gfp_mask, order);
+	out_of_memory(zonelist, gfp_mask, order, nodemask);
 
 out:
 	clear_zonelist_oom(zonelist, gfp_mask);
Index: mm-test-kernel/include/linux/oom.h
===================================================================
--- mm-test-kernel.orig/include/linux/oom.h
+++ mm-test-kernel/include/linux/oom.h
@@ -10,6 +10,7 @@
 #ifdef __KERNEL__
 
 #include <linux/types.h>
+#include <linux/nodemask.h>
 
 struct zonelist;
 struct notifier_block;
@@ -26,7 +27,8 @@ enum oom_constraint {
 extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 
-extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order);
+extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+		int order, nodemask_t *mask);
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
 


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v3
@ 2009-11-11  2:24                 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 92+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-11  2:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, KOSAKI Motohiro, linux-mm, linux-kernel, akpm,
	cl, rientjes

From: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>

Fixing node-oriented allocation handling in oom-kill.c
I myself think this as bugfix not as ehnancement.

In these days, things are changed as
  - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
  - mempolicy don't maintain its own private zonelists.
  (And cpuset doesn't use nodemask for __alloc_pages_nodemask())

But, current oom-killer's doesn't handle nodemask and it's wrong.

This patch does
  - check nodemask, if nodemask && nodemask doesn't cover all
    node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
  - Scan all zonelist under nodemask, if it hits cpuset's wall
    this faiulre is from cpuset.
And
  - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
    This doesn't change "current" behavior.
    If callers use __GFP_THISNODE, it should handle "page allocation failure"
    by itself.

Changelog: 2009/11/11
 - fixed nodes_equal() calculation.
 - return CONSTRAINT_MEMPOLICY always if given nodemask is not enough big.

Changelog: 2009/11/06
 - fixed lack of oom.h

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>
---
 drivers/char/sysrq.c |    2 +-
 include/linux/oom.h  |    4 +++-
 mm/oom_kill.c        |   44 ++++++++++++++++++++++++++++++++------------
 mm/page_alloc.c      |   10 ++++++++--
 4 files changed, 44 insertions(+), 16 deletions(-)

Index: mm-test-kernel/drivers/char/sysrq.c
===================================================================
--- mm-test-kernel.orig/drivers/char/sysrq.c
+++ mm-test-kernel/drivers/char/sysrq.c
@@ -339,7 +339,7 @@ static struct sysrq_key_op sysrq_term_op
 
 static void moom_callback(struct work_struct *ignored)
 {
-	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0);
+	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0, NULL);
 }
 
 static DECLARE_WORK(moom_work, moom_callback);
Index: mm-test-kernel/mm/oom_kill.c
===================================================================
--- mm-test-kernel.orig/mm/oom_kill.c
+++ mm-test-kernel/mm/oom_kill.c
@@ -196,27 +196,45 @@ unsigned long badness(struct task_struct
 /*
  * Determine the type of allocation constraint.
  */
+#ifdef CONFIG_NUMA
 static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
-						    gfp_t gfp_mask)
+				    gfp_t gfp_mask, nodemask_t *nodemask)
 {
-#ifdef CONFIG_NUMA
 	struct zone *zone;
 	struct zoneref *z;
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
-	nodemask_t nodes = node_states[N_HIGH_MEMORY];
+	int ret = CONSTRAINT_NONE;
 
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
-		if (cpuset_zone_allowed_softwall(zone, gfp_mask))
-			node_clear(zone_to_nid(zone), nodes);
-		else
+	/*
+ 	 * The nodemask here is a nodemask passed to alloc_pages(). Now,
+ 	 * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
+ 	 * feature. mempolicy is an only user of nodemask here.
+ 	 */
+	if (nodemask) {
+		nodemask_t mask;
+		/* check mempolicy's nodemask contains all N_HIGH_MEMORY */
+		nodes_and(mask, *nodemask, node_states[N_HIGH_MEMORY]);
+		if (!nodes_equal(mask, node_states[N_HIGH_MEMORY]))
+			return CONSTRAINT_MEMORY_POLICY;
+	}
+
+	/* Check this allocation failure is caused by cpuset's wall function */
+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+			high_zoneidx, nodemask)
+		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
 			return CONSTRAINT_CPUSET;
 
-	if (!nodes_empty(nodes))
-		return CONSTRAINT_MEMORY_POLICY;
-#endif
+	/* __GFP_THISNODE never calls OOM.*/
 
 	return CONSTRAINT_NONE;
 }
+#else
+static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
+				gfp_t gfp_mask, nodemask_t *nodemask)
+{
+	return CONSTRAINT_NONE;
+}
+#endif
 
 /*
  * Simple selection loop. We chose the process with the highest
@@ -603,7 +621,8 @@ rest_and_return:
  * OR try to be smart about which process to kill. Note that we
  * don't have to be perfect here, we just have to be good.
  */
-void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
+void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+		int order, nodemask_t *nodemask)
 {
 	unsigned long freed = 0;
 	enum oom_constraint constraint;
@@ -622,11 +641,12 @@ void out_of_memory(struct zonelist *zone
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.
 	 */
-	constraint = constrained_alloc(zonelist, gfp_mask);
+	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
 	read_lock(&tasklist_lock);
 
 	switch (constraint) {
 	case CONSTRAINT_MEMORY_POLICY:
+		/* kill by process's its own memory alloc limitation */
 		oom_kill_process(current, gfp_mask, order, 0, NULL,
 				"No available memory (MPOL_BIND)");
 		break;
Index: mm-test-kernel/mm/page_alloc.c
===================================================================
--- mm-test-kernel.orig/mm/page_alloc.c
+++ mm-test-kernel/mm/page_alloc.c
@@ -1667,9 +1667,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, un
 	/* The OOM killer will not help higher order allocs */
 	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_NOFAIL))
 		goto out;
-
+	/*
+	 * In usual, GFP_THISNODE contains __GFP_NORETRY and we never hit this.
+	 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
+	 * Note: Hugepage uses it but will hit PAGE_ALLOC_COSTLY_ORDER.
+	 */
+	if (gfp_mask & __GFP_THISNODE)
+		goto out;
 	/* Exhausted what can be done so it's blamo time */
-	out_of_memory(zonelist, gfp_mask, order);
+	out_of_memory(zonelist, gfp_mask, order, nodemask);
 
 out:
 	clear_zonelist_oom(zonelist, gfp_mask);
Index: mm-test-kernel/include/linux/oom.h
===================================================================
--- mm-test-kernel.orig/include/linux/oom.h
+++ mm-test-kernel/include/linux/oom.h
@@ -10,6 +10,7 @@
 #ifdef __KERNEL__
 
 #include <linux/types.h>
+#include <linux/nodemask.h>
 
 struct zonelist;
 struct notifier_block;
@@ -26,7 +27,8 @@ enum oom_constraint {
 extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 
-extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order);
+extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+		int order, nodemask_t *mask);
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v3
  2009-11-11  2:24                 ` KAMEZAWA Hiroyuki
@ 2009-11-11  2:36                   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 92+ messages in thread
From: KOSAKI Motohiro @ 2009-11-11  2:36 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: kosaki.motohiro, Daisuke Nishimura, linux-mm, linux-kernel, akpm,
	cl, rientjes

> From: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>
> 
> Fixing node-oriented allocation handling in oom-kill.c
> I myself think this as bugfix not as ehnancement.
> 
> In these days, things are changed as
>   - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
>   - mempolicy don't maintain its own private zonelists.
>   (And cpuset doesn't use nodemask for __alloc_pages_nodemask())
> 
> But, current oom-killer's doesn't handle nodemask and it's wrong.
> 
> This patch does
>   - check nodemask, if nodemask && nodemask doesn't cover all
>     node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
>   - Scan all zonelist under nodemask, if it hits cpuset's wall
>     this faiulre is from cpuset.
> And
>   - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
>     This doesn't change "current" behavior.
>     If callers use __GFP_THISNODE, it should handle "page allocation failure"
>     by itself.
> 
> Changelog: 2009/11/11
>  - fixed nodes_equal() calculation.
>  - return CONSTRAINT_MEMPOLICY always if given nodemask is not enough big.
> 
> Changelog: 2009/11/06
>  - fixed lack of oom.h
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>
> ---
>  drivers/char/sysrq.c |    2 +-
>  include/linux/oom.h  |    4 +++-
>  mm/oom_kill.c        |   44 ++++++++++++++++++++++++++++++++------------
>  mm/page_alloc.c      |   10 ++++++++--
>  4 files changed, 44 insertions(+), 16 deletions(-)
> 

Looks good to me.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>




^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v3
@ 2009-11-11  2:36                   ` KOSAKI Motohiro
  0 siblings, 0 replies; 92+ messages in thread
From: KOSAKI Motohiro @ 2009-11-11  2:36 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: kosaki.motohiro, Daisuke Nishimura, linux-mm, linux-kernel, akpm,
	cl, rientjes

> From: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>
> 
> Fixing node-oriented allocation handling in oom-kill.c
> I myself think this as bugfix not as ehnancement.
> 
> In these days, things are changed as
>   - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
>   - mempolicy don't maintain its own private zonelists.
>   (And cpuset doesn't use nodemask for __alloc_pages_nodemask())
> 
> But, current oom-killer's doesn't handle nodemask and it's wrong.
> 
> This patch does
>   - check nodemask, if nodemask && nodemask doesn't cover all
>     node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
>   - Scan all zonelist under nodemask, if it hits cpuset's wall
>     this faiulre is from cpuset.
> And
>   - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
>     This doesn't change "current" behavior.
>     If callers use __GFP_THISNODE, it should handle "page allocation failure"
>     by itself.
> 
> Changelog: 2009/11/11
>  - fixed nodes_equal() calculation.
>  - return CONSTRAINT_MEMPOLICY always if given nodemask is not enough big.
> 
> Changelog: 2009/11/06
>  - fixed lack of oom.h
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>
> ---
>  drivers/char/sysrq.c |    2 +-
>  include/linux/oom.h  |    4 +++-
>  mm/oom_kill.c        |   44 ++++++++++++++++++++++++++++++++------------
>  mm/page_alloc.c      |   10 ++++++++--
>  4 files changed, 44 insertions(+), 16 deletions(-)
> 

Looks good to me.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v3
  2009-11-11  2:24                 ` KAMEZAWA Hiroyuki
@ 2009-11-11  2:49                   ` David Rientjes
  -1 siblings, 0 replies; 92+ messages in thread
From: David Rientjes @ 2009-11-11  2:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, KOSAKI Motohiro, linux-mm, linux-kernel,
	Andrew Morton, Christoph Lameter

On Wed, 11 Nov 2009, KAMEZAWA Hiroyuki wrote:

> Index: mm-test-kernel/drivers/char/sysrq.c
> ===================================================================
> --- mm-test-kernel.orig/drivers/char/sysrq.c
> +++ mm-test-kernel/drivers/char/sysrq.c
> @@ -339,7 +339,7 @@ static struct sysrq_key_op sysrq_term_op
>  
>  static void moom_callback(struct work_struct *ignored)
>  {
> -	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0);
> +	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0, NULL);
>  }
>  
>  static DECLARE_WORK(moom_work, moom_callback);
> Index: mm-test-kernel/mm/oom_kill.c
> ===================================================================
> --- mm-test-kernel.orig/mm/oom_kill.c
> +++ mm-test-kernel/mm/oom_kill.c
> @@ -196,27 +196,45 @@ unsigned long badness(struct task_struct
>  /*
>   * Determine the type of allocation constraint.
>   */
> +#ifdef CONFIG_NUMA
>  static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> -						    gfp_t gfp_mask)
> +				    gfp_t gfp_mask, nodemask_t *nodemask)

We should probably remove the inline specifier, there's only one caller 
currently and if additional ones were added in the future this function 
should probably not be replicated.

>  {
> -#ifdef CONFIG_NUMA
>  	struct zone *zone;
>  	struct zoneref *z;
>  	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> -	nodemask_t nodes = node_states[N_HIGH_MEMORY];
> +	int ret = CONSTRAINT_NONE;
>  
> -	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> -		if (cpuset_zone_allowed_softwall(zone, gfp_mask))
> -			node_clear(zone_to_nid(zone), nodes);
> -		else
> +	/*
> + 	 * The nodemask here is a nodemask passed to alloc_pages(). Now,
> + 	 * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
> + 	 * feature. mempolicy is an only user of nodemask here.
> + 	 */
> +	if (nodemask) {
> +		nodemask_t mask;
> +		/* check mempolicy's nodemask contains all N_HIGH_MEMORY */
> +		nodes_and(mask, *nodemask, node_states[N_HIGH_MEMORY]);
> +		if (!nodes_equal(mask, node_states[N_HIGH_MEMORY]))
> +			return CONSTRAINT_MEMORY_POLICY;
> +	}

Although a nodemask_t was previously allocated on the stack, we should 
probably change this to use NODEMASK_ALLOC() for kernels with higher 
CONFIG_NODES_SHIFT since allocations can happen very deep into the stack.

There should be a way around that, however.  Shouldn't

	if (nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
		return CONSTRAINT_MEMORY_POLICY;

be sufficient?

> +
> +	/* Check this allocation failure is caused by cpuset's wall function */
> +	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> +			high_zoneidx, nodemask)
> +		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
>  			return CONSTRAINT_CPUSET;
>  
> -	if (!nodes_empty(nodes))
> -		return CONSTRAINT_MEMORY_POLICY;
> -#endif
> +	/* __GFP_THISNODE never calls OOM.*/
>  
>  	return CONSTRAINT_NONE;
>  }
> +#else
> +static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> +				gfp_t gfp_mask, nodemask_t *nodemask)
> +{
> +	return CONSTRAINT_NONE;
> +}
> +#endif
>  
>  /*
>   * Simple selection loop. We chose the process with the highest
> @@ -603,7 +621,8 @@ rest_and_return:
>   * OR try to be smart about which process to kill. Note that we
>   * don't have to be perfect here, we just have to be good.
>   */
> -void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
> +void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> +		int order, nodemask_t *nodemask)
>  {
>  	unsigned long freed = 0;
>  	enum oom_constraint constraint;
> @@ -622,11 +641,12 @@ void out_of_memory(struct zonelist *zone
>  	 * Check if there were limitations on the allocation (only relevant for
>  	 * NUMA) that may require different handling.
>  	 */
> -	constraint = constrained_alloc(zonelist, gfp_mask);
> +	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
>  	read_lock(&tasklist_lock);
>  
>  	switch (constraint) {
>  	case CONSTRAINT_MEMORY_POLICY:
> +		/* kill by process's its own memory alloc limitation */

I don't understand this comment.

>  		oom_kill_process(current, gfp_mask, order, 0, NULL,
>  				"No available memory (MPOL_BIND)");
>  		break;
> Index: mm-test-kernel/mm/page_alloc.c
> ===================================================================
> --- mm-test-kernel.orig/mm/page_alloc.c
> +++ mm-test-kernel/mm/page_alloc.c
> @@ -1667,9 +1667,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, un
>  	/* The OOM killer will not help higher order allocs */
>  	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_NOFAIL))
>  		goto out;
> -
> +	/*
> +	 * In usual, GFP_THISNODE contains __GFP_NORETRY and we never hit this.
> +	 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
> +	 * Note: Hugepage uses it but will hit PAGE_ALLOC_COSTLY_ORDER.
> +	 */
> +	if (gfp_mask & __GFP_THISNODE)
> +		goto out;
>  	/* Exhausted what can be done so it's blamo time */
> -	out_of_memory(zonelist, gfp_mask, order);
> +	out_of_memory(zonelist, gfp_mask, order, nodemask);
>  
>  out:
>  	clear_zonelist_oom(zonelist, gfp_mask);

This doesn't seem like the right place for this check; should we even try 
direct reclaim for bare users of __GFP_THISNODE?  If we're adding it for 
sanity even though no callers would currently hit it, it also is a 
potential escape route for __GFP_NOFAIL.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v3
@ 2009-11-11  2:49                   ` David Rientjes
  0 siblings, 0 replies; 92+ messages in thread
From: David Rientjes @ 2009-11-11  2:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, KOSAKI Motohiro, linux-mm, linux-kernel,
	Andrew Morton, Christoph Lameter

On Wed, 11 Nov 2009, KAMEZAWA Hiroyuki wrote:

> Index: mm-test-kernel/drivers/char/sysrq.c
> ===================================================================
> --- mm-test-kernel.orig/drivers/char/sysrq.c
> +++ mm-test-kernel/drivers/char/sysrq.c
> @@ -339,7 +339,7 @@ static struct sysrq_key_op sysrq_term_op
>  
>  static void moom_callback(struct work_struct *ignored)
>  {
> -	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0);
> +	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0, NULL);
>  }
>  
>  static DECLARE_WORK(moom_work, moom_callback);
> Index: mm-test-kernel/mm/oom_kill.c
> ===================================================================
> --- mm-test-kernel.orig/mm/oom_kill.c
> +++ mm-test-kernel/mm/oom_kill.c
> @@ -196,27 +196,45 @@ unsigned long badness(struct task_struct
>  /*
>   * Determine the type of allocation constraint.
>   */
> +#ifdef CONFIG_NUMA
>  static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> -						    gfp_t gfp_mask)
> +				    gfp_t gfp_mask, nodemask_t *nodemask)

We should probably remove the inline specifier, there's only one caller 
currently and if additional ones were added in the future this function 
should probably not be replicated.

>  {
> -#ifdef CONFIG_NUMA
>  	struct zone *zone;
>  	struct zoneref *z;
>  	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> -	nodemask_t nodes = node_states[N_HIGH_MEMORY];
> +	int ret = CONSTRAINT_NONE;
>  
> -	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> -		if (cpuset_zone_allowed_softwall(zone, gfp_mask))
> -			node_clear(zone_to_nid(zone), nodes);
> -		else
> +	/*
> + 	 * The nodemask here is a nodemask passed to alloc_pages(). Now,
> + 	 * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
> + 	 * feature. mempolicy is an only user of nodemask here.
> + 	 */
> +	if (nodemask) {
> +		nodemask_t mask;
> +		/* check mempolicy's nodemask contains all N_HIGH_MEMORY */
> +		nodes_and(mask, *nodemask, node_states[N_HIGH_MEMORY]);
> +		if (!nodes_equal(mask, node_states[N_HIGH_MEMORY]))
> +			return CONSTRAINT_MEMORY_POLICY;
> +	}

Although a nodemask_t was previously allocated on the stack, we should 
probably change this to use NODEMASK_ALLOC() for kernels with higher 
CONFIG_NODES_SHIFT since allocations can happen very deep into the stack.

There should be a way around that, however.  Shouldn't

	if (nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
		return CONSTRAINT_MEMORY_POLICY;

be sufficient?

> +
> +	/* Check this allocation failure is caused by cpuset's wall function */
> +	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> +			high_zoneidx, nodemask)
> +		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
>  			return CONSTRAINT_CPUSET;
>  
> -	if (!nodes_empty(nodes))
> -		return CONSTRAINT_MEMORY_POLICY;
> -#endif
> +	/* __GFP_THISNODE never calls OOM.*/
>  
>  	return CONSTRAINT_NONE;
>  }
> +#else
> +static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> +				gfp_t gfp_mask, nodemask_t *nodemask)
> +{
> +	return CONSTRAINT_NONE;
> +}
> +#endif
>  
>  /*
>   * Simple selection loop. We chose the process with the highest
> @@ -603,7 +621,8 @@ rest_and_return:
>   * OR try to be smart about which process to kill. Note that we
>   * don't have to be perfect here, we just have to be good.
>   */
> -void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
> +void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> +		int order, nodemask_t *nodemask)
>  {
>  	unsigned long freed = 0;
>  	enum oom_constraint constraint;
> @@ -622,11 +641,12 @@ void out_of_memory(struct zonelist *zone
>  	 * Check if there were limitations on the allocation (only relevant for
>  	 * NUMA) that may require different handling.
>  	 */
> -	constraint = constrained_alloc(zonelist, gfp_mask);
> +	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
>  	read_lock(&tasklist_lock);
>  
>  	switch (constraint) {
>  	case CONSTRAINT_MEMORY_POLICY:
> +		/* kill by process's its own memory alloc limitation */

I don't understand this comment.

>  		oom_kill_process(current, gfp_mask, order, 0, NULL,
>  				"No available memory (MPOL_BIND)");
>  		break;
> Index: mm-test-kernel/mm/page_alloc.c
> ===================================================================
> --- mm-test-kernel.orig/mm/page_alloc.c
> +++ mm-test-kernel/mm/page_alloc.c
> @@ -1667,9 +1667,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, un
>  	/* The OOM killer will not help higher order allocs */
>  	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_NOFAIL))
>  		goto out;
> -
> +	/*
> +	 * In usual, GFP_THISNODE contains __GFP_NORETRY and we never hit this.
> +	 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
> +	 * Note: Hugepage uses it but will hit PAGE_ALLOC_COSTLY_ORDER.
> +	 */
> +	if (gfp_mask & __GFP_THISNODE)
> +		goto out;
>  	/* Exhausted what can be done so it's blamo time */
> -	out_of_memory(zonelist, gfp_mask, order);
> +	out_of_memory(zonelist, gfp_mask, order, nodemask);
>  
>  out:
>  	clear_zonelist_oom(zonelist, gfp_mask);

This doesn't seem like the right place for this check; should we even try 
direct reclaim for bare users of __GFP_THISNODE?  If we're adding it for 
sanity even though no callers would currently hit it, it also is a 
potential escape route for __GFP_NOFAIL.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v3
  2009-11-11  2:49                   ` David Rientjes
@ 2009-11-11  3:02                     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 92+ messages in thread
From: KOSAKI Motohiro @ 2009-11-11  3:02 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, Daisuke Nishimura, linux-mm,
	linux-kernel, Andrew Morton, Christoph Lameter

Hi

> On Wed, 11 Nov 2009, KAMEZAWA Hiroyuki wrote:
> 
> > Index: mm-test-kernel/drivers/char/sysrq.c
> > ===================================================================
> > --- mm-test-kernel.orig/drivers/char/sysrq.c
> > +++ mm-test-kernel/drivers/char/sysrq.c
> > @@ -339,7 +339,7 @@ static struct sysrq_key_op sysrq_term_op
> >  
> >  static void moom_callback(struct work_struct *ignored)
> >  {
> > -	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0);
> > +	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0, NULL);
> >  }
> >  
> >  static DECLARE_WORK(moom_work, moom_callback);
> > Index: mm-test-kernel/mm/oom_kill.c
> > ===================================================================
> > --- mm-test-kernel.orig/mm/oom_kill.c
> > +++ mm-test-kernel/mm/oom_kill.c
> > @@ -196,27 +196,45 @@ unsigned long badness(struct task_struct
> >  /*
> >   * Determine the type of allocation constraint.
> >   */
> > +#ifdef CONFIG_NUMA
> >  static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> > -						    gfp_t gfp_mask)
> > +				    gfp_t gfp_mask, nodemask_t *nodemask)
> 
> We should probably remove the inline specifier, there's only one caller 
> currently and if additional ones were added in the future this function 
> should probably not be replicated.

Good catch.


> >  {
> > -#ifdef CONFIG_NUMA
> >  	struct zone *zone;
> >  	struct zoneref *z;
> >  	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> > -	nodemask_t nodes = node_states[N_HIGH_MEMORY];
> > +	int ret = CONSTRAINT_NONE;
> >  
> > -	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> > -		if (cpuset_zone_allowed_softwall(zone, gfp_mask))
> > -			node_clear(zone_to_nid(zone), nodes);
> > -		else
> > +	/*
> > + 	 * The nodemask here is a nodemask passed to alloc_pages(). Now,
> > + 	 * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
> > + 	 * feature. mempolicy is an only user of nodemask here.
> > + 	 */
> > +	if (nodemask) {
> > +		nodemask_t mask;
> > +		/* check mempolicy's nodemask contains all N_HIGH_MEMORY */
> > +		nodes_and(mask, *nodemask, node_states[N_HIGH_MEMORY]);
> > +		if (!nodes_equal(mask, node_states[N_HIGH_MEMORY]))
> > +			return CONSTRAINT_MEMORY_POLICY;
> > +	}
> 
> Although a nodemask_t was previously allocated on the stack, we should 
> probably change this to use NODEMASK_ALLOC() for kernels with higher 
> CONFIG_NODES_SHIFT since allocations can happen very deep into the stack.

No. NODEMASK_ALLOC() is crap. we should remove it. 
btw, CPUMASK_ALLOC was already removed.


> There should be a way around that, however.  Shouldn't
> 
> 	if (nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
> 		return CONSTRAINT_MEMORY_POLICY;
> 
> be sufficient?

Is this safe on memory hotplug case?





^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v3
@ 2009-11-11  3:02                     ` KOSAKI Motohiro
  0 siblings, 0 replies; 92+ messages in thread
From: KOSAKI Motohiro @ 2009-11-11  3:02 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, Daisuke Nishimura, linux-mm,
	linux-kernel, Andrew Morton, Christoph Lameter

Hi

> On Wed, 11 Nov 2009, KAMEZAWA Hiroyuki wrote:
> 
> > Index: mm-test-kernel/drivers/char/sysrq.c
> > ===================================================================
> > --- mm-test-kernel.orig/drivers/char/sysrq.c
> > +++ mm-test-kernel/drivers/char/sysrq.c
> > @@ -339,7 +339,7 @@ static struct sysrq_key_op sysrq_term_op
> >  
> >  static void moom_callback(struct work_struct *ignored)
> >  {
> > -	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0);
> > +	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0, NULL);
> >  }
> >  
> >  static DECLARE_WORK(moom_work, moom_callback);
> > Index: mm-test-kernel/mm/oom_kill.c
> > ===================================================================
> > --- mm-test-kernel.orig/mm/oom_kill.c
> > +++ mm-test-kernel/mm/oom_kill.c
> > @@ -196,27 +196,45 @@ unsigned long badness(struct task_struct
> >  /*
> >   * Determine the type of allocation constraint.
> >   */
> > +#ifdef CONFIG_NUMA
> >  static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> > -						    gfp_t gfp_mask)
> > +				    gfp_t gfp_mask, nodemask_t *nodemask)
> 
> We should probably remove the inline specifier, there's only one caller 
> currently and if additional ones were added in the future this function 
> should probably not be replicated.

Good catch.


> >  {
> > -#ifdef CONFIG_NUMA
> >  	struct zone *zone;
> >  	struct zoneref *z;
> >  	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> > -	nodemask_t nodes = node_states[N_HIGH_MEMORY];
> > +	int ret = CONSTRAINT_NONE;
> >  
> > -	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> > -		if (cpuset_zone_allowed_softwall(zone, gfp_mask))
> > -			node_clear(zone_to_nid(zone), nodes);
> > -		else
> > +	/*
> > + 	 * The nodemask here is a nodemask passed to alloc_pages(). Now,
> > + 	 * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
> > + 	 * feature. mempolicy is an only user of nodemask here.
> > + 	 */
> > +	if (nodemask) {
> > +		nodemask_t mask;
> > +		/* check mempolicy's nodemask contains all N_HIGH_MEMORY */
> > +		nodes_and(mask, *nodemask, node_states[N_HIGH_MEMORY]);
> > +		if (!nodes_equal(mask, node_states[N_HIGH_MEMORY]))
> > +			return CONSTRAINT_MEMORY_POLICY;
> > +	}
> 
> Although a nodemask_t was previously allocated on the stack, we should 
> probably change this to use NODEMASK_ALLOC() for kernels with higher 
> CONFIG_NODES_SHIFT since allocations can happen very deep into the stack.

No. NODEMASK_ALLOC() is crap. we should remove it. 
btw, CPUMASK_ALLOC was already removed.


> There should be a way around that, however.  Shouldn't
> 
> 	if (nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
> 		return CONSTRAINT_MEMORY_POLICY;
> 
> be sufficient?

Is this safe on memory hotplug case?




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v3
  2009-11-11  2:49                   ` David Rientjes
@ 2009-11-11  3:04                     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 92+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-11  3:04 UTC (permalink / raw)
  To: David Rientjes
  Cc: Daisuke Nishimura, KOSAKI Motohiro, linux-mm, linux-kernel,
	Andrew Morton, Christoph Lameter

On Tue, 10 Nov 2009 18:49:51 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Wed, 11 Nov 2009, KAMEZAWA Hiroyuki wrote:
> 
> > Index: mm-test-kernel/drivers/char/sysrq.c
> > ===================================================================
> > --- mm-test-kernel.orig/drivers/char/sysrq.c
> > +++ mm-test-kernel/drivers/char/sysrq.c
> > @@ -339,7 +339,7 @@ static struct sysrq_key_op sysrq_term_op
> >  
> >  static void moom_callback(struct work_struct *ignored)
> >  {
> > -	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0);
> > +	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0, NULL);
> >  }
> >  
> >  static DECLARE_WORK(moom_work, moom_callback);
> > Index: mm-test-kernel/mm/oom_kill.c
> > ===================================================================
> > --- mm-test-kernel.orig/mm/oom_kill.c
> > +++ mm-test-kernel/mm/oom_kill.c
> > @@ -196,27 +196,45 @@ unsigned long badness(struct task_struct
> >  /*
> >   * Determine the type of allocation constraint.
> >   */
> > +#ifdef CONFIG_NUMA
> >  static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> > -						    gfp_t gfp_mask)
> > +				    gfp_t gfp_mask, nodemask_t *nodemask)
> 
> We should probably remove the inline specifier, there's only one caller 
> currently and if additional ones were added in the future this function 
> should probably not be replicated.
> 
Hmm, ok, remove.


> >  {
> > -#ifdef CONFIG_NUMA
> >  	struct zone *zone;
> >  	struct zoneref *z;
> >  	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> > -	nodemask_t nodes = node_states[N_HIGH_MEMORY];
> > +	int ret = CONSTRAINT_NONE;
> >  
> > -	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> > -		if (cpuset_zone_allowed_softwall(zone, gfp_mask))
> > -			node_clear(zone_to_nid(zone), nodes);
> > -		else
> > +	/*
> > + 	 * The nodemask here is a nodemask passed to alloc_pages(). Now,
> > + 	 * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
> > + 	 * feature. mempolicy is an only user of nodemask here.
> > + 	 */
> > +	if (nodemask) {
> > +		nodemask_t mask;
> > +		/* check mempolicy's nodemask contains all N_HIGH_MEMORY */
> > +		nodes_and(mask, *nodemask, node_states[N_HIGH_MEMORY]);
> > +		if (!nodes_equal(mask, node_states[N_HIGH_MEMORY]))
> > +			return CONSTRAINT_MEMORY_POLICY;
> > +	}
> 
> Although a nodemask_t was previously allocated on the stack, we should 
> probably change this to use NODEMASK_ALLOC() for kernels with higher 
> CONFIG_NODES_SHIFT since allocations can happen very deep into the stack.
> 
> There should be a way around that, however.  Shouldn't
> 
> 	if (nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
> 		return CONSTRAINT_MEMORY_POLICY;
> 
> be sufficient?
> 

Ah, I didn't notice nodes_subset(). Thank you, I'll use it.


> > +
> > +	/* Check this allocation failure is caused by cpuset's wall function */
> > +	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> > +			high_zoneidx, nodemask)
> > +		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
> >  			return CONSTRAINT_CPUSET;
> >  
> > -	if (!nodes_empty(nodes))
> > -		return CONSTRAINT_MEMORY_POLICY;
> > -#endif
> > +	/* __GFP_THISNODE never calls OOM.*/
> >  
> >  	return CONSTRAINT_NONE;
> >  }
> > +#else
> > +static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> > +				gfp_t gfp_mask, nodemask_t *nodemask)
> > +{
> > +	return CONSTRAINT_NONE;
> > +}
> > +#endif
> >  
> >  /*
> >   * Simple selection loop. We chose the process with the highest
> > @@ -603,7 +621,8 @@ rest_and_return:
> >   * OR try to be smart about which process to kill. Note that we
> >   * don't have to be perfect here, we just have to be good.
> >   */
> > -void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
> > +void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> > +		int order, nodemask_t *nodemask)
> >  {
> >  	unsigned long freed = 0;
> >  	enum oom_constraint constraint;
> > @@ -622,11 +641,12 @@ void out_of_memory(struct zonelist *zone
> >  	 * Check if there were limitations on the allocation (only relevant for
> >  	 * NUMA) that may require different handling.
> >  	 */
> > -	constraint = constrained_alloc(zonelist, gfp_mask);
> > +	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
> >  	read_lock(&tasklist_lock);
> >  
> >  	switch (constraint) {
> >  	case CONSTRAINT_MEMORY_POLICY:
> > +		/* kill by process's its own memory alloc limitation */
> 
> I don't understand this comment.
> 
remove this. But it seems not to be well known that current is always killed if
CONSTRAINT_MEMPOLICY. 

> >  		oom_kill_process(current, gfp_mask, order, 0, NULL,
> >  				"No available memory (MPOL_BIND)");
> >  		break;
> > Index: mm-test-kernel/mm/page_alloc.c
> > ===================================================================
> > --- mm-test-kernel.orig/mm/page_alloc.c
> > +++ mm-test-kernel/mm/page_alloc.c
> > @@ -1667,9 +1667,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, un
> >  	/* The OOM killer will not help higher order allocs */
> >  	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_NOFAIL))
> >  		goto out;
> > -
> > +	/*
> > +	 * In usual, GFP_THISNODE contains __GFP_NORETRY and we never hit this.
> > +	 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
> > +	 * Note: Hugepage uses it but will hit PAGE_ALLOC_COSTLY_ORDER.
> > +	 */
> > +	if (gfp_mask & __GFP_THISNODE)
> > +		goto out;
> >  	/* Exhausted what can be done so it's blamo time */
> > -	out_of_memory(zonelist, gfp_mask, order);
> > +	out_of_memory(zonelist, gfp_mask, order, nodemask);
> >  
> >  out:
> >  	clear_zonelist_oom(zonelist, gfp_mask);
> 
> This doesn't seem like the right place for this check; should we even try 
> direct reclaim for bare users of __GFP_THISNODE?
No, hugepage has to do reclaim.

> If we're adding it for  sanity even though no callers would currently hit it,
> it also is a potential escape route for __GFP_NOFAIL.

will add __GFP_NOFAIL check.

Thanks,
-Kame






^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v3
@ 2009-11-11  3:04                     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 92+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-11  3:04 UTC (permalink / raw)
  To: David Rientjes
  Cc: Daisuke Nishimura, KOSAKI Motohiro, linux-mm, linux-kernel,
	Andrew Morton, Christoph Lameter

On Tue, 10 Nov 2009 18:49:51 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Wed, 11 Nov 2009, KAMEZAWA Hiroyuki wrote:
> 
> > Index: mm-test-kernel/drivers/char/sysrq.c
> > ===================================================================
> > --- mm-test-kernel.orig/drivers/char/sysrq.c
> > +++ mm-test-kernel/drivers/char/sysrq.c
> > @@ -339,7 +339,7 @@ static struct sysrq_key_op sysrq_term_op
> >  
> >  static void moom_callback(struct work_struct *ignored)
> >  {
> > -	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0);
> > +	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0, NULL);
> >  }
> >  
> >  static DECLARE_WORK(moom_work, moom_callback);
> > Index: mm-test-kernel/mm/oom_kill.c
> > ===================================================================
> > --- mm-test-kernel.orig/mm/oom_kill.c
> > +++ mm-test-kernel/mm/oom_kill.c
> > @@ -196,27 +196,45 @@ unsigned long badness(struct task_struct
> >  /*
> >   * Determine the type of allocation constraint.
> >   */
> > +#ifdef CONFIG_NUMA
> >  static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> > -						    gfp_t gfp_mask)
> > +				    gfp_t gfp_mask, nodemask_t *nodemask)
> 
> We should probably remove the inline specifier, there's only one caller 
> currently and if additional ones were added in the future this function 
> should probably not be replicated.
> 
Hmm, ok, remove.


> >  {
> > -#ifdef CONFIG_NUMA
> >  	struct zone *zone;
> >  	struct zoneref *z;
> >  	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> > -	nodemask_t nodes = node_states[N_HIGH_MEMORY];
> > +	int ret = CONSTRAINT_NONE;
> >  
> > -	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> > -		if (cpuset_zone_allowed_softwall(zone, gfp_mask))
> > -			node_clear(zone_to_nid(zone), nodes);
> > -		else
> > +	/*
> > + 	 * The nodemask here is a nodemask passed to alloc_pages(). Now,
> > + 	 * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
> > + 	 * feature. mempolicy is an only user of nodemask here.
> > + 	 */
> > +	if (nodemask) {
> > +		nodemask_t mask;
> > +		/* check mempolicy's nodemask contains all N_HIGH_MEMORY */
> > +		nodes_and(mask, *nodemask, node_states[N_HIGH_MEMORY]);
> > +		if (!nodes_equal(mask, node_states[N_HIGH_MEMORY]))
> > +			return CONSTRAINT_MEMORY_POLICY;
> > +	}
> 
> Although a nodemask_t was previously allocated on the stack, we should 
> probably change this to use NODEMASK_ALLOC() for kernels with higher 
> CONFIG_NODES_SHIFT since allocations can happen very deep into the stack.
> 
> There should be a way around that, however.  Shouldn't
> 
> 	if (nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
> 		return CONSTRAINT_MEMORY_POLICY;
> 
> be sufficient?
> 

Ah, I didn't notice nodes_subset(). Thank you, I'll use it.


> > +
> > +	/* Check this allocation failure is caused by cpuset's wall function */
> > +	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> > +			high_zoneidx, nodemask)
> > +		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
> >  			return CONSTRAINT_CPUSET;
> >  
> > -	if (!nodes_empty(nodes))
> > -		return CONSTRAINT_MEMORY_POLICY;
> > -#endif
> > +	/* __GFP_THISNODE never calls OOM.*/
> >  
> >  	return CONSTRAINT_NONE;
> >  }
> > +#else
> > +static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> > +				gfp_t gfp_mask, nodemask_t *nodemask)
> > +{
> > +	return CONSTRAINT_NONE;
> > +}
> > +#endif
> >  
> >  /*
> >   * Simple selection loop. We chose the process with the highest
> > @@ -603,7 +621,8 @@ rest_and_return:
> >   * OR try to be smart about which process to kill. Note that we
> >   * don't have to be perfect here, we just have to be good.
> >   */
> > -void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
> > +void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> > +		int order, nodemask_t *nodemask)
> >  {
> >  	unsigned long freed = 0;
> >  	enum oom_constraint constraint;
> > @@ -622,11 +641,12 @@ void out_of_memory(struct zonelist *zone
> >  	 * Check if there were limitations on the allocation (only relevant for
> >  	 * NUMA) that may require different handling.
> >  	 */
> > -	constraint = constrained_alloc(zonelist, gfp_mask);
> > +	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
> >  	read_lock(&tasklist_lock);
> >  
> >  	switch (constraint) {
> >  	case CONSTRAINT_MEMORY_POLICY:
> > +		/* kill by process's its own memory alloc limitation */
> 
> I don't understand this comment.
> 
remove this. But it seems not to be well known that current is always killed if
CONSTRAINT_MEMPOLICY. 

> >  		oom_kill_process(current, gfp_mask, order, 0, NULL,
> >  				"No available memory (MPOL_BIND)");
> >  		break;
> > Index: mm-test-kernel/mm/page_alloc.c
> > ===================================================================
> > --- mm-test-kernel.orig/mm/page_alloc.c
> > +++ mm-test-kernel/mm/page_alloc.c
> > @@ -1667,9 +1667,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, un
> >  	/* The OOM killer will not help higher order allocs */
> >  	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_NOFAIL))
> >  		goto out;
> > -
> > +	/*
> > +	 * In usual, GFP_THISNODE contains __GFP_NORETRY and we never hit this.
> > +	 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
> > +	 * Note: Hugepage uses it but will hit PAGE_ALLOC_COSTLY_ORDER.
> > +	 */
> > +	if (gfp_mask & __GFP_THISNODE)
> > +		goto out;
> >  	/* Exhausted what can be done so it's blamo time */
> > -	out_of_memory(zonelist, gfp_mask, order);
> > +	out_of_memory(zonelist, gfp_mask, order, nodemask);
> >  
> >  out:
> >  	clear_zonelist_oom(zonelist, gfp_mask);
> 
> This doesn't seem like the right place for this check; should we even try 
> direct reclaim for bare users of __GFP_THISNODE?
No, hugepage has to do reclaim.

> If we're adding it for  sanity even though no callers would currently hit it,
> it also is a potential escape route for __GFP_NOFAIL.

will add __GFP_NOFAIL check.

Thanks,
-Kame





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v3
  2009-11-11  3:02                     ` KOSAKI Motohiro
@ 2009-11-11  3:10                       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 92+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-11  3:10 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: David Rientjes, Daisuke Nishimura, linux-mm, linux-kernel,
	Andrew Morton, Christoph Lameter

On Wed, 11 Nov 2009 12:02:06 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > There should be a way around that, however.  Shouldn't
> > 
> > 	if (nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
> > 		return CONSTRAINT_MEMORY_POLICY;
> > 
> > be sufficient?
> 
> Is this safe on memory hotplug case?
> 
N_HIGH_MEMORY is updated at memory hotplug.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v3
@ 2009-11-11  3:10                       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 92+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-11  3:10 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: David Rientjes, Daisuke Nishimura, linux-mm, linux-kernel,
	Andrew Morton, Christoph Lameter

On Wed, 11 Nov 2009 12:02:06 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > There should be a way around that, however.  Shouldn't
> > 
> > 	if (nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
> > 		return CONSTRAINT_MEMORY_POLICY;
> > 
> > be sufficient?
> 
> Is this safe on memory hotplug case?
> 
N_HIGH_MEMORY is updated at memory hotplug.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v3
  2009-11-11  3:02                     ` KOSAKI Motohiro
@ 2009-11-11  3:14                       ` David Rientjes
  -1 siblings, 0 replies; 92+ messages in thread
From: David Rientjes @ 2009-11-11  3:14 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, linux-mm, linux-kernel,
	Andrew Morton, Christoph Lameter

On Wed, 11 Nov 2009, KOSAKI Motohiro wrote:

> > >  {
> > > -#ifdef CONFIG_NUMA
> > >  	struct zone *zone;
> > >  	struct zoneref *z;
> > >  	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> > > -	nodemask_t nodes = node_states[N_HIGH_MEMORY];
> > > +	int ret = CONSTRAINT_NONE;
> > >  
> > > -	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> > > -		if (cpuset_zone_allowed_softwall(zone, gfp_mask))
> > > -			node_clear(zone_to_nid(zone), nodes);
> > > -		else
> > > +	/*
> > > + 	 * The nodemask here is a nodemask passed to alloc_pages(). Now,
> > > + 	 * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
> > > + 	 * feature. mempolicy is an only user of nodemask here.
> > > + 	 */
> > > +	if (nodemask) {
> > > +		nodemask_t mask;
> > > +		/* check mempolicy's nodemask contains all N_HIGH_MEMORY */
> > > +		nodes_and(mask, *nodemask, node_states[N_HIGH_MEMORY]);
> > > +		if (!nodes_equal(mask, node_states[N_HIGH_MEMORY]))
> > > +			return CONSTRAINT_MEMORY_POLICY;
> > > +	}
> > 
> > Although a nodemask_t was previously allocated on the stack, we should 
> > probably change this to use NODEMASK_ALLOC() for kernels with higher 
> > CONFIG_NODES_SHIFT since allocations can happen very deep into the stack.
> 
> No. NODEMASK_ALLOC() is crap. we should remove it. 

I've booted 1K node systems and have found it to be helpful to ensure that 
the stack will not overflow especially in areas where we normally are deep 
already, such as in the page allocator.

> btw, CPUMASK_ALLOC was already removed.

I don't remember CPUMASK_ALLOC() actually being merged.  I know the 
comment exists in nodemask.h, but I don't recall any CPUMASK_ALLOC() users 
in the tree.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v3
@ 2009-11-11  3:14                       ` David Rientjes
  0 siblings, 0 replies; 92+ messages in thread
From: David Rientjes @ 2009-11-11  3:14 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, linux-mm, linux-kernel,
	Andrew Morton, Christoph Lameter

On Wed, 11 Nov 2009, KOSAKI Motohiro wrote:

> > >  {
> > > -#ifdef CONFIG_NUMA
> > >  	struct zone *zone;
> > >  	struct zoneref *z;
> > >  	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> > > -	nodemask_t nodes = node_states[N_HIGH_MEMORY];
> > > +	int ret = CONSTRAINT_NONE;
> > >  
> > > -	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> > > -		if (cpuset_zone_allowed_softwall(zone, gfp_mask))
> > > -			node_clear(zone_to_nid(zone), nodes);
> > > -		else
> > > +	/*
> > > + 	 * The nodemask here is a nodemask passed to alloc_pages(). Now,
> > > + 	 * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
> > > + 	 * feature. mempolicy is an only user of nodemask here.
> > > + 	 */
> > > +	if (nodemask) {
> > > +		nodemask_t mask;
> > > +		/* check mempolicy's nodemask contains all N_HIGH_MEMORY */
> > > +		nodes_and(mask, *nodemask, node_states[N_HIGH_MEMORY]);
> > > +		if (!nodes_equal(mask, node_states[N_HIGH_MEMORY]))
> > > +			return CONSTRAINT_MEMORY_POLICY;
> > > +	}
> > 
> > Although a nodemask_t was previously allocated on the stack, we should 
> > probably change this to use NODEMASK_ALLOC() for kernels with higher 
> > CONFIG_NODES_SHIFT since allocations can happen very deep into the stack.
> 
> No. NODEMASK_ALLOC() is crap. we should remove it. 

I've booted 1K node systems and have found it to be helpful to ensure that 
the stack will not overflow especially in areas where we normally are deep 
already, such as in the page allocator.

> btw, CPUMASK_ALLOC was already removed.

I don't remember CPUMASK_ALLOC() actually being merged.  I know the 
comment exists in nodemask.h, but I don't recall any CPUMASK_ALLOC() users 
in the tree.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v3
  2009-11-11  3:14                       ` David Rientjes
@ 2009-11-11  3:23                         ` KOSAKI Motohiro
  -1 siblings, 0 replies; 92+ messages in thread
From: KOSAKI Motohiro @ 2009-11-11  3:23 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, Daisuke Nishimura, linux-mm,
	linux-kernel, Andrew Morton, Christoph Lameter

> On Wed, 11 Nov 2009, KOSAKI Motohiro wrote:
> 
> > > >  {
> > > > -#ifdef CONFIG_NUMA
> > > >  	struct zone *zone;
> > > >  	struct zoneref *z;
> > > >  	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> > > > -	nodemask_t nodes = node_states[N_HIGH_MEMORY];
> > > > +	int ret = CONSTRAINT_NONE;
> > > >  
> > > > -	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> > > > -		if (cpuset_zone_allowed_softwall(zone, gfp_mask))
> > > > -			node_clear(zone_to_nid(zone), nodes);
> > > > -		else
> > > > +	/*
> > > > + 	 * The nodemask here is a nodemask passed to alloc_pages(). Now,
> > > > + 	 * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
> > > > + 	 * feature. mempolicy is an only user of nodemask here.
> > > > + 	 */
> > > > +	if (nodemask) {
> > > > +		nodemask_t mask;
> > > > +		/* check mempolicy's nodemask contains all N_HIGH_MEMORY */
> > > > +		nodes_and(mask, *nodemask, node_states[N_HIGH_MEMORY]);
> > > > +		if (!nodes_equal(mask, node_states[N_HIGH_MEMORY]))
> > > > +			return CONSTRAINT_MEMORY_POLICY;
> > > > +	}
> > > 
> > > Although a nodemask_t was previously allocated on the stack, we should 
> > > probably change this to use NODEMASK_ALLOC() for kernels with higher 
> > > CONFIG_NODES_SHIFT since allocations can happen very deep into the stack.
> > 
> > No. NODEMASK_ALLOC() is crap. we should remove it. 
> 
> I've booted 1K node systems and have found it to be helpful to ensure that 
> the stack will not overflow especially in areas where we normally are deep 
> already, such as in the page allocator.

Linux doesn't support 1K nodes. (and only SGI huge machine use 512 nodes)

At least, NODEMASK_ALLOC should make more cleaner interface. current one
and struct nodemask_scratch are pretty ugly.


> > btw, CPUMASK_ALLOC was already removed.
> 
> I don't remember CPUMASK_ALLOC() actually being merged.  I know the 
> comment exists in nodemask.h, but I don't recall any CPUMASK_ALLOC() users 
> in the tree.




^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v3
@ 2009-11-11  3:23                         ` KOSAKI Motohiro
  0 siblings, 0 replies; 92+ messages in thread
From: KOSAKI Motohiro @ 2009-11-11  3:23 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, Daisuke Nishimura, linux-mm,
	linux-kernel, Andrew Morton, Christoph Lameter

> On Wed, 11 Nov 2009, KOSAKI Motohiro wrote:
> 
> > > >  {
> > > > -#ifdef CONFIG_NUMA
> > > >  	struct zone *zone;
> > > >  	struct zoneref *z;
> > > >  	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> > > > -	nodemask_t nodes = node_states[N_HIGH_MEMORY];
> > > > +	int ret = CONSTRAINT_NONE;
> > > >  
> > > > -	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> > > > -		if (cpuset_zone_allowed_softwall(zone, gfp_mask))
> > > > -			node_clear(zone_to_nid(zone), nodes);
> > > > -		else
> > > > +	/*
> > > > + 	 * The nodemask here is a nodemask passed to alloc_pages(). Now,
> > > > + 	 * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
> > > > + 	 * feature. mempolicy is an only user of nodemask here.
> > > > + 	 */
> > > > +	if (nodemask) {
> > > > +		nodemask_t mask;
> > > > +		/* check mempolicy's nodemask contains all N_HIGH_MEMORY */
> > > > +		nodes_and(mask, *nodemask, node_states[N_HIGH_MEMORY]);
> > > > +		if (!nodes_equal(mask, node_states[N_HIGH_MEMORY]))
> > > > +			return CONSTRAINT_MEMORY_POLICY;
> > > > +	}
> > > 
> > > Although a nodemask_t was previously allocated on the stack, we should 
> > > probably change this to use NODEMASK_ALLOC() for kernels with higher 
> > > CONFIG_NODES_SHIFT since allocations can happen very deep into the stack.
> > 
> > No. NODEMASK_ALLOC() is crap. we should remove it. 
> 
> I've booted 1K node systems and have found it to be helpful to ensure that 
> the stack will not overflow especially in areas where we normally are deep 
> already, such as in the page allocator.

Linux doesn't support 1K nodes. (and only SGI huge machine use 512 nodes)

At least, NODEMASK_ALLOC should make more cleaner interface. current one
and struct nodemask_scratch are pretty ugly.


> > btw, CPUMASK_ALLOC was already removed.
> 
> I don't remember CPUMASK_ALLOC() actually being merged.  I know the 
> comment exists in nodemask.h, but I don't recall any CPUMASK_ALLOC() users 
> in the tree.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v3
  2009-11-11  3:23                         ` KOSAKI Motohiro
@ 2009-11-11  3:27                           ` David Rientjes
  -1 siblings, 0 replies; 92+ messages in thread
From: David Rientjes @ 2009-11-11  3:27 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, linux-mm, linux-kernel,
	Andrew Morton, Christoph Lameter

On Wed, 11 Nov 2009, KOSAKI Motohiro wrote:

> Linux doesn't support 1K nodes. (and only SGI huge machine use 512 nodes)
> 

I know for a fact that it does on x86 if you adjust CONFIG_NODES_SHIFT, 
I've booted kernels all the way back to 2.6.18 with 1K nodes.

> At least, NODEMASK_ALLOC should make more cleaner interface. current one
> and struct nodemask_scratch are pretty ugly.
> 

I agree, I haven't been a fan of nodemask_scratch because I think its use 
case is pretty limited, but I do advocate using NODEMASK_ALLOC() when deep 
in the stack.  We've made sure that most of the mempolicy code does that 
where manipulating nodemasks is common in -mm.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v3
@ 2009-11-11  3:27                           ` David Rientjes
  0 siblings, 0 replies; 92+ messages in thread
From: David Rientjes @ 2009-11-11  3:27 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, linux-mm, linux-kernel,
	Andrew Morton, Christoph Lameter

On Wed, 11 Nov 2009, KOSAKI Motohiro wrote:

> Linux doesn't support 1K nodes. (and only SGI huge machine use 512 nodes)
> 

I know for a fact that it does on x86 if you adjust CONFIG_NODES_SHIFT, 
I've booted kernels all the way back to 2.6.18 with 1K nodes.

> At least, NODEMASK_ALLOC should make more cleaner interface. current one
> and struct nodemask_scratch are pretty ugly.
> 

I agree, I haven't been a fan of nodemask_scratch because I think its use 
case is pretty limited, but I do advocate using NODEMASK_ALLOC() when deep 
in the stack.  We've made sure that most of the mempolicy code does that 
where manipulating nodemasks is common in -mm.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4
  2009-11-11  2:24                 ` KAMEZAWA Hiroyuki
@ 2009-11-11  4:45                   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 92+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-11  4:45 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, KOSAKI Motohiro, linux-mm, linux-kernel, akpm,
	cl, rientjes

From: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>

Fixing node-oriented allocation handling in oom-kill.c
I myself think this as bugfix not as ehnancement.

In these days, things are changed as
  - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
  - mempolicy don't maintain its own private zonelists.
  (And cpuset doesn't use nodemask for __alloc_pages_nodemask())

So, current oom-killer's check function is wrong.

This patch does
  - check nodemask, if nodemask && nodemask doesn't cover all
    node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
  - Scan all zonelist under nodemask, if it hits cpuset's wall
    this faiulre is from cpuset.
And
  - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
    This doesn't change "current" behavior. If callers use __GFP_THISNODE
    it should handle "page allocation failure" by itself.

  - handle __GFP_NOFAIL+__GFP_THISNODE path.
    This is something like a FIXME but this gfpmask is not used now.

Changelog: 2009/11/11(2)
 - uses nodes_subset().
 - clean up.
 - added __GFP_NOFAIL case. And added waring.

Changelog: 2009/11/11
 - fixed nodes_equal() calculation.
 - return CONSTRAINT_MEMPOLICY always if given nodemask is not enough big.

Changelog: 2009/11/06
 - fixed lack of oom.h

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>
---
 drivers/char/sysrq.c |    2 +-
 include/linux/oom.h  |    4 +++-
 mm/oom_kill.c        |   45 +++++++++++++++++++++++++++++++++------------
 mm/page_alloc.c      |   20 +++++++++++++++-----
 4 files changed, 52 insertions(+), 19 deletions(-)

Index: mm-test-kernel/drivers/char/sysrq.c
===================================================================
--- mm-test-kernel.orig/drivers/char/sysrq.c
+++ mm-test-kernel/drivers/char/sysrq.c
@@ -339,7 +339,7 @@ static struct sysrq_key_op sysrq_term_op
 
 static void moom_callback(struct work_struct *ignored)
 {
-	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0);
+	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0, NULL);
 }
 
 static DECLARE_WORK(moom_work, moom_callback);
Index: mm-test-kernel/mm/oom_kill.c
===================================================================
--- mm-test-kernel.orig/mm/oom_kill.c
+++ mm-test-kernel/mm/oom_kill.c
@@ -196,27 +196,47 @@ unsigned long badness(struct task_struct
 /*
  * Determine the type of allocation constraint.
  */
+#ifdef CONFIG_NUMA
 static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
-						    gfp_t gfp_mask)
+				    gfp_t gfp_mask, nodemask_t *nodemask)
 {
-#ifdef CONFIG_NUMA
 	struct zone *zone;
 	struct zoneref *z;
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
-	nodemask_t nodes = node_states[N_HIGH_MEMORY];
+	int ret = CONSTRAINT_NONE;
 
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
-		if (cpuset_zone_allowed_softwall(zone, gfp_mask))
-			node_clear(zone_to_nid(zone), nodes);
-		else
-			return CONSTRAINT_CPUSET;
+	/*
+	 * Reach here only when __GFP_NOFAIL is used. So, we should avoid
+ 	 * to kill current.We have to random task kill in this case.
+ 	 * Hopefully, CONSTRAINT_THISNODE...but no way to handle it, now.
+ 	 */
+	if (gfp_mask & __GPF_THISNODE)
+		return ret;
 
-	if (!nodes_empty(nodes))
+	/*
+ 	 * The nodemask here is a nodemask passed to alloc_pages(). Now,
+ 	 * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
+ 	 * feature. mempolicy is an only user of nodemask here.
+	 * check mempolicy's nodemask contains all N_HIGH_MEMORY
+ 	 */
+	if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
 		return CONSTRAINT_MEMORY_POLICY;
-#endif
 
+	/* Check this allocation failure is caused by cpuset's wall function */
+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+			high_zoneidx, nodemask)
+		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
+			return CONSTRAINT_CPUSET;
+
+	return CONSTRAINT_NONE;
+}
+#else
+static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
+				gfp_t gfp_mask, nodemask_t *nodemask)
+{
 	return CONSTRAINT_NONE;
 }
+#endif
 
 /*
  * Simple selection loop. We chose the process with the highest
@@ -603,7 +623,8 @@ rest_and_return:
  * OR try to be smart about which process to kill. Note that we
  * don't have to be perfect here, we just have to be good.
  */
-void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
+void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+		int order, nodemask_t *nodemask)
 {
 	unsigned long freed = 0;
 	enum oom_constraint constraint;
@@ -622,7 +643,7 @@ void out_of_memory(struct zonelist *zone
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.
 	 */
-	constraint = constrained_alloc(zonelist, gfp_mask);
+	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
 	read_lock(&tasklist_lock);
 
 	switch (constraint) {
Index: mm-test-kernel/mm/page_alloc.c
===================================================================
--- mm-test-kernel.orig/mm/page_alloc.c
+++ mm-test-kernel/mm/page_alloc.c
@@ -1664,12 +1664,22 @@ __alloc_pages_may_oom(gfp_t gfp_mask, un
 	if (page)
 		goto out;
 
-	/* The OOM killer will not help higher order allocs */
-	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_NOFAIL))
-		goto out;
-
+	if (!(gfp_mask & __GFP_NOFAIL)) {
+		/* The OOM killer will not help higher order allocs */
+		if (order > PAGE_ALLOC_COSTLY_ORDER)
+			goto out;
+		/*
+	 	* GFP_THISNODE contains __GFP_NORETRY and we never hit this.
+	 	* Sanity check for bare calls of __GFP_THISNODE, not real OOM.
+	 	* The caller should handle page allocation failure by itself if
+	 	* it specifies __GFP_THISNODE.
+	 	* Note: Hugepage uses it but will hit PAGE_ALLOC_COSTLY_ORDER.
+	 	*/
+		if (gfp_mask & __GFP_THISNODE)
+			goto out;
+	}
 	/* Exhausted what can be done so it's blamo time */
-	out_of_memory(zonelist, gfp_mask, order);
+	out_of_memory(zonelist, gfp_mask, order, nodemask);
 
 out:
 	clear_zonelist_oom(zonelist, gfp_mask);
Index: mm-test-kernel/include/linux/oom.h
===================================================================
--- mm-test-kernel.orig/include/linux/oom.h
+++ mm-test-kernel/include/linux/oom.h
@@ -10,6 +10,7 @@
 #ifdef __KERNEL__
 
 #include <linux/types.h>
+#include <linux/nodemask.h>
 
 struct zonelist;
 struct notifier_block;
@@ -26,7 +27,8 @@ enum oom_constraint {
 extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 
-extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order);
+extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+		int order, nodemask_t *mask);
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
 


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4
@ 2009-11-11  4:45                   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 92+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-11  4:45 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, KOSAKI Motohiro, linux-mm, linux-kernel, akpm,
	cl, rientjes

From: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>

Fixing node-oriented allocation handling in oom-kill.c
I myself think this as bugfix not as ehnancement.

In these days, things are changed as
  - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
  - mempolicy don't maintain its own private zonelists.
  (And cpuset doesn't use nodemask for __alloc_pages_nodemask())

So, current oom-killer's check function is wrong.

This patch does
  - check nodemask, if nodemask && nodemask doesn't cover all
    node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
  - Scan all zonelist under nodemask, if it hits cpuset's wall
    this faiulre is from cpuset.
And
  - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
    This doesn't change "current" behavior. If callers use __GFP_THISNODE
    it should handle "page allocation failure" by itself.

  - handle __GFP_NOFAIL+__GFP_THISNODE path.
    This is something like a FIXME but this gfpmask is not used now.

Changelog: 2009/11/11(2)
 - uses nodes_subset().
 - clean up.
 - added __GFP_NOFAIL case. And added waring.

Changelog: 2009/11/11
 - fixed nodes_equal() calculation.
 - return CONSTRAINT_MEMPOLICY always if given nodemask is not enough big.

Changelog: 2009/11/06
 - fixed lack of oom.h

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>
---
 drivers/char/sysrq.c |    2 +-
 include/linux/oom.h  |    4 +++-
 mm/oom_kill.c        |   45 +++++++++++++++++++++++++++++++++------------
 mm/page_alloc.c      |   20 +++++++++++++++-----
 4 files changed, 52 insertions(+), 19 deletions(-)

Index: mm-test-kernel/drivers/char/sysrq.c
===================================================================
--- mm-test-kernel.orig/drivers/char/sysrq.c
+++ mm-test-kernel/drivers/char/sysrq.c
@@ -339,7 +339,7 @@ static struct sysrq_key_op sysrq_term_op
 
 static void moom_callback(struct work_struct *ignored)
 {
-	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0);
+	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0, NULL);
 }
 
 static DECLARE_WORK(moom_work, moom_callback);
Index: mm-test-kernel/mm/oom_kill.c
===================================================================
--- mm-test-kernel.orig/mm/oom_kill.c
+++ mm-test-kernel/mm/oom_kill.c
@@ -196,27 +196,47 @@ unsigned long badness(struct task_struct
 /*
  * Determine the type of allocation constraint.
  */
+#ifdef CONFIG_NUMA
 static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
-						    gfp_t gfp_mask)
+				    gfp_t gfp_mask, nodemask_t *nodemask)
 {
-#ifdef CONFIG_NUMA
 	struct zone *zone;
 	struct zoneref *z;
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
-	nodemask_t nodes = node_states[N_HIGH_MEMORY];
+	int ret = CONSTRAINT_NONE;
 
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
-		if (cpuset_zone_allowed_softwall(zone, gfp_mask))
-			node_clear(zone_to_nid(zone), nodes);
-		else
-			return CONSTRAINT_CPUSET;
+	/*
+	 * Reach here only when __GFP_NOFAIL is used. So, we should avoid
+ 	 * to kill current.We have to random task kill in this case.
+ 	 * Hopefully, CONSTRAINT_THISNODE...but no way to handle it, now.
+ 	 */
+	if (gfp_mask & __GPF_THISNODE)
+		return ret;
 
-	if (!nodes_empty(nodes))
+	/*
+ 	 * The nodemask here is a nodemask passed to alloc_pages(). Now,
+ 	 * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
+ 	 * feature. mempolicy is an only user of nodemask here.
+	 * check mempolicy's nodemask contains all N_HIGH_MEMORY
+ 	 */
+	if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
 		return CONSTRAINT_MEMORY_POLICY;
-#endif
 
+	/* Check this allocation failure is caused by cpuset's wall function */
+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+			high_zoneidx, nodemask)
+		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
+			return CONSTRAINT_CPUSET;
+
+	return CONSTRAINT_NONE;
+}
+#else
+static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
+				gfp_t gfp_mask, nodemask_t *nodemask)
+{
 	return CONSTRAINT_NONE;
 }
+#endif
 
 /*
  * Simple selection loop. We chose the process with the highest
@@ -603,7 +623,8 @@ rest_and_return:
  * OR try to be smart about which process to kill. Note that we
  * don't have to be perfect here, we just have to be good.
  */
-void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
+void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+		int order, nodemask_t *nodemask)
 {
 	unsigned long freed = 0;
 	enum oom_constraint constraint;
@@ -622,7 +643,7 @@ void out_of_memory(struct zonelist *zone
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.
 	 */
-	constraint = constrained_alloc(zonelist, gfp_mask);
+	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
 	read_lock(&tasklist_lock);
 
 	switch (constraint) {
Index: mm-test-kernel/mm/page_alloc.c
===================================================================
--- mm-test-kernel.orig/mm/page_alloc.c
+++ mm-test-kernel/mm/page_alloc.c
@@ -1664,12 +1664,22 @@ __alloc_pages_may_oom(gfp_t gfp_mask, un
 	if (page)
 		goto out;
 
-	/* The OOM killer will not help higher order allocs */
-	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_NOFAIL))
-		goto out;
-
+	if (!(gfp_mask & __GFP_NOFAIL)) {
+		/* The OOM killer will not help higher order allocs */
+		if (order > PAGE_ALLOC_COSTLY_ORDER)
+			goto out;
+		/*
+	 	* GFP_THISNODE contains __GFP_NORETRY and we never hit this.
+	 	* Sanity check for bare calls of __GFP_THISNODE, not real OOM.
+	 	* The caller should handle page allocation failure by itself if
+	 	* it specifies __GFP_THISNODE.
+	 	* Note: Hugepage uses it but will hit PAGE_ALLOC_COSTLY_ORDER.
+	 	*/
+		if (gfp_mask & __GFP_THISNODE)
+			goto out;
+	}
 	/* Exhausted what can be done so it's blamo time */
-	out_of_memory(zonelist, gfp_mask, order);
+	out_of_memory(zonelist, gfp_mask, order, nodemask);
 
 out:
 	clear_zonelist_oom(zonelist, gfp_mask);
Index: mm-test-kernel/include/linux/oom.h
===================================================================
--- mm-test-kernel.orig/include/linux/oom.h
+++ mm-test-kernel/include/linux/oom.h
@@ -10,6 +10,7 @@
 #ifdef __KERNEL__
 
 #include <linux/types.h>
+#include <linux/nodemask.h>
 
 struct zonelist;
 struct notifier_block;
@@ -26,7 +27,8 @@ enum oom_constraint {
 extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 
-extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order);
+extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+		int order, nodemask_t *mask);
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.1
  2009-11-11  4:45                   ` KAMEZAWA Hiroyuki
@ 2009-11-11  5:28                     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 92+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-11  5:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, KOSAKI Motohiro, linux-mm, linux-kernel, akpm,
	cl, rientjes

Sorry, missed to remove 'inline'...
==
From: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>

Fixing node-oriented allocation handling in oom-kill.c
I myself think this as bugfix not as ehnancement.

In these days, things are changed as
  - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
  - mempolicy don't maintain its own private zonelists.
  (And cpuset doesn't use nodemask for __alloc_pages_nodemask())

So, current oom-killer's check function is wrong.

This patch does
  - check nodemask, if nodemask && nodemask doesn't cover all
    node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
  - Scan all zonelist under nodemask, if it hits cpuset's wall
    this faiulre is from cpuset.
And
  - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
    This doesn't change "current" behavior. If callers use __GFP_THISNODE
    it should handle "page allocation failure" by itself.

  - handle __GFP_NOFAIL+__GFP_THISNODE path.
    This is something like a FIXME but this gfpmask is not used now.

Changelog: 2009/11/11(2)
 - uses nodes_subset().
 - clean up.
 - added __GFP_NOFAIL case. And added waring.
 - removed inline

Changelog: 2009/11/11
 - fixed nodes_equal() calculation.
 - return CONSTRAINT_MEMPOLICY always if given nodemask is not enough big.

Changelog: 2009/11/06
 - fixed lack of oom.h

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>
---
 drivers/char/sysrq.c |    2 +-
 include/linux/oom.h  |    4 +++-
 mm/oom_kill.c        |   47 ++++++++++++++++++++++++++++++++++-------------
 mm/page_alloc.c      |   20 +++++++++++++++-----
 4 files changed, 53 insertions(+), 20 deletions(-)

Index: mm-test-kernel/drivers/char/sysrq.c
===================================================================
--- mm-test-kernel.orig/drivers/char/sysrq.c
+++ mm-test-kernel/drivers/char/sysrq.c
@@ -339,7 +339,7 @@ static struct sysrq_key_op sysrq_term_op
 
 static void moom_callback(struct work_struct *ignored)
 {
-	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0);
+	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0, NULL);
 }
 
 static DECLARE_WORK(moom_work, moom_callback);
Index: mm-test-kernel/mm/oom_kill.c
===================================================================
--- mm-test-kernel.orig/mm/oom_kill.c
+++ mm-test-kernel/mm/oom_kill.c
@@ -196,27 +196,47 @@ unsigned long badness(struct task_struct
 /*
  * Determine the type of allocation constraint.
  */
-static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
-						    gfp_t gfp_mask)
-{
 #ifdef CONFIG_NUMA
+static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
+				    gfp_t gfp_mask, nodemask_t *nodemask)
+{
 	struct zone *zone;
 	struct zoneref *z;
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
-	nodemask_t nodes = node_states[N_HIGH_MEMORY];
+	int ret = CONSTRAINT_NONE;
 
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
-		if (cpuset_zone_allowed_softwall(zone, gfp_mask))
-			node_clear(zone_to_nid(zone), nodes);
-		else
-			return CONSTRAINT_CPUSET;
+	/*
+	 * Reach here only when __GFP_NOFAIL is used. So, we should avoid
+ 	 * to kill current.We have to random task kill in this case.
+ 	 * Hopefully, CONSTRAINT_THISNODE...but no way to handle it, now.
+ 	 */
+	if (gfp_mask & __GPF_THISNODE)
+		return ret;
 
-	if (!nodes_empty(nodes))
+	/*
+ 	 * The nodemask here is a nodemask passed to alloc_pages(). Now,
+ 	 * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
+ 	 * feature. mempolicy is an only user of nodemask here.
+	 * check mempolicy's nodemask contains all N_HIGH_MEMORY
+ 	 */
+	if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
 		return CONSTRAINT_MEMORY_POLICY;
-#endif
 
+	/* Check this allocation failure is caused by cpuset's wall function */
+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+			high_zoneidx, nodemask)
+		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
+			return CONSTRAINT_CPUSET;
+
+	return CONSTRAINT_NONE;
+}
+#else
+static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
+				gfp_t gfp_mask, nodemask_t *nodemask)
+{
 	return CONSTRAINT_NONE;
 }
+#endif
 
 /*
  * Simple selection loop. We chose the process with the highest
@@ -603,7 +623,8 @@ rest_and_return:
  * OR try to be smart about which process to kill. Note that we
  * don't have to be perfect here, we just have to be good.
  */
-void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
+void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+		int order, nodemask_t *nodemask)
 {
 	unsigned long freed = 0;
 	enum oom_constraint constraint;
@@ -622,7 +643,7 @@ void out_of_memory(struct zonelist *zone
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.
 	 */
-	constraint = constrained_alloc(zonelist, gfp_mask);
+	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
 	read_lock(&tasklist_lock);
 
 	switch (constraint) {
Index: mm-test-kernel/mm/page_alloc.c
===================================================================
--- mm-test-kernel.orig/mm/page_alloc.c
+++ mm-test-kernel/mm/page_alloc.c
@@ -1664,12 +1664,22 @@ __alloc_pages_may_oom(gfp_t gfp_mask, un
 	if (page)
 		goto out;
 
-	/* The OOM killer will not help higher order allocs */
-	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_NOFAIL))
-		goto out;
-
+	if (!(gfp_mask & __GFP_NOFAIL)) {
+		/* The OOM killer will not help higher order allocs */
+		if (order > PAGE_ALLOC_COSTLY_ORDER)
+			goto out;
+		/*
+	 	* GFP_THISNODE contains __GFP_NORETRY and we never hit this.
+	 	* Sanity check for bare calls of __GFP_THISNODE, not real OOM.
+	 	* The caller should handle page allocation failure by itself if
+	 	* it specifies __GFP_THISNODE.
+	 	* Note: Hugepage uses it but will hit PAGE_ALLOC_COSTLY_ORDER.
+	 	*/
+		if (gfp_mask & __GFP_THISNODE)
+			goto out;
+	}
 	/* Exhausted what can be done so it's blamo time */
-	out_of_memory(zonelist, gfp_mask, order);
+	out_of_memory(zonelist, gfp_mask, order, nodemask);
 
 out:
 	clear_zonelist_oom(zonelist, gfp_mask);
Index: mm-test-kernel/include/linux/oom.h
===================================================================
--- mm-test-kernel.orig/include/linux/oom.h
+++ mm-test-kernel/include/linux/oom.h
@@ -10,6 +10,7 @@
 #ifdef __KERNEL__
 
 #include <linux/types.h>
+#include <linux/nodemask.h>
 
 struct zonelist;
 struct notifier_block;
@@ -26,7 +27,8 @@ enum oom_constraint {
 extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 
-extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order);
+extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+		int order, nodemask_t *mask);
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
 


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.1
@ 2009-11-11  5:28                     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 92+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-11  5:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, KOSAKI Motohiro, linux-mm, linux-kernel, akpm,
	cl, rientjes

Sorry, missed to remove 'inline'...
==
From: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>

Fixing node-oriented allocation handling in oom-kill.c
I myself think this as bugfix not as ehnancement.

In these days, things are changed as
  - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
  - mempolicy don't maintain its own private zonelists.
  (And cpuset doesn't use nodemask for __alloc_pages_nodemask())

So, current oom-killer's check function is wrong.

This patch does
  - check nodemask, if nodemask && nodemask doesn't cover all
    node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
  - Scan all zonelist under nodemask, if it hits cpuset's wall
    this faiulre is from cpuset.
And
  - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
    This doesn't change "current" behavior. If callers use __GFP_THISNODE
    it should handle "page allocation failure" by itself.

  - handle __GFP_NOFAIL+__GFP_THISNODE path.
    This is something like a FIXME but this gfpmask is not used now.

Changelog: 2009/11/11(2)
 - uses nodes_subset().
 - clean up.
 - added __GFP_NOFAIL case. And added waring.
 - removed inline

Changelog: 2009/11/11
 - fixed nodes_equal() calculation.
 - return CONSTRAINT_MEMPOLICY always if given nodemask is not enough big.

Changelog: 2009/11/06
 - fixed lack of oom.h

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>
---
 drivers/char/sysrq.c |    2 +-
 include/linux/oom.h  |    4 +++-
 mm/oom_kill.c        |   47 ++++++++++++++++++++++++++++++++++-------------
 mm/page_alloc.c      |   20 +++++++++++++++-----
 4 files changed, 53 insertions(+), 20 deletions(-)

Index: mm-test-kernel/drivers/char/sysrq.c
===================================================================
--- mm-test-kernel.orig/drivers/char/sysrq.c
+++ mm-test-kernel/drivers/char/sysrq.c
@@ -339,7 +339,7 @@ static struct sysrq_key_op sysrq_term_op
 
 static void moom_callback(struct work_struct *ignored)
 {
-	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0);
+	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0, NULL);
 }
 
 static DECLARE_WORK(moom_work, moom_callback);
Index: mm-test-kernel/mm/oom_kill.c
===================================================================
--- mm-test-kernel.orig/mm/oom_kill.c
+++ mm-test-kernel/mm/oom_kill.c
@@ -196,27 +196,47 @@ unsigned long badness(struct task_struct
 /*
  * Determine the type of allocation constraint.
  */
-static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
-						    gfp_t gfp_mask)
-{
 #ifdef CONFIG_NUMA
+static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
+				    gfp_t gfp_mask, nodemask_t *nodemask)
+{
 	struct zone *zone;
 	struct zoneref *z;
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
-	nodemask_t nodes = node_states[N_HIGH_MEMORY];
+	int ret = CONSTRAINT_NONE;
 
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
-		if (cpuset_zone_allowed_softwall(zone, gfp_mask))
-			node_clear(zone_to_nid(zone), nodes);
-		else
-			return CONSTRAINT_CPUSET;
+	/*
+	 * Reach here only when __GFP_NOFAIL is used. So, we should avoid
+ 	 * to kill current.We have to random task kill in this case.
+ 	 * Hopefully, CONSTRAINT_THISNODE...but no way to handle it, now.
+ 	 */
+	if (gfp_mask & __GPF_THISNODE)
+		return ret;
 
-	if (!nodes_empty(nodes))
+	/*
+ 	 * The nodemask here is a nodemask passed to alloc_pages(). Now,
+ 	 * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
+ 	 * feature. mempolicy is an only user of nodemask here.
+	 * check mempolicy's nodemask contains all N_HIGH_MEMORY
+ 	 */
+	if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
 		return CONSTRAINT_MEMORY_POLICY;
-#endif
 
+	/* Check this allocation failure is caused by cpuset's wall function */
+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+			high_zoneidx, nodemask)
+		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
+			return CONSTRAINT_CPUSET;
+
+	return CONSTRAINT_NONE;
+}
+#else
+static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
+				gfp_t gfp_mask, nodemask_t *nodemask)
+{
 	return CONSTRAINT_NONE;
 }
+#endif
 
 /*
  * Simple selection loop. We chose the process with the highest
@@ -603,7 +623,8 @@ rest_and_return:
  * OR try to be smart about which process to kill. Note that we
  * don't have to be perfect here, we just have to be good.
  */
-void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
+void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+		int order, nodemask_t *nodemask)
 {
 	unsigned long freed = 0;
 	enum oom_constraint constraint;
@@ -622,7 +643,7 @@ void out_of_memory(struct zonelist *zone
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.
 	 */
-	constraint = constrained_alloc(zonelist, gfp_mask);
+	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
 	read_lock(&tasklist_lock);
 
 	switch (constraint) {
Index: mm-test-kernel/mm/page_alloc.c
===================================================================
--- mm-test-kernel.orig/mm/page_alloc.c
+++ mm-test-kernel/mm/page_alloc.c
@@ -1664,12 +1664,22 @@ __alloc_pages_may_oom(gfp_t gfp_mask, un
 	if (page)
 		goto out;
 
-	/* The OOM killer will not help higher order allocs */
-	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_NOFAIL))
-		goto out;
-
+	if (!(gfp_mask & __GFP_NOFAIL)) {
+		/* The OOM killer will not help higher order allocs */
+		if (order > PAGE_ALLOC_COSTLY_ORDER)
+			goto out;
+		/*
+	 	* GFP_THISNODE contains __GFP_NORETRY and we never hit this.
+	 	* Sanity check for bare calls of __GFP_THISNODE, not real OOM.
+	 	* The caller should handle page allocation failure by itself if
+	 	* it specifies __GFP_THISNODE.
+	 	* Note: Hugepage uses it but will hit PAGE_ALLOC_COSTLY_ORDER.
+	 	*/
+		if (gfp_mask & __GFP_THISNODE)
+			goto out;
+	}
 	/* Exhausted what can be done so it's blamo time */
-	out_of_memory(zonelist, gfp_mask, order);
+	out_of_memory(zonelist, gfp_mask, order, nodemask);
 
 out:
 	clear_zonelist_oom(zonelist, gfp_mask);
Index: mm-test-kernel/include/linux/oom.h
===================================================================
--- mm-test-kernel.orig/include/linux/oom.h
+++ mm-test-kernel/include/linux/oom.h
@@ -10,6 +10,7 @@
 #ifdef __KERNEL__
 
 #include <linux/types.h>
+#include <linux/nodemask.h>
 
 struct zonelist;
 struct notifier_block;
@@ -26,7 +27,8 @@ enum oom_constraint {
 extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 
-extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order);
+extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+		int order, nodemask_t *mask);
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.1
  2009-11-11  5:28                     ` KAMEZAWA Hiroyuki
@ 2009-11-11  5:58                       ` David Rientjes
  -1 siblings, 0 replies; 92+ messages in thread
From: David Rientjes @ 2009-11-11  5:58 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, KOSAKI Motohiro, linux-mm, linux-kernel,
	Andrew Morton, Christoph Lameter

On Wed, 11 Nov 2009, KAMEZAWA Hiroyuki wrote:

> Index: mm-test-kernel/drivers/char/sysrq.c
> ===================================================================
> --- mm-test-kernel.orig/drivers/char/sysrq.c
> +++ mm-test-kernel/drivers/char/sysrq.c
> @@ -339,7 +339,7 @@ static struct sysrq_key_op sysrq_term_op
>  
>  static void moom_callback(struct work_struct *ignored)
>  {
> -	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0);
> +	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0, NULL);
>  }
>  
>  static DECLARE_WORK(moom_work, moom_callback);
> Index: mm-test-kernel/mm/oom_kill.c
> ===================================================================
> --- mm-test-kernel.orig/mm/oom_kill.c
> +++ mm-test-kernel/mm/oom_kill.c
> @@ -196,27 +196,47 @@ unsigned long badness(struct task_struct
>  /*
>   * Determine the type of allocation constraint.
>   */
> -static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> -						    gfp_t gfp_mask)
> -{
>  #ifdef CONFIG_NUMA
> +static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> +				    gfp_t gfp_mask, nodemask_t *nodemask)
> +{
>  	struct zone *zone;
>  	struct zoneref *z;
>  	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> -	nodemask_t nodes = node_states[N_HIGH_MEMORY];
> +	int ret = CONSTRAINT_NONE;
>  
> -	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> -		if (cpuset_zone_allowed_softwall(zone, gfp_mask))
> -			node_clear(zone_to_nid(zone), nodes);
> -		else
> -			return CONSTRAINT_CPUSET;
> +	/*
> +	 * Reach here only when __GFP_NOFAIL is used. So, we should avoid
> + 	 * to kill current.We have to random task kill in this case.
> + 	 * Hopefully, CONSTRAINT_THISNODE...but no way to handle it, now.
> + 	 */
> +	if (gfp_mask & __GPF_THISNODE)
> +		return ret;
>  

That shouldn't compile.

> -	if (!nodes_empty(nodes))
> +	/*
> + 	 * The nodemask here is a nodemask passed to alloc_pages(). Now,
> + 	 * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
> + 	 * feature. mempolicy is an only user of nodemask here.
> +	 * check mempolicy's nodemask contains all N_HIGH_MEMORY
> + 	 */
> +	if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
>  		return CONSTRAINT_MEMORY_POLICY;
> -#endif
>  
> +	/* Check this allocation failure is caused by cpuset's wall function */
> +	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> +			high_zoneidx, nodemask)
> +		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
> +			return CONSTRAINT_CPUSET;
> +
> +	return CONSTRAINT_NONE;
> +}
> +#else
> +static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> +				gfp_t gfp_mask, nodemask_t *nodemask)

inline seems appropriate in this case, gcc will optimize it anyway.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.1
@ 2009-11-11  5:58                       ` David Rientjes
  0 siblings, 0 replies; 92+ messages in thread
From: David Rientjes @ 2009-11-11  5:58 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, KOSAKI Motohiro, linux-mm, linux-kernel,
	Andrew Morton, Christoph Lameter

On Wed, 11 Nov 2009, KAMEZAWA Hiroyuki wrote:

> Index: mm-test-kernel/drivers/char/sysrq.c
> ===================================================================
> --- mm-test-kernel.orig/drivers/char/sysrq.c
> +++ mm-test-kernel/drivers/char/sysrq.c
> @@ -339,7 +339,7 @@ static struct sysrq_key_op sysrq_term_op
>  
>  static void moom_callback(struct work_struct *ignored)
>  {
> -	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0);
> +	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0, NULL);
>  }
>  
>  static DECLARE_WORK(moom_work, moom_callback);
> Index: mm-test-kernel/mm/oom_kill.c
> ===================================================================
> --- mm-test-kernel.orig/mm/oom_kill.c
> +++ mm-test-kernel/mm/oom_kill.c
> @@ -196,27 +196,47 @@ unsigned long badness(struct task_struct
>  /*
>   * Determine the type of allocation constraint.
>   */
> -static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> -						    gfp_t gfp_mask)
> -{
>  #ifdef CONFIG_NUMA
> +static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> +				    gfp_t gfp_mask, nodemask_t *nodemask)
> +{
>  	struct zone *zone;
>  	struct zoneref *z;
>  	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> -	nodemask_t nodes = node_states[N_HIGH_MEMORY];
> +	int ret = CONSTRAINT_NONE;
>  
> -	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> -		if (cpuset_zone_allowed_softwall(zone, gfp_mask))
> -			node_clear(zone_to_nid(zone), nodes);
> -		else
> -			return CONSTRAINT_CPUSET;
> +	/*
> +	 * Reach here only when __GFP_NOFAIL is used. So, we should avoid
> + 	 * to kill current.We have to random task kill in this case.
> + 	 * Hopefully, CONSTRAINT_THISNODE...but no way to handle it, now.
> + 	 */
> +	if (gfp_mask & __GPF_THISNODE)
> +		return ret;
>  

That shouldn't compile.

> -	if (!nodes_empty(nodes))
> +	/*
> + 	 * The nodemask here is a nodemask passed to alloc_pages(). Now,
> + 	 * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
> + 	 * feature. mempolicy is an only user of nodemask here.
> +	 * check mempolicy's nodemask contains all N_HIGH_MEMORY
> + 	 */
> +	if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
>  		return CONSTRAINT_MEMORY_POLICY;
> -#endif
>  
> +	/* Check this allocation failure is caused by cpuset's wall function */
> +	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> +			high_zoneidx, nodemask)
> +		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
> +			return CONSTRAINT_CPUSET;
> +
> +	return CONSTRAINT_NONE;
> +}
> +#else
> +static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> +				gfp_t gfp_mask, nodemask_t *nodemask)

inline seems appropriate in this case, gcc will optimize it anyway.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.1
  2009-11-11  5:58                       ` David Rientjes
@ 2009-11-11  6:20                         ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 92+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-11  6:20 UTC (permalink / raw)
  To: David Rientjes
  Cc: Daisuke Nishimura, KOSAKI Motohiro, linux-mm, linux-kernel,
	Andrew Morton, Christoph Lameter

On Tue, 10 Nov 2009 21:58:31 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Wed, 11 Nov 2009, KAMEZAWA Hiroyuki wrote:
> 
> > Index: mm-test-kernel/drivers/char/sysrq.c
> > ===================================================================
> > --- mm-test-kernel.orig/drivers/char/sysrq.c
> > +++ mm-test-kernel/drivers/char/sysrq.c
> > @@ -339,7 +339,7 @@ static struct sysrq_key_op sysrq_term_op
> >  
> >  static void moom_callback(struct work_struct *ignored)
> >  {
> > -	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0);
> > +	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0, NULL);
> >  }
> >  
> >  static DECLARE_WORK(moom_work, moom_callback);
> > Index: mm-test-kernel/mm/oom_kill.c
> > ===================================================================
> > --- mm-test-kernel.orig/mm/oom_kill.c
> > +++ mm-test-kernel/mm/oom_kill.c
> > @@ -196,27 +196,47 @@ unsigned long badness(struct task_struct
> >  /*
> >   * Determine the type of allocation constraint.
> >   */
> > -static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> > -						    gfp_t gfp_mask)
> > -{
> >  #ifdef CONFIG_NUMA
> > +static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> > +				    gfp_t gfp_mask, nodemask_t *nodemask)
> > +{
> >  	struct zone *zone;
> >  	struct zoneref *z;
> >  	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> > -	nodemask_t nodes = node_states[N_HIGH_MEMORY];
> > +	int ret = CONSTRAINT_NONE;
> >  
> > -	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> > -		if (cpuset_zone_allowed_softwall(zone, gfp_mask))
> > -			node_clear(zone_to_nid(zone), nodes);
> > -		else
> > -			return CONSTRAINT_CPUSET;
> > +	/*
> > +	 * Reach here only when __GFP_NOFAIL is used. So, we should avoid
> > + 	 * to kill current.We have to random task kill in this case.
> > + 	 * Hopefully, CONSTRAINT_THISNODE...but no way to handle it, now.
> > + 	 */
> > +	if (gfp_mask & __GPF_THISNODE)
> > +		return ret;
> >  
> 
> That shouldn't compile.
> 
Why ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.1
@ 2009-11-11  6:20                         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 92+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-11  6:20 UTC (permalink / raw)
  To: David Rientjes
  Cc: Daisuke Nishimura, KOSAKI Motohiro, linux-mm, linux-kernel,
	Andrew Morton, Christoph Lameter

On Tue, 10 Nov 2009 21:58:31 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Wed, 11 Nov 2009, KAMEZAWA Hiroyuki wrote:
> 
> > Index: mm-test-kernel/drivers/char/sysrq.c
> > ===================================================================
> > --- mm-test-kernel.orig/drivers/char/sysrq.c
> > +++ mm-test-kernel/drivers/char/sysrq.c
> > @@ -339,7 +339,7 @@ static struct sysrq_key_op sysrq_term_op
> >  
> >  static void moom_callback(struct work_struct *ignored)
> >  {
> > -	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0);
> > +	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0, NULL);
> >  }
> >  
> >  static DECLARE_WORK(moom_work, moom_callback);
> > Index: mm-test-kernel/mm/oom_kill.c
> > ===================================================================
> > --- mm-test-kernel.orig/mm/oom_kill.c
> > +++ mm-test-kernel/mm/oom_kill.c
> > @@ -196,27 +196,47 @@ unsigned long badness(struct task_struct
> >  /*
> >   * Determine the type of allocation constraint.
> >   */
> > -static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> > -						    gfp_t gfp_mask)
> > -{
> >  #ifdef CONFIG_NUMA
> > +static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> > +				    gfp_t gfp_mask, nodemask_t *nodemask)
> > +{
> >  	struct zone *zone;
> >  	struct zoneref *z;
> >  	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> > -	nodemask_t nodes = node_states[N_HIGH_MEMORY];
> > +	int ret = CONSTRAINT_NONE;
> >  
> > -	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> > -		if (cpuset_zone_allowed_softwall(zone, gfp_mask))
> > -			node_clear(zone_to_nid(zone), nodes);
> > -		else
> > -			return CONSTRAINT_CPUSET;
> > +	/*
> > +	 * Reach here only when __GFP_NOFAIL is used. So, we should avoid
> > + 	 * to kill current.We have to random task kill in this case.
> > + 	 * Hopefully, CONSTRAINT_THISNODE...but no way to handle it, now.
> > + 	 */
> > +	if (gfp_mask & __GPF_THISNODE)
> > +		return ret;
> >  
> 
> That shouldn't compile.
> 
Why ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.1
  2009-11-11  6:20                         ` KAMEZAWA Hiroyuki
@ 2009-11-11  6:26                           ` David Rientjes
  -1 siblings, 0 replies; 92+ messages in thread
From: David Rientjes @ 2009-11-11  6:26 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, KOSAKI Motohiro, linux-mm, linux-kernel,
	Andrew Morton, Christoph Lameter

On Wed, 11 Nov 2009, KAMEZAWA Hiroyuki wrote:

> > > Index: mm-test-kernel/mm/oom_kill.c
> > > ===================================================================
> > > --- mm-test-kernel.orig/mm/oom_kill.c
> > > +++ mm-test-kernel/mm/oom_kill.c
> > > @@ -196,27 +196,47 @@ unsigned long badness(struct task_struct
> > >  /*
> > >   * Determine the type of allocation constraint.
> > >   */
> > > -static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> > > -						    gfp_t gfp_mask)
> > > -{
> > >  #ifdef CONFIG_NUMA
> > > +static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> > > +				    gfp_t gfp_mask, nodemask_t *nodemask)
> > > +{
> > >  	struct zone *zone;
> > >  	struct zoneref *z;
> > >  	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> > > -	nodemask_t nodes = node_states[N_HIGH_MEMORY];
> > > +	int ret = CONSTRAINT_NONE;
> > >  
> > > -	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> > > -		if (cpuset_zone_allowed_softwall(zone, gfp_mask))
> > > -			node_clear(zone_to_nid(zone), nodes);
> > > -		else
> > > -			return CONSTRAINT_CPUSET;
> > > +	/*
> > > +	 * Reach here only when __GFP_NOFAIL is used. So, we should avoid
> > > + 	 * to kill current.We have to random task kill in this case.
> > > + 	 * Hopefully, CONSTRAINT_THISNODE...but no way to handle it, now.
> > > + 	 */
> > > +	if (gfp_mask & __GPF_THISNODE)
> > > +		return ret;
> > >  
> > 
> > That shouldn't compile.
> > 
> Why ?
> 

Even when I pointed it out, you still didn't bother to try compiling it?  
You need s/GPF/GFP, it stands for "get free pages."

You can also eliminate the ret variable, it's not doing anything.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.1
@ 2009-11-11  6:26                           ` David Rientjes
  0 siblings, 0 replies; 92+ messages in thread
From: David Rientjes @ 2009-11-11  6:26 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, KOSAKI Motohiro, linux-mm, linux-kernel,
	Andrew Morton, Christoph Lameter

On Wed, 11 Nov 2009, KAMEZAWA Hiroyuki wrote:

> > > Index: mm-test-kernel/mm/oom_kill.c
> > > ===================================================================
> > > --- mm-test-kernel.orig/mm/oom_kill.c
> > > +++ mm-test-kernel/mm/oom_kill.c
> > > @@ -196,27 +196,47 @@ unsigned long badness(struct task_struct
> > >  /*
> > >   * Determine the type of allocation constraint.
> > >   */
> > > -static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> > > -						    gfp_t gfp_mask)
> > > -{
> > >  #ifdef CONFIG_NUMA
> > > +static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> > > +				    gfp_t gfp_mask, nodemask_t *nodemask)
> > > +{
> > >  	struct zone *zone;
> > >  	struct zoneref *z;
> > >  	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> > > -	nodemask_t nodes = node_states[N_HIGH_MEMORY];
> > > +	int ret = CONSTRAINT_NONE;
> > >  
> > > -	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> > > -		if (cpuset_zone_allowed_softwall(zone, gfp_mask))
> > > -			node_clear(zone_to_nid(zone), nodes);
> > > -		else
> > > -			return CONSTRAINT_CPUSET;
> > > +	/*
> > > +	 * Reach here only when __GFP_NOFAIL is used. So, we should avoid
> > > + 	 * to kill current.We have to random task kill in this case.
> > > + 	 * Hopefully, CONSTRAINT_THISNODE...but no way to handle it, now.
> > > + 	 */
> > > +	if (gfp_mask & __GPF_THISNODE)
> > > +		return ret;
> > >  
> > 
> > That shouldn't compile.
> > 
> Why ?
> 

Even when I pointed it out, you still didn't bother to try compiling it?  
You need s/GPF/GFP, it stands for "get free pages."

You can also eliminate the ret variable, it's not doing anything.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
  2009-11-11  6:26                           ` David Rientjes
@ 2009-11-11  6:34                             ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 92+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-11  6:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: Daisuke Nishimura, KOSAKI Motohiro, linux-mm, linux-kernel,
	Andrew Morton, Christoph Lameter

On Tue, 10 Nov 2009 22:26:19 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:
	 */
> > > > +	if (gfp_mask & __GPF_THISNODE)
> > > > +		return ret;
> > > >  
> > > 
> > > That shouldn't compile.
> > > 
> > Why ?
> > 
> 
> Even when I pointed it out, you still didn't bother to try compiling it?  
> You need s/GPF/GFP, it stands for "get free pages."
> 
You're right.
Ah. I noticed CONFIG_NUMA=n
....(of course, I usually set it =y)

> You can also eliminate the ret variable, it's not doing anything.
> 
ok. fixed one here. Thank you.
==
From: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>

Fixing node-oriented allocation handling in oom-kill.c
I myself think this as bugfix not as ehnancement.

In these days, things are changed as
  - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
  - mempolicy don't maintain its own private zonelists.
  (And cpuset doesn't use nodemask for __alloc_pages_nodemask())

So, current oom-killer's check function is wrong.

This patch does
  - check nodemask, if nodemask && nodemask doesn't cover all
    node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
  - Scan all zonelist under nodemask, if it hits cpuset's wall
    this faiulre is from cpuset.
And
  - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
    This doesn't change "current" behavior. If callers use __GFP_THISNODE
    it should handle "page allocation failure" by itself.

  - handle __GFP_NOFAIL+__GFP_THISNODE path.
    This is something like a FIXME but this gfpmask is not used now.

Changelog: 2009/11/11(2)
 - uses nodes_subset().
 - clean up.
 - added __GFP_NOFAIL case. And added waring.
 - removed inline
 - removed 'ret'

Changelog: 2009/11/11
 - fixed nodes_equal() calculation.
 - return CONSTRAINT_MEMPOLICY always if given nodemask is not enough big.

Changelog: 2009/11/06
 - fixed lack of oom.h

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>
---
 drivers/char/sysrq.c |    2 +-
 include/linux/oom.h  |    4 +++-
 mm/oom_kill.c        |   46 +++++++++++++++++++++++++++++++++-------------
 mm/page_alloc.c      |   20 +++++++++++++++-----
 4 files changed, 52 insertions(+), 20 deletions(-)

Index: mm-test-kernel/drivers/char/sysrq.c
===================================================================
--- mm-test-kernel.orig/drivers/char/sysrq.c
+++ mm-test-kernel/drivers/char/sysrq.c
@@ -339,7 +339,7 @@ static struct sysrq_key_op sysrq_term_op
 
 static void moom_callback(struct work_struct *ignored)
 {
-	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0);
+	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0, NULL);
 }
 
 static DECLARE_WORK(moom_work, moom_callback);
Index: mm-test-kernel/mm/oom_kill.c
===================================================================
--- mm-test-kernel.orig/mm/oom_kill.c
+++ mm-test-kernel/mm/oom_kill.c
@@ -196,27 +196,46 @@ unsigned long badness(struct task_struct
 /*
  * Determine the type of allocation constraint.
  */
-static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
-						    gfp_t gfp_mask)
-{
 #ifdef CONFIG_NUMA
+static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
+				    gfp_t gfp_mask, nodemask_t *nodemask)
+{
 	struct zone *zone;
 	struct zoneref *z;
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
-	nodemask_t nodes = node_states[N_HIGH_MEMORY];
 
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
-		if (cpuset_zone_allowed_softwall(zone, gfp_mask))
-			node_clear(zone_to_nid(zone), nodes);
-		else
-			return CONSTRAINT_CPUSET;
+	/*
+	 * Reach here only when __GFP_NOFAIL is used. So, we should avoid
+ 	 * to kill current.We have to random task kill in this case.
+ 	 * Hopefully, CONSTRAINT_THISNODE...but no way to handle it, now.
+ 	 */
+	if (gfp_mask & __GFP_THISNODE)
+		return CONSTRAINT_NONE;
 
-	if (!nodes_empty(nodes))
+	/*
+ 	 * The nodemask here is a nodemask passed to alloc_pages(). Now,
+ 	 * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
+ 	 * feature. mempolicy is an only user of nodemask here.
+	 * check mempolicy's nodemask contains all N_HIGH_MEMORY
+ 	 */
+	if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
 		return CONSTRAINT_MEMORY_POLICY;
-#endif
 
+	/* Check this allocation failure is caused by cpuset's wall function */
+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+			high_zoneidx, nodemask)
+		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
+			return CONSTRAINT_CPUSET;
+
+	return CONSTRAINT_NONE;
+}
+#else
+static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
+				gfp_t gfp_mask, nodemask_t *nodemask)
+{
 	return CONSTRAINT_NONE;
 }
+#endif
 
 /*
  * Simple selection loop. We chose the process with the highest
@@ -603,7 +622,8 @@ rest_and_return:
  * OR try to be smart about which process to kill. Note that we
  * don't have to be perfect here, we just have to be good.
  */
-void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
+void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+		int order, nodemask_t *nodemask)
 {
 	unsigned long freed = 0;
 	enum oom_constraint constraint;
@@ -622,7 +642,7 @@ void out_of_memory(struct zonelist *zone
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.
 	 */
-	constraint = constrained_alloc(zonelist, gfp_mask);
+	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
 	read_lock(&tasklist_lock);
 
 	switch (constraint) {
Index: mm-test-kernel/mm/page_alloc.c
===================================================================
--- mm-test-kernel.orig/mm/page_alloc.c
+++ mm-test-kernel/mm/page_alloc.c
@@ -1664,12 +1664,22 @@ __alloc_pages_may_oom(gfp_t gfp_mask, un
 	if (page)
 		goto out;
 
-	/* The OOM killer will not help higher order allocs */
-	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_NOFAIL))
-		goto out;
-
+	if (!(gfp_mask & __GFP_NOFAIL)) {
+		/* The OOM killer will not help higher order allocs */
+		if (order > PAGE_ALLOC_COSTLY_ORDER)
+			goto out;
+		/*
+	 	* GFP_THISNODE contains __GFP_NORETRY and we never hit this.
+	 	* Sanity check for bare calls of __GFP_THISNODE, not real OOM.
+	 	* The caller should handle page allocation failure by itself if
+	 	* it specifies __GFP_THISNODE.
+	 	* Note: Hugepage uses it but will hit PAGE_ALLOC_COSTLY_ORDER.
+	 	*/
+		if (gfp_mask & __GFP_THISNODE)
+			goto out;
+	}
 	/* Exhausted what can be done so it's blamo time */
-	out_of_memory(zonelist, gfp_mask, order);
+	out_of_memory(zonelist, gfp_mask, order, nodemask);
 
 out:
 	clear_zonelist_oom(zonelist, gfp_mask);
Index: mm-test-kernel/include/linux/oom.h
===================================================================
--- mm-test-kernel.orig/include/linux/oom.h
+++ mm-test-kernel/include/linux/oom.h
@@ -10,6 +10,7 @@
 #ifdef __KERNEL__
 
 #include <linux/types.h>
+#include <linux/nodemask.h>
 
 struct zonelist;
 struct notifier_block;
@@ -26,7 +27,8 @@ enum oom_constraint {
 extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 
-extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order);
+extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+		int order, nodemask_t *mask);
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
 


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
@ 2009-11-11  6:34                             ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 92+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-11  6:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: Daisuke Nishimura, KOSAKI Motohiro, linux-mm, linux-kernel,
	Andrew Morton, Christoph Lameter

On Tue, 10 Nov 2009 22:26:19 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:
	 */
> > > > +	if (gfp_mask & __GPF_THISNODE)
> > > > +		return ret;
> > > >  
> > > 
> > > That shouldn't compile.
> > > 
> > Why ?
> > 
> 
> Even when I pointed it out, you still didn't bother to try compiling it?  
> You need s/GPF/GFP, it stands for "get free pages."
> 
You're right.
Ah. I noticed CONFIG_NUMA=n
....(of course, I usually set it =y)

> You can also eliminate the ret variable, it's not doing anything.
> 
ok. fixed one here. Thank you.
==
From: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>

Fixing node-oriented allocation handling in oom-kill.c
I myself think this as bugfix not as ehnancement.

In these days, things are changed as
  - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
  - mempolicy don't maintain its own private zonelists.
  (And cpuset doesn't use nodemask for __alloc_pages_nodemask())

So, current oom-killer's check function is wrong.

This patch does
  - check nodemask, if nodemask && nodemask doesn't cover all
    node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
  - Scan all zonelist under nodemask, if it hits cpuset's wall
    this faiulre is from cpuset.
And
  - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
    This doesn't change "current" behavior. If callers use __GFP_THISNODE
    it should handle "page allocation failure" by itself.

  - handle __GFP_NOFAIL+__GFP_THISNODE path.
    This is something like a FIXME but this gfpmask is not used now.

Changelog: 2009/11/11(2)
 - uses nodes_subset().
 - clean up.
 - added __GFP_NOFAIL case. And added waring.
 - removed inline
 - removed 'ret'

Changelog: 2009/11/11
 - fixed nodes_equal() calculation.
 - return CONSTRAINT_MEMPOLICY always if given nodemask is not enough big.

Changelog: 2009/11/06
 - fixed lack of oom.h

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>
---
 drivers/char/sysrq.c |    2 +-
 include/linux/oom.h  |    4 +++-
 mm/oom_kill.c        |   46 +++++++++++++++++++++++++++++++++-------------
 mm/page_alloc.c      |   20 +++++++++++++++-----
 4 files changed, 52 insertions(+), 20 deletions(-)

Index: mm-test-kernel/drivers/char/sysrq.c
===================================================================
--- mm-test-kernel.orig/drivers/char/sysrq.c
+++ mm-test-kernel/drivers/char/sysrq.c
@@ -339,7 +339,7 @@ static struct sysrq_key_op sysrq_term_op
 
 static void moom_callback(struct work_struct *ignored)
 {
-	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0);
+	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0, NULL);
 }
 
 static DECLARE_WORK(moom_work, moom_callback);
Index: mm-test-kernel/mm/oom_kill.c
===================================================================
--- mm-test-kernel.orig/mm/oom_kill.c
+++ mm-test-kernel/mm/oom_kill.c
@@ -196,27 +196,46 @@ unsigned long badness(struct task_struct
 /*
  * Determine the type of allocation constraint.
  */
-static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
-						    gfp_t gfp_mask)
-{
 #ifdef CONFIG_NUMA
+static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
+				    gfp_t gfp_mask, nodemask_t *nodemask)
+{
 	struct zone *zone;
 	struct zoneref *z;
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
-	nodemask_t nodes = node_states[N_HIGH_MEMORY];
 
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
-		if (cpuset_zone_allowed_softwall(zone, gfp_mask))
-			node_clear(zone_to_nid(zone), nodes);
-		else
-			return CONSTRAINT_CPUSET;
+	/*
+	 * Reach here only when __GFP_NOFAIL is used. So, we should avoid
+ 	 * to kill current.We have to random task kill in this case.
+ 	 * Hopefully, CONSTRAINT_THISNODE...but no way to handle it, now.
+ 	 */
+	if (gfp_mask & __GFP_THISNODE)
+		return CONSTRAINT_NONE;
 
-	if (!nodes_empty(nodes))
+	/*
+ 	 * The nodemask here is a nodemask passed to alloc_pages(). Now,
+ 	 * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
+ 	 * feature. mempolicy is an only user of nodemask here.
+	 * check mempolicy's nodemask contains all N_HIGH_MEMORY
+ 	 */
+	if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
 		return CONSTRAINT_MEMORY_POLICY;
-#endif
 
+	/* Check this allocation failure is caused by cpuset's wall function */
+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+			high_zoneidx, nodemask)
+		if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
+			return CONSTRAINT_CPUSET;
+
+	return CONSTRAINT_NONE;
+}
+#else
+static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
+				gfp_t gfp_mask, nodemask_t *nodemask)
+{
 	return CONSTRAINT_NONE;
 }
+#endif
 
 /*
  * Simple selection loop. We chose the process with the highest
@@ -603,7 +622,8 @@ rest_and_return:
  * OR try to be smart about which process to kill. Note that we
  * don't have to be perfect here, we just have to be good.
  */
-void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
+void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+		int order, nodemask_t *nodemask)
 {
 	unsigned long freed = 0;
 	enum oom_constraint constraint;
@@ -622,7 +642,7 @@ void out_of_memory(struct zonelist *zone
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.
 	 */
-	constraint = constrained_alloc(zonelist, gfp_mask);
+	constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
 	read_lock(&tasklist_lock);
 
 	switch (constraint) {
Index: mm-test-kernel/mm/page_alloc.c
===================================================================
--- mm-test-kernel.orig/mm/page_alloc.c
+++ mm-test-kernel/mm/page_alloc.c
@@ -1664,12 +1664,22 @@ __alloc_pages_may_oom(gfp_t gfp_mask, un
 	if (page)
 		goto out;
 
-	/* The OOM killer will not help higher order allocs */
-	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_NOFAIL))
-		goto out;
-
+	if (!(gfp_mask & __GFP_NOFAIL)) {
+		/* The OOM killer will not help higher order allocs */
+		if (order > PAGE_ALLOC_COSTLY_ORDER)
+			goto out;
+		/*
+	 	* GFP_THISNODE contains __GFP_NORETRY and we never hit this.
+	 	* Sanity check for bare calls of __GFP_THISNODE, not real OOM.
+	 	* The caller should handle page allocation failure by itself if
+	 	* it specifies __GFP_THISNODE.
+	 	* Note: Hugepage uses it but will hit PAGE_ALLOC_COSTLY_ORDER.
+	 	*/
+		if (gfp_mask & __GFP_THISNODE)
+			goto out;
+	}
 	/* Exhausted what can be done so it's blamo time */
-	out_of_memory(zonelist, gfp_mask, order);
+	out_of_memory(zonelist, gfp_mask, order, nodemask);
 
 out:
 	clear_zonelist_oom(zonelist, gfp_mask);
Index: mm-test-kernel/include/linux/oom.h
===================================================================
--- mm-test-kernel.orig/include/linux/oom.h
+++ mm-test-kernel/include/linux/oom.h
@@ -10,6 +10,7 @@
 #ifdef __KERNEL__
 
 #include <linux/types.h>
+#include <linux/nodemask.h>
 
 struct zonelist;
 struct notifier_block;
@@ -26,7 +27,8 @@ enum oom_constraint {
 extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 
-extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order);
+extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+		int order, nodemask_t *mask);
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
  2009-11-11  6:34                             ` KAMEZAWA Hiroyuki
@ 2009-11-11  7:32                               ` David Rientjes
  -1 siblings, 0 replies; 92+ messages in thread
From: David Rientjes @ 2009-11-11  7:32 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, KOSAKI Motohiro, linux-mm, linux-kernel,
	Andrew Morton, Christoph Lameter

On Wed, 11 Nov 2009, KAMEZAWA Hiroyuki wrote:

> From: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>
> 
> Fixing node-oriented allocation handling in oom-kill.c
> I myself think this as bugfix not as ehnancement.
> 
> In these days, things are changed as
>   - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
>   - mempolicy don't maintain its own private zonelists.
>   (And cpuset doesn't use nodemask for __alloc_pages_nodemask())
> 
> So, current oom-killer's check function is wrong.
> 
> This patch does
>   - check nodemask, if nodemask && nodemask doesn't cover all
>     node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
>   - Scan all zonelist under nodemask, if it hits cpuset's wall
>     this faiulre is from cpuset.
> And
>   - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
>     This doesn't change "current" behavior. If callers use __GFP_THISNODE
>     it should handle "page allocation failure" by itself.
> 
>   - handle __GFP_NOFAIL+__GFP_THISNODE path.
>     This is something like a FIXME but this gfpmask is not used now.
> 
> Changelog: 2009/11/11(2)
>  - uses nodes_subset().
>  - clean up.
>  - added __GFP_NOFAIL case. And added waring.
>  - removed inline
>  - removed 'ret'
> 
> Changelog: 2009/11/11
>  - fixed nodes_equal() calculation.
>  - return CONSTRAINT_MEMPOLICY always if given nodemask is not enough big.
> 
> Changelog: 2009/11/06
>  - fixed lack of oom.h
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
@ 2009-11-11  7:32                               ` David Rientjes
  0 siblings, 0 replies; 92+ messages in thread
From: David Rientjes @ 2009-11-11  7:32 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, KOSAKI Motohiro, linux-mm, linux-kernel,
	Andrew Morton, Christoph Lameter

On Wed, 11 Nov 2009, KAMEZAWA Hiroyuki wrote:

> From: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>
> 
> Fixing node-oriented allocation handling in oom-kill.c
> I myself think this as bugfix not as ehnancement.
> 
> In these days, things are changed as
>   - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
>   - mempolicy don't maintain its own private zonelists.
>   (And cpuset doesn't use nodemask for __alloc_pages_nodemask())
> 
> So, current oom-killer's check function is wrong.
> 
> This patch does
>   - check nodemask, if nodemask && nodemask doesn't cover all
>     node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
>   - Scan all zonelist under nodemask, if it hits cpuset's wall
>     this faiulre is from cpuset.
> And
>   - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
>     This doesn't change "current" behavior. If callers use __GFP_THISNODE
>     it should handle "page allocation failure" by itself.
> 
>   - handle __GFP_NOFAIL+__GFP_THISNODE path.
>     This is something like a FIXME but this gfpmask is not used now.
> 
> Changelog: 2009/11/11(2)
>  - uses nodes_subset().
>  - clean up.
>  - added __GFP_NOFAIL case. And added waring.
>  - removed inline
>  - removed 'ret'
> 
> Changelog: 2009/11/11
>  - fixed nodes_equal() calculation.
>  - return CONSTRAINT_MEMPOLICY always if given nodemask is not enough big.
> 
> Changelog: 2009/11/06
>  - fixed lack of oom.h
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>

Acked-by: David Rientjes <rientjes@google.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
  2009-11-11  6:34                             ` KAMEZAWA Hiroyuki
@ 2009-11-18  0:11                               ` David Rientjes
  -1 siblings, 0 replies; 92+ messages in thread
From: David Rientjes @ 2009-11-18  0:11 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, KOSAKI Motohiro, linux-mm, linux-kernel,
	Andrew Morton, Christoph Lameter

On Wed, 11 Nov 2009, KAMEZAWA Hiroyuki wrote:

> Fixing node-oriented allocation handling in oom-kill.c
> I myself think this as bugfix not as ehnancement.
> 
> In these days, things are changed as
>   - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
>   - mempolicy don't maintain its own private zonelists.
>   (And cpuset doesn't use nodemask for __alloc_pages_nodemask())
> 
> So, current oom-killer's check function is wrong.
> 
> This patch does
>   - check nodemask, if nodemask && nodemask doesn't cover all
>     node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
>   - Scan all zonelist under nodemask, if it hits cpuset's wall
>     this faiulre is from cpuset.
> And
>   - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
>     This doesn't change "current" behavior. If callers use __GFP_THISNODE
>     it should handle "page allocation failure" by itself.
> 
>   - handle __GFP_NOFAIL+__GFP_THISNODE path.
>     This is something like a FIXME but this gfpmask is not used now.
> 

Now that we're passing the nodemask into the oom killer, we should be able 
to do more intelligent CONSTRAINT_MEMORY_POLICY selection.  current is not 
always the ideal task to kill, so it's better to scan the tasklist and 
determine the best task depending on our heuristics, similiar to how we 
penalize candidates if they do not share the same cpuset.

Something like the following (untested) patch.  Comments?
---
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -201,7 +201,9 @@ extern void mpol_fix_fork_child_flag(struct task_struct *p);
 extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
 				unsigned long addr, gfp_t gfp_flags,
 				struct mempolicy **mpol, nodemask_t **nodemask);
-extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
+extern bool init_nodemask_of_task_mempolicy(struct task_struct *tsk,
+				nodemask_t *mask);
+extern bool init_nodemask_of_current_mempolicy(nodemask_t *mask);
 extern unsigned slab_node(struct mempolicy *policy);
 
 extern enum zone_type policy_zone;
@@ -329,7 +331,16 @@ static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
 	return node_zonelist(0, gfp_flags);
 }
 
-static inline bool init_nodemask_of_mempolicy(nodemask_t *m) { return false; }
+static inline bool init_nodemask_of_task_mempolicy(struct task_struct *tsk,
+							nodemask_t *mask)
+{
+	return false;
+}
+
+static inline bool init_nodemask_of_current_mempolicy(nodemask_t *mask)
+{
+	return false;
+}
 
 static inline int do_migrate_pages(struct mm_struct *mm,
 			const nodemask_t *from_nodes,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ed3f392..3ab3021 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1373,7 +1373,7 @@ static ssize_t nr_hugepages_store_common(bool obey_mempolicy,
 		 * global hstate attribute
 		 */
 		if (!(obey_mempolicy &&
-				init_nodemask_of_mempolicy(nodes_allowed))) {
+			init_nodemask_of_current_mempolicy(nodes_allowed))) {
 			NODEMASK_FREE(nodes_allowed);
 			nodes_allowed = &node_states[N_HIGH_MEMORY];
 		}
@@ -1860,7 +1860,7 @@ static int hugetlb_sysctl_handler_common(bool obey_mempolicy,
 		NODEMASK_ALLOC(nodemask_t, nodes_allowed,
 						GFP_KERNEL | __GFP_NORETRY);
 		if (!(obey_mempolicy &&
-			       init_nodemask_of_mempolicy(nodes_allowed))) {
+			init_nodemask_of_current_mempolicy(nodes_allowed))) {
 			NODEMASK_FREE(nodes_allowed);
 			nodes_allowed = &node_states[N_HIGH_MEMORY];
 		}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index f11fdad..23c84bb 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1568,24 +1568,18 @@ struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
 	}
 	return zl;
 }
+#endif
 
 /*
- * init_nodemask_of_mempolicy
- *
- * If the current task's mempolicy is "default" [NULL], return 'false'
- * to indicate default policy.  Otherwise, extract the policy nodemask
- * for 'bind' or 'interleave' policy into the argument nodemask, or
- * initialize the argument nodemask to contain the single node for
- * 'preferred' or 'local' policy and return 'true' to indicate presence
- * of non-default mempolicy.
- *
- * We don't bother with reference counting the mempolicy [mpol_get/put]
- * because the current task is examining it's own mempolicy and a task's
- * mempolicy is only ever changed by the task itself.
+ * If tsk's mempolicy is "default" [NULL], return 'false' to indicate default
+ * policy.  Otherwise, extract the policy nodemask for 'bind' or 'interleave'
+ * policy into the argument nodemask, or initialize the argument nodemask to
+ * contain the single node for 'preferred' or 'local' policy and return 'true'
+ * to indicate presence of non-default mempolicy.
  *
  * N.B., it is the caller's responsibility to free a returned nodemask.
  */
-bool init_nodemask_of_mempolicy(nodemask_t *mask)
+bool init_nodemask_of_task_mempolicy(struct task_struct *tsk, nodemask_t *mask)
 {
 	struct mempolicy *mempolicy;
 	int nid;
@@ -1615,7 +1609,16 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
 
 	return true;
 }
-#endif
+
+/*
+ * We don't bother with reference counting the mempolicy [mpol_get/put]
+ * because the current task is examining it's own mempolicy and a task's
+ * mempolicy is only ever changed by the task itself.
+ */
+bool init_nodemask_of_current_mempolicy(nodemask_t *mask)
+{
+	return init_nodemask_of_task_mempolicy(current, mask);
+}
 
 /* Allocate a page in interleaved policy.
    Own path because it needs to do special accounting. */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ab04537..4c5c58b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -27,6 +27,7 @@
 #include <linux/notifier.h>
 #include <linux/memcontrol.h>
 #include <linux/security.h>
+#include <linux/mempolicy.h>
 
 int sysctl_panic_on_oom;
 int sysctl_oom_kill_allocating_task;
@@ -35,18 +36,30 @@ static DEFINE_SPINLOCK(zone_scan_lock);
 /* #define DEBUG */
 
 /*
- * Is all threads of the target process nodes overlap ours?
+ * Do the nodes allowed by any of tsk's threads overlap ours?
  */
-static int has_intersects_mems_allowed(struct task_struct *tsk)
+static int has_intersects_mems_allowed(struct task_struct *tsk,
+						nodemask_t *nodemask)
 {
-	struct task_struct *t;
+	struct task_struct *start = tsk;
+	NODEMASK_ALLOC(nodemask_t, mpol_nodemask, GFP_KERNEL);
 
-	t = tsk;
+	if (!nodemask)
+		mpol_nodemask = NULL;
 	do {
-		if (cpuset_mems_allowed_intersects(current, t))
+		if (mpol_nodemask) {
+			mpol_get(tsk->mempolicy);
+			if (init_nodemask_of_task_mempolicy(tsk, mpol_nodemask) &&
+				nodes_intersects(*nodemask, *mpol_nodemask)) {
+				mpol_put(tsk->mempolicy);
+				return 1;
+			}
+			mpol_put(tsk->mempolicy);
+		}
+		if (cpuset_mems_allowed_intersects(current, tsk))
 			return 1;
-		t = next_thread(t);
-	} while (t != tsk);
+		tsk = next_thread(tsk);
+	} while (tsk != start);
 
 	return 0;
 }
@@ -55,6 +68,8 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
  * badness - calculate a numeric value for how bad this task has been
  * @p: task struct of which task we should calculate
  * @uptime: current uptime in seconds
+ * @constraint: type of oom constraint
+ * @nodemask: nodemask passed to page allocator
  *
  * The formula used is relatively simple and documented inline in the
  * function. The main rationale is that we want to select a good task
@@ -70,7 +85,8 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
  *    of least surprise ... (be careful when you change it)
  */
 
-unsigned long badness(struct task_struct *p, unsigned long uptime)
+unsigned long badness(struct task_struct *p, unsigned long uptime,
+			enum oom_constraint constraint, nodemask_t *nodemask)
 {
 	unsigned long points, cpu_time, run_time;
 	struct mm_struct *mm;
@@ -171,7 +187,9 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 	 * because p may have allocated or otherwise mapped memory on
 	 * this node before. However it will be less likely.
 	 */
-	if (!has_intersects_mems_allowed(p))
+	if (!has_intersects_mems_allowed(p,
+			constraint == CONSTRAINT_MEMORY_POLICY ? nodemask :
+								 NULL))
 		points /= 8;
 
 	/*
@@ -244,7 +262,8 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
  * (not docbooked, we don't want this one cluttering up the manual)
  */
 static struct task_struct *select_bad_process(unsigned long *ppoints,
-						struct mem_cgroup *mem)
+			struct mem_cgroup *mem, enum oom_constraint constraint,
+			nodemask_t *nodemask)
 {
 	struct task_struct *p;
 	struct task_struct *chosen = NULL;
@@ -300,7 +319,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 		if (p->signal->oom_adj == OOM_DISABLE)
 			continue;
 
-		points = badness(p, uptime.tv_sec);
+		points = badness(p, uptime.tv_sec, constraint, nodemask);
 		if (points > *ppoints || !chosen) {
 			chosen = p;
 			*ppoints = points;
@@ -472,7 +491,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
 
 	read_lock(&tasklist_lock);
 retry:
-	p = select_bad_process(&points, mem);
+	p = select_bad_process(&points, mem, NULL);
 	if (PTR_ERR(p) == -1UL)
 		goto out;
 
@@ -554,7 +573,8 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
 /*
  * Must be called with tasklist_lock held for read.
  */
-static void __out_of_memory(gfp_t gfp_mask, int order)
+static void __out_of_memory(gfp_t gfp_mask, int order,
+			enum oom_constraint constraint, nodemask_t *nodemask)
 {
 	struct task_struct *p;
 	unsigned long points;
@@ -568,7 +588,7 @@ retry:
 	 * Rambo mode: Shoot down a process and hope it solves whatever
 	 * issues we may have.
 	 */
-	p = select_bad_process(&points, NULL);
+	p = select_bad_process(&points, NULL, constraint, nodemask);
 
 	if (PTR_ERR(p) == -1UL)
 		return;
@@ -609,7 +629,8 @@ void pagefault_out_of_memory(void)
 		panic("out of memory from page fault. panic_on_oom is selected.\n");
 
 	read_lock(&tasklist_lock);
-	__out_of_memory(0, 0); /* unknown gfp_mask and order */
+	/* unknown gfp_mask and order */
+	__out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
 	read_unlock(&tasklist_lock);
 
 	/*
@@ -656,11 +677,6 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	read_lock(&tasklist_lock);
 
 	switch (constraint) {
-	case CONSTRAINT_MEMORY_POLICY:
-		oom_kill_process(current, gfp_mask, order, 0, NULL,
-				"No available memory (MPOL_BIND)");
-		break;
-
 	case CONSTRAINT_NONE:
 		if (sysctl_panic_on_oom) {
 			dump_header(gfp_mask, order, NULL);
@@ -668,7 +684,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		}
 		/* Fall-through */
 	case CONSTRAINT_CPUSET:
-		__out_of_memory(gfp_mask, order);
+	case CONSTRAINT_MEMORY_POLICY:
+		__out_of_memory(gfp_mask, order, constraint, nodemask);
 		break;
 	}
 

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
@ 2009-11-18  0:11                               ` David Rientjes
  0 siblings, 0 replies; 92+ messages in thread
From: David Rientjes @ 2009-11-18  0:11 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, KOSAKI Motohiro, linux-mm, linux-kernel,
	Andrew Morton, Christoph Lameter

On Wed, 11 Nov 2009, KAMEZAWA Hiroyuki wrote:

> Fixing node-oriented allocation handling in oom-kill.c
> I myself think this as bugfix not as ehnancement.
> 
> In these days, things are changed as
>   - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
>   - mempolicy don't maintain its own private zonelists.
>   (And cpuset doesn't use nodemask for __alloc_pages_nodemask())
> 
> So, current oom-killer's check function is wrong.
> 
> This patch does
>   - check nodemask, if nodemask && nodemask doesn't cover all
>     node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
>   - Scan all zonelist under nodemask, if it hits cpuset's wall
>     this faiulre is from cpuset.
> And
>   - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
>     This doesn't change "current" behavior. If callers use __GFP_THISNODE
>     it should handle "page allocation failure" by itself.
> 
>   - handle __GFP_NOFAIL+__GFP_THISNODE path.
>     This is something like a FIXME but this gfpmask is not used now.
> 

Now that we're passing the nodemask into the oom killer, we should be able 
to do more intelligent CONSTRAINT_MEMORY_POLICY selection.  current is not 
always the ideal task to kill, so it's better to scan the tasklist and 
determine the best task depending on our heuristics, similiar to how we 
penalize candidates if they do not share the same cpuset.

Something like the following (untested) patch.  Comments?
---
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -201,7 +201,9 @@ extern void mpol_fix_fork_child_flag(struct task_struct *p);
 extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
 				unsigned long addr, gfp_t gfp_flags,
 				struct mempolicy **mpol, nodemask_t **nodemask);
-extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
+extern bool init_nodemask_of_task_mempolicy(struct task_struct *tsk,
+				nodemask_t *mask);
+extern bool init_nodemask_of_current_mempolicy(nodemask_t *mask);
 extern unsigned slab_node(struct mempolicy *policy);
 
 extern enum zone_type policy_zone;
@@ -329,7 +331,16 @@ static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
 	return node_zonelist(0, gfp_flags);
 }
 
-static inline bool init_nodemask_of_mempolicy(nodemask_t *m) { return false; }
+static inline bool init_nodemask_of_task_mempolicy(struct task_struct *tsk,
+							nodemask_t *mask)
+{
+	return false;
+}
+
+static inline bool init_nodemask_of_current_mempolicy(nodemask_t *mask)
+{
+	return false;
+}
 
 static inline int do_migrate_pages(struct mm_struct *mm,
 			const nodemask_t *from_nodes,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ed3f392..3ab3021 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1373,7 +1373,7 @@ static ssize_t nr_hugepages_store_common(bool obey_mempolicy,
 		 * global hstate attribute
 		 */
 		if (!(obey_mempolicy &&
-				init_nodemask_of_mempolicy(nodes_allowed))) {
+			init_nodemask_of_current_mempolicy(nodes_allowed))) {
 			NODEMASK_FREE(nodes_allowed);
 			nodes_allowed = &node_states[N_HIGH_MEMORY];
 		}
@@ -1860,7 +1860,7 @@ static int hugetlb_sysctl_handler_common(bool obey_mempolicy,
 		NODEMASK_ALLOC(nodemask_t, nodes_allowed,
 						GFP_KERNEL | __GFP_NORETRY);
 		if (!(obey_mempolicy &&
-			       init_nodemask_of_mempolicy(nodes_allowed))) {
+			init_nodemask_of_current_mempolicy(nodes_allowed))) {
 			NODEMASK_FREE(nodes_allowed);
 			nodes_allowed = &node_states[N_HIGH_MEMORY];
 		}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index f11fdad..23c84bb 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1568,24 +1568,18 @@ struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
 	}
 	return zl;
 }
+#endif
 
 /*
- * init_nodemask_of_mempolicy
- *
- * If the current task's mempolicy is "default" [NULL], return 'false'
- * to indicate default policy.  Otherwise, extract the policy nodemask
- * for 'bind' or 'interleave' policy into the argument nodemask, or
- * initialize the argument nodemask to contain the single node for
- * 'preferred' or 'local' policy and return 'true' to indicate presence
- * of non-default mempolicy.
- *
- * We don't bother with reference counting the mempolicy [mpol_get/put]
- * because the current task is examining it's own mempolicy and a task's
- * mempolicy is only ever changed by the task itself.
+ * If tsk's mempolicy is "default" [NULL], return 'false' to indicate default
+ * policy.  Otherwise, extract the policy nodemask for 'bind' or 'interleave'
+ * policy into the argument nodemask, or initialize the argument nodemask to
+ * contain the single node for 'preferred' or 'local' policy and return 'true'
+ * to indicate presence of non-default mempolicy.
  *
  * N.B., it is the caller's responsibility to free a returned nodemask.
  */
-bool init_nodemask_of_mempolicy(nodemask_t *mask)
+bool init_nodemask_of_task_mempolicy(struct task_struct *tsk, nodemask_t *mask)
 {
 	struct mempolicy *mempolicy;
 	int nid;
@@ -1615,7 +1609,16 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
 
 	return true;
 }
-#endif
+
+/*
+ * We don't bother with reference counting the mempolicy [mpol_get/put]
+ * because the current task is examining it's own mempolicy and a task's
+ * mempolicy is only ever changed by the task itself.
+ */
+bool init_nodemask_of_current_mempolicy(nodemask_t *mask)
+{
+	return init_nodemask_of_task_mempolicy(current, mask);
+}
 
 /* Allocate a page in interleaved policy.
    Own path because it needs to do special accounting. */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ab04537..4c5c58b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -27,6 +27,7 @@
 #include <linux/notifier.h>
 #include <linux/memcontrol.h>
 #include <linux/security.h>
+#include <linux/mempolicy.h>
 
 int sysctl_panic_on_oom;
 int sysctl_oom_kill_allocating_task;
@@ -35,18 +36,30 @@ static DEFINE_SPINLOCK(zone_scan_lock);
 /* #define DEBUG */
 
 /*
- * Is all threads of the target process nodes overlap ours?
+ * Do the nodes allowed by any of tsk's threads overlap ours?
  */
-static int has_intersects_mems_allowed(struct task_struct *tsk)
+static int has_intersects_mems_allowed(struct task_struct *tsk,
+						nodemask_t *nodemask)
 {
-	struct task_struct *t;
+	struct task_struct *start = tsk;
+	NODEMASK_ALLOC(nodemask_t, mpol_nodemask, GFP_KERNEL);
 
-	t = tsk;
+	if (!nodemask)
+		mpol_nodemask = NULL;
 	do {
-		if (cpuset_mems_allowed_intersects(current, t))
+		if (mpol_nodemask) {
+			mpol_get(tsk->mempolicy);
+			if (init_nodemask_of_task_mempolicy(tsk, mpol_nodemask) &&
+				nodes_intersects(*nodemask, *mpol_nodemask)) {
+				mpol_put(tsk->mempolicy);
+				return 1;
+			}
+			mpol_put(tsk->mempolicy);
+		}
+		if (cpuset_mems_allowed_intersects(current, tsk))
 			return 1;
-		t = next_thread(t);
-	} while (t != tsk);
+		tsk = next_thread(tsk);
+	} while (tsk != start);
 
 	return 0;
 }
@@ -55,6 +68,8 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
  * badness - calculate a numeric value for how bad this task has been
  * @p: task struct of which task we should calculate
  * @uptime: current uptime in seconds
+ * @constraint: type of oom constraint
+ * @nodemask: nodemask passed to page allocator
  *
  * The formula used is relatively simple and documented inline in the
  * function. The main rationale is that we want to select a good task
@@ -70,7 +85,8 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
  *    of least surprise ... (be careful when you change it)
  */
 
-unsigned long badness(struct task_struct *p, unsigned long uptime)
+unsigned long badness(struct task_struct *p, unsigned long uptime,
+			enum oom_constraint constraint, nodemask_t *nodemask)
 {
 	unsigned long points, cpu_time, run_time;
 	struct mm_struct *mm;
@@ -171,7 +187,9 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
 	 * because p may have allocated or otherwise mapped memory on
 	 * this node before. However it will be less likely.
 	 */
-	if (!has_intersects_mems_allowed(p))
+	if (!has_intersects_mems_allowed(p,
+			constraint == CONSTRAINT_MEMORY_POLICY ? nodemask :
+								 NULL))
 		points /= 8;
 
 	/*
@@ -244,7 +262,8 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
  * (not docbooked, we don't want this one cluttering up the manual)
  */
 static struct task_struct *select_bad_process(unsigned long *ppoints,
-						struct mem_cgroup *mem)
+			struct mem_cgroup *mem, enum oom_constraint constraint,
+			nodemask_t *nodemask)
 {
 	struct task_struct *p;
 	struct task_struct *chosen = NULL;
@@ -300,7 +319,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 		if (p->signal->oom_adj == OOM_DISABLE)
 			continue;
 
-		points = badness(p, uptime.tv_sec);
+		points = badness(p, uptime.tv_sec, constraint, nodemask);
 		if (points > *ppoints || !chosen) {
 			chosen = p;
 			*ppoints = points;
@@ -472,7 +491,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
 
 	read_lock(&tasklist_lock);
 retry:
-	p = select_bad_process(&points, mem);
+	p = select_bad_process(&points, mem, NULL);
 	if (PTR_ERR(p) == -1UL)
 		goto out;
 
@@ -554,7 +573,8 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
 /*
  * Must be called with tasklist_lock held for read.
  */
-static void __out_of_memory(gfp_t gfp_mask, int order)
+static void __out_of_memory(gfp_t gfp_mask, int order,
+			enum oom_constraint constraint, nodemask_t *nodemask)
 {
 	struct task_struct *p;
 	unsigned long points;
@@ -568,7 +588,7 @@ retry:
 	 * Rambo mode: Shoot down a process and hope it solves whatever
 	 * issues we may have.
 	 */
-	p = select_bad_process(&points, NULL);
+	p = select_bad_process(&points, NULL, constraint, nodemask);
 
 	if (PTR_ERR(p) == -1UL)
 		return;
@@ -609,7 +629,8 @@ void pagefault_out_of_memory(void)
 		panic("out of memory from page fault. panic_on_oom is selected.\n");
 
 	read_lock(&tasklist_lock);
-	__out_of_memory(0, 0); /* unknown gfp_mask and order */
+	/* unknown gfp_mask and order */
+	__out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
 	read_unlock(&tasklist_lock);
 
 	/*
@@ -656,11 +677,6 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	read_lock(&tasklist_lock);
 
 	switch (constraint) {
-	case CONSTRAINT_MEMORY_POLICY:
-		oom_kill_process(current, gfp_mask, order, 0, NULL,
-				"No available memory (MPOL_BIND)");
-		break;
-
 	case CONSTRAINT_NONE:
 		if (sysctl_panic_on_oom) {
 			dump_header(gfp_mask, order, NULL);
@@ -668,7 +684,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		}
 		/* Fall-through */
 	case CONSTRAINT_CPUSET:
-		__out_of_memory(gfp_mask, order);
+	case CONSTRAINT_MEMORY_POLICY:
+		__out_of_memory(gfp_mask, order, constraint, nodemask);
 		break;
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
  2009-11-18  0:11                               ` David Rientjes
@ 2009-11-18  0:58                                 ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 92+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-18  0:58 UTC (permalink / raw)
  To: David Rientjes
  Cc: Daisuke Nishimura, KOSAKI Motohiro, linux-mm, linux-kernel,
	Andrew Morton, Christoph Lameter

Hi,

On Tue, 17 Nov 2009 16:11:58 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Wed, 11 Nov 2009, KAMEZAWA Hiroyuki wrote:
> 
> > Fixing node-oriented allocation handling in oom-kill.c
> > I myself think this as bugfix not as ehnancement.
> > 
> > In these days, things are changed as
> >   - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
> >   - mempolicy don't maintain its own private zonelists.
> >   (And cpuset doesn't use nodemask for __alloc_pages_nodemask())
> > 
> > So, current oom-killer's check function is wrong.
> > 
> > This patch does
> >   - check nodemask, if nodemask && nodemask doesn't cover all
> >     node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
> >   - Scan all zonelist under nodemask, if it hits cpuset's wall
> >     this faiulre is from cpuset.
> > And
> >   - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
> >     This doesn't change "current" behavior. If callers use __GFP_THISNODE
> >     it should handle "page allocation failure" by itself.
> > 
> >   - handle __GFP_NOFAIL+__GFP_THISNODE path.
> >     This is something like a FIXME but this gfpmask is not used now.
> > 
> 
> Now that we're passing the nodemask into the oom killer, we should be able 
> to do more intelligent CONSTRAINT_MEMORY_POLICY selection.  current is not 
> always the ideal task to kill, so it's better to scan the tasklist and 
> determine the best task depending on our heuristics, similiar to how we 
> penalize candidates if they do not share the same cpuset.
> 
> Something like the following (untested) patch.  Comments?

Hm, yes. I think this direction is good.
I have my own but your version looks nicer.
(I'm busy with troubles in these days, sorry.)


> ---
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -201,7 +201,9 @@ extern void mpol_fix_fork_child_flag(struct task_struct *p);
>  extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
>  				unsigned long addr, gfp_t gfp_flags,
>  				struct mempolicy **mpol, nodemask_t **nodemask);
> -extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
> +extern bool init_nodemask_of_task_mempolicy(struct task_struct *tsk,
> +				nodemask_t *mask);
> +extern bool init_nodemask_of_current_mempolicy(nodemask_t *mask);
>  extern unsigned slab_node(struct mempolicy *policy);
>  
>  extern enum zone_type policy_zone;
> @@ -329,7 +331,16 @@ static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
>  	return node_zonelist(0, gfp_flags);
>  }
>  
> -static inline bool init_nodemask_of_mempolicy(nodemask_t *m) { return false; }
> +static inline bool init_nodemask_of_task_mempolicy(struct task_struct *tsk,
> +							nodemask_t *mask)
> +{
> +	return false;
> +}
> +
> +static inline bool init_nodemask_of_current_mempolicy(nodemask_t *mask)
> +{
> +	return false;
> +}
>  
>  static inline int do_migrate_pages(struct mm_struct *mm,
>  			const nodemask_t *from_nodes,
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ed3f392..3ab3021 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1373,7 +1373,7 @@ static ssize_t nr_hugepages_store_common(bool obey_mempolicy,
>  		 * global hstate attribute
>  		 */
>  		if (!(obey_mempolicy &&
> -				init_nodemask_of_mempolicy(nodes_allowed))) {
> +			init_nodemask_of_current_mempolicy(nodes_allowed))) {
>  			NODEMASK_FREE(nodes_allowed);
>  			nodes_allowed = &node_states[N_HIGH_MEMORY];
>  		}
> @@ -1860,7 +1860,7 @@ static int hugetlb_sysctl_handler_common(bool obey_mempolicy,
>  		NODEMASK_ALLOC(nodemask_t, nodes_allowed,
>  						GFP_KERNEL | __GFP_NORETRY);
>  		if (!(obey_mempolicy &&
> -			       init_nodemask_of_mempolicy(nodes_allowed))) {
> +			init_nodemask_of_current_mempolicy(nodes_allowed))) {
>  			NODEMASK_FREE(nodes_allowed);
>  			nodes_allowed = &node_states[N_HIGH_MEMORY];
>  		}
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index f11fdad..23c84bb 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1568,24 +1568,18 @@ struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
>  	}
>  	return zl;
>  }
> +#endif
>  
>  /*
> - * init_nodemask_of_mempolicy
> - *
> - * If the current task's mempolicy is "default" [NULL], return 'false'
> - * to indicate default policy.  Otherwise, extract the policy nodemask
> - * for 'bind' or 'interleave' policy into the argument nodemask, or
> - * initialize the argument nodemask to contain the single node for
> - * 'preferred' or 'local' policy and return 'true' to indicate presence
> - * of non-default mempolicy.
> - *
> - * We don't bother with reference counting the mempolicy [mpol_get/put]
> - * because the current task is examining it's own mempolicy and a task's
> - * mempolicy is only ever changed by the task itself.
> + * If tsk's mempolicy is "default" [NULL], return 'false' to indicate default
> + * policy.  Otherwise, extract the policy nodemask for 'bind' or 'interleave'
> + * policy into the argument nodemask, or initialize the argument nodemask to
> + * contain the single node for 'preferred' or 'local' policy and return 'true'
> + * to indicate presence of non-default mempolicy.
>   *
>   * N.B., it is the caller's responsibility to free a returned nodemask.
>   */
> -bool init_nodemask_of_mempolicy(nodemask_t *mask)
> +bool init_nodemask_of_task_mempolicy(struct task_struct *tsk, nodemask_t *mask)
>  {
>  	struct mempolicy *mempolicy;
>  	int nid;
> @@ -1615,7 +1609,16 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
>  
>  	return true;
>  }
> -#endif
> +
> +/*
> + * We don't bother with reference counting the mempolicy [mpol_get/put]
> + * because the current task is examining it's own mempolicy and a task's
> + * mempolicy is only ever changed by the task itself.
> + */
> +bool init_nodemask_of_current_mempolicy(nodemask_t *mask)
> +{
> +	return init_nodemask_of_task_mempolicy(current, mask);
> +}
>  
>  /* Allocate a page in interleaved policy.
>     Own path because it needs to do special accounting. */
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index ab04537..4c5c58b 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -27,6 +27,7 @@
>  #include <linux/notifier.h>
>  #include <linux/memcontrol.h>
>  #include <linux/security.h>
> +#include <linux/mempolicy.h>
>  
>  int sysctl_panic_on_oom;
>  int sysctl_oom_kill_allocating_task;
> @@ -35,18 +36,30 @@ static DEFINE_SPINLOCK(zone_scan_lock);
>  /* #define DEBUG */
>  
>  /*
> - * Is all threads of the target process nodes overlap ours?
> + * Do the nodes allowed by any of tsk's threads overlap ours?
>   */
> -static int has_intersects_mems_allowed(struct task_struct *tsk)
> +static int has_intersects_mems_allowed(struct task_struct *tsk,
> +						nodemask_t *nodemask)
>  {
> -	struct task_struct *t;
> +	struct task_struct *start = tsk;
> +	NODEMASK_ALLOC(nodemask_t, mpol_nodemask, GFP_KERNEL);
>  
> -	t = tsk;
> +	if (!nodemask)
> +		mpol_nodemask = NULL;
>  	do {
> -		if (cpuset_mems_allowed_intersects(current, t))
> +		if (mpol_nodemask) {
> +			mpol_get(tsk->mempolicy);
> +			if (init_nodemask_of_task_mempolicy(tsk, mpol_nodemask) &&
> +				nodes_intersects(*nodemask, *mpol_nodemask)) {
> +				mpol_put(tsk->mempolicy);
> +				return 1;
> +			}
> +			mpol_put(tsk->mempolicy);
> +		}

Hmm this mpol_get()/mpol_put() are necessary under tasklist_lock held ?
And...I wonder

	if (!init_nodemask_of_task_mempolicy(tsk, mpol_nodemask))
		return 1; /* this task uses default policy */


> +		if (cpuset_mems_allowed_intersects(current, tsk))
>  			return 1;
> -		t = next_thread(t);
> -	} while (t != tsk);
> +		tsk = next_thread(tsk);
> +	} while (tsk != start);
>  

Sigh...we has to scan all threads, again.
Could you have an idea to improve this ?

For example, 
	mm->mask_of_nodes_which_a_page_was_allocated_on
or
        mm->mask_of_nodes_made_by_some_magical_technique
some ?
(maybe per-node rss is over kill.)


>  	return 0;
>  }
> @@ -55,6 +68,8 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
>   * badness - calculate a numeric value for how bad this task has been
>   * @p: task struct of which task we should calculate
>   * @uptime: current uptime in seconds
> + * @constraint: type of oom constraint
> + * @nodemask: nodemask passed to page allocator
>   *
>   * The formula used is relatively simple and documented inline in the
>   * function. The main rationale is that we want to select a good task
> @@ -70,7 +85,8 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
>   *    of least surprise ... (be careful when you change it)
>   */
>  
> -unsigned long badness(struct task_struct *p, unsigned long uptime)
> +unsigned long badness(struct task_struct *p, unsigned long uptime,
> +			enum oom_constraint constraint, nodemask_t *nodemask)
>  {
>  	unsigned long points, cpu_time, run_time;
>  	struct mm_struct *mm;
> @@ -171,7 +187,9 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
>  	 * because p may have allocated or otherwise mapped memory on
>  	 * this node before. However it will be less likely.
>  	 */
> -	if (!has_intersects_mems_allowed(p))
> +	if (!has_intersects_mems_allowed(p,
> +			constraint == CONSTRAINT_MEMORY_POLICY ? nodemask :
> +								 NULL))
>  		points /= 8;
>  
>  	/*
> @@ -244,7 +262,8 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
>   * (not docbooked, we don't want this one cluttering up the manual)
>   */
>  static struct task_struct *select_bad_process(unsigned long *ppoints,
> -						struct mem_cgroup *mem)
> +			struct mem_cgroup *mem, enum oom_constraint constraint,
> +			nodemask_t *nodemask)
>  {
>  	struct task_struct *p;
>  	struct task_struct *chosen = NULL;
> @@ -300,7 +319,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
>  		if (p->signal->oom_adj == OOM_DISABLE)
>  			continue;
>  
> -		points = badness(p, uptime.tv_sec);
> +		points = badness(p, uptime.tv_sec, constraint, nodemask);
>  		if (points > *ppoints || !chosen) {
>  			chosen = p;
>  			*ppoints = points;
> @@ -472,7 +491,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
>  
>  	read_lock(&tasklist_lock);
>  retry:
> -	p = select_bad_process(&points, mem);
> +	p = select_bad_process(&points, mem, NULL);
>  	if (PTR_ERR(p) == -1UL)
>  		goto out;
>  
> @@ -554,7 +573,8 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
>  /*
>   * Must be called with tasklist_lock held for read.
>   */
> -static void __out_of_memory(gfp_t gfp_mask, int order)
> +static void __out_of_memory(gfp_t gfp_mask, int order,
> +			enum oom_constraint constraint, nodemask_t *nodemask)
>  {
>  	struct task_struct *p;
>  	unsigned long points;
> @@ -568,7 +588,7 @@ retry:
>  	 * Rambo mode: Shoot down a process and hope it solves whatever
>  	 * issues we may have.
>  	 */
> -	p = select_bad_process(&points, NULL);
> +	p = select_bad_process(&points, NULL, constraint, nodemask);
>  
>  	if (PTR_ERR(p) == -1UL)
>  		return;
> @@ -609,7 +629,8 @@ void pagefault_out_of_memory(void)
>  		panic("out of memory from page fault. panic_on_oom is selected.\n");
>  
>  	read_lock(&tasklist_lock);
> -	__out_of_memory(0, 0); /* unknown gfp_mask and order */
> +	/* unknown gfp_mask and order */
> +	__out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
>  	read_unlock(&tasklist_lock);
>  
>  	/*
> @@ -656,11 +677,6 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  	read_lock(&tasklist_lock);
>  
>  	switch (constraint) {
> -	case CONSTRAINT_MEMORY_POLICY:
> -		oom_kill_process(current, gfp_mask, order, 0, NULL,
> -				"No available memory (MPOL_BIND)");
> -		break;
> -
>  	case CONSTRAINT_NONE:
>  		if (sysctl_panic_on_oom) {
>  			dump_header(gfp_mask, order, NULL);
> @@ -668,7 +684,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  		}
>  		/* Fall-through */
>  	case CONSTRAINT_CPUSET:
> -		__out_of_memory(gfp_mask, order);
> +	case CONSTRAINT_MEMORY_POLICY:
> +		__out_of_memory(gfp_mask, order, constraint, nodemask);
>  		break;
>  	}
maybe good. But hmm...does this work well with per-vma mempolicy ?

I wonder
  mm->mask_of_nodes_made_by_some_magical_technique
will be necessary for completeness.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
@ 2009-11-18  0:58                                 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 92+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-18  0:58 UTC (permalink / raw)
  To: David Rientjes
  Cc: Daisuke Nishimura, KOSAKI Motohiro, linux-mm, linux-kernel,
	Andrew Morton, Christoph Lameter

Hi,

On Tue, 17 Nov 2009 16:11:58 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Wed, 11 Nov 2009, KAMEZAWA Hiroyuki wrote:
> 
> > Fixing node-oriented allocation handling in oom-kill.c
> > I myself think this as bugfix not as ehnancement.
> > 
> > In these days, things are changed as
> >   - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
> >   - mempolicy don't maintain its own private zonelists.
> >   (And cpuset doesn't use nodemask for __alloc_pages_nodemask())
> > 
> > So, current oom-killer's check function is wrong.
> > 
> > This patch does
> >   - check nodemask, if nodemask && nodemask doesn't cover all
> >     node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
> >   - Scan all zonelist under nodemask, if it hits cpuset's wall
> >     this faiulre is from cpuset.
> > And
> >   - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
> >     This doesn't change "current" behavior. If callers use __GFP_THISNODE
> >     it should handle "page allocation failure" by itself.
> > 
> >   - handle __GFP_NOFAIL+__GFP_THISNODE path.
> >     This is something like a FIXME but this gfpmask is not used now.
> > 
> 
> Now that we're passing the nodemask into the oom killer, we should be able 
> to do more intelligent CONSTRAINT_MEMORY_POLICY selection.  current is not 
> always the ideal task to kill, so it's better to scan the tasklist and 
> determine the best task depending on our heuristics, similiar to how we 
> penalize candidates if they do not share the same cpuset.
> 
> Something like the following (untested) patch.  Comments?

Hm, yes. I think this direction is good.
I have my own but your version looks nicer.
(I'm busy with troubles in these days, sorry.)


> ---
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -201,7 +201,9 @@ extern void mpol_fix_fork_child_flag(struct task_struct *p);
>  extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
>  				unsigned long addr, gfp_t gfp_flags,
>  				struct mempolicy **mpol, nodemask_t **nodemask);
> -extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
> +extern bool init_nodemask_of_task_mempolicy(struct task_struct *tsk,
> +				nodemask_t *mask);
> +extern bool init_nodemask_of_current_mempolicy(nodemask_t *mask);
>  extern unsigned slab_node(struct mempolicy *policy);
>  
>  extern enum zone_type policy_zone;
> @@ -329,7 +331,16 @@ static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
>  	return node_zonelist(0, gfp_flags);
>  }
>  
> -static inline bool init_nodemask_of_mempolicy(nodemask_t *m) { return false; }
> +static inline bool init_nodemask_of_task_mempolicy(struct task_struct *tsk,
> +							nodemask_t *mask)
> +{
> +	return false;
> +}
> +
> +static inline bool init_nodemask_of_current_mempolicy(nodemask_t *mask)
> +{
> +	return false;
> +}
>  
>  static inline int do_migrate_pages(struct mm_struct *mm,
>  			const nodemask_t *from_nodes,
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ed3f392..3ab3021 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1373,7 +1373,7 @@ static ssize_t nr_hugepages_store_common(bool obey_mempolicy,
>  		 * global hstate attribute
>  		 */
>  		if (!(obey_mempolicy &&
> -				init_nodemask_of_mempolicy(nodes_allowed))) {
> +			init_nodemask_of_current_mempolicy(nodes_allowed))) {
>  			NODEMASK_FREE(nodes_allowed);
>  			nodes_allowed = &node_states[N_HIGH_MEMORY];
>  		}
> @@ -1860,7 +1860,7 @@ static int hugetlb_sysctl_handler_common(bool obey_mempolicy,
>  		NODEMASK_ALLOC(nodemask_t, nodes_allowed,
>  						GFP_KERNEL | __GFP_NORETRY);
>  		if (!(obey_mempolicy &&
> -			       init_nodemask_of_mempolicy(nodes_allowed))) {
> +			init_nodemask_of_current_mempolicy(nodes_allowed))) {
>  			NODEMASK_FREE(nodes_allowed);
>  			nodes_allowed = &node_states[N_HIGH_MEMORY];
>  		}
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index f11fdad..23c84bb 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1568,24 +1568,18 @@ struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
>  	}
>  	return zl;
>  }
> +#endif
>  
>  /*
> - * init_nodemask_of_mempolicy
> - *
> - * If the current task's mempolicy is "default" [NULL], return 'false'
> - * to indicate default policy.  Otherwise, extract the policy nodemask
> - * for 'bind' or 'interleave' policy into the argument nodemask, or
> - * initialize the argument nodemask to contain the single node for
> - * 'preferred' or 'local' policy and return 'true' to indicate presence
> - * of non-default mempolicy.
> - *
> - * We don't bother with reference counting the mempolicy [mpol_get/put]
> - * because the current task is examining it's own mempolicy and a task's
> - * mempolicy is only ever changed by the task itself.
> + * If tsk's mempolicy is "default" [NULL], return 'false' to indicate default
> + * policy.  Otherwise, extract the policy nodemask for 'bind' or 'interleave'
> + * policy into the argument nodemask, or initialize the argument nodemask to
> + * contain the single node for 'preferred' or 'local' policy and return 'true'
> + * to indicate presence of non-default mempolicy.
>   *
>   * N.B., it is the caller's responsibility to free a returned nodemask.
>   */
> -bool init_nodemask_of_mempolicy(nodemask_t *mask)
> +bool init_nodemask_of_task_mempolicy(struct task_struct *tsk, nodemask_t *mask)
>  {
>  	struct mempolicy *mempolicy;
>  	int nid;
> @@ -1615,7 +1609,16 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
>  
>  	return true;
>  }
> -#endif
> +
> +/*
> + * We don't bother with reference counting the mempolicy [mpol_get/put]
> + * because the current task is examining it's own mempolicy and a task's
> + * mempolicy is only ever changed by the task itself.
> + */
> +bool init_nodemask_of_current_mempolicy(nodemask_t *mask)
> +{
> +	return init_nodemask_of_task_mempolicy(current, mask);
> +}
>  
>  /* Allocate a page in interleaved policy.
>     Own path because it needs to do special accounting. */
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index ab04537..4c5c58b 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -27,6 +27,7 @@
>  #include <linux/notifier.h>
>  #include <linux/memcontrol.h>
>  #include <linux/security.h>
> +#include <linux/mempolicy.h>
>  
>  int sysctl_panic_on_oom;
>  int sysctl_oom_kill_allocating_task;
> @@ -35,18 +36,30 @@ static DEFINE_SPINLOCK(zone_scan_lock);
>  /* #define DEBUG */
>  
>  /*
> - * Is all threads of the target process nodes overlap ours?
> + * Do the nodes allowed by any of tsk's threads overlap ours?
>   */
> -static int has_intersects_mems_allowed(struct task_struct *tsk)
> +static int has_intersects_mems_allowed(struct task_struct *tsk,
> +						nodemask_t *nodemask)
>  {
> -	struct task_struct *t;
> +	struct task_struct *start = tsk;
> +	NODEMASK_ALLOC(nodemask_t, mpol_nodemask, GFP_KERNEL);
>  
> -	t = tsk;
> +	if (!nodemask)
> +		mpol_nodemask = NULL;
>  	do {
> -		if (cpuset_mems_allowed_intersects(current, t))
> +		if (mpol_nodemask) {
> +			mpol_get(tsk->mempolicy);
> +			if (init_nodemask_of_task_mempolicy(tsk, mpol_nodemask) &&
> +				nodes_intersects(*nodemask, *mpol_nodemask)) {
> +				mpol_put(tsk->mempolicy);
> +				return 1;
> +			}
> +			mpol_put(tsk->mempolicy);
> +		}

Hmm this mpol_get()/mpol_put() are necessary under tasklist_lock held ?
And...I wonder

	if (!init_nodemask_of_task_mempolicy(tsk, mpol_nodemask))
		return 1; /* this task uses default policy */


> +		if (cpuset_mems_allowed_intersects(current, tsk))
>  			return 1;
> -		t = next_thread(t);
> -	} while (t != tsk);
> +		tsk = next_thread(tsk);
> +	} while (tsk != start);
>  

Sigh...we has to scan all threads, again.
Could you have an idea to improve this ?

For example, 
	mm->mask_of_nodes_which_a_page_was_allocated_on
or
        mm->mask_of_nodes_made_by_some_magical_technique
some ?
(maybe per-node rss is over kill.)


>  	return 0;
>  }
> @@ -55,6 +68,8 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
>   * badness - calculate a numeric value for how bad this task has been
>   * @p: task struct of which task we should calculate
>   * @uptime: current uptime in seconds
> + * @constraint: type of oom constraint
> + * @nodemask: nodemask passed to page allocator
>   *
>   * The formula used is relatively simple and documented inline in the
>   * function. The main rationale is that we want to select a good task
> @@ -70,7 +85,8 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
>   *    of least surprise ... (be careful when you change it)
>   */
>  
> -unsigned long badness(struct task_struct *p, unsigned long uptime)
> +unsigned long badness(struct task_struct *p, unsigned long uptime,
> +			enum oom_constraint constraint, nodemask_t *nodemask)
>  {
>  	unsigned long points, cpu_time, run_time;
>  	struct mm_struct *mm;
> @@ -171,7 +187,9 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
>  	 * because p may have allocated or otherwise mapped memory on
>  	 * this node before. However it will be less likely.
>  	 */
> -	if (!has_intersects_mems_allowed(p))
> +	if (!has_intersects_mems_allowed(p,
> +			constraint == CONSTRAINT_MEMORY_POLICY ? nodemask :
> +								 NULL))
>  		points /= 8;
>  
>  	/*
> @@ -244,7 +262,8 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
>   * (not docbooked, we don't want this one cluttering up the manual)
>   */
>  static struct task_struct *select_bad_process(unsigned long *ppoints,
> -						struct mem_cgroup *mem)
> +			struct mem_cgroup *mem, enum oom_constraint constraint,
> +			nodemask_t *nodemask)
>  {
>  	struct task_struct *p;
>  	struct task_struct *chosen = NULL;
> @@ -300,7 +319,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
>  		if (p->signal->oom_adj == OOM_DISABLE)
>  			continue;
>  
> -		points = badness(p, uptime.tv_sec);
> +		points = badness(p, uptime.tv_sec, constraint, nodemask);
>  		if (points > *ppoints || !chosen) {
>  			chosen = p;
>  			*ppoints = points;
> @@ -472,7 +491,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
>  
>  	read_lock(&tasklist_lock);
>  retry:
> -	p = select_bad_process(&points, mem);
> +	p = select_bad_process(&points, mem, NULL);
>  	if (PTR_ERR(p) == -1UL)
>  		goto out;
>  
> @@ -554,7 +573,8 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
>  /*
>   * Must be called with tasklist_lock held for read.
>   */
> -static void __out_of_memory(gfp_t gfp_mask, int order)
> +static void __out_of_memory(gfp_t gfp_mask, int order,
> +			enum oom_constraint constraint, nodemask_t *nodemask)
>  {
>  	struct task_struct *p;
>  	unsigned long points;
> @@ -568,7 +588,7 @@ retry:
>  	 * Rambo mode: Shoot down a process and hope it solves whatever
>  	 * issues we may have.
>  	 */
> -	p = select_bad_process(&points, NULL);
> +	p = select_bad_process(&points, NULL, constraint, nodemask);
>  
>  	if (PTR_ERR(p) == -1UL)
>  		return;
> @@ -609,7 +629,8 @@ void pagefault_out_of_memory(void)
>  		panic("out of memory from page fault. panic_on_oom is selected.\n");
>  
>  	read_lock(&tasklist_lock);
> -	__out_of_memory(0, 0); /* unknown gfp_mask and order */
> +	/* unknown gfp_mask and order */
> +	__out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
>  	read_unlock(&tasklist_lock);
>  
>  	/*
> @@ -656,11 +677,6 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  	read_lock(&tasklist_lock);
>  
>  	switch (constraint) {
> -	case CONSTRAINT_MEMORY_POLICY:
> -		oom_kill_process(current, gfp_mask, order, 0, NULL,
> -				"No available memory (MPOL_BIND)");
> -		break;
> -
>  	case CONSTRAINT_NONE:
>  		if (sysctl_panic_on_oom) {
>  			dump_header(gfp_mask, order, NULL);
> @@ -668,7 +684,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  		}
>  		/* Fall-through */
>  	case CONSTRAINT_CPUSET:
> -		__out_of_memory(gfp_mask, order);
> +	case CONSTRAINT_MEMORY_POLICY:
> +		__out_of_memory(gfp_mask, order, constraint, nodemask);
>  		break;
>  	}
maybe good. But hmm...does this work well with per-vma mempolicy ?

I wonder
  mm->mask_of_nodes_made_by_some_magical_technique
will be necessary for completeness.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
  2009-11-18  0:11                               ` David Rientjes
@ 2009-11-18  1:41                                 ` Daisuke Nishimura
  -1 siblings, 0 replies; 92+ messages in thread
From: Daisuke Nishimura @ 2009-11-18  1:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm, linux-kernel,
	Andrew Morton, Christoph Lameter, Daisuke Nishimura

Hi.

On Tue, 17 Nov 2009 16:11:58 -0800 (PST), David Rientjes <rientjes@google.com> wrote:
> On Wed, 11 Nov 2009, KAMEZAWA Hiroyuki wrote:
> 
> > Fixing node-oriented allocation handling in oom-kill.c
> > I myself think this as bugfix not as ehnancement.
> > 
> > In these days, things are changed as
> >   - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
> >   - mempolicy don't maintain its own private zonelists.
> >   (And cpuset doesn't use nodemask for __alloc_pages_nodemask())
> > 
> > So, current oom-killer's check function is wrong.
> > 
> > This patch does
> >   - check nodemask, if nodemask && nodemask doesn't cover all
> >     node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
> >   - Scan all zonelist under nodemask, if it hits cpuset's wall
> >     this faiulre is from cpuset.
> > And
> >   - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
> >     This doesn't change "current" behavior. If callers use __GFP_THISNODE
> >     it should handle "page allocation failure" by itself.
> > 
> >   - handle __GFP_NOFAIL+__GFP_THISNODE path.
> >     This is something like a FIXME but this gfpmask is not used now.
> > 
> 
> Now that we're passing the nodemask into the oom killer, we should be able 
> to do more intelligent CONSTRAINT_MEMORY_POLICY selection.  current is not 
> always the ideal task to kill, so it's better to scan the tasklist and 
> determine the best task depending on our heuristics, similiar to how we 
> penalize candidates if they do not share the same cpuset.
> 
> Something like the following (untested) patch.  Comments?
I agree to this direction.

Taking into account the usage per node which is included in nodemask might be useful,
but we don't have per node rss counter per task now and it would add some overhead,
so I think this would be enough(at leaset for now).

Just a minor nitpick:

> @@ -472,7 +491,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
>  
>  	read_lock(&tasklist_lock);
>  retry:
> -	p = select_bad_process(&points, mem);
> +	p = select_bad_process(&points, mem, NULL);
>  	if (PTR_ERR(p) == -1UL)
>  		goto out;
>  
need to pass "CONSTRAINT_NONE" too.


Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
@ 2009-11-18  1:41                                 ` Daisuke Nishimura
  0 siblings, 0 replies; 92+ messages in thread
From: Daisuke Nishimura @ 2009-11-18  1:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-mm, linux-kernel,
	Andrew Morton, Christoph Lameter, Daisuke Nishimura

Hi.

On Tue, 17 Nov 2009 16:11:58 -0800 (PST), David Rientjes <rientjes@google.com> wrote:
> On Wed, 11 Nov 2009, KAMEZAWA Hiroyuki wrote:
> 
> > Fixing node-oriented allocation handling in oom-kill.c
> > I myself think this as bugfix not as ehnancement.
> > 
> > In these days, things are changed as
> >   - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
> >   - mempolicy don't maintain its own private zonelists.
> >   (And cpuset doesn't use nodemask for __alloc_pages_nodemask())
> > 
> > So, current oom-killer's check function is wrong.
> > 
> > This patch does
> >   - check nodemask, if nodemask && nodemask doesn't cover all
> >     node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
> >   - Scan all zonelist under nodemask, if it hits cpuset's wall
> >     this faiulre is from cpuset.
> > And
> >   - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
> >     This doesn't change "current" behavior. If callers use __GFP_THISNODE
> >     it should handle "page allocation failure" by itself.
> > 
> >   - handle __GFP_NOFAIL+__GFP_THISNODE path.
> >     This is something like a FIXME but this gfpmask is not used now.
> > 
> 
> Now that we're passing the nodemask into the oom killer, we should be able 
> to do more intelligent CONSTRAINT_MEMORY_POLICY selection.  current is not 
> always the ideal task to kill, so it's better to scan the tasklist and 
> determine the best task depending on our heuristics, similiar to how we 
> penalize candidates if they do not share the same cpuset.
> 
> Something like the following (untested) patch.  Comments?
I agree to this direction.

Taking into account the usage per node which is included in nodemask might be useful,
but we don't have per node rss counter per task now and it would add some overhead,
so I think this would be enough(at leaset for now).

Just a minor nitpick:

> @@ -472,7 +491,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
>  
>  	read_lock(&tasklist_lock);
>  retry:
> -	p = select_bad_process(&points, mem);
> +	p = select_bad_process(&points, mem, NULL);
>  	if (PTR_ERR(p) == -1UL)
>  		goto out;
>  
need to pass "CONSTRAINT_NONE" too.


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
  2009-11-18  0:58                                 ` KAMEZAWA Hiroyuki
@ 2009-11-18  2:13                                   ` David Rientjes
  -1 siblings, 0 replies; 92+ messages in thread
From: David Rientjes @ 2009-11-18  2:13 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, KOSAKI Motohiro, linux-mm, linux-kernel,
	Andrew Morton, Christoph Lameter

On Wed, 18 Nov 2009, KAMEZAWA Hiroyuki wrote:

> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index ab04537..4c5c58b 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -27,6 +27,7 @@
> >  #include <linux/notifier.h>
> >  #include <linux/memcontrol.h>
> >  #include <linux/security.h>
> > +#include <linux/mempolicy.h>
> >  
> >  int sysctl_panic_on_oom;
> >  int sysctl_oom_kill_allocating_task;
> > @@ -35,18 +36,30 @@ static DEFINE_SPINLOCK(zone_scan_lock);
> >  /* #define DEBUG */
> >  
> >  /*
> > - * Is all threads of the target process nodes overlap ours?
> > + * Do the nodes allowed by any of tsk's threads overlap ours?
> >   */
> > -static int has_intersects_mems_allowed(struct task_struct *tsk)
> > +static int has_intersects_mems_allowed(struct task_struct *tsk,
> > +						nodemask_t *nodemask)
> >  {
> > -	struct task_struct *t;
> > +	struct task_struct *start = tsk;
> > +	NODEMASK_ALLOC(nodemask_t, mpol_nodemask, GFP_KERNEL);
> >  
> > -	t = tsk;
> > +	if (!nodemask)
> > +		mpol_nodemask = NULL;
> >  	do {
> > -		if (cpuset_mems_allowed_intersects(current, t))
> > +		if (mpol_nodemask) {
> > +			mpol_get(tsk->mempolicy);
> > +			if (init_nodemask_of_task_mempolicy(tsk, mpol_nodemask) &&
> > +				nodes_intersects(*nodemask, *mpol_nodemask)) {
> > +				mpol_put(tsk->mempolicy);
> > +				return 1;
> > +			}
> > +			mpol_put(tsk->mempolicy);
> > +		}
> 
> Hmm this mpol_get()/mpol_put() are necessary under tasklist_lock held ?

They are, we don't hold tasklist_lock while dropping the reference count 
in do_exit().

> And...I wonder
> 
> 	if (!init_nodemask_of_task_mempolicy(tsk, mpol_nodemask))
> 		return 1; /* this task uses default policy */
> 
> 
> > +		if (cpuset_mems_allowed_intersects(current, tsk))
> >  			return 1;
> > -		t = next_thread(t);
> > -	} while (t != tsk);
> > +		tsk = next_thread(tsk);
> > +	} while (tsk != start);
> >  
> 
> Sigh...we has to scan all threads, again.
> Could you have an idea to improve this ?
> 
> For example, 
> 	mm->mask_of_nodes_which_a_page_was_allocated_on
> or
>         mm->mask_of_nodes_made_by_some_magical_technique
> some ?
> (maybe per-node rss is over kill.)
> 

The same criticism could be said for the CONSTRAINT_CPUSET.  We don't 
actually know in either case, mempolicy or cpusets, if memory was ever 
allocated on a particular node in tsk->mempolicy->v.nodes or 
tsk->mems_allowed, respectively.  We assume, however, that if a node is 
included in a mempolicy nodemask or cpuset mems that it is an allowed 
node to allocate from for all attached tasks (and those tasks aren't 
solely allocating on a subset) so that killing tasks based on their 
potential for allocating on oom nodes is actually helpful.

> 
> >  	return 0;
> >  }
> > @@ -55,6 +68,8 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
> >   * badness - calculate a numeric value for how bad this task has been
> >   * @p: task struct of which task we should calculate
> >   * @uptime: current uptime in seconds
> > + * @constraint: type of oom constraint
> > + * @nodemask: nodemask passed to page allocator
> >   *
> >   * The formula used is relatively simple and documented inline in the
> >   * function. The main rationale is that we want to select a good task
> > @@ -70,7 +85,8 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
> >   *    of least surprise ... (be careful when you change it)
> >   */
> >  
> > -unsigned long badness(struct task_struct *p, unsigned long uptime)
> > +unsigned long badness(struct task_struct *p, unsigned long uptime,
> > +			enum oom_constraint constraint, nodemask_t *nodemask)
> >  {
> >  	unsigned long points, cpu_time, run_time;
> >  	struct mm_struct *mm;
> > @@ -171,7 +187,9 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
> >  	 * because p may have allocated or otherwise mapped memory on
> >  	 * this node before. However it will be less likely.
> >  	 */
> > -	if (!has_intersects_mems_allowed(p))
> > +	if (!has_intersects_mems_allowed(p,
> > +			constraint == CONSTRAINT_MEMORY_POLICY ? nodemask :
> > +								 NULL))
> >  		points /= 8;
> >  
> >  	/*
> > @@ -244,7 +262,8 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> >   * (not docbooked, we don't want this one cluttering up the manual)
> >   */
> >  static struct task_struct *select_bad_process(unsigned long *ppoints,
> > -						struct mem_cgroup *mem)
> > +			struct mem_cgroup *mem, enum oom_constraint constraint,
> > +			nodemask_t *nodemask)
> >  {
> >  	struct task_struct *p;
> >  	struct task_struct *chosen = NULL;
> > @@ -300,7 +319,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
> >  		if (p->signal->oom_adj == OOM_DISABLE)
> >  			continue;
> >  
> > -		points = badness(p, uptime.tv_sec);
> > +		points = badness(p, uptime.tv_sec, constraint, nodemask);
> >  		if (points > *ppoints || !chosen) {
> >  			chosen = p;
> >  			*ppoints = points;
> > @@ -472,7 +491,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
> >  
> >  	read_lock(&tasklist_lock);
> >  retry:
> > -	p = select_bad_process(&points, mem);
> > +	p = select_bad_process(&points, mem, NULL);
> >  	if (PTR_ERR(p) == -1UL)
> >  		goto out;
> >  
> > @@ -554,7 +573,8 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
> >  /*
> >   * Must be called with tasklist_lock held for read.
> >   */
> > -static void __out_of_memory(gfp_t gfp_mask, int order)
> > +static void __out_of_memory(gfp_t gfp_mask, int order,
> > +			enum oom_constraint constraint, nodemask_t *nodemask)
> >  {
> >  	struct task_struct *p;
> >  	unsigned long points;
> > @@ -568,7 +588,7 @@ retry:
> >  	 * Rambo mode: Shoot down a process and hope it solves whatever
> >  	 * issues we may have.
> >  	 */
> > -	p = select_bad_process(&points, NULL);
> > +	p = select_bad_process(&points, NULL, constraint, nodemask);
> >  
> >  	if (PTR_ERR(p) == -1UL)
> >  		return;
> > @@ -609,7 +629,8 @@ void pagefault_out_of_memory(void)
> >  		panic("out of memory from page fault. panic_on_oom is selected.\n");
> >  
> >  	read_lock(&tasklist_lock);
> > -	__out_of_memory(0, 0); /* unknown gfp_mask and order */
> > +	/* unknown gfp_mask and order */
> > +	__out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
> >  	read_unlock(&tasklist_lock);
> >  
> >  	/*
> > @@ -656,11 +677,6 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> >  	read_lock(&tasklist_lock);
> >  
> >  	switch (constraint) {
> > -	case CONSTRAINT_MEMORY_POLICY:
> > -		oom_kill_process(current, gfp_mask, order, 0, NULL,
> > -				"No available memory (MPOL_BIND)");
> > -		break;
> > -
> >  	case CONSTRAINT_NONE:
> >  		if (sysctl_panic_on_oom) {
> >  			dump_header(gfp_mask, order, NULL);
> > @@ -668,7 +684,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> >  		}
> >  		/* Fall-through */
> >  	case CONSTRAINT_CPUSET:
> > -		__out_of_memory(gfp_mask, order);
> > +	case CONSTRAINT_MEMORY_POLICY:
> > +		__out_of_memory(gfp_mask, order, constraint, nodemask);
> >  		break;
> >  	}
> maybe good. But hmm...does this work well with per-vma mempolicy ?
> 
> I wonder
>   mm->mask_of_nodes_made_by_some_magical_technique
> will be necessary for completeness.
> 

I think that would probably be rejected because of its implications on the 
allocation fastpath.  The change here isn't causing the oom killing to be 
any less ideal; current may never have allocated any memory on its 
mempolicy nodes prior to the oom and so killing it may be entirely 
useless.  It's better to use our heuristics for determining the ideal task 
to kill in that case and restricting our subset of eligible tasks by the 
same criteria that we use for cpusets.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
@ 2009-11-18  2:13                                   ` David Rientjes
  0 siblings, 0 replies; 92+ messages in thread
From: David Rientjes @ 2009-11-18  2:13 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, KOSAKI Motohiro, linux-mm, linux-kernel,
	Andrew Morton, Christoph Lameter

On Wed, 18 Nov 2009, KAMEZAWA Hiroyuki wrote:

> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index ab04537..4c5c58b 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -27,6 +27,7 @@
> >  #include <linux/notifier.h>
> >  #include <linux/memcontrol.h>
> >  #include <linux/security.h>
> > +#include <linux/mempolicy.h>
> >  
> >  int sysctl_panic_on_oom;
> >  int sysctl_oom_kill_allocating_task;
> > @@ -35,18 +36,30 @@ static DEFINE_SPINLOCK(zone_scan_lock);
> >  /* #define DEBUG */
> >  
> >  /*
> > - * Is all threads of the target process nodes overlap ours?
> > + * Do the nodes allowed by any of tsk's threads overlap ours?
> >   */
> > -static int has_intersects_mems_allowed(struct task_struct *tsk)
> > +static int has_intersects_mems_allowed(struct task_struct *tsk,
> > +						nodemask_t *nodemask)
> >  {
> > -	struct task_struct *t;
> > +	struct task_struct *start = tsk;
> > +	NODEMASK_ALLOC(nodemask_t, mpol_nodemask, GFP_KERNEL);
> >  
> > -	t = tsk;
> > +	if (!nodemask)
> > +		mpol_nodemask = NULL;
> >  	do {
> > -		if (cpuset_mems_allowed_intersects(current, t))
> > +		if (mpol_nodemask) {
> > +			mpol_get(tsk->mempolicy);
> > +			if (init_nodemask_of_task_mempolicy(tsk, mpol_nodemask) &&
> > +				nodes_intersects(*nodemask, *mpol_nodemask)) {
> > +				mpol_put(tsk->mempolicy);
> > +				return 1;
> > +			}
> > +			mpol_put(tsk->mempolicy);
> > +		}
> 
> Hmm this mpol_get()/mpol_put() are necessary under tasklist_lock held ?

They are, we don't hold tasklist_lock while dropping the reference count 
in do_exit().

> And...I wonder
> 
> 	if (!init_nodemask_of_task_mempolicy(tsk, mpol_nodemask))
> 		return 1; /* this task uses default policy */
> 
> 
> > +		if (cpuset_mems_allowed_intersects(current, tsk))
> >  			return 1;
> > -		t = next_thread(t);
> > -	} while (t != tsk);
> > +		tsk = next_thread(tsk);
> > +	} while (tsk != start);
> >  
> 
> Sigh...we has to scan all threads, again.
> Could you have an idea to improve this ?
> 
> For example, 
> 	mm->mask_of_nodes_which_a_page_was_allocated_on
> or
>         mm->mask_of_nodes_made_by_some_magical_technique
> some ?
> (maybe per-node rss is over kill.)
> 

The same criticism could be said for the CONSTRAINT_CPUSET.  We don't 
actually know in either case, mempolicy or cpusets, if memory was ever 
allocated on a particular node in tsk->mempolicy->v.nodes or 
tsk->mems_allowed, respectively.  We assume, however, that if a node is 
included in a mempolicy nodemask or cpuset mems that it is an allowed 
node to allocate from for all attached tasks (and those tasks aren't 
solely allocating on a subset) so that killing tasks based on their 
potential for allocating on oom nodes is actually helpful.

> 
> >  	return 0;
> >  }
> > @@ -55,6 +68,8 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
> >   * badness - calculate a numeric value for how bad this task has been
> >   * @p: task struct of which task we should calculate
> >   * @uptime: current uptime in seconds
> > + * @constraint: type of oom constraint
> > + * @nodemask: nodemask passed to page allocator
> >   *
> >   * The formula used is relatively simple and documented inline in the
> >   * function. The main rationale is that we want to select a good task
> > @@ -70,7 +85,8 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
> >   *    of least surprise ... (be careful when you change it)
> >   */
> >  
> > -unsigned long badness(struct task_struct *p, unsigned long uptime)
> > +unsigned long badness(struct task_struct *p, unsigned long uptime,
> > +			enum oom_constraint constraint, nodemask_t *nodemask)
> >  {
> >  	unsigned long points, cpu_time, run_time;
> >  	struct mm_struct *mm;
> > @@ -171,7 +187,9 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
> >  	 * because p may have allocated or otherwise mapped memory on
> >  	 * this node before. However it will be less likely.
> >  	 */
> > -	if (!has_intersects_mems_allowed(p))
> > +	if (!has_intersects_mems_allowed(p,
> > +			constraint == CONSTRAINT_MEMORY_POLICY ? nodemask :
> > +								 NULL))
> >  		points /= 8;
> >  
> >  	/*
> > @@ -244,7 +262,8 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> >   * (not docbooked, we don't want this one cluttering up the manual)
> >   */
> >  static struct task_struct *select_bad_process(unsigned long *ppoints,
> > -						struct mem_cgroup *mem)
> > +			struct mem_cgroup *mem, enum oom_constraint constraint,
> > +			nodemask_t *nodemask)
> >  {
> >  	struct task_struct *p;
> >  	struct task_struct *chosen = NULL;
> > @@ -300,7 +319,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
> >  		if (p->signal->oom_adj == OOM_DISABLE)
> >  			continue;
> >  
> > -		points = badness(p, uptime.tv_sec);
> > +		points = badness(p, uptime.tv_sec, constraint, nodemask);
> >  		if (points > *ppoints || !chosen) {
> >  			chosen = p;
> >  			*ppoints = points;
> > @@ -472,7 +491,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
> >  
> >  	read_lock(&tasklist_lock);
> >  retry:
> > -	p = select_bad_process(&points, mem);
> > +	p = select_bad_process(&points, mem, NULL);
> >  	if (PTR_ERR(p) == -1UL)
> >  		goto out;
> >  
> > @@ -554,7 +573,8 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
> >  /*
> >   * Must be called with tasklist_lock held for read.
> >   */
> > -static void __out_of_memory(gfp_t gfp_mask, int order)
> > +static void __out_of_memory(gfp_t gfp_mask, int order,
> > +			enum oom_constraint constraint, nodemask_t *nodemask)
> >  {
> >  	struct task_struct *p;
> >  	unsigned long points;
> > @@ -568,7 +588,7 @@ retry:
> >  	 * Rambo mode: Shoot down a process and hope it solves whatever
> >  	 * issues we may have.
> >  	 */
> > -	p = select_bad_process(&points, NULL);
> > +	p = select_bad_process(&points, NULL, constraint, nodemask);
> >  
> >  	if (PTR_ERR(p) == -1UL)
> >  		return;
> > @@ -609,7 +629,8 @@ void pagefault_out_of_memory(void)
> >  		panic("out of memory from page fault. panic_on_oom is selected.\n");
> >  
> >  	read_lock(&tasklist_lock);
> > -	__out_of_memory(0, 0); /* unknown gfp_mask and order */
> > +	/* unknown gfp_mask and order */
> > +	__out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
> >  	read_unlock(&tasklist_lock);
> >  
> >  	/*
> > @@ -656,11 +677,6 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> >  	read_lock(&tasklist_lock);
> >  
> >  	switch (constraint) {
> > -	case CONSTRAINT_MEMORY_POLICY:
> > -		oom_kill_process(current, gfp_mask, order, 0, NULL,
> > -				"No available memory (MPOL_BIND)");
> > -		break;
> > -
> >  	case CONSTRAINT_NONE:
> >  		if (sysctl_panic_on_oom) {
> >  			dump_header(gfp_mask, order, NULL);
> > @@ -668,7 +684,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> >  		}
> >  		/* Fall-through */
> >  	case CONSTRAINT_CPUSET:
> > -		__out_of_memory(gfp_mask, order);
> > +	case CONSTRAINT_MEMORY_POLICY:
> > +		__out_of_memory(gfp_mask, order, constraint, nodemask);
> >  		break;
> >  	}
> maybe good. But hmm...does this work well with per-vma mempolicy ?
> 
> I wonder
>   mm->mask_of_nodes_made_by_some_magical_technique
> will be necessary for completeness.
> 

I think that would probably be rejected because of its implications on the 
allocation fastpath.  The change here isn't causing the oom killing to be 
any less ideal; current may never have allocated any memory on its 
mempolicy nodes prior to the oom and so killing it may be entirely 
useless.  It's better to use our heuristics for determining the ideal task 
to kill in that case and restricting our subset of eligible tasks by the 
same criteria that we use for cpusets.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
  2009-11-18  2:13                                   ` David Rientjes
@ 2009-12-15  1:16                                     ` Andrew Morton
  -1 siblings, 0 replies; 92+ messages in thread
From: Andrew Morton @ 2009-12-15  1:16 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, KOSAKI Motohiro, linux-mm,
	linux-kernel, Christoph Lameter


So I have a note-to-self here that these patches:

oom_kill-use-rss-value-instead-of-vm-size-for-badness.patch
oom-kill-show-virtual-size-and-rss-information-of-the-killed-process.patch
oom-kill-fix-numa-consraint-check-with-nodemask-v42.patch

are tentative and it was unclear whether I should merge them.

What do we think?

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
@ 2009-12-15  1:16                                     ` Andrew Morton
  0 siblings, 0 replies; 92+ messages in thread
From: Andrew Morton @ 2009-12-15  1:16 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, KOSAKI Motohiro, linux-mm,
	linux-kernel, Christoph Lameter


So I have a note-to-self here that these patches:

oom_kill-use-rss-value-instead-of-vm-size-for-badness.patch
oom-kill-show-virtual-size-and-rss-information-of-the-killed-process.patch
oom-kill-fix-numa-consraint-check-with-nodemask-v42.patch

are tentative and it was unclear whether I should merge them.

What do we think?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
  2009-12-15  1:16                                     ` Andrew Morton
@ 2009-12-15  1:32                                       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 92+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-15  1:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Daisuke Nishimura, KOSAKI Motohiro, linux-mm,
	linux-kernel, Christoph Lameter

On Mon, 14 Dec 2009 17:16:32 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> 
> So I have a note-to-self here that these patches:
> 
> oom_kill-use-rss-value-instead-of-vm-size-for-badness.patch
> oom-kill-show-virtual-size-and-rss-information-of-the-killed-process.patch
> oom-kill-fix-numa-consraint-check-with-nodemask-v42.patch
> 
> are tentative and it was unclear whether I should merge them.
> 
> What do we think?
> 

In my view,
  oom-kill-show-virtual-size-and-rss-information-of-the-killed-process.patch
  - should be merged. Because we tend to get several OOM reports in a month,
    More precise information is always welcomed.

  oom-kill-fix-numa-consraint-check-with-nodemask-v42.patch
  - should be merged. This is a bug fix.

  oom_kill-use-rss-value-instead-of-vm-size-for-badness.patch
  - should not be merged.
    I'm now preparing more counters for mm's statistics. It's better to
    wait  and to see what we can do more. And other patches for total
    oom-killer improvement is under development.

    And, there is a compatibility problem.
    As David says, this may break some crazy software which uses
    fake_numa+cpuset+oom_killer+oom_adj for resource controlling.
   (even if I recommend them to use memcg rather than crazy tricks...)
    
    2 ideas which I can think of now are..
    1) add sysctl_oom_calc_on_committed_memory 
       If this is set, use vm-size instead of rss.

    2) add /proc/<pid>/oom_guard_size
       This allows users to specify "valid/expected size" of a task.
       When
       #echo 10M > /proc/<pid>/oom_guard_size
       At OOM calculation, 10Mbytes is subtracted from rss size.
       (The best way is to estimate this automatically from vm_size..but...)



Thanks,
-Kame


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
@ 2009-12-15  1:32                                       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 92+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-15  1:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Daisuke Nishimura, KOSAKI Motohiro, linux-mm,
	linux-kernel, Christoph Lameter

On Mon, 14 Dec 2009 17:16:32 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> 
> So I have a note-to-self here that these patches:
> 
> oom_kill-use-rss-value-instead-of-vm-size-for-badness.patch
> oom-kill-show-virtual-size-and-rss-information-of-the-killed-process.patch
> oom-kill-fix-numa-consraint-check-with-nodemask-v42.patch
> 
> are tentative and it was unclear whether I should merge them.
> 
> What do we think?
> 

In my view,
  oom-kill-show-virtual-size-and-rss-information-of-the-killed-process.patch
  - should be merged. Because we tend to get several OOM reports in a month,
    More precise information is always welcomed.

  oom-kill-fix-numa-consraint-check-with-nodemask-v42.patch
  - should be merged. This is a bug fix.

  oom_kill-use-rss-value-instead-of-vm-size-for-badness.patch
  - should not be merged.
    I'm now preparing more counters for mm's statistics. It's better to
    wait  and to see what we can do more. And other patches for total
    oom-killer improvement is under development.

    And, there is a compatibility problem.
    As David says, this may break some crazy software which uses
    fake_numa+cpuset+oom_killer+oom_adj for resource controlling.
   (even if I recommend them to use memcg rather than crazy tricks...)
    
    2 ideas which I can think of now are..
    1) add sysctl_oom_calc_on_committed_memory 
       If this is set, use vm-size instead of rss.

    2) add /proc/<pid>/oom_guard_size
       This allows users to specify "valid/expected size" of a task.
       When
       #echo 10M > /proc/<pid>/oom_guard_size
       At OOM calculation, 10Mbytes is subtracted from rss size.
       (The best way is to estimate this automatically from vm_size..but...)



Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
  2009-12-15  1:32                                       ` KAMEZAWA Hiroyuki
@ 2009-12-15  1:38                                         ` KOSAKI Motohiro
  -1 siblings, 0 replies; 92+ messages in thread
From: KOSAKI Motohiro @ 2009-12-15  1:38 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: kosaki.motohiro, Andrew Morton, David Rientjes,
	Daisuke Nishimura, linux-mm, linux-kernel, Christoph Lameter

> On Mon, 14 Dec 2009 17:16:32 -0800
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > 
> > So I have a note-to-self here that these patches:
> > 
> > oom_kill-use-rss-value-instead-of-vm-size-for-badness.patch
> > oom-kill-show-virtual-size-and-rss-information-of-the-killed-process.patch
> > oom-kill-fix-numa-consraint-check-with-nodemask-v42.patch
> > 
> > are tentative and it was unclear whether I should merge them.
> > 
> > What do we think?
> 
> In my view,
>   oom-kill-show-virtual-size-and-rss-information-of-the-killed-process.patch
>   - should be merged.
> 
>   oom-kill-fix-numa-consraint-check-with-nodemask-v42.patch
>   - should be merged.
> 
>   oom_kill-use-rss-value-instead-of-vm-size-for-badness.patch
>   - should not be merged.

all agree.




^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
@ 2009-12-15  1:38                                         ` KOSAKI Motohiro
  0 siblings, 0 replies; 92+ messages in thread
From: KOSAKI Motohiro @ 2009-12-15  1:38 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: kosaki.motohiro, Andrew Morton, David Rientjes,
	Daisuke Nishimura, linux-mm, linux-kernel, Christoph Lameter

> On Mon, 14 Dec 2009 17:16:32 -0800
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > 
> > So I have a note-to-self here that these patches:
> > 
> > oom_kill-use-rss-value-instead-of-vm-size-for-badness.patch
> > oom-kill-show-virtual-size-and-rss-information-of-the-killed-process.patch
> > oom-kill-fix-numa-consraint-check-with-nodemask-v42.patch
> > 
> > are tentative and it was unclear whether I should merge them.
> > 
> > What do we think?
> 
> In my view,
>   oom-kill-show-virtual-size-and-rss-information-of-the-killed-process.patch
>   - should be merged.
> 
>   oom-kill-fix-numa-consraint-check-with-nodemask-v42.patch
>   - should be merged.
> 
>   oom_kill-use-rss-value-instead-of-vm-size-for-badness.patch
>   - should not be merged.

all agree.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
  2009-12-15  1:32                                       ` KAMEZAWA Hiroyuki
@ 2009-12-15  4:30                                         ` David Rientjes
  -1 siblings, 0 replies; 92+ messages in thread
From: David Rientjes @ 2009-12-15  4:30 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Daisuke Nishimura, KOSAKI Motohiro, linux-mm,
	linux-kernel, Christoph Lameter

On Tue, 15 Dec 2009, KAMEZAWA Hiroyuki wrote:

>     I'm now preparing more counters for mm's statistics. It's better to
>     wait  and to see what we can do more. And other patches for total
>     oom-killer improvement is under development.
> 
>     And, there is a compatibility problem.
>     As David says, this may break some crazy software which uses
>     fake_numa+cpuset+oom_killer+oom_adj for resource controlling.
>    (even if I recommend them to use memcg rather than crazy tricks...)
>     

That's not at all what I said.  I said using total_vm as a baseline allows 
users to define when a process is to be considered "rogue," that is, using 
more memory than expected.  Using rss would be inappropriate since it is 
highly dynamic and depends on the state of the VM at the time of oom, 
which userspace cannot possibly keep updated.

You consistently ignore that point: the power of /proc/pid/oom_adj to 
influence when a process, such as a memory leaker, is to be considered as 
a high priority for an oom kill.  It has absolutely nothing to do with 
fake NUMA, cpusets, or memcg.

>     2 ideas which I can think of now are..
>     1) add sysctl_oom_calc_on_committed_memory 
>        If this is set, use vm-size instead of rss.
> 

I would agree only if the oom killer used total_vm as a the default, it is 
long-standing and allows for the aforementioned capability that you lose 
with rss.  I have no problem with the added sysctl to use rss as the 
baseline when enabled.

>     2) add /proc/<pid>/oom_guard_size
>        This allows users to specify "valid/expected size" of a task.
>        When
>        #echo 10M > /proc/<pid>/oom_guard_size
>        At OOM calculation, 10Mbytes is subtracted from rss size.
>        (The best way is to estimate this automatically from vm_size..but...)

Expected rss is almost impossible to tune for cpusets that have a highly 
dynamic set of mems, let alone without containment.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
@ 2009-12-15  4:30                                         ` David Rientjes
  0 siblings, 0 replies; 92+ messages in thread
From: David Rientjes @ 2009-12-15  4:30 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Daisuke Nishimura, KOSAKI Motohiro, linux-mm,
	linux-kernel, Christoph Lameter

On Tue, 15 Dec 2009, KAMEZAWA Hiroyuki wrote:

>     I'm now preparing more counters for mm's statistics. It's better to
>     wait  and to see what we can do more. And other patches for total
>     oom-killer improvement is under development.
> 
>     And, there is a compatibility problem.
>     As David says, this may break some crazy software which uses
>     fake_numa+cpuset+oom_killer+oom_adj for resource controlling.
>    (even if I recommend them to use memcg rather than crazy tricks...)
>     

That's not at all what I said.  I said using total_vm as a baseline allows 
users to define when a process is to be considered "rogue," that is, using 
more memory than expected.  Using rss would be inappropriate since it is 
highly dynamic and depends on the state of the VM at the time of oom, 
which userspace cannot possibly keep updated.

You consistently ignore that point: the power of /proc/pid/oom_adj to 
influence when a process, such as a memory leaker, is to be considered as 
a high priority for an oom kill.  It has absolutely nothing to do with 
fake NUMA, cpusets, or memcg.

>     2 ideas which I can think of now are..
>     1) add sysctl_oom_calc_on_committed_memory 
>        If this is set, use vm-size instead of rss.
> 

I would agree only if the oom killer used total_vm as a the default, it is 
long-standing and allows for the aforementioned capability that you lose 
with rss.  I have no problem with the added sysctl to use rss as the 
baseline when enabled.

>     2) add /proc/<pid>/oom_guard_size
>        This allows users to specify "valid/expected size" of a task.
>        When
>        #echo 10M > /proc/<pid>/oom_guard_size
>        At OOM calculation, 10Mbytes is subtracted from rss size.
>        (The best way is to estimate this automatically from vm_size..but...)

Expected rss is almost impossible to tune for cpusets that have a highly 
dynamic set of mems, let alone without containment.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
  2009-12-15  4:30                                         ` David Rientjes
@ 2009-12-15  4:35                                           ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 92+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-15  4:35 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Daisuke Nishimura, KOSAKI Motohiro, linux-mm,
	linux-kernel, Christoph Lameter

On Mon, 14 Dec 2009 20:30:37 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:
> >     2 ideas which I can think of now are..
> >     1) add sysctl_oom_calc_on_committed_memory 
> >        If this is set, use vm-size instead of rss.
> > 
> 
> I would agree only if the oom killer used total_vm as a the default, it is 
> long-standing and allows for the aforementioned capability that you lose 
> with rss.  I have no problem with the added sysctl to use rss as the 
> baseline when enabled.
> 
I'll prepare a patch for adds

  sysctl_oom_kill_based_on_rss (default=0)

ok ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
@ 2009-12-15  4:35                                           ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 92+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-15  4:35 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Daisuke Nishimura, KOSAKI Motohiro, linux-mm,
	linux-kernel, Christoph Lameter

On Mon, 14 Dec 2009 20:30:37 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:
> >     2 ideas which I can think of now are..
> >     1) add sysctl_oom_calc_on_committed_memory 
> >        If this is set, use vm-size instead of rss.
> > 
> 
> I would agree only if the oom killer used total_vm as a the default, it is 
> long-standing and allows for the aforementioned capability that you lose 
> with rss.  I have no problem with the added sysctl to use rss as the 
> baseline when enabled.
> 
I'll prepare a patch for adds

  sysctl_oom_kill_based_on_rss (default=0)

ok ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
  2009-12-15  4:30                                         ` David Rientjes
@ 2009-12-15  4:43                                           ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 92+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-15  4:43 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Daisuke Nishimura, KOSAKI Motohiro, linux-mm,
	linux-kernel, Christoph Lameter

On Mon, 14 Dec 2009 20:30:37 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Tue, 15 Dec 2009, KAMEZAWA Hiroyuki wrote:
> 
> >     I'm now preparing more counters for mm's statistics. It's better to
> >     wait  and to see what we can do more. And other patches for total
> >     oom-killer improvement is under development.
> > 
> >     And, there is a compatibility problem.
> >     As David says, this may break some crazy software which uses
> >     fake_numa+cpuset+oom_killer+oom_adj for resource controlling.
> >    (even if I recommend them to use memcg rather than crazy tricks...)
> >     
> 
> That's not at all what I said.  I said using total_vm as a baseline allows 
> users to define when a process is to be considered "rogue," that is, using 
> more memory than expected.  Using rss would be inappropriate since it is 
> highly dynamic and depends on the state of the VM at the time of oom, 
> which userspace cannot possibly keep updated.
> 
> You consistently ignore that point: the power of /proc/pid/oom_adj to 
> influence when a process, such as a memory leaker, is to be considered as 
> a high priority for an oom kill.  It has absolutely nothing to do with 
> fake NUMA, cpusets, or memcg.
> 
You also ignore that it's not sane to use oom kill for resource control ;)

Anyway, rss patch was dropped as you want.
I'll prepare other ones.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
@ 2009-12-15  4:43                                           ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 92+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-15  4:43 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Daisuke Nishimura, KOSAKI Motohiro, linux-mm,
	linux-kernel, Christoph Lameter

On Mon, 14 Dec 2009 20:30:37 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Tue, 15 Dec 2009, KAMEZAWA Hiroyuki wrote:
> 
> >     I'm now preparing more counters for mm's statistics. It's better to
> >     wait  and to see what we can do more. And other patches for total
> >     oom-killer improvement is under development.
> > 
> >     And, there is a compatibility problem.
> >     As David says, this may break some crazy software which uses
> >     fake_numa+cpuset+oom_killer+oom_adj for resource controlling.
> >    (even if I recommend them to use memcg rather than crazy tricks...)
> >     
> 
> That's not at all what I said.  I said using total_vm as a baseline allows 
> users to define when a process is to be considered "rogue," that is, using 
> more memory than expected.  Using rss would be inappropriate since it is 
> highly dynamic and depends on the state of the VM at the time of oom, 
> which userspace cannot possibly keep updated.
> 
> You consistently ignore that point: the power of /proc/pid/oom_adj to 
> influence when a process, such as a memory leaker, is to be considered as 
> a high priority for an oom kill.  It has absolutely nothing to do with 
> fake NUMA, cpusets, or memcg.
> 
You also ignore that it's not sane to use oom kill for resource control ;)

Anyway, rss patch was dropped as you want.
I'll prepare other ones.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
  2009-12-15  4:30                                         ` David Rientjes
@ 2009-12-15  4:47                                           ` KOSAKI Motohiro
  -1 siblings, 0 replies; 92+ messages in thread
From: KOSAKI Motohiro @ 2009-12-15  4:47 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, Andrew Morton,
	Daisuke Nishimura, linux-mm, linux-kernel, Christoph Lameter

> On Tue, 15 Dec 2009, KAMEZAWA Hiroyuki wrote:
> 
> >     I'm now preparing more counters for mm's statistics. It's better to
> >     wait  and to see what we can do more. And other patches for total
> >     oom-killer improvement is under development.
> > 
> >     And, there is a compatibility problem.
> >     As David says, this may break some crazy software which uses
> >     fake_numa+cpuset+oom_killer+oom_adj for resource controlling.
> >    (even if I recommend them to use memcg rather than crazy tricks...)
> >     
> 
> That's not at all what I said.  I said using total_vm as a baseline allows 
> users to define when a process is to be considered "rogue," that is, using 
> more memory than expected.  Using rss would be inappropriate since it is 
> highly dynamic and depends on the state of the VM at the time of oom, 
> which userspace cannot possibly keep updated.
> 
> You consistently ignore that point: the power of /proc/pid/oom_adj to 
> influence when a process, such as a memory leaker, is to be considered as 
> a high priority for an oom kill.  It has absolutely nothing to do with 
> fake NUMA, cpusets, or memcg.

To compare vsz is only meaningful when the same program are compared.
But oom killer don't. To compare vsz between another program DONT detect
any memory leak.


> 
> >     2 ideas which I can think of now are..
> >     1) add sysctl_oom_calc_on_committed_memory 
> >        If this is set, use vm-size instead of rss.
> > 
> 
> I would agree only if the oom killer used total_vm as a the default, it is 
> long-standing and allows for the aforementioned capability that you lose 
> with rss.  I have no problem with the added sysctl to use rss as the 
> baseline when enabled.

Probably, nobody agree you. your opinion don't solve original issue.
kernel developer can't ignore bug report.


> 
> >     2) add /proc/<pid>/oom_guard_size
> >        This allows users to specify "valid/expected size" of a task.
> >        When
> >        #echo 10M > /proc/<pid>/oom_guard_size
> >        At OOM calculation, 10Mbytes is subtracted from rss size.
> >        (The best way is to estimate this automatically from vm_size..but...)
> 
> Expected rss is almost impossible to tune for cpusets that have a highly 
> dynamic set of mems, let alone without containment.




^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
@ 2009-12-15  4:47                                           ` KOSAKI Motohiro
  0 siblings, 0 replies; 92+ messages in thread
From: KOSAKI Motohiro @ 2009-12-15  4:47 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, Andrew Morton,
	Daisuke Nishimura, linux-mm, linux-kernel, Christoph Lameter

> On Tue, 15 Dec 2009, KAMEZAWA Hiroyuki wrote:
> 
> >     I'm now preparing more counters for mm's statistics. It's better to
> >     wait  and to see what we can do more. And other patches for total
> >     oom-killer improvement is under development.
> > 
> >     And, there is a compatibility problem.
> >     As David says, this may break some crazy software which uses
> >     fake_numa+cpuset+oom_killer+oom_adj for resource controlling.
> >    (even if I recommend them to use memcg rather than crazy tricks...)
> >     
> 
> That's not at all what I said.  I said using total_vm as a baseline allows 
> users to define when a process is to be considered "rogue," that is, using 
> more memory than expected.  Using rss would be inappropriate since it is 
> highly dynamic and depends on the state of the VM at the time of oom, 
> which userspace cannot possibly keep updated.
> 
> You consistently ignore that point: the power of /proc/pid/oom_adj to 
> influence when a process, such as a memory leaker, is to be considered as 
> a high priority for an oom kill.  It has absolutely nothing to do with 
> fake NUMA, cpusets, or memcg.

To compare vsz is only meaningful when the same program are compared.
But oom killer don't. To compare vsz between another program DONT detect
any memory leak.


> 
> >     2 ideas which I can think of now are..
> >     1) add sysctl_oom_calc_on_committed_memory 
> >        If this is set, use vm-size instead of rss.
> > 
> 
> I would agree only if the oom killer used total_vm as a the default, it is 
> long-standing and allows for the aforementioned capability that you lose 
> with rss.  I have no problem with the added sysctl to use rss as the 
> baseline when enabled.

Probably, nobody agree you. your opinion don't solve original issue.
kernel developer can't ignore bug report.


> 
> >     2) add /proc/<pid>/oom_guard_size
> >        This allows users to specify "valid/expected size" of a task.
> >        When
> >        #echo 10M > /proc/<pid>/oom_guard_size
> >        At OOM calculation, 10Mbytes is subtracted from rss size.
> >        (The best way is to estimate this automatically from vm_size..but...)
> 
> Expected rss is almost impossible to tune for cpusets that have a highly 
> dynamic set of mems, let alone without containment.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
  2009-12-15  4:35                                           ` KAMEZAWA Hiroyuki
@ 2009-12-15  4:54                                             ` David Rientjes
  -1 siblings, 0 replies; 92+ messages in thread
From: David Rientjes @ 2009-12-15  4:54 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Daisuke Nishimura, KOSAKI Motohiro, linux-mm,
	linux-kernel, Christoph Lameter

On Tue, 15 Dec 2009, KAMEZAWA Hiroyuki wrote:

> > I would agree only if the oom killer used total_vm as a the default, it is 
> > long-standing and allows for the aforementioned capability that you lose 
> > with rss.  I have no problem with the added sysctl to use rss as the 
> > baseline when enabled.
> > 
> I'll prepare a patch for adds
> 
>   sysctl_oom_kill_based_on_rss (default=0)
> 
> ok ?
> 

I have no strong feelings either for or against that, I guess users who 
want to always kill the biggest memory hogger even when single page 
__GFP_WAIT allocations fail could use it.  I'm not sure it would get much 
use, though.

I think we should methodically work out an oom killer badness rewrite that 
won't compound the problem by adding more and more userspace knobs.  In 
other words, we should slow down, construct a list of goals that we want 
to achieve, and then see what type of solution we can create.

A few requirements that I have:

 - we must be able to define when a task is a memory hogger; this is
   currently done by /proc/pid/oom_adj relying on the overall total_vm
   size of the task as a baseline.  Most users should have a good sense
   of when their task is using more memory than expected and killing a
   memory leaker should always be the optimal oom killer result.  A better 
   set of units other than a shift on total_vm would be helpful, though.

 - we must prefer tasks that run on a cpuset or mempolicy's nodes if the 
   oom condition is constrained by that cpuset or mempolicy and its not a
   system-wide issue.

 - we must be able to polarize the badness heuristic to always select a
   particular task is if its very low priority or disable oom killing for
   a task if its must-run.

The proposal may be to remove /proc/pid/oom_adj completely since I know 
both you and KOSAKI-san dislike it, but we'd need an alternative which 
keeps the above functionality intact.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
@ 2009-12-15  4:54                                             ` David Rientjes
  0 siblings, 0 replies; 92+ messages in thread
From: David Rientjes @ 2009-12-15  4:54 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Daisuke Nishimura, KOSAKI Motohiro, linux-mm,
	linux-kernel, Christoph Lameter

On Tue, 15 Dec 2009, KAMEZAWA Hiroyuki wrote:

> > I would agree only if the oom killer used total_vm as a the default, it is 
> > long-standing and allows for the aforementioned capability that you lose 
> > with rss.  I have no problem with the added sysctl to use rss as the 
> > baseline when enabled.
> > 
> I'll prepare a patch for adds
> 
>   sysctl_oom_kill_based_on_rss (default=0)
> 
> ok ?
> 

I have no strong feelings either for or against that, I guess users who 
want to always kill the biggest memory hogger even when single page 
__GFP_WAIT allocations fail could use it.  I'm not sure it would get much 
use, though.

I think we should methodically work out an oom killer badness rewrite that 
won't compound the problem by adding more and more userspace knobs.  In 
other words, we should slow down, construct a list of goals that we want 
to achieve, and then see what type of solution we can create.

A few requirements that I have:

 - we must be able to define when a task is a memory hogger; this is
   currently done by /proc/pid/oom_adj relying on the overall total_vm
   size of the task as a baseline.  Most users should have a good sense
   of when their task is using more memory than expected and killing a
   memory leaker should always be the optimal oom killer result.  A better 
   set of units other than a shift on total_vm would be helpful, though.

 - we must prefer tasks that run on a cpuset or mempolicy's nodes if the 
   oom condition is constrained by that cpuset or mempolicy and its not a
   system-wide issue.

 - we must be able to polarize the badness heuristic to always select a
   particular task is if its very low priority or disable oom killing for
   a task if its must-run.

The proposal may be to remove /proc/pid/oom_adj completely since I know 
both you and KOSAKI-san dislike it, but we'd need an alternative which 
keeps the above functionality intact.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
  2009-12-15  4:35                                           ` KAMEZAWA Hiroyuki
@ 2009-12-15  4:57                                             ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 92+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-15  4:57 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: David Rientjes, Andrew Morton, Daisuke Nishimura,
	KOSAKI Motohiro, linux-mm, linux-kernel, Christoph Lameter

On Tue, 15 Dec 2009 13:35:46 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Mon, 14 Dec 2009 20:30:37 -0800 (PST)
> David Rientjes <rientjes@google.com> wrote:
> > >     2 ideas which I can think of now are..
> > >     1) add sysctl_oom_calc_on_committed_memory 
> > >        If this is set, use vm-size instead of rss.
> > > 
> > 
> > I would agree only if the oom killer used total_vm as a the default, it is 
> > long-standing and allows for the aforementioned capability that you lose 
> > with rss.  I have no problem with the added sysctl to use rss as the 
> > baseline when enabled.
> > 
> I'll prepare a patch for adds
> 
>   sysctl_oom_kill_based_on_rss (default=0)
> 
Hmm..

But for usual desktop users, using rss as default,as memory-eater-should-die.
sounds reasoable to me.
Hmm...maybe some automatic detection logic is required.

As my 1st version shows, 

   CONSTRAINT_CPUSET -> use vm_size
   CONSTRAINT_LOWMEM -> use lowmem_rss
   CONSTRAINT_NONE   -> use rss
seems like a landing point for all stake holders. No ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
@ 2009-12-15  4:57                                             ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 92+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-15  4:57 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: David Rientjes, Andrew Morton, Daisuke Nishimura,
	KOSAKI Motohiro, linux-mm, linux-kernel, Christoph Lameter

On Tue, 15 Dec 2009 13:35:46 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Mon, 14 Dec 2009 20:30:37 -0800 (PST)
> David Rientjes <rientjes@google.com> wrote:
> > >     2 ideas which I can think of now are..
> > >     1) add sysctl_oom_calc_on_committed_memory 
> > >        If this is set, use vm-size instead of rss.
> > > 
> > 
> > I would agree only if the oom killer used total_vm as a the default, it is 
> > long-standing and allows for the aforementioned capability that you lose 
> > with rss.  I have no problem with the added sysctl to use rss as the 
> > baseline when enabled.
> > 
> I'll prepare a patch for adds
> 
>   sysctl_oom_kill_based_on_rss (default=0)
> 
Hmm..

But for usual desktop users, using rss as default,as memory-eater-should-die.
sounds reasoable to me.
Hmm...maybe some automatic detection logic is required.

As my 1st version shows, 

   CONSTRAINT_CPUSET -> use vm_size
   CONSTRAINT_LOWMEM -> use lowmem_rss
   CONSTRAINT_NONE   -> use rss
seems like a landing point for all stake holders. No ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
  2009-12-15  4:43                                           ` KAMEZAWA Hiroyuki
@ 2009-12-15  4:57                                             ` David Rientjes
  -1 siblings, 0 replies; 92+ messages in thread
From: David Rientjes @ 2009-12-15  4:57 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Daisuke Nishimura, KOSAKI Motohiro, linux-mm,
	linux-kernel, Christoph Lameter

On Tue, 15 Dec 2009, KAMEZAWA Hiroyuki wrote:

> > That's not at all what I said.  I said using total_vm as a baseline allows 
> > users to define when a process is to be considered "rogue," that is, using 
> > more memory than expected.  Using rss would be inappropriate since it is 
> > highly dynamic and depends on the state of the VM at the time of oom, 
> > which userspace cannot possibly keep updated.
> > 
> > You consistently ignore that point: the power of /proc/pid/oom_adj to 
> > influence when a process, such as a memory leaker, is to be considered as 
> > a high priority for an oom kill.  It has absolutely nothing to do with 
> > fake NUMA, cpusets, or memcg.
> > 
> You also ignore that it's not sane to use oom kill for resource control ;)
> 

Please read my email.  Did I say anything about resource control AT ALL?  
I said /proc/pid/oom_adj currently allows userspace to define when a task 
is "rogue," meaning its consuming much more memory than expected.  Those 
memory leakers should always be the optimal result for the oom killer to 
kill.  Using rss as the baseline would not allow userspace to effectively 
do the same thing since it's dynamic and depends on the state of the VM at 
the time of oom which is probably not reflected in the /proc/pid/oom_adj 
values for all tasks.  It has absolutely nothing to do with resource 
control, so please address this very trivial issue without going off on 
tangents.  Thanks.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
@ 2009-12-15  4:57                                             ` David Rientjes
  0 siblings, 0 replies; 92+ messages in thread
From: David Rientjes @ 2009-12-15  4:57 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Daisuke Nishimura, KOSAKI Motohiro, linux-mm,
	linux-kernel, Christoph Lameter

On Tue, 15 Dec 2009, KAMEZAWA Hiroyuki wrote:

> > That's not at all what I said.  I said using total_vm as a baseline allows 
> > users to define when a process is to be considered "rogue," that is, using 
> > more memory than expected.  Using rss would be inappropriate since it is 
> > highly dynamic and depends on the state of the VM at the time of oom, 
> > which userspace cannot possibly keep updated.
> > 
> > You consistently ignore that point: the power of /proc/pid/oom_adj to 
> > influence when a process, such as a memory leaker, is to be considered as 
> > a high priority for an oom kill.  It has absolutely nothing to do with 
> > fake NUMA, cpusets, or memcg.
> > 
> You also ignore that it's not sane to use oom kill for resource control ;)
> 

Please read my email.  Did I say anything about resource control AT ALL?  
I said /proc/pid/oom_adj currently allows userspace to define when a task 
is "rogue," meaning its consuming much more memory than expected.  Those 
memory leakers should always be the optimal result for the oom killer to 
kill.  Using rss as the baseline would not allow userspace to effectively 
do the same thing since it's dynamic and depends on the state of the VM at 
the time of oom which is probably not reflected in the /proc/pid/oom_adj 
values for all tasks.  It has absolutely nothing to do with resource 
control, so please address this very trivial issue without going off on 
tangents.  Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
  2009-12-15  4:47                                           ` KOSAKI Motohiro
@ 2009-12-15  5:03                                             ` David Rientjes
  -1 siblings, 0 replies; 92+ messages in thread
From: David Rientjes @ 2009-12-15  5:03 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Daisuke Nishimura, linux-mm,
	linux-kernel, Christoph Lameter

On Tue, 15 Dec 2009, KOSAKI Motohiro wrote:

> To compare vsz is only meaningful when the same program are compared.
> But oom killer don't. To compare vsz between another program DONT detect
> any memory leak.
> 

You're losing the ability to detect that memory leak because you'd be 
using a baseline that userspace cannot possibly know at the time of oom.  
You cannot possibly insist that users understand the amount of resident 
memory for all applications when tuning the heuristic adjuster from 
userspace.

In other words, how do you plan on userspace being able to identify tasks 
that are memory leakers if you change the baseline to rss?  Unless you 
have an answer to this question, you're not admitting the problem that 
the oom killer is primarily designed to address.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
@ 2009-12-15  5:03                                             ` David Rientjes
  0 siblings, 0 replies; 92+ messages in thread
From: David Rientjes @ 2009-12-15  5:03 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Daisuke Nishimura, linux-mm,
	linux-kernel, Christoph Lameter

On Tue, 15 Dec 2009, KOSAKI Motohiro wrote:

> To compare vsz is only meaningful when the same program are compared.
> But oom killer don't. To compare vsz between another program DONT detect
> any memory leak.
> 

You're losing the ability to detect that memory leak because you'd be 
using a baseline that userspace cannot possibly know at the time of oom.  
You cannot possibly insist that users understand the amount of resident 
memory for all applications when tuning the heuristic adjuster from 
userspace.

In other words, how do you plan on userspace being able to identify tasks 
that are memory leakers if you change the baseline to rss?  Unless you 
have an answer to this question, you're not admitting the problem that 
the oom killer is primarily designed to address.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
  2009-12-15  4:57                                             ` David Rientjes
@ 2009-12-15  5:09                                               ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 92+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-15  5:09 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Daisuke Nishimura, KOSAKI Motohiro, linux-mm,
	linux-kernel, Christoph Lameter

On Mon, 14 Dec 2009 20:57:53 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Tue, 15 Dec 2009, KAMEZAWA Hiroyuki wrote:
> 
> > > That's not at all what I said.  I said using total_vm as a baseline allows 
> > > users to define when a process is to be considered "rogue," that is, using 
> > > more memory than expected.  Using rss would be inappropriate since it is 
> > > highly dynamic and depends on the state of the VM at the time of oom, 
> > > which userspace cannot possibly keep updated.
> > > 
> > > You consistently ignore that point: the power of /proc/pid/oom_adj to 
> > > influence when a process, such as a memory leaker, is to be considered as 
> > > a high priority for an oom kill.  It has absolutely nothing to do with 
> > > fake NUMA, cpusets, or memcg.
> > > 
> > You also ignore that it's not sane to use oom kill for resource control ;)
> > 
> 
> Please read my email.  Did I say anything about resource control AT ALL?  
> I said /proc/pid/oom_adj currently allows userspace to define when a task 
> is "rogue," meaning its consuming much more memory than expected.  Those 
> memory leakers should always be the optimal result for the oom killer to 
> kill.  Using rss as the baseline would not allow userspace to effectively 
> do the same thing since it's dynamic and depends on the state of the VM at 
> the time of oom which is probably not reflected in the /proc/pid/oom_adj 
> values for all tasks.  It has absolutely nothing to do with resource 
> control, so please address this very trivial issue without going off on 
> tangents.  Thanks.

What I can't undestand is the technique to know whether a (unknown) process is
leaking memory or not by checking vm_size.
And, why don't you use overcommit_memory when you can depends on vm_size ?

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
@ 2009-12-15  5:09                                               ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 92+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-15  5:09 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Daisuke Nishimura, KOSAKI Motohiro, linux-mm,
	linux-kernel, Christoph Lameter

On Mon, 14 Dec 2009 20:57:53 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Tue, 15 Dec 2009, KAMEZAWA Hiroyuki wrote:
> 
> > > That's not at all what I said.  I said using total_vm as a baseline allows 
> > > users to define when a process is to be considered "rogue," that is, using 
> > > more memory than expected.  Using rss would be inappropriate since it is 
> > > highly dynamic and depends on the state of the VM at the time of oom, 
> > > which userspace cannot possibly keep updated.
> > > 
> > > You consistently ignore that point: the power of /proc/pid/oom_adj to 
> > > influence when a process, such as a memory leaker, is to be considered as 
> > > a high priority for an oom kill.  It has absolutely nothing to do with 
> > > fake NUMA, cpusets, or memcg.
> > > 
> > You also ignore that it's not sane to use oom kill for resource control ;)
> > 
> 
> Please read my email.  Did I say anything about resource control AT ALL?  
> I said /proc/pid/oom_adj currently allows userspace to define when a task 
> is "rogue," meaning its consuming much more memory than expected.  Those 
> memory leakers should always be the optimal result for the oom killer to 
> kill.  Using rss as the baseline would not allow userspace to effectively 
> do the same thing since it's dynamic and depends on the state of the VM at 
> the time of oom which is probably not reflected in the /proc/pid/oom_adj 
> values for all tasks.  It has absolutely nothing to do with resource 
> control, so please address this very trivial issue without going off on 
> tangents.  Thanks.

What I can't undestand is the technique to know whether a (unknown) process is
leaking memory or not by checking vm_size.
And, why don't you use overcommit_memory when you can depends on vm_size ?

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
  2009-12-15  4:54                                             ` David Rientjes
@ 2009-12-15  5:19                                               ` KOSAKI Motohiro
  -1 siblings, 0 replies; 92+ messages in thread
From: KOSAKI Motohiro @ 2009-12-15  5:19 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, Andrew Morton,
	Daisuke Nishimura, linux-mm, linux-kernel, Christoph Lameter

> On Tue, 15 Dec 2009, KAMEZAWA Hiroyuki wrote:
> 
> > > I would agree only if the oom killer used total_vm as a the default, it is 
> > > long-standing and allows for the aforementioned capability that you lose 
> > > with rss.  I have no problem with the added sysctl to use rss as the 
> > > baseline when enabled.
> > > 
> > I'll prepare a patch for adds
> > 
> >   sysctl_oom_kill_based_on_rss (default=0)
> > 
> > ok ?
> > 
> 
> I have no strong feelings either for or against that, I guess users who 
> want to always kill the biggest memory hogger even when single page 
> __GFP_WAIT allocations fail could use it.  I'm not sure it would get much 
> use, though.
> 
> I think we should methodically work out an oom killer badness rewrite that 
> won't compound the problem by adding more and more userspace knobs.  In 
> other words, we should slow down, construct a list of goals that we want 
> to achieve, and then see what type of solution we can create.
> 
> A few requirements that I have:

Um, good analysis! really.

>
>  - we must be able to define when a task is a memory hogger; this is
>    currently done by /proc/pid/oom_adj relying on the overall total_vm
>    size of the task as a baseline.  Most users should have a good sense
>    of when their task is using more memory than expected and killing a
>    memory leaker should always be the optimal oom killer result.  A better 
>    set of units other than a shift on total_vm would be helpful, though.

nit: What's mean "Most users"? desktop user(one of most majority users)
don't have any expection of memory usage.

but, if admin have memory expection, they should be able to tune
optimal oom result.

I think you pointed right thing.


>  - we must prefer tasks that run on a cpuset or mempolicy's nodes if the 
>    oom condition is constrained by that cpuset or mempolicy and its not a
>    system-wide issue.

agreed. (who disagree it?)


>  - we must be able to polarize the badness heuristic to always select a
>    particular task is if its very low priority or disable oom killing for
>    a task if its must-run.

Probably I haven't catch your point. What's mean "polarize"? Can you
please describe more?


> The proposal may be to remove /proc/pid/oom_adj completely since I know 
> both you and KOSAKI-san dislike it, but we'd need an alternative which 
> keeps the above functionality intact.

Yes, To provide alternative way is must.




^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
@ 2009-12-15  5:19                                               ` KOSAKI Motohiro
  0 siblings, 0 replies; 92+ messages in thread
From: KOSAKI Motohiro @ 2009-12-15  5:19 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, Andrew Morton,
	Daisuke Nishimura, linux-mm, linux-kernel, Christoph Lameter

> On Tue, 15 Dec 2009, KAMEZAWA Hiroyuki wrote:
> 
> > > I would agree only if the oom killer used total_vm as a the default, it is 
> > > long-standing and allows for the aforementioned capability that you lose 
> > > with rss.  I have no problem with the added sysctl to use rss as the 
> > > baseline when enabled.
> > > 
> > I'll prepare a patch for adds
> > 
> >   sysctl_oom_kill_based_on_rss (default=0)
> > 
> > ok ?
> > 
> 
> I have no strong feelings either for or against that, I guess users who 
> want to always kill the biggest memory hogger even when single page 
> __GFP_WAIT allocations fail could use it.  I'm not sure it would get much 
> use, though.
> 
> I think we should methodically work out an oom killer badness rewrite that 
> won't compound the problem by adding more and more userspace knobs.  In 
> other words, we should slow down, construct a list of goals that we want 
> to achieve, and then see what type of solution we can create.
> 
> A few requirements that I have:

Um, good analysis! really.

>
>  - we must be able to define when a task is a memory hogger; this is
>    currently done by /proc/pid/oom_adj relying on the overall total_vm
>    size of the task as a baseline.  Most users should have a good sense
>    of when their task is using more memory than expected and killing a
>    memory leaker should always be the optimal oom killer result.  A better 
>    set of units other than a shift on total_vm would be helpful, though.

nit: What's mean "Most users"? desktop user(one of most majority users)
don't have any expection of memory usage.

but, if admin have memory expection, they should be able to tune
optimal oom result.

I think you pointed right thing.


>  - we must prefer tasks that run on a cpuset or mempolicy's nodes if the 
>    oom condition is constrained by that cpuset or mempolicy and its not a
>    system-wide issue.

agreed. (who disagree it?)


>  - we must be able to polarize the badness heuristic to always select a
>    particular task is if its very low priority or disable oom killing for
>    a task if its must-run.

Probably I haven't catch your point. What's mean "polarize"? Can you
please describe more?


> The proposal may be to remove /proc/pid/oom_adj completely since I know 
> both you and KOSAKI-san dislike it, but we'd need an alternative which 
> keeps the above functionality intact.

Yes, To provide alternative way is must.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
  2009-12-15  5:19                                               ` KOSAKI Motohiro
@ 2009-12-17 22:21                                                 ` David Rientjes
  -1 siblings, 0 replies; 92+ messages in thread
From: David Rientjes @ 2009-12-17 22:21 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Daisuke Nishimura, linux-mm,
	linux-kernel, Christoph Lameter

On Tue, 15 Dec 2009, KOSAKI Motohiro wrote:

> > A few requirements that I have:
> 
> Um, good analysis! really.
> 
> >
> >  - we must be able to define when a task is a memory hogger; this is
> >    currently done by /proc/pid/oom_adj relying on the overall total_vm
> >    size of the task as a baseline.  Most users should have a good sense
> >    of when their task is using more memory than expected and killing a
> >    memory leaker should always be the optimal oom killer result.  A better 
> >    set of units other than a shift on total_vm would be helpful, though.
> 
> nit: What's mean "Most users"? desktop user(one of most majority users)
> don't have any expection of memory usage.
> 
> but, if admin have memory expection, they should be able to tune
> optimal oom result.
> 
> I think you pointed right thing.
> 

This is mostly referring to production server users where memory 
consumption by particular applications can be estimated, which allows the 
kernel to determine when a task is using a wildly unexpected amount that 
happens to become egregious enough to force the oom killer into killing a 
task.

That is contrast to using rss as a baseline where we prefer on killing the 
application with the most resident RAM.  It is not always ideal to kill a 
task with 8GB of rss when we fail to allocate a single page for a low 
priority task.

> >  - we must prefer tasks that run on a cpuset or mempolicy's nodes if the 
> >    oom condition is constrained by that cpuset or mempolicy and its not a
> >    system-wide issue.
> 
> agreed. (who disagree it?)
> 

It's possible to nullify the current penalization in the badness heuristic 
(order 3 reduction) if a candidate task does not share nodes with 
current's allowed set either by way of cpusets or mempolicies.  For 
example, an oom caused by an application with an MPOL_BIND on a single 
node can easily kill a task that has no memory resident on that node if 
its usage (or rss) is 3 orders higher than any candidate that is allowed 
on my bound node.

> >  - we must be able to polarize the badness heuristic to always select a
> >    particular task is if its very low priority or disable oom killing for
> >    a task if its must-run.
> 
> Probably I haven't catch your point. What's mean "polarize"? Can you
> please describe more?
> 

We need to be able to polarize tasks so they are always killed regardless 
of any kernel heuristic (/proc/pid/oom_adj of +15, currently) or always 
chosen last (-16, currently).  We also need a way of completely disabling 
oom killing for certain tasks such as with OOM_DISABLE.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
@ 2009-12-17 22:21                                                 ` David Rientjes
  0 siblings, 0 replies; 92+ messages in thread
From: David Rientjes @ 2009-12-17 22:21 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Daisuke Nishimura, linux-mm,
	linux-kernel, Christoph Lameter

On Tue, 15 Dec 2009, KOSAKI Motohiro wrote:

> > A few requirements that I have:
> 
> Um, good analysis! really.
> 
> >
> >  - we must be able to define when a task is a memory hogger; this is
> >    currently done by /proc/pid/oom_adj relying on the overall total_vm
> >    size of the task as a baseline.  Most users should have a good sense
> >    of when their task is using more memory than expected and killing a
> >    memory leaker should always be the optimal oom killer result.  A better 
> >    set of units other than a shift on total_vm would be helpful, though.
> 
> nit: What's mean "Most users"? desktop user(one of most majority users)
> don't have any expection of memory usage.
> 
> but, if admin have memory expection, they should be able to tune
> optimal oom result.
> 
> I think you pointed right thing.
> 

This is mostly referring to production server users where memory 
consumption by particular applications can be estimated, which allows the 
kernel to determine when a task is using a wildly unexpected amount that 
happens to become egregious enough to force the oom killer into killing a 
task.

That is contrast to using rss as a baseline where we prefer on killing the 
application with the most resident RAM.  It is not always ideal to kill a 
task with 8GB of rss when we fail to allocate a single page for a low 
priority task.

> >  - we must prefer tasks that run on a cpuset or mempolicy's nodes if the 
> >    oom condition is constrained by that cpuset or mempolicy and its not a
> >    system-wide issue.
> 
> agreed. (who disagree it?)
> 

It's possible to nullify the current penalization in the badness heuristic 
(order 3 reduction) if a candidate task does not share nodes with 
current's allowed set either by way of cpusets or mempolicies.  For 
example, an oom caused by an application with an MPOL_BIND on a single 
node can easily kill a task that has no memory resident on that node if 
its usage (or rss) is 3 orders higher than any candidate that is allowed 
on my bound node.

> >  - we must be able to polarize the badness heuristic to always select a
> >    particular task is if its very low priority or disable oom killing for
> >    a task if its must-run.
> 
> Probably I haven't catch your point. What's mean "polarize"? Can you
> please describe more?
> 

We need to be able to polarize tasks so they are always killed regardless 
of any kernel heuristic (/proc/pid/oom_adj of +15, currently) or always 
chosen last (-16, currently).  We also need a way of completely disabling 
oom killing for certain tasks such as with OOM_DISABLE.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
  2009-12-15  5:09                                               ` KAMEZAWA Hiroyuki
@ 2009-12-17 22:23                                                 ` David Rientjes
  -1 siblings, 0 replies; 92+ messages in thread
From: David Rientjes @ 2009-12-17 22:23 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Daisuke Nishimura, KOSAKI Motohiro, linux-mm,
	linux-kernel, Christoph Lameter

On Tue, 15 Dec 2009, KAMEZAWA Hiroyuki wrote:

> What I can't undestand is the technique to know whether a (unknown) process is
> leaking memory or not by checking vm_size.

Memory leaks are better identified via total_vm since leaked memory has a 
lower probability of staying resident in physical memory.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
@ 2009-12-17 22:23                                                 ` David Rientjes
  0 siblings, 0 replies; 92+ messages in thread
From: David Rientjes @ 2009-12-17 22:23 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Daisuke Nishimura, KOSAKI Motohiro, linux-mm,
	linux-kernel, Christoph Lameter

On Tue, 15 Dec 2009, KAMEZAWA Hiroyuki wrote:

> What I can't undestand is the technique to know whether a (unknown) process is
> leaking memory or not by checking vm_size.

Memory leaks are better identified via total_vm since leaked memory has a 
lower probability of staying resident in physical memory.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
  2009-12-17 22:23                                                 ` David Rientjes
@ 2009-12-17 23:33                                                   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 92+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-17 23:33 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Daisuke Nishimura, KOSAKI Motohiro, linux-mm,
	linux-kernel, Christoph Lameter

On Thu, 17 Dec 2009 14:23:39 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Tue, 15 Dec 2009, KAMEZAWA Hiroyuki wrote:
> 
> > What I can't undestand is the technique to know whether a (unknown) process is
> > leaking memory or not by checking vm_size.
> 
> Memory leaks are better identified via total_vm since leaked memory has a 
> lower probability of staying resident in physical memory.
> 
Because malloc() writes header on newly allcoated memory, (vm_size - rss) cannot
be far from a some important program  which wakes up once in a
day or sleep in the day works in the night. 

I hope user knows expected memory size of applications, but I know it can't.
Sigh...

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
@ 2009-12-17 23:33                                                   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 92+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-17 23:33 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Daisuke Nishimura, KOSAKI Motohiro, linux-mm,
	linux-kernel, Christoph Lameter

On Thu, 17 Dec 2009 14:23:39 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Tue, 15 Dec 2009, KAMEZAWA Hiroyuki wrote:
> 
> > What I can't undestand is the technique to know whether a (unknown) process is
> > leaking memory or not by checking vm_size.
> 
> Memory leaks are better identified via total_vm since leaked memory has a 
> lower probability of staying resident in physical memory.
> 
Because malloc() writes header on newly allcoated memory, (vm_size - rss) cannot
be far from a some important program  which wakes up once in a
day or sleep in the day works in the night. 

I hope user knows expected memory size of applications, but I know it can't.
Sigh...

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
  2009-12-17 22:21                                                 ` David Rientjes
@ 2009-12-18  4:30                                                   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 92+ messages in thread
From: KOSAKI Motohiro @ 2009-12-18  4:30 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, Andrew Morton,
	Daisuke Nishimura, linux-mm, linux-kernel, Christoph Lameter

> On Tue, 15 Dec 2009, KOSAKI Motohiro wrote:
> 
> > > A few requirements that I have:
> > 
> > Um, good analysis! really.
> > 
> > >
> > >  - we must be able to define when a task is a memory hogger; this is
> > >    currently done by /proc/pid/oom_adj relying on the overall total_vm
> > >    size of the task as a baseline.  Most users should have a good sense
> > >    of when their task is using more memory than expected and killing a
> > >    memory leaker should always be the optimal oom killer result.  A better 
> > >    set of units other than a shift on total_vm would be helpful, though.
> > 
> > nit: What's mean "Most users"? desktop user(one of most majority users)
> > don't have any expection of memory usage.
> > 
> > but, if admin have memory expection, they should be able to tune
> > optimal oom result.
> > 
> > I think you pointed right thing.
> > 
> 
> This is mostly referring to production server users where memory 
> consumption by particular applications can be estimated, which allows the 
> kernel to determine when a task is using a wildly unexpected amount that 
> happens to become egregious enough to force the oom killer into killing a 
> task.
> 
> That is contrast to using rss as a baseline where we prefer on killing the 
> application with the most resident RAM.  It is not always ideal to kill a 
> task with 8GB of rss when we fail to allocate a single page for a low 
> priority task.

VSZ has the same problem if low priority task allocate last single page.


> > >  - we must prefer tasks that run on a cpuset or mempolicy's nodes if the 
> > >    oom condition is constrained by that cpuset or mempolicy and its not a
> > >    system-wide issue.
> > 
> > agreed. (who disagree it?)
> > 
> 
> It's possible to nullify the current penalization in the badness heuristic 
> (order 3 reduction) if a candidate task does not share nodes with 
> current's allowed set either by way of cpusets or mempolicies.  For 
> example, an oom caused by an application with an MPOL_BIND on a single 
> node can easily kill a task that has no memory resident on that node if 
> its usage (or rss) is 3 orders higher than any candidate that is allowed 
> on my bound node.

yes, possible. however its heuristic is intensional. the code comment says:

        /*
         * If p's nodes don't overlap ours, it may still help to kill p
         * because p may have allocated or otherwise mapped memory on
         * this node before. However it will be less likely.
         */

do you have alternative plan? How do we know the task don't have any
page in memory busted node? we can't add any statistics for oom because
almost systems never ever use oom. thus, many developer oppose such slowdown.


> > >  - we must be able to polarize the badness heuristic to always select a
> > >    particular task is if its very low priority or disable oom killing for
> > >    a task if its must-run.
> > 
> > Probably I haven't catch your point. What's mean "polarize"? Can you
> > please describe more?
> 
> We need to be able to polarize tasks so they are always killed regardless 
> of any kernel heuristic (/proc/pid/oom_adj of +15, currently) or always 
> chosen last (-16, currently).  We also need a way of completely disabling 
> oom killing for certain tasks such as with OOM_DISABLE.

afaik, when admin use +15 or -16 adjustment, usually they hope to don't use
kernel heuristic. This is the reason that I proposed /proc/pid/oom_priority
new tunable knob.




^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
@ 2009-12-18  4:30                                                   ` KOSAKI Motohiro
  0 siblings, 0 replies; 92+ messages in thread
From: KOSAKI Motohiro @ 2009-12-18  4:30 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, Andrew Morton,
	Daisuke Nishimura, linux-mm, linux-kernel, Christoph Lameter

> On Tue, 15 Dec 2009, KOSAKI Motohiro wrote:
> 
> > > A few requirements that I have:
> > 
> > Um, good analysis! really.
> > 
> > >
> > >  - we must be able to define when a task is a memory hogger; this is
> > >    currently done by /proc/pid/oom_adj relying on the overall total_vm
> > >    size of the task as a baseline.  Most users should have a good sense
> > >    of when their task is using more memory than expected and killing a
> > >    memory leaker should always be the optimal oom killer result.  A better 
> > >    set of units other than a shift on total_vm would be helpful, though.
> > 
> > nit: What's mean "Most users"? desktop user(one of most majority users)
> > don't have any expection of memory usage.
> > 
> > but, if admin have memory expection, they should be able to tune
> > optimal oom result.
> > 
> > I think you pointed right thing.
> > 
> 
> This is mostly referring to production server users where memory 
> consumption by particular applications can be estimated, which allows the 
> kernel to determine when a task is using a wildly unexpected amount that 
> happens to become egregious enough to force the oom killer into killing a 
> task.
> 
> That is contrast to using rss as a baseline where we prefer on killing the 
> application with the most resident RAM.  It is not always ideal to kill a 
> task with 8GB of rss when we fail to allocate a single page for a low 
> priority task.

VSZ has the same problem if low priority task allocate last single page.


> > >  - we must prefer tasks that run on a cpuset or mempolicy's nodes if the 
> > >    oom condition is constrained by that cpuset or mempolicy and its not a
> > >    system-wide issue.
> > 
> > agreed. (who disagree it?)
> > 
> 
> It's possible to nullify the current penalization in the badness heuristic 
> (order 3 reduction) if a candidate task does not share nodes with 
> current's allowed set either by way of cpusets or mempolicies.  For 
> example, an oom caused by an application with an MPOL_BIND on a single 
> node can easily kill a task that has no memory resident on that node if 
> its usage (or rss) is 3 orders higher than any candidate that is allowed 
> on my bound node.

yes, possible. however its heuristic is intensional. the code comment says:

        /*
         * If p's nodes don't overlap ours, it may still help to kill p
         * because p may have allocated or otherwise mapped memory on
         * this node before. However it will be less likely.
         */

do you have alternative plan? How do we know the task don't have any
page in memory busted node? we can't add any statistics for oom because
almost systems never ever use oom. thus, many developer oppose such slowdown.


> > >  - we must be able to polarize the badness heuristic to always select a
> > >    particular task is if its very low priority or disable oom killing for
> > >    a task if its must-run.
> > 
> > Probably I haven't catch your point. What's mean "polarize"? Can you
> > please describe more?
> 
> We need to be able to polarize tasks so they are always killed regardless 
> of any kernel heuristic (/proc/pid/oom_adj of +15, currently) or always 
> chosen last (-16, currently).  We also need a way of completely disabling 
> oom killing for certain tasks such as with OOM_DISABLE.

afaik, when admin use +15 or -16 adjustment, usually they hope to don't use
kernel heuristic. This is the reason that I proposed /proc/pid/oom_priority
new tunable knob.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
  2009-12-18  4:30                                                   ` KOSAKI Motohiro
@ 2009-12-18 10:04                                                     ` David Rientjes
  -1 siblings, 0 replies; 92+ messages in thread
From: David Rientjes @ 2009-12-18 10:04 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Daisuke Nishimura, linux-mm,
	linux-kernel, Christoph Lameter

On Fri, 18 Dec 2009, KOSAKI Motohiro wrote:

> > That is contrast to using rss as a baseline where we prefer on killing the 
> > application with the most resident RAM.  It is not always ideal to kill a 
> > task with 8GB of rss when we fail to allocate a single page for a low 
> > priority task.
> 
> VSZ has the same problem if low priority task allocate last single page.
> 

I don't understand what you're trying to say, sorry.  Why, in your mind, 
do we always want to prefer to kill the application with the largest 
amount of memory present in physical RAM for a single, failed order-0 
allocation attempt from a lower priority task?

Additionally, when would it be sufficient to simply fail a ~__GFP_NOFAIL 
allocation instead of killing anything?

> yes, possible. however its heuristic is intensional. the code comment says:
> 
>         /*
>          * If p's nodes don't overlap ours, it may still help to kill p
>          * because p may have allocated or otherwise mapped memory on
>          * this node before. However it will be less likely.
>          */
> 
> do you have alternative plan? How do we know the task don't have any
> page in memory busted node? we can't add any statistics for oom because
> almost systems never ever use oom. thus, many developer oppose such slowdown.
> 

There's nothing wrong with that currently (except it doesn't work for 
mempolicies), I'm stating that it is a requirement that we keep such a 
penalization in our heuristic if we plan on rewriting it.  I was 
attempting to get a list of requirements for oom killing decisions so that 
we can write a sane heuristic and you're simply defending the status quo 
which you insist we should change.

> > We need to be able to polarize tasks so they are always killed regardless 
> > of any kernel heuristic (/proc/pid/oom_adj of +15, currently) or always 
> > chosen last (-16, currently).  We also need a way of completely disabling 
> > oom killing for certain tasks such as with OOM_DISABLE.
> 
> afaik, when admin use +15 or -16 adjustment, usually they hope to don't use
> kernel heuristic.

That's exactly what I said above.

> This is the reason that I proposed /proc/pid/oom_priority
> new tunable knob.
> 

In addition to /proc/pid/oom_adj??  oom_priority on it's own does not 
allow us to define when a task is a memory leaker based on the expected 
memory consumption of a single application.  That should be the single 
biggest consideration in the new badness heuristic: to define when a task 
should be killed because it is rogue.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2
@ 2009-12-18 10:04                                                     ` David Rientjes
  0 siblings, 0 replies; 92+ messages in thread
From: David Rientjes @ 2009-12-18 10:04 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Daisuke Nishimura, linux-mm,
	linux-kernel, Christoph Lameter

On Fri, 18 Dec 2009, KOSAKI Motohiro wrote:

> > That is contrast to using rss as a baseline where we prefer on killing the 
> > application with the most resident RAM.  It is not always ideal to kill a 
> > task with 8GB of rss when we fail to allocate a single page for a low 
> > priority task.
> 
> VSZ has the same problem if low priority task allocate last single page.
> 

I don't understand what you're trying to say, sorry.  Why, in your mind, 
do we always want to prefer to kill the application with the largest 
amount of memory present in physical RAM for a single, failed order-0 
allocation attempt from a lower priority task?

Additionally, when would it be sufficient to simply fail a ~__GFP_NOFAIL 
allocation instead of killing anything?

> yes, possible. however its heuristic is intensional. the code comment says:
> 
>         /*
>          * If p's nodes don't overlap ours, it may still help to kill p
>          * because p may have allocated or otherwise mapped memory on
>          * this node before. However it will be less likely.
>          */
> 
> do you have alternative plan? How do we know the task don't have any
> page in memory busted node? we can't add any statistics for oom because
> almost systems never ever use oom. thus, many developer oppose such slowdown.
> 

There's nothing wrong with that currently (except it doesn't work for 
mempolicies), I'm stating that it is a requirement that we keep such a 
penalization in our heuristic if we plan on rewriting it.  I was 
attempting to get a list of requirements for oom killing decisions so that 
we can write a sane heuristic and you're simply defending the status quo 
which you insist we should change.

> > We need to be able to polarize tasks so they are always killed regardless 
> > of any kernel heuristic (/proc/pid/oom_adj of +15, currently) or always 
> > chosen last (-16, currently).  We also need a way of completely disabling 
> > oom killing for certain tasks such as with OOM_DISABLE.
> 
> afaik, when admin use +15 or -16 adjustment, usually they hope to don't use
> kernel heuristic.

That's exactly what I said above.

> This is the reason that I proposed /proc/pid/oom_priority
> new tunable knob.
> 

In addition to /proc/pid/oom_adj??  oom_priority on it's own does not 
allow us to define when a task is a memory leaker based on the expected 
memory consumption of a single application.  That should be the single 
biggest consideration in the new badness heuristic: to define when a task 
should be killed because it is rogue.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 92+ messages in thread

end of thread, other threads:[~2009-12-18 10:05 UTC | newest]

Thread overview: 92+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-04  8:09 [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask KAMEZAWA Hiroyuki
2009-11-04  8:09 ` KAMEZAWA Hiroyuki
2009-11-06  0:02 ` [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v2 KAMEZAWA Hiroyuki
2009-11-06  0:02   ` KAMEZAWA Hiroyuki
2009-11-10  7:24   ` KOSAKI Motohiro
2009-11-10  7:24     ` KOSAKI Motohiro
2009-11-10  7:24     ` KAMEZAWA Hiroyuki
2009-11-10  7:24       ` KAMEZAWA Hiroyuki
2009-11-10  7:39       ` KOSAKI Motohiro
2009-11-10  7:39         ` KOSAKI Motohiro
2009-11-10  7:40         ` KAMEZAWA Hiroyuki
2009-11-10  7:40           ` KAMEZAWA Hiroyuki
2009-11-10  8:03           ` Daisuke Nishimura
2009-11-10  8:03             ` Daisuke Nishimura
2009-11-10  8:17             ` KAMEZAWA Hiroyuki
2009-11-10  8:17               ` KAMEZAWA Hiroyuki
2009-11-11  2:24               ` [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v3 KAMEZAWA Hiroyuki
2009-11-11  2:24                 ` KAMEZAWA Hiroyuki
2009-11-11  2:36                 ` KOSAKI Motohiro
2009-11-11  2:36                   ` KOSAKI Motohiro
2009-11-11  2:49                 ` David Rientjes
2009-11-11  2:49                   ` David Rientjes
2009-11-11  3:02                   ` KOSAKI Motohiro
2009-11-11  3:02                     ` KOSAKI Motohiro
2009-11-11  3:10                     ` KAMEZAWA Hiroyuki
2009-11-11  3:10                       ` KAMEZAWA Hiroyuki
2009-11-11  3:14                     ` David Rientjes
2009-11-11  3:14                       ` David Rientjes
2009-11-11  3:23                       ` KOSAKI Motohiro
2009-11-11  3:23                         ` KOSAKI Motohiro
2009-11-11  3:27                         ` David Rientjes
2009-11-11  3:27                           ` David Rientjes
2009-11-11  3:04                   ` KAMEZAWA Hiroyuki
2009-11-11  3:04                     ` KAMEZAWA Hiroyuki
2009-11-11  4:45                 ` [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4 KAMEZAWA Hiroyuki
2009-11-11  4:45                   ` KAMEZAWA Hiroyuki
2009-11-11  5:28                   ` [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.1 KAMEZAWA Hiroyuki
2009-11-11  5:28                     ` KAMEZAWA Hiroyuki
2009-11-11  5:58                     ` David Rientjes
2009-11-11  5:58                       ` David Rientjes
2009-11-11  6:20                       ` KAMEZAWA Hiroyuki
2009-11-11  6:20                         ` KAMEZAWA Hiroyuki
2009-11-11  6:26                         ` David Rientjes
2009-11-11  6:26                           ` David Rientjes
2009-11-11  6:34                           ` [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2 KAMEZAWA Hiroyuki
2009-11-11  6:34                             ` KAMEZAWA Hiroyuki
2009-11-11  7:32                             ` David Rientjes
2009-11-11  7:32                               ` David Rientjes
2009-11-18  0:11                             ` David Rientjes
2009-11-18  0:11                               ` David Rientjes
2009-11-18  0:58                               ` KAMEZAWA Hiroyuki
2009-11-18  0:58                                 ` KAMEZAWA Hiroyuki
2009-11-18  2:13                                 ` David Rientjes
2009-11-18  2:13                                   ` David Rientjes
2009-12-15  1:16                                   ` Andrew Morton
2009-12-15  1:16                                     ` Andrew Morton
2009-12-15  1:32                                     ` KAMEZAWA Hiroyuki
2009-12-15  1:32                                       ` KAMEZAWA Hiroyuki
2009-12-15  1:38                                       ` KOSAKI Motohiro
2009-12-15  1:38                                         ` KOSAKI Motohiro
2009-12-15  4:30                                       ` David Rientjes
2009-12-15  4:30                                         ` David Rientjes
2009-12-15  4:35                                         ` KAMEZAWA Hiroyuki
2009-12-15  4:35                                           ` KAMEZAWA Hiroyuki
2009-12-15  4:54                                           ` David Rientjes
2009-12-15  4:54                                             ` David Rientjes
2009-12-15  5:19                                             ` KOSAKI Motohiro
2009-12-15  5:19                                               ` KOSAKI Motohiro
2009-12-17 22:21                                               ` David Rientjes
2009-12-17 22:21                                                 ` David Rientjes
2009-12-18  4:30                                                 ` KOSAKI Motohiro
2009-12-18  4:30                                                   ` KOSAKI Motohiro
2009-12-18 10:04                                                   ` David Rientjes
2009-12-18 10:04                                                     ` David Rientjes
2009-12-15  4:57                                           ` KAMEZAWA Hiroyuki
2009-12-15  4:57                                             ` KAMEZAWA Hiroyuki
2009-12-15  4:43                                         ` KAMEZAWA Hiroyuki
2009-12-15  4:43                                           ` KAMEZAWA Hiroyuki
2009-12-15  4:57                                           ` David Rientjes
2009-12-15  4:57                                             ` David Rientjes
2009-12-15  5:09                                             ` KAMEZAWA Hiroyuki
2009-12-15  5:09                                               ` KAMEZAWA Hiroyuki
2009-12-17 22:23                                               ` David Rientjes
2009-12-17 22:23                                                 ` David Rientjes
2009-12-17 23:33                                                 ` KAMEZAWA Hiroyuki
2009-12-17 23:33                                                   ` KAMEZAWA Hiroyuki
2009-12-15  4:47                                         ` KOSAKI Motohiro
2009-12-15  4:47                                           ` KOSAKI Motohiro
2009-12-15  5:03                                           ` David Rientjes
2009-12-15  5:03                                             ` David Rientjes
2009-11-18  1:41                               ` Daisuke Nishimura
2009-11-18  1:41                                 ` Daisuke Nishimura

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.