[RFC PATCH 0/2] fix unnecessary accidental OOM problem on concurrent reclaim

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/2] fix unnecessary accidental OOM problem on concurrent reclaim
@ 2009-07-07  9:40 ` KOSAKI Motohiro
  0 siblings, 0 replies; 38+ messages in thread
From: KOSAKI Motohiro @ 2009-07-07  9:40 UTC (permalink / raw)
  To: LKML, linux-mm, Andrew Morton, Rik van Riel, Wu Fengguang, Minchan Kim
  Cc: kosaki.motohiro

This patch series depent on "OOM analysis helper patches" series.

Current reclaim logic doesn't consider concurrent reclaim. Then it makes
accidental OOM on many CPU systems.

I think this patch series addresses its issue.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC PATCH 0/2] fix unnecessary accidental OOM problem on concurrent reclaim
@ 2009-07-07  9:40 ` KOSAKI Motohiro
  0 siblings, 0 replies; 38+ messages in thread
From: KOSAKI Motohiro @ 2009-07-07  9:40 UTC (permalink / raw)
  To: LKML, linux-mm, Andrew Morton, Rik van Riel, Wu Fengguang, Minchan Kim
  Cc: kosaki.motohiro

This patch series depent on "OOM analysis helper patches" series.

Current reclaim logic doesn't consider concurrent reclaim. Then it makes
accidental OOM on many CPU systems.

I think this patch series addresses its issue.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC PATCH 1/2] vmscan don't isolate too many pages
  2009-07-07  9:40 ` KOSAKI Motohiro
@ 2009-07-07  9:47   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 38+ messages in thread
From: KOSAKI Motohiro @ 2009-07-07  9:47 UTC (permalink / raw)
  To: LKML
  Cc: kosaki.motohiro, linux-mm, Andrew Morton, Rik van Riel,
	Wu Fengguang, Minchan Kim

Subject: [PATCH] vmscan don't isolate too many pages

If the system have plenty threads or processes, concurrent reclaim can
isolate very much pages.

And if other processes isolate _all_ pages on lru, the reclaimer can't find
any reclaimable page and it makes accidental OOM.

The solusion is, we should restrict maximum number of isolated pages.
(this patch use inactive_page/2)


FAQ
-------
Q: Why do you compared zone accumulate pages, not individual zone pages?
A: If we check individual zone, #-of-reclaimer is restricted by smallest zone.
   it mean decreasing the performance of the system having small dma zone.


Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/page_alloc.c |   27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

Index: b/mm/page_alloc.c
===================================================================
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1721,6 +1721,28 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	return alloc_flags;
 }
 
+static bool too_many_isolated(struct zonelist *zonelist,
+			      enum zone_type high_zoneidx, nodemask_t *nodemask)
+{
+	unsigned long nr_inactive = 0;
+	unsigned long nr_isolated = 0;
+	struct zoneref *z;
+	struct zone *zone;
+
+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+					high_zoneidx, nodemask) {
+		if (!populated_zone(zone))
+			continue;
+
+		nr_inactive += zone_page_state(zone, NR_INACTIVE_ANON);
+		nr_inactive += zone_page_state(zone, NR_INACTIVE_FILE);
+		nr_isolated += zone_page_state(zone, NR_ISOLATED_ANON);
+		nr_isolated += zone_page_state(zone, NR_ISOLATED_FILE);
+	}
+
+	return nr_isolated > nr_inactive;
+}
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
@@ -1789,6 +1811,11 @@ rebalance:
 	if (p->flags & PF_MEMALLOC)
 		goto nopage;
 
+	if (too_many_isolated(gfp_mask, zonelist, high_zoneidx, nodemask)) {
+		schedule_timeout_uninterruptible(HZ/10);
+		goto restart;
+	}
+
 	/* Try direct reclaim and then allocating */
 	page = __alloc_pages_direct_reclaim(gfp_mask, order,
 					zonelist, high_zoneidx,



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC PATCH 1/2] vmscan don't isolate too many pages
@ 2009-07-07  9:47   ` KOSAKI Motohiro
  0 siblings, 0 replies; 38+ messages in thread
From: KOSAKI Motohiro @ 2009-07-07  9:47 UTC (permalink / raw)
  To: LKML
  Cc: kosaki.motohiro, linux-mm, Andrew Morton, Rik van Riel,
	Wu Fengguang, Minchan Kim

Subject: [PATCH] vmscan don't isolate too many pages

If the system have plenty threads or processes, concurrent reclaim can
isolate very much pages.

And if other processes isolate _all_ pages on lru, the reclaimer can't find
any reclaimable page and it makes accidental OOM.

The solusion is, we should restrict maximum number of isolated pages.
(this patch use inactive_page/2)


FAQ
-------
Q: Why do you compared zone accumulate pages, not individual zone pages?
A: If we check individual zone, #-of-reclaimer is restricted by smallest zone.
   it mean decreasing the performance of the system having small dma zone.


Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/page_alloc.c |   27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

Index: b/mm/page_alloc.c
===================================================================
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1721,6 +1721,28 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	return alloc_flags;
 }
 
+static bool too_many_isolated(struct zonelist *zonelist,
+			      enum zone_type high_zoneidx, nodemask_t *nodemask)
+{
+	unsigned long nr_inactive = 0;
+	unsigned long nr_isolated = 0;
+	struct zoneref *z;
+	struct zone *zone;
+
+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+					high_zoneidx, nodemask) {
+		if (!populated_zone(zone))
+			continue;
+
+		nr_inactive += zone_page_state(zone, NR_INACTIVE_ANON);
+		nr_inactive += zone_page_state(zone, NR_INACTIVE_FILE);
+		nr_isolated += zone_page_state(zone, NR_ISOLATED_ANON);
+		nr_isolated += zone_page_state(zone, NR_ISOLATED_FILE);
+	}
+
+	return nr_isolated > nr_inactive;
+}
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
@@ -1789,6 +1811,11 @@ rebalance:
 	if (p->flags & PF_MEMALLOC)
 		goto nopage;
 
+	if (too_many_isolated(gfp_mask, zonelist, high_zoneidx, nodemask)) {
+		schedule_timeout_uninterruptible(HZ/10);
+		goto restart;
+	}
+
 	/* Try direct reclaim and then allocating */
 	page = __alloc_pages_direct_reclaim(gfp_mask, order,
 					zonelist, high_zoneidx,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC PATCH 2/2] Don't continue reclaim if the system have plenty free memory
  2009-07-07  9:40 ` KOSAKI Motohiro
@ 2009-07-07  9:48   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 38+ messages in thread
From: KOSAKI Motohiro @ 2009-07-07  9:48 UTC (permalink / raw)
  To: LKML
  Cc: kosaki.motohiro, linux-mm, Andrew Morton, Rik van Riel,
	Wu Fengguang, Minchan Kim

Subject: [PATCH] Don't continue reclaim if the system have plenty free memory

On concurrent reclaim situation, if one reclaimer makes OOM, maybe other
reclaimer can stop reclaim because OOM killer makes enough free memory.

But current kernel doesn't have its logic. Then, we can face following accidental
2nd OOM scenario.

1. System memory is used by only one big process.
2. memory shortage occur and concurrent reclaim start.
3. One reclaimer makes OOM and OOM killer kill above big process.
4. Almost reclaimable page will be freed.
5. Another reclaimer can't find any reclaimable page because those pages are
   already freed.
6. Then, system makes accidental and unnecessary 2nd OOM killer.


Plus, nowaday datacenter system have badboy process monitoring system and
it kill too much memory consumption process.
But it don't stop other reclaimer and it makes accidental 2nd OOM by the
same reason.


This patch have one good side effect. it increase reclaim depended benchmark
performance.

e.g.
=====
% ./hackbench 140 process 100

before:
	Time: 93.361
after:
	Time: 28.799



Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 fs/buffer.c          |    2 +-
 include/linux/swap.h |    3 ++-
 mm/page_alloc.c      |    3 ++-
 mm/vmscan.c          |   29 ++++++++++++++++++++++++++++-
 4 files changed, 33 insertions(+), 4 deletions(-)

Index: b/mm/vmscan.c
===================================================================
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -87,6 +87,9 @@ struct scan_control {
 	 */
 	nodemask_t	*nodemask;
 
+	/* Caller's preferred zone. */
+	struct zone	*preferred_zone;
+
 	/* Pluggable isolate pages callback */
 	unsigned long (*isolate_pages)(unsigned long nr, struct list_head *dst,
 			unsigned long *scanned, int order, int mode,
@@ -1535,6 +1538,10 @@ static void shrink_zone(int priority, st
 	unsigned long nr_reclaimed = sc->nr_reclaimed;
 	unsigned long swap_cluster_max = sc->swap_cluster_max;
 	int noswap = 0;
+	int classzone_idx = 0;
+
+	if (sc->preferred_zone)
+		classzone_idx = zone_idx(sc->preferred_zone);
 
 	/* If we have no swap space, do not bother scanning anon pages. */
 	if (!sc->may_swap || (nr_swap_pages <= 0)) {
@@ -1583,6 +1590,20 @@ static void shrink_zone(int priority, st
 		if (nr_reclaimed > swap_cluster_max &&
 			priority < DEF_PRIORITY && !current_is_kswapd())
 			break;
+
+		/*
+		 * Now, we have plenty free memory.
+		 * Perhaps, big processes exited or they killed by OOM killer.
+		 * To continue reclaim doesn't make any sense.
+		 */
+		if (zone_page_state(zone, NR_FREE_PAGES) >
+		    zone_lru_pages(zone) &&
+		    zone_watermark_ok(zone, sc->order, high_wmark_pages(zone),
+				      classzone_idx, 0)) {
+			/* fake result for reclaim stop */
+			nr_reclaimed += swap_cluster_max;
+			break;
+		}
 	}
 
 	sc->nr_reclaimed = nr_reclaimed;
@@ -1767,7 +1788,8 @@ out:
 }
 
 unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
-				gfp_t gfp_mask, nodemask_t *nodemask)
+				gfp_t gfp_mask, nodemask_t *nodemask,
+				struct zone *preferred_zone)
 {
 	struct scan_control sc = {
 		.gfp_mask = gfp_mask,
@@ -1780,6 +1802,7 @@ unsigned long try_to_free_pages(struct z
 		.mem_cgroup = NULL,
 		.isolate_pages = isolate_pages_global,
 		.nodemask = nodemask,
+		.preferred_zone = preferred_zone,
 	};
 
 	return do_try_to_free_pages(zonelist, &sc);
@@ -1808,6 +1831,10 @@ unsigned long try_to_free_mem_cgroup_pag
 	sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 			(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
 	zonelist = NODE_DATA(numa_node_id())->node_zonelists;
+	first_zones_zonelist(zonelist,
+			     gfp_zone(sc.gfp_mask), NULL,
+			     &sc.preferred_zone);
+
 	return do_try_to_free_pages(zonelist, &sc);
 }
 #endif
Index: b/fs/buffer.c
===================================================================
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -290,7 +290,7 @@ static void free_more_memory(void)
 						&zone);
 		if (zone)
 			try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,
-						GFP_NOFS, NULL);
+					  GFP_NOFS, NULL, zone);
 	}
 }
 
Index: b/include/linux/swap.h
===================================================================
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -213,7 +213,8 @@ static inline void lru_cache_add_active_
 
 /* linux/mm/vmscan.c */
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
-					gfp_t gfp_mask, nodemask_t *mask);
+				       gfp_t gfp_mask, nodemask_t *mask,
+				       struct zone *preferred_zone);
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
 						  gfp_t gfp_mask, bool noswap,
 						  unsigned int swappiness);
Index: b/mm/page_alloc.c
===================================================================
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1629,7 +1629,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
 	reclaim_state.reclaimed_slab = 0;
 	p->reclaim_state = &reclaim_state;
 
-	*did_some_progress = try_to_free_pages(zonelist, order, gfp_mask, nodemask);
+	*did_some_progress = try_to_free_pages(zonelist, order, gfp_mask,
+					       nodemask, preferred_zone);
 
 	p->reclaim_state = NULL;
 	lockdep_clear_current_reclaim_state();



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC PATCH 2/2] Don't continue reclaim if the system have plenty free memory
@ 2009-07-07  9:48   ` KOSAKI Motohiro
  0 siblings, 0 replies; 38+ messages in thread
From: KOSAKI Motohiro @ 2009-07-07  9:48 UTC (permalink / raw)
  To: LKML
  Cc: kosaki.motohiro, linux-mm, Andrew Morton, Rik van Riel,
	Wu Fengguang, Minchan Kim

Subject: [PATCH] Don't continue reclaim if the system have plenty free memory

On concurrent reclaim situation, if one reclaimer makes OOM, maybe other
reclaimer can stop reclaim because OOM killer makes enough free memory.

But current kernel doesn't have its logic. Then, we can face following accidental
2nd OOM scenario.

1. System memory is used by only one big process.
2. memory shortage occur and concurrent reclaim start.
3. One reclaimer makes OOM and OOM killer kill above big process.
4. Almost reclaimable page will be freed.
5. Another reclaimer can't find any reclaimable page because those pages are
   already freed.
6. Then, system makes accidental and unnecessary 2nd OOM killer.


Plus, nowaday datacenter system have badboy process monitoring system and
it kill too much memory consumption process.
But it don't stop other reclaimer and it makes accidental 2nd OOM by the
same reason.


This patch have one good side effect. it increase reclaim depended benchmark
performance.

e.g.
=====
% ./hackbench 140 process 100

before:
	Time: 93.361
after:
	Time: 28.799



Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 fs/buffer.c          |    2 +-
 include/linux/swap.h |    3 ++-
 mm/page_alloc.c      |    3 ++-
 mm/vmscan.c          |   29 ++++++++++++++++++++++++++++-
 4 files changed, 33 insertions(+), 4 deletions(-)

Index: b/mm/vmscan.c
===================================================================
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -87,6 +87,9 @@ struct scan_control {
 	 */
 	nodemask_t	*nodemask;
 
+	/* Caller's preferred zone. */
+	struct zone	*preferred_zone;
+
 	/* Pluggable isolate pages callback */
 	unsigned long (*isolate_pages)(unsigned long nr, struct list_head *dst,
 			unsigned long *scanned, int order, int mode,
@@ -1535,6 +1538,10 @@ static void shrink_zone(int priority, st
 	unsigned long nr_reclaimed = sc->nr_reclaimed;
 	unsigned long swap_cluster_max = sc->swap_cluster_max;
 	int noswap = 0;
+	int classzone_idx = 0;
+
+	if (sc->preferred_zone)
+		classzone_idx = zone_idx(sc->preferred_zone);
 
 	/* If we have no swap space, do not bother scanning anon pages. */
 	if (!sc->may_swap || (nr_swap_pages <= 0)) {
@@ -1583,6 +1590,20 @@ static void shrink_zone(int priority, st
 		if (nr_reclaimed > swap_cluster_max &&
 			priority < DEF_PRIORITY && !current_is_kswapd())
 			break;
+
+		/*
+		 * Now, we have plenty free memory.
+		 * Perhaps, big processes exited or they killed by OOM killer.
+		 * To continue reclaim doesn't make any sense.
+		 */
+		if (zone_page_state(zone, NR_FREE_PAGES) >
+		    zone_lru_pages(zone) &&
+		    zone_watermark_ok(zone, sc->order, high_wmark_pages(zone),
+				      classzone_idx, 0)) {
+			/* fake result for reclaim stop */
+			nr_reclaimed += swap_cluster_max;
+			break;
+		}
 	}
 
 	sc->nr_reclaimed = nr_reclaimed;
@@ -1767,7 +1788,8 @@ out:
 }
 
 unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
-				gfp_t gfp_mask, nodemask_t *nodemask)
+				gfp_t gfp_mask, nodemask_t *nodemask,
+				struct zone *preferred_zone)
 {
 	struct scan_control sc = {
 		.gfp_mask = gfp_mask,
@@ -1780,6 +1802,7 @@ unsigned long try_to_free_pages(struct z
 		.mem_cgroup = NULL,
 		.isolate_pages = isolate_pages_global,
 		.nodemask = nodemask,
+		.preferred_zone = preferred_zone,
 	};
 
 	return do_try_to_free_pages(zonelist, &sc);
@@ -1808,6 +1831,10 @@ unsigned long try_to_free_mem_cgroup_pag
 	sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 			(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
 	zonelist = NODE_DATA(numa_node_id())->node_zonelists;
+	first_zones_zonelist(zonelist,
+			     gfp_zone(sc.gfp_mask), NULL,
+			     &sc.preferred_zone);
+
 	return do_try_to_free_pages(zonelist, &sc);
 }
 #endif
Index: b/fs/buffer.c
===================================================================
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -290,7 +290,7 @@ static void free_more_memory(void)
 						&zone);
 		if (zone)
 			try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,
-						GFP_NOFS, NULL);
+					  GFP_NOFS, NULL, zone);
 	}
 }
 
Index: b/include/linux/swap.h
===================================================================
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -213,7 +213,8 @@ static inline void lru_cache_add_active_
 
 /* linux/mm/vmscan.c */
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
-					gfp_t gfp_mask, nodemask_t *mask);
+				       gfp_t gfp_mask, nodemask_t *mask,
+				       struct zone *preferred_zone);
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
 						  gfp_t gfp_mask, bool noswap,
 						  unsigned int swappiness);
Index: b/mm/page_alloc.c
===================================================================
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1629,7 +1629,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
 	reclaim_state.reclaimed_slab = 0;
 	p->reclaim_state = &reclaim_state;
 
-	*did_some_progress = try_to_free_pages(zonelist, order, gfp_mask, nodemask);
+	*did_some_progress = try_to_free_pages(zonelist, order, gfp_mask,
+					       nodemask, preferred_zone);
 
 	p->reclaim_state = NULL;
 	lockdep_clear_current_reclaim_state();


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 2/2] Don't continue reclaim if the system have plenty  free memory
  2009-07-07  9:48   ` KOSAKI Motohiro
@ 2009-07-07 13:20     ` Minchan Kim
  -1 siblings, 0 replies; 38+ messages in thread
From: Minchan Kim @ 2009-07-07 13:20 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: LKML, linux-mm, Andrew Morton, Rik van Riel, Wu Fengguang

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 7592 bytes --]

Hi, Kosaki.
On Tue, Jul 7, 2009 at 6:48 PM, KOSAKIMotohiro<kosaki.motohiro@jp.fujitsu.com> wrote:> Subject: [PATCH] Don't continue reclaim if the system have plenty free memory>> On concurrent reclaim situation, if one reclaimer makes OOM, maybe other> reclaimer can stop reclaim because OOM killer makes enough free memory.>> But current kernel doesn't have its logic. Then, we can face following accidental> 2nd OOM scenario.>> 1. System memory is used by only one big process.> 2. memory shortage occur and concurrent reclaim start.> 3. One reclaimer makes OOM and OOM killer kill above big process.> 4. Almost reclaimable page will be freed.> 5. Another reclaimer can't find any reclaimable page because those pages are> Â  already freed.> 6. Then, system makes accidental and unnecessary 2nd OOM killer.>
Did you see the this situation ?Why I ask is that we have already a routine for preventing parallelOOM killing in __alloc_pages_may_oom.
Couldn't it protect your scenario ?If it can't, Could you explain the scenario in more detail ?
I think first we try to modify old routine with efficient.
>> Plus, nowaday datacenter system have badboy process monitoring system and> it kill too much memory consumption process.> But it don't stop other reclaimer and it makes accidental 2nd OOM by the> same reason.>>> This patch have one good side effect. it increase reclaim depended benchmark> performance.>> e.g.> =====> % ./hackbench 140 process 100>> before:> Â  Â  Â  Â Time: 93.361> after:> Â  Â  Â  Â Time: 28.799>>>> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>> ---> Â fs/buffer.c Â  Â  Â  Â  Â | Â  Â 2 +-> Â include/linux/swap.h | Â  Â 3 ++-> Â mm/page_alloc.c Â  Â  Â | Â  Â 3 ++-> Â mm/vmscan.c Â  Â  Â  Â  Â | Â  29 ++++++++++++++++++++++++++++-> Â 4 files changed, 33 insertions(+), 4 deletions(-)>> Index: b/mm/vmscan.c> ===================================================================> --- a/mm/vmscan.c> +++ b/mm/vmscan.c> @@ -87,6 +87,9 @@ struct scan_control {> Â  Â  Â  Â  */> Â  Â  Â  Â nodemask_t Â  Â  Â *nodemask;>> + Â  Â  Â  /* Caller's preferred zone. */> + Â  Â  Â  struct zone Â  Â  *preferred_zone;> +> Â  Â  Â  Â /* Pluggable isolate pages callback */> Â  Â  Â  Â unsigned long (*isolate_pages)(unsigned long nr, struct list_head *dst,> Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â unsigned long *scanned, int order, int mode,> @@ -1535,6 +1538,10 @@ static void shrink_zone(int priority, st> Â  Â  Â  Â unsigned long nr_reclaimed = sc->nr_reclaimed;> Â  Â  Â  Â unsigned long swap_cluster_max = sc->swap_cluster_max;> Â  Â  Â  Â int noswap = 0;> + Â  Â  Â  int classzone_idx = 0;> +> + Â  Â  Â  if (sc->preferred_zone)> + Â  Â  Â  Â  Â  Â  Â  classzone_idx = zone_idx(sc->preferred_zone);>> Â  Â  Â  Â /* If we have no swap space, do not bother scanning anon pages. */> Â  Â  Â  Â if (!sc->may_swap || (nr_swap_pages <= 0)) {> @@ -1583,6 +1590,20 @@ static void shrink_zone(int priority, st> Â  Â  Â  Â  Â  Â  Â  Â if (nr_reclaimed > swap_cluster_max &&> Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â priority < DEF_PRIORITY && !current_is_kswapd())> Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â break;> +> + Â  Â  Â  Â  Â  Â  Â  /*> + Â  Â  Â  Â  Â  Â  Â  Â * Now, we have plenty free memory.> + Â  Â  Â  Â  Â  Â  Â  Â * Perhaps, big processes exited or they killed by OOM killer.> + Â  Â  Â  Â  Â  Â  Â  Â * To continue reclaim doesn't make any sense.> + Â  Â  Â  Â  Â  Â  Â  Â */> + Â  Â  Â  Â  Â  Â  Â  if (zone_page_state(zone, NR_FREE_PAGES) >> + Â  Â  Â  Â  Â  Â  Â  Â  Â  zone_lru_pages(zone) &&> + Â  Â  Â  Â  Â  Â  Â  Â  Â  zone_watermark_ok(zone, sc->order, high_wmark_pages(zone),> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  classzone_idx, 0)) {> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  /* fake result for reclaim stop */> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  nr_reclaimed += swap_cluster_max;> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  break;> + Â  Â  Â  Â  Â  Â  Â  }> Â  Â  Â  Â }>> Â  Â  Â  Â sc->nr_reclaimed = nr_reclaimed;> @@ -1767,7 +1788,8 @@ out:> Â }>> Â unsigned long try_to_free_pages(struct zonelist *zonelist, int order,> - Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  gfp_t gfp_mask, nodemask_t *nodemask)> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  gfp_t gfp_mask, nodemask_t *nodemask,> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  struct zone *preferred_zone)> Â {> Â  Â  Â  Â struct scan_control sc = {> Â  Â  Â  Â  Â  Â  Â  Â .gfp_mask = gfp_mask,> @@ -1780,6 +1802,7 @@ unsigned long try_to_free_pages(struct z> Â  Â  Â  Â  Â  Â  Â  Â .mem_cgroup = NULL,> Â  Â  Â  Â  Â  Â  Â  Â .isolate_pages = isolate_pages_global,> Â  Â  Â  Â  Â  Â  Â  Â .nodemask = nodemask,> + Â  Â  Â  Â  Â  Â  Â  .preferred_zone = preferred_zone,> Â  Â  Â  Â };>> Â  Â  Â  Â return do_try_to_free_pages(zonelist, &sc);> @@ -1808,6 +1831,10 @@ unsigned long try_to_free_mem_cgroup_pag> Â  Â  Â  Â sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |> Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);> Â  Â  Â  Â zonelist = NODE_DATA(numa_node_id())->node_zonelists;> + Â  Â  Â  first_zones_zonelist(zonelist,> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â gfp_zone(sc.gfp_mask), NULL,> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â &sc.preferred_zone);> +> Â  Â  Â  Â return do_try_to_free_pages(zonelist, &sc);> Â }> Â #endif> Index: b/fs/buffer.c> ===================================================================> --- a/fs/buffer.c> +++ b/fs/buffer.c> @@ -290,7 +290,7 @@ static void free_more_memory(void)> Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â &zone);> Â  Â  Â  Â  Â  Â  Â  Â if (zone)> Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,> - Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  GFP_NOFS, NULL);> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  GFP_NOFS, NULL, zone);> Â  Â  Â  Â }> Â }>> Index: b/include/linux/swap.h> ===================================================================> --- a/include/linux/swap.h> +++ b/include/linux/swap.h> @@ -213,7 +213,8 @@ static inline void lru_cache_add_active_>> Â /* linux/mm/vmscan.c */> Â extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,> - Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  gfp_t gfp_mask, nodemask_t *mask);> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â gfp_t gfp_mask, nodemask_t *mask,> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â struct zone *preferred_zone);> Â extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,> Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â gfp_t gfp_mask, bool noswap,> Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â unsigned int swappiness);> Index: b/mm/page_alloc.c> ===================================================================> --- a/mm/page_alloc.c> +++ b/mm/page_alloc.c> @@ -1629,7 +1629,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m> Â  Â  Â  Â reclaim_state.reclaimed_slab = 0;> Â  Â  Â  Â p->reclaim_state = &reclaim_state;>> - Â  Â  Â  *did_some_progress = try_to_free_pages(zonelist, order, gfp_mask, nodemask);> + Â  Â  Â  *did_some_progress = try_to_free_pages(zonelist, order, gfp_mask,> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â nodemask, preferred_zone);>> Â  Â  Â  Â p->reclaim_state = NULL;> Â  Â  Â  Â lockdep_clear_current_reclaim_state();>>>


-- Kind regards,Minchan Kimÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hšïêÿ‘êçz_è®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨èÚ&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 2/2] Don't continue reclaim if the system have plenty free memory
@ 2009-07-07 13:20     ` Minchan Kim
  0 siblings, 0 replies; 38+ messages in thread
From: Minchan Kim @ 2009-07-07 13:20 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: LKML, linux-mm, Andrew Morton, Rik van Riel, Wu Fengguang

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 7707 bytes --]

Hi, Kosaki.

On Tue, Jul 7, 2009 at 6:48 PM, KOSAKI
Motohiro<kosaki.motohiro@jp.fujitsu.com> wrote:
> Subject: [PATCH] Don't continue reclaim if the system have plenty free memory
>
> On concurrent reclaim situation, if one reclaimer makes OOM, maybe other
> reclaimer can stop reclaim because OOM killer makes enough free memory.
>
> But current kernel doesn't have its logic. Then, we can face following accidental
> 2nd OOM scenario.
>
> 1. System memory is used by only one big process.
> 2. memory shortage occur and concurrent reclaim start.
> 3. One reclaimer makes OOM and OOM killer kill above big process.
> 4. Almost reclaimable page will be freed.
> 5. Another reclaimer can't find any reclaimable page because those pages are
> Â  already freed.
> 6. Then, system makes accidental and unnecessary 2nd OOM killer.
>

Did you see the this situation ?
Why I ask is that we have already a routine for preventing parallel
OOM killing in __alloc_pages_may_oom.

Couldn't it protect your scenario ?
If it can't, Could you explain the scenario in more detail ?

I think first we try to modify old routine with efficient.

>
> Plus, nowaday datacenter system have badboy process monitoring system and
> it kill too much memory consumption process.
> But it don't stop other reclaimer and it makes accidental 2nd OOM by the
> same reason.
>
>
> This patch have one good side effect. it increase reclaim depended benchmark
> performance.
>
> e.g.
> =====
> % ./hackbench 140 process 100
>
> before:
> Â  Â  Â  Â Time: 93.361
> after:
> Â  Â  Â  Â Time: 28.799
>
>
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
> Â fs/buffer.c Â  Â  Â  Â  Â | Â  Â 2 +-
> Â include/linux/swap.h | Â  Â 3 ++-
> Â mm/page_alloc.c Â  Â  Â | Â  Â 3 ++-
> Â mm/vmscan.c Â  Â  Â  Â  Â | Â  29 ++++++++++++++++++++++++++++-
> Â 4 files changed, 33 insertions(+), 4 deletions(-)
>
> Index: b/mm/vmscan.c
> ===================================================================
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -87,6 +87,9 @@ struct scan_control {
> Â  Â  Â  Â  */
> Â  Â  Â  Â nodemask_t Â  Â  Â *nodemask;
>
> + Â  Â  Â  /* Caller's preferred zone. */
> + Â  Â  Â  struct zone Â  Â  *preferred_zone;
> +
> Â  Â  Â  Â /* Pluggable isolate pages callback */
> Â  Â  Â  Â unsigned long (*isolate_pages)(unsigned long nr, struct list_head *dst,
> Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â unsigned long *scanned, int order, int mode,
> @@ -1535,6 +1538,10 @@ static void shrink_zone(int priority, st
> Â  Â  Â  Â unsigned long nr_reclaimed = sc->nr_reclaimed;
> Â  Â  Â  Â unsigned long swap_cluster_max = sc->swap_cluster_max;
> Â  Â  Â  Â int noswap = 0;
> + Â  Â  Â  int classzone_idx = 0;
> +
> + Â  Â  Â  if (sc->preferred_zone)
> + Â  Â  Â  Â  Â  Â  Â  classzone_idx = zone_idx(sc->preferred_zone);
>
> Â  Â  Â  Â /* If we have no swap space, do not bother scanning anon pages. */
> Â  Â  Â  Â if (!sc->may_swap || (nr_swap_pages <= 0)) {
> @@ -1583,6 +1590,20 @@ static void shrink_zone(int priority, st
> Â  Â  Â  Â  Â  Â  Â  Â if (nr_reclaimed > swap_cluster_max &&
> Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â priority < DEF_PRIORITY && !current_is_kswapd())
> Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â break;
> +
> + Â  Â  Â  Â  Â  Â  Â  /*
> + Â  Â  Â  Â  Â  Â  Â  Â * Now, we have plenty free memory.
> + Â  Â  Â  Â  Â  Â  Â  Â * Perhaps, big processes exited or they killed by OOM killer.
> + Â  Â  Â  Â  Â  Â  Â  Â * To continue reclaim doesn't make any sense.
> + Â  Â  Â  Â  Â  Â  Â  Â */
> + Â  Â  Â  Â  Â  Â  Â  if (zone_page_state(zone, NR_FREE_PAGES) >
> + Â  Â  Â  Â  Â  Â  Â  Â  Â  zone_lru_pages(zone) &&
> + Â  Â  Â  Â  Â  Â  Â  Â  Â  zone_watermark_ok(zone, sc->order, high_wmark_pages(zone),
> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  classzone_idx, 0)) {
> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  /* fake result for reclaim stop */
> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  nr_reclaimed += swap_cluster_max;
> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  break;
> + Â  Â  Â  Â  Â  Â  Â  }
> Â  Â  Â  Â }
>
> Â  Â  Â  Â sc->nr_reclaimed = nr_reclaimed;
> @@ -1767,7 +1788,8 @@ out:
> Â }
>
> Â unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> - Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  gfp_t gfp_mask, nodemask_t *nodemask)
> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  gfp_t gfp_mask, nodemask_t *nodemask,
> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  struct zone *preferred_zone)
> Â {
> Â  Â  Â  Â struct scan_control sc = {
> Â  Â  Â  Â  Â  Â  Â  Â .gfp_mask = gfp_mask,
> @@ -1780,6 +1802,7 @@ unsigned long try_to_free_pages(struct z
> Â  Â  Â  Â  Â  Â  Â  Â .mem_cgroup = NULL,
> Â  Â  Â  Â  Â  Â  Â  Â .isolate_pages = isolate_pages_global,
> Â  Â  Â  Â  Â  Â  Â  Â .nodemask = nodemask,
> + Â  Â  Â  Â  Â  Â  Â  .preferred_zone = preferred_zone,
> Â  Â  Â  Â };
>
> Â  Â  Â  Â return do_try_to_free_pages(zonelist, &sc);
> @@ -1808,6 +1831,10 @@ unsigned long try_to_free_mem_cgroup_pag
> Â  Â  Â  Â sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
> Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
> Â  Â  Â  Â zonelist = NODE_DATA(numa_node_id())->node_zonelists;
> + Â  Â  Â  first_zones_zonelist(zonelist,
> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â gfp_zone(sc.gfp_mask), NULL,
> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â &sc.preferred_zone);
> +
> Â  Â  Â  Â return do_try_to_free_pages(zonelist, &sc);
> Â }
> Â #endif
> Index: b/fs/buffer.c
> ===================================================================
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -290,7 +290,7 @@ static void free_more_memory(void)
> Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â &zone);
> Â  Â  Â  Â  Â  Â  Â  Â if (zone)
> Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,
> - Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  GFP_NOFS, NULL);
> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  GFP_NOFS, NULL, zone);
> Â  Â  Â  Â }
> Â }
>
> Index: b/include/linux/swap.h
> ===================================================================
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -213,7 +213,8 @@ static inline void lru_cache_add_active_
>
> Â /* linux/mm/vmscan.c */
> Â extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> - Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  gfp_t gfp_mask, nodemask_t *mask);
> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â gfp_t gfp_mask, nodemask_t *mask,
> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â struct zone *preferred_zone);
> Â extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
> Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â gfp_t gfp_mask, bool noswap,
> Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â unsigned int swappiness);
> Index: b/mm/page_alloc.c
> ===================================================================
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1629,7 +1629,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
> Â  Â  Â  Â reclaim_state.reclaimed_slab = 0;
> Â  Â  Â  Â p->reclaim_state = &reclaim_state;
>
> - Â  Â  Â  *did_some_progress = try_to_free_pages(zonelist, order, gfp_mask, nodemask);
> + Â  Â  Â  *did_some_progress = try_to_free_pages(zonelist, order, gfp_mask,
> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â nodemask, preferred_zone);
>
> Â  Â  Â  Â p->reclaim_state = NULL;
> Â  Â  Â  Â lockdep_clear_current_reclaim_state();
>
>
>



-- 
Kind regards,
Minchan Kim
N‹§²æìr¸›zÇ§u©ž²Æ {\b†éì¹»\x1c®&Þ–)îÆi¢žØ^n‡r¶‰šŽŠÝ¢j$½§$¢¸\x05¢¹¨è§~Š'.)îÄÃ,yèm¶ŸÿÃ\f%Š{±šj+ƒðèž×¦j)Z†·Ÿ

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 1/2] vmscan don't isolate too many pages
  2009-07-07  9:47   ` KOSAKI Motohiro
@ 2009-07-07 13:23     ` Wu Fengguang
  -1 siblings, 0 replies; 38+ messages in thread
From: Wu Fengguang @ 2009-07-07 13:23 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: LKML, linux-mm, Andrew Morton, Rik van Riel, Minchan Kim

On Tue, Jul 07, 2009 at 05:47:13PM +0800, KOSAKI Motohiro wrote:
> Subject: [PATCH] vmscan don't isolate too many pages
> 
> If the system have plenty threads or processes, concurrent reclaim can
> isolate very much pages.
> 
> And if other processes isolate _all_ pages on lru, the reclaimer can't find
> any reclaimable page and it makes accidental OOM.
> 
> The solusion is, we should restrict maximum number of isolated pages.
> (this patch use inactive_page/2)

Now I think this is a better solution than per-cpu throttling :)
Will test it tomorrow.

Acked-by: Wu Fengguang <fengguang.wu@intel.com>

> 
> FAQ
> -------
> Q: Why do you compared zone accumulate pages, not individual zone pages?
> A: If we check individual zone, #-of-reclaimer is restricted by smallest zone.
>    it mean decreasing the performance of the system having small dma zone.
> 
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
>  mm/page_alloc.c |   27 +++++++++++++++++++++++++++
>  1 file changed, 27 insertions(+)
> 
> Index: b/mm/page_alloc.c
> ===================================================================
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1721,6 +1721,28 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  	return alloc_flags;
>  }
>  
> +static bool too_many_isolated(struct zonelist *zonelist,
> +			      enum zone_type high_zoneidx, nodemask_t *nodemask)
> +{
> +	unsigned long nr_inactive = 0;
> +	unsigned long nr_isolated = 0;
> +	struct zoneref *z;
> +	struct zone *zone;
> +
> +	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> +					high_zoneidx, nodemask) {
> +		if (!populated_zone(zone))
> +			continue;
> +
> +		nr_inactive += zone_page_state(zone, NR_INACTIVE_ANON);
> +		nr_inactive += zone_page_state(zone, NR_INACTIVE_FILE);
> +		nr_isolated += zone_page_state(zone, NR_ISOLATED_ANON);
> +		nr_isolated += zone_page_state(zone, NR_ISOLATED_FILE);
> +	}
> +
> +	return nr_isolated > nr_inactive;
> +}
> +
>  static inline struct page *
>  __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	struct zonelist *zonelist, enum zone_type high_zoneidx,
> @@ -1789,6 +1811,11 @@ rebalance:
>  	if (p->flags & PF_MEMALLOC)
>  		goto nopage;
>  
> +	if (too_many_isolated(gfp_mask, zonelist, high_zoneidx, nodemask)) {
> +		schedule_timeout_uninterruptible(HZ/10);
> +		goto restart;
> +	}
> +
>  	/* Try direct reclaim and then allocating */
>  	page = __alloc_pages_direct_reclaim(gfp_mask, order,
>  					zonelist, high_zoneidx,
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 1/2] vmscan don't isolate too many pages
@ 2009-07-07 13:23     ` Wu Fengguang
  0 siblings, 0 replies; 38+ messages in thread
From: Wu Fengguang @ 2009-07-07 13:23 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: LKML, linux-mm, Andrew Morton, Rik van Riel, Minchan Kim

On Tue, Jul 07, 2009 at 05:47:13PM +0800, KOSAKI Motohiro wrote:
> Subject: [PATCH] vmscan don't isolate too many pages
> 
> If the system have plenty threads or processes, concurrent reclaim can
> isolate very much pages.
> 
> And if other processes isolate _all_ pages on lru, the reclaimer can't find
> any reclaimable page and it makes accidental OOM.
> 
> The solusion is, we should restrict maximum number of isolated pages.
> (this patch use inactive_page/2)

Now I think this is a better solution than per-cpu throttling :)
Will test it tomorrow.

Acked-by: Wu Fengguang <fengguang.wu@intel.com>

> 
> FAQ
> -------
> Q: Why do you compared zone accumulate pages, not individual zone pages?
> A: If we check individual zone, #-of-reclaimer is restricted by smallest zone.
>    it mean decreasing the performance of the system having small dma zone.
> 
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
>  mm/page_alloc.c |   27 +++++++++++++++++++++++++++
>  1 file changed, 27 insertions(+)
> 
> Index: b/mm/page_alloc.c
> ===================================================================
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1721,6 +1721,28 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  	return alloc_flags;
>  }
>  
> +static bool too_many_isolated(struct zonelist *zonelist,
> +			      enum zone_type high_zoneidx, nodemask_t *nodemask)
> +{
> +	unsigned long nr_inactive = 0;
> +	unsigned long nr_isolated = 0;
> +	struct zoneref *z;
> +	struct zone *zone;
> +
> +	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> +					high_zoneidx, nodemask) {
> +		if (!populated_zone(zone))
> +			continue;
> +
> +		nr_inactive += zone_page_state(zone, NR_INACTIVE_ANON);
> +		nr_inactive += zone_page_state(zone, NR_INACTIVE_FILE);
> +		nr_isolated += zone_page_state(zone, NR_ISOLATED_ANON);
> +		nr_isolated += zone_page_state(zone, NR_ISOLATED_FILE);
> +	}
> +
> +	return nr_isolated > nr_inactive;
> +}
> +
>  static inline struct page *
>  __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	struct zonelist *zonelist, enum zone_type high_zoneidx,
> @@ -1789,6 +1811,11 @@ rebalance:
>  	if (p->flags & PF_MEMALLOC)
>  		goto nopage;
>  
> +	if (too_many_isolated(gfp_mask, zonelist, high_zoneidx, nodemask)) {
> +		schedule_timeout_uninterruptible(HZ/10);
> +		goto restart;
> +	}
> +
>  	/* Try direct reclaim and then allocating */
>  	page = __alloc_pages_direct_reclaim(gfp_mask, order,
>  					zonelist, high_zoneidx,
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 1/2] vmscan don't isolate too many pages
  2009-07-07  9:47   ` KOSAKI Motohiro
@ 2009-07-07 18:59     ` Rik van Riel
  -1 siblings, 0 replies; 38+ messages in thread
From: Rik van Riel @ 2009-07-07 18:59 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: LKML, linux-mm, Andrew Morton, Wu Fengguang, Minchan Kim

KOSAKI Motohiro wrote:

> FAQ
> -------
> Q: Why do you compared zone accumulate pages, not individual zone pages?
> A: If we check individual zone, #-of-reclaimer is restricted by smallest zone.
>    it mean decreasing the performance of the system having small dma zone.

That is a clever solution!  I was playing around a bit with
doing it on a per-zone basis.  Your idea is much nicer.

However, I can see one potential problem with your patch:

+		nr_inactive += zone_page_state(zone, NR_INACTIVE_ANON);
+		nr_inactive += zone_page_state(zone, NR_INACTIVE_FILE);
+		nr_isolated += zone_page_state(zone, NR_ISOLATED_ANON);
+		nr_isolated += zone_page_state(zone, NR_ISOLATED_FILE);
+	}
+
+	return nr_isolated > nr_inactive;

What if we ran out of swap space, or are not scanning the
anon list at all for some reason?

It is possible that there are no inactive_file pages left,
with all file pages already isolated, and your function
still letting reclaimers through.

This means you could still get a spurious OOM.

I guess I should mail out my (ugly) approach, so we can
compare the two :)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 1/2] vmscan don't isolate too many pages
@ 2009-07-07 18:59     ` Rik van Riel
  0 siblings, 0 replies; 38+ messages in thread
From: Rik van Riel @ 2009-07-07 18:59 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: LKML, linux-mm, Andrew Morton, Wu Fengguang, Minchan Kim

KOSAKI Motohiro wrote:

> FAQ
> -------
> Q: Why do you compared zone accumulate pages, not individual zone pages?
> A: If we check individual zone, #-of-reclaimer is restricted by smallest zone.
>    it mean decreasing the performance of the system having small dma zone.

That is a clever solution!  I was playing around a bit with
doing it on a per-zone basis.  Your idea is much nicer.

However, I can see one potential problem with your patch:

+		nr_inactive += zone_page_state(zone, NR_INACTIVE_ANON);
+		nr_inactive += zone_page_state(zone, NR_INACTIVE_FILE);
+		nr_isolated += zone_page_state(zone, NR_ISOLATED_ANON);
+		nr_isolated += zone_page_state(zone, NR_ISOLATED_FILE);
+	}
+
+	return nr_isolated > nr_inactive;

What if we ran out of swap space, or are not scanning the
anon list at all for some reason?

It is possible that there are no inactive_file pages left,
with all file pages already isolated, and your function
still letting reclaimers through.

This means you could still get a spurious OOM.

I guess I should mail out my (ugly) approach, so we can
compare the two :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 1/2] vmscan don't isolate too many pages
  2009-07-07  9:47   ` KOSAKI Motohiro
@ 2009-07-07 23:39     ` Minchan Kim
  -1 siblings, 0 replies; 38+ messages in thread
From: Minchan Kim @ 2009-07-07 23:39 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: LKML, linux-mm, Andrew Morton, Rik van Riel, Wu Fengguang

On Tue, Jul 7, 2009 at 6:47 PM, KOSAKI
Motohiro<kosaki.motohiro@jp.fujitsu.com> wrote:
> Subject: [PATCH] vmscan don't isolate too many pages
>
> If the system have plenty threads or processes, concurrent reclaim can
> isolate very much pages.
>
> And if other processes isolate _all_ pages on lru, the reclaimer can't find
> any reclaimable page and it makes accidental OOM.
>
> The solusion is, we should restrict maximum number of isolated pages.
> (this patch use inactive_page/2)
>
>
> FAQ
> -------
> Q: Why do you compared zone accumulate pages, not individual zone pages?
> A: If we check individual zone, #-of-reclaimer is restricted by smallest zone.
>   it mean decreasing the performance of the system having small dma zone.
>
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
>  mm/page_alloc.c |   27 +++++++++++++++++++++++++++
>  1 file changed, 27 insertions(+)
>
> Index: b/mm/page_alloc.c
> ===================================================================
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1721,6 +1721,28 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>        return alloc_flags;
>  }
>
> +static bool too_many_isolated(struct zonelist *zonelist,
> +                             enum zone_type high_zoneidx, nodemask_t *nodemask)
> +{
> +       unsigned long nr_inactive = 0;
> +       unsigned long nr_isolated = 0;
> +       struct zoneref *z;
> +       struct zone *zone;
> +
> +       for_each_zone_zonelist_nodemask(zone, z, zonelist,
> +                                       high_zoneidx, nodemask) {
> +               if (!populated_zone(zone))
> +                       continue;
> +
> +               nr_inactive += zone_page_state(zone, NR_INACTIVE_ANON);
> +               nr_inactive += zone_page_state(zone, NR_INACTIVE_FILE);
> +               nr_isolated += zone_page_state(zone, NR_ISOLATED_ANON);
> +               nr_isolated += zone_page_state(zone, NR_ISOLATED_FILE);
> +       }
> +
> +       return nr_isolated > nr_inactive;
> +}
> +
>  static inline struct page *
>  __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>        struct zonelist *zonelist, enum zone_type high_zoneidx,
> @@ -1789,6 +1811,11 @@ rebalance:
>        if (p->flags & PF_MEMALLOC)
>                goto nopage;
>
> +       if (too_many_isolated(gfp_mask, zonelist, high_zoneidx, nodemask)) {

too_many_isolated(zonelist, high_zoneidx, nodemask)

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 1/2] vmscan don't isolate too many pages
@ 2009-07-07 23:39     ` Minchan Kim
  0 siblings, 0 replies; 38+ messages in thread
From: Minchan Kim @ 2009-07-07 23:39 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: LKML, linux-mm, Andrew Morton, Rik van Riel, Wu Fengguang

On Tue, Jul 7, 2009 at 6:47 PM, KOSAKI
Motohiro<kosaki.motohiro@jp.fujitsu.com> wrote:
> Subject: [PATCH] vmscan don't isolate too many pages
>
> If the system have plenty threads or processes, concurrent reclaim can
> isolate very much pages.
>
> And if other processes isolate _all_ pages on lru, the reclaimer can't find
> any reclaimable page and it makes accidental OOM.
>
> The solusion is, we should restrict maximum number of isolated pages.
> (this patch use inactive_page/2)
>
>
> FAQ
> -------
> Q: Why do you compared zone accumulate pages, not individual zone pages?
> A: If we check individual zone, #-of-reclaimer is restricted by smallest zone.
>   it mean decreasing the performance of the system having small dma zone.
>
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
>  mm/page_alloc.c |   27 +++++++++++++++++++++++++++
>  1 file changed, 27 insertions(+)
>
> Index: b/mm/page_alloc.c
> ===================================================================
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1721,6 +1721,28 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>        return alloc_flags;
>  }
>
> +static bool too_many_isolated(struct zonelist *zonelist,
> +                             enum zone_type high_zoneidx, nodemask_t *nodemask)
> +{
> +       unsigned long nr_inactive = 0;
> +       unsigned long nr_isolated = 0;
> +       struct zoneref *z;
> +       struct zone *zone;
> +
> +       for_each_zone_zonelist_nodemask(zone, z, zonelist,
> +                                       high_zoneidx, nodemask) {
> +               if (!populated_zone(zone))
> +                       continue;
> +
> +               nr_inactive += zone_page_state(zone, NR_INACTIVE_ANON);
> +               nr_inactive += zone_page_state(zone, NR_INACTIVE_FILE);
> +               nr_isolated += zone_page_state(zone, NR_ISOLATED_ANON);
> +               nr_isolated += zone_page_state(zone, NR_ISOLATED_FILE);
> +       }
> +
> +       return nr_isolated > nr_inactive;
> +}
> +
>  static inline struct page *
>  __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>        struct zonelist *zonelist, enum zone_type high_zoneidx,
> @@ -1789,6 +1811,11 @@ rebalance:
>        if (p->flags & PF_MEMALLOC)
>                goto nopage;
>
> +       if (too_many_isolated(gfp_mask, zonelist, high_zoneidx, nodemask)) {

too_many_isolated(zonelist, high_zoneidx, nodemask)

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 1/2] vmscan don't isolate too many pages
  2009-07-07 18:59     ` Rik van Riel
@ 2009-07-08  3:19       ` Wu Fengguang
  -1 siblings, 0 replies; 38+ messages in thread
From: Wu Fengguang @ 2009-07-08  3:19 UTC (permalink / raw)
  To: Rik van Riel; +Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton, Minchan Kim

On Wed, Jul 08, 2009 at 02:59:29AM +0800, Rik van Riel wrote:
> KOSAKI Motohiro wrote:
> 
> > FAQ
> > -------
> > Q: Why do you compared zone accumulate pages, not individual zone pages?
> > A: If we check individual zone, #-of-reclaimer is restricted by smallest zone.
> >    it mean decreasing the performance of the system having small dma zone.
> 
> That is a clever solution!  I was playing around a bit with
> doing it on a per-zone basis.  Your idea is much nicer.
> 
> However, I can see one potential problem with your patch:
> 
> +		nr_inactive += zone_page_state(zone, NR_INACTIVE_ANON);
> +		nr_inactive += zone_page_state(zone, NR_INACTIVE_FILE);
> +		nr_isolated += zone_page_state(zone, NR_ISOLATED_ANON);
> +		nr_isolated += zone_page_state(zone, NR_ISOLATED_FILE);
> +	}
> +
> +	return nr_isolated > nr_inactive;
> 
> What if we ran out of swap space, or are not scanning the
> anon list at all for some reason?
> 
> It is possible that there are no inactive_file pages left,
> with all file pages already isolated, and your function
> still letting reclaimers through.

Good catch!

If swap is always off, NR_ISOLATED_ANON = 0. So it becomes

        NR_ISOLATED_FILE > NR_INACTIVE_FILE + NR_INACTIVE_ANON

which will never be true if there are more anon pages than file pages.

If swap is on but goes full at some time, comparing *ANON is
also meaningless because the anon list won't be scanned.

> This means you could still get a spurious OOM.
> 
> I guess I should mail out my (ugly) approach, so we can
> compare the two :)

And it helps to be aware of all the alternatives, now and future :)

KOSAKI, I tested this updated patch. The OOM seems to be gone, but
now the process could sleep for too long time.

[  316.756006] BUG: soft lockup - CPU#1 stuck for 61s! [msgctl11:12497]
[  316.756006] Modules linked in: drm snd_hda_codec_analog snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_seq snd_timer snd_seq_device iwlagn snd iwlcore soundcore snd_page_alloc video
[  316.756006] irq event stamp: 269858
[  316.756006] hardirqs last  enabled at (269857): [<ffffffff8100cc50>] restore_args+0x0/0x30
[  316.756006] hardirqs last disabled at (269858): [<ffffffff8100bf6a>] save_args+0x6a/0x70
[  316.756006] softirqs last  enabled at (269856): [<ffffffff81055d9e>] __do_softirq+0x19e/0x1f0
[  316.756006] softirqs last disabled at (269841): [<ffffffff8100d3cc>] call_softirq+0x1c/0x50
[  316.756006] CPU 1:
[  316.756006] Modules linked in: drm snd_hda_codec_analog snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_seq snd_timer snd_seq_device iwlagn snd iwlcore soundcore snd_page_alloc video
[  316.756006] Pid: 12497, comm: msgctl11 Not tainted 2.6.31-rc1 #33 HP Compaq 6910p
[  316.756006] RIP: 0010:[<ffffffff810804a9>]  [<ffffffff810804a9>] lock_acquire+0xf9/0x120
[  316.756006] RSP: 0000:ffff880013a9fcd8  EFLAGS: 00000246
[  316.756006] RAX: ffff880013a7c500 RBX: ffff880013a9fd28 RCX: ffffffff81b6c928
[  316.756006] RDX: 0000000000000002 RSI: ffffffff82130ff0 RDI: 0000000000000246
[  316.756006] RBP: ffffffff8100cb8e R08: ffffff18f84dc1fb R09: 0000000000000001
[  316.756006] R10: 00000000000001ce R11: 0000000000000001 R12: 0000000000000002
[  316.756006] R13: ffff880013a7cc90 R14: 000000008107eca9 R15: ffff880013a9fd08
[  316.756006] FS:  00007f91a8bf76f0(0000) GS:ffff88000272f000(0000) knlGS:0000000000000000
[  316.756006] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  316.756006] CR2: 00007f91a8c079a0 CR3: 0000000013a81000 CR4: 00000000000006e0
[  316.756006] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  316.756006] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  316.756006] Call Trace:
[  316.756006]  [<ffffffff810fade9>] ? __swap_duplicate+0x59/0x1a0
[  316.756006]  [<ffffffff8158e0e6>] ? _spin_lock+0x36/0x70
[  316.756006]  [<ffffffff810fade9>] ? __swap_duplicate+0x59/0x1a0
[  316.756006]  [<ffffffff810fade9>] ? __swap_duplicate+0x59/0x1a0
[  316.756006]  [<ffffffff810faf43>] ? swapcache_prepare+0x13/0x20
[  316.756006]  [<ffffffff810fa423>] ? read_swap_cache_async+0x63/0x120
[  316.756006]  [<ffffffff810fa567>] ? swapin_readahead+0x87/0xc0
[  316.756006]  [<ffffffff810ec9f9>] ? handle_mm_fault+0x719/0x840
[  316.756006]  [<ffffffff815911cb>] ? do_page_fault+0x1cb/0x330
[  316.756006]  [<ffffffff8158e9e5>] ? page_fault+0x25/0x30
[  316.756006] Kernel panic - not syncing: softlockup: hung tasks
[  316.756006] Pid: 12497, comm: msgctl11 Not tainted 2.6.31-rc1 #33
[  316.756006] Call Trace:
[  316.756006]  <IRQ>  [<ffffffff8158a01a>] panic+0xa5/0x173
[  316.756006]  [<ffffffff8100cb8e>] ? common_interrupt+0xe/0x13
[  316.756006]  [<ffffffff81012e69>] ? sched_clock+0x9/0x10
[  316.756006]  [<ffffffff8107b745>] ? lock_release_holdtime+0x35/0x1c0
[  316.756006]  [<ffffffff8158df1b>] ? _spin_unlock+0x2b/0x40
[  316.756006]  [<ffffffff810a733d>] softlockup_tick+0x1ad/0x1e0
[  316.756006]  [<ffffffff8105b91d>] run_local_timers+0x1d/0x30
[  316.756006]  [<ffffffff8105b96c>] update_process_times+0x3c/0x80
[  316.756006]  [<ffffffff810773fc>] tick_periodic+0x2c/0x80
[  316.756006]  [<ffffffff81077476>] tick_handle_periodic+0x26/0x90
[  316.756006]  [<ffffffff81077848>] tick_do_broadcast+0x88/0x90
[  316.756006]  [<ffffffff810779a9>] tick_do_periodic_broadcast+0x39/0x50
[  316.756006]  [<ffffffff81077f34>] tick_handle_periodic_broadcast+0x14/0x50
[  316.756006]  [<ffffffff8100f5ef>] timer_interrupt+0x1f/0x30
[  316.756006]  [<ffffffff810a7e70>] handle_IRQ_event+0x70/0x180
[  316.756006]  [<ffffffff810a9cf1>] handle_edge_irq+0xc1/0x160
[  316.756006]  [<ffffffff8100ee6b>] handle_irq+0x4b/0xb0
[  316.756006]  [<ffffffff8159346f>] do_IRQ+0x6f/0xf0
[  316.756006]  [<ffffffff8100cb93>] ret_from_intr+0x0/0x16
[  316.756006]  <EOI>  [<ffffffff810804a9>] ? lock_acquire+0xf9/0x120
[  316.756006]  [<ffffffff810fade9>] ? __swap_duplicate+0x59/0x1a0
[  316.756006]  [<ffffffff8158e0e6>] ? _spin_lock+0x36/0x70
[  316.756006]  [<ffffffff810fade9>] ? __swap_duplicate+0x59/0x1a0
[  316.756006]  [<ffffffff810fade9>] ? __swap_duplicate+0x59/0x1a0
[  316.756006]  [<ffffffff810faf43>] ? swapcache_prepare+0x13/0x20
[  316.756006]  [<ffffffff810fa423>] ? read_swap_cache_async+0x63/0x120
[  316.756006]  [<ffffffff810fa567>] ? swapin_readahead+0x87/0xc0
[  316.756006]  [<ffffffff810ec9f9>] ? handle_mm_fault+0x719/0x840
[  316.756006]  [<ffffffff815911cb>] ? do_page_fault+0x1cb/0x330
[  316.756006]  [<ffffffff8158e9e5>] ? page_fault+0x25/0x30
[  316.756006] Rebooting in 100 seconds..


---
 mm/page_alloc.c |   29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -1721,6 +1721,30 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	return alloc_flags;
 }
 
+static bool too_many_isolated(struct zonelist *zonelist,
+			      enum zone_type high_zoneidx, nodemask_t *nodemask)
+{
+	unsigned long nr_inactive = 0;
+	unsigned long nr_isolated = 0;
+	struct zoneref *z;
+	struct zone *zone;
+
+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+					high_zoneidx, nodemask) {
+		if (!populated_zone(zone))
+			continue;
+
+		nr_inactive += zone_page_state(zone, NR_INACTIVE_FILE);
+		nr_isolated += zone_page_state(zone, NR_ISOLATED_FILE);
+		if (nr_swap_pages) {
+			nr_inactive += zone_page_state(zone, NR_INACTIVE_ANON);
+			nr_isolated += zone_page_state(zone, NR_ISOLATED_ANON);
+		}
+	}
+
+	return nr_isolated > nr_inactive;
+}
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
@@ -1789,6 +1813,11 @@ rebalance:
 	if (p->flags & PF_MEMALLOC)
 		goto nopage;
 
+	if (too_many_isolated(zonelist, high_zoneidx, nodemask)) {
+		schedule_timeout_uninterruptible(HZ/10);
+		goto restart;
+	}
+
 	/* Try direct reclaim and then allocating */
 	page = __alloc_pages_direct_reclaim(gfp_mask, order,
 					zonelist, high_zoneidx,

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 1/2] vmscan don't isolate too many pages
@ 2009-07-08  3:19       ` Wu Fengguang
  0 siblings, 0 replies; 38+ messages in thread
From: Wu Fengguang @ 2009-07-08  3:19 UTC (permalink / raw)
  To: Rik van Riel; +Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton, Minchan Kim

On Wed, Jul 08, 2009 at 02:59:29AM +0800, Rik van Riel wrote:
> KOSAKI Motohiro wrote:
> 
> > FAQ
> > -------
> > Q: Why do you compared zone accumulate pages, not individual zone pages?
> > A: If we check individual zone, #-of-reclaimer is restricted by smallest zone.
> >    it mean decreasing the performance of the system having small dma zone.
> 
> That is a clever solution!  I was playing around a bit with
> doing it on a per-zone basis.  Your idea is much nicer.
> 
> However, I can see one potential problem with your patch:
> 
> +		nr_inactive += zone_page_state(zone, NR_INACTIVE_ANON);
> +		nr_inactive += zone_page_state(zone, NR_INACTIVE_FILE);
> +		nr_isolated += zone_page_state(zone, NR_ISOLATED_ANON);
> +		nr_isolated += zone_page_state(zone, NR_ISOLATED_FILE);
> +	}
> +
> +	return nr_isolated > nr_inactive;
> 
> What if we ran out of swap space, or are not scanning the
> anon list at all for some reason?
> 
> It is possible that there are no inactive_file pages left,
> with all file pages already isolated, and your function
> still letting reclaimers through.

Good catch!

If swap is always off, NR_ISOLATED_ANON = 0. So it becomes

        NR_ISOLATED_FILE > NR_INACTIVE_FILE + NR_INACTIVE_ANON

which will never be true if there are more anon pages than file pages.

If swap is on but goes full at some time, comparing *ANON is
also meaningless because the anon list won't be scanned.

> This means you could still get a spurious OOM.
> 
> I guess I should mail out my (ugly) approach, so we can
> compare the two :)

And it helps to be aware of all the alternatives, now and future :)

KOSAKI, I tested this updated patch. The OOM seems to be gone, but
now the process could sleep for too long time.

[  316.756006] BUG: soft lockup - CPU#1 stuck for 61s! [msgctl11:12497]
[  316.756006] Modules linked in: drm snd_hda_codec_analog snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_seq snd_timer snd_seq_device iwlagn snd iwlcore soundcore snd_page_alloc video
[  316.756006] irq event stamp: 269858
[  316.756006] hardirqs last  enabled at (269857): [<ffffffff8100cc50>] restore_args+0x0/0x30
[  316.756006] hardirqs last disabled at (269858): [<ffffffff8100bf6a>] save_args+0x6a/0x70
[  316.756006] softirqs last  enabled at (269856): [<ffffffff81055d9e>] __do_softirq+0x19e/0x1f0
[  316.756006] softirqs last disabled at (269841): [<ffffffff8100d3cc>] call_softirq+0x1c/0x50
[  316.756006] CPU 1:
[  316.756006] Modules linked in: drm snd_hda_codec_analog snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_seq snd_timer snd_seq_device iwlagn snd iwlcore soundcore snd_page_alloc video
[  316.756006] Pid: 12497, comm: msgctl11 Not tainted 2.6.31-rc1 #33 HP Compaq 6910p
[  316.756006] RIP: 0010:[<ffffffff810804a9>]  [<ffffffff810804a9>] lock_acquire+0xf9/0x120
[  316.756006] RSP: 0000:ffff880013a9fcd8  EFLAGS: 00000246
[  316.756006] RAX: ffff880013a7c500 RBX: ffff880013a9fd28 RCX: ffffffff81b6c928
[  316.756006] RDX: 0000000000000002 RSI: ffffffff82130ff0 RDI: 0000000000000246
[  316.756006] RBP: ffffffff8100cb8e R08: ffffff18f84dc1fb R09: 0000000000000001
[  316.756006] R10: 00000000000001ce R11: 0000000000000001 R12: 0000000000000002
[  316.756006] R13: ffff880013a7cc90 R14: 000000008107eca9 R15: ffff880013a9fd08
[  316.756006] FS:  00007f91a8bf76f0(0000) GS:ffff88000272f000(0000) knlGS:0000000000000000
[  316.756006] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  316.756006] CR2: 00007f91a8c079a0 CR3: 0000000013a81000 CR4: 00000000000006e0
[  316.756006] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  316.756006] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  316.756006] Call Trace:
[  316.756006]  [<ffffffff810fade9>] ? __swap_duplicate+0x59/0x1a0
[  316.756006]  [<ffffffff8158e0e6>] ? _spin_lock+0x36/0x70
[  316.756006]  [<ffffffff810fade9>] ? __swap_duplicate+0x59/0x1a0
[  316.756006]  [<ffffffff810fade9>] ? __swap_duplicate+0x59/0x1a0
[  316.756006]  [<ffffffff810faf43>] ? swapcache_prepare+0x13/0x20
[  316.756006]  [<ffffffff810fa423>] ? read_swap_cache_async+0x63/0x120
[  316.756006]  [<ffffffff810fa567>] ? swapin_readahead+0x87/0xc0
[  316.756006]  [<ffffffff810ec9f9>] ? handle_mm_fault+0x719/0x840
[  316.756006]  [<ffffffff815911cb>] ? do_page_fault+0x1cb/0x330
[  316.756006]  [<ffffffff8158e9e5>] ? page_fault+0x25/0x30
[  316.756006] Kernel panic - not syncing: softlockup: hung tasks
[  316.756006] Pid: 12497, comm: msgctl11 Not tainted 2.6.31-rc1 #33
[  316.756006] Call Trace:
[  316.756006]  <IRQ>  [<ffffffff8158a01a>] panic+0xa5/0x173
[  316.756006]  [<ffffffff8100cb8e>] ? common_interrupt+0xe/0x13
[  316.756006]  [<ffffffff81012e69>] ? sched_clock+0x9/0x10
[  316.756006]  [<ffffffff8107b745>] ? lock_release_holdtime+0x35/0x1c0
[  316.756006]  [<ffffffff8158df1b>] ? _spin_unlock+0x2b/0x40
[  316.756006]  [<ffffffff810a733d>] softlockup_tick+0x1ad/0x1e0
[  316.756006]  [<ffffffff8105b91d>] run_local_timers+0x1d/0x30
[  316.756006]  [<ffffffff8105b96c>] update_process_times+0x3c/0x80
[  316.756006]  [<ffffffff810773fc>] tick_periodic+0x2c/0x80
[  316.756006]  [<ffffffff81077476>] tick_handle_periodic+0x26/0x90
[  316.756006]  [<ffffffff81077848>] tick_do_broadcast+0x88/0x90
[  316.756006]  [<ffffffff810779a9>] tick_do_periodic_broadcast+0x39/0x50
[  316.756006]  [<ffffffff81077f34>] tick_handle_periodic_broadcast+0x14/0x50
[  316.756006]  [<ffffffff8100f5ef>] timer_interrupt+0x1f/0x30
[  316.756006]  [<ffffffff810a7e70>] handle_IRQ_event+0x70/0x180
[  316.756006]  [<ffffffff810a9cf1>] handle_edge_irq+0xc1/0x160
[  316.756006]  [<ffffffff8100ee6b>] handle_irq+0x4b/0xb0
[  316.756006]  [<ffffffff8159346f>] do_IRQ+0x6f/0xf0
[  316.756006]  [<ffffffff8100cb93>] ret_from_intr+0x0/0x16
[  316.756006]  <EOI>  [<ffffffff810804a9>] ? lock_acquire+0xf9/0x120
[  316.756006]  [<ffffffff810fade9>] ? __swap_duplicate+0x59/0x1a0
[  316.756006]  [<ffffffff8158e0e6>] ? _spin_lock+0x36/0x70
[  316.756006]  [<ffffffff810fade9>] ? __swap_duplicate+0x59/0x1a0
[  316.756006]  [<ffffffff810fade9>] ? __swap_duplicate+0x59/0x1a0
[  316.756006]  [<ffffffff810faf43>] ? swapcache_prepare+0x13/0x20
[  316.756006]  [<ffffffff810fa423>] ? read_swap_cache_async+0x63/0x120
[  316.756006]  [<ffffffff810fa567>] ? swapin_readahead+0x87/0xc0
[  316.756006]  [<ffffffff810ec9f9>] ? handle_mm_fault+0x719/0x840
[  316.756006]  [<ffffffff815911cb>] ? do_page_fault+0x1cb/0x330
[  316.756006]  [<ffffffff8158e9e5>] ? page_fault+0x25/0x30
[  316.756006] Rebooting in 100 seconds..


---
 mm/page_alloc.c |   29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -1721,6 +1721,30 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	return alloc_flags;
 }
 
+static bool too_many_isolated(struct zonelist *zonelist,
+			      enum zone_type high_zoneidx, nodemask_t *nodemask)
+{
+	unsigned long nr_inactive = 0;
+	unsigned long nr_isolated = 0;
+	struct zoneref *z;
+	struct zone *zone;
+
+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+					high_zoneidx, nodemask) {
+		if (!populated_zone(zone))
+			continue;
+
+		nr_inactive += zone_page_state(zone, NR_INACTIVE_FILE);
+		nr_isolated += zone_page_state(zone, NR_ISOLATED_FILE);
+		if (nr_swap_pages) {
+			nr_inactive += zone_page_state(zone, NR_INACTIVE_ANON);
+			nr_isolated += zone_page_state(zone, NR_ISOLATED_ANON);
+		}
+	}
+
+	return nr_isolated > nr_inactive;
+}
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
@@ -1789,6 +1813,11 @@ rebalance:
 	if (p->flags & PF_MEMALLOC)
 		goto nopage;
 
+	if (too_many_isolated(zonelist, high_zoneidx, nodemask)) {
+		schedule_timeout_uninterruptible(HZ/10);
+		goto restart;
+	}
+
 	/* Try direct reclaim and then allocating */
 	page = __alloc_pages_direct_reclaim(gfp_mask, order,
 					zonelist, high_zoneidx,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC PATCH 1/2] vmscan don't isolate too many pages in a zone
  2009-07-08  3:19       ` Wu Fengguang
@ 2009-07-09  1:51         ` Rik van Riel
  -1 siblings, 0 replies; 38+ messages in thread
From: Rik van Riel @ 2009-07-09  1:51 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton, Minchan Kim

When way too many processes go into direct reclaim, it is possible
for all of the pages to be taken off the LRU.  One result of this
is that the next process in the page reclaim code thinks there are
no reclaimable pages left and triggers an out of memory kill.

One solution to this problem is to never let so many processes into
the page reclaim path that the entire LRU is emptied.  Limiting the
system to only having half of each inactive list isolated for
reclaim should be safe.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
On Wed, 8 Jul 2009 11:19:01 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> > I guess I should mail out my (ugly) approach, so we can
> > compare the two :)
> 
> And it helps to be aware of all the alternatives, now and future :)

Here is the per-zone alternative to Kosaki's patch.

I believe Kosaki's patch will result in better performance
and is more elegant overall, but here it is :)

 mm/vmscan.c |   25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

Index: mmotm/mm/vmscan.c
===================================================================
--- mmotm.orig/mm/vmscan.c	2009-07-08 21:37:01.000000000 -0400
+++ mmotm/mm/vmscan.c	2009-07-08 21:39:02.000000000 -0400
@@ -1035,6 +1035,27 @@ int isolate_lru_page(struct page *page)
 }
 
 /*
+ * Are there way too many processes in the direct reclaim path already?
+ */
+static int too_many_isolated(struct zone *zone, int file)
+{
+	unsigned long inactive, isolated;
+
+	if (current_is_kswapd())
+		return 0;
+
+	if (file) {
+		inactive = zone_page_state(zone, NR_INACTIVE_FILE);
+		isolated = zone_page_state(zone, NR_ISOLATED_FILE);
+	} else {
+		inactive = zone_page_state(zone, NR_INACTIVE_ANON);
+		isolated = zone_page_state(zone, NR_ISOLATED_ANON);
+	}
+
+	return isolated > inactive;
+}
+
+/*
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
  */
@@ -1049,6 +1070,10 @@ static unsigned long shrink_inactive_lis
 	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
 	int lumpy_reclaim = 0;
 
+	while (unlikely(too_many_isolated(zone, file))) {
+		schedule_timeout_interruptible(HZ/10);
+	}
+
 	/*
 	 * If we need a large contiguous chunk of memory, or have
 	 * trouble getting a small set of contiguous pages, we

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC PATCH 1/2] vmscan don't isolate too many pages in a zone
@ 2009-07-09  1:51         ` Rik van Riel
  0 siblings, 0 replies; 38+ messages in thread
From: Rik van Riel @ 2009-07-09  1:51 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton, Minchan Kim

When way too many processes go into direct reclaim, it is possible
for all of the pages to be taken off the LRU.  One result of this
is that the next process in the page reclaim code thinks there are
no reclaimable pages left and triggers an out of memory kill.

One solution to this problem is to never let so many processes into
the page reclaim path that the entire LRU is emptied.  Limiting the
system to only having half of each inactive list isolated for
reclaim should be safe.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
On Wed, 8 Jul 2009 11:19:01 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> > I guess I should mail out my (ugly) approach, so we can
> > compare the two :)
> 
> And it helps to be aware of all the alternatives, now and future :)

Here is the per-zone alternative to Kosaki's patch.

I believe Kosaki's patch will result in better performance
and is more elegant overall, but here it is :)

 mm/vmscan.c |   25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

Index: mmotm/mm/vmscan.c
===================================================================
--- mmotm.orig/mm/vmscan.c	2009-07-08 21:37:01.000000000 -0400
+++ mmotm/mm/vmscan.c	2009-07-08 21:39:02.000000000 -0400
@@ -1035,6 +1035,27 @@ int isolate_lru_page(struct page *page)
 }
 
 /*
+ * Are there way too many processes in the direct reclaim path already?
+ */
+static int too_many_isolated(struct zone *zone, int file)
+{
+	unsigned long inactive, isolated;
+
+	if (current_is_kswapd())
+		return 0;
+
+	if (file) {
+		inactive = zone_page_state(zone, NR_INACTIVE_FILE);
+		isolated = zone_page_state(zone, NR_ISOLATED_FILE);
+	} else {
+		inactive = zone_page_state(zone, NR_INACTIVE_ANON);
+		isolated = zone_page_state(zone, NR_ISOLATED_ANON);
+	}
+
+	return isolated > inactive;
+}
+
+/*
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
  */
@@ -1049,6 +1070,10 @@ static unsigned long shrink_inactive_lis
 	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
 	int lumpy_reclaim = 0;
 
+	while (unlikely(too_many_isolated(zone, file))) {
+		schedule_timeout_interruptible(HZ/10);
+	}
+
 	/*
 	 * If we need a large contiguous chunk of memory, or have
 	 * trouble getting a small set of contiguous pages, we

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 1/2] vmscan don't isolate too many pages in a zone
  2009-07-09  1:51         ` Rik van Riel
@ 2009-07-09  2:47           ` Wu Fengguang
  -1 siblings, 0 replies; 38+ messages in thread
From: Wu Fengguang @ 2009-07-09  2:47 UTC (permalink / raw)
  To: Rik van Riel; +Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton, Minchan Kim

On Thu, Jul 09, 2009 at 09:51:05AM +0800, Rik van Riel wrote:
> When way too many processes go into direct reclaim, it is possible
> for all of the pages to be taken off the LRU.  One result of this
> is that the next process in the page reclaim code thinks there are
> no reclaimable pages left and triggers an out of memory kill.
> 
> One solution to this problem is to never let so many processes into
> the page reclaim path that the entire LRU is emptied.  Limiting the
> system to only having half of each inactive list isolated for
> reclaim should be safe.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
> On Wed, 8 Jul 2009 11:19:01 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > > I guess I should mail out my (ugly) approach, so we can
> > > compare the two :)
> > 
> > And it helps to be aware of all the alternatives, now and future :)
> 
> Here is the per-zone alternative to Kosaki's patch.
> 
> I believe Kosaki's patch will result in better performance
> and is more elegant overall, but here it is :)
> 
>  mm/vmscan.c |   25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)
> 
> Index: mmotm/mm/vmscan.c
> ===================================================================
> --- mmotm.orig/mm/vmscan.c	2009-07-08 21:37:01.000000000 -0400
> +++ mmotm/mm/vmscan.c	2009-07-08 21:39:02.000000000 -0400
> @@ -1035,6 +1035,27 @@ int isolate_lru_page(struct page *page)
>  }
>  
>  /*
> + * Are there way too many processes in the direct reclaim path already?
> + */
> +static int too_many_isolated(struct zone *zone, int file)
> +{
> +	unsigned long inactive, isolated;
> +
> +	if (current_is_kswapd())
> +		return 0;
> +
> +	if (file) {
> +		inactive = zone_page_state(zone, NR_INACTIVE_FILE);
> +		isolated = zone_page_state(zone, NR_ISOLATED_FILE);
> +	} else {
> +		inactive = zone_page_state(zone, NR_INACTIVE_ANON);
> +		isolated = zone_page_state(zone, NR_ISOLATED_ANON);
> +	}
> +
> +	return isolated > inactive;
> +}
> +
> +/*
>   * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
>   * of reclaimed pages
>   */
> @@ -1049,6 +1070,10 @@ static unsigned long shrink_inactive_lis
>  	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
>  	int lumpy_reclaim = 0;
>  
> +	while (unlikely(too_many_isolated(zone, file))) {
> +		schedule_timeout_interruptible(HZ/10);
> +	}
> +
>  	/*
>  	 * If we need a large contiguous chunk of memory, or have
>  	 * trouble getting a small set of contiguous pages, we

It survives 5 runs. The first 4 runs are relatively smooth. The 5th run is much
slower, and the 6th run triggered a soft-lockup warning. Anyway this record seems
better than KOSAKI's patch, which triggered soft-lockup at the first run yesterday.

        Last login: Wed Jul  8 11:10:06 2009 from 192.168.2.1
1)      wfg@hp ~% /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
        msgctl11    0  INFO  :  Using upto 16300 pids
        msgctl11    1  PASS  :  msgctl11 ran successfully!
2)      wfg@hp ~% /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
        msgctl11    0  INFO  :  Using upto 16300 pids
        msgctl11    1  PASS  :  msgctl11 ran successfully!
3)      wfg@hp ~% /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
        msgctl11    0  INFO  :  Using upto 16300 pids
        msgctl11    1  PASS  :  msgctl11 ran successfully!
4)      wfg@hp ~% time /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
        msgctl11    0  INFO  :  Using upto 16300 pids
        msgctl11    1  PASS  :  msgctl11 ran successfully!
        /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11  3.38s user 52.90s system 191% cpu 29.399 total
5)      wfg@hp ~% time /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
        msgctl11    0  INFO  :  Using upto 16300 pids
        msgctl11    1  PASS  :  msgctl11 ran successfully!
        /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11  4.54s user 488.33s system 129% cpu 6:19.14 total
6)      wfg@hp ~% time /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
        msgctl11    0  INFO  :  Using upto 16300 pids
        msgctl11    1  PASS  :  msgctl11 ran successfully!
        /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11  4.62s user 778.82s system 149% cpu 8:43.85 total


[ 1440.932891] INFO: task msgctl11:30739 blocked for more than 120 seconds.
[ 1440.935108] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1440.937857] msgctl11      D ffffffff8180f650  5992 30739  26108 0x00000000
[ 1440.940491]  ffff880035d9bdd8 0000000000000046 0000000000000000 0000000000000046
[ 1440.943174]  ffff880035d9bd48 00000000001d2d80 000000000000cec8 ffff8800308a0000
[ 1440.946854]  ffff8800140ba280 ffff8800308a0380 0000000135d9bd88 ffffffff8107d5d8
[ 1440.949513] Call Trace:
[ 1440.951006]  [<ffffffff8107d5d8>] ? mark_held_locks+0x68/0x90
[ 1440.953274]  [<ffffffff8158e020>] ? _spin_unlock_irq+0x30/0x40
[ 1440.955519]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1440.957084]  [<ffffffff8158d9f5>] __down_write_nested+0x85/0xc0
[ 1440.958426]  [<ffffffff8158da3b>] __down_write+0xb/0x10
[ 1440.960642]  [<ffffffff8158cc2d>] down_write+0x6d/0x90
[ 1440.961813]  [<ffffffff8126dc0d>] ? ipcctl_pre_down+0x3d/0x150
[ 1440.963110]  [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150
[ 1440.965340]  [<ffffffff8126f3ce>] sys_msgctl+0xbe/0x5a0
[ 1440.967504]  [<ffffffff8106e74b>] ? up_read+0x2b/0x40
[ 1440.968734]  [<ffffffff8100cc35>] ? retint_swapgs+0x13/0x1b
[ 1440.971005]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1440.973433]  [<ffffffff8158db2e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 1440.975958]  [<ffffffff8100c0f2>] system_call_fastpath+0x16/0x1b
[ 1440.978280] 1 lock held by msgctl11/30739:
[ 1440.980199]  #0:  (&ids->rw_mutex){+++++.}, at: [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150
[ 1440.984155] INFO: task msgctl11:30751 blocked for more than 120 seconds.
[ 1440.985763] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1440.988407] msgctl11      D ffffffff8180f650  5992 30751  26108 0x00000000
[ 1440.991030]  ffff880011917dd8 0000000000000046 0000000000000000 0000000000000046
[ 1440.993476]  ffff880011917d48 00000000001d2d80 000000000000cec8 ffff88000b82c500
[ 1440.997447]  ffff8800104e8000 ffff88000b82c880 0000000111917d88 ffffffff8107d5d8
[ 1441.001098] Call Trace:
[ 1441.001657]  [<ffffffff8107d5d8>] ? mark_held_locks+0x68/0x90
[ 1441.004954]  [<ffffffff8158e020>] ? _spin_unlock_irq+0x30/0x40
[ 1441.007229]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1441.009664]  [<ffffffff8158d9f5>] __down_write_nested+0x85/0xc0
[ 1441.012093]  [<ffffffff8158da3b>] __down_write+0xb/0x10
[ 1441.013202]  [<ffffffff8158cc2d>] down_write+0x6d/0x90
[ 1441.014389]  [<ffffffff8126dc0d>] ? ipcctl_pre_down+0x3d/0x150
[ 1441.015637]  [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150
[ 1441.017001]  [<ffffffff8126f3ce>] sys_msgctl+0xbe/0x5a0
[ 1441.018256]  [<ffffffff8106e74b>] ? up_read+0x2b/0x40
[ 1441.020376]  [<ffffffff8100cc35>] ? retint_swapgs+0x13/0x1b
[ 1441.022552]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1441.024070]  [<ffffffff8158db2e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 1441.025494]  [<ffffffff8100c0f2>] system_call_fastpath+0x16/0x1b
[ 1441.026933] 1 lock held by msgctl11/30751:
[ 1441.027855]  #0:  (&ids->rw_mutex){+++++.}, at: [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150
[ 1441.032825] INFO: task msgctl11:30765 blocked for more than 120 seconds.
[ 1441.034316] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1441.037090] msgctl11      D ffffffff8180f650  5992 30765  26108 0x00000000
[ 1441.038633]  ffff8800175e1dd8 0000000000000046 0000000000000000 0000000000000046
[ 1441.042420]  ffff8800175e1d48 00000000001d2d80 000000000000cec8 ffff880026b54500
[ 1441.046070]  ffff88003ff74500 ffff880026b54880 00000001175e1d88 000000010003abf8
[ 1441.049564] Call Trace:
[ 1441.050349]  [<ffffffff8158e020>] ? _spin_unlock_irq+0x30/0x40
[ 1441.052493]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1441.055100]  [<ffffffff8158d9f5>] __down_write_nested+0x85/0xc0
[ 1441.057366]  [<ffffffff8158da3b>] __down_write+0xb/0x10
[ 1441.058529]  [<ffffffff8158cc2d>] down_write+0x6d/0x90
[ 1441.060741]  [<ffffffff8126dc0d>] ? ipcctl_pre_down+0x3d/0x150
[ 1441.063105]  [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150
[ 1441.065298]  [<ffffffff8126f3ce>] sys_msgctl+0xbe/0x5a0
[ 1441.067490]  [<ffffffff8106e74b>] ? up_read+0x2b/0x40
[ 1441.069609]  [<ffffffff8100cc35>] ? retint_swapgs+0x13/0x1b
[ 1441.070947]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1441.072394]  [<ffffffff8158db2e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 1441.074809]  [<ffffffff8100c0f2>] system_call_fastpath+0x16/0x1b
[ 1441.076236] 1 lock held by msgctl11/30765:
[ 1441.077146]  #0:  (&ids->rw_mutex){+++++.}, at: [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150
[ 1441.081127] INFO: task msgctl11:30767 blocked for more than 120 seconds.
[ 1441.082590] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1441.085415] msgctl11      D ffffffff8180f650  5992 30767  26108 0x00000000
[ 1441.086987]  ffff88003671bdd8 0000000000000046 0000000000000000 0000000000000046
[ 1441.089704]  ffff88003671bd48 00000000001d2d80 000000000000cec8 ffff880037e22280
[ 1441.092409]  ffff88000aacc500 ffff880037e22600 000000013671bd88 ffffffff8107d5d8
[ 1441.096056] Call Trace:
[ 1441.096604]  [<ffffffff8107d5d8>] ? mark_held_locks+0x68/0x90
[ 1441.098759]  [<ffffffff8158e020>] ? _spin_unlock_irq+0x30/0x40
[ 1441.100328]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1441.102622]  [<ffffffff8158d9f5>] __down_write_nested+0x85/0xc0
[ 1441.104030]  [<ffffffff8158da3b>] __down_write+0xb/0x10
[ 1441.105198]  [<ffffffff8158cc2d>] down_write+0x6d/0x90
[ 1441.107339]  [<ffffffff8126dc0d>] ? ipcctl_pre_down+0x3d/0x150
[ 1441.109589]  [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150
[ 1441.111046]  [<ffffffff8126f3ce>] sys_msgctl+0xbe/0x5a0
[ 1441.113119]  [<ffffffff8106e74b>] ? up_read+0x2b/0x40
[ 1441.114270]  [<ffffffff8100cc35>] ? retint_swapgs+0x13/0x1b
[ 1441.115482]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1441.117045]  [<ffffffff8158db2e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 1441.118482]  [<ffffffff8100c0f2>] system_call_fastpath+0x16/0x1b
[ 1441.121630] 1 lock held by msgctl11/30767:
[ 1441.122735]  #0:  (&ids->rw_mutex){+++++.}, at: [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150
[ 1441.126682] INFO: task msgctl11:30778 blocked for more than 120 seconds.
[ 1441.129232] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1441.132064] msgctl11      D ffffffff8180f650  5992 30778  26108 0x00000000
[ 1441.134534]  ffff880015085dd8 0000000000000046 0000000000000000 0000000000000046
[ 1441.137341]  ffff880015085d48 00000000001d2d80 000000000000cec8 ffff880024190000
[ 1441.139971]  ffff88001e6fa280 ffff880024190380 0000000115085d88 ffffffff8107d5d8
[ 1441.143691] Call Trace:
[ 1441.145193]  [<ffffffff8107d5d8>] ? mark_held_locks+0x68/0x90
[ 1441.147441]  [<ffffffff8158e020>] ? _spin_unlock_irq+0x30/0x40
[ 1441.148737]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1441.152295]  [<ffffffff8158d9f5>] __down_write_nested+0x85/0xc0
[ 1441.154582]  [<ffffffff8158da3b>] __down_write+0xb/0x10
[ 1441.156767]  [<ffffffff8158cc2d>] down_write+0x6d/0x90
[ 1441.157843]  [<ffffffff8126dc0d>] ? ipcctl_pre_down+0x3d/0x150
[ 1441.159290]  [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150
[ 1441.160587]  [<ffffffff8126f3ce>] sys_msgctl+0xbe/0x5a0
[ 1441.162714]  [<ffffffff8106e74b>] ? up_read+0x2b/0x40
[ 1441.164849]  [<ffffffff8100cc35>] ? retint_swapgs+0x13/0x1b
[ 1441.167166]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1441.169578]  [<ffffffff8158db2e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 1441.171147]  [<ffffffff8100c0f2>] system_call_fastpath+0x16/0x1b
[ 1441.172447] 1 lock held by msgctl11/30778:
[ 1441.173391]  #0:  (&ids->rw_mutex){+++++.}, at: [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150
[ 1441.177342] INFO: task msgctl11:30779 blocked for more than 120 seconds.
[ 1441.178802] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1441.181607] msgctl11      D ffffffff8180f650  5992 30779  26108 0x00000000
[ 1441.184220]  ffff8800141bddd8 0000000000000046 0000000000000000 0000000000000046
[ 1441.186951]  ffff8800141bdd48 00000000001d2d80 000000000000cec8 ffff880024194500
[ 1441.190716]  ffff88003ff74500 ffff880024194880 00000001141bdd88 000000010003ad99
[ 1441.194288] Call Trace:
[ 1441.194855]  [<ffffffff8158e020>] ? _spin_unlock_irq+0x30/0x40
[ 1441.196988]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1441.198579]  [<ffffffff8158d9f5>] __down_write_nested+0x85/0xc0
[ 1441.201039]  [<ffffffff8158da3b>] __down_write+0xb/0x10
[ 1441.203022]  [<ffffffff8158cc2d>] down_write+0x6d/0x90
[ 1441.204322]  [<ffffffff8126dc0d>] ? ipcctl_pre_down+0x3d/0x150
[ 1441.206599]  [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150
[ 1441.208802]  [<ffffffff8126f3ce>] sys_msgctl+0xbe/0x5a0
[ 1441.209990]  [<ffffffff8106e74b>] ? up_read+0x2b/0x40
[ 1441.212208]  [<ffffffff8100cc35>] ? retint_swapgs+0x13/0x1b
[ 1441.213447]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1441.215947]  [<ffffffff8158db2e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 1441.218461]  [<ffffffff8100c0f2>] system_call_fastpath+0x16/0x1b
[ 1441.220860] 1 lock held by msgctl11/30779:
[ 1441.222667]  #0:  (&ids->rw_mutex){+++++.}, at: [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150
[ 1441.226674] INFO: task msgctl11:30781 blocked for more than 120 seconds.
[ 1441.228079] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1441.231064] msgctl11      D ffffffff8180f650  5992 30781  26108 0x00000000
[ 1441.234493]  ffff88001d955dd8 0000000000000046 0000000000000000 0000000000000046
[ 1441.236997]  ffff88001d955d48 00000000001d2d80 000000000000cec8 ffff88001896a280
[ 1441.238927]  ffff880039554500 ffff88001896a600 000000011d955d88 ffffffff8107d5d8
[ 1441.242591] Call Trace:
[ 1441.243146]  [<ffffffff8107d5d8>] ? mark_held_locks+0x68/0x90
[ 1441.245393]  [<ffffffff8158e020>] ? _spin_unlock_irq+0x30/0x40
[ 1441.246704]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1441.249133]  [<ffffffff8158d9f5>] __down_write_nested+0x85/0xc0
[ 1441.252552]  [<ffffffff8158da3b>] __down_write+0xb/0x10
[ 1441.253690]  [<ffffffff8158cc2d>] down_write+0x6d/0x90
[ 1441.254861]  [<ffffffff8126dc0d>] ? ipcctl_pre_down+0x3d/0x150
[ 1441.257165]  [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150
[ 1441.259506]  [<ffffffff8126f3ce>] sys_msgctl+0xbe/0x5a0
[ 1441.260715]  [<ffffffff8106e74b>] ? up_read+0x2b/0x40
[ 1441.262789]  [<ffffffff8100cc35>] ? retint_swapgs+0x13/0x1b
[ 1441.265018]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1441.266554]  [<ffffffff8158db2e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 1441.267990]  [<ffffffff8100c0f2>] system_call_fastpath+0x16/0x1b
[ 1441.269422] 1 lock held by msgctl11/30781:
[ 1441.271342]  #0:  (&ids->rw_mutex){+++++.}, at: [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150
[ 1441.275352] INFO: task msgctl11:30782 blocked for more than 120 seconds.
[ 1441.278605] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1441.280612] msgctl11      D ffffffff8180f650  6168 30782  26108 0x00000000
[ 1441.283112]  ffff8800141b5dd8 0000000000000046 0000000000000000 0000000000000046
[ 1441.285893]  ffff8800141b5d48 00000000001d2d80 000000000000cec8 ffff8800232d0000
[ 1441.289531]  ffff88003ff82280 ffff8800232d0380 00000001141b5d88 ffffffff8107d5d8
[ 1441.292220] Call Trace:
[ 1441.293698]  [<ffffffff8107d5d8>] ? mark_held_locks+0x68/0x90
[ 1441.296193]  [<ffffffff8158e020>] ? _spin_unlock_irq+0x30/0x40
[ 1441.298242]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1441.299805]  [<ffffffff8158d9f5>] __down_write_nested+0x85/0xc0
[ 1441.302286]  [<ffffffff8158da3b>] __down_write+0xb/0x10
[ 1441.304316]  [<ffffffff8158cc2d>] down_write+0x6d/0x90
[ 1441.306543]  [<ffffffff8126dc0d>] ? ipcctl_pre_down+0x3d/0x150
[ 1441.308789]  [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150
[ 1441.310067]  [<ffffffff8126f3ce>] sys_msgctl+0xbe/0x5a0
[ 1441.312322]  [<ffffffff8106e74b>] ? up_read+0x2b/0x40
[ 1441.314468]  [<ffffffff8100cc35>] ? retint_swapgs+0x13/0x1b
[ 1441.315669]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1441.317154]  [<ffffffff8158db2e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 1441.319681]  [<ffffffff8100c0f2>] system_call_fastpath+0x16/0x1b
[ 1441.320988] 1 lock held by msgctl11/30782:
[ 1441.322917]  #0:  (&ids->rw_mutex){+++++.}, at: [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150
[ 1441.326888] INFO: task msgctl11:30783 blocked for more than 120 seconds.
[ 1441.329309] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1441.332241] msgctl11      D ffffffff8180f650  5992 30783  26108 0x00000000
[ 1441.335720]  ffff880017ffddd8 0000000000000046 0000000000000000 0000000000000046
[ 1441.338168]  ffff880017ffdd48 00000000001d2d80 000000000000cec8 ffff8800232d2280
[ 1441.341925]  ffff880025a92280 ffff8800232d2600 0000000117ffdd88 ffffffff8107d5d8
[ 1441.345809] Call Trace:
[ 1441.346389]  [<ffffffff8107d5d8>] ? mark_held_locks+0x68/0x90
[ 1441.348624]  [<ffffffff8158e020>] ? _spin_unlock_irq+0x30/0x40
[ 1441.350996]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1441.353383]  [<ffffffff8158d9f5>] __down_write_nested+0x85/0xc0
[ 1441.354782]  [<ffffffff8158da3b>] __down_write+0xb/0x10
[ 1441.355940]  [<ffffffff8158cc2d>] down_write+0x6d/0x90
[ 1441.357084]  [<ffffffff8126dc0d>] ? ipcctl_pre_down+0x3d/0x150
[ 1441.358357]  [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150
[ 1441.359733]  [<ffffffff8126f3ce>] sys_msgctl+0xbe/0x5a0
[ 1441.360854]  [<ffffffff8106e74b>] ? up_read+0x2b/0x40
[ 1441.362108]  [<ffffffff8100cc35>] ? retint_swapgs+0x13/0x1b
[ 1441.364234]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1441.365787]  [<ffffffff8158db2e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 1441.367227]  [<ffffffff8100c0f2>] system_call_fastpath+0x16/0x1b
[ 1441.370663] 1 lock held by msgctl11/30783:
[ 1441.371584]  #0:  (&ids->rw_mutex){+++++.}, at: [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150
[ 1441.374517] INFO: task msgctl11:30784 blocked for more than 120 seconds.
[ 1441.376004] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1441.378807] msgctl11      D ffffffff8180f650  6024 30784  26108 0x00000000
[ 1441.380338]  ffff880013e13dd8 0000000000000046 0000000000000000 0000000000000046
[ 1441.383118]  ffff880013e13d48 00000000001d2d80 000000000000cec8 ffff880032408000
[ 1441.386763]  ffff88003ff82280 ffff880032408380 0000000113e13d88 ffffffff8107d5d8
[ 1441.390296] Call Trace:
[ 1441.390944]  [<ffffffff8107d5d8>] ? mark_held_locks+0x68/0x90
[ 1441.392288]  [<ffffffff8158e020>] ? _spin_unlock_irq+0x30/0x40
[ 1441.394469]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1441.397020]  [<ffffffff8158d9f5>] __down_write_nested+0x85/0xc0
[ 1441.398316]  [<ffffffff8158da3b>] __down_write+0xb/0x10
[ 1441.399907]  [<ffffffff8158cc2d>] down_write+0x6d/0x90
[ 1441.401881]  [<ffffffff8126dc0d>] ? ipcctl_pre_down+0x3d/0x150
[ 1441.404015]  [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150
[ 1441.405272]  [<ffffffff8126f3ce>] sys_msgctl+0xbe/0x5a0
[ 1441.407434]  [<ffffffff8106e74b>] ? up_read+0x2b/0x40
[ 1441.409661]  [<ffffffff8100cc35>] ? retint_swapgs+0x13/0x1b
[ 1441.410959]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1441.413347]  [<ffffffff8158db2e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 1441.414874]  [<ffffffff8100c0f2>] system_call_fastpath+0x16/0x1b
[ 1441.417228] 1 lock held by msgctl11/30784:
[ 1441.419163]  #0:  (&ids->rw_mutex){+++++.}, at: [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 1/2] vmscan don't isolate too many pages in a zone
@ 2009-07-09  2:47           ` Wu Fengguang
  0 siblings, 0 replies; 38+ messages in thread
From: Wu Fengguang @ 2009-07-09  2:47 UTC (permalink / raw)
  To: Rik van Riel; +Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton, Minchan Kim

On Thu, Jul 09, 2009 at 09:51:05AM +0800, Rik van Riel wrote:
> When way too many processes go into direct reclaim, it is possible
> for all of the pages to be taken off the LRU.  One result of this
> is that the next process in the page reclaim code thinks there are
> no reclaimable pages left and triggers an out of memory kill.
> 
> One solution to this problem is to never let so many processes into
> the page reclaim path that the entire LRU is emptied.  Limiting the
> system to only having half of each inactive list isolated for
> reclaim should be safe.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
> On Wed, 8 Jul 2009 11:19:01 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > > I guess I should mail out my (ugly) approach, so we can
> > > compare the two :)
> > 
> > And it helps to be aware of all the alternatives, now and future :)
> 
> Here is the per-zone alternative to Kosaki's patch.
> 
> I believe Kosaki's patch will result in better performance
> and is more elegant overall, but here it is :)
> 
>  mm/vmscan.c |   25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)
> 
> Index: mmotm/mm/vmscan.c
> ===================================================================
> --- mmotm.orig/mm/vmscan.c	2009-07-08 21:37:01.000000000 -0400
> +++ mmotm/mm/vmscan.c	2009-07-08 21:39:02.000000000 -0400
> @@ -1035,6 +1035,27 @@ int isolate_lru_page(struct page *page)
>  }
>  
>  /*
> + * Are there way too many processes in the direct reclaim path already?
> + */
> +static int too_many_isolated(struct zone *zone, int file)
> +{
> +	unsigned long inactive, isolated;
> +
> +	if (current_is_kswapd())
> +		return 0;
> +
> +	if (file) {
> +		inactive = zone_page_state(zone, NR_INACTIVE_FILE);
> +		isolated = zone_page_state(zone, NR_ISOLATED_FILE);
> +	} else {
> +		inactive = zone_page_state(zone, NR_INACTIVE_ANON);
> +		isolated = zone_page_state(zone, NR_ISOLATED_ANON);
> +	}
> +
> +	return isolated > inactive;
> +}
> +
> +/*
>   * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
>   * of reclaimed pages
>   */
> @@ -1049,6 +1070,10 @@ static unsigned long shrink_inactive_lis
>  	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
>  	int lumpy_reclaim = 0;
>  
> +	while (unlikely(too_many_isolated(zone, file))) {
> +		schedule_timeout_interruptible(HZ/10);
> +	}
> +
>  	/*
>  	 * If we need a large contiguous chunk of memory, or have
>  	 * trouble getting a small set of contiguous pages, we

It survives 5 runs. The first 4 runs are relatively smooth. The 5th run is much
slower, and the 6th run triggered a soft-lockup warning. Anyway this record seems
better than KOSAKI's patch, which triggered soft-lockup at the first run yesterday.

        Last login: Wed Jul  8 11:10:06 2009 from 192.168.2.1
1)      wfg@hp ~% /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
        msgctl11    0  INFO  :  Using upto 16300 pids
        msgctl11    1  PASS  :  msgctl11 ran successfully!
2)      wfg@hp ~% /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
        msgctl11    0  INFO  :  Using upto 16300 pids
        msgctl11    1  PASS  :  msgctl11 ran successfully!
3)      wfg@hp ~% /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
        msgctl11    0  INFO  :  Using upto 16300 pids
        msgctl11    1  PASS  :  msgctl11 ran successfully!
4)      wfg@hp ~% time /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
        msgctl11    0  INFO  :  Using upto 16300 pids
        msgctl11    1  PASS  :  msgctl11 ran successfully!
        /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11  3.38s user 52.90s system 191% cpu 29.399 total
5)      wfg@hp ~% time /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
        msgctl11    0  INFO  :  Using upto 16300 pids
        msgctl11    1  PASS  :  msgctl11 ran successfully!
        /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11  4.54s user 488.33s system 129% cpu 6:19.14 total
6)      wfg@hp ~% time /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
        msgctl11    0  INFO  :  Using upto 16300 pids
        msgctl11    1  PASS  :  msgctl11 ran successfully!
        /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11  4.62s user 778.82s system 149% cpu 8:43.85 total


[ 1440.932891] INFO: task msgctl11:30739 blocked for more than 120 seconds.
[ 1440.935108] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1440.937857] msgctl11      D ffffffff8180f650  5992 30739  26108 0x00000000
[ 1440.940491]  ffff880035d9bdd8 0000000000000046 0000000000000000 0000000000000046
[ 1440.943174]  ffff880035d9bd48 00000000001d2d80 000000000000cec8 ffff8800308a0000
[ 1440.946854]  ffff8800140ba280 ffff8800308a0380 0000000135d9bd88 ffffffff8107d5d8
[ 1440.949513] Call Trace:
[ 1440.951006]  [<ffffffff8107d5d8>] ? mark_held_locks+0x68/0x90
[ 1440.953274]  [<ffffffff8158e020>] ? _spin_unlock_irq+0x30/0x40
[ 1440.955519]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1440.957084]  [<ffffffff8158d9f5>] __down_write_nested+0x85/0xc0
[ 1440.958426]  [<ffffffff8158da3b>] __down_write+0xb/0x10
[ 1440.960642]  [<ffffffff8158cc2d>] down_write+0x6d/0x90
[ 1440.961813]  [<ffffffff8126dc0d>] ? ipcctl_pre_down+0x3d/0x150
[ 1440.963110]  [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150
[ 1440.965340]  [<ffffffff8126f3ce>] sys_msgctl+0xbe/0x5a0
[ 1440.967504]  [<ffffffff8106e74b>] ? up_read+0x2b/0x40
[ 1440.968734]  [<ffffffff8100cc35>] ? retint_swapgs+0x13/0x1b
[ 1440.971005]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1440.973433]  [<ffffffff8158db2e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 1440.975958]  [<ffffffff8100c0f2>] system_call_fastpath+0x16/0x1b
[ 1440.978280] 1 lock held by msgctl11/30739:
[ 1440.980199]  #0:  (&ids->rw_mutex){+++++.}, at: [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150
[ 1440.984155] INFO: task msgctl11:30751 blocked for more than 120 seconds.
[ 1440.985763] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1440.988407] msgctl11      D ffffffff8180f650  5992 30751  26108 0x00000000
[ 1440.991030]  ffff880011917dd8 0000000000000046 0000000000000000 0000000000000046
[ 1440.993476]  ffff880011917d48 00000000001d2d80 000000000000cec8 ffff88000b82c500
[ 1440.997447]  ffff8800104e8000 ffff88000b82c880 0000000111917d88 ffffffff8107d5d8
[ 1441.001098] Call Trace:
[ 1441.001657]  [<ffffffff8107d5d8>] ? mark_held_locks+0x68/0x90
[ 1441.004954]  [<ffffffff8158e020>] ? _spin_unlock_irq+0x30/0x40
[ 1441.007229]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1441.009664]  [<ffffffff8158d9f5>] __down_write_nested+0x85/0xc0
[ 1441.012093]  [<ffffffff8158da3b>] __down_write+0xb/0x10
[ 1441.013202]  [<ffffffff8158cc2d>] down_write+0x6d/0x90
[ 1441.014389]  [<ffffffff8126dc0d>] ? ipcctl_pre_down+0x3d/0x150
[ 1441.015637]  [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150
[ 1441.017001]  [<ffffffff8126f3ce>] sys_msgctl+0xbe/0x5a0
[ 1441.018256]  [<ffffffff8106e74b>] ? up_read+0x2b/0x40
[ 1441.020376]  [<ffffffff8100cc35>] ? retint_swapgs+0x13/0x1b
[ 1441.022552]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1441.024070]  [<ffffffff8158db2e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 1441.025494]  [<ffffffff8100c0f2>] system_call_fastpath+0x16/0x1b
[ 1441.026933] 1 lock held by msgctl11/30751:
[ 1441.027855]  #0:  (&ids->rw_mutex){+++++.}, at: [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150
[ 1441.032825] INFO: task msgctl11:30765 blocked for more than 120 seconds.
[ 1441.034316] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1441.037090] msgctl11      D ffffffff8180f650  5992 30765  26108 0x00000000
[ 1441.038633]  ffff8800175e1dd8 0000000000000046 0000000000000000 0000000000000046
[ 1441.042420]  ffff8800175e1d48 00000000001d2d80 000000000000cec8 ffff880026b54500
[ 1441.046070]  ffff88003ff74500 ffff880026b54880 00000001175e1d88 000000010003abf8
[ 1441.049564] Call Trace:
[ 1441.050349]  [<ffffffff8158e020>] ? _spin_unlock_irq+0x30/0x40
[ 1441.052493]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1441.055100]  [<ffffffff8158d9f5>] __down_write_nested+0x85/0xc0
[ 1441.057366]  [<ffffffff8158da3b>] __down_write+0xb/0x10
[ 1441.058529]  [<ffffffff8158cc2d>] down_write+0x6d/0x90
[ 1441.060741]  [<ffffffff8126dc0d>] ? ipcctl_pre_down+0x3d/0x150
[ 1441.063105]  [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150
[ 1441.065298]  [<ffffffff8126f3ce>] sys_msgctl+0xbe/0x5a0
[ 1441.067490]  [<ffffffff8106e74b>] ? up_read+0x2b/0x40
[ 1441.069609]  [<ffffffff8100cc35>] ? retint_swapgs+0x13/0x1b
[ 1441.070947]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1441.072394]  [<ffffffff8158db2e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 1441.074809]  [<ffffffff8100c0f2>] system_call_fastpath+0x16/0x1b
[ 1441.076236] 1 lock held by msgctl11/30765:
[ 1441.077146]  #0:  (&ids->rw_mutex){+++++.}, at: [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150
[ 1441.081127] INFO: task msgctl11:30767 blocked for more than 120 seconds.
[ 1441.082590] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1441.085415] msgctl11      D ffffffff8180f650  5992 30767  26108 0x00000000
[ 1441.086987]  ffff88003671bdd8 0000000000000046 0000000000000000 0000000000000046
[ 1441.089704]  ffff88003671bd48 00000000001d2d80 000000000000cec8 ffff880037e22280
[ 1441.092409]  ffff88000aacc500 ffff880037e22600 000000013671bd88 ffffffff8107d5d8
[ 1441.096056] Call Trace:
[ 1441.096604]  [<ffffffff8107d5d8>] ? mark_held_locks+0x68/0x90
[ 1441.098759]  [<ffffffff8158e020>] ? _spin_unlock_irq+0x30/0x40
[ 1441.100328]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1441.102622]  [<ffffffff8158d9f5>] __down_write_nested+0x85/0xc0
[ 1441.104030]  [<ffffffff8158da3b>] __down_write+0xb/0x10
[ 1441.105198]  [<ffffffff8158cc2d>] down_write+0x6d/0x90
[ 1441.107339]  [<ffffffff8126dc0d>] ? ipcctl_pre_down+0x3d/0x150
[ 1441.109589]  [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150
[ 1441.111046]  [<ffffffff8126f3ce>] sys_msgctl+0xbe/0x5a0
[ 1441.113119]  [<ffffffff8106e74b>] ? up_read+0x2b/0x40
[ 1441.114270]  [<ffffffff8100cc35>] ? retint_swapgs+0x13/0x1b
[ 1441.115482]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1441.117045]  [<ffffffff8158db2e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 1441.118482]  [<ffffffff8100c0f2>] system_call_fastpath+0x16/0x1b
[ 1441.121630] 1 lock held by msgctl11/30767:
[ 1441.122735]  #0:  (&ids->rw_mutex){+++++.}, at: [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150
[ 1441.126682] INFO: task msgctl11:30778 blocked for more than 120 seconds.
[ 1441.129232] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1441.132064] msgctl11      D ffffffff8180f650  5992 30778  26108 0x00000000
[ 1441.134534]  ffff880015085dd8 0000000000000046 0000000000000000 0000000000000046
[ 1441.137341]  ffff880015085d48 00000000001d2d80 000000000000cec8 ffff880024190000
[ 1441.139971]  ffff88001e6fa280 ffff880024190380 0000000115085d88 ffffffff8107d5d8
[ 1441.143691] Call Trace:
[ 1441.145193]  [<ffffffff8107d5d8>] ? mark_held_locks+0x68/0x90
[ 1441.147441]  [<ffffffff8158e020>] ? _spin_unlock_irq+0x30/0x40
[ 1441.148737]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1441.152295]  [<ffffffff8158d9f5>] __down_write_nested+0x85/0xc0
[ 1441.154582]  [<ffffffff8158da3b>] __down_write+0xb/0x10
[ 1441.156767]  [<ffffffff8158cc2d>] down_write+0x6d/0x90
[ 1441.157843]  [<ffffffff8126dc0d>] ? ipcctl_pre_down+0x3d/0x150
[ 1441.159290]  [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150
[ 1441.160587]  [<ffffffff8126f3ce>] sys_msgctl+0xbe/0x5a0
[ 1441.162714]  [<ffffffff8106e74b>] ? up_read+0x2b/0x40
[ 1441.164849]  [<ffffffff8100cc35>] ? retint_swapgs+0x13/0x1b
[ 1441.167166]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1441.169578]  [<ffffffff8158db2e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 1441.171147]  [<ffffffff8100c0f2>] system_call_fastpath+0x16/0x1b
[ 1441.172447] 1 lock held by msgctl11/30778:
[ 1441.173391]  #0:  (&ids->rw_mutex){+++++.}, at: [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150
[ 1441.177342] INFO: task msgctl11:30779 blocked for more than 120 seconds.
[ 1441.178802] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1441.181607] msgctl11      D ffffffff8180f650  5992 30779  26108 0x00000000
[ 1441.184220]  ffff8800141bddd8 0000000000000046 0000000000000000 0000000000000046
[ 1441.186951]  ffff8800141bdd48 00000000001d2d80 000000000000cec8 ffff880024194500
[ 1441.190716]  ffff88003ff74500 ffff880024194880 00000001141bdd88 000000010003ad99
[ 1441.194288] Call Trace:
[ 1441.194855]  [<ffffffff8158e020>] ? _spin_unlock_irq+0x30/0x40
[ 1441.196988]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1441.198579]  [<ffffffff8158d9f5>] __down_write_nested+0x85/0xc0
[ 1441.201039]  [<ffffffff8158da3b>] __down_write+0xb/0x10
[ 1441.203022]  [<ffffffff8158cc2d>] down_write+0x6d/0x90
[ 1441.204322]  [<ffffffff8126dc0d>] ? ipcctl_pre_down+0x3d/0x150
[ 1441.206599]  [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150
[ 1441.208802]  [<ffffffff8126f3ce>] sys_msgctl+0xbe/0x5a0
[ 1441.209990]  [<ffffffff8106e74b>] ? up_read+0x2b/0x40
[ 1441.212208]  [<ffffffff8100cc35>] ? retint_swapgs+0x13/0x1b
[ 1441.213447]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1441.215947]  [<ffffffff8158db2e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 1441.218461]  [<ffffffff8100c0f2>] system_call_fastpath+0x16/0x1b
[ 1441.220860] 1 lock held by msgctl11/30779:
[ 1441.222667]  #0:  (&ids->rw_mutex){+++++.}, at: [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150
[ 1441.226674] INFO: task msgctl11:30781 blocked for more than 120 seconds.
[ 1441.228079] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1441.231064] msgctl11      D ffffffff8180f650  5992 30781  26108 0x00000000
[ 1441.234493]  ffff88001d955dd8 0000000000000046 0000000000000000 0000000000000046
[ 1441.236997]  ffff88001d955d48 00000000001d2d80 000000000000cec8 ffff88001896a280
[ 1441.238927]  ffff880039554500 ffff88001896a600 000000011d955d88 ffffffff8107d5d8
[ 1441.242591] Call Trace:
[ 1441.243146]  [<ffffffff8107d5d8>] ? mark_held_locks+0x68/0x90
[ 1441.245393]  [<ffffffff8158e020>] ? _spin_unlock_irq+0x30/0x40
[ 1441.246704]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1441.249133]  [<ffffffff8158d9f5>] __down_write_nested+0x85/0xc0
[ 1441.252552]  [<ffffffff8158da3b>] __down_write+0xb/0x10
[ 1441.253690]  [<ffffffff8158cc2d>] down_write+0x6d/0x90
[ 1441.254861]  [<ffffffff8126dc0d>] ? ipcctl_pre_down+0x3d/0x150
[ 1441.257165]  [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150
[ 1441.259506]  [<ffffffff8126f3ce>] sys_msgctl+0xbe/0x5a0
[ 1441.260715]  [<ffffffff8106e74b>] ? up_read+0x2b/0x40
[ 1441.262789]  [<ffffffff8100cc35>] ? retint_swapgs+0x13/0x1b
[ 1441.265018]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1441.266554]  [<ffffffff8158db2e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 1441.267990]  [<ffffffff8100c0f2>] system_call_fastpath+0x16/0x1b
[ 1441.269422] 1 lock held by msgctl11/30781:
[ 1441.271342]  #0:  (&ids->rw_mutex){+++++.}, at: [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150
[ 1441.275352] INFO: task msgctl11:30782 blocked for more than 120 seconds.
[ 1441.278605] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1441.280612] msgctl11      D ffffffff8180f650  6168 30782  26108 0x00000000
[ 1441.283112]  ffff8800141b5dd8 0000000000000046 0000000000000000 0000000000000046
[ 1441.285893]  ffff8800141b5d48 00000000001d2d80 000000000000cec8 ffff8800232d0000
[ 1441.289531]  ffff88003ff82280 ffff8800232d0380 00000001141b5d88 ffffffff8107d5d8
[ 1441.292220] Call Trace:
[ 1441.293698]  [<ffffffff8107d5d8>] ? mark_held_locks+0x68/0x90
[ 1441.296193]  [<ffffffff8158e020>] ? _spin_unlock_irq+0x30/0x40
[ 1441.298242]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1441.299805]  [<ffffffff8158d9f5>] __down_write_nested+0x85/0xc0
[ 1441.302286]  [<ffffffff8158da3b>] __down_write+0xb/0x10
[ 1441.304316]  [<ffffffff8158cc2d>] down_write+0x6d/0x90
[ 1441.306543]  [<ffffffff8126dc0d>] ? ipcctl_pre_down+0x3d/0x150
[ 1441.308789]  [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150
[ 1441.310067]  [<ffffffff8126f3ce>] sys_msgctl+0xbe/0x5a0
[ 1441.312322]  [<ffffffff8106e74b>] ? up_read+0x2b/0x40
[ 1441.314468]  [<ffffffff8100cc35>] ? retint_swapgs+0x13/0x1b
[ 1441.315669]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1441.317154]  [<ffffffff8158db2e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 1441.319681]  [<ffffffff8100c0f2>] system_call_fastpath+0x16/0x1b
[ 1441.320988] 1 lock held by msgctl11/30782:
[ 1441.322917]  #0:  (&ids->rw_mutex){+++++.}, at: [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150
[ 1441.326888] INFO: task msgctl11:30783 blocked for more than 120 seconds.
[ 1441.329309] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1441.332241] msgctl11      D ffffffff8180f650  5992 30783  26108 0x00000000
[ 1441.335720]  ffff880017ffddd8 0000000000000046 0000000000000000 0000000000000046
[ 1441.338168]  ffff880017ffdd48 00000000001d2d80 000000000000cec8 ffff8800232d2280
[ 1441.341925]  ffff880025a92280 ffff8800232d2600 0000000117ffdd88 ffffffff8107d5d8
[ 1441.345809] Call Trace:
[ 1441.346389]  [<ffffffff8107d5d8>] ? mark_held_locks+0x68/0x90
[ 1441.348624]  [<ffffffff8158e020>] ? _spin_unlock_irq+0x30/0x40
[ 1441.350996]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1441.353383]  [<ffffffff8158d9f5>] __down_write_nested+0x85/0xc0
[ 1441.354782]  [<ffffffff8158da3b>] __down_write+0xb/0x10
[ 1441.355940]  [<ffffffff8158cc2d>] down_write+0x6d/0x90
[ 1441.357084]  [<ffffffff8126dc0d>] ? ipcctl_pre_down+0x3d/0x150
[ 1441.358357]  [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150
[ 1441.359733]  [<ffffffff8126f3ce>] sys_msgctl+0xbe/0x5a0
[ 1441.360854]  [<ffffffff8106e74b>] ? up_read+0x2b/0x40
[ 1441.362108]  [<ffffffff8100cc35>] ? retint_swapgs+0x13/0x1b
[ 1441.364234]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1441.365787]  [<ffffffff8158db2e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 1441.367227]  [<ffffffff8100c0f2>] system_call_fastpath+0x16/0x1b
[ 1441.370663] 1 lock held by msgctl11/30783:
[ 1441.371584]  #0:  (&ids->rw_mutex){+++++.}, at: [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150
[ 1441.374517] INFO: task msgctl11:30784 blocked for more than 120 seconds.
[ 1441.376004] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1441.378807] msgctl11      D ffffffff8180f650  6024 30784  26108 0x00000000
[ 1441.380338]  ffff880013e13dd8 0000000000000046 0000000000000000 0000000000000046
[ 1441.383118]  ffff880013e13d48 00000000001d2d80 000000000000cec8 ffff880032408000
[ 1441.386763]  ffff88003ff82280 ffff880032408380 0000000113e13d88 ffffffff8107d5d8
[ 1441.390296] Call Trace:
[ 1441.390944]  [<ffffffff8107d5d8>] ? mark_held_locks+0x68/0x90
[ 1441.392288]  [<ffffffff8158e020>] ? _spin_unlock_irq+0x30/0x40
[ 1441.394469]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1441.397020]  [<ffffffff8158d9f5>] __down_write_nested+0x85/0xc0
[ 1441.398316]  [<ffffffff8158da3b>] __down_write+0xb/0x10
[ 1441.399907]  [<ffffffff8158cc2d>] down_write+0x6d/0x90
[ 1441.401881]  [<ffffffff8126dc0d>] ? ipcctl_pre_down+0x3d/0x150
[ 1441.404015]  [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150
[ 1441.405272]  [<ffffffff8126f3ce>] sys_msgctl+0xbe/0x5a0
[ 1441.407434]  [<ffffffff8106e74b>] ? up_read+0x2b/0x40
[ 1441.409661]  [<ffffffff8100cc35>] ? retint_swapgs+0x13/0x1b
[ 1441.410959]  [<ffffffff8107d915>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1441.413347]  [<ffffffff8158db2e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 1441.414874]  [<ffffffff8100c0f2>] system_call_fastpath+0x16/0x1b
[ 1441.417228] 1 lock held by msgctl11/30784:
[ 1441.419163]  #0:  (&ids->rw_mutex){+++++.}, at: [<ffffffff8126dc0d>] ipcctl_pre_down+0x3d/0x150

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 1/2] vmscan don't isolate too many pages in a zone
  2009-07-09  2:47           ` Wu Fengguang
@ 2009-07-09  3:07             ` Wu Fengguang
  -1 siblings, 0 replies; 38+ messages in thread
From: Wu Fengguang @ 2009-07-09  3:07 UTC (permalink / raw)
  To: Rik van Riel; +Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton, Minchan Kim

On Thu, Jul 09, 2009 at 10:47:10AM +0800, Wu Fengguang wrote:
> On Thu, Jul 09, 2009 at 09:51:05AM +0800, Rik van Riel wrote:
> > When way too many processes go into direct reclaim, it is possible
> > for all of the pages to be taken off the LRU.  One result of this
> > is that the next process in the page reclaim code thinks there are
> > no reclaimable pages left and triggers an out of memory kill.
> > 
> > One solution to this problem is to never let so many processes into
> > the page reclaim path that the entire LRU is emptied.  Limiting the
> > system to only having half of each inactive list isolated for
> > reclaim should be safe.
> > 
> > Signed-off-by: Rik van Riel <riel@redhat.com>
> > ---
> > On Wed, 8 Jul 2009 11:19:01 +0800
> > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > > I guess I should mail out my (ugly) approach, so we can
> > > > compare the two :)
> > > 
> > > And it helps to be aware of all the alternatives, now and future :)
> > 
> > Here is the per-zone alternative to Kosaki's patch.
> > 
> > I believe Kosaki's patch will result in better performance
> > and is more elegant overall, but here it is :)
> > 
> >  mm/vmscan.c |   25 +++++++++++++++++++++++++
> >  1 file changed, 25 insertions(+)
> > 
> > Index: mmotm/mm/vmscan.c
> > ===================================================================
> > --- mmotm.orig/mm/vmscan.c	2009-07-08 21:37:01.000000000 -0400
> > +++ mmotm/mm/vmscan.c	2009-07-08 21:39:02.000000000 -0400
> > @@ -1035,6 +1035,27 @@ int isolate_lru_page(struct page *page)
> >  }
> >  
> >  /*
> > + * Are there way too many processes in the direct reclaim path already?
> > + */
> > +static int too_many_isolated(struct zone *zone, int file)
> > +{
> > +	unsigned long inactive, isolated;
> > +
> > +	if (current_is_kswapd())
> > +		return 0;
> > +
> > +	if (file) {
> > +		inactive = zone_page_state(zone, NR_INACTIVE_FILE);
> > +		isolated = zone_page_state(zone, NR_ISOLATED_FILE);
> > +	} else {
> > +		inactive = zone_page_state(zone, NR_INACTIVE_ANON);
> > +		isolated = zone_page_state(zone, NR_ISOLATED_ANON);
> > +	}
> > +
> > +	return isolated > inactive;
> > +}
> > +
> > +/*
> >   * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
> >   * of reclaimed pages
> >   */
> > @@ -1049,6 +1070,10 @@ static unsigned long shrink_inactive_lis
> >  	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> >  	int lumpy_reclaim = 0;
> >  
> > +	while (unlikely(too_many_isolated(zone, file))) {
> > +		schedule_timeout_interruptible(HZ/10);
> > +	}
> > +
> >  	/*
> >  	 * If we need a large contiguous chunk of memory, or have
> >  	 * trouble getting a small set of contiguous pages, we
> 
> It survives 5 runs. The first 4 runs are relatively smooth. The 5th run is much
> slower, and the 6th run triggered a soft-lockup warning. Anyway this record seems
> better than KOSAKI's patch, which triggered soft-lockup at the first run yesterday.
> 
>         Last login: Wed Jul  8 11:10:06 2009 from 192.168.2.1
> 1)      wfg@hp ~% /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
>         msgctl11    0  INFO  :  Using upto 16300 pids
>         msgctl11    1  PASS  :  msgctl11 ran successfully!
> 2)      wfg@hp ~% /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
>         msgctl11    0  INFO  :  Using upto 16300 pids
>         msgctl11    1  PASS  :  msgctl11 ran successfully!
> 3)      wfg@hp ~% /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
>         msgctl11    0  INFO  :  Using upto 16300 pids
>         msgctl11    1  PASS  :  msgctl11 ran successfully!
> 4)      wfg@hp ~% time /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
>         msgctl11    0  INFO  :  Using upto 16300 pids
>         msgctl11    1  PASS  :  msgctl11 ran successfully!
>         /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11  3.38s user 52.90s system 191% cpu 29.399 total
> 5)      wfg@hp ~% time /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
>         msgctl11    0  INFO  :  Using upto 16300 pids
>         msgctl11    1  PASS  :  msgctl11 ran successfully!
>         /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11  4.54s user 488.33s system 129% cpu 6:19.14 total
> 6)      wfg@hp ~% time /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
>         msgctl11    0  INFO  :  Using upto 16300 pids
>         msgctl11    1  PASS  :  msgctl11 ran successfully!
>         /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11  4.62s user 778.82s system 149% cpu 8:43.85 total

I tried the semaphore based concurrent direct reclaim throttling, and
get these numbers. The run time is normal 30s, but can sometimes go up
by many folds. It seems that there are more hidden problems..

Last login: Thu Jul  9 10:13:12 2009 from 192.168.2.1
wfg@hp ~% time /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
msgctl11    0  INFO  :  Using upto 16298 pids
msgctl11    1  PASS  :  msgctl11 ran successfully!
/cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11  3.38s user 51.28s system 182% cpu 30.002 total
wfg@hp ~% time /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
msgctl11    0  INFO  :  Using upto 16298 pids
msgctl11    1  PASS  :  msgctl11 ran successfully!
/cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11  3.78s user 52.04s system 185% cpu 30.168 total
wfg@hp ~% time /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
msgctl11    0  INFO  :  Using upto 16298 pids
msgctl11    1  PASS  :  msgctl11 ran successfully!
/cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11  3.59s user 51.95s system 193% cpu 28.628 total
wfg@hp ~% time /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
msgctl11    0  INFO  :  Using upto 16298 pids
msgctl11    1  PASS  :  msgctl11 ran successfully!
/cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11  3.87s user 283.66s system 167% cpu 2:51.17 total
wfg@hp ~% time /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
msgctl11    0  INFO  :  Using upto 16297 pids
msgctl11    1  PASS  :  msgctl11 ran successfully!
/cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11  3.32s user 49.80s system 178% cpu 29.673 total
wfg@hp ~% time /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
msgctl11    0  INFO  :  Using upto 16297 pids
msgctl11    1  PASS  :  msgctl11 ran successfully!
/cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11  3.70s user 52.56s system 190% cpu 29.601 total
wfg@hp ~% time /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
msgctl11    0  INFO  :  Using upto 16297 pids
msgctl11    1  PASS  :  msgctl11 ran successfully!
/cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11  3.92s user 251.55s system 158% cpu 2:41.40 total
wfg@hp ~% time /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
msgctl11    0  INFO  :  Using upto 16297 pids
(soft lockup)

--- linux.orig/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -1042,6 +1042,7 @@ static unsigned long shrink_inactive_lis
 	unsigned long nr_reclaimed = 0;
 	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
 	int lumpy_reclaim = 0;
+	static struct semaphore direct_reclaim_sem = __SEMAPHORE_INITIALIZER(direct_reclaim_sem, 32);
 
 	/*
 	 * If we need a large contiguous chunk of memory, or have
@@ -1057,6 +1058,9 @@ static unsigned long shrink_inactive_lis
 
 	pagevec_init(&pvec, 1);
 
+	if (!current_is_kswapd())
+		down(&direct_reclaim_sem);
+
 	lru_add_drain();
 	spin_lock_irq(&zone->lru_lock);
 	do {
@@ -1173,6 +1177,10 @@ static unsigned long shrink_inactive_lis
 done:
 	local_irq_enable();
 	pagevec_release(&pvec);
+
+	if (!current_is_kswapd())
+		up(&direct_reclaim_sem);
+
 	return nr_reclaimed;
 }
 
 Thanks,
 Fengguang


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 1/2] vmscan don't isolate too many pages in a zone
@ 2009-07-09  3:07             ` Wu Fengguang
  0 siblings, 0 replies; 38+ messages in thread
From: Wu Fengguang @ 2009-07-09  3:07 UTC (permalink / raw)
  To: Rik van Riel; +Cc: KOSAKI Motohiro, LKML, linux-mm, Andrew Morton, Minchan Kim

On Thu, Jul 09, 2009 at 10:47:10AM +0800, Wu Fengguang wrote:
> On Thu, Jul 09, 2009 at 09:51:05AM +0800, Rik van Riel wrote:
> > When way too many processes go into direct reclaim, it is possible
> > for all of the pages to be taken off the LRU.  One result of this
> > is that the next process in the page reclaim code thinks there are
> > no reclaimable pages left and triggers an out of memory kill.
> > 
> > One solution to this problem is to never let so many processes into
> > the page reclaim path that the entire LRU is emptied.  Limiting the
> > system to only having half of each inactive list isolated for
> > reclaim should be safe.
> > 
> > Signed-off-by: Rik van Riel <riel@redhat.com>
> > ---
> > On Wed, 8 Jul 2009 11:19:01 +0800
> > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > > I guess I should mail out my (ugly) approach, so we can
> > > > compare the two :)
> > > 
> > > And it helps to be aware of all the alternatives, now and future :)
> > 
> > Here is the per-zone alternative to Kosaki's patch.
> > 
> > I believe Kosaki's patch will result in better performance
> > and is more elegant overall, but here it is :)
> > 
> >  mm/vmscan.c |   25 +++++++++++++++++++++++++
> >  1 file changed, 25 insertions(+)
> > 
> > Index: mmotm/mm/vmscan.c
> > ===================================================================
> > --- mmotm.orig/mm/vmscan.c	2009-07-08 21:37:01.000000000 -0400
> > +++ mmotm/mm/vmscan.c	2009-07-08 21:39:02.000000000 -0400
> > @@ -1035,6 +1035,27 @@ int isolate_lru_page(struct page *page)
> >  }
> >  
> >  /*
> > + * Are there way too many processes in the direct reclaim path already?
> > + */
> > +static int too_many_isolated(struct zone *zone, int file)
> > +{
> > +	unsigned long inactive, isolated;
> > +
> > +	if (current_is_kswapd())
> > +		return 0;
> > +
> > +	if (file) {
> > +		inactive = zone_page_state(zone, NR_INACTIVE_FILE);
> > +		isolated = zone_page_state(zone, NR_ISOLATED_FILE);
> > +	} else {
> > +		inactive = zone_page_state(zone, NR_INACTIVE_ANON);
> > +		isolated = zone_page_state(zone, NR_ISOLATED_ANON);
> > +	}
> > +
> > +	return isolated > inactive;
> > +}
> > +
> > +/*
> >   * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
> >   * of reclaimed pages
> >   */
> > @@ -1049,6 +1070,10 @@ static unsigned long shrink_inactive_lis
> >  	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> >  	int lumpy_reclaim = 0;
> >  
> > +	while (unlikely(too_many_isolated(zone, file))) {
> > +		schedule_timeout_interruptible(HZ/10);
> > +	}
> > +
> >  	/*
> >  	 * If we need a large contiguous chunk of memory, or have
> >  	 * trouble getting a small set of contiguous pages, we
> 
> It survives 5 runs. The first 4 runs are relatively smooth. The 5th run is much
> slower, and the 6th run triggered a soft-lockup warning. Anyway this record seems
> better than KOSAKI's patch, which triggered soft-lockup at the first run yesterday.
> 
>         Last login: Wed Jul  8 11:10:06 2009 from 192.168.2.1
> 1)      wfg@hp ~% /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
>         msgctl11    0  INFO  :  Using upto 16300 pids
>         msgctl11    1  PASS  :  msgctl11 ran successfully!
> 2)      wfg@hp ~% /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
>         msgctl11    0  INFO  :  Using upto 16300 pids
>         msgctl11    1  PASS  :  msgctl11 ran successfully!
> 3)      wfg@hp ~% /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
>         msgctl11    0  INFO  :  Using upto 16300 pids
>         msgctl11    1  PASS  :  msgctl11 ran successfully!
> 4)      wfg@hp ~% time /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
>         msgctl11    0  INFO  :  Using upto 16300 pids
>         msgctl11    1  PASS  :  msgctl11 ran successfully!
>         /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11  3.38s user 52.90s system 191% cpu 29.399 total
> 5)      wfg@hp ~% time /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
>         msgctl11    0  INFO  :  Using upto 16300 pids
>         msgctl11    1  PASS  :  msgctl11 ran successfully!
>         /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11  4.54s user 488.33s system 129% cpu 6:19.14 total
> 6)      wfg@hp ~% time /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
>         msgctl11    0  INFO  :  Using upto 16300 pids
>         msgctl11    1  PASS  :  msgctl11 ran successfully!
>         /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11  4.62s user 778.82s system 149% cpu 8:43.85 total

I tried the semaphore based concurrent direct reclaim throttling, and
get these numbers. The run time is normal 30s, but can sometimes go up
by many folds. It seems that there are more hidden problems..

Last login: Thu Jul  9 10:13:12 2009 from 192.168.2.1
wfg@hp ~% time /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
msgctl11    0  INFO  :  Using upto 16298 pids
msgctl11    1  PASS  :  msgctl11 ran successfully!
/cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11  3.38s user 51.28s system 182% cpu 30.002 total
wfg@hp ~% time /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
msgctl11    0  INFO  :  Using upto 16298 pids
msgctl11    1  PASS  :  msgctl11 ran successfully!
/cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11  3.78s user 52.04s system 185% cpu 30.168 total
wfg@hp ~% time /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
msgctl11    0  INFO  :  Using upto 16298 pids
msgctl11    1  PASS  :  msgctl11 ran successfully!
/cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11  3.59s user 51.95s system 193% cpu 28.628 total
wfg@hp ~% time /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
msgctl11    0  INFO  :  Using upto 16298 pids
msgctl11    1  PASS  :  msgctl11 ran successfully!
/cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11  3.87s user 283.66s system 167% cpu 2:51.17 total
wfg@hp ~% time /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
msgctl11    0  INFO  :  Using upto 16297 pids
msgctl11    1  PASS  :  msgctl11 ran successfully!
/cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11  3.32s user 49.80s system 178% cpu 29.673 total
wfg@hp ~% time /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
msgctl11    0  INFO  :  Using upto 16297 pids
msgctl11    1  PASS  :  msgctl11 ran successfully!
/cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11  3.70s user 52.56s system 190% cpu 29.601 total
wfg@hp ~% time /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
msgctl11    0  INFO  :  Using upto 16297 pids
msgctl11    1  PASS  :  msgctl11 ran successfully!
/cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11  3.92s user 251.55s system 158% cpu 2:41.40 total
wfg@hp ~% time /cc/ltp/ltp-full-20090531/./testcases/kernel/syscalls/ipc/msgctl/msgctl11
msgctl11    0  INFO  :  Using upto 16297 pids
(soft lockup)

--- linux.orig/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -1042,6 +1042,7 @@ static unsigned long shrink_inactive_lis
 	unsigned long nr_reclaimed = 0;
 	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
 	int lumpy_reclaim = 0;
+	static struct semaphore direct_reclaim_sem = __SEMAPHORE_INITIALIZER(direct_reclaim_sem, 32);
 
 	/*
 	 * If we need a large contiguous chunk of memory, or have
@@ -1057,6 +1058,9 @@ static unsigned long shrink_inactive_lis
 
 	pagevec_init(&pvec, 1);
 
+	if (!current_is_kswapd())
+		down(&direct_reclaim_sem);
+
 	lru_add_drain();
 	spin_lock_irq(&zone->lru_lock);
 	do {
@@ -1173,6 +1177,10 @@ static unsigned long shrink_inactive_lis
 done:
 	local_irq_enable();
 	pagevec_release(&pvec);
+
+	if (!current_is_kswapd())
+		up(&direct_reclaim_sem);
+
 	return nr_reclaimed;
 }
 
 Thanks,
 Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 1/2] vmscan don't isolate too many pages
  2009-07-07 23:39     ` Minchan Kim
@ 2009-07-09  3:12       ` KOSAKI Motohiro
  -1 siblings, 0 replies; 38+ messages in thread
From: KOSAKI Motohiro @ 2009-07-09  3:12 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, LKML, linux-mm, Andrew Morton, Rik van Riel,
	Wu Fengguang

> > + ? ? ? if (too_many_isolated(gfp_mask, zonelist, high_zoneidx, nodemask)) {
> 
> too_many_isolated(zonelist, high_zoneidx, nodemask)

Correct.
I forgot to quilt refresh before sending. sorry.




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 1/2] vmscan don't isolate too many pages
@ 2009-07-09  3:12       ` KOSAKI Motohiro
  0 siblings, 0 replies; 38+ messages in thread
From: KOSAKI Motohiro @ 2009-07-09  3:12 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, LKML, linux-mm, Andrew Morton, Rik van Riel,
	Wu Fengguang

> > + ? ? ? if (too_many_isolated(gfp_mask, zonelist, high_zoneidx, nodemask)) {
> 
> too_many_isolated(zonelist, high_zoneidx, nodemask)

Correct.
I forgot to quilt refresh before sending. sorry.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 2/2] Don't continue reclaim if the system have plenty  free memory
  2009-07-07 13:20     ` Minchan Kim
@ 2009-07-09  5:08       ` KOSAKI Motohiro
  -1 siblings, 0 replies; 38+ messages in thread
From: KOSAKI Motohiro @ 2009-07-09  5:08 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, LKML, linux-mm, Andrew Morton, Rik van Riel,
	Wu Fengguang

> Hi, Kosaki.
> 
> On Tue, Jul 7, 2009 at 6:48 PM, KOSAKI
> Motohiro<kosaki.motohiro@jp.fujitsu.com> wrote:
> > Subject: [PATCH] Don't continue reclaim if the system have plenty free memory
> >
> > On concurrent reclaim situation, if one reclaimer makes OOM, maybe other
> > reclaimer can stop reclaim because OOM killer makes enough free memory.
> >
> > But current kernel doesn't have its logic. Then, we can face following accidental
> > 2nd OOM scenario.
> >
> > 1. System memory is used by only one big process.
> > 2. memory shortage occur and concurrent reclaim start.
> > 3. One reclaimer makes OOM and OOM killer kill above big process.
> > 4. Almost reclaimable page will be freed.
> > 5. Another reclaimer can't find any reclaimable page because those pages are
> > ? already freed.
> > 6. Then, system makes accidental and unnecessary 2nd OOM killer.
> >
> 
> Did you see the this situation ?
> Why I ask is that we have already a routine for preventing parallel
> OOM killing in __alloc_pages_may_oom.
>
> Couldn't it protect your scenario ?

Can you please see actual code of this patch?
Those two patches fix different problem.

1/2 fixes the issue of that concurrent direct reclaimer makes
too many isolated pages.
2/2 fixes the issue of that reclaim and exit race makes accidental oom.


> If it can't, Could you explain the scenario in more detail ?

__alloc_pages_may_oom() check don't effect the threads of already
entered reclaim. it's obvious.





^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 2/2] Don't continue reclaim if the system have plenty  free memory
@ 2009-07-09  5:08       ` KOSAKI Motohiro
  0 siblings, 0 replies; 38+ messages in thread
From: KOSAKI Motohiro @ 2009-07-09  5:08 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, LKML, linux-mm, Andrew Morton, Rik van Riel,
	Wu Fengguang

> Hi, Kosaki.
> 
> On Tue, Jul 7, 2009 at 6:48 PM, KOSAKI
> Motohiro<kosaki.motohiro@jp.fujitsu.com> wrote:
> > Subject: [PATCH] Don't continue reclaim if the system have plenty free memory
> >
> > On concurrent reclaim situation, if one reclaimer makes OOM, maybe other
> > reclaimer can stop reclaim because OOM killer makes enough free memory.
> >
> > But current kernel doesn't have its logic. Then, we can face following accidental
> > 2nd OOM scenario.
> >
> > 1. System memory is used by only one big process.
> > 2. memory shortage occur and concurrent reclaim start.
> > 3. One reclaimer makes OOM and OOM killer kill above big process.
> > 4. Almost reclaimable page will be freed.
> > 5. Another reclaimer can't find any reclaimable page because those pages are
> > ? already freed.
> > 6. Then, system makes accidental and unnecessary 2nd OOM killer.
> >
> 
> Did you see the this situation ?
> Why I ask is that we have already a routine for preventing parallel
> OOM killing in __alloc_pages_may_oom.
>
> Couldn't it protect your scenario ?

Can you please see actual code of this patch?
Those two patches fix different problem.

1/2 fixes the issue of that concurrent direct reclaimer makes
too many isolated pages.
2/2 fixes the issue of that reclaim and exit race makes accidental oom.


> If it can't, Could you explain the scenario in more detail ?

__alloc_pages_may_oom() check don't effect the threads of already
entered reclaim. it's obvious.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 1/2] vmscan don't isolate too many pages in a zone
  2009-07-09  1:51         ` Rik van Riel
@ 2009-07-09  6:39           ` KOSAKI Motohiro
  -1 siblings, 0 replies; 38+ messages in thread
From: KOSAKI Motohiro @ 2009-07-09  6:39 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kosaki.motohiro, Wu Fengguang, LKML, linux-mm, Andrew Morton,
	Minchan Kim

Hi

> When way too many processes go into direct reclaim, it is possible
> for all of the pages to be taken off the LRU.  One result of this
> is that the next process in the page reclaim code thinks there are
> no reclaimable pages left and triggers an out of memory kill.
> 
> One solution to this problem is to never let so many processes into
> the page reclaim path that the entire LRU is emptied.  Limiting the
> system to only having half of each inactive list isolated for
> reclaim should be safe.

Thanks good patch.
I'd like to run several benchmark and compare my and your patches.

Can you please gime me few days?




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 1/2] vmscan don't isolate too many pages in a zone
@ 2009-07-09  6:39           ` KOSAKI Motohiro
  0 siblings, 0 replies; 38+ messages in thread
From: KOSAKI Motohiro @ 2009-07-09  6:39 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kosaki.motohiro, Wu Fengguang, LKML, linux-mm, Andrew Morton,
	Minchan Kim

Hi

> When way too many processes go into direct reclaim, it is possible
> for all of the pages to be taken off the LRU.  One result of this
> is that the next process in the page reclaim code thinks there are
> no reclaimable pages left and triggers an out of memory kill.
> 
> One solution to this problem is to never let so many processes into
> the page reclaim path that the entire LRU is emptied.  Limiting the
> system to only having half of each inactive list isolated for
> reclaim should be safe.

Thanks good patch.
I'd like to run several benchmark and compare my and your patches.

Can you please gime me few days?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 1/2] vmscan don't isolate too many pages in a zone
  2009-07-09  3:07             ` Wu Fengguang
@ 2009-07-09  7:01               ` KOSAKI Motohiro
  -1 siblings, 0 replies; 38+ messages in thread
From: KOSAKI Motohiro @ 2009-07-09  7:01 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro, Rik van Riel, LKML, linux-mm, Andrew Morton,
	Minchan Kim

Hi

> I tried the semaphore based concurrent direct reclaim throttling, and
> get these numbers. The run time is normal 30s, but can sometimes go up
> by many folds. It seems that there are more hidden problems..

Hmm....
I think I and you have different priority list. May I explain why Rik
and decide to use half of LRU pages?

the system have 4GB (=1M pages) memory. my patch allow 1M/2/32=16384
threads. I agree this is very large and inefficient. However IOW 
this is very conservative.
I believe it don't makes too strong restriction problem.

In the other hand, your patch's concurrent restriction is small constant
value (=32).
it can be more efficient and it also can makes regression. IOW it is more
aggressive approach.

e.g.
if the system have >100 CPU, my patch can get enough much reclaimer but
your patch makes tons idle cpus.

And, To recall original issue tearch us this is rarely and a bit insane
workload issue.
Then, I priotize to

1. prevent unnecessary OOM
2. no regression to typical workload
3. msgctl11 performance

IOW, I don't think msgctl11 performance is so important.
May I ask why do you think msgctl11 performance is so important?

>
> --- linux.orig/mm/vmscan.c
> +++ linux/mm/vmscan.c
> @@ -1042,6 +1042,7 @@ static unsigned long shrink_inactive_lis
>  	unsigned long nr_reclaimed = 0;
>  	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
>  	int lumpy_reclaim = 0;
> +	static struct semaphore direct_reclaim_sem = __SEMAPHORE_INITIALIZER(direct_reclaim_sem, 32);
>  
>  	/*
>  	 * If we need a large contiguous chunk of memory, or have
> @@ -1057,6 +1058,9 @@ static unsigned long shrink_inactive_lis
>  
>  	pagevec_init(&pvec, 1);
>  
> +	if (!current_is_kswapd())
> +		down(&direct_reclaim_sem);
> +
>  	lru_add_drain();
>  	spin_lock_irq(&zone->lru_lock);
>  	do {
> @@ -1173,6 +1177,10 @@ static unsigned long shrink_inactive_lis
>  done:
>  	local_irq_enable();
>  	pagevec_release(&pvec);
> +
> +	if (!current_is_kswapd())
> +		up(&direct_reclaim_sem);
> +
>  	return nr_reclaimed;
>  }

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 1/2] vmscan don't isolate too many pages in a zone
@ 2009-07-09  7:01               ` KOSAKI Motohiro
  0 siblings, 0 replies; 38+ messages in thread
From: KOSAKI Motohiro @ 2009-07-09  7:01 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro, Rik van Riel, LKML, linux-mm, Andrew Morton,
	Minchan Kim

Hi

> I tried the semaphore based concurrent direct reclaim throttling, and
> get these numbers. The run time is normal 30s, but can sometimes go up
> by many folds. It seems that there are more hidden problems..

Hmm....
I think I and you have different priority list. May I explain why Rik
and decide to use half of LRU pages?

the system have 4GB (=1M pages) memory. my patch allow 1M/2/32=16384
threads. I agree this is very large and inefficient. However IOW 
this is very conservative.
I believe it don't makes too strong restriction problem.

In the other hand, your patch's concurrent restriction is small constant
value (=32).
it can be more efficient and it also can makes regression. IOW it is more
aggressive approach.

e.g.
if the system have >100 CPU, my patch can get enough much reclaimer but
your patch makes tons idle cpus.

And, To recall original issue tearch us this is rarely and a bit insane
workload issue.
Then, I priotize to

1. prevent unnecessary OOM
2. no regression to typical workload
3. msgctl11 performance

IOW, I don't think msgctl11 performance is so important.
May I ask why do you think msgctl11 performance is so important?

>
> --- linux.orig/mm/vmscan.c
> +++ linux/mm/vmscan.c
> @@ -1042,6 +1042,7 @@ static unsigned long shrink_inactive_lis
>  	unsigned long nr_reclaimed = 0;
>  	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
>  	int lumpy_reclaim = 0;
> +	static struct semaphore direct_reclaim_sem = __SEMAPHORE_INITIALIZER(direct_reclaim_sem, 32);
>  
>  	/*
>  	 * If we need a large contiguous chunk of memory, or have
> @@ -1057,6 +1058,9 @@ static unsigned long shrink_inactive_lis
>  
>  	pagevec_init(&pvec, 1);
>  
> +	if (!current_is_kswapd())
> +		down(&direct_reclaim_sem);
> +
>  	lru_add_drain();
>  	spin_lock_irq(&zone->lru_lock);
>  	do {
> @@ -1173,6 +1177,10 @@ static unsigned long shrink_inactive_lis
>  done:
>  	local_irq_enable();
>  	pagevec_release(&pvec);
> +
> +	if (!current_is_kswapd())
> +		up(&direct_reclaim_sem);
> +
>  	return nr_reclaimed;
>  }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 1/2] vmscan don't isolate too many pages in a zone
  2009-07-09  7:01               ` KOSAKI Motohiro
@ 2009-07-09  8:42                 ` Wu Fengguang
  -1 siblings, 0 replies; 38+ messages in thread
From: Wu Fengguang @ 2009-07-09  8:42 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: Rik van Riel, LKML, linux-mm, Andrew Morton, Minchan Kim

On Thu, Jul 09, 2009 at 03:01:26PM +0800, KOSAKI Motohiro wrote:
> Hi
> 
> > I tried the semaphore based concurrent direct reclaim throttling, and
> > get these numbers. The run time is normal 30s, but can sometimes go up
> > by many folds. It seems that there are more hidden problems..
> 
> Hmm....
> I think I and you have different priority list. May I explain why Rik
> and decide to use half of LRU pages?
> 
> the system have 4GB (=1M pages) memory. my patch allow 1M/2/32=16384
> threads. I agree this is very large and inefficient. However IOW 
> this is very conservative.
> I believe it don't makes too strong restriction problem.
 
Sorry if I made confusions. I agree on the NR_ISOLATED based throttling.
It risks much less than to limit the concurrency of direct reclaim.
Isolating half LRU pages normally costs nothing.

> In the other hand, your patch's concurrent restriction is small constant
> value (=32).
> it can be more efficient and it also can makes regression. IOW it is more
> aggressive approach.
> 
> e.g.
> if the system have >100 CPU, my patch can get enough much reclaimer but
> your patch makes tons idle cpus.

That's a quick (and clueless) hack to check if the (very unstable)
reclaim behavior can be improved by limiting the concurrency. I didn't
mean to push it further more :)

> And, To recall original issue tearch us this is rarely and a bit insane
> workload issue.
> Then, I priotize to
> 
> 1. prevent unnecessary OOM
> 2. no regression to typical workload
> 3. msgctl11 performance

I totally agree on the above priorities.

> 
> IOW, I don't think msgctl11 performance is so important.
> May I ask why do you think msgctl11 performance is so important?

Now that we have addressed (1)/(2) with your patch, naturally the
msgctl11 performance problem catches my eyes. Strictly speaking
I'm not particularly interested in the performance itself, but
the obviously high _fluctuations_ of performance. Something bad
is happening there which deserves some attention.

> 
> >
> > --- linux.orig/mm/vmscan.c
> > +++ linux/mm/vmscan.c
> > @@ -1042,6 +1042,7 @@ static unsigned long shrink_inactive_lis
> >  	unsigned long nr_reclaimed = 0;
> >  	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> >  	int lumpy_reclaim = 0;
> > +	static struct semaphore direct_reclaim_sem = __SEMAPHORE_INITIALIZER(direct_reclaim_sem, 32);
> >  
> >  	/*
> >  	 * If we need a large contiguous chunk of memory, or have
> > @@ -1057,6 +1058,9 @@ static unsigned long shrink_inactive_lis
> >  
> >  	pagevec_init(&pvec, 1);
> >  
> > +	if (!current_is_kswapd())
> > +		down(&direct_reclaim_sem);
> > +
> >  	lru_add_drain();
> >  	spin_lock_irq(&zone->lru_lock);
> >  	do {
> > @@ -1173,6 +1177,10 @@ static unsigned long shrink_inactive_lis
> >  done:
> >  	local_irq_enable();
> >  	pagevec_release(&pvec);
> > +
> > +	if (!current_is_kswapd())
> > +		up(&direct_reclaim_sem);
> > +
> >  	return nr_reclaimed;
> >  }
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 1/2] vmscan don't isolate too many pages in a zone
@ 2009-07-09  8:42                 ` Wu Fengguang
  0 siblings, 0 replies; 38+ messages in thread
From: Wu Fengguang @ 2009-07-09  8:42 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: Rik van Riel, LKML, linux-mm, Andrew Morton, Minchan Kim

On Thu, Jul 09, 2009 at 03:01:26PM +0800, KOSAKI Motohiro wrote:
> Hi
> 
> > I tried the semaphore based concurrent direct reclaim throttling, and
> > get these numbers. The run time is normal 30s, but can sometimes go up
> > by many folds. It seems that there are more hidden problems..
> 
> Hmm....
> I think I and you have different priority list. May I explain why Rik
> and decide to use half of LRU pages?
> 
> the system have 4GB (=1M pages) memory. my patch allow 1M/2/32=16384
> threads. I agree this is very large and inefficient. However IOW 
> this is very conservative.
> I believe it don't makes too strong restriction problem.
 
Sorry if I made confusions. I agree on the NR_ISOLATED based throttling.
It risks much less than to limit the concurrency of direct reclaim.
Isolating half LRU pages normally costs nothing.

> In the other hand, your patch's concurrent restriction is small constant
> value (=32).
> it can be more efficient and it also can makes regression. IOW it is more
> aggressive approach.
> 
> e.g.
> if the system have >100 CPU, my patch can get enough much reclaimer but
> your patch makes tons idle cpus.

That's a quick (and clueless) hack to check if the (very unstable)
reclaim behavior can be improved by limiting the concurrency. I didn't
mean to push it further more :)

> And, To recall original issue tearch us this is rarely and a bit insane
> workload issue.
> Then, I priotize to
> 
> 1. prevent unnecessary OOM
> 2. no regression to typical workload
> 3. msgctl11 performance

I totally agree on the above priorities.

> 
> IOW, I don't think msgctl11 performance is so important.
> May I ask why do you think msgctl11 performance is so important?

Now that we have addressed (1)/(2) with your patch, naturally the
msgctl11 performance problem catches my eyes. Strictly speaking
I'm not particularly interested in the performance itself, but
the obviously high _fluctuations_ of performance. Something bad
is happening there which deserves some attention.

> 
> >
> > --- linux.orig/mm/vmscan.c
> > +++ linux/mm/vmscan.c
> > @@ -1042,6 +1042,7 @@ static unsigned long shrink_inactive_lis
> >  	unsigned long nr_reclaimed = 0;
> >  	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> >  	int lumpy_reclaim = 0;
> > +	static struct semaphore direct_reclaim_sem = __SEMAPHORE_INITIALIZER(direct_reclaim_sem, 32);
> >  
> >  	/*
> >  	 * If we need a large contiguous chunk of memory, or have
> > @@ -1057,6 +1058,9 @@ static unsigned long shrink_inactive_lis
> >  
> >  	pagevec_init(&pvec, 1);
> >  
> > +	if (!current_is_kswapd())
> > +		down(&direct_reclaim_sem);
> > +
> >  	lru_add_drain();
> >  	spin_lock_irq(&zone->lru_lock);
> >  	do {
> > @@ -1173,6 +1177,10 @@ static unsigned long shrink_inactive_lis
> >  done:
> >  	local_irq_enable();
> >  	pagevec_release(&pvec);
> > +
> > +	if (!current_is_kswapd())
> > +		up(&direct_reclaim_sem);
> > +
> >  	return nr_reclaimed;
> >  }
> 
> 
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 2/2] Don't continue reclaim if the system have plenty  free memory
  2009-07-09  5:08       ` KOSAKI Motohiro
@ 2009-07-09 10:58         ` Minchan Kim
  -1 siblings, 0 replies; 38+ messages in thread
From: Minchan Kim @ 2009-07-09 10:58 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: LKML, linux-mm, Andrew Morton, Rik van Riel, Wu Fengguang

On Thu, Jul 9, 2009 at 2:08 PM, KOSAKI
Motohiro<kosaki.motohiro@jp.fujitsu.com> wrote:
>> Hi, Kosaki.
>>
>> On Tue, Jul 7, 2009 at 6:48 PM, KOSAKI
>> Motohiro<kosaki.motohiro@jp.fujitsu.com> wrote:
>> > Subject: [PATCH] Don't continue reclaim if the system have plenty free memory
>> >
>> > On concurrent reclaim situation, if one reclaimer makes OOM, maybe other
>> > reclaimer can stop reclaim because OOM killer makes enough free memory.
>> >
>> > But current kernel doesn't have its logic. Then, we can face following accidental
>> > 2nd OOM scenario.
>> >
>> > 1. System memory is used by only one big process.
>> > 2. memory shortage occur and concurrent reclaim start.
>> > 3. One reclaimer makes OOM and OOM killer kill above big process.
>> > 4. Almost reclaimable page will be freed.
>> > 5. Another reclaimer can't find any reclaimable page because those pages are
>> > ? already freed.
>> > 6. Then, system makes accidental and unnecessary 2nd OOM killer.
>> >
>>
>> Did you see the this situation ?
>> Why I ask is that we have already a routine for preventing parallel
>> OOM killing in __alloc_pages_may_oom.
>>
>> Couldn't it protect your scenario ?
>
> Can you please see actual code of this patch?

I mean follow as,

static inline struct page *
__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
        struct zonelist *zonelist, enum zone_type high_zoneidx,
...
<snip>

        /*
         * Go through the zonelist yet one more time, keep very high watermark
         * here, this is only to catch a parallel oom killing, we must fail if
         * we're still under heavy pressure.
         */
        page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
                order, zonelist, high_zoneidx,
                ALLOC_WMARK_HIGH|ALLOC_CPUSET,
                preferred_zone, migratetype);


> Those two patches fix different problem.
>
> 1/2 fixes the issue of that concurrent direct reclaimer makes
> too many isolated pages.
> 2/2 fixes the issue of that reclaim and exit race makes accidental oom.
>
>
>> If it can't, Could you explain the scenario in more detail ?
>
> __alloc_pages_may_oom() check don't effect the threads of already
> entered reclaim. it's obvious.

Threads which are entered into direct reclaim mode will call
__alloc_pages_may_oom before out_of_memory.
At that time, if one big process is killed a while ago,
get_page_from_freelist in __alloc_pages_may_oom will be succeeded at
last. So I think it doesn't occur OOM.

But in that case, we suffered from unnecessary page scanning per each
priority(12~0). So in this case, your patch is good to me. then you
would be better to change log. :)

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 2/2] Don't continue reclaim if the system have plenty free memory
@ 2009-07-09 10:58         ` Minchan Kim
  0 siblings, 0 replies; 38+ messages in thread
From: Minchan Kim @ 2009-07-09 10:58 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: LKML, linux-mm, Andrew Morton, Rik van Riel, Wu Fengguang

On Thu, Jul 9, 2009 at 2:08 PM, KOSAKI
Motohiro<kosaki.motohiro@jp.fujitsu.com> wrote:
>> Hi, Kosaki.
>>
>> On Tue, Jul 7, 2009 at 6:48 PM, KOSAKI
>> Motohiro<kosaki.motohiro@jp.fujitsu.com> wrote:
>> > Subject: [PATCH] Don't continue reclaim if the system have plenty free memory
>> >
>> > On concurrent reclaim situation, if one reclaimer makes OOM, maybe other
>> > reclaimer can stop reclaim because OOM killer makes enough free memory.
>> >
>> > But current kernel doesn't have its logic. Then, we can face following accidental
>> > 2nd OOM scenario.
>> >
>> > 1. System memory is used by only one big process.
>> > 2. memory shortage occur and concurrent reclaim start.
>> > 3. One reclaimer makes OOM and OOM killer kill above big process.
>> > 4. Almost reclaimable page will be freed.
>> > 5. Another reclaimer can't find any reclaimable page because those pages are
>> > ? already freed.
>> > 6. Then, system makes accidental and unnecessary 2nd OOM killer.
>> >
>>
>> Did you see the this situation ?
>> Why I ask is that we have already a routine for preventing parallel
>> OOM killing in __alloc_pages_may_oom.
>>
>> Couldn't it protect your scenario ?
>
> Can you please see actual code of this patch?

I mean follow as,

static inline struct page *
__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
        struct zonelist *zonelist, enum zone_type high_zoneidx,
...
<snip>

        /*
         * Go through the zonelist yet one more time, keep very high watermark
         * here, this is only to catch a parallel oom killing, we must fail if
         * we're still under heavy pressure.
         */
        page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
                order, zonelist, high_zoneidx,
                ALLOC_WMARK_HIGH|ALLOC_CPUSET,
                preferred_zone, migratetype);


> Those two patches fix different problem.
>
> 1/2 fixes the issue of that concurrent direct reclaimer makes
> too many isolated pages.
> 2/2 fixes the issue of that reclaim and exit race makes accidental oom.
>
>
>> If it can't, Could you explain the scenario in more detail ?
>
> __alloc_pages_may_oom() check don't effect the threads of already
> entered reclaim. it's obvious.

Threads which are entered into direct reclaim mode will call
__alloc_pages_may_oom before out_of_memory.
At that time, if one big process is killed a while ago,
get_page_from_freelist in __alloc_pages_may_oom will be succeeded at
last. So I think it doesn't occur OOM.

But in that case, we suffered from unnecessary page scanning per each
priority(12~0). So in this case, your patch is good to me. then you
would be better to change log. :)

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 1/2] vmscan don't isolate too many pages in a zone
  2009-07-09  8:42                 ` Wu Fengguang
@ 2009-07-09 11:07                   ` Minchan Kim
  -1 siblings, 0 replies; 38+ messages in thread
From: Minchan Kim @ 2009-07-09 11:07 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: KOSAKI Motohiro, Rik van Riel, LKML, linux-mm, Andrew Morton

Hi, Wu.

On Thu, Jul 9, 2009 at 5:42 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Thu, Jul 09, 2009 at 03:01:26PM +0800, KOSAKI Motohiro wrote:
>> Hi
>>
>> > I tried the semaphore based concurrent direct reclaim throttling, and
>> > get these numbers. The run time is normal 30s, but can sometimes go up
>> > by many folds. It seems that there are more hidden problems..
>>
>> Hmm....
>> I think I and you have different priority list. May I explain why Rik
>> and decide to use half of LRU pages?
>>
>> the system have 4GB (=1M pages) memory. my patch allow 1M/2/32=16384
>> threads. I agree this is very large and inefficient. However IOW
>> this is very conservative.
>> I believe it don't makes too strong restriction problem.
>
> Sorry if I made confusions. I agree on the NR_ISOLATED based throttling.
> It risks much less than to limit the concurrency of direct reclaim.
> Isolating half LRU pages normally costs nothing.
>
>> In the other hand, your patch's concurrent restriction is small constant
>> value (=32).
>> it can be more efficient and it also can makes regression. IOW it is more
>> aggressive approach.
>>
>> e.g.
>> if the system have >100 CPU, my patch can get enough much reclaimer but
>> your patch makes tons idle cpus.
>
> That's a quick (and clueless) hack to check if the (very unstable)
> reclaim behavior can be improved by limiting the concurrency. I didn't
> mean to push it further more :)
>
>> And, To recall original issue tearch us this is rarely and a bit insane
>> workload issue.
>> Then, I priotize to
>>
>> 1. prevent unnecessary OOM
>> 2. no regression to typical workload
>> 3. msgctl11 performance
>
> I totally agree on the above priorities.
>
>>
>> IOW, I don't think msgctl11 performance is so important.
>> May I ask why do you think msgctl11 performance is so important?
>
> Now that we have addressed (1)/(2) with your patch, naturally the
> msgctl11 performance problem catches my eyes. Strictly speaking
> I'm not particularly interested in the performance itself, but
> the obviously high _fluctuations_ of performance. Something bad

Me, too. I also have a looked into this problem.
But unfortunately, I can't devote my attention to the problem until
this weekend.
If you know the cause, let me know it :)

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 1/2] vmscan don't isolate too many pages in a zone
@ 2009-07-09 11:07                   ` Minchan Kim
  0 siblings, 0 replies; 38+ messages in thread
From: Minchan Kim @ 2009-07-09 11:07 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: KOSAKI Motohiro, Rik van Riel, LKML, linux-mm, Andrew Morton

Hi, Wu.

On Thu, Jul 9, 2009 at 5:42 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Thu, Jul 09, 2009 at 03:01:26PM +0800, KOSAKI Motohiro wrote:
>> Hi
>>
>> > I tried the semaphore based concurrent direct reclaim throttling, and
>> > get these numbers. The run time is normal 30s, but can sometimes go up
>> > by many folds. It seems that there are more hidden problems..
>>
>> Hmm....
>> I think I and you have different priority list. May I explain why Rik
>> and decide to use half of LRU pages?
>>
>> the system have 4GB (=1M pages) memory. my patch allow 1M/2/32=16384
>> threads. I agree this is very large and inefficient. However IOW
>> this is very conservative.
>> I believe it don't makes too strong restriction problem.
>
> Sorry if I made confusions. I agree on the NR_ISOLATED based throttling.
> It risks much less than to limit the concurrency of direct reclaim.
> Isolating half LRU pages normally costs nothing.
>
>> In the other hand, your patch's concurrent restriction is small constant
>> value (=32).
>> it can be more efficient and it also can makes regression. IOW it is more
>> aggressive approach.
>>
>> e.g.
>> if the system have >100 CPU, my patch can get enough much reclaimer but
>> your patch makes tons idle cpus.
>
> That's a quick (and clueless) hack to check if the (very unstable)
> reclaim behavior can be improved by limiting the concurrency. I didn't
> mean to push it further more :)
>
>> And, To recall original issue tearch us this is rarely and a bit insane
>> workload issue.
>> Then, I priotize to
>>
>> 1. prevent unnecessary OOM
>> 2. no regression to typical workload
>> 3. msgctl11 performance
>
> I totally agree on the above priorities.
>
>>
>> IOW, I don't think msgctl11 performance is so important.
>> May I ask why do you think msgctl11 performance is so important?
>
> Now that we have addressed (1)/(2) with your patch, naturally the
> msgctl11 performance problem catches my eyes. Strictly speaking
> I'm not particularly interested in the performance itself, but
> the obviously high _fluctuations_ of performance. Something bad

Me, too. I also have a looked into this problem.
But unfortunately, I can't devote my attention to the problem until
this weekend.
If you know the cause, let me know it :)

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 2/2] Don't continue reclaim if the system have plenty  free memory
  2009-07-09 10:58         ` Minchan Kim
@ 2009-07-13  0:37           ` KOSAKI Motohiro
  -1 siblings, 0 replies; 38+ messages in thread
From: KOSAKI Motohiro @ 2009-07-13  0:37 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, LKML, linux-mm, Andrew Morton, Rik van Riel,
	Wu Fengguang

> On Thu, Jul 9, 2009 at 2:08 PM, KOSAKI
> Motohiro<kosaki.motohiro@jp.fujitsu.com> wrote:
> >> Hi, Kosaki.
> >>
> >> On Tue, Jul 7, 2009 at 6:48 PM, KOSAKI
> >> Motohiro<kosaki.motohiro@jp.fujitsu.com> wrote:
> >> > Subject: [PATCH] Don't continue reclaim if the system have plenty free memory
> >> >
> >> > On concurrent reclaim situation, if one reclaimer makes OOM, maybe other
> >> > reclaimer can stop reclaim because OOM killer makes enough free memory.
> >> >
> >> > But current kernel doesn't have its logic. Then, we can face following accidental
> >> > 2nd OOM scenario.
> >> >
> >> > 1. System memory is used by only one big process.
> >> > 2. memory shortage occur and concurrent reclaim start.
> >> > 3. One reclaimer makes OOM and OOM killer kill above big process.
> >> > 4. Almost reclaimable page will be freed.
> >> > 5. Another reclaimer can't find any reclaimable page because those pages are
> >> > ? already freed.
> >> > 6. Then, system makes accidental and unnecessary 2nd OOM killer.
> >> >
> >>
> >> Did you see the this situation ?
> >> Why I ask is that we have already a routine for preventing parallel
> >> OOM killing in __alloc_pages_may_oom.
> >>
> >> Couldn't it protect your scenario ?
> >
> > Can you please see actual code of this patch?
> 
> I mean follow as,
> 
> static inline struct page *
> __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>         struct zonelist *zonelist, enum zone_type high_zoneidx,
> ...
> <snip>
> 
>         /*
>          * Go through the zonelist yet one more time, keep very high watermark
>          * here, this is only to catch a parallel oom killing, we must fail if
>          * we're still under heavy pressure.
>          */
>         page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
>                 order, zonelist, high_zoneidx,
>                 ALLOC_WMARK_HIGH|ALLOC_CPUSET,
>                 preferred_zone, migratetype);

Thanks, I catch your point.
Yes, the issue explained my description only happen old distro kernel.
I haven't notice this issue was already fixed. very thanks.

but above fix doesn't make sense. it mean
 - concurrent reclaim can drop too many usable memory
 - but only gurantee it doesn't cause oom

Then, I'll fix my patch description.



> > Those two patches fix different problem.
> >
> > 1/2 fixes the issue of that concurrent direct reclaimer makes
> > too many isolated pages.
> > 2/2 fixes the issue of that reclaim and exit race makes accidental oom.
> >
> >
> >> If it can't, Could you explain the scenario in more detail ?
> >
> > __alloc_pages_may_oom() check don't effect the threads of already
> > entered reclaim. it's obvious.
> 
> Threads which are entered into direct reclaim mode will call
> __alloc_pages_may_oom before out_of_memory.
> At that time, if one big process is killed a while ago,
> get_page_from_freelist in __alloc_pages_may_oom will be succeeded at
> last. So I think it doesn't occur OOM.
> 
> But in that case, we suffered from unnecessary page scanning per each
> priority(12~0). So in this case, your patch is good to me. then you
> would be better to change log. :)







^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 2/2] Don't continue reclaim if the system have plenty  free memory
@ 2009-07-13  0:37           ` KOSAKI Motohiro
  0 siblings, 0 replies; 38+ messages in thread
From: KOSAKI Motohiro @ 2009-07-13  0:37 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kosaki.motohiro, LKML, linux-mm, Andrew Morton, Rik van Riel,
	Wu Fengguang

> On Thu, Jul 9, 2009 at 2:08 PM, KOSAKI
> Motohiro<kosaki.motohiro@jp.fujitsu.com> wrote:
> >> Hi, Kosaki.
> >>
> >> On Tue, Jul 7, 2009 at 6:48 PM, KOSAKI
> >> Motohiro<kosaki.motohiro@jp.fujitsu.com> wrote:
> >> > Subject: [PATCH] Don't continue reclaim if the system have plenty free memory
> >> >
> >> > On concurrent reclaim situation, if one reclaimer makes OOM, maybe other
> >> > reclaimer can stop reclaim because OOM killer makes enough free memory.
> >> >
> >> > But current kernel doesn't have its logic. Then, we can face following accidental
> >> > 2nd OOM scenario.
> >> >
> >> > 1. System memory is used by only one big process.
> >> > 2. memory shortage occur and concurrent reclaim start.
> >> > 3. One reclaimer makes OOM and OOM killer kill above big process.
> >> > 4. Almost reclaimable page will be freed.
> >> > 5. Another reclaimer can't find any reclaimable page because those pages are
> >> > ? already freed.
> >> > 6. Then, system makes accidental and unnecessary 2nd OOM killer.
> >> >
> >>
> >> Did you see the this situation ?
> >> Why I ask is that we have already a routine for preventing parallel
> >> OOM killing in __alloc_pages_may_oom.
> >>
> >> Couldn't it protect your scenario ?
> >
> > Can you please see actual code of this patch?
> 
> I mean follow as,
> 
> static inline struct page *
> __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>         struct zonelist *zonelist, enum zone_type high_zoneidx,
> ...
> <snip>
> 
>         /*
>          * Go through the zonelist yet one more time, keep very high watermark
>          * here, this is only to catch a parallel oom killing, we must fail if
>          * we're still under heavy pressure.
>          */
>         page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
>                 order, zonelist, high_zoneidx,
>                 ALLOC_WMARK_HIGH|ALLOC_CPUSET,
>                 preferred_zone, migratetype);

Thanks, I catch your point.
Yes, the issue explained my description only happen old distro kernel.
I haven't notice this issue was already fixed. very thanks.

but above fix doesn't make sense. it mean
 - concurrent reclaim can drop too many usable memory
 - but only gurantee it doesn't cause oom

Then, I'll fix my patch description.



> > Those two patches fix different problem.
> >
> > 1/2 fixes the issue of that concurrent direct reclaimer makes
> > too many isolated pages.
> > 2/2 fixes the issue of that reclaim and exit race makes accidental oom.
> >
> >
> >> If it can't, Could you explain the scenario in more detail ?
> >
> > __alloc_pages_may_oom() check don't effect the threads of already
> > entered reclaim. it's obvious.
> 
> Threads which are entered into direct reclaim mode will call
> __alloc_pages_may_oom before out_of_memory.
> At that time, if one big process is killed a while ago,
> get_page_from_freelist in __alloc_pages_may_oom will be succeeded at
> last. So I think it doesn't occur OOM.
> 
> But in that case, we suffered from unnecessary page scanning per each
> priority(12~0). So in this case, your patch is good to me. then you
> would be better to change log. :)






--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2009-07-13  0:38 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-07-07  9:40 [RFC PATCH 0/2] fix unnecessary accidental OOM problem on concurrent reclaim KOSAKI Motohiro
2009-07-07  9:40 ` KOSAKI Motohiro
2009-07-07  9:47 ` [RFC PATCH 1/2] vmscan don't isolate too many pages KOSAKI Motohiro
2009-07-07  9:47   ` KOSAKI Motohiro
2009-07-07 13:23   ` Wu Fengguang
2009-07-07 13:23     ` Wu Fengguang
2009-07-07 18:59   ` Rik van Riel
2009-07-07 18:59     ` Rik van Riel
2009-07-08  3:19     ` Wu Fengguang
2009-07-08  3:19       ` Wu Fengguang
2009-07-09  1:51       ` [RFC PATCH 1/2] vmscan don't isolate too many pages in a zone Rik van Riel
2009-07-09  1:51         ` Rik van Riel
2009-07-09  2:47         ` Wu Fengguang
2009-07-09  2:47           ` Wu Fengguang
2009-07-09  3:07           ` Wu Fengguang
2009-07-09  3:07             ` Wu Fengguang
2009-07-09  7:01             ` KOSAKI Motohiro
2009-07-09  7:01               ` KOSAKI Motohiro
2009-07-09  8:42               ` Wu Fengguang
2009-07-09  8:42                 ` Wu Fengguang
2009-07-09 11:07                 ` Minchan Kim
2009-07-09 11:07                   ` Minchan Kim
2009-07-09  6:39         ` KOSAKI Motohiro
2009-07-09  6:39           ` KOSAKI Motohiro
2009-07-07 23:39   ` [RFC PATCH 1/2] vmscan don't isolate too many pages Minchan Kim
2009-07-07 23:39     ` Minchan Kim
2009-07-09  3:12     ` KOSAKI Motohiro
2009-07-09  3:12       ` KOSAKI Motohiro
2009-07-07  9:48 ` [RFC PATCH 2/2] Don't continue reclaim if the system have plenty free memory KOSAKI Motohiro
2009-07-07  9:48   ` KOSAKI Motohiro
2009-07-07 13:20   ` Minchan Kim
2009-07-07 13:20     ` Minchan Kim
2009-07-09  5:08     ` KOSAKI Motohiro
2009-07-09  5:08       ` KOSAKI Motohiro
2009-07-09 10:58       ` Minchan Kim
2009-07-09 10:58         ` Minchan Kim
2009-07-13  0:37         ` KOSAKI Motohiro
2009-07-13  0:37           ` KOSAKI Motohiro

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.