linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] Prevent kswapd dumping excessive amounts of memory in response to high-order allocations
@ 2010-11-30 17:15 Mel Gorman
  2010-11-30 17:15 ` [PATCH 1/3] mm: kswapd: Stop high-order balancing when any suitable zone is balanced Mel Gorman
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Mel Gorman @ 2010-11-30 17:15 UTC (permalink / raw)
  To: Simon Kirby
  Cc: KOSAKI Motohiro, Shaohua Li, Dave Hansen, linux-mm, linux-kernel,
	Mel Gorman

Simon Kirby reported the following problem

   We're seeing cases on a number of servers where cache never fully
   grows to use all available memory.  Sometimes we see servers with 4
   GB of memory that never seem to have less than 1.5 GB free, even with
   a constantly-active VM.  In some cases, these servers also swap out
   while this happens, even though they are constantly reading the working
   set into memory.  We have been seeing this happening for a long time;
   I don't think it's anything recent, and it still happens on 2.6.36.

After some debugging work by Simon, Dave Hansen and others, the prevaling
theory became that kswapd is reclaiming order-3 pages requested by SLUB
too aggressive about it.

There are two apparent problems here. On the target machine, there is a small
Normal zone in comparison to DMA32. As kswapd tries to balance all zones, it
would continually try reclaiming for Normal even though DMA32 was balanced
enough for callers. The second problem is that sleeping_prematurely() uses
the requested order, not the order kswapd finally reclaimed at. This keeps
kswapd artifically awake.

This series aims to alleviate these problems but needs testing to confirm
it alleviates the actual problem and wider review to think if there is a
better alternative approach. Local tests passed but are not reproducing
the same problem unfortunately so the results are inclusive.

 include/linux/mmzone.h |    3 +-
 mm/page_alloc.c        |    2 +-
 mm/vmscan.c            |   90 ++++++++++++++++++++++++++++++++++++++++-------
 3 files changed, 79 insertions(+), 16 deletions(-)


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/3] mm: kswapd: Stop high-order balancing when any suitable zone is balanced
  2010-11-30 17:15 [RFC PATCH 0/3] Prevent kswapd dumping excessive amounts of memory in response to high-order allocations Mel Gorman
@ 2010-11-30 17:15 ` Mel Gorman
  2010-12-01  2:13   ` Shaohua Li
  2010-11-30 17:15 ` [PATCH 2/3] mm: kswapd: Use the order that kswapd was reclaiming at for sleeping_prematurely() Mel Gorman
  2010-11-30 17:15 ` [PATCH 3/3] mm: kswapd: Keep kswapd awake for high-order allocations until a percentage of the node is balanced Mel Gorman
  2 siblings, 1 reply; 14+ messages in thread
From: Mel Gorman @ 2010-11-30 17:15 UTC (permalink / raw)
  To: Simon Kirby
  Cc: KOSAKI Motohiro, Shaohua Li, Dave Hansen, linux-mm, linux-kernel,
	Mel Gorman

When the allocator enters its slow path, kswapd is woken up to balance the
node. It continues working until all zones within the node are balanced. For
order-0 allocations, this makes perfect sense but for higher orders it can
have unintended side-effects. If the zone sizes are imbalanced, kswapd
may reclaim heavily on a smaller zone discarding an excessive number of
pages. The user-visible behaviour is that kswapd is awake and reclaiming
even though plenty of pages are free from a suitable zone.

This patch alters the "balance" logic to stop kswapd if any suitable zone
becomes balanced to reduce the number of pages it reclaims from other zones.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/mmzone.h |    3 ++-
 mm/page_alloc.c        |    2 +-
 mm/vmscan.c            |   48 +++++++++++++++++++++++++++++++++++++++---------
 3 files changed, 42 insertions(+), 11 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 39c24eb..25fe08d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -645,6 +645,7 @@ typedef struct pglist_data {
 	wait_queue_head_t kswapd_wait;
 	struct task_struct *kswapd;
 	int kswapd_max_order;
+	enum zone_type high_zoneidx;
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
@@ -660,7 +661,7 @@ typedef struct pglist_data {
 
 extern struct mutex zonelists_mutex;
 void build_all_zonelists(void *data);
-void wakeup_kswapd(struct zone *zone, int order);
+void wakeup_kswapd(struct zone *zone, int order, enum zone_type high_zoneidx);
 int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 		int classzone_idx, int alloc_flags);
 enum memmap_context {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 07a6544..344b597 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1921,7 +1921,7 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
 	struct zone *zone;
 
 	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
-		wakeup_kswapd(zone, order);
+		wakeup_kswapd(zone, order, high_zoneidx);
 }
 
 static inline int
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d31d7ce..67e4283 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2165,11 +2165,14 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
  * interoperates with the page allocator fallback scheme to ensure that aging
  * of pages is balanced across the zones.
  */
-static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
+static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
+							int high_zoneidx)
 {
 	int all_zones_ok;
+	int any_zone_ok;
 	int priority;
 	int i;
+	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
 	unsigned long total_scanned;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	struct scan_control sc = {
@@ -2192,7 +2195,6 @@ loop_again:
 	count_vm_event(PAGEOUTRUN);
 
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
-		int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
 		unsigned long lru_pages = 0;
 		int has_under_min_watermark_zone = 0;
 
@@ -2201,6 +2203,7 @@ loop_again:
 			disable_swap_token();
 
 		all_zones_ok = 1;
+		any_zone_ok = 0;
 
 		/*
 		 * Scan in the highmem->dma direction for the highest
@@ -2310,10 +2313,12 @@ loop_again:
 				 * spectulatively avoid congestion waits
 				 */
 				zone_clear_flag(zone, ZONE_CONGESTED);
+				if (i <= high_zoneidx)
+					any_zone_ok = 1;
 			}
 
 		}
-		if (all_zones_ok)
+		if (all_zones_ok || (order && any_zone_ok))
 			break;		/* kswapd: all done */
 		/*
 		 * OK, kswapd is getting into trouble.  Take a nap, then take
@@ -2336,7 +2341,7 @@ loop_again:
 			break;
 	}
 out:
-	if (!all_zones_ok) {
+	if (!(all_zones_ok || (order && any_zone_ok))) {
 		cond_resched();
 
 		try_to_freeze();
@@ -2361,6 +2366,22 @@ out:
 		goto loop_again;
 	}
 
+	/* kswapd should always balance all zones for order-0 */
+	if (order && !all_zones_ok) {
+		order = sc.order = 0;
+		goto loop_again;
+	}
+
+	/*
+	 * As kswapd could be going to sleep, unconditionally mark all
+	 * zones as uncongested as kswapd is the only mechanism which
+	 * clears congestion flags
+	 */
+	for (i = 0; i <= end_zone; i++) {
+		struct zone *zone = pgdat->node_zones + i;
+		zone_clear_flag(zone, ZONE_CONGESTED);
+	}
+
 	return sc.nr_reclaimed;
 }
 
@@ -2380,6 +2401,7 @@ out:
 static int kswapd(void *p)
 {
 	unsigned long order;
+	int zone_highidx;
 	pg_data_t *pgdat = (pg_data_t*)p;
 	struct task_struct *tsk = current;
 	DEFINE_WAIT(wait);
@@ -2410,19 +2432,24 @@ static int kswapd(void *p)
 	set_freezable();
 
 	order = 0;
+	zone_highidx = MAX_NR_ZONES;
 	for ( ; ; ) {
 		unsigned long new_order;
+		int new_zone_highidx;
 		int ret;
 
 		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
 		new_order = pgdat->kswapd_max_order;
+		new_zone_highidx = pgdat->high_zoneidx;
 		pgdat->kswapd_max_order = 0;
-		if (order < new_order) {
+		pgdat->high_zoneidx = MAX_NR_ZONES;
+		if (order < new_order || new_zone_highidx < zone_highidx) {
 			/*
 			 * Don't sleep if someone wants a larger 'order'
-			 * allocation
+			 * allocation or an order at a higher zone
 			 */
 			order = new_order;
+			zone_highidx = new_zone_highidx;
 		} else {
 			if (!freezing(current) && !kthread_should_stop()) {
 				long remaining = 0;
@@ -2451,6 +2478,7 @@ static int kswapd(void *p)
 			}
 
 			order = pgdat->kswapd_max_order;
+			zone_highidx = pgdat->high_zoneidx;
 		}
 		finish_wait(&pgdat->kswapd_wait, &wait);
 
@@ -2464,7 +2492,7 @@ static int kswapd(void *p)
 		 */
 		if (!ret) {
 			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
-			balance_pgdat(pgdat, order);
+			balance_pgdat(pgdat, order, zone_highidx);
 		}
 	}
 	return 0;
@@ -2473,7 +2501,7 @@ static int kswapd(void *p)
 /*
  * A zone is low on free memory, so wake its kswapd task to service it.
  */
-void wakeup_kswapd(struct zone *zone, int order)
+void wakeup_kswapd(struct zone *zone, int order, enum zone_type high_zoneidx)
 {
 	pg_data_t *pgdat;
 
@@ -2483,8 +2511,10 @@ void wakeup_kswapd(struct zone *zone, int order)
 	pgdat = zone->zone_pgdat;
 	if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0))
 		return;
-	if (pgdat->kswapd_max_order < order)
+	if (pgdat->kswapd_max_order < order) {
 		pgdat->kswapd_max_order = order;
+		pgdat->high_zoneidx = min(pgdat->high_zoneidx, high_zoneidx);
+	}
 	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
 	if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 		return;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/3] mm: kswapd: Use the order that kswapd was reclaiming at for sleeping_prematurely()
  2010-11-30 17:15 [RFC PATCH 0/3] Prevent kswapd dumping excessive amounts of memory in response to high-order allocations Mel Gorman
  2010-11-30 17:15 ` [PATCH 1/3] mm: kswapd: Stop high-order balancing when any suitable zone is balanced Mel Gorman
@ 2010-11-30 17:15 ` Mel Gorman
  2010-11-30 17:15 ` [PATCH 3/3] mm: kswapd: Keep kswapd awake for high-order allocations until a percentage of the node is balanced Mel Gorman
  2 siblings, 0 replies; 14+ messages in thread
From: Mel Gorman @ 2010-11-30 17:15 UTC (permalink / raw)
  To: Simon Kirby
  Cc: KOSAKI Motohiro, Shaohua Li, Dave Hansen, linux-mm, linux-kernel,
	Mel Gorman

Before kswapd goes to sleep, it uses sleeping_prematurely() to check if
there was a race pushing a zone below its watermark. If the race
happened, it stays awake. However, balance_pgdat() can decide to reclaim
at a lower order if it decides that high-order reclaim is not working as
expected. This information is not passed back to sleeping_prematurely().
The impact is that kswapd remains awake reclaiming pages long after it
should have gone to sleep. This patch passes the adjusted order to
sleeping_prematurely and uses the same logic as balance_pgdat to decide
if it's ok to go to sleep.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |   30 ++++++++++++++++++++++++------
 1 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 67e4283..9891efd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2118,15 +2118,17 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 #endif
 
 /* is kswapd sleeping prematurely? */
-static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
+static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
 {
 	int i;
+	bool all_zones_ok = true;
+	bool any_zone_ok = false;
 
 	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
 	if (remaining)
 		return 1;
 
-	/* If after HZ/10, a zone is below the high mark, it's premature */
+	/* Check the watermark levels */
 	for (i = 0; i < pgdat->nr_zones; i++) {
 		struct zone *zone = pgdat->node_zones + i;
 
@@ -2138,10 +2140,20 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
 
 		if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
 								0, 0))
-			return 1;
+			all_zones_ok = false;
+		else
+			any_zone_ok = true;
 	}
 
-	return 0;
+	/*
+	 * For high-order requests, any zone meeting the watermark is enough
+	 *   to allow kswapd go back to sleep
+	 * For order-0, all zones must be balanced
+	 */
+	if (order)
+		return !any_zone_ok;
+	else
+		return !all_zones_ok;
 }
 
 /*
@@ -2382,7 +2394,13 @@ out:
 		zone_clear_flag(zone, ZONE_CONGESTED);
 	}
 
-	return sc.nr_reclaimed;
+	/*
+	 * Return the order we were reclaiming at so sleeping_prematurely()
+	 * makes a decision on the order we were last reclaiming at. However,
+	 * if another caller entered the allocator slow path while kswapd
+	 * was awake, order will remain at the higher level
+	 */
+	return order;
 }
 
 /*
@@ -2492,7 +2510,7 @@ static int kswapd(void *p)
 		 */
 		if (!ret) {
 			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
-			balance_pgdat(pgdat, order, zone_highidx);
+			order = balance_pgdat(pgdat, order, zone_highidx);
 		}
 	}
 	return 0;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/3] mm: kswapd: Keep kswapd awake for high-order allocations until a percentage of the node is balanced
  2010-11-30 17:15 [RFC PATCH 0/3] Prevent kswapd dumping excessive amounts of memory in response to high-order allocations Mel Gorman
  2010-11-30 17:15 ` [PATCH 1/3] mm: kswapd: Stop high-order balancing when any suitable zone is balanced Mel Gorman
  2010-11-30 17:15 ` [PATCH 2/3] mm: kswapd: Use the order that kswapd was reclaiming at for sleeping_prematurely() Mel Gorman
@ 2010-11-30 17:15 ` Mel Gorman
  2 siblings, 0 replies; 14+ messages in thread
From: Mel Gorman @ 2010-11-30 17:15 UTC (permalink / raw)
  To: Simon Kirby
  Cc: KOSAKI Motohiro, Shaohua Li, Dave Hansen, linux-mm, linux-kernel,
	Mel Gorman

When reclaiming for high-orders, kswapd is responsible for balancing a
node but it should not reclaim excessively. It avoids excessive reclaim
by considering if any zone in a node is balanced then the node is
balanced. In the cases where there are imbalanced zone sizes (e.g.
ZONE_DMA with both ZONE_DMA32 and ZONE_NORMAL), kswapd can go to sleep
prematurely as just one small zone was balanced.

This alters the sleep logic of kswapd slightly. It counts the number of pages
that make up the balanced zones. If the total number of balanced pages is
more than a quarter of the zone, kswapd will go back to sleep.  This should
keep a node balanced without reclaiming an excessive number of pages.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |   30 ++++++++++++++++++++++--------
 1 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9891efd..77c511f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2117,12 +2117,26 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 }
 #endif
 
+/*
+ * pgdat_balanced is used when checking if a node is balanced for high-order
+ * allocations. Only zones that meet watermarks make up "balanced".
+ * The total of balanced pages must be at least 25% of the node for the
+ * node to be considered balanced. Forcing all zones to be balanced for high
+ * orders can cause excessive reclaim when there are imbalanced zones.
+ * Similarly, we do not want kswapd to go to sleep because ZONE_DMA happens
+ * to be balanced when ZONE_DMA32 is huge in comparison and unbalanced
+ */
+static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced)
+{
+	return balanced > pgdat->node_present_pages / 4;
+}
+
 /* is kswapd sleeping prematurely? */
 static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
 {
 	int i;
+	unsigned long balanced = 0;
 	bool all_zones_ok = true;
-	bool any_zone_ok = false;
 
 	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
 	if (remaining)
@@ -2142,7 +2156,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
 								0, 0))
 			all_zones_ok = false;
 		else
-			any_zone_ok = true;
+			balanced += zone->present_pages;
 	}
 
 	/*
@@ -2151,7 +2165,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
 	 * For order-0, all zones must be balanced
 	 */
 	if (order)
-		return !any_zone_ok;
+		return pgdat_balanced(pgdat, balanced);
 	else
 		return !all_zones_ok;
 }
@@ -2181,7 +2195,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 							int high_zoneidx)
 {
 	int all_zones_ok;
-	int any_zone_ok;
+	unsigned long balanced;
 	int priority;
 	int i;
 	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
@@ -2215,7 +2229,7 @@ loop_again:
 			disable_swap_token();
 
 		all_zones_ok = 1;
-		any_zone_ok = 0;
+		balanced = 0;
 
 		/*
 		 * Scan in the highmem->dma direction for the highest
@@ -2326,11 +2340,11 @@ loop_again:
 				 */
 				zone_clear_flag(zone, ZONE_CONGESTED);
 				if (i <= high_zoneidx)
-					any_zone_ok = 1;
+					balanced += zone->present_pages;
 			}
 
 		}
-		if (all_zones_ok || (order && any_zone_ok))
+		if (all_zones_ok || (order && pgdat_balanced(pgdat, balanced)))
 			break;		/* kswapd: all done */
 		/*
 		 * OK, kswapd is getting into trouble.  Take a nap, then take
@@ -2353,7 +2367,7 @@ loop_again:
 			break;
 	}
 out:
-	if (!(all_zones_ok || (order && any_zone_ok))) {
+	if (!(all_zones_ok || (order && pgdat_balanced(pgdat, balanced)))) {
 		cond_resched();
 
 		try_to_freeze();
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] mm: kswapd: Stop high-order balancing when any suitable zone is balanced
  2010-11-30 17:15 ` [PATCH 1/3] mm: kswapd: Stop high-order balancing when any suitable zone is balanced Mel Gorman
@ 2010-12-01  2:13   ` Shaohua Li
  2010-12-01  2:23     ` KOSAKI Motohiro
  2010-12-01 11:07     ` Mel Gorman
  0 siblings, 2 replies; 14+ messages in thread
From: Shaohua Li @ 2010-12-01  2:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Simon Kirby, KOSAKI Motohiro, Dave Hansen, linux-mm, linux-kernel

On Wed, 2010-12-01 at 01:15 +0800, Mel Gorman wrote:
> When the allocator enters its slow path, kswapd is woken up to balance the
> node. It continues working until all zones within the node are balanced. For
> order-0 allocations, this makes perfect sense but for higher orders it can
> have unintended side-effects. If the zone sizes are imbalanced, kswapd
> may reclaim heavily on a smaller zone discarding an excessive number of
> pages. The user-visible behaviour is that kswapd is awake and reclaiming
> even though plenty of pages are free from a suitable zone.
> 
> This patch alters the "balance" logic to stop kswapd if any suitable zone
> becomes balanced to reduce the number of pages it reclaims from other zones.
from my understanding, the patch will break reclaim high zone if a low
zone meets the high order allocation, even the high zone doesn't meet
the high order allocation. This, for example, will make a high order
allocation from a high zone fallback to low zone and quickly exhaust low
zone, for example DMA. This will break some drivers.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] mm: kswapd: Stop high-order balancing when any suitable zone is balanced
  2010-12-01  2:13   ` Shaohua Li
@ 2010-12-01  2:23     ` KOSAKI Motohiro
  2010-12-01  2:47       ` Shaohua Li
  2010-12-01 11:07     ` Mel Gorman
  1 sibling, 1 reply; 14+ messages in thread
From: KOSAKI Motohiro @ 2010-12-01  2:23 UTC (permalink / raw)
  To: Shaohua Li
  Cc: kosaki.motohiro, Mel Gorman, Simon Kirby, Dave Hansen, linux-mm,
	linux-kernel

> On Wed, 2010-12-01 at 01:15 +0800, Mel Gorman wrote:
> > When the allocator enters its slow path, kswapd is woken up to balance the
> > node. It continues working until all zones within the node are balanced. For
> > order-0 allocations, this makes perfect sense but for higher orders it can
> > have unintended side-effects. If the zone sizes are imbalanced, kswapd
> > may reclaim heavily on a smaller zone discarding an excessive number of
> > pages. The user-visible behaviour is that kswapd is awake and reclaiming
> > even though plenty of pages are free from a suitable zone.
> > 
> > This patch alters the "balance" logic to stop kswapd if any suitable zone
> > becomes balanced to reduce the number of pages it reclaims from other zones.
> from my understanding, the patch will break reclaim high zone if a low
> zone meets the high order allocation, even the high zone doesn't meet
> the high order allocation. This, for example, will make a high order
> allocation from a high zone fallback to low zone and quickly exhaust low
> zone, for example DMA. This will break some drivers.

Have you seen patch [3/3]? I think it migigate your pointed issue.




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] mm: kswapd: Stop high-order balancing when any suitable zone is balanced
  2010-12-01  2:23     ` KOSAKI Motohiro
@ 2010-12-01  2:47       ` Shaohua Li
  2010-12-01  2:59         ` KOSAKI Motohiro
  0 siblings, 1 reply; 14+ messages in thread
From: Shaohua Li @ 2010-12-01  2:47 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Simon Kirby, Dave Hansen, linux-mm, linux-kernel

On Wed, 2010-12-01 at 10:23 +0800, KOSAKI Motohiro wrote:
> > On Wed, 2010-12-01 at 01:15 +0800, Mel Gorman wrote:
> > > When the allocator enters its slow path, kswapd is woken up to balance the
> > > node. It continues working until all zones within the node are balanced. For
> > > order-0 allocations, this makes perfect sense but for higher orders it can
> > > have unintended side-effects. If the zone sizes are imbalanced, kswapd
> > > may reclaim heavily on a smaller zone discarding an excessive number of
> > > pages. The user-visible behaviour is that kswapd is awake and reclaiming
> > > even though plenty of pages are free from a suitable zone.
> > > 
> > > This patch alters the "balance" logic to stop kswapd if any suitable zone
> > > becomes balanced to reduce the number of pages it reclaims from other zones.
> > from my understanding, the patch will break reclaim high zone if a low
> > zone meets the high order allocation, even the high zone doesn't meet
> > the high order allocation. This, for example, will make a high order
> > allocation from a high zone fallback to low zone and quickly exhaust low
> > zone, for example DMA. This will break some drivers.
> 
> Have you seen patch [3/3]? I think it migigate your pointed issue.
yes, it improves a lot, but still possible for small systems.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] mm: kswapd: Stop high-order balancing when any suitable zone is balanced
  2010-12-01  2:47       ` Shaohua Li
@ 2010-12-01  2:59         ` KOSAKI Motohiro
  2010-12-01  3:20           ` Shaohua Li
  0 siblings, 1 reply; 14+ messages in thread
From: KOSAKI Motohiro @ 2010-12-01  2:59 UTC (permalink / raw)
  To: Shaohua Li
  Cc: kosaki.motohiro, Mel Gorman, Simon Kirby, Dave Hansen, linux-mm,
	linux-kernel

> On Wed, 2010-12-01 at 10:23 +0800, KOSAKI Motohiro wrote:
> > > On Wed, 2010-12-01 at 01:15 +0800, Mel Gorman wrote:
> > > > When the allocator enters its slow path, kswapd is woken up to balance the
> > > > node. It continues working until all zones within the node are balanced. For
> > > > order-0 allocations, this makes perfect sense but for higher orders it can
> > > > have unintended side-effects. If the zone sizes are imbalanced, kswapd
> > > > may reclaim heavily on a smaller zone discarding an excessive number of
> > > > pages. The user-visible behaviour is that kswapd is awake and reclaiming
> > > > even though plenty of pages are free from a suitable zone.
> > > > 
> > > > This patch alters the "balance" logic to stop kswapd if any suitable zone
> > > > becomes balanced to reduce the number of pages it reclaims from other zones.
> > > from my understanding, the patch will break reclaim high zone if a low
> > > zone meets the high order allocation, even the high zone doesn't meet
> > > the high order allocation. This, for example, will make a high order
> > > allocation from a high zone fallback to low zone and quickly exhaust low
> > > zone, for example DMA. This will break some drivers.
> > 
> > Have you seen patch [3/3]? I think it migigate your pointed issue.
> yes, it improves a lot, but still possible for small systems.

Ok, I got you. so please define your "small systems" word? we can't make
perfect VM heuristics obviously, then we need to compare pros/cons.

Of cource, I'm glad if you have better idea and show it.




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] mm: kswapd: Stop high-order balancing when any suitable zone is balanced
  2010-12-01  2:59         ` KOSAKI Motohiro
@ 2010-12-01  3:20           ` Shaohua Li
  2010-12-01  3:28             ` KOSAKI Motohiro
  0 siblings, 1 reply; 14+ messages in thread
From: Shaohua Li @ 2010-12-01  3:20 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Simon Kirby, Dave Hansen, linux-mm, linux-kernel

On Wed, 2010-12-01 at 10:59 +0800, KOSAKI Motohiro wrote:
> > On Wed, 2010-12-01 at 10:23 +0800, KOSAKI Motohiro wrote:
> > > > On Wed, 2010-12-01 at 01:15 +0800, Mel Gorman wrote:
> > > > > When the allocator enters its slow path, kswapd is woken up to balance the
> > > > > node. It continues working until all zones within the node are balanced. For
> > > > > order-0 allocations, this makes perfect sense but for higher orders it can
> > > > > have unintended side-effects. If the zone sizes are imbalanced, kswapd
> > > > > may reclaim heavily on a smaller zone discarding an excessive number of
> > > > > pages. The user-visible behaviour is that kswapd is awake and reclaiming
> > > > > even though plenty of pages are free from a suitable zone.
> > > > > 
> > > > > This patch alters the "balance" logic to stop kswapd if any suitable zone
> > > > > becomes balanced to reduce the number of pages it reclaims from other zones.
> > > > from my understanding, the patch will break reclaim high zone if a low
> > > > zone meets the high order allocation, even the high zone doesn't meet
> > > > the high order allocation. This, for example, will make a high order
> > > > allocation from a high zone fallback to low zone and quickly exhaust low
> > > > zone, for example DMA. This will break some drivers.
> > > 
> > > Have you seen patch [3/3]? I think it migigate your pointed issue.
> > yes, it improves a lot, but still possible for small systems.
> 
> Ok, I got you. so please define your "small systems" word? 
an embedded system with less memory memory, obviously

> we can't make
> perfect VM heuristics obviously, then we need to compare pros/cons.
if you don't care about small system, let's consider a NORMAL i386
system with 896m normal zone, and 896M*3 high zone. normal zone will
quickly exhaust by high order high zone allocation, leave a latter
allocation which does need normal zone fail.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] mm: kswapd: Stop high-order balancing when any suitable zone is balanced
  2010-12-01  3:20           ` Shaohua Li
@ 2010-12-01  3:28             ` KOSAKI Motohiro
  2010-12-01  7:40               ` Shaohua Li
  0 siblings, 1 reply; 14+ messages in thread
From: KOSAKI Motohiro @ 2010-12-01  3:28 UTC (permalink / raw)
  To: Shaohua Li
  Cc: kosaki.motohiro, Mel Gorman, Simon Kirby, Dave Hansen, linux-mm,
	linux-kernel

> On Wed, 2010-12-01 at 10:59 +0800, KOSAKI Motohiro wrote:
> > > On Wed, 2010-12-01 at 10:23 +0800, KOSAKI Motohiro wrote:
> > > > > On Wed, 2010-12-01 at 01:15 +0800, Mel Gorman wrote:
> > > > > > When the allocator enters its slow path, kswapd is woken up to balance the
> > > > > > node. It continues working until all zones within the node are balanced. For
> > > > > > order-0 allocations, this makes perfect sense but for higher orders it can
> > > > > > have unintended side-effects. If the zone sizes are imbalanced, kswapd
> > > > > > may reclaim heavily on a smaller zone discarding an excessive number of
> > > > > > pages. The user-visible behaviour is that kswapd is awake and reclaiming
> > > > > > even though plenty of pages are free from a suitable zone.
> > > > > > 
> > > > > > This patch alters the "balance" logic to stop kswapd if any suitable zone
> > > > > > becomes balanced to reduce the number of pages it reclaims from other zones.
> > > > > from my understanding, the patch will break reclaim high zone if a low
> > > > > zone meets the high order allocation, even the high zone doesn't meet
> > > > > the high order allocation. This, for example, will make a high order
> > > > > allocation from a high zone fallback to low zone and quickly exhaust low
> > > > > zone, for example DMA. This will break some drivers.
> > > > 
> > > > Have you seen patch [3/3]? I think it migigate your pointed issue.
> > > yes, it improves a lot, but still possible for small systems.
> > 
> > Ok, I got you. so please define your "small systems" word? 
> an embedded system with less memory memory, obviously

Typical embedded system don't have multiple zone. It's not obvious.


> > we can't make
> > perfect VM heuristics obviously, then we need to compare pros/cons.
> if you don't care about small system, let's consider a NORMAL i386
> system with 896m normal zone, and 896M*3 high zone. normal zone will
> quickly exhaust by high order high zone allocation, leave a latter
> allocation which does need normal zone fail.

Not happen. slab don't allocate from highmem and page cache allocation
is always using order-0. When happen high order high zone allocation?






^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] mm: kswapd: Stop high-order balancing when any suitable zone is balanced
  2010-12-01  3:28             ` KOSAKI Motohiro
@ 2010-12-01  7:40               ` Shaohua Li
  2010-12-01  7:52                 ` KOSAKI Motohiro
  0 siblings, 1 reply; 14+ messages in thread
From: Shaohua Li @ 2010-12-01  7:40 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Simon Kirby, Dave Hansen, linux-mm, linux-kernel

On Wed, 2010-12-01 at 11:28 +0800, KOSAKI Motohiro wrote:
> > On Wed, 2010-12-01 at 10:59 +0800, KOSAKI Motohiro wrote:
> > > > On Wed, 2010-12-01 at 10:23 +0800, KOSAKI Motohiro wrote:
> > > > > > On Wed, 2010-12-01 at 01:15 +0800, Mel Gorman wrote:
> > > > > > > When the allocator enters its slow path, kswapd is woken up to balance the
> > > > > > > node. It continues working until all zones within the node are balanced. For
> > > > > > > order-0 allocations, this makes perfect sense but for higher orders it can
> > > > > > > have unintended side-effects. If the zone sizes are imbalanced, kswapd
> > > > > > > may reclaim heavily on a smaller zone discarding an excessive number of
> > > > > > > pages. The user-visible behaviour is that kswapd is awake and reclaiming
> > > > > > > even though plenty of pages are free from a suitable zone.
> > > > > > > 
> > > > > > > This patch alters the "balance" logic to stop kswapd if any suitable zone
> > > > > > > becomes balanced to reduce the number of pages it reclaims from other zones.
> > > > > > from my understanding, the patch will break reclaim high zone if a low
> > > > > > zone meets the high order allocation, even the high zone doesn't meet
> > > > > > the high order allocation. This, for example, will make a high order
> > > > > > allocation from a high zone fallback to low zone and quickly exhaust low
> > > > > > zone, for example DMA. This will break some drivers.
> > > > > 
> > > > > Have you seen patch [3/3]? I think it migigate your pointed issue.
> > > > yes, it improves a lot, but still possible for small systems.
> > > 
> > > Ok, I got you. so please define your "small systems" word? 
> > an embedded system with less memory memory, obviously
> 
> Typical embedded system don't have multiple zone. It's not obvious.
IIRC, ARM supports highmem. But you are right, slub doen't allocate from
highmem.
> > > we can't make
> > > perfect VM heuristics obviously, then we need to compare pros/cons.
> > if you don't care about small system, let's consider a NORMAL i386
> > system with 896m normal zone, and 896M*3 high zone. normal zone will
> > quickly exhaust by high order high zone allocation, leave a latter
> > allocation which does need normal zone fail.
> 
> Not happen. slab don't allocate from highmem and page cache allocation
> is always using order-0. When happen high order high zone allocation?
ok, thanks, I missed this. then how about a x86_64 box with 896M DMA32
and 896*3M NORMAL? some pci devices can only dma to DMA32 zone.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] mm: kswapd: Stop high-order balancing when any suitable zone is balanced
  2010-12-01  7:40               ` Shaohua Li
@ 2010-12-01  7:52                 ` KOSAKI Motohiro
  2010-12-01  8:24                   ` Shaohua Li
  0 siblings, 1 reply; 14+ messages in thread
From: KOSAKI Motohiro @ 2010-12-01  7:52 UTC (permalink / raw)
  To: Shaohua Li
  Cc: kosaki.motohiro, Mel Gorman, Simon Kirby, Dave Hansen, linux-mm,
	linux-kernel

> > > > we can't make
> > > > perfect VM heuristics obviously, then we need to compare pros/cons.
> > > if you don't care about small system, let's consider a NORMAL i386
> > > system with 896m normal zone, and 896M*3 high zone. normal zone will
> > > quickly exhaust by high order high zone allocation, leave a latter
> > > allocation which does need normal zone fail.
> > 
> > Not happen. slab don't allocate from highmem and page cache allocation
> > is always using order-0. When happen high order high zone allocation?
> ok, thanks, I missed this. then how about a x86_64 box with 896M DMA32
> and 896*3M NORMAL? some pci devices can only dma to DMA32 zone.

First, DMA32 is 4GB. Second, modern high end system don't use 32bit PCI
device. Third, while we are thinking desktop users, 4GB is not small
room. nowadays, typical desktop have only 2GB or 4GB memory.

In other word, I agree your pointed issue is exist _potentially_. but
I don't think it is frequently than Simon's case.

In other word, when deciding heuristics, we can't avoid to think issue
frequency. It's very important.


Of cource, if you have better idea, I don't oppose it.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] mm: kswapd: Stop high-order balancing when any suitable zone is balanced
  2010-12-01  7:52                 ` KOSAKI Motohiro
@ 2010-12-01  8:24                   ` Shaohua Li
  0 siblings, 0 replies; 14+ messages in thread
From: Shaohua Li @ 2010-12-01  8:24 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Simon Kirby, Dave Hansen, linux-mm, linux-kernel

On Wed, 2010-12-01 at 15:52 +0800, KOSAKI Motohiro wrote:
> > > > > we can't make
> > > > > perfect VM heuristics obviously, then we need to compare pros/cons.
> > > > if you don't care about small system, let's consider a NORMAL i386
> > > > system with 896m normal zone, and 896M*3 high zone. normal zone will
> > > > quickly exhaust by high order high zone allocation, leave a latter
> > > > allocation which does need normal zone fail.
> > > 
> > > Not happen. slab don't allocate from highmem and page cache allocation
> > > is always using order-0. When happen high order high zone allocation?
> > ok, thanks, I missed this. then how about a x86_64 box with 896M DMA32
> > and 896*3M NORMAL? some pci devices can only dma to DMA32 zone.
> 
> First, DMA32 is 4GB. Second, modern high end system don't use 32bit PCI
> device. Third, while we are thinking desktop users, 4GB is not small
> room. nowadays, typical desktop have only 2GB or 4GB memory.
DMA32 isn't 4G, because there is hole under 4G for PCI bars. I don't
think 32 bit PCI device is rare too. But anyway, if you insist this
isn't a big issue, I'm ok.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] mm: kswapd: Stop high-order balancing when any suitable zone is balanced
  2010-12-01  2:13   ` Shaohua Li
  2010-12-01  2:23     ` KOSAKI Motohiro
@ 2010-12-01 11:07     ` Mel Gorman
  1 sibling, 0 replies; 14+ messages in thread
From: Mel Gorman @ 2010-12-01 11:07 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Simon Kirby, KOSAKI Motohiro, Dave Hansen, linux-mm, linux-kernel

On Wed, Dec 01, 2010 at 10:13:56AM +0800, Shaohua Li wrote:
> On Wed, 2010-12-01 at 01:15 +0800, Mel Gorman wrote:
> > When the allocator enters its slow path, kswapd is woken up to balance the
> > node. It continues working until all zones within the node are balanced. For
> > order-0 allocations, this makes perfect sense but for higher orders it can
> > have unintended side-effects. If the zone sizes are imbalanced, kswapd
> > may reclaim heavily on a smaller zone discarding an excessive number of
> > pages. The user-visible behaviour is that kswapd is awake and reclaiming
> > even though plenty of pages are free from a suitable zone.
> > 
> > This patch alters the "balance" logic to stop kswapd if any suitable zone
> > becomes balanced to reduce the number of pages it reclaims from other zones.
>
> from my understanding, the patch will break reclaim high zone if a low
> zone meets the high order allocation, even the high zone doesn't meet
> the high order allocation.

Indeed this is possible and it's a situation confirmed by Simon. Patch 3
should cover it because replacing "are any zones ok?" with "are zones
representing at least 25% of the node balanced?"

> This, for example, will make a high order
> allocation from a high zone fallback to low zone and quickly exhaust low
> zone, for example DMA. This will break some drivers.
> 

The lowmem reserve would prevent that happening so the drivers would be
fine. The real impact is that kswapd would stop when DMA was balanced
even though it was really DMA32 or Normal needed to be balanced for
proper behaviour.

On lowmem reserves though, there is another buglet in
sleeping_prematurely. The classzone_idx it uses means that the wrong
lowmem_reserve is used for the majority of allocation requests.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2010-12-01 11:07 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-30 17:15 [RFC PATCH 0/3] Prevent kswapd dumping excessive amounts of memory in response to high-order allocations Mel Gorman
2010-11-30 17:15 ` [PATCH 1/3] mm: kswapd: Stop high-order balancing when any suitable zone is balanced Mel Gorman
2010-12-01  2:13   ` Shaohua Li
2010-12-01  2:23     ` KOSAKI Motohiro
2010-12-01  2:47       ` Shaohua Li
2010-12-01  2:59         ` KOSAKI Motohiro
2010-12-01  3:20           ` Shaohua Li
2010-12-01  3:28             ` KOSAKI Motohiro
2010-12-01  7:40               ` Shaohua Li
2010-12-01  7:52                 ` KOSAKI Motohiro
2010-12-01  8:24                   ` Shaohua Li
2010-12-01 11:07     ` Mel Gorman
2010-11-30 17:15 ` [PATCH 2/3] mm: kswapd: Use the order that kswapd was reclaiming at for sleeping_prematurely() Mel Gorman
2010-11-30 17:15 ` [PATCH 3/3] mm: kswapd: Keep kswapd awake for high-order allocations until a percentage of the node is balanced Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).