linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 -mm 0/3] kswapd vs compaction improvements
@ 2012-01-24 18:18 Rik van Riel
  2012-01-24 18:21 ` [PATCH v2 -mm 1/3] mm: reclaim at order 0 when compaction is enabled Rik van Riel
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Rik van Riel @ 2012-01-24 18:18 UTC (permalink / raw)
  To: linux-mm
  Cc: lkml, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Andrew Morton, Minchan Kim, KOSAKI Motohiro

Running a kernel with compaction enabled with some memory pressure has
caused my system to run into all kinds of trouble.

One of the more obvious problems is that while kswapd does not try to 
free contiguous pages when CONFIG_COMPACTION is enabled, it does
continue reclaiming until enough contiguous pages have become available.
This can lead to enormous swap storms, where lots of memory is freed
and a fair amount of the working set can end up in swap.

A second problem is that memory compaction currently does nothing for
network allocations in the receive path, for eg. jumbo frames, because
those are done in interrupt context.  In the past we have tried to
have kswapd invoke memory compaction, but it used too much CPU time.
The second patch in this series has kswapd invoke compaction very
carefully, taking in account the desired page order, as well as
zone->compaction_deferred.

I have tested these patches on my system, and things seem to behave
well. Any tests and reviews would be appreciated.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 -mm 1/3] mm: reclaim at order 0 when compaction is enabled
  2012-01-24 18:18 [PATCH v2 -mm 0/3] kswapd vs compaction improvements Rik van Riel
@ 2012-01-24 18:21 ` Rik van Riel
  2012-01-25 15:00   ` Mel Gorman
  2012-01-24 18:22 ` [PATCH v2 -mm 2/3] mm: kswapd carefully call compaction Rik van Riel
  2012-01-24 18:23 ` [PATCH v2 -mm 3/3] mm: only defer compaction for failed order and higher Rik van Riel
  2 siblings, 1 reply; 14+ messages in thread
From: Rik van Riel @ 2012-01-24 18:21 UTC (permalink / raw)
  To: linux-mm
  Cc: lkml, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Andrew Morton, Minchan Kim, KOSAKI Motohiro

When built with CONFIG_COMPACTION, kswapd does not try to free
contiguous pages.  Because it is not trying, it should also not
test whether it succeeded, because that can result in continuous
page reclaim, until a large fraction of memory is free and large
fractions of the working set have been evicted.

In shrink_inactive_list, we should not try to do higher order
(out of LRU order) page isolation, unless we really are in 
lumpy reclaim mode. This gives all pages a good amount of time
on the inactive list, giving the actively used pages the chance
to get referenced and avoid eviction.

Also remove a line of code that increments balanced right before
exiting the function.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |   29 ++++++++++++++++++++++-------
 1 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2880396..0398fab 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1512,6 +1512,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct mem_cgroup_zone *mz,
 	unsigned long nr_writeback = 0;
 	isolate_mode_t reclaim_mode = ISOLATE_INACTIVE;
 	struct zone *zone = mz->zone;
+	int order = 0;
 
 	while (unlikely(too_many_isolated(zone, file, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1522,8 +1523,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct mem_cgroup_zone *mz,
 	}
 
 	set_reclaim_mode(priority, sc, false);
-	if (sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM)
+	if (sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM) {
 		reclaim_mode |= ISOLATE_ACTIVE;
+		order = sc->order;
+	}
 
 	lru_add_drain();
 
@@ -1535,7 +1538,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct mem_cgroup_zone *mz,
 	spin_lock_irq(&zone->lru_lock);
 
 	nr_taken = isolate_lru_pages(nr_to_scan, mz, &page_list,
-				     &nr_scanned, sc->order,
+				     &nr_scanned, order,
 				     reclaim_mode, 0, file);
 	if (global_reclaim(sc)) {
 		zone->pages_scanned += nr_scanned;
@@ -2754,7 +2757,7 @@ loop_again:
 		 */
 		for (i = 0; i <= end_zone; i++) {
 			struct zone *zone = pgdat->node_zones + i;
-			int nr_slab;
+			int nr_slab, testorder;
 			unsigned long balance_gap;
 
 			if (!populated_zone(zone))
@@ -2783,11 +2786,25 @@ loop_again:
 			 * gap is either the low watermark or 1%
 			 * of the zone, whichever is smaller.
 			 */
+			testorder = order;
 			balance_gap = min(low_wmark_pages(zone),
 				(zone->present_pages +
 					KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
 				KSWAPD_ZONE_BALANCE_GAP_RATIO);
-			if (!zone_watermark_ok_safe(zone, order,
+			/*
+			 * Kswapd reclaims only single pages when
+			 * COMPACTION_BUILD. Trying too hard to get
+			 * contiguous free pages can result in excessive
+			 * amounts of free memory, and useful things
+			 * getting kicked out of memory.
+			 * Limit the amount of reclaim to something sane,
+			 * plus space for compaction to do its thing.
+			 */
+			if (COMPACTION_BUILD) {
+				testorder = 0;
+				balance_gap += 2<<order;
+			}
+			if (!zone_watermark_ok_safe(zone, testorder,
 					high_wmark_pages(zone) + balance_gap,
 					end_zone, 0)) {
 				shrink_zone(priority, zone, &sc);
@@ -2816,7 +2833,7 @@ loop_again:
 				continue;
 			}
 
-			if (!zone_watermark_ok_safe(zone, order,
+			if (!zone_watermark_ok_safe(zone, testorder,
 					high_wmark_pages(zone), end_zone, 0)) {
 				all_zones_ok = 0;
 				/*
@@ -2922,8 +2939,6 @@ out:
 
 			/* If balanced, clear the congested flag */
 			zone_clear_flag(zone, ZONE_CONGESTED);
-			if (i <= *classzone_idx)
-				balanced += zone->present_pages;
 		}
 	}
 
.

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 -mm 2/3] mm: kswapd carefully call compaction
  2012-01-24 18:18 [PATCH v2 -mm 0/3] kswapd vs compaction improvements Rik van Riel
  2012-01-24 18:21 ` [PATCH v2 -mm 1/3] mm: reclaim at order 0 when compaction is enabled Rik van Riel
@ 2012-01-24 18:22 ` Rik van Riel
  2012-01-25 15:19   ` Mel Gorman
  2012-01-24 18:23 ` [PATCH v2 -mm 3/3] mm: only defer compaction for failed order and higher Rik van Riel
  2 siblings, 1 reply; 14+ messages in thread
From: Rik van Riel @ 2012-01-24 18:22 UTC (permalink / raw)
  To: linux-mm
  Cc: lkml, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Andrew Morton, Minchan Kim, KOSAKI Motohiro

With CONFIG_COMPACTION enabled, kswapd does not try to free
contiguous free pages, even when it is woken for a higher order
request.

This could be bad for eg. jumbo frame network allocations, which
are done from interrupt context and cannot compact memory themselves.
Higher than before allocation failure rates in the network receive
path have been observed in kernels with compaction enabled.

Teach kswapd to defragment the memory zones in a node, but only
if required and compaction is not deferred in a zone.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
-v2: cleanups suggested by Mel Gorman

 include/linux/compaction.h |    6 +++++
 mm/compaction.c            |   53 +++++++++++++++++++++++++++++---------------
 mm/vmscan.c                |    9 +++++++
 3 files changed, 50 insertions(+), 18 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index bb2bbdb..7a9323a 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -23,6 +23,7 @@ extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *mask,
 			bool sync);
+extern int compact_pgdat(pg_data_t *pgdat, int order);
 extern unsigned long compaction_suitable(struct zone *zone, int order);
 
 /* Do not skip compaction more than 64 times */
@@ -62,6 +63,11 @@ static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
 	return COMPACT_CONTINUE;
 }
 
+static inline int compact_pgdat(pg_data_t *pgdat, int order)
+{
+	return COMPACT_CONTINUE;
+}
+
 static inline unsigned long compaction_suitable(struct zone *zone, int order)
 {
 	return COMPACT_SKIPPED;
diff --git a/mm/compaction.c b/mm/compaction.c
index 71a58f6..51ece75 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -653,44 +653,61 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
 
 
 /* Compact all zones within a node */
-static int compact_node(int nid)
+static int __compact_pgdat(pg_data_t *pgdat, struct compact_control *cc)
 {
 	int zoneid;
-	pg_data_t *pgdat;
 	struct zone *zone;
 
-	if (nid < 0 || nid >= nr_node_ids || !node_online(nid))
-		return -EINVAL;
-	pgdat = NODE_DATA(nid);
-
 	/* Flush pending updates to the LRU lists */
 	lru_add_drain_all();
 
 	for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
-		struct compact_control cc = {
-			.nr_freepages = 0,
-			.nr_migratepages = 0,
-			.order = -1,
-			.sync = true,
-		};
 
 		zone = &pgdat->node_zones[zoneid];
 		if (!populated_zone(zone))
 			continue;
 
-		cc.zone = zone;
-		INIT_LIST_HEAD(&cc.freepages);
-		INIT_LIST_HEAD(&cc.migratepages);
+		cc->nr_freepages = 0;
+		cc->nr_migratepages = 0;
+		cc->zone = zone;
+		INIT_LIST_HEAD(&cc->freepages);
+		INIT_LIST_HEAD(&cc->migratepages);
 
-		compact_zone(zone, &cc);
+		if (cc->order < 0 || !compaction_deferred(zone))
+			compact_zone(zone, cc);
 
-		VM_BUG_ON(!list_empty(&cc.freepages));
-		VM_BUG_ON(!list_empty(&cc.migratepages));
+		VM_BUG_ON(!list_empty(&cc->freepages));
+		VM_BUG_ON(!list_empty(&cc->migratepages));
 	}
 
 	return 0;
 }
 
+int compact_pgdat(pg_data_t *pgdat, int order)
+{
+	struct compact_control cc = {
+		.order = order,
+		.sync = false,
+	};
+
+	return __compact_pgdat(pgdat, &cc);
+}
+
+static int compact_node(int nid)
+{
+	pg_data_t *pgdat;
+	struct compact_control cc = {
+		.order = -1,
+		.sync = true,
+	};
+
+	if (nid < 0 || nid >= nr_node_ids || !node_online(nid))
+		return -EINVAL;
+	pgdat = NODE_DATA(nid);
+
+	return __compact_pgdat(pgdat, &cc);
+}
+
 /* Compact all nodes in the system */
 static int compact_nodes(void)
 {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0398fab..fa17794 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2673,6 +2673,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 	int priority;
 	int i;
 	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
+	int zones_need_compaction = 1;
 	unsigned long total_scanned;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	unsigned long nr_soft_reclaimed;
@@ -2937,9 +2938,17 @@ out:
 				goto loop_again;
 			}
 
+			/* Check if the memory needs to be defragmented. */
+			if (zone_watermark_ok(zone, order,
+				    low_wmark_pages(zone), *classzone_idx, 0))
+				zones_need_compaction = 0;
+
 			/* If balanced, clear the congested flag */
 			zone_clear_flag(zone, ZONE_CONGESTED);
 		}
+
+		if (zones_need_compaction)
+			compact_pgdat(pgdat, order);
 	}
 
 	/*

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 -mm 3/3] mm: only defer compaction for failed order and higher
  2012-01-24 18:18 [PATCH v2 -mm 0/3] kswapd vs compaction improvements Rik van Riel
  2012-01-24 18:21 ` [PATCH v2 -mm 1/3] mm: reclaim at order 0 when compaction is enabled Rik van Riel
  2012-01-24 18:22 ` [PATCH v2 -mm 2/3] mm: kswapd carefully call compaction Rik van Riel
@ 2012-01-24 18:23 ` Rik van Riel
  2012-01-25 15:41   ` Mel Gorman
  2 siblings, 1 reply; 14+ messages in thread
From: Rik van Riel @ 2012-01-24 18:23 UTC (permalink / raw)
  To: linux-mm
  Cc: lkml, Andrea Arcangeli, Mel Gorman, Johannes Weiner,
	Andrew Morton, Minchan Kim, KOSAKI Motohiro

Currently a failed order-9 (transparent hugepage) compaction can
lead to memory compaction being temporarily disabled for a memory
zone.  Even if we only need compaction for an order 2 allocation,
eg. for jumbo frames networking.

The fix is relatively straightforward: keep track of the order at
which compaction failed to create a free memory area. Only defer
compaction at that order and higher, while letting compaction go
through for lower orders.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/compaction.h |   14 ++++++++++----
 include/linux/mmzone.h     |    1 +
 mm/compaction.c            |   11 ++++++++++-
 mm/page_alloc.c            |    6 ++++--
 mm/vmscan.c                |    2 +-
 5 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 7a9323a..51a90b7 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -34,20 +34,26 @@ extern unsigned long compaction_suitable(struct zone *zone, int order);
  * allocation success. 1 << compact_defer_limit compactions are skipped up
  * to a limit of 1 << COMPACT_MAX_DEFER_SHIFT
  */
-static inline void defer_compaction(struct zone *zone)
+static inline void defer_compaction(struct zone *zone, int order)
 {
 	zone->compact_considered = 0;
 	zone->compact_defer_shift++;
 
+	if (order < zone->compact_order_failed)
+		zone->compact_order_failed = order;
+
 	if (zone->compact_defer_shift > COMPACT_MAX_DEFER_SHIFT)
 		zone->compact_defer_shift = COMPACT_MAX_DEFER_SHIFT;
 }
 
 /* Returns true if compaction should be skipped this time */
-static inline bool compaction_deferred(struct zone *zone)
+static inline bool compaction_deferred(struct zone *zone, int order)
 {
 	unsigned long defer_limit = 1UL << zone->compact_defer_shift;
 
+	if (order < zone->compact_order_failed)
+		return false;
+
 	/* Avoid possible overflow */
 	if (++zone->compact_considered > defer_limit)
 		zone->compact_considered = defer_limit;
@@ -73,11 +79,11 @@ static inline unsigned long compaction_suitable(struct zone *zone, int order)
 	return COMPACT_SKIPPED;
 }
 
-static inline void defer_compaction(struct zone *zone)
+static inline void defer_compaction(struct zone *zone, int order)
 {
 }
 
-static inline bool compaction_deferred(struct zone *zone)
+static inline bool compaction_deferred(struct zone *zone, int order)
 {
 	return 1;
 }
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 650ba2f..dff7115 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -365,6 +365,7 @@ struct zone {
 	 */
 	unsigned int		compact_considered;
 	unsigned int		compact_defer_shift;
+	int			compact_order_failed;
 #endif
 
 	ZONE_PADDING(_pad1_)
diff --git a/mm/compaction.c b/mm/compaction.c
index 51ece75..e8cff81 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -673,9 +673,18 @@ static int __compact_pgdat(pg_data_t *pgdat, struct compact_control *cc)
 		INIT_LIST_HEAD(&cc->freepages);
 		INIT_LIST_HEAD(&cc->migratepages);
 
-		if (cc->order < 0 || !compaction_deferred(zone))
+		if (cc->order < 0 || !compaction_deferred(zone, cc->order))
 			compact_zone(zone, cc);
 
+		if (cc->order > 0) {
+			int ok = zone_watermark_ok(zone, cc->order,
+						low_wmark_pages(zone), 0, 0);
+			if (ok && cc->order > zone->compact_order_failed)
+				zone->compact_order_failed = cc->order + 1;
+			else if (!ok && cc->sync)
+				defer_compaction(zone, cc->order);
+		}
+
 		VM_BUG_ON(!list_empty(&cc->freepages));
 		VM_BUG_ON(!list_empty(&cc->migratepages));
 	}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0027d8f..cd617d9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1990,7 +1990,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	if (!order)
 		return NULL;
 
-	if (compaction_deferred(preferred_zone)) {
+	if (compaction_deferred(preferred_zone, order)) {
 		*deferred_compaction = true;
 		return NULL;
 	}
@@ -2012,6 +2012,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		if (page) {
 			preferred_zone->compact_considered = 0;
 			preferred_zone->compact_defer_shift = 0;
+			if (order >= preferred_zone->compact_order_failed)
+				preferred_zone->compact_order_failed = order + 1;
 			count_vm_event(COMPACTSUCCESS);
 			return page;
 		}
@@ -2028,7 +2030,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		 * defer if the failure was a sync compaction failure.
 		 */
 		if (sync_migration)
-			defer_compaction(preferred_zone);
+			defer_compaction(preferred_zone, order);
 
 		cond_resched();
 	}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index fa17794..5d65991 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2199,7 +2199,7 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
 	 * If compaction is deferred, reclaim up to a point where
 	 * compaction will have a chance of success when re-enabled
 	 */
-	if (compaction_deferred(zone))
+	if (compaction_deferred(zone, sc->order))
 		return watermark_ok;
 
 	/* If compaction is not ready to start, keep reclaiming */

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 -mm 1/3] mm: reclaim at order 0 when compaction is enabled
  2012-01-24 18:21 ` [PATCH v2 -mm 1/3] mm: reclaim at order 0 when compaction is enabled Rik van Riel
@ 2012-01-25 15:00   ` Mel Gorman
  2012-01-25 15:27     ` Rik van Riel
  0 siblings, 1 reply; 14+ messages in thread
From: Mel Gorman @ 2012-01-25 15:00 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-mm, lkml, Andrea Arcangeli, Johannes Weiner, Andrew Morton,
	Minchan Kim, KOSAKI Motohiro

On Tue, Jan 24, 2012 at 01:21:36PM -0500, Rik van Riel wrote:
> When built with CONFIG_COMPACTION, kswapd does not try to free
> contiguous pages. 

balance_pgdat() gets its order from wakeup_kswapd(). This does not apply
to THP because kswapd does not get woken for THP but it should be woken
up for allocations like jumbo frames or order-1.

This order is passed to shrink_zone(), shrink_list, shrink_inactive_list
and to isolate_lru_pages() which will isolate pages within a naturally
aligned boundary. This happens even with CONFIG_COMPACTION enabled.
There have been a number of changes in vmscan.c recently so maybe we are
looking at different versions but why do you say it doesn't free
contiguous pages?

That said, this is lumpy-reclaim-like behaviour and we want to move
away from it. There is no guarantee that there are pages that can be
isolated in the naturally aligned region which is particularly true if
swap is disabled. This has not happened yet as each cycle had changes
to compaction that took priority (the most recent being sync-light). I'm
glad you picked this up, so thanks.

As kswapd does no memory compaction itself, this patch still makes
sense but I found the changelog misleading.

> Because it is not trying, it should also not
> test whether it succeeded, because that can result in continuous
> page reclaim, until a large fraction of memory is free and large
> fractions of the working set have been evicted.
> 
> In shrink_inactive_list, we should not try to do higher order
> (out of LRU order) page isolation, unless we really are in 
> lumpy reclaim mode. This gives all pages a good amount of time
> on the inactive list, giving the actively used pages the chance
> to get referenced and avoid eviction.
> 
> Also remove a line of code that increments balanced right before
> exiting the function.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
>  mm/vmscan.c |   29 ++++++++++++++++++++++-------
>  1 files changed, 22 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2880396..0398fab 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1512,6 +1512,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct mem_cgroup_zone *mz,
>  	unsigned long nr_writeback = 0;
>  	isolate_mode_t reclaim_mode = ISOLATE_INACTIVE;
>  	struct zone *zone = mz->zone;
> +	int order = 0;
>  
>  	while (unlikely(too_many_isolated(zone, file, sc))) {
>  		congestion_wait(BLK_RW_ASYNC, HZ/10);
> @@ -1522,8 +1523,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct mem_cgroup_zone *mz,
>  	}
>  
>  	set_reclaim_mode(priority, sc, false);
> -	if (sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM)
> +	if (sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM) {
>  		reclaim_mode |= ISOLATE_ACTIVE;
> +		order = sc->order;
> +	}
>  
>  	lru_add_drain();
>  

This is a nit-pick but I would far prefer if you did not bypass
sc->order like this and instead changes isolate_lru_pages to do a

if (!order || !(sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM))
	continue;

That would very clearly mark where LUMPYRECLAIM takes effect in
isolate_lru_pages() and makes deleting LUMPYRECLAIM easier in the
future.

The second effect of this change is a non-obvious side-effect. kswapd
will now isolate fewer pages per cycle because it will isolate
SWAP_CLUSTER_MAX pages instead of SWAP_CLUSTER_MAX<<order which it
potentially does currently. This is not wrong as such and may be
desirable to limit how much reclaim kswapd does but potentially it
impacts success rates for compaction. As this does not apply to THP,
it will be difficult to detect but bear in mind if we see an increase
in high-order allocation failures after this patch is merged. I am
not suggesting a change here but it would be nice to note in the
changelog if there is a new version of this patch.

> @@ -1535,7 +1538,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct mem_cgroup_zone *mz,
>  	spin_lock_irq(&zone->lru_lock);
>  
>  	nr_taken = isolate_lru_pages(nr_to_scan, mz, &page_list,
> -				     &nr_scanned, sc->order,
> +				     &nr_scanned, order,
>  				     reclaim_mode, 0, file);
>  	if (global_reclaim(sc)) {
>  		zone->pages_scanned += nr_scanned;
> @@ -2754,7 +2757,7 @@ loop_again:
>  		 */
>  		for (i = 0; i <= end_zone; i++) {
>  			struct zone *zone = pgdat->node_zones + i;
> -			int nr_slab;
> +			int nr_slab, testorder;
>  			unsigned long balance_gap;
>  
>  			if (!populated_zone(zone))
> @@ -2783,11 +2786,25 @@ loop_again:
>  			 * gap is either the low watermark or 1%
>  			 * of the zone, whichever is smaller.
>  			 */
> +			testorder = order;
>  			balance_gap = min(low_wmark_pages(zone),
>  				(zone->present_pages +
>  					KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
>  				KSWAPD_ZONE_BALANCE_GAP_RATIO);
> -			if (!zone_watermark_ok_safe(zone, order,
> +			/*
> +			 * Kswapd reclaims only single pages when
> +			 * COMPACTION_BUILD. Trying too hard to get
> +			 * contiguous free pages can result in excessive
> +			 * amounts of free memory, and useful things
> +			 * getting kicked out of memory.
> +			 * Limit the amount of reclaim to something sane,
> +			 * plus space for compaction to do its thing.
> +			 */
> +			if (COMPACTION_BUILD) {
> +				testorder = 0;
> +				balance_gap += 2<<order;
> +			}

This 2<<order logic now appears in a few places. I am not expecting
it to be handled in this patch but at some point the various different
balance_gap logics need to be put in once place.

> +			if (!zone_watermark_ok_safe(zone, testorder,
>  					high_wmark_pages(zone) + balance_gap,
>  					end_zone, 0)) {
>  				shrink_zone(priority, zone, &sc);
> @@ -2816,7 +2833,7 @@ loop_again:
>  				continue;
>  			}
>  
> -			if (!zone_watermark_ok_safe(zone, order,
> +			if (!zone_watermark_ok_safe(zone, testorder,
>  					high_wmark_pages(zone), end_zone, 0)) {
>  				all_zones_ok = 0;
>  				/*
> @@ -2922,8 +2939,6 @@ out:
>  
>  			/* If balanced, clear the congested flag */
>  			zone_clear_flag(zone, ZONE_CONGESTED);
> -			if (i <= *classzone_idx)
> -				balanced += zone->present_pages;
>  		}

Why is this being deleted? It is still used by pgdat_balanced().

>  	}
>  

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 -mm 2/3] mm: kswapd carefully call compaction
  2012-01-24 18:22 ` [PATCH v2 -mm 2/3] mm: kswapd carefully call compaction Rik van Riel
@ 2012-01-25 15:19   ` Mel Gorman
  0 siblings, 0 replies; 14+ messages in thread
From: Mel Gorman @ 2012-01-25 15:19 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-mm, lkml, Andrea Arcangeli, Johannes Weiner, Andrew Morton,
	Minchan Kim, KOSAKI Motohiro

On Tue, Jan 24, 2012 at 01:22:43PM -0500, Rik van Riel wrote:
> With CONFIG_COMPACTION enabled, kswapd does not try to free
> contiguous free pages, even when it is woken for a higher order
> request.
> 
> This could be bad for eg. jumbo frame network allocations, which
> are done from interrupt context and cannot compact memory themselves.
> Higher than before allocation failure rates in the network receive
> path have been observed in kernels with compaction enabled.
> 
> Teach kswapd to defragment the memory zones in a node, but only
> if required and compaction is not deferred in a zone.
> 

We used to do something vaguely like this in the past and it was
reverted because compaction was stalling for too long. With the
merging of sync-light, this should be less of an issue but we should
be watchful of high CPU usage from kswapd with too much time spent
in memory compaction even though I recognise that compaction takes
places in kswapds exit path. In 3.3-rc1, there is a risk of high
CPU usage anyway because kswapd may be scanning over large numbers
of dirty pages it is no longer writing so care will be needed to
disguish between different high CPU usage problems.

That said, I didn't spot any obvious problems so;

Acked-by: Mel Gorman <mel@csn.ul.ie>

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 -mm 1/3] mm: reclaim at order 0 when compaction is enabled
  2012-01-25 15:00   ` Mel Gorman
@ 2012-01-25 15:27     ` Rik van Riel
  2012-01-25 16:07       ` Mel Gorman
  2012-01-25 22:16       ` Andrea Arcangeli
  0 siblings, 2 replies; 14+ messages in thread
From: Rik van Riel @ 2012-01-25 15:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, lkml, Andrea Arcangeli, Johannes Weiner, Andrew Morton,
	Minchan Kim, KOSAKI Motohiro

On 01/25/2012 10:00 AM, Mel Gorman wrote:
> On Tue, Jan 24, 2012 at 01:21:36PM -0500, Rik van Riel wrote:
>> When built with CONFIG_COMPACTION, kswapd does not try to free
>> contiguous pages.
>
> balance_pgdat() gets its order from wakeup_kswapd(). This does not apply
> to THP because kswapd does not get woken for THP but it should be woken
> up for allocations like jumbo frames or order-1.

In the kernel I run at home, I wake up kswapd for THP
as well. This is a larger change, which Andrea asked
me to delay submitting upstream for a bit.

So far there seem to be no ill effects. I'll continue
watching for them.

> As kswapd does no memory compaction itself, this patch still makes
> sense but I found the changelog misleading.

Fair enough.  I will adjust the changelog.

>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 2880396..0398fab 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1512,6 +1512,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct mem_cgroup_zone *mz,
>>   	unsigned long nr_writeback = 0;
>>   	isolate_mode_t reclaim_mode = ISOLATE_INACTIVE;
>>   	struct zone *zone = mz->zone;
>> +	int order = 0;
>>
>>   	while (unlikely(too_many_isolated(zone, file, sc))) {
>>   		congestion_wait(BLK_RW_ASYNC, HZ/10);
>> @@ -1522,8 +1523,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct mem_cgroup_zone *mz,
>>   	}
>>
>>   	set_reclaim_mode(priority, sc, false);
>> -	if (sc->reclaim_mode&  RECLAIM_MODE_LUMPYRECLAIM)
>> +	if (sc->reclaim_mode&  RECLAIM_MODE_LUMPYRECLAIM) {
>>   		reclaim_mode |= ISOLATE_ACTIVE;
>> +		order = sc->order;
>> +	}
>>
>>   	lru_add_drain();
>>
>
> This is a nit-pick but I would far prefer if you did not bypass
> sc->order like this and instead changes isolate_lru_pages to do a
>
> if (!order || !(sc->reclaim_mode&  RECLAIM_MODE_LUMPYRECLAIM))
> 	continue;
>
> That would very clearly mark where LUMPYRECLAIM takes effect in
> isolate_lru_pages() and makes deleting LUMPYRECLAIM easier in the
> future.

OK, I will do things that way instead.

> The second effect of this change is a non-obvious side-effect. kswapd
> will now isolate fewer pages per cycle because it will isolate
> SWAP_CLUSTER_MAX pages instead of SWAP_CLUSTER_MAX<<order which it
> potentially does currently. This is not wrong as such and may be
> desirable to limit how much reclaim kswapd does but potentially it
> impacts success rates for compaction. As this does not apply to THP,
> it will be difficult to detect but bear in mind if we see an increase
> in high-order allocation failures after this patch is merged. I am
> not suggesting a change here but it would be nice to note in the
> changelog if there is a new version of this patch.

Good point.  I am running with THP waking up kswapd, and
things seem to behave (and compaction seems to succeed),
but we might indeed want to change balance_pgdat to free
more pages for higher order allocations.

Maybe this is the place to check (in balanced_pgdat) ?

                 /*
                  * We do this so kswapd doesn't build up large 
priorities for
                  * example when it is freeing in parallel with 
allocators. It
                  * matches the direct reclaim path behaviour in terms 
of impact
                  * on zone->*_priority.
                  */
                 if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
                         break;

>> @@ -2922,8 +2939,6 @@ out:
>>
>>   			/* If balanced, clear the congested flag */
>>   			zone_clear_flag(zone, ZONE_CONGESTED);
>> -			if (i<= *classzone_idx)
>> -				balanced += zone->present_pages;
>>   		}
>
> Why is this being deleted? It is still used by pgdat_balanced().

This is outside of the big while loop and is not used again
in the function.  This final for loop does not appear to
use the variable balanced at all, except for incrementing
it.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 -mm 3/3] mm: only defer compaction for failed order and higher
  2012-01-24 18:23 ` [PATCH v2 -mm 3/3] mm: only defer compaction for failed order and higher Rik van Riel
@ 2012-01-25 15:41   ` Mel Gorman
  2012-01-25 15:55     ` Rik van Riel
  0 siblings, 1 reply; 14+ messages in thread
From: Mel Gorman @ 2012-01-25 15:41 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-mm, lkml, Andrea Arcangeli, Johannes Weiner, Andrew Morton,
	Minchan Kim, KOSAKI Motohiro

On Tue, Jan 24, 2012 at 01:23:32PM -0500, Rik van Riel wrote:
> Currently a failed order-9 (transparent hugepage) compaction can
> lead to memory compaction being temporarily disabled for a memory
> zone.  Even if we only need compaction for an order 2 allocation,
> eg. for jumbo frames networking.
> 

This is a throwback to when compaction was introduced just for
THP and hugetlbfs and compaction did not take place for order <=
PAGE_ALLOC_COSTLY_ORDER.

> The fix is relatively straightforward: keep track of the order at
> which compaction failed to create a free memory area. Only defer
> compaction at that order and higher, while letting compaction go
> through for lower orders.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
>  include/linux/compaction.h |   14 ++++++++++----
>  include/linux/mmzone.h     |    1 +
>  mm/compaction.c            |   11 ++++++++++-
>  mm/page_alloc.c            |    6 ++++--
>  mm/vmscan.c                |    2 +-
>  5 files changed, 26 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 7a9323a..51a90b7 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -34,20 +34,26 @@ extern unsigned long compaction_suitable(struct zone *zone, int order);
>   * allocation success. 1 << compact_defer_limit compactions are skipped up
>   * to a limit of 1 << COMPACT_MAX_DEFER_SHIFT
>   */
> -static inline void defer_compaction(struct zone *zone)
> +static inline void defer_compaction(struct zone *zone, int order)
>  {
>  	zone->compact_considered = 0;
>  	zone->compact_defer_shift++;
>  
> +	if (order < zone->compact_order_failed)
> +		zone->compact_order_failed = order;
> +
>  	if (zone->compact_defer_shift > COMPACT_MAX_DEFER_SHIFT)
>  		zone->compact_defer_shift = COMPACT_MAX_DEFER_SHIFT;
>  }
>  
>  /* Returns true if compaction should be skipped this time */
> -static inline bool compaction_deferred(struct zone *zone)
> +static inline bool compaction_deferred(struct zone *zone, int order)
>  {
>  	unsigned long defer_limit = 1UL << zone->compact_defer_shift;
>  
> +	if (order < zone->compact_order_failed)
> +		return false;
> +
>  	/* Avoid possible overflow */
>  	if (++zone->compact_considered > defer_limit)
>  		zone->compact_considered = defer_limit;
> @@ -73,11 +79,11 @@ static inline unsigned long compaction_suitable(struct zone *zone, int order)
>  	return COMPACT_SKIPPED;
>  }
>  
> -static inline void defer_compaction(struct zone *zone)
> +static inline void defer_compaction(struct zone *zone, int order)
>  {
>  }
>  
> -static inline bool compaction_deferred(struct zone *zone)
> +static inline bool compaction_deferred(struct zone *zone, int order)
>  {
>  	return 1;
>  }
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 650ba2f..dff7115 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -365,6 +365,7 @@ struct zone {
>  	 */
>  	unsigned int		compact_considered;
>  	unsigned int		compact_defer_shift;
> +	int			compact_order_failed;
>  #endif
>  
>  	ZONE_PADDING(_pad1_)
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 51ece75..e8cff81 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -673,9 +673,18 @@ static int __compact_pgdat(pg_data_t *pgdat, struct compact_control *cc)
>  		INIT_LIST_HEAD(&cc->freepages);
>  		INIT_LIST_HEAD(&cc->migratepages);
>  
> -		if (cc->order < 0 || !compaction_deferred(zone))
> +		if (cc->order < 0 || !compaction_deferred(zone, cc->order))
>  			compact_zone(zone, cc);
>  
> +		if (cc->order > 0) {
> +			int ok = zone_watermark_ok(zone, cc->order,
> +						low_wmark_pages(zone), 0, 0);
> +			if (ok && cc->order > zone->compact_order_failed)
> +				zone->compact_order_failed = cc->order + 1;
> +			else if (!ok && cc->sync)
> +				defer_compaction(zone, cc->order);
> +		}
> +

That needs a comment. I think what you're trying to do is reset
compat_order_failed once compaction is successful.

The "!ok && cc->sync" check may be broken. __compact_pgdat() is
called from kswapd, not direct compaction so cc->sync will not be true.

>  		VM_BUG_ON(!list_empty(&cc->freepages));
>  		VM_BUG_ON(!list_empty(&cc->migratepages));
>  	}
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0027d8f..cd617d9 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1990,7 +1990,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  	if (!order)
>  		return NULL;
>  
> -	if (compaction_deferred(preferred_zone)) {
> +	if (compaction_deferred(preferred_zone, order)) {
>  		*deferred_compaction = true;
>  		return NULL;
>  	}
> @@ -2012,6 +2012,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  		if (page) {
>  			preferred_zone->compact_considered = 0;
>  			preferred_zone->compact_defer_shift = 0;
> +			if (order >= preferred_zone->compact_order_failed)
> +				preferred_zone->compact_order_failed = order + 1;
>  			count_vm_event(COMPACTSUCCESS);
>  			return page;
>  		}
> @@ -2028,7 +2030,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  		 * defer if the failure was a sync compaction failure.
>  		 */
>  		if (sync_migration)
> -			defer_compaction(preferred_zone);
> +			defer_compaction(preferred_zone, order);
>  
>  		cond_resched();
>  	}
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index fa17794..5d65991 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2199,7 +2199,7 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
>  	 * If compaction is deferred, reclaim up to a point where
>  	 * compaction will have a chance of success when re-enabled
>  	 */
> -	if (compaction_deferred(zone))
> +	if (compaction_deferred(zone, sc->order))
>  		return watermark_ok;
>  
>  	/* If compaction is not ready to start, keep reclaiming */
> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 -mm 3/3] mm: only defer compaction for failed order and higher
  2012-01-25 15:41   ` Mel Gorman
@ 2012-01-25 15:55     ` Rik van Riel
  2012-01-25 16:21       ` Mel Gorman
  0 siblings, 1 reply; 14+ messages in thread
From: Rik van Riel @ 2012-01-25 15:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, lkml, Andrea Arcangeli, Johannes Weiner, Andrew Morton,
	Minchan Kim, KOSAKI Motohiro

On 01/25/2012 10:41 AM, Mel Gorman wrote:

>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -673,9 +673,18 @@ static int __compact_pgdat(pg_data_t *pgdat, struct compact_control *cc)
>>   		INIT_LIST_HEAD(&cc->freepages);
>>   		INIT_LIST_HEAD(&cc->migratepages);
>>
>> -		if (cc->order<  0 || !compaction_deferred(zone))
>> +		if (cc->order<  0 || !compaction_deferred(zone, cc->order))
>>   			compact_zone(zone, cc);
>>
>> +		if (cc->order>  0) {
>> +			int ok = zone_watermark_ok(zone, cc->order,
>> +						low_wmark_pages(zone), 0, 0);
>> +			if (ok&&  cc->order>  zone->compact_order_failed)
>> +				zone->compact_order_failed = cc->order + 1;
>> +			else if (!ok&&  cc->sync)
>> +				defer_compaction(zone, cc->order);
>> +		}
>> +
>
> That needs a comment. I think what you're trying to do is reset
> compat_order_failed once compaction is successful.
>
> The "!ok&&  cc->sync" check may be broken. __compact_pgdat() is
> called from kswapd, not direct compaction so cc->sync will not be true.

The problem with doing that is that we would be deferring
synchronous compaction (by allocators), just because
asynchronous compaction from kswapd failed...

That is the reason the code is like it is above.  And
indeed, it will not defer compaction from this code path
right now.

Then again, neither does async compaction from page
allocators defer compaction - only sync compaction does.

If it turns out we need a separate compaction deferral
for async compaction, we can always introduce that later,
and this code will be ready for it.

If you prefer, I can replace the whole "else if" bit with
a big fat comment explaining why we cannot currently
defer compaction from this point.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 -mm 1/3] mm: reclaim at order 0 when compaction is enabled
  2012-01-25 15:27     ` Rik van Riel
@ 2012-01-25 16:07       ` Mel Gorman
  2012-01-25 17:17         ` Rik van Riel
  2012-01-25 22:16       ` Andrea Arcangeli
  1 sibling, 1 reply; 14+ messages in thread
From: Mel Gorman @ 2012-01-25 16:07 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-mm, lkml, Andrea Arcangeli, Johannes Weiner, Andrew Morton,
	Minchan Kim, KOSAKI Motohiro

On Wed, Jan 25, 2012 at 10:27:28AM -0500, Rik van Riel wrote:
> On 01/25/2012 10:00 AM, Mel Gorman wrote:
> >On Tue, Jan 24, 2012 at 01:21:36PM -0500, Rik van Riel wrote:
> >>When built with CONFIG_COMPACTION, kswapd does not try to free
> >>contiguous pages.
> >
> >balance_pgdat() gets its order from wakeup_kswapd(). This does not apply
> >to THP because kswapd does not get woken for THP but it should be woken
> >up for allocations like jumbo frames or order-1.
> 
> In the kernel I run at home, I wake up kswapd for THP
> as well. This is a larger change, which Andrea asked
> me to delay submitting upstream for a bit.
> 

Ok, good call. Waking kswapd up for THP is still premature.

> So far there seem to be no ill effects. I'll continue
> watching for them.
> 
> >As kswapd does no memory compaction itself, this patch still makes
> >sense but I found the changelog misleading.
> 
> Fair enough.  I will adjust the changelog.
> 

Thanks.

> ><SNIP>
> >The second effect of this change is a non-obvious side-effect. kswapd
> >will now isolate fewer pages per cycle because it will isolate
> >SWAP_CLUSTER_MAX pages instead of SWAP_CLUSTER_MAX<<order which it
> >potentially does currently. This is not wrong as such and may be
> >desirable to limit how much reclaim kswapd does but potentially it
> >impacts success rates for compaction. As this does not apply to THP,
> >it will be difficult to detect but bear in mind if we see an increase
> >in high-order allocation failures after this patch is merged. I am
> >not suggesting a change here but it would be nice to note in the
> >changelog if there is a new version of this patch.
> 
> Good point.  I am running with THP waking up kswapd, and
> things seem to behave (and compaction seems to succeed),
> but we might indeed want to change balance_pgdat to free
> more pages for higher order allocations.
> 
> Maybe this is the place to check (in balanced_pgdat) ?
> 
>                 /*
>                  * We do this so kswapd doesn't build up large
> priorities for
>                  * example when it is freeing in parallel with
> allocators. It
>                  * matches the direct reclaim path behaviour in
> terms of impact
>                  * on zone->*_priority.
>                  */
>                 if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
>                         break;
> 

It would be a good place all right. Preferably it would tie into
compaction_ready() to decide whether to continue reclaiming or not.

> >>@@ -2922,8 +2939,6 @@ out:
> >>
> >>  			/* If balanced, clear the congested flag */
> >>  			zone_clear_flag(zone, ZONE_CONGESTED);
> >>-			if (i<= *classzone_idx)
> >>-				balanced += zone->present_pages;
> >>  		}
> >
> >Why is this being deleted? It is still used by pgdat_balanced().
> 
> This is outside of the big while loop and is not used again
> in the function. 

How about here?

               if (all_zones_ok || (order && pgdat_balanced(pgdat, balanced, *classzone_idx)))
                        break;          /* kswapd: all done */

Either way, it looks like something that should be in its own patch.

> This final for loop does not appear to
> use the variable balanced at all, except for incrementing
> it.
> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 -mm 3/3] mm: only defer compaction for failed order and higher
  2012-01-25 15:55     ` Rik van Riel
@ 2012-01-25 16:21       ` Mel Gorman
  0 siblings, 0 replies; 14+ messages in thread
From: Mel Gorman @ 2012-01-25 16:21 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-mm, lkml, Andrea Arcangeli, Johannes Weiner, Andrew Morton,
	Minchan Kim, KOSAKI Motohiro

On Wed, Jan 25, 2012 at 10:55:05AM -0500, Rik van Riel wrote:
> On 01/25/2012 10:41 AM, Mel Gorman wrote:
> 
> >>--- a/mm/compaction.c
> >>+++ b/mm/compaction.c
> >>@@ -673,9 +673,18 @@ static int __compact_pgdat(pg_data_t *pgdat, struct compact_control *cc)
> >>  		INIT_LIST_HEAD(&cc->freepages);
> >>  		INIT_LIST_HEAD(&cc->migratepages);
> >>
> >>-		if (cc->order<  0 || !compaction_deferred(zone))
> >>+		if (cc->order<  0 || !compaction_deferred(zone, cc->order))
> >>  			compact_zone(zone, cc);
> >>
> >>+		if (cc->order>  0) {
> >>+			int ok = zone_watermark_ok(zone, cc->order,
> >>+						low_wmark_pages(zone), 0, 0);
> >>+			if (ok&&  cc->order>  zone->compact_order_failed)
> >>+				zone->compact_order_failed = cc->order + 1;
> >>+			else if (!ok&&  cc->sync)
> >>+				defer_compaction(zone, cc->order);
> >>+		}
> >>+
> >
> >That needs a comment. I think what you're trying to do is reset
> >compat_order_failed once compaction is successful.
> >
> >The "!ok&&  cc->sync" check may be broken. __compact_pgdat() is
> >called from kswapd, not direct compaction so cc->sync will not be true.
> 
> The problem with doing that is that we would be deferring
> synchronous compaction (by allocators), just because
> asynchronous compaction from kswapd failed...
> 

I should have been clear. I was not suggesting that we defer compaction
here for !cc->sync. I was pointing out that the code as-is is dead.

> That is the reason the code is like it is above.  And
> indeed, it will not defer compaction from this code path
> right now.
> 

Ok, that was my understanding, I just wanted to be sure I understood
your intentions.

> Then again, neither does async compaction from page
> allocators defer compaction - only sync compaction does.
> 

Yep, this is on purpose.

> If it turns out we need a separate compaction deferral
> for async compaction, we can always introduce that later,
> and this code will be ready for it.
> 

Ok, I can accept that.

> If you prefer, I can replace the whole "else if" bit with
> a big fat comment explaining why we cannot currently
> defer compaction from this point.
> 

Explaining that it is dead code for kswapd would also do.

If you do replace the code, add a WARN_ON(cc->sync) in case the
assumption changes in the future so defer_compaction() gets thought
about properly.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 -mm 1/3] mm: reclaim at order 0 when compaction is enabled
  2012-01-25 16:07       ` Mel Gorman
@ 2012-01-25 17:17         ` Rik van Riel
  0 siblings, 0 replies; 14+ messages in thread
From: Rik van Riel @ 2012-01-25 17:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, lkml, Andrea Arcangeli, Johannes Weiner, Andrew Morton,
	Minchan Kim, KOSAKI Motohiro

On 01/25/2012 11:07 AM, Mel Gorman wrote:
> On Wed, Jan 25, 2012 at 10:27:28AM -0500, Rik van Riel wrote:

>> Maybe this is the place to check (in balanced_pgdat) ?
>>
>>                  /*
>>                   * We do this so kswapd doesn't build up large
>> priorities for
>>                   * example when it is freeing in parallel with
>> allocators. It
>>                   * matches the direct reclaim path behaviour in
>> terms of impact
>>                   * on zone->*_priority.
>>                   */
>>                  if (sc.nr_reclaimed>= SWAP_CLUSTER_MAX)
>>                          break;
>>
>
> It would be a good place all right. Preferably it would tie into
> compaction_ready() to decide whether to continue reclaiming or not.

Good point. We can just track compaction_ready for all
of the zones and bail out if at least one of the zones
is ready for compaction (somehow taking the classzone
index into account?).

Or maybe simply keep reclaiming order 0 pages until all
the memory zones in the pgdat are ready for compaction?

>>>> @@ -2922,8 +2939,6 @@ out:
>>>>
>>>>   			/* If balanced, clear the congested flag */
>>>>   			zone_clear_flag(zone, ZONE_CONGESTED);
>>>> -			if (i<= *classzone_idx)
>>>> -				balanced += zone->present_pages;
>>>>   		}
>>>
>>> Why is this being deleted? It is still used by pgdat_balanced().
>>
>> This is outside of the big while loop and is not used again
>> in the function.
>
> How about here?
>
>                 if (all_zones_ok || (order&&  pgdat_balanced(pgdat, balanced, *classzone_idx)))
>                          break;          /* kswapd: all done */
>
> Either way, it looks like something that should be in its own patch.

That code is above the lines that I removed.

The only way to get back there is to jump up to
loop_again, from where we will get to the main loop,
where balanced is zeroed out.

If you want I will split this out into its own patch.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 -mm 1/3] mm: reclaim at order 0 when compaction is enabled
  2012-01-25 15:27     ` Rik van Riel
  2012-01-25 16:07       ` Mel Gorman
@ 2012-01-25 22:16       ` Andrea Arcangeli
  2012-01-26  2:12         ` Rik van Riel
  1 sibling, 1 reply; 14+ messages in thread
From: Andrea Arcangeli @ 2012-01-25 22:16 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mel Gorman, linux-mm, lkml, Johannes Weiner, Andrew Morton,
	Minchan Kim, KOSAKI Motohiro

On Wed, Jan 25, 2012 at 10:27:28AM -0500, Rik van Riel wrote:
> On 01/25/2012 10:00 AM, Mel Gorman wrote:
> > On Tue, Jan 24, 2012 at 01:21:36PM -0500, Rik van Riel wrote:
> >> When built with CONFIG_COMPACTION, kswapd does not try to free
> >> contiguous pages.
> >
> > balance_pgdat() gets its order from wakeup_kswapd(). This does not apply
> > to THP because kswapd does not get woken for THP but it should be woken
> > up for allocations like jumbo frames or order-1.
> 
> In the kernel I run at home, I wake up kswapd for THP
> as well. This is a larger change, which Andrea asked
> me to delay submitting upstream for a bit.
> 
> So far there seem to be no ill effects. I'll continue
> watching for them.

The only problem we had last time when we managed to add compaction in
kswapd upstream, was a problem of that too high kswapd wakeup
frequency that kept kswapd spinning at 100% load and destroying
specsfs performance. It may have been a fundamental problem of
compaction not being worthwhile to run to generate jumbo frames
because the cost of migrating memory, copying, flushing ptes, is
significantly higher than the benefit of doing the network DMA in
larger chunks and saving a few interrupts. This is why the logic need
to be tested with specsfs on nfs and stuff that triggers jumbo frames
heavily and does an huge network traffic.

About THP, it may not give problems for THP because the allocation
rate is much slower.

OTOH THP pages are bigger so that would make life harder for
compaction than for the smaller order requested by the jumbo
frames.

The main reason why THP don't strictly need this, is that all these
allocations can block and so THP can run compaction in direct reclaim.
The jumbo frames can't wait instead so they absolutely need kswapd if
we ever intend them to benefit from compaction.

I think the idea is sound, we just need to know if maybe we need to
enable it selectively and tell jumbo frames to use
__GFP_NO_KSWAPD_COMPACTION or something to only disable compaction but
to still free memory as we'll fallback with a 4k allocation in the nic
driver. (but if we'd need that, that would question if this is
worthwhile as they were the ones intended to benefit, most other
order > 0 allocations don't run in atomic context)

I'm still quite afraid that compaction in kswapd waken by jumbo frames
may not work well, but I'm happy that you try this again to be sure of
what's the best direction to take, there wasn't enough time to
investigate this properly last time, so we quickly backed it out to be
safe. Just running specsfs on nfs with a nic using jumbo frames is
enough to tell if this works or not, so good thing is there's no risk
of regression if that thing works fine.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 -mm 1/3] mm: reclaim at order 0 when compaction is enabled
  2012-01-25 22:16       ` Andrea Arcangeli
@ 2012-01-26  2:12         ` Rik van Riel
  0 siblings, 0 replies; 14+ messages in thread
From: Rik van Riel @ 2012-01-26  2:12 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, linux-mm, lkml, Johannes Weiner, Andrew Morton,
	Minchan Kim, KOSAKI Motohiro

On 01/25/2012 05:16 PM, Andrea Arcangeli wrote:
> On Wed, Jan 25, 2012 at 10:27:28AM -0500, Rik van Riel wrote:
>> On 01/25/2012 10:00 AM, Mel Gorman wrote:
>>> On Tue, Jan 24, 2012 at 01:21:36PM -0500, Rik van Riel wrote:
>>>> When built with CONFIG_COMPACTION, kswapd does not try to free
>>>> contiguous pages.
>>>
>>> balance_pgdat() gets its order from wakeup_kswapd(). This does not apply
>>> to THP because kswapd does not get woken for THP but it should be woken
>>> up for allocations like jumbo frames or order-1.
>>
>> In the kernel I run at home, I wake up kswapd for THP
>> as well. This is a larger change, which Andrea asked
>> me to delay submitting upstream for a bit.
>>
>> So far there seem to be no ill effects. I'll continue
>> watching for them.
>
> The only problem we had last time when we managed to add compaction in
> kswapd upstream, was a problem of that too high kswapd wakeup
> frequency that kept kswapd spinning at 100% load and destroying
> specsfs performance. It may have been a fundamental problem of
> compaction not being worthwhile to run to generate jumbo frames
> because the cost of migrating memory, copying, flushing ptes

I suspect the problem was much simpler back then.  Kswapd
invoked compaction inside the loop, instead of outside the
loop, and there was no throttling at all.

> About THP, it may not give problems for THP because the allocation
> rate is much slower.

> I'm still quite afraid that compaction in kswapd waken by jumbo frames
> may not work well,

THP allocations may be slower, but jumbo frames get freed
again quickly. We do not have to compact memory for every
few jumbo frame allocations, only when the number of packets
in flight is going up...

You are right that it should be tested, though :)

I will look into that.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2012-01-26  2:13 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-24 18:18 [PATCH v2 -mm 0/3] kswapd vs compaction improvements Rik van Riel
2012-01-24 18:21 ` [PATCH v2 -mm 1/3] mm: reclaim at order 0 when compaction is enabled Rik van Riel
2012-01-25 15:00   ` Mel Gorman
2012-01-25 15:27     ` Rik van Riel
2012-01-25 16:07       ` Mel Gorman
2012-01-25 17:17         ` Rik van Riel
2012-01-25 22:16       ` Andrea Arcangeli
2012-01-26  2:12         ` Rik van Riel
2012-01-24 18:22 ` [PATCH v2 -mm 2/3] mm: kswapd carefully call compaction Rik van Riel
2012-01-25 15:19   ` Mel Gorman
2012-01-24 18:23 ` [PATCH v2 -mm 3/3] mm: only defer compaction for failed order and higher Rik van Riel
2012-01-25 15:41   ` Mel Gorman
2012-01-25 15:55     ` Rik van Riel
2012-01-25 16:21       ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).