linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/9] Reduce compaction scanning and lock contention
@ 2012-09-21 10:46 Mel Gorman
  2012-09-21 10:46 ` [PATCH 1/9] Revert "mm: compaction: check lock contention first before taking lock" Mel Gorman
                   ` (9 more replies)
  0 siblings, 10 replies; 42+ messages in thread
From: Mel Gorman @ 2012-09-21 10:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Richard Davies, Shaohua Li, Rik van Riel, Avi Kivity, QEMU-devel,
	KVM, Linux-MM, LKML, Mel Gorman

Hi Andrew,

Richard Davies and Shaohua Li have both reported lock contention
problems in compaction on the zone and LRU locks as well as
significant amounts of time being spent in compaction. This series
aims to reduce lock contention and scanning rates to reduce that CPU
usage. Richard reported at https://lkml.org/lkml/2012/9/21/91 that
this series made a big different to a problem he reported in August
(http://marc.info/?l=kvm&m=134511507015614&w=2).

Patches 1-3 reverts existing patches in Andrew's tree that get replaced
	later in the series.

Patch 4 is a fix for c67fe375 (mm: compaction: Abort async compaction if
	locks are contended or taking too long) to properly abort in all
	cases when contention is detected.

Patch 5 defers acquiring the zone->lru_lock as long as possible.

Patch 6 defers acquiring the zone->lock as lock as possible.

Patch 7 reverts Rik's "skip-free" patches as the core concept gets
	reimplemented later and the remaining patches are easier to
	understand if this is reverted first.

Patch 8 adds a pageblock-skip bit to the pageblock flags to cache what
	pageblocks should be skipped by the migrate and free scanners.
	This drastically reduces the amount of scanning compaction has
	to do.

Patch 9 reimplements something similar to Rik's idea except it uses the
	pageblock-skip information to decide where the scanners should
	restart from and does not need to wrap around.

I tested this on 3.6-rc6 + linux-next/akpm. Kernels tested were

akpm-20120920	3.6-rc6 + linux-next/akpm as of Septeber 20th, 2012
lesslock	Patches 1-6
revert		Patches 1-7
cachefail	Patches 1-8
skipuseless	Patches 1-9

Stress high-order allocation tests looked ok. Success rates are more or
less the same with the full series applied but there is an expectation that
there is less opportunity to race with other allocation requests if there is
less scanning. The time to complete the tests did not vary that much and are
uninteresting as were the vmstat statistics so I will not present them here.

Using ftrace I recorded how much scanning was done by compaction and got this

                            3.6.0-rc6     3.6.0-rc6   3.6.0-rc6  3.6.0-rc6 3.6.0-rc6
                            akpm-20120920 lockless  revert-v2r2  cachefail skipuseless

Total   free    scanned         360753976  515414028  565479007   17103281   18916589 
Total   free    isolated          2852429    3597369    4048601     670493     727840 
Total   free    efficiency        0.0079%    0.0070%    0.0072%    0.0392%    0.0385% 
Total   migrate scanned         247728664  822729112 1004645830   17946827   14118903 
Total   migrate isolated          2555324    3245937    3437501     616359     658616 
Total   migrate efficiency        0.0103%    0.0039%    0.0034%    0.0343%    0.0466% 

The efficiency is worthless because of the nature of the test and the
number of failures.  The really interesting point as far as this patch
series is concerned is the number of pages scanned. Note that reverting
Rik's patches massively increases the number of pages scanned indicating
that those patches really did make a difference to CPU usage.

However, caching what pageblocks should be skipped has a much higher
impact. With patches 1-8 applied, free page and migrate page scanning are
both reduced by 95% in comparison to the akpm kernel.  If the basic concept
of Rik's patches are implemened on top then scanning then the free scanner
barely changed but migrate scanning was further reduced. That said, tests
on 3.6-rc5 indicated that the last patch had greater impact than what was
measured here so it is a bit variable.

One way or the other, this series has a large impact on the amount of
scanning compaction does when there is a storm of THP allocations.

 include/linux/mmzone.h          |    5 +-
 include/linux/pageblock-flags.h |   19 +-
 mm/compaction.c                 |  397 +++++++++++++++++++++++++--------------
 mm/internal.h                   |   11 +-
 mm/page_alloc.c                 |    6 +-
 5 files changed, 280 insertions(+), 158 deletions(-)

-- 
1.7.9.2


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 1/9] Revert "mm: compaction: check lock contention first before taking lock"
  2012-09-21 10:46 [PATCH 0/9] Reduce compaction scanning and lock contention Mel Gorman
@ 2012-09-21 10:46 ` Mel Gorman
  2012-09-21 17:46   ` Rafael Aquini
  2012-09-21 10:46 ` [PATCH 2/9] Revert "mm-compaction-abort-compaction-loop-if-lock-is-contended-or-run-too-long-fix" Mel Gorman
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 42+ messages in thread
From: Mel Gorman @ 2012-09-21 10:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Richard Davies, Shaohua Li, Rik van Riel, Avi Kivity, QEMU-devel,
	KVM, Linux-MM, LKML, Mel Gorman

This reverts
mm-compaction-check-lock-contention-first-before-taking-lock.patch as it
is replaced by a later patch in the series.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/compaction.c |    5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 3bb7232..4a77b4b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -347,9 +347,8 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 
 	/* Time to isolate some pages for migration */
 	cond_resched();
-	locked = compact_trylock_irqsave(&zone->lru_lock, &flags, cc);
-	if (!locked)
-		return 0;
+	spin_lock_irqsave(&zone->lru_lock, flags);
+	locked = true;
 	for (; low_pfn < end_pfn; low_pfn++) {
 		struct page *page;
 
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 2/9] Revert "mm-compaction-abort-compaction-loop-if-lock-is-contended-or-run-too-long-fix"
  2012-09-21 10:46 [PATCH 0/9] Reduce compaction scanning and lock contention Mel Gorman
  2012-09-21 10:46 ` [PATCH 1/9] Revert "mm: compaction: check lock contention first before taking lock" Mel Gorman
@ 2012-09-21 10:46 ` Mel Gorman
  2012-09-21 17:47   ` Rafael Aquini
  2012-09-21 10:46 ` [PATCH 3/9] Revert "mm: compaction: abort compaction loop if lock is contended or run too long" Mel Gorman
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 42+ messages in thread
From: Mel Gorman @ 2012-09-21 10:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Richard Davies, Shaohua Li, Rik van Riel, Avi Kivity, QEMU-devel,
	KVM, Linux-MM, LKML, Mel Gorman

This reverts
mm-compaction-abort-compaction-loop-if-lock-is-contended-or-run-too-long-fix
as it is replaced by a later patch in the series.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/compaction.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 4a77b4b..1c873bb 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -907,7 +907,8 @@ static unsigned long compact_zone_order(struct zone *zone,
 	INIT_LIST_HEAD(&cc.migratepages);
 
 	ret = compact_zone(zone, &cc);
-	*contended = cc.contended;
+	if (contended)
+		*contended = cc.contended;
 	return ret;
 }
 
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 3/9] Revert "mm: compaction: abort compaction loop if lock is contended or run too long"
  2012-09-21 10:46 [PATCH 0/9] Reduce compaction scanning and lock contention Mel Gorman
  2012-09-21 10:46 ` [PATCH 1/9] Revert "mm: compaction: check lock contention first before taking lock" Mel Gorman
  2012-09-21 10:46 ` [PATCH 2/9] Revert "mm-compaction-abort-compaction-loop-if-lock-is-contended-or-run-too-long-fix" Mel Gorman
@ 2012-09-21 10:46 ` Mel Gorman
  2012-09-21 17:48   ` Rafael Aquini
  2012-09-21 10:46 ` [PATCH 4/9] mm: compaction: Abort compaction loop if lock is contended or run too long Mel Gorman
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 42+ messages in thread
From: Mel Gorman @ 2012-09-21 10:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Richard Davies, Shaohua Li, Rik van Riel, Avi Kivity, QEMU-devel,
	KVM, Linux-MM, LKML, Mel Gorman

This reverts
mm-compaction-abort-compaction-loop-if-lock-is-contended-or-run-too-long.patch
as it is replaced by a later patch in the series.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/compaction.c |   12 +++++-------
 mm/internal.h   |    2 +-
 2 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 1c873bb..614f18b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -70,7 +70,8 @@ static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
 
 		/* async aborts if taking too long or contended */
 		if (!cc->sync) {
-			cc->contended = true;
+			if (cc->contended)
+				*cc->contended = true;
 			return false;
 		}
 
@@ -685,7 +686,7 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 
 	/* Perform the isolation */
 	low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn);
-	if (!low_pfn || cc->contended)
+	if (!low_pfn)
 		return ISOLATE_ABORT;
 
 	cc->migrate_pfn = low_pfn;
@@ -893,7 +894,6 @@ static unsigned long compact_zone_order(struct zone *zone,
 				 bool sync, bool *contended,
 				 struct page **page)
 {
-	unsigned long ret;
 	struct compact_control cc = {
 		.nr_freepages = 0,
 		.nr_migratepages = 0,
@@ -901,15 +901,13 @@ static unsigned long compact_zone_order(struct zone *zone,
 		.migratetype = allocflags_to_migratetype(gfp_mask),
 		.zone = zone,
 		.sync = sync,
+		.contended = contended,
 		.page = page,
 	};
 	INIT_LIST_HEAD(&cc.freepages);
 	INIT_LIST_HEAD(&cc.migratepages);
 
-	ret = compact_zone(zone, &cc);
-	if (contended)
-		*contended = cc.contended;
-	return ret;
+	return compact_zone(zone, &cc);
 }
 
 int sysctl_extfrag_threshold = 500;
diff --git a/mm/internal.h b/mm/internal.h
index eebbed5..386772f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -131,7 +131,7 @@ struct compact_control {
 	int order;			/* order a direct compactor needs */
 	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
 	struct zone *zone;
-	bool contended;			/* True if a lock was contended */
+	bool *contended;		/* True if a lock was contended */
 	struct page **page;		/* Page captured of requested size */
 };
 
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 4/9] mm: compaction: Abort compaction loop if lock is contended or run too long
  2012-09-21 10:46 [PATCH 0/9] Reduce compaction scanning and lock contention Mel Gorman
                   ` (2 preceding siblings ...)
  2012-09-21 10:46 ` [PATCH 3/9] Revert "mm: compaction: abort compaction loop if lock is contended or run too long" Mel Gorman
@ 2012-09-21 10:46 ` Mel Gorman
  2012-09-21 17:50   ` Rafael Aquini
                     ` (2 more replies)
  2012-09-21 10:46 ` [PATCH 5/9] mm: compaction: Acquire the zone->lru_lock as late as possible Mel Gorman
                   ` (5 subsequent siblings)
  9 siblings, 3 replies; 42+ messages in thread
From: Mel Gorman @ 2012-09-21 10:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Richard Davies, Shaohua Li, Rik van Riel, Avi Kivity, QEMU-devel,
	KVM, Linux-MM, LKML, Mel Gorman

From: Shaohua Li <shli@fusionio.com>

Changelog since V2
o Fix BUG_ON triggered due to pages left on cc.migratepages
o Make compact_zone_order() require non-NULL arg `contended'

Changelog since V1
o only abort the compaction if lock is contended or run too long
o Rearranged the code by Andrea Arcangeli.

isolate_migratepages_range() might isolate no pages if for example when
zone->lru_lock is contended and running asynchronous compaction. In this
case, we should abort compaction, otherwise, compact_zone will run a
useless loop and make zone->lru_lock is even contended.

[minchan@kernel.org: Putback pages isolated for migration if aborting]
[akpm@linux-foundation.org: compact_zone_order requires non-NULL arg contended]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
---
 mm/compaction.c |   17 ++++++++++++-----
 mm/internal.h   |    2 +-
 2 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 614f18b..6b55491 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -70,8 +70,7 @@ static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
 
 		/* async aborts if taking too long or contended */
 		if (!cc->sync) {
-			if (cc->contended)
-				*cc->contended = true;
+			cc->contended = true;
 			return false;
 		}
 
@@ -686,7 +685,7 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 
 	/* Perform the isolation */
 	low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn);
-	if (!low_pfn)
+	if (!low_pfn || cc->contended)
 		return ISOLATE_ABORT;
 
 	cc->migrate_pfn = low_pfn;
@@ -846,6 +845,8 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 		switch (isolate_migratepages(zone, cc)) {
 		case ISOLATE_ABORT:
 			ret = COMPACT_PARTIAL;
+			putback_lru_pages(&cc->migratepages);
+			cc->nr_migratepages = 0;
 			goto out;
 		case ISOLATE_NONE:
 			continue;
@@ -894,6 +895,7 @@ static unsigned long compact_zone_order(struct zone *zone,
 				 bool sync, bool *contended,
 				 struct page **page)
 {
+	unsigned long ret;
 	struct compact_control cc = {
 		.nr_freepages = 0,
 		.nr_migratepages = 0,
@@ -901,13 +903,18 @@ static unsigned long compact_zone_order(struct zone *zone,
 		.migratetype = allocflags_to_migratetype(gfp_mask),
 		.zone = zone,
 		.sync = sync,
-		.contended = contended,
 		.page = page,
 	};
 	INIT_LIST_HEAD(&cc.freepages);
 	INIT_LIST_HEAD(&cc.migratepages);
 
-	return compact_zone(zone, &cc);
+	ret = compact_zone(zone, &cc);
+
+	VM_BUG_ON(!list_empty(&cc.freepages));
+	VM_BUG_ON(!list_empty(&cc.migratepages));
+
+	*contended = cc.contended;
+	return ret;
 }
 
 int sysctl_extfrag_threshold = 500;
diff --git a/mm/internal.h b/mm/internal.h
index 386772f..eebbed5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -131,7 +131,7 @@ struct compact_control {
 	int order;			/* order a direct compactor needs */
 	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
 	struct zone *zone;
-	bool *contended;		/* True if a lock was contended */
+	bool contended;			/* True if a lock was contended */
 	struct page **page;		/* Page captured of requested size */
 };
 
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 5/9] mm: compaction: Acquire the zone->lru_lock as late as possible
  2012-09-21 10:46 [PATCH 0/9] Reduce compaction scanning and lock contention Mel Gorman
                   ` (3 preceding siblings ...)
  2012-09-21 10:46 ` [PATCH 4/9] mm: compaction: Abort compaction loop if lock is contended or run too long Mel Gorman
@ 2012-09-21 10:46 ` Mel Gorman
  2012-09-21 17:51   ` Rafael Aquini
  2012-09-25  7:05   ` Minchan Kim
  2012-09-21 10:46 ` [PATCH 6/9] mm: compaction: Acquire the zone->lock " Mel Gorman
                   ` (4 subsequent siblings)
  9 siblings, 2 replies; 42+ messages in thread
From: Mel Gorman @ 2012-09-21 10:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Richard Davies, Shaohua Li, Rik van Riel, Avi Kivity, QEMU-devel,
	KVM, Linux-MM, LKML, Mel Gorman

Compactions migrate scanner acquires the zone->lru_lock when scanning a range
of pages looking for LRU pages to acquire. It does this even if there are
no LRU pages in the range. If multiple processes are compacting then this
can cause severe locking contention. To make matters worse commit b2eef8c0
(mm: compaction: minimise the time IRQs are disabled while isolating pages
for migration) releases the lru_lock every SWAP_CLUSTER_MAX pages that are
scanned.

This patch makes two changes to how the migrate scanner acquires the LRU
lock. First, it only releases the LRU lock every SWAP_CLUSTER_MAX pages if
the lock is contended. This reduces the number of times it unnecessarily
disables and re-enables IRQs. The second is that it defers acquiring the
LRU lock for as long as possible. If there are no LRU pages or the only
LRU pages are transhuge then the LRU lock will not be acquired at all
which reduces contention on zone->lru_lock.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
---
 mm/compaction.c |   63 +++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 43 insertions(+), 20 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 6b55491..a6068ff 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -50,6 +50,11 @@ static inline bool migrate_async_suitable(int migratetype)
 	return is_migrate_cma(migratetype) || migratetype == MIGRATE_MOVABLE;
 }
 
+static inline bool should_release_lock(spinlock_t *lock)
+{
+	return need_resched() || spin_is_contended(lock);
+}
+
 /*
  * Compaction requires the taking of some coarse locks that are potentially
  * very heavily contended. Check if the process needs to be scheduled or
@@ -62,7 +67,7 @@ static inline bool migrate_async_suitable(int migratetype)
 static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
 				      bool locked, struct compact_control *cc)
 {
-	if (need_resched() || spin_is_contended(lock)) {
+	if (should_release_lock(lock)) {
 		if (locked) {
 			spin_unlock_irqrestore(lock, *flags);
 			locked = false;
@@ -327,7 +332,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 	isolate_mode_t mode = 0;
 	struct lruvec *lruvec;
 	unsigned long flags;
-	bool locked;
+	bool locked = false;
 
 	/*
 	 * Ensure that there are not too many pages isolated from the LRU
@@ -347,23 +352,17 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 
 	/* Time to isolate some pages for migration */
 	cond_resched();
-	spin_lock_irqsave(&zone->lru_lock, flags);
-	locked = true;
 	for (; low_pfn < end_pfn; low_pfn++) {
 		struct page *page;
 
 		/* give a chance to irqs before checking need_resched() */
-		if (!((low_pfn+1) % SWAP_CLUSTER_MAX)) {
-			spin_unlock_irqrestore(&zone->lru_lock, flags);
-			locked = false;
+		if (locked && !((low_pfn+1) % SWAP_CLUSTER_MAX)) {
+			if (should_release_lock(&zone->lru_lock)) {
+				spin_unlock_irqrestore(&zone->lru_lock, flags);
+				locked = false;
+			}
 		}
 
-		/* Check if it is ok to still hold the lock */
-		locked = compact_checklock_irqsave(&zone->lru_lock, &flags,
-								locked, cc);
-		if (!locked || fatal_signal_pending(current))
-			break;
-
 		/*
 		 * migrate_pfn does not necessarily start aligned to a
 		 * pageblock. Ensure that pfn_valid is called when moving
@@ -403,21 +402,38 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 		pageblock_nr = low_pfn >> pageblock_order;
 		if (!cc->sync && last_pageblock_nr != pageblock_nr &&
 		    !migrate_async_suitable(get_pageblock_migratetype(page))) {
-			low_pfn += pageblock_nr_pages;
-			low_pfn = ALIGN(low_pfn, pageblock_nr_pages) - 1;
-			last_pageblock_nr = pageblock_nr;
-			continue;
+			goto next_pageblock;
 		}
 
+		/* Check may be lockless but that's ok as we recheck later */
 		if (!PageLRU(page))
 			continue;
 
 		/*
-		 * PageLRU is set, and lru_lock excludes isolation,
-		 * splitting and collapsing (collapsing has already
-		 * happened if PageLRU is set).
+		 * PageLRU is set. lru_lock normally excludes isolation
+		 * splitting and collapsing (collapsing has already happened
+		 * if PageLRU is set) but the lock is not necessarily taken
+		 * here and it is wasteful to take it just to check transhuge.
+		 * Check transhuge without lock and skip if it's either a
+		 * transhuge or hugetlbfs page.
 		 */
 		if (PageTransHuge(page)) {
+			if (!locked)
+				goto next_pageblock;
+			low_pfn += (1 << compound_order(page)) - 1;
+			continue;
+		}
+
+		/* Check if it is ok to still hold the lock */
+		locked = compact_checklock_irqsave(&zone->lru_lock, &flags,
+								locked, cc);
+		if (!locked || fatal_signal_pending(current))
+			break;
+
+		/* Recheck PageLRU and PageTransHuge under lock */
+		if (!PageLRU(page))
+			continue;
+		if (PageTransHuge(page)) {
 			low_pfn += (1 << compound_order(page)) - 1;
 			continue;
 		}
@@ -444,6 +460,13 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 			++low_pfn;
 			break;
 		}
+
+		continue;
+
+next_pageblock:
+		low_pfn += pageblock_nr_pages;
+		low_pfn = ALIGN(low_pfn, pageblock_nr_pages) - 1;
+		last_pageblock_nr = pageblock_nr;
 	}
 
 	acct_isolated(zone, locked, cc);
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 6/9] mm: compaction: Acquire the zone->lock as late as possible
  2012-09-21 10:46 [PATCH 0/9] Reduce compaction scanning and lock contention Mel Gorman
                   ` (4 preceding siblings ...)
  2012-09-21 10:46 ` [PATCH 5/9] mm: compaction: Acquire the zone->lru_lock as late as possible Mel Gorman
@ 2012-09-21 10:46 ` Mel Gorman
  2012-09-21 17:52   ` Rafael Aquini
                     ` (2 more replies)
  2012-09-21 10:46 ` [PATCH 7/9] Revert "mm: have order > 0 compaction start off where it left" Mel Gorman
                   ` (3 subsequent siblings)
  9 siblings, 3 replies; 42+ messages in thread
From: Mel Gorman @ 2012-09-21 10:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Richard Davies, Shaohua Li, Rik van Riel, Avi Kivity, QEMU-devel,
	KVM, Linux-MM, LKML, Mel Gorman

Compactions free scanner acquires the zone->lock when checking for PageBuddy
pages and isolating them. It does this even if there are no PageBuddy pages
in the range.

This patch defers acquiring the zone lock for as long as possible. In the
event there are no free pages in the pageblock then the lock will not be
acquired at all which reduces contention on zone->lock.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
---
 mm/compaction.c |  141 ++++++++++++++++++++++++++++++-------------------------
 1 file changed, 78 insertions(+), 63 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index a6068ff..8e56594 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -93,6 +93,28 @@ static inline bool compact_trylock_irqsave(spinlock_t *lock,
 	return compact_checklock_irqsave(lock, flags, false, cc);
 }
 
+/* Returns true if the page is within a block suitable for migration to */
+static bool suitable_migration_target(struct page *page)
+{
+
+	int migratetype = get_pageblock_migratetype(page);
+
+	/* Don't interfere with memory hot-remove or the min_free_kbytes blocks */
+	if (migratetype == MIGRATE_ISOLATE || migratetype == MIGRATE_RESERVE)
+		return false;
+
+	/* If the page is a large free page, then allow migration */
+	if (PageBuddy(page) && page_order(page) >= pageblock_order)
+		return true;
+
+	/* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */
+	if (migrate_async_suitable(migratetype))
+		return true;
+
+	/* Otherwise skip the block */
+	return false;
+}
+
 static void compact_capture_page(struct compact_control *cc)
 {
 	unsigned long flags;
@@ -153,13 +175,16 @@ static void compact_capture_page(struct compact_control *cc)
  * pages inside of the pageblock (even though it may still end up isolating
  * some pages).
  */
-static unsigned long isolate_freepages_block(unsigned long blockpfn,
+static unsigned long isolate_freepages_block(struct compact_control *cc,
+				unsigned long blockpfn,
 				unsigned long end_pfn,
 				struct list_head *freelist,
 				bool strict)
 {
 	int nr_scanned = 0, total_isolated = 0;
 	struct page *cursor;
+	unsigned long flags;
+	bool locked = false;
 
 	cursor = pfn_to_page(blockpfn);
 
@@ -168,23 +193,38 @@ static unsigned long isolate_freepages_block(unsigned long blockpfn,
 		int isolated, i;
 		struct page *page = cursor;
 
-		if (!pfn_valid_within(blockpfn)) {
-			if (strict)
-				return 0;
-			continue;
-		}
+		if (!pfn_valid_within(blockpfn))
+			goto strict_check;
 		nr_scanned++;
 
-		if (!PageBuddy(page)) {
-			if (strict)
-				return 0;
-			continue;
-		}
+		if (!PageBuddy(page))
+			goto strict_check;
+
+		/*
+		 * The zone lock must be held to isolate freepages. This
+		 * unfortunately this is a very coarse lock and can be
+		 * heavily contended if there are parallel allocations
+		 * or parallel compactions. For async compaction do not
+		 * spin on the lock and we acquire the lock as late as
+		 * possible.
+		 */
+		locked = compact_checklock_irqsave(&cc->zone->lock, &flags,
+								locked, cc);
+		if (!locked)
+			break;
+
+		/* Recheck this is a suitable migration target under lock */
+		if (!strict && !suitable_migration_target(page))
+			break;
+
+		/* Recheck this is a buddy page under lock */
+		if (!PageBuddy(page))
+			goto strict_check;
 
 		/* Found a free page, break it into order-0 pages */
 		isolated = split_free_page(page);
 		if (!isolated && strict)
-			return 0;
+			goto strict_check;
 		total_isolated += isolated;
 		for (i = 0; i < isolated; i++) {
 			list_add(&page->lru, freelist);
@@ -196,9 +236,23 @@ static unsigned long isolate_freepages_block(unsigned long blockpfn,
 			blockpfn += isolated - 1;
 			cursor += isolated - 1;
 		}
+
+		continue;
+
+strict_check:
+		/* Abort isolation if the caller requested strict isolation */
+		if (strict) {
+			total_isolated = 0;
+			goto out;
+		}
 	}
 
 	trace_mm_compaction_isolate_freepages(nr_scanned, total_isolated);
+
+out:
+	if (locked)
+		spin_unlock_irqrestore(&cc->zone->lock, flags);
+
 	return total_isolated;
 }
 
@@ -218,13 +272,18 @@ static unsigned long isolate_freepages_block(unsigned long blockpfn,
 unsigned long
 isolate_freepages_range(unsigned long start_pfn, unsigned long end_pfn)
 {
-	unsigned long isolated, pfn, block_end_pfn, flags;
+	unsigned long isolated, pfn, block_end_pfn;
 	struct zone *zone = NULL;
 	LIST_HEAD(freelist);
+	struct compact_control cc;
 
 	if (pfn_valid(start_pfn))
 		zone = page_zone(pfn_to_page(start_pfn));
 
+	/* cc needed for isolate_freepages_block to acquire zone->lock */
+	cc.zone = zone;
+	cc.sync = true;
+
 	for (pfn = start_pfn; pfn < end_pfn; pfn += isolated) {
 		if (!pfn_valid(pfn) || zone != page_zone(pfn_to_page(pfn)))
 			break;
@@ -236,10 +295,8 @@ isolate_freepages_range(unsigned long start_pfn, unsigned long end_pfn)
 		block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages);
 		block_end_pfn = min(block_end_pfn, end_pfn);
 
-		spin_lock_irqsave(&zone->lock, flags);
-		isolated = isolate_freepages_block(pfn, block_end_pfn,
+		isolated = isolate_freepages_block(&cc, pfn, block_end_pfn,
 						   &freelist, true);
-		spin_unlock_irqrestore(&zone->lock, flags);
 
 		/*
 		 * In strict mode, isolate_freepages_block() returns 0 if
@@ -481,29 +538,6 @@ next_pageblock:
 
 #endif /* CONFIG_COMPACTION || CONFIG_CMA */
 #ifdef CONFIG_COMPACTION
-
-/* Returns true if the page is within a block suitable for migration to */
-static bool suitable_migration_target(struct page *page)
-{
-
-	int migratetype = get_pageblock_migratetype(page);
-
-	/* Don't interfere with memory hot-remove or the min_free_kbytes blocks */
-	if (migratetype == MIGRATE_ISOLATE || migratetype == MIGRATE_RESERVE)
-		return false;
-
-	/* If the page is a large free page, then allow migration */
-	if (PageBuddy(page) && page_order(page) >= pageblock_order)
-		return true;
-
-	/* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */
-	if (migrate_async_suitable(migratetype))
-		return true;
-
-	/* Otherwise skip the block */
-	return false;
-}
-
 /*
  * Returns the start pfn of the last page block in a zone.  This is the starting
  * point for full compaction of a zone.  Compaction searches for free pages from
@@ -527,7 +561,6 @@ static void isolate_freepages(struct zone *zone,
 {
 	struct page *page;
 	unsigned long high_pfn, low_pfn, pfn, zone_end_pfn, end_pfn;
-	unsigned long flags;
 	int nr_freepages = cc->nr_freepages;
 	struct list_head *freelist = &cc->freepages;
 
@@ -575,30 +608,12 @@ static void isolate_freepages(struct zone *zone,
 		if (!suitable_migration_target(page))
 			continue;
 
-		/*
-		 * Found a block suitable for isolating free pages from. Now
-		 * we disabled interrupts, double check things are ok and
-		 * isolate the pages. This is to minimise the time IRQs
-		 * are disabled
-		 */
+		/* Found a block suitable for isolating free pages from */
 		isolated = 0;
-
-		/*
-		 * The zone lock must be held to isolate freepages. This
-		 * unfortunately this is a very coarse lock and can be
-		 * heavily contended if there are parallel allocations
-		 * or parallel compactions. For async compaction do not
-		 * spin on the lock
-		 */
-		if (!compact_trylock_irqsave(&zone->lock, &flags, cc))
-			break;
-		if (suitable_migration_target(page)) {
-			end_pfn = min(pfn + pageblock_nr_pages, zone_end_pfn);
-			isolated = isolate_freepages_block(pfn, end_pfn,
-							   freelist, false);
-			nr_freepages += isolated;
-		}
-		spin_unlock_irqrestore(&zone->lock, flags);
+		end_pfn = min(pfn + pageblock_nr_pages, zone_end_pfn);
+		isolated = isolate_freepages_block(cc, pfn, end_pfn,
+						   freelist, false);
+		nr_freepages += isolated;
 
 		/*
 		 * Record the highest PFN we isolated pages from. When next
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 7/9] Revert "mm: have order > 0 compaction start off where it left"
  2012-09-21 10:46 [PATCH 0/9] Reduce compaction scanning and lock contention Mel Gorman
                   ` (5 preceding siblings ...)
  2012-09-21 10:46 ` [PATCH 6/9] mm: compaction: Acquire the zone->lock " Mel Gorman
@ 2012-09-21 10:46 ` Mel Gorman
  2012-09-21 17:52   ` Rafael Aquini
  2012-09-25  7:37   ` Minchan Kim
  2012-09-21 10:46 ` [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated Mel Gorman
                   ` (2 subsequent siblings)
  9 siblings, 2 replies; 42+ messages in thread
From: Mel Gorman @ 2012-09-21 10:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Richard Davies, Shaohua Li, Rik van Riel, Avi Kivity, QEMU-devel,
	KVM, Linux-MM, LKML, Mel Gorman

This reverts commit 7db8889a (mm: have order > 0 compaction start off
where it left) and commit de74f1cc (mm: have order > 0 compaction start
near a pageblock with free pages). These patches were a good idea and
tests confirmed that they massively reduced the amount of scanning but
the implementation is complex and tricky to understand. A later patch
will cache what pageblocks should be skipped and reimplements the
concept of compact_cached_free_pfn on top for both migration and
free scanners.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/linux/mmzone.h |    4 ---
 mm/compaction.c        |   65 ++++--------------------------------------------
 mm/internal.h          |    6 -----
 mm/page_alloc.c        |    5 ----
 4 files changed, 5 insertions(+), 75 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 2daa54f..603d0b5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -368,10 +368,6 @@ struct zone {
 	 */
 	spinlock_t		lock;
 	int                     all_unreclaimable; /* All pages pinned */
-#if defined CONFIG_COMPACTION || defined CONFIG_CMA
-	/* pfn where the last incremental compaction isolated free pages */
-	unsigned long		compact_cached_free_pfn;
-#endif
 #ifdef CONFIG_MEMORY_HOTPLUG
 	/* see spanned/present_pages for more description */
 	seqlock_t		span_seqlock;
diff --git a/mm/compaction.c b/mm/compaction.c
index 8e56594..9fc1b61 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -539,20 +539,6 @@ next_pageblock:
 #endif /* CONFIG_COMPACTION || CONFIG_CMA */
 #ifdef CONFIG_COMPACTION
 /*
- * Returns the start pfn of the last page block in a zone.  This is the starting
- * point for full compaction of a zone.  Compaction searches for free pages from
- * the end of each zone, while isolate_freepages_block scans forward inside each
- * page block.
- */
-static unsigned long start_free_pfn(struct zone *zone)
-{
-	unsigned long free_pfn;
-	free_pfn = zone->zone_start_pfn + zone->spanned_pages;
-	free_pfn &= ~(pageblock_nr_pages-1);
-	return free_pfn;
-}
-
-/*
  * Based on information in the current compact_control, find blocks
  * suitable for isolating free pages from and then isolate them.
  */
@@ -620,19 +606,8 @@ static void isolate_freepages(struct zone *zone,
 		 * looking for free pages, the search will restart here as
 		 * page migration may have returned some pages to the allocator
 		 */
-		if (isolated) {
+		if (isolated)
 			high_pfn = max(high_pfn, pfn);
-
-			/*
-			 * If the free scanner has wrapped, update
-			 * compact_cached_free_pfn to point to the highest
-			 * pageblock with free pages. This reduces excessive
-			 * scanning of full pageblocks near the end of the
-			 * zone
-			 */
-			if (cc->order > 0 && cc->wrapped)
-				zone->compact_cached_free_pfn = high_pfn;
-		}
 	}
 
 	/* split_free_page does not map the pages */
@@ -640,11 +615,6 @@ static void isolate_freepages(struct zone *zone,
 
 	cc->free_pfn = high_pfn;
 	cc->nr_freepages = nr_freepages;
-
-	/* If compact_cached_free_pfn is reset then set it now */
-	if (cc->order > 0 && !cc->wrapped &&
-			zone->compact_cached_free_pfn == start_free_pfn(zone))
-		zone->compact_cached_free_pfn = high_pfn;
 }
 
 /*
@@ -739,26 +709,8 @@ static int compact_finished(struct zone *zone,
 	if (fatal_signal_pending(current))
 		return COMPACT_PARTIAL;
 
-	/*
-	 * A full (order == -1) compaction run starts at the beginning and
-	 * end of a zone; it completes when the migrate and free scanner meet.
-	 * A partial (order > 0) compaction can start with the free scanner
-	 * at a random point in the zone, and may have to restart.
-	 */
-	if (cc->free_pfn <= cc->migrate_pfn) {
-		if (cc->order > 0 && !cc->wrapped) {
-			/* We started partway through; restart at the end. */
-			unsigned long free_pfn = start_free_pfn(zone);
-			zone->compact_cached_free_pfn = free_pfn;
-			cc->free_pfn = free_pfn;
-			cc->wrapped = 1;
-			return COMPACT_CONTINUE;
-		}
-		return COMPACT_COMPLETE;
-	}
-
-	/* We wrapped around and ended up where we started. */
-	if (cc->wrapped && cc->free_pfn <= cc->start_free_pfn)
+	/* Compaction run completes if the migrate and free scanner meet */
+	if (cc->free_pfn <= cc->migrate_pfn)
 		return COMPACT_COMPLETE;
 
 	/*
@@ -864,15 +816,8 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 
 	/* Setup to move all movable pages to the end of the zone */
 	cc->migrate_pfn = zone->zone_start_pfn;
-
-	if (cc->order > 0) {
-		/* Incremental compaction. Start where the last one stopped. */
-		cc->free_pfn = zone->compact_cached_free_pfn;
-		cc->start_free_pfn = cc->free_pfn;
-	} else {
-		/* Order == -1 starts at the end of the zone. */
-		cc->free_pfn = start_free_pfn(zone);
-	}
+	cc->free_pfn = cc->migrate_pfn + zone->spanned_pages;
+	cc->free_pfn &= ~(pageblock_nr_pages-1);
 
 	migrate_prep_local();
 
diff --git a/mm/internal.h b/mm/internal.h
index eebbed5..f590a6e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -119,14 +119,8 @@ struct compact_control {
 	unsigned long nr_freepages;	/* Number of isolated free pages */
 	unsigned long nr_migratepages;	/* Number of pages to migrate */
 	unsigned long free_pfn;		/* isolate_freepages search base */
-	unsigned long start_free_pfn;	/* where we started the search */
 	unsigned long migrate_pfn;	/* isolate_migratepages search base */
 	bool sync;			/* Synchronous migration */
-	bool wrapped;			/* Order > 0 compactions are
-					   incremental, once free_pfn
-					   and migrate_pfn meet, we restart
-					   from the top of the zone;
-					   remember we wrapped around. */
 
 	int order;			/* order a direct compactor needs */
 	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bc5229b..c03037e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4472,11 +4472,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 
 		zone->spanned_pages = size;
 		zone->present_pages = realsize;
-#if defined CONFIG_COMPACTION || defined CONFIG_CMA
-		zone->compact_cached_free_pfn = zone->zone_start_pfn +
-						zone->spanned_pages;
-		zone->compact_cached_free_pfn &= ~(pageblock_nr_pages-1);
-#endif
 #ifdef CONFIG_NUMA
 		zone->node = nid;
 		zone->min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio)
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated
  2012-09-21 10:46 [PATCH 0/9] Reduce compaction scanning and lock contention Mel Gorman
                   ` (6 preceding siblings ...)
  2012-09-21 10:46 ` [PATCH 7/9] Revert "mm: have order > 0 compaction start off where it left" Mel Gorman
@ 2012-09-21 10:46 ` Mel Gorman
  2012-09-21 17:53   ` Rafael Aquini
  2012-09-21 21:36   ` Andrew Morton
  2012-09-21 10:46 ` [PATCH 9/9] mm: compaction: Restart compaction from near where it left off Mel Gorman
  2012-09-21 13:51 ` [PATCH 0/9] Reduce compaction scanning and lock contention Rik van Riel
  9 siblings, 2 replies; 42+ messages in thread
From: Mel Gorman @ 2012-09-21 10:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Richard Davies, Shaohua Li, Rik van Riel, Avi Kivity, QEMU-devel,
	KVM, Linux-MM, LKML, Mel Gorman

When compaction was implemented it was known that scanning could potentially
be excessive. The ideal was that a counter be maintained for each pageblock
but maintaining this information would incur a severe penalty due to a
shared writable cache line. It has reached the point where the scanning
costs are an serious problem, particularly on long-lived systems where a
large process starts and allocates a large number of THPs at the same time.

Instead of using a shared counter, this patch adds another bit to the
pageblock flags called PG_migrate_skip. If a pageblock is scanned by
either migrate or free scanner and 0 pages were isolated, the pageblock
is marked to be skipped in the future. When scanning, this bit is checked
before any scanning takes place and the block skipped if set.

The main difficulty with a patch like this is "when to ignore the cached
information?" If it's ignored too often, the scanning rates will still
be excessive. If the information is too stale then allocations will fail
that might have otherwise succeeded. In this patch

o CMA always ignores the information
o If the migrate and free scanner meet then the cached information will
  be discarded if it's at least 5 seconds since the last time the cache
  was discarded
o If there are a large number of allocation failures, discard the cache.

The time-based heuristic is very clumsy but there are few choices for a
better event. Depending solely on multiple allocation failures still allows
excessive scanning when THP allocations are failing in quick succession
due to memory pressure. Waiting until memory pressure is relieved would
cause compaction to continually fail instead of using reclaim/compaction
to try allocate the page. The time-based mechanism is clumsy but a better
option is not obvious.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/linux/mmzone.h          |    3 ++
 include/linux/pageblock-flags.h |   19 +++++++-
 mm/compaction.c                 |   93 +++++++++++++++++++++++++++++++++++++--
 mm/internal.h                   |    1 +
 mm/page_alloc.c                 |    1 +
 5 files changed, 111 insertions(+), 6 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 603d0b5..a456361 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -368,6 +368,9 @@ struct zone {
 	 */
 	spinlock_t		lock;
 	int                     all_unreclaimable; /* All pages pinned */
+#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+	unsigned long		compact_blockskip_expire;
+#endif
 #ifdef CONFIG_MEMORY_HOTPLUG
 	/* see spanned/present_pages for more description */
 	seqlock_t		span_seqlock;
diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index 19ef95d..eed27f4 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -30,6 +30,9 @@ enum pageblock_bits {
 	PB_migrate,
 	PB_migrate_end = PB_migrate + 3 - 1,
 			/* 3 bits required for migrate types */
+#ifdef CONFIG_COMPACTION
+	PB_migrate_skip,/* If set the block is skipped by compaction */
+#endif /* CONFIG_COMPACTION */
 	NR_PAGEBLOCK_BITS
 };
 
@@ -65,10 +68,22 @@ unsigned long get_pageblock_flags_group(struct page *page,
 void set_pageblock_flags_group(struct page *page, unsigned long flags,
 					int start_bitidx, int end_bitidx);
 
+#ifdef CONFIG_COMPACTION
+#define get_pageblock_skip(page) \
+			get_pageblock_flags_group(page, PB_migrate_skip,     \
+							PB_migrate_skip + 1)
+#define clear_pageblock_skip(page) \
+			set_pageblock_flags_group(page, 0, PB_migrate_skip,  \
+							PB_migrate_skip + 1)
+#define set_pageblock_skip(page) \
+			set_pageblock_flags_group(page, 1, PB_migrate_skip,  \
+							PB_migrate_skip + 1)
+#endif /* CONFIG_COMPACTION */
+
 #define get_pageblock_flags(page) \
-			get_pageblock_flags_group(page, 0, NR_PAGEBLOCK_BITS-1)
+			get_pageblock_flags_group(page, 0, PB_migrate_end)
 #define set_pageblock_flags(page, flags) \
 			set_pageblock_flags_group(page, flags,	\
-						  0, NR_PAGEBLOCK_BITS-1)
+						  0, PB_migrate_end)
 
 #endif	/* PAGEBLOCK_FLAGS_H */
diff --git a/mm/compaction.c b/mm/compaction.c
index 9fc1b61..9276bc8 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -50,6 +50,64 @@ static inline bool migrate_async_suitable(int migratetype)
 	return is_migrate_cma(migratetype) || migratetype == MIGRATE_MOVABLE;
 }
 
+/* Returns true if the pageblock should be scanned for pages to isolate. */
+static inline bool isolation_suitable(struct compact_control *cc,
+					struct page *page)
+{
+	if (cc->ignore_skip_hint)
+		return true;
+
+	return !get_pageblock_skip(page);
+}
+
+/*
+ * This function is called to clear all cached information on pageblocks that
+ * should be skipped for page isolation when the migrate and free page scanner
+ * meet.
+ */
+static void reset_isolation_suitable(struct zone *zone)
+{
+	unsigned long start_pfn = zone->zone_start_pfn;
+	unsigned long end_pfn = zone->zone_start_pfn + zone->spanned_pages;
+	unsigned long pfn;
+
+	/*
+	 * Do not reset more than once every five seconds. If allocations are
+	 * failing sufficiently quickly to allow this to happen then continually
+	 * scanning for compaction is not going to help. The choice of five
+	 * seconds is arbitrary but will mitigate excessive scanning.
+	 */
+	if (time_before(jiffies, zone->compact_blockskip_expire))
+		return;
+	zone->compact_blockskip_expire = jiffies + (HZ * 5);
+
+	/* Walk the zone and mark every pageblock as suitable for isolation */
+	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
+		struct page *page;
+		if (!pfn_valid(pfn))
+			continue;
+
+		page = pfn_to_page(pfn);
+		if (zone != page_zone(page))
+			continue;
+
+		clear_pageblock_skip(page);
+	}
+}
+
+/*
+ * If no pages were isolated then mark this pageblock to be skipped in the
+ * future. The information is later cleared by reset_isolation_suitable().
+ */
+static void update_pageblock_skip(struct page *page, unsigned long nr_isolated)
+{
+	if (!page)
+		return;
+
+	if (!nr_isolated)
+		set_pageblock_skip(page);
+}
+
 static inline bool should_release_lock(spinlock_t *lock)
 {
 	return need_resched() || spin_is_contended(lock);
@@ -182,7 +240,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 				bool strict)
 {
 	int nr_scanned = 0, total_isolated = 0;
-	struct page *cursor;
+	struct page *cursor, *valid_page = NULL;
 	unsigned long flags;
 	bool locked = false;
 
@@ -196,6 +254,8 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 		if (!pfn_valid_within(blockpfn))
 			goto strict_check;
 		nr_scanned++;
+		if (!valid_page)
+			valid_page = page;
 
 		if (!PageBuddy(page))
 			goto strict_check;
@@ -253,6 +313,10 @@ out:
 	if (locked)
 		spin_unlock_irqrestore(&cc->zone->lock, flags);
 
+	/* Update the pageblock-skip if the whole pageblock was scanned */
+	if (blockpfn == end_pfn)
+		update_pageblock_skip(valid_page, total_isolated);
+
 	return total_isolated;
 }
 
@@ -390,6 +454,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 	struct lruvec *lruvec;
 	unsigned long flags;
 	bool locked = false;
+	struct page *page = NULL, *valid_page = NULL;
 
 	/*
 	 * Ensure that there are not too many pages isolated from the LRU
@@ -410,7 +475,6 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 	/* Time to isolate some pages for migration */
 	cond_resched();
 	for (; low_pfn < end_pfn; low_pfn++) {
-		struct page *page;
 
 		/* give a chance to irqs before checking need_resched() */
 		if (locked && !((low_pfn+1) % SWAP_CLUSTER_MAX)) {
@@ -447,6 +511,14 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 		if (page_zone(page) != zone)
 			continue;
 
+		if (!valid_page)
+			valid_page = page;
+
+		/* If isolation recently failed, do not retry */
+		pageblock_nr = low_pfn >> pageblock_order;
+		if (!isolation_suitable(cc, page))
+			goto next_pageblock;
+
 		/* Skip if free */
 		if (PageBuddy(page))
 			continue;
@@ -456,7 +528,6 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 		 * migration is optimistic to see if the minimum amount of work
 		 * satisfies the allocation
 		 */
-		pageblock_nr = low_pfn >> pageblock_order;
 		if (!cc->sync && last_pageblock_nr != pageblock_nr &&
 		    !migrate_async_suitable(get_pageblock_migratetype(page))) {
 			goto next_pageblock;
@@ -531,6 +602,10 @@ next_pageblock:
 	if (locked)
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 
+	/* Update the pageblock-skip if the whole pageblock was scanned */
+	if (low_pfn == end_pfn)
+		update_pageblock_skip(valid_page, nr_isolated);
+
 	trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);
 
 	return low_pfn;
@@ -594,6 +669,10 @@ static void isolate_freepages(struct zone *zone,
 		if (!suitable_migration_target(page))
 			continue;
 
+		/* If isolation recently failed, do not retry */
+		if (!isolation_suitable(cc, page))
+			continue;
+
 		/* Found a block suitable for isolating free pages from */
 		isolated = 0;
 		end_pfn = min(pfn + pageblock_nr_pages, zone_end_pfn);
@@ -710,8 +789,10 @@ static int compact_finished(struct zone *zone,
 		return COMPACT_PARTIAL;
 
 	/* Compaction run completes if the migrate and free scanner meet */
-	if (cc->free_pfn <= cc->migrate_pfn)
+	if (cc->free_pfn <= cc->migrate_pfn) {
+		reset_isolation_suitable(cc->zone);
 		return COMPACT_COMPLETE;
+	}
 
 	/*
 	 * order == -1 is expected when compacting via
@@ -819,6 +900,10 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 	cc->free_pfn = cc->migrate_pfn + zone->spanned_pages;
 	cc->free_pfn &= ~(pageblock_nr_pages-1);
 
+	/* Clear pageblock skip if there are numerous alloc failures */
+	if (zone->compact_defer_shift == COMPACT_MAX_DEFER_SHIFT)
+		reset_isolation_suitable(zone);
+
 	migrate_prep_local();
 
 	while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) {
diff --git a/mm/internal.h b/mm/internal.h
index f590a6e..fbf5ddc 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -121,6 +121,7 @@ struct compact_control {
 	unsigned long free_pfn;		/* isolate_freepages search base */
 	unsigned long migrate_pfn;	/* isolate_migratepages search base */
 	bool sync;			/* Synchronous migration */
+	bool ignore_skip_hint;		/* Scan blocks even if marked skip */
 
 	int order;			/* order a direct compactor needs */
 	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c03037e..ba34eee 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5662,6 +5662,7 @@ static int __alloc_contig_migrate_range(unsigned long start, unsigned long end)
 		.order = -1,
 		.zone = page_zone(pfn_to_page(start)),
 		.sync = true,
+		.ignore_skip_hint = true,
 	};
 	INIT_LIST_HEAD(&cc.migratepages);
 
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 9/9] mm: compaction: Restart compaction from near where it left off
  2012-09-21 10:46 [PATCH 0/9] Reduce compaction scanning and lock contention Mel Gorman
                   ` (7 preceding siblings ...)
  2012-09-21 10:46 ` [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated Mel Gorman
@ 2012-09-21 10:46 ` Mel Gorman
  2012-09-21 17:54   ` Rafael Aquini
  2012-09-21 13:51 ` [PATCH 0/9] Reduce compaction scanning and lock contention Rik van Riel
  9 siblings, 1 reply; 42+ messages in thread
From: Mel Gorman @ 2012-09-21 10:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Richard Davies, Shaohua Li, Rik van Riel, Avi Kivity, QEMU-devel,
	KVM, Linux-MM, LKML, Mel Gorman

This is almost entirely based on Rik's previous patches and discussions
with him about how this might be implemented.

Order > 0 compaction stops when enough free pages of the correct page
order have been coalesced.  When doing subsequent higher order allocations,
it is possible for compaction to be invoked many times.

However, the compaction code always starts out looking for things to compact
at the start of the zone, and for free pages to compact things to at the
end of the zone.

This can cause quadratic behaviour, with isolate_freepages starting at
the end of the zone each time, even though previous invocations of the
compaction code already filled up all free memory on that end of the zone.
This can cause isolate_freepages to take enormous amounts of CPU with
certain workloads on larger memory systems.

This patch caches where the migration and free scanner should start from on
subsequent compaction invocations using the pageblock-skip information. When
compaction starts it begins from the cached restart points and will
update the cached restart points until a page is isolated or a pageblock
is skipped that would have been scanned by synchronous compaction.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/linux/mmzone.h |    4 ++++
 mm/compaction.c        |   54 ++++++++++++++++++++++++++++++++++++++++--------
 mm/internal.h          |    4 ++++
 3 files changed, 53 insertions(+), 9 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a456361..e7792a3 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -370,6 +370,10 @@ struct zone {
 	int                     all_unreclaimable; /* All pages pinned */
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
 	unsigned long		compact_blockskip_expire;
+
+	/* pfns where compaction scanners should start */
+	unsigned long		compact_cached_free_pfn;
+	unsigned long		compact_cached_migrate_pfn;
 #endif
 #ifdef CONFIG_MEMORY_HOTPLUG
 	/* see spanned/present_pages for more description */
diff --git a/mm/compaction.c b/mm/compaction.c
index 9276bc8..4bd96f3 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -79,6 +79,9 @@ static void reset_isolation_suitable(struct zone *zone)
 	 */
 	if (time_before(jiffies, zone->compact_blockskip_expire))
 		return;
+
+	zone->compact_cached_migrate_pfn = start_pfn;
+	zone->compact_cached_free_pfn = end_pfn;
 	zone->compact_blockskip_expire = jiffies + (HZ * 5);
 
 	/* Walk the zone and mark every pageblock as suitable for isolation */
@@ -99,13 +102,29 @@ static void reset_isolation_suitable(struct zone *zone)
  * If no pages were isolated then mark this pageblock to be skipped in the
  * future. The information is later cleared by reset_isolation_suitable().
  */
-static void update_pageblock_skip(struct page *page, unsigned long nr_isolated)
+static void update_pageblock_skip(struct compact_control *cc,
+			struct page *page, unsigned long nr_isolated,
+			bool migrate_scanner)
 {
+	struct zone *zone = cc->zone;
 	if (!page)
 		return;
 
-	if (!nr_isolated)
+	if (!nr_isolated) {
+		unsigned long pfn = page_to_pfn(page);
 		set_pageblock_skip(page);
+
+		/* Update where compaction should restart */
+		if (migrate_scanner) {
+			if (!cc->finished_update_migrate &&
+			    pfn > zone->compact_cached_migrate_pfn)
+				zone->compact_cached_migrate_pfn = pfn;
+		} else {
+			if (!cc->finished_update_free &&
+			    pfn < zone->compact_cached_free_pfn)
+				zone->compact_cached_free_pfn = pfn;
+		}
+	}
 }
 
 static inline bool should_release_lock(spinlock_t *lock)
@@ -315,7 +334,7 @@ out:
 
 	/* Update the pageblock-skip if the whole pageblock was scanned */
 	if (blockpfn == end_pfn)
-		update_pageblock_skip(valid_page, total_isolated);
+		update_pageblock_skip(cc, valid_page, total_isolated, false);
 
 	return total_isolated;
 }
@@ -530,6 +549,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 		 */
 		if (!cc->sync && last_pageblock_nr != pageblock_nr &&
 		    !migrate_async_suitable(get_pageblock_migratetype(page))) {
+			cc->finished_update_migrate = true;
 			goto next_pageblock;
 		}
 
@@ -578,6 +598,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 		VM_BUG_ON(PageTransCompound(page));
 
 		/* Successfully isolated */
+		cc->finished_update_migrate = true;
 		del_page_from_lru_list(page, lruvec, page_lru(page));
 		list_add(&page->lru, migratelist);
 		cc->nr_migratepages++;
@@ -604,7 +625,7 @@ next_pageblock:
 
 	/* Update the pageblock-skip if the whole pageblock was scanned */
 	if (low_pfn == end_pfn)
-		update_pageblock_skip(valid_page, nr_isolated);
+		update_pageblock_skip(cc, valid_page, nr_isolated, true);
 
 	trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);
 
@@ -685,8 +706,10 @@ static void isolate_freepages(struct zone *zone,
 		 * looking for free pages, the search will restart here as
 		 * page migration may have returned some pages to the allocator
 		 */
-		if (isolated)
+		if (isolated) {
+			cc->finished_update_free = true;
 			high_pfn = max(high_pfn, pfn);
+		}
 	}
 
 	/* split_free_page does not map the pages */
@@ -883,6 +906,8 @@ unsigned long compaction_suitable(struct zone *zone, int order)
 static int compact_zone(struct zone *zone, struct compact_control *cc)
 {
 	int ret;
+	unsigned long start_pfn = zone->zone_start_pfn;
+	unsigned long end_pfn = zone->zone_start_pfn + zone->spanned_pages;
 
 	ret = compaction_suitable(zone, cc->order);
 	switch (ret) {
@@ -895,10 +920,21 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 		;
 	}
 
-	/* Setup to move all movable pages to the end of the zone */
-	cc->migrate_pfn = zone->zone_start_pfn;
-	cc->free_pfn = cc->migrate_pfn + zone->spanned_pages;
-	cc->free_pfn &= ~(pageblock_nr_pages-1);
+	/*
+	 * Setup to move all movable pages to the end of the zone. Used cached
+	 * information on where the scanners should start but check that it
+	 * is initialised by ensuring the values are within zone boundaries.
+	 */
+	cc->migrate_pfn = zone->compact_cached_migrate_pfn;
+	cc->free_pfn = zone->compact_cached_free_pfn;
+	if (cc->free_pfn < start_pfn || cc->free_pfn > end_pfn) {
+		cc->free_pfn = end_pfn & ~(pageblock_nr_pages-1);
+		zone->compact_cached_free_pfn = cc->free_pfn;
+	}
+	if (cc->migrate_pfn < start_pfn || cc->migrate_pfn > end_pfn) {
+		cc->migrate_pfn = start_pfn;
+		zone->compact_cached_migrate_pfn = cc->migrate_pfn;
+	}
 
 	/* Clear pageblock skip if there are numerous alloc failures */
 	if (zone->compact_defer_shift == COMPACT_MAX_DEFER_SHIFT)
diff --git a/mm/internal.h b/mm/internal.h
index fbf5ddc..7052289 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -122,6 +122,10 @@ struct compact_control {
 	unsigned long migrate_pfn;	/* isolate_migratepages search base */
 	bool sync;			/* Synchronous migration */
 	bool ignore_skip_hint;		/* Scan blocks even if marked skip */
+	bool finished_update_free;	/* True when the zone cached pfns are
+					 * no longer being updated
+					 */
+	bool finished_update_migrate;
 
 	int order;			/* order a direct compactor needs */
 	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
-- 
1.7.9.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/9] Reduce compaction scanning and lock contention
  2012-09-21 10:46 [PATCH 0/9] Reduce compaction scanning and lock contention Mel Gorman
                   ` (8 preceding siblings ...)
  2012-09-21 10:46 ` [PATCH 9/9] mm: compaction: Restart compaction from near where it left off Mel Gorman
@ 2012-09-21 13:51 ` Rik van Riel
  9 siblings, 0 replies; 42+ messages in thread
From: Rik van Riel @ 2012-09-21 13:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Richard Davies, Shaohua Li, Avi Kivity,
	QEMU-devel, KVM, Linux-MM, LKML

On 09/21/2012 06:46 AM, Mel Gorman wrote:
> Hi Andrew,
>
> Richard Davies and Shaohua Li have both reported lock contention
> problems in compaction on the zone and LRU locks as well as
> significant amounts of time being spent in compaction. This series
> aims to reduce lock contention and scanning rates to reduce that CPU
> usage. Richard reported at https://lkml.org/lkml/2012/9/21/91 that
> this series made a big different to a problem he reported in August
> (http://marc.info/?l=kvm&m=134511507015614&w=2).

> One way or the other, this series has a large impact on the amount of
> scanning compaction does when there is a storm of THP allocations.

Andrew,

Mel and I have discussed the stuff in this series quite a bit,
and I am convinced this is the way forward with compaction.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/9] Revert "mm: compaction: check lock contention first before taking lock"
  2012-09-21 10:46 ` [PATCH 1/9] Revert "mm: compaction: check lock contention first before taking lock" Mel Gorman
@ 2012-09-21 17:46   ` Rafael Aquini
  0 siblings, 0 replies; 42+ messages in thread
From: Rafael Aquini @ 2012-09-21 17:46 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Richard Davies, Shaohua Li, Rik van Riel,
	Avi Kivity, QEMU-devel, KVM, Linux-MM, LKML

On Fri, Sep 21, 2012 at 11:46:15AM +0100, Mel Gorman wrote:
> This reverts
> mm-compaction-check-lock-contention-first-before-taking-lock.patch as it
> is replaced by a later patch in the series.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rafael Aquini <aquini@redhat.com>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/9] Revert "mm-compaction-abort-compaction-loop-if-lock-is-contended-or-run-too-long-fix"
  2012-09-21 10:46 ` [PATCH 2/9] Revert "mm-compaction-abort-compaction-loop-if-lock-is-contended-or-run-too-long-fix" Mel Gorman
@ 2012-09-21 17:47   ` Rafael Aquini
  0 siblings, 0 replies; 42+ messages in thread
From: Rafael Aquini @ 2012-09-21 17:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Richard Davies, Shaohua Li, Rik van Riel,
	Avi Kivity, QEMU-devel, KVM, Linux-MM, LKML

On Fri, Sep 21, 2012 at 11:46:16AM +0100, Mel Gorman wrote:
> This reverts
> mm-compaction-abort-compaction-loop-if-lock-is-contended-or-run-too-long-fix
> as it is replaced by a later patch in the series.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rafael Aquini <aquini@redhat.com>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/9] Revert "mm: compaction: abort compaction loop if lock is contended or run too long"
  2012-09-21 10:46 ` [PATCH 3/9] Revert "mm: compaction: abort compaction loop if lock is contended or run too long" Mel Gorman
@ 2012-09-21 17:48   ` Rafael Aquini
  0 siblings, 0 replies; 42+ messages in thread
From: Rafael Aquini @ 2012-09-21 17:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Richard Davies, Shaohua Li, Rik van Riel,
	Avi Kivity, QEMU-devel, KVM, Linux-MM, LKML

On Fri, Sep 21, 2012 at 11:46:17AM +0100, Mel Gorman wrote:
> This reverts
> mm-compaction-abort-compaction-loop-if-lock-is-contended-or-run-too-long.patch
> as it is replaced by a later patch in the series.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rafael Aquini <aquini@redhat.com>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/9] mm: compaction: Abort compaction loop if lock is contended or run too long
  2012-09-21 10:46 ` [PATCH 4/9] mm: compaction: Abort compaction loop if lock is contended or run too long Mel Gorman
@ 2012-09-21 17:50   ` Rafael Aquini
  2012-09-21 21:31   ` Andrew Morton
  2012-09-25  7:34   ` Minchan Kim
  2 siblings, 0 replies; 42+ messages in thread
From: Rafael Aquini @ 2012-09-21 17:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Richard Davies, Shaohua Li, Rik van Riel,
	Avi Kivity, QEMU-devel, KVM, Linux-MM, LKML

On Fri, Sep 21, 2012 at 11:46:18AM +0100, Mel Gorman wrote:
> From: Shaohua Li <shli@fusionio.com>
> 
> Changelog since V2
> o Fix BUG_ON triggered due to pages left on cc.migratepages
> o Make compact_zone_order() require non-NULL arg `contended'
> 
> Changelog since V1
> o only abort the compaction if lock is contended or run too long
> o Rearranged the code by Andrea Arcangeli.
> 
> isolate_migratepages_range() might isolate no pages if for example when
> zone->lru_lock is contended and running asynchronous compaction. In this
> case, we should abort compaction, otherwise, compact_zone will run a
> useless loop and make zone->lru_lock is even contended.
> 
> [minchan@kernel.org: Putback pages isolated for migration if aborting]
> [akpm@linux-foundation.org: compact_zone_order requires non-NULL arg contended]
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Shaohua Li <shli@fusionio.com>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---

Acked-by: Rafael Aquini <aquini@redhat.com>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 5/9] mm: compaction: Acquire the zone->lru_lock as late as possible
  2012-09-21 10:46 ` [PATCH 5/9] mm: compaction: Acquire the zone->lru_lock as late as possible Mel Gorman
@ 2012-09-21 17:51   ` Rafael Aquini
  2012-09-25  7:05   ` Minchan Kim
  1 sibling, 0 replies; 42+ messages in thread
From: Rafael Aquini @ 2012-09-21 17:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Richard Davies, Shaohua Li, Rik van Riel,
	Avi Kivity, QEMU-devel, KVM, Linux-MM, LKML

On Fri, Sep 21, 2012 at 11:46:19AM +0100, Mel Gorman wrote:
> Compactions migrate scanner acquires the zone->lru_lock when scanning a range
> of pages looking for LRU pages to acquire. It does this even if there are
> no LRU pages in the range. If multiple processes are compacting then this
> can cause severe locking contention. To make matters worse commit b2eef8c0
> (mm: compaction: minimise the time IRQs are disabled while isolating pages
> for migration) releases the lru_lock every SWAP_CLUSTER_MAX pages that are
> scanned.
> 
> This patch makes two changes to how the migrate scanner acquires the LRU
> lock. First, it only releases the LRU lock every SWAP_CLUSTER_MAX pages if
> the lock is contended. This reduces the number of times it unnecessarily
> disables and re-enables IRQs. The second is that it defers acquiring the
> LRU lock for as long as possible. If there are no LRU pages or the only
> LRU pages are transhuge then the LRU lock will not be acquired at all
> which reduces contention on zone->lru_lock.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---

Acked-by: Rafael Aquini <aquini@redhat.com>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 6/9] mm: compaction: Acquire the zone->lock as late as possible
  2012-09-21 10:46 ` [PATCH 6/9] mm: compaction: Acquire the zone->lock " Mel Gorman
@ 2012-09-21 17:52   ` Rafael Aquini
  2012-09-21 21:35   ` Andrew Morton
  2012-09-25  7:35   ` Minchan Kim
  2 siblings, 0 replies; 42+ messages in thread
From: Rafael Aquini @ 2012-09-21 17:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Richard Davies, Shaohua Li, Rik van Riel,
	Avi Kivity, QEMU-devel, KVM, Linux-MM, LKML

On Fri, Sep 21, 2012 at 11:46:20AM +0100, Mel Gorman wrote:
> Compactions free scanner acquires the zone->lock when checking for PageBuddy
> pages and isolating them. It does this even if there are no PageBuddy pages
> in the range.
> 
> This patch defers acquiring the zone lock for as long as possible. In the
> event there are no free pages in the pageblock then the lock will not be
> acquired at all which reduces contention on zone->lock.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---

Acked-by: Rafael Aquini <aquini@redhat.com>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 7/9] Revert "mm: have order > 0 compaction start off where it left"
  2012-09-21 10:46 ` [PATCH 7/9] Revert "mm: have order > 0 compaction start off where it left" Mel Gorman
@ 2012-09-21 17:52   ` Rafael Aquini
  2012-09-25  7:37   ` Minchan Kim
  1 sibling, 0 replies; 42+ messages in thread
From: Rafael Aquini @ 2012-09-21 17:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Richard Davies, Shaohua Li, Rik van Riel,
	Avi Kivity, QEMU-devel, KVM, Linux-MM, LKML

On Fri, Sep 21, 2012 at 11:46:21AM +0100, Mel Gorman wrote:
> This reverts commit 7db8889a (mm: have order > 0 compaction start off
> where it left) and commit de74f1cc (mm: have order > 0 compaction start
> near a pageblock with free pages). These patches were a good idea and
> tests confirmed that they massively reduced the amount of scanning but
> the implementation is complex and tricky to understand. A later patch
> will cache what pageblocks should be skipped and reimplements the
> concept of compact_cached_free_pfn on top for both migration and
> free scanners.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---

Acked-by: Rafael Aquini <aquini@redhat.com>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated
  2012-09-21 10:46 ` [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated Mel Gorman
@ 2012-09-21 17:53   ` Rafael Aquini
  2012-09-21 21:36   ` Andrew Morton
  1 sibling, 0 replies; 42+ messages in thread
From: Rafael Aquini @ 2012-09-21 17:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Richard Davies, Shaohua Li, Rik van Riel,
	Avi Kivity, QEMU-devel, KVM, Linux-MM, LKML

On Fri, Sep 21, 2012 at 11:46:22AM +0100, Mel Gorman wrote:
> When compaction was implemented it was known that scanning could potentially
> be excessive. The ideal was that a counter be maintained for each pageblock
> but maintaining this information would incur a severe penalty due to a
> shared writable cache line. It has reached the point where the scanning
> costs are an serious problem, particularly on long-lived systems where a
> large process starts and allocates a large number of THPs at the same time.
> 
> Instead of using a shared counter, this patch adds another bit to the
> pageblock flags called PG_migrate_skip. If a pageblock is scanned by
> either migrate or free scanner and 0 pages were isolated, the pageblock
> is marked to be skipped in the future. When scanning, this bit is checked
> before any scanning takes place and the block skipped if set.
> 
> The main difficulty with a patch like this is "when to ignore the cached
> information?" If it's ignored too often, the scanning rates will still
> be excessive. If the information is too stale then allocations will fail
> that might have otherwise succeeded. In this patch
> 
> o CMA always ignores the information
> o If the migrate and free scanner meet then the cached information will
>   be discarded if it's at least 5 seconds since the last time the cache
>   was discarded
> o If there are a large number of allocation failures, discard the cache.
> 
> The time-based heuristic is very clumsy but there are few choices for a
> better event. Depending solely on multiple allocation failures still allows
> excessive scanning when THP allocations are failing in quick succession
> due to memory pressure. Waiting until memory pressure is relieved would
> cause compaction to continually fail instead of using reclaim/compaction
> to try allocate the page. The time-based mechanism is clumsy but a better
> option is not obvious.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---

Acked-by: Rafael Aquini <aquini@redhat.com>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 9/9] mm: compaction: Restart compaction from near where it left off
  2012-09-21 10:46 ` [PATCH 9/9] mm: compaction: Restart compaction from near where it left off Mel Gorman
@ 2012-09-21 17:54   ` Rafael Aquini
  0 siblings, 0 replies; 42+ messages in thread
From: Rafael Aquini @ 2012-09-21 17:54 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Richard Davies, Shaohua Li, Rik van Riel,
	Avi Kivity, QEMU-devel, KVM, Linux-MM, LKML

On Fri, Sep 21, 2012 at 11:46:23AM +0100, Mel Gorman wrote:
> This is almost entirely based on Rik's previous patches and discussions
> with him about how this might be implemented.
> 
> Order > 0 compaction stops when enough free pages of the correct page
> order have been coalesced.  When doing subsequent higher order allocations,
> it is possible for compaction to be invoked many times.
> 
> However, the compaction code always starts out looking for things to compact
> at the start of the zone, and for free pages to compact things to at the
> end of the zone.
> 
> This can cause quadratic behaviour, with isolate_freepages starting at
> the end of the zone each time, even though previous invocations of the
> compaction code already filled up all free memory on that end of the zone.
> This can cause isolate_freepages to take enormous amounts of CPU with
> certain workloads on larger memory systems.
> 
> This patch caches where the migration and free scanner should start from on
> subsequent compaction invocations using the pageblock-skip information. When
> compaction starts it begins from the cached restart points and will
> update the cached restart points until a page is isolated or a pageblock
> is skipped that would have been scanned by synchronous compaction.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---

Acked-by: Rafael Aquini <aquini@redhat.com>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/9] mm: compaction: Abort compaction loop if lock is contended or run too long
  2012-09-21 10:46 ` [PATCH 4/9] mm: compaction: Abort compaction loop if lock is contended or run too long Mel Gorman
  2012-09-21 17:50   ` Rafael Aquini
@ 2012-09-21 21:31   ` Andrew Morton
  2012-09-25  7:34   ` Minchan Kim
  2 siblings, 0 replies; 42+ messages in thread
From: Andrew Morton @ 2012-09-21 21:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Richard Davies, Shaohua Li, Rik van Riel, Avi Kivity, QEMU-devel,
	KVM, Linux-MM, LKML

On Fri, 21 Sep 2012 11:46:18 +0100
Mel Gorman <mgorman@suse.de> wrote:

> Changelog since V2
> o Fix BUG_ON triggered due to pages left on cc.migratepages
> o Make compact_zone_order() require non-NULL arg `contended'
> 
> Changelog since V1
> o only abort the compaction if lock is contended or run too long
> o Rearranged the code by Andrea Arcangeli.
> 
> isolate_migratepages_range() might isolate no pages if for example when
> zone->lru_lock is contended and running asynchronous compaction. In this
> case, we should abort compaction, otherwise, compact_zone will run a
> useless loop and make zone->lru_lock is even contended.

hm, this appears to be identical to

mm-compaction-abort-compaction-loop-if-lock-is-contended-or-run-too-long.patch
mm-compaction-abort-compaction-loop-if-lock-is-contended-or-run-too-long-fix.patch
mm-compaction-abort-compaction-loop-if-lock-is-contended-or-run-too-long-fix-2.patch

so I simply omitted patches 2, 3 and 4.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 6/9] mm: compaction: Acquire the zone->lock as late as possible
  2012-09-21 10:46 ` [PATCH 6/9] mm: compaction: Acquire the zone->lock " Mel Gorman
  2012-09-21 17:52   ` Rafael Aquini
@ 2012-09-21 21:35   ` Andrew Morton
  2012-09-24  8:52     ` Mel Gorman
  2012-09-25  7:35   ` Minchan Kim
  2 siblings, 1 reply; 42+ messages in thread
From: Andrew Morton @ 2012-09-21 21:35 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Richard Davies, Shaohua Li, Rik van Riel, Avi Kivity, QEMU-devel,
	KVM, Linux-MM, LKML

On Fri, 21 Sep 2012 11:46:20 +0100
Mel Gorman <mgorman@suse.de> wrote:

> Compactions free scanner acquires the zone->lock when checking for PageBuddy
> pages and isolating them. It does this even if there are no PageBuddy pages
> in the range.
> 
> This patch defers acquiring the zone lock for as long as possible. In the
> event there are no free pages in the pageblock then the lock will not be
> acquired at all which reduces contention on zone->lock.
>
> ...
>
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -93,6 +93,28 @@ static inline bool compact_trylock_irqsave(spinlock_t *lock,
>  	return compact_checklock_irqsave(lock, flags, false, cc);
>  }
>  
> +/* Returns true if the page is within a block suitable for migration to */
> +static bool suitable_migration_target(struct page *page)
> +{
> +

stray newline

> +	int migratetype = get_pageblock_migratetype(page);
> +
> +	/* Don't interfere with memory hot-remove or the min_free_kbytes blocks */
> +	if (migratetype == MIGRATE_ISOLATE || migratetype == MIGRATE_RESERVE)
> +		return false;
> +
> +	/* If the page is a large free page, then allow migration */
> +	if (PageBuddy(page) && page_order(page) >= pageblock_order)
> +		return true;
> +
> +	/* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */
> +	if (migrate_async_suitable(migratetype))
> +		return true;
> +
> +	/* Otherwise skip the block */
> +	return false;
> +}
> +
>
> ...
>
> @@ -168,23 +193,38 @@ static unsigned long isolate_freepages_block(unsigned long blockpfn,
>  		int isolated, i;
>  		struct page *page = cursor;
>  
> -		if (!pfn_valid_within(blockpfn)) {
> -			if (strict)
> -				return 0;
> -			continue;
> -		}
> +		if (!pfn_valid_within(blockpfn))
> +			goto strict_check;
>  		nr_scanned++;
>  
> -		if (!PageBuddy(page)) {
> -			if (strict)
> -				return 0;
> -			continue;
> -		}
> +		if (!PageBuddy(page))
> +			goto strict_check;
> +
> +		/*
> +		 * The zone lock must be held to isolate freepages. This
> +		 * unfortunately this is a very coarse lock and can be

this this

> +		 * heavily contended if there are parallel allocations
> +		 * or parallel compactions. For async compaction do not
> +		 * spin on the lock and we acquire the lock as late as
> +		 * possible.
> +		 */
> +		locked = compact_checklock_irqsave(&cc->zone->lock, &flags,
> +								locked, cc);
> +		if (!locked)
> +			break;
> +
> +		/* Recheck this is a suitable migration target under lock */
> +		if (!strict && !suitable_migration_target(page))
> +			break;
> +
> +		/* Recheck this is a buddy page under lock */
> +		if (!PageBuddy(page))
> +			goto strict_check;
>  
>  		/* Found a free page, break it into order-0 pages */
>  		isolated = split_free_page(page);
>  		if (!isolated && strict)
> -			return 0;
> +			goto strict_check;
>  		total_isolated += isolated;
>  		for (i = 0; i < isolated; i++) {
>  			list_add(&page->lru, freelist);
> @@ -196,9 +236,23 @@ static unsigned long isolate_freepages_block(unsigned long blockpfn,
>  			blockpfn += isolated - 1;
>  			cursor += isolated - 1;
>  		}
> +
> +		continue;
> +
> +strict_check:
> +		/* Abort isolation if the caller requested strict isolation */
> +		if (strict) {
> +			total_isolated = 0;
> +			goto out;
> +		}
>  	}
>  
>  	trace_mm_compaction_isolate_freepages(nr_scanned, total_isolated);
> +
> +out:
> +	if (locked)
> +		spin_unlock_irqrestore(&cc->zone->lock, flags);
> +
>  	return total_isolated;
>  }

A a few things about this function.

Would it be cleaner if we did

	if (!strict) {
		if (!suitable_migration_target(page))
			break;
	} else {
		if (!PageBuddy(page))
			goto strict_check;
	}

and then remove the test of `strict' from strict_check (which then
should be renamed)?

Which perhaps means that the whole `strict_check' block can go away:

	if (!strict) {
		if (!suitable_migration_target(page))
			break;
	} else {
		if (!PageBuddy(page)) {
			total_isolated = 0;
			goto out;
	}

Have a think about it?  The function is a little straggly.

Secondly, it is correct/desirable to skip the (now misnamed
`trace_mm_compaction_isolate_freepages') tracepoint generation if we
baled out of the loop?  The fact that we entered
isolate_freepages_block() but failed to isolate anything is data which
people might be interested in?

Thirdly, that (existing) comment "This assumes the block is valid" is
either too vague or wrong.  If it's valid, why wo we check
pfn_valid_within()?

> @@ -218,13 +272,18 @@ static unsigned long isolate_freepages_block(unsigned long blockpfn,
>  unsigned long
>  isolate_freepages_range(unsigned long start_pfn, unsigned long end_pfn)
>  {
> -	unsigned long isolated, pfn, block_end_pfn, flags;
> +	unsigned long isolated, pfn, block_end_pfn;
>  	struct zone *zone = NULL;
>  	LIST_HEAD(freelist);
> +	struct compact_control cc;
>  
>  	if (pfn_valid(start_pfn))
>  		zone = page_zone(pfn_to_page(start_pfn));
>  
> +	/* cc needed for isolate_freepages_block to acquire zone->lock */
> +	cc.zone = zone;
> +	cc.sync = true;

We initialise two of cc's fields, leave the other 12 fields containing
random garbage, then start using it.  I see no bug here, but...


>  	for (pfn = start_pfn; pfn < end_pfn; pfn += isolated) {
>  		if (!pfn_valid(pfn) || zone != page_zone(pfn_to_page(pfn)))
>  			break;
> ...
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated
  2012-09-21 10:46 ` [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated Mel Gorman
  2012-09-21 17:53   ` Rafael Aquini
@ 2012-09-21 21:36   ` Andrew Morton
  2012-09-24  9:39     ` Mel Gorman
  1 sibling, 1 reply; 42+ messages in thread
From: Andrew Morton @ 2012-09-21 21:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Richard Davies, Shaohua Li, Rik van Riel, Avi Kivity, QEMU-devel,
	KVM, Linux-MM, LKML

On Fri, 21 Sep 2012 11:46:22 +0100
Mel Gorman <mgorman@suse.de> wrote:

> When compaction was implemented it was known that scanning could potentially
> be excessive. The ideal was that a counter be maintained for each pageblock
> but maintaining this information would incur a severe penalty due to a
> shared writable cache line. It has reached the point where the scanning
> costs are an serious problem, particularly on long-lived systems where a
> large process starts and allocates a large number of THPs at the same time.
> 
> Instead of using a shared counter, this patch adds another bit to the
> pageblock flags called PG_migrate_skip. If a pageblock is scanned by
> either migrate or free scanner and 0 pages were isolated, the pageblock
> is marked to be skipped in the future. When scanning, this bit is checked
> before any scanning takes place and the block skipped if set.
> 
> The main difficulty with a patch like this is "when to ignore the cached
> information?" If it's ignored too often, the scanning rates will still
> be excessive. If the information is too stale then allocations will fail
> that might have otherwise succeeded. In this patch
> 
> o CMA always ignores the information
> o If the migrate and free scanner meet then the cached information will
>   be discarded if it's at least 5 seconds since the last time the cache
>   was discarded
> o If there are a large number of allocation failures, discard the cache.
> 
> The time-based heuristic is very clumsy but there are few choices for a
> better event. Depending solely on multiple allocation failures still allows
> excessive scanning when THP allocations are failing in quick succession
> due to memory pressure. Waiting until memory pressure is relieved would
> cause compaction to continually fail instead of using reclaim/compaction
> to try allocate the page. The time-based mechanism is clumsy but a better
> option is not obvious.

ick.

Wall time has sooo little relationship to what's happening in there. 
If we *have* to use polling, cannot we clock the poll with some metric
which is at least vaguely related to the amount of activity?  Number
(or proportion) of pages scanned, for example?  Or reset everything on
the Nth trip around the zone?  Or even a combination of one of these
*and* of wall time, so the system will at least work harder when MM is
under load.

Also, what has to be done to avoid the polling altogether?  eg/ie, zap
a pageblock's PB_migrate_skip synchronously, when something was done to
that pageblock which justifies repolling it?

>
> ...
>
> +static void reset_isolation_suitable(struct zone *zone)
> +{
> +	unsigned long start_pfn = zone->zone_start_pfn;
> +	unsigned long end_pfn = zone->zone_start_pfn + zone->spanned_pages;
> +	unsigned long pfn;
> +
> +	/*
> +	 * Do not reset more than once every five seconds. If allocations are
> +	 * failing sufficiently quickly to allow this to happen then continually
> +	 * scanning for compaction is not going to help. The choice of five
> +	 * seconds is arbitrary but will mitigate excessive scanning.
> +	 */
> +	if (time_before(jiffies, zone->compact_blockskip_expire))
> +		return;
> +	zone->compact_blockskip_expire = jiffies + (HZ * 5);
> +
> +	/* Walk the zone and mark every pageblock as suitable for isolation */
> +	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
> +		struct page *page;
> +		if (!pfn_valid(pfn))
> +			continue;
> +
> +		page = pfn_to_page(pfn);
> +		if (zone != page_zone(page))
> +			continue;
> +
> +		clear_pageblock_skip(page);
> +	}

What's the worst-case loop count here?

> +}
> +
>
> ...
>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 6/9] mm: compaction: Acquire the zone->lock as late as possible
  2012-09-21 21:35   ` Andrew Morton
@ 2012-09-24  8:52     ` Mel Gorman
  2012-09-25  7:36       ` Minchan Kim
  0 siblings, 1 reply; 42+ messages in thread
From: Mel Gorman @ 2012-09-24  8:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Richard Davies, Shaohua Li, Rik van Riel, Avi Kivity, QEMU-devel,
	KVM, Linux-MM, LKML

On Fri, Sep 21, 2012 at 02:35:57PM -0700, Andrew Morton wrote:
> On Fri, 21 Sep 2012 11:46:20 +0100
> Mel Gorman <mgorman@suse.de> wrote:
> 
> > Compactions free scanner acquires the zone->lock when checking for PageBuddy
> > pages and isolating them. It does this even if there are no PageBuddy pages
> > in the range.
> > 
> > This patch defers acquiring the zone lock for as long as possible. In the
> > event there are no free pages in the pageblock then the lock will not be
> > acquired at all which reduces contention on zone->lock.
> >
> > ...
> >
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -93,6 +93,28 @@ static inline bool compact_trylock_irqsave(spinlock_t *lock,
> >  	return compact_checklock_irqsave(lock, flags, false, cc);
> >  }
> >  
> > +/* Returns true if the page is within a block suitable for migration to */
> > +static bool suitable_migration_target(struct page *page)
> > +{
> > +
> 
> stray newline
> 

Fixed.

> > +	int migratetype = get_pageblock_migratetype(page);
> > +
> > +	/* Don't interfere with memory hot-remove or the min_free_kbytes blocks */
> > +	if (migratetype == MIGRATE_ISOLATE || migratetype == MIGRATE_RESERVE)
> > +		return false;
> > +
> > +	/* If the page is a large free page, then allow migration */
> > +	if (PageBuddy(page) && page_order(page) >= pageblock_order)
> > +		return true;
> > +
> > +	/* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */
> > +	if (migrate_async_suitable(migratetype))
> > +		return true;
> > +
> > +	/* Otherwise skip the block */
> > +	return false;
> > +}
> > +
> >
> > ...
> >
> > @@ -168,23 +193,38 @@ static unsigned long isolate_freepages_block(unsigned long blockpfn,
> >  		int isolated, i;
> >  		struct page *page = cursor;
> >  
> > -		if (!pfn_valid_within(blockpfn)) {
> > -			if (strict)
> > -				return 0;
> > -			continue;
> > -		}
> > +		if (!pfn_valid_within(blockpfn))
> > +			goto strict_check;
> >  		nr_scanned++;
> >  
> > -		if (!PageBuddy(page)) {
> > -			if (strict)
> > -				return 0;
> > -			continue;
> > -		}
> > +		if (!PageBuddy(page))
> > +			goto strict_check;
> > +
> > +		/*
> > +		 * The zone lock must be held to isolate freepages. This
> > +		 * unfortunately this is a very coarse lock and can be
> 
> this this
> 

Fixed.

> > +		 * heavily contended if there are parallel allocations
> > +		 * or parallel compactions. For async compaction do not
> > +		 * spin on the lock and we acquire the lock as late as
> > +		 * possible.
> > +		 */
> > +		locked = compact_checklock_irqsave(&cc->zone->lock, &flags,
> > +								locked, cc);
> > +		if (!locked)
> > +			break;
> > +
> > +		/* Recheck this is a suitable migration target under lock */
> > +		if (!strict && !suitable_migration_target(page))
> > +			break;
> > +
> > +		/* Recheck this is a buddy page under lock */
> > +		if (!PageBuddy(page))
> > +			goto strict_check;
> >  
> >  		/* Found a free page, break it into order-0 pages */
> >  		isolated = split_free_page(page);
> >  		if (!isolated && strict)
> > -			return 0;
> > +			goto strict_check;
> >  		total_isolated += isolated;
> >  		for (i = 0; i < isolated; i++) {
> >  			list_add(&page->lru, freelist);
> > @@ -196,9 +236,23 @@ static unsigned long isolate_freepages_block(unsigned long blockpfn,
> >  			blockpfn += isolated - 1;
> >  			cursor += isolated - 1;
> >  		}
> > +
> > +		continue;
> > +
> > +strict_check:
> > +		/* Abort isolation if the caller requested strict isolation */
> > +		if (strict) {
> > +			total_isolated = 0;
> > +			goto out;
> > +		}
> >  	}
> >  
> >  	trace_mm_compaction_isolate_freepages(nr_scanned, total_isolated);
> > +
> > +out:
> > +	if (locked)
> > +		spin_unlock_irqrestore(&cc->zone->lock, flags);
> > +
> >  	return total_isolated;
> >  }
> 
> A a few things about this function.
> 
> Would it be cleaner if we did
> 
> 	if (!strict) {
> 		if (!suitable_migration_target(page))
> 			break;
> 	} else {
> 		if (!PageBuddy(page))
> 			goto strict_check;
> 	}
> 
> and then remove the test of `strict' from strict_check (which then
> should be renamed)?
> 

I was not able to implement what you suggested without breakage.
However, I can do something very similar if "strict" mode does not bail
out ASAP and instead double checks at the end that everything was
isolated correctly. This does mean that CMAs failure case is slightly
more expensive but that really is a corner case and I think it's
acceptable if the code is easier to follow as a result.

> Which perhaps means that the whole `strict_check' block can go away:
> 
> 	if (!strict) {
> 		if (!suitable_migration_target(page))
> 			break;
> 	} else {
> 		if (!PageBuddy(page)) {
> 			total_isolated = 0;
> 			goto out;
> 	}
> 
> Have a think about it?  The function is a little straggly.
> 

Proposal below.

> Secondly, it is correct/desirable to skip the (now misnamed
> `trace_mm_compaction_isolate_freepages') tracepoint generation if we
> baled out of the loop?  The fact that we entered
> isolate_freepages_block() but failed to isolate anything is data which
> people might be interested in?
> 

You're right, it may be interesting for someone debugging CMA to know that
nr_scanned != nr_isolated at the time of allocation failure. 

> Thirdly, that (existing) comment "This assumes the block is valid" is
> either too vague or wrong.  If it's valid, why wo we check
> pfn_valid_within()?
> 

Comment is stale and should be removed.

> > @@ -218,13 +272,18 @@ static unsigned long isolate_freepages_block(unsigned long blockpfn,
> >  unsigned long
> >  isolate_freepages_range(unsigned long start_pfn, unsigned long end_pfn)
> >  {
> > -	unsigned long isolated, pfn, block_end_pfn, flags;
> > +	unsigned long isolated, pfn, block_end_pfn;
> >  	struct zone *zone = NULL;
> >  	LIST_HEAD(freelist);
> > +	struct compact_control cc;
> >  
> >  	if (pfn_valid(start_pfn))
> >  		zone = page_zone(pfn_to_page(start_pfn));
> >  
> > +	/* cc needed for isolate_freepages_block to acquire zone->lock */
> > +	cc.zone = zone;
> > +	cc.sync = true;
> 
> We initialise two of cc's fields, leave the other 12 fields containing
> random garbage, then start using it.  I see no bug here, but...
> 

I get your point. At the very least we should initialise it with zeros.

How about this?

---8<---
mm: compaction: Iron out isolate_freepages_block() and isolate_freepages_range()

Andrew pointed out that isolate_freepages_block() is "straggly" and
isolate_freepages_range() is making assumptions on how compact_control is
used which is delicate. This patch straightens isolate_freepages_block()
and makes it fly straight and initialses compact_control to zeros in
isolate_freepages_range(). The code should be easier to follow and
is functionally equivalent. The CMA failure path is now a little more
expensive but that is a marginal corner-case.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/compaction.c |   47 ++++++++++++++++++++++-------------------------
 1 file changed, 22 insertions(+), 25 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 9276bc8..5ffe9a5 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -154,7 +154,6 @@ static inline bool compact_trylock_irqsave(spinlock_t *lock,
 /* Returns true if the page is within a block suitable for migration to */
 static bool suitable_migration_target(struct page *page)
 {
-
 	int migratetype = get_pageblock_migratetype(page);
 
 	/* Don't interfere with memory hot-remove or the min_free_kbytes blocks */
@@ -246,23 +245,23 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 
 	cursor = pfn_to_page(blockpfn);
 
-	/* Isolate free pages. This assumes the block is valid */
+	/* Isolate free pages. */
 	for (; blockpfn < end_pfn; blockpfn++, cursor++) {
 		int isolated, i;
 		struct page *page = cursor;
 
-		if (!pfn_valid_within(blockpfn))
-			goto strict_check;
 		nr_scanned++;
+		if (!pfn_valid_within(blockpfn))
+			continue;
 		if (!valid_page)
 			valid_page = page;
 
 		if (!PageBuddy(page))
-			goto strict_check;
+			continue;
 
 		/*
-		 * The zone lock must be held to isolate freepages. This
-		 * unfortunately this is a very coarse lock and can be
+		 * The zone lock must be held to isolate freepages.
+		 * Unfortunately this is a very coarse lock and can be
 		 * heavily contended if there are parallel allocations
 		 * or parallel compactions. For async compaction do not
 		 * spin on the lock and we acquire the lock as late as
@@ -279,12 +278,12 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 
 		/* Recheck this is a buddy page under lock */
 		if (!PageBuddy(page))
-			goto strict_check;
+			continue;
 
 		/* Found a free page, break it into order-0 pages */
 		isolated = split_free_page(page);
 		if (!isolated && strict)
-			goto strict_check;
+			break;
 		total_isolated += isolated;
 		for (i = 0; i < isolated; i++) {
 			list_add(&page->lru, freelist);
@@ -296,20 +295,18 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 			blockpfn += isolated - 1;
 			cursor += isolated - 1;
 		}
-
-		continue;
-
-strict_check:
-		/* Abort isolation if the caller requested strict isolation */
-		if (strict) {
-			total_isolated = 0;
-			goto out;
-		}
 	}
 
 	trace_mm_compaction_isolate_freepages(nr_scanned, total_isolated);
 
-out:
+	/*
+	 * If strict isolation is requested by CMA then check that all the
+	 * pages scanned were isolated. If there were any failures, 0 is
+	 * returned and CMA will fail.
+	 */
+	if (strict && nr_scanned != total_isolated)
+		total_isolated = 0;
+
 	if (locked)
 		spin_unlock_irqrestore(&cc->zone->lock, flags);
 
@@ -339,14 +336,14 @@ isolate_freepages_range(unsigned long start_pfn, unsigned long end_pfn)
 	unsigned long isolated, pfn, block_end_pfn;
 	struct zone *zone = NULL;
 	LIST_HEAD(freelist);
-	struct compact_control cc;
-
-	if (pfn_valid(start_pfn))
-		zone = page_zone(pfn_to_page(start_pfn));
 
 	/* cc needed for isolate_freepages_block to acquire zone->lock */
-	cc.zone = zone;
-	cc.sync = true;
+	struct compact_control cc = {
+		.sync = true,
+	};
+
+	if (pfn_valid(start_pfn))
+		cc.zone = zone = page_zone(pfn_to_page(start_pfn));
 
 	for (pfn = start_pfn; pfn < end_pfn; pfn += isolated) {
 		if (!pfn_valid(pfn) || zone != page_zone(pfn_to_page(pfn)))

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated
  2012-09-21 21:36   ` Andrew Morton
@ 2012-09-24  9:39     ` Mel Gorman
  2012-09-24 21:26       ` Andrew Morton
  0 siblings, 1 reply; 42+ messages in thread
From: Mel Gorman @ 2012-09-24  9:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Richard Davies, Shaohua Li, Rik van Riel, Avi Kivity, QEMU-devel,
	KVM, Linux-MM, LKML

On Fri, Sep 21, 2012 at 02:36:56PM -0700, Andrew Morton wrote:
> On Fri, 21 Sep 2012 11:46:22 +0100
> Mel Gorman <mgorman@suse.de> wrote:
> 
> > When compaction was implemented it was known that scanning could potentially
> > be excessive. The ideal was that a counter be maintained for each pageblock
> > but maintaining this information would incur a severe penalty due to a
> > shared writable cache line. It has reached the point where the scanning
> > costs are an serious problem, particularly on long-lived systems where a
> > large process starts and allocates a large number of THPs at the same time.
> > 
> > Instead of using a shared counter, this patch adds another bit to the
> > pageblock flags called PG_migrate_skip. If a pageblock is scanned by
> > either migrate or free scanner and 0 pages were isolated, the pageblock
> > is marked to be skipped in the future. When scanning, this bit is checked
> > before any scanning takes place and the block skipped if set.
> > 
> > The main difficulty with a patch like this is "when to ignore the cached
> > information?" If it's ignored too often, the scanning rates will still
> > be excessive. If the information is too stale then allocations will fail
> > that might have otherwise succeeded. In this patch
> > 
> > o CMA always ignores the information
> > o If the migrate and free scanner meet then the cached information will
> >   be discarded if it's at least 5 seconds since the last time the cache
> >   was discarded
> > o If there are a large number of allocation failures, discard the cache.
> > 
> > The time-based heuristic is very clumsy but there are few choices for a
> > better event. Depending solely on multiple allocation failures still allows
> > excessive scanning when THP allocations are failing in quick succession
> > due to memory pressure. Waiting until memory pressure is relieved would
> > cause compaction to continually fail instead of using reclaim/compaction
> > to try allocate the page. The time-based mechanism is clumsy but a better
> > option is not obvious.
> 
> ick.
> 

I know. I was being generous when I described it as "clumsy".

> Wall time has sooo little relationship to what's happening in there. 
> If we *have* to use polling, cannot we clock the poll with some metric
> which is at least vaguely related to the amount of activity?

Initially I wanted to only depend on just this

        /* Clear pageblock skip if there are numerous alloc failures */
        if (zone->compact_defer_shift == COMPACT_MAX_DEFER_SHIFT)
                reset_isolation_suitable(zone);

because this it at least related to activity but it's weak for two
reasons. One, it's depending on failures to make the decisions - i.e. when
it already is too late. Two, even this condition can be hit very quickly
and can result in many resets per second in the worst case.

> Number
> (or proportion) of pages scanned, for example?  Or reset everything on
> the Nth trip around the zone? 

For a full compaction failure we have scanned all pages in the zone so
there is no proportion to use there.

Resetting everything every Nth trip around the zone is similar to the
above check except it would look like

if (zone->compact_defer_shift == COMPACT_MAX_DEFER_SHIFT &&
		zone->compact_reset_laps == COMPACT_MAX_LAPS)
	reset_isolation_suitable(zone)

but it's weak for the same reasons - depending on failures to make decisions
and can happen too quickly.

I also considered using the PGFREE vmstat to reset if a pageblock worth of
pages had been freed since the last reset but this happens very quickly
under memory pressure and would not throttle enough. I also considered
deferring until NR_FREE_PAGES was high enough but this would severely
impact allocation success rates under memory pressure.

> Or even a combination of one of these
> *and* of wall time, so the system will at least work harder when MM is
> under load.
> 

We sortof do that now - we are depending on a number of failures and
time before clearing the bits.

> Also, what has to be done to avoid the polling altogether?  eg/ie, zap
> a pageblock's PB_migrate_skip synchronously, when something was done to
> that pageblock which justifies repolling it?
> 

The "something" event you are looking for is pages being freed or
allocated in the page allocator. A movable page being allocated in block
or a page being freed should clear the PB_migrate_skip bit if it's set.
Unfortunately this would impact the fast path of the alloc and free paths
of the page allocator. I felt that that was too high a price to pay.

> >
> > ...
> >
> > +static void reset_isolation_suitable(struct zone *zone)
> > +{
> > +	unsigned long start_pfn = zone->zone_start_pfn;
> > +	unsigned long end_pfn = zone->zone_start_pfn + zone->spanned_pages;
> > +	unsigned long pfn;
> > +
> > +	/*
> > +	 * Do not reset more than once every five seconds. If allocations are
> > +	 * failing sufficiently quickly to allow this to happen then continually
> > +	 * scanning for compaction is not going to help. The choice of five
> > +	 * seconds is arbitrary but will mitigate excessive scanning.
> > +	 */
> > +	if (time_before(jiffies, zone->compact_blockskip_expire))
> > +		return;
> > +	zone->compact_blockskip_expire = jiffies + (HZ * 5);
> > +
> > +	/* Walk the zone and mark every pageblock as suitable for isolation */
> > +	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
> > +		struct page *page;
> > +		if (!pfn_valid(pfn))
> > +			continue;
> > +
> > +		page = pfn_to_page(pfn);
> > +		if (zone != page_zone(page))
> > +			continue;
> > +
> > +		clear_pageblock_skip(page);
> > +	}
> 
> What's the worst-case loop count here?
> 

zone->spanned_pages >> pageblock_order

> > +}
> > +
> >
> > ...
> >
> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated
  2012-09-24  9:39     ` Mel Gorman
@ 2012-09-24 21:26       ` Andrew Morton
  2012-09-25  9:12         ` Mel Gorman
  0 siblings, 1 reply; 42+ messages in thread
From: Andrew Morton @ 2012-09-24 21:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Richard Davies, Shaohua Li, Rik van Riel, Avi Kivity, QEMU-devel,
	KVM, Linux-MM, LKML

On Mon, 24 Sep 2012 10:39:38 +0100
Mel Gorman <mgorman@suse.de> wrote:

> On Fri, Sep 21, 2012 at 02:36:56PM -0700, Andrew Morton wrote:
> 
> > Also, what has to be done to avoid the polling altogether?  eg/ie, zap
> > a pageblock's PB_migrate_skip synchronously, when something was done to
> > that pageblock which justifies repolling it?
> > 
> 
> The "something" event you are looking for is pages being freed or
> allocated in the page allocator. A movable page being allocated in block
> or a page being freed should clear the PB_migrate_skip bit if it's set.
> Unfortunately this would impact the fast path of the alloc and free paths
> of the page allocator. I felt that that was too high a price to pay.

We already do a similar thing in the page allocator: clearing of
->all_unreclaimable and ->pages_scanned.  But that isn't on the "fast
path" really - it happens once per pcp unload.  Can we do something
like that?  Drop some hint into the zone without having to visit each
page?

> > >
> > > ...
> > >
> > > +static void reset_isolation_suitable(struct zone *zone)
> > > +{
> > > +	unsigned long start_pfn = zone->zone_start_pfn;
> > > +	unsigned long end_pfn = zone->zone_start_pfn + zone->spanned_pages;
> > > +	unsigned long pfn;
> > > +
> > > +	/*
> > > +	 * Do not reset more than once every five seconds. If allocations are
> > > +	 * failing sufficiently quickly to allow this to happen then continually
> > > +	 * scanning for compaction is not going to help. The choice of five
> > > +	 * seconds is arbitrary but will mitigate excessive scanning.
> > > +	 */
> > > +	if (time_before(jiffies, zone->compact_blockskip_expire))
> > > +		return;
> > > +	zone->compact_blockskip_expire = jiffies + (HZ * 5);
> > > +
> > > +	/* Walk the zone and mark every pageblock as suitable for isolation */
> > > +	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
> > > +		struct page *page;
> > > +		if (!pfn_valid(pfn))
> > > +			continue;
> > > +
> > > +		page = pfn_to_page(pfn);
> > > +		if (zone != page_zone(page))
> > > +			continue;
> > > +
> > > +		clear_pageblock_skip(page);
> > > +	}
> > 
> > What's the worst-case loop count here?
> > 
> 
> zone->spanned_pages >> pageblock_order

What's the worst-case value of (zone->spanned_pages >> pageblock_order) :)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 5/9] mm: compaction: Acquire the zone->lru_lock as late as possible
  2012-09-21 10:46 ` [PATCH 5/9] mm: compaction: Acquire the zone->lru_lock as late as possible Mel Gorman
  2012-09-21 17:51   ` Rafael Aquini
@ 2012-09-25  7:05   ` Minchan Kim
  2012-09-25  7:51     ` Mel Gorman
  1 sibling, 1 reply; 42+ messages in thread
From: Minchan Kim @ 2012-09-25  7:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Richard Davies, Shaohua Li, Rik van Riel,
	Avi Kivity, QEMU-devel, KVM, Linux-MM, LKML

Hi Mel,

I have a question below.

On Fri, Sep 21, 2012 at 11:46:19AM +0100, Mel Gorman wrote:
> Compactions migrate scanner acquires the zone->lru_lock when scanning a range
> of pages looking for LRU pages to acquire. It does this even if there are
> no LRU pages in the range. If multiple processes are compacting then this
> can cause severe locking contention. To make matters worse commit b2eef8c0
> (mm: compaction: minimise the time IRQs are disabled while isolating pages
> for migration) releases the lru_lock every SWAP_CLUSTER_MAX pages that are
> scanned.
> 
> This patch makes two changes to how the migrate scanner acquires the LRU
> lock. First, it only releases the LRU lock every SWAP_CLUSTER_MAX pages if
> the lock is contended. This reduces the number of times it unnecessarily
> disables and re-enables IRQs. The second is that it defers acquiring the
> LRU lock for as long as possible. If there are no LRU pages or the only
> LRU pages are transhuge then the LRU lock will not be acquired at all
> which reduces contention on zone->lru_lock.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
>  mm/compaction.c |   63 +++++++++++++++++++++++++++++++++++++------------------
>  1 file changed, 43 insertions(+), 20 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 6b55491..a6068ff 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -50,6 +50,11 @@ static inline bool migrate_async_suitable(int migratetype)
>  	return is_migrate_cma(migratetype) || migratetype == MIGRATE_MOVABLE;
>  }
>  
> +static inline bool should_release_lock(spinlock_t *lock)
> +{
> +	return need_resched() || spin_is_contended(lock);
> +}
> +
>  /*
>   * Compaction requires the taking of some coarse locks that are potentially
>   * very heavily contended. Check if the process needs to be scheduled or
> @@ -62,7 +67,7 @@ static inline bool migrate_async_suitable(int migratetype)
>  static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
>  				      bool locked, struct compact_control *cc)
>  {
> -	if (need_resched() || spin_is_contended(lock)) {
> +	if (should_release_lock(lock)) {
>  		if (locked) {
>  			spin_unlock_irqrestore(lock, *flags);
>  			locked = false;
> @@ -327,7 +332,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>  	isolate_mode_t mode = 0;
>  	struct lruvec *lruvec;
>  	unsigned long flags;
> -	bool locked;
> +	bool locked = false;
>  
>  	/*
>  	 * Ensure that there are not too many pages isolated from the LRU
> @@ -347,23 +352,17 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>  
>  	/* Time to isolate some pages for migration */
>  	cond_resched();
> -	spin_lock_irqsave(&zone->lru_lock, flags);
> -	locked = true;
>  	for (; low_pfn < end_pfn; low_pfn++) {
>  		struct page *page;
>  
>  		/* give a chance to irqs before checking need_resched() */
> -		if (!((low_pfn+1) % SWAP_CLUSTER_MAX)) {
> -			spin_unlock_irqrestore(&zone->lru_lock, flags);
> -			locked = false;
> +		if (locked && !((low_pfn+1) % SWAP_CLUSTER_MAX)) {
> +			if (should_release_lock(&zone->lru_lock)) {
> +				spin_unlock_irqrestore(&zone->lru_lock, flags);
> +				locked = false;
> +			}
>  		}
>  
> -		/* Check if it is ok to still hold the lock */
> -		locked = compact_checklock_irqsave(&zone->lru_lock, &flags,
> -								locked, cc);
> -		if (!locked || fatal_signal_pending(current))
> -			break;
> -
>  		/*
>  		 * migrate_pfn does not necessarily start aligned to a
>  		 * pageblock. Ensure that pfn_valid is called when moving
> @@ -403,21 +402,38 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>  		pageblock_nr = low_pfn >> pageblock_order;
>  		if (!cc->sync && last_pageblock_nr != pageblock_nr &&
>  		    !migrate_async_suitable(get_pageblock_migratetype(page))) {
> -			low_pfn += pageblock_nr_pages;
> -			low_pfn = ALIGN(low_pfn, pageblock_nr_pages) - 1;
> -			last_pageblock_nr = pageblock_nr;
> -			continue;
> +			goto next_pageblock;
>  		}
>  
> +		/* Check may be lockless but that's ok as we recheck later */
>  		if (!PageLRU(page))
>  			continue;
>  
>  		/*
> -		 * PageLRU is set, and lru_lock excludes isolation,
> -		 * splitting and collapsing (collapsing has already
> -		 * happened if PageLRU is set).
> +		 * PageLRU is set. lru_lock normally excludes isolation
> +		 * splitting and collapsing (collapsing has already happened
> +		 * if PageLRU is set) but the lock is not necessarily taken
> +		 * here and it is wasteful to take it just to check transhuge.
> +		 * Check transhuge without lock and skip if it's either a
> +		 * transhuge or hugetlbfs page.
>  		 */
>  		if (PageTransHuge(page)) {
> +			if (!locked)
> +				goto next_pageblock;

Why skip all pages in a pageblock if !locked?
Shouldn't we add some comment?

> +			low_pfn += (1 << compound_order(page)) - 1;
> +			continue;
> +		}
> +
> +		/* Check if it is ok to still hold the lock */
> +		locked = compact_checklock_irqsave(&zone->lru_lock, &flags,
> +								locked, cc);
> +		if (!locked || fatal_signal_pending(current))
> +			break;
> +
> +		/* Recheck PageLRU and PageTransHuge under lock */
> +		if (!PageLRU(page))
> +			continue;
> +		if (PageTransHuge(page)) {
>  			low_pfn += (1 << compound_order(page)) - 1;
>  			continue;
>  		}
> @@ -444,6 +460,13 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>  			++low_pfn;
>  			break;
>  		}
> +
> +		continue;
> +
> +next_pageblock:
> +		low_pfn += pageblock_nr_pages;
> +		low_pfn = ALIGN(low_pfn, pageblock_nr_pages) - 1;
> +		last_pageblock_nr = pageblock_nr;
>  	}
>  
>  	acct_isolated(zone, locked, cc);
> -- 
> 1.7.9.2
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/9] mm: compaction: Abort compaction loop if lock is contended or run too long
  2012-09-21 10:46 ` [PATCH 4/9] mm: compaction: Abort compaction loop if lock is contended or run too long Mel Gorman
  2012-09-21 17:50   ` Rafael Aquini
  2012-09-21 21:31   ` Andrew Morton
@ 2012-09-25  7:34   ` Minchan Kim
  2 siblings, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2012-09-25  7:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Richard Davies, Shaohua Li, Rik van Riel,
	Avi Kivity, QEMU-devel, KVM, Linux-MM, LKML

On Fri, Sep 21, 2012 at 11:46:18AM +0100, Mel Gorman wrote:
> From: Shaohua Li <shli@fusionio.com>
> 
> Changelog since V2
> o Fix BUG_ON triggered due to pages left on cc.migratepages
> o Make compact_zone_order() require non-NULL arg `contended'
> 
> Changelog since V1
> o only abort the compaction if lock is contended or run too long
> o Rearranged the code by Andrea Arcangeli.
> 
> isolate_migratepages_range() might isolate no pages if for example when
> zone->lru_lock is contended and running asynchronous compaction. In this
> case, we should abort compaction, otherwise, compact_zone will run a
> useless loop and make zone->lru_lock is even contended.
> 
> [minchan@kernel.org: Putback pages isolated for migration if aborting]
> [akpm@linux-foundation.org: compact_zone_order requires non-NULL arg contended]
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Shaohua Li <shli@fusionio.com>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Minchan Kim <minchan@kernel.org>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 6/9] mm: compaction: Acquire the zone->lock as late as possible
  2012-09-21 10:46 ` [PATCH 6/9] mm: compaction: Acquire the zone->lock " Mel Gorman
  2012-09-21 17:52   ` Rafael Aquini
  2012-09-21 21:35   ` Andrew Morton
@ 2012-09-25  7:35   ` Minchan Kim
  2 siblings, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2012-09-25  7:35 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Richard Davies, Shaohua Li, Rik van Riel,
	Avi Kivity, QEMU-devel, KVM, Linux-MM, LKML

On Fri, Sep 21, 2012 at 11:46:20AM +0100, Mel Gorman wrote:
> Compactions free scanner acquires the zone->lock when checking for PageBuddy
> pages and isolating them. It does this even if there are no PageBuddy pages
> in the range.
> 
> This patch defers acquiring the zone lock for as long as possible. In the
> event there are no free pages in the pageblock then the lock will not be
> acquired at all which reduces contention on zone->lock.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Minchan Kim <minchan@kernel.org>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 6/9] mm: compaction: Acquire the zone->lock as late as possible
  2012-09-24  8:52     ` Mel Gorman
@ 2012-09-25  7:36       ` Minchan Kim
  0 siblings, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2012-09-25  7:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Richard Davies, Shaohua Li, Rik van Riel,
	Avi Kivity, QEMU-devel, KVM, Linux-MM, LKML

On Mon, Sep 24, 2012 at 09:52:38AM +0100, Mel Gorman wrote:
> On Fri, Sep 21, 2012 at 02:35:57PM -0700, Andrew Morton wrote:
> > On Fri, 21 Sep 2012 11:46:20 +0100
> > Mel Gorman <mgorman@suse.de> wrote:
> > 
> > > Compactions free scanner acquires the zone->lock when checking for PageBuddy
> > > pages and isolating them. It does this even if there are no PageBuddy pages
> > > in the range.
> > > 
> > > This patch defers acquiring the zone lock for as long as possible. In the
> > > event there are no free pages in the pageblock then the lock will not be
> > > acquired at all which reduces contention on zone->lock.
> > >
> > > ...
> > >
> > > --- a/mm/compaction.c
> > > +++ b/mm/compaction.c
> > > @@ -93,6 +93,28 @@ static inline bool compact_trylock_irqsave(spinlock_t *lock,
> > >  	return compact_checklock_irqsave(lock, flags, false, cc);
> > >  }
> > >  
> > > +/* Returns true if the page is within a block suitable for migration to */
> > > +static bool suitable_migration_target(struct page *page)
> > > +{
> > > +
> > 
> > stray newline
> > 
> 
> Fixed.
> 
> > > +	int migratetype = get_pageblock_migratetype(page);
> > > +
> > > +	/* Don't interfere with memory hot-remove or the min_free_kbytes blocks */
> > > +	if (migratetype == MIGRATE_ISOLATE || migratetype == MIGRATE_RESERVE)
> > > +		return false;
> > > +
> > > +	/* If the page is a large free page, then allow migration */
> > > +	if (PageBuddy(page) && page_order(page) >= pageblock_order)
> > > +		return true;
> > > +
> > > +	/* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */
> > > +	if (migrate_async_suitable(migratetype))
> > > +		return true;
> > > +
> > > +	/* Otherwise skip the block */
> > > +	return false;
> > > +}
> > > +
> > >
> > > ...
> > >
> > > @@ -168,23 +193,38 @@ static unsigned long isolate_freepages_block(unsigned long blockpfn,
> > >  		int isolated, i;
> > >  		struct page *page = cursor;
> > >  
> > > -		if (!pfn_valid_within(blockpfn)) {
> > > -			if (strict)
> > > -				return 0;
> > > -			continue;
> > > -		}
> > > +		if (!pfn_valid_within(blockpfn))
> > > +			goto strict_check;
> > >  		nr_scanned++;
> > >  
> > > -		if (!PageBuddy(page)) {
> > > -			if (strict)
> > > -				return 0;
> > > -			continue;
> > > -		}
> > > +		if (!PageBuddy(page))
> > > +			goto strict_check;
> > > +
> > > +		/*
> > > +		 * The zone lock must be held to isolate freepages. This
> > > +		 * unfortunately this is a very coarse lock and can be
> > 
> > this this
> > 
> 
> Fixed.
> 
> > > +		 * heavily contended if there are parallel allocations
> > > +		 * or parallel compactions. For async compaction do not
> > > +		 * spin on the lock and we acquire the lock as late as
> > > +		 * possible.
> > > +		 */
> > > +		locked = compact_checklock_irqsave(&cc->zone->lock, &flags,
> > > +								locked, cc);
> > > +		if (!locked)
> > > +			break;
> > > +
> > > +		/* Recheck this is a suitable migration target under lock */
> > > +		if (!strict && !suitable_migration_target(page))
> > > +			break;
> > > +
> > > +		/* Recheck this is a buddy page under lock */
> > > +		if (!PageBuddy(page))
> > > +			goto strict_check;
> > >  
> > >  		/* Found a free page, break it into order-0 pages */
> > >  		isolated = split_free_page(page);
> > >  		if (!isolated && strict)
> > > -			return 0;
> > > +			goto strict_check;
> > >  		total_isolated += isolated;
> > >  		for (i = 0; i < isolated; i++) {
> > >  			list_add(&page->lru, freelist);
> > > @@ -196,9 +236,23 @@ static unsigned long isolate_freepages_block(unsigned long blockpfn,
> > >  			blockpfn += isolated - 1;
> > >  			cursor += isolated - 1;
> > >  		}
> > > +
> > > +		continue;
> > > +
> > > +strict_check:
> > > +		/* Abort isolation if the caller requested strict isolation */
> > > +		if (strict) {
> > > +			total_isolated = 0;
> > > +			goto out;
> > > +		}
> > >  	}
> > >  
> > >  	trace_mm_compaction_isolate_freepages(nr_scanned, total_isolated);
> > > +
> > > +out:
> > > +	if (locked)
> > > +		spin_unlock_irqrestore(&cc->zone->lock, flags);
> > > +
> > >  	return total_isolated;
> > >  }
> > 
> > A a few things about this function.
> > 
> > Would it be cleaner if we did
> > 
> > 	if (!strict) {
> > 		if (!suitable_migration_target(page))
> > 			break;
> > 	} else {
> > 		if (!PageBuddy(page))
> > 			goto strict_check;
> > 	}
> > 
> > and then remove the test of `strict' from strict_check (which then
> > should be renamed)?
> > 
> 
> I was not able to implement what you suggested without breakage.
> However, I can do something very similar if "strict" mode does not bail
> out ASAP and instead double checks at the end that everything was
> isolated correctly. This does mean that CMAs failure case is slightly
> more expensive but that really is a corner case and I think it's
> acceptable if the code is easier to follow as a result.
> 
> > Which perhaps means that the whole `strict_check' block can go away:
> > 
> > 	if (!strict) {
> > 		if (!suitable_migration_target(page))
> > 			break;
> > 	} else {
> > 		if (!PageBuddy(page)) {
> > 			total_isolated = 0;
> > 			goto out;
> > 	}
> > 
> > Have a think about it?  The function is a little straggly.
> > 
> 
> Proposal below.
> 
> > Secondly, it is correct/desirable to skip the (now misnamed
> > `trace_mm_compaction_isolate_freepages') tracepoint generation if we
> > baled out of the loop?  The fact that we entered
> > isolate_freepages_block() but failed to isolate anything is data which
> > people might be interested in?
> > 
> 
> You're right, it may be interesting for someone debugging CMA to know that
> nr_scanned != nr_isolated at the time of allocation failure. 
> 
> > Thirdly, that (existing) comment "This assumes the block is valid" is
> > either too vague or wrong.  If it's valid, why wo we check
> > pfn_valid_within()?
> > 
> 
> Comment is stale and should be removed.
> 
> > > @@ -218,13 +272,18 @@ static unsigned long isolate_freepages_block(unsigned long blockpfn,
> > >  unsigned long
> > >  isolate_freepages_range(unsigned long start_pfn, unsigned long end_pfn)
> > >  {
> > > -	unsigned long isolated, pfn, block_end_pfn, flags;
> > > +	unsigned long isolated, pfn, block_end_pfn;
> > >  	struct zone *zone = NULL;
> > >  	LIST_HEAD(freelist);
> > > +	struct compact_control cc;
> > >  
> > >  	if (pfn_valid(start_pfn))
> > >  		zone = page_zone(pfn_to_page(start_pfn));
> > >  
> > > +	/* cc needed for isolate_freepages_block to acquire zone->lock */
> > > +	cc.zone = zone;
> > > +	cc.sync = true;
> > 
> > We initialise two of cc's fields, leave the other 12 fields containing
> > random garbage, then start using it.  I see no bug here, but...
> > 
> 
> I get your point. At the very least we should initialise it with zeros.
> 
> How about this?
> 
> ---8<---
> mm: compaction: Iron out isolate_freepages_block() and isolate_freepages_range()
> 
> Andrew pointed out that isolate_freepages_block() is "straggly" and
> isolate_freepages_range() is making assumptions on how compact_control is
> used which is delicate. This patch straightens isolate_freepages_block()
> and makes it fly straight and initialses compact_control to zeros in
> isolate_freepages_range(). The code should be easier to follow and
> is functionally equivalent. The CMA failure path is now a little more
> expensive but that is a marginal corner-case.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Minchan Kim <minchan@kernel.org>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 7/9] Revert "mm: have order > 0 compaction start off where it left"
  2012-09-21 10:46 ` [PATCH 7/9] Revert "mm: have order > 0 compaction start off where it left" Mel Gorman
  2012-09-21 17:52   ` Rafael Aquini
@ 2012-09-25  7:37   ` Minchan Kim
  1 sibling, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2012-09-25  7:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Richard Davies, Shaohua Li, Rik van Riel,
	Avi Kivity, QEMU-devel, KVM, Linux-MM, LKML

On Fri, Sep 21, 2012 at 11:46:21AM +0100, Mel Gorman wrote:
> This reverts commit 7db8889a (mm: have order > 0 compaction start off
> where it left) and commit de74f1cc (mm: have order > 0 compaction start
> near a pageblock with free pages). These patches were a good idea and
> tests confirmed that they massively reduced the amount of scanning but
> the implementation is complex and tricky to understand. A later patch
> will cache what pageblocks should be skipped and reimplements the
> concept of compact_cached_free_pfn on top for both migration and
> free scanners.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Minchan Kim <minchan@kernel.org>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 5/9] mm: compaction: Acquire the zone->lru_lock as late as possible
  2012-09-25  7:05   ` Minchan Kim
@ 2012-09-25  7:51     ` Mel Gorman
  2012-09-25  8:13       ` Minchan Kim
  0 siblings, 1 reply; 42+ messages in thread
From: Mel Gorman @ 2012-09-25  7:51 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Richard Davies, Shaohua Li, Rik van Riel,
	Avi Kivity, QEMU-devel, KVM, Linux-MM, LKML

On Tue, Sep 25, 2012 at 04:05:17PM +0900, Minchan Kim wrote:
> Hi Mel,
> 
> I have a question below.
> 
> On Fri, Sep 21, 2012 at 11:46:19AM +0100, Mel Gorman wrote:
> > Compactions migrate scanner acquires the zone->lru_lock when scanning a range
> > of pages looking for LRU pages to acquire. It does this even if there are
> > no LRU pages in the range. If multiple processes are compacting then this
> > can cause severe locking contention. To make matters worse commit b2eef8c0
> > (mm: compaction: minimise the time IRQs are disabled while isolating pages
> > for migration) releases the lru_lock every SWAP_CLUSTER_MAX pages that are
> > scanned.
> > 
> > This patch makes two changes to how the migrate scanner acquires the LRU
> > lock. First, it only releases the LRU lock every SWAP_CLUSTER_MAX pages if
> > the lock is contended. This reduces the number of times it unnecessarily
> > disables and re-enables IRQs. The second is that it defers acquiring the
> > LRU lock for as long as possible. If there are no LRU pages or the only
> > LRU pages are transhuge then the LRU lock will not be acquired at all
> > which reduces contention on zone->lru_lock.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > ---
> >  mm/compaction.c |   63 +++++++++++++++++++++++++++++++++++++------------------
> >  1 file changed, 43 insertions(+), 20 deletions(-)
> > 
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index 6b55491..a6068ff 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -50,6 +50,11 @@ static inline bool migrate_async_suitable(int migratetype)
> >  	return is_migrate_cma(migratetype) || migratetype == MIGRATE_MOVABLE;
> >  }
> >  
> > +static inline bool should_release_lock(spinlock_t *lock)
> > +{
> > +	return need_resched() || spin_is_contended(lock);
> > +}
> > +
> >  /*
> >   * Compaction requires the taking of some coarse locks that are potentially
> >   * very heavily contended. Check if the process needs to be scheduled or
> > @@ -62,7 +67,7 @@ static inline bool migrate_async_suitable(int migratetype)
> >  static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
> >  				      bool locked, struct compact_control *cc)
> >  {
> > -	if (need_resched() || spin_is_contended(lock)) {
> > +	if (should_release_lock(lock)) {
> >  		if (locked) {
> >  			spin_unlock_irqrestore(lock, *flags);
> >  			locked = false;
> > @@ -327,7 +332,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
> >  	isolate_mode_t mode = 0;
> >  	struct lruvec *lruvec;
> >  	unsigned long flags;
> > -	bool locked;
> > +	bool locked = false;
> >  
> >  	/*
> >  	 * Ensure that there are not too many pages isolated from the LRU
> > @@ -347,23 +352,17 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
> >  
> >  	/* Time to isolate some pages for migration */
> >  	cond_resched();
> > -	spin_lock_irqsave(&zone->lru_lock, flags);
> > -	locked = true;
> >  	for (; low_pfn < end_pfn; low_pfn++) {
> >  		struct page *page;
> >  
> >  		/* give a chance to irqs before checking need_resched() */
> > -		if (!((low_pfn+1) % SWAP_CLUSTER_MAX)) {
> > -			spin_unlock_irqrestore(&zone->lru_lock, flags);
> > -			locked = false;
> > +		if (locked && !((low_pfn+1) % SWAP_CLUSTER_MAX)) {
> > +			if (should_release_lock(&zone->lru_lock)) {
> > +				spin_unlock_irqrestore(&zone->lru_lock, flags);
> > +				locked = false;
> > +			}
> >  		}
> >  
> > -		/* Check if it is ok to still hold the lock */
> > -		locked = compact_checklock_irqsave(&zone->lru_lock, &flags,
> > -								locked, cc);
> > -		if (!locked || fatal_signal_pending(current))
> > -			break;
> > -
> >  		/*
> >  		 * migrate_pfn does not necessarily start aligned to a
> >  		 * pageblock. Ensure that pfn_valid is called when moving
> > @@ -403,21 +402,38 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
> >  		pageblock_nr = low_pfn >> pageblock_order;
> >  		if (!cc->sync && last_pageblock_nr != pageblock_nr &&
> >  		    !migrate_async_suitable(get_pageblock_migratetype(page))) {
> > -			low_pfn += pageblock_nr_pages;
> > -			low_pfn = ALIGN(low_pfn, pageblock_nr_pages) - 1;
> > -			last_pageblock_nr = pageblock_nr;
> > -			continue;
> > +			goto next_pageblock;
> >  		}
> >  
> > +		/* Check may be lockless but that's ok as we recheck later */
> >  		if (!PageLRU(page))
> >  			continue;
> >  
> >  		/*
> > -		 * PageLRU is set, and lru_lock excludes isolation,
> > -		 * splitting and collapsing (collapsing has already
> > -		 * happened if PageLRU is set).
> > +		 * PageLRU is set. lru_lock normally excludes isolation
> > +		 * splitting and collapsing (collapsing has already happened
> > +		 * if PageLRU is set) but the lock is not necessarily taken
> > +		 * here and it is wasteful to take it just to check transhuge.
> > +		 * Check transhuge without lock and skip if it's either a
> > +		 * transhuge or hugetlbfs page.
> >  		 */
> >  		if (PageTransHuge(page)) {
> > +			if (!locked)
> > +				goto next_pageblock;
> 
> Why skip all pages in a pageblock if !locked?
> Shouldn't we add some comment?
> 

The comment is above the block already. The lru_lock normally excludes
isolation and splitting. If we do not hold the hold, it's not safe to
call compound_order so instead we skip the entire pageblock.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 5/9] mm: compaction: Acquire the zone->lru_lock as late as possible
  2012-09-25  7:51     ` Mel Gorman
@ 2012-09-25  8:13       ` Minchan Kim
  2012-09-25 21:39         ` Andrew Morton
  0 siblings, 1 reply; 42+ messages in thread
From: Minchan Kim @ 2012-09-25  8:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Richard Davies, Shaohua Li, Rik van Riel,
	Avi Kivity, QEMU-devel, KVM, Linux-MM, LKML

On Tue, Sep 25, 2012 at 08:51:05AM +0100, Mel Gorman wrote:
> On Tue, Sep 25, 2012 at 04:05:17PM +0900, Minchan Kim wrote:
> > Hi Mel,
> > 
> > I have a question below.
> > 
> > On Fri, Sep 21, 2012 at 11:46:19AM +0100, Mel Gorman wrote:
> > > Compactions migrate scanner acquires the zone->lru_lock when scanning a range
> > > of pages looking for LRU pages to acquire. It does this even if there are
> > > no LRU pages in the range. If multiple processes are compacting then this
> > > can cause severe locking contention. To make matters worse commit b2eef8c0
> > > (mm: compaction: minimise the time IRQs are disabled while isolating pages
> > > for migration) releases the lru_lock every SWAP_CLUSTER_MAX pages that are
> > > scanned.
> > > 
> > > This patch makes two changes to how the migrate scanner acquires the LRU
> > > lock. First, it only releases the LRU lock every SWAP_CLUSTER_MAX pages if
> > > the lock is contended. This reduces the number of times it unnecessarily
> > > disables and re-enables IRQs. The second is that it defers acquiring the
> > > LRU lock for as long as possible. If there are no LRU pages or the only
> > > LRU pages are transhuge then the LRU lock will not be acquired at all
> > > which reduces contention on zone->lru_lock.
> > > 
> > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > Acked-by: Rik van Riel <riel@redhat.com>
> > > ---
> > >  mm/compaction.c |   63 +++++++++++++++++++++++++++++++++++++------------------
> > >  1 file changed, 43 insertions(+), 20 deletions(-)
> > > 
> > > diff --git a/mm/compaction.c b/mm/compaction.c
> > > index 6b55491..a6068ff 100644
> > > --- a/mm/compaction.c
> > > +++ b/mm/compaction.c
> > > @@ -50,6 +50,11 @@ static inline bool migrate_async_suitable(int migratetype)
> > >  	return is_migrate_cma(migratetype) || migratetype == MIGRATE_MOVABLE;
> > >  }
> > >  
> > > +static inline bool should_release_lock(spinlock_t *lock)
> > > +{
> > > +	return need_resched() || spin_is_contended(lock);
> > > +}
> > > +
> > >  /*
> > >   * Compaction requires the taking of some coarse locks that are potentially
> > >   * very heavily contended. Check if the process needs to be scheduled or
> > > @@ -62,7 +67,7 @@ static inline bool migrate_async_suitable(int migratetype)
> > >  static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
> > >  				      bool locked, struct compact_control *cc)
> > >  {
> > > -	if (need_resched() || spin_is_contended(lock)) {
> > > +	if (should_release_lock(lock)) {
> > >  		if (locked) {
> > >  			spin_unlock_irqrestore(lock, *flags);
> > >  			locked = false;
> > > @@ -327,7 +332,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
> > >  	isolate_mode_t mode = 0;
> > >  	struct lruvec *lruvec;
> > >  	unsigned long flags;
> > > -	bool locked;
> > > +	bool locked = false;
> > >  
> > >  	/*
> > >  	 * Ensure that there are not too many pages isolated from the LRU
> > > @@ -347,23 +352,17 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
> > >  
> > >  	/* Time to isolate some pages for migration */
> > >  	cond_resched();
> > > -	spin_lock_irqsave(&zone->lru_lock, flags);
> > > -	locked = true;
> > >  	for (; low_pfn < end_pfn; low_pfn++) {
> > >  		struct page *page;
> > >  
> > >  		/* give a chance to irqs before checking need_resched() */
> > > -		if (!((low_pfn+1) % SWAP_CLUSTER_MAX)) {
> > > -			spin_unlock_irqrestore(&zone->lru_lock, flags);
> > > -			locked = false;
> > > +		if (locked && !((low_pfn+1) % SWAP_CLUSTER_MAX)) {
> > > +			if (should_release_lock(&zone->lru_lock)) {
> > > +				spin_unlock_irqrestore(&zone->lru_lock, flags);
> > > +				locked = false;
> > > +			}
> > >  		}
> > >  
> > > -		/* Check if it is ok to still hold the lock */
> > > -		locked = compact_checklock_irqsave(&zone->lru_lock, &flags,
> > > -								locked, cc);
> > > -		if (!locked || fatal_signal_pending(current))
> > > -			break;
> > > -
> > >  		/*
> > >  		 * migrate_pfn does not necessarily start aligned to a
> > >  		 * pageblock. Ensure that pfn_valid is called when moving
> > > @@ -403,21 +402,38 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
> > >  		pageblock_nr = low_pfn >> pageblock_order;
> > >  		if (!cc->sync && last_pageblock_nr != pageblock_nr &&
> > >  		    !migrate_async_suitable(get_pageblock_migratetype(page))) {
> > > -			low_pfn += pageblock_nr_pages;
> > > -			low_pfn = ALIGN(low_pfn, pageblock_nr_pages) - 1;
> > > -			last_pageblock_nr = pageblock_nr;
> > > -			continue;
> > > +			goto next_pageblock;
> > >  		}
> > >  
> > > +		/* Check may be lockless but that's ok as we recheck later */
> > >  		if (!PageLRU(page))
> > >  			continue;
> > >  
> > >  		/*
> > > -		 * PageLRU is set, and lru_lock excludes isolation,
> > > -		 * splitting and collapsing (collapsing has already
> > > -		 * happened if PageLRU is set).
> > > +		 * PageLRU is set. lru_lock normally excludes isolation
> > > +		 * splitting and collapsing (collapsing has already happened
> > > +		 * if PageLRU is set) but the lock is not necessarily taken
> > > +		 * here and it is wasteful to take it just to check transhuge.
> > > +		 * Check transhuge without lock and skip if it's either a
> > > +		 * transhuge or hugetlbfs page.
> > >  		 */
> > >  		if (PageTransHuge(page)) {
> > > +			if (!locked)
> > > +				goto next_pageblock;
> > 
> > Why skip all pages in a pageblock if !locked?
> > Shouldn't we add some comment?
> > 
> 
> The comment is above the block already. The lru_lock normally excludes
> isolation and splitting. If we do not hold the hold, it's not safe to
> call compound_order so instead we skip the entire pageblock.

I see. To me, your saying is better than current comment.
I hope comment could be more explicit.

diff --git a/mm/compaction.c b/mm/compaction.c
index df01b4e..f1d2cc7 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -542,8 +542,9 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
                 * splitting and collapsing (collapsing has already happened
                 * if PageLRU is set) but the lock is not necessarily taken
                 * here and it is wasteful to take it just to check transhuge.
-                * Check transhuge without lock and skip if it's either a
-                * transhuge or hugetlbfs page.
+                * Check transhuge without lock and *skip* if it's either a
+                * transhuge or hugetlbfs page because it's not safe to call
+                * compound_order.
                 */
                if (PageTransHuge(page)) {
                        if (!locked)


Anyway, it's trivial and if anyone think it's valuable, feel free to apply it, please.

> 
> -- 
> Mel Gorman
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated
  2012-09-24 21:26       ` Andrew Morton
@ 2012-09-25  9:12         ` Mel Gorman
  2012-09-25 20:03           ` Andrew Morton
  2012-09-26  0:49           ` Minchan Kim
  0 siblings, 2 replies; 42+ messages in thread
From: Mel Gorman @ 2012-09-25  9:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Richard Davies, Shaohua Li, Rik van Riel, Avi Kivity, QEMU-devel,
	KVM, Linux-MM, LKML

On Mon, Sep 24, 2012 at 02:26:44PM -0700, Andrew Morton wrote:
> On Mon, 24 Sep 2012 10:39:38 +0100
> Mel Gorman <mgorman@suse.de> wrote:
> 
> > On Fri, Sep 21, 2012 at 02:36:56PM -0700, Andrew Morton wrote:
> > 
> > > Also, what has to be done to avoid the polling altogether?  eg/ie, zap
> > > a pageblock's PB_migrate_skip synchronously, when something was done to
> > > that pageblock which justifies repolling it?
> > > 
> > 
> > The "something" event you are looking for is pages being freed or
> > allocated in the page allocator. A movable page being allocated in block
> > or a page being freed should clear the PB_migrate_skip bit if it's set.
> > Unfortunately this would impact the fast path of the alloc and free paths
> > of the page allocator. I felt that that was too high a price to pay.
> 
> We already do a similar thing in the page allocator: clearing of
> ->all_unreclaimable and ->pages_scanned. 

That is true but that is a simple write (shared cache line but still) to
a struct zone. Worse, now that you point it out, that's pretty stupid. It
should be checking if the value is non-zero before writing to it to avoid
a cache line bounce.

Clearing the PG_migrate_skip in this path to avoid the need to ever pool is
not as cheap as it needs to

set_pageblock_skip
  -> set_pageblock_flags_group
    -> page_zone
    -> page_to_pfn
    -> get_pageblock_bitmap
    -> pfn_to_bitidx
    -> __set_bit

> But that isn't on the "fast
> path" really - it happens once per pcp unload. 

That's still an important enough path that I'm wary of making it fatter
and that only covers the free path. To avoid the polling, the allocation
side needs to be handled too. It could be shoved down into rmqueue() to
put it into a slightly colder path but still, it's a price to pay to keep
compaction happy.

> Can we do something
> like that?  Drop some hint into the zone without having to visit each
> page?
> 

Not without incurring a cost, but yes, t is possible to give a hint on when
PG_migrate_skip should be cleared and move away from that time-based hammer.

First, we'd introduce a variant of get_pageblock_migratetype() that returns
all the bits for the pageblock flags and then helpers to extract either the
migratetype or the PG_migrate_skip. We already are incurring the cost of
get_pageblock_migratetype() so it will not be much more expensive than what
is already there. If there is an allocation or free within a pageblock that
as the PG_migrate_skip bit set then we increment a counter. When the counter
reaches some to-be-decided "threshold" then compaction may clear all the
bits. This would match the criteria of the clearing being based on activity.

There are four potential problems with this

1. The logic to retrieve all the bits and split them up will be a little
   convulated but maybe it would not be that bad.

2. The counter is a shared-writable cache line but obviously it could
   be moved to vmstat and incremented with inc_zone_page_state to offset
   the cost a little.

3. The biggested weakness is that there is not way to know if the
   counter is incremented based on activity in a small subset of blocks.

4. What should the threshold be?

The first problem is minor but the other three are potentially a mess.
Adding another vmstat counter is bad enough in itself but if the counter
is incremented based on a small subsets of pageblocks, the hint becomes
is potentially useless.

However, does this match what you have in mind or am I over-complicating
things?

> > > >
> > > > ...
> > > >
> > > > +static void reset_isolation_suitable(struct zone *zone)
> > > > +{
> > > > +	unsigned long start_pfn = zone->zone_start_pfn;
> > > > +	unsigned long end_pfn = zone->zone_start_pfn + zone->spanned_pages;
> > > > +	unsigned long pfn;
> > > > +
> > > > +	/*
> > > > +	 * Do not reset more than once every five seconds. If allocations are
> > > > +	 * failing sufficiently quickly to allow this to happen then continually
> > > > +	 * scanning for compaction is not going to help. The choice of five
> > > > +	 * seconds is arbitrary but will mitigate excessive scanning.
> > > > +	 */
> > > > +	if (time_before(jiffies, zone->compact_blockskip_expire))
> > > > +		return;
> > > > +	zone->compact_blockskip_expire = jiffies + (HZ * 5);
> > > > +
> > > > +	/* Walk the zone and mark every pageblock as suitable for isolation */
> > > > +	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
> > > > +		struct page *page;
> > > > +		if (!pfn_valid(pfn))
> > > > +			continue;
> > > > +
> > > > +		page = pfn_to_page(pfn);
> > > > +		if (zone != page_zone(page))
> > > > +			continue;
> > > > +
> > > > +		clear_pageblock_skip(page);
> > > > +	}
> > > 
> > > What's the worst-case loop count here?
> > > 
> > 
> > zone->spanned_pages >> pageblock_order
> 
> What's the worst-case value of (zone->spanned_pages >> pageblock_order) :)

Lets take an unlikely case - 128G single-node machine. That loop count
on x86-64 would be 65536. It'll be fast enough, particularly in this
path.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated
  2012-09-25  9:12         ` Mel Gorman
@ 2012-09-25 20:03           ` Andrew Morton
  2012-09-27 12:06             ` [PATCH] mm: compaction: cache if a pageblock was scanned and no pages were isolated -fix2 Mel Gorman
  2012-09-27 13:12             ` [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated Mel Gorman
  2012-09-26  0:49           ` Minchan Kim
  1 sibling, 2 replies; 42+ messages in thread
From: Andrew Morton @ 2012-09-25 20:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Richard Davies, Shaohua Li, Rik van Riel, Avi Kivity, QEMU-devel,
	KVM, Linux-MM, LKML

On Tue, 25 Sep 2012 10:12:07 +0100
Mel Gorman <mgorman@suse.de> wrote:

> First, we'd introduce a variant of get_pageblock_migratetype() that returns
> all the bits for the pageblock flags and then helpers to extract either the
> migratetype or the PG_migrate_skip. We already are incurring the cost of
> get_pageblock_migratetype() so it will not be much more expensive than what
> is already there. If there is an allocation or free within a pageblock that
> as the PG_migrate_skip bit set then we increment a counter. When the counter
> reaches some to-be-decided "threshold" then compaction may clear all the
> bits. This would match the criteria of the clearing being based on activity.
> 
> There are four potential problems with this
> 
> 1. The logic to retrieve all the bits and split them up will be a little
>    convulated but maybe it would not be that bad.
> 
> 2. The counter is a shared-writable cache line but obviously it could
>    be moved to vmstat and incremented with inc_zone_page_state to offset
>    the cost a little.
> 
> 3. The biggested weakness is that there is not way to know if the
>    counter is incremented based on activity in a small subset of blocks.
> 
> 4. What should the threshold be?
> 
> The first problem is minor but the other three are potentially a mess.
> Adding another vmstat counter is bad enough in itself but if the counter
> is incremented based on a small subsets of pageblocks, the hint becomes
> is potentially useless.
> 
> However, does this match what you have in mind or am I over-complicating
> things?

Sounds complicated.

Using wall time really does suck.  Are you sure you can't think of
something more logical?

How would we demonstrate the suckage?  What would be the observeable downside of
switching that 5 seconds to 5 hours?

> > > > > +	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
> > > > > +		struct page *page;
> > > > > +		if (!pfn_valid(pfn))
> > > > > +			continue;
> > > > > +
> > > > > +		page = pfn_to_page(pfn);
> > > > > +		if (zone != page_zone(page))
> > > > > +			continue;
> > > > > +
> > > > > +		clear_pageblock_skip(page);
> > > > > +	}
> > > > 
> > > > What's the worst-case loop count here?
> > > > 
> > > 
> > > zone->spanned_pages >> pageblock_order
> > 
> > What's the worst-case value of (zone->spanned_pages >> pageblock_order) :)
> 
> Lets take an unlikely case - 128G single-node machine. That loop count
> on x86-64 would be 65536. It'll be fast enough, particularly in this
> path.

That could easily exceed a millisecond.  Can/should we stick a
cond_resched() in there?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 5/9] mm: compaction: Acquire the zone->lru_lock as late as possible
  2012-09-25  8:13       ` Minchan Kim
@ 2012-09-25 21:39         ` Andrew Morton
  2012-09-26  0:23           ` Minchan Kim
  2012-09-26 10:17           ` Mel Gorman
  0 siblings, 2 replies; 42+ messages in thread
From: Andrew Morton @ 2012-09-25 21:39 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mel Gorman, Richard Davies, Shaohua Li, Rik van Riel, Avi Kivity,
	QEMU-devel, KVM, Linux-MM, LKML

On Tue, 25 Sep 2012 17:13:27 +0900
Minchan Kim <minchan@kernel.org> wrote:

> I see. To me, your saying is better than current comment.
> I hope comment could be more explicit.
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index df01b4e..f1d2cc7 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -542,8 +542,9 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>                  * splitting and collapsing (collapsing has already happened
>                  * if PageLRU is set) but the lock is not necessarily taken
>                  * here and it is wasteful to take it just to check transhuge.
> -                * Check transhuge without lock and skip if it's either a
> -                * transhuge or hugetlbfs page.
> +                * Check transhuge without lock and *skip* if it's either a
> +                * transhuge or hugetlbfs page because it's not safe to call
> +                * compound_order.
>                  */
>                 if (PageTransHuge(page)) {
>                         if (!locked)

Going a bit further:

--- a/mm/compaction.c~mm-compaction-acquire-the-zone-lru_lock-as-late-as-possible-fix
+++ a/mm/compaction.c
@@ -415,7 +415,8 @@ isolate_migratepages_range(struct zone *
 		 * if PageLRU is set) but the lock is not necessarily taken
 		 * here and it is wasteful to take it just to check transhuge.
 		 * Check transhuge without lock and skip if it's either a
-		 * transhuge or hugetlbfs page.
+		 * transhuge or hugetlbfs page because calling compound_order()
+		 * requires lru_lock to exclude isolation and splitting.
 		 */
 		if (PageTransHuge(page)) {
 			if (!locked)
_


but...  the requirement to hold lru_lock for compound_order() is news
to me.  It doesn't seem to be written down or explained anywhere, and
one wonders why the cheerily undocumented compound_lock() doesn't have
this effect.  What's going on here??


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 5/9] mm: compaction: Acquire the zone->lru_lock as late as possible
  2012-09-25 21:39         ` Andrew Morton
@ 2012-09-26  0:23           ` Minchan Kim
  2012-09-26 10:17           ` Mel Gorman
  1 sibling, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2012-09-26  0:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Richard Davies, Shaohua Li, Rik van Riel, Avi Kivity,
	QEMU-devel, KVM, Linux-MM, LKML

On Tue, Sep 25, 2012 at 02:39:31PM -0700, Andrew Morton wrote:
> On Tue, 25 Sep 2012 17:13:27 +0900
> Minchan Kim <minchan@kernel.org> wrote:
> 
> > I see. To me, your saying is better than current comment.
> > I hope comment could be more explicit.
> > 
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index df01b4e..f1d2cc7 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -542,8 +542,9 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
> >                  * splitting and collapsing (collapsing has already happened
> >                  * if PageLRU is set) but the lock is not necessarily taken
> >                  * here and it is wasteful to take it just to check transhuge.
> > -                * Check transhuge without lock and skip if it's either a
> > -                * transhuge or hugetlbfs page.
> > +                * Check transhuge without lock and *skip* if it's either a
> > +                * transhuge or hugetlbfs page because it's not safe to call
> > +                * compound_order.
> >                  */
> >                 if (PageTransHuge(page)) {
> >                         if (!locked)
> 
> Going a bit further:
> 
> --- a/mm/compaction.c~mm-compaction-acquire-the-zone-lru_lock-as-late-as-possible-fix
> +++ a/mm/compaction.c
> @@ -415,7 +415,8 @@ isolate_migratepages_range(struct zone *
>  		 * if PageLRU is set) but the lock is not necessarily taken
>  		 * here and it is wasteful to take it just to check transhuge.
>  		 * Check transhuge without lock and skip if it's either a
> -		 * transhuge or hugetlbfs page.
> +		 * transhuge or hugetlbfs page because calling compound_order()
> +		 * requires lru_lock to exclude isolation and splitting.
>  		 */
>  		if (PageTransHuge(page)) {
>  			if (!locked)
> _
> 
> 
> but...  the requirement to hold lru_lock for compound_order() is news
> to me.  It doesn't seem to be written down or explained anywhere, and
> one wonders why the cheerily undocumented compound_lock() doesn't have
> this effect.  What's going on here??

First of all, I don't know why we should mention hugetlbfs in comment.
I don't know hugetlbfs well so I had a time to look through code but
can't find a place setting PG_lru so I'm not sure hugetlbfs page can
reach on this code. Please correct me if I was wrong.

On THP, I think compound_lock you mentioned is okay, too but I think
it's sort of optimization because we don't need both lru_lock and
compound_lock. If we hold lru_lock, we can't prevent race with
__split_huge_page_refcount so that the page couldn't be freed.

Namely, it's safe to call compound_order.

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated
  2012-09-25  9:12         ` Mel Gorman
  2012-09-25 20:03           ` Andrew Morton
@ 2012-09-26  0:49           ` Minchan Kim
  2012-09-27 12:14             ` Mel Gorman
  1 sibling, 1 reply; 42+ messages in thread
From: Minchan Kim @ 2012-09-26  0:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Richard Davies, Shaohua Li, Rik van Riel,
	Avi Kivity, QEMU-devel, KVM, Linux-MM, LKML

On Tue, Sep 25, 2012 at 10:12:07AM +0100, Mel Gorman wrote:
> On Mon, Sep 24, 2012 at 02:26:44PM -0700, Andrew Morton wrote:
> > On Mon, 24 Sep 2012 10:39:38 +0100
> > Mel Gorman <mgorman@suse.de> wrote:
> > 
> > > On Fri, Sep 21, 2012 at 02:36:56PM -0700, Andrew Morton wrote:
> > > 
> > > > Also, what has to be done to avoid the polling altogether?  eg/ie, zap
> > > > a pageblock's PB_migrate_skip synchronously, when something was done to
> > > > that pageblock which justifies repolling it?
> > > > 
> > > 
> > > The "something" event you are looking for is pages being freed or
> > > allocated in the page allocator. A movable page being allocated in block
> > > or a page being freed should clear the PB_migrate_skip bit if it's set.
> > > Unfortunately this would impact the fast path of the alloc and free paths
> > > of the page allocator. I felt that that was too high a price to pay.
> > 
> > We already do a similar thing in the page allocator: clearing of
> > ->all_unreclaimable and ->pages_scanned. 
> 
> That is true but that is a simple write (shared cache line but still) to
> a struct zone. Worse, now that you point it out, that's pretty stupid. It
> should be checking if the value is non-zero before writing to it to avoid
> a cache line bounce.
> 
> Clearing the PG_migrate_skip in this path to avoid the need to ever pool is
> not as cheap as it needs to
> 
> set_pageblock_skip
>   -> set_pageblock_flags_group
>     -> page_zone
>     -> page_to_pfn
>     -> get_pageblock_bitmap
>     -> pfn_to_bitidx
>     -> __set_bit
> 
> > But that isn't on the "fast
> > path" really - it happens once per pcp unload. 
> 
> That's still an important enough path that I'm wary of making it fatter
> and that only covers the free path. To avoid the polling, the allocation
> side needs to be handled too. It could be shoved down into rmqueue() to
> put it into a slightly colder path but still, it's a price to pay to keep
> compaction happy.
> 
> > Can we do something
> > like that?  Drop some hint into the zone without having to visit each
> > page?
> > 
> 
> Not without incurring a cost, but yes, t is possible to give a hint on when
> PG_migrate_skip should be cleared and move away from that time-based hammer.
> 
> First, we'd introduce a variant of get_pageblock_migratetype() that returns
> all the bits for the pageblock flags and then helpers to extract either the
> migratetype or the PG_migrate_skip. We already are incurring the cost of
> get_pageblock_migratetype() so it will not be much more expensive than what
> is already there. If there is an allocation or free within a pageblock that
> as the PG_migrate_skip bit set then we increment a counter. When the counter
> reaches some to-be-decided "threshold" then compaction may clear all the
> bits. This would match the criteria of the clearing being based on activity.
> 
> There are four potential problems with this
> 
> 1. The logic to retrieve all the bits and split them up will be a little
>    convulated but maybe it would not be that bad.
> 
> 2. The counter is a shared-writable cache line but obviously it could
>    be moved to vmstat and incremented with inc_zone_page_state to offset
>    the cost a little.
> 
> 3. The biggested weakness is that there is not way to know if the
>    counter is incremented based on activity in a small subset of blocks.
> 
> 4. What should the threshold be?
> 
> The first problem is minor but the other three are potentially a mess.
> Adding another vmstat counter is bad enough in itself but if the counter
> is incremented based on a small subsets of pageblocks, the hint becomes
> is potentially useless.

Another idea is that we can add two bits(PG_check_migrate/PG_check_free)
in pageblock_flags_group.
In allocation path, we can set PG_check_migrate in a pageblock
In free path, we can set PG_check_free in a pageblock.
And they are cleared by compaction's scan like now.
So we can discard 3 and 4 at least.

Another idea is that let's cure it by fixing fundamental problem.
Make zone's locks more fine-grained.
As time goes by, system uses bigger memory but our lock of zone
isn't scalable. Recently, lru_lock and zone->lock contention report
isn't rare so i think it's good time that we move next step.

How about defining struct sub_zone per 2G or 4G?
so a zone can have several sub_zone as size and subzone can replace
current zone's role and zone is just container of subzones.
Of course, it's not easy to implement but I think someday we should
go that way. Is it a really overkill?

> 
> However, does this match what you have in mind or am I over-complicating
> things?
> 
> > > > >
> > > > > ...
> > > > >
> > > > > +static void reset_isolation_suitable(struct zone *zone)
> > > > > +{
> > > > > +	unsigned long start_pfn = zone->zone_start_pfn;
> > > > > +	unsigned long end_pfn = zone->zone_start_pfn + zone->spanned_pages;
> > > > > +	unsigned long pfn;
> > > > > +
> > > > > +	/*
> > > > > +	 * Do not reset more than once every five seconds. If allocations are
> > > > > +	 * failing sufficiently quickly to allow this to happen then continually
> > > > > +	 * scanning for compaction is not going to help. The choice of five
> > > > > +	 * seconds is arbitrary but will mitigate excessive scanning.
> > > > > +	 */
> > > > > +	if (time_before(jiffies, zone->compact_blockskip_expire))
> > > > > +		return;
> > > > > +	zone->compact_blockskip_expire = jiffies + (HZ * 5);
> > > > > +
> > > > > +	/* Walk the zone and mark every pageblock as suitable for isolation */
> > > > > +	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
> > > > > +		struct page *page;
> > > > > +		if (!pfn_valid(pfn))
> > > > > +			continue;
> > > > > +
> > > > > +		page = pfn_to_page(pfn);
> > > > > +		if (zone != page_zone(page))
> > > > > +			continue;
> > > > > +
> > > > > +		clear_pageblock_skip(page);
> > > > > +	}
> > > > 
> > > > What's the worst-case loop count here?
> > > > 
> > > 
> > > zone->spanned_pages >> pageblock_order
> > 
> > What's the worst-case value of (zone->spanned_pages >> pageblock_order) :)
> 
> Lets take an unlikely case - 128G single-node machine. That loop count
> on x86-64 would be 65536. It'll be fast enough, particularly in this
> path.
> 
> -- 
> Mel Gorman
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 5/9] mm: compaction: Acquire the zone->lru_lock as late as possible
  2012-09-25 21:39         ` Andrew Morton
  2012-09-26  0:23           ` Minchan Kim
@ 2012-09-26 10:17           ` Mel Gorman
  1 sibling, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2012-09-26 10:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Minchan Kim, Richard Davies, Shaohua Li, Rik van Riel,
	Avi Kivity, QEMU-devel, KVM, Linux-MM, LKML

On Tue, Sep 25, 2012 at 02:39:31PM -0700, Andrew Morton wrote:
> On Tue, 25 Sep 2012 17:13:27 +0900
> Minchan Kim <minchan@kernel.org> wrote:
> 
> > I see. To me, your saying is better than current comment.
> > I hope comment could be more explicit.
> > 
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index df01b4e..f1d2cc7 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -542,8 +542,9 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
> >                  * splitting and collapsing (collapsing has already happened
> >                  * if PageLRU is set) but the lock is not necessarily taken
> >                  * here and it is wasteful to take it just to check transhuge.
> > -                * Check transhuge without lock and skip if it's either a
> > -                * transhuge or hugetlbfs page.
> > +                * Check transhuge without lock and *skip* if it's either a
> > +                * transhuge or hugetlbfs page because it's not safe to call
> > +                * compound_order.
> >                  */
> >                 if (PageTransHuge(page)) {
> >                         if (!locked)
> 
> Going a bit further:
> 
> --- a/mm/compaction.c~mm-compaction-acquire-the-zone-lru_lock-as-late-as-possible-fix
> +++ a/mm/compaction.c
> @@ -415,7 +415,8 @@ isolate_migratepages_range(struct zone *
>  		 * if PageLRU is set) but the lock is not necessarily taken
>  		 * here and it is wasteful to take it just to check transhuge.
>  		 * Check transhuge without lock and skip if it's either a
> -		 * transhuge or hugetlbfs page.
> +		 * transhuge or hugetlbfs page because calling compound_order()
> +		 * requires lru_lock to exclude isolation and splitting.
>  		 */
>  		if (PageTransHuge(page)) {
>  			if (!locked)
> _
> 
> but...  the requirement to hold lru_lock for compound_order() is news
> to me.  It doesn't seem to be written down or explained anywhere, and
> one wonders why the cheerily undocumented compound_lock() doesn't have
> this effect.  What's going on here??
> 

The lru_lock is not *required* for compound_order(). Normally, users of
compound_order() know that the page is not going to collapse underneath
them. The slub allocator is not going to have a compound page it controls
disappear unexpectedly and does not need additional locking for example.

In the case where we are potentially dealing with a THP page, we have to
take into account if it can collapse underneath us. In this case, there is a
race between when when PageTransHuge is checked and compound_order is called.
The race is probably harmless but it was easy to take into account in
this case.

I think the comment saying that lru_lock is required is misleading. How
about this?

                 * Check TransHuge without lock and skip the whole
                 * pageblock if it's either a transhuge or hugetlbfs page
		 * as calling compound_order without preventing THP
		 * splitting the page underneath us may return
		 * surprising results.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH] mm: compaction: cache if a pageblock was scanned and no pages were isolated -fix2
  2012-09-25 20:03           ` Andrew Morton
@ 2012-09-27 12:06             ` Mel Gorman
  2012-09-27 13:12             ` [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated Mel Gorman
  1 sibling, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2012-09-27 12:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Richard Davies, Shaohua Li, Rik van Riel, Avi Kivity, QEMU-devel,
	KVM, Linux-MM, LKML

The clearing of PG_migrate_skip potentially takes a long time if the
zone is massive. Be safe and check if it needs to reschedule.

This is a fix for
mm-compaction-cache-if-a-pageblock-was-scanned-and-no-pages-were-isolated.patch

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/compaction.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/compaction.c b/mm/compaction.c
index fb07abb..722d10f 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -85,6 +85,9 @@ static void reset_isolation_suitable(struct zone *zone)
 	/* Walk the zone and mark every pageblock as suitable for isolation */
 	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
 		struct page *page;
+
+		cond_resched();
+
 		if (!pfn_valid(pfn))
 			continue;
 

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated
  2012-09-26  0:49           ` Minchan Kim
@ 2012-09-27 12:14             ` Mel Gorman
  0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2012-09-27 12:14 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Richard Davies, Shaohua Li, Rik van Riel,
	Avi Kivity, QEMU-devel, KVM, Linux-MM, LKML

On Wed, Sep 26, 2012 at 09:49:30AM +0900, Minchan Kim wrote:
> On Tue, Sep 25, 2012 at 10:12:07AM +0100, Mel Gorman wrote:
> > On Mon, Sep 24, 2012 at 02:26:44PM -0700, Andrew Morton wrote:
> > > On Mon, 24 Sep 2012 10:39:38 +0100
> > > Mel Gorman <mgorman@suse.de> wrote:
> > > 
> > > > On Fri, Sep 21, 2012 at 02:36:56PM -0700, Andrew Morton wrote:
> > > > 
> > > > > Also, what has to be done to avoid the polling altogether?  eg/ie, zap
> > > > > a pageblock's PB_migrate_skip synchronously, when something was done to
> > > > > that pageblock which justifies repolling it?
> > > > > 
> > > > 
> > > > The "something" event you are looking for is pages being freed or
> > > > allocated in the page allocator. A movable page being allocated in block
> > > > or a page being freed should clear the PB_migrate_skip bit if it's set.
> > > > Unfortunately this would impact the fast path of the alloc and free paths
> > > > of the page allocator. I felt that that was too high a price to pay.
> > > 
> > > We already do a similar thing in the page allocator: clearing of
> > > ->all_unreclaimable and ->pages_scanned. 
> > 
> > That is true but that is a simple write (shared cache line but still) to
> > a struct zone. Worse, now that you point it out, that's pretty stupid. It
> > should be checking if the value is non-zero before writing to it to avoid
> > a cache line bounce.
> > 
> > Clearing the PG_migrate_skip in this path to avoid the need to ever pool is
> > not as cheap as it needs to
> > 
> > set_pageblock_skip
> >   -> set_pageblock_flags_group
> >     -> page_zone
> >     -> page_to_pfn
> >     -> get_pageblock_bitmap
> >     -> pfn_to_bitidx
> >     -> __set_bit
> > 
> > > But that isn't on the "fast
> > > path" really - it happens once per pcp unload. 
> > 
> > That's still an important enough path that I'm wary of making it fatter
> > and that only covers the free path. To avoid the polling, the allocation
> > side needs to be handled too. It could be shoved down into rmqueue() to
> > put it into a slightly colder path but still, it's a price to pay to keep
> > compaction happy.
> > 
> > > Can we do something
> > > like that?  Drop some hint into the zone without having to visit each
> > > page?
> > > 
> > 
> > Not without incurring a cost, but yes, t is possible to give a hint on when
> > PG_migrate_skip should be cleared and move away from that time-based hammer.
> > 
> > First, we'd introduce a variant of get_pageblock_migratetype() that returns
> > all the bits for the pageblock flags and then helpers to extract either the
> > migratetype or the PG_migrate_skip. We already are incurring the cost of
> > get_pageblock_migratetype() so it will not be much more expensive than what
> > is already there. If there is an allocation or free within a pageblock that
> > as the PG_migrate_skip bit set then we increment a counter. When the counter
> > reaches some to-be-decided "threshold" then compaction may clear all the
> > bits. This would match the criteria of the clearing being based on activity.
> > 
> > There are four potential problems with this
> > 
> > 1. The logic to retrieve all the bits and split them up will be a little
> >    convulated but maybe it would not be that bad.
> > 
> > 2. The counter is a shared-writable cache line but obviously it could
> >    be moved to vmstat and incremented with inc_zone_page_state to offset
> >    the cost a little.
> > 
> > 3. The biggested weakness is that there is not way to know if the
> >    counter is incremented based on activity in a small subset of blocks.
> > 
> > 4. What should the threshold be?
> > 
> > The first problem is minor but the other three are potentially a mess.
> > Adding another vmstat counter is bad enough in itself but if the counter
> > is incremented based on a small subsets of pageblocks, the hint becomes
> > is potentially useless.
> 
> Another idea is that we can add two bits(PG_check_migrate/PG_check_free)
> in pageblock_flags_group.
> In allocation path, we can set PG_check_migrate in a pageblock
> In free path, we can set PG_check_free in a pageblock.
> And they are cleared by compaction's scan like now.
> So we can discard 3 and 4 at least.
> 

Adding a second bit does not fix problem 3 or problem 4 at all. With two
bits, all activity could be concentrated on two blocks - one migrate and
one free. The threshold still has to be selected.

> Another idea is that let's cure it by fixing fundamental problem.
> Make zone's locks more fine-grained.

Far easier said than done and only covers the contention problem. It
does nothing for the scanning problem.

> As time goes by, system uses bigger memory but our lock of zone
> isn't scalable. Recently, lru_lock and zone->lock contention report
> isn't rare so i think it's good time that we move next step.
> 

Lock contention on both those locks recently were due to compaction
rather than something more fundamental.

> How about defining struct sub_zone per 2G or 4G?
> so a zone can have several sub_zone as size and subzone can replace
> current zone's role and zone is just container of subzones.
> Of course, it's not easy to implement but I think someday we should
> go that way. Is it a really overkill?
> 

One one side that greatly increases the cost of the page allocator and
the size of the zonelist it must walk as it'll need additional walks for
each of these lists. The interaction with fragmentation avoidance and
how it handles fallbacks would be particularly problematic. On the other
side, multiple sub-zones will also introduce multiple LRUs making the
existing balancing problem considerably worse.

And again, all this would be aimed at contention and do nothing for the
scanning problem at hand.

That introduces a multiple LRUs that must be balanced problem. 

I'm work on a patch that removes the time heuristic that I think might
work. Will hopefully post it today.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated
  2012-09-25 20:03           ` Andrew Morton
  2012-09-27 12:06             ` [PATCH] mm: compaction: cache if a pageblock was scanned and no pages were isolated -fix2 Mel Gorman
@ 2012-09-27 13:12             ` Mel Gorman
  1 sibling, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2012-09-27 13:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Richard Davies, Shaohua Li, Rik van Riel, Avi Kivity, QEMU-devel,
	KVM, Linux-MM, LKML

On Tue, Sep 25, 2012 at 01:03:52PM -0700, Andrew Morton wrote:
> On Tue, 25 Sep 2012 10:12:07 +0100
> Mel Gorman <mgorman@suse.de> wrote:
> 
> > First, we'd introduce a variant of get_pageblock_migratetype() that returns
> > all the bits for the pageblock flags and then helpers to extract either the
> > migratetype or the PG_migrate_skip. We already are incurring the cost of
> > get_pageblock_migratetype() so it will not be much more expensive than what
> > is already there. If there is an allocation or free within a pageblock that
> > as the PG_migrate_skip bit set then we increment a counter. When the counter
> > reaches some to-be-decided "threshold" then compaction may clear all the
> > bits. This would match the criteria of the clearing being based on activity.
> > 
> > There are four potential problems with this
> > 
> > 1. The logic to retrieve all the bits and split them up will be a little
> >    convulated but maybe it would not be that bad.
> > 
> > 2. The counter is a shared-writable cache line but obviously it could
> >    be moved to vmstat and incremented with inc_zone_page_state to offset
> >    the cost a little.
> > 
> > 3. The biggested weakness is that there is not way to know if the
> >    counter is incremented based on activity in a small subset of blocks.
> > 
> > 4. What should the threshold be?
> > 
> > The first problem is minor but the other three are potentially a mess.
> > Adding another vmstat counter is bad enough in itself but if the counter
> > is incremented based on a small subsets of pageblocks, the hint becomes
> > is potentially useless.
> > 
> > However, does this match what you have in mind or am I over-complicating
> > things?
> 
> Sounds complicated.
> 
> Using wall time really does suck. 

I know, we spent a fair amount of effort getting rid of congestion_wait()
from paths it did not belong to for similar reasons.

> Are you sure you can't think of
> something more logical?
> 

No, I'm not sure.

As a matter of general policy I should not encourage this but apparently
you can nag better code out of me, patch is below :). I would rather it
was added on top rather than merged with the time-based series so it can
be reverted easily if necessary.

> How would we demonstrate the suckage?  What would be the observeable downside of
> switching that 5 seconds to 5 hours?
> 

Reduced allocation success rates.

> > Lets take an unlikely case - 128G single-node machine. That loop count
> > on x86-64 would be 65536. It'll be fast enough, particularly in this
> > path.
> 
> That could easily exceed a millisecond.  Can/should we stick a
> cond_resched() in there?

Ok, I think it is very unlikely but not impossible. I posted a patch for
it already.

Here is a candidate patch that replaces the time heuristic with one that
is based on VM activity. My own testing indicate that scan rates are
slightly higher with this patch than the time heuristic but well within
acceptable limits.

Richard, can you also test this patch and make sure your test case has
not regressed again please?

---8<---
mm: compaction: Clear PG_migrate_skip based on compaction and reclaim activity

Compaction caches if a pageblock was scanned and no pages were isolated
so that the pageblocks can be skipped in the future to reduce scanning.
This information is not cleared by the page allocator based on activity
due to the impact it would have to the page allocator fast paths. Hence
there is a requirement that something clear the cache or pageblocks will
be skipped forever. Currently the cache is cleared if there were a number
of recent allocation failures and it has not been cleared within the last
5 seconds. Time-based decisions like this are terrible as they have no
relationship to VM activity and is basically a big hammer.

Unfortunately, accurate heuristics would add cost to some hot paths so this
patch implements a rough heuristic. There are two cases where the cache
is cleared.

1. If a !kswapd process completes a compaction cycle (migrate and free
   scanner meet), the zone is marked compact_blockskip_flush. When kswapd
   goes to sleep, it will clear the cache. This is expected to be the
   common case where the cache is cleared. It does not really matter if
   kswapd happens to be asleep or going to sleep when the flag is set as
   it will be woken on the next allocation request.

2. If there has been multiple failures recently and compaction just
   finished being deferred then a process will clear the cache and start
   a full scan. This situation happens if there are multiple high-order
   allocation requests under heavy memory pressure.

The clearing of the PG_migrate_skip bits and other scans is inherently
racy but the race is harmless. For allocations that can fail such as
THP, they will simply fail. For requests that cannot fail, they will
retry the allocation. Tests indicated that scanning rates were roughly
similar to when the time-based heuristic was used and the allocation
success rates were similar.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/compaction.h |   15 +++++++++++++
 include/linux/mmzone.h     |    3 ++-
 mm/compaction.c            |   50 ++++++++++++++++++++++++++++++--------------
 mm/page_alloc.c            |    1 +
 mm/vmscan.c                |    8 +++++++
 5 files changed, 60 insertions(+), 17 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 0e38a1d..6ecb6dc 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -24,6 +24,7 @@ extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *mask,
 			bool sync, bool *contended, struct page **page);
 extern int compact_pgdat(pg_data_t *pgdat, int order);
+extern void reset_isolation_suitable(pg_data_t *pgdat);
 extern unsigned long compaction_suitable(struct zone *zone, int order);
 
 /* Do not skip compaction more than 64 times */
@@ -61,6 +62,16 @@ static inline bool compaction_deferred(struct zone *zone, int order)
 	return zone->compact_considered < defer_limit;
 }
 
+/* Returns true if restarting compaction after many failures */
+static inline bool compaction_restarting(struct zone *zone, int order)
+{
+	if (order < zone->compact_order_failed)
+		return false;
+
+	return zone->compact_defer_shift == COMPACT_MAX_DEFER_SHIFT &&
+		zone->compact_considered >= 1UL << zone->compact_defer_shift;
+}
+
 #else
 static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *nodemask,
@@ -74,6 +85,10 @@ static inline int compact_pgdat(pg_data_t *pgdat, int order)
 	return COMPACT_CONTINUE;
 }
 
+static inline void reset_isolation_suitable(pg_data_t *pgdat)
+{
+}
+
 static inline unsigned long compaction_suitable(struct zone *zone, int order)
 {
 	return COMPACT_SKIPPED;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e7792a3..ddce36a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -369,7 +369,8 @@ struct zone {
 	spinlock_t		lock;
 	int                     all_unreclaimable; /* All pages pinned */
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
-	unsigned long		compact_blockskip_expire;
+	/* Set to true when the PG_migrate_skip bits should be cleared */
+	bool			compact_blockskip_flush;
 
 	/* pfns where compaction scanners should start */
 	unsigned long		compact_cached_free_pfn;
diff --git a/mm/compaction.c b/mm/compaction.c
index 55d0999..8250b69 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -66,24 +66,15 @@ static inline bool isolation_suitable(struct compact_control *cc,
  * should be skipped for page isolation when the migrate and free page scanner
  * meet.
  */
-static void reset_isolation_suitable(struct zone *zone)
+static void __reset_isolation_suitable(struct zone *zone)
 {
 	unsigned long start_pfn = zone->zone_start_pfn;
 	unsigned long end_pfn = zone->zone_start_pfn + zone->spanned_pages;
 	unsigned long pfn;
 
-	/*
-	 * Do not reset more than once every five seconds. If allocations are
-	 * failing sufficiently quickly to allow this to happen then continually
-	 * scanning for compaction is not going to help. The choice of five
-	 * seconds is arbitrary but will mitigate excessive scanning.
-	 */
-	if (time_before(jiffies, zone->compact_blockskip_expire))
-		return;
-
 	zone->compact_cached_migrate_pfn = start_pfn;
 	zone->compact_cached_free_pfn = end_pfn;
-	zone->compact_blockskip_expire = jiffies + (HZ * 5);
+	zone->compact_blockskip_flush = false;
 
 	/* Walk the zone and mark every pageblock as suitable for isolation */
 	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
@@ -102,9 +93,24 @@ static void reset_isolation_suitable(struct zone *zone)
 	}
 }
 
+void reset_isolation_suitable(pg_data_t *pgdat)
+{
+	int zoneid;
+
+	for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
+		struct zone *zone = &pgdat->node_zones[zoneid];
+		if (!populated_zone(zone))
+			continue;
+
+		/* Only flush if a full compaction finished recently */
+		if (zone->compact_blockskip_flush)
+			__reset_isolation_suitable(zone);
+	}
+}
+
 /*
  * If no pages were isolated then mark this pageblock to be skipped in the
- * future. The information is later cleared by reset_isolation_suitable().
+ * future. The information is later cleared by __reset_isolation_suitable().
  */
 static void update_pageblock_skip(struct compact_control *cc,
 			struct page *page, unsigned long nr_isolated,
@@ -827,7 +833,15 @@ static int compact_finished(struct zone *zone,
 
 	/* Compaction run completes if the migrate and free scanner meet */
 	if (cc->free_pfn <= cc->migrate_pfn) {
-		reset_isolation_suitable(cc->zone);
+		/*
+		 * Mark that the PG_migrate_skip information should be cleared
+		 * by kswapd when it goes to sleep. kswapd does not set the
+		 * flag itself as the decision to be clear should be directly
+		 * based on an allocation request.
+		 */
+		if (!current_is_kswapd())
+			zone->compact_blockskip_flush = true;
+		
 		return COMPACT_COMPLETE;
 	}
 
@@ -950,9 +964,13 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 		zone->compact_cached_migrate_pfn = cc->migrate_pfn;
 	}
 
-	/* Clear pageblock skip if there are numerous alloc failures */
-	if (zone->compact_defer_shift == COMPACT_MAX_DEFER_SHIFT)
-		reset_isolation_suitable(zone);
+	/*
+	 * Clear pageblock skip if there were failures recently and compaction
+	 * is about to be retried after being deferred. kswapd does not do
+	 * this reset as it'll reset the cached information when going to sleep.
+	 */
+	if (compaction_restarting(zone, cc->order) && !current_is_kswapd())
+		__reset_isolation_suitable(zone);
 
 	migrate_prep_local();
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ba34eee..0a1906b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2166,6 +2166,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 				preferred_zone, migratetype);
 		if (page) {
 got_page:
+			preferred_zone->compact_blockskip_flush = false;
 			preferred_zone->compact_considered = 0;
 			preferred_zone->compact_defer_shift = 0;
 			if (order >= preferred_zone->compact_order_failed)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f8f56f8..12b5b5a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2895,6 +2895,14 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 		 */
 		set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);
 
+		/*
+		 * Compaction records what page blocks it recently failed to
+		 * isolate pages from and skips them in the future scanning.
+		 * When kswapd is going to sleep, it is reasonable to assume
+		 * that pages and compaction may succeed so reset the cache.
+		 */
+		reset_isolation_suitable(pgdat);
+
 		if (!kthread_should_stop())
 			schedule();
 

^ permalink raw reply related	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2012-09-27 13:12 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-09-21 10:46 [PATCH 0/9] Reduce compaction scanning and lock contention Mel Gorman
2012-09-21 10:46 ` [PATCH 1/9] Revert "mm: compaction: check lock contention first before taking lock" Mel Gorman
2012-09-21 17:46   ` Rafael Aquini
2012-09-21 10:46 ` [PATCH 2/9] Revert "mm-compaction-abort-compaction-loop-if-lock-is-contended-or-run-too-long-fix" Mel Gorman
2012-09-21 17:47   ` Rafael Aquini
2012-09-21 10:46 ` [PATCH 3/9] Revert "mm: compaction: abort compaction loop if lock is contended or run too long" Mel Gorman
2012-09-21 17:48   ` Rafael Aquini
2012-09-21 10:46 ` [PATCH 4/9] mm: compaction: Abort compaction loop if lock is contended or run too long Mel Gorman
2012-09-21 17:50   ` Rafael Aquini
2012-09-21 21:31   ` Andrew Morton
2012-09-25  7:34   ` Minchan Kim
2012-09-21 10:46 ` [PATCH 5/9] mm: compaction: Acquire the zone->lru_lock as late as possible Mel Gorman
2012-09-21 17:51   ` Rafael Aquini
2012-09-25  7:05   ` Minchan Kim
2012-09-25  7:51     ` Mel Gorman
2012-09-25  8:13       ` Minchan Kim
2012-09-25 21:39         ` Andrew Morton
2012-09-26  0:23           ` Minchan Kim
2012-09-26 10:17           ` Mel Gorman
2012-09-21 10:46 ` [PATCH 6/9] mm: compaction: Acquire the zone->lock " Mel Gorman
2012-09-21 17:52   ` Rafael Aquini
2012-09-21 21:35   ` Andrew Morton
2012-09-24  8:52     ` Mel Gorman
2012-09-25  7:36       ` Minchan Kim
2012-09-25  7:35   ` Minchan Kim
2012-09-21 10:46 ` [PATCH 7/9] Revert "mm: have order > 0 compaction start off where it left" Mel Gorman
2012-09-21 17:52   ` Rafael Aquini
2012-09-25  7:37   ` Minchan Kim
2012-09-21 10:46 ` [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated Mel Gorman
2012-09-21 17:53   ` Rafael Aquini
2012-09-21 21:36   ` Andrew Morton
2012-09-24  9:39     ` Mel Gorman
2012-09-24 21:26       ` Andrew Morton
2012-09-25  9:12         ` Mel Gorman
2012-09-25 20:03           ` Andrew Morton
2012-09-27 12:06             ` [PATCH] mm: compaction: cache if a pageblock was scanned and no pages were isolated -fix2 Mel Gorman
2012-09-27 13:12             ` [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated Mel Gorman
2012-09-26  0:49           ` Minchan Kim
2012-09-27 12:14             ` Mel Gorman
2012-09-21 10:46 ` [PATCH 9/9] mm: compaction: Restart compaction from near where it left off Mel Gorman
2012-09-21 17:54   ` Rafael Aquini
2012-09-21 13:51 ` [PATCH 0/9] Reduce compaction scanning and lock contention Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).