[PATCH 01/10] mm, compaction: do not recheck suitable_migration

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 01/10] mm, compaction: do not recheck suitable_migration_target under lock
@ 2014-06-09  9:26 ` Vlastimil Babka
  0 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-09  9:26 UTC (permalink / raw)
  To: David Rientjes, linux-mm
  Cc: linux-kernel, Andrew Morton, Greg Thelen, Vlastimil Babka,
	Minchan Kim, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

isolate_freepages_block() rechecks if the pageblock is suitable to be a target
for migration after it has taken the zone->lock. However, the check has been
optimized to occur only once per pageblock, and compact_checklock_irqsave()
might be dropping and reacquiring lock, which means somebody else might have
changed the pageblock's migratetype meanwhile.

Furthermore, nothing prevents the migratetype to change right after
isolate_freepages_block() has finished isolating. Given how imperfect this is,
it's simpler to just rely on the check done in isolate_freepages() without
lock, and not pretend that the recheck under lock guarantees anything. It is
just a heuristic after all.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
---
I suggest folding mm-compactionc-isolate_freepages_block-small-tuneup.patch into this

 mm/compaction.c | 13 -------------
 1 file changed, 13 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 5175019..b73b182 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -276,7 +276,6 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 	struct page *cursor, *valid_page = NULL;
 	unsigned long flags;
 	bool locked = false;
-	bool checked_pageblock = false;
 
 	cursor = pfn_to_page(blockpfn);
 
@@ -307,18 +306,6 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 		if (!locked)
 			break;
 
-		/* Recheck this is a suitable migration target under lock */
-		if (!strict && !checked_pageblock) {
-			/*
-			 * We need to check suitability of pageblock only once
-			 * and this isolate_freepages_block() is called with
-			 * pageblock range, so just check once is sufficient.
-			 */
-			checked_pageblock = true;
-			if (!suitable_migration_target(page))
-				break;
-		}
-
 		/* Recheck this is a buddy page under lock */
 		if (!PageBuddy(page))
 			goto isolate_fail;
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 01/10] mm, compaction: do not recheck suitable_migration_target under lock
@ 2014-06-09  9:26 ` Vlastimil Babka
  0 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-09  9:26 UTC (permalink / raw)
  To: David Rientjes, linux-mm
  Cc: linux-kernel, Andrew Morton, Greg Thelen, Vlastimil Babka,
	Minchan Kim, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

isolate_freepages_block() rechecks if the pageblock is suitable to be a target
for migration after it has taken the zone->lock. However, the check has been
optimized to occur only once per pageblock, and compact_checklock_irqsave()
might be dropping and reacquiring lock, which means somebody else might have
changed the pageblock's migratetype meanwhile.

Furthermore, nothing prevents the migratetype to change right after
isolate_freepages_block() has finished isolating. Given how imperfect this is,
it's simpler to just rely on the check done in isolate_freepages() without
lock, and not pretend that the recheck under lock guarantees anything. It is
just a heuristic after all.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
---
I suggest folding mm-compactionc-isolate_freepages_block-small-tuneup.patch into this

 mm/compaction.c | 13 -------------
 1 file changed, 13 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 5175019..b73b182 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -276,7 +276,6 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 	struct page *cursor, *valid_page = NULL;
 	unsigned long flags;
 	bool locked = false;
-	bool checked_pageblock = false;
 
 	cursor = pfn_to_page(blockpfn);
 
@@ -307,18 +306,6 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 		if (!locked)
 			break;
 
-		/* Recheck this is a suitable migration target under lock */
-		if (!strict && !checked_pageblock) {
-			/*
-			 * We need to check suitability of pageblock only once
-			 * and this isolate_freepages_block() is called with
-			 * pageblock range, so just check once is sufficient.
-			 */
-			checked_pageblock = true;
-			if (!suitable_migration_target(page))
-				break;
-		}
-
 		/* Recheck this is a buddy page under lock */
 		if (!PageBuddy(page))
 			goto isolate_fail;
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 02/10] mm, compaction: report compaction as contended only due to lock contention
  2014-06-09  9:26 ` Vlastimil Babka
@ 2014-06-09  9:26   ` Vlastimil Babka
  -1 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-09  9:26 UTC (permalink / raw)
  To: David Rientjes, linux-mm
  Cc: linux-kernel, Andrew Morton, Greg Thelen, Vlastimil Babka,
	Minchan Kim, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

Async compaction aborts when it detects zone lock contention or need_resched()
is true. David Rientjes has reported that in practice, most direct async
compactions for THP allocation abort due to need_resched(). This means that a
second direct compaction is never attempted, which might be OK for a page
fault, but hugepaged is intended to attempt a sync compaction in such case and
in these cases it won't.

This patch replaces "bool contended" in compact_control with an enum that
distinguieshes between aborting due to need_resched() and aborting due to lock
contention. This allows propagating the abort through all compaction functions
as before, but declaring the direct compaction as contended only when lock
contantion has been detected.

As a result, hugepaged will proceed with second sync compaction as intended,
when the preceding async compaction aborted due to need_resched().

Reported-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
---
 mm/compaction.c | 20 ++++++++++++++------
 mm/internal.h   | 15 +++++++++++----
 2 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index b73b182..d37f4a8 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -185,9 +185,14 @@ static void update_pageblock_skip(struct compact_control *cc,
 }
 #endif /* CONFIG_COMPACTION */
 
-static inline bool should_release_lock(spinlock_t *lock)
+enum compact_contended should_release_lock(spinlock_t *lock)
 {
-	return need_resched() || spin_is_contended(lock);
+	if (need_resched())
+		return COMPACT_CONTENDED_SCHED;
+	else if (spin_is_contended(lock))
+		return COMPACT_CONTENDED_LOCK;
+	else
+		return COMPACT_CONTENDED_NONE;
 }
 
 /*
@@ -202,7 +207,9 @@ static inline bool should_release_lock(spinlock_t *lock)
 static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
 				      bool locked, struct compact_control *cc)
 {
-	if (should_release_lock(lock)) {
+	enum compact_contended contended = should_release_lock(lock);
+
+	if (contended) {
 		if (locked) {
 			spin_unlock_irqrestore(lock, *flags);
 			locked = false;
@@ -210,7 +217,7 @@ static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
 
 		/* async aborts if taking too long or contended */
 		if (cc->mode == MIGRATE_ASYNC) {
-			cc->contended = true;
+			cc->contended = contended;
 			return false;
 		}
 
@@ -236,7 +243,7 @@ static inline bool compact_should_abort(struct compact_control *cc)
 	/* async compaction aborts if contended */
 	if (need_resched()) {
 		if (cc->mode == MIGRATE_ASYNC) {
-			cc->contended = true;
+			cc->contended = COMPACT_CONTENDED_SCHED;
 			return true;
 		}
 
@@ -1095,7 +1102,8 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
 	VM_BUG_ON(!list_empty(&cc.freepages));
 	VM_BUG_ON(!list_empty(&cc.migratepages));
 
-	*contended = cc.contended;
+	/* We only signal lock contention back to the allocator */
+	*contended = cc.contended == COMPACT_CONTENDED_LOCK;
 	return ret;
 }
 
diff --git a/mm/internal.h b/mm/internal.h
index 7f22a11f..4659e8e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -117,6 +117,13 @@ extern int user_min_free_kbytes;
 
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
 
+/* Used to signal whether compaction detected need_sched() or lock contention */
+enum compact_contended {
+	COMPACT_CONTENDED_NONE = 0, /* no contention detected */
+	COMPACT_CONTENDED_SCHED,    /* need_sched() was true */
+	COMPACT_CONTENDED_LOCK,     /* zone lock or lru_lock was contended */
+};
+
 /*
  * in mm/compaction.c
  */
@@ -144,10 +151,10 @@ struct compact_control {
 	int order;			/* order a direct compactor needs */
 	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
 	struct zone *zone;
-	bool contended;			/* True if a lock was contended, or
-					 * need_resched() true during async
-					 * compaction
-					 */
+	enum compact_contended contended; /* Signal need_sched() or lock
+					   * contention detected during
+					   * compaction
+					   */
 };
 
 unsigned long
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 02/10] mm, compaction: report compaction as contended only due to lock contention
@ 2014-06-09  9:26   ` Vlastimil Babka
  0 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-09  9:26 UTC (permalink / raw)
  To: David Rientjes, linux-mm
  Cc: linux-kernel, Andrew Morton, Greg Thelen, Vlastimil Babka,
	Minchan Kim, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

Async compaction aborts when it detects zone lock contention or need_resched()
is true. David Rientjes has reported that in practice, most direct async
compactions for THP allocation abort due to need_resched(). This means that a
second direct compaction is never attempted, which might be OK for a page
fault, but hugepaged is intended to attempt a sync compaction in such case and
in these cases it won't.

This patch replaces "bool contended" in compact_control with an enum that
distinguieshes between aborting due to need_resched() and aborting due to lock
contention. This allows propagating the abort through all compaction functions
as before, but declaring the direct compaction as contended only when lock
contantion has been detected.

As a result, hugepaged will proceed with second sync compaction as intended,
when the preceding async compaction aborted due to need_resched().

Reported-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
---
 mm/compaction.c | 20 ++++++++++++++------
 mm/internal.h   | 15 +++++++++++----
 2 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index b73b182..d37f4a8 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -185,9 +185,14 @@ static void update_pageblock_skip(struct compact_control *cc,
 }
 #endif /* CONFIG_COMPACTION */
 
-static inline bool should_release_lock(spinlock_t *lock)
+enum compact_contended should_release_lock(spinlock_t *lock)
 {
-	return need_resched() || spin_is_contended(lock);
+	if (need_resched())
+		return COMPACT_CONTENDED_SCHED;
+	else if (spin_is_contended(lock))
+		return COMPACT_CONTENDED_LOCK;
+	else
+		return COMPACT_CONTENDED_NONE;
 }
 
 /*
@@ -202,7 +207,9 @@ static inline bool should_release_lock(spinlock_t *lock)
 static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
 				      bool locked, struct compact_control *cc)
 {
-	if (should_release_lock(lock)) {
+	enum compact_contended contended = should_release_lock(lock);
+
+	if (contended) {
 		if (locked) {
 			spin_unlock_irqrestore(lock, *flags);
 			locked = false;
@@ -210,7 +217,7 @@ static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
 
 		/* async aborts if taking too long or contended */
 		if (cc->mode == MIGRATE_ASYNC) {
-			cc->contended = true;
+			cc->contended = contended;
 			return false;
 		}
 
@@ -236,7 +243,7 @@ static inline bool compact_should_abort(struct compact_control *cc)
 	/* async compaction aborts if contended */
 	if (need_resched()) {
 		if (cc->mode == MIGRATE_ASYNC) {
-			cc->contended = true;
+			cc->contended = COMPACT_CONTENDED_SCHED;
 			return true;
 		}
 
@@ -1095,7 +1102,8 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
 	VM_BUG_ON(!list_empty(&cc.freepages));
 	VM_BUG_ON(!list_empty(&cc.migratepages));
 
-	*contended = cc.contended;
+	/* We only signal lock contention back to the allocator */
+	*contended = cc.contended == COMPACT_CONTENDED_LOCK;
 	return ret;
 }
 
diff --git a/mm/internal.h b/mm/internal.h
index 7f22a11f..4659e8e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -117,6 +117,13 @@ extern int user_min_free_kbytes;
 
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
 
+/* Used to signal whether compaction detected need_sched() or lock contention */
+enum compact_contended {
+	COMPACT_CONTENDED_NONE = 0, /* no contention detected */
+	COMPACT_CONTENDED_SCHED,    /* need_sched() was true */
+	COMPACT_CONTENDED_LOCK,     /* zone lock or lru_lock was contended */
+};
+
 /*
  * in mm/compaction.c
  */
@@ -144,10 +151,10 @@ struct compact_control {
 	int order;			/* order a direct compactor needs */
 	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
 	struct zone *zone;
-	bool contended;			/* True if a lock was contended, or
-					 * need_resched() true during async
-					 * compaction
-					 */
+	enum compact_contended contended; /* Signal need_sched() or lock
+					   * contention detected during
+					   * compaction
+					   */
 };
 
 unsigned long
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 03/10] mm, compaction: periodically drop lock and restore IRQs in scanners
  2014-06-09  9:26 ` Vlastimil Babka
@ 2014-06-09  9:26   ` Vlastimil Babka
  -1 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-09  9:26 UTC (permalink / raw)
  To: David Rientjes, linux-mm
  Cc: linux-kernel, Andrew Morton, Greg Thelen, Vlastimil Babka,
	Minchan Kim, Mel Gorman, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel

Compaction scanners regularly check for lock contention and need_resched()
through the compact_checklock_irqsave() function. However, if there is no
contention, the lock can be held and IRQ disabled for potentially long time.

This has been addressed by commit b2eef8c0d0 ("mm: compaction: minimise the
time IRQs are disabled while isolating pages for migration") for the migration
scanner. However, the refactoring done by commit 748446bb6b ("mm: compaction:
acquire the zone->lru_lock as late as possible") has changed the conditions so
that the lock is dropped only when there's contention on the lock or
need_resched() is true. Also, need_resched() is checked only when the lock is
already held. The comment "give a chance to irqs before checking need_resched"
is therefore misleading, as IRQs remain disabled when the check is done.

This patch restores the behavior intended by commit b2eef8c0d0 and also tries
to better balance and make more deterministic the time spent by checking for
contention vs the time the scanners might run between the checks. It also
avoids situations where checking has not been done often enough before. The
result should be avoiding both too frequent and too infrequent contention
checking, and especially the potentially long-running scans with IRQs disabled
and no checking of need_resched() or for fatal signal pending, which can happen
when many consecutive pages or pageblocks fail the preliminary tests and do not
reach the later call site to compact_checklock_irqsave(), as explained below.

Before the patch:

In the migration scanner, compact_checklock_irqsave() was called each loop, if
reached. If not reached, some lower-frequency checking could still be done if
the lock was already held, but this would not result in aborting contended
async compaction until reaching compact_checklock_irqsave() or end of
pageblock. In the free scanner, it was similar but completely without the
periodical checking, so lock can be potentially held until reaching the end of
pageblock.

After the patch, in both scanners:

The periodical check is done as the first thing in the loop on each
SWAP_CLUSTER_MAX aligned pfn, using the new compact_unlock_should_abort()
function, which always unlocks the lock (if locked) and aborts async compaction
if scheduling is needed or someone else holds the lock. It also aborts any type
of compaction when a fatal signal is pending.

The compact_checklock_irqsave() function is replaced with a slightly different
compact_trylock_irqsave(). The biggest difference is that the function is not
called at all if the lock is already held. The periodical contention checking
is left solely to compact_unlock_should_abort(). If the lock is not held, the
function however does avoid contended run for async compaction by aborting when
trylock fails. Sync compaction does not use trylock.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
---
V2: do not consider need/cond_resched() in compact_trylock_irqsave(); spelling
    remove inline: compaction.o size reduced
 mm/compaction.c | 121 ++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 79 insertions(+), 42 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index d37f4a8..e1a4283 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -185,54 +185,77 @@ static void update_pageblock_skip(struct compact_control *cc,
 }
 #endif /* CONFIG_COMPACTION */
 
-enum compact_contended should_release_lock(spinlock_t *lock)
+/*
+ * Compaction requires the taking of some coarse locks that are potentially
+ * very heavily contended. For async compaction, back out if the lock cannot
+ * be taken immediately. For sync compaction, spin on the lock if needed.
+ *
+ * Returns true if the lock is held
+ * Returns false if the lock is not held and compaction should abort
+ */
+static bool compact_trylock_irqsave(spinlock_t *lock,
+			unsigned long *flags, struct compact_control *cc)
 {
-	if (need_resched())
-		return COMPACT_CONTENDED_SCHED;
-	else if (spin_is_contended(lock))
-		return COMPACT_CONTENDED_LOCK;
-	else
-		return COMPACT_CONTENDED_NONE;
+	if (cc->mode == MIGRATE_ASYNC) {
+		if (!spin_trylock_irqsave(lock, *flags)) {
+			cc->contended = COMPACT_CONTENDED_LOCK;
+			return false;
+		}
+	} else {
+		spin_lock_irqsave(lock, *flags);
+	}
+
+	return true;
 }
 
 /*
  * Compaction requires the taking of some coarse locks that are potentially
- * very heavily contended. Check if the process needs to be scheduled or
- * if the lock is contended. For async compaction, back out in the event
- * if contention is severe. For sync compaction, schedule.
+ * very heavily contended. The lock should be periodically unlocked to avoid
+ * having disabled IRQs for a long time, even when there is nobody waiting on
+ * the lock. It might also be that allowing the IRQs will result in
+ * need_resched() becoming true. If scheduling is needed, or somebody else
+ * has taken the lock, async compaction aborts. Sync compaction schedules.
+ * Either compaction type will also abort if a fatal signal is pending.
+ * In either case if the lock was locked, it is dropped and not regained.
  *
- * Returns true if the lock is held.
- * Returns false if the lock is released and compaction should abort
+ * Returns true if compaction should abort due to fatal signal pending, or
+ *		async compaction due to lock contention or need to schedule
+ * Returns false when compaction can continue (sync compaction might have
+ *		scheduled)
  */
-static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
-				      bool locked, struct compact_control *cc)
+static bool compact_unlock_should_abort(spinlock_t *lock,
+		unsigned long flags, bool *locked, struct compact_control *cc)
 {
-	enum compact_contended contended = should_release_lock(lock);
+	if (*locked) {
+		spin_unlock_irqrestore(lock, flags);
+		*locked = false;
+	}
 
-	if (contended) {
-		if (locked) {
-			spin_unlock_irqrestore(lock, *flags);
-			locked = false;
-		}
+	if (fatal_signal_pending(current)) {
+		cc->contended = COMPACT_CONTENDED_SCHED;
+		return true;
+	}
 
-		/* async aborts if taking too long or contended */
-		if (cc->mode == MIGRATE_ASYNC) {
-			cc->contended = contended;
-			return false;
+	if (cc->mode == MIGRATE_ASYNC) {
+		if (need_resched()) {
+			cc->contended = COMPACT_CONTENDED_SCHED;
+			return true;
 		}
-
+		if (spin_is_locked(lock)) {
+			cc->contended = COMPACT_CONTENDED_LOCK;
+			return true;
+		}
+	} else {
 		cond_resched();
 	}
 
-	if (!locked)
-		spin_lock_irqsave(lock, *flags);
-	return true;
+	return false;
 }
 
 /*
  * Aside from avoiding lock contention, compaction also periodically checks
  * need_resched() and either schedules in sync compaction or aborts async
- * compaction. This is similar to what compact_checklock_irqsave() does, but
+ * compaction. This is similar to what compact_unlock_should_abort() does, but
  * is used where no lock is concerned.
  *
  * Returns false when no scheduling was needed, or sync compaction scheduled.
@@ -291,6 +314,16 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 		int isolated, i;
 		struct page *page = cursor;
 
+		/*
+		 * Periodically drop the lock (if held) regardless of its
+		 * contention, to give chance to IRQs. Abort async compaction
+		 * if contended.
+		 */
+		if (!(blockpfn % SWAP_CLUSTER_MAX)
+		    && compact_unlock_should_abort(&cc->zone->lock, flags,
+								&locked, cc))
+			break;
+
 		nr_scanned++;
 		if (!pfn_valid_within(blockpfn))
 			goto isolate_fail;
@@ -308,8 +341,9 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 		 * spin on the lock and we acquire the lock as late as
 		 * possible.
 		 */
-		locked = compact_checklock_irqsave(&cc->zone->lock, &flags,
-								locked, cc);
+		if (!locked)
+			locked = compact_trylock_irqsave(&cc->zone->lock,
+								&flags, cc);
 		if (!locked)
 			break;
 
@@ -514,13 +548,15 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 
 	/* Time to isolate some pages for migration */
 	for (; low_pfn < end_pfn; low_pfn++) {
-		/* give a chance to irqs before checking need_resched() */
-		if (locked && !(low_pfn % SWAP_CLUSTER_MAX)) {
-			if (should_release_lock(&zone->lru_lock)) {
-				spin_unlock_irqrestore(&zone->lru_lock, flags);
-				locked = false;
-			}
-		}
+		/*
+		 * Periodically drop the lock (if held) regardless of its
+		 * contention, to give chance to IRQs. Abort async compaction
+		 * if contended.
+		 */
+		if (!(low_pfn % SWAP_CLUSTER_MAX)
+		    && compact_unlock_should_abort(&zone->lru_lock, flags,
+								&locked, cc))
+			break;
 
 		/*
 		 * migrate_pfn does not necessarily start aligned to a
@@ -622,10 +658,11 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 		    page_count(page) > page_mapcount(page))
 			continue;
 
-		/* Check if it is ok to still hold the lock */
-		locked = compact_checklock_irqsave(&zone->lru_lock, &flags,
-								locked, cc);
-		if (!locked || fatal_signal_pending(current))
+		/* If the lock is not held, try to take it */
+		if (!locked)
+			locked = compact_trylock_irqsave(&zone->lru_lock,
+								&flags, cc);
+		if (!locked)
 			break;
 
 		/* Recheck PageLRU and PageTransHuge under lock */
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 03/10] mm, compaction: periodically drop lock and restore IRQs in scanners
@ 2014-06-09  9:26   ` Vlastimil Babka
  0 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-09  9:26 UTC (permalink / raw)
  To: David Rientjes, linux-mm
  Cc: linux-kernel, Andrew Morton, Greg Thelen, Vlastimil Babka,
	Minchan Kim, Mel Gorman, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel

Compaction scanners regularly check for lock contention and need_resched()
through the compact_checklock_irqsave() function. However, if there is no
contention, the lock can be held and IRQ disabled for potentially long time.

This has been addressed by commit b2eef8c0d0 ("mm: compaction: minimise the
time IRQs are disabled while isolating pages for migration") for the migration
scanner. However, the refactoring done by commit 748446bb6b ("mm: compaction:
acquire the zone->lru_lock as late as possible") has changed the conditions so
that the lock is dropped only when there's contention on the lock or
need_resched() is true. Also, need_resched() is checked only when the lock is
already held. The comment "give a chance to irqs before checking need_resched"
is therefore misleading, as IRQs remain disabled when the check is done.

This patch restores the behavior intended by commit b2eef8c0d0 and also tries
to better balance and make more deterministic the time spent by checking for
contention vs the time the scanners might run between the checks. It also
avoids situations where checking has not been done often enough before. The
result should be avoiding both too frequent and too infrequent contention
checking, and especially the potentially long-running scans with IRQs disabled
and no checking of need_resched() or for fatal signal pending, which can happen
when many consecutive pages or pageblocks fail the preliminary tests and do not
reach the later call site to compact_checklock_irqsave(), as explained below.

Before the patch:

In the migration scanner, compact_checklock_irqsave() was called each loop, if
reached. If not reached, some lower-frequency checking could still be done if
the lock was already held, but this would not result in aborting contended
async compaction until reaching compact_checklock_irqsave() or end of
pageblock. In the free scanner, it was similar but completely without the
periodical checking, so lock can be potentially held until reaching the end of
pageblock.

After the patch, in both scanners:

The periodical check is done as the first thing in the loop on each
SWAP_CLUSTER_MAX aligned pfn, using the new compact_unlock_should_abort()
function, which always unlocks the lock (if locked) and aborts async compaction
if scheduling is needed or someone else holds the lock. It also aborts any type
of compaction when a fatal signal is pending.

The compact_checklock_irqsave() function is replaced with a slightly different
compact_trylock_irqsave(). The biggest difference is that the function is not
called at all if the lock is already held. The periodical contention checking
is left solely to compact_unlock_should_abort(). If the lock is not held, the
function however does avoid contended run for async compaction by aborting when
trylock fails. Sync compaction does not use trylock.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
---
V2: do not consider need/cond_resched() in compact_trylock_irqsave(); spelling
    remove inline: compaction.o size reduced
 mm/compaction.c | 121 ++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 79 insertions(+), 42 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index d37f4a8..e1a4283 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -185,54 +185,77 @@ static void update_pageblock_skip(struct compact_control *cc,
 }
 #endif /* CONFIG_COMPACTION */
 
-enum compact_contended should_release_lock(spinlock_t *lock)
+/*
+ * Compaction requires the taking of some coarse locks that are potentially
+ * very heavily contended. For async compaction, back out if the lock cannot
+ * be taken immediately. For sync compaction, spin on the lock if needed.
+ *
+ * Returns true if the lock is held
+ * Returns false if the lock is not held and compaction should abort
+ */
+static bool compact_trylock_irqsave(spinlock_t *lock,
+			unsigned long *flags, struct compact_control *cc)
 {
-	if (need_resched())
-		return COMPACT_CONTENDED_SCHED;
-	else if (spin_is_contended(lock))
-		return COMPACT_CONTENDED_LOCK;
-	else
-		return COMPACT_CONTENDED_NONE;
+	if (cc->mode == MIGRATE_ASYNC) {
+		if (!spin_trylock_irqsave(lock, *flags)) {
+			cc->contended = COMPACT_CONTENDED_LOCK;
+			return false;
+		}
+	} else {
+		spin_lock_irqsave(lock, *flags);
+	}
+
+	return true;
 }
 
 /*
  * Compaction requires the taking of some coarse locks that are potentially
- * very heavily contended. Check if the process needs to be scheduled or
- * if the lock is contended. For async compaction, back out in the event
- * if contention is severe. For sync compaction, schedule.
+ * very heavily contended. The lock should be periodically unlocked to avoid
+ * having disabled IRQs for a long time, even when there is nobody waiting on
+ * the lock. It might also be that allowing the IRQs will result in
+ * need_resched() becoming true. If scheduling is needed, or somebody else
+ * has taken the lock, async compaction aborts. Sync compaction schedules.
+ * Either compaction type will also abort if a fatal signal is pending.
+ * In either case if the lock was locked, it is dropped and not regained.
  *
- * Returns true if the lock is held.
- * Returns false if the lock is released and compaction should abort
+ * Returns true if compaction should abort due to fatal signal pending, or
+ *		async compaction due to lock contention or need to schedule
+ * Returns false when compaction can continue (sync compaction might have
+ *		scheduled)
  */
-static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
-				      bool locked, struct compact_control *cc)
+static bool compact_unlock_should_abort(spinlock_t *lock,
+		unsigned long flags, bool *locked, struct compact_control *cc)
 {
-	enum compact_contended contended = should_release_lock(lock);
+	if (*locked) {
+		spin_unlock_irqrestore(lock, flags);
+		*locked = false;
+	}
 
-	if (contended) {
-		if (locked) {
-			spin_unlock_irqrestore(lock, *flags);
-			locked = false;
-		}
+	if (fatal_signal_pending(current)) {
+		cc->contended = COMPACT_CONTENDED_SCHED;
+		return true;
+	}
 
-		/* async aborts if taking too long or contended */
-		if (cc->mode == MIGRATE_ASYNC) {
-			cc->contended = contended;
-			return false;
+	if (cc->mode == MIGRATE_ASYNC) {
+		if (need_resched()) {
+			cc->contended = COMPACT_CONTENDED_SCHED;
+			return true;
 		}
-
+		if (spin_is_locked(lock)) {
+			cc->contended = COMPACT_CONTENDED_LOCK;
+			return true;
+		}
+	} else {
 		cond_resched();
 	}
 
-	if (!locked)
-		spin_lock_irqsave(lock, *flags);
-	return true;
+	return false;
 }
 
 /*
  * Aside from avoiding lock contention, compaction also periodically checks
  * need_resched() and either schedules in sync compaction or aborts async
- * compaction. This is similar to what compact_checklock_irqsave() does, but
+ * compaction. This is similar to what compact_unlock_should_abort() does, but
  * is used where no lock is concerned.
  *
  * Returns false when no scheduling was needed, or sync compaction scheduled.
@@ -291,6 +314,16 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 		int isolated, i;
 		struct page *page = cursor;
 
+		/*
+		 * Periodically drop the lock (if held) regardless of its
+		 * contention, to give chance to IRQs. Abort async compaction
+		 * if contended.
+		 */
+		if (!(blockpfn % SWAP_CLUSTER_MAX)
+		    && compact_unlock_should_abort(&cc->zone->lock, flags,
+								&locked, cc))
+			break;
+
 		nr_scanned++;
 		if (!pfn_valid_within(blockpfn))
 			goto isolate_fail;
@@ -308,8 +341,9 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 		 * spin on the lock and we acquire the lock as late as
 		 * possible.
 		 */
-		locked = compact_checklock_irqsave(&cc->zone->lock, &flags,
-								locked, cc);
+		if (!locked)
+			locked = compact_trylock_irqsave(&cc->zone->lock,
+								&flags, cc);
 		if (!locked)
 			break;
 
@@ -514,13 +548,15 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 
 	/* Time to isolate some pages for migration */
 	for (; low_pfn < end_pfn; low_pfn++) {
-		/* give a chance to irqs before checking need_resched() */
-		if (locked && !(low_pfn % SWAP_CLUSTER_MAX)) {
-			if (should_release_lock(&zone->lru_lock)) {
-				spin_unlock_irqrestore(&zone->lru_lock, flags);
-				locked = false;
-			}
-		}
+		/*
+		 * Periodically drop the lock (if held) regardless of its
+		 * contention, to give chance to IRQs. Abort async compaction
+		 * if contended.
+		 */
+		if (!(low_pfn % SWAP_CLUSTER_MAX)
+		    && compact_unlock_should_abort(&zone->lru_lock, flags,
+								&locked, cc))
+			break;
 
 		/*
 		 * migrate_pfn does not necessarily start aligned to a
@@ -622,10 +658,11 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 		    page_count(page) > page_mapcount(page))
 			continue;
 
-		/* Check if it is ok to still hold the lock */
-		locked = compact_checklock_irqsave(&zone->lru_lock, &flags,
-								locked, cc);
-		if (!locked || fatal_signal_pending(current))
+		/* If the lock is not held, try to take it */
+		if (!locked)
+			locked = compact_trylock_irqsave(&zone->lru_lock,
+								&flags, cc);
+		if (!locked)
 			break;
 
 		/* Recheck PageLRU and PageTransHuge under lock */
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 04/10] mm, compaction: skip rechecks when lock was already held
  2014-06-09  9:26 ` Vlastimil Babka
@ 2014-06-09  9:26   ` Vlastimil Babka
  -1 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-09  9:26 UTC (permalink / raw)
  To: David Rientjes, linux-mm
  Cc: linux-kernel, Andrew Morton, Greg Thelen, Vlastimil Babka,
	Minchan Kim, Mel Gorman, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel

Compaction scanners try to lock zone locks as late as possible by checking
many page or pageblock properties opportunistically without lock and skipping
them if not unsuitable. For pages that pass the initial checks, some properties
have to be checked again safely under lock. However, if the lock was already
held from a previous iteration in the initial checks, the rechecks are
unnecessary.

This patch therefore skips the rechecks when the lock was already held. This is
now possible to do, since we don't (potentially) drop and reacquire the lock
between the initial checks and the safe rechecks anymore.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
---
V2: remove goto skip_recheck

 mm/compaction.c | 53 +++++++++++++++++++++++++++++++----------------------
 1 file changed, 31 insertions(+), 22 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index e1a4283..83f72bd 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -334,22 +334,30 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 			goto isolate_fail;
 
 		/*
-		 * The zone lock must be held to isolate freepages.
-		 * Unfortunately this is a very coarse lock and can be
-		 * heavily contended if there are parallel allocations
-		 * or parallel compactions. For async compaction do not
-		 * spin on the lock and we acquire the lock as late as
-		 * possible.
+		 * If we already hold the lock, we can skip some rechecking.
+		 * Note that if we hold the lock now, checked_pageblock was
+		 * already set in some previous iteration (or strict is true),
+		 * so it is correct to skip the suitable migration target
+		 * recheck as well.
 		 */
-		if (!locked)
+		if (!locked) {
+			/*
+			 * The zone lock must be held to isolate freepages.
+			 * Unfortunately this is a very coarse lock and can be
+			 * heavily contended if there are parallel allocations
+			 * or parallel compactions. For async compaction do not
+			 * spin on the lock and we acquire the lock as late as
+			 * possible.
+			 */
 			locked = compact_trylock_irqsave(&cc->zone->lock,
 								&flags, cc);
-		if (!locked)
-			break;
+			if (!locked)
+				break;
 
-		/* Recheck this is a buddy page under lock */
-		if (!PageBuddy(page))
-			goto isolate_fail;
+			/* Recheck this is a buddy page under lock */
+			if (!PageBuddy(page))
+				goto isolate_fail;
+		}
 
 		/* Found a free page, break it into order-0 pages */
 		isolated = split_free_page(page);
@@ -658,19 +666,20 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 		    page_count(page) > page_mapcount(page))
 			continue;
 
-		/* If the lock is not held, try to take it */
-		if (!locked)
+		/* If we already hold the lock, we can skip some rechecking */
+		if (!locked) {
 			locked = compact_trylock_irqsave(&zone->lru_lock,
 								&flags, cc);
-		if (!locked)
-			break;
+			if (!locked)
+				break;
 
-		/* Recheck PageLRU and PageTransHuge under lock */
-		if (!PageLRU(page))
-			continue;
-		if (PageTransHuge(page)) {
-			low_pfn += (1 << compound_order(page)) - 1;
-			continue;
+			/* Recheck PageLRU and PageTransHuge under lock */
+			if (!PageLRU(page))
+				continue;
+			if (PageTransHuge(page)) {
+				low_pfn += (1 << compound_order(page)) - 1;
+				continue;
+			}
 		}
 
 		lruvec = mem_cgroup_page_lruvec(page, zone);
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 04/10] mm, compaction: skip rechecks when lock was already held
@ 2014-06-09  9:26   ` Vlastimil Babka
  0 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-09  9:26 UTC (permalink / raw)
  To: David Rientjes, linux-mm
  Cc: linux-kernel, Andrew Morton, Greg Thelen, Vlastimil Babka,
	Minchan Kim, Mel Gorman, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel

Compaction scanners try to lock zone locks as late as possible by checking
many page or pageblock properties opportunistically without lock and skipping
them if not unsuitable. For pages that pass the initial checks, some properties
have to be checked again safely under lock. However, if the lock was already
held from a previous iteration in the initial checks, the rechecks are
unnecessary.

This patch therefore skips the rechecks when the lock was already held. This is
now possible to do, since we don't (potentially) drop and reacquire the lock
between the initial checks and the safe rechecks anymore.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
---
V2: remove goto skip_recheck

 mm/compaction.c | 53 +++++++++++++++++++++++++++++++----------------------
 1 file changed, 31 insertions(+), 22 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index e1a4283..83f72bd 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -334,22 +334,30 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 			goto isolate_fail;
 
 		/*
-		 * The zone lock must be held to isolate freepages.
-		 * Unfortunately this is a very coarse lock and can be
-		 * heavily contended if there are parallel allocations
-		 * or parallel compactions. For async compaction do not
-		 * spin on the lock and we acquire the lock as late as
-		 * possible.
+		 * If we already hold the lock, we can skip some rechecking.
+		 * Note that if we hold the lock now, checked_pageblock was
+		 * already set in some previous iteration (or strict is true),
+		 * so it is correct to skip the suitable migration target
+		 * recheck as well.
 		 */
-		if (!locked)
+		if (!locked) {
+			/*
+			 * The zone lock must be held to isolate freepages.
+			 * Unfortunately this is a very coarse lock and can be
+			 * heavily contended if there are parallel allocations
+			 * or parallel compactions. For async compaction do not
+			 * spin on the lock and we acquire the lock as late as
+			 * possible.
+			 */
 			locked = compact_trylock_irqsave(&cc->zone->lock,
 								&flags, cc);
-		if (!locked)
-			break;
+			if (!locked)
+				break;
 
-		/* Recheck this is a buddy page under lock */
-		if (!PageBuddy(page))
-			goto isolate_fail;
+			/* Recheck this is a buddy page under lock */
+			if (!PageBuddy(page))
+				goto isolate_fail;
+		}
 
 		/* Found a free page, break it into order-0 pages */
 		isolated = split_free_page(page);
@@ -658,19 +666,20 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 		    page_count(page) > page_mapcount(page))
 			continue;
 
-		/* If the lock is not held, try to take it */
-		if (!locked)
+		/* If we already hold the lock, we can skip some rechecking */
+		if (!locked) {
 			locked = compact_trylock_irqsave(&zone->lru_lock,
 								&flags, cc);
-		if (!locked)
-			break;
+			if (!locked)
+				break;
 
-		/* Recheck PageLRU and PageTransHuge under lock */
-		if (!PageLRU(page))
-			continue;
-		if (PageTransHuge(page)) {
-			low_pfn += (1 << compound_order(page)) - 1;
-			continue;
+			/* Recheck PageLRU and PageTransHuge under lock */
+			if (!PageLRU(page))
+				continue;
+			if (PageTransHuge(page)) {
+				low_pfn += (1 << compound_order(page)) - 1;
+				continue;
+			}
 		}
 
 		lruvec = mem_cgroup_page_lruvec(page, zone);
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 05/10] mm, compaction: remember position within pageblock in free pages scanner
  2014-06-09  9:26 ` Vlastimil Babka
@ 2014-06-09  9:26   ` Vlastimil Babka
  -1 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-09  9:26 UTC (permalink / raw)
  To: David Rientjes, linux-mm
  Cc: linux-kernel, Andrew Morton, Greg Thelen, Vlastimil Babka,
	Minchan Kim, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

Unlike the migration scanner, the free scanner remembers the beginning of the
last scanned pageblock in cc->free_pfn. It might be therefore rescanning pages
uselessly when called several times during single compaction. This might have
been useful when pages were returned to the buddy allocator after a failed
migration, but this is no longer the case.

This patch changes the meaning of cc->free_pfn so that if it points to a
middle of a pageblock, that pageblock is scanned only from cc->free_pfn to the
end. isolate_freepages_block() will record the pfn of the last page it looked
at, which is then used to update cc->free_pfn.

In the mmtests stress-highalloc benchmark, this has resulted in lowering the
ratio between pages scanned by both scanners, from 2.5 free pages per migrate
page, to 2.25 free pages per migrate page, without affecting success rates.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
---
 mm/compaction.c | 33 ++++++++++++++++++++++++++++-----
 1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 83f72bd..58dfaaa 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -297,7 +297,7 @@ static bool suitable_migration_target(struct page *page)
  * (even though it may still end up isolating some pages).
  */
 static unsigned long isolate_freepages_block(struct compact_control *cc,
-				unsigned long blockpfn,
+				unsigned long *start_pfn,
 				unsigned long end_pfn,
 				struct list_head *freelist,
 				bool strict)
@@ -306,6 +306,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 	struct page *cursor, *valid_page = NULL;
 	unsigned long flags;
 	bool locked = false;
+	unsigned long blockpfn = *start_pfn;
 
 	cursor = pfn_to_page(blockpfn);
 
@@ -314,6 +315,9 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 		int isolated, i;
 		struct page *page = cursor;
 
+		/* Record how far we have got within the block */
+		*start_pfn = blockpfn;
+
 		/*
 		 * Periodically drop the lock (if held) regardless of its
 		 * contention, to give chance to IRQs. Abort async compaction
@@ -424,6 +428,9 @@ isolate_freepages_range(struct compact_control *cc,
 	LIST_HEAD(freelist);
 
 	for (pfn = start_pfn; pfn < end_pfn; pfn += isolated) {
+		/* Protect pfn from changing by isolate_freepages_block */
+		unsigned long isolate_start_pfn = pfn;
+
 		if (!pfn_valid(pfn) || cc->zone != page_zone(pfn_to_page(pfn)))
 			break;
 
@@ -434,8 +441,8 @@ isolate_freepages_range(struct compact_control *cc,
 		block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages);
 		block_end_pfn = min(block_end_pfn, end_pfn);
 
-		isolated = isolate_freepages_block(cc, pfn, block_end_pfn,
-						   &freelist, true);
+		isolated = isolate_freepages_block(cc, &isolate_start_pfn,
+						block_end_pfn, &freelist, true);
 
 		/*
 		 * In strict mode, isolate_freepages_block() returns 0 if
@@ -774,6 +781,7 @@ static void isolate_freepages(struct zone *zone,
 				block_end_pfn = block_start_pfn,
 				block_start_pfn -= pageblock_nr_pages) {
 		unsigned long isolated;
+		unsigned long isolate_start_pfn;
 
 		/*
 		 * This can iterate a massively long zone without finding any
@@ -807,12 +815,27 @@ static void isolate_freepages(struct zone *zone,
 			continue;
 
 		/* Found a block suitable for isolating free pages from */
-		cc->free_pfn = block_start_pfn;
-		isolated = isolate_freepages_block(cc, block_start_pfn,
+		isolate_start_pfn = block_start_pfn;
+
+		/*
+		 * If we are restarting the free scanner in this block, do not
+		 * rescan the beginning of the block
+		 */
+		if (cc->free_pfn < block_end_pfn)
+			isolate_start_pfn = cc->free_pfn;
+
+		isolated = isolate_freepages_block(cc, &isolate_start_pfn,
 					block_end_pfn, freelist, false);
 		nr_freepages += isolated;
 
 		/*
+		 * Remember where the free scanner should restart next time.
+		 * This will point to the last page of pageblock we just
+		 * scanned, if we scanned it fully.
+		 */
+		cc->free_pfn = isolate_start_pfn;
+
+		/*
 		 * Set a flag that we successfully isolated in this pageblock.
 		 * In the next loop iteration, zone->compact_cached_free_pfn
 		 * will not be updated and thus it will effectively contain the
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 05/10] mm, compaction: remember position within pageblock in free pages scanner
@ 2014-06-09  9:26   ` Vlastimil Babka
  0 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-09  9:26 UTC (permalink / raw)
  To: David Rientjes, linux-mm
  Cc: linux-kernel, Andrew Morton, Greg Thelen, Vlastimil Babka,
	Minchan Kim, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

Unlike the migration scanner, the free scanner remembers the beginning of the
last scanned pageblock in cc->free_pfn. It might be therefore rescanning pages
uselessly when called several times during single compaction. This might have
been useful when pages were returned to the buddy allocator after a failed
migration, but this is no longer the case.

This patch changes the meaning of cc->free_pfn so that if it points to a
middle of a pageblock, that pageblock is scanned only from cc->free_pfn to the
end. isolate_freepages_block() will record the pfn of the last page it looked
at, which is then used to update cc->free_pfn.

In the mmtests stress-highalloc benchmark, this has resulted in lowering the
ratio between pages scanned by both scanners, from 2.5 free pages per migrate
page, to 2.25 free pages per migrate page, without affecting success rates.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
---
 mm/compaction.c | 33 ++++++++++++++++++++++++++++-----
 1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 83f72bd..58dfaaa 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -297,7 +297,7 @@ static bool suitable_migration_target(struct page *page)
  * (even though it may still end up isolating some pages).
  */
 static unsigned long isolate_freepages_block(struct compact_control *cc,
-				unsigned long blockpfn,
+				unsigned long *start_pfn,
 				unsigned long end_pfn,
 				struct list_head *freelist,
 				bool strict)
@@ -306,6 +306,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 	struct page *cursor, *valid_page = NULL;
 	unsigned long flags;
 	bool locked = false;
+	unsigned long blockpfn = *start_pfn;
 
 	cursor = pfn_to_page(blockpfn);
 
@@ -314,6 +315,9 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 		int isolated, i;
 		struct page *page = cursor;
 
+		/* Record how far we have got within the block */
+		*start_pfn = blockpfn;
+
 		/*
 		 * Periodically drop the lock (if held) regardless of its
 		 * contention, to give chance to IRQs. Abort async compaction
@@ -424,6 +428,9 @@ isolate_freepages_range(struct compact_control *cc,
 	LIST_HEAD(freelist);
 
 	for (pfn = start_pfn; pfn < end_pfn; pfn += isolated) {
+		/* Protect pfn from changing by isolate_freepages_block */
+		unsigned long isolate_start_pfn = pfn;
+
 		if (!pfn_valid(pfn) || cc->zone != page_zone(pfn_to_page(pfn)))
 			break;
 
@@ -434,8 +441,8 @@ isolate_freepages_range(struct compact_control *cc,
 		block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages);
 		block_end_pfn = min(block_end_pfn, end_pfn);
 
-		isolated = isolate_freepages_block(cc, pfn, block_end_pfn,
-						   &freelist, true);
+		isolated = isolate_freepages_block(cc, &isolate_start_pfn,
+						block_end_pfn, &freelist, true);
 
 		/*
 		 * In strict mode, isolate_freepages_block() returns 0 if
@@ -774,6 +781,7 @@ static void isolate_freepages(struct zone *zone,
 				block_end_pfn = block_start_pfn,
 				block_start_pfn -= pageblock_nr_pages) {
 		unsigned long isolated;
+		unsigned long isolate_start_pfn;
 
 		/*
 		 * This can iterate a massively long zone without finding any
@@ -807,12 +815,27 @@ static void isolate_freepages(struct zone *zone,
 			continue;
 
 		/* Found a block suitable for isolating free pages from */
-		cc->free_pfn = block_start_pfn;
-		isolated = isolate_freepages_block(cc, block_start_pfn,
+		isolate_start_pfn = block_start_pfn;
+
+		/*
+		 * If we are restarting the free scanner in this block, do not
+		 * rescan the beginning of the block
+		 */
+		if (cc->free_pfn < block_end_pfn)
+			isolate_start_pfn = cc->free_pfn;
+
+		isolated = isolate_freepages_block(cc, &isolate_start_pfn,
 					block_end_pfn, freelist, false);
 		nr_freepages += isolated;
 
 		/*
+		 * Remember where the free scanner should restart next time.
+		 * This will point to the last page of pageblock we just
+		 * scanned, if we scanned it fully.
+		 */
+		cc->free_pfn = isolate_start_pfn;
+
+		/*
 		 * Set a flag that we successfully isolated in this pageblock.
 		 * In the next loop iteration, zone->compact_cached_free_pfn
 		 * will not be updated and thus it will effectively contain the
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 06/10] mm, compaction: skip buddy pages by their order in the migrate scanner
  2014-06-09  9:26 ` Vlastimil Babka
@ 2014-06-09  9:26   ` Vlastimil Babka
  -1 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-09  9:26 UTC (permalink / raw)
  To: David Rientjes, linux-mm
  Cc: linux-kernel, Andrew Morton, Greg Thelen, Vlastimil Babka,
	Minchan Kim, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

The migration scanner skips PageBuddy pages, but does not consider their order
as checking page_order() is generally unsafe without holding the zone->lock,
and acquiring the lock just for the check wouldn't be a good tradeoff.

Still, this could avoid some iterations over the rest of the buddy page, and
if we are careful, the race window between PageBuddy() check and page_order()
is small, and the worst thing that can happen is that we skip too much and miss
some isolation candidates. This is not that bad, as compaction can already fail
for many other reasons like parallel allocations, and those have much larger
race window.

This patch therefore makes the migration scanner obtain the buddy page order
and use it to skip the whole buddy page, if the order appears to be in the
valid range.

It's important that the page_order() is read only once, so that the value used
in the checks and in the pfn calculation is the same. But in theory the
compiler can replace the local variable by multiple inlines of page_order().
Therefore, the patch introduces page_order_unsafe() that uses ACCESS_ONCE to
prevent this.

Preliminary results with stress-highalloc from mmtests show a 10% reduction in
number of pages scanned by migration scanner. This change is also important to
later allow detecting when a cc->order block of pages cannot be compacted, and
the scanner should skip to the next block instead of wasting time.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
---
V2: fix low_pfn > end_pfn check; comments
    kept page_order_unsafe() approach for now

 mm/compaction.c | 25 ++++++++++++++++++++++---
 mm/internal.h   | 20 +++++++++++++++++++-
 2 files changed, 41 insertions(+), 4 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 58dfaaa..11c0926 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -626,11 +626,23 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 		}
 
 		/*
-		 * Skip if free. page_order cannot be used without zone->lock
-		 * as nothing prevents parallel allocations or buddy merging.
+		 * Skip if free. We read page order here without zone lock
+		 * which is generally unsafe, but the race window is small and
+		 * the worst thing that can happen is that we skip some
+		 * potential isolation targets.
 		 */
-		if (PageBuddy(page))
+		if (PageBuddy(page)) {
+			unsigned long freepage_order = page_order_unsafe(page);
+
+			/*
+			 * Without lock, we cannot be sure that what we got is
+			 * a valid page order. Consider only values in the
+			 * valid order range to prevent low_pfn overflow.
+			 */
+			if (freepage_order > 0 && freepage_order < MAX_ORDER)
+				low_pfn += (1UL << freepage_order) - 1;
 			continue;
+		}
 
 		/*
 		 * Check may be lockless but that's ok as we recheck later.
@@ -718,6 +730,13 @@ next_pageblock:
 		low_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages) - 1;
 	}
 
+	/*
+	 * The PageBuddy() check could have potentially brought us outside
+	 * the range to be scanned.
+	 */
+	if (unlikely(low_pfn > end_pfn))
+		low_pfn = end_pfn;
+
 	acct_isolated(zone, locked, cc);
 
 	if (locked)
diff --git a/mm/internal.h b/mm/internal.h
index 4659e8e..584d04f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -171,7 +171,8 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
  * general, page_zone(page)->lock must be held by the caller to prevent the
  * page from being allocated in parallel and returning garbage as the order.
  * If a caller does not hold page_zone(page)->lock, it must guarantee that the
- * page cannot be allocated or merged in parallel.
+ * page cannot be allocated or merged in parallel. Alternatively, it must
+ * handle invalid values gracefully, and use page_order_unsafe() below.
  */
 static inline unsigned long page_order(struct page *page)
 {
@@ -179,6 +180,23 @@ static inline unsigned long page_order(struct page *page)
 	return page_private(page);
 }
 
+/*
+ * Like page_order(), but for callers who cannot afford to hold the zone lock,
+ * and handle invalid values gracefully. ACCESS_ONCE is used so that if the
+ * caller assigns the result into a local variable and e.g. tests it for valid
+ * range  before using, the compiler cannot decide to remove the variable and
+ * inline the function multiple times, potentially observing different values
+ * in the tests and the actual use of the result.
+ */
+static inline unsigned long page_order_unsafe(struct page *page)
+{
+	/*
+	 * PageBuddy() should be checked by the caller to minimize race window,
+	 * and invalid values must be handled gracefully.
+	 */
+	return ACCESS_ONCE(page_private(page));
+}
+
 static inline bool is_cow_mapping(vm_flags_t flags)
 {
 	return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 06/10] mm, compaction: skip buddy pages by their order in the migrate scanner
@ 2014-06-09  9:26   ` Vlastimil Babka
  0 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-09  9:26 UTC (permalink / raw)
  To: David Rientjes, linux-mm
  Cc: linux-kernel, Andrew Morton, Greg Thelen, Vlastimil Babka,
	Minchan Kim, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

The migration scanner skips PageBuddy pages, but does not consider their order
as checking page_order() is generally unsafe without holding the zone->lock,
and acquiring the lock just for the check wouldn't be a good tradeoff.

Still, this could avoid some iterations over the rest of the buddy page, and
if we are careful, the race window between PageBuddy() check and page_order()
is small, and the worst thing that can happen is that we skip too much and miss
some isolation candidates. This is not that bad, as compaction can already fail
for many other reasons like parallel allocations, and those have much larger
race window.

This patch therefore makes the migration scanner obtain the buddy page order
and use it to skip the whole buddy page, if the order appears to be in the
valid range.

It's important that the page_order() is read only once, so that the value used
in the checks and in the pfn calculation is the same. But in theory the
compiler can replace the local variable by multiple inlines of page_order().
Therefore, the patch introduces page_order_unsafe() that uses ACCESS_ONCE to
prevent this.

Preliminary results with stress-highalloc from mmtests show a 10% reduction in
number of pages scanned by migration scanner. This change is also important to
later allow detecting when a cc->order block of pages cannot be compacted, and
the scanner should skip to the next block instead of wasting time.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
---
V2: fix low_pfn > end_pfn check; comments
    kept page_order_unsafe() approach for now

 mm/compaction.c | 25 ++++++++++++++++++++++---
 mm/internal.h   | 20 +++++++++++++++++++-
 2 files changed, 41 insertions(+), 4 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 58dfaaa..11c0926 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -626,11 +626,23 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 		}
 
 		/*
-		 * Skip if free. page_order cannot be used without zone->lock
-		 * as nothing prevents parallel allocations or buddy merging.
+		 * Skip if free. We read page order here without zone lock
+		 * which is generally unsafe, but the race window is small and
+		 * the worst thing that can happen is that we skip some
+		 * potential isolation targets.
 		 */
-		if (PageBuddy(page))
+		if (PageBuddy(page)) {
+			unsigned long freepage_order = page_order_unsafe(page);
+
+			/*
+			 * Without lock, we cannot be sure that what we got is
+			 * a valid page order. Consider only values in the
+			 * valid order range to prevent low_pfn overflow.
+			 */
+			if (freepage_order > 0 && freepage_order < MAX_ORDER)
+				low_pfn += (1UL << freepage_order) - 1;
 			continue;
+		}
 
 		/*
 		 * Check may be lockless but that's ok as we recheck later.
@@ -718,6 +730,13 @@ next_pageblock:
 		low_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages) - 1;
 	}
 
+	/*
+	 * The PageBuddy() check could have potentially brought us outside
+	 * the range to be scanned.
+	 */
+	if (unlikely(low_pfn > end_pfn))
+		low_pfn = end_pfn;
+
 	acct_isolated(zone, locked, cc);
 
 	if (locked)
diff --git a/mm/internal.h b/mm/internal.h
index 4659e8e..584d04f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -171,7 +171,8 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
  * general, page_zone(page)->lock must be held by the caller to prevent the
  * page from being allocated in parallel and returning garbage as the order.
  * If a caller does not hold page_zone(page)->lock, it must guarantee that the
- * page cannot be allocated or merged in parallel.
+ * page cannot be allocated or merged in parallel. Alternatively, it must
+ * handle invalid values gracefully, and use page_order_unsafe() below.
  */
 static inline unsigned long page_order(struct page *page)
 {
@@ -179,6 +180,23 @@ static inline unsigned long page_order(struct page *page)
 	return page_private(page);
 }
 
+/*
+ * Like page_order(), but for callers who cannot afford to hold the zone lock,
+ * and handle invalid values gracefully. ACCESS_ONCE is used so that if the
+ * caller assigns the result into a local variable and e.g. tests it for valid
+ * range  before using, the compiler cannot decide to remove the variable and
+ * inline the function multiple times, potentially observing different values
+ * in the tests and the actual use of the result.
+ */
+static inline unsigned long page_order_unsafe(struct page *page)
+{
+	/*
+	 * PageBuddy() should be checked by the caller to minimize race window,
+	 * and invalid values must be handled gracefully.
+	 */
+	return ACCESS_ONCE(page_private(page));
+}
+
 static inline bool is_cow_mapping(vm_flags_t flags)
 {
 	return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 07/10] mm: rename allocflags_to_migratetype for clarity
  2014-06-09  9:26 ` Vlastimil Babka
@ 2014-06-09  9:26   ` Vlastimil Babka
  -1 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-09  9:26 UTC (permalink / raw)
  To: David Rientjes, linux-mm
  Cc: linux-kernel, Andrew Morton, Greg Thelen, Vlastimil Babka,
	Minchan Kim, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

From: David Rientjes <rientjes@google.com>

The page allocator has gfp flags (like __GFP_WAIT) and alloc flags (like
ALLOC_CPUSET) that have separate semantics.

The function allocflags_to_migratetype() actually takes gfp flags, not alloc
flags, and returns a migratetype.  Rename it to gfpflags_to_migratetype().

Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
---
 include/linux/gfp.h | 2 +-
 mm/compaction.c     | 4 ++--
 mm/page_alloc.c     | 6 +++---
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 6eb1fb3..ed9627e 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -156,7 +156,7 @@ struct vm_area_struct;
 #define GFP_DMA32	__GFP_DMA32
 
 /* Convert GFP flags to their corresponding migrate type */
-static inline int allocflags_to_migratetype(gfp_t gfp_flags)
+static inline int gfpflags_to_migratetype(const gfp_t gfp_flags)
 {
 	WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
 
diff --git a/mm/compaction.c b/mm/compaction.c
index 11c0926..c339ccd 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1178,7 +1178,7 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
 		.nr_freepages = 0,
 		.nr_migratepages = 0,
 		.order = order,
-		.migratetype = allocflags_to_migratetype(gfp_mask),
+		.migratetype = gfpflags_to_migratetype(gfp_mask),
 		.zone = zone,
 		.mode = mode,
 	};
@@ -1228,7 +1228,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
 	count_compact_event(COMPACTSTALL);
 
 #ifdef CONFIG_CMA
-	if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
+	if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
 #endif
 	/* Compact each zone in the list */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4f59fa2..cc0b687 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2473,7 +2473,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 	}
 #ifdef CONFIG_CMA
-	if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
+	if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
 #endif
 	return alloc_flags;
@@ -2716,7 +2716,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	struct zone *preferred_zone;
 	struct zoneref *preferred_zoneref;
 	struct page *page = NULL;
-	int migratetype = allocflags_to_migratetype(gfp_mask);
+	int migratetype = gfpflags_to_migratetype(gfp_mask);
 	unsigned int cpuset_mems_cookie;
 	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
 	int classzone_idx;
@@ -2750,7 +2750,7 @@ retry_cpuset:
 	classzone_idx = zonelist_zone_idx(preferred_zoneref);
 
 #ifdef CONFIG_CMA
-	if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
+	if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
 #endif
 retry:
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 07/10] mm: rename allocflags_to_migratetype for clarity
@ 2014-06-09  9:26   ` Vlastimil Babka
  0 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-09  9:26 UTC (permalink / raw)
  To: David Rientjes, linux-mm
  Cc: linux-kernel, Andrew Morton, Greg Thelen, Vlastimil Babka,
	Minchan Kim, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

From: David Rientjes <rientjes@google.com>

The page allocator has gfp flags (like __GFP_WAIT) and alloc flags (like
ALLOC_CPUSET) that have separate semantics.

The function allocflags_to_migratetype() actually takes gfp flags, not alloc
flags, and returns a migratetype.  Rename it to gfpflags_to_migratetype().

Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
---
 include/linux/gfp.h | 2 +-
 mm/compaction.c     | 4 ++--
 mm/page_alloc.c     | 6 +++---
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 6eb1fb3..ed9627e 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -156,7 +156,7 @@ struct vm_area_struct;
 #define GFP_DMA32	__GFP_DMA32
 
 /* Convert GFP flags to their corresponding migrate type */
-static inline int allocflags_to_migratetype(gfp_t gfp_flags)
+static inline int gfpflags_to_migratetype(const gfp_t gfp_flags)
 {
 	WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
 
diff --git a/mm/compaction.c b/mm/compaction.c
index 11c0926..c339ccd 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1178,7 +1178,7 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
 		.nr_freepages = 0,
 		.nr_migratepages = 0,
 		.order = order,
-		.migratetype = allocflags_to_migratetype(gfp_mask),
+		.migratetype = gfpflags_to_migratetype(gfp_mask),
 		.zone = zone,
 		.mode = mode,
 	};
@@ -1228,7 +1228,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
 	count_compact_event(COMPACTSTALL);
 
 #ifdef CONFIG_CMA
-	if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
+	if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
 #endif
 	/* Compact each zone in the list */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4f59fa2..cc0b687 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2473,7 +2473,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 	}
 #ifdef CONFIG_CMA
-	if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
+	if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
 #endif
 	return alloc_flags;
@@ -2716,7 +2716,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	struct zone *preferred_zone;
 	struct zoneref *preferred_zoneref;
 	struct page *page = NULL;
-	int migratetype = allocflags_to_migratetype(gfp_mask);
+	int migratetype = gfpflags_to_migratetype(gfp_mask);
 	unsigned int cpuset_mems_cookie;
 	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
 	int classzone_idx;
@@ -2750,7 +2750,7 @@ retry_cpuset:
 	classzone_idx = zonelist_zone_idx(preferred_zoneref);
 
 #ifdef CONFIG_CMA
-	if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
+	if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
 #endif
 retry:
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 08/10] mm, compaction: pass gfp mask to compact_control
  2014-06-09  9:26 ` Vlastimil Babka
@ 2014-06-09  9:26   ` Vlastimil Babka
  -1 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-09  9:26 UTC (permalink / raw)
  To: David Rientjes, linux-mm
  Cc: linux-kernel, Andrew Morton, Greg Thelen, Vlastimil Babka,
	Minchan Kim, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

From: David Rientjes <rientjes@google.com>

struct compact_control currently converts the gfp mask to a migratetype, but we
need the entire gfp mask in a follow-up patch.

Pass the entire gfp mask as part of struct compact_control.

Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
---
 mm/compaction.c | 12 +++++++-----
 mm/internal.h   |  2 +-
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index c339ccd..d1e30ba 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -965,8 +965,8 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 	return ISOLATE_SUCCESS;
 }
 
-static int compact_finished(struct zone *zone,
-			    struct compact_control *cc)
+static int compact_finished(struct zone *zone, struct compact_control *cc,
+			    const int migratetype)
 {
 	unsigned int order;
 	unsigned long watermark;
@@ -1012,7 +1012,7 @@ static int compact_finished(struct zone *zone,
 		struct free_area *area = &zone->free_area[order];
 
 		/* Job done if page is free of the right migratetype */
-		if (!list_empty(&area->free_list[cc->migratetype]))
+		if (!list_empty(&area->free_list[migratetype]))
 			return COMPACT_PARTIAL;
 
 		/* Job done if allocation would set block type */
@@ -1078,6 +1078,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 	int ret;
 	unsigned long start_pfn = zone->zone_start_pfn;
 	unsigned long end_pfn = zone_end_pfn(zone);
+	const int migratetype = gfpflags_to_migratetype(cc->gfp_mask);
 	const bool sync = cc->mode != MIGRATE_ASYNC;
 
 	ret = compaction_suitable(zone, cc->order);
@@ -1120,7 +1121,8 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 
 	migrate_prep_local();
 
-	while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) {
+	while ((ret = compact_finished(zone, cc, migratetype)) ==
+						COMPACT_CONTINUE) {
 		int err;
 
 		switch (isolate_migratepages(zone, cc)) {
@@ -1178,7 +1180,7 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
 		.nr_freepages = 0,
 		.nr_migratepages = 0,
 		.order = order,
-		.migratetype = gfpflags_to_migratetype(gfp_mask),
+		.gfp_mask = gfp_mask,
 		.zone = zone,
 		.mode = mode,
 	};
diff --git a/mm/internal.h b/mm/internal.h
index 584d04f..af15461 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -149,7 +149,7 @@ struct compact_control {
 	bool finished_update_migrate;
 
 	int order;			/* order a direct compactor needs */
-	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
+	const gfp_t gfp_mask;		/* gfp mask of a direct compactor */
 	struct zone *zone;
 	enum compact_contended contended; /* Signal need_sched() or lock
 					   * contention detected during
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 08/10] mm, compaction: pass gfp mask to compact_control
@ 2014-06-09  9:26   ` Vlastimil Babka
  0 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-09  9:26 UTC (permalink / raw)
  To: David Rientjes, linux-mm
  Cc: linux-kernel, Andrew Morton, Greg Thelen, Vlastimil Babka,
	Minchan Kim, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

From: David Rientjes <rientjes@google.com>

struct compact_control currently converts the gfp mask to a migratetype, but we
need the entire gfp mask in a follow-up patch.

Pass the entire gfp mask as part of struct compact_control.

Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
---
 mm/compaction.c | 12 +++++++-----
 mm/internal.h   |  2 +-
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index c339ccd..d1e30ba 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -965,8 +965,8 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 	return ISOLATE_SUCCESS;
 }
 
-static int compact_finished(struct zone *zone,
-			    struct compact_control *cc)
+static int compact_finished(struct zone *zone, struct compact_control *cc,
+			    const int migratetype)
 {
 	unsigned int order;
 	unsigned long watermark;
@@ -1012,7 +1012,7 @@ static int compact_finished(struct zone *zone,
 		struct free_area *area = &zone->free_area[order];
 
 		/* Job done if page is free of the right migratetype */
-		if (!list_empty(&area->free_list[cc->migratetype]))
+		if (!list_empty(&area->free_list[migratetype]))
 			return COMPACT_PARTIAL;
 
 		/* Job done if allocation would set block type */
@@ -1078,6 +1078,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 	int ret;
 	unsigned long start_pfn = zone->zone_start_pfn;
 	unsigned long end_pfn = zone_end_pfn(zone);
+	const int migratetype = gfpflags_to_migratetype(cc->gfp_mask);
 	const bool sync = cc->mode != MIGRATE_ASYNC;
 
 	ret = compaction_suitable(zone, cc->order);
@@ -1120,7 +1121,8 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 
 	migrate_prep_local();
 
-	while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) {
+	while ((ret = compact_finished(zone, cc, migratetype)) ==
+						COMPACT_CONTINUE) {
 		int err;
 
 		switch (isolate_migratepages(zone, cc)) {
@@ -1178,7 +1180,7 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
 		.nr_freepages = 0,
 		.nr_migratepages = 0,
 		.order = order,
-		.migratetype = gfpflags_to_migratetype(gfp_mask),
+		.gfp_mask = gfp_mask,
 		.zone = zone,
 		.mode = mode,
 	};
diff --git a/mm/internal.h b/mm/internal.h
index 584d04f..af15461 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -149,7 +149,7 @@ struct compact_control {
 	bool finished_update_migrate;
 
 	int order;			/* order a direct compactor needs */
-	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
+	const gfp_t gfp_mask;		/* gfp mask of a direct compactor */
 	struct zone *zone;
 	enum compact_contended contended; /* Signal need_sched() or lock
 					   * contention detected during
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [RFC PATCH 09/10] mm, compaction: try to capture the just-created high-order freepage
  2014-06-09  9:26 ` Vlastimil Babka
@ 2014-06-09  9:26   ` Vlastimil Babka
  -1 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-09  9:26 UTC (permalink / raw)
  To: David Rientjes, linux-mm
  Cc: linux-kernel, Andrew Morton, Greg Thelen, Vlastimil Babka,
	Minchan Kim, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

Compaction uses watermark checking to determine if it succeeded in creating
a high-order free page. My testing has shown that this is quite racy and it
can happen that watermark checking in compaction succeeds, and moments later
the watermark checking in page allocation fails, even though the number of
free pages has increased meanwhile.

It should be more reliable if direct compaction captured the high-order free
page as soon as it detects it, and pass it back to allocation. This would
also reduce the window for somebody else to allocate the free page.

This has been already implemented by 1fb3f8ca0e92 ("mm: compaction: capture a
suitable high-order page immediately when it is made available"), but later
reverted by 8fb74b9f ("mm: compaction: partially revert capture of suitable
high-order page") due to flaws.

This patch differs from the previous attempt in two aspects:

1) The previous patch scanned free lists to capture the page. In this patch,
   only the cc->order aligned block that the migration scanner just finished
   is considered, but only if pages were actually isolated for migration in
   that block. Tracking cc->order aligned blocks also has benefits for the
   following patch that skips blocks where non-migratable pages were found.

2) In this patch, the isolated free page is allocated through extending
   get_page_from_freelist() and buffered_rmqueue(). This ensures that it gets
   all operations such as prep_new_page() and page->pfmemalloc setting that
   was missing in the previous attempt, zone statistics are updated etc.

Evaluation is pending.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
---
 include/linux/compaction.h |  5 ++-
 mm/compaction.c            | 92 ++++++++++++++++++++++++++++++++++++++++++++--
 mm/internal.h              |  2 +
 mm/page_alloc.c            | 69 +++++++++++++++++++++++++++-------
 4 files changed, 150 insertions(+), 18 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 01e3132..69579f5 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -10,6 +10,8 @@
 #define COMPACT_PARTIAL		2
 /* The full zone was compacted */
 #define COMPACT_COMPLETE	3
+/* Captured a high-order free page in direct compaction */
+#define COMPACT_CAPTURED	4
 
 #ifdef CONFIG_COMPACTION
 extern int sysctl_compact_memory;
@@ -22,7 +24,8 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
 extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *mask,
-			enum migrate_mode mode, bool *contended);
+			enum migrate_mode mode, bool *contended,
+			struct page **captured_page);
 extern void compact_pgdat(pg_data_t *pgdat, int order);
 extern void reset_isolation_suitable(pg_data_t *pgdat);
 extern unsigned long compaction_suitable(struct zone *zone, int order);
diff --git a/mm/compaction.c b/mm/compaction.c
index d1e30ba..b69ac19 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -541,6 +541,16 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 	const isolate_mode_t mode = (cc->mode == MIGRATE_ASYNC ?
 					ISOLATE_ASYNC_MIGRATE : 0) |
 				    (unevictable ? ISOLATE_UNEVICTABLE : 0);
+	unsigned long capture_pfn = 0;   /* current candidate for capturing */
+	unsigned long next_capture_pfn = 0; /* next candidate for capturing */
+
+	if (cc->order > PAGE_ALLOC_COSTLY_ORDER
+		&& gfpflags_to_migratetype(cc->gfp_mask) == MIGRATE_MOVABLE
+			&& cc->order <= pageblock_order) {
+		/* This may be outside the zone, but we check that later */
+		capture_pfn = low_pfn & ~((1UL << cc->order) - 1);
+		next_capture_pfn = ALIGN(low_pfn + 1, (1UL << cc->order));
+	}
 
 	/*
 	 * Ensure that there are not too many pages isolated from the LRU
@@ -563,6 +573,20 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 
 	/* Time to isolate some pages for migration */
 	for (; low_pfn < end_pfn; low_pfn++) {
+		if (low_pfn == next_capture_pfn) {
+			/*
+			 * We have a capture candidate if we isolated something
+			 * during the last cc->order aligned block of pages
+			 */
+			if (nr_isolated &&
+					capture_pfn >= zone->zone_start_pfn) {
+				cc->capture_page = pfn_to_page(capture_pfn);
+				break;
+			}
+			capture_pfn = next_capture_pfn;
+			next_capture_pfn += (1UL << cc->order);
+		}
+
 		/*
 		 * Periodically drop the lock (if held) regardless of its
 		 * contention, to give chance to IRQs. Abort async compaction
@@ -582,6 +606,8 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 		if ((low_pfn & (MAX_ORDER_NR_PAGES - 1)) == 0) {
 			if (!pfn_valid(low_pfn)) {
 				low_pfn += MAX_ORDER_NR_PAGES - 1;
+				if (next_capture_pfn)
+					next_capture_pfn = low_pfn + 1;
 				continue;
 			}
 		}
@@ -639,8 +665,12 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 			 * a valid page order. Consider only values in the
 			 * valid order range to prevent low_pfn overflow.
 			 */
-			if (freepage_order > 0 && freepage_order < MAX_ORDER)
+			if (freepage_order > 0 && freepage_order < MAX_ORDER) {
 				low_pfn += (1UL << freepage_order) - 1;
+				if (next_capture_pfn)
+					next_capture_pfn = ALIGN(low_pfn + 1,
+							(1UL << cc->order));
+			}
 			continue;
 		}
 
@@ -673,6 +703,9 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 			if (!locked)
 				goto next_pageblock;
 			low_pfn += (1 << compound_order(page)) - 1;
+			if (next_capture_pfn)
+				next_capture_pfn =
+					ALIGN(low_pfn + 1, (1UL << cc->order));
 			continue;
 		}
 
@@ -697,6 +730,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 				continue;
 			if (PageTransHuge(page)) {
 				low_pfn += (1 << compound_order(page)) - 1;
+				next_capture_pfn = low_pfn + 1;
 				continue;
 			}
 		}
@@ -728,6 +762,8 @@ isolate_success:
 
 next_pageblock:
 		low_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages) - 1;
+		if (next_capture_pfn)
+			next_capture_pfn = low_pfn + 1;
 	}
 
 	/*
@@ -965,6 +1001,41 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 	return ISOLATE_SUCCESS;
 }
 
+/*
+ * When called, cc->capture_page is just a candidate. This function will either
+ * successfully capture the page, or reset it to NULL.
+ */
+static bool compact_capture_page(struct compact_control *cc)
+{
+	struct page *page = cc->capture_page;
+
+	/* Unsafe check if it's worth to try acquiring the zone->lock at all */
+	if (PageBuddy(page) && page_order_unsafe(page) >= cc->order)
+		goto try_capture;
+
+	/*
+	 * There's a good chance that we have just put free pages on this CPU's
+	 * pcplists after the page migration. Drain them to allow merging.
+	 */
+	get_cpu();
+	drain_local_pages(NULL);
+	put_cpu();
+
+	/* Did the draining help? */
+	if (PageBuddy(page) && page_order_unsafe(page) >= cc->order)
+		goto try_capture;
+
+	goto fail;
+
+try_capture:
+	if (capture_free_page(page, cc->order))
+		return true;
+
+fail:
+	cc->capture_page = NULL;
+	return false;
+}
+
 static int compact_finished(struct zone *zone, struct compact_control *cc,
 			    const int migratetype)
 {
@@ -993,6 +1064,10 @@ static int compact_finished(struct zone *zone, struct compact_control *cc,
 		return COMPACT_COMPLETE;
 	}
 
+	/* Did we just finish a pageblock that was capture candidate? */
+	if (cc->capture_page && compact_capture_page(cc))
+		return COMPACT_CAPTURED;
+
 	/*
 	 * order == -1 is expected when compacting via
 	 * /proc/sys/vm/compact_memory
@@ -1173,7 +1248,8 @@ out:
 }
 
 static unsigned long compact_zone_order(struct zone *zone, int order,
-		gfp_t gfp_mask, enum migrate_mode mode, bool *contended)
+		gfp_t gfp_mask, enum migrate_mode mode, bool *contended,
+						struct page **captured_page)
 {
 	unsigned long ret;
 	struct compact_control cc = {
@@ -1189,6 +1265,9 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
 
 	ret = compact_zone(zone, &cc);
 
+	if (ret == COMPACT_CAPTURED)
+		*captured_page = cc.capture_page;
+
 	VM_BUG_ON(!list_empty(&cc.freepages));
 	VM_BUG_ON(!list_empty(&cc.migratepages));
 
@@ -1213,7 +1292,8 @@ int sysctl_extfrag_threshold = 500;
  */
 unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *nodemask,
-			enum migrate_mode mode, bool *contended)
+			enum migrate_mode mode, bool *contended,
+			struct page **captured_page)
 {
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	int may_enter_fs = gfp_mask & __GFP_FS;
@@ -1239,9 +1319,13 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
 		int status;
 
 		status = compact_zone_order(zone, order, gfp_mask, mode,
-						contended);
+						contended, captured_page);
 		rc = max(status, rc);
 
+		/* If we captured a page, stop compacting */
+		if (*captured_page)
+			break;
+
 		/* If a normal allocation would succeed, stop compacting */
 		if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0,
 				      alloc_flags))
diff --git a/mm/internal.h b/mm/internal.h
index af15461..2b7e5de 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -110,6 +110,7 @@ extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
  */
 extern void __free_pages_bootmem(struct page *page, unsigned int order);
 extern void prep_compound_page(struct page *page, unsigned long order);
+extern bool capture_free_page(struct page *page, unsigned int order);
 #ifdef CONFIG_MEMORY_FAILURE
 extern bool is_free_buddy_page(struct page *page);
 #endif
@@ -155,6 +156,7 @@ struct compact_control {
 					   * contention detected during
 					   * compaction
 					   */
+	struct page *capture_page;	/* Free page captured by compaction */
 };
 
 unsigned long
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cc0b687..b95f4ac 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -954,7 +954,6 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 	return NULL;
 }
 
-
 /*
  * This array describes the order lists are fallen back to when
  * the free lists for the desirable migrate type are depleted
@@ -1474,9 +1473,11 @@ static int __isolate_free_page(struct page *page, unsigned int order)
 {
 	unsigned long watermark;
 	struct zone *zone;
+	struct free_area *area;
 	int mt;
+	unsigned int freepage_order = page_order(page);
 
-	BUG_ON(!PageBuddy(page));
+	VM_BUG_ON_PAGE((!PageBuddy(page) || freepage_order < order), page);
 
 	zone = page_zone(page);
 	mt = get_pageblock_migratetype(page);
@@ -1491,9 +1492,12 @@ static int __isolate_free_page(struct page *page, unsigned int order)
 	}
 
 	/* Remove page from free list */
+	area = &zone->free_area[freepage_order];
 	list_del(&page->lru);
-	zone->free_area[order].nr_free--;
+	area->nr_free--;
 	rmv_page_order(page);
+	if (freepage_order != order)
+		expand(zone, page, order, freepage_order, area, mt);
 
 	/* Set the pageblock if the isolated page is at least a pageblock */
 	if (order >= pageblock_order - 1) {
@@ -1536,6 +1540,26 @@ int split_free_page(struct page *page)
 	return nr_pages;
 }
 
+bool capture_free_page(struct page *page, unsigned int order)
+{
+	struct zone *zone = page_zone(page);
+	unsigned long flags;
+	bool ret;
+
+	spin_lock_irqsave(&zone->lock, flags);
+
+	if (!PageBuddy(page) || page_order(page) < order) {
+		ret = false;
+		goto out;
+	}
+
+	ret = __isolate_free_page(page, order);
+
+out:
+	spin_unlock_irqrestore(&zone->lock, flags);
+	return ret;
+}
+
 /*
  * Really, prep_compound_page() should be called from __rmqueue_bulk().  But
  * we cheat by calling it from here, in the order > 0 path.  Saves a branch
@@ -1544,7 +1568,8 @@ int split_free_page(struct page *page)
 static inline
 struct page *buffered_rmqueue(struct zone *preferred_zone,
 			struct zone *zone, unsigned int order,
-			gfp_t gfp_flags, int migratetype)
+			gfp_t gfp_flags, int migratetype,
+			struct page *isolated_freepage)
 {
 	unsigned long flags;
 	struct page *page;
@@ -1573,6 +1598,9 @@ again:
 
 		list_del(&page->lru);
 		pcp->count--;
+	} else if (unlikely(isolated_freepage)) {
+		page = isolated_freepage;
+		local_irq_save(flags);
 	} else {
 		if (unlikely(gfp_flags & __GFP_NOFAIL)) {
 			/*
@@ -1588,7 +1616,9 @@ again:
 			WARN_ON_ONCE(order > 1);
 		}
 		spin_lock_irqsave(&zone->lock, flags);
+
 		page = __rmqueue(zone, order, migratetype);
+
 		spin_unlock(&zone->lock);
 		if (!page)
 			goto failed;
@@ -1916,7 +1946,8 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 static struct page *
 get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
-		struct zone *preferred_zone, int classzone_idx, int migratetype)
+		struct zone *preferred_zone, int classzone_idx, int migratetype,
+		struct page *isolated_freepage)
 {
 	struct zoneref *z;
 	struct page *page = NULL;
@@ -1927,6 +1958,13 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 	bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) &&
 				(gfp_mask & __GFP_WRITE);
 
+	if (isolated_freepage) {
+		zone = page_zone(isolated_freepage);
+		page = buffered_rmqueue(preferred_zone, zone, order, gfp_mask,
+						migratetype, isolated_freepage);
+		goto got_page;
+	}
+
 zonelist_scan:
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
@@ -2051,7 +2089,7 @@ zonelist_scan:
 
 try_this_zone:
 		page = buffered_rmqueue(preferred_zone, zone, order,
-						gfp_mask, migratetype);
+						gfp_mask, migratetype, NULL);
 		if (page)
 			break;
 this_zone_full:
@@ -2065,6 +2103,7 @@ this_zone_full:
 		goto zonelist_scan;
 	}
 
+got_page:
 	if (page)
 		/*
 		 * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
@@ -2202,7 +2241,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
 		order, zonelist, high_zoneidx,
 		ALLOC_WMARK_HIGH|ALLOC_CPUSET,
-		preferred_zone, classzone_idx, migratetype);
+		preferred_zone, classzone_idx, migratetype, NULL);
 	if (page)
 		goto out;
 
@@ -2241,6 +2280,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	bool *contended_compaction, bool *deferred_compaction,
 	unsigned long *did_some_progress)
 {
+	struct page *captured_page;
+
 	if (!order)
 		return NULL;
 
@@ -2252,7 +2293,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	current->flags |= PF_MEMALLOC;
 	*did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask,
 						nodemask, mode,
-						contended_compaction);
+						contended_compaction,
+						&captured_page);
 	current->flags &= ~PF_MEMALLOC;
 
 	if (*did_some_progress != COMPACT_SKIPPED) {
@@ -2265,7 +2307,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		page = get_page_from_freelist(gfp_mask, nodemask,
 				order, zonelist, high_zoneidx,
 				alloc_flags & ~ALLOC_NO_WATERMARKS,
-				preferred_zone, classzone_idx, migratetype);
+				preferred_zone, classzone_idx, migratetype,
+				captured_page);
 		if (page) {
 			preferred_zone->compact_blockskip_flush = false;
 			compaction_defer_reset(preferred_zone, order, true);
@@ -2357,7 +2400,7 @@ retry:
 					zonelist, high_zoneidx,
 					alloc_flags & ~ALLOC_NO_WATERMARKS,
 					preferred_zone, classzone_idx,
-					migratetype);
+					migratetype, NULL);
 
 	/*
 	 * If an allocation failed after direct reclaim, it could be because
@@ -2387,7 +2430,7 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 	do {
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
-			preferred_zone, classzone_idx, migratetype);
+			preferred_zone, classzone_idx, migratetype, NULL);
 
 		if (!page && gfp_mask & __GFP_NOFAIL)
 			wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
@@ -2548,7 +2591,7 @@ rebalance:
 	/* This is the last chance, in general, before the goto nopage. */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
 			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
-			preferred_zone, classzone_idx, migratetype);
+			preferred_zone, classzone_idx, migratetype, NULL);
 	if (page)
 		goto got_pg;
 
@@ -2757,7 +2800,7 @@ retry:
 	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, alloc_flags,
-			preferred_zone, classzone_idx, migratetype);
+			preferred_zone, classzone_idx, migratetype, NULL);
 	if (unlikely(!page)) {
 		/*
 		 * The first pass makes sure allocations are spread
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [RFC PATCH 09/10] mm, compaction: try to capture the just-created high-order freepage
@ 2014-06-09  9:26   ` Vlastimil Babka
  0 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-09  9:26 UTC (permalink / raw)
  To: David Rientjes, linux-mm
  Cc: linux-kernel, Andrew Morton, Greg Thelen, Vlastimil Babka,
	Minchan Kim, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

Compaction uses watermark checking to determine if it succeeded in creating
a high-order free page. My testing has shown that this is quite racy and it
can happen that watermark checking in compaction succeeds, and moments later
the watermark checking in page allocation fails, even though the number of
free pages has increased meanwhile.

It should be more reliable if direct compaction captured the high-order free
page as soon as it detects it, and pass it back to allocation. This would
also reduce the window for somebody else to allocate the free page.

This has been already implemented by 1fb3f8ca0e92 ("mm: compaction: capture a
suitable high-order page immediately when it is made available"), but later
reverted by 8fb74b9f ("mm: compaction: partially revert capture of suitable
high-order page") due to flaws.

This patch differs from the previous attempt in two aspects:

1) The previous patch scanned free lists to capture the page. In this patch,
   only the cc->order aligned block that the migration scanner just finished
   is considered, but only if pages were actually isolated for migration in
   that block. Tracking cc->order aligned blocks also has benefits for the
   following patch that skips blocks where non-migratable pages were found.

2) In this patch, the isolated free page is allocated through extending
   get_page_from_freelist() and buffered_rmqueue(). This ensures that it gets
   all operations such as prep_new_page() and page->pfmemalloc setting that
   was missing in the previous attempt, zone statistics are updated etc.

Evaluation is pending.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
---
 include/linux/compaction.h |  5 ++-
 mm/compaction.c            | 92 ++++++++++++++++++++++++++++++++++++++++++++--
 mm/internal.h              |  2 +
 mm/page_alloc.c            | 69 +++++++++++++++++++++++++++-------
 4 files changed, 150 insertions(+), 18 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 01e3132..69579f5 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -10,6 +10,8 @@
 #define COMPACT_PARTIAL		2
 /* The full zone was compacted */
 #define COMPACT_COMPLETE	3
+/* Captured a high-order free page in direct compaction */
+#define COMPACT_CAPTURED	4
 
 #ifdef CONFIG_COMPACTION
 extern int sysctl_compact_memory;
@@ -22,7 +24,8 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
 extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *mask,
-			enum migrate_mode mode, bool *contended);
+			enum migrate_mode mode, bool *contended,
+			struct page **captured_page);
 extern void compact_pgdat(pg_data_t *pgdat, int order);
 extern void reset_isolation_suitable(pg_data_t *pgdat);
 extern unsigned long compaction_suitable(struct zone *zone, int order);
diff --git a/mm/compaction.c b/mm/compaction.c
index d1e30ba..b69ac19 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -541,6 +541,16 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 	const isolate_mode_t mode = (cc->mode == MIGRATE_ASYNC ?
 					ISOLATE_ASYNC_MIGRATE : 0) |
 				    (unevictable ? ISOLATE_UNEVICTABLE : 0);
+	unsigned long capture_pfn = 0;   /* current candidate for capturing */
+	unsigned long next_capture_pfn = 0; /* next candidate for capturing */
+
+	if (cc->order > PAGE_ALLOC_COSTLY_ORDER
+		&& gfpflags_to_migratetype(cc->gfp_mask) == MIGRATE_MOVABLE
+			&& cc->order <= pageblock_order) {
+		/* This may be outside the zone, but we check that later */
+		capture_pfn = low_pfn & ~((1UL << cc->order) - 1);
+		next_capture_pfn = ALIGN(low_pfn + 1, (1UL << cc->order));
+	}
 
 	/*
 	 * Ensure that there are not too many pages isolated from the LRU
@@ -563,6 +573,20 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 
 	/* Time to isolate some pages for migration */
 	for (; low_pfn < end_pfn; low_pfn++) {
+		if (low_pfn == next_capture_pfn) {
+			/*
+			 * We have a capture candidate if we isolated something
+			 * during the last cc->order aligned block of pages
+			 */
+			if (nr_isolated &&
+					capture_pfn >= zone->zone_start_pfn) {
+				cc->capture_page = pfn_to_page(capture_pfn);
+				break;
+			}
+			capture_pfn = next_capture_pfn;
+			next_capture_pfn += (1UL << cc->order);
+		}
+
 		/*
 		 * Periodically drop the lock (if held) regardless of its
 		 * contention, to give chance to IRQs. Abort async compaction
@@ -582,6 +606,8 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 		if ((low_pfn & (MAX_ORDER_NR_PAGES - 1)) == 0) {
 			if (!pfn_valid(low_pfn)) {
 				low_pfn += MAX_ORDER_NR_PAGES - 1;
+				if (next_capture_pfn)
+					next_capture_pfn = low_pfn + 1;
 				continue;
 			}
 		}
@@ -639,8 +665,12 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 			 * a valid page order. Consider only values in the
 			 * valid order range to prevent low_pfn overflow.
 			 */
-			if (freepage_order > 0 && freepage_order < MAX_ORDER)
+			if (freepage_order > 0 && freepage_order < MAX_ORDER) {
 				low_pfn += (1UL << freepage_order) - 1;
+				if (next_capture_pfn)
+					next_capture_pfn = ALIGN(low_pfn + 1,
+							(1UL << cc->order));
+			}
 			continue;
 		}
 
@@ -673,6 +703,9 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 			if (!locked)
 				goto next_pageblock;
 			low_pfn += (1 << compound_order(page)) - 1;
+			if (next_capture_pfn)
+				next_capture_pfn =
+					ALIGN(low_pfn + 1, (1UL << cc->order));
 			continue;
 		}
 
@@ -697,6 +730,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 				continue;
 			if (PageTransHuge(page)) {
 				low_pfn += (1 << compound_order(page)) - 1;
+				next_capture_pfn = low_pfn + 1;
 				continue;
 			}
 		}
@@ -728,6 +762,8 @@ isolate_success:
 
 next_pageblock:
 		low_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages) - 1;
+		if (next_capture_pfn)
+			next_capture_pfn = low_pfn + 1;
 	}
 
 	/*
@@ -965,6 +1001,41 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 	return ISOLATE_SUCCESS;
 }
 
+/*
+ * When called, cc->capture_page is just a candidate. This function will either
+ * successfully capture the page, or reset it to NULL.
+ */
+static bool compact_capture_page(struct compact_control *cc)
+{
+	struct page *page = cc->capture_page;
+
+	/* Unsafe check if it's worth to try acquiring the zone->lock at all */
+	if (PageBuddy(page) && page_order_unsafe(page) >= cc->order)
+		goto try_capture;
+
+	/*
+	 * There's a good chance that we have just put free pages on this CPU's
+	 * pcplists after the page migration. Drain them to allow merging.
+	 */
+	get_cpu();
+	drain_local_pages(NULL);
+	put_cpu();
+
+	/* Did the draining help? */
+	if (PageBuddy(page) && page_order_unsafe(page) >= cc->order)
+		goto try_capture;
+
+	goto fail;
+
+try_capture:
+	if (capture_free_page(page, cc->order))
+		return true;
+
+fail:
+	cc->capture_page = NULL;
+	return false;
+}
+
 static int compact_finished(struct zone *zone, struct compact_control *cc,
 			    const int migratetype)
 {
@@ -993,6 +1064,10 @@ static int compact_finished(struct zone *zone, struct compact_control *cc,
 		return COMPACT_COMPLETE;
 	}
 
+	/* Did we just finish a pageblock that was capture candidate? */
+	if (cc->capture_page && compact_capture_page(cc))
+		return COMPACT_CAPTURED;
+
 	/*
 	 * order == -1 is expected when compacting via
 	 * /proc/sys/vm/compact_memory
@@ -1173,7 +1248,8 @@ out:
 }
 
 static unsigned long compact_zone_order(struct zone *zone, int order,
-		gfp_t gfp_mask, enum migrate_mode mode, bool *contended)
+		gfp_t gfp_mask, enum migrate_mode mode, bool *contended,
+						struct page **captured_page)
 {
 	unsigned long ret;
 	struct compact_control cc = {
@@ -1189,6 +1265,9 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
 
 	ret = compact_zone(zone, &cc);
 
+	if (ret == COMPACT_CAPTURED)
+		*captured_page = cc.capture_page;
+
 	VM_BUG_ON(!list_empty(&cc.freepages));
 	VM_BUG_ON(!list_empty(&cc.migratepages));
 
@@ -1213,7 +1292,8 @@ int sysctl_extfrag_threshold = 500;
  */
 unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *nodemask,
-			enum migrate_mode mode, bool *contended)
+			enum migrate_mode mode, bool *contended,
+			struct page **captured_page)
 {
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	int may_enter_fs = gfp_mask & __GFP_FS;
@@ -1239,9 +1319,13 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
 		int status;
 
 		status = compact_zone_order(zone, order, gfp_mask, mode,
-						contended);
+						contended, captured_page);
 		rc = max(status, rc);
 
+		/* If we captured a page, stop compacting */
+		if (*captured_page)
+			break;
+
 		/* If a normal allocation would succeed, stop compacting */
 		if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0,
 				      alloc_flags))
diff --git a/mm/internal.h b/mm/internal.h
index af15461..2b7e5de 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -110,6 +110,7 @@ extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
  */
 extern void __free_pages_bootmem(struct page *page, unsigned int order);
 extern void prep_compound_page(struct page *page, unsigned long order);
+extern bool capture_free_page(struct page *page, unsigned int order);
 #ifdef CONFIG_MEMORY_FAILURE
 extern bool is_free_buddy_page(struct page *page);
 #endif
@@ -155,6 +156,7 @@ struct compact_control {
 					   * contention detected during
 					   * compaction
 					   */
+	struct page *capture_page;	/* Free page captured by compaction */
 };
 
 unsigned long
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cc0b687..b95f4ac 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -954,7 +954,6 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 	return NULL;
 }
 
-
 /*
  * This array describes the order lists are fallen back to when
  * the free lists for the desirable migrate type are depleted
@@ -1474,9 +1473,11 @@ static int __isolate_free_page(struct page *page, unsigned int order)
 {
 	unsigned long watermark;
 	struct zone *zone;
+	struct free_area *area;
 	int mt;
+	unsigned int freepage_order = page_order(page);
 
-	BUG_ON(!PageBuddy(page));
+	VM_BUG_ON_PAGE((!PageBuddy(page) || freepage_order < order), page);
 
 	zone = page_zone(page);
 	mt = get_pageblock_migratetype(page);
@@ -1491,9 +1492,12 @@ static int __isolate_free_page(struct page *page, unsigned int order)
 	}
 
 	/* Remove page from free list */
+	area = &zone->free_area[freepage_order];
 	list_del(&page->lru);
-	zone->free_area[order].nr_free--;
+	area->nr_free--;
 	rmv_page_order(page);
+	if (freepage_order != order)
+		expand(zone, page, order, freepage_order, area, mt);
 
 	/* Set the pageblock if the isolated page is at least a pageblock */
 	if (order >= pageblock_order - 1) {
@@ -1536,6 +1540,26 @@ int split_free_page(struct page *page)
 	return nr_pages;
 }
 
+bool capture_free_page(struct page *page, unsigned int order)
+{
+	struct zone *zone = page_zone(page);
+	unsigned long flags;
+	bool ret;
+
+	spin_lock_irqsave(&zone->lock, flags);
+
+	if (!PageBuddy(page) || page_order(page) < order) {
+		ret = false;
+		goto out;
+	}
+
+	ret = __isolate_free_page(page, order);
+
+out:
+	spin_unlock_irqrestore(&zone->lock, flags);
+	return ret;
+}
+
 /*
  * Really, prep_compound_page() should be called from __rmqueue_bulk().  But
  * we cheat by calling it from here, in the order > 0 path.  Saves a branch
@@ -1544,7 +1568,8 @@ int split_free_page(struct page *page)
 static inline
 struct page *buffered_rmqueue(struct zone *preferred_zone,
 			struct zone *zone, unsigned int order,
-			gfp_t gfp_flags, int migratetype)
+			gfp_t gfp_flags, int migratetype,
+			struct page *isolated_freepage)
 {
 	unsigned long flags;
 	struct page *page;
@@ -1573,6 +1598,9 @@ again:
 
 		list_del(&page->lru);
 		pcp->count--;
+	} else if (unlikely(isolated_freepage)) {
+		page = isolated_freepage;
+		local_irq_save(flags);
 	} else {
 		if (unlikely(gfp_flags & __GFP_NOFAIL)) {
 			/*
@@ -1588,7 +1616,9 @@ again:
 			WARN_ON_ONCE(order > 1);
 		}
 		spin_lock_irqsave(&zone->lock, flags);
+
 		page = __rmqueue(zone, order, migratetype);
+
 		spin_unlock(&zone->lock);
 		if (!page)
 			goto failed;
@@ -1916,7 +1946,8 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 static struct page *
 get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
-		struct zone *preferred_zone, int classzone_idx, int migratetype)
+		struct zone *preferred_zone, int classzone_idx, int migratetype,
+		struct page *isolated_freepage)
 {
 	struct zoneref *z;
 	struct page *page = NULL;
@@ -1927,6 +1958,13 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 	bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) &&
 				(gfp_mask & __GFP_WRITE);
 
+	if (isolated_freepage) {
+		zone = page_zone(isolated_freepage);
+		page = buffered_rmqueue(preferred_zone, zone, order, gfp_mask,
+						migratetype, isolated_freepage);
+		goto got_page;
+	}
+
 zonelist_scan:
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
@@ -2051,7 +2089,7 @@ zonelist_scan:
 
 try_this_zone:
 		page = buffered_rmqueue(preferred_zone, zone, order,
-						gfp_mask, migratetype);
+						gfp_mask, migratetype, NULL);
 		if (page)
 			break;
 this_zone_full:
@@ -2065,6 +2103,7 @@ this_zone_full:
 		goto zonelist_scan;
 	}
 
+got_page:
 	if (page)
 		/*
 		 * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
@@ -2202,7 +2241,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
 		order, zonelist, high_zoneidx,
 		ALLOC_WMARK_HIGH|ALLOC_CPUSET,
-		preferred_zone, classzone_idx, migratetype);
+		preferred_zone, classzone_idx, migratetype, NULL);
 	if (page)
 		goto out;
 
@@ -2241,6 +2280,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	bool *contended_compaction, bool *deferred_compaction,
 	unsigned long *did_some_progress)
 {
+	struct page *captured_page;
+
 	if (!order)
 		return NULL;
 
@@ -2252,7 +2293,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	current->flags |= PF_MEMALLOC;
 	*did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask,
 						nodemask, mode,
-						contended_compaction);
+						contended_compaction,
+						&captured_page);
 	current->flags &= ~PF_MEMALLOC;
 
 	if (*did_some_progress != COMPACT_SKIPPED) {
@@ -2265,7 +2307,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		page = get_page_from_freelist(gfp_mask, nodemask,
 				order, zonelist, high_zoneidx,
 				alloc_flags & ~ALLOC_NO_WATERMARKS,
-				preferred_zone, classzone_idx, migratetype);
+				preferred_zone, classzone_idx, migratetype,
+				captured_page);
 		if (page) {
 			preferred_zone->compact_blockskip_flush = false;
 			compaction_defer_reset(preferred_zone, order, true);
@@ -2357,7 +2400,7 @@ retry:
 					zonelist, high_zoneidx,
 					alloc_flags & ~ALLOC_NO_WATERMARKS,
 					preferred_zone, classzone_idx,
-					migratetype);
+					migratetype, NULL);
 
 	/*
 	 * If an allocation failed after direct reclaim, it could be because
@@ -2387,7 +2430,7 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 	do {
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
-			preferred_zone, classzone_idx, migratetype);
+			preferred_zone, classzone_idx, migratetype, NULL);
 
 		if (!page && gfp_mask & __GFP_NOFAIL)
 			wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
@@ -2548,7 +2591,7 @@ rebalance:
 	/* This is the last chance, in general, before the goto nopage. */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
 			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
-			preferred_zone, classzone_idx, migratetype);
+			preferred_zone, classzone_idx, migratetype, NULL);
 	if (page)
 		goto got_pg;
 
@@ -2757,7 +2800,7 @@ retry:
 	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, alloc_flags,
-			preferred_zone, classzone_idx, migratetype);
+			preferred_zone, classzone_idx, migratetype, NULL);
 	if (unlikely(!page)) {
 		/*
 		 * The first pass makes sure allocations are spread
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [RFC PATCH 10/10] mm, compaction: do not migrate pages when that cannot satisfy page fault allocation
  2014-06-09  9:26 ` Vlastimil Babka
@ 2014-06-09  9:26   ` Vlastimil Babka
  -1 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-09  9:26 UTC (permalink / raw)
  To: David Rientjes, linux-mm
  Cc: linux-kernel, Andrew Morton, Greg Thelen, Vlastimil Babka,
	Minchan Kim, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

In direct compaction for a page fault, we want to allocate the high-order page
as soon as possible, so migrating from a cc->order aligned block of pages that
contains also unmigratable pages just adds to page fault latency.

This patch therefore makes the migration scanner skip to the next cc->order
aligned block of pages as soon as it cannot isolate a non-free page. Everything
isolated up to that point is put back.

In this mode, the nr_isolated limit to COMPACT_CLUSTER_MAX is not observed,
allowing the scanner to scan the whole block at once, instead of migrating
COMPACT_CLUSTER_MAX pages and then finding an unmigratable page in the next
call. This might however have some implications on direct reclaimers through
too_many_isolated().

In very preliminary tests, this has reduced migrate_scanned, isolations and
migrations by about 10%, while the success rate of stress-highalloc mmtests
actually improved a bit.

[rientjes@google.com: skip_on_failure logic; cleanups]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
---
 mm/compaction.c | 56 ++++++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 40 insertions(+), 16 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index b69ac19..6dda4eb 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -543,6 +543,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 				    (unevictable ? ISOLATE_UNEVICTABLE : 0);
 	unsigned long capture_pfn = 0;   /* current candidate for capturing */
 	unsigned long next_capture_pfn = 0; /* next candidate for capturing */
+	bool skip_on_failure = false; /* skip block when isolation fails */
 
 	if (cc->order > PAGE_ALLOC_COSTLY_ORDER
 		&& gfpflags_to_migratetype(cc->gfp_mask) == MIGRATE_MOVABLE
@@ -550,6 +551,14 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 		/* This may be outside the zone, but we check that later */
 		capture_pfn = low_pfn & ~((1UL << cc->order) - 1);
 		next_capture_pfn = ALIGN(low_pfn + 1, (1UL << cc->order));
+		/*
+		 * It is too expensive for compaction to migrate pages from a
+		 * cc->order block of pages on page faults, unless the entire
+		 * block can become free. But hugepaged should try anyway for
+		 * THP so that general defragmentation happens.
+		 */
+		skip_on_failure = (cc->gfp_mask & __GFP_NO_KSWAPD)
+				&& !(current->flags & PF_KTHREAD);
 	}
 
 	/*
@@ -613,7 +622,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 		}
 
 		if (!pfn_valid_within(low_pfn))
-			continue;
+			goto isolation_failed;
 		nr_scanned++;
 
 		/*
@@ -624,7 +633,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 		 */
 		page = pfn_to_page(low_pfn);
 		if (page_zone(page) != zone)
-			continue;
+			goto isolation_failed;
 
 		if (!valid_page)
 			valid_page = page;
@@ -686,7 +695,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 					goto isolate_success;
 				}
 			}
-			continue;
+			goto isolation_failed;
 		}
 
 		/*
@@ -706,7 +715,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 			if (next_capture_pfn)
 				next_capture_pfn =
 					ALIGN(low_pfn + 1, (1UL << cc->order));
-			continue;
+			goto isolation_failed;
 		}
 
 		/*
@@ -716,7 +725,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 		 */
 		if (!page_mapping(page) &&
 		    page_count(page) > page_mapcount(page))
-			continue;
+			goto isolation_failed;
 
 		/* If we already hold the lock, we can skip some rechecking */
 		if (!locked) {
@@ -727,11 +736,11 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 
 			/* Recheck PageLRU and PageTransHuge under lock */
 			if (!PageLRU(page))
-				continue;
+				goto isolation_failed;
 			if (PageTransHuge(page)) {
 				low_pfn += (1 << compound_order(page)) - 1;
 				next_capture_pfn = low_pfn + 1;
-				continue;
+				goto isolation_failed;
 			}
 		}
 
@@ -739,7 +748,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 
 		/* Try isolate the page */
 		if (__isolate_lru_page(page, mode) != 0)
-			continue;
+			goto isolation_failed;
 
 		VM_BUG_ON_PAGE(PageTransCompound(page), page);
 
@@ -749,11 +758,14 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 isolate_success:
 		cc->finished_update_migrate = true;
 		list_add(&page->lru, migratelist);
-		cc->nr_migratepages++;
 		nr_isolated++;
 
-		/* Avoid isolating too much */
-		if (cc->nr_migratepages == COMPACT_CLUSTER_MAX) {
+		/*
+		 * Avoid isolating too much, except if we try to capture a
+		 * free page and want to find out at once if it can be done
+		 * or we should skip to the next block.
+		 */
+		if (!skip_on_failure && nr_isolated == COMPACT_CLUSTER_MAX) {
 			++low_pfn;
 			break;
 		}
@@ -764,6 +776,20 @@ next_pageblock:
 		low_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages) - 1;
 		if (next_capture_pfn)
 			next_capture_pfn = low_pfn + 1;
+
+isolation_failed:
+		if (skip_on_failure) {
+			if (nr_isolated) {
+				if (locked) {
+					spin_unlock_irqrestore(&zone->lru_lock,
+									flags);
+					locked = false;
+				}
+				putback_movable_pages(migratelist);
+				nr_isolated = 0;
+			}
+			low_pfn = next_capture_pfn - 1;
+		}
 	}
 
 	/*
@@ -773,6 +799,7 @@ next_pageblock:
 	if (unlikely(low_pfn > end_pfn))
 		low_pfn = end_pfn;
 
+	cc->nr_migratepages = nr_isolated;
 	acct_isolated(zone, locked, cc);
 
 	if (locked)
@@ -782,7 +809,7 @@ next_pageblock:
 	 * Update the pageblock-skip information and cached scanner pfn,
 	 * if the whole pageblock was scanned without isolating any page.
 	 */
-	if (low_pfn == end_pfn)
+	if (low_pfn == end_pfn && !skip_on_failure)
 		update_pageblock_skip(cc, valid_page, nr_isolated,
 				      set_unsuitable, true);
 
@@ -998,7 +1025,7 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 
 	cc->migrate_pfn = low_pfn;
 
-	return ISOLATE_SUCCESS;
+	return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
 }
 
 /*
@@ -1212,9 +1239,6 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 			;
 		}
 
-		if (!cc->nr_migratepages)
-			continue;
-
 		err = migrate_pages(&cc->migratepages, compaction_alloc,
 				compaction_free, (unsigned long)cc, cc->mode,
 				MR_COMPACTION);
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [RFC PATCH 10/10] mm, compaction: do not migrate pages when that cannot satisfy page fault allocation
@ 2014-06-09  9:26   ` Vlastimil Babka
  0 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-09  9:26 UTC (permalink / raw)
  To: David Rientjes, linux-mm
  Cc: linux-kernel, Andrew Morton, Greg Thelen, Vlastimil Babka,
	Minchan Kim, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

In direct compaction for a page fault, we want to allocate the high-order page
as soon as possible, so migrating from a cc->order aligned block of pages that
contains also unmigratable pages just adds to page fault latency.

This patch therefore makes the migration scanner skip to the next cc->order
aligned block of pages as soon as it cannot isolate a non-free page. Everything
isolated up to that point is put back.

In this mode, the nr_isolated limit to COMPACT_CLUSTER_MAX is not observed,
allowing the scanner to scan the whole block at once, instead of migrating
COMPACT_CLUSTER_MAX pages and then finding an unmigratable page in the next
call. This might however have some implications on direct reclaimers through
too_many_isolated().

In very preliminary tests, this has reduced migrate_scanned, isolations and
migrations by about 10%, while the success rate of stress-highalloc mmtests
actually improved a bit.

[rientjes@google.com: skip_on_failure logic; cleanups]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
---
 mm/compaction.c | 56 ++++++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 40 insertions(+), 16 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index b69ac19..6dda4eb 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -543,6 +543,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 				    (unevictable ? ISOLATE_UNEVICTABLE : 0);
 	unsigned long capture_pfn = 0;   /* current candidate for capturing */
 	unsigned long next_capture_pfn = 0; /* next candidate for capturing */
+	bool skip_on_failure = false; /* skip block when isolation fails */
 
 	if (cc->order > PAGE_ALLOC_COSTLY_ORDER
 		&& gfpflags_to_migratetype(cc->gfp_mask) == MIGRATE_MOVABLE
@@ -550,6 +551,14 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 		/* This may be outside the zone, but we check that later */
 		capture_pfn = low_pfn & ~((1UL << cc->order) - 1);
 		next_capture_pfn = ALIGN(low_pfn + 1, (1UL << cc->order));
+		/*
+		 * It is too expensive for compaction to migrate pages from a
+		 * cc->order block of pages on page faults, unless the entire
+		 * block can become free. But hugepaged should try anyway for
+		 * THP so that general defragmentation happens.
+		 */
+		skip_on_failure = (cc->gfp_mask & __GFP_NO_KSWAPD)
+				&& !(current->flags & PF_KTHREAD);
 	}
 
 	/*
@@ -613,7 +622,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 		}
 
 		if (!pfn_valid_within(low_pfn))
-			continue;
+			goto isolation_failed;
 		nr_scanned++;
 
 		/*
@@ -624,7 +633,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 		 */
 		page = pfn_to_page(low_pfn);
 		if (page_zone(page) != zone)
-			continue;
+			goto isolation_failed;
 
 		if (!valid_page)
 			valid_page = page;
@@ -686,7 +695,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 					goto isolate_success;
 				}
 			}
-			continue;
+			goto isolation_failed;
 		}
 
 		/*
@@ -706,7 +715,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 			if (next_capture_pfn)
 				next_capture_pfn =
 					ALIGN(low_pfn + 1, (1UL << cc->order));
-			continue;
+			goto isolation_failed;
 		}
 
 		/*
@@ -716,7 +725,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 		 */
 		if (!page_mapping(page) &&
 		    page_count(page) > page_mapcount(page))
-			continue;
+			goto isolation_failed;
 
 		/* If we already hold the lock, we can skip some rechecking */
 		if (!locked) {
@@ -727,11 +736,11 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 
 			/* Recheck PageLRU and PageTransHuge under lock */
 			if (!PageLRU(page))
-				continue;
+				goto isolation_failed;
 			if (PageTransHuge(page)) {
 				low_pfn += (1 << compound_order(page)) - 1;
 				next_capture_pfn = low_pfn + 1;
-				continue;
+				goto isolation_failed;
 			}
 		}
 
@@ -739,7 +748,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 
 		/* Try isolate the page */
 		if (__isolate_lru_page(page, mode) != 0)
-			continue;
+			goto isolation_failed;
 
 		VM_BUG_ON_PAGE(PageTransCompound(page), page);
 
@@ -749,11 +758,14 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 isolate_success:
 		cc->finished_update_migrate = true;
 		list_add(&page->lru, migratelist);
-		cc->nr_migratepages++;
 		nr_isolated++;
 
-		/* Avoid isolating too much */
-		if (cc->nr_migratepages == COMPACT_CLUSTER_MAX) {
+		/*
+		 * Avoid isolating too much, except if we try to capture a
+		 * free page and want to find out at once if it can be done
+		 * or we should skip to the next block.
+		 */
+		if (!skip_on_failure && nr_isolated == COMPACT_CLUSTER_MAX) {
 			++low_pfn;
 			break;
 		}
@@ -764,6 +776,20 @@ next_pageblock:
 		low_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages) - 1;
 		if (next_capture_pfn)
 			next_capture_pfn = low_pfn + 1;
+
+isolation_failed:
+		if (skip_on_failure) {
+			if (nr_isolated) {
+				if (locked) {
+					spin_unlock_irqrestore(&zone->lru_lock,
+									flags);
+					locked = false;
+				}
+				putback_movable_pages(migratelist);
+				nr_isolated = 0;
+			}
+			low_pfn = next_capture_pfn - 1;
+		}
 	}
 
 	/*
@@ -773,6 +799,7 @@ next_pageblock:
 	if (unlikely(low_pfn > end_pfn))
 		low_pfn = end_pfn;
 
+	cc->nr_migratepages = nr_isolated;
 	acct_isolated(zone, locked, cc);
 
 	if (locked)
@@ -782,7 +809,7 @@ next_pageblock:
 	 * Update the pageblock-skip information and cached scanner pfn,
 	 * if the whole pageblock was scanned without isolating any page.
 	 */
-	if (low_pfn == end_pfn)
+	if (low_pfn == end_pfn && !skip_on_failure)
 		update_pageblock_skip(cc, valid_page, nr_isolated,
 				      set_unsuitable, true);
 
@@ -998,7 +1025,7 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 
 	cc->migrate_pfn = low_pfn;
 
-	return ISOLATE_SUCCESS;
+	return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
 }
 
 /*
@@ -1212,9 +1239,6 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 			;
 		}
 
-		if (!cc->nr_migratepages)
-			continue;
-
 		err = migrate_pages(&cc->migratepages, compaction_alloc,
 				compaction_free, (unsigned long)cc, cc->mode,
 				MR_COMPACTION);
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [PATCH 01/10] mm, compaction: do not recheck suitable_migration_target under lock
  2014-06-09  9:26 ` Vlastimil Babka
@ 2014-06-09 23:41   ` David Rientjes
  -1 siblings, 0 replies; 88+ messages in thread
From: David Rientjes @ 2014-06-09 23:41 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Andrew Morton, Greg Thelen, Minchan Kim,
	Mel Gorman, Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel

On Mon, 9 Jun 2014, Vlastimil Babka wrote:

> isolate_freepages_block() rechecks if the pageblock is suitable to be a target
> for migration after it has taken the zone->lock. However, the check has been
> optimized to occur only once per pageblock, and compact_checklock_irqsave()
> might be dropping and reacquiring lock, which means somebody else might have
> changed the pageblock's migratetype meanwhile.
> 
> Furthermore, nothing prevents the migratetype to change right after
> isolate_freepages_block() has finished isolating. Given how imperfect this is,
> it's simpler to just rely on the check done in isolate_freepages() without
> lock, and not pretend that the recheck under lock guarantees anything. It is
> just a heuristic after all.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: David Rientjes <rientjes@google.com>

Acked-by: David Rientjes <rientjes@google.com>

We already do a preliminary check for suitable_migration_target() in 
isolate_freepages() in a racy way to avoid unnecessary work (and 
page_order() there is unprotected, I think you already mentioned this) so 
this seems fine to abandon.

> ---
> I suggest folding mm-compactionc-isolate_freepages_block-small-tuneup.patch into this
> 

Hmm, Andrew was just moving some code around in that patch, I'm not sure 
that it makes sense to couple these two together and your patch here is 
addressing an optimization rather than a cleanup (and you've documented it 
well, no need to obscure it with unrelated changes).

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 01/10] mm, compaction: do not recheck suitable_migration_target under lock
@ 2014-06-09 23:41   ` David Rientjes
  0 siblings, 0 replies; 88+ messages in thread
From: David Rientjes @ 2014-06-09 23:41 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Andrew Morton, Greg Thelen, Minchan Kim,
	Mel Gorman, Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel

On Mon, 9 Jun 2014, Vlastimil Babka wrote:

> isolate_freepages_block() rechecks if the pageblock is suitable to be a target
> for migration after it has taken the zone->lock. However, the check has been
> optimized to occur only once per pageblock, and compact_checklock_irqsave()
> might be dropping and reacquiring lock, which means somebody else might have
> changed the pageblock's migratetype meanwhile.
> 
> Furthermore, nothing prevents the migratetype to change right after
> isolate_freepages_block() has finished isolating. Given how imperfect this is,
> it's simpler to just rely on the check done in isolate_freepages() without
> lock, and not pretend that the recheck under lock guarantees anything. It is
> just a heuristic after all.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: David Rientjes <rientjes@google.com>

Acked-by: David Rientjes <rientjes@google.com>

We already do a preliminary check for suitable_migration_target() in 
isolate_freepages() in a racy way to avoid unnecessary work (and 
page_order() there is unprotected, I think you already mentioned this) so 
this seems fine to abandon.

> ---
> I suggest folding mm-compactionc-isolate_freepages_block-small-tuneup.patch into this
> 

Hmm, Andrew was just moving some code around in that patch, I'm not sure 
that it makes sense to couple these two together and your patch here is 
addressing an optimization rather than a cleanup (and you've documented it 
well, no need to obscure it with unrelated changes).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 02/10] mm, compaction: report compaction as contended only due to lock contention
  2014-06-09  9:26   ` Vlastimil Babka
@ 2014-06-09 23:50     ` David Rientjes
  -1 siblings, 0 replies; 88+ messages in thread
From: David Rientjes @ 2014-06-09 23:50 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Andrew Morton, Greg Thelen, Minchan Kim,
	Mel Gorman, Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel

On Mon, 9 Jun 2014, Vlastimil Babka wrote:

> Async compaction aborts when it detects zone lock contention or need_resched()
> is true. David Rientjes has reported that in practice, most direct async
> compactions for THP allocation abort due to need_resched(). This means that a
> second direct compaction is never attempted, which might be OK for a page
> fault, but hugepaged is intended to attempt a sync compaction in such case and
> in these cases it won't.
> 
> This patch replaces "bool contended" in compact_control with an enum that
> distinguieshes between aborting due to need_resched() and aborting due to lock
> contention. This allows propagating the abort through all compaction functions
> as before, but declaring the direct compaction as contended only when lock
> contantion has been detected.
> 
> As a result, hugepaged will proceed with second sync compaction as intended,
> when the preceding async compaction aborted due to need_resched().
> 

s/hugepaged/khugepaged/ on the changelog.

> Reported-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>
> ---
>  mm/compaction.c | 20 ++++++++++++++------
>  mm/internal.h   | 15 +++++++++++----
>  2 files changed, 25 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index b73b182..d37f4a8 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -185,9 +185,14 @@ static void update_pageblock_skip(struct compact_control *cc,
>  }
>  #endif /* CONFIG_COMPACTION */
>  
> -static inline bool should_release_lock(spinlock_t *lock)
> +enum compact_contended should_release_lock(spinlock_t *lock)
>  {
> -	return need_resched() || spin_is_contended(lock);
> +	if (need_resched())
> +		return COMPACT_CONTENDED_SCHED;
> +	else if (spin_is_contended(lock))
> +		return COMPACT_CONTENDED_LOCK;
> +	else
> +		return COMPACT_CONTENDED_NONE;
>  }
>  
>  /*

I think eventually we're going to remove the need_resched() heuristic 
entirely and so enum compact_contended might be overkill, but do we need 
to worry about spin_is_contended(lock) && need_resched() reporting 
COMPACT_CONTENDED_SCHED here instead of COMPACT_CONTENDED_LOCK?

> @@ -202,7 +207,9 @@ static inline bool should_release_lock(spinlock_t *lock)
>  static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
>  				      bool locked, struct compact_control *cc)
>  {
> -	if (should_release_lock(lock)) {
> +	enum compact_contended contended = should_release_lock(lock);
> +
> +	if (contended) {
>  		if (locked) {
>  			spin_unlock_irqrestore(lock, *flags);
>  			locked = false;
> @@ -210,7 +217,7 @@ static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
>  
>  		/* async aborts if taking too long or contended */
>  		if (cc->mode == MIGRATE_ASYNC) {
> -			cc->contended = true;
> +			cc->contended = contended;
>  			return false;
>  		}
>  
> @@ -236,7 +243,7 @@ static inline bool compact_should_abort(struct compact_control *cc)
>  	/* async compaction aborts if contended */
>  	if (need_resched()) {
>  		if (cc->mode == MIGRATE_ASYNC) {
> -			cc->contended = true;
> +			cc->contended = COMPACT_CONTENDED_SCHED;
>  			return true;
>  		}
>  
> @@ -1095,7 +1102,8 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
>  	VM_BUG_ON(!list_empty(&cc.freepages));
>  	VM_BUG_ON(!list_empty(&cc.migratepages));
>  
> -	*contended = cc.contended;
> +	/* We only signal lock contention back to the allocator */
> +	*contended = cc.contended == COMPACT_CONTENDED_LOCK;
>  	return ret;
>  }
>  

Hmm, since the only thing that matters for cc->contended is 
COMPACT_CONTENDED_LOCK, it may make sense to just leave this as a bool 
within struct compact_control instead of passing the actual reason around 
when it doesn't matter.

> diff --git a/mm/internal.h b/mm/internal.h
> index 7f22a11f..4659e8e 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -117,6 +117,13 @@ extern int user_min_free_kbytes;
>  
>  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>  
> +/* Used to signal whether compaction detected need_sched() or lock contention */
> +enum compact_contended {
> +	COMPACT_CONTENDED_NONE = 0, /* no contention detected */
> +	COMPACT_CONTENDED_SCHED,    /* need_sched() was true */
> +	COMPACT_CONTENDED_LOCK,     /* zone lock or lru_lock was contended */
> +};
> +
>  /*
>   * in mm/compaction.c
>   */
> @@ -144,10 +151,10 @@ struct compact_control {
>  	int order;			/* order a direct compactor needs */
>  	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
>  	struct zone *zone;
> -	bool contended;			/* True if a lock was contended, or
> -					 * need_resched() true during async
> -					 * compaction
> -					 */
> +	enum compact_contended contended; /* Signal need_sched() or lock
> +					   * contention detected during
> +					   * compaction
> +					   */
>  };
>  
>  unsigned long

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 02/10] mm, compaction: report compaction as contended only due to lock contention
@ 2014-06-09 23:50     ` David Rientjes
  0 siblings, 0 replies; 88+ messages in thread
From: David Rientjes @ 2014-06-09 23:50 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Andrew Morton, Greg Thelen, Minchan Kim,
	Mel Gorman, Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel

On Mon, 9 Jun 2014, Vlastimil Babka wrote:

> Async compaction aborts when it detects zone lock contention or need_resched()
> is true. David Rientjes has reported that in practice, most direct async
> compactions for THP allocation abort due to need_resched(). This means that a
> second direct compaction is never attempted, which might be OK for a page
> fault, but hugepaged is intended to attempt a sync compaction in such case and
> in these cases it won't.
> 
> This patch replaces "bool contended" in compact_control with an enum that
> distinguieshes between aborting due to need_resched() and aborting due to lock
> contention. This allows propagating the abort through all compaction functions
> as before, but declaring the direct compaction as contended only when lock
> contantion has been detected.
> 
> As a result, hugepaged will proceed with second sync compaction as intended,
> when the preceding async compaction aborted due to need_resched().
> 

s/hugepaged/khugepaged/ on the changelog.

> Reported-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>
> ---
>  mm/compaction.c | 20 ++++++++++++++------
>  mm/internal.h   | 15 +++++++++++----
>  2 files changed, 25 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index b73b182..d37f4a8 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -185,9 +185,14 @@ static void update_pageblock_skip(struct compact_control *cc,
>  }
>  #endif /* CONFIG_COMPACTION */
>  
> -static inline bool should_release_lock(spinlock_t *lock)
> +enum compact_contended should_release_lock(spinlock_t *lock)
>  {
> -	return need_resched() || spin_is_contended(lock);
> +	if (need_resched())
> +		return COMPACT_CONTENDED_SCHED;
> +	else if (spin_is_contended(lock))
> +		return COMPACT_CONTENDED_LOCK;
> +	else
> +		return COMPACT_CONTENDED_NONE;
>  }
>  
>  /*

I think eventually we're going to remove the need_resched() heuristic 
entirely and so enum compact_contended might be overkill, but do we need 
to worry about spin_is_contended(lock) && need_resched() reporting 
COMPACT_CONTENDED_SCHED here instead of COMPACT_CONTENDED_LOCK?

> @@ -202,7 +207,9 @@ static inline bool should_release_lock(spinlock_t *lock)
>  static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
>  				      bool locked, struct compact_control *cc)
>  {
> -	if (should_release_lock(lock)) {
> +	enum compact_contended contended = should_release_lock(lock);
> +
> +	if (contended) {
>  		if (locked) {
>  			spin_unlock_irqrestore(lock, *flags);
>  			locked = false;
> @@ -210,7 +217,7 @@ static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
>  
>  		/* async aborts if taking too long or contended */
>  		if (cc->mode == MIGRATE_ASYNC) {
> -			cc->contended = true;
> +			cc->contended = contended;
>  			return false;
>  		}
>  
> @@ -236,7 +243,7 @@ static inline bool compact_should_abort(struct compact_control *cc)
>  	/* async compaction aborts if contended */
>  	if (need_resched()) {
>  		if (cc->mode == MIGRATE_ASYNC) {
> -			cc->contended = true;
> +			cc->contended = COMPACT_CONTENDED_SCHED;
>  			return true;
>  		}
>  
> @@ -1095,7 +1102,8 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
>  	VM_BUG_ON(!list_empty(&cc.freepages));
>  	VM_BUG_ON(!list_empty(&cc.migratepages));
>  
> -	*contended = cc.contended;
> +	/* We only signal lock contention back to the allocator */
> +	*contended = cc.contended == COMPACT_CONTENDED_LOCK;
>  	return ret;
>  }
>  

Hmm, since the only thing that matters for cc->contended is 
COMPACT_CONTENDED_LOCK, it may make sense to just leave this as a bool 
within struct compact_control instead of passing the actual reason around 
when it doesn't matter.

> diff --git a/mm/internal.h b/mm/internal.h
> index 7f22a11f..4659e8e 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -117,6 +117,13 @@ extern int user_min_free_kbytes;
>  
>  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>  
> +/* Used to signal whether compaction detected need_sched() or lock contention */
> +enum compact_contended {
> +	COMPACT_CONTENDED_NONE = 0, /* no contention detected */
> +	COMPACT_CONTENDED_SCHED,    /* need_sched() was true */
> +	COMPACT_CONTENDED_LOCK,     /* zone lock or lru_lock was contended */
> +};
> +
>  /*
>   * in mm/compaction.c
>   */
> @@ -144,10 +151,10 @@ struct compact_control {
>  	int order;			/* order a direct compactor needs */
>  	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
>  	struct zone *zone;
> -	bool contended;			/* True if a lock was contended, or
> -					 * need_resched() true during async
> -					 * compaction
> -					 */
> +	enum compact_contended contended; /* Signal need_sched() or lock
> +					   * contention detected during
> +					   * compaction
> +					   */
>  };
>  
>  unsigned long

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 03/10] mm, compaction: periodically drop lock and restore IRQs in scanners
  2014-06-09  9:26   ` Vlastimil Babka
@ 2014-06-09 23:58     ` David Rientjes
  -1 siblings, 0 replies; 88+ messages in thread
From: David Rientjes @ 2014-06-09 23:58 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Andrew Morton, Greg Thelen, Minchan Kim,
	Mel Gorman, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel

On Mon, 9 Jun 2014, Vlastimil Babka wrote:

> diff --git a/mm/compaction.c b/mm/compaction.c
> index d37f4a8..e1a4283 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -185,54 +185,77 @@ static void update_pageblock_skip(struct compact_control *cc,
>  }
>  #endif /* CONFIG_COMPACTION */
>  
> -enum compact_contended should_release_lock(spinlock_t *lock)
> +/*
> + * Compaction requires the taking of some coarse locks that are potentially
> + * very heavily contended. For async compaction, back out if the lock cannot
> + * be taken immediately. For sync compaction, spin on the lock if needed.
> + *
> + * Returns true if the lock is held
> + * Returns false if the lock is not held and compaction should abort
> + */
> +static bool compact_trylock_irqsave(spinlock_t *lock,
> +			unsigned long *flags, struct compact_control *cc)
>  {
> -	if (need_resched())
> -		return COMPACT_CONTENDED_SCHED;
> -	else if (spin_is_contended(lock))
> -		return COMPACT_CONTENDED_LOCK;
> -	else
> -		return COMPACT_CONTENDED_NONE;
> +	if (cc->mode == MIGRATE_ASYNC) {
> +		if (!spin_trylock_irqsave(lock, *flags)) {
> +			cc->contended = COMPACT_CONTENDED_LOCK;
> +			return false;
> +		}
> +	} else {
> +		spin_lock_irqsave(lock, *flags);
> +	}
> +
> +	return true;
>  }
>  
>  /*
>   * Compaction requires the taking of some coarse locks that are potentially
> - * very heavily contended. Check if the process needs to be scheduled or
> - * if the lock is contended. For async compaction, back out in the event
> - * if contention is severe. For sync compaction, schedule.
> + * very heavily contended. The lock should be periodically unlocked to avoid
> + * having disabled IRQs for a long time, even when there is nobody waiting on
> + * the lock. It might also be that allowing the IRQs will result in
> + * need_resched() becoming true. If scheduling is needed, or somebody else
> + * has taken the lock, async compaction aborts. Sync compaction schedules.
> + * Either compaction type will also abort if a fatal signal is pending.
> + * In either case if the lock was locked, it is dropped and not regained.
>   *
> - * Returns true if the lock is held.
> - * Returns false if the lock is released and compaction should abort
> + * Returns true if compaction should abort due to fatal signal pending, or
> + *		async compaction due to lock contention or need to schedule
> + * Returns false when compaction can continue (sync compaction might have
> + *		scheduled)
>   */
> -static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
> -				      bool locked, struct compact_control *cc)
> +static bool compact_unlock_should_abort(spinlock_t *lock,
> +		unsigned long flags, bool *locked, struct compact_control *cc)
>  {
> -	enum compact_contended contended = should_release_lock(lock);
> +	if (*locked) {
> +		spin_unlock_irqrestore(lock, flags);
> +		*locked = false;
> +	}
>  
> -	if (contended) {
> -		if (locked) {
> -			spin_unlock_irqrestore(lock, *flags);
> -			locked = false;
> -		}
> +	if (fatal_signal_pending(current)) {
> +		cc->contended = COMPACT_CONTENDED_SCHED;
> +		return true;
> +	}
>  
> -		/* async aborts if taking too long or contended */
> -		if (cc->mode == MIGRATE_ASYNC) {
> -			cc->contended = contended;
> -			return false;
> +	if (cc->mode == MIGRATE_ASYNC) {
> +		if (need_resched()) {
> +			cc->contended = COMPACT_CONTENDED_SCHED;
> +			return true;
>  		}
> -
> +		if (spin_is_locked(lock)) {
> +			cc->contended = COMPACT_CONTENDED_LOCK;
> +			return true;
> +		}

Any reason to abort here?  If we need to do compact_trylock_irqsave() on 
this lock again then we'll abort when we come to that point, but it seems 
pointless to abort early if the lock isn't actually needed anymore or it 
is dropped before trying to acquire it again.

> +	} else {
>  		cond_resched();
>  	}
>  
> -	if (!locked)
> -		spin_lock_irqsave(lock, *flags);
> -	return true;
> +	return false;
>  }
>  
>  /*
>   * Aside from avoiding lock contention, compaction also periodically checks
>   * need_resched() and either schedules in sync compaction or aborts async
> - * compaction. This is similar to what compact_checklock_irqsave() does, but
> + * compaction. This is similar to what compact_unlock_should_abort() does, but
>   * is used where no lock is concerned.
>   *
>   * Returns false when no scheduling was needed, or sync compaction scheduled.
> @@ -291,6 +314,16 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>  		int isolated, i;
>  		struct page *page = cursor;
>  
> +		/*
> +		 * Periodically drop the lock (if held) regardless of its
> +		 * contention, to give chance to IRQs. Abort async compaction
> +		 * if contended.
> +		 */
> +		if (!(blockpfn % SWAP_CLUSTER_MAX)
> +		    && compact_unlock_should_abort(&cc->zone->lock, flags,
> +								&locked, cc))
> +			break;
> +
>  		nr_scanned++;
>  		if (!pfn_valid_within(blockpfn))
>  			goto isolate_fail;
> @@ -308,8 +341,9 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>  		 * spin on the lock and we acquire the lock as late as
>  		 * possible.
>  		 */
> -		locked = compact_checklock_irqsave(&cc->zone->lock, &flags,
> -								locked, cc);
> +		if (!locked)
> +			locked = compact_trylock_irqsave(&cc->zone->lock,
> +								&flags, cc);
>  		if (!locked)
>  			break;
>  
> @@ -514,13 +548,15 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>  
>  	/* Time to isolate some pages for migration */
>  	for (; low_pfn < end_pfn; low_pfn++) {
> -		/* give a chance to irqs before checking need_resched() */
> -		if (locked && !(low_pfn % SWAP_CLUSTER_MAX)) {
> -			if (should_release_lock(&zone->lru_lock)) {
> -				spin_unlock_irqrestore(&zone->lru_lock, flags);
> -				locked = false;
> -			}
> -		}
> +		/*
> +		 * Periodically drop the lock (if held) regardless of its
> +		 * contention, to give chance to IRQs. Abort async compaction
> +		 * if contended.
> +		 */
> +		if (!(low_pfn % SWAP_CLUSTER_MAX)
> +		    && compact_unlock_should_abort(&zone->lru_lock, flags,
> +								&locked, cc))
> +			break;
>  
>  		/*
>  		 * migrate_pfn does not necessarily start aligned to a
> @@ -622,10 +658,11 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>  		    page_count(page) > page_mapcount(page))
>  			continue;
>  
> -		/* Check if it is ok to still hold the lock */
> -		locked = compact_checklock_irqsave(&zone->lru_lock, &flags,
> -								locked, cc);
> -		if (!locked || fatal_signal_pending(current))
> +		/* If the lock is not held, try to take it */
> +		if (!locked)
> +			locked = compact_trylock_irqsave(&zone->lru_lock,
> +								&flags, cc);
> +		if (!locked)
>  			break;
>  
>  		/* Recheck PageLRU and PageTransHuge under lock */

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 03/10] mm, compaction: periodically drop lock and restore IRQs in scanners
@ 2014-06-09 23:58     ` David Rientjes
  0 siblings, 0 replies; 88+ messages in thread
From: David Rientjes @ 2014-06-09 23:58 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Andrew Morton, Greg Thelen, Minchan Kim,
	Mel Gorman, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel

On Mon, 9 Jun 2014, Vlastimil Babka wrote:

> diff --git a/mm/compaction.c b/mm/compaction.c
> index d37f4a8..e1a4283 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -185,54 +185,77 @@ static void update_pageblock_skip(struct compact_control *cc,
>  }
>  #endif /* CONFIG_COMPACTION */
>  
> -enum compact_contended should_release_lock(spinlock_t *lock)
> +/*
> + * Compaction requires the taking of some coarse locks that are potentially
> + * very heavily contended. For async compaction, back out if the lock cannot
> + * be taken immediately. For sync compaction, spin on the lock if needed.
> + *
> + * Returns true if the lock is held
> + * Returns false if the lock is not held and compaction should abort
> + */
> +static bool compact_trylock_irqsave(spinlock_t *lock,
> +			unsigned long *flags, struct compact_control *cc)
>  {
> -	if (need_resched())
> -		return COMPACT_CONTENDED_SCHED;
> -	else if (spin_is_contended(lock))
> -		return COMPACT_CONTENDED_LOCK;
> -	else
> -		return COMPACT_CONTENDED_NONE;
> +	if (cc->mode == MIGRATE_ASYNC) {
> +		if (!spin_trylock_irqsave(lock, *flags)) {
> +			cc->contended = COMPACT_CONTENDED_LOCK;
> +			return false;
> +		}
> +	} else {
> +		spin_lock_irqsave(lock, *flags);
> +	}
> +
> +	return true;
>  }
>  
>  /*
>   * Compaction requires the taking of some coarse locks that are potentially
> - * very heavily contended. Check if the process needs to be scheduled or
> - * if the lock is contended. For async compaction, back out in the event
> - * if contention is severe. For sync compaction, schedule.
> + * very heavily contended. The lock should be periodically unlocked to avoid
> + * having disabled IRQs for a long time, even when there is nobody waiting on
> + * the lock. It might also be that allowing the IRQs will result in
> + * need_resched() becoming true. If scheduling is needed, or somebody else
> + * has taken the lock, async compaction aborts. Sync compaction schedules.
> + * Either compaction type will also abort if a fatal signal is pending.
> + * In either case if the lock was locked, it is dropped and not regained.
>   *
> - * Returns true if the lock is held.
> - * Returns false if the lock is released and compaction should abort
> + * Returns true if compaction should abort due to fatal signal pending, or
> + *		async compaction due to lock contention or need to schedule
> + * Returns false when compaction can continue (sync compaction might have
> + *		scheduled)
>   */
> -static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
> -				      bool locked, struct compact_control *cc)
> +static bool compact_unlock_should_abort(spinlock_t *lock,
> +		unsigned long flags, bool *locked, struct compact_control *cc)
>  {
> -	enum compact_contended contended = should_release_lock(lock);
> +	if (*locked) {
> +		spin_unlock_irqrestore(lock, flags);
> +		*locked = false;
> +	}
>  
> -	if (contended) {
> -		if (locked) {
> -			spin_unlock_irqrestore(lock, *flags);
> -			locked = false;
> -		}
> +	if (fatal_signal_pending(current)) {
> +		cc->contended = COMPACT_CONTENDED_SCHED;
> +		return true;
> +	}
>  
> -		/* async aborts if taking too long or contended */
> -		if (cc->mode == MIGRATE_ASYNC) {
> -			cc->contended = contended;
> -			return false;
> +	if (cc->mode == MIGRATE_ASYNC) {
> +		if (need_resched()) {
> +			cc->contended = COMPACT_CONTENDED_SCHED;
> +			return true;
>  		}
> -
> +		if (spin_is_locked(lock)) {
> +			cc->contended = COMPACT_CONTENDED_LOCK;
> +			return true;
> +		}

Any reason to abort here?  If we need to do compact_trylock_irqsave() on 
this lock again then we'll abort when we come to that point, but it seems 
pointless to abort early if the lock isn't actually needed anymore or it 
is dropped before trying to acquire it again.

> +	} else {
>  		cond_resched();
>  	}
>  
> -	if (!locked)
> -		spin_lock_irqsave(lock, *flags);
> -	return true;
> +	return false;
>  }
>  
>  /*
>   * Aside from avoiding lock contention, compaction also periodically checks
>   * need_resched() and either schedules in sync compaction or aborts async
> - * compaction. This is similar to what compact_checklock_irqsave() does, but
> + * compaction. This is similar to what compact_unlock_should_abort() does, but
>   * is used where no lock is concerned.
>   *
>   * Returns false when no scheduling was needed, or sync compaction scheduled.
> @@ -291,6 +314,16 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>  		int isolated, i;
>  		struct page *page = cursor;
>  
> +		/*
> +		 * Periodically drop the lock (if held) regardless of its
> +		 * contention, to give chance to IRQs. Abort async compaction
> +		 * if contended.
> +		 */
> +		if (!(blockpfn % SWAP_CLUSTER_MAX)
> +		    && compact_unlock_should_abort(&cc->zone->lock, flags,
> +								&locked, cc))
> +			break;
> +
>  		nr_scanned++;
>  		if (!pfn_valid_within(blockpfn))
>  			goto isolate_fail;
> @@ -308,8 +341,9 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>  		 * spin on the lock and we acquire the lock as late as
>  		 * possible.
>  		 */
> -		locked = compact_checklock_irqsave(&cc->zone->lock, &flags,
> -								locked, cc);
> +		if (!locked)
> +			locked = compact_trylock_irqsave(&cc->zone->lock,
> +								&flags, cc);
>  		if (!locked)
>  			break;
>  
> @@ -514,13 +548,15 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>  
>  	/* Time to isolate some pages for migration */
>  	for (; low_pfn < end_pfn; low_pfn++) {
> -		/* give a chance to irqs before checking need_resched() */
> -		if (locked && !(low_pfn % SWAP_CLUSTER_MAX)) {
> -			if (should_release_lock(&zone->lru_lock)) {
> -				spin_unlock_irqrestore(&zone->lru_lock, flags);
> -				locked = false;
> -			}
> -		}
> +		/*
> +		 * Periodically drop the lock (if held) regardless of its
> +		 * contention, to give chance to IRQs. Abort async compaction
> +		 * if contended.
> +		 */
> +		if (!(low_pfn % SWAP_CLUSTER_MAX)
> +		    && compact_unlock_should_abort(&zone->lru_lock, flags,
> +								&locked, cc))
> +			break;
>  
>  		/*
>  		 * migrate_pfn does not necessarily start aligned to a
> @@ -622,10 +658,11 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>  		    page_count(page) > page_mapcount(page))
>  			continue;
>  
> -		/* Check if it is ok to still hold the lock */
> -		locked = compact_checklock_irqsave(&zone->lru_lock, &flags,
> -								locked, cc);
> -		if (!locked || fatal_signal_pending(current))
> +		/* If the lock is not held, try to take it */
> +		if (!locked)
> +			locked = compact_trylock_irqsave(&zone->lru_lock,
> +								&flags, cc);
> +		if (!locked)
>  			break;
>  
>  		/* Recheck PageLRU and PageTransHuge under lock */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 04/10] mm, compaction: skip rechecks when lock was already held
  2014-06-09  9:26   ` Vlastimil Babka
@ 2014-06-10  0:00     ` David Rientjes
  -1 siblings, 0 replies; 88+ messages in thread
From: David Rientjes @ 2014-06-10  0:00 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Andrew Morton, Greg Thelen, Minchan Kim,
	Mel Gorman, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel

On Mon, 9 Jun 2014, Vlastimil Babka wrote:

> Compaction scanners try to lock zone locks as late as possible by checking
> many page or pageblock properties opportunistically without lock and skipping
> them if not unsuitable. For pages that pass the initial checks, some properties
> have to be checked again safely under lock. However, if the lock was already
> held from a previous iteration in the initial checks, the rechecks are
> unnecessary.
> 
> This patch therefore skips the rechecks when the lock was already held. This is
> now possible to do, since we don't (potentially) drop and reacquire the lock
> between the initial checks and the safe rechecks anymore.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: David Rientjes <rientjes@google.com>

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 04/10] mm, compaction: skip rechecks when lock was already held
@ 2014-06-10  0:00     ` David Rientjes
  0 siblings, 0 replies; 88+ messages in thread
From: David Rientjes @ 2014-06-10  0:00 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Andrew Morton, Greg Thelen, Minchan Kim,
	Mel Gorman, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel

On Mon, 9 Jun 2014, Vlastimil Babka wrote:

> Compaction scanners try to lock zone locks as late as possible by checking
> many page or pageblock properties opportunistically without lock and skipping
> them if not unsuitable. For pages that pass the initial checks, some properties
> have to be checked again safely under lock. However, if the lock was already
> held from a previous iteration in the initial checks, the rechecks are
> unnecessary.
> 
> This patch therefore skips the rechecks when the lock was already held. This is
> now possible to do, since we don't (potentially) drop and reacquire the lock
> between the initial checks and the safe rechecks anymore.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: David Rientjes <rientjes@google.com>

Acked-by: David Rientjes <rientjes@google.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 05/10] mm, compaction: remember position within pageblock in free pages scanner
  2014-06-09  9:26   ` Vlastimil Babka
@ 2014-06-10  0:07     ` David Rientjes
  -1 siblings, 0 replies; 88+ messages in thread
From: David Rientjes @ 2014-06-10  0:07 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Andrew Morton, Greg Thelen, Minchan Kim,
	Mel Gorman, Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel

On Mon, 9 Jun 2014, Vlastimil Babka wrote:

> Unlike the migration scanner, the free scanner remembers the beginning of the
> last scanned pageblock in cc->free_pfn. It might be therefore rescanning pages
> uselessly when called several times during single compaction. This might have
> been useful when pages were returned to the buddy allocator after a failed
> migration, but this is no longer the case.
> 
> This patch changes the meaning of cc->free_pfn so that if it points to a
> middle of a pageblock, that pageblock is scanned only from cc->free_pfn to the
> end. isolate_freepages_block() will record the pfn of the last page it looked
> at, which is then used to update cc->free_pfn.
> 
> In the mmtests stress-highalloc benchmark, this has resulted in lowering the
> ratio between pages scanned by both scanners, from 2.5 free pages per migrate
> page, to 2.25 free pages per migrate page, without affecting success rates.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: David Rientjes <rientjes@google.com>

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 05/10] mm, compaction: remember position within pageblock in free pages scanner
@ 2014-06-10  0:07     ` David Rientjes
  0 siblings, 0 replies; 88+ messages in thread
From: David Rientjes @ 2014-06-10  0:07 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Andrew Morton, Greg Thelen, Minchan Kim,
	Mel Gorman, Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel

On Mon, 9 Jun 2014, Vlastimil Babka wrote:

> Unlike the migration scanner, the free scanner remembers the beginning of the
> last scanned pageblock in cc->free_pfn. It might be therefore rescanning pages
> uselessly when called several times during single compaction. This might have
> been useful when pages were returned to the buddy allocator after a failed
> migration, but this is no longer the case.
> 
> This patch changes the meaning of cc->free_pfn so that if it points to a
> middle of a pageblock, that pageblock is scanned only from cc->free_pfn to the
> end. isolate_freepages_block() will record the pfn of the last page it looked
> at, which is then used to update cc->free_pfn.
> 
> In the mmtests stress-highalloc benchmark, this has resulted in lowering the
> ratio between pages scanned by both scanners, from 2.5 free pages per migrate
> page, to 2.25 free pages per migrate page, without affecting success rates.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: David Rientjes <rientjes@google.com>

Acked-by: David Rientjes <rientjes@google.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 06/10] mm, compaction: skip buddy pages by their order in the migrate scanner
  2014-06-09  9:26   ` Vlastimil Babka
@ 2014-06-10  0:08     ` David Rientjes
  -1 siblings, 0 replies; 88+ messages in thread
From: David Rientjes @ 2014-06-10  0:08 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Andrew Morton, Greg Thelen, Minchan Kim,
	Mel Gorman, Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel

On Mon, 9 Jun 2014, Vlastimil Babka wrote:

> The migration scanner skips PageBuddy pages, but does not consider their order
> as checking page_order() is generally unsafe without holding the zone->lock,
> and acquiring the lock just for the check wouldn't be a good tradeoff.
> 
> Still, this could avoid some iterations over the rest of the buddy page, and
> if we are careful, the race window between PageBuddy() check and page_order()
> is small, and the worst thing that can happen is that we skip too much and miss
> some isolation candidates. This is not that bad, as compaction can already fail
> for many other reasons like parallel allocations, and those have much larger
> race window.
> 
> This patch therefore makes the migration scanner obtain the buddy page order
> and use it to skip the whole buddy page, if the order appears to be in the
> valid range.
> 
> It's important that the page_order() is read only once, so that the value used
> in the checks and in the pfn calculation is the same. But in theory the
> compiler can replace the local variable by multiple inlines of page_order().
> Therefore, the patch introduces page_order_unsafe() that uses ACCESS_ONCE to
> prevent this.
> 
> Preliminary results with stress-highalloc from mmtests show a 10% reduction in
> number of pages scanned by migration scanner. This change is also important to
> later allow detecting when a cc->order block of pages cannot be compacted, and
> the scanner should skip to the next block instead of wasting time.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: David Rientjes <rientjes@google.com>
> ---
> V2: fix low_pfn > end_pfn check; comments
>     kept page_order_unsafe() approach for now
> 

Please see http://marc.info/?l=linux-mm&m=140235272808846, I'd love to be 
proved wrong.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 06/10] mm, compaction: skip buddy pages by their order in the migrate scanner
@ 2014-06-10  0:08     ` David Rientjes
  0 siblings, 0 replies; 88+ messages in thread
From: David Rientjes @ 2014-06-10  0:08 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Andrew Morton, Greg Thelen, Minchan Kim,
	Mel Gorman, Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel

On Mon, 9 Jun 2014, Vlastimil Babka wrote:

> The migration scanner skips PageBuddy pages, but does not consider their order
> as checking page_order() is generally unsafe without holding the zone->lock,
> and acquiring the lock just for the check wouldn't be a good tradeoff.
> 
> Still, this could avoid some iterations over the rest of the buddy page, and
> if we are careful, the race window between PageBuddy() check and page_order()
> is small, and the worst thing that can happen is that we skip too much and miss
> some isolation candidates. This is not that bad, as compaction can already fail
> for many other reasons like parallel allocations, and those have much larger
> race window.
> 
> This patch therefore makes the migration scanner obtain the buddy page order
> and use it to skip the whole buddy page, if the order appears to be in the
> valid range.
> 
> It's important that the page_order() is read only once, so that the value used
> in the checks and in the pfn calculation is the same. But in theory the
> compiler can replace the local variable by multiple inlines of page_order().
> Therefore, the patch introduces page_order_unsafe() that uses ACCESS_ONCE to
> prevent this.
> 
> Preliminary results with stress-highalloc from mmtests show a 10% reduction in
> number of pages scanned by migration scanner. This change is also important to
> later allow detecting when a cc->order block of pages cannot be compacted, and
> the scanner should skip to the next block instead of wasting time.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: David Rientjes <rientjes@google.com>
> ---
> V2: fix low_pfn > end_pfn check; comments
>     kept page_order_unsafe() approach for now
> 

Please see http://marc.info/?l=linux-mm&m=140235272808846, I'd love to be 
proved wrong.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 02/10] mm, compaction: report compaction as contended only due to lock contention
  2014-06-09 23:50     ` David Rientjes
@ 2014-06-10  7:11       ` Vlastimil Babka
  -1 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-10  7:11 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, linux-kernel, Andrew Morton, Greg Thelen, Minchan Kim,
	Mel Gorman, Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel

On 06/10/2014 01:50 AM, David Rientjes wrote:
> On Mon, 9 Jun 2014, Vlastimil Babka wrote:
>
>> Async compaction aborts when it detects zone lock contention or need_resched()
>> is true. David Rientjes has reported that in practice, most direct async
>> compactions for THP allocation abort due to need_resched(). This means that a
>> second direct compaction is never attempted, which might be OK for a page
>> fault, but hugepaged is intended to attempt a sync compaction in such case and
>> in these cases it won't.
>>
>> This patch replaces "bool contended" in compact_control with an enum that
>> distinguieshes between aborting due to need_resched() and aborting due to lock
>> contention. This allows propagating the abort through all compaction functions
>> as before, but declaring the direct compaction as contended only when lock
>> contantion has been detected.
>>
>> As a result, hugepaged will proceed with second sync compaction as intended,
>> when the preceding async compaction aborted due to need_resched().
>>
>
> s/hugepaged/khugepaged/ on the changelog.
>
>> Reported-by: David Rientjes <rientjes@google.com>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>> Cc: Michal Nazarewicz <mina86@mina86.com>
>> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>> Cc: Christoph Lameter <cl@linux.com>
>> Cc: Rik van Riel <riel@redhat.com>
>> ---
>>   mm/compaction.c | 20 ++++++++++++++------
>>   mm/internal.h   | 15 +++++++++++----
>>   2 files changed, 25 insertions(+), 10 deletions(-)
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index b73b182..d37f4a8 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -185,9 +185,14 @@ static void update_pageblock_skip(struct compact_control *cc,
>>   }
>>   #endif /* CONFIG_COMPACTION */
>>
>> -static inline bool should_release_lock(spinlock_t *lock)
>> +enum compact_contended should_release_lock(spinlock_t *lock)
>>   {
>> -	return need_resched() || spin_is_contended(lock);
>> +	if (need_resched())
>> +		return COMPACT_CONTENDED_SCHED;
>> +	else if (spin_is_contended(lock))
>> +		return COMPACT_CONTENDED_LOCK;
>> +	else
>> +		return COMPACT_CONTENDED_NONE;
>>   }
>>
>>   /*
>
> I think eventually we're going to remove the need_resched() heuristic
> entirely and so enum compact_contended might be overkill, but do we need
> to worry about spin_is_contended(lock) && need_resched() reporting
> COMPACT_CONTENDED_SCHED here instead of COMPACT_CONTENDED_LOCK?

Hm right, maybe I should reorder the two tests.

>> @@ -202,7 +207,9 @@ static inline bool should_release_lock(spinlock_t *lock)
>>   static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
>>   				      bool locked, struct compact_control *cc)
>>   {
>> -	if (should_release_lock(lock)) {
>> +	enum compact_contended contended = should_release_lock(lock);
>> +
>> +	if (contended) {
>>   		if (locked) {
>>   			spin_unlock_irqrestore(lock, *flags);
>>   			locked = false;
>> @@ -210,7 +217,7 @@ static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
>>
>>   		/* async aborts if taking too long or contended */
>>   		if (cc->mode == MIGRATE_ASYNC) {
>> -			cc->contended = true;
>> +			cc->contended = contended;
>>   			return false;
>>   		}
>>
>> @@ -236,7 +243,7 @@ static inline bool compact_should_abort(struct compact_control *cc)
>>   	/* async compaction aborts if contended */
>>   	if (need_resched()) {
>>   		if (cc->mode == MIGRATE_ASYNC) {
>> -			cc->contended = true;
>> +			cc->contended = COMPACT_CONTENDED_SCHED;
>>   			return true;
>>   		}
>>
>> @@ -1095,7 +1102,8 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
>>   	VM_BUG_ON(!list_empty(&cc.freepages));
>>   	VM_BUG_ON(!list_empty(&cc.migratepages));
>>
>> -	*contended = cc.contended;
>> +	/* We only signal lock contention back to the allocator */
>> +	*contended = cc.contended == COMPACT_CONTENDED_LOCK;
>>   	return ret;
>>   }
>>
>
> Hmm, since the only thing that matters for cc->contended is
> COMPACT_CONTENDED_LOCK, it may make sense to just leave this as a bool
> within struct compact_control instead of passing the actual reason around
> when it doesn't matter.

That's what I thought first. But we set cc->contended in 
isolate_freepages_block() and then check it in isolate_freepages() and 
compaction_alloc() to make sure we don't continue the free scanner once 
contention (or need_resched()) is detected. And introducing an enum, 
even if temporary measure, seemed simpler than making that checking more 
complex. This way it can stay the same once we get rid of need_resched().

>> diff --git a/mm/internal.h b/mm/internal.h
>> index 7f22a11f..4659e8e 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -117,6 +117,13 @@ extern int user_min_free_kbytes;
>>
>>   #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>>
>> +/* Used to signal whether compaction detected need_sched() or lock contention */
>> +enum compact_contended {
>> +	COMPACT_CONTENDED_NONE = 0, /* no contention detected */
>> +	COMPACT_CONTENDED_SCHED,    /* need_sched() was true */
>> +	COMPACT_CONTENDED_LOCK,     /* zone lock or lru_lock was contended */
>> +};
>> +
>>   /*
>>    * in mm/compaction.c
>>    */
>> @@ -144,10 +151,10 @@ struct compact_control {
>>   	int order;			/* order a direct compactor needs */
>>   	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
>>   	struct zone *zone;
>> -	bool contended;			/* True if a lock was contended, or
>> -					 * need_resched() true during async
>> -					 * compaction
>> -					 */
>> +	enum compact_contended contended; /* Signal need_sched() or lock
>> +					   * contention detected during
>> +					   * compaction
>> +					   */
>>   };
>>
>>   unsigned long


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 02/10] mm, compaction: report compaction as contended only due to lock contention
@ 2014-06-10  7:11       ` Vlastimil Babka
  0 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-10  7:11 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, linux-kernel, Andrew Morton, Greg Thelen, Minchan Kim,
	Mel Gorman, Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel

On 06/10/2014 01:50 AM, David Rientjes wrote:
> On Mon, 9 Jun 2014, Vlastimil Babka wrote:
>
>> Async compaction aborts when it detects zone lock contention or need_resched()
>> is true. David Rientjes has reported that in practice, most direct async
>> compactions for THP allocation abort due to need_resched(). This means that a
>> second direct compaction is never attempted, which might be OK for a page
>> fault, but hugepaged is intended to attempt a sync compaction in such case and
>> in these cases it won't.
>>
>> This patch replaces "bool contended" in compact_control with an enum that
>> distinguieshes between aborting due to need_resched() and aborting due to lock
>> contention. This allows propagating the abort through all compaction functions
>> as before, but declaring the direct compaction as contended only when lock
>> contantion has been detected.
>>
>> As a result, hugepaged will proceed with second sync compaction as intended,
>> when the preceding async compaction aborted due to need_resched().
>>
>
> s/hugepaged/khugepaged/ on the changelog.
>
>> Reported-by: David Rientjes <rientjes@google.com>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>> Cc: Michal Nazarewicz <mina86@mina86.com>
>> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>> Cc: Christoph Lameter <cl@linux.com>
>> Cc: Rik van Riel <riel@redhat.com>
>> ---
>>   mm/compaction.c | 20 ++++++++++++++------
>>   mm/internal.h   | 15 +++++++++++----
>>   2 files changed, 25 insertions(+), 10 deletions(-)
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index b73b182..d37f4a8 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -185,9 +185,14 @@ static void update_pageblock_skip(struct compact_control *cc,
>>   }
>>   #endif /* CONFIG_COMPACTION */
>>
>> -static inline bool should_release_lock(spinlock_t *lock)
>> +enum compact_contended should_release_lock(spinlock_t *lock)
>>   {
>> -	return need_resched() || spin_is_contended(lock);
>> +	if (need_resched())
>> +		return COMPACT_CONTENDED_SCHED;
>> +	else if (spin_is_contended(lock))
>> +		return COMPACT_CONTENDED_LOCK;
>> +	else
>> +		return COMPACT_CONTENDED_NONE;
>>   }
>>
>>   /*
>
> I think eventually we're going to remove the need_resched() heuristic
> entirely and so enum compact_contended might be overkill, but do we need
> to worry about spin_is_contended(lock) && need_resched() reporting
> COMPACT_CONTENDED_SCHED here instead of COMPACT_CONTENDED_LOCK?

Hm right, maybe I should reorder the two tests.

>> @@ -202,7 +207,9 @@ static inline bool should_release_lock(spinlock_t *lock)
>>   static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
>>   				      bool locked, struct compact_control *cc)
>>   {
>> -	if (should_release_lock(lock)) {
>> +	enum compact_contended contended = should_release_lock(lock);
>> +
>> +	if (contended) {
>>   		if (locked) {
>>   			spin_unlock_irqrestore(lock, *flags);
>>   			locked = false;
>> @@ -210,7 +217,7 @@ static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
>>
>>   		/* async aborts if taking too long or contended */
>>   		if (cc->mode == MIGRATE_ASYNC) {
>> -			cc->contended = true;
>> +			cc->contended = contended;
>>   			return false;
>>   		}
>>
>> @@ -236,7 +243,7 @@ static inline bool compact_should_abort(struct compact_control *cc)
>>   	/* async compaction aborts if contended */
>>   	if (need_resched()) {
>>   		if (cc->mode == MIGRATE_ASYNC) {
>> -			cc->contended = true;
>> +			cc->contended = COMPACT_CONTENDED_SCHED;
>>   			return true;
>>   		}
>>
>> @@ -1095,7 +1102,8 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
>>   	VM_BUG_ON(!list_empty(&cc.freepages));
>>   	VM_BUG_ON(!list_empty(&cc.migratepages));
>>
>> -	*contended = cc.contended;
>> +	/* We only signal lock contention back to the allocator */
>> +	*contended = cc.contended == COMPACT_CONTENDED_LOCK;
>>   	return ret;
>>   }
>>
>
> Hmm, since the only thing that matters for cc->contended is
> COMPACT_CONTENDED_LOCK, it may make sense to just leave this as a bool
> within struct compact_control instead of passing the actual reason around
> when it doesn't matter.

That's what I thought first. But we set cc->contended in 
isolate_freepages_block() and then check it in isolate_freepages() and 
compaction_alloc() to make sure we don't continue the free scanner once 
contention (or need_resched()) is detected. And introducing an enum, 
even if temporary measure, seemed simpler than making that checking more 
complex. This way it can stay the same once we get rid of need_resched().

>> diff --git a/mm/internal.h b/mm/internal.h
>> index 7f22a11f..4659e8e 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -117,6 +117,13 @@ extern int user_min_free_kbytes;
>>
>>   #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>>
>> +/* Used to signal whether compaction detected need_sched() or lock contention */
>> +enum compact_contended {
>> +	COMPACT_CONTENDED_NONE = 0, /* no contention detected */
>> +	COMPACT_CONTENDED_SCHED,    /* need_sched() was true */
>> +	COMPACT_CONTENDED_LOCK,     /* zone lock or lru_lock was contended */
>> +};
>> +
>>   /*
>>    * in mm/compaction.c
>>    */
>> @@ -144,10 +151,10 @@ struct compact_control {
>>   	int order;			/* order a direct compactor needs */
>>   	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
>>   	struct zone *zone;
>> -	bool contended;			/* True if a lock was contended, or
>> -					 * need_resched() true during async
>> -					 * compaction
>> -					 */
>> +	enum compact_contended contended; /* Signal need_sched() or lock
>> +					   * contention detected during
>> +					   * compaction
>> +					   */
>>   };
>>
>>   unsigned long

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 03/10] mm, compaction: periodically drop lock and restore IRQs in scanners
  2014-06-09 23:58     ` David Rientjes
@ 2014-06-10  7:15       ` Vlastimil Babka
  -1 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-10  7:15 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, linux-kernel, Andrew Morton, Greg Thelen, Minchan Kim,
	Mel Gorman, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel

On 06/10/2014 01:58 AM, David Rientjes wrote:
> On Mon, 9 Jun 2014, Vlastimil Babka wrote:
>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index d37f4a8..e1a4283 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -185,54 +185,77 @@ static void update_pageblock_skip(struct compact_control *cc,
>>   }
>>   #endif /* CONFIG_COMPACTION */
>>
>> -enum compact_contended should_release_lock(spinlock_t *lock)
>> +/*
>> + * Compaction requires the taking of some coarse locks that are potentially
>> + * very heavily contended. For async compaction, back out if the lock cannot
>> + * be taken immediately. For sync compaction, spin on the lock if needed.
>> + *
>> + * Returns true if the lock is held
>> + * Returns false if the lock is not held and compaction should abort
>> + */
>> +static bool compact_trylock_irqsave(spinlock_t *lock,
>> +			unsigned long *flags, struct compact_control *cc)
>>   {
>> -	if (need_resched())
>> -		return COMPACT_CONTENDED_SCHED;
>> -	else if (spin_is_contended(lock))
>> -		return COMPACT_CONTENDED_LOCK;
>> -	else
>> -		return COMPACT_CONTENDED_NONE;
>> +	if (cc->mode == MIGRATE_ASYNC) {
>> +		if (!spin_trylock_irqsave(lock, *flags)) {
>> +			cc->contended = COMPACT_CONTENDED_LOCK;
>> +			return false;
>> +		}
>> +	} else {
>> +		spin_lock_irqsave(lock, *flags);
>> +	}
>> +
>> +	return true;
>>   }
>>
>>   /*
>>    * Compaction requires the taking of some coarse locks that are potentially
>> - * very heavily contended. Check if the process needs to be scheduled or
>> - * if the lock is contended. For async compaction, back out in the event
>> - * if contention is severe. For sync compaction, schedule.
>> + * very heavily contended. The lock should be periodically unlocked to avoid
>> + * having disabled IRQs for a long time, even when there is nobody waiting on
>> + * the lock. It might also be that allowing the IRQs will result in
>> + * need_resched() becoming true. If scheduling is needed, or somebody else
>> + * has taken the lock, async compaction aborts. Sync compaction schedules.
>> + * Either compaction type will also abort if a fatal signal is pending.
>> + * In either case if the lock was locked, it is dropped and not regained.
>>    *
>> - * Returns true if the lock is held.
>> - * Returns false if the lock is released and compaction should abort
>> + * Returns true if compaction should abort due to fatal signal pending, or
>> + *		async compaction due to lock contention or need to schedule
>> + * Returns false when compaction can continue (sync compaction might have
>> + *		scheduled)
>>    */
>> -static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
>> -				      bool locked, struct compact_control *cc)
>> +static bool compact_unlock_should_abort(spinlock_t *lock,
>> +		unsigned long flags, bool *locked, struct compact_control *cc)
>>   {
>> -	enum compact_contended contended = should_release_lock(lock);
>> +	if (*locked) {
>> +		spin_unlock_irqrestore(lock, flags);
>> +		*locked = false;
>> +	}
>>
>> -	if (contended) {
>> -		if (locked) {
>> -			spin_unlock_irqrestore(lock, *flags);
>> -			locked = false;
>> -		}
>> +	if (fatal_signal_pending(current)) {
>> +		cc->contended = COMPACT_CONTENDED_SCHED;
>> +		return true;
>> +	}
>>
>> -		/* async aborts if taking too long or contended */
>> -		if (cc->mode == MIGRATE_ASYNC) {
>> -			cc->contended = contended;
>> -			return false;
>> +	if (cc->mode == MIGRATE_ASYNC) {
>> +		if (need_resched()) {
>> +			cc->contended = COMPACT_CONTENDED_SCHED;
>> +			return true;
>>   		}
>> -
>> +		if (spin_is_locked(lock)) {
>> +			cc->contended = COMPACT_CONTENDED_LOCK;
>> +			return true;
>> +		}
>
> Any reason to abort here?  If we need to do compact_trylock_irqsave() on
> this lock again then we'll abort when we come to that point, but it seems
> pointless to abort early if the lock isn't actually needed anymore or it
> is dropped before trying to acquire it again.

spin_is_locked() true means somebody was most probably waiting for us to 
unlock so maybe we should back off. But I'm not sure if that check can 
actually succeed so early after unlock.

>> +	} else {
>>   		cond_resched();
>>   	}
>>
>> -	if (!locked)
>> -		spin_lock_irqsave(lock, *flags);
>> -	return true;
>> +	return false;
>>   }
>>
>>   /*
>>    * Aside from avoiding lock contention, compaction also periodically checks
>>    * need_resched() and either schedules in sync compaction or aborts async
>> - * compaction. This is similar to what compact_checklock_irqsave() does, but
>> + * compaction. This is similar to what compact_unlock_should_abort() does, but
>>    * is used where no lock is concerned.
>>    *
>>    * Returns false when no scheduling was needed, or sync compaction scheduled.
>> @@ -291,6 +314,16 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>>   		int isolated, i;
>>   		struct page *page = cursor;
>>
>> +		/*
>> +		 * Periodically drop the lock (if held) regardless of its
>> +		 * contention, to give chance to IRQs. Abort async compaction
>> +		 * if contended.
>> +		 */
>> +		if (!(blockpfn % SWAP_CLUSTER_MAX)
>> +		    && compact_unlock_should_abort(&cc->zone->lock, flags,
>> +								&locked, cc))
>> +			break;
>> +
>>   		nr_scanned++;
>>   		if (!pfn_valid_within(blockpfn))
>>   			goto isolate_fail;
>> @@ -308,8 +341,9 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>>   		 * spin on the lock and we acquire the lock as late as
>>   		 * possible.
>>   		 */
>> -		locked = compact_checklock_irqsave(&cc->zone->lock, &flags,
>> -								locked, cc);
>> +		if (!locked)
>> +			locked = compact_trylock_irqsave(&cc->zone->lock,
>> +								&flags, cc);
>>   		if (!locked)
>>   			break;
>>
>> @@ -514,13 +548,15 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>>
>>   	/* Time to isolate some pages for migration */
>>   	for (; low_pfn < end_pfn; low_pfn++) {
>> -		/* give a chance to irqs before checking need_resched() */
>> -		if (locked && !(low_pfn % SWAP_CLUSTER_MAX)) {
>> -			if (should_release_lock(&zone->lru_lock)) {
>> -				spin_unlock_irqrestore(&zone->lru_lock, flags);
>> -				locked = false;
>> -			}
>> -		}
>> +		/*
>> +		 * Periodically drop the lock (if held) regardless of its
>> +		 * contention, to give chance to IRQs. Abort async compaction
>> +		 * if contended.
>> +		 */
>> +		if (!(low_pfn % SWAP_CLUSTER_MAX)
>> +		    && compact_unlock_should_abort(&zone->lru_lock, flags,
>> +								&locked, cc))
>> +			break;
>>
>>   		/*
>>   		 * migrate_pfn does not necessarily start aligned to a
>> @@ -622,10 +658,11 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>>   		    page_count(page) > page_mapcount(page))
>>   			continue;
>>
>> -		/* Check if it is ok to still hold the lock */
>> -		locked = compact_checklock_irqsave(&zone->lru_lock, &flags,
>> -								locked, cc);
>> -		if (!locked || fatal_signal_pending(current))
>> +		/* If the lock is not held, try to take it */
>> +		if (!locked)
>> +			locked = compact_trylock_irqsave(&zone->lru_lock,
>> +								&flags, cc);
>> +		if (!locked)
>>   			break;
>>
>>   		/* Recheck PageLRU and PageTransHuge under lock */


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 03/10] mm, compaction: periodically drop lock and restore IRQs in scanners
@ 2014-06-10  7:15       ` Vlastimil Babka
  0 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-10  7:15 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, linux-kernel, Andrew Morton, Greg Thelen, Minchan Kim,
	Mel Gorman, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel

On 06/10/2014 01:58 AM, David Rientjes wrote:
> On Mon, 9 Jun 2014, Vlastimil Babka wrote:
>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index d37f4a8..e1a4283 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -185,54 +185,77 @@ static void update_pageblock_skip(struct compact_control *cc,
>>   }
>>   #endif /* CONFIG_COMPACTION */
>>
>> -enum compact_contended should_release_lock(spinlock_t *lock)
>> +/*
>> + * Compaction requires the taking of some coarse locks that are potentially
>> + * very heavily contended. For async compaction, back out if the lock cannot
>> + * be taken immediately. For sync compaction, spin on the lock if needed.
>> + *
>> + * Returns true if the lock is held
>> + * Returns false if the lock is not held and compaction should abort
>> + */
>> +static bool compact_trylock_irqsave(spinlock_t *lock,
>> +			unsigned long *flags, struct compact_control *cc)
>>   {
>> -	if (need_resched())
>> -		return COMPACT_CONTENDED_SCHED;
>> -	else if (spin_is_contended(lock))
>> -		return COMPACT_CONTENDED_LOCK;
>> -	else
>> -		return COMPACT_CONTENDED_NONE;
>> +	if (cc->mode == MIGRATE_ASYNC) {
>> +		if (!spin_trylock_irqsave(lock, *flags)) {
>> +			cc->contended = COMPACT_CONTENDED_LOCK;
>> +			return false;
>> +		}
>> +	} else {
>> +		spin_lock_irqsave(lock, *flags);
>> +	}
>> +
>> +	return true;
>>   }
>>
>>   /*
>>    * Compaction requires the taking of some coarse locks that are potentially
>> - * very heavily contended. Check if the process needs to be scheduled or
>> - * if the lock is contended. For async compaction, back out in the event
>> - * if contention is severe. For sync compaction, schedule.
>> + * very heavily contended. The lock should be periodically unlocked to avoid
>> + * having disabled IRQs for a long time, even when there is nobody waiting on
>> + * the lock. It might also be that allowing the IRQs will result in
>> + * need_resched() becoming true. If scheduling is needed, or somebody else
>> + * has taken the lock, async compaction aborts. Sync compaction schedules.
>> + * Either compaction type will also abort if a fatal signal is pending.
>> + * In either case if the lock was locked, it is dropped and not regained.
>>    *
>> - * Returns true if the lock is held.
>> - * Returns false if the lock is released and compaction should abort
>> + * Returns true if compaction should abort due to fatal signal pending, or
>> + *		async compaction due to lock contention or need to schedule
>> + * Returns false when compaction can continue (sync compaction might have
>> + *		scheduled)
>>    */
>> -static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
>> -				      bool locked, struct compact_control *cc)
>> +static bool compact_unlock_should_abort(spinlock_t *lock,
>> +		unsigned long flags, bool *locked, struct compact_control *cc)
>>   {
>> -	enum compact_contended contended = should_release_lock(lock);
>> +	if (*locked) {
>> +		spin_unlock_irqrestore(lock, flags);
>> +		*locked = false;
>> +	}
>>
>> -	if (contended) {
>> -		if (locked) {
>> -			spin_unlock_irqrestore(lock, *flags);
>> -			locked = false;
>> -		}
>> +	if (fatal_signal_pending(current)) {
>> +		cc->contended = COMPACT_CONTENDED_SCHED;
>> +		return true;
>> +	}
>>
>> -		/* async aborts if taking too long or contended */
>> -		if (cc->mode == MIGRATE_ASYNC) {
>> -			cc->contended = contended;
>> -			return false;
>> +	if (cc->mode == MIGRATE_ASYNC) {
>> +		if (need_resched()) {
>> +			cc->contended = COMPACT_CONTENDED_SCHED;
>> +			return true;
>>   		}
>> -
>> +		if (spin_is_locked(lock)) {
>> +			cc->contended = COMPACT_CONTENDED_LOCK;
>> +			return true;
>> +		}
>
> Any reason to abort here?  If we need to do compact_trylock_irqsave() on
> this lock again then we'll abort when we come to that point, but it seems
> pointless to abort early if the lock isn't actually needed anymore or it
> is dropped before trying to acquire it again.

spin_is_locked() true means somebody was most probably waiting for us to 
unlock so maybe we should back off. But I'm not sure if that check can 
actually succeed so early after unlock.

>> +	} else {
>>   		cond_resched();
>>   	}
>>
>> -	if (!locked)
>> -		spin_lock_irqsave(lock, *flags);
>> -	return true;
>> +	return false;
>>   }
>>
>>   /*
>>    * Aside from avoiding lock contention, compaction also periodically checks
>>    * need_resched() and either schedules in sync compaction or aborts async
>> - * compaction. This is similar to what compact_checklock_irqsave() does, but
>> + * compaction. This is similar to what compact_unlock_should_abort() does, but
>>    * is used where no lock is concerned.
>>    *
>>    * Returns false when no scheduling was needed, or sync compaction scheduled.
>> @@ -291,6 +314,16 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>>   		int isolated, i;
>>   		struct page *page = cursor;
>>
>> +		/*
>> +		 * Periodically drop the lock (if held) regardless of its
>> +		 * contention, to give chance to IRQs. Abort async compaction
>> +		 * if contended.
>> +		 */
>> +		if (!(blockpfn % SWAP_CLUSTER_MAX)
>> +		    && compact_unlock_should_abort(&cc->zone->lock, flags,
>> +								&locked, cc))
>> +			break;
>> +
>>   		nr_scanned++;
>>   		if (!pfn_valid_within(blockpfn))
>>   			goto isolate_fail;
>> @@ -308,8 +341,9 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>>   		 * spin on the lock and we acquire the lock as late as
>>   		 * possible.
>>   		 */
>> -		locked = compact_checklock_irqsave(&cc->zone->lock, &flags,
>> -								locked, cc);
>> +		if (!locked)
>> +			locked = compact_trylock_irqsave(&cc->zone->lock,
>> +								&flags, cc);
>>   		if (!locked)
>>   			break;
>>
>> @@ -514,13 +548,15 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>>
>>   	/* Time to isolate some pages for migration */
>>   	for (; low_pfn < end_pfn; low_pfn++) {
>> -		/* give a chance to irqs before checking need_resched() */
>> -		if (locked && !(low_pfn % SWAP_CLUSTER_MAX)) {
>> -			if (should_release_lock(&zone->lru_lock)) {
>> -				spin_unlock_irqrestore(&zone->lru_lock, flags);
>> -				locked = false;
>> -			}
>> -		}
>> +		/*
>> +		 * Periodically drop the lock (if held) regardless of its
>> +		 * contention, to give chance to IRQs. Abort async compaction
>> +		 * if contended.
>> +		 */
>> +		if (!(low_pfn % SWAP_CLUSTER_MAX)
>> +		    && compact_unlock_should_abort(&zone->lru_lock, flags,
>> +								&locked, cc))
>> +			break;
>>
>>   		/*
>>   		 * migrate_pfn does not necessarily start aligned to a
>> @@ -622,10 +658,11 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>>   		    page_count(page) > page_mapcount(page))
>>   			continue;
>>
>> -		/* Check if it is ok to still hold the lock */
>> -		locked = compact_checklock_irqsave(&zone->lru_lock, &flags,
>> -								locked, cc);
>> -		if (!locked || fatal_signal_pending(current))
>> +		/* If the lock is not held, try to take it */
>> +		if (!locked)
>> +			locked = compact_trylock_irqsave(&zone->lru_lock,
>> +								&flags, cc);
>> +		if (!locked)
>>   			break;
>>
>>   		/* Recheck PageLRU and PageTransHuge under lock */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 02/10] mm, compaction: report compaction as contended only due to lock contention
  2014-06-10  7:11       ` Vlastimil Babka
@ 2014-06-10 23:40         ` David Rientjes
  -1 siblings, 0 replies; 88+ messages in thread
From: David Rientjes @ 2014-06-10 23:40 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Andrew Morton, Greg Thelen, Minchan Kim,
	Mel Gorman, Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel

On Tue, 10 Jun 2014, Vlastimil Babka wrote:

> > > diff --git a/mm/compaction.c b/mm/compaction.c
> > > index b73b182..d37f4a8 100644
> > > --- a/mm/compaction.c
> > > +++ b/mm/compaction.c
> > > @@ -185,9 +185,14 @@ static void update_pageblock_skip(struct
> > > compact_control *cc,
> > >   }
> > >   #endif /* CONFIG_COMPACTION */
> > > 
> > > -static inline bool should_release_lock(spinlock_t *lock)
> > > +enum compact_contended should_release_lock(spinlock_t *lock)
> > >   {
> > > -	return need_resched() || spin_is_contended(lock);
> > > +	if (need_resched())
> > > +		return COMPACT_CONTENDED_SCHED;
> > > +	else if (spin_is_contended(lock))
> > > +		return COMPACT_CONTENDED_LOCK;
> > > +	else
> > > +		return COMPACT_CONTENDED_NONE;
> > >   }
> > > 
> > >   /*
> > 
> > I think eventually we're going to remove the need_resched() heuristic
> > entirely and so enum compact_contended might be overkill, but do we need
> > to worry about spin_is_contended(lock) && need_resched() reporting
> > COMPACT_CONTENDED_SCHED here instead of COMPACT_CONTENDED_LOCK?
> 
> Hm right, maybe I should reorder the two tests.
> 

Yes, please.

> > > @@ -202,7 +207,9 @@ static inline bool should_release_lock(spinlock_t
> > > *lock)
> > >   static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long
> > > *flags,
> > >   				      bool locked, struct compact_control *cc)
> > >   {
> > > -	if (should_release_lock(lock)) {
> > > +	enum compact_contended contended = should_release_lock(lock);
> > > +
> > > +	if (contended) {
> > >   		if (locked) {
> > >   			spin_unlock_irqrestore(lock, *flags);
> > >   			locked = false;
> > > @@ -210,7 +217,7 @@ static bool compact_checklock_irqsave(spinlock_t
> > > *lock, unsigned long *flags,
> > > 
> > >   		/* async aborts if taking too long or contended */
> > >   		if (cc->mode == MIGRATE_ASYNC) {
> > > -			cc->contended = true;
> > > +			cc->contended = contended;
> > >   			return false;
> > >   		}
> > > 
> > > @@ -236,7 +243,7 @@ static inline bool compact_should_abort(struct
> > > compact_control *cc)
> > >   	/* async compaction aborts if contended */
> > >   	if (need_resched()) {
> > >   		if (cc->mode == MIGRATE_ASYNC) {
> > > -			cc->contended = true;
> > > +			cc->contended = COMPACT_CONTENDED_SCHED;
> > >   			return true;
> > >   		}
> > > 
> > > @@ -1095,7 +1102,8 @@ static unsigned long compact_zone_order(struct zone
> > > *zone, int order,
> > >   	VM_BUG_ON(!list_empty(&cc.freepages));
> > >   	VM_BUG_ON(!list_empty(&cc.migratepages));
> > > 
> > > -	*contended = cc.contended;
> > > +	/* We only signal lock contention back to the allocator */
> > > +	*contended = cc.contended == COMPACT_CONTENDED_LOCK;
> > >   	return ret;
> > >   }
> > > 
> > 
> > Hmm, since the only thing that matters for cc->contended is
> > COMPACT_CONTENDED_LOCK, it may make sense to just leave this as a bool
> > within struct compact_control instead of passing the actual reason around
> > when it doesn't matter.
> 
> That's what I thought first. But we set cc->contended in
> isolate_freepages_block() and then check it in isolate_freepages() and
> compaction_alloc() to make sure we don't continue the free scanner once
> contention (or need_resched()) is detected. And introducing an enum, even if
> temporary measure, seemed simpler than making that checking more complex. This
> way it can stay the same once we get rid of need_resched().
> 

Ok, we can always reconsider it later after need_resched() is removed.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 02/10] mm, compaction: report compaction as contended only due to lock contention
@ 2014-06-10 23:40         ` David Rientjes
  0 siblings, 0 replies; 88+ messages in thread
From: David Rientjes @ 2014-06-10 23:40 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Andrew Morton, Greg Thelen, Minchan Kim,
	Mel Gorman, Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel

On Tue, 10 Jun 2014, Vlastimil Babka wrote:

> > > diff --git a/mm/compaction.c b/mm/compaction.c
> > > index b73b182..d37f4a8 100644
> > > --- a/mm/compaction.c
> > > +++ b/mm/compaction.c
> > > @@ -185,9 +185,14 @@ static void update_pageblock_skip(struct
> > > compact_control *cc,
> > >   }
> > >   #endif /* CONFIG_COMPACTION */
> > > 
> > > -static inline bool should_release_lock(spinlock_t *lock)
> > > +enum compact_contended should_release_lock(spinlock_t *lock)
> > >   {
> > > -	return need_resched() || spin_is_contended(lock);
> > > +	if (need_resched())
> > > +		return COMPACT_CONTENDED_SCHED;
> > > +	else if (spin_is_contended(lock))
> > > +		return COMPACT_CONTENDED_LOCK;
> > > +	else
> > > +		return COMPACT_CONTENDED_NONE;
> > >   }
> > > 
> > >   /*
> > 
> > I think eventually we're going to remove the need_resched() heuristic
> > entirely and so enum compact_contended might be overkill, but do we need
> > to worry about spin_is_contended(lock) && need_resched() reporting
> > COMPACT_CONTENDED_SCHED here instead of COMPACT_CONTENDED_LOCK?
> 
> Hm right, maybe I should reorder the two tests.
> 

Yes, please.

> > > @@ -202,7 +207,9 @@ static inline bool should_release_lock(spinlock_t
> > > *lock)
> > >   static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long
> > > *flags,
> > >   				      bool locked, struct compact_control *cc)
> > >   {
> > > -	if (should_release_lock(lock)) {
> > > +	enum compact_contended contended = should_release_lock(lock);
> > > +
> > > +	if (contended) {
> > >   		if (locked) {
> > >   			spin_unlock_irqrestore(lock, *flags);
> > >   			locked = false;
> > > @@ -210,7 +217,7 @@ static bool compact_checklock_irqsave(spinlock_t
> > > *lock, unsigned long *flags,
> > > 
> > >   		/* async aborts if taking too long or contended */
> > >   		if (cc->mode == MIGRATE_ASYNC) {
> > > -			cc->contended = true;
> > > +			cc->contended = contended;
> > >   			return false;
> > >   		}
> > > 
> > > @@ -236,7 +243,7 @@ static inline bool compact_should_abort(struct
> > > compact_control *cc)
> > >   	/* async compaction aborts if contended */
> > >   	if (need_resched()) {
> > >   		if (cc->mode == MIGRATE_ASYNC) {
> > > -			cc->contended = true;
> > > +			cc->contended = COMPACT_CONTENDED_SCHED;
> > >   			return true;
> > >   		}
> > > 
> > > @@ -1095,7 +1102,8 @@ static unsigned long compact_zone_order(struct zone
> > > *zone, int order,
> > >   	VM_BUG_ON(!list_empty(&cc.freepages));
> > >   	VM_BUG_ON(!list_empty(&cc.migratepages));
> > > 
> > > -	*contended = cc.contended;
> > > +	/* We only signal lock contention back to the allocator */
> > > +	*contended = cc.contended == COMPACT_CONTENDED_LOCK;
> > >   	return ret;
> > >   }
> > > 
> > 
> > Hmm, since the only thing that matters for cc->contended is
> > COMPACT_CONTENDED_LOCK, it may make sense to just leave this as a bool
> > within struct compact_control instead of passing the actual reason around
> > when it doesn't matter.
> 
> That's what I thought first. But we set cc->contended in
> isolate_freepages_block() and then check it in isolate_freepages() and
> compaction_alloc() to make sure we don't continue the free scanner once
> contention (or need_resched()) is detected. And introducing an enum, even if
> temporary measure, seemed simpler than making that checking more complex. This
> way it can stay the same once we get rid of need_resched().
> 

Ok, we can always reconsider it later after need_resched() is removed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 03/10] mm, compaction: periodically drop lock and restore IRQs in scanners
  2014-06-10  7:15       ` Vlastimil Babka
@ 2014-06-10 23:41         ` David Rientjes
  -1 siblings, 0 replies; 88+ messages in thread
From: David Rientjes @ 2014-06-10 23:41 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Andrew Morton, Greg Thelen, Minchan Kim,
	Mel Gorman, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel

On Tue, 10 Jun 2014, Vlastimil Babka wrote:

> On 06/10/2014 01:58 AM, David Rientjes wrote:
> > On Mon, 9 Jun 2014, Vlastimil Babka wrote:
> > 
> > > diff --git a/mm/compaction.c b/mm/compaction.c
> > > index d37f4a8..e1a4283 100644
> > > --- a/mm/compaction.c
> > > +++ b/mm/compaction.c
> > > @@ -185,54 +185,77 @@ static void update_pageblock_skip(struct
> > > compact_control *cc,
> > >   }
> > >   #endif /* CONFIG_COMPACTION */
> > > 
> > > -enum compact_contended should_release_lock(spinlock_t *lock)
> > > +/*
> > > + * Compaction requires the taking of some coarse locks that are
> > > potentially
> > > + * very heavily contended. For async compaction, back out if the lock
> > > cannot
> > > + * be taken immediately. For sync compaction, spin on the lock if needed.
> > > + *
> > > + * Returns true if the lock is held
> > > + * Returns false if the lock is not held and compaction should abort
> > > + */
> > > +static bool compact_trylock_irqsave(spinlock_t *lock,
> > > +			unsigned long *flags, struct compact_control *cc)
> > >   {
> > > -	if (need_resched())
> > > -		return COMPACT_CONTENDED_SCHED;
> > > -	else if (spin_is_contended(lock))
> > > -		return COMPACT_CONTENDED_LOCK;
> > > -	else
> > > -		return COMPACT_CONTENDED_NONE;
> > > +	if (cc->mode == MIGRATE_ASYNC) {
> > > +		if (!spin_trylock_irqsave(lock, *flags)) {
> > > +			cc->contended = COMPACT_CONTENDED_LOCK;
> > > +			return false;
> > > +		}
> > > +	} else {
> > > +		spin_lock_irqsave(lock, *flags);
> > > +	}
> > > +
> > > +	return true;
> > >   }
> > > 
> > >   /*
> > >    * Compaction requires the taking of some coarse locks that are
> > > potentially
> > > - * very heavily contended. Check if the process needs to be scheduled or
> > > - * if the lock is contended. For async compaction, back out in the event
> > > - * if contention is severe. For sync compaction, schedule.
> > > + * very heavily contended. The lock should be periodically unlocked to
> > > avoid
> > > + * having disabled IRQs for a long time, even when there is nobody
> > > waiting on
> > > + * the lock. It might also be that allowing the IRQs will result in
> > > + * need_resched() becoming true. If scheduling is needed, or somebody
> > > else
> > > + * has taken the lock, async compaction aborts. Sync compaction
> > > schedules.
> > > + * Either compaction type will also abort if a fatal signal is pending.
> > > + * In either case if the lock was locked, it is dropped and not regained.
> > >    *
> > > - * Returns true if the lock is held.
> > > - * Returns false if the lock is released and compaction should abort
> > > + * Returns true if compaction should abort due to fatal signal pending,
> > > or
> > > + *		async compaction due to lock contention or need to schedule
> > > + * Returns false when compaction can continue (sync compaction might have
> > > + *		scheduled)
> > >    */
> > > -static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long
> > > *flags,
> > > -				      bool locked, struct compact_control *cc)
> > > +static bool compact_unlock_should_abort(spinlock_t *lock,
> > > +		unsigned long flags, bool *locked, struct compact_control *cc)
> > >   {
> > > -	enum compact_contended contended = should_release_lock(lock);
> > > +	if (*locked) {
> > > +		spin_unlock_irqrestore(lock, flags);
> > > +		*locked = false;
> > > +	}
> > > 
> > > -	if (contended) {
> > > -		if (locked) {
> > > -			spin_unlock_irqrestore(lock, *flags);
> > > -			locked = false;
> > > -		}
> > > +	if (fatal_signal_pending(current)) {
> > > +		cc->contended = COMPACT_CONTENDED_SCHED;
> > > +		return true;
> > > +	}
> > > 
> > > -		/* async aborts if taking too long or contended */
> > > -		if (cc->mode == MIGRATE_ASYNC) {
> > > -			cc->contended = contended;
> > > -			return false;
> > > +	if (cc->mode == MIGRATE_ASYNC) {
> > > +		if (need_resched()) {
> > > +			cc->contended = COMPACT_CONTENDED_SCHED;
> > > +			return true;
> > >   		}
> > > -
> > > +		if (spin_is_locked(lock)) {
> > > +			cc->contended = COMPACT_CONTENDED_LOCK;
> > > +			return true;
> > > +		}
> > 
> > Any reason to abort here?  If we need to do compact_trylock_irqsave() on
> > this lock again then we'll abort when we come to that point, but it seems
> > pointless to abort early if the lock isn't actually needed anymore or it
> > is dropped before trying to acquire it again.
> 
> spin_is_locked() true means somebody was most probably waiting for us to
> unlock so maybe we should back off. But I'm not sure if that check can
> actually succeed so early after unlock.
> 

The fact remains, however, is that we may never actually need to grab that 
specific lock again and this would cause us to terminate prematurely.  I 
think the preemptive spin_is_locked() test should be removed here.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 03/10] mm, compaction: periodically drop lock and restore IRQs in scanners
@ 2014-06-10 23:41         ` David Rientjes
  0 siblings, 0 replies; 88+ messages in thread
From: David Rientjes @ 2014-06-10 23:41 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Andrew Morton, Greg Thelen, Minchan Kim,
	Mel Gorman, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel

On Tue, 10 Jun 2014, Vlastimil Babka wrote:

> On 06/10/2014 01:58 AM, David Rientjes wrote:
> > On Mon, 9 Jun 2014, Vlastimil Babka wrote:
> > 
> > > diff --git a/mm/compaction.c b/mm/compaction.c
> > > index d37f4a8..e1a4283 100644
> > > --- a/mm/compaction.c
> > > +++ b/mm/compaction.c
> > > @@ -185,54 +185,77 @@ static void update_pageblock_skip(struct
> > > compact_control *cc,
> > >   }
> > >   #endif /* CONFIG_COMPACTION */
> > > 
> > > -enum compact_contended should_release_lock(spinlock_t *lock)
> > > +/*
> > > + * Compaction requires the taking of some coarse locks that are
> > > potentially
> > > + * very heavily contended. For async compaction, back out if the lock
> > > cannot
> > > + * be taken immediately. For sync compaction, spin on the lock if needed.
> > > + *
> > > + * Returns true if the lock is held
> > > + * Returns false if the lock is not held and compaction should abort
> > > + */
> > > +static bool compact_trylock_irqsave(spinlock_t *lock,
> > > +			unsigned long *flags, struct compact_control *cc)
> > >   {
> > > -	if (need_resched())
> > > -		return COMPACT_CONTENDED_SCHED;
> > > -	else if (spin_is_contended(lock))
> > > -		return COMPACT_CONTENDED_LOCK;
> > > -	else
> > > -		return COMPACT_CONTENDED_NONE;
> > > +	if (cc->mode == MIGRATE_ASYNC) {
> > > +		if (!spin_trylock_irqsave(lock, *flags)) {
> > > +			cc->contended = COMPACT_CONTENDED_LOCK;
> > > +			return false;
> > > +		}
> > > +	} else {
> > > +		spin_lock_irqsave(lock, *flags);
> > > +	}
> > > +
> > > +	return true;
> > >   }
> > > 
> > >   /*
> > >    * Compaction requires the taking of some coarse locks that are
> > > potentially
> > > - * very heavily contended. Check if the process needs to be scheduled or
> > > - * if the lock is contended. For async compaction, back out in the event
> > > - * if contention is severe. For sync compaction, schedule.
> > > + * very heavily contended. The lock should be periodically unlocked to
> > > avoid
> > > + * having disabled IRQs for a long time, even when there is nobody
> > > waiting on
> > > + * the lock. It might also be that allowing the IRQs will result in
> > > + * need_resched() becoming true. If scheduling is needed, or somebody
> > > else
> > > + * has taken the lock, async compaction aborts. Sync compaction
> > > schedules.
> > > + * Either compaction type will also abort if a fatal signal is pending.
> > > + * In either case if the lock was locked, it is dropped and not regained.
> > >    *
> > > - * Returns true if the lock is held.
> > > - * Returns false if the lock is released and compaction should abort
> > > + * Returns true if compaction should abort due to fatal signal pending,
> > > or
> > > + *		async compaction due to lock contention or need to schedule
> > > + * Returns false when compaction can continue (sync compaction might have
> > > + *		scheduled)
> > >    */
> > > -static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long
> > > *flags,
> > > -				      bool locked, struct compact_control *cc)
> > > +static bool compact_unlock_should_abort(spinlock_t *lock,
> > > +		unsigned long flags, bool *locked, struct compact_control *cc)
> > >   {
> > > -	enum compact_contended contended = should_release_lock(lock);
> > > +	if (*locked) {
> > > +		spin_unlock_irqrestore(lock, flags);
> > > +		*locked = false;
> > > +	}
> > > 
> > > -	if (contended) {
> > > -		if (locked) {
> > > -			spin_unlock_irqrestore(lock, *flags);
> > > -			locked = false;
> > > -		}
> > > +	if (fatal_signal_pending(current)) {
> > > +		cc->contended = COMPACT_CONTENDED_SCHED;
> > > +		return true;
> > > +	}
> > > 
> > > -		/* async aborts if taking too long or contended */
> > > -		if (cc->mode == MIGRATE_ASYNC) {
> > > -			cc->contended = contended;
> > > -			return false;
> > > +	if (cc->mode == MIGRATE_ASYNC) {
> > > +		if (need_resched()) {
> > > +			cc->contended = COMPACT_CONTENDED_SCHED;
> > > +			return true;
> > >   		}
> > > -
> > > +		if (spin_is_locked(lock)) {
> > > +			cc->contended = COMPACT_CONTENDED_LOCK;
> > > +			return true;
> > > +		}
> > 
> > Any reason to abort here?  If we need to do compact_trylock_irqsave() on
> > this lock again then we'll abort when we come to that point, but it seems
> > pointless to abort early if the lock isn't actually needed anymore or it
> > is dropped before trying to acquire it again.
> 
> spin_is_locked() true means somebody was most probably waiting for us to
> unlock so maybe we should back off. But I'm not sure if that check can
> actually succeed so early after unlock.
> 

The fact remains, however, is that we may never actually need to grab that 
specific lock again and this would cause us to terminate prematurely.  I 
think the preemptive spin_is_locked() test should be removed here.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 01/10] mm, compaction: do not recheck suitable_migration_target under lock
  2014-06-09  9:26 ` Vlastimil Babka
@ 2014-06-11  0:33   ` Minchan Kim
  -1 siblings, 0 replies; 88+ messages in thread
From: Minchan Kim @ 2014-06-11  0:33 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

On Mon, Jun 09, 2014 at 11:26:13AM +0200, Vlastimil Babka wrote:
> isolate_freepages_block() rechecks if the pageblock is suitable to be a target
> for migration after it has taken the zone->lock. However, the check has been
> optimized to occur only once per pageblock, and compact_checklock_irqsave()
> might be dropping and reacquiring lock, which means somebody else might have
> changed the pageblock's migratetype meanwhile.
> 
> Furthermore, nothing prevents the migratetype to change right after
> isolate_freepages_block() has finished isolating. Given how imperfect this is,
> it's simpler to just rely on the check done in isolate_freepages() without
> lock, and not pretend that the recheck under lock guarantees anything. It is
> just a heuristic after all.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Minchan Kim <minchan@kernel.org>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 01/10] mm, compaction: do not recheck suitable_migration_target under lock
@ 2014-06-11  0:33   ` Minchan Kim
  0 siblings, 0 replies; 88+ messages in thread
From: Minchan Kim @ 2014-06-11  0:33 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

On Mon, Jun 09, 2014 at 11:26:13AM +0200, Vlastimil Babka wrote:
> isolate_freepages_block() rechecks if the pageblock is suitable to be a target
> for migration after it has taken the zone->lock. However, the check has been
> optimized to occur only once per pageblock, and compact_checklock_irqsave()
> might be dropping and reacquiring lock, which means somebody else might have
> changed the pageblock's migratetype meanwhile.
> 
> Furthermore, nothing prevents the migratetype to change right after
> isolate_freepages_block() has finished isolating. Given how imperfect this is,
> it's simpler to just rely on the check done in isolate_freepages() without
> lock, and not pretend that the recheck under lock guarantees anything. It is
> just a heuristic after all.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Minchan Kim <minchan@kernel.org>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 02/10] mm, compaction: report compaction as contended only due to lock contention
  2014-06-09  9:26   ` Vlastimil Babka
@ 2014-06-11  1:10     ` Minchan Kim
  -1 siblings, 0 replies; 88+ messages in thread
From: Minchan Kim @ 2014-06-11  1:10 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

On Mon, Jun 09, 2014 at 11:26:14AM +0200, Vlastimil Babka wrote:
> Async compaction aborts when it detects zone lock contention or need_resched()
> is true. David Rientjes has reported that in practice, most direct async
> compactions for THP allocation abort due to need_resched(). This means that a
> second direct compaction is never attempted, which might be OK for a page
> fault, but hugepaged is intended to attempt a sync compaction in such case and
> in these cases it won't.
> 
> This patch replaces "bool contended" in compact_control with an enum that
> distinguieshes between aborting due to need_resched() and aborting due to lock
> contention. This allows propagating the abort through all compaction functions
> as before, but declaring the direct compaction as contended only when lock
> contantion has been detected.
> 
> As a result, hugepaged will proceed with second sync compaction as intended,
> when the preceding async compaction aborted due to need_resched().

You said "second direct compaction is never attempted, which might be OK
for a page fault" and said "hugepagd is intented to attempt a sync compaction"
so I feel you want to handle khugepaged so special unlike other direct compact
(ex, page fault).

By this patch, direct compaction take care only lock contention, not rescheduling
so that pop questions.

Is it okay not to consider need_resched in direct compaction really?
We have taken care of it in direct reclaim path so why direct compaction is
so special?

Why does khugepaged give up easily if lock contention/need_resched happens?
khugepaged is important for success ratio as I read your description so IMO,
khugepaged should do synchronously without considering early bail out by
lock/rescheduling.

If it causes problems, user should increase scan_sleep_millisecs/alloc_sleep_millisecs,
which is exactly the knob for that cases.

So, my point is how about making khugepaged doing always dumb synchronous
compaction thorough PG_KHUGEPAGED or GFP_SYNC_TRANSHUGE?

> 
> Reported-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>
> ---
>  mm/compaction.c | 20 ++++++++++++++------
>  mm/internal.h   | 15 +++++++++++----
>  2 files changed, 25 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index b73b182..d37f4a8 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -185,9 +185,14 @@ static void update_pageblock_skip(struct compact_control *cc,
>  }
>  #endif /* CONFIG_COMPACTION */
>  
> -static inline bool should_release_lock(spinlock_t *lock)
> +enum compact_contended should_release_lock(spinlock_t *lock)
>  {
> -	return need_resched() || spin_is_contended(lock);
> +	if (need_resched())
> +		return COMPACT_CONTENDED_SCHED;
> +	else if (spin_is_contended(lock))
> +		return COMPACT_CONTENDED_LOCK;
> +	else
> +		return COMPACT_CONTENDED_NONE;
>  }
>  
>  /*
> @@ -202,7 +207,9 @@ static inline bool should_release_lock(spinlock_t *lock)
>  static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
>  				      bool locked, struct compact_control *cc)
>  {
> -	if (should_release_lock(lock)) {
> +	enum compact_contended contended = should_release_lock(lock);
> +
> +	if (contended) {
>  		if (locked) {
>  			spin_unlock_irqrestore(lock, *flags);
>  			locked = false;
> @@ -210,7 +217,7 @@ static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
>  
>  		/* async aborts if taking too long or contended */
>  		if (cc->mode == MIGRATE_ASYNC) {
> -			cc->contended = true;
> +			cc->contended = contended;
>  			return false;
>  		}
>  
> @@ -236,7 +243,7 @@ static inline bool compact_should_abort(struct compact_control *cc)
>  	/* async compaction aborts if contended */
>  	if (need_resched()) {
>  		if (cc->mode == MIGRATE_ASYNC) {
> -			cc->contended = true;
> +			cc->contended = COMPACT_CONTENDED_SCHED;
>  			return true;
>  		}
>  
> @@ -1095,7 +1102,8 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
>  	VM_BUG_ON(!list_empty(&cc.freepages));
>  	VM_BUG_ON(!list_empty(&cc.migratepages));
>  
> -	*contended = cc.contended;
> +	/* We only signal lock contention back to the allocator */
> +	*contended = cc.contended == COMPACT_CONTENDED_LOCK;
>  	return ret;
>  }
>  
> diff --git a/mm/internal.h b/mm/internal.h
> index 7f22a11f..4659e8e 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -117,6 +117,13 @@ extern int user_min_free_kbytes;
>  
>  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>  
> +/* Used to signal whether compaction detected need_sched() or lock contention */
> +enum compact_contended {
> +	COMPACT_CONTENDED_NONE = 0, /* no contention detected */
> +	COMPACT_CONTENDED_SCHED,    /* need_sched() was true */
> +	COMPACT_CONTENDED_LOCK,     /* zone lock or lru_lock was contended */
> +};
> +
>  /*
>   * in mm/compaction.c
>   */
> @@ -144,10 +151,10 @@ struct compact_control {
>  	int order;			/* order a direct compactor needs */
>  	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
>  	struct zone *zone;
> -	bool contended;			/* True if a lock was contended, or
> -					 * need_resched() true during async
> -					 * compaction
> -					 */
> +	enum compact_contended contended; /* Signal need_sched() or lock
> +					   * contention detected during
> +					   * compaction
> +					   */
>  };
>  
>  unsigned long
> -- 
> 1.8.4.5
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 02/10] mm, compaction: report compaction as contended only due to lock contention
@ 2014-06-11  1:10     ` Minchan Kim
  0 siblings, 0 replies; 88+ messages in thread
From: Minchan Kim @ 2014-06-11  1:10 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

On Mon, Jun 09, 2014 at 11:26:14AM +0200, Vlastimil Babka wrote:
> Async compaction aborts when it detects zone lock contention or need_resched()
> is true. David Rientjes has reported that in practice, most direct async
> compactions for THP allocation abort due to need_resched(). This means that a
> second direct compaction is never attempted, which might be OK for a page
> fault, but hugepaged is intended to attempt a sync compaction in such case and
> in these cases it won't.
> 
> This patch replaces "bool contended" in compact_control with an enum that
> distinguieshes between aborting due to need_resched() and aborting due to lock
> contention. This allows propagating the abort through all compaction functions
> as before, but declaring the direct compaction as contended only when lock
> contantion has been detected.
> 
> As a result, hugepaged will proceed with second sync compaction as intended,
> when the preceding async compaction aborted due to need_resched().

You said "second direct compaction is never attempted, which might be OK
for a page fault" and said "hugepagd is intented to attempt a sync compaction"
so I feel you want to handle khugepaged so special unlike other direct compact
(ex, page fault).

By this patch, direct compaction take care only lock contention, not rescheduling
so that pop questions.

Is it okay not to consider need_resched in direct compaction really?
We have taken care of it in direct reclaim path so why direct compaction is
so special?

Why does khugepaged give up easily if lock contention/need_resched happens?
khugepaged is important for success ratio as I read your description so IMO,
khugepaged should do synchronously without considering early bail out by
lock/rescheduling.

If it causes problems, user should increase scan_sleep_millisecs/alloc_sleep_millisecs,
which is exactly the knob for that cases.

So, my point is how about making khugepaged doing always dumb synchronous
compaction thorough PG_KHUGEPAGED or GFP_SYNC_TRANSHUGE?

> 
> Reported-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>
> ---
>  mm/compaction.c | 20 ++++++++++++++------
>  mm/internal.h   | 15 +++++++++++----
>  2 files changed, 25 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index b73b182..d37f4a8 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -185,9 +185,14 @@ static void update_pageblock_skip(struct compact_control *cc,
>  }
>  #endif /* CONFIG_COMPACTION */
>  
> -static inline bool should_release_lock(spinlock_t *lock)
> +enum compact_contended should_release_lock(spinlock_t *lock)
>  {
> -	return need_resched() || spin_is_contended(lock);
> +	if (need_resched())
> +		return COMPACT_CONTENDED_SCHED;
> +	else if (spin_is_contended(lock))
> +		return COMPACT_CONTENDED_LOCK;
> +	else
> +		return COMPACT_CONTENDED_NONE;
>  }
>  
>  /*
> @@ -202,7 +207,9 @@ static inline bool should_release_lock(spinlock_t *lock)
>  static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
>  				      bool locked, struct compact_control *cc)
>  {
> -	if (should_release_lock(lock)) {
> +	enum compact_contended contended = should_release_lock(lock);
> +
> +	if (contended) {
>  		if (locked) {
>  			spin_unlock_irqrestore(lock, *flags);
>  			locked = false;
> @@ -210,7 +217,7 @@ static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
>  
>  		/* async aborts if taking too long or contended */
>  		if (cc->mode == MIGRATE_ASYNC) {
> -			cc->contended = true;
> +			cc->contended = contended;
>  			return false;
>  		}
>  
> @@ -236,7 +243,7 @@ static inline bool compact_should_abort(struct compact_control *cc)
>  	/* async compaction aborts if contended */
>  	if (need_resched()) {
>  		if (cc->mode == MIGRATE_ASYNC) {
> -			cc->contended = true;
> +			cc->contended = COMPACT_CONTENDED_SCHED;
>  			return true;
>  		}
>  
> @@ -1095,7 +1102,8 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
>  	VM_BUG_ON(!list_empty(&cc.freepages));
>  	VM_BUG_ON(!list_empty(&cc.migratepages));
>  
> -	*contended = cc.contended;
> +	/* We only signal lock contention back to the allocator */
> +	*contended = cc.contended == COMPACT_CONTENDED_LOCK;
>  	return ret;
>  }
>  
> diff --git a/mm/internal.h b/mm/internal.h
> index 7f22a11f..4659e8e 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -117,6 +117,13 @@ extern int user_min_free_kbytes;
>  
>  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>  
> +/* Used to signal whether compaction detected need_sched() or lock contention */
> +enum compact_contended {
> +	COMPACT_CONTENDED_NONE = 0, /* no contention detected */
> +	COMPACT_CONTENDED_SCHED,    /* need_sched() was true */
> +	COMPACT_CONTENDED_LOCK,     /* zone lock or lru_lock was contended */
> +};
> +
>  /*
>   * in mm/compaction.c
>   */
> @@ -144,10 +151,10 @@ struct compact_control {
>  	int order;			/* order a direct compactor needs */
>  	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
>  	struct zone *zone;
> -	bool contended;			/* True if a lock was contended, or
> -					 * need_resched() true during async
> -					 * compaction
> -					 */
> +	enum compact_contended contended; /* Signal need_sched() or lock
> +					   * contention detected during
> +					   * compaction
> +					   */
>  };
>  
>  unsigned long
> -- 
> 1.8.4.5
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 03/10] mm, compaction: periodically drop lock and restore IRQs in scanners
  2014-06-09  9:26   ` Vlastimil Babka
@ 2014-06-11  1:32     ` Minchan Kim
  -1 siblings, 0 replies; 88+ messages in thread
From: Minchan Kim @ 2014-06-11  1:32 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Mel Gorman, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel

On Mon, Jun 09, 2014 at 11:26:15AM +0200, Vlastimil Babka wrote:
> Compaction scanners regularly check for lock contention and need_resched()
> through the compact_checklock_irqsave() function. However, if there is no
> contention, the lock can be held and IRQ disabled for potentially long time.
> 
> This has been addressed by commit b2eef8c0d0 ("mm: compaction: minimise the
> time IRQs are disabled while isolating pages for migration") for the migration
> scanner. However, the refactoring done by commit 748446bb6b ("mm: compaction:
> acquire the zone->lru_lock as late as possible") has changed the conditions so
> that the lock is dropped only when there's contention on the lock or
> need_resched() is true. Also, need_resched() is checked only when the lock is
> already held. The comment "give a chance to irqs before checking need_resched"
> is therefore misleading, as IRQs remain disabled when the check is done.
> 
> This patch restores the behavior intended by commit b2eef8c0d0 and also tries
> to better balance and make more deterministic the time spent by checking for
> contention vs the time the scanners might run between the checks. It also
> avoids situations where checking has not been done often enough before. The
> result should be avoiding both too frequent and too infrequent contention
> checking, and especially the potentially long-running scans with IRQs disabled
> and no checking of need_resched() or for fatal signal pending, which can happen
> when many consecutive pages or pageblocks fail the preliminary tests and do not
> reach the later call site to compact_checklock_irqsave(), as explained below.
> 
> Before the patch:
> 
> In the migration scanner, compact_checklock_irqsave() was called each loop, if
> reached. If not reached, some lower-frequency checking could still be done if
> the lock was already held, but this would not result in aborting contended
> async compaction until reaching compact_checklock_irqsave() or end of
> pageblock. In the free scanner, it was similar but completely without the
> periodical checking, so lock can be potentially held until reaching the end of
> pageblock.
> 
> After the patch, in both scanners:
> 
> The periodical check is done as the first thing in the loop on each
> SWAP_CLUSTER_MAX aligned pfn, using the new compact_unlock_should_abort()
> function, which always unlocks the lock (if locked) and aborts async compaction
> if scheduling is needed or someone else holds the lock. It also aborts any type
> of compaction when a fatal signal is pending.
> 
> The compact_checklock_irqsave() function is replaced with a slightly different
> compact_trylock_irqsave(). The biggest difference is that the function is not
> called at all if the lock is already held. The periodical contention checking
> is left solely to compact_unlock_should_abort(). If the lock is not held, the
> function however does avoid contended run for async compaction by aborting when
> trylock fails. Sync compaction does not use trylock.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Generally, I like this but below a question.

> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: David Rientjes <rientjes@google.com>
> ---
> V2: do not consider need/cond_resched() in compact_trylock_irqsave(); spelling
>     remove inline: compaction.o size reduced
>  mm/compaction.c | 121 ++++++++++++++++++++++++++++++++++++--------------------
>  1 file changed, 79 insertions(+), 42 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index d37f4a8..e1a4283 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -185,54 +185,77 @@ static void update_pageblock_skip(struct compact_control *cc,
>  }
>  #endif /* CONFIG_COMPACTION */
>  
> -enum compact_contended should_release_lock(spinlock_t *lock)
> +/*
> + * Compaction requires the taking of some coarse locks that are potentially
> + * very heavily contended. For async compaction, back out if the lock cannot
> + * be taken immediately. For sync compaction, spin on the lock if needed.
> + *
> + * Returns true if the lock is held
> + * Returns false if the lock is not held and compaction should abort
> + */
> +static bool compact_trylock_irqsave(spinlock_t *lock,
> +			unsigned long *flags, struct compact_control *cc)
>  {
> -	if (need_resched())
> -		return COMPACT_CONTENDED_SCHED;
> -	else if (spin_is_contended(lock))
> -		return COMPACT_CONTENDED_LOCK;
> -	else
> -		return COMPACT_CONTENDED_NONE;
> +	if (cc->mode == MIGRATE_ASYNC) {
> +		if (!spin_trylock_irqsave(lock, *flags)) {
> +			cc->contended = COMPACT_CONTENDED_LOCK;
> +			return false;
> +		}
> +	} else {
> +		spin_lock_irqsave(lock, *flags);
> +	}
> +
> +	return true;
>  }
>  
>  /*
>   * Compaction requires the taking of some coarse locks that are potentially
> - * very heavily contended. Check if the process needs to be scheduled or
> - * if the lock is contended. For async compaction, back out in the event
> - * if contention is severe. For sync compaction, schedule.
> + * very heavily contended. The lock should be periodically unlocked to avoid
> + * having disabled IRQs for a long time, even when there is nobody waiting on
> + * the lock. It might also be that allowing the IRQs will result in
> + * need_resched() becoming true. If scheduling is needed, or somebody else
> + * has taken the lock, async compaction aborts. Sync compaction schedules.
> + * Either compaction type will also abort if a fatal signal is pending.
> + * In either case if the lock was locked, it is dropped and not regained.
>   *
> - * Returns true if the lock is held.
> - * Returns false if the lock is released and compaction should abort
> + * Returns true if compaction should abort due to fatal signal pending, or
> + *		async compaction due to lock contention or need to schedule
> + * Returns false when compaction can continue (sync compaction might have
> + *		scheduled)
>   */
> -static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
> -				      bool locked, struct compact_control *cc)
> +static bool compact_unlock_should_abort(spinlock_t *lock,
> +		unsigned long flags, bool *locked, struct compact_control *cc)
>  {
> -	enum compact_contended contended = should_release_lock(lock);
> +	if (*locked) {
> +		spin_unlock_irqrestore(lock, flags);
> +		*locked = false;
> +	}
>  
> -	if (contended) {
> -		if (locked) {
> -			spin_unlock_irqrestore(lock, *flags);
> -			locked = false;
> -		}
> +	if (fatal_signal_pending(current)) {
> +		cc->contended = COMPACT_CONTENDED_SCHED;
> +		return true;
> +	}
>  
> -		/* async aborts if taking too long or contended */
> -		if (cc->mode == MIGRATE_ASYNC) {
> -			cc->contended = contended;
> -			return false;
> +	if (cc->mode == MIGRATE_ASYNC) {
> +		if (need_resched()) {
> +			cc->contended = COMPACT_CONTENDED_SCHED;
> +			return true;
>  		}
> -
> +		if (spin_is_locked(lock)) {

Why do you use spin_is_locked instead of spin_is_contended?

> +			cc->contended = COMPACT_CONTENDED_LOCK;
> +			return true;
> +		}
> +	} else {
>  		cond_resched();
>  	}
>  
> -	if (!locked)
> -		spin_lock_irqsave(lock, *flags);
> -	return true;
> +	return false;
>  }
>  
>  /*
>   * Aside from avoiding lock contention, compaction also periodically checks
>   * need_resched() and either schedules in sync compaction or aborts async
> - * compaction. This is similar to what compact_checklock_irqsave() does, but
> + * compaction. This is similar to what compact_unlock_should_abort() does, but
>   * is used where no lock is concerned.
>   *
>   * Returns false when no scheduling was needed, or sync compaction scheduled.
> @@ -291,6 +314,16 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>  		int isolated, i;
>  		struct page *page = cursor;
>  
> +		/*
> +		 * Periodically drop the lock (if held) regardless of its
> +		 * contention, to give chance to IRQs. Abort async compaction
> +		 * if contended.
> +		 */
> +		if (!(blockpfn % SWAP_CLUSTER_MAX)
> +		    && compact_unlock_should_abort(&cc->zone->lock, flags,
> +								&locked, cc))
> +			break;
> +
>  		nr_scanned++;
>  		if (!pfn_valid_within(blockpfn))
>  			goto isolate_fail;
> @@ -308,8 +341,9 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>  		 * spin on the lock and we acquire the lock as late as
>  		 * possible.
>  		 */
> -		locked = compact_checklock_irqsave(&cc->zone->lock, &flags,
> -								locked, cc);
> +		if (!locked)
> +			locked = compact_trylock_irqsave(&cc->zone->lock,
> +								&flags, cc);
>  		if (!locked)
>  			break;
>  
> @@ -514,13 +548,15 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>  
>  	/* Time to isolate some pages for migration */
>  	for (; low_pfn < end_pfn; low_pfn++) {
> -		/* give a chance to irqs before checking need_resched() */
> -		if (locked && !(low_pfn % SWAP_CLUSTER_MAX)) {
> -			if (should_release_lock(&zone->lru_lock)) {
> -				spin_unlock_irqrestore(&zone->lru_lock, flags);
> -				locked = false;
> -			}
> -		}
> +		/*
> +		 * Periodically drop the lock (if held) regardless of its
> +		 * contention, to give chance to IRQs. Abort async compaction
> +		 * if contended.
> +		 */
> +		if (!(low_pfn % SWAP_CLUSTER_MAX)
> +		    && compact_unlock_should_abort(&zone->lru_lock, flags,
> +								&locked, cc))
> +			break;
>  
>  		/*
>  		 * migrate_pfn does not necessarily start aligned to a
> @@ -622,10 +658,11 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>  		    page_count(page) > page_mapcount(page))
>  			continue;
>  
> -		/* Check if it is ok to still hold the lock */
> -		locked = compact_checklock_irqsave(&zone->lru_lock, &flags,
> -								locked, cc);
> -		if (!locked || fatal_signal_pending(current))
> +		/* If the lock is not held, try to take it */
> +		if (!locked)
> +			locked = compact_trylock_irqsave(&zone->lru_lock,
> +								&flags, cc);
> +		if (!locked)
>  			break;
>  
>  		/* Recheck PageLRU and PageTransHuge under lock */
> -- 
> 1.8.4.5
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 03/10] mm, compaction: periodically drop lock and restore IRQs in scanners
@ 2014-06-11  1:32     ` Minchan Kim
  0 siblings, 0 replies; 88+ messages in thread
From: Minchan Kim @ 2014-06-11  1:32 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Mel Gorman, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel

On Mon, Jun 09, 2014 at 11:26:15AM +0200, Vlastimil Babka wrote:
> Compaction scanners regularly check for lock contention and need_resched()
> through the compact_checklock_irqsave() function. However, if there is no
> contention, the lock can be held and IRQ disabled for potentially long time.
> 
> This has been addressed by commit b2eef8c0d0 ("mm: compaction: minimise the
> time IRQs are disabled while isolating pages for migration") for the migration
> scanner. However, the refactoring done by commit 748446bb6b ("mm: compaction:
> acquire the zone->lru_lock as late as possible") has changed the conditions so
> that the lock is dropped only when there's contention on the lock or
> need_resched() is true. Also, need_resched() is checked only when the lock is
> already held. The comment "give a chance to irqs before checking need_resched"
> is therefore misleading, as IRQs remain disabled when the check is done.
> 
> This patch restores the behavior intended by commit b2eef8c0d0 and also tries
> to better balance and make more deterministic the time spent by checking for
> contention vs the time the scanners might run between the checks. It also
> avoids situations where checking has not been done often enough before. The
> result should be avoiding both too frequent and too infrequent contention
> checking, and especially the potentially long-running scans with IRQs disabled
> and no checking of need_resched() or for fatal signal pending, which can happen
> when many consecutive pages or pageblocks fail the preliminary tests and do not
> reach the later call site to compact_checklock_irqsave(), as explained below.
> 
> Before the patch:
> 
> In the migration scanner, compact_checklock_irqsave() was called each loop, if
> reached. If not reached, some lower-frequency checking could still be done if
> the lock was already held, but this would not result in aborting contended
> async compaction until reaching compact_checklock_irqsave() or end of
> pageblock. In the free scanner, it was similar but completely without the
> periodical checking, so lock can be potentially held until reaching the end of
> pageblock.
> 
> After the patch, in both scanners:
> 
> The periodical check is done as the first thing in the loop on each
> SWAP_CLUSTER_MAX aligned pfn, using the new compact_unlock_should_abort()
> function, which always unlocks the lock (if locked) and aborts async compaction
> if scheduling is needed or someone else holds the lock. It also aborts any type
> of compaction when a fatal signal is pending.
> 
> The compact_checklock_irqsave() function is replaced with a slightly different
> compact_trylock_irqsave(). The biggest difference is that the function is not
> called at all if the lock is already held. The periodical contention checking
> is left solely to compact_unlock_should_abort(). If the lock is not held, the
> function however does avoid contended run for async compaction by aborting when
> trylock fails. Sync compaction does not use trylock.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Generally, I like this but below a question.

> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: David Rientjes <rientjes@google.com>
> ---
> V2: do not consider need/cond_resched() in compact_trylock_irqsave(); spelling
>     remove inline: compaction.o size reduced
>  mm/compaction.c | 121 ++++++++++++++++++++++++++++++++++++--------------------
>  1 file changed, 79 insertions(+), 42 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index d37f4a8..e1a4283 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -185,54 +185,77 @@ static void update_pageblock_skip(struct compact_control *cc,
>  }
>  #endif /* CONFIG_COMPACTION */
>  
> -enum compact_contended should_release_lock(spinlock_t *lock)
> +/*
> + * Compaction requires the taking of some coarse locks that are potentially
> + * very heavily contended. For async compaction, back out if the lock cannot
> + * be taken immediately. For sync compaction, spin on the lock if needed.
> + *
> + * Returns true if the lock is held
> + * Returns false if the lock is not held and compaction should abort
> + */
> +static bool compact_trylock_irqsave(spinlock_t *lock,
> +			unsigned long *flags, struct compact_control *cc)
>  {
> -	if (need_resched())
> -		return COMPACT_CONTENDED_SCHED;
> -	else if (spin_is_contended(lock))
> -		return COMPACT_CONTENDED_LOCK;
> -	else
> -		return COMPACT_CONTENDED_NONE;
> +	if (cc->mode == MIGRATE_ASYNC) {
> +		if (!spin_trylock_irqsave(lock, *flags)) {
> +			cc->contended = COMPACT_CONTENDED_LOCK;
> +			return false;
> +		}
> +	} else {
> +		spin_lock_irqsave(lock, *flags);
> +	}
> +
> +	return true;
>  }
>  
>  /*
>   * Compaction requires the taking of some coarse locks that are potentially
> - * very heavily contended. Check if the process needs to be scheduled or
> - * if the lock is contended. For async compaction, back out in the event
> - * if contention is severe. For sync compaction, schedule.
> + * very heavily contended. The lock should be periodically unlocked to avoid
> + * having disabled IRQs for a long time, even when there is nobody waiting on
> + * the lock. It might also be that allowing the IRQs will result in
> + * need_resched() becoming true. If scheduling is needed, or somebody else
> + * has taken the lock, async compaction aborts. Sync compaction schedules.
> + * Either compaction type will also abort if a fatal signal is pending.
> + * In either case if the lock was locked, it is dropped and not regained.
>   *
> - * Returns true if the lock is held.
> - * Returns false if the lock is released and compaction should abort
> + * Returns true if compaction should abort due to fatal signal pending, or
> + *		async compaction due to lock contention or need to schedule
> + * Returns false when compaction can continue (sync compaction might have
> + *		scheduled)
>   */
> -static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
> -				      bool locked, struct compact_control *cc)
> +static bool compact_unlock_should_abort(spinlock_t *lock,
> +		unsigned long flags, bool *locked, struct compact_control *cc)
>  {
> -	enum compact_contended contended = should_release_lock(lock);
> +	if (*locked) {
> +		spin_unlock_irqrestore(lock, flags);
> +		*locked = false;
> +	}
>  
> -	if (contended) {
> -		if (locked) {
> -			spin_unlock_irqrestore(lock, *flags);
> -			locked = false;
> -		}
> +	if (fatal_signal_pending(current)) {
> +		cc->contended = COMPACT_CONTENDED_SCHED;
> +		return true;
> +	}
>  
> -		/* async aborts if taking too long or contended */
> -		if (cc->mode == MIGRATE_ASYNC) {
> -			cc->contended = contended;
> -			return false;
> +	if (cc->mode == MIGRATE_ASYNC) {
> +		if (need_resched()) {
> +			cc->contended = COMPACT_CONTENDED_SCHED;
> +			return true;
>  		}
> -
> +		if (spin_is_locked(lock)) {

Why do you use spin_is_locked instead of spin_is_contended?

> +			cc->contended = COMPACT_CONTENDED_LOCK;
> +			return true;
> +		}
> +	} else {
>  		cond_resched();
>  	}
>  
> -	if (!locked)
> -		spin_lock_irqsave(lock, *flags);
> -	return true;
> +	return false;
>  }
>  
>  /*
>   * Aside from avoiding lock contention, compaction also periodically checks
>   * need_resched() and either schedules in sync compaction or aborts async
> - * compaction. This is similar to what compact_checklock_irqsave() does, but
> + * compaction. This is similar to what compact_unlock_should_abort() does, but
>   * is used where no lock is concerned.
>   *
>   * Returns false when no scheduling was needed, or sync compaction scheduled.
> @@ -291,6 +314,16 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>  		int isolated, i;
>  		struct page *page = cursor;
>  
> +		/*
> +		 * Periodically drop the lock (if held) regardless of its
> +		 * contention, to give chance to IRQs. Abort async compaction
> +		 * if contended.
> +		 */
> +		if (!(blockpfn % SWAP_CLUSTER_MAX)
> +		    && compact_unlock_should_abort(&cc->zone->lock, flags,
> +								&locked, cc))
> +			break;
> +
>  		nr_scanned++;
>  		if (!pfn_valid_within(blockpfn))
>  			goto isolate_fail;
> @@ -308,8 +341,9 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>  		 * spin on the lock and we acquire the lock as late as
>  		 * possible.
>  		 */
> -		locked = compact_checklock_irqsave(&cc->zone->lock, &flags,
> -								locked, cc);
> +		if (!locked)
> +			locked = compact_trylock_irqsave(&cc->zone->lock,
> +								&flags, cc);
>  		if (!locked)
>  			break;
>  
> @@ -514,13 +548,15 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>  
>  	/* Time to isolate some pages for migration */
>  	for (; low_pfn < end_pfn; low_pfn++) {
> -		/* give a chance to irqs before checking need_resched() */
> -		if (locked && !(low_pfn % SWAP_CLUSTER_MAX)) {
> -			if (should_release_lock(&zone->lru_lock)) {
> -				spin_unlock_irqrestore(&zone->lru_lock, flags);
> -				locked = false;
> -			}
> -		}
> +		/*
> +		 * Periodically drop the lock (if held) regardless of its
> +		 * contention, to give chance to IRQs. Abort async compaction
> +		 * if contended.
> +		 */
> +		if (!(low_pfn % SWAP_CLUSTER_MAX)
> +		    && compact_unlock_should_abort(&zone->lru_lock, flags,
> +								&locked, cc))
> +			break;
>  
>  		/*
>  		 * migrate_pfn does not necessarily start aligned to a
> @@ -622,10 +658,11 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>  		    page_count(page) > page_mapcount(page))
>  			continue;
>  
> -		/* Check if it is ok to still hold the lock */
> -		locked = compact_checklock_irqsave(&zone->lru_lock, &flags,
> -								locked, cc);
> -		if (!locked || fatal_signal_pending(current))
> +		/* If the lock is not held, try to take it */
> +		if (!locked)
> +			locked = compact_trylock_irqsave(&zone->lru_lock,
> +								&flags, cc);
> +		if (!locked)
>  			break;
>  
>  		/* Recheck PageLRU and PageTransHuge under lock */
> -- 
> 1.8.4.5
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 04/10] mm, compaction: skip rechecks when lock was already held
  2014-06-09  9:26   ` Vlastimil Babka
@ 2014-06-11  1:50     ` Minchan Kim
  -1 siblings, 0 replies; 88+ messages in thread
From: Minchan Kim @ 2014-06-11  1:50 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Mel Gorman, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel

On Mon, Jun 09, 2014 at 11:26:16AM +0200, Vlastimil Babka wrote:
> Compaction scanners try to lock zone locks as late as possible by checking
> many page or pageblock properties opportunistically without lock and skipping
> them if not unsuitable. For pages that pass the initial checks, some properties
> have to be checked again safely under lock. However, if the lock was already
> held from a previous iteration in the initial checks, the rechecks are
> unnecessary.
> 
> This patch therefore skips the rechecks when the lock was already held. This is
> now possible to do, since we don't (potentially) drop and reacquire the lock
> between the initial checks and the safe rechecks anymore.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Minchan Kim <minchan@kernel.org>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 04/10] mm, compaction: skip rechecks when lock was already held
@ 2014-06-11  1:50     ` Minchan Kim
  0 siblings, 0 replies; 88+ messages in thread
From: Minchan Kim @ 2014-06-11  1:50 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Mel Gorman, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel

On Mon, Jun 09, 2014 at 11:26:16AM +0200, Vlastimil Babka wrote:
> Compaction scanners try to lock zone locks as late as possible by checking
> many page or pageblock properties opportunistically without lock and skipping
> them if not unsuitable. For pages that pass the initial checks, some properties
> have to be checked again safely under lock. However, if the lock was already
> held from a previous iteration in the initial checks, the rechecks are
> unnecessary.
> 
> This patch therefore skips the rechecks when the lock was already held. This is
> now possible to do, since we don't (potentially) drop and reacquire the lock
> between the initial checks and the safe rechecks anymore.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Minchan Kim <minchan@kernel.org>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 05/10] mm, compaction: remember position within pageblock in free pages scanner
  2014-06-09  9:26   ` Vlastimil Babka
@ 2014-06-11  2:12     ` Minchan Kim
  -1 siblings, 0 replies; 88+ messages in thread
From: Minchan Kim @ 2014-06-11  2:12 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

On Mon, Jun 09, 2014 at 11:26:17AM +0200, Vlastimil Babka wrote:
> Unlike the migration scanner, the free scanner remembers the beginning of the
> last scanned pageblock in cc->free_pfn. It might be therefore rescanning pages
> uselessly when called several times during single compaction. This might have
> been useful when pages were returned to the buddy allocator after a failed
> migration, but this is no longer the case.
> 
> This patch changes the meaning of cc->free_pfn so that if it points to a
> middle of a pageblock, that pageblock is scanned only from cc->free_pfn to the
> end. isolate_freepages_block() will record the pfn of the last page it looked
> at, which is then used to update cc->free_pfn.
> 
> In the mmtests stress-highalloc benchmark, this has resulted in lowering the
> ratio between pages scanned by both scanners, from 2.5 free pages per migrate
> page, to 2.25 free pages per migrate page, without affecting success rates.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Minchan Kim <minchan@kernel.org>

Below is a nitpick.

> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: David Rientjes <rientjes@google.com>
> ---
>  mm/compaction.c | 33 ++++++++++++++++++++++++++++-----
>  1 file changed, 28 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 83f72bd..58dfaaa 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -297,7 +297,7 @@ static bool suitable_migration_target(struct page *page)
>   * (even though it may still end up isolating some pages).
>   */
>  static unsigned long isolate_freepages_block(struct compact_control *cc,
> -				unsigned long blockpfn,
> +				unsigned long *start_pfn,
>  				unsigned long end_pfn,
>  				struct list_head *freelist,
>  				bool strict)
> @@ -306,6 +306,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>  	struct page *cursor, *valid_page = NULL;
>  	unsigned long flags;
>  	bool locked = false;
> +	unsigned long blockpfn = *start_pfn;
>  
>  	cursor = pfn_to_page(blockpfn);
>  
> @@ -314,6 +315,9 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>  		int isolated, i;
>  		struct page *page = cursor;
>  
> +		/* Record how far we have got within the block */
> +		*start_pfn = blockpfn;
> +

Couldn't we move this out of the loop for just one store?

>  		/*
>  		 * Periodically drop the lock (if held) regardless of its
>  		 * contention, to give chance to IRQs. Abort async compaction
> @@ -424,6 +428,9 @@ isolate_freepages_range(struct compact_control *cc,
>  	LIST_HEAD(freelist);
>  
>  	for (pfn = start_pfn; pfn < end_pfn; pfn += isolated) {
> +		/* Protect pfn from changing by isolate_freepages_block */
> +		unsigned long isolate_start_pfn = pfn;
> +
>  		if (!pfn_valid(pfn) || cc->zone != page_zone(pfn_to_page(pfn)))
>  			break;
>  
> @@ -434,8 +441,8 @@ isolate_freepages_range(struct compact_control *cc,
>  		block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages);
>  		block_end_pfn = min(block_end_pfn, end_pfn);
>  
> -		isolated = isolate_freepages_block(cc, pfn, block_end_pfn,
> -						   &freelist, true);
> +		isolated = isolate_freepages_block(cc, &isolate_start_pfn,
> +						block_end_pfn, &freelist, true);
>  
>  		/*
>  		 * In strict mode, isolate_freepages_block() returns 0 if
> @@ -774,6 +781,7 @@ static void isolate_freepages(struct zone *zone,
>  				block_end_pfn = block_start_pfn,
>  				block_start_pfn -= pageblock_nr_pages) {
>  		unsigned long isolated;
> +		unsigned long isolate_start_pfn;
>  
>  		/*
>  		 * This can iterate a massively long zone without finding any
> @@ -807,12 +815,27 @@ static void isolate_freepages(struct zone *zone,
>  			continue;
>  
>  		/* Found a block suitable for isolating free pages from */
> -		cc->free_pfn = block_start_pfn;
> -		isolated = isolate_freepages_block(cc, block_start_pfn,
> +		isolate_start_pfn = block_start_pfn;
> +
> +		/*
> +		 * If we are restarting the free scanner in this block, do not
> +		 * rescan the beginning of the block
> +		 */
> +		if (cc->free_pfn < block_end_pfn)
> +			isolate_start_pfn = cc->free_pfn;
> +
> +		isolated = isolate_freepages_block(cc, &isolate_start_pfn,
>  					block_end_pfn, freelist, false);
>  		nr_freepages += isolated;
>  
>  		/*
> +		 * Remember where the free scanner should restart next time.
> +		 * This will point to the last page of pageblock we just
> +		 * scanned, if we scanned it fully.
> +		 */
> +		cc->free_pfn = isolate_start_pfn;
> +
> +		/*
>  		 * Set a flag that we successfully isolated in this pageblock.
>  		 * In the next loop iteration, zone->compact_cached_free_pfn
>  		 * will not be updated and thus it will effectively contain the
> -- 
> 1.8.4.5
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 05/10] mm, compaction: remember position within pageblock in free pages scanner
@ 2014-06-11  2:12     ` Minchan Kim
  0 siblings, 0 replies; 88+ messages in thread
From: Minchan Kim @ 2014-06-11  2:12 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

On Mon, Jun 09, 2014 at 11:26:17AM +0200, Vlastimil Babka wrote:
> Unlike the migration scanner, the free scanner remembers the beginning of the
> last scanned pageblock in cc->free_pfn. It might be therefore rescanning pages
> uselessly when called several times during single compaction. This might have
> been useful when pages were returned to the buddy allocator after a failed
> migration, but this is no longer the case.
> 
> This patch changes the meaning of cc->free_pfn so that if it points to a
> middle of a pageblock, that pageblock is scanned only from cc->free_pfn to the
> end. isolate_freepages_block() will record the pfn of the last page it looked
> at, which is then used to update cc->free_pfn.
> 
> In the mmtests stress-highalloc benchmark, this has resulted in lowering the
> ratio between pages scanned by both scanners, from 2.5 free pages per migrate
> page, to 2.25 free pages per migrate page, without affecting success rates.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Minchan Kim <minchan@kernel.org>

Below is a nitpick.

> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: David Rientjes <rientjes@google.com>
> ---
>  mm/compaction.c | 33 ++++++++++++++++++++++++++++-----
>  1 file changed, 28 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 83f72bd..58dfaaa 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -297,7 +297,7 @@ static bool suitable_migration_target(struct page *page)
>   * (even though it may still end up isolating some pages).
>   */
>  static unsigned long isolate_freepages_block(struct compact_control *cc,
> -				unsigned long blockpfn,
> +				unsigned long *start_pfn,
>  				unsigned long end_pfn,
>  				struct list_head *freelist,
>  				bool strict)
> @@ -306,6 +306,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>  	struct page *cursor, *valid_page = NULL;
>  	unsigned long flags;
>  	bool locked = false;
> +	unsigned long blockpfn = *start_pfn;
>  
>  	cursor = pfn_to_page(blockpfn);
>  
> @@ -314,6 +315,9 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>  		int isolated, i;
>  		struct page *page = cursor;
>  
> +		/* Record how far we have got within the block */
> +		*start_pfn = blockpfn;
> +

Couldn't we move this out of the loop for just one store?

>  		/*
>  		 * Periodically drop the lock (if held) regardless of its
>  		 * contention, to give chance to IRQs. Abort async compaction
> @@ -424,6 +428,9 @@ isolate_freepages_range(struct compact_control *cc,
>  	LIST_HEAD(freelist);
>  
>  	for (pfn = start_pfn; pfn < end_pfn; pfn += isolated) {
> +		/* Protect pfn from changing by isolate_freepages_block */
> +		unsigned long isolate_start_pfn = pfn;
> +
>  		if (!pfn_valid(pfn) || cc->zone != page_zone(pfn_to_page(pfn)))
>  			break;
>  
> @@ -434,8 +441,8 @@ isolate_freepages_range(struct compact_control *cc,
>  		block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages);
>  		block_end_pfn = min(block_end_pfn, end_pfn);
>  
> -		isolated = isolate_freepages_block(cc, pfn, block_end_pfn,
> -						   &freelist, true);
> +		isolated = isolate_freepages_block(cc, &isolate_start_pfn,
> +						block_end_pfn, &freelist, true);
>  
>  		/*
>  		 * In strict mode, isolate_freepages_block() returns 0 if
> @@ -774,6 +781,7 @@ static void isolate_freepages(struct zone *zone,
>  				block_end_pfn = block_start_pfn,
>  				block_start_pfn -= pageblock_nr_pages) {
>  		unsigned long isolated;
> +		unsigned long isolate_start_pfn;
>  
>  		/*
>  		 * This can iterate a massively long zone without finding any
> @@ -807,12 +815,27 @@ static void isolate_freepages(struct zone *zone,
>  			continue;
>  
>  		/* Found a block suitable for isolating free pages from */
> -		cc->free_pfn = block_start_pfn;
> -		isolated = isolate_freepages_block(cc, block_start_pfn,
> +		isolate_start_pfn = block_start_pfn;
> +
> +		/*
> +		 * If we are restarting the free scanner in this block, do not
> +		 * rescan the beginning of the block
> +		 */
> +		if (cc->free_pfn < block_end_pfn)
> +			isolate_start_pfn = cc->free_pfn;
> +
> +		isolated = isolate_freepages_block(cc, &isolate_start_pfn,
>  					block_end_pfn, freelist, false);
>  		nr_freepages += isolated;
>  
>  		/*
> +		 * Remember where the free scanner should restart next time.
> +		 * This will point to the last page of pageblock we just
> +		 * scanned, if we scanned it fully.
> +		 */
> +		cc->free_pfn = isolate_start_pfn;
> +
> +		/*
>  		 * Set a flag that we successfully isolated in this pageblock.
>  		 * In the next loop iteration, zone->compact_cached_free_pfn
>  		 * will not be updated and thus it will effectively contain the
> -- 
> 1.8.4.5
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 07/10] mm: rename allocflags_to_migratetype for clarity
  2014-06-09  9:26   ` Vlastimil Babka
@ 2014-06-11  2:41     ` Minchan Kim
  -1 siblings, 0 replies; 88+ messages in thread
From: Minchan Kim @ 2014-06-11  2:41 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

On Mon, Jun 09, 2014 at 11:26:19AM +0200, Vlastimil Babka wrote:
> From: David Rientjes <rientjes@google.com>
> 
> The page allocator has gfp flags (like __GFP_WAIT) and alloc flags (like
> ALLOC_CPUSET) that have separate semantics.
> 
> The function allocflags_to_migratetype() actually takes gfp flags, not alloc
> flags, and returns a migratetype.  Rename it to gfpflags_to_migratetype().
> 
> Signed-off-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

I was one of person who got confused sometime.

Acked-by: Minchan Kim <minchan@kernel.org>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 07/10] mm: rename allocflags_to_migratetype for clarity
@ 2014-06-11  2:41     ` Minchan Kim
  0 siblings, 0 replies; 88+ messages in thread
From: Minchan Kim @ 2014-06-11  2:41 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

On Mon, Jun 09, 2014 at 11:26:19AM +0200, Vlastimil Babka wrote:
> From: David Rientjes <rientjes@google.com>
> 
> The page allocator has gfp flags (like __GFP_WAIT) and alloc flags (like
> ALLOC_CPUSET) that have separate semantics.
> 
> The function allocflags_to_migratetype() actually takes gfp flags, not alloc
> flags, and returns a migratetype.  Rename it to gfpflags_to_migratetype().
> 
> Signed-off-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

I was one of person who got confused sometime.

Acked-by: Minchan Kim <minchan@kernel.org>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 01/10] mm, compaction: do not recheck suitable_migration_target under lock
  2014-06-09  9:26 ` Vlastimil Babka
@ 2014-06-11  2:45   ` Zhang Yanfei
  -1 siblings, 0 replies; 88+ messages in thread
From: Zhang Yanfei @ 2014-06-11  2:45 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Minchan Kim, Mel Gorman, Joonsoo Kim,
	Michal Nazarewicz, Naoya Horiguchi, Christoph Lameter,
	Rik van Riel

On 06/09/2014 05:26 PM, Vlastimil Babka wrote:
> isolate_freepages_block() rechecks if the pageblock is suitable to be a target
> for migration after it has taken the zone->lock. However, the check has been
> optimized to occur only once per pageblock, and compact_checklock_irqsave()
> might be dropping and reacquiring lock, which means somebody else might have
> changed the pageblock's migratetype meanwhile.
> 
> Furthermore, nothing prevents the migratetype to change right after
> isolate_freepages_block() has finished isolating. Given how imperfect this is,
> it's simpler to just rely on the check done in isolate_freepages() without
> lock, and not pretend that the recheck under lock guarantees anything. It is
> just a heuristic after all.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>

> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: David Rientjes <rientjes@google.com>
> ---
> I suggest folding mm-compactionc-isolate_freepages_block-small-tuneup.patch into this
> 
>  mm/compaction.c | 13 -------------
>  1 file changed, 13 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 5175019..b73b182 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -276,7 +276,6 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>  	struct page *cursor, *valid_page = NULL;
>  	unsigned long flags;
>  	bool locked = false;
> -	bool checked_pageblock = false;
>  
>  	cursor = pfn_to_page(blockpfn);
>  
> @@ -307,18 +306,6 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>  		if (!locked)
>  			break;
>  
> -		/* Recheck this is a suitable migration target under lock */
> -		if (!strict && !checked_pageblock) {
> -			/*
> -			 * We need to check suitability of pageblock only once
> -			 * and this isolate_freepages_block() is called with
> -			 * pageblock range, so just check once is sufficient.
> -			 */
> -			checked_pageblock = true;
> -			if (!suitable_migration_target(page))
> -				break;
> -		}
> -
>  		/* Recheck this is a buddy page under lock */
>  		if (!PageBuddy(page))
>  			goto isolate_fail;
> 


-- 
Thanks.
Zhang Yanfei

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 01/10] mm, compaction: do not recheck suitable_migration_target under lock
@ 2014-06-11  2:45   ` Zhang Yanfei
  0 siblings, 0 replies; 88+ messages in thread
From: Zhang Yanfei @ 2014-06-11  2:45 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Minchan Kim, Mel Gorman, Joonsoo Kim,
	Michal Nazarewicz, Naoya Horiguchi, Christoph Lameter,
	Rik van Riel

On 06/09/2014 05:26 PM, Vlastimil Babka wrote:
> isolate_freepages_block() rechecks if the pageblock is suitable to be a target
> for migration after it has taken the zone->lock. However, the check has been
> optimized to occur only once per pageblock, and compact_checklock_irqsave()
> might be dropping and reacquiring lock, which means somebody else might have
> changed the pageblock's migratetype meanwhile.
> 
> Furthermore, nothing prevents the migratetype to change right after
> isolate_freepages_block() has finished isolating. Given how imperfect this is,
> it's simpler to just rely on the check done in isolate_freepages() without
> lock, and not pretend that the recheck under lock guarantees anything. It is
> just a heuristic after all.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>

> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: David Rientjes <rientjes@google.com>
> ---
> I suggest folding mm-compactionc-isolate_freepages_block-small-tuneup.patch into this
> 
>  mm/compaction.c | 13 -------------
>  1 file changed, 13 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 5175019..b73b182 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -276,7 +276,6 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>  	struct page *cursor, *valid_page = NULL;
>  	unsigned long flags;
>  	bool locked = false;
> -	bool checked_pageblock = false;
>  
>  	cursor = pfn_to_page(blockpfn);
>  
> @@ -307,18 +306,6 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>  		if (!locked)
>  			break;
>  
> -		/* Recheck this is a suitable migration target under lock */
> -		if (!strict && !checked_pageblock) {
> -			/*
> -			 * We need to check suitability of pageblock only once
> -			 * and this isolate_freepages_block() is called with
> -			 * pageblock range, so just check once is sufficient.
> -			 */
> -			checked_pageblock = true;
> -			if (!suitable_migration_target(page))
> -				break;
> -		}
> -
>  		/* Recheck this is a buddy page under lock */
>  		if (!PageBuddy(page))
>  			goto isolate_fail;
> 


-- 
Thanks.
Zhang Yanfei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/10] mm, compaction: pass gfp mask to compact_control
  2014-06-09  9:26   ` Vlastimil Babka
@ 2014-06-11  2:48     ` Minchan Kim
  -1 siblings, 0 replies; 88+ messages in thread
From: Minchan Kim @ 2014-06-11  2:48 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

On Mon, Jun 09, 2014 at 11:26:20AM +0200, Vlastimil Babka wrote:
> From: David Rientjes <rientjes@google.com>
> 
> struct compact_control currently converts the gfp mask to a migratetype, but we
> need the entire gfp mask in a follow-up patch.
> 
> Pass the entire gfp mask as part of struct compact_control.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>
> ---
>  mm/compaction.c | 12 +++++++-----
>  mm/internal.h   |  2 +-
>  2 files changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index c339ccd..d1e30ba 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -965,8 +965,8 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
>  	return ISOLATE_SUCCESS;
>  }
>  
> -static int compact_finished(struct zone *zone,
> -			    struct compact_control *cc)
> +static int compact_finished(struct zone *zone, struct compact_control *cc,
> +			    const int migratetype)

If we has gfp_mask, we could use gfpflags_to_migratetype from cc->gfp_mask.
What's is your intention?

>  {
>  	unsigned int order;
>  	unsigned long watermark;
> @@ -1012,7 +1012,7 @@ static int compact_finished(struct zone *zone,
>  		struct free_area *area = &zone->free_area[order];
>  
>  		/* Job done if page is free of the right migratetype */
> -		if (!list_empty(&area->free_list[cc->migratetype]))
> +		if (!list_empty(&area->free_list[migratetype]))
>  			return COMPACT_PARTIAL;
>  
>  		/* Job done if allocation would set block type */
> @@ -1078,6 +1078,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
>  	int ret;
>  	unsigned long start_pfn = zone->zone_start_pfn;
>  	unsigned long end_pfn = zone_end_pfn(zone);
> +	const int migratetype = gfpflags_to_migratetype(cc->gfp_mask);
>  	const bool sync = cc->mode != MIGRATE_ASYNC;
>  
>  	ret = compaction_suitable(zone, cc->order);
> @@ -1120,7 +1121,8 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
>  
>  	migrate_prep_local();
>  
> -	while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) {
> +	while ((ret = compact_finished(zone, cc, migratetype)) ==
> +						COMPACT_CONTINUE) {
>  		int err;
>  
>  		switch (isolate_migratepages(zone, cc)) {
> @@ -1178,7 +1180,7 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
>  		.nr_freepages = 0,
>  		.nr_migratepages = 0,
>  		.order = order,
> -		.migratetype = gfpflags_to_migratetype(gfp_mask),
> +		.gfp_mask = gfp_mask,
>  		.zone = zone,
>  		.mode = mode,
>  	};
> diff --git a/mm/internal.h b/mm/internal.h
> index 584d04f..af15461 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -149,7 +149,7 @@ struct compact_control {
>  	bool finished_update_migrate;
>  
>  	int order;			/* order a direct compactor needs */
> -	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
> +	const gfp_t gfp_mask;		/* gfp mask of a direct compactor */
>  	struct zone *zone;
>  	enum compact_contended contended; /* Signal need_sched() or lock
>  					   * contention detected during
> -- 
> 1.8.4.5
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/10] mm, compaction: pass gfp mask to compact_control
@ 2014-06-11  2:48     ` Minchan Kim
  0 siblings, 0 replies; 88+ messages in thread
From: Minchan Kim @ 2014-06-11  2:48 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

On Mon, Jun 09, 2014 at 11:26:20AM +0200, Vlastimil Babka wrote:
> From: David Rientjes <rientjes@google.com>
> 
> struct compact_control currently converts the gfp mask to a migratetype, but we
> need the entire gfp mask in a follow-up patch.
> 
> Pass the entire gfp mask as part of struct compact_control.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>
> ---
>  mm/compaction.c | 12 +++++++-----
>  mm/internal.h   |  2 +-
>  2 files changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index c339ccd..d1e30ba 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -965,8 +965,8 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
>  	return ISOLATE_SUCCESS;
>  }
>  
> -static int compact_finished(struct zone *zone,
> -			    struct compact_control *cc)
> +static int compact_finished(struct zone *zone, struct compact_control *cc,
> +			    const int migratetype)

If we has gfp_mask, we could use gfpflags_to_migratetype from cc->gfp_mask.
What's is your intention?

>  {
>  	unsigned int order;
>  	unsigned long watermark;
> @@ -1012,7 +1012,7 @@ static int compact_finished(struct zone *zone,
>  		struct free_area *area = &zone->free_area[order];
>  
>  		/* Job done if page is free of the right migratetype */
> -		if (!list_empty(&area->free_list[cc->migratetype]))
> +		if (!list_empty(&area->free_list[migratetype]))
>  			return COMPACT_PARTIAL;
>  
>  		/* Job done if allocation would set block type */
> @@ -1078,6 +1078,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
>  	int ret;
>  	unsigned long start_pfn = zone->zone_start_pfn;
>  	unsigned long end_pfn = zone_end_pfn(zone);
> +	const int migratetype = gfpflags_to_migratetype(cc->gfp_mask);
>  	const bool sync = cc->mode != MIGRATE_ASYNC;
>  
>  	ret = compaction_suitable(zone, cc->order);
> @@ -1120,7 +1121,8 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
>  
>  	migrate_prep_local();
>  
> -	while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) {
> +	while ((ret = compact_finished(zone, cc, migratetype)) ==
> +						COMPACT_CONTINUE) {
>  		int err;
>  
>  		switch (isolate_migratepages(zone, cc)) {
> @@ -1178,7 +1180,7 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
>  		.nr_freepages = 0,
>  		.nr_migratepages = 0,
>  		.order = order,
> -		.migratetype = gfpflags_to_migratetype(gfp_mask),
> +		.gfp_mask = gfp_mask,
>  		.zone = zone,
>  		.mode = mode,
>  	};
> diff --git a/mm/internal.h b/mm/internal.h
> index 584d04f..af15461 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -149,7 +149,7 @@ struct compact_control {
>  	bool finished_update_migrate;
>  
>  	int order;			/* order a direct compactor needs */
> -	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
> +	const gfp_t gfp_mask;		/* gfp mask of a direct compactor */
>  	struct zone *zone;
>  	enum compact_contended contended; /* Signal need_sched() or lock
>  					   * contention detected during
> -- 
> 1.8.4.5
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 05/10] mm, compaction: remember position within pageblock in free pages scanner
  2014-06-09  9:26   ` Vlastimil Babka
@ 2014-06-11  3:29     ` Zhang Yanfei
  -1 siblings, 0 replies; 88+ messages in thread
From: Zhang Yanfei @ 2014-06-11  3:29 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Minchan Kim, Mel Gorman, Joonsoo Kim,
	Michal Nazarewicz, Naoya Horiguchi, Christoph Lameter,
	Rik van Riel

On 06/09/2014 05:26 PM, Vlastimil Babka wrote:
> Unlike the migration scanner, the free scanner remembers the beginning of the
> last scanned pageblock in cc->free_pfn. It might be therefore rescanning pages
> uselessly when called several times during single compaction. This might have
> been useful when pages were returned to the buddy allocator after a failed
> migration, but this is no longer the case.
> 
> This patch changes the meaning of cc->free_pfn so that if it points to a
> middle of a pageblock, that pageblock is scanned only from cc->free_pfn to the
> end. isolate_freepages_block() will record the pfn of the last page it looked
> at, which is then used to update cc->free_pfn.
> 
> In the mmtests stress-highalloc benchmark, this has resulted in lowering the
> ratio between pages scanned by both scanners, from 2.5 free pages per migrate
> page, to 2.25 free pages per migrate page, without affecting success rates.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>

> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: David Rientjes <rientjes@google.com>
> ---
>  mm/compaction.c | 33 ++++++++++++++++++++++++++++-----
>  1 file changed, 28 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 83f72bd..58dfaaa 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -297,7 +297,7 @@ static bool suitable_migration_target(struct page *page)
>   * (even though it may still end up isolating some pages).
>   */
>  static unsigned long isolate_freepages_block(struct compact_control *cc,
> -				unsigned long blockpfn,
> +				unsigned long *start_pfn,
>  				unsigned long end_pfn,
>  				struct list_head *freelist,
>  				bool strict)
> @@ -306,6 +306,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>  	struct page *cursor, *valid_page = NULL;
>  	unsigned long flags;
>  	bool locked = false;
> +	unsigned long blockpfn = *start_pfn;
>  
>  	cursor = pfn_to_page(blockpfn);
>  
> @@ -314,6 +315,9 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>  		int isolated, i;
>  		struct page *page = cursor;
>  
> +		/* Record how far we have got within the block */
> +		*start_pfn = blockpfn;
> +
>  		/*
>  		 * Periodically drop the lock (if held) regardless of its
>  		 * contention, to give chance to IRQs. Abort async compaction
> @@ -424,6 +428,9 @@ isolate_freepages_range(struct compact_control *cc,
>  	LIST_HEAD(freelist);
>  
>  	for (pfn = start_pfn; pfn < end_pfn; pfn += isolated) {
> +		/* Protect pfn from changing by isolate_freepages_block */
> +		unsigned long isolate_start_pfn = pfn;
> +
>  		if (!pfn_valid(pfn) || cc->zone != page_zone(pfn_to_page(pfn)))
>  			break;
>  
> @@ -434,8 +441,8 @@ isolate_freepages_range(struct compact_control *cc,
>  		block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages);
>  		block_end_pfn = min(block_end_pfn, end_pfn);
>  
> -		isolated = isolate_freepages_block(cc, pfn, block_end_pfn,
> -						   &freelist, true);
> +		isolated = isolate_freepages_block(cc, &isolate_start_pfn,
> +						block_end_pfn, &freelist, true);
>  
>  		/*
>  		 * In strict mode, isolate_freepages_block() returns 0 if
> @@ -774,6 +781,7 @@ static void isolate_freepages(struct zone *zone,
>  				block_end_pfn = block_start_pfn,
>  				block_start_pfn -= pageblock_nr_pages) {
>  		unsigned long isolated;
> +		unsigned long isolate_start_pfn;
>  
>  		/*
>  		 * This can iterate a massively long zone without finding any
> @@ -807,12 +815,27 @@ static void isolate_freepages(struct zone *zone,
>  			continue;
>  
>  		/* Found a block suitable for isolating free pages from */
> -		cc->free_pfn = block_start_pfn;
> -		isolated = isolate_freepages_block(cc, block_start_pfn,
> +		isolate_start_pfn = block_start_pfn;
> +
> +		/*
> +		 * If we are restarting the free scanner in this block, do not
> +		 * rescan the beginning of the block
> +		 */
> +		if (cc->free_pfn < block_end_pfn)
> +			isolate_start_pfn = cc->free_pfn;
> +
> +		isolated = isolate_freepages_block(cc, &isolate_start_pfn,
>  					block_end_pfn, freelist, false);
>  		nr_freepages += isolated;
>  
>  		/*
> +		 * Remember where the free scanner should restart next time.
> +		 * This will point to the last page of pageblock we just
> +		 * scanned, if we scanned it fully.
> +		 */
> +		cc->free_pfn = isolate_start_pfn;
> +
> +		/*
>  		 * Set a flag that we successfully isolated in this pageblock.
>  		 * In the next loop iteration, zone->compact_cached_free_pfn
>  		 * will not be updated and thus it will effectively contain the
> 


-- 
Thanks.
Zhang Yanfei

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 05/10] mm, compaction: remember position within pageblock in free pages scanner
@ 2014-06-11  3:29     ` Zhang Yanfei
  0 siblings, 0 replies; 88+ messages in thread
From: Zhang Yanfei @ 2014-06-11  3:29 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Minchan Kim, Mel Gorman, Joonsoo Kim,
	Michal Nazarewicz, Naoya Horiguchi, Christoph Lameter,
	Rik van Riel

On 06/09/2014 05:26 PM, Vlastimil Babka wrote:
> Unlike the migration scanner, the free scanner remembers the beginning of the
> last scanned pageblock in cc->free_pfn. It might be therefore rescanning pages
> uselessly when called several times during single compaction. This might have
> been useful when pages were returned to the buddy allocator after a failed
> migration, but this is no longer the case.
> 
> This patch changes the meaning of cc->free_pfn so that if it points to a
> middle of a pageblock, that pageblock is scanned only from cc->free_pfn to the
> end. isolate_freepages_block() will record the pfn of the last page it looked
> at, which is then used to update cc->free_pfn.
> 
> In the mmtests stress-highalloc benchmark, this has resulted in lowering the
> ratio between pages scanned by both scanners, from 2.5 free pages per migrate
> page, to 2.25 free pages per migrate page, without affecting success rates.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>

> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: David Rientjes <rientjes@google.com>
> ---
>  mm/compaction.c | 33 ++++++++++++++++++++++++++++-----
>  1 file changed, 28 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 83f72bd..58dfaaa 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -297,7 +297,7 @@ static bool suitable_migration_target(struct page *page)
>   * (even though it may still end up isolating some pages).
>   */
>  static unsigned long isolate_freepages_block(struct compact_control *cc,
> -				unsigned long blockpfn,
> +				unsigned long *start_pfn,
>  				unsigned long end_pfn,
>  				struct list_head *freelist,
>  				bool strict)
> @@ -306,6 +306,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>  	struct page *cursor, *valid_page = NULL;
>  	unsigned long flags;
>  	bool locked = false;
> +	unsigned long blockpfn = *start_pfn;
>  
>  	cursor = pfn_to_page(blockpfn);
>  
> @@ -314,6 +315,9 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>  		int isolated, i;
>  		struct page *page = cursor;
>  
> +		/* Record how far we have got within the block */
> +		*start_pfn = blockpfn;
> +
>  		/*
>  		 * Periodically drop the lock (if held) regardless of its
>  		 * contention, to give chance to IRQs. Abort async compaction
> @@ -424,6 +428,9 @@ isolate_freepages_range(struct compact_control *cc,
>  	LIST_HEAD(freelist);
>  
>  	for (pfn = start_pfn; pfn < end_pfn; pfn += isolated) {
> +		/* Protect pfn from changing by isolate_freepages_block */
> +		unsigned long isolate_start_pfn = pfn;
> +
>  		if (!pfn_valid(pfn) || cc->zone != page_zone(pfn_to_page(pfn)))
>  			break;
>  
> @@ -434,8 +441,8 @@ isolate_freepages_range(struct compact_control *cc,
>  		block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages);
>  		block_end_pfn = min(block_end_pfn, end_pfn);
>  
> -		isolated = isolate_freepages_block(cc, pfn, block_end_pfn,
> -						   &freelist, true);
> +		isolated = isolate_freepages_block(cc, &isolate_start_pfn,
> +						block_end_pfn, &freelist, true);
>  
>  		/*
>  		 * In strict mode, isolate_freepages_block() returns 0 if
> @@ -774,6 +781,7 @@ static void isolate_freepages(struct zone *zone,
>  				block_end_pfn = block_start_pfn,
>  				block_start_pfn -= pageblock_nr_pages) {
>  		unsigned long isolated;
> +		unsigned long isolate_start_pfn;
>  
>  		/*
>  		 * This can iterate a massively long zone without finding any
> @@ -807,12 +815,27 @@ static void isolate_freepages(struct zone *zone,
>  			continue;
>  
>  		/* Found a block suitable for isolating free pages from */
> -		cc->free_pfn = block_start_pfn;
> -		isolated = isolate_freepages_block(cc, block_start_pfn,
> +		isolate_start_pfn = block_start_pfn;
> +
> +		/*
> +		 * If we are restarting the free scanner in this block, do not
> +		 * rescan the beginning of the block
> +		 */
> +		if (cc->free_pfn < block_end_pfn)
> +			isolate_start_pfn = cc->free_pfn;
> +
> +		isolated = isolate_freepages_block(cc, &isolate_start_pfn,
>  					block_end_pfn, freelist, false);
>  		nr_freepages += isolated;
>  
>  		/*
> +		 * Remember where the free scanner should restart next time.
> +		 * This will point to the last page of pageblock we just
> +		 * scanned, if we scanned it fully.
> +		 */
> +		cc->free_pfn = isolate_start_pfn;
> +
> +		/*
>  		 * Set a flag that we successfully isolated in this pageblock.
>  		 * In the next loop iteration, zone->compact_cached_free_pfn
>  		 * will not be updated and thus it will effectively contain the
> 


-- 
Thanks.
Zhang Yanfei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 07/10] mm: rename allocflags_to_migratetype for clarity
  2014-06-11  2:41     ` Minchan Kim
@ 2014-06-11  3:38       ` Zhang Yanfei
  -1 siblings, 0 replies; 88+ messages in thread
From: Zhang Yanfei @ 2014-06-11  3:38 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Vlastimil Babka, David Rientjes, linux-mm, linux-kernel,
	Andrew Morton, Greg Thelen, Mel Gorman, Joonsoo Kim,
	Michal Nazarewicz, Naoya Horiguchi, Christoph Lameter,
	Rik van Riel

On 06/11/2014 10:41 AM, Minchan Kim wrote:
> On Mon, Jun 09, 2014 at 11:26:19AM +0200, Vlastimil Babka wrote:
>> From: David Rientjes <rientjes@google.com>
>>
>> The page allocator has gfp flags (like __GFP_WAIT) and alloc flags (like
>> ALLOC_CPUSET) that have separate semantics.
>>
>> The function allocflags_to_migratetype() actually takes gfp flags, not alloc
>> flags, and returns a migratetype.  Rename it to gfpflags_to_migratetype().
>>
>> Signed-off-by: David Rientjes <rientjes@google.com>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> 
> I was one of person who got confused sometime.

Some names in MM really make people confused. But sometimes thinking
an appropriate name is also a hard thing. Like I once wanted to change
the name of function nr_free_zone_pages() and also nr_free_buffer_pages().
But it is hard to name them, so at last Andrew suggested to add the
detailed function description to make it clear only.

Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>

> 
> Acked-by: Minchan Kim <minchan@kernel.org>
> 


-- 
Thanks.
Zhang Yanfei

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 07/10] mm: rename allocflags_to_migratetype for clarity
@ 2014-06-11  3:38       ` Zhang Yanfei
  0 siblings, 0 replies; 88+ messages in thread
From: Zhang Yanfei @ 2014-06-11  3:38 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Vlastimil Babka, David Rientjes, linux-mm, linux-kernel,
	Andrew Morton, Greg Thelen, Mel Gorman, Joonsoo Kim,
	Michal Nazarewicz, Naoya Horiguchi, Christoph Lameter,
	Rik van Riel

On 06/11/2014 10:41 AM, Minchan Kim wrote:
> On Mon, Jun 09, 2014 at 11:26:19AM +0200, Vlastimil Babka wrote:
>> From: David Rientjes <rientjes@google.com>
>>
>> The page allocator has gfp flags (like __GFP_WAIT) and alloc flags (like
>> ALLOC_CPUSET) that have separate semantics.
>>
>> The function allocflags_to_migratetype() actually takes gfp flags, not alloc
>> flags, and returns a migratetype.  Rename it to gfpflags_to_migratetype().
>>
>> Signed-off-by: David Rientjes <rientjes@google.com>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> 
> I was one of person who got confused sometime.

Some names in MM really make people confused. But sometimes thinking
an appropriate name is also a hard thing. Like I once wanted to change
the name of function nr_free_zone_pages() and also nr_free_buffer_pages().
But it is hard to name them, so at last Andrew suggested to add the
detailed function description to make it clear only.

Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>

> 
> Acked-by: Minchan Kim <minchan@kernel.org>
> 


-- 
Thanks.
Zhang Yanfei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 05/10] mm, compaction: remember position within pageblock in free pages scanner
  2014-06-11  2:12     ` Minchan Kim
@ 2014-06-11  8:16       ` Joonsoo Kim
  -1 siblings, 0 replies; 88+ messages in thread
From: Joonsoo Kim @ 2014-06-11  8:16 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Vlastimil Babka, David Rientjes, linux-mm, linux-kernel,
	Andrew Morton, Greg Thelen, Mel Gorman, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

On Wed, Jun 11, 2014 at 11:12:13AM +0900, Minchan Kim wrote:
> On Mon, Jun 09, 2014 at 11:26:17AM +0200, Vlastimil Babka wrote:
> > Unlike the migration scanner, the free scanner remembers the beginning of the
> > last scanned pageblock in cc->free_pfn. It might be therefore rescanning pages
> > uselessly when called several times during single compaction. This might have
> > been useful when pages were returned to the buddy allocator after a failed
> > migration, but this is no longer the case.
> > 
> > This patch changes the meaning of cc->free_pfn so that if it points to a
> > middle of a pageblock, that pageblock is scanned only from cc->free_pfn to the
> > end. isolate_freepages_block() will record the pfn of the last page it looked
> > at, which is then used to update cc->free_pfn.
> > 
> > In the mmtests stress-highalloc benchmark, this has resulted in lowering the
> > ratio between pages scanned by both scanners, from 2.5 free pages per migrate
> > page, to 2.25 free pages per migrate page, without affecting success rates.
> > 
> > Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Reviewed-by: Minchan Kim <minchan@kernel.org>
> 
> Below is a nitpick.
> 
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > Cc: Michal Nazarewicz <mina86@mina86.com>
> > Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Cc: Christoph Lameter <cl@linux.com>
> > Cc: Rik van Riel <riel@redhat.com>
> > Cc: David Rientjes <rientjes@google.com>
> > ---
> >  mm/compaction.c | 33 ++++++++++++++++++++++++++++-----
> >  1 file changed, 28 insertions(+), 5 deletions(-)
> > 
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index 83f72bd..58dfaaa 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -297,7 +297,7 @@ static bool suitable_migration_target(struct page *page)
> >   * (even though it may still end up isolating some pages).
> >   */
> >  static unsigned long isolate_freepages_block(struct compact_control *cc,
> > -				unsigned long blockpfn,
> > +				unsigned long *start_pfn,
> >  				unsigned long end_pfn,
> >  				struct list_head *freelist,
> >  				bool strict)
> > @@ -306,6 +306,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
> >  	struct page *cursor, *valid_page = NULL;
> >  	unsigned long flags;
> >  	bool locked = false;
> > +	unsigned long blockpfn = *start_pfn;
> >  
> >  	cursor = pfn_to_page(blockpfn);
> >  
> > @@ -314,6 +315,9 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
> >  		int isolated, i;
> >  		struct page *page = cursor;
> >  
> > +		/* Record how far we have got within the block */
> > +		*start_pfn = blockpfn;
> > +
> 
> Couldn't we move this out of the loop for just one store?

Hello, Vlastimil.

Moreover, start_pfn can't be updated to end pfn with this approach.
Is it okay?

Thanks.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 05/10] mm, compaction: remember position within pageblock in free pages scanner
@ 2014-06-11  8:16       ` Joonsoo Kim
  0 siblings, 0 replies; 88+ messages in thread
From: Joonsoo Kim @ 2014-06-11  8:16 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Vlastimil Babka, David Rientjes, linux-mm, linux-kernel,
	Andrew Morton, Greg Thelen, Mel Gorman, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

On Wed, Jun 11, 2014 at 11:12:13AM +0900, Minchan Kim wrote:
> On Mon, Jun 09, 2014 at 11:26:17AM +0200, Vlastimil Babka wrote:
> > Unlike the migration scanner, the free scanner remembers the beginning of the
> > last scanned pageblock in cc->free_pfn. It might be therefore rescanning pages
> > uselessly when called several times during single compaction. This might have
> > been useful when pages were returned to the buddy allocator after a failed
> > migration, but this is no longer the case.
> > 
> > This patch changes the meaning of cc->free_pfn so that if it points to a
> > middle of a pageblock, that pageblock is scanned only from cc->free_pfn to the
> > end. isolate_freepages_block() will record the pfn of the last page it looked
> > at, which is then used to update cc->free_pfn.
> > 
> > In the mmtests stress-highalloc benchmark, this has resulted in lowering the
> > ratio between pages scanned by both scanners, from 2.5 free pages per migrate
> > page, to 2.25 free pages per migrate page, without affecting success rates.
> > 
> > Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Reviewed-by: Minchan Kim <minchan@kernel.org>
> 
> Below is a nitpick.
> 
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > Cc: Michal Nazarewicz <mina86@mina86.com>
> > Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Cc: Christoph Lameter <cl@linux.com>
> > Cc: Rik van Riel <riel@redhat.com>
> > Cc: David Rientjes <rientjes@google.com>
> > ---
> >  mm/compaction.c | 33 ++++++++++++++++++++++++++++-----
> >  1 file changed, 28 insertions(+), 5 deletions(-)
> > 
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index 83f72bd..58dfaaa 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -297,7 +297,7 @@ static bool suitable_migration_target(struct page *page)
> >   * (even though it may still end up isolating some pages).
> >   */
> >  static unsigned long isolate_freepages_block(struct compact_control *cc,
> > -				unsigned long blockpfn,
> > +				unsigned long *start_pfn,
> >  				unsigned long end_pfn,
> >  				struct list_head *freelist,
> >  				bool strict)
> > @@ -306,6 +306,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
> >  	struct page *cursor, *valid_page = NULL;
> >  	unsigned long flags;
> >  	bool locked = false;
> > +	unsigned long blockpfn = *start_pfn;
> >  
> >  	cursor = pfn_to_page(blockpfn);
> >  
> > @@ -314,6 +315,9 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
> >  		int isolated, i;
> >  		struct page *page = cursor;
> >  
> > +		/* Record how far we have got within the block */
> > +		*start_pfn = blockpfn;
> > +
> 
> Couldn't we move this out of the loop for just one store?

Hello, Vlastimil.

Moreover, start_pfn can't be updated to end pfn with this approach.
Is it okay?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 03/10] mm, compaction: periodically drop lock and restore IRQs in scanners
  2014-06-11  1:32     ` Minchan Kim
@ 2014-06-11 11:24       ` Vlastimil Babka
  -1 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-11 11:24 UTC (permalink / raw)
  To: Minchan Kim
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Mel Gorman, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel

On 06/11/2014 03:32 AM, Minchan Kim wrote:
>> >+	if (cc->mode == MIGRATE_ASYNC) {
>> >+		if (need_resched()) {
>> >+			cc->contended = COMPACT_CONTENDED_SCHED;
>> >+			return true;
>> >  		}
>> >-
>> >+		if (spin_is_locked(lock)) {
> Why do you use spin_is_locked instead of spin_is_contended?

Because I know I have dropped the lock. AFAIK spin_is_locked() means 
somebody else is holding it, which would be a contention for me if I 
would want to take it back. spin_is_contended() means that somebody else 
#1 is holding it AND somebody else #2 is already waiting for it.

Previously in should_release_lock() the code assumed that it was me who 
holds the lock, so I check if somebody else is waiting for it, hence 
spin_is_contended().

But note that the assumption was not always true when 
should_release_lock() was called from compact_checklock_irqsave(). So it 
was another subtle suboptimality. In async compaction when I don't have 
the lock, I should be deciding if I take it based on if somebody else is 
holding it. Instead it was deciding based on if somebody else #1 is 
holding it and somebody else #2 is waiting.
Then there's still a chance of race between this check and call to 
spin_lock_irqsave, so I could spin on the lock even if I don't want to. 
Using spin_trylock_irqsave() instead is like checking spin_is_locked() 
and locking, without this race.

So even though I will probably remove the spin_is_locked() check per 
David's objection, the trylock will still nicely prevent waiting on the 
lock in async compaction.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 03/10] mm, compaction: periodically drop lock and restore IRQs in scanners
@ 2014-06-11 11:24       ` Vlastimil Babka
  0 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-11 11:24 UTC (permalink / raw)
  To: Minchan Kim
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Mel Gorman, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel

On 06/11/2014 03:32 AM, Minchan Kim wrote:
>> >+	if (cc->mode == MIGRATE_ASYNC) {
>> >+		if (need_resched()) {
>> >+			cc->contended = COMPACT_CONTENDED_SCHED;
>> >+			return true;
>> >  		}
>> >-
>> >+		if (spin_is_locked(lock)) {
> Why do you use spin_is_locked instead of spin_is_contended?

Because I know I have dropped the lock. AFAIK spin_is_locked() means 
somebody else is holding it, which would be a contention for me if I 
would want to take it back. spin_is_contended() means that somebody else 
#1 is holding it AND somebody else #2 is already waiting for it.

Previously in should_release_lock() the code assumed that it was me who 
holds the lock, so I check if somebody else is waiting for it, hence 
spin_is_contended().

But note that the assumption was not always true when 
should_release_lock() was called from compact_checklock_irqsave(). So it 
was another subtle suboptimality. In async compaction when I don't have 
the lock, I should be deciding if I take it based on if somebody else is 
holding it. Instead it was deciding based on if somebody else #1 is 
holding it and somebody else #2 is waiting.
Then there's still a chance of race between this check and call to 
spin_lock_irqsave, so I could spin on the lock even if I don't want to. 
Using spin_trylock_irqsave() instead is like checking spin_is_locked() 
and locking, without this race.

So even though I will probably remove the spin_is_locked() check per 
David's objection, the trylock will still nicely prevent waiting on the 
lock in async compaction.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 05/10] mm, compaction: remember position within pageblock in free pages scanner
  2014-06-11  2:12     ` Minchan Kim
@ 2014-06-11 11:33       ` Vlastimil Babka
  -1 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-11 11:33 UTC (permalink / raw)
  To: Minchan Kim
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

On 06/11/2014 04:12 AM, Minchan Kim wrote:
>> >@@ -314,6 +315,9 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>> >  		int isolated, i;
>> >  		struct page *page = cursor;
>> >
>> >+		/* Record how far we have got within the block */
>> >+		*start_pfn = blockpfn;
>> >+
> Couldn't we move this out of the loop for just one store?

You mean using a local variable inside the loop, and assigning once, for 
performance reasons (register vs memory access)?

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 05/10] mm, compaction: remember position within pageblock in free pages scanner
@ 2014-06-11 11:33       ` Vlastimil Babka
  0 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-11 11:33 UTC (permalink / raw)
  To: Minchan Kim
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

On 06/11/2014 04:12 AM, Minchan Kim wrote:
>> >@@ -314,6 +315,9 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>> >  		int isolated, i;
>> >  		struct page *page = cursor;
>> >
>> >+		/* Record how far we have got within the block */
>> >+		*start_pfn = blockpfn;
>> >+
> Couldn't we move this out of the loop for just one store?

You mean using a local variable inside the loop, and assigning once, for 
performance reasons (register vs memory access)?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 05/10] mm, compaction: remember position within pageblock in free pages scanner
  2014-06-11  8:16       ` Joonsoo Kim
@ 2014-06-11 11:41         ` Vlastimil Babka
  -1 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-11 11:41 UTC (permalink / raw)
  To: Joonsoo Kim, Minchan Kim
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Mel Gorman, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel

On 06/11/2014 10:16 AM, Joonsoo Kim wrote:
> On Wed, Jun 11, 2014 at 11:12:13AM +0900, Minchan Kim wrote:
>> On Mon, Jun 09, 2014 at 11:26:17AM +0200, Vlastimil Babka wrote:
>>> Unlike the migration scanner, the free scanner remembers the beginning of the
>>> last scanned pageblock in cc->free_pfn. It might be therefore rescanning pages
>>> uselessly when called several times during single compaction. This might have
>>> been useful when pages were returned to the buddy allocator after a failed
>>> migration, but this is no longer the case.
>>>
>>> This patch changes the meaning of cc->free_pfn so that if it points to a
>>> middle of a pageblock, that pageblock is scanned only from cc->free_pfn to the
>>> end. isolate_freepages_block() will record the pfn of the last page it looked
>>> at, which is then used to update cc->free_pfn.
>>>
>>> In the mmtests stress-highalloc benchmark, this has resulted in lowering the
>>> ratio between pages scanned by both scanners, from 2.5 free pages per migrate
>>> page, to 2.25 free pages per migrate page, without affecting success rates.
>>>
>>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> Reviewed-by: Minchan Kim <minchan@kernel.org>
>>
>> Below is a nitpick.
>>
>>> Cc: Minchan Kim <minchan@kernel.org>
>>> Cc: Mel Gorman <mgorman@suse.de>
>>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>>> Cc: Michal Nazarewicz <mina86@mina86.com>
>>> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>>> Cc: Christoph Lameter <cl@linux.com>
>>> Cc: Rik van Riel <riel@redhat.com>
>>> Cc: David Rientjes <rientjes@google.com>
>>> ---
>>>   mm/compaction.c | 33 ++++++++++++++++++++++++++++-----
>>>   1 file changed, 28 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/mm/compaction.c b/mm/compaction.c
>>> index 83f72bd..58dfaaa 100644
>>> --- a/mm/compaction.c
>>> +++ b/mm/compaction.c
>>> @@ -297,7 +297,7 @@ static bool suitable_migration_target(struct page *page)
>>>    * (even though it may still end up isolating some pages).
>>>    */
>>>   static unsigned long isolate_freepages_block(struct compact_control *cc,
>>> -				unsigned long blockpfn,
>>> +				unsigned long *start_pfn,
>>>   				unsigned long end_pfn,
>>>   				struct list_head *freelist,
>>>   				bool strict)
>>> @@ -306,6 +306,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>>>   	struct page *cursor, *valid_page = NULL;
>>>   	unsigned long flags;
>>>   	bool locked = false;
>>> +	unsigned long blockpfn = *start_pfn;
>>>
>>>   	cursor = pfn_to_page(blockpfn);
>>>
>>> @@ -314,6 +315,9 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>>>   		int isolated, i;
>>>   		struct page *page = cursor;
>>>
>>> +		/* Record how far we have got within the block */
>>> +		*start_pfn = blockpfn;
>>> +
>>
>> Couldn't we move this out of the loop for just one store?

Ah, I get it now. Ignore my previous reply.

> Hello, Vlastimil.
>
> Moreover, start_pfn can't be updated to end pfn with this approach.
> Is it okay?

That's intentional, as end_pfn means the scanner would restart at the 
beginning of next pageblock. So I want to record last pfn *inside* the 
pageblock that was fully scanned. Note that there's a high change that 
fully scanning pageblock means that I haven't isolated enough and 
isolate_freepages() will advance to the previous pageblock anyway, and 
the recorded value will be overwritten. But still it's better to prevent 
this corner case.

So outside the loop, I would need to do:

*start_pfn = max(blockpfn, end_pfn - 1);

It looks a bit tricky but probably better than multiple assignments.

Thanks.

> Thanks.
>


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 05/10] mm, compaction: remember position within pageblock in free pages scanner
@ 2014-06-11 11:41         ` Vlastimil Babka
  0 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-11 11:41 UTC (permalink / raw)
  To: Joonsoo Kim, Minchan Kim
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Mel Gorman, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel

On 06/11/2014 10:16 AM, Joonsoo Kim wrote:
> On Wed, Jun 11, 2014 at 11:12:13AM +0900, Minchan Kim wrote:
>> On Mon, Jun 09, 2014 at 11:26:17AM +0200, Vlastimil Babka wrote:
>>> Unlike the migration scanner, the free scanner remembers the beginning of the
>>> last scanned pageblock in cc->free_pfn. It might be therefore rescanning pages
>>> uselessly when called several times during single compaction. This might have
>>> been useful when pages were returned to the buddy allocator after a failed
>>> migration, but this is no longer the case.
>>>
>>> This patch changes the meaning of cc->free_pfn so that if it points to a
>>> middle of a pageblock, that pageblock is scanned only from cc->free_pfn to the
>>> end. isolate_freepages_block() will record the pfn of the last page it looked
>>> at, which is then used to update cc->free_pfn.
>>>
>>> In the mmtests stress-highalloc benchmark, this has resulted in lowering the
>>> ratio between pages scanned by both scanners, from 2.5 free pages per migrate
>>> page, to 2.25 free pages per migrate page, without affecting success rates.
>>>
>>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> Reviewed-by: Minchan Kim <minchan@kernel.org>
>>
>> Below is a nitpick.
>>
>>> Cc: Minchan Kim <minchan@kernel.org>
>>> Cc: Mel Gorman <mgorman@suse.de>
>>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>>> Cc: Michal Nazarewicz <mina86@mina86.com>
>>> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>>> Cc: Christoph Lameter <cl@linux.com>
>>> Cc: Rik van Riel <riel@redhat.com>
>>> Cc: David Rientjes <rientjes@google.com>
>>> ---
>>>   mm/compaction.c | 33 ++++++++++++++++++++++++++++-----
>>>   1 file changed, 28 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/mm/compaction.c b/mm/compaction.c
>>> index 83f72bd..58dfaaa 100644
>>> --- a/mm/compaction.c
>>> +++ b/mm/compaction.c
>>> @@ -297,7 +297,7 @@ static bool suitable_migration_target(struct page *page)
>>>    * (even though it may still end up isolating some pages).
>>>    */
>>>   static unsigned long isolate_freepages_block(struct compact_control *cc,
>>> -				unsigned long blockpfn,
>>> +				unsigned long *start_pfn,
>>>   				unsigned long end_pfn,
>>>   				struct list_head *freelist,
>>>   				bool strict)
>>> @@ -306,6 +306,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>>>   	struct page *cursor, *valid_page = NULL;
>>>   	unsigned long flags;
>>>   	bool locked = false;
>>> +	unsigned long blockpfn = *start_pfn;
>>>
>>>   	cursor = pfn_to_page(blockpfn);
>>>
>>> @@ -314,6 +315,9 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
>>>   		int isolated, i;
>>>   		struct page *page = cursor;
>>>
>>> +		/* Record how far we have got within the block */
>>> +		*start_pfn = blockpfn;
>>> +
>>
>> Couldn't we move this out of the loop for just one store?

Ah, I get it now. Ignore my previous reply.

> Hello, Vlastimil.
>
> Moreover, start_pfn can't be updated to end pfn with this approach.
> Is it okay?

That's intentional, as end_pfn means the scanner would restart at the 
beginning of next pageblock. So I want to record last pfn *inside* the 
pageblock that was fully scanned. Note that there's a high change that 
fully scanning pageblock means that I haven't isolated enough and 
isolate_freepages() will advance to the previous pageblock anyway, and 
the recorded value will be overwritten. But still it's better to prevent 
this corner case.

So outside the loop, I would need to do:

*start_pfn = max(blockpfn, end_pfn - 1);

It looks a bit tricky but probably better than multiple assignments.

Thanks.

> Thanks.
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/10] mm, compaction: pass gfp mask to compact_control
  2014-06-11  2:48     ` Minchan Kim
@ 2014-06-11 11:46       ` Vlastimil Babka
  -1 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-11 11:46 UTC (permalink / raw)
  To: Minchan Kim
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

On 06/11/2014 04:48 AM, Minchan Kim wrote:
> On Mon, Jun 09, 2014 at 11:26:20AM +0200, Vlastimil Babka wrote:
>> From: David Rientjes <rientjes@google.com>
>>
>> struct compact_control currently converts the gfp mask to a migratetype, but we
>> need the entire gfp mask in a follow-up patch.
>>
>> Pass the entire gfp mask as part of struct compact_control.
>>
>> Signed-off-by: David Rientjes <rientjes@google.com>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>> Cc: Michal Nazarewicz <mina86@mina86.com>
>> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>> Cc: Christoph Lameter <cl@linux.com>
>> Cc: Rik van Riel <riel@redhat.com>
>> ---
>>   mm/compaction.c | 12 +++++++-----
>>   mm/internal.h   |  2 +-
>>   2 files changed, 8 insertions(+), 6 deletions(-)
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index c339ccd..d1e30ba 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -965,8 +965,8 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
>>   	return ISOLATE_SUCCESS;
>>   }
>>
>> -static int compact_finished(struct zone *zone,
>> -			    struct compact_control *cc)
>> +static int compact_finished(struct zone *zone, struct compact_control *cc,
>> +			    const int migratetype)
>
> If we has gfp_mask, we could use gfpflags_to_migratetype from cc->gfp_mask.
> What's is your intention?

Can't speak for David but I left it this way as it means 
gfpflags_to_migratetype is only called once per compact_zone. Now I 
realize my patch 10/10 repeats the call in isolate_migratepages_range so 
I'll probably update that as well.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/10] mm, compaction: pass gfp mask to compact_control
@ 2014-06-11 11:46       ` Vlastimil Babka
  0 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-11 11:46 UTC (permalink / raw)
  To: Minchan Kim
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

On 06/11/2014 04:48 AM, Minchan Kim wrote:
> On Mon, Jun 09, 2014 at 11:26:20AM +0200, Vlastimil Babka wrote:
>> From: David Rientjes <rientjes@google.com>
>>
>> struct compact_control currently converts the gfp mask to a migratetype, but we
>> need the entire gfp mask in a follow-up patch.
>>
>> Pass the entire gfp mask as part of struct compact_control.
>>
>> Signed-off-by: David Rientjes <rientjes@google.com>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>> Cc: Michal Nazarewicz <mina86@mina86.com>
>> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>> Cc: Christoph Lameter <cl@linux.com>
>> Cc: Rik van Riel <riel@redhat.com>
>> ---
>>   mm/compaction.c | 12 +++++++-----
>>   mm/internal.h   |  2 +-
>>   2 files changed, 8 insertions(+), 6 deletions(-)
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index c339ccd..d1e30ba 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -965,8 +965,8 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
>>   	return ISOLATE_SUCCESS;
>>   }
>>
>> -static int compact_finished(struct zone *zone,
>> -			    struct compact_control *cc)
>> +static int compact_finished(struct zone *zone, struct compact_control *cc,
>> +			    const int migratetype)
>
> If we has gfp_mask, we could use gfpflags_to_migratetype from cc->gfp_mask.
> What's is your intention?

Can't speak for David but I left it this way as it means 
gfpflags_to_migratetype is only called once per compact_zone. Now I 
realize my patch 10/10 repeats the call in isolate_migratepages_range so 
I'll probably update that as well.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 02/10] mm, compaction: report compaction as contended only due to lock contention
  2014-06-11  1:10     ` Minchan Kim
@ 2014-06-11 12:22       ` Vlastimil Babka
  -1 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-11 12:22 UTC (permalink / raw)
  To: Minchan Kim
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

On 06/11/2014 03:10 AM, Minchan Kim wrote:
> On Mon, Jun 09, 2014 at 11:26:14AM +0200, Vlastimil Babka wrote:
>> Async compaction aborts when it detects zone lock contention or need_resched()
>> is true. David Rientjes has reported that in practice, most direct async
>> compactions for THP allocation abort due to need_resched(). This means that a
>> second direct compaction is never attempted, which might be OK for a page
>> fault, but hugepaged is intended to attempt a sync compaction in such case and
>> in these cases it won't.
>>
>> This patch replaces "bool contended" in compact_control with an enum that
>> distinguieshes between aborting due to need_resched() and aborting due to lock
>> contention. This allows propagating the abort through all compaction functions
>> as before, but declaring the direct compaction as contended only when lock
>> contantion has been detected.
>>
>> As a result, hugepaged will proceed with second sync compaction as intended,
>> when the preceding async compaction aborted due to need_resched().
>
> You said "second direct compaction is never attempted, which might be OK
> for a page fault" and said "hugepagd is intented to attempt a sync compaction"
> so I feel you want to handle khugepaged so special unlike other direct compact
> (ex, page fault).

Well khugepaged is my primary concern, but I imagine there are other 
direct compaction users besides THP page fault and khugepaged.

> By this patch, direct compaction take care only lock contention, not rescheduling
> so that pop questions.
>
> Is it okay not to consider need_resched in direct compaction really?

It still considers need_resched() to back of from async compaction. It's 
only about signaling contended_compaction back to 
__alloc_pages_slowpath(). There's this code executed after the first, 
async compaction fails:

/*
  * It can become very expensive to allocate transparent hugepages at
  * fault, so use asynchronous memory compaction for THP unless it is
  * khugepaged trying to collapse.
  */
if (!(gfp_mask & __GFP_NO_KSWAPD) || (current->flags & PF_KTHREAD))
         migration_mode = MIGRATE_SYNC_LIGHT;

/*
  * If compaction is deferred for high-order allocations, it is because
  * sync compaction recently failed. In this is the case and the caller
  * requested a movable allocation that does not heavily disrupt the
  * system then fail the allocation instead of entering direct reclaim.
  */
if ((deferred_compaction || contended_compaction) &&
                                         (gfp_mask & __GFP_NO_KSWAPD))
         goto nopage;

Both THP page fault and khugepaged use __GFP_NO_KSWAPD. The first if() 
decides whether the second attempt will be sync (for khugepaged) or 
async (page fault). The second if() decides that if compaction was 
contended, then there won't be any second attempt (and reclaim) at all. 
Counting need_resched() as contended in this case is bad for khugepaged. 
Even for page fault it means no direct reclaim and a second async 
compaction. David says need_resched() occurs so often then it is a poor 
heuristic to decide this.

> We have taken care of it in direct reclaim path so why direct compaction is
> so special?

I admit I'm not that familiar with reclaim but I didn't quickly find any 
need_resched() there? There's plenty of cond_resched() but that doesn't 
mean it will abort? Could you explain for me?

> Why does khugepaged give up easily if lock contention/need_resched happens?
> khugepaged is important for success ratio as I read your description so IMO,
> khugepaged should do synchronously without considering early bail out by
> lock/rescheduling.

Well a stupid answer is that's how __alloc_pages_slowpath() works :) I 
don't think it's bad to try using first a more lightweight approach 
before trying the heavyweight one. As long as the heavyweight one is not 
skipped for khugepaged.

> If it causes problems, user should increase scan_sleep_millisecs/alloc_sleep_millisecs,
> which is exactly the knob for that cases.
>
> So, my point is how about making khugepaged doing always dumb synchronous
> compaction thorough PG_KHUGEPAGED or GFP_SYNC_TRANSHUGE?
>
>>
>> Reported-by: David Rientjes <rientjes@google.com>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>> Cc: Michal Nazarewicz <mina86@mina86.com>
>> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>> Cc: Christoph Lameter <cl@linux.com>
>> Cc: Rik van Riel <riel@redhat.com>
>> ---
>>   mm/compaction.c | 20 ++++++++++++++------
>>   mm/internal.h   | 15 +++++++++++----
>>   2 files changed, 25 insertions(+), 10 deletions(-)
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index b73b182..d37f4a8 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -185,9 +185,14 @@ static void update_pageblock_skip(struct compact_control *cc,
>>   }
>>   #endif /* CONFIG_COMPACTION */
>>
>> -static inline bool should_release_lock(spinlock_t *lock)
>> +enum compact_contended should_release_lock(spinlock_t *lock)
>>   {
>> -	return need_resched() || spin_is_contended(lock);
>> +	if (need_resched())
>> +		return COMPACT_CONTENDED_SCHED;
>> +	else if (spin_is_contended(lock))
>> +		return COMPACT_CONTENDED_LOCK;
>> +	else
>> +		return COMPACT_CONTENDED_NONE;
>>   }
>>
>>   /*
>> @@ -202,7 +207,9 @@ static inline bool should_release_lock(spinlock_t *lock)
>>   static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
>>   				      bool locked, struct compact_control *cc)
>>   {
>> -	if (should_release_lock(lock)) {
>> +	enum compact_contended contended = should_release_lock(lock);
>> +
>> +	if (contended) {
>>   		if (locked) {
>>   			spin_unlock_irqrestore(lock, *flags);
>>   			locked = false;
>> @@ -210,7 +217,7 @@ static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
>>
>>   		/* async aborts if taking too long or contended */
>>   		if (cc->mode == MIGRATE_ASYNC) {
>> -			cc->contended = true;
>> +			cc->contended = contended;
>>   			return false;
>>   		}
>>
>> @@ -236,7 +243,7 @@ static inline bool compact_should_abort(struct compact_control *cc)
>>   	/* async compaction aborts if contended */
>>   	if (need_resched()) {
>>   		if (cc->mode == MIGRATE_ASYNC) {
>> -			cc->contended = true;
>> +			cc->contended = COMPACT_CONTENDED_SCHED;
>>   			return true;
>>   		}
>>
>> @@ -1095,7 +1102,8 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
>>   	VM_BUG_ON(!list_empty(&cc.freepages));
>>   	VM_BUG_ON(!list_empty(&cc.migratepages));
>>
>> -	*contended = cc.contended;
>> +	/* We only signal lock contention back to the allocator */
>> +	*contended = cc.contended == COMPACT_CONTENDED_LOCK;
>>   	return ret;
>>   }
>>
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 7f22a11f..4659e8e 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -117,6 +117,13 @@ extern int user_min_free_kbytes;
>>
>>   #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>>
>> +/* Used to signal whether compaction detected need_sched() or lock contention */
>> +enum compact_contended {
>> +	COMPACT_CONTENDED_NONE = 0, /* no contention detected */
>> +	COMPACT_CONTENDED_SCHED,    /* need_sched() was true */
>> +	COMPACT_CONTENDED_LOCK,     /* zone lock or lru_lock was contended */
>> +};
>> +
>>   /*
>>    * in mm/compaction.c
>>    */
>> @@ -144,10 +151,10 @@ struct compact_control {
>>   	int order;			/* order a direct compactor needs */
>>   	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
>>   	struct zone *zone;
>> -	bool contended;			/* True if a lock was contended, or
>> -					 * need_resched() true during async
>> -					 * compaction
>> -					 */
>> +	enum compact_contended contended; /* Signal need_sched() or lock
>> +					   * contention detected during
>> +					   * compaction
>> +					   */
>>   };
>>
>>   unsigned long
>> --
>> 1.8.4.5
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 02/10] mm, compaction: report compaction as contended only due to lock contention
@ 2014-06-11 12:22       ` Vlastimil Babka
  0 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-11 12:22 UTC (permalink / raw)
  To: Minchan Kim
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

On 06/11/2014 03:10 AM, Minchan Kim wrote:
> On Mon, Jun 09, 2014 at 11:26:14AM +0200, Vlastimil Babka wrote:
>> Async compaction aborts when it detects zone lock contention or need_resched()
>> is true. David Rientjes has reported that in practice, most direct async
>> compactions for THP allocation abort due to need_resched(). This means that a
>> second direct compaction is never attempted, which might be OK for a page
>> fault, but hugepaged is intended to attempt a sync compaction in such case and
>> in these cases it won't.
>>
>> This patch replaces "bool contended" in compact_control with an enum that
>> distinguieshes between aborting due to need_resched() and aborting due to lock
>> contention. This allows propagating the abort through all compaction functions
>> as before, but declaring the direct compaction as contended only when lock
>> contantion has been detected.
>>
>> As a result, hugepaged will proceed with second sync compaction as intended,
>> when the preceding async compaction aborted due to need_resched().
>
> You said "second direct compaction is never attempted, which might be OK
> for a page fault" and said "hugepagd is intented to attempt a sync compaction"
> so I feel you want to handle khugepaged so special unlike other direct compact
> (ex, page fault).

Well khugepaged is my primary concern, but I imagine there are other 
direct compaction users besides THP page fault and khugepaged.

> By this patch, direct compaction take care only lock contention, not rescheduling
> so that pop questions.
>
> Is it okay not to consider need_resched in direct compaction really?

It still considers need_resched() to back of from async compaction. It's 
only about signaling contended_compaction back to 
__alloc_pages_slowpath(). There's this code executed after the first, 
async compaction fails:

/*
  * It can become very expensive to allocate transparent hugepages at
  * fault, so use asynchronous memory compaction for THP unless it is
  * khugepaged trying to collapse.
  */
if (!(gfp_mask & __GFP_NO_KSWAPD) || (current->flags & PF_KTHREAD))
         migration_mode = MIGRATE_SYNC_LIGHT;

/*
  * If compaction is deferred for high-order allocations, it is because
  * sync compaction recently failed. In this is the case and the caller
  * requested a movable allocation that does not heavily disrupt the
  * system then fail the allocation instead of entering direct reclaim.
  */
if ((deferred_compaction || contended_compaction) &&
                                         (gfp_mask & __GFP_NO_KSWAPD))
         goto nopage;

Both THP page fault and khugepaged use __GFP_NO_KSWAPD. The first if() 
decides whether the second attempt will be sync (for khugepaged) or 
async (page fault). The second if() decides that if compaction was 
contended, then there won't be any second attempt (and reclaim) at all. 
Counting need_resched() as contended in this case is bad for khugepaged. 
Even for page fault it means no direct reclaim and a second async 
compaction. David says need_resched() occurs so often then it is a poor 
heuristic to decide this.

> We have taken care of it in direct reclaim path so why direct compaction is
> so special?

I admit I'm not that familiar with reclaim but I didn't quickly find any 
need_resched() there? There's plenty of cond_resched() but that doesn't 
mean it will abort? Could you explain for me?

> Why does khugepaged give up easily if lock contention/need_resched happens?
> khugepaged is important for success ratio as I read your description so IMO,
> khugepaged should do synchronously without considering early bail out by
> lock/rescheduling.

Well a stupid answer is that's how __alloc_pages_slowpath() works :) I 
don't think it's bad to try using first a more lightweight approach 
before trying the heavyweight one. As long as the heavyweight one is not 
skipped for khugepaged.

> If it causes problems, user should increase scan_sleep_millisecs/alloc_sleep_millisecs,
> which is exactly the knob for that cases.
>
> So, my point is how about making khugepaged doing always dumb synchronous
> compaction thorough PG_KHUGEPAGED or GFP_SYNC_TRANSHUGE?
>
>>
>> Reported-by: David Rientjes <rientjes@google.com>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>> Cc: Michal Nazarewicz <mina86@mina86.com>
>> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>> Cc: Christoph Lameter <cl@linux.com>
>> Cc: Rik van Riel <riel@redhat.com>
>> ---
>>   mm/compaction.c | 20 ++++++++++++++------
>>   mm/internal.h   | 15 +++++++++++----
>>   2 files changed, 25 insertions(+), 10 deletions(-)
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index b73b182..d37f4a8 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -185,9 +185,14 @@ static void update_pageblock_skip(struct compact_control *cc,
>>   }
>>   #endif /* CONFIG_COMPACTION */
>>
>> -static inline bool should_release_lock(spinlock_t *lock)
>> +enum compact_contended should_release_lock(spinlock_t *lock)
>>   {
>> -	return need_resched() || spin_is_contended(lock);
>> +	if (need_resched())
>> +		return COMPACT_CONTENDED_SCHED;
>> +	else if (spin_is_contended(lock))
>> +		return COMPACT_CONTENDED_LOCK;
>> +	else
>> +		return COMPACT_CONTENDED_NONE;
>>   }
>>
>>   /*
>> @@ -202,7 +207,9 @@ static inline bool should_release_lock(spinlock_t *lock)
>>   static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
>>   				      bool locked, struct compact_control *cc)
>>   {
>> -	if (should_release_lock(lock)) {
>> +	enum compact_contended contended = should_release_lock(lock);
>> +
>> +	if (contended) {
>>   		if (locked) {
>>   			spin_unlock_irqrestore(lock, *flags);
>>   			locked = false;
>> @@ -210,7 +217,7 @@ static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
>>
>>   		/* async aborts if taking too long or contended */
>>   		if (cc->mode == MIGRATE_ASYNC) {
>> -			cc->contended = true;
>> +			cc->contended = contended;
>>   			return false;
>>   		}
>>
>> @@ -236,7 +243,7 @@ static inline bool compact_should_abort(struct compact_control *cc)
>>   	/* async compaction aborts if contended */
>>   	if (need_resched()) {
>>   		if (cc->mode == MIGRATE_ASYNC) {
>> -			cc->contended = true;
>> +			cc->contended = COMPACT_CONTENDED_SCHED;
>>   			return true;
>>   		}
>>
>> @@ -1095,7 +1102,8 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
>>   	VM_BUG_ON(!list_empty(&cc.freepages));
>>   	VM_BUG_ON(!list_empty(&cc.migratepages));
>>
>> -	*contended = cc.contended;
>> +	/* We only signal lock contention back to the allocator */
>> +	*contended = cc.contended == COMPACT_CONTENDED_LOCK;
>>   	return ret;
>>   }
>>
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 7f22a11f..4659e8e 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -117,6 +117,13 @@ extern int user_min_free_kbytes;
>>
>>   #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>>
>> +/* Used to signal whether compaction detected need_sched() or lock contention */
>> +enum compact_contended {
>> +	COMPACT_CONTENDED_NONE = 0, /* no contention detected */
>> +	COMPACT_CONTENDED_SCHED,    /* need_sched() was true */
>> +	COMPACT_CONTENDED_LOCK,     /* zone lock or lru_lock was contended */
>> +};
>> +
>>   /*
>>    * in mm/compaction.c
>>    */
>> @@ -144,10 +151,10 @@ struct compact_control {
>>   	int order;			/* order a direct compactor needs */
>>   	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
>>   	struct zone *zone;
>> -	bool contended;			/* True if a lock was contended, or
>> -					 * need_resched() true during async
>> -					 * compaction
>> -					 */
>> +	enum compact_contended contended; /* Signal need_sched() or lock
>> +					   * contention detected during
>> +					   * compaction
>> +					   */
>>   };
>>
>>   unsigned long
>> --
>> 1.8.4.5
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [RFC PATCH 09/10] mm, compaction: try to capture the just-created high-order freepage
  2014-06-09  9:26   ` Vlastimil Babka
@ 2014-06-11 14:56     ` Vlastimil Babka
  -1 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-11 14:56 UTC (permalink / raw)
  To: David Rientjes, linux-mm
  Cc: linux-kernel, Andrew Morton, Greg Thelen, Minchan Kim,
	Mel Gorman, Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel

On 06/09/2014 11:26 AM, Vlastimil Babka wrote:
> Compaction uses watermark checking to determine if it succeeded in creating
> a high-order free page. My testing has shown that this is quite racy and it
> can happen that watermark checking in compaction succeeds, and moments later
> the watermark checking in page allocation fails, even though the number of
> free pages has increased meanwhile.
> 
> It should be more reliable if direct compaction captured the high-order free
> page as soon as it detects it, and pass it back to allocation. This would
> also reduce the window for somebody else to allocate the free page.
> 
> This has been already implemented by 1fb3f8ca0e92 ("mm: compaction: capture a
> suitable high-order page immediately when it is made available"), but later
> reverted by 8fb74b9f ("mm: compaction: partially revert capture of suitable
> high-order page") due to flaws.
> 
> This patch differs from the previous attempt in two aspects:
> 
> 1) The previous patch scanned free lists to capture the page. In this patch,
>     only the cc->order aligned block that the migration scanner just finished
>     is considered, but only if pages were actually isolated for migration in
>     that block. Tracking cc->order aligned blocks also has benefits for the
>     following patch that skips blocks where non-migratable pages were found.
> 
> 2) In this patch, the isolated free page is allocated through extending
>     get_page_from_freelist() and buffered_rmqueue(). This ensures that it gets
>     all operations such as prep_new_page() and page->pfmemalloc setting that
>     was missing in the previous attempt, zone statistics are updated etc.
> 
> Evaluation is pending.

Uh, so if anyone wants to test it, here's a fixed version, as initial evaluation
showed it does not actually capture anything (which should not affect patch 10/10
though) and debugging this took a while.

- for pageblock_order (i.e. THP), capture was never attempted, as the for cycle
  in isolate_migratepages_range() has ended right before the
  low_pfn == next_capture_pfn check
- lru_add_drain() has to be done before pcplists drain. This made a big difference
  (~50 successful captures -> ~1300 successful captures)
  Note that __alloc_pages_direct_compact() is missing lru_add_drain() as well, and
  all the existing watermark-based compaction termination decisions (which happen
  before the drain in __alloc_pages_direct_compact()) don't do any draining at all.
  
-----8<-----
From: Vlastimil Babka <vbabka@suse.cz>
Date: Wed, 28 May 2014 17:05:18 +0200
Subject: [PATCH fixed 09/10] mm, compaction: try to capture the just-created
 high-order freepage

Compaction uses watermark checking to determine if it succeeded in creating
a high-order free page. My testing has shown that this is quite racy and it
can happen that watermark checking in compaction succeeds, and moments later
the watermark checking in page allocation fails, even though the number of
free pages has increased meanwhile.

It should be more reliable if direct compaction captured the high-order free
page as soon as it detects it, and pass it back to allocation. This would
also reduce the window for somebody else to allocate the free page.

This has been already implemented by 1fb3f8ca0e92 ("mm: compaction: capture a
suitable high-order page immediately when it is made available"), but later
reverted by 8fb74b9f ("mm: compaction: partially revert capture of suitable
high-order page") due to flaws.

This patch differs from the previous attempt in two aspects:

1) The previous patch scanned free lists to capture the page. In this patch,
   only the cc->order aligned block that the migration scanner just finished
   is considered, but only if pages were actually isolated for migration in
   that block. Tracking cc->order aligned blocks also has benefits for the
   following patch that skips blocks where non-migratable pages were found.

2) In this patch, the isolated free page is allocated through extending
   get_page_from_freelist() and buffered_rmqueue(). This ensures that it gets
   all operations such as prep_new_page() and page->pfmemalloc setting that
   was missing in the previous attempt, zone statistics are updated etc.

Evaluation is pending.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
---
 include/linux/compaction.h |   5 ++-
 mm/compaction.c            | 103 +++++++++++++++++++++++++++++++++++++++++++--
 mm/internal.h              |   2 +
 mm/page_alloc.c            |  69 ++++++++++++++++++++++++------
 4 files changed, 161 insertions(+), 18 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 01e3132..69579f5 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -10,6 +10,8 @@
 #define COMPACT_PARTIAL		2
 /* The full zone was compacted */
 #define COMPACT_COMPLETE	3
+/* Captured a high-order free page in direct compaction */
+#define COMPACT_CAPTURED	4
 
 #ifdef CONFIG_COMPACTION
 extern int sysctl_compact_memory;
@@ -22,7 +24,8 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
 extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *mask,
-			enum migrate_mode mode, bool *contended);
+			enum migrate_mode mode, bool *contended,
+			struct page **captured_page);
 extern void compact_pgdat(pg_data_t *pgdat, int order);
 extern void reset_isolation_suitable(pg_data_t *pgdat);
 extern unsigned long compaction_suitable(struct zone *zone, int order);
diff --git a/mm/compaction.c b/mm/compaction.c
index d1e30ba..2988758 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -541,6 +541,16 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 	const isolate_mode_t mode = (cc->mode == MIGRATE_ASYNC ?
 					ISOLATE_ASYNC_MIGRATE : 0) |
 				    (unevictable ? ISOLATE_UNEVICTABLE : 0);
+	unsigned long capture_pfn = 0;   /* current candidate for capturing */
+	unsigned long next_capture_pfn = 0; /* next candidate for capturing */
+
+	if (cc->order > PAGE_ALLOC_COSTLY_ORDER
+		&& gfpflags_to_migratetype(cc->gfp_mask) == MIGRATE_MOVABLE
+			&& cc->order <= pageblock_order) {
+		/* This may be outside the zone, but we check that later */
+		capture_pfn = low_pfn & ~((1UL << cc->order) - 1);
+		next_capture_pfn = ALIGN(low_pfn + 1, (1UL << cc->order));
+	}
 
 	/*
 	 * Ensure that there are not too many pages isolated from the LRU
@@ -563,6 +573,19 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 
 	/* Time to isolate some pages for migration */
 	for (; low_pfn < end_pfn; low_pfn++) {
+		if (low_pfn == next_capture_pfn) {
+			/*
+			 * We have a capture candidate if we isolated something
+			 * during the last cc->order aligned block of pages.
+			 */
+			if (nr_isolated && capture_pfn >= zone->zone_start_pfn)
+				break;
+
+			/* Prepare for a new capture candidate */
+			capture_pfn = next_capture_pfn;
+			next_capture_pfn += (1UL << cc->order);
+		}
+
 		/*
 		 * Periodically drop the lock (if held) regardless of its
 		 * contention, to give chance to IRQs. Abort async compaction
@@ -582,6 +605,8 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 		if ((low_pfn & (MAX_ORDER_NR_PAGES - 1)) == 0) {
 			if (!pfn_valid(low_pfn)) {
 				low_pfn += MAX_ORDER_NR_PAGES - 1;
+				if (next_capture_pfn)
+					next_capture_pfn = low_pfn + 1;
 				continue;
 			}
 		}
@@ -639,8 +664,12 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 			 * a valid page order. Consider only values in the
 			 * valid order range to prevent low_pfn overflow.
 			 */
-			if (freepage_order > 0 && freepage_order < MAX_ORDER)
+			if (freepage_order > 0 && freepage_order < MAX_ORDER) {
 				low_pfn += (1UL << freepage_order) - 1;
+				if (next_capture_pfn)
+					next_capture_pfn = ALIGN(low_pfn + 1,
+							(1UL << cc->order));
+			}
 			continue;
 		}
 
@@ -673,6 +702,9 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 			if (!locked)
 				goto next_pageblock;
 			low_pfn += (1 << compound_order(page)) - 1;
+			if (next_capture_pfn)
+				next_capture_pfn =
+					ALIGN(low_pfn + 1, (1UL << cc->order));
 			continue;
 		}
 
@@ -697,6 +729,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 				continue;
 			if (PageTransHuge(page)) {
 				low_pfn += (1 << compound_order(page)) - 1;
+				next_capture_pfn = low_pfn + 1;
 				continue;
 			}
 		}
@@ -728,9 +761,20 @@ isolate_success:
 
 next_pageblock:
 		low_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages) - 1;
+		if (next_capture_pfn)
+			next_capture_pfn = low_pfn + 1;
 	}
 
 	/*
+	 * For cases when next_capture_pfn == end_pfn, such as end of
+	 * pageblock, we couldn't have determined capture candidate inside
+	 * the for cycle, so we have to do it here.
+	 */
+	if (low_pfn == next_capture_pfn && nr_isolated
+			&& capture_pfn >= zone->zone_start_pfn)
+		cc->capture_page = pfn_to_page(capture_pfn);
+
+	/*
 	 * The PageBuddy() check could have potentially brought us outside
 	 * the range to be scanned.
 	 */
@@ -965,6 +1009,44 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 	return ISOLATE_SUCCESS;
 }
 
+/*
+ * When called, cc->capture_page is just a candidate. This function will either
+ * successfully capture the page, or reset it to NULL.
+ */
+static bool compact_capture_page(struct compact_control *cc)
+{
+	struct page *page = cc->capture_page;
+	int cpu;
+
+	/* Unsafe check if it's worth to try acquiring the zone->lock at all */
+	if (PageBuddy(page) && page_order_unsafe(page) >= cc->order)
+		goto try_capture;
+
+	/*
+	 * There's a good chance that we have just put free pages on this CPU's
+	 * lru cache and pcplists after the page migrations. Drain them to
+	 * allow merging.
+	 */
+	cpu = get_cpu();
+	lru_add_drain_cpu(cpu);
+	drain_local_pages(NULL);
+	put_cpu();
+
+	/* Did the draining help? */
+	if (PageBuddy(page) && page_order_unsafe(page) >= cc->order)
+		goto try_capture;
+
+	goto fail;
+
+try_capture:
+	if (capture_free_page(page, cc->order))
+		return true;
+
+fail:
+	cc->capture_page = NULL;
+	return false;
+}
+
 static int compact_finished(struct zone *zone, struct compact_control *cc,
 			    const int migratetype)
 {
@@ -993,6 +1075,10 @@ static int compact_finished(struct zone *zone, struct compact_control *cc,
 		return COMPACT_COMPLETE;
 	}
 
+	/* Did we just finish a pageblock that was capture candidate? */
+	if (cc->capture_page && compact_capture_page(cc))
+		return COMPACT_CAPTURED;
+
 	/*
 	 * order == -1 is expected when compacting via
 	 * /proc/sys/vm/compact_memory
@@ -1173,7 +1259,8 @@ out:
 }
 
 static unsigned long compact_zone_order(struct zone *zone, int order,
-		gfp_t gfp_mask, enum migrate_mode mode, bool *contended)
+		gfp_t gfp_mask, enum migrate_mode mode, bool *contended,
+						struct page **captured_page)
 {
 	unsigned long ret;
 	struct compact_control cc = {
@@ -1189,6 +1276,9 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
 
 	ret = compact_zone(zone, &cc);
 
+	if (ret == COMPACT_CAPTURED)
+		*captured_page = cc.capture_page;
+
 	VM_BUG_ON(!list_empty(&cc.freepages));
 	VM_BUG_ON(!list_empty(&cc.migratepages));
 
@@ -1213,7 +1303,8 @@ int sysctl_extfrag_threshold = 500;
  */
 unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *nodemask,
-			enum migrate_mode mode, bool *contended)
+			enum migrate_mode mode, bool *contended,
+			struct page **captured_page)
 {
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	int may_enter_fs = gfp_mask & __GFP_FS;
@@ -1239,9 +1330,13 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
 		int status;
 
 		status = compact_zone_order(zone, order, gfp_mask, mode,
-						contended);
+						contended, captured_page);
 		rc = max(status, rc);
 
+		/* If we captured a page, stop compacting */
+		if (*captured_page)
+			break;
+
 		/* If a normal allocation would succeed, stop compacting */
 		if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0,
 				      alloc_flags))
diff --git a/mm/internal.h b/mm/internal.h
index af15461..2b7e5de 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -110,6 +110,7 @@ extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
  */
 extern void __free_pages_bootmem(struct page *page, unsigned int order);
 extern void prep_compound_page(struct page *page, unsigned long order);
+extern bool capture_free_page(struct page *page, unsigned int order);
 #ifdef CONFIG_MEMORY_FAILURE
 extern bool is_free_buddy_page(struct page *page);
 #endif
@@ -155,6 +156,7 @@ struct compact_control {
 					   * contention detected during
 					   * compaction
 					   */
+	struct page *capture_page;	/* Free page captured by compaction */
 };
 
 unsigned long
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a3acb83..6235cad 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -954,7 +954,6 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 	return NULL;
 }
 
-
 /*
  * This array describes the order lists are fallen back to when
  * the free lists for the desirable migrate type are depleted
@@ -1474,9 +1473,11 @@ static int __isolate_free_page(struct page *page, unsigned int order)
 {
 	unsigned long watermark;
 	struct zone *zone;
+	struct free_area *area;
 	int mt;
+	unsigned int freepage_order = page_order(page);
 
-	BUG_ON(!PageBuddy(page));
+	VM_BUG_ON_PAGE((!PageBuddy(page) || freepage_order < order), page);
 
 	zone = page_zone(page);
 	mt = get_pageblock_migratetype(page);
@@ -1491,9 +1492,12 @@ static int __isolate_free_page(struct page *page, unsigned int order)
 	}
 
 	/* Remove page from free list */
+	area = &zone->free_area[freepage_order];
 	list_del(&page->lru);
-	zone->free_area[order].nr_free--;
+	area->nr_free--;
 	rmv_page_order(page);
+	if (freepage_order != order)
+		expand(zone, page, order, freepage_order, area, mt);
 
 	/* Set the pageblock if the isolated page is at least a pageblock */
 	if (order >= pageblock_order - 1) {
@@ -1536,6 +1540,26 @@ int split_free_page(struct page *page)
 	return nr_pages;
 }
 
+bool capture_free_page(struct page *page, unsigned int order)
+{
+	struct zone *zone = page_zone(page);
+	unsigned long flags;
+	bool ret;
+
+	spin_lock_irqsave(&zone->lock, flags);
+
+	if (!PageBuddy(page) || page_order(page) < order) {
+		ret = false;
+		goto out;
+	}
+
+	ret = __isolate_free_page(page, order);
+
+out:
+	spin_unlock_irqrestore(&zone->lock, flags);
+	return ret;
+}
+
 /*
  * Really, prep_compound_page() should be called from __rmqueue_bulk().  But
  * we cheat by calling it from here, in the order > 0 path.  Saves a branch
@@ -1544,7 +1568,8 @@ int split_free_page(struct page *page)
 static inline
 struct page *buffered_rmqueue(struct zone *preferred_zone,
 			struct zone *zone, unsigned int order,
-			gfp_t gfp_flags, int migratetype)
+			gfp_t gfp_flags, int migratetype,
+			struct page *isolated_freepage)
 {
 	unsigned long flags;
 	struct page *page;
@@ -1573,6 +1598,9 @@ again:
 
 		list_del(&page->lru);
 		pcp->count--;
+	} else if (unlikely(isolated_freepage)) {
+		page = isolated_freepage;
+		local_irq_save(flags);
 	} else {
 		if (unlikely(gfp_flags & __GFP_NOFAIL)) {
 			/*
@@ -1588,7 +1616,9 @@ again:
 			WARN_ON_ONCE(order > 1);
 		}
 		spin_lock_irqsave(&zone->lock, flags);
+
 		page = __rmqueue(zone, order, migratetype);
+
 		spin_unlock(&zone->lock);
 		if (!page)
 			goto failed;
@@ -1916,7 +1946,8 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 static struct page *
 get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
-		struct zone *preferred_zone, int classzone_idx, int migratetype)
+		struct zone *preferred_zone, int classzone_idx, int migratetype,
+		struct page *isolated_freepage)
 {
 	struct zoneref *z;
 	struct page *page = NULL;
@@ -1927,6 +1958,13 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 	bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) &&
 				(gfp_mask & __GFP_WRITE);
 
+	if (isolated_freepage) {
+		zone = page_zone(isolated_freepage);
+		page = buffered_rmqueue(preferred_zone, zone, order, gfp_mask,
+						migratetype, isolated_freepage);
+		goto got_page;
+	}
+
 zonelist_scan:
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
@@ -2051,7 +2089,7 @@ zonelist_scan:
 
 try_this_zone:
 		page = buffered_rmqueue(preferred_zone, zone, order,
-						gfp_mask, migratetype);
+						gfp_mask, migratetype, NULL);
 		if (page)
 			break;
 this_zone_full:
@@ -2065,6 +2103,7 @@ this_zone_full:
 		goto zonelist_scan;
 	}
 
+got_page:
 	if (page)
 		/*
 		 * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
@@ -2202,7 +2241,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
 		order, zonelist, high_zoneidx,
 		ALLOC_WMARK_HIGH|ALLOC_CPUSET,
-		preferred_zone, classzone_idx, migratetype);
+		preferred_zone, classzone_idx, migratetype, NULL);
 	if (page)
 		goto out;
 
@@ -2241,6 +2280,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	bool *contended_compaction, bool *deferred_compaction,
 	unsigned long *did_some_progress)
 {
+	struct page *captured_page;
+
 	if (!order)
 		return NULL;
 
@@ -2252,7 +2293,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	current->flags |= PF_MEMALLOC;
 	*did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask,
 						nodemask, mode,
-						contended_compaction);
+						contended_compaction,
+						&captured_page);
 	current->flags &= ~PF_MEMALLOC;
 
 	if (*did_some_progress != COMPACT_SKIPPED) {
@@ -2265,7 +2307,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		page = get_page_from_freelist(gfp_mask, nodemask,
 				order, zonelist, high_zoneidx,
 				alloc_flags & ~ALLOC_NO_WATERMARKS,
-				preferred_zone, classzone_idx, migratetype);
+				preferred_zone, classzone_idx, migratetype,
+				captured_page);
 		if (page) {
 			preferred_zone->compact_blockskip_flush = false;
 			compaction_defer_reset(preferred_zone, order, true);
@@ -2357,7 +2400,7 @@ retry:
 					zonelist, high_zoneidx,
 					alloc_flags & ~ALLOC_NO_WATERMARKS,
 					preferred_zone, classzone_idx,
-					migratetype);
+					migratetype, NULL);
 
 	/*
 	 * If an allocation failed after direct reclaim, it could be because
@@ -2387,7 +2430,7 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 	do {
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
-			preferred_zone, classzone_idx, migratetype);
+			preferred_zone, classzone_idx, migratetype, NULL);
 
 		if (!page && gfp_mask & __GFP_NOFAIL)
 			wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
@@ -2548,7 +2591,7 @@ rebalance:
 	/* This is the last chance, in general, before the goto nopage. */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
 			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
-			preferred_zone, classzone_idx, migratetype);
+			preferred_zone, classzone_idx, migratetype, NULL);
 	if (page)
 		goto got_pg;
 
@@ -2757,7 +2800,7 @@ retry:
 	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, alloc_flags,
-			preferred_zone, classzone_idx, migratetype);
+			preferred_zone, classzone_idx, migratetype, NULL);
 	if (unlikely(!page)) {
 		/*
 		 * The first pass makes sure allocations are spread
-- 
1.8.4.5




^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [RFC PATCH 09/10] mm, compaction: try to capture the just-created high-order freepage
@ 2014-06-11 14:56     ` Vlastimil Babka
  0 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-11 14:56 UTC (permalink / raw)
  To: David Rientjes, linux-mm
  Cc: linux-kernel, Andrew Morton, Greg Thelen, Minchan Kim,
	Mel Gorman, Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel

On 06/09/2014 11:26 AM, Vlastimil Babka wrote:
> Compaction uses watermark checking to determine if it succeeded in creating
> a high-order free page. My testing has shown that this is quite racy and it
> can happen that watermark checking in compaction succeeds, and moments later
> the watermark checking in page allocation fails, even though the number of
> free pages has increased meanwhile.
> 
> It should be more reliable if direct compaction captured the high-order free
> page as soon as it detects it, and pass it back to allocation. This would
> also reduce the window for somebody else to allocate the free page.
> 
> This has been already implemented by 1fb3f8ca0e92 ("mm: compaction: capture a
> suitable high-order page immediately when it is made available"), but later
> reverted by 8fb74b9f ("mm: compaction: partially revert capture of suitable
> high-order page") due to flaws.
> 
> This patch differs from the previous attempt in two aspects:
> 
> 1) The previous patch scanned free lists to capture the page. In this patch,
>     only the cc->order aligned block that the migration scanner just finished
>     is considered, but only if pages were actually isolated for migration in
>     that block. Tracking cc->order aligned blocks also has benefits for the
>     following patch that skips blocks where non-migratable pages were found.
> 
> 2) In this patch, the isolated free page is allocated through extending
>     get_page_from_freelist() and buffered_rmqueue(). This ensures that it gets
>     all operations such as prep_new_page() and page->pfmemalloc setting that
>     was missing in the previous attempt, zone statistics are updated etc.
> 
> Evaluation is pending.

Uh, so if anyone wants to test it, here's a fixed version, as initial evaluation
showed it does not actually capture anything (which should not affect patch 10/10
though) and debugging this took a while.

- for pageblock_order (i.e. THP), capture was never attempted, as the for cycle
  in isolate_migratepages_range() has ended right before the
  low_pfn == next_capture_pfn check
- lru_add_drain() has to be done before pcplists drain. This made a big difference
  (~50 successful captures -> ~1300 successful captures)
  Note that __alloc_pages_direct_compact() is missing lru_add_drain() as well, and
  all the existing watermark-based compaction termination decisions (which happen
  before the drain in __alloc_pages_direct_compact()) don't do any draining at all.
  
-----8<-----
From: Vlastimil Babka <vbabka@suse.cz>
Date: Wed, 28 May 2014 17:05:18 +0200
Subject: [PATCH fixed 09/10] mm, compaction: try to capture the just-created
 high-order freepage

Compaction uses watermark checking to determine if it succeeded in creating
a high-order free page. My testing has shown that this is quite racy and it
can happen that watermark checking in compaction succeeds, and moments later
the watermark checking in page allocation fails, even though the number of
free pages has increased meanwhile.

It should be more reliable if direct compaction captured the high-order free
page as soon as it detects it, and pass it back to allocation. This would
also reduce the window for somebody else to allocate the free page.

This has been already implemented by 1fb3f8ca0e92 ("mm: compaction: capture a
suitable high-order page immediately when it is made available"), but later
reverted by 8fb74b9f ("mm: compaction: partially revert capture of suitable
high-order page") due to flaws.

This patch differs from the previous attempt in two aspects:

1) The previous patch scanned free lists to capture the page. In this patch,
   only the cc->order aligned block that the migration scanner just finished
   is considered, but only if pages were actually isolated for migration in
   that block. Tracking cc->order aligned blocks also has benefits for the
   following patch that skips blocks where non-migratable pages were found.

2) In this patch, the isolated free page is allocated through extending
   get_page_from_freelist() and buffered_rmqueue(). This ensures that it gets
   all operations such as prep_new_page() and page->pfmemalloc setting that
   was missing in the previous attempt, zone statistics are updated etc.

Evaluation is pending.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
---
 include/linux/compaction.h |   5 ++-
 mm/compaction.c            | 103 +++++++++++++++++++++++++++++++++++++++++++--
 mm/internal.h              |   2 +
 mm/page_alloc.c            |  69 ++++++++++++++++++++++++------
 4 files changed, 161 insertions(+), 18 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 01e3132..69579f5 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -10,6 +10,8 @@
 #define COMPACT_PARTIAL		2
 /* The full zone was compacted */
 #define COMPACT_COMPLETE	3
+/* Captured a high-order free page in direct compaction */
+#define COMPACT_CAPTURED	4
 
 #ifdef CONFIG_COMPACTION
 extern int sysctl_compact_memory;
@@ -22,7 +24,8 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
 extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *mask,
-			enum migrate_mode mode, bool *contended);
+			enum migrate_mode mode, bool *contended,
+			struct page **captured_page);
 extern void compact_pgdat(pg_data_t *pgdat, int order);
 extern void reset_isolation_suitable(pg_data_t *pgdat);
 extern unsigned long compaction_suitable(struct zone *zone, int order);
diff --git a/mm/compaction.c b/mm/compaction.c
index d1e30ba..2988758 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -541,6 +541,16 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 	const isolate_mode_t mode = (cc->mode == MIGRATE_ASYNC ?
 					ISOLATE_ASYNC_MIGRATE : 0) |
 				    (unevictable ? ISOLATE_UNEVICTABLE : 0);
+	unsigned long capture_pfn = 0;   /* current candidate for capturing */
+	unsigned long next_capture_pfn = 0; /* next candidate for capturing */
+
+	if (cc->order > PAGE_ALLOC_COSTLY_ORDER
+		&& gfpflags_to_migratetype(cc->gfp_mask) == MIGRATE_MOVABLE
+			&& cc->order <= pageblock_order) {
+		/* This may be outside the zone, but we check that later */
+		capture_pfn = low_pfn & ~((1UL << cc->order) - 1);
+		next_capture_pfn = ALIGN(low_pfn + 1, (1UL << cc->order));
+	}
 
 	/*
 	 * Ensure that there are not too many pages isolated from the LRU
@@ -563,6 +573,19 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 
 	/* Time to isolate some pages for migration */
 	for (; low_pfn < end_pfn; low_pfn++) {
+		if (low_pfn == next_capture_pfn) {
+			/*
+			 * We have a capture candidate if we isolated something
+			 * during the last cc->order aligned block of pages.
+			 */
+			if (nr_isolated && capture_pfn >= zone->zone_start_pfn)
+				break;
+
+			/* Prepare for a new capture candidate */
+			capture_pfn = next_capture_pfn;
+			next_capture_pfn += (1UL << cc->order);
+		}
+
 		/*
 		 * Periodically drop the lock (if held) regardless of its
 		 * contention, to give chance to IRQs. Abort async compaction
@@ -582,6 +605,8 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 		if ((low_pfn & (MAX_ORDER_NR_PAGES - 1)) == 0) {
 			if (!pfn_valid(low_pfn)) {
 				low_pfn += MAX_ORDER_NR_PAGES - 1;
+				if (next_capture_pfn)
+					next_capture_pfn = low_pfn + 1;
 				continue;
 			}
 		}
@@ -639,8 +664,12 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 			 * a valid page order. Consider only values in the
 			 * valid order range to prevent low_pfn overflow.
 			 */
-			if (freepage_order > 0 && freepage_order < MAX_ORDER)
+			if (freepage_order > 0 && freepage_order < MAX_ORDER) {
 				low_pfn += (1UL << freepage_order) - 1;
+				if (next_capture_pfn)
+					next_capture_pfn = ALIGN(low_pfn + 1,
+							(1UL << cc->order));
+			}
 			continue;
 		}
 
@@ -673,6 +702,9 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 			if (!locked)
 				goto next_pageblock;
 			low_pfn += (1 << compound_order(page)) - 1;
+			if (next_capture_pfn)
+				next_capture_pfn =
+					ALIGN(low_pfn + 1, (1UL << cc->order));
 			continue;
 		}
 
@@ -697,6 +729,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 				continue;
 			if (PageTransHuge(page)) {
 				low_pfn += (1 << compound_order(page)) - 1;
+				next_capture_pfn = low_pfn + 1;
 				continue;
 			}
 		}
@@ -728,9 +761,20 @@ isolate_success:
 
 next_pageblock:
 		low_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages) - 1;
+		if (next_capture_pfn)
+			next_capture_pfn = low_pfn + 1;
 	}
 
 	/*
+	 * For cases when next_capture_pfn == end_pfn, such as end of
+	 * pageblock, we couldn't have determined capture candidate inside
+	 * the for cycle, so we have to do it here.
+	 */
+	if (low_pfn == next_capture_pfn && nr_isolated
+			&& capture_pfn >= zone->zone_start_pfn)
+		cc->capture_page = pfn_to_page(capture_pfn);
+
+	/*
 	 * The PageBuddy() check could have potentially brought us outside
 	 * the range to be scanned.
 	 */
@@ -965,6 +1009,44 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 	return ISOLATE_SUCCESS;
 }
 
+/*
+ * When called, cc->capture_page is just a candidate. This function will either
+ * successfully capture the page, or reset it to NULL.
+ */
+static bool compact_capture_page(struct compact_control *cc)
+{
+	struct page *page = cc->capture_page;
+	int cpu;
+
+	/* Unsafe check if it's worth to try acquiring the zone->lock at all */
+	if (PageBuddy(page) && page_order_unsafe(page) >= cc->order)
+		goto try_capture;
+
+	/*
+	 * There's a good chance that we have just put free pages on this CPU's
+	 * lru cache and pcplists after the page migrations. Drain them to
+	 * allow merging.
+	 */
+	cpu = get_cpu();
+	lru_add_drain_cpu(cpu);
+	drain_local_pages(NULL);
+	put_cpu();
+
+	/* Did the draining help? */
+	if (PageBuddy(page) && page_order_unsafe(page) >= cc->order)
+		goto try_capture;
+
+	goto fail;
+
+try_capture:
+	if (capture_free_page(page, cc->order))
+		return true;
+
+fail:
+	cc->capture_page = NULL;
+	return false;
+}
+
 static int compact_finished(struct zone *zone, struct compact_control *cc,
 			    const int migratetype)
 {
@@ -993,6 +1075,10 @@ static int compact_finished(struct zone *zone, struct compact_control *cc,
 		return COMPACT_COMPLETE;
 	}
 
+	/* Did we just finish a pageblock that was capture candidate? */
+	if (cc->capture_page && compact_capture_page(cc))
+		return COMPACT_CAPTURED;
+
 	/*
 	 * order == -1 is expected when compacting via
 	 * /proc/sys/vm/compact_memory
@@ -1173,7 +1259,8 @@ out:
 }
 
 static unsigned long compact_zone_order(struct zone *zone, int order,
-		gfp_t gfp_mask, enum migrate_mode mode, bool *contended)
+		gfp_t gfp_mask, enum migrate_mode mode, bool *contended,
+						struct page **captured_page)
 {
 	unsigned long ret;
 	struct compact_control cc = {
@@ -1189,6 +1276,9 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
 
 	ret = compact_zone(zone, &cc);
 
+	if (ret == COMPACT_CAPTURED)
+		*captured_page = cc.capture_page;
+
 	VM_BUG_ON(!list_empty(&cc.freepages));
 	VM_BUG_ON(!list_empty(&cc.migratepages));
 
@@ -1213,7 +1303,8 @@ int sysctl_extfrag_threshold = 500;
  */
 unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *nodemask,
-			enum migrate_mode mode, bool *contended)
+			enum migrate_mode mode, bool *contended,
+			struct page **captured_page)
 {
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	int may_enter_fs = gfp_mask & __GFP_FS;
@@ -1239,9 +1330,13 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
 		int status;
 
 		status = compact_zone_order(zone, order, gfp_mask, mode,
-						contended);
+						contended, captured_page);
 		rc = max(status, rc);
 
+		/* If we captured a page, stop compacting */
+		if (*captured_page)
+			break;
+
 		/* If a normal allocation would succeed, stop compacting */
 		if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0,
 				      alloc_flags))
diff --git a/mm/internal.h b/mm/internal.h
index af15461..2b7e5de 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -110,6 +110,7 @@ extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
  */
 extern void __free_pages_bootmem(struct page *page, unsigned int order);
 extern void prep_compound_page(struct page *page, unsigned long order);
+extern bool capture_free_page(struct page *page, unsigned int order);
 #ifdef CONFIG_MEMORY_FAILURE
 extern bool is_free_buddy_page(struct page *page);
 #endif
@@ -155,6 +156,7 @@ struct compact_control {
 					   * contention detected during
 					   * compaction
 					   */
+	struct page *capture_page;	/* Free page captured by compaction */
 };
 
 unsigned long
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a3acb83..6235cad 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -954,7 +954,6 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 	return NULL;
 }
 
-
 /*
  * This array describes the order lists are fallen back to when
  * the free lists for the desirable migrate type are depleted
@@ -1474,9 +1473,11 @@ static int __isolate_free_page(struct page *page, unsigned int order)
 {
 	unsigned long watermark;
 	struct zone *zone;
+	struct free_area *area;
 	int mt;
+	unsigned int freepage_order = page_order(page);
 
-	BUG_ON(!PageBuddy(page));
+	VM_BUG_ON_PAGE((!PageBuddy(page) || freepage_order < order), page);
 
 	zone = page_zone(page);
 	mt = get_pageblock_migratetype(page);
@@ -1491,9 +1492,12 @@ static int __isolate_free_page(struct page *page, unsigned int order)
 	}
 
 	/* Remove page from free list */
+	area = &zone->free_area[freepage_order];
 	list_del(&page->lru);
-	zone->free_area[order].nr_free--;
+	area->nr_free--;
 	rmv_page_order(page);
+	if (freepage_order != order)
+		expand(zone, page, order, freepage_order, area, mt);
 
 	/* Set the pageblock if the isolated page is at least a pageblock */
 	if (order >= pageblock_order - 1) {
@@ -1536,6 +1540,26 @@ int split_free_page(struct page *page)
 	return nr_pages;
 }
 
+bool capture_free_page(struct page *page, unsigned int order)
+{
+	struct zone *zone = page_zone(page);
+	unsigned long flags;
+	bool ret;
+
+	spin_lock_irqsave(&zone->lock, flags);
+
+	if (!PageBuddy(page) || page_order(page) < order) {
+		ret = false;
+		goto out;
+	}
+
+	ret = __isolate_free_page(page, order);
+
+out:
+	spin_unlock_irqrestore(&zone->lock, flags);
+	return ret;
+}
+
 /*
  * Really, prep_compound_page() should be called from __rmqueue_bulk().  But
  * we cheat by calling it from here, in the order > 0 path.  Saves a branch
@@ -1544,7 +1568,8 @@ int split_free_page(struct page *page)
 static inline
 struct page *buffered_rmqueue(struct zone *preferred_zone,
 			struct zone *zone, unsigned int order,
-			gfp_t gfp_flags, int migratetype)
+			gfp_t gfp_flags, int migratetype,
+			struct page *isolated_freepage)
 {
 	unsigned long flags;
 	struct page *page;
@@ -1573,6 +1598,9 @@ again:
 
 		list_del(&page->lru);
 		pcp->count--;
+	} else if (unlikely(isolated_freepage)) {
+		page = isolated_freepage;
+		local_irq_save(flags);
 	} else {
 		if (unlikely(gfp_flags & __GFP_NOFAIL)) {
 			/*
@@ -1588,7 +1616,9 @@ again:
 			WARN_ON_ONCE(order > 1);
 		}
 		spin_lock_irqsave(&zone->lock, flags);
+
 		page = __rmqueue(zone, order, migratetype);
+
 		spin_unlock(&zone->lock);
 		if (!page)
 			goto failed;
@@ -1916,7 +1946,8 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 static struct page *
 get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
-		struct zone *preferred_zone, int classzone_idx, int migratetype)
+		struct zone *preferred_zone, int classzone_idx, int migratetype,
+		struct page *isolated_freepage)
 {
 	struct zoneref *z;
 	struct page *page = NULL;
@@ -1927,6 +1958,13 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 	bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) &&
 				(gfp_mask & __GFP_WRITE);
 
+	if (isolated_freepage) {
+		zone = page_zone(isolated_freepage);
+		page = buffered_rmqueue(preferred_zone, zone, order, gfp_mask,
+						migratetype, isolated_freepage);
+		goto got_page;
+	}
+
 zonelist_scan:
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
@@ -2051,7 +2089,7 @@ zonelist_scan:
 
 try_this_zone:
 		page = buffered_rmqueue(preferred_zone, zone, order,
-						gfp_mask, migratetype);
+						gfp_mask, migratetype, NULL);
 		if (page)
 			break;
 this_zone_full:
@@ -2065,6 +2103,7 @@ this_zone_full:
 		goto zonelist_scan;
 	}
 
+got_page:
 	if (page)
 		/*
 		 * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
@@ -2202,7 +2241,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
 		order, zonelist, high_zoneidx,
 		ALLOC_WMARK_HIGH|ALLOC_CPUSET,
-		preferred_zone, classzone_idx, migratetype);
+		preferred_zone, classzone_idx, migratetype, NULL);
 	if (page)
 		goto out;
 
@@ -2241,6 +2280,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	bool *contended_compaction, bool *deferred_compaction,
 	unsigned long *did_some_progress)
 {
+	struct page *captured_page;
+
 	if (!order)
 		return NULL;
 
@@ -2252,7 +2293,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	current->flags |= PF_MEMALLOC;
 	*did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask,
 						nodemask, mode,
-						contended_compaction);
+						contended_compaction,
+						&captured_page);
 	current->flags &= ~PF_MEMALLOC;
 
 	if (*did_some_progress != COMPACT_SKIPPED) {
@@ -2265,7 +2307,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		page = get_page_from_freelist(gfp_mask, nodemask,
 				order, zonelist, high_zoneidx,
 				alloc_flags & ~ALLOC_NO_WATERMARKS,
-				preferred_zone, classzone_idx, migratetype);
+				preferred_zone, classzone_idx, migratetype,
+				captured_page);
 		if (page) {
 			preferred_zone->compact_blockskip_flush = false;
 			compaction_defer_reset(preferred_zone, order, true);
@@ -2357,7 +2400,7 @@ retry:
 					zonelist, high_zoneidx,
 					alloc_flags & ~ALLOC_NO_WATERMARKS,
 					preferred_zone, classzone_idx,
-					migratetype);
+					migratetype, NULL);
 
 	/*
 	 * If an allocation failed after direct reclaim, it could be because
@@ -2387,7 +2430,7 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 	do {
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
-			preferred_zone, classzone_idx, migratetype);
+			preferred_zone, classzone_idx, migratetype, NULL);
 
 		if (!page && gfp_mask & __GFP_NOFAIL)
 			wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
@@ -2548,7 +2591,7 @@ rebalance:
 	/* This is the last chance, in general, before the goto nopage. */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
 			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
-			preferred_zone, classzone_idx, migratetype);
+			preferred_zone, classzone_idx, migratetype, NULL);
 	if (page)
 		goto got_pg;
 
@@ -2757,7 +2800,7 @@ retry:
 	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, alloc_flags,
-			preferred_zone, classzone_idx, migratetype);
+			preferred_zone, classzone_idx, migratetype, NULL);
 	if (unlikely(!page)) {
 		/*
 		 * The first pass makes sure allocations are spread
-- 
1.8.4.5



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [PATCH 02/10] mm, compaction: report compaction as contended only due to lock contention
  2014-06-11 12:22       ` Vlastimil Babka
@ 2014-06-11 23:49         ` Minchan Kim
  -1 siblings, 0 replies; 88+ messages in thread
From: Minchan Kim @ 2014-06-11 23:49 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

On Wed, Jun 11, 2014 at 02:22:30PM +0200, Vlastimil Babka wrote:
> On 06/11/2014 03:10 AM, Minchan Kim wrote:
> >On Mon, Jun 09, 2014 at 11:26:14AM +0200, Vlastimil Babka wrote:
> >>Async compaction aborts when it detects zone lock contention or need_resched()
> >>is true. David Rientjes has reported that in practice, most direct async
> >>compactions for THP allocation abort due to need_resched(). This means that a
> >>second direct compaction is never attempted, which might be OK for a page
> >>fault, but hugepaged is intended to attempt a sync compaction in such case and
> >>in these cases it won't.
> >>
> >>This patch replaces "bool contended" in compact_control with an enum that
> >>distinguieshes between aborting due to need_resched() and aborting due to lock
> >>contention. This allows propagating the abort through all compaction functions
> >>as before, but declaring the direct compaction as contended only when lock
> >>contantion has been detected.
> >>
> >>As a result, hugepaged will proceed with second sync compaction as intended,
> >>when the preceding async compaction aborted due to need_resched().
> >
> >You said "second direct compaction is never attempted, which might be OK
> >for a page fault" and said "hugepagd is intented to attempt a sync compaction"
> >so I feel you want to handle khugepaged so special unlike other direct compact
> >(ex, page fault).
> 
> Well khugepaged is my primary concern, but I imagine there are other
> direct compaction users besides THP page fault and khugepaged.
> 
> >By this patch, direct compaction take care only lock contention, not rescheduling
> >so that pop questions.
> >
> >Is it okay not to consider need_resched in direct compaction really?
> 
> It still considers need_resched() to back of from async compaction.
> It's only about signaling contended_compaction back to
> __alloc_pages_slowpath(). There's this code executed after the
> first, async compaction fails:
> 
> /*
>  * It can become very expensive to allocate transparent hugepages at
>  * fault, so use asynchronous memory compaction for THP unless it is
>  * khugepaged trying to collapse.
>  */
> if (!(gfp_mask & __GFP_NO_KSWAPD) || (current->flags & PF_KTHREAD))
>         migration_mode = MIGRATE_SYNC_LIGHT;
> 
> /*
>  * If compaction is deferred for high-order allocations, it is because
>  * sync compaction recently failed. In this is the case and the caller
>  * requested a movable allocation that does not heavily disrupt the
>  * system then fail the allocation instead of entering direct reclaim.
>  */
> if ((deferred_compaction || contended_compaction) &&
>                                         (gfp_mask & __GFP_NO_KSWAPD))
>         goto nopage;
> 
> Both THP page fault and khugepaged use __GFP_NO_KSWAPD. The first
> if() decides whether the second attempt will be sync (for
> khugepaged) or async (page fault). The second if() decides that if
> compaction was contended, then there won't be any second attempt
> (and reclaim) at all. Counting need_resched() as contended in this
> case is bad for khugepaged. Even for page fault it means no direct

I agree khugepaged shouldn't count on need_resched, even lock contention
because it was a result from admin's decision.
If it hurts system performance, he should adjust knobs for khugepaged.

> reclaim and a second async compaction. David says need_resched()
> occurs so often then it is a poor heuristic to decide this.

But page fault is a bit different. Inherently, high-order allocation
(ie, above PAGE_ALLOC_COSTLY_ORDER) is fragile so all of the caller
shoud keep in mind that and prepare second plan(ex, 4K allocation)
so direct reclaim/compaction should take care of latency rather than
success ratio.

If need_resched in second attempt(ie, synchronous compaction) is almost
true, it means the process consumed his timeslice so it shouldn't be
greedy and gives a CPU resource to others.
I don't mean we should abort but the process could sleep and retry.
The point is that we should give latency pain to the process request
high-order alocation, not another random process.

IMHO, if we want to increase high-order alloc ratio in page fault,
kswapd should be more aggressive than now via feedback loop from
fail rate from direct compaction.

> 
> >We have taken care of it in direct reclaim path so why direct compaction is
> >so special?
> 
> I admit I'm not that familiar with reclaim but I didn't quickly find
> any need_resched() there? There's plenty of cond_resched() but that
> doesn't mean it will abort? Could you explain for me?

I meant cond_resched.

> 
> >Why does khugepaged give up easily if lock contention/need_resched happens?
> >khugepaged is important for success ratio as I read your description so IMO,
> >khugepaged should do synchronously without considering early bail out by
> >lock/rescheduling.
> 
> Well a stupid answer is that's how __alloc_pages_slowpath() works :)
> I don't think it's bad to try using first a more lightweight
> approach before trying the heavyweight one. As long as the
> heavyweight one is not skipped for khugepaged.

I'm not saying current two-stage trying is bad. My stand is that we should
take care of need_resched and shouldn't become a greedy but khugepaged would
be okay.

> 
> >If it causes problems, user should increase scan_sleep_millisecs/alloc_sleep_millisecs,
> >which is exactly the knob for that cases.
> >
> >So, my point is how about making khugepaged doing always dumb synchronous
> >compaction thorough PG_KHUGEPAGED or GFP_SYNC_TRANSHUGE?
> >
> >>
> >>Reported-by: David Rientjes <rientjes@google.com>
> >>Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> >>Cc: Minchan Kim <minchan@kernel.org>
> >>Cc: Mel Gorman <mgorman@suse.de>
> >>Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >>Cc: Michal Nazarewicz <mina86@mina86.com>
> >>Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> >>Cc: Christoph Lameter <cl@linux.com>
> >>Cc: Rik van Riel <riel@redhat.com>
> >>---
> >>  mm/compaction.c | 20 ++++++++++++++------
> >>  mm/internal.h   | 15 +++++++++++----
> >>  2 files changed, 25 insertions(+), 10 deletions(-)
> >>
> >>diff --git a/mm/compaction.c b/mm/compaction.c
> >>index b73b182..d37f4a8 100644
> >>--- a/mm/compaction.c
> >>+++ b/mm/compaction.c
> >>@@ -185,9 +185,14 @@ static void update_pageblock_skip(struct compact_control *cc,
> >>  }
> >>  #endif /* CONFIG_COMPACTION */
> >>
> >>-static inline bool should_release_lock(spinlock_t *lock)
> >>+enum compact_contended should_release_lock(spinlock_t *lock)
> >>  {
> >>-	return need_resched() || spin_is_contended(lock);
> >>+	if (need_resched())
> >>+		return COMPACT_CONTENDED_SCHED;
> >>+	else if (spin_is_contended(lock))
> >>+		return COMPACT_CONTENDED_LOCK;
> >>+	else
> >>+		return COMPACT_CONTENDED_NONE;
> >>  }
> >>
> >>  /*
> >>@@ -202,7 +207,9 @@ static inline bool should_release_lock(spinlock_t *lock)
> >>  static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
> >>  				      bool locked, struct compact_control *cc)
> >>  {
> >>-	if (should_release_lock(lock)) {
> >>+	enum compact_contended contended = should_release_lock(lock);
> >>+
> >>+	if (contended) {
> >>  		if (locked) {
> >>  			spin_unlock_irqrestore(lock, *flags);
> >>  			locked = false;
> >>@@ -210,7 +217,7 @@ static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
> >>
> >>  		/* async aborts if taking too long or contended */
> >>  		if (cc->mode == MIGRATE_ASYNC) {
> >>-			cc->contended = true;
> >>+			cc->contended = contended;
> >>  			return false;
> >>  		}
> >>
> >>@@ -236,7 +243,7 @@ static inline bool compact_should_abort(struct compact_control *cc)
> >>  	/* async compaction aborts if contended */
> >>  	if (need_resched()) {
> >>  		if (cc->mode == MIGRATE_ASYNC) {
> >>-			cc->contended = true;
> >>+			cc->contended = COMPACT_CONTENDED_SCHED;
> >>  			return true;
> >>  		}
> >>
> >>@@ -1095,7 +1102,8 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
> >>  	VM_BUG_ON(!list_empty(&cc.freepages));
> >>  	VM_BUG_ON(!list_empty(&cc.migratepages));
> >>
> >>-	*contended = cc.contended;
> >>+	/* We only signal lock contention back to the allocator */
> >>+	*contended = cc.contended == COMPACT_CONTENDED_LOCK;
> >>  	return ret;
> >>  }
> >>
> >>diff --git a/mm/internal.h b/mm/internal.h
> >>index 7f22a11f..4659e8e 100644
> >>--- a/mm/internal.h
> >>+++ b/mm/internal.h
> >>@@ -117,6 +117,13 @@ extern int user_min_free_kbytes;
> >>
> >>  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
> >>
> >>+/* Used to signal whether compaction detected need_sched() or lock contention */
> >>+enum compact_contended {
> >>+	COMPACT_CONTENDED_NONE = 0, /* no contention detected */
> >>+	COMPACT_CONTENDED_SCHED,    /* need_sched() was true */
> >>+	COMPACT_CONTENDED_LOCK,     /* zone lock or lru_lock was contended */
> >>+};
> >>+
> >>  /*
> >>   * in mm/compaction.c
> >>   */
> >>@@ -144,10 +151,10 @@ struct compact_control {
> >>  	int order;			/* order a direct compactor needs */
> >>  	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
> >>  	struct zone *zone;
> >>-	bool contended;			/* True if a lock was contended, or
> >>-					 * need_resched() true during async
> >>-					 * compaction
> >>-					 */
> >>+	enum compact_contended contended; /* Signal need_sched() or lock
> >>+					   * contention detected during
> >>+					   * compaction
> >>+					   */
> >>  };
> >>
> >>  unsigned long
> >>--
> >>1.8.4.5
> >>
> >>--
> >>To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >>the body to majordomo@kvack.org.  For more info on Linux MM,
> >>see: http://www.linux-mm.org/ .
> >>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 02/10] mm, compaction: report compaction as contended only due to lock contention
@ 2014-06-11 23:49         ` Minchan Kim
  0 siblings, 0 replies; 88+ messages in thread
From: Minchan Kim @ 2014-06-11 23:49 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

On Wed, Jun 11, 2014 at 02:22:30PM +0200, Vlastimil Babka wrote:
> On 06/11/2014 03:10 AM, Minchan Kim wrote:
> >On Mon, Jun 09, 2014 at 11:26:14AM +0200, Vlastimil Babka wrote:
> >>Async compaction aborts when it detects zone lock contention or need_resched()
> >>is true. David Rientjes has reported that in practice, most direct async
> >>compactions for THP allocation abort due to need_resched(). This means that a
> >>second direct compaction is never attempted, which might be OK for a page
> >>fault, but hugepaged is intended to attempt a sync compaction in such case and
> >>in these cases it won't.
> >>
> >>This patch replaces "bool contended" in compact_control with an enum that
> >>distinguieshes between aborting due to need_resched() and aborting due to lock
> >>contention. This allows propagating the abort through all compaction functions
> >>as before, but declaring the direct compaction as contended only when lock
> >>contantion has been detected.
> >>
> >>As a result, hugepaged will proceed with second sync compaction as intended,
> >>when the preceding async compaction aborted due to need_resched().
> >
> >You said "second direct compaction is never attempted, which might be OK
> >for a page fault" and said "hugepagd is intented to attempt a sync compaction"
> >so I feel you want to handle khugepaged so special unlike other direct compact
> >(ex, page fault).
> 
> Well khugepaged is my primary concern, but I imagine there are other
> direct compaction users besides THP page fault and khugepaged.
> 
> >By this patch, direct compaction take care only lock contention, not rescheduling
> >so that pop questions.
> >
> >Is it okay not to consider need_resched in direct compaction really?
> 
> It still considers need_resched() to back of from async compaction.
> It's only about signaling contended_compaction back to
> __alloc_pages_slowpath(). There's this code executed after the
> first, async compaction fails:
> 
> /*
>  * It can become very expensive to allocate transparent hugepages at
>  * fault, so use asynchronous memory compaction for THP unless it is
>  * khugepaged trying to collapse.
>  */
> if (!(gfp_mask & __GFP_NO_KSWAPD) || (current->flags & PF_KTHREAD))
>         migration_mode = MIGRATE_SYNC_LIGHT;
> 
> /*
>  * If compaction is deferred for high-order allocations, it is because
>  * sync compaction recently failed. In this is the case and the caller
>  * requested a movable allocation that does not heavily disrupt the
>  * system then fail the allocation instead of entering direct reclaim.
>  */
> if ((deferred_compaction || contended_compaction) &&
>                                         (gfp_mask & __GFP_NO_KSWAPD))
>         goto nopage;
> 
> Both THP page fault and khugepaged use __GFP_NO_KSWAPD. The first
> if() decides whether the second attempt will be sync (for
> khugepaged) or async (page fault). The second if() decides that if
> compaction was contended, then there won't be any second attempt
> (and reclaim) at all. Counting need_resched() as contended in this
> case is bad for khugepaged. Even for page fault it means no direct

I agree khugepaged shouldn't count on need_resched, even lock contention
because it was a result from admin's decision.
If it hurts system performance, he should adjust knobs for khugepaged.

> reclaim and a second async compaction. David says need_resched()
> occurs so often then it is a poor heuristic to decide this.

But page fault is a bit different. Inherently, high-order allocation
(ie, above PAGE_ALLOC_COSTLY_ORDER) is fragile so all of the caller
shoud keep in mind that and prepare second plan(ex, 4K allocation)
so direct reclaim/compaction should take care of latency rather than
success ratio.

If need_resched in second attempt(ie, synchronous compaction) is almost
true, it means the process consumed his timeslice so it shouldn't be
greedy and gives a CPU resource to others.
I don't mean we should abort but the process could sleep and retry.
The point is that we should give latency pain to the process request
high-order alocation, not another random process.

IMHO, if we want to increase high-order alloc ratio in page fault,
kswapd should be more aggressive than now via feedback loop from
fail rate from direct compaction.

> 
> >We have taken care of it in direct reclaim path so why direct compaction is
> >so special?
> 
> I admit I'm not that familiar with reclaim but I didn't quickly find
> any need_resched() there? There's plenty of cond_resched() but that
> doesn't mean it will abort? Could you explain for me?

I meant cond_resched.

> 
> >Why does khugepaged give up easily if lock contention/need_resched happens?
> >khugepaged is important for success ratio as I read your description so IMO,
> >khugepaged should do synchronously without considering early bail out by
> >lock/rescheduling.
> 
> Well a stupid answer is that's how __alloc_pages_slowpath() works :)
> I don't think it's bad to try using first a more lightweight
> approach before trying the heavyweight one. As long as the
> heavyweight one is not skipped for khugepaged.

I'm not saying current two-stage trying is bad. My stand is that we should
take care of need_resched and shouldn't become a greedy but khugepaged would
be okay.

> 
> >If it causes problems, user should increase scan_sleep_millisecs/alloc_sleep_millisecs,
> >which is exactly the knob for that cases.
> >
> >So, my point is how about making khugepaged doing always dumb synchronous
> >compaction thorough PG_KHUGEPAGED or GFP_SYNC_TRANSHUGE?
> >
> >>
> >>Reported-by: David Rientjes <rientjes@google.com>
> >>Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> >>Cc: Minchan Kim <minchan@kernel.org>
> >>Cc: Mel Gorman <mgorman@suse.de>
> >>Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >>Cc: Michal Nazarewicz <mina86@mina86.com>
> >>Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> >>Cc: Christoph Lameter <cl@linux.com>
> >>Cc: Rik van Riel <riel@redhat.com>
> >>---
> >>  mm/compaction.c | 20 ++++++++++++++------
> >>  mm/internal.h   | 15 +++++++++++----
> >>  2 files changed, 25 insertions(+), 10 deletions(-)
> >>
> >>diff --git a/mm/compaction.c b/mm/compaction.c
> >>index b73b182..d37f4a8 100644
> >>--- a/mm/compaction.c
> >>+++ b/mm/compaction.c
> >>@@ -185,9 +185,14 @@ static void update_pageblock_skip(struct compact_control *cc,
> >>  }
> >>  #endif /* CONFIG_COMPACTION */
> >>
> >>-static inline bool should_release_lock(spinlock_t *lock)
> >>+enum compact_contended should_release_lock(spinlock_t *lock)
> >>  {
> >>-	return need_resched() || spin_is_contended(lock);
> >>+	if (need_resched())
> >>+		return COMPACT_CONTENDED_SCHED;
> >>+	else if (spin_is_contended(lock))
> >>+		return COMPACT_CONTENDED_LOCK;
> >>+	else
> >>+		return COMPACT_CONTENDED_NONE;
> >>  }
> >>
> >>  /*
> >>@@ -202,7 +207,9 @@ static inline bool should_release_lock(spinlock_t *lock)
> >>  static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
> >>  				      bool locked, struct compact_control *cc)
> >>  {
> >>-	if (should_release_lock(lock)) {
> >>+	enum compact_contended contended = should_release_lock(lock);
> >>+
> >>+	if (contended) {
> >>  		if (locked) {
> >>  			spin_unlock_irqrestore(lock, *flags);
> >>  			locked = false;
> >>@@ -210,7 +217,7 @@ static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
> >>
> >>  		/* async aborts if taking too long or contended */
> >>  		if (cc->mode == MIGRATE_ASYNC) {
> >>-			cc->contended = true;
> >>+			cc->contended = contended;
> >>  			return false;
> >>  		}
> >>
> >>@@ -236,7 +243,7 @@ static inline bool compact_should_abort(struct compact_control *cc)
> >>  	/* async compaction aborts if contended */
> >>  	if (need_resched()) {
> >>  		if (cc->mode == MIGRATE_ASYNC) {
> >>-			cc->contended = true;
> >>+			cc->contended = COMPACT_CONTENDED_SCHED;
> >>  			return true;
> >>  		}
> >>
> >>@@ -1095,7 +1102,8 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
> >>  	VM_BUG_ON(!list_empty(&cc.freepages));
> >>  	VM_BUG_ON(!list_empty(&cc.migratepages));
> >>
> >>-	*contended = cc.contended;
> >>+	/* We only signal lock contention back to the allocator */
> >>+	*contended = cc.contended == COMPACT_CONTENDED_LOCK;
> >>  	return ret;
> >>  }
> >>
> >>diff --git a/mm/internal.h b/mm/internal.h
> >>index 7f22a11f..4659e8e 100644
> >>--- a/mm/internal.h
> >>+++ b/mm/internal.h
> >>@@ -117,6 +117,13 @@ extern int user_min_free_kbytes;
> >>
> >>  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
> >>
> >>+/* Used to signal whether compaction detected need_sched() or lock contention */
> >>+enum compact_contended {
> >>+	COMPACT_CONTENDED_NONE = 0, /* no contention detected */
> >>+	COMPACT_CONTENDED_SCHED,    /* need_sched() was true */
> >>+	COMPACT_CONTENDED_LOCK,     /* zone lock or lru_lock was contended */
> >>+};
> >>+
> >>  /*
> >>   * in mm/compaction.c
> >>   */
> >>@@ -144,10 +151,10 @@ struct compact_control {
> >>  	int order;			/* order a direct compactor needs */
> >>  	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
> >>  	struct zone *zone;
> >>-	bool contended;			/* True if a lock was contended, or
> >>-					 * need_resched() true during async
> >>-					 * compaction
> >>-					 */
> >>+	enum compact_contended contended; /* Signal need_sched() or lock
> >>+					   * contention detected during
> >>+					   * compaction
> >>+					   */
> >>  };
> >>
> >>  unsigned long
> >>--
> >>1.8.4.5
> >>
> >>--
> >>To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >>the body to majordomo@kvack.org.  For more info on Linux MM,
> >>see: http://www.linux-mm.org/ .
> >>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/10] mm, compaction: pass gfp mask to compact_control
  2014-06-11 11:46       ` Vlastimil Babka
@ 2014-06-12  0:24         ` David Rientjes
  -1 siblings, 0 replies; 88+ messages in thread
From: David Rientjes @ 2014-06-12  0:24 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Minchan Kim, linux-mm, linux-kernel, Andrew Morton, Greg Thelen,
	Mel Gorman, Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel

On Wed, 11 Jun 2014, Vlastimil Babka wrote:

> > > diff --git a/mm/compaction.c b/mm/compaction.c
> > > index c339ccd..d1e30ba 100644
> > > --- a/mm/compaction.c
> > > +++ b/mm/compaction.c
> > > @@ -965,8 +965,8 @@ static isolate_migrate_t isolate_migratepages(struct
> > > zone *zone,
> > >   	return ISOLATE_SUCCESS;
> > >   }
> > > 
> > > -static int compact_finished(struct zone *zone,
> > > -			    struct compact_control *cc)
> > > +static int compact_finished(struct zone *zone, struct compact_control
> > > *cc,
> > > +			    const int migratetype)
> > 
> > If we has gfp_mask, we could use gfpflags_to_migratetype from cc->gfp_mask.
> > What's is your intention?
> 
> Can't speak for David but I left it this way as it means
> gfpflags_to_migratetype is only called once per compact_zone. Now I realize my
> patch 10/10 repeats the call in isolate_migratepages_range so I'll probably
> update that as well.
> 

Yes, that was definitely the intention: call it once in compact_zone() and 
store it as const and then avoid calling it every time for 
compact_finished().

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 08/10] mm, compaction: pass gfp mask to compact_control
@ 2014-06-12  0:24         ` David Rientjes
  0 siblings, 0 replies; 88+ messages in thread
From: David Rientjes @ 2014-06-12  0:24 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Minchan Kim, linux-mm, linux-kernel, Andrew Morton, Greg Thelen,
	Mel Gorman, Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel

On Wed, 11 Jun 2014, Vlastimil Babka wrote:

> > > diff --git a/mm/compaction.c b/mm/compaction.c
> > > index c339ccd..d1e30ba 100644
> > > --- a/mm/compaction.c
> > > +++ b/mm/compaction.c
> > > @@ -965,8 +965,8 @@ static isolate_migrate_t isolate_migratepages(struct
> > > zone *zone,
> > >   	return ISOLATE_SUCCESS;
> > >   }
> > > 
> > > -static int compact_finished(struct zone *zone,
> > > -			    struct compact_control *cc)
> > > +static int compact_finished(struct zone *zone, struct compact_control
> > > *cc,
> > > +			    const int migratetype)
> > 
> > If we has gfp_mask, we could use gfpflags_to_migratetype from cc->gfp_mask.
> > What's is your intention?
> 
> Can't speak for David but I left it this way as it means
> gfpflags_to_migratetype is only called once per compact_zone. Now I realize my
> patch 10/10 repeats the call in isolate_migratepages_range so I'll probably
> update that as well.
> 

Yes, that was definitely the intention: call it once in compact_zone() and 
store it as const and then avoid calling it every time for 
compact_finished().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [RFC PATCH 09/10] mm, compaction: try to capture the just-created high-order freepage
  2014-06-11 14:56     ` Vlastimil Babka
@ 2014-06-12  2:20       ` Minchan Kim
  -1 siblings, 0 replies; 88+ messages in thread
From: Minchan Kim @ 2014-06-12  2:20 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

On Wed, Jun 11, 2014 at 04:56:49PM +0200, Vlastimil Babka wrote:
> On 06/09/2014 11:26 AM, Vlastimil Babka wrote:
> > Compaction uses watermark checking to determine if it succeeded in creating
> > a high-order free page. My testing has shown that this is quite racy and it
> > can happen that watermark checking in compaction succeeds, and moments later
> > the watermark checking in page allocation fails, even though the number of
> > free pages has increased meanwhile.
> > 
> > It should be more reliable if direct compaction captured the high-order free
> > page as soon as it detects it, and pass it back to allocation. This would
> > also reduce the window for somebody else to allocate the free page.
> > 
> > This has been already implemented by 1fb3f8ca0e92 ("mm: compaction: capture a
> > suitable high-order page immediately when it is made available"), but later
> > reverted by 8fb74b9f ("mm: compaction: partially revert capture of suitable
> > high-order page") due to flaws.
> > 
> > This patch differs from the previous attempt in two aspects:
> > 
> > 1) The previous patch scanned free lists to capture the page. In this patch,
> >     only the cc->order aligned block that the migration scanner just finished
> >     is considered, but only if pages were actually isolated for migration in
> >     that block. Tracking cc->order aligned blocks also has benefits for the
> >     following patch that skips blocks where non-migratable pages were found.
> > 

Generally I like this.

> > 2) In this patch, the isolated free page is allocated through extending
> >     get_page_from_freelist() and buffered_rmqueue(). This ensures that it gets
> >     all operations such as prep_new_page() and page->pfmemalloc setting that
> >     was missing in the previous attempt, zone statistics are updated etc.
> > 

But this part is problem.
Capturing is not common but you are adding more overhead in hotpath for rare cases
where even they are ok to fail so it's not a good deal.
In such case, We have no choice but to do things you mentioned (ex,statistics,
prep_new_page, pfmemalloc) manually in __alloc_pages_direct_compact.

> > Evaluation is pending.
> 
> Uh, so if anyone wants to test it, here's a fixed version, as initial evaluation
> showed it does not actually capture anything (which should not affect patch 10/10
> though) and debugging this took a while.
> 
> - for pageblock_order (i.e. THP), capture was never attempted, as the for cycle
>   in isolate_migratepages_range() has ended right before the
>   low_pfn == next_capture_pfn check
> - lru_add_drain() has to be done before pcplists drain. This made a big difference
>   (~50 successful captures -> ~1300 successful captures)
>   Note that __alloc_pages_direct_compact() is missing lru_add_drain() as well, and
>   all the existing watermark-based compaction termination decisions (which happen
>   before the drain in __alloc_pages_direct_compact()) don't do any draining at all.
>   
> -----8<-----
> From: Vlastimil Babka <vbabka@suse.cz>
> Date: Wed, 28 May 2014 17:05:18 +0200
> Subject: [PATCH fixed 09/10] mm, compaction: try to capture the just-created
>  high-order freepage
> 
> Compaction uses watermark checking to determine if it succeeded in creating
> a high-order free page. My testing has shown that this is quite racy and it
> can happen that watermark checking in compaction succeeds, and moments later
> the watermark checking in page allocation fails, even though the number of
> free pages has increased meanwhile.
> 
> It should be more reliable if direct compaction captured the high-order free
> page as soon as it detects it, and pass it back to allocation. This would
> also reduce the window for somebody else to allocate the free page.
> 
> This has been already implemented by 1fb3f8ca0e92 ("mm: compaction: capture a
> suitable high-order page immediately when it is made available"), but later
> reverted by 8fb74b9f ("mm: compaction: partially revert capture of suitable
> high-order page") due to flaws.
> 
> This patch differs from the previous attempt in two aspects:
> 
> 1) The previous patch scanned free lists to capture the page. In this patch,
>    only the cc->order aligned block that the migration scanner just finished
>    is considered, but only if pages were actually isolated for migration in
>    that block. Tracking cc->order aligned blocks also has benefits for the
>    following patch that skips blocks where non-migratable pages were found.
> 
> 2) In this patch, the isolated free page is allocated through extending
>    get_page_from_freelist() and buffered_rmqueue(). This ensures that it gets
>    all operations such as prep_new_page() and page->pfmemalloc setting that
>    was missing in the previous attempt, zone statistics are updated etc.
> 
> Evaluation is pending.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: David Rientjes <rientjes@google.com>
> ---
>  include/linux/compaction.h |   5 ++-
>  mm/compaction.c            | 103 +++++++++++++++++++++++++++++++++++++++++++--
>  mm/internal.h              |   2 +
>  mm/page_alloc.c            |  69 ++++++++++++++++++++++++------
>  4 files changed, 161 insertions(+), 18 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 01e3132..69579f5 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -10,6 +10,8 @@
>  #define COMPACT_PARTIAL		2
>  /* The full zone was compacted */
>  #define COMPACT_COMPLETE	3
> +/* Captured a high-order free page in direct compaction */
> +#define COMPACT_CAPTURED	4
>  
>  #ifdef CONFIG_COMPACTION
>  extern int sysctl_compact_memory;
> @@ -22,7 +24,8 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
>  extern int fragmentation_index(struct zone *zone, unsigned int order);
>  extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
>  			int order, gfp_t gfp_mask, nodemask_t *mask,
> -			enum migrate_mode mode, bool *contended);
> +			enum migrate_mode mode, bool *contended,
> +			struct page **captured_page);
>  extern void compact_pgdat(pg_data_t *pgdat, int order);
>  extern void reset_isolation_suitable(pg_data_t *pgdat);
>  extern unsigned long compaction_suitable(struct zone *zone, int order);
> diff --git a/mm/compaction.c b/mm/compaction.c
> index d1e30ba..2988758 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -541,6 +541,16 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>  	const isolate_mode_t mode = (cc->mode == MIGRATE_ASYNC ?
>  					ISOLATE_ASYNC_MIGRATE : 0) |
>  				    (unevictable ? ISOLATE_UNEVICTABLE : 0);
> +	unsigned long capture_pfn = 0;   /* current candidate for capturing */
> +	unsigned long next_capture_pfn = 0; /* next candidate for capturing */
> +
> +	if (cc->order > PAGE_ALLOC_COSTLY_ORDER
> +		&& gfpflags_to_migratetype(cc->gfp_mask) == MIGRATE_MOVABLE
> +			&& cc->order <= pageblock_order) {

You sent with RFC mark so I will not review detailed thing but just design stuff.

Why does capture work for limited high-order range?
Direct compaction is really costly operation for the process and he did it
at the cost of his resource(ie, timeslice) so anyone try to do direct compaction
deserves to have a precious result regardless of order.

Another question: Why couldn't the capture work for only MIGRATE_CMA?

> +		/* This may be outside the zone, but we check that later */
> +		capture_pfn = low_pfn & ~((1UL << cc->order) - 1);
> +		next_capture_pfn = ALIGN(low_pfn + 1, (1UL << cc->order));
> +	}


>  
>  	/*
>  	 * Ensure that there are not too many pages isolated from the LRU
> @@ -563,6 +573,19 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>  
>  	/* Time to isolate some pages for migration */
>  	for (; low_pfn < end_pfn; low_pfn++) {
> +		if (low_pfn == next_capture_pfn) {
> +			/*
> +			 * We have a capture candidate if we isolated something
> +			 * during the last cc->order aligned block of pages.
> +			 */
> +			if (nr_isolated && capture_pfn >= zone->zone_start_pfn)
> +				break;
> +
> +			/* Prepare for a new capture candidate */
> +			capture_pfn = next_capture_pfn;
> +			next_capture_pfn += (1UL << cc->order);
> +		}
> +
>  		/*
>  		 * Periodically drop the lock (if held) regardless of its
>  		 * contention, to give chance to IRQs. Abort async compaction
> @@ -582,6 +605,8 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>  		if ((low_pfn & (MAX_ORDER_NR_PAGES - 1)) == 0) {
>  			if (!pfn_valid(low_pfn)) {
>  				low_pfn += MAX_ORDER_NR_PAGES - 1;
> +				if (next_capture_pfn)
> +					next_capture_pfn = low_pfn + 1;
>  				continue;
>  			}
>  		}
> @@ -639,8 +664,12 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>  			 * a valid page order. Consider only values in the
>  			 * valid order range to prevent low_pfn overflow.
>  			 */
> -			if (freepage_order > 0 && freepage_order < MAX_ORDER)
> +			if (freepage_order > 0 && freepage_order < MAX_ORDER) {
>  				low_pfn += (1UL << freepage_order) - 1;
> +				if (next_capture_pfn)
> +					next_capture_pfn = ALIGN(low_pfn + 1,
> +							(1UL << cc->order));
> +			}
>  			continue;
>  		}
>  
> @@ -673,6 +702,9 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>  			if (!locked)
>  				goto next_pageblock;
>  			low_pfn += (1 << compound_order(page)) - 1;
> +			if (next_capture_pfn)
> +				next_capture_pfn =
> +					ALIGN(low_pfn + 1, (1UL << cc->order));
>  			continue;
>  		}
>  
> @@ -697,6 +729,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>  				continue;
>  			if (PageTransHuge(page)) {
>  				low_pfn += (1 << compound_order(page)) - 1;
> +				next_capture_pfn = low_pfn + 1;
>  				continue;
>  			}
>  		}
> @@ -728,9 +761,20 @@ isolate_success:
>  
>  next_pageblock:
>  		low_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages) - 1;
> +		if (next_capture_pfn)
> +			next_capture_pfn = low_pfn + 1;
>  	}
>  
>  	/*
> +	 * For cases when next_capture_pfn == end_pfn, such as end of
> +	 * pageblock, we couldn't have determined capture candidate inside
> +	 * the for cycle, so we have to do it here.
> +	 */
> +	if (low_pfn == next_capture_pfn && nr_isolated
> +			&& capture_pfn >= zone->zone_start_pfn)
> +		cc->capture_page = pfn_to_page(capture_pfn);
> +
> +	/*
>  	 * The PageBuddy() check could have potentially brought us outside
>  	 * the range to be scanned.
>  	 */
> @@ -965,6 +1009,44 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
>  	return ISOLATE_SUCCESS;
>  }
>  
> +/*
> + * When called, cc->capture_page is just a candidate. This function will either
> + * successfully capture the page, or reset it to NULL.
> + */
> +static bool compact_capture_page(struct compact_control *cc)
> +{
> +	struct page *page = cc->capture_page;
> +	int cpu;
> +
> +	/* Unsafe check if it's worth to try acquiring the zone->lock at all */
> +	if (PageBuddy(page) && page_order_unsafe(page) >= cc->order)
> +		goto try_capture;
> +
> +	/*
> +	 * There's a good chance that we have just put free pages on this CPU's
> +	 * lru cache and pcplists after the page migrations. Drain them to
> +	 * allow merging.
> +	 */
> +	cpu = get_cpu();
> +	lru_add_drain_cpu(cpu);
> +	drain_local_pages(NULL);
> +	put_cpu();
> +
> +	/* Did the draining help? */
> +	if (PageBuddy(page) && page_order_unsafe(page) >= cc->order)
> +		goto try_capture;
> +
> +	goto fail;
> +
> +try_capture:
> +	if (capture_free_page(page, cc->order))
> +		return true;
> +
> +fail:
> +	cc->capture_page = NULL;
> +	return false;
> +}
> +
>  static int compact_finished(struct zone *zone, struct compact_control *cc,
>  			    const int migratetype)
>  {
> @@ -993,6 +1075,10 @@ static int compact_finished(struct zone *zone, struct compact_control *cc,
>  		return COMPACT_COMPLETE;
>  	}
>  
> +	/* Did we just finish a pageblock that was capture candidate? */
> +	if (cc->capture_page && compact_capture_page(cc))
> +		return COMPACT_CAPTURED;
> +
>  	/*
>  	 * order == -1 is expected when compacting via
>  	 * /proc/sys/vm/compact_memory
> @@ -1173,7 +1259,8 @@ out:
>  }
>  
>  static unsigned long compact_zone_order(struct zone *zone, int order,
> -		gfp_t gfp_mask, enum migrate_mode mode, bool *contended)
> +		gfp_t gfp_mask, enum migrate_mode mode, bool *contended,
> +						struct page **captured_page)
>  {
>  	unsigned long ret;
>  	struct compact_control cc = {
> @@ -1189,6 +1276,9 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
>  
>  	ret = compact_zone(zone, &cc);
>  
> +	if (ret == COMPACT_CAPTURED)
> +		*captured_page = cc.capture_page;
> +
>  	VM_BUG_ON(!list_empty(&cc.freepages));
>  	VM_BUG_ON(!list_empty(&cc.migratepages));
>  
> @@ -1213,7 +1303,8 @@ int sysctl_extfrag_threshold = 500;
>   */
>  unsigned long try_to_compact_pages(struct zonelist *zonelist,
>  			int order, gfp_t gfp_mask, nodemask_t *nodemask,
> -			enum migrate_mode mode, bool *contended)
> +			enum migrate_mode mode, bool *contended,
> +			struct page **captured_page)
>  {
>  	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
>  	int may_enter_fs = gfp_mask & __GFP_FS;
> @@ -1239,9 +1330,13 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
>  		int status;
>  
>  		status = compact_zone_order(zone, order, gfp_mask, mode,
> -						contended);
> +						contended, captured_page);
>  		rc = max(status, rc);
>  
> +		/* If we captured a page, stop compacting */
> +		if (*captured_page)
> +			break;
> +
>  		/* If a normal allocation would succeed, stop compacting */
>  		if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0,
>  				      alloc_flags))
> diff --git a/mm/internal.h b/mm/internal.h
> index af15461..2b7e5de 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -110,6 +110,7 @@ extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
>   */
>  extern void __free_pages_bootmem(struct page *page, unsigned int order);
>  extern void prep_compound_page(struct page *page, unsigned long order);
> +extern bool capture_free_page(struct page *page, unsigned int order);
>  #ifdef CONFIG_MEMORY_FAILURE
>  extern bool is_free_buddy_page(struct page *page);
>  #endif
> @@ -155,6 +156,7 @@ struct compact_control {
>  					   * contention detected during
>  					   * compaction
>  					   */
> +	struct page *capture_page;	/* Free page captured by compaction */
>  };
>  
>  unsigned long
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a3acb83..6235cad 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -954,7 +954,6 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
>  	return NULL;
>  }
>  
> -
>  /*
>   * This array describes the order lists are fallen back to when
>   * the free lists for the desirable migrate type are depleted
> @@ -1474,9 +1473,11 @@ static int __isolate_free_page(struct page *page, unsigned int order)
>  {
>  	unsigned long watermark;
>  	struct zone *zone;
> +	struct free_area *area;
>  	int mt;
> +	unsigned int freepage_order = page_order(page);
>  
> -	BUG_ON(!PageBuddy(page));
> +	VM_BUG_ON_PAGE((!PageBuddy(page) || freepage_order < order), page);
>  
>  	zone = page_zone(page);
>  	mt = get_pageblock_migratetype(page);
> @@ -1491,9 +1492,12 @@ static int __isolate_free_page(struct page *page, unsigned int order)
>  	}
>  
>  	/* Remove page from free list */
> +	area = &zone->free_area[freepage_order];
>  	list_del(&page->lru);
> -	zone->free_area[order].nr_free--;
> +	area->nr_free--;
>  	rmv_page_order(page);
> +	if (freepage_order != order)
> +		expand(zone, page, order, freepage_order, area, mt);
>  
>  	/* Set the pageblock if the isolated page is at least a pageblock */
>  	if (order >= pageblock_order - 1) {
> @@ -1536,6 +1540,26 @@ int split_free_page(struct page *page)
>  	return nr_pages;
>  }
>  
> +bool capture_free_page(struct page *page, unsigned int order)
> +{
> +	struct zone *zone = page_zone(page);
> +	unsigned long flags;
> +	bool ret;
> +
> +	spin_lock_irqsave(&zone->lock, flags);
> +
> +	if (!PageBuddy(page) || page_order(page) < order) {
> +		ret = false;
> +		goto out;
> +	}
> +
> +	ret = __isolate_free_page(page, order);
> +
> +out:
> +	spin_unlock_irqrestore(&zone->lock, flags);
> +	return ret;
> +}
> +
>  /*
>   * Really, prep_compound_page() should be called from __rmqueue_bulk().  But
>   * we cheat by calling it from here, in the order > 0 path.  Saves a branch
> @@ -1544,7 +1568,8 @@ int split_free_page(struct page *page)
>  static inline
>  struct page *buffered_rmqueue(struct zone *preferred_zone,
>  			struct zone *zone, unsigned int order,
> -			gfp_t gfp_flags, int migratetype)
> +			gfp_t gfp_flags, int migratetype,
> +			struct page *isolated_freepage)
>  {
>  	unsigned long flags;
>  	struct page *page;
> @@ -1573,6 +1598,9 @@ again:
>  
>  		list_del(&page->lru);
>  		pcp->count--;
> +	} else if (unlikely(isolated_freepage)) {
> +		page = isolated_freepage;
> +		local_irq_save(flags);
>  	} else {
>  		if (unlikely(gfp_flags & __GFP_NOFAIL)) {
>  			/*
> @@ -1588,7 +1616,9 @@ again:
>  			WARN_ON_ONCE(order > 1);
>  		}
>  		spin_lock_irqsave(&zone->lock, flags);
> +
>  		page = __rmqueue(zone, order, migratetype);
> +
>  		spin_unlock(&zone->lock);
>  		if (!page)
>  			goto failed;
> @@ -1916,7 +1946,8 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
>  static struct page *
>  get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
>  		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
> -		struct zone *preferred_zone, int classzone_idx, int migratetype)
> +		struct zone *preferred_zone, int classzone_idx, int migratetype,
> +		struct page *isolated_freepage)
>  {
>  	struct zoneref *z;
>  	struct page *page = NULL;
> @@ -1927,6 +1958,13 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
>  	bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) &&
>  				(gfp_mask & __GFP_WRITE);
>  
> +	if (isolated_freepage) {
> +		zone = page_zone(isolated_freepage);
> +		page = buffered_rmqueue(preferred_zone, zone, order, gfp_mask,
> +						migratetype, isolated_freepage);
> +		goto got_page;
> +	}
> +
>  zonelist_scan:
>  	/*
>  	 * Scan zonelist, looking for a zone with enough free.
> @@ -2051,7 +2089,7 @@ zonelist_scan:
>  
>  try_this_zone:
>  		page = buffered_rmqueue(preferred_zone, zone, order,
> -						gfp_mask, migratetype);
> +						gfp_mask, migratetype, NULL);
>  		if (page)
>  			break;
>  this_zone_full:
> @@ -2065,6 +2103,7 @@ this_zone_full:
>  		goto zonelist_scan;
>  	}
>  
> +got_page:
>  	if (page)
>  		/*
>  		 * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
> @@ -2202,7 +2241,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
>  		order, zonelist, high_zoneidx,
>  		ALLOC_WMARK_HIGH|ALLOC_CPUSET,
> -		preferred_zone, classzone_idx, migratetype);
> +		preferred_zone, classzone_idx, migratetype, NULL);
>  	if (page)
>  		goto out;
>  
> @@ -2241,6 +2280,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  	bool *contended_compaction, bool *deferred_compaction,
>  	unsigned long *did_some_progress)
>  {
> +	struct page *captured_page;
> +
>  	if (!order)
>  		return NULL;
>  
> @@ -2252,7 +2293,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  	current->flags |= PF_MEMALLOC;
>  	*did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask,
>  						nodemask, mode,
> -						contended_compaction);
> +						contended_compaction,
> +						&captured_page);
>  	current->flags &= ~PF_MEMALLOC;
>  
>  	if (*did_some_progress != COMPACT_SKIPPED) {
> @@ -2265,7 +2307,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  		page = get_page_from_freelist(gfp_mask, nodemask,
>  				order, zonelist, high_zoneidx,
>  				alloc_flags & ~ALLOC_NO_WATERMARKS,
> -				preferred_zone, classzone_idx, migratetype);
> +				preferred_zone, classzone_idx, migratetype,
> +				captured_page);
>  		if (page) {
>  			preferred_zone->compact_blockskip_flush = false;
>  			compaction_defer_reset(preferred_zone, order, true);
> @@ -2357,7 +2400,7 @@ retry:
>  					zonelist, high_zoneidx,
>  					alloc_flags & ~ALLOC_NO_WATERMARKS,
>  					preferred_zone, classzone_idx,
> -					migratetype);
> +					migratetype, NULL);
>  
>  	/*
>  	 * If an allocation failed after direct reclaim, it could be because
> @@ -2387,7 +2430,7 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
>  	do {
>  		page = get_page_from_freelist(gfp_mask, nodemask, order,
>  			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
> -			preferred_zone, classzone_idx, migratetype);
> +			preferred_zone, classzone_idx, migratetype, NULL);
>  
>  		if (!page && gfp_mask & __GFP_NOFAIL)
>  			wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
> @@ -2548,7 +2591,7 @@ rebalance:
>  	/* This is the last chance, in general, before the goto nopage. */
>  	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
>  			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
> -			preferred_zone, classzone_idx, migratetype);
> +			preferred_zone, classzone_idx, migratetype, NULL);
>  	if (page)
>  		goto got_pg;
>  
> @@ -2757,7 +2800,7 @@ retry:
>  	/* First allocation attempt */
>  	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
>  			zonelist, high_zoneidx, alloc_flags,
> -			preferred_zone, classzone_idx, migratetype);
> +			preferred_zone, classzone_idx, migratetype, NULL);
>  	if (unlikely(!page)) {
>  		/*
>  		 * The first pass makes sure allocations are spread
> -- 
> 1.8.4.5
> 
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [RFC PATCH 09/10] mm, compaction: try to capture the just-created high-order freepage
@ 2014-06-12  2:20       ` Minchan Kim
  0 siblings, 0 replies; 88+ messages in thread
From: Minchan Kim @ 2014-06-12  2:20 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

On Wed, Jun 11, 2014 at 04:56:49PM +0200, Vlastimil Babka wrote:
> On 06/09/2014 11:26 AM, Vlastimil Babka wrote:
> > Compaction uses watermark checking to determine if it succeeded in creating
> > a high-order free page. My testing has shown that this is quite racy and it
> > can happen that watermark checking in compaction succeeds, and moments later
> > the watermark checking in page allocation fails, even though the number of
> > free pages has increased meanwhile.
> > 
> > It should be more reliable if direct compaction captured the high-order free
> > page as soon as it detects it, and pass it back to allocation. This would
> > also reduce the window for somebody else to allocate the free page.
> > 
> > This has been already implemented by 1fb3f8ca0e92 ("mm: compaction: capture a
> > suitable high-order page immediately when it is made available"), but later
> > reverted by 8fb74b9f ("mm: compaction: partially revert capture of suitable
> > high-order page") due to flaws.
> > 
> > This patch differs from the previous attempt in two aspects:
> > 
> > 1) The previous patch scanned free lists to capture the page. In this patch,
> >     only the cc->order aligned block that the migration scanner just finished
> >     is considered, but only if pages were actually isolated for migration in
> >     that block. Tracking cc->order aligned blocks also has benefits for the
> >     following patch that skips blocks where non-migratable pages were found.
> > 

Generally I like this.

> > 2) In this patch, the isolated free page is allocated through extending
> >     get_page_from_freelist() and buffered_rmqueue(). This ensures that it gets
> >     all operations such as prep_new_page() and page->pfmemalloc setting that
> >     was missing in the previous attempt, zone statistics are updated etc.
> > 

But this part is problem.
Capturing is not common but you are adding more overhead in hotpath for rare cases
where even they are ok to fail so it's not a good deal.
In such case, We have no choice but to do things you mentioned (ex,statistics,
prep_new_page, pfmemalloc) manually in __alloc_pages_direct_compact.

> > Evaluation is pending.
> 
> Uh, so if anyone wants to test it, here's a fixed version, as initial evaluation
> showed it does not actually capture anything (which should not affect patch 10/10
> though) and debugging this took a while.
> 
> - for pageblock_order (i.e. THP), capture was never attempted, as the for cycle
>   in isolate_migratepages_range() has ended right before the
>   low_pfn == next_capture_pfn check
> - lru_add_drain() has to be done before pcplists drain. This made a big difference
>   (~50 successful captures -> ~1300 successful captures)
>   Note that __alloc_pages_direct_compact() is missing lru_add_drain() as well, and
>   all the existing watermark-based compaction termination decisions (which happen
>   before the drain in __alloc_pages_direct_compact()) don't do any draining at all.
>   
> -----8<-----
> From: Vlastimil Babka <vbabka@suse.cz>
> Date: Wed, 28 May 2014 17:05:18 +0200
> Subject: [PATCH fixed 09/10] mm, compaction: try to capture the just-created
>  high-order freepage
> 
> Compaction uses watermark checking to determine if it succeeded in creating
> a high-order free page. My testing has shown that this is quite racy and it
> can happen that watermark checking in compaction succeeds, and moments later
> the watermark checking in page allocation fails, even though the number of
> free pages has increased meanwhile.
> 
> It should be more reliable if direct compaction captured the high-order free
> page as soon as it detects it, and pass it back to allocation. This would
> also reduce the window for somebody else to allocate the free page.
> 
> This has been already implemented by 1fb3f8ca0e92 ("mm: compaction: capture a
> suitable high-order page immediately when it is made available"), but later
> reverted by 8fb74b9f ("mm: compaction: partially revert capture of suitable
> high-order page") due to flaws.
> 
> This patch differs from the previous attempt in two aspects:
> 
> 1) The previous patch scanned free lists to capture the page. In this patch,
>    only the cc->order aligned block that the migration scanner just finished
>    is considered, but only if pages were actually isolated for migration in
>    that block. Tracking cc->order aligned blocks also has benefits for the
>    following patch that skips blocks where non-migratable pages were found.
> 
> 2) In this patch, the isolated free page is allocated through extending
>    get_page_from_freelist() and buffered_rmqueue(). This ensures that it gets
>    all operations such as prep_new_page() and page->pfmemalloc setting that
>    was missing in the previous attempt, zone statistics are updated etc.
> 
> Evaluation is pending.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: David Rientjes <rientjes@google.com>
> ---
>  include/linux/compaction.h |   5 ++-
>  mm/compaction.c            | 103 +++++++++++++++++++++++++++++++++++++++++++--
>  mm/internal.h              |   2 +
>  mm/page_alloc.c            |  69 ++++++++++++++++++++++++------
>  4 files changed, 161 insertions(+), 18 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 01e3132..69579f5 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -10,6 +10,8 @@
>  #define COMPACT_PARTIAL		2
>  /* The full zone was compacted */
>  #define COMPACT_COMPLETE	3
> +/* Captured a high-order free page in direct compaction */
> +#define COMPACT_CAPTURED	4
>  
>  #ifdef CONFIG_COMPACTION
>  extern int sysctl_compact_memory;
> @@ -22,7 +24,8 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
>  extern int fragmentation_index(struct zone *zone, unsigned int order);
>  extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
>  			int order, gfp_t gfp_mask, nodemask_t *mask,
> -			enum migrate_mode mode, bool *contended);
> +			enum migrate_mode mode, bool *contended,
> +			struct page **captured_page);
>  extern void compact_pgdat(pg_data_t *pgdat, int order);
>  extern void reset_isolation_suitable(pg_data_t *pgdat);
>  extern unsigned long compaction_suitable(struct zone *zone, int order);
> diff --git a/mm/compaction.c b/mm/compaction.c
> index d1e30ba..2988758 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -541,6 +541,16 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>  	const isolate_mode_t mode = (cc->mode == MIGRATE_ASYNC ?
>  					ISOLATE_ASYNC_MIGRATE : 0) |
>  				    (unevictable ? ISOLATE_UNEVICTABLE : 0);
> +	unsigned long capture_pfn = 0;   /* current candidate for capturing */
> +	unsigned long next_capture_pfn = 0; /* next candidate for capturing */
> +
> +	if (cc->order > PAGE_ALLOC_COSTLY_ORDER
> +		&& gfpflags_to_migratetype(cc->gfp_mask) == MIGRATE_MOVABLE
> +			&& cc->order <= pageblock_order) {

You sent with RFC mark so I will not review detailed thing but just design stuff.

Why does capture work for limited high-order range?
Direct compaction is really costly operation for the process and he did it
at the cost of his resource(ie, timeslice) so anyone try to do direct compaction
deserves to have a precious result regardless of order.

Another question: Why couldn't the capture work for only MIGRATE_CMA?

> +		/* This may be outside the zone, but we check that later */
> +		capture_pfn = low_pfn & ~((1UL << cc->order) - 1);
> +		next_capture_pfn = ALIGN(low_pfn + 1, (1UL << cc->order));
> +	}


>  
>  	/*
>  	 * Ensure that there are not too many pages isolated from the LRU
> @@ -563,6 +573,19 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>  
>  	/* Time to isolate some pages for migration */
>  	for (; low_pfn < end_pfn; low_pfn++) {
> +		if (low_pfn == next_capture_pfn) {
> +			/*
> +			 * We have a capture candidate if we isolated something
> +			 * during the last cc->order aligned block of pages.
> +			 */
> +			if (nr_isolated && capture_pfn >= zone->zone_start_pfn)
> +				break;
> +
> +			/* Prepare for a new capture candidate */
> +			capture_pfn = next_capture_pfn;
> +			next_capture_pfn += (1UL << cc->order);
> +		}
> +
>  		/*
>  		 * Periodically drop the lock (if held) regardless of its
>  		 * contention, to give chance to IRQs. Abort async compaction
> @@ -582,6 +605,8 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>  		if ((low_pfn & (MAX_ORDER_NR_PAGES - 1)) == 0) {
>  			if (!pfn_valid(low_pfn)) {
>  				low_pfn += MAX_ORDER_NR_PAGES - 1;
> +				if (next_capture_pfn)
> +					next_capture_pfn = low_pfn + 1;
>  				continue;
>  			}
>  		}
> @@ -639,8 +664,12 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>  			 * a valid page order. Consider only values in the
>  			 * valid order range to prevent low_pfn overflow.
>  			 */
> -			if (freepage_order > 0 && freepage_order < MAX_ORDER)
> +			if (freepage_order > 0 && freepage_order < MAX_ORDER) {
>  				low_pfn += (1UL << freepage_order) - 1;
> +				if (next_capture_pfn)
> +					next_capture_pfn = ALIGN(low_pfn + 1,
> +							(1UL << cc->order));
> +			}
>  			continue;
>  		}
>  
> @@ -673,6 +702,9 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>  			if (!locked)
>  				goto next_pageblock;
>  			low_pfn += (1 << compound_order(page)) - 1;
> +			if (next_capture_pfn)
> +				next_capture_pfn =
> +					ALIGN(low_pfn + 1, (1UL << cc->order));
>  			continue;
>  		}
>  
> @@ -697,6 +729,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>  				continue;
>  			if (PageTransHuge(page)) {
>  				low_pfn += (1 << compound_order(page)) - 1;
> +				next_capture_pfn = low_pfn + 1;
>  				continue;
>  			}
>  		}
> @@ -728,9 +761,20 @@ isolate_success:
>  
>  next_pageblock:
>  		low_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages) - 1;
> +		if (next_capture_pfn)
> +			next_capture_pfn = low_pfn + 1;
>  	}
>  
>  	/*
> +	 * For cases when next_capture_pfn == end_pfn, such as end of
> +	 * pageblock, we couldn't have determined capture candidate inside
> +	 * the for cycle, so we have to do it here.
> +	 */
> +	if (low_pfn == next_capture_pfn && nr_isolated
> +			&& capture_pfn >= zone->zone_start_pfn)
> +		cc->capture_page = pfn_to_page(capture_pfn);
> +
> +	/*
>  	 * The PageBuddy() check could have potentially brought us outside
>  	 * the range to be scanned.
>  	 */
> @@ -965,6 +1009,44 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
>  	return ISOLATE_SUCCESS;
>  }
>  
> +/*
> + * When called, cc->capture_page is just a candidate. This function will either
> + * successfully capture the page, or reset it to NULL.
> + */
> +static bool compact_capture_page(struct compact_control *cc)
> +{
> +	struct page *page = cc->capture_page;
> +	int cpu;
> +
> +	/* Unsafe check if it's worth to try acquiring the zone->lock at all */
> +	if (PageBuddy(page) && page_order_unsafe(page) >= cc->order)
> +		goto try_capture;
> +
> +	/*
> +	 * There's a good chance that we have just put free pages on this CPU's
> +	 * lru cache and pcplists after the page migrations. Drain them to
> +	 * allow merging.
> +	 */
> +	cpu = get_cpu();
> +	lru_add_drain_cpu(cpu);
> +	drain_local_pages(NULL);
> +	put_cpu();
> +
> +	/* Did the draining help? */
> +	if (PageBuddy(page) && page_order_unsafe(page) >= cc->order)
> +		goto try_capture;
> +
> +	goto fail;
> +
> +try_capture:
> +	if (capture_free_page(page, cc->order))
> +		return true;
> +
> +fail:
> +	cc->capture_page = NULL;
> +	return false;
> +}
> +
>  static int compact_finished(struct zone *zone, struct compact_control *cc,
>  			    const int migratetype)
>  {
> @@ -993,6 +1075,10 @@ static int compact_finished(struct zone *zone, struct compact_control *cc,
>  		return COMPACT_COMPLETE;
>  	}
>  
> +	/* Did we just finish a pageblock that was capture candidate? */
> +	if (cc->capture_page && compact_capture_page(cc))
> +		return COMPACT_CAPTURED;
> +
>  	/*
>  	 * order == -1 is expected when compacting via
>  	 * /proc/sys/vm/compact_memory
> @@ -1173,7 +1259,8 @@ out:
>  }
>  
>  static unsigned long compact_zone_order(struct zone *zone, int order,
> -		gfp_t gfp_mask, enum migrate_mode mode, bool *contended)
> +		gfp_t gfp_mask, enum migrate_mode mode, bool *contended,
> +						struct page **captured_page)
>  {
>  	unsigned long ret;
>  	struct compact_control cc = {
> @@ -1189,6 +1276,9 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
>  
>  	ret = compact_zone(zone, &cc);
>  
> +	if (ret == COMPACT_CAPTURED)
> +		*captured_page = cc.capture_page;
> +
>  	VM_BUG_ON(!list_empty(&cc.freepages));
>  	VM_BUG_ON(!list_empty(&cc.migratepages));
>  
> @@ -1213,7 +1303,8 @@ int sysctl_extfrag_threshold = 500;
>   */
>  unsigned long try_to_compact_pages(struct zonelist *zonelist,
>  			int order, gfp_t gfp_mask, nodemask_t *nodemask,
> -			enum migrate_mode mode, bool *contended)
> +			enum migrate_mode mode, bool *contended,
> +			struct page **captured_page)
>  {
>  	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
>  	int may_enter_fs = gfp_mask & __GFP_FS;
> @@ -1239,9 +1330,13 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
>  		int status;
>  
>  		status = compact_zone_order(zone, order, gfp_mask, mode,
> -						contended);
> +						contended, captured_page);
>  		rc = max(status, rc);
>  
> +		/* If we captured a page, stop compacting */
> +		if (*captured_page)
> +			break;
> +
>  		/* If a normal allocation would succeed, stop compacting */
>  		if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0,
>  				      alloc_flags))
> diff --git a/mm/internal.h b/mm/internal.h
> index af15461..2b7e5de 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -110,6 +110,7 @@ extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
>   */
>  extern void __free_pages_bootmem(struct page *page, unsigned int order);
>  extern void prep_compound_page(struct page *page, unsigned long order);
> +extern bool capture_free_page(struct page *page, unsigned int order);
>  #ifdef CONFIG_MEMORY_FAILURE
>  extern bool is_free_buddy_page(struct page *page);
>  #endif
> @@ -155,6 +156,7 @@ struct compact_control {
>  					   * contention detected during
>  					   * compaction
>  					   */
> +	struct page *capture_page;	/* Free page captured by compaction */
>  };
>  
>  unsigned long
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a3acb83..6235cad 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -954,7 +954,6 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
>  	return NULL;
>  }
>  
> -
>  /*
>   * This array describes the order lists are fallen back to when
>   * the free lists for the desirable migrate type are depleted
> @@ -1474,9 +1473,11 @@ static int __isolate_free_page(struct page *page, unsigned int order)
>  {
>  	unsigned long watermark;
>  	struct zone *zone;
> +	struct free_area *area;
>  	int mt;
> +	unsigned int freepage_order = page_order(page);
>  
> -	BUG_ON(!PageBuddy(page));
> +	VM_BUG_ON_PAGE((!PageBuddy(page) || freepage_order < order), page);
>  
>  	zone = page_zone(page);
>  	mt = get_pageblock_migratetype(page);
> @@ -1491,9 +1492,12 @@ static int __isolate_free_page(struct page *page, unsigned int order)
>  	}
>  
>  	/* Remove page from free list */
> +	area = &zone->free_area[freepage_order];
>  	list_del(&page->lru);
> -	zone->free_area[order].nr_free--;
> +	area->nr_free--;
>  	rmv_page_order(page);
> +	if (freepage_order != order)
> +		expand(zone, page, order, freepage_order, area, mt);
>  
>  	/* Set the pageblock if the isolated page is at least a pageblock */
>  	if (order >= pageblock_order - 1) {
> @@ -1536,6 +1540,26 @@ int split_free_page(struct page *page)
>  	return nr_pages;
>  }
>  
> +bool capture_free_page(struct page *page, unsigned int order)
> +{
> +	struct zone *zone = page_zone(page);
> +	unsigned long flags;
> +	bool ret;
> +
> +	spin_lock_irqsave(&zone->lock, flags);
> +
> +	if (!PageBuddy(page) || page_order(page) < order) {
> +		ret = false;
> +		goto out;
> +	}
> +
> +	ret = __isolate_free_page(page, order);
> +
> +out:
> +	spin_unlock_irqrestore(&zone->lock, flags);
> +	return ret;
> +}
> +
>  /*
>   * Really, prep_compound_page() should be called from __rmqueue_bulk().  But
>   * we cheat by calling it from here, in the order > 0 path.  Saves a branch
> @@ -1544,7 +1568,8 @@ int split_free_page(struct page *page)
>  static inline
>  struct page *buffered_rmqueue(struct zone *preferred_zone,
>  			struct zone *zone, unsigned int order,
> -			gfp_t gfp_flags, int migratetype)
> +			gfp_t gfp_flags, int migratetype,
> +			struct page *isolated_freepage)
>  {
>  	unsigned long flags;
>  	struct page *page;
> @@ -1573,6 +1598,9 @@ again:
>  
>  		list_del(&page->lru);
>  		pcp->count--;
> +	} else if (unlikely(isolated_freepage)) {
> +		page = isolated_freepage;
> +		local_irq_save(flags);
>  	} else {
>  		if (unlikely(gfp_flags & __GFP_NOFAIL)) {
>  			/*
> @@ -1588,7 +1616,9 @@ again:
>  			WARN_ON_ONCE(order > 1);
>  		}
>  		spin_lock_irqsave(&zone->lock, flags);
> +
>  		page = __rmqueue(zone, order, migratetype);
> +
>  		spin_unlock(&zone->lock);
>  		if (!page)
>  			goto failed;
> @@ -1916,7 +1946,8 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
>  static struct page *
>  get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
>  		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
> -		struct zone *preferred_zone, int classzone_idx, int migratetype)
> +		struct zone *preferred_zone, int classzone_idx, int migratetype,
> +		struct page *isolated_freepage)
>  {
>  	struct zoneref *z;
>  	struct page *page = NULL;
> @@ -1927,6 +1958,13 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
>  	bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) &&
>  				(gfp_mask & __GFP_WRITE);
>  
> +	if (isolated_freepage) {
> +		zone = page_zone(isolated_freepage);
> +		page = buffered_rmqueue(preferred_zone, zone, order, gfp_mask,
> +						migratetype, isolated_freepage);
> +		goto got_page;
> +	}
> +
>  zonelist_scan:
>  	/*
>  	 * Scan zonelist, looking for a zone with enough free.
> @@ -2051,7 +2089,7 @@ zonelist_scan:
>  
>  try_this_zone:
>  		page = buffered_rmqueue(preferred_zone, zone, order,
> -						gfp_mask, migratetype);
> +						gfp_mask, migratetype, NULL);
>  		if (page)
>  			break;
>  this_zone_full:
> @@ -2065,6 +2103,7 @@ this_zone_full:
>  		goto zonelist_scan;
>  	}
>  
> +got_page:
>  	if (page)
>  		/*
>  		 * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
> @@ -2202,7 +2241,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
>  		order, zonelist, high_zoneidx,
>  		ALLOC_WMARK_HIGH|ALLOC_CPUSET,
> -		preferred_zone, classzone_idx, migratetype);
> +		preferred_zone, classzone_idx, migratetype, NULL);
>  	if (page)
>  		goto out;
>  
> @@ -2241,6 +2280,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  	bool *contended_compaction, bool *deferred_compaction,
>  	unsigned long *did_some_progress)
>  {
> +	struct page *captured_page;
> +
>  	if (!order)
>  		return NULL;
>  
> @@ -2252,7 +2293,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  	current->flags |= PF_MEMALLOC;
>  	*did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask,
>  						nodemask, mode,
> -						contended_compaction);
> +						contended_compaction,
> +						&captured_page);
>  	current->flags &= ~PF_MEMALLOC;
>  
>  	if (*did_some_progress != COMPACT_SKIPPED) {
> @@ -2265,7 +2307,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  		page = get_page_from_freelist(gfp_mask, nodemask,
>  				order, zonelist, high_zoneidx,
>  				alloc_flags & ~ALLOC_NO_WATERMARKS,
> -				preferred_zone, classzone_idx, migratetype);
> +				preferred_zone, classzone_idx, migratetype,
> +				captured_page);
>  		if (page) {
>  			preferred_zone->compact_blockskip_flush = false;
>  			compaction_defer_reset(preferred_zone, order, true);
> @@ -2357,7 +2400,7 @@ retry:
>  					zonelist, high_zoneidx,
>  					alloc_flags & ~ALLOC_NO_WATERMARKS,
>  					preferred_zone, classzone_idx,
> -					migratetype);
> +					migratetype, NULL);
>  
>  	/*
>  	 * If an allocation failed after direct reclaim, it could be because
> @@ -2387,7 +2430,7 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
>  	do {
>  		page = get_page_from_freelist(gfp_mask, nodemask, order,
>  			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
> -			preferred_zone, classzone_idx, migratetype);
> +			preferred_zone, classzone_idx, migratetype, NULL);
>  
>  		if (!page && gfp_mask & __GFP_NOFAIL)
>  			wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
> @@ -2548,7 +2591,7 @@ rebalance:
>  	/* This is the last chance, in general, before the goto nopage. */
>  	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
>  			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
> -			preferred_zone, classzone_idx, migratetype);
> +			preferred_zone, classzone_idx, migratetype, NULL);
>  	if (page)
>  		goto got_pg;
>  
> @@ -2757,7 +2800,7 @@ retry:
>  	/* First allocation attempt */
>  	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
>  			zonelist, high_zoneidx, alloc_flags,
> -			preferred_zone, classzone_idx, migratetype);
> +			preferred_zone, classzone_idx, migratetype, NULL);
>  	if (unlikely(!page)) {
>  		/*
>  		 * The first pass makes sure allocations are spread
> -- 
> 1.8.4.5
> 
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [RFC PATCH 09/10] mm, compaction: try to capture the just-created high-order freepage
  2014-06-12  2:20       ` Minchan Kim
@ 2014-06-12  8:21         ` Vlastimil Babka
  -1 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-12  8:21 UTC (permalink / raw)
  To: Minchan Kim
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

On 06/12/2014 04:20 AM, Minchan Kim wrote:
> On Wed, Jun 11, 2014 at 04:56:49PM +0200, Vlastimil Babka wrote:
>> On 06/09/2014 11:26 AM, Vlastimil Babka wrote:
>>> Compaction uses watermark checking to determine if it succeeded in creating
>>> a high-order free page. My testing has shown that this is quite racy and it
>>> can happen that watermark checking in compaction succeeds, and moments later
>>> the watermark checking in page allocation fails, even though the number of
>>> free pages has increased meanwhile.
>>>
>>> It should be more reliable if direct compaction captured the high-order free
>>> page as soon as it detects it, and pass it back to allocation. This would
>>> also reduce the window for somebody else to allocate the free page.
>>>
>>> This has been already implemented by 1fb3f8ca0e92 ("mm: compaction: capture a
>>> suitable high-order page immediately when it is made available"), but later
>>> reverted by 8fb74b9f ("mm: compaction: partially revert capture of suitable
>>> high-order page") due to flaws.
>>>
>>> This patch differs from the previous attempt in two aspects:
>>>
>>> 1) The previous patch scanned free lists to capture the page. In this patch,
>>>      only the cc->order aligned block that the migration scanner just finished
>>>      is considered, but only if pages were actually isolated for migration in
>>>      that block. Tracking cc->order aligned blocks also has benefits for the
>>>      following patch that skips blocks where non-migratable pages were found.
>>>
>
> Generally I like this.

Thanks.

>>> 2) In this patch, the isolated free page is allocated through extending
>>>      get_page_from_freelist() and buffered_rmqueue(). This ensures that it gets
>>>      all operations such as prep_new_page() and page->pfmemalloc setting that
>>>      was missing in the previous attempt, zone statistics are updated etc.
>>>
>
> But this part is problem.
> Capturing is not common but you are adding more overhead in hotpath for rare cases
> where even they are ok to fail so it's not a good deal.
> In such case, We have no choice but to do things you mentioned (ex,statistics,
> prep_new_page, pfmemalloc) manually in __alloc_pages_direct_compact.

OK, I will try.

>>> Evaluation is pending.
>>
>> Uh, so if anyone wants to test it, here's a fixed version, as initial evaluation
>> showed it does not actually capture anything (which should not affect patch 10/10
>> though) and debugging this took a while.
>>
>> - for pageblock_order (i.e. THP), capture was never attempted, as the for cycle
>>    in isolate_migratepages_range() has ended right before the
>>    low_pfn == next_capture_pfn check
>> - lru_add_drain() has to be done before pcplists drain. This made a big difference
>>    (~50 successful captures -> ~1300 successful captures)
>>    Note that __alloc_pages_direct_compact() is missing lru_add_drain() as well, and
>>    all the existing watermark-based compaction termination decisions (which happen
>>    before the drain in __alloc_pages_direct_compact()) don't do any draining at all.
>>
>> -----8<-----
>> From: Vlastimil Babka <vbabka@suse.cz>
>> Date: Wed, 28 May 2014 17:05:18 +0200
>> Subject: [PATCH fixed 09/10] mm, compaction: try to capture the just-created
>>   high-order freepage
>>
>> Compaction uses watermark checking to determine if it succeeded in creating
>> a high-order free page. My testing has shown that this is quite racy and it
>> can happen that watermark checking in compaction succeeds, and moments later
>> the watermark checking in page allocation fails, even though the number of
>> free pages has increased meanwhile.
>>
>> It should be more reliable if direct compaction captured the high-order free
>> page as soon as it detects it, and pass it back to allocation. This would
>> also reduce the window for somebody else to allocate the free page.
>>
>> This has been already implemented by 1fb3f8ca0e92 ("mm: compaction: capture a
>> suitable high-order page immediately when it is made available"), but later
>> reverted by 8fb74b9f ("mm: compaction: partially revert capture of suitable
>> high-order page") due to flaws.
>>
>> This patch differs from the previous attempt in two aspects:
>>
>> 1) The previous patch scanned free lists to capture the page. In this patch,
>>     only the cc->order aligned block that the migration scanner just finished
>>     is considered, but only if pages were actually isolated for migration in
>>     that block. Tracking cc->order aligned blocks also has benefits for the
>>     following patch that skips blocks where non-migratable pages were found.
>>
>> 2) In this patch, the isolated free page is allocated through extending
>>     get_page_from_freelist() and buffered_rmqueue(). This ensures that it gets
>>     all operations such as prep_new_page() and page->pfmemalloc setting that
>>     was missing in the previous attempt, zone statistics are updated etc.
>>
>> Evaluation is pending.
>>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>> Cc: Michal Nazarewicz <mina86@mina86.com>
>> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>> Cc: Christoph Lameter <cl@linux.com>
>> Cc: Rik van Riel <riel@redhat.com>
>> Cc: David Rientjes <rientjes@google.com>
>> ---
>>   include/linux/compaction.h |   5 ++-
>>   mm/compaction.c            | 103 +++++++++++++++++++++++++++++++++++++++++++--
>>   mm/internal.h              |   2 +
>>   mm/page_alloc.c            |  69 ++++++++++++++++++++++++------
>>   4 files changed, 161 insertions(+), 18 deletions(-)
>>
>> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
>> index 01e3132..69579f5 100644
>> --- a/include/linux/compaction.h
>> +++ b/include/linux/compaction.h
>> @@ -10,6 +10,8 @@
>>   #define COMPACT_PARTIAL		2
>>   /* The full zone was compacted */
>>   #define COMPACT_COMPLETE	3
>> +/* Captured a high-order free page in direct compaction */
>> +#define COMPACT_CAPTURED	4
>>
>>   #ifdef CONFIG_COMPACTION
>>   extern int sysctl_compact_memory;
>> @@ -22,7 +24,8 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
>>   extern int fragmentation_index(struct zone *zone, unsigned int order);
>>   extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
>>   			int order, gfp_t gfp_mask, nodemask_t *mask,
>> -			enum migrate_mode mode, bool *contended);
>> +			enum migrate_mode mode, bool *contended,
>> +			struct page **captured_page);
>>   extern void compact_pgdat(pg_data_t *pgdat, int order);
>>   extern void reset_isolation_suitable(pg_data_t *pgdat);
>>   extern unsigned long compaction_suitable(struct zone *zone, int order);
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index d1e30ba..2988758 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -541,6 +541,16 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>>   	const isolate_mode_t mode = (cc->mode == MIGRATE_ASYNC ?
>>   					ISOLATE_ASYNC_MIGRATE : 0) |
>>   				    (unevictable ? ISOLATE_UNEVICTABLE : 0);
>> +	unsigned long capture_pfn = 0;   /* current candidate for capturing */
>> +	unsigned long next_capture_pfn = 0; /* next candidate for capturing */
>> +
>> +	if (cc->order > PAGE_ALLOC_COSTLY_ORDER
>> +		&& gfpflags_to_migratetype(cc->gfp_mask) == MIGRATE_MOVABLE
>> +			&& cc->order <= pageblock_order) {
>
> You sent with RFC mark so I will not review detailed thing but just design stuff.
>
> Why does capture work for limited high-order range?

I thought the overhead of maintaining the pfn's and trying the capture 
would be a bad tradeoff for low-order compactions which I suppose have a 
good chance of succeeding even without capture. But I admit I don't have 
data to support this yet.

> Direct compaction is really costly operation for the process and he did it
> at the cost of his resource(ie, timeslice) so anyone try to do direct compaction
> deserves to have a precious result regardless of order.
> Another question: Why couldn't the capture work for only MIGRATE_CMA?

CMA allocations don't go through standard direct compaction. They also 
use memory isolation to prevent parallel activity from stealing the 
pages freed by compaction. And importantly they set cc->order = -1, as 
the goal is not to compact a single high-order page, but arbitrary long 
range.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [RFC PATCH 09/10] mm, compaction: try to capture the just-created high-order freepage
@ 2014-06-12  8:21         ` Vlastimil Babka
  0 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-12  8:21 UTC (permalink / raw)
  To: Minchan Kim
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

On 06/12/2014 04:20 AM, Minchan Kim wrote:
> On Wed, Jun 11, 2014 at 04:56:49PM +0200, Vlastimil Babka wrote:
>> On 06/09/2014 11:26 AM, Vlastimil Babka wrote:
>>> Compaction uses watermark checking to determine if it succeeded in creating
>>> a high-order free page. My testing has shown that this is quite racy and it
>>> can happen that watermark checking in compaction succeeds, and moments later
>>> the watermark checking in page allocation fails, even though the number of
>>> free pages has increased meanwhile.
>>>
>>> It should be more reliable if direct compaction captured the high-order free
>>> page as soon as it detects it, and pass it back to allocation. This would
>>> also reduce the window for somebody else to allocate the free page.
>>>
>>> This has been already implemented by 1fb3f8ca0e92 ("mm: compaction: capture a
>>> suitable high-order page immediately when it is made available"), but later
>>> reverted by 8fb74b9f ("mm: compaction: partially revert capture of suitable
>>> high-order page") due to flaws.
>>>
>>> This patch differs from the previous attempt in two aspects:
>>>
>>> 1) The previous patch scanned free lists to capture the page. In this patch,
>>>      only the cc->order aligned block that the migration scanner just finished
>>>      is considered, but only if pages were actually isolated for migration in
>>>      that block. Tracking cc->order aligned blocks also has benefits for the
>>>      following patch that skips blocks where non-migratable pages were found.
>>>
>
> Generally I like this.

Thanks.

>>> 2) In this patch, the isolated free page is allocated through extending
>>>      get_page_from_freelist() and buffered_rmqueue(). This ensures that it gets
>>>      all operations such as prep_new_page() and page->pfmemalloc setting that
>>>      was missing in the previous attempt, zone statistics are updated etc.
>>>
>
> But this part is problem.
> Capturing is not common but you are adding more overhead in hotpath for rare cases
> where even they are ok to fail so it's not a good deal.
> In such case, We have no choice but to do things you mentioned (ex,statistics,
> prep_new_page, pfmemalloc) manually in __alloc_pages_direct_compact.

OK, I will try.

>>> Evaluation is pending.
>>
>> Uh, so if anyone wants to test it, here's a fixed version, as initial evaluation
>> showed it does not actually capture anything (which should not affect patch 10/10
>> though) and debugging this took a while.
>>
>> - for pageblock_order (i.e. THP), capture was never attempted, as the for cycle
>>    in isolate_migratepages_range() has ended right before the
>>    low_pfn == next_capture_pfn check
>> - lru_add_drain() has to be done before pcplists drain. This made a big difference
>>    (~50 successful captures -> ~1300 successful captures)
>>    Note that __alloc_pages_direct_compact() is missing lru_add_drain() as well, and
>>    all the existing watermark-based compaction termination decisions (which happen
>>    before the drain in __alloc_pages_direct_compact()) don't do any draining at all.
>>
>> -----8<-----
>> From: Vlastimil Babka <vbabka@suse.cz>
>> Date: Wed, 28 May 2014 17:05:18 +0200
>> Subject: [PATCH fixed 09/10] mm, compaction: try to capture the just-created
>>   high-order freepage
>>
>> Compaction uses watermark checking to determine if it succeeded in creating
>> a high-order free page. My testing has shown that this is quite racy and it
>> can happen that watermark checking in compaction succeeds, and moments later
>> the watermark checking in page allocation fails, even though the number of
>> free pages has increased meanwhile.
>>
>> It should be more reliable if direct compaction captured the high-order free
>> page as soon as it detects it, and pass it back to allocation. This would
>> also reduce the window for somebody else to allocate the free page.
>>
>> This has been already implemented by 1fb3f8ca0e92 ("mm: compaction: capture a
>> suitable high-order page immediately when it is made available"), but later
>> reverted by 8fb74b9f ("mm: compaction: partially revert capture of suitable
>> high-order page") due to flaws.
>>
>> This patch differs from the previous attempt in two aspects:
>>
>> 1) The previous patch scanned free lists to capture the page. In this patch,
>>     only the cc->order aligned block that the migration scanner just finished
>>     is considered, but only if pages were actually isolated for migration in
>>     that block. Tracking cc->order aligned blocks also has benefits for the
>>     following patch that skips blocks where non-migratable pages were found.
>>
>> 2) In this patch, the isolated free page is allocated through extending
>>     get_page_from_freelist() and buffered_rmqueue(). This ensures that it gets
>>     all operations such as prep_new_page() and page->pfmemalloc setting that
>>     was missing in the previous attempt, zone statistics are updated etc.
>>
>> Evaluation is pending.
>>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>> Cc: Michal Nazarewicz <mina86@mina86.com>
>> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>> Cc: Christoph Lameter <cl@linux.com>
>> Cc: Rik van Riel <riel@redhat.com>
>> Cc: David Rientjes <rientjes@google.com>
>> ---
>>   include/linux/compaction.h |   5 ++-
>>   mm/compaction.c            | 103 +++++++++++++++++++++++++++++++++++++++++++--
>>   mm/internal.h              |   2 +
>>   mm/page_alloc.c            |  69 ++++++++++++++++++++++++------
>>   4 files changed, 161 insertions(+), 18 deletions(-)
>>
>> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
>> index 01e3132..69579f5 100644
>> --- a/include/linux/compaction.h
>> +++ b/include/linux/compaction.h
>> @@ -10,6 +10,8 @@
>>   #define COMPACT_PARTIAL		2
>>   /* The full zone was compacted */
>>   #define COMPACT_COMPLETE	3
>> +/* Captured a high-order free page in direct compaction */
>> +#define COMPACT_CAPTURED	4
>>
>>   #ifdef CONFIG_COMPACTION
>>   extern int sysctl_compact_memory;
>> @@ -22,7 +24,8 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
>>   extern int fragmentation_index(struct zone *zone, unsigned int order);
>>   extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
>>   			int order, gfp_t gfp_mask, nodemask_t *mask,
>> -			enum migrate_mode mode, bool *contended);
>> +			enum migrate_mode mode, bool *contended,
>> +			struct page **captured_page);
>>   extern void compact_pgdat(pg_data_t *pgdat, int order);
>>   extern void reset_isolation_suitable(pg_data_t *pgdat);
>>   extern unsigned long compaction_suitable(struct zone *zone, int order);
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index d1e30ba..2988758 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -541,6 +541,16 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>>   	const isolate_mode_t mode = (cc->mode == MIGRATE_ASYNC ?
>>   					ISOLATE_ASYNC_MIGRATE : 0) |
>>   				    (unevictable ? ISOLATE_UNEVICTABLE : 0);
>> +	unsigned long capture_pfn = 0;   /* current candidate for capturing */
>> +	unsigned long next_capture_pfn = 0; /* next candidate for capturing */
>> +
>> +	if (cc->order > PAGE_ALLOC_COSTLY_ORDER
>> +		&& gfpflags_to_migratetype(cc->gfp_mask) == MIGRATE_MOVABLE
>> +			&& cc->order <= pageblock_order) {
>
> You sent with RFC mark so I will not review detailed thing but just design stuff.
>
> Why does capture work for limited high-order range?

I thought the overhead of maintaining the pfn's and trying the capture 
would be a bad tradeoff for low-order compactions which I suppose have a 
good chance of succeeding even without capture. But I admit I don't have 
data to support this yet.

> Direct compaction is really costly operation for the process and he did it
> at the cost of his resource(ie, timeslice) so anyone try to do direct compaction
> deserves to have a precious result regardless of order.
> Another question: Why couldn't the capture work for only MIGRATE_CMA?

CMA allocations don't go through standard direct compaction. They also 
use memory isolation to prevent parallel activity from stealing the 
pages freed by compaction. And importantly they set cc->order = -1, as 
the goal is not to compact a single high-order page, but arbitrary long 
range.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 02/10] mm, compaction: report compaction as contended only due to lock contention
  2014-06-11 23:49         ` Minchan Kim
@ 2014-06-12 14:02           ` Vlastimil Babka
  -1 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-12 14:02 UTC (permalink / raw)
  To: Minchan Kim
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

On 06/12/2014 01:49 AM, Minchan Kim wrote:
> On Wed, Jun 11, 2014 at 02:22:30PM +0200, Vlastimil Babka wrote:
>> On 06/11/2014 03:10 AM, Minchan Kim wrote:
>>> On Mon, Jun 09, 2014 at 11:26:14AM +0200, Vlastimil Babka wrote:
>>>> Async compaction aborts when it detects zone lock contention or need_resched()
>>>> is true. David Rientjes has reported that in practice, most direct async
>>>> compactions for THP allocation abort due to need_resched(). This means that a
>>>> second direct compaction is never attempted, which might be OK for a page
>>>> fault, but hugepaged is intended to attempt a sync compaction in such case and
>>>> in these cases it won't.
>>>>
>>>> This patch replaces "bool contended" in compact_control with an enum that
>>>> distinguieshes between aborting due to need_resched() and aborting due to lock
>>>> contention. This allows propagating the abort through all compaction functions
>>>> as before, but declaring the direct compaction as contended only when lock
>>>> contantion has been detected.
>>>>
>>>> As a result, hugepaged will proceed with second sync compaction as intended,
>>>> when the preceding async compaction aborted due to need_resched().
>>>
>>> You said "second direct compaction is never attempted, which might be OK
>>> for a page fault" and said "hugepagd is intented to attempt a sync compaction"
>>> so I feel you want to handle khugepaged so special unlike other direct compact
>>> (ex, page fault).
>>
>> Well khugepaged is my primary concern, but I imagine there are other
>> direct compaction users besides THP page fault and khugepaged.
>>
>>> By this patch, direct compaction take care only lock contention, not rescheduling
>>> so that pop questions.
>>>
>>> Is it okay not to consider need_resched in direct compaction really?
>>
>> It still considers need_resched() to back of from async compaction.
>> It's only about signaling contended_compaction back to
>> __alloc_pages_slowpath(). There's this code executed after the
>> first, async compaction fails:
>>
>> /*
>>   * It can become very expensive to allocate transparent hugepages at
>>   * fault, so use asynchronous memory compaction for THP unless it is
>>   * khugepaged trying to collapse.
>>   */
>> if (!(gfp_mask & __GFP_NO_KSWAPD) || (current->flags & PF_KTHREAD))
>>          migration_mode = MIGRATE_SYNC_LIGHT;
>>
>> /*
>>   * If compaction is deferred for high-order allocations, it is because
>>   * sync compaction recently failed. In this is the case and the caller
>>   * requested a movable allocation that does not heavily disrupt the
>>   * system then fail the allocation instead of entering direct reclaim.
>>   */
>> if ((deferred_compaction || contended_compaction) &&
>>                                          (gfp_mask & __GFP_NO_KSWAPD))
>>          goto nopage;
>>
>> Both THP page fault and khugepaged use __GFP_NO_KSWAPD. The first
>> if() decides whether the second attempt will be sync (for
>> khugepaged) or async (page fault). The second if() decides that if
>> compaction was contended, then there won't be any second attempt
>> (and reclaim) at all. Counting need_resched() as contended in this
>> case is bad for khugepaged. Even for page fault it means no direct
>
> I agree khugepaged shouldn't count on need_resched, even lock contention
> because it was a result from admin's decision.
> If it hurts system performance, he should adjust knobs for khugepaged.
>
>> reclaim and a second async compaction. David says need_resched()
>> occurs so often then it is a poor heuristic to decide this.
>
> But page fault is a bit different. Inherently, high-order allocation
> (ie, above PAGE_ALLOC_COSTLY_ORDER) is fragile so all of the caller
> shoud keep in mind that and prepare second plan(ex, 4K allocation)
> so direct reclaim/compaction should take care of latency rather than
> success ratio.

Yes it's a rather delicate balance. But the plan is now to try balance 
this differently than using need_resched.

> If need_resched in second attempt(ie, synchronous compaction) is almost
> true, it means the process consumed his timeslice so it shouldn't be
> greedy and gives a CPU resource to others.

Synchronous compaction uses cond_resched() so that's fine I think?

> I don't mean we should abort but the process could sleep and retry.
> The point is that we should give latency pain to the process request
> high-order alocation, not another random process.

So basically you are saying that there should be cond_resched() also for 
async compaction when need_resched() is true? Now need_resched() is a 
trigger to back off rather quickly all the way back to 
__alloc_pages_direct_compact() which does contain a cond_resched(). So 
there should be a yield before retry. Or are you worried that the back 
off is not quick enough and it shoudl cond_resched() immediately?

> IMHO, if we want to increase high-order alloc ratio in page fault,
> kswapd should be more aggressive than now via feedback loop from
> fail rate from direct compaction.

Recently I think we have been rather decreasing high-order alloc ratio 
in page fault :) But (at least for the THP) page fault allocation 
attempts contain __GFP_NO_KSWAPD, so there's no feedback loop. I guess 
changing that would be rather disruptive.

>>
>>> We have taken care of it in direct reclaim path so why direct compaction is
>>> so special?
>>
>> I admit I'm not that familiar with reclaim but I didn't quickly find
>> any need_resched() there? There's plenty of cond_resched() but that
>> doesn't mean it will abort? Could you explain for me?
>
> I meant cond_resched.
>
>>
>>> Why does khugepaged give up easily if lock contention/need_resched happens?
>>> khugepaged is important for success ratio as I read your description so IMO,
>>> khugepaged should do synchronously without considering early bail out by
>>> lock/rescheduling.
>>
>> Well a stupid answer is that's how __alloc_pages_slowpath() works :)
>> I don't think it's bad to try using first a more lightweight
>> approach before trying the heavyweight one. As long as the
>> heavyweight one is not skipped for khugepaged.
>
> I'm not saying current two-stage trying is bad. My stand is that we should
> take care of need_resched and shouldn't become a greedy but khugepaged would
> be okay.
>
>>
>>> If it causes problems, user should increase scan_sleep_millisecs/alloc_sleep_millisecs,
>>> which is exactly the knob for that cases.
>>>
>>> So, my point is how about making khugepaged doing always dumb synchronous
>>> compaction thorough PG_KHUGEPAGED or GFP_SYNC_TRANSHUGE?
>>>
>>>>
>>>> Reported-by: David Rientjes <rientjes@google.com>
>>>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>>>> Cc: Minchan Kim <minchan@kernel.org>
>>>> Cc: Mel Gorman <mgorman@suse.de>
>>>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>>>> Cc: Michal Nazarewicz <mina86@mina86.com>
>>>> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>>>> Cc: Christoph Lameter <cl@linux.com>
>>>> Cc: Rik van Riel <riel@redhat.com>
>>>> ---
>>>>   mm/compaction.c | 20 ++++++++++++++------
>>>>   mm/internal.h   | 15 +++++++++++----
>>>>   2 files changed, 25 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/mm/compaction.c b/mm/compaction.c
>>>> index b73b182..d37f4a8 100644
>>>> --- a/mm/compaction.c
>>>> +++ b/mm/compaction.c
>>>> @@ -185,9 +185,14 @@ static void update_pageblock_skip(struct compact_control *cc,
>>>>   }
>>>>   #endif /* CONFIG_COMPACTION */
>>>>
>>>> -static inline bool should_release_lock(spinlock_t *lock)
>>>> +enum compact_contended should_release_lock(spinlock_t *lock)
>>>>   {
>>>> -	return need_resched() || spin_is_contended(lock);
>>>> +	if (need_resched())
>>>> +		return COMPACT_CONTENDED_SCHED;
>>>> +	else if (spin_is_contended(lock))
>>>> +		return COMPACT_CONTENDED_LOCK;
>>>> +	else
>>>> +		return COMPACT_CONTENDED_NONE;
>>>>   }
>>>>
>>>>   /*
>>>> @@ -202,7 +207,9 @@ static inline bool should_release_lock(spinlock_t *lock)
>>>>   static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
>>>>   				      bool locked, struct compact_control *cc)
>>>>   {
>>>> -	if (should_release_lock(lock)) {
>>>> +	enum compact_contended contended = should_release_lock(lock);
>>>> +
>>>> +	if (contended) {
>>>>   		if (locked) {
>>>>   			spin_unlock_irqrestore(lock, *flags);
>>>>   			locked = false;
>>>> @@ -210,7 +217,7 @@ static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
>>>>
>>>>   		/* async aborts if taking too long or contended */
>>>>   		if (cc->mode == MIGRATE_ASYNC) {
>>>> -			cc->contended = true;
>>>> +			cc->contended = contended;
>>>>   			return false;
>>>>   		}
>>>>
>>>> @@ -236,7 +243,7 @@ static inline bool compact_should_abort(struct compact_control *cc)
>>>>   	/* async compaction aborts if contended */
>>>>   	if (need_resched()) {
>>>>   		if (cc->mode == MIGRATE_ASYNC) {
>>>> -			cc->contended = true;
>>>> +			cc->contended = COMPACT_CONTENDED_SCHED;
>>>>   			return true;
>>>>   		}
>>>>
>>>> @@ -1095,7 +1102,8 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
>>>>   	VM_BUG_ON(!list_empty(&cc.freepages));
>>>>   	VM_BUG_ON(!list_empty(&cc.migratepages));
>>>>
>>>> -	*contended = cc.contended;
>>>> +	/* We only signal lock contention back to the allocator */
>>>> +	*contended = cc.contended == COMPACT_CONTENDED_LOCK;
>>>>   	return ret;
>>>>   }
>>>>
>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>> index 7f22a11f..4659e8e 100644
>>>> --- a/mm/internal.h
>>>> +++ b/mm/internal.h
>>>> @@ -117,6 +117,13 @@ extern int user_min_free_kbytes;
>>>>
>>>>   #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>>>>
>>>> +/* Used to signal whether compaction detected need_sched() or lock contention */
>>>> +enum compact_contended {
>>>> +	COMPACT_CONTENDED_NONE = 0, /* no contention detected */
>>>> +	COMPACT_CONTENDED_SCHED,    /* need_sched() was true */
>>>> +	COMPACT_CONTENDED_LOCK,     /* zone lock or lru_lock was contended */
>>>> +};
>>>> +
>>>>   /*
>>>>    * in mm/compaction.c
>>>>    */
>>>> @@ -144,10 +151,10 @@ struct compact_control {
>>>>   	int order;			/* order a direct compactor needs */
>>>>   	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
>>>>   	struct zone *zone;
>>>> -	bool contended;			/* True if a lock was contended, or
>>>> -					 * need_resched() true during async
>>>> -					 * compaction
>>>> -					 */
>>>> +	enum compact_contended contended; /* Signal need_sched() or lock
>>>> +					   * contention detected during
>>>> +					   * compaction
>>>> +					   */
>>>>   };
>>>>
>>>>   unsigned long
>>>> --
>>>> 1.8.4.5
>>>>
>>>> --
>>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>>> see: http://www.linux-mm.org/ .
>>>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 02/10] mm, compaction: report compaction as contended only due to lock contention
@ 2014-06-12 14:02           ` Vlastimil Babka
  0 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-12 14:02 UTC (permalink / raw)
  To: Minchan Kim
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

On 06/12/2014 01:49 AM, Minchan Kim wrote:
> On Wed, Jun 11, 2014 at 02:22:30PM +0200, Vlastimil Babka wrote:
>> On 06/11/2014 03:10 AM, Minchan Kim wrote:
>>> On Mon, Jun 09, 2014 at 11:26:14AM +0200, Vlastimil Babka wrote:
>>>> Async compaction aborts when it detects zone lock contention or need_resched()
>>>> is true. David Rientjes has reported that in practice, most direct async
>>>> compactions for THP allocation abort due to need_resched(). This means that a
>>>> second direct compaction is never attempted, which might be OK for a page
>>>> fault, but hugepaged is intended to attempt a sync compaction in such case and
>>>> in these cases it won't.
>>>>
>>>> This patch replaces "bool contended" in compact_control with an enum that
>>>> distinguieshes between aborting due to need_resched() and aborting due to lock
>>>> contention. This allows propagating the abort through all compaction functions
>>>> as before, but declaring the direct compaction as contended only when lock
>>>> contantion has been detected.
>>>>
>>>> As a result, hugepaged will proceed with second sync compaction as intended,
>>>> when the preceding async compaction aborted due to need_resched().
>>>
>>> You said "second direct compaction is never attempted, which might be OK
>>> for a page fault" and said "hugepagd is intented to attempt a sync compaction"
>>> so I feel you want to handle khugepaged so special unlike other direct compact
>>> (ex, page fault).
>>
>> Well khugepaged is my primary concern, but I imagine there are other
>> direct compaction users besides THP page fault and khugepaged.
>>
>>> By this patch, direct compaction take care only lock contention, not rescheduling
>>> so that pop questions.
>>>
>>> Is it okay not to consider need_resched in direct compaction really?
>>
>> It still considers need_resched() to back of from async compaction.
>> It's only about signaling contended_compaction back to
>> __alloc_pages_slowpath(). There's this code executed after the
>> first, async compaction fails:
>>
>> /*
>>   * It can become very expensive to allocate transparent hugepages at
>>   * fault, so use asynchronous memory compaction for THP unless it is
>>   * khugepaged trying to collapse.
>>   */
>> if (!(gfp_mask & __GFP_NO_KSWAPD) || (current->flags & PF_KTHREAD))
>>          migration_mode = MIGRATE_SYNC_LIGHT;
>>
>> /*
>>   * If compaction is deferred for high-order allocations, it is because
>>   * sync compaction recently failed. In this is the case and the caller
>>   * requested a movable allocation that does not heavily disrupt the
>>   * system then fail the allocation instead of entering direct reclaim.
>>   */
>> if ((deferred_compaction || contended_compaction) &&
>>                                          (gfp_mask & __GFP_NO_KSWAPD))
>>          goto nopage;
>>
>> Both THP page fault and khugepaged use __GFP_NO_KSWAPD. The first
>> if() decides whether the second attempt will be sync (for
>> khugepaged) or async (page fault). The second if() decides that if
>> compaction was contended, then there won't be any second attempt
>> (and reclaim) at all. Counting need_resched() as contended in this
>> case is bad for khugepaged. Even for page fault it means no direct
>
> I agree khugepaged shouldn't count on need_resched, even lock contention
> because it was a result from admin's decision.
> If it hurts system performance, he should adjust knobs for khugepaged.
>
>> reclaim and a second async compaction. David says need_resched()
>> occurs so often then it is a poor heuristic to decide this.
>
> But page fault is a bit different. Inherently, high-order allocation
> (ie, above PAGE_ALLOC_COSTLY_ORDER) is fragile so all of the caller
> shoud keep in mind that and prepare second plan(ex, 4K allocation)
> so direct reclaim/compaction should take care of latency rather than
> success ratio.

Yes it's a rather delicate balance. But the plan is now to try balance 
this differently than using need_resched.

> If need_resched in second attempt(ie, synchronous compaction) is almost
> true, it means the process consumed his timeslice so it shouldn't be
> greedy and gives a CPU resource to others.

Synchronous compaction uses cond_resched() so that's fine I think?

> I don't mean we should abort but the process could sleep and retry.
> The point is that we should give latency pain to the process request
> high-order alocation, not another random process.

So basically you are saying that there should be cond_resched() also for 
async compaction when need_resched() is true? Now need_resched() is a 
trigger to back off rather quickly all the way back to 
__alloc_pages_direct_compact() which does contain a cond_resched(). So 
there should be a yield before retry. Or are you worried that the back 
off is not quick enough and it shoudl cond_resched() immediately?

> IMHO, if we want to increase high-order alloc ratio in page fault,
> kswapd should be more aggressive than now via feedback loop from
> fail rate from direct compaction.

Recently I think we have been rather decreasing high-order alloc ratio 
in page fault :) But (at least for the THP) page fault allocation 
attempts contain __GFP_NO_KSWAPD, so there's no feedback loop. I guess 
changing that would be rather disruptive.

>>
>>> We have taken care of it in direct reclaim path so why direct compaction is
>>> so special?
>>
>> I admit I'm not that familiar with reclaim but I didn't quickly find
>> any need_resched() there? There's plenty of cond_resched() but that
>> doesn't mean it will abort? Could you explain for me?
>
> I meant cond_resched.
>
>>
>>> Why does khugepaged give up easily if lock contention/need_resched happens?
>>> khugepaged is important for success ratio as I read your description so IMO,
>>> khugepaged should do synchronously without considering early bail out by
>>> lock/rescheduling.
>>
>> Well a stupid answer is that's how __alloc_pages_slowpath() works :)
>> I don't think it's bad to try using first a more lightweight
>> approach before trying the heavyweight one. As long as the
>> heavyweight one is not skipped for khugepaged.
>
> I'm not saying current two-stage trying is bad. My stand is that we should
> take care of need_resched and shouldn't become a greedy but khugepaged would
> be okay.
>
>>
>>> If it causes problems, user should increase scan_sleep_millisecs/alloc_sleep_millisecs,
>>> which is exactly the knob for that cases.
>>>
>>> So, my point is how about making khugepaged doing always dumb synchronous
>>> compaction thorough PG_KHUGEPAGED or GFP_SYNC_TRANSHUGE?
>>>
>>>>
>>>> Reported-by: David Rientjes <rientjes@google.com>
>>>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>>>> Cc: Minchan Kim <minchan@kernel.org>
>>>> Cc: Mel Gorman <mgorman@suse.de>
>>>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>>>> Cc: Michal Nazarewicz <mina86@mina86.com>
>>>> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>>>> Cc: Christoph Lameter <cl@linux.com>
>>>> Cc: Rik van Riel <riel@redhat.com>
>>>> ---
>>>>   mm/compaction.c | 20 ++++++++++++++------
>>>>   mm/internal.h   | 15 +++++++++++----
>>>>   2 files changed, 25 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/mm/compaction.c b/mm/compaction.c
>>>> index b73b182..d37f4a8 100644
>>>> --- a/mm/compaction.c
>>>> +++ b/mm/compaction.c
>>>> @@ -185,9 +185,14 @@ static void update_pageblock_skip(struct compact_control *cc,
>>>>   }
>>>>   #endif /* CONFIG_COMPACTION */
>>>>
>>>> -static inline bool should_release_lock(spinlock_t *lock)
>>>> +enum compact_contended should_release_lock(spinlock_t *lock)
>>>>   {
>>>> -	return need_resched() || spin_is_contended(lock);
>>>> +	if (need_resched())
>>>> +		return COMPACT_CONTENDED_SCHED;
>>>> +	else if (spin_is_contended(lock))
>>>> +		return COMPACT_CONTENDED_LOCK;
>>>> +	else
>>>> +		return COMPACT_CONTENDED_NONE;
>>>>   }
>>>>
>>>>   /*
>>>> @@ -202,7 +207,9 @@ static inline bool should_release_lock(spinlock_t *lock)
>>>>   static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
>>>>   				      bool locked, struct compact_control *cc)
>>>>   {
>>>> -	if (should_release_lock(lock)) {
>>>> +	enum compact_contended contended = should_release_lock(lock);
>>>> +
>>>> +	if (contended) {
>>>>   		if (locked) {
>>>>   			spin_unlock_irqrestore(lock, *flags);
>>>>   			locked = false;
>>>> @@ -210,7 +217,7 @@ static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
>>>>
>>>>   		/* async aborts if taking too long or contended */
>>>>   		if (cc->mode == MIGRATE_ASYNC) {
>>>> -			cc->contended = true;
>>>> +			cc->contended = contended;
>>>>   			return false;
>>>>   		}
>>>>
>>>> @@ -236,7 +243,7 @@ static inline bool compact_should_abort(struct compact_control *cc)
>>>>   	/* async compaction aborts if contended */
>>>>   	if (need_resched()) {
>>>>   		if (cc->mode == MIGRATE_ASYNC) {
>>>> -			cc->contended = true;
>>>> +			cc->contended = COMPACT_CONTENDED_SCHED;
>>>>   			return true;
>>>>   		}
>>>>
>>>> @@ -1095,7 +1102,8 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
>>>>   	VM_BUG_ON(!list_empty(&cc.freepages));
>>>>   	VM_BUG_ON(!list_empty(&cc.migratepages));
>>>>
>>>> -	*contended = cc.contended;
>>>> +	/* We only signal lock contention back to the allocator */
>>>> +	*contended = cc.contended == COMPACT_CONTENDED_LOCK;
>>>>   	return ret;
>>>>   }
>>>>
>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>> index 7f22a11f..4659e8e 100644
>>>> --- a/mm/internal.h
>>>> +++ b/mm/internal.h
>>>> @@ -117,6 +117,13 @@ extern int user_min_free_kbytes;
>>>>
>>>>   #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>>>>
>>>> +/* Used to signal whether compaction detected need_sched() or lock contention */
>>>> +enum compact_contended {
>>>> +	COMPACT_CONTENDED_NONE = 0, /* no contention detected */
>>>> +	COMPACT_CONTENDED_SCHED,    /* need_sched() was true */
>>>> +	COMPACT_CONTENDED_LOCK,     /* zone lock or lru_lock was contended */
>>>> +};
>>>> +
>>>>   /*
>>>>    * in mm/compaction.c
>>>>    */
>>>> @@ -144,10 +151,10 @@ struct compact_control {
>>>>   	int order;			/* order a direct compactor needs */
>>>>   	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
>>>>   	struct zone *zone;
>>>> -	bool contended;			/* True if a lock was contended, or
>>>> -					 * need_resched() true during async
>>>> -					 * compaction
>>>> -					 */
>>>> +	enum compact_contended contended; /* Signal need_sched() or lock
>>>> +					   * contention detected during
>>>> +					   * compaction
>>>> +					   */
>>>>   };
>>>>
>>>>   unsigned long
>>>> --
>>>> 1.8.4.5
>>>>
>>>> --
>>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>>> see: http://www.linux-mm.org/ .
>>>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 02/10] mm, compaction: report compaction as contended only due to lock contention
  2014-06-12 14:02           ` Vlastimil Babka
@ 2014-06-13  2:40             ` Minchan Kim
  -1 siblings, 0 replies; 88+ messages in thread
From: Minchan Kim @ 2014-06-13  2:40 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

On Thu, Jun 12, 2014 at 04:02:04PM +0200, Vlastimil Babka wrote:
> On 06/12/2014 01:49 AM, Minchan Kim wrote:
> >On Wed, Jun 11, 2014 at 02:22:30PM +0200, Vlastimil Babka wrote:
> >>On 06/11/2014 03:10 AM, Minchan Kim wrote:
> >>>On Mon, Jun 09, 2014 at 11:26:14AM +0200, Vlastimil Babka wrote:
> >>>>Async compaction aborts when it detects zone lock contention or need_resched()
> >>>>is true. David Rientjes has reported that in practice, most direct async
> >>>>compactions for THP allocation abort due to need_resched(). This means that a
> >>>>second direct compaction is never attempted, which might be OK for a page
> >>>>fault, but hugepaged is intended to attempt a sync compaction in such case and
> >>>>in these cases it won't.
> >>>>
> >>>>This patch replaces "bool contended" in compact_control with an enum that
> >>>>distinguieshes between aborting due to need_resched() and aborting due to lock
> >>>>contention. This allows propagating the abort through all compaction functions
> >>>>as before, but declaring the direct compaction as contended only when lock
> >>>>contantion has been detected.
> >>>>
> >>>>As a result, hugepaged will proceed with second sync compaction as intended,
> >>>>when the preceding async compaction aborted due to need_resched().
> >>>
> >>>You said "second direct compaction is never attempted, which might be OK
> >>>for a page fault" and said "hugepagd is intented to attempt a sync compaction"
> >>>so I feel you want to handle khugepaged so special unlike other direct compact
> >>>(ex, page fault).
> >>
> >>Well khugepaged is my primary concern, but I imagine there are other
> >>direct compaction users besides THP page fault and khugepaged.
> >>
> >>>By this patch, direct compaction take care only lock contention, not rescheduling
> >>>so that pop questions.
> >>>
> >>>Is it okay not to consider need_resched in direct compaction really?
> >>
> >>It still considers need_resched() to back of from async compaction.
> >>It's only about signaling contended_compaction back to
> >>__alloc_pages_slowpath(). There's this code executed after the
> >>first, async compaction fails:
> >>
> >>/*
> >>  * It can become very expensive to allocate transparent hugepages at
> >>  * fault, so use asynchronous memory compaction for THP unless it is
> >>  * khugepaged trying to collapse.
> >>  */
> >>if (!(gfp_mask & __GFP_NO_KSWAPD) || (current->flags & PF_KTHREAD))
> >>         migration_mode = MIGRATE_SYNC_LIGHT;
> >>
> >>/*
> >>  * If compaction is deferred for high-order allocations, it is because
> >>  * sync compaction recently failed. In this is the case and the caller
> >>  * requested a movable allocation that does not heavily disrupt the
> >>  * system then fail the allocation instead of entering direct reclaim.
> >>  */
> >>if ((deferred_compaction || contended_compaction) &&
> >>                                         (gfp_mask & __GFP_NO_KSWAPD))
> >>         goto nopage;
> >>
> >>Both THP page fault and khugepaged use __GFP_NO_KSWAPD. The first
> >>if() decides whether the second attempt will be sync (for
> >>khugepaged) or async (page fault). The second if() decides that if
> >>compaction was contended, then there won't be any second attempt
> >>(and reclaim) at all. Counting need_resched() as contended in this
> >>case is bad for khugepaged. Even for page fault it means no direct
> >
> >I agree khugepaged shouldn't count on need_resched, even lock contention
> >because it was a result from admin's decision.
> >If it hurts system performance, he should adjust knobs for khugepaged.
> >
> >>reclaim and a second async compaction. David says need_resched()
> >>occurs so often then it is a poor heuristic to decide this.
> >
> >But page fault is a bit different. Inherently, high-order allocation
> >(ie, above PAGE_ALLOC_COSTLY_ORDER) is fragile so all of the caller
> >shoud keep in mind that and prepare second plan(ex, 4K allocation)
> >so direct reclaim/compaction should take care of latency rather than
> >success ratio.
> 
> Yes it's a rather delicate balance. But the plan is now to try
> balance this differently than using need_resched.
> 
> >If need_resched in second attempt(ie, synchronous compaction) is almost
> >true, it means the process consumed his timeslice so it shouldn't be
> >greedy and gives a CPU resource to others.
> 
> Synchronous compaction uses cond_resched() so that's fine I think?

Sorry for being not clear. I post for the clarification before taking
a rest in holiday. :)

When THP page fault occurs and found rescheduling while doing async
direct compaction, it goes "nopage" and fall-backed to 4K page.
It's good to me.

Another topic: I couldn't find any cond_resched. Anyway, it could be
another patch.

>From a4b7c288d8de670adbc45c85991ed3bef31e4e16 Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan@kernel.org>
Date: Fri, 13 Jun 2014 10:59:26 +0900
Subject: [PATCH] mm: call cond_resched right before failing compaction

David reported in many case of direct compaction for THP page fault
is failed since the async compaction was abort by need_resched.
It's okay because THP could be fallback to 4K page but the problem
is if need_resched is true, we should give a chance to next process
to schedul in for the latency so that we are not greedy any more.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/page_alloc.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4f59fa2..1ac5133 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2617,8 +2617,16 @@ rebalance:
 	 * system then fail the allocation instead of entering direct reclaim.
 	 */
 	if ((deferred_compaction || contended_compaction) &&
-						(gfp_mask & __GFP_NO_KSWAPD))
+						(gfp_mask & __GFP_NO_KSWAPD)) {
+		/*
+		 * When THP page fault occurs in large memory system,
+		 * contended_compaction is likely to be true by need_resched
+		 * checking so let's schedule right before returning NULL page.
+		 * That makes I'm not greedy!
+		 */
+		cond_resched();
 		goto nopage;
+	}
 
 	/* Try direct reclaim and then allocating */
 	page = __alloc_pages_direct_reclaim(gfp_mask, order,
-- 
2.0.0

With your change(ie, direct compaction is only aware of lock contetion,
not need_resched), when THP page fault occurs and it found rescheduling
while doing async direct compaction, it goes *direct reclaim path*,
not "nopage" and async direct compaction again and then finally nopage.
I think you are changing the behavior heavily to increase latency,
which is not what direct reclaim path want even though I have no data.

So, what I want is following as.
It is based on previoius inline patch.

---
 mm/page_alloc.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1ac5133..8a4480e5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2624,8 +2624,17 @@ rebalance:
 		 * checking so let's schedule right before returning NULL page.
 		 * That makes I'm not greedy!
 		 */
-		cond_resched();
-		goto nopage;
+		int ret = cond_resched();
+
+		/* When THP page fault, we want to bail out for the latency */
+		if (!(current->flags & PF_KTHREAD) || !ret)
+			goto nopage;
+
+		/*
+		 * I'm khugepaged and took a rest so want to try compaction
+		 * with synchronous rather than giving up easily.
+		 */
+		WARN_ON(migration_mode == MIGRATE_ASYNC);
 	}
 
 	/* Try direct reclaim and then allocating */
-- 
2.0.0

I'm off from now on. :)

> 
> >I don't mean we should abort but the process could sleep and retry.
> >The point is that we should give latency pain to the process request
> >high-order alocation, not another random process.
> 
> So basically you are saying that there should be cond_resched() also
> for async compaction when need_resched() is true? Now need_resched()
> is a trigger to back off rather quickly all the way back to
> __alloc_pages_direct_compact() which does contain a cond_resched().
> So there should be a yield before retry. Or are you worried that the
> back off is not quick enough and it shoudl cond_resched()
> immediately?
> 
> >IMHO, if we want to increase high-order alloc ratio in page fault,
> >kswapd should be more aggressive than now via feedback loop from
> >fail rate from direct compaction.
> 
> Recently I think we have been rather decreasing high-order alloc
> ratio in page fault :) But (at least for the THP) page fault
> allocation attempts contain __GFP_NO_KSWAPD, so there's no feedback
> loop. I guess changing that would be rather disruptive.
> 
> >>
> >>>We have taken care of it in direct reclaim path so why direct compaction is
> >>>so special?
> >>
> >>I admit I'm not that familiar with reclaim but I didn't quickly find
> >>any need_resched() there? There's plenty of cond_resched() but that
> >>doesn't mean it will abort? Could you explain for me?
> >
> >I meant cond_resched.
> >
> >>
> >>>Why does khugepaged give up easily if lock contention/need_resched happens?
> >>>khugepaged is important for success ratio as I read your description so IMO,
> >>>khugepaged should do synchronously without considering early bail out by
> >>>lock/rescheduling.
> >>
> >>Well a stupid answer is that's how __alloc_pages_slowpath() works :)
> >>I don't think it's bad to try using first a more lightweight
> >>approach before trying the heavyweight one. As long as the
> >>heavyweight one is not skipped for khugepaged.
> >
> >I'm not saying current two-stage trying is bad. My stand is that we should
> >take care of need_resched and shouldn't become a greedy but khugepaged would
> >be okay.
> >
> >>
> >>>If it causes problems, user should increase scan_sleep_millisecs/alloc_sleep_millisecs,
> >>>which is exactly the knob for that cases.
> >>>
> >>>So, my point is how about making khugepaged doing always dumb synchronous
> >>>compaction thorough PG_KHUGEPAGED or GFP_SYNC_TRANSHUGE?
> >>>
> >>>>
> >>>>Reported-by: David Rientjes <rientjes@google.com>
> >>>>Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> >>>>Cc: Minchan Kim <minchan@kernel.org>
> >>>>Cc: Mel Gorman <mgorman@suse.de>
> >>>>Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >>>>Cc: Michal Nazarewicz <mina86@mina86.com>
> >>>>Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> >>>>Cc: Christoph Lameter <cl@linux.com>
> >>>>Cc: Rik van Riel <riel@redhat.com>
> >>>>---
> >>>>  mm/compaction.c | 20 ++++++++++++++------
> >>>>  mm/internal.h   | 15 +++++++++++----
> >>>>  2 files changed, 25 insertions(+), 10 deletions(-)
> >>>>
> >>>>diff --git a/mm/compaction.c b/mm/compaction.c
> >>>>index b73b182..d37f4a8 100644
> >>>>--- a/mm/compaction.c
> >>>>+++ b/mm/compaction.c
> >>>>@@ -185,9 +185,14 @@ static void update_pageblock_skip(struct compact_control *cc,
> >>>>  }
> >>>>  #endif /* CONFIG_COMPACTION */
> >>>>
> >>>>-static inline bool should_release_lock(spinlock_t *lock)
> >>>>+enum compact_contended should_release_lock(spinlock_t *lock)
> >>>>  {
> >>>>-	return need_resched() || spin_is_contended(lock);
> >>>>+	if (need_resched())
> >>>>+		return COMPACT_CONTENDED_SCHED;
> >>>>+	else if (spin_is_contended(lock))
> >>>>+		return COMPACT_CONTENDED_LOCK;
> >>>>+	else
> >>>>+		return COMPACT_CONTENDED_NONE;
> >>>>  }
> >>>>
> >>>>  /*
> >>>>@@ -202,7 +207,9 @@ static inline bool should_release_lock(spinlock_t *lock)
> >>>>  static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
> >>>>  				      bool locked, struct compact_control *cc)
> >>>>  {
> >>>>-	if (should_release_lock(lock)) {
> >>>>+	enum compact_contended contended = should_release_lock(lock);
> >>>>+
> >>>>+	if (contended) {
> >>>>  		if (locked) {
> >>>>  			spin_unlock_irqrestore(lock, *flags);
> >>>>  			locked = false;
> >>>>@@ -210,7 +217,7 @@ static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
> >>>>
> >>>>  		/* async aborts if taking too long or contended */
> >>>>  		if (cc->mode == MIGRATE_ASYNC) {
> >>>>-			cc->contended = true;
> >>>>+			cc->contended = contended;
> >>>>  			return false;
> >>>>  		}
> >>>>
> >>>>@@ -236,7 +243,7 @@ static inline bool compact_should_abort(struct compact_control *cc)
> >>>>  	/* async compaction aborts if contended */
> >>>>  	if (need_resched()) {
> >>>>  		if (cc->mode == MIGRATE_ASYNC) {
> >>>>-			cc->contended = true;
> >>>>+			cc->contended = COMPACT_CONTENDED_SCHED;
> >>>>  			return true;
> >>>>  		}
> >>>>
> >>>>@@ -1095,7 +1102,8 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
> >>>>  	VM_BUG_ON(!list_empty(&cc.freepages));
> >>>>  	VM_BUG_ON(!list_empty(&cc.migratepages));
> >>>>
> >>>>-	*contended = cc.contended;
> >>>>+	/* We only signal lock contention back to the allocator */
> >>>>+	*contended = cc.contended == COMPACT_CONTENDED_LOCK;
> >>>>  	return ret;
> >>>>  }
> >>>>
> >>>>diff --git a/mm/internal.h b/mm/internal.h
> >>>>index 7f22a11f..4659e8e 100644
> >>>>--- a/mm/internal.h
> >>>>+++ b/mm/internal.h
> >>>>@@ -117,6 +117,13 @@ extern int user_min_free_kbytes;
> >>>>
> >>>>  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
> >>>>
> >>>>+/* Used to signal whether compaction detected need_sched() or lock contention */
> >>>>+enum compact_contended {
> >>>>+	COMPACT_CONTENDED_NONE = 0, /* no contention detected */
> >>>>+	COMPACT_CONTENDED_SCHED,    /* need_sched() was true */
> >>>>+	COMPACT_CONTENDED_LOCK,     /* zone lock or lru_lock was contended */
> >>>>+};
> >>>>+
> >>>>  /*
> >>>>   * in mm/compaction.c
> >>>>   */
> >>>>@@ -144,10 +151,10 @@ struct compact_control {
> >>>>  	int order;			/* order a direct compactor needs */
> >>>>  	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
> >>>>  	struct zone *zone;
> >>>>-	bool contended;			/* True if a lock was contended, or
> >>>>-					 * need_resched() true during async
> >>>>-					 * compaction
> >>>>-					 */
> >>>>+	enum compact_contended contended; /* Signal need_sched() or lock
> >>>>+					   * contention detected during
> >>>>+					   * compaction
> >>>>+					   */
> >>>>  };
> >>>>
> >>>>  unsigned long
> >>>>--
> >>>>1.8.4.5
> >>>>
> >>>>--
> >>>>To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >>>>the body to majordomo@kvack.org.  For more info on Linux MM,
> >>>>see: http://www.linux-mm.org/ .
> >>>>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >>>
> >>
> >>--
> >>To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >>the body to majordomo@kvack.org.  For more info on Linux MM,
> >>see: http://www.linux-mm.org/ .
> >>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [PATCH 02/10] mm, compaction: report compaction as contended only due to lock contention
@ 2014-06-13  2:40             ` Minchan Kim
  0 siblings, 0 replies; 88+ messages in thread
From: Minchan Kim @ 2014-06-13  2:40 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

On Thu, Jun 12, 2014 at 04:02:04PM +0200, Vlastimil Babka wrote:
> On 06/12/2014 01:49 AM, Minchan Kim wrote:
> >On Wed, Jun 11, 2014 at 02:22:30PM +0200, Vlastimil Babka wrote:
> >>On 06/11/2014 03:10 AM, Minchan Kim wrote:
> >>>On Mon, Jun 09, 2014 at 11:26:14AM +0200, Vlastimil Babka wrote:
> >>>>Async compaction aborts when it detects zone lock contention or need_resched()
> >>>>is true. David Rientjes has reported that in practice, most direct async
> >>>>compactions for THP allocation abort due to need_resched(). This means that a
> >>>>second direct compaction is never attempted, which might be OK for a page
> >>>>fault, but hugepaged is intended to attempt a sync compaction in such case and
> >>>>in these cases it won't.
> >>>>
> >>>>This patch replaces "bool contended" in compact_control with an enum that
> >>>>distinguieshes between aborting due to need_resched() and aborting due to lock
> >>>>contention. This allows propagating the abort through all compaction functions
> >>>>as before, but declaring the direct compaction as contended only when lock
> >>>>contantion has been detected.
> >>>>
> >>>>As a result, hugepaged will proceed with second sync compaction as intended,
> >>>>when the preceding async compaction aborted due to need_resched().
> >>>
> >>>You said "second direct compaction is never attempted, which might be OK
> >>>for a page fault" and said "hugepagd is intented to attempt a sync compaction"
> >>>so I feel you want to handle khugepaged so special unlike other direct compact
> >>>(ex, page fault).
> >>
> >>Well khugepaged is my primary concern, but I imagine there are other
> >>direct compaction users besides THP page fault and khugepaged.
> >>
> >>>By this patch, direct compaction take care only lock contention, not rescheduling
> >>>so that pop questions.
> >>>
> >>>Is it okay not to consider need_resched in direct compaction really?
> >>
> >>It still considers need_resched() to back of from async compaction.
> >>It's only about signaling contended_compaction back to
> >>__alloc_pages_slowpath(). There's this code executed after the
> >>first, async compaction fails:
> >>
> >>/*
> >>  * It can become very expensive to allocate transparent hugepages at
> >>  * fault, so use asynchronous memory compaction for THP unless it is
> >>  * khugepaged trying to collapse.
> >>  */
> >>if (!(gfp_mask & __GFP_NO_KSWAPD) || (current->flags & PF_KTHREAD))
> >>         migration_mode = MIGRATE_SYNC_LIGHT;
> >>
> >>/*
> >>  * If compaction is deferred for high-order allocations, it is because
> >>  * sync compaction recently failed. In this is the case and the caller
> >>  * requested a movable allocation that does not heavily disrupt the
> >>  * system then fail the allocation instead of entering direct reclaim.
> >>  */
> >>if ((deferred_compaction || contended_compaction) &&
> >>                                         (gfp_mask & __GFP_NO_KSWAPD))
> >>         goto nopage;
> >>
> >>Both THP page fault and khugepaged use __GFP_NO_KSWAPD. The first
> >>if() decides whether the second attempt will be sync (for
> >>khugepaged) or async (page fault). The second if() decides that if
> >>compaction was contended, then there won't be any second attempt
> >>(and reclaim) at all. Counting need_resched() as contended in this
> >>case is bad for khugepaged. Even for page fault it means no direct
> >
> >I agree khugepaged shouldn't count on need_resched, even lock contention
> >because it was a result from admin's decision.
> >If it hurts system performance, he should adjust knobs for khugepaged.
> >
> >>reclaim and a second async compaction. David says need_resched()
> >>occurs so often then it is a poor heuristic to decide this.
> >
> >But page fault is a bit different. Inherently, high-order allocation
> >(ie, above PAGE_ALLOC_COSTLY_ORDER) is fragile so all of the caller
> >shoud keep in mind that and prepare second plan(ex, 4K allocation)
> >so direct reclaim/compaction should take care of latency rather than
> >success ratio.
> 
> Yes it's a rather delicate balance. But the plan is now to try
> balance this differently than using need_resched.
> 
> >If need_resched in second attempt(ie, synchronous compaction) is almost
> >true, it means the process consumed his timeslice so it shouldn't be
> >greedy and gives a CPU resource to others.
> 
> Synchronous compaction uses cond_resched() so that's fine I think?

Sorry for being not clear. I post for the clarification before taking
a rest in holiday. :)

When THP page fault occurs and found rescheduling while doing async
direct compaction, it goes "nopage" and fall-backed to 4K page.
It's good to me.

Another topic: I couldn't find any cond_resched. Anyway, it could be
another patch.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 02/10] mm, compaction: report compaction as contended only due to lock contention
  2014-06-13  2:40             ` Minchan Kim
@ 2014-06-20 11:47               ` Vlastimil Babka
  -1 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-20 11:47 UTC (permalink / raw)
  To: Minchan Kim
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

On 06/13/2014 04:40 AM, Minchan Kim wrote:
> On Thu, Jun 12, 2014 at 04:02:04PM +0200, Vlastimil Babka wrote:
>> On 06/12/2014 01:49 AM, Minchan Kim wrote:
>> >On Wed, Jun 11, 2014 at 02:22:30PM +0200, Vlastimil Babka wrote:
>> >>On 06/11/2014 03:10 AM, Minchan Kim wrote:
>> >>>On Mon, Jun 09, 2014 at 11:26:14AM +0200, Vlastimil Babka wrote:
>> >>>>Async compaction aborts when it detects zone lock contention or need_resched()
>> >>>>is true. David Rientjes has reported that in practice, most direct async
>> >>>>compactions for THP allocation abort due to need_resched(). This means that a
>> >>>>second direct compaction is never attempted, which might be OK for a page
>> >>>>fault, but hugepaged is intended to attempt a sync compaction in such case and
>> >>>>in these cases it won't.
>> >>>>
>> >>>>This patch replaces "bool contended" in compact_control with an enum that
>> >>>>distinguieshes between aborting due to need_resched() and aborting due to lock
>> >>>>contention. This allows propagating the abort through all compaction functions
>> >>>>as before, but declaring the direct compaction as contended only when lock
>> >>>>contantion has been detected.
>> >>>>
>> >>>>As a result, hugepaged will proceed with second sync compaction as intended,
>> >>>>when the preceding async compaction aborted due to need_resched().
>> >>>
>> >>>You said "second direct compaction is never attempted, which might be OK
>> >>>for a page fault" and said "hugepagd is intented to attempt a sync compaction"
>> >>>so I feel you want to handle khugepaged so special unlike other direct compact
>> >>>(ex, page fault).
>> >>
>> >>Well khugepaged is my primary concern, but I imagine there are other
>> >>direct compaction users besides THP page fault and khugepaged.
>> >>
>> >>>By this patch, direct compaction take care only lock contention, not rescheduling
>> >>>so that pop questions.
>> >>>
>> >>>Is it okay not to consider need_resched in direct compaction really?
>> >>
>> >>It still considers need_resched() to back of from async compaction.
>> >>It's only about signaling contended_compaction back to
>> >>__alloc_pages_slowpath(). There's this code executed after the
>> >>first, async compaction fails:
>> >>
>> >>/*
>> >>  * It can become very expensive to allocate transparent hugepages at
>> >>  * fault, so use asynchronous memory compaction for THP unless it is
>> >>  * khugepaged trying to collapse.
>> >>  */
>> >>if (!(gfp_mask & __GFP_NO_KSWAPD) || (current->flags & PF_KTHREAD))
>> >>         migration_mode = MIGRATE_SYNC_LIGHT;
>> >>
>> >>/*
>> >>  * If compaction is deferred for high-order allocations, it is because
>> >>  * sync compaction recently failed. In this is the case and the caller
>> >>  * requested a movable allocation that does not heavily disrupt the
>> >>  * system then fail the allocation instead of entering direct reclaim.
>> >>  */
>> >>if ((deferred_compaction || contended_compaction) &&
>> >>                                         (gfp_mask & __GFP_NO_KSWAPD))
>> >>         goto nopage;
>> >>
>> >>Both THP page fault and khugepaged use __GFP_NO_KSWAPD. The first
>> >>if() decides whether the second attempt will be sync (for
>> >>khugepaged) or async (page fault). The second if() decides that if
>> >>compaction was contended, then there won't be any second attempt
>> >>(and reclaim) at all. Counting need_resched() as contended in this
>> >>case is bad for khugepaged. Even for page fault it means no direct
>> >
>> >I agree khugepaged shouldn't count on need_resched, even lock contention
>> >because it was a result from admin's decision.
>> >If it hurts system performance, he should adjust knobs for khugepaged.
>> >
>> >>reclaim and a second async compaction. David says need_resched()
>> >>occurs so often then it is a poor heuristic to decide this.
>> >
>> >But page fault is a bit different. Inherently, high-order allocation
>> >(ie, above PAGE_ALLOC_COSTLY_ORDER) is fragile so all of the caller
>> >shoud keep in mind that and prepare second plan(ex, 4K allocation)
>> >so direct reclaim/compaction should take care of latency rather than
>> >success ratio.
>> 
>> Yes it's a rather delicate balance. But the plan is now to try
>> balance this differently than using need_resched.
>> 
>> >If need_resched in second attempt(ie, synchronous compaction) is almost
>> >true, it means the process consumed his timeslice so it shouldn't be
>> >greedy and gives a CPU resource to others.
>> 
>> Synchronous compaction uses cond_resched() so that's fine I think?
> 
> Sorry for being not clear. I post for the clarification before taking
> a rest in holiday. :)
> 
> When THP page fault occurs and found rescheduling while doing async
> direct compaction, it goes "nopage" and fall-backed to 4K page.
> It's good to me.
> 
> Another topic: I couldn't find any cond_resched. Anyway, it could be
> another patch.
> 

Thanks for the explanation. I'll include a cond_resched() at the level of
try_to_compact_pages() where it fits better, so it's not necessary in the place you
suggested. This should solve the "don't be greedy" problem. I will not yet include
the "bail out for latency" part because we are now slowly moving towards removing
need_resched() as a condition for stopping compaction, and this would on the contrary
extend it to prevent direct reclaim as well. David's data suggests that compaction often
bails out due to need_resched(), so this would reduce the amount of direct reclaim and I
don't want to touch that area in this series :)

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 02/10] mm, compaction: report compaction as contended only due to lock contention
@ 2014-06-20 11:47               ` Vlastimil Babka
  0 siblings, 0 replies; 88+ messages in thread
From: Vlastimil Babka @ 2014-06-20 11:47 UTC (permalink / raw)
  To: Minchan Kim
  Cc: David Rientjes, linux-mm, linux-kernel, Andrew Morton,
	Greg Thelen, Mel Gorman, Joonsoo Kim, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel

On 06/13/2014 04:40 AM, Minchan Kim wrote:
> On Thu, Jun 12, 2014 at 04:02:04PM +0200, Vlastimil Babka wrote:
>> On 06/12/2014 01:49 AM, Minchan Kim wrote:
>> >On Wed, Jun 11, 2014 at 02:22:30PM +0200, Vlastimil Babka wrote:
>> >>On 06/11/2014 03:10 AM, Minchan Kim wrote:
>> >>>On Mon, Jun 09, 2014 at 11:26:14AM +0200, Vlastimil Babka wrote:
>> >>>>Async compaction aborts when it detects zone lock contention or need_resched()
>> >>>>is true. David Rientjes has reported that in practice, most direct async
>> >>>>compactions for THP allocation abort due to need_resched(). This means that a
>> >>>>second direct compaction is never attempted, which might be OK for a page
>> >>>>fault, but hugepaged is intended to attempt a sync compaction in such case and
>> >>>>in these cases it won't.
>> >>>>
>> >>>>This patch replaces "bool contended" in compact_control with an enum that
>> >>>>distinguieshes between aborting due to need_resched() and aborting due to lock
>> >>>>contention. This allows propagating the abort through all compaction functions
>> >>>>as before, but declaring the direct compaction as contended only when lock
>> >>>>contantion has been detected.
>> >>>>
>> >>>>As a result, hugepaged will proceed with second sync compaction as intended,
>> >>>>when the preceding async compaction aborted due to need_resched().
>> >>>
>> >>>You said "second direct compaction is never attempted, which might be OK
>> >>>for a page fault" and said "hugepagd is intented to attempt a sync compaction"
>> >>>so I feel you want to handle khugepaged so special unlike other direct compact
>> >>>(ex, page fault).
>> >>
>> >>Well khugepaged is my primary concern, but I imagine there are other
>> >>direct compaction users besides THP page fault and khugepaged.
>> >>
>> >>>By this patch, direct compaction take care only lock contention, not rescheduling
>> >>>so that pop questions.
>> >>>
>> >>>Is it okay not to consider need_resched in direct compaction really?
>> >>
>> >>It still considers need_resched() to back of from async compaction.
>> >>It's only about signaling contended_compaction back to
>> >>__alloc_pages_slowpath(). There's this code executed after the
>> >>first, async compaction fails:
>> >>
>> >>/*
>> >>  * It can become very expensive to allocate transparent hugepages at
>> >>  * fault, so use asynchronous memory compaction for THP unless it is
>> >>  * khugepaged trying to collapse.
>> >>  */
>> >>if (!(gfp_mask & __GFP_NO_KSWAPD) || (current->flags & PF_KTHREAD))
>> >>         migration_mode = MIGRATE_SYNC_LIGHT;
>> >>
>> >>/*
>> >>  * If compaction is deferred for high-order allocations, it is because
>> >>  * sync compaction recently failed. In this is the case and the caller
>> >>  * requested a movable allocation that does not heavily disrupt the
>> >>  * system then fail the allocation instead of entering direct reclaim.
>> >>  */
>> >>if ((deferred_compaction || contended_compaction) &&
>> >>                                         (gfp_mask & __GFP_NO_KSWAPD))
>> >>         goto nopage;
>> >>
>> >>Both THP page fault and khugepaged use __GFP_NO_KSWAPD. The first
>> >>if() decides whether the second attempt will be sync (for
>> >>khugepaged) or async (page fault). The second if() decides that if
>> >>compaction was contended, then there won't be any second attempt
>> >>(and reclaim) at all. Counting need_resched() as contended in this
>> >>case is bad for khugepaged. Even for page fault it means no direct
>> >
>> >I agree khugepaged shouldn't count on need_resched, even lock contention
>> >because it was a result from admin's decision.
>> >If it hurts system performance, he should adjust knobs for khugepaged.
>> >
>> >>reclaim and a second async compaction. David says need_resched()
>> >>occurs so often then it is a poor heuristic to decide this.
>> >
>> >But page fault is a bit different. Inherently, high-order allocation
>> >(ie, above PAGE_ALLOC_COSTLY_ORDER) is fragile so all of the caller
>> >shoud keep in mind that and prepare second plan(ex, 4K allocation)
>> >so direct reclaim/compaction should take care of latency rather than
>> >success ratio.
>> 
>> Yes it's a rather delicate balance. But the plan is now to try
>> balance this differently than using need_resched.
>> 
>> >If need_resched in second attempt(ie, synchronous compaction) is almost
>> >true, it means the process consumed his timeslice so it shouldn't be
>> >greedy and gives a CPU resource to others.
>> 
>> Synchronous compaction uses cond_resched() so that's fine I think?
> 
> Sorry for being not clear. I post for the clarification before taking
> a rest in holiday. :)
> 
> When THP page fault occurs and found rescheduling while doing async
> direct compaction, it goes "nopage" and fall-backed to 4K page.
> It's good to me.
> 
> Another topic: I couldn't find any cond_resched. Anyway, it could be
> another patch.
> 

Thanks for the explanation. I'll include a cond_resched() at the level of
try_to_compact_pages() where it fits better, so it's not necessary in the place you
suggested. This should solve the "don't be greedy" problem. I will not yet include
the "bail out for latency" part because we are now slowly moving towards removing
need_resched() as a condition for stopping compaction, and this would on the contrary
extend it to prevent direct reclaim as well. David's data suggests that compaction often
bails out due to need_resched(), so this would reduce the amount of direct reclaim and I
don't want to touch that area in this series :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 88+ messages in thread

end of thread, other threads:[~2014-06-20 11:47 UTC | newest]

Thread overview: 88+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-09  9:26 [PATCH 01/10] mm, compaction: do not recheck suitable_migration_target under lock Vlastimil Babka
2014-06-09  9:26 ` Vlastimil Babka
2014-06-09  9:26 ` [PATCH 02/10] mm, compaction: report compaction as contended only due to lock contention Vlastimil Babka
2014-06-09  9:26   ` Vlastimil Babka
2014-06-09 23:50   ` David Rientjes
2014-06-09 23:50     ` David Rientjes
2014-06-10  7:11     ` Vlastimil Babka
2014-06-10  7:11       ` Vlastimil Babka
2014-06-10 23:40       ` David Rientjes
2014-06-10 23:40         ` David Rientjes
2014-06-11  1:10   ` Minchan Kim
2014-06-11  1:10     ` Minchan Kim
2014-06-11 12:22     ` Vlastimil Babka
2014-06-11 12:22       ` Vlastimil Babka
2014-06-11 23:49       ` Minchan Kim
2014-06-11 23:49         ` Minchan Kim
2014-06-12 14:02         ` Vlastimil Babka
2014-06-12 14:02           ` Vlastimil Babka
2014-06-13  2:40           ` Minchan Kim
2014-06-13  2:40             ` Minchan Kim
2014-06-20 11:47             ` Vlastimil Babka
2014-06-20 11:47               ` Vlastimil Babka
2014-06-09  9:26 ` [PATCH 03/10] mm, compaction: periodically drop lock and restore IRQs in scanners Vlastimil Babka
2014-06-09  9:26   ` Vlastimil Babka
2014-06-09 23:58   ` David Rientjes
2014-06-09 23:58     ` David Rientjes
2014-06-10  7:15     ` Vlastimil Babka
2014-06-10  7:15       ` Vlastimil Babka
2014-06-10 23:41       ` David Rientjes
2014-06-10 23:41         ` David Rientjes
2014-06-11  1:32   ` Minchan Kim
2014-06-11  1:32     ` Minchan Kim
2014-06-11 11:24     ` Vlastimil Babka
2014-06-11 11:24       ` Vlastimil Babka
2014-06-09  9:26 ` [PATCH 04/10] mm, compaction: skip rechecks when lock was already held Vlastimil Babka
2014-06-09  9:26   ` Vlastimil Babka
2014-06-10  0:00   ` David Rientjes
2014-06-10  0:00     ` David Rientjes
2014-06-11  1:50   ` Minchan Kim
2014-06-11  1:50     ` Minchan Kim
2014-06-09  9:26 ` [PATCH 05/10] mm, compaction: remember position within pageblock in free pages scanner Vlastimil Babka
2014-06-09  9:26   ` Vlastimil Babka
2014-06-10  0:07   ` David Rientjes
2014-06-10  0:07     ` David Rientjes
2014-06-11  2:12   ` Minchan Kim
2014-06-11  2:12     ` Minchan Kim
2014-06-11  8:16     ` Joonsoo Kim
2014-06-11  8:16       ` Joonsoo Kim
2014-06-11 11:41       ` Vlastimil Babka
2014-06-11 11:41         ` Vlastimil Babka
2014-06-11 11:33     ` Vlastimil Babka
2014-06-11 11:33       ` Vlastimil Babka
2014-06-11  3:29   ` Zhang Yanfei
2014-06-11  3:29     ` Zhang Yanfei
2014-06-09  9:26 ` [PATCH 06/10] mm, compaction: skip buddy pages by their order in the migrate scanner Vlastimil Babka
2014-06-09  9:26   ` Vlastimil Babka
2014-06-10  0:08   ` David Rientjes
2014-06-10  0:08     ` David Rientjes
2014-06-09  9:26 ` [PATCH 07/10] mm: rename allocflags_to_migratetype for clarity Vlastimil Babka
2014-06-09  9:26   ` Vlastimil Babka
2014-06-11  2:41   ` Minchan Kim
2014-06-11  2:41     ` Minchan Kim
2014-06-11  3:38     ` Zhang Yanfei
2014-06-11  3:38       ` Zhang Yanfei
2014-06-09  9:26 ` [PATCH 08/10] mm, compaction: pass gfp mask to compact_control Vlastimil Babka
2014-06-09  9:26   ` Vlastimil Babka
2014-06-11  2:48   ` Minchan Kim
2014-06-11  2:48     ` Minchan Kim
2014-06-11 11:46     ` Vlastimil Babka
2014-06-11 11:46       ` Vlastimil Babka
2014-06-12  0:24       ` David Rientjes
2014-06-12  0:24         ` David Rientjes
2014-06-09  9:26 ` [RFC PATCH 09/10] mm, compaction: try to capture the just-created high-order freepage Vlastimil Babka
2014-06-09  9:26   ` Vlastimil Babka
2014-06-11 14:56   ` Vlastimil Babka
2014-06-11 14:56     ` Vlastimil Babka
2014-06-12  2:20     ` Minchan Kim
2014-06-12  2:20       ` Minchan Kim
2014-06-12  8:21       ` Vlastimil Babka
2014-06-12  8:21         ` Vlastimil Babka
2014-06-09  9:26 ` [RFC PATCH 10/10] mm, compaction: do not migrate pages when that cannot satisfy page fault allocation Vlastimil Babka
2014-06-09  9:26   ` Vlastimil Babka
2014-06-09 23:41 ` [PATCH 01/10] mm, compaction: do not recheck suitable_migration_target under lock David Rientjes
2014-06-09 23:41   ` David Rientjes
2014-06-11  0:33 ` Minchan Kim
2014-06-11  0:33   ` Minchan Kim
2014-06-11  2:45 ` Zhang Yanfei
2014-06-11  2:45   ` Zhang Yanfei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.