linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/3] reduce THP fault thrashing
@ 2018-12-11 14:29 Vlastimil Babka
  2018-12-11 14:29 ` [RFC 1/3] mm, thp: restore __GFP_NORETRY for madvised thp fault allocations Vlastimil Babka
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Vlastimil Babka @ 2018-12-11 14:29 UTC (permalink / raw)
  To: David Rientjes, Andrea Arcangeli, Mel Gorman
  Cc: Michal Hocko, Linus Torvalds, linux-mm, Andrew Morton, Vlastimil Babka

Hi,

this is my attempt at reducing the madvised THP fault local node thrashing by
reclaim+compaction attempts which Andrea reported, by trying to better utilize
recent compaction results. It doesn't introduce any new __GFP_ONLY_COMPACT flag
or add order-specific decisions like Andrea's and David's previous patches, but
it does add __GFP_NORETRY back to madvised THP faults, like they both did
(Patch 1). Patch 2 is based on another Andrea's suggestion, where any
compaction failure is a reason to not try further (not just defered
compaction). Finally, patch 3 introduces defered compaction tracking for async
mode which is what's used for THP faults. Details in respective patch
changelogs.

I haven't tested it yet besides running transhuge-stress and verifying via
tracepoints that defered async compaction does happen. I hope all interested
parties can test the series on their workloads, thanks in advance. I expect
that THP fault success rates will be worse, but hopefully it will also fix
the local node thrashing issue. The success rates can then likely be improved
by making compaction core smarter, but that's a separate topic.

The series is based on v4.20-rc6.

Vlastimil

Vlastimil Babka (3):
  mm, thp: restore __GFP_NORETRY for madvised thp fault allocations
  mm, page_alloc: reclaim for __GFP_NORETRY costly requests only when
    compaction was skipped
  mm, compaction: introduce deferred async compaction

 include/linux/compaction.h        | 10 ++--
 include/linux/mmzone.h            |  6 +--
 include/trace/events/compaction.h | 29 ++++++-----
 mm/compaction.c                   | 80 ++++++++++++++++++-------------
 mm/huge_memory.c                  | 13 +++--
 mm/page_alloc.c                   | 14 +++---
 6 files changed, 84 insertions(+), 68 deletions(-)

-- 
2.19.2

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC 1/3] mm, thp: restore __GFP_NORETRY for madvised thp fault allocations
  2018-12-11 14:29 [RFC 0/3] reduce THP fault thrashing Vlastimil Babka
@ 2018-12-11 14:29 ` Vlastimil Babka
  2019-01-08 11:16   ` Mel Gorman
  2018-12-11 14:29 ` [RFC 2/3] mm, page_alloc: reclaim for __GFP_NORETRY costly requests only when compaction was skipped Vlastimil Babka
  2018-12-11 14:29 ` [RFC 3/3] mm, compaction: introduce deferred async compaction Vlastimil Babka
  2 siblings, 1 reply; 9+ messages in thread
From: Vlastimil Babka @ 2018-12-11 14:29 UTC (permalink / raw)
  To: David Rientjes, Andrea Arcangeli, Mel Gorman
  Cc: Michal Hocko, Linus Torvalds, linux-mm, Andrew Morton, Vlastimil Babka

Commit 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and
madvised allocations") intended to make THP faults in MADV_HUGEPAGE areas more
successful for processes that indicate that they are willing to pay a higher
initial setup cost for long-term THP benefits. In the current page allocator
implementation this means that the allocations will try to use reclaim and the
more costly sync compaction mode, in case the initial direct async compaction
fails.

However, THP faults also include __GFP_THISNODE, which, combined with direct
reclaim, can result in a node-reclaim-like local node thrashing behavior, as
reported by Andrea [1].

While this patch is not a full fix, the first step is to restore __GFP_NORETRY
for madvised THP faults. The expected downside are potentially worse THP
fault success rates for the madvised areas, which will have to then rely more
on khugepaged. For khugepaged, __GFP_NORETRY is not restored, as its activity
should be limited enough by sleeping to cause noticeable thrashing.

Note that alloc_new_node_page() and new_page() is probably another candidate as
they handle the migrate_pages(2), resp. mbind(2) syscall, which can thus allow
unprivileged node-reclaim-like behavior.

The patch also updates the comments in alloc_hugepage_direct_gfpmask() because
elsewhere compaction during page fault is called direct compaction, and
'synchronous' refers to the migration mode, which is not used for THP faults.

[1] https://lkml.kernel.org/m/20180820032204.9591-1-aarcange@redhat.com

Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
---
 mm/huge_memory.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5da55b38b1b7..c442b12b060c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -633,24 +633,23 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
 {
 	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
 
-	/* Always do synchronous compaction */
+	/* Always try direct compaction */
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
+		return GFP_TRANSHUGE | __GFP_NORETRY;
 
 	/* Kick kcompactd and fail quickly */
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
 
-	/* Synchronous compaction if madvised, otherwise kick kcompactd */
+	/* Direct compaction if madvised, otherwise kick kcompactd */
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE_LIGHT |
-			(vma_madvised ? __GFP_DIRECT_RECLAIM :
+			(vma_madvised ? (__GFP_DIRECT_RECLAIM | __GFP_NORETRY):
 					__GFP_KSWAPD_RECLAIM);
 
-	/* Only do synchronous compaction if madvised */
+	/* Only do direct compaction if madvised */
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE_LIGHT |
-		       (vma_madvised ? __GFP_DIRECT_RECLAIM : 0);
+		return vma_madvised ? (GFP_TRANSHUGE | __GFP_NORETRY) : GFP_TRANSHUGE_LIGHT;
 
 	return GFP_TRANSHUGE_LIGHT;
 }
-- 
2.19.2

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC 2/3] mm, page_alloc: reclaim for __GFP_NORETRY costly requests only when compaction was skipped
  2018-12-11 14:29 [RFC 0/3] reduce THP fault thrashing Vlastimil Babka
  2018-12-11 14:29 ` [RFC 1/3] mm, thp: restore __GFP_NORETRY for madvised thp fault allocations Vlastimil Babka
@ 2018-12-11 14:29 ` Vlastimil Babka
  2019-01-08 11:25   ` Mel Gorman
  2018-12-11 14:29 ` [RFC 3/3] mm, compaction: introduce deferred async compaction Vlastimil Babka
  2 siblings, 1 reply; 9+ messages in thread
From: Vlastimil Babka @ 2018-12-11 14:29 UTC (permalink / raw)
  To: David Rientjes, Andrea Arcangeli, Mel Gorman
  Cc: Michal Hocko, Linus Torvalds, linux-mm, Andrew Morton, Vlastimil Babka

For costly __GFP_NORETRY allocations (including THP's) we first do an initial
compaction attempt and if that fails, we proceed with reclaim and another
round of compaction, unless compaction was deferred due to earlier multiple
failures. Andrea proposed [1] that we count all compaction failures as the
defered case in try_to_compact_pages(), but I don't think that's a good idea
in general. Instead, change the __GFP_NORETRY specific condition so that it
only proceeds with further reclaim/compaction when the initial compaction
attempt was skipped due to lack of free base pages.

Note that the original condition probably never worked properly for THP's,
because compaction can only become deferred after a sync compaction failure,
and THP's only perform async compaction, except khugepaged, which is
infrequent, or madvised faults (until the previous patch restored __GFP_NORETRY
for those) which are not the default case. Deferring due to async compaction
failures should be however also beneficial and thus introduced in the next
patch.

Also note that due to how try_to_compact_pages() constructs its return value
from compaction attempts across the whole zonelist, returning COMPACT_SKIPPED
means that compaction was skipped for *all* attempted zones/nodes, which means
all zones/nodes are low on memory at the same moment. This is probably rare,
which would mean that the resulting 'goto nopage' would be very common, just
because e.g. a single zone had enough memory and compaction failed there, while
the rest of nodes could succeed after reclaim.  However, since THP faults use
__GFP_THISNODE, compaction is also attempted only for a single node, so in
practice there should be no significant loss of information when constructing
the return value, nor bias towards 'goto nopage' for THP faults.

[1] https://lkml.kernel.org/r/20181206005425.GB21159@redhat.com

Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
---
 mm/page_alloc.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2ec9cc407216..3d83a6093ada 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4129,14 +4129,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		 */
 		if (costly_order && (gfp_mask & __GFP_NORETRY)) {
 			/*
-			 * If compaction is deferred for high-order allocations,
-			 * it is because sync compaction recently failed. If
-			 * this is the case and the caller requested a THP
-			 * allocation, we do not want to heavily disrupt the
-			 * system, so we fail the allocation instead of entering
-			 * direct reclaim.
+			 * If compaction was skipped because of insufficient
+			 * free pages, proceed with reclaim and another
+			 * compaction attempt. If it failed for other reasons or
+			 * was deferred, do not reclaim and retry, as we do not
+			 * want to heavily disrupt the system for a costly
+			 * __GFP_NORETRY allocation such as THP.
 			 */
-			if (compact_result == COMPACT_DEFERRED)
+			if (compact_result != COMPACT_SKIPPED)
 				goto nopage;
 
 			/*
-- 
2.19.2

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC 3/3] mm, compaction: introduce deferred async compaction
  2018-12-11 14:29 [RFC 0/3] reduce THP fault thrashing Vlastimil Babka
  2018-12-11 14:29 ` [RFC 1/3] mm, thp: restore __GFP_NORETRY for madvised thp fault allocations Vlastimil Babka
  2018-12-11 14:29 ` [RFC 2/3] mm, page_alloc: reclaim for __GFP_NORETRY costly requests only when compaction was skipped Vlastimil Babka
@ 2018-12-11 14:29 ` Vlastimil Babka
  2019-01-08 11:28   ` Mel Gorman
  2 siblings, 1 reply; 9+ messages in thread
From: Vlastimil Babka @ 2018-12-11 14:29 UTC (permalink / raw)
  To: David Rientjes, Andrea Arcangeli, Mel Gorman
  Cc: Michal Hocko, Linus Torvalds, linux-mm, Andrew Morton, Vlastimil Babka

Deferring compaction happens when it fails to fulfill the allocation request at
given order, and then a number of the following direct compaction attempts for
same or higher orders is skipped; with further failures, the number grows
exponentially up to 64. This is reset e.g. when compaction succeeds.

Until now, defering compaction is only performed after a sync compaction fails,
and then it also blocks async compaction attempts. The rationale is that only a
failed sync compaction is expected to fully exhaust all compaction potential of
a zone. However, for THP page faults that use __GFP_NORETRY, this means only
async compaction is attempted and thus it is never deferred, potentially
resulting in pointless reclaim/compaction attempts in a badly fragmented node.

This patch therefore tracks and checks async compaction deferred status in
addition, and mostly separately from sync compaction. This allows deferring THP
fault compaction without affecting any sync pageblock-order compaction.
Deferring for sync compaction however implies deferring for async compaction as
well. When deferred status is reset, it is reset for both modes.

The expected outcome is less compaction/reclaim activity for failing THP faults
likely with some expense on THP fault success rate.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
---
 include/linux/compaction.h        | 10 ++--
 include/linux/mmzone.h            |  6 +--
 include/trace/events/compaction.h | 29 ++++++-----
 mm/compaction.c                   | 80 ++++++++++++++++++-------------
 4 files changed, 71 insertions(+), 54 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 68250a57aace..f1d4dc1deec9 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -100,11 +100,11 @@ extern void reset_isolation_suitable(pg_data_t *pgdat);
 extern enum compact_result compaction_suitable(struct zone *zone, int order,
 		unsigned int alloc_flags, int classzone_idx);
 
-extern void defer_compaction(struct zone *zone, int order);
-extern bool compaction_deferred(struct zone *zone, int order);
+extern void defer_compaction(struct zone *zone, int order, bool sync);
+extern bool compaction_deferred(struct zone *zone, int order, bool sync);
 extern void compaction_defer_reset(struct zone *zone, int order,
 				bool alloc_success);
-extern bool compaction_restarting(struct zone *zone, int order);
+extern bool compaction_restarting(struct zone *zone, int order, bool sync);
 
 /* Compaction has made some progress and retrying makes sense */
 static inline bool compaction_made_progress(enum compact_result result)
@@ -189,11 +189,11 @@ static inline enum compact_result compaction_suitable(struct zone *zone, int ord
 	return COMPACT_SKIPPED;
 }
 
-static inline void defer_compaction(struct zone *zone, int order)
+static inline void defer_compaction(struct zone *zone, int order, bool sync)
 {
 }
 
-static inline bool compaction_deferred(struct zone *zone, int order)
+static inline bool compaction_deferred(struct zone *zone, int order, bool sync)
 {
 	return true;
 }
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 847705a6d0ec..4c59996dd4f9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -492,9 +492,9 @@ struct zone {
 	 * are skipped before trying again. The number attempted since
 	 * last failure is tracked with compact_considered.
 	 */
-	unsigned int		compact_considered;
-	unsigned int		compact_defer_shift;
-	int			compact_order_failed;
+	unsigned int		compact_considered[2];
+	unsigned int		compact_defer_shift[2];
+	int			compact_order_failed[2];
 #endif
 
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h
index 6074eff3d766..7ef40c76bfed 100644
--- a/include/trace/events/compaction.h
+++ b/include/trace/events/compaction.h
@@ -245,9 +245,9 @@ DEFINE_EVENT(mm_compaction_suitable_template, mm_compaction_suitable,
 
 DECLARE_EVENT_CLASS(mm_compaction_defer_template,
 
-	TP_PROTO(struct zone *zone, int order),
+	TP_PROTO(struct zone *zone, int order, bool sync),
 
-	TP_ARGS(zone, order),
+	TP_ARGS(zone, order, sync),
 
 	TP_STRUCT__entry(
 		__field(int, nid)
@@ -256,45 +256,48 @@ DECLARE_EVENT_CLASS(mm_compaction_defer_template,
 		__field(unsigned int, considered)
 		__field(unsigned int, defer_shift)
 		__field(int, order_failed)
+		__field(bool, sync)
 	),
 
 	TP_fast_assign(
 		__entry->nid = zone_to_nid(zone);
 		__entry->idx = zone_idx(zone);
 		__entry->order = order;
-		__entry->considered = zone->compact_considered;
-		__entry->defer_shift = zone->compact_defer_shift;
-		__entry->order_failed = zone->compact_order_failed;
+		__entry->considered = zone->compact_considered[sync];
+		__entry->defer_shift = zone->compact_defer_shift[sync];
+		__entry->order_failed = zone->compact_order_failed[sync];
+		__entry->sync = sync;
 	),
 
-	TP_printk("node=%d zone=%-8s order=%d order_failed=%d consider=%u limit=%lu",
+	TP_printk("node=%d zone=%-8s order=%d order_failed=%d consider=%u limit=%lu sync=%d",
 		__entry->nid,
 		__print_symbolic(__entry->idx, ZONE_TYPE),
 		__entry->order,
 		__entry->order_failed,
 		__entry->considered,
-		1UL << __entry->defer_shift)
+		1UL << __entry->defer_shift,
+		__entry->sync)
 );
 
 DEFINE_EVENT(mm_compaction_defer_template, mm_compaction_deferred,
 
-	TP_PROTO(struct zone *zone, int order),
+	TP_PROTO(struct zone *zone, int order, bool sync),
 
-	TP_ARGS(zone, order)
+	TP_ARGS(zone, order, sync)
 );
 
 DEFINE_EVENT(mm_compaction_defer_template, mm_compaction_defer_compaction,
 
-	TP_PROTO(struct zone *zone, int order),
+	TP_PROTO(struct zone *zone, int order, bool sync),
 
-	TP_ARGS(zone, order)
+	TP_ARGS(zone, order, sync)
 );
 
 DEFINE_EVENT(mm_compaction_defer_template, mm_compaction_defer_reset,
 
-	TP_PROTO(struct zone *zone, int order),
+	TP_PROTO(struct zone *zone, int order, bool sync),
 
-	TP_ARGS(zone, order)
+	TP_ARGS(zone, order, sync)
 );
 #endif
 
diff --git a/mm/compaction.c b/mm/compaction.c
index 7c607479de4a..cb139b63a754 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -139,36 +139,40 @@ EXPORT_SYMBOL(__ClearPageMovable);
  * allocation success. 1 << compact_defer_limit compactions are skipped up
  * to a limit of 1 << COMPACT_MAX_DEFER_SHIFT
  */
-void defer_compaction(struct zone *zone, int order)
+void defer_compaction(struct zone *zone, int order, bool sync)
 {
-	zone->compact_considered = 0;
-	zone->compact_defer_shift++;
+	zone->compact_considered[sync] = 0;
+	zone->compact_defer_shift[sync]++;
 
-	if (order < zone->compact_order_failed)
-		zone->compact_order_failed = order;
+	if (order < zone->compact_order_failed[sync])
+		zone->compact_order_failed[sync] = order;
 
-	if (zone->compact_defer_shift > COMPACT_MAX_DEFER_SHIFT)
-		zone->compact_defer_shift = COMPACT_MAX_DEFER_SHIFT;
+	if (zone->compact_defer_shift[sync] > COMPACT_MAX_DEFER_SHIFT)
+		zone->compact_defer_shift[sync] = COMPACT_MAX_DEFER_SHIFT;
 
-	trace_mm_compaction_defer_compaction(zone, order);
+	trace_mm_compaction_defer_compaction(zone, order, sync);
+
+	/* deferred sync compaciton implies deferred async compaction */
+	if (sync)
+		defer_compaction(zone, order, false);
 }
 
 /* Returns true if compaction should be skipped this time */
-bool compaction_deferred(struct zone *zone, int order)
+bool compaction_deferred(struct zone *zone, int order, bool sync)
 {
-	unsigned long defer_limit = 1UL << zone->compact_defer_shift;
+	unsigned long defer_limit = 1UL << zone->compact_defer_shift[sync];
 
-	if (order < zone->compact_order_failed)
+	if (order < zone->compact_order_failed[sync])
 		return false;
 
 	/* Avoid possible overflow */
-	if (++zone->compact_considered > defer_limit)
-		zone->compact_considered = defer_limit;
+	if (++zone->compact_considered[sync] > defer_limit)
+		zone->compact_considered[sync] = defer_limit;
 
-	if (zone->compact_considered >= defer_limit)
+	if (zone->compact_considered[sync] >= defer_limit)
 		return false;
 
-	trace_mm_compaction_deferred(zone, order);
+	trace_mm_compaction_deferred(zone, order, sync);
 
 	return true;
 }
@@ -181,24 +185,32 @@ bool compaction_deferred(struct zone *zone, int order)
 void compaction_defer_reset(struct zone *zone, int order,
 		bool alloc_success)
 {
-	if (alloc_success) {
-		zone->compact_considered = 0;
-		zone->compact_defer_shift = 0;
-	}
-	if (order >= zone->compact_order_failed)
-		zone->compact_order_failed = order + 1;
+	int sync;
+
+	for (sync = 0; sync <= 1; sync++) {
+		if (alloc_success) {
+			zone->compact_considered[sync] = 0;
+			zone->compact_defer_shift[sync] = 0;
+		}
+		if (order >= zone->compact_order_failed[sync])
+			zone->compact_order_failed[sync] = order + 1;
 
-	trace_mm_compaction_defer_reset(zone, order);
+		trace_mm_compaction_defer_reset(zone, order, sync);
+	}
 }
 
 /* Returns true if restarting compaction after many failures */
-bool compaction_restarting(struct zone *zone, int order)
+bool compaction_restarting(struct zone *zone, int order, bool sync)
 {
-	if (order < zone->compact_order_failed)
+	int defer_shift;
+
+	if (order < zone->compact_order_failed[sync])
 		return false;
 
-	return zone->compact_defer_shift == COMPACT_MAX_DEFER_SHIFT &&
-		zone->compact_considered >= 1UL << zone->compact_defer_shift;
+	defer_shift = zone->compact_defer_shift[sync];
+
+	return defer_shift == COMPACT_MAX_DEFER_SHIFT &&
+		zone->compact_considered[sync] >= 1UL << defer_shift;
 }
 
 /* Returns true if the pageblock should be scanned for pages to isolate. */
@@ -1555,7 +1567,7 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
 	 * Clear pageblock skip if there were failures recently and compaction
 	 * is about to be retried after being deferred.
 	 */
-	if (compaction_restarting(zone, cc->order))
+	if (compaction_restarting(zone, cc->order, sync))
 		__reset_isolation_suitable(zone);
 
 	/*
@@ -1767,7 +1779,8 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 		enum compact_result status;
 
 		if (prio > MIN_COMPACT_PRIORITY
-					&& compaction_deferred(zone, order)) {
+				&& compaction_deferred(zone, order,
+					prio != COMPACT_PRIO_ASYNC)) {
 			rc = max_t(enum compact_result, COMPACT_DEFERRED, rc);
 			continue;
 		}
@@ -1789,14 +1802,15 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 			break;
 		}
 
-		if (prio != COMPACT_PRIO_ASYNC && (status == COMPACT_COMPLETE ||
-					status == COMPACT_PARTIAL_SKIPPED))
+		if (status == COMPACT_COMPLETE ||
+				status == COMPACT_PARTIAL_SKIPPED)
 			/*
 			 * We think that allocation won't succeed in this zone
 			 * so we defer compaction there. If it ends up
 			 * succeeding after all, it will be reset.
 			 */
-			defer_compaction(zone, order);
+			defer_compaction(zone, order,
+						prio != COMPACT_PRIO_ASYNC);
 
 		/*
 		 * We might have stopped compacting due to need_resched() in
@@ -1966,7 +1980,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
 		if (!populated_zone(zone))
 			continue;
 
-		if (compaction_deferred(zone, cc.order))
+		if (compaction_deferred(zone, cc.order, true))
 			continue;
 
 		if (compaction_suitable(zone, cc.order, 0, zoneid) !=
@@ -2000,7 +2014,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
 			 * We use sync migration mode here, so we defer like
 			 * sync direct compaction does.
 			 */
-			defer_compaction(zone, cc.order);
+			defer_compaction(zone, cc.order, true);
 		}
 
 		count_compact_events(KCOMPACTD_MIGRATE_SCANNED,
-- 
2.19.2

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC 1/3] mm, thp: restore __GFP_NORETRY for madvised thp fault allocations
  2018-12-11 14:29 ` [RFC 1/3] mm, thp: restore __GFP_NORETRY for madvised thp fault allocations Vlastimil Babka
@ 2019-01-08 11:16   ` Mel Gorman
  2019-01-10 13:52     ` Vlastimil Babka
  0 siblings, 1 reply; 9+ messages in thread
From: Mel Gorman @ 2019-01-08 11:16 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Rientjes, Andrea Arcangeli, Michal Hocko, Linus Torvalds,
	linux-mm, Andrew Morton

On Tue, Dec 11, 2018 at 03:29:39PM +0100, Vlastimil Babka wrote:
> Commit 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and
> madvised allocations") intended to make THP faults in MADV_HUGEPAGE areas more
> successful for processes that indicate that they are willing to pay a higher
> initial setup cost for long-term THP benefits. In the current page allocator
> implementation this means that the allocations will try to use reclaim and the
> more costly sync compaction mode, in case the initial direct async compaction
> fails.
> 

First off, I'm going to have trouble reasoning about this because there
is also my own series that reduces compaction failure rates. If that
series get approved, then it's far more likely that when compaction
fails that we really mean it and retries from the page allocator context
may be pointless unless the caller indicates it should really retry for
a long time.

The corner case I have in mind is a compaction failure on a very small
percentage of remaining pageblocks that are currently under writeback.

It also means that the merit of this series needs to account for whether
it's before or after the compaction series as the impact will be
different. FWIW, I had the same problem with evaluating the compaction
series on the context of __GFP_THISNODE vs !__GFP_THISNODE

> However, THP faults also include __GFP_THISNODE, which, combined with direct
> reclaim, can result in a node-reclaim-like local node thrashing behavior, as
> reported by Andrea [1].
> 
> While this patch is not a full fix, the first step is to restore __GFP_NORETRY
> for madvised THP faults. The expected downside are potentially worse THP
> fault success rates for the madvised areas, which will have to then rely more
> on khugepaged. For khugepaged, __GFP_NORETRY is not restored, as its activity
> should be limited enough by sleeping to cause noticeable thrashing.
> 
> Note that alloc_new_node_page() and new_page() is probably another candidate as
> they handle the migrate_pages(2), resp. mbind(2) syscall, which can thus allow
> unprivileged node-reclaim-like behavior.
> 
> The patch also updates the comments in alloc_hugepage_direct_gfpmask() because
> elsewhere compaction during page fault is called direct compaction, and
> 'synchronous' refers to the migration mode, which is not used for THP faults.
> 
> [1] https://lkml.kernel.org/m/20180820032204.9591-1-aarcange@redhat.com
> 
> Reported-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Michal Hocko <mhocko@kernel.org>
> ---
>  mm/huge_memory.c | 13 ++++++-------
>  1 file changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 5da55b38b1b7..c442b12b060c 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -633,24 +633,23 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
>  {
>  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
>  
> -	/* Always do synchronous compaction */
> +	/* Always try direct compaction */
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
> -		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
> +		return GFP_TRANSHUGE | __GFP_NORETRY;
>  

While I get that you want to reduce thrashing, the configuration item
really indicates the system (not just the caller, but everyone) is willing
to take a hit to get a THP.

>  	/* Kick kcompactd and fail quickly */
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
>  		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
>  
> -	/* Synchronous compaction if madvised, otherwise kick kcompactd */
> +	/* Direct compaction if madvised, otherwise kick kcompactd */
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
>  		return GFP_TRANSHUGE_LIGHT |
> -			(vma_madvised ? __GFP_DIRECT_RECLAIM :
> +			(vma_madvised ? (__GFP_DIRECT_RECLAIM | __GFP_NORETRY):
>  					__GFP_KSWAPD_RECLAIM);
>  

Similar note.

> -	/* Only do synchronous compaction if madvised */
> +	/* Only do direct compaction if madvised */
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
> -		return GFP_TRANSHUGE_LIGHT |
> -		       (vma_madvised ? __GFP_DIRECT_RECLAIM : 0);
> +		return vma_madvised ? (GFP_TRANSHUGE | __GFP_NORETRY) : GFP_TRANSHUGE_LIGHT;
>  

Similar note.

That said, if this series went in before the compaction series then I
would think it's ok and I would Ack it. However, *after* the compaction
series, I think it would make more sense to rip out or severely reduce the
complex should_compact_retry logic and move towards "if compaction returns,
the page allocator should take its word for it except in extreme cases".

Compaction previously was a bit too casual about partially scanning
backblocks and the setting/clearing of pageblock bits. The retry logic
sortof dealt with it by making reset/clear cycles very frequent but
after the series, they are relatively rare[1].

[1] Says he bravely before proven otherwise

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC 2/3] mm, page_alloc: reclaim for __GFP_NORETRY costly requests only when compaction was skipped
  2018-12-11 14:29 ` [RFC 2/3] mm, page_alloc: reclaim for __GFP_NORETRY costly requests only when compaction was skipped Vlastimil Babka
@ 2019-01-08 11:25   ` Mel Gorman
  0 siblings, 0 replies; 9+ messages in thread
From: Mel Gorman @ 2019-01-08 11:25 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Rientjes, Andrea Arcangeli, Michal Hocko, Linus Torvalds,
	linux-mm, Andrew Morton

On Tue, Dec 11, 2018 at 03:29:40PM +0100, Vlastimil Babka wrote:
> For costly __GFP_NORETRY allocations (including THP's) we first do an initial
> compaction attempt and if that fails, we proceed with reclaim and another
> round of compaction, unless compaction was deferred due to earlier multiple
> failures. Andrea proposed [1] that we count all compaction failures as the
> defered case in try_to_compact_pages(), but I don't think that's a good idea
> in general.

I'm still stuck on the pre/post compaction series dilemma. I agree with
you before the compaction series and disagree with you after. I'm not
sitting on the fence here, I really think the series materially alters
the facts :(.

> Instead, change the __GFP_NORETRY specific condition so that it
> only proceeds with further reclaim/compaction when the initial compaction
> attempt was skipped due to lack of free base pages.
> 
> Note that the original condition probably never worked properly for THP's,
> because compaction can only become deferred after a sync compaction failure,
> and THP's only perform async compaction, except khugepaged, which is
> infrequent, or madvised faults (until the previous patch restored __GFP_NORETRY
> for those) which are not the default case. Deferring due to async compaction
> failures should be however also beneficial and thus introduced in the next
> patch.
> 
> Also note that due to how try_to_compact_pages() constructs its return value
> from compaction attempts across the whole zonelist, returning COMPACT_SKIPPED
> means that compaction was skipped for *all* attempted zones/nodes, which means
> all zones/nodes are low on memory at the same moment. This is probably rare,

It's not as rare as I imagined but compaction series changes it so that
it really is rare or when it occurs, it probably means compaction will
fail if retried.

> which would mean that the resulting 'goto nopage' would be very common,just
> because e.g. a single zone had enough memory and compaction failed there, while
> the rest of nodes could succeed after reclaim.  However, since THP faults use
> __GFP_THISNODE, compaction is also attempted only for a single node, so in
> practice there should be no significant loss of information when constructing
> the return value, nor bias towards 'goto nopage' for THP faults.
> 

Post the compaction series, we never direct compact a remote node. I
think post the series this concept still makes sense but it would turn
into 

a) try async compaction
b) try sync compaction
c) defer unconditionally

Ideally Andrea would be able to report back on this but if we can decide
on the compaction series and the ordering of this, I can shove the results
through the SUSE test grid with both the usemem (similar to Andrea's case)
and the thpscale/thpfioscale cases and see what the latency and success
rates look like.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC 3/3] mm, compaction: introduce deferred async compaction
  2018-12-11 14:29 ` [RFC 3/3] mm, compaction: introduce deferred async compaction Vlastimil Babka
@ 2019-01-08 11:28   ` Mel Gorman
  0 siblings, 0 replies; 9+ messages in thread
From: Mel Gorman @ 2019-01-08 11:28 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Rientjes, Andrea Arcangeli, Michal Hocko, Linus Torvalds,
	linux-mm, Andrew Morton

On Tue, Dec 11, 2018 at 03:29:41PM +0100, Vlastimil Babka wrote:
> Deferring compaction happens when it fails to fulfill the allocation request at
> given order, and then a number of the following direct compaction attempts for
> same or higher orders is skipped; with further failures, the number grows
> exponentially up to 64. This is reset e.g. when compaction succeeds.
> 
> Until now, defering compaction is only performed after a sync compaction fails,
> and then it also blocks async compaction attempts. The rationale is that only a
> failed sync compaction is expected to fully exhaust all compaction potential of
> a zone. However, for THP page faults that use __GFP_NORETRY, this means only
> async compaction is attempted and thus it is never deferred, potentially
> resulting in pointless reclaim/compaction attempts in a badly fragmented node.
> 
> This patch therefore tracks and checks async compaction deferred status in
> addition, and mostly separately from sync compaction. This allows deferring THP
> fault compaction without affecting any sync pageblock-order compaction.
> Deferring for sync compaction however implies deferring for async compaction as
> well. When deferred status is reset, it is reset for both modes.
> 
> The expected outcome is less compaction/reclaim activity for failing THP faults
> likely with some expense on THP fault success rate.
> 

Either pre/post compaction series I think this makes sense although the
details may change slightly. If the caller allows then do async
compaction, sync compaction if that fails and defer if that also fails.
If the caller can only do async compaction then defer upon failure and
keep track of those two callers separetly.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC 1/3] mm, thp: restore __GFP_NORETRY for madvised thp fault allocations
  2019-01-08 11:16   ` Mel Gorman
@ 2019-01-10 13:52     ` Vlastimil Babka
  2019-01-10 14:55       ` Mel Gorman
  0 siblings, 1 reply; 9+ messages in thread
From: Vlastimil Babka @ 2019-01-10 13:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Andrea Arcangeli, Michal Hocko, Linus Torvalds,
	linux-mm, Andrew Morton

On 1/8/19 12:16 PM, Mel Gorman wrote:
> On Tue, Dec 11, 2018 at 03:29:39PM +0100, Vlastimil Babka wrote:
>> Commit 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and
>> madvised allocations") intended to make THP faults in MADV_HUGEPAGE areas more
>> successful for processes that indicate that they are willing to pay a higher
>> initial setup cost for long-term THP benefits. In the current page allocator
>> implementation this means that the allocations will try to use reclaim and the
>> more costly sync compaction mode, in case the initial direct async compaction
>> fails.
>>
> 
> First off, I'm going to have trouble reasoning about this because there
> is also my own series that reduces compaction failure rates. If that
> series get approved, then it's far more likely that when compaction
> fails that we really mean it and retries from the page allocator context
> may be pointless unless the caller indicates it should really retry for
> a long time.

Understood.

> The corner case I have in mind is a compaction failure on a very small
> percentage of remaining pageblocks that are currently under writeback.
> 
> It also means that the merit of this series needs to account for whether
> it's before or after the compaction series as the impact will be
> different. FWIW, I had the same problem with evaluating the compaction
> series on the context of __GFP_THISNODE vs !__GFP_THISNODE

Right. In that case I think for mainline, making compaction better has
priority over trying to compensate for it. The question is if somebody
wants to fix stable/older distro kernels. Now that it wasn't possible to
remove the __GFP_THISNODE for THP's, I thought this might be an
alternative acceptable to anyone, provided that it works. Backporting
your compaction series would be much more difficult I guess. Of course
distro kernels can also divert from mainline and go with the
__GFP_THISNODE removal privately.

>> However, THP faults also include __GFP_THISNODE, which, combined with direct
>> reclaim, can result in a node-reclaim-like local node thrashing behavior, as
>> reported by Andrea [1].
>>
>> While this patch is not a full fix, the first step is to restore __GFP_NORETRY
>> for madvised THP faults. The expected downside are potentially worse THP
>> fault success rates for the madvised areas, which will have to then rely more
>> on khugepaged. For khugepaged, __GFP_NORETRY is not restored, as its activity
>> should be limited enough by sleeping to cause noticeable thrashing.
>>
>> Note that alloc_new_node_page() and new_page() is probably another candidate as
>> they handle the migrate_pages(2), resp. mbind(2) syscall, which can thus allow
>> unprivileged node-reclaim-like behavior.
>>
>> The patch also updates the comments in alloc_hugepage_direct_gfpmask() because
>> elsewhere compaction during page fault is called direct compaction, and
>> 'synchronous' refers to the migration mode, which is not used for THP faults.
>>
>> [1] https://lkml.kernel.org/m/20180820032204.9591-1-aarcange@redhat.com
>>
>> Reported-by: Andrea Arcangeli <aarcange@redhat.com>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: David Rientjes <rientjes@google.com>
>> Cc: Mel Gorman <mgorman@techsingularity.net>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> ---
>>  mm/huge_memory.c | 13 ++++++-------
>>  1 file changed, 6 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 5da55b38b1b7..c442b12b060c 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -633,24 +633,23 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
>>  {
>>  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
>>  
>> -	/* Always do synchronous compaction */
>> +	/* Always try direct compaction */
>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
>> -		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
>> +		return GFP_TRANSHUGE | __GFP_NORETRY;
>>  
> 
> While I get that you want to reduce thrashing, the configuration item
> really indicates the system (not just the caller, but everyone) is willing
> to take a hit to get a THP.

Yeah some hit in exchange for THP's, but probably not an overreclaim due
to __GFP_THISNODE implications.

>>  	/* Kick kcompactd and fail quickly */
>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
>>  		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
>>  
>> -	/* Synchronous compaction if madvised, otherwise kick kcompactd */
>> +	/* Direct compaction if madvised, otherwise kick kcompactd */
>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
>>  		return GFP_TRANSHUGE_LIGHT |
>> -			(vma_madvised ? __GFP_DIRECT_RECLAIM :
>> +			(vma_madvised ? (__GFP_DIRECT_RECLAIM | __GFP_NORETRY):
>>  					__GFP_KSWAPD_RECLAIM);
>>  
> 
> Similar note.
> 
>> -	/* Only do synchronous compaction if madvised */
>> +	/* Only do direct compaction if madvised */
>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
>> -		return GFP_TRANSHUGE_LIGHT |
>> -		       (vma_madvised ? __GFP_DIRECT_RECLAIM : 0);
>> +		return vma_madvised ? (GFP_TRANSHUGE | __GFP_NORETRY) : GFP_TRANSHUGE_LIGHT;
>>  
> 
> Similar note.
> 
> That said, if this series went in before the compaction series then I
> would think it's ok and I would Ack it. However, *after* the compaction
> series, I think it would make more sense to rip out or severely reduce the
> complex should_compact_retry logic and move towards "if compaction returns,
> the page allocator should take its word for it except in extreme cases".

OK.

> Compaction previously was a bit too casual about partially scanning
> backblocks and the setting/clearing of pageblock bits. The retry logic
> sortof dealt with it by making reset/clear cycles very frequent but
> after the series, they are relatively rare[1].
> 
> [1] Says he bravely before proven otherwise
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC 1/3] mm, thp: restore __GFP_NORETRY for madvised thp fault allocations
  2019-01-10 13:52     ` Vlastimil Babka
@ 2019-01-10 14:55       ` Mel Gorman
  0 siblings, 0 replies; 9+ messages in thread
From: Mel Gorman @ 2019-01-10 14:55 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Rientjes, Andrea Arcangeli, Michal Hocko, Linus Torvalds,
	linux-mm, Andrew Morton

On Thu, Jan 10, 2019 at 02:52:32PM +0100, Vlastimil Babka wrote:
> > It also means that the merit of this series needs to account for whether
> > it's before or after the compaction series as the impact will be
> > different. FWIW, I had the same problem with evaluating the compaction
> > series on the context of __GFP_THISNODE vs !__GFP_THISNODE
> 
> Right. In that case I think for mainline, making compaction better has
> priority over trying to compensate for it.

Thanks, I agree.

> The question is if somebody
> wants to fix stable/older distro kernels. Now that it wasn't possible to
> remove the __GFP_THISNODE for THP's, I thought this might be an
> alternative acceptable to anyone, provided that it works. Backporting
> your compaction series would be much more difficult I guess. Of course
> distro kernels can also divert from mainline and go with the
> __GFP_THISNODE removal privately.
> 

That is a good point and hopefully Andrea can come back with some data from
his side. I can queue up something our side and see how it affects the
usemem case. As it's a backporting issue that I think would be rejected
by the stable rules, we can discuss the specifics offline and keep "did
it work or not" for here.

I agree that backporting the compaction series too far back would get
"interesting" as some of the pre-requisites are unexpected -- e.g. all
the data we have assumes the fragmentation avoidance stuff is in place and
that in turn has other dependencies such as when kcompactd gets woken up,
your patches on how fallbacks are managed etc.

> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> index 5da55b38b1b7..c442b12b060c 100644
> >> --- a/mm/huge_memory.c
> >> +++ b/mm/huge_memory.c
> >> @@ -633,24 +633,23 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
> >>  {
> >>  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> >>  
> >> -	/* Always do synchronous compaction */
> >> +	/* Always try direct compaction */
> >>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
> >> -		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
> >> +		return GFP_TRANSHUGE | __GFP_NORETRY;
> >>  
> > 
> > While I get that you want to reduce thrashing, the configuration item
> > really indicates the system (not just the caller, but everyone) is willing
> > to take a hit to get a THP.
> 
> Yeah some hit in exchange for THP's, but probably not an overreclaim due
> to __GFP_THISNODE implications.
> 

Fair point, we can get that data.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-01-10 14:55 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-11 14:29 [RFC 0/3] reduce THP fault thrashing Vlastimil Babka
2018-12-11 14:29 ` [RFC 1/3] mm, thp: restore __GFP_NORETRY for madvised thp fault allocations Vlastimil Babka
2019-01-08 11:16   ` Mel Gorman
2019-01-10 13:52     ` Vlastimil Babka
2019-01-10 14:55       ` Mel Gorman
2018-12-11 14:29 ` [RFC 2/3] mm, page_alloc: reclaim for __GFP_NORETRY costly requests only when compaction was skipped Vlastimil Babka
2019-01-08 11:25   ` Mel Gorman
2018-12-11 14:29 ` [RFC 3/3] mm, compaction: introduce deferred async compaction Vlastimil Babka
2019-01-08 11:28   ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).